@botbotgo/better-call 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -4,7 +4,7 @@
4
4
 
5
5
  # BetterCall
6
6
 
7
- **One-line wrapper. BFCL tool-call accuracy: 81.3% → 91.3%.**
7
+ **One-line wrapper. Three full BFCL remote runs completed. Best: 55.5% → 63.6%.**
8
8
 
9
9
  ```ts
10
10
  const tools = betterTools([searchTool, calculatorTool]);
@@ -46,6 +46,29 @@ This prints the BFCL v4 targeted weak-category table used above. It is a BetterC
46
46
 
47
47
  For the full official BFCL v4 reproduction, use Berkeley's published checkpoint/package: commit `f7cf735` or `pip install bfcl-eval==2025.12.17`.
48
48
 
49
+ To run the real-model benchmark against your own Ollama endpoint, install the BFCL package and point BetterCall at the package data:
50
+
51
+ ```bash
52
+ python3 -m venv /tmp/better-call-bfcl-venv
53
+ /tmp/better-call-bfcl-venv/bin/pip install "bfcl-eval==2025.12.17" soundfile
54
+
55
+ OLLAMA_BASE_URL=http://127.0.0.1:11434 \
56
+ BENCH_MODELS=qwen3.5:0.8b \
57
+ BENCH_CATEGORIES=all \
58
+ BENCH_CASES_PER_CATEGORY=0 \
59
+ npm run bench:bfcl:real
60
+ ```
61
+
62
+ For long categories, resume or shard a category with `BENCH_CASE_OFFSET` and `BENCH_CASES_PER_CATEGORY`; the benchmark still uses real model calls and counts request errors/timeouts as incorrect.
63
+
64
+ For a model/category matrix run, set `OLLAMA_BASE_URL` and run:
65
+
66
+ ```bash
67
+ OLLAMA_BASE_URL=http://127.0.0.1:11434 npm run bench:bfcl:remote-small-all
68
+ ```
69
+
70
+ The runner does real `/api/chat` tool-call requests. Benchmark JSON redacts the endpoint by default; set `BENCH_SHOW_ENDPOINT=1` only for private debugging.
71
+
49
72
  ## What It Catches
50
73
 
51
74
  | Category | Failures | Example |
@@ -94,9 +117,36 @@ Default recommendation: start with validate + block, then add `repairModel` for
94
117
 
95
118
  ## Benchmark
96
119
 
97
- Measured on BFCL v4 single-turn targeted weak categories with remote Ollama models.
120
+ Measured with real Ollama `/api/chat` calls over all supported BFCL v4 single-turn tool-call categories. Request errors and timeouts count as incorrect. This is **not an official BFCL leaderboard score**; it measures BetterCall as a runtime reliability layer.
121
+
122
+ Latest completed remote run artifact: `benchmarks/bfcl-real-remote-completed-summary.json`.
123
+
124
+ | Model | Completed cases | Raw | BetterCall repair | Accuracy lift | Request errors |
125
+ | --- | ---: | ---: | ---: | ---: | ---: |
126
+ | `qwen3:0.6b` | 3,625 | 55.5% | 63.6% | +8.2pp | 217 |
127
+ | `qwen3.5:0.8b` | 3,625 | 54.6% | 56.9% | +2.3pp | 901 |
128
+ | `lfm2.5-thinking:latest` | 3,625 | 50.8% | 54.8% | +4.0pp | 1,142 |
129
+
130
+ Latest completed model category detail: `lfm2.5-thinking:latest`.
131
+
132
+ | Category | Cases | Raw | BetterCall repair | Lift | Request errors |
133
+ | --- | ---: | ---: | ---: | ---: | ---: |
134
+ | `simple_python` | 400 | 72.5% | 72.5% | +0.0pp | 69 |
135
+ | `simple_java` | 100 | 54.0% | 53.0% | -1.0pp | 11 |
136
+ | `simple_javascript` | 50 | 60.0% | 58.0% | -2.0pp | 4 |
137
+ | `multiple` | 200 | 59.5% | 59.0% | -0.5pp | 65 |
138
+ | `parallel` | 200 | 57.0% | 57.0% | +0.0pp | 61 |
139
+ | `parallel_multiple` | 200 | 26.5% | 26.5% | +0.0pp | 131 |
140
+ | `irrelevance` | 240 | 29.2% | 36.7% | +7.5pp | 152 |
141
+ | `live_simple` | 258 | 57.0% | 57.0% | +0.0pp | 52 |
142
+ | `live_multiple` | 1,053 | 45.9% | 45.9% | +0.0pp | 302 |
143
+ | `live_parallel` | 16 | 43.8% | 43.8% | +0.0pp | 2 |
144
+ | `live_parallel_multiple` | 24 | 45.8% | 45.8% | +0.0pp | 2 |
145
+ | `live_irrelevance` | 884 | 52.3% | 67.1% | +14.8pp | 291 |
146
+
147
+ The strongest `lfm2.5-thinking:latest` category was `live_irrelevance`: 52.3% raw to 67.1% after BetterCall repair, a +14.8pp lift over 884 real remote cases.
98
148
 
99
- This is **not an official BFCL leaderboard score**. It measures BetterCall as a runtime reliability layer.
149
+ Historical targeted wrapper benchmark:
100
150
 
101
151
  | Model | Raw | BetterCall repair | Accuracy lift |
102
152
  | --- | ---: | ---: | ---: |
@@ -0,0 +1,362 @@
1
+ {
2
+ "name": "BFCL v4 full remote Ollama runs",
3
+ "source": "Real model calls against all supported BFCL v4 single-turn tool-call categories completed in this repository",
4
+ "note": "Endpoint redacted. Scores count request errors/timeouts as incorrect. This is not an official BFCL leaderboard submission.",
5
+ "generatedAt": "2026-05-08T04:26:26.419Z",
6
+ "results": [
7
+ {
8
+ "model": "qwen3:0.6b",
9
+ "total": 3625,
10
+ "rawCorrect": 2011,
11
+ "betterCorrect": 2307,
12
+ "errors": 217,
13
+ "categories": [
14
+ {
15
+ "model": "qwen3:0.6b",
16
+ "category": "simple_python",
17
+ "total": 400,
18
+ "rawCorrect": 205,
19
+ "betterCorrect": 311,
20
+ "errors": 6,
21
+ "repaired": 129
22
+ },
23
+ {
24
+ "model": "qwen3:0.6b",
25
+ "category": "simple_java",
26
+ "total": 100,
27
+ "rawCorrect": 29,
28
+ "betterCorrect": 43,
29
+ "errors": 2,
30
+ "repaired": 29
31
+ },
32
+ {
33
+ "model": "qwen3:0.6b",
34
+ "category": "simple_javascript",
35
+ "total": 50,
36
+ "rawCorrect": 16,
37
+ "betterCorrect": 20,
38
+ "errors": 5,
39
+ "repaired": 12
40
+ },
41
+ {
42
+ "model": "qwen3:0.6b",
43
+ "category": "multiple",
44
+ "total": 200,
45
+ "rawCorrect": 137,
46
+ "betterCorrect": 164,
47
+ "errors": 0,
48
+ "repaired": 31
49
+ },
50
+ {
51
+ "model": "qwen3:0.6b",
52
+ "category": "parallel",
53
+ "total": 200,
54
+ "rawCorrect": 90,
55
+ "betterCorrect": 96,
56
+ "errors": 7,
57
+ "repaired": 57
58
+ },
59
+ {
60
+ "model": "qwen3:0.6b",
61
+ "category": "parallel_multiple",
62
+ "total": 200,
63
+ "rawCorrect": 92,
64
+ "betterCorrect": 95,
65
+ "errors": 23,
66
+ "repaired": 31
67
+ },
68
+ {
69
+ "model": "qwen3:0.6b",
70
+ "category": "irrelevance",
71
+ "total": 240,
72
+ "rawCorrect": 184,
73
+ "betterCorrect": 193,
74
+ "errors": 27,
75
+ "repaired": 9
76
+ },
77
+ {
78
+ "model": "qwen3:0.6b",
79
+ "category": "live_simple",
80
+ "total": 258,
81
+ "rawCorrect": 94,
82
+ "betterCorrect": 119,
83
+ "errors": 14,
84
+ "repaired": 47
85
+ },
86
+ {
87
+ "model": "qwen3:0.6b",
88
+ "category": "live_multiple",
89
+ "total": 1053,
90
+ "rawCorrect": 465,
91
+ "betterCorrect": 534,
92
+ "errors": 67,
93
+ "repaired": 143
94
+ },
95
+ {
96
+ "model": "qwen3:0.6b",
97
+ "category": "live_parallel",
98
+ "total": 16,
99
+ "rawCorrect": 1,
100
+ "betterCorrect": 1,
101
+ "errors": 12,
102
+ "repaired": 0
103
+ },
104
+ {
105
+ "model": "qwen3:0.6b",
106
+ "category": "live_parallel_multiple",
107
+ "total": 24,
108
+ "rawCorrect": 6,
109
+ "betterCorrect": 6,
110
+ "errors": 0,
111
+ "repaired": 3
112
+ },
113
+ {
114
+ "model": "qwen3:0.6b",
115
+ "category": "live_irrelevance",
116
+ "total": 884,
117
+ "rawCorrect": 692,
118
+ "betterCorrect": 725,
119
+ "errors": 54,
120
+ "repaired": 33
121
+ }
122
+ ],
123
+ "repaired": 524
124
+ },
125
+ {
126
+ "model": "qwen3.5:0.8b",
127
+ "total": 3625,
128
+ "rawCorrect": 1978,
129
+ "betterCorrect": 2062,
130
+ "errors": 901,
131
+ "repaired": 154,
132
+ "categories": [
133
+ {
134
+ "model": "qwen3.5:0.8b",
135
+ "category": "simple_python",
136
+ "total": 400,
137
+ "rawCorrect": 282,
138
+ "betterCorrect": 307,
139
+ "errors": 29,
140
+ "repaired": 35
141
+ },
142
+ {
143
+ "model": "qwen3.5:0.8b",
144
+ "category": "simple_java",
145
+ "total": 100,
146
+ "rawCorrect": 45,
147
+ "betterCorrect": 46,
148
+ "errors": 32,
149
+ "repaired": 6
150
+ },
151
+ {
152
+ "model": "qwen3.5:0.8b",
153
+ "category": "simple_javascript",
154
+ "total": 50,
155
+ "rawCorrect": 18,
156
+ "betterCorrect": 21,
157
+ "errors": 15,
158
+ "repaired": 7
159
+ },
160
+ {
161
+ "model": "qwen3.5:0.8b",
162
+ "category": "multiple",
163
+ "total": 200,
164
+ "rawCorrect": 146,
165
+ "betterCorrect": 155,
166
+ "errors": 17,
167
+ "repaired": 9
168
+ },
169
+ {
170
+ "model": "qwen3.5:0.8b",
171
+ "category": "parallel",
172
+ "total": 200,
173
+ "rawCorrect": 127,
174
+ "betterCorrect": 128,
175
+ "errors": 26,
176
+ "repaired": 15
177
+ },
178
+ {
179
+ "model": "qwen3.5:0.8b",
180
+ "category": "parallel_multiple",
181
+ "total": 200,
182
+ "rawCorrect": 97,
183
+ "betterCorrect": 99,
184
+ "errors": 41,
185
+ "repaired": 20
186
+ },
187
+ {
188
+ "model": "qwen3.5:0.8b",
189
+ "category": "irrelevance",
190
+ "total": 240,
191
+ "rawCorrect": 110,
192
+ "betterCorrect": 116,
193
+ "errors": 113,
194
+ "repaired": 6
195
+ },
196
+ {
197
+ "model": "qwen3.5:0.8b",
198
+ "category": "live_simple",
199
+ "total": 258,
200
+ "rawCorrect": 132,
201
+ "betterCorrect": 139,
202
+ "errors": 43,
203
+ "repaired": 8
204
+ },
205
+ {
206
+ "model": "qwen3.5:0.8b",
207
+ "category": "live_multiple",
208
+ "total": 1053,
209
+ "rawCorrect": 538,
210
+ "betterCorrect": 545,
211
+ "errors": 212,
212
+ "repaired": 25
213
+ },
214
+ {
215
+ "model": "qwen3.5:0.8b",
216
+ "category": "live_parallel",
217
+ "total": 16,
218
+ "rawCorrect": 0,
219
+ "betterCorrect": 0,
220
+ "errors": 13,
221
+ "repaired": 0
222
+ },
223
+ {
224
+ "model": "qwen3.5:0.8b",
225
+ "category": "live_parallel_multiple",
226
+ "total": 24,
227
+ "rawCorrect": 4,
228
+ "betterCorrect": 4,
229
+ "errors": 19,
230
+ "repaired": 0
231
+ },
232
+ {
233
+ "model": "qwen3.5:0.8b",
234
+ "category": "live_irrelevance",
235
+ "total": 884,
236
+ "rawCorrect": 479,
237
+ "betterCorrect": 502,
238
+ "errors": 341,
239
+ "repaired": 23
240
+ }
241
+ ]
242
+ },
243
+ {
244
+ "model": "lfm2.5-thinking:latest",
245
+ "total": 3625,
246
+ "rawCorrect": 1840,
247
+ "betterCorrect": 1986,
248
+ "errors": 1142,
249
+ "repaired": 149,
250
+ "categories": [
251
+ {
252
+ "model": "lfm2.5-thinking:latest",
253
+ "category": "simple_python",
254
+ "total": 400,
255
+ "rawCorrect": 290,
256
+ "betterCorrect": 290,
257
+ "errors": 69,
258
+ "repaired": 0
259
+ },
260
+ {
261
+ "model": "lfm2.5-thinking:latest",
262
+ "category": "simple_java",
263
+ "total": 100,
264
+ "rawCorrect": 54,
265
+ "betterCorrect": 53,
266
+ "errors": 11,
267
+ "repaired": 0
268
+ },
269
+ {
270
+ "model": "lfm2.5-thinking:latest",
271
+ "category": "simple_javascript",
272
+ "total": 50,
273
+ "rawCorrect": 30,
274
+ "betterCorrect": 29,
275
+ "errors": 4,
276
+ "repaired": 0
277
+ },
278
+ {
279
+ "model": "lfm2.5-thinking:latest",
280
+ "category": "multiple",
281
+ "total": 200,
282
+ "rawCorrect": 119,
283
+ "betterCorrect": 118,
284
+ "errors": 65,
285
+ "repaired": 0
286
+ },
287
+ {
288
+ "model": "lfm2.5-thinking:latest",
289
+ "category": "parallel",
290
+ "total": 200,
291
+ "rawCorrect": 114,
292
+ "betterCorrect": 114,
293
+ "errors": 61,
294
+ "repaired": 0
295
+ },
296
+ {
297
+ "model": "lfm2.5-thinking:latest",
298
+ "category": "parallel_multiple",
299
+ "total": 200,
300
+ "rawCorrect": 53,
301
+ "betterCorrect": 53,
302
+ "errors": 131,
303
+ "repaired": 0
304
+ },
305
+ {
306
+ "model": "lfm2.5-thinking:latest",
307
+ "category": "irrelevance",
308
+ "total": 240,
309
+ "rawCorrect": 70,
310
+ "betterCorrect": 88,
311
+ "errors": 152,
312
+ "repaired": 18
313
+ },
314
+ {
315
+ "model": "lfm2.5-thinking:latest",
316
+ "category": "live_simple",
317
+ "total": 258,
318
+ "rawCorrect": 147,
319
+ "betterCorrect": 147,
320
+ "errors": 52,
321
+ "repaired": 0
322
+ },
323
+ {
324
+ "model": "lfm2.5-thinking:latest",
325
+ "category": "live_multiple",
326
+ "total": 1053,
327
+ "rawCorrect": 483,
328
+ "betterCorrect": 483,
329
+ "errors": 302,
330
+ "repaired": 0
331
+ },
332
+ {
333
+ "model": "lfm2.5-thinking:latest",
334
+ "category": "live_parallel",
335
+ "total": 16,
336
+ "rawCorrect": 7,
337
+ "betterCorrect": 7,
338
+ "errors": 2,
339
+ "repaired": 0
340
+ },
341
+ {
342
+ "model": "lfm2.5-thinking:latest",
343
+ "category": "live_parallel_multiple",
344
+ "total": 24,
345
+ "rawCorrect": 11,
346
+ "betterCorrect": 11,
347
+ "errors": 2,
348
+ "repaired": 0
349
+ },
350
+ {
351
+ "model": "lfm2.5-thinking:latest",
352
+ "category": "live_irrelevance",
353
+ "total": 884,
354
+ "rawCorrect": 462,
355
+ "betterCorrect": 593,
356
+ "errors": 291,
357
+ "repaired": 131
358
+ }
359
+ ]
360
+ }
361
+ ]
362
+ }
@@ -1,5 +1,5 @@
1
1
  import { type RepairModelLike } from "./default-repair.js";
2
- import type { GuardPolicy, JsonSchema, RepairFunction, SemanticValidator } from "./types.js";
2
+ import type { GuardPolicy, JsonSchema, RepairFunction, SemanticValidator, ToolCallIssue } from "./types.js";
3
3
  export type BetterToolLike = {
4
4
  name: string;
5
5
  description?: string;
@@ -15,4 +15,9 @@ export type BetterToolsOptions = {
15
15
  policy?: GuardPolicy;
16
16
  mode?: "guard" | "repair" | "review";
17
17
  };
18
+ export declare class BetterToolValidationError extends Error {
19
+ readonly tool: string;
20
+ readonly issues: ToolCallIssue[];
21
+ constructor(tool: string, issues: ToolCallIssue[]);
22
+ }
18
23
  export declare function betterTools<T extends BetterToolLike>(tools: T[], options?: BetterToolsOptions): T[];
@@ -1,5 +1,15 @@
1
1
  import { defaultRepair } from "./default-repair.js";
2
2
  import { reliableToolCalls } from "./reliable.js";
3
+ export class BetterToolValidationError extends Error {
4
+ tool;
5
+ issues;
6
+ constructor(tool, issues) {
7
+ super(`BetterCall rejected ${tool}: ${issues.map((issue) => issue.message).join("; ")}`);
8
+ this.tool = tool;
9
+ this.issues = issues;
10
+ this.name = "BetterToolValidationError";
11
+ }
12
+ }
3
13
  export function betterTools(tools, options = {}) {
4
14
  return tools.map((tool) => wrapOneTool(tool, options));
5
15
  }
@@ -16,7 +26,7 @@ function wrapOneTool(tool, options) {
16
26
  mode: options.mode,
17
27
  });
18
28
  if (!result.ok) {
19
- throw new Error(`BetterCall rejected ${tool.name}: ${result.issues.map((issue) => issue.message).join("; ")}`);
29
+ throw new BetterToolValidationError(tool.name, result.issues);
20
30
  }
21
31
  const safeCall = result.calls.find((call) => call.tool === tool.name);
22
32
  if (!safeCall)
@@ -1 +1 @@
1
- {"version":3,"file":"better-tool.js","sourceRoot":"","sources":["../src/better-tool.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,aAAa,EAAwB,MAAM,qBAAqB,CAAC;AAC1E,OAAO,EAAE,iBAAiB,EAAE,MAAM,eAAe,CAAC;AAoBlD,MAAM,UAAU,WAAW,CAA2B,KAAU,EAAE,UAA8B,EAAE;IAChG,OAAO,KAAK,CAAC,GAAG,CAAC,CAAC,IAAI,EAAE,EAAE,CAAC,WAAW,CAAC,IAAI,EAAE,OAAO,CAAC,CAAC,CAAC;AACzD,CAAC;AAED,SAAS,WAAW,CAA2B,IAAO,EAAE,OAA2B;IACjF,MAAM,OAAO,GAAG,MAAM,CAAC,MAAM,CAAC,MAAM,CAAC,MAAM,CAAC,MAAM,CAAC,cAAc,CAAC,IAAI,CAAC,CAAC,EAAE,IAAI,CAAM,CAAC;IACrF,OAAO,CAAC,MAAM,GAAG,KAAK,EAAE,KAAc,EAAE,MAAgB,EAAE,EAAE;QAC1D,MAAM,IAAI,GAAG,WAAW,CAAC,KAAK,CAAC,CAAC;QAChC,MAAM,MAAM,GAAG,MAAM,iBAAiB,CAAC;YACrC,SAAS,EAAE,gBAAgB,CAAC,OAAO,CAAC,SAAS,EAAE,IAAI,CAAC;YACpD,KAAK,EAAE,CAAC,gBAAgB,CAAC,IAAI,EAAE,OAAO,CAAC,CAAC;YACxC,KAAK,EAAE,CAAC,EAAE,IAAI,EAAE,IAAI,CAAC,IAAI,EAAE,IAAI,EAAE,CAAC;YAClC,MAAM,EAAE,OAAO,CAAC,MAAM;YACtB,MAAM,EAAE,aAAa,CAAC,OAAO,CAAC;YAC9B,IAAI,EAAE,OAAO,CAAC,IAAI;SACnB,CAAC,CAAC;QACH,IAAI,CAAC,MAAM,CAAC,EAAE,EAAE,CAAC;YACf,MAAM,IAAI,KAAK,CAAC,uBAAuB,IAAI,CAAC,IAAI,KAAK,MAAM,CAAC,MAAM,CAAC,GAAG,CAAC,CAAC,KAAK,EAAE,EAAE,CAAC,KAAK,CAAC,OAAO,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,EAAE,CAAC,CAAC;QACjH,CAAC;QACD,MAAM,QAAQ,GAAG,MAAM,CAAC,KAAK,CAAC,IAAI,CAAC,CAAC,IAAI,EAAE,EAAE,CAAC,IAAI,CAAC,IAAI,KAAK,IAAI,CAAC,IAAI,CAAC,CAAC;QACtE,IAAI,CAAC,QAAQ;YAAE,MAAM,IAAI,KAAK,CAAC,sBAAsB,IAAI,CAAC,IAAI,oBAAoB,CAAC,CAAC;QACpF,OAAO,IAAI,CAAC,MAAM,CAAC,IAAI,CAAC,IAAI,EAAE,aAAa,CAAC,KAAK,EAAE,QAAQ,CAAC,IAAI,CAAC,EAAE,MAAM,CAAC,CAAC;IAC7E,CAAC,CAAC;IACF,OAAO,OAAO,CAAC;AACjB,CAAC;AAED,SAAS,aAAa,CAAC,OAA2B;IAChD,OAAO,OAAO,CAAC,MAAM,IAAI,CAAC,OAAO,CAAC,WAAW,CAAC,CAAC,CAAC,aAAa,CAAC,OAAO,CAAC,WAAW,CAAC,CAAC,CAAC,CAAC,SAAS,CAAC,CAAC;AAClG,CAAC;AAED,SAAS,gBAAgB,CAAC,IAAoB,EAAE,OAA2B;IACzE,OAAO;QACL,IAAI,EAAE,IAAI,CAAC,IAAI;QACf,WAAW,EAAE,IAAI,CAAC,WAAW;QAC7B,MAAM,EAAE,OAAO,CAAC,MAAM,IAAI,CAAC,YAAY,CAAC,IAAI,CAAC,MAAM,CAAC,CAAC,CAAC,CAAC,IAAI,CAAC,MAAM,CAAC,CAAC,CAAC,SAAS,CAAC;QAC/E,QAAQ,EAAE,OAAO,CAAC,QAAQ;KAC3B,CAAC;AACJ,CAAC;AAED,SAAS,gBAAgB,CACvB,SAA0C,EAC1C,IAA6B;IAE7B,IAAI,OAAO,SAAS,KAAK,UAAU;QAAE,OAAO,SAAS,CAAC,IAAI,CAAC,CAAC;IAC5D,OAAO,SAAS,IAAI,IAAI,CAAC,SAAS,CAAC,IAAI,CAAC,CAAC;AAC3C,CAAC;AAED,SAAS,WAAW,CAAC,KAAc;IACjC,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,CAAC;QAAE,OAAO,KAAK,CAAC,IAAI,CAAC;IAC/D,OAAO,QAAQ,CAAC,KAAK,CAAC,CAAC,CAAC,CAAC,KAAK,CAAC,CAAC,CAAC,EAAE,KAAK,EAAE,CAAC;AAC7C,CAAC;AAED,SAAS,aAAa,CAAC,KAAc,EAAE,IAA6B;IAClE,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,CAAC;QAAE,OAAO,EAAE,GAAG,KAAK,EAAE,IAAI,EAAE,CAAC;IACvE,OAAO,IAAI,CAAC;AACd,CAAC;AAED,SAAS,YAAY,CAAC,KAAc;IAClC,OAAO,QAAQ,CAAC,KAAK,CAAC,IAAI,CAAC,OAAO,KAAK,CAAC,IAAI,KAAK,QAAQ,IAAI,KAAK,CAAC,OAAO,CAAC,KAAK,CAAC,IAAI,CAAC,IAAI,QAAQ,CAAC,KAAK,CAAC,UAAU,CAAC,CAAC,CAAC;AACxH,CAAC;AAED,SAAS,QAAQ,CAAC,KAAc;IAC9B,OAAO,OAAO,KAAK,KAAK,QAAQ,IAAI,KAAK,KAAK,IAAI,IAAI,CAAC,KAAK,CAAC,OAAO,CAAC,KAAK,CAAC,CAAC;AAC9E,CAAC"}
1
+ {"version":3,"file":"better-tool.js","sourceRoot":"","sources":["../src/better-tool.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,aAAa,EAAwB,MAAM,qBAAqB,CAAC;AAC1E,OAAO,EAAE,iBAAiB,EAAE,MAAM,eAAe,CAAC;AAoBlD,MAAM,OAAO,yBAA0B,SAAQ,KAAK;IAEhC;IACA;IAFlB,YACkB,IAAY,EACZ,MAAuB;QAEvC,KAAK,CAAC,uBAAuB,IAAI,KAAK,MAAM,CAAC,GAAG,CAAC,CAAC,KAAK,EAAE,EAAE,CAAC,KAAK,CAAC,OAAO,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,EAAE,CAAC,CAAC;QAHzE,SAAI,GAAJ,IAAI,CAAQ;QACZ,WAAM,GAAN,MAAM,CAAiB;QAGvC,IAAI,CAAC,IAAI,GAAG,2BAA2B,CAAC;IAC1C,CAAC;CACF;AAED,MAAM,UAAU,WAAW,CAA2B,KAAU,EAAE,UAA8B,EAAE;IAChG,OAAO,KAAK,CAAC,GAAG,CAAC,CAAC,IAAI,EAAE,EAAE,CAAC,WAAW,CAAC,IAAI,EAAE,OAAO,CAAC,CAAC,CAAC;AACzD,CAAC;AAED,SAAS,WAAW,CAA2B,IAAO,EAAE,OAA2B;IACjF,MAAM,OAAO,GAAG,MAAM,CAAC,MAAM,CAAC,MAAM,CAAC,MAAM,CAAC,MAAM,CAAC,cAAc,CAAC,IAAI,CAAC,CAAC,EAAE,IAAI,CAAM,CAAC;IACrF,OAAO,CAAC,MAAM,GAAG,KAAK,EAAE,KAAc,EAAE,MAAgB,EAAE,EAAE;QAC1D,MAAM,IAAI,GAAG,WAAW,CAAC,KAAK,CAAC,CAAC;QAChC,MAAM,MAAM,GAAG,MAAM,iBAAiB,CAAC;YACrC,SAAS,EAAE,gBAAgB,CAAC,OAAO,CAAC,SAAS,EAAE,IAAI,CAAC;YACpD,KAAK,EAAE,CAAC,gBAAgB,CAAC,IAAI,EAAE,OAAO,CAAC,CAAC;YACxC,KAAK,EAAE,CAAC,EAAE,IAAI,EAAE,IAAI,CAAC,IAAI,EAAE,IAAI,EAAE,CAAC;YAClC,MAAM,EAAE,OAAO,CAAC,MAAM;YACtB,MAAM,EAAE,aAAa,CAAC,OAAO,CAAC;YAC9B,IAAI,EAAE,OAAO,CAAC,IAAI;SACnB,CAAC,CAAC;QACH,IAAI,CAAC,MAAM,CAAC,EAAE,EAAE,CAAC;YACf,MAAM,IAAI,yBAAyB,CAAC,IAAI,CAAC,IAAI,EAAE,MAAM,CAAC,MAAM,CAAC,CAAC;QAChE,CAAC;QACD,MAAM,QAAQ,GAAG,MAAM,CAAC,KAAK,CAAC,IAAI,CAAC,CAAC,IAAI,EAAE,EAAE,CAAC,IAAI,CAAC,IAAI,KAAK,IAAI,CAAC,IAAI,CAAC,CAAC;QACtE,IAAI,CAAC,QAAQ;YAAE,MAAM,IAAI,KAAK,CAAC,sBAAsB,IAAI,CAAC,IAAI,oBAAoB,CAAC,CAAC;QACpF,OAAO,IAAI,CAAC,MAAM,CAAC,IAAI,CAAC,IAAI,EAAE,aAAa,CAAC,KAAK,EAAE,QAAQ,CAAC,IAAI,CAAC,EAAE,MAAM,CAAC,CAAC;IAC7E,CAAC,CAAC;IACF,OAAO,OAAO,CAAC;AACjB,CAAC;AAED,SAAS,aAAa,CAAC,OAA2B;IAChD,OAAO,OAAO,CAAC,MAAM,IAAI,CAAC,OAAO,CAAC,WAAW,CAAC,CAAC,CAAC,aAAa,CAAC,OAAO,CAAC,WAAW,CAAC,CAAC,CAAC,CAAC,SAAS,CAAC,CAAC;AAClG,CAAC;AAED,SAAS,gBAAgB,CAAC,IAAoB,EAAE,OAA2B;IACzE,OAAO;QACL,IAAI,EAAE,IAAI,CAAC,IAAI;QACf,WAAW,EAAE,IAAI,CAAC,WAAW;QAC7B,MAAM,EAAE,OAAO,CAAC,MAAM,IAAI,CAAC,YAAY,CAAC,IAAI,CAAC,MAAM,CAAC,CAAC,CAAC,CAAC,IAAI,CAAC,MAAM,CAAC,CAAC,CAAC,SAAS,CAAC;QAC/E,QAAQ,EAAE,OAAO,CAAC,QAAQ;KAC3B,CAAC;AACJ,CAAC;AAED,SAAS,gBAAgB,CACvB,SAA0C,EAC1C,IAA6B;IAE7B,IAAI,OAAO,SAAS,KAAK,UAAU;QAAE,OAAO,SAAS,CAAC,IAAI,CAAC,CAAC;IAC5D,OAAO,SAAS,IAAI,IAAI,CAAC,SAAS,CAAC,IAAI,CAAC,CAAC;AAC3C,CAAC;AAED,SAAS,WAAW,CAAC,KAAc;IACjC,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,CAAC;QAAE,OAAO,KAAK,CAAC,IAAI,CAAC;IAC/D,OAAO,QAAQ,CAAC,KAAK,CAAC,CAAC,CAAC,CAAC,KAAK,CAAC,CAAC,CAAC,EAAE,KAAK,EAAE,CAAC;AAC7C,CAAC;AAED,SAAS,aAAa,CAAC,KAAc,EAAE,IAA6B;IAClE,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,CAAC;QAAE,OAAO,EAAE,GAAG,KAAK,EAAE,IAAI,EAAE,CAAC;IACvE,OAAO,IAAI,CAAC;AACd,CAAC;AAED,SAAS,YAAY,CAAC,KAAc;IAClC,OAAO,QAAQ,CAAC,KAAK,CAAC,IAAI,CAAC,OAAO,KAAK,CAAC,IAAI,KAAK,QAAQ,IAAI,KAAK,CAAC,OAAO,CAAC,KAAK,CAAC,IAAI,CAAC,IAAI,QAAQ,CAAC,KAAK,CAAC,UAAU,CAAC,CAAC,CAAC;AACxH,CAAC;AAED,SAAS,QAAQ,CAAC,KAAc;IAC9B,OAAO,OAAO,KAAK,KAAK,QAAQ,IAAI,KAAK,KAAK,IAAI,IAAI,CAAC,KAAK,CAAC,OAAO,CAAC,KAAK,CAAC,CAAC;AAC9E,CAAC"}
package/dist/index.d.ts CHANGED
@@ -3,6 +3,6 @@ export { buildRepairPrompt, buildReviewPrompt } from "./prompts.js";
3
3
  export { defaultRepair } from "./default-repair.js";
4
4
  export type { RepairModelLike } from "./default-repair.js";
5
5
  export { reliableToolCalls } from "./reliable.js";
6
- export { betterTools } from "./better-tool.js";
6
+ export { BetterToolValidationError, betterTools } from "./better-tool.js";
7
7
  export type { BetterToolLike, BetterToolsOptions } from "./better-tool.js";
8
8
  export type { GuardPolicy, GuardResult, IssueKind, JsonSchema, ReliableToolCallsInput, ReliableToolCallsResult, RepairFunction, RepairInput, SemanticValidator, ToolCall, ToolCallIssue, ToolDefinition, } from "./types.js";
package/dist/index.js CHANGED
@@ -2,5 +2,5 @@ export { guardToolCalls } from "./guard.js";
2
2
  export { buildRepairPrompt, buildReviewPrompt } from "./prompts.js";
3
3
  export { defaultRepair } from "./default-repair.js";
4
4
  export { reliableToolCalls } from "./reliable.js";
5
- export { betterTools } from "./better-tool.js";
5
+ export { BetterToolValidationError, betterTools } from "./better-tool.js";
6
6
  //# sourceMappingURL=index.js.map
package/dist/index.js.map CHANGED
@@ -1 +1 @@
1
- {"version":3,"file":"index.js","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,cAAc,EAAE,MAAM,YAAY,CAAC;AAC5C,OAAO,EAAE,iBAAiB,EAAE,iBAAiB,EAAE,MAAM,cAAc,CAAC;AACpE,OAAO,EAAE,aAAa,EAAE,MAAM,qBAAqB,CAAC;AAEpD,OAAO,EAAE,iBAAiB,EAAE,MAAM,eAAe,CAAC;AAClD,OAAO,EAAE,WAAW,EAAE,MAAM,kBAAkB,CAAC"}
1
+ {"version":3,"file":"index.js","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,cAAc,EAAE,MAAM,YAAY,CAAC;AAC5C,OAAO,EAAE,iBAAiB,EAAE,iBAAiB,EAAE,MAAM,cAAc,CAAC;AACpE,OAAO,EAAE,aAAa,EAAE,MAAM,qBAAqB,CAAC;AAEpD,OAAO,EAAE,iBAAiB,EAAE,MAAM,eAAe,CAAC;AAClD,OAAO,EAAE,yBAAyB,EAAE,WAAW,EAAE,MAAM,kBAAkB,CAAC"}
package/docs/banner.svg CHANGED
@@ -1,6 +1,6 @@
1
1
  <svg xmlns="http://www.w3.org/2000/svg" width="1400" height="520" viewBox="0 0 1400 520" role="img" aria-labelledby="title desc">
2
2
  <title id="title">BetterCall banner</title>
3
- <desc id="desc">BetterCall is a one-line wrapper that boosts BFCL tool-call accuracy from 81.3 percent to 91.3 percent.</desc>
3
+ <desc id="desc">BetterCall is a one-line wrapper with three full BFCL remote runs completed; the best run improves from 55.5 percent to 63.6 percent.</desc>
4
4
  <defs>
5
5
  <linearGradient id="background" x1="0" y1="0" x2="1" y2="1">
6
6
  <stop offset="0" stop-color="#07111f"/>
@@ -24,6 +24,6 @@
24
24
  <g transform="translate(112 112)">
25
25
  <text x="0" y="80" font-family="Inter, ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, sans-serif" font-size="78" font-weight="800" fill="#f8fbff">BetterCall</text>
26
26
  <rect x="4" y="116" width="430" height="8" rx="4" fill="url(#accent)"/>
27
- <text x="0" y="186" font-family="Inter, ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, sans-serif" font-size="34" font-weight="670" fill="#dcecff">One-line wrapper. BFCL tool-call accuracy: 81.3% → 91.3%.</text>
27
+ <text x="0" y="186" font-family="Inter, ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, sans-serif" font-size="34" font-weight="670" fill="#dcecff">Three full BFCL remote runs. Best: 55.5% → 63.6%.</text>
28
28
  </g>
29
29
  </svg>
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@botbotgo/better-call",
3
- "version": "0.1.0",
3
+ "version": "0.1.2",
4
4
  "description": "LLM tool-call reliability layer.",
5
5
  "type": "module",
6
6
  "license": "Apache-2.0",
@@ -38,6 +38,8 @@
38
38
  "scripts": {
39
39
  "build": "tsc -p tsconfig.json",
40
40
  "bench:bfcl": "npm run build && node scripts/bench-bfcl.mjs",
41
+ "bench:bfcl:real": "npm run build && node scripts/bench-bfcl-real.mjs",
42
+ "bench:bfcl:remote-small-all": "bash scripts/run-remote-small-bfcl-all.sh",
41
43
  "release:pack": "npm pack --dry-run",
42
44
  "release:publish": "npm publish --access public",
43
45
  "test": "npm run build && node --test test/*.test.mjs",
@@ -0,0 +1,448 @@
1
+ import { mkdir, readFile, writeFile } from "node:fs/promises";
2
+ import { dirname, resolve } from "node:path";
3
+ import { fileURLToPath } from "node:url";
4
+ import { defaultRepair, reliableToolCalls } from "../dist/index.js";
5
+
6
+ const root = resolve(dirname(fileURLToPath(import.meta.url)), "..");
7
+ const allSingleTurnToolCategories = [
8
+ "simple_python",
9
+ "simple_java",
10
+ "simple_javascript",
11
+ "multiple",
12
+ "parallel",
13
+ "parallel_multiple",
14
+ "irrelevance",
15
+ "live_simple",
16
+ "live_multiple",
17
+ "live_parallel",
18
+ "live_parallel_multiple",
19
+ "live_irrelevance",
20
+ ];
21
+ const bfclDataDir = process.env.BFCL_DATA_DIR
22
+ ?? "/tmp/better-call-bfcl-venv/lib/python3.13/site-packages/bfcl_eval/data";
23
+ const ollamaBaseUrl = process.env.OLLAMA_BASE_URL ?? "http://127.0.0.1:11434";
24
+ const models = csv(process.env.BENCH_MODELS ?? "granite4.1:3b,qwen3.5:2b");
25
+ const categories = expandCategories(process.env.BENCH_CATEGORIES ?? "simple_python,irrelevance");
26
+ const casesPerCategory = Number(process.env.BENCH_CASES_PER_CATEGORY ?? "10");
27
+ const caseOffset = Number(process.env.BENCH_CASE_OFFSET ?? "0");
28
+ const requestTimeoutMs = Number(process.env.BENCH_REQUEST_TIMEOUT_MS ?? "120000");
29
+ const progressInterval = Number(process.env.BENCH_PROGRESS_INTERVAL ?? "25");
30
+ const outputPath = process.env.BENCH_OUTPUT
31
+ ?? resolve(root, "benchmarks", `bfcl-real-${new Date().toISOString().replaceAll(":", "-")}.json`);
32
+ const showEndpoint = process.env.BENCH_SHOW_ENDPOINT === "1";
33
+
34
+ await assertOllama();
35
+
36
+ console.log("# BFCL v4 real model benchmark");
37
+ console.log("");
38
+ console.log(`Ollama endpoint: ${showEndpoint ? ollamaBaseUrl : "<provided by OLLAMA_BASE_URL>"}`);
39
+ console.log(`BFCL data: ${bfclDataDir}`);
40
+ console.log(`Models: ${models.join(", ")}`);
41
+ console.log(`Categories: ${categories.join(", ")}`);
42
+ console.log(`Cases/category: ${casesPerCategory > 0 ? casesPerCategory : "all"}`);
43
+ console.log(`Case offset: ${caseOffset}`);
44
+ console.log(`Request timeout: ${requestTimeoutMs}ms`);
45
+ console.log(`Progress interval: ${progressInterval}`);
46
+ console.log(`Output: ${outputPath}`);
47
+ console.log("");
48
+
49
+ const results = [];
50
+ for (const model of models) {
51
+ const rows = [];
52
+ for (const category of categories) {
53
+ const cases = await loadCases(category);
54
+ const offsetCases = caseOffset > 0 ? cases.slice(caseOffset) : cases;
55
+ const selected = casesPerCategory > 0 ? offsetCases.slice(0, casesPerCategory) : offsetCases;
56
+ const row = await runCategory(model, category, selected);
57
+ rows.push(row);
58
+ printProgress(row);
59
+ await writeResults(results.concat([summarize(model, rows)]));
60
+ }
61
+ results.push(summarize(model, rows));
62
+ }
63
+
64
+ console.log("");
65
+ printTable(
66
+ ["Model", "Total", "Raw", "BetterCall", "Lift", "Raw Correct", "Better Correct", "Errors"],
67
+ results.map((row) => [
68
+ row.model,
69
+ String(row.total),
70
+ percent(row.rawCorrect, row.total),
71
+ percent(row.betterCorrect, row.total),
72
+ `${signed(((row.betterCorrect - row.rawCorrect) * 100) / row.total)}pp`,
73
+ String(row.rawCorrect),
74
+ String(row.betterCorrect),
75
+ String(row.errors),
76
+ ]),
77
+ );
78
+
79
+ console.log("");
80
+ console.log("## By Category");
81
+ console.log("");
82
+ printTable(
83
+ ["Model", "Category", "Total", "Raw", "BetterCall", "Lift", "Errors"],
84
+ results.flatMap((result) => result.categories.map((row) => [
85
+ result.model,
86
+ row.category,
87
+ String(row.total),
88
+ percent(row.rawCorrect, row.total),
89
+ percent(row.betterCorrect, row.total),
90
+ `${signed(((row.betterCorrect - row.rawCorrect) * 100) / row.total)}pp`,
91
+ String(row.errors),
92
+ ])),
93
+ );
94
+
95
+ await writeResults(results);
96
+
97
+ async function writeResults(currentResults) {
98
+ await mkdir(dirname(outputPath), { recursive: true });
99
+ await writeFile(outputPath, `${JSON.stringify({
100
+ ollamaEndpoint: showEndpoint ? ollamaBaseUrl : "<redacted>",
101
+ bfclDataDir,
102
+ models,
103
+ categories,
104
+ casesPerCategory,
105
+ requestTimeoutMs,
106
+ progressInterval,
107
+ results: currentResults,
108
+ }, null, 2)}\n`);
109
+ }
110
+
111
+ async function runCategory(model, category, cases) {
112
+ let rawCorrect = 0;
113
+ let betterCorrect = 0;
114
+ let rawToolCalls = 0;
115
+ let repaired = 0;
116
+ let errors = 0;
117
+ const failures = [];
118
+
119
+ for (const [caseIndex, item] of cases.entries()) {
120
+ const tools = item.functions.map(toToolDefinition);
121
+ let raw = [];
122
+ try {
123
+ raw = await callOllamaTools(model, item.userInput, tools);
124
+ } catch (error) {
125
+ errors += 1;
126
+ failures.push({ id: item.id, error: error instanceof Error ? error.message : String(error) });
127
+ printCaseProgress(model, category, caseIndex, cases.length, rawCorrect, betterCorrect, repaired, errors);
128
+ continue;
129
+ }
130
+ rawToolCalls += raw.length;
131
+ if (isCorrect(category, raw, item.groundTruth)) rawCorrect += 1;
132
+
133
+ const repairModel = {
134
+ invoke: (input) => callOllamaText(model, String(input)),
135
+ };
136
+ let better;
137
+ try {
138
+ better = await reliableToolCalls({
139
+ userInput: item.userInput,
140
+ tools,
141
+ calls: raw,
142
+ policy: { expected: isIrrelevanceCategory(category) ? "none" : "some" },
143
+ repair: defaultRepair(repairModel),
144
+ });
145
+ } catch (error) {
146
+ errors += 1;
147
+ failures.push({ id: item.id, raw, error: error instanceof Error ? error.message : String(error) });
148
+ printCaseProgress(model, category, caseIndex, cases.length, rawCorrect, betterCorrect, repaired, errors);
149
+ continue;
150
+ }
151
+ if (better.repaired) repaired += 1;
152
+ if (isCorrect(category, better.calls, item.groundTruth)) {
153
+ betterCorrect += 1;
154
+ } else if (failures.length < 3) {
155
+ failures.push({ id: item.id, raw, better: better.calls, issues: better.issues });
156
+ }
157
+ printCaseProgress(model, category, caseIndex, cases.length, rawCorrect, betterCorrect, repaired, errors);
158
+ }
159
+
160
+ return {
161
+ model,
162
+ category,
163
+ total: cases.length,
164
+ rawCorrect,
165
+ betterCorrect,
166
+ rawToolCalls,
167
+ repaired,
168
+ errors,
169
+ failures,
170
+ };
171
+ }
172
+
173
+ async function loadCases(category) {
174
+ const prompts = await readJsonl(resolve(bfclDataDir, `BFCL_v4_${category}.json`));
175
+ const answers = isIrrelevanceCategory(category)
176
+ ? prompts.map((item) => ({ id: item.id, ground_truth: [] }))
177
+ : await readJsonl(resolve(bfclDataDir, "possible_answer", `BFCL_v4_${category}.json`));
178
+
179
+ return prompts.map((item, index) => ({
180
+ id: item.id,
181
+ userInput: item.question.flat().map((message) => message.content).join("\n"),
182
+ functions: item.function ?? [],
183
+ groundTruth: answers[index].ground_truth,
184
+ }));
185
+ }
186
+
187
+ function toToolDefinition(fn) {
188
+ return {
189
+ name: fn.name,
190
+ description: fn.description,
191
+ schema: normalizeSchema(fn.parameters),
192
+ };
193
+ }
194
+
195
+ function normalizeSchema(schema) {
196
+ if (!schema || typeof schema !== "object") return schema;
197
+ const next = { ...schema };
198
+ if (next.type === "dict") next.type = "object";
199
+ if (next.properties) {
200
+ next.properties = Object.fromEntries(
201
+ Object.entries(next.properties).map(([key, value]) => [key, normalizeSchema(value)]),
202
+ );
203
+ }
204
+ if (next.items) next.items = normalizeSchema(next.items);
205
+ return next;
206
+ }
207
+
208
+ async function callOllamaTools(model, userInput, tools) {
209
+ const payload = {
210
+ model,
211
+ stream: false,
212
+ messages: [{ role: "user", content: userInput }],
213
+ tools: tools.map((tool) => ({
214
+ type: "function",
215
+ function: {
216
+ name: tool.name,
217
+ description: tool.description ?? "",
218
+ parameters: tool.schema ?? { type: "object", properties: {} },
219
+ },
220
+ })),
221
+ options: { temperature: 0 },
222
+ };
223
+ const json = await postJson(`${ollamaBaseUrl}/api/chat`, payload);
224
+ return normalizeOllamaToolCalls(json.message?.tool_calls);
225
+ }
226
+
227
+ async function callOllamaText(model, prompt) {
228
+ const json = await postJson(`${ollamaBaseUrl}/api/chat`, {
229
+ model,
230
+ stream: false,
231
+ messages: [{ role: "user", content: prompt }],
232
+ format: "json",
233
+ options: { temperature: 0 },
234
+ });
235
+ return json.message?.content ?? "";
236
+ }
237
+
238
+ function normalizeOllamaToolCalls(toolCalls) {
239
+ if (!Array.isArray(toolCalls)) return [];
240
+ return toolCalls
241
+ .map((call) => call?.function)
242
+ .filter((fn) => fn && typeof fn.name === "string")
243
+ .map((fn) => ({
244
+ tool: fn.name,
245
+ args: isRecord(fn.arguments) ? fn.arguments : {},
246
+ }));
247
+ }
248
+
249
+ function isCorrect(category, calls, groundTruth) {
250
+ if (isIrrelevanceCategory(category)) return calls.length === 0;
251
+ if (!Array.isArray(groundTruth) || calls.length !== groundTruth.length) return false;
252
+ if (category.includes("parallel")) return callsMatchUnordered(calls, groundTruth);
253
+ return groundTruth.every((expected, index) => callMatches(calls[index], expected));
254
+ }
255
+
256
+ function callsMatchUnordered(calls, groundTruth) {
257
+ const used = new Set();
258
+ for (const expected of groundTruth) {
259
+ const index = calls.findIndex((call, candidateIndex) => !used.has(candidateIndex) && callMatches(call, expected));
260
+ if (index < 0) return false;
261
+ used.add(index);
262
+ }
263
+ return true;
264
+ }
265
+
266
+ function callMatches(call, expected) {
267
+ if (!call || !isRecord(expected)) return false;
268
+ const [tool, args] = Object.entries(expected)[0] ?? [];
269
+ if (call.tool !== tool) return false;
270
+ if (!isRecord(args)) return false;
271
+ for (const [key, allowed] of Object.entries(args)) {
272
+ if (!Object.hasOwn(call.args, key)) {
273
+ if (Array.isArray(allowed) && allowed.includes("")) continue;
274
+ return false;
275
+ }
276
+ if (!valueAllowed(call.args[key], allowed)) return false;
277
+ }
278
+ for (const key of Object.keys(call.args)) {
279
+ if (!Object.hasOwn(args, key)) return false;
280
+ }
281
+ return true;
282
+ }
283
+
284
+ function valueAllowed(actual, allowed) {
285
+ if (!Array.isArray(allowed)) return Object.is(actual, allowed);
286
+ return allowed.some((value) => {
287
+ if (value === "" && (actual === undefined || actual === "")) return true;
288
+ if (isRecord(actual) && isRecord(value)) return objectMatches(actual, value);
289
+ if (Array.isArray(actual) && Array.isArray(value)) return arrayMatches(actual, value);
290
+ if (typeof actual === "string" && typeof value === "string") {
291
+ return normalizeString(actual) === normalizeString(value);
292
+ }
293
+ return JSON.stringify(actual) === JSON.stringify(value);
294
+ });
295
+ }
296
+
297
+ function objectMatches(actual, expected) {
298
+ for (const [key, allowed] of Object.entries(expected)) {
299
+ if (!Object.hasOwn(actual, key)) {
300
+ if (Array.isArray(allowed) && allowed.includes("")) continue;
301
+ return false;
302
+ }
303
+ if (!valueAllowed(actual[key], allowed)) return false;
304
+ }
305
+ for (const key of Object.keys(actual)) {
306
+ if (!Object.hasOwn(expected, key)) return false;
307
+ }
308
+ return true;
309
+ }
310
+
311
+ function arrayMatches(actual, expected) {
312
+ if (actual.length !== expected.length) return false;
313
+ return expected.every((value, index) => {
314
+ if (isRecord(actual[index]) && isRecord(value)) return objectMatches(actual[index], value);
315
+ if (Array.isArray(actual[index]) && Array.isArray(value)) return arrayMatches(actual[index], value);
316
+ if (typeof actual[index] === "string" && typeof value === "string") {
317
+ return normalizeString(actual[index]) === normalizeString(value);
318
+ }
319
+ return Object.is(actual[index], value);
320
+ });
321
+ }
322
+
323
+ function normalizeString(value) {
324
+ return value.replace(/[ ,./\-_*^]/gu, "").toLowerCase().replaceAll("'", "\"");
325
+ }
326
+
327
+ async function readJsonl(path) {
328
+ const text = await readFile(path, "utf8");
329
+ return text.trim().split("\n").filter(Boolean).map((line) => JSON.parse(line));
330
+ }
331
+
332
+ async function assertOllama() {
333
+ const tags = await postJson(`${ollamaBaseUrl}/api/tags`, undefined, "GET");
334
+ const available = new Set((tags.models ?? []).map((item) => item.name));
335
+ const missing = models.filter((model) => !available.has(model));
336
+ if (missing.length > 0) {
337
+ throw new Error(`Missing Ollama models: ${missing.join(", ")}`);
338
+ }
339
+ }
340
+
341
+ async function postJson(url, payload, method = "POST") {
342
+ const controller = new AbortController();
343
+ const timeout = setTimeout(() => controller.abort(), requestTimeoutMs);
344
+ try {
345
+ const response = await fetch(url, {
346
+ method,
347
+ headers: payload ? { "content-type": "application/json" } : undefined,
348
+ body: payload ? JSON.stringify(payload) : undefined,
349
+ signal: controller.signal,
350
+ });
351
+ if (!response.ok) {
352
+ throw new Error(`${method} ${redactUrl(url)} failed: ${response.status} ${await readBodyWithTimeout(response, "text", controller)}`);
353
+ }
354
+ return readBodyWithTimeout(response, "json", controller);
355
+ } finally {
356
+ clearTimeout(timeout);
357
+ }
358
+ }
359
+
360
+ async function readBodyWithTimeout(response, kind, controller) {
361
+ let timeout;
362
+ try {
363
+ return await Promise.race([
364
+ kind === "json" ? response.json() : response.text(),
365
+ new Promise((_, reject) => {
366
+ timeout = setTimeout(() => {
367
+ controller.abort();
368
+ reject(new Error(`Timed out reading ${kind} response body after ${requestTimeoutMs}ms`));
369
+ }, requestTimeoutMs);
370
+ }),
371
+ ]);
372
+ } finally {
373
+ clearTimeout(timeout);
374
+ }
375
+ }
376
+
377
+ function printCaseProgress(model, category, caseIndex, total, rawCorrect, betterCorrect, repaired, errors) {
378
+ if (progressInterval <= 0) return;
379
+ if ((caseIndex + 1) % progressInterval !== 0 && caseIndex + 1 !== total) return;
380
+ console.log(
381
+ `[${model}] ${category} progress ${caseIndex + 1}/${total}: `
382
+ + `raw ${rawCorrect}, better ${betterCorrect}, repaired ${repaired}, errors ${errors}`,
383
+ );
384
+ }
385
+
386
+ function redactUrl(url) {
387
+ try {
388
+ const parsed = new URL(url);
389
+ return `<redacted>${parsed.pathname}`;
390
+ } catch {
391
+ return "<redacted>";
392
+ }
393
+ }
394
+
395
+ function summarize(model, categories) {
396
+ return {
397
+ model,
398
+ total: sum(categories, "total"),
399
+ rawCorrect: sum(categories, "rawCorrect"),
400
+ betterCorrect: sum(categories, "betterCorrect"),
401
+ errors: sum(categories, "errors"),
402
+ categories,
403
+ };
404
+ }
405
+
406
+ function printProgress(row) {
407
+ console.log(
408
+ `[${row.model}] ${row.category}: raw ${row.rawCorrect}/${row.total}, `
409
+ + `better ${row.betterCorrect}/${row.total}, repaired ${row.repaired}, `
410
+ + `errors ${row.errors}, rawToolCalls ${row.rawToolCalls}`,
411
+ );
412
+ }
413
+
414
+ function printTable(headers, rows) {
415
+ console.log(`| ${headers.join(" | ")} |`);
416
+ console.log(`| ${headers.map(() => "---").join(" | ")} |`);
417
+ for (const row of rows) console.log(`| ${row.join(" | ")} |`);
418
+ }
419
+
420
+ function percent(correct, total) {
421
+ return `${((correct * 100) / total).toFixed(1)}%`;
422
+ }
423
+
424
+ function signed(value) {
425
+ return `${value >= 0 ? "+" : ""}${value.toFixed(1)}`;
426
+ }
427
+
428
+ function sum(rows, key) {
429
+ return rows.reduce((total, row) => total + row[key], 0);
430
+ }
431
+
432
+ function csv(value) {
433
+ return value.split(",").map((item) => item.trim()).filter(Boolean);
434
+ }
435
+
436
+ function expandCategories(value) {
437
+ const requested = csv(value);
438
+ if (requested.includes("all") || requested.includes("all_single_turn_tool")) return allSingleTurnToolCategories;
439
+ return requested;
440
+ }
441
+
442
+ function isIrrelevanceCategory(category) {
443
+ return category === "irrelevance" || category === "live_irrelevance";
444
+ }
445
+
446
+ function isRecord(value) {
447
+ return typeof value === "object" && value !== null && !Array.isArray(value);
448
+ }
@@ -0,0 +1,63 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
5
+ cd "$ROOT"
6
+
7
+ : "${OLLAMA_BASE_URL:?Set OLLAMA_BASE_URL to your Ollama endpoint, for example http://127.0.0.1:11434}"
8
+ BENCH_REQUEST_TIMEOUT_MS="${BENCH_REQUEST_TIMEOUT_MS:-5000}"
9
+ BENCH_PROGRESS_INTERVAL="${BENCH_PROGRESS_INTERVAL:-50}"
10
+
11
+ models=(
12
+ "qwen3.5:0.8b"
13
+ "qwen3:0.6b"
14
+ "lfm2.5-thinking:latest"
15
+ "qwen3.5:2b"
16
+ "granite4.1:3b"
17
+ "qwen3.5:4b"
18
+ "gemma4:e2b"
19
+ "qwen2.5:7b-instruct"
20
+ "gemma4:e4b"
21
+ "qwen3:latest"
22
+ "qwen3.5:9b"
23
+ )
24
+
25
+ categories=(
26
+ "simple_python"
27
+ "simple_java"
28
+ "simple_javascript"
29
+ "multiple"
30
+ "parallel"
31
+ "parallel_multiple"
32
+ "irrelevance"
33
+ "live_simple"
34
+ "live_multiple"
35
+ "live_parallel"
36
+ "live_parallel_multiple"
37
+ "live_irrelevance"
38
+ )
39
+
40
+ mkdir -p benchmarks/logs benchmarks/remote-small-runs
41
+
42
+ npm run build
43
+
44
+ for model in "${models[@]}"; do
45
+ safe_model="${model//[:\/]/-}"
46
+ for category in "${categories[@]}"; do
47
+ output="benchmarks/remote-small-runs/${safe_model}-${category}.json"
48
+ log="benchmarks/logs/${safe_model}-${category}.log"
49
+ if [[ -s "$output" ]]; then
50
+ echo "[skip] $model $category -> $output"
51
+ continue
52
+ fi
53
+ echo "[run] $model $category"
54
+ OLLAMA_BASE_URL="$OLLAMA_BASE_URL" \
55
+ BENCH_MODELS="$model" \
56
+ BENCH_CATEGORIES="$category" \
57
+ BENCH_CASES_PER_CATEGORY=0 \
58
+ BENCH_REQUEST_TIMEOUT_MS="$BENCH_REQUEST_TIMEOUT_MS" \
59
+ BENCH_PROGRESS_INTERVAL="$BENCH_PROGRESS_INTERVAL" \
60
+ BENCH_OUTPUT="$ROOT/$output" \
61
+ node scripts/bench-bfcl-real.mjs 2>&1 | tee "$log"
62
+ done
63
+ done