npm - @botbotgo/better-call - Versions diffs - 0.1.0 → 0.1.2 - Mend

@botbotgo/better-call 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/README.md +53 -3
package/benchmarks/bfcl-real-remote-completed-summary.json +362 -0
package/dist/better-tool.d.ts +6 -1
package/dist/better-tool.js +11 -1
package/dist/better-tool.js.map +1 -1
package/dist/index.d.ts +1 -1
package/dist/index.js +1 -1
package/dist/index.js.map +1 -1
package/docs/banner.svg +2 -2
package/package.json +3 -1
package/scripts/bench-bfcl-real.mjs +448 -0
package/scripts/run-remote-small-bfcl-all.sh +63 -0

package/README.md CHANGED Viewed

@@ -4,7 +4,7 @@
 # BetterCall
-**One-line wrapper. BFCL tool-call accuracy: 81.3% → 91.3%.**
+**One-line wrapper. Three full BFCL remote runs completed. Best: 55.5% → 63.6%.**
 ```ts
 const tools = betterTools([searchTool, calculatorTool]);
@@ -46,6 +46,29 @@ This prints the BFCL v4 targeted weak-category table used above. It is a BetterC
 For the full official BFCL v4 reproduction, use Berkeley's published checkpoint/package: commit `f7cf735` or `pip install bfcl-eval==2025.12.17`.
+To run the real-model benchmark against your own Ollama endpoint, install the BFCL package and point BetterCall at the package data:
+```bash
+python3 -m venv /tmp/better-call-bfcl-venv
+/tmp/better-call-bfcl-venv/bin/pip install "bfcl-eval==2025.12.17" soundfile
+OLLAMA_BASE_URL=http://127.0.0.1:11434 \
+BENCH_MODELS=qwen3.5:0.8b \
+BENCH_CATEGORIES=all \
+BENCH_CASES_PER_CATEGORY=0 \
+npm run bench:bfcl:real
+```
+For long categories, resume or shard a category with `BENCH_CASE_OFFSET` and `BENCH_CASES_PER_CATEGORY`; the benchmark still uses real model calls and counts request errors/timeouts as incorrect.
+For a model/category matrix run, set `OLLAMA_BASE_URL` and run:
+```bash
+OLLAMA_BASE_URL=http://127.0.0.1:11434 npm run bench:bfcl:remote-small-all
+```
+The runner does real `/api/chat` tool-call requests. Benchmark JSON redacts the endpoint by default; set `BENCH_SHOW_ENDPOINT=1` only for private debugging.
 ## What It Catches
 | Category | Failures | Example |
@@ -94,9 +117,36 @@ Default recommendation: start with validate + block, then add `repairModel` for
 ## Benchmark
-Measured on BFCL v4 single-turn targeted weak categories with remote Ollama models.
+Measured with real Ollama `/api/chat` calls over all supported BFCL v4 single-turn tool-call categories. Request errors and timeouts count as incorrect. This is **not an official BFCL leaderboard score**; it measures BetterCall as a runtime reliability layer.
+Latest completed remote run artifact: `benchmarks/bfcl-real-remote-completed-summary.json`.
+| Model | Completed cases | Raw | BetterCall repair | Accuracy lift | Request errors |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| `qwen3:0.6b` | 3,625 | 55.5% | 63.6% | +8.2pp | 217 |
+| `qwen3.5:0.8b` | 3,625 | 54.6% | 56.9% | +2.3pp | 901 |
+| `lfm2.5-thinking:latest` | 3,625 | 50.8% | 54.8% | +4.0pp | 1,142 |
+Latest completed model category detail: `lfm2.5-thinking:latest`.
+| Category | Cases | Raw | BetterCall repair | Lift | Request errors |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| `simple_python` | 400 | 72.5% | 72.5% | +0.0pp | 69 |
+| `simple_java` | 100 | 54.0% | 53.0% | -1.0pp | 11 |
+| `simple_javascript` | 50 | 60.0% | 58.0% | -2.0pp | 4 |
+| `multiple` | 200 | 59.5% | 59.0% | -0.5pp | 65 |
+| `parallel` | 200 | 57.0% | 57.0% | +0.0pp | 61 |
+| `parallel_multiple` | 200 | 26.5% | 26.5% | +0.0pp | 131 |
+| `irrelevance` | 240 | 29.2% | 36.7% | +7.5pp | 152 |
+| `live_simple` | 258 | 57.0% | 57.0% | +0.0pp | 52 |
+| `live_multiple` | 1,053 | 45.9% | 45.9% | +0.0pp | 302 |
+| `live_parallel` | 16 | 43.8% | 43.8% | +0.0pp | 2 |
+| `live_parallel_multiple` | 24 | 45.8% | 45.8% | +0.0pp | 2 |
+| `live_irrelevance` | 884 | 52.3% | 67.1% | +14.8pp | 291 |
+The strongest `lfm2.5-thinking:latest` category was `live_irrelevance`: 52.3% raw to 67.1% after BetterCall repair, a +14.8pp lift over 884 real remote cases.
-This is **not an official BFCL leaderboard score**. It measures BetterCall as a runtime reliability layer.
+Historical targeted wrapper benchmark:
 | Model | Raw | BetterCall repair | Accuracy lift |
 | --- | ---: | ---: | ---: |

package/benchmarks/bfcl-real-remote-completed-summary.json ADDED Viewed

@@ -0,0 +1,362 @@
+{
+  "name": "BFCL v4 full remote Ollama runs",
+  "source": "Real model calls against all supported BFCL v4 single-turn tool-call categories completed in this repository",
+  "note": "Endpoint redacted. Scores count request errors/timeouts as incorrect. This is not an official BFCL leaderboard submission.",
+  "generatedAt": "2026-05-08T04:26:26.419Z",
+  "results": [
+    {
+      "model": "qwen3:0.6b",
+      "total": 3625,
+      "rawCorrect": 2011,
+      "betterCorrect": 2307,
+      "errors": 217,
+      "categories": [
+        {
+          "model": "qwen3:0.6b",
+          "category": "simple_python",
+          "total": 400,
+          "rawCorrect": 205,
+          "betterCorrect": 311,
+          "errors": 6,
+          "repaired": 129
+        },
+        {
+          "model": "qwen3:0.6b",
+          "category": "simple_java",
+          "total": 100,
+          "rawCorrect": 29,
+          "betterCorrect": 43,
+          "errors": 2,
+          "repaired": 29
+        },
+        {
+          "model": "qwen3:0.6b",
+          "category": "simple_javascript",
+          "total": 50,
+          "rawCorrect": 16,
+          "betterCorrect": 20,
+          "errors": 5,
+          "repaired": 12
+        },
+        {
+          "model": "qwen3:0.6b",
+          "category": "multiple",
+          "total": 200,
+          "rawCorrect": 137,
+          "betterCorrect": 164,
+          "errors": 0,
+          "repaired": 31
+        },
+        {
+          "model": "qwen3:0.6b",
+          "category": "parallel",
+          "total": 200,
+          "rawCorrect": 90,
+          "betterCorrect": 96,
+          "errors": 7,
+          "repaired": 57
+        },
+        {
+          "model": "qwen3:0.6b",
+          "category": "parallel_multiple",
+          "total": 200,
+          "rawCorrect": 92,
+          "betterCorrect": 95,
+          "errors": 23,
+          "repaired": 31
+        },
+        {
+          "model": "qwen3:0.6b",
+          "category": "irrelevance",
+          "total": 240,
+          "rawCorrect": 184,
+          "betterCorrect": 193,
+          "errors": 27,
+          "repaired": 9
+        },
+        {
+          "model": "qwen3:0.6b",
+          "category": "live_simple",
+          "total": 258,
+          "rawCorrect": 94,
+          "betterCorrect": 119,
+          "errors": 14,
+          "repaired": 47
+        },
+        {
+          "model": "qwen3:0.6b",
+          "category": "live_multiple",
+          "total": 1053,
+          "rawCorrect": 465,
+          "betterCorrect": 534,
+          "errors": 67,
+          "repaired": 143
+        },
+        {
+          "model": "qwen3:0.6b",
+          "category": "live_parallel",
+          "total": 16,
+          "rawCorrect": 1,
+          "betterCorrect": 1,
+          "errors": 12,
+          "repaired": 0
+        },
+        {
+          "model": "qwen3:0.6b",
+          "category": "live_parallel_multiple",
+          "total": 24,
+          "rawCorrect": 6,
+          "betterCorrect": 6,
+          "errors": 0,
+          "repaired": 3
+        },
+        {
+          "model": "qwen3:0.6b",
+          "category": "live_irrelevance",
+          "total": 884,
+          "rawCorrect": 692,
+          "betterCorrect": 725,
+          "errors": 54,
+          "repaired": 33
+        }
+      ],
+      "repaired": 524
+    },
+    {
+      "model": "qwen3.5:0.8b",
+      "total": 3625,
+      "rawCorrect": 1978,
+      "betterCorrect": 2062,
+      "errors": 901,
+      "repaired": 154,
+      "categories": [
+        {
+          "model": "qwen3.5:0.8b",
+          "category": "simple_python",
+          "total": 400,
+          "rawCorrect": 282,
+          "betterCorrect": 307,
+          "errors": 29,
+          "repaired": 35
+        },
+        {
+          "model": "qwen3.5:0.8b",
+          "category": "simple_java",
+          "total": 100,
+          "rawCorrect": 45,
+          "betterCorrect": 46,
+          "errors": 32,
+          "repaired": 6
+        },
+        {
+          "model": "qwen3.5:0.8b",
+          "category": "simple_javascript",
+          "total": 50,
+          "rawCorrect": 18,
+          "betterCorrect": 21,
+          "errors": 15,
+          "repaired": 7
+        },
+        {
+          "model": "qwen3.5:0.8b",
+          "category": "multiple",
+          "total": 200,
+          "rawCorrect": 146,
+          "betterCorrect": 155,
+          "errors": 17,
+          "repaired": 9
+        },
+        {
+          "model": "qwen3.5:0.8b",
+          "category": "parallel",
+          "total": 200,
+          "rawCorrect": 127,
+          "betterCorrect": 128,
+          "errors": 26,
+          "repaired": 15
+        },
+        {
+          "model": "qwen3.5:0.8b",
+          "category": "parallel_multiple",
+          "total": 200,
+          "rawCorrect": 97,
+          "betterCorrect": 99,
+          "errors": 41,
+          "repaired": 20
+        },
+        {
+          "model": "qwen3.5:0.8b",
+          "category": "irrelevance",
+          "total": 240,
+          "rawCorrect": 110,
+          "betterCorrect": 116,
+          "errors": 113,
+          "repaired": 6
+        },
+        {
+          "model": "qwen3.5:0.8b",
+          "category": "live_simple",
+          "total": 258,
+          "rawCorrect": 132,
+          "betterCorrect": 139,
+          "errors": 43,
+          "repaired": 8
+        },
+        {
+          "model": "qwen3.5:0.8b",
+          "category": "live_multiple",
+          "total": 1053,
+          "rawCorrect": 538,
+          "betterCorrect": 545,
+          "errors": 212,
+          "repaired": 25
+        },
+        {
+          "model": "qwen3.5:0.8b",
+          "category": "live_parallel",
+          "total": 16,
+          "rawCorrect": 0,
+          "betterCorrect": 0,
+          "errors": 13,
+          "repaired": 0
+        },
+        {
+          "model": "qwen3.5:0.8b",
+          "category": "live_parallel_multiple",
+          "total": 24,
+          "rawCorrect": 4,
+          "betterCorrect": 4,
+          "errors": 19,
+          "repaired": 0
+        },
+        {
+          "model": "qwen3.5:0.8b",
+          "category": "live_irrelevance",
+          "total": 884,
+          "rawCorrect": 479,
+          "betterCorrect": 502,
+          "errors": 341,
+          "repaired": 23
+        }
+      ]
+    },
+    {
+      "model": "lfm2.5-thinking:latest",
+      "total": 3625,
+      "rawCorrect": 1840,
+      "betterCorrect": 1986,
+      "errors": 1142,
+      "repaired": 149,
+      "categories": [
+        {
+          "model": "lfm2.5-thinking:latest",
+          "category": "simple_python",
+          "total": 400,
+          "rawCorrect": 290,
+          "betterCorrect": 290,
+          "errors": 69,
+          "repaired": 0
+        },
+        {
+          "model": "lfm2.5-thinking:latest",
+          "category": "simple_java",
+          "total": 100,
+          "rawCorrect": 54,
+          "betterCorrect": 53,
+          "errors": 11,
+          "repaired": 0
+        },
+        {
+          "model": "lfm2.5-thinking:latest",
+          "category": "simple_javascript",
+          "total": 50,
+          "rawCorrect": 30,
+          "betterCorrect": 29,
+          "errors": 4,
+          "repaired": 0
+        },
+        {
+          "model": "lfm2.5-thinking:latest",
+          "category": "multiple",
+          "total": 200,
+          "rawCorrect": 119,
+          "betterCorrect": 118,
+          "errors": 65,
+          "repaired": 0
+        },
+        {
+          "model": "lfm2.5-thinking:latest",
+          "category": "parallel",
+          "total": 200,
+          "rawCorrect": 114,
+          "betterCorrect": 114,
+          "errors": 61,
+          "repaired": 0
+        },
+        {
+          "model": "lfm2.5-thinking:latest",
+          "category": "parallel_multiple",
+          "total": 200,
+          "rawCorrect": 53,
+          "betterCorrect": 53,
+          "errors": 131,
+          "repaired": 0
+        },
+        {
+          "model": "lfm2.5-thinking:latest",
+          "category": "irrelevance",
+          "total": 240,
+          "rawCorrect": 70,
+          "betterCorrect": 88,
+          "errors": 152,
+          "repaired": 18
+        },
+        {
+          "model": "lfm2.5-thinking:latest",
+          "category": "live_simple",
+          "total": 258,
+          "rawCorrect": 147,
+          "betterCorrect": 147,
+          "errors": 52,
+          "repaired": 0
+        },
+        {
+          "model": "lfm2.5-thinking:latest",
+          "category": "live_multiple",
+          "total": 1053,
+          "rawCorrect": 483,
+          "betterCorrect": 483,
+          "errors": 302,
+          "repaired": 0
+        },
+        {
+          "model": "lfm2.5-thinking:latest",
+          "category": "live_parallel",
+          "total": 16,
+          "rawCorrect": 7,
+          "betterCorrect": 7,
+          "errors": 2,
+          "repaired": 0
+        },
+        {
+          "model": "lfm2.5-thinking:latest",
+          "category": "live_parallel_multiple",
+          "total": 24,
+          "rawCorrect": 11,
+          "betterCorrect": 11,
+          "errors": 2,
+          "repaired": 0
+        },
+        {
+          "model": "lfm2.5-thinking:latest",
+          "category": "live_irrelevance",
+          "total": 884,
+          "rawCorrect": 462,
+          "betterCorrect": 593,
+          "errors": 291,
+          "repaired": 131
+        }
+      ]
+    }
+  ]
+}

package/dist/better-tool.d.ts CHANGED Viewed

@@ -1,5 +1,5 @@
 import { type RepairModelLike } from "./default-repair.js";
-import type { GuardPolicy, JsonSchema, RepairFunction, SemanticValidator } from "./types.js";
+import type { GuardPolicy, JsonSchema, RepairFunction, SemanticValidator, ToolCallIssue } from "./types.js";
 export type BetterToolLike = {
     name: string;
     description?: string;
@@ -15,4 +15,9 @@ export type BetterToolsOptions = {
     policy?: GuardPolicy;
     mode?: "guard" | "repair" | "review";
 };
+export declare class BetterToolValidationError extends Error {
+    readonly tool: string;
+    readonly issues: ToolCallIssue[];
+    constructor(tool: string, issues: ToolCallIssue[]);
+}
 export declare function betterTools<T extends BetterToolLike>(tools: T[], options?: BetterToolsOptions): T[];

package/dist/better-tool.js CHANGED Viewed

@@ -1,5 +1,15 @@
 import { defaultRepair } from "./default-repair.js";
 import { reliableToolCalls } from "./reliable.js";
+export class BetterToolValidationError extends Error {
+    tool;
+    issues;
+    constructor(tool, issues) {
+        super(`BetterCall rejected ${tool}: ${issues.map((issue) => issue.message).join("; ")}`);
+        this.tool = tool;
+        this.issues = issues;
+        this.name = "BetterToolValidationError";
+    }
+}
 export function betterTools(tools, options = {}) {
     return tools.map((tool) => wrapOneTool(tool, options));
 }
@@ -16,7 +26,7 @@ function wrapOneTool(tool, options) {
             mode: options.mode,
         });
         if (!result.ok) {
-            throw new Error(`BetterCall rejected ${tool.name}: ${result.issues.map((issue) => issue.message).join("; ")}`);
+            throw new BetterToolValidationError(tool.name, result.issues);
         }
         const safeCall = result.calls.find((call) => call.tool === tool.name);
         if (!safeCall)

package/dist/better-tool.js.map CHANGED Viewed

	@@ -1 +1 @@
1	- {"version":3,"file":"better-tool.js","sourceRoot":"","sources":["../src/better-tool.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,aAAa,EAAwB,MAAM,qBAAqB,CAAC;AAC1E,OAAO,EAAE,iBAAiB,EAAE,MAAM,eAAe,CAAC;AAoBlD,MAAM,UAAU,WAAW,CAA2B,KAAU,EAAE,UAA8B,EAAE;IAChG,OAAO,KAAK,CAAC,GAAG,CAAC,CAAC,IAAI,EAAE,EAAE,CAAC,WAAW,CAAC,IAAI,EAAE,OAAO,CAAC,CAAC,CAAC;AACzD,CAAC;AAED,SAAS,WAAW,CAA2B,IAAO,EAAE,OAA2B;IACjF,MAAM,OAAO,GAAG,MAAM,CAAC,MAAM,CAAC,MAAM,CAAC,MAAM,CAAC,MAAM,CAAC,cAAc,CAAC,IAAI,CAAC,CAAC,EAAE,IAAI,CAAM,CAAC;IACrF,OAAO,CAAC,MAAM,GAAG,KAAK,EAAE,KAAc,EAAE,MAAgB,EAAE,EAAE;QAC1D,MAAM,IAAI,GAAG,WAAW,CAAC,KAAK,CAAC,CAAC;QAChC,MAAM,MAAM,GAAG,MAAM,iBAAiB,CAAC;YACrC,SAAS,EAAE,gBAAgB,CAAC,OAAO,CAAC,SAAS,EAAE,IAAI,CAAC;YACpD,KAAK,EAAE,CAAC,gBAAgB,CAAC,IAAI,EAAE,OAAO,CAAC,CAAC;YACxC,KAAK,EAAE,CAAC,EAAE,IAAI,EAAE,IAAI,CAAC,IAAI,EAAE,IAAI,EAAE,CAAC;YAClC,MAAM,EAAE,OAAO,CAAC,MAAM;YACtB,MAAM,EAAE,aAAa,CAAC,OAAO,CAAC;YAC9B,IAAI,EAAE,OAAO,CAAC,IAAI;SACnB,CAAC,CAAC;QACH,IAAI,CAAC,MAAM,CAAC,EAAE,EAAE,CAAC;YACf,MAAM,IAAI,~~KAAK~~,CAAC,~~uBAAuB,~~IAAI,CAAC,IAAI,~~KAAK~~,MAAM,CAAC,MAAM,CAAC,~~GAAG,~~CAAC~~,CAAC,KAAK,EAAE,EAAE,CAAC,KAAK,CAAC,OAAO,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,EAAE,CAAC,CAAC~~;~~QACjH~~,CAAC;QACD,MAAM,QAAQ,GAAG,MAAM,CAAC,KAAK,CAAC,IAAI,CAAC,CAAC,IAAI,EAAE,EAAE,CAAC,IAAI,CAAC,IAAI,KAAK,IAAI,CAAC,IAAI,CAAC,CAAC;QACtE,IAAI,CAAC,QAAQ;YAAE,MAAM,IAAI,KAAK,CAAC,sBAAsB,IAAI,CAAC,IAAI,oBAAoB,CAAC,CAAC;QACpF,OAAO,IAAI,CAAC,MAAM,CAAC,IAAI,CAAC,IAAI,EAAE,aAAa,CAAC,KAAK,EAAE,QAAQ,CAAC,IAAI,CAAC,EAAE,MAAM,CAAC,CAAC;IAC7E,CAAC,CAAC;IACF,OAAO,OAAO,CAAC;AACjB,CAAC;AAED,SAAS,aAAa,CAAC,OAA2B;IAChD,OAAO,OAAO,CAAC,MAAM,IAAI,CAAC,OAAO,CAAC,WAAW,CAAC,CAAC,CAAC,aAAa,CAAC,OAAO,CAAC,WAAW,CAAC,CAAC,CAAC,CAAC,SAAS,CAAC,CAAC;AAClG,CAAC;AAED,SAAS,gBAAgB,CAAC,IAAoB,EAAE,OAA2B;IACzE,OAAO;QACL,IAAI,EAAE,IAAI,CAAC,IAAI;QACf,WAAW,EAAE,IAAI,CAAC,WAAW;QAC7B,MAAM,EAAE,OAAO,CAAC,MAAM,IAAI,CAAC,YAAY,CAAC,IAAI,CAAC,MAAM,CAAC,CAAC,CAAC,CAAC,IAAI,CAAC,MAAM,CAAC,CAAC,CAAC,SAAS,CAAC;QAC/E,QAAQ,EAAE,OAAO,CAAC,QAAQ;KAC3B,CAAC;AACJ,CAAC;AAED,SAAS,gBAAgB,CACvB,SAA0C,EAC1C,IAA6B;IAE7B,IAAI,OAAO,SAAS,KAAK,UAAU;QAAE,OAAO,SAAS,CAAC,IAAI,CAAC,CAAC;IAC5D,OAAO,SAAS,IAAI,IAAI,CAAC,SAAS,CAAC,IAAI,CAAC,CAAC;AAC3C,CAAC;AAED,SAAS,WAAW,CAAC,KAAc;IACjC,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,CAAC;QAAE,OAAO,KAAK,CAAC,IAAI,CAAC;IAC/D,OAAO,QAAQ,CAAC,KAAK,CAAC,CAAC,CAAC,CAAC,KAAK,CAAC,CAAC,CAAC,EAAE,KAAK,EAAE,CAAC;AAC7C,CAAC;AAED,SAAS,aAAa,CAAC,KAAc,EAAE,IAA6B;IAClE,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,CAAC;QAAE,OAAO,EAAE,GAAG,KAAK,EAAE,IAAI,EAAE,CAAC;IACvE,OAAO,IAAI,CAAC;AACd,CAAC;AAED,SAAS,YAAY,CAAC,KAAc;IAClC,OAAO,QAAQ,CAAC,KAAK,CAAC,IAAI,CAAC,OAAO,KAAK,CAAC,IAAI,KAAK,QAAQ,IAAI,KAAK,CAAC,OAAO,CAAC,KAAK,CAAC,IAAI,CAAC,IAAI,QAAQ,CAAC,KAAK,CAAC,UAAU,CAAC,CAAC,CAAC;AACxH,CAAC;AAED,SAAS,QAAQ,CAAC,KAAc;IAC9B,OAAO,OAAO,KAAK,KAAK,QAAQ,IAAI,KAAK,KAAK,IAAI,IAAI,CAAC,KAAK,CAAC,OAAO,CAAC,KAAK,CAAC,CAAC;AAC9E,CAAC"}
1	+ {"version":3,"file":"better-tool.js","sourceRoot":"","sources":["../src/better-tool.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,aAAa,EAAwB,MAAM,qBAAqB,CAAC;AAC1E,OAAO,EAAE,iBAAiB,EAAE,MAAM,eAAe,CAAC;AAoBlD,MAAM,OAAO,yBAA0B,SAAQ,KAAK;IAEhC;IACA;IAFlB,YACkB,IAAY,EACZ,MAAuB;QAEvC,KAAK,CAAC,uBAAuB,IAAI,KAAK,MAAM,CAAC,GAAG,CAAC,CAAC,KAAK,EAAE,EAAE,CAAC,KAAK,CAAC,OAAO,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,EAAE,CAAC,CAAC;QAHzE,SAAI,GAAJ,IAAI,CAAQ;QACZ,WAAM,GAAN,MAAM,CAAiB;QAGvC,IAAI,CAAC,IAAI,GAAG,2BAA2B,CAAC;IAC1C,CAAC;CACF;AAED,MAAM,UAAU,WAAW,CAA2B,KAAU,EAAE,UAA8B,EAAE;IAChG,OAAO,KAAK,CAAC,GAAG,CAAC,CAAC,IAAI,EAAE,EAAE,CAAC,WAAW,CAAC,IAAI,EAAE,OAAO,CAAC,CAAC,CAAC;AACzD,CAAC;AAED,SAAS,WAAW,CAA2B,IAAO,EAAE,OAA2B;IACjF,MAAM,OAAO,GAAG,MAAM,CAAC,MAAM,CAAC,MAAM,CAAC,MAAM,CAAC,MAAM,CAAC,cAAc,CAAC,IAAI,CAAC,CAAC,EAAE,IAAI,CAAM,CAAC;IACrF,OAAO,CAAC,MAAM,GAAG,KAAK,EAAE,KAAc,EAAE,MAAgB,EAAE,EAAE;QAC1D,MAAM,IAAI,GAAG,WAAW,CAAC,KAAK,CAAC,CAAC;QAChC,MAAM,MAAM,GAAG,MAAM,iBAAiB,CAAC;YACrC,SAAS,EAAE,gBAAgB,CAAC,OAAO,CAAC,SAAS,EAAE,IAAI,CAAC;YACpD,KAAK,EAAE,CAAC,gBAAgB,CAAC,IAAI,EAAE,OAAO,CAAC,CAAC;YACxC,KAAK,EAAE,CAAC,EAAE,IAAI,EAAE,IAAI,CAAC,IAAI,EAAE,IAAI,EAAE,CAAC;YAClC,MAAM,EAAE,OAAO,CAAC,MAAM;YACtB,MAAM,EAAE,aAAa,CAAC,OAAO,CAAC;YAC9B,IAAI,EAAE,OAAO,CAAC,IAAI;SACnB,CAAC,CAAC;QACH,IAAI,CAAC,MAAM,CAAC,EAAE,EAAE,CAAC;YACf,MAAM,IAAI,yBAAyB,CAAC,IAAI,CAAC,IAAI,EAAE,MAAM,CAAC,MAAM,CAAC,CAAC;QAChE,CAAC;QACD,MAAM,QAAQ,GAAG,MAAM,CAAC,KAAK,CAAC,IAAI,CAAC,CAAC,IAAI,EAAE,EAAE,CAAC,IAAI,CAAC,IAAI,KAAK,IAAI,CAAC,IAAI,CAAC,CAAC;QACtE,IAAI,CAAC,QAAQ;YAAE,MAAM,IAAI,KAAK,CAAC,sBAAsB,IAAI,CAAC,IAAI,oBAAoB,CAAC,CAAC;QACpF,OAAO,IAAI,CAAC,MAAM,CAAC,IAAI,CAAC,IAAI,EAAE,aAAa,CAAC,KAAK,EAAE,QAAQ,CAAC,IAAI,CAAC,EAAE,MAAM,CAAC,CAAC;IAC7E,CAAC,CAAC;IACF,OAAO,OAAO,CAAC;AACjB,CAAC;AAED,SAAS,aAAa,CAAC,OAA2B;IAChD,OAAO,OAAO,CAAC,MAAM,IAAI,CAAC,OAAO,CAAC,WAAW,CAAC,CAAC,CAAC,aAAa,CAAC,OAAO,CAAC,WAAW,CAAC,CAAC,CAAC,CAAC,SAAS,CAAC,CAAC;AAClG,CAAC;AAED,SAAS,gBAAgB,CAAC,IAAoB,EAAE,OAA2B;IACzE,OAAO;QACL,IAAI,EAAE,IAAI,CAAC,IAAI;QACf,WAAW,EAAE,IAAI,CAAC,WAAW;QAC7B,MAAM,EAAE,OAAO,CAAC,MAAM,IAAI,CAAC,YAAY,CAAC,IAAI,CAAC,MAAM,CAAC,CAAC,CAAC,CAAC,IAAI,CAAC,MAAM,CAAC,CAAC,CAAC,SAAS,CAAC;QAC/E,QAAQ,EAAE,OAAO,CAAC,QAAQ;KAC3B,CAAC;AACJ,CAAC;AAED,SAAS,gBAAgB,CACvB,SAA0C,EAC1C,IAA6B;IAE7B,IAAI,OAAO,SAAS,KAAK,UAAU;QAAE,OAAO,SAAS,CAAC,IAAI,CAAC,CAAC;IAC5D,OAAO,SAAS,IAAI,IAAI,CAAC,SAAS,CAAC,IAAI,CAAC,CAAC;AAC3C,CAAC;AAED,SAAS,WAAW,CAAC,KAAc;IACjC,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,CAAC;QAAE,OAAO,KAAK,CAAC,IAAI,CAAC;IAC/D,OAAO,QAAQ,CAAC,KAAK,CAAC,CAAC,CAAC,CAAC,KAAK,CAAC,CAAC,CAAC,EAAE,KAAK,EAAE,CAAC;AAC7C,CAAC;AAED,SAAS,aAAa,CAAC,KAAc,EAAE,IAA6B;IAClE,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,QAAQ,CAAC,KAAK,CAAC,IAAI,CAAC;QAAE,OAAO,EAAE,GAAG,KAAK,EAAE,IAAI,EAAE,CAAC;IACvE,OAAO,IAAI,CAAC;AACd,CAAC;AAED,SAAS,YAAY,CAAC,KAAc;IAClC,OAAO,QAAQ,CAAC,KAAK,CAAC,IAAI,CAAC,OAAO,KAAK,CAAC,IAAI,KAAK,QAAQ,IAAI,KAAK,CAAC,OAAO,CAAC,KAAK,CAAC,IAAI,CAAC,IAAI,QAAQ,CAAC,KAAK,CAAC,UAAU,CAAC,CAAC,CAAC;AACxH,CAAC;AAED,SAAS,QAAQ,CAAC,KAAc;IAC9B,OAAO,OAAO,KAAK,KAAK,QAAQ,IAAI,KAAK,KAAK,IAAI,IAAI,CAAC,KAAK,CAAC,OAAO,CAAC,KAAK,CAAC,CAAC;AAC9E,CAAC"}

package/dist/index.d.ts CHANGED Viewed

@@ -3,6 +3,6 @@ export { buildRepairPrompt, buildReviewPrompt } from "./prompts.js";
 export { defaultRepair } from "./default-repair.js";
 export type { RepairModelLike } from "./default-repair.js";
 export { reliableToolCalls } from "./reliable.js";
-export { betterTools } from "./better-tool.js";
+export { BetterToolValidationError, betterTools } from "./better-tool.js";
 export type { BetterToolLike, BetterToolsOptions } from "./better-tool.js";
 export type { GuardPolicy, GuardResult, IssueKind, JsonSchema, ReliableToolCallsInput, ReliableToolCallsResult, RepairFunction, RepairInput, SemanticValidator, ToolCall, ToolCallIssue, ToolDefinition, } from "./types.js";

package/dist/index.js CHANGED Viewed

@@ -2,5 +2,5 @@ export { guardToolCalls } from "./guard.js";
 export { buildRepairPrompt, buildReviewPrompt } from "./prompts.js";
 export { defaultRepair } from "./default-repair.js";
 export { reliableToolCalls } from "./reliable.js";
-export { betterTools } from "./better-tool.js";
+export { BetterToolValidationError, betterTools } from "./better-tool.js";
 //# sourceMappingURL=index.js.map

package/dist/index.js.map CHANGED Viewed

	@@ -1 +1 @@
1	- {"version":3,"file":"index.js","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,cAAc,EAAE,MAAM,YAAY,CAAC;AAC5C,OAAO,EAAE,iBAAiB,EAAE,iBAAiB,EAAE,MAAM,cAAc,CAAC;AACpE,OAAO,EAAE,aAAa,EAAE,MAAM,qBAAqB,CAAC;AAEpD,OAAO,EAAE,iBAAiB,EAAE,MAAM,eAAe,CAAC;AAClD,OAAO,EAAE,WAAW,EAAE,MAAM,kBAAkB,CAAC"}
1	+ {"version":3,"file":"index.js","sourceRoot":"","sources":["../src/index.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,cAAc,EAAE,MAAM,YAAY,CAAC;AAC5C,OAAO,EAAE,iBAAiB,EAAE,iBAAiB,EAAE,MAAM,cAAc,CAAC;AACpE,OAAO,EAAE,aAAa,EAAE,MAAM,qBAAqB,CAAC;AAEpD,OAAO,EAAE,iBAAiB,EAAE,MAAM,eAAe,CAAC;AAClD,OAAO,EAAE,yBAAyB,EAAE,WAAW,EAAE,MAAM,kBAAkB,CAAC"}

package/docs/banner.svg CHANGED Viewed

@@ -1,6 +1,6 @@
 <svg xmlns="http://www.w3.org/2000/svg" width="1400" height="520" viewBox="0 0 1400 520" role="img" aria-labelledby="title desc">
   <title id="title">BetterCall banner</title>
-  <desc id="desc">BetterCall is a one-line wrapper that boosts BFCL tool-call accuracy from 81.3 percent to 91.3 percent.</desc>
+  <desc id="desc">BetterCall is a one-line wrapper with three full BFCL remote runs completed; the best run improves from 55.5 percent to 63.6 percent.</desc>
   <defs>
     <linearGradient id="background" x1="0" y1="0" x2="1" y2="1">
       <stop offset="0" stop-color="#07111f"/>
@@ -24,6 +24,6 @@
   <g transform="translate(112 112)">
     <text x="0" y="80" font-family="Inter, ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, sans-serif" font-size="78" font-weight="800" fill="#f8fbff">BetterCall</text>
     <rect x="4" y="116" width="430" height="8" rx="4" fill="url(#accent)"/>
-    <text x="0" y="186" font-family="Inter, ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, sans-serif" font-size="34" font-weight="670" fill="#dcecff">One-line wrapper. BFCL tool-call accuracy: 81.3% → 91.3%.</text>
+    <text x="0" y="186" font-family="Inter, ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, sans-serif" font-size="34" font-weight="670" fill="#dcecff">Three full BFCL remote runs. Best: 55.5% → 63.6%.</text>
   </g>
 </svg>

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@botbotgo/better-call",
-  "version": "0.1.0",
+  "version": "0.1.2",
   "description": "LLM tool-call reliability layer.",
   "type": "module",
   "license": "Apache-2.0",
@@ -38,6 +38,8 @@
   "scripts": {
     "build": "tsc -p tsconfig.json",
     "bench:bfcl": "npm run build && node scripts/bench-bfcl.mjs",
+    "bench:bfcl:real": "npm run build && node scripts/bench-bfcl-real.mjs",
+    "bench:bfcl:remote-small-all": "bash scripts/run-remote-small-bfcl-all.sh",
     "release:pack": "npm pack --dry-run",
     "release:publish": "npm publish --access public",
     "test": "npm run build && node --test test/*.test.mjs",

package/scripts/bench-bfcl-real.mjs ADDED Viewed

@@ -0,0 +1,448 @@
+import { mkdir, readFile, writeFile } from "node:fs/promises";
+import { dirname, resolve } from "node:path";
+import { fileURLToPath } from "node:url";
+import { defaultRepair, reliableToolCalls } from "../dist/index.js";
+const root = resolve(dirname(fileURLToPath(import.meta.url)), "..");
+const allSingleTurnToolCategories = [
+  "simple_python",
+  "simple_java",
+  "simple_javascript",
+  "multiple",
+  "parallel",
+  "parallel_multiple",
+  "irrelevance",
+  "live_simple",
+  "live_multiple",
+  "live_parallel",
+  "live_parallel_multiple",
+  "live_irrelevance",
+];
+const bfclDataDir = process.env.BFCL_DATA_DIR
+  ?? "/tmp/better-call-bfcl-venv/lib/python3.13/site-packages/bfcl_eval/data";
+const ollamaBaseUrl = process.env.OLLAMA_BASE_URL ?? "http://127.0.0.1:11434";
+const models = csv(process.env.BENCH_MODELS ?? "granite4.1:3b,qwen3.5:2b");
+const categories = expandCategories(process.env.BENCH_CATEGORIES ?? "simple_python,irrelevance");
+const casesPerCategory = Number(process.env.BENCH_CASES_PER_CATEGORY ?? "10");
+const caseOffset = Number(process.env.BENCH_CASE_OFFSET ?? "0");
+const requestTimeoutMs = Number(process.env.BENCH_REQUEST_TIMEOUT_MS ?? "120000");
+const progressInterval = Number(process.env.BENCH_PROGRESS_INTERVAL ?? "25");
+const outputPath = process.env.BENCH_OUTPUT
+  ?? resolve(root, "benchmarks", `bfcl-real-${new Date().toISOString().replaceAll(":", "-")}.json`);
+const showEndpoint = process.env.BENCH_SHOW_ENDPOINT === "1";
+await assertOllama();
+console.log("# BFCL v4 real model benchmark");
+console.log("");
+console.log(`Ollama endpoint: ${showEndpoint ? ollamaBaseUrl : "<provided by OLLAMA_BASE_URL>"}`);
+console.log(`BFCL data: ${bfclDataDir}`);
+console.log(`Models: ${models.join(", ")}`);
+console.log(`Categories: ${categories.join(", ")}`);
+console.log(`Cases/category: ${casesPerCategory > 0 ? casesPerCategory : "all"}`);
+console.log(`Case offset: ${caseOffset}`);
+console.log(`Request timeout: ${requestTimeoutMs}ms`);
+console.log(`Progress interval: ${progressInterval}`);
+console.log(`Output: ${outputPath}`);
+console.log("");
+const results = [];
+for (const model of models) {
+  const rows = [];
+  for (const category of categories) {
+    const cases = await loadCases(category);
+    const offsetCases = caseOffset > 0 ? cases.slice(caseOffset) : cases;
+    const selected = casesPerCategory > 0 ? offsetCases.slice(0, casesPerCategory) : offsetCases;
+    const row = await runCategory(model, category, selected);
+    rows.push(row);
+    printProgress(row);
+    await writeResults(results.concat([summarize(model, rows)]));
+  }
+  results.push(summarize(model, rows));
+}
+console.log("");
+printTable(
+  ["Model", "Total", "Raw", "BetterCall", "Lift", "Raw Correct", "Better Correct", "Errors"],
+  results.map((row) => [
+    row.model,
+    String(row.total),
+    percent(row.rawCorrect, row.total),
+    percent(row.betterCorrect, row.total),
+    `${signed(((row.betterCorrect - row.rawCorrect) * 100) / row.total)}pp`,
+    String(row.rawCorrect),
+    String(row.betterCorrect),
+    String(row.errors),
+  ]),
+);
+console.log("");
+console.log("## By Category");
+console.log("");
+printTable(
+  ["Model", "Category", "Total", "Raw", "BetterCall", "Lift", "Errors"],
+  results.flatMap((result) => result.categories.map((row) => [
+    result.model,
+    row.category,
+    String(row.total),
+    percent(row.rawCorrect, row.total),
+    percent(row.betterCorrect, row.total),
+    `${signed(((row.betterCorrect - row.rawCorrect) * 100) / row.total)}pp`,
+    String(row.errors),
+  ])),
+);
+await writeResults(results);
+async function writeResults(currentResults) {
+  await mkdir(dirname(outputPath), { recursive: true });
+  await writeFile(outputPath, `${JSON.stringify({
+  ollamaEndpoint: showEndpoint ? ollamaBaseUrl : "<redacted>",
+  bfclDataDir,
+  models,
+  categories,
+  casesPerCategory,
+  requestTimeoutMs,
+  progressInterval,
+  results: currentResults,
+}, null, 2)}\n`);
+}
+async function runCategory(model, category, cases) {
+  let rawCorrect = 0;
+  let betterCorrect = 0;
+  let rawToolCalls = 0;
+  let repaired = 0;
+  let errors = 0;
+  const failures = [];
+  for (const [caseIndex, item] of cases.entries()) {
+    const tools = item.functions.map(toToolDefinition);
+    let raw = [];
+    try {
+      raw = await callOllamaTools(model, item.userInput, tools);
+    } catch (error) {
+      errors += 1;
+      failures.push({ id: item.id, error: error instanceof Error ? error.message : String(error) });
+      printCaseProgress(model, category, caseIndex, cases.length, rawCorrect, betterCorrect, repaired, errors);
+      continue;
+    }
+    rawToolCalls += raw.length;
+    if (isCorrect(category, raw, item.groundTruth)) rawCorrect += 1;
+    const repairModel = {
+      invoke: (input) => callOllamaText(model, String(input)),
+    };
+    let better;
+    try {
+      better = await reliableToolCalls({
+        userInput: item.userInput,
+        tools,
+        calls: raw,
+        policy: { expected: isIrrelevanceCategory(category) ? "none" : "some" },
+        repair: defaultRepair(repairModel),
+      });
+    } catch (error) {
+      errors += 1;
+      failures.push({ id: item.id, raw, error: error instanceof Error ? error.message : String(error) });
+      printCaseProgress(model, category, caseIndex, cases.length, rawCorrect, betterCorrect, repaired, errors);
+      continue;
+    }
+    if (better.repaired) repaired += 1;
+    if (isCorrect(category, better.calls, item.groundTruth)) {
+      betterCorrect += 1;
+    } else if (failures.length < 3) {
+      failures.push({ id: item.id, raw, better: better.calls, issues: better.issues });
+    }
+    printCaseProgress(model, category, caseIndex, cases.length, rawCorrect, betterCorrect, repaired, errors);
+  }
+  return {
+    model,
+    category,
+    total: cases.length,
+    rawCorrect,
+    betterCorrect,
+    rawToolCalls,
+    repaired,
+    errors,
+    failures,
+  };
+}
+async function loadCases(category) {
+  const prompts = await readJsonl(resolve(bfclDataDir, `BFCL_v4_${category}.json`));
+  const answers = isIrrelevanceCategory(category)
+    ? prompts.map((item) => ({ id: item.id, ground_truth: [] }))
+    : await readJsonl(resolve(bfclDataDir, "possible_answer", `BFCL_v4_${category}.json`));
+  return prompts.map((item, index) => ({
+    id: item.id,
+    userInput: item.question.flat().map((message) => message.content).join("\n"),
+    functions: item.function ?? [],
+    groundTruth: answers[index].ground_truth,
+  }));
+}
+function toToolDefinition(fn) {
+  return {
+    name: fn.name,
+    description: fn.description,
+    schema: normalizeSchema(fn.parameters),
+  };
+}
+function normalizeSchema(schema) {
+  if (!schema || typeof schema !== "object") return schema;
+  const next = { ...schema };
+  if (next.type === "dict") next.type = "object";
+  if (next.properties) {
+    next.properties = Object.fromEntries(
+      Object.entries(next.properties).map(([key, value]) => [key, normalizeSchema(value)]),
+    );
+  }
+  if (next.items) next.items = normalizeSchema(next.items);
+  return next;
+}
+async function callOllamaTools(model, userInput, tools) {
+  const payload = {
+    model,
+    stream: false,
+    messages: [{ role: "user", content: userInput }],
+    tools: tools.map((tool) => ({
+      type: "function",
+      function: {
+        name: tool.name,
+        description: tool.description ?? "",
+        parameters: tool.schema ?? { type: "object", properties: {} },
+      },
+    })),
+    options: { temperature: 0 },
+  };
+  const json = await postJson(`${ollamaBaseUrl}/api/chat`, payload);
+  return normalizeOllamaToolCalls(json.message?.tool_calls);
+}
+async function callOllamaText(model, prompt) {
+  const json = await postJson(`${ollamaBaseUrl}/api/chat`, {
+    model,
+    stream: false,
+    messages: [{ role: "user", content: prompt }],
+    format: "json",
+    options: { temperature: 0 },
+  });
+  return json.message?.content ?? "";
+}
+function normalizeOllamaToolCalls(toolCalls) {
+  if (!Array.isArray(toolCalls)) return [];
+  return toolCalls
+    .map((call) => call?.function)
+    .filter((fn) => fn && typeof fn.name === "string")
+    .map((fn) => ({
+      tool: fn.name,
+      args: isRecord(fn.arguments) ? fn.arguments : {},
+    }));
+}
+function isCorrect(category, calls, groundTruth) {
+  if (isIrrelevanceCategory(category)) return calls.length === 0;
+  if (!Array.isArray(groundTruth) || calls.length !== groundTruth.length) return false;
+  if (category.includes("parallel")) return callsMatchUnordered(calls, groundTruth);
+  return groundTruth.every((expected, index) => callMatches(calls[index], expected));
+}
+function callsMatchUnordered(calls, groundTruth) {
+  const used = new Set();
+  for (const expected of groundTruth) {
+    const index = calls.findIndex((call, candidateIndex) => !used.has(candidateIndex) && callMatches(call, expected));
+    if (index < 0) return false;
+    used.add(index);
+  }
+  return true;
+}
+function callMatches(call, expected) {
+  if (!call || !isRecord(expected)) return false;
+  const [tool, args] = Object.entries(expected)[0] ?? [];
+  if (call.tool !== tool) return false;
+  if (!isRecord(args)) return false;
+  for (const [key, allowed] of Object.entries(args)) {
+    if (!Object.hasOwn(call.args, key)) {
+      if (Array.isArray(allowed) && allowed.includes("")) continue;
+      return false;
+    }
+    if (!valueAllowed(call.args[key], allowed)) return false;
+  }
+  for (const key of Object.keys(call.args)) {
+    if (!Object.hasOwn(args, key)) return false;
+  }
+  return true;
+}
+function valueAllowed(actual, allowed) {
+  if (!Array.isArray(allowed)) return Object.is(actual, allowed);
+  return allowed.some((value) => {
+    if (value === "" && (actual === undefined || actual === "")) return true;
+    if (isRecord(actual) && isRecord(value)) return objectMatches(actual, value);
+    if (Array.isArray(actual) && Array.isArray(value)) return arrayMatches(actual, value);
+    if (typeof actual === "string" && typeof value === "string") {
+      return normalizeString(actual) === normalizeString(value);
+    }
+    return JSON.stringify(actual) === JSON.stringify(value);
+  });
+}
+function objectMatches(actual, expected) {
+  for (const [key, allowed] of Object.entries(expected)) {
+    if (!Object.hasOwn(actual, key)) {
+      if (Array.isArray(allowed) && allowed.includes("")) continue;
+      return false;
+    }
+    if (!valueAllowed(actual[key], allowed)) return false;
+  }
+  for (const key of Object.keys(actual)) {
+    if (!Object.hasOwn(expected, key)) return false;
+  }
+  return true;
+}
+function arrayMatches(actual, expected) {
+  if (actual.length !== expected.length) return false;
+  return expected.every((value, index) => {
+    if (isRecord(actual[index]) && isRecord(value)) return objectMatches(actual[index], value);
+    if (Array.isArray(actual[index]) && Array.isArray(value)) return arrayMatches(actual[index], value);
+    if (typeof actual[index] === "string" && typeof value === "string") {
+      return normalizeString(actual[index]) === normalizeString(value);
+    }
+    return Object.is(actual[index], value);
+  });
+}
+function normalizeString(value) {
+  return value.replace(/[ ,./\-_*^]/gu, "").toLowerCase().replaceAll("'", "\"");
+}
+async function readJsonl(path) {
+  const text = await readFile(path, "utf8");
+  return text.trim().split("\n").filter(Boolean).map((line) => JSON.parse(line));
+}
+async function assertOllama() {
+  const tags = await postJson(`${ollamaBaseUrl}/api/tags`, undefined, "GET");
+  const available = new Set((tags.models ?? []).map((item) => item.name));
+  const missing = models.filter((model) => !available.has(model));
+  if (missing.length > 0) {
+    throw new Error(`Missing Ollama models: ${missing.join(", ")}`);
+  }
+}
+async function postJson(url, payload, method = "POST") {
+  const controller = new AbortController();
+  const timeout = setTimeout(() => controller.abort(), requestTimeoutMs);
+  try {
+    const response = await fetch(url, {
+      method,
+      headers: payload ? { "content-type": "application/json" } : undefined,
+      body: payload ? JSON.stringify(payload) : undefined,
+      signal: controller.signal,
+    });
+    if (!response.ok) {
+      throw new Error(`${method} ${redactUrl(url)} failed: ${response.status} ${await readBodyWithTimeout(response, "text", controller)}`);
+    }
+    return readBodyWithTimeout(response, "json", controller);
+  } finally {
+    clearTimeout(timeout);
+  }
+}
+async function readBodyWithTimeout(response, kind, controller) {
+  let timeout;
+  try {
+    return await Promise.race([
+      kind === "json" ? response.json() : response.text(),
+      new Promise((_, reject) => {
+        timeout = setTimeout(() => {
+          controller.abort();
+          reject(new Error(`Timed out reading ${kind} response body after ${requestTimeoutMs}ms`));
+        }, requestTimeoutMs);
+      }),
+    ]);
+  } finally {
+    clearTimeout(timeout);
+  }
+}
+function printCaseProgress(model, category, caseIndex, total, rawCorrect, betterCorrect, repaired, errors) {
+  if (progressInterval <= 0) return;
+  if ((caseIndex + 1) % progressInterval !== 0 && caseIndex + 1 !== total) return;
+  console.log(
+    `[${model}] ${category} progress ${caseIndex + 1}/${total}: `
+      + `raw ${rawCorrect}, better ${betterCorrect}, repaired ${repaired}, errors ${errors}`,
+  );
+}
+function redactUrl(url) {
+  try {
+    const parsed = new URL(url);
+    return `<redacted>${parsed.pathname}`;
+  } catch {
+    return "<redacted>";
+  }
+}
+function summarize(model, categories) {
+  return {
+    model,
+    total: sum(categories, "total"),
+    rawCorrect: sum(categories, "rawCorrect"),
+    betterCorrect: sum(categories, "betterCorrect"),
+    errors: sum(categories, "errors"),
+    categories,
+  };
+}
+function printProgress(row) {
+  console.log(
+    `[${row.model}] ${row.category}: raw ${row.rawCorrect}/${row.total}, `
+      + `better ${row.betterCorrect}/${row.total}, repaired ${row.repaired}, `
+      + `errors ${row.errors}, rawToolCalls ${row.rawToolCalls}`,
+  );
+}
+function printTable(headers, rows) {
+  console.log(`| ${headers.join(" | ")} |`);
+  console.log(`| ${headers.map(() => "---").join(" | ")} |`);
+  for (const row of rows) console.log(`| ${row.join(" | ")} |`);
+}
+function percent(correct, total) {
+  return `${((correct * 100) / total).toFixed(1)}%`;
+}
+function signed(value) {
+  return `${value >= 0 ? "+" : ""}${value.toFixed(1)}`;
+}
+function sum(rows, key) {
+  return rows.reduce((total, row) => total + row[key], 0);
+}
+function csv(value) {
+  return value.split(",").map((item) => item.trim()).filter(Boolean);
+}
+function expandCategories(value) {
+  const requested = csv(value);
+  if (requested.includes("all") || requested.includes("all_single_turn_tool")) return allSingleTurnToolCategories;
+  return requested;
+}
+function isIrrelevanceCategory(category) {
+  return category === "irrelevance" || category === "live_irrelevance";
+}
+function isRecord(value) {
+  return typeof value === "object" && value !== null && !Array.isArray(value);
+}

package/scripts/run-remote-small-bfcl-all.sh ADDED Viewed

@@ -0,0 +1,63 @@
+#!/usr/bin/env bash
+set -euo pipefail
+ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+cd "$ROOT"
+: "${OLLAMA_BASE_URL:?Set OLLAMA_BASE_URL to your Ollama endpoint, for example http://127.0.0.1:11434}"
+BENCH_REQUEST_TIMEOUT_MS="${BENCH_REQUEST_TIMEOUT_MS:-5000}"
+BENCH_PROGRESS_INTERVAL="${BENCH_PROGRESS_INTERVAL:-50}"
+models=(
+  "qwen3.5:0.8b"
+  "qwen3:0.6b"
+  "lfm2.5-thinking:latest"
+  "qwen3.5:2b"
+  "granite4.1:3b"
+  "qwen3.5:4b"
+  "gemma4:e2b"
+  "qwen2.5:7b-instruct"
+  "gemma4:e4b"
+  "qwen3:latest"
+  "qwen3.5:9b"
+)
+categories=(
+  "simple_python"
+  "simple_java"
+  "simple_javascript"
+  "multiple"
+  "parallel"
+  "parallel_multiple"
+  "irrelevance"
+  "live_simple"
+  "live_multiple"
+  "live_parallel"
+  "live_parallel_multiple"
+  "live_irrelevance"
+)
+mkdir -p benchmarks/logs benchmarks/remote-small-runs
+npm run build
+for model in "${models[@]}"; do
+  safe_model="${model//[:\/]/-}"
+  for category in "${categories[@]}"; do
+    output="benchmarks/remote-small-runs/${safe_model}-${category}.json"
+    log="benchmarks/logs/${safe_model}-${category}.log"
+    if [[ -s "$output" ]]; then
+      echo "[skip] $model $category -> $output"
+      continue
+    fi
+    echo "[run] $model $category"
+    OLLAMA_BASE_URL="$OLLAMA_BASE_URL" \
+      BENCH_MODELS="$model" \
+      BENCH_CATEGORIES="$category" \
+      BENCH_CASES_PER_CATEGORY=0 \
+      BENCH_REQUEST_TIMEOUT_MS="$BENCH_REQUEST_TIMEOUT_MS" \
+      BENCH_PROGRESS_INTERVAL="$BENCH_PROGRESS_INTERVAL" \
+      BENCH_OUTPUT="$ROOT/$output" \
+      node scripts/bench-bfcl-real.mjs 2>&1 | tee "$log"
+  done
+done