npm - @botbotgo/better-call - Versions diffs - 0.1.5 → 0.1.7 - Mend

@botbotgo/better-call 0.1.5 → 0.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md +51 -24
package/benchmarks/bfcl-real-remote-completed-summary.json +119 -1
package/docs/banner.svg +2 -2
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -4,7 +4,7 @@
 # BetterCall
-**One-line wrapper. Six full BFCL remote runs completed. Best: 73.4% → 83.8%.**
+**One-line wrapper. Seven full BFCL remote runs completed. Best: 73.4% → 83.8%.**
 ```ts
 const tools = betterTools([searchTool, calculatorTool]);
@@ -121,33 +121,60 @@ Measured with real Ollama `/api/chat` calls over all supported BFCL v4 single-tu
 Latest completed remote run artifact: `benchmarks/bfcl-real-remote-completed-summary.json`.
-| Model | Completed cases | Raw | BetterCall repair | Accuracy lift | Request errors |
-| --- | ---: | ---: | ---: | ---: | ---: |
-| `granite4.1:3b` | 3,625 | 73.4% | 83.8% | +10.4pp | 25 |
-| `qwen3:0.6b` | 3,625 | 55.5% | 63.6% | +8.2pp | 217 |
-| `qwen3.5:0.8b` | 3,625 | 54.6% | 56.9% | +2.3pp | 901 |
-| `qwen3.5:2b` | 3,625 | 53.9% | 54.9% | +1.0pp | 1,308 |
-| `lfm2.5-thinking:latest` | 3,625 | 50.8% | 54.8% | +4.0pp | 1,142 |
-| `gemma4:e2b` | 3,625 | 24.3% | 24.7% | +0.4pp | 2,641 |
+Performance after wrapping the same model outputs with BetterCall:
+```text
+granite4.1:3b
+  Raw         73.4% | #############################...........
+  BetterCall  83.8% | ##################################......
+qwen2.5:7b-instruct
+  Raw         72.2% | #############################...........
+  BetterCall  78.2% | ###############################.........
+qwen3:0.6b
+  Raw         55.5% | ######################..................
+  BetterCall  63.6% | #########################...............
+qwen3.5:0.8b
+  Raw         54.6% | ######################..................
+  BetterCall  56.9% | #######################.................
+qwen3.5:2b
+  Raw         53.9% | ######################..................
+  BetterCall  54.9% | ######################..................
+lfm2.5-thinking:latest
+  Raw         50.8% | ####################....................
+  BetterCall  54.8% | ######################..................
+gemma4:e2b
+  Raw         24.3% | ##########..............................
+  BetterCall  24.7% | ##########..............................
+```
+| Rank | Model | Completed cases | Raw model | BetterCall | Lift | Request errors |
+| ---: | --- | ---: | ---: | ---: | ---: | ---: |
+| 1 | `granite4.1:3b` | 3,625 | 73.4% | 83.8% | +10.4pp | 25 |
+| 2 | `qwen2.5:7b-instruct` | 3,625 | 72.2% | 78.2% | +5.9pp | 80 |
+| 3 | `qwen3:0.6b` | 3,625 | 55.5% | 63.6% | +8.2pp | 217 |
+| 4 | `qwen3.5:0.8b` | 3,625 | 54.6% | 56.9% | +2.3pp | 901 |
+| 5 | `qwen3.5:2b` | 3,625 | 53.9% | 54.9% | +1.0pp | 1,308 |
+| 6 | `lfm2.5-thinking:latest` | 3,625 | 50.8% | 54.8% | +4.0pp | 1,142 |
+| 7 | `gemma4:e2b` | 3,625 | 24.3% | 24.7% | +0.4pp | 2,641 |
-Latest completed model category detail: `gemma4:e2b`.
+Latest completed model category detail: `qwen2.5:7b-instruct`.
 | Category | Cases | Raw | BetterCall repair | Lift | Request errors |
 | --- | ---: | ---: | ---: | ---: | ---: |
-| `simple_python` | 400 | 82.0% | 83.3% | +1.3pp | 46 |
-| `simple_java` | 100 | 50.0% | 50.0% | +0.0pp | 29 |
-| `simple_javascript` | 50 | 34.0% | 36.0% | +2.0pp | 22 |
-| `multiple` | 200 | 76.5% | 79.0% | +2.5pp | 32 |
-| `parallel` | 200 | 54.0% | 54.0% | +0.0pp | 84 |
-| `parallel_multiple` | 200 | 42.5% | 42.5% | +0.0pp | 109 |
-| `irrelevance` | 240 | 52.9% | 54.2% | +1.3pp | 104 |
-| `live_simple` | 258 | 4.7% | 5.0% | +0.4pp | 238 |
-| `live_multiple` | 1,053 | 0.0% | 0.0% | +0.0pp | 1,053 |
-| `live_parallel` | 16 | 0.0% | 0.0% | +0.0pp | 16 |
-| `live_parallel_multiple` | 24 | 0.0% | 0.0% | +0.0pp | 24 |
-| `live_irrelevance` | 884 | 0.0% | 0.0% | +0.0pp | 884 |
-The strongest `gemma4:e2b` category was `multiple`, improving from 76.5% to 79.0%. The live categories were dominated by request errors under the 5s timeout and are preserved as measured in the artifact.
+| `simple_python` | 400 | 91.3% | 93.0% | +1.8pp | 9 |
+| `simple_java` | 100 | 55.0% | 59.0% | +4.0pp | 3 |
+| `simple_javascript` | 50 | 60.0% | 64.0% | +4.0pp | 0 |
+| `multiple` | 200 | 89.5% | 91.0% | +1.5pp | 3 |
+| `parallel` | 200 | 81.0% | 84.5% | +3.5pp | 2 |
+| `parallel_multiple` | 200 | 77.0% | 78.0% | +1.0pp | 6 |
+| `irrelevance` | 240 | 64.6% | 89.6% | +25.0pp | 0 |
+| `live_simple` | 258 | 69.0% | 72.1% | +3.1pp | 1 |
+| `live_multiple` | 1,053 | 69.0% | 73.4% | +4.4pp | 18 |
+| `live_parallel` | 16 | 25.0% | 37.5% | +12.5pp | 2 |
+| `live_parallel_multiple` | 24 | 50.0% | 50.0% | +0.0pp | 2 |
+| `live_irrelevance` | 884 | 67.6% | 75.9% | +8.3pp | 34 |
+The strongest `qwen2.5:7b-instruct` category lift was `irrelevance`, improving from 64.6% to 89.6%. The largest absolute live gain was `live_irrelevance`, improving by 73 cases.
 Historical targeted wrapper benchmark:

package/benchmarks/bfcl-real-remote-completed-summary.json CHANGED Viewed

@@ -2,7 +2,7 @@
   "name": "BFCL v4 full remote Ollama runs",
   "source": "Real model calls against all supported BFCL v4 single-turn tool-call categories completed in this repository",
   "note": "Endpoint redacted. Scores count request errors/timeouts as incorrect. This is not an official BFCL leaderboard submission.",
-  "generatedAt": "2026-05-08T21:10:42.234Z",
+  "generatedAt": "2026-05-08T22:24:15.073Z",
   "results": [
     {
       "model": "granite4.1:3b",
@@ -122,6 +122,124 @@
         }
       ]
     },
+    {
+      "model": "qwen2.5:7b-instruct",
+      "total": 3625,
+      "rawCorrect": 2619,
+      "betterCorrect": 2833,
+      "errors": 80,
+      "repaired": 279,
+      "categories": [
+        {
+          "model": "qwen2.5:7b-instruct",
+          "category": "simple_python",
+          "total": 400,
+          "rawCorrect": 365,
+          "betterCorrect": 372,
+          "errors": 9,
+          "repaired": 8
+        },
+        {
+          "model": "qwen2.5:7b-instruct",
+          "category": "simple_java",
+          "total": 100,
+          "rawCorrect": 55,
+          "betterCorrect": 59,
+          "errors": 3,
+          "repaired": 12
+        },
+        {
+          "model": "qwen2.5:7b-instruct",
+          "category": "simple_javascript",
+          "total": 50,
+          "rawCorrect": 30,
+          "betterCorrect": 32,
+          "errors": 0,
+          "repaired": 11
+        },
+        {
+          "model": "qwen2.5:7b-instruct",
+          "category": "multiple",
+          "total": 200,
+          "rawCorrect": 179,
+          "betterCorrect": 182,
+          "errors": 3,
+          "repaired": 4
+        },
+        {
+          "model": "qwen2.5:7b-instruct",
+          "category": "parallel",
+          "total": 200,
+          "rawCorrect": 162,
+          "betterCorrect": 169,
+          "errors": 2,
+          "repaired": 9
+        },
+        {
+          "model": "qwen2.5:7b-instruct",
+          "category": "parallel_multiple",
+          "total": 200,
+          "rawCorrect": 154,
+          "betterCorrect": 156,
+          "errors": 6,
+          "repaired": 7
+        },
+        {
+          "model": "qwen2.5:7b-instruct",
+          "category": "irrelevance",
+          "total": 240,
+          "rawCorrect": 155,
+          "betterCorrect": 215,
+          "errors": 0,
+          "repaired": 60
+        },
+        {
+          "model": "qwen2.5:7b-instruct",
+          "category": "live_simple",
+          "total": 258,
+          "rawCorrect": 178,
+          "betterCorrect": 186,
+          "errors": 1,
+          "repaired": 16
+        },
+        {
+          "model": "qwen2.5:7b-instruct",
+          "category": "live_multiple",
+          "total": 1053,
+          "rawCorrect": 727,
+          "betterCorrect": 773,
+          "errors": 18,
+          "repaired": 75
+        },
+        {
+          "model": "qwen2.5:7b-instruct",
+          "category": "live_parallel",
+          "total": 16,
+          "rawCorrect": 4,
+          "betterCorrect": 6,
+          "errors": 2,
+          "repaired": 4
+        },
+        {
+          "model": "qwen2.5:7b-instruct",
+          "category": "live_parallel_multiple",
+          "total": 24,
+          "rawCorrect": 12,
+          "betterCorrect": 12,
+          "errors": 2,
+          "repaired": 0
+        },
+        {
+          "model": "qwen2.5:7b-instruct",
+          "category": "live_irrelevance",
+          "total": 884,
+          "rawCorrect": 598,
+          "betterCorrect": 671,
+          "errors": 34,
+          "repaired": 73
+        }
+      ]
+    },
     {
       "model": "qwen3:0.6b",
       "total": 3625,

package/docs/banner.svg CHANGED Viewed

@@ -1,6 +1,6 @@
 <svg xmlns="http://www.w3.org/2000/svg" width="1400" height="520" viewBox="0 0 1400 520" role="img" aria-labelledby="title desc">
   <title id="title">BetterCall banner</title>
-  <desc id="desc">BetterCall is a one-line wrapper with six full BFCL remote runs completed; the best run improves from 73.4 percent to 83.8 percent.</desc>
+  <desc id="desc">BetterCall is a one-line wrapper with seven full BFCL remote runs completed; the best run improves from 73.4 percent to 83.8 percent.</desc>
   <defs>
     <linearGradient id="background" x1="0" y1="0" x2="1" y2="1">
       <stop offset="0" stop-color="#07111f"/>
@@ -24,6 +24,6 @@
   <g transform="translate(112 112)">
     <text x="0" y="80" font-family="Inter, ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, sans-serif" font-size="78" font-weight="800" fill="#f8fbff">BetterCall</text>
     <rect x="4" y="116" width="430" height="8" rx="4" fill="url(#accent)"/>
-    <text x="0" y="186" font-family="Inter, ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, sans-serif" font-size="34" font-weight="670" fill="#dcecff">Six full BFCL remote runs. Best: 73.4% → 83.8%.</text>
+    <text x="0" y="186" font-family="Inter, ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, sans-serif" font-size="34" font-weight="670" fill="#dcecff">Seven full BFCL remote runs. Best: 73.4% → 83.8%.</text>
   </g>
 </svg>

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@botbotgo/better-call",
-  "version": "0.1.5",
+  "version": "0.1.7",
   "description": "LLM tool-call reliability layer.",
   "type": "module",
   "license": "Apache-2.0",