@botbotgo/better-call 0.1.17 → 0.1.18
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +104 -157
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -4,13 +4,17 @@
|
|
|
4
4
|
|
|
5
5
|
# BetterCall
|
|
6
6
|
|
|
7
|
-
**
|
|
7
|
+
**A small runtime reliability layer for LangChain and LLM tool calls.**
|
|
8
|
+
|
|
9
|
+
BetterCall wraps existing LangChain-style tools, validates model-generated calls before execution, and can optionally ask a repair model to fix rejected calls. In full BFCL v4 remote Ollama runs, the best measured lift was `granite4.1:3b`: **73.4% raw -> 83.8% with BetterCall**.
|
|
8
10
|
|
|
9
11
|
```ts
|
|
12
|
+
import { betterTools } from "@botbotgo/better-call";
|
|
13
|
+
|
|
10
14
|
const tools = betterTools([searchTool, calculatorTool]);
|
|
11
15
|
```
|
|
12
16
|
|
|
13
|
-
No model means validate
|
|
17
|
+
No model means validate and block only. Add `repairModel` when you want automatic repair.
|
|
14
18
|
|
|
15
19
|
## Install
|
|
16
20
|
|
|
@@ -18,89 +22,92 @@ No model means validate + block only. Add `repairModel` when you want automatic
|
|
|
18
22
|
npm install @botbotgo/better-call
|
|
19
23
|
```
|
|
20
24
|
|
|
21
|
-
##
|
|
25
|
+
## Quick Start
|
|
26
|
+
|
|
27
|
+
LangChain tools work directly:
|
|
22
28
|
|
|
23
29
|
```ts
|
|
30
|
+
import { tool } from "@langchain/core/tools";
|
|
24
31
|
import { betterTools } from "@botbotgo/better-call";
|
|
32
|
+
import { z } from "zod";
|
|
25
33
|
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
Pass a repair model to let BetterCall fix rejected calls automatically. This can be the same chat model your agent uses, or a separate cheaper/stronger model dedicated to repair:
|
|
31
|
-
|
|
32
|
-
```ts
|
|
33
|
-
// Validate + repair: validate, repair with a model, validate again, then execute.
|
|
34
|
-
const tools = betterTools([searchTool, calculatorTool], {
|
|
35
|
-
repairModel: model,
|
|
34
|
+
const searchTool = tool(async ({ query }) => search(query), {
|
|
35
|
+
name: "search",
|
|
36
|
+
description: "Search the web.",
|
|
37
|
+
schema: z.object({ query: z.string() }),
|
|
36
38
|
});
|
|
37
|
-
```
|
|
38
39
|
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
```bash
|
|
42
|
-
npm run bench:bfcl
|
|
40
|
+
const tools = betterTools([searchTool]);
|
|
43
41
|
```
|
|
44
42
|
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
For the full official BFCL v4 reproduction, use Berkeley's published checkpoint/package: commit `f7cf735` or `pip install bfcl-eval==2025.12.17`.
|
|
43
|
+
The same wrapper works for any tool with `name`, `schema`, and `invoke(input)`:
|
|
48
44
|
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
```bash
|
|
52
|
-
python3 -m venv /tmp/better-call-bfcl-venv
|
|
53
|
-
/tmp/better-call-bfcl-venv/bin/pip install "bfcl-eval==2025.12.17" soundfile
|
|
45
|
+
```ts
|
|
46
|
+
import { betterTools } from "@botbotgo/better-call";
|
|
54
47
|
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
BENCH_CATEGORIES=all \
|
|
58
|
-
BENCH_CASES_PER_CATEGORY=0 \
|
|
59
|
-
npm run bench:bfcl:real
|
|
48
|
+
// Validate and block bad calls before execution.
|
|
49
|
+
const tools = betterTools([searchTool, calculatorTool]);
|
|
60
50
|
```
|
|
61
51
|
|
|
62
|
-
|
|
52
|
+
To repair rejected calls automatically, pass a LangChain chat model or any model with `invoke(input)`:
|
|
53
|
+
|
|
54
|
+
```ts
|
|
55
|
+
import { ChatOpenAI } from "@langchain/openai";
|
|
63
56
|
|
|
64
|
-
|
|
57
|
+
const repairModel = new ChatOpenAI({ model: "gpt-4.1-mini" });
|
|
65
58
|
|
|
66
|
-
|
|
67
|
-
|
|
59
|
+
const tools = betterTools([searchTool, calculatorTool], {
|
|
60
|
+
repairModel,
|
|
61
|
+
});
|
|
68
62
|
```
|
|
69
63
|
|
|
70
|
-
|
|
64
|
+
BetterCall validates, repairs rejected calls, validates again, and only then invokes the original tool.
|
|
71
65
|
|
|
72
66
|
## What It Catches
|
|
73
67
|
|
|
74
|
-
|
|
|
75
|
-
| --- | --- |
|
|
76
|
-
| Tool selection | Unknown
|
|
77
|
-
|
|
|
78
|
-
|
|
|
79
|
-
|
|
|
80
|
-
|
|
|
81
|
-
| Schema | Invalid enum | `NASDAQ` where only `US`, `HK`, `CN` are allowed |
|
|
82
|
-
| Schema | Extra arg | `currency` when `additionalProperties: false` |
|
|
83
|
-
| Policy | Semantic validator rejection | Domain-specific validator rejects unsafe args |
|
|
84
|
-
|
|
85
|
-
In validate + block mode, BetterCall rejects these calls before execution. With `repairModel`, it asks the model to fix rejected calls, validates again, and only then executes.
|
|
68
|
+
| Area | Examples |
|
|
69
|
+
| --- | --- |
|
|
70
|
+
| Tool selection | Unknown tools, irrelevant tool calls |
|
|
71
|
+
| Arguments | Missing required args, wrong arg names, extra args |
|
|
72
|
+
| Schema | Wrong types, invalid enums, `additionalProperties: false` violations |
|
|
73
|
+
| Policy | Expected no tool call, expected some tool call |
|
|
74
|
+
| Semantic validation | Domain-specific validator rejects unsafe args |
|
|
86
75
|
|
|
87
76
|
## API
|
|
88
77
|
|
|
89
|
-
### `betterTools`
|
|
78
|
+
### `betterTools(tools, options?)`
|
|
90
79
|
|
|
91
|
-
|
|
80
|
+
Wraps an existing tool array. Each tool must expose `name` and `invoke(input)`. BetterCall preserves the original tool shape and wraps `invoke`.
|
|
92
81
|
|
|
93
82
|
```ts
|
|
94
|
-
|
|
95
|
-
const tools = betterTools([searchTool, calculatorTool]);
|
|
83
|
+
const guarded = betterTools([searchTool]);
|
|
96
84
|
|
|
97
|
-
|
|
98
|
-
|
|
85
|
+
const repaired = betterTools([searchTool], {
|
|
86
|
+
repairModel: model,
|
|
87
|
+
});
|
|
99
88
|
```
|
|
100
89
|
|
|
101
|
-
|
|
90
|
+
Options:
|
|
91
|
+
|
|
92
|
+
| Option | Purpose |
|
|
93
|
+
| --- | --- |
|
|
94
|
+
| `repairModel` | Model used by the default repair prompt/parser |
|
|
95
|
+
| `repair` | Custom repair function |
|
|
96
|
+
| `mode` | `"guard"`, `"repair"`, or `"review"` |
|
|
97
|
+
| `policy` | Expected call policy, such as `expected: "none"` |
|
|
98
|
+
| `validate` | Custom semantic validator |
|
|
99
|
+
| `schema` | Override the wrapped tool schema |
|
|
100
|
+
| `userInput` | Original user input or a function that derives it from tool args |
|
|
102
101
|
|
|
103
|
-
|
|
102
|
+
Modes:
|
|
103
|
+
|
|
104
|
+
| Mode | Behavior |
|
|
105
|
+
| --- | --- |
|
|
106
|
+
| no model / `"guard"` | Validate and block unsafe calls |
|
|
107
|
+
| `"repair"` or `repairModel` | Repair rejected calls, then validate again |
|
|
108
|
+
| `"review"` | Ask for a full self-check even when calls pass schema |
|
|
109
|
+
|
|
110
|
+
Tool schemas can be JSON Schema, Zod object schemas, or Zod-shaped records:
|
|
104
111
|
|
|
105
112
|
```ts
|
|
106
113
|
import { z } from "zod";
|
|
@@ -117,121 +124,61 @@ const tools = betterTools([
|
|
|
117
124
|
]);
|
|
118
125
|
```
|
|
119
126
|
|
|
120
|
-
|
|
127
|
+
Lower-level exports are available for custom runtimes:
|
|
121
128
|
|
|
122
|
-
|
|
129
|
+
- `guardToolCalls(...)`
|
|
130
|
+
- `reliableToolCalls(...)`
|
|
131
|
+
- `normalizeToolSchema(schema)`
|
|
132
|
+
- `defaultRepair(model)`
|
|
133
|
+
- `BetterToolValidationError`
|
|
123
134
|
|
|
124
|
-
|
|
135
|
+
## Benchmarks
|
|
125
136
|
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
| Mode | Behavior |
|
|
129
|
-
| --- | --- |
|
|
130
|
-
| no model | Validate and block only |
|
|
131
|
-
| `repairModel` | Validate, repair rejected calls, then validate again |
|
|
132
|
-
| custom `repair` | Use your own repair function |
|
|
133
|
-
| `review` | Ask for a full self-check even when calls pass schema |
|
|
137
|
+
Run the bundled targeted BFCL wrapper benchmark:
|
|
134
138
|
|
|
135
|
-
|
|
139
|
+
```bash
|
|
140
|
+
npm run bench:bfcl
|
|
141
|
+
```
|
|
136
142
|
|
|
137
|
-
|
|
143
|
+
Run real-model BFCL v4 calls against your Ollama endpoint:
|
|
138
144
|
|
|
139
|
-
|
|
145
|
+
```bash
|
|
146
|
+
python3 -m venv /tmp/better-call-bfcl-venv
|
|
147
|
+
/tmp/better-call-bfcl-venv/bin/pip install "bfcl-eval==2025.12.17" soundfile
|
|
140
148
|
|
|
141
|
-
|
|
149
|
+
OLLAMA_BASE_URL=http://127.0.0.1:11434 \
|
|
150
|
+
BENCH_MODELS=qwen3.5:0.8b \
|
|
151
|
+
BENCH_CATEGORIES=all \
|
|
152
|
+
BENCH_CASES_PER_CATEGORY=0 \
|
|
153
|
+
npm run bench:bfcl:real
|
|
154
|
+
```
|
|
142
155
|
|
|
143
|
-
|
|
156
|
+
Run the small-model matrix:
|
|
144
157
|
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
```text
|
|
148
|
-
granite4.1:3b
|
|
149
|
-
Raw 73.4% | #############################...........
|
|
150
|
-
BetterCall 83.8% | ##################################......
|
|
151
|
-
qwen2.5:7b-instruct
|
|
152
|
-
Raw 72.2% | #############################...........
|
|
153
|
-
BetterCall 78.2% | ###############################.........
|
|
154
|
-
qwen3:0.6b
|
|
155
|
-
Raw 55.5% | ######################..................
|
|
156
|
-
BetterCall 63.6% | #########################...............
|
|
157
|
-
qwen3.5:0.8b
|
|
158
|
-
Raw 54.6% | ######################..................
|
|
159
|
-
BetterCall 56.9% | #######################.................
|
|
160
|
-
qwen3.5:2b
|
|
161
|
-
Raw 53.9% | ######################..................
|
|
162
|
-
BetterCall 54.9% | ######################..................
|
|
163
|
-
lfm2.5-thinking:latest
|
|
164
|
-
Raw 50.8% | ####################....................
|
|
165
|
-
BetterCall 54.8% | ######################..................
|
|
166
|
-
qwen3.5:4b
|
|
167
|
-
Raw 43.6% | #################.......................
|
|
168
|
-
BetterCall 43.4% | #################.......................
|
|
169
|
-
gemma4:e2b
|
|
170
|
-
Raw 24.3% | ##########..............................
|
|
171
|
-
BetterCall 24.7% | ##########..............................
|
|
158
|
+
```bash
|
|
159
|
+
OLLAMA_BASE_URL=http://127.0.0.1:11434 npm run bench:bfcl:remote-small-all
|
|
172
160
|
```
|
|
173
161
|
|
|
174
|
-
|
|
175
|
-
| ---: | --- | ---: | ---: | ---: | ---: | ---: |
|
|
176
|
-
| 1 | `granite4.1:3b` | 3,625 | 73.4% | 83.8% | +10.4pp | 25 |
|
|
177
|
-
| 2 | `qwen2.5:7b-instruct` | 3,625 | 72.2% | 78.2% | +5.9pp | 80 |
|
|
178
|
-
| 3 | `qwen3:0.6b` | 3,625 | 55.5% | 63.6% | +8.2pp | 217 |
|
|
179
|
-
| 4 | `qwen3.5:0.8b` | 3,625 | 54.6% | 56.9% | +2.3pp | 901 |
|
|
180
|
-
| 5 | `qwen3.5:2b` | 3,625 | 53.9% | 54.9% | +1.0pp | 1,308 |
|
|
181
|
-
| 6 | `lfm2.5-thinking:latest` | 3,625 | 50.8% | 54.8% | +4.0pp | 1,142 |
|
|
182
|
-
| 7 | `qwen3.5:4b` | 3,625 | 43.6% | 43.4% | -0.2pp | 1,847 |
|
|
183
|
-
| 8 | `gemma4:e2b` | 3,625 | 24.3% | 24.7% | +0.4pp | 2,641 |
|
|
162
|
+
Notes:
|
|
184
163
|
|
|
185
|
-
|
|
164
|
+
- These are BetterCall wrapper benchmarks, not official BFCL leaderboard submissions.
|
|
165
|
+
- Full official BFCL v4 reproduction should use Berkeley's published checkpoint/package: commit `f7cf735` or `bfcl-eval==2025.12.17`.
|
|
166
|
+
- Real runs use Ollama `/api/chat` tool-call requests.
|
|
167
|
+
- Request errors and timeouts count as incorrect.
|
|
168
|
+
- Benchmark JSON redacts endpoints by default. Set `BENCH_SHOW_ENDPOINT=1` only for private debugging.
|
|
169
|
+
|
|
170
|
+
Latest completed remote run artifact: `benchmarks/bfcl-real-remote-completed-summary.json`.
|
|
186
171
|
|
|
187
|
-
|
|
|
172
|
+
| Model | Cases | Raw | BetterCall | Lift | Request errors |
|
|
188
173
|
| --- | ---: | ---: | ---: | ---: | ---: |
|
|
189
|
-
| `
|
|
190
|
-
| `
|
|
191
|
-
| `
|
|
192
|
-
| `
|
|
193
|
-
| `
|
|
194
|
-
| `
|
|
195
|
-
| `
|
|
196
|
-
| `
|
|
197
|
-
| `live_multiple` | 1,053 | 41.6% | 41.0% | -0.6pp | 538 |
|
|
198
|
-
| `live_parallel` | 16 | 0.0% | 0.0% | +0.0pp | 16 |
|
|
199
|
-
| `live_parallel_multiple` | 24 | 0.0% | 0.0% | +0.0pp | 24 |
|
|
200
|
-
| `live_irrelevance` | 884 | 0.0% | 0.0% | +0.0pp | 884 |
|
|
201
|
-
|
|
202
|
-
This `qwen3.5:4b` run hit sustained remote request failures in the live categories; those failures are counted as incorrect by the benchmark.
|
|
203
|
-
|
|
204
|
-
Historical targeted wrapper benchmark:
|
|
205
|
-
|
|
206
|
-
| Model | Raw | BetterCall repair | Accuracy lift |
|
|
207
|
-
| --- | ---: | ---: | ---: |
|
|
208
|
-
| `gemma4:e2b` | 81.3% | 91.3% | +10.0pp |
|
|
209
|
-
| `qwen3.5:2b` | 75.3% | 84.0% | +8.7pp |
|
|
210
|
-
| `qwen3.5:9b` | 84.0% | 90.0% | +6.0pp |
|
|
211
|
-
| `qwen3.5:4b` | 82.0% | 87.3% | +5.3pp |
|
|
212
|
-
| `granite4.1:3b` | 66.0% | 69.3% | +3.3pp |
|
|
213
|
-
|
|
214
|
-
Strongest result: BFCL `irrelevance`, where the model should not call any tool.
|
|
215
|
-
|
|
216
|
-
| Model | Raw irrelevance | BetterCall repair |
|
|
217
|
-
| --- | ---: | ---: |
|
|
218
|
-
| `qwen3.5:2b` | 74% | 100% |
|
|
219
|
-
| `qwen3.5:4b` | 84% | 100% |
|
|
220
|
-
| `qwen3.5:9b` | 84% | 100% |
|
|
221
|
-
| `granite4.1:3b` | 92% | 100% |
|
|
222
|
-
| `gemma4:e2b` | 70% | 100% |
|
|
223
|
-
|
|
224
|
-
## Why It Exists
|
|
225
|
-
|
|
226
|
-
Small models are useful because they are cheap and fast. They also make tool mistakes:
|
|
227
|
-
|
|
228
|
-
- call a tool that does not exist
|
|
229
|
-
- fill the wrong parameter names
|
|
230
|
-
- pass the wrong types
|
|
231
|
-
- call tools when no tool is relevant
|
|
232
|
-
- produce a call that looks valid but is unsafe to execute
|
|
233
|
-
|
|
234
|
-
BetterCall reduces those failures before they reach production tools.
|
|
174
|
+
| `granite4.1:3b` | 3,625 | 73.4% | 83.8% | +10.4pp | 25 |
|
|
175
|
+
| `qwen2.5:7b-instruct` | 3,625 | 72.2% | 78.2% | +5.9pp | 80 |
|
|
176
|
+
| `qwen3:0.6b` | 3,625 | 55.5% | 63.6% | +8.2pp | 217 |
|
|
177
|
+
| `qwen3.5:0.8b` | 3,625 | 54.6% | 56.9% | +2.3pp | 901 |
|
|
178
|
+
| `qwen3.5:2b` | 3,625 | 53.9% | 54.9% | +1.0pp | 1,308 |
|
|
179
|
+
| `lfm2.5-thinking:latest` | 3,625 | 50.8% | 54.8% | +4.0pp | 1,142 |
|
|
180
|
+
| `qwen3.5:4b` | 3,625 | 43.6% | 43.4% | -0.2pp | 1,847 |
|
|
181
|
+
| `gemma4:e2b` | 3,625 | 24.3% | 24.7% | +0.4pp | 2,641 |
|
|
235
182
|
|
|
236
183
|
## License
|
|
237
184
|
|