npm - @botbotgo/better-call - Versions diffs - 0.1.17 → 0.1.18 - Mend

@botbotgo/better-call 0.1.17 → 0.1.18

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/README.md +104 -157
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -4,13 +4,17 @@
 # BetterCall
-**One-line wrapper. Eight full BFCL remote runs completed. Best: 73.4% → 83.8%.**
+**A small runtime reliability layer for LangChain and LLM tool calls.**
+BetterCall wraps existing LangChain-style tools, validates model-generated calls before execution, and can optionally ask a repair model to fix rejected calls. In full BFCL v4 remote Ollama runs, the best measured lift was `granite4.1:3b`: **73.4% raw -> 83.8% with BetterCall**.
 ```ts
+import { betterTools } from "@botbotgo/better-call";
 const tools = betterTools([searchTool, calculatorTool]);
 ```
-No model means validate + block only. Add `repairModel` when you want automatic repair.
+No model means validate and block only. Add `repairModel` when you want automatic repair.
 ## Install
@@ -18,89 +22,92 @@ No model means validate + block only. Add `repairModel` when you want automatic
 npm install @botbotgo/better-call
 ```
-## LangGraph Quick Start
+## Quick Start
+LangChain tools work directly:
 ```ts
+import { tool } from "@langchain/core/tools";
 import { betterTools } from "@botbotgo/better-call";
+import { z } from "zod";
-// Validate + block: stop bad calls before execution.
-const tools = betterTools([searchTool, calculatorTool]);
-```
-Pass a repair model to let BetterCall fix rejected calls automatically. This can be the same chat model your agent uses, or a separate cheaper/stronger model dedicated to repair:
-```ts
-// Validate + repair: validate, repair with a model, validate again, then execute.
-const tools = betterTools([searchTool, calculatorTool], {
-  repairModel: model,
+const searchTool = tool(async ({ query }) => search(query), {
+  name: "search",
+  description: "Search the web.",
+  schema: z.object({ query: z.string() }),
 });
-```
-## Run The BFCL Benchmark
-```bash
-npm run bench:bfcl
+const tools = betterTools([searchTool]);
 ```
-This prints the BFCL v4 targeted weak-category table used above. It is a BetterCall wrapper benchmark, not an official leaderboard submission.
-For the full official BFCL v4 reproduction, use Berkeley's published checkpoint/package: commit `f7cf735` or `pip install bfcl-eval==2025.12.17`.
+The same wrapper works for any tool with `name`, `schema`, and `invoke(input)`:
-To run the real-model benchmark against your own Ollama endpoint, install the BFCL package and point BetterCall at the package data:
-```bash
-python3 -m venv /tmp/better-call-bfcl-venv
-/tmp/better-call-bfcl-venv/bin/pip install "bfcl-eval==2025.12.17" soundfile
+```ts
+import { betterTools } from "@botbotgo/better-call";
-OLLAMA_BASE_URL=http://127.0.0.1:11434 \
-BENCH_MODELS=qwen3.5:0.8b \
-BENCH_CATEGORIES=all \
-BENCH_CASES_PER_CATEGORY=0 \
-npm run bench:bfcl:real
+// Validate and block bad calls before execution.
+const tools = betterTools([searchTool, calculatorTool]);
 ```
-For long categories, resume or shard a category with `BENCH_CASE_OFFSET` and `BENCH_CASES_PER_CATEGORY`; the benchmark still uses real model calls and counts request errors/timeouts as incorrect.
+To repair rejected calls automatically, pass a LangChain chat model or any model with `invoke(input)`:
+```ts
+import { ChatOpenAI } from "@langchain/openai";
-For a model/category matrix run, set `OLLAMA_BASE_URL` and run:
+const repairModel = new ChatOpenAI({ model: "gpt-4.1-mini" });
-```bash
-OLLAMA_BASE_URL=http://127.0.0.1:11434 npm run bench:bfcl:remote-small-all
+const tools = betterTools([searchTool, calculatorTool], {
+  repairModel,
+});
 ```
-The runner does real `/api/chat` tool-call requests. Benchmark JSON redacts the endpoint by default; set `BENCH_SHOW_ENDPOINT=1` only for private debugging.
+BetterCall validates, repairs rejected calls, validates again, and only then invokes the original tool.
 ## What It Catches
-| Category | Failures | Example |
-| --- | --- | --- |
-| Tool selection | Unknown tool | `stock_price` instead of `stock_quote` |
-| Tool selection | Irrelevant call | Model calls a tool when no tool should be used |
-| Arguments | Missing required arg | Missing required `ticker` |
-| Arguments | Wrong arg name | `symbol` instead of `ticker` |
-| Arguments | Wrong type | `"3"` where an integer is required |
-| Schema | Invalid enum | `NASDAQ` where only `US`, `HK`, `CN` are allowed |
-| Schema | Extra arg | `currency` when `additionalProperties: false` |
-| Policy | Semantic validator rejection | Domain-specific validator rejects unsafe args |
-In validate + block mode, BetterCall rejects these calls before execution. With `repairModel`, it asks the model to fix rejected calls, validates again, and only then executes.
+| Area | Examples |
+| --- | --- |
+| Tool selection | Unknown tools, irrelevant tool calls |
+| Arguments | Missing required args, wrong arg names, extra args |
+| Schema | Wrong types, invalid enums, `additionalProperties: false` violations |
+| Policy | Expected no tool call, expected some tool call |
+| Semantic validation | Domain-specific validator rejects unsafe args |
 ## API
-### `betterTools`
+### `betterTools(tools, options?)`
-Wrap a LangGraph-style tools array.
+Wraps an existing tool array. Each tool must expose `name` and `invoke(input)`. BetterCall preserves the original tool shape and wraps `invoke`.
 ```ts
-// Validate + block.
-const tools = betterTools([searchTool, calculatorTool]);
+const guarded = betterTools([searchTool]);
-// Validate + repair.
-const toolsWithRepair = betterTools([searchTool, calculatorTool], { repairModel: model });
+const repaired = betterTools([searchTool], {
+  repairModel: model,
+});
 ```
-`options` is optional. Each tool must expose `name` and `invoke(input)`. BetterCall preserves each tool's shape and wraps `invoke`.
+Options:
+| Option | Purpose |
+| --- | --- |
+| `repairModel` | Model used by the default repair prompt/parser |
+| `repair` | Custom repair function |
+| `mode` | `"guard"`, `"repair"`, or `"review"` |
+| `policy` | Expected call policy, such as `expected: "none"` |
+| `validate` | Custom semantic validator |
+| `schema` | Override the wrapped tool schema |
+| `userInput` | Original user input or a function that derives it from tool args |
-Tool schemas can be plain JSON Schema, Zod object schemas, or Zod-shaped records:
+Modes:
+| Mode | Behavior |
+| --- | --- |
+| no model / `"guard"` | Validate and block unsafe calls |
+| `"repair"` or `repairModel` | Repair rejected calls, then validate again |
+| `"review"` | Ask for a full self-check even when calls pass schema |
+Tool schemas can be JSON Schema, Zod object schemas, or Zod-shaped records:
 ```ts
 import { z } from "zod";
@@ -117,121 +124,61 @@ const tools = betterTools([
 ]);
 ```
-BetterCall converts Zod schemas to JSON Schema before validation and repair. If your runtime already converts schemas, passing JSON Schema directly remains the lowest-dependency path.
+Lower-level exports are available for custom runtimes:
-The lower-level `guardToolCalls(...)` and `reliableToolCalls(...)` APIs accept the same schema shapes in `ToolDefinition.schema`. For custom runtime adapters, `normalizeToolSchema(schema)` is also exported so tool metadata can share BetterCall's exact conversion behavior.
+- `guardToolCalls(...)`
+- `reliableToolCalls(...)`
+- `normalizeToolSchema(schema)`
+- `defaultRepair(model)`
+- `BetterToolValidationError`
-Without `repairModel` or `repair`, BetterCall validates and blocks unsafe calls instead of fixing them. `repairModel` only needs an `invoke(input)` method, such as a LangChain chat model. If it is provided, BetterCall supplies the repair prompt and JSON parser.
+## Benchmarks
-Modes:
-| Mode | Behavior |
-| --- | --- |
-| no model | Validate and block only |
-| `repairModel` | Validate, repair rejected calls, then validate again |
-| custom `repair` | Use your own repair function |
-| `review` | Ask for a full self-check even when calls pass schema |
+Run the bundled targeted BFCL wrapper benchmark:
-Default recommendation: start with validate + block, then add `repairModel` for small models or unreliable tool callers. `review` is more expensive and model-dependent.
+```bash
+npm run bench:bfcl
+```
-`irrelevance` is one validation failure type: the model called a tool when no tool should be called. BetterCall also validates tool names, argument names, JSON schema, types, enums, and semantic validators.
+Run real-model BFCL v4 calls against your Ollama endpoint:
-## Benchmark
+```bash
+python3 -m venv /tmp/better-call-bfcl-venv
+/tmp/better-call-bfcl-venv/bin/pip install "bfcl-eval==2025.12.17" soundfile
-Measured with real Ollama `/api/chat` calls over all supported BFCL v4 single-turn tool-call categories. Request errors and timeouts count as incorrect. This is **not an official BFCL leaderboard score**; it measures BetterCall as a runtime reliability layer.
+OLLAMA_BASE_URL=http://127.0.0.1:11434 \
+BENCH_MODELS=qwen3.5:0.8b \
+BENCH_CATEGORIES=all \
+BENCH_CASES_PER_CATEGORY=0 \
+npm run bench:bfcl:real
+```
-Latest completed remote run artifact: `benchmarks/bfcl-real-remote-completed-summary.json`.
+Run the small-model matrix:
-Performance after wrapping the same model outputs with BetterCall:
-```text
-granite4.1:3b
-  Raw         73.4% | #############################...........
-  BetterCall  83.8% | ##################################......
-qwen2.5:7b-instruct
-  Raw         72.2% | #############################...........
-  BetterCall  78.2% | ###############################.........
-qwen3:0.6b
-  Raw         55.5% | ######################..................
-  BetterCall  63.6% | #########################...............
-qwen3.5:0.8b
-  Raw         54.6% | ######################..................
-  BetterCall  56.9% | #######################.................
-qwen3.5:2b
-  Raw         53.9% | ######################..................
-  BetterCall  54.9% | ######################..................
-lfm2.5-thinking:latest
-  Raw         50.8% | ####################....................
-  BetterCall  54.8% | ######################..................
-qwen3.5:4b
-  Raw         43.6% | #################.......................
-  BetterCall  43.4% | #################.......................
-gemma4:e2b
-  Raw         24.3% | ##########..............................
-  BetterCall  24.7% | ##########..............................
+```bash
+OLLAMA_BASE_URL=http://127.0.0.1:11434 npm run bench:bfcl:remote-small-all
 ```
-| Rank | Model | Completed cases | Raw model | BetterCall | Lift | Request errors |
-| ---: | --- | ---: | ---: | ---: | ---: | ---: |
-| 1 | `granite4.1:3b` | 3,625 | 73.4% | 83.8% | +10.4pp | 25 |
-| 2 | `qwen2.5:7b-instruct` | 3,625 | 72.2% | 78.2% | +5.9pp | 80 |
-| 3 | `qwen3:0.6b` | 3,625 | 55.5% | 63.6% | +8.2pp | 217 |
-| 4 | `qwen3.5:0.8b` | 3,625 | 54.6% | 56.9% | +2.3pp | 901 |
-| 5 | `qwen3.5:2b` | 3,625 | 53.9% | 54.9% | +1.0pp | 1,308 |
-| 6 | `lfm2.5-thinking:latest` | 3,625 | 50.8% | 54.8% | +4.0pp | 1,142 |
-| 7 | `qwen3.5:4b` | 3,625 | 43.6% | 43.4% | -0.2pp | 1,847 |
-| 8 | `gemma4:e2b` | 3,625 | 24.3% | 24.7% | +0.4pp | 2,641 |
+Notes:
-Latest completed model category detail: `qwen3.5:4b`.
+- These are BetterCall wrapper benchmarks, not official BFCL leaderboard submissions.
+- Full official BFCL v4 reproduction should use Berkeley's published checkpoint/package: commit `f7cf735` or `bfcl-eval==2025.12.17`.
+- Real runs use Ollama `/api/chat` tool-call requests.
+- Request errors and timeouts count as incorrect.
+- Benchmark JSON redacts endpoints by default. Set `BENCH_SHOW_ENDPOINT=1` only for private debugging.
+Latest completed remote run artifact: `benchmarks/bfcl-real-remote-completed-summary.json`.
-| Category | Cases | Raw | BetterCall repair | Lift | Request errors |
+| Model | Cases | Raw | BetterCall | Lift | Request errors |
 | --- | ---: | ---: | ---: | ---: | ---: |
-| `simple_python` | 400 | 81.3% | 81.3% | +0.0pp | 54 |
-| `simple_java` | 100 | 56.0% | 56.0% | +0.0pp | 32 |
-| `simple_javascript` | 50 | 48.0% | 48.0% | +0.0pp | 18 |
-| `multiple` | 200 | 83.5% | 83.5% | +0.0pp | 20 |
-| `parallel` | 200 | 70.0% | 70.0% | +0.0pp | 45 |
-| `parallel_multiple` | 200 | 47.0% | 47.0% | +0.0pp | 96 |
-| `irrelevance` | 240 | 68.8% | 68.8% | +0.0pp | 75 |
-| `live_simple` | 258 | 66.7% | 66.3% | -0.4pp | 45 |
-| `live_multiple` | 1,053 | 41.6% | 41.0% | -0.6pp | 538 |
-| `live_parallel` | 16 | 0.0% | 0.0% | +0.0pp | 16 |
-| `live_parallel_multiple` | 24 | 0.0% | 0.0% | +0.0pp | 24 |
-| `live_irrelevance` | 884 | 0.0% | 0.0% | +0.0pp | 884 |
-This `qwen3.5:4b` run hit sustained remote request failures in the live categories; those failures are counted as incorrect by the benchmark.
-Historical targeted wrapper benchmark:
-| Model | Raw | BetterCall repair | Accuracy lift |
-| --- | ---: | ---: | ---: |
-| `gemma4:e2b` | 81.3% | 91.3% | +10.0pp |
-| `qwen3.5:2b` | 75.3% | 84.0% | +8.7pp |
-| `qwen3.5:9b` | 84.0% | 90.0% | +6.0pp |
-| `qwen3.5:4b` | 82.0% | 87.3% | +5.3pp |
-| `granite4.1:3b` | 66.0% | 69.3% | +3.3pp |
-Strongest result: BFCL `irrelevance`, where the model should not call any tool.
-| Model | Raw irrelevance | BetterCall repair |
-| --- | ---: | ---: |
-| `qwen3.5:2b` | 74% | 100% |
-| `qwen3.5:4b` | 84% | 100% |
-| `qwen3.5:9b` | 84% | 100% |
-| `granite4.1:3b` | 92% | 100% |
-| `gemma4:e2b` | 70% | 100% |
-## Why It Exists
-Small models are useful because they are cheap and fast. They also make tool mistakes:
-- call a tool that does not exist
-- fill the wrong parameter names
-- pass the wrong types
-- call tools when no tool is relevant
-- produce a call that looks valid but is unsafe to execute
-BetterCall reduces those failures before they reach production tools.
+| `granite4.1:3b` | 3,625 | 73.4% | 83.8% | +10.4pp | 25 |
+| `qwen2.5:7b-instruct` | 3,625 | 72.2% | 78.2% | +5.9pp | 80 |
+| `qwen3:0.6b` | 3,625 | 55.5% | 63.6% | +8.2pp | 217 |
+| `qwen3.5:0.8b` | 3,625 | 54.6% | 56.9% | +2.3pp | 901 |
+| `qwen3.5:2b` | 3,625 | 53.9% | 54.9% | +1.0pp | 1,308 |
+| `lfm2.5-thinking:latest` | 3,625 | 50.8% | 54.8% | +4.0pp | 1,142 |
+| `qwen3.5:4b` | 3,625 | 43.6% | 43.4% | -0.2pp | 1,847 |
+| `gemma4:e2b` | 3,625 | 24.3% | 24.7% | +0.4pp | 2,641 |
 ## License

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@botbotgo/better-call",
-  "version": "0.1.17",
+  "version": "0.1.18",
   "description": "LLM tool-call reliability layer.",
   "type": "module",
   "license": "Apache-2.0",