ai-lcr 0.6.1 → 0.6.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +67 -1
- package/README.md +48 -0
- package/README.zh-CN.md +36 -0
- package/dist/index.cjs +325 -42
- package/dist/index.d.cts +189 -2
- package/dist/index.d.ts +189 -2
- package/dist/index.js +324 -42
- package/package.json +1 -1
package/CHANGELOG.md
CHANGED
|
@@ -4,11 +4,77 @@ All notable changes to `ai-lcr` are documented here. The format follows
|
|
|
4
4
|
[Keep a Changelog](https://keepachangelog.com/), and the project adheres to
|
|
5
5
|
[Semantic Versioning](https://semver.org/).
|
|
6
6
|
|
|
7
|
+
## [0.6.3] — 2026-06-11
|
|
8
|
+
|
|
9
|
+
Caching — both kinds, each off by default and each a pure config flag with no
|
|
10
|
+
service to run. The response cache is the layer Vercel AI Gateway notably
|
|
11
|
+
doesn't offer; ai-lcr does it in-process and folds it into its cost truth.
|
|
12
|
+
|
|
13
|
+
### Added
|
|
14
|
+
|
|
15
|
+
- **`createLCR({ cache })`** — exact-match **response cache**. An identical
|
|
16
|
+
request replays the stored response and calls **no provider at all**: zero
|
|
17
|
+
latency, `costUsd: 0`. Storage is pluggable with **zero added dependencies**:
|
|
18
|
+
`cache: true` uses a bundled in-memory store, `cache: myStore` brings your own
|
|
19
|
+
(Redis / Vercel KV — required for cross-request hits on serverless, where
|
|
20
|
+
memory isn't shared), `cache: { store?, ttlMs? }` sets a TTL. A hit settles a
|
|
21
|
+
`CallRecord` with `cacheHit: true` and the avoided cost on its own
|
|
22
|
+
`cacheHitSavingUsd` line (a caching saving, never folded into routing savings).
|
|
23
|
+
Empty completions and usage-less results are never cached. New exports:
|
|
24
|
+
`createMemoryCacheStore`, types `CacheStore` / `CacheOptions` / `CachedCall` /
|
|
25
|
+
`CachedMeta` / `MemoryCacheOptions`.
|
|
26
|
+
- **`createLCR({ promptCache })`** — automatic provider-side **prompt-cache**
|
|
27
|
+
breakpoint. Inserts an Anthropic `cache_control` marker on the last system
|
|
28
|
+
message so the static prompt head bills at the cache-read rate (~0.1× input)
|
|
29
|
+
on repeats; the model still runs. `true` for the 5-minute window,
|
|
30
|
+
`{ ttl: "1h" }` for the longer one. Only writes the `anthropic` namespace
|
|
31
|
+
(ignored by other providers, safe on a mixed chain) and steps aside if you set
|
|
32
|
+
`cacheControl` yourself. Savings surface via the existing `cachedInputTokens` /
|
|
33
|
+
`cachedSavingUsd`. New exported type `PromptCacheOptions`.
|
|
34
|
+
- `CallRecord` gains **`cacheHit`** and **`cacheHitSavingUsd`** for response-cache
|
|
35
|
+
hits.
|
|
36
|
+
|
|
37
|
+
### Compatibility
|
|
38
|
+
|
|
39
|
+
- Fully backward compatible. Both `cache` and `promptCache` are **off by
|
|
40
|
+
default** — unset, routing behaves exactly as before.
|
|
41
|
+
|
|
42
|
+
## [0.6.2] — 2026-06-11
|
|
43
|
+
|
|
44
|
+
Circuit breaker for persistently-failing providers. Until now the only recovery
|
|
45
|
+
lever was `resetIntervalMs`, which snaps routing back to the cheapest provider on
|
|
46
|
+
a timer — so a provider that's actually down keeps eating one failed attempt
|
|
47
|
+
every window. The breaker remembers the failure and stops sending it traffic.
|
|
48
|
+
|
|
49
|
+
### Added
|
|
50
|
+
|
|
51
|
+
- **`createLCR({ cooldown })`.** A provider that fails `maxFailures` times within
|
|
52
|
+
`windowMs` is *skipped* for `cooldownMs` instead of being re-probed every
|
|
53
|
+
request; a single success clears its count. `true` enables defaults (3 / 60s →
|
|
54
|
+
60s); pass `{ maxFailures, windowMs, cooldownMs }` to tune. New exported type
|
|
55
|
+
`CooldownOptions`.
|
|
56
|
+
- The breaker only **reorders** each request's attempt list (cooling providers go
|
|
57
|
+
last), so when every provider is cooling a request still tries them all rather
|
|
58
|
+
than failing outright — it can never turn a recoverable request into a hard
|
|
59
|
+
failure.
|
|
60
|
+
|
|
61
|
+
### Changed
|
|
62
|
+
|
|
63
|
+
- The routing engine now snapshots a per-request **attempt order** once (cheapest
|
|
64
|
+
ring with cooling providers moved to the back) and threads it through streaming
|
|
65
|
+
failover, replacing the previous modular index walk. Behavior is identical when
|
|
66
|
+
`cooldown` is unset.
|
|
67
|
+
|
|
68
|
+
### Compatibility
|
|
69
|
+
|
|
70
|
+
- Fully backward compatible. `cooldown` is **off by default** — with it unset no
|
|
71
|
+
provider is ever skipped and routing behaves exactly as before.
|
|
72
|
+
|
|
7
73
|
## [0.6.1] — 2026-06-11
|
|
8
74
|
|
|
9
75
|
Zero-config pricing for native-maker routes. Until now every priced provider
|
|
10
76
|
needed a hand-typed `cost: { input, output }`; for a vendor's own API that number
|
|
11
|
-
is just the public list price you could look up. 0.
|
|
77
|
+
is just the public list price you could look up. 0.6.1 bundles those.
|
|
12
78
|
|
|
13
79
|
### Added
|
|
14
80
|
|
package/README.md
CHANGED
|
@@ -171,6 +171,53 @@ Look a price up yourself with `getModelPrice("claude-sonnet-4-6")`. The table is
|
|
|
171
171
|
2. **Fall through on failure.** On any provider failure — rate limit, 5xx, timeout, a **billing cap** (402 / out-of-credit / quota), *and* a client error like a **400** — it advances to the next provider, streaming-safe. A 400 fails over on purpose: across OpenAI-compatible aggregators a 400 is usually "*this* provider won't take this request" (an unsupported param, a model it hasn't listed, a stricter schema), not a universally-broken request — so the next provider may well serve it. If every provider rejects the request it still fails, surfacing the **original** error so a genuine caller bug stays debuggable. The one failure that never fails over is a deliberate caller cancellation (`AbortSignal`). Pass `shouldRetry: isRetryableError` to `createLCR` to restore the stricter "client errors fail fast" behavior.
|
|
172
172
|
3. **Recover.** After an idle window (`resetIntervalMs`, default 60s) it snaps back to the cheapest provider.
|
|
173
173
|
|
|
174
|
+
For a provider that's *persistently* down, the timer alone keeps re-probing it — one failed attempt every window. Turn on the **circuit breaker** to stop that:
|
|
175
|
+
|
|
176
|
+
```ts
|
|
177
|
+
const lcr = createLCR({
|
|
178
|
+
models: { /* … */ },
|
|
179
|
+
cooldown: true, // skip a provider that keeps failing, instead of re-probing it
|
|
180
|
+
});
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
With `cooldown` on, a provider that fails enough times in a window is *skipped* for a cooldown period rather than tried every request — and a single success clears it. Defaults are 3 failures / 60s → 60s cooldown; tune with `cooldown: { maxFailures, windowMs, cooldownMs }`. It only ever **reorders** the attempt list (cooling providers go last), so if *every* provider is cooling a request still tries them all rather than failing outright. Off by default — routing is unchanged unless you opt in.
|
|
184
|
+
|
|
185
|
+
## Cache
|
|
186
|
+
|
|
187
|
+
There are two completely different "caches" in LLM land, and ai-lcr does both — each off by default, each a pure config flag with no service to run.
|
|
188
|
+
|
|
189
|
+
### Skip the call entirely (`cache`) — response cache
|
|
190
|
+
|
|
191
|
+
When a request is byte-for-byte identical to one already answered, replay the stored response and call **no provider at all**: zero latency, `costUsd: 0`. This is the layer Vercel AI Gateway notably *doesn't* offer.
|
|
192
|
+
|
|
193
|
+
```ts
|
|
194
|
+
const lcr = createLCR({
|
|
195
|
+
models: { /* … */ },
|
|
196
|
+
cache: true, // exact-match response cache (in-memory by default)
|
|
197
|
+
});
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
Storage is pluggable and ai-lcr ships **zero dependencies** for it:
|
|
201
|
+
|
|
202
|
+
- **`cache: true`** uses a process-local in-memory store. Real on a long-running server, and useful *within* one serverless invocation (an agent loop repeating a sub-call) — but it does **not** survive across serverless requests, because separate function instances don't share memory.
|
|
203
|
+
- **For cross-request hits on serverless**, bring your own store backed by a shared layer (Upstash Redis, Vercel KV): `cache: myStore`. ai-lcr runs no service of its own — any shared store is yours. A custom store is just `{ get, set }` (see `CacheStore`); `createMemoryCacheStore({ maxEntries })` is exported if you want the bundled one with a cap.
|
|
204
|
+
- **`cache: { store?, ttlMs? }`** sets an entry lifetime.
|
|
205
|
+
|
|
206
|
+
A hit settles a `CallRecord` with `cacheHit: true`, `costUsd: 0`, and the money it avoided on its own `cacheHitSavingUsd` line — a *caching* saving, kept separate from routing savings (`baselineUsd − costUsd`), never folded in. Empty completions and usage-less results are never cached. One caveat worth stating: caching makes identical requests return identical responses — exactly right for idempotent / `temperature: 0` calls, a behavior change for sampled ones.
|
|
207
|
+
|
|
208
|
+
### Pay less for the call (`promptCache`) — provider prompt cache
|
|
209
|
+
|
|
210
|
+
Different mechanism: the model still runs, but the **static head of the prompt** (your system prompt) bills at the provider's cache-read rate (~0.1× input) on repeats. Anthropic needs an explicit `cache_control` marker; OpenAI / Gemini / DeepSeek cache the prefix automatically. `promptCache: true` inserts that marker on the last system message for you:
|
|
211
|
+
|
|
212
|
+
```ts
|
|
213
|
+
const lcr = createLCR({
|
|
214
|
+
models: { /* … */ },
|
|
215
|
+
promptCache: true, // 5-minute window; { ttl: "1h" } for the longer one
|
|
216
|
+
});
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
It only writes the `anthropic` provider-options namespace, which every other provider ignores — so it's safe on a mixed chain. And it **steps aside entirely** if you set `cacheControl` yourself anywhere in the prompt. The savings then show up exactly as before: `cachedInputTokens` and `cachedSavingUsd` on the `CallRecord` (see [Cache-aware cost](#see-what-happened-oncall) below).
|
|
220
|
+
|
|
174
221
|
## See what happened (`onCall`)
|
|
175
222
|
|
|
176
223
|
`onError`/`onCost` fire separately and uncorrelated, so a failover is hard to read after the fact. `onCall` gives you **one record per request** — the full chain, the winner, the reason for each failed hop, latency, and cost — and `formatCallRecord` turns it into a one-liner you can scan:
|
|
@@ -490,6 +537,7 @@ Two OpenAI-compatible providers, same probe, same day. Cells cover both families
|
|
|
490
537
|
## Roadmap
|
|
491
538
|
|
|
492
539
|
- [x] Own failover engine — cheapest-first routing + streaming-safe fallback, no external routing dependency
|
|
540
|
+
- [x] Circuit breaker (`cooldown`) — skip a persistently-failing provider instead of re-probing it every window
|
|
493
541
|
- [x] Real per-call cost accounting (`onCost`)
|
|
494
542
|
- [x] One correlated record per request with the full failover chain (`onCall` + `formatCallRecord`)
|
|
495
543
|
- [x] Auto cheapest-first ordering (`autoSort`) from per-provider `cost`
|
package/README.zh-CN.md
CHANGED
|
@@ -144,6 +144,42 @@ DeepInfra 只承载开源权重——没有第一方 Claude / GPT / Gemini。那
|
|
|
144
144
|
2. **失败时向下穿透。** 遇到任何 provider 失败——限流、5xx、超时、**额度耗尽**(402 / 欠费 / 余额不足),以及 **400** 这类 client 错误——都会前进到下一个 provider,且对流式安全。400 会 failover 是有意为之:在 OpenAI 兼容聚合层里,400 往往是"*这家* provider 不吃这个请求"(不支持的参数、它没上架这个 model、更严格的 schema),而非请求本身坏了——换一家很可能就能服务。若所有 provider 都拒绝,请求仍会失败,并抛出**第一个**(原始)错误,让真正的调用方 bug 保持可调试。唯一永远不 failover 的是调用方主动取消(`AbortSignal`)。想恢复旧的"client 错误立即失败"行为,给 `createLCR` 传 `shouldRetry: isRetryableError`。
|
|
145
145
|
3. **恢复。** 在一段空闲窗口(`resetIntervalMs`,默认 60s)之后,自动回到最便宜的 provider。
|
|
146
146
|
|
|
147
|
+
## 缓存
|
|
148
|
+
|
|
149
|
+
LLM 世界里有两种完全不同的"缓存",ai-lcr 两种都做——都默认关闭,都只是一个配置开关,**不需要你跑任何服务**。
|
|
150
|
+
|
|
151
|
+
### 整次调用都省掉(`cache`)—— 响应缓存
|
|
152
|
+
|
|
153
|
+
当一个请求和之前回答过的一模一样时,直接重放已存的响应、**完全不调用任何 provider**:零延迟、`costUsd: 0`。这正是 Vercel AI Gateway 明显没做的那一层。
|
|
154
|
+
|
|
155
|
+
```ts
|
|
156
|
+
const lcr = createLCR({
|
|
157
|
+
models: { /* … */ },
|
|
158
|
+
cache: true, // 精确匹配响应缓存(默认进程内内存)
|
|
159
|
+
});
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
存储可插拔,且 ai-lcr 为此**零依赖**:
|
|
163
|
+
|
|
164
|
+
- **`cache: true`** 用进程内内存。在长驻 server 上是真缓存;在单次 serverless 调用内(比如 agent 循环里重复的子调用)也有用——但它**不会跨 serverless 请求存活**,因为不同函数实例之间不共享内存。
|
|
165
|
+
- **想在 serverless 上跨请求命中**,自带一个由共享层(Upstash Redis、Vercel KV)支撑的 store:`cache: myStore`。ai-lcr 自己不跑任何服务——共享层是你的。自定义 store 就是 `{ get, set }`(见 `CacheStore`);想要带上限的内置实现,用导出的 `createMemoryCacheStore({ maxEntries })`。
|
|
166
|
+
- **`cache: { store?, ttlMs? }`** 可设置过期时间。
|
|
167
|
+
|
|
168
|
+
命中会落一条 `CallRecord`:`cacheHit: true`、`costUsd: 0`,并把省下的钱单独记在 `cacheHitSavingUsd` 一行——这是**缓存**省的钱,和路由省的钱(`baselineUsd − costUsd`)分开,绝不混在一起。空回复和无 usage 的结果永不缓存。一个要点:缓存会让相同请求返回相同响应——对幂等 / `temperature: 0` 的调用正好,对采样型调用则是行为改变。
|
|
169
|
+
|
|
170
|
+
### 让这次调用更便宜(`promptCache`)—— provider 提示缓存
|
|
171
|
+
|
|
172
|
+
另一套机制:模型照样跑,但**提示的静态开头**(你的 system prompt)在重复时按 provider 的缓存读价(约 0.1× input)计费。Anthropic 需要显式 `cache_control` 标记;OpenAI / Gemini / DeepSeek 自动缓存前缀。`promptCache: true` 帮你在最后一条 system 消息上插入这个标记:
|
|
173
|
+
|
|
174
|
+
```ts
|
|
175
|
+
const lcr = createLCR({
|
|
176
|
+
models: { /* … */ },
|
|
177
|
+
promptCache: true, // 5 分钟窗口;想要更长用 { ttl: "1h" }
|
|
178
|
+
});
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
它只写 `anthropic` 这个 provider-options 命名空间,其他 provider 一律忽略——所以在混合链路上也安全。而且只要你在 prompt 里自己设了 `cacheControl`,它就**完全让位**。省下的钱照旧体现在 `CallRecord` 的 `cachedInputTokens` 和 `cachedSavingUsd` 上。
|
|
182
|
+
|
|
147
183
|
## 看清每次调用发生了什么(`onCall`)
|
|
148
184
|
|
|
149
185
|
`onError`/`onCost` 各自独立触发、互不关联,事后很难还原一次 failover 的全貌。`onCall` 给你**每个请求一条记录**——完整的尝试链、最终服务者、每跳失败的原因、延迟和成本;`formatCallRecord` 把它变成一行可扫读的日志:
|