ai-lcr 0.6.2 → 0.6.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +35 -0
- package/README.md +36 -0
- package/README.zh-CN.md +36 -0
- package/dist/index.cjs +211 -3
- package/dist/index.d.cts +161 -2
- package/dist/index.d.ts +161 -2
- package/dist/index.js +210 -3
- package/package.json +1 -1
package/CHANGELOG.md
CHANGED
|
@@ -4,6 +4,41 @@ All notable changes to `ai-lcr` are documented here. The format follows
|
|
|
4
4
|
[Keep a Changelog](https://keepachangelog.com/), and the project adheres to
|
|
5
5
|
[Semantic Versioning](https://semver.org/).
|
|
6
6
|
|
|
7
|
+
## [0.6.3] — 2026-06-11
|
|
8
|
+
|
|
9
|
+
Caching — both kinds, each off by default and each a pure config flag with no
|
|
10
|
+
service to run. The response cache is the layer Vercel AI Gateway notably
|
|
11
|
+
doesn't offer; ai-lcr does it in-process and folds it into its cost truth.
|
|
12
|
+
|
|
13
|
+
### Added
|
|
14
|
+
|
|
15
|
+
- **`createLCR({ cache })`** — exact-match **response cache**. An identical
|
|
16
|
+
request replays the stored response and calls **no provider at all**: zero
|
|
17
|
+
latency, `costUsd: 0`. Storage is pluggable with **zero added dependencies**:
|
|
18
|
+
`cache: true` uses a bundled in-memory store, `cache: myStore` brings your own
|
|
19
|
+
(Redis / Vercel KV — required for cross-request hits on serverless, where
|
|
20
|
+
memory isn't shared), `cache: { store?, ttlMs? }` sets a TTL. A hit settles a
|
|
21
|
+
`CallRecord` with `cacheHit: true` and the avoided cost on its own
|
|
22
|
+
`cacheHitSavingUsd` line (a caching saving, never folded into routing savings).
|
|
23
|
+
Empty completions and usage-less results are never cached. New exports:
|
|
24
|
+
`createMemoryCacheStore`, types `CacheStore` / `CacheOptions` / `CachedCall` /
|
|
25
|
+
`CachedMeta` / `MemoryCacheOptions`.
|
|
26
|
+
- **`createLCR({ promptCache })`** — automatic provider-side **prompt-cache**
|
|
27
|
+
breakpoint. Inserts an Anthropic `cache_control` marker on the last system
|
|
28
|
+
message so the static prompt head bills at the cache-read rate (~0.1× input)
|
|
29
|
+
on repeats; the model still runs. `true` for the 5-minute window,
|
|
30
|
+
`{ ttl: "1h" }` for the longer one. Only writes the `anthropic` namespace
|
|
31
|
+
(ignored by other providers, safe on a mixed chain) and steps aside if you set
|
|
32
|
+
`cacheControl` yourself. Savings surface via the existing `cachedInputTokens` /
|
|
33
|
+
`cachedSavingUsd`. New exported type `PromptCacheOptions`.
|
|
34
|
+
- `CallRecord` gains **`cacheHit`** and **`cacheHitSavingUsd`** for response-cache
|
|
35
|
+
hits.
|
|
36
|
+
|
|
37
|
+
### Compatibility
|
|
38
|
+
|
|
39
|
+
- Fully backward compatible. Both `cache` and `promptCache` are **off by
|
|
40
|
+
default** — unset, routing behaves exactly as before.
|
|
41
|
+
|
|
7
42
|
## [0.6.2] — 2026-06-11
|
|
8
43
|
|
|
9
44
|
Circuit breaker for persistently-failing providers. Until now the only recovery
|
package/README.md
CHANGED
|
@@ -182,6 +182,42 @@ const lcr = createLCR({
|
|
|
182
182
|
|
|
183
183
|
With `cooldown` on, a provider that fails enough times in a window is *skipped* for a cooldown period rather than tried every request — and a single success clears it. Defaults are 3 failures / 60s → 60s cooldown; tune with `cooldown: { maxFailures, windowMs, cooldownMs }`. It only ever **reorders** the attempt list (cooling providers go last), so if *every* provider is cooling a request still tries them all rather than failing outright. Off by default — routing is unchanged unless you opt in.
|
|
184
184
|
|
|
185
|
+
## Cache
|
|
186
|
+
|
|
187
|
+
There are two completely different "caches" in LLM land, and ai-lcr does both — each off by default, each a pure config flag with no service to run.
|
|
188
|
+
|
|
189
|
+
### Skip the call entirely (`cache`) — response cache
|
|
190
|
+
|
|
191
|
+
When a request is byte-for-byte identical to one already answered, replay the stored response and call **no provider at all**: zero latency, `costUsd: 0`. This is the layer Vercel AI Gateway notably *doesn't* offer.
|
|
192
|
+
|
|
193
|
+
```ts
|
|
194
|
+
const lcr = createLCR({
|
|
195
|
+
models: { /* … */ },
|
|
196
|
+
cache: true, // exact-match response cache (in-memory by default)
|
|
197
|
+
});
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
Storage is pluggable and ai-lcr ships **zero dependencies** for it:
|
|
201
|
+
|
|
202
|
+
- **`cache: true`** uses a process-local in-memory store. Real on a long-running server, and useful *within* one serverless invocation (an agent loop repeating a sub-call) — but it does **not** survive across serverless requests, because separate function instances don't share memory.
|
|
203
|
+
- **For cross-request hits on serverless**, bring your own store backed by a shared layer (Upstash Redis, Vercel KV): `cache: myStore`. ai-lcr runs no service of its own — any shared store is yours. A custom store is just `{ get, set }` (see `CacheStore`); `createMemoryCacheStore({ maxEntries })` is exported if you want the bundled one with a cap.
|
|
204
|
+
- **`cache: { store?, ttlMs? }`** sets an entry lifetime.
|
|
205
|
+
|
|
206
|
+
A hit settles a `CallRecord` with `cacheHit: true`, `costUsd: 0`, and the money it avoided on its own `cacheHitSavingUsd` line — a *caching* saving, kept separate from routing savings (`baselineUsd − costUsd`), never folded in. Empty completions and usage-less results are never cached. One caveat worth stating: caching makes identical requests return identical responses — exactly right for idempotent / `temperature: 0` calls, a behavior change for sampled ones.
|
|
207
|
+
|
|
208
|
+
### Pay less for the call (`promptCache`) — provider prompt cache
|
|
209
|
+
|
|
210
|
+
Different mechanism: the model still runs, but the **static head of the prompt** (your system prompt) bills at the provider's cache-read rate (~0.1× input) on repeats. Anthropic needs an explicit `cache_control` marker; OpenAI / Gemini / DeepSeek cache the prefix automatically. `promptCache: true` inserts that marker on the last system message for you:
|
|
211
|
+
|
|
212
|
+
```ts
|
|
213
|
+
const lcr = createLCR({
|
|
214
|
+
models: { /* … */ },
|
|
215
|
+
promptCache: true, // 5-minute window; { ttl: "1h" } for the longer one
|
|
216
|
+
});
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
It only writes the `anthropic` provider-options namespace, which every other provider ignores — so it's safe on a mixed chain. And it **steps aside entirely** if you set `cacheControl` yourself anywhere in the prompt. The savings then show up exactly as before: `cachedInputTokens` and `cachedSavingUsd` on the `CallRecord` (see [Cache-aware cost](#see-what-happened-oncall) below).
|
|
220
|
+
|
|
185
221
|
## See what happened (`onCall`)
|
|
186
222
|
|
|
187
223
|
`onError`/`onCost` fire separately and uncorrelated, so a failover is hard to read after the fact. `onCall` gives you **one record per request** — the full chain, the winner, the reason for each failed hop, latency, and cost — and `formatCallRecord` turns it into a one-liner you can scan:
|
package/README.zh-CN.md
CHANGED
|
@@ -144,6 +144,42 @@ DeepInfra 只承载开源权重——没有第一方 Claude / GPT / Gemini。那
|
|
|
144
144
|
2. **失败时向下穿透。** 遇到任何 provider 失败——限流、5xx、超时、**额度耗尽**(402 / 欠费 / 余额不足),以及 **400** 这类 client 错误——都会前进到下一个 provider,且对流式安全。400 会 failover 是有意为之:在 OpenAI 兼容聚合层里,400 往往是"*这家* provider 不吃这个请求"(不支持的参数、它没上架这个 model、更严格的 schema),而非请求本身坏了——换一家很可能就能服务。若所有 provider 都拒绝,请求仍会失败,并抛出**第一个**(原始)错误,让真正的调用方 bug 保持可调试。唯一永远不 failover 的是调用方主动取消(`AbortSignal`)。想恢复旧的"client 错误立即失败"行为,给 `createLCR` 传 `shouldRetry: isRetryableError`。
|
|
145
145
|
3. **恢复。** 在一段空闲窗口(`resetIntervalMs`,默认 60s)之后,自动回到最便宜的 provider。
|
|
146
146
|
|
|
147
|
+
## 缓存
|
|
148
|
+
|
|
149
|
+
LLM 世界里有两种完全不同的"缓存",ai-lcr 两种都做——都默认关闭,都只是一个配置开关,**不需要你跑任何服务**。
|
|
150
|
+
|
|
151
|
+
### 整次调用都省掉(`cache`)—— 响应缓存
|
|
152
|
+
|
|
153
|
+
当一个请求和之前回答过的一模一样时,直接重放已存的响应、**完全不调用任何 provider**:零延迟、`costUsd: 0`。这正是 Vercel AI Gateway 明显没做的那一层。
|
|
154
|
+
|
|
155
|
+
```ts
|
|
156
|
+
const lcr = createLCR({
|
|
157
|
+
models: { /* … */ },
|
|
158
|
+
cache: true, // 精确匹配响应缓存(默认进程内内存)
|
|
159
|
+
});
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
存储可插拔,且 ai-lcr 为此**零依赖**:
|
|
163
|
+
|
|
164
|
+
- **`cache: true`** 用进程内内存。在长驻 server 上是真缓存;在单次 serverless 调用内(比如 agent 循环里重复的子调用)也有用——但它**不会跨 serverless 请求存活**,因为不同函数实例之间不共享内存。
|
|
165
|
+
- **想在 serverless 上跨请求命中**,自带一个由共享层(Upstash Redis、Vercel KV)支撑的 store:`cache: myStore`。ai-lcr 自己不跑任何服务——共享层是你的。自定义 store 就是 `{ get, set }`(见 `CacheStore`);想要带上限的内置实现,用导出的 `createMemoryCacheStore({ maxEntries })`。
|
|
166
|
+
- **`cache: { store?, ttlMs? }`** 可设置过期时间。
|
|
167
|
+
|
|
168
|
+
命中会落一条 `CallRecord`:`cacheHit: true`、`costUsd: 0`,并把省下的钱单独记在 `cacheHitSavingUsd` 一行——这是**缓存**省的钱,和路由省的钱(`baselineUsd − costUsd`)分开,绝不混在一起。空回复和无 usage 的结果永不缓存。一个要点:缓存会让相同请求返回相同响应——对幂等 / `temperature: 0` 的调用正好,对采样型调用则是行为改变。
|
|
169
|
+
|
|
170
|
+
### 让这次调用更便宜(`promptCache`)—— provider 提示缓存
|
|
171
|
+
|
|
172
|
+
另一套机制:模型照样跑,但**提示的静态开头**(你的 system prompt)在重复时按 provider 的缓存读价(约 0.1× input)计费。Anthropic 需要显式 `cache_control` 标记;OpenAI / Gemini / DeepSeek 自动缓存前缀。`promptCache: true` 帮你在最后一条 system 消息上插入这个标记:
|
|
173
|
+
|
|
174
|
+
```ts
|
|
175
|
+
const lcr = createLCR({
|
|
176
|
+
models: { /* … */ },
|
|
177
|
+
promptCache: true, // 5 分钟窗口;想要更长用 { ttl: "1h" }
|
|
178
|
+
});
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
它只写 `anthropic` 这个 provider-options 命名空间,其他 provider 一律忽略——所以在混合链路上也安全。而且只要你在 prompt 里自己设了 `cacheControl`,它就**完全让位**。省下的钱照旧体现在 `CallRecord` 的 `cachedInputTokens` 和 `cachedSavingUsd` 上。
|
|
182
|
+
|
|
147
183
|
## 看清每次调用发生了什么(`onCall`)
|
|
148
184
|
|
|
149
185
|
`onError`/`onCost` 各自独立触发、互不关联,事后很难还原一次 failover 的全貌。`onCall` 给你**每个请求一条记录**——完整的尝试链、最终服务者、每跳失败的原因、延迟和成本;`formatCallRecord` 把它变成一行可扫读的日志:
|
package/dist/index.cjs
CHANGED
|
@@ -34,6 +34,7 @@ __export(index_exports, {
|
|
|
34
34
|
createKunavoMediaAdapter: () => createKunavoMediaAdapter,
|
|
35
35
|
createLCR: () => createLCR,
|
|
36
36
|
createMediaLCR: () => createMediaLCR,
|
|
37
|
+
createMemoryCacheStore: () => createMemoryCacheStore,
|
|
37
38
|
createRunwareMediaAdapter: () => createRunwareMediaAdapter,
|
|
38
39
|
durationFromInput: () => durationFromInput,
|
|
39
40
|
formatCallRecord: () => formatCallRecord,
|
|
@@ -49,6 +50,105 @@ __export(index_exports, {
|
|
|
49
50
|
});
|
|
50
51
|
module.exports = __toCommonJS(index_exports);
|
|
51
52
|
|
|
53
|
+
// src/cache.ts
|
|
54
|
+
function isCacheStore(x) {
|
|
55
|
+
return typeof x === "object" && x !== null && typeof x.get === "function" && typeof x.set === "function";
|
|
56
|
+
}
|
|
57
|
+
function resolveCache(opt) {
|
|
58
|
+
if (!opt) return void 0;
|
|
59
|
+
if (opt === true) return { store: createMemoryCacheStore() };
|
|
60
|
+
if (isCacheStore(opt)) return { store: opt };
|
|
61
|
+
return {
|
|
62
|
+
store: opt.store ?? createMemoryCacheStore(),
|
|
63
|
+
...opt.ttlMs !== void 0 ? { ttlMs: opt.ttlMs } : {}
|
|
64
|
+
};
|
|
65
|
+
}
|
|
66
|
+
function cacheKeyOf(modelName, options) {
|
|
67
|
+
const rest = options.providerOptions ? Object.entries(options.providerOptions).filter(([ns]) => ns !== "lcr") : [];
|
|
68
|
+
const po = rest.length > 0 ? Object.fromEntries(rest) : void 0;
|
|
69
|
+
return JSON.stringify({
|
|
70
|
+
m: modelName,
|
|
71
|
+
prompt: options.prompt,
|
|
72
|
+
maxOutputTokens: options.maxOutputTokens,
|
|
73
|
+
temperature: options.temperature,
|
|
74
|
+
topP: options.topP,
|
|
75
|
+
topK: options.topK,
|
|
76
|
+
frequencyPenalty: options.frequencyPenalty,
|
|
77
|
+
presencePenalty: options.presencePenalty,
|
|
78
|
+
stopSequences: options.stopSequences,
|
|
79
|
+
seed: options.seed,
|
|
80
|
+
responseFormat: options.responseFormat,
|
|
81
|
+
tools: options.tools,
|
|
82
|
+
toolChoice: options.toolChoice,
|
|
83
|
+
po
|
|
84
|
+
});
|
|
85
|
+
}
|
|
86
|
+
function streamFromParts(parts) {
|
|
87
|
+
return new ReadableStream({
|
|
88
|
+
start(controller) {
|
|
89
|
+
for (const part of parts) controller.enqueue(part);
|
|
90
|
+
controller.close();
|
|
91
|
+
}
|
|
92
|
+
});
|
|
93
|
+
}
|
|
94
|
+
function createMemoryCacheStore(opts = {}) {
|
|
95
|
+
const maxEntries = opts.maxEntries ?? 1e3;
|
|
96
|
+
const map = /* @__PURE__ */ new Map();
|
|
97
|
+
return {
|
|
98
|
+
get(key) {
|
|
99
|
+
const entry = map.get(key);
|
|
100
|
+
if (!entry) return void 0;
|
|
101
|
+
if (entry.expiresAt !== void 0 && entry.expiresAt <= Date.now()) {
|
|
102
|
+
map.delete(key);
|
|
103
|
+
return void 0;
|
|
104
|
+
}
|
|
105
|
+
return entry.value;
|
|
106
|
+
},
|
|
107
|
+
set(key, value, ttlMs) {
|
|
108
|
+
const entry = ttlMs !== void 0 ? { value, expiresAt: Date.now() + ttlMs } : { value };
|
|
109
|
+
map.delete(key);
|
|
110
|
+
map.set(key, entry);
|
|
111
|
+
if (map.size > maxEntries) {
|
|
112
|
+
const oldest = map.keys().next().value;
|
|
113
|
+
if (oldest !== void 0) map.delete(oldest);
|
|
114
|
+
}
|
|
115
|
+
}
|
|
116
|
+
};
|
|
117
|
+
}
|
|
118
|
+
|
|
119
|
+
// src/prompt-cache.ts
|
|
120
|
+
function resolvePromptCache(opt) {
|
|
121
|
+
if (!opt) return void 0;
|
|
122
|
+
if (opt === true) return { ttl: "5m" };
|
|
123
|
+
return { ttl: opt.ttl ?? "5m" };
|
|
124
|
+
}
|
|
125
|
+
function hasAnthropicCacheControl(message) {
|
|
126
|
+
const anthropic = message.providerOptions?.anthropic;
|
|
127
|
+
return !!anthropic && "cacheControl" in anthropic;
|
|
128
|
+
}
|
|
129
|
+
function withPromptCacheBreakpoint(options, cfg) {
|
|
130
|
+
const prompt = options.prompt;
|
|
131
|
+
if (!Array.isArray(prompt) || prompt.length === 0) return options;
|
|
132
|
+
if (prompt.some(hasAnthropicCacheControl)) return options;
|
|
133
|
+
let target = -1;
|
|
134
|
+
for (let i = 0; i < prompt.length; i++) {
|
|
135
|
+
if (prompt[i].role === "system") target = i;
|
|
136
|
+
}
|
|
137
|
+
if (target === -1) return options;
|
|
138
|
+
const cacheControl = cfg.ttl === "1h" ? { type: "ephemeral", ttl: "1h" } : { type: "ephemeral" };
|
|
139
|
+
const newPrompt = prompt.map((message, i) => {
|
|
140
|
+
if (i !== target) return message;
|
|
141
|
+
return {
|
|
142
|
+
...message,
|
|
143
|
+
providerOptions: {
|
|
144
|
+
...message.providerOptions,
|
|
145
|
+
anthropic: { ...message.providerOptions?.anthropic, cacheControl }
|
|
146
|
+
}
|
|
147
|
+
};
|
|
148
|
+
});
|
|
149
|
+
return { ...options, prompt: newPrompt };
|
|
150
|
+
}
|
|
151
|
+
|
|
52
152
|
// src/fallback.ts
|
|
53
153
|
var EmptyCompletionError = class extends Error {
|
|
54
154
|
constructor(provider) {
|
|
@@ -447,6 +547,16 @@ var LcrFallbackModel = class {
|
|
|
447
547
|
const usageMissing = inputTokens === 0 && outputTokens === 0;
|
|
448
548
|
const emptyCompletion = inputTokens > 0 && outputTokens === 0;
|
|
449
549
|
const baselineUsd = this.baselineUsd(inputTokens, outputTokens, cacheReadTokens);
|
|
550
|
+
ctx.settled = {
|
|
551
|
+
meta: {
|
|
552
|
+
winner: provider.label,
|
|
553
|
+
costUsd,
|
|
554
|
+
inputTokens,
|
|
555
|
+
outputTokens,
|
|
556
|
+
...cacheReadTokens > 0 ? { cachedInputTokens: cacheReadTokens } : {}
|
|
557
|
+
},
|
|
558
|
+
cacheable: !emptyCompletion && !usageMissing
|
|
559
|
+
};
|
|
450
560
|
this.emitCost({
|
|
451
561
|
model: this.opts.modelName,
|
|
452
562
|
provider: provider.label,
|
|
@@ -491,6 +601,16 @@ var LcrFallbackModel = class {
|
|
|
491
601
|
});
|
|
492
602
|
}
|
|
493
603
|
async doGenerate(options) {
|
|
604
|
+
const cache = this.opts.cache;
|
|
605
|
+
const cacheKey = cache ? cacheKeyOf(this.opts.modelName, options) : void 0;
|
|
606
|
+
if (cache && cacheKey !== void 0) {
|
|
607
|
+
const hit = await cache.store.get(cacheKey);
|
|
608
|
+
if (hit && hit.kind === "generate") {
|
|
609
|
+
this.finalizeCacheHit(this.startCall(options), hit.meta);
|
|
610
|
+
return hit.result;
|
|
611
|
+
}
|
|
612
|
+
}
|
|
613
|
+
const callOptions = this.opts.promptCache ? withPromptCacheBreakpoint(options, this.opts.promptCache) : options;
|
|
494
614
|
const ctx = this.startCall(options);
|
|
495
615
|
const providers = this.opts.providers;
|
|
496
616
|
const order = this.routeOrder(this.startIndex());
|
|
@@ -501,7 +621,7 @@ var LcrFallbackModel = class {
|
|
|
501
621
|
const isLast = pos === order.length - 1;
|
|
502
622
|
const attemptStart = Date.now();
|
|
503
623
|
try {
|
|
504
|
-
const result = await provider.model.doGenerate(
|
|
624
|
+
const result = await provider.model.doGenerate(callOptions);
|
|
505
625
|
const out = result.usage?.outputTokens?.total ?? 0;
|
|
506
626
|
const inp = result.usage?.inputTokens?.total ?? 0;
|
|
507
627
|
if (inp > 0 && out === 0 && !isLast) {
|
|
@@ -514,6 +634,9 @@ var LcrFallbackModel = class {
|
|
|
514
634
|
this.recordProviderSuccess(idx);
|
|
515
635
|
this.settleSticky(idx);
|
|
516
636
|
this.finalizeOk(ctx, provider, attemptStart, result.usage);
|
|
637
|
+
if (cache && cacheKey !== void 0 && ctx.settled?.cacheable) {
|
|
638
|
+
this.storeCache(cacheKey, { kind: "generate", result, meta: ctx.settled.meta });
|
|
639
|
+
}
|
|
517
640
|
return result;
|
|
518
641
|
} catch (error) {
|
|
519
642
|
lastError = error;
|
|
@@ -530,7 +653,76 @@ var LcrFallbackModel = class {
|
|
|
530
653
|
throw ctx.firstError ?? lastError;
|
|
531
654
|
}
|
|
532
655
|
async doStream(options) {
|
|
533
|
-
|
|
656
|
+
const cache = this.opts.cache;
|
|
657
|
+
const cacheKey = cache ? cacheKeyOf(this.opts.modelName, options) : void 0;
|
|
658
|
+
if (cache && cacheKey !== void 0) {
|
|
659
|
+
const hit = await cache.store.get(cacheKey);
|
|
660
|
+
if (hit && hit.kind === "stream") {
|
|
661
|
+
this.finalizeCacheHit(this.startCall(options), hit.meta);
|
|
662
|
+
return { stream: streamFromParts(hit.parts) };
|
|
663
|
+
}
|
|
664
|
+
}
|
|
665
|
+
const ctx = this.startCall(options);
|
|
666
|
+
const callOptions = this.opts.promptCache ? withPromptCacheBreakpoint(options, this.opts.promptCache) : options;
|
|
667
|
+
const inner = await this.doStreamWithCtx(
|
|
668
|
+
callOptions,
|
|
669
|
+
ctx,
|
|
670
|
+
this.routeOrder(this.startIndex()),
|
|
671
|
+
0
|
|
672
|
+
);
|
|
673
|
+
if (!cache || cacheKey === void 0) return inner;
|
|
674
|
+
const collected = [];
|
|
675
|
+
const self = this;
|
|
676
|
+
const wrapped = inner.stream.pipeThrough(
|
|
677
|
+
new TransformStream({
|
|
678
|
+
transform(part, controller) {
|
|
679
|
+
collected.push(part);
|
|
680
|
+
controller.enqueue(part);
|
|
681
|
+
},
|
|
682
|
+
flush() {
|
|
683
|
+
if (ctx.settled?.cacheable) {
|
|
684
|
+
self.storeCache(cacheKey, { kind: "stream", parts: collected, meta: ctx.settled.meta });
|
|
685
|
+
}
|
|
686
|
+
}
|
|
687
|
+
})
|
|
688
|
+
);
|
|
689
|
+
return { ...inner, stream: wrapped };
|
|
690
|
+
}
|
|
691
|
+
/** A response-cache hit: replay a stored answer with no provider call. Settles
|
|
692
|
+
* one {@link CallRecord} with `cacheHit`, `costUsd: 0`, and the avoided cost
|
|
693
|
+
* on its own `cacheHitSavingUsd` line. */
|
|
694
|
+
finalizeCacheHit(ctx, meta) {
|
|
695
|
+
this.emitCall({
|
|
696
|
+
id: ctx.id,
|
|
697
|
+
model: this.opts.modelName,
|
|
698
|
+
attempts: [{ provider: meta.winner, ok: true, latencyMs: Date.now() - ctx.startedAt }],
|
|
699
|
+
winner: meta.winner,
|
|
700
|
+
ok: true,
|
|
701
|
+
failedOver: false,
|
|
702
|
+
latencyMs: Date.now() - ctx.startedAt,
|
|
703
|
+
inputTokens: meta.inputTokens,
|
|
704
|
+
outputTokens: meta.outputTokens,
|
|
705
|
+
...meta.cachedInputTokens ? { cachedInputTokens: meta.cachedInputTokens } : {},
|
|
706
|
+
costUsd: 0,
|
|
707
|
+
cacheHit: true,
|
|
708
|
+
...meta.costUsd > 0 ? { cacheHitSavingUsd: meta.costUsd } : {},
|
|
709
|
+
...ctx.requestId ? { requestId: ctx.requestId } : {}
|
|
710
|
+
});
|
|
711
|
+
}
|
|
712
|
+
/** Best-effort write to the response cache: a sync throw or a rejected async
|
|
713
|
+
* `set` must never break the request. Caching is an optimization, not a
|
|
714
|
+
* guarantee. */
|
|
715
|
+
storeCache(key, value) {
|
|
716
|
+
const cache = this.opts.cache;
|
|
717
|
+
if (!cache) return;
|
|
718
|
+
try {
|
|
719
|
+
const r = cache.store.set(key, value, cache.ttlMs);
|
|
720
|
+
if (r && typeof r.catch === "function") {
|
|
721
|
+
r.catch(() => {
|
|
722
|
+
});
|
|
723
|
+
}
|
|
724
|
+
} catch {
|
|
725
|
+
}
|
|
534
726
|
}
|
|
535
727
|
// The stream's failover recursion re-enters here with the SAME `ctx` and the
|
|
536
728
|
// SAME `order` snapshot, advancing only the local position `pos`, so a
|
|
@@ -2012,12 +2204,16 @@ function createLCR(config) {
|
|
|
2012
2204
|
autoPrice = false,
|
|
2013
2205
|
resetIntervalMs,
|
|
2014
2206
|
cooldown,
|
|
2207
|
+
cache,
|
|
2208
|
+
promptCache,
|
|
2015
2209
|
onError,
|
|
2016
2210
|
onCost,
|
|
2017
2211
|
onCall,
|
|
2018
2212
|
shouldRetry,
|
|
2019
2213
|
defaultCacheReadRatio
|
|
2020
2214
|
} = config;
|
|
2215
|
+
const resolvedCache = resolveCache(cache);
|
|
2216
|
+
const resolvedPromptCache = resolvePromptCache(promptCache);
|
|
2021
2217
|
if (defaultCacheReadRatio !== void 0 && (defaultCacheReadRatio < 0 || defaultCacheReadRatio > 1)) {
|
|
2022
2218
|
throw new Error(
|
|
2023
2219
|
`ai-lcr: defaultCacheReadRatio must be in [0, 1], got ${defaultCacheReadRatio}`
|
|
@@ -2037,7 +2233,18 @@ function createLCR(config) {
|
|
|
2037
2233
|
}
|
|
2038
2234
|
routed.set(
|
|
2039
2235
|
name,
|
|
2040
|
-
new LcrFallbackModel({
|
|
2236
|
+
new LcrFallbackModel({
|
|
2237
|
+
modelName: name,
|
|
2238
|
+
providers,
|
|
2239
|
+
resetIntervalMs,
|
|
2240
|
+
cooldown,
|
|
2241
|
+
...resolvedCache ? { cache: resolvedCache } : {},
|
|
2242
|
+
...resolvedPromptCache ? { promptCache: resolvedPromptCache } : {},
|
|
2243
|
+
onError,
|
|
2244
|
+
onCost,
|
|
2245
|
+
onCall,
|
|
2246
|
+
shouldRetry
|
|
2247
|
+
})
|
|
2041
2248
|
);
|
|
2042
2249
|
}
|
|
2043
2250
|
return (modelName) => {
|
|
@@ -2066,6 +2273,7 @@ function createLCR(config) {
|
|
|
2066
2273
|
createKunavoMediaAdapter,
|
|
2067
2274
|
createLCR,
|
|
2068
2275
|
createMediaLCR,
|
|
2276
|
+
createMemoryCacheStore,
|
|
2069
2277
|
createRunwareMediaAdapter,
|
|
2070
2278
|
durationFromInput,
|
|
2071
2279
|
formatCallRecord,
|
package/dist/index.d.cts
CHANGED
|
@@ -1,4 +1,120 @@
|
|
|
1
|
-
import { LanguageModelV3 } from '@ai-sdk/provider';
|
|
1
|
+
import { LanguageModelV3GenerateResult, LanguageModelV3StreamPart, LanguageModelV3 } from '@ai-sdk/provider';
|
|
2
|
+
|
|
3
|
+
/**
|
|
4
|
+
* Exact-match response cache (Feature ②).
|
|
5
|
+
*
|
|
6
|
+
* Unlike provider-side prompt caching (see ./prompt-cache), this skips the
|
|
7
|
+
* model call *entirely*: when a request is byte-for-byte identical to one
|
|
8
|
+
* already answered, the stored response is replayed and no provider is touched
|
|
9
|
+
* — zero latency, zero cost. This is the layer Vercel AI Gateway notably does
|
|
10
|
+
* NOT offer, and it composes naturally with ai-lcr's cost truth: a hit settles
|
|
11
|
+
* a {@link CallRecord} with `costUsd: 0` and `cacheHit: true`, and reports the
|
|
12
|
+
* money it avoided as `cacheHitSavingUsd` — a saving kept on its own line, the
|
|
13
|
+
* same discipline as prompt-cache savings, never folded into routing savings.
|
|
14
|
+
*
|
|
15
|
+
* Storage is pluggable and the package ships ZERO dependencies for it. The
|
|
16
|
+
* default `createMemoryCacheStore()` is a process-local Map: real on a
|
|
17
|
+
* long-running server, and useful within a single serverless invocation (an
|
|
18
|
+
* agent loop that repeats a sub-call), but it does NOT survive across
|
|
19
|
+
* serverless requests — different function instances don't share memory. For
|
|
20
|
+
* cross-request hits on serverless, inject your own store backed by a shared
|
|
21
|
+
* layer (Upstash Redis, Vercel KV). ai-lcr never runs that service; you bring
|
|
22
|
+
* it. A custom store is responsible for serializing the stored value.
|
|
23
|
+
*
|
|
24
|
+
* Determinism caveat: caching makes identical requests return identical
|
|
25
|
+
* responses. That is exactly right for idempotent / `temperature: 0` calls and
|
|
26
|
+
* changes behavior for sampled ones (the variety is gone). Enable it where a
|
|
27
|
+
* repeated answer is acceptable.
|
|
28
|
+
*/
|
|
29
|
+
|
|
30
|
+
/** A stored, replayable LLM response plus the cost it originally incurred. */
|
|
31
|
+
type CachedCall = {
|
|
32
|
+
kind: "generate";
|
|
33
|
+
result: LanguageModelV3GenerateResult;
|
|
34
|
+
meta: CachedMeta;
|
|
35
|
+
} | {
|
|
36
|
+
kind: "stream";
|
|
37
|
+
parts: LanguageModelV3StreamPart[];
|
|
38
|
+
meta: CachedMeta;
|
|
39
|
+
};
|
|
40
|
+
/** Settle-time facts carried in the cache entry so a hit can report honest
|
|
41
|
+
* tokens, the originally-serving provider, and the money the hit avoided. */
|
|
42
|
+
interface CachedMeta {
|
|
43
|
+
/** The provider that served the original (cached) call. */
|
|
44
|
+
winner: string;
|
|
45
|
+
/** What the original call actually cost — i.e. the money a hit avoids. */
|
|
46
|
+
costUsd: number;
|
|
47
|
+
inputTokens: number;
|
|
48
|
+
outputTokens: number;
|
|
49
|
+
/** Prompt-cache reads on the original call, when reported (> 0 only). */
|
|
50
|
+
cachedInputTokens?: number;
|
|
51
|
+
}
|
|
52
|
+
/**
|
|
53
|
+
* Pluggable response-cache backend. Implement it over Redis / Vercel KV / any
|
|
54
|
+
* shared store to get cross-request hits on serverless; the bundled
|
|
55
|
+
* {@link createMemoryCacheStore} is the dependency-free default. `get`/`set`
|
|
56
|
+
* may be sync or async. A `set` that throws must never break the request — the
|
|
57
|
+
* engine treats the cache as best-effort.
|
|
58
|
+
*/
|
|
59
|
+
interface CacheStore {
|
|
60
|
+
get(key: string): CachedCall | undefined | Promise<CachedCall | undefined>;
|
|
61
|
+
set(key: string, value: CachedCall, ttlMs?: number): void | Promise<void>;
|
|
62
|
+
}
|
|
63
|
+
/** Public response-cache config. See {@link LCRConfig.cache}. */
|
|
64
|
+
interface CacheOptions {
|
|
65
|
+
/** Where to store responses. Defaults to a process-local in-memory store. */
|
|
66
|
+
store?: CacheStore;
|
|
67
|
+
/** Entry lifetime in ms. Omit for no expiry (entries live until evicted). */
|
|
68
|
+
ttlMs?: number;
|
|
69
|
+
}
|
|
70
|
+
/** Tuning for {@link createMemoryCacheStore}. */
|
|
71
|
+
interface MemoryCacheOptions {
|
|
72
|
+
/**
|
|
73
|
+
* Cap on stored entries. When exceeded, the oldest-inserted entry is dropped
|
|
74
|
+
* (insertion-order FIFO — Map preserves it). Keeps an unbounded key space
|
|
75
|
+
* (every distinct prompt) from leaking memory in a long-running process.
|
|
76
|
+
* Default 1000.
|
|
77
|
+
*/
|
|
78
|
+
maxEntries?: number;
|
|
79
|
+
}
|
|
80
|
+
/**
|
|
81
|
+
* A process-local in-memory {@link CacheStore} with optional TTL and a
|
|
82
|
+
* bounded entry count. Zero dependencies. See the module header for where this
|
|
83
|
+
* is (and isn't) useful — notably it does NOT share across serverless requests.
|
|
84
|
+
*/
|
|
85
|
+
declare function createMemoryCacheStore(opts?: MemoryCacheOptions): CacheStore;
|
|
86
|
+
|
|
87
|
+
/**
|
|
88
|
+
* Automatic prompt-cache breakpoints (Feature ①).
|
|
89
|
+
*
|
|
90
|
+
* Provider-side prompt caching (Anthropic, MiniMax) caches a *prefix* of the
|
|
91
|
+
* prompt so repeated calls bill the static head — the system prompt — at the
|
|
92
|
+
* cache-read rate (~0.1× input) instead of full price. The model still runs;
|
|
93
|
+
* only the input cost of the cached prefix drops. Anthropic needs an explicit
|
|
94
|
+
* `cache_control` marker; OpenAI / Gemini / DeepSeek cache the prefix
|
|
95
|
+
* automatically with no marker at all.
|
|
96
|
+
*
|
|
97
|
+
* This module adds, when `promptCache` is enabled, a single `cacheControl`
|
|
98
|
+
* breakpoint on the LAST system message — the canonical large, stable head of
|
|
99
|
+
* a prompt. It only writes the `anthropic` provider-options namespace, which
|
|
100
|
+
* every non-Anthropic provider ignores, so it is safe to apply on every leg of
|
|
101
|
+
* a mixed chain: Anthropic reads it, the rest pass it through untouched. No
|
|
102
|
+
* external service and no storage — the cache itself lives at the provider.
|
|
103
|
+
*
|
|
104
|
+
* It steps aside the moment the caller is managing caching themselves: if ANY
|
|
105
|
+
* message already carries an `anthropic.cacheControl`, the prompt is returned
|
|
106
|
+
* unchanged. Same "explicit always wins" discipline as the price table.
|
|
107
|
+
*/
|
|
108
|
+
|
|
109
|
+
/** Tuning for automatic prompt-cache breakpoints. See {@link LCRConfig.promptCache}. */
|
|
110
|
+
interface PromptCacheOptions {
|
|
111
|
+
/**
|
|
112
|
+
* Cache lifetime for the injected breakpoint. `"5m"` (the Anthropic default,
|
|
113
|
+
* a cheaper cache write) or `"1h"` (a pricier write that pays off when the
|
|
114
|
+
* same prefix is reused over a longer span). Default `"5m"`.
|
|
115
|
+
*/
|
|
116
|
+
ttl?: "5m" | "1h";
|
|
117
|
+
}
|
|
2
118
|
|
|
3
119
|
/**
|
|
4
120
|
* Owned failover engine for ai-lcr.
|
|
@@ -176,6 +292,21 @@ interface CallRecord {
|
|
|
176
292
|
* be surfaced separately, never folded into `baselineUsd - costUsd`.
|
|
177
293
|
*/
|
|
178
294
|
cachedSavingUsd?: number;
|
|
295
|
+
/**
|
|
296
|
+
* True when this request was served from ai-lcr's exact-match RESPONSE cache
|
|
297
|
+
* — no provider was called at all. Distinct from `cachedInputTokens` /
|
|
298
|
+
* `cachedSavingUsd`, which are the *provider's* prompt-cache (the model still
|
|
299
|
+
* ran). On a hit `costUsd` is 0, `winner` is the provider that served the
|
|
300
|
+
* ORIGINAL (now-cached) call, and `attempts` has a single synthetic entry.
|
|
301
|
+
*/
|
|
302
|
+
cacheHit?: boolean;
|
|
303
|
+
/**
|
|
304
|
+
* On a `cacheHit`, the money the hit avoided — i.e. what the original call
|
|
305
|
+
* actually cost when it ran. Present only when > 0. Like `cachedSavingUsd`
|
|
306
|
+
* this is a caching saving, NOT a routing saving, so it lives on its own line
|
|
307
|
+
* and is never folded into `baselineUsd - costUsd`.
|
|
308
|
+
*/
|
|
309
|
+
cacheHitSavingUsd?: number;
|
|
179
310
|
/**
|
|
180
311
|
* Caller-supplied correlation id, read from `providerOptions.lcr.requestId`
|
|
181
312
|
* on the call. Multi-step tool loops emit one record per `doStream`/
|
|
@@ -926,6 +1057,34 @@ interface LCRConfig {
|
|
|
926
1057
|
* unchanged routing, no provider is ever skipped). See {@link CooldownOptions}.
|
|
927
1058
|
*/
|
|
928
1059
|
cooldown?: boolean | CooldownOptions;
|
|
1060
|
+
/**
|
|
1061
|
+
* Exact-match RESPONSE cache: when a request is identical to one already
|
|
1062
|
+
* answered, replay the stored response and call no provider at all — zero
|
|
1063
|
+
* latency, `costUsd: 0`, and the avoided cost reported as `cacheHitSavingUsd`
|
|
1064
|
+
* on the {@link CallRecord} (with `cacheHit: true`). Off by default.
|
|
1065
|
+
*
|
|
1066
|
+
* `true` uses a process-local in-memory store; pass a {@link CacheStore} to
|
|
1067
|
+
* bring your own (Redis / Vercel KV — required for cross-request hits on
|
|
1068
|
+
* serverless, where memory isn't shared); pass `{ store?, ttlMs? }` to set a
|
|
1069
|
+
* TTL. ai-lcr runs no service of its own — any shared store is yours.
|
|
1070
|
+
*
|
|
1071
|
+
* Caching makes identical requests return identical responses: ideal for
|
|
1072
|
+
* idempotent / `temperature: 0` calls, a behavior change for sampled ones.
|
|
1073
|
+
* Empty completions and usage-less results are never cached.
|
|
1074
|
+
*/
|
|
1075
|
+
cache?: boolean | CacheStore | CacheOptions;
|
|
1076
|
+
/**
|
|
1077
|
+
* Automatic provider-side PROMPT caching: insert a `cache_control` breakpoint
|
|
1078
|
+
* on the last system message so the static prompt head bills at the
|
|
1079
|
+
* cache-read rate (~0.1× input) on repeats. The model still runs — this only
|
|
1080
|
+
* lowers input cost, it does not skip the call (that's `cache`). Only
|
|
1081
|
+
* Anthropic / MiniMax need the marker; OpenAI / Gemini / DeepSeek cache the
|
|
1082
|
+
* prefix automatically and ignore it, so this is safe on a mixed chain.
|
|
1083
|
+
*
|
|
1084
|
+
* `true` for the 5-minute default, `{ ttl: "1h" }` for the longer window.
|
|
1085
|
+
* Off by default; steps aside entirely if you set `cacheControl` yourself.
|
|
1086
|
+
*/
|
|
1087
|
+
promptCache?: boolean | PromptCacheOptions;
|
|
929
1088
|
/** Called when a provider errors and routing falls through to the next. */
|
|
930
1089
|
onError?: (error: Error, provider: string) => void;
|
|
931
1090
|
/** Called after each successful call with the serving provider, tokens, and cost. */
|
|
@@ -969,4 +1128,4 @@ type LCRRouter = (modelName: string) => LanguageModelV3;
|
|
|
969
1128
|
*/
|
|
970
1129
|
declare function createLCR(config: LCRConfig): LCRRouter;
|
|
971
1130
|
|
|
972
|
-
export { type BillableContext, type CallRecord, type CooldownOptions, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type HttpSinkOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, MODEL_PRICES, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaJobHandle, type MediaJobStatus, type MediaLCR, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPollResult, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaStatusRequest, type MediaStatusResult, type MediaSubmitOptions, type MediaSubmitRequest, type MediaSubmitResult, type MediaUnit, type MediaUsage, OFFICIAL_PRICES, type PriceComparisonRow, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, billableUnits, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createHttpSink, createKunavoMediaAdapter, createLCR, createMediaLCR, createRunwareMediaAdapter, durationFromInput, formatCallRecord, getModelPrice, isAbortError, isNetworkError, isRetryableError, normalizedCents, priceCents, rankRoutes, referenceMegapixels, shouldFailover };
|
|
1131
|
+
export { type BillableContext, type CacheOptions, type CacheStore, type CachedCall, type CachedMeta, type CallRecord, type CooldownOptions, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type HttpSinkOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, MODEL_PRICES, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaJobHandle, type MediaJobStatus, type MediaLCR, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPollResult, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaStatusRequest, type MediaStatusResult, type MediaSubmitOptions, type MediaSubmitRequest, type MediaSubmitResult, type MediaUnit, type MediaUsage, type MemoryCacheOptions, OFFICIAL_PRICES, type PriceComparisonRow, type PromptCacheOptions, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, billableUnits, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createHttpSink, createKunavoMediaAdapter, createLCR, createMediaLCR, createMemoryCacheStore, createRunwareMediaAdapter, durationFromInput, formatCallRecord, getModelPrice, isAbortError, isNetworkError, isRetryableError, normalizedCents, priceCents, rankRoutes, referenceMegapixels, shouldFailover };
|
package/dist/index.d.ts
CHANGED
|
@@ -1,4 +1,120 @@
|
|
|
1
|
-
import { LanguageModelV3 } from '@ai-sdk/provider';
|
|
1
|
+
import { LanguageModelV3GenerateResult, LanguageModelV3StreamPart, LanguageModelV3 } from '@ai-sdk/provider';
|
|
2
|
+
|
|
3
|
+
/**
|
|
4
|
+
* Exact-match response cache (Feature ②).
|
|
5
|
+
*
|
|
6
|
+
* Unlike provider-side prompt caching (see ./prompt-cache), this skips the
|
|
7
|
+
* model call *entirely*: when a request is byte-for-byte identical to one
|
|
8
|
+
* already answered, the stored response is replayed and no provider is touched
|
|
9
|
+
* — zero latency, zero cost. This is the layer Vercel AI Gateway notably does
|
|
10
|
+
* NOT offer, and it composes naturally with ai-lcr's cost truth: a hit settles
|
|
11
|
+
* a {@link CallRecord} with `costUsd: 0` and `cacheHit: true`, and reports the
|
|
12
|
+
* money it avoided as `cacheHitSavingUsd` — a saving kept on its own line, the
|
|
13
|
+
* same discipline as prompt-cache savings, never folded into routing savings.
|
|
14
|
+
*
|
|
15
|
+
* Storage is pluggable and the package ships ZERO dependencies for it. The
|
|
16
|
+
* default `createMemoryCacheStore()` is a process-local Map: real on a
|
|
17
|
+
* long-running server, and useful within a single serverless invocation (an
|
|
18
|
+
* agent loop that repeats a sub-call), but it does NOT survive across
|
|
19
|
+
* serverless requests — different function instances don't share memory. For
|
|
20
|
+
* cross-request hits on serverless, inject your own store backed by a shared
|
|
21
|
+
* layer (Upstash Redis, Vercel KV). ai-lcr never runs that service; you bring
|
|
22
|
+
* it. A custom store is responsible for serializing the stored value.
|
|
23
|
+
*
|
|
24
|
+
* Determinism caveat: caching makes identical requests return identical
|
|
25
|
+
* responses. That is exactly right for idempotent / `temperature: 0` calls and
|
|
26
|
+
* changes behavior for sampled ones (the variety is gone). Enable it where a
|
|
27
|
+
* repeated answer is acceptable.
|
|
28
|
+
*/
|
|
29
|
+
|
|
30
|
+
/** A stored, replayable LLM response plus the cost it originally incurred. */
|
|
31
|
+
type CachedCall = {
|
|
32
|
+
kind: "generate";
|
|
33
|
+
result: LanguageModelV3GenerateResult;
|
|
34
|
+
meta: CachedMeta;
|
|
35
|
+
} | {
|
|
36
|
+
kind: "stream";
|
|
37
|
+
parts: LanguageModelV3StreamPart[];
|
|
38
|
+
meta: CachedMeta;
|
|
39
|
+
};
|
|
40
|
+
/** Settle-time facts carried in the cache entry so a hit can report honest
|
|
41
|
+
* tokens, the originally-serving provider, and the money the hit avoided. */
|
|
42
|
+
interface CachedMeta {
|
|
43
|
+
/** The provider that served the original (cached) call. */
|
|
44
|
+
winner: string;
|
|
45
|
+
/** What the original call actually cost — i.e. the money a hit avoids. */
|
|
46
|
+
costUsd: number;
|
|
47
|
+
inputTokens: number;
|
|
48
|
+
outputTokens: number;
|
|
49
|
+
/** Prompt-cache reads on the original call, when reported (> 0 only). */
|
|
50
|
+
cachedInputTokens?: number;
|
|
51
|
+
}
|
|
52
|
+
/**
|
|
53
|
+
* Pluggable response-cache backend. Implement it over Redis / Vercel KV / any
|
|
54
|
+
* shared store to get cross-request hits on serverless; the bundled
|
|
55
|
+
* {@link createMemoryCacheStore} is the dependency-free default. `get`/`set`
|
|
56
|
+
* may be sync or async. A `set` that throws must never break the request — the
|
|
57
|
+
* engine treats the cache as best-effort.
|
|
58
|
+
*/
|
|
59
|
+
interface CacheStore {
|
|
60
|
+
get(key: string): CachedCall | undefined | Promise<CachedCall | undefined>;
|
|
61
|
+
set(key: string, value: CachedCall, ttlMs?: number): void | Promise<void>;
|
|
62
|
+
}
|
|
63
|
+
/** Public response-cache config. See {@link LCRConfig.cache}. */
|
|
64
|
+
interface CacheOptions {
|
|
65
|
+
/** Where to store responses. Defaults to a process-local in-memory store. */
|
|
66
|
+
store?: CacheStore;
|
|
67
|
+
/** Entry lifetime in ms. Omit for no expiry (entries live until evicted). */
|
|
68
|
+
ttlMs?: number;
|
|
69
|
+
}
|
|
70
|
+
/** Tuning for {@link createMemoryCacheStore}. */
|
|
71
|
+
interface MemoryCacheOptions {
|
|
72
|
+
/**
|
|
73
|
+
* Cap on stored entries. When exceeded, the oldest-inserted entry is dropped
|
|
74
|
+
* (insertion-order FIFO — Map preserves it). Keeps an unbounded key space
|
|
75
|
+
* (every distinct prompt) from leaking memory in a long-running process.
|
|
76
|
+
* Default 1000.
|
|
77
|
+
*/
|
|
78
|
+
maxEntries?: number;
|
|
79
|
+
}
|
|
80
|
+
/**
|
|
81
|
+
* A process-local in-memory {@link CacheStore} with optional TTL and a
|
|
82
|
+
* bounded entry count. Zero dependencies. See the module header for where this
|
|
83
|
+
* is (and isn't) useful — notably it does NOT share across serverless requests.
|
|
84
|
+
*/
|
|
85
|
+
declare function createMemoryCacheStore(opts?: MemoryCacheOptions): CacheStore;
|
|
86
|
+
|
|
87
|
+
/**
|
|
88
|
+
* Automatic prompt-cache breakpoints (Feature ①).
|
|
89
|
+
*
|
|
90
|
+
* Provider-side prompt caching (Anthropic, MiniMax) caches a *prefix* of the
|
|
91
|
+
* prompt so repeated calls bill the static head — the system prompt — at the
|
|
92
|
+
* cache-read rate (~0.1× input) instead of full price. The model still runs;
|
|
93
|
+
* only the input cost of the cached prefix drops. Anthropic needs an explicit
|
|
94
|
+
* `cache_control` marker; OpenAI / Gemini / DeepSeek cache the prefix
|
|
95
|
+
* automatically with no marker at all.
|
|
96
|
+
*
|
|
97
|
+
* This module adds, when `promptCache` is enabled, a single `cacheControl`
|
|
98
|
+
* breakpoint on the LAST system message — the canonical large, stable head of
|
|
99
|
+
* a prompt. It only writes the `anthropic` provider-options namespace, which
|
|
100
|
+
* every non-Anthropic provider ignores, so it is safe to apply on every leg of
|
|
101
|
+
* a mixed chain: Anthropic reads it, the rest pass it through untouched. No
|
|
102
|
+
* external service and no storage — the cache itself lives at the provider.
|
|
103
|
+
*
|
|
104
|
+
* It steps aside the moment the caller is managing caching themselves: if ANY
|
|
105
|
+
* message already carries an `anthropic.cacheControl`, the prompt is returned
|
|
106
|
+
* unchanged. Same "explicit always wins" discipline as the price table.
|
|
107
|
+
*/
|
|
108
|
+
|
|
109
|
+
/** Tuning for automatic prompt-cache breakpoints. See {@link LCRConfig.promptCache}. */
|
|
110
|
+
interface PromptCacheOptions {
|
|
111
|
+
/**
|
|
112
|
+
* Cache lifetime for the injected breakpoint. `"5m"` (the Anthropic default,
|
|
113
|
+
* a cheaper cache write) or `"1h"` (a pricier write that pays off when the
|
|
114
|
+
* same prefix is reused over a longer span). Default `"5m"`.
|
|
115
|
+
*/
|
|
116
|
+
ttl?: "5m" | "1h";
|
|
117
|
+
}
|
|
2
118
|
|
|
3
119
|
/**
|
|
4
120
|
* Owned failover engine for ai-lcr.
|
|
@@ -176,6 +292,21 @@ interface CallRecord {
|
|
|
176
292
|
* be surfaced separately, never folded into `baselineUsd - costUsd`.
|
|
177
293
|
*/
|
|
178
294
|
cachedSavingUsd?: number;
|
|
295
|
+
/**
|
|
296
|
+
* True when this request was served from ai-lcr's exact-match RESPONSE cache
|
|
297
|
+
* — no provider was called at all. Distinct from `cachedInputTokens` /
|
|
298
|
+
* `cachedSavingUsd`, which are the *provider's* prompt-cache (the model still
|
|
299
|
+
* ran). On a hit `costUsd` is 0, `winner` is the provider that served the
|
|
300
|
+
* ORIGINAL (now-cached) call, and `attempts` has a single synthetic entry.
|
|
301
|
+
*/
|
|
302
|
+
cacheHit?: boolean;
|
|
303
|
+
/**
|
|
304
|
+
* On a `cacheHit`, the money the hit avoided — i.e. what the original call
|
|
305
|
+
* actually cost when it ran. Present only when > 0. Like `cachedSavingUsd`
|
|
306
|
+
* this is a caching saving, NOT a routing saving, so it lives on its own line
|
|
307
|
+
* and is never folded into `baselineUsd - costUsd`.
|
|
308
|
+
*/
|
|
309
|
+
cacheHitSavingUsd?: number;
|
|
179
310
|
/**
|
|
180
311
|
* Caller-supplied correlation id, read from `providerOptions.lcr.requestId`
|
|
181
312
|
* on the call. Multi-step tool loops emit one record per `doStream`/
|
|
@@ -926,6 +1057,34 @@ interface LCRConfig {
|
|
|
926
1057
|
* unchanged routing, no provider is ever skipped). See {@link CooldownOptions}.
|
|
927
1058
|
*/
|
|
928
1059
|
cooldown?: boolean | CooldownOptions;
|
|
1060
|
+
/**
|
|
1061
|
+
* Exact-match RESPONSE cache: when a request is identical to one already
|
|
1062
|
+
* answered, replay the stored response and call no provider at all — zero
|
|
1063
|
+
* latency, `costUsd: 0`, and the avoided cost reported as `cacheHitSavingUsd`
|
|
1064
|
+
* on the {@link CallRecord} (with `cacheHit: true`). Off by default.
|
|
1065
|
+
*
|
|
1066
|
+
* `true` uses a process-local in-memory store; pass a {@link CacheStore} to
|
|
1067
|
+
* bring your own (Redis / Vercel KV — required for cross-request hits on
|
|
1068
|
+
* serverless, where memory isn't shared); pass `{ store?, ttlMs? }` to set a
|
|
1069
|
+
* TTL. ai-lcr runs no service of its own — any shared store is yours.
|
|
1070
|
+
*
|
|
1071
|
+
* Caching makes identical requests return identical responses: ideal for
|
|
1072
|
+
* idempotent / `temperature: 0` calls, a behavior change for sampled ones.
|
|
1073
|
+
* Empty completions and usage-less results are never cached.
|
|
1074
|
+
*/
|
|
1075
|
+
cache?: boolean | CacheStore | CacheOptions;
|
|
1076
|
+
/**
|
|
1077
|
+
* Automatic provider-side PROMPT caching: insert a `cache_control` breakpoint
|
|
1078
|
+
* on the last system message so the static prompt head bills at the
|
|
1079
|
+
* cache-read rate (~0.1× input) on repeats. The model still runs — this only
|
|
1080
|
+
* lowers input cost, it does not skip the call (that's `cache`). Only
|
|
1081
|
+
* Anthropic / MiniMax need the marker; OpenAI / Gemini / DeepSeek cache the
|
|
1082
|
+
* prefix automatically and ignore it, so this is safe on a mixed chain.
|
|
1083
|
+
*
|
|
1084
|
+
* `true` for the 5-minute default, `{ ttl: "1h" }` for the longer window.
|
|
1085
|
+
* Off by default; steps aside entirely if you set `cacheControl` yourself.
|
|
1086
|
+
*/
|
|
1087
|
+
promptCache?: boolean | PromptCacheOptions;
|
|
929
1088
|
/** Called when a provider errors and routing falls through to the next. */
|
|
930
1089
|
onError?: (error: Error, provider: string) => void;
|
|
931
1090
|
/** Called after each successful call with the serving provider, tokens, and cost. */
|
|
@@ -969,4 +1128,4 @@ type LCRRouter = (modelName: string) => LanguageModelV3;
|
|
|
969
1128
|
*/
|
|
970
1129
|
declare function createLCR(config: LCRConfig): LCRRouter;
|
|
971
1130
|
|
|
972
|
-
export { type BillableContext, type CallRecord, type CooldownOptions, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type HttpSinkOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, MODEL_PRICES, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaJobHandle, type MediaJobStatus, type MediaLCR, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPollResult, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaStatusRequest, type MediaStatusResult, type MediaSubmitOptions, type MediaSubmitRequest, type MediaSubmitResult, type MediaUnit, type MediaUsage, OFFICIAL_PRICES, type PriceComparisonRow, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, billableUnits, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createHttpSink, createKunavoMediaAdapter, createLCR, createMediaLCR, createRunwareMediaAdapter, durationFromInput, formatCallRecord, getModelPrice, isAbortError, isNetworkError, isRetryableError, normalizedCents, priceCents, rankRoutes, referenceMegapixels, shouldFailover };
|
|
1131
|
+
export { type BillableContext, type CacheOptions, type CacheStore, type CachedCall, type CachedMeta, type CallRecord, type CooldownOptions, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type HttpSinkOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, MODEL_PRICES, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaJobHandle, type MediaJobStatus, type MediaLCR, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPollResult, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaStatusRequest, type MediaStatusResult, type MediaSubmitOptions, type MediaSubmitRequest, type MediaSubmitResult, type MediaUnit, type MediaUsage, type MemoryCacheOptions, OFFICIAL_PRICES, type PriceComparisonRow, type PromptCacheOptions, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, billableUnits, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createHttpSink, createKunavoMediaAdapter, createLCR, createMediaLCR, createMemoryCacheStore, createRunwareMediaAdapter, durationFromInput, formatCallRecord, getModelPrice, isAbortError, isNetworkError, isRetryableError, normalizedCents, priceCents, rankRoutes, referenceMegapixels, shouldFailover };
|
package/dist/index.js
CHANGED
|
@@ -1,3 +1,102 @@
|
|
|
1
|
+
// src/cache.ts
|
|
2
|
+
function isCacheStore(x) {
|
|
3
|
+
return typeof x === "object" && x !== null && typeof x.get === "function" && typeof x.set === "function";
|
|
4
|
+
}
|
|
5
|
+
function resolveCache(opt) {
|
|
6
|
+
if (!opt) return void 0;
|
|
7
|
+
if (opt === true) return { store: createMemoryCacheStore() };
|
|
8
|
+
if (isCacheStore(opt)) return { store: opt };
|
|
9
|
+
return {
|
|
10
|
+
store: opt.store ?? createMemoryCacheStore(),
|
|
11
|
+
...opt.ttlMs !== void 0 ? { ttlMs: opt.ttlMs } : {}
|
|
12
|
+
};
|
|
13
|
+
}
|
|
14
|
+
function cacheKeyOf(modelName, options) {
|
|
15
|
+
const rest = options.providerOptions ? Object.entries(options.providerOptions).filter(([ns]) => ns !== "lcr") : [];
|
|
16
|
+
const po = rest.length > 0 ? Object.fromEntries(rest) : void 0;
|
|
17
|
+
return JSON.stringify({
|
|
18
|
+
m: modelName,
|
|
19
|
+
prompt: options.prompt,
|
|
20
|
+
maxOutputTokens: options.maxOutputTokens,
|
|
21
|
+
temperature: options.temperature,
|
|
22
|
+
topP: options.topP,
|
|
23
|
+
topK: options.topK,
|
|
24
|
+
frequencyPenalty: options.frequencyPenalty,
|
|
25
|
+
presencePenalty: options.presencePenalty,
|
|
26
|
+
stopSequences: options.stopSequences,
|
|
27
|
+
seed: options.seed,
|
|
28
|
+
responseFormat: options.responseFormat,
|
|
29
|
+
tools: options.tools,
|
|
30
|
+
toolChoice: options.toolChoice,
|
|
31
|
+
po
|
|
32
|
+
});
|
|
33
|
+
}
|
|
34
|
+
function streamFromParts(parts) {
|
|
35
|
+
return new ReadableStream({
|
|
36
|
+
start(controller) {
|
|
37
|
+
for (const part of parts) controller.enqueue(part);
|
|
38
|
+
controller.close();
|
|
39
|
+
}
|
|
40
|
+
});
|
|
41
|
+
}
|
|
42
|
+
function createMemoryCacheStore(opts = {}) {
|
|
43
|
+
const maxEntries = opts.maxEntries ?? 1e3;
|
|
44
|
+
const map = /* @__PURE__ */ new Map();
|
|
45
|
+
return {
|
|
46
|
+
get(key) {
|
|
47
|
+
const entry = map.get(key);
|
|
48
|
+
if (!entry) return void 0;
|
|
49
|
+
if (entry.expiresAt !== void 0 && entry.expiresAt <= Date.now()) {
|
|
50
|
+
map.delete(key);
|
|
51
|
+
return void 0;
|
|
52
|
+
}
|
|
53
|
+
return entry.value;
|
|
54
|
+
},
|
|
55
|
+
set(key, value, ttlMs) {
|
|
56
|
+
const entry = ttlMs !== void 0 ? { value, expiresAt: Date.now() + ttlMs } : { value };
|
|
57
|
+
map.delete(key);
|
|
58
|
+
map.set(key, entry);
|
|
59
|
+
if (map.size > maxEntries) {
|
|
60
|
+
const oldest = map.keys().next().value;
|
|
61
|
+
if (oldest !== void 0) map.delete(oldest);
|
|
62
|
+
}
|
|
63
|
+
}
|
|
64
|
+
};
|
|
65
|
+
}
|
|
66
|
+
|
|
67
|
+
// src/prompt-cache.ts
|
|
68
|
+
function resolvePromptCache(opt) {
|
|
69
|
+
if (!opt) return void 0;
|
|
70
|
+
if (opt === true) return { ttl: "5m" };
|
|
71
|
+
return { ttl: opt.ttl ?? "5m" };
|
|
72
|
+
}
|
|
73
|
+
function hasAnthropicCacheControl(message) {
|
|
74
|
+
const anthropic = message.providerOptions?.anthropic;
|
|
75
|
+
return !!anthropic && "cacheControl" in anthropic;
|
|
76
|
+
}
|
|
77
|
+
function withPromptCacheBreakpoint(options, cfg) {
|
|
78
|
+
const prompt = options.prompt;
|
|
79
|
+
if (!Array.isArray(prompt) || prompt.length === 0) return options;
|
|
80
|
+
if (prompt.some(hasAnthropicCacheControl)) return options;
|
|
81
|
+
let target = -1;
|
|
82
|
+
for (let i = 0; i < prompt.length; i++) {
|
|
83
|
+
if (prompt[i].role === "system") target = i;
|
|
84
|
+
}
|
|
85
|
+
if (target === -1) return options;
|
|
86
|
+
const cacheControl = cfg.ttl === "1h" ? { type: "ephemeral", ttl: "1h" } : { type: "ephemeral" };
|
|
87
|
+
const newPrompt = prompt.map((message, i) => {
|
|
88
|
+
if (i !== target) return message;
|
|
89
|
+
return {
|
|
90
|
+
...message,
|
|
91
|
+
providerOptions: {
|
|
92
|
+
...message.providerOptions,
|
|
93
|
+
anthropic: { ...message.providerOptions?.anthropic, cacheControl }
|
|
94
|
+
}
|
|
95
|
+
};
|
|
96
|
+
});
|
|
97
|
+
return { ...options, prompt: newPrompt };
|
|
98
|
+
}
|
|
99
|
+
|
|
1
100
|
// src/fallback.ts
|
|
2
101
|
var EmptyCompletionError = class extends Error {
|
|
3
102
|
constructor(provider) {
|
|
@@ -396,6 +495,16 @@ var LcrFallbackModel = class {
|
|
|
396
495
|
const usageMissing = inputTokens === 0 && outputTokens === 0;
|
|
397
496
|
const emptyCompletion = inputTokens > 0 && outputTokens === 0;
|
|
398
497
|
const baselineUsd = this.baselineUsd(inputTokens, outputTokens, cacheReadTokens);
|
|
498
|
+
ctx.settled = {
|
|
499
|
+
meta: {
|
|
500
|
+
winner: provider.label,
|
|
501
|
+
costUsd,
|
|
502
|
+
inputTokens,
|
|
503
|
+
outputTokens,
|
|
504
|
+
...cacheReadTokens > 0 ? { cachedInputTokens: cacheReadTokens } : {}
|
|
505
|
+
},
|
|
506
|
+
cacheable: !emptyCompletion && !usageMissing
|
|
507
|
+
};
|
|
399
508
|
this.emitCost({
|
|
400
509
|
model: this.opts.modelName,
|
|
401
510
|
provider: provider.label,
|
|
@@ -440,6 +549,16 @@ var LcrFallbackModel = class {
|
|
|
440
549
|
});
|
|
441
550
|
}
|
|
442
551
|
async doGenerate(options) {
|
|
552
|
+
const cache = this.opts.cache;
|
|
553
|
+
const cacheKey = cache ? cacheKeyOf(this.opts.modelName, options) : void 0;
|
|
554
|
+
if (cache && cacheKey !== void 0) {
|
|
555
|
+
const hit = await cache.store.get(cacheKey);
|
|
556
|
+
if (hit && hit.kind === "generate") {
|
|
557
|
+
this.finalizeCacheHit(this.startCall(options), hit.meta);
|
|
558
|
+
return hit.result;
|
|
559
|
+
}
|
|
560
|
+
}
|
|
561
|
+
const callOptions = this.opts.promptCache ? withPromptCacheBreakpoint(options, this.opts.promptCache) : options;
|
|
443
562
|
const ctx = this.startCall(options);
|
|
444
563
|
const providers = this.opts.providers;
|
|
445
564
|
const order = this.routeOrder(this.startIndex());
|
|
@@ -450,7 +569,7 @@ var LcrFallbackModel = class {
|
|
|
450
569
|
const isLast = pos === order.length - 1;
|
|
451
570
|
const attemptStart = Date.now();
|
|
452
571
|
try {
|
|
453
|
-
const result = await provider.model.doGenerate(
|
|
572
|
+
const result = await provider.model.doGenerate(callOptions);
|
|
454
573
|
const out = result.usage?.outputTokens?.total ?? 0;
|
|
455
574
|
const inp = result.usage?.inputTokens?.total ?? 0;
|
|
456
575
|
if (inp > 0 && out === 0 && !isLast) {
|
|
@@ -463,6 +582,9 @@ var LcrFallbackModel = class {
|
|
|
463
582
|
this.recordProviderSuccess(idx);
|
|
464
583
|
this.settleSticky(idx);
|
|
465
584
|
this.finalizeOk(ctx, provider, attemptStart, result.usage);
|
|
585
|
+
if (cache && cacheKey !== void 0 && ctx.settled?.cacheable) {
|
|
586
|
+
this.storeCache(cacheKey, { kind: "generate", result, meta: ctx.settled.meta });
|
|
587
|
+
}
|
|
466
588
|
return result;
|
|
467
589
|
} catch (error) {
|
|
468
590
|
lastError = error;
|
|
@@ -479,7 +601,76 @@ var LcrFallbackModel = class {
|
|
|
479
601
|
throw ctx.firstError ?? lastError;
|
|
480
602
|
}
|
|
481
603
|
async doStream(options) {
|
|
482
|
-
|
|
604
|
+
const cache = this.opts.cache;
|
|
605
|
+
const cacheKey = cache ? cacheKeyOf(this.opts.modelName, options) : void 0;
|
|
606
|
+
if (cache && cacheKey !== void 0) {
|
|
607
|
+
const hit = await cache.store.get(cacheKey);
|
|
608
|
+
if (hit && hit.kind === "stream") {
|
|
609
|
+
this.finalizeCacheHit(this.startCall(options), hit.meta);
|
|
610
|
+
return { stream: streamFromParts(hit.parts) };
|
|
611
|
+
}
|
|
612
|
+
}
|
|
613
|
+
const ctx = this.startCall(options);
|
|
614
|
+
const callOptions = this.opts.promptCache ? withPromptCacheBreakpoint(options, this.opts.promptCache) : options;
|
|
615
|
+
const inner = await this.doStreamWithCtx(
|
|
616
|
+
callOptions,
|
|
617
|
+
ctx,
|
|
618
|
+
this.routeOrder(this.startIndex()),
|
|
619
|
+
0
|
|
620
|
+
);
|
|
621
|
+
if (!cache || cacheKey === void 0) return inner;
|
|
622
|
+
const collected = [];
|
|
623
|
+
const self = this;
|
|
624
|
+
const wrapped = inner.stream.pipeThrough(
|
|
625
|
+
new TransformStream({
|
|
626
|
+
transform(part, controller) {
|
|
627
|
+
collected.push(part);
|
|
628
|
+
controller.enqueue(part);
|
|
629
|
+
},
|
|
630
|
+
flush() {
|
|
631
|
+
if (ctx.settled?.cacheable) {
|
|
632
|
+
self.storeCache(cacheKey, { kind: "stream", parts: collected, meta: ctx.settled.meta });
|
|
633
|
+
}
|
|
634
|
+
}
|
|
635
|
+
})
|
|
636
|
+
);
|
|
637
|
+
return { ...inner, stream: wrapped };
|
|
638
|
+
}
|
|
639
|
+
/** A response-cache hit: replay a stored answer with no provider call. Settles
|
|
640
|
+
* one {@link CallRecord} with `cacheHit`, `costUsd: 0`, and the avoided cost
|
|
641
|
+
* on its own `cacheHitSavingUsd` line. */
|
|
642
|
+
finalizeCacheHit(ctx, meta) {
|
|
643
|
+
this.emitCall({
|
|
644
|
+
id: ctx.id,
|
|
645
|
+
model: this.opts.modelName,
|
|
646
|
+
attempts: [{ provider: meta.winner, ok: true, latencyMs: Date.now() - ctx.startedAt }],
|
|
647
|
+
winner: meta.winner,
|
|
648
|
+
ok: true,
|
|
649
|
+
failedOver: false,
|
|
650
|
+
latencyMs: Date.now() - ctx.startedAt,
|
|
651
|
+
inputTokens: meta.inputTokens,
|
|
652
|
+
outputTokens: meta.outputTokens,
|
|
653
|
+
...meta.cachedInputTokens ? { cachedInputTokens: meta.cachedInputTokens } : {},
|
|
654
|
+
costUsd: 0,
|
|
655
|
+
cacheHit: true,
|
|
656
|
+
...meta.costUsd > 0 ? { cacheHitSavingUsd: meta.costUsd } : {},
|
|
657
|
+
...ctx.requestId ? { requestId: ctx.requestId } : {}
|
|
658
|
+
});
|
|
659
|
+
}
|
|
660
|
+
/** Best-effort write to the response cache: a sync throw or a rejected async
|
|
661
|
+
* `set` must never break the request. Caching is an optimization, not a
|
|
662
|
+
* guarantee. */
|
|
663
|
+
storeCache(key, value) {
|
|
664
|
+
const cache = this.opts.cache;
|
|
665
|
+
if (!cache) return;
|
|
666
|
+
try {
|
|
667
|
+
const r = cache.store.set(key, value, cache.ttlMs);
|
|
668
|
+
if (r && typeof r.catch === "function") {
|
|
669
|
+
r.catch(() => {
|
|
670
|
+
});
|
|
671
|
+
}
|
|
672
|
+
} catch {
|
|
673
|
+
}
|
|
483
674
|
}
|
|
484
675
|
// The stream's failover recursion re-enters here with the SAME `ctx` and the
|
|
485
676
|
// SAME `order` snapshot, advancing only the local position `pos`, so a
|
|
@@ -1961,12 +2152,16 @@ function createLCR(config) {
|
|
|
1961
2152
|
autoPrice = false,
|
|
1962
2153
|
resetIntervalMs,
|
|
1963
2154
|
cooldown,
|
|
2155
|
+
cache,
|
|
2156
|
+
promptCache,
|
|
1964
2157
|
onError,
|
|
1965
2158
|
onCost,
|
|
1966
2159
|
onCall,
|
|
1967
2160
|
shouldRetry,
|
|
1968
2161
|
defaultCacheReadRatio
|
|
1969
2162
|
} = config;
|
|
2163
|
+
const resolvedCache = resolveCache(cache);
|
|
2164
|
+
const resolvedPromptCache = resolvePromptCache(promptCache);
|
|
1970
2165
|
if (defaultCacheReadRatio !== void 0 && (defaultCacheReadRatio < 0 || defaultCacheReadRatio > 1)) {
|
|
1971
2166
|
throw new Error(
|
|
1972
2167
|
`ai-lcr: defaultCacheReadRatio must be in [0, 1], got ${defaultCacheReadRatio}`
|
|
@@ -1986,7 +2181,18 @@ function createLCR(config) {
|
|
|
1986
2181
|
}
|
|
1987
2182
|
routed.set(
|
|
1988
2183
|
name,
|
|
1989
|
-
new LcrFallbackModel({
|
|
2184
|
+
new LcrFallbackModel({
|
|
2185
|
+
modelName: name,
|
|
2186
|
+
providers,
|
|
2187
|
+
resetIntervalMs,
|
|
2188
|
+
cooldown,
|
|
2189
|
+
...resolvedCache ? { cache: resolvedCache } : {},
|
|
2190
|
+
...resolvedPromptCache ? { promptCache: resolvedPromptCache } : {},
|
|
2191
|
+
onError,
|
|
2192
|
+
onCost,
|
|
2193
|
+
onCall,
|
|
2194
|
+
shouldRetry
|
|
2195
|
+
})
|
|
1990
2196
|
);
|
|
1991
2197
|
}
|
|
1992
2198
|
return (modelName) => {
|
|
@@ -2014,6 +2220,7 @@ export {
|
|
|
2014
2220
|
createKunavoMediaAdapter,
|
|
2015
2221
|
createLCR,
|
|
2016
2222
|
createMediaLCR,
|
|
2223
|
+
createMemoryCacheStore,
|
|
2017
2224
|
createRunwareMediaAdapter,
|
|
2018
2225
|
durationFromInput,
|
|
2019
2226
|
formatCallRecord,
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "ai-lcr",
|
|
3
|
-
"version": "0.6.
|
|
3
|
+
"version": "0.6.3",
|
|
4
4
|
"description": "Least Cost Routing for LLMs — route every model call to the cheapest available provider, fall back automatically, and track real cost. Built for the Vercel AI SDK.",
|
|
5
5
|
"keywords": [
|
|
6
6
|
"ai",
|