ai-lcr 0.6.2 → 0.6.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -4,6 +4,41 @@ All notable changes to `ai-lcr` are documented here. The format follows
4
4
  [Keep a Changelog](https://keepachangelog.com/), and the project adheres to
5
5
  [Semantic Versioning](https://semver.org/).
6
6
 
7
+ ## [0.6.3] — 2026-06-11
8
+
9
+ Caching — both kinds, each off by default and each a pure config flag with no
10
+ service to run. The response cache is the layer Vercel AI Gateway notably
11
+ doesn't offer; ai-lcr does it in-process and folds it into its cost truth.
12
+
13
+ ### Added
14
+
15
+ - **`createLCR({ cache })`** — exact-match **response cache**. An identical
16
+ request replays the stored response and calls **no provider at all**: zero
17
+ latency, `costUsd: 0`. Storage is pluggable with **zero added dependencies**:
18
+ `cache: true` uses a bundled in-memory store, `cache: myStore` brings your own
19
+ (Redis / Vercel KV — required for cross-request hits on serverless, where
20
+ memory isn't shared), `cache: { store?, ttlMs? }` sets a TTL. A hit settles a
21
+ `CallRecord` with `cacheHit: true` and the avoided cost on its own
22
+ `cacheHitSavingUsd` line (a caching saving, never folded into routing savings).
23
+ Empty completions and usage-less results are never cached. New exports:
24
+ `createMemoryCacheStore`, types `CacheStore` / `CacheOptions` / `CachedCall` /
25
+ `CachedMeta` / `MemoryCacheOptions`.
26
+ - **`createLCR({ promptCache })`** — automatic provider-side **prompt-cache**
27
+ breakpoint. Inserts an Anthropic `cache_control` marker on the last system
28
+ message so the static prompt head bills at the cache-read rate (~0.1× input)
29
+ on repeats; the model still runs. `true` for the 5-minute window,
30
+ `{ ttl: "1h" }` for the longer one. Only writes the `anthropic` namespace
31
+ (ignored by other providers, safe on a mixed chain) and steps aside if you set
32
+ `cacheControl` yourself. Savings surface via the existing `cachedInputTokens` /
33
+ `cachedSavingUsd`. New exported type `PromptCacheOptions`.
34
+ - `CallRecord` gains **`cacheHit`** and **`cacheHitSavingUsd`** for response-cache
35
+ hits.
36
+
37
+ ### Compatibility
38
+
39
+ - Fully backward compatible. Both `cache` and `promptCache` are **off by
40
+ default** — unset, routing behaves exactly as before.
41
+
7
42
  ## [0.6.2] — 2026-06-11
8
43
 
9
44
  Circuit breaker for persistently-failing providers. Until now the only recovery
package/README.md CHANGED
@@ -182,6 +182,42 @@ const lcr = createLCR({
182
182
 
183
183
  With `cooldown` on, a provider that fails enough times in a window is *skipped* for a cooldown period rather than tried every request — and a single success clears it. Defaults are 3 failures / 60s → 60s cooldown; tune with `cooldown: { maxFailures, windowMs, cooldownMs }`. It only ever **reorders** the attempt list (cooling providers go last), so if *every* provider is cooling a request still tries them all rather than failing outright. Off by default — routing is unchanged unless you opt in.
184
184
 
185
+ ## Cache
186
+
187
+ There are two completely different "caches" in LLM land, and ai-lcr does both — each off by default, each a pure config flag with no service to run.
188
+
189
+ ### Skip the call entirely (`cache`) — response cache
190
+
191
+ When a request is byte-for-byte identical to one already answered, replay the stored response and call **no provider at all**: zero latency, `costUsd: 0`. This is the layer Vercel AI Gateway notably *doesn't* offer.
192
+
193
+ ```ts
194
+ const lcr = createLCR({
195
+ models: { /* … */ },
196
+ cache: true, // exact-match response cache (in-memory by default)
197
+ });
198
+ ```
199
+
200
+ Storage is pluggable and ai-lcr ships **zero dependencies** for it:
201
+
202
+ - **`cache: true`** uses a process-local in-memory store. Real on a long-running server, and useful *within* one serverless invocation (an agent loop repeating a sub-call) — but it does **not** survive across serverless requests, because separate function instances don't share memory.
203
+ - **For cross-request hits on serverless**, bring your own store backed by a shared layer (Upstash Redis, Vercel KV): `cache: myStore`. ai-lcr runs no service of its own — any shared store is yours. A custom store is just `{ get, set }` (see `CacheStore`); `createMemoryCacheStore({ maxEntries })` is exported if you want the bundled one with a cap.
204
+ - **`cache: { store?, ttlMs? }`** sets an entry lifetime.
205
+
206
+ A hit settles a `CallRecord` with `cacheHit: true`, `costUsd: 0`, and the money it avoided on its own `cacheHitSavingUsd` line — a *caching* saving, kept separate from routing savings (`baselineUsd − costUsd`), never folded in. Empty completions and usage-less results are never cached. One caveat worth stating: caching makes identical requests return identical responses — exactly right for idempotent / `temperature: 0` calls, a behavior change for sampled ones.
207
+
208
+ ### Pay less for the call (`promptCache`) — provider prompt cache
209
+
210
+ Different mechanism: the model still runs, but the **static head of the prompt** (your system prompt) bills at the provider's cache-read rate (~0.1× input) on repeats. Anthropic needs an explicit `cache_control` marker; OpenAI / Gemini / DeepSeek cache the prefix automatically. `promptCache: true` inserts that marker on the last system message for you:
211
+
212
+ ```ts
213
+ const lcr = createLCR({
214
+ models: { /* … */ },
215
+ promptCache: true, // 5-minute window; { ttl: "1h" } for the longer one
216
+ });
217
+ ```
218
+
219
+ It only writes the `anthropic` provider-options namespace, which every other provider ignores — so it's safe on a mixed chain. And it **steps aside entirely** if you set `cacheControl` yourself anywhere in the prompt. The savings then show up exactly as before: `cachedInputTokens` and `cachedSavingUsd` on the `CallRecord` (see [Cache-aware cost](#see-what-happened-oncall) below).
220
+
185
221
  ## See what happened (`onCall`)
186
222
 
187
223
  `onError`/`onCost` fire separately and uncorrelated, so a failover is hard to read after the fact. `onCall` gives you **one record per request** — the full chain, the winner, the reason for each failed hop, latency, and cost — and `formatCallRecord` turns it into a one-liner you can scan:
package/README.zh-CN.md CHANGED
@@ -144,6 +144,42 @@ DeepInfra 只承载开源权重——没有第一方 Claude / GPT / Gemini。那
144
144
  2. **失败时向下穿透。** 遇到任何 provider 失败——限流、5xx、超时、**额度耗尽**(402 / 欠费 / 余额不足),以及 **400** 这类 client 错误——都会前进到下一个 provider,且对流式安全。400 会 failover 是有意为之:在 OpenAI 兼容聚合层里,400 往往是"*这家* provider 不吃这个请求"(不支持的参数、它没上架这个 model、更严格的 schema),而非请求本身坏了——换一家很可能就能服务。若所有 provider 都拒绝,请求仍会失败,并抛出**第一个**(原始)错误,让真正的调用方 bug 保持可调试。唯一永远不 failover 的是调用方主动取消(`AbortSignal`)。想恢复旧的"client 错误立即失败"行为,给 `createLCR` 传 `shouldRetry: isRetryableError`。
145
145
  3. **恢复。** 在一段空闲窗口(`resetIntervalMs`,默认 60s)之后,自动回到最便宜的 provider。
146
146
 
147
+ ## 缓存
148
+
149
+ LLM 世界里有两种完全不同的"缓存",ai-lcr 两种都做——都默认关闭,都只是一个配置开关,**不需要你跑任何服务**。
150
+
151
+ ### 整次调用都省掉(`cache`)—— 响应缓存
152
+
153
+ 当一个请求和之前回答过的一模一样时,直接重放已存的响应、**完全不调用任何 provider**:零延迟、`costUsd: 0`。这正是 Vercel AI Gateway 明显没做的那一层。
154
+
155
+ ```ts
156
+ const lcr = createLCR({
157
+ models: { /* … */ },
158
+ cache: true, // 精确匹配响应缓存(默认进程内内存)
159
+ });
160
+ ```
161
+
162
+ 存储可插拔,且 ai-lcr 为此**零依赖**:
163
+
164
+ - **`cache: true`** 用进程内内存。在长驻 server 上是真缓存;在单次 serverless 调用内(比如 agent 循环里重复的子调用)也有用——但它**不会跨 serverless 请求存活**,因为不同函数实例之间不共享内存。
165
+ - **想在 serverless 上跨请求命中**,自带一个由共享层(Upstash Redis、Vercel KV)支撑的 store:`cache: myStore`。ai-lcr 自己不跑任何服务——共享层是你的。自定义 store 就是 `{ get, set }`(见 `CacheStore`);想要带上限的内置实现,用导出的 `createMemoryCacheStore({ maxEntries })`。
166
+ - **`cache: { store?, ttlMs? }`** 可设置过期时间。
167
+
168
+ 命中会落一条 `CallRecord`:`cacheHit: true`、`costUsd: 0`,并把省下的钱单独记在 `cacheHitSavingUsd` 一行——这是**缓存**省的钱,和路由省的钱(`baselineUsd − costUsd`)分开,绝不混在一起。空回复和无 usage 的结果永不缓存。一个要点:缓存会让相同请求返回相同响应——对幂等 / `temperature: 0` 的调用正好,对采样型调用则是行为改变。
169
+
170
+ ### 让这次调用更便宜(`promptCache`)—— provider 提示缓存
171
+
172
+ 另一套机制:模型照样跑,但**提示的静态开头**(你的 system prompt)在重复时按 provider 的缓存读价(约 0.1× input)计费。Anthropic 需要显式 `cache_control` 标记;OpenAI / Gemini / DeepSeek 自动缓存前缀。`promptCache: true` 帮你在最后一条 system 消息上插入这个标记:
173
+
174
+ ```ts
175
+ const lcr = createLCR({
176
+ models: { /* … */ },
177
+ promptCache: true, // 5 分钟窗口;想要更长用 { ttl: "1h" }
178
+ });
179
+ ```
180
+
181
+ 它只写 `anthropic` 这个 provider-options 命名空间,其他 provider 一律忽略——所以在混合链路上也安全。而且只要你在 prompt 里自己设了 `cacheControl`,它就**完全让位**。省下的钱照旧体现在 `CallRecord` 的 `cachedInputTokens` 和 `cachedSavingUsd` 上。
182
+
147
183
  ## 看清每次调用发生了什么(`onCall`)
148
184
 
149
185
  `onError`/`onCost` 各自独立触发、互不关联,事后很难还原一次 failover 的全貌。`onCall` 给你**每个请求一条记录**——完整的尝试链、最终服务者、每跳失败的原因、延迟和成本;`formatCallRecord` 把它变成一行可扫读的日志:
package/dist/index.cjs CHANGED
@@ -34,6 +34,7 @@ __export(index_exports, {
34
34
  createKunavoMediaAdapter: () => createKunavoMediaAdapter,
35
35
  createLCR: () => createLCR,
36
36
  createMediaLCR: () => createMediaLCR,
37
+ createMemoryCacheStore: () => createMemoryCacheStore,
37
38
  createRunwareMediaAdapter: () => createRunwareMediaAdapter,
38
39
  durationFromInput: () => durationFromInput,
39
40
  formatCallRecord: () => formatCallRecord,
@@ -49,6 +50,105 @@ __export(index_exports, {
49
50
  });
50
51
  module.exports = __toCommonJS(index_exports);
51
52
 
53
+ // src/cache.ts
54
+ function isCacheStore(x) {
55
+ return typeof x === "object" && x !== null && typeof x.get === "function" && typeof x.set === "function";
56
+ }
57
+ function resolveCache(opt) {
58
+ if (!opt) return void 0;
59
+ if (opt === true) return { store: createMemoryCacheStore() };
60
+ if (isCacheStore(opt)) return { store: opt };
61
+ return {
62
+ store: opt.store ?? createMemoryCacheStore(),
63
+ ...opt.ttlMs !== void 0 ? { ttlMs: opt.ttlMs } : {}
64
+ };
65
+ }
66
+ function cacheKeyOf(modelName, options) {
67
+ const rest = options.providerOptions ? Object.entries(options.providerOptions).filter(([ns]) => ns !== "lcr") : [];
68
+ const po = rest.length > 0 ? Object.fromEntries(rest) : void 0;
69
+ return JSON.stringify({
70
+ m: modelName,
71
+ prompt: options.prompt,
72
+ maxOutputTokens: options.maxOutputTokens,
73
+ temperature: options.temperature,
74
+ topP: options.topP,
75
+ topK: options.topK,
76
+ frequencyPenalty: options.frequencyPenalty,
77
+ presencePenalty: options.presencePenalty,
78
+ stopSequences: options.stopSequences,
79
+ seed: options.seed,
80
+ responseFormat: options.responseFormat,
81
+ tools: options.tools,
82
+ toolChoice: options.toolChoice,
83
+ po
84
+ });
85
+ }
86
+ function streamFromParts(parts) {
87
+ return new ReadableStream({
88
+ start(controller) {
89
+ for (const part of parts) controller.enqueue(part);
90
+ controller.close();
91
+ }
92
+ });
93
+ }
94
+ function createMemoryCacheStore(opts = {}) {
95
+ const maxEntries = opts.maxEntries ?? 1e3;
96
+ const map = /* @__PURE__ */ new Map();
97
+ return {
98
+ get(key) {
99
+ const entry = map.get(key);
100
+ if (!entry) return void 0;
101
+ if (entry.expiresAt !== void 0 && entry.expiresAt <= Date.now()) {
102
+ map.delete(key);
103
+ return void 0;
104
+ }
105
+ return entry.value;
106
+ },
107
+ set(key, value, ttlMs) {
108
+ const entry = ttlMs !== void 0 ? { value, expiresAt: Date.now() + ttlMs } : { value };
109
+ map.delete(key);
110
+ map.set(key, entry);
111
+ if (map.size > maxEntries) {
112
+ const oldest = map.keys().next().value;
113
+ if (oldest !== void 0) map.delete(oldest);
114
+ }
115
+ }
116
+ };
117
+ }
118
+
119
+ // src/prompt-cache.ts
120
+ function resolvePromptCache(opt) {
121
+ if (!opt) return void 0;
122
+ if (opt === true) return { ttl: "5m" };
123
+ return { ttl: opt.ttl ?? "5m" };
124
+ }
125
+ function hasAnthropicCacheControl(message) {
126
+ const anthropic = message.providerOptions?.anthropic;
127
+ return !!anthropic && "cacheControl" in anthropic;
128
+ }
129
+ function withPromptCacheBreakpoint(options, cfg) {
130
+ const prompt = options.prompt;
131
+ if (!Array.isArray(prompt) || prompt.length === 0) return options;
132
+ if (prompt.some(hasAnthropicCacheControl)) return options;
133
+ let target = -1;
134
+ for (let i = 0; i < prompt.length; i++) {
135
+ if (prompt[i].role === "system") target = i;
136
+ }
137
+ if (target === -1) return options;
138
+ const cacheControl = cfg.ttl === "1h" ? { type: "ephemeral", ttl: "1h" } : { type: "ephemeral" };
139
+ const newPrompt = prompt.map((message, i) => {
140
+ if (i !== target) return message;
141
+ return {
142
+ ...message,
143
+ providerOptions: {
144
+ ...message.providerOptions,
145
+ anthropic: { ...message.providerOptions?.anthropic, cacheControl }
146
+ }
147
+ };
148
+ });
149
+ return { ...options, prompt: newPrompt };
150
+ }
151
+
52
152
  // src/fallback.ts
53
153
  var EmptyCompletionError = class extends Error {
54
154
  constructor(provider) {
@@ -447,6 +547,16 @@ var LcrFallbackModel = class {
447
547
  const usageMissing = inputTokens === 0 && outputTokens === 0;
448
548
  const emptyCompletion = inputTokens > 0 && outputTokens === 0;
449
549
  const baselineUsd = this.baselineUsd(inputTokens, outputTokens, cacheReadTokens);
550
+ ctx.settled = {
551
+ meta: {
552
+ winner: provider.label,
553
+ costUsd,
554
+ inputTokens,
555
+ outputTokens,
556
+ ...cacheReadTokens > 0 ? { cachedInputTokens: cacheReadTokens } : {}
557
+ },
558
+ cacheable: !emptyCompletion && !usageMissing
559
+ };
450
560
  this.emitCost({
451
561
  model: this.opts.modelName,
452
562
  provider: provider.label,
@@ -491,6 +601,16 @@ var LcrFallbackModel = class {
491
601
  });
492
602
  }
493
603
  async doGenerate(options) {
604
+ const cache = this.opts.cache;
605
+ const cacheKey = cache ? cacheKeyOf(this.opts.modelName, options) : void 0;
606
+ if (cache && cacheKey !== void 0) {
607
+ const hit = await cache.store.get(cacheKey);
608
+ if (hit && hit.kind === "generate") {
609
+ this.finalizeCacheHit(this.startCall(options), hit.meta);
610
+ return hit.result;
611
+ }
612
+ }
613
+ const callOptions = this.opts.promptCache ? withPromptCacheBreakpoint(options, this.opts.promptCache) : options;
494
614
  const ctx = this.startCall(options);
495
615
  const providers = this.opts.providers;
496
616
  const order = this.routeOrder(this.startIndex());
@@ -501,7 +621,7 @@ var LcrFallbackModel = class {
501
621
  const isLast = pos === order.length - 1;
502
622
  const attemptStart = Date.now();
503
623
  try {
504
- const result = await provider.model.doGenerate(options);
624
+ const result = await provider.model.doGenerate(callOptions);
505
625
  const out = result.usage?.outputTokens?.total ?? 0;
506
626
  const inp = result.usage?.inputTokens?.total ?? 0;
507
627
  if (inp > 0 && out === 0 && !isLast) {
@@ -514,6 +634,9 @@ var LcrFallbackModel = class {
514
634
  this.recordProviderSuccess(idx);
515
635
  this.settleSticky(idx);
516
636
  this.finalizeOk(ctx, provider, attemptStart, result.usage);
637
+ if (cache && cacheKey !== void 0 && ctx.settled?.cacheable) {
638
+ this.storeCache(cacheKey, { kind: "generate", result, meta: ctx.settled.meta });
639
+ }
517
640
  return result;
518
641
  } catch (error) {
519
642
  lastError = error;
@@ -530,7 +653,76 @@ var LcrFallbackModel = class {
530
653
  throw ctx.firstError ?? lastError;
531
654
  }
532
655
  async doStream(options) {
533
- return this.doStreamWithCtx(options, this.startCall(options), this.routeOrder(this.startIndex()), 0);
656
+ const cache = this.opts.cache;
657
+ const cacheKey = cache ? cacheKeyOf(this.opts.modelName, options) : void 0;
658
+ if (cache && cacheKey !== void 0) {
659
+ const hit = await cache.store.get(cacheKey);
660
+ if (hit && hit.kind === "stream") {
661
+ this.finalizeCacheHit(this.startCall(options), hit.meta);
662
+ return { stream: streamFromParts(hit.parts) };
663
+ }
664
+ }
665
+ const ctx = this.startCall(options);
666
+ const callOptions = this.opts.promptCache ? withPromptCacheBreakpoint(options, this.opts.promptCache) : options;
667
+ const inner = await this.doStreamWithCtx(
668
+ callOptions,
669
+ ctx,
670
+ this.routeOrder(this.startIndex()),
671
+ 0
672
+ );
673
+ if (!cache || cacheKey === void 0) return inner;
674
+ const collected = [];
675
+ const self = this;
676
+ const wrapped = inner.stream.pipeThrough(
677
+ new TransformStream({
678
+ transform(part, controller) {
679
+ collected.push(part);
680
+ controller.enqueue(part);
681
+ },
682
+ flush() {
683
+ if (ctx.settled?.cacheable) {
684
+ self.storeCache(cacheKey, { kind: "stream", parts: collected, meta: ctx.settled.meta });
685
+ }
686
+ }
687
+ })
688
+ );
689
+ return { ...inner, stream: wrapped };
690
+ }
691
+ /** A response-cache hit: replay a stored answer with no provider call. Settles
692
+ * one {@link CallRecord} with `cacheHit`, `costUsd: 0`, and the avoided cost
693
+ * on its own `cacheHitSavingUsd` line. */
694
+ finalizeCacheHit(ctx, meta) {
695
+ this.emitCall({
696
+ id: ctx.id,
697
+ model: this.opts.modelName,
698
+ attempts: [{ provider: meta.winner, ok: true, latencyMs: Date.now() - ctx.startedAt }],
699
+ winner: meta.winner,
700
+ ok: true,
701
+ failedOver: false,
702
+ latencyMs: Date.now() - ctx.startedAt,
703
+ inputTokens: meta.inputTokens,
704
+ outputTokens: meta.outputTokens,
705
+ ...meta.cachedInputTokens ? { cachedInputTokens: meta.cachedInputTokens } : {},
706
+ costUsd: 0,
707
+ cacheHit: true,
708
+ ...meta.costUsd > 0 ? { cacheHitSavingUsd: meta.costUsd } : {},
709
+ ...ctx.requestId ? { requestId: ctx.requestId } : {}
710
+ });
711
+ }
712
+ /** Best-effort write to the response cache: a sync throw or a rejected async
713
+ * `set` must never break the request. Caching is an optimization, not a
714
+ * guarantee. */
715
+ storeCache(key, value) {
716
+ const cache = this.opts.cache;
717
+ if (!cache) return;
718
+ try {
719
+ const r = cache.store.set(key, value, cache.ttlMs);
720
+ if (r && typeof r.catch === "function") {
721
+ r.catch(() => {
722
+ });
723
+ }
724
+ } catch {
725
+ }
534
726
  }
535
727
  // The stream's failover recursion re-enters here with the SAME `ctx` and the
536
728
  // SAME `order` snapshot, advancing only the local position `pos`, so a
@@ -2012,12 +2204,16 @@ function createLCR(config) {
2012
2204
  autoPrice = false,
2013
2205
  resetIntervalMs,
2014
2206
  cooldown,
2207
+ cache,
2208
+ promptCache,
2015
2209
  onError,
2016
2210
  onCost,
2017
2211
  onCall,
2018
2212
  shouldRetry,
2019
2213
  defaultCacheReadRatio
2020
2214
  } = config;
2215
+ const resolvedCache = resolveCache(cache);
2216
+ const resolvedPromptCache = resolvePromptCache(promptCache);
2021
2217
  if (defaultCacheReadRatio !== void 0 && (defaultCacheReadRatio < 0 || defaultCacheReadRatio > 1)) {
2022
2218
  throw new Error(
2023
2219
  `ai-lcr: defaultCacheReadRatio must be in [0, 1], got ${defaultCacheReadRatio}`
@@ -2037,7 +2233,18 @@ function createLCR(config) {
2037
2233
  }
2038
2234
  routed.set(
2039
2235
  name,
2040
- new LcrFallbackModel({ modelName: name, providers, resetIntervalMs, cooldown, onError, onCost, onCall, shouldRetry })
2236
+ new LcrFallbackModel({
2237
+ modelName: name,
2238
+ providers,
2239
+ resetIntervalMs,
2240
+ cooldown,
2241
+ ...resolvedCache ? { cache: resolvedCache } : {},
2242
+ ...resolvedPromptCache ? { promptCache: resolvedPromptCache } : {},
2243
+ onError,
2244
+ onCost,
2245
+ onCall,
2246
+ shouldRetry
2247
+ })
2041
2248
  );
2042
2249
  }
2043
2250
  return (modelName) => {
@@ -2066,6 +2273,7 @@ function createLCR(config) {
2066
2273
  createKunavoMediaAdapter,
2067
2274
  createLCR,
2068
2275
  createMediaLCR,
2276
+ createMemoryCacheStore,
2069
2277
  createRunwareMediaAdapter,
2070
2278
  durationFromInput,
2071
2279
  formatCallRecord,
package/dist/index.d.cts CHANGED
@@ -1,4 +1,120 @@
1
- import { LanguageModelV3 } from '@ai-sdk/provider';
1
+ import { LanguageModelV3GenerateResult, LanguageModelV3StreamPart, LanguageModelV3 } from '@ai-sdk/provider';
2
+
3
+ /**
4
+ * Exact-match response cache (Feature ②).
5
+ *
6
+ * Unlike provider-side prompt caching (see ./prompt-cache), this skips the
7
+ * model call *entirely*: when a request is byte-for-byte identical to one
8
+ * already answered, the stored response is replayed and no provider is touched
9
+ * — zero latency, zero cost. This is the layer Vercel AI Gateway notably does
10
+ * NOT offer, and it composes naturally with ai-lcr's cost truth: a hit settles
11
+ * a {@link CallRecord} with `costUsd: 0` and `cacheHit: true`, and reports the
12
+ * money it avoided as `cacheHitSavingUsd` — a saving kept on its own line, the
13
+ * same discipline as prompt-cache savings, never folded into routing savings.
14
+ *
15
+ * Storage is pluggable and the package ships ZERO dependencies for it. The
16
+ * default `createMemoryCacheStore()` is a process-local Map: real on a
17
+ * long-running server, and useful within a single serverless invocation (an
18
+ * agent loop that repeats a sub-call), but it does NOT survive across
19
+ * serverless requests — different function instances don't share memory. For
20
+ * cross-request hits on serverless, inject your own store backed by a shared
21
+ * layer (Upstash Redis, Vercel KV). ai-lcr never runs that service; you bring
22
+ * it. A custom store is responsible for serializing the stored value.
23
+ *
24
+ * Determinism caveat: caching makes identical requests return identical
25
+ * responses. That is exactly right for idempotent / `temperature: 0` calls and
26
+ * changes behavior for sampled ones (the variety is gone). Enable it where a
27
+ * repeated answer is acceptable.
28
+ */
29
+
30
+ /** A stored, replayable LLM response plus the cost it originally incurred. */
31
+ type CachedCall = {
32
+ kind: "generate";
33
+ result: LanguageModelV3GenerateResult;
34
+ meta: CachedMeta;
35
+ } | {
36
+ kind: "stream";
37
+ parts: LanguageModelV3StreamPart[];
38
+ meta: CachedMeta;
39
+ };
40
+ /** Settle-time facts carried in the cache entry so a hit can report honest
41
+ * tokens, the originally-serving provider, and the money the hit avoided. */
42
+ interface CachedMeta {
43
+ /** The provider that served the original (cached) call. */
44
+ winner: string;
45
+ /** What the original call actually cost — i.e. the money a hit avoids. */
46
+ costUsd: number;
47
+ inputTokens: number;
48
+ outputTokens: number;
49
+ /** Prompt-cache reads on the original call, when reported (> 0 only). */
50
+ cachedInputTokens?: number;
51
+ }
52
+ /**
53
+ * Pluggable response-cache backend. Implement it over Redis / Vercel KV / any
54
+ * shared store to get cross-request hits on serverless; the bundled
55
+ * {@link createMemoryCacheStore} is the dependency-free default. `get`/`set`
56
+ * may be sync or async. A `set` that throws must never break the request — the
57
+ * engine treats the cache as best-effort.
58
+ */
59
+ interface CacheStore {
60
+ get(key: string): CachedCall | undefined | Promise<CachedCall | undefined>;
61
+ set(key: string, value: CachedCall, ttlMs?: number): void | Promise<void>;
62
+ }
63
+ /** Public response-cache config. See {@link LCRConfig.cache}. */
64
+ interface CacheOptions {
65
+ /** Where to store responses. Defaults to a process-local in-memory store. */
66
+ store?: CacheStore;
67
+ /** Entry lifetime in ms. Omit for no expiry (entries live until evicted). */
68
+ ttlMs?: number;
69
+ }
70
+ /** Tuning for {@link createMemoryCacheStore}. */
71
+ interface MemoryCacheOptions {
72
+ /**
73
+ * Cap on stored entries. When exceeded, the oldest-inserted entry is dropped
74
+ * (insertion-order FIFO — Map preserves it). Keeps an unbounded key space
75
+ * (every distinct prompt) from leaking memory in a long-running process.
76
+ * Default 1000.
77
+ */
78
+ maxEntries?: number;
79
+ }
80
+ /**
81
+ * A process-local in-memory {@link CacheStore} with optional TTL and a
82
+ * bounded entry count. Zero dependencies. See the module header for where this
83
+ * is (and isn't) useful — notably it does NOT share across serverless requests.
84
+ */
85
+ declare function createMemoryCacheStore(opts?: MemoryCacheOptions): CacheStore;
86
+
87
+ /**
88
+ * Automatic prompt-cache breakpoints (Feature ①).
89
+ *
90
+ * Provider-side prompt caching (Anthropic, MiniMax) caches a *prefix* of the
91
+ * prompt so repeated calls bill the static head — the system prompt — at the
92
+ * cache-read rate (~0.1× input) instead of full price. The model still runs;
93
+ * only the input cost of the cached prefix drops. Anthropic needs an explicit
94
+ * `cache_control` marker; OpenAI / Gemini / DeepSeek cache the prefix
95
+ * automatically with no marker at all.
96
+ *
97
+ * This module adds, when `promptCache` is enabled, a single `cacheControl`
98
+ * breakpoint on the LAST system message — the canonical large, stable head of
99
+ * a prompt. It only writes the `anthropic` provider-options namespace, which
100
+ * every non-Anthropic provider ignores, so it is safe to apply on every leg of
101
+ * a mixed chain: Anthropic reads it, the rest pass it through untouched. No
102
+ * external service and no storage — the cache itself lives at the provider.
103
+ *
104
+ * It steps aside the moment the caller is managing caching themselves: if ANY
105
+ * message already carries an `anthropic.cacheControl`, the prompt is returned
106
+ * unchanged. Same "explicit always wins" discipline as the price table.
107
+ */
108
+
109
+ /** Tuning for automatic prompt-cache breakpoints. See {@link LCRConfig.promptCache}. */
110
+ interface PromptCacheOptions {
111
+ /**
112
+ * Cache lifetime for the injected breakpoint. `"5m"` (the Anthropic default,
113
+ * a cheaper cache write) or `"1h"` (a pricier write that pays off when the
114
+ * same prefix is reused over a longer span). Default `"5m"`.
115
+ */
116
+ ttl?: "5m" | "1h";
117
+ }
2
118
 
3
119
  /**
4
120
  * Owned failover engine for ai-lcr.
@@ -176,6 +292,21 @@ interface CallRecord {
176
292
  * be surfaced separately, never folded into `baselineUsd - costUsd`.
177
293
  */
178
294
  cachedSavingUsd?: number;
295
+ /**
296
+ * True when this request was served from ai-lcr's exact-match RESPONSE cache
297
+ * — no provider was called at all. Distinct from `cachedInputTokens` /
298
+ * `cachedSavingUsd`, which are the *provider's* prompt-cache (the model still
299
+ * ran). On a hit `costUsd` is 0, `winner` is the provider that served the
300
+ * ORIGINAL (now-cached) call, and `attempts` has a single synthetic entry.
301
+ */
302
+ cacheHit?: boolean;
303
+ /**
304
+ * On a `cacheHit`, the money the hit avoided — i.e. what the original call
305
+ * actually cost when it ran. Present only when > 0. Like `cachedSavingUsd`
306
+ * this is a caching saving, NOT a routing saving, so it lives on its own line
307
+ * and is never folded into `baselineUsd - costUsd`.
308
+ */
309
+ cacheHitSavingUsd?: number;
179
310
  /**
180
311
  * Caller-supplied correlation id, read from `providerOptions.lcr.requestId`
181
312
  * on the call. Multi-step tool loops emit one record per `doStream`/
@@ -926,6 +1057,34 @@ interface LCRConfig {
926
1057
  * unchanged routing, no provider is ever skipped). See {@link CooldownOptions}.
927
1058
  */
928
1059
  cooldown?: boolean | CooldownOptions;
1060
+ /**
1061
+ * Exact-match RESPONSE cache: when a request is identical to one already
1062
+ * answered, replay the stored response and call no provider at all — zero
1063
+ * latency, `costUsd: 0`, and the avoided cost reported as `cacheHitSavingUsd`
1064
+ * on the {@link CallRecord} (with `cacheHit: true`). Off by default.
1065
+ *
1066
+ * `true` uses a process-local in-memory store; pass a {@link CacheStore} to
1067
+ * bring your own (Redis / Vercel KV — required for cross-request hits on
1068
+ * serverless, where memory isn't shared); pass `{ store?, ttlMs? }` to set a
1069
+ * TTL. ai-lcr runs no service of its own — any shared store is yours.
1070
+ *
1071
+ * Caching makes identical requests return identical responses: ideal for
1072
+ * idempotent / `temperature: 0` calls, a behavior change for sampled ones.
1073
+ * Empty completions and usage-less results are never cached.
1074
+ */
1075
+ cache?: boolean | CacheStore | CacheOptions;
1076
+ /**
1077
+ * Automatic provider-side PROMPT caching: insert a `cache_control` breakpoint
1078
+ * on the last system message so the static prompt head bills at the
1079
+ * cache-read rate (~0.1× input) on repeats. The model still runs — this only
1080
+ * lowers input cost, it does not skip the call (that's `cache`). Only
1081
+ * Anthropic / MiniMax need the marker; OpenAI / Gemini / DeepSeek cache the
1082
+ * prefix automatically and ignore it, so this is safe on a mixed chain.
1083
+ *
1084
+ * `true` for the 5-minute default, `{ ttl: "1h" }` for the longer window.
1085
+ * Off by default; steps aside entirely if you set `cacheControl` yourself.
1086
+ */
1087
+ promptCache?: boolean | PromptCacheOptions;
929
1088
  /** Called when a provider errors and routing falls through to the next. */
930
1089
  onError?: (error: Error, provider: string) => void;
931
1090
  /** Called after each successful call with the serving provider, tokens, and cost. */
@@ -969,4 +1128,4 @@ type LCRRouter = (modelName: string) => LanguageModelV3;
969
1128
  */
970
1129
  declare function createLCR(config: LCRConfig): LCRRouter;
971
1130
 
972
- export { type BillableContext, type CallRecord, type CooldownOptions, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type HttpSinkOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, MODEL_PRICES, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaJobHandle, type MediaJobStatus, type MediaLCR, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPollResult, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaStatusRequest, type MediaStatusResult, type MediaSubmitOptions, type MediaSubmitRequest, type MediaSubmitResult, type MediaUnit, type MediaUsage, OFFICIAL_PRICES, type PriceComparisonRow, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, billableUnits, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createHttpSink, createKunavoMediaAdapter, createLCR, createMediaLCR, createRunwareMediaAdapter, durationFromInput, formatCallRecord, getModelPrice, isAbortError, isNetworkError, isRetryableError, normalizedCents, priceCents, rankRoutes, referenceMegapixels, shouldFailover };
1131
+ export { type BillableContext, type CacheOptions, type CacheStore, type CachedCall, type CachedMeta, type CallRecord, type CooldownOptions, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type HttpSinkOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, MODEL_PRICES, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaJobHandle, type MediaJobStatus, type MediaLCR, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPollResult, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaStatusRequest, type MediaStatusResult, type MediaSubmitOptions, type MediaSubmitRequest, type MediaSubmitResult, type MediaUnit, type MediaUsage, type MemoryCacheOptions, OFFICIAL_PRICES, type PriceComparisonRow, type PromptCacheOptions, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, billableUnits, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createHttpSink, createKunavoMediaAdapter, createLCR, createMediaLCR, createMemoryCacheStore, createRunwareMediaAdapter, durationFromInput, formatCallRecord, getModelPrice, isAbortError, isNetworkError, isRetryableError, normalizedCents, priceCents, rankRoutes, referenceMegapixels, shouldFailover };
package/dist/index.d.ts CHANGED
@@ -1,4 +1,120 @@
1
- import { LanguageModelV3 } from '@ai-sdk/provider';
1
+ import { LanguageModelV3GenerateResult, LanguageModelV3StreamPart, LanguageModelV3 } from '@ai-sdk/provider';
2
+
3
+ /**
4
+ * Exact-match response cache (Feature ②).
5
+ *
6
+ * Unlike provider-side prompt caching (see ./prompt-cache), this skips the
7
+ * model call *entirely*: when a request is byte-for-byte identical to one
8
+ * already answered, the stored response is replayed and no provider is touched
9
+ * — zero latency, zero cost. This is the layer Vercel AI Gateway notably does
10
+ * NOT offer, and it composes naturally with ai-lcr's cost truth: a hit settles
11
+ * a {@link CallRecord} with `costUsd: 0` and `cacheHit: true`, and reports the
12
+ * money it avoided as `cacheHitSavingUsd` — a saving kept on its own line, the
13
+ * same discipline as prompt-cache savings, never folded into routing savings.
14
+ *
15
+ * Storage is pluggable and the package ships ZERO dependencies for it. The
16
+ * default `createMemoryCacheStore()` is a process-local Map: real on a
17
+ * long-running server, and useful within a single serverless invocation (an
18
+ * agent loop that repeats a sub-call), but it does NOT survive across
19
+ * serverless requests — different function instances don't share memory. For
20
+ * cross-request hits on serverless, inject your own store backed by a shared
21
+ * layer (Upstash Redis, Vercel KV). ai-lcr never runs that service; you bring
22
+ * it. A custom store is responsible for serializing the stored value.
23
+ *
24
+ * Determinism caveat: caching makes identical requests return identical
25
+ * responses. That is exactly right for idempotent / `temperature: 0` calls and
26
+ * changes behavior for sampled ones (the variety is gone). Enable it where a
27
+ * repeated answer is acceptable.
28
+ */
29
+
30
+ /** A stored, replayable LLM response plus the cost it originally incurred. */
31
+ type CachedCall = {
32
+ kind: "generate";
33
+ result: LanguageModelV3GenerateResult;
34
+ meta: CachedMeta;
35
+ } | {
36
+ kind: "stream";
37
+ parts: LanguageModelV3StreamPart[];
38
+ meta: CachedMeta;
39
+ };
40
+ /** Settle-time facts carried in the cache entry so a hit can report honest
41
+ * tokens, the originally-serving provider, and the money the hit avoided. */
42
+ interface CachedMeta {
43
+ /** The provider that served the original (cached) call. */
44
+ winner: string;
45
+ /** What the original call actually cost — i.e. the money a hit avoids. */
46
+ costUsd: number;
47
+ inputTokens: number;
48
+ outputTokens: number;
49
+ /** Prompt-cache reads on the original call, when reported (> 0 only). */
50
+ cachedInputTokens?: number;
51
+ }
52
+ /**
53
+ * Pluggable response-cache backend. Implement it over Redis / Vercel KV / any
54
+ * shared store to get cross-request hits on serverless; the bundled
55
+ * {@link createMemoryCacheStore} is the dependency-free default. `get`/`set`
56
+ * may be sync or async. A `set` that throws must never break the request — the
57
+ * engine treats the cache as best-effort.
58
+ */
59
+ interface CacheStore {
60
+ get(key: string): CachedCall | undefined | Promise<CachedCall | undefined>;
61
+ set(key: string, value: CachedCall, ttlMs?: number): void | Promise<void>;
62
+ }
63
+ /** Public response-cache config. See {@link LCRConfig.cache}. */
64
+ interface CacheOptions {
65
+ /** Where to store responses. Defaults to a process-local in-memory store. */
66
+ store?: CacheStore;
67
+ /** Entry lifetime in ms. Omit for no expiry (entries live until evicted). */
68
+ ttlMs?: number;
69
+ }
70
+ /** Tuning for {@link createMemoryCacheStore}. */
71
+ interface MemoryCacheOptions {
72
+ /**
73
+ * Cap on stored entries. When exceeded, the oldest-inserted entry is dropped
74
+ * (insertion-order FIFO — Map preserves it). Keeps an unbounded key space
75
+ * (every distinct prompt) from leaking memory in a long-running process.
76
+ * Default 1000.
77
+ */
78
+ maxEntries?: number;
79
+ }
80
+ /**
81
+ * A process-local in-memory {@link CacheStore} with optional TTL and a
82
+ * bounded entry count. Zero dependencies. See the module header for where this
83
+ * is (and isn't) useful — notably it does NOT share across serverless requests.
84
+ */
85
+ declare function createMemoryCacheStore(opts?: MemoryCacheOptions): CacheStore;
86
+
87
+ /**
88
+ * Automatic prompt-cache breakpoints (Feature ①).
89
+ *
90
+ * Provider-side prompt caching (Anthropic, MiniMax) caches a *prefix* of the
91
+ * prompt so repeated calls bill the static head — the system prompt — at the
92
+ * cache-read rate (~0.1× input) instead of full price. The model still runs;
93
+ * only the input cost of the cached prefix drops. Anthropic needs an explicit
94
+ * `cache_control` marker; OpenAI / Gemini / DeepSeek cache the prefix
95
+ * automatically with no marker at all.
96
+ *
97
+ * This module adds, when `promptCache` is enabled, a single `cacheControl`
98
+ * breakpoint on the LAST system message — the canonical large, stable head of
99
+ * a prompt. It only writes the `anthropic` provider-options namespace, which
100
+ * every non-Anthropic provider ignores, so it is safe to apply on every leg of
101
+ * a mixed chain: Anthropic reads it, the rest pass it through untouched. No
102
+ * external service and no storage — the cache itself lives at the provider.
103
+ *
104
+ * It steps aside the moment the caller is managing caching themselves: if ANY
105
+ * message already carries an `anthropic.cacheControl`, the prompt is returned
106
+ * unchanged. Same "explicit always wins" discipline as the price table.
107
+ */
108
+
109
+ /** Tuning for automatic prompt-cache breakpoints. See {@link LCRConfig.promptCache}. */
110
+ interface PromptCacheOptions {
111
+ /**
112
+ * Cache lifetime for the injected breakpoint. `"5m"` (the Anthropic default,
113
+ * a cheaper cache write) or `"1h"` (a pricier write that pays off when the
114
+ * same prefix is reused over a longer span). Default `"5m"`.
115
+ */
116
+ ttl?: "5m" | "1h";
117
+ }
2
118
 
3
119
  /**
4
120
  * Owned failover engine for ai-lcr.
@@ -176,6 +292,21 @@ interface CallRecord {
176
292
  * be surfaced separately, never folded into `baselineUsd - costUsd`.
177
293
  */
178
294
  cachedSavingUsd?: number;
295
+ /**
296
+ * True when this request was served from ai-lcr's exact-match RESPONSE cache
297
+ * — no provider was called at all. Distinct from `cachedInputTokens` /
298
+ * `cachedSavingUsd`, which are the *provider's* prompt-cache (the model still
299
+ * ran). On a hit `costUsd` is 0, `winner` is the provider that served the
300
+ * ORIGINAL (now-cached) call, and `attempts` has a single synthetic entry.
301
+ */
302
+ cacheHit?: boolean;
303
+ /**
304
+ * On a `cacheHit`, the money the hit avoided — i.e. what the original call
305
+ * actually cost when it ran. Present only when > 0. Like `cachedSavingUsd`
306
+ * this is a caching saving, NOT a routing saving, so it lives on its own line
307
+ * and is never folded into `baselineUsd - costUsd`.
308
+ */
309
+ cacheHitSavingUsd?: number;
179
310
  /**
180
311
  * Caller-supplied correlation id, read from `providerOptions.lcr.requestId`
181
312
  * on the call. Multi-step tool loops emit one record per `doStream`/
@@ -926,6 +1057,34 @@ interface LCRConfig {
926
1057
  * unchanged routing, no provider is ever skipped). See {@link CooldownOptions}.
927
1058
  */
928
1059
  cooldown?: boolean | CooldownOptions;
1060
+ /**
1061
+ * Exact-match RESPONSE cache: when a request is identical to one already
1062
+ * answered, replay the stored response and call no provider at all — zero
1063
+ * latency, `costUsd: 0`, and the avoided cost reported as `cacheHitSavingUsd`
1064
+ * on the {@link CallRecord} (with `cacheHit: true`). Off by default.
1065
+ *
1066
+ * `true` uses a process-local in-memory store; pass a {@link CacheStore} to
1067
+ * bring your own (Redis / Vercel KV — required for cross-request hits on
1068
+ * serverless, where memory isn't shared); pass `{ store?, ttlMs? }` to set a
1069
+ * TTL. ai-lcr runs no service of its own — any shared store is yours.
1070
+ *
1071
+ * Caching makes identical requests return identical responses: ideal for
1072
+ * idempotent / `temperature: 0` calls, a behavior change for sampled ones.
1073
+ * Empty completions and usage-less results are never cached.
1074
+ */
1075
+ cache?: boolean | CacheStore | CacheOptions;
1076
+ /**
1077
+ * Automatic provider-side PROMPT caching: insert a `cache_control` breakpoint
1078
+ * on the last system message so the static prompt head bills at the
1079
+ * cache-read rate (~0.1× input) on repeats. The model still runs — this only
1080
+ * lowers input cost, it does not skip the call (that's `cache`). Only
1081
+ * Anthropic / MiniMax need the marker; OpenAI / Gemini / DeepSeek cache the
1082
+ * prefix automatically and ignore it, so this is safe on a mixed chain.
1083
+ *
1084
+ * `true` for the 5-minute default, `{ ttl: "1h" }` for the longer window.
1085
+ * Off by default; steps aside entirely if you set `cacheControl` yourself.
1086
+ */
1087
+ promptCache?: boolean | PromptCacheOptions;
929
1088
  /** Called when a provider errors and routing falls through to the next. */
930
1089
  onError?: (error: Error, provider: string) => void;
931
1090
  /** Called after each successful call with the serving provider, tokens, and cost. */
@@ -969,4 +1128,4 @@ type LCRRouter = (modelName: string) => LanguageModelV3;
969
1128
  */
970
1129
  declare function createLCR(config: LCRConfig): LCRRouter;
971
1130
 
972
- export { type BillableContext, type CallRecord, type CooldownOptions, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type HttpSinkOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, MODEL_PRICES, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaJobHandle, type MediaJobStatus, type MediaLCR, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPollResult, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaStatusRequest, type MediaStatusResult, type MediaSubmitOptions, type MediaSubmitRequest, type MediaSubmitResult, type MediaUnit, type MediaUsage, OFFICIAL_PRICES, type PriceComparisonRow, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, billableUnits, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createHttpSink, createKunavoMediaAdapter, createLCR, createMediaLCR, createRunwareMediaAdapter, durationFromInput, formatCallRecord, getModelPrice, isAbortError, isNetworkError, isRetryableError, normalizedCents, priceCents, rankRoutes, referenceMegapixels, shouldFailover };
1131
+ export { type BillableContext, type CacheOptions, type CacheStore, type CachedCall, type CachedMeta, type CallRecord, type CooldownOptions, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type HttpSinkOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, MODEL_PRICES, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaJobHandle, type MediaJobStatus, type MediaLCR, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPollResult, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaStatusRequest, type MediaStatusResult, type MediaSubmitOptions, type MediaSubmitRequest, type MediaSubmitResult, type MediaUnit, type MediaUsage, type MemoryCacheOptions, OFFICIAL_PRICES, type PriceComparisonRow, type PromptCacheOptions, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, billableUnits, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createHttpSink, createKunavoMediaAdapter, createLCR, createMediaLCR, createMemoryCacheStore, createRunwareMediaAdapter, durationFromInput, formatCallRecord, getModelPrice, isAbortError, isNetworkError, isRetryableError, normalizedCents, priceCents, rankRoutes, referenceMegapixels, shouldFailover };
package/dist/index.js CHANGED
@@ -1,3 +1,102 @@
1
+ // src/cache.ts
2
+ function isCacheStore(x) {
3
+ return typeof x === "object" && x !== null && typeof x.get === "function" && typeof x.set === "function";
4
+ }
5
+ function resolveCache(opt) {
6
+ if (!opt) return void 0;
7
+ if (opt === true) return { store: createMemoryCacheStore() };
8
+ if (isCacheStore(opt)) return { store: opt };
9
+ return {
10
+ store: opt.store ?? createMemoryCacheStore(),
11
+ ...opt.ttlMs !== void 0 ? { ttlMs: opt.ttlMs } : {}
12
+ };
13
+ }
14
+ function cacheKeyOf(modelName, options) {
15
+ const rest = options.providerOptions ? Object.entries(options.providerOptions).filter(([ns]) => ns !== "lcr") : [];
16
+ const po = rest.length > 0 ? Object.fromEntries(rest) : void 0;
17
+ return JSON.stringify({
18
+ m: modelName,
19
+ prompt: options.prompt,
20
+ maxOutputTokens: options.maxOutputTokens,
21
+ temperature: options.temperature,
22
+ topP: options.topP,
23
+ topK: options.topK,
24
+ frequencyPenalty: options.frequencyPenalty,
25
+ presencePenalty: options.presencePenalty,
26
+ stopSequences: options.stopSequences,
27
+ seed: options.seed,
28
+ responseFormat: options.responseFormat,
29
+ tools: options.tools,
30
+ toolChoice: options.toolChoice,
31
+ po
32
+ });
33
+ }
34
+ function streamFromParts(parts) {
35
+ return new ReadableStream({
36
+ start(controller) {
37
+ for (const part of parts) controller.enqueue(part);
38
+ controller.close();
39
+ }
40
+ });
41
+ }
42
+ function createMemoryCacheStore(opts = {}) {
43
+ const maxEntries = opts.maxEntries ?? 1e3;
44
+ const map = /* @__PURE__ */ new Map();
45
+ return {
46
+ get(key) {
47
+ const entry = map.get(key);
48
+ if (!entry) return void 0;
49
+ if (entry.expiresAt !== void 0 && entry.expiresAt <= Date.now()) {
50
+ map.delete(key);
51
+ return void 0;
52
+ }
53
+ return entry.value;
54
+ },
55
+ set(key, value, ttlMs) {
56
+ const entry = ttlMs !== void 0 ? { value, expiresAt: Date.now() + ttlMs } : { value };
57
+ map.delete(key);
58
+ map.set(key, entry);
59
+ if (map.size > maxEntries) {
60
+ const oldest = map.keys().next().value;
61
+ if (oldest !== void 0) map.delete(oldest);
62
+ }
63
+ }
64
+ };
65
+ }
66
+
67
+ // src/prompt-cache.ts
68
+ function resolvePromptCache(opt) {
69
+ if (!opt) return void 0;
70
+ if (opt === true) return { ttl: "5m" };
71
+ return { ttl: opt.ttl ?? "5m" };
72
+ }
73
+ function hasAnthropicCacheControl(message) {
74
+ const anthropic = message.providerOptions?.anthropic;
75
+ return !!anthropic && "cacheControl" in anthropic;
76
+ }
77
+ function withPromptCacheBreakpoint(options, cfg) {
78
+ const prompt = options.prompt;
79
+ if (!Array.isArray(prompt) || prompt.length === 0) return options;
80
+ if (prompt.some(hasAnthropicCacheControl)) return options;
81
+ let target = -1;
82
+ for (let i = 0; i < prompt.length; i++) {
83
+ if (prompt[i].role === "system") target = i;
84
+ }
85
+ if (target === -1) return options;
86
+ const cacheControl = cfg.ttl === "1h" ? { type: "ephemeral", ttl: "1h" } : { type: "ephemeral" };
87
+ const newPrompt = prompt.map((message, i) => {
88
+ if (i !== target) return message;
89
+ return {
90
+ ...message,
91
+ providerOptions: {
92
+ ...message.providerOptions,
93
+ anthropic: { ...message.providerOptions?.anthropic, cacheControl }
94
+ }
95
+ };
96
+ });
97
+ return { ...options, prompt: newPrompt };
98
+ }
99
+
1
100
  // src/fallback.ts
2
101
  var EmptyCompletionError = class extends Error {
3
102
  constructor(provider) {
@@ -396,6 +495,16 @@ var LcrFallbackModel = class {
396
495
  const usageMissing = inputTokens === 0 && outputTokens === 0;
397
496
  const emptyCompletion = inputTokens > 0 && outputTokens === 0;
398
497
  const baselineUsd = this.baselineUsd(inputTokens, outputTokens, cacheReadTokens);
498
+ ctx.settled = {
499
+ meta: {
500
+ winner: provider.label,
501
+ costUsd,
502
+ inputTokens,
503
+ outputTokens,
504
+ ...cacheReadTokens > 0 ? { cachedInputTokens: cacheReadTokens } : {}
505
+ },
506
+ cacheable: !emptyCompletion && !usageMissing
507
+ };
399
508
  this.emitCost({
400
509
  model: this.opts.modelName,
401
510
  provider: provider.label,
@@ -440,6 +549,16 @@ var LcrFallbackModel = class {
440
549
  });
441
550
  }
442
551
  async doGenerate(options) {
552
+ const cache = this.opts.cache;
553
+ const cacheKey = cache ? cacheKeyOf(this.opts.modelName, options) : void 0;
554
+ if (cache && cacheKey !== void 0) {
555
+ const hit = await cache.store.get(cacheKey);
556
+ if (hit && hit.kind === "generate") {
557
+ this.finalizeCacheHit(this.startCall(options), hit.meta);
558
+ return hit.result;
559
+ }
560
+ }
561
+ const callOptions = this.opts.promptCache ? withPromptCacheBreakpoint(options, this.opts.promptCache) : options;
443
562
  const ctx = this.startCall(options);
444
563
  const providers = this.opts.providers;
445
564
  const order = this.routeOrder(this.startIndex());
@@ -450,7 +569,7 @@ var LcrFallbackModel = class {
450
569
  const isLast = pos === order.length - 1;
451
570
  const attemptStart = Date.now();
452
571
  try {
453
- const result = await provider.model.doGenerate(options);
572
+ const result = await provider.model.doGenerate(callOptions);
454
573
  const out = result.usage?.outputTokens?.total ?? 0;
455
574
  const inp = result.usage?.inputTokens?.total ?? 0;
456
575
  if (inp > 0 && out === 0 && !isLast) {
@@ -463,6 +582,9 @@ var LcrFallbackModel = class {
463
582
  this.recordProviderSuccess(idx);
464
583
  this.settleSticky(idx);
465
584
  this.finalizeOk(ctx, provider, attemptStart, result.usage);
585
+ if (cache && cacheKey !== void 0 && ctx.settled?.cacheable) {
586
+ this.storeCache(cacheKey, { kind: "generate", result, meta: ctx.settled.meta });
587
+ }
466
588
  return result;
467
589
  } catch (error) {
468
590
  lastError = error;
@@ -479,7 +601,76 @@ var LcrFallbackModel = class {
479
601
  throw ctx.firstError ?? lastError;
480
602
  }
481
603
  async doStream(options) {
482
- return this.doStreamWithCtx(options, this.startCall(options), this.routeOrder(this.startIndex()), 0);
604
+ const cache = this.opts.cache;
605
+ const cacheKey = cache ? cacheKeyOf(this.opts.modelName, options) : void 0;
606
+ if (cache && cacheKey !== void 0) {
607
+ const hit = await cache.store.get(cacheKey);
608
+ if (hit && hit.kind === "stream") {
609
+ this.finalizeCacheHit(this.startCall(options), hit.meta);
610
+ return { stream: streamFromParts(hit.parts) };
611
+ }
612
+ }
613
+ const ctx = this.startCall(options);
614
+ const callOptions = this.opts.promptCache ? withPromptCacheBreakpoint(options, this.opts.promptCache) : options;
615
+ const inner = await this.doStreamWithCtx(
616
+ callOptions,
617
+ ctx,
618
+ this.routeOrder(this.startIndex()),
619
+ 0
620
+ );
621
+ if (!cache || cacheKey === void 0) return inner;
622
+ const collected = [];
623
+ const self = this;
624
+ const wrapped = inner.stream.pipeThrough(
625
+ new TransformStream({
626
+ transform(part, controller) {
627
+ collected.push(part);
628
+ controller.enqueue(part);
629
+ },
630
+ flush() {
631
+ if (ctx.settled?.cacheable) {
632
+ self.storeCache(cacheKey, { kind: "stream", parts: collected, meta: ctx.settled.meta });
633
+ }
634
+ }
635
+ })
636
+ );
637
+ return { ...inner, stream: wrapped };
638
+ }
639
+ /** A response-cache hit: replay a stored answer with no provider call. Settles
640
+ * one {@link CallRecord} with `cacheHit`, `costUsd: 0`, and the avoided cost
641
+ * on its own `cacheHitSavingUsd` line. */
642
+ finalizeCacheHit(ctx, meta) {
643
+ this.emitCall({
644
+ id: ctx.id,
645
+ model: this.opts.modelName,
646
+ attempts: [{ provider: meta.winner, ok: true, latencyMs: Date.now() - ctx.startedAt }],
647
+ winner: meta.winner,
648
+ ok: true,
649
+ failedOver: false,
650
+ latencyMs: Date.now() - ctx.startedAt,
651
+ inputTokens: meta.inputTokens,
652
+ outputTokens: meta.outputTokens,
653
+ ...meta.cachedInputTokens ? { cachedInputTokens: meta.cachedInputTokens } : {},
654
+ costUsd: 0,
655
+ cacheHit: true,
656
+ ...meta.costUsd > 0 ? { cacheHitSavingUsd: meta.costUsd } : {},
657
+ ...ctx.requestId ? { requestId: ctx.requestId } : {}
658
+ });
659
+ }
660
+ /** Best-effort write to the response cache: a sync throw or a rejected async
661
+ * `set` must never break the request. Caching is an optimization, not a
662
+ * guarantee. */
663
+ storeCache(key, value) {
664
+ const cache = this.opts.cache;
665
+ if (!cache) return;
666
+ try {
667
+ const r = cache.store.set(key, value, cache.ttlMs);
668
+ if (r && typeof r.catch === "function") {
669
+ r.catch(() => {
670
+ });
671
+ }
672
+ } catch {
673
+ }
483
674
  }
484
675
  // The stream's failover recursion re-enters here with the SAME `ctx` and the
485
676
  // SAME `order` snapshot, advancing only the local position `pos`, so a
@@ -1961,12 +2152,16 @@ function createLCR(config) {
1961
2152
  autoPrice = false,
1962
2153
  resetIntervalMs,
1963
2154
  cooldown,
2155
+ cache,
2156
+ promptCache,
1964
2157
  onError,
1965
2158
  onCost,
1966
2159
  onCall,
1967
2160
  shouldRetry,
1968
2161
  defaultCacheReadRatio
1969
2162
  } = config;
2163
+ const resolvedCache = resolveCache(cache);
2164
+ const resolvedPromptCache = resolvePromptCache(promptCache);
1970
2165
  if (defaultCacheReadRatio !== void 0 && (defaultCacheReadRatio < 0 || defaultCacheReadRatio > 1)) {
1971
2166
  throw new Error(
1972
2167
  `ai-lcr: defaultCacheReadRatio must be in [0, 1], got ${defaultCacheReadRatio}`
@@ -1986,7 +2181,18 @@ function createLCR(config) {
1986
2181
  }
1987
2182
  routed.set(
1988
2183
  name,
1989
- new LcrFallbackModel({ modelName: name, providers, resetIntervalMs, cooldown, onError, onCost, onCall, shouldRetry })
2184
+ new LcrFallbackModel({
2185
+ modelName: name,
2186
+ providers,
2187
+ resetIntervalMs,
2188
+ cooldown,
2189
+ ...resolvedCache ? { cache: resolvedCache } : {},
2190
+ ...resolvedPromptCache ? { promptCache: resolvedPromptCache } : {},
2191
+ onError,
2192
+ onCost,
2193
+ onCall,
2194
+ shouldRetry
2195
+ })
1990
2196
  );
1991
2197
  }
1992
2198
  return (modelName) => {
@@ -2014,6 +2220,7 @@ export {
2014
2220
  createKunavoMediaAdapter,
2015
2221
  createLCR,
2016
2222
  createMediaLCR,
2223
+ createMemoryCacheStore,
2017
2224
  createRunwareMediaAdapter,
2018
2225
  durationFromInput,
2019
2226
  formatCallRecord,
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "ai-lcr",
3
- "version": "0.6.2",
3
+ "version": "0.6.3",
4
4
  "description": "Least Cost Routing for LLMs — route every model call to the cheapest available provider, fall back automatically, and track real cost. Built for the Vercel AI SDK.",
5
5
  "keywords": [
6
6
  "ai",