npm - ai-lcr - Versions diffs - 0.6.2 → 0.6.3 - Mend

ai-lcr 0.6.2 → 0.6.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/CHANGELOG.md CHANGED Viewed

@@ -4,6 +4,41 @@ All notable changes to `ai-lcr` are documented here. The format follows
 [Keep a Changelog](https://keepachangelog.com/), and the project adheres to
 [Semantic Versioning](https://semver.org/).
+## [0.6.3] — 2026-06-11
+Caching — both kinds, each off by default and each a pure config flag with no
+service to run. The response cache is the layer Vercel AI Gateway notably
+doesn't offer; ai-lcr does it in-process and folds it into its cost truth.
+### Added
+- **`createLCR({ cache })`** — exact-match **response cache**. An identical
+  request replays the stored response and calls **no provider at all**: zero
+  latency, `costUsd: 0`. Storage is pluggable with **zero added dependencies**:
+  `cache: true` uses a bundled in-memory store, `cache: myStore` brings your own
+  (Redis / Vercel KV — required for cross-request hits on serverless, where
+  memory isn't shared), `cache: { store?, ttlMs? }` sets a TTL. A hit settles a
+  `CallRecord` with `cacheHit: true` and the avoided cost on its own
+  `cacheHitSavingUsd` line (a caching saving, never folded into routing savings).
+  Empty completions and usage-less results are never cached. New exports:
+  `createMemoryCacheStore`, types `CacheStore` / `CacheOptions` / `CachedCall` /
+  `CachedMeta` / `MemoryCacheOptions`.
+- **`createLCR({ promptCache })`** — automatic provider-side **prompt-cache**
+  breakpoint. Inserts an Anthropic `cache_control` marker on the last system
+  message so the static prompt head bills at the cache-read rate (~0.1× input)
+  on repeats; the model still runs. `true` for the 5-minute window,
+  `{ ttl: "1h" }` for the longer one. Only writes the `anthropic` namespace
+  (ignored by other providers, safe on a mixed chain) and steps aside if you set
+  `cacheControl` yourself. Savings surface via the existing `cachedInputTokens` /
+  `cachedSavingUsd`. New exported type `PromptCacheOptions`.
+- `CallRecord` gains **`cacheHit`** and **`cacheHitSavingUsd`** for response-cache
+  hits.
+### Compatibility
+- Fully backward compatible. Both `cache` and `promptCache` are **off by
+  default** — unset, routing behaves exactly as before.
 ## [0.6.2] — 2026-06-11
 Circuit breaker for persistently-failing providers. Until now the only recovery

package/README.md CHANGED Viewed

@@ -182,6 +182,42 @@ const lcr = createLCR({
 With `cooldown` on, a provider that fails enough times in a window is *skipped* for a cooldown period rather than tried every request — and a single success clears it. Defaults are 3 failures / 60s → 60s cooldown; tune with `cooldown: { maxFailures, windowMs, cooldownMs }`. It only ever **reorders** the attempt list (cooling providers go last), so if *every* provider is cooling a request still tries them all rather than failing outright. Off by default — routing is unchanged unless you opt in.
+## Cache
+There are two completely different "caches" in LLM land, and ai-lcr does both — each off by default, each a pure config flag with no service to run.
+### Skip the call entirely (`cache`) — response cache
+When a request is byte-for-byte identical to one already answered, replay the stored response and call **no provider at all**: zero latency, `costUsd: 0`. This is the layer Vercel AI Gateway notably *doesn't* offer.
+```ts
+const lcr = createLCR({
+  models: { /* … */ },
+  cache: true, // exact-match response cache (in-memory by default)
+});
+```
+Storage is pluggable and ai-lcr ships **zero dependencies** for it:
+- **`cache: true`** uses a process-local in-memory store. Real on a long-running server, and useful *within* one serverless invocation (an agent loop repeating a sub-call) — but it does **not** survive across serverless requests, because separate function instances don't share memory.
+- **For cross-request hits on serverless**, bring your own store backed by a shared layer (Upstash Redis, Vercel KV): `cache: myStore`. ai-lcr runs no service of its own — any shared store is yours. A custom store is just `{ get, set }` (see `CacheStore`); `createMemoryCacheStore({ maxEntries })` is exported if you want the bundled one with a cap.
+- **`cache: { store?, ttlMs? }`** sets an entry lifetime.
+A hit settles a `CallRecord` with `cacheHit: true`, `costUsd: 0`, and the money it avoided on its own `cacheHitSavingUsd` line — a *caching* saving, kept separate from routing savings (`baselineUsd − costUsd`), never folded in. Empty completions and usage-less results are never cached. One caveat worth stating: caching makes identical requests return identical responses — exactly right for idempotent / `temperature: 0` calls, a behavior change for sampled ones.
+### Pay less for the call (`promptCache`) — provider prompt cache
+Different mechanism: the model still runs, but the **static head of the prompt** (your system prompt) bills at the provider's cache-read rate (~0.1× input) on repeats. Anthropic needs an explicit `cache_control` marker; OpenAI / Gemini / DeepSeek cache the prefix automatically. `promptCache: true` inserts that marker on the last system message for you:
+```ts
+const lcr = createLCR({
+  models: { /* … */ },
+  promptCache: true,          // 5-minute window; { ttl: "1h" } for the longer one
+});
+```
+It only writes the `anthropic` provider-options namespace, which every other provider ignores — so it's safe on a mixed chain. And it **steps aside entirely** if you set `cacheControl` yourself anywhere in the prompt. The savings then show up exactly as before: `cachedInputTokens` and `cachedSavingUsd` on the `CallRecord` (see [Cache-aware cost](#see-what-happened-oncall) below).
 ## See what happened (`onCall`)
 `onError`/`onCost` fire separately and uncorrelated, so a failover is hard to read after the fact. `onCall` gives you **one record per request** — the full chain, the winner, the reason for each failed hop, latency, and cost — and `formatCallRecord` turns it into a one-liner you can scan:

package/README.zh-CN.md CHANGED Viewed

@@ -144,6 +144,42 @@ DeepInfra 只承载开源权重——没有第一方 Claude / GPT / Gemini。那
 2. **失败时向下穿透。** 遇到任何 provider 失败——限流、5xx、超时、**额度耗尽**（402 / 欠费 / 余额不足），以及 **400** 这类 client 错误——都会前进到下一个 provider，且对流式安全。400 会 failover 是有意为之：在 OpenAI 兼容聚合层里，400 往往是"*这家* provider 不吃这个请求"（不支持的参数、它没上架这个 model、更严格的 schema），而非请求本身坏了——换一家很可能就能服务。若所有 provider 都拒绝，请求仍会失败，并抛出**第一个**（原始）错误，让真正的调用方 bug 保持可调试。唯一永远不 failover 的是调用方主动取消（`AbortSignal`）。想恢复旧的"client 错误立即失败"行为，给 `createLCR` 传 `shouldRetry: isRetryableError`。
 3. **恢复。** 在一段空闲窗口（`resetIntervalMs`，默认 60s）之后，自动回到最便宜的 provider。
+## 缓存
+LLM 世界里有两种完全不同的"缓存"，ai-lcr 两种都做——都默认关闭，都只是一个配置开关，**不需要你跑任何服务**。
+### 整次调用都省掉（`cache`）—— 响应缓存
+当一个请求和之前回答过的一模一样时，直接重放已存的响应、**完全不调用任何 provider**：零延迟、`costUsd: 0`。这正是 Vercel AI Gateway 明显没做的那一层。
+```ts
+const lcr = createLCR({
+  models: { /* … */ },
+  cache: true, // 精确匹配响应缓存（默认进程内内存）
+});
+```
+存储可插拔，且 ai-lcr 为此**零依赖**：
+- **`cache: true`** 用进程内内存。在长驻 server 上是真缓存；在单次 serverless 调用内（比如 agent 循环里重复的子调用）也有用——但它**不会跨 serverless 请求存活**，因为不同函数实例之间不共享内存。
+- **想在 serverless 上跨请求命中**，自带一个由共享层（Upstash Redis、Vercel KV）支撑的 store：`cache: myStore`。ai-lcr 自己不跑任何服务——共享层是你的。自定义 store 就是 `{ get, set }`（见 `CacheStore`）；想要带上限的内置实现，用导出的 `createMemoryCacheStore({ maxEntries })`。
+- **`cache: { store?, ttlMs? }`** 可设置过期时间。
+命中会落一条 `CallRecord`：`cacheHit: true`、`costUsd: 0`，并把省下的钱单独记在 `cacheHitSavingUsd` 一行——这是**缓存**省的钱，和路由省的钱（`baselineUsd − costUsd`）分开，绝不混在一起。空回复和无 usage 的结果永不缓存。一个要点：缓存会让相同请求返回相同响应——对幂等 / `temperature: 0` 的调用正好，对采样型调用则是行为改变。
+### 让这次调用更便宜（`promptCache`）—— provider 提示缓存
+另一套机制：模型照样跑，但**提示的静态开头**（你的 system prompt）在重复时按 provider 的缓存读价（约 0.1× input）计费。Anthropic 需要显式 `cache_control` 标记；OpenAI / Gemini / DeepSeek 自动缓存前缀。`promptCache: true` 帮你在最后一条 system 消息上插入这个标记：
+```ts
+const lcr = createLCR({
+  models: { /* … */ },
+  promptCache: true,          // 5 分钟窗口；想要更长用 { ttl: "1h" }
+});
+```
+它只写 `anthropic` 这个 provider-options 命名空间，其他 provider 一律忽略——所以在混合链路上也安全。而且只要你在 prompt 里自己设了 `cacheControl`，它就**完全让位**。省下的钱照旧体现在 `CallRecord` 的 `cachedInputTokens` 和 `cachedSavingUsd` 上。
 ## 看清每次调用发生了什么（`onCall`）
 `onError`/`onCost` 各自独立触发、互不关联，事后很难还原一次 failover 的全貌。`onCall` 给你**每个请求一条记录**——完整的尝试链、最终服务者、每跳失败的原因、延迟和成本；`formatCallRecord` 把它变成一行可扫读的日志：

package/dist/index.cjs CHANGED Viewed

@@ -34,6 +34,7 @@ __export(index_exports, {
   createKunavoMediaAdapter: () => createKunavoMediaAdapter,
   createLCR: () => createLCR,
   createMediaLCR: () => createMediaLCR,
+  createMemoryCacheStore: () => createMemoryCacheStore,
   createRunwareMediaAdapter: () => createRunwareMediaAdapter,
   durationFromInput: () => durationFromInput,
   formatCallRecord: () => formatCallRecord,
@@ -49,6 +50,105 @@ __export(index_exports, {
 });
 module.exports = __toCommonJS(index_exports);
+// src/cache.ts
+function isCacheStore(x) {
+  return typeof x === "object" && x !== null && typeof x.get === "function" && typeof x.set === "function";
+}
+function resolveCache(opt) {
+  if (!opt) return void 0;
+  if (opt === true) return { store: createMemoryCacheStore() };
+  if (isCacheStore(opt)) return { store: opt };
+  return {
+    store: opt.store ?? createMemoryCacheStore(),
+    ...opt.ttlMs !== void 0 ? { ttlMs: opt.ttlMs } : {}
+  };
+}
+function cacheKeyOf(modelName, options) {
+  const rest = options.providerOptions ? Object.entries(options.providerOptions).filter(([ns]) => ns !== "lcr") : [];
+  const po = rest.length > 0 ? Object.fromEntries(rest) : void 0;
+  return JSON.stringify({
+    m: modelName,
+    prompt: options.prompt,
+    maxOutputTokens: options.maxOutputTokens,
+    temperature: options.temperature,
+    topP: options.topP,
+    topK: options.topK,
+    frequencyPenalty: options.frequencyPenalty,
+    presencePenalty: options.presencePenalty,
+    stopSequences: options.stopSequences,
+    seed: options.seed,
+    responseFormat: options.responseFormat,
+    tools: options.tools,
+    toolChoice: options.toolChoice,
+    po
+  });
+}
+function streamFromParts(parts) {
+  return new ReadableStream({
+    start(controller) {
+      for (const part of parts) controller.enqueue(part);
+      controller.close();
+    }
+  });
+}
+function createMemoryCacheStore(opts = {}) {
+  const maxEntries = opts.maxEntries ?? 1e3;
+  const map = /* @__PURE__ */ new Map();
+  return {
+    get(key) {
+      const entry = map.get(key);
+      if (!entry) return void 0;
+      if (entry.expiresAt !== void 0 && entry.expiresAt <= Date.now()) {
+        map.delete(key);
+        return void 0;
+      }
+      return entry.value;
+    },
+    set(key, value, ttlMs) {
+      const entry = ttlMs !== void 0 ? { value, expiresAt: Date.now() + ttlMs } : { value };
+      map.delete(key);
+      map.set(key, entry);
+      if (map.size > maxEntries) {
+        const oldest = map.keys().next().value;
+        if (oldest !== void 0) map.delete(oldest);
+      }
+    }
+  };
+}
+// src/prompt-cache.ts
+function resolvePromptCache(opt) {
+  if (!opt) return void 0;
+  if (opt === true) return { ttl: "5m" };
+  return { ttl: opt.ttl ?? "5m" };
+}
+function hasAnthropicCacheControl(message) {
+  const anthropic = message.providerOptions?.anthropic;
+  return !!anthropic && "cacheControl" in anthropic;
+}
+function withPromptCacheBreakpoint(options, cfg) {
+  const prompt = options.prompt;
+  if (!Array.isArray(prompt) || prompt.length === 0) return options;
+  if (prompt.some(hasAnthropicCacheControl)) return options;
+  let target = -1;
+  for (let i = 0; i < prompt.length; i++) {
+    if (prompt[i].role === "system") target = i;
+  }
+  if (target === -1) return options;
+  const cacheControl = cfg.ttl === "1h" ? { type: "ephemeral", ttl: "1h" } : { type: "ephemeral" };
+  const newPrompt = prompt.map((message, i) => {
+    if (i !== target) return message;
+    return {
+      ...message,
+      providerOptions: {
+        ...message.providerOptions,
+        anthropic: { ...message.providerOptions?.anthropic, cacheControl }
+      }
+    };
+  });
+  return { ...options, prompt: newPrompt };
+}
 // src/fallback.ts
 var EmptyCompletionError = class extends Error {
   constructor(provider) {
@@ -447,6 +547,16 @@ var LcrFallbackModel = class {
     const usageMissing = inputTokens === 0 && outputTokens === 0;
     const emptyCompletion = inputTokens > 0 && outputTokens === 0;
     const baselineUsd = this.baselineUsd(inputTokens, outputTokens, cacheReadTokens);
+    ctx.settled = {
+      meta: {
+        winner: provider.label,
+        costUsd,
+        inputTokens,
+        outputTokens,
+        ...cacheReadTokens > 0 ? { cachedInputTokens: cacheReadTokens } : {}
+      },
+      cacheable: !emptyCompletion && !usageMissing
+    };
     this.emitCost({
       model: this.opts.modelName,
       provider: provider.label,
@@ -491,6 +601,16 @@ var LcrFallbackModel = class {
     });
   }
   async doGenerate(options) {
+    const cache = this.opts.cache;
+    const cacheKey = cache ? cacheKeyOf(this.opts.modelName, options) : void 0;
+    if (cache && cacheKey !== void 0) {
+      const hit = await cache.store.get(cacheKey);
+      if (hit && hit.kind === "generate") {
+        this.finalizeCacheHit(this.startCall(options), hit.meta);
+        return hit.result;
+      }
+    }
+    const callOptions = this.opts.promptCache ? withPromptCacheBreakpoint(options, this.opts.promptCache) : options;
     const ctx = this.startCall(options);
     const providers = this.opts.providers;
     const order = this.routeOrder(this.startIndex());
@@ -501,7 +621,7 @@ var LcrFallbackModel = class {
       const isLast = pos === order.length - 1;
       const attemptStart = Date.now();
       try {
-        const result = await provider.model.doGenerate(options);
+        const result = await provider.model.doGenerate(callOptions);
         const out = result.usage?.outputTokens?.total ?? 0;
         const inp = result.usage?.inputTokens?.total ?? 0;
         if (inp > 0 && out === 0 && !isLast) {
@@ -514,6 +634,9 @@ var LcrFallbackModel = class {
         this.recordProviderSuccess(idx);
         this.settleSticky(idx);
         this.finalizeOk(ctx, provider, attemptStart, result.usage);
+        if (cache && cacheKey !== void 0 && ctx.settled?.cacheable) {
+          this.storeCache(cacheKey, { kind: "generate", result, meta: ctx.settled.meta });
+        }
         return result;
       } catch (error) {
         lastError = error;
@@ -530,7 +653,76 @@ var LcrFallbackModel = class {
     throw ctx.firstError ?? lastError;
   }
   async doStream(options) {
-    return this.doStreamWithCtx(options, this.startCall(options), this.routeOrder(this.startIndex()), 0);
+    const cache = this.opts.cache;
+    const cacheKey = cache ? cacheKeyOf(this.opts.modelName, options) : void 0;
+    if (cache && cacheKey !== void 0) {
+      const hit = await cache.store.get(cacheKey);
+      if (hit && hit.kind === "stream") {
+        this.finalizeCacheHit(this.startCall(options), hit.meta);
+        return { stream: streamFromParts(hit.parts) };
+      }
+    }
+    const ctx = this.startCall(options);
+    const callOptions = this.opts.promptCache ? withPromptCacheBreakpoint(options, this.opts.promptCache) : options;
+    const inner = await this.doStreamWithCtx(
+      callOptions,
+      ctx,
+      this.routeOrder(this.startIndex()),
+      0
+    );
+    if (!cache || cacheKey === void 0) return inner;
+    const collected = [];
+    const self = this;
+    const wrapped = inner.stream.pipeThrough(
+      new TransformStream({
+        transform(part, controller) {
+          collected.push(part);
+          controller.enqueue(part);
+        },
+        flush() {
+          if (ctx.settled?.cacheable) {
+            self.storeCache(cacheKey, { kind: "stream", parts: collected, meta: ctx.settled.meta });
+          }
+        }
+      })
+    );
+    return { ...inner, stream: wrapped };
+  }
+  /** A response-cache hit: replay a stored answer with no provider call. Settles
+   *  one {@link CallRecord} with `cacheHit`, `costUsd: 0`, and the avoided cost
+   *  on its own `cacheHitSavingUsd` line. */
+  finalizeCacheHit(ctx, meta) {
+    this.emitCall({
+      id: ctx.id,
+      model: this.opts.modelName,
+      attempts: [{ provider: meta.winner, ok: true, latencyMs: Date.now() - ctx.startedAt }],
+      winner: meta.winner,
+      ok: true,
+      failedOver: false,
+      latencyMs: Date.now() - ctx.startedAt,
+      inputTokens: meta.inputTokens,
+      outputTokens: meta.outputTokens,
+      ...meta.cachedInputTokens ? { cachedInputTokens: meta.cachedInputTokens } : {},
+      costUsd: 0,
+      cacheHit: true,
+      ...meta.costUsd > 0 ? { cacheHitSavingUsd: meta.costUsd } : {},
+      ...ctx.requestId ? { requestId: ctx.requestId } : {}
+    });
+  }
+  /** Best-effort write to the response cache: a sync throw or a rejected async
+   *  `set` must never break the request. Caching is an optimization, not a
+   *  guarantee. */
+  storeCache(key, value) {
+    const cache = this.opts.cache;
+    if (!cache) return;
+    try {
+      const r = cache.store.set(key, value, cache.ttlMs);
+      if (r && typeof r.catch === "function") {
+        r.catch(() => {
+        });
+      }
+    } catch {
+    }
   }
   // The stream's failover recursion re-enters here with the SAME `ctx` and the
   // SAME `order` snapshot, advancing only the local position `pos`, so a
@@ -2012,12 +2204,16 @@ function createLCR(config) {
     autoPrice = false,
     resetIntervalMs,
     cooldown,
+    cache,
+    promptCache,
     onError,
     onCost,
     onCall,
     shouldRetry,
     defaultCacheReadRatio
   } = config;
+  const resolvedCache = resolveCache(cache);
+  const resolvedPromptCache = resolvePromptCache(promptCache);
   if (defaultCacheReadRatio !== void 0 && (defaultCacheReadRatio < 0 || defaultCacheReadRatio > 1)) {
     throw new Error(
       `ai-lcr: defaultCacheReadRatio must be in [0, 1], got ${defaultCacheReadRatio}`
@@ -2037,7 +2233,18 @@ function createLCR(config) {
     }
     routed.set(
       name,
-      new LcrFallbackModel({ modelName: name, providers, resetIntervalMs, cooldown, onError, onCost, onCall, shouldRetry })
+      new LcrFallbackModel({
+        modelName: name,
+        providers,
+        resetIntervalMs,
+        cooldown,
+        ...resolvedCache ? { cache: resolvedCache } : {},
+        ...resolvedPromptCache ? { promptCache: resolvedPromptCache } : {},
+        onError,
+        onCost,
+        onCall,
+        shouldRetry
+      })
     );
   }
   return (modelName) => {
@@ -2066,6 +2273,7 @@ function createLCR(config) {
   createKunavoMediaAdapter,
   createLCR,
   createMediaLCR,
+  createMemoryCacheStore,
   createRunwareMediaAdapter,
   durationFromInput,
   formatCallRecord,

package/dist/index.d.cts CHANGED Viewed

@@ -1,4 +1,120 @@
-import { LanguageModelV3 } from '@ai-sdk/provider';
+import { LanguageModelV3GenerateResult, LanguageModelV3StreamPart, LanguageModelV3 } from '@ai-sdk/provider';
+/**
+ * Exact-match response cache (Feature ②).
+ *
+ * Unlike provider-side prompt caching (see ./prompt-cache), this skips the
+ * model call *entirely*: when a request is byte-for-byte identical to one
+ * already answered, the stored response is replayed and no provider is touched
+ * — zero latency, zero cost. This is the layer Vercel AI Gateway notably does
+ * NOT offer, and it composes naturally with ai-lcr's cost truth: a hit settles
+ * a {@link CallRecord} with `costUsd: 0` and `cacheHit: true`, and reports the
+ * money it avoided as `cacheHitSavingUsd` — a saving kept on its own line, the
+ * same discipline as prompt-cache savings, never folded into routing savings.
+ *
+ * Storage is pluggable and the package ships ZERO dependencies for it. The
+ * default `createMemoryCacheStore()` is a process-local Map: real on a
+ * long-running server, and useful within a single serverless invocation (an
+ * agent loop that repeats a sub-call), but it does NOT survive across
+ * serverless requests — different function instances don't share memory. For
+ * cross-request hits on serverless, inject your own store backed by a shared
+ * layer (Upstash Redis, Vercel KV). ai-lcr never runs that service; you bring
+ * it. A custom store is responsible for serializing the stored value.
+ *
+ * Determinism caveat: caching makes identical requests return identical
+ * responses. That is exactly right for idempotent / `temperature: 0` calls and
+ * changes behavior for sampled ones (the variety is gone). Enable it where a
+ * repeated answer is acceptable.
+ */
+/** A stored, replayable LLM response plus the cost it originally incurred. */
+type CachedCall = {
+    kind: "generate";
+    result: LanguageModelV3GenerateResult;
+    meta: CachedMeta;
+} | {
+    kind: "stream";
+    parts: LanguageModelV3StreamPart[];
+    meta: CachedMeta;
+};
+/** Settle-time facts carried in the cache entry so a hit can report honest
+ *  tokens, the originally-serving provider, and the money the hit avoided. */
+interface CachedMeta {
+    /** The provider that served the original (cached) call. */
+    winner: string;
+    /** What the original call actually cost — i.e. the money a hit avoids. */
+    costUsd: number;
+    inputTokens: number;
+    outputTokens: number;
+    /** Prompt-cache reads on the original call, when reported (> 0 only). */
+    cachedInputTokens?: number;
+}
+/**
+ * Pluggable response-cache backend. Implement it over Redis / Vercel KV / any
+ * shared store to get cross-request hits on serverless; the bundled
+ * {@link createMemoryCacheStore} is the dependency-free default. `get`/`set`
+ * may be sync or async. A `set` that throws must never break the request — the
+ * engine treats the cache as best-effort.
+ */
+interface CacheStore {
+    get(key: string): CachedCall | undefined | Promise<CachedCall | undefined>;
+    set(key: string, value: CachedCall, ttlMs?: number): void | Promise<void>;
+}
+/** Public response-cache config. See {@link LCRConfig.cache}. */
+interface CacheOptions {
+    /** Where to store responses. Defaults to a process-local in-memory store. */
+    store?: CacheStore;
+    /** Entry lifetime in ms. Omit for no expiry (entries live until evicted). */
+    ttlMs?: number;
+}
+/** Tuning for {@link createMemoryCacheStore}. */
+interface MemoryCacheOptions {
+    /**
+     * Cap on stored entries. When exceeded, the oldest-inserted entry is dropped
+     * (insertion-order FIFO — Map preserves it). Keeps an unbounded key space
+     * (every distinct prompt) from leaking memory in a long-running process.
+     * Default 1000.
+     */
+    maxEntries?: number;
+}
+/**
+ * A process-local in-memory {@link CacheStore} with optional TTL and a
+ * bounded entry count. Zero dependencies. See the module header for where this
+ * is (and isn't) useful — notably it does NOT share across serverless requests.
+ */
+declare function createMemoryCacheStore(opts?: MemoryCacheOptions): CacheStore;
+/**
+ * Automatic prompt-cache breakpoints (Feature ①).
+ *
+ * Provider-side prompt caching (Anthropic, MiniMax) caches a *prefix* of the
+ * prompt so repeated calls bill the static head — the system prompt — at the
+ * cache-read rate (~0.1× input) instead of full price. The model still runs;
+ * only the input cost of the cached prefix drops. Anthropic needs an explicit
+ * `cache_control` marker; OpenAI / Gemini / DeepSeek cache the prefix
+ * automatically with no marker at all.
+ *
+ * This module adds, when `promptCache` is enabled, a single `cacheControl`
+ * breakpoint on the LAST system message — the canonical large, stable head of
+ * a prompt. It only writes the `anthropic` provider-options namespace, which
+ * every non-Anthropic provider ignores, so it is safe to apply on every leg of
+ * a mixed chain: Anthropic reads it, the rest pass it through untouched. No
+ * external service and no storage — the cache itself lives at the provider.
+ *
+ * It steps aside the moment the caller is managing caching themselves: if ANY
+ * message already carries an `anthropic.cacheControl`, the prompt is returned
+ * unchanged. Same "explicit always wins" discipline as the price table.
+ */
+/** Tuning for automatic prompt-cache breakpoints. See {@link LCRConfig.promptCache}. */
+interface PromptCacheOptions {
+    /**
+     * Cache lifetime for the injected breakpoint. `"5m"` (the Anthropic default,
+     * a cheaper cache write) or `"1h"` (a pricier write that pays off when the
+     * same prefix is reused over a longer span). Default `"5m"`.
+     */
+    ttl?: "5m" | "1h";
+}
 /**
  * Owned failover engine for ai-lcr.
@@ -176,6 +292,21 @@ interface CallRecord {
      * be surfaced separately, never folded into `baselineUsd - costUsd`.
      */
     cachedSavingUsd?: number;
+    /**
+     * True when this request was served from ai-lcr's exact-match RESPONSE cache
+     * — no provider was called at all. Distinct from `cachedInputTokens` /
+     * `cachedSavingUsd`, which are the *provider's* prompt-cache (the model still
+     * ran). On a hit `costUsd` is 0, `winner` is the provider that served the
+     * ORIGINAL (now-cached) call, and `attempts` has a single synthetic entry.
+     */
+    cacheHit?: boolean;
+    /**
+     * On a `cacheHit`, the money the hit avoided — i.e. what the original call
+     * actually cost when it ran. Present only when > 0. Like `cachedSavingUsd`
+     * this is a caching saving, NOT a routing saving, so it lives on its own line
+     * and is never folded into `baselineUsd - costUsd`.
+     */
+    cacheHitSavingUsd?: number;
     /**
      * Caller-supplied correlation id, read from `providerOptions.lcr.requestId`
      * on the call. Multi-step tool loops emit one record per `doStream`/
@@ -926,6 +1057,34 @@ interface LCRConfig {
      * unchanged routing, no provider is ever skipped). See {@link CooldownOptions}.
      */
     cooldown?: boolean | CooldownOptions;
+    /**
+     * Exact-match RESPONSE cache: when a request is identical to one already
+     * answered, replay the stored response and call no provider at all — zero
+     * latency, `costUsd: 0`, and the avoided cost reported as `cacheHitSavingUsd`
+     * on the {@link CallRecord} (with `cacheHit: true`). Off by default.
+     *
+     * `true` uses a process-local in-memory store; pass a {@link CacheStore} to
+     * bring your own (Redis / Vercel KV — required for cross-request hits on
+     * serverless, where memory isn't shared); pass `{ store?, ttlMs? }` to set a
+     * TTL. ai-lcr runs no service of its own — any shared store is yours.
+     *
+     * Caching makes identical requests return identical responses: ideal for
+     * idempotent / `temperature: 0` calls, a behavior change for sampled ones.
+     * Empty completions and usage-less results are never cached.
+     */
+    cache?: boolean | CacheStore | CacheOptions;
+    /**
+     * Automatic provider-side PROMPT caching: insert a `cache_control` breakpoint
+     * on the last system message so the static prompt head bills at the
+     * cache-read rate (~0.1× input) on repeats. The model still runs — this only
+     * lowers input cost, it does not skip the call (that's `cache`). Only
+     * Anthropic / MiniMax need the marker; OpenAI / Gemini / DeepSeek cache the
+     * prefix automatically and ignore it, so this is safe on a mixed chain.
+     *
+     * `true` for the 5-minute default, `{ ttl: "1h" }` for the longer window.
+     * Off by default; steps aside entirely if you set `cacheControl` yourself.
+     */
+    promptCache?: boolean | PromptCacheOptions;
     /** Called when a provider errors and routing falls through to the next. */
     onError?: (error: Error, provider: string) => void;
     /** Called after each successful call with the serving provider, tokens, and cost. */
@@ -969,4 +1128,4 @@ type LCRRouter = (modelName: string) => LanguageModelV3;
  */
 declare function createLCR(config: LCRConfig): LCRRouter;
-export { type BillableContext, type CallRecord, type CooldownOptions, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type HttpSinkOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, MODEL_PRICES, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaJobHandle, type MediaJobStatus, type MediaLCR, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPollResult, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaStatusRequest, type MediaStatusResult, type MediaSubmitOptions, type MediaSubmitRequest, type MediaSubmitResult, type MediaUnit, type MediaUsage, OFFICIAL_PRICES, type PriceComparisonRow, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, billableUnits, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createHttpSink, createKunavoMediaAdapter, createLCR, createMediaLCR, createRunwareMediaAdapter, durationFromInput, formatCallRecord, getModelPrice, isAbortError, isNetworkError, isRetryableError, normalizedCents, priceCents, rankRoutes, referenceMegapixels, shouldFailover };
+export { type BillableContext, type CacheOptions, type CacheStore, type CachedCall, type CachedMeta, type CallRecord, type CooldownOptions, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type HttpSinkOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, MODEL_PRICES, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaJobHandle, type MediaJobStatus, type MediaLCR, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPollResult, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaStatusRequest, type MediaStatusResult, type MediaSubmitOptions, type MediaSubmitRequest, type MediaSubmitResult, type MediaUnit, type MediaUsage, type MemoryCacheOptions, OFFICIAL_PRICES, type PriceComparisonRow, type PromptCacheOptions, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, billableUnits, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createHttpSink, createKunavoMediaAdapter, createLCR, createMediaLCR, createMemoryCacheStore, createRunwareMediaAdapter, durationFromInput, formatCallRecord, getModelPrice, isAbortError, isNetworkError, isRetryableError, normalizedCents, priceCents, rankRoutes, referenceMegapixels, shouldFailover };

package/dist/index.d.ts CHANGED Viewed

@@ -1,4 +1,120 @@
-import { LanguageModelV3 } from '@ai-sdk/provider';
+import { LanguageModelV3GenerateResult, LanguageModelV3StreamPart, LanguageModelV3 } from '@ai-sdk/provider';
+/**
+ * Exact-match response cache (Feature ②).
+ *
+ * Unlike provider-side prompt caching (see ./prompt-cache), this skips the
+ * model call *entirely*: when a request is byte-for-byte identical to one
+ * already answered, the stored response is replayed and no provider is touched
+ * — zero latency, zero cost. This is the layer Vercel AI Gateway notably does
+ * NOT offer, and it composes naturally with ai-lcr's cost truth: a hit settles
+ * a {@link CallRecord} with `costUsd: 0` and `cacheHit: true`, and reports the
+ * money it avoided as `cacheHitSavingUsd` — a saving kept on its own line, the
+ * same discipline as prompt-cache savings, never folded into routing savings.
+ *
+ * Storage is pluggable and the package ships ZERO dependencies for it. The
+ * default `createMemoryCacheStore()` is a process-local Map: real on a
+ * long-running server, and useful within a single serverless invocation (an
+ * agent loop that repeats a sub-call), but it does NOT survive across
+ * serverless requests — different function instances don't share memory. For
+ * cross-request hits on serverless, inject your own store backed by a shared
+ * layer (Upstash Redis, Vercel KV). ai-lcr never runs that service; you bring
+ * it. A custom store is responsible for serializing the stored value.
+ *
+ * Determinism caveat: caching makes identical requests return identical
+ * responses. That is exactly right for idempotent / `temperature: 0` calls and
+ * changes behavior for sampled ones (the variety is gone). Enable it where a
+ * repeated answer is acceptable.
+ */
+/** A stored, replayable LLM response plus the cost it originally incurred. */
+type CachedCall = {
+    kind: "generate";
+    result: LanguageModelV3GenerateResult;
+    meta: CachedMeta;
+} | {
+    kind: "stream";
+    parts: LanguageModelV3StreamPart[];
+    meta: CachedMeta;
+};
+/** Settle-time facts carried in the cache entry so a hit can report honest
+ *  tokens, the originally-serving provider, and the money the hit avoided. */
+interface CachedMeta {
+    /** The provider that served the original (cached) call. */
+    winner: string;
+    /** What the original call actually cost — i.e. the money a hit avoids. */
+    costUsd: number;
+    inputTokens: number;
+    outputTokens: number;
+    /** Prompt-cache reads on the original call, when reported (> 0 only). */
+    cachedInputTokens?: number;
+}
+/**
+ * Pluggable response-cache backend. Implement it over Redis / Vercel KV / any
+ * shared store to get cross-request hits on serverless; the bundled
+ * {@link createMemoryCacheStore} is the dependency-free default. `get`/`set`
+ * may be sync or async. A `set` that throws must never break the request — the
+ * engine treats the cache as best-effort.
+ */
+interface CacheStore {
+    get(key: string): CachedCall | undefined | Promise<CachedCall | undefined>;
+    set(key: string, value: CachedCall, ttlMs?: number): void | Promise<void>;
+}
+/** Public response-cache config. See {@link LCRConfig.cache}. */
+interface CacheOptions {
+    /** Where to store responses. Defaults to a process-local in-memory store. */
+    store?: CacheStore;
+    /** Entry lifetime in ms. Omit for no expiry (entries live until evicted). */
+    ttlMs?: number;
+}
+/** Tuning for {@link createMemoryCacheStore}. */
+interface MemoryCacheOptions {
+    /**
+     * Cap on stored entries. When exceeded, the oldest-inserted entry is dropped
+     * (insertion-order FIFO — Map preserves it). Keeps an unbounded key space
+     * (every distinct prompt) from leaking memory in a long-running process.
+     * Default 1000.
+     */
+    maxEntries?: number;
+}
+/**
+ * A process-local in-memory {@link CacheStore} with optional TTL and a
+ * bounded entry count. Zero dependencies. See the module header for where this
+ * is (and isn't) useful — notably it does NOT share across serverless requests.
+ */
+declare function createMemoryCacheStore(opts?: MemoryCacheOptions): CacheStore;
+/**
+ * Automatic prompt-cache breakpoints (Feature ①).
+ *
+ * Provider-side prompt caching (Anthropic, MiniMax) caches a *prefix* of the
+ * prompt so repeated calls bill the static head — the system prompt — at the
+ * cache-read rate (~0.1× input) instead of full price. The model still runs;
+ * only the input cost of the cached prefix drops. Anthropic needs an explicit
+ * `cache_control` marker; OpenAI / Gemini / DeepSeek cache the prefix
+ * automatically with no marker at all.
+ *
+ * This module adds, when `promptCache` is enabled, a single `cacheControl`
+ * breakpoint on the LAST system message — the canonical large, stable head of
+ * a prompt. It only writes the `anthropic` provider-options namespace, which
+ * every non-Anthropic provider ignores, so it is safe to apply on every leg of
+ * a mixed chain: Anthropic reads it, the rest pass it through untouched. No
+ * external service and no storage — the cache itself lives at the provider.
+ *
+ * It steps aside the moment the caller is managing caching themselves: if ANY
+ * message already carries an `anthropic.cacheControl`, the prompt is returned
+ * unchanged. Same "explicit always wins" discipline as the price table.
+ */
+/** Tuning for automatic prompt-cache breakpoints. See {@link LCRConfig.promptCache}. */
+interface PromptCacheOptions {
+    /**
+     * Cache lifetime for the injected breakpoint. `"5m"` (the Anthropic default,
+     * a cheaper cache write) or `"1h"` (a pricier write that pays off when the
+     * same prefix is reused over a longer span). Default `"5m"`.
+     */
+    ttl?: "5m" | "1h";
+}
 /**
  * Owned failover engine for ai-lcr.
@@ -176,6 +292,21 @@ interface CallRecord {
      * be surfaced separately, never folded into `baselineUsd - costUsd`.
      */
     cachedSavingUsd?: number;
+    /**
+     * True when this request was served from ai-lcr's exact-match RESPONSE cache
+     * — no provider was called at all. Distinct from `cachedInputTokens` /
+     * `cachedSavingUsd`, which are the *provider's* prompt-cache (the model still
+     * ran). On a hit `costUsd` is 0, `winner` is the provider that served the
+     * ORIGINAL (now-cached) call, and `attempts` has a single synthetic entry.
+     */
+    cacheHit?: boolean;
+    /**
+     * On a `cacheHit`, the money the hit avoided — i.e. what the original call
+     * actually cost when it ran. Present only when > 0. Like `cachedSavingUsd`
+     * this is a caching saving, NOT a routing saving, so it lives on its own line
+     * and is never folded into `baselineUsd - costUsd`.
+     */
+    cacheHitSavingUsd?: number;
     /**
      * Caller-supplied correlation id, read from `providerOptions.lcr.requestId`
      * on the call. Multi-step tool loops emit one record per `doStream`/
@@ -926,6 +1057,34 @@ interface LCRConfig {
      * unchanged routing, no provider is ever skipped). See {@link CooldownOptions}.
      */
     cooldown?: boolean | CooldownOptions;
+    /**
+     * Exact-match RESPONSE cache: when a request is identical to one already
+     * answered, replay the stored response and call no provider at all — zero
+     * latency, `costUsd: 0`, and the avoided cost reported as `cacheHitSavingUsd`
+     * on the {@link CallRecord} (with `cacheHit: true`). Off by default.
+     *
+     * `true` uses a process-local in-memory store; pass a {@link CacheStore} to
+     * bring your own (Redis / Vercel KV — required for cross-request hits on
+     * serverless, where memory isn't shared); pass `{ store?, ttlMs? }` to set a
+     * TTL. ai-lcr runs no service of its own — any shared store is yours.
+     *
+     * Caching makes identical requests return identical responses: ideal for
+     * idempotent / `temperature: 0` calls, a behavior change for sampled ones.
+     * Empty completions and usage-less results are never cached.
+     */
+    cache?: boolean | CacheStore | CacheOptions;
+    /**
+     * Automatic provider-side PROMPT caching: insert a `cache_control` breakpoint
+     * on the last system message so the static prompt head bills at the
+     * cache-read rate (~0.1× input) on repeats. The model still runs — this only
+     * lowers input cost, it does not skip the call (that's `cache`). Only
+     * Anthropic / MiniMax need the marker; OpenAI / Gemini / DeepSeek cache the
+     * prefix automatically and ignore it, so this is safe on a mixed chain.
+     *
+     * `true` for the 5-minute default, `{ ttl: "1h" }` for the longer window.
+     * Off by default; steps aside entirely if you set `cacheControl` yourself.
+     */
+    promptCache?: boolean | PromptCacheOptions;
     /** Called when a provider errors and routing falls through to the next. */
     onError?: (error: Error, provider: string) => void;
     /** Called after each successful call with the serving provider, tokens, and cost. */
@@ -969,4 +1128,4 @@ type LCRRouter = (modelName: string) => LanguageModelV3;
  */
 declare function createLCR(config: LCRConfig): LCRRouter;
-export { type BillableContext, type CallRecord, type CooldownOptions, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type HttpSinkOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, MODEL_PRICES, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaJobHandle, type MediaJobStatus, type MediaLCR, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPollResult, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaStatusRequest, type MediaStatusResult, type MediaSubmitOptions, type MediaSubmitRequest, type MediaSubmitResult, type MediaUnit, type MediaUsage, OFFICIAL_PRICES, type PriceComparisonRow, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, billableUnits, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createHttpSink, createKunavoMediaAdapter, createLCR, createMediaLCR, createRunwareMediaAdapter, durationFromInput, formatCallRecord, getModelPrice, isAbortError, isNetworkError, isRetryableError, normalizedCents, priceCents, rankRoutes, referenceMegapixels, shouldFailover };
+export { type BillableContext, type CacheOptions, type CacheStore, type CachedCall, type CachedMeta, type CallRecord, type CooldownOptions, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type HttpSinkOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, MODEL_PRICES, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaJobHandle, type MediaJobStatus, type MediaLCR, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPollResult, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaStatusRequest, type MediaStatusResult, type MediaSubmitOptions, type MediaSubmitRequest, type MediaSubmitResult, type MediaUnit, type MediaUsage, type MemoryCacheOptions, OFFICIAL_PRICES, type PriceComparisonRow, type PromptCacheOptions, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, billableUnits, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createHttpSink, createKunavoMediaAdapter, createLCR, createMediaLCR, createMemoryCacheStore, createRunwareMediaAdapter, durationFromInput, formatCallRecord, getModelPrice, isAbortError, isNetworkError, isRetryableError, normalizedCents, priceCents, rankRoutes, referenceMegapixels, shouldFailover };

package/dist/index.js CHANGED Viewed

@@ -1,3 +1,102 @@
+// src/cache.ts
+function isCacheStore(x) {
+  return typeof x === "object" && x !== null && typeof x.get === "function" && typeof x.set === "function";
+}
+function resolveCache(opt) {
+  if (!opt) return void 0;
+  if (opt === true) return { store: createMemoryCacheStore() };
+  if (isCacheStore(opt)) return { store: opt };
+  return {
+    store: opt.store ?? createMemoryCacheStore(),
+    ...opt.ttlMs !== void 0 ? { ttlMs: opt.ttlMs } : {}
+  };
+}
+function cacheKeyOf(modelName, options) {
+  const rest = options.providerOptions ? Object.entries(options.providerOptions).filter(([ns]) => ns !== "lcr") : [];
+  const po = rest.length > 0 ? Object.fromEntries(rest) : void 0;
+  return JSON.stringify({
+    m: modelName,
+    prompt: options.prompt,
+    maxOutputTokens: options.maxOutputTokens,
+    temperature: options.temperature,
+    topP: options.topP,
+    topK: options.topK,
+    frequencyPenalty: options.frequencyPenalty,
+    presencePenalty: options.presencePenalty,
+    stopSequences: options.stopSequences,
+    seed: options.seed,
+    responseFormat: options.responseFormat,
+    tools: options.tools,
+    toolChoice: options.toolChoice,
+    po
+  });
+}
+function streamFromParts(parts) {
+  return new ReadableStream({
+    start(controller) {
+      for (const part of parts) controller.enqueue(part);
+      controller.close();
+    }
+  });
+}
+function createMemoryCacheStore(opts = {}) {
+  const maxEntries = opts.maxEntries ?? 1e3;
+  const map = /* @__PURE__ */ new Map();
+  return {
+    get(key) {
+      const entry = map.get(key);
+      if (!entry) return void 0;
+      if (entry.expiresAt !== void 0 && entry.expiresAt <= Date.now()) {
+        map.delete(key);
+        return void 0;
+      }
+      return entry.value;
+    },
+    set(key, value, ttlMs) {
+      const entry = ttlMs !== void 0 ? { value, expiresAt: Date.now() + ttlMs } : { value };
+      map.delete(key);
+      map.set(key, entry);
+      if (map.size > maxEntries) {
+        const oldest = map.keys().next().value;
+        if (oldest !== void 0) map.delete(oldest);
+      }
+    }
+  };
+}
+// src/prompt-cache.ts
+function resolvePromptCache(opt) {
+  if (!opt) return void 0;
+  if (opt === true) return { ttl: "5m" };
+  return { ttl: opt.ttl ?? "5m" };
+}
+function hasAnthropicCacheControl(message) {
+  const anthropic = message.providerOptions?.anthropic;
+  return !!anthropic && "cacheControl" in anthropic;
+}
+function withPromptCacheBreakpoint(options, cfg) {
+  const prompt = options.prompt;
+  if (!Array.isArray(prompt) || prompt.length === 0) return options;
+  if (prompt.some(hasAnthropicCacheControl)) return options;
+  let target = -1;
+  for (let i = 0; i < prompt.length; i++) {
+    if (prompt[i].role === "system") target = i;
+  }
+  if (target === -1) return options;
+  const cacheControl = cfg.ttl === "1h" ? { type: "ephemeral", ttl: "1h" } : { type: "ephemeral" };
+  const newPrompt = prompt.map((message, i) => {
+    if (i !== target) return message;
+    return {
+      ...message,
+      providerOptions: {
+        ...message.providerOptions,
+        anthropic: { ...message.providerOptions?.anthropic, cacheControl }
+      }
+    };
+  });
+  return { ...options, prompt: newPrompt };
+}
 // src/fallback.ts
 var EmptyCompletionError = class extends Error {
   constructor(provider) {
@@ -396,6 +495,16 @@ var LcrFallbackModel = class {
     const usageMissing = inputTokens === 0 && outputTokens === 0;
     const emptyCompletion = inputTokens > 0 && outputTokens === 0;
     const baselineUsd = this.baselineUsd(inputTokens, outputTokens, cacheReadTokens);
+    ctx.settled = {
+      meta: {
+        winner: provider.label,
+        costUsd,
+        inputTokens,
+        outputTokens,
+        ...cacheReadTokens > 0 ? { cachedInputTokens: cacheReadTokens } : {}
+      },
+      cacheable: !emptyCompletion && !usageMissing
+    };
     this.emitCost({
       model: this.opts.modelName,
       provider: provider.label,
@@ -440,6 +549,16 @@ var LcrFallbackModel = class {
     });
   }
   async doGenerate(options) {
+    const cache = this.opts.cache;
+    const cacheKey = cache ? cacheKeyOf(this.opts.modelName, options) : void 0;
+    if (cache && cacheKey !== void 0) {
+      const hit = await cache.store.get(cacheKey);
+      if (hit && hit.kind === "generate") {
+        this.finalizeCacheHit(this.startCall(options), hit.meta);
+        return hit.result;
+      }
+    }
+    const callOptions = this.opts.promptCache ? withPromptCacheBreakpoint(options, this.opts.promptCache) : options;
     const ctx = this.startCall(options);
     const providers = this.opts.providers;
     const order = this.routeOrder(this.startIndex());
@@ -450,7 +569,7 @@ var LcrFallbackModel = class {
       const isLast = pos === order.length - 1;
       const attemptStart = Date.now();
       try {
-        const result = await provider.model.doGenerate(options);
+        const result = await provider.model.doGenerate(callOptions);
         const out = result.usage?.outputTokens?.total ?? 0;
         const inp = result.usage?.inputTokens?.total ?? 0;
         if (inp > 0 && out === 0 && !isLast) {
@@ -463,6 +582,9 @@ var LcrFallbackModel = class {
         this.recordProviderSuccess(idx);
         this.settleSticky(idx);
         this.finalizeOk(ctx, provider, attemptStart, result.usage);
+        if (cache && cacheKey !== void 0 && ctx.settled?.cacheable) {
+          this.storeCache(cacheKey, { kind: "generate", result, meta: ctx.settled.meta });
+        }
         return result;
       } catch (error) {
         lastError = error;
@@ -479,7 +601,76 @@ var LcrFallbackModel = class {
     throw ctx.firstError ?? lastError;
   }
   async doStream(options) {
-    return this.doStreamWithCtx(options, this.startCall(options), this.routeOrder(this.startIndex()), 0);
+    const cache = this.opts.cache;
+    const cacheKey = cache ? cacheKeyOf(this.opts.modelName, options) : void 0;
+    if (cache && cacheKey !== void 0) {
+      const hit = await cache.store.get(cacheKey);
+      if (hit && hit.kind === "stream") {
+        this.finalizeCacheHit(this.startCall(options), hit.meta);
+        return { stream: streamFromParts(hit.parts) };
+      }
+    }
+    const ctx = this.startCall(options);
+    const callOptions = this.opts.promptCache ? withPromptCacheBreakpoint(options, this.opts.promptCache) : options;
+    const inner = await this.doStreamWithCtx(
+      callOptions,
+      ctx,
+      this.routeOrder(this.startIndex()),
+      0
+    );
+    if (!cache || cacheKey === void 0) return inner;
+    const collected = [];
+    const self = this;
+    const wrapped = inner.stream.pipeThrough(
+      new TransformStream({
+        transform(part, controller) {
+          collected.push(part);
+          controller.enqueue(part);
+        },
+        flush() {
+          if (ctx.settled?.cacheable) {
+            self.storeCache(cacheKey, { kind: "stream", parts: collected, meta: ctx.settled.meta });
+          }
+        }
+      })
+    );
+    return { ...inner, stream: wrapped };
+  }
+  /** A response-cache hit: replay a stored answer with no provider call. Settles
+   *  one {@link CallRecord} with `cacheHit`, `costUsd: 0`, and the avoided cost
+   *  on its own `cacheHitSavingUsd` line. */
+  finalizeCacheHit(ctx, meta) {
+    this.emitCall({
+      id: ctx.id,
+      model: this.opts.modelName,
+      attempts: [{ provider: meta.winner, ok: true, latencyMs: Date.now() - ctx.startedAt }],
+      winner: meta.winner,
+      ok: true,
+      failedOver: false,
+      latencyMs: Date.now() - ctx.startedAt,
+      inputTokens: meta.inputTokens,
+      outputTokens: meta.outputTokens,
+      ...meta.cachedInputTokens ? { cachedInputTokens: meta.cachedInputTokens } : {},
+      costUsd: 0,
+      cacheHit: true,
+      ...meta.costUsd > 0 ? { cacheHitSavingUsd: meta.costUsd } : {},
+      ...ctx.requestId ? { requestId: ctx.requestId } : {}
+    });
+  }
+  /** Best-effort write to the response cache: a sync throw or a rejected async
+   *  `set` must never break the request. Caching is an optimization, not a
+   *  guarantee. */
+  storeCache(key, value) {
+    const cache = this.opts.cache;
+    if (!cache) return;
+    try {
+      const r = cache.store.set(key, value, cache.ttlMs);
+      if (r && typeof r.catch === "function") {
+        r.catch(() => {
+        });
+      }
+    } catch {
+    }
   }
   // The stream's failover recursion re-enters here with the SAME `ctx` and the
   // SAME `order` snapshot, advancing only the local position `pos`, so a
@@ -1961,12 +2152,16 @@ function createLCR(config) {
     autoPrice = false,
     resetIntervalMs,
     cooldown,
+    cache,
+    promptCache,
     onError,
     onCost,
     onCall,
     shouldRetry,
     defaultCacheReadRatio
   } = config;
+  const resolvedCache = resolveCache(cache);
+  const resolvedPromptCache = resolvePromptCache(promptCache);
   if (defaultCacheReadRatio !== void 0 && (defaultCacheReadRatio < 0 || defaultCacheReadRatio > 1)) {
     throw new Error(
       `ai-lcr: defaultCacheReadRatio must be in [0, 1], got ${defaultCacheReadRatio}`
@@ -1986,7 +2181,18 @@ function createLCR(config) {
     }
     routed.set(
       name,
-      new LcrFallbackModel({ modelName: name, providers, resetIntervalMs, cooldown, onError, onCost, onCall, shouldRetry })
+      new LcrFallbackModel({
+        modelName: name,
+        providers,
+        resetIntervalMs,
+        cooldown,
+        ...resolvedCache ? { cache: resolvedCache } : {},
+        ...resolvedPromptCache ? { promptCache: resolvedPromptCache } : {},
+        onError,
+        onCost,
+        onCall,
+        shouldRetry
+      })
     );
   }
   return (modelName) => {
@@ -2014,6 +2220,7 @@ export {
   createKunavoMediaAdapter,
   createLCR,
   createMediaLCR,
+  createMemoryCacheStore,
   createRunwareMediaAdapter,
   durationFromInput,
   formatCallRecord,

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "ai-lcr",
-  "version": "0.6.2",
+  "version": "0.6.3",
   "description": "Least Cost Routing for LLMs — route every model call to the cheapest available provider, fall back automatically, and track real cost. Built for the Vercel AI SDK.",
   "keywords": [
     "ai",