ai-lcr 0.2.6 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -4,6 +4,65 @@ All notable changes to `ai-lcr` are documented here. The format follows
4
4
  [Keep a Changelog](https://keepachangelog.com/), and the project adheres to
5
5
  [Semantic Versioning](https://semver.org/).
6
6
 
7
+ ## [0.4.0] — 2026-06-02
8
+
9
+ All additions are optional and backward compatible.
10
+
11
+ ### Added
12
+
13
+ - **`CallRecord.ttftMs` — time to first token.** Streaming calls now report TTFT,
14
+ the industry-standard responsiveness metric: ms from the winning provider's
15
+ stream attempt start to its first content token (`text-delta` /
16
+ `reasoning-delta`). Measured against the *winner's* attempt, so failover
17
+ overhead (already in `latencyMs`) doesn't distort it. `undefined` for
18
+ `doGenerate` (no streaming → no "first token") and for calls that failed before
19
+ producing content. `formatCallRecord` shows it inline next to total latency when
20
+ present (`412ms (ttft 88ms)`). With `latencyMs` and `outputTokens` on the same
21
+ record, output throughput is derivable: `outputTokens / ((latencyMs − ttftMs) /
22
+ 1000)` tokens/sec.
23
+
24
+ ## [0.3.0] — 2026-06-02
25
+
26
+ Integration-feedback pass from wiring ai-lcr into a real agentic product
27
+ (multi-step tool loops, Anthropic prompt caching). All additions are optional
28
+ and backward compatible.
29
+
30
+ ### Fixed
31
+
32
+ - **`createHttpSink` is exported again.** It shipped in 0.2.0, then silently
33
+ dropped out of the package somewhere after — so `import { createHttpSink }`
34
+ (as the integration playbook documents) failed with TS2305 on 0.2.1+. The
35
+ source and tests are restored and the symbol is now pinned in the public-API
36
+ smoke test so it can't regress unnoticed.
37
+ - **Capability probe no longer false-FAILs tool support.** `check-provider.sh`
38
+ tested tools with `tool_choice:"auto"` and a single roll — reasoning / chatty
39
+ models often answer in text instead of calling, which looked identical to
40
+ dropped tools. It now forces `tool_choice:"required"` (testing *can* the
41
+ provider call a tool, not *will* the model decide to). The token-inflation
42
+ parser also surfaces a stderr diagnostic on a parse failure instead of
43
+ silently returning empty (which masqueraded as an inconclusive result).
44
+
45
+ ### Added
46
+
47
+ - **`CallRecord.baselineUsd` on the text side.** The text router now fills the
48
+ savings baseline — the same token usage priced on the most expensive priced
49
+ provider in the chain — so `baselineUsd − costUsd` (the headline a cost
50
+ dashboard shows) is computable for text, not just media.
51
+ - **Prompt-cache-aware cost.** `ProviderCost` gains an optional `cacheRead`
52
+ (USD per 1M cached input tokens). When a call reports
53
+ `usage.inputTokens.cacheRead`, those tokens bill at that rate; omit it and
54
+ they fall back to the full `input` rate (unchanged). `CallRecord` exposes
55
+ `cachedInputTokens` for auditing. Accounting only — routing weights are
56
+ unchanged in this release.
57
+ - **`CallRecord.requestId` passthrough.** Read from `providerOptions.lcr.requestId`;
58
+ stamp the same id on every step of a tool loop to roll a multi-step request
59
+ up into one cost figure on the dashboard.
60
+ - **`CallRecord.usageMissing` flag.** Set when the winner served OK but reported
61
+ zero input *and* output tokens — i.e. the provider emitted no usage, so
62
+ `costUsd` (and any token-based credit metering) silently reads 0. Surfaces the
63
+ difference between "free" and "cost unknown"; `formatCallRecord` shows it as
64
+ `⚠no-usage`, and a savings suffix `(saved $X)` when `baselineUsd` beats cost.
65
+
7
66
  ## [0.2.6] — 2026-06-01
8
67
 
9
68
  ### Changed
package/README.md CHANGED
@@ -98,6 +98,46 @@ const lcr = createLCR({
98
98
 
99
99
  The same pattern works for any vendor's native SDK provider — `@ai-sdk/anthropic`, `@ai-sdk/google`, `@ai-sdk/openai`, `@ai-sdk/xai`, and so on. They all return `LanguageModelV3`, so you can mix a native vendor API with aggregators in one model's list. Native APIs are narrow (only that vendor's models) but featureful; aggregators are broad. **Official-first + aggregator-fallback** is the natural LCR shape.
100
100
 
101
+ ## Cheapest route for open-weights models (DeepInfra)
102
+
103
+ For open-weights models — DeepSeek, Kimi, MiniMax, GLM, Qwen — a dedicated inference host is usually the cheapest route, well under aggregator pricing. [DeepInfra](https://deepinfra.com) is OpenAI-compatible, so it slots in as just another entry. **One gotcha:** its OpenAI endpoint lives at `/v1/openai` (the `/v1/` precedes `openai`), not the usual `/v1`:
104
+
105
+ ```ts
106
+ import { createLCR } from "ai-lcr";
107
+ import { createOpenAICompatible } from "@ai-sdk/openai-compatible";
108
+
109
+ const deepinfra = createOpenAICompatible({
110
+ name: "deepinfra",
111
+ baseURL: "https://api.deepinfra.com/v1/openai", // note: /v1/openai, not /v1
112
+ apiKey: process.env.DEEPINFRA_API_KEY,
113
+ });
114
+ const openrouter = createOpenAICompatible({
115
+ name: "openrouter",
116
+ baseURL: "https://openrouter.ai/api/v1",
117
+ apiKey: process.env.OPENROUTER_API_KEY,
118
+ });
119
+
120
+ const lcr = createLCR({
121
+ autoSort: true,
122
+ models: {
123
+ // DeepInfra is cheapest; OpenRouter is the breadth/uptime fallback.
124
+ // DeepInfra uses HuggingFace-style ids (org/Name).
125
+ "deepseek-v4-flash": [
126
+ { model: deepinfra("deepseek-ai/DeepSeek-V4-Flash"), label: "deepinfra", cost: { input: 0.10, output: 0.20 } },
127
+ { model: openrouter("deepseek/deepseek-v4-flash"), label: "openrouter", cost: { input: 0.27, output: 1.10 } },
128
+ ],
129
+ "minimax-m2.5": [
130
+ { model: deepinfra("MiniMaxAI/MiniMax-M2.5"), label: "deepinfra", cost: { input: 0.15, output: 1.15 } },
131
+ ],
132
+ "kimi-k2.5": [
133
+ { model: deepinfra("moonshotai/Kimi-K2.5"), label: "deepinfra", cost: { input: 0.45, output: 2.25 } },
134
+ ],
135
+ },
136
+ });
137
+ ```
138
+
139
+ DeepInfra carries open weights only — no first-party Claude / GPT / Gemini. For those closed models, route through OpenRouter or a discount gateway instead.
140
+
101
141
  ## How it routes
102
142
 
103
143
  1. **Cheapest first.** Providers are tried in order — list them cheapest-first, or set `autoSort: true` to order them by `cost` automatically.
@@ -144,12 +184,52 @@ interface CallRecord {
144
184
  ok: boolean;
145
185
  failedOver: boolean; // more than one provider was tried
146
186
  latencyMs: number;
187
+ ttftMs?: number; // streaming only: time to first token (winner's first content delta) — industry-standard responsiveness metric
147
188
  inputTokens: number;
148
189
  outputTokens: number;
149
- costUsd: number;
190
+ cachedInputTokens?: number; // prompt-cache hits the winner read (when reported)
191
+ costUsd: number; // winner cost, cache-discount applied (see `cacheRead`)
192
+ baselineUsd?: number; // same usage on the priciest priced leg → savings = baselineUsd − costUsd
193
+ requestId?: string; // your correlation id (see below) — roll multi-step tool loops into one request
194
+ usageMissing?: boolean; // winner served but reported 0/0 tokens → costUsd is 0 but unknown, not free
150
195
  }
151
196
  ```
152
197
 
198
+ **Savings, not just spend.** Whenever at least one provider in a chain carries a `cost`, `baselineUsd` is what the same call would have cost on the most expensive priced leg (typically your safety-net fallback). `baselineUsd − costUsd` is the money routing saved on that call — the number a cost dashboard exists to show.
199
+
200
+ **Responsiveness, not just total time.** On streaming calls (`streamText`, `streamObject`, streaming agents), `ttftMs` is the **time to first token** — measured from the winning provider's attempt start to its first content delta. It's the metric most LLM dashboards lead with, because it's what a user feels as "how fast did it start replying". Total `latencyMs` covers the whole stream including any failover; `ttftMs` isolates the serving model's responsiveness. It's `undefined` for `generateText`/`generateObject` (no streaming → no "first" token) and for calls that failed before any content. Output throughput (tokens/sec) is then `outputTokens / ((latencyMs − ttftMs) / 1000)`.
201
+
202
+ **Cache-aware cost.** Add `cacheRead` (USD per 1M cached input tokens) to a provider's `cost` and ai-lcr bills prompt-cache hits at that rate when the call reports `usage.inputTokens.cacheRead`. Omit it and cached tokens fall back to the full `input` rate (unchanged from before). For cache-heavy traffic (e.g. Anthropic, where a cache read is ~0.1×) this keeps `costUsd` honest — and `cachedInputTokens` lets a dashboard audit it:
203
+
204
+ ```ts
205
+ { model: claude, label: "anthropic", cost: { input: 3, output: 15, cacheRead: 0.3 } }
206
+ ```
207
+
208
+ **Group a multi-step request.** An agentic turn does one `onCall` per `doStream`/`doGenerate` step, so a 10-step tool loop emits 10 records. Pass a stable id through `providerOptions.lcr.requestId` and every step's record carries it — group by `requestId` for per-request cost:
209
+
210
+ ```ts
211
+ await streamText({ model: lcr("chat"), messages, providerOptions: { lcr: { requestId } } });
212
+ ```
213
+
214
+ ### Ship records to a collector (`createHttpSink`)
215
+
216
+ `createHttpSink` builds an `onCall` handler that POSTs each `CallRecord` as JSON to an endpoint (e.g. a self-hosted dashboard's `/api/ingest`, or any drain that takes the shape). Fire-and-forget — a failed POST never breaks your app. On serverless, pass a `waitUntil`-style `dispatch` (Next.js: `after`) so the request isn't cut off:
217
+
218
+ ```ts
219
+ import { createLCR, createHttpSink } from "ai-lcr";
220
+ import { after } from "next/server";
221
+
222
+ const lcr = createLCR({
223
+ models: { /* … */ },
224
+ onCall: createHttpSink({
225
+ url: process.env.LCR_INGEST_URL + "/api/ingest",
226
+ headers: { authorization: `Bearer ${process.env.LCR_INGEST_KEY}` },
227
+ project: process.env.LCR_PROJECT, // optional tenant tag merged into each payload
228
+ dispatch: after,
229
+ }),
230
+ });
231
+ ```
232
+
153
233
  ## Supported providers
154
234
 
155
235
  Any OpenAI-compatible endpoint works — and so does any AI SDK provider package, including a model vendor's own official API.
@@ -235,11 +315,11 @@ API_KEY=$KUNAVO_API_KEY BASE=https://api.kunavo.com \
235
315
  REF_API_KEY=$OPENROUTER_API_KEY REF_BASE=https://openrouter.ai/api \
236
316
  bash scripts/check-provider.sh
237
317
 
238
- # TokenMart uses vendor-prefixed model IDs
239
- API_KEY=$TOKENMART_API_KEY BASE=https://api.tokenmart.ai \
240
- MODEL_1=google/gemini-3-flash REF_1=google/gemini-3-flash-preview \
241
- MODEL_2=anthropic/claude-sonnet-4-6 REF_2=anthropic/claude-sonnet-4.6 \
242
- CACHE_MODEL=anthropic/claude-sonnet-4-6 \
318
+ # TokenMart (Inference AI) uses bare, un-prefixed model IDs
319
+ API_KEY=$INFERENCE_API_KEY BASE=https://model.service-inference.ai \
320
+ MODEL_1=gemini-3-flash-preview REF_1=google/gemini-3-flash-preview \
321
+ MODEL_2=claude-sonnet-4-6 REF_2=anthropic/claude-sonnet-4.6 \
322
+ CACHE_MODEL=claude-sonnet-4-6 \
243
323
  REF_API_KEY=$OPENROUTER_API_KEY REF_BASE=https://openrouter.ai/api \
244
324
  bash scripts/check-provider.sh
245
325
  ```
package/README.zh-CN.md CHANGED
@@ -98,6 +98,46 @@ const lcr = createLCR({
98
98
 
99
99
  同样的模式适用于任何厂商的原生 SDK provider——`@ai-sdk/anthropic`、`@ai-sdk/google`、`@ai-sdk/openai`、`@ai-sdk/xai` 等等。它们都返回 `LanguageModelV3`,所以你可以在一个模型的列表里把厂商原生 API 和聚合器混着用。原生 API 覆盖窄(只有该厂商自己的模型)但特性全;聚合器覆盖广。**官方优先 + 聚合器兜底** 正是 LCR 最自然的形态。
100
100
 
101
+ ## 开源权重模型的最便宜路由(DeepInfra)
102
+
103
+ 对开源权重模型——DeepSeek、Kimi、MiniMax、GLM、Qwen——专门的推理托管商通常是最便宜的路由,明显低于聚合器价格。[DeepInfra](https://deepinfra.com) 兼容 OpenAI,直接当成列表里的又一个 entry 即可。**有一个坑**:它的 OpenAI endpoint 在 `/v1/openai`(`/v1/` 在 `openai` **前面**),不是常规的 `/v1`:
104
+
105
+ ```ts
106
+ import { createLCR } from "ai-lcr";
107
+ import { createOpenAICompatible } from "@ai-sdk/openai-compatible";
108
+
109
+ const deepinfra = createOpenAICompatible({
110
+ name: "deepinfra",
111
+ baseURL: "https://api.deepinfra.com/v1/openai", // 注意:/v1/openai,不是 /v1
112
+ apiKey: process.env.DEEPINFRA_API_KEY,
113
+ });
114
+ const openrouter = createOpenAICompatible({
115
+ name: "openrouter",
116
+ baseURL: "https://openrouter.ai/api/v1",
117
+ apiKey: process.env.OPENROUTER_API_KEY,
118
+ });
119
+
120
+ const lcr = createLCR({
121
+ autoSort: true,
122
+ models: {
123
+ // DeepInfra 最便宜;OpenRouter 作广覆盖 / 可用性兜底。
124
+ // DeepInfra 用 HuggingFace 风格的 id(org/Name)。
125
+ "deepseek-v4-flash": [
126
+ { model: deepinfra("deepseek-ai/DeepSeek-V4-Flash"), label: "deepinfra", cost: { input: 0.10, output: 0.20 } },
127
+ { model: openrouter("deepseek/deepseek-v4-flash"), label: "openrouter", cost: { input: 0.27, output: 1.10 } },
128
+ ],
129
+ "minimax-m2.5": [
130
+ { model: deepinfra("MiniMaxAI/MiniMax-M2.5"), label: "deepinfra", cost: { input: 0.15, output: 1.15 } },
131
+ ],
132
+ "kimi-k2.5": [
133
+ { model: deepinfra("moonshotai/Kimi-K2.5"), label: "deepinfra", cost: { input: 0.45, output: 2.25 } },
134
+ ],
135
+ },
136
+ });
137
+ ```
138
+
139
+ DeepInfra 只承载开源权重——没有第一方 Claude / GPT / Gemini。那些闭源模型请走 OpenRouter 或折扣中转。
140
+
101
141
  ## 它如何路由
102
142
 
103
143
  1. **最便宜优先。** provider 按顺序依次尝试——把它们排成最便宜优先,或设置 `autoSort: true` 让它按 `cost` 自动排序。
@@ -192,11 +232,11 @@ API_KEY=$KUNAVO_API_KEY BASE=https://api.kunavo.com \
192
232
  REF_API_KEY=$OPENROUTER_API_KEY REF_BASE=https://openrouter.ai/api \
193
233
  bash scripts/check-provider.sh
194
234
 
195
- # TokenMart 使用 vendor 前缀的模型 ID
196
- API_KEY=$TOKENMART_API_KEY BASE=https://api.tokenmart.ai \
197
- MODEL_1=google/gemini-3-flash REF_1=google/gemini-3-flash-preview \
198
- MODEL_2=anthropic/claude-sonnet-4-6 REF_2=anthropic/claude-sonnet-4.6 \
199
- CACHE_MODEL=anthropic/claude-sonnet-4-6 \
235
+ # TokenMart(Inference AI)使用不带 vendor 前缀的裸模型 ID
236
+ API_KEY=$INFERENCE_API_KEY BASE=https://model.service-inference.ai \
237
+ MODEL_1=gemini-3-flash-preview REF_1=google/gemini-3-flash-preview \
238
+ MODEL_2=claude-sonnet-4-6 REF_2=anthropic/claude-sonnet-4.6 \
239
+ CACHE_MODEL=claude-sonnet-4-6 \
200
240
  REF_API_KEY=$OPENROUTER_API_KEY REF_BASE=https://openrouter.ai/api \
201
241
  bash scripts/check-provider.sh
202
242
  ```
package/dist/index.cjs CHANGED
@@ -27,6 +27,7 @@ __export(index_exports, {
27
27
  classifyErrorKind: () => classifyErrorKind,
28
28
  comparePrices: () => comparePrices,
29
29
  createFalMediaAdapter: () => createFalMediaAdapter,
30
+ createHttpSink: () => createHttpSink,
30
31
  createKunavoMediaAdapter: () => createKunavoMediaAdapter,
31
32
  createLCR: () => createLCR,
32
33
  createMediaLCR: () => createMediaLCR,
@@ -186,6 +187,16 @@ function newCallId() {
186
187
  if (c?.randomUUID) return c.randomUUID();
187
188
  return `lcr_${Date.now().toString(36)}_${(callSeq++).toString(36)}`;
188
189
  }
190
+ function costForUsage(cost, inputTokens, outputTokens, cacheReadTokens) {
191
+ const cached = Math.min(Math.max(cacheReadTokens, 0), inputTokens);
192
+ const fullInput = inputTokens - cached;
193
+ const cachedRate = cost.cacheRead ?? cost.input;
194
+ return fullInput / 1e6 * cost.input + cached / 1e6 * cachedRate + outputTokens / 1e6 * cost.output;
195
+ }
196
+ function requestIdFrom(options) {
197
+ const raw = options.providerOptions?.lcr?.requestId;
198
+ return typeof raw === "string" && raw.length > 0 ? raw : void 0;
199
+ }
189
200
  var LcrFallbackModel = class {
190
201
  constructor(opts) {
191
202
  this.opts = opts;
@@ -267,8 +278,13 @@ var LcrFallbackModel = class {
267
278
  } catch {
268
279
  }
269
280
  }
270
- startCall() {
271
- return { id: newCallId(), attempts: [], startedAt: Date.now() };
281
+ startCall(options) {
282
+ return {
283
+ id: newCallId(),
284
+ attempts: [],
285
+ startedAt: Date.now(),
286
+ requestId: requestIdFrom(options)
287
+ };
272
288
  }
273
289
  /** Record a failed attempt onto the call's chain (no event yet). */
274
290
  recordFail(ctx, provider, attemptStart, error) {
@@ -280,12 +296,29 @@ var LcrFallbackModel = class {
280
296
  kind: classifyErrorKind(error)
281
297
  });
282
298
  }
299
+ /**
300
+ * Baseline = what this same usage would have cost on the most expensive
301
+ * *priced* provider in the chain (typically the OpenRouter fallback leg). The
302
+ * winner's savings is `baselineUsd - costUsd`. Undefined when no provider in
303
+ * the chain carries a price (nothing to compare against).
304
+ */
305
+ baselineUsd(inputTokens, outputTokens, cacheReadTokens) {
306
+ let max;
307
+ for (const p of this.opts.providers) {
308
+ if (!p.cost) continue;
309
+ const c = costForUsage(p.cost, inputTokens, outputTokens, cacheReadTokens);
310
+ if (max === void 0 || c > max) max = c;
311
+ }
312
+ return max;
313
+ }
283
314
  /** Winner settled: record the attempt, fire `onCost` (compat) + `onCall`. */
284
- finalizeOk(ctx, provider, attemptStart, usage) {
315
+ finalizeOk(ctx, provider, attemptStart, usage, ttftMs) {
285
316
  ctx.attempts.push({ provider: provider.label, ok: true, latencyMs: Date.now() - attemptStart });
286
317
  const inputTokens = usage?.inputTokens?.total ?? 0;
287
318
  const outputTokens = usage?.outputTokens?.total ?? 0;
288
- const costUsd = provider.cost ? inputTokens / 1e6 * provider.cost.input + outputTokens / 1e6 * provider.cost.output : 0;
319
+ const cacheReadTokens = usage?.inputTokens?.cacheRead ?? 0;
320
+ const costUsd = provider.cost ? costForUsage(provider.cost, inputTokens, outputTokens, cacheReadTokens) : 0;
321
+ const usageMissing = inputTokens === 0 && outputTokens === 0;
289
322
  this.emitCost({
290
323
  model: this.opts.modelName,
291
324
  provider: provider.label,
@@ -301,9 +334,14 @@ var LcrFallbackModel = class {
301
334
  ok: true,
302
335
  failedOver: ctx.attempts.length > 1,
303
336
  latencyMs: Date.now() - ctx.startedAt,
337
+ ...ttftMs !== void 0 ? { ttftMs } : {},
304
338
  inputTokens,
305
339
  outputTokens,
306
- costUsd
340
+ ...cacheReadTokens > 0 ? { cachedInputTokens: cacheReadTokens } : {},
341
+ costUsd,
342
+ baselineUsd: this.baselineUsd(inputTokens, outputTokens, cacheReadTokens),
343
+ ...ctx.requestId ? { requestId: ctx.requestId } : {},
344
+ ...usageMissing ? { usageMissing: true } : {}
307
345
  });
308
346
  }
309
347
  /** Every provider failed: fire `onCall` with no winner. */
@@ -318,11 +356,12 @@ var LcrFallbackModel = class {
318
356
  latencyMs: Date.now() - ctx.startedAt,
319
357
  inputTokens: 0,
320
358
  outputTokens: 0,
321
- costUsd: 0
359
+ costUsd: 0,
360
+ ...ctx.requestId ? { requestId: ctx.requestId } : {}
322
361
  });
323
362
  }
324
363
  async doGenerate(options) {
325
- const ctx = this.startCall();
364
+ const ctx = this.startCall(options);
326
365
  const providers = this.opts.providers;
327
366
  const n = providers.length;
328
367
  const start = this.startIndex();
@@ -351,7 +390,7 @@ var LcrFallbackModel = class {
351
390
  throw lastError;
352
391
  }
353
392
  async doStream(options) {
354
- return this.doStreamWithCtx(options, this.startCall(), this.startIndex(), 0);
393
+ return this.doStreamWithCtx(options, this.startCall(options), this.startIndex(), 0);
355
394
  }
356
395
  // The stream's failover recursion re-enters here with the SAME `ctx` and a
357
396
  // threaded-through local cursor (`idx`/`tried`), so a mid-stream switch keeps
@@ -395,6 +434,7 @@ var LcrFallbackModel = class {
395
434
  const triedBeforeServing = tried;
396
435
  let usage;
397
436
  let streamedAny = false;
437
+ let ttftMs;
398
438
  const stream = new ReadableStream({
399
439
  async start(controller) {
400
440
  let reader = null;
@@ -408,11 +448,14 @@ var LcrFallbackModel = class {
408
448
  }
409
449
  if (done) break;
410
450
  if (value.type === "finish") usage = value.usage;
451
+ if (ttftMs === void 0 && (value.type === "text-delta" || value.type === "reasoning-delta")) {
452
+ ttftMs = Date.now() - servingAttemptStart;
453
+ }
411
454
  controller.enqueue(value);
412
455
  if (value.type !== "stream-start") streamedAny = true;
413
456
  }
414
457
  self.settleSticky(servingIdx);
415
- self.finalizeOk(ctx, servingProvider, servingAttemptStart, usage);
458
+ self.finalizeOk(ctx, servingProvider, servingAttemptStart, usage, ttftMs);
416
459
  controller.close();
417
460
  } catch (error) {
418
461
  self.emitError(error, servingProvider.label);
@@ -474,7 +517,12 @@ function formatCallRecord(record, opts = {}) {
474
517
  const glyph = !record.ok ? "\u2717" : record.failedOver ? "\u26A0" : "\u2713";
475
518
  const chain = record.attempts.map((a) => a.provider).join("\u2192") || record.winner || "\u2014";
476
519
  const status = formatCost(record);
477
- let line = `${glyph} ${record.model} ${chain} ${record.latencyMs}ms ${status}`;
520
+ const timing = record.ttftMs !== void 0 ? `${record.latencyMs}ms (ttft ${record.ttftMs}ms)` : `${record.latencyMs}ms`;
521
+ let line = `${glyph} ${record.model} ${chain} ${timing} ${status}`;
522
+ if (record.ok && record.baselineUsd !== void 0 && record.baselineUsd > record.costUsd) {
523
+ line += ` (saved $${(record.baselineUsd - record.costUsd).toFixed(4)})`;
524
+ }
525
+ if (record.usageMissing) line += ` \u26A0no-usage`;
478
526
  const failed = record.attempts.filter((a) => !a.ok);
479
527
  if (failed.length > 0) {
480
528
  const reasons = failed.map((a) => `${a.provider} ${a.errorClass ?? "error"}`).join(", ");
@@ -487,6 +535,40 @@ function formatCallRecord(record, opts = {}) {
487
535
  return line;
488
536
  }
489
537
 
538
+ // src/sink.ts
539
+ function createHttpSink(options) {
540
+ const {
541
+ url,
542
+ headers,
543
+ project,
544
+ dispatch = (task) => {
545
+ void task();
546
+ },
547
+ fetchImpl,
548
+ onError
549
+ } = options;
550
+ const doFetch = fetchImpl ?? globalThis.fetch;
551
+ return (record) => {
552
+ if (!doFetch) {
553
+ onError?.(new Error("ai-lcr: no fetch available for createHttpSink"));
554
+ return;
555
+ }
556
+ const payload = project ? { project, ...record } : record;
557
+ dispatch(async () => {
558
+ try {
559
+ await doFetch(url, {
560
+ method: "POST",
561
+ headers: { "content-type": "application/json", ...headers },
562
+ body: JSON.stringify(payload),
563
+ keepalive: true
564
+ });
565
+ } catch (err) {
566
+ onError?.(err);
567
+ }
568
+ });
569
+ };
570
+ }
571
+
490
572
  // src/media.ts
491
573
  var DEFAULT_REFERENCE = {
492
574
  image: { width: 1920, height: 1080 },
@@ -1053,6 +1135,7 @@ function createLCR(config) {
1053
1135
  classifyErrorKind,
1054
1136
  comparePrices,
1055
1137
  createFalMediaAdapter,
1138
+ createHttpSink,
1056
1139
  createKunavoMediaAdapter,
1057
1140
  createLCR,
1058
1141
  createMediaLCR,
package/dist/index.d.cts CHANGED
@@ -17,8 +17,18 @@ import { LanguageModelV3 } from '@ai-sdk/provider';
17
17
 
18
18
  /** USD per 1M tokens. */
19
19
  interface ProviderCost {
20
+ /** USD per 1M input (prompt) tokens. */
20
21
  input: number;
22
+ /** USD per 1M output (completion) tokens. */
21
23
  output: number;
24
+ /**
25
+ * USD per 1M *cached* input tokens read (prompt-cache hits). Optional. When a
26
+ * call reports `usage.inputTokens.cacheRead`, those tokens are billed at this
27
+ * rate instead of `input` — so the cost stays honest for cache-heavy traffic
28
+ * (e.g. Anthropic, where a cache read is ~0.1× the input price). Omit it and
29
+ * cached tokens fall back to the full `input` rate (the pre-0.3 behavior).
30
+ */
31
+ cacheRead?: number;
22
32
  }
23
33
  interface CostEvent {
24
34
  /** Logical model name (the key in createLCR's `models`). */
@@ -75,17 +85,50 @@ interface CallRecord {
75
85
  failedOver: boolean;
76
86
  /** Total wall time across all attempts, ms. */
77
87
  latencyMs: number;
88
+ /**
89
+ * Time to first token (TTFT), ms — the industry-standard responsiveness
90
+ * metric. Measured from the *winning* provider's stream attempt start to its
91
+ * first content token (`text-delta` / `reasoning-delta`), so it captures how
92
+ * fast the model that actually served started replying, not failover overhead
93
+ * (that's already in `latencyMs`). Streaming only: **undefined** for
94
+ * `doGenerate` (the whole response lands at once, so there's no "first token")
95
+ * and for calls that failed before producing any content. With `latencyMs` and
96
+ * `outputTokens`, output throughput is derivable: `outputTokens / ((latencyMs −
97
+ * ttftMs) / 1000)` tokens/sec.
98
+ */
99
+ ttftMs?: number;
78
100
  inputTokens: number;
79
101
  outputTokens: number;
102
+ /**
103
+ * Cached input (prompt-cache) tokens the winner read, when the provider
104
+ * reported them (`usage.inputTokens.cacheRead`). Present only when > 0. Lets
105
+ * the dashboard show cache-hit volume and audit why `costUsd` is lower than
106
+ * sticker × tokens. Undefined when the provider reports no cache info.
107
+ */
108
+ cachedInputTokens?: number;
80
109
  /** Computed from the winner's `cost`; 0 if no price was given or the call failed. */
81
110
  costUsd: number;
82
111
  /**
83
- * What the same request would have cost on the most expensive configured
84
- * provider the savings baseline (`baselineUsd - costUsd`). Set by the media
85
- * router; the text router omits it (left undefined) until a per-call text
86
- * baseline lands. Optional so both routers share one {@link CallRecord} shape.
112
+ * What the same request would have cost on the most expensive *priced*
113
+ * provider in the chain, on identical token usage the savings baseline
114
+ * (`baselineUsd - costUsd`). Set by both routers whenever at least one
115
+ * provider carries a `cost`; undefined only when no provider was priced.
87
116
  */
88
117
  baselineUsd?: number;
118
+ /**
119
+ * Caller-supplied correlation id, read from `providerOptions.lcr.requestId`
120
+ * on the call. Multi-step tool loops emit one record per `doStream`/
121
+ * `doGenerate` step; stamp the same `requestId` on every step to let the
122
+ * dashboard roll a whole user request up into one cost/`calls` figure.
123
+ */
124
+ requestId?: string;
125
+ /**
126
+ * True when the winner served OK but reported **zero** input *and* output
127
+ * tokens — i.e. the provider didn't emit usage. A silent danger: `costUsd`
128
+ * collapses to 0 and any token-based credit metering under-charges with no
129
+ * other signal. Treat a flagged record as "cost unknown", not "free".
130
+ */
131
+ usageMissing?: boolean;
89
132
  }
90
133
  /**
91
134
  * Normalize an error into a short, log-friendly class for {@link CallRecord}.
@@ -110,6 +153,7 @@ declare function classifyErrorKind(error: unknown): ErrorKind;
110
153
  * latency, cost, and — when anything failed — the reason for each failed hop.
111
154
  *
112
155
  * ✓ text tokenmart 412ms $0.0003
156
+ * ✓ text tokenmart 412ms (ttft 88ms) $0.0003 ← streaming: TTFT shown when known
113
157
  * ⚠ text tokenmart→openrouter 910ms $0.0004 ⤷ tokenmart 502
114
158
  * ✗ text deepseek→tokenmart→openrouter 1240ms FAILED ⤷ deepseek 401, tokenmart 502, openrouter 429
115
159
  *
@@ -122,6 +166,54 @@ interface FormatOptions {
122
166
  }
123
167
  declare function formatCallRecord(record: CallRecord, opts?: FormatOptions): string;
124
168
 
169
+ /**
170
+ * Optional HTTP sink for `onCall` — ship each {@link CallRecord} as JSON to a
171
+ * collector (e.g. a self-hosted ai-lcr-dashboard `/api/ingest`, or any endpoint
172
+ * that accepts the CallRecord shape).
173
+ *
174
+ * Fully optional and dashboard-agnostic: omit it and ai-lcr stores nothing;
175
+ * point `url` at whatever you run. Logging must never break your app, so a
176
+ * failed POST is swallowed by default (surface it via `onError` if you want).
177
+ *
178
+ * import { createLCR, createHttpSink } from "ai-lcr";
179
+ * import { after } from "next/server"; // serverless: don't block the response
180
+ *
181
+ * const lcr = createLCR({
182
+ * models: { ... },
183
+ * onCall: createHttpSink({
184
+ * url: process.env.LCR_INGEST_URL + "/api/ingest",
185
+ * headers: { authorization: `Bearer ${process.env.LCR_INGEST_KEY}` },
186
+ * project: process.env.LCR_PROJECT,
187
+ * dispatch: after, // run after the response is sent
188
+ * }),
189
+ * });
190
+ */
191
+
192
+ interface HttpSinkOptions {
193
+ /** Where to POST each CallRecord (a collector that accepts the JSON shape). */
194
+ url: string;
195
+ /** Extra headers, e.g. `{ authorization: ` + "`Bearer ${key}`" + ` }`. */
196
+ headers?: Record<string, string>;
197
+ /** Optional tenant/project tag merged into each payload (`{ project, ...record }`). */
198
+ project?: string;
199
+ /**
200
+ * Wrap the dispatch so it survives a serverless function returning. On
201
+ * Next.js pass `after` from "next/server"; elsewhere pass a `waitUntil`-style
202
+ * function. Defaults to running immediately — correct for long-lived servers,
203
+ * but on serverless an un-awaited POST may be cut off, so pass `after`.
204
+ */
205
+ dispatch?: (task: () => void | Promise<void>) => void;
206
+ /** Custom fetch (tests / runtimes without a global `fetch`). */
207
+ fetchImpl?: typeof fetch;
208
+ /** Called if the POST fails. Failures are swallowed by default. */
209
+ onError?: (error: unknown) => void;
210
+ }
211
+ /**
212
+ * Build an `onCall` handler that POSTs each {@link CallRecord} to `url`.
213
+ * Returns a plain `(record) => void` — pass it straight to `createLCR`'s `onCall`.
214
+ */
215
+ declare function createHttpSink(options: HttpSinkOptions): (record: CallRecord) => void;
216
+
125
217
  /**
126
218
  * ai-lcr media routing — Least Cost Routing for image & video models.
127
219
  *
@@ -453,4 +545,4 @@ type LCRRouter = (modelName: string) => LanguageModelV3;
453
545
  */
454
546
  declare function createLCR(config: LCRConfig): LCRRouter;
455
547
 
456
- export { type CallRecord, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaUnit, type PriceComparisonRow, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createKunavoMediaAdapter, createLCR, createMediaLCR, createRunwareMediaAdapter, formatCallRecord, normalizedCents, rankRoutes, referenceMegapixels };
548
+ export { type CallRecord, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type HttpSinkOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaUnit, type PriceComparisonRow, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createHttpSink, createKunavoMediaAdapter, createLCR, createMediaLCR, createRunwareMediaAdapter, formatCallRecord, normalizedCents, rankRoutes, referenceMegapixels };
package/dist/index.d.ts CHANGED
@@ -17,8 +17,18 @@ import { LanguageModelV3 } from '@ai-sdk/provider';
17
17
 
18
18
  /** USD per 1M tokens. */
19
19
  interface ProviderCost {
20
+ /** USD per 1M input (prompt) tokens. */
20
21
  input: number;
22
+ /** USD per 1M output (completion) tokens. */
21
23
  output: number;
24
+ /**
25
+ * USD per 1M *cached* input tokens read (prompt-cache hits). Optional. When a
26
+ * call reports `usage.inputTokens.cacheRead`, those tokens are billed at this
27
+ * rate instead of `input` — so the cost stays honest for cache-heavy traffic
28
+ * (e.g. Anthropic, where a cache read is ~0.1× the input price). Omit it and
29
+ * cached tokens fall back to the full `input` rate (the pre-0.3 behavior).
30
+ */
31
+ cacheRead?: number;
22
32
  }
23
33
  interface CostEvent {
24
34
  /** Logical model name (the key in createLCR's `models`). */
@@ -75,17 +85,50 @@ interface CallRecord {
75
85
  failedOver: boolean;
76
86
  /** Total wall time across all attempts, ms. */
77
87
  latencyMs: number;
88
+ /**
89
+ * Time to first token (TTFT), ms — the industry-standard responsiveness
90
+ * metric. Measured from the *winning* provider's stream attempt start to its
91
+ * first content token (`text-delta` / `reasoning-delta`), so it captures how
92
+ * fast the model that actually served started replying, not failover overhead
93
+ * (that's already in `latencyMs`). Streaming only: **undefined** for
94
+ * `doGenerate` (the whole response lands at once, so there's no "first token")
95
+ * and for calls that failed before producing any content. With `latencyMs` and
96
+ * `outputTokens`, output throughput is derivable: `outputTokens / ((latencyMs −
97
+ * ttftMs) / 1000)` tokens/sec.
98
+ */
99
+ ttftMs?: number;
78
100
  inputTokens: number;
79
101
  outputTokens: number;
102
+ /**
103
+ * Cached input (prompt-cache) tokens the winner read, when the provider
104
+ * reported them (`usage.inputTokens.cacheRead`). Present only when > 0. Lets
105
+ * the dashboard show cache-hit volume and audit why `costUsd` is lower than
106
+ * sticker × tokens. Undefined when the provider reports no cache info.
107
+ */
108
+ cachedInputTokens?: number;
80
109
  /** Computed from the winner's `cost`; 0 if no price was given or the call failed. */
81
110
  costUsd: number;
82
111
  /**
83
- * What the same request would have cost on the most expensive configured
84
- * provider the savings baseline (`baselineUsd - costUsd`). Set by the media
85
- * router; the text router omits it (left undefined) until a per-call text
86
- * baseline lands. Optional so both routers share one {@link CallRecord} shape.
112
+ * What the same request would have cost on the most expensive *priced*
113
+ * provider in the chain, on identical token usage the savings baseline
114
+ * (`baselineUsd - costUsd`). Set by both routers whenever at least one
115
+ * provider carries a `cost`; undefined only when no provider was priced.
87
116
  */
88
117
  baselineUsd?: number;
118
+ /**
119
+ * Caller-supplied correlation id, read from `providerOptions.lcr.requestId`
120
+ * on the call. Multi-step tool loops emit one record per `doStream`/
121
+ * `doGenerate` step; stamp the same `requestId` on every step to let the
122
+ * dashboard roll a whole user request up into one cost/`calls` figure.
123
+ */
124
+ requestId?: string;
125
+ /**
126
+ * True when the winner served OK but reported **zero** input *and* output
127
+ * tokens — i.e. the provider didn't emit usage. A silent danger: `costUsd`
128
+ * collapses to 0 and any token-based credit metering under-charges with no
129
+ * other signal. Treat a flagged record as "cost unknown", not "free".
130
+ */
131
+ usageMissing?: boolean;
89
132
  }
90
133
  /**
91
134
  * Normalize an error into a short, log-friendly class for {@link CallRecord}.
@@ -110,6 +153,7 @@ declare function classifyErrorKind(error: unknown): ErrorKind;
110
153
  * latency, cost, and — when anything failed — the reason for each failed hop.
111
154
  *
112
155
  * ✓ text tokenmart 412ms $0.0003
156
+ * ✓ text tokenmart 412ms (ttft 88ms) $0.0003 ← streaming: TTFT shown when known
113
157
  * ⚠ text tokenmart→openrouter 910ms $0.0004 ⤷ tokenmart 502
114
158
  * ✗ text deepseek→tokenmart→openrouter 1240ms FAILED ⤷ deepseek 401, tokenmart 502, openrouter 429
115
159
  *
@@ -122,6 +166,54 @@ interface FormatOptions {
122
166
  }
123
167
  declare function formatCallRecord(record: CallRecord, opts?: FormatOptions): string;
124
168
 
169
+ /**
170
+ * Optional HTTP sink for `onCall` — ship each {@link CallRecord} as JSON to a
171
+ * collector (e.g. a self-hosted ai-lcr-dashboard `/api/ingest`, or any endpoint
172
+ * that accepts the CallRecord shape).
173
+ *
174
+ * Fully optional and dashboard-agnostic: omit it and ai-lcr stores nothing;
175
+ * point `url` at whatever you run. Logging must never break your app, so a
176
+ * failed POST is swallowed by default (surface it via `onError` if you want).
177
+ *
178
+ * import { createLCR, createHttpSink } from "ai-lcr";
179
+ * import { after } from "next/server"; // serverless: don't block the response
180
+ *
181
+ * const lcr = createLCR({
182
+ * models: { ... },
183
+ * onCall: createHttpSink({
184
+ * url: process.env.LCR_INGEST_URL + "/api/ingest",
185
+ * headers: { authorization: `Bearer ${process.env.LCR_INGEST_KEY}` },
186
+ * project: process.env.LCR_PROJECT,
187
+ * dispatch: after, // run after the response is sent
188
+ * }),
189
+ * });
190
+ */
191
+
192
+ interface HttpSinkOptions {
193
+ /** Where to POST each CallRecord (a collector that accepts the JSON shape). */
194
+ url: string;
195
+ /** Extra headers, e.g. `{ authorization: ` + "`Bearer ${key}`" + ` }`. */
196
+ headers?: Record<string, string>;
197
+ /** Optional tenant/project tag merged into each payload (`{ project, ...record }`). */
198
+ project?: string;
199
+ /**
200
+ * Wrap the dispatch so it survives a serverless function returning. On
201
+ * Next.js pass `after` from "next/server"; elsewhere pass a `waitUntil`-style
202
+ * function. Defaults to running immediately — correct for long-lived servers,
203
+ * but on serverless an un-awaited POST may be cut off, so pass `after`.
204
+ */
205
+ dispatch?: (task: () => void | Promise<void>) => void;
206
+ /** Custom fetch (tests / runtimes without a global `fetch`). */
207
+ fetchImpl?: typeof fetch;
208
+ /** Called if the POST fails. Failures are swallowed by default. */
209
+ onError?: (error: unknown) => void;
210
+ }
211
+ /**
212
+ * Build an `onCall` handler that POSTs each {@link CallRecord} to `url`.
213
+ * Returns a plain `(record) => void` — pass it straight to `createLCR`'s `onCall`.
214
+ */
215
+ declare function createHttpSink(options: HttpSinkOptions): (record: CallRecord) => void;
216
+
125
217
  /**
126
218
  * ai-lcr media routing — Least Cost Routing for image & video models.
127
219
  *
@@ -453,4 +545,4 @@ type LCRRouter = (modelName: string) => LanguageModelV3;
453
545
  */
454
546
  declare function createLCR(config: LCRConfig): LCRRouter;
455
547
 
456
- export { type CallRecord, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaUnit, type PriceComparisonRow, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createKunavoMediaAdapter, createLCR, createMediaLCR, createRunwareMediaAdapter, formatCallRecord, normalizedCents, rankRoutes, referenceMegapixels };
548
+ export { type CallRecord, type CostEvent, DEFAULT_REFERENCE, type ErrorKind, type FormatOptions, type HttpSinkOptions, type LCRConfig, type LCRRouter, MEDIA_PRICING, type MediaAdapter, type MediaCostEvent, type MediaGenerateRequest, type MediaGenerateResult, type MediaLCRConfig, type MediaModality, type MediaModelDef, type MediaOutput, type MediaPricing, type MediaRegistry, type MediaRoute, type MediaRunResult, type MediaUnit, type PriceComparisonRow, type ProviderCost, type ProviderEntry, type RankedRoute, type ReferenceSpec, type RouteAttempt, cheapestRoute, classifyError, classifyErrorKind, comparePrices, createFalMediaAdapter, createHttpSink, createKunavoMediaAdapter, createLCR, createMediaLCR, createRunwareMediaAdapter, formatCallRecord, normalizedCents, rankRoutes, referenceMegapixels };
package/dist/index.js CHANGED
@@ -146,6 +146,16 @@ function newCallId() {
146
146
  if (c?.randomUUID) return c.randomUUID();
147
147
  return `lcr_${Date.now().toString(36)}_${(callSeq++).toString(36)}`;
148
148
  }
149
+ function costForUsage(cost, inputTokens, outputTokens, cacheReadTokens) {
150
+ const cached = Math.min(Math.max(cacheReadTokens, 0), inputTokens);
151
+ const fullInput = inputTokens - cached;
152
+ const cachedRate = cost.cacheRead ?? cost.input;
153
+ return fullInput / 1e6 * cost.input + cached / 1e6 * cachedRate + outputTokens / 1e6 * cost.output;
154
+ }
155
+ function requestIdFrom(options) {
156
+ const raw = options.providerOptions?.lcr?.requestId;
157
+ return typeof raw === "string" && raw.length > 0 ? raw : void 0;
158
+ }
149
159
  var LcrFallbackModel = class {
150
160
  constructor(opts) {
151
161
  this.opts = opts;
@@ -227,8 +237,13 @@ var LcrFallbackModel = class {
227
237
  } catch {
228
238
  }
229
239
  }
230
- startCall() {
231
- return { id: newCallId(), attempts: [], startedAt: Date.now() };
240
+ startCall(options) {
241
+ return {
242
+ id: newCallId(),
243
+ attempts: [],
244
+ startedAt: Date.now(),
245
+ requestId: requestIdFrom(options)
246
+ };
232
247
  }
233
248
  /** Record a failed attempt onto the call's chain (no event yet). */
234
249
  recordFail(ctx, provider, attemptStart, error) {
@@ -240,12 +255,29 @@ var LcrFallbackModel = class {
240
255
  kind: classifyErrorKind(error)
241
256
  });
242
257
  }
258
+ /**
259
+ * Baseline = what this same usage would have cost on the most expensive
260
+ * *priced* provider in the chain (typically the OpenRouter fallback leg). The
261
+ * winner's savings is `baselineUsd - costUsd`. Undefined when no provider in
262
+ * the chain carries a price (nothing to compare against).
263
+ */
264
+ baselineUsd(inputTokens, outputTokens, cacheReadTokens) {
265
+ let max;
266
+ for (const p of this.opts.providers) {
267
+ if (!p.cost) continue;
268
+ const c = costForUsage(p.cost, inputTokens, outputTokens, cacheReadTokens);
269
+ if (max === void 0 || c > max) max = c;
270
+ }
271
+ return max;
272
+ }
243
273
  /** Winner settled: record the attempt, fire `onCost` (compat) + `onCall`. */
244
- finalizeOk(ctx, provider, attemptStart, usage) {
274
+ finalizeOk(ctx, provider, attemptStart, usage, ttftMs) {
245
275
  ctx.attempts.push({ provider: provider.label, ok: true, latencyMs: Date.now() - attemptStart });
246
276
  const inputTokens = usage?.inputTokens?.total ?? 0;
247
277
  const outputTokens = usage?.outputTokens?.total ?? 0;
248
- const costUsd = provider.cost ? inputTokens / 1e6 * provider.cost.input + outputTokens / 1e6 * provider.cost.output : 0;
278
+ const cacheReadTokens = usage?.inputTokens?.cacheRead ?? 0;
279
+ const costUsd = provider.cost ? costForUsage(provider.cost, inputTokens, outputTokens, cacheReadTokens) : 0;
280
+ const usageMissing = inputTokens === 0 && outputTokens === 0;
249
281
  this.emitCost({
250
282
  model: this.opts.modelName,
251
283
  provider: provider.label,
@@ -261,9 +293,14 @@ var LcrFallbackModel = class {
261
293
  ok: true,
262
294
  failedOver: ctx.attempts.length > 1,
263
295
  latencyMs: Date.now() - ctx.startedAt,
296
+ ...ttftMs !== void 0 ? { ttftMs } : {},
264
297
  inputTokens,
265
298
  outputTokens,
266
- costUsd
299
+ ...cacheReadTokens > 0 ? { cachedInputTokens: cacheReadTokens } : {},
300
+ costUsd,
301
+ baselineUsd: this.baselineUsd(inputTokens, outputTokens, cacheReadTokens),
302
+ ...ctx.requestId ? { requestId: ctx.requestId } : {},
303
+ ...usageMissing ? { usageMissing: true } : {}
267
304
  });
268
305
  }
269
306
  /** Every provider failed: fire `onCall` with no winner. */
@@ -278,11 +315,12 @@ var LcrFallbackModel = class {
278
315
  latencyMs: Date.now() - ctx.startedAt,
279
316
  inputTokens: 0,
280
317
  outputTokens: 0,
281
- costUsd: 0
318
+ costUsd: 0,
319
+ ...ctx.requestId ? { requestId: ctx.requestId } : {}
282
320
  });
283
321
  }
284
322
  async doGenerate(options) {
285
- const ctx = this.startCall();
323
+ const ctx = this.startCall(options);
286
324
  const providers = this.opts.providers;
287
325
  const n = providers.length;
288
326
  const start = this.startIndex();
@@ -311,7 +349,7 @@ var LcrFallbackModel = class {
311
349
  throw lastError;
312
350
  }
313
351
  async doStream(options) {
314
- return this.doStreamWithCtx(options, this.startCall(), this.startIndex(), 0);
352
+ return this.doStreamWithCtx(options, this.startCall(options), this.startIndex(), 0);
315
353
  }
316
354
  // The stream's failover recursion re-enters here with the SAME `ctx` and a
317
355
  // threaded-through local cursor (`idx`/`tried`), so a mid-stream switch keeps
@@ -355,6 +393,7 @@ var LcrFallbackModel = class {
355
393
  const triedBeforeServing = tried;
356
394
  let usage;
357
395
  let streamedAny = false;
396
+ let ttftMs;
358
397
  const stream = new ReadableStream({
359
398
  async start(controller) {
360
399
  let reader = null;
@@ -368,11 +407,14 @@ var LcrFallbackModel = class {
368
407
  }
369
408
  if (done) break;
370
409
  if (value.type === "finish") usage = value.usage;
410
+ if (ttftMs === void 0 && (value.type === "text-delta" || value.type === "reasoning-delta")) {
411
+ ttftMs = Date.now() - servingAttemptStart;
412
+ }
371
413
  controller.enqueue(value);
372
414
  if (value.type !== "stream-start") streamedAny = true;
373
415
  }
374
416
  self.settleSticky(servingIdx);
375
- self.finalizeOk(ctx, servingProvider, servingAttemptStart, usage);
417
+ self.finalizeOk(ctx, servingProvider, servingAttemptStart, usage, ttftMs);
376
418
  controller.close();
377
419
  } catch (error) {
378
420
  self.emitError(error, servingProvider.label);
@@ -434,7 +476,12 @@ function formatCallRecord(record, opts = {}) {
434
476
  const glyph = !record.ok ? "\u2717" : record.failedOver ? "\u26A0" : "\u2713";
435
477
  const chain = record.attempts.map((a) => a.provider).join("\u2192") || record.winner || "\u2014";
436
478
  const status = formatCost(record);
437
- let line = `${glyph} ${record.model} ${chain} ${record.latencyMs}ms ${status}`;
479
+ const timing = record.ttftMs !== void 0 ? `${record.latencyMs}ms (ttft ${record.ttftMs}ms)` : `${record.latencyMs}ms`;
480
+ let line = `${glyph} ${record.model} ${chain} ${timing} ${status}`;
481
+ if (record.ok && record.baselineUsd !== void 0 && record.baselineUsd > record.costUsd) {
482
+ line += ` (saved $${(record.baselineUsd - record.costUsd).toFixed(4)})`;
483
+ }
484
+ if (record.usageMissing) line += ` \u26A0no-usage`;
438
485
  const failed = record.attempts.filter((a) => !a.ok);
439
486
  if (failed.length > 0) {
440
487
  const reasons = failed.map((a) => `${a.provider} ${a.errorClass ?? "error"}`).join(", ");
@@ -447,6 +494,40 @@ function formatCallRecord(record, opts = {}) {
447
494
  return line;
448
495
  }
449
496
 
497
+ // src/sink.ts
498
+ function createHttpSink(options) {
499
+ const {
500
+ url,
501
+ headers,
502
+ project,
503
+ dispatch = (task) => {
504
+ void task();
505
+ },
506
+ fetchImpl,
507
+ onError
508
+ } = options;
509
+ const doFetch = fetchImpl ?? globalThis.fetch;
510
+ return (record) => {
511
+ if (!doFetch) {
512
+ onError?.(new Error("ai-lcr: no fetch available for createHttpSink"));
513
+ return;
514
+ }
515
+ const payload = project ? { project, ...record } : record;
516
+ dispatch(async () => {
517
+ try {
518
+ await doFetch(url, {
519
+ method: "POST",
520
+ headers: { "content-type": "application/json", ...headers },
521
+ body: JSON.stringify(payload),
522
+ keepalive: true
523
+ });
524
+ } catch (err) {
525
+ onError?.(err);
526
+ }
527
+ });
528
+ };
529
+ }
530
+
450
531
  // src/media.ts
451
532
  var DEFAULT_REFERENCE = {
452
533
  image: { width: 1920, height: 1080 },
@@ -1012,6 +1093,7 @@ export {
1012
1093
  classifyErrorKind,
1013
1094
  comparePrices,
1014
1095
  createFalMediaAdapter,
1096
+ createHttpSink,
1015
1097
  createKunavoMediaAdapter,
1016
1098
  createLCR,
1017
1099
  createMediaLCR,
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "ai-lcr",
3
- "version": "0.2.6",
3
+ "version": "0.4.0",
4
4
  "description": "Least Cost Routing for LLMs — route every model call to the cheapest available provider, fall back automatically, and track real cost. Built for the Vercel AI SDK.",
5
5
  "keywords": [
6
6
  "ai",