ai-sdk-rate-limiter 0.4.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -56,9 +56,152 @@ The wrapped model is a drop-in replacement. Every Vercel AI SDK feature — stre
56
56
 
57
57
  **Raw SDK support** — Works with the native OpenAI, Anthropic, Groq, Mistral, and Cohere SDKs directly via a transparent JavaScript Proxy. No Vercel AI SDK required.
58
58
 
59
+ **OpenTelemetry** — Drop-in OTel plugin that emits GenAI-spec spans for every request. Works with any OTel-compatible tracer.
60
+
61
+ **CLI audit** — `npx ai-sdk-rate-limiter audit` probes your API keys to detect your actual tier limits and generates a ready-to-paste config override.
62
+
63
+ ---
64
+
65
+ ## Contents
66
+
67
+ - [Vercel AI SDK usage](#vercel-ai-sdk-usage)
68
+ - [Raw SDK proxy](#raw-sdk-proxy)
69
+ - [Configuration reference](#configuration-reference)
70
+ - [Per-request options](#per-request-options)
71
+ - [Cost tracking](#cost-tracking)
72
+ - [Budget fallback routing](#budget-fallback-routing)
73
+ - [Multi-instance Redis store](#multi-instance-redis-store)
74
+ - [Events](#events)
75
+ - [Backpressure](#backpressure)
76
+ - [Error handling](#error-handling)
77
+ - [OpenTelemetry](#opentelemetry)
78
+ - [CLI audit](#cli-audit)
79
+ - [Model registry](#model-registry)
80
+ - [Advanced usage](#advanced-usage)
81
+ - [How it works](#how-it-works)
82
+ - [Comparison](#comparison)
83
+ - [TypeScript](#typescript)
84
+ - [Requirements](#requirements)
85
+
86
+ ---
87
+
88
+ ## Vercel AI SDK usage
89
+
90
+ ### Basic wrap
91
+
92
+ ```typescript
93
+ import { createRateLimiter } from 'ai-sdk-rate-limiter'
94
+ import { openai } from '@ai-sdk/openai'
95
+ import { generateText, streamText } from 'ai'
96
+
97
+ const limiter = createRateLimiter()
98
+ const model = limiter.wrap(openai('gpt-4o'))
99
+
100
+ // generateText
101
+ const { text } = await generateText({ model, prompt: 'Summarize this...' })
102
+
103
+ // streamText — streaming is first-class, rate limit slot consumed at request start
104
+ const result = streamText({ model, messages })
105
+ for await (const chunk of result.textStream) {
106
+ process.stdout.write(chunk)
107
+ }
108
+ ```
109
+
110
+ ### Using the raw middleware
111
+
112
+ If you use `wrapLanguageModel` directly or need to compose middleware:
113
+
114
+ ```typescript
115
+ import { wrapLanguageModel } from 'ai'
116
+
117
+ // Single middleware
118
+ const model = wrapLanguageModel({
119
+ model: openai('gpt-4o'),
120
+ middleware: limiter.middleware,
121
+ })
122
+
123
+ // Composed with other middleware
124
+ const model = wrapLanguageModel({
125
+ model: openai('gpt-4o'),
126
+ middleware: [loggingMiddleware, limiter.middleware, cachingMiddleware],
127
+ })
128
+ ```
129
+
130
+ ---
131
+
132
+ ## Raw SDK proxy
133
+
134
+ If you're using the OpenAI, Anthropic, Groq, Mistral, or Cohere SDK directly — without the Vercel AI SDK — use `limiter.rawProxy()` to add rate limiting as a transparent drop-in:
135
+
136
+ ```typescript
137
+ import { createRateLimiter } from 'ai-sdk-rate-limiter'
138
+ import OpenAI from 'openai'
139
+ import Anthropic from '@anthropic-ai/sdk'
140
+
141
+ const limiter = createRateLimiter({
142
+ cost: { budget: { daily: 50 }, onExceeded: 'throw' },
143
+ on: { rateLimited: ({ model }) => console.warn(`${model} rate limited`) },
144
+ })
145
+
146
+ // Every API call goes through the same rate limiter and cost tracker
147
+ const openai = limiter.rawProxy(new OpenAI())
148
+ const anthropic = limiter.rawProxy(new Anthropic())
149
+
150
+ // Use exactly as before — no other changes needed
151
+ const completion = await openai.chat.completions.create({
152
+ model: 'gpt-4o',
153
+ messages: [{ role: 'user', content: 'Hello!' }],
154
+ })
155
+
156
+ const message = await anthropic.messages.create({
157
+ model: 'claude-opus-4-6',
158
+ max_tokens: 1024,
159
+ messages: [{ role: 'user', content: 'Hello!' }],
160
+ })
161
+
162
+ // Cost from both clients tracked together
163
+ const report = limiter.getCostReport()
164
+ ```
165
+
166
+ **Streaming** — the proxy wraps the returned `AsyncIterable` to capture the final usage chunk automatically:
167
+
168
+ ```typescript
169
+ const stream = await openai.chat.completions.create({
170
+ model: 'gpt-4o',
171
+ messages: [{ role: 'user', content: 'Stream this' }],
172
+ stream: true,
173
+ stream_options: { include_usage: true }, // OpenAI: include usage in final chunk
174
+ })
175
+
176
+ for await (const chunk of stream) {
177
+ process.stdout.write(chunk.choices[0]?.delta?.content ?? '')
178
+ }
179
+ // After the loop, tokens are recorded in limiter.getCostReport()
180
+ ```
181
+
182
+ **Standalone — no shared limiter needed:**
183
+
184
+ ```typescript
185
+ import { rateLimited } from 'ai-sdk-rate-limiter'
186
+
187
+ const openai = rateLimited(new OpenAI(), {
188
+ config: { cost: { budget: { daily: 20 } } },
189
+ })
190
+ ```
191
+
192
+ **Override auto-detected provider** — useful for OpenAI-compatible endpoints:
193
+
194
+ ```typescript
195
+ const client = limiter.rawProxy(new OpenAI({ baseURL: 'https://api.groq.com/openai/v1' }), {
196
+ provider: 'groq', // use Groq's limits and pricing instead of OpenAI's
197
+ })
198
+ ```
199
+
200
+ Provider is auto-detected from the client's constructor name (`OpenAI` → `openai`, `Anthropic` → `anthropic`, `Groq` → `groq`, etc.).
201
+
59
202
  ---
60
203
 
61
- ## Configuration
204
+ ## Configuration reference
62
205
 
63
206
  Everything has a sensible default. Override only what you need.
64
207
 
@@ -67,17 +210,15 @@ const limiter = createRateLimiter({
67
210
  // Override or extend built-in model limits for your API tier
68
211
  limits: {
69
212
  'gpt-4o': { rpm: 500, itpm: 30_000 },
70
- 'claude-opus-4-6': { rpm: 50, itpm: 30_000 },
71
- // Wildcard: apply to all models from a provider
72
- 'openai/*': { rpm: 1000 },
213
+ 'claude-opus-4-6': { rpm: 50, itpm: 20_000 },
73
214
  },
74
215
 
75
216
  // Cost budgets and behavior when exceeded
76
217
  cost: {
77
218
  budget: {
78
- hourly: 5, // USD
79
- daily: 50,
80
- monthly: 500,
219
+ hourly: 5, // USD — hard cap per hour
220
+ daily: 50, // USD — hard cap per day
221
+ monthly: 500, // USD — hard cap per month
81
222
  },
82
223
  onExceeded: 'throw', // 'throw' | 'queue' | 'fallback'
83
224
  },
@@ -100,14 +241,14 @@ const limiter = createRateLimiter({
100
241
  retryOn: [429, 500, 502, 503, 504],
101
242
  },
102
243
 
103
- // Observability
244
+ // Observability — see Events section for all available events
104
245
  on: {
105
246
  rateLimited: ({ model, source, resetAt }) =>
106
247
  console.warn(`${model} rate limited (${source}), resets ${new Date(resetAt).toISOString()}`),
107
248
  retrying: ({ model, attempt, delayMs }) =>
108
249
  console.log(`${model} retry ${attempt} in ${delayMs}ms`),
109
250
  budgetHit: ({ model, currentCostUsd, limitUsd, period }) =>
110
- alerts.send(`${model} hit $${limitUsd} ${period} budget ($${currentCostUsd} spent)`),
251
+ alerts.send(`${model} hit $${limitUsd} ${period} budget`),
111
252
  completed: ({ model, inputTokens, outputTokens, costUsd, latencyMs }) =>
112
253
  metrics.record({ model, inputTokens, outputTokens, costUsd, latencyMs }),
113
254
  },
@@ -123,14 +264,14 @@ Pass options to individual requests via `providerOptions.rateLimiter`:
123
264
  ```typescript
124
265
  import { generateText } from 'ai'
125
266
 
126
- // High-priority request — skips ahead of normal traffic in the queue
267
+ // High-priority — skips ahead of normal traffic in the queue
127
268
  await generateText({
128
269
  model,
129
270
  prompt: 'Urgent user request...',
130
271
  providerOptions: {
131
272
  rateLimiter: {
132
- priority: 'high', // 'high' | 'normal' | 'low'
133
- timeout: 10_000, // override the default queue timeout for this request
273
+ priority: 'high', // 'high' | 'normal' | 'low'
274
+ timeout: 10_000, // override the default queue timeout for this request
134
275
  },
135
276
  },
136
277
  })
@@ -145,7 +286,7 @@ await generateText({
145
286
  })
146
287
  ```
147
288
 
148
- This is the right way to colocate user requests and background jobs on the same model without background jobs starving users.
289
+ This is the recommended way to colocate user requests and background jobs on the same model without background jobs starving users.
149
290
 
150
291
  ---
151
292
 
@@ -157,7 +298,7 @@ const report = limiter.getCostReport()
157
298
 
158
299
  console.log(report)
159
300
  // {
160
- // hour: { requests: 42, inputTokens: 84_000, outputTokens: 21_000, costUsd: 0.29 },
301
+ // hour: { requests: 42, inputTokens: 84_000, outputTokens: 21_000, costUsd: 0.29 },
161
302
  // day: { requests: 318, inputTokens: 620_000, outputTokens: 155_000, costUsd: 2.11 },
162
303
  // month: { requests: 4821, inputTokens: 9_100_000, outputTokens: 2_200_000, costUsd: 34.80 },
163
304
  // byModel: {
@@ -173,17 +314,17 @@ Costs are based on **actual token counts** from API responses — not estimates.
173
314
 
174
315
  ## Budget fallback routing
175
316
 
176
- When a budget limit is hit, you can transparently reroute to a cheaper model instead of throwing an error. Pass a `fallback` option to `wrap()`:
317
+ When a budget limit is hit, you can transparently reroute to a cheaper model instead of throwing an error:
177
318
 
178
319
  ```typescript
179
320
  const limiter = createRateLimiter({
180
321
  cost: {
181
322
  budget: { daily: 10 },
182
- onExceeded: 'fallback', // reroute to fallback instead of throwing
323
+ onExceeded: 'fallback', // reroute to fallback instead of throwing
183
324
  },
184
325
  on: {
185
- budgetHit: ({ model, currentCostUsd, limitUsd, period }) =>
186
- console.warn(`${model} ${period} budget hit ($${currentCostUsd} of $${limitUsd})`),
326
+ budgetHit: ({ model, usingFallback }) =>
327
+ console.warn(`${model} budget hit — ${usingFallback ? 'using fallback' : 'throwing'}`),
187
328
  },
188
329
  })
189
330
 
@@ -192,17 +333,11 @@ const model = limiter.wrap(
192
333
  { fallback: openai('gpt-4o-mini') }, // used when budget is exceeded
193
334
  )
194
335
 
195
- // Under budget → uses gpt-4o normally
196
- // Over $10/day → silently switches to gpt-4o-mini, no code changes needed
336
+ // Under budget → uses gpt-4o normally
337
+ // Over $10/day → silently switches to gpt-4o-mini, no code changes needed
197
338
  const result = await generateText({ model, prompt })
198
339
  ```
199
340
 
200
- **How it works:**
201
- 1. The budget is checked before every request against total rolling spend
202
- 2. When exceeded, `BudgetExceededError` is caught inside `wrap()` before it reaches your code
203
- 3. The request is re-executed against the fallback model, bypassing the budget pre-check
204
- 4. Fallback usage is tracked under the fallback model's ID in `getCostReport()`
205
-
206
341
  **Behavior matrix:**
207
342
 
208
343
  | `onExceeded` | `fallback` configured | Outcome |
@@ -212,91 +347,47 @@ const result = await generateText({ model, prompt })
212
347
  | `'fallback'` | no | Throws `BudgetExceededError` |
213
348
  | `'queue'` | any | Queues until period resets |
214
349
 
215
- ---
216
-
217
- ## Raw SDK proxy
218
-
219
- If you're using the OpenAI, Anthropic, Groq, Mistral, or Cohere SDK directly — without the Vercel AI SDK — use `limiter.rawProxy()` to add rate limiting as a transparent drop-in:
220
-
221
- ```typescript
222
- import { createRateLimiter } from 'ai-sdk-rate-limiter'
223
- import OpenAI from 'openai'
224
- import Anthropic from '@anthropic-ai/sdk'
225
-
226
- const limiter = createRateLimiter({
227
- cost: { budget: { daily: 50 }, onExceeded: 'throw' },
228
- on: { rateLimited: ({ model }) => console.warn(`${model} rate limited`) },
229
- })
350
+ Fallback usage is tracked under the fallback model's ID in `getCostReport()`.
230
351
 
231
- // Every API call goes through the same rate limiter and cost tracker
232
- const openai = limiter.rawProxy(new OpenAI())
233
- const anthropic = limiter.rawProxy(new Anthropic())
352
+ ---
234
353
 
235
- // Use exactly as before — no other changes needed
236
- const completion = await openai.chat.completions.create({
237
- model: 'gpt-4o',
238
- messages: [{ role: 'user', content: 'Hello!' }],
239
- })
354
+ ## Multi-instance Redis store
240
355
 
241
- const message = await anthropic.messages.create({
242
- model: 'claude-opus-4-6',
243
- max_tokens: 1024,
244
- messages: [{ role: 'user', content: 'Hello!' }],
245
- })
356
+ By default, rate limit state is in-memory (per-process). For multi-instance deployments — multiple pods, serverless replicas, workers — each instance has its own counters. Install the Redis store to share state:
246
357
 
247
- // Cost from both clients tracked together
248
- const report = limiter.getCostReport()
249
358
  ```
250
-
251
- **Streaming works too** — the proxy wraps the returned `AsyncIterable` to capture the final usage chunk automatically:
252
-
253
- ```typescript
254
- const stream = await openai.chat.completions.create({
255
- model: 'gpt-4o',
256
- messages: [{ role: 'user', content: 'Stream this' }],
257
- stream: true,
258
- stream_options: { include_usage: true }, // tells OpenAI to include usage in final chunk
259
- })
260
-
261
- for await (const chunk of stream) {
262
- process.stdout.write(chunk.choices[0]?.delta?.content ?? '')
263
- }
264
- // After the loop, usage is recorded in limiter.getCostReport()
359
+ npm install ioredis
265
360
  ```
266
361
 
267
- **Zero-config standalone version** — if you don't need to share the limiter with other models:
268
-
269
362
  ```typescript
270
- import { rateLimited } from 'ai-sdk-rate-limiter'
363
+ import { createRateLimiter } from 'ai-sdk-rate-limiter'
364
+ import { RedisStore } from 'ai-sdk-rate-limiter/redis'
365
+ import Redis from 'ioredis'
271
366
 
272
- const openai = rateLimited(new OpenAI(), {
273
- config: { cost: { budget: { daily: 20 } } },
367
+ const limiter = createRateLimiter({
368
+ store: new RedisStore(new Redis(process.env.REDIS_URL)),
369
+ // ... rest of your config unchanged
274
370
  })
275
371
  ```
276
372
 
277
- **Provider is auto-detected** from the client's constructor name (`OpenAI`, `Anthropic`, `Groq`, etc.). Override it explicitly if needed:
373
+ That's the entire change. All APIs `wrap()`, `rawProxy()`, events, cost reports — work identically. The Redis store enforces rate limits collectively so no two instances can jointly exceed the API limits.
374
+
375
+ **Options:**
278
376
 
279
377
  ```typescript
280
- const client = limiter.rawProxy(new OpenAI({ baseURL: 'https://api.groq.com/openai/v1' }), {
281
- provider: 'groq', // use Groq's limits and pricing instead of OpenAI's
378
+ new RedisStore(redis, {
379
+ keyPrefix: 'rl:myapp:', // namespace if multiple apps share one Redis instance
380
+ windowMs: 60_000, // window size in ms; match your provider's limit window
282
381
  })
283
382
  ```
284
383
 
285
- ---
286
-
287
- ## Backpressure — know before you send
384
+ **How it works internally:**
288
385
 
289
- Check estimated wait time before committing to a request. Useful for showing loading states or shedding load gracefully.
290
-
291
- ```typescript
292
- const waitMs = limiter.estimatedWait('gpt-4o')
386
+ Each request runs a Lua script atomically that: removes stale entries from a sorted set, counts requests and tokens in the current window, checks against RPM and ITPM limits, and either reserves the slot or returns when the next slot opens. The local queue (priority ordering, drain timer, timeout handling) stays in-memory per instance — only the window counters are shared via Redis.
293
387
 
294
- if (waitMs > 5_000) {
295
- return res.status(503).json({ error: 'Model is busy, try again shortly', retryAfterMs: waitMs })
296
- }
388
+ **Compatible clients** — any client with `eval()`, `get()`, and `set()` works: `ioredis`, `node-redis`, Upstash Redis.
297
389
 
298
- const result = await generateText({ model, prompt })
299
- ```
390
+ Use the default `InMemoryStore` for single-instance deployments — it's more accurate (true sliding window, no network round-trips) and zero-config. Only switch to `RedisStore` when you actually need cross-instance coordination.
300
391
 
301
392
  ---
302
393
 
@@ -320,24 +411,46 @@ limiter.off('queued', handler)
320
411
 
321
412
  | Event | When | Key fields |
322
413
  |---|---|---|
323
- | `queued` | Request enters the queue | `model`, `priority`, `queueDepth`, `estimatedWaitMs` |
324
- | `dequeued` | Request leaves the queue | `model`, `waitedMs`, `priority` |
325
- | `retrying` | A failed request is about to retry | `model`, `attempt`, `maxAttempts`, `delayMs`, `error` |
326
- | `rateLimited` | Limit hit (local or remote 429) | `model`, `source`, `limitType`, `resetAt` |
327
- | `budgetHit` | Cost budget exceeded | `model`, `currentCostUsd`, `limitUsd`, `period`, `usingFallback` |
328
- | `dropped` | Request rejected (queue full or timeout) | `model`, `reason` |
329
- | `completed` | Request finished successfully | `model`, `inputTokens`, `outputTokens`, `costUsd`, `latencyMs` |
414
+ | `queued` | Request enters the queue | `model`, `provider`, `priority`, `queueDepth`, `estimatedWaitMs` |
415
+ | `dequeued` | Request leaves the queue | `model`, `provider`, `waitedMs`, `priority` |
416
+ | `retrying` | A failed request is about to retry | `model`, `provider`, `attempt`, `maxAttempts`, `delayMs`, `error` |
417
+ | `rateLimited` | Limit hit (local or remote 429) | `model`, `provider`, `source`, `limitType`, `resetAt` |
418
+ | `budgetHit` | Cost budget exceeded | `model`, `provider`, `currentCostUsd`, `limitUsd`, `period`, `usingFallback` |
419
+ | `dropped` | Request rejected (queue full or timeout) | `model`, `provider`, `reason` |
420
+ | `completed` | Request finished successfully | `model`, `provider`, `inputTokens`, `outputTokens`, `costUsd`, `latencyMs`, `streaming` |
330
421
 
331
- The `source` on `rateLimited` distinguishes between requests we blocked locally (`'local'`) vs. requests the API rejected with a 429 (`'remote'`). Local blocks are free; remote blocks mean your limits are misconfigured.
422
+ The `source` on `rateLimited` distinguishes between requests we blocked locally (`'local'`) vs. requests the API rejected with a 429 (`'remote'`). Local blocks are expected and free. Frequent remote blocks mean your configured limits are too high for your tier — run `npx ai-sdk-rate-limiter audit` to get accurate numbers.
423
+
424
+ ---
425
+
426
+ ## Backpressure
427
+
428
+ Check estimated wait time before committing to a request. Useful for showing loading states or shedding load gracefully:
429
+
430
+ ```typescript
431
+ const waitMs = await limiter.estimatedWait('gpt-4o')
432
+
433
+ if (waitMs > 5_000) {
434
+ return res.status(503).json({
435
+ error: 'Model busy, try again shortly',
436
+ retryAfterMs: waitMs,
437
+ })
438
+ }
439
+
440
+ const result = await generateText({ model, prompt })
441
+ ```
442
+
443
+ Returns `0` if the model would proceed immediately.
332
444
 
333
445
  ---
334
446
 
335
447
  ## Error handling
336
448
 
337
- Every error is typed, structured, and tells you exactly what happened:
449
+ Every error is typed and carries structured context:
338
450
 
339
451
  ```typescript
340
452
  import {
453
+ RateLimitExceededError,
341
454
  QueueTimeoutError,
342
455
  QueueFullError,
343
456
  BudgetExceededError,
@@ -349,77 +462,163 @@ try {
349
462
  const result = await generateText({ model, prompt })
350
463
  } catch (error) {
351
464
  if (error instanceof QueueTimeoutError) {
352
- // error.model, error.waitedMs, error.queueDepth
465
+ // Request waited in queue longer than queue.timeout
353
466
  console.error(`Timed out after ${error.waitedMs}ms (queue depth: ${error.queueDepth})`)
354
467
  } else if (error instanceof BudgetExceededError) {
355
- // error.model, error.currentCostUsd, error.limitUsd, error.period
468
+ // Cost budget hit and onExceeded is 'throw' or no fallback configured
356
469
  console.error(`Budget exceeded: $${error.currentCostUsd} of $${error.limitUsd} ${error.period}`)
357
470
  } else if (error instanceof RetryExhaustedError) {
358
- // error.model, error.attempts, error.cause
471
+ // All retry attempts failed
359
472
  console.error(`All ${error.attempts} retries exhausted`, error.cause)
360
473
  } else if (error instanceof QueueFullError) {
361
- // error.model, error.maxSize
362
- console.error(`Queue full at ${error.maxSize} requests`)
474
+ // Queue at capacity and onFull is 'throw'
475
+ console.error(`Queue full at ${error.maxSize} requests for ${error.model}`)
476
+ } else if (error instanceof RateLimitExceededError) {
477
+ // Rate limit hit and the request could not be queued
478
+ console.error(`${error.model} ${error.limitType} limit of ${error.limit} exceeded`)
363
479
  }
364
480
  }
365
481
  ```
366
482
 
367
- All errors extend `RateLimiterError`, so a single `instanceof RateLimiterError` check separates rate-limiting failures from AI SDK errors.
483
+ All errors extend `RateLimiterError`, so a single `instanceof RateLimiterError` check separates rate-limiter failures from AI API errors.
484
+
485
+ **Error fields:**
486
+
487
+ | Error | Fields |
488
+ |---|---|
489
+ | `QueueTimeoutError` | `model`, `waitedMs`, `queueDepth` |
490
+ | `BudgetExceededError` | `model`, `currentCostUsd`, `limitUsd`, `period` |
491
+ | `RetryExhaustedError` | `model`, `attempts`, `cause` |
492
+ | `QueueFullError` | `model`, `maxSize` |
493
+ | `RateLimitExceededError` | `model`, `limitType`, `limit`, `resetAt` |
368
494
 
369
495
  ---
370
496
 
371
- ## Advanced middleware usage
497
+ ## OpenTelemetry
372
498
 
373
- If you use `wrapLanguageModel` directly, the raw middleware is available:
499
+ The `ai-sdk-rate-limiter/otel` entry point provides a plugin that emits OpenTelemetry spans for every AI request. No hard dependency on `@opentelemetry/api` the plugin accepts any OTel-compatible tracer via structural typing.
374
500
 
375
501
  ```typescript
376
- import { wrapLanguageModel } from 'ai'
502
+ import { trace } from '@opentelemetry/api'
503
+ import { createRateLimiter } from 'ai-sdk-rate-limiter'
504
+ import { createOtelPlugin } from 'ai-sdk-rate-limiter/otel'
377
505
 
378
- const model = wrapLanguageModel({
379
- model: openai('gpt-4o'),
380
- middleware: limiter.middleware,
506
+ const limiter = createRateLimiter({
507
+ on: createOtelPlugin(trace.getTracer('my-service')),
381
508
  })
382
509
  ```
383
510
 
384
- You can also compose it with other middleware:
511
+ **Spans emitted:**
512
+
513
+ | Span name | When | Status |
514
+ |---|---|---|
515
+ | `gen_ai.request` | Every completed request | OK |
516
+ | `gen_ai.request` | Every dropped request | ERROR |
517
+ | `ai_rate_limiter.retry` | Each retry attempt | OK |
518
+ | `ai_rate_limiter.budget_hit` | Budget exceeded | ERROR |
519
+
520
+ **Attributes on `gen_ai.request` (completed):**
521
+
522
+ | Attribute | Value |
523
+ |---|---|
524
+ | `gen_ai.system` | Provider name (e.g. `openai`, `anthropic`) |
525
+ | `gen_ai.request.model` | Model ID |
526
+ | `gen_ai.usage.input_tokens` | Actual input tokens from API response |
527
+ | `gen_ai.usage.output_tokens` | Actual output tokens from API response |
528
+ | `ai_rate_limiter.cost_usd` | Cost in USD for this request |
529
+ | `ai_rate_limiter.latency_ms` | Total latency including queue wait |
530
+ | `ai_rate_limiter.streaming` | Whether the request used streaming |
531
+
532
+ Attribute names follow the [OTel GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/). The `gen_ai.request` span duration is reconstructed from `latencyMs` so it reflects the full wall-clock time of the request, including any queue wait.
533
+
534
+ **Custom tracer interface** — if you don't want to install `@opentelemetry/api`, implement the `OtelTracer` interface directly:
385
535
 
386
536
  ```typescript
387
- const model = wrapLanguageModel({
388
- model: openai('gpt-4o'),
389
- middleware: [loggingMiddleware, limiter.middleware, cachingMiddleware],
537
+ import { createOtelPlugin, type OtelTracer } from 'ai-sdk-rate-limiter/otel'
538
+
539
+ const tracer: OtelTracer = {
540
+ startSpan(name, options) {
541
+ // return any object that implements OtelSpan
542
+ },
543
+ }
544
+
545
+ const limiter = createRateLimiter({
546
+ on: createOtelPlugin(tracer),
390
547
  })
391
548
  ```
392
549
 
393
550
  ---
394
551
 
395
- ## Multiple limiters
552
+ ## CLI audit
396
553
 
397
- Run separate limiters for separate concerns e.g., one per customer tier:
554
+ Probe your API keys to discover your actual rate limit tier and generate a ready-to-paste config override:
398
555
 
399
- ```typescript
400
- const freeLimiter = createRateLimiter({
401
- limits: { 'gpt-4o-mini': { rpm: 5 } },
402
- cost: { budget: { daily: 0.10 }, onExceeded: 'throw' },
403
- queue: { timeout: 5_000 },
404
- })
556
+ ```
557
+ npx ai-sdk-rate-limiter audit
558
+ ```
405
559
 
406
- const paidLimiter = createRateLimiter({
407
- limits: { 'gpt-4o': { rpm: 100 } },
408
- cost: { budget: { daily: 20 } },
409
- queue: { timeout: 30_000 },
410
- })
560
+ ```
561
+ ────────────────────────────────────────────────────────────────────────────────
562
+ ai-sdk-rate-limiter audit
563
+ ────────────────────────────────────────────────────────────────────────────────
564
+
565
+ OPENAI (OPENAI_API_KEY)
566
+ Model RPM TPM Registry
567
+ ──────────────────────────────────────────────────────────────────────────────
568
+ gpt-4o 10000 2,000,000 (registry: 500 RPM / 30,000 TPM)
569
+ gpt-4o-mini 10000 10,000,000 ≠ (registry: 500 RPM / 200,000 TPM)
570
+
571
+ ────────────────────────────────────────────────────────────────────────────────
572
+ 1 model(s) differ from registry defaults.
573
+ Paste the config below into createRateLimiter():
574
+
575
+ const limiter = createRateLimiter({
576
+ limits: {
577
+ 'gpt-4o-mini': { rpm: 10000, itpm: 10,000,000 },
578
+ },
579
+ })
411
580
 
412
- // Route to the right limiter per request
413
- const model = req.user.plan === 'paid'
414
- ? paidLimiter.wrap(openai('gpt-4o'))
415
- : freeLimiter.wrap(openai('gpt-4o-mini'))
581
+ ────────────────────────────────────────────────────────────────────────────────
582
+ ```
583
+
584
+ **How it works** — Makes a minimal (5-token) request per model and reads the `x-ratelimit-limit-*` headers that every provider returns on each response. These headers reflect your account's actual tier, not the published defaults.
585
+
586
+ **Options:**
587
+
588
+ ```
589
+ npx ai-sdk-rate-limiter audit [options]
590
+
591
+ --provider, -p <name> Audit a single provider: openai, anthropic, groq, mistral, cohere
592
+ --json Machine-readable JSON output
593
+ --help, -h Show help
594
+ --version, -v Print version
595
+
596
+ Environment variables required:
597
+ OPENAI_API_KEY
598
+ ANTHROPIC_API_KEY
599
+ GROQ_API_KEY
600
+ MISTRAL_API_KEY
601
+ COHERE_API_KEY
602
+ ```
603
+
604
+ **Examples:**
605
+
606
+ ```bash
607
+ # Audit all configured providers
608
+ npx ai-sdk-rate-limiter audit
609
+
610
+ # Audit only OpenAI
611
+ npx ai-sdk-rate-limiter audit --provider openai
612
+
613
+ # Machine-readable output for CI / scripts
614
+ npx ai-sdk-rate-limiter audit --json | jq '.providers[0].models'
416
615
  ```
417
616
 
418
617
  ---
419
618
 
420
- ## Built-in model registry
619
+ ## Model registry
421
620
 
422
- Limits and pricing are built-in for every major model across 6 providers. Defaults are conservative (free/Tier 1) — override with your actual plan limits.
621
+ Limits and pricing are built-in for every major model across 6 providers. Defaults are conservative (free/Tier 1) — override with your actual plan limits via the `limits` config option or by running `audit`.
423
622
 
424
623
  **OpenAI**
425
624
 
@@ -493,30 +692,87 @@ import {
493
692
  console.log(GROQ_MODELS['llama-3.3-70b-versatile'])
494
693
  // { rpm: 30, itpm: 6000, rpd: 1000, inputPricePerMillion: 0.59, ... }
495
694
 
496
- console.log(isKnownModel('llama-3.3-70b-versatile', 'groq'))
497
- // true
695
+ console.log(isKnownModel('llama-3.3-70b-versatile', 'groq')) // true
696
+ console.log(isKnownModel('my-fine-tune', 'openai')) // false → fallback limits
697
+
698
+ // Resolve the effective limits for a model (registry defaults merged with user overrides)
699
+ const limits = resolveModelLimits('gpt-4o', 'openai', { 'gpt-4o': { rpm: 1000 } })
700
+ ```
701
+
702
+ ---
703
+
704
+ ## Advanced usage
705
+
706
+ ### Multiple limiters per tier
707
+
708
+ ```typescript
709
+ const freeLimiter = createRateLimiter({
710
+ limits: { 'gpt-4o-mini': { rpm: 5 } },
711
+ cost: { budget: { daily: 0.10 }, onExceeded: 'throw' },
712
+ queue: { timeout: 5_000 },
713
+ })
714
+
715
+ const paidLimiter = createRateLimiter({
716
+ limits: { 'gpt-4o': { rpm: 100 } },
717
+ cost: { budget: { daily: 20 } },
718
+ queue: { timeout: 30_000 },
719
+ })
720
+
721
+ // Route per request based on user plan
722
+ const model = req.user.plan === 'paid'
723
+ ? paidLimiter.wrap(openai('gpt-4o'))
724
+ : freeLimiter.wrap(openai('gpt-4o-mini'))
725
+ ```
726
+
727
+ ### Combine OTel tracing with event logging
728
+
729
+ ```typescript
730
+ import { createOtelPlugin } from 'ai-sdk-rate-limiter/otel'
731
+
732
+ const limiter = createRateLimiter({
733
+ on: {
734
+ // OTel spans for every request
735
+ ...createOtelPlugin(trace.getTracer('my-service')),
736
+ // Plus any additional handlers
737
+ budgetHit: ({ model, limitUsd, period }) =>
738
+ alerts.send(`Budget alert: ${model} hit $${limitUsd} ${period} cap`),
739
+ },
740
+ })
741
+ ```
742
+
743
+ ### Custom rate limit store
744
+
745
+ Implement `RateLimitStore` to use any backend (DynamoDB, Postgres, etc.):
746
+
747
+ ```typescript
748
+ import type { RateLimitStore } from 'ai-sdk-rate-limiter'
498
749
 
499
- console.log(isKnownModel('my-fine-tune', 'openai'))
500
- // false will use fallback limits
750
+ class MyStore implements RateLimitStore {
751
+ async checkAndReserve(key, tokens, limits) { /* ... */ }
752
+ async applyBackoff(key, untilMs) { /* ... */ }
753
+ async getBackoff(key) { /* ... */ }
754
+ }
755
+
756
+ const limiter = createRateLimiter({ store: new MyStore() })
501
757
  ```
502
758
 
503
759
  ---
504
760
 
505
761
  ## How it works
506
762
 
507
- **Rate limiting algorithm** — Sliding window counter. Each model tracks a list of `{timestamp, tokens}` entries for the past 60 seconds. On every request, stale entries are evicted and the window is checked against RPM and ITPM limits simultaneously.
763
+ **Rate limiting** — Sliding window counter per model. Each model tracks a list of `{timestamp, tokens}` entries for the past 60 seconds. On every request, stale entries are evicted and the window is checked against RPM and ITPM limits simultaneously.
508
764
 
509
- **Queue** — A min-heap priority queue per model, sorted by `priority` then enqueue time (FIFO within same priority). A drain timer fires when the oldest window entry expires, processing as many waiters as possible before rescheduling.
765
+ **Queue** — A sorted priority queue per model, ordered by `priority` then enqueue time (FIFO within same priority). A drain timer fires when the oldest window entry expires, processing as many waiters as possible before rescheduling.
510
766
 
511
- **Retry-After propagation** — When a remote 429 arrives with a `Retry-After` header, the backoff is applied to the engine, not just the failing request. All requests queued behind it pause until the backoff clears. This prevents the common failure mode where you retry one request while 10 more immediately follow and all get 429s.
767
+ **Retry-After propagation** — When a remote 429 arrives with a `Retry-After` header, the backoff is applied to the entire model key in the engine, not just the failing request. All requests queued behind it pause until the backoff clears. This prevents the common thundering-herd failure where you retry one request while 10 others immediately follow and all get 429s.
512
768
 
513
- **Token estimation** — Before a request fires, tokens are estimated from the prompt text (~4 chars/token) and reserved in the window. After the response, actual usage from the API replaces the estimate. For streaming, actual counts come from the `finish` chunk.
769
+ **Token estimation** — Before a request fires, tokens are estimated from the prompt text (~4 chars/token) and reserved in the window. After the response, actual usage from the API replaces the estimate. For streaming, actual counts come from the `finish` chunk (Vercel AI SDK) or the final usage chunk (raw proxy).
514
770
 
515
- **Zero dependencies** — The middleware interface is implemented structurally. `@ai-sdk/provider` types are referenced for type checking but not required at runtime. No `ioredis`, no `bottleneck`, no tokenizer libraries.
771
+ **Zero dependencies** — The Vercel AI SDK middleware interface is implemented structurally `@ai-sdk/provider` types are used for type checking only and not required at runtime. No `ioredis`, no `bottleneck`, no tokenizer libraries in the core.
516
772
 
517
773
  ---
518
774
 
519
- ## Why not X?
775
+ ## Comparison
520
776
 
521
777
  | | ai-sdk-rate-limiter | bottleneck | p-limit | SDK built-in retry | LangChain |
522
778
  |---|:---:|:---:|:---:|:---:|:---:|
@@ -528,37 +784,42 @@ console.log(isKnownModel('my-fine-tune', 'openai'))
528
784
  | Cost tracking + budgets | yes | no | no | no | no |
529
785
  | Retry-After header | yes | no | no | partial | partial |
530
786
  | Backoff propagation | yes | no | no | no | no |
787
+ | OpenTelemetry | yes | no | no | no | partial |
788
+ | CLI audit | yes | no | no | no | no |
531
789
  | Zero runtime deps | yes | no | yes | — | no |
532
790
  | Provider-agnostic | yes | yes | yes | no | no |
533
791
 
534
- **bottleneck** is excellent general-purpose rate limiting, but knows nothing about AI APIs no model limits, no token counting, no cost tracking. You'd need to configure it per-model manually and rebuild the cost system yourself.
792
+ **bottleneck** Excellent general-purpose rate limiting, but knows nothing about AI APIs. No model limits, no token counting, no cost tracking. You'd need to configure it per-model manually and rebuild the cost system yourself.
535
793
 
536
- **p-limit** controls concurrency, not rate. It doesn't throttle to N requests per minute, it throttles to N concurrent requests. Useful but completely different problem.
794
+ **p-limit** Controls concurrency, not rate. Limits to N concurrent requests, not N requests per minute. A different problem.
537
795
 
538
- **SDK built-in retry** retries on 429 with backoff. That's the floor, not the ceiling. It doesn't queue, doesn't prioritize, doesn't track cost, and doesn't propagate backoff to other in-flight requests.
796
+ **SDK built-in retry** Retries on 429 with backoff. That's the floor, not the ceiling. No queuing, no priority, no cost tracking, no backoff propagation to other in-flight requests.
539
797
 
540
798
  ---
541
799
 
542
800
  ## TypeScript
543
801
 
544
- Fully typed. All configuration options, events, errors, and report shapes have precise TypeScript definitions.
802
+ Fully typed. All configuration options, events, errors, and report shapes have precise TypeScript definitions exported from the main entry point.
545
803
 
546
804
  ```typescript
547
805
  import type {
548
806
  RateLimiterConfig,
549
807
  CostReport,
808
+ EventMap,
550
809
  QueuedEvent,
551
810
  Priority,
811
+ ModelLimits,
552
812
  } from 'ai-sdk-rate-limiter'
553
-
554
- function handleQueuedRequest(event: QueuedEvent) {
555
- // event.model, event.priority, event.queueDepth, event.estimatedWaitMs
556
- // all typed, all autocompleted
557
- }
558
813
  ```
559
814
 
560
815
  ---
561
816
 
817
+ ## Examples
818
+
819
+ A full Next.js 15 App Router example is included at [`examples/nextjs/`](./examples/nextjs/). It demonstrates streaming chat with rate limiting, live cost display, and proper error handling for budget and rate limit errors.
820
+
821
+ ---
822
+
562
823
  ## Requirements
563
824
 
564
825
  - Node.js 18+ / Bun / Deno