ai-sdk-rate-limiter 0.5.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -56,9 +56,152 @@ The wrapped model is a drop-in replacement. Every Vercel AI SDK feature — stre
56
56
 
57
57
  **Raw SDK support** — Works with the native OpenAI, Anthropic, Groq, Mistral, and Cohere SDKs directly via a transparent JavaScript Proxy. No Vercel AI SDK required.
58
58
 
59
+ **OpenTelemetry** — Drop-in OTel plugin that emits GenAI-spec spans for every request. Works with any OTel-compatible tracer.
60
+
61
+ **CLI audit** — `npx ai-sdk-rate-limiter audit` probes your API keys to detect your actual tier limits and generates a ready-to-paste config override.
62
+
63
+ ---
64
+
65
+ ## Contents
66
+
67
+ - [Vercel AI SDK usage](#vercel-ai-sdk-usage)
68
+ - [Raw SDK proxy](#raw-sdk-proxy)
69
+ - [Configuration reference](#configuration-reference)
70
+ - [Per-request options](#per-request-options)
71
+ - [Cost tracking](#cost-tracking)
72
+ - [Budget fallback routing](#budget-fallback-routing)
73
+ - [Multi-instance Redis store](#multi-instance-redis-store)
74
+ - [Events](#events)
75
+ - [Backpressure](#backpressure)
76
+ - [Error handling](#error-handling)
77
+ - [OpenTelemetry](#opentelemetry)
78
+ - [CLI audit](#cli-audit)
79
+ - [Model registry](#model-registry)
80
+ - [Advanced usage](#advanced-usage)
81
+ - [How it works](#how-it-works)
82
+ - [Comparison](#comparison)
83
+ - [TypeScript](#typescript)
84
+ - [Requirements](#requirements)
85
+
59
86
  ---
60
87
 
61
- ## Configuration
88
+ ## Vercel AI SDK usage
89
+
90
+ ### Basic wrap
91
+
92
+ ```typescript
93
+ import { createRateLimiter } from 'ai-sdk-rate-limiter'
94
+ import { openai } from '@ai-sdk/openai'
95
+ import { generateText, streamText } from 'ai'
96
+
97
+ const limiter = createRateLimiter()
98
+ const model = limiter.wrap(openai('gpt-4o'))
99
+
100
+ // generateText
101
+ const { text } = await generateText({ model, prompt: 'Summarize this...' })
102
+
103
+ // streamText — streaming is first-class, rate limit slot consumed at request start
104
+ const result = streamText({ model, messages })
105
+ for await (const chunk of result.textStream) {
106
+ process.stdout.write(chunk)
107
+ }
108
+ ```
109
+
110
+ ### Using the raw middleware
111
+
112
+ If you use `wrapLanguageModel` directly or need to compose middleware:
113
+
114
+ ```typescript
115
+ import { wrapLanguageModel } from 'ai'
116
+
117
+ // Single middleware
118
+ const model = wrapLanguageModel({
119
+ model: openai('gpt-4o'),
120
+ middleware: limiter.middleware,
121
+ })
122
+
123
+ // Composed with other middleware
124
+ const model = wrapLanguageModel({
125
+ model: openai('gpt-4o'),
126
+ middleware: [loggingMiddleware, limiter.middleware, cachingMiddleware],
127
+ })
128
+ ```
129
+
130
+ ---
131
+
132
+ ## Raw SDK proxy
133
+
134
+ If you're using the OpenAI, Anthropic, Groq, Mistral, or Cohere SDK directly — without the Vercel AI SDK — use `limiter.rawProxy()` to add rate limiting as a transparent drop-in:
135
+
136
+ ```typescript
137
+ import { createRateLimiter } from 'ai-sdk-rate-limiter'
138
+ import OpenAI from 'openai'
139
+ import Anthropic from '@anthropic-ai/sdk'
140
+
141
+ const limiter = createRateLimiter({
142
+ cost: { budget: { daily: 50 }, onExceeded: 'throw' },
143
+ on: { rateLimited: ({ model }) => console.warn(`${model} rate limited`) },
144
+ })
145
+
146
+ // Every API call goes through the same rate limiter and cost tracker
147
+ const openai = limiter.rawProxy(new OpenAI())
148
+ const anthropic = limiter.rawProxy(new Anthropic())
149
+
150
+ // Use exactly as before — no other changes needed
151
+ const completion = await openai.chat.completions.create({
152
+ model: 'gpt-4o',
153
+ messages: [{ role: 'user', content: 'Hello!' }],
154
+ })
155
+
156
+ const message = await anthropic.messages.create({
157
+ model: 'claude-opus-4-6',
158
+ max_tokens: 1024,
159
+ messages: [{ role: 'user', content: 'Hello!' }],
160
+ })
161
+
162
+ // Cost from both clients tracked together
163
+ const report = limiter.getCostReport()
164
+ ```
165
+
166
+ **Streaming** — the proxy wraps the returned `AsyncIterable` to capture the final usage chunk automatically:
167
+
168
+ ```typescript
169
+ const stream = await openai.chat.completions.create({
170
+ model: 'gpt-4o',
171
+ messages: [{ role: 'user', content: 'Stream this' }],
172
+ stream: true,
173
+ stream_options: { include_usage: true }, // OpenAI: include usage in final chunk
174
+ })
175
+
176
+ for await (const chunk of stream) {
177
+ process.stdout.write(chunk.choices[0]?.delta?.content ?? '')
178
+ }
179
+ // After the loop, tokens are recorded in limiter.getCostReport()
180
+ ```
181
+
182
+ **Standalone — no shared limiter needed:**
183
+
184
+ ```typescript
185
+ import { rateLimited } from 'ai-sdk-rate-limiter'
186
+
187
+ const openai = rateLimited(new OpenAI(), {
188
+ config: { cost: { budget: { daily: 20 } } },
189
+ })
190
+ ```
191
+
192
+ **Override auto-detected provider** — useful for OpenAI-compatible endpoints:
193
+
194
+ ```typescript
195
+ const client = limiter.rawProxy(new OpenAI({ baseURL: 'https://api.groq.com/openai/v1' }), {
196
+ provider: 'groq', // use Groq's limits and pricing instead of OpenAI's
197
+ })
198
+ ```
199
+
200
+ Provider is auto-detected from the client's constructor name (`OpenAI` → `openai`, `Anthropic` → `anthropic`, `Groq` → `groq`, etc.).
201
+
202
+ ---
203
+
204
+ ## Configuration reference
62
205
 
63
206
  Everything has a sensible default. Override only what you need.
64
207
 
@@ -67,17 +210,15 @@ const limiter = createRateLimiter({
67
210
  // Override or extend built-in model limits for your API tier
68
211
  limits: {
69
212
  'gpt-4o': { rpm: 500, itpm: 30_000 },
70
- 'claude-opus-4-6': { rpm: 50, itpm: 30_000 },
71
- // Wildcard: apply to all models from a provider
72
- 'openai/*': { rpm: 1000 },
213
+ 'claude-opus-4-6': { rpm: 50, itpm: 20_000 },
73
214
  },
74
215
 
75
216
  // Cost budgets and behavior when exceeded
76
217
  cost: {
77
218
  budget: {
78
- hourly: 5, // USD
79
- daily: 50,
80
- monthly: 500,
219
+ hourly: 5, // USD — hard cap per hour
220
+ daily: 50, // USD — hard cap per day
221
+ monthly: 500, // USD — hard cap per month
81
222
  },
82
223
  onExceeded: 'throw', // 'throw' | 'queue' | 'fallback'
83
224
  },
@@ -100,14 +241,14 @@ const limiter = createRateLimiter({
100
241
  retryOn: [429, 500, 502, 503, 504],
101
242
  },
102
243
 
103
- // Observability
244
+ // Observability — see Events section for all available events
104
245
  on: {
105
246
  rateLimited: ({ model, source, resetAt }) =>
106
247
  console.warn(`${model} rate limited (${source}), resets ${new Date(resetAt).toISOString()}`),
107
248
  retrying: ({ model, attempt, delayMs }) =>
108
249
  console.log(`${model} retry ${attempt} in ${delayMs}ms`),
109
250
  budgetHit: ({ model, currentCostUsd, limitUsd, period }) =>
110
- alerts.send(`${model} hit $${limitUsd} ${period} budget ($${currentCostUsd} spent)`),
251
+ alerts.send(`${model} hit $${limitUsd} ${period} budget`),
111
252
  completed: ({ model, inputTokens, outputTokens, costUsd, latencyMs }) =>
112
253
  metrics.record({ model, inputTokens, outputTokens, costUsd, latencyMs }),
113
254
  },
@@ -123,14 +264,14 @@ Pass options to individual requests via `providerOptions.rateLimiter`:
123
264
  ```typescript
124
265
  import { generateText } from 'ai'
125
266
 
126
- // High-priority request — skips ahead of normal traffic in the queue
267
+ // High-priority — skips ahead of normal traffic in the queue
127
268
  await generateText({
128
269
  model,
129
270
  prompt: 'Urgent user request...',
130
271
  providerOptions: {
131
272
  rateLimiter: {
132
- priority: 'high', // 'high' | 'normal' | 'low'
133
- timeout: 10_000, // override the default queue timeout for this request
273
+ priority: 'high', // 'high' | 'normal' | 'low'
274
+ timeout: 10_000, // override the default queue timeout for this request
134
275
  },
135
276
  },
136
277
  })
@@ -145,7 +286,7 @@ await generateText({
145
286
  })
146
287
  ```
147
288
 
148
- This is the right way to colocate user requests and background jobs on the same model without background jobs starving users.
289
+ This is the recommended way to colocate user requests and background jobs on the same model without background jobs starving users.
149
290
 
150
291
  ---
151
292
 
@@ -157,7 +298,7 @@ const report = limiter.getCostReport()
157
298
 
158
299
  console.log(report)
159
300
  // {
160
- // hour: { requests: 42, inputTokens: 84_000, outputTokens: 21_000, costUsd: 0.29 },
301
+ // hour: { requests: 42, inputTokens: 84_000, outputTokens: 21_000, costUsd: 0.29 },
161
302
  // day: { requests: 318, inputTokens: 620_000, outputTokens: 155_000, costUsd: 2.11 },
162
303
  // month: { requests: 4821, inputTokens: 9_100_000, outputTokens: 2_200_000, costUsd: 34.80 },
163
304
  // byModel: {
@@ -173,17 +314,17 @@ Costs are based on **actual token counts** from API responses — not estimates.
173
314
 
174
315
  ## Budget fallback routing
175
316
 
176
- When a budget limit is hit, you can transparently reroute to a cheaper model instead of throwing an error. Pass a `fallback` option to `wrap()`:
317
+ When a budget limit is hit, you can transparently reroute to a cheaper model instead of throwing an error:
177
318
 
178
319
  ```typescript
179
320
  const limiter = createRateLimiter({
180
321
  cost: {
181
322
  budget: { daily: 10 },
182
- onExceeded: 'fallback', // reroute to fallback instead of throwing
323
+ onExceeded: 'fallback', // reroute to fallback instead of throwing
183
324
  },
184
325
  on: {
185
- budgetHit: ({ model, currentCostUsd, limitUsd, period }) =>
186
- console.warn(`${model} ${period} budget hit ($${currentCostUsd} of $${limitUsd})`),
326
+ budgetHit: ({ model, usingFallback }) =>
327
+ console.warn(`${model} budget hit — ${usingFallback ? 'using fallback' : 'throwing'}`),
187
328
  },
188
329
  })
189
330
 
@@ -192,17 +333,11 @@ const model = limiter.wrap(
192
333
  { fallback: openai('gpt-4o-mini') }, // used when budget is exceeded
193
334
  )
194
335
 
195
- // Under budget → uses gpt-4o normally
196
- // Over $10/day → silently switches to gpt-4o-mini, no code changes needed
336
+ // Under budget → uses gpt-4o normally
337
+ // Over $10/day → silently switches to gpt-4o-mini, no code changes needed
197
338
  const result = await generateText({ model, prompt })
198
339
  ```
199
340
 
200
- **How it works:**
201
- 1. The budget is checked before every request against total rolling spend
202
- 2. When exceeded, `BudgetExceededError` is caught inside `wrap()` before it reaches your code
203
- 3. The request is re-executed against the fallback model, bypassing the budget pre-check
204
- 4. Fallback usage is tracked under the fallback model's ID in `getCostReport()`
205
-
206
341
  **Behavior matrix:**
207
342
 
208
343
  | `onExceeded` | `fallback` configured | Outcome |
@@ -212,11 +347,13 @@ const result = await generateText({ model, prompt })
212
347
  | `'fallback'` | no | Throws `BudgetExceededError` |
213
348
  | `'queue'` | any | Queues until period resets |
214
349
 
350
+ Fallback usage is tracked under the fallback model's ID in `getCostReport()`.
351
+
215
352
  ---
216
353
 
217
354
  ## Multi-instance Redis store
218
355
 
219
- By default, rate limit state is in-memory (per-process). In multi-instance deployments — serverless functions, multiple pods, workers — each instance has its own counters. Install the Redis store to share state across all instances:
356
+ By default, rate limit state is in-memory (per-process). For multi-instance deployments — multiple pods, serverless replicas, workers — each instance has its own counters. Install the Redis store to share state:
220
357
 
221
358
  ```
222
359
  npm install ioredis
@@ -229,162 +366,91 @@ import Redis from 'ioredis'
229
366
 
230
367
  const limiter = createRateLimiter({
231
368
  store: new RedisStore(new Redis(process.env.REDIS_URL)),
232
- // ... rest of your config
369
+ // ... rest of your config unchanged
233
370
  })
234
371
  ```
235
372
 
236
373
  That's the entire change. All APIs — `wrap()`, `rawProxy()`, events, cost reports — work identically. The Redis store enforces rate limits collectively so no two instances can jointly exceed the API limits.
237
374
 
238
- **How it works:**
239
-
240
- Each request atomically runs a Lua script that:
241
- 1. Removes entries older than 60 seconds from a sorted set (`ZREMRANGEBYSCORE`)
242
- 2. Counts remaining requests and sums input tokens
243
- 3. Checks against RPM and ITPM limits
244
- 4. If allowed: reserves the slot (`ZADD`) and returns immediately
245
- 5. If blocked: returns the timestamp when the next slot opens
246
-
247
- The local queue (priority ordering, drain timer, timeout handling) stays in-memory per instance — only the window counters are shared.
248
-
249
375
  **Options:**
250
376
 
251
377
  ```typescript
252
378
  new RedisStore(redis, {
253
- keyPrefix: 'rl:myapp:', // namespace if multiple apps share Redis
254
- windowMs: 60_000, // window size; match your provider's limit window
379
+ keyPrefix: 'rl:myapp:', // namespace if multiple apps share one Redis instance
380
+ windowMs: 60_000, // window size in ms; match your provider's limit window
255
381
  })
256
382
  ```
257
383
 
258
- **Compatible clients** — any Redis client with `eval()`, `get()`, and `set()` works: `ioredis`, `node-redis`, Upstash Redis.
384
+ **How it works internally:**
385
+
386
+ Each request runs a Lua script atomically that: removes stale entries from a sorted set, counts requests and tokens in the current window, checks against RPM and ITPM limits, and either reserves the slot or returns when the next slot opens. The local queue (priority ordering, drain timer, timeout handling) stays in-memory per instance — only the window counters are shared via Redis.
387
+
388
+ **Compatible clients** — any client with `eval()`, `get()`, and `set()` works: `ioredis`, `node-redis`, Upstash Redis.
259
389
 
260
- **Single-instance deployments:** the default `InMemoryStore` is more accurate (true sliding window, no network round-trips) and zero-config. Only switch to `RedisStore` when you actually need cross-instance coordination.
390
+ Use the default `InMemoryStore` for single-instance deployments — it's more accurate (true sliding window, no network round-trips) and zero-config. Only switch to `RedisStore` when you actually need cross-instance coordination.
261
391
 
262
392
  ---
263
393
 
264
- ## Raw SDK proxy
394
+ ## Events
265
395
 
266
- If you're using the OpenAI, Anthropic, Groq, Mistral, or Cohere SDK directly — without the Vercel AI SDK — use `limiter.rawProxy()` to add rate limiting as a transparent drop-in:
396
+ All events are typed. Register handlers at creation time or dynamically:
267
397
 
268
398
  ```typescript
269
- import { createRateLimiter } from 'ai-sdk-rate-limiter'
270
- import OpenAI from 'openai'
271
- import Anthropic from '@anthropic-ai/sdk'
272
-
399
+ // At creation time
273
400
  const limiter = createRateLimiter({
274
- cost: { budget: { daily: 50 }, onExceeded: 'throw' },
275
- on: { rateLimited: ({ model }) => console.warn(`${model} rate limited`) },
276
- })
277
-
278
- // Every API call goes through the same rate limiter and cost tracker
279
- const openai = limiter.rawProxy(new OpenAI())
280
- const anthropic = limiter.rawProxy(new Anthropic())
281
-
282
- // Use exactly as before — no other changes needed
283
- const completion = await openai.chat.completions.create({
284
- model: 'gpt-4o',
285
- messages: [{ role: 'user', content: 'Hello!' }],
286
- })
287
-
288
- const message = await anthropic.messages.create({
289
- model: 'claude-opus-4-6',
290
- max_tokens: 1024,
291
- messages: [{ role: 'user', content: 'Hello!' }],
401
+ on: { rateLimited: handler },
292
402
  })
293
403
 
294
- // Cost from both clients tracked together
295
- const report = limiter.getCostReport()
296
- ```
297
-
298
- **Streaming works too** — the proxy wraps the returned `AsyncIterable` to capture the final usage chunk automatically:
299
-
300
- ```typescript
301
- const stream = await openai.chat.completions.create({
302
- model: 'gpt-4o',
303
- messages: [{ role: 'user', content: 'Stream this' }],
304
- stream: true,
305
- stream_options: { include_usage: true }, // tells OpenAI to include usage in final chunk
404
+ // Dynamically
405
+ limiter.on('queued', ({ model, queueDepth, estimatedWaitMs }) => {
406
+ console.log(`${model} queued (depth: ${queueDepth}, ~${estimatedWaitMs}ms wait)`)
306
407
  })
307
408
 
308
- for await (const chunk of stream) {
309
- process.stdout.write(chunk.choices[0]?.delta?.content ?? '')
310
- }
311
- // After the loop, usage is recorded in limiter.getCostReport()
312
- ```
313
-
314
- **Zero-config standalone version** — if you don't need to share the limiter with other models:
315
-
316
- ```typescript
317
- import { rateLimited } from 'ai-sdk-rate-limiter'
318
-
319
- const openai = rateLimited(new OpenAI(), {
320
- config: { cost: { budget: { daily: 20 } } },
321
- })
409
+ limiter.off('queued', handler)
322
410
  ```
323
411
 
324
- **Provider is auto-detected** from the client's constructor name (`OpenAI`, `Anthropic`, `Groq`, etc.). Override it explicitly if needed:
412
+ | Event | When | Key fields |
413
+ |---|---|---|
414
+ | `queued` | Request enters the queue | `model`, `provider`, `priority`, `queueDepth`, `estimatedWaitMs` |
415
+ | `dequeued` | Request leaves the queue | `model`, `provider`, `waitedMs`, `priority` |
416
+ | `retrying` | A failed request is about to retry | `model`, `provider`, `attempt`, `maxAttempts`, `delayMs`, `error` |
417
+ | `rateLimited` | Limit hit (local or remote 429) | `model`, `provider`, `source`, `limitType`, `resetAt` |
418
+ | `budgetHit` | Cost budget exceeded | `model`, `provider`, `currentCostUsd`, `limitUsd`, `period`, `usingFallback` |
419
+ | `dropped` | Request rejected (queue full or timeout) | `model`, `provider`, `reason` |
420
+ | `completed` | Request finished successfully | `model`, `provider`, `inputTokens`, `outputTokens`, `costUsd`, `latencyMs`, `streaming` |
325
421
 
326
- ```typescript
327
- const client = limiter.rawProxy(new OpenAI({ baseURL: 'https://api.groq.com/openai/v1' }), {
328
- provider: 'groq', // use Groq's limits and pricing instead of OpenAI's
329
- })
330
- ```
422
+ The `source` on `rateLimited` distinguishes between requests we blocked locally (`'local'`) vs. requests the API rejected with a 429 (`'remote'`). Local blocks are expected and free. Frequent remote blocks mean your configured limits are too high for your tier — run `npx ai-sdk-rate-limiter audit` to get accurate numbers.
331
423
 
332
424
  ---
333
425
 
334
- ## Backpressure — know before you send
426
+ ## Backpressure
335
427
 
336
- Check estimated wait time before committing to a request. Useful for showing loading states or shedding load gracefully.
428
+ Check estimated wait time before committing to a request. Useful for showing loading states or shedding load gracefully:
337
429
 
338
430
  ```typescript
339
- const waitMs = limiter.estimatedWait('gpt-4o')
431
+ const waitMs = await limiter.estimatedWait('gpt-4o')
340
432
 
341
433
  if (waitMs > 5_000) {
342
- return res.status(503).json({ error: 'Model is busy, try again shortly', retryAfterMs: waitMs })
434
+ return res.status(503).json({
435
+ error: 'Model busy, try again shortly',
436
+ retryAfterMs: waitMs,
437
+ })
343
438
  }
344
439
 
345
440
  const result = await generateText({ model, prompt })
346
441
  ```
347
442
 
348
- ---
349
-
350
- ## Events
351
-
352
- All events are typed. Register handlers at creation time or dynamically:
353
-
354
- ```typescript
355
- // At creation time
356
- const limiter = createRateLimiter({
357
- on: { rateLimited: handler },
358
- })
359
-
360
- // Dynamically
361
- limiter.on('queued', ({ model, queueDepth, estimatedWaitMs }) => {
362
- console.log(`${model} queued (depth: ${queueDepth}, ~${estimatedWaitMs}ms wait)`)
363
- })
364
-
365
- limiter.off('queued', handler)
366
- ```
367
-
368
- | Event | When | Key fields |
369
- |---|---|---|
370
- | `queued` | Request enters the queue | `model`, `priority`, `queueDepth`, `estimatedWaitMs` |
371
- | `dequeued` | Request leaves the queue | `model`, `waitedMs`, `priority` |
372
- | `retrying` | A failed request is about to retry | `model`, `attempt`, `maxAttempts`, `delayMs`, `error` |
373
- | `rateLimited` | Limit hit (local or remote 429) | `model`, `source`, `limitType`, `resetAt` |
374
- | `budgetHit` | Cost budget exceeded | `model`, `currentCostUsd`, `limitUsd`, `period`, `usingFallback` |
375
- | `dropped` | Request rejected (queue full or timeout) | `model`, `reason` |
376
- | `completed` | Request finished successfully | `model`, `inputTokens`, `outputTokens`, `costUsd`, `latencyMs` |
377
-
378
- The `source` on `rateLimited` distinguishes between requests we blocked locally (`'local'`) vs. requests the API rejected with a 429 (`'remote'`). Local blocks are free; remote blocks mean your limits are misconfigured.
443
+ Returns `0` if the model would proceed immediately.
379
444
 
380
445
  ---
381
446
 
382
447
  ## Error handling
383
448
 
384
- Every error is typed, structured, and tells you exactly what happened:
449
+ Every error is typed and carries structured context:
385
450
 
386
451
  ```typescript
387
452
  import {
453
+ RateLimitExceededError,
388
454
  QueueTimeoutError,
389
455
  QueueFullError,
390
456
  BudgetExceededError,
@@ -396,77 +462,163 @@ try {
396
462
  const result = await generateText({ model, prompt })
397
463
  } catch (error) {
398
464
  if (error instanceof QueueTimeoutError) {
399
- // error.model, error.waitedMs, error.queueDepth
465
+ // Request waited in queue longer than queue.timeout
400
466
  console.error(`Timed out after ${error.waitedMs}ms (queue depth: ${error.queueDepth})`)
401
467
  } else if (error instanceof BudgetExceededError) {
402
- // error.model, error.currentCostUsd, error.limitUsd, error.period
468
+ // Cost budget hit and onExceeded is 'throw' or no fallback configured
403
469
  console.error(`Budget exceeded: $${error.currentCostUsd} of $${error.limitUsd} ${error.period}`)
404
470
  } else if (error instanceof RetryExhaustedError) {
405
- // error.model, error.attempts, error.cause
471
+ // All retry attempts failed
406
472
  console.error(`All ${error.attempts} retries exhausted`, error.cause)
407
473
  } else if (error instanceof QueueFullError) {
408
- // error.model, error.maxSize
409
- console.error(`Queue full at ${error.maxSize} requests`)
474
+ // Queue at capacity and onFull is 'throw'
475
+ console.error(`Queue full at ${error.maxSize} requests for ${error.model}`)
476
+ } else if (error instanceof RateLimitExceededError) {
477
+ // Rate limit hit and the request could not be queued
478
+ console.error(`${error.model} ${error.limitType} limit of ${error.limit} exceeded`)
410
479
  }
411
480
  }
412
481
  ```
413
482
 
414
- All errors extend `RateLimiterError`, so a single `instanceof RateLimiterError` check separates rate-limiting failures from AI SDK errors.
483
+ All errors extend `RateLimiterError`, so a single `instanceof RateLimiterError` check separates rate-limiter failures from AI API errors.
484
+
485
+ **Error fields:**
486
+
487
+ | Error | Fields |
488
+ |---|---|
489
+ | `QueueTimeoutError` | `model`, `waitedMs`, `queueDepth` |
490
+ | `BudgetExceededError` | `model`, `currentCostUsd`, `limitUsd`, `period` |
491
+ | `RetryExhaustedError` | `model`, `attempts`, `cause` |
492
+ | `QueueFullError` | `model`, `maxSize` |
493
+ | `RateLimitExceededError` | `model`, `limitType`, `limit`, `resetAt` |
415
494
 
416
495
  ---
417
496
 
418
- ## Advanced middleware usage
497
+ ## OpenTelemetry
419
498
 
420
- If you use `wrapLanguageModel` directly, the raw middleware is available:
499
+ The `ai-sdk-rate-limiter/otel` entry point provides a plugin that emits OpenTelemetry spans for every AI request. No hard dependency on `@opentelemetry/api` the plugin accepts any OTel-compatible tracer via structural typing.
421
500
 
422
501
  ```typescript
423
- import { wrapLanguageModel } from 'ai'
502
+ import { trace } from '@opentelemetry/api'
503
+ import { createRateLimiter } from 'ai-sdk-rate-limiter'
504
+ import { createOtelPlugin } from 'ai-sdk-rate-limiter/otel'
424
505
 
425
- const model = wrapLanguageModel({
426
- model: openai('gpt-4o'),
427
- middleware: limiter.middleware,
506
+ const limiter = createRateLimiter({
507
+ on: createOtelPlugin(trace.getTracer('my-service')),
428
508
  })
429
509
  ```
430
510
 
431
- You can also compose it with other middleware:
511
+ **Spans emitted:**
512
+
513
+ | Span name | When | Status |
514
+ |---|---|---|
515
+ | `gen_ai.request` | Every completed request | OK |
516
+ | `gen_ai.request` | Every dropped request | ERROR |
517
+ | `ai_rate_limiter.retry` | Each retry attempt | OK |
518
+ | `ai_rate_limiter.budget_hit` | Budget exceeded | ERROR |
519
+
520
+ **Attributes on `gen_ai.request` (completed):**
521
+
522
+ | Attribute | Value |
523
+ |---|---|
524
+ | `gen_ai.system` | Provider name (e.g. `openai`, `anthropic`) |
525
+ | `gen_ai.request.model` | Model ID |
526
+ | `gen_ai.usage.input_tokens` | Actual input tokens from API response |
527
+ | `gen_ai.usage.output_tokens` | Actual output tokens from API response |
528
+ | `ai_rate_limiter.cost_usd` | Cost in USD for this request |
529
+ | `ai_rate_limiter.latency_ms` | Total latency including queue wait |
530
+ | `ai_rate_limiter.streaming` | Whether the request used streaming |
531
+
532
+ Attribute names follow the [OTel GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/). The `gen_ai.request` span duration is reconstructed from `latencyMs` so it reflects the full wall-clock time of the request, including any queue wait.
533
+
534
+ **Custom tracer interface** — if you don't want to install `@opentelemetry/api`, implement the `OtelTracer` interface directly:
432
535
 
433
536
  ```typescript
434
- const model = wrapLanguageModel({
435
- model: openai('gpt-4o'),
436
- middleware: [loggingMiddleware, limiter.middleware, cachingMiddleware],
537
+ import { createOtelPlugin, type OtelTracer } from 'ai-sdk-rate-limiter/otel'
538
+
539
+ const tracer: OtelTracer = {
540
+ startSpan(name, options) {
541
+ // return any object that implements OtelSpan
542
+ },
543
+ }
544
+
545
+ const limiter = createRateLimiter({
546
+ on: createOtelPlugin(tracer),
437
547
  })
438
548
  ```
439
549
 
440
550
  ---
441
551
 
442
- ## Multiple limiters
552
+ ## CLI audit
443
553
 
444
- Run separate limiters for separate concerns e.g., one per customer tier:
554
+ Probe your API keys to discover your actual rate limit tier and generate a ready-to-paste config override:
445
555
 
446
- ```typescript
447
- const freeLimiter = createRateLimiter({
448
- limits: { 'gpt-4o-mini': { rpm: 5 } },
449
- cost: { budget: { daily: 0.10 }, onExceeded: 'throw' },
450
- queue: { timeout: 5_000 },
451
- })
556
+ ```
557
+ npx ai-sdk-rate-limiter audit
558
+ ```
452
559
 
453
- const paidLimiter = createRateLimiter({
454
- limits: { 'gpt-4o': { rpm: 100 } },
455
- cost: { budget: { daily: 20 } },
456
- queue: { timeout: 30_000 },
457
- })
560
+ ```
561
+ ────────────────────────────────────────────────────────────────────────────────
562
+ ai-sdk-rate-limiter audit
563
+ ────────────────────────────────────────────────────────────────────────────────
564
+
565
+ OPENAI (OPENAI_API_KEY)
566
+ Model RPM TPM Registry
567
+ ──────────────────────────────────────────────────────────────────────────────
568
+ gpt-4o 10000 2,000,000 (registry: 500 RPM / 30,000 TPM)
569
+ gpt-4o-mini 10000 10,000,000 ≠ (registry: 500 RPM / 200,000 TPM)
570
+
571
+ ────────────────────────────────────────────────────────────────────────────────
572
+ 1 model(s) differ from registry defaults.
573
+ Paste the config below into createRateLimiter():
574
+
575
+ const limiter = createRateLimiter({
576
+ limits: {
577
+ 'gpt-4o-mini': { rpm: 10000, itpm: 10,000,000 },
578
+ },
579
+ })
458
580
 
459
- // Route to the right limiter per request
460
- const model = req.user.plan === 'paid'
461
- ? paidLimiter.wrap(openai('gpt-4o'))
462
- : freeLimiter.wrap(openai('gpt-4o-mini'))
581
+ ────────────────────────────────────────────────────────────────────────────────
582
+ ```
583
+
584
+ **How it works** — Makes a minimal (5-token) request per model and reads the `x-ratelimit-limit-*` headers that every provider returns on each response. These headers reflect your account's actual tier, not the published defaults.
585
+
586
+ **Options:**
587
+
588
+ ```
589
+ npx ai-sdk-rate-limiter audit [options]
590
+
591
+ --provider, -p <name> Audit a single provider: openai, anthropic, groq, mistral, cohere
592
+ --json Machine-readable JSON output
593
+ --help, -h Show help
594
+ --version, -v Print version
595
+
596
+ Environment variables required:
597
+ OPENAI_API_KEY
598
+ ANTHROPIC_API_KEY
599
+ GROQ_API_KEY
600
+ MISTRAL_API_KEY
601
+ COHERE_API_KEY
602
+ ```
603
+
604
+ **Examples:**
605
+
606
+ ```bash
607
+ # Audit all configured providers
608
+ npx ai-sdk-rate-limiter audit
609
+
610
+ # Audit only OpenAI
611
+ npx ai-sdk-rate-limiter audit --provider openai
612
+
613
+ # Machine-readable output for CI / scripts
614
+ npx ai-sdk-rate-limiter audit --json | jq '.providers[0].models'
463
615
  ```
464
616
 
465
617
  ---
466
618
 
467
- ## Built-in model registry
619
+ ## Model registry
468
620
 
469
- Limits and pricing are built-in for every major model across 6 providers. Defaults are conservative (free/Tier 1) — override with your actual plan limits.
621
+ Limits and pricing are built-in for every major model across 6 providers. Defaults are conservative (free/Tier 1) — override with your actual plan limits via the `limits` config option or by running `audit`.
470
622
 
471
623
  **OpenAI**
472
624
 
@@ -540,30 +692,87 @@ import {
540
692
  console.log(GROQ_MODELS['llama-3.3-70b-versatile'])
541
693
  // { rpm: 30, itpm: 6000, rpd: 1000, inputPricePerMillion: 0.59, ... }
542
694
 
543
- console.log(isKnownModel('llama-3.3-70b-versatile', 'groq'))
544
- // true
695
+ console.log(isKnownModel('llama-3.3-70b-versatile', 'groq')) // true
696
+ console.log(isKnownModel('my-fine-tune', 'openai')) // false → fallback limits
697
+
698
+ // Resolve the effective limits for a model (registry defaults merged with user overrides)
699
+ const limits = resolveModelLimits('gpt-4o', 'openai', { 'gpt-4o': { rpm: 1000 } })
700
+ ```
701
+
702
+ ---
703
+
704
+ ## Advanced usage
545
705
 
546
- console.log(isKnownModel('my-fine-tune', 'openai'))
547
- // false → will use fallback limits
706
+ ### Multiple limiters per tier
707
+
708
+ ```typescript
709
+ const freeLimiter = createRateLimiter({
710
+ limits: { 'gpt-4o-mini': { rpm: 5 } },
711
+ cost: { budget: { daily: 0.10 }, onExceeded: 'throw' },
712
+ queue: { timeout: 5_000 },
713
+ })
714
+
715
+ const paidLimiter = createRateLimiter({
716
+ limits: { 'gpt-4o': { rpm: 100 } },
717
+ cost: { budget: { daily: 20 } },
718
+ queue: { timeout: 30_000 },
719
+ })
720
+
721
+ // Route per request based on user plan
722
+ const model = req.user.plan === 'paid'
723
+ ? paidLimiter.wrap(openai('gpt-4o'))
724
+ : freeLimiter.wrap(openai('gpt-4o-mini'))
725
+ ```
726
+
727
+ ### Combine OTel tracing with event logging
728
+
729
+ ```typescript
730
+ import { createOtelPlugin } from 'ai-sdk-rate-limiter/otel'
731
+
732
+ const limiter = createRateLimiter({
733
+ on: {
734
+ // OTel spans for every request
735
+ ...createOtelPlugin(trace.getTracer('my-service')),
736
+ // Plus any additional handlers
737
+ budgetHit: ({ model, limitUsd, period }) =>
738
+ alerts.send(`Budget alert: ${model} hit $${limitUsd} ${period} cap`),
739
+ },
740
+ })
741
+ ```
742
+
743
+ ### Custom rate limit store
744
+
745
+ Implement `RateLimitStore` to use any backend (DynamoDB, Postgres, etc.):
746
+
747
+ ```typescript
748
+ import type { RateLimitStore } from 'ai-sdk-rate-limiter'
749
+
750
+ class MyStore implements RateLimitStore {
751
+ async checkAndReserve(key, tokens, limits) { /* ... */ }
752
+ async applyBackoff(key, untilMs) { /* ... */ }
753
+ async getBackoff(key) { /* ... */ }
754
+ }
755
+
756
+ const limiter = createRateLimiter({ store: new MyStore() })
548
757
  ```
549
758
 
550
759
  ---
551
760
 
552
761
  ## How it works
553
762
 
554
- **Rate limiting algorithm** — Sliding window counter. Each model tracks a list of `{timestamp, tokens}` entries for the past 60 seconds. On every request, stale entries are evicted and the window is checked against RPM and ITPM limits simultaneously.
763
+ **Rate limiting** — Sliding window counter per model. Each model tracks a list of `{timestamp, tokens}` entries for the past 60 seconds. On every request, stale entries are evicted and the window is checked against RPM and ITPM limits simultaneously.
555
764
 
556
- **Queue** — A min-heap priority queue per model, sorted by `priority` then enqueue time (FIFO within same priority). A drain timer fires when the oldest window entry expires, processing as many waiters as possible before rescheduling.
765
+ **Queue** — A sorted priority queue per model, ordered by `priority` then enqueue time (FIFO within same priority). A drain timer fires when the oldest window entry expires, processing as many waiters as possible before rescheduling.
557
766
 
558
- **Retry-After propagation** — When a remote 429 arrives with a `Retry-After` header, the backoff is applied to the engine, not just the failing request. All requests queued behind it pause until the backoff clears. This prevents the common failure mode where you retry one request while 10 more immediately follow and all get 429s.
767
+ **Retry-After propagation** — When a remote 429 arrives with a `Retry-After` header, the backoff is applied to the entire model key in the engine, not just the failing request. All requests queued behind it pause until the backoff clears. This prevents the common thundering-herd failure where you retry one request while 10 others immediately follow and all get 429s.
559
768
 
560
- **Token estimation** — Before a request fires, tokens are estimated from the prompt text (~4 chars/token) and reserved in the window. After the response, actual usage from the API replaces the estimate. For streaming, actual counts come from the `finish` chunk.
769
+ **Token estimation** — Before a request fires, tokens are estimated from the prompt text (~4 chars/token) and reserved in the window. After the response, actual usage from the API replaces the estimate. For streaming, actual counts come from the `finish` chunk (Vercel AI SDK) or the final usage chunk (raw proxy).
561
770
 
562
- **Zero dependencies** — The middleware interface is implemented structurally. `@ai-sdk/provider` types are referenced for type checking but not required at runtime. No `ioredis`, no `bottleneck`, no tokenizer libraries.
771
+ **Zero dependencies** — The Vercel AI SDK middleware interface is implemented structurally `@ai-sdk/provider` types are used for type checking only and not required at runtime. No `ioredis`, no `bottleneck`, no tokenizer libraries in the core.
563
772
 
564
773
  ---
565
774
 
566
- ## Why not X?
775
+ ## Comparison
567
776
 
568
777
  | | ai-sdk-rate-limiter | bottleneck | p-limit | SDK built-in retry | LangChain |
569
778
  |---|:---:|:---:|:---:|:---:|:---:|
@@ -575,37 +784,42 @@ console.log(isKnownModel('my-fine-tune', 'openai'))
575
784
  | Cost tracking + budgets | yes | no | no | no | no |
576
785
  | Retry-After header | yes | no | no | partial | partial |
577
786
  | Backoff propagation | yes | no | no | no | no |
787
+ | OpenTelemetry | yes | no | no | no | partial |
788
+ | CLI audit | yes | no | no | no | no |
578
789
  | Zero runtime deps | yes | no | yes | — | no |
579
790
  | Provider-agnostic | yes | yes | yes | no | no |
580
791
 
581
- **bottleneck** is excellent general-purpose rate limiting, but knows nothing about AI APIs no model limits, no token counting, no cost tracking. You'd need to configure it per-model manually and rebuild the cost system yourself.
792
+ **bottleneck** Excellent general-purpose rate limiting, but knows nothing about AI APIs. No model limits, no token counting, no cost tracking. You'd need to configure it per-model manually and rebuild the cost system yourself.
582
793
 
583
- **p-limit** controls concurrency, not rate. It doesn't throttle to N requests per minute, it throttles to N concurrent requests. Useful but completely different problem.
794
+ **p-limit** Controls concurrency, not rate. Limits to N concurrent requests, not N requests per minute. A different problem.
584
795
 
585
- **SDK built-in retry** retries on 429 with backoff. That's the floor, not the ceiling. It doesn't queue, doesn't prioritize, doesn't track cost, and doesn't propagate backoff to other in-flight requests.
796
+ **SDK built-in retry** Retries on 429 with backoff. That's the floor, not the ceiling. No queuing, no priority, no cost tracking, no backoff propagation to other in-flight requests.
586
797
 
587
798
  ---
588
799
 
589
800
  ## TypeScript
590
801
 
591
- Fully typed. All configuration options, events, errors, and report shapes have precise TypeScript definitions.
802
+ Fully typed. All configuration options, events, errors, and report shapes have precise TypeScript definitions exported from the main entry point.
592
803
 
593
804
  ```typescript
594
805
  import type {
595
806
  RateLimiterConfig,
596
807
  CostReport,
808
+ EventMap,
597
809
  QueuedEvent,
598
810
  Priority,
811
+ ModelLimits,
599
812
  } from 'ai-sdk-rate-limiter'
600
-
601
- function handleQueuedRequest(event: QueuedEvent) {
602
- // event.model, event.priority, event.queueDepth, event.estimatedWaitMs
603
- // all typed, all autocompleted
604
- }
605
813
  ```
606
814
 
607
815
  ---
608
816
 
817
+ ## Examples
818
+
819
+ A full Next.js 15 App Router example is included at [`examples/nextjs/`](./examples/nextjs/). It demonstrates streaming chat with rate limiting, live cost display, and proper error handling for budget and rate limit errors.
820
+
821
+ ---
822
+
609
823
  ## Requirements
610
824
 
611
825
  - Node.js 18+ / Bun / Deno