ai-sdk-rate-limiter 0.4.0 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +415 -154
- package/dist/cli.js +1029 -0
- package/dist/index.cjs +206 -142
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +3 -320
- package/dist/index.d.ts +3 -320
- package/dist/index.js +206 -142
- package/dist/index.js.map +1 -1
- package/dist/otel.cjs +75 -0
- package/dist/otel.cjs.map +1 -0
- package/dist/otel.d.cts +63 -0
- package/dist/otel.d.ts +63 -0
- package/dist/otel.js +72 -0
- package/dist/otel.js.map +1 -0
- package/dist/redis.cjs +209 -0
- package/dist/redis.cjs.map +1 -0
- package/dist/redis.d.cts +54 -0
- package/dist/redis.d.ts +54 -0
- package/dist/redis.js +207 -0
- package/dist/redis.js.map +1 -0
- package/dist/types-CgePLtmQ.d.cts +385 -0
- package/dist/types-CgePLtmQ.d.ts +385 -0
- package/package.json +33 -2
package/README.md
CHANGED
|
@@ -56,9 +56,152 @@ The wrapped model is a drop-in replacement. Every Vercel AI SDK feature — stre
|
|
|
56
56
|
|
|
57
57
|
**Raw SDK support** — Works with the native OpenAI, Anthropic, Groq, Mistral, and Cohere SDKs directly via a transparent JavaScript Proxy. No Vercel AI SDK required.
|
|
58
58
|
|
|
59
|
+
**OpenTelemetry** — Drop-in OTel plugin that emits GenAI-spec spans for every request. Works with any OTel-compatible tracer.
|
|
60
|
+
|
|
61
|
+
**CLI audit** — `npx ai-sdk-rate-limiter audit` probes your API keys to detect your actual tier limits and generates a ready-to-paste config override.
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
## Contents
|
|
66
|
+
|
|
67
|
+
- [Vercel AI SDK usage](#vercel-ai-sdk-usage)
|
|
68
|
+
- [Raw SDK proxy](#raw-sdk-proxy)
|
|
69
|
+
- [Configuration reference](#configuration-reference)
|
|
70
|
+
- [Per-request options](#per-request-options)
|
|
71
|
+
- [Cost tracking](#cost-tracking)
|
|
72
|
+
- [Budget fallback routing](#budget-fallback-routing)
|
|
73
|
+
- [Multi-instance Redis store](#multi-instance-redis-store)
|
|
74
|
+
- [Events](#events)
|
|
75
|
+
- [Backpressure](#backpressure)
|
|
76
|
+
- [Error handling](#error-handling)
|
|
77
|
+
- [OpenTelemetry](#opentelemetry)
|
|
78
|
+
- [CLI audit](#cli-audit)
|
|
79
|
+
- [Model registry](#model-registry)
|
|
80
|
+
- [Advanced usage](#advanced-usage)
|
|
81
|
+
- [How it works](#how-it-works)
|
|
82
|
+
- [Comparison](#comparison)
|
|
83
|
+
- [TypeScript](#typescript)
|
|
84
|
+
- [Requirements](#requirements)
|
|
85
|
+
|
|
86
|
+
---
|
|
87
|
+
|
|
88
|
+
## Vercel AI SDK usage
|
|
89
|
+
|
|
90
|
+
### Basic wrap
|
|
91
|
+
|
|
92
|
+
```typescript
|
|
93
|
+
import { createRateLimiter } from 'ai-sdk-rate-limiter'
|
|
94
|
+
import { openai } from '@ai-sdk/openai'
|
|
95
|
+
import { generateText, streamText } from 'ai'
|
|
96
|
+
|
|
97
|
+
const limiter = createRateLimiter()
|
|
98
|
+
const model = limiter.wrap(openai('gpt-4o'))
|
|
99
|
+
|
|
100
|
+
// generateText
|
|
101
|
+
const { text } = await generateText({ model, prompt: 'Summarize this...' })
|
|
102
|
+
|
|
103
|
+
// streamText — streaming is first-class, rate limit slot consumed at request start
|
|
104
|
+
const result = streamText({ model, messages })
|
|
105
|
+
for await (const chunk of result.textStream) {
|
|
106
|
+
process.stdout.write(chunk)
|
|
107
|
+
}
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
### Using the raw middleware
|
|
111
|
+
|
|
112
|
+
If you use `wrapLanguageModel` directly or need to compose middleware:
|
|
113
|
+
|
|
114
|
+
```typescript
|
|
115
|
+
import { wrapLanguageModel } from 'ai'
|
|
116
|
+
|
|
117
|
+
// Single middleware
|
|
118
|
+
const model = wrapLanguageModel({
|
|
119
|
+
model: openai('gpt-4o'),
|
|
120
|
+
middleware: limiter.middleware,
|
|
121
|
+
})
|
|
122
|
+
|
|
123
|
+
// Composed with other middleware
|
|
124
|
+
const model = wrapLanguageModel({
|
|
125
|
+
model: openai('gpt-4o'),
|
|
126
|
+
middleware: [loggingMiddleware, limiter.middleware, cachingMiddleware],
|
|
127
|
+
})
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
---
|
|
131
|
+
|
|
132
|
+
## Raw SDK proxy
|
|
133
|
+
|
|
134
|
+
If you're using the OpenAI, Anthropic, Groq, Mistral, or Cohere SDK directly — without the Vercel AI SDK — use `limiter.rawProxy()` to add rate limiting as a transparent drop-in:
|
|
135
|
+
|
|
136
|
+
```typescript
|
|
137
|
+
import { createRateLimiter } from 'ai-sdk-rate-limiter'
|
|
138
|
+
import OpenAI from 'openai'
|
|
139
|
+
import Anthropic from '@anthropic-ai/sdk'
|
|
140
|
+
|
|
141
|
+
const limiter = createRateLimiter({
|
|
142
|
+
cost: { budget: { daily: 50 }, onExceeded: 'throw' },
|
|
143
|
+
on: { rateLimited: ({ model }) => console.warn(`${model} rate limited`) },
|
|
144
|
+
})
|
|
145
|
+
|
|
146
|
+
// Every API call goes through the same rate limiter and cost tracker
|
|
147
|
+
const openai = limiter.rawProxy(new OpenAI())
|
|
148
|
+
const anthropic = limiter.rawProxy(new Anthropic())
|
|
149
|
+
|
|
150
|
+
// Use exactly as before — no other changes needed
|
|
151
|
+
const completion = await openai.chat.completions.create({
|
|
152
|
+
model: 'gpt-4o',
|
|
153
|
+
messages: [{ role: 'user', content: 'Hello!' }],
|
|
154
|
+
})
|
|
155
|
+
|
|
156
|
+
const message = await anthropic.messages.create({
|
|
157
|
+
model: 'claude-opus-4-6',
|
|
158
|
+
max_tokens: 1024,
|
|
159
|
+
messages: [{ role: 'user', content: 'Hello!' }],
|
|
160
|
+
})
|
|
161
|
+
|
|
162
|
+
// Cost from both clients tracked together
|
|
163
|
+
const report = limiter.getCostReport()
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
**Streaming** — the proxy wraps the returned `AsyncIterable` to capture the final usage chunk automatically:
|
|
167
|
+
|
|
168
|
+
```typescript
|
|
169
|
+
const stream = await openai.chat.completions.create({
|
|
170
|
+
model: 'gpt-4o',
|
|
171
|
+
messages: [{ role: 'user', content: 'Stream this' }],
|
|
172
|
+
stream: true,
|
|
173
|
+
stream_options: { include_usage: true }, // OpenAI: include usage in final chunk
|
|
174
|
+
})
|
|
175
|
+
|
|
176
|
+
for await (const chunk of stream) {
|
|
177
|
+
process.stdout.write(chunk.choices[0]?.delta?.content ?? '')
|
|
178
|
+
}
|
|
179
|
+
// After the loop, tokens are recorded in limiter.getCostReport()
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
**Standalone — no shared limiter needed:**
|
|
183
|
+
|
|
184
|
+
```typescript
|
|
185
|
+
import { rateLimited } from 'ai-sdk-rate-limiter'
|
|
186
|
+
|
|
187
|
+
const openai = rateLimited(new OpenAI(), {
|
|
188
|
+
config: { cost: { budget: { daily: 20 } } },
|
|
189
|
+
})
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
**Override auto-detected provider** — useful for OpenAI-compatible endpoints:
|
|
193
|
+
|
|
194
|
+
```typescript
|
|
195
|
+
const client = limiter.rawProxy(new OpenAI({ baseURL: 'https://api.groq.com/openai/v1' }), {
|
|
196
|
+
provider: 'groq', // use Groq's limits and pricing instead of OpenAI's
|
|
197
|
+
})
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
Provider is auto-detected from the client's constructor name (`OpenAI` → `openai`, `Anthropic` → `anthropic`, `Groq` → `groq`, etc.).
|
|
201
|
+
|
|
59
202
|
---
|
|
60
203
|
|
|
61
|
-
## Configuration
|
|
204
|
+
## Configuration reference
|
|
62
205
|
|
|
63
206
|
Everything has a sensible default. Override only what you need.
|
|
64
207
|
|
|
@@ -67,17 +210,15 @@ const limiter = createRateLimiter({
|
|
|
67
210
|
// Override or extend built-in model limits for your API tier
|
|
68
211
|
limits: {
|
|
69
212
|
'gpt-4o': { rpm: 500, itpm: 30_000 },
|
|
70
|
-
'claude-opus-4-6': { rpm: 50, itpm:
|
|
71
|
-
// Wildcard: apply to all models from a provider
|
|
72
|
-
'openai/*': { rpm: 1000 },
|
|
213
|
+
'claude-opus-4-6': { rpm: 50, itpm: 20_000 },
|
|
73
214
|
},
|
|
74
215
|
|
|
75
216
|
// Cost budgets and behavior when exceeded
|
|
76
217
|
cost: {
|
|
77
218
|
budget: {
|
|
78
|
-
hourly: 5, // USD
|
|
79
|
-
daily: 50,
|
|
80
|
-
monthly: 500,
|
|
219
|
+
hourly: 5, // USD — hard cap per hour
|
|
220
|
+
daily: 50, // USD — hard cap per day
|
|
221
|
+
monthly: 500, // USD — hard cap per month
|
|
81
222
|
},
|
|
82
223
|
onExceeded: 'throw', // 'throw' | 'queue' | 'fallback'
|
|
83
224
|
},
|
|
@@ -100,14 +241,14 @@ const limiter = createRateLimiter({
|
|
|
100
241
|
retryOn: [429, 500, 502, 503, 504],
|
|
101
242
|
},
|
|
102
243
|
|
|
103
|
-
// Observability
|
|
244
|
+
// Observability — see Events section for all available events
|
|
104
245
|
on: {
|
|
105
246
|
rateLimited: ({ model, source, resetAt }) =>
|
|
106
247
|
console.warn(`${model} rate limited (${source}), resets ${new Date(resetAt).toISOString()}`),
|
|
107
248
|
retrying: ({ model, attempt, delayMs }) =>
|
|
108
249
|
console.log(`${model} retry ${attempt} in ${delayMs}ms`),
|
|
109
250
|
budgetHit: ({ model, currentCostUsd, limitUsd, period }) =>
|
|
110
|
-
alerts.send(`${model} hit $${limitUsd} ${period} budget
|
|
251
|
+
alerts.send(`${model} hit $${limitUsd} ${period} budget`),
|
|
111
252
|
completed: ({ model, inputTokens, outputTokens, costUsd, latencyMs }) =>
|
|
112
253
|
metrics.record({ model, inputTokens, outputTokens, costUsd, latencyMs }),
|
|
113
254
|
},
|
|
@@ -123,14 +264,14 @@ Pass options to individual requests via `providerOptions.rateLimiter`:
|
|
|
123
264
|
```typescript
|
|
124
265
|
import { generateText } from 'ai'
|
|
125
266
|
|
|
126
|
-
// High-priority
|
|
267
|
+
// High-priority — skips ahead of normal traffic in the queue
|
|
127
268
|
await generateText({
|
|
128
269
|
model,
|
|
129
270
|
prompt: 'Urgent user request...',
|
|
130
271
|
providerOptions: {
|
|
131
272
|
rateLimiter: {
|
|
132
|
-
priority: 'high',
|
|
133
|
-
timeout: 10_000,
|
|
273
|
+
priority: 'high', // 'high' | 'normal' | 'low'
|
|
274
|
+
timeout: 10_000, // override the default queue timeout for this request
|
|
134
275
|
},
|
|
135
276
|
},
|
|
136
277
|
})
|
|
@@ -145,7 +286,7 @@ await generateText({
|
|
|
145
286
|
})
|
|
146
287
|
```
|
|
147
288
|
|
|
148
|
-
This is the
|
|
289
|
+
This is the recommended way to colocate user requests and background jobs on the same model without background jobs starving users.
|
|
149
290
|
|
|
150
291
|
---
|
|
151
292
|
|
|
@@ -157,7 +298,7 @@ const report = limiter.getCostReport()
|
|
|
157
298
|
|
|
158
299
|
console.log(report)
|
|
159
300
|
// {
|
|
160
|
-
// hour: { requests: 42, inputTokens: 84_000, outputTokens: 21_000,
|
|
301
|
+
// hour: { requests: 42, inputTokens: 84_000, outputTokens: 21_000, costUsd: 0.29 },
|
|
161
302
|
// day: { requests: 318, inputTokens: 620_000, outputTokens: 155_000, costUsd: 2.11 },
|
|
162
303
|
// month: { requests: 4821, inputTokens: 9_100_000, outputTokens: 2_200_000, costUsd: 34.80 },
|
|
163
304
|
// byModel: {
|
|
@@ -173,17 +314,17 @@ Costs are based on **actual token counts** from API responses — not estimates.
|
|
|
173
314
|
|
|
174
315
|
## Budget fallback routing
|
|
175
316
|
|
|
176
|
-
When a budget limit is hit, you can transparently reroute to a cheaper model instead of throwing an error
|
|
317
|
+
When a budget limit is hit, you can transparently reroute to a cheaper model instead of throwing an error:
|
|
177
318
|
|
|
178
319
|
```typescript
|
|
179
320
|
const limiter = createRateLimiter({
|
|
180
321
|
cost: {
|
|
181
322
|
budget: { daily: 10 },
|
|
182
|
-
onExceeded: 'fallback',
|
|
323
|
+
onExceeded: 'fallback', // reroute to fallback instead of throwing
|
|
183
324
|
},
|
|
184
325
|
on: {
|
|
185
|
-
budgetHit: ({ model,
|
|
186
|
-
console.warn(`${model}
|
|
326
|
+
budgetHit: ({ model, usingFallback }) =>
|
|
327
|
+
console.warn(`${model} budget hit — ${usingFallback ? 'using fallback' : 'throwing'}`),
|
|
187
328
|
},
|
|
188
329
|
})
|
|
189
330
|
|
|
@@ -192,17 +333,11 @@ const model = limiter.wrap(
|
|
|
192
333
|
{ fallback: openai('gpt-4o-mini') }, // used when budget is exceeded
|
|
193
334
|
)
|
|
194
335
|
|
|
195
|
-
// Under budget
|
|
196
|
-
// Over $10/day
|
|
336
|
+
// Under budget → uses gpt-4o normally
|
|
337
|
+
// Over $10/day → silently switches to gpt-4o-mini, no code changes needed
|
|
197
338
|
const result = await generateText({ model, prompt })
|
|
198
339
|
```
|
|
199
340
|
|
|
200
|
-
**How it works:**
|
|
201
|
-
1. The budget is checked before every request against total rolling spend
|
|
202
|
-
2. When exceeded, `BudgetExceededError` is caught inside `wrap()` before it reaches your code
|
|
203
|
-
3. The request is re-executed against the fallback model, bypassing the budget pre-check
|
|
204
|
-
4. Fallback usage is tracked under the fallback model's ID in `getCostReport()`
|
|
205
|
-
|
|
206
341
|
**Behavior matrix:**
|
|
207
342
|
|
|
208
343
|
| `onExceeded` | `fallback` configured | Outcome |
|
|
@@ -212,91 +347,47 @@ const result = await generateText({ model, prompt })
|
|
|
212
347
|
| `'fallback'` | no | Throws `BudgetExceededError` |
|
|
213
348
|
| `'queue'` | any | Queues until period resets |
|
|
214
349
|
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
## Raw SDK proxy
|
|
218
|
-
|
|
219
|
-
If you're using the OpenAI, Anthropic, Groq, Mistral, or Cohere SDK directly — without the Vercel AI SDK — use `limiter.rawProxy()` to add rate limiting as a transparent drop-in:
|
|
220
|
-
|
|
221
|
-
```typescript
|
|
222
|
-
import { createRateLimiter } from 'ai-sdk-rate-limiter'
|
|
223
|
-
import OpenAI from 'openai'
|
|
224
|
-
import Anthropic from '@anthropic-ai/sdk'
|
|
225
|
-
|
|
226
|
-
const limiter = createRateLimiter({
|
|
227
|
-
cost: { budget: { daily: 50 }, onExceeded: 'throw' },
|
|
228
|
-
on: { rateLimited: ({ model }) => console.warn(`${model} rate limited`) },
|
|
229
|
-
})
|
|
350
|
+
Fallback usage is tracked under the fallback model's ID in `getCostReport()`.
|
|
230
351
|
|
|
231
|
-
|
|
232
|
-
const openai = limiter.rawProxy(new OpenAI())
|
|
233
|
-
const anthropic = limiter.rawProxy(new Anthropic())
|
|
352
|
+
---
|
|
234
353
|
|
|
235
|
-
|
|
236
|
-
const completion = await openai.chat.completions.create({
|
|
237
|
-
model: 'gpt-4o',
|
|
238
|
-
messages: [{ role: 'user', content: 'Hello!' }],
|
|
239
|
-
})
|
|
354
|
+
## Multi-instance Redis store
|
|
240
355
|
|
|
241
|
-
|
|
242
|
-
model: 'claude-opus-4-6',
|
|
243
|
-
max_tokens: 1024,
|
|
244
|
-
messages: [{ role: 'user', content: 'Hello!' }],
|
|
245
|
-
})
|
|
356
|
+
By default, rate limit state is in-memory (per-process). For multi-instance deployments — multiple pods, serverless replicas, workers — each instance has its own counters. Install the Redis store to share state:
|
|
246
357
|
|
|
247
|
-
// Cost from both clients tracked together
|
|
248
|
-
const report = limiter.getCostReport()
|
|
249
358
|
```
|
|
250
|
-
|
|
251
|
-
**Streaming works too** — the proxy wraps the returned `AsyncIterable` to capture the final usage chunk automatically:
|
|
252
|
-
|
|
253
|
-
```typescript
|
|
254
|
-
const stream = await openai.chat.completions.create({
|
|
255
|
-
model: 'gpt-4o',
|
|
256
|
-
messages: [{ role: 'user', content: 'Stream this' }],
|
|
257
|
-
stream: true,
|
|
258
|
-
stream_options: { include_usage: true }, // tells OpenAI to include usage in final chunk
|
|
259
|
-
})
|
|
260
|
-
|
|
261
|
-
for await (const chunk of stream) {
|
|
262
|
-
process.stdout.write(chunk.choices[0]?.delta?.content ?? '')
|
|
263
|
-
}
|
|
264
|
-
// After the loop, usage is recorded in limiter.getCostReport()
|
|
359
|
+
npm install ioredis
|
|
265
360
|
```
|
|
266
361
|
|
|
267
|
-
**Zero-config standalone version** — if you don't need to share the limiter with other models:
|
|
268
|
-
|
|
269
362
|
```typescript
|
|
270
|
-
import {
|
|
363
|
+
import { createRateLimiter } from 'ai-sdk-rate-limiter'
|
|
364
|
+
import { RedisStore } from 'ai-sdk-rate-limiter/redis'
|
|
365
|
+
import Redis from 'ioredis'
|
|
271
366
|
|
|
272
|
-
const
|
|
273
|
-
|
|
367
|
+
const limiter = createRateLimiter({
|
|
368
|
+
store: new RedisStore(new Redis(process.env.REDIS_URL)),
|
|
369
|
+
// ... rest of your config unchanged
|
|
274
370
|
})
|
|
275
371
|
```
|
|
276
372
|
|
|
277
|
-
|
|
373
|
+
That's the entire change. All APIs — `wrap()`, `rawProxy()`, events, cost reports — work identically. The Redis store enforces rate limits collectively so no two instances can jointly exceed the API limits.
|
|
374
|
+
|
|
375
|
+
**Options:**
|
|
278
376
|
|
|
279
377
|
```typescript
|
|
280
|
-
|
|
281
|
-
|
|
378
|
+
new RedisStore(redis, {
|
|
379
|
+
keyPrefix: 'rl:myapp:', // namespace if multiple apps share one Redis instance
|
|
380
|
+
windowMs: 60_000, // window size in ms; match your provider's limit window
|
|
282
381
|
})
|
|
283
382
|
```
|
|
284
383
|
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
## Backpressure — know before you send
|
|
384
|
+
**How it works internally:**
|
|
288
385
|
|
|
289
|
-
|
|
290
|
-
|
|
291
|
-
```typescript
|
|
292
|
-
const waitMs = limiter.estimatedWait('gpt-4o')
|
|
386
|
+
Each request runs a Lua script atomically that: removes stale entries from a sorted set, counts requests and tokens in the current window, checks against RPM and ITPM limits, and either reserves the slot or returns when the next slot opens. The local queue (priority ordering, drain timer, timeout handling) stays in-memory per instance — only the window counters are shared via Redis.
|
|
293
387
|
|
|
294
|
-
|
|
295
|
-
return res.status(503).json({ error: 'Model is busy, try again shortly', retryAfterMs: waitMs })
|
|
296
|
-
}
|
|
388
|
+
**Compatible clients** — any client with `eval()`, `get()`, and `set()` works: `ioredis`, `node-redis`, Upstash Redis.
|
|
297
389
|
|
|
298
|
-
|
|
299
|
-
```
|
|
390
|
+
Use the default `InMemoryStore` for single-instance deployments — it's more accurate (true sliding window, no network round-trips) and zero-config. Only switch to `RedisStore` when you actually need cross-instance coordination.
|
|
300
391
|
|
|
301
392
|
---
|
|
302
393
|
|
|
@@ -320,24 +411,46 @@ limiter.off('queued', handler)
|
|
|
320
411
|
|
|
321
412
|
| Event | When | Key fields |
|
|
322
413
|
|---|---|---|
|
|
323
|
-
| `queued` | Request enters the queue | `model`, `priority`, `queueDepth`, `estimatedWaitMs` |
|
|
324
|
-
| `dequeued` | Request leaves the queue | `model`, `waitedMs`, `priority` |
|
|
325
|
-
| `retrying` | A failed request is about to retry | `model`, `attempt`, `maxAttempts`, `delayMs`, `error` |
|
|
326
|
-
| `rateLimited` | Limit hit (local or remote 429) | `model`, `source`, `limitType`, `resetAt` |
|
|
327
|
-
| `budgetHit` | Cost budget exceeded | `model`, `currentCostUsd`, `limitUsd`, `period`, `usingFallback` |
|
|
328
|
-
| `dropped` | Request rejected (queue full or timeout) | `model`, `reason` |
|
|
329
|
-
| `completed` | Request finished successfully | `model`, `inputTokens`, `outputTokens`, `costUsd`, `latencyMs` |
|
|
414
|
+
| `queued` | Request enters the queue | `model`, `provider`, `priority`, `queueDepth`, `estimatedWaitMs` |
|
|
415
|
+
| `dequeued` | Request leaves the queue | `model`, `provider`, `waitedMs`, `priority` |
|
|
416
|
+
| `retrying` | A failed request is about to retry | `model`, `provider`, `attempt`, `maxAttempts`, `delayMs`, `error` |
|
|
417
|
+
| `rateLimited` | Limit hit (local or remote 429) | `model`, `provider`, `source`, `limitType`, `resetAt` |
|
|
418
|
+
| `budgetHit` | Cost budget exceeded | `model`, `provider`, `currentCostUsd`, `limitUsd`, `period`, `usingFallback` |
|
|
419
|
+
| `dropped` | Request rejected (queue full or timeout) | `model`, `provider`, `reason` |
|
|
420
|
+
| `completed` | Request finished successfully | `model`, `provider`, `inputTokens`, `outputTokens`, `costUsd`, `latencyMs`, `streaming` |
|
|
330
421
|
|
|
331
|
-
The `source` on `rateLimited` distinguishes between requests we blocked locally (`'local'`) vs. requests the API rejected with a 429 (`'remote'`). Local blocks are free
|
|
422
|
+
The `source` on `rateLimited` distinguishes between requests we blocked locally (`'local'`) vs. requests the API rejected with a 429 (`'remote'`). Local blocks are expected and free. Frequent remote blocks mean your configured limits are too high for your tier — run `npx ai-sdk-rate-limiter audit` to get accurate numbers.
|
|
423
|
+
|
|
424
|
+
---
|
|
425
|
+
|
|
426
|
+
## Backpressure
|
|
427
|
+
|
|
428
|
+
Check estimated wait time before committing to a request. Useful for showing loading states or shedding load gracefully:
|
|
429
|
+
|
|
430
|
+
```typescript
|
|
431
|
+
const waitMs = await limiter.estimatedWait('gpt-4o')
|
|
432
|
+
|
|
433
|
+
if (waitMs > 5_000) {
|
|
434
|
+
return res.status(503).json({
|
|
435
|
+
error: 'Model busy, try again shortly',
|
|
436
|
+
retryAfterMs: waitMs,
|
|
437
|
+
})
|
|
438
|
+
}
|
|
439
|
+
|
|
440
|
+
const result = await generateText({ model, prompt })
|
|
441
|
+
```
|
|
442
|
+
|
|
443
|
+
Returns `0` if the model would proceed immediately.
|
|
332
444
|
|
|
333
445
|
---
|
|
334
446
|
|
|
335
447
|
## Error handling
|
|
336
448
|
|
|
337
|
-
Every error is typed
|
|
449
|
+
Every error is typed and carries structured context:
|
|
338
450
|
|
|
339
451
|
```typescript
|
|
340
452
|
import {
|
|
453
|
+
RateLimitExceededError,
|
|
341
454
|
QueueTimeoutError,
|
|
342
455
|
QueueFullError,
|
|
343
456
|
BudgetExceededError,
|
|
@@ -349,77 +462,163 @@ try {
|
|
|
349
462
|
const result = await generateText({ model, prompt })
|
|
350
463
|
} catch (error) {
|
|
351
464
|
if (error instanceof QueueTimeoutError) {
|
|
352
|
-
//
|
|
465
|
+
// Request waited in queue longer than queue.timeout
|
|
353
466
|
console.error(`Timed out after ${error.waitedMs}ms (queue depth: ${error.queueDepth})`)
|
|
354
467
|
} else if (error instanceof BudgetExceededError) {
|
|
355
|
-
//
|
|
468
|
+
// Cost budget hit and onExceeded is 'throw' or no fallback configured
|
|
356
469
|
console.error(`Budget exceeded: $${error.currentCostUsd} of $${error.limitUsd} ${error.period}`)
|
|
357
470
|
} else if (error instanceof RetryExhaustedError) {
|
|
358
|
-
//
|
|
471
|
+
// All retry attempts failed
|
|
359
472
|
console.error(`All ${error.attempts} retries exhausted`, error.cause)
|
|
360
473
|
} else if (error instanceof QueueFullError) {
|
|
361
|
-
//
|
|
362
|
-
console.error(`Queue full at ${error.maxSize} requests`)
|
|
474
|
+
// Queue at capacity and onFull is 'throw'
|
|
475
|
+
console.error(`Queue full at ${error.maxSize} requests for ${error.model}`)
|
|
476
|
+
} else if (error instanceof RateLimitExceededError) {
|
|
477
|
+
// Rate limit hit and the request could not be queued
|
|
478
|
+
console.error(`${error.model} ${error.limitType} limit of ${error.limit} exceeded`)
|
|
363
479
|
}
|
|
364
480
|
}
|
|
365
481
|
```
|
|
366
482
|
|
|
367
|
-
All errors extend `RateLimiterError`, so a single `instanceof RateLimiterError` check separates rate-
|
|
483
|
+
All errors extend `RateLimiterError`, so a single `instanceof RateLimiterError` check separates rate-limiter failures from AI API errors.
|
|
484
|
+
|
|
485
|
+
**Error fields:**
|
|
486
|
+
|
|
487
|
+
| Error | Fields |
|
|
488
|
+
|---|---|
|
|
489
|
+
| `QueueTimeoutError` | `model`, `waitedMs`, `queueDepth` |
|
|
490
|
+
| `BudgetExceededError` | `model`, `currentCostUsd`, `limitUsd`, `period` |
|
|
491
|
+
| `RetryExhaustedError` | `model`, `attempts`, `cause` |
|
|
492
|
+
| `QueueFullError` | `model`, `maxSize` |
|
|
493
|
+
| `RateLimitExceededError` | `model`, `limitType`, `limit`, `resetAt` |
|
|
368
494
|
|
|
369
495
|
---
|
|
370
496
|
|
|
371
|
-
##
|
|
497
|
+
## OpenTelemetry
|
|
372
498
|
|
|
373
|
-
|
|
499
|
+
The `ai-sdk-rate-limiter/otel` entry point provides a plugin that emits OpenTelemetry spans for every AI request. No hard dependency on `@opentelemetry/api` — the plugin accepts any OTel-compatible tracer via structural typing.
|
|
374
500
|
|
|
375
501
|
```typescript
|
|
376
|
-
import {
|
|
502
|
+
import { trace } from '@opentelemetry/api'
|
|
503
|
+
import { createRateLimiter } from 'ai-sdk-rate-limiter'
|
|
504
|
+
import { createOtelPlugin } from 'ai-sdk-rate-limiter/otel'
|
|
377
505
|
|
|
378
|
-
const
|
|
379
|
-
|
|
380
|
-
middleware: limiter.middleware,
|
|
506
|
+
const limiter = createRateLimiter({
|
|
507
|
+
on: createOtelPlugin(trace.getTracer('my-service')),
|
|
381
508
|
})
|
|
382
509
|
```
|
|
383
510
|
|
|
384
|
-
|
|
511
|
+
**Spans emitted:**
|
|
512
|
+
|
|
513
|
+
| Span name | When | Status |
|
|
514
|
+
|---|---|---|
|
|
515
|
+
| `gen_ai.request` | Every completed request | OK |
|
|
516
|
+
| `gen_ai.request` | Every dropped request | ERROR |
|
|
517
|
+
| `ai_rate_limiter.retry` | Each retry attempt | OK |
|
|
518
|
+
| `ai_rate_limiter.budget_hit` | Budget exceeded | ERROR |
|
|
519
|
+
|
|
520
|
+
**Attributes on `gen_ai.request` (completed):**
|
|
521
|
+
|
|
522
|
+
| Attribute | Value |
|
|
523
|
+
|---|---|
|
|
524
|
+
| `gen_ai.system` | Provider name (e.g. `openai`, `anthropic`) |
|
|
525
|
+
| `gen_ai.request.model` | Model ID |
|
|
526
|
+
| `gen_ai.usage.input_tokens` | Actual input tokens from API response |
|
|
527
|
+
| `gen_ai.usage.output_tokens` | Actual output tokens from API response |
|
|
528
|
+
| `ai_rate_limiter.cost_usd` | Cost in USD for this request |
|
|
529
|
+
| `ai_rate_limiter.latency_ms` | Total latency including queue wait |
|
|
530
|
+
| `ai_rate_limiter.streaming` | Whether the request used streaming |
|
|
531
|
+
|
|
532
|
+
Attribute names follow the [OTel GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/). The `gen_ai.request` span duration is reconstructed from `latencyMs` so it reflects the full wall-clock time of the request, including any queue wait.
|
|
533
|
+
|
|
534
|
+
**Custom tracer interface** — if you don't want to install `@opentelemetry/api`, implement the `OtelTracer` interface directly:
|
|
385
535
|
|
|
386
536
|
```typescript
|
|
387
|
-
|
|
388
|
-
|
|
389
|
-
|
|
537
|
+
import { createOtelPlugin, type OtelTracer } from 'ai-sdk-rate-limiter/otel'
|
|
538
|
+
|
|
539
|
+
const tracer: OtelTracer = {
|
|
540
|
+
startSpan(name, options) {
|
|
541
|
+
// return any object that implements OtelSpan
|
|
542
|
+
},
|
|
543
|
+
}
|
|
544
|
+
|
|
545
|
+
const limiter = createRateLimiter({
|
|
546
|
+
on: createOtelPlugin(tracer),
|
|
390
547
|
})
|
|
391
548
|
```
|
|
392
549
|
|
|
393
550
|
---
|
|
394
551
|
|
|
395
|
-
##
|
|
552
|
+
## CLI audit
|
|
396
553
|
|
|
397
|
-
|
|
554
|
+
Probe your API keys to discover your actual rate limit tier and generate a ready-to-paste config override:
|
|
398
555
|
|
|
399
|
-
```
|
|
400
|
-
|
|
401
|
-
|
|
402
|
-
cost: { budget: { daily: 0.10 }, onExceeded: 'throw' },
|
|
403
|
-
queue: { timeout: 5_000 },
|
|
404
|
-
})
|
|
556
|
+
```
|
|
557
|
+
npx ai-sdk-rate-limiter audit
|
|
558
|
+
```
|
|
405
559
|
|
|
406
|
-
|
|
407
|
-
|
|
408
|
-
|
|
409
|
-
|
|
410
|
-
|
|
560
|
+
```
|
|
561
|
+
────────────────────────────────────────────────────────────────────────────────
|
|
562
|
+
ai-sdk-rate-limiter audit
|
|
563
|
+
────────────────────────────────────────────────────────────────────────────────
|
|
564
|
+
|
|
565
|
+
OPENAI (OPENAI_API_KEY)
|
|
566
|
+
Model RPM TPM Registry
|
|
567
|
+
──────────────────────────────────────────────────────────────────────────────
|
|
568
|
+
gpt-4o 10000 2,000,000 (registry: 500 RPM / 30,000 TPM)
|
|
569
|
+
gpt-4o-mini 10000 10,000,000 ≠ (registry: 500 RPM / 200,000 TPM)
|
|
570
|
+
|
|
571
|
+
────────────────────────────────────────────────────────────────────────────────
|
|
572
|
+
1 model(s) differ from registry defaults.
|
|
573
|
+
Paste the config below into createRateLimiter():
|
|
574
|
+
|
|
575
|
+
const limiter = createRateLimiter({
|
|
576
|
+
limits: {
|
|
577
|
+
'gpt-4o-mini': { rpm: 10000, itpm: 10,000,000 },
|
|
578
|
+
},
|
|
579
|
+
})
|
|
411
580
|
|
|
412
|
-
|
|
413
|
-
|
|
414
|
-
|
|
415
|
-
|
|
581
|
+
────────────────────────────────────────────────────────────────────────────────
|
|
582
|
+
```
|
|
583
|
+
|
|
584
|
+
**How it works** — Makes a minimal (5-token) request per model and reads the `x-ratelimit-limit-*` headers that every provider returns on each response. These headers reflect your account's actual tier, not the published defaults.
|
|
585
|
+
|
|
586
|
+
**Options:**
|
|
587
|
+
|
|
588
|
+
```
|
|
589
|
+
npx ai-sdk-rate-limiter audit [options]
|
|
590
|
+
|
|
591
|
+
--provider, -p <name> Audit a single provider: openai, anthropic, groq, mistral, cohere
|
|
592
|
+
--json Machine-readable JSON output
|
|
593
|
+
--help, -h Show help
|
|
594
|
+
--version, -v Print version
|
|
595
|
+
|
|
596
|
+
Environment variables required:
|
|
597
|
+
OPENAI_API_KEY
|
|
598
|
+
ANTHROPIC_API_KEY
|
|
599
|
+
GROQ_API_KEY
|
|
600
|
+
MISTRAL_API_KEY
|
|
601
|
+
COHERE_API_KEY
|
|
602
|
+
```
|
|
603
|
+
|
|
604
|
+
**Examples:**
|
|
605
|
+
|
|
606
|
+
```bash
|
|
607
|
+
# Audit all configured providers
|
|
608
|
+
npx ai-sdk-rate-limiter audit
|
|
609
|
+
|
|
610
|
+
# Audit only OpenAI
|
|
611
|
+
npx ai-sdk-rate-limiter audit --provider openai
|
|
612
|
+
|
|
613
|
+
# Machine-readable output for CI / scripts
|
|
614
|
+
npx ai-sdk-rate-limiter audit --json | jq '.providers[0].models'
|
|
416
615
|
```
|
|
417
616
|
|
|
418
617
|
---
|
|
419
618
|
|
|
420
|
-
##
|
|
619
|
+
## Model registry
|
|
421
620
|
|
|
422
|
-
Limits and pricing are built-in for every major model across 6 providers. Defaults are conservative (free/Tier 1) — override with your actual plan limits
|
|
621
|
+
Limits and pricing are built-in for every major model across 6 providers. Defaults are conservative (free/Tier 1) — override with your actual plan limits via the `limits` config option or by running `audit`.
|
|
423
622
|
|
|
424
623
|
**OpenAI**
|
|
425
624
|
|
|
@@ -493,30 +692,87 @@ import {
|
|
|
493
692
|
console.log(GROQ_MODELS['llama-3.3-70b-versatile'])
|
|
494
693
|
// { rpm: 30, itpm: 6000, rpd: 1000, inputPricePerMillion: 0.59, ... }
|
|
495
694
|
|
|
496
|
-
console.log(isKnownModel('llama-3.3-70b-versatile', 'groq'))
|
|
497
|
-
//
|
|
695
|
+
console.log(isKnownModel('llama-3.3-70b-versatile', 'groq')) // true
|
|
696
|
+
console.log(isKnownModel('my-fine-tune', 'openai')) // false → fallback limits
|
|
697
|
+
|
|
698
|
+
// Resolve the effective limits for a model (registry defaults merged with user overrides)
|
|
699
|
+
const limits = resolveModelLimits('gpt-4o', 'openai', { 'gpt-4o': { rpm: 1000 } })
|
|
700
|
+
```
|
|
701
|
+
|
|
702
|
+
---
|
|
703
|
+
|
|
704
|
+
## Advanced usage
|
|
705
|
+
|
|
706
|
+
### Multiple limiters per tier
|
|
707
|
+
|
|
708
|
+
```typescript
|
|
709
|
+
const freeLimiter = createRateLimiter({
|
|
710
|
+
limits: { 'gpt-4o-mini': { rpm: 5 } },
|
|
711
|
+
cost: { budget: { daily: 0.10 }, onExceeded: 'throw' },
|
|
712
|
+
queue: { timeout: 5_000 },
|
|
713
|
+
})
|
|
714
|
+
|
|
715
|
+
const paidLimiter = createRateLimiter({
|
|
716
|
+
limits: { 'gpt-4o': { rpm: 100 } },
|
|
717
|
+
cost: { budget: { daily: 20 } },
|
|
718
|
+
queue: { timeout: 30_000 },
|
|
719
|
+
})
|
|
720
|
+
|
|
721
|
+
// Route per request based on user plan
|
|
722
|
+
const model = req.user.plan === 'paid'
|
|
723
|
+
? paidLimiter.wrap(openai('gpt-4o'))
|
|
724
|
+
: freeLimiter.wrap(openai('gpt-4o-mini'))
|
|
725
|
+
```
|
|
726
|
+
|
|
727
|
+
### Combine OTel tracing with event logging
|
|
728
|
+
|
|
729
|
+
```typescript
|
|
730
|
+
import { createOtelPlugin } from 'ai-sdk-rate-limiter/otel'
|
|
731
|
+
|
|
732
|
+
const limiter = createRateLimiter({
|
|
733
|
+
on: {
|
|
734
|
+
// OTel spans for every request
|
|
735
|
+
...createOtelPlugin(trace.getTracer('my-service')),
|
|
736
|
+
// Plus any additional handlers
|
|
737
|
+
budgetHit: ({ model, limitUsd, period }) =>
|
|
738
|
+
alerts.send(`Budget alert: ${model} hit $${limitUsd} ${period} cap`),
|
|
739
|
+
},
|
|
740
|
+
})
|
|
741
|
+
```
|
|
742
|
+
|
|
743
|
+
### Custom rate limit store
|
|
744
|
+
|
|
745
|
+
Implement `RateLimitStore` to use any backend (DynamoDB, Postgres, etc.):
|
|
746
|
+
|
|
747
|
+
```typescript
|
|
748
|
+
import type { RateLimitStore } from 'ai-sdk-rate-limiter'
|
|
498
749
|
|
|
499
|
-
|
|
500
|
-
|
|
750
|
+
class MyStore implements RateLimitStore {
|
|
751
|
+
async checkAndReserve(key, tokens, limits) { /* ... */ }
|
|
752
|
+
async applyBackoff(key, untilMs) { /* ... */ }
|
|
753
|
+
async getBackoff(key) { /* ... */ }
|
|
754
|
+
}
|
|
755
|
+
|
|
756
|
+
const limiter = createRateLimiter({ store: new MyStore() })
|
|
501
757
|
```
|
|
502
758
|
|
|
503
759
|
---
|
|
504
760
|
|
|
505
761
|
## How it works
|
|
506
762
|
|
|
507
|
-
**Rate limiting
|
|
763
|
+
**Rate limiting** — Sliding window counter per model. Each model tracks a list of `{timestamp, tokens}` entries for the past 60 seconds. On every request, stale entries are evicted and the window is checked against RPM and ITPM limits simultaneously.
|
|
508
764
|
|
|
509
|
-
**Queue** — A
|
|
765
|
+
**Queue** — A sorted priority queue per model, ordered by `priority` then enqueue time (FIFO within same priority). A drain timer fires when the oldest window entry expires, processing as many waiters as possible before rescheduling.
|
|
510
766
|
|
|
511
|
-
**Retry-After propagation** — When a remote 429 arrives with a `Retry-After` header, the backoff is applied to the engine, not just the failing request. All requests queued behind it pause until the backoff clears. This prevents the common failure
|
|
767
|
+
**Retry-After propagation** — When a remote 429 arrives with a `Retry-After` header, the backoff is applied to the entire model key in the engine, not just the failing request. All requests queued behind it pause until the backoff clears. This prevents the common thundering-herd failure where you retry one request while 10 others immediately follow and all get 429s.
|
|
512
768
|
|
|
513
|
-
**Token estimation** — Before a request fires, tokens are estimated from the prompt text (~4 chars/token) and reserved in the window. After the response, actual usage from the API replaces the estimate. For streaming, actual counts come from the `finish` chunk.
|
|
769
|
+
**Token estimation** — Before a request fires, tokens are estimated from the prompt text (~4 chars/token) and reserved in the window. After the response, actual usage from the API replaces the estimate. For streaming, actual counts come from the `finish` chunk (Vercel AI SDK) or the final usage chunk (raw proxy).
|
|
514
770
|
|
|
515
|
-
**Zero dependencies** — The middleware interface is implemented structurally
|
|
771
|
+
**Zero dependencies** — The Vercel AI SDK middleware interface is implemented structurally — `@ai-sdk/provider` types are used for type checking only and not required at runtime. No `ioredis`, no `bottleneck`, no tokenizer libraries in the core.
|
|
516
772
|
|
|
517
773
|
---
|
|
518
774
|
|
|
519
|
-
##
|
|
775
|
+
## Comparison
|
|
520
776
|
|
|
521
777
|
| | ai-sdk-rate-limiter | bottleneck | p-limit | SDK built-in retry | LangChain |
|
|
522
778
|
|---|:---:|:---:|:---:|:---:|:---:|
|
|
@@ -528,37 +784,42 @@ console.log(isKnownModel('my-fine-tune', 'openai'))
|
|
|
528
784
|
| Cost tracking + budgets | yes | no | no | no | no |
|
|
529
785
|
| Retry-After header | yes | no | no | partial | partial |
|
|
530
786
|
| Backoff propagation | yes | no | no | no | no |
|
|
787
|
+
| OpenTelemetry | yes | no | no | no | partial |
|
|
788
|
+
| CLI audit | yes | no | no | no | no |
|
|
531
789
|
| Zero runtime deps | yes | no | yes | — | no |
|
|
532
790
|
| Provider-agnostic | yes | yes | yes | no | no |
|
|
533
791
|
|
|
534
|
-
**bottleneck**
|
|
792
|
+
**bottleneck** — Excellent general-purpose rate limiting, but knows nothing about AI APIs. No model limits, no token counting, no cost tracking. You'd need to configure it per-model manually and rebuild the cost system yourself.
|
|
535
793
|
|
|
536
|
-
**p-limit**
|
|
794
|
+
**p-limit** — Controls concurrency, not rate. Limits to N concurrent requests, not N requests per minute. A different problem.
|
|
537
795
|
|
|
538
|
-
**SDK built-in retry**
|
|
796
|
+
**SDK built-in retry** — Retries on 429 with backoff. That's the floor, not the ceiling. No queuing, no priority, no cost tracking, no backoff propagation to other in-flight requests.
|
|
539
797
|
|
|
540
798
|
---
|
|
541
799
|
|
|
542
800
|
## TypeScript
|
|
543
801
|
|
|
544
|
-
Fully typed. All configuration options, events, errors, and report shapes have precise TypeScript definitions.
|
|
802
|
+
Fully typed. All configuration options, events, errors, and report shapes have precise TypeScript definitions exported from the main entry point.
|
|
545
803
|
|
|
546
804
|
```typescript
|
|
547
805
|
import type {
|
|
548
806
|
RateLimiterConfig,
|
|
549
807
|
CostReport,
|
|
808
|
+
EventMap,
|
|
550
809
|
QueuedEvent,
|
|
551
810
|
Priority,
|
|
811
|
+
ModelLimits,
|
|
552
812
|
} from 'ai-sdk-rate-limiter'
|
|
553
|
-
|
|
554
|
-
function handleQueuedRequest(event: QueuedEvent) {
|
|
555
|
-
// event.model, event.priority, event.queueDepth, event.estimatedWaitMs
|
|
556
|
-
// all typed, all autocompleted
|
|
557
|
-
}
|
|
558
813
|
```
|
|
559
814
|
|
|
560
815
|
---
|
|
561
816
|
|
|
817
|
+
## Examples
|
|
818
|
+
|
|
819
|
+
A full Next.js 15 App Router example is included at [`examples/nextjs/`](./examples/nextjs/). It demonstrates streaming chat with rate limiting, live cost display, and proper error handling for budget and rate limit errors.
|
|
820
|
+
|
|
821
|
+
---
|
|
822
|
+
|
|
562
823
|
## Requirements
|
|
563
824
|
|
|
564
825
|
- Node.js 18+ / Bun / Deno
|