ai-sdk-rate-limiter 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,415 @@
1
+ # ai-sdk-rate-limiter
2
+
3
+ Smart rate limiting, queuing, and cost tracking for AI API calls. Works across providers. Zero required dependencies.
4
+
5
+ ```
6
+ npm install ai-sdk-rate-limiter
7
+ ```
8
+
9
+ ---
10
+
11
+ ## The problem
12
+
13
+ Every developer building with LLMs hits this eventually:
14
+
15
+ - `429 Too Many Requests` crashes a production request mid-flight
16
+ - You retry immediately and burn through your remaining quota
17
+ - Rate limits differ per model, per tier, per provider — none documented uniformly
18
+ - Your Node.js server runs 4 instances. They race against the same API quota
19
+ - A bulk job spends $300 overnight and nobody notices until the bill arrives
20
+ - You have no idea which model is responsible for the cost spike
21
+
22
+ Every existing tool solves one of these. This solves all of them.
23
+
24
+ ---
25
+
26
+ ## Quick start
27
+
28
+ ```typescript
29
+ import { createRateLimiter } from 'ai-sdk-rate-limiter'
30
+ import { openai } from '@ai-sdk/openai'
31
+ import { generateText, streamText } from 'ai'
32
+
33
+ const limiter = createRateLimiter()
34
+
35
+ // Wrap any Vercel AI SDK model — that's it
36
+ const model = limiter.wrap(openai('gpt-4o'))
37
+
38
+ const { text } = await generateText({ model, prompt: 'Hello!' })
39
+ ```
40
+
41
+ The wrapped model is a drop-in replacement. Every Vercel AI SDK feature — streaming, tool calls, structured output — works exactly as before.
42
+
43
+ ---
44
+
45
+ ## What it does
46
+
47
+ **Rate limiting** — Tracks requests and tokens in a 60-second sliding window per model. When a limit is reached, requests queue automatically rather than crashing.
48
+
49
+ **Priority queuing** — Queued requests drain in priority order (`high` before `normal` before `low`), FIFO within the same priority. Your user-facing requests skip ahead of background jobs.
50
+
51
+ **Smart retry** — Retries on 429, 500, 502, 503, 504 with exponential backoff + jitter. Honors the `Retry-After` header exactly — if the API says wait 3 seconds, waits 3 seconds, not 30.
52
+
53
+ **Cost tracking** — Records actual token usage from every response. Reports hourly, daily, and monthly spend per model. Optionally enforces budget caps.
54
+
55
+ **Built-in model registry** — Knows the RPM, ITPM, and per-token pricing for every major OpenAI, Anthropic, and Google model out of the box. Nothing to configure to get started.
56
+
57
+ ---
58
+
59
+ ## Configuration
60
+
61
+ Everything has a sensible default. Override only what you need.
62
+
63
+ ```typescript
64
+ const limiter = createRateLimiter({
65
+ // Override or extend built-in model limits for your API tier
66
+ limits: {
67
+ 'gpt-4o': { rpm: 500, itpm: 30_000 },
68
+ 'claude-opus-4-6': { rpm: 50, itpm: 30_000 },
69
+ // Wildcard: apply to all models from a provider
70
+ 'openai/*': { rpm: 1000 },
71
+ },
72
+
73
+ // Cost budgets and behavior when exceeded
74
+ cost: {
75
+ budget: {
76
+ hourly: 5, // USD
77
+ daily: 50,
78
+ monthly: 500,
79
+ },
80
+ onExceeded: 'throw', // or 'queue' — wait until the period resets
81
+ },
82
+
83
+ // Queue behavior
84
+ queue: {
85
+ maxSize: 500, // max requests waiting; throws QueueFullError when full
86
+ timeout: 30_000, // ms before a queued request times out with QueueTimeoutError
87
+ onFull: 'throw', // or 'drop-low' — evict lowest-priority requests first
88
+ },
89
+
90
+ // Retry behavior
91
+ retry: {
92
+ maxAttempts: 4, // total attempts including the first
93
+ backoff: 'exponential', // 'exponential' | 'linear' | 'fixed'
94
+ baseDelay: 1_000, // ms
95
+ maxDelay: 60_000, // ms cap
96
+ jitter: true, // ±30% randomness (prevents thundering herd)
97
+ parseRetryAfter: true, // honor Retry-After header from 429 responses
98
+ retryOn: [429, 500, 502, 503, 504],
99
+ },
100
+
101
+ // Observability
102
+ on: {
103
+ rateLimited: ({ model, source, resetAt }) =>
104
+ console.warn(`${model} rate limited (${source}), resets ${new Date(resetAt).toISOString()}`),
105
+ retrying: ({ model, attempt, delayMs }) =>
106
+ console.log(`${model} retry ${attempt} in ${delayMs}ms`),
107
+ budgetHit: ({ model, currentCostUsd, limitUsd, period }) =>
108
+ alerts.send(`${model} hit $${limitUsd} ${period} budget ($${currentCostUsd} spent)`),
109
+ completed: ({ model, inputTokens, outputTokens, costUsd, latencyMs }) =>
110
+ metrics.record({ model, inputTokens, outputTokens, costUsd, latencyMs }),
111
+ },
112
+ })
113
+ ```
114
+
115
+ ---
116
+
117
+ ## Per-request options
118
+
119
+ Pass options to individual requests via `providerOptions.rateLimiter`:
120
+
121
+ ```typescript
122
+ import { generateText } from 'ai'
123
+
124
+ // High-priority request — skips ahead of normal traffic in the queue
125
+ await generateText({
126
+ model,
127
+ prompt: 'Urgent user request...',
128
+ providerOptions: {
129
+ rateLimiter: {
130
+ priority: 'high', // 'high' | 'normal' | 'low'
131
+ timeout: 10_000, // override the default queue timeout for this request
132
+ },
133
+ },
134
+ })
135
+
136
+ // Background job — yields to user-facing traffic
137
+ await generateText({
138
+ model,
139
+ prompt: 'Nightly batch summary...',
140
+ providerOptions: {
141
+ rateLimiter: { priority: 'low' },
142
+ },
143
+ })
144
+ ```
145
+
146
+ This is the right way to colocate user requests and background jobs on the same model without background jobs starving users.
147
+
148
+ ---
149
+
150
+ ## Cost tracking
151
+
152
+ ```typescript
153
+ // At any point — live snapshot
154
+ const report = limiter.getCostReport()
155
+
156
+ console.log(report)
157
+ // {
158
+ // hour: { requests: 42, inputTokens: 84_000, outputTokens: 21_000, costUsd: 0.29 },
159
+ // day: { requests: 318, inputTokens: 620_000, outputTokens: 155_000, costUsd: 2.11 },
160
+ // month: { requests: 4821, inputTokens: 9_100_000, outputTokens: 2_200_000, costUsd: 34.80 },
161
+ // byModel: {
162
+ // 'gpt-4o': { requests: 120, inputTokens: 240_000, outputTokens: 60_000, costUsd: 1.20 },
163
+ // 'gpt-4o-mini': { requests: 198, inputTokens: 380_000, outputTokens: 95_000, costUsd: 0.91 },
164
+ // }
165
+ // }
166
+ ```
167
+
168
+ Costs are based on **actual token counts** from API responses — not estimates. The report uses rolling windows, so `hour` always means "the last 60 minutes."
169
+
170
+ ---
171
+
172
+ ## Backpressure — know before you send
173
+
174
+ Check estimated wait time before committing to a request. Useful for showing loading states or shedding load gracefully.
175
+
176
+ ```typescript
177
+ const waitMs = limiter.estimatedWait('gpt-4o')
178
+
179
+ if (waitMs > 5_000) {
180
+ return res.status(503).json({ error: 'Model is busy, try again shortly', retryAfterMs: waitMs })
181
+ }
182
+
183
+ const result = await generateText({ model, prompt })
184
+ ```
185
+
186
+ ---
187
+
188
+ ## Events
189
+
190
+ All events are typed. Register handlers at creation time or dynamically:
191
+
192
+ ```typescript
193
+ // At creation time
194
+ const limiter = createRateLimiter({
195
+ on: { rateLimited: handler },
196
+ })
197
+
198
+ // Dynamically
199
+ limiter.on('queued', ({ model, queueDepth, estimatedWaitMs }) => {
200
+ console.log(`${model} queued (depth: ${queueDepth}, ~${estimatedWaitMs}ms wait)`)
201
+ })
202
+
203
+ limiter.off('queued', handler)
204
+ ```
205
+
206
+ | Event | When | Key fields |
207
+ |---|---|---|
208
+ | `queued` | Request enters the queue | `model`, `priority`, `queueDepth`, `estimatedWaitMs` |
209
+ | `dequeued` | Request leaves the queue | `model`, `waitedMs`, `priority` |
210
+ | `retrying` | A failed request is about to retry | `model`, `attempt`, `maxAttempts`, `delayMs`, `error` |
211
+ | `rateLimited` | Limit hit (local or remote 429) | `model`, `source`, `limitType`, `resetAt` |
212
+ | `budgetHit` | Cost budget exceeded | `model`, `currentCostUsd`, `limitUsd`, `period` |
213
+ | `dropped` | Request rejected (queue full or timeout) | `model`, `reason` |
214
+ | `completed` | Request finished successfully | `model`, `inputTokens`, `outputTokens`, `costUsd`, `latencyMs` |
215
+
216
+ The `source` on `rateLimited` distinguishes between requests we blocked locally (`'local'`) vs. requests the API rejected with a 429 (`'remote'`). Local blocks are free; remote blocks mean your limits are misconfigured.
217
+
218
+ ---
219
+
220
+ ## Error handling
221
+
222
+ Every error is typed, structured, and tells you exactly what happened:
223
+
224
+ ```typescript
225
+ import {
226
+ QueueTimeoutError,
227
+ QueueFullError,
228
+ BudgetExceededError,
229
+ RetryExhaustedError,
230
+ RateLimiterError,
231
+ } from 'ai-sdk-rate-limiter'
232
+
233
+ try {
234
+ const result = await generateText({ model, prompt })
235
+ } catch (error) {
236
+ if (error instanceof QueueTimeoutError) {
237
+ // error.model, error.waitedMs, error.queueDepth
238
+ console.error(`Timed out after ${error.waitedMs}ms (queue depth: ${error.queueDepth})`)
239
+ } else if (error instanceof BudgetExceededError) {
240
+ // error.model, error.currentCostUsd, error.limitUsd, error.period
241
+ console.error(`Budget exceeded: $${error.currentCostUsd} of $${error.limitUsd} ${error.period}`)
242
+ } else if (error instanceof RetryExhaustedError) {
243
+ // error.model, error.attempts, error.cause
244
+ console.error(`All ${error.attempts} retries exhausted`, error.cause)
245
+ } else if (error instanceof QueueFullError) {
246
+ // error.model, error.maxSize
247
+ console.error(`Queue full at ${error.maxSize} requests`)
248
+ }
249
+ }
250
+ ```
251
+
252
+ All errors extend `RateLimiterError`, so a single `instanceof RateLimiterError` check separates rate-limiting failures from AI SDK errors.
253
+
254
+ ---
255
+
256
+ ## Advanced middleware usage
257
+
258
+ If you use `wrapLanguageModel` directly, the raw middleware is available:
259
+
260
+ ```typescript
261
+ import { wrapLanguageModel } from 'ai'
262
+
263
+ const model = wrapLanguageModel({
264
+ model: openai('gpt-4o'),
265
+ middleware: limiter.middleware,
266
+ })
267
+ ```
268
+
269
+ You can also compose it with other middleware:
270
+
271
+ ```typescript
272
+ const model = wrapLanguageModel({
273
+ model: openai('gpt-4o'),
274
+ middleware: [loggingMiddleware, limiter.middleware, cachingMiddleware],
275
+ })
276
+ ```
277
+
278
+ ---
279
+
280
+ ## Multiple limiters
281
+
282
+ Run separate limiters for separate concerns — e.g., one per customer tier:
283
+
284
+ ```typescript
285
+ const freeLimiter = createRateLimiter({
286
+ limits: { 'gpt-4o-mini': { rpm: 5 } },
287
+ cost: { budget: { daily: 0.10 }, onExceeded: 'throw' },
288
+ queue: { timeout: 5_000 },
289
+ })
290
+
291
+ const paidLimiter = createRateLimiter({
292
+ limits: { 'gpt-4o': { rpm: 100 } },
293
+ cost: { budget: { daily: 20 } },
294
+ queue: { timeout: 30_000 },
295
+ })
296
+
297
+ // Route to the right limiter per request
298
+ const model = req.user.plan === 'paid'
299
+ ? paidLimiter.wrap(openai('gpt-4o'))
300
+ : freeLimiter.wrap(openai('gpt-4o-mini'))
301
+ ```
302
+
303
+ ---
304
+
305
+ ## Built-in model registry
306
+
307
+ Limits and pricing are built-in for every major model. These defaults are Tier 1 (most conservative) — override with your actual tier limits.
308
+
309
+ **OpenAI**
310
+
311
+ | Model | RPM | ITPM | Input $/M | Output $/M |
312
+ |---|---|---|---|---|
313
+ | gpt-4o | 500 | 30,000 | $2.50 | $10.00 |
314
+ | gpt-4o-mini | 500 | 200,000 | $0.15 | $0.60 |
315
+ | o1 | 500 | 30,000 | $15.00 | $60.00 |
316
+ | o3-mini / o4-mini | 500 | 200,000 | $1.10 | $4.40 |
317
+ | gpt-3.5-turbo | 3,500 | 90,000 | $0.50 | $1.50 |
318
+
319
+ **Anthropic**
320
+
321
+ | Model | RPM | ITPM | Input $/M | Output $/M |
322
+ |---|---|---|---|---|
323
+ | claude-opus-4-6 | 50 | 30,000 | $15.00 | $75.00 |
324
+ | claude-sonnet-4-6 | 50 | 30,000 | $3.00 | $15.00 |
325
+ | claude-haiku-4-5 | 50 | 50,000 | $0.80 | $4.00 |
326
+
327
+ **Google**
328
+
329
+ | Model | RPM | ITPM | Input $/M | Output $/M |
330
+ |---|---|---|---|---|
331
+ | gemini-2.0-flash | 15 | 1,000,000 | $0.10 | $0.40 |
332
+ | gemini-1.5-pro | 2 | 32,000 | $1.25 | $5.00 |
333
+ | gemini-1.5-flash | 15 | 1,000,000 | $0.075 | $0.30 |
334
+
335
+ Unknown models fall back to 60 RPM / 100k ITPM with no cost tracking. You can inspect or extend the registry:
336
+
337
+ ```typescript
338
+ import { OPENAI_MODELS, ANTHROPIC_MODELS, resolveModelLimits, isKnownModel } from 'ai-sdk-rate-limiter'
339
+
340
+ console.log(OPENAI_MODELS['gpt-4o'])
341
+ // { rpm: 500, itpm: 30000, otpm: 30000, inputPricePerMillion: 2.5, ... }
342
+
343
+ console.log(isKnownModel('my-fine-tune', 'openai'))
344
+ // false → will use fallback limits
345
+ ```
346
+
347
+ ---
348
+
349
+ ## How it works
350
+
351
+ **Rate limiting algorithm** — Sliding window counter. Each model tracks a list of `{timestamp, tokens}` entries for the past 60 seconds. On every request, stale entries are evicted and the window is checked against RPM and ITPM limits simultaneously.
352
+
353
+ **Queue** — A min-heap priority queue per model, sorted by `priority` then enqueue time (FIFO within same priority). A drain timer fires when the oldest window entry expires, processing as many waiters as possible before rescheduling.
354
+
355
+ **Retry-After propagation** — When a remote 429 arrives with a `Retry-After` header, the backoff is applied to the engine, not just the failing request. All requests queued behind it pause until the backoff clears. This prevents the common failure mode where you retry one request while 10 more immediately follow and all get 429s.
356
+
357
+ **Token estimation** — Before a request fires, tokens are estimated from the prompt text (~4 chars/token) and reserved in the window. After the response, actual usage from the API replaces the estimate. For streaming, actual counts come from the `finish` chunk.
358
+
359
+ **Zero dependencies** — The middleware interface is implemented structurally. `@ai-sdk/provider` types are referenced for type checking but not required at runtime. No `ioredis`, no `bottleneck`, no tokenizer libraries.
360
+
361
+ ---
362
+
363
+ ## Why not X?
364
+
365
+ | | ai-sdk-rate-limiter | bottleneck | p-limit | SDK built-in retry | LangChain |
366
+ |---|:---:|:---:|:---:|:---:|:---:|
367
+ | Vercel AI SDK `.wrap()` | yes | no | no | — | no |
368
+ | Model-aware limits | yes | no | no | no | partial |
369
+ | ITPM / token tracking | yes | no | no | no | no |
370
+ | Priority queue | yes | yes | no | no | no |
371
+ | Cost tracking + budgets | yes | no | no | no | no |
372
+ | Retry-After header | yes | no | no | partial | partial |
373
+ | Backoff propagation | yes | no | no | no | no |
374
+ | Zero runtime deps | yes | no | yes | — | no |
375
+ | Provider-agnostic | yes | yes | yes | no | no |
376
+
377
+ **bottleneck** is excellent general-purpose rate limiting, but knows nothing about AI APIs — no model limits, no token counting, no cost tracking. You'd need to configure it per-model manually and rebuild the cost system yourself.
378
+
379
+ **p-limit** controls concurrency, not rate. It doesn't throttle to N requests per minute, it throttles to N concurrent requests. Useful but completely different problem.
380
+
381
+ **SDK built-in retry** retries on 429 with backoff. That's the floor, not the ceiling. It doesn't queue, doesn't prioritize, doesn't track cost, and doesn't propagate backoff to other in-flight requests.
382
+
383
+ ---
384
+
385
+ ## TypeScript
386
+
387
+ Fully typed. All configuration options, events, errors, and report shapes have precise TypeScript definitions.
388
+
389
+ ```typescript
390
+ import type {
391
+ RateLimiterConfig,
392
+ CostReport,
393
+ QueuedEvent,
394
+ Priority,
395
+ } from 'ai-sdk-rate-limiter'
396
+
397
+ function handleQueuedRequest(event: QueuedEvent) {
398
+ // event.model, event.priority, event.queueDepth, event.estimatedWaitMs
399
+ // all typed, all autocompleted
400
+ }
401
+ ```
402
+
403
+ ---
404
+
405
+ ## Requirements
406
+
407
+ - Node.js 18+ / Bun / Deno
408
+ - `ai` v4+ (Vercel AI SDK) — optional peer dep, only needed for `.wrap()`
409
+ - Zero required runtime dependencies
410
+
411
+ ---
412
+
413
+ ## License
414
+
415
+ MIT