ai-sdk-rate-limiter 0.7.0 → 0.7.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +266 -2
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -18,6 +18,7 @@ Every developer building with LLMs hits this eventually:
|
|
|
18
18
|
- Your Node.js server runs 4 instances. They race against the same API quota
|
|
19
19
|
- A bulk job spends $300 overnight and nobody notices until the bill arrives
|
|
20
20
|
- You have no idea which model is responsible for the cost spike
|
|
21
|
+
- In a multi-tenant app, one user's burst shouldn't block everyone else
|
|
21
22
|
|
|
22
23
|
Every existing tool solves one of these. This solves all of them.
|
|
23
24
|
|
|
@@ -48,16 +49,24 @@ The wrapped model is a drop-in replacement. Every Vercel AI SDK feature — stre
|
|
|
48
49
|
|
|
49
50
|
**Priority queuing** — Queued requests drain in priority order (`high` before `normal` before `low`), FIFO within the same priority. Your user-facing requests skip ahead of background jobs.
|
|
50
51
|
|
|
52
|
+
**Concurrency limits** — Optional `maxConcurrent` cap per model enforced as a semaphore. Requests queue behind in-flight ones, then release slots as they complete.
|
|
53
|
+
|
|
51
54
|
**Smart retry** — Retries on 429, 500, 502, 503, 504 with exponential backoff + jitter. Honors the `Retry-After` header exactly — if the API says wait 3 seconds, waits 3 seconds, not 30.
|
|
52
55
|
|
|
53
56
|
**Cost tracking** — Records actual token usage from every response. Reports hourly, daily, and monthly spend per model. Optionally enforces budget caps.
|
|
54
57
|
|
|
58
|
+
**Multi-tenant scoped limits** — Give each user or org its own isolated rate limit window without running separate limiter instances. Wildcard patterns match user tiers.
|
|
59
|
+
|
|
60
|
+
**AbortSignal propagation** — Cancelling a request (e.g. user navigates away) immediately removes it from the queue. No wasted API calls for abandoned requests.
|
|
61
|
+
|
|
55
62
|
**Built-in model registry** — Knows the RPM, ITPM, and per-token pricing for every major OpenAI, Anthropic, Google, Groq, Mistral, and Cohere model out of the box. Nothing to configure to get started.
|
|
56
63
|
|
|
57
64
|
**Raw SDK support** — Works with the native OpenAI, Anthropic, Groq, Mistral, and Cohere SDKs directly via a transparent JavaScript Proxy. No Vercel AI SDK required.
|
|
58
65
|
|
|
59
66
|
**OpenTelemetry** — Drop-in OTel plugin that emits GenAI-spec spans for every request. Works with any OTel-compatible tracer.
|
|
60
67
|
|
|
68
|
+
**Testing utilities** — `createTestLimiter()` records every completed call so you can assert on model usage, token counts, and costs in unit tests.
|
|
69
|
+
|
|
61
70
|
**CLI audit** — `npx ai-sdk-rate-limiter audit` probes your API keys to detect your actual tier limits and generates a ready-to-paste config override.
|
|
62
71
|
|
|
63
72
|
---
|
|
@@ -68,6 +77,9 @@ The wrapped model is a drop-in replacement. Every Vercel AI SDK feature — stre
|
|
|
68
77
|
- [Raw SDK proxy](#raw-sdk-proxy)
|
|
69
78
|
- [Configuration reference](#configuration-reference)
|
|
70
79
|
- [Per-request options](#per-request-options)
|
|
80
|
+
- [Multi-tenant scoped limits](#multi-tenant-scoped-limits)
|
|
81
|
+
- [Concurrency limits](#concurrency-limits)
|
|
82
|
+
- [AbortSignal support](#abortsignal-support)
|
|
71
83
|
- [Cost tracking](#cost-tracking)
|
|
72
84
|
- [Budget fallback routing](#budget-fallback-routing)
|
|
73
85
|
- [Multi-instance Redis store](#multi-instance-redis-store)
|
|
@@ -75,6 +87,7 @@ The wrapped model is a drop-in replacement. Every Vercel AI SDK feature — stre
|
|
|
75
87
|
- [Backpressure](#backpressure)
|
|
76
88
|
- [Error handling](#error-handling)
|
|
77
89
|
- [OpenTelemetry](#opentelemetry)
|
|
90
|
+
- [Testing utilities](#testing-utilities)
|
|
78
91
|
- [CLI audit](#cli-audit)
|
|
79
92
|
- [Model registry](#model-registry)
|
|
80
93
|
- [Advanced usage](#advanced-usage)
|
|
@@ -209,7 +222,7 @@ Everything has a sensible default. Override only what you need.
|
|
|
209
222
|
const limiter = createRateLimiter({
|
|
210
223
|
// Override or extend built-in model limits for your API tier
|
|
211
224
|
limits: {
|
|
212
|
-
'gpt-4o': { rpm: 500, itpm: 30_000 },
|
|
225
|
+
'gpt-4o': { rpm: 500, itpm: 30_000, maxConcurrent: 20 },
|
|
213
226
|
'claude-opus-4-6': { rpm: 50, itpm: 20_000 },
|
|
214
227
|
},
|
|
215
228
|
|
|
@@ -241,6 +254,13 @@ const limiter = createRateLimiter({
|
|
|
241
254
|
retryOn: [429, 500, 502, 503, 504],
|
|
242
255
|
},
|
|
243
256
|
|
|
257
|
+
// Per-scope rate limit overrides for multi-tenant use cases
|
|
258
|
+
scopes: {
|
|
259
|
+
'user:free:*': { rpm: 5, itpm: 10_000 },
|
|
260
|
+
'user:pro:*': { rpm: 60, itpm: 200_000 },
|
|
261
|
+
'org:*': { rpm: 300, maxConcurrent: 20 },
|
|
262
|
+
},
|
|
263
|
+
|
|
244
264
|
// Observability — see Events section for all available events
|
|
245
265
|
on: {
|
|
246
266
|
rateLimited: ({ model, source, resetAt }) =>
|
|
@@ -284,12 +304,155 @@ await generateText({
|
|
|
284
304
|
rateLimiter: { priority: 'low' },
|
|
285
305
|
},
|
|
286
306
|
})
|
|
307
|
+
|
|
308
|
+
// Per-request scope (overrides any static scope set in limiter.wrap())
|
|
309
|
+
await generateText({
|
|
310
|
+
model,
|
|
311
|
+
prompt: 'User message...',
|
|
312
|
+
providerOptions: {
|
|
313
|
+
rateLimiter: { scope: `user:${userId}` },
|
|
314
|
+
},
|
|
315
|
+
})
|
|
287
316
|
```
|
|
288
317
|
|
|
289
318
|
This is the recommended way to colocate user requests and background jobs on the same model without background jobs starving users.
|
|
290
319
|
|
|
291
320
|
---
|
|
292
321
|
|
|
322
|
+
## Multi-tenant scoped limits
|
|
323
|
+
|
|
324
|
+
Scoped limits give each user, org, or tenant its own isolated rate limit window. A burst from one user doesn't consume quota for anyone else.
|
|
325
|
+
|
|
326
|
+
### Static scope on the model
|
|
327
|
+
|
|
328
|
+
```typescript
|
|
329
|
+
// Each user gets their own model with an isolated window
|
|
330
|
+
function getModelForUser(userId: string) {
|
|
331
|
+
return limiter.wrap(openai('gpt-4o'), { scope: `user:${userId}` })
|
|
332
|
+
}
|
|
333
|
+
|
|
334
|
+
// user:alice has its own RPM/ITPM window independent of user:bob
|
|
335
|
+
const aliceModel = getModelForUser('alice')
|
|
336
|
+
const bobModel = getModelForUser('bob')
|
|
337
|
+
```
|
|
338
|
+
|
|
339
|
+
### Per-request scope
|
|
340
|
+
|
|
341
|
+
```typescript
|
|
342
|
+
// Same wrapped model, different scope per call
|
|
343
|
+
await generateText({
|
|
344
|
+
model,
|
|
345
|
+
prompt: req.body.message,
|
|
346
|
+
providerOptions: {
|
|
347
|
+
rateLimiter: { scope: `user:${req.user.id}` },
|
|
348
|
+
},
|
|
349
|
+
})
|
|
350
|
+
```
|
|
351
|
+
|
|
352
|
+
Per-request scope takes precedence over any static scope set in `limiter.wrap()`.
|
|
353
|
+
|
|
354
|
+
### Defining scope-level limits
|
|
355
|
+
|
|
356
|
+
Use `config.scopes` to define separate rate limits for each scope tier. Keys support `*` wildcards:
|
|
357
|
+
|
|
358
|
+
```typescript
|
|
359
|
+
const limiter = createRateLimiter({
|
|
360
|
+
scopes: {
|
|
361
|
+
'user:free:*': { rpm: 5, itpm: 10_000 }, // free tier: 5 rpm each
|
|
362
|
+
'user:pro:*': { rpm: 60, itpm: 200_000 }, // pro tier: 60 rpm each
|
|
363
|
+
'org:*': { rpm: 300, maxConcurrent: 20 },
|
|
364
|
+
},
|
|
365
|
+
})
|
|
366
|
+
|
|
367
|
+
// Each scope gets its own isolated window under the matched limits
|
|
368
|
+
await generateText({
|
|
369
|
+
model,
|
|
370
|
+
providerOptions: {
|
|
371
|
+
rateLimiter: { scope: 'user:free:alice' }, // matches 'user:free:*' → 5 rpm
|
|
372
|
+
},
|
|
373
|
+
})
|
|
374
|
+
```
|
|
375
|
+
|
|
376
|
+
**Scope fields:**
|
|
377
|
+
|
|
378
|
+
| Field | Description |
|
|
379
|
+
|---|---|
|
|
380
|
+
| `rpm` | Max requests per minute for this scope |
|
|
381
|
+
| `itpm` | Max input tokens per minute for this scope |
|
|
382
|
+
| `maxConcurrent` | Max concurrent in-flight requests for this scope |
|
|
383
|
+
|
|
384
|
+
When a scope matches, its limits replace the model's global limits for that request. Each scope gets a fully independent sliding window — `user:alice` and `user:bob` don't share quota.
|
|
385
|
+
|
|
386
|
+
If no `scopes` config is defined, the model's global limits apply to all scoped requests.
|
|
387
|
+
|
|
388
|
+
---
|
|
389
|
+
|
|
390
|
+
## Concurrency limits
|
|
391
|
+
|
|
392
|
+
Limit how many requests to a model can be in-flight simultaneously. Useful for:
|
|
393
|
+
- Preventing connection pool exhaustion
|
|
394
|
+
- Controlling cost burn rate during spikes
|
|
395
|
+
- Enforcing per-scope parallelism
|
|
396
|
+
|
|
397
|
+
```typescript
|
|
398
|
+
const limiter = createRateLimiter({
|
|
399
|
+
limits: {
|
|
400
|
+
'gpt-4o': {
|
|
401
|
+
rpm: 500,
|
|
402
|
+
maxConcurrent: 10, // at most 10 requests executing at once
|
|
403
|
+
},
|
|
404
|
+
},
|
|
405
|
+
})
|
|
406
|
+
```
|
|
407
|
+
|
|
408
|
+
Once `maxConcurrent` slots are occupied, new requests queue behind them. Each slot is released when its request completes (success or failure). Concurrency is checked after the rate limit slot is acquired — both limits apply independently.
|
|
409
|
+
|
|
410
|
+
Concurrency limits also work in scoped contexts:
|
|
411
|
+
|
|
412
|
+
```typescript
|
|
413
|
+
const limiter = createRateLimiter({
|
|
414
|
+
scopes: {
|
|
415
|
+
'org:*': { rpm: 300, maxConcurrent: 20 },
|
|
416
|
+
},
|
|
417
|
+
})
|
|
418
|
+
```
|
|
419
|
+
|
|
420
|
+
---
|
|
421
|
+
|
|
422
|
+
## AbortSignal support
|
|
423
|
+
|
|
424
|
+
Pass an `AbortSignal` to cancel a request that's waiting in the queue. If the signal fires before the request starts executing, it's removed from the queue immediately and the promise rejects with an `AbortError`.
|
|
425
|
+
|
|
426
|
+
```typescript
|
|
427
|
+
const controller = new AbortController()
|
|
428
|
+
|
|
429
|
+
// User closes the browser tab — cancel pending AI requests
|
|
430
|
+
window.addEventListener('beforeunload', () => controller.abort())
|
|
431
|
+
|
|
432
|
+
const result = await generateText({
|
|
433
|
+
model,
|
|
434
|
+
prompt: 'Long running task...',
|
|
435
|
+
abortSignal: controller.signal,
|
|
436
|
+
})
|
|
437
|
+
```
|
|
438
|
+
|
|
439
|
+
```typescript
|
|
440
|
+
// With timeout
|
|
441
|
+
const signal = AbortSignal.timeout(5_000)
|
|
442
|
+
|
|
443
|
+
try {
|
|
444
|
+
const result = await generateText({ model, prompt, abortSignal: signal })
|
|
445
|
+
} catch (err) {
|
|
446
|
+
if (err.name === 'AbortError') {
|
|
447
|
+
console.log('Request cancelled (timed out or aborted)')
|
|
448
|
+
}
|
|
449
|
+
}
|
|
450
|
+
```
|
|
451
|
+
|
|
452
|
+
The signal threads through both the rate-limit queue and the concurrency queue. A request that's already executing is not affected — only queued requests can be aborted this way.
|
|
453
|
+
|
|
454
|
+
---
|
|
455
|
+
|
|
293
456
|
## Cost tracking
|
|
294
457
|
|
|
295
458
|
```typescript
|
|
@@ -476,6 +639,9 @@ try {
|
|
|
476
639
|
} else if (error instanceof RateLimitExceededError) {
|
|
477
640
|
// Rate limit hit and the request could not be queued
|
|
478
641
|
console.error(`${error.model} ${error.limitType} limit of ${error.limit} exceeded`)
|
|
642
|
+
} else if (error.name === 'AbortError') {
|
|
643
|
+
// Request was cancelled via AbortSignal before it started executing
|
|
644
|
+
console.log('Request cancelled')
|
|
479
645
|
}
|
|
480
646
|
}
|
|
481
647
|
```
|
|
@@ -549,6 +715,67 @@ const limiter = createRateLimiter({
|
|
|
549
715
|
|
|
550
716
|
---
|
|
551
717
|
|
|
718
|
+
## Testing utilities
|
|
719
|
+
|
|
720
|
+
The `ai-sdk-rate-limiter/testing` entry point provides a test-friendly limiter that records every completed call. Use it to assert on model usage, token counts, and costs in unit tests without mocking internals.
|
|
721
|
+
|
|
722
|
+
```typescript
|
|
723
|
+
import { createTestLimiter } from 'ai-sdk-rate-limiter/testing'
|
|
724
|
+
import { openai } from '@ai-sdk/openai'
|
|
725
|
+
import { generateText } from 'ai'
|
|
726
|
+
|
|
727
|
+
const limiter = createTestLimiter()
|
|
728
|
+
const model = limiter.wrap(openai('gpt-4o'))
|
|
729
|
+
|
|
730
|
+
// Run your code under test
|
|
731
|
+
await generateText({ model, prompt: 'Hello!' })
|
|
732
|
+
await generateText({ model, prompt: 'Another request' })
|
|
733
|
+
|
|
734
|
+
// Assert on recorded calls
|
|
735
|
+
const calls = limiter.getCalls()
|
|
736
|
+
expect(calls).toHaveLength(2)
|
|
737
|
+
expect(calls[0].modelId).toBe('gpt-4o')
|
|
738
|
+
expect(calls[0].inputTokens).toBeGreaterThan(0)
|
|
739
|
+
expect(calls[0].costUsd).toBeGreaterThan(0)
|
|
740
|
+
|
|
741
|
+
// Reset between tests
|
|
742
|
+
limiter.reset()
|
|
743
|
+
```
|
|
744
|
+
|
|
745
|
+
`createTestLimiter()` accepts the same config as `createRateLimiter()`, so you can test budget enforcement, retry behavior, and other scenarios with real config:
|
|
746
|
+
|
|
747
|
+
```typescript
|
|
748
|
+
const limiter = createTestLimiter({
|
|
749
|
+
cost: { budget: { daily: 0.01 }, onExceeded: 'throw' },
|
|
750
|
+
})
|
|
751
|
+
|
|
752
|
+
// Test that your code handles budget errors gracefully
|
|
753
|
+
```
|
|
754
|
+
|
|
755
|
+
**`CallRecord` fields:**
|
|
756
|
+
|
|
757
|
+
| Field | Type | Description |
|
|
758
|
+
|---|---|---|
|
|
759
|
+
| `modelId` | `string` | Model that was called |
|
|
760
|
+
| `provider` | `string` | Provider name |
|
|
761
|
+
| `inputTokens` | `number` | Input tokens from the API response |
|
|
762
|
+
| `outputTokens` | `number` | Output tokens from the API response |
|
|
763
|
+
| `costUsd` | `number` | Cost in USD for this call |
|
|
764
|
+
| `latencyMs` | `number` | Total latency in ms |
|
|
765
|
+
| `streaming` | `boolean` | Whether this was a streaming call |
|
|
766
|
+
| `timestamp` | `number` | Unix timestamp (ms) when the call completed |
|
|
767
|
+
|
|
768
|
+
**Methods:**
|
|
769
|
+
|
|
770
|
+
| Method | Description |
|
|
771
|
+
|---|---|
|
|
772
|
+
| `getCalls()` | Returns all completed calls in chronological order |
|
|
773
|
+
| `reset()` | Clears call history |
|
|
774
|
+
|
|
775
|
+
All other `RateLimiter` methods (`wrap`, `rawProxy`, `getCostReport`, `getStatus`, `on`, `off`) work identically.
|
|
776
|
+
|
|
777
|
+
---
|
|
778
|
+
|
|
552
779
|
## CLI audit
|
|
553
780
|
|
|
554
781
|
Probe your API keys to discover your actual rate limit tier and generate a ready-to-paste config override:
|
|
@@ -724,6 +951,30 @@ const model = req.user.plan === 'paid'
|
|
|
724
951
|
: freeLimiter.wrap(openai('gpt-4o-mini'))
|
|
725
952
|
```
|
|
726
953
|
|
|
954
|
+
### Per-user limits with a single limiter
|
|
955
|
+
|
|
956
|
+
For large numbers of users, scoped limits are more efficient than one limiter per user:
|
|
957
|
+
|
|
958
|
+
```typescript
|
|
959
|
+
const limiter = createRateLimiter({
|
|
960
|
+
scopes: {
|
|
961
|
+
'user:free:*': { rpm: 5, itpm: 10_000 },
|
|
962
|
+
'user:pro:*': { rpm: 60, itpm: 200_000 },
|
|
963
|
+
},
|
|
964
|
+
})
|
|
965
|
+
|
|
966
|
+
const model = limiter.wrap(openai('gpt-4o'))
|
|
967
|
+
|
|
968
|
+
// Each request is tracked in an isolated window per user
|
|
969
|
+
await generateText({
|
|
970
|
+
model,
|
|
971
|
+
prompt: req.body.message,
|
|
972
|
+
providerOptions: {
|
|
973
|
+
rateLimiter: { scope: `user:${req.user.plan}:${req.user.id}` },
|
|
974
|
+
},
|
|
975
|
+
})
|
|
976
|
+
```
|
|
977
|
+
|
|
727
978
|
### Combine OTel tracing with event logging
|
|
728
979
|
|
|
729
980
|
```typescript
|
|
@@ -764,6 +1015,12 @@ const limiter = createRateLimiter({ store: new MyStore() })
|
|
|
764
1015
|
|
|
765
1016
|
**Queue** — A sorted priority queue per model, ordered by `priority` then enqueue time (FIFO within same priority). A drain timer fires when the oldest window entry expires, processing as many waiters as possible before rescheduling.
|
|
766
1017
|
|
|
1018
|
+
**Concurrency** — A semaphore (`activeCount` + `concurrencyWaiters`) per model key, checked after the rate limit slot is acquired. `release()` decrements the count and unblocks the next waiter.
|
|
1019
|
+
|
|
1020
|
+
**Scoped limits** — Each scoped request uses a key in the form `scope:provider:modelId`. That key gets its own independent sliding window. Wildcard patterns in `config.scopes` are matched with `*` → `.*` regex expansion.
|
|
1021
|
+
|
|
1022
|
+
**AbortSignal** — The signal is registered as an event listener on both the rate-limit queue and the concurrency queue. If it fires, the request is spliced out of whichever queue it's in and the promise rejects with an `AbortError`.
|
|
1023
|
+
|
|
767
1024
|
**Retry-After propagation** — When a remote 429 arrives with a `Retry-After` header, the backoff is applied to the entire model key in the engine, not just the failing request. All requests queued behind it pause until the backoff clears. This prevents the common thundering-herd failure where you retry one request while 10 others immediately follow and all get 429s.
|
|
768
1025
|
|
|
769
1026
|
**Token estimation** — Before a request fires, tokens are estimated from the prompt text (~4 chars/token) and reserved in the window. After the response, actual usage from the API replaces the estimate. For streaming, actual counts come from the `finish` chunk (Vercel AI SDK) or the final usage chunk (raw proxy).
|
|
@@ -781,10 +1038,14 @@ const limiter = createRateLimiter({ store: new MyStore() })
|
|
|
781
1038
|
| Model-aware limits | yes | no | no | no | partial |
|
|
782
1039
|
| ITPM / token tracking | yes | no | no | no | no |
|
|
783
1040
|
| Priority queue | yes | yes | no | no | no |
|
|
1041
|
+
| Concurrency limits | yes | yes | yes | no | no |
|
|
784
1042
|
| Cost tracking + budgets | yes | no | no | no | no |
|
|
1043
|
+
| Multi-tenant scoped limits | yes | no | no | no | no |
|
|
1044
|
+
| AbortSignal propagation | yes | no | no | no | no |
|
|
785
1045
|
| Retry-After header | yes | no | no | partial | partial |
|
|
786
1046
|
| Backoff propagation | yes | no | no | no | no |
|
|
787
1047
|
| OpenTelemetry | yes | no | no | no | partial |
|
|
1048
|
+
| Testing utilities | yes | no | no | no | no |
|
|
788
1049
|
| CLI audit | yes | no | no | no | no |
|
|
789
1050
|
| Zero runtime deps | yes | no | yes | — | no |
|
|
790
1051
|
| Provider-agnostic | yes | yes | yes | no | no |
|
|
@@ -804,12 +1065,15 @@ Fully typed. All configuration options, events, errors, and report shapes have p
|
|
|
804
1065
|
```typescript
|
|
805
1066
|
import type {
|
|
806
1067
|
RateLimiterConfig,
|
|
1068
|
+
ModelLimits,
|
|
1069
|
+
ScopeConfig,
|
|
807
1070
|
CostReport,
|
|
808
1071
|
EventMap,
|
|
809
1072
|
QueuedEvent,
|
|
810
1073
|
Priority,
|
|
811
|
-
ModelLimits,
|
|
812
1074
|
} from 'ai-sdk-rate-limiter'
|
|
1075
|
+
|
|
1076
|
+
import type { CallRecord } from 'ai-sdk-rate-limiter/testing'
|
|
813
1077
|
```
|
|
814
1078
|
|
|
815
1079
|
---
|
package/package.json
CHANGED