ai-sdk-rate-limiter 0.7.0 → 0.7.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +266 -2
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -18,6 +18,7 @@ Every developer building with LLMs hits this eventually:
18
18
  - Your Node.js server runs 4 instances. They race against the same API quota
19
19
  - A bulk job spends $300 overnight and nobody notices until the bill arrives
20
20
  - You have no idea which model is responsible for the cost spike
21
+ - In a multi-tenant app, one user's burst shouldn't block everyone else
21
22
 
22
23
  Every existing tool solves one of these. This solves all of them.
23
24
 
@@ -48,16 +49,24 @@ The wrapped model is a drop-in replacement. Every Vercel AI SDK feature — stre
48
49
 
49
50
  **Priority queuing** — Queued requests drain in priority order (`high` before `normal` before `low`), FIFO within the same priority. Your user-facing requests skip ahead of background jobs.
50
51
 
52
+ **Concurrency limits** — Optional `maxConcurrent` cap per model enforced as a semaphore. Requests queue behind in-flight ones, then release slots as they complete.
53
+
51
54
  **Smart retry** — Retries on 429, 500, 502, 503, 504 with exponential backoff + jitter. Honors the `Retry-After` header exactly — if the API says wait 3 seconds, waits 3 seconds, not 30.
52
55
 
53
56
  **Cost tracking** — Records actual token usage from every response. Reports hourly, daily, and monthly spend per model. Optionally enforces budget caps.
54
57
 
58
+ **Multi-tenant scoped limits** — Give each user or org its own isolated rate limit window without running separate limiter instances. Wildcard patterns match user tiers.
59
+
60
+ **AbortSignal propagation** — Cancelling a request (e.g. user navigates away) immediately removes it from the queue. No wasted API calls for abandoned requests.
61
+
55
62
  **Built-in model registry** — Knows the RPM, ITPM, and per-token pricing for every major OpenAI, Anthropic, Google, Groq, Mistral, and Cohere model out of the box. Nothing to configure to get started.
56
63
 
57
64
  **Raw SDK support** — Works with the native OpenAI, Anthropic, Groq, Mistral, and Cohere SDKs directly via a transparent JavaScript Proxy. No Vercel AI SDK required.
58
65
 
59
66
  **OpenTelemetry** — Drop-in OTel plugin that emits GenAI-spec spans for every request. Works with any OTel-compatible tracer.
60
67
 
68
+ **Testing utilities** — `createTestLimiter()` records every completed call so you can assert on model usage, token counts, and costs in unit tests.
69
+
61
70
  **CLI audit** — `npx ai-sdk-rate-limiter audit` probes your API keys to detect your actual tier limits and generates a ready-to-paste config override.
62
71
 
63
72
  ---
@@ -68,6 +77,9 @@ The wrapped model is a drop-in replacement. Every Vercel AI SDK feature — stre
68
77
  - [Raw SDK proxy](#raw-sdk-proxy)
69
78
  - [Configuration reference](#configuration-reference)
70
79
  - [Per-request options](#per-request-options)
80
+ - [Multi-tenant scoped limits](#multi-tenant-scoped-limits)
81
+ - [Concurrency limits](#concurrency-limits)
82
+ - [AbortSignal support](#abortsignal-support)
71
83
  - [Cost tracking](#cost-tracking)
72
84
  - [Budget fallback routing](#budget-fallback-routing)
73
85
  - [Multi-instance Redis store](#multi-instance-redis-store)
@@ -75,6 +87,7 @@ The wrapped model is a drop-in replacement. Every Vercel AI SDK feature — stre
75
87
  - [Backpressure](#backpressure)
76
88
  - [Error handling](#error-handling)
77
89
  - [OpenTelemetry](#opentelemetry)
90
+ - [Testing utilities](#testing-utilities)
78
91
  - [CLI audit](#cli-audit)
79
92
  - [Model registry](#model-registry)
80
93
  - [Advanced usage](#advanced-usage)
@@ -209,7 +222,7 @@ Everything has a sensible default. Override only what you need.
209
222
  const limiter = createRateLimiter({
210
223
  // Override or extend built-in model limits for your API tier
211
224
  limits: {
212
- 'gpt-4o': { rpm: 500, itpm: 30_000 },
225
+ 'gpt-4o': { rpm: 500, itpm: 30_000, maxConcurrent: 20 },
213
226
  'claude-opus-4-6': { rpm: 50, itpm: 20_000 },
214
227
  },
215
228
 
@@ -241,6 +254,13 @@ const limiter = createRateLimiter({
241
254
  retryOn: [429, 500, 502, 503, 504],
242
255
  },
243
256
 
257
+ // Per-scope rate limit overrides for multi-tenant use cases
258
+ scopes: {
259
+ 'user:free:*': { rpm: 5, itpm: 10_000 },
260
+ 'user:pro:*': { rpm: 60, itpm: 200_000 },
261
+ 'org:*': { rpm: 300, maxConcurrent: 20 },
262
+ },
263
+
244
264
  // Observability — see Events section for all available events
245
265
  on: {
246
266
  rateLimited: ({ model, source, resetAt }) =>
@@ -284,12 +304,155 @@ await generateText({
284
304
  rateLimiter: { priority: 'low' },
285
305
  },
286
306
  })
307
+
308
+ // Per-request scope (overrides any static scope set in limiter.wrap())
309
+ await generateText({
310
+ model,
311
+ prompt: 'User message...',
312
+ providerOptions: {
313
+ rateLimiter: { scope: `user:${userId}` },
314
+ },
315
+ })
287
316
  ```
288
317
 
289
318
  This is the recommended way to colocate user requests and background jobs on the same model without background jobs starving users.
290
319
 
291
320
  ---
292
321
 
322
+ ## Multi-tenant scoped limits
323
+
324
+ Scoped limits give each user, org, or tenant its own isolated rate limit window. A burst from one user doesn't consume quota for anyone else.
325
+
326
+ ### Static scope on the model
327
+
328
+ ```typescript
329
+ // Each user gets their own model with an isolated window
330
+ function getModelForUser(userId: string) {
331
+ return limiter.wrap(openai('gpt-4o'), { scope: `user:${userId}` })
332
+ }
333
+
334
+ // user:alice has its own RPM/ITPM window independent of user:bob
335
+ const aliceModel = getModelForUser('alice')
336
+ const bobModel = getModelForUser('bob')
337
+ ```
338
+
339
+ ### Per-request scope
340
+
341
+ ```typescript
342
+ // Same wrapped model, different scope per call
343
+ await generateText({
344
+ model,
345
+ prompt: req.body.message,
346
+ providerOptions: {
347
+ rateLimiter: { scope: `user:${req.user.id}` },
348
+ },
349
+ })
350
+ ```
351
+
352
+ Per-request scope takes precedence over any static scope set in `limiter.wrap()`.
353
+
354
+ ### Defining scope-level limits
355
+
356
+ Use `config.scopes` to define separate rate limits for each scope tier. Keys support `*` wildcards:
357
+
358
+ ```typescript
359
+ const limiter = createRateLimiter({
360
+ scopes: {
361
+ 'user:free:*': { rpm: 5, itpm: 10_000 }, // free tier: 5 rpm each
362
+ 'user:pro:*': { rpm: 60, itpm: 200_000 }, // pro tier: 60 rpm each
363
+ 'org:*': { rpm: 300, maxConcurrent: 20 },
364
+ },
365
+ })
366
+
367
+ // Each scope gets its own isolated window under the matched limits
368
+ await generateText({
369
+ model,
370
+ providerOptions: {
371
+ rateLimiter: { scope: 'user:free:alice' }, // matches 'user:free:*' → 5 rpm
372
+ },
373
+ })
374
+ ```
375
+
376
+ **Scope fields:**
377
+
378
+ | Field | Description |
379
+ |---|---|
380
+ | `rpm` | Max requests per minute for this scope |
381
+ | `itpm` | Max input tokens per minute for this scope |
382
+ | `maxConcurrent` | Max concurrent in-flight requests for this scope |
383
+
384
+ When a scope matches, its limits replace the model's global limits for that request. Each scope gets a fully independent sliding window — `user:alice` and `user:bob` don't share quota.
385
+
386
+ If no `scopes` config is defined, the model's global limits apply to all scoped requests.
387
+
388
+ ---
389
+
390
+ ## Concurrency limits
391
+
392
+ Limit how many requests to a model can be in-flight simultaneously. Useful for:
393
+ - Preventing connection pool exhaustion
394
+ - Controlling cost burn rate during spikes
395
+ - Enforcing per-scope parallelism
396
+
397
+ ```typescript
398
+ const limiter = createRateLimiter({
399
+ limits: {
400
+ 'gpt-4o': {
401
+ rpm: 500,
402
+ maxConcurrent: 10, // at most 10 requests executing at once
403
+ },
404
+ },
405
+ })
406
+ ```
407
+
408
+ Once `maxConcurrent` slots are occupied, new requests queue behind them. Each slot is released when its request completes (success or failure). Concurrency is checked after the rate limit slot is acquired — both limits apply independently.
409
+
410
+ Concurrency limits also work in scoped contexts:
411
+
412
+ ```typescript
413
+ const limiter = createRateLimiter({
414
+ scopes: {
415
+ 'org:*': { rpm: 300, maxConcurrent: 20 },
416
+ },
417
+ })
418
+ ```
419
+
420
+ ---
421
+
422
+ ## AbortSignal support
423
+
424
+ Pass an `AbortSignal` to cancel a request that's waiting in the queue. If the signal fires before the request starts executing, it's removed from the queue immediately and the promise rejects with an `AbortError`.
425
+
426
+ ```typescript
427
+ const controller = new AbortController()
428
+
429
+ // User closes the browser tab — cancel pending AI requests
430
+ window.addEventListener('beforeunload', () => controller.abort())
431
+
432
+ const result = await generateText({
433
+ model,
434
+ prompt: 'Long running task...',
435
+ abortSignal: controller.signal,
436
+ })
437
+ ```
438
+
439
+ ```typescript
440
+ // With timeout
441
+ const signal = AbortSignal.timeout(5_000)
442
+
443
+ try {
444
+ const result = await generateText({ model, prompt, abortSignal: signal })
445
+ } catch (err) {
446
+ if (err.name === 'AbortError') {
447
+ console.log('Request cancelled (timed out or aborted)')
448
+ }
449
+ }
450
+ ```
451
+
452
+ The signal threads through both the rate-limit queue and the concurrency queue. A request that's already executing is not affected — only queued requests can be aborted this way.
453
+
454
+ ---
455
+
293
456
  ## Cost tracking
294
457
 
295
458
  ```typescript
@@ -476,6 +639,9 @@ try {
476
639
  } else if (error instanceof RateLimitExceededError) {
477
640
  // Rate limit hit and the request could not be queued
478
641
  console.error(`${error.model} ${error.limitType} limit of ${error.limit} exceeded`)
642
+ } else if (error.name === 'AbortError') {
643
+ // Request was cancelled via AbortSignal before it started executing
644
+ console.log('Request cancelled')
479
645
  }
480
646
  }
481
647
  ```
@@ -549,6 +715,67 @@ const limiter = createRateLimiter({
549
715
 
550
716
  ---
551
717
 
718
+ ## Testing utilities
719
+
720
+ The `ai-sdk-rate-limiter/testing` entry point provides a test-friendly limiter that records every completed call. Use it to assert on model usage, token counts, and costs in unit tests without mocking internals.
721
+
722
+ ```typescript
723
+ import { createTestLimiter } from 'ai-sdk-rate-limiter/testing'
724
+ import { openai } from '@ai-sdk/openai'
725
+ import { generateText } from 'ai'
726
+
727
+ const limiter = createTestLimiter()
728
+ const model = limiter.wrap(openai('gpt-4o'))
729
+
730
+ // Run your code under test
731
+ await generateText({ model, prompt: 'Hello!' })
732
+ await generateText({ model, prompt: 'Another request' })
733
+
734
+ // Assert on recorded calls
735
+ const calls = limiter.getCalls()
736
+ expect(calls).toHaveLength(2)
737
+ expect(calls[0].modelId).toBe('gpt-4o')
738
+ expect(calls[0].inputTokens).toBeGreaterThan(0)
739
+ expect(calls[0].costUsd).toBeGreaterThan(0)
740
+
741
+ // Reset between tests
742
+ limiter.reset()
743
+ ```
744
+
745
+ `createTestLimiter()` accepts the same config as `createRateLimiter()`, so you can test budget enforcement, retry behavior, and other scenarios with real config:
746
+
747
+ ```typescript
748
+ const limiter = createTestLimiter({
749
+ cost: { budget: { daily: 0.01 }, onExceeded: 'throw' },
750
+ })
751
+
752
+ // Test that your code handles budget errors gracefully
753
+ ```
754
+
755
+ **`CallRecord` fields:**
756
+
757
+ | Field | Type | Description |
758
+ |---|---|---|
759
+ | `modelId` | `string` | Model that was called |
760
+ | `provider` | `string` | Provider name |
761
+ | `inputTokens` | `number` | Input tokens from the API response |
762
+ | `outputTokens` | `number` | Output tokens from the API response |
763
+ | `costUsd` | `number` | Cost in USD for this call |
764
+ | `latencyMs` | `number` | Total latency in ms |
765
+ | `streaming` | `boolean` | Whether this was a streaming call |
766
+ | `timestamp` | `number` | Unix timestamp (ms) when the call completed |
767
+
768
+ **Methods:**
769
+
770
+ | Method | Description |
771
+ |---|---|
772
+ | `getCalls()` | Returns all completed calls in chronological order |
773
+ | `reset()` | Clears call history |
774
+
775
+ All other `RateLimiter` methods (`wrap`, `rawProxy`, `getCostReport`, `getStatus`, `on`, `off`) work identically.
776
+
777
+ ---
778
+
552
779
  ## CLI audit
553
780
 
554
781
  Probe your API keys to discover your actual rate limit tier and generate a ready-to-paste config override:
@@ -724,6 +951,30 @@ const model = req.user.plan === 'paid'
724
951
  : freeLimiter.wrap(openai('gpt-4o-mini'))
725
952
  ```
726
953
 
954
+ ### Per-user limits with a single limiter
955
+
956
+ For large numbers of users, scoped limits are more efficient than one limiter per user:
957
+
958
+ ```typescript
959
+ const limiter = createRateLimiter({
960
+ scopes: {
961
+ 'user:free:*': { rpm: 5, itpm: 10_000 },
962
+ 'user:pro:*': { rpm: 60, itpm: 200_000 },
963
+ },
964
+ })
965
+
966
+ const model = limiter.wrap(openai('gpt-4o'))
967
+
968
+ // Each request is tracked in an isolated window per user
969
+ await generateText({
970
+ model,
971
+ prompt: req.body.message,
972
+ providerOptions: {
973
+ rateLimiter: { scope: `user:${req.user.plan}:${req.user.id}` },
974
+ },
975
+ })
976
+ ```
977
+
727
978
  ### Combine OTel tracing with event logging
728
979
 
729
980
  ```typescript
@@ -764,6 +1015,12 @@ const limiter = createRateLimiter({ store: new MyStore() })
764
1015
 
765
1016
  **Queue** — A sorted priority queue per model, ordered by `priority` then enqueue time (FIFO within same priority). A drain timer fires when the oldest window entry expires, processing as many waiters as possible before rescheduling.
766
1017
 
1018
+ **Concurrency** — A semaphore (`activeCount` + `concurrencyWaiters`) per model key, checked after the rate limit slot is acquired. `release()` decrements the count and unblocks the next waiter.
1019
+
1020
+ **Scoped limits** — Each scoped request uses a key in the form `scope:provider:modelId`. That key gets its own independent sliding window. Wildcard patterns in `config.scopes` are matched with `*` → `.*` regex expansion.
1021
+
1022
+ **AbortSignal** — The signal is registered as an event listener on both the rate-limit queue and the concurrency queue. If it fires, the request is spliced out of whichever queue it's in and the promise rejects with an `AbortError`.
1023
+
767
1024
  **Retry-After propagation** — When a remote 429 arrives with a `Retry-After` header, the backoff is applied to the entire model key in the engine, not just the failing request. All requests queued behind it pause until the backoff clears. This prevents the common thundering-herd failure where you retry one request while 10 others immediately follow and all get 429s.
768
1025
 
769
1026
  **Token estimation** — Before a request fires, tokens are estimated from the prompt text (~4 chars/token) and reserved in the window. After the response, actual usage from the API replaces the estimate. For streaming, actual counts come from the `finish` chunk (Vercel AI SDK) or the final usage chunk (raw proxy).
@@ -781,10 +1038,14 @@ const limiter = createRateLimiter({ store: new MyStore() })
781
1038
  | Model-aware limits | yes | no | no | no | partial |
782
1039
  | ITPM / token tracking | yes | no | no | no | no |
783
1040
  | Priority queue | yes | yes | no | no | no |
1041
+ | Concurrency limits | yes | yes | yes | no | no |
784
1042
  | Cost tracking + budgets | yes | no | no | no | no |
1043
+ | Multi-tenant scoped limits | yes | no | no | no | no |
1044
+ | AbortSignal propagation | yes | no | no | no | no |
785
1045
  | Retry-After header | yes | no | no | partial | partial |
786
1046
  | Backoff propagation | yes | no | no | no | no |
787
1047
  | OpenTelemetry | yes | no | no | no | partial |
1048
+ | Testing utilities | yes | no | no | no | no |
788
1049
  | CLI audit | yes | no | no | no | no |
789
1050
  | Zero runtime deps | yes | no | yes | — | no |
790
1051
  | Provider-agnostic | yes | yes | yes | no | no |
@@ -804,12 +1065,15 @@ Fully typed. All configuration options, events, errors, and report shapes have p
804
1065
  ```typescript
805
1066
  import type {
806
1067
  RateLimiterConfig,
1068
+ ModelLimits,
1069
+ ScopeConfig,
807
1070
  CostReport,
808
1071
  EventMap,
809
1072
  QueuedEvent,
810
1073
  Priority,
811
- ModelLimits,
812
1074
  } from 'ai-sdk-rate-limiter'
1075
+
1076
+ import type { CallRecord } from 'ai-sdk-rate-limiter/testing'
813
1077
  ```
814
1078
 
815
1079
  ---
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "ai-sdk-rate-limiter",
3
- "version": "0.7.0",
3
+ "version": "0.7.1",
4
4
  "description": "Smart rate limiting, queuing, and cost tracking middleware for AI SDK calls. Works across providers.",
5
5
  "type": "module",
6
6
  "main": "./dist/index.cjs",