npm - ai-sdk-rate-limiter - Versions diffs - 0.7.1 → 0.9.0 - Mend

ai-sdk-rate-limiter 0.7.1 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (36) hide show

package/README.md +363 -12
package/dist/index.cjs +626 -155
package/dist/index.cjs.map +1 -1
package/dist/index.d.cts +17 -3
package/dist/index.d.ts +17 -3
package/dist/index.js +625 -156
package/dist/index.js.map +1 -1
package/dist/otel.d.cts +1 -1
package/dist/otel.d.ts +1 -1
package/dist/prometheus.cjs +133 -0
package/dist/prometheus.cjs.map +1 -0
package/dist/prometheus.d.cts +37 -0
package/dist/prometheus.d.ts +37 -0
package/dist/prometheus.js +131 -0
package/dist/prometheus.js.map +1 -0
package/dist/redis.cjs +160 -19
package/dist/redis.cjs.map +1 -1
package/dist/redis.d.cts +39 -2
package/dist/redis.d.ts +39 -2
package/dist/redis.js +160 -20
package/dist/redis.js.map +1 -1
package/dist/statsd.cjs +67 -0
package/dist/statsd.cjs.map +1 -0
package/dist/statsd.d.cts +46 -0
package/dist/statsd.d.ts +46 -0
package/dist/statsd.js +65 -0
package/dist/statsd.js.map +1 -0
package/dist/testing.cjs +624 -155
package/dist/testing.cjs.map +1 -1
package/dist/testing.d.cts +1 -1
package/dist/testing.d.ts +1 -1
package/dist/testing.js +624 -155
package/dist/testing.js.map +1 -1
package/dist/{types-D7qskXNw.d.cts → types-CUPpMRPE.d.cts} +146 -4
package/dist/{types-D7qskXNw.d.ts → types-CUPpMRPE.d.ts} +146 -4
package/package.json +21 -1

package/README.md CHANGED Viewed

@@ -2,6 +2,8 @@
 Smart rate limiting, queuing, and cost tracking for AI API calls. Works across providers. Zero required dependencies.
+[![npm](https://img.shields.io/npm/v/ai-sdk-rate-limiter)](https://www.npmjs.com/package/ai-sdk-rate-limiter)
 ```
 npm install ai-sdk-rate-limiter
 ```
@@ -63,6 +65,24 @@ The wrapped model is a drop-in replacement. Every Vercel AI SDK feature — stre
 **Raw SDK support** — Works with the native OpenAI, Anthropic, Groq, Mistral, and Cohere SDKs directly via a transparent JavaScript Proxy. No Vercel AI SDK required.
+**Circuit breaker** — Automatically opens on repeated 5xx failures, blocking requests until the upstream recovers. Transitions to half-open state to probe recovery, then closes once healthy.
+**Graceful shutdown** — `limiter.shutdown()` drains in-flight requests before the process exits. New requests received during shutdown are rejected with `ShutdownError`.
+**Persistent cost tracking** — `RedisCostStore` (from `ai-sdk-rate-limiter/redis`) survives process restarts so budget caps remain accurate. `warmUp()` pre-loads historical spend from the store on startup.
+**Per-scope cost attribution** — `getCostReport()` includes a `byScope` breakdown so you can see exact spend per user, org, or tenant.
+**Auto-detected limits** — Parses `x-ratelimit-limit-*` headers from every response and tightens the local windows automatically. Your config always wins; detected values fill in where you haven't overridden.
+**Prometheus metrics** — `createPrometheusPlugin()` (from `ai-sdk-rate-limiter/prometheus`) exports counters, gauges, and histograms for every request, token, cost, retry, and queue event.
+**StatsD / DogStatsD** — `createStatsDPlugin()` (from `ai-sdk-rate-limiter/statsd`) bridges all events to any StatsD-compatible client.
+**Call timeout** — `callTimeout` option kills a hung AI call after N milliseconds via `Promise.race()` — independent of the Vercel AI SDK `abortSignal`.
+**Fallback chains** — `fallback` now accepts an array of models. On `BudgetExceededError`, the chain is walked in order until one succeeds.
 **OpenTelemetry** — Drop-in OTel plugin that emits GenAI-spec spans for every request. Works with any OTel-compatible tracer.
 **Testing utilities** — `createTestLimiter()` records every completed call so you can assert on model usage, token counts, and costs in unit tests.
@@ -80,9 +100,15 @@ The wrapped model is a drop-in replacement. Every Vercel AI SDK feature — stre
 - [Multi-tenant scoped limits](#multi-tenant-scoped-limits)
 - [Concurrency limits](#concurrency-limits)
 - [AbortSignal support](#abortsignal-support)
+- [Call timeout](#call-timeout)
 - [Cost tracking](#cost-tracking)
 - [Budget fallback routing](#budget-fallback-routing)
+- [Persistent cost tracking](#persistent-cost-tracking)
 - [Multi-instance Redis store](#multi-instance-redis-store)
+- [Circuit breaker](#circuit-breaker)
+- [Graceful shutdown](#graceful-shutdown)
+- [Prometheus metrics](#prometheus-metrics)
+- [StatsD metrics](#statsD-metrics)
 - [Events](#events)
 - [Backpressure](#backpressure)
 - [Error handling](#error-handling)
@@ -224,6 +250,9 @@ const limiter = createRateLimiter({
   limits: {
     'gpt-4o':          { rpm: 500, itpm: 30_000, maxConcurrent: 20 },
     'claude-opus-4-6': { rpm: 50,  itpm: 20_000 },
+    // rpd  — requests per day (enforced in a rolling 24-hour window)
+    // otpm — output tokens per minute (based on actuals from completed requests)
+    'gemini-1.5-flash': { rpm: 15, rpd: 1_500, otpm: 500_000 },
   },
   // Cost budgets and behavior when exceeded
@@ -234,6 +263,14 @@ const limiter = createRateLimiter({
       monthly: 500,   // USD — hard cap per month
     },
     onExceeded: 'throw', // 'throw' | 'queue' | 'fallback'
+    store: new RedisCostStore(redis), // persist cost history across restarts (optional)
+  },
+  // Circuit breaker — open on repeated 5xx failures, probe recovery automatically
+  circuit: {
+    failureThreshold: 5,     // consecutive failures before opening
+    cooldownMs:       30_000, // how long to stay open before probing
+    tripOn:           [500, 502, 503, 504], // which status codes count as failures
   },
   // Queue behavior
@@ -252,6 +289,7 @@ const limiter = createRateLimiter({
     jitter:          true,           // ±30% randomness (prevents thundering herd)
     parseRetryAfter: true,           // honor Retry-After header from 429 responses
     retryOn:         [429, 500, 502, 503, 504],
+    callTimeout:     30_000,         // ms — kill hung AI calls via Promise.race()
   },
   // Per-scope rate limit overrides for multi-tenant use cases
@@ -290,8 +328,9 @@ await generateText({
   prompt: 'Urgent user request...',
   providerOptions: {
     rateLimiter: {
-      priority: 'high',   // 'high' | 'normal' | 'low'
-      timeout:  10_000,   // override the default queue timeout for this request
+      priority:    'high',   // 'high' | 'normal' | 'low'
+      timeout:     10_000,   // override the default queue timeout for this request
+      callTimeout: 15_000,   // kill the AI call itself if it hangs beyond 15s
     },
   },
 })
@@ -373,7 +412,17 @@ await generateText({
 })
 ```
-**Scope fields:**
+**Model limit fields:**
+| Field | Description |
+|---|---|
+| `rpm` | Max requests per minute |
+| `itpm` | Max input tokens per minute |
+| `otpm` | Max output tokens per minute (based on completed request actuals) |
+| `rpd` | Max requests per day (rolling 24-hour window) |
+| `maxConcurrent` | Max concurrent in-flight requests |
+**Scope fields (`config.scopes`):**
 | Field | Description |
 |---|---|
@@ -453,6 +502,30 @@ The signal threads through both the rate-limit queue and the concurrency queue.
 ---
+## Call timeout
+`callTimeout` kills a hung AI API call after N milliseconds using `Promise.race()`. This is distinct from the queue `timeout` (which fires if a request waits too long to _start_) — `callTimeout` fires if the request is already executing but the API hasn't responded.
+```typescript
+// Global default for all requests
+const limiter = createRateLimiter({
+  retry: { callTimeout: 30_000 }, // abort any call that takes longer than 30s
+})
+// Per-request override
+await generateText({
+  model,
+  prompt: '...',
+  providerOptions: {
+    rateLimiter: { callTimeout: 10_000 }, // stricter timeout for this request
+  },
+})
+```
+When a `callTimeout` fires, the request throws a `TimeoutError` (native `DOMException` with `name: 'TimeoutError'`). The retry logic treats it as a retryable failure if the status code is in `retryOn`. Set `callTimeout` to `undefined` (the default) to disable it.
+---
 ## Cost tracking
 ```typescript
@@ -467,12 +540,18 @@ console.log(report)
 //   byModel: {
 //     'gpt-4o':      { requests: 120, inputTokens: 240_000, outputTokens: 60_000, costUsd: 1.20 },
 //     'gpt-4o-mini': { requests: 198, inputTokens: 380_000, outputTokens: 95_000, costUsd: 0.91 },
+//   },
+//   byScope: {
+//     'user:alice': { requests: 15, inputTokens: 30_000, outputTokens: 7_500, costUsd: 0.15 },
+//     'user:bob':   { requests: 8,  inputTokens: 12_000, outputTokens: 3_000, costUsd: 0.06 },
 //   }
 // }
 ```
 Costs are based on **actual token counts** from API responses — not estimates. The report uses rolling windows, so `hour` always means "the last 60 minutes."
+`byScope` is populated automatically when requests carry a `scope` (either set on `limiter.wrap()` or via `providerOptions.rateLimiter.scope`). Unscoped requests don't appear in `byScope`.
 ---
 ## Budget fallback routing
@@ -501,19 +580,73 @@ const model = limiter.wrap(
 const result = await generateText({ model, prompt })
 ```
+**Fallback chains** — pass an array to walk multiple fallbacks in order:
+```typescript
+const model = limiter.wrap(openai('gpt-4o'), {
+  fallback: [
+    openai('gpt-4o-mini'),    // try first
+    openai('gpt-3.5-turbo'),  // try second if gpt-4o-mini is also over budget
+  ],
+})
+```
+Each model in the chain is tried in order. If all are over budget, `BudgetExceededError` is thrown.
 **Behavior matrix:**
 | `onExceeded` | `fallback` configured | Outcome |
 |---|---|---|
 | `'throw'` | any | Throws `BudgetExceededError` |
-| `'fallback'` | yes | Transparently uses fallback model |
+| `'fallback'` | yes | Transparently walks the fallback chain |
 | `'fallback'` | no | Throws `BudgetExceededError` |
-| `'queue'` | any | Queues until period resets |
+| `'queue'` | any | Holds the request until the rolling window clears enough spend; throws `QueueTimeoutError` if `queue.timeout` elapses |
 Fallback usage is tracked under the fallback model's ID in `getCostReport()`.
 ---
+## Persistent cost tracking
+By default, cost history lives in memory and resets on restart. If your process restarts frequently (serverless, rolling deploys), budget caps could be bypassed because the new instance starts with $0 spend.
+`RedisCostStore` persists every cost entry to a Redis sorted set so budget caps survive restarts:
+```typescript
+import { createRateLimiter } from 'ai-sdk-rate-limiter'
+import { RedisCostStore } from 'ai-sdk-rate-limiter/redis'
+import Redis from 'ioredis'
+const redis = new Redis(process.env.REDIS_URL)
+const limiter = createRateLimiter({
+  cost: {
+    budget: { daily: 50 },
+    onExceeded: 'throw',
+    store: new RedisCostStore(redis), // persist entries to Redis
+  },
+})
+// On startup — pre-load the last 30 days of spend so the in-memory
+// window is accurate immediately (before any new requests)
+await limiter.warmUp()
+```
+**`warmUp()`** — loads the last 30 days of entries from the store into the in-memory cost tracker. Call it once after `createRateLimiter()`. Without it the limiter works correctly for new requests, but budget checks won't account for spend from previous process runs until the first request arrives.
+`RedisCostStore` options:
+```typescript
+new RedisCostStore(redis, {
+  keyPrefix: 'cost:myapp:',  // namespace key (default: 'airl:cost:')
+  ttlMs:     30 * 86_400_000, // TTL for the sorted set (default: 30 days)
+})
+```
+Errors from the cost store (connection failures, etc.) are swallowed silently — cost persistence is best-effort and never blocks request execution.
+---
 ## Multi-instance Redis store
 By default, rate limit state is in-memory (per-process). For multi-instance deployments — multiple pods, serverless replicas, workers — each instance has its own counters. Install the Redis store to share state:
@@ -546,7 +679,9 @@ new RedisStore(redis, {
 **How it works internally:**
-Each request runs a Lua script atomically that: removes stale entries from a sorted set, counts requests and tokens in the current window, checks against RPM and ITPM limits, and either reserves the slot or returns when the next slot opens. The local queue (priority ordering, drain timer, timeout handling) stays in-memory per instance — only the window counters are shared via Redis.
+Each request runs a Lua script atomically that: removes stale entries from a sorted set, counts requests and tokens in the current window, checks against RPM, ITPM, OTPM, and RPD limits, and either reserves the slot or returns when the next slot opens. The local queue (priority ordering, drain timer, timeout handling) stays in-memory per instance — only the window counters are shared via Redis.
+**Failover** — If Redis is unreachable (connection error, timeout), the store fails open: rate limit enforcement is suspended for that call and the request proceeds normally. Enforcement resumes as soon as the store recovers. This means AI calls never block on Redis availability — you trade enforcement precision for reliability during outages.
 **Compatible clients** — any client with `eval()`, `get()`, and `set()` works: `ioredis`, `node-redis`, Upstash Redis.
@@ -554,6 +689,175 @@ Use the default `InMemoryStore` for single-instance deployments — it's more ac
 ---
+## Circuit breaker
+The circuit breaker protects against cascading failures when an upstream AI API is degrading. After N consecutive 5xx failures, the circuit opens and subsequent requests fail immediately with `CircuitOpenError` rather than piling up and timing out.
+```typescript
+import { createRateLimiter, CircuitOpenError } from 'ai-sdk-rate-limiter'
+const limiter = createRateLimiter({
+  circuit: {
+    failureThreshold: 5,      // open after 5 consecutive failures
+    cooldownMs:       30_000, // stay open for 30s, then probe
+    tripOn:           [500, 502, 503, 504], // which HTTP status codes trip the circuit
+  },
+  on: {
+    circuitOpen:   ({ model, openUntilMs }) =>
+      console.error(`Circuit open for ${model} until ${new Date(openUntilMs).toISOString()}`),
+    circuitClosed: ({ model }) =>
+      console.log(`Circuit closed for ${model} — upstream recovered`),
+  },
+})
+try {
+  const result = await generateText({ model, prompt })
+} catch (err) {
+  if (err instanceof CircuitOpenError) {
+    // Fail fast — upstream is degraded, don't pile on
+    return res.status(503).json({
+      error: 'AI service temporarily unavailable',
+      retryAfter: Math.ceil((err.openUntilMs - Date.now()) / 1000),
+    })
+  }
+}
+```
+**State machine:**
+- `CLOSED` (normal) — requests pass through; failures are counted
+- `OPEN` — requests fail immediately with `CircuitOpenError`; after `cooldownMs`, transitions to HALF_OPEN
+- `HALF_OPEN` — one probe request is allowed; success → CLOSED, failure → OPEN (resets cooldown)
+The circuit is per-model. A failing `gpt-4o` doesn't affect `gpt-4o-mini`. 429 rate-limit errors do **not** trip the circuit — only 5xx errors (or whatever you configure in `tripOn`) count as failures.
+---
+## Graceful shutdown
+```typescript
+// On SIGTERM / process exit
+process.on('SIGTERM', async () => {
+  // Stop accepting new requests, wait up to 30s for in-flight ones to finish
+  await limiter.shutdown({ drainMs: 30_000 })
+  process.exit(0)
+})
+```
+After `shutdown()` is called:
+- New requests throw `ShutdownError` immediately
+- In-flight requests complete normally (up to `drainMs`)
+- The returned promise resolves when the queue drains or `drainMs` elapses
+```typescript
+import { ShutdownError } from 'ai-sdk-rate-limiter'
+try {
+  const result = await generateText({ model, prompt })
+} catch (err) {
+  if (err instanceof ShutdownError) {
+    // Process is shutting down — expected, not an error
+  }
+}
+```
+---
+## Prometheus metrics
+```
+npm install ai-sdk-rate-limiter
+```
+The `ai-sdk-rate-limiter/prometheus` entry point provides in-process Prometheus metrics with no external dependencies. Metrics are accumulated in memory and rendered to the text exposition format on demand.
+```typescript
+import { createRateLimiter } from 'ai-sdk-rate-limiter'
+import { createPrometheusPlugin } from 'ai-sdk-rate-limiter/prometheus'
+const prometheus = createPrometheusPlugin()
+const limiter = createRateLimiter({
+  on: prometheus,
+})
+// Expose /metrics endpoint (Express example)
+app.get('/metrics', (req, res) => {
+  res.set('Content-Type', 'text/plain; version=0.0.4')
+  res.send(prometheus.collect())
+})
+```
+**Metrics exported:**
+| Metric | Type | Description |
+|---|---|---|
+| `ai_requests_total` | counter | Total requests, labelled by `model`, `provider`, `status` |
+| `ai_tokens_input_total` | counter | Total input tokens, labelled by `model`, `provider` |
+| `ai_tokens_output_total` | counter | Total output tokens, labelled by `model`, `provider` |
+| `ai_cost_usd_total` | counter | Total cost in USD, labelled by `model`, `provider` |
+| `ai_request_duration_ms` | summary | Request latency (p50, p90, p99), labelled by `model` |
+| `ai_retries_total` | counter | Total retry attempts, labelled by `model` |
+| `ai_rate_limited_total` | counter | Rate-limit hits, labelled by `model`, `source` |
+| `ai_budget_exceeded_total` | counter | Budget exceeded events, labelled by `model`, `period` |
+| `ai_queue_depth` | gauge | Current queue depth, labelled by `model` |
+```typescript
+// Custom metric prefix
+const prometheus = createPrometheusPlugin({ prefix: 'myapp_' })
+// → myapp_requests_total, myapp_tokens_input_total, ...
+// Reset counters (e.g. for tests)
+prometheus.reset()
+```
+---
+## StatsD metrics
+```typescript
+import { createRateLimiter } from 'ai-sdk-rate-limiter'
+import { createStatsDPlugin } from 'ai-sdk-rate-limiter/statsd'
+import StatsD from 'hot-shots' // or node-statsd, dogstatsd-client, etc.
+const statsd = new StatsD({ host: 'localhost', port: 8125 })
+const limiter = createRateLimiter({
+  on: createStatsDPlugin(statsd, {
+    prefix:     'myapp.ai.',   // default: 'ai.'
+    globalTags: ['env:prod'],  // appended to every metric
+  }),
+})
+```
+Any client that implements the `StatsDClient` interface works — `hot-shots`, `node-statsd`, `datadog-metrics`, or a custom implementation:
+```typescript
+import type { StatsDClient } from 'ai-sdk-rate-limiter/statsd'
+const client: StatsDClient = {
+  increment(metric, value, tags) { /* ... */ },
+  gauge(metric, value, tags) { /* ... */ },
+  timing(metric, value, tags) { /* ... */ },
+}
+```
+**Metrics emitted** (same set as Prometheus, DogStatsD tag format `['key:value']`):
+| Metric | Type | Tags |
+|---|---|---|
+| `ai.requests` | increment | `model:*`, `provider:*`, `status:completed\|dropped` |
+| `ai.tokens.input` | increment | `model:*`, `provider:*` |
+| `ai.tokens.output` | increment | `model:*`, `provider:*` |
+| `ai.cost_usd` | timing (gauge) | `model:*`, `provider:*` |
+| `ai.latency_ms` | timing | `model:*` |
+| `ai.retries` | increment | `model:*` |
+| `ai.rate_limited` | increment | `model:*`, `source:local\|remote` |
+| `ai.budget_exceeded` | increment | `model:*`, `period:hourly\|daily\|monthly` |
+| `ai.queue_depth` | gauge | `model:*` |
+---
 ## Events
 All events are typed. Register handlers at creation time or dynamically:
@@ -579,8 +883,20 @@ limiter.off('queued', handler)
 | `retrying` | A failed request is about to retry | `model`, `provider`, `attempt`, `maxAttempts`, `delayMs`, `error` |
 | `rateLimited` | Limit hit (local or remote 429) | `model`, `provider`, `source`, `limitType`, `resetAt` |
 | `budgetHit` | Cost budget exceeded | `model`, `provider`, `currentCostUsd`, `limitUsd`, `period`, `usingFallback` |
-| `dropped` | Request rejected (queue full or timeout) | `model`, `provider`, `reason` |
-| `completed` | Request finished successfully | `model`, `provider`, `inputTokens`, `outputTokens`, `costUsd`, `latencyMs`, `streaming` |
+| `dropped` | Request rejected | `model`, `provider`, `reason`, `waitedMs?`, `queueDepth?`, `scope?`, `metadata?` |
+| `completed` | Request finished successfully | `model`, `provider`, `inputTokens`, `outputTokens`, `costUsd`, `latencyMs`, `streaming`, `scope?` |
+| `circuitOpen` | Circuit breaker opened | `model`, `provider`, `openUntilMs`, `failureCount` |
+| `circuitClosed` | Circuit breaker closed (upstream recovered) | `model`, `provider` |
+| `limitsDetected` | Limits auto-updated from response headers | `model`, `provider`, `detectedLimits` |
+**`dropped` reason values:**
+| Reason | Cause |
+|---|---|
+| `'queue-full'` | Queue at `maxSize` capacity |
+| `'queue-timeout'` | Request waited longer than `queue.timeout` |
+| `'circuit-open'` | Circuit breaker is open |
+| `'shutdown'` | Limiter is shutting down |
 The `source` on `rateLimited` distinguishes between requests we blocked locally (`'local'`) vs. requests the API rejected with a 429 (`'remote'`). Local blocks are expected and free. Frequent remote blocks mean your configured limits are too high for your tier — run `npx ai-sdk-rate-limiter audit` to get accurate numbers.
@@ -618,6 +934,8 @@ import {
   QueueFullError,
   BudgetExceededError,
   RetryExhaustedError,
+  CircuitOpenError,
+  ShutdownError,
   RateLimiterError,
 } from 'ai-sdk-rate-limiter'
@@ -630,6 +948,13 @@ try {
   } else if (error instanceof BudgetExceededError) {
     // Cost budget hit and onExceeded is 'throw' or no fallback configured
     console.error(`Budget exceeded: $${error.currentCostUsd} of $${error.limitUsd} ${error.period}`)
+  } else if (error instanceof CircuitOpenError) {
+    // Circuit breaker is open — upstream is degraded
+    const retryAfterSec = Math.ceil((error.openUntilMs - Date.now()) / 1000)
+    res.status(503).json({ error: 'AI service temporarily unavailable', retryAfter: retryAfterSec })
+  } else if (error instanceof ShutdownError) {
+    // Limiter is shutting down — process is exiting
+    console.log('Limiter shutting down, request rejected')
   } else if (error instanceof RetryExhaustedError) {
     // All retry attempts failed
     console.error(`All ${error.attempts} retries exhausted`, error.cause)
@@ -654,6 +979,8 @@ All errors extend `RateLimiterError`, so a single `instanceof RateLimiterError`
 |---|---|
 | `QueueTimeoutError` | `model`, `waitedMs`, `queueDepth` |
 | `BudgetExceededError` | `model`, `currentCostUsd`, `limitUsd`, `period` |
+| `CircuitOpenError` | `model`, `openUntilMs` |
+| `ShutdownError` | — |
 | `RetryExhaustedError` | `model`, `attempts`, `cause` |
 | `QueueFullError` | `model`, `maxSize` |
 | `RateLimitExceededError` | `model`, `limitType`, `limit`, `resetAt` |
@@ -999,9 +1326,11 @@ Implement `RateLimitStore` to use any backend (DynamoDB, Postgres, etc.):
 import type { RateLimitStore } from 'ai-sdk-rate-limiter'
 class MyStore implements RateLimitStore {
-  async checkAndReserve(key, tokens, limits) { /* ... */ }
-  async applyBackoff(key, untilMs) { /* ... */ }
+  async checkAndRecord(key, estimatedInputTokens, limits) { /* ... */ }
+  async reconcile(key, actualInputTokens, actualOutputTokens) { /* ... */ }
+  async setBackoff(key, untilMs) { /* ... */ }
   async getBackoff(key) { /* ... */ }
+  async nextSlotMs(key, limits, estimatedInputTokens) { /* ... */ }
 }
 const limiter = createRateLimiter({ store: new MyStore() })
@@ -1011,7 +1340,7 @@ const limiter = createRateLimiter({ store: new MyStore() })
 ## How it works
-**Rate limiting** — Sliding window counter per model. Each model tracks a list of `{timestamp, tokens}` entries for the past 60 seconds. On every request, stale entries are evicted and the window is checked against RPM and ITPM limits simultaneously.
+**Rate limiting** — Sliding window counter per model. Each model tracks a list of `{timestamp, tokens}` entries for the past 60 seconds. On every request, stale entries are evicted and the window is checked against RPM, ITPM, and OTPM limits simultaneously. RPD uses a separate 24-hour rolling window. OTPM is based on actual output token counts from completed requests.
 **Queue** — A sorted priority queue per model, ordered by `priority` then enqueue time (FIFO within same priority). A drain timer fires when the oldest window entry expires, processing as many waiters as possible before rescheduling.
@@ -1023,7 +1352,7 @@ const limiter = createRateLimiter({ store: new MyStore() })
 **Retry-After propagation** — When a remote 429 arrives with a `Retry-After` header, the backoff is applied to the entire model key in the engine, not just the failing request. All requests queued behind it pause until the backoff clears. This prevents the common thundering-herd failure where you retry one request while 10 others immediately follow and all get 429s.
-**Token estimation** — Before a request fires, tokens are estimated from the prompt text (~4 chars/token) and reserved in the window. After the response, actual usage from the API replaces the estimate. For streaming, actual counts come from the `finish` chunk (Vercel AI SDK) or the final usage chunk (raw proxy).
+**Token estimation** — Before a request fires, tokens are estimated from the prompt text (~4 chars/token) and reserved in the window. After the response, actual usage from the API replaces the estimate. For streaming, actual counts come from the `finish` chunk (Vercel AI SDK) or the final usage chunk (raw proxy). If a stream ends without a usage chunk (some error paths), the window is updated with zeros rather than leaving the estimate in place.
 **Zero dependencies** — The Vercel AI SDK middleware interface is implemented structurally — `@ai-sdk/provider` types are used for type checking only and not required at runtime. No `ioredis`, no `bottleneck`, no tokenizer libraries in the core.
@@ -1040,10 +1369,19 @@ const limiter = createRateLimiter({ store: new MyStore() })
 | Priority queue | yes | yes | no | no | no |
 | Concurrency limits | yes | yes | yes | no | no |
 | Cost tracking + budgets | yes | no | no | no | no |
+| Persistent cost store | yes | no | no | no | no |
+| Per-scope cost attribution | yes | no | no | no | no |
+| Budget fallback chains | yes | no | no | no | no |
+| Circuit breaker | yes | no | no | no | no |
+| Graceful shutdown | yes | no | no | no | no |
+| Auto-detected limits | yes | no | no | no | no |
 | Multi-tenant scoped limits | yes | no | no | no | no |
 | AbortSignal propagation | yes | no | no | no | no |
+| Call timeout | yes | no | no | no | no |
 | Retry-After header | yes | no | no | partial | partial |
 | Backoff propagation | yes | no | no | no | no |
+| Prometheus metrics | yes | no | no | no | no |
+| StatsD metrics | yes | no | no | no | no |
 | OpenTelemetry | yes | no | no | no | partial |
 | Testing utilities | yes | no | no | no | no |
 | CLI audit | yes | no | no | no | no |
@@ -1071,9 +1409,22 @@ import type {
   EventMap,
   QueuedEvent,
   Priority,
+  CircuitBreakerConfig,
+  CircuitOpenEvent,
+  CircuitClosedEvent,
+  LimitsDetectedEvent,
+  DroppedEvent,
+  CompletedEvent,
 } from 'ai-sdk-rate-limiter'
 import type { CallRecord } from 'ai-sdk-rate-limiter/testing'
+import type {
+  CostStore,
+  PersistedCostEntry,
+} from 'ai-sdk-rate-limiter/redis'
+import type { StatsDClient } from 'ai-sdk-rate-limiter/statsd'
 ```
 ---