npm - ai-sdk-rate-limiter - Versions diffs - 0.4.0 → 0.6.0 - Mend

ai-sdk-rate-limiter 0.4.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

package/README.md +415 -154
package/dist/cli.js +1029 -0
package/dist/index.cjs +206 -142
package/dist/index.cjs.map +1 -1
package/dist/index.d.cts +3 -320
package/dist/index.d.ts +3 -320
package/dist/index.js +206 -142
package/dist/index.js.map +1 -1
package/dist/otel.cjs +75 -0
package/dist/otel.cjs.map +1 -0
package/dist/otel.d.cts +63 -0
package/dist/otel.d.ts +63 -0
package/dist/otel.js +72 -0
package/dist/otel.js.map +1 -0
package/dist/redis.cjs +209 -0
package/dist/redis.cjs.map +1 -0
package/dist/redis.d.cts +54 -0
package/dist/redis.d.ts +54 -0
package/dist/redis.js +207 -0
package/dist/redis.js.map +1 -0
package/dist/types-CgePLtmQ.d.cts +385 -0
package/dist/types-CgePLtmQ.d.ts +385 -0
package/package.json +33 -2

package/README.md CHANGED Viewed

@@ -56,9 +56,152 @@ The wrapped model is a drop-in replacement. Every Vercel AI SDK feature — stre
 **Raw SDK support** — Works with the native OpenAI, Anthropic, Groq, Mistral, and Cohere SDKs directly via a transparent JavaScript Proxy. No Vercel AI SDK required.
+**OpenTelemetry** — Drop-in OTel plugin that emits GenAI-spec spans for every request. Works with any OTel-compatible tracer.
+**CLI audit** — `npx ai-sdk-rate-limiter audit` probes your API keys to detect your actual tier limits and generates a ready-to-paste config override.
+---
+## Contents
+- [Vercel AI SDK usage](#vercel-ai-sdk-usage)
+- [Raw SDK proxy](#raw-sdk-proxy)
+- [Configuration reference](#configuration-reference)
+- [Per-request options](#per-request-options)
+- [Cost tracking](#cost-tracking)
+- [Budget fallback routing](#budget-fallback-routing)
+- [Multi-instance Redis store](#multi-instance-redis-store)
+- [Events](#events)
+- [Backpressure](#backpressure)
+- [Error handling](#error-handling)
+- [OpenTelemetry](#opentelemetry)
+- [CLI audit](#cli-audit)
+- [Model registry](#model-registry)
+- [Advanced usage](#advanced-usage)
+- [How it works](#how-it-works)
+- [Comparison](#comparison)
+- [TypeScript](#typescript)
+- [Requirements](#requirements)
+---
+## Vercel AI SDK usage
+### Basic wrap
+```typescript
+import { createRateLimiter } from 'ai-sdk-rate-limiter'
+import { openai } from '@ai-sdk/openai'
+import { generateText, streamText } from 'ai'
+const limiter = createRateLimiter()
+const model = limiter.wrap(openai('gpt-4o'))
+// generateText
+const { text } = await generateText({ model, prompt: 'Summarize this...' })
+// streamText — streaming is first-class, rate limit slot consumed at request start
+const result = streamText({ model, messages })
+for await (const chunk of result.textStream) {
+  process.stdout.write(chunk)
+}
+```
+### Using the raw middleware
+If you use `wrapLanguageModel` directly or need to compose middleware:
+```typescript
+import { wrapLanguageModel } from 'ai'
+// Single middleware
+const model = wrapLanguageModel({
+  model: openai('gpt-4o'),
+  middleware: limiter.middleware,
+})
+// Composed with other middleware
+const model = wrapLanguageModel({
+  model: openai('gpt-4o'),
+  middleware: [loggingMiddleware, limiter.middleware, cachingMiddleware],
+})
+```
+---
+## Raw SDK proxy
+If you're using the OpenAI, Anthropic, Groq, Mistral, or Cohere SDK directly — without the Vercel AI SDK — use `limiter.rawProxy()` to add rate limiting as a transparent drop-in:
+```typescript
+import { createRateLimiter } from 'ai-sdk-rate-limiter'
+import OpenAI from 'openai'
+import Anthropic from '@anthropic-ai/sdk'
+const limiter = createRateLimiter({
+  cost: { budget: { daily: 50 }, onExceeded: 'throw' },
+  on: { rateLimited: ({ model }) => console.warn(`${model} rate limited`) },
+})
+// Every API call goes through the same rate limiter and cost tracker
+const openai    = limiter.rawProxy(new OpenAI())
+const anthropic = limiter.rawProxy(new Anthropic())
+// Use exactly as before — no other changes needed
+const completion = await openai.chat.completions.create({
+  model: 'gpt-4o',
+  messages: [{ role: 'user', content: 'Hello!' }],
+})
+const message = await anthropic.messages.create({
+  model: 'claude-opus-4-6',
+  max_tokens: 1024,
+  messages: [{ role: 'user', content: 'Hello!' }],
+})
+// Cost from both clients tracked together
+const report = limiter.getCostReport()
+```
+**Streaming** — the proxy wraps the returned `AsyncIterable` to capture the final usage chunk automatically:
+```typescript
+const stream = await openai.chat.completions.create({
+  model: 'gpt-4o',
+  messages: [{ role: 'user', content: 'Stream this' }],
+  stream: true,
+  stream_options: { include_usage: true }, // OpenAI: include usage in final chunk
+})
+for await (const chunk of stream) {
+  process.stdout.write(chunk.choices[0]?.delta?.content ?? '')
+}
+// After the loop, tokens are recorded in limiter.getCostReport()
+```
+**Standalone — no shared limiter needed:**
+```typescript
+import { rateLimited } from 'ai-sdk-rate-limiter'
+const openai = rateLimited(new OpenAI(), {
+  config: { cost: { budget: { daily: 20 } } },
+})
+```
+**Override auto-detected provider** — useful for OpenAI-compatible endpoints:
+```typescript
+const client = limiter.rawProxy(new OpenAI({ baseURL: 'https://api.groq.com/openai/v1' }), {
+  provider: 'groq', // use Groq's limits and pricing instead of OpenAI's
+})
+```
+Provider is auto-detected from the client's constructor name (`OpenAI` → `openai`, `Anthropic` → `anthropic`, `Groq` → `groq`, etc.).
 ---
-## Configuration
+## Configuration reference
 Everything has a sensible default. Override only what you need.
@@ -67,17 +210,15 @@ const limiter = createRateLimiter({
   // Override or extend built-in model limits for your API tier
   limits: {
     'gpt-4o':          { rpm: 500, itpm: 30_000 },
-    'claude-opus-4-6': { rpm: 50,  itpm: 30_000 },
-    // Wildcard: apply to all models from a provider
-    'openai/*':        { rpm: 1000 },
+    'claude-opus-4-6': { rpm: 50,  itpm: 20_000 },
   },
   // Cost budgets and behavior when exceeded
   cost: {
     budget: {
-      hourly:  5,     // USD
-      daily:   50,
-      monthly: 500,
+      hourly:  5,     // USD — hard cap per hour
+      daily:   50,    // USD — hard cap per day
+      monthly: 500,   // USD — hard cap per month
     },
     onExceeded: 'throw', // 'throw' | 'queue' | 'fallback'
   },
@@ -100,14 +241,14 @@ const limiter = createRateLimiter({
     retryOn:         [429, 500, 502, 503, 504],
   },
-  // Observability
+  // Observability — see Events section for all available events
   on: {
     rateLimited: ({ model, source, resetAt }) =>
       console.warn(`${model} rate limited (${source}), resets ${new Date(resetAt).toISOString()}`),
     retrying:    ({ model, attempt, delayMs }) =>
       console.log(`${model} retry ${attempt} in ${delayMs}ms`),
     budgetHit:   ({ model, currentCostUsd, limitUsd, period }) =>
-      alerts.send(`${model} hit $${limitUsd} ${period} budget ($${currentCostUsd} spent)`),
+      alerts.send(`${model} hit $${limitUsd} ${period} budget`),
     completed:   ({ model, inputTokens, outputTokens, costUsd, latencyMs }) =>
       metrics.record({ model, inputTokens, outputTokens, costUsd, latencyMs }),
   },
@@ -123,14 +264,14 @@ Pass options to individual requests via `providerOptions.rateLimiter`:
 ```typescript
 import { generateText } from 'ai'
-// High-priority request — skips ahead of normal traffic in the queue
+// High-priority — skips ahead of normal traffic in the queue
 await generateText({
   model,
   prompt: 'Urgent user request...',
   providerOptions: {
     rateLimiter: {
-      priority: 'high',    // 'high' | 'normal' | 'low'
-      timeout:  10_000,    // override the default queue timeout for this request
+      priority: 'high',   // 'high' | 'normal' | 'low'
+      timeout:  10_000,   // override the default queue timeout for this request
     },
   },
 })
@@ -145,7 +286,7 @@ await generateText({
 })
 ```
-This is the right way to colocate user requests and background jobs on the same model without background jobs starving users.
+This is the recommended way to colocate user requests and background jobs on the same model without background jobs starving users.
 ---
@@ -157,7 +298,7 @@ const report = limiter.getCostReport()
 console.log(report)
 // {
-//   hour:  { requests: 42,   inputTokens: 84_000,  outputTokens: 21_000, costUsd: 0.29 },
+//   hour:  { requests: 42,   inputTokens: 84_000,  outputTokens: 21_000,  costUsd: 0.29 },
 //   day:   { requests: 318,  inputTokens: 620_000, outputTokens: 155_000, costUsd: 2.11 },
 //   month: { requests: 4821, inputTokens: 9_100_000, outputTokens: 2_200_000, costUsd: 34.80 },
 //   byModel: {
@@ -173,17 +314,17 @@ Costs are based on **actual token counts** from API responses — not estimates.
 ## Budget fallback routing
-When a budget limit is hit, you can transparently reroute to a cheaper model instead of throwing an error. Pass a `fallback` option to `wrap()`:
+When a budget limit is hit, you can transparently reroute to a cheaper model instead of throwing an error:
 ```typescript
 const limiter = createRateLimiter({
   cost: {
     budget: { daily: 10 },
-    onExceeded: 'fallback',  // reroute to fallback instead of throwing
+    onExceeded: 'fallback', // reroute to fallback instead of throwing
   },
   on: {
-    budgetHit: ({ model, currentCostUsd, limitUsd, period }) =>
-      console.warn(`${model} ${period} budget hit ($${currentCostUsd} of $${limitUsd})`),
+    budgetHit: ({ model, usingFallback }) =>
+      console.warn(`${model} budget hit — ${usingFallback ? 'using fallback' : 'throwing'}`),
   },
 })
@@ -192,17 +333,11 @@ const model = limiter.wrap(
   { fallback: openai('gpt-4o-mini') },  // used when budget is exceeded
 )
-// Under budget  → uses gpt-4o normally
-// Over $10/day  → silently switches to gpt-4o-mini, no code changes needed
+// Under budget → uses gpt-4o normally
+// Over $10/day → silently switches to gpt-4o-mini, no code changes needed
 const result = await generateText({ model, prompt })
 ```
-**How it works:**
-1. The budget is checked before every request against total rolling spend
-2. When exceeded, `BudgetExceededError` is caught inside `wrap()` before it reaches your code
-3. The request is re-executed against the fallback model, bypassing the budget pre-check
-4. Fallback usage is tracked under the fallback model's ID in `getCostReport()`
 **Behavior matrix:**
 | `onExceeded` | `fallback` configured | Outcome |
@@ -212,91 +347,47 @@ const result = await generateText({ model, prompt })
 | `'fallback'` | no | Throws `BudgetExceededError` |
 | `'queue'` | any | Queues until period resets |
----
-## Raw SDK proxy
-If you're using the OpenAI, Anthropic, Groq, Mistral, or Cohere SDK directly — without the Vercel AI SDK — use `limiter.rawProxy()` to add rate limiting as a transparent drop-in:
-```typescript
-import { createRateLimiter } from 'ai-sdk-rate-limiter'
-import OpenAI from 'openai'
-import Anthropic from '@anthropic-ai/sdk'
-const limiter = createRateLimiter({
-  cost: { budget: { daily: 50 }, onExceeded: 'throw' },
-  on: { rateLimited: ({ model }) => console.warn(`${model} rate limited`) },
-})
+Fallback usage is tracked under the fallback model's ID in `getCostReport()`.
-// Every API call goes through the same rate limiter and cost tracker
-const openai   = limiter.rawProxy(new OpenAI())
-const anthropic = limiter.rawProxy(new Anthropic())
+---
-// Use exactly as before — no other changes needed
-const completion = await openai.chat.completions.create({
-  model: 'gpt-4o',
-  messages: [{ role: 'user', content: 'Hello!' }],
-})
+## Multi-instance Redis store
-const message = await anthropic.messages.create({
-  model: 'claude-opus-4-6',
-  max_tokens: 1024,
-  messages: [{ role: 'user', content: 'Hello!' }],
-})
+By default, rate limit state is in-memory (per-process). For multi-instance deployments — multiple pods, serverless replicas, workers — each instance has its own counters. Install the Redis store to share state:
-// Cost from both clients tracked together
-const report = limiter.getCostReport()
 ```
-**Streaming works too** — the proxy wraps the returned `AsyncIterable` to capture the final usage chunk automatically:
-```typescript
-const stream = await openai.chat.completions.create({
-  model: 'gpt-4o',
-  messages: [{ role: 'user', content: 'Stream this' }],
-  stream: true,
-  stream_options: { include_usage: true },  // tells OpenAI to include usage in final chunk
-})
-for await (const chunk of stream) {
-  process.stdout.write(chunk.choices[0]?.delta?.content ?? '')
-}
-// After the loop, usage is recorded in limiter.getCostReport()
+npm install ioredis
 ```
-**Zero-config standalone version** — if you don't need to share the limiter with other models:
 ```typescript
-import { rateLimited } from 'ai-sdk-rate-limiter'
+import { createRateLimiter } from 'ai-sdk-rate-limiter'
+import { RedisStore } from 'ai-sdk-rate-limiter/redis'
+import Redis from 'ioredis'
-const openai = rateLimited(new OpenAI(), {
-  config: { cost: { budget: { daily: 20 } } },
+const limiter = createRateLimiter({
+  store: new RedisStore(new Redis(process.env.REDIS_URL)),
+  // ... rest of your config unchanged
 })
 ```
-**Provider is auto-detected** from the client's constructor name (`OpenAI`, `Anthropic`, `Groq`, etc.). Override it explicitly if needed:
+That's the entire change. All APIs — `wrap()`, `rawProxy()`, events, cost reports — work identically. The Redis store enforces rate limits collectively so no two instances can jointly exceed the API limits.
+**Options:**
 ```typescript
-const client = limiter.rawProxy(new OpenAI({ baseURL: 'https://api.groq.com/openai/v1' }), {
-  provider: 'groq',  // use Groq's limits and pricing instead of OpenAI's
+new RedisStore(redis, {
+  keyPrefix: 'rl:myapp:', // namespace if multiple apps share one Redis instance
+  windowMs:  60_000,      // window size in ms; match your provider's limit window
 })
 ```
----
-## Backpressure — know before you send
+**How it works internally:**
-Check estimated wait time before committing to a request. Useful for showing loading states or shedding load gracefully.
-```typescript
-const waitMs = limiter.estimatedWait('gpt-4o')
+Each request runs a Lua script atomically that: removes stale entries from a sorted set, counts requests and tokens in the current window, checks against RPM and ITPM limits, and either reserves the slot or returns when the next slot opens. The local queue (priority ordering, drain timer, timeout handling) stays in-memory per instance — only the window counters are shared via Redis.
-if (waitMs > 5_000) {
-  return res.status(503).json({ error: 'Model is busy, try again shortly', retryAfterMs: waitMs })
-}
+**Compatible clients** — any client with `eval()`, `get()`, and `set()` works: `ioredis`, `node-redis`, Upstash Redis.
-const result = await generateText({ model, prompt })
-```
+Use the default `InMemoryStore` for single-instance deployments — it's more accurate (true sliding window, no network round-trips) and zero-config. Only switch to `RedisStore` when you actually need cross-instance coordination.
 ---
@@ -320,24 +411,46 @@ limiter.off('queued', handler)
 | Event | When | Key fields |
 |---|---|---|
-| `queued` | Request enters the queue | `model`, `priority`, `queueDepth`, `estimatedWaitMs` |
-| `dequeued` | Request leaves the queue | `model`, `waitedMs`, `priority` |
-| `retrying` | A failed request is about to retry | `model`, `attempt`, `maxAttempts`, `delayMs`, `error` |
-| `rateLimited` | Limit hit (local or remote 429) | `model`, `source`, `limitType`, `resetAt` |
-| `budgetHit` | Cost budget exceeded | `model`, `currentCostUsd`, `limitUsd`, `period`, `usingFallback` |
-| `dropped` | Request rejected (queue full or timeout) | `model`, `reason` |
-| `completed` | Request finished successfully | `model`, `inputTokens`, `outputTokens`, `costUsd`, `latencyMs` |
+| `queued` | Request enters the queue | `model`, `provider`, `priority`, `queueDepth`, `estimatedWaitMs` |
+| `dequeued` | Request leaves the queue | `model`, `provider`, `waitedMs`, `priority` |
+| `retrying` | A failed request is about to retry | `model`, `provider`, `attempt`, `maxAttempts`, `delayMs`, `error` |
+| `rateLimited` | Limit hit (local or remote 429) | `model`, `provider`, `source`, `limitType`, `resetAt` |
+| `budgetHit` | Cost budget exceeded | `model`, `provider`, `currentCostUsd`, `limitUsd`, `period`, `usingFallback` |
+| `dropped` | Request rejected (queue full or timeout) | `model`, `provider`, `reason` |
+| `completed` | Request finished successfully | `model`, `provider`, `inputTokens`, `outputTokens`, `costUsd`, `latencyMs`, `streaming` |
-The `source` on `rateLimited` distinguishes between requests we blocked locally (`'local'`) vs. requests the API rejected with a 429 (`'remote'`). Local blocks are free; remote blocks mean your limits are misconfigured.
+The `source` on `rateLimited` distinguishes between requests we blocked locally (`'local'`) vs. requests the API rejected with a 429 (`'remote'`). Local blocks are expected and free. Frequent remote blocks mean your configured limits are too high for your tier — run `npx ai-sdk-rate-limiter audit` to get accurate numbers.
+---
+## Backpressure
+Check estimated wait time before committing to a request. Useful for showing loading states or shedding load gracefully:
+```typescript
+const waitMs = await limiter.estimatedWait('gpt-4o')
+if (waitMs > 5_000) {
+  return res.status(503).json({
+    error: 'Model busy, try again shortly',
+    retryAfterMs: waitMs,
+  })
+}
+const result = await generateText({ model, prompt })
+```
+Returns `0` if the model would proceed immediately.
 ---
 ## Error handling
-Every error is typed, structured, and tells you exactly what happened:
+Every error is typed and carries structured context:
 ```typescript
 import {
+  RateLimitExceededError,
   QueueTimeoutError,
   QueueFullError,
   BudgetExceededError,
@@ -349,77 +462,163 @@ try {
   const result = await generateText({ model, prompt })
 } catch (error) {
   if (error instanceof QueueTimeoutError) {
-    // error.model, error.waitedMs, error.queueDepth
+    // Request waited in queue longer than queue.timeout
     console.error(`Timed out after ${error.waitedMs}ms (queue depth: ${error.queueDepth})`)
   } else if (error instanceof BudgetExceededError) {
-    // error.model, error.currentCostUsd, error.limitUsd, error.period
+    // Cost budget hit and onExceeded is 'throw' or no fallback configured
     console.error(`Budget exceeded: $${error.currentCostUsd} of $${error.limitUsd} ${error.period}`)
   } else if (error instanceof RetryExhaustedError) {
-    // error.model, error.attempts, error.cause
+    // All retry attempts failed
     console.error(`All ${error.attempts} retries exhausted`, error.cause)
   } else if (error instanceof QueueFullError) {
-    // error.model, error.maxSize
-    console.error(`Queue full at ${error.maxSize} requests`)
+    // Queue at capacity and onFull is 'throw'
+    console.error(`Queue full at ${error.maxSize} requests for ${error.model}`)
+  } else if (error instanceof RateLimitExceededError) {
+    // Rate limit hit and the request could not be queued
+    console.error(`${error.model} ${error.limitType} limit of ${error.limit} exceeded`)
   }
 }
 ```
-All errors extend `RateLimiterError`, so a single `instanceof RateLimiterError` check separates rate-limiting failures from AI SDK errors.
+All errors extend `RateLimiterError`, so a single `instanceof RateLimiterError` check separates rate-limiter failures from AI API errors.
+**Error fields:**
+| Error | Fields |
+|---|---|
+| `QueueTimeoutError` | `model`, `waitedMs`, `queueDepth` |
+| `BudgetExceededError` | `model`, `currentCostUsd`, `limitUsd`, `period` |
+| `RetryExhaustedError` | `model`, `attempts`, `cause` |
+| `QueueFullError` | `model`, `maxSize` |
+| `RateLimitExceededError` | `model`, `limitType`, `limit`, `resetAt` |
 ---
-## Advanced middleware usage
+## OpenTelemetry
-If you use `wrapLanguageModel` directly, the raw middleware is available:
+The `ai-sdk-rate-limiter/otel` entry point provides a plugin that emits OpenTelemetry spans for every AI request. No hard dependency on `@opentelemetry/api` — the plugin accepts any OTel-compatible tracer via structural typing.
 ```typescript
-import { wrapLanguageModel } from 'ai'
+import { trace } from '@opentelemetry/api'
+import { createRateLimiter } from 'ai-sdk-rate-limiter'
+import { createOtelPlugin } from 'ai-sdk-rate-limiter/otel'
-const model = wrapLanguageModel({
-  model: openai('gpt-4o'),
-  middleware: limiter.middleware,
+const limiter = createRateLimiter({
+  on: createOtelPlugin(trace.getTracer('my-service')),
 })
 ```
-You can also compose it with other middleware:
+**Spans emitted:**
+| Span name | When | Status |
+|---|---|---|
+| `gen_ai.request` | Every completed request | OK |
+| `gen_ai.request` | Every dropped request | ERROR |
+| `ai_rate_limiter.retry` | Each retry attempt | OK |
+| `ai_rate_limiter.budget_hit` | Budget exceeded | ERROR |
+**Attributes on `gen_ai.request` (completed):**
+| Attribute | Value |
+|---|---|
+| `gen_ai.system` | Provider name (e.g. `openai`, `anthropic`) |
+| `gen_ai.request.model` | Model ID |
+| `gen_ai.usage.input_tokens` | Actual input tokens from API response |
+| `gen_ai.usage.output_tokens` | Actual output tokens from API response |
+| `ai_rate_limiter.cost_usd` | Cost in USD for this request |
+| `ai_rate_limiter.latency_ms` | Total latency including queue wait |
+| `ai_rate_limiter.streaming` | Whether the request used streaming |
+Attribute names follow the [OTel GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/). The `gen_ai.request` span duration is reconstructed from `latencyMs` so it reflects the full wall-clock time of the request, including any queue wait.
+**Custom tracer interface** — if you don't want to install `@opentelemetry/api`, implement the `OtelTracer` interface directly:
 ```typescript
-const model = wrapLanguageModel({
-  model: openai('gpt-4o'),
-  middleware: [loggingMiddleware, limiter.middleware, cachingMiddleware],
+import { createOtelPlugin, type OtelTracer } from 'ai-sdk-rate-limiter/otel'
+const tracer: OtelTracer = {
+  startSpan(name, options) {
+    // return any object that implements OtelSpan
+  },
+}
+const limiter = createRateLimiter({
+  on: createOtelPlugin(tracer),
 })
 ```
 ---
-## Multiple limiters
+## CLI audit
-Run separate limiters for separate concerns — e.g., one per customer tier:
+Probe your API keys to discover your actual rate limit tier and generate a ready-to-paste config override:
-```typescript
-const freeLimiter = createRateLimiter({
-  limits:  { 'gpt-4o-mini': { rpm: 5 } },
-  cost:    { budget: { daily: 0.10 }, onExceeded: 'throw' },
-  queue:   { timeout: 5_000 },
-})
+```
+npx ai-sdk-rate-limiter audit
+```
-const paidLimiter = createRateLimiter({
-  limits:  { 'gpt-4o': { rpm: 100 } },
-  cost:    { budget: { daily: 20 } },
-  queue:   { timeout: 30_000 },
-})
+```
+────────────────────────────────────────────────────────────────────────────────
+  ai-sdk-rate-limiter audit
+────────────────────────────────────────────────────────────────────────────────
+  OPENAI  (OPENAI_API_KEY)
+  Model                            RPM        TPM            Registry
+  ──────────────────────────────────────────────────────────────────────────────
+  gpt-4o                           10000      2,000,000      (registry: 500 RPM / 30,000 TPM)
+  gpt-4o-mini                      10000      10,000,000 ≠   (registry: 500 RPM / 200,000 TPM)
+────────────────────────────────────────────────────────────────────────────────
+  1 model(s) differ from registry defaults.
+  Paste the config below into createRateLimiter():
+  const limiter = createRateLimiter({
+    limits: {
+      'gpt-4o-mini': { rpm: 10000, itpm: 10,000,000 },
+    },
+  })
-// Route to the right limiter per request
-const model = req.user.plan === 'paid'
-  ? paidLimiter.wrap(openai('gpt-4o'))
-  : freeLimiter.wrap(openai('gpt-4o-mini'))
+────────────────────────────────────────────────────────────────────────────────
+```
+**How it works** — Makes a minimal (5-token) request per model and reads the `x-ratelimit-limit-*` headers that every provider returns on each response. These headers reflect your account's actual tier, not the published defaults.
+**Options:**
+```
+npx ai-sdk-rate-limiter audit [options]
+  --provider, -p <name>   Audit a single provider: openai, anthropic, groq, mistral, cohere
+  --json                  Machine-readable JSON output
+  --help, -h              Show help
+  --version, -v           Print version
+Environment variables required:
+  OPENAI_API_KEY
+  ANTHROPIC_API_KEY
+  GROQ_API_KEY
+  MISTRAL_API_KEY
+  COHERE_API_KEY
+```
+**Examples:**
+```bash
+# Audit all configured providers
+npx ai-sdk-rate-limiter audit
+# Audit only OpenAI
+npx ai-sdk-rate-limiter audit --provider openai
+# Machine-readable output for CI / scripts
+npx ai-sdk-rate-limiter audit --json | jq '.providers[0].models'
 ```
 ---
-## Built-in model registry
+## Model registry
-Limits and pricing are built-in for every major model across 6 providers. Defaults are conservative (free/Tier 1) — override with your actual plan limits.
+Limits and pricing are built-in for every major model across 6 providers. Defaults are conservative (free/Tier 1) — override with your actual plan limits via the `limits` config option or by running `audit`.
 **OpenAI**
@@ -493,30 +692,87 @@ import {
 console.log(GROQ_MODELS['llama-3.3-70b-versatile'])
 // { rpm: 30, itpm: 6000, rpd: 1000, inputPricePerMillion: 0.59, ... }
-console.log(isKnownModel('llama-3.3-70b-versatile', 'groq'))
-// true
+console.log(isKnownModel('llama-3.3-70b-versatile', 'groq')) // true
+console.log(isKnownModel('my-fine-tune', 'openai'))           // false → fallback limits
+// Resolve the effective limits for a model (registry defaults merged with user overrides)
+const limits = resolveModelLimits('gpt-4o', 'openai', { 'gpt-4o': { rpm: 1000 } })
+```
+---
+## Advanced usage
+### Multiple limiters per tier
+```typescript
+const freeLimiter = createRateLimiter({
+  limits: { 'gpt-4o-mini': { rpm: 5 } },
+  cost:   { budget: { daily: 0.10 }, onExceeded: 'throw' },
+  queue:  { timeout: 5_000 },
+})
+const paidLimiter = createRateLimiter({
+  limits: { 'gpt-4o': { rpm: 100 } },
+  cost:   { budget: { daily: 20 } },
+  queue:  { timeout: 30_000 },
+})
+// Route per request based on user plan
+const model = req.user.plan === 'paid'
+  ? paidLimiter.wrap(openai('gpt-4o'))
+  : freeLimiter.wrap(openai('gpt-4o-mini'))
+```
+### Combine OTel tracing with event logging
+```typescript
+import { createOtelPlugin } from 'ai-sdk-rate-limiter/otel'
+const limiter = createRateLimiter({
+  on: {
+    // OTel spans for every request
+    ...createOtelPlugin(trace.getTracer('my-service')),
+    // Plus any additional handlers
+    budgetHit: ({ model, limitUsd, period }) =>
+      alerts.send(`Budget alert: ${model} hit $${limitUsd} ${period} cap`),
+  },
+})
+```
+### Custom rate limit store
+Implement `RateLimitStore` to use any backend (DynamoDB, Postgres, etc.):
+```typescript
+import type { RateLimitStore } from 'ai-sdk-rate-limiter'
-console.log(isKnownModel('my-fine-tune', 'openai'))
-// false → will use fallback limits
+class MyStore implements RateLimitStore {
+  async checkAndReserve(key, tokens, limits) { /* ... */ }
+  async applyBackoff(key, untilMs) { /* ... */ }
+  async getBackoff(key) { /* ... */ }
+}
+const limiter = createRateLimiter({ store: new MyStore() })
 ```
 ---
 ## How it works
-**Rate limiting algorithm** — Sliding window counter. Each model tracks a list of `{timestamp, tokens}` entries for the past 60 seconds. On every request, stale entries are evicted and the window is checked against RPM and ITPM limits simultaneously.
+**Rate limiting** — Sliding window counter per model. Each model tracks a list of `{timestamp, tokens}` entries for the past 60 seconds. On every request, stale entries are evicted and the window is checked against RPM and ITPM limits simultaneously.
-**Queue** — A min-heap priority queue per model, sorted by `priority` then enqueue time (FIFO within same priority). A drain timer fires when the oldest window entry expires, processing as many waiters as possible before rescheduling.
+**Queue** — A sorted priority queue per model, ordered by `priority` then enqueue time (FIFO within same priority). A drain timer fires when the oldest window entry expires, processing as many waiters as possible before rescheduling.
-**Retry-After propagation** — When a remote 429 arrives with a `Retry-After` header, the backoff is applied to the engine, not just the failing request. All requests queued behind it pause until the backoff clears. This prevents the common failure mode where you retry one request while 10 more immediately follow and all get 429s.
+**Retry-After propagation** — When a remote 429 arrives with a `Retry-After` header, the backoff is applied to the entire model key in the engine, not just the failing request. All requests queued behind it pause until the backoff clears. This prevents the common thundering-herd failure where you retry one request while 10 others immediately follow and all get 429s.
-**Token estimation** — Before a request fires, tokens are estimated from the prompt text (~4 chars/token) and reserved in the window. After the response, actual usage from the API replaces the estimate. For streaming, actual counts come from the `finish` chunk.
+**Token estimation** — Before a request fires, tokens are estimated from the prompt text (~4 chars/token) and reserved in the window. After the response, actual usage from the API replaces the estimate. For streaming, actual counts come from the `finish` chunk (Vercel AI SDK) or the final usage chunk (raw proxy).
-**Zero dependencies** — The middleware interface is implemented structurally. `@ai-sdk/provider` types are referenced for type checking but not required at runtime. No `ioredis`, no `bottleneck`, no tokenizer libraries.
+**Zero dependencies** — The Vercel AI SDK middleware interface is implemented structurally — `@ai-sdk/provider` types are used for type checking only and not required at runtime. No `ioredis`, no `bottleneck`, no tokenizer libraries in the core.
 ---
-## Why not X?
+## Comparison
 | | ai-sdk-rate-limiter | bottleneck | p-limit | SDK built-in retry | LangChain |
 |---|:---:|:---:|:---:|:---:|:---:|
@@ -528,37 +784,42 @@ console.log(isKnownModel('my-fine-tune', 'openai'))
 | Cost tracking + budgets | yes | no | no | no | no |
 | Retry-After header | yes | no | no | partial | partial |
 | Backoff propagation | yes | no | no | no | no |
+| OpenTelemetry | yes | no | no | no | partial |
+| CLI audit | yes | no | no | no | no |
 | Zero runtime deps | yes | no | yes | — | no |
 | Provider-agnostic | yes | yes | yes | no | no |
-**bottleneck** is excellent general-purpose rate limiting, but knows nothing about AI APIs — no model limits, no token counting, no cost tracking. You'd need to configure it per-model manually and rebuild the cost system yourself.
+**bottleneck** — Excellent general-purpose rate limiting, but knows nothing about AI APIs. No model limits, no token counting, no cost tracking. You'd need to configure it per-model manually and rebuild the cost system yourself.
-**p-limit** controls concurrency, not rate. It doesn't throttle to N requests per minute, it throttles to N concurrent requests. Useful but completely different problem.
+**p-limit** — Controls concurrency, not rate. Limits to N concurrent requests, not N requests per minute. A different problem.
-**SDK built-in retry** retries on 429 with backoff. That's the floor, not the ceiling. It doesn't queue, doesn't prioritize, doesn't track cost, and doesn't propagate backoff to other in-flight requests.
+**SDK built-in retry** — Retries on 429 with backoff. That's the floor, not the ceiling. No queuing, no priority, no cost tracking, no backoff propagation to other in-flight requests.
 ---
 ## TypeScript
-Fully typed. All configuration options, events, errors, and report shapes have precise TypeScript definitions.
+Fully typed. All configuration options, events, errors, and report shapes have precise TypeScript definitions exported from the main entry point.
 ```typescript
 import type {
   RateLimiterConfig,
   CostReport,
+  EventMap,
   QueuedEvent,
   Priority,
+  ModelLimits,
 } from 'ai-sdk-rate-limiter'
-function handleQueuedRequest(event: QueuedEvent) {
-  // event.model, event.priority, event.queueDepth, event.estimatedWaitMs
-  // all typed, all autocompleted
-}
 ```
 ---
+## Examples
+A full Next.js 15 App Router example is included at [`examples/nextjs/`](./examples/nextjs/). It demonstrates streaming chat with rate limiting, live cost display, and proper error handling for budget and rate limit errors.
+---
 ## Requirements
 - Node.js 18+ / Bun / Deno