npm - @askalf/dario - Versions diffs - 3.12.0 → 3.13.0 - Mend

@askalf/dario 3.12.0 → 3.13.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/README.md +44 -4
package/dist/cc-template.js +15 -0
package/dist/live-fingerprint.d.ts +82 -0
package/dist/live-fingerprint.js +94 -0
package/dist/pool.d.ts +48 -0
package/dist/pool.js +99 -1
package/dist/proxy.js +168 -2
package/dist/sealed-pool.d.ts +202 -0
package/dist/sealed-pool.js +416 -0
package/dist/shim/runtime.cjs +146 -20
package/package.json +2 -2

package/README.md CHANGED Viewed

@@ -243,6 +243,7 @@ curl http://localhost:3456/analytics    # per-account / per-model stats, burn ra
 | `dario backend list` | List configured OpenAI-compat backends |
 | `dario backend add <name> --key=<key> [--base-url=<url>]` | Add an OpenAI-compat backend |
 | `dario backend remove <name>` | Remove an OpenAI-compat backend |
+| `dario shim -- <cmd> [args...]` | **Experimental (v3.12.0).** Run a child process with an in-process fetch patch that rewrites its outbound Anthropic requests — no HTTP proxy involved. See [Experimental: Shim mode](#experimental-shim-mode). |
 | `dario help` | Full command reference |
 ### Proxy options
@@ -454,6 +455,39 @@ curl http://localhost:3456/health
 ---
+## Experimental: Shim mode
+*New in v3.12.0. Opt-in. The default path is still the HTTP proxy — shim mode is a second transport, not a replacement.*
+Shim mode runs a child process with an **in-process `globalThis.fetch` patch** that rewrites the child's outbound requests to `api.anthropic.com/v1/messages` exactly the way the proxy would, then sends them directly from the child to Anthropic. No localhost HTTP hop. No port to bind. No `ANTHROPIC_BASE_URL` to set.
+```bash
+dario shim -- claude --print "hello"
+dario shim -v -- claude --print "hello"        # verbose
+```
+Under the hood: `dario shim` spawns the child with `NODE_OPTIONS=--require <dario-runtime.cjs>` and a unix socket / named pipe for telemetry. The runtime patches `globalThis.fetch` only for Anthropic messages requests, applies the same template replay the proxy does (system prompt, tools, user agent, beta flags), and relays per-request events back to the parent so analytics still work. Every other fetch call in the child is untouched and failsafe-passes through on any internal error.
+**When to use shim mode**
+- Running a single CC instance on a locked-down machine where binding a local port is inconvenient or forbidden.
+- Wrapping one-off scripts (`dario shim -- node my-agent.js`) without setting up environment variables.
+- Debugging a specific child process in isolation — verbose logs are scoped to that process.
+**When to stay on the proxy** (which is still the default)
+- Multi-client routing. The proxy serves every tool on the machine through one endpoint; the shim wraps one child at a time.
+- Multi-account pool mode. Pooling across subscriptions needs a shared OAuth pool the proxy owns — a shim patch inside one child can't see the pool state.
+- Anything that isn't a Node / Bun child. The shim relies on `NODE_OPTIONS`, so non-JS runtimes (Python SDK, a Go CLI) still need the proxy.
+Limitations at v3.12.0:
+- Bun child detection is partial — known-good with `claude --print` on Node.
+- No `--replace claude` global wrapper yet; you call `dario shim -- claude ...` explicitly.
+- Per-request token cost recording in shim mode is still being wired into analytics.
+- Windows named-pipe CI coverage is incomplete.
+The shim runtime lives at `src/shim/runtime.cjs` (hand-written CJS so `--require` can load it) and the host orchestrator at `src/shim/host.ts`. ~180 lines total. See the [v3.12.0 release notes](https://github.com/askalf/dario/releases/tag/v3.12.0) for the full design writeup.
+---
 ## Endpoints
 | Path | Description |
@@ -464,7 +498,7 @@ curl http://localhost:3456/health
 | `GET /health` | Proxy health + OAuth status + request count |
 | `GET /status` | Detailed Claude OAuth token status |
 | `GET /accounts` | Pool snapshot (pool mode only) |
-| `GET /analytics` | Per-account / per-model stats, burn rate, exhaustion predictions (pool mode only) |
+| `GET /analytics` | Per-account / per-model stats, burn rate, exhaustion predictions. **v3.11.1+:** every request carries a `billingBucket` field (`five_hour` / `seven_day` / `overage` / `unknown`) so you can see, at a glance, which bucket each request billed against. (pool mode only) |
 ---
@@ -523,6 +557,9 @@ This establishes a session baseline. Without priming, brand-new accounts occasio
 **What happens when Anthropic rotates the OAuth config?**
 Dario auto-detects OAuth config from the installed Claude Code binary. When CC ships a new version with rotated values, dario picks them up on the next run. Cache at `~/.dario/cc-oauth-cache-v3.json`, keyed by the CC binary fingerprint. Falls back to hardcoded CC 2.1.104 prod values if CC isn't installed.
+**What happens when Anthropic changes the CC request template?**
+*New in v3.11.0.* Dario extracts the live request template from your installed Claude Code binary on startup — the system prompt slices, tool schemas, user-agent, beta flags — and uses those to replay requests instead of a version pinned into dario itself. When CC ships a new version with a tweaked template, the next `dario proxy` run picks it up automatically. Fallback: the hand-curated `src/cc-template-data.json` bundled with the release, so dario still works even if the installed CC binary is a version the extractor doesn't know how to read. See `src/live-fingerprint.ts`.
 **I'm hitting rate limits on the Claude backend. What do I do?**
 Claude subscriptions have rolling 5-hour and 7-day usage windows. Check utilization with Claude Code's `/usage` command or the [statusline](https://code.claude.com/docs/en/statusline). For multi-agent workloads, add more accounts and let pool mode distribute the load: `dario accounts add <alias>`.
@@ -586,19 +623,22 @@ Longer-form writing on how dario works and why it works that way:
 ## Contributing
-PRs welcome. The codebase is ~2,500 lines of TypeScript across 10 files:
+PRs welcome. The codebase is small TypeScript — around ~3,000 lines across ~14 files:
 | File | Purpose |
 |---|---|
 | `src/proxy.ts` | HTTP proxy server, request handler, rate governor, Claude backend dispatch |
 | `src/cc-template.ts` | CC request template engine, tool mapping, orchestration & framework scrubbing |
-| `src/cc-template-data.json` | CC request template data (25 tools, 25KB system prompt) |
+| `src/cc-template-data.json` | Bundled fallback CC request template (used when live-fingerprint extraction isn't possible) |
 | `src/cc-oauth-detect.ts` | OAuth config auto-detection from the installed CC binary |
+| `src/live-fingerprint.ts` | **v3.11.0.** Live extraction of the CC request template (system prompt, tools, user-agent, beta flags) from the installed Claude Code binary |
 | `src/oauth.ts` | Single-account token storage, PKCE flow, auto-refresh |
 | `src/accounts.ts` | Multi-account credential storage and independent OAuth lifecycle |
 | `src/pool.ts` | Account pool, headroom-aware routing, failover target selection |
-| `src/analytics.ts` | Rolling request history, per-account / per-model stats, burn-rate |
+| `src/analytics.ts` | Rolling request history, per-account / per-model stats, burn-rate, billing bucket classification |
 | `src/openai-backend.ts` | OpenAI-compat backend credential storage and request forwarder |
+| `src/shim/runtime.cjs` | **v3.12.0.** Hand-written CJS payload loaded into child processes via `NODE_OPTIONS=--require`; patches `globalThis.fetch` for Anthropic messages requests only |
+| `src/shim/host.ts` | **v3.12.0.** Parent-side orchestrator for `dario shim` — spawns the child, owns the telemetry socket / named pipe, feeds analytics |
 | `src/cli.ts` | CLI entry point, command routing, Bun auto-relaunch |
 | `src/index.ts` | Library exports |

package/dist/cc-template.js CHANGED Viewed

@@ -390,6 +390,14 @@ export function buildCCRequest(clientBody, billingTag, cache1h, identity, opts =
                         default: return a;
                     }
                 },
+                // Unmapped-fallback mappings must always lose the reverse-lookup
+                // collision to any legitimate mapping that targets the same CC tool.
+                // Otherwise a client that declares both an unmapped tool (e.g.
+                // OpenClaw's `image`) round-robin'd onto Glob AND a real `glob` /
+                // `find_files` / `list_files` mapping can have the reverse path
+                // route real Glob tool_use blocks back to `image`, which then fails
+                // its own input validation ("image required"). dario#37, Glob half.
+                reverseScore: 0,
             });
         }
     }
@@ -554,12 +562,19 @@ function buildReverseLookup(toolMap) {
         }
     }
     // Score-based collision resolution in the non-identity pass.
+    // reverseScore: 0 means "never claim a reverse slot at all" — used for
+    // unmapped-fallback mappings whose forward path exists for round-robin
+    // distribution but whose reverse path would corrupt real CC tool calls
+    // (e.g. routing a real Glob tool_use back to an unmapped `image` client
+    // tool with the wrong input shape, dario#37 Glob half).
     const scoreOf = (m) => m.reverseScore ?? 10;
     for (const [clientName, mapping] of toolMap) {
         if (clientName.toLowerCase() === mapping.ccTool.toLowerCase())
             continue;
         if (identityClaimed.has(mapping.ccTool))
             continue;
+        if (scoreOf(mapping) === 0)
+            continue;
         const existing = reverseMap.get(mapping.ccTool);
         if (!existing || scoreOf(mapping) > scoreOf(existing.mapping)) {
             reverseMap.set(mapping.ccTool, { clientName, mapping });

package/dist/live-fingerprint.d.ts CHANGED Viewed

@@ -15,6 +15,74 @@
  * only runs long enough to capture a single request. CC's OAuth token
  * never leaves the machine — we send CC to a loopback URL that CC itself
  * trusts because we set ANTHROPIC_BASE_URL in the child's environment.
+ *
+ * --------------------------------------------------------------------
+ * "Hide in the population" roadmap (v3.13 → ?)
+ * --------------------------------------------------------------------
+ *
+ * The fingerprint pipeline has historically cared about one axis: what
+ * goes INSIDE the /v1/messages body (agent identity, system prompt, tool
+ * list). That's only one fingerprint vector. Anthropic can (and likely
+ * does) look at several others:
+ *
+ *   1. Header ORDER. Node's http module emits headers in alphabetical
+ *      order via setHeader(). Undici preserves insertion order. Real CC
+ *      uses undici with a specific insertion pattern. If dario sends
+ *      headers in a different order than CC, the difference is trivially
+ *      observable on the server side via the raw header array.
+ *      → Captured as `header_order` below. Outbound proxy paths should
+ *        use the captured order when rebuilding fetch() headers.
+ *
+ *   2. TLS ClientHello (JA3 / JA4 fingerprint). The cipher list, elliptic
+ *      curves, extension order, and ALPN negotiation are determined by
+ *      the TLS library, and Node's TLS (OpenSSL) produces a distinctive
+ *      fingerprint that differs from any browser or from curl. Real CC
+ *      running on top of Node has the Node JA3 — so we already match,
+ *      provided both run on the same Node major. A cross-runtime worry
+ *      surfaces when Anthropic ships Bun- or bundled-binary CC: at that
+ *      point Node-dario and Bun-CC would JA-differ.
+ *      → Mitigation: detect Bun-compiled CC, fall back to shim mode
+ *        (which patches fetch INSIDE the CC process, inheriting CC's
+ *        own TLS stack for free).
+ *
+ *   3. HTTP/2 frame ordering + SETTINGS parameters. Similar to TLS, this
+ *      is controlled by the HTTP library. Node and undici produce a
+ *      consistent H2 fingerprint. Matches as long as both ends run the
+ *      same library.
+ *
+ *   4. Request timing distribution. Real CC sends requests with jitter
+ *      driven by user typing, tool-call sequencing, and internal retry
+ *      logic. Dario-through-a-client sends requests with jitter driven
+ *      by WHATEVER client is on the other end (OpenClaw, Hermes, curl).
+ *      That distribution differs from CC's. Anthropic could pattern-match
+ *      "no inter-request jitter" as a fingerprint for automated usage.
+ *      → Deferred. Adds latency for debatable gain. Analytics already
+ *        tracks per-request timing — could drive a replay distribution
+ *        later.
+ *
+ *   5. sessionId rotation cadence. CC rotates its internal session id
+ *      on a specific cadence (observed: roughly once per conversation
+ *      start, not per-request). Dario today uses a static session id
+ *      from loadClaudeIdentity. A proxy that kept rotating sessionId
+ *      randomly would stand out; a proxy that never rotates also stands
+ *      out. Matching CC's cadence requires observing CC over a longer
+ *      period than a single capture session.
+ *      → Deferred. Requires a longer-running capture mode.
+ *
+ *   6. Request body field ordering. JSON is unordered, but the wire
+ *      serialization IS ordered. Real CC uses a specific field order
+ *      for /v1/messages (e.g., `model` before `messages` before
+ *      `system` before `tools`). A proxy that serializes in a different
+ *      order leaks its origin.
+ *      → Worth matching. Cheap to implement — the template capture
+ *        already produces a body we can walk to recover field order.
+ *        Deferred to a follow-up.
+ *
+ * The concrete v3.13 move is (1): capture header_order and make it
+ * available on the template so the outbound proxy paths can reproduce
+ * it. Everything else is documented here as a roadmap so the next
+ * contributor — or dario maintainer six months from now — can pick up
+ * the right piece without re-deriving the threat model.
  */
 export interface TemplateData {
     _version: string;
@@ -28,6 +96,14 @@ export interface TemplateData {
         input_schema: Record<string, unknown>;
     }>;
     tool_names: string[];
+    /**
+     * The exact order CC emitted HTTP headers in when it hit the capture
+     * endpoint. Lowercased. Populated only from live captures — bundled
+     * snapshots leave this undefined and callers fall back to their own
+     * default order. Used by outbound proxy paths to reproduce CC's
+     * header ordering instead of Node's alphabetical default.
+     */
+    header_order?: string[];
 }
 /**
  * Load the template synchronously. Prefers the live cache (fresh capture
@@ -59,6 +135,12 @@ interface CapturedRequest {
     method: string;
     path: string;
     headers: Record<string, string>;
+    /**
+     * The flat [k1, v1, k2, v2, ...] array exactly as Node exposes it via
+     * req.rawHeaders. Preserves insertion order and duplicates, which the
+     * flattened `headers` map does not. Used to recover CC's header order.
+     */
+    rawHeaders: string[];
     body: Record<string, unknown>;
 }
 /**

package/dist/live-fingerprint.js CHANGED Viewed

@@ -15,6 +15,74 @@
  * only runs long enough to capture a single request. CC's OAuth token
  * never leaves the machine — we send CC to a loopback URL that CC itself
  * trusts because we set ANTHROPIC_BASE_URL in the child's environment.
+ *
+ * --------------------------------------------------------------------
+ * "Hide in the population" roadmap (v3.13 → ?)
+ * --------------------------------------------------------------------
+ *
+ * The fingerprint pipeline has historically cared about one axis: what
+ * goes INSIDE the /v1/messages body (agent identity, system prompt, tool
+ * list). That's only one fingerprint vector. Anthropic can (and likely
+ * does) look at several others:
+ *
+ *   1. Header ORDER. Node's http module emits headers in alphabetical
+ *      order via setHeader(). Undici preserves insertion order. Real CC
+ *      uses undici with a specific insertion pattern. If dario sends
+ *      headers in a different order than CC, the difference is trivially
+ *      observable on the server side via the raw header array.
+ *      → Captured as `header_order` below. Outbound proxy paths should
+ *        use the captured order when rebuilding fetch() headers.
+ *
+ *   2. TLS ClientHello (JA3 / JA4 fingerprint). The cipher list, elliptic
+ *      curves, extension order, and ALPN negotiation are determined by
+ *      the TLS library, and Node's TLS (OpenSSL) produces a distinctive
+ *      fingerprint that differs from any browser or from curl. Real CC
+ *      running on top of Node has the Node JA3 — so we already match,
+ *      provided both run on the same Node major. A cross-runtime worry
+ *      surfaces when Anthropic ships Bun- or bundled-binary CC: at that
+ *      point Node-dario and Bun-CC would JA-differ.
+ *      → Mitigation: detect Bun-compiled CC, fall back to shim mode
+ *        (which patches fetch INSIDE the CC process, inheriting CC's
+ *        own TLS stack for free).
+ *
+ *   3. HTTP/2 frame ordering + SETTINGS parameters. Similar to TLS, this
+ *      is controlled by the HTTP library. Node and undici produce a
+ *      consistent H2 fingerprint. Matches as long as both ends run the
+ *      same library.
+ *
+ *   4. Request timing distribution. Real CC sends requests with jitter
+ *      driven by user typing, tool-call sequencing, and internal retry
+ *      logic. Dario-through-a-client sends requests with jitter driven
+ *      by WHATEVER client is on the other end (OpenClaw, Hermes, curl).
+ *      That distribution differs from CC's. Anthropic could pattern-match
+ *      "no inter-request jitter" as a fingerprint for automated usage.
+ *      → Deferred. Adds latency for debatable gain. Analytics already
+ *        tracks per-request timing — could drive a replay distribution
+ *        later.
+ *
+ *   5. sessionId rotation cadence. CC rotates its internal session id
+ *      on a specific cadence (observed: roughly once per conversation
+ *      start, not per-request). Dario today uses a static session id
+ *      from loadClaudeIdentity. A proxy that kept rotating sessionId
+ *      randomly would stand out; a proxy that never rotates also stands
+ *      out. Matching CC's cadence requires observing CC over a longer
+ *      period than a single capture session.
+ *      → Deferred. Requires a longer-running capture mode.
+ *
+ *   6. Request body field ordering. JSON is unordered, but the wire
+ *      serialization IS ordered. Real CC uses a specific field order
+ *      for /v1/messages (e.g., `model` before `messages` before
+ *      `system` before `tools`). A proxy that serializes in a different
+ *      order leaks its origin.
+ *      → Worth matching. Cheap to implement — the template capture
+ *        already produces a body we can walk to recover field order.
+ *        Deferred to a follow-up.
+ *
+ * The concrete v3.13 move is (1): capture header_order and make it
+ * available on the template so the outbound proxy paths can reproduce
+ * it. Everything else is documented here as a roadmap so the next
+ * contributor — or dario maintainer six months from now — can pick up
+ * the right piece without re-deriving the threat model.
  */
 import { spawn } from 'node:child_process';
 import { createServer } from 'node:http';
@@ -165,6 +233,7 @@ async function runCapture(timeoutMs) {
                         method: req.method ?? 'POST',
                         path: req.url ?? '/v1/messages',
                         headers,
+                        rawHeaders: Array.isArray(req.rawHeaders) ? [...req.rawHeaders] : [],
                         body,
                     };
                 }
@@ -324,6 +393,7 @@ export function extractTemplate(captured) {
     if (tools.length === 0)
         return null;
     const version = extractCCVersion(captured.headers) ?? 'unknown';
+    const headerOrder = extractHeaderOrder(captured.rawHeaders);
     return {
         _version: version,
         _captured: new Date().toISOString(),
@@ -332,8 +402,32 @@ export function extractTemplate(captured) {
         system_prompt: systemPrompt,
         tools,
         tool_names: tools.map((t) => t.name),
+        header_order: headerOrder,
     };
 }
+/**
+ * Walk rawHeaders (flat [k1, v1, k2, v2, ...] array) and return the
+ * header names in insertion order, lowercased, de-duplicated. If the
+ * raw array is empty or unusable, returns undefined so the caller
+ * falls back to default ordering.
+ */
+function extractHeaderOrder(rawHeaders) {
+    if (!Array.isArray(rawHeaders) || rawHeaders.length === 0)
+        return undefined;
+    const order = [];
+    const seen = new Set();
+    for (let i = 0; i < rawHeaders.length; i += 2) {
+        const name = rawHeaders[i];
+        if (typeof name !== 'string')
+            continue;
+        const lower = name.toLowerCase();
+        if (seen.has(lower))
+            continue;
+        seen.add(lower);
+        order.push(lower);
+    }
+    return order.length > 0 ? order : undefined;
+}
 function pickTextBlock(block) {
     if (!block || typeof block !== 'object')
         return null;

package/dist/pool.d.ts CHANGED Viewed

@@ -1,3 +1,15 @@
+/**
+ * Compute a stable stickiness key from a conversation's first user
+ * message. Multi-turn agent sessions carry the same first user message
+ * on every turn, so hashing it gives a stable per-conversation key that
+ * doesn't require client cooperation. Empty / whitespace-only inputs
+ * return null so callers bypass stickiness on unhashable requests.
+ *
+ * Uses SHA-256 truncated to 16 hex chars (64 bits) — plenty of collision
+ * headroom for a pool of at most a few hundred active conversations per
+ * proxy instance, and small enough to log without spam.
+ */
+export declare function computeStickyKey(firstUserMessage: string | null | undefined): string | null;
 export interface AccountIdentity {
     deviceId: string;
     accountUuid: string;
@@ -39,6 +51,7 @@ export declare class AccountPool {
     private queueMaxSize;
     private queueTimeoutMs;
     private drainTimer;
+    private sticky;
     add(alias: string, opts: {
         accessToken: string;
         refreshToken: string;
@@ -50,6 +63,41 @@ export declare class AccountPool {
     get size(): number;
     /** Select the best account for the next request. */
     select(): PoolAccount | null;
+    /**
+     * Select with session stickiness. If `stickyKey` is already bound to a
+     * healthy account (not rejected, token not near expiry, headroom > 2%),
+     * return that account. Otherwise pick by headroom (`select()`) and
+     * rebind the key to the chosen account. Null key bypasses stickiness
+     * and delegates to `select()`.
+     *
+     * Rebinding also fires when the previously-bound account is marked
+     * rejected (429) or has its headroom drop below 2% — at that point the
+     * conversation's cache entry on the old account is effectively stranded
+     * until reset anyway, so there's no cost to moving. The new account
+     * starts building its own cache for this conversation from turn 1 of
+     * the rebind.
+     *
+     * Also performs lazy cleanup of expired bindings (TTL or size cap).
+     */
+    selectSticky(stickyKey: string | null): PoolAccount | null;
+    /**
+     * Rebind a sticky key to a different account — called by proxy after an
+     * in-request 429 failover moves to the next-best account. Without this
+     * the next turn of the same conversation would re-select the exhausted
+     * account via the stale binding, eat another 429, and failover again.
+     */
+    rebindSticky(stickyKey: string | null, alias: string): void;
+    /**
+     * Drop any binding that points at an account no longer in the pool, any
+     * binding past the TTL, and if we're over the size cap drop the oldest
+     * entries until we're back under. O(n) but n is small (capped at 2k)
+     * and this only runs on selectSticky, not on every method.
+     */
+    private cleanupSticky;
+    /** Test/inspection helper — number of live sticky bindings. */
+    stickyCount(): number;
+    /** Test/inspection helper — current alias bound to a key, or null. */
+    stickyAliasFor(stickyKey: string): string | null;
     /** Select the next-best account, excluding the given set of aliases. */
     selectExcluding(excluded: Set<string>): PoolAccount | null;
     updateRateLimits(alias: string, snapshot: RateLimitSnapshot): void;

package/dist/pool.js CHANGED Viewed

@@ -6,7 +6,24 @@
  * path it has always had; the pool only runs when there are multiple
  * accounts to distribute against.
  */
-import { randomUUID } from 'node:crypto';
+import { createHash, randomUUID } from 'node:crypto';
+/**
+ * Compute a stable stickiness key from a conversation's first user
+ * message. Multi-turn agent sessions carry the same first user message
+ * on every turn, so hashing it gives a stable per-conversation key that
+ * doesn't require client cooperation. Empty / whitespace-only inputs
+ * return null so callers bypass stickiness on unhashable requests.
+ *
+ * Uses SHA-256 truncated to 16 hex chars (64 bits) — plenty of collision
+ * headroom for a pool of at most a few hundred active conversations per
+ * proxy instance, and small enough to log without spam.
+ */
+export function computeStickyKey(firstUserMessage) {
+    const trimmed = (firstUserMessage ?? '').trim();
+    if (trimmed.length === 0)
+        return null;
+    return createHash('sha256').update(trimmed).digest('hex').slice(0, 16);
+}
 export const EMPTY_SNAPSHOT = {
     status: 'unknown',
     util5h: 0,
@@ -31,12 +48,15 @@ export function parseRateLimits(headers) {
         updatedAt: Date.now(),
     };
 }
+const STICKY_TTL_MS = 6 * 60 * 60 * 1000; // 6h
+const STICKY_MAX_ENTRIES = 2_000; // lazy cleanup cap
 export class AccountPool {
     accounts = new Map();
     queue = [];
     queueMaxSize = 50;
     queueTimeoutMs = 60_000;
     drainTimer = null;
+    sticky = new Map();
     add(alias, opts) {
         const existing = this.accounts.get(alias);
         this.accounts.set(alias, {
@@ -82,6 +102,84 @@ export class AccountPool {
         // No rate-limit data at all — least-used first
         return all.reduce((a, b) => a.requestCount < b.requestCount ? a : b);
     }
+    /**
+     * Select with session stickiness. If `stickyKey` is already bound to a
+     * healthy account (not rejected, token not near expiry, headroom > 2%),
+     * return that account. Otherwise pick by headroom (`select()`) and
+     * rebind the key to the chosen account. Null key bypasses stickiness
+     * and delegates to `select()`.
+     *
+     * Rebinding also fires when the previously-bound account is marked
+     * rejected (429) or has its headroom drop below 2% — at that point the
+     * conversation's cache entry on the old account is effectively stranded
+     * until reset anyway, so there's no cost to moving. The new account
+     * starts building its own cache for this conversation from turn 1 of
+     * the rebind.
+     *
+     * Also performs lazy cleanup of expired bindings (TTL or size cap).
+     */
+    selectSticky(stickyKey) {
+        if (!stickyKey)
+            return this.select();
+        this.cleanupSticky();
+        const binding = this.sticky.get(stickyKey);
+        if (binding) {
+            const bound = this.accounts.get(binding.alias);
+            const now = Date.now();
+            if (bound
+                && bound.rateLimit.status !== 'rejected'
+                && bound.expiresAt > now + 30_000
+                && (1 - Math.max(bound.rateLimit.util5h, bound.rateLimit.util7d)) > 0.02) {
+                return bound;
+            }
+        }
+        const picked = this.select();
+        if (picked) {
+            this.sticky.set(stickyKey, { alias: picked.alias, boundAt: Date.now() });
+        }
+        return picked;
+    }
+    /**
+     * Rebind a sticky key to a different account — called by proxy after an
+     * in-request 429 failover moves to the next-best account. Without this
+     * the next turn of the same conversation would re-select the exhausted
+     * account via the stale binding, eat another 429, and failover again.
+     */
+    rebindSticky(stickyKey, alias) {
+        if (!stickyKey)
+            return;
+        if (!this.accounts.has(alias))
+            return;
+        this.sticky.set(stickyKey, { alias, boundAt: Date.now() });
+    }
+    /**
+     * Drop any binding that points at an account no longer in the pool, any
+     * binding past the TTL, and if we're over the size cap drop the oldest
+     * entries until we're back under. O(n) but n is small (capped at 2k)
+     * and this only runs on selectSticky, not on every method.
+     */
+    cleanupSticky() {
+        const now = Date.now();
+        for (const [key, b] of this.sticky) {
+            if (!this.accounts.has(b.alias) || now - b.boundAt > STICKY_TTL_MS) {
+                this.sticky.delete(key);
+            }
+        }
+        if (this.sticky.size > STICKY_MAX_ENTRIES) {
+            const sorted = [...this.sticky.entries()].sort((a, b) => a[1].boundAt - b[1].boundAt);
+            const toDrop = sorted.slice(0, this.sticky.size - STICKY_MAX_ENTRIES);
+            for (const [key] of toDrop)
+                this.sticky.delete(key);
+        }
+    }
+    /** Test/inspection helper — number of live sticky bindings. */
+    stickyCount() {
+        return this.sticky.size;
+    }
+    /** Test/inspection helper — current alias bound to a key, or null. */
+    stickyAliasFor(stickyKey) {
+        return this.sticky.get(stickyKey)?.alias ?? null;
+    }
     /** Select the next-best account, excluding the given set of aliases. */
     selectExcluding(excluded) {
         if (this.accounts.size <= 1)