PyPI - promptecho - Versions diffs - 0.1.0__tar.gz - Mend

promptecho 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

promptecho-0.1.0/.gitignore +10 -0
promptecho-0.1.0/.hgignore +17 -0
promptecho-0.1.0/DESIGN.md +196 -0
promptecho-0.1.0/PKG-INFO +232 -0
promptecho-0.1.0/README.md +217 -0
promptecho-0.1.0/SUPPORT.md +89 -0
promptecho-0.1.0/TUTORIAL.md +553 -0
promptecho-0.1.0/examples/cassettes/example.yaml +32 -0
promptecho-0.1.0/examples/test_example.py +24 -0
promptecho-0.1.0/pyproject.toml +29 -0
promptecho-0.1.0/src/promptecho/__init__.py +81 -0
promptecho-0.1.0/src/promptecho/cassette.py +138 -0
promptecho-0.1.0/src/promptecho/matcher.py +115 -0
promptecho-0.1.0/src/promptecho/normalizers.py +160 -0
promptecho-0.1.0/src/promptecho/patch.py +134 -0
promptecho-0.1.0/src/promptecho/pytest_plugin.py +29 -0
promptecho-0.1.0/src/promptecho/transport.py +89 -0
promptecho-0.1.0/tests/conftest.py +1 -0
promptecho-0.1.0/tests/test_diff_on_miss.py +138 -0
promptecho-0.1.0/tests/test_normalizers.py +79 -0
promptecho-0.1.0/tests/test_plugin.py +15 -0
promptecho-0.1.0/tests/test_reasoning_and_binary.py +128 -0
promptecho-0.1.0/tests/test_record_replay.py +177 -0

promptecho-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,10 @@
+__pycache__/
+*.py[cod]
+.pytest_cache/
+*.egg-info/
+build/
+dist/
+.venv/
+venv/
+.env
+.DS_Store

promptecho-0.1.0/.hgignore ADDED Viewed

@@ -0,0 +1,17 @@
+# Use shell-style glob syntax
+syntax: glob
+# Compiled Python files
+*.pyc
+# Folder view configuration files
+.DS_Store
+Desktop.ini
+# Thumbnail cache files
+._*
+Thumbs.db
+# Files that might appear on external disks
+.Spotlight-V100
+.Trashes

promptecho-0.1.0/DESIGN.md ADDED Viewed

@@ -0,0 +1,196 @@
+# promptecho — design notes
+The ergonomics — `use_cassette` decorator, record modes, pytest fixture — are
+table stakes copied straight from vcrpy and aren't interesting. The six sections
+below are where this tool earns its name. Each one is a deliberate choice and a
+specific failure of the obvious "vcr-for-llms" approach.
+## 1. Interception: at the HTTP transport, not the SDK
+Both the Anthropic and OpenAI Python SDKs — and Mistral, Cohere v5+, `google-genai`,
+the AnthropicBedrock / AnthropicVertex variants, and the OpenAI SDK pointed at
+OpenRouter / Together / Fireworks / Cerebras / Groq / vLLM / TGI / SGLang via
+`base_url=` — are all built on `httpx`. So we intercept once at `httpx`'s transport
+layer and get every one of them for free, instead of monkeypatching each SDK's
+`messages.create` / `chat.completions.create` surface (which would be a maintenance
+treadmill as SDKs change).
+The mechanism: [`patch.py`](src/promptecho/patch.py) monkeypatches
+`httpx.HTTPTransport.handle_request` (sync) and `httpx.AsyncHTTPTransport.handle_async_request`
+(async) for the duration of the `use_cassette` block, then restores the originals.
+Each patched method routes the request through the record/replay decision and
+either returns a fresh `httpx.Response` reconstructed from the cassette or passes
+through to the real transport and captures the result.
+The decision logic itself lives in [`transport.py`](src/promptecho/transport.py) and
+is pure — no httpx, no I/O. That separation means the branching logic (cassette
+miss → record? error? re-record?) is unit-testable without standing up a network
+stack. It's the same separation `respx` and `vcrpy`'s httpx stub use.
+**The cost of this choice:** SDKs not on httpx (boto3-Bedrock, HF `InferenceClient`,
+`google-cloud-aiplatform`) are invisible — they just pass straight to the network as if
+promptecho weren't installed. That's a deliberate v1 scope. The roadmap item
+"`requests`/`urllib3` interception backend" closes it, at the cost of supporting two
+transport stacks at once.
+## 2. Matching: normalized fingerprint, not raw bytes
+The crux, and the thing the obvious "vcr for LLMs" pitch gets wrong.
+For **replay** you want determinism: the same logical request must map to the same
+recording, every time. Raw-byte matching (vcrpy default) fails because LLM bodies
+carry volatile noise — client-injected request IDs, reordered `tools` arrays,
+key-order and whitespace differences from re-serialization. Same call, different
+bytes, missed match.
+So we compute a **fingerprint** over only the fields that determine the response:
+```
+fingerprint(body) = sha256( canonical_json( pick(body, match_on) ) )
+```
+- `canonical_json` sorts keys and strips insignificant whitespace, so
+  re-serialization can't change the key.
+- Volatile fields are simply not in `match_on`, so they can't affect the match.
+The default `match_on` is:
+```python
+["model", "messages", "system", "tools", "tool_choice",
+ "reasoning_effort", "reasoning", "thinking"]
+```
+`model`, `messages`, `system`, `tools`, `tool_choice` are obvious. The last three
+are less obvious and matter: reasoning-model knobs (OpenAI `reasoning_effort`,
+Anthropic `thinking`, OpenRouter `reasoning`) change the response without changing
+the prompt. If they aren't in the default match set, a test with
+`reasoning_effort="high"` would silently replay the recording made for
+`"low"` — a wrong-fixture bug that's hard to catch by eye. So they're in by
+default, even though omitting them would yield "smaller" fingerprints. Correctness
+wins.
+`match_on` is also user-configurable, because only the test author knows which
+fields are load-bearing for *their* assertion (does this test care about
+`temperature`? `max_tokens`?).
+**What we deliberately do NOT do:** semantic / embedding matching on replay.
+"Different prompt, embedding-close enough → same recording" reintroduces
+non-determinism into the exact thing you adopted promptecho to make deterministic,
+and can silently serve the wrong recording. Semantic matching is a *caching*
+concern, not a *testing* one. Keeping these two ideas separate is a core stance
+— see the README's "What promptecho is not." A `fuzzy=True` dev-loop convenience
+is on the roadmap; it will never be the default and never used in CI.
+## 3. Cross-provider canonicalization
+The same logical prompt is expressed in different wire shapes across providers,
+SDK versions, and even within a single provider's API:
+- Anthropic puts the system prompt in a top-level `system` param; OpenAI puts
+  it in a `system`- or `developer`-role message.
+- Message content may be a bare string or a list of typed content blocks.
+- Tool defs differ: Anthropic `{name, description, input_schema}` vs OpenAI
+  `{type: function, function: {name, description, parameters}}`.
+- OpenAI's `max_completion_tokens` is an alias of `max_tokens` for newer models.
+[`normalizers.py`](src/promptecho/normalizers.py) maps each raw provider body into
+one canonical shape *before* fingerprinting. That's the capability a raw-bytes HTTP
+VCR fundamentally cannot have — and the reason "just point vcrpy at httpx" is
+unsatisfying. Provider is detected by URL host first, body-shape fallback (so
+localhost / self-hosted gateways / private proxies behave the same as the
+brand-name hosts).
+Two consequences worth being explicit about:
+1. **The canonical body is what's stored on disk** — not the raw provider JSON.
+   This makes cassettes provider-agnostic and easier to skim in code review, at
+   the cost of being one step removed from the wire format. Worth it.
+2. **Lossy joins.** Where shapes don't map cleanly, we choose the simpler form:
+   e.g. a multi-block system prompt collapses to a `\n`-joined string. That's
+   fine for matching; if you have something exotic worth preserving, override
+   `match_on` to include the original field path instead.
+Open: more providers (Gemini, Mistral) get the same treatment, and a
+user-pluggable normalizer hook for in-house gateways with custom shapes.
+## 4. Streaming: record the events, re-emit faithfully
+Most real LLM calls are `stream=True` (SSE). A recording that only captures the
+final assembled body is useless for testing streaming code paths — you can't
+test progressive UI rendering, token-budget cutoffs, or streaming-tool-call
+handling against a one-shot fixture.
+promptecho captures the **ordered list of SSE events** as they arrive, stores them
+under `response.events`, and on replay re-emits them as a synthetic stream — so
+`for chunk in stream:` iterates identically against the cassette, including event
+boundaries and `message_delta` / `content_block_delta` / reasoning-delta ordering.
+This is fiddly (chunked transfer, `[DONE]` sentinels, provider-specific event
+shapes) and is exactly why generic HTTP VCR tools are unsatisfying for LLM work.
+Getting it right is the moat.
+**Known limitation:** on the *recording* run we buffer the full upstream
+response before returning it, so you lose true streaming timing while recording.
+Replay streams fine. Acceptable for a test tool — the recording run isn't where
+you measure latency.
+## 5. Binary responses: detect, base64, round-trip
+Image-out, audio-out, and any `application/octet-stream` body cannot be
+text-decoded — the bytes have to survive the cassette round-trip exactly. A
+YAML cassette that runs binary bodies through `.decode("utf-8", "replace")`
+silently corrupts them (this was a real bug found by probe-testing before launch).
+So `_capture()` inspects `Content-Type` and, for anything `image/*`, `audio/*`,
+`video/*`, `application/octet-stream`, `pdf`, or `zip`, stores the body
+base64-encoded with a `binary: true` flag. Replay decodes it back. **Verified
+byte-equal** end-to-end (record → server shutdown → replay) in
+`tests/test_reasoning_and_binary.py::test_binary_image_response_byte_exact`.
+Multimodal-as-JSON (base64 inside `content` blocks — the Anthropic / OpenAI
+vision / GPT-4o image-out shape) was already fine, because the base64 string
+lives inside JSON and never gets text-decoded as bytes. That stays covered by
+its own test.
+`multipart/form-data` (file uploads/downloads) is explicitly out of scope for
+v1 — large payloads, encoding edge cases, rarely a thing you want to freeze in a
+fixture.
+## 6. Secrets: redact on record
+A cassette is meant to be committed, so it must be safe by default — opt-out, not
+opt-in. On record we strip `authorization`, `x-api-key`, and `openai-organization`
+headers and never write request auth to disk. The list is configurable
+(`REDACT_HEADERS` in [`cassette.py`](src/promptecho/cassette.py)) — extend it for
+provider-specific auth headers, never shrink it without thinking carefully.
+Body-level secrets (a prompt that happens to contain a credential) are *not*
+auto-redacted, because there's no reliable way to detect them. The escape hatch
+is don't put secrets in prompts — but a future `redact_body=[...]` hook is a
+reasonable addition.
+---
+## Open design threads (good build-in-public material)
+- ~~**Field-level diff on cassette miss in CI.**~~ **Done** —
+  `matcher.diff_request()` walks both bodies in parallel and emits leaf-level
+  `(path, recorded, incoming)` triples (`messages[1].content`, etc.), shown
+  inline in the `CassetteMiss` error. Restricted to `match_on` fields so the
+  output isn't drowned in volatile-field noise. Long values are truncated to
+  ~80 chars. Open extension: an optional terminal-colored mode for local dev
+  (off by default to keep CI logs grep-able).
+- **Drift detection.** An optional `mode=all` run in a nightly (not PR) CI job
+  that re-records and flags when a model's output to a frozen prompt has changed
+  — turning cassettes into a cheap model-regression tripwire. The hardest part
+  is choosing the "did anything meaningful change?" comparator (raw text diff is
+  noisy; LLM-judge re-introduces non-determinism). Reasonable defaults: structural
+  diff for tool-call shapes, text-similarity for prose.
+- **A second interception backend (`requests`/`urllib3`).** Unlocks boto3-Bedrock
+  and HF `InferenceClient`. Non-trivial: different stack, no clean
+  `BaseTransport` equivalent in urllib3, and Bedrock specifically signs requests
+  at the urllib3 level with SigV4 — so any fingerprint over the request would
+  pick up timestamp/signature noise unless we strip them first. Worth doing when
+  there's evidence of demand, not before.
+- **User-pluggable normalizers.** For in-house gateways with custom shapes. A
+  `register_normalizer(detect_fn, normalize_fn)` API is straightforward; the
+  open question is whether to ship it before or after the surface stabilizes.

promptecho-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,232 @@
+Metadata-Version: 2.4
+Name: promptecho
+Version: 0.1.0
+Summary: Record & replay for LLM API calls — like vcrpy/nock, built for LLM traffic.
+License-Expression: MIT
+Keywords: anthropic,llm,mock,openai,pytest,record-replay,testing,vcr
+Requires-Python: >=3.9
+Requires-Dist: httpx>=0.24
+Requires-Dist: pyyaml>=6.0
+Provides-Extra: dev
+Requires-Dist: anthropic; extra == 'dev'
+Requires-Dist: openai; extra == 'dev'
+Requires-Dist: pytest>=7; extra == 'dev'
+Description-Content-Type: text/markdown
+# promptecho
+**Record & replay for LLM API calls.** Like [`vcrpy`](https://github.com/kevin1024/vcrpy) / [`nock`](https://github.com/nock/nock), but built for the way LLM traffic actually behaves.
+Your LLM tests have three problems: they're **flaky** (non-deterministic outputs), **slow** (real network round-trips), and **expensive** (burning tokens in CI on every run). promptecho records each real API call once to a cassette file, then replays it forever — deterministically, instantly, for free.
+```python
+import promptecho
+from anthropic import Anthropic
+@promptecho.use_cassette("cassettes/summarize.yaml")
+def test_summarize():
+    client = Anthropic()
+    msg = client.messages.create(
+        model="claude-opus-4-8",
+        max_tokens=100,
+        messages=[{"role": "user", "content": "Summarize: the cat sat on the mat."}],
+    )
+    assert "cat" in msg.content[0].text.lower()
+```
+First run: one real call, recorded to `cassettes/summarize.yaml`.
+Every run after: replayed from disk. No network, no tokens, no flake.
+> **Proof, not marketing.** The end-to-end test that gates every release records against a local server, **shuts the server down**, then replays. Same response, zero network. If the response can come back with the upstream gone, the cassette is genuinely doing the work — not a partial proxy. See [`tests/test_record_replay.py`](tests/test_record_replay.py).
+---
+## Why not just use vcrpy?
+You can — at the HTTP layer, vcrpy works on LLM calls today. promptecho exists because LLM traffic breaks vcrpy's assumptions in five specific ways:
+1. **Matching.** vcrpy matches on raw request bytes. LLM bodies carry volatile fields (client-injected IDs, reordered tools, whitespace) that change the bytes without changing the *meaning* — so byte-matching misses on replay. promptecho matches on a **normalized fingerprint** of the fields that determine the response, and **canonicalizes across providers**: it knows `content: "hi"` equals `content: [{"type":"text","text":"hi"}]`, an Anthropic top-level `system` equals an OpenAI `system`-role message, and an Anthropic `input_schema` tool def equals an OpenAI `function.parameters`. A raw-bytes VCR can't.
+2. **Streaming.** Most LLM calls are SSE streams. promptecho records the event stream and faithfully re-emits it on replay, so `stream=True` and token-by-token iteration work identically against a cassette — including reasoning deltas.
+3. **Binary / multimodal responses.** vcrpy's text-based cassettes silently corrupt raw `image/*` / `audio/*` / `octet-stream` bodies. promptecho detects them by `Content-Type` and base64-encodes them in the cassette, so image-out and audio-out responses round-trip byte-exact.
+4. **Debuggable CI failures.** When a vcrpy cassette miss happens, you get *"no match"*. promptecho prints the exact path that changed: `messages[1].content: recorded "summarize the cat" / incoming "summarize the dog"`. Test failures are actionable, not detective work.
+5. **Secrets.** API keys live in headers on every call. promptecho redacts them by default — a cassette is safe to commit.
+## What promptecho is *not*
+- **Not a cache.** Replay matching is exact/normalized and deterministic, on purpose. It does **not** semantically match "different prompt, close enough" — that would put non-determinism back into the harness you're using to remove it. (A separate opt-in fuzzy mode is on the roadmap as a dev-loop convenience; it will never be the default and never used in CI.)
+- **Not an eval.** It freezes a response so your *surrounding code* is testable. Judging whether the response is *good* is a different tool (see roadmap: `toMatchLLMSnapshot()`).
+---
+## What it covers
+promptecho intercepts at the `httpx` transport layer. **If the SDK uses httpx, promptecho sees the call** — which is almost everything modern.
+| You're calling | Covered? |
+|---|---|
+| Anthropic, OpenAI, Mistral, Cohere, `google-genai` SDKs | ✅ |
+| **OpenAI SDK with custom `base_url`** → OpenRouter, Together, Fireworks, Cerebras, Groq, DeepInfra, Perplexity | ✅ |
+| **Self-hosted vLLM / TGI / SGLang / LM Studio / Ollama** (OpenAI-compatible mode) | ✅ |
+| Your **own fine-tune** behind any of the above | ✅ |
+| **Reasoning models** — o1/o3, Claude extended thinking, DeepSeek-R1 | ✅ (incl. `reasoning_effort` / `thinking` in default match-on) |
+| **Multimodal** — base64-in-JSON (vision, Claude image-out, GPT-4o) and raw binary (`image/*`, `audio/*`) | ✅ (byte-exact round-trip) |
+| Bedrock via boto3, HF `InferenceClient`, in-process `transformers` | ❌ (see workarounds in [SUPPORT.md](SUPPORT.md)) |
+Full matrix with caveats and workarounds: [**SUPPORT.md**](SUPPORT.md). For practical recipes by scenario (startup / enterprise / research), see [**TUTORIAL.md**](TUTORIAL.md).
+### Hosted open-source via the OpenAI SDK
+This is the dominant pattern for non-Anthropic/non-OpenAI usage, and it Just Works:
+```python
+from openai import OpenAI
+client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key="...")
+@promptecho.use_cassette("cassettes/openrouter.yaml")
+def test_via_openrouter():
+    r = client.chat.completions.create(
+        model="meta-llama/llama-3.1-70b-instruct",
+        messages=[{"role": "user", "content": "hi"}],
+    )
+    assert r.choices[0].message.content
+```
+Detection falls back to body shape when the host is unknown, so localhost gateways, in-house proxies, and self-hosted vLLM/TGI behave the same way as the brand-name hosts.
+---
+## Install
+```bash
+pip install promptecho   # not yet on PyPI — install from source for now
+```
+```bash
+git clone <repo> && cd promptecho
+pip install -e .
+```
+Requires Python ≥ 3.9 and `httpx ≥ 0.24`.
+---
+## Usage
+### Decorator
+```python
+@promptecho.use_cassette("cassettes/foo.yaml")
+def test_foo(): ...
+```
+### Context manager
+```python
+with promptecho.use_cassette("cassettes/foo.yaml"):
+    client.messages.create(...)
+```
+### pytest fixture (auto-named per test)
+```python
+def test_bar(promptecho_cassette):   # records to cassettes/test_bar.yaml
+    client.messages.create(...)
+```
+The fixture defaults to `mode="once"` locally and `mode="none"` when `CI=true` — so a forgotten recording fails the build instead of making a live call.
+### Record modes
+Borrowed from vcrpy, so the mental model is free:
+| mode | absent cassette | present cassette | use for |
+|------|-----------------|------------------|---------|
+| `once` *(default)* | record | replay | normal dev |
+| `none` | **error** | replay | **CI** — guarantees no live calls |
+| `new_episodes` | record | replay + record new | evolving tests |
+| `all` | record | re-record everything | refreshing fixtures |
+```python
+@promptecho.use_cassette("cassettes/foo.yaml", mode="none")
+```
+### Choosing what to match on
+Defaults to `["model", "messages", "system", "tools", "tool_choice", "reasoning_effort", "reasoning", "thinking"]` — everything that determines the response for a chat-shaped call, including reasoning-model knobs.
+```python
+@promptecho.use_cassette(
+    "cassettes/foo.yaml",
+    match_on=["model", "messages", "system", "temperature"],  # add temperature
+)
+```
+For non-chat shapes (raw TGI `/generate`, embeddings) you'll want to override, e.g. `match_on=["model", "input"]` for an embeddings endpoint. See [SUPPORT.md → Request shapes](SUPPORT.md#request-shapes).
+### Async
+Works identically with `httpx.AsyncClient` and the async surfaces of Anthropic / OpenAI / Mistral SDKs — the async transport is patched the same way as sync.
+---
+## Cassette format
+Human-readable YAML, designed to diff cleanly in PRs:
+```yaml
+version: 1
+match_on: [model, messages, system, tools, tool_choice, reasoning_effort, reasoning, thinking]
+interactions:
+  - request:
+      method: POST
+      url: https://api.anthropic.com/v1/messages
+      match_key: ef43f6acaed95b2f        # fingerprint of matched fields
+      matched_on: [model, messages, system, tools, tool_choice]
+      body:                              # canonical (provider-normalized) body
+        model: claude-opus-4-8
+        messages:
+          - {role: user, content: "Summarize: the cat sat on the mat."}
+    response:
+      status: 200
+      headers: {content-type: application/json}
+      streaming: false
+      body:
+        content: [{type: text, text: "A cat sat on a mat."}]
+        usage: {input_tokens: 14, output_tokens: 8}
+```
+- **Streamed** responses store the ordered SSE events under `response.events` with `streaming: true`; replay re-emits them in order.
+- **Binary** responses (image/audio/octet-stream) get `binary: true` and the body is base64-encoded; replay decodes and returns the original bytes.
+- **The stored body is the canonical, provider-normalized shape** — not the raw provider JSON. That makes cassettes provider-agnostic and easier to skim in code review.
+Auto-redacted on record: `authorization`, `x-api-key`, `openai-organization`. Configurable.
+See [`examples/cassettes/example.yaml`](examples/cassettes/example.yaml) for a real one.
+---
+## Status
+**v0.1.0, working core. 19 tests, all green.** Not yet on PyPI.
+Records and replays real httpx traffic — sync, async, SSE streaming, binary responses, cross-provider request shapes — verified end-to-end against a local server that gets shut down between record and replay.
+### Roadmap (build-in-public)
+Done:
+- [x] httpx sync + async transport interception
+- [x] SSE streaming record/replay
+- [x] pytest plugin + auto-naming
+- [x] Per-provider request normalizers (Anthropic / OpenAI / generic)
+- [x] Reasoning-model match defaults (`reasoning_effort`, `thinking`, `reasoning`)
+- [x] Binary response round-trip (image/audio/octet-stream — base64 in cassette)
+- [x] Field-level diff on cassette miss (CI `mode=none` errors pinpoint the changed path, not just the field name)
+Next:
+- [ ] `requests` / `urllib3` interception backend — unlocks boto3-Bedrock and HF `InferenceClient`
+- [ ] `promptecho lint` — find un-recorded calls in a test suite
+- [ ] **`toMatchLLMSnapshot()` sibling** — semantic snapshot assertions on top of recorded calls
+## Design
+For the why-not-the-other-way decisions — fingerprint vs raw bytes, why semantic matching is fenced off, how SSE re-emission works, how cross-provider normalization is structured — see [DESIGN.md](DESIGN.md).
+## License
+MIT

promptecho-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,217 @@
+# promptecho
+**Record & replay for LLM API calls.** Like [`vcrpy`](https://github.com/kevin1024/vcrpy) / [`nock`](https://github.com/nock/nock), but built for the way LLM traffic actually behaves.
+Your LLM tests have three problems: they're **flaky** (non-deterministic outputs), **slow** (real network round-trips), and **expensive** (burning tokens in CI on every run). promptecho records each real API call once to a cassette file, then replays it forever — deterministically, instantly, for free.
+```python
+import promptecho
+from anthropic import Anthropic
+@promptecho.use_cassette("cassettes/summarize.yaml")
+def test_summarize():
+    client = Anthropic()
+    msg = client.messages.create(
+        model="claude-opus-4-8",
+        max_tokens=100,
+        messages=[{"role": "user", "content": "Summarize: the cat sat on the mat."}],
+    )
+    assert "cat" in msg.content[0].text.lower()
+```
+First run: one real call, recorded to `cassettes/summarize.yaml`.
+Every run after: replayed from disk. No network, no tokens, no flake.
+> **Proof, not marketing.** The end-to-end test that gates every release records against a local server, **shuts the server down**, then replays. Same response, zero network. If the response can come back with the upstream gone, the cassette is genuinely doing the work — not a partial proxy. See [`tests/test_record_replay.py`](tests/test_record_replay.py).
+---
+## Why not just use vcrpy?
+You can — at the HTTP layer, vcrpy works on LLM calls today. promptecho exists because LLM traffic breaks vcrpy's assumptions in five specific ways:
+1. **Matching.** vcrpy matches on raw request bytes. LLM bodies carry volatile fields (client-injected IDs, reordered tools, whitespace) that change the bytes without changing the *meaning* — so byte-matching misses on replay. promptecho matches on a **normalized fingerprint** of the fields that determine the response, and **canonicalizes across providers**: it knows `content: "hi"` equals `content: [{"type":"text","text":"hi"}]`, an Anthropic top-level `system` equals an OpenAI `system`-role message, and an Anthropic `input_schema` tool def equals an OpenAI `function.parameters`. A raw-bytes VCR can't.
+2. **Streaming.** Most LLM calls are SSE streams. promptecho records the event stream and faithfully re-emits it on replay, so `stream=True` and token-by-token iteration work identically against a cassette — including reasoning deltas.
+3. **Binary / multimodal responses.** vcrpy's text-based cassettes silently corrupt raw `image/*` / `audio/*` / `octet-stream` bodies. promptecho detects them by `Content-Type` and base64-encodes them in the cassette, so image-out and audio-out responses round-trip byte-exact.
+4. **Debuggable CI failures.** When a vcrpy cassette miss happens, you get *"no match"*. promptecho prints the exact path that changed: `messages[1].content: recorded "summarize the cat" / incoming "summarize the dog"`. Test failures are actionable, not detective work.
+5. **Secrets.** API keys live in headers on every call. promptecho redacts them by default — a cassette is safe to commit.
+## What promptecho is *not*
+- **Not a cache.** Replay matching is exact/normalized and deterministic, on purpose. It does **not** semantically match "different prompt, close enough" — that would put non-determinism back into the harness you're using to remove it. (A separate opt-in fuzzy mode is on the roadmap as a dev-loop convenience; it will never be the default and never used in CI.)
+- **Not an eval.** It freezes a response so your *surrounding code* is testable. Judging whether the response is *good* is a different tool (see roadmap: `toMatchLLMSnapshot()`).
+---
+## What it covers
+promptecho intercepts at the `httpx` transport layer. **If the SDK uses httpx, promptecho sees the call** — which is almost everything modern.
+| You're calling | Covered? |
+|---|---|
+| Anthropic, OpenAI, Mistral, Cohere, `google-genai` SDKs | ✅ |
+| **OpenAI SDK with custom `base_url`** → OpenRouter, Together, Fireworks, Cerebras, Groq, DeepInfra, Perplexity | ✅ |
+| **Self-hosted vLLM / TGI / SGLang / LM Studio / Ollama** (OpenAI-compatible mode) | ✅ |
+| Your **own fine-tune** behind any of the above | ✅ |
+| **Reasoning models** — o1/o3, Claude extended thinking, DeepSeek-R1 | ✅ (incl. `reasoning_effort` / `thinking` in default match-on) |
+| **Multimodal** — base64-in-JSON (vision, Claude image-out, GPT-4o) and raw binary (`image/*`, `audio/*`) | ✅ (byte-exact round-trip) |
+| Bedrock via boto3, HF `InferenceClient`, in-process `transformers` | ❌ (see workarounds in [SUPPORT.md](SUPPORT.md)) |
+Full matrix with caveats and workarounds: [**SUPPORT.md**](SUPPORT.md). For practical recipes by scenario (startup / enterprise / research), see [**TUTORIAL.md**](TUTORIAL.md).
+### Hosted open-source via the OpenAI SDK
+This is the dominant pattern for non-Anthropic/non-OpenAI usage, and it Just Works:
+```python
+from openai import OpenAI
+client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key="...")
+@promptecho.use_cassette("cassettes/openrouter.yaml")
+def test_via_openrouter():
+    r = client.chat.completions.create(
+        model="meta-llama/llama-3.1-70b-instruct",
+        messages=[{"role": "user", "content": "hi"}],
+    )
+    assert r.choices[0].message.content
+```
+Detection falls back to body shape when the host is unknown, so localhost gateways, in-house proxies, and self-hosted vLLM/TGI behave the same way as the brand-name hosts.
+---
+## Install
+```bash
+pip install promptecho   # not yet on PyPI — install from source for now
+```
+```bash
+git clone <repo> && cd promptecho
+pip install -e .
+```
+Requires Python ≥ 3.9 and `httpx ≥ 0.24`.
+---
+## Usage
+### Decorator
+```python
+@promptecho.use_cassette("cassettes/foo.yaml")
+def test_foo(): ...
+```
+### Context manager
+```python
+with promptecho.use_cassette("cassettes/foo.yaml"):
+    client.messages.create(...)
+```
+### pytest fixture (auto-named per test)
+```python
+def test_bar(promptecho_cassette):   # records to cassettes/test_bar.yaml
+    client.messages.create(...)
+```
+The fixture defaults to `mode="once"` locally and `mode="none"` when `CI=true` — so a forgotten recording fails the build instead of making a live call.
+### Record modes
+Borrowed from vcrpy, so the mental model is free:
+| mode | absent cassette | present cassette | use for |
+|------|-----------------|------------------|---------|
+| `once` *(default)* | record | replay | normal dev |
+| `none` | **error** | replay | **CI** — guarantees no live calls |
+| `new_episodes` | record | replay + record new | evolving tests |
+| `all` | record | re-record everything | refreshing fixtures |
+```python
+@promptecho.use_cassette("cassettes/foo.yaml", mode="none")
+```
+### Choosing what to match on
+Defaults to `["model", "messages", "system", "tools", "tool_choice", "reasoning_effort", "reasoning", "thinking"]` — everything that determines the response for a chat-shaped call, including reasoning-model knobs.
+```python
+@promptecho.use_cassette(
+    "cassettes/foo.yaml",
+    match_on=["model", "messages", "system", "temperature"],  # add temperature
+)
+```
+For non-chat shapes (raw TGI `/generate`, embeddings) you'll want to override, e.g. `match_on=["model", "input"]` for an embeddings endpoint. See [SUPPORT.md → Request shapes](SUPPORT.md#request-shapes).
+### Async
+Works identically with `httpx.AsyncClient` and the async surfaces of Anthropic / OpenAI / Mistral SDKs — the async transport is patched the same way as sync.
+---
+## Cassette format
+Human-readable YAML, designed to diff cleanly in PRs:
+```yaml
+version: 1
+match_on: [model, messages, system, tools, tool_choice, reasoning_effort, reasoning, thinking]
+interactions:
+  - request:
+      method: POST
+      url: https://api.anthropic.com/v1/messages
+      match_key: ef43f6acaed95b2f        # fingerprint of matched fields
+      matched_on: [model, messages, system, tools, tool_choice]
+      body:                              # canonical (provider-normalized) body
+        model: claude-opus-4-8
+        messages:
+          - {role: user, content: "Summarize: the cat sat on the mat."}
+    response:
+      status: 200
+      headers: {content-type: application/json}
+      streaming: false
+      body:
+        content: [{type: text, text: "A cat sat on a mat."}]
+        usage: {input_tokens: 14, output_tokens: 8}
+```
+- **Streamed** responses store the ordered SSE events under `response.events` with `streaming: true`; replay re-emits them in order.
+- **Binary** responses (image/audio/octet-stream) get `binary: true` and the body is base64-encoded; replay decodes and returns the original bytes.
+- **The stored body is the canonical, provider-normalized shape** — not the raw provider JSON. That makes cassettes provider-agnostic and easier to skim in code review.
+Auto-redacted on record: `authorization`, `x-api-key`, `openai-organization`. Configurable.
+See [`examples/cassettes/example.yaml`](examples/cassettes/example.yaml) for a real one.
+---
+## Status
+**v0.1.0, working core. 19 tests, all green.** Not yet on PyPI.
+Records and replays real httpx traffic — sync, async, SSE streaming, binary responses, cross-provider request shapes — verified end-to-end against a local server that gets shut down between record and replay.
+### Roadmap (build-in-public)
+Done:
+- [x] httpx sync + async transport interception
+- [x] SSE streaming record/replay
+- [x] pytest plugin + auto-naming
+- [x] Per-provider request normalizers (Anthropic / OpenAI / generic)
+- [x] Reasoning-model match defaults (`reasoning_effort`, `thinking`, `reasoning`)
+- [x] Binary response round-trip (image/audio/octet-stream — base64 in cassette)
+- [x] Field-level diff on cassette miss (CI `mode=none` errors pinpoint the changed path, not just the field name)
+Next:
+- [ ] `requests` / `urllib3` interception backend — unlocks boto3-Bedrock and HF `InferenceClient`
+- [ ] `promptecho lint` — find un-recorded calls in a test suite
+- [ ] **`toMatchLLMSnapshot()` sibling** — semantic snapshot assertions on top of recorded calls
+## Design
+For the why-not-the-other-way decisions — fingerprint vs raw bytes, why semantic matching is fenced off, how SSE re-emission works, how cross-provider normalization is structured — see [DESIGN.md](DESIGN.md).
+## License
+MIT