promptecho 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,10 @@
1
+ __pycache__/
2
+ *.py[cod]
3
+ .pytest_cache/
4
+ *.egg-info/
5
+ build/
6
+ dist/
7
+ .venv/
8
+ venv/
9
+ .env
10
+ .DS_Store
@@ -0,0 +1,17 @@
1
+ # Use shell-style glob syntax
2
+ syntax: glob
3
+
4
+ # Compiled Python files
5
+ *.pyc
6
+
7
+ # Folder view configuration files
8
+ .DS_Store
9
+ Desktop.ini
10
+
11
+ # Thumbnail cache files
12
+ ._*
13
+ Thumbs.db
14
+
15
+ # Files that might appear on external disks
16
+ .Spotlight-V100
17
+ .Trashes
@@ -0,0 +1,196 @@
1
+ # promptecho — design notes
2
+
3
+ The ergonomics — `use_cassette` decorator, record modes, pytest fixture — are
4
+ table stakes copied straight from vcrpy and aren't interesting. The six sections
5
+ below are where this tool earns its name. Each one is a deliberate choice and a
6
+ specific failure of the obvious "vcr-for-llms" approach.
7
+
8
+ ## 1. Interception: at the HTTP transport, not the SDK
9
+
10
+ Both the Anthropic and OpenAI Python SDKs — and Mistral, Cohere v5+, `google-genai`,
11
+ the AnthropicBedrock / AnthropicVertex variants, and the OpenAI SDK pointed at
12
+ OpenRouter / Together / Fireworks / Cerebras / Groq / vLLM / TGI / SGLang via
13
+ `base_url=` — are all built on `httpx`. So we intercept once at `httpx`'s transport
14
+ layer and get every one of them for free, instead of monkeypatching each SDK's
15
+ `messages.create` / `chat.completions.create` surface (which would be a maintenance
16
+ treadmill as SDKs change).
17
+
18
+ The mechanism: [`patch.py`](src/promptecho/patch.py) monkeypatches
19
+ `httpx.HTTPTransport.handle_request` (sync) and `httpx.AsyncHTTPTransport.handle_async_request`
20
+ (async) for the duration of the `use_cassette` block, then restores the originals.
21
+ Each patched method routes the request through the record/replay decision and
22
+ either returns a fresh `httpx.Response` reconstructed from the cassette or passes
23
+ through to the real transport and captures the result.
24
+
25
+ The decision logic itself lives in [`transport.py`](src/promptecho/transport.py) and
26
+ is pure — no httpx, no I/O. That separation means the branching logic (cassette
27
+ miss → record? error? re-record?) is unit-testable without standing up a network
28
+ stack. It's the same separation `respx` and `vcrpy`'s httpx stub use.
29
+
30
+ **The cost of this choice:** SDKs not on httpx (boto3-Bedrock, HF `InferenceClient`,
31
+ `google-cloud-aiplatform`) are invisible — they just pass straight to the network as if
32
+ promptecho weren't installed. That's a deliberate v1 scope. The roadmap item
33
+ "`requests`/`urllib3` interception backend" closes it, at the cost of supporting two
34
+ transport stacks at once.
35
+
36
+ ## 2. Matching: normalized fingerprint, not raw bytes
37
+
38
+ The crux, and the thing the obvious "vcr for LLMs" pitch gets wrong.
39
+
40
+ For **replay** you want determinism: the same logical request must map to the same
41
+ recording, every time. Raw-byte matching (vcrpy default) fails because LLM bodies
42
+ carry volatile noise — client-injected request IDs, reordered `tools` arrays,
43
+ key-order and whitespace differences from re-serialization. Same call, different
44
+ bytes, missed match.
45
+
46
+ So we compute a **fingerprint** over only the fields that determine the response:
47
+
48
+ ```
49
+ fingerprint(body) = sha256( canonical_json( pick(body, match_on) ) )
50
+ ```
51
+
52
+ - `canonical_json` sorts keys and strips insignificant whitespace, so
53
+ re-serialization can't change the key.
54
+ - Volatile fields are simply not in `match_on`, so they can't affect the match.
55
+
56
+ The default `match_on` is:
57
+ ```python
58
+ ["model", "messages", "system", "tools", "tool_choice",
59
+ "reasoning_effort", "reasoning", "thinking"]
60
+ ```
61
+
62
+ `model`, `messages`, `system`, `tools`, `tool_choice` are obvious. The last three
63
+ are less obvious and matter: reasoning-model knobs (OpenAI `reasoning_effort`,
64
+ Anthropic `thinking`, OpenRouter `reasoning`) change the response without changing
65
+ the prompt. If they aren't in the default match set, a test with
66
+ `reasoning_effort="high"` would silently replay the recording made for
67
+ `"low"` — a wrong-fixture bug that's hard to catch by eye. So they're in by
68
+ default, even though omitting them would yield "smaller" fingerprints. Correctness
69
+ wins.
70
+
71
+ `match_on` is also user-configurable, because only the test author knows which
72
+ fields are load-bearing for *their* assertion (does this test care about
73
+ `temperature`? `max_tokens`?).
74
+
75
+ **What we deliberately do NOT do:** semantic / embedding matching on replay.
76
+ "Different prompt, embedding-close enough → same recording" reintroduces
77
+ non-determinism into the exact thing you adopted promptecho to make deterministic,
78
+ and can silently serve the wrong recording. Semantic matching is a *caching*
79
+ concern, not a *testing* one. Keeping these two ideas separate is a core stance
80
+ — see the README's "What promptecho is not." A `fuzzy=True` dev-loop convenience
81
+ is on the roadmap; it will never be the default and never used in CI.
82
+
83
+ ## 3. Cross-provider canonicalization
84
+
85
+ The same logical prompt is expressed in different wire shapes across providers,
86
+ SDK versions, and even within a single provider's API:
87
+
88
+ - Anthropic puts the system prompt in a top-level `system` param; OpenAI puts
89
+ it in a `system`- or `developer`-role message.
90
+ - Message content may be a bare string or a list of typed content blocks.
91
+ - Tool defs differ: Anthropic `{name, description, input_schema}` vs OpenAI
92
+ `{type: function, function: {name, description, parameters}}`.
93
+ - OpenAI's `max_completion_tokens` is an alias of `max_tokens` for newer models.
94
+
95
+ [`normalizers.py`](src/promptecho/normalizers.py) maps each raw provider body into
96
+ one canonical shape *before* fingerprinting. That's the capability a raw-bytes HTTP
97
+ VCR fundamentally cannot have — and the reason "just point vcrpy at httpx" is
98
+ unsatisfying. Provider is detected by URL host first, body-shape fallback (so
99
+ localhost / self-hosted gateways / private proxies behave the same as the
100
+ brand-name hosts).
101
+
102
+ Two consequences worth being explicit about:
103
+
104
+ 1. **The canonical body is what's stored on disk** — not the raw provider JSON.
105
+ This makes cassettes provider-agnostic and easier to skim in code review, at
106
+ the cost of being one step removed from the wire format. Worth it.
107
+ 2. **Lossy joins.** Where shapes don't map cleanly, we choose the simpler form:
108
+ e.g. a multi-block system prompt collapses to a `\n`-joined string. That's
109
+ fine for matching; if you have something exotic worth preserving, override
110
+ `match_on` to include the original field path instead.
111
+
112
+ Open: more providers (Gemini, Mistral) get the same treatment, and a
113
+ user-pluggable normalizer hook for in-house gateways with custom shapes.
114
+
115
+ ## 4. Streaming: record the events, re-emit faithfully
116
+
117
+ Most real LLM calls are `stream=True` (SSE). A recording that only captures the
118
+ final assembled body is useless for testing streaming code paths — you can't
119
+ test progressive UI rendering, token-budget cutoffs, or streaming-tool-call
120
+ handling against a one-shot fixture.
121
+
122
+ promptecho captures the **ordered list of SSE events** as they arrive, stores them
123
+ under `response.events`, and on replay re-emits them as a synthetic stream — so
124
+ `for chunk in stream:` iterates identically against the cassette, including event
125
+ boundaries and `message_delta` / `content_block_delta` / reasoning-delta ordering.
126
+
127
+ This is fiddly (chunked transfer, `[DONE]` sentinels, provider-specific event
128
+ shapes) and is exactly why generic HTTP VCR tools are unsatisfying for LLM work.
129
+ Getting it right is the moat.
130
+
131
+ **Known limitation:** on the *recording* run we buffer the full upstream
132
+ response before returning it, so you lose true streaming timing while recording.
133
+ Replay streams fine. Acceptable for a test tool — the recording run isn't where
134
+ you measure latency.
135
+
136
+ ## 5. Binary responses: detect, base64, round-trip
137
+
138
+ Image-out, audio-out, and any `application/octet-stream` body cannot be
139
+ text-decoded — the bytes have to survive the cassette round-trip exactly. A
140
+ YAML cassette that runs binary bodies through `.decode("utf-8", "replace")`
141
+ silently corrupts them (this was a real bug found by probe-testing before launch).
142
+
143
+ So `_capture()` inspects `Content-Type` and, for anything `image/*`, `audio/*`,
144
+ `video/*`, `application/octet-stream`, `pdf`, or `zip`, stores the body
145
+ base64-encoded with a `binary: true` flag. Replay decodes it back. **Verified
146
+ byte-equal** end-to-end (record → server shutdown → replay) in
147
+ `tests/test_reasoning_and_binary.py::test_binary_image_response_byte_exact`.
148
+
149
+ Multimodal-as-JSON (base64 inside `content` blocks — the Anthropic / OpenAI
150
+ vision / GPT-4o image-out shape) was already fine, because the base64 string
151
+ lives inside JSON and never gets text-decoded as bytes. That stays covered by
152
+ its own test.
153
+
154
+ `multipart/form-data` (file uploads/downloads) is explicitly out of scope for
155
+ v1 — large payloads, encoding edge cases, rarely a thing you want to freeze in a
156
+ fixture.
157
+
158
+ ## 6. Secrets: redact on record
159
+
160
+ A cassette is meant to be committed, so it must be safe by default — opt-out, not
161
+ opt-in. On record we strip `authorization`, `x-api-key`, and `openai-organization`
162
+ headers and never write request auth to disk. The list is configurable
163
+ (`REDACT_HEADERS` in [`cassette.py`](src/promptecho/cassette.py)) — extend it for
164
+ provider-specific auth headers, never shrink it without thinking carefully.
165
+
166
+ Body-level secrets (a prompt that happens to contain a credential) are *not*
167
+ auto-redacted, because there's no reliable way to detect them. The escape hatch
168
+ is don't put secrets in prompts — but a future `redact_body=[...]` hook is a
169
+ reasonable addition.
170
+
171
+ ---
172
+
173
+ ## Open design threads (good build-in-public material)
174
+
175
+ - ~~**Field-level diff on cassette miss in CI.**~~ **Done** —
176
+ `matcher.diff_request()` walks both bodies in parallel and emits leaf-level
177
+ `(path, recorded, incoming)` triples (`messages[1].content`, etc.), shown
178
+ inline in the `CassetteMiss` error. Restricted to `match_on` fields so the
179
+ output isn't drowned in volatile-field noise. Long values are truncated to
180
+ ~80 chars. Open extension: an optional terminal-colored mode for local dev
181
+ (off by default to keep CI logs grep-able).
182
+ - **Drift detection.** An optional `mode=all` run in a nightly (not PR) CI job
183
+ that re-records and flags when a model's output to a frozen prompt has changed
184
+ — turning cassettes into a cheap model-regression tripwire. The hardest part
185
+ is choosing the "did anything meaningful change?" comparator (raw text diff is
186
+ noisy; LLM-judge re-introduces non-determinism). Reasonable defaults: structural
187
+ diff for tool-call shapes, text-similarity for prose.
188
+ - **A second interception backend (`requests`/`urllib3`).** Unlocks boto3-Bedrock
189
+ and HF `InferenceClient`. Non-trivial: different stack, no clean
190
+ `BaseTransport` equivalent in urllib3, and Bedrock specifically signs requests
191
+ at the urllib3 level with SigV4 — so any fingerprint over the request would
192
+ pick up timestamp/signature noise unless we strip them first. Worth doing when
193
+ there's evidence of demand, not before.
194
+ - **User-pluggable normalizers.** For in-house gateways with custom shapes. A
195
+ `register_normalizer(detect_fn, normalize_fn)` API is straightforward; the
196
+ open question is whether to ship it before or after the surface stabilizes.
@@ -0,0 +1,232 @@
1
+ Metadata-Version: 2.4
2
+ Name: promptecho
3
+ Version: 0.1.0
4
+ Summary: Record & replay for LLM API calls — like vcrpy/nock, built for LLM traffic.
5
+ License-Expression: MIT
6
+ Keywords: anthropic,llm,mock,openai,pytest,record-replay,testing,vcr
7
+ Requires-Python: >=3.9
8
+ Requires-Dist: httpx>=0.24
9
+ Requires-Dist: pyyaml>=6.0
10
+ Provides-Extra: dev
11
+ Requires-Dist: anthropic; extra == 'dev'
12
+ Requires-Dist: openai; extra == 'dev'
13
+ Requires-Dist: pytest>=7; extra == 'dev'
14
+ Description-Content-Type: text/markdown
15
+
16
+ # promptecho
17
+
18
+ **Record & replay for LLM API calls.** Like [`vcrpy`](https://github.com/kevin1024/vcrpy) / [`nock`](https://github.com/nock/nock), but built for the way LLM traffic actually behaves.
19
+
20
+ Your LLM tests have three problems: they're **flaky** (non-deterministic outputs), **slow** (real network round-trips), and **expensive** (burning tokens in CI on every run). promptecho records each real API call once to a cassette file, then replays it forever — deterministically, instantly, for free.
21
+
22
+ ```python
23
+ import promptecho
24
+ from anthropic import Anthropic
25
+
26
+ @promptecho.use_cassette("cassettes/summarize.yaml")
27
+ def test_summarize():
28
+ client = Anthropic()
29
+ msg = client.messages.create(
30
+ model="claude-opus-4-8",
31
+ max_tokens=100,
32
+ messages=[{"role": "user", "content": "Summarize: the cat sat on the mat."}],
33
+ )
34
+ assert "cat" in msg.content[0].text.lower()
35
+ ```
36
+
37
+ First run: one real call, recorded to `cassettes/summarize.yaml`.
38
+ Every run after: replayed from disk. No network, no tokens, no flake.
39
+
40
+ > **Proof, not marketing.** The end-to-end test that gates every release records against a local server, **shuts the server down**, then replays. Same response, zero network. If the response can come back with the upstream gone, the cassette is genuinely doing the work — not a partial proxy. See [`tests/test_record_replay.py`](tests/test_record_replay.py).
41
+
42
+ ---
43
+
44
+ ## Why not just use vcrpy?
45
+
46
+ You can — at the HTTP layer, vcrpy works on LLM calls today. promptecho exists because LLM traffic breaks vcrpy's assumptions in five specific ways:
47
+
48
+ 1. **Matching.** vcrpy matches on raw request bytes. LLM bodies carry volatile fields (client-injected IDs, reordered tools, whitespace) that change the bytes without changing the *meaning* — so byte-matching misses on replay. promptecho matches on a **normalized fingerprint** of the fields that determine the response, and **canonicalizes across providers**: it knows `content: "hi"` equals `content: [{"type":"text","text":"hi"}]`, an Anthropic top-level `system` equals an OpenAI `system`-role message, and an Anthropic `input_schema` tool def equals an OpenAI `function.parameters`. A raw-bytes VCR can't.
49
+ 2. **Streaming.** Most LLM calls are SSE streams. promptecho records the event stream and faithfully re-emits it on replay, so `stream=True` and token-by-token iteration work identically against a cassette — including reasoning deltas.
50
+ 3. **Binary / multimodal responses.** vcrpy's text-based cassettes silently corrupt raw `image/*` / `audio/*` / `octet-stream` bodies. promptecho detects them by `Content-Type` and base64-encodes them in the cassette, so image-out and audio-out responses round-trip byte-exact.
51
+ 4. **Debuggable CI failures.** When a vcrpy cassette miss happens, you get *"no match"*. promptecho prints the exact path that changed: `messages[1].content: recorded "summarize the cat" / incoming "summarize the dog"`. Test failures are actionable, not detective work.
52
+ 5. **Secrets.** API keys live in headers on every call. promptecho redacts them by default — a cassette is safe to commit.
53
+
54
+ ## What promptecho is *not*
55
+
56
+ - **Not a cache.** Replay matching is exact/normalized and deterministic, on purpose. It does **not** semantically match "different prompt, close enough" — that would put non-determinism back into the harness you're using to remove it. (A separate opt-in fuzzy mode is on the roadmap as a dev-loop convenience; it will never be the default and never used in CI.)
57
+ - **Not an eval.** It freezes a response so your *surrounding code* is testable. Judging whether the response is *good* is a different tool (see roadmap: `toMatchLLMSnapshot()`).
58
+
59
+ ---
60
+
61
+ ## What it covers
62
+
63
+ promptecho intercepts at the `httpx` transport layer. **If the SDK uses httpx, promptecho sees the call** — which is almost everything modern.
64
+
65
+ | You're calling | Covered? |
66
+ |---|---|
67
+ | Anthropic, OpenAI, Mistral, Cohere, `google-genai` SDKs | ✅ |
68
+ | **OpenAI SDK with custom `base_url`** → OpenRouter, Together, Fireworks, Cerebras, Groq, DeepInfra, Perplexity | ✅ |
69
+ | **Self-hosted vLLM / TGI / SGLang / LM Studio / Ollama** (OpenAI-compatible mode) | ✅ |
70
+ | Your **own fine-tune** behind any of the above | ✅ |
71
+ | **Reasoning models** — o1/o3, Claude extended thinking, DeepSeek-R1 | ✅ (incl. `reasoning_effort` / `thinking` in default match-on) |
72
+ | **Multimodal** — base64-in-JSON (vision, Claude image-out, GPT-4o) and raw binary (`image/*`, `audio/*`) | ✅ (byte-exact round-trip) |
73
+ | Bedrock via boto3, HF `InferenceClient`, in-process `transformers` | ❌ (see workarounds in [SUPPORT.md](SUPPORT.md)) |
74
+
75
+ Full matrix with caveats and workarounds: [**SUPPORT.md**](SUPPORT.md). For practical recipes by scenario (startup / enterprise / research), see [**TUTORIAL.md**](TUTORIAL.md).
76
+
77
+ ### Hosted open-source via the OpenAI SDK
78
+
79
+ This is the dominant pattern for non-Anthropic/non-OpenAI usage, and it Just Works:
80
+
81
+ ```python
82
+ from openai import OpenAI
83
+ client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key="...")
84
+
85
+ @promptecho.use_cassette("cassettes/openrouter.yaml")
86
+ def test_via_openrouter():
87
+ r = client.chat.completions.create(
88
+ model="meta-llama/llama-3.1-70b-instruct",
89
+ messages=[{"role": "user", "content": "hi"}],
90
+ )
91
+ assert r.choices[0].message.content
92
+ ```
93
+
94
+ Detection falls back to body shape when the host is unknown, so localhost gateways, in-house proxies, and self-hosted vLLM/TGI behave the same way as the brand-name hosts.
95
+
96
+ ---
97
+
98
+ ## Install
99
+
100
+ ```bash
101
+ pip install promptecho # not yet on PyPI — install from source for now
102
+ ```
103
+
104
+ ```bash
105
+ git clone <repo> && cd promptecho
106
+ pip install -e .
107
+ ```
108
+
109
+ Requires Python ≥ 3.9 and `httpx ≥ 0.24`.
110
+
111
+ ---
112
+
113
+ ## Usage
114
+
115
+ ### Decorator
116
+ ```python
117
+ @promptecho.use_cassette("cassettes/foo.yaml")
118
+ def test_foo(): ...
119
+ ```
120
+
121
+ ### Context manager
122
+ ```python
123
+ with promptecho.use_cassette("cassettes/foo.yaml"):
124
+ client.messages.create(...)
125
+ ```
126
+
127
+ ### pytest fixture (auto-named per test)
128
+ ```python
129
+ def test_bar(promptecho_cassette): # records to cassettes/test_bar.yaml
130
+ client.messages.create(...)
131
+ ```
132
+
133
+ The fixture defaults to `mode="once"` locally and `mode="none"` when `CI=true` — so a forgotten recording fails the build instead of making a live call.
134
+
135
+ ### Record modes
136
+ Borrowed from vcrpy, so the mental model is free:
137
+
138
+ | mode | absent cassette | present cassette | use for |
139
+ |------|-----------------|------------------|---------|
140
+ | `once` *(default)* | record | replay | normal dev |
141
+ | `none` | **error** | replay | **CI** — guarantees no live calls |
142
+ | `new_episodes` | record | replay + record new | evolving tests |
143
+ | `all` | record | re-record everything | refreshing fixtures |
144
+
145
+ ```python
146
+ @promptecho.use_cassette("cassettes/foo.yaml", mode="none")
147
+ ```
148
+
149
+ ### Choosing what to match on
150
+
151
+ Defaults to `["model", "messages", "system", "tools", "tool_choice", "reasoning_effort", "reasoning", "thinking"]` — everything that determines the response for a chat-shaped call, including reasoning-model knobs.
152
+
153
+ ```python
154
+ @promptecho.use_cassette(
155
+ "cassettes/foo.yaml",
156
+ match_on=["model", "messages", "system", "temperature"], # add temperature
157
+ )
158
+ ```
159
+
160
+ For non-chat shapes (raw TGI `/generate`, embeddings) you'll want to override, e.g. `match_on=["model", "input"]` for an embeddings endpoint. See [SUPPORT.md → Request shapes](SUPPORT.md#request-shapes).
161
+
162
+ ### Async
163
+
164
+ Works identically with `httpx.AsyncClient` and the async surfaces of Anthropic / OpenAI / Mistral SDKs — the async transport is patched the same way as sync.
165
+
166
+ ---
167
+
168
+ ## Cassette format
169
+
170
+ Human-readable YAML, designed to diff cleanly in PRs:
171
+
172
+ ```yaml
173
+ version: 1
174
+ match_on: [model, messages, system, tools, tool_choice, reasoning_effort, reasoning, thinking]
175
+ interactions:
176
+ - request:
177
+ method: POST
178
+ url: https://api.anthropic.com/v1/messages
179
+ match_key: ef43f6acaed95b2f # fingerprint of matched fields
180
+ matched_on: [model, messages, system, tools, tool_choice]
181
+ body: # canonical (provider-normalized) body
182
+ model: claude-opus-4-8
183
+ messages:
184
+ - {role: user, content: "Summarize: the cat sat on the mat."}
185
+ response:
186
+ status: 200
187
+ headers: {content-type: application/json}
188
+ streaming: false
189
+ body:
190
+ content: [{type: text, text: "A cat sat on a mat."}]
191
+ usage: {input_tokens: 14, output_tokens: 8}
192
+ ```
193
+
194
+ - **Streamed** responses store the ordered SSE events under `response.events` with `streaming: true`; replay re-emits them in order.
195
+ - **Binary** responses (image/audio/octet-stream) get `binary: true` and the body is base64-encoded; replay decodes and returns the original bytes.
196
+ - **The stored body is the canonical, provider-normalized shape** — not the raw provider JSON. That makes cassettes provider-agnostic and easier to skim in code review.
197
+
198
+ Auto-redacted on record: `authorization`, `x-api-key`, `openai-organization`. Configurable.
199
+
200
+ See [`examples/cassettes/example.yaml`](examples/cassettes/example.yaml) for a real one.
201
+
202
+ ---
203
+
204
+ ## Status
205
+
206
+ **v0.1.0, working core. 19 tests, all green.** Not yet on PyPI.
207
+
208
+ Records and replays real httpx traffic — sync, async, SSE streaming, binary responses, cross-provider request shapes — verified end-to-end against a local server that gets shut down between record and replay.
209
+
210
+ ### Roadmap (build-in-public)
211
+
212
+ Done:
213
+ - [x] httpx sync + async transport interception
214
+ - [x] SSE streaming record/replay
215
+ - [x] pytest plugin + auto-naming
216
+ - [x] Per-provider request normalizers (Anthropic / OpenAI / generic)
217
+ - [x] Reasoning-model match defaults (`reasoning_effort`, `thinking`, `reasoning`)
218
+ - [x] Binary response round-trip (image/audio/octet-stream — base64 in cassette)
219
+ - [x] Field-level diff on cassette miss (CI `mode=none` errors pinpoint the changed path, not just the field name)
220
+
221
+ Next:
222
+ - [ ] `requests` / `urllib3` interception backend — unlocks boto3-Bedrock and HF `InferenceClient`
223
+ - [ ] `promptecho lint` — find un-recorded calls in a test suite
224
+ - [ ] **`toMatchLLMSnapshot()` sibling** — semantic snapshot assertions on top of recorded calls
225
+
226
+ ## Design
227
+
228
+ For the why-not-the-other-way decisions — fingerprint vs raw bytes, why semantic matching is fenced off, how SSE re-emission works, how cross-provider normalization is structured — see [DESIGN.md](DESIGN.md).
229
+
230
+ ## License
231
+
232
+ MIT
@@ -0,0 +1,217 @@
1
+ # promptecho
2
+
3
+ **Record & replay for LLM API calls.** Like [`vcrpy`](https://github.com/kevin1024/vcrpy) / [`nock`](https://github.com/nock/nock), but built for the way LLM traffic actually behaves.
4
+
5
+ Your LLM tests have three problems: they're **flaky** (non-deterministic outputs), **slow** (real network round-trips), and **expensive** (burning tokens in CI on every run). promptecho records each real API call once to a cassette file, then replays it forever — deterministically, instantly, for free.
6
+
7
+ ```python
8
+ import promptecho
9
+ from anthropic import Anthropic
10
+
11
+ @promptecho.use_cassette("cassettes/summarize.yaml")
12
+ def test_summarize():
13
+ client = Anthropic()
14
+ msg = client.messages.create(
15
+ model="claude-opus-4-8",
16
+ max_tokens=100,
17
+ messages=[{"role": "user", "content": "Summarize: the cat sat on the mat."}],
18
+ )
19
+ assert "cat" in msg.content[0].text.lower()
20
+ ```
21
+
22
+ First run: one real call, recorded to `cassettes/summarize.yaml`.
23
+ Every run after: replayed from disk. No network, no tokens, no flake.
24
+
25
+ > **Proof, not marketing.** The end-to-end test that gates every release records against a local server, **shuts the server down**, then replays. Same response, zero network. If the response can come back with the upstream gone, the cassette is genuinely doing the work — not a partial proxy. See [`tests/test_record_replay.py`](tests/test_record_replay.py).
26
+
27
+ ---
28
+
29
+ ## Why not just use vcrpy?
30
+
31
+ You can — at the HTTP layer, vcrpy works on LLM calls today. promptecho exists because LLM traffic breaks vcrpy's assumptions in five specific ways:
32
+
33
+ 1. **Matching.** vcrpy matches on raw request bytes. LLM bodies carry volatile fields (client-injected IDs, reordered tools, whitespace) that change the bytes without changing the *meaning* — so byte-matching misses on replay. promptecho matches on a **normalized fingerprint** of the fields that determine the response, and **canonicalizes across providers**: it knows `content: "hi"` equals `content: [{"type":"text","text":"hi"}]`, an Anthropic top-level `system` equals an OpenAI `system`-role message, and an Anthropic `input_schema` tool def equals an OpenAI `function.parameters`. A raw-bytes VCR can't.
34
+ 2. **Streaming.** Most LLM calls are SSE streams. promptecho records the event stream and faithfully re-emits it on replay, so `stream=True` and token-by-token iteration work identically against a cassette — including reasoning deltas.
35
+ 3. **Binary / multimodal responses.** vcrpy's text-based cassettes silently corrupt raw `image/*` / `audio/*` / `octet-stream` bodies. promptecho detects them by `Content-Type` and base64-encodes them in the cassette, so image-out and audio-out responses round-trip byte-exact.
36
+ 4. **Debuggable CI failures.** When a vcrpy cassette miss happens, you get *"no match"*. promptecho prints the exact path that changed: `messages[1].content: recorded "summarize the cat" / incoming "summarize the dog"`. Test failures are actionable, not detective work.
37
+ 5. **Secrets.** API keys live in headers on every call. promptecho redacts them by default — a cassette is safe to commit.
38
+
39
+ ## What promptecho is *not*
40
+
41
+ - **Not a cache.** Replay matching is exact/normalized and deterministic, on purpose. It does **not** semantically match "different prompt, close enough" — that would put non-determinism back into the harness you're using to remove it. (A separate opt-in fuzzy mode is on the roadmap as a dev-loop convenience; it will never be the default and never used in CI.)
42
+ - **Not an eval.** It freezes a response so your *surrounding code* is testable. Judging whether the response is *good* is a different tool (see roadmap: `toMatchLLMSnapshot()`).
43
+
44
+ ---
45
+
46
+ ## What it covers
47
+
48
+ promptecho intercepts at the `httpx` transport layer. **If the SDK uses httpx, promptecho sees the call** — which is almost everything modern.
49
+
50
+ | You're calling | Covered? |
51
+ |---|---|
52
+ | Anthropic, OpenAI, Mistral, Cohere, `google-genai` SDKs | ✅ |
53
+ | **OpenAI SDK with custom `base_url`** → OpenRouter, Together, Fireworks, Cerebras, Groq, DeepInfra, Perplexity | ✅ |
54
+ | **Self-hosted vLLM / TGI / SGLang / LM Studio / Ollama** (OpenAI-compatible mode) | ✅ |
55
+ | Your **own fine-tune** behind any of the above | ✅ |
56
+ | **Reasoning models** — o1/o3, Claude extended thinking, DeepSeek-R1 | ✅ (incl. `reasoning_effort` / `thinking` in default match-on) |
57
+ | **Multimodal** — base64-in-JSON (vision, Claude image-out, GPT-4o) and raw binary (`image/*`, `audio/*`) | ✅ (byte-exact round-trip) |
58
+ | Bedrock via boto3, HF `InferenceClient`, in-process `transformers` | ❌ (see workarounds in [SUPPORT.md](SUPPORT.md)) |
59
+
60
+ Full matrix with caveats and workarounds: [**SUPPORT.md**](SUPPORT.md). For practical recipes by scenario (startup / enterprise / research), see [**TUTORIAL.md**](TUTORIAL.md).
61
+
62
+ ### Hosted open-source via the OpenAI SDK
63
+
64
+ This is the dominant pattern for non-Anthropic/non-OpenAI usage, and it Just Works:
65
+
66
+ ```python
67
+ from openai import OpenAI
68
+ client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key="...")
69
+
70
+ @promptecho.use_cassette("cassettes/openrouter.yaml")
71
+ def test_via_openrouter():
72
+ r = client.chat.completions.create(
73
+ model="meta-llama/llama-3.1-70b-instruct",
74
+ messages=[{"role": "user", "content": "hi"}],
75
+ )
76
+ assert r.choices[0].message.content
77
+ ```
78
+
79
+ Detection falls back to body shape when the host is unknown, so localhost gateways, in-house proxies, and self-hosted vLLM/TGI behave the same way as the brand-name hosts.
80
+
81
+ ---
82
+
83
+ ## Install
84
+
85
+ ```bash
86
+ pip install promptecho # not yet on PyPI — install from source for now
87
+ ```
88
+
89
+ ```bash
90
+ git clone <repo> && cd promptecho
91
+ pip install -e .
92
+ ```
93
+
94
+ Requires Python ≥ 3.9 and `httpx ≥ 0.24`.
95
+
96
+ ---
97
+
98
+ ## Usage
99
+
100
+ ### Decorator
101
+ ```python
102
+ @promptecho.use_cassette("cassettes/foo.yaml")
103
+ def test_foo(): ...
104
+ ```
105
+
106
+ ### Context manager
107
+ ```python
108
+ with promptecho.use_cassette("cassettes/foo.yaml"):
109
+ client.messages.create(...)
110
+ ```
111
+
112
+ ### pytest fixture (auto-named per test)
113
+ ```python
114
+ def test_bar(promptecho_cassette): # records to cassettes/test_bar.yaml
115
+ client.messages.create(...)
116
+ ```
117
+
118
+ The fixture defaults to `mode="once"` locally and `mode="none"` when `CI=true` — so a forgotten recording fails the build instead of making a live call.
119
+
120
+ ### Record modes
121
+ Borrowed from vcrpy, so the mental model is free:
122
+
123
+ | mode | absent cassette | present cassette | use for |
124
+ |------|-----------------|------------------|---------|
125
+ | `once` *(default)* | record | replay | normal dev |
126
+ | `none` | **error** | replay | **CI** — guarantees no live calls |
127
+ | `new_episodes` | record | replay + record new | evolving tests |
128
+ | `all` | record | re-record everything | refreshing fixtures |
129
+
130
+ ```python
131
+ @promptecho.use_cassette("cassettes/foo.yaml", mode="none")
132
+ ```
133
+
134
+ ### Choosing what to match on
135
+
136
+ Defaults to `["model", "messages", "system", "tools", "tool_choice", "reasoning_effort", "reasoning", "thinking"]` — everything that determines the response for a chat-shaped call, including reasoning-model knobs.
137
+
138
+ ```python
139
+ @promptecho.use_cassette(
140
+ "cassettes/foo.yaml",
141
+ match_on=["model", "messages", "system", "temperature"], # add temperature
142
+ )
143
+ ```
144
+
145
+ For non-chat shapes (raw TGI `/generate`, embeddings) you'll want to override, e.g. `match_on=["model", "input"]` for an embeddings endpoint. See [SUPPORT.md → Request shapes](SUPPORT.md#request-shapes).
146
+
147
+ ### Async
148
+
149
+ Works identically with `httpx.AsyncClient` and the async surfaces of Anthropic / OpenAI / Mistral SDKs — the async transport is patched the same way as sync.
150
+
151
+ ---
152
+
153
+ ## Cassette format
154
+
155
+ Human-readable YAML, designed to diff cleanly in PRs:
156
+
157
+ ```yaml
158
+ version: 1
159
+ match_on: [model, messages, system, tools, tool_choice, reasoning_effort, reasoning, thinking]
160
+ interactions:
161
+ - request:
162
+ method: POST
163
+ url: https://api.anthropic.com/v1/messages
164
+ match_key: ef43f6acaed95b2f # fingerprint of matched fields
165
+ matched_on: [model, messages, system, tools, tool_choice]
166
+ body: # canonical (provider-normalized) body
167
+ model: claude-opus-4-8
168
+ messages:
169
+ - {role: user, content: "Summarize: the cat sat on the mat."}
170
+ response:
171
+ status: 200
172
+ headers: {content-type: application/json}
173
+ streaming: false
174
+ body:
175
+ content: [{type: text, text: "A cat sat on a mat."}]
176
+ usage: {input_tokens: 14, output_tokens: 8}
177
+ ```
178
+
179
+ - **Streamed** responses store the ordered SSE events under `response.events` with `streaming: true`; replay re-emits them in order.
180
+ - **Binary** responses (image/audio/octet-stream) get `binary: true` and the body is base64-encoded; replay decodes and returns the original bytes.
181
+ - **The stored body is the canonical, provider-normalized shape** — not the raw provider JSON. That makes cassettes provider-agnostic and easier to skim in code review.
182
+
183
+ Auto-redacted on record: `authorization`, `x-api-key`, `openai-organization`. Configurable.
184
+
185
+ See [`examples/cassettes/example.yaml`](examples/cassettes/example.yaml) for a real one.
186
+
187
+ ---
188
+
189
+ ## Status
190
+
191
+ **v0.1.0, working core. 19 tests, all green.** Not yet on PyPI.
192
+
193
+ Records and replays real httpx traffic — sync, async, SSE streaming, binary responses, cross-provider request shapes — verified end-to-end against a local server that gets shut down between record and replay.
194
+
195
+ ### Roadmap (build-in-public)
196
+
197
+ Done:
198
+ - [x] httpx sync + async transport interception
199
+ - [x] SSE streaming record/replay
200
+ - [x] pytest plugin + auto-naming
201
+ - [x] Per-provider request normalizers (Anthropic / OpenAI / generic)
202
+ - [x] Reasoning-model match defaults (`reasoning_effort`, `thinking`, `reasoning`)
203
+ - [x] Binary response round-trip (image/audio/octet-stream — base64 in cassette)
204
+ - [x] Field-level diff on cassette miss (CI `mode=none` errors pinpoint the changed path, not just the field name)
205
+
206
+ Next:
207
+ - [ ] `requests` / `urllib3` interception backend — unlocks boto3-Bedrock and HF `InferenceClient`
208
+ - [ ] `promptecho lint` — find un-recorded calls in a test suite
209
+ - [ ] **`toMatchLLMSnapshot()` sibling** — semantic snapshot assertions on top of recorded calls
210
+
211
+ ## Design
212
+
213
+ For the why-not-the-other-way decisions — fingerprint vs raw bytes, why semantic matching is fenced off, how SSE re-emission works, how cross-provider normalization is structured — see [DESIGN.md](DESIGN.md).
214
+
215
+ ## License
216
+
217
+ MIT