agentrec 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- agentrec-0.2.0/.claude/scheduled_tasks.lock +1 -0
- agentrec-0.2.0/.github/workflows/ci.yml +26 -0
- agentrec-0.2.0/.gitignore +12 -0
- agentrec-0.2.0/CHANGELOG.md +36 -0
- agentrec-0.2.0/LICENSE +21 -0
- agentrec-0.2.0/NOTICE +24 -0
- agentrec-0.2.0/PKG-INFO +269 -0
- agentrec-0.2.0/README.md +239 -0
- agentrec-0.2.0/agentrec/__init__.py +94 -0
- agentrec-0.2.0/agentrec/__main__.py +3 -0
- agentrec-0.2.0/agentrec/capture.py +46 -0
- agentrec-0.2.0/agentrec/cli.py +173 -0
- agentrec-0.2.0/agentrec/comparators.py +271 -0
- agentrec-0.2.0/agentrec/keying.py +133 -0
- agentrec-0.2.0/agentrec/migration.py +506 -0
- agentrec-0.2.0/agentrec/providers/__init__.py +248 -0
- agentrec-0.2.0/agentrec/providers/anthropic.py +160 -0
- agentrec-0.2.0/agentrec/providers/base.py +152 -0
- agentrec-0.2.0/agentrec/providers/openai.py +159 -0
- agentrec-0.2.0/agentrec/report.py +424 -0
- agentrec-0.2.0/agentrec/session.py +160 -0
- agentrec-0.2.0/agentrec/store.py +338 -0
- agentrec-0.2.0/agentrec/transport.py +271 -0
- agentrec-0.2.0/pyproject.toml +53 -0
- agentrec-0.2.0/requirements.txt +26 -0
- agentrec-0.2.0/tests/__init__.py +0 -0
- agentrec-0.2.0/tests/test_anthropic.py +180 -0
- agentrec-0.2.0/tests/test_comparators.py +143 -0
- agentrec-0.2.0/tests/test_filestore.py +216 -0
- agentrec-0.2.0/tests/test_live_record.py +66 -0
- agentrec-0.2.0/tests/test_migration.py +499 -0
- agentrec-0.2.0/tests/test_non_streaming.py +144 -0
- agentrec-0.2.0/tests/test_providers.py +346 -0
- agentrec-0.2.0/tests/test_session.py +201 -0
- agentrec-0.2.0/tests/test_streaming.py +303 -0
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"sessionId":"49bf5cc6-008f-4af3-9bd0-e4b04cb0b71b","pid":29944,"acquiredAt":1781171304285}
|
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
name: CI
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
branches: [main]
|
|
6
|
+
pull_request:
|
|
7
|
+
|
|
8
|
+
jobs:
|
|
9
|
+
test:
|
|
10
|
+
strategy:
|
|
11
|
+
fail-fast: false
|
|
12
|
+
matrix:
|
|
13
|
+
os: [ubuntu-latest, windows-latest]
|
|
14
|
+
python-version: ["3.10", "3.11", "3.12", "3.13"]
|
|
15
|
+
runs-on: ${{ matrix.os }}
|
|
16
|
+
steps:
|
|
17
|
+
- uses: actions/checkout@v4
|
|
18
|
+
- uses: actions/setup-python@v5
|
|
19
|
+
with:
|
|
20
|
+
python-version: ${{ matrix.python-version }}
|
|
21
|
+
- name: Install
|
|
22
|
+
run: pip install -e ".[dev]"
|
|
23
|
+
# No API keys in CI: the live record/replay tests skip themselves and
|
|
24
|
+
# the offline suite (canned SSE/JSON fixtures) is what gates merges.
|
|
25
|
+
- name: Test
|
|
26
|
+
run: pytest -q
|
|
@@ -0,0 +1,36 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
## 0.2.0 — 2026-06-11
|
|
4
|
+
|
|
5
|
+
First public (PyPI) release.
|
|
6
|
+
|
|
7
|
+
### Added
|
|
8
|
+
- **Model-migration report**: `agentrec migrate | report | annotate` CLI.
|
|
9
|
+
Replays every corpus prompt against a target model (cross-provider
|
|
10
|
+
OpenAI ↔ Anthropic translation included), caches target answers as
|
|
11
|
+
`migration__…` cassettes, and renders Markdown/HTML/console reports.
|
|
12
|
+
- Comparators: `exact`, `fuzzy` (offline), `embedding`, `judge` (live), all
|
|
13
|
+
scored side-by-side in one run.
|
|
14
|
+
- **Per-category breakdown**: recordings tagged via
|
|
15
|
+
`cassette(store, metadata={"category": "..."})` are grouped per task type
|
|
16
|
+
in the report.
|
|
17
|
+
- **Output-token columns** per row, per category, and report-wide
|
|
18
|
+
(baseline vs target volume and ratio) as a verbosity/cost signal.
|
|
19
|
+
- **Concurrent row scoring** in `run_migration` (`concurrency`, default 8),
|
|
20
|
+
with deterministic report order and a `progress` callback.
|
|
21
|
+
- **Retry with backoff** on rate-limited/overloaded target calls
|
|
22
|
+
(429/500/502/503/529), honouring `Retry-After`; failed responses are never
|
|
23
|
+
cached.
|
|
24
|
+
- `agentrec[compression]` extra for brotli/zstd cassette decoding.
|
|
25
|
+
|
|
26
|
+
### Fixed
|
|
27
|
+
- Corpus tooling (migration, summaries) now decompresses recorded responses
|
|
28
|
+
per their `Content-Encoding` (gzip/deflate built in, brotli/zstd via the
|
|
29
|
+
extra). Replay was always correct; decoding raw chunks was not.
|
|
30
|
+
|
|
31
|
+
## 0.1.0
|
|
32
|
+
|
|
33
|
+
Internal prototype: record/replay at the httpx transport layer (streaming SSE
|
|
34
|
+
and non-streaming JSON), `InMemoryStore`/`FileStore` with header redaction and
|
|
35
|
+
request-body secret scrubbing, request-fingerprint keying with
|
|
36
|
+
provider/model/semantic-key provenance, `async_client()` + `cassette` facade.
|
agentrec-0.2.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Pi-Wi
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
agentrec-0.2.0/NOTICE
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
agentrec
|
|
2
|
+
========
|
|
3
|
+
|
|
4
|
+
Copyright (c) 2026 agentrec contributors.
|
|
5
|
+
|
|
6
|
+
This project is licensed under the MIT License (see LICENSE).
|
|
7
|
+
|
|
8
|
+
Third-party acknowledgements
|
|
9
|
+
-----------------------------
|
|
10
|
+
|
|
11
|
+
**baml_vcr** (https://github.com/BoundaryML/baml_vcr)
|
|
12
|
+
MIT License — Copyright (c) BoundaryML contributors
|
|
13
|
+
|
|
14
|
+
The streaming chunk capture and replay pattern in agentrec — specifically
|
|
15
|
+
the idea of recording each SSE byte frame as an ordered list of chunks with
|
|
16
|
+
relative timestamps and re-emitting them in the same order during replay —
|
|
17
|
+
draws inspiration from baml_vcr's approach.
|
|
18
|
+
|
|
19
|
+
agentrec's interception mechanism is deliberately different: instead of
|
|
20
|
+
patching a specific framework's client or collector (as baml_vcr does),
|
|
21
|
+
agentrec wraps httpx's AsyncBaseTransport so that interception sits below
|
|
22
|
+
any SDK abstraction (OpenAI, Anthropic, LangChain, etc.) and requires no
|
|
23
|
+
framework-specific code. Only the streaming-chunk pattern was taken as
|
|
24
|
+
inspiration; no source code was copied.
|
agentrec-0.2.0/PKG-INFO
ADDED
|
@@ -0,0 +1,269 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: agentrec
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: Framework-agnostic record/replay for LLM API interactions (streaming and non-streaming)
|
|
5
|
+
License-Expression: MIT
|
|
6
|
+
License-File: LICENSE
|
|
7
|
+
License-File: NOTICE
|
|
8
|
+
Keywords: anthropic,httpx,llm,openai,record,replay,testing,vcr
|
|
9
|
+
Classifier: Development Status :: 4 - Beta
|
|
10
|
+
Classifier: Intended Audience :: Developers
|
|
11
|
+
Classifier: Programming Language :: Python :: 3
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
16
|
+
Classifier: Topic :: Software Development :: Libraries
|
|
17
|
+
Classifier: Topic :: Software Development :: Testing
|
|
18
|
+
Requires-Python: >=3.10
|
|
19
|
+
Requires-Dist: httpx>=0.27
|
|
20
|
+
Provides-Extra: compression
|
|
21
|
+
Requires-Dist: brotli>=1.1; extra == 'compression'
|
|
22
|
+
Requires-Dist: zstandard>=0.22; extra == 'compression'
|
|
23
|
+
Provides-Extra: dev
|
|
24
|
+
Requires-Dist: anthropic>=0.40; extra == 'dev'
|
|
25
|
+
Requires-Dist: openai>=1.30; extra == 'dev'
|
|
26
|
+
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
|
|
27
|
+
Requires-Dist: pytest>=8.0; extra == 'dev'
|
|
28
|
+
Requires-Dist: python-dotenv>=1.0; extra == 'dev'
|
|
29
|
+
Description-Content-Type: text/markdown
|
|
30
|
+
|
|
31
|
+
# agentrec
|
|
32
|
+
|
|
33
|
+
Framework-agnostic record/replay for streaming LLM API interactions.
|
|
34
|
+
Records and replays at the **httpx transport layer**, so it works below
|
|
35
|
+
the OpenAI SDK, the Anthropic SDK, LangChain, or any other httpx-backed client —
|
|
36
|
+
the core depends on nothing but `httpx`.
|
|
37
|
+
|
|
38
|
+
> **Status:** beta (0.2). The record/replay mechanic is proven for streaming
|
|
39
|
+
> (SSE) and non-streaming (JSON) responses, for both OpenAI and Anthropic.
|
|
40
|
+
> On top of the recorded corpus sits a working **model-migration report**
|
|
41
|
+
> (see [Model-migration report](#model-migration-report)). The API may still
|
|
42
|
+
> change in minor releases before 1.0.
|
|
43
|
+
>
|
|
44
|
+
> **Scope limits:** record/replay works for *any* httpx-backed SDK, but the
|
|
45
|
+
> migration runner's cross-provider translation covers OpenAI ↔ Anthropic and
|
|
46
|
+
> **text-only** conversations — requests using tools, images, or other rich
|
|
47
|
+
> content become clearly-reasoned skipped rows rather than translations.
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## Architecture
|
|
52
|
+
|
|
53
|
+
```
|
|
54
|
+
agentrec/
|
|
55
|
+
capture.py # CapturedChunk, CapturedRequest, CapturedInteraction — storage-agnostic data
|
|
56
|
+
keying.py # fingerprint() — provider/model/semantic_key + the default cassette id
|
|
57
|
+
store.py # InteractionStore ABC + InMemoryStore + FileStore (JSON cassettes)
|
|
58
|
+
transport.py # RecordingTransport, ReplayTransport, AutoTransport (the low-level seam)
|
|
59
|
+
session.py # async_client() + cassette — the high-level, ergonomic seam
|
|
60
|
+
providers/ # ProviderAdapter registry: OpenAI + Anthropic request/response dialects
|
|
61
|
+
comparators.py # exact / fuzzy (offline), embedding / judge (live) response scoring
|
|
62
|
+
migration.py # run_migration() — replay the corpus against a candidate model
|
|
63
|
+
report.py # Markdown / HTML / console rendering of a MigrationReport
|
|
64
|
+
cli.py # `agentrec migrate | report | annotate`
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
Key design commitments:
|
|
68
|
+
|
|
69
|
+
- **Tee, don't intercept-and-buffer.** `RecordingTransport` wraps the live
|
|
70
|
+
stream so the caller and the store both see every chunk in order, without
|
|
71
|
+
the recorder buffering the whole response first.
|
|
72
|
+
- **Raw bytes, no parsing.** Chunks are stored as the original SSE byte frames.
|
|
73
|
+
The SDK parser re-runs on replay and produces the same objects it would have
|
|
74
|
+
from the network. OpenAI SSE and Anthropic SSE look identical here — both are
|
|
75
|
+
byte streams — which is why one codebase covers both with no provider branches.
|
|
76
|
+
- **Injected store.** `InMemoryStore` (volatile) and `FileStore` (human-readable
|
|
77
|
+
JSON cassettes, atomic writes, secret-scrubbing) both satisfy `InteractionStore`.
|
|
78
|
+
A future store (Parquet corpus, S3, …) drops in without touching transport code.
|
|
79
|
+
- **Distinct transport classes.** `RecordingTransport` requires an inner
|
|
80
|
+
transport; `ReplayTransport` has none — it *cannot* accidentally touch the
|
|
81
|
+
network. `AutoTransport` composes the two for cassette semantics.
|
|
82
|
+
- **Request-derived keys.** Each interaction is keyed by a fingerprint of the
|
|
83
|
+
request (method + path + model + normalised body), so one transport handles
|
|
84
|
+
many distinct calls and the same call replays deterministically.
|
|
85
|
+
|
|
86
|
+
---
|
|
87
|
+
|
|
88
|
+
## Install
|
|
89
|
+
|
|
90
|
+
```bash
|
|
91
|
+
pip install agentrec # core is httpx-only
|
|
92
|
+
pip install "agentrec[compression]" # + brotli/zstd cassette decoding
|
|
93
|
+
|
|
94
|
+
# from a checkout:
|
|
95
|
+
pip install -e ".[dev]" # the dev extra adds the SDKs + pytest
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
## Quick start — the high-level seam
|
|
101
|
+
|
|
102
|
+
Build one `agentrec.async_client()` and pass it to any httpx-based SDK. Wrap
|
|
103
|
+
your calls in a `cassette`: `mode="auto"` replays a request if it's been
|
|
104
|
+
recorded, otherwise records it (true VCR-style cassette behaviour).
|
|
105
|
+
|
|
106
|
+
```python
|
|
107
|
+
import agentrec
|
|
108
|
+
from openai import AsyncOpenAI
|
|
109
|
+
|
|
110
|
+
store = agentrec.FileStore("corpus")
|
|
111
|
+
http = agentrec.async_client() # honours the active cassette scope
|
|
112
|
+
oai = AsyncOpenAI(http_client=http)
|
|
113
|
+
|
|
114
|
+
# Streaming — every call inside is recorded once, then replayed:
|
|
115
|
+
@agentrec.cassette(store, mode="auto")
|
|
116
|
+
async def ask_stream(prompt: str) -> str:
|
|
117
|
+
stream = await oai.chat.completions.create(
|
|
118
|
+
model="gpt-4o-mini",
|
|
119
|
+
messages=[{"role": "user", "content": prompt}],
|
|
120
|
+
stream=True,
|
|
121
|
+
)
|
|
122
|
+
out = ""
|
|
123
|
+
async for chunk in stream:
|
|
124
|
+
if chunk.choices and chunk.choices[0].delta.content:
|
|
125
|
+
out += chunk.choices[0].delta.content
|
|
126
|
+
return out
|
|
127
|
+
|
|
128
|
+
# Non-streaming — works identically; the JSON body is one chunk at the transport layer:
|
|
129
|
+
@agentrec.cassette(store, mode="auto")
|
|
130
|
+
async def ask(prompt: str) -> str:
|
|
131
|
+
response = await oai.chat.completions.create(
|
|
132
|
+
model="gpt-4o-mini",
|
|
133
|
+
messages=[{"role": "user", "content": prompt}],
|
|
134
|
+
)
|
|
135
|
+
return response.choices[0].message.content
|
|
136
|
+
|
|
137
|
+
# Or as a context manager:
|
|
138
|
+
async with agentrec.cassette(store, mode="record"):
|
|
139
|
+
await oai.chat.completions.create(...)
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
The same `async_client` + `cassette` works against the Anthropic SDK unchanged —
|
|
143
|
+
just `AsyncAnthropic(http_client=http)`.
|
|
144
|
+
|
|
145
|
+
---
|
|
146
|
+
|
|
147
|
+
## Lower-level seam — explicit transports
|
|
148
|
+
|
|
149
|
+
When you'd rather wire the httpx client yourself (no contextvar), use the
|
|
150
|
+
transports directly. `key` is optional: omit it for request-derived keying, or
|
|
151
|
+
pass a fixed id for a single named cassette.
|
|
152
|
+
|
|
153
|
+
```python
|
|
154
|
+
import httpx
|
|
155
|
+
from openai import AsyncOpenAI
|
|
156
|
+
from agentrec import FileStore, RecordingTransport, ReplayTransport
|
|
157
|
+
|
|
158
|
+
store = FileStore("corpus")
|
|
159
|
+
|
|
160
|
+
# --- Record (needs network) ---
|
|
161
|
+
async with httpx.AsyncClient(
|
|
162
|
+
transport=RecordingTransport(httpx.AsyncHTTPTransport(), store, key="weather")
|
|
163
|
+
) as http_client:
|
|
164
|
+
client = AsyncOpenAI(http_client=http_client)
|
|
165
|
+
stream = await client.chat.completions.create(..., stream=True)
|
|
166
|
+
async for chunk in stream:
|
|
167
|
+
... # caller receives the live stream unchanged
|
|
168
|
+
|
|
169
|
+
# --- Replay (offline, no key needed if you recorded with request keying) ---
|
|
170
|
+
async with httpx.AsyncClient(transport=ReplayTransport(store, key="weather")) as http_client:
|
|
171
|
+
client = AsyncOpenAI(http_client=http_client)
|
|
172
|
+
stream = await client.chat.completions.create(..., stream=True)
|
|
173
|
+
async for chunk in stream:
|
|
174
|
+
... # identical to the recorded run
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
---
|
|
178
|
+
|
|
179
|
+
## Provider support
|
|
180
|
+
|
|
181
|
+
Interception is at the httpx transport, so agentrec is provider-neutral for
|
|
182
|
+
**any SDK that lets you pass an httpx client**:
|
|
183
|
+
|
|
184
|
+
| SDK / client | Works | How |
|
|
185
|
+
| ------------------------------------ | :---: | ------------------------------------------- |
|
|
186
|
+
| OpenAI (`openai`) | ✅ | `AsyncOpenAI(http_client=...)` |
|
|
187
|
+
| Anthropic (`anthropic`) | ✅ | `AsyncAnthropic(http_client=...)` |
|
|
188
|
+
| Most modern httpx-based SDKs / LangChain | ✅ | pass the agentrec httpx client through |
|
|
189
|
+
| **Non-httpx SDKs** (AWS Bedrock/boto3, some Vertex paths) | ❌ | they don't route through httpx, so the transport never sees the call — a different seam would be needed |
|
|
190
|
+
|
|
191
|
+
The boundary is "httpx-backed," not "OpenAI." If a client opens its sockets
|
|
192
|
+
through `botocore`/`urllib3` instead of httpx, transport interception can't see
|
|
193
|
+
it.
|
|
194
|
+
|
|
195
|
+
---
|
|
196
|
+
|
|
197
|
+
## Running the tests
|
|
198
|
+
|
|
199
|
+
```bash
|
|
200
|
+
pytest -q
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
| Test file | Needs a key? | What it proves |
|
|
204
|
+
| ------------------------ | ------------ | ----------------------------------------------------------------- |
|
|
205
|
+
| `tests/test_streaming.py` | offline + `OPENAI_API_KEY` | OpenAI SSE replay mechanic; live record→replay identity |
|
|
206
|
+
| `tests/test_non_streaming.py` | offline | Plain JSON (non-streaming) record/replay, auto mode, provenance |
|
|
207
|
+
| `tests/test_filestore.py` | offline | FileStore round-trip, redaction, hostile ids, readable cassettes |
|
|
208
|
+
| `tests/test_session.py` | offline | `async_client`/`cassette`, auto mode, request keying, metadata |
|
|
209
|
+
| `tests/test_providers.py` | offline | Adapter decoding (SSE/JSON × provider), translation, registry |
|
|
210
|
+
| `tests/test_comparators.py` | offline | Comparator scoring incl. mocked embedding/judge, spec parsing |
|
|
211
|
+
| `tests/test_migration.py` | offline | Migration end-to-end, caching, lineage metadata, report + CLI |
|
|
212
|
+
| `tests/test_anthropic.py` | offline + `ANTHROPIC_API_KEY` | Anthropic replay (provider-neutrality); live record→replay |
|
|
213
|
+
| `tests/test_live_record.py` | `OPENAI_API_KEY` | live capture against the real OpenAI API |
|
|
214
|
+
|
|
215
|
+
Key-gated tests skip cleanly when the key is absent. Live keys are read from a
|
|
216
|
+
project-root `.env` (via `python-dotenv`). The offline tests use canned SSE
|
|
217
|
+
frames and patch `httpx.AsyncHTTPTransport` so any accidental network access
|
|
218
|
+
fails the test.
|
|
219
|
+
|
|
220
|
+
---
|
|
221
|
+
|
|
222
|
+
## Model-migration report
|
|
223
|
+
|
|
224
|
+
Every recording carries provenance in `interaction.metadata`: `provider`,
|
|
225
|
+
`model`, `semantic_key`, and `recorded_at`. The **`semantic_key`** is a hash of
|
|
226
|
+
the request *without* the model (and other non-semantic fields), so two
|
|
227
|
+
interactions recorded against different models for the same logical prompt share
|
|
228
|
+
a `semantic_key`.
|
|
229
|
+
|
|
230
|
+
The migration runner builds on that: it groups the corpus by `semantic_key`,
|
|
231
|
+
re-asks every recorded prompt of a **target model** (cross-provider translation
|
|
232
|
+
included — an OpenAI-recorded prompt can be re-asked of Claude), records the
|
|
233
|
+
target's answers back into the corpus as `migration__…` cassettes, and scores
|
|
234
|
+
baseline vs. target with the selected comparators:
|
|
235
|
+
|
|
236
|
+
| Comparator | Needs network? | What it measures |
|
|
237
|
+
| ----------- | -------------- | ------------------------------------------------- |
|
|
238
|
+
| `exact` | no | normalized string equality (classification-style) |
|
|
239
|
+
| `fuzzy` | no | `difflib` sequence similarity |
|
|
240
|
+
| `embedding` | OpenAI API | cosine similarity of embeddings |
|
|
241
|
+
| `judge` | LLM API | an LLM scores semantic equivalence |
|
|
242
|
+
|
|
243
|
+
```bash
|
|
244
|
+
# Re-ask the corpus of a candidate model and write Markdown + HTML reports:
|
|
245
|
+
agentrec migrate --corpus corpus --target claude-haiku-4-5 --compare exact,fuzzy,judge
|
|
246
|
+
|
|
247
|
+
# Re-render fully offline from already-recorded migration cassettes
|
|
248
|
+
# (offline comparators only; --strict exits 1 on any failure — a CI gate):
|
|
249
|
+
agentrec report --corpus corpus --target claude-haiku-4-5 --strict
|
|
250
|
+
|
|
251
|
+
# Backfill summary blocks + fingerprint metadata into older cassettes:
|
|
252
|
+
agentrec annotate --corpus corpus
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
Re-running `migrate` is cheap: each target answer is itself a cassette, so
|
|
256
|
+
already-answered prompts are served from disk and only new prompts hit the API.
|
|
257
|
+
A failed (non-200) target call is never cached — a re-run retries it live.
|
|
258
|
+
|
|
259
|
+
Rows are scored concurrently (`--concurrency`, default 8). Recordings tagged
|
|
260
|
+
with a category — `cassette(store, metadata={"category": "extract"})` — get a
|
|
261
|
+
**per-category breakdown** in the report, and output-token counts per row and
|
|
262
|
+
per category surface verbosity/cost differences between the models.
|
|
263
|
+
|
|
264
|
+
---
|
|
265
|
+
|
|
266
|
+
## Attributions
|
|
267
|
+
|
|
268
|
+
See [NOTICE](NOTICE) for third-party acknowledgements, including inspiration
|
|
269
|
+
from **baml_vcr** for the streaming chunk capture/replay pattern.
|
agentrec-0.2.0/README.md
ADDED
|
@@ -0,0 +1,239 @@
|
|
|
1
|
+
# agentrec
|
|
2
|
+
|
|
3
|
+
Framework-agnostic record/replay for streaming LLM API interactions.
|
|
4
|
+
Records and replays at the **httpx transport layer**, so it works below
|
|
5
|
+
the OpenAI SDK, the Anthropic SDK, LangChain, or any other httpx-backed client —
|
|
6
|
+
the core depends on nothing but `httpx`.
|
|
7
|
+
|
|
8
|
+
> **Status:** beta (0.2). The record/replay mechanic is proven for streaming
|
|
9
|
+
> (SSE) and non-streaming (JSON) responses, for both OpenAI and Anthropic.
|
|
10
|
+
> On top of the recorded corpus sits a working **model-migration report**
|
|
11
|
+
> (see [Model-migration report](#model-migration-report)). The API may still
|
|
12
|
+
> change in minor releases before 1.0.
|
|
13
|
+
>
|
|
14
|
+
> **Scope limits:** record/replay works for *any* httpx-backed SDK, but the
|
|
15
|
+
> migration runner's cross-provider translation covers OpenAI ↔ Anthropic and
|
|
16
|
+
> **text-only** conversations — requests using tools, images, or other rich
|
|
17
|
+
> content become clearly-reasoned skipped rows rather than translations.
|
|
18
|
+
|
|
19
|
+
---
|
|
20
|
+
|
|
21
|
+
## Architecture
|
|
22
|
+
|
|
23
|
+
```
|
|
24
|
+
agentrec/
|
|
25
|
+
capture.py # CapturedChunk, CapturedRequest, CapturedInteraction — storage-agnostic data
|
|
26
|
+
keying.py # fingerprint() — provider/model/semantic_key + the default cassette id
|
|
27
|
+
store.py # InteractionStore ABC + InMemoryStore + FileStore (JSON cassettes)
|
|
28
|
+
transport.py # RecordingTransport, ReplayTransport, AutoTransport (the low-level seam)
|
|
29
|
+
session.py # async_client() + cassette — the high-level, ergonomic seam
|
|
30
|
+
providers/ # ProviderAdapter registry: OpenAI + Anthropic request/response dialects
|
|
31
|
+
comparators.py # exact / fuzzy (offline), embedding / judge (live) response scoring
|
|
32
|
+
migration.py # run_migration() — replay the corpus against a candidate model
|
|
33
|
+
report.py # Markdown / HTML / console rendering of a MigrationReport
|
|
34
|
+
cli.py # `agentrec migrate | report | annotate`
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
Key design commitments:
|
|
38
|
+
|
|
39
|
+
- **Tee, don't intercept-and-buffer.** `RecordingTransport` wraps the live
|
|
40
|
+
stream so the caller and the store both see every chunk in order, without
|
|
41
|
+
the recorder buffering the whole response first.
|
|
42
|
+
- **Raw bytes, no parsing.** Chunks are stored as the original SSE byte frames.
|
|
43
|
+
The SDK parser re-runs on replay and produces the same objects it would have
|
|
44
|
+
from the network. OpenAI SSE and Anthropic SSE look identical here — both are
|
|
45
|
+
byte streams — which is why one codebase covers both with no provider branches.
|
|
46
|
+
- **Injected store.** `InMemoryStore` (volatile) and `FileStore` (human-readable
|
|
47
|
+
JSON cassettes, atomic writes, secret-scrubbing) both satisfy `InteractionStore`.
|
|
48
|
+
A future store (Parquet corpus, S3, …) drops in without touching transport code.
|
|
49
|
+
- **Distinct transport classes.** `RecordingTransport` requires an inner
|
|
50
|
+
transport; `ReplayTransport` has none — it *cannot* accidentally touch the
|
|
51
|
+
network. `AutoTransport` composes the two for cassette semantics.
|
|
52
|
+
- **Request-derived keys.** Each interaction is keyed by a fingerprint of the
|
|
53
|
+
request (method + path + model + normalised body), so one transport handles
|
|
54
|
+
many distinct calls and the same call replays deterministically.
|
|
55
|
+
|
|
56
|
+
---
|
|
57
|
+
|
|
58
|
+
## Install
|
|
59
|
+
|
|
60
|
+
```bash
|
|
61
|
+
pip install agentrec # core is httpx-only
|
|
62
|
+
pip install "agentrec[compression]" # + brotli/zstd cassette decoding
|
|
63
|
+
|
|
64
|
+
# from a checkout:
|
|
65
|
+
pip install -e ".[dev]" # the dev extra adds the SDKs + pytest
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
---
|
|
69
|
+
|
|
70
|
+
## Quick start — the high-level seam
|
|
71
|
+
|
|
72
|
+
Build one `agentrec.async_client()` and pass it to any httpx-based SDK. Wrap
|
|
73
|
+
your calls in a `cassette`: `mode="auto"` replays a request if it's been
|
|
74
|
+
recorded, otherwise records it (true VCR-style cassette behaviour).
|
|
75
|
+
|
|
76
|
+
```python
|
|
77
|
+
import agentrec
|
|
78
|
+
from openai import AsyncOpenAI
|
|
79
|
+
|
|
80
|
+
store = agentrec.FileStore("corpus")
|
|
81
|
+
http = agentrec.async_client() # honours the active cassette scope
|
|
82
|
+
oai = AsyncOpenAI(http_client=http)
|
|
83
|
+
|
|
84
|
+
# Streaming — every call inside is recorded once, then replayed:
|
|
85
|
+
@agentrec.cassette(store, mode="auto")
|
|
86
|
+
async def ask_stream(prompt: str) -> str:
|
|
87
|
+
stream = await oai.chat.completions.create(
|
|
88
|
+
model="gpt-4o-mini",
|
|
89
|
+
messages=[{"role": "user", "content": prompt}],
|
|
90
|
+
stream=True,
|
|
91
|
+
)
|
|
92
|
+
out = ""
|
|
93
|
+
async for chunk in stream:
|
|
94
|
+
if chunk.choices and chunk.choices[0].delta.content:
|
|
95
|
+
out += chunk.choices[0].delta.content
|
|
96
|
+
return out
|
|
97
|
+
|
|
98
|
+
# Non-streaming — works identically; the JSON body is one chunk at the transport layer:
|
|
99
|
+
@agentrec.cassette(store, mode="auto")
|
|
100
|
+
async def ask(prompt: str) -> str:
|
|
101
|
+
response = await oai.chat.completions.create(
|
|
102
|
+
model="gpt-4o-mini",
|
|
103
|
+
messages=[{"role": "user", "content": prompt}],
|
|
104
|
+
)
|
|
105
|
+
return response.choices[0].message.content
|
|
106
|
+
|
|
107
|
+
# Or as a context manager:
|
|
108
|
+
async with agentrec.cassette(store, mode="record"):
|
|
109
|
+
await oai.chat.completions.create(...)
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
The same `async_client` + `cassette` works against the Anthropic SDK unchanged —
|
|
113
|
+
just `AsyncAnthropic(http_client=http)`.
|
|
114
|
+
|
|
115
|
+
---
|
|
116
|
+
|
|
117
|
+
## Lower-level seam — explicit transports
|
|
118
|
+
|
|
119
|
+
When you'd rather wire the httpx client yourself (no contextvar), use the
|
|
120
|
+
transports directly. `key` is optional: omit it for request-derived keying, or
|
|
121
|
+
pass a fixed id for a single named cassette.
|
|
122
|
+
|
|
123
|
+
```python
|
|
124
|
+
import httpx
|
|
125
|
+
from openai import AsyncOpenAI
|
|
126
|
+
from agentrec import FileStore, RecordingTransport, ReplayTransport
|
|
127
|
+
|
|
128
|
+
store = FileStore("corpus")
|
|
129
|
+
|
|
130
|
+
# --- Record (needs network) ---
|
|
131
|
+
async with httpx.AsyncClient(
|
|
132
|
+
transport=RecordingTransport(httpx.AsyncHTTPTransport(), store, key="weather")
|
|
133
|
+
) as http_client:
|
|
134
|
+
client = AsyncOpenAI(http_client=http_client)
|
|
135
|
+
stream = await client.chat.completions.create(..., stream=True)
|
|
136
|
+
async for chunk in stream:
|
|
137
|
+
... # caller receives the live stream unchanged
|
|
138
|
+
|
|
139
|
+
# --- Replay (offline, no key needed if you recorded with request keying) ---
|
|
140
|
+
async with httpx.AsyncClient(transport=ReplayTransport(store, key="weather")) as http_client:
|
|
141
|
+
client = AsyncOpenAI(http_client=http_client)
|
|
142
|
+
stream = await client.chat.completions.create(..., stream=True)
|
|
143
|
+
async for chunk in stream:
|
|
144
|
+
... # identical to the recorded run
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
---
|
|
148
|
+
|
|
149
|
+
## Provider support
|
|
150
|
+
|
|
151
|
+
Interception is at the httpx transport, so agentrec is provider-neutral for
|
|
152
|
+
**any SDK that lets you pass an httpx client**:
|
|
153
|
+
|
|
154
|
+
| SDK / client | Works | How |
|
|
155
|
+
| ------------------------------------ | :---: | ------------------------------------------- |
|
|
156
|
+
| OpenAI (`openai`) | ✅ | `AsyncOpenAI(http_client=...)` |
|
|
157
|
+
| Anthropic (`anthropic`) | ✅ | `AsyncAnthropic(http_client=...)` |
|
|
158
|
+
| Most modern httpx-based SDKs / LangChain | ✅ | pass the agentrec httpx client through |
|
|
159
|
+
| **Non-httpx SDKs** (AWS Bedrock/boto3, some Vertex paths) | ❌ | they don't route through httpx, so the transport never sees the call — a different seam would be needed |
|
|
160
|
+
|
|
161
|
+
The boundary is "httpx-backed," not "OpenAI." If a client opens its sockets
|
|
162
|
+
through `botocore`/`urllib3` instead of httpx, transport interception can't see
|
|
163
|
+
it.
|
|
164
|
+
|
|
165
|
+
---
|
|
166
|
+
|
|
167
|
+
## Running the tests
|
|
168
|
+
|
|
169
|
+
```bash
|
|
170
|
+
pytest -q
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
| Test file | Needs a key? | What it proves |
|
|
174
|
+
| ------------------------ | ------------ | ----------------------------------------------------------------- |
|
|
175
|
+
| `tests/test_streaming.py` | offline + `OPENAI_API_KEY` | OpenAI SSE replay mechanic; live record→replay identity |
|
|
176
|
+
| `tests/test_non_streaming.py` | offline | Plain JSON (non-streaming) record/replay, auto mode, provenance |
|
|
177
|
+
| `tests/test_filestore.py` | offline | FileStore round-trip, redaction, hostile ids, readable cassettes |
|
|
178
|
+
| `tests/test_session.py` | offline | `async_client`/`cassette`, auto mode, request keying, metadata |
|
|
179
|
+
| `tests/test_providers.py` | offline | Adapter decoding (SSE/JSON × provider), translation, registry |
|
|
180
|
+
| `tests/test_comparators.py` | offline | Comparator scoring incl. mocked embedding/judge, spec parsing |
|
|
181
|
+
| `tests/test_migration.py` | offline | Migration end-to-end, caching, lineage metadata, report + CLI |
|
|
182
|
+
| `tests/test_anthropic.py` | offline + `ANTHROPIC_API_KEY` | Anthropic replay (provider-neutrality); live record→replay |
|
|
183
|
+
| `tests/test_live_record.py` | `OPENAI_API_KEY` | live capture against the real OpenAI API |
|
|
184
|
+
|
|
185
|
+
Key-gated tests skip cleanly when the key is absent. Live keys are read from a
|
|
186
|
+
project-root `.env` (via `python-dotenv`). The offline tests use canned SSE
|
|
187
|
+
frames and patch `httpx.AsyncHTTPTransport` so any accidental network access
|
|
188
|
+
fails the test.
|
|
189
|
+
|
|
190
|
+
---
|
|
191
|
+
|
|
192
|
+
## Model-migration report
|
|
193
|
+
|
|
194
|
+
Every recording carries provenance in `interaction.metadata`: `provider`,
|
|
195
|
+
`model`, `semantic_key`, and `recorded_at`. The **`semantic_key`** is a hash of
|
|
196
|
+
the request *without* the model (and other non-semantic fields), so two
|
|
197
|
+
interactions recorded against different models for the same logical prompt share
|
|
198
|
+
a `semantic_key`.
|
|
199
|
+
|
|
200
|
+
The migration runner builds on that: it groups the corpus by `semantic_key`,
|
|
201
|
+
re-asks every recorded prompt of a **target model** (cross-provider translation
|
|
202
|
+
included — an OpenAI-recorded prompt can be re-asked of Claude), records the
|
|
203
|
+
target's answers back into the corpus as `migration__…` cassettes, and scores
|
|
204
|
+
baseline vs. target with the selected comparators:
|
|
205
|
+
|
|
206
|
+
| Comparator | Needs network? | What it measures |
|
|
207
|
+
| ----------- | -------------- | ------------------------------------------------- |
|
|
208
|
+
| `exact` | no | normalized string equality (classification-style) |
|
|
209
|
+
| `fuzzy` | no | `difflib` sequence similarity |
|
|
210
|
+
| `embedding` | OpenAI API | cosine similarity of embeddings |
|
|
211
|
+
| `judge` | LLM API | an LLM scores semantic equivalence |
|
|
212
|
+
|
|
213
|
+
```bash
|
|
214
|
+
# Re-ask the corpus of a candidate model and write Markdown + HTML reports:
|
|
215
|
+
agentrec migrate --corpus corpus --target claude-haiku-4-5 --compare exact,fuzzy,judge
|
|
216
|
+
|
|
217
|
+
# Re-render fully offline from already-recorded migration cassettes
|
|
218
|
+
# (offline comparators only; --strict exits 1 on any failure — a CI gate):
|
|
219
|
+
agentrec report --corpus corpus --target claude-haiku-4-5 --strict
|
|
220
|
+
|
|
221
|
+
# Backfill summary blocks + fingerprint metadata into older cassettes:
|
|
222
|
+
agentrec annotate --corpus corpus
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
Re-running `migrate` is cheap: each target answer is itself a cassette, so
|
|
226
|
+
already-answered prompts are served from disk and only new prompts hit the API.
|
|
227
|
+
A failed (non-200) target call is never cached — a re-run retries it live.
|
|
228
|
+
|
|
229
|
+
Rows are scored concurrently (`--concurrency`, default 8). Recordings tagged
|
|
230
|
+
with a category — `cassette(store, metadata={"category": "extract"})` — get a
|
|
231
|
+
**per-category breakdown** in the report, and output-token counts per row and
|
|
232
|
+
per category surface verbosity/cost differences between the models.
|
|
233
|
+
|
|
234
|
+
---
|
|
235
|
+
|
|
236
|
+
## Attributions
|
|
237
|
+
|
|
238
|
+
See [NOTICE](NOTICE) for third-party acknowledgements, including inspiration
|
|
239
|
+
from **baml_vcr** for the streaming chunk capture/replay pattern.
|