cat-claws 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,5 @@
1
+ __pycache__/
2
+ *.pyc
3
+ dist/
4
+ *.egg-info/
5
+ .env
@@ -0,0 +1,32 @@
1
+ # Changelog
2
+
3
+ All notable changes to cat-claws will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [Unreleased]
9
+
10
+ ## [0.1.0] - 2026-07-03
11
+
12
+ - Initial scaffold: MASTERPLAN.md (design + step tracker), package skeleton.
13
+ - classify() v0: one-row-sealed calls, frozen prompt, bounded concurrency,
14
+ JSON re-asks, wide 0/1 DataFrame output.
15
+ - Phase 2 — rate-limit handling: the Claude adapter detects a genuinely
16
+ exhausted usage window (a `RateLimitEvent` whose primary `status ==
17
+ "rejected"`, or `ResultMessage.api_error_status == 429`; `allowed`/
18
+ `allowed_warning` are non-blocking) and surfaces it with a `"rate-limited: "`
19
+ error prefix, including the reset time when known. It deliberately IGNORES
20
+ `overage_status` — a `rejected` overage bucket with
21
+ `overage_disabled_reason == "org_level_disabled"` just means the org turned
22
+ off spillover billing while the request still succeeds; treating it as a
23
+ limit falsely failed every call on such accounts (common in institutional
24
+ subscriptions). A successful answer always wins over an informational limit
25
+ event. classify() gains `rate_limit_retries` (default 2): on a genuine limit a
26
+ row backs off exponentially (30s, 60s…) on a budget separate from
27
+ `json_retries`, other in-flight rows unaffected, and never raises. Reset-aware:
28
+ when the reset is farther out than the backoff budget can bridge (e.g. a
29
+ `five_hour` cap hours away) the row fails fast with the resumable message
30
+ instead of sleeping through futile retries; near/unknown resets still back off.
31
+ - Phase 2 — `benchmarks/bench_classify.py`: synthetic-data throughput
32
+ benchmark across max_workers settings; results in `benchmarks/RESULTS.md`.
@@ -0,0 +1,315 @@
1
+ # cat-agent — Implementation Guide (handoff document)
2
+
3
+ *Written 2026-07-03 for whoever (human or model) continues this project.
4
+ Read MASTERPLAN.md first for the why; this file is the how. If anything here
5
+ contradicts the code, trust the code and update this file.*
6
+
7
+ ## 0. Ground rules — read before touching anything
8
+
9
+ 1. **One row = one sealed, fresh-context agent call.** Never reuse a
10
+ conversation across rows. Never put multiple rows in one prompt. This is
11
+ a research-validity requirement the maintainer (Chris) set explicitly.
12
+ If a change would violate it, stop and ask.
13
+ 2. **The prompt is frozen.** Row prompts must be byte-identical to what
14
+ `catstack.text_functions_ensemble.build_text_classification_prompt`
15
+ produces. `tests/test_classify.py::TestPromptParity` enforces this — if
16
+ that test fails, YOUR change is wrong, not the test. Never "improve" the
17
+ prompt wording.
18
+ 3. **Output schema is fixed**: DataFrame with `input_data`,
19
+ `processing_status` ("success" or "error: ..."), and `category_N` 0/1
20
+ columns (None on error rows). Same as `catstack.classify()`.
21
+ 4. **One bad row never aborts a batch.** Errors are recorded per row.
22
+ 5. **cat-stack never hard-depends on cat-agent.** Dispatch in cat-stack is a
23
+ lazy import; distribution happens via the cat-llm meta-package.
24
+ 6. Workflow conventions (from the maintainer's standing preferences):
25
+ accumulate changes in CHANGELOG.md `[Unreleased]`; do NOT bump the
26
+ version per change — one bump per release batch. Every substantive fix
27
+ gets a live smoke test against the real agent, not just mocks. Prefer
28
+ stdlib over new dependencies. For provider/SDK behavior claims: probe
29
+ empirically before coding against them.
30
+
31
+ ## 1. What exists and works (verified live 2026-07-03)
32
+
33
+ - Environment: macOS, anaconda python (`/Users/chrissoria/anaconda3/bin/python3`).
34
+ `catagent` and `catstack` are installed editable in it. Claude Code CLI
35
+ 2.1.197 is installed and logged in (subscription auth — no API key needed
36
+ for agent calls). `claude-agent-sdk` 0.2.110 is installed.
37
+ - `catagent.classify()` works end to end: 3 rows on `claude-sonnet-5`,
38
+ `max_workers=3`, 4.4s total, correct matrix.
39
+ - Mocked tests: `cd ~/Documents/Research/cat-agent && python -m pytest tests/ -q`
40
+ → 5 passing. Run them after every change.
41
+ - Live smoke (costs ~nothing on subscription, takes ~10s):
42
+
43
+ ```bash
44
+ python3 -c "
45
+ import catagent
46
+ df = catagent.classify(
47
+ input_data=['I moved for a new job', 'Rent got too expensive', 'Closer to my parents'],
48
+ categories=['Employment', 'Cost of living', 'Family', 'Other'],
49
+ user_model='claude-sonnet-5', description='Why did you move?', max_workers=3)
50
+ print(df.to_string()); assert (df.processing_status == 'success').all()"
51
+ ```
52
+
53
+ ## 2. Verified facts — do NOT re-derive, do NOT trust your training data
54
+
55
+ These were established by live probes on 2026-07-03 (sdk 0.2.110, CLI
56
+ 2.1.197). Training-data knowledge of this SDK is likely stale.
57
+
58
+ | Fact | Consequence |
59
+ |---|---|
60
+ | Two `query()` one-shots share NO context | fresh-context-per-row design is sound |
61
+ | Process overhead ≈ 1.9s per one-shot; 4-way concurrency near-linear | throughput = concurrency; don't chase warm-process reuse |
62
+ | `ClaudeAgentOptions(output_format={"type":"json","schema":...})` is **silently ignored** — answer arrives as markdown text, no structured field | Phase 1 parses prompt-JSON via `catstack.extract_json`; re-probe `output_format` on every SDK upgrade before building Phase 3 |
63
+ | The agent **thinks by default** (ThinkingBlock observed on haiku) | adapter passes `thinking=ThinkingConfigDisabled(type="disabled")` at `thinking_budget=0` (engine parity) and `effort=<graded>` above 0 |
64
+ | `ThinkingConfigDisabled` is a TypedDict | `ThinkingConfigDisabled(type="disabled")` constructs a plain dict — fine |
65
+ | `catstack._utils.validate_classification_json(json_str, n)` returns a `(bool, dict)` TUPLE and the dict values are STRINGS "1"/"0" | unpack both; compare `str(v) == "1"` |
66
+ | `catstack.extract_json(reply)` returns a JSON *string* (handles fences/preambles) | feed its output to validate, don't json.loads the raw reply |
67
+ | `system_prompt` option REPLACES Claude Code's default agent persona | our `_SYSTEM_PROMPT` in classify.py is transport scaffolding, not part of the instrument |
68
+ | `setting_sources=[]` prevents CLAUDE.md/user-settings injection | never remove it — without it, running from inside a repo contaminates classifications |
69
+ | `RateLimitEvent` fires on SUCCESSFUL calls too (informational). Only primary `RateLimitInfo.status=='rejected'` blocks. `overage_status=='rejected'` + `overage_disabled_reason=='org_level_disabled'` = overage billing off (org config), NOT a block — call succeeds | detect rate limits on primary `status` only; a returned answer wins over any limit event. Verified 2026-07-03 against a live 74%-used, org-overage-disabled account |
70
+
71
+ ## 3. Code map
72
+
73
+ ```
74
+ src/catagent/
75
+ __about__.py version 0.0.1 — single source of truth (hatch reads it)
76
+ __init__.py exports classify
77
+ _adapters/base.py AgentAdapter.one_shot(prompt, system_prompt, model,
78
+ thinking_budget) -> (text|None, error|None)
79
+ _adapters/claude.py ClaudeAdapter — sealed ClaudeAgentOptions; thinking
80
+ parity; CLINotFoundError -> friendly install message;
81
+ falls back to agent-default thinking if the explicit
82
+ disable errors
83
+ _adapters/__init__.py ADAPTERS registry + get_adapter(name)
84
+ _backend.py gather_bounded(coro_fns, max_workers) — sync->async
85
+ seam via asyncio.run; captures per-task exceptions
86
+ classify.py classify(input_data, categories, user_model, agent,
87
+ description, multi_label, thinking_budget,
88
+ max_workers, json_retries) -> DataFrame
89
+ tests/test_classify.py FakeAdapter pattern for mocked tests — copy it for
90
+ new tests; TestPromptParity is the canary
91
+ ```
92
+
93
+ Repo: github.com/chrissoria/cat-agent (PRIVATE until first PyPI release).
94
+ Commit style: imperative subject, body explains why, footer:
95
+ `Co-Authored-By: Claude ...` (see git log for examples). Push after commit.
96
+
97
+ ## 3b. Standing sanity checks — bracket EVERY work session with these
98
+
99
+ Run before starting (baseline must be green — if it isn't, fix that first,
100
+ don't build on a broken base) and after every substantive change:
101
+
102
+ ```bash
103
+ cd ~/Documents/Research/cat-agent
104
+ python -m pytest tests/ -q # ALL green, incl. TestPromptParity
105
+ python -c "import catagent; print(catagent.__version__)" # imports clean
106
+ git status --short # only files YOU meant to touch
107
+ ```
108
+
109
+ And once per session, the 3-row live smoke from §1 (~10s, subscription-only
110
+ cost). PASS = 3× "success" + correct 1s on the diagonal.
111
+
112
+ **Direction check (ask before each step, honestly):** does what I'm about to
113
+ do (a) keep one-row-sealed-calls, (b) leave the prompt byte-identical,
114
+ (c) keep the output schema unchanged, (d) avoid new hard dependencies? If
115
+ any answer is no → stop and surface it to the maintainer instead of coding.
116
+
117
+ ## 4. Phase 2 — benchmarks + rate-limit handling — DONE 2026-07-03
118
+
119
+ *(Throughput sweep deferred: the subscription five_hour window was exhausted
120
+ during development; re-run `python benchmarks/bench_classify.py --n 50
121
+ --write-results` after a reset. Everything else landed and is live-verified.)*
122
+
123
+ **Two live findings that changed the plan:**
124
+ 1. **Overage false positive (bug, fixed).** Detection first keyed on *any*
125
+ `RateLimitInfo` with a `rejected` bucket, including `overage_status`. But
126
+ org-disabled overage (`overage_disabled_reason=="org_level_disabled"`,
127
+ `isUsingOverage=False`) reports `overage_status=='rejected'` while the
128
+ primary `status=='allowed'` and **the call succeeds**. The old code aborted
129
+ every successful call as rate-limited — breaking cat-agent on every call for
130
+ institutional accounts. Fix: detect only on primary `status=='rejected'`, and
131
+ a returned answer always wins over an informational limit event. Regression
132
+ tests in `tests/test_rate_limit.py`. Confirm any detection change against the
133
+ RAW payload (`RateLimitInfo.raw`), never the parsed summary.
134
+ 2. **Window is five_hour, not minutes.** The reset on a genuine exhaustion is
135
+ hours out, so classify() **fails fast** when the reset (parsed via
136
+ `base.parse_reset_epoch`) is beyond its backoff budget, backing off only for
137
+ near/unknown resets. This supersedes the original item 3 "always retry up to
138
+ 2 times" below.
139
+
140
+ Goal: know how this behaves at realistic N and fail gracefully at limits.
141
+
142
+ 1. **Benchmark script** (`benchmarks/bench_classify.py`, committed): classify
143
+ N=50 short rows (generate synthetic one-liners; do NOT use real study
144
+ data) on `claude-haiku-4-5` at max_workers ∈ {1, 4, 8}. Record wall time,
145
+ rows/s, error count. Write results into the script's docstring or a
146
+ `benchmarks/RESULTS.md` with date + CLI/SDK versions.
147
+
148
+ ✓ **Sanity before the 50-row run:** run the script with N=4 first. PASS:
149
+ 4/4 success, wall time ≈ 5–15s at workers=4, and per-worker scaling
150
+ visible (workers=1 clearly slower than workers=4). FAIL (e.g. workers=4
151
+ not faster, or errors): stop — something regressed in `_backend.py`
152
+ concurrency; do not burn a 50-row run to find out.
153
+
154
+ ✓ **Sanity after:** rows/s at workers=8 should be ≥ workers=4's. If it's
155
+ *worse*, you're likely hitting throttling — that's a finding, record it,
156
+ don't "fix" the code.
157
+
158
+ 2. **Rate-limit surfacing.** During Phase-0 probes a `RateLimitEvent`
159
+ message type was observed in the stream (untriggered). In
160
+ `_adapters/claude.py`, detect rate-limit conditions (probe: what does the
161
+ SDK emit when throttled? check message types + ResultMessage fields) and
162
+ return a distinguishable error string prefix, e.g.
163
+ `"rate-limited: ..."` so classify() can react.
164
+
165
+ ✓ **Sanity:** you cannot reliably trigger a real rate limit on demand —
166
+ so the check is structural: unit-test the detection function against a
167
+ synthetic RateLimitEvent/ResultMessage object. If you find yourself
168
+ hammering the live API trying to trigger a real 429, stop — that's the
169
+ wrong direction (and abuses the subscription).
170
+
171
+ 3. **Backoff in classify()**: on a rate-limited row, sleep (exponential,
172
+ start 30s — subscription windows are minutes-scale, not seconds) and
173
+ retry up to 2 times BEFORE consuming json_retries. Keep per-row
174
+ isolation: other in-flight rows continue.
175
+
176
+ ✓ **Sanity:** mocked test with a fake clock / patched `asyncio.sleep` —
177
+ the test suite must still finish in seconds. If tests now take minutes,
178
+ you forgot to patch the sleep. Also: non-rate-limited rows in the same
179
+ batch must complete WITHOUT waiting on the throttled row (assert via
180
+ call-order in the fake adapter).
181
+
182
+ 4. **Partial-results guarantee test**: mocked test where the adapter
183
+ rate-limits every call — classify() must return a full DataFrame with
184
+ `error: rate-limited...` statuses, never raise.
185
+
186
+ ✓ **Sanity:** `len(df) == len(input_data)` exactly, and
187
+ `TestPromptParity` still green (backoff logic must not have touched
188
+ prompt construction).
189
+
190
+ 5. Acceptance: mocked tests green; benchmark table committed; a live 50-row
191
+ haiku run completes with 0 errors (or documented rate-limit behavior).
192
+
193
+ ✓ **Phase-2 exit sanity:** re-run the §1 3-row sonnet-5 smoke one last
194
+ time; diff `benchmarks/RESULTS.md` numbers against the Phase-1 baseline
195
+ (1.5s/row effective at workers=3). Materially slower → find out why
196
+ before checking the phase off.
197
+
198
+ ## 5. Phase 3 — structured output (blocked until SDK supports it)
199
+
200
+ Re-probe on each SDK upgrade (takes 1 minute):
201
+
202
+ ```bash
203
+ python3 /path/to/probe: send output_format={"type":"json","schema":{...}} and
204
+ inspect messages — see scratchpad/probe_agent_sdk.py pattern in git history
205
+ of this guide, or rewrite: if ResultMessage gains a structured field or the
206
+ text becomes bare JSON, it works.
207
+ ```
208
+
209
+ If supported: add `output_format` to ClaudeAdapter behind a feature check,
210
+ keep extract_json as fallback. If not: leave Phase 3 alone.
211
+
212
+ ✓ **Sanity (gate for even starting Phase 3):** the probe must show a
213
+ *machine-parseable* result — bare JSON text or a populated structured field.
214
+ "Markdown that mentions the right numbers" (what 0.2.110 produces) is a
215
+ FAIL; do not write any Phase-3 code against it. If implemented: run the same
216
+ 3-row live smoke twice, once with structured output and once with the
217
+ extract_json fallback forced — both matrices must be identical.
218
+
219
+ ## 6. Phase 4 — cat-stack + cat-llm integration
220
+
221
+ Dispatch lives in cat-stack; distribution in cat-llm (decision recorded in
222
+ MASTERPLAN). Mirror the existing `claude-code` branches exactly — anchors in
223
+ `cat-stack/src/catstack/_providers.py` (line numbers as of 2.0.1+):
224
+
225
+ - `PROVIDER_CONFIG["claude-code"]` (~line 719): add a sibling
226
+ `"claude-agent"` entry (`endpoint: None`).
227
+ - `detect_provider` (~line 1742): add `model_source == "claude-agent"`.
228
+ - `complete()` dispatch (~line 1249): before payload build, add:
229
+ ```python
230
+ if self.provider == "claude-agent":
231
+ try:
232
+ from catagent._adapters import get_adapter
233
+ from catagent._backend import gather_bounded
234
+ except ImportError:
235
+ return None, ("cat-agent is not installed. "
236
+ "Run: pip install cat-stack[agent]")
237
+ ... # build system/user text from messages (see _call_claude_cli
238
+ # for the message-flattening pattern), run one sealed call
239
+ ```
240
+ NOTE: complete() is sync and called from worker threads; call the adapter
241
+ via `asyncio.run` per call (gather_bounded pattern) — do NOT create a
242
+ module-global event loop.
243
+ - `text_functions_ensemble.py` ~line 653: the `claude-code` validation
244
+ branch (CLI availability check, api_key not required) — add
245
+ `claude-agent` alongside it, checking catagent importability instead.
246
+ - pyproject: `[project.optional-dependencies] agent = ["cat-agent>=0.1.0"]`.
247
+ - cat-llm meta pyproject: add `cat-agent>=0.1.0` to `dependencies`.
248
+ - Tests: mocked test in cat-stack (`tests/test_claude_agent_dispatch.py`)
249
+ patching catagent; live test: `catstack.classify(model_source="claude-agent")`
250
+ 3 rows. Ensemble test: one API model + claude-agent in a panel.
251
+ - Ecosystem rules: cat-stack release = CHANGELOG entry + version bump at
252
+ batch end (see cat-stack/CLAUDE.md); cat-agent needs a PyPI release FIRST
253
+ (flip repo public, `python -m build`, twine with PYPI_API_TOKEN from
254
+ cat-stack/.env, TWINE_USERNAME=__token__).
255
+
256
+ ✓ **Per-step sanity for Phase 4 (cat-stack is production code used by 6+
257
+ downstream packages — check after EVERY edit there, not at the end):**
258
+
259
+ 1. After each cat-stack edit:
260
+ `cd ~/Documents/Research/cat-stack && python -m pytest tests/ -q` —
261
+ expected: everything green except the known pre-existing failure in
262
+ `test_chat_template_kwargs_strip.py::test_warning_printed_only_once`
263
+ (untracked WIP test, fails on clean HEAD too — NOT yours to fix).
264
+ Any OTHER failure = your change broke the engine; revert and rethink.
265
+ 2. The no-install path must degrade politely BEFORE testing the happy path:
266
+ temporarily `pip uninstall -y cat-agent`, run
267
+ `catstack.classify(model_source="claude-agent", ...)` → must return
268
+ error rows mentioning `pip install cat-stack[agent]`, never a raw
269
+ ImportError traceback. Reinstall (`pip install -e ~/Documents/Research/cat-agent`)
270
+ and confirm the same call succeeds.
271
+ 3. Regression canary: run one classify on a NORMAL provider
272
+ (`model_source="anthropic"`, sonnet-5, creativity=0.3, 1 row) after the
273
+ dispatch edits — the claude-agent branch must not have disturbed API
274
+ routing.
275
+ 4. Ensemble sanity: panel of claude-agent + one API model must produce
276
+ consensus columns and per-model columns with no schema drift
277
+ (compare `df.columns` against an API-only ensemble run).
278
+ 5. cat-llm meta edit: `pip download cat-llm --no-deps -d /tmp/x` is NOT the
279
+ check — the check is reading the diff: exactly one line added to
280
+ `dependencies` in cat-llm/pyproject.toml. Meta-package mistakes ship to
281
+ every user; keep the diff minimal and reviewed.
282
+
283
+ ## 7. Phase 5 — Codex adapter (later)
284
+
285
+ Spike first, code second (`codex exec` non-interactive mode: auth story,
286
+ model flag, JSON/event output, sandbox flags, startup cost, context
287
+ isolation — same probe checklist as Phase 0). Implement
288
+ `_adapters/codex.py` against `AgentAdapter`; register in `ADAPTERS`;
289
+ `model_source="codex-agent"` in cat-stack; split extras
290
+ (`cat-agent[claude]` / `cat-agent[codex]`) so neither SDK is forced on
291
+ users of the other. Cross-agent parity run for methodology disclosure.
292
+
293
+ ✓ **Sanity gates:** (1) Do not write `codex.py` until the spike proves
294
+ context isolation between `codex exec` calls — that probe result decides
295
+ whether the design is even possible, same as Phase 0 did for Claude.
296
+ (2) The Codex adapter must pass the SAME mocked test suite: parameterize
297
+ `tests/test_classify.py` over adapters rather than duplicating tests — if
298
+ you're copy-pasting the test file, wrong direction. (3) After the extras
299
+ split, `pip install cat-agent[claude]` in a fresh venv must work WITHOUT
300
+ any codex packages present (and vice versa) — import-time cross-adapter
301
+ leakage means `_adapters/__init__.py` needs lazier imports.
302
+
303
+ ## 8. Traps encountered (so you don't repeat them)
304
+
305
+ - zsh: `echo ===` and unquoted `=` in commands expand weirdly; quote them.
306
+ - The experiments `.env` values are quoted; `export $(grep ...)` leaks the
307
+ quotes. Use python-dotenv to read keys when shelling out.
308
+ - pandas/bottleneck UserWarning noise on every python start — filter with
309
+ `grep -v pandas`, it is not an error.
310
+ - Live API keys live in
311
+ `/Users/chrissoria/Documents/Research/Categorization_AI_experiments/.env`
312
+ (ANTHROPIC_API_KEY, GOOGLE_API_KEY, ...). Agent calls need NO key.
313
+ - Do not edit `cat-stack/src/catstack/collapse_themes.py` — it carries the
314
+ maintainer's uncommitted WIP. If you must build cat-stack dists, stash it
315
+ first and pop after (see cat-stack releases in git history).
@@ -0,0 +1,192 @@
1
+ # cat-agent — Claude Agent SDK backend for the CatLLM ecosystem
2
+
3
+ *Drafted 2026-07-03. Naming is provisional (`cat-agent` / import `catagent`);
4
+ renaming is cheap until first PyPI release.*
5
+
6
+ > **Continuing this project? Read `IMPLEMENTATION_GUIDE.md` next** — it holds
7
+ > the verified facts (don't re-derive them), the traps already hit, and
8
+ > step-by-step instructions with acceptance criteria for every remaining
9
+ > phase. This file is the why; that file is the how.
10
+
11
+ ## Why this package exists
12
+
13
+ cat-stack already has a `claude-code` provider: a ~100-line subprocess shim
14
+ around `claude -p`. It works, but it is under-engineered for research use:
15
+
16
+ 1. **Cost/access** — the real prize. Rows classified through the user's
17
+ Claude subscription instead of per-token API billing. For researchers and
18
+ students without API budgets, "install Claude Code, log in, classify your
19
+ survey" is a different accessibility story. The shim technically does
20
+ this; the SDK makes it robust enough to recommend.
21
+ 2. **Throughput** — the shim is sequential with full CLI startup per row
22
+ (~33s/row measured on claude-haiku-4-5, 2026-07-03). The SDK is
23
+ async-native: concurrent one-shot queries are the honest performance fix.
24
+ 3. **Reliability** — the shim scrapes stdout; the SDK yields typed message
25
+ objects, so "the assistant's final text" is extracted reliably.
26
+ 4. **Isolation** — `claude -p` loads project settings and CLAUDE.md by
27
+ default: running classify() from inside a repo can silently inject that
28
+ repo's instructions into every classification. The SDK exposes explicit
29
+ controls (`setting_sources`, custom `system_prompt`, `allowed_tools`).
30
+
31
+ ## Design constraints (non-negotiable)
32
+
33
+ - **One row = one call = one fresh context.** Never a persistent
34
+ conversation across rows (cross-row contamination breaks research
35
+ validity). Never corpus-in-one-prompt. Throughput comes from concurrency,
36
+ not context reuse.
37
+ - **The frozen prompt.** Prompts come from cat-stack's validated
38
+ `build_text_classification_prompt` — byte-identical to the API path. This
39
+ package is a transport, not a new instrument.
40
+ - **Sealed sessions.** `allowed_tools=[]`, single turn, no settings/CLAUDE.md
41
+ loading, custom system prompt only. Classification must not touch the
42
+ filesystem or improvise.
43
+ - **Same output contract.** The model answers in JSON (prompt-requested, as
44
+ today); parsing goes through cat-stack's `extract_json` +
45
+ `validate_classification_json`; output is the standard wide 0/1 DataFrame
46
+ (`input_data`, `processing_status`, `category_N` columns). Everything
47
+ downstream (ensembles, R, Stata, desktop) must be able to adopt this
48
+ backend without schema changes.
49
+ - **The subprocess shim stays** in cat-stack as the zero-dependency fallback.
50
+ `model_source="claude-code"` keeps meaning shim; this package introduces
51
+ `model_source="claude-agent"`.
52
+ - **Dependency discipline.** This package depends on `claude-agent-sdk` and
53
+ `cat-stack`. cat-stack never depends on this package — it lazy-imports it
54
+ behind the `claude-agent` model_source (the `[formatter]`-extra pattern),
55
+ erroring with `pip install cat-stack[agent]` guidance when absent.
56
+
57
+ ## Architecture
58
+
59
+ Multi-agent by design: Claude (via `claude-agent-sdk`) is the first adapter;
60
+ OpenAI Codex is a planned second (its `codex exec` non-interactive mode fits
61
+ the same one-shot contract). The seam between "the classification pipeline"
62
+ and "which agent CLI answers one prompt" is therefore an explicit adapter
63
+ interface from day one — everything above the adapter is agent-agnostic.
64
+
65
+ ```
66
+ cat-stack classify(model_source="claude-agent" | "codex-agent" …)
67
+ └─ lazy import catagent → backend satisfies the same (text, error)
68
+ contract complete() returns
69
+ catagent
70
+ ├─ _adapters/
71
+ │ base.py AgentAdapter: one_shot(prompt, system_prompt, model,
72
+ │ opts) -> (text, error). Sealed-session semantics are part
73
+ │ of the contract (no tools, fresh context, single turn).
74
+ │ claude.py claude-agent-sdk implementation (Phase 1)
75
+ │ codex.py codex CLI implementation (later phase)
76
+ ├─ _backend.py agent-agnostic plumbing: adapter registry, dedicated
77
+ │ event loop, semaphore-bounded concurrency, retries
78
+ ├─ classify.py standalone classify(agent="claude") (Phase 1: usable
79
+ │ without engine integration; later delegated to from
80
+ │ cat-stack)
81
+ └─ __about__.py version (single source of truth, hatch)
82
+ ```
83
+
84
+ Adapter-contract notes for Codex (recorded now, built later): `codex exec`
85
+ supports non-interactive one-shots with JSON event output and model
86
+ selection; auth via ChatGPT subscription login mirrors the Claude story
87
+ (subscription-based classification). Sandbox/approval flags are the sealed-
88
+ session equivalent. `claude-agent-sdk` stays an install extra once a second
89
+ adapter exists (`cat-agent[claude]`, `cat-agent[codex]`) so neither CLI's SDK
90
+ is forced on users of the other.
91
+
92
+ ## Known risks / open questions
93
+
94
+ - **Rate limits**: subscription plans have usage caps; large jobs may hit
95
+ them. Degrade gracefully (clear error, partial results, resumability) —
96
+ never promise API-like throughput.
97
+ - **Model parity**: CLI-served vs API-served output comparability is a
98
+ methodology-disclosure question for papers. Measure, document, disclose
99
+ (never silently swap).
100
+ - **SDK tempo**: the Agent SDK tracks Claude Code releases; re-audit on CLI
101
+ major versions (same habit as the 2026-07-03 shim audit).
102
+ - **Warm-process reuse**: can fresh-context queries share a warm process?
103
+ (Phase 0 answers; if not, concurrency alone is the plan.)
104
+
105
+ ## Step tracker
106
+
107
+ ### Phase 0 — empirical spike (kill-or-validate) — DONE 2026-07-03, sdk 0.2.110
108
+ - [x] Install `claude-agent-sdk`; introspect the real API surface
109
+ - [x] Timing: 5.6s wall for a trivial one-shot, only ~1.9s process overhead
110
+ (the shim's 33s/row was mostly sequential design + inference, not startup)
111
+ - [x] Context isolation: PASS — two `query()` calls share nothing
112
+ - [x] Sealed-session options verified (`allowed_tools=[]`, `max_turns=1`,
113
+ `setting_sources=[]`, custom `system_prompt`)
114
+ - [x] Model selection verified (`claude-sonnet-5` by name, no error)
115
+ - [x] Concurrency: 4 parallel one-shots in 3.6s total (near-linear speedup)
116
+ - [x] Structured output: `output_format={"type":"json","schema":...}` was
117
+ silently IGNORED (markdown answer, no structured field) — Phase 1 uses
118
+ prompt-JSON + extract_json; Phase 3 re-probes future SDK versions.
119
+ - [x] Finding: the agent enables THINKING by default (ThinkingBlock observed
120
+ on haiku). Engine parity requires thinking disabled at
121
+ thinking_budget=0 and graded `effort` above it — the adapter must set
122
+ this explicitly.
123
+
124
+ ### Phase 1 — classify() v0 (single function, parity with shim) — DONE 2026-07-03
125
+ - [x] Repo skeleton: pyproject (hatch), __about__, README, CHANGELOG
126
+ - [x] Adapter contract (`_adapters/base.py`) + Claude adapter — sealed
127
+ one-row call, (text, error) contract, thinking-off-by-default parity
128
+ - [x] `classify.py` — rows → frozen prompts → one_shot → extract_json →
129
+ wide 0/1 DataFrame with processing_status
130
+ - [x] JSON retry (`json_retries`, per-row isolation)
131
+ - [x] Mocked unit tests incl. frozen-prompt byte-parity test (5 passing)
132
+ - [x] Live smoke test: 3 rows, claude-sonnet-5, 4.4s total (1.5s/row
133
+ effective at max_workers=3 vs ~33s/row through the shim), matrix correct
134
+ - [x] Commit + push — github.com/chrissoria/cat-agent (private for now;
135
+ flip to public at first PyPI release)
136
+
137
+ *Note: Phase 2's core (semaphore-bounded concurrency) landed in Phase 1 via
138
+ `_backend.gather_bounded`; Phase 2 now means benchmarks at realistic N +
139
+ rate-limit handling.*
140
+
141
+ ### Phase 2 — concurrency + rate-limit handling — DONE 2026-07-03 (throughput sweep deferred)
142
+ - [x] Semaphore-bounded async gather (max_workers semantics) — landed in Phase 1
143
+ (`_backend.gather_bounded`); per-row isolation confirmed by a mocked test
144
+ (a throttled row's backoff does not stall healthy rows)
145
+ - [x] Graceful rate-limit handling + partial results — adapter detects a genuine
146
+ exhaustion (primary `RateLimitEvent.status=="rejected"` or
147
+ `ResultMessage.api_error_status==429`; `allowed`/`allowed_warning` non-
148
+ blocking) and surfaces `rate-limited: … (resets at epoch N)`; classify()
149
+ backs off on a budget separate from json_retries, **fails fast when the
150
+ reset is beyond the backoff budget** (five_hour caps), never raises.
151
+ - [x] FALSE-POSITIVE BUG found + fixed 2026-07-03: detection had treated a
152
+ `rejected` *overage* bucket as a limit, but org-disabled overage
153
+ (`overage_disabled_reason=="org_level_disabled"`) is a billing config, not
154
+ a block — the call succeeds. This failed EVERY call on institutional
155
+ accounts. Now keys only on primary `status`, and a successful answer wins
156
+ over an informational limit event. Live-confirmed: 3-row classify 3/3
157
+ success, correct matrix, 4.3s (~1.4s/row). 39 mocked tests green.
158
+ - [x] `benchmarks/bench_classify.py` (synthetic data) + `benchmarks/RESULTS.md`
159
+ - [ ] Clean throughput sweep (N=50 haiku, workers∈{1,4,8}) — not yet run (held
160
+ off to spare the maintainer's actively-used window; best on a fresh one):
161
+ `python benchmarks/bench_classify.py --n 50 --write-results`. Phase-1
162
+ reference stands: ~1.4–1.5s/row (workers=3) vs ~33s/row shim.
163
+
164
+ ### Phase 3 — structured output (if Phase 0 says it's real)
165
+ - [ ] Schema-enforced JSON (native or in-process tool trick)
166
+ - [ ] Fall back to prompt-JSON when unsupported
167
+
168
+ ### Phase 4 — engine + ecosystem integration
169
+
170
+ *Two-level integration (decided 2026-07-03): DISPATCH lives in cat-stack
171
+ (the engine is the only layer that sees `model_source`, and the domain
172
+ packages / R / Stata all call catstack directly — routing anywhere higher
173
+ would exclude them); DISTRIBUTION lives in cat-llm (the meta-package bundles
174
+ cat-agent for users, same as it bundles the domain packages cat-stack
175
+ doesn't depend on). cat-stack never hard-depends on cat-agent.*
176
+
177
+ - [ ] cat-stack: `model_source="claude-agent"` lazy-import branch + `[agent]` extra
178
+ - [ ] cat-llm (meta): add `cat-agent` to dependencies so `pip install cat-llm`
179
+ includes the agent backend
180
+ - [ ] Ensemble support (claude-agent as one model in a panel)
181
+ - [ ] explore/extract/summarize passthroughs
182
+ - [ ] R/Stata/desktop: no changes needed by design — verify
183
+ - [ ] Docs + methodology disclosure notes; first PyPI release (flip repo public)
184
+
185
+ ### Phase 5 — Codex adapter
186
+ - [ ] Phase-0-style spike on `codex exec` (auth, model selection, JSON
187
+ output, sandbox flags, startup cost, context isolation)
188
+ - [ ] `_adapters/codex.py` implementing the same AgentAdapter contract
189
+ - [ ] `model_source="codex-agent"` in cat-stack; extras split
190
+ (`cat-agent[claude]` / `cat-agent[codex]`)
191
+ - [ ] Cross-agent parity test: same rows, same frozen prompt, Claude vs
192
+ Codex adapters — document divergence for methodology disclosure
@@ -0,0 +1,62 @@
1
+ Metadata-Version: 2.4
2
+ Name: cat-claws
3
+ Version: 0.1.0
4
+ Summary: Claude Agent SDK backend for the CatLLM ecosystem — classify text through a Claude subscription instead of per-token API billing.
5
+ Project-URL: Source, https://github.com/chrissoria/cat-agent
6
+ Author-email: Chris Soria <chrissoria@berkeley.edu>
7
+ License-Expression: GPL-3.0-or-later
8
+ Keywords: agent-sdk,classification,claude,llm,survey
9
+ Classifier: Development Status :: 3 - Alpha
10
+ Classifier: Programming Language :: Python
11
+ Classifier: Programming Language :: Python :: 3.10
12
+ Classifier: Programming Language :: Python :: 3.11
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Requires-Python: >=3.10
15
+ Requires-Dist: cat-stack>=2.0.1
16
+ Requires-Dist: claude-agent-sdk>=0.1.0
17
+ Requires-Dist: pandas
18
+ Description-Content-Type: text/markdown
19
+
20
+ # cat-claws
21
+
22
+ Agent-CLI backend for the [CatLLM ecosystem](https://github.com/chrissoria/cat-llm):
23
+ classify text through a **Claude subscription** (via the Claude Agent SDK)
24
+ instead of per-token API billing. An OpenAI Codex adapter is planned.
25
+
26
+ *(Distribution name `cat-claws`; imports as `catclaws`. Source repo:
27
+ [cat-agent](https://github.com/chrissoria/cat-agent).)*
28
+
29
+ **Status: alpha, under active development.** See `MASTERPLAN.md` for the
30
+ design and step tracker.
31
+
32
+ ## Install
33
+
34
+ ```bash
35
+ pip install cat-claws
36
+ ```
37
+
38
+ ## Design in one paragraph
39
+
40
+ One row = one sealed, fresh-context agent call (no tools, single turn, no
41
+ settings/CLAUDE.md loading), using cat-stack's validated classification
42
+ prompt byte-for-byte. The model answers in JSON; parsing and the wide 0/1
43
+ output matrix reuse cat-stack's existing machinery. Throughput comes from
44
+ concurrent one-shot calls, never from shared conversations or
45
+ corpus-in-one-prompt (which would contaminate rows and break research
46
+ validity).
47
+
48
+ ## Quick start (Phase 1)
49
+
50
+ ```python
51
+ import catclaws
52
+
53
+ df = catclaws.classify(
54
+ input_data=["I moved for a new job", "Rent got too expensive"],
55
+ categories=["Employment", "Cost of living", "Other"],
56
+ user_model="claude-sonnet-5", # any model your Claude login can use
57
+ description="Why did you move?",
58
+ )
59
+ ```
60
+
61
+ Requires [Claude Code](https://code.claude.com/docs) installed and logged in
62
+ (`claude` on PATH). No API key needed.
@@ -0,0 +1,43 @@
1
+ # cat-claws
2
+
3
+ Agent-CLI backend for the [CatLLM ecosystem](https://github.com/chrissoria/cat-llm):
4
+ classify text through a **Claude subscription** (via the Claude Agent SDK)
5
+ instead of per-token API billing. An OpenAI Codex adapter is planned.
6
+
7
+ *(Distribution name `cat-claws`; imports as `catclaws`. Source repo:
8
+ [cat-agent](https://github.com/chrissoria/cat-agent).)*
9
+
10
+ **Status: alpha, under active development.** See `MASTERPLAN.md` for the
11
+ design and step tracker.
12
+
13
+ ## Install
14
+
15
+ ```bash
16
+ pip install cat-claws
17
+ ```
18
+
19
+ ## Design in one paragraph
20
+
21
+ One row = one sealed, fresh-context agent call (no tools, single turn, no
22
+ settings/CLAUDE.md loading), using cat-stack's validated classification
23
+ prompt byte-for-byte. The model answers in JSON; parsing and the wide 0/1
24
+ output matrix reuse cat-stack's existing machinery. Throughput comes from
25
+ concurrent one-shot calls, never from shared conversations or
26
+ corpus-in-one-prompt (which would contaminate rows and break research
27
+ validity).
28
+
29
+ ## Quick start (Phase 1)
30
+
31
+ ```python
32
+ import catclaws
33
+
34
+ df = catclaws.classify(
35
+ input_data=["I moved for a new job", "Rent got too expensive"],
36
+ categories=["Employment", "Cost of living", "Other"],
37
+ user_model="claude-sonnet-5", # any model your Claude login can use
38
+ description="Why did you move?",
39
+ )
40
+ ```
41
+
42
+ Requires [Claude Code](https://code.claude.com/docs) installed and logged in
43
+ (`claude` on PATH). No API key needed.