PyPI - cat-claws - Versions diffs - 0.1.0__tar.gz - Mend

cat-claws 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

cat_claws-0.1.0/.gitignore +5 -0
cat_claws-0.1.0/CHANGELOG.md +32 -0
cat_claws-0.1.0/IMPLEMENTATION_GUIDE.md +315 -0
cat_claws-0.1.0/MASTERPLAN.md +192 -0
cat_claws-0.1.0/PKG-INFO +62 -0
cat_claws-0.1.0/README.md +43 -0
cat_claws-0.1.0/benchmarks/RESULTS.md +75 -0
cat_claws-0.1.0/benchmarks/bench_classify.py +124 -0
cat_claws-0.1.0/pyproject.toml +36 -0
cat_claws-0.1.0/src/catclaws/__about__.py +6 -0
cat_claws-0.1.0/src/catclaws/__init__.py +13 -0
cat_claws-0.1.0/src/catclaws/_adapters/__init__.py +16 -0
cat_claws-0.1.0/src/catclaws/_adapters/base.py +70 -0
cat_claws-0.1.0/src/catclaws/_adapters/claude.py +186 -0
cat_claws-0.1.0/src/catclaws/_backend.py +31 -0
cat_claws-0.1.0/src/catclaws/classify.py +186 -0
cat_claws-0.1.0/tests/test_classify.py +241 -0
cat_claws-0.1.0/tests/test_rate_limit.py +217 -0

cat_claws-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,5 @@
+__pycache__/
+*.pyc
+dist/
+*.egg-info/
+.env

cat_claws-0.1.0/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,32 @@
+# Changelog
+All notable changes to cat-claws will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [Unreleased]
+## [0.1.0] - 2026-07-03
+- Initial scaffold: MASTERPLAN.md (design + step tracker), package skeleton.
+- classify() v0: one-row-sealed calls, frozen prompt, bounded concurrency,
+  JSON re-asks, wide 0/1 DataFrame output.
+- Phase 2 — rate-limit handling: the Claude adapter detects a genuinely
+  exhausted usage window (a `RateLimitEvent` whose primary `status ==
+  "rejected"`, or `ResultMessage.api_error_status == 429`; `allowed`/
+  `allowed_warning` are non-blocking) and surfaces it with a `"rate-limited: "`
+  error prefix, including the reset time when known. It deliberately IGNORES
+  `overage_status` — a `rejected` overage bucket with
+  `overage_disabled_reason == "org_level_disabled"` just means the org turned
+  off spillover billing while the request still succeeds; treating it as a
+  limit falsely failed every call on such accounts (common in institutional
+  subscriptions). A successful answer always wins over an informational limit
+  event. classify() gains `rate_limit_retries` (default 2): on a genuine limit a
+  row backs off exponentially (30s, 60s…) on a budget separate from
+  `json_retries`, other in-flight rows unaffected, and never raises. Reset-aware:
+  when the reset is farther out than the backoff budget can bridge (e.g. a
+  `five_hour` cap hours away) the row fails fast with the resumable message
+  instead of sleeping through futile retries; near/unknown resets still back off.
+- Phase 2 — `benchmarks/bench_classify.py`: synthetic-data throughput
+  benchmark across max_workers settings; results in `benchmarks/RESULTS.md`.

cat_claws-0.1.0/IMPLEMENTATION_GUIDE.md ADDED Viewed

@@ -0,0 +1,315 @@
+# cat-agent — Implementation Guide (handoff document)
+*Written 2026-07-03 for whoever (human or model) continues this project.
+Read MASTERPLAN.md first for the why; this file is the how. If anything here
+contradicts the code, trust the code and update this file.*
+## 0. Ground rules — read before touching anything
+1. **One row = one sealed, fresh-context agent call.** Never reuse a
+   conversation across rows. Never put multiple rows in one prompt. This is
+   a research-validity requirement the maintainer (Chris) set explicitly.
+   If a change would violate it, stop and ask.
+2. **The prompt is frozen.** Row prompts must be byte-identical to what
+   `catstack.text_functions_ensemble.build_text_classification_prompt`
+   produces. `tests/test_classify.py::TestPromptParity` enforces this — if
+   that test fails, YOUR change is wrong, not the test. Never "improve" the
+   prompt wording.
+3. **Output schema is fixed**: DataFrame with `input_data`,
+   `processing_status` ("success" or "error: ..."), and `category_N` 0/1
+   columns (None on error rows). Same as `catstack.classify()`.
+4. **One bad row never aborts a batch.** Errors are recorded per row.
+5. **cat-stack never hard-depends on cat-agent.** Dispatch in cat-stack is a
+   lazy import; distribution happens via the cat-llm meta-package.
+6. Workflow conventions (from the maintainer's standing preferences):
+   accumulate changes in CHANGELOG.md `[Unreleased]`; do NOT bump the
+   version per change — one bump per release batch. Every substantive fix
+   gets a live smoke test against the real agent, not just mocks. Prefer
+   stdlib over new dependencies. For provider/SDK behavior claims: probe
+   empirically before coding against them.
+## 1. What exists and works (verified live 2026-07-03)
+- Environment: macOS, anaconda python (`/Users/chrissoria/anaconda3/bin/python3`).
+  `catagent` and `catstack` are installed editable in it. Claude Code CLI
+  2.1.197 is installed and logged in (subscription auth — no API key needed
+  for agent calls). `claude-agent-sdk` 0.2.110 is installed.
+- `catagent.classify()` works end to end: 3 rows on `claude-sonnet-5`,
+  `max_workers=3`, 4.4s total, correct matrix.
+- Mocked tests: `cd ~/Documents/Research/cat-agent && python -m pytest tests/ -q`
+  → 5 passing. Run them after every change.
+- Live smoke (costs ~nothing on subscription, takes ~10s):
+  ```bash
+  python3 -c "
+  import catagent
+  df = catagent.classify(
+      input_data=['I moved for a new job', 'Rent got too expensive', 'Closer to my parents'],
+      categories=['Employment', 'Cost of living', 'Family', 'Other'],
+      user_model='claude-sonnet-5', description='Why did you move?', max_workers=3)
+  print(df.to_string()); assert (df.processing_status == 'success').all()"
+  ```
+## 2. Verified facts — do NOT re-derive, do NOT trust your training data
+These were established by live probes on 2026-07-03 (sdk 0.2.110, CLI
+2.1.197). Training-data knowledge of this SDK is likely stale.
+| Fact | Consequence |
+|---|---|
+| Two `query()` one-shots share NO context | fresh-context-per-row design is sound |
+| Process overhead ≈ 1.9s per one-shot; 4-way concurrency near-linear | throughput = concurrency; don't chase warm-process reuse |
+| `ClaudeAgentOptions(output_format={"type":"json","schema":...})` is **silently ignored** — answer arrives as markdown text, no structured field | Phase 1 parses prompt-JSON via `catstack.extract_json`; re-probe `output_format` on every SDK upgrade before building Phase 3 |
+| The agent **thinks by default** (ThinkingBlock observed on haiku) | adapter passes `thinking=ThinkingConfigDisabled(type="disabled")` at `thinking_budget=0` (engine parity) and `effort=<graded>` above 0 |
+| `ThinkingConfigDisabled` is a TypedDict | `ThinkingConfigDisabled(type="disabled")` constructs a plain dict — fine |
+| `catstack._utils.validate_classification_json(json_str, n)` returns a `(bool, dict)` TUPLE and the dict values are STRINGS "1"/"0" | unpack both; compare `str(v) == "1"` |
+| `catstack.extract_json(reply)` returns a JSON *string* (handles fences/preambles) | feed its output to validate, don't json.loads the raw reply |
+| `system_prompt` option REPLACES Claude Code's default agent persona | our `_SYSTEM_PROMPT` in classify.py is transport scaffolding, not part of the instrument |
+| `setting_sources=[]` prevents CLAUDE.md/user-settings injection | never remove it — without it, running from inside a repo contaminates classifications |
+| `RateLimitEvent` fires on SUCCESSFUL calls too (informational). Only primary `RateLimitInfo.status=='rejected'` blocks. `overage_status=='rejected'` + `overage_disabled_reason=='org_level_disabled'` = overage billing off (org config), NOT a block — call succeeds | detect rate limits on primary `status` only; a returned answer wins over any limit event. Verified 2026-07-03 against a live 74%-used, org-overage-disabled account |
+## 3. Code map
+```
+src/catagent/
+  __about__.py        version 0.0.1 — single source of truth (hatch reads it)
+  __init__.py         exports classify
+  _adapters/base.py   AgentAdapter.one_shot(prompt, system_prompt, model,
+                      thinking_budget) -> (text|None, error|None)
+  _adapters/claude.py ClaudeAdapter — sealed ClaudeAgentOptions; thinking
+                      parity; CLINotFoundError -> friendly install message;
+                      falls back to agent-default thinking if the explicit
+                      disable errors
+  _adapters/__init__.py  ADAPTERS registry + get_adapter(name)
+  _backend.py         gather_bounded(coro_fns, max_workers) — sync->async
+                      seam via asyncio.run; captures per-task exceptions
+  classify.py         classify(input_data, categories, user_model, agent,
+                      description, multi_label, thinking_budget,
+                      max_workers, json_retries) -> DataFrame
+tests/test_classify.py  FakeAdapter pattern for mocked tests — copy it for
+                        new tests; TestPromptParity is the canary
+```
+Repo: github.com/chrissoria/cat-agent (PRIVATE until first PyPI release).
+Commit style: imperative subject, body explains why, footer:
+`Co-Authored-By: Claude ...` (see git log for examples). Push after commit.
+## 3b. Standing sanity checks — bracket EVERY work session with these
+Run before starting (baseline must be green — if it isn't, fix that first,
+don't build on a broken base) and after every substantive change:
+```bash
+cd ~/Documents/Research/cat-agent
+python -m pytest tests/ -q                  # ALL green, incl. TestPromptParity
+python -c "import catagent; print(catagent.__version__)"   # imports clean
+git status --short                          # only files YOU meant to touch
+```
+And once per session, the 3-row live smoke from §1 (~10s, subscription-only
+cost). PASS = 3× "success" + correct 1s on the diagonal.
+**Direction check (ask before each step, honestly):** does what I'm about to
+do (a) keep one-row-sealed-calls, (b) leave the prompt byte-identical,
+(c) keep the output schema unchanged, (d) avoid new hard dependencies? If
+any answer is no → stop and surface it to the maintainer instead of coding.
+## 4. Phase 2 — benchmarks + rate-limit handling — DONE 2026-07-03
+*(Throughput sweep deferred: the subscription five_hour window was exhausted
+during development; re-run `python benchmarks/bench_classify.py --n 50
+--write-results` after a reset. Everything else landed and is live-verified.)*
+**Two live findings that changed the plan:**
+1. **Overage false positive (bug, fixed).** Detection first keyed on *any*
+   `RateLimitInfo` with a `rejected` bucket, including `overage_status`. But
+   org-disabled overage (`overage_disabled_reason=="org_level_disabled"`,
+   `isUsingOverage=False`) reports `overage_status=='rejected'` while the
+   primary `status=='allowed'` and **the call succeeds**. The old code aborted
+   every successful call as rate-limited — breaking cat-agent on every call for
+   institutional accounts. Fix: detect only on primary `status=='rejected'`, and
+   a returned answer always wins over an informational limit event. Regression
+   tests in `tests/test_rate_limit.py`. Confirm any detection change against the
+   RAW payload (`RateLimitInfo.raw`), never the parsed summary.
+2. **Window is five_hour, not minutes.** The reset on a genuine exhaustion is
+   hours out, so classify() **fails fast** when the reset (parsed via
+   `base.parse_reset_epoch`) is beyond its backoff budget, backing off only for
+   near/unknown resets. This supersedes the original item 3 "always retry up to
+   2 times" below.
+Goal: know how this behaves at realistic N and fail gracefully at limits.
+1. **Benchmark script** (`benchmarks/bench_classify.py`, committed): classify
+   N=50 short rows (generate synthetic one-liners; do NOT use real study
+   data) on `claude-haiku-4-5` at max_workers ∈ {1, 4, 8}. Record wall time,
+   rows/s, error count. Write results into the script's docstring or a
+   `benchmarks/RESULTS.md` with date + CLI/SDK versions.
+   ✓ **Sanity before the 50-row run:** run the script with N=4 first. PASS:
+   4/4 success, wall time ≈ 5–15s at workers=4, and per-worker scaling
+   visible (workers=1 clearly slower than workers=4). FAIL (e.g. workers=4
+   not faster, or errors): stop — something regressed in `_backend.py`
+   concurrency; do not burn a 50-row run to find out.
+   ✓ **Sanity after:** rows/s at workers=8 should be ≥ workers=4's. If it's
+   *worse*, you're likely hitting throttling — that's a finding, record it,
+   don't "fix" the code.
+2. **Rate-limit surfacing.** During Phase-0 probes a `RateLimitEvent`
+   message type was observed in the stream (untriggered). In
+   `_adapters/claude.py`, detect rate-limit conditions (probe: what does the
+   SDK emit when throttled? check message types + ResultMessage fields) and
+   return a distinguishable error string prefix, e.g.
+   `"rate-limited: ..."` so classify() can react.
+   ✓ **Sanity:** you cannot reliably trigger a real rate limit on demand —
+   so the check is structural: unit-test the detection function against a
+   synthetic RateLimitEvent/ResultMessage object. If you find yourself
+   hammering the live API trying to trigger a real 429, stop — that's the
+   wrong direction (and abuses the subscription).
+3. **Backoff in classify()**: on a rate-limited row, sleep (exponential,
+   start 30s — subscription windows are minutes-scale, not seconds) and
+   retry up to 2 times BEFORE consuming json_retries. Keep per-row
+   isolation: other in-flight rows continue.
+   ✓ **Sanity:** mocked test with a fake clock / patched `asyncio.sleep` —
+   the test suite must still finish in seconds. If tests now take minutes,
+   you forgot to patch the sleep. Also: non-rate-limited rows in the same
+   batch must complete WITHOUT waiting on the throttled row (assert via
+   call-order in the fake adapter).
+4. **Partial-results guarantee test**: mocked test where the adapter
+   rate-limits every call — classify() must return a full DataFrame with
+   `error: rate-limited...` statuses, never raise.
+   ✓ **Sanity:** `len(df) == len(input_data)` exactly, and
+   `TestPromptParity` still green (backoff logic must not have touched
+   prompt construction).
+5. Acceptance: mocked tests green; benchmark table committed; a live 50-row
+   haiku run completes with 0 errors (or documented rate-limit behavior).
+   ✓ **Phase-2 exit sanity:** re-run the §1 3-row sonnet-5 smoke one last
+   time; diff `benchmarks/RESULTS.md` numbers against the Phase-1 baseline
+   (1.5s/row effective at workers=3). Materially slower → find out why
+   before checking the phase off.
+## 5. Phase 3 — structured output (blocked until SDK supports it)
+Re-probe on each SDK upgrade (takes 1 minute):
+```bash
+python3 /path/to/probe: send output_format={"type":"json","schema":{...}} and
+inspect messages — see scratchpad/probe_agent_sdk.py pattern in git history
+of this guide, or rewrite: if ResultMessage gains a structured field or the
+text becomes bare JSON, it works.
+```
+If supported: add `output_format` to ClaudeAdapter behind a feature check,
+keep extract_json as fallback. If not: leave Phase 3 alone.
+✓ **Sanity (gate for even starting Phase 3):** the probe must show a
+*machine-parseable* result — bare JSON text or a populated structured field.
+"Markdown that mentions the right numbers" (what 0.2.110 produces) is a
+FAIL; do not write any Phase-3 code against it. If implemented: run the same
+3-row live smoke twice, once with structured output and once with the
+extract_json fallback forced — both matrices must be identical.
+## 6. Phase 4 — cat-stack + cat-llm integration
+Dispatch lives in cat-stack; distribution in cat-llm (decision recorded in
+MASTERPLAN). Mirror the existing `claude-code` branches exactly — anchors in
+`cat-stack/src/catstack/_providers.py` (line numbers as of 2.0.1+):
+- `PROVIDER_CONFIG["claude-code"]` (~line 719): add a sibling
+  `"claude-agent"` entry (`endpoint: None`).
+- `detect_provider` (~line 1742): add `model_source == "claude-agent"`.
+- `complete()` dispatch (~line 1249): before payload build, add:
+  ```python
+  if self.provider == "claude-agent":
+      try:
+          from catagent._adapters import get_adapter
+          from catagent._backend import gather_bounded
+      except ImportError:
+          return None, ("cat-agent is not installed. "
+                        "Run: pip install cat-stack[agent]")
+      ...  # build system/user text from messages (see _call_claude_cli
+           # for the message-flattening pattern), run one sealed call
+  ```
+  NOTE: complete() is sync and called from worker threads; call the adapter
+  via `asyncio.run` per call (gather_bounded pattern) — do NOT create a
+  module-global event loop.
+- `text_functions_ensemble.py` ~line 653: the `claude-code` validation
+  branch (CLI availability check, api_key not required) — add
+  `claude-agent` alongside it, checking catagent importability instead.
+- pyproject: `[project.optional-dependencies] agent = ["cat-agent>=0.1.0"]`.
+- cat-llm meta pyproject: add `cat-agent>=0.1.0` to `dependencies`.
+- Tests: mocked test in cat-stack (`tests/test_claude_agent_dispatch.py`)
+  patching catagent; live test: `catstack.classify(model_source="claude-agent")`
+  3 rows. Ensemble test: one API model + claude-agent in a panel.
+- Ecosystem rules: cat-stack release = CHANGELOG entry + version bump at
+  batch end (see cat-stack/CLAUDE.md); cat-agent needs a PyPI release FIRST
+  (flip repo public, `python -m build`, twine with PYPI_API_TOKEN from
+  cat-stack/.env, TWINE_USERNAME=__token__).
+✓ **Per-step sanity for Phase 4 (cat-stack is production code used by 6+
+downstream packages — check after EVERY edit there, not at the end):**
+1. After each cat-stack edit:
+   `cd ~/Documents/Research/cat-stack && python -m pytest tests/ -q` —
+   expected: everything green except the known pre-existing failure in
+   `test_chat_template_kwargs_strip.py::test_warning_printed_only_once`
+   (untracked WIP test, fails on clean HEAD too — NOT yours to fix).
+   Any OTHER failure = your change broke the engine; revert and rethink.
+2. The no-install path must degrade politely BEFORE testing the happy path:
+   temporarily `pip uninstall -y cat-agent`, run
+   `catstack.classify(model_source="claude-agent", ...)` → must return
+   error rows mentioning `pip install cat-stack[agent]`, never a raw
+   ImportError traceback. Reinstall (`pip install -e ~/Documents/Research/cat-agent`)
+   and confirm the same call succeeds.
+3. Regression canary: run one classify on a NORMAL provider
+   (`model_source="anthropic"`, sonnet-5, creativity=0.3, 1 row) after the
+   dispatch edits — the claude-agent branch must not have disturbed API
+   routing.
+4. Ensemble sanity: panel of claude-agent + one API model must produce
+   consensus columns and per-model columns with no schema drift
+   (compare `df.columns` against an API-only ensemble run).
+5. cat-llm meta edit: `pip download cat-llm --no-deps -d /tmp/x` is NOT the
+   check — the check is reading the diff: exactly one line added to
+   `dependencies` in cat-llm/pyproject.toml. Meta-package mistakes ship to
+   every user; keep the diff minimal and reviewed.
+## 7. Phase 5 — Codex adapter (later)
+Spike first, code second (`codex exec` non-interactive mode: auth story,
+model flag, JSON/event output, sandbox flags, startup cost, context
+isolation — same probe checklist as Phase 0). Implement
+`_adapters/codex.py` against `AgentAdapter`; register in `ADAPTERS`;
+`model_source="codex-agent"` in cat-stack; split extras
+(`cat-agent[claude]` / `cat-agent[codex]`) so neither SDK is forced on
+users of the other. Cross-agent parity run for methodology disclosure.
+✓ **Sanity gates:** (1) Do not write `codex.py` until the spike proves
+context isolation between `codex exec` calls — that probe result decides
+whether the design is even possible, same as Phase 0 did for Claude.
+(2) The Codex adapter must pass the SAME mocked test suite: parameterize
+`tests/test_classify.py` over adapters rather than duplicating tests — if
+you're copy-pasting the test file, wrong direction. (3) After the extras
+split, `pip install cat-agent[claude]` in a fresh venv must work WITHOUT
+any codex packages present (and vice versa) — import-time cross-adapter
+leakage means `_adapters/__init__.py` needs lazier imports.
+## 8. Traps encountered (so you don't repeat them)
+- zsh: `echo ===` and unquoted `=` in commands expand weirdly; quote them.
+- The experiments `.env` values are quoted; `export $(grep ...)` leaks the
+  quotes. Use python-dotenv to read keys when shelling out.
+- pandas/bottleneck UserWarning noise on every python start — filter with
+  `grep -v pandas`, it is not an error.
+- Live API keys live in
+  `/Users/chrissoria/Documents/Research/Categorization_AI_experiments/.env`
+  (ANTHROPIC_API_KEY, GOOGLE_API_KEY, ...). Agent calls need NO key.
+- Do not edit `cat-stack/src/catstack/collapse_themes.py` — it carries the
+  maintainer's uncommitted WIP. If you must build cat-stack dists, stash it
+  first and pop after (see cat-stack releases in git history).

cat_claws-0.1.0/MASTERPLAN.md ADDED Viewed

@@ -0,0 +1,192 @@
+# cat-agent — Claude Agent SDK backend for the CatLLM ecosystem
+*Drafted 2026-07-03. Naming is provisional (`cat-agent` / import `catagent`);
+renaming is cheap until first PyPI release.*
+> **Continuing this project? Read `IMPLEMENTATION_GUIDE.md` next** — it holds
+> the verified facts (don't re-derive them), the traps already hit, and
+> step-by-step instructions with acceptance criteria for every remaining
+> phase. This file is the why; that file is the how.
+## Why this package exists
+cat-stack already has a `claude-code` provider: a ~100-line subprocess shim
+around `claude -p`. It works, but it is under-engineered for research use:
+1. **Cost/access** — the real prize. Rows classified through the user's
+   Claude subscription instead of per-token API billing. For researchers and
+   students without API budgets, "install Claude Code, log in, classify your
+   survey" is a different accessibility story. The shim technically does
+   this; the SDK makes it robust enough to recommend.
+2. **Throughput** — the shim is sequential with full CLI startup per row
+   (~33s/row measured on claude-haiku-4-5, 2026-07-03). The SDK is
+   async-native: concurrent one-shot queries are the honest performance fix.
+3. **Reliability** — the shim scrapes stdout; the SDK yields typed message
+   objects, so "the assistant's final text" is extracted reliably.
+4. **Isolation** — `claude -p` loads project settings and CLAUDE.md by
+   default: running classify() from inside a repo can silently inject that
+   repo's instructions into every classification. The SDK exposes explicit
+   controls (`setting_sources`, custom `system_prompt`, `allowed_tools`).
+## Design constraints (non-negotiable)
+- **One row = one call = one fresh context.** Never a persistent
+  conversation across rows (cross-row contamination breaks research
+  validity). Never corpus-in-one-prompt. Throughput comes from concurrency,
+  not context reuse.
+- **The frozen prompt.** Prompts come from cat-stack's validated
+  `build_text_classification_prompt` — byte-identical to the API path. This
+  package is a transport, not a new instrument.
+- **Sealed sessions.** `allowed_tools=[]`, single turn, no settings/CLAUDE.md
+  loading, custom system prompt only. Classification must not touch the
+  filesystem or improvise.
+- **Same output contract.** The model answers in JSON (prompt-requested, as
+  today); parsing goes through cat-stack's `extract_json` +
+  `validate_classification_json`; output is the standard wide 0/1 DataFrame
+  (`input_data`, `processing_status`, `category_N` columns). Everything
+  downstream (ensembles, R, Stata, desktop) must be able to adopt this
+  backend without schema changes.
+- **The subprocess shim stays** in cat-stack as the zero-dependency fallback.
+  `model_source="claude-code"` keeps meaning shim; this package introduces
+  `model_source="claude-agent"`.
+- **Dependency discipline.** This package depends on `claude-agent-sdk` and
+  `cat-stack`. cat-stack never depends on this package — it lazy-imports it
+  behind the `claude-agent` model_source (the `[formatter]`-extra pattern),
+  erroring with `pip install cat-stack[agent]` guidance when absent.
+## Architecture
+Multi-agent by design: Claude (via `claude-agent-sdk`) is the first adapter;
+OpenAI Codex is a planned second (its `codex exec` non-interactive mode fits
+the same one-shot contract). The seam between "the classification pipeline"
+and "which agent CLI answers one prompt" is therefore an explicit adapter
+interface from day one — everything above the adapter is agent-agnostic.
+```
+cat-stack classify(model_source="claude-agent" | "codex-agent" …)
+   └─ lazy import catagent  → backend satisfies the same (text, error)
+                              contract complete() returns
+catagent
+   ├─ _adapters/
+   │    base.py     AgentAdapter: one_shot(prompt, system_prompt, model,
+   │                opts) -> (text, error). Sealed-session semantics are part
+   │                of the contract (no tools, fresh context, single turn).
+   │    claude.py   claude-agent-sdk implementation (Phase 1)
+   │    codex.py    codex CLI implementation (later phase)
+   ├─ _backend.py   agent-agnostic plumbing: adapter registry, dedicated
+   │                event loop, semaphore-bounded concurrency, retries
+   ├─ classify.py   standalone classify(agent="claude") (Phase 1: usable
+   │                without engine integration; later delegated to from
+   │                cat-stack)
+   └─ __about__.py  version (single source of truth, hatch)
+```
+Adapter-contract notes for Codex (recorded now, built later): `codex exec`
+supports non-interactive one-shots with JSON event output and model
+selection; auth via ChatGPT subscription login mirrors the Claude story
+(subscription-based classification). Sandbox/approval flags are the sealed-
+session equivalent. `claude-agent-sdk` stays an install extra once a second
+adapter exists (`cat-agent[claude]`, `cat-agent[codex]`) so neither CLI's SDK
+is forced on users of the other.
+## Known risks / open questions
+- **Rate limits**: subscription plans have usage caps; large jobs may hit
+  them. Degrade gracefully (clear error, partial results, resumability) —
+  never promise API-like throughput.
+- **Model parity**: CLI-served vs API-served output comparability is a
+  methodology-disclosure question for papers. Measure, document, disclose
+  (never silently swap).
+- **SDK tempo**: the Agent SDK tracks Claude Code releases; re-audit on CLI
+  major versions (same habit as the 2026-07-03 shim audit).
+- **Warm-process reuse**: can fresh-context queries share a warm process?
+  (Phase 0 answers; if not, concurrency alone is the plan.)
+## Step tracker
+### Phase 0 — empirical spike (kill-or-validate) — DONE 2026-07-03, sdk 0.2.110
+- [x] Install `claude-agent-sdk`; introspect the real API surface
+- [x] Timing: 5.6s wall for a trivial one-shot, only ~1.9s process overhead
+      (the shim's 33s/row was mostly sequential design + inference, not startup)
+- [x] Context isolation: PASS — two `query()` calls share nothing
+- [x] Sealed-session options verified (`allowed_tools=[]`, `max_turns=1`,
+      `setting_sources=[]`, custom `system_prompt`)
+- [x] Model selection verified (`claude-sonnet-5` by name, no error)
+- [x] Concurrency: 4 parallel one-shots in 3.6s total (near-linear speedup)
+- [x] Structured output: `output_format={"type":"json","schema":...}` was
+      silently IGNORED (markdown answer, no structured field) — Phase 1 uses
+      prompt-JSON + extract_json; Phase 3 re-probes future SDK versions.
+- [x] Finding: the agent enables THINKING by default (ThinkingBlock observed
+      on haiku). Engine parity requires thinking disabled at
+      thinking_budget=0 and graded `effort` above it — the adapter must set
+      this explicitly.
+### Phase 1 — classify() v0 (single function, parity with shim) — DONE 2026-07-03
+- [x] Repo skeleton: pyproject (hatch), __about__, README, CHANGELOG
+- [x] Adapter contract (`_adapters/base.py`) + Claude adapter — sealed
+      one-row call, (text, error) contract, thinking-off-by-default parity
+- [x] `classify.py` — rows → frozen prompts → one_shot → extract_json →
+      wide 0/1 DataFrame with processing_status
+- [x] JSON retry (`json_retries`, per-row isolation)
+- [x] Mocked unit tests incl. frozen-prompt byte-parity test (5 passing)
+- [x] Live smoke test: 3 rows, claude-sonnet-5, 4.4s total (1.5s/row
+      effective at max_workers=3 vs ~33s/row through the shim), matrix correct
+- [x] Commit + push — github.com/chrissoria/cat-agent (private for now;
+      flip to public at first PyPI release)
+*Note: Phase 2's core (semaphore-bounded concurrency) landed in Phase 1 via
+`_backend.gather_bounded`; Phase 2 now means benchmarks at realistic N +
+rate-limit handling.*
+### Phase 2 — concurrency + rate-limit handling — DONE 2026-07-03 (throughput sweep deferred)
+- [x] Semaphore-bounded async gather (max_workers semantics) — landed in Phase 1
+      (`_backend.gather_bounded`); per-row isolation confirmed by a mocked test
+      (a throttled row's backoff does not stall healthy rows)
+- [x] Graceful rate-limit handling + partial results — adapter detects a genuine
+      exhaustion (primary `RateLimitEvent.status=="rejected"` or
+      `ResultMessage.api_error_status==429`; `allowed`/`allowed_warning` non-
+      blocking) and surfaces `rate-limited: … (resets at epoch N)`; classify()
+      backs off on a budget separate from json_retries, **fails fast when the
+      reset is beyond the backoff budget** (five_hour caps), never raises.
+- [x] FALSE-POSITIVE BUG found + fixed 2026-07-03: detection had treated a
+      `rejected` *overage* bucket as a limit, but org-disabled overage
+      (`overage_disabled_reason=="org_level_disabled"`) is a billing config, not
+      a block — the call succeeds. This failed EVERY call on institutional
+      accounts. Now keys only on primary `status`, and a successful answer wins
+      over an informational limit event. Live-confirmed: 3-row classify 3/3
+      success, correct matrix, 4.3s (~1.4s/row). 39 mocked tests green.
+- [x] `benchmarks/bench_classify.py` (synthetic data) + `benchmarks/RESULTS.md`
+- [ ] Clean throughput sweep (N=50 haiku, workers∈{1,4,8}) — not yet run (held
+      off to spare the maintainer's actively-used window; best on a fresh one):
+      `python benchmarks/bench_classify.py --n 50 --write-results`. Phase-1
+      reference stands: ~1.4–1.5s/row (workers=3) vs ~33s/row shim.
+### Phase 3 — structured output (if Phase 0 says it's real)
+- [ ] Schema-enforced JSON (native or in-process tool trick)
+- [ ] Fall back to prompt-JSON when unsupported
+### Phase 4 — engine + ecosystem integration
+*Two-level integration (decided 2026-07-03): DISPATCH lives in cat-stack
+(the engine is the only layer that sees `model_source`, and the domain
+packages / R / Stata all call catstack directly — routing anywhere higher
+would exclude them); DISTRIBUTION lives in cat-llm (the meta-package bundles
+cat-agent for users, same as it bundles the domain packages cat-stack
+doesn't depend on). cat-stack never hard-depends on cat-agent.*
+- [ ] cat-stack: `model_source="claude-agent"` lazy-import branch + `[agent]` extra
+- [ ] cat-llm (meta): add `cat-agent` to dependencies so `pip install cat-llm`
+      includes the agent backend
+- [ ] Ensemble support (claude-agent as one model in a panel)
+- [ ] explore/extract/summarize passthroughs
+- [ ] R/Stata/desktop: no changes needed by design — verify
+- [ ] Docs + methodology disclosure notes; first PyPI release (flip repo public)
+### Phase 5 — Codex adapter
+- [ ] Phase-0-style spike on `codex exec` (auth, model selection, JSON
+      output, sandbox flags, startup cost, context isolation)
+- [ ] `_adapters/codex.py` implementing the same AgentAdapter contract
+- [ ] `model_source="codex-agent"` in cat-stack; extras split
+      (`cat-agent[claude]` / `cat-agent[codex]`)
+- [ ] Cross-agent parity test: same rows, same frozen prompt, Claude vs
+      Codex adapters — document divergence for methodology disclosure

cat_claws-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,62 @@
+Metadata-Version: 2.4
+Name: cat-claws
+Version: 0.1.0
+Summary: Claude Agent SDK backend for the CatLLM ecosystem — classify text through a Claude subscription instead of per-token API billing.
+Project-URL: Source, https://github.com/chrissoria/cat-agent
+Author-email: Chris Soria <chrissoria@berkeley.edu>
+License-Expression: GPL-3.0-or-later
+Keywords: agent-sdk,classification,claude,llm,survey
+Classifier: Development Status :: 3 - Alpha
+Classifier: Programming Language :: Python
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Requires-Python: >=3.10
+Requires-Dist: cat-stack>=2.0.1
+Requires-Dist: claude-agent-sdk>=0.1.0
+Requires-Dist: pandas
+Description-Content-Type: text/markdown
+# cat-claws
+Agent-CLI backend for the [CatLLM ecosystem](https://github.com/chrissoria/cat-llm):
+classify text through a **Claude subscription** (via the Claude Agent SDK)
+instead of per-token API billing. An OpenAI Codex adapter is planned.
+*(Distribution name `cat-claws`; imports as `catclaws`. Source repo:
+[cat-agent](https://github.com/chrissoria/cat-agent).)*
+**Status: alpha, under active development.** See `MASTERPLAN.md` for the
+design and step tracker.
+## Install
+```bash
+pip install cat-claws
+```
+## Design in one paragraph
+One row = one sealed, fresh-context agent call (no tools, single turn, no
+settings/CLAUDE.md loading), using cat-stack's validated classification
+prompt byte-for-byte. The model answers in JSON; parsing and the wide 0/1
+output matrix reuse cat-stack's existing machinery. Throughput comes from
+concurrent one-shot calls, never from shared conversations or
+corpus-in-one-prompt (which would contaminate rows and break research
+validity).
+## Quick start (Phase 1)
+```python
+import catclaws
+df = catclaws.classify(
+    input_data=["I moved for a new job", "Rent got too expensive"],
+    categories=["Employment", "Cost of living", "Other"],
+    user_model="claude-sonnet-5",   # any model your Claude login can use
+    description="Why did you move?",
+)
+```
+Requires [Claude Code](https://code.claude.com/docs) installed and logged in
+(`claude` on PATH). No API key needed.

cat_claws-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,43 @@
+# cat-claws
+Agent-CLI backend for the [CatLLM ecosystem](https://github.com/chrissoria/cat-llm):
+classify text through a **Claude subscription** (via the Claude Agent SDK)
+instead of per-token API billing. An OpenAI Codex adapter is planned.
+*(Distribution name `cat-claws`; imports as `catclaws`. Source repo:
+[cat-agent](https://github.com/chrissoria/cat-agent).)*
+**Status: alpha, under active development.** See `MASTERPLAN.md` for the
+design and step tracker.
+## Install
+```bash
+pip install cat-claws
+```
+## Design in one paragraph
+One row = one sealed, fresh-context agent call (no tools, single turn, no
+settings/CLAUDE.md loading), using cat-stack's validated classification
+prompt byte-for-byte. The model answers in JSON; parsing and the wide 0/1
+output matrix reuse cat-stack's existing machinery. Throughput comes from
+concurrent one-shot calls, never from shared conversations or
+corpus-in-one-prompt (which would contaminate rows and break research
+validity).
+## Quick start (Phase 1)
+```python
+import catclaws
+df = catclaws.classify(
+    input_data=["I moved for a new job", "Rent got too expensive"],
+    categories=["Employment", "Cost of living", "Other"],
+    user_model="claude-sonnet-5",   # any model your Claude login can use
+    description="Why did you move?",
+)
+```
+Requires [Claude Code](https://code.claude.com/docs) installed and logged in
+(`claude` on PATH). No API key needed.