PyPI - brooder - Versions diffs - 0.1.0__tar.gz → 0.2.0__tar.gz - Mend

brooder 0.1.0tar.gz → 0.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (110) hide show

{brooder-0.1.0 → brooder-0.2.0}/.github/workflows/ci.yml +8 -8
{brooder-0.1.0 → brooder-0.2.0}/.github/workflows/release.yml +23 -5
{brooder-0.1.0 → brooder-0.2.0}/CHANGELOG.md +46 -1
{brooder-0.1.0 → brooder-0.2.0}/PKG-INFO +204 -95
{brooder-0.1.0 → brooder-0.2.0}/README.md +201 -94
brooder-0.2.0/ROADMAP.md +336 -0
brooder-0.2.0/SECURITY.md +39 -0
{brooder-0.1.0 → brooder-0.2.0}/action.yml +25 -8
brooder-0.2.0/design/anti-flakiness.md +277 -0
{brooder-0.1.0 → brooder-0.2.0}/examples/github-action.yml +10 -4
{brooder-0.1.0 → brooder-0.2.0}/pyproject.toml +10 -1
{brooder-0.1.0 → brooder-0.2.0}/src/brooder/__init__.py +3 -0
brooder-0.2.0/src/brooder/analysis.py +143 -0
brooder-0.2.0/src/brooder/budget.py +195 -0
brooder-0.2.0/src/brooder/cli.py +486 -0
brooder-0.2.0/src/brooder/config.py +266 -0
{brooder-0.1.0 → brooder-0.2.0}/src/brooder/diffing.py +159 -12
{brooder-0.1.0 → brooder-0.2.0}/src/brooder/errors.py +6 -3
brooder-0.2.0/src/brooder/integrations/__init__.py +90 -0
brooder-0.2.0/src/brooder/integrations/anthropic.py +127 -0
brooder-0.2.0/src/brooder/integrations/base.py +568 -0
brooder-0.2.0/src/brooder/integrations/bedrock.py +128 -0
{brooder-0.1.0 → brooder-0.2.0}/src/brooder/integrations/claude_agent.py +83 -19
{brooder-0.1.0 → brooder-0.2.0}/src/brooder/integrations/google.py +20 -4
brooder-0.2.0/src/brooder/integrations/openai.py +112 -0
brooder-0.2.0/src/brooder/judges.py +161 -0
{brooder-0.1.0 → brooder-0.2.0}/src/brooder/metrics.py +52 -15
brooder-0.2.0/src/brooder/models.py +379 -0
brooder-0.2.0/src/brooder/pytest_plugin.py +309 -0
{brooder-0.1.0 → brooder-0.2.0}/src/brooder/recorder.py +153 -2
brooder-0.2.0/src/brooder/redaction.py +153 -0
brooder-0.2.0/src/brooder/report.py +411 -0
brooder-0.2.0/src/brooder/storage.py +309 -0
brooder-0.2.0/tests/conftest.py +7 -0
brooder-0.2.0/tests/test_action.py +143 -0
{brooder-0.1.0 → brooder-0.2.0}/tests/test_analysis.py +19 -0
{brooder-0.1.0 → brooder-0.2.0}/tests/test_async_capture.py +5 -3
brooder-0.2.0/tests/test_budget.py +274 -0
{brooder-0.1.0 → brooder-0.2.0}/tests/test_claude_agent.py +97 -9
brooder-0.2.0/tests/test_cli.py +472 -0
brooder-0.2.0/tests/test_config.py +98 -0
brooder-0.2.0/tests/test_diffing.py +159 -0
{brooder-0.1.0 → brooder-0.2.0}/tests/test_integrations.py +43 -2
brooder-0.2.0/tests/test_judges.py +77 -0
{brooder-0.1.0 → brooder-0.2.0}/tests/test_metrics.py +73 -1
brooder-0.2.0/tests/test_models.py +122 -0
brooder-0.2.0/tests/test_output.py +200 -0
brooder-0.2.0/tests/test_pytest_plugin.py +159 -0
{brooder-0.1.0 → brooder-0.2.0}/tests/test_recorder.py +62 -0
brooder-0.2.0/tests/test_redaction.py +140 -0
brooder-0.2.0/tests/test_report.py +93 -0
brooder-0.2.0/tests/test_severity.py +110 -0
brooder-0.2.0/tests/test_storage.py +227 -0
brooder-0.2.0/tests/test_streaming.py +422 -0
{brooder-0.1.0 → brooder-0.2.0}/tests/test_trajectory_diff.py +24 -0
brooder-0.1.0/ROADMAP.md +0 -134
brooder-0.1.0/SECURITY.md +0 -14
brooder-0.1.0/src/brooder/analysis.py +0 -79
brooder-0.1.0/src/brooder/cli.py +0 -281
brooder-0.1.0/src/brooder/config.py +0 -88
brooder-0.1.0/src/brooder/integrations/__init__.py +0 -75
brooder-0.1.0/src/brooder/integrations/anthropic.py +0 -46
brooder-0.1.0/src/brooder/integrations/base.py +0 -170
brooder-0.1.0/src/brooder/integrations/bedrock.py +0 -49
brooder-0.1.0/src/brooder/integrations/openai.py +0 -43
brooder-0.1.0/src/brooder/judges.py +0 -109
brooder-0.1.0/src/brooder/models.py +0 -148
brooder-0.1.0/src/brooder/report.py +0 -261
brooder-0.1.0/src/brooder/storage.py +0 -150
brooder-0.1.0/tests/test_action.py +0 -57
brooder-0.1.0/tests/test_cli.py +0 -194
brooder-0.1.0/tests/test_config.py +0 -44
brooder-0.1.0/tests/test_diffing.py +0 -54
brooder-0.1.0/tests/test_judges.py +0 -31
brooder-0.1.0/tests/test_output.py +0 -99
brooder-0.1.0/tests/test_storage.py +0 -39
{brooder-0.1.0 → brooder-0.2.0}/.github/ISSUE_TEMPLATE/bug_report.md +0 -0
{brooder-0.1.0 → brooder-0.2.0}/.github/ISSUE_TEMPLATE/feature_request.md +0 -0
{brooder-0.1.0 → brooder-0.2.0}/.github/dependabot.yml +0 -0
{brooder-0.1.0 → brooder-0.2.0}/.github/pull_request_template.md +0 -0
{brooder-0.1.0 → brooder-0.2.0}/.gitignore +0 -0
{brooder-0.1.0 → brooder-0.2.0}/.pre-commit-config.yaml +0 -0
{brooder-0.1.0 → brooder-0.2.0}/CONTRIBUTING.md +0 -0
{brooder-0.1.0 → brooder-0.2.0}/DCO +0 -0
{brooder-0.1.0 → brooder-0.2.0}/LICENSE +0 -0
{brooder-0.1.0 → brooder-0.2.0}/LICENSING.md +0 -0
{brooder-0.1.0 → brooder-0.2.0}/NOTICE +0 -0
{brooder-0.1.0 → brooder-0.2.0}/TRADEMARKS.md +0 -0
{brooder-0.1.0 → brooder-0.2.0}/assets/banner.svg +0 -0
{brooder-0.1.0 → brooder-0.2.0}/assets/demo.svg +0 -0
{brooder-0.1.0 → brooder-0.2.0}/assets/record-demo.sh +0 -0
{brooder-0.1.0 → brooder-0.2.0}/design/framework-adapters.md +0 -0
{brooder-0.1.0 → brooder-0.2.0}/design/trajectory.md +0 -0
{brooder-0.1.0 → brooder-0.2.0}/docs/api.md +0 -0
{brooder-0.1.0 → brooder-0.2.0}/docs/index.md +0 -0
{brooder-0.1.0 → brooder-0.2.0}/examples/flaky_agent.py +0 -0
{brooder-0.1.0 → brooder-0.2.0}/examples/loop_agent.py +0 -0
{brooder-0.1.0 → brooder-0.2.0}/examples/regressing_agent.py +0 -0
{brooder-0.1.0 → brooder-0.2.0}/examples/stable_agent.py +0 -0
{brooder-0.1.0 → brooder-0.2.0}/mkdocs.yml +0 -0
{brooder-0.1.0 → brooder-0.2.0}/src/brooder/integrations/langchain.py +0 -0
{brooder-0.1.0 → brooder-0.2.0}/src/brooder/integrations/openai_agents.py +0 -0
{brooder-0.1.0 → brooder-0.2.0}/src/brooder/integrations/otel.py +0 -0
{brooder-0.1.0 → brooder-0.2.0}/src/brooder/log.py +0 -0
{brooder-0.1.0 → brooder-0.2.0}/src/brooder/py.typed +0 -0
{brooder-0.1.0 → brooder-0.2.0}/tests/test_capture_core.py +0 -0
{brooder-0.1.0 → brooder-0.2.0}/tests/test_langchain.py +0 -0
{brooder-0.1.0 → brooder-0.2.0}/tests/test_openai_agents.py +0 -0
{brooder-0.1.0 → brooder-0.2.0}/tests/test_otel.py +0 -0
{brooder-0.1.0 → brooder-0.2.0}/tests/test_trajectory.py +0 -0

{brooder-0.1.0 → brooder-0.2.0}/.github/workflows/ci.yml RENAMED Viewed

@@ -9,8 +9,8 @@ jobs:
   lint:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v4
-      - uses: actions/setup-python@v5
+      - uses: actions/checkout@9c091bb21b7c1c1d1991bb908d89e4e9dddfe3e0 # v7.0.0
+      - uses: actions/setup-python@ece7cb06caefa5fff74198d8649806c4678c61a1 # v6.3.0
         with:
           python-version: "3.12"
       - run: pip install -e ".[dev]"
@@ -27,8 +27,8 @@ jobs:
       matrix:
         python-version: ["3.10", "3.11", "3.12"]
     steps:
-      - uses: actions/checkout@v4
-      - uses: actions/setup-python@v5
+      - uses: actions/checkout@9c091bb21b7c1c1d1991bb908d89e4e9dddfe3e0 # v7.0.0
+      - uses: actions/setup-python@ece7cb06caefa5fff74198d8649806c4678c61a1 # v6.3.0
         with:
           python-version: ${{ matrix.python-version }}
       - run: pip install -e ".[dev]"
@@ -42,8 +42,8 @@ jobs:
   docs:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v4
-      - uses: actions/setup-python@v5
+      - uses: actions/checkout@9c091bb21b7c1c1d1991bb908d89e4e9dddfe3e0 # v7.0.0
+      - uses: actions/setup-python@ece7cb06caefa5fff74198d8649806c4678c61a1 # v6.3.0
         with:
           python-version: "3.12"
       - run: pip install -e ".[docs]"
@@ -54,8 +54,8 @@ jobs:
     # Keep the Apache-2.0 core free of strong copyleft (GPL/AGPL/SSPL). LGPL is allowed.
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v4
-      - uses: actions/setup-python@v5
+      - uses: actions/checkout@9c091bb21b7c1c1d1991bb908d89e4e9dddfe3e0 # v7.0.0
+      - uses: actions/setup-python@ece7cb06caefa5fff74198d8649806c4678c61a1 # v6.3.0
         with:
           python-version: "3.12"
       - name: Install runtime deps only

{brooder-0.1.0 → brooder-0.2.0}/.github/workflows/release.yml RENAMED Viewed

@@ -23,8 +23,8 @@ jobs:
   build:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v4
-      - uses: actions/setup-python@v5
+      - uses: actions/checkout@9c091bb21b7c1c1d1991bb908d89e4e9dddfe3e0 # v7.0.0
+      - uses: actions/setup-python@ece7cb06caefa5fff74198d8649806c4678c61a1 # v6.3.0
         with:
           python-version: "3.12"
       - name: Build sdist + wheel
@@ -35,7 +35,7 @@ jobs:
         run: |
           python -m pip install --upgrade twine
           python -m twine check dist/*
-      - uses: actions/upload-artifact@v4
+      - uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
         with:
           name: dist
           path: dist/
@@ -47,9 +47,27 @@ jobs:
     permissions:
       id-token: write # OIDC token for Trusted Publishing
     steps:
-      - uses: actions/download-artifact@v4
+      - uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c # v8.0.1
         with:
           name: dist
           path: dist/
       - name: Publish to PyPI
-        uses: pypa/gh-action-pypi-publish@release/v1
+        uses: pypa/gh-action-pypi-publish@cef221092ed1bacb1cc03d23a2d87d1d172e277b # v1.14.0
+  major-tag:
+    # Move the `v1` (Action interface) tag to this release so `uses: agentbrooder/brooder@v1`
+    # always resolves to the latest release — no manual tag bumping. Runs only after publish.
+    # Safe from loops: `v1` doesn't match the `v*.*.*` trigger, and a GITHUB_TOKEN push doesn't
+    # re-trigger workflows anyway.
+    needs: publish
+    runs-on: ubuntu-latest
+    permissions:
+      contents: write
+    steps:
+      - uses: actions/checkout@9c091bb21b7c1c1d1991bb908d89e4e9dddfe3e0 # v7.0.0
+      - name: Update the v1 tag to point at this release
+        run: |
+          git config user.name "github-actions[bot]"
+          git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
+          git tag -f v1
+          git push -f origin v1

{brooder-0.1.0 → brooder-0.2.0}/CHANGELOG.md RENAMED Viewed

@@ -6,6 +6,50 @@ All notable changes to this project are documented here. The format is based on
 ## [Unreleased]
+## [0.2.0] — 2026-07-04
+### Added
+- **Review UX — snapshot-fatigue defense** — every failing case is now classified **suspicious**
+  (a material change: the output, a tool/final step, an observed tool result, or a guardrail
+  terminal changed — or the case is flaky) vs. **expected** (cosmetic reasoning-turn or
+  turn/step-count churn with the tool path and answer intact). The results table and the Markdown PR
+  comment headline "N suspicious · M expected", sort the worst cases first, and add a **Review**
+  column. `brooder approve` gained selective acceptance: `brooder approve [SELECTOR]... --only
+  {all,expected,suspicious} [--dry-run]` — e.g. `brooder approve --only expected` clears the cosmetic
+  drift in one command so you review the suspicious few individually. Classification is advisory and
+  **never changes the CI gate** — every regression still fails the build. The machine-readable
+  summary adds `suspicious`/`expected` counts and a per-case `severity` (summary schema **v1 → v2**,
+  additive).
+- **Streaming capture** — auto-capture now records tool calls, content, and token usage from
+  streamed responses (`stream=True`) instead of only warning. The streamed iterator is teed as your
+  code consumes it and reduced into a normal captured call once the stream ends. Covers OpenAI
+  `create(stream=True)` (sync + async), Anthropic `messages.create(stream=True)`, and Bedrock
+  `converse_stream`. (The `messages.stream()` / `.stream()` context-manager helpers stay warn-only.)
+- **Cost/latency budget gate** — runs now capture `Run.usage` (wall-clock `duration_ms` via
+  `@record`, token counts via provider auto-capture for OpenAI/Anthropic/Bedrock/Google, and derived
+  `cost_usd` when `budget.prices` are configured). `brooder ci --budget` / `run --budget` fail on a
+  `budget:` breach — absolute caps (`max_total_tokens` / `max_duration_ms` / `max_cost_usd`) or
+  per-baseline drift (`max_tokens_increase` etc.). Usage is **not** part of the behavioral diff, so a
+  latency blip or token drift never reads as a regression; the gate is orthogonal to the verdict.
+  OTLP now also emits mean `brooder.usage.*` gauges. Baseline schema bumped to **v2** (additive — v1
+  baselines still load).
+- **pytest plugin** (`pip install brooder[pytest]`, auto-registers via the `pytest11` entry point):
+  a `brooder` fixture + `brooder.snapshot(result, inputs=...)` to snapshot-test agents inside pytest.
+  `pytest` checks each run against its committed baseline (a regression fails the test, a missing
+  baseline fails with a hint — never a silent pass); `pytest --brooder-update` records/refreshes
+  baselines. Configurable per-test with `@pytest.mark.brooder(agent=..., inputs=...)`. New public
+  `brooder.recorder.active_run(handle)` context manager and `brooder.analysis.analyze_with_config`
+  (shared by the CLI and the plugin so verdicts can't diverge).
+### Changed
+- **PR-comment hardening** — the Markdown report now opens with a machine-stable
+  `<!-- brooder-report -->` sentinel, and the GitHub Action slices the PR comment on that marker
+  instead of the content-derived `## Brooder results` headline (captured content escapes `<`, so it
+  can't forge the marker or truncate the comment).
+- **Baseline schema versioning** — loading a baseline written by a newer Brooder now fails with a
+  clear "upgrade Brooder" error instead of a confusing validation failure, and there is a migration
+  seam so a future `Run`/`Step` change can upgrade old committed baselines on load.
 ## [0.1.0] — 2026-07-02
 First public release.
@@ -92,5 +136,6 @@ First public release.
 - Strict typing (`py.typed`), atomic storage writes, typed config, structured logging.
 - Tooling: ruff, mypy (strict), pre-commit, pytest + coverage, CI matrix (3.10–3.12).
-[Unreleased]: https://github.com/agentbrooder/brooder/compare/v0.1.0...HEAD
+[Unreleased]: https://github.com/agentbrooder/brooder/compare/v0.2.0...HEAD
+[0.2.0]: https://github.com/agentbrooder/brooder/compare/v0.1.0...v0.2.0
 [0.1.0]: https://github.com/agentbrooder/brooder/releases/tag/v0.1.0

{brooder-0.1.0 → brooder-0.2.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: brooder
-Version: 0.1.0
+Version: 0.2.0
 Summary: Snapshot testing for AI agents — catch behavior regressions before they ship.
 Project-URL: Homepage, https://brooder.dev
 Project-URL: Repository, https://github.com/agentbrooder/brooder
@@ -43,6 +43,8 @@ Requires-Dist: openai-agents>=0.1; extra == 'openai-agents'
 Provides-Extra: otel
 Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.20; extra == 'otel'
 Requires-Dist: opentelemetry-sdk>=1.20; extra == 'otel'
+Provides-Extra: pytest
+Requires-Dist: pytest>=8.0; extra == 'pytest'
 Description-Content-Type: text/markdown
 <p align="center">
@@ -112,60 +114,204 @@ have shipped to production unnoticed. Brooder caught it — and exited non-zero,
 ---
-## The normal workflow
+## The workflow
 ```bash
 brooder record examples/regressing_agent.py     # capture golden baselines from real runs
 brooder run    examples/regressing_agent.py     # re-run after a change, diff vs baseline
 brooder diff                                    # see exactly what changed
-brooder approve                                 # accept the new behavior as the baseline
+brooder approve --only expected                 # bulk-accept the cosmetic drift...
+brooder approve <case>                           # ...then accept the reviewed ones case-by-case
 ```
 `brooder run` exits non-zero when behavior regressed — drop it into CI and it gates your PRs.
+**No snapshot fatigue.** When a model bump surfaces a wall of diffs, Brooder classifies each as
+**suspicious** (the output, tool path, or a guardrail actually changed) or **expected** (cosmetic
+reasoning-turn / count drift), headlines the summary `N suspicious · M expected`, and sorts the
+scary ones first — so `brooder approve --only expected` clears the noise in one command and you spend
+review on the few that matter. (`brooder approve` with no args still accepts everything.)
 ---
-## Instrument your own agent
+## Instrument your agent
-Add one decorator. Log tool calls with one function. That's the whole SDK.
+Add one decorator. That's the whole SDK. Log tool calls explicitly with `brooder.tool_call`, or
+wrap your LLM client with `brooder.instrument(...)` and Brooder captures the model's tool-call
+decisions for you.
 ```python
 import brooder
+import openai
-def search_kb(query):
-    brooder.tool_call("search_kb", {"query": query}, result="...")
-    return "..."
+client = brooder.instrument(openai.OpenAI())   # auto-captures tool calls while recording
 @brooder.record("support-agent")
 def agent(question: str) -> str:
-    docs = search_kb(question)
+    docs = client.chat.completions.create(model="gpt-4o", messages=[...])
     return answer_from(docs)
 # call it over your real inputs; brooder records/replays automatically
 ```
-Then run it through the CLI. Baselines are plain JSON committed to your repo, so diffs show up in
-code review like any other change.
+Baselines are plain JSON committed to your repo, so diffs show up in code review like any other
+change.
+**It tests the whole trajectory, not single LLM calls.** `@brooder.record` wraps your *entire*
+agent — every step of its plan → act → observe loop. The baseline is the full trajectory: every
+tool call across every turn, in order, plus the final output. So Brooder catches a `verify` step
+that silently disappears *inside the loop* — the kind of agent-level regression an LLM-output eval
+never sees.
+### Works with your stack
+| Layer | Supported |
+| --- | --- |
+| **LLM providers** | OpenAI, Azure OpenAI, Anthropic, AWS Bedrock, Google (Gemini / Vertex) — auto-detected |
+| **Agent frameworks** | LangChain, LangGraph, CrewAI, AutoGen (via OpenTelemetry), OpenAI Agents SDK, Claude Agent SDK |
+| **Async** | `AsyncOpenAI`, `AsyncAzureOpenAI`, `AsyncAnthropic`, Google `generate_content_async` — no extra setup |
+| **Custom endpoints** | Any base URL / proxy / OpenAI-compatible gateway — Brooder never touches credentials |
+Setup for each is in **[Integrations](#integrations)** below.
+---
+## Why not just use observability / eval tools?
+| Tool type | Examples | What it does | The gap Brooder fills |
+| --- | --- | --- | --- |
+| Observability | Langfuse, Laminar, Phoenix | Trace/monitor **after** it runs | Doesn't gate **before** you ship |
+| Eval frameworks | DeepEval, Braintrust, Ragas | Score against **hand-written** datasets | Requires eval authoring nobody maintains |
+| **Brooder** | — | **Record real runs → behavioral diff on every change → CI gate** | **Zero eval-writing, catches model-migration regressions** |
+Your baselines are JSON files in **your** repo. No SaaS, no cloud account — nobody can acquire your
+test suite out from under you.
+---
+## Gate your PRs (GitHub Action)
+Drop Brooder into CI and it re-runs your agent on every pull request, comments the behavioral diff,
+and fails the check when behavior regresses. Copy [examples/github-action.yml](examples/github-action.yml)
+to `.github/workflows/brooder.yml`:
+```yaml
+permissions:
+  contents: read
+  pull-requests: write        # so it can comment the diff
+jobs:
+  agent-snapshot:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: agentbrooder/brooder@v1
+        with:
+          script: tests/agent_snapshot.py
+```
+The comment is upserted (updated in place, not spammed) and looks like the `--format markdown`
+output.
+> **Security:** `brooder ci` runs your checked-out agent script, so don't wire live provider
+> secrets into a `pull_request`-triggered job — see the
+> [CI trust model](SECURITY.md#running-brooder-in-ci-safely-trust-model).
+---
+## Snapshot-test inside pytest
+Prefer to stay in the test runner you already have? `pip install brooder[pytest]` and use the
+`brooder` fixture — no separate CLI harness:
+```python
+def test_support_agent(brooder):
+    answer = support_agent("refund my order")   # tool calls / LLM turns auto-captured
+    brooder.snapshot(answer, inputs="refund my order")
+```
+- `pytest --brooder-update` records the golden baseline (commit it, like any snapshot).
+- `pytest` checks each run against that baseline: a behavioral regression **fails the test**, and a
+  missing baseline fails with a hint instead of passing silently.
+It honors your `brooder.yaml` (judge, normalization, redaction) and reuses the same capture and diff
+engine as the CLI. Configure a case with `@pytest.mark.brooder(agent="support", inputs=...)`.
+---
+## What it checks
+- **Structural diff** — the sequence of tool calls, their arguments, and the final output.
+- **Semantic diff** — a pluggable judge (`judge: exact | llm`) so equivalent wording isn't a regression.
+- **Flakiness** — `brooder run --runs 3` runs each case N times and flags non-determinism (`FLAKY`).
+- **Review triage** — each regression is tagged **suspicious** (material) or **expected** (cosmetic)
+  so a model bump's wall of diffs sorts by attention, not just count (see *The workflow* above).
+Each case gets a verdict — `PASS` / `REGRESSED` / `NEW` / `FLAKY` — a review class, and a stability
+score.
+---
+## Gate on cost & latency drift
+Behavior isn't the only thing that regresses — a model swap can keep the *same* behavior while
+quietly doubling your token bill. Brooder captures each run's **latency and token usage** (and cost,
+if you configure prices) and can fail CI when they spike. Usage is tracked separately from behavior,
+so a noisy latency blip never reads as a behavioral regression.
+```yaml
+# brooder.yaml
+budget:
+  max_total_tokens: 3000      # absolute ceiling per case
+  max_tokens_increase: 0.2    # …or fail if tokens drift >20% vs the baseline
+  prices:                     # optional: enable USD cost caps
+    gpt-4o: { input_per_mtok: 2.5, output_per_mtok: 10.0 }
+```
+```console
+$ brooder ci --budget agent.py
+💸 Budget — 1 limit(s) exceeded
+ • assistant/19761739: total_tokens 2600 is +160% vs baseline 1000 (limit 1200)
+```
 ---
-## Auto-capture (no manual `tool_call`)
+## Integrations
+Everything above works with one decorator. These sections show the exact setup for each provider,
+framework, and output format — expand what you need.
-Wrap your LLM client and Brooder records the model's tool-call decisions automatically:
+<details>
+<summary><b>All providers, custom endpoints & async</b></summary>
+Wrap your LLM client and Brooder records the model's tool-call decisions automatically. The provider
+is auto-detected from the client; override it with a name, an alias, or a `Provider`:
 ```python
 import brooder
-import openai
+from brooder import Provider
-client = brooder.instrument(openai.OpenAI())
-# now every client.chat.completions.create(...) call is captured while recording
+brooder.instrument(openai.OpenAI())                          # OpenAI
+brooder.instrument(openai.AzureOpenAI(...))                  # Azure OpenAI (or provider="azure")
+brooder.instrument(anthropic.Anthropic())                   # Anthropic (or provider=Provider.ANTHROPIC)
+brooder.instrument(boto3.client("bedrock-runtime"))         # AWS Bedrock (or provider="aws")
+brooder.instrument(genai.GenerativeModel("gemini-1.5-pro")) # Google Gemini / Vertex (or provider="gemini")
 ```
-Supported providers: **OpenAI**, **Azure OpenAI**, **Anthropic**, **AWS Bedrock**, and
-**Google (Gemini / Vertex)**. The provider is auto-detected; override it with
-`brooder.instrument(client, provider="bedrock")`. Model *names* are intentionally not diffed, so
-switching models isn't itself a change — only the model's *behavior* (which tools it calls, with
-what arguments) is.
+The canonical set is `brooder.Provider`: **OpenAI**, **Azure OpenAI**, **Anthropic**, **AWS
+Bedrock**, and **Google (Gemini / Vertex)**.
+**Custom endpoints & proxies.** Brooder never manages credentials or URLs — the provider's own SDK
+does. Point a client at any base URL (an internal gateway, an OpenAI-compatible server, an
+Azure-APIM-proxied Anthropic endpoint, …) and hand it to `instrument` unchanged:
+```python
+client = anthropic.Anthropic(base_url="https://your-proxy/…", api_key="…")
+brooder.instrument(client)   # captured exactly the same
+```
+Model *names* are intentionally not diffed, so switching models isn't itself a change — only the
+model's *behavior* (which tools it calls, with what arguments) is.
 **Async works too.** `@brooder.record` and `instrument(...)` handle `async def` agents and async
 clients — `AsyncOpenAI`, `AsyncAzureOpenAI`, `AsyncAnthropic`, and Google's `generate_content_async`
@@ -182,11 +328,13 @@ async def agent(question: str) -> str:
 (Async AWS Bedrock via aioboto3 isn't covered yet — the sync boto3 client is.)
-## Capture from agent frameworks (OpenTelemetry)
+</details>
+<details>
+<summary><b>OpenTelemetry (LangGraph, CrewAI, AutoGen, …)</b></summary>
-Building on an agent framework? If it emits OpenTelemetry GenAI spans — **LangGraph, CrewAI,
-AutoGen**, and anything else on the convention — add one span processor and Brooder ingests the
-whole trajectory, no manual `tool_call`:
+If your framework emits OpenTelemetry GenAI spans, add one span processor and Brooder ingests the
+whole trajectory — no manual `tool_call`:
 ```python
 from opentelemetry import trace
@@ -199,27 +347,34 @@ It maps inference spans → turns, `execute_tool` spans → tool calls, and the
 input/output → the case identity and final answer. It also drops straight into the OTel pipelines
 you already run (Datadog / Arize / Honeycomb).
-Building directly on the **Claude Agent SDK**? Register Brooder's hooks and it records the tool
-trajectory automatically:
+</details>
+<details>
+<summary><b>Claude Agent SDK</b></summary>
+Register Brooder's hooks and it records the tool trajectory and the final answer automatically:
 ```python
 import brooder
-from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions, ResultMessage
-from brooder.integrations import claude_agent
+from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions
 options = ClaudeAgentOptions(hooks=brooder.claude_agent_hooks(agent="support-agent"))
 async with ClaudeSDKClient(options=options) as client:
     await client.query(prompt)
     async for msg in client.receive_response():
-        if isinstance(msg, ResultMessage):
-            claude_agent.record_output(msg.session_id, msg.result)  # optional: capture the answer
+        ...  # nothing Brooder-specific needed
 ```
-`UserPromptSubmit` opens a run (the prompt is the case identity), `PostToolUse` becomes a tool step,
+`UserPromptSubmit` opens a run (the prompt is the case identity), tool-use hooks become tool steps,
 and `Stop` finalizes it.
-On the **OpenAI Agents SDK**? Its tracing is on by default — install Brooder's trace processor once
-and every run is captured (no OpenAI API key required for capture):
+</details>
+<details>
+<summary><b>OpenAI Agents SDK</b></summary>
+Its tracing is on by default — install Brooder's trace processor once and every run is captured (no
+OpenAI API key required for capture):
 ```python
 import brooder.integrations.openai_agents as bd_agents
@@ -231,7 +386,12 @@ It maps generation/response spans → turns, function spans → tool calls, and
 guardrails into the trajectory too — so both tool selection *and* control-flow regressions get
 diffed.
-Using **LangChain or LangGraph**? Attach one callback handler — no OpenTelemetry setup required:
+</details>
+<details>
+<summary><b>LangChain / LangGraph</b></summary>
+Attach one callback handler — no OpenTelemetry setup required:
 ```python
 import brooder.integrations.langchain as bd_lc
@@ -243,66 +403,23 @@ graph.invoke({"messages": [...]}, config={"callbacks": [handler]})
 The root chain start opens a run (its input is the case identity), model calls become turns, and
 tool calls become tool steps — one handler covers both LangChain and LangGraph.
-## It tests agents (the whole trajectory), not single LLM calls
-`@brooder.record` wraps your **entire agent** — every step of its plan → act → observe loop.
-The baseline is the full **trajectory**: every tool call across every turn, in order, plus the
-final output. So Brooder catches agent-level regressions, not just token changes in one model
-response.
-```bash
-# A multi-step agent that silently stops verifying before answering on the newer model:
-brooder migrate --from gpt-4o --to gpt-5-new examples/loop_agent.py
-# -> REGRESSED: trajectory[1] "verify" removed
-```
-That dropped `verify` step happened *inside the loop* — the kind of thing an LLM-output eval
-would never see.
-## Why not just use observability / eval tools?
-| Tool type | Examples | What it does | The gap Brooder fills |
-| --- | --- | --- | --- |
-| Observability | Langfuse, Laminar, Phoenix | Trace/monitor **after** it runs | Doesn't gate **before** you ship |
-| Eval frameworks | DeepEval, Braintrust, Ragas | Score against **hand-written** datasets | Requires eval authoring nobody maintains |
-| **Brooder** | — | **Record real runs → behavioral diff on every change → CI gate** | **Zero eval-writing, catches model-migration regressions** |
----
+</details>
-## Gate your PRs (GitHub Action)
-Drop Brooder into CI and it re-runs your agent on every pull request, comments the behavioral diff,
-and fails the check when behavior regresses. Copy [examples/github-action.yml](examples/github-action.yml)
-to `.github/workflows/brooder.yml`:
-```yaml
-permissions:
-  contents: read
-  pull-requests: write        # so it can comment the diff
-jobs:
-  agent-snapshot:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - uses: agentbrooder/brooder@v1
-        with:
-          script: tests/agent_snapshot.py
-```
-The comment is upserted (updated in place, not spammed) and looks like the `--format markdown`
-output below.
-## Machine-readable output (`--json` / OTLP)
+<details>
+<summary><b>Machine-readable output & dashboards (<code>--json</code> / OTLP)</b></summary>
 `run`, `ci`, and `diff` take `--format table|json|markdown` (`--json` is a shortcut). Exit codes are
 unchanged, so you can gate *and* parse:
 ```bash
 brooder run agent.py --json | jq '.summary'
-# { "total": 3, "passed": 2, "regressed": 1, "flaky": 0, "regressions": 1, "mean_stability": 80 }
+# { "total": 3, "passed": 2, "regressed": 1, "flaky": 0, "regressions": 1,
+#   "suspicious": 1, "expected": 0, "mean_stability": 80 }
 ```
+Each case also carries a `severity` (`suspicious` / `expected` / `none`) so a dashboard can rank the
+regressions that need a human by attention, not just count them.
 For dashboards, point Brooder at any OTLP endpoint and each run emits a snapshot of gauges
 (`brooder.cases.*`, `brooder.stability.mean`) — **one exporter** that reaches Datadog, Grafana,
 Honeycomb, and CloudWatch:
@@ -313,15 +430,7 @@ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/v1/metrics   # or metri
 brooder ci agent.py
 ```
----
-## What it checks
-- **Structural diff** — the sequence of tool calls, their arguments, and the final output.
-- **Semantic diff** — a pluggable judge (`judge: exact | llm`) so equivalent wording isn't a regression.
-- **Flakiness** — `brooder run --runs 3` runs each case N times and flags non-determinism (`FLAKY`).
-Each case gets a verdict — `PASS` / `REGRESSED` / `NEW` / `FLAKY` — and a stability score.
+</details>
 ---

brooder 0.1.0__tar.gz → 0.2.0__tar.gz

brooder 0.1.0tar.gz → 0.2.0tar.gz