PyPI - react-agent-harness - Versions diffs - 0.6.0__tar.gz → 0.7.0__tar.gz - Mend

react-agent-harness 0.6.0tar.gz → 0.7.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (89) hide show

{react_agent_harness-0.6.0/react_agent_harness.egg-info → react_agent_harness-0.7.0}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.4
 Name: react-agent-harness
-Version: 0.6.0
-Summary: Multi-agent LLM orchestration: hybrid DAG planning, two-tier memory, streaming
+Version: 0.7.0
+Summary: Multi-agent LLM orchestration: hybrid DAG planning, two-tier memory, streaming, cost/token budgets with per-call-site breakdown
 Requires-Python: >=3.10
 License-File: LICENSE
 Requires-Dist: prompt_toolkit>=3.0

{react_agent_harness-0.6.0 → react_agent_harness-0.7.0}/README.md RENAMED Viewed

@@ -1,7 +1,9 @@
 # react-agent-harness
 Bring-your-own-LLM multi-agent harness: hybrid DAG planning with replan-on-failure,
-two-tier memory (semantic KV + episodic vector), and a streaming-primary event model.
+two-tier memory (semantic KV + episodic vector), a streaming-primary event model,
+and cost/token budgets with per-call-site attribution (classifier, router,
+planner, synthesizer, agent).
 Config-driven — register tools and agents, run any goal. No subclassing.
@@ -33,15 +35,21 @@ events to stdout and prints elapsed time + cost at the end.
 ## Architecture
 ```
-harness/runtime.py          AgentRuntime — single entry point, wire once run anything
+harness/runtime.py          AgentRuntime — single entry point; BudgetGuard with cost/token caps + per-call-site breakdown
 harness/events.py           BusEvent + EventType — canonical event vocabulary
-harness/llm/openai.py       OpenAILLM — OpenAI adapter with usage + cost tracking
+harness/llm/openai.py       OpenAILLM — OpenAI API-key adapter with usage + cost tracking
+harness/llm/anthropic.py    AnthropicLLM — direct Anthropic API-key adapter with prompt-caching support
+harness/llm/claude_code.py  ClaudeCodeLLM — Claude subscription OAuth adapter (experimental, ToS caveats)
+harness/llm/openai_codex.py OpenAICodexLLM — ChatGPT subscription OAuth adapter (experimental, ToS caveats)
+harness/llm/auth.py         Shared OAuth + auth-file primitives for the subscription adapters
 harness/llm/fallback.py     FallbackLLM — transparent retry on transient upstream errors
 harness/llm/routing.py      RoutingLLM — dispatch calls to different adapters by a selector
+harness/trace.py            JSONL trace recorder + replay — durable, per-event flush
+harness/trace_viewer.py     Local web timeline viewer for recorded JSONL traces
 harness/annotation.py       Annotation store + AnnotationHook — RLHF trajectory capture
 harness/hitl.py             HITL approval gate — interactive CLI, session-allow list
 harness/tool_policy.py      Persistent tool policy — user-scoped allow rules, CLI management
-harness/console.py          ConsoleRenderer — centralised BusEvent formatting for CLI apps
+harness/console.py          ConsoleRenderer — centralised BusEvent formatting + render_budget helper
 harness/steering.py         Async steering — agent.steer(text), StdinRouter pub/sub, FileSteer, factory helpers
 harness/checkpoint.py       CheckpointStore + _ResumeHint + maybe_resume_key — pluggable run-state persistence (file + Redis); auto-resume built into dispatch_stream / run_stream
 harness/otel.py             OTELHook — OpenTelemetry span exporter (opt-in)
@@ -56,7 +64,8 @@ memory/redis_store.py       Redis semantic store — durable KV with TTL
 memory/stores.py            InMemory stores — local dev default, no deps
 tools/builtin/http_fetch.py HTTPFetch — minimal read-only GET tool
 tools/builtin/fetch_image.py FetchImage — fetch URL and return OpenAI image_url block
-tools/mcp/adapter.py        MCP tool adapter — connect any MCP server
+tools/mcp/adapter.py        MCP tool adapter — stdio, SSE, streamable-HTTP transports
+tools/mcp/auth.py           ApiKeyMCPAuth + BrowserOAuthMCPAuth — auth primitives for remote MCP servers
 ```
 Execution is **streaming-primary**: every path yields `BusEvent`s for
@@ -331,6 +340,69 @@ crashing the loop.
 - **During run**: `write_working_fact()` — lightweight KV, namespaced, short TTL
 - **End of run**: `write_run_end()` — LLM extraction → global semantic + episodic vector
+### Memory reconciliation (default-on)
+`write_run_end` runs the LLM-arbitrated reconciler by default: instead of
+extract-and-overwrite, the LLM sees existing relevant memory + new evidence
+and emits a plan of per-fact actions (`ADD` / `UPDATE` / `MERGE` / `DELETE`
+/ `NOOP`). Same call count as the legacy extraction step; the prompt is
+larger only when there's existing context to reconcile against.
+```python
+manager = MemoryManager(
+    semantic_store=…,
+    episodic_store=…,
+    llm=…,
+    reconcile_on_write=True,           # default — set False for legacy extract path
+    allow_destructive_reconcile=False, # default — DELETE actions demoted to NOOP
+    auto_compact_threshold={"agent_task": 20},  # optional — fire compact()
+                                                # when an agent accumulates this
+                                                # many task episodes
+)
+```
+`allow_destructive_reconcile=False` keeps the LLM from removing data unless
+you've vetted that DELETE actions are sensible for your workload — demoted
+decisions land in `manager.get_conflict_log()` so you can audit.
+`manager.compact(goal="…", agent_id="…")` is the same primitive with no
+new evidence — a pure cleanup pass that consolidates accumulated episodes
+and prunes redundant facts. Triggered automatically by
+`auto_compact_threshold`, or call it explicitly.
+Episodic supersede is now **hard-delete** (no `active=False` tombstones
+accumulating per run): `memory_policy="latest"` writes and reconciler
+`DELETE` actions both remove rows.
+If the LLM returns a response that doesn't parse as a reconcile plan
+(older / smaller models that don't follow the multi-action schema),
+`write_run_end` silently falls back to the legacy extract-and-overwrite
+path — no crash, no missed run-end write.
+### Tool-result caching (opt-in, per run)
+`AgentConfig.cache_tool_results = True` memoizes tool calls within a single
+run, keyed by `(tool_name, args)`. Useful for multi-agent runs where agents
+redo each other's idempotent reads (`HTTPFetch` on stable URLs,
+`kubectl get ...` discovery, MCP filesystem reads).
+A tool can veto caching for itself with `cacheable = False` on the
+instance — required for anything with side effects or time-dependent
+output. Errors are never cached (a transient failure shouldn't poison the
+rest of the run).
+```python
+class HTTPFetch:
+    name = "http_fetch"
+    cacheable = True  # default; explicit for clarity
+agents.register(AgentConfig(
+    agent_id="web",
+    ...,
+    cache_tool_results=True,
+))
+```
 Defaults are in-memory (`InMemorySemanticStore`, `InMemoryEpisodicStore`).
 For durable storage:
@@ -590,6 +662,92 @@ Cost ceiling fires on the *next* `check()` (start of next ReAct step or
 orchestrator batch), not synchronously mid-call — accept this for 0.0.1, the
 guard's job is preventing runaway loops, not bounding individual calls.
+### Token limits + per-call-site breakdown
+`GuardrailConfig.max_input_tokens` / `max_output_tokens` cap raw token
+usage independently of dollar cost. This is the only enforcement available
+to subscription-auth runs (`ClaudeCodeLLM`, `OpenAICodexLLM`) — those tiers
+don't expose pricing, so cost stays 0 and only token caps can fire.
+```python
+runtime = AgentRuntime(
+    ...,
+    guardrail_config=GuardrailConfig(
+        max_total_cost_usd=2.0,
+        max_input_tokens=100_000,
+        max_output_tokens=20_000,
+    ),
+)
+```
+Per-call-site attribution lives on the terminal event's `budget` payload
+— a snapshot of spending bucketed by the LLM slot that ran each call.
+The runtime tags classifier / router / planner / synthesizer calls
+automatically; ReAct agent calls go into the totals but don't get a
+bucket. So `cheap` (used for both `classifier_llm` and `router_llm`) and
+`premium` (used for `planner_llm`) report separately even though one is
+the same physical LLM instance shared across slots:
+```python
+async for event in runtime.dispatch_stream(goal):
+    # Routed (simple) goals terminate with TASK_DONE; orchestrated goals
+    # with DONE. Both carry the same ``budget`` shape.
+    if event.type in (EventType.TASK_DONE, EventType.DONE):
+        budget = event.payload["budget"]
+        print(f"total: in={budget['tokens_in']} out={budget['tokens_out']} "
+              f"${budget['cost_usd']:.4f}")
+        for slot, stats in budget["breakdown"].items():
+            print(f"  {slot}: in={stats['tokens_in']} out={stats['tokens_out']}")
+```
+The same `budget` dict is attached to `runtime.run(...)` and
+`runtime.dispatch(...)` return values under the `budget` key, so blocking
+callers don't need to read events.
+Anthropic / Claude Code adapters count input tokens as the *total* that
+hit the wire (non-cached + cache-creation + cache-read), so token caps
+reflect actual consumption regardless of cache hit rate. Cost calculation
+via `cost_fn` still respects cache pricing.
+### Evals via the trace recorder
+There's no shipped evals framework — opinions on scorers, judge models,
+and golden-set management belong outside the orchestration core. The
+[trace recorder](#trace-recorder--replay--local-viewer) already writes
+per-event token/cost/latency to JSONL, so a few lines of glue cover most
+in-house eval setups:
+```python
+import json
+from harness.trace import record_trace
+# 1. Record traces while running a fixture set.
+for fixture in fixtures:
+    async for _event in record_trace(
+        runtime.dispatch_stream(fixture["input"]),
+        path=f"runs/{fixture['id']}.jsonl",
+    ):
+        pass
+# 2. Score offline by replaying.
+def score_run(path: str, expected: str) -> dict:
+    answer = ""
+    budget = {"tokens_in": 0, "tokens_out": 0, "cost_usd": 0.0, "breakdown": {}}
+    for line in open(path):
+        event = json.loads(line)
+        if event["type"] in ("done", "task_done"):
+            answer = event["payload"].get("answer", "")
+            budget = event["payload"].get("budget", budget)
+    return {
+        "success": expected.lower() in answer.lower(),
+        **budget,  # tokens_in, tokens_out, cost_usd, breakdown
+    }
+```
+Plug in your own scorer (exact-match, LLM-judge, semantic similarity) on
+top. External tools like Braintrust, LangSmith, and Weave are
+purpose-built for this and ingest the same JSONL shape directly.
 ## Tool execution
 Tools that shell out (`kubectl`, `curl`, `sh -c …`) should not run inside the

{react_agent_harness-0.6.0 → react_agent_harness-0.7.0}/agents/base.py RENAMED Viewed

@@ -30,6 +30,7 @@ import asyncio
 import contextlib
 import json
 import logging
+import os
 import uuid
 from collections.abc import AsyncGenerator
 from dataclasses import dataclass
@@ -65,6 +66,13 @@ class AgentConfig:
     working_memory_max_tokens: int = 8000  # WorkingMemory eviction threshold; tune per agent
     hitl_tools: list[str] = None  # tools requiring human approval; None = no HITL
     checkpoint_every: int = 0  # write a resumable checkpoint every N steps; 0 = disabled
+    # Cache tool results within a single run, keyed by (tool_name, args).
+    # Opt-in because not every tool is idempotent — a tool may also veto
+    # caching for itself by exposing ``cacheable = False`` on its instance.
+    # Designed for read-mostly multi-agent runs where agents redo each
+    # other's lookups (HTTPFetch on stable URLs, ``kubectl get …`` style
+    # discovery, MCP filesystem reads).
+    cache_tool_results: bool = False
     def __post_init__(self):
         if self.hitl_tools is None:
@@ -159,6 +167,12 @@ class BaseAgent:
         self._resume_key: str = (
             ""  # key printed in --resume banner; set by orchestrator to outer run_id
         )
+        # Per-run tool-result cache. ``None`` when caching is off so the
+        # hot path on ``_execute_tool`` skips the lookup entirely; a fresh
+        # dict per BaseAgent instance bounds the lifetime to one run.
+        self._tool_cache: dict[tuple[str, str], Any] | None = (
+            {} if config.cache_tool_results else None
+        )
     # ── Async steering ────────────────────────────────────────────────────────
@@ -335,7 +349,15 @@ class BaseAgent:
                 agent_id=self.config.agent_id,
             )
             if not mem_context.is_empty():
-                parts.append(mem_context.render())
+                rendered = mem_context.render()
+                if os.environ.get("DEBUG_MEMORY_CONTEXT") == "1":
+                    print(f"\n[debug:memory] context injected for {self.config.agent_id}")
+                    print("─" * 64)
+                    print(rendered)
+                    print("─" * 64)
+                parts.append(rendered)
+            elif os.environ.get("DEBUG_MEMORY_CONTEXT") == "1":
+                print(f"\n[debug:memory] context injected for {self.config.agent_id}: (empty)")
         tool_list = ", ".join(self._tools.keys()) or "none"
         parts.append(REACT_FORMAT.replace("__TOOL_LIST__", tool_list))
@@ -422,6 +444,12 @@ class BaseAgent:
                         "summarizations": self._working_memory.summarization_count,
                     },
                 }
+                # Attach the current budget snapshot so dispatch_stream
+                # consumers can read totals + per-call-site breakdown off
+                # the routed path's terminal event, same shape as the
+                # orchestrator's DONE event.
+                if self._guard is not None and hasattr(self._guard, "snapshot"):
+                    result["budget"] = self._guard.snapshot()
                 logger.info(
                     "Agent %s completed: steps=%d confidence=%.2f summarizations=%d",
                     self.config.agent_id,
@@ -653,11 +681,17 @@ class BaseAgent:
             payload=before_usage,
         )
+        # Tag ReAct spending so it shows up in BudgetGuard.breakdown alongside
+        # classifier/router/planner/synthesizer. Per-agent attribution makes
+        # multi-agent demos surface which specialist agent actually drove the
+        # bulk of token usage.
+        react_source = f"agent:{self.config.agent_id}"
         try:
             if hasattr(self._llm, "stream_complete"):
                 async for token in self._llm.stream_complete(
                     system=None,
                     messages=messages,
+                    source=react_source,
                 ):
                     accumulated += token
                     if self.config.stream_tokens:
@@ -679,6 +713,7 @@ class BaseAgent:
                     system=None,
                     messages=messages,
                     response_format={"type": "json_object"},
+                    source=react_source,
                 )
                 response = _normalize_response(raw)
                 if response is None:
@@ -739,12 +774,32 @@ class BaseAgent:
             return (
                 f"Error: tool '{name}' not available. Available tools: {list(self._tools.keys())}"
             )
+        tool = self._tools[name]
+        # Per-run memoization, gated by both agent opt-in AND tool consent.
+        # Tools that have side effects or time-dependent output can veto
+        # caching by setting ``cacheable = False`` on the instance. Errors
+        # are NOT cached — a transient failure should not poison the rest
+        # of the run.
+        cache_key: tuple[str, str] | None = None
+        if self._tool_cache is not None and getattr(tool, "cacheable", True) is True:
+            try:
+                cache_key = (name, json.dumps(args, sort_keys=True, default=str))
+            except (TypeError, ValueError):
+                cache_key = None  # un-serialisable args — silently skip
+            if cache_key is not None and cache_key in self._tool_cache:
+                return self._tool_cache[cache_key]
         try:
-            return await self._tools[name].execute(**args)
+            result = await tool.execute(**args)
         except Exception as e:
             logger.error("Tool %s failed: %s", name, e)
             return f"Tool error ({name}): {e}"
+        if cache_key is not None and self._tool_cache is not None:
+            self._tool_cache[cache_key] = result
+        return result
     # ── Helpers ───────────────────────────────────────────────────────────────
     def _error_result(self, reason: str, steps: int) -> dict:

{react_agent_harness-0.6.0 → react_agent_harness-0.7.0}/harness/console.py RENAMED Viewed

@@ -178,17 +178,54 @@ class ConsoleRenderer:
             self.sep("═")
             print(p.get("answer", "(no answer)"), file=self._out)
             self.sep()
+            # ``budget`` snapshot supersedes the flat cost/elapsed fields when
+            # present (added with token caps + per-call-site breakdown).
+            budget = p.get("budget") or {}
+            cost = budget.get("cost_usd", p.get("cost_usd", 0))
+            elapsed = budget.get("elapsed_seconds", p.get("elapsed_seconds", 0))
             print(
                 f"Confidence: {p.get('confidence', 0):.2f}  |  "
                 f"Replans: {p.get('replan_count', 0)}  |  "
-                f"Cost: ${p.get('cost_usd', 0):.4f}  |  "
-                f"Time: {p.get('elapsed_seconds', 0):.1f}s",
+                f"Cost: ${cost:.4f}  |  "
+                f"Time: {elapsed:.1f}s",
                 file=self._out,
             )
+            self.render_budget(budget)
         elif t == EventType.ERROR:
             print(f"\n[error]      {event.error}", file=sys.stderr)
+    def render_budget(self, budget: dict | None) -> None:
+        """Print tokens + per-call-site breakdown from a ``BudgetGuard.snapshot()``
+        dict. Safe to call with ``{}`` or ``None`` — prints nothing when
+        there's no usage to show.
+        Exposed publicly so demos and other consumers that own their own
+        DONE / TASK_DONE rendering can still surface the breakdown without
+        duplicating the formatting.
+        """
+        if not budget:
+            return
+        tokens_in = budget.get("tokens_in")
+        tokens_out = budget.get("tokens_out")
+        if tokens_in is not None or tokens_out is not None:
+            print(
+                f"Tokens:     in={int(tokens_in or 0):,}  out={int(tokens_out or 0):,}",
+                file=self._out,
+            )
+        breakdown = budget.get("breakdown") or {}
+        if breakdown:
+            # Right-pad the slot label so columns line up — matters when
+            # the demo prints multiple slots in sequence.
+            width = max(len(name) for name in breakdown)
+            for slot, stats in breakdown.items():
+                print(
+                    f"  {slot:<{width}}  "
+                    f"in={int(stats.get('tokens_in', 0)):>7,}  "
+                    f"out={int(stats.get('tokens_out', 0)):>6,}",
+                    file=self._out,
+                )
     # ── private helpers ───────────────────────────────────────────────────────
     def _label(self, event: BusEvent) -> str:

{react_agent_harness-0.6.0 → react_agent_harness-0.7.0}/harness/llm/anthropic.py RENAMED Viewed

@@ -91,6 +91,8 @@ class AnthropicLLM:
         self,
         system: str | None,
         messages: list[dict],
+        *,
+        source: str | None = None,
         **kwargs: Any,
     ) -> dict:
         max_tokens = int(kwargs.pop("max_tokens", self._max_tokens))
@@ -110,7 +112,7 @@ class AnthropicLLM:
         cost = _compute_cost(usage, self._cost_fn)
         if cost is not None:
             usage["cost_usd"] = cost
-        self._record_cost(usage)
+        self._record_usage(usage, source=source)
         self.last_usage = usage
         text = _collect_text(resp.content)
@@ -122,6 +124,8 @@ class AnthropicLLM:
         self,
         system: str | None,
         messages: list[dict],
+        *,
+        source: str | None = None,
     ) -> AsyncGenerator[str, None]:
         sys_blocks = _system_blocks(system, prompt_caching=self._prompt_caching)
         built_messages = _build_messages(messages, prompt_caching=self._prompt_caching)
@@ -143,17 +147,34 @@ class AnthropicLLM:
             cost = _compute_cost(usage, self._cost_fn)
             if cost is not None:
                 usage["cost_usd"] = cost
-            self._record_cost(usage)
+            self._record_usage(usage, source=source)
             self.last_usage = usage
     # ── Internals ─────────────────────────────────────────────────────────────
-    def _record_cost(self, usage: dict) -> None:
-        if not self._budget:
+    def _record_usage(self, usage: dict, *, source: str | None) -> None:
+        """Forward usage to the budget guard.
+        Token count for budget purposes is the total input that hit the wire
+        — non-cached + cache-creation + cache-read — so token caps reflect
+        real wall-clock consumption regardless of cache hit rate. Cost
+        (which respects cache pricing via ``cost_fn``) is reported when
+        known.
+        """
+        guard = self._budget
+        if not guard:
             return
+        tokens_in = (
+            int(usage.get("tokens_in") or 0)
+            + int(usage.get("cache_read_tokens") or 0)
+            + int(usage.get("cache_creation_tokens") or 0)
+        )
+        tokens_out = int(usage.get("tokens_out") or 0)
+        if (tokens_in or tokens_out) and hasattr(guard, "add_tokens"):
+            guard.add_tokens(tokens_in, tokens_out, source=source)
         cost = usage.get("cost_usd")
         if cost and cost > 0:
-            self._budget.add_cost(cost)
+            guard.add_cost(cost, source=source)
 # ── Module-level helpers ──────────────────────────────────────────────────────

{react_agent_harness-0.6.0 → react_agent_harness-0.7.0}/harness/llm/claude_code.py RENAMED Viewed

@@ -68,12 +68,24 @@ class ClaudeCodeLLM:
         self._user_agent = user_agent or _default_user_agent()
         self._betas = betas
         self._prompt_caching = prompt_caching
+        self._budget: Any = None
         self.last_usage: dict | None = None
+    def set_budget(self, guard: Any) -> None:
+        """Inject a BudgetGuard so token caps fire on subscription-auth runs.
+        Cost stays 0 (no pricing schedule available for the subscription
+        tier), but ``add_tokens`` still lands so ``max_input_tokens`` /
+        ``max_output_tokens`` are enforced.
+        """
+        self._budget = guard
     async def complete(
         self,
         system: str | None,
         messages: list[dict],
+        *,
+        source: str | None = None,
         **kwargs: Any,
     ) -> dict:
         """Collect the streaming response into a single text + usage dict.
@@ -84,7 +96,9 @@ class ClaudeCodeLLM:
         """
         max_tokens = int(kwargs.pop("max_tokens", self._max_tokens))
         parts: list[str] = []
-        async for delta in self._iter_stream(system, messages, max_tokens=max_tokens, extra=kwargs):
+        async for delta in self._iter_stream(
+            system, messages, max_tokens=max_tokens, extra=kwargs, source=source
+        ):
             parts.append(delta)
         text = "".join(parts)
         if not text:
@@ -95,9 +109,11 @@ class ClaudeCodeLLM:
         self,
         system: str | None,
         messages: list[dict],
+        *,
+        source: str | None = None,
     ) -> AsyncGenerator[str, None]:
         async for delta in self._iter_stream(
-            system, messages, max_tokens=self._max_tokens, extra={}
+            system, messages, max_tokens=self._max_tokens, extra={}, source=source
         ):
             yield delta
@@ -114,6 +130,7 @@ class ClaudeCodeLLM:
         *,
         max_tokens: int,
         extra: dict[str, Any],
+        source: str | None = None,
     ) -> AsyncGenerator[str, None]:
         """Single source of truth: open Anthropic SSE stream, yield text
         deltas, populate `self.last_usage`. Auth refresh on 401/403
@@ -182,10 +199,31 @@ class ClaudeCodeLLM:
                     "total_tokens": tokens_in + tokens_out,
                     "provider": "claude-code",
                 }
+                self._record_usage(self.last_usage, source=source)
                 return
         raise RuntimeError("Claude Code authentication failed after refresh")
+    def _record_usage(self, usage: dict, *, source: str | None) -> None:
+        """Report token totals to the budget guard.
+        Tokens budgeted = total input that hit the wire (non-cached +
+        cache-creation + cache-read) plus output tokens — so ``max_input_tokens``
+        / ``max_output_tokens`` reflect real consumption regardless of cache
+        hit rate. No cost is reported (subscription auth, no pricing).
+        """
+        guard = self._budget
+        if not guard or not hasattr(guard, "add_tokens"):
+            return
+        tokens_in = (
+            int(usage.get("tokens_in") or 0)
+            + int(usage.get("cache_read_tokens") or 0)
+            + int(usage.get("cache_creation_tokens") or 0)
+        )
+        tokens_out = int(usage.get("tokens_out") or 0)
+        if tokens_in or tokens_out:
+            guard.add_tokens(tokens_in, tokens_out, source=source)
     async def _get_client(self) -> Any:
         if self._client is None:
             try:

{react_agent_harness-0.6.0 → react_agent_harness-0.7.0}/harness/llm/fallback.py RENAMED Viewed

@@ -123,6 +123,8 @@ class FallbackLLM:
         self,
         system: str | None,
         messages: list[dict],
+        *,
+        source: str | None = None,
     ) -> AsyncGenerator[str, None]:
         """Stream from the first adapter that doesn't fail before yielding.
@@ -136,7 +138,7 @@ class FallbackLLM:
             if not hasattr(llm, "stream_complete"):
                 continue
             try:
-                gen = llm.stream_complete(system, messages)
+                gen = llm.stream_complete(system, messages, source=source)
                 first = await _peek_first(gen)
             except BaseException as exc:
                 if i == len(self._llms) - 1 or not self._is_transient(exc):

{react_agent_harness-0.6.0 → react_agent_harness-0.7.0}/harness/llm/openai.py RENAMED Viewed

@@ -101,6 +101,8 @@ class OpenAILLM:
         self,
         system: str | None,
         messages: list[dict],
+        *,
+        source: str | None = None,
         **kwargs: Any,
     ) -> dict:
         full_messages = _prepend_system(system, messages)
@@ -120,7 +122,7 @@ class OpenAILLM:
         resp = raw.parse()
         headers = _headers_dict(raw)
         usage = self._build_usage(resp, headers)
-        self._record_cost(usage)
+        self._record_usage(usage, source=source)
         self.last_usage = usage
         content = resp.choices[0].message.content or ""
@@ -132,6 +134,8 @@ class OpenAILLM:
         self,
         system: str | None,
         messages: list[dict],
+        *,
+        source: str | None = None,
     ) -> AsyncGenerator[str, None]:
         full_messages = _prepend_system(system, messages)
         # include_usage adds a final SSE chunk with the same usage block as
@@ -156,7 +160,7 @@ class OpenAILLM:
         if final_chunk is not None:
             usage = self._build_usage(final_chunk, headers)
-            self._record_cost(usage)
+            self._record_usage(usage, source=source)
             self.last_usage = usage
     # ── Internals ─────────────────────────────────────────────────────────────
@@ -185,12 +189,24 @@ class OpenAILLM:
             usage["cost_usd"] = cost
         return usage
-    def _record_cost(self, usage: dict) -> None:
-        if not self._budget:
+    def _record_usage(self, usage: dict, *, source: str | None) -> None:
+        """Forward usage to the budget guard.
+        Tokens are reported on every call (even when no ``cost_fn`` is wired)
+        so token-based caps still fire. Cost is forwarded only when known.
+        Both calls accept the per-call-site ``source`` tag so the guard's
+        breakdown attributes spending to the right slot.
+        """
+        guard = self._budget
+        if not guard:
             return
+        tokens_in = int(usage.get("tokens_in") or 0)
+        tokens_out = int(usage.get("tokens_out") or 0)
+        if (tokens_in or tokens_out) and hasattr(guard, "add_tokens"):
+            guard.add_tokens(tokens_in, tokens_out, source=source)
         cost = usage.get("cost_usd")
         if cost and cost > 0:
-            self._budget.add_cost(cost)
+            guard.add_cost(cost, source=source)
 # ── Module-level helpers ─────────────────────────────────────────────────────

react-agent-harness 0.6.0__tar.gz → 0.7.0__tar.gz

react-agent-harness 0.6.0tar.gz → 0.7.0tar.gz