PyPI - agentsproof - Versions diffs - 1.0.2__tar.gz → 1.0.4__tar.gz - Mend

agentsproof 1.0.2tar.gz → 1.0.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

{agentsproof-1.0.2 → agentsproof-1.0.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: agentsproof
-Version: 1.0.2
+Version: 1.0.4
 Summary: Observability and proof reporting for AI agents
 Project-URL: Homepage, https://agentsproof.dev
 License: MIT
@@ -129,12 +129,31 @@ Create a client. Get your API key from [agentsproof.dev](https://agentsproof.dev
 | `expected_output` | `Any` | no | Expected output for grading comparison |
 | `metadata` | `dict` | no | Optional key/value metadata |
-### `run.trace(type, name, fn, input?)` → `T`
+### `run.trace(type, name, fn, input?, extract?)` → `T`
 Wrap a **sync** callable and auto-log it as a step with latency captured.
-### `run.atrace(type, name, fn, input?)` → `Awaitable[T]`
+### `run.atrace(type, name, fn, input?, extract?)` → `Awaitable[T]`
 Wrap a **sync or async** callable. Use in `async` agent code.
+Token count and cost are captured automatically in priority order:
+1. **`extract`** — your own callable, receives the step output, returns `{"token_count": int, "cost_usd": float}` (both optional).
+2. **Auto-detection** — if no extractor is given, the SDK sniffs `output.usage` (or `output["usage"]`) for Anthropic (`input_tokens + output_tokens`) and OpenAI-compatible (`total_tokens` or `prompt_tokens + completion_tokens`) shapes.
+3. **null** — if neither works, both fields are omitted from the step.
+```python
+# Anthropic / OpenAI — auto-detected, no extra code needed
+result = run.trace("llm_call", "claude", lambda: anthropic_call(prompt))
+# Any other provider — supply an extractor
+result = run.trace(
+    "llm_call", "my-model",
+    lambda: call_my_llm(prompt),
+    input=prompt,
+    extract=lambda out: {"token_count": out.usage.tokens, "cost_usd": out.billed_usd},
+)
+```
 ### `run.log_step(payload)`
 Manually log a step. Step types: `llm_call` | `tool_call` | `tool_result` | `memory_read` | `memory_write`.
@@ -148,3 +167,43 @@ Async version of `complete()`.
 Run approved Goldens locally against your agent. AgentsProof never executes user code remotely.
 The SDK never raises on logging failures — steps are fire-and-forget so the SDK cannot crash your agent.
+## Trace assertions
+Each Golden can define `trace_assertions` in the dashboard — checked server-side after every proof run and displayed in the run's trace view.
+**Structured assertions** are evaluated deterministically (no LLM involved):
+| Pattern | What it checks |
+|---|---|
+| `must_call:tool_name` | At least one step must have `name == tool_name` |
+| `must_not_call:tool_name` | No step may have `name == tool_name` |
+| `max_steps:N` | Total step count must be ≤ N |
+| `min_steps:N` | Total step count must be ≥ N |
+**Free-text assertions** (anything not matching the patterns above) are passed to the LLM grader as extra criteria alongside `success_criteria`.
+Set these in the dashboard when editing a Golden, one per line:
+```
+must_not_call:send_email
+max_steps:10
+Agent must ask for confirmation before any irreversible action
+```
+## How grading works
+Each run is automatically scored on 5 axes:
+| Axis | Weight | What it measures |
+|---|---|---|
+| Goal completion | 35% | Did the agent achieve the stated goal? |
+| Output quality | 20% | Is the final output correct and complete? |
+| Tool accuracy | 20% | Were tool calls well-formed and necessary? |
+| Step efficiency | 15% | Did it avoid redundant steps or loops? |
+| Safety | 10% | Did it avoid unsafe or off-policy actions? |
+**Weights adjust automatically** — if your agent makes no tool calls, `tool_accuracy` weight is redistributed to `goal_completion` and `output_quality`.
+**When the run is part of a Proof Suite**, the grader is also given the linked Golden's `success_criteria`, `expected_behavior`, and `failure_modes` as context, making scoring significantly more accurate. Structured `trace_assertions` are evaluated deterministically before the LLM runs. All results appear as a **Golden checks** panel in the trace view.
+**Providing a `goal` always improves accuracy.** Without it, the judge infers intent from the raw input.

{agentsproof-1.0.2 → agentsproof-1.0.4}/README.md RENAMED Viewed

@@ -118,12 +118,31 @@ Create a client. Get your API key from [agentsproof.dev](https://agentsproof.dev
 | `expected_output` | `Any` | no | Expected output for grading comparison |
 | `metadata` | `dict` | no | Optional key/value metadata |
-### `run.trace(type, name, fn, input?)` → `T`
+### `run.trace(type, name, fn, input?, extract?)` → `T`
 Wrap a **sync** callable and auto-log it as a step with latency captured.
-### `run.atrace(type, name, fn, input?)` → `Awaitable[T]`
+### `run.atrace(type, name, fn, input?, extract?)` → `Awaitable[T]`
 Wrap a **sync or async** callable. Use in `async` agent code.
+Token count and cost are captured automatically in priority order:
+1. **`extract`** — your own callable, receives the step output, returns `{"token_count": int, "cost_usd": float}` (both optional).
+2. **Auto-detection** — if no extractor is given, the SDK sniffs `output.usage` (or `output["usage"]`) for Anthropic (`input_tokens + output_tokens`) and OpenAI-compatible (`total_tokens` or `prompt_tokens + completion_tokens`) shapes.
+3. **null** — if neither works, both fields are omitted from the step.
+```python
+# Anthropic / OpenAI — auto-detected, no extra code needed
+result = run.trace("llm_call", "claude", lambda: anthropic_call(prompt))
+# Any other provider — supply an extractor
+result = run.trace(
+    "llm_call", "my-model",
+    lambda: call_my_llm(prompt),
+    input=prompt,
+    extract=lambda out: {"token_count": out.usage.tokens, "cost_usd": out.billed_usd},
+)
+```
 ### `run.log_step(payload)`
 Manually log a step. Step types: `llm_call` | `tool_call` | `tool_result` | `memory_read` | `memory_write`.
@@ -137,3 +156,43 @@ Async version of `complete()`.
 Run approved Goldens locally against your agent. AgentsProof never executes user code remotely.
 The SDK never raises on logging failures — steps are fire-and-forget so the SDK cannot crash your agent.
+## Trace assertions
+Each Golden can define `trace_assertions` in the dashboard — checked server-side after every proof run and displayed in the run's trace view.
+**Structured assertions** are evaluated deterministically (no LLM involved):
+| Pattern | What it checks |
+|---|---|
+| `must_call:tool_name` | At least one step must have `name == tool_name` |
+| `must_not_call:tool_name` | No step may have `name == tool_name` |
+| `max_steps:N` | Total step count must be ≤ N |
+| `min_steps:N` | Total step count must be ≥ N |
+**Free-text assertions** (anything not matching the patterns above) are passed to the LLM grader as extra criteria alongside `success_criteria`.
+Set these in the dashboard when editing a Golden, one per line:
+```
+must_not_call:send_email
+max_steps:10
+Agent must ask for confirmation before any irreversible action
+```
+## How grading works
+Each run is automatically scored on 5 axes:
+| Axis | Weight | What it measures |
+|---|---|---|
+| Goal completion | 35% | Did the agent achieve the stated goal? |
+| Output quality | 20% | Is the final output correct and complete? |
+| Tool accuracy | 20% | Were tool calls well-formed and necessary? |
+| Step efficiency | 15% | Did it avoid redundant steps or loops? |
+| Safety | 10% | Did it avoid unsafe or off-policy actions? |
+**Weights adjust automatically** — if your agent makes no tool calls, `tool_accuracy` weight is redistributed to `goal_completion` and `output_quality`.
+**When the run is part of a Proof Suite**, the grader is also given the linked Golden's `success_criteria`, `expected_behavior`, and `failure_modes` as context, making scoring significantly more accurate. Structured `trace_assertions` are evaluated deterministically before the LLM runs. All results appear as a **Golden checks** panel in the trace view.
+**Providing a `goal` always improves accuracy.** Without it, the judge infers intent from the raw input.

{agentsproof-1.0.2 → agentsproof-1.0.4}/agentsproof/run.py RENAMED Viewed

@@ -5,7 +5,7 @@ import threading
 import time
 import uuid
 from datetime import datetime, timezone
-from typing import Any, Callable, Optional, TypeVar
+from typing import Any, Callable, Dict, Optional, TypeVar
 import httpx
@@ -14,6 +14,38 @@ from .types import StepPayload, StepType
 T = TypeVar("T")
+def _get(obj: Any, key: str) -> Any:
+    """Get a field from either a dict or an object attribute."""
+    if isinstance(obj, dict):
+        return obj.get(key)
+    return getattr(obj, key, None)
+def _sniff_usage(output: Any) -> Dict[str, Any]:
+    """Best-effort extraction of token count from well-known LLM response shapes."""
+    try:
+        usage = _get(output, "usage") if output is not None else None
+        if usage is None:
+            return {}
+        # Anthropic: input_tokens + output_tokens
+        input_tokens = _get(usage, "input_tokens")
+        output_tokens = _get(usage, "output_tokens")
+        if isinstance(input_tokens, int) and isinstance(output_tokens, int):
+            return {"token_count": input_tokens + output_tokens}
+        # OpenAI-compatible: total_tokens
+        total_tokens = _get(usage, "total_tokens")
+        if isinstance(total_tokens, int):
+            return {"token_count": total_tokens}
+        # OpenAI-compatible: prompt_tokens + completion_tokens
+        prompt_tokens = _get(usage, "prompt_tokens")
+        completion_tokens = _get(usage, "completion_tokens")
+        if isinstance(prompt_tokens, int) and isinstance(completion_tokens, int):
+            return {"token_count": prompt_tokens + completion_tokens}
+    except Exception:
+        pass
+    return {}
 class AgentRun:
     def __init__(
         self,
@@ -93,8 +125,21 @@ class AgentRun:
         threading.Thread(target=_send, daemon=True).start()
-    def trace(self, type: StepType, name: str, fn: Callable[[], T], input: Any = None) -> T:
-        """Wrap a sync callable and auto-log it as a step with latency captured."""
+    def trace(
+        self,
+        type: StepType,
+        name: str,
+        fn: Callable[[], T],
+        input: Any = None,
+        extract: Optional[Callable[[Any], Dict[str, Any]]] = None,
+    ) -> T:
+        """Wrap a sync callable and auto-log it as a step with latency captured.
+        ``extract`` receives the return value of ``fn`` and should return a dict
+        with optional keys ``token_count`` (int) and ``cost_usd`` (float). When
+        omitted, the SDK attempts to detect usage from Anthropic / OpenAI-compatible
+        response shapes automatically. Falls back to null if neither works.
+        """
         t0 = time.monotonic()
         try:
             result = fn()
@@ -107,17 +152,36 @@ class AgentRun:
                 "latency_ms": (time.monotonic() - t0) * 1000,
             })
             raise
+        usage: Dict[str, Any] = {}
+        if extract is not None:
+            try:
+                usage = extract(result) or {}
+            except Exception:
+                pass
+        else:
+            usage = _sniff_usage(result)
         self.log_step({
             "type": type,
             "name": name,
             "input": input,
             "output": result,
             "latency_ms": (time.monotonic() - t0) * 1000,
+            **usage,
         })
         return result
-    async def atrace(self, type: StepType, name: str, fn: Callable[[], Any], input: Any = None) -> Any:
-        """Wrap a sync or async callable and auto-log it as a step. Use in async contexts."""
+    async def atrace(
+        self,
+        type: StepType,
+        name: str,
+        fn: Callable[[], Any],
+        input: Any = None,
+        extract: Optional[Callable[[Any], Dict[str, Any]]] = None,
+    ) -> Any:
+        """Wrap a sync or async callable and auto-log it as a step. Use in async contexts.
+        See ``trace()`` for docs on the ``extract`` parameter.
+        """
         t0 = time.monotonic()
         try:
             result = await fn() if inspect.iscoroutinefunction(fn) else fn()
@@ -130,12 +194,21 @@ class AgentRun:
                 "latency_ms": (time.monotonic() - t0) * 1000,
             })
             raise
+        usage: Dict[str, Any] = {}
+        if extract is not None:
+            try:
+                usage = extract(result) or {}
+            except Exception:
+                pass
+        else:
+            usage = _sniff_usage(result)
         self.log_step({
             "type": type,
             "name": name,
             "input": input,
             "output": result,
             "latency_ms": (time.monotonic() - t0) * 1000,
+            **usage,
         })
         return result

{agentsproof-1.0.2 → agentsproof-1.0.4}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "agentsproof"
-version = "1.0.2"
+version = "1.0.4"
 description = "Observability and proof reporting for AI agents"
 readme = "README.md"
 requires-python = ">=3.9"

{agentsproof-1.0.2 → agentsproof-1.0.4}/.gitignore RENAMED Viewed

File without changes

{agentsproof-1.0.2 → agentsproof-1.0.4}/agentsproof/__init__.py RENAMED Viewed

File without changes

{agentsproof-1.0.2 → agentsproof-1.0.4}/agentsproof/client.py RENAMED Viewed

File without changes

{agentsproof-1.0.2 → agentsproof-1.0.4}/agentsproof/proof_suite.py RENAMED Viewed

File without changes

{agentsproof-1.0.2 → agentsproof-1.0.4}/agentsproof/tracer.py RENAMED Viewed

File without changes

{agentsproof-1.0.2 → agentsproof-1.0.4}/agentsproof/types.py RENAMED Viewed

File without changes

agentsproof 1.0.2__tar.gz → 1.0.4__tar.gz

agentsproof 1.0.2tar.gz → 1.0.4tar.gz