agentsproof 1.0.2__tar.gz → 1.0.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: agentsproof
3
- Version: 1.0.2
3
+ Version: 1.0.3
4
4
  Summary: Observability and proof reporting for AI agents
5
5
  Project-URL: Homepage, https://agentsproof.dev
6
6
  License: MIT
@@ -148,3 +148,43 @@ Async version of `complete()`.
148
148
  Run approved Goldens locally against your agent. AgentsProof never executes user code remotely.
149
149
 
150
150
  The SDK never raises on logging failures — steps are fire-and-forget so the SDK cannot crash your agent.
151
+
152
+ ## Trace assertions
153
+
154
+ Each Golden can define `trace_assertions` in the dashboard — checked server-side after every proof run and displayed in the run's trace view.
155
+
156
+ **Structured assertions** are evaluated deterministically (no LLM involved):
157
+
158
+ | Pattern | What it checks |
159
+ |---|---|
160
+ | `must_call:tool_name` | At least one step must have `name == tool_name` |
161
+ | `must_not_call:tool_name` | No step may have `name == tool_name` |
162
+ | `max_steps:N` | Total step count must be ≤ N |
163
+ | `min_steps:N` | Total step count must be ≥ N |
164
+
165
+ **Free-text assertions** (anything not matching the patterns above) are passed to the LLM grader as extra criteria alongside `success_criteria`.
166
+
167
+ Set these in the dashboard when editing a Golden, one per line:
168
+ ```
169
+ must_not_call:send_email
170
+ max_steps:10
171
+ Agent must ask for confirmation before any irreversible action
172
+ ```
173
+
174
+ ## How grading works
175
+
176
+ Each run is automatically scored on 5 axes:
177
+
178
+ | Axis | Weight | What it measures |
179
+ |---|---|---|
180
+ | Goal completion | 35% | Did the agent achieve the stated goal? |
181
+ | Output quality | 20% | Is the final output correct and complete? |
182
+ | Tool accuracy | 20% | Were tool calls well-formed and necessary? |
183
+ | Step efficiency | 15% | Did it avoid redundant steps or loops? |
184
+ | Safety | 10% | Did it avoid unsafe or off-policy actions? |
185
+
186
+ **Weights adjust automatically** — if your agent makes no tool calls, `tool_accuracy` weight is redistributed to `goal_completion` and `output_quality`.
187
+
188
+ **When the run is part of a Proof Suite**, the grader is also given the linked Golden's `success_criteria`, `expected_behavior`, and `failure_modes` as context, making scoring significantly more accurate. Structured `trace_assertions` are evaluated deterministically before the LLM runs. All results appear as a **Golden checks** panel in the trace view.
189
+
190
+ **Providing a `goal` always improves accuracy.** Without it, the judge infers intent from the raw input.
@@ -137,3 +137,43 @@ Async version of `complete()`.
137
137
  Run approved Goldens locally against your agent. AgentsProof never executes user code remotely.
138
138
 
139
139
  The SDK never raises on logging failures — steps are fire-and-forget so the SDK cannot crash your agent.
140
+
141
+ ## Trace assertions
142
+
143
+ Each Golden can define `trace_assertions` in the dashboard — checked server-side after every proof run and displayed in the run's trace view.
144
+
145
+ **Structured assertions** are evaluated deterministically (no LLM involved):
146
+
147
+ | Pattern | What it checks |
148
+ |---|---|
149
+ | `must_call:tool_name` | At least one step must have `name == tool_name` |
150
+ | `must_not_call:tool_name` | No step may have `name == tool_name` |
151
+ | `max_steps:N` | Total step count must be ≤ N |
152
+ | `min_steps:N` | Total step count must be ≥ N |
153
+
154
+ **Free-text assertions** (anything not matching the patterns above) are passed to the LLM grader as extra criteria alongside `success_criteria`.
155
+
156
+ Set these in the dashboard when editing a Golden, one per line:
157
+ ```
158
+ must_not_call:send_email
159
+ max_steps:10
160
+ Agent must ask for confirmation before any irreversible action
161
+ ```
162
+
163
+ ## How grading works
164
+
165
+ Each run is automatically scored on 5 axes:
166
+
167
+ | Axis | Weight | What it measures |
168
+ |---|---|---|
169
+ | Goal completion | 35% | Did the agent achieve the stated goal? |
170
+ | Output quality | 20% | Is the final output correct and complete? |
171
+ | Tool accuracy | 20% | Were tool calls well-formed and necessary? |
172
+ | Step efficiency | 15% | Did it avoid redundant steps or loops? |
173
+ | Safety | 10% | Did it avoid unsafe or off-policy actions? |
174
+
175
+ **Weights adjust automatically** — if your agent makes no tool calls, `tool_accuracy` weight is redistributed to `goal_completion` and `output_quality`.
176
+
177
+ **When the run is part of a Proof Suite**, the grader is also given the linked Golden's `success_criteria`, `expected_behavior`, and `failure_modes` as context, making scoring significantly more accurate. Structured `trace_assertions` are evaluated deterministically before the LLM runs. All results appear as a **Golden checks** panel in the trace view.
178
+
179
+ **Providing a `goal` always improves accuracy.** Without it, the judge infers intent from the raw input.
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
4
4
 
5
5
  [project]
6
6
  name = "agentsproof"
7
- version = "1.0.2"
7
+ version = "1.0.3"
8
8
  description = "Observability and proof reporting for AI agents"
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.9"
File without changes