agentsproof 1.0.2__tar.gz → 1.0.3__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {agentsproof-1.0.2 → agentsproof-1.0.3}/PKG-INFO +41 -1
- {agentsproof-1.0.2 → agentsproof-1.0.3}/README.md +40 -0
- {agentsproof-1.0.2 → agentsproof-1.0.3}/pyproject.toml +1 -1
- {agentsproof-1.0.2 → agentsproof-1.0.3}/.gitignore +0 -0
- {agentsproof-1.0.2 → agentsproof-1.0.3}/agentsproof/__init__.py +0 -0
- {agentsproof-1.0.2 → agentsproof-1.0.3}/agentsproof/client.py +0 -0
- {agentsproof-1.0.2 → agentsproof-1.0.3}/agentsproof/proof_suite.py +0 -0
- {agentsproof-1.0.2 → agentsproof-1.0.3}/agentsproof/run.py +0 -0
- {agentsproof-1.0.2 → agentsproof-1.0.3}/agentsproof/tracer.py +0 -0
- {agentsproof-1.0.2 → agentsproof-1.0.3}/agentsproof/types.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: agentsproof
|
|
3
|
-
Version: 1.0.
|
|
3
|
+
Version: 1.0.3
|
|
4
4
|
Summary: Observability and proof reporting for AI agents
|
|
5
5
|
Project-URL: Homepage, https://agentsproof.dev
|
|
6
6
|
License: MIT
|
|
@@ -148,3 +148,43 @@ Async version of `complete()`.
|
|
|
148
148
|
Run approved Goldens locally against your agent. AgentsProof never executes user code remotely.
|
|
149
149
|
|
|
150
150
|
The SDK never raises on logging failures — steps are fire-and-forget so the SDK cannot crash your agent.
|
|
151
|
+
|
|
152
|
+
## Trace assertions
|
|
153
|
+
|
|
154
|
+
Each Golden can define `trace_assertions` in the dashboard — checked server-side after every proof run and displayed in the run's trace view.
|
|
155
|
+
|
|
156
|
+
**Structured assertions** are evaluated deterministically (no LLM involved):
|
|
157
|
+
|
|
158
|
+
| Pattern | What it checks |
|
|
159
|
+
|---|---|
|
|
160
|
+
| `must_call:tool_name` | At least one step must have `name == tool_name` |
|
|
161
|
+
| `must_not_call:tool_name` | No step may have `name == tool_name` |
|
|
162
|
+
| `max_steps:N` | Total step count must be ≤ N |
|
|
163
|
+
| `min_steps:N` | Total step count must be ≥ N |
|
|
164
|
+
|
|
165
|
+
**Free-text assertions** (anything not matching the patterns above) are passed to the LLM grader as extra criteria alongside `success_criteria`.
|
|
166
|
+
|
|
167
|
+
Set these in the dashboard when editing a Golden, one per line:
|
|
168
|
+
```
|
|
169
|
+
must_not_call:send_email
|
|
170
|
+
max_steps:10
|
|
171
|
+
Agent must ask for confirmation before any irreversible action
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
## How grading works
|
|
175
|
+
|
|
176
|
+
Each run is automatically scored on 5 axes:
|
|
177
|
+
|
|
178
|
+
| Axis | Weight | What it measures |
|
|
179
|
+
|---|---|---|
|
|
180
|
+
| Goal completion | 35% | Did the agent achieve the stated goal? |
|
|
181
|
+
| Output quality | 20% | Is the final output correct and complete? |
|
|
182
|
+
| Tool accuracy | 20% | Were tool calls well-formed and necessary? |
|
|
183
|
+
| Step efficiency | 15% | Did it avoid redundant steps or loops? |
|
|
184
|
+
| Safety | 10% | Did it avoid unsafe or off-policy actions? |
|
|
185
|
+
|
|
186
|
+
**Weights adjust automatically** — if your agent makes no tool calls, `tool_accuracy` weight is redistributed to `goal_completion` and `output_quality`.
|
|
187
|
+
|
|
188
|
+
**When the run is part of a Proof Suite**, the grader is also given the linked Golden's `success_criteria`, `expected_behavior`, and `failure_modes` as context, making scoring significantly more accurate. Structured `trace_assertions` are evaluated deterministically before the LLM runs. All results appear as a **Golden checks** panel in the trace view.
|
|
189
|
+
|
|
190
|
+
**Providing a `goal` always improves accuracy.** Without it, the judge infers intent from the raw input.
|
|
@@ -137,3 +137,43 @@ Async version of `complete()`.
|
|
|
137
137
|
Run approved Goldens locally against your agent. AgentsProof never executes user code remotely.
|
|
138
138
|
|
|
139
139
|
The SDK never raises on logging failures — steps are fire-and-forget so the SDK cannot crash your agent.
|
|
140
|
+
|
|
141
|
+
## Trace assertions
|
|
142
|
+
|
|
143
|
+
Each Golden can define `trace_assertions` in the dashboard — checked server-side after every proof run and displayed in the run's trace view.
|
|
144
|
+
|
|
145
|
+
**Structured assertions** are evaluated deterministically (no LLM involved):
|
|
146
|
+
|
|
147
|
+
| Pattern | What it checks |
|
|
148
|
+
|---|---|
|
|
149
|
+
| `must_call:tool_name` | At least one step must have `name == tool_name` |
|
|
150
|
+
| `must_not_call:tool_name` | No step may have `name == tool_name` |
|
|
151
|
+
| `max_steps:N` | Total step count must be ≤ N |
|
|
152
|
+
| `min_steps:N` | Total step count must be ≥ N |
|
|
153
|
+
|
|
154
|
+
**Free-text assertions** (anything not matching the patterns above) are passed to the LLM grader as extra criteria alongside `success_criteria`.
|
|
155
|
+
|
|
156
|
+
Set these in the dashboard when editing a Golden, one per line:
|
|
157
|
+
```
|
|
158
|
+
must_not_call:send_email
|
|
159
|
+
max_steps:10
|
|
160
|
+
Agent must ask for confirmation before any irreversible action
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
## How grading works
|
|
164
|
+
|
|
165
|
+
Each run is automatically scored on 5 axes:
|
|
166
|
+
|
|
167
|
+
| Axis | Weight | What it measures |
|
|
168
|
+
|---|---|---|
|
|
169
|
+
| Goal completion | 35% | Did the agent achieve the stated goal? |
|
|
170
|
+
| Output quality | 20% | Is the final output correct and complete? |
|
|
171
|
+
| Tool accuracy | 20% | Were tool calls well-formed and necessary? |
|
|
172
|
+
| Step efficiency | 15% | Did it avoid redundant steps or loops? |
|
|
173
|
+
| Safety | 10% | Did it avoid unsafe or off-policy actions? |
|
|
174
|
+
|
|
175
|
+
**Weights adjust automatically** — if your agent makes no tool calls, `tool_accuracy` weight is redistributed to `goal_completion` and `output_quality`.
|
|
176
|
+
|
|
177
|
+
**When the run is part of a Proof Suite**, the grader is also given the linked Golden's `success_criteria`, `expected_behavior`, and `failure_modes` as context, making scoring significantly more accurate. Structured `trace_assertions` are evaluated deterministically before the LLM runs. All results appear as a **Golden checks** panel in the trace view.
|
|
178
|
+
|
|
179
|
+
**Providing a `goal` always improves accuracy.** Without it, the judge infers intent from the raw input.
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|