tracecore 0.9.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- tracecore-0.9.0/LICENSE +21 -0
- tracecore-0.9.0/PKG-INFO +599 -0
- tracecore-0.9.0/README.md +553 -0
- tracecore-0.9.0/agent_bench/__init__.py +8 -0
- tracecore-0.9.0/agent_bench/cli.py +1201 -0
- tracecore-0.9.0/agent_bench/config.py +117 -0
- tracecore-0.9.0/agent_bench/interactive.py +477 -0
- tracecore-0.9.0/agent_bench/maintainer.py +162 -0
- tracecore-0.9.0/agent_bench/openclaw.py +484 -0
- tracecore-0.9.0/agent_bench/pairings.py +79 -0
- tracecore-0.9.0/pyproject.toml +66 -0
- tracecore-0.9.0/setup.cfg +4 -0
- tracecore-0.9.0/tasks/__init__.py +1 -0
- tracecore-0.9.0/tasks/config_drift_remediation/__init__.py +1 -0
- tracecore-0.9.0/tasks/config_drift_remediation/actions.py +33 -0
- tracecore-0.9.0/tasks/config_drift_remediation/setup.py +61 -0
- tracecore-0.9.0/tasks/config_drift_remediation/validate.py +14 -0
- tracecore-0.9.0/tasks/deterministic_rate_service/__init__.py +1 -0
- tracecore-0.9.0/tasks/deterministic_rate_service/actions.py +124 -0
- tracecore-0.9.0/tasks/deterministic_rate_service/service.py +113 -0
- tracecore-0.9.0/tasks/deterministic_rate_service/setup.py +59 -0
- tracecore-0.9.0/tasks/deterministic_rate_service/shared.py +8 -0
- tracecore-0.9.0/tasks/deterministic_rate_service/validate.py +18 -0
- tracecore-0.9.0/tasks/dice_game/actions.py +37 -0
- tracecore-0.9.0/tasks/dice_game/setup.py +21 -0
- tracecore-0.9.0/tasks/dice_game/validate.py +11 -0
- tracecore-0.9.0/tasks/filesystem_hidden_config/actions.py +34 -0
- tracecore-0.9.0/tasks/filesystem_hidden_config/setup.py +20 -0
- tracecore-0.9.0/tasks/filesystem_hidden_config/validate.py +8 -0
- tracecore-0.9.0/tasks/incident_recovery_chain/__init__.py +1 -0
- tracecore-0.9.0/tasks/incident_recovery_chain/actions.py +33 -0
- tracecore-0.9.0/tasks/incident_recovery_chain/setup.py +60 -0
- tracecore-0.9.0/tasks/incident_recovery_chain/validate.py +14 -0
- tracecore-0.9.0/tasks/log_alert_triage/__init__.py +1 -0
- tracecore-0.9.0/tasks/log_alert_triage/actions.py +33 -0
- tracecore-0.9.0/tasks/log_alert_triage/setup.py +62 -0
- tracecore-0.9.0/tasks/log_alert_triage/validate.py +14 -0
- tracecore-0.9.0/tasks/log_stream_monitor/__init__.py +0 -0
- tracecore-0.9.0/tasks/log_stream_monitor/actions.py +35 -0
- tracecore-0.9.0/tasks/log_stream_monitor/setup.py +50 -0
- tracecore-0.9.0/tasks/log_stream_monitor/validate.py +14 -0
- tracecore-0.9.0/tasks/rate_limited_api/actions.py +93 -0
- tracecore-0.9.0/tasks/rate_limited_api/service.py +76 -0
- tracecore-0.9.0/tasks/rate_limited_api/setup.py +41 -0
- tracecore-0.9.0/tasks/rate_limited_api/shared.py +6 -0
- tracecore-0.9.0/tasks/rate_limited_api/validate.py +21 -0
- tracecore-0.9.0/tasks/rate_limited_chain/actions.py +124 -0
- tracecore-0.9.0/tasks/rate_limited_chain/service.py +105 -0
- tracecore-0.9.0/tasks/rate_limited_chain/setup.py +58 -0
- tracecore-0.9.0/tasks/rate_limited_chain/shared.py +8 -0
- tracecore-0.9.0/tasks/rate_limited_chain/validate.py +22 -0
- tracecore-0.9.0/tasks/runbook_verifier/__init__.py +1 -0
- tracecore-0.9.0/tasks/runbook_verifier/actions.py +46 -0
- tracecore-0.9.0/tasks/runbook_verifier/setup.py +120 -0
- tracecore-0.9.0/tasks/runbook_verifier/shared.py +10 -0
- tracecore-0.9.0/tasks/runbook_verifier/validate.py +15 -0
- tracecore-0.9.0/tasks/sandboxed_code_auditor/__init__.py +1 -0
- tracecore-0.9.0/tasks/sandboxed_code_auditor/actions.py +38 -0
- tracecore-0.9.0/tasks/sandboxed_code_auditor/setup.py +76 -0
- tracecore-0.9.0/tasks/sandboxed_code_auditor/validate.py +16 -0
- tracecore-0.9.0/tests/test_agent_contract.py +4 -0
- tracecore-0.9.0/tests/test_autogen_adapter.py +217 -0
- tracecore-0.9.0/tests/test_baseline.py +222 -0
- tracecore-0.9.0/tests/test_baseline_diff_pretty.py +102 -0
- tracecore-0.9.0/tests/test_bundle_audit.py +67 -0
- tracecore-0.9.0/tests/test_chain_agent_runner.py +11 -0
- tracecore-0.9.0/tests/test_cli_baseline.py +142 -0
- tracecore-0.9.0/tests/test_cli_new_agent.py +88 -0
- tracecore-0.9.0/tests/test_cli_openclaw.py +408 -0
- tracecore-0.9.0/tests/test_cli_runs.py +86 -0
- tracecore-0.9.0/tests/test_cli_tasks_validate.py +40 -0
- tracecore-0.9.0/tests/test_config.py +46 -0
- tracecore-0.9.0/tests/test_determinism.py +90 -0
- tracecore-0.9.0/tests/test_deterministic_rate_service_task.py +83 -0
- tracecore-0.9.0/tests/test_dice_game_agent.py +73 -0
- tracecore-0.9.0/tests/test_dice_game_pydantic.py +45 -0
- tracecore-0.9.0/tests/test_interactive_cli.py +242 -0
- tracecore-0.9.0/tests/test_langchain_adapter.py +63 -0
- tracecore-0.9.0/tests/test_maintainer.py +56 -0
- tracecore-0.9.0/tests/test_naive_llm_agent.py +71 -0
- tracecore-0.9.0/tests/test_negative_cases.py +241 -0
- tracecore-0.9.0/tests/test_operations_tasks.py +175 -0
- tracecore-0.9.0/tests/test_pairing_contracts.py +77 -0
- tracecore-0.9.0/tests/test_planner_agent.py +92 -0
- tracecore-0.9.0/tests/test_rate_limited_api_task.py +52 -0
- tracecore-0.9.0/tests/test_rate_limited_chain_task.py +74 -0
- tracecore-0.9.0/tests/test_record_mode.py +221 -0
- tracecore-0.9.0/tests/test_replay_audit.py +104 -0
- tracecore-0.9.0/tests/test_runbook_verifier_agent.py +19 -0
- tracecore-0.9.0/tests/test_runner_contract.py +62 -0
- tracecore-0.9.0/tests/test_runner_failure_taxonomy.py +205 -0
- tracecore-0.9.0/tests/test_runner_smoke.py +10 -0
- tracecore-0.9.0/tests/test_sandbox_env.py +51 -0
- tracecore-0.9.0/tests/test_task_loading.py +38 -0
- tracecore-0.9.0/tests/test_task_registry.py +131 -0
- tracecore-0.9.0/tests/test_terminal_logic_failure.py +66 -0
- tracecore-0.9.0/tests/test_webui_context.py +58 -0
- tracecore-0.9.0/tests/test_webui_routes.py +238 -0
- tracecore-0.9.0/tracecore.egg-info/PKG-INFO +599 -0
- tracecore-0.9.0/tracecore.egg-info/SOURCES.txt +102 -0
- tracecore-0.9.0/tracecore.egg-info/dependency_links.txt +1 -0
- tracecore-0.9.0/tracecore.egg-info/entry_points.txt +2 -0
- tracecore-0.9.0/tracecore.egg-info/requires.txt +14 -0
- tracecore-0.9.0/tracecore.egg-info/top_level.txt +2 -0
tracecore-0.9.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Justin Dobbs
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
tracecore-0.9.0/PKG-INFO
ADDED
|
@@ -0,0 +1,599 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: tracecore
|
|
3
|
+
Version: 0.9.0
|
|
4
|
+
Summary: A lightweight benchmark for action-oriented agents.
|
|
5
|
+
Author: Justin Dobbs
|
|
6
|
+
License: MIT License
|
|
7
|
+
|
|
8
|
+
Copyright (c) 2026 Justin Dobbs
|
|
9
|
+
|
|
10
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
11
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
12
|
+
in the Software without restriction, including without limitation the rights
|
|
13
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
14
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
15
|
+
furnished to do so, subject to the following conditions:
|
|
16
|
+
|
|
17
|
+
The above copyright notice and this permission notice shall be included in all
|
|
18
|
+
copies or substantial portions of the Software.
|
|
19
|
+
|
|
20
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
21
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
22
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
23
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
24
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
25
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
26
|
+
SOFTWARE.
|
|
27
|
+
|
|
28
|
+
Project-URL: Homepage, https://github.com/justindobbs/Tracecore
|
|
29
|
+
Project-URL: Issues, https://github.com/justindobbs/Tracecore/issues
|
|
30
|
+
Requires-Python: >=3.10
|
|
31
|
+
Description-Content-Type: text/markdown
|
|
32
|
+
License-File: LICENSE
|
|
33
|
+
Requires-Dist: rich>=13.7
|
|
34
|
+
Provides-Extra: dev
|
|
35
|
+
Requires-Dist: pytest>=9.0; extra == "dev"
|
|
36
|
+
Requires-Dist: fastapi>=0.131; extra == "dev"
|
|
37
|
+
Requires-Dist: uvicorn>=0.27; extra == "dev"
|
|
38
|
+
Requires-Dist: pytest-cov>=5.0; extra == "dev"
|
|
39
|
+
Requires-Dist: ruff>=0.9.0; extra == "dev"
|
|
40
|
+
Requires-Dist: jinja2>=3.1; extra == "dev"
|
|
41
|
+
Requires-Dist: python-multipart>=0.0.9; extra == "dev"
|
|
42
|
+
Requires-Dist: httpx>=0.27; extra == "dev"
|
|
43
|
+
Provides-Extra: pydantic-poc
|
|
44
|
+
Requires-Dist: pydantic-ai<1.0,>=0.0.3; extra == "pydantic-poc"
|
|
45
|
+
Dynamic: license-file
|
|
46
|
+
|
|
47
|
+
# TraceCore (Agent Bench CLI)
|
|
48
|
+
[](https://github.com/justindobbs/Tracecore/actions/workflows/tests.yml)
|
|
49
|
+
[](https://www.python.org/downloads/)
|
|
50
|
+
[](LICENSE)
|
|
51
|
+
|
|
52
|
+

|
|
53
|
+
|
|
54
|
+
# TraceCore overview
|
|
55
|
+
A lightweight benchmark for action-oriented agents inspired by the OpenClaw style—planner loops, tool APIs, partial observability—but open to any implementation that satisfies the harness.
|
|
56
|
+
|
|
57
|
+
TraceCore evaluates whether an agent can operate—not just reason.
|
|
58
|
+
No LLM judges. No vibes. No giant simulators.
|
|
59
|
+
|
|
60
|
+
> **Brand note:** TraceCore is the product name; the CLI/package and commands remain `agent-bench` for backward compatibility.
|
|
61
|
+
|
|
62
|
+
Core definition: see [`docs/core.md`](docs/core.md) for the Deterministic Episode Runtime primitive and invariant contracts.
|
|
63
|
+
|
|
64
|
+
If your agent can survive this benchmark, it can probably survive production.
|
|
65
|
+
|
|
66
|
+
|
|
67
|
+
## Installation
|
|
68
|
+
|
|
69
|
+
### Published package (recommended)
|
|
70
|
+
|
|
71
|
+
```bash
|
|
72
|
+
pip install tracecore
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
Or with uv:
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
uv pip install tracecore
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
This installs the `agent-bench` CLI and all runtime dependencies. The CLI is immediately available once your environment's `Scripts` directory is on PATH.
|
|
82
|
+
|
|
83
|
+
### Developer / contributor install
|
|
84
|
+
|
|
85
|
+
Clone the repo and install in editable mode to keep tasks and CLI entries in sync with your working tree (required for the web UI and the registry-powered loader):
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
git clone https://github.com/justindobbs/Tracecore.git
|
|
89
|
+
cd Tracecore
|
|
90
|
+
python -m venv .venv && .venv\Scripts\activate # or source .venv/bin/activate on macOS/Linux
|
|
91
|
+
pip install -e .[dev]
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
`pip install -e` keeps the package in sync with your working tree so new tasks + CLI entries are immediately available.
|
|
95
|
+
|
|
96
|
+
### Windows PATH tip
|
|
97
|
+
|
|
98
|
+
The editable install drops `agent-bench.exe` into `%APPDATA%\Python\Python310\Scripts` (or whichever minor version you're using). Add that folder to **Path** via *System Properties → Environment Variables* so `agent-bench` works from any terminal. After updating Path, open a new shell.
|
|
99
|
+
|
|
100
|
+
> Prefer a one-step install? `pipx install tracecore` drops its own shim into `%USERPROFILE%\.local\bin` and handles PATH automatically.
|
|
101
|
+
>
|
|
102
|
+
> Already using [uv](https://docs.astral.sh/uv/)? Run `uv tool install tracecore` to create the CLI shim in `%USERPROFILE%\.local\bin`. uv's bootstrap already wires that directory into PATH, so no manual environment edits are required.
|
|
103
|
+
|
|
104
|
+
Prefer a shorter command name? Create a shell alias so `tracecore` forwards to `agent-bench`:
|
|
105
|
+
|
|
106
|
+
- **PowerShell** (add to `$PROFILE`): `Set-Alias tracecore agent-bench`
|
|
107
|
+
- **Command Prompt**: `doskey tracecore=agent-bench $*`
|
|
108
|
+
- **Bash/Zsh**: `alias tracecore='agent-bench'`
|
|
109
|
+
|
|
110
|
+
The alias simply invokes the same CLI, so all subcommands and flags continue to work.
|
|
111
|
+
|
|
112
|
+
## Quick start
|
|
113
|
+
|
|
114
|
+
**Fastest path** — run a known-good agent+task pairing by name:
|
|
115
|
+
|
|
116
|
+
```bash
|
|
117
|
+
agent-bench run pairing log_stream_monitor
|
|
118
|
+
agent-bench run pairing log_stream_monitor --seed 7
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
See all available pairings:
|
|
122
|
+
|
|
123
|
+
```bash
|
|
124
|
+
agent-bench run pairing --list
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
Smoke-test every pairing in one shot (useful after a harness change):
|
|
128
|
+
|
|
129
|
+
```bash
|
|
130
|
+
agent-bench run pairing --all
|
|
131
|
+
agent-bench run pairing --all --seed 7 --timeout 120 # 120 s wall-clock limit per run
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
Or navigate into the `agents/` directory — if only one pairing matches a file there, it auto-selects:
|
|
135
|
+
|
|
136
|
+
```bash
|
|
137
|
+
cd agents
|
|
138
|
+
agent-bench run pairing # auto-detects if unambiguous
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
Run any agent+task+seed explicitly, with an optional wall-clock timeout:
|
|
142
|
+
|
|
143
|
+
```bash
|
|
144
|
+
agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
|
|
145
|
+
agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42 --timeout 60
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
Need an end-to-end TraceCore + Pydantic AI example? See [docs/pydantic_poc.md](docs/pydantic_poc.md) for the deterministic dice game agent/task combo.
|
|
149
|
+
|
|
150
|
+
Want a standalone proof-of-concept that walks through the full execution loop? See [`examples/simple_agent_demo/`](examples/simple_agent_demo/README.md) — a self-contained demo with a CLI that lists tasks, lists agents, and runs any pairing with verbose trace output:
|
|
151
|
+
|
|
152
|
+
```bash
|
|
153
|
+
cd examples/simple_agent_demo
|
|
154
|
+
python demo.py --task dice_game --agent dice_game_agent
|
|
155
|
+
python demo.py --list-tasks
|
|
156
|
+
python demo.py --list-agents
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
Prefer a guided setup? Launch the colorful wizard and let it walk you through agent/task/seed selection (it saves the answers and then calls the same `run` command under the hood):
|
|
160
|
+
|
|
161
|
+
```bash
|
|
162
|
+
agent-bench interactive
|
|
163
|
+
# add --dry-run to preview the command without executing
|
|
164
|
+
# add --save-session to remember your choices for next time
|
|
165
|
+
# add --plugins to include plugin tasks in discovery
|
|
166
|
+
# add --no-color if your terminal doesn't support ANSI colors
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
The wizard includes:
|
|
170
|
+
- **Suggested pairings**: See agent-task combinations with proven success (if baseline data exists)
|
|
171
|
+
- **Agent validation**: Checks that selected agents implement the required interface
|
|
172
|
+
- **Task budgets**: Shows `steps` and `tool_calls` limits for each task
|
|
173
|
+
- **Progress indicators**: Guides you through "Step 1/3", "Step 2/3", "Step 3/3"
|
|
174
|
+
- **Fuzzy search**: Type partial names to filter agents/tasks
|
|
175
|
+
- **Inline help**: Press `?` during any prompt for context-sensitive tips
|
|
176
|
+
- **Session persistence**: Use `--save-session` to remember your last selections
|
|
177
|
+
- **Dry-run mode**: Preview the exact command before execution with `--dry-run`
|
|
178
|
+
|
|
179
|
+
Prefer the UI?
|
|
180
|
+
|
|
181
|
+
```bash
|
|
182
|
+
agent-bench dashboard --reload
|
|
183
|
+
# then open http://localhost:8000
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
Point the form at `agents/toy_agent.py` + `filesystem_hidden_config@1` for a deterministic smoke test, or switch to `agents/rate_limit_agent.py` for the API scenarios. The **Pairings** tab in the dashboard provides one-click launch for every known-good pairing.
|
|
187
|
+
|
|
188
|
+
### Inspect recent runs
|
|
189
|
+
|
|
190
|
+
Print a compact table of recent runs without opening the dashboard:
|
|
191
|
+
|
|
192
|
+
```bash
|
|
193
|
+
agent-bench runs summary
|
|
194
|
+
agent-bench runs summary --task log_stream_monitor@1 --limit 10
|
|
195
|
+
agent-bench runs summary --failure-type budget_exhausted
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
For raw JSON output use `agent-bench runs list` (same filters).
|
|
199
|
+
|
|
200
|
+
### Run tests
|
|
201
|
+
|
|
202
|
+
```bash
|
|
203
|
+
python -m pytest
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
Want a single command that runs task validation + pytest and can apply a couple guarded, mechanical fixes? See [`docs/maintainer.md`](docs/maintainer.md):
|
|
207
|
+
|
|
208
|
+
```bash
|
|
209
|
+
agent-bench maintain
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
### Write a new agent
|
|
213
|
+
|
|
214
|
+
Scaffold a stub with the correct `reset` / `observe` / `act` interface in one command:
|
|
215
|
+
|
|
216
|
+
```bash
|
|
217
|
+
agent-bench new-agent my_agent
|
|
218
|
+
# creates agents/my_agent_agent.py with inline docstrings and budget-guard boilerplate
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
Kebab-case names are normalised automatically (`my-agent` → `MyAgentAgent`). Use `--output-dir` to write elsewhere, `--force` to overwrite an existing file.
|
|
222
|
+
|
|
223
|
+
Then wire it to a task and run:
|
|
224
|
+
|
|
225
|
+
```bash
|
|
226
|
+
agent-bench run --agent agents/my_agent_agent.py --task filesystem_hidden_config@1 --seed 0
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
See [`docs/agents.md`](docs/agents.md) for the full interface contract and [`docs/task_harness.md`](docs/task_harness.md) for the action schema.
|
|
230
|
+
|
|
231
|
+
## Troubleshooting
|
|
232
|
+
|
|
233
|
+
Need help diagnosing install, CLI, or validator issues? See [`docs/troubleshooting.md`](docs/troubleshooting.md) for a consolidated guide that covers PATH fixes, common failure types, and dashboard hiccups.
|
|
234
|
+
|
|
235
|
+
> **Note:** Task budgets are configured in each task's `task.toml` manifest and can be inspected via `agent-bench tasks validate --registry`. There is no `--budget` CLI override flag; budgets are enforced from the task definition.
|
|
236
|
+
|
|
237
|
+
## Tutorials
|
|
238
|
+
- **OpenClaw users**: see [`OPENCLAW_QUICKSTART.md`](OPENCLAW_QUICKSTART.md) for a 5-minute first run (no OpenClaw install required), or the full [`tutorials/openclaw_quickstart.md`](tutorials/openclaw_quickstart.md) for adapter patterns, budget mapping, and troubleshooting.
|
|
239
|
+
|
|
240
|
+
## Framing the idea
|
|
241
|
+
Terminal Bench works because it:
|
|
242
|
+
|
|
243
|
+
- Evaluates agents via real tasks, not synthetic prompts
|
|
244
|
+
- Uses a simple, opinionated interface (a terminal)
|
|
245
|
+
- Is cheap to run, easy to extend, and hard to game
|
|
246
|
+
|
|
247
|
+
An operations-focused benchmark should do the same, but centered on:
|
|
248
|
+
|
|
249
|
+
- Action-oriented agents with tool APIs
|
|
250
|
+
- Environment interaction and partial observability
|
|
251
|
+
- Longish horizons with state, retries, and recovery
|
|
252
|
+
|
|
253
|
+
In practice, this covers:
|
|
254
|
+
- OpenClaw-native agents
|
|
255
|
+
- Custom planner loops wired into REST or filesystem tools
|
|
256
|
+
- Orchestration agents (e.g., TaskWeaver, AutoGPT-style) that can wrap the simple `reset/observe/act` interface
|
|
257
|
+
|
|
258
|
+
Think of it less as "benchmarking a model" and more as benchmarking an agent loop end-to-end.
|
|
259
|
+
|
|
260
|
+
## What makes these agents distinct?
|
|
261
|
+
(Adjust these if your mental model differs.)
|
|
262
|
+
|
|
263
|
+
- Planner / policy loop instead of single-shot prompting
|
|
264
|
+
- Tool or action interfaces instead of raw chat completions
|
|
265
|
+
- Optional memory, world models, or reusable skills
|
|
266
|
+
- Strong emphasis on doing, not just responding
|
|
267
|
+
|
|
268
|
+
So the benchmark should **not** test raw language quality or one-shot reasoning. It should test:
|
|
269
|
+
|
|
270
|
+
- Decision-making under constraints
|
|
271
|
+
- Tool sequencing and dependency management
|
|
272
|
+
- Recovery from errors and partial failures
|
|
273
|
+
- State tracking over time and across steps
|
|
274
|
+
|
|
275
|
+
## Why this exists
|
|
276
|
+
Most benchmarks answer questions like:
|
|
277
|
+
- Can the model reason?
|
|
278
|
+
- Can it write the right patch?
|
|
279
|
+
- Can it roleplay an agent?
|
|
280
|
+
|
|
281
|
+
TraceCore answers a different question:
|
|
282
|
+
Can this agent run unattended and get the job done without breaking things?
|
|
283
|
+
|
|
284
|
+
We test:
|
|
285
|
+
- Tool sequencing
|
|
286
|
+
- Error recovery
|
|
287
|
+
- State tracking
|
|
288
|
+
- Long-horizon behavior
|
|
289
|
+
- Boring, reliable decision-making
|
|
290
|
+
|
|
291
|
+
## Design principles
|
|
292
|
+
1. **Minimal environment, maximal signal**
|
|
293
|
+
- Keep worlds tiny, deterministic, and inspectable: toy filesystems, fake APIs, log streams, local services.
|
|
294
|
+
- No giant simulators or cloud dependencies—everything should run in seconds on a laptop.
|
|
295
|
+
2. **Agent-in-the-loop evaluation**
|
|
296
|
+
- Benchmark the entire perception → reasoning → action loop, not a single prompt.
|
|
297
|
+
- Each task specifies initial state, tool interface, validator, and explicit budgets (steps + tool calls).
|
|
298
|
+
3. **Binary outcomes first**
|
|
299
|
+
- Success or failure is the headline metric; secondary stats (steps, tool calls, errors) give color.
|
|
300
|
+
- Deterministic tasks + frozen versions make regressions obvious and stop overfitting.
|
|
301
|
+
4. **Hard to game, easy to extend**
|
|
302
|
+
- Sandboxed execution, limited affordances, and published hashes keep agents honest.
|
|
303
|
+
- Tasks are small Python packages so contributors can add new scenarios without ceremony.
|
|
304
|
+
|
|
305
|
+
## Task categories (operations-native)
|
|
306
|
+
### 1. Tool choreography tasks
|
|
307
|
+
Goal: stress sequencing, dependency management, and retries.
|
|
308
|
+
|
|
309
|
+
- *Example:* `rate_limited_api@1` — retrieve an `ACCESS_TOKEN` from a mock API that enforces a deterministic rate limit and transient failures.
|
|
310
|
+
- *Signals:* correct tool ordering, retry logic, state retention, graceful degradation.
|
|
311
|
+
|
|
312
|
+
### 2. Partial observability & discovery
|
|
313
|
+
Goal: reward cautious exploration instead of brute force.
|
|
314
|
+
|
|
315
|
+
- *Example:* “Traverse a directory tree with undocumented schema. Find the real config key without trashing the filesystem.”
|
|
316
|
+
- *Signals:* hypothesis updates, selective reads, remembering seen paths, avoiding repeated mistakes.
|
|
317
|
+
|
|
318
|
+
### 3. Long-horizon maintenance
|
|
319
|
+
Goal: ensure persistence, monitoring, and acting at the right moment.
|
|
320
|
+
|
|
321
|
+
- *Example:* “A service degrades over time. Watch logs, detect the symptom, and apply the correct fix only when needed.”
|
|
322
|
+
- *Signals:* patience, trigger detection, not overreacting, applying steady-state playbooks.
|
|
323
|
+
|
|
324
|
+
### 4. Adversarial-but-fair environments
|
|
325
|
+
Goal: test robustness when the world is a little hostile.
|
|
326
|
+
|
|
327
|
+
- *Example:* flaky tools, malformed API responses, conflicting telemetry that needs disambiguation.
|
|
328
|
+
- *Signals:* error recovery, fallback strategies, keeping track of provenance before acting.
|
|
329
|
+
|
|
330
|
+
## Scoring without overengineering
|
|
331
|
+
- Binary success/failure is the scoreboard.
|
|
332
|
+
- Secondary metrics: steps taken, tool calls, wall-clock time, error count.
|
|
333
|
+
- No LLM judges, no vibes, no composite scores you can’t reason about.
|
|
334
|
+
|
|
335
|
+
## Interface sketch
|
|
336
|
+
Agents run exactly like they would in production: provide an agent, pick a task, respect the budget.
|
|
337
|
+
|
|
338
|
+
```sh
|
|
339
|
+
agent-bench run \
|
|
340
|
+
--agent agents/toy_agent.py \
|
|
341
|
+
--task filesystem_hidden_config@1 \
|
|
342
|
+
--seed 42
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
Each task ships with a harness, fake environment, and validator. Agents only see what they’re allowed to see.
|
|
346
|
+
|
|
347
|
+
## Why this matters (and what’s missing today)
|
|
348
|
+
Most agent benchmarks collapse back into single-prompt exams. They rarely measure recovery, operational competence, or whether the agent can survive unattended. TraceCore surfaces engineering-quality differences and rewards boring-but-correct behavior.
|
|
349
|
+
|
|
350
|
+
## Potential pitfalls & guardrails
|
|
351
|
+
- **Overfitting to the harness** → Keep suites varied, publish fixtures, encourage new contributions.
|
|
352
|
+
- **Agents cheating via inspection** → Sandbox aggressively, freeze binaries, limit visibility.
|
|
353
|
+
- **Benchmark drift** → Freeze task versions, publish hashes/seeded assets, require changelog entries.
|
|
354
|
+
|
|
355
|
+
## What’s in v0
|
|
356
|
+
Task suites:
|
|
357
|
+
- Filesystem & State
|
|
358
|
+
- Tool Choreography
|
|
359
|
+
- Long-Horizon & Monitoring
|
|
360
|
+
- Adversarial-but-Fair
|
|
361
|
+
- Operations & Triage
|
|
362
|
+
|
|
363
|
+
Shipping tasks:
|
|
364
|
+
- `filesystem_hidden_config@1` (filesystem suite): explore a hidden directory tree to find the one true `API_KEY`.
|
|
365
|
+
- `rate_limited_api@1` (api suite): classify API errors, respect `retry_after`, and persist the returned `ACCESS_TOKEN`.
|
|
366
|
+
- `log_alert_triage@1` (operations suite): triage deterministic logs and extract the final `ALERT_CODE`.
|
|
367
|
+
- `config_drift_remediation@1` (operations suite): compare desired vs. live configs and output the remediation patch line.
|
|
368
|
+
- `incident_recovery_chain@1` (operations suite): follow a recovery handoff chain to recover `RECOVERY_TOKEN`.
|
|
369
|
+
- `log_stream_monitor@1` (operations suite): poll a paginated log stream, ignore noise, and emit `STREAM_CODE` when a `CRITICAL` entry is detected.
|
|
370
|
+
|
|
371
|
+
Each task:
|
|
372
|
+
- Defines an initial environment
|
|
373
|
+
- Exposes a constrained action interface
|
|
374
|
+
- Has a single, deterministic success condition
|
|
375
|
+
|
|
376
|
+
## How it works
|
|
377
|
+
You provide any agent that implements the documented interface.
|
|
378
|
+
We provide a task harness.
|
|
379
|
+
The agent runs until:
|
|
380
|
+
- It succeeds
|
|
381
|
+
- It fails
|
|
382
|
+
- It runs out of budget
|
|
383
|
+
|
|
384
|
+
No human in the loop. No retries.
|
|
385
|
+
|
|
386
|
+
## Example
|
|
387
|
+
```sh
|
|
388
|
+
agent-bench run \
|
|
389
|
+
--agent agents/toy_agent.py \
|
|
390
|
+
--task filesystem_hidden_config@1 \
|
|
391
|
+
--seed 42
|
|
392
|
+
|
|
393
|
+
# Replay a prior run_id (defaults to recorded agent/task/seed, but you can override):
|
|
394
|
+
agent-bench run --replay <run_id> --seed 42
|
|
395
|
+
```
|
|
396
|
+
|
|
397
|
+
### Configuration via `agent-bench.toml`
|
|
398
|
+
|
|
399
|
+
Rather not repeat `--agent`, `--task`, and `--seed` every time? Drop a config file in the repo root (or pass `--config path/to/file`). Set `AGENT_BENCH_CONFIG=agent-bench.toml` in CI (and any automation) so the same defaults apply everywhere.
|
|
400
|
+
|
|
401
|
+
```toml
|
|
402
|
+
[defaults]
|
|
403
|
+
agent = "agents/toy_agent.py"
|
|
404
|
+
task = "filesystem_hidden_config@1"
|
|
405
|
+
seed = 42
|
|
406
|
+
|
|
407
|
+
[agent."agents/rate_limit_agent.py"]
|
|
408
|
+
task = "rate_limited_api@1"
|
|
409
|
+
seed = 11
|
|
410
|
+
```
|
|
411
|
+
|
|
412
|
+
The CLI resolves flags first, then per-agent overrides, then the `[defaults]` block. Any command accepts `--config` to point at another file; otherwise `agent-bench.toml` (or `agent_bench.toml`) is used when present or when `AGENT_BENCH_CONFIG` is set.
|
|
413
|
+
|
|
414
|
+
If `agent-bench` isn’t on your PATH yet, call it via Python:
|
|
415
|
+
|
|
416
|
+
```powershell
|
|
417
|
+
python -m agent_bench.cli --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
|
|
418
|
+
```
|
|
419
|
+
|
|
420
|
+
Every CLI run writes a JSON artifact under `.agent_bench/runs/<run_id>.json`. Inspect them directly, or list them via:
|
|
421
|
+
|
|
422
|
+
```sh
|
|
423
|
+
agent-bench runs list --limit 5
|
|
424
|
+
```
|
|
425
|
+
|
|
426
|
+
Want to zero in on a specific outcome? Use the structured failure taxonomy filter:
|
|
427
|
+
|
|
428
|
+
```sh
|
|
429
|
+
agent-bench runs list --failure-type timeout --limit 5
|
|
430
|
+
agent-bench runs list --failure-type success --limit 5 # only successful runs
|
|
431
|
+
```
|
|
432
|
+
|
|
433
|
+
The same buckets surface in the Web UI’s **Recent Runs** list, where each entry is labeled
|
|
434
|
+
`Success` or `Failure — <type>` so you can spot budget exhaustion vs. invalid actions at a glance.
|
|
435
|
+
|
|
436
|
+
Need a quick aggregate of how an agent performs on a task? Use the baseline helper:
|
|
437
|
+
|
|
438
|
+
```sh
|
|
439
|
+
agent-bench baseline --agent agents/toy_agent.py --task filesystem_hidden_config@1
|
|
440
|
+
```
|
|
441
|
+
|
|
442
|
+
It emits success rate, average steps/tool calls, and links back to the latest trace for that agent/task pair. Add `--export` to persist a frozen snapshot for the web UI:
|
|
443
|
+
|
|
444
|
+
```sh
|
|
445
|
+
agent-bench baseline --export # writes .agent_bench/baselines/baseline-<ts>.json
|
|
446
|
+
agent-bench baseline --export latest # custom filename in the baselines folder
|
|
447
|
+
```
|
|
448
|
+
|
|
449
|
+
Compare two specific runs (paths or `run_id`s) to see exactly where traces diverge:
|
|
450
|
+
|
|
451
|
+
```sh
|
|
452
|
+
agent-bench baseline --compare .agent_bench/runs/run_a.json .agent_bench/runs/run_b.json
|
|
453
|
+
# or mix path + run_id
|
|
454
|
+
agent-bench baseline --compare abcd1234 efgh5678
|
|
455
|
+
```
|
|
456
|
+
|
|
457
|
+
The diff output highlights whether the agent/task/success states match and lists per-step differences.
|
|
458
|
+
Use `--format text` for a quick human summary; exit codes are `0` (identical), `1` (different), `2` (incompatible task/agent).
|
|
459
|
+
For CI usage, see [`docs/ci_workflow.md`](docs/ci_workflow.md).
|
|
460
|
+
This repo also ships a `chain-agent-baseline` workflow wired to `agents/chain_agent.py` + `rate_limited_chain@1`.
|
|
461
|
+
|
|
462
|
+
The Baselines tab in the UI only shows a "Latest published" card after you export at least once.
|
|
463
|
+
|
|
464
|
+
|
|
465
|
+
## Minimal Web UI (Optional)
|
|
466
|
+
Prefer sliders and buttons over the CLI? Spin up the lightweight FastAPI form:
|
|
467
|
+
|
|
468
|
+
```sh
|
|
469
|
+
pip install tracecore
|
|
470
|
+
agent-bench dashboard --host 127.0.0.1 --port 8000 --reload
|
|
471
|
+
```
|
|
472
|
+
|
|
473
|
+
> **`--reload` is for local development only.** It enables uvicorn's auto-reload on file changes and should not be used in shared or production environments. Omit the flag for stable serving.
|
|
474
|
+
|
|
475
|
+
> Tip: create a virtual environment first (e.g., `python -m venv .venv && .venv\Scripts\activate` on Windows) so the FastAPI deps stay isolated. See the official FastAPI installation guide for more platform-specific options: <https://fastapi.tiangolo.com/#installation>
|
|
476
|
+
|
|
477
|
+
Then visit [http://localhost:8000](http://localhost:8000) to:
|
|
478
|
+
- Pick any agent module under `agents/`
|
|
479
|
+
- Choose a task (`filesystem_hidden_config@1`, `rate_limited_api@1`, etc.) and seed
|
|
480
|
+
- Launch runs, inspect structured JSON results (seed included), and drill into traces
|
|
481
|
+
- Replay a prior run by pasting its `run_id` and optionally overriding the seed/agent/task
|
|
482
|
+
|
|
483
|
+
The UI intentionally ships with **no** Node/Vite stack—just FastAPI + Jinja—so you can layer more elaborate frontends later without losing the minimal flow.
|
|
484
|
+
|
|
485
|
+
Output:
|
|
486
|
+
```json
|
|
487
|
+
{
|
|
488
|
+
"task_id": "filesystem_hidden_config",
|
|
489
|
+
"version": 1,
|
|
490
|
+
"seed": 42,
|
|
491
|
+
"success": true,
|
|
492
|
+
"failure_reason": null,
|
|
493
|
+
"failure_type": null,
|
|
494
|
+
"steps_used": 37,
|
|
495
|
+
"tool_calls_used": 12
|
|
496
|
+
}
|
|
497
|
+
```
|
|
498
|
+
|
|
499
|
+
### Diagnostics workflow
|
|
500
|
+
|
|
501
|
+
1. **Run & persist** — both the CLI and the web UI call the same harness and automatically persist artifacts under `.agent_bench/runs/` with metadata (`run_id`, `trace_id`, timestamps, harness version, trace entries).
|
|
502
|
+
2. **Inspect traces** — load [http://localhost:8000/?trace_id=<run_id>](http://localhost:8000/?trace_id=%3Crun_id%3E) to jump straight to the trace viewer, or fetch raw JSON via `/api/traces/<run_id>`.
|
|
503
|
+
3. **Compare outcomes** — use `agent-bench baseline ...` or the UI baseline table to spot regressions (success rate, average steps/tool calls) before publishing results.
|
|
504
|
+
4. **Freeze specs** — once a run set looks good, tag the task versions + harness revision so those run IDs remain reproducible proof of behavior.
|
|
505
|
+
5. **Manual verification** — before freezing or sharing results, run through `docs/manual_verification.md` to replay the CLI + UI flows end-to-end.
|
|
506
|
+
|
|
507
|
+
To inspect a specific run artifact directly, use:
|
|
508
|
+
```sh
|
|
509
|
+
agent-bench runs list --limit 5
|
|
510
|
+
# then load the JSON artifact from .agent_bench/runs/<run_id>.json
|
|
511
|
+
```
|
|
512
|
+
|
|
513
|
+
## Release process
|
|
514
|
+
|
|
515
|
+
Ready to cut a release? See [`docs/release_process.md`](docs/release_process.md) for the standard checklist (changelog, version stamping, test gate, SPEC_FREEZE alignment, trust evidence bundle, and tagging steps). Historical release notes are also archived there.
|
|
516
|
+
|
|
517
|
+
## What we measure
|
|
518
|
+
Per task:
|
|
519
|
+
- Success / failure
|
|
520
|
+
- Steps taken
|
|
521
|
+
- Tool calls
|
|
522
|
+
- Error count
|
|
523
|
+
|
|
524
|
+
Across a suite:
|
|
525
|
+
- Success rate
|
|
526
|
+
- Aggregate efficiency metrics
|
|
527
|
+
|
|
528
|
+
See [SPEC_FREEZE.md](SPEC_FREEZE.md) for the frozen v0.1.0 task list (including the new `rate_limited_chain@1` pain task) and the rules for bumping versions.
|
|
529
|
+
|
|
530
|
+
We deliberately avoid:
|
|
531
|
+
- LLM-based judges
|
|
532
|
+
- Natural language grading
|
|
533
|
+
- Weighted composite scores
|
|
534
|
+
|
|
535
|
+
## Reference agent
|
|
536
|
+
TraceCore ships with a minimal reference agent.
|
|
537
|
+
It is:
|
|
538
|
+
- Conservative
|
|
539
|
+
- State-driven
|
|
540
|
+
- Explicit about errors
|
|
541
|
+
- Boring on purpose
|
|
542
|
+
|
|
543
|
+
If your agent can’t outperform the reference agent, that’s a signal.
|
|
544
|
+
|
|
545
|
+
Reference implementations:
|
|
546
|
+
- `agents/toy_agent.py` — solves filesystem discovery tasks.
|
|
547
|
+
- `agents/rate_limit_agent.py` — handles classic rate-limit retry flows (`rate_limited_api@1`).
|
|
548
|
+
- `agents/chain_agent.py` — completes the chained handshake + rate-limit pain task (`rate_limited_chain@1`).
|
|
549
|
+
- `agents/ops_triage_agent.py` — handles operations triage tasks (`log_alert_triage@1`, `config_drift_remediation@1`, `incident_recovery_chain@1`).
|
|
550
|
+
- `agents/cheater_agent.py` — intentionally malicious “cheater sim” that tries to read hidden state; the sandbox should block it with a `sandbox_violation` so you can prove the harness defenses work.
|
|
551
|
+
|
|
552
|
+
## Adding a task, `log_alert_triage@1`, `config_drift_remediation@1`, `incident_recovery_chain@1`
|
|
553
|
+
Tasks are small and self-contained, but every bundled scenario now flows through a manifest so registry + docs stay aligned.
|
|
554
|
+
|
|
555
|
+
### Bundled manifest
|
|
556
|
+
- `tasks/registry.json` enumerates every built-in task (`filesystem_hidden_config@1`, `rate_limited_api@1`, `rate_limited_chain@1`, `deterministic_rate_service@1`, `log_alert_triage@1`, `config_drift_remediation@1`, `incident_recovery_chain@1`).
|
|
557
|
+
- Update the list above whenever you add new operations tasks.
|
|
558
|
+
- When you add or bump a task version, update this manifest, SPEC_FREEZE, and the docs table in `docs/tasks.md`.
|
|
559
|
+
|
|
560
|
+
### Plugin workflow
|
|
561
|
+
- External packages can expose tasks without living in this repo via the `agent_bench.tasks` entry-point group.
|
|
562
|
+
- See [`docs/task_plugin_template.md`](docs/task_plugin_template.md) for a ready-to-copy layout, entry-point snippet, and `register()` helper contract.
|
|
563
|
+
- The loader automatically merges bundled manifest entries and plugin descriptors, so `agent-bench run --task my_plugin_task@1` works once the package is installed.
|
|
564
|
+
- Validate task manifests/registry entries with `agent-bench tasks validate` before publishing plugins or bumping versions.
|
|
565
|
+
|
|
566
|
+
### Task requirements
|
|
567
|
+
- Environment setup (`setup.py`)
|
|
568
|
+
- Available actions/tools (`actions.py`)
|
|
569
|
+
- Validator (`validate.py`)
|
|
570
|
+
- Budget defaults + metadata (`task.toml`)
|
|
571
|
+
- Contract fields defined in [`docs/contract_spec.md`](docs/contract_spec.md)
|
|
572
|
+
|
|
573
|
+
If your task:
|
|
574
|
+
- Requires internet access
|
|
575
|
+
- Needs a GPU
|
|
576
|
+
- Takes minutes to run
|
|
577
|
+
|
|
578
|
+
It probably doesn’t belong here.
|
|
579
|
+
|
|
580
|
+
## Non-goals
|
|
581
|
+
TraceCore does not aim to:
|
|
582
|
+
- Benchmark raw language quality
|
|
583
|
+
- Measure creativity
|
|
584
|
+
- Replace SWE-bench or Terminal Bench
|
|
585
|
+
- Simulate the real world
|
|
586
|
+
|
|
587
|
+
It tests operational competence, nothing more.
|
|
588
|
+
|
|
589
|
+
## Status
|
|
590
|
+
This project is early and opinionated.
|
|
591
|
+
Expect:
|
|
592
|
+
- Breaking changes
|
|
593
|
+
- Small task suites
|
|
594
|
+
- Strong opinions
|
|
595
|
+
|
|
596
|
+
If you disagree, open an issue—or better, a PR.
|
|
597
|
+
|
|
598
|
+
One-line summary:
|
|
599
|
+
Terminal Bench, but for agents that actually have to do things.
|