tracecore 0.9.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (104) hide show
  1. tracecore-0.9.0/LICENSE +21 -0
  2. tracecore-0.9.0/PKG-INFO +599 -0
  3. tracecore-0.9.0/README.md +553 -0
  4. tracecore-0.9.0/agent_bench/__init__.py +8 -0
  5. tracecore-0.9.0/agent_bench/cli.py +1201 -0
  6. tracecore-0.9.0/agent_bench/config.py +117 -0
  7. tracecore-0.9.0/agent_bench/interactive.py +477 -0
  8. tracecore-0.9.0/agent_bench/maintainer.py +162 -0
  9. tracecore-0.9.0/agent_bench/openclaw.py +484 -0
  10. tracecore-0.9.0/agent_bench/pairings.py +79 -0
  11. tracecore-0.9.0/pyproject.toml +66 -0
  12. tracecore-0.9.0/setup.cfg +4 -0
  13. tracecore-0.9.0/tasks/__init__.py +1 -0
  14. tracecore-0.9.0/tasks/config_drift_remediation/__init__.py +1 -0
  15. tracecore-0.9.0/tasks/config_drift_remediation/actions.py +33 -0
  16. tracecore-0.9.0/tasks/config_drift_remediation/setup.py +61 -0
  17. tracecore-0.9.0/tasks/config_drift_remediation/validate.py +14 -0
  18. tracecore-0.9.0/tasks/deterministic_rate_service/__init__.py +1 -0
  19. tracecore-0.9.0/tasks/deterministic_rate_service/actions.py +124 -0
  20. tracecore-0.9.0/tasks/deterministic_rate_service/service.py +113 -0
  21. tracecore-0.9.0/tasks/deterministic_rate_service/setup.py +59 -0
  22. tracecore-0.9.0/tasks/deterministic_rate_service/shared.py +8 -0
  23. tracecore-0.9.0/tasks/deterministic_rate_service/validate.py +18 -0
  24. tracecore-0.9.0/tasks/dice_game/actions.py +37 -0
  25. tracecore-0.9.0/tasks/dice_game/setup.py +21 -0
  26. tracecore-0.9.0/tasks/dice_game/validate.py +11 -0
  27. tracecore-0.9.0/tasks/filesystem_hidden_config/actions.py +34 -0
  28. tracecore-0.9.0/tasks/filesystem_hidden_config/setup.py +20 -0
  29. tracecore-0.9.0/tasks/filesystem_hidden_config/validate.py +8 -0
  30. tracecore-0.9.0/tasks/incident_recovery_chain/__init__.py +1 -0
  31. tracecore-0.9.0/tasks/incident_recovery_chain/actions.py +33 -0
  32. tracecore-0.9.0/tasks/incident_recovery_chain/setup.py +60 -0
  33. tracecore-0.9.0/tasks/incident_recovery_chain/validate.py +14 -0
  34. tracecore-0.9.0/tasks/log_alert_triage/__init__.py +1 -0
  35. tracecore-0.9.0/tasks/log_alert_triage/actions.py +33 -0
  36. tracecore-0.9.0/tasks/log_alert_triage/setup.py +62 -0
  37. tracecore-0.9.0/tasks/log_alert_triage/validate.py +14 -0
  38. tracecore-0.9.0/tasks/log_stream_monitor/__init__.py +0 -0
  39. tracecore-0.9.0/tasks/log_stream_monitor/actions.py +35 -0
  40. tracecore-0.9.0/tasks/log_stream_monitor/setup.py +50 -0
  41. tracecore-0.9.0/tasks/log_stream_monitor/validate.py +14 -0
  42. tracecore-0.9.0/tasks/rate_limited_api/actions.py +93 -0
  43. tracecore-0.9.0/tasks/rate_limited_api/service.py +76 -0
  44. tracecore-0.9.0/tasks/rate_limited_api/setup.py +41 -0
  45. tracecore-0.9.0/tasks/rate_limited_api/shared.py +6 -0
  46. tracecore-0.9.0/tasks/rate_limited_api/validate.py +21 -0
  47. tracecore-0.9.0/tasks/rate_limited_chain/actions.py +124 -0
  48. tracecore-0.9.0/tasks/rate_limited_chain/service.py +105 -0
  49. tracecore-0.9.0/tasks/rate_limited_chain/setup.py +58 -0
  50. tracecore-0.9.0/tasks/rate_limited_chain/shared.py +8 -0
  51. tracecore-0.9.0/tasks/rate_limited_chain/validate.py +22 -0
  52. tracecore-0.9.0/tasks/runbook_verifier/__init__.py +1 -0
  53. tracecore-0.9.0/tasks/runbook_verifier/actions.py +46 -0
  54. tracecore-0.9.0/tasks/runbook_verifier/setup.py +120 -0
  55. tracecore-0.9.0/tasks/runbook_verifier/shared.py +10 -0
  56. tracecore-0.9.0/tasks/runbook_verifier/validate.py +15 -0
  57. tracecore-0.9.0/tasks/sandboxed_code_auditor/__init__.py +1 -0
  58. tracecore-0.9.0/tasks/sandboxed_code_auditor/actions.py +38 -0
  59. tracecore-0.9.0/tasks/sandboxed_code_auditor/setup.py +76 -0
  60. tracecore-0.9.0/tasks/sandboxed_code_auditor/validate.py +16 -0
  61. tracecore-0.9.0/tests/test_agent_contract.py +4 -0
  62. tracecore-0.9.0/tests/test_autogen_adapter.py +217 -0
  63. tracecore-0.9.0/tests/test_baseline.py +222 -0
  64. tracecore-0.9.0/tests/test_baseline_diff_pretty.py +102 -0
  65. tracecore-0.9.0/tests/test_bundle_audit.py +67 -0
  66. tracecore-0.9.0/tests/test_chain_agent_runner.py +11 -0
  67. tracecore-0.9.0/tests/test_cli_baseline.py +142 -0
  68. tracecore-0.9.0/tests/test_cli_new_agent.py +88 -0
  69. tracecore-0.9.0/tests/test_cli_openclaw.py +408 -0
  70. tracecore-0.9.0/tests/test_cli_runs.py +86 -0
  71. tracecore-0.9.0/tests/test_cli_tasks_validate.py +40 -0
  72. tracecore-0.9.0/tests/test_config.py +46 -0
  73. tracecore-0.9.0/tests/test_determinism.py +90 -0
  74. tracecore-0.9.0/tests/test_deterministic_rate_service_task.py +83 -0
  75. tracecore-0.9.0/tests/test_dice_game_agent.py +73 -0
  76. tracecore-0.9.0/tests/test_dice_game_pydantic.py +45 -0
  77. tracecore-0.9.0/tests/test_interactive_cli.py +242 -0
  78. tracecore-0.9.0/tests/test_langchain_adapter.py +63 -0
  79. tracecore-0.9.0/tests/test_maintainer.py +56 -0
  80. tracecore-0.9.0/tests/test_naive_llm_agent.py +71 -0
  81. tracecore-0.9.0/tests/test_negative_cases.py +241 -0
  82. tracecore-0.9.0/tests/test_operations_tasks.py +175 -0
  83. tracecore-0.9.0/tests/test_pairing_contracts.py +77 -0
  84. tracecore-0.9.0/tests/test_planner_agent.py +92 -0
  85. tracecore-0.9.0/tests/test_rate_limited_api_task.py +52 -0
  86. tracecore-0.9.0/tests/test_rate_limited_chain_task.py +74 -0
  87. tracecore-0.9.0/tests/test_record_mode.py +221 -0
  88. tracecore-0.9.0/tests/test_replay_audit.py +104 -0
  89. tracecore-0.9.0/tests/test_runbook_verifier_agent.py +19 -0
  90. tracecore-0.9.0/tests/test_runner_contract.py +62 -0
  91. tracecore-0.9.0/tests/test_runner_failure_taxonomy.py +205 -0
  92. tracecore-0.9.0/tests/test_runner_smoke.py +10 -0
  93. tracecore-0.9.0/tests/test_sandbox_env.py +51 -0
  94. tracecore-0.9.0/tests/test_task_loading.py +38 -0
  95. tracecore-0.9.0/tests/test_task_registry.py +131 -0
  96. tracecore-0.9.0/tests/test_terminal_logic_failure.py +66 -0
  97. tracecore-0.9.0/tests/test_webui_context.py +58 -0
  98. tracecore-0.9.0/tests/test_webui_routes.py +238 -0
  99. tracecore-0.9.0/tracecore.egg-info/PKG-INFO +599 -0
  100. tracecore-0.9.0/tracecore.egg-info/SOURCES.txt +102 -0
  101. tracecore-0.9.0/tracecore.egg-info/dependency_links.txt +1 -0
  102. tracecore-0.9.0/tracecore.egg-info/entry_points.txt +2 -0
  103. tracecore-0.9.0/tracecore.egg-info/requires.txt +14 -0
  104. tracecore-0.9.0/tracecore.egg-info/top_level.txt +2 -0
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Justin Dobbs
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,599 @@
1
+ Metadata-Version: 2.4
2
+ Name: tracecore
3
+ Version: 0.9.0
4
+ Summary: A lightweight benchmark for action-oriented agents.
5
+ Author: Justin Dobbs
6
+ License: MIT License
7
+
8
+ Copyright (c) 2026 Justin Dobbs
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26
+ SOFTWARE.
27
+
28
+ Project-URL: Homepage, https://github.com/justindobbs/Tracecore
29
+ Project-URL: Issues, https://github.com/justindobbs/Tracecore/issues
30
+ Requires-Python: >=3.10
31
+ Description-Content-Type: text/markdown
32
+ License-File: LICENSE
33
+ Requires-Dist: rich>=13.7
34
+ Provides-Extra: dev
35
+ Requires-Dist: pytest>=9.0; extra == "dev"
36
+ Requires-Dist: fastapi>=0.131; extra == "dev"
37
+ Requires-Dist: uvicorn>=0.27; extra == "dev"
38
+ Requires-Dist: pytest-cov>=5.0; extra == "dev"
39
+ Requires-Dist: ruff>=0.9.0; extra == "dev"
40
+ Requires-Dist: jinja2>=3.1; extra == "dev"
41
+ Requires-Dist: python-multipart>=0.0.9; extra == "dev"
42
+ Requires-Dist: httpx>=0.27; extra == "dev"
43
+ Provides-Extra: pydantic-poc
44
+ Requires-Dist: pydantic-ai<1.0,>=0.0.3; extra == "pydantic-poc"
45
+ Dynamic: license-file
46
+
47
+ # TraceCore (Agent Bench CLI)
48
+ [![Tests](https://github.com/justindobbs/Tracecore/actions/workflows/tests.yml/badge.svg?branch=main)](https://github.com/justindobbs/Tracecore/actions/workflows/tests.yml)
49
+ [![Python](https://img.shields.io/badge/python-3.10%2B-blue?logo=python)](https://www.python.org/downloads/)
50
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
51
+
52
+ ![TraceCore](banner.png)
53
+
54
+ # TraceCore overview
55
+ A lightweight benchmark for action-oriented agents inspired by the OpenClaw style—planner loops, tool APIs, partial observability—but open to any implementation that satisfies the harness.
56
+
57
+ TraceCore evaluates whether an agent can operate—not just reason.
58
+ No LLM judges. No vibes. No giant simulators.
59
+
60
+ > **Brand note:** TraceCore is the product name; the CLI/package and commands remain `agent-bench` for backward compatibility.
61
+
62
+ Core definition: see [`docs/core.md`](docs/core.md) for the Deterministic Episode Runtime primitive and invariant contracts.
63
+
64
+ If your agent can survive this benchmark, it can probably survive production.
65
+
66
+
67
+ ## Installation
68
+
69
+ ### Published package (recommended)
70
+
71
+ ```bash
72
+ pip install tracecore
73
+ ```
74
+
75
+ Or with uv:
76
+
77
+ ```bash
78
+ uv pip install tracecore
79
+ ```
80
+
81
+ This installs the `agent-bench` CLI and all runtime dependencies. The CLI is immediately available once your environment's `Scripts` directory is on PATH.
82
+
83
+ ### Developer / contributor install
84
+
85
+ Clone the repo and install in editable mode to keep tasks and CLI entries in sync with your working tree (required for the web UI and the registry-powered loader):
86
+
87
+ ```bash
88
+ git clone https://github.com/justindobbs/Tracecore.git
89
+ cd Tracecore
90
+ python -m venv .venv && .venv\Scripts\activate # or source .venv/bin/activate on macOS/Linux
91
+ pip install -e .[dev]
92
+ ```
93
+
94
+ `pip install -e` keeps the package in sync with your working tree so new tasks + CLI entries are immediately available.
95
+
96
+ ### Windows PATH tip
97
+
98
+ The editable install drops `agent-bench.exe` into `%APPDATA%\Python\Python310\Scripts` (or whichever minor version you're using). Add that folder to **Path** via *System Properties → Environment Variables* so `agent-bench` works from any terminal. After updating Path, open a new shell.
99
+
100
+ > Prefer a one-step install? `pipx install tracecore` drops its own shim into `%USERPROFILE%\.local\bin` and handles PATH automatically.
101
+ >
102
+ > Already using [uv](https://docs.astral.sh/uv/)? Run `uv tool install tracecore` to create the CLI shim in `%USERPROFILE%\.local\bin`. uv's bootstrap already wires that directory into PATH, so no manual environment edits are required.
103
+
104
+ Prefer a shorter command name? Create a shell alias so `tracecore` forwards to `agent-bench`:
105
+
106
+ - **PowerShell** (add to `$PROFILE`): `Set-Alias tracecore agent-bench`
107
+ - **Command Prompt**: `doskey tracecore=agent-bench $*`
108
+ - **Bash/Zsh**: `alias tracecore='agent-bench'`
109
+
110
+ The alias simply invokes the same CLI, so all subcommands and flags continue to work.
111
+
112
+ ## Quick start
113
+
114
+ **Fastest path** — run a known-good agent+task pairing by name:
115
+
116
+ ```bash
117
+ agent-bench run pairing log_stream_monitor
118
+ agent-bench run pairing log_stream_monitor --seed 7
119
+ ```
120
+
121
+ See all available pairings:
122
+
123
+ ```bash
124
+ agent-bench run pairing --list
125
+ ```
126
+
127
+ Smoke-test every pairing in one shot (useful after a harness change):
128
+
129
+ ```bash
130
+ agent-bench run pairing --all
131
+ agent-bench run pairing --all --seed 7 --timeout 120 # 120 s wall-clock limit per run
132
+ ```
133
+
134
+ Or navigate into the `agents/` directory — if only one pairing matches a file there, it auto-selects:
135
+
136
+ ```bash
137
+ cd agents
138
+ agent-bench run pairing # auto-detects if unambiguous
139
+ ```
140
+
141
+ Run any agent+task+seed explicitly, with an optional wall-clock timeout:
142
+
143
+ ```bash
144
+ agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
145
+ agent-bench run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42 --timeout 60
146
+ ```
147
+
148
+ Need an end-to-end TraceCore + Pydantic AI example? See [docs/pydantic_poc.md](docs/pydantic_poc.md) for the deterministic dice game agent/task combo.
149
+
150
+ Want a standalone proof-of-concept that walks through the full execution loop? See [`examples/simple_agent_demo/`](examples/simple_agent_demo/README.md) — a self-contained demo with a CLI that lists tasks, lists agents, and runs any pairing with verbose trace output:
151
+
152
+ ```bash
153
+ cd examples/simple_agent_demo
154
+ python demo.py --task dice_game --agent dice_game_agent
155
+ python demo.py --list-tasks
156
+ python demo.py --list-agents
157
+ ```
158
+
159
+ Prefer a guided setup? Launch the colorful wizard and let it walk you through agent/task/seed selection (it saves the answers and then calls the same `run` command under the hood):
160
+
161
+ ```bash
162
+ agent-bench interactive
163
+ # add --dry-run to preview the command without executing
164
+ # add --save-session to remember your choices for next time
165
+ # add --plugins to include plugin tasks in discovery
166
+ # add --no-color if your terminal doesn't support ANSI colors
167
+ ```
168
+
169
+ The wizard includes:
170
+ - **Suggested pairings**: See agent-task combinations with proven success (if baseline data exists)
171
+ - **Agent validation**: Checks that selected agents implement the required interface
172
+ - **Task budgets**: Shows `steps` and `tool_calls` limits for each task
173
+ - **Progress indicators**: Guides you through "Step 1/3", "Step 2/3", "Step 3/3"
174
+ - **Fuzzy search**: Type partial names to filter agents/tasks
175
+ - **Inline help**: Press `?` during any prompt for context-sensitive tips
176
+ - **Session persistence**: Use `--save-session` to remember your last selections
177
+ - **Dry-run mode**: Preview the exact command before execution with `--dry-run`
178
+
179
+ Prefer the UI?
180
+
181
+ ```bash
182
+ agent-bench dashboard --reload
183
+ # then open http://localhost:8000
184
+ ```
185
+
186
+ Point the form at `agents/toy_agent.py` + `filesystem_hidden_config@1` for a deterministic smoke test, or switch to `agents/rate_limit_agent.py` for the API scenarios. The **Pairings** tab in the dashboard provides one-click launch for every known-good pairing.
187
+
188
+ ### Inspect recent runs
189
+
190
+ Print a compact table of recent runs without opening the dashboard:
191
+
192
+ ```bash
193
+ agent-bench runs summary
194
+ agent-bench runs summary --task log_stream_monitor@1 --limit 10
195
+ agent-bench runs summary --failure-type budget_exhausted
196
+ ```
197
+
198
+ For raw JSON output use `agent-bench runs list` (same filters).
199
+
200
+ ### Run tests
201
+
202
+ ```bash
203
+ python -m pytest
204
+ ```
205
+
206
+ Want a single command that runs task validation + pytest and can apply a couple guarded, mechanical fixes? See [`docs/maintainer.md`](docs/maintainer.md):
207
+
208
+ ```bash
209
+ agent-bench maintain
210
+ ```
211
+
212
+ ### Write a new agent
213
+
214
+ Scaffold a stub with the correct `reset` / `observe` / `act` interface in one command:
215
+
216
+ ```bash
217
+ agent-bench new-agent my_agent
218
+ # creates agents/my_agent_agent.py with inline docstrings and budget-guard boilerplate
219
+ ```
220
+
221
+ Kebab-case names are normalised automatically (`my-agent` → `MyAgentAgent`). Use `--output-dir` to write elsewhere, `--force` to overwrite an existing file.
222
+
223
+ Then wire it to a task and run:
224
+
225
+ ```bash
226
+ agent-bench run --agent agents/my_agent_agent.py --task filesystem_hidden_config@1 --seed 0
227
+ ```
228
+
229
+ See [`docs/agents.md`](docs/agents.md) for the full interface contract and [`docs/task_harness.md`](docs/task_harness.md) for the action schema.
230
+
231
+ ## Troubleshooting
232
+
233
+ Need help diagnosing install, CLI, or validator issues? See [`docs/troubleshooting.md`](docs/troubleshooting.md) for a consolidated guide that covers PATH fixes, common failure types, and dashboard hiccups.
234
+
235
+ > **Note:** Task budgets are configured in each task's `task.toml` manifest and can be inspected via `agent-bench tasks validate --registry`. There is no `--budget` CLI override flag; budgets are enforced from the task definition.
236
+
237
+ ## Tutorials
238
+ - **OpenClaw users**: see [`OPENCLAW_QUICKSTART.md`](OPENCLAW_QUICKSTART.md) for a 5-minute first run (no OpenClaw install required), or the full [`tutorials/openclaw_quickstart.md`](tutorials/openclaw_quickstart.md) for adapter patterns, budget mapping, and troubleshooting.
239
+
240
+ ## Framing the idea
241
+ Terminal Bench works because it:
242
+
243
+ - Evaluates agents via real tasks, not synthetic prompts
244
+ - Uses a simple, opinionated interface (a terminal)
245
+ - Is cheap to run, easy to extend, and hard to game
246
+
247
+ An operations-focused benchmark should do the same, but centered on:
248
+
249
+ - Action-oriented agents with tool APIs
250
+ - Environment interaction and partial observability
251
+ - Longish horizons with state, retries, and recovery
252
+
253
+ In practice, this covers:
254
+ - OpenClaw-native agents
255
+ - Custom planner loops wired into REST or filesystem tools
256
+ - Orchestration agents (e.g., TaskWeaver, AutoGPT-style) that can wrap the simple `reset/observe/act` interface
257
+
258
+ Think of it less as "benchmarking a model" and more as benchmarking an agent loop end-to-end.
259
+
260
+ ## What makes these agents distinct?
261
+ (Adjust these if your mental model differs.)
262
+
263
+ - Planner / policy loop instead of single-shot prompting
264
+ - Tool or action interfaces instead of raw chat completions
265
+ - Optional memory, world models, or reusable skills
266
+ - Strong emphasis on doing, not just responding
267
+
268
+ So the benchmark should **not** test raw language quality or one-shot reasoning. It should test:
269
+
270
+ - Decision-making under constraints
271
+ - Tool sequencing and dependency management
272
+ - Recovery from errors and partial failures
273
+ - State tracking over time and across steps
274
+
275
+ ## Why this exists
276
+ Most benchmarks answer questions like:
277
+ - Can the model reason?
278
+ - Can it write the right patch?
279
+ - Can it roleplay an agent?
280
+
281
+ TraceCore answers a different question:
282
+ Can this agent run unattended and get the job done without breaking things?
283
+
284
+ We test:
285
+ - Tool sequencing
286
+ - Error recovery
287
+ - State tracking
288
+ - Long-horizon behavior
289
+ - Boring, reliable decision-making
290
+
291
+ ## Design principles
292
+ 1. **Minimal environment, maximal signal**
293
+ - Keep worlds tiny, deterministic, and inspectable: toy filesystems, fake APIs, log streams, local services.
294
+ - No giant simulators or cloud dependencies—everything should run in seconds on a laptop.
295
+ 2. **Agent-in-the-loop evaluation**
296
+ - Benchmark the entire perception → reasoning → action loop, not a single prompt.
297
+ - Each task specifies initial state, tool interface, validator, and explicit budgets (steps + tool calls).
298
+ 3. **Binary outcomes first**
299
+ - Success or failure is the headline metric; secondary stats (steps, tool calls, errors) give color.
300
+ - Deterministic tasks + frozen versions make regressions obvious and stop overfitting.
301
+ 4. **Hard to game, easy to extend**
302
+ - Sandboxed execution, limited affordances, and published hashes keep agents honest.
303
+ - Tasks are small Python packages so contributors can add new scenarios without ceremony.
304
+
305
+ ## Task categories (operations-native)
306
+ ### 1. Tool choreography tasks
307
+ Goal: stress sequencing, dependency management, and retries.
308
+
309
+ - *Example:* `rate_limited_api@1` — retrieve an `ACCESS_TOKEN` from a mock API that enforces a deterministic rate limit and transient failures.
310
+ - *Signals:* correct tool ordering, retry logic, state retention, graceful degradation.
311
+
312
+ ### 2. Partial observability & discovery
313
+ Goal: reward cautious exploration instead of brute force.
314
+
315
+ - *Example:* “Traverse a directory tree with undocumented schema. Find the real config key without trashing the filesystem.”
316
+ - *Signals:* hypothesis updates, selective reads, remembering seen paths, avoiding repeated mistakes.
317
+
318
+ ### 3. Long-horizon maintenance
319
+ Goal: ensure persistence, monitoring, and acting at the right moment.
320
+
321
+ - *Example:* “A service degrades over time. Watch logs, detect the symptom, and apply the correct fix only when needed.”
322
+ - *Signals:* patience, trigger detection, not overreacting, applying steady-state playbooks.
323
+
324
+ ### 4. Adversarial-but-fair environments
325
+ Goal: test robustness when the world is a little hostile.
326
+
327
+ - *Example:* flaky tools, malformed API responses, conflicting telemetry that needs disambiguation.
328
+ - *Signals:* error recovery, fallback strategies, keeping track of provenance before acting.
329
+
330
+ ## Scoring without overengineering
331
+ - Binary success/failure is the scoreboard.
332
+ - Secondary metrics: steps taken, tool calls, wall-clock time, error count.
333
+ - No LLM judges, no vibes, no composite scores you can’t reason about.
334
+
335
+ ## Interface sketch
336
+ Agents run exactly like they would in production: provide an agent, pick a task, respect the budget.
337
+
338
+ ```sh
339
+ agent-bench run \
340
+ --agent agents/toy_agent.py \
341
+ --task filesystem_hidden_config@1 \
342
+ --seed 42
343
+ ```
344
+
345
+ Each task ships with a harness, fake environment, and validator. Agents only see what they’re allowed to see.
346
+
347
+ ## Why this matters (and what’s missing today)
348
+ Most agent benchmarks collapse back into single-prompt exams. They rarely measure recovery, operational competence, or whether the agent can survive unattended. TraceCore surfaces engineering-quality differences and rewards boring-but-correct behavior.
349
+
350
+ ## Potential pitfalls & guardrails
351
+ - **Overfitting to the harness** → Keep suites varied, publish fixtures, encourage new contributions.
352
+ - **Agents cheating via inspection** → Sandbox aggressively, freeze binaries, limit visibility.
353
+ - **Benchmark drift** → Freeze task versions, publish hashes/seeded assets, require changelog entries.
354
+
355
+ ## What’s in v0
356
+ Task suites:
357
+ - Filesystem & State
358
+ - Tool Choreography
359
+ - Long-Horizon & Monitoring
360
+ - Adversarial-but-Fair
361
+ - Operations & Triage
362
+
363
+ Shipping tasks:
364
+ - `filesystem_hidden_config@1` (filesystem suite): explore a hidden directory tree to find the one true `API_KEY`.
365
+ - `rate_limited_api@1` (api suite): classify API errors, respect `retry_after`, and persist the returned `ACCESS_TOKEN`.
366
+ - `log_alert_triage@1` (operations suite): triage deterministic logs and extract the final `ALERT_CODE`.
367
+ - `config_drift_remediation@1` (operations suite): compare desired vs. live configs and output the remediation patch line.
368
+ - `incident_recovery_chain@1` (operations suite): follow a recovery handoff chain to recover `RECOVERY_TOKEN`.
369
+ - `log_stream_monitor@1` (operations suite): poll a paginated log stream, ignore noise, and emit `STREAM_CODE` when a `CRITICAL` entry is detected.
370
+
371
+ Each task:
372
+ - Defines an initial environment
373
+ - Exposes a constrained action interface
374
+ - Has a single, deterministic success condition
375
+
376
+ ## How it works
377
+ You provide any agent that implements the documented interface.
378
+ We provide a task harness.
379
+ The agent runs until:
380
+ - It succeeds
381
+ - It fails
382
+ - It runs out of budget
383
+
384
+ No human in the loop. No retries.
385
+
386
+ ## Example
387
+ ```sh
388
+ agent-bench run \
389
+ --agent agents/toy_agent.py \
390
+ --task filesystem_hidden_config@1 \
391
+ --seed 42
392
+
393
+ # Replay a prior run_id (defaults to recorded agent/task/seed, but you can override):
394
+ agent-bench run --replay <run_id> --seed 42
395
+ ```
396
+
397
+ ### Configuration via `agent-bench.toml`
398
+
399
+ Rather not repeat `--agent`, `--task`, and `--seed` every time? Drop a config file in the repo root (or pass `--config path/to/file`). Set `AGENT_BENCH_CONFIG=agent-bench.toml` in CI (and any automation) so the same defaults apply everywhere.
400
+
401
+ ```toml
402
+ [defaults]
403
+ agent = "agents/toy_agent.py"
404
+ task = "filesystem_hidden_config@1"
405
+ seed = 42
406
+
407
+ [agent."agents/rate_limit_agent.py"]
408
+ task = "rate_limited_api@1"
409
+ seed = 11
410
+ ```
411
+
412
+ The CLI resolves flags first, then per-agent overrides, then the `[defaults]` block. Any command accepts `--config` to point at another file; otherwise `agent-bench.toml` (or `agent_bench.toml`) is used when present or when `AGENT_BENCH_CONFIG` is set.
413
+
414
+ If `agent-bench` isn’t on your PATH yet, call it via Python:
415
+
416
+ ```powershell
417
+ python -m agent_bench.cli --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
418
+ ```
419
+
420
+ Every CLI run writes a JSON artifact under `.agent_bench/runs/<run_id>.json`. Inspect them directly, or list them via:
421
+
422
+ ```sh
423
+ agent-bench runs list --limit 5
424
+ ```
425
+
426
+ Want to zero in on a specific outcome? Use the structured failure taxonomy filter:
427
+
428
+ ```sh
429
+ agent-bench runs list --failure-type timeout --limit 5
430
+ agent-bench runs list --failure-type success --limit 5 # only successful runs
431
+ ```
432
+
433
+ The same buckets surface in the Web UI’s **Recent Runs** list, where each entry is labeled
434
+ `Success` or `Failure — <type>` so you can spot budget exhaustion vs. invalid actions at a glance.
435
+
436
+ Need a quick aggregate of how an agent performs on a task? Use the baseline helper:
437
+
438
+ ```sh
439
+ agent-bench baseline --agent agents/toy_agent.py --task filesystem_hidden_config@1
440
+ ```
441
+
442
+ It emits success rate, average steps/tool calls, and links back to the latest trace for that agent/task pair. Add `--export` to persist a frozen snapshot for the web UI:
443
+
444
+ ```sh
445
+ agent-bench baseline --export # writes .agent_bench/baselines/baseline-<ts>.json
446
+ agent-bench baseline --export latest # custom filename in the baselines folder
447
+ ```
448
+
449
+ Compare two specific runs (paths or `run_id`s) to see exactly where traces diverge:
450
+
451
+ ```sh
452
+ agent-bench baseline --compare .agent_bench/runs/run_a.json .agent_bench/runs/run_b.json
453
+ # or mix path + run_id
454
+ agent-bench baseline --compare abcd1234 efgh5678
455
+ ```
456
+
457
+ The diff output highlights whether the agent/task/success states match and lists per-step differences.
458
+ Use `--format text` for a quick human summary; exit codes are `0` (identical), `1` (different), `2` (incompatible task/agent).
459
+ For CI usage, see [`docs/ci_workflow.md`](docs/ci_workflow.md).
460
+ This repo also ships a `chain-agent-baseline` workflow wired to `agents/chain_agent.py` + `rate_limited_chain@1`.
461
+
462
+ The Baselines tab in the UI only shows a "Latest published" card after you export at least once.
463
+
464
+
465
+ ## Minimal Web UI (Optional)
466
+ Prefer sliders and buttons over the CLI? Spin up the lightweight FastAPI form:
467
+
468
+ ```sh
469
+ pip install tracecore
470
+ agent-bench dashboard --host 127.0.0.1 --port 8000 --reload
471
+ ```
472
+
473
+ > **`--reload` is for local development only.** It enables uvicorn's auto-reload on file changes and should not be used in shared or production environments. Omit the flag for stable serving.
474
+
475
+ > Tip: create a virtual environment first (e.g., `python -m venv .venv && .venv\Scripts\activate` on Windows) so the FastAPI deps stay isolated. See the official FastAPI installation guide for more platform-specific options: <https://fastapi.tiangolo.com/#installation>
476
+
477
+ Then visit [http://localhost:8000](http://localhost:8000) to:
478
+ - Pick any agent module under `agents/`
479
+ - Choose a task (`filesystem_hidden_config@1`, `rate_limited_api@1`, etc.) and seed
480
+ - Launch runs, inspect structured JSON results (seed included), and drill into traces
481
+ - Replay a prior run by pasting its `run_id` and optionally overriding the seed/agent/task
482
+
483
+ The UI intentionally ships with **no** Node/Vite stack—just FastAPI + Jinja—so you can layer more elaborate frontends later without losing the minimal flow.
484
+
485
+ Output:
486
+ ```json
487
+ {
488
+ "task_id": "filesystem_hidden_config",
489
+ "version": 1,
490
+ "seed": 42,
491
+ "success": true,
492
+ "failure_reason": null,
493
+ "failure_type": null,
494
+ "steps_used": 37,
495
+ "tool_calls_used": 12
496
+ }
497
+ ```
498
+
499
+ ### Diagnostics workflow
500
+
501
+ 1. **Run & persist** — both the CLI and the web UI call the same harness and automatically persist artifacts under `.agent_bench/runs/` with metadata (`run_id`, `trace_id`, timestamps, harness version, trace entries).
502
+ 2. **Inspect traces** — load [http://localhost:8000/?trace_id=<run_id>](http://localhost:8000/?trace_id=%3Crun_id%3E) to jump straight to the trace viewer, or fetch raw JSON via `/api/traces/<run_id>`.
503
+ 3. **Compare outcomes** — use `agent-bench baseline ...` or the UI baseline table to spot regressions (success rate, average steps/tool calls) before publishing results.
504
+ 4. **Freeze specs** — once a run set looks good, tag the task versions + harness revision so those run IDs remain reproducible proof of behavior.
505
+ 5. **Manual verification** — before freezing or sharing results, run through `docs/manual_verification.md` to replay the CLI + UI flows end-to-end.
506
+
507
+ To inspect a specific run artifact directly, use:
508
+ ```sh
509
+ agent-bench runs list --limit 5
510
+ # then load the JSON artifact from .agent_bench/runs/<run_id>.json
511
+ ```
512
+
513
+ ## Release process
514
+
515
+ Ready to cut a release? See [`docs/release_process.md`](docs/release_process.md) for the standard checklist (changelog, version stamping, test gate, SPEC_FREEZE alignment, trust evidence bundle, and tagging steps). Historical release notes are also archived there.
516
+
517
+ ## What we measure
518
+ Per task:
519
+ - Success / failure
520
+ - Steps taken
521
+ - Tool calls
522
+ - Error count
523
+
524
+ Across a suite:
525
+ - Success rate
526
+ - Aggregate efficiency metrics
527
+
528
+ See [SPEC_FREEZE.md](SPEC_FREEZE.md) for the frozen v0.1.0 task list (including the new `rate_limited_chain@1` pain task) and the rules for bumping versions.
529
+
530
+ We deliberately avoid:
531
+ - LLM-based judges
532
+ - Natural language grading
533
+ - Weighted composite scores
534
+
535
+ ## Reference agent
536
+ TraceCore ships with a minimal reference agent.
537
+ It is:
538
+ - Conservative
539
+ - State-driven
540
+ - Explicit about errors
541
+ - Boring on purpose
542
+
543
+ If your agent can’t outperform the reference agent, that’s a signal.
544
+
545
+ Reference implementations:
546
+ - `agents/toy_agent.py` — solves filesystem discovery tasks.
547
+ - `agents/rate_limit_agent.py` — handles classic rate-limit retry flows (`rate_limited_api@1`).
548
+ - `agents/chain_agent.py` — completes the chained handshake + rate-limit pain task (`rate_limited_chain@1`).
549
+ - `agents/ops_triage_agent.py` — handles operations triage tasks (`log_alert_triage@1`, `config_drift_remediation@1`, `incident_recovery_chain@1`).
550
+ - `agents/cheater_agent.py` — intentionally malicious “cheater sim” that tries to read hidden state; the sandbox should block it with a `sandbox_violation` so you can prove the harness defenses work.
551
+
552
+ ## Adding a task, `log_alert_triage@1`, `config_drift_remediation@1`, `incident_recovery_chain@1`
553
+ Tasks are small and self-contained, but every bundled scenario now flows through a manifest so registry + docs stay aligned.
554
+
555
+ ### Bundled manifest
556
+ - `tasks/registry.json` enumerates every built-in task (`filesystem_hidden_config@1`, `rate_limited_api@1`, `rate_limited_chain@1`, `deterministic_rate_service@1`, `log_alert_triage@1`, `config_drift_remediation@1`, `incident_recovery_chain@1`).
557
+ - Update the list above whenever you add new operations tasks.
558
+ - When you add or bump a task version, update this manifest, SPEC_FREEZE, and the docs table in `docs/tasks.md`.
559
+
560
+ ### Plugin workflow
561
+ - External packages can expose tasks without living in this repo via the `agent_bench.tasks` entry-point group.
562
+ - See [`docs/task_plugin_template.md`](docs/task_plugin_template.md) for a ready-to-copy layout, entry-point snippet, and `register()` helper contract.
563
+ - The loader automatically merges bundled manifest entries and plugin descriptors, so `agent-bench run --task my_plugin_task@1` works once the package is installed.
564
+ - Validate task manifests/registry entries with `agent-bench tasks validate` before publishing plugins or bumping versions.
565
+
566
+ ### Task requirements
567
+ - Environment setup (`setup.py`)
568
+ - Available actions/tools (`actions.py`)
569
+ - Validator (`validate.py`)
570
+ - Budget defaults + metadata (`task.toml`)
571
+ - Contract fields defined in [`docs/contract_spec.md`](docs/contract_spec.md)
572
+
573
+ If your task:
574
+ - Requires internet access
575
+ - Needs a GPU
576
+ - Takes minutes to run
577
+
578
+ It probably doesn’t belong here.
579
+
580
+ ## Non-goals
581
+ TraceCore does not aim to:
582
+ - Benchmark raw language quality
583
+ - Measure creativity
584
+ - Replace SWE-bench or Terminal Bench
585
+ - Simulate the real world
586
+
587
+ It tests operational competence, nothing more.
588
+
589
+ ## Status
590
+ This project is early and opinionated.
591
+ Expect:
592
+ - Breaking changes
593
+ - Small task suites
594
+ - Strong opinions
595
+
596
+ If you disagree, open an issue—or better, a PR.
597
+
598
+ One-line summary:
599
+ Terminal Bench, but for agents that actually have to do things.