trapstreet-cli 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,40 @@
1
+ .env
2
+ # Python
3
+ __pycache__/
4
+ *.py[cod]
5
+ *.pyo
6
+ *.pyd
7
+ *.so
8
+
9
+ # Distribution / packaging
10
+ dist/
11
+ build/
12
+ *.egg-info/
13
+ .eggs/
14
+
15
+ # uv / virtual envs
16
+ .venv/
17
+ uv.lock~
18
+
19
+ # hatch-vcs generated
20
+ src/trap/_version.py
21
+
22
+ # trap runtime workspace
23
+ .trap/
24
+
25
+ # Testing
26
+ .pytest_cache/
27
+ .coverage
28
+ htmlcov/
29
+ .deepeval
30
+
31
+ # Ruff
32
+ .ruff_cache/
33
+
34
+ # macOS
35
+ .DS_Store
36
+
37
+ # Editors
38
+ .idea/
39
+ .vscode/
40
+ *.swp
@@ -0,0 +1,394 @@
1
+ Metadata-Version: 2.4
2
+ Name: trapstreet-cli
3
+ Version: 0.1.0
4
+ Summary: tp — non-invasive CLI testing framework for AI workflows. Submits results to trapstreet.run.
5
+ Project-URL: Homepage, https://trapstreet.run
6
+ Project-URL: Repository, https://github.com/AntiNoise-ai/trapstreet-mvp
7
+ Project-URL: Issues, https://github.com/AntiNoise-ai/trapstreet-mvp/issues
8
+ Author-email: AntiNoise <hcshu@dlyai.com>
9
+ License-Expression: MIT
10
+ Keywords: agent,ai,benchmark,eval,llm,trapstreet,workflow
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Environment :: Console
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Operating System :: OS Independent
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3 :: Only
18
+ Classifier: Programming Language :: Python :: 3.13
19
+ Classifier: Topic :: Software Development :: Testing
20
+ Requires-Python: >=3.13
21
+ Requires-Dist: loguru>=0.7
22
+ Requires-Dist: pydantic>=2.0
23
+ Requires-Dist: pyyaml>=6.0
24
+ Requires-Dist: rich>=13.0
25
+ Requires-Dist: typer>=0.12
26
+ Description-Content-Type: text/markdown
27
+
28
+ # trap — CLI
29
+
30
+ > Lives at [`trapstreet-mvp/cli/`](https://github.com/AntiNoise-ai/trapstreet-mvp/tree/main/cli) — part of the trapstreet monorepo. The standalone repo `AntiNoise-ai/trap` is retained for history only; all active development happens here.
31
+
32
+ Install (from PyPI — recommended):
33
+
34
+ ```bash
35
+ uv tool install trapstreet-cli
36
+ # also works via pipx / pip
37
+ ```
38
+
39
+ From git (latest main, no PyPI release needed):
40
+
41
+ ```bash
42
+ uv tool install "git+https://github.com/AntiNoise-ai/trapstreet-mvp.git#subdirectory=cli"
43
+ ```
44
+
45
+ The command name is `tp`. Releases are git-tagged `cli-vX.Y.Z`; tagging
46
+ triggers PyPI publish via GitHub Actions.
47
+
48
+ `trap` is a **non-invasive CLI testing framework for AI prompts, agents, and workflows**. It treats the program under test (the "solution") as a black box: it invokes the solution as a subprocess, captures stdout/stderr/files, then optionally pipes that output through a "judge" (per-case scorer) and a "grader" (overall aggregator) — also subprocesses, also language-agnostic.
49
+
50
+ The framework knows nothing about how the solution is implemented. Python, shell scripts, compiled binaries, agentic pipelines — anything that can be invoked from a shell works.
51
+
52
+ ---
53
+
54
+ ## Core idea: solution and task are decoupled
55
+
56
+ Two roles, two repos (or two directories), connected only by a small IO contract:
57
+
58
+ | Role | Owns | Configures |
59
+ |---|---|---|
60
+ | **Solution author** | `trap.yaml`, the solution code | how to invoke the solution, which inputs to feed it, which outputs it produces |
61
+ | **Task author** | `traptask.yaml`, `judge.py`, `grader.py`, `inputs/`, `expected/` | the test cases, scoring logic, expected outputs |
62
+
63
+ The solution doesn't need to import trap or know it exists. It reads paths from two environment variables (`INPUTS`, `OUTPUTS`) and runs.
64
+
65
+ ```
66
+ inputs/{case_id}/ ──[INPUTS env var]──▶ solution ──[OUTPUTS env var]──▶ .trap/{task}/{ts}/{case_id}/
67
+ expected/{case_id}/ │
68
+ │ │
69
+ └──────────────────────── judge ◀──────────────────────────────────────────┘
70
+
71
+ {metrics: any JSON}
72
+
73
+ [collect all cases, hand to grader]
74
+
75
+ grader
76
+
77
+ {passed, score, ...}
78
+ ```
79
+
80
+ ---
81
+
82
+ ## Install
83
+
84
+ Requires Python `>=3.14` and `uv`.
85
+
86
+ ```bash
87
+ git clone https://github.com/AntiNoise-ai/trap
88
+ cd trap
89
+ uv sync
90
+ ```
91
+
92
+ The installed entry point is **`tp`** (not `trap`), declared in `pyproject.toml`:
93
+
94
+ ```bash
95
+ uv run tp --help
96
+ ```
97
+
98
+ There is no PyPI release yet, so use `uv run tp …` from a clone, or run it from a wheel you build locally (`uv build`).
99
+
100
+ ---
101
+
102
+ ## Quick start — the echo example
103
+
104
+ The repo ships two complete worked examples under `examples/`. Walk through `examples/echo/` to see the moving pieces.
105
+
106
+ ### 1. Solution side — `examples/echo/solution/`
107
+
108
+ `echo.py` reads JSON from stdin and prints `message` to stdout (or errors with exit 1 if `message` is missing):
109
+
110
+ ```python
111
+ import json, sys
112
+ data = json.load(sys.stdin)
113
+ if "message" not in data:
114
+ print("error: missing 'message' field", file=sys.stderr)
115
+ sys.exit(1)
116
+ print(data["message"])
117
+ ```
118
+
119
+ `trap.yaml` tells trap how to run it and where the task lives:
120
+
121
+ ```yaml
122
+ tasks:
123
+ test:
124
+ description: Echo solution — reads stdin JSON, writes it back to stdout
125
+ cmd: uv run python echo.py
126
+ traptask: ../task # path to the task directory (relative to trap.yaml)
127
+ inputs:
128
+ stdin: input.json # pipe inputs/{case_id}/input.json into stdin
129
+ ```
130
+
131
+ ### 2. Task side — `examples/echo/task/`
132
+
133
+ `traptask.yaml` lists the cases and points at the judge/grader:
134
+
135
+ ```yaml
136
+ dirs:
137
+ inputs: inputs/ # optional; this is the default
138
+ expected: expected/ # optional; this is the default
139
+
140
+ cases:
141
+ - id: contains_basic
142
+ description: stdout contains the substring (case-insensitive)
143
+ tags: [smoke]
144
+ - id: exit_code_failure
145
+ description: exit code is 1 when message field is missing
146
+ - id: skipped_example
147
+ skip: true
148
+ tags: [wip]
149
+
150
+ judge:
151
+ cmd: .venv/bin/python judge.py # optional — omit for output-only mode
152
+
153
+ grader:
154
+ cmd: .venv/bin/python grader.py # optional — omit to skip aggregation
155
+ ```
156
+
157
+ Each case has a directory under `inputs/{id}/` (and optionally `expected/{id}/`) holding whatever files that case needs.
158
+
159
+ `judge.py` reads the payload from `TRAPTASK_PAYLOAD`, evaluates one case, prints a JSON metric to stdout:
160
+
161
+ ```python
162
+ import json, os, re
163
+ from pathlib import Path
164
+
165
+ data = json.loads(os.environ["TRAPTASK_PAYLOAD"])
166
+ stdout = Path(data["outputs"]["case_stdout"]).read_text().strip()
167
+ exit_code = json.loads(Path(data["outputs"]["case_meta.json"]).read_text())["exit_code"]
168
+ expected = json.loads(Path(data["expected"]["expected.json"]).read_text())
169
+
170
+ # … compute results …
171
+ print(json.dumps({"score": score}))
172
+ ```
173
+
174
+ `grader.py` receives the list of all case results and emits the overall verdict:
175
+
176
+ ```python
177
+ import json, os
178
+ results = json.loads(os.environ["TRAPTASK_PAYLOAD"])
179
+ passed = all(r["metrics"]["score"] == 1.0 for r in results)
180
+ print(json.dumps({"passed": passed, "score": avg_score}))
181
+ ```
182
+
183
+ ### 3. Run it
184
+
185
+ From `examples/echo/solution/`:
186
+
187
+ ```bash
188
+ uv run tp run # run the first task in trap.yaml
189
+ uv run tp run test # run the named task
190
+ uv run tp run -t smoke # only run cases tagged `smoke`
191
+ uv run tp run --output json # print machine-readable JSON instead of a rich table
192
+ uv run tp run --fail-fast # stop on first case whose judge score < 1.0
193
+ ```
194
+
195
+ Trap writes per-run artifacts under `.trap/{task}/{timestamp}/` and updates a `latest` symlink alongside it.
196
+
197
+ ---
198
+
199
+ ## CLI reference
200
+
201
+ ```
202
+ tp run [TASK] [OPTIONS] # execute a task
203
+ tp report [TASK] [RUN] # re-print the report for a stored run (defaults to `latest`)
204
+ tp init # scaffold trap.yaml + traptask.yaml — NOT YET IMPLEMENTED
205
+ ```
206
+
207
+ ### `tp run` options
208
+
209
+ | Flag | Default | Purpose |
210
+ |---|---|---|
211
+ | `TASK` (positional) | first task in `trap.yaml` | which task to run |
212
+ | `--config / -c` | `trap.yaml` | path to the trap config |
213
+ | `--tag / -t` | (none) | filter cases by tag; repeatable |
214
+ | `--output / -o` | `rich` | report renderer: `rich` or `json` |
215
+ | `--fail-fast` | `false` | stop after the first case whose judge `score < 1.0` |
216
+ | `--workspace / -w` | `.trap` | where to write run artifacts |
217
+
218
+ ### `tp report` options
219
+
220
+ Re-renders a previously stored run from disk. Same `--config / --output / --workspace` flags; the `RUN` argument is the timestamp directory name, or `latest` (default).
221
+
222
+ ### Exit codes
223
+
224
+ - `0` — every case exited 0, and (if a grader ran) `metrics.passed` is not `False`.
225
+ - `1` — at least one case had a non-zero exit code, **or** the grader returned `{"passed": false}`.
226
+
227
+ ---
228
+
229
+ ## Configuration reference
230
+
231
+ ### `trap.yaml` (solution author)
232
+
233
+ ```yaml
234
+ tasks:
235
+ test: # task name; arbitrary
236
+ description: optional # shown in the report
237
+ cmd: uv run python solution.py
238
+ traptask: ../task # path to the task dir (contains traptask.yaml)
239
+ inputs: # optional
240
+ stdin: input.txt # filename in inputs/{case_id}/ to pipe as stdin
241
+ files: # optional: filenames to assert exist before running
242
+ - config.json
243
+ file_outputs: # files the solution promises to write
244
+ - result.json
245
+ timeout: 30 # seconds; default 30
246
+ inputs_envvar: INPUTS # override the env var name if you want
247
+ outputs_envvar: OUTPUTS
248
+
249
+ run: # second task; same traptask, different cmd or inputs
250
+ cmd: uv run python solution.py
251
+ traptask: ../task
252
+ inputs:
253
+ stdin: input.txt
254
+ file_outputs:
255
+ - result.json
256
+ ```
257
+
258
+ - `tasks:` is a mapping; each key is a task name you can pass to `tp run`.
259
+ - `traptask` is required for every task — it points at the **directory** containing `traptask.yaml`.
260
+ - `cmd` is parsed via `shlex.split`, run with the trap.yaml's directory as `cwd`.
261
+
262
+ ### `traptask.yaml` (task author)
263
+
264
+ The entire file is optional. **If `traptask.yaml` is absent**, trap scans `inputs/` and treats each subdirectory as a case in *output-only mode* (no judge, no grader, no expected). With it:
265
+
266
+ ```yaml
267
+ dirs:
268
+ inputs: inputs/ # default
269
+ expected: expected/ # default
270
+
271
+ cases:
272
+ - id: contains_basic # must match an inputs/<id>/ directory
273
+ description: optional
274
+ tags: [smoke] # for `tp run -t smoke`
275
+ - id: skipped_example
276
+ skip: true # case is not executed
277
+
278
+ judge: # optional; omit for output-only mode
279
+ cmd: .venv/bin/python judge.py
280
+ payload_envvar: TRAPTASK_PAYLOAD # default; override if you must
281
+
282
+ grader: # optional; omit to skip aggregation
283
+ cmd: .venv/bin/python grader.py
284
+ ```
285
+
286
+ `judge.cmd` and `grader.cmd` run with `cwd` set to the task directory.
287
+
288
+ ---
289
+
290
+ ## The IO contract
291
+
292
+ Trap injects environment variables at three points. Values are always **JSON strings** (not file paths) so consumers can `json.loads(os.environ[…])` directly.
293
+
294
+ ### Solution-side: `INPUTS` and `OUTPUTS`
295
+
296
+ Before running each case, trap injects:
297
+
298
+ ```jsonc
299
+ // INPUTS — every file in inputs/{case_id}/
300
+ {
301
+ "input.json": "/abs/path/task/inputs/contains_basic/input.json",
302
+ "config.json": "/abs/path/task/inputs/contains_basic/config.json"
303
+ }
304
+
305
+ // OUTPUTS — every filename declared in trap.yaml `file_outputs`
306
+ {
307
+ "result.json": "/abs/path/.trap/test/2026-05-09T14:30:00/contains_basic/result.json"
308
+ }
309
+ ```
310
+
311
+ Keys are full filenames *with* extension. Values are absolute paths. The solution reads `INPUTS["foo.json"]`, writes to `OUTPUTS["result.json"]`. If you have nothing to read or write via files you can still receive content on stdin via `inputs.stdin` in trap.yaml.
312
+
313
+ `stdout`, `stderr`, and `meta.json` are captured automatically — the solution never writes to `OUTPUTS["case_stdout"]` itself.
314
+
315
+ ### Judge-side: `TRAPTASK_PAYLOAD`
316
+
317
+ For each case, the judge receives a JSON string with three namespaces:
318
+
319
+ ```jsonc
320
+ {
321
+ "inputs": { "input.json": "/abs/path/task/inputs/case1/input.json" },
322
+ "outputs": {
323
+ "case_stdout": "/abs/path/.trap/test/.../case1/case_stdout",
324
+ "case_stderr": "/abs/path/.trap/test/.../case1/case_stderr",
325
+ "case_meta.json": "/abs/path/.trap/test/.../case1/case_meta.json"
326
+ },
327
+ "expected": { "expected.json": "/abs/path/task/expected/case1/expected.json" }
328
+ }
329
+ ```
330
+
331
+ Note the captured outputs are keyed `case_stdout`, `case_stderr`, `case_meta.json` (prefixed) — that's what the runner writes to disk and what the example `judge.py` reads. `case_meta.json` contains `{"exit_code": N, "duration": seconds}`.
332
+
333
+ The judge prints free-form JSON to stdout. Trap stores it verbatim as `CaseResult.metrics`. Convention: include a numeric `score` field if you want `--fail-fast` to be meaningful (it checks `metrics.score < 1.0`).
334
+
335
+ ### Grader-side: `TRAPTASK_PAYLOAD`
336
+
337
+ The grader receives the full list of per-case results as JSON:
338
+
339
+ ```jsonc
340
+ [
341
+ {"case_id": "contains_basic", "exit_code": 0, "duration": 0.12, "metrics": {"score": 1.0}, "skipped": false},
342
+ {"case_id": "exact_match", "exit_code": 0, "duration": 0.11, "metrics": {"score": 0.5}, "skipped": false}
343
+ ]
344
+ ```
345
+
346
+ It prints free-form JSON to stdout. Convention: include `passed: bool` (used for the exit-code check) and `score: float` (used in the report header).
347
+
348
+ ---
349
+
350
+ ## The `.trap/` workspace
351
+
352
+ ```
353
+ .trap/
354
+ └── {task_name}/
355
+ ├── latest -> 2026-05-09T14:30:00/
356
+ └── 2026-05-09T14:30:00/
357
+ ├── {case_id}/
358
+ │ ├── case_stdout
359
+ │ ├── case_stderr
360
+ │ ├── case_meta.json # {"exit_code": 0, "duration": 0.12}
361
+ │ ├── judge_stdout # raw judge output (if judge ran)
362
+ │ ├── judge_stderr
363
+ │ ├── judge_meta.json
364
+ │ └── {any declared file_outputs}
365
+ ├── grader_stdout # raw grader output (if grader ran)
366
+ ├── grader_stderr
367
+ ├── grader_meta.json
368
+ └── report.json # the rendered/JSON report for this run
369
+ ```
370
+
371
+ Use `tp report` to re-display a stored run without re-executing the solution.
372
+
373
+ ---
374
+
375
+ ## Three running modes
376
+
377
+ Choose by what you put (or don't put) in `traptask.yaml`:
378
+
379
+ | Mode | `judge` | `grader` | Pass/fail signal |
380
+ |---|---|---|---|
381
+ | Output-only | absent | absent | any case `exit_code != 0` → exit 1 |
382
+ | Per-case scoring | present | absent | same as above |
383
+ | Full evaluation | present | present | grader's `metrics.passed == false` → exit 1 |
384
+
385
+ In output-only mode you can even omit `traptask.yaml` entirely — trap will discover cases by scanning `inputs/` subdirectories.
386
+
387
+ ---
388
+
389
+ ## Current limitations
390
+
391
+ - `tp init` is a stub; scaffolding is not implemented yet.
392
+ - Cases run sequentially (the `TaskRunner._iter` generator is deliberately left as a seam for future parallelization).
393
+ - No PyPI release; install from source.
394
+ - Python 3.14 minimum.