ghostlab 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (49) hide show
  1. ghostlab/__init__.py +5 -0
  2. ghostlab/__main__.py +4 -0
  3. ghostlab/cli.py +5 -0
  4. ghostlab-0.1.0.dist-info/METADATA +669 -0
  5. ghostlab-0.1.0.dist-info/RECORD +49 -0
  6. ghostlab-0.1.0.dist-info/WHEEL +4 -0
  7. ghostlab-0.1.0.dist-info/entry_points.txt +3 -0
  8. ghostlab-0.1.0.dist-info/licenses/LICENSE +21 -0
  9. rehearsal/__init__.py +3 -0
  10. rehearsal/__main__.py +4 -0
  11. rehearsal/apps_host/__init__.py +30 -0
  12. rehearsal/apps_host/assertions.py +102 -0
  13. rehearsal/apps_host/executor.py +112 -0
  14. rehearsal/apps_host/protocol.py +169 -0
  15. rehearsal/apps_host/renderer.py +205 -0
  16. rehearsal/apps_host/report.py +123 -0
  17. rehearsal/cli.py +1027 -0
  18. rehearsal/codex_backend.py +104 -0
  19. rehearsal/compare.py +117 -0
  20. rehearsal/config.py +193 -0
  21. rehearsal/critique.py +231 -0
  22. rehearsal/dataset.py +297 -0
  23. rehearsal/evaluate.py +478 -0
  24. rehearsal/generate.py +215 -0
  25. rehearsal/inspect.py +236 -0
  26. rehearsal/logging.py +17 -0
  27. rehearsal/mcp_apps.py +448 -0
  28. rehearsal/mcp_client.py +285 -0
  29. rehearsal/mcp_config.py +34 -0
  30. rehearsal/orchestrator.py +236 -0
  31. rehearsal/personas.py +134 -0
  32. rehearsal/profile.py +221 -0
  33. rehearsal/prompts.py +90 -0
  34. rehearsal/report.py +64 -0
  35. rehearsal/review.py +248 -0
  36. rehearsal/runner_presets.py +113 -0
  37. rehearsal/runners.py +178 -0
  38. rehearsal/scorecard.py +266 -0
  39. rehearsal/storage/__init__.py +19 -0
  40. rehearsal/storage/db.py +136 -0
  41. rehearsal/storage/hashing.py +19 -0
  42. rehearsal/storage/ids.py +44 -0
  43. rehearsal/storage/migrations/0001_initial.sql +349 -0
  44. rehearsal/storage/redact.py +46 -0
  45. rehearsal/storage/repository.py +1036 -0
  46. rehearsal/tool_capture.py +183 -0
  47. rehearsal/types.py +30 -0
  48. rehearsal/ui/__init__.py +1 -0
  49. rehearsal/ui/app.py +1095 -0
ghostlab/__init__.py ADDED
@@ -0,0 +1,5 @@
1
+ """Compatibility package for the MCP Ghostlab CLI."""
2
+
3
+ from rehearsal import __version__
4
+
5
+ __all__ = ["__version__"]
ghostlab/__main__.py ADDED
@@ -0,0 +1,4 @@
1
+ from rehearsal.cli import main
2
+
3
+
4
+ raise SystemExit(main())
ghostlab/cli.py ADDED
@@ -0,0 +1,5 @@
1
+ from rehearsal.cli import main
2
+
3
+
4
+ if __name__ == "__main__":
5
+ raise SystemExit(main())
@@ -0,0 +1,669 @@
1
+ Metadata-Version: 2.4
2
+ Name: ghostlab
3
+ Version: 0.1.0
4
+ Summary: Local end-to-end testing lab for any MCP server: coding agents role-play real users, drive your tools over multiple turns, score runs, and render/interact with MCP Apps ui:// widgets.
5
+ Project-URL: Documentation, https://sajjadgg.github.io/Rehearsal/
6
+ Project-URL: Homepage, https://github.com/sajjadGG/Rehearsal
7
+ Project-URL: Repository, https://github.com/sajjadGG/Rehearsal
8
+ Project-URL: Issues, https://github.com/sajjadGG/Rehearsal/issues
9
+ Project-URL: Changelog, https://github.com/sajjadGG/Rehearsal/releases
10
+ Author: Sajjad Gholamzadeh
11
+ License: MIT
12
+ License-File: LICENSE
13
+ Keywords: agents,ai-agents,claude,cli,codex,end-to-end-testing,evaluation,llm,llm-evaluation,mcp,mcp-apps,model-context-protocol,testing
14
+ Classifier: Development Status :: 3 - Alpha
15
+ Classifier: Environment :: Console
16
+ Classifier: Intended Audience :: Developers
17
+ Classifier: License :: OSI Approved :: MIT License
18
+ Classifier: Operating System :: OS Independent
19
+ Classifier: Programming Language :: Python :: 3
20
+ Classifier: Programming Language :: Python :: 3.10
21
+ Classifier: Programming Language :: Python :: 3.11
22
+ Classifier: Programming Language :: Python :: 3.12
23
+ Classifier: Programming Language :: Python :: 3.13
24
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
25
+ Classifier: Topic :: Software Development :: Quality Assurance
26
+ Classifier: Topic :: Software Development :: Testing
27
+ Requires-Python: >=3.10
28
+ Provides-Extra: apps
29
+ Requires-Dist: playwright>=1.40; extra == 'apps'
30
+ Provides-Extra: dev
31
+ Requires-Dist: build>=1.2; extra == 'dev'
32
+ Requires-Dist: mkdocs>=1.6; extra == 'dev'
33
+ Requires-Dist: pytest>=8.0; extra == 'dev'
34
+ Requires-Dist: twine>=5.0; extra == 'dev'
35
+ Provides-Extra: docs
36
+ Requires-Dist: mkdocs>=1.6; extra == 'docs'
37
+ Provides-Extra: ui
38
+ Requires-Dist: streamlit>=1.30; extra == 'ui'
39
+ Description-Content-Type: text/markdown
40
+
41
+ # MCP Rehearsal / Ghostlab
42
+
43
+ > A local, end-to-end **testing lab for any MCP server** — coding agents role-play
44
+ > real users, drive your tools over multiple turns, and the harness captures
45
+ > traces, scores outcomes, and even **renders and clicks through MCP Apps UI
46
+ > widgets**.
47
+
48
+ [![CI](https://github.com/sajjadGG/Rehearsal/actions/workflows/ci.yml/badge.svg)](https://github.com/sajjadGG/Rehearsal/actions)
49
+ [![Docs](https://img.shields.io/badge/docs-wiki-blue)](https://sajjadgg.github.io/Rehearsal/)
50
+ [![Python](https://img.shields.io/badge/python-3.10%E2%80%933.13-blue)](pyproject.toml)
51
+ [![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)
52
+ [![llms.txt](https://img.shields.io/badge/llms.txt-✓-purple)](llms.txt)
53
+
54
+ **Test your MCP server the way it's actually used** — not with unit tests against
55
+ the protocol, but with a real coding agent (Codex / Claude) that picks tools,
56
+ makes mistakes, and tries to accomplish goals, while a second agent plays the
57
+ user. Ghostlab understands a target MCP, generates persona × scenario datasets,
58
+ runs the dual-agent loop, scores each run, and compares runs for regressions.
59
+
60
+ 📖 **Docs wiki:** https://sajjadgg.github.io/Rehearsal/ · 🤖 **For agents:** [`llms.txt`](llms.txt) · 🛠 **Contributing:** [`CONTRIBUTING.md`](CONTRIBUTING.md)
61
+
62
+ ## Quickstart
63
+
64
+ ```bash
65
+ python3.13 -m venv .venv
66
+ .venv/bin/pip install -e . # add '.[ui]' for the web UI, '.[apps]' for widget rendering
67
+
68
+ # 1. Understand a target MCP
69
+ ghostlab inspect --target targets/cortex-local.json
70
+
71
+ # 2. Drive it with two agents (one under test, one emulating a user)
72
+ ghostlab run --target targets/cortex-local.json --scenario scenarios/cortex-onboarding-status.json
73
+
74
+ # 3. Render and click through an MCP Apps ui:// widget (needs '.[apps]')
75
+ ghostlab apps-render --target targets/cortex-local.json --tool views_generate_sentence_scramble \
76
+ --arguments '{"target_sentence":"The cat sat on the mat","shuffled_elements":["mat","The","on","sat","cat","the"]}' \
77
+ --intent '{"type":"reorder","value":["The","cat","sat","on","the","mat"]}'
78
+ ```
79
+
80
+ ## What it does
81
+
82
+ | Stage | Commands | What you get |
83
+ | --- | --- | --- |
84
+ | **Understand** | `inspect`, `profile` | Tool/resource/prompt dump + a capability profile, with lint findings |
85
+ | **Generate** | `generate-scenarios`, `generate-personas`, `generate-dataset`, `review-dataset` | Reusable persona × scenario datasets you can curate |
86
+ | **Run** | `run`, `run-dataset` | Multi-turn dual-agent transcripts with structured tool-call capture |
87
+ | **Evaluate** | `evaluate`, `compare` | Pass/fail verdicts (codex judge) and regression diffs between runs |
88
+ | **MCP Apps** | `apps-probe`, `apps-render` | Fetch/diagnose `ui://` widgets, then render + interact with them in headless Chrome |
89
+ | **Persist & explore** | `db`, `ui` | SQLite run history + a Streamlit UI over the whole pipeline |
90
+
91
+ ## Goal
92
+
93
+ Build a repeatable, sandboxed tester that can:
94
+
95
+ - Run any target MCP app in an isolated environment.
96
+ - Launch one coding-agent session as the **agent-under-test** (with target MCP injected).
97
+ - Launch another coding-agent session as the **user emulator** (persona + goal driven).
98
+ - Drive multi-turn interactions between them.
99
+ - Capture full traces, tool activity, failures, and outcomes.
100
+
101
+ This lets you test with your existing Codex/Claude usage path, instead of wiring a separate LLM provider deployment just for E2E testing.
102
+
103
+ ## Scope
104
+
105
+ Rehearsal is intentionally **app-agnostic**:
106
+
107
+ - Works with any MCP server reachable by stdio/SSE/streamable HTTP.
108
+ - Supports local or remote MCP endpoints.
109
+ - Supports multiple coding-agent runners (Codex, Claude Code, and future adapters).
110
+
111
+ No Cortex-specific assumptions are required in the core harness.
112
+
113
+ ## Core Idea
114
+
115
+ Rehearsal uses a **dual-harness architecture**:
116
+
117
+ 1. **AUT Harness (Agent Under Test)**
118
+ - Starts a coding-agent session (Codex or Claude Code).
119
+ - Injects target MCP server config into that session.
120
+ - Exposes a controlled I/O bridge so it can receive user messages and return replies/tool results.
121
+
122
+ 2. **User Emulator Harness**
123
+ - Starts a second coding-agent session.
124
+ - Gives it a scenario file (persona, goals, constraints, success criteria).
125
+ - Asks it to act like a realistic user and send messages turn-by-turn to the AUT.
126
+
127
+ 3. **Orchestrator**
128
+ - Coordinates turn-taking, timeouts, retries, and stop conditions.
129
+ - Logs every message and event in structured format.
130
+ - Produces a run report with bug candidates and reproduction context.
131
+
132
+ ## First Implementation Plan
133
+
134
+ ### Phase 1: Local Loop
135
+
136
+ - Define scenario schema (JSON).
137
+ - Define target schema for MCP connection config (stdio/SSE/HTTP).
138
+ - Build a Python orchestrator that runs:
139
+ - `codex`/`claude` process A as AUT
140
+ - `codex`/`claude` process B as emulator
141
+ - Relay turns through a strict protocol.
142
+ - Write JSONL logs + markdown summary.
143
+
144
+ ### Phase 2: Sandboxed Execution
145
+
146
+ - Add Docker Compose profiles for generic MCP target services.
147
+ - Keep orchestrator on host or in sidecar container.
148
+ - Stamp each run with target ID + build SHA/version + scenario ID + timestamp.
149
+
150
+ ### Phase 3: Regression + CI
151
+
152
+ - Add deterministic scenario packs.
153
+ - Add pass/fail gates (timeouts, tool misuse, policy violations, hallucinated capabilities, schema errors).
154
+ - Publish comparison reports between runs.
155
+
156
+ ## Target Configuration Model
157
+
158
+ Each test run points to a target definition, for example:
159
+
160
+ - `target.id`: unique name (`filesystem-mcp-local`, `my-app-staging`)
161
+ - `transport`: `stdio` | `sse` | `streamable-http`
162
+ - `connection`: command+args+env (stdio) or URL+headers (network transports)
163
+ - `capabilities`: optional expected tools/resources/prompts
164
+ - `startup`: optional health checks and boot timeout
165
+
166
+ This model makes the same harness reusable across different MCP apps.
167
+
168
+ ## What We’ll Log
169
+
170
+ Per run:
171
+
172
+ - Target metadata (id, transport, endpoint/command fingerprint).
173
+ - Scenario metadata (id, persona, goal).
174
+ - Full AUT/emulator transcripts.
175
+ - MCP tool call envelopes (request/response/error).
176
+ - Timing (latency per turn, total runtime).
177
+ - Exit states (success, timeout, crash, policy breach).
178
+ - Repro bundle pointers.
179
+
180
+ ## Success Criteria
181
+
182
+ Rehearsal is useful when you can:
183
+
184
+ - Start one command and run multiple scenarios against any MCP target.
185
+ - Reproduce failures with the same target+scenario seed/config.
186
+ - Compare two runs and quickly see regressions.
187
+ - Debug from logs without rerunning blindly.
188
+
189
+ ## Current Folder Layout
190
+
191
+ ```text
192
+ mcp-rehearsal/
193
+ README.md
194
+ __main__.py
195
+ rehearsal/
196
+ targets/
197
+ scenarios/
198
+ runners/
199
+ runs/
200
+ docker/
201
+ ```
202
+
203
+ ## Commands
204
+
205
+ Install locally from this checkout:
206
+
207
+ ```bash
208
+ python3.13 -m venv .venv
209
+ .venv/bin/pip install -r requirements-dev.txt
210
+ ```
211
+
212
+ The package installs two equivalent console scripts:
213
+
214
+ ```bash
215
+ ghostlab --help
216
+ rehearsal --help
217
+ ```
218
+
219
+ Rehearsal exposes subcommands (the bare `--target ... --scenario ...` form still
220
+ works and is treated as `run`):
221
+
222
+ - `ghostlab inspect` — connect to a target MCP and capture what it exposes.
223
+ - `ghostlab profile` — turn an `inspect.json` into a capability profile (codex).
224
+ - `ghostlab generate-scenarios` — generate scenarios from a profile (codex).
225
+ - `ghostlab generate-personas` — generate a reusable persona library (codex).
226
+ - `ghostlab generate-dataset` — build a persona x scenario dataset (codex).
227
+ - `ghostlab review-dataset` — review & curate a dataset (coverage, flags, approve/reject).
228
+ - `ghostlab run-dataset` — run every case in a dataset.
229
+ - `ghostlab run` — run a dual-agent E2E scenario.
230
+ - `ghostlab evaluate` — score a run into a pass/fail verdict (codex judge).
231
+ - `ghostlab compare` — diff two dataset runs for regressions.
232
+ - `ghostlab apps-probe` — probe a target's MCP Apps (`ui://`) widgets: fetch resources + CSP diagnostics.
233
+ - `ghostlab apps-render` — render a `ui://` widget in headless Chrome, drive it, and capture proof.
234
+ - `ghostlab doctor` — check codex and validate runner presets.
235
+ - `ghostlab ui` — launch the Streamlit pipeline UI.
236
+
237
+ ### The UI: `ghostlab ui`
238
+
239
+ Run the whole pipeline from a browser instead of the CLI:
240
+
241
+ ```bash
242
+ pip install 'ghostlab[ui]' # installs streamlit
243
+ ghostlab ui # opens http://localhost:8501
244
+ ```
245
+
246
+ The app walks an MCP through four user-facing stages:
247
+
248
+ 1. **Inspect MCP** — connect to the MCP, verify its tools/resources, and analyze
249
+ its capabilities and likely workflows.
250
+ 2. **Build Test Cases** — generate personas and scenarios, pair them into
251
+ runnable cases, then review coverage, warnings, case selection, and the
252
+ resolved per-case prompts together.
253
+ 3. **Run & Evaluate** — run each selected persona + scenario case with the
254
+ agent-under-test and user emulator, then optionally evaluate it with a codex
255
+ judge. Generation and runs expose determinate progress and live turn activity.
256
+ 4. **Review Results** — browse each run's chronological conversation trace,
257
+ inline tool activity, exact runtime prompts, model/duration metadata, and
258
+ verdict evidence. Filter run history by target, status, verdict, or search.
259
+
260
+ A **case** is the concrete runnable preset formed by pairing one persona with
261
+ one scenario. Each selected case produces one run and one trace.
262
+
263
+ The sidebar sets the **workspace dir**, the **codex binary**, and the **codex
264
+ model** (applied to every codex-backed stage — generation, the AUT/user runners,
265
+ and the judge — and shown wherever it is used). Each stage has a **🔍 View
266
+ prompt** expander so you can see the exact prompt sent to codex (profile,
267
+ persona/scenario generation, the agent-under-test and user-emulator prompts, and
268
+ the judge). Steps gate on their prerequisites and the run step shows live
269
+ per-case progress.
270
+
271
+ Artifacts are written under the workspace directory (default
272
+ `ghostlab_workspace/`) so runs persist and can also be opened with the CLI.
273
+
274
+ ## Install from PyPI
275
+
276
+ Once published, install the released package directly:
277
+
278
+ ```bash
279
+ pip install ghostlab # add [ui] and/or [apps] for those extras
280
+ ghostlab --help
281
+ ```
282
+
283
+ ## Packaging & Release
284
+
285
+ Build and validate distributions locally:
286
+
287
+ ```bash
288
+ .venv/bin/python -m pytest
289
+ .venv/bin/python -m build
290
+ .venv/bin/twine check dist/*
291
+ ```
292
+
293
+ CI runs tests on Python 3.10 through 3.13 and verifies that the package builds.
294
+ Releases are automated: the **`publish.yml`** workflow builds the sdist + wheel,
295
+ publishes them to PyPI via **Trusted Publishing**, and attaches them to the
296
+ GitHub Release — triggered when you **publish a GitHub Release** (or run the
297
+ workflow manually). Cut a release like:
298
+
299
+ ```bash
300
+ # bump rehearsal/__init__.py __version__ first, then:
301
+ gh release create v0.1.0 --generate-notes
302
+ ```
303
+
304
+ To enable publishing, create the PyPI project **`ghostlab`** and add a Trusted
305
+ Publisher for this repository, workflow `.github/workflows/publish.yml`,
306
+ environment `pypi`. No PyPI username or token is committed.
307
+
308
+ The Pages workflow builds the docs wiki with MkDocs and deploys it to GitHub
309
+ Pages on pushes to `main`, `v*.*.*` release tags, and manual workflow runs. In
310
+ the GitHub repository settings, set Pages to use GitHub Actions as the source.
311
+
312
+ ### Understand a new MCP: `inspect`
313
+
314
+ Point it at a target and it introspects the server without any coding-agent
315
+ credits or manual `curl`:
316
+
317
+ ```bash
318
+ ghostlab inspect --target targets/cortex-local.json
319
+ ```
320
+
321
+ This connects over the configured transport (stdio / streamable-HTTP / SSE),
322
+ runs the `initialize` handshake, and pages through `tools/list`,
323
+ `resources/list`, `resources/templates/list`, and `prompts/list`. It writes
324
+ `runs/<id>-inspect/inspect.json` (raw) and `inspect.md` (readable), and **lints**
325
+ tool/resource descriptions for references to tools the server does not actually
326
+ expose (e.g. Cortex descriptions mention `kb_find` / `kb_read` / `kb_read_skill`,
327
+ which are not in `tools/list`). This capability dump is the input to capability
328
+ profiling and scenario generation.
329
+
330
+ ### Profile a new MCP: `profile`
331
+
332
+ Turn the raw `inspect.json` into a structured **capability profile** — the
333
+ bridge between Understand and Generate. Deterministic structure (tool taxonomy
334
+ by name family, read/write state surfaces, gaps) is computed locally; a domain
335
+ summary and inferred multi-step workflows are generated by codex:
336
+
337
+ ```bash
338
+ ghostlab profile \
339
+ --inspect runs/<id>-inspect/inspect.json
340
+ ```
341
+
342
+ It writes `capabilities.json` + `capabilities.md` next to the `inspect.json`.
343
+ Generated workflow steps are filtered to real tool names, so the profile never
344
+ references hallucinated or non-exposed tools. This profile is the input scenario
345
+ generation consumes.
346
+
347
+ ### Generate scenarios: `generate-scenarios`
348
+
349
+ Generate grounded use-case scenarios the MCP supports, derived from the
350
+ capability profile:
351
+
352
+ ```bash
353
+ ghostlab generate-scenarios \
354
+ --profile runs/<id>-inspect/capabilities.json \
355
+ --n 3 \
356
+ --output-dir scenarios
357
+ ```
358
+
359
+ Scenarios are spread across intents (`happy_path` / `edge_case` / `adversarial`)
360
+ and each declares an `exercises` list of the tools it should drive the assistant
361
+ to use. Tool references are filtered to real tool names, so scenarios never
362
+ depend on hallucinated or non-exposed tools. Each scenario is written as a
363
+ `ScenarioConfig`-shaped JSON file ready for `run`.
364
+
365
+ ### Build a persona library: `generate-personas`
366
+
367
+ Personas are reusable **user profiles** decoupled from scenarios, so the same
368
+ persona can be paired with many scenarios (the basis for the dataset matrix).
369
+ Generate a domain-relevant library from a capability profile:
370
+
371
+ ```bash
372
+ ghostlab generate-personas \
373
+ --profile runs/<id>-inspect/capabilities.json \
374
+ --n 4 \
375
+ --output-dir personas
376
+ ```
377
+
378
+ Each persona has a `summary`, behavioral `traits` (terse, impatient, easily
379
+ confused, non-native, ...), and a domain `context` map (native_language,
380
+ target_exam, level, ...). Pass one to a run with `--persona`:
381
+
382
+ ```bash
383
+ ghostlab run ... --persona personas/ielts-power-user.json
384
+ ```
385
+
386
+ The user-emulator prompt is composed from the persona's summary + traits +
387
+ context. Scenarios with an inline `persona` string still work unchanged; when a
388
+ persona is supplied, the scenario's inline note refines it.
389
+
390
+ ### Build a dataset: `generate-dataset`
391
+
392
+ A dataset is a **persona x scenario matrix** — different users, and different
393
+ scenarios tailored to each of them. For every persona, codex generates
394
+ persona-specific scenarios, and the pairs become runnable cases:
395
+
396
+ ```bash
397
+ ghostlab generate-dataset \
398
+ --profile runs/<id>-inspect/capabilities.json \
399
+ --personas 3 --scenarios-per-persona 3 --seed 7 \
400
+ --name cortex
401
+ ```
402
+
403
+ This writes a self-contained dataset directory:
404
+
405
+ ```text
406
+ datasets/cortex/
407
+ dataset.json manifest: mcp, seed, cases[]
408
+ personas/<id>.json
409
+ scenarios/<id>.json persona-namespaced; inline `persona` is a situational note
410
+ ```
411
+
412
+ The persona is the authoritative identity at run time; each scenario's inline
413
+ `persona` carries only a short situational note ("has 45 minutes before work"),
414
+ so the two never conflict. The `--seed` governs case ordering for reproducible
415
+ manifests.
416
+
417
+ ### Review & curate a dataset: `review-dataset`
418
+
419
+ Before spending agent credits, check that the dataset makes sense:
420
+
421
+ ```bash
422
+ ghostlab review-dataset \
423
+ --dataset datasets/cortex \
424
+ --profile runs/<id>-inspect/capabilities.json
425
+ ```
426
+
427
+ This writes `review.md` + `review.json` with a **tool-coverage matrix** (which
428
+ tool categories are exercised, which tools are never touched), **per-case
429
+ previews** (persona traits, situation, goal, opening message, success/failure
430
+ criteria, exercises), and **flags**: near-duplicate cases, scenarios exercising
431
+ non-exposed tools, and personas with no scenarios.
432
+
433
+ Curation is **file-first** — each case gets a `status` in `dataset.json`
434
+ (`pending` / `approved` / `rejected` / `needs-edit`). Edit it by hand, or use:
435
+
436
+ ```bash
437
+ # approve/reject by case id (no ids = all cases)
438
+ ghostlab review-dataset --dataset datasets/cortex \
439
+ --approve case-a case-b --reject case-c
440
+ ```
441
+
442
+ Then run only the approved cases:
443
+
444
+ ```bash
445
+ ghostlab run-dataset --dataset datasets/cortex \
446
+ --target targets/cortex-local.json --approved-only
447
+ ```
448
+
449
+ ### Run a dataset: `run-dataset`
450
+
451
+ Execute every case (use `--limit` for small dev runs):
452
+
453
+ ```bash
454
+ ghostlab run-dataset \
455
+ --dataset datasets/cortex \
456
+ --target targets/cortex-local.json \
457
+ --aut-runner runners/codex-cortex-aut.json \
458
+ --user-runner runners/codex-user-emulator.json \
459
+ --limit 2
460
+ ```
461
+
462
+ Each case runs through the orchestrator (with its persona) into its own run
463
+ directory, and a dataset-level `summary.md` + `results.json` capture per-case
464
+ status and turn counts.
465
+
466
+ ### Tool-call capture & output hygiene
467
+
468
+ Every run captures structured MCP tool calls from the agent host. The codex AUT
469
+ runners set `"parser": "codex-json"` and run `codex exec --json`, so the
470
+ orchestrator parses the JSONL stream and records each `mcp_tool_call` with its
471
+ **arguments, result, error, and status** — plus the clean assistant message —
472
+ into `events.jsonl`, with a per-turn table in `report.md`. (Runners without the
473
+ codex-json parser fall back to scraping codex's plain-text
474
+ `mcp: <server>/<tool> started|(completed)|(failed)` lines for tool name +
475
+ status.) When a scenario declares `exercises`, the report also shows a
476
+ tool-coverage line (expected vs. actually called).
477
+
478
+ stdout and stderr are kept **separate**: only stdout (with known host noise
479
+ redacted) becomes the conversational message handed to the other agent, while
480
+ raw stderr is logged for debugging. This prevents the emulator from reacting to
481
+ agent-host warnings instead of the assistant's actual reply.
482
+
483
+ ### Evaluate a run: `evaluate`
484
+
485
+ Turn a run into a structured pass/fail verdict:
486
+
487
+ ```bash
488
+ ghostlab evaluate --run runs/<id> --capabilities runs/<id>-inspect/capabilities.json
489
+ ```
490
+
491
+ It combines **deterministic checks** over the captured tool calls (failed calls,
492
+ expected-tool coverage from the scenario's `exercises`) with a **codex
493
+ LLM-judge** that scores each `success_criterion` (met?) and `failure_signal`
494
+ (triggered?) from the transcript and tool calls. Hard gates force an overall
495
+ `fail`: the run crashed, a failure signal triggered, or — when `--capabilities`
496
+ is supplied — the assistant claimed a tool the server does not expose. Writes
497
+ `verdict.json` + `verdict.md`; exits non-zero unless the verdict is `pass`
498
+ (`partial` exits 0 unless `--strict`), so datasets can gate CI.
499
+
500
+ ### Score a whole dataset
501
+
502
+ `run-dataset --evaluate` runs the codex judge on each case and records the
503
+ verdict in the per-case run dir and in the summary's `results.json` (stamped with
504
+ the ghostlab version + dataset seed for provenance):
505
+
506
+ ```bash
507
+ ghostlab run-dataset --dataset datasets/cortex \
508
+ --target targets/cortex-local.json \
509
+ --aut-runner runners/codex-cortex-local-session.json \
510
+ --evaluate --capabilities runs/<id>-inspect/capabilities.json
511
+ ```
512
+
513
+ ### Compare two runs: `compare`
514
+
515
+ After editing a prompt or tool description, re-run the same dataset and diff the
516
+ results to see what got better or worse:
517
+
518
+ ```bash
519
+ ghostlab compare --base runs/<base>-summary --candidate runs/<cand>-summary \
520
+ --output comparison.md
521
+ ```
522
+
523
+ It diffs case-by-case on verdict (falling back to run status), listing
524
+ **regressions** (newly failing) first, then **fixes** (newly passing), then other
525
+ changes. Exits non-zero when there are regressions, so it can gate CI.
526
+
527
+ ### Probe MCP Apps widgets: `apps-probe`
528
+
529
+ Some MCPs ship **MCP Apps UI** resources — a tool's `_meta.ui.resourceUri` points
530
+ to a `ui://…` HTML widget a compatible host is expected to render. The vanilla
531
+ runner can confirm an agent *called* a UI-producing tool, but not that the widget
532
+ rendered or that a user could interact with it (see
533
+ `specs/cortex-mcp-apps-e2e.spec`, issue #13).
534
+
535
+ `apps-probe` is the first increment of the MCP Apps host layer. It connects to a
536
+ target, finds every UI-producing tool, fetches each `ui://` resource via
537
+ `resources/read`, and reports render-readiness and CSP diagnostics:
538
+
539
+ ```bash
540
+ ghostlab apps-probe --target targets/cortex-local.json
541
+ # or restrict to specific widgets:
542
+ ghostlab apps-probe --target targets/cortex-local.json --tool views_create_listening_practice
543
+ ```
544
+
545
+ It writes `apps-probe.json` + `apps-probe.md` with the resource's MIME profile,
546
+ HTML size, preferred frame hints, and CSP connect/resource domains. Diagnostics
547
+ flag empty/unfetchable resources, non-`mcp-app` MIME types, and tools that accept
548
+ remote media (`audio_url`, `image_url`, …) whose resource CSP would block it. The
549
+ report reserves structured sections for the host-bridge transcript, interaction
550
+ trace, render artifacts, and final app state — populated by `apps-render` below.
551
+ The module also defines the **UI-intent contract**
552
+ (`reorder`/`choose`/`type`/`reveal`/`submit`/`rate`/`mark`) the user emulator
553
+ emits, and the host-bridge message vocabulary a renderer must implement.
554
+
555
+ ### Render & drive MCP Apps widgets: `apps-render`
556
+
557
+ `apps-render` actually **renders** a `ui://` widget and proves a user can see and
558
+ use it. It implements the MCP Apps host bridge (JSON-RPC over `postMessage`,
559
+ protocol `2026-01-26`), mounts the widget in a sandboxed headless-Chrome iframe,
560
+ completes the `ui/initialize` handshake, and feeds it the tool input + result so
561
+ it renders real content. It then captures a screenshot, the visible DOM text, the
562
+ host-bridge transcript, console/network errors, and runs app-aware assertions —
563
+ and can execute a sequence of UI intents against the live widget.
564
+
565
+ ```bash
566
+ pip install 'ghostlab[apps]' && playwright install chrome # one-time
567
+ ghostlab apps-render --target targets/cortex-local.json \
568
+ --tool views_generate_sentence_scramble \
569
+ --arguments '{"target_sentence":"The cat sat on the mat","shuffled_elements":["mat","The","on","sat","cat","the"]}' \
570
+ --intent '{"type":"reorder","value":["The","cat","sat","on","the","mat"]}' \
571
+ --intent '{"type":"reveal"}'
572
+ ```
573
+
574
+ By default it **calls the tool** with `--arguments` to obtain the result the
575
+ widget renders from (use `--no-call` to render from the arguments alone, or omit
576
+ `--tool` to pick the first UI-producing tool). It writes `apps-render.json` +
577
+ `apps-render.md`, a `widget.png` of the initial render, and a `widget-final.png`
578
+ after the intents run. Exit status is non-zero if the render errored or any
579
+ assertion failed, so it can gate CI. In the example above the reorder intent
580
+ rebuilds the sentence and the widget confirms _"Nice work. Sentence is correct."_
581
+ — proving both visibility and a completed interaction.
582
+
583
+ This is the browser-backed increment of issue #13. Still ahead: richer per-widget
584
+ assertions, the full request-side host bridge (`call-server-tool` proxying,
585
+ `open-link`/`download-file` handling), and wiring the user emulator to emit UI
586
+ intents during a live `run`.
587
+
588
+ ### Session runner (one live agent across turns)
589
+
590
+ By default each turn spawns a fresh agent process and the orchestrator replays
591
+ the transcript. The **session runner** (`"kind": "codex-session"`) instead keeps
592
+ one codex session alive: turn 1 records the `thread_id` from the JSONL
593
+ `thread.started` event, and later turns run `codex exec resume <thread_id>` so
594
+ codex retains context — the orchestrator then sends only the new user message
595
+ instead of the whole transcript (fewer tokens, no repeated cold-start noise). The
596
+ shared `session_id` is logged per turn in `events.jsonl` for auditability.
597
+
598
+ ```bash
599
+ ghostlab run --target targets/cortex-local.json --scenario <scenario.json> \
600
+ --aut-runner runners/codex-cortex-local-session.json --user-runner <user.json>
601
+ ```
602
+
603
+ ### Validate your setup: `doctor`
604
+
605
+ Check that codex is reachable and that runner presets are well-formed before a
606
+ run:
607
+
608
+ ```bash
609
+ ghostlab doctor # validates runners/*.json
610
+ ghostlab doctor --runners runners/codex-cortex-local-session.json
611
+ ```
612
+
613
+ It reports the codex binary + version and validates each runner's kind, command,
614
+ and parser (e.g. a `codex-session` command must contain `exec`).
615
+
616
+ ### Default agent backend
617
+
618
+ `codex` is the default coding-agent backend for the generation and run stages.
619
+ The `inspect` command needs no agent — it is a direct MCP client. The codex
620
+ binary is auto-detected from `$PATH`, then the macOS app bundle
621
+ (`/Applications/Codex.app/Contents/Resources/codex`); override with
622
+ `$REHEARSAL_CODEX_BIN` or `--codex-bin`.
623
+
624
+ ## Quick Start
625
+
626
+ Run a mock scenario without spending any coding-agent credits:
627
+
628
+ ```bash
629
+ cd mcp-rehearsal
630
+ ghostlab run \
631
+ --target targets/example-stdio.json \
632
+ --scenario scenarios/basic-discovery.json \
633
+ --aut-runner runners/mock-aut.json \
634
+ --user-runner runners/mock-user.json
635
+ ```
636
+
637
+ The run output is written under `runs/<run-id>/`:
638
+
639
+ - `events.jsonl`: structured event log
640
+ - `report.md`: readable run summary
641
+ - `target.mcp.json`: generated `mcpServers` config for the target
642
+
643
+ ## Runner Configs
644
+
645
+ Mock runner:
646
+
647
+ ```json
648
+ {
649
+ "kind": "mock"
650
+ }
651
+ ```
652
+
653
+ Process runner:
654
+
655
+ ```json
656
+ {
657
+ "kind": "process",
658
+ "command": ["codex", "exec", "-"],
659
+ "env": {},
660
+ "timeout_seconds": 300,
661
+ "prompt_mode": "stdin"
662
+ }
663
+ ```
664
+
665
+ The process runner starts one fresh process per turn. `prompt_mode` can be `stdin`, `append-arg`, or `replace-placeholder`. Rehearsal also sets `REHEARSAL_TARGET_ID` and `REHEARSAL_MCP_CONFIG` for the AUT process so runner commands can inject the generated MCP config into Codex, Claude Code, or another agent host.
666
+
667
+ ## Next Step
668
+
669
+ Wire the process runner to real Codex and Claude Code MCP config injection, then add a Docker Compose sandbox for target MCP apps.