ghostlab 0.1.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- ghostlab/__init__.py +5 -0
- ghostlab/__main__.py +4 -0
- ghostlab/cli.py +5 -0
- ghostlab-0.1.0.dist-info/METADATA +669 -0
- ghostlab-0.1.0.dist-info/RECORD +49 -0
- ghostlab-0.1.0.dist-info/WHEEL +4 -0
- ghostlab-0.1.0.dist-info/entry_points.txt +3 -0
- ghostlab-0.1.0.dist-info/licenses/LICENSE +21 -0
- rehearsal/__init__.py +3 -0
- rehearsal/__main__.py +4 -0
- rehearsal/apps_host/__init__.py +30 -0
- rehearsal/apps_host/assertions.py +102 -0
- rehearsal/apps_host/executor.py +112 -0
- rehearsal/apps_host/protocol.py +169 -0
- rehearsal/apps_host/renderer.py +205 -0
- rehearsal/apps_host/report.py +123 -0
- rehearsal/cli.py +1027 -0
- rehearsal/codex_backend.py +104 -0
- rehearsal/compare.py +117 -0
- rehearsal/config.py +193 -0
- rehearsal/critique.py +231 -0
- rehearsal/dataset.py +297 -0
- rehearsal/evaluate.py +478 -0
- rehearsal/generate.py +215 -0
- rehearsal/inspect.py +236 -0
- rehearsal/logging.py +17 -0
- rehearsal/mcp_apps.py +448 -0
- rehearsal/mcp_client.py +285 -0
- rehearsal/mcp_config.py +34 -0
- rehearsal/orchestrator.py +236 -0
- rehearsal/personas.py +134 -0
- rehearsal/profile.py +221 -0
- rehearsal/prompts.py +90 -0
- rehearsal/report.py +64 -0
- rehearsal/review.py +248 -0
- rehearsal/runner_presets.py +113 -0
- rehearsal/runners.py +178 -0
- rehearsal/scorecard.py +266 -0
- rehearsal/storage/__init__.py +19 -0
- rehearsal/storage/db.py +136 -0
- rehearsal/storage/hashing.py +19 -0
- rehearsal/storage/ids.py +44 -0
- rehearsal/storage/migrations/0001_initial.sql +349 -0
- rehearsal/storage/redact.py +46 -0
- rehearsal/storage/repository.py +1036 -0
- rehearsal/tool_capture.py +183 -0
- rehearsal/types.py +30 -0
- rehearsal/ui/__init__.py +1 -0
- rehearsal/ui/app.py +1095 -0
ghostlab/__init__.py
ADDED
ghostlab/__main__.py
ADDED
ghostlab/cli.py
ADDED
|
@@ -0,0 +1,669 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: ghostlab
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Local end-to-end testing lab for any MCP server: coding agents role-play real users, drive your tools over multiple turns, score runs, and render/interact with MCP Apps ui:// widgets.
|
|
5
|
+
Project-URL: Documentation, https://sajjadgg.github.io/Rehearsal/
|
|
6
|
+
Project-URL: Homepage, https://github.com/sajjadGG/Rehearsal
|
|
7
|
+
Project-URL: Repository, https://github.com/sajjadGG/Rehearsal
|
|
8
|
+
Project-URL: Issues, https://github.com/sajjadGG/Rehearsal/issues
|
|
9
|
+
Project-URL: Changelog, https://github.com/sajjadGG/Rehearsal/releases
|
|
10
|
+
Author: Sajjad Gholamzadeh
|
|
11
|
+
License: MIT
|
|
12
|
+
License-File: LICENSE
|
|
13
|
+
Keywords: agents,ai-agents,claude,cli,codex,end-to-end-testing,evaluation,llm,llm-evaluation,mcp,mcp-apps,model-context-protocol,testing
|
|
14
|
+
Classifier: Development Status :: 3 - Alpha
|
|
15
|
+
Classifier: Environment :: Console
|
|
16
|
+
Classifier: Intended Audience :: Developers
|
|
17
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
18
|
+
Classifier: Operating System :: OS Independent
|
|
19
|
+
Classifier: Programming Language :: Python :: 3
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
22
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
23
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
24
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
25
|
+
Classifier: Topic :: Software Development :: Quality Assurance
|
|
26
|
+
Classifier: Topic :: Software Development :: Testing
|
|
27
|
+
Requires-Python: >=3.10
|
|
28
|
+
Provides-Extra: apps
|
|
29
|
+
Requires-Dist: playwright>=1.40; extra == 'apps'
|
|
30
|
+
Provides-Extra: dev
|
|
31
|
+
Requires-Dist: build>=1.2; extra == 'dev'
|
|
32
|
+
Requires-Dist: mkdocs>=1.6; extra == 'dev'
|
|
33
|
+
Requires-Dist: pytest>=8.0; extra == 'dev'
|
|
34
|
+
Requires-Dist: twine>=5.0; extra == 'dev'
|
|
35
|
+
Provides-Extra: docs
|
|
36
|
+
Requires-Dist: mkdocs>=1.6; extra == 'docs'
|
|
37
|
+
Provides-Extra: ui
|
|
38
|
+
Requires-Dist: streamlit>=1.30; extra == 'ui'
|
|
39
|
+
Description-Content-Type: text/markdown
|
|
40
|
+
|
|
41
|
+
# MCP Rehearsal / Ghostlab
|
|
42
|
+
|
|
43
|
+
> A local, end-to-end **testing lab for any MCP server** — coding agents role-play
|
|
44
|
+
> real users, drive your tools over multiple turns, and the harness captures
|
|
45
|
+
> traces, scores outcomes, and even **renders and clicks through MCP Apps UI
|
|
46
|
+
> widgets**.
|
|
47
|
+
|
|
48
|
+
[](https://github.com/sajjadGG/Rehearsal/actions)
|
|
49
|
+
[](https://sajjadgg.github.io/Rehearsal/)
|
|
50
|
+
[](pyproject.toml)
|
|
51
|
+
[](LICENSE)
|
|
52
|
+
[](llms.txt)
|
|
53
|
+
|
|
54
|
+
**Test your MCP server the way it's actually used** — not with unit tests against
|
|
55
|
+
the protocol, but with a real coding agent (Codex / Claude) that picks tools,
|
|
56
|
+
makes mistakes, and tries to accomplish goals, while a second agent plays the
|
|
57
|
+
user. Ghostlab understands a target MCP, generates persona × scenario datasets,
|
|
58
|
+
runs the dual-agent loop, scores each run, and compares runs for regressions.
|
|
59
|
+
|
|
60
|
+
📖 **Docs wiki:** https://sajjadgg.github.io/Rehearsal/ · 🤖 **For agents:** [`llms.txt`](llms.txt) · 🛠 **Contributing:** [`CONTRIBUTING.md`](CONTRIBUTING.md)
|
|
61
|
+
|
|
62
|
+
## Quickstart
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
python3.13 -m venv .venv
|
|
66
|
+
.venv/bin/pip install -e . # add '.[ui]' for the web UI, '.[apps]' for widget rendering
|
|
67
|
+
|
|
68
|
+
# 1. Understand a target MCP
|
|
69
|
+
ghostlab inspect --target targets/cortex-local.json
|
|
70
|
+
|
|
71
|
+
# 2. Drive it with two agents (one under test, one emulating a user)
|
|
72
|
+
ghostlab run --target targets/cortex-local.json --scenario scenarios/cortex-onboarding-status.json
|
|
73
|
+
|
|
74
|
+
# 3. Render and click through an MCP Apps ui:// widget (needs '.[apps]')
|
|
75
|
+
ghostlab apps-render --target targets/cortex-local.json --tool views_generate_sentence_scramble \
|
|
76
|
+
--arguments '{"target_sentence":"The cat sat on the mat","shuffled_elements":["mat","The","on","sat","cat","the"]}' \
|
|
77
|
+
--intent '{"type":"reorder","value":["The","cat","sat","on","the","mat"]}'
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
## What it does
|
|
81
|
+
|
|
82
|
+
| Stage | Commands | What you get |
|
|
83
|
+
| --- | --- | --- |
|
|
84
|
+
| **Understand** | `inspect`, `profile` | Tool/resource/prompt dump + a capability profile, with lint findings |
|
|
85
|
+
| **Generate** | `generate-scenarios`, `generate-personas`, `generate-dataset`, `review-dataset` | Reusable persona × scenario datasets you can curate |
|
|
86
|
+
| **Run** | `run`, `run-dataset` | Multi-turn dual-agent transcripts with structured tool-call capture |
|
|
87
|
+
| **Evaluate** | `evaluate`, `compare` | Pass/fail verdicts (codex judge) and regression diffs between runs |
|
|
88
|
+
| **MCP Apps** | `apps-probe`, `apps-render` | Fetch/diagnose `ui://` widgets, then render + interact with them in headless Chrome |
|
|
89
|
+
| **Persist & explore** | `db`, `ui` | SQLite run history + a Streamlit UI over the whole pipeline |
|
|
90
|
+
|
|
91
|
+
## Goal
|
|
92
|
+
|
|
93
|
+
Build a repeatable, sandboxed tester that can:
|
|
94
|
+
|
|
95
|
+
- Run any target MCP app in an isolated environment.
|
|
96
|
+
- Launch one coding-agent session as the **agent-under-test** (with target MCP injected).
|
|
97
|
+
- Launch another coding-agent session as the **user emulator** (persona + goal driven).
|
|
98
|
+
- Drive multi-turn interactions between them.
|
|
99
|
+
- Capture full traces, tool activity, failures, and outcomes.
|
|
100
|
+
|
|
101
|
+
This lets you test with your existing Codex/Claude usage path, instead of wiring a separate LLM provider deployment just for E2E testing.
|
|
102
|
+
|
|
103
|
+
## Scope
|
|
104
|
+
|
|
105
|
+
Rehearsal is intentionally **app-agnostic**:
|
|
106
|
+
|
|
107
|
+
- Works with any MCP server reachable by stdio/SSE/streamable HTTP.
|
|
108
|
+
- Supports local or remote MCP endpoints.
|
|
109
|
+
- Supports multiple coding-agent runners (Codex, Claude Code, and future adapters).
|
|
110
|
+
|
|
111
|
+
No Cortex-specific assumptions are required in the core harness.
|
|
112
|
+
|
|
113
|
+
## Core Idea
|
|
114
|
+
|
|
115
|
+
Rehearsal uses a **dual-harness architecture**:
|
|
116
|
+
|
|
117
|
+
1. **AUT Harness (Agent Under Test)**
|
|
118
|
+
- Starts a coding-agent session (Codex or Claude Code).
|
|
119
|
+
- Injects target MCP server config into that session.
|
|
120
|
+
- Exposes a controlled I/O bridge so it can receive user messages and return replies/tool results.
|
|
121
|
+
|
|
122
|
+
2. **User Emulator Harness**
|
|
123
|
+
- Starts a second coding-agent session.
|
|
124
|
+
- Gives it a scenario file (persona, goals, constraints, success criteria).
|
|
125
|
+
- Asks it to act like a realistic user and send messages turn-by-turn to the AUT.
|
|
126
|
+
|
|
127
|
+
3. **Orchestrator**
|
|
128
|
+
- Coordinates turn-taking, timeouts, retries, and stop conditions.
|
|
129
|
+
- Logs every message and event in structured format.
|
|
130
|
+
- Produces a run report with bug candidates and reproduction context.
|
|
131
|
+
|
|
132
|
+
## First Implementation Plan
|
|
133
|
+
|
|
134
|
+
### Phase 1: Local Loop
|
|
135
|
+
|
|
136
|
+
- Define scenario schema (JSON).
|
|
137
|
+
- Define target schema for MCP connection config (stdio/SSE/HTTP).
|
|
138
|
+
- Build a Python orchestrator that runs:
|
|
139
|
+
- `codex`/`claude` process A as AUT
|
|
140
|
+
- `codex`/`claude` process B as emulator
|
|
141
|
+
- Relay turns through a strict protocol.
|
|
142
|
+
- Write JSONL logs + markdown summary.
|
|
143
|
+
|
|
144
|
+
### Phase 2: Sandboxed Execution
|
|
145
|
+
|
|
146
|
+
- Add Docker Compose profiles for generic MCP target services.
|
|
147
|
+
- Keep orchestrator on host or in sidecar container.
|
|
148
|
+
- Stamp each run with target ID + build SHA/version + scenario ID + timestamp.
|
|
149
|
+
|
|
150
|
+
### Phase 3: Regression + CI
|
|
151
|
+
|
|
152
|
+
- Add deterministic scenario packs.
|
|
153
|
+
- Add pass/fail gates (timeouts, tool misuse, policy violations, hallucinated capabilities, schema errors).
|
|
154
|
+
- Publish comparison reports between runs.
|
|
155
|
+
|
|
156
|
+
## Target Configuration Model
|
|
157
|
+
|
|
158
|
+
Each test run points to a target definition, for example:
|
|
159
|
+
|
|
160
|
+
- `target.id`: unique name (`filesystem-mcp-local`, `my-app-staging`)
|
|
161
|
+
- `transport`: `stdio` | `sse` | `streamable-http`
|
|
162
|
+
- `connection`: command+args+env (stdio) or URL+headers (network transports)
|
|
163
|
+
- `capabilities`: optional expected tools/resources/prompts
|
|
164
|
+
- `startup`: optional health checks and boot timeout
|
|
165
|
+
|
|
166
|
+
This model makes the same harness reusable across different MCP apps.
|
|
167
|
+
|
|
168
|
+
## What We’ll Log
|
|
169
|
+
|
|
170
|
+
Per run:
|
|
171
|
+
|
|
172
|
+
- Target metadata (id, transport, endpoint/command fingerprint).
|
|
173
|
+
- Scenario metadata (id, persona, goal).
|
|
174
|
+
- Full AUT/emulator transcripts.
|
|
175
|
+
- MCP tool call envelopes (request/response/error).
|
|
176
|
+
- Timing (latency per turn, total runtime).
|
|
177
|
+
- Exit states (success, timeout, crash, policy breach).
|
|
178
|
+
- Repro bundle pointers.
|
|
179
|
+
|
|
180
|
+
## Success Criteria
|
|
181
|
+
|
|
182
|
+
Rehearsal is useful when you can:
|
|
183
|
+
|
|
184
|
+
- Start one command and run multiple scenarios against any MCP target.
|
|
185
|
+
- Reproduce failures with the same target+scenario seed/config.
|
|
186
|
+
- Compare two runs and quickly see regressions.
|
|
187
|
+
- Debug from logs without rerunning blindly.
|
|
188
|
+
|
|
189
|
+
## Current Folder Layout
|
|
190
|
+
|
|
191
|
+
```text
|
|
192
|
+
mcp-rehearsal/
|
|
193
|
+
README.md
|
|
194
|
+
__main__.py
|
|
195
|
+
rehearsal/
|
|
196
|
+
targets/
|
|
197
|
+
scenarios/
|
|
198
|
+
runners/
|
|
199
|
+
runs/
|
|
200
|
+
docker/
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
## Commands
|
|
204
|
+
|
|
205
|
+
Install locally from this checkout:
|
|
206
|
+
|
|
207
|
+
```bash
|
|
208
|
+
python3.13 -m venv .venv
|
|
209
|
+
.venv/bin/pip install -r requirements-dev.txt
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
The package installs two equivalent console scripts:
|
|
213
|
+
|
|
214
|
+
```bash
|
|
215
|
+
ghostlab --help
|
|
216
|
+
rehearsal --help
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
Rehearsal exposes subcommands (the bare `--target ... --scenario ...` form still
|
|
220
|
+
works and is treated as `run`):
|
|
221
|
+
|
|
222
|
+
- `ghostlab inspect` — connect to a target MCP and capture what it exposes.
|
|
223
|
+
- `ghostlab profile` — turn an `inspect.json` into a capability profile (codex).
|
|
224
|
+
- `ghostlab generate-scenarios` — generate scenarios from a profile (codex).
|
|
225
|
+
- `ghostlab generate-personas` — generate a reusable persona library (codex).
|
|
226
|
+
- `ghostlab generate-dataset` — build a persona x scenario dataset (codex).
|
|
227
|
+
- `ghostlab review-dataset` — review & curate a dataset (coverage, flags, approve/reject).
|
|
228
|
+
- `ghostlab run-dataset` — run every case in a dataset.
|
|
229
|
+
- `ghostlab run` — run a dual-agent E2E scenario.
|
|
230
|
+
- `ghostlab evaluate` — score a run into a pass/fail verdict (codex judge).
|
|
231
|
+
- `ghostlab compare` — diff two dataset runs for regressions.
|
|
232
|
+
- `ghostlab apps-probe` — probe a target's MCP Apps (`ui://`) widgets: fetch resources + CSP diagnostics.
|
|
233
|
+
- `ghostlab apps-render` — render a `ui://` widget in headless Chrome, drive it, and capture proof.
|
|
234
|
+
- `ghostlab doctor` — check codex and validate runner presets.
|
|
235
|
+
- `ghostlab ui` — launch the Streamlit pipeline UI.
|
|
236
|
+
|
|
237
|
+
### The UI: `ghostlab ui`
|
|
238
|
+
|
|
239
|
+
Run the whole pipeline from a browser instead of the CLI:
|
|
240
|
+
|
|
241
|
+
```bash
|
|
242
|
+
pip install 'ghostlab[ui]' # installs streamlit
|
|
243
|
+
ghostlab ui # opens http://localhost:8501
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
The app walks an MCP through four user-facing stages:
|
|
247
|
+
|
|
248
|
+
1. **Inspect MCP** — connect to the MCP, verify its tools/resources, and analyze
|
|
249
|
+
its capabilities and likely workflows.
|
|
250
|
+
2. **Build Test Cases** — generate personas and scenarios, pair them into
|
|
251
|
+
runnable cases, then review coverage, warnings, case selection, and the
|
|
252
|
+
resolved per-case prompts together.
|
|
253
|
+
3. **Run & Evaluate** — run each selected persona + scenario case with the
|
|
254
|
+
agent-under-test and user emulator, then optionally evaluate it with a codex
|
|
255
|
+
judge. Generation and runs expose determinate progress and live turn activity.
|
|
256
|
+
4. **Review Results** — browse each run's chronological conversation trace,
|
|
257
|
+
inline tool activity, exact runtime prompts, model/duration metadata, and
|
|
258
|
+
verdict evidence. Filter run history by target, status, verdict, or search.
|
|
259
|
+
|
|
260
|
+
A **case** is the concrete runnable preset formed by pairing one persona with
|
|
261
|
+
one scenario. Each selected case produces one run and one trace.
|
|
262
|
+
|
|
263
|
+
The sidebar sets the **workspace dir**, the **codex binary**, and the **codex
|
|
264
|
+
model** (applied to every codex-backed stage — generation, the AUT/user runners,
|
|
265
|
+
and the judge — and shown wherever it is used). Each stage has a **🔍 View
|
|
266
|
+
prompt** expander so you can see the exact prompt sent to codex (profile,
|
|
267
|
+
persona/scenario generation, the agent-under-test and user-emulator prompts, and
|
|
268
|
+
the judge). Steps gate on their prerequisites and the run step shows live
|
|
269
|
+
per-case progress.
|
|
270
|
+
|
|
271
|
+
Artifacts are written under the workspace directory (default
|
|
272
|
+
`ghostlab_workspace/`) so runs persist and can also be opened with the CLI.
|
|
273
|
+
|
|
274
|
+
## Install from PyPI
|
|
275
|
+
|
|
276
|
+
Once published, install the released package directly:
|
|
277
|
+
|
|
278
|
+
```bash
|
|
279
|
+
pip install ghostlab # add [ui] and/or [apps] for those extras
|
|
280
|
+
ghostlab --help
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
## Packaging & Release
|
|
284
|
+
|
|
285
|
+
Build and validate distributions locally:
|
|
286
|
+
|
|
287
|
+
```bash
|
|
288
|
+
.venv/bin/python -m pytest
|
|
289
|
+
.venv/bin/python -m build
|
|
290
|
+
.venv/bin/twine check dist/*
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
CI runs tests on Python 3.10 through 3.13 and verifies that the package builds.
|
|
294
|
+
Releases are automated: the **`publish.yml`** workflow builds the sdist + wheel,
|
|
295
|
+
publishes them to PyPI via **Trusted Publishing**, and attaches them to the
|
|
296
|
+
GitHub Release — triggered when you **publish a GitHub Release** (or run the
|
|
297
|
+
workflow manually). Cut a release like:
|
|
298
|
+
|
|
299
|
+
```bash
|
|
300
|
+
# bump rehearsal/__init__.py __version__ first, then:
|
|
301
|
+
gh release create v0.1.0 --generate-notes
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
To enable publishing, create the PyPI project **`ghostlab`** and add a Trusted
|
|
305
|
+
Publisher for this repository, workflow `.github/workflows/publish.yml`,
|
|
306
|
+
environment `pypi`. No PyPI username or token is committed.
|
|
307
|
+
|
|
308
|
+
The Pages workflow builds the docs wiki with MkDocs and deploys it to GitHub
|
|
309
|
+
Pages on pushes to `main`, `v*.*.*` release tags, and manual workflow runs. In
|
|
310
|
+
the GitHub repository settings, set Pages to use GitHub Actions as the source.
|
|
311
|
+
|
|
312
|
+
### Understand a new MCP: `inspect`
|
|
313
|
+
|
|
314
|
+
Point it at a target and it introspects the server without any coding-agent
|
|
315
|
+
credits or manual `curl`:
|
|
316
|
+
|
|
317
|
+
```bash
|
|
318
|
+
ghostlab inspect --target targets/cortex-local.json
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
This connects over the configured transport (stdio / streamable-HTTP / SSE),
|
|
322
|
+
runs the `initialize` handshake, and pages through `tools/list`,
|
|
323
|
+
`resources/list`, `resources/templates/list`, and `prompts/list`. It writes
|
|
324
|
+
`runs/<id>-inspect/inspect.json` (raw) and `inspect.md` (readable), and **lints**
|
|
325
|
+
tool/resource descriptions for references to tools the server does not actually
|
|
326
|
+
expose (e.g. Cortex descriptions mention `kb_find` / `kb_read` / `kb_read_skill`,
|
|
327
|
+
which are not in `tools/list`). This capability dump is the input to capability
|
|
328
|
+
profiling and scenario generation.
|
|
329
|
+
|
|
330
|
+
### Profile a new MCP: `profile`
|
|
331
|
+
|
|
332
|
+
Turn the raw `inspect.json` into a structured **capability profile** — the
|
|
333
|
+
bridge between Understand and Generate. Deterministic structure (tool taxonomy
|
|
334
|
+
by name family, read/write state surfaces, gaps) is computed locally; a domain
|
|
335
|
+
summary and inferred multi-step workflows are generated by codex:
|
|
336
|
+
|
|
337
|
+
```bash
|
|
338
|
+
ghostlab profile \
|
|
339
|
+
--inspect runs/<id>-inspect/inspect.json
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
It writes `capabilities.json` + `capabilities.md` next to the `inspect.json`.
|
|
343
|
+
Generated workflow steps are filtered to real tool names, so the profile never
|
|
344
|
+
references hallucinated or non-exposed tools. This profile is the input scenario
|
|
345
|
+
generation consumes.
|
|
346
|
+
|
|
347
|
+
### Generate scenarios: `generate-scenarios`
|
|
348
|
+
|
|
349
|
+
Generate grounded use-case scenarios the MCP supports, derived from the
|
|
350
|
+
capability profile:
|
|
351
|
+
|
|
352
|
+
```bash
|
|
353
|
+
ghostlab generate-scenarios \
|
|
354
|
+
--profile runs/<id>-inspect/capabilities.json \
|
|
355
|
+
--n 3 \
|
|
356
|
+
--output-dir scenarios
|
|
357
|
+
```
|
|
358
|
+
|
|
359
|
+
Scenarios are spread across intents (`happy_path` / `edge_case` / `adversarial`)
|
|
360
|
+
and each declares an `exercises` list of the tools it should drive the assistant
|
|
361
|
+
to use. Tool references are filtered to real tool names, so scenarios never
|
|
362
|
+
depend on hallucinated or non-exposed tools. Each scenario is written as a
|
|
363
|
+
`ScenarioConfig`-shaped JSON file ready for `run`.
|
|
364
|
+
|
|
365
|
+
### Build a persona library: `generate-personas`
|
|
366
|
+
|
|
367
|
+
Personas are reusable **user profiles** decoupled from scenarios, so the same
|
|
368
|
+
persona can be paired with many scenarios (the basis for the dataset matrix).
|
|
369
|
+
Generate a domain-relevant library from a capability profile:
|
|
370
|
+
|
|
371
|
+
```bash
|
|
372
|
+
ghostlab generate-personas \
|
|
373
|
+
--profile runs/<id>-inspect/capabilities.json \
|
|
374
|
+
--n 4 \
|
|
375
|
+
--output-dir personas
|
|
376
|
+
```
|
|
377
|
+
|
|
378
|
+
Each persona has a `summary`, behavioral `traits` (terse, impatient, easily
|
|
379
|
+
confused, non-native, ...), and a domain `context` map (native_language,
|
|
380
|
+
target_exam, level, ...). Pass one to a run with `--persona`:
|
|
381
|
+
|
|
382
|
+
```bash
|
|
383
|
+
ghostlab run ... --persona personas/ielts-power-user.json
|
|
384
|
+
```
|
|
385
|
+
|
|
386
|
+
The user-emulator prompt is composed from the persona's summary + traits +
|
|
387
|
+
context. Scenarios with an inline `persona` string still work unchanged; when a
|
|
388
|
+
persona is supplied, the scenario's inline note refines it.
|
|
389
|
+
|
|
390
|
+
### Build a dataset: `generate-dataset`
|
|
391
|
+
|
|
392
|
+
A dataset is a **persona x scenario matrix** — different users, and different
|
|
393
|
+
scenarios tailored to each of them. For every persona, codex generates
|
|
394
|
+
persona-specific scenarios, and the pairs become runnable cases:
|
|
395
|
+
|
|
396
|
+
```bash
|
|
397
|
+
ghostlab generate-dataset \
|
|
398
|
+
--profile runs/<id>-inspect/capabilities.json \
|
|
399
|
+
--personas 3 --scenarios-per-persona 3 --seed 7 \
|
|
400
|
+
--name cortex
|
|
401
|
+
```
|
|
402
|
+
|
|
403
|
+
This writes a self-contained dataset directory:
|
|
404
|
+
|
|
405
|
+
```text
|
|
406
|
+
datasets/cortex/
|
|
407
|
+
dataset.json manifest: mcp, seed, cases[]
|
|
408
|
+
personas/<id>.json
|
|
409
|
+
scenarios/<id>.json persona-namespaced; inline `persona` is a situational note
|
|
410
|
+
```
|
|
411
|
+
|
|
412
|
+
The persona is the authoritative identity at run time; each scenario's inline
|
|
413
|
+
`persona` carries only a short situational note ("has 45 minutes before work"),
|
|
414
|
+
so the two never conflict. The `--seed` governs case ordering for reproducible
|
|
415
|
+
manifests.
|
|
416
|
+
|
|
417
|
+
### Review & curate a dataset: `review-dataset`
|
|
418
|
+
|
|
419
|
+
Before spending agent credits, check that the dataset makes sense:
|
|
420
|
+
|
|
421
|
+
```bash
|
|
422
|
+
ghostlab review-dataset \
|
|
423
|
+
--dataset datasets/cortex \
|
|
424
|
+
--profile runs/<id>-inspect/capabilities.json
|
|
425
|
+
```
|
|
426
|
+
|
|
427
|
+
This writes `review.md` + `review.json` with a **tool-coverage matrix** (which
|
|
428
|
+
tool categories are exercised, which tools are never touched), **per-case
|
|
429
|
+
previews** (persona traits, situation, goal, opening message, success/failure
|
|
430
|
+
criteria, exercises), and **flags**: near-duplicate cases, scenarios exercising
|
|
431
|
+
non-exposed tools, and personas with no scenarios.
|
|
432
|
+
|
|
433
|
+
Curation is **file-first** — each case gets a `status` in `dataset.json`
|
|
434
|
+
(`pending` / `approved` / `rejected` / `needs-edit`). Edit it by hand, or use:
|
|
435
|
+
|
|
436
|
+
```bash
|
|
437
|
+
# approve/reject by case id (no ids = all cases)
|
|
438
|
+
ghostlab review-dataset --dataset datasets/cortex \
|
|
439
|
+
--approve case-a case-b --reject case-c
|
|
440
|
+
```
|
|
441
|
+
|
|
442
|
+
Then run only the approved cases:
|
|
443
|
+
|
|
444
|
+
```bash
|
|
445
|
+
ghostlab run-dataset --dataset datasets/cortex \
|
|
446
|
+
--target targets/cortex-local.json --approved-only
|
|
447
|
+
```
|
|
448
|
+
|
|
449
|
+
### Run a dataset: `run-dataset`
|
|
450
|
+
|
|
451
|
+
Execute every case (use `--limit` for small dev runs):
|
|
452
|
+
|
|
453
|
+
```bash
|
|
454
|
+
ghostlab run-dataset \
|
|
455
|
+
--dataset datasets/cortex \
|
|
456
|
+
--target targets/cortex-local.json \
|
|
457
|
+
--aut-runner runners/codex-cortex-aut.json \
|
|
458
|
+
--user-runner runners/codex-user-emulator.json \
|
|
459
|
+
--limit 2
|
|
460
|
+
```
|
|
461
|
+
|
|
462
|
+
Each case runs through the orchestrator (with its persona) into its own run
|
|
463
|
+
directory, and a dataset-level `summary.md` + `results.json` capture per-case
|
|
464
|
+
status and turn counts.
|
|
465
|
+
|
|
466
|
+
### Tool-call capture & output hygiene
|
|
467
|
+
|
|
468
|
+
Every run captures structured MCP tool calls from the agent host. The codex AUT
|
|
469
|
+
runners set `"parser": "codex-json"` and run `codex exec --json`, so the
|
|
470
|
+
orchestrator parses the JSONL stream and records each `mcp_tool_call` with its
|
|
471
|
+
**arguments, result, error, and status** — plus the clean assistant message —
|
|
472
|
+
into `events.jsonl`, with a per-turn table in `report.md`. (Runners without the
|
|
473
|
+
codex-json parser fall back to scraping codex's plain-text
|
|
474
|
+
`mcp: <server>/<tool> started|(completed)|(failed)` lines for tool name +
|
|
475
|
+
status.) When a scenario declares `exercises`, the report also shows a
|
|
476
|
+
tool-coverage line (expected vs. actually called).
|
|
477
|
+
|
|
478
|
+
stdout and stderr are kept **separate**: only stdout (with known host noise
|
|
479
|
+
redacted) becomes the conversational message handed to the other agent, while
|
|
480
|
+
raw stderr is logged for debugging. This prevents the emulator from reacting to
|
|
481
|
+
agent-host warnings instead of the assistant's actual reply.
|
|
482
|
+
|
|
483
|
+
### Evaluate a run: `evaluate`
|
|
484
|
+
|
|
485
|
+
Turn a run into a structured pass/fail verdict:
|
|
486
|
+
|
|
487
|
+
```bash
|
|
488
|
+
ghostlab evaluate --run runs/<id> --capabilities runs/<id>-inspect/capabilities.json
|
|
489
|
+
```
|
|
490
|
+
|
|
491
|
+
It combines **deterministic checks** over the captured tool calls (failed calls,
|
|
492
|
+
expected-tool coverage from the scenario's `exercises`) with a **codex
|
|
493
|
+
LLM-judge** that scores each `success_criterion` (met?) and `failure_signal`
|
|
494
|
+
(triggered?) from the transcript and tool calls. Hard gates force an overall
|
|
495
|
+
`fail`: the run crashed, a failure signal triggered, or — when `--capabilities`
|
|
496
|
+
is supplied — the assistant claimed a tool the server does not expose. Writes
|
|
497
|
+
`verdict.json` + `verdict.md`; exits non-zero unless the verdict is `pass`
|
|
498
|
+
(`partial` exits 0 unless `--strict`), so datasets can gate CI.
|
|
499
|
+
|
|
500
|
+
### Score a whole dataset
|
|
501
|
+
|
|
502
|
+
`run-dataset --evaluate` runs the codex judge on each case and records the
|
|
503
|
+
verdict in the per-case run dir and in the summary's `results.json` (stamped with
|
|
504
|
+
the ghostlab version + dataset seed for provenance):
|
|
505
|
+
|
|
506
|
+
```bash
|
|
507
|
+
ghostlab run-dataset --dataset datasets/cortex \
|
|
508
|
+
--target targets/cortex-local.json \
|
|
509
|
+
--aut-runner runners/codex-cortex-local-session.json \
|
|
510
|
+
--evaluate --capabilities runs/<id>-inspect/capabilities.json
|
|
511
|
+
```
|
|
512
|
+
|
|
513
|
+
### Compare two runs: `compare`
|
|
514
|
+
|
|
515
|
+
After editing a prompt or tool description, re-run the same dataset and diff the
|
|
516
|
+
results to see what got better or worse:
|
|
517
|
+
|
|
518
|
+
```bash
|
|
519
|
+
ghostlab compare --base runs/<base>-summary --candidate runs/<cand>-summary \
|
|
520
|
+
--output comparison.md
|
|
521
|
+
```
|
|
522
|
+
|
|
523
|
+
It diffs case-by-case on verdict (falling back to run status), listing
|
|
524
|
+
**regressions** (newly failing) first, then **fixes** (newly passing), then other
|
|
525
|
+
changes. Exits non-zero when there are regressions, so it can gate CI.
|
|
526
|
+
|
|
527
|
+
### Probe MCP Apps widgets: `apps-probe`
|
|
528
|
+
|
|
529
|
+
Some MCPs ship **MCP Apps UI** resources — a tool's `_meta.ui.resourceUri` points
|
|
530
|
+
to a `ui://…` HTML widget a compatible host is expected to render. The vanilla
|
|
531
|
+
runner can confirm an agent *called* a UI-producing tool, but not that the widget
|
|
532
|
+
rendered or that a user could interact with it (see
|
|
533
|
+
`specs/cortex-mcp-apps-e2e.spec`, issue #13).
|
|
534
|
+
|
|
535
|
+
`apps-probe` is the first increment of the MCP Apps host layer. It connects to a
|
|
536
|
+
target, finds every UI-producing tool, fetches each `ui://` resource via
|
|
537
|
+
`resources/read`, and reports render-readiness and CSP diagnostics:
|
|
538
|
+
|
|
539
|
+
```bash
|
|
540
|
+
ghostlab apps-probe --target targets/cortex-local.json
|
|
541
|
+
# or restrict to specific widgets:
|
|
542
|
+
ghostlab apps-probe --target targets/cortex-local.json --tool views_create_listening_practice
|
|
543
|
+
```
|
|
544
|
+
|
|
545
|
+
It writes `apps-probe.json` + `apps-probe.md` with the resource's MIME profile,
|
|
546
|
+
HTML size, preferred frame hints, and CSP connect/resource domains. Diagnostics
|
|
547
|
+
flag empty/unfetchable resources, non-`mcp-app` MIME types, and tools that accept
|
|
548
|
+
remote media (`audio_url`, `image_url`, …) whose resource CSP would block it. The
|
|
549
|
+
report reserves structured sections for the host-bridge transcript, interaction
|
|
550
|
+
trace, render artifacts, and final app state — populated by `apps-render` below.
|
|
551
|
+
The module also defines the **UI-intent contract**
|
|
552
|
+
(`reorder`/`choose`/`type`/`reveal`/`submit`/`rate`/`mark`) the user emulator
|
|
553
|
+
emits, and the host-bridge message vocabulary a renderer must implement.
|
|
554
|
+
|
|
555
|
+
### Render & drive MCP Apps widgets: `apps-render`
|
|
556
|
+
|
|
557
|
+
`apps-render` actually **renders** a `ui://` widget and proves a user can see and
|
|
558
|
+
use it. It implements the MCP Apps host bridge (JSON-RPC over `postMessage`,
|
|
559
|
+
protocol `2026-01-26`), mounts the widget in a sandboxed headless-Chrome iframe,
|
|
560
|
+
completes the `ui/initialize` handshake, and feeds it the tool input + result so
|
|
561
|
+
it renders real content. It then captures a screenshot, the visible DOM text, the
|
|
562
|
+
host-bridge transcript, console/network errors, and runs app-aware assertions —
|
|
563
|
+
and can execute a sequence of UI intents against the live widget.
|
|
564
|
+
|
|
565
|
+
```bash
|
|
566
|
+
pip install 'ghostlab[apps]' && playwright install chrome # one-time
|
|
567
|
+
ghostlab apps-render --target targets/cortex-local.json \
|
|
568
|
+
--tool views_generate_sentence_scramble \
|
|
569
|
+
--arguments '{"target_sentence":"The cat sat on the mat","shuffled_elements":["mat","The","on","sat","cat","the"]}' \
|
|
570
|
+
--intent '{"type":"reorder","value":["The","cat","sat","on","the","mat"]}' \
|
|
571
|
+
--intent '{"type":"reveal"}'
|
|
572
|
+
```
|
|
573
|
+
|
|
574
|
+
By default it **calls the tool** with `--arguments` to obtain the result the
|
|
575
|
+
widget renders from (use `--no-call` to render from the arguments alone, or omit
|
|
576
|
+
`--tool` to pick the first UI-producing tool). It writes `apps-render.json` +
|
|
577
|
+
`apps-render.md`, a `widget.png` of the initial render, and a `widget-final.png`
|
|
578
|
+
after the intents run. Exit status is non-zero if the render errored or any
|
|
579
|
+
assertion failed, so it can gate CI. In the example above the reorder intent
|
|
580
|
+
rebuilds the sentence and the widget confirms _"Nice work. Sentence is correct."_
|
|
581
|
+
— proving both visibility and a completed interaction.
|
|
582
|
+
|
|
583
|
+
This is the browser-backed increment of issue #13. Still ahead: richer per-widget
|
|
584
|
+
assertions, the full request-side host bridge (`call-server-tool` proxying,
|
|
585
|
+
`open-link`/`download-file` handling), and wiring the user emulator to emit UI
|
|
586
|
+
intents during a live `run`.
|
|
587
|
+
|
|
588
|
+
### Session runner (one live agent across turns)
|
|
589
|
+
|
|
590
|
+
By default each turn spawns a fresh agent process and the orchestrator replays
|
|
591
|
+
the transcript. The **session runner** (`"kind": "codex-session"`) instead keeps
|
|
592
|
+
one codex session alive: turn 1 records the `thread_id` from the JSONL
|
|
593
|
+
`thread.started` event, and later turns run `codex exec resume <thread_id>` so
|
|
594
|
+
codex retains context — the orchestrator then sends only the new user message
|
|
595
|
+
instead of the whole transcript (fewer tokens, no repeated cold-start noise). The
|
|
596
|
+
shared `session_id` is logged per turn in `events.jsonl` for auditability.
|
|
597
|
+
|
|
598
|
+
```bash
|
|
599
|
+
ghostlab run --target targets/cortex-local.json --scenario <scenario.json> \
|
|
600
|
+
--aut-runner runners/codex-cortex-local-session.json --user-runner <user.json>
|
|
601
|
+
```
|
|
602
|
+
|
|
603
|
+
### Validate your setup: `doctor`
|
|
604
|
+
|
|
605
|
+
Check that codex is reachable and that runner presets are well-formed before a
|
|
606
|
+
run:
|
|
607
|
+
|
|
608
|
+
```bash
|
|
609
|
+
ghostlab doctor # validates runners/*.json
|
|
610
|
+
ghostlab doctor --runners runners/codex-cortex-local-session.json
|
|
611
|
+
```
|
|
612
|
+
|
|
613
|
+
It reports the codex binary + version and validates each runner's kind, command,
|
|
614
|
+
and parser (e.g. a `codex-session` command must contain `exec`).
|
|
615
|
+
|
|
616
|
+
### Default agent backend
|
|
617
|
+
|
|
618
|
+
`codex` is the default coding-agent backend for the generation and run stages.
|
|
619
|
+
The `inspect` command needs no agent — it is a direct MCP client. The codex
|
|
620
|
+
binary is auto-detected from `$PATH`, then the macOS app bundle
|
|
621
|
+
(`/Applications/Codex.app/Contents/Resources/codex`); override with
|
|
622
|
+
`$REHEARSAL_CODEX_BIN` or `--codex-bin`.
|
|
623
|
+
|
|
624
|
+
## Quick Start
|
|
625
|
+
|
|
626
|
+
Run a mock scenario without spending any coding-agent credits:
|
|
627
|
+
|
|
628
|
+
```bash
|
|
629
|
+
cd mcp-rehearsal
|
|
630
|
+
ghostlab run \
|
|
631
|
+
--target targets/example-stdio.json \
|
|
632
|
+
--scenario scenarios/basic-discovery.json \
|
|
633
|
+
--aut-runner runners/mock-aut.json \
|
|
634
|
+
--user-runner runners/mock-user.json
|
|
635
|
+
```
|
|
636
|
+
|
|
637
|
+
The run output is written under `runs/<run-id>/`:
|
|
638
|
+
|
|
639
|
+
- `events.jsonl`: structured event log
|
|
640
|
+
- `report.md`: readable run summary
|
|
641
|
+
- `target.mcp.json`: generated `mcpServers` config for the target
|
|
642
|
+
|
|
643
|
+
## Runner Configs
|
|
644
|
+
|
|
645
|
+
Mock runner:
|
|
646
|
+
|
|
647
|
+
```json
|
|
648
|
+
{
|
|
649
|
+
"kind": "mock"
|
|
650
|
+
}
|
|
651
|
+
```
|
|
652
|
+
|
|
653
|
+
Process runner:
|
|
654
|
+
|
|
655
|
+
```json
|
|
656
|
+
{
|
|
657
|
+
"kind": "process",
|
|
658
|
+
"command": ["codex", "exec", "-"],
|
|
659
|
+
"env": {},
|
|
660
|
+
"timeout_seconds": 300,
|
|
661
|
+
"prompt_mode": "stdin"
|
|
662
|
+
}
|
|
663
|
+
```
|
|
664
|
+
|
|
665
|
+
The process runner starts one fresh process per turn. `prompt_mode` can be `stdin`, `append-arg`, or `replace-placeholder`. Rehearsal also sets `REHEARSAL_TARGET_ID` and `REHEARSAL_MCP_CONFIG` for the AUT process so runner commands can inject the generated MCP config into Codex, Claude Code, or another agent host.
|
|
666
|
+
|
|
667
|
+
## Next Step
|
|
668
|
+
|
|
669
|
+
Wire the process runner to real Codex and Claude Code MCP config injection, then add a Docker Compose sandbox for target MCP apps.
|