cowork-harness 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -4,187 +4,43 @@ All notable changes to this project are documented here. The format is based on
4
4
  [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The project uses
5
5
  [Semantic Versioning](https://semver.org/); pre-1.0 minor versions may include breaking changes.
6
6
 
7
- ## [0.1.0] — 2026-06-13
7
+ ## [0.1.1] — 2026-06-16
8
8
 
9
- Initial public release. A faithful, headless, scriptable harness for Claude Cowork's runtime for
10
- testing Claude Code **skills** outside the Desktop app — same staged agent, same spawn/control-protocol
11
- contract, same sandbox limitations, binary-grounded against `app.asar` 1.12603.1 / agent ELF 2.1.170.
9
+ Docs, distribution, and packaging. No CLI behavior change.
12
10
 
13
11
  ### Added
14
12
 
15
- - **Skill & scenario testing.** `cowork-harness skill <folder> "<prompt>"` runs a local skill folder
16
- directly (copied fresh each run no install, marketplace, or version bump). `cowork-harness run
17
- <scenario.yaml | dir/>` runs authored, asserted regression scenarios with a CI-ready exit code.
18
- `--prompt-file <path>` passes a prompt **verbatim** (raw bytes, no shell `$`/backtick expansion) — the
19
- faithful-relay path for prompts with shell metacharacters. Per-command `skill --help` / `run --help`.
20
- - **Unified output.** `skill` renders the agent's work (assistant text + tool calls) + a metered
21
- footer; `run` is verdict-first but prints the **failing transcript inline** on a `FAIL`. `--quiet` /
22
- `--verbose` / `NO_COLOR`. `--output-format json` emits a stable, pipe-safe **compact single-line** envelope on
23
- stdout (`{tool, version, command, ok, results[], error}`; errors carry `{category, message, hint}`) —
24
- human output stays on stderr (see [SPEC §11](./SPEC.md)). Exit codes `0`/`1`/`2`.
25
- - **Five fidelity tiers** (`fidelity:`): `protocol` (L0, no sandbox), `container` (L1 Docker + per-run
26
- default-deny egress proxy), `microvm` (L2 Apple-VZ Lima microVM + guest firewall), `hostloop`
27
- (Cowork's production split-execution: agent loop on the host, shell/web routed into the VM via the
28
- workspace SDK-MCP server), and `cowork` (auto-picks host-loop vs container the way Cowork does, via
29
- GrowthBook gate `1143815894` decoded from the synced baseline).
30
- - **Three-seam driver.** `AgentSession` (typed event stream over the stream-json control protocol) →
31
- `Decider` (policy) → `Run` (turn loop + `RunRecord`). Multi-turn capable; the sub-agent dispatch tree,
32
- decisions (with who-decided + rationale), and cost are recorded.
33
- - **Input policy — no silent false-greens.** Scripted `answers:` / `--answer "q=choice"` resolve the
34
- agent's questions and tool-permission requests. An unscripted question follows an explicit
35
- `on_unanswered` policy: `fail` (error + the exact `--answer` to add — always the default for `run`),
36
- `prompt` (ask at the TTY), or `first` (pick option 1, loudly warn). Left unset, `skill` is adaptive
37
- (`prompt` on a TTY, `fail` when piped/CI). Exit codes: `0` pass · `1` assertion/agent failure · `2` usage / unanswered-under-`fail`.
38
- Tool permissions follow a `cowork` (allow-unscripted with an audit finding) or `strict` (deny) parity.
39
- - **In-band gate answering by the driving agent — `--decider-dir <dir>` + Monitor.** For the cases where
40
- the right answer encodes the *driving agent's test intent* (branch-steer / reproduce / boundary /
41
- differential), the harness writes each live gate to `<dir>/req-N.json` and blocks for `<dir>/resp-N.json`;
42
- the driving Claude session arms a **Monitor** on the dir (each gate wakes it via a task-notification) and
43
- writes the answer — the LIVE `AskUserQuestion` is answered **in-band**, no resume, no re-worded question
44
- (mechanism verified live). Same wire protocol as the other channels (reuses `ExternalDecider` whole);
45
- stdout stays free so it composes with `--output-format json` + `run_in_background`. The run is flagged
46
- non-deterministic. (Replaces the deferred, resume-brittle exit-at-gate approach.) Emits `[gate] req-N
47
- emitted` / `[gate] resp-N consumed` to stderr (visible even under `--output-format json`) and renames a
48
- consumed `req-N.json` → `.done` so a watcher can't re-emit it. The harness **owns the transport** so
49
- the driver only reads-and-decides: **`cowork-harness gates <dir> --follow`** streams one clean JSON
50
- line per pending gate + a terminal `{"done":true}` (point a single Monitor at it — no hand-written
51
- zsh/find/seen-set loop), and **`cowork-harness answer <dir> --gate <N> --choose <label>`** writes the
52
- answer atomically with the right wire shape.
53
- - **LLM decider — state the test's intent in one line (`--decider-llm --intent "<text>"`).** For
54
- agent-driven runs where writing a `--decider-cmd` helper is overkill: a small model (host `claude -p`,
55
- `COWORK_HARNESS_DECIDER_MODEL`) picks an option **by label** per live question, optionally steered by a
56
- one-line intent (e.g. "test the not_ai branch" → picks `not_ai`). Scripted `--answer`/`--answer-policy`
57
- still resolve first; an out-of-set answer **fails loud** (never a silent default). Because it's
58
- non-deterministic, the run is flagged `nonDeterministic` and the footer prints `⚠ non-deterministic
59
- (LLM-decided)` so a green can't be mistaken for a reproducible/scripted pass. Validate it in ~2s with
60
- `cowork-harness decide --decider-llm --intent …`. (`--decider-llm` is the only user-facing spelling;
61
- the internal `agent` policy it rides is not a CLI flag — `--on-unanswered agent` is rejected with a redirect.)
62
- - **Safety fix — a question is never silently answered with option 1.** A question that reaches the
63
- decider chain's terminal unanswered now **fails loud** (`UnansweredError`) instead of the prior
64
- silent option-1 fallback in the run loop — closing the worst failure mode (a wrong-branch run printing
65
- `✓ success`). Permissions/dialogs still fail *closed* (deny/cancel), which is correct.
66
- - **External decider — answer the LIVE question (the stochastic-question fix).** Because a skill's
67
- AskUserQuestions are LLM-generated and vary run to run, a pre-written `--answer` regex is brittle.
68
- **`--decider-cmd '<helper>'`** spawns a helper once and pipes each *actual* live question (with options
69
- + a scrubbed transcript `context` + a literal `reply_with` template) to it, reading the answer back —
70
- the agent-usable, one-shot path for custom logic (even an LLM call). The Python package's
71
- **`serve_decider(fn)`** pre-builds the wire loop so a helper writes only the decision function (the
72
- spawn-helper analogue of the `gates`/`answer` commands). Replies are lenient (label OR 1-based index,
73
- `id` optional); scripted `--answer` + permission parity still apply first; the request is
74
- secret-scrubbed before it leaves the process. The helper owns its own pipes, so **`--decider-cmd`
75
- keeps the CLI's stdout free and composes with `--output-format json`** — as does `--decider-dir`; both
76
- external channels are stdout-free and orthogonal (pick one terminal).
77
- - **Egress sandbox.** Default-deny outbound, enforced against the **synced** Cowork domain allowlist
78
- (plus per-scenario `extra_allow`); `egress_*` / `expect_denied` assertions; `web_fetch` modeled
79
- host/API-routed (gated by a web-fetch allowlist) as in real Cowork, distinct from container-sandboxed
80
- `bash`.
81
- - **Assertions** (`assert:`): transcript, files, user-visible artifacts, tool / sub-agent usage,
82
- `subagent_dispatched` / `subagent_declared_but_unused` / `dispatch_count_max`, egress, no-delete-in-
83
- outputs, self-heal, host-path-leak, question count, `gate_answers_delivered`, result status, and
84
- **`transcript_matches`/`transcript_not_matches`** (case-insensitive regex over the transcript — the
85
- drift-tolerant content check for stochastic prose; replay-safe, so it runs on the token-free PR gate).
86
- - **AskUserQuestion answer delivery (correctness fix).** The answer to an AskUserQuestion gate is now
87
- injected as the binary's COMPLETE tool input — `updatedInput:{questions, answers}`, not `{answers}` —
88
- matching the ELF's built-in handler, which does `questions.map(…)` over the input (verified against
89
- `claude-code-vm` 2.1.170). Dropping `questions` threw `undefined is not an object (evaluating 'q.map')`,
90
- so the answer never reached the model and gate-steering silently no-oped. (The earlier golden snapshot
91
- had blessed the `{answers}`-only shape as "faithful"; it was the bug — corrected, with a regression test
92
- asserting `questions` is preserved.) **New verification surfaces:** `tool_result` blocks are now captured;
93
- `RunResult.gateDeliveries[]` + the `gate_answers_delivered` assertion confirm each answer actually
94
- reached the model (a `::warning:: [gate] DELIVERY FAILED` fires in real time on an errored result);
95
- `cowork-harness trace <id> --tools` shows each tool's result status; `trace <id> --gates` shows the gate
96
- lifecycle (question → injected answer → delivered result); the gate rendezvous wire shapes are
97
- written into `<run>/gates/` on every run (so the forensic evidence survives the channel's cleanup, even
98
- without `--keep`).
99
- - **Truthful tool counts (`RunResult.toolCounts`).** Per-tool call counts from the actual tool_use stream
100
- (top-level only). On the cowork path `usage.server_tool_use.web_search_requests` is 0 (it counts the
101
- Anthropic *server* tool; WebSearch is a host-routed *client* tool) — `toolCounts.WebSearch` is the real
102
- count to assert on.
103
- - **Run-once-then-script.** Every question the agent asks that wasn't pre-scripted (auto-answered by
104
- `first`, or answered interactively) is echoed on the footer as a copy-pasteable `--answer "<q>=<choice>"`
105
- line — turning an exploratory run into a deterministic one. An **idle heartbeat** (`… still running
106
- (Xs · N tools)` on stderr after ~30s of silence) keeps long 5–20 min runs legible; disable with
107
- `COWORK_HARNESS_NO_HEARTBEAT`, tune with `COWORK_HARNESS_HEARTBEAT_MS`.
108
- - **File provision.** `--upload <file>` attaches a file at `mnt/uploads/<name>` (the "attach a file"
109
- path skills like deck-review require) and `--project <dir>` connects a folder at `mnt/.projects/<id>`,
110
- ad-hoc on the `skill` command (parity with the scenario `session.uploads`/`folders`).
111
- - **Resume-after-failure hardening.** Ephemeral Docker resources (egress networks/proxy + the host-loop
112
- container) are named by a unique per-invocation token (not the session id), and the agent container is
113
- reaped on teardown — so a `--resume` after a failed/interrupted run no longer collides with a leftover
114
- orphaned container/network.
115
- - **Session persistence & resume.** `--session-id <id>` pins a stable run dir + the agent's native
116
- session UUID (persisted in a `session.json` manifest); `--resume` reuses that work dir — preserving
117
- `mnt/.claude/projects/<uuid>.jsonl`, `gate_state.json`, and `mnt/outputs` — and passes the agent's
118
- own `--resume` so it reloads the conversation. This is how checkpoint-and-resume skills (a gate that
119
- writes state, ends, and is re-invoked later with the prior RUN_ID) are tested. The harness leans on
120
- the agent's native resume rather than reimplementing it (binary-verified).
121
- - **Interactive `chat`** — multi-turn REPL keeping the full harness (egress sandbox + control protocol);
122
- `chat --raw` drops to the agent's native interactive cowork mode via `docker run -it`.
123
- - **Cassettes + full-fidelity replay.** `record` saves a control-protocol cassette; `replay --cassette`
124
- plays it back deterministically (no token, no Docker) and re-evaluates content assertions.
125
- - *The cassette captures both protocol directions:* `events` (child→driver, the assistant turn stream)
126
- AND `controlOut` (driver→child decision responses). Both are recorded; `replay` now **consumes** both.
127
- - *Full-fidelity replay (C1 false-green fix).* Consuming `controlOut` re-runs the decision pipeline on
128
- replay, populating `rec.questions`/`rec.gateAnswers`/`rec.gateDeliveries`. Previously,
129
- `question_asked` silently false-failed (questions invisible), `questions_count_max` passed vacuously
130
- (0 ≤ max), and `gate_answers_delivered: true` passed vacuously (no deliveries recorded) — a silent
131
- false-green violating the project's core principle. All three now genuinely evaluate when `controlOut`
132
- is present.
133
- - *The O7 guard on the token-free lane (`replay_protocol_fidelity`).* `replay` re-serializes each
134
- decision response via `serializeDecision` and compares to the frozen `controlOut` envelope. A mismatch
135
- (e.g. `serializeDecision` dropping `questions` from the AskUserQuestion `updatedInput`) appends a
136
- `{ assertion: { replay_protocol_fidelity: true }, pass: false, message }` entry and exits 1 — catching
137
- the O7 answer-shape regression without a live model.
138
- - *Backward compatibility.* Old cassettes without `controlOut` get a loud `::warning::` on stderr;
139
- `question_asked`, `questions_count_max`, and `gate_answers_delivered` are excluded from evaluation
140
- (not vacuously passed). Re-record to enable full-fidelity mode.
141
- - *Committed synthetic fixture + CI replay gate.* `examples/replays/example-pdf-skill.cassette.json`
142
- is a hand-authored fixture (permission gate + AskUserQuestion gate + `tool_result`) committed to the
143
- repo and replayed in the token-free CI job — dogfooding the documented PR-gate pattern and pinning
144
- the fixture against `parseMessage`/assertion/`Run` regressions on every push.
145
- - **pytest `cowork` lane** (`python/`) — `@pytest.mark.cowork` + a `cowork` fixture over the
146
- `--output-format json` surface, selectable with `-m cowork` beside your fast tests.
147
- - **Faithful sub-agent aggregation.** Recognizes the real cowork dispatch tool — **`Agent`**
148
- (`{description, subagent_type, prompt}`; binary-verified primary name, with `Task` as its legacy
149
- alias) and any tool carrying `subagent_type` — so `subagents[]` and the `subagent_dispatched` /
150
- `dispatch_count_max` / `subagent_tool_*` assertions fire under `--fidelity cowork`. The cowork
151
- `TaskCreate`/`TaskUpdate` **todo list** and `Monitor` are correctly excluded (no over-counting). Each
152
- dispatch also captures its **`description`** — so a dispatch the skill made with no `subagent_type`
153
- (`agentType:"unknown"`) is still self-explaining in `trace` and assertable: `subagent_dispatched`
154
- matches the agentType **OR** the description.
155
- - **`trace <run-id | dir | events.jsonl> [--tools]`** — digests a run's `events.jsonl` into tool calls,
156
- sub-agent dispatches (deduped), decisions, and questions (reuses the live `parseMessage` so it tracks
157
- the schema); `--output-format json` for structured rows. Plus `result.json`/the json envelope now expose
158
- `workDir`/`outputsDir`, and `--keep` prints the deep `mnt/outputs` deliverable path.
159
- - **Per-run artifacts** under `runs/<scenario>/<id>/`: `events.jsonl` + `control-out.jsonl` (the cassette
160
- source), `run.jsonl` (harness-observability log), `trace.json` (structured trace), `egress.log`,
161
- `result.json`, `agent.stderr.log`. Injected secrets (OAuth token / API key) are scrubbed from every
162
- persisted log by value.
163
- - **Auth via env or `.env`.** `CLAUDE_CODE_OAUTH_TOKEN` (preferred) or `ANTHROPIC_API_KEY`, resolved in
164
- priority order: exported env > `--dotenv <path>` > `./.env` (cwd) > `<install>/.env` (package root),
165
- so a run from any directory still finds the install's credentials. Host-side, gitignored, exported
166
- vars win, never mounted into the sandbox; passed via argv/env, never written to a runtime path. (The
167
- flag is `--dotenv`, not `--env-file` — Node reserves the latter.)
168
- - **Platform baselines** (`cowork-harness sync`) derive the release-specific facts (agent version, domain
169
- allowlist, mount layout, the production GrowthBook gate states) from your installed Claude Desktop, so
170
- the code rides the stable protocol while data tracks each release. `--diff` previews changes; an
171
- `asarFingerprint` tripwire flags unrecognized deltas.
172
- - **Sandbox self-verification** — `cowork-harness boundary-check` proves the sandbox enforces Cowork's
173
- limitations; `cowork-harness vm <init|status|delete|prune>` manages the L2 microVM (`prune` drops orphaned VMs).
174
- - **CI** (`.github/workflows/ci.yml`): typecheck · format · unit + golden snapshots · build · boundary
175
- parity (Docker) · live-contract guards · scenario suite (gated on a key) · the pytest lane.
13
+ - **Companion Claude Code skill, installable.** A `.claude-plugin/marketplace.json` + skills-directory
14
+ plugin make the bundled skill installable via `/plugin marketplace add yaniv-golan/cowork-harness`;
15
+ the skill self-bootstraps the CLI (`npx cowork-harness@latest`) and fails loud on missing tier deps.
16
+ - **`AGENTS.md`** canonical, cross-tool agent instructions and **`llms.txt`** doc index.
17
+ - **JSON Schema for scenario & session YAML** (`schema/*.schema.json`, generated via `npm run schema`,
18
+ pinned by a token-free drift-guard); `# yaml-language-server: $schema=` hints in the example scenarios.
19
+ - README banner, badges, an "For AI agents" section, and `npm install` instructions.
176
20
 
177
- ### Notes
21
+ ### Changed
178
22
 
179
- - **Two concepts, two names (no "profile" overload).** The synced per-release snapshot is the **platform
180
- baseline** (`baselines/desktop-*.json`, scenario field `baseline:`, type `PlatformBaseline`); the
181
- hand-authored pre-prompt config is the **session** (`sessions/*.yaml`, scenario field `session:`, type
182
- `SessionConfig`). For back-compat, a scenario's deprecated `profile:` key is still accepted for one minor
183
- (mapped to `baseline:` with a stderr deprecation warning) and `Profile` is re-exported as an alias of
184
- `PlatformBaseline`; both are removed next minor. The `--output-format json` `RunResult` field is `baseline`
185
- (was `profile`).
186
- - The agent binary is **bind-mounted from your own Claude Desktop install** (`claude-code-vm/<ver>/claude`)
187
- or `COWORK_AGENT_BINARY` — nothing Anthropic-owned is bundled or distributed. There is no npm path.
188
- - Cowork mode is enabled by `CLAUDE_CODE_IS_COWORK=1` (env), **not** a `--cowork` flag.
189
- - This is a fixture for testing, **not a security boundary** — see [SECURITY.md](./SECURITY.md) and the
190
- fidelity caveats in [DESIGN.md](./DESIGN.md).
23
+ - Release pipeline publishes via npm **Trusted Publishing (OIDC)** with provenance (no stored token).
24
+ - GitHub Actions bumped off the deprecated Node 20 runtime; CI live-scenario job skips cleanly without a key.
25
+
26
+ ## [0.1.0] 2026-06-16
27
+
28
+ Initial public release. A faithful, headless, scriptable harness for Claude Cowork's runtime — for
29
+ testing Claude Code **skills** outside the Desktop app with the same staged agent, spawn/control-protocol
30
+ contract, egress allowlist, permission protocol, and sandbox limitations. Binary-grounded against
31
+ `app.asar` 1.12603.1 / agent ELF 2.1.170.
32
+
33
+ ### Added
34
+
35
+ - Commands: `skill`, `run`, `chat`, `record`, `replay`, `trace`, and `decide`, plus `sync`,
36
+ `boundary-check`, and `vm` management. Stable `--output-format json` envelope and CI-ready exit codes.
37
+ - Five fidelity tiers (`fidelity:`): `protocol`, `container`, `microvm`, `hostloop`, and `cowork`
38
+ (auto-picks host-loop vs container the way Cowork does).
39
+ - Scenario YAML — prompt + scripted answers + `assert:` (transcript, files, artifacts, tool / sub-agent
40
+ usage, egress, and more) for authored, asserted regression runs.
41
+ - Input policy with no silent false-greens: scripted, LLM, and in-band (`--decider-dir`) answering for
42
+ AskUserQuestion / tool-permission gates; an unanswered gate fails loud.
43
+ - Default-deny egress sandbox enforced against the synced Cowork domain allowlist.
44
+ - Token-free, Docker-free cassette `record` / `replay` for the PR gate.
45
+ - Platform baselines synced from a local Claude Desktop install — nothing Anthropic-owned is bundled
46
+ or distributed.
package/README.md CHANGED
@@ -1,8 +1,15 @@
1
+ <p align="center">
2
+ <img src="docs/assets/banner.png" alt="cowork-harness — headless, scriptable, CI-ready test harness for Claude Cowork skills" width="100%">
3
+ </p>
4
+
1
5
  # cowork-harness
2
6
 
3
7
  [![ci](https://github.com/yaniv-golan/cowork-harness/actions/workflows/ci.yml/badge.svg)](https://github.com/yaniv-golan/cowork-harness/actions/workflows/ci.yml)
4
8
  [![license: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE)
5
9
  [![node: >=20](https://img.shields.io/badge/node-%3E%3D20-339933.svg)](#quick-start)
10
+ [![Claude Code plugin](https://img.shields.io/badge/Claude_Code-plugin-F97316)](#drive-it-from-claude-code-companion-skill)
11
+ [![Built with Skill Creator Plus](https://img.shields.io/badge/Built_with-Skill_Creator_Plus-4ecdc4)](https://github.com/yaniv-golan/skill-creator-plus)
12
+ [![Agent Skills compatible](https://img.shields.io/badge/Agent_Skills-compatible-4A90D9)](https://agentskills.io)
6
13
 
7
14
  Scriptable, CI-friendly test harness that reproduces **Claude Cowork's observable runtime contract** closely enough to test the skills you write — across many scenarios, headless, in CI — without the (locked) Desktop app. It reproduces not just Cowork's *behavior* but its *limitations*: sealed filesystem, default-deny egress, MCP-only cross-boundary — so a green test means green in real Cowork.
8
15
 
@@ -141,7 +148,13 @@ It mounts the folder(s) at the Cowork plugin path, runs the staged agent in cowo
141
148
 
142
149
  ## Quick start
143
150
 
144
- **Install from source** (not yet published to npm):
151
+ **Install from npm:**
152
+
153
+ ```bash
154
+ npm install -g cowork-harness # puts the `cowork-harness` command on your PATH
155
+ ```
156
+
157
+ **Or build from source:**
145
158
 
146
159
  ```bash
147
160
  git clone https://github.com/yaniv-golan/cowork-harness && cd cowork-harness
@@ -149,8 +162,18 @@ npm install && npm run build && npm link # puts the `cowork-harness` command
149
162
  # …or skip the link and call it directly: node dist/cli.js <cmd>
150
163
  ```
151
164
 
152
- > A global `npm install -g cowork-harness` will work once the package is published; for now, build from source.
153
- > (Heads-up: the repo folder is `claude-cowork-headless-emulator`, the package/CLI is `cowork-harness`, and the GitHub repo is `yaniv-golan/cowork-harness`.)
165
+ ### Drive it from Claude Code (companion skill)
166
+
167
+ This repo ships a **companion skill** (`.claude/skills/cowork-harness/`) that teaches an agent how to drive the harness — author scenarios, pick a fidelity tier, script answers, place assertions in the right CI lane, and avoid the "✓ passed ≠ correct" traps. Install it into Claude Code via the bundled marketplace:
168
+
169
+ ```bash
170
+ /plugin marketplace add yaniv-golan/cowork-harness
171
+ /plugin install cowork-harness@cowork-harness
172
+ ```
173
+
174
+ The skill **self-bootstraps the CLI**: if `cowork-harness` isn't on your PATH it falls back to `npx cowork-harness@latest` (Node ≥ 20). Tiers above `protocol` still need Docker/Lima and a Claude Desktop agent binary — see the prerequisites below.
175
+
176
+ It also follows the open [Agent Skills](https://github.com/vercel-labs/skills) spec, so it installs cross-editor (Cursor, Codex, OpenCode, …) by pointing the `npx skills` CLI at `.claude/skills/cowork-harness` in this repo. (Working *inside* this repo, the skill auto-loads as a project skill — no install needed.)
154
177
 
155
178
  **Prerequisites for anything above `protocol` fidelity** (the `protocol` tier needs none of these — it's pure logic iteration):
156
179
  1. **Claude Desktop, opened once.** The Cowork agent binary is **bind-mounted from your own install** at run time — nothing Anthropic-owned is bundled. Open Cowork once so the agent ELF is staged (`…/claude-code-vm/<ver>/claude`); the harness auto-detects it, or set `COWORK_AGENT_BINARY=<path>` to point at it. Without a staged agent, container/cowork runs fail with "Open Cowork once to stage it…".
@@ -309,32 +332,26 @@ Secrets (the injected OAuth token / API key) are scrubbed from every persisted l
309
332
  ## Architecture
310
333
 
311
334
  ```
312
- ┌──────────────────────────────────────────────┐
313
- scenario.yaml ─────► │ cowork-harness (TS CLI)
314
-
315
- baseline loader ── baselines/desktop-*.json ◄── cowork-sync
316
- │ (agent ver, mounts, │ (reads live
317
- egress allowlist) │ Desktop install
318
- │ │ + app.asar)
319
- scenario → runtime selector (L0/L1/L2)
320
- └───────────────┬──────────────────────────────┘
321
- spawns + speaks stream-json
322
- ┌───────────────▼──────────────────────────────┐
323
- │ Agent: claude -p (CLAUDE_CODE_IS_COWORK=1) │
324
- --input-format stream-json │
325
- --output-format stream-json │
326
- │ │
327
- cwd = /sessions/<id>/mnt
328
- mnt/uploads, mnt/.projects/*, plugin mounts
329
- └───────────────┬──────────────────────────────┘
330
- decision control req (tool/ egress
331
- question/dialog/elicit) │
332
- ┌─────────────────────▼──────────────┐ ┌───────────────────────┐
333
- │ AgentSession → Decider → Run │ │ Egress proxy │
334
- │ (protocol seam · policy seam · │ │ default-deny, │
335
- │ turn loop + RunRecord) │ │ allowlist = synced │
336
- └─────────────────────────────────────┘ │ vmAllowedDomains() │
337
- └───────────────────────┘
335
+ ┌────────────────────────────────────────────────┐
336
+ scenario.yaml ────► │ cowork-harness (TypeScript CLI)
337
+ baseline loader ◄── baselines/desktop-*.json
338
+ runtime selector ──► L0 / L1 / L2
339
+ └───────────────────────┬────────────────────────┘
340
+ spawns + speaks stream-json
341
+ ┌───────────────────────▼────────────────────────┐
342
+ Agent: claude -p (CLAUDE_CODE_IS_COWORK=1)
343
+ │ --input-format / --output-format stream-json│
344
+ cwd = /sessions/<id>/mnt │
345
+ │ mnt/uploads · mnt/.projects/* · plugins │
346
+ └───────────────────────┬────────────────────────┘
347
+ decision control request outbound network (egress)
348
+ (tool · question · dialog) default-deny → allowlist
349
+ ┌───────────────────────▼────────────┐ ┌────────────────────────┐
350
+ AgentSession ──► Decider ──► Run │ Egress proxy │
351
+ protocol · policy · turn loop │ │ default-deny;
352
+ │ + RunRecord │ │ allowlist = synced │
353
+ │ │ │ vmAllowedDomains()
354
+ └────────────────────────────────────┘ └────────────────────────┘
338
355
  ```
339
356
 
340
357
  - **AgentSession** speaks the Agent SDK control protocol over stream-json, emitting a typed event
@@ -424,7 +441,7 @@ The diff shows exactly what moved (agent bump, allowlist change, new mount). You
424
441
 
425
442
  ---
426
443
 
427
- ## Honest limitations
444
+ ## Limitations
428
445
 
429
446
  - **Not the full Desktop network transport.** L1 is a container, not a VM; L2 *is* a real Apple-VZ microVM but still does not reproduce Cowork's gVisor netstack — its egress is the same allowlist proxy as L1 (with a guest iptables firewall in front). If your skill depends on VM-kernel specifics, validate at L2; if it depends on packet-level gVisor behavior, no tier reproduces it.
430
447
  - **Cowork in-guest context is partial.** Desktop supplies host-loop staging, runtime `mountPath` RPC, and the bridge. We reproduce the *filesystem and cowork mode*, not those host-side services. Skills that call Desktop-only host RPCs won't run here (they wouldn't be portable anyway).
@@ -435,6 +452,17 @@ These are documented per-tier in [DESIGN.md](./DESIGN.md) so a green test means
435
452
 
436
453
  ---
437
454
 
455
+ ## For AI agents
456
+
457
+ This repo is built to be driven by agents, not just read by humans:
458
+
459
+ - **[AGENTS.md](./AGENTS.md)** — the canonical agent-instructions file (architecture seams, the build gate, invariants, ethos). Read it before changing code. Also indexed in **[llms.txt](./llms.txt)**.
460
+ - **Companion skill** — [`.claude/skills/cowork-harness/`](./.claude/skills/cowork-harness/SKILL.md) teaches an agent to drive the harness; install it via the marketplace (see [above](#drive-it-from-claude-code-companion-skill)).
461
+ - **Machine-readable interfaces** — stable `--output-format json` envelope on stdout, deterministic exit codes (`0`/`1`/`2`), and `--help` on every command.
462
+ - **JSON Schemas** — [`schema/scenario.schema.json`](./schema/scenario.schema.json) and [`schema/session.schema.json`](./schema/session.schema.json) describe every field of the YAML you author (generated from the source schemas; `npm run schema`).
463
+
464
+ ---
465
+
438
466
  ## Documentation
439
467
 
440
468
  | Doc | Read it for |
package/dist/types.js CHANGED
@@ -93,7 +93,7 @@ export const Assertion = z.object({
93
93
  result: z.enum(["success", "error"]).optional(),
94
94
  replay_protocol_fidelity: z.boolean().optional(), // synthesized by replayCassette — serializeDecision output matched the frozen controlOut recording (O7 guard on the token-free lane)
95
95
  });
96
- const ScenarioObject = z
96
+ export const ScenarioObject = z
97
97
  .object({
98
98
  // Optional: defaults to the scenario's filename (sans extension) via parseScenarioFile —
99
99
  // the file IS the identity. An explicit `name:` is an override (keys the run dir + cassette).
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "cowork-harness",
3
- "version": "0.1.0",
3
+ "version": "0.1.1",
4
4
  "description": "Scriptable, CI-friendly harness for Claude Cowork's runtime contract for testing skills across scenarios — same agent, mounts, egress allowlist, permission protocol, and sandbox limitations.",
5
5
  "license": "MIT",
6
6
  "type": "module",
@@ -43,6 +43,7 @@
43
43
  "scripts": {
44
44
  "build": "rm -rf dist && tsc -p tsconfig.json && chmod +x dist/cli.js",
45
45
  "dev": "tsx src/cli.ts",
46
+ "schema": "tsx scripts/gen-schema.ts",
46
47
  "test": "vitest run",
47
48
  "test:watch": "vitest",
48
49
  "test:live": "vitest run --config vitest.config.live.ts",
@@ -62,7 +63,8 @@
62
63
  "prettier": "3.8.4",
63
64
  "tsx": "^4.19.0",
64
65
  "typescript": "^5.6.0",
65
- "vitest": "^2.1.0"
66
+ "vitest": "^2.1.0",
67
+ "zod-to-json-schema": "^3.25.2"
66
68
  },
67
69
  "engines": {
68
70
  "node": ">=20"