npm - cowork-harness - Versions diffs - 0.1.0 - Mend

cowork-harness 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (49) hide show

package/.env.example +16 -0
package/CHANGELOG.md +190 -0
package/LICENSE +21 -0
package/README.md +470 -0
package/baselines/desktop-1.11847.5.json +78 -0
package/baselines/desktop-1.12603.1.json +140 -0
package/baselines/prompts/desktop-1.12603.1/host-loop-append.md +8 -0
package/baselines/prompts/desktop-1.12603.1/subagent-append-vm.md +3 -0
package/baselines/prompts/desktop-1.12603.1/system-prompt-append.md +18 -0
package/dist/agent/session.js +465 -0
package/dist/assert.js +159 -0
package/dist/baseline.js +87 -0
package/dist/boundary.js +114 -0
package/dist/canary/grants.js +37 -0
package/dist/cli.js +1107 -0
package/dist/decide/decider.js +521 -0
package/dist/decide/external-channel.js +262 -0
package/dist/decide/llm-transport.js +52 -0
package/dist/dotenv.js +52 -0
package/dist/egress/proxy.js +138 -0
package/dist/egress/sidecar.js +125 -0
package/dist/hostloop/provenance.js +110 -0
package/dist/hostloop/workspace-handler.js +226 -0
package/dist/loop-decision.js +62 -0
package/dist/prompt.js +43 -0
package/dist/run/cassette.js +420 -0
package/dist/run/chat.js +194 -0
package/dist/run/envelope.js +31 -0
package/dist/run/execute.js +533 -0
package/dist/run/renderer.js +179 -0
package/dist/run/run.js +347 -0
package/dist/run/trace-view.js +227 -0
package/dist/runtime/argv.js +126 -0
package/dist/runtime/container.js +76 -0
package/dist/runtime/host-env.js +28 -0
package/dist/runtime/hostloop.js +129 -0
package/dist/runtime/lima.js +177 -0
package/dist/runtime/microvm.js +151 -0
package/dist/runtime/protocol.js +79 -0
package/dist/runtime/stage.js +52 -0
package/dist/secrets.js +42 -0
package/dist/session.js +315 -0
package/dist/sync/cowork-sync.js +215 -0
package/dist/types.js +127 -0
package/docker/Dockerfile.agent +31 -0
package/docker/Dockerfile.proxy +12 -0
package/docker/compose.yml +31 -0
package/fixtures/subagent-grants.json +5 -0
package/package.json +70 -0

package/.env.example ADDED Viewed

@@ -0,0 +1,16 @@
+# cowork-harness host-side credentials.
+#
+# Copy to `.env` (auto-loaded from the working dir at startup; already exported vars win).
+# `.env` is gitignored. SECURITY: keep this at your repo/working-dir ROOT — never inside a
+# mounted skill/project folder, or its contents would be copied into the agent's sandbox.
+# The token value is also scrubbed from all persisted run logs regardless of source.
+# Preferred: a long-lived OAuth token. Mint one with `claude setup-token` (valid ~1 year).
+CLAUDE_CODE_OAUTH_TOKEN=
+# Alternative: an API key (used ONLY when no OAuth token is set; cowork drops API keys when a
+# token is present, mirroring the desktop).
+# ANTHROPIC_API_KEY=
+# Optional: timezone passed through to the agent.
+# TZ=America/New_York

package/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,190 @@
+# Changelog
+All notable changes to this project are documented here. The format is based on
+[Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The project uses
+[Semantic Versioning](https://semver.org/); pre-1.0 minor versions may include breaking changes.
+## [0.1.0] — 2026-06-13
+Initial public release. A faithful, headless, scriptable harness for Claude Cowork's runtime for
+testing Claude Code **skills** outside the Desktop app — same staged agent, same spawn/control-protocol
+contract, same sandbox limitations, binary-grounded against `app.asar` 1.12603.1 / agent ELF 2.1.170.
+### Added
+- **Skill & scenario testing.** `cowork-harness skill <folder> "<prompt>"` runs a local skill folder
+  directly (copied fresh each run — no install, marketplace, or version bump). `cowork-harness run
+  <scenario.yaml | dir/>` runs authored, asserted regression scenarios with a CI-ready exit code.
+  `--prompt-file <path>` passes a prompt **verbatim** (raw bytes, no shell `$`/backtick expansion) — the
+  faithful-relay path for prompts with shell metacharacters. Per-command `skill --help` / `run --help`.
+- **Unified output.** `skill` renders the agent's work (assistant text + tool calls) + a metered
+  footer; `run` is verdict-first but prints the **failing transcript inline** on a `FAIL`. `--quiet` /
+  `--verbose` / `NO_COLOR`. `--output-format json` emits a stable, pipe-safe **compact single-line** envelope on
+  stdout (`{tool, version, command, ok, results[], error}`; errors carry `{category, message, hint}`) —
+  human output stays on stderr (see [SPEC §11](./SPEC.md)). Exit codes `0`/`1`/`2`.
+- **Five fidelity tiers** (`fidelity:`): `protocol` (L0, no sandbox), `container` (L1 Docker + per-run
+  default-deny egress proxy), `microvm` (L2 Apple-VZ Lima microVM + guest firewall), `hostloop`
+  (Cowork's production split-execution: agent loop on the host, shell/web routed into the VM via the
+  workspace SDK-MCP server), and `cowork` (auto-picks host-loop vs container the way Cowork does, via
+  GrowthBook gate `1143815894` decoded from the synced baseline).
+- **Three-seam driver.** `AgentSession` (typed event stream over the stream-json control protocol) →
+  `Decider` (policy) → `Run` (turn loop + `RunRecord`). Multi-turn capable; the sub-agent dispatch tree,
+  decisions (with who-decided + rationale), and cost are recorded.
+- **Input policy — no silent false-greens.** Scripted `answers:` / `--answer "q=choice"` resolve the
+  agent's questions and tool-permission requests. An unscripted question follows an explicit
+  `on_unanswered` policy: `fail` (error + the exact `--answer` to add — always the default for `run`),
+  `prompt` (ask at the TTY), or `first` (pick option 1, loudly warn). Left unset, `skill` is adaptive
+  (`prompt` on a TTY, `fail` when piped/CI). Exit codes: `0` pass · `1` assertion/agent failure · `2` usage / unanswered-under-`fail`.
+  Tool permissions follow a `cowork` (allow-unscripted with an audit finding) or `strict` (deny) parity.
+- **In-band gate answering by the driving agent — `--decider-dir <dir>` + Monitor.** For the cases where
+  the right answer encodes the *driving agent's test intent* (branch-steer / reproduce / boundary /
+  differential), the harness writes each live gate to `<dir>/req-N.json` and blocks for `<dir>/resp-N.json`;
+  the driving Claude session arms a **Monitor** on the dir (each gate wakes it via a task-notification) and
+  writes the answer — the LIVE `AskUserQuestion` is answered **in-band**, no resume, no re-worded question
+  (mechanism verified live). Same wire protocol as the other channels (reuses `ExternalDecider` whole);
+  stdout stays free so it composes with `--output-format json` + `run_in_background`. The run is flagged
+  non-deterministic. (Replaces the deferred, resume-brittle exit-at-gate approach.) Emits `[gate] req-N
+  emitted` / `[gate] resp-N consumed` to stderr (visible even under `--output-format json`) and renames a
+  consumed `req-N.json` → `.done` so a watcher can't re-emit it. The harness **owns the transport** so
+  the driver only reads-and-decides: **`cowork-harness gates <dir> --follow`** streams one clean JSON
+  line per pending gate + a terminal `{"done":true}` (point a single Monitor at it — no hand-written
+  zsh/find/seen-set loop), and **`cowork-harness answer <dir> --gate <N> --choose <label>`** writes the
+  answer atomically with the right wire shape.
+- **LLM decider — state the test's intent in one line (`--decider-llm --intent "<text>"`).** For
+  agent-driven runs where writing a `--decider-cmd` helper is overkill: a small model (host `claude -p`,
+  `COWORK_HARNESS_DECIDER_MODEL`) picks an option **by label** per live question, optionally steered by a
+  one-line intent (e.g. "test the not_ai branch" → picks `not_ai`). Scripted `--answer`/`--answer-policy`
+  still resolve first; an out-of-set answer **fails loud** (never a silent default). Because it's
+  non-deterministic, the run is flagged `nonDeterministic` and the footer prints `⚠ non-deterministic
+  (LLM-decided)` so a green can't be mistaken for a reproducible/scripted pass. Validate it in ~2s with
+  `cowork-harness decide --decider-llm --intent …`. (`--decider-llm` is the only user-facing spelling;
+  the internal `agent` policy it rides is not a CLI flag — `--on-unanswered agent` is rejected with a redirect.)
+- **Safety fix — a question is never silently answered with option 1.** A question that reaches the
+  decider chain's terminal unanswered now **fails loud** (`UnansweredError`) instead of the prior
+  silent option-1 fallback in the run loop — closing the worst failure mode (a wrong-branch run printing
+  `✓ success`). Permissions/dialogs still fail *closed* (deny/cancel), which is correct.
+- **External decider — answer the LIVE question (the stochastic-question fix).** Because a skill's
+  AskUserQuestions are LLM-generated and vary run to run, a pre-written `--answer` regex is brittle.
+  **`--decider-cmd '<helper>'`** spawns a helper once and pipes each *actual* live question (with options
+  + a scrubbed transcript `context` + a literal `reply_with` template) to it, reading the answer back —
+  the agent-usable, one-shot path for custom logic (even an LLM call). The Python package's
+  **`serve_decider(fn)`** pre-builds the wire loop so a helper writes only the decision function (the
+  spawn-helper analogue of the `gates`/`answer` commands). Replies are lenient (label OR 1-based index,
+  `id` optional); scripted `--answer` + permission parity still apply first; the request is
+  secret-scrubbed before it leaves the process. The helper owns its own pipes, so **`--decider-cmd`
+  keeps the CLI's stdout free and composes with `--output-format json`** — as does `--decider-dir`; both
+  external channels are stdout-free and orthogonal (pick one terminal).
+- **Egress sandbox.** Default-deny outbound, enforced against the **synced** Cowork domain allowlist
+  (plus per-scenario `extra_allow`); `egress_*` / `expect_denied` assertions; `web_fetch` modeled
+  host/API-routed (gated by a web-fetch allowlist) as in real Cowork, distinct from container-sandboxed
+  `bash`.
+- **Assertions** (`assert:`): transcript, files, user-visible artifacts, tool / sub-agent usage,
+  `subagent_dispatched` / `subagent_declared_but_unused` / `dispatch_count_max`, egress, no-delete-in-
+  outputs, self-heal, host-path-leak, question count, `gate_answers_delivered`, result status, and
+  **`transcript_matches`/`transcript_not_matches`** (case-insensitive regex over the transcript — the
+  drift-tolerant content check for stochastic prose; replay-safe, so it runs on the token-free PR gate).
+- **AskUserQuestion answer delivery (correctness fix).** The answer to an AskUserQuestion gate is now
+  injected as the binary's COMPLETE tool input — `updatedInput:{questions, answers}`, not `{answers}` —
+  matching the ELF's built-in handler, which does `questions.map(…)` over the input (verified against
+  `claude-code-vm` 2.1.170). Dropping `questions` threw `undefined is not an object (evaluating 'q.map')`,
+  so the answer never reached the model and gate-steering silently no-oped. (The earlier golden snapshot
+  had blessed the `{answers}`-only shape as "faithful"; it was the bug — corrected, with a regression test
+  asserting `questions` is preserved.) **New verification surfaces:** `tool_result` blocks are now captured;
+  `RunResult.gateDeliveries[]` + the `gate_answers_delivered` assertion confirm each answer actually
+  reached the model (a `::warning:: [gate] DELIVERY FAILED` fires in real time on an errored result);
+  `cowork-harness trace <id> --tools` shows each tool's result status; `trace <id> --gates` shows the gate
+  lifecycle (question → injected answer → delivered result); the gate rendezvous wire shapes are
+  written into `<run>/gates/` on every run (so the forensic evidence survives the channel's cleanup, even
+  without `--keep`).
+- **Truthful tool counts (`RunResult.toolCounts`).** Per-tool call counts from the actual tool_use stream
+  (top-level only). On the cowork path `usage.server_tool_use.web_search_requests` is 0 (it counts the
+  Anthropic *server* tool; WebSearch is a host-routed *client* tool) — `toolCounts.WebSearch` is the real
+  count to assert on.
+- **Run-once-then-script.** Every question the agent asks that wasn't pre-scripted (auto-answered by
+  `first`, or answered interactively) is echoed on the footer as a copy-pasteable `--answer "<q>=<choice>"`
+  line — turning an exploratory run into a deterministic one. An **idle heartbeat** (`… still running
+  (Xs · N tools)` on stderr after ~30s of silence) keeps long 5–20 min runs legible; disable with
+  `COWORK_HARNESS_NO_HEARTBEAT`, tune with `COWORK_HARNESS_HEARTBEAT_MS`.
+- **File provision.** `--upload <file>` attaches a file at `mnt/uploads/<name>` (the "attach a file"
+  path skills like deck-review require) and `--project <dir>` connects a folder at `mnt/.projects/<id>`,
+  ad-hoc on the `skill` command (parity with the scenario `session.uploads`/`folders`).
+- **Resume-after-failure hardening.** Ephemeral Docker resources (egress networks/proxy + the host-loop
+  container) are named by a unique per-invocation token (not the session id), and the agent container is
+  reaped on teardown — so a `--resume` after a failed/interrupted run no longer collides with a leftover
+  orphaned container/network.
+- **Session persistence & resume.** `--session-id <id>` pins a stable run dir + the agent's native
+  session UUID (persisted in a `session.json` manifest); `--resume` reuses that work dir — preserving
+  `mnt/.claude/projects/<uuid>.jsonl`, `gate_state.json`, and `mnt/outputs` — and passes the agent's
+  own `--resume` so it reloads the conversation. This is how checkpoint-and-resume skills (a gate that
+  writes state, ends, and is re-invoked later with the prior RUN_ID) are tested. The harness leans on
+  the agent's native resume rather than reimplementing it (binary-verified).
+- **Interactive `chat`** — multi-turn REPL keeping the full harness (egress sandbox + control protocol);
+  `chat --raw` drops to the agent's native interactive cowork mode via `docker run -it`.
+- **Cassettes + full-fidelity replay.** `record` saves a control-protocol cassette; `replay --cassette`
+  plays it back deterministically (no token, no Docker) and re-evaluates content assertions.
+  - *The cassette captures both protocol directions:* `events` (child→driver, the assistant turn stream)
+    AND `controlOut` (driver→child decision responses). Both are recorded; `replay` now **consumes** both.
+  - *Full-fidelity replay (C1 false-green fix).* Consuming `controlOut` re-runs the decision pipeline on
+    replay, populating `rec.questions`/`rec.gateAnswers`/`rec.gateDeliveries`. Previously,
+    `question_asked` silently false-failed (questions invisible), `questions_count_max` passed vacuously
+    (0 ≤ max), and `gate_answers_delivered: true` passed vacuously (no deliveries recorded) — a silent
+    false-green violating the project's core principle. All three now genuinely evaluate when `controlOut`
+    is present.
+  - *The O7 guard on the token-free lane (`replay_protocol_fidelity`).* `replay` re-serializes each
+    decision response via `serializeDecision` and compares to the frozen `controlOut` envelope. A mismatch
+    (e.g. `serializeDecision` dropping `questions` from the AskUserQuestion `updatedInput`) appends a
+    `{ assertion: { replay_protocol_fidelity: true }, pass: false, message }` entry and exits 1 — catching
+    the O7 answer-shape regression without a live model.
+  - *Backward compatibility.* Old cassettes without `controlOut` get a loud `::warning::` on stderr;
+    `question_asked`, `questions_count_max`, and `gate_answers_delivered` are excluded from evaluation
+    (not vacuously passed). Re-record to enable full-fidelity mode.
+  - *Committed synthetic fixture + CI replay gate.* `examples/replays/example-pdf-skill.cassette.json`
+    is a hand-authored fixture (permission gate + AskUserQuestion gate + `tool_result`) committed to the
+    repo and replayed in the token-free CI job — dogfooding the documented PR-gate pattern and pinning
+    the fixture against `parseMessage`/assertion/`Run` regressions on every push.
+- **pytest `cowork` lane** (`python/`) — `@pytest.mark.cowork` + a `cowork` fixture over the
+  `--output-format json` surface, selectable with `-m cowork` beside your fast tests.
+- **Faithful sub-agent aggregation.** Recognizes the real cowork dispatch tool — **`Agent`**
+  (`{description, subagent_type, prompt}`; binary-verified primary name, with `Task` as its legacy
+  alias) and any tool carrying `subagent_type` — so `subagents[]` and the `subagent_dispatched` /
+  `dispatch_count_max` / `subagent_tool_*` assertions fire under `--fidelity cowork`. The cowork
+  `TaskCreate`/`TaskUpdate` **todo list** and `Monitor` are correctly excluded (no over-counting). Each
+  dispatch also captures its **`description`** — so a dispatch the skill made with no `subagent_type`
+  (`agentType:"unknown"`) is still self-explaining in `trace` and assertable: `subagent_dispatched`
+  matches the agentType **OR** the description.
+- **`trace <run-id | dir | events.jsonl> [--tools]`** — digests a run's `events.jsonl` into tool calls,
+  sub-agent dispatches (deduped), decisions, and questions (reuses the live `parseMessage` so it tracks
+  the schema); `--output-format json` for structured rows. Plus `result.json`/the json envelope now expose
+  `workDir`/`outputsDir`, and `--keep` prints the deep `mnt/outputs` deliverable path.
+- **Per-run artifacts** under `runs/<scenario>/<id>/`: `events.jsonl` + `control-out.jsonl` (the cassette
+  source), `run.jsonl` (harness-observability log), `trace.json` (structured trace), `egress.log`,
+  `result.json`, `agent.stderr.log`. Injected secrets (OAuth token / API key) are scrubbed from every
+  persisted log by value.
+- **Auth via env or `.env`.** `CLAUDE_CODE_OAUTH_TOKEN` (preferred) or `ANTHROPIC_API_KEY`, resolved in
+  priority order: exported env > `--dotenv <path>` > `./.env` (cwd) > `<install>/.env` (package root),
+  so a run from any directory still finds the install's credentials. Host-side, gitignored, exported
+  vars win, never mounted into the sandbox; passed via argv/env, never written to a runtime path. (The
+  flag is `--dotenv`, not `--env-file` — Node reserves the latter.)
+- **Platform baselines** (`cowork-harness sync`) derive the release-specific facts (agent version, domain
+  allowlist, mount layout, the production GrowthBook gate states) from your installed Claude Desktop, so
+  the code rides the stable protocol while data tracks each release. `--diff` previews changes; an
+  `asarFingerprint` tripwire flags unrecognized deltas.
+- **Sandbox self-verification** — `cowork-harness boundary-check` proves the sandbox enforces Cowork's
+  limitations; `cowork-harness vm <init|status|delete|prune>` manages the L2 microVM (`prune` drops orphaned VMs).
+- **CI** (`.github/workflows/ci.yml`): typecheck · format · unit + golden snapshots · build · boundary
+  parity (Docker) · live-contract guards · scenario suite (gated on a key) · the pytest lane.
+### Notes
+- **Two concepts, two names (no "profile" overload).** The synced per-release snapshot is the **platform
+  baseline** (`baselines/desktop-*.json`, scenario field `baseline:`, type `PlatformBaseline`); the
+  hand-authored pre-prompt config is the **session** (`sessions/*.yaml`, scenario field `session:`, type
+  `SessionConfig`). For back-compat, a scenario's deprecated `profile:` key is still accepted for one minor
+  (mapped to `baseline:` with a stderr deprecation warning) and `Profile` is re-exported as an alias of
+  `PlatformBaseline`; both are removed next minor. The `--output-format json` `RunResult` field is `baseline`
+  (was `profile`).
+- The agent binary is **bind-mounted from your own Claude Desktop install** (`claude-code-vm/<ver>/claude`)
+  or `COWORK_AGENT_BINARY` — nothing Anthropic-owned is bundled or distributed. There is no npm path.
+- Cowork mode is enabled by `CLAUDE_CODE_IS_COWORK=1` (env), **not** a `--cowork` flag.
+- This is a fixture for testing, **not a security boundary** — see [SECURITY.md](./SECURITY.md) and the
+  fidelity caveats in [DESIGN.md](./DESIGN.md).

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Yaniv Golan
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.