cowork-harness 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (54) hide show
  1. package/CHANGELOG.md +54 -0
  2. package/README.md +5 -3
  3. package/dist/agent/session.js +47 -14
  4. package/dist/assert.js +143 -45
  5. package/dist/boundary.js +65 -59
  6. package/dist/cli.js +114 -12
  7. package/dist/decide/decider.js +92 -39
  8. package/dist/decide/external-channel.js +2 -1
  9. package/dist/egress/proxy.js +26 -1
  10. package/dist/egress/sidecar.js +21 -4
  11. package/dist/hostloop/workspace-handler.js +5 -1
  12. package/dist/io.js +11 -0
  13. package/dist/prompt.js +2 -1
  14. package/dist/regex.js +14 -0
  15. package/dist/run/cassette.js +200 -26
  16. package/dist/run/chat.js +2 -1
  17. package/dist/run/envelope.js +24 -2
  18. package/dist/run/execute.js +92 -25
  19. package/dist/run/renderer.js +7 -4
  20. package/dist/run/run.js +26 -27
  21. package/dist/run/scaffold.js +54 -0
  22. package/dist/run/trace-view.js +54 -6
  23. package/dist/run/verdict.js +62 -0
  24. package/dist/runtime/argv.js +29 -10
  25. package/dist/runtime/container.js +2 -1
  26. package/dist/runtime/hostloop.js +4 -2
  27. package/dist/runtime/lima.js +13 -4
  28. package/dist/runtime/microvm.js +12 -42
  29. package/dist/runtime/protocol.js +2 -1
  30. package/dist/runtime/stage.js +17 -0
  31. package/dist/session.js +83 -41
  32. package/dist/staging/resolve.js +42 -0
  33. package/dist/types.js +87 -27
  34. package/docs/README.md +29 -0
  35. package/docs/assets/banner.png +0 -0
  36. package/docs/boundary.md +81 -0
  37. package/docs/cassette.md +226 -0
  38. package/docs/cowork-spawn-contract-1.12603.1.md +78 -0
  39. package/docs/decider-dir.md +134 -0
  40. package/docs/discovery.md +43 -0
  41. package/docs/maintenance.md +151 -0
  42. package/docs/scenario.md +392 -0
  43. package/docs/session.md +106 -0
  44. package/package.json +5 -1
  45. package/python/README.md +146 -0
  46. package/python/conftest.py +18 -0
  47. package/python/cowork_harness.py +387 -0
  48. package/python/pyproject.toml +28 -0
  49. package/python/test_cowork_lane.py +206 -0
  50. package/python/test_csv_fx_lane.py +48 -0
  51. package/python/test_csv_metrics_lane.py +49 -0
  52. package/schema/scenario.schema.json +263 -0
  53. package/schema/session.schema.json +240 -0
  54. package/scripts/gen-schema.ts +60 -0
package/CHANGELOG.md CHANGED
@@ -4,6 +4,60 @@ All notable changes to this project are documented here. The format is based on
4
4
  [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The project uses
5
5
  [Semantic Versioning](https://semver.org/); pre-1.0 minor versions may include breaking changes.
6
6
 
7
+ ## [Unreleased]
8
+
9
+ ## [0.2.0] — 2026-06-17
10
+
11
+ Binary-verified the AskUserQuestion answer wire shape (agent ELF 2.1.170), implemented the
12
+ harness-improvements plan, and resolved a 39-finding code-review pass behind two centralizing seams.
13
+
14
+ ### Added
15
+
16
+ - **AskUserQuestion answer shapes.** `multiSelect` gates (answer with a list of labels → the verified
17
+ comma-joined wire shape); free-text **"Other"** via `answer:` (distinct from the label-validated
18
+ `choose:`); `choose` tolerates the `(Recommended)` suffix + `recommended`/`first` keywords. A partial
19
+ match on a batched gate now **names the unmatched sub-questions**.
20
+ - **`artifact_json` assertion** — assert a JSON artifact's contents via a dotted path
21
+ (`equals`/`gt`/`exists`/`absent`/`is_null`); `absent`, `is_null`, and an unresolved intermediate are
22
+ distinct (the last fails loud, never a vacuous pass).
23
+ - **Artifact manifest in cassettes** — `record` snapshots `outputs/`/`.projects/` (paths + hashes + small
24
+ JSON bodies) so `file_exists`/`user_visible_artifact`/`artifact_json` run on token-free `replay`. A
25
+ cassette→skill/baseline **staleness fingerprint** warns on drift; `replay --strict` fails on it. Cassettes
26
+ now carry a `cassetteVersion` (forward-compat guard).
27
+ - **`RunResult.artifacts`** (ENV-MANIFEST) — observed user-visible files (path + bytes); also surfaced as
28
+ `Result.artifacts` in the Python helper.
29
+ - **`allow_permissive_auto_allow` assertion + `RunResult.scan`** — a security-scan surface for the
30
+ Cowork-parity verdict (below); the assertion opts a scenario into a permissive auto-allow on purpose.
31
+ - **CLI:** `trace --dispatches` (sub-agent dispatch tree + real total), `assert --list` (schema-generated),
32
+ `scaffold --from-run <id>` (kept run → starter scenario YAML).
33
+ - **Python:** `run_scenario()` — run an authored scenario YAML and get the typed `Result`.
34
+
35
+ ### Changed
36
+
37
+ - **Single verdict source (`computeVerdict()`)** wired into all five pass/fail sites (run/skill exit, footer,
38
+ replay exit, JSON-envelope `ok`) plus the Python `assert_success`. A Cowork-parity violation — a permissive
39
+ auto-allow, a recorded `outputs/` delete, or a host-path leak — now **default-fails** the run unless the
40
+ scenario explicitly asserts about it.
41
+ - **Single fail-loud staging policy (`src/staging/resolve.ts`)** for every declared input (marketplace
42
+ manifest, enabled-plugin resolution, local skills, `mcp.config`, uploads, folders), with a Docker-safe
43
+ marketplace charset.
44
+ - The run root honors `COWORK_HARNESS_RUNS_DIR`.
45
+
46
+ ### Fixed
47
+
48
+ - **Egress / runtime hardening:** per-hop redirect egress logging, allowlist validation, a per-run proxy
49
+ port, proxy/sidecar readiness handshakes, fail-loud Lima provisioning, and boundary teardown in
50
+ `try/finally`.
51
+ - **Protocol / decider hardening:** oversized control-frame hard-fail, a nonzero child-exit error event,
52
+ provenance untruncation, TTY-elicit cancel, and a JSON-safe `reply_with` key.
53
+ - **Detection / packaging:** `%2F`/backslash decode in the outputs-delete detector; the npm package now
54
+ ships `schema/`, `docs/`, `python/`, and `scripts/`; assertion path containment; resume empty-tree warning.
55
+
56
+ ### Notes
57
+
58
+ - Held/deferred per the plan's gating: composed partial-gate answering, `decider_intent:` in scenario YAML,
59
+ a whole-gate `response:` freeform, and `artifacts_share_field`. All additive/opt-in when built.
60
+
7
61
  ## [0.1.1] — 2026-06-16
8
62
 
9
63
  Docs, distribution, and packaging. No CLI behavior change.
package/README.md CHANGED
@@ -74,8 +74,10 @@ Skill testing is the headline use, but the tool is a general harness over the Co
74
74
  | `skill <folder> "<prompt>"` | Run a local skill/plugin folder once against the staged agent | ad-hoc "is the skill alive / does it do X?" — the fast inner loop |
75
75
  | `run <scenario.yaml \| dir/>` | Run authored scenarios with `assert:` + a CI-ready exit code | you want a repeatable, **asserted regression test** |
76
76
  | `chat <folder>` | Interactive multi-turn REPL against a skill (TTY) | debugging a multi-turn flow by hand |
77
- | `record` / `replay` | Save a control-protocol cassette, then replay it deterministically | **token-free, Docker-free CI** from a once-recorded run |
78
- | `trace <run-id>` | Digest a run's `events.jsonl` (tools, sub-agent dispatches, decisions) | "how many sub-agents *actually* dispatched, and which?" |
77
+ | `record` / `replay` | Save a control-protocol cassette, then replay it deterministically (`replay --strict` fails on a stale cassette) | **token-free, Docker-free CI** from a once-recorded run |
78
+ | `trace <run-id>` | Digest a run's `events.jsonl` (`--tools`, `--gates`, `--dispatches` for the sub-agent dispatch tree + total) | "how many sub-agents *actually* dispatched, and which?" |
79
+ | `scaffold --from-run <id>` | Turn a kept run into a starter scenario YAML (gates→answers, artifacts→`file_exists`) | authoring a scenario from a real run instead of guessing |
80
+ | `assert --list` | List the available scenario assertions (generated from the schema) | "what can I assert?" without grepping the source |
79
81
  | `decide` | Validate a decider against a sample question in ~2 s (no run) | sanity-check a `--decider-*` / `--answer` wiring before a long run |
80
82
  | `gates` / `answer` | Stream / answer in-band gates for `--decider-dir` | a **driving agent** answers live questions via a Monitor |
81
83
  | `boundary-check [baseline]` | Prove the sandbox enforces Cowork's limitations | verifying the harness's own fidelity |
@@ -171,7 +173,7 @@ This repo ships a **companion skill** (`.claude/skills/cowork-harness/`) that te
171
173
  /plugin install cowork-harness@cowork-harness
172
174
  ```
173
175
 
174
- The skill **self-bootstraps the CLI**: if `cowork-harness` isn't on your PATH it falls back to `npx cowork-harness@latest` (Node ≥ 20). Tiers above `protocol` still need Docker/Lima and a Claude Desktop agent binary — see the prerequisites below.
176
+ The skill **self-bootstraps the CLI**: if `cowork-harness` isn't on your PATH it falls back to `npx cowork-harness@>=0.2.0` (a version floor that fails loud rather than silently fetching a too-old CLI; Node ≥ 20). Tiers above `protocol` still need Docker/Lima and a Claude Desktop agent binary — see the prerequisites below.
175
177
 
176
178
  It also follows the open [Agent Skills](https://github.com/vercel-labs/skills) spec, so it installs cross-editor (Cursor, Codex, OpenCode, …) by pointing the `npx skills` CLI at `.claude/skills/cowork-harness` in this repo. (Working *inside* this repo, the skill auto-loads as a project skill — no install needed.)
177
179
 
@@ -1,3 +1,4 @@
1
+ import { warn } from "../io.js";
1
2
  import { createWriteStream } from "node:fs";
2
3
  import { join } from "node:path";
3
4
  import readline from "node:readline";
@@ -159,6 +160,8 @@ export class LiveAgentSession {
159
160
  /** Reject function set when proc emits an error — bridges the callback into the async generator.
160
161
  * Set before the generator loop starts; called at most once (the Promise settles once). */
161
162
  rejectError;
163
+ /** Bounded tail of the child's stderr, for the nonzero-exit error message. */
164
+ stderrTail = "";
162
165
  constructor(proc, outDir) {
163
166
  this.proc = proc;
164
167
  this.outDir = outDir;
@@ -166,6 +169,11 @@ export class LiveAgentSession {
166
169
  this.controlOut = createWriteStream(join(outDir, "control-out.jsonl"), { flags: "a" });
167
170
  const errLog = createWriteStream(join(outDir, "agent.stderr.log"), { flags: "a" });
168
171
  this.proc.stderr.pipe(errLog);
172
+ // keep a bounded stderr tail and capture the exit code/signal so a child that dies nonzero
173
+ // (with no structured {type:"result"} error) is surfaced as a typed error event, not a silent stop.
174
+ this.proc.stderr.on("data", (d) => {
175
+ this.stderrTail = (this.stderrTail + d.toString()).slice(-2000);
176
+ });
169
177
  // #15: attach stdin error listener once at construction so dead-child writes don't produce
170
178
  // unhandled process errors. Routes to the same error path as spawn errors when possible.
171
179
  this.proc.stdin.on("error", (e) => {
@@ -241,6 +249,20 @@ export class LiveAgentSession {
241
249
  }
242
250
  yield* this.translate(msg);
243
251
  }
252
+ // stdout closed. Give a pending 'exit' one tick to land (NOT a blocking wait on 'close' — a
253
+ // mock/fake child may never emit it), then surface a nonzero/signal exit as a typed error — a
254
+ // crashed child that emitted no {type:"result"} error line would otherwise be a silent stop.
255
+ await new Promise((res) => setImmediate(res));
256
+ const code = this.proc.exitCode;
257
+ const signal = this.proc.signalCode;
258
+ if (signal || (code !== null && code !== 0)) {
259
+ const tail = this.stderrTail.trim();
260
+ yield {
261
+ type: "error",
262
+ source: "exit",
263
+ message: `agent process exited ${signal ? `on signal ${signal}` : `with code ${code}`}${tail ? ` — stderr tail: ${tail}` : ""}`,
264
+ };
265
+ }
244
266
  }
245
267
  finally {
246
268
  this.rejectError = undefined; // generator is done; stop routing errors here
@@ -274,7 +296,7 @@ export class LiveAgentSession {
274
296
  // reply — an unanswered mcp_message blocks the in-VM agent on the round-trip forever (deadlock).
275
297
  // Reply with a JSON-RPC error instead, mirroring the no-handler defense below.
276
298
  const message = e?.message ?? String(e);
277
- process.stderr.write(`::warning:: sdkMcp.handle threw for "${server}" — replying with a JSON-RPC error: ${message}\n`);
299
+ warn(`::warning:: sdkMcp.handle threw for "${server}" — replying with a JSON-RPC error: ${message}\n`);
278
300
  out = { error: { code: -32603, message: `handler error: ${message}` } };
279
301
  }
280
302
  this.write(mcpResponseEnvelope(msg.request_id, out, jr.id));
@@ -294,7 +316,7 @@ export class LiveAgentSession {
294
316
  // #10: an mcp_message arrived but no sdkMcp handler is configured. Reply with a JSON-RPC error
295
317
  // (well-formed via mcpResponseEnvelope) instead of silently dropping it — a dropped request
296
318
  // leaves the in-VM agent waiting on the round-trip forever (protocol deadlock in host-loop mode).
297
- process.stderr.write(`::warning:: mcp_message for server "${server}" arrived but no sdkMcp handler is configured — replying with a JSON-RPC error (would otherwise deadlock)\n`);
319
+ warn(`::warning:: mcp_message for server "${server}" arrived but no sdkMcp handler is configured — replying with a JSON-RPC error (would otherwise deadlock)\n`);
298
320
  this.write(mcpResponseEnvelope(msg.request_id, { error: { code: -32601, message: "no sdkMcp handler configured" } }, jr.id));
299
321
  return;
300
322
  }
@@ -312,7 +334,7 @@ export class LiveAgentSession {
312
334
  if (!req) {
313
335
  // #13: an id with no matching request_id is a protocol drift. Writing a guessed envelope would
314
336
  // be worse, but a silent return leaves the agent blocked until timeout (looks like a hang).
315
- process.stderr.write(`::warning:: respond() for unknown decision id "${decisionId}" — no matching request_id was seen; the agent may block until timeout (protocol drift)\n`);
337
+ warn(`::warning:: respond() for unknown decision id "${decisionId}" — no matching request_id was seen; the agent may block until timeout (protocol drift)\n`);
316
338
  return;
317
339
  }
318
340
  // #14: serializeDecision returns a safe deny envelope on a kind mismatch (defense in depth). That
@@ -320,7 +342,7 @@ export class LiveAgentSession {
320
342
  // while the agent actually received a deny. (serializeDecision stays a pure declared inverse of
321
343
  // deserializeDecision; the warning lives here in the caller, not in the pure function.)
322
344
  if (req.kind !== r.kind)
323
- process.stderr.write(`::warning:: decider returned kind "${r.kind}" for a "${req.kind}" request (id ${decisionId}) → sending a safe deny/cancel; the agent did NOT receive an answer\n`);
345
+ warn(`::warning:: decider returned kind "${r.kind}" for a "${req.kind}" request (id ${decisionId}) → sending a safe deny/cancel; the agent did NOT receive an answer\n`);
324
346
  this.write(serializeDecision(req, r));
325
347
  }
326
348
  close() {
@@ -333,13 +355,12 @@ export class LiveAgentSession {
333
355
  }
334
356
  write(obj) {
335
357
  const line = JSON.stringify(obj);
336
- // #54: the control protocol writes small single-line JSON frames, so stdin backpressure
337
- // effectively never engages; we intentionally ignore the write() return / drain here. If frame
338
- // sizes ever grow, revisit with a drain-aware queue (which would make write()/respond() async).
339
- // #47: guard that assumption a frame past the threshold warns loudly so the "revisit" trigger
340
- // fires instead of silently risking partial buffering on a frame far larger than expected.
358
+ // The control protocol writes small single-line JSON frames, so stdin backpressure effectively never
359
+ // engages; we ignore the write() return / drain here. A frame past the safe threshold is anomalous —
360
+ // hard-FAIL rather than risk a partially-buffered write that silently corrupts the protocol stream.
361
+ // (If large control frames ever become legitimate, switch to a drain-aware queue, making writes async.)
341
362
  if (line.length > 256 * 1024)
342
- process.stderr.write(`::warning:: control frame is ${line.length} bytes (> 256 KiB) — stdin backpressure may engage; revisit write() with a drain-aware queue\n`);
363
+ throw new Error(`control frame is ${line.length} bytes (> 256 KiB safe limit) — refusing to write to avoid partial stdin buffering; this indicates an unexpectedly large control payload`);
343
364
  this.controlOut.write(line + "\n");
344
365
  this.proc.stdin.write(line + "\n");
345
366
  }
@@ -393,6 +414,7 @@ export function parseMessage(msg) {
393
414
  ev.push({
394
415
  type: "subagent_dispatch",
395
416
  toolUseId: String(block.id ?? ""),
417
+ parentToolUseId,
396
418
  // Skills often dispatch with only {description, prompt} (no subagent_type) → agentType is
397
419
  // "unknown" but the description still identifies the dispatch (e.g. "TOP_DOWN market sizing").
398
420
  agentType: String(inp.subagent_type ?? inp.subagentType ?? "unknown"),
@@ -415,6 +437,7 @@ export function parseMessage(msg) {
415
437
  toolUseId: block.tool_use_id ? String(block.tool_use_id) : undefined,
416
438
  isError: !!block.is_error,
417
439
  text: toolResultText(block.content),
440
+ provenanceText: toolResultRaw(block.content),
418
441
  });
419
442
  }
420
443
  break;
@@ -424,17 +447,27 @@ export function parseMessage(msg) {
424
447
  }
425
448
  return ev;
426
449
  }
427
- /** Flatten a tool_result `content` (a string, or an array of `{type:"text",text}` blocks) to a short string. */
428
- function toolResultText(content) {
450
+ /** Flatten a tool_result `content` (a string, or an array of `{type:"text",text}` blocks), capped at
451
+ * `max` chars. The 500-char DISPLAY value (toolResultText) keeps the recorder/trace compact; the larger
452
+ * PROVENANCE value (toolResultRaw) is what seeds web_fetch provenance, so a URL past char 500 isn't lost. */
453
+ function flattenToolResult(content, max) {
429
454
  if (typeof content === "string")
430
- return content.slice(0, 500);
455
+ return content.slice(0, max);
431
456
  if (Array.isArray(content))
432
457
  return content
433
458
  .map((b) => (b && typeof b === "object" && "text" in b ? String(b.text) : ""))
434
459
  .join(" ")
435
- .slice(0, 500);
460
+ .slice(0, max);
436
461
  return "";
437
462
  }
463
+ function toolResultText(content) {
464
+ return flattenToolResult(content, 500);
465
+ }
466
+ /** Larger cap for provenance (URL extraction) — matches the web_fetch body cap so any URL the agent
467
+ * could realistically act on is seeded; still bounded so a pathological result can't blow up memory. */
468
+ function toolResultRaw(content) {
469
+ return flattenToolResult(content, 200_000);
470
+ }
438
471
  export function toDecisionRequest(msg) {
439
472
  const sub = msg.request?.subtype;
440
473
  const id = msg.request_id;
package/dist/assert.js CHANGED
@@ -1,5 +1,41 @@
1
- import { existsSync } from "node:fs";
2
- import { join } from "node:path";
1
+ import { existsSync, readFileSync } from "node:fs";
2
+ import { resolve, relative, isAbsolute, sep } from "node:path";
3
+ import { compileUserRegex } from "./regex.js";
4
+ /** Resolve a user-authored assertion path under `workRoot`, rejecting absolute paths and any `..` that
5
+ * escapes the root. Returns the absolute path, or null if it would leave `workRoot`. Assertion paths are
6
+ * author-controlled, not attacker input, but a `file_exists: "../../etc/passwd"` silently probing the host
7
+ * FS (or an `outputs/../../x` slipping past the user-visible prefix check) is a containment bug regardless. */
8
+ function containedPath(workRoot, p) {
9
+ if (isAbsolute(p))
10
+ return null;
11
+ const root = resolve(workRoot);
12
+ const abs = resolve(root, p);
13
+ const rel = relative(root, abs);
14
+ if (rel === ".." || rel.startsWith(".." + sep) || isAbsolute(rel))
15
+ return null;
16
+ return abs;
17
+ }
18
+ export function resolveDotPath(doc, path) {
19
+ if (!path)
20
+ return { state: "value", value: doc };
21
+ const segs = path.split(".");
22
+ let cur = doc;
23
+ for (let i = 0; i < segs.length; i++) {
24
+ const seg = segs[i];
25
+ const last = i === segs.length - 1;
26
+ if (cur === null || typeof cur !== "object")
27
+ return { state: "unresolved", at: segs.slice(0, i).join(".") || "(root)" };
28
+ const obj = cur;
29
+ const has = Object.prototype.hasOwnProperty.call(obj, seg);
30
+ if (last)
31
+ return has ? { state: "value", value: obj[seg] } : { state: "absent" };
32
+ if (!has)
33
+ return { state: "unresolved", at: segs.slice(0, i + 1).join(".") };
34
+ cur = obj[seg];
35
+ }
36
+ return { state: "value", value: cur };
37
+ }
38
+ const jsonEq = (a, b) => JSON.stringify(a) === JSON.stringify(b);
3
39
  /**
4
40
  * Boundary-aware host matching: `host` must equal `needle` exactly or be a proper subdomain of it.
5
41
  * `evilanthropic.com` does NOT match `anthropic.com`; `x.anthropic.com` does.
@@ -30,36 +66,41 @@ function check(a, ctx) {
30
66
  // `evaluate()` is a bare `.map(check)` with no error boundary, so a malformed pattern must be a
31
67
  // clean assertion failure, not an uncaught throw. Case-insensitive ("i").
32
68
  if (a.transcript_matches !== undefined) {
33
- let re;
34
- try {
35
- re = new RegExp(a.transcript_matches, "i");
36
- }
37
- catch (e) {
38
- results.push(fail(`transcript_matches: bad regex "${a.transcript_matches}": ${String(e.message)}`));
39
- }
40
- if (re)
41
- results.push(re.test(ctx.transcript) ? ok() : fail(`transcript did not match /${a.transcript_matches}/i`));
69
+ const c = compileUserRegex(a.transcript_matches);
70
+ if ("error" in c)
71
+ results.push(fail(`transcript_matches: bad regex "${a.transcript_matches}": ${c.error}`));
72
+ else
73
+ results.push(c.re.test(ctx.transcript) ? ok() : fail(`transcript did not match /${a.transcript_matches}/i`));
42
74
  }
43
75
  if (a.transcript_not_matches !== undefined) {
44
- let re;
45
- try {
46
- re = new RegExp(a.transcript_not_matches, "i");
47
- }
48
- catch (e) {
49
- results.push(fail(`transcript_not_matches: bad regex "${a.transcript_not_matches}": ${String(e.message)}`));
50
- }
51
- if (re)
52
- results.push(!re.test(ctx.transcript) ? ok() : fail(`transcript unexpectedly matched /${a.transcript_not_matches}/i`));
76
+ const c = compileUserRegex(a.transcript_not_matches);
77
+ if ("error" in c)
78
+ results.push(fail(`transcript_not_matches: bad regex "${a.transcript_not_matches}": ${c.error}`));
79
+ else
80
+ results.push(!c.re.test(ctx.transcript) ? ok() : fail(`transcript unexpectedly matched /${a.transcript_not_matches}/i`));
81
+ }
82
+ if (a.file_exists !== undefined) {
83
+ const abs = containedPath(ctx.workRoot, a.file_exists);
84
+ if (!abs)
85
+ results.push(fail(`unsafe file_exists path "${a.file_exists}" — must stay under the work root (no absolute paths or "..")`));
86
+ else
87
+ results.push(existsSync(abs) ? ok() : fail(`file not found: ${a.file_exists} (under ${ctx.workRoot})`));
53
88
  }
54
- if (a.file_exists !== undefined)
55
- results.push(existsSync(join(ctx.workRoot, a.file_exists)) ? ok() : fail(`file not found: ${a.file_exists} (under ${ctx.workRoot})`));
56
89
  if (a.user_visible_artifact !== undefined) {
57
90
  const p = a.user_visible_artifact;
58
- const visible = ctx.userVisiblePrefixes.some((pre) => p === pre || p.startsWith(pre + "/"));
59
- if (!visible)
60
- results.push(fail(`"${p}" is not under a user-visible prefix (${ctx.userVisiblePrefixes.join(", ")}) invisible to the user in Cowork`));
61
- else
62
- results.push(existsSync(join(ctx.workRoot, p)) ? ok() : fail(`user-visible artifact not found: ${p}`));
91
+ const abs = containedPath(ctx.workRoot, p);
92
+ if (!abs) {
93
+ // normalize/contain BEFORE the prefix test so `outputs/../../x` can't pass startsWith("outputs/")
94
+ results.push(fail(`unsafe user_visible_artifact path "${p}" — must stay under the work root (no absolute paths or "..")`));
95
+ }
96
+ else {
97
+ const rel = relative(resolve(ctx.workRoot), abs); // normalized, guaranteed under workRoot
98
+ const visible = ctx.userVisiblePrefixes.some((pre) => rel === pre || rel.startsWith(pre + "/"));
99
+ if (!visible)
100
+ results.push(fail(`"${p}" is not under a user-visible prefix (${ctx.userVisiblePrefixes.join(", ")}) — invisible to the user in Cowork`));
101
+ else
102
+ results.push(existsSync(abs) ? ok() : fail(`user-visible artifact not found: ${p}`));
103
+ }
63
104
  }
64
105
  if (a.tool_called !== undefined)
65
106
  results.push(ctx.toolsCalled.has(a.tool_called) ? ok() : fail(`tool not called: ${a.tool_called}`));
@@ -72,15 +113,11 @@ function check(a, ctx) {
72
113
  if (a.subagent_dispatched !== undefined) {
73
114
  // Match the agentType OR the description — skills often dispatch with only a `description`
74
115
  // (no subagent_type → agentType "unknown"), so name-matching alone would miss those (O1).
75
- let rx;
76
- try {
77
- rx = new RegExp(a.subagent_dispatched, "i");
78
- }
79
- catch (e) {
80
- results.push(fail(`subagent_dispatched: bad regex "${a.subagent_dispatched}": ${String(e.message)}`));
81
- }
82
- if (rx)
83
- results.push(ctx.subagents.some((s) => rx.test(s.agentType) || rx.test(s.description ?? ""))
116
+ const c = compileUserRegex(a.subagent_dispatched);
117
+ if ("error" in c)
118
+ results.push(fail(`subagent_dispatched: bad regex "${a.subagent_dispatched}": ${c.error}`));
119
+ else
120
+ results.push(ctx.subagents.some((s) => c.re.test(s.agentType) || c.re.test(s.description ?? ""))
84
121
  ? ok()
85
122
  : fail(`no sub-agent matching "${a.subagent_dispatched}" was dispatched (by type or description)`));
86
123
  }
@@ -112,18 +149,18 @@ function check(a, ctx) {
112
149
  : fail(`delete op(s) touched outputs (forbidden in Cowork): ${ctx.outputsDeletes.slice(0, 3).join("; ")}`));
113
150
  if (a.self_heal_ran !== undefined)
114
151
  results.push(ctx.selfHealRan === a.self_heal_ran ? ok() : fail(`self_heal_ran was ${ctx.selfHealRan}, expected ${a.self_heal_ran}`));
152
+ // Verdict modifier (consumed by computeVerdict, not here). It always "passes" as an assertion so a
153
+ // standalone `{allow_permissive_auto_allow: true}` is a valid non-empty assertion, not "empty assertion".
154
+ if (a.allow_permissive_auto_allow !== undefined)
155
+ results.push(ok());
115
156
  if (a.transcript_no_host_path !== undefined)
116
157
  results.push(!ctx.hostPathLeaked === a.transcript_no_host_path ? ok() : fail(`host path leaked into model-visible text: ${ctx.hostPathLeaked}`));
117
158
  if (a.question_asked !== undefined) {
118
- let rx;
119
- try {
120
- rx = new RegExp(a.question_asked, "i");
121
- }
122
- catch (e) {
123
- results.push(fail(`question_asked: bad regex "${a.question_asked}": ${String(e.message)}`));
124
- }
125
- if (rx)
126
- results.push(ctx.questions.some((q) => rx.test(q)) ? ok() : fail(`no question matched: ${a.question_asked}`));
159
+ const c = compileUserRegex(a.question_asked);
160
+ if ("error" in c)
161
+ results.push(fail(`question_asked: bad regex "${a.question_asked}": ${c.error}`));
162
+ else
163
+ results.push(ctx.questions.some((q) => c.re.test(q)) ? ok() : fail(`no question matched: ${a.question_asked}`));
127
164
  }
128
165
  if (a.questions_count_max !== undefined)
129
166
  results.push(ctx.questions.length <= a.questions_count_max ? ok() : fail(`asked ${ctx.questions.length} questions, max ${a.questions_count_max}`));
@@ -150,6 +187,67 @@ function check(a, ctx) {
150
187
  results.push(failedConfirmed.length > 0 ? ok() : fail(`expected a confirmed gate-delivery failure but none was observed`));
151
188
  }
152
189
  }
190
+ if (a.artifact_json !== undefined) {
191
+ const aj = a.artifact_json;
192
+ const file = containedPath(ctx.workRoot, aj.artifact);
193
+ if (!file)
194
+ results.push(fail(`unsafe artifact_json path "${aj.artifact}" — must stay under the work root (no absolute paths or "..")`));
195
+ else if (!existsSync(file))
196
+ results.push(fail(`artifact_json: file not found: ${aj.artifact} (under ${ctx.workRoot})`));
197
+ else {
198
+ let doc;
199
+ let parsed = true;
200
+ try {
201
+ doc = JSON.parse(readFileSync(file, "utf8"));
202
+ }
203
+ catch (e) {
204
+ parsed = false;
205
+ results.push(fail(`artifact_json: ${aj.artifact} is not valid JSON: ${String(e.message)}`));
206
+ }
207
+ if (parsed) {
208
+ const r = resolveDotPath(doc, aj.path);
209
+ if (r.state === "unresolved") {
210
+ // Malformed/truncated artifact for this path — fail loud, NOT a vacuous "absent" pass (the H4
211
+ // false-green at the field level).
212
+ results.push(fail(`artifact_json: path "${aj.path}" unresolvable in ${aj.artifact} — intermediate "${r.at}" is missing or not an object`));
213
+ }
214
+ else {
215
+ const present = r.state === "value";
216
+ const val = r.state === "value" ? r.value : undefined;
217
+ let any = false;
218
+ if (aj.exists !== undefined) {
219
+ any = true;
220
+ results.push(present === aj.exists ? ok() : fail(`artifact_json: "${aj.path ?? "(root)"}" exists=${present}, expected ${aj.exists}`));
221
+ }
222
+ if (aj.absent !== undefined) {
223
+ any = true;
224
+ const absent = r.state === "absent";
225
+ results.push(absent === aj.absent ? ok() : fail(`artifact_json: "${aj.path}" absent=${absent}, expected ${aj.absent}`));
226
+ }
227
+ if (aj.is_null !== undefined) {
228
+ any = true;
229
+ const isNull = present && val === null;
230
+ results.push(isNull === aj.is_null ? ok() : fail(`artifact_json: "${aj.path}" is_null=${isNull}, expected ${aj.is_null}`));
231
+ }
232
+ if (aj.equals !== undefined) {
233
+ any = true;
234
+ results.push(present && jsonEq(val, aj.equals)
235
+ ? ok()
236
+ : fail(`artifact_json: "${aj.path}" = ${JSON.stringify(val)}, expected ${JSON.stringify(aj.equals)}`));
237
+ }
238
+ if (aj.gt !== undefined) {
239
+ any = true;
240
+ results.push(typeof val === "number" && val > aj.gt
241
+ ? ok()
242
+ : fail(`artifact_json: "${aj.path}" = ${JSON.stringify(val)}, expected > ${aj.gt}`));
243
+ }
244
+ // No operator → an existence assertion (the value must be present).
245
+ if (!any)
246
+ results.push(present ? ok() : fail(`artifact_json: "${aj.path ?? "(root)"}" is not present (no operator given → existence check)`));
247
+ }
248
+ }
249
+ }
250
+ }
153
251
  if (a.result !== undefined)
154
252
  results.push(ctx.result === a.result ? ok() : fail(`result was ${ctx.result}, expected ${a.result}`));
155
253
  if (results.length === 0)
package/dist/boundary.js CHANGED
@@ -23,67 +23,73 @@ export function runBoundaryChecks(baseline, session) {
23
23
  const sidecar = startEgressSidecar(boundaryAllowList(baseline, session), mkdtempSync(join(tmpdir(), "cowork-bchk-")), runId);
24
24
  const network = sidecar.network;
25
25
  const proxy = sidecar.proxyUrl;
26
- const probe = (shell, withProxy = false) => spawnSync(runtime, [
27
- "run",
28
- "--rm",
29
- "--platform",
30
- "linux/arm64",
31
- "--network",
32
- network,
33
- ...(withProxy ? ["-e", `HTTPS_PROXY=${proxy}`, "-e", `HTTP_PROXY=${proxy}`] : []),
34
- "--entrypoint",
35
- "sh",
36
- image,
37
- "-c",
38
- shell,
39
- ], { encoding: "utf8", timeout: 30_000 });
40
- // 1. Host filesystem is NOT visible (no /Users, no host home bind).
41
- {
42
- const r = probe(`ls /Users 2>&1 || true; ls /host 2>&1 || true`);
43
- const out = (r.stdout ?? "") + (r.stderr ?? "");
44
- const blocked = isHostFsSealed(out);
45
- results.push({
46
- check: "host-fs-sealed",
47
- expectation: "host paths (/Users, /host) invisible",
48
- pass: blocked,
49
- detail: out.trim().slice(0, 200),
50
- });
26
+ // Probes can throw (spawnSync setup errors, etc.); tear the sidecar down in `finally` so an unexpected
27
+ // throw never leaks the proxy container + both Docker networks.
28
+ try {
29
+ const probe = (shell, withProxy = false) => spawnSync(runtime, [
30
+ "run",
31
+ "--rm",
32
+ "--platform",
33
+ "linux/arm64",
34
+ "--network",
35
+ network,
36
+ ...(withProxy ? ["-e", `HTTPS_PROXY=${proxy}`, "-e", `HTTP_PROXY=${proxy}`] : []),
37
+ "--entrypoint",
38
+ "sh",
39
+ image,
40
+ "-c",
41
+ shell,
42
+ ], { encoding: "utf8", timeout: 30_000 });
43
+ // 1. Host filesystem is NOT visible (no /Users, no host home bind).
44
+ {
45
+ const r = probe(`ls /Users 2>&1 || true; ls /host 2>&1 || true`);
46
+ const out = (r.stdout ?? "") + (r.stderr ?? "");
47
+ const blocked = isHostFsSealed(out);
48
+ results.push({
49
+ check: "host-fs-sealed",
50
+ expectation: "host paths (/Users, /host) invisible",
51
+ pass: blocked,
52
+ detail: out.trim().slice(0, 200),
53
+ });
54
+ }
55
+ // 2. Direct (non-proxied) egress is impossible — no route off the internal net.
56
+ {
57
+ const r = probe(`curl -sS -m 5 -o /dev/null http://example.com && echo REACHED || echo BLOCKED`);
58
+ const out = ((r.stdout ?? "") + (r.stderr ?? "")).trim();
59
+ results.push({
60
+ check: "direct-egress-denied",
61
+ expectation: "no route to internet without proxy",
62
+ pass: /BLOCKED/.test(out) && !/REACHED/.test(out),
63
+ detail: out,
64
+ });
65
+ }
66
+ // 3. Non-allowlisted egress via the proxy is refused (403).
67
+ {
68
+ const r = probe(`curl -sS -m 5 -o /dev/null https://example.com && echo REACHED || echo BLOCKED`, true);
69
+ const out = ((r.stdout ?? "") + (r.stderr ?? "")).trim();
70
+ results.push({
71
+ check: "allowlist-enforced",
72
+ expectation: "off-list host refused by proxy",
73
+ pass: /BLOCKED|403/.test(out) && !/REACHED/.test(out),
74
+ detail: out.slice(0, 200),
75
+ });
76
+ }
77
+ // 4. Allowlisted egress via the proxy works (so the agent can reach inference).
78
+ {
79
+ const r = probe(`curl -sS -m 8 -o /dev/null https://api.anthropic.com && echo OK || echo FAIL`, true);
80
+ const out = ((r.stdout ?? "") + (r.stderr ?? "")).trim();
81
+ results.push({
82
+ check: "allowlist-permits",
83
+ expectation: "allowlisted host reachable via proxy",
84
+ pass: /OK/.test(out),
85
+ detail: out.slice(0, 200),
86
+ });
87
+ }
88
+ return results;
51
89
  }
52
- // 2. Direct (non-proxied) egress is impossible — no route off the internal net.
53
- {
54
- const r = probe(`curl -sS -m 5 -o /dev/null http://example.com && echo REACHED || echo BLOCKED`);
55
- const out = ((r.stdout ?? "") + (r.stderr ?? "")).trim();
56
- results.push({
57
- check: "direct-egress-denied",
58
- expectation: "no route to internet without proxy",
59
- pass: /BLOCKED/.test(out) && !/REACHED/.test(out),
60
- detail: out,
61
- });
90
+ finally {
91
+ sidecar.teardown();
62
92
  }
63
- // 3. Non-allowlisted egress via the proxy is refused (403).
64
- {
65
- const r = probe(`curl -sS -m 5 -o /dev/null https://example.com && echo REACHED || echo BLOCKED`, true);
66
- const out = ((r.stdout ?? "") + (r.stderr ?? "")).trim();
67
- results.push({
68
- check: "allowlist-enforced",
69
- expectation: "off-list host refused by proxy",
70
- pass: /BLOCKED|403/.test(out) && !/REACHED/.test(out),
71
- detail: out.slice(0, 200),
72
- });
73
- }
74
- // 4. Allowlisted egress via the proxy works (so the agent can reach inference).
75
- {
76
- const r = probe(`curl -sS -m 8 -o /dev/null https://api.anthropic.com && echo OK || echo FAIL`, true);
77
- const out = ((r.stdout ?? "") + (r.stderr ?? "")).trim();
78
- results.push({
79
- check: "allowlist-permits",
80
- expectation: "allowlisted host reachable via proxy",
81
- pass: /OK/.test(out),
82
- detail: out.slice(0, 200),
83
- });
84
- }
85
- sidecar.teardown();
86
- return results;
87
93
  }
88
94
  /** Escape regex metacharacters in a literal so it can be embedded in a RegExp. */
89
95
  function escapeRegex(s) {