npm - cowork-harness - Versions diffs - 0.1.1 → 0.3.0 - Mend

cowork-harness 0.1.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (57) hide show

package/CHANGELOG.md +90 -0
package/README.md +16 -4
package/dist/agent/session.js +47 -14
package/dist/assert.js +152 -45
package/dist/boundary.js +65 -59
package/dist/cli.js +120 -13
package/dist/decide/decider.js +92 -39
package/dist/decide/external-channel.js +2 -1
package/dist/egress/proxy.js +26 -1
package/dist/egress/sidecar.js +21 -4
package/dist/hostloop/workspace-handler.js +5 -1
package/dist/io.js +11 -0
package/dist/prompt.js +2 -1
package/dist/redact.js +101 -0
package/dist/regex.js +14 -0
package/dist/run/cassette.js +594 -59
package/dist/run/chat.js +2 -1
package/dist/run/envelope.js +24 -2
package/dist/run/execute.js +92 -25
package/dist/run/renderer.js +7 -4
package/dist/run/run.js +26 -27
package/dist/run/scaffold.js +54 -0
package/dist/run/trace-view.js +54 -6
package/dist/run/verdict.js +62 -0
package/dist/runtime/argv.js +29 -10
package/dist/runtime/container.js +2 -1
package/dist/runtime/hostloop.js +4 -2
package/dist/runtime/lima.js +13 -4
package/dist/runtime/microvm.js +12 -42
package/dist/runtime/protocol.js +2 -1
package/dist/runtime/stage.js +17 -0
package/dist/scan.js +30 -0
package/dist/session.js +83 -41
package/dist/staging/resolve.js +42 -0
package/dist/types.js +91 -27
package/docs/README.md +29 -0
package/docs/assets/banner.png +0 -0
package/docs/boundary.md +81 -0
package/docs/cassette.md +274 -0
package/docs/cowork-spawn-contract-1.12603.1.md +78 -0
package/docs/decider-dir.md +134 -0
package/docs/discovery.md +43 -0
package/docs/maintenance.md +165 -0
package/docs/scenario.md +400 -0
package/docs/session.md +106 -0
package/package.json +7 -2
package/python/README.md +146 -0
package/python/conftest.py +18 -0
package/python/cowork_harness.py +387 -0
package/python/pyproject.toml +28 -0
package/python/test_cowork_lane.py +206 -0
package/python/test_csv_fx_lane.py +48 -0
package/python/test_csv_metrics_lane.py +49 -0
package/schema/scenario.schema.json +268 -0
package/schema/session.schema.json +240 -0
package/scripts/check-versions.ts +90 -0
package/scripts/gen-schema.ts +60 -0

package/CHANGELOG.md CHANGED Viewed

@@ -4,6 +4,96 @@ All notable changes to this project are documented here. The format is based on
 [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The project uses
 [Semantic Versioning](https://semver.org/); pre-1.0 minor versions may include breaking changes.
+## [Unreleased]
+## [0.3.0] — 2026-06-17
+The CI-operate + privacy layer for committed cassettes: record-time redaction, an always-on
+`verify-cassettes` scan/staleness gate, batch recording, and a set-membership assert operator.
+### Added
+- **`verify-cassettes <file|dir>`** — a token/agent-free CI gate over committed cassettes. A privacy
+  **scan** flags `email`/`currency`/bare-`domain` matches across the whole cassette, excluding only the
+  agent's **capability-manifest** messages (`system/init` + the `init-1` registry) from the noisy classes —
+  that catalog/MCP-server boilerplate is the sole concentrated false-positive source (email still scans it,
+  since the registry `account` field can carry the dev's email). `--allow <regex>` suppresses synthetic/
+  public reference names; multi-word proper names are opt-in, not a default class. Plus a **staleness** check
+  (`--staleness-only`) fails when a cassette's fingerprint drifted (you edited the skill but didn't
+  re-record). Exit 1 on any finding/drift/unreadable cassette; a malformed cassette is tallied, never
+  crashes the batch. Dedicated JSON envelope (`{command, ok, results}`), not the `RunResult` shape.
+- **Record-time content redaction** (opt-in; distinct from secret-scrub). A `.cowork-redact.json` (or
+  `COWORK_HARNESS_REDACT_PATTERNS`/`_KEYS`) rewrites configured PII across the **whole** cassette surface
+  (transcript, artifact bodies + filenames, prompt/answers/assert, skillSources) **structurally** — JSON
+  stays valid and the AskUserQuestion question/answer strings stay in sync (the O7 guard still passes), with
+  collision-safe deterministic tokens. Redaction is **verdict-preserving**: `record` refuses to write if it
+  would flip an assertion (a manufactured green). `--no-redact` / `--allow-failing` escape hatches.
+- **Batch recording** — `record <dir>` records every scenario in a directory (classified by a positive
+  `prompt:` signal: a non-scenario YAML is an announced skip, a broken scenario is a failure, never a silent
+  skip); `record <cassette-dir> --rerecord-stale` re-records only the cassettes whose fingerprint drifted.
+- **`artifact_json` `in:` operator** — assert the resolved value deep-equals one of a fixed set; stable for
+  stochastic (LLM-extracted) values where `equals` churns across re-records.
+### Fixed
+- **`skillHash` cassette fingerprint was silently dead** — `skillSourceDirs` passed a path string to
+  `loadSession` (which wants parsed YAML), threw, and the throw was swallowed, so the staleness gate's
+  skill-edit signal never computed for a file-based session. Now parses + resolves the session correctly;
+  `hashDir` folds in each file's relative path + type marker (a *move* now registers); `skillSources` are
+  stored relative, never as absolute host paths.
+## [0.2.0] — 2026-06-17
+Binary-verified the AskUserQuestion answer wire shape (agent ELF 2.1.170), implemented the
+harness-improvements plan, and resolved a 39-finding code-review pass behind two centralizing seams.
+### Added
+- **AskUserQuestion answer shapes.** `multiSelect` gates (answer with a list of labels → the verified
+  comma-joined wire shape); free-text **"Other"** via `answer:` (distinct from the label-validated
+  `choose:`); `choose` tolerates the `(Recommended)` suffix + `recommended`/`first` keywords. A partial
+  match on a batched gate now **names the unmatched sub-questions**.
+- **`artifact_json` assertion** — assert a JSON artifact's contents via a dotted path
+  (`equals`/`gt`/`exists`/`absent`/`is_null`); `absent`, `is_null`, and an unresolved intermediate are
+  distinct (the last fails loud, never a vacuous pass).
+- **Artifact manifest in cassettes** — `record` snapshots `outputs/`/`.projects/` (paths + hashes + small
+  JSON bodies) so `file_exists`/`user_visible_artifact`/`artifact_json` run on token-free `replay`. A
+  cassette→skill/baseline **staleness fingerprint** warns on drift; `replay --strict` fails on it. Cassettes
+  now carry a `cassetteVersion` (forward-compat guard).
+- **`RunResult.artifacts`** (ENV-MANIFEST) — observed user-visible files (path + bytes); also surfaced as
+  `Result.artifacts` in the Python helper.
+- **`allow_permissive_auto_allow` assertion + `RunResult.scan`** — a security-scan surface for the
+  Cowork-parity verdict (below); the assertion opts a scenario into a permissive auto-allow on purpose.
+- **CLI:** `trace --dispatches` (sub-agent dispatch tree + real total), `assert --list` (schema-generated),
+  `scaffold --from-run <id>` (kept run → starter scenario YAML).
+- **Python:** `run_scenario()` — run an authored scenario YAML and get the typed `Result`.
+### Changed
+- **Single verdict source (`computeVerdict()`)** wired into all five pass/fail sites (run/skill exit, footer,
+  replay exit, JSON-envelope `ok`) plus the Python `assert_success`. A Cowork-parity violation — a permissive
+  auto-allow, a recorded `outputs/` delete, or a host-path leak — now **default-fails** the run unless the
+  scenario explicitly asserts about it.
+- **Single fail-loud staging policy (`src/staging/resolve.ts`)** for every declared input (marketplace
+  manifest, enabled-plugin resolution, local skills, `mcp.config`, uploads, folders), with a Docker-safe
+  marketplace charset.
+- The run root honors `COWORK_HARNESS_RUNS_DIR`.
+### Fixed
+- **Egress / runtime hardening:** per-hop redirect egress logging, allowlist validation, a per-run proxy
+  port, proxy/sidecar readiness handshakes, fail-loud Lima provisioning, and boundary teardown in
+  `try/finally`.
+- **Protocol / decider hardening:** oversized control-frame hard-fail, a nonzero child-exit error event,
+  provenance untruncation, TTY-elicit cancel, and a JSON-safe `reply_with` key.
+- **Detection / packaging:** `%2F`/backslash decode in the outputs-delete detector; the npm package now
+  ships `schema/`, `docs/`, `python/`, and `scripts/`; assertion path containment; resume empty-tree warning.
+### Notes
+- Held/deferred per the plan's gating: composed partial-gate answering, `decider_intent:` in scenario YAML,
+  a whole-gate `response:` freeform, and `artifacts_share_field`. All additive/opt-in when built.
 ## [0.1.1] — 2026-06-16
 Docs, distribution, and packaging. No CLI behavior change.

package/README.md CHANGED Viewed

@@ -74,8 +74,11 @@ Skill testing is the headline use, but the tool is a general harness over the Co
 | `skill <folder> "<prompt>"` | Run a local skill/plugin folder once against the staged agent | ad-hoc "is the skill alive / does it do X?" — the fast inner loop |
 | `run <scenario.yaml \| dir/>` | Run authored scenarios with `assert:` + a CI-ready exit code | you want a repeatable, **asserted regression test** |
 | `chat <folder>` | Interactive multi-turn REPL against a skill (TTY) | debugging a multi-turn flow by hand |
-| `record` / `replay` | Save a control-protocol cassette, then replay it deterministically | **token-free, Docker-free CI** from a once-recorded run |
-| `trace <run-id>` | Digest a run's `events.jsonl` (tools, sub-agent dispatches, decisions) | "how many sub-agents *actually* dispatched, and which?" |
+| `record` / `replay` | Save a control-protocol cassette (one scenario, or batch a `dir/`; `--rerecord-stale` refreshes only drifted ones), then replay it deterministically (`replay --strict` fails on a stale cassette) | **token-free, Docker-free CI** from a once-recorded run |
+| `verify-cassettes <file\|dir>` | Token-free CI gate over committed cassettes: a privacy scan (email/currency/domain → exit 1) + a staleness check (`--staleness-only`) | gating **committed cassettes** against PII leaks + "edited the skill, forgot to re-record" |
+| `trace <run-id>` | Digest a run's `events.jsonl` (`--tools`, `--gates`, `--dispatches` for the sub-agent dispatch tree + total) | "how many sub-agents *actually* dispatched, and which?" |
+| `scaffold --from-run <id>` | Turn a kept run into a starter scenario YAML (gates→answers, artifacts→`file_exists`) | authoring a scenario from a real run instead of guessing |
+| `assert --list` | List the available scenario assertions (generated from the schema) | "what can I assert?" without grepping the source |
 | `decide` | Validate a decider against a sample question in ~2 s (no run) | sanity-check a `--decider-*` / `--answer` wiring before a long run |
 | `gates` / `answer` | Stream / answer in-band gates for `--decider-dir` | a **driving agent** answers live questions via a Monitor |
 | `boundary-check [baseline]` | Prove the sandbox enforces Cowork's limitations | verifying the harness's own fidelity |
@@ -171,7 +174,7 @@ This repo ships a **companion skill** (`.claude/skills/cowork-harness/`) that te
 /plugin install cowork-harness@cowork-harness
 ```
-The skill **self-bootstraps the CLI**: if `cowork-harness` isn't on your PATH it falls back to `npx cowork-harness@latest` (Node ≥ 20). Tiers above `protocol` still need Docker/Lima and a Claude Desktop agent binary — see the prerequisites below.
+The skill **self-bootstraps the CLI**: if `cowork-harness` isn't on your PATH it falls back to `npx cowork-harness@>=0.2.0` (a version floor that fails loud rather than silently fetching a too-old CLI; Node ≥ 20). Tiers above `protocol` still need Docker/Lima and a Claude Desktop agent binary — see the prerequisites below.
 It also follows the open [Agent Skills](https://github.com/vercel-labs/skills) spec, so it installs cross-editor (Cursor, Codex, OpenCode, …) by pointing the `npx skills` CLI at `.claude/skills/cowork-harness` in this repo. (Working *inside* this repo, the skill auto-loads as a project skill — no install needed.)
@@ -202,8 +205,17 @@ cowork-harness replay --cassette cassettes/example-pdf-skill.cassette.json
 # A committed synthetic fixture is ready to replay on a fresh clone (no record step needed):
 cowork-harness replay --cassette examples/replays/example-pdf-skill.cassette.json
+# Cassettes are COMMITTED fixtures — record against synthetic data, and gate them in CI:
+cowork-harness verify-cassettes cassettes/   # privacy scan (email/currency/domain) + staleness; exit 1 on a finding
 ```
+> **Privacy:** a cassette snapshots the transcript and the `outputs/` JSON bodies, so it's committed PII
+> surface. Record against synthetic inputs; opt into record-time **redaction** with a `.cowork-redact.json`
+> (verdict-preserving — `record` refuses to write if redaction would flip an assertion); and gate every
+> commit with `verify-cassettes` (the always-on scan, `--allow <regex>` for synthetic/public names). See
+> [docs/cassette.md](./docs/cassette.md).
 > **What replay checks.** A cassette bundles BOTH recorded protocol directions: the child→driver
 > `events` stream AND the driver→child `controlOut` decision responses. `replay` re-runs the
 > orchestration from both, re-evaluates the **content** assertions, and re-exercises
@@ -383,7 +395,7 @@ The provided [GitHub Actions workflow](.github/workflows/ci.yml) runs a **four-s
 | Stage | Runs | Needs | Gates |
 |---|---|---|---|
-| **unit** | format check · typecheck · unit tests · build · CLI smoke · token-free `replay` gate | nothing | every push/PR |
+| **unit** | format check · typecheck · unit tests · build · CLI smoke · token-free `replay` + `verify-cassettes` gates | nothing | every push/PR |
 | **boundary** | builds the pinned agent image, brings up the default-deny network, runs `boundary-check` | Docker, arm64 runner | proves the sandbox enforces Cowork's limits — **no API key** |
 | **scenarios** | the live scenario suite at `container` fidelity, uploads transcripts/egress logs as artifacts | `ANTHROPIC_API_KEY` (or `CLAUDE_CODE_OAUTH_TOKEN`) | fork PRs: the whole job is skipped (`if:` guard); same-repo without a key: warns and exits 0 |
 | **parity-drift** | reminder to re-`sync` when Desktop updates | nothing | informational, never blocks |

package/dist/agent/session.js CHANGED Viewed

@@ -1,3 +1,4 @@
+import { warn } from "../io.js";
 import { createWriteStream } from "node:fs";
 import { join } from "node:path";
 import readline from "node:readline";
@@ -159,6 +160,8 @@ export class LiveAgentSession {
     /** Reject function set when proc emits an error — bridges the callback into the async generator.
      *  Set before the generator loop starts; called at most once (the Promise settles once). */
     rejectError;
+    /** Bounded tail of the child's stderr, for the nonzero-exit error message. */
+    stderrTail = "";
     constructor(proc, outDir) {
         this.proc = proc;
         this.outDir = outDir;
@@ -166,6 +169,11 @@ export class LiveAgentSession {
         this.controlOut = createWriteStream(join(outDir, "control-out.jsonl"), { flags: "a" });
         const errLog = createWriteStream(join(outDir, "agent.stderr.log"), { flags: "a" });
         this.proc.stderr.pipe(errLog);
+        // keep a bounded stderr tail and capture the exit code/signal so a child that dies nonzero
+        // (with no structured {type:"result"} error) is surfaced as a typed error event, not a silent stop.
+        this.proc.stderr.on("data", (d) => {
+            this.stderrTail = (this.stderrTail + d.toString()).slice(-2000);
+        });
         // #15: attach stdin error listener once at construction so dead-child writes don't produce
         // unhandled process errors. Routes to the same error path as spawn errors when possible.
         this.proc.stdin.on("error", (e) => {
@@ -241,6 +249,20 @@ export class LiveAgentSession {
                 }
                 yield* this.translate(msg);
             }
+            // stdout closed. Give a pending 'exit' one tick to land (NOT a blocking wait on 'close' — a
+            // mock/fake child may never emit it), then surface a nonzero/signal exit as a typed error — a
+            // crashed child that emitted no {type:"result"} error line would otherwise be a silent stop.
+            await new Promise((res) => setImmediate(res));
+            const code = this.proc.exitCode;
+            const signal = this.proc.signalCode;
+            if (signal || (code !== null && code !== 0)) {
+                const tail = this.stderrTail.trim();
+                yield {
+                    type: "error",
+                    source: "exit",
+                    message: `agent process exited ${signal ? `on signal ${signal}` : `with code ${code}`}${tail ? ` — stderr tail: ${tail}` : ""}`,
+                };
+            }
         }
         finally {
             this.rejectError = undefined; // generator is done; stop routing errors here
@@ -274,7 +296,7 @@ export class LiveAgentSession {
                     // reply — an unanswered mcp_message blocks the in-VM agent on the round-trip forever (deadlock).
                     // Reply with a JSON-RPC error instead, mirroring the no-handler defense below.
                     const message = e?.message ?? String(e);
-                    process.stderr.write(`::warning:: sdkMcp.handle threw for "${server}" — replying with a JSON-RPC error: ${message}\n`);
+                    warn(`::warning:: sdkMcp.handle threw for "${server}" — replying with a JSON-RPC error: ${message}\n`);
                     out = { error: { code: -32603, message: `handler error: ${message}` } };
                 }
                 this.write(mcpResponseEnvelope(msg.request_id, out, jr.id));
@@ -294,7 +316,7 @@ export class LiveAgentSession {
             // #10: an mcp_message arrived but no sdkMcp handler is configured. Reply with a JSON-RPC error
             // (well-formed via mcpResponseEnvelope) instead of silently dropping it — a dropped request
             // leaves the in-VM agent waiting on the round-trip forever (protocol deadlock in host-loop mode).
-            process.stderr.write(`::warning:: mcp_message for server "${server}" arrived but no sdkMcp handler is configured — replying with a JSON-RPC error (would otherwise deadlock)\n`);
+            warn(`::warning:: mcp_message for server "${server}" arrived but no sdkMcp handler is configured — replying with a JSON-RPC error (would otherwise deadlock)\n`);
             this.write(mcpResponseEnvelope(msg.request_id, { error: { code: -32601, message: "no sdkMcp handler configured" } }, jr.id));
             return;
         }
@@ -312,7 +334,7 @@ export class LiveAgentSession {
         if (!req) {
             // #13: an id with no matching request_id is a protocol drift. Writing a guessed envelope would
             // be worse, but a silent return leaves the agent blocked until timeout (looks like a hang).
-            process.stderr.write(`::warning:: respond() for unknown decision id "${decisionId}" — no matching request_id was seen; the agent may block until timeout (protocol drift)\n`);
+            warn(`::warning:: respond() for unknown decision id "${decisionId}" — no matching request_id was seen; the agent may block until timeout (protocol drift)\n`);
             return;
         }
         // #14: serializeDecision returns a safe deny envelope on a kind mismatch (defense in depth). That
@@ -320,7 +342,7 @@ export class LiveAgentSession {
         // while the agent actually received a deny. (serializeDecision stays a pure declared inverse of
         // deserializeDecision; the warning lives here in the caller, not in the pure function.)
         if (req.kind !== r.kind)
-            process.stderr.write(`::warning:: decider returned kind "${r.kind}" for a "${req.kind}" request (id ${decisionId}) → sending a safe deny/cancel; the agent did NOT receive an answer\n`);
+            warn(`::warning:: decider returned kind "${r.kind}" for a "${req.kind}" request (id ${decisionId}) → sending a safe deny/cancel; the agent did NOT receive an answer\n`);
         this.write(serializeDecision(req, r));
     }
     close() {
@@ -333,13 +355,12 @@ export class LiveAgentSession {
     }
     write(obj) {
         const line = JSON.stringify(obj);
-        // #54: the control protocol writes small single-line JSON frames, so stdin backpressure
-        // effectively never engages; we intentionally ignore the write() return / drain here. If frame
-        // sizes ever grow, revisit with a drain-aware queue (which would make write()/respond() async).
-        // #47: guard that assumption — a frame past the threshold warns loudly so the "revisit" trigger
-        // fires instead of silently risking partial buffering on a frame far larger than expected.
+        // The control protocol writes small single-line JSON frames, so stdin backpressure effectively never
+        // engages; we ignore the write() return / drain here. A frame past the safe threshold is anomalous —
+        // hard-FAIL rather than risk a partially-buffered write that silently corrupts the protocol stream.
+        // (If large control frames ever become legitimate, switch to a drain-aware queue, making writes async.)
         if (line.length > 256 * 1024)
-            process.stderr.write(`::warning:: control frame is ${line.length} bytes (> 256 KiB) — stdin backpressure may engage; revisit write() with a drain-aware queue\n`);
+            throw new Error(`control frame is ${line.length} bytes (> 256 KiB safe limit) — refusing to write to avoid partial stdin buffering; this indicates an unexpectedly large control payload`);
         this.controlOut.write(line + "\n");
         this.proc.stdin.write(line + "\n");
     }
@@ -393,6 +414,7 @@ export function parseMessage(msg) {
                         ev.push({
                             type: "subagent_dispatch",
                             toolUseId: String(block.id ?? ""),
+                            parentToolUseId,
                             // Skills often dispatch with only {description, prompt} (no subagent_type) → agentType is
                             // "unknown" but the description still identifies the dispatch (e.g. "TOP_DOWN market sizing").
                             agentType: String(inp.subagent_type ?? inp.subagentType ?? "unknown"),
@@ -415,6 +437,7 @@ export function parseMessage(msg) {
                         toolUseId: block.tool_use_id ? String(block.tool_use_id) : undefined,
                         isError: !!block.is_error,
                         text: toolResultText(block.content),
+                        provenanceText: toolResultRaw(block.content),
                     });
             }
             break;
@@ -424,17 +447,27 @@ export function parseMessage(msg) {
     }
     return ev;
 }
-/** Flatten a tool_result `content` (a string, or an array of `{type:"text",text}` blocks) to a short string. */
-function toolResultText(content) {
+/** Flatten a tool_result `content` (a string, or an array of `{type:"text",text}` blocks), capped at
+ *  `max` chars. The 500-char DISPLAY value (toolResultText) keeps the recorder/trace compact; the larger
+ *  PROVENANCE value (toolResultRaw) is what seeds web_fetch provenance, so a URL past char 500 isn't lost. */
+function flattenToolResult(content, max) {
     if (typeof content === "string")
-        return content.slice(0, 500);
+        return content.slice(0, max);
     if (Array.isArray(content))
         return content
             .map((b) => (b && typeof b === "object" && "text" in b ? String(b.text) : ""))
             .join(" ")
-            .slice(0, 500);
+            .slice(0, max);
     return "";
 }
+function toolResultText(content) {
+    return flattenToolResult(content, 500);
+}
+/** Larger cap for provenance (URL extraction) — matches the web_fetch body cap so any URL the agent
+ *  could realistically act on is seeded; still bounded so a pathological result can't blow up memory. */
+function toolResultRaw(content) {
+    return flattenToolResult(content, 200_000);
+}
 export function toDecisionRequest(msg) {
     const sub = msg.request?.subtype;
     const id = msg.request_id;

package/dist/assert.js CHANGED Viewed

@@ -1,5 +1,41 @@
-import { existsSync } from "node:fs";
-import { join } from "node:path";
+import { existsSync, readFileSync } from "node:fs";
+import { resolve, relative, isAbsolute, sep } from "node:path";
+import { compileUserRegex } from "./regex.js";
+/** Resolve a user-authored assertion path under `workRoot`, rejecting absolute paths and any `..` that
+ *  escapes the root. Returns the absolute path, or null if it would leave `workRoot`. Assertion paths are
+ *  author-controlled, not attacker input, but a `file_exists: "../../etc/passwd"` silently probing the host
+ *  FS (or an `outputs/../../x` slipping past the user-visible prefix check) is a containment bug regardless. */
+function containedPath(workRoot, p) {
+    if (isAbsolute(p))
+        return null;
+    const root = resolve(workRoot);
+    const abs = resolve(root, p);
+    const rel = relative(root, abs);
+    if (rel === ".." || rel.startsWith(".." + sep) || isAbsolute(rel))
+        return null;
+    return abs;
+}
+export function resolveDotPath(doc, path) {
+    if (!path)
+        return { state: "value", value: doc };
+    const segs = path.split(".");
+    let cur = doc;
+    for (let i = 0; i < segs.length; i++) {
+        const seg = segs[i];
+        const last = i === segs.length - 1;
+        if (cur === null || typeof cur !== "object")
+            return { state: "unresolved", at: segs.slice(0, i).join(".") || "(root)" };
+        const obj = cur;
+        const has = Object.prototype.hasOwnProperty.call(obj, seg);
+        if (last)
+            return has ? { state: "value", value: obj[seg] } : { state: "absent" };
+        if (!has)
+            return { state: "unresolved", at: segs.slice(0, i + 1).join(".") };
+        cur = obj[seg];
+    }
+    return { state: "value", value: cur };
+}
+const jsonEq = (a, b) => JSON.stringify(a) === JSON.stringify(b);
 /**
  * Boundary-aware host matching: `host` must equal `needle` exactly or be a proper subdomain of it.
  * `evilanthropic.com` does NOT match `anthropic.com`; `x.anthropic.com` does.
@@ -30,36 +66,41 @@ function check(a, ctx) {
     // `evaluate()` is a bare `.map(check)` with no error boundary, so a malformed pattern must be a
     // clean assertion failure, not an uncaught throw. Case-insensitive ("i").
     if (a.transcript_matches !== undefined) {
-        let re;
-        try {
-            re = new RegExp(a.transcript_matches, "i");
-        }
-        catch (e) {
-            results.push(fail(`transcript_matches: bad regex "${a.transcript_matches}": ${String(e.message)}`));
-        }
-        if (re)
-            results.push(re.test(ctx.transcript) ? ok() : fail(`transcript did not match /${a.transcript_matches}/i`));
+        const c = compileUserRegex(a.transcript_matches);
+        if ("error" in c)
+            results.push(fail(`transcript_matches: bad regex "${a.transcript_matches}": ${c.error}`));
+        else
+            results.push(c.re.test(ctx.transcript) ? ok() : fail(`transcript did not match /${a.transcript_matches}/i`));
     }
     if (a.transcript_not_matches !== undefined) {
-        let re;
-        try {
-            re = new RegExp(a.transcript_not_matches, "i");
-        }
-        catch (e) {
-            results.push(fail(`transcript_not_matches: bad regex "${a.transcript_not_matches}": ${String(e.message)}`));
-        }
-        if (re)
-            results.push(!re.test(ctx.transcript) ? ok() : fail(`transcript unexpectedly matched /${a.transcript_not_matches}/i`));
+        const c = compileUserRegex(a.transcript_not_matches);
+        if ("error" in c)
+            results.push(fail(`transcript_not_matches: bad regex "${a.transcript_not_matches}": ${c.error}`));
+        else
+            results.push(!c.re.test(ctx.transcript) ? ok() : fail(`transcript unexpectedly matched /${a.transcript_not_matches}/i`));
+    }
+    if (a.file_exists !== undefined) {
+        const abs = containedPath(ctx.workRoot, a.file_exists);
+        if (!abs)
+            results.push(fail(`unsafe file_exists path "${a.file_exists}" — must stay under the work root (no absolute paths or "..")`));
+        else
+            results.push(existsSync(abs) ? ok() : fail(`file not found: ${a.file_exists} (under ${ctx.workRoot})`));
     }
-    if (a.file_exists !== undefined)
-        results.push(existsSync(join(ctx.workRoot, a.file_exists)) ? ok() : fail(`file not found: ${a.file_exists} (under ${ctx.workRoot})`));
     if (a.user_visible_artifact !== undefined) {
         const p = a.user_visible_artifact;
-        const visible = ctx.userVisiblePrefixes.some((pre) => p === pre || p.startsWith(pre + "/"));
-        if (!visible)
-            results.push(fail(`"${p}" is not under a user-visible prefix (${ctx.userVisiblePrefixes.join(", ")}) — invisible to the user in Cowork`));
-        else
-            results.push(existsSync(join(ctx.workRoot, p)) ? ok() : fail(`user-visible artifact not found: ${p}`));
+        const abs = containedPath(ctx.workRoot, p);
+        if (!abs) {
+            // normalize/contain BEFORE the prefix test so `outputs/../../x` can't pass startsWith("outputs/")
+            results.push(fail(`unsafe user_visible_artifact path "${p}" — must stay under the work root (no absolute paths or "..")`));
+        }
+        else {
+            const rel = relative(resolve(ctx.workRoot), abs); // normalized, guaranteed under workRoot
+            const visible = ctx.userVisiblePrefixes.some((pre) => rel === pre || rel.startsWith(pre + "/"));
+            if (!visible)
+                results.push(fail(`"${p}" is not under a user-visible prefix (${ctx.userVisiblePrefixes.join(", ")}) — invisible to the user in Cowork`));
+            else
+                results.push(existsSync(abs) ? ok() : fail(`user-visible artifact not found: ${p}`));
+        }
     }
     if (a.tool_called !== undefined)
         results.push(ctx.toolsCalled.has(a.tool_called) ? ok() : fail(`tool not called: ${a.tool_called}`));
@@ -72,15 +113,11 @@ function check(a, ctx) {
     if (a.subagent_dispatched !== undefined) {
         // Match the agentType OR the description — skills often dispatch with only a `description`
         // (no subagent_type → agentType "unknown"), so name-matching alone would miss those (O1).
-        let rx;
-        try {
-            rx = new RegExp(a.subagent_dispatched, "i");
-        }
-        catch (e) {
-            results.push(fail(`subagent_dispatched: bad regex "${a.subagent_dispatched}": ${String(e.message)}`));
-        }
-        if (rx)
-            results.push(ctx.subagents.some((s) => rx.test(s.agentType) || rx.test(s.description ?? ""))
+        const c = compileUserRegex(a.subagent_dispatched);
+        if ("error" in c)
+            results.push(fail(`subagent_dispatched: bad regex "${a.subagent_dispatched}": ${c.error}`));
+        else
+            results.push(ctx.subagents.some((s) => c.re.test(s.agentType) || c.re.test(s.description ?? ""))
                 ? ok()
                 : fail(`no sub-agent matching "${a.subagent_dispatched}" was dispatched (by type or description)`));
     }
@@ -112,18 +149,18 @@ function check(a, ctx) {
             : fail(`delete op(s) touched outputs (forbidden in Cowork): ${ctx.outputsDeletes.slice(0, 3).join("; ")}`));
     if (a.self_heal_ran !== undefined)
         results.push(ctx.selfHealRan === a.self_heal_ran ? ok() : fail(`self_heal_ran was ${ctx.selfHealRan}, expected ${a.self_heal_ran}`));
+    // Verdict modifier (consumed by computeVerdict, not here). It always "passes" as an assertion so a
+    // standalone `{allow_permissive_auto_allow: true}` is a valid non-empty assertion, not "empty assertion".
+    if (a.allow_permissive_auto_allow !== undefined)
+        results.push(ok());
     if (a.transcript_no_host_path !== undefined)
         results.push(!ctx.hostPathLeaked === a.transcript_no_host_path ? ok() : fail(`host path leaked into model-visible text: ${ctx.hostPathLeaked}`));
     if (a.question_asked !== undefined) {
-        let rx;
-        try {
-            rx = new RegExp(a.question_asked, "i");
-        }
-        catch (e) {
-            results.push(fail(`question_asked: bad regex "${a.question_asked}": ${String(e.message)}`));
-        }
-        if (rx)
-            results.push(ctx.questions.some((q) => rx.test(q)) ? ok() : fail(`no question matched: ${a.question_asked}`));
+        const c = compileUserRegex(a.question_asked);
+        if ("error" in c)
+            results.push(fail(`question_asked: bad regex "${a.question_asked}": ${c.error}`));
+        else
+            results.push(ctx.questions.some((q) => c.re.test(q)) ? ok() : fail(`no question matched: ${a.question_asked}`));
     }
     if (a.questions_count_max !== undefined)
         results.push(ctx.questions.length <= a.questions_count_max ? ok() : fail(`asked ${ctx.questions.length} questions, max ${a.questions_count_max}`));
@@ -150,6 +187,76 @@ function check(a, ctx) {
             results.push(failedConfirmed.length > 0 ? ok() : fail(`expected a confirmed gate-delivery failure but none was observed`));
         }
     }
+    if (a.artifact_json !== undefined) {
+        const aj = a.artifact_json;
+        const file = containedPath(ctx.workRoot, aj.artifact);
+        if (!file)
+            results.push(fail(`unsafe artifact_json path "${aj.artifact}" — must stay under the work root (no absolute paths or "..")`));
+        else if (!existsSync(file))
+            results.push(fail(`artifact_json: file not found: ${aj.artifact} (under ${ctx.workRoot})`));
+        else {
+            let doc;
+            let parsed = true;
+            try {
+                doc = JSON.parse(readFileSync(file, "utf8"));
+            }
+            catch (e) {
+                parsed = false;
+                results.push(fail(`artifact_json: ${aj.artifact} is not valid JSON: ${String(e.message)}`));
+            }
+            if (parsed) {
+                const r = resolveDotPath(doc, aj.path);
+                if (r.state === "unresolved") {
+                    // Malformed/truncated artifact for this path — fail loud, NOT a vacuous "absent" pass (the H4
+                    // false-green at the field level).
+                    results.push(fail(`artifact_json: path "${aj.path}" unresolvable in ${aj.artifact} — intermediate "${r.at}" is missing or not an object`));
+                }
+                else {
+                    const present = r.state === "value";
+                    const val = r.state === "value" ? r.value : undefined;
+                    let any = false;
+                    if (aj.exists !== undefined) {
+                        any = true;
+                        results.push(present === aj.exists ? ok() : fail(`artifact_json: "${aj.path ?? "(root)"}" exists=${present}, expected ${aj.exists}`));
+                    }
+                    if (aj.absent !== undefined) {
+                        any = true;
+                        const absent = r.state === "absent";
+                        results.push(absent === aj.absent ? ok() : fail(`artifact_json: "${aj.path}" absent=${absent}, expected ${aj.absent}`));
+                    }
+                    if (aj.is_null !== undefined) {
+                        any = true;
+                        const isNull = present && val === null;
+                        results.push(isNull === aj.is_null ? ok() : fail(`artifact_json: "${aj.path}" is_null=${isNull}, expected ${aj.is_null}`));
+                    }
+                    if (aj.equals !== undefined) {
+                        any = true;
+                        results.push(present && jsonEq(val, aj.equals)
+                            ? ok()
+                            : fail(`artifact_json: "${aj.path}" = ${JSON.stringify(val)}, expected ${JSON.stringify(aj.equals)}`));
+                    }
+                    if (aj.gt !== undefined) {
+                        any = true;
+                        results.push(typeof val === "number" && val > aj.gt
+                            ? ok()
+                            : fail(`artifact_json: "${aj.path}" = ${JSON.stringify(val)}, expected > ${aj.gt}`));
+                    }
+                    // #4: set membership — the resolved value deep-equals one of a fixed set. Stable for stochastic
+                    // (LLM-extracted) values where `equals` would churn across re-records. `present &&` guard mirrors
+                    // `equals` so an absent value never vacuously satisfies it.
+                    if (aj.in !== undefined) {
+                        any = true;
+                        results.push(present && Array.isArray(aj.in) && aj.in.some((x) => jsonEq(val, x))
+                            ? ok()
+                            : fail(`artifact_json: "${aj.path}" = ${JSON.stringify(val)}, expected one of ${JSON.stringify(aj.in)}`));
+                    }
+                    // No operator → an existence assertion (the value must be present).
+                    if (!any)
+                        results.push(present ? ok() : fail(`artifact_json: "${aj.path ?? "(root)"}" is not present (no operator given → existence check)`));
+                }
+            }
+        }
+    }
     if (a.result !== undefined)
         results.push(ctx.result === a.result ? ok() : fail(`result was ${ctx.result}, expected ${a.result}`));
     if (results.length === 0)