cowork-harness 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -6,6 +6,42 @@ All notable changes to this project are documented here. The format is based on
6
6
 
7
7
  ## [Unreleased]
8
8
 
9
+ ## [0.3.0] — 2026-06-17
10
+
11
+ The CI-operate + privacy layer for committed cassettes: record-time redaction, an always-on
12
+ `verify-cassettes` scan/staleness gate, batch recording, and a set-membership assert operator.
13
+
14
+ ### Added
15
+
16
+ - **`verify-cassettes <file|dir>`** — a token/agent-free CI gate over committed cassettes. A privacy
17
+ **scan** flags `email`/`currency`/bare-`domain` matches across the whole cassette, excluding only the
18
+ agent's **capability-manifest** messages (`system/init` + the `init-1` registry) from the noisy classes —
19
+ that catalog/MCP-server boilerplate is the sole concentrated false-positive source (email still scans it,
20
+ since the registry `account` field can carry the dev's email). `--allow <regex>` suppresses synthetic/
21
+ public reference names; multi-word proper names are opt-in, not a default class. Plus a **staleness** check
22
+ (`--staleness-only`) fails when a cassette's fingerprint drifted (you edited the skill but didn't
23
+ re-record). Exit 1 on any finding/drift/unreadable cassette; a malformed cassette is tallied, never
24
+ crashes the batch. Dedicated JSON envelope (`{command, ok, results}`), not the `RunResult` shape.
25
+ - **Record-time content redaction** (opt-in; distinct from secret-scrub). A `.cowork-redact.json` (or
26
+ `COWORK_HARNESS_REDACT_PATTERNS`/`_KEYS`) rewrites configured PII across the **whole** cassette surface
27
+ (transcript, artifact bodies + filenames, prompt/answers/assert, skillSources) **structurally** — JSON
28
+ stays valid and the AskUserQuestion question/answer strings stay in sync (the O7 guard still passes), with
29
+ collision-safe deterministic tokens. Redaction is **verdict-preserving**: `record` refuses to write if it
30
+ would flip an assertion (a manufactured green). `--no-redact` / `--allow-failing` escape hatches.
31
+ - **Batch recording** — `record <dir>` records every scenario in a directory (classified by a positive
32
+ `prompt:` signal: a non-scenario YAML is an announced skip, a broken scenario is a failure, never a silent
33
+ skip); `record <cassette-dir> --rerecord-stale` re-records only the cassettes whose fingerprint drifted.
34
+ - **`artifact_json` `in:` operator** — assert the resolved value deep-equals one of a fixed set; stable for
35
+ stochastic (LLM-extracted) values where `equals` churns across re-records.
36
+
37
+ ### Fixed
38
+
39
+ - **`skillHash` cassette fingerprint was silently dead** — `skillSourceDirs` passed a path string to
40
+ `loadSession` (which wants parsed YAML), threw, and the throw was swallowed, so the staleness gate's
41
+ skill-edit signal never computed for a file-based session. Now parses + resolves the session correctly;
42
+ `hashDir` folds in each file's relative path + type marker (a *move* now registers); `skillSources` are
43
+ stored relative, never as absolute host paths.
44
+
9
45
  ## [0.2.0] — 2026-06-17
10
46
 
11
47
  Binary-verified the AskUserQuestion answer wire shape (agent ELF 2.1.170), implemented the
package/README.md CHANGED
@@ -74,7 +74,8 @@ Skill testing is the headline use, but the tool is a general harness over the Co
74
74
  | `skill <folder> "<prompt>"` | Run a local skill/plugin folder once against the staged agent | ad-hoc "is the skill alive / does it do X?" — the fast inner loop |
75
75
  | `run <scenario.yaml \| dir/>` | Run authored scenarios with `assert:` + a CI-ready exit code | you want a repeatable, **asserted regression test** |
76
76
  | `chat <folder>` | Interactive multi-turn REPL against a skill (TTY) | debugging a multi-turn flow by hand |
77
- | `record` / `replay` | Save a control-protocol cassette, then replay it deterministically (`replay --strict` fails on a stale cassette) | **token-free, Docker-free CI** from a once-recorded run |
77
+ | `record` / `replay` | Save a control-protocol cassette (one scenario, or batch a `dir/`; `--rerecord-stale` refreshes only drifted ones), then replay it deterministically (`replay --strict` fails on a stale cassette) | **token-free, Docker-free CI** from a once-recorded run |
78
+ | `verify-cassettes <file\|dir>` | Token-free CI gate over committed cassettes: a privacy scan (email/currency/domain → exit 1) + a staleness check (`--staleness-only`) | gating **committed cassettes** against PII leaks + "edited the skill, forgot to re-record" |
78
79
  | `trace <run-id>` | Digest a run's `events.jsonl` (`--tools`, `--gates`, `--dispatches` for the sub-agent dispatch tree + total) | "how many sub-agents *actually* dispatched, and which?" |
79
80
  | `scaffold --from-run <id>` | Turn a kept run into a starter scenario YAML (gates→answers, artifacts→`file_exists`) | authoring a scenario from a real run instead of guessing |
80
81
  | `assert --list` | List the available scenario assertions (generated from the schema) | "what can I assert?" without grepping the source |
@@ -204,8 +205,17 @@ cowork-harness replay --cassette cassettes/example-pdf-skill.cassette.json
204
205
 
205
206
  # A committed synthetic fixture is ready to replay on a fresh clone (no record step needed):
206
207
  cowork-harness replay --cassette examples/replays/example-pdf-skill.cassette.json
208
+
209
+ # Cassettes are COMMITTED fixtures — record against synthetic data, and gate them in CI:
210
+ cowork-harness verify-cassettes cassettes/ # privacy scan (email/currency/domain) + staleness; exit 1 on a finding
207
211
  ```
208
212
 
213
+ > **Privacy:** a cassette snapshots the transcript and the `outputs/` JSON bodies, so it's committed PII
214
+ > surface. Record against synthetic inputs; opt into record-time **redaction** with a `.cowork-redact.json`
215
+ > (verdict-preserving — `record` refuses to write if redaction would flip an assertion); and gate every
216
+ > commit with `verify-cassettes` (the always-on scan, `--allow <regex>` for synthetic/public names). See
217
+ > [docs/cassette.md](./docs/cassette.md).
218
+
209
219
  > **What replay checks.** A cassette bundles BOTH recorded protocol directions: the child→driver
210
220
  > `events` stream AND the driver→child `controlOut` decision responses. `replay` re-runs the
211
221
  > orchestration from both, re-evaluates the **content** assertions, and re-exercises
@@ -385,7 +395,7 @@ The provided [GitHub Actions workflow](.github/workflows/ci.yml) runs a **four-s
385
395
 
386
396
  | Stage | Runs | Needs | Gates |
387
397
  |---|---|---|---|
388
- | **unit** | format check · typecheck · unit tests · build · CLI smoke · token-free `replay` gate | nothing | every push/PR |
398
+ | **unit** | format check · typecheck · unit tests · build · CLI smoke · token-free `replay` + `verify-cassettes` gates | nothing | every push/PR |
389
399
  | **boundary** | builds the pinned agent image, brings up the default-deny network, runs `boundary-check` | Docker, arm64 runner | proves the sandbox enforces Cowork's limits — **no API key** |
390
400
  | **scenarios** | the live scenario suite at `container` fidelity, uploads transcripts/egress logs as artifacts | `ANTHROPIC_API_KEY` (or `CLAUDE_CODE_OAUTH_TOKEN`) | fork PRs: the whole job is skipped (`if:` guard); same-repo without a key: warns and exits 0 |
391
401
  | **parity-drift** | reminder to re-`sync` when Desktop updates | nothing | informational, never blocks |
package/dist/assert.js CHANGED
@@ -241,6 +241,15 @@ function check(a, ctx) {
241
241
  ? ok()
242
242
  : fail(`artifact_json: "${aj.path}" = ${JSON.stringify(val)}, expected > ${aj.gt}`));
243
243
  }
244
+ // #4: set membership — the resolved value deep-equals one of a fixed set. Stable for stochastic
245
+ // (LLM-extracted) values where `equals` would churn across re-records. `present &&` guard mirrors
246
+ // `equals` so an absent value never vacuously satisfies it.
247
+ if (aj.in !== undefined) {
248
+ any = true;
249
+ results.push(present && Array.isArray(aj.in) && aj.in.some((x) => jsonEq(val, x))
250
+ ? ok()
251
+ : fail(`artifact_json: "${aj.path}" = ${JSON.stringify(val)}, expected one of ${JSON.stringify(aj.in)}`));
252
+ }
244
253
  // No operator → an existence assertion (the value must be present).
245
254
  if (!any)
246
255
  results.push(present ? ok() : fail(`artifact_json: "${aj.path ?? "(root)"}" is not present (no operator given → existence check)`));
package/dist/cli.js CHANGED
@@ -13,7 +13,7 @@ import { vmInit, vmDelete, vmStatus, vmPrune, instanceName } from "./runtime/lim
13
13
  import { sync } from "./sync/cowork-sync.js";
14
14
  import { runBoundaryChecks, formatBoundary } from "./boundary.js";
15
15
  import { cmdChat } from "./run/chat.js";
16
- import { cmdRecord, cmdReplay } from "./run/cassette.js";
16
+ import { cmdRecord, cmdReplay, cmdVerifyCassettes } from "./run/cassette.js";
17
17
  import { loadDotenv } from "./dotenv.js";
18
18
  import { makeRenderer, renderStart, renderFooter, startHeartbeat } from "./run/renderer.js";
19
19
  import { resolveEventsFile, buildTrace, formatTrace, buildGateTrace, formatGateTrace, buildDispatchTree, formatDispatchTree, } from "./run/trace-view.js";
@@ -50,6 +50,8 @@ const HELP = `cowork-harness <command> (v${"$VERSION"})
50
50
  [--out <file>] cassette path (default: cassettes/<scenario-name>.cassette.json)
51
51
  replay --cassette <file> deterministic protocol-replay of a cassette (no token) [--output-format json]
52
52
  [--strict] escalate a cassette-staleness warning (baseline/skill drift) to a failure
53
+ verify-cassettes <file|dir> CI gate (no token): privacy scan + staleness — exit 1 on a PII finding or drift
54
+ [--privacy-only|--staleness-only] [--allow <regex>]... [--output-format json]
53
55
  trace <run-id | dir | path> digest a run's events.jsonl (tools+result status, dispatches, decisions)
54
56
  [--tools] tool/dispatch rows only [--gates] gate lifecycle (question→answer→delivered)
55
57
  [--dispatches] sub-agent dispatch tree + the real total (read off dispatch_count_max)
@@ -199,6 +201,7 @@ async function main() {
199
201
  "chat",
200
202
  "record",
201
203
  "replay",
204
+ "verify-cassettes",
202
205
  "trace",
203
206
  "assert",
204
207
  "scaffold",
@@ -260,6 +263,8 @@ async function main() {
260
263
  return cmdRecord(rest);
261
264
  case "replay":
262
265
  return cmdReplay(rest);
266
+ case "verify-cassettes":
267
+ return cmdVerifyCassettes(rest);
263
268
  case "trace":
264
269
  return cmdTrace(rest);
265
270
  case "assert":
package/dist/redact.js ADDED
@@ -0,0 +1,101 @@
1
+ /**
2
+ * Content redaction for committed cassettes (#1 / A1). DISTINCT from `secrets.ts` (which scrubs auth
3
+ * tokens): this redacts author-configured PII patterns out of the cassette surface before it is written.
4
+ *
5
+ * Two hard requirements drive the design:
6
+ * - STRUCTURAL (not line-level) for JSON protocol lines: events/controlOut are JSON; redacting their raw
7
+ * text could unbalance the JSON (→ a silently skipped line on replay) or desync the AskUserQuestion
8
+ * question/answer strings the O7 guard compares across events and controlOut. So JSON is parsed, every
9
+ * string LEAF and object KEY is redacted, then re-serialized.
10
+ * - COLLISION-SAFE deterministic tokens: `[REDACTED:<label>:<hash>]`. The hash (of the matched text) keeps
11
+ * the token stable across re-records (no churn) AND injective — two distinct names never collapse into a
12
+ * single `answers` map key. A genuine collision (astronomically rare) fails loud, never silently merges.
13
+ */
14
+ import { createHash } from "node:crypto";
15
+ import { existsSync, readFileSync } from "node:fs";
16
+ import { join } from "node:path";
17
+ export const EMPTY_POLICY = { patterns: [], keyNames: [] };
18
+ function csv(v) {
19
+ return (v ?? "")
20
+ .split(",")
21
+ .map((s) => s.trim())
22
+ .filter(Boolean);
23
+ }
24
+ /** Assemble a redaction policy from `.cowork-redact.json` (searched in `searchDirs`, e.g. cwd then the
25
+ * scenario/cassette dir) merged with `COWORK_HARNESS_REDACT_PATTERNS`/`_KEYS`. No config + no env →
26
+ * EMPTY_POLICY (the opt-in default; the A2 scanner is the always-on safety net). A malformed regex throws —
27
+ * a silently-dropped redaction rule is under-redaction, i.e. a leak. */
28
+ export function loadRedactionPolicy(searchDirs) {
29
+ const patterns = [];
30
+ const keyNames = [];
31
+ const seen = new Set();
32
+ for (const dir of searchDirs) {
33
+ const f = join(dir, ".cowork-redact.json");
34
+ if (seen.has(f) || !existsSync(f))
35
+ continue;
36
+ seen.add(f);
37
+ const cfg = JSON.parse(readFileSync(f, "utf8"));
38
+ for (const p of cfg.patterns ?? [])
39
+ patterns.push({ re: new RegExp(p.regex, p.flags ?? "g"), label: p.label ?? "redacted" });
40
+ for (const k of cfg.keys ?? [])
41
+ keyNames.push(k);
42
+ }
43
+ for (const src of csv(process.env.COWORK_HARNESS_REDACT_PATTERNS))
44
+ patterns.push({ re: new RegExp(src, "g"), label: "redacted" });
45
+ for (const k of csv(process.env.COWORK_HARNESS_REDACT_KEYS))
46
+ keyNames.push(k);
47
+ return { patterns, keyNames };
48
+ }
49
+ /** Stable, collision-safe token for a matched span. Depends ONLY on the matched text (context-free), so the
50
+ * same logical string redacts identically wherever it appears (events question text == controlOut answers
51
+ * key) — the property the O7 guard relies on. */
52
+ function token(label, match) {
53
+ const h = createHash("sha256").update(match).digest("hex").slice(0, 12);
54
+ return `[REDACTED:${label}:${h}]`;
55
+ }
56
+ /** Apply every pattern to a single string. Each pattern is forced global so all occurrences go. */
57
+ export function redactText(text, policy) {
58
+ let out = text;
59
+ for (const { re, label } of policy.patterns) {
60
+ const g = new RegExp(re.source, re.flags.includes("g") ? re.flags : re.flags + "g");
61
+ out = out.replace(g, (m) => token(label, m));
62
+ }
63
+ return out;
64
+ }
65
+ /** Recursively redact a parsed JSON value: string leaves AND object keys (C3). Numbers/booleans/null pass
66
+ * through. A key collision after redaction (two distinct keys → one) throws — a silent merge would lose data
67
+ * and (for an `answers` map) break replay. */
68
+ export function redactStructural(value, policy) {
69
+ if (typeof value === "string")
70
+ return redactText(value, policy);
71
+ if (Array.isArray(value))
72
+ return value.map((v) => redactStructural(v, policy));
73
+ if (value !== null && typeof value === "object") {
74
+ const out = {};
75
+ for (const [k, v] of Object.entries(value)) {
76
+ const rk = redactText(k, policy);
77
+ // A value under a configured key is redacted wholesale regardless of TYPE (a sensitive number/object
78
+ // leaks just like a string). The hash is over its JSON form so the token stays deterministic.
79
+ const rv = policy.keyNames.includes(k) ? token("key", typeof v === "string" ? v : JSON.stringify(v)) : redactStructural(v, policy);
80
+ if (Object.prototype.hasOwnProperty.call(out, rk))
81
+ throw new Error(`redaction collision: two distinct keys both redacted to "${rk}" — refusing to silently merge`);
82
+ out[rk] = rv;
83
+ }
84
+ return out;
85
+ }
86
+ return value;
87
+ }
88
+ /** Redact one JSONL protocol line. If it parses as JSON, redact structurally (guaranteeing it still parses);
89
+ * otherwise fall back to safe text redaction (a non-JSON line has no protocol coupling). */
90
+ export function redactJsonLine(line, policy) {
91
+ if (!line.trim())
92
+ return line;
93
+ let parsed;
94
+ try {
95
+ parsed = JSON.parse(line);
96
+ }
97
+ catch {
98
+ return redactText(line, policy);
99
+ }
100
+ return JSON.stringify(redactStructural(parsed, policy));
101
+ }
@@ -3,8 +3,8 @@ import { readFileSync, writeFileSync, mkdirSync, mkdtempSync, existsSync, readdi
3
3
  import { createHash } from "node:crypto";
4
4
  import { tmpdir } from "node:os";
5
5
  import { join, dirname, relative, isAbsolute } from "node:path";
6
- import { executeScenario, parseScenarioFile, collectArtifacts } from "./execute.js";
7
- import { loadSession } from "../session.js";
6
+ import { executeScenario, parseScenarioFile, collectArtifacts, parseSessionFile } from "./execute.js";
7
+ import { loadSession, resolveSessionPaths } from "../session.js";
8
8
  import { loadBaseline } from "../baseline.js";
9
9
  import { Run } from "./run.js";
10
10
  import { parseMessage, serializeDecision, deserializeDecision, canon, } from "../agent/session.js";
@@ -13,6 +13,10 @@ import { evaluate } from "../assert.js";
13
13
  import { makeRenderer, renderFooter } from "./renderer.js";
14
14
  import { jsonEnvelope, parseOutputFormat } from "./envelope.js";
15
15
  import { computeVerdict } from "./verdict.js";
16
+ import { redactJsonLine, redactText, redactStructural, loadRedactionPolicy } from "../redact.js";
17
+ import { collectSecrets, scrub } from "../secrets.js";
18
+ import { scanText, DEFAULT_SCAN_PATTERNS, EMAIL_SCAN_PATTERNS } from "../scan.js";
19
+ import { parse as parseYaml } from "yaml";
16
20
  const out = (s) => process.stdout.write(s + "\n");
17
21
  const log = (s) => process.stderr.write(s + "\n");
18
22
  /** Current cassette format version. Readers tolerate ABSENT (legacy → 0) and warn on a FUTURE version. */
@@ -47,8 +51,10 @@ function materializeManifest(entries) {
47
51
  }
48
52
  return { workRoot, prefixes: ["outputs", ".projects"] };
49
53
  }
50
- /** Hash a directory's file CONTENTS recursively (sorted, path-relative) — stable across machines. */
51
- function hashDir(dir, hash) {
54
+ /** Hash a directory's structure + file CONTENTS recursively (sorted) — stable across machines. The hash
55
+ * folds in each entry's RELATIVE path (not just its basename) plus a type marker, so a file MOVING within
56
+ * the tree (`a/x.json` → `a/sub/x.json`, same content) changes the hash (S2 — basename-only missed moves). */
57
+ function hashDir(dir, hash, rel = "") {
52
58
  let entries;
53
59
  try {
54
60
  entries = readdirSync(dir).sort();
@@ -58,6 +64,7 @@ function hashDir(dir, hash) {
58
64
  }
59
65
  for (const name of entries) {
60
66
  const abs = join(dir, name);
67
+ const relPath = rel ? `${rel}/${name}` : name;
61
68
  let st;
62
69
  try {
63
70
  st = statSync(abs);
@@ -65,10 +72,12 @@ function hashDir(dir, hash) {
65
72
  catch {
66
73
  continue;
67
74
  }
68
- if (st.isDirectory())
69
- hashDir(abs, hash);
75
+ if (st.isDirectory()) {
76
+ hash.update(`D:${relPath}\n`); // structure marker — an empty/renamed dir registers too
77
+ hashDir(abs, hash, relPath);
78
+ }
70
79
  else if (st.isFile()) {
71
- hash.update(name);
80
+ hash.update(`F:${relPath}\n`); // relative path, not basename — a move changes the digest
72
81
  try {
73
82
  hash.update(readFileSync(abs));
74
83
  }
@@ -78,28 +87,121 @@ function hashDir(dir, hash) {
78
87
  }
79
88
  }
80
89
  }
81
- /** #1b: the local skill/plugin/marketplace source dirs a session mounts — the "skill dir" hash unit. */
90
+ /** #1b: the local skill/plugin/marketplace source dirs a session mounts — the "skill dir" hash unit.
91
+ * Returns ABSOLUTE dirs (for hashing/reading) plus `baseDir`, the session-file dir the relative
92
+ * `skillSources` are stored against (so the committed fingerprint carries no absolute host path — C1). */
82
93
  function skillSourceDirs(sessionPath, cassetteDir) {
83
94
  const resolved = cassetteDir && !isAbsolute(sessionPath) ? join(cassetteDir, sessionPath) : sessionPath;
95
+ const baseDir = dirname(resolved);
84
96
  if (sessionPath === "(inline)" || !existsSync(resolved))
85
- return [];
97
+ return { dirs: [], baseDir };
86
98
  let cfg;
87
99
  try {
88
- cfg = loadSession(resolved);
100
+ // Mirror loadSessionFromFile (execute.ts): parse the YAML, then RESOLVE its relative skill/plugin
101
+ // paths against the session-file dir (`baseDir` — the post-cassetteDir-join location, so this works for
102
+ // both the record call (no cassetteDir) and the replay call (cassetteDir set)). Passing the raw path
103
+ // string to loadSession() throws (it wants parsed YAML) — the swallowed throw is why skillHash was
104
+ // silently never computed.
105
+ cfg = resolveSessionPaths(loadSession(parseSessionFile(resolved)), baseDir);
89
106
  }
90
107
  catch {
91
- return [];
108
+ return { dirs: [], baseDir };
92
109
  }
93
- return [...cfg.skills.local, ...cfg.plugins.local_plugins, ...cfg.plugins.remote_plugins, ...cfg.plugins.local_marketplaces].filter((d) => existsSync(d));
110
+ const dirs = [...cfg.skills.local, ...cfg.plugins.local_plugins, ...cfg.plugins.remote_plugins, ...cfg.plugins.local_marketplaces].filter((d) => existsSync(d));
111
+ return { dirs, baseDir };
94
112
  }
95
- function buildFingerprint(sessionPath, baselineAppVersion, cassetteDir) {
96
- const dirs = skillSourceDirs(sessionPath, cassetteDir);
113
+ export function buildFingerprint(sessionPath, baselineAppVersion, cassetteDir) {
114
+ const { dirs, baseDir } = skillSourceDirs(sessionPath, cassetteDir);
97
115
  if (dirs.length === 0)
98
116
  return { baseline: baselineAppVersion };
99
117
  const hash = createHash("sha256");
100
118
  for (const d of dirs.sort())
101
119
  hashDir(d, hash);
102
- return { baseline: baselineAppVersion, skillHash: hash.digest("hex"), skillSources: dirs };
120
+ // Store skillSources RELATIVE to the session-file dir — diagnostics only (the replay recompute re-derives
121
+ // the dirs from the session), so a relative path is enough and never leaks an absolute `/Users/...` path.
122
+ return { baseline: baselineAppVersion, skillHash: hash.digest("hex"), skillSources: dirs.map((d) => relative(baseDir, d)) };
123
+ }
124
+ /** A2: scan the WHOLE cassette surface for PII (default classes: email/currency/domain). A `truncated`
125
+ * artifact has NO committed body (hash-only) — nothing to leak — but is reported as `unscanned` so coverage
126
+ * is never silently implied. Real-class findings fail the gate; `unscanned` is informational. */
127
+ /** The agent's CAPABILITY MANIFEST — environment boilerplate, never user data, and the sole concentrated
128
+ * source of `domain`/`currency` scan noise (tool/skill catalog descriptions + MCP-server names a regex
129
+ * can't tell apart from customer data). Two stable structural forms:
130
+ * - the `system/init` event (tools/mcp_servers/skills/cwd registry), and
131
+ * - the `initialize` `control_response` (`request_id: "init-1"`; body = commands/agents/models/account).
132
+ * These get `email`-only scanning (email is universal — the `account` field can carry the dev's own email);
133
+ * the noisy classes are suppressed only here. */
134
+ function isCapabilityManifest(line) {
135
+ let m;
136
+ try {
137
+ m = JSON.parse(line);
138
+ }
139
+ catch {
140
+ return false;
141
+ }
142
+ if (m?.type === "system" && m?.subtype === "init")
143
+ return true;
144
+ if (m?.type === "control_response") {
145
+ const r = m.response ?? {};
146
+ if (r.request_id === "init-1")
147
+ return true;
148
+ const body = r.response;
149
+ if (body && typeof body === "object" && "commands" in body && "agents" in body)
150
+ return true; // shape fallback
151
+ }
152
+ return false;
153
+ }
154
+ export function scanCassette(cassette, allow) {
155
+ const findings = [];
156
+ const FULL = DEFAULT_SCAN_PATTERNS; // email + currency + domain
157
+ const EMAIL = EMAIL_SCAN_PATTERNS; // email only — for the capability-manifest messages
158
+ // Transcript: full net EXCEPT the capability-manifest messages (catalog noise), where only email runs.
159
+ cassette.events.forEach((l, i) => findings.push(...scanText(l, `events[${i}]`, allow, isCapabilityManifest(l) ? EMAIL : FULL)));
160
+ cassette.controlOut?.forEach((l, i) => findings.push(...scanText(l, `controlOut[${i}]`, allow, isCapabilityManifest(l) ? EMAIL : FULL)));
161
+ // Deliverable + author-written fields — full net (a real cap table's figures/domains live here).
162
+ for (const a of cassette.artifacts ?? []) {
163
+ findings.push(...scanText(a.path, `artifact path ${a.path}`, allow, FULL)); // a filename can name a customer
164
+ if (a.body !== undefined)
165
+ findings.push(...scanText(a.body, `artifact ${a.path}`, allow, FULL));
166
+ else if (a.truncated)
167
+ findings.push({ where: `artifact ${a.path}`, cls: "unscanned", sample: "(body not committed — too large or unreadable)" });
168
+ }
169
+ findings.push(...scanText(cassette.scenario.prompt, "scenario.prompt", allow, FULL));
170
+ findings.push(...scanText(JSON.stringify(cassette.scenario.answers ?? null), "scenario.answers", allow, FULL));
171
+ findings.push(...scanText(JSON.stringify(cassette.scenario.assert ?? null), "scenario.assert", allow, FULL));
172
+ for (const s of cassette.fingerprint?.skillSources ?? [])
173
+ findings.push(...scanText(s, "fingerprint.skillSources", allow, FULL));
174
+ return findings;
175
+ }
176
+ /** B3 staleness GATE: recompute the fingerprint and report drift. Unlike `replayCassette` (which WARNS),
177
+ * the gate treats an unresolvable skillHash as a failure — can't verify ⇒ not green. No fingerprint → nothing
178
+ * to check (legacy cassette). */
179
+ export function checkStaleness(cassette, cassetteDir) {
180
+ const fp = cassette.fingerprint;
181
+ if (!fp)
182
+ return [];
183
+ const msgs = [];
184
+ let liveBaseline;
185
+ try {
186
+ liveBaseline = loadBaseline("latest").appVersion;
187
+ }
188
+ catch {
189
+ /* baseline not loadable */
190
+ }
191
+ // Gate mode: can't verify ⇒ not green. The cassette carries a baseline-of-record but we can't load the
192
+ // current one to compare — a fail, not a silent skip (baselines ship with the package, so this is rare).
193
+ if (liveBaseline === undefined)
194
+ msgs.push("cannot load the latest baseline to verify staleness — run `cowork-harness sync` or ship baselines/ (can't verify ⇒ not green)");
195
+ else if (liveBaseline !== fp.baseline)
196
+ msgs.push(`baseline moved ${fp.baseline} → ${liveBaseline} since record — re-record`);
197
+ if (fp.skillHash) {
198
+ const live = buildFingerprint(cassette.scenario.session, fp.baseline, cassetteDir);
199
+ if (live.skillHash === undefined)
200
+ msgs.push("skill dirs not resolvable from the cassette location — cannot verify staleness (gate fails: can't verify ⇒ not green)");
201
+ else if (live.skillHash !== fp.skillHash)
202
+ msgs.push("local skill/plugin dir contents changed since record — re-record");
203
+ }
204
+ return msgs;
103
205
  }
104
206
  /** A minimal RunRecord for a truncated-cassette replay — empty collections so downstream evaluate()/the
105
207
  * mismatch loops don't NPE; result:"error" because the cassette could not be driven to completion. */
@@ -270,75 +372,246 @@ const NOOP_DECIDER = {
270
372
  return ABSTAIN;
271
373
  },
272
374
  };
273
- /** `record <scenario.yaml> [--out <file>]` run live + save a cassette. */
375
+ /** Apply CONTENT redaction (the opt-in policy) across the WHOLE cassette surface (C1): events/controlOut
376
+ * protocol lines (structurally — string leaves AND object keys, keeping JSON valid + the O7 question/answer
377
+ * strings in sync), artifact bodies, the scenario prompt/answers/assert metadata, and the diagnostic
378
+ * skillSources. Identity fields (name/session/fidelity/baseline) are left intact so replay still resolves.
379
+ * Pure — returns a new cassette. Distinct from secret-scrub (`scrub`), which runs first. */
380
+ export function redactCassette(cassette, policy) {
381
+ const scenario = {
382
+ ...cassette.scenario,
383
+ prompt: redactText(cassette.scenario.prompt, policy),
384
+ answers: redactStructural(cassette.scenario.answers, policy),
385
+ assert: redactStructural(cassette.scenario.assert, policy),
386
+ };
387
+ return {
388
+ ...cassette,
389
+ scenario,
390
+ events: cassette.events.map((l) => redactJsonLine(l, policy)),
391
+ controlOut: cassette.controlOut?.map((l) => redactJsonLine(l, policy)),
392
+ artifacts: cassette.artifacts?.map((a) => ({
393
+ ...a,
394
+ path: redactText(a.path, policy), // C1: a filename can name a customer (outputs/Acme-cap-table.json)
395
+ ...(a.body !== undefined ? { body: redactJsonLine(a.body, policy) } : {}),
396
+ })),
397
+ fingerprint: cassette.fingerprint
398
+ ? { ...cassette.fingerprint, skillSources: cassette.fingerprint.skillSources?.map((s) => redactText(s, policy)) }
399
+ : undefined,
400
+ };
401
+ }
402
+ /** A3 / C4 cardinal-sin guard: redaction must be VERDICT-PRESERVING. Replay both the pre-redaction and the
403
+ * redacted cassette (token-free) and compare verdicts; if redaction flipped any replay-checkable assertion
404
+ * (e.g. stripped a value a `transcript_not_matches` keys on, manufacturing a green), throw — never write a
405
+ * cassette whose verdict was changed by redaction. */
406
+ export async function assertRedactionVerdictPreserved(base, redacted) {
407
+ const vb = computeVerdict(await replayCassette(base), "replay");
408
+ const vr = computeVerdict(await replayCassette(redacted), "replay");
409
+ if (vb.pass !== vr.pass)
410
+ throw new Error(`redaction changed the replay verdict (pre-redaction pass=${vb.pass} → redacted pass=${vr.pass}) — redaction altered an ` +
411
+ `asserted observable; refusing to write a cassette whose verdict was manufactured by redaction (A3). ` +
412
+ `Record against synthetic inputs, or narrow the redaction policy so it doesn't touch asserted values.`);
413
+ }
414
+ /** B1: classify the `*.yaml`/`*.yml` (non-recursive) under `dir` for batch `record`. Classification keys on a
415
+ * POSITIVE `prompt:` signal — NOT on "Scenario.parse threw", because a session YAML and a broken scenario
416
+ * both throw the same error. A doc with `prompt:` that fails to parse is BROKEN (a batch failure), never a
417
+ * silent skip — silently swallowing a broken scenario as a non-scenario is the false-green this guards. */
418
+ export function discoverScenarios(dir) {
419
+ const files = readdirSync(dir)
420
+ .filter((f) => /\.ya?ml$/i.test(f))
421
+ .sort()
422
+ .map((f) => join(dir, f));
423
+ const out = { scenarios: [], skipped: [], broken: [] };
424
+ for (const f of files) {
425
+ let raw;
426
+ try {
427
+ raw = parseYaml(readFileSync(f, "utf8"));
428
+ }
429
+ catch (e) {
430
+ out.broken.push({ file: f, error: `YAML parse error: ${e.message}` });
431
+ continue;
432
+ }
433
+ const hasPrompt = raw !== null && typeof raw === "object" && "prompt" in raw;
434
+ if (!hasPrompt) {
435
+ out.skipped.push(f); // no prompt → a session/other doc; announced skip, not a failure
436
+ continue;
437
+ }
438
+ try {
439
+ parseScenarioFile(f);
440
+ out.scenarios.push(f);
441
+ }
442
+ catch (e) {
443
+ out.broken.push({ file: f, error: e.message });
444
+ }
445
+ }
446
+ return out;
447
+ }
448
+ /** Read + parse a cassette, never throwing — a malformed `*.cassette.json` must be TALLIED, not crash a
449
+ * whole batch (a crash mid-walk reads as "the rest were fine" — a false-green by abort). */
450
+ function readCassette(path) {
451
+ try {
452
+ return { cassette: JSON.parse(readFileSync(path, "utf8")) };
453
+ }
454
+ catch (e) {
455
+ return { error: `unreadable / invalid cassette JSON: ${e.message}` };
456
+ }
457
+ }
458
+ /** B2: the committed cassettes under `dir` whose fingerprint has drifted (baseline/skill) — the re-record
459
+ * work-list. Pure + token-free (reuses `checkStaleness`); the actual re-record needs the live agent. A
460
+ * malformed cassette is surfaced as stale (needs attention) rather than silently dropped. */
461
+ export function selectStaleCassettes(dir) {
462
+ return readdirSync(dir)
463
+ .filter((f) => f.endsWith(".cassette.json"))
464
+ .sort()
465
+ .map((f) => join(dir, f))
466
+ .map((path) => {
467
+ const r = readCassette(path);
468
+ return "error" in r ? { path, staleness: [r.error] } : { path, staleness: checkStaleness(r.cassette, dirname(path)) };
469
+ })
470
+ .filter((x) => x.staleness.length > 0);
471
+ }
472
+ /** Record one scenario FILE → one cassette (parses the file, then shares the live-record tail with the
473
+ * in-memory path). The file's dir feeds the redaction-policy search (for a co-located .cowork-redact.json). */
474
+ async function recordScenarioFile(file, opts) {
475
+ return recordScenarioObject(parseScenarioFile(file), opts, [dirname(file)]);
476
+ }
477
+ /** `record <scenario.yaml | dir> [--out <file>] [--rerecord-stale] [--no-redact] [--allow-failing]` —
478
+ * run live + save a cassette. A single file records one; a dir batches (B1); --rerecord-stale (B2) treats
479
+ * the dir as committed cassettes and re-records only those whose fingerprint drifted. */
274
480
  export async function cmdRecord(args) {
481
+ const noRedact = args.includes("--no-redact");
482
+ if (noRedact)
483
+ log("record: --no-redact — content redaction is OFF; the cassette is written verbatim, so ensure inputs are synthetic.");
484
+ const allowFailing = args.includes("--allow-failing");
485
+ const rerecordStale = args.includes("--rerecord-stale");
275
486
  const outIdx = args.indexOf("--out");
276
- // #9: bounds-check --out's value — a trailing `--out` makes cassettePath undefined → a raw
277
- // dirname(undefined)/writeFileSync(undefined) crash surfacing as an `internal` error.
278
487
  if (outIdx >= 0 && args[outIdx + 1] === undefined) {
279
488
  log("usage: record <scenario.yaml> --out <file.cassette.json> (--out needs a value)");
280
- process.exit(2);
489
+ return process.exit(2);
281
490
  }
282
- // Skip --out's VALUE when scanning for the scenario positional, so the common flag-first form
283
- // `record --out out.json scenario.yaml` records scenario.yaml (not out.json) as the scenario.
284
- const scenarioPositionals = args.filter((a, i) => !a.startsWith("--") && args[i - 1] !== "--out");
285
- const file = scenarioPositionals[0];
286
- if (!file) {
287
- log("usage: record <scenario.yaml> [--out <file.cassette.json>]");
288
- process.exit(2);
491
+ const positionals = args.filter((a, i) => !a.startsWith("--") && args[i - 1] !== "--out");
492
+ const target = positionals[0];
493
+ if (!target) {
494
+ log("usage: record <scenario.yaml | dir/> [--out <file>] [--rerecord-stale] [--no-redact] [--allow-failing]");
495
+ return process.exit(2);
289
496
  }
290
- // Reject extra scenario positionals rather than silently dropping all but the first (record takes ONE).
291
- if (scenarioPositionals.length > 1) {
292
- log(`record takes a single scenario (got ${scenarioPositionals.length}: ${scenarioPositionals.join(", ")})`);
293
- process.exit(2);
497
+ if (positionals.length > 1) {
498
+ log(`record takes a single scenario or dir (got ${positionals.length}: ${positionals.join(", ")})`);
499
+ return process.exit(2);
294
500
  }
295
- const scenario = parseScenarioFile(file);
501
+ const isDir = existsSync(target) && statSync(target).isDirectory();
502
+ // B2: re-record only the drifted cassettes in a committed cassette dir.
503
+ if (rerecordStale) {
504
+ if (!isDir) {
505
+ log("record --rerecord-stale takes a DIRECTORY of committed cassettes");
506
+ return process.exit(2);
507
+ }
508
+ const stale = selectStaleCassettes(target);
509
+ if (stale.length === 0) {
510
+ log(`✓ record --rerecord-stale: all cassettes under ${target} are fresh — nothing to re-record`);
511
+ return process.exit(0);
512
+ }
513
+ let failures = 0;
514
+ for (const { path: cp, staleness } of stale) {
515
+ const rc = readCassette(cp);
516
+ if ("error" in rc) {
517
+ failures++;
518
+ log(` ✗ ${cp}: ${rc.error} — cannot re-record`);
519
+ continue;
520
+ }
521
+ const cassette = rc.cassette;
522
+ // Re-record from the embedded scenario, re-resolving its relocatable session against the cassette dir.
523
+ const sessionRef = cassette.scenario.session === "(inline)" ? "(inline)" : join(dirname(cp), cassette.scenario.session);
524
+ log(`↻ re-recording ${cp} (stale: ${staleness.join("; ")})`);
525
+ try {
526
+ const r = await recordScenarioObject({ ...cassette.scenario, session: sessionRef }, { noRedact, allowFailing, cassettePath: cp });
527
+ log(` ✓ ${cp} (${r.result.result})`);
528
+ }
529
+ catch (e) {
530
+ failures++;
531
+ log(` ✗ ${cp}: ${e.message}`);
532
+ }
533
+ }
534
+ return process.exit(failures > 0 ? 1 : 0);
535
+ }
536
+ // B1: batch a directory of scenarios.
537
+ if (isDir) {
538
+ const disc = discoverScenarios(target);
539
+ for (const s of disc.skipped)
540
+ log(`· skipped (not a scenario — no \`prompt:\`): ${s}`);
541
+ for (const b of disc.broken)
542
+ log(`✗ ${b.file}: ${b.error}`);
543
+ if (disc.scenarios.length === 0) {
544
+ log(`record: no scenarios discovered under ${target} (loud non-zero — not a vacuous "0 failures = green")`);
545
+ return process.exit(2);
546
+ }
547
+ let failures = disc.broken.length;
548
+ for (const f of disc.scenarios) {
549
+ try {
550
+ const r = await recordScenarioFile(f, { noRedact, allowFailing });
551
+ log(`✓ ${f} → ${r.cassettePath} (${r.result.result})`);
552
+ }
553
+ catch (e) {
554
+ failures++;
555
+ log(`✗ ${f}: ${e.message}`);
556
+ }
557
+ }
558
+ log(failures > 0
559
+ ? `✗ record: ${failures} of ${disc.scenarios.length + disc.broken.length} failed`
560
+ : `✓ record: ${disc.scenarios.length} cassette(s)`);
561
+ return process.exit(failures > 0 ? 1 : 0);
562
+ }
563
+ // Single scenario file.
564
+ try {
565
+ const cassettePath = outIdx >= 0 ? args[outIdx + 1] : undefined;
566
+ const r = await recordScenarioFile(target, { noRedact, allowFailing, cassettePath });
567
+ log(`✓ recorded ${r.result.result} · ${r.artifacts} artifact(s) → ${r.cassettePath}`);
568
+ }
569
+ catch (e) {
570
+ log(`record: ${e.message}`);
571
+ return process.exit(1);
572
+ }
573
+ }
574
+ /** The live-record TAIL shared by the file (B1/single) and in-memory (B2 re-record) paths: run live, refuse
575
+ * a failing run unless opted in (A3), snapshot + secret-scrub bodies (C2), opt-in redact + verdict-preserve
576
+ * (A1/A3), write. `extraPolicyDirs` adds the scenario-file dir to the .cowork-redact.json search. */
577
+ async function recordScenarioObject(scenario, opts, extraPolicyDirs = []) {
296
578
  const result = await executeScenario(scenario);
297
- const events = safeLines(join(result.outDir, "events.jsonl"));
298
- const controlOut = safeLines(join(result.outDir, "control-out.jsonl"));
299
- const cassettePath = outIdx >= 0 ? args[outIdx + 1] : join("cassettes", `${scenario.name}.cassette.json`);
579
+ const cassettePath = opts.cassettePath ?? join("cassettes", `${scenario.name}.cassette.json`);
300
580
  mkdirSync(dirname(cassettePath), { recursive: true });
301
- // Store a RELOCATABLE session path (relative to the cassette dir) instead of the absolute resolved path
302
- // parseScenarioFile baked in — replay never loads the session for the pipeline, so this is metadata-only,
303
- // but it keeps a moved bundle honest. Record the resolved tier so replay can report effectiveFidelity.
581
+ // A3: a failing live run frozen into a cassette is a latent false-signal refuse unless opted in.
582
+ if (!computeVerdict(result, "live").pass && !opts.allowFailing)
583
+ throw new Error(`live run did NOT pass (result=${result.result}) refusing to freeze a failing run (re-run, or --allow-failing)`);
584
+ // RELOCATABLE session path (relative to the cassette dir) — metadata-only, keeps a moved bundle honest.
304
585
  const relocatable = {
305
586
  ...scenario,
306
587
  session: scenario.session === "(inline)" ? "(inline)" : relative(dirname(cassettePath), scenario.session),
307
588
  };
308
- // #1: snapshot the user-visible artifacts (from the live work root, before --keep cleanup) so
309
- // file_exists/user_visible_artifact/artifact_json survive token-free replay. #1b: a staleness tripwire
310
- // over the recording's inputs (baseline + local skill dirs) — `scenario.session` is still absolute here.
311
- const artifacts = result.workDir ? buildManifest(result.workDir) : [];
312
- const fingerprint = buildFingerprint(scenario.session, result.baseline);
313
- const cassette = {
589
+ // C2: buildManifest reads output bodies RAW (executeScenario scrubs result/events/control-out, NOT
590
+ // outputs/) secret-scrub each body before it is committed.
591
+ const secrets = collectSecrets();
592
+ const artifacts = (result.workDir ? buildManifest(result.workDir) : []).map((a) => a.body !== undefined ? { ...a, body: scrub(a.body, secrets) } : a);
593
+ const base = {
314
594
  cassetteVersion: CASSETTE_VERSION,
315
595
  scenario: relocatable,
316
- events,
317
- controlOut,
596
+ events: safeLines(join(result.outDir, "events.jsonl")),
597
+ controlOut: safeLines(join(result.outDir, "control-out.jsonl")),
318
598
  effectiveFidelity: result.effectiveFidelity,
319
599
  artifacts,
320
- fingerprint,
600
+ fingerprint: buildFingerprint(scenario.session, result.baseline),
321
601
  };
322
- writeFileSync(cassettePath, JSON.stringify(cassette, null, 2));
323
- // capture summary: turns (assistant messages) + tool calls in the recording
324
- let turns = 0;
325
- let tools = 0;
326
- for (const e of cassette.events) {
327
- let m;
328
- try {
329
- m = JSON.parse(e);
330
- }
331
- catch {
332
- continue;
333
- }
334
- if (m.type === "assistant") {
335
- turns++;
336
- for (const b of m.message?.content ?? [])
337
- if (b.type === "tool_use")
338
- tools++;
339
- }
602
+ // A1 (opt-in) content redaction over the whole surface (C1). Empty policy → no-op. Non-empty → must be
603
+ // VERDICT-PRESERVING (A3): replay both and refuse to write on divergence (a manufactured green).
604
+ const policy = opts.noRedact
605
+ ? { patterns: [], keyNames: [] }
606
+ : loadRedactionPolicy([process.cwd(), ...extraPolicyDirs, dirname(cassettePath)]);
607
+ let cassette = base;
608
+ if (policy.patterns.length || policy.keyNames.length) {
609
+ const redacted = redactCassette(base, policy);
610
+ await assertRedactionVerdictPreserved(base, redacted);
611
+ cassette = redacted;
340
612
  }
341
- log(`✓ recorded ${events.length} events · ${turns} turns · ${tools} tool calls · ${artifacts.length} artifact(s) ${cassettePath} (${result.result})`);
613
+ writeFileSync(cassettePath, JSON.stringify(cassette, null, 2));
614
+ return { result, cassettePath, artifacts: artifacts.length };
342
615
  }
343
616
  /** `replay --cassette <file>` — deterministic protocol-replay; re-evaluates content assertions. */
344
617
  export async function cmdReplay(args) {
@@ -374,6 +647,94 @@ export async function cmdReplay(args) {
374
647
  renderFooter(result, plan, { renderer, lane: "replay" });
375
648
  process.exit(verdict.exitCode);
376
649
  }
650
+ /** `verify-cassettes <file|dir>` — the CI gate (token/agent-free). Runs the privacy scan (A2) and the
651
+ * staleness check (B3) over one cassette or every `*.cassette.json` in a dir (non-recursive). Exit 1 on any
652
+ * real PII finding or staleness drift; `unscanned` notes are informational. Dedicated JSON envelope. */
653
+ export function cmdVerifyCassettes(args) {
654
+ let json;
655
+ try {
656
+ json = parseOutputFormat(args) === "json";
657
+ }
658
+ catch (e) {
659
+ log(String(e.message));
660
+ return process.exit(2);
661
+ }
662
+ const privacyOnly = args.includes("--privacy-only");
663
+ const stalenessOnly = args.includes("--staleness-only");
664
+ // Both flags together would disable BOTH families → empty findings → ok=true → exit 0: a silent
665
+ // false-green in the gate itself. Reject it as a usage error.
666
+ if (privacyOnly && stalenessOnly) {
667
+ log("verify-cassettes: --privacy-only and --staleness-only are mutually exclusive (together they'd check nothing)");
668
+ return process.exit(2);
669
+ }
670
+ const doPrivacy = !stalenessOnly;
671
+ const doStaleness = !privacyOnly;
672
+ const allow = [];
673
+ for (let i = 0; i < args.length; i++) {
674
+ if (args[i] !== "--allow")
675
+ continue;
676
+ const src = args[++i];
677
+ if (src === undefined) {
678
+ log("--allow needs a regex value");
679
+ return process.exit(2);
680
+ }
681
+ try {
682
+ allow.push(new RegExp(src, "i"));
683
+ }
684
+ catch {
685
+ log(`--allow: invalid regex: ${src}`);
686
+ return process.exit(2);
687
+ }
688
+ }
689
+ const target = args.find((a, i) => !a.startsWith("--") && args[i - 1] !== "--allow");
690
+ if (!target) {
691
+ log("usage: verify-cassettes <file|dir> [--privacy-only|--staleness-only] [--allow <regex>]... [--output-format json]");
692
+ return process.exit(2);
693
+ }
694
+ if (!existsSync(target)) {
695
+ log(`verify-cassettes: path not found: ${target}`);
696
+ return process.exit(2);
697
+ }
698
+ const files = statSync(target).isDirectory()
699
+ ? readdirSync(target)
700
+ .filter((f) => f.endsWith(".cassette.json"))
701
+ .sort()
702
+ .map((f) => join(target, f))
703
+ : [target];
704
+ if (files.length === 0) {
705
+ log(`verify-cassettes: no .cassette.json files under ${target} — nothing verified (loud non-zero, not a vacuous pass)`);
706
+ return process.exit(2);
707
+ }
708
+ const results = files.map((f) => {
709
+ const rc = readCassette(f);
710
+ if ("error" in rc)
711
+ return { file: f, findings: [], staleness: [], error: rc.error };
712
+ const findings = doPrivacy ? scanCassette(rc.cassette, allow) : [];
713
+ const staleness = doStaleness ? checkStaleness(rc.cassette, dirname(f)) : [];
714
+ return { file: f, findings, staleness, error: undefined };
715
+ });
716
+ const realFindings = results.flatMap((r) => r.findings.filter((x) => x.cls !== "unscanned"));
717
+ const staleAny = results.some((r) => r.staleness.length > 0);
718
+ const errorAny = results.some((r) => r.error !== undefined);
719
+ const ok = realFindings.length === 0 && !staleAny && !errorAny;
720
+ if (json) {
721
+ out(JSON.stringify({ command: "verify-cassettes", ok, results }));
722
+ }
723
+ else {
724
+ for (const r of results) {
725
+ if (r.error)
726
+ log(`✗ ${r.file}: [error] ${r.error}`);
727
+ for (const f of r.findings)
728
+ log(`${f.cls === "unscanned" ? "·" : "✗"} ${r.file}: [${f.cls}] ${f.where} — ${f.sample}`);
729
+ for (const s of r.staleness)
730
+ log(`✗ ${r.file}: [stale] ${s}`);
731
+ }
732
+ log(ok
733
+ ? `✓ verify-cassettes: ${files.length} cassette(s) clean`
734
+ : `✗ verify-cassettes: ${realFindings.length} PII finding(s)${staleAny ? " + staleness drift" : ""}${errorAny ? " + unreadable cassette(s)" : ""} across ${files.length} cassette(s)`);
735
+ }
736
+ return process.exit(ok ? 0 : 1);
737
+ }
377
738
  /** Replay a cassette through Run and re-evaluate the content assertions. With a `cassette.artifacts`
378
739
  * manifest (#1), filesystem assertions (file_exists/user_visible_artifact/artifact_json) ALSO run, against
379
740
  * the materialized snapshot. `opts.strict` (#1b) escalates staleness warnings to failing assertions. */
package/dist/scan.js ADDED
@@ -0,0 +1,30 @@
1
+ export const DEFAULT_SCAN_PATTERNS = [
2
+ { re: /[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}/gi, cls: "email" },
3
+ { re: /\$\s?\d[\d,]*(?:\.\d+)?\s?(?:k|m|b|bn|million|billion)?/gi, cls: "currency" },
4
+ { re: /\b[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.(?:com|io|net|org|co|app|ai|dev|xyz)\b/gi, cls: "domain" },
5
+ ];
6
+ /** The high-precision subset (email only). `email` is scanned UNIVERSALLY — even on the agent
7
+ * capability-manifest messages (the `system/init` event and the `initialize` registry `control_response`),
8
+ * because the registry's `account` field can carry the developer's own email (a real leak). The noisy
9
+ * classes (`currency`/`domain`) are the ones suppressed on those two manifest messages, where every hit is
10
+ * the agent's tool/skill catalog or MCP-server names — environment boilerplate a regex can't tell apart
11
+ * from customer data. Everywhere else (assistant reasoning, tool I/O, decisions, the deliverable) gets the
12
+ * full net. */
13
+ export const EMAIL_SCAN_PATTERNS = DEFAULT_SCAN_PATTERNS.filter((p) => p.cls === "email");
14
+ function allowed(sample, allow) {
15
+ // Test against a non-global clone so a caller's /g regex can't carry lastIndex across calls.
16
+ return allow.some((a) => new RegExp(a.source, a.flags.replace("g", "")).test(sample));
17
+ }
18
+ /** Scan one string for PII matches, suppressing anything the allowlist covers. */
19
+ export function scanText(text, where, allow, patterns = DEFAULT_SCAN_PATTERNS) {
20
+ const out = [];
21
+ for (const { re, cls } of patterns) {
22
+ const g = new RegExp(re.source, re.flags.includes("g") ? re.flags : re.flags + "g");
23
+ for (const m of text.matchAll(g)) {
24
+ const sample = m[0];
25
+ if (!allowed(sample, allow))
26
+ out.push({ where, cls, sample });
27
+ }
28
+ }
29
+ return out;
30
+ }
package/dist/types.js CHANGED
@@ -145,13 +145,17 @@ export const Assertion = z.object({
145
145
  artifact: z.string().describe("relative path to a JSON artifact under the work root (e.g. outputs/cap_state.json)"),
146
146
  path: z.string().optional().describe("dotted path into the JSON (e.g. me.run_id); omit to target the whole document"),
147
147
  equals: z.unknown().optional().describe("the resolved value deep-equals this"),
148
+ in: z
149
+ .array(z.unknown())
150
+ .optional()
151
+ .describe("the resolved value deep-equals one of these (stable for stochastic/LLM-extracted values where equals churns)"),
148
152
  gt: z.number().optional().describe("the resolved value is a number greater than this"),
149
153
  exists: z.boolean().optional().describe("the path resolves to a present (non-absent) value"),
150
154
  absent: z.boolean().optional().describe("the final key is absent from its (resolved) parent — the anti-hallucination negative"),
151
155
  is_null: z.boolean().optional().describe("the resolved value is JSON null (distinct from absent)"),
152
156
  })
153
157
  .optional()
154
- .describe("assert over a JSON artifact's contents (dotted path + equals|gt|exists|absent|is_null)"),
158
+ .describe("assert over a JSON artifact's contents (dotted path + equals|in|gt|exists|absent|is_null)"),
155
159
  });
156
160
  export const ScenarioObject = z
157
161
  .object({
package/docs/cassette.md CHANGED
@@ -183,6 +183,54 @@ Re-record a cassette when:
183
183
  - `replay` exits 1 on a `replay_protocol_fidelity` mismatch — this means `serializeDecision` changed;
184
184
  review the change, confirm it's correct, then re-record to update the frozen envelope.
185
185
 
186
+ ## Batch recording
187
+
188
+ `record` takes a single scenario OR a directory:
189
+
190
+ ```bash
191
+ cowork-harness record scenarios/ # record every scenario in the dir (one cassette each)
192
+ cowork-harness record cassettes/ --rerecord-stale # re-record ONLY the cassettes whose fingerprint drifted
193
+ ```
194
+
195
+ Directory discovery keys on a **positive `prompt:` signal**: a `*.yaml` with no top-level `prompt:` is an
196
+ announced skip (it's a session/other doc), but a doc that *looks* like a scenario (has `prompt:`) yet fails
197
+ to parse is a **failure**, never a silent skip. Zero scenarios discovered → loud non-zero exit. `record`
198
+ also **refuses to freeze a failing live run** into a cassette (`--allow-failing` overrides) — a committed
199
+ red cassette is a latent false-signal.
200
+
201
+ ## Privacy: cassettes are committed fixtures
202
+
203
+ A cassette snapshots the transcript **and** the `outputs/` JSON bodies (names, dollar figures, share
204
+ counts) — committed PII surface. Two layers, distinct from secret-scrub (which only strips auth tokens):
205
+
206
+ - **Opt-in redaction** (the mutation). Drop a `.cowork-redact.json` next to your scenarios, or set
207
+ `COWORK_HARNESS_REDACT_PATTERNS` / `COWORK_HARNESS_REDACT_KEYS`. At record time it rewrites matching PII
208
+ across the whole cassette surface (transcript, artifact bodies + filenames, prompt/answers/assert,
209
+ skillSources) **structurally** — JSON stays valid and the AskUserQuestion question/answer strings stay in
210
+ sync, so the O7 guard still passes. Redaction is **verdict-preserving**: `record` replays before/after and
211
+ **refuses to write** if redaction would flip an assertion (a manufactured green is the cardinal sin).
212
+ `--no-redact` skips it for known-synthetic inputs.
213
+ - **Always-on scan gate** — `verify-cassettes <file|dir>` scans the committed cassettes and **exits
214
+ non-zero** on a finding, so "no leak" is a gate, not discipline. The full net (`email` + `currency` +
215
+ bare-`domain`) runs over the **whole cassette** — the deliverable (`outputs/` bodies + filenames), the
216
+ author-written `prompt`/`answers`/`assert`, AND the agent's reasoning + tool I/O — with **one structural
217
+ exception**: the agent's **capability-manifest** messages (the `system/init` event and the `initialize`
218
+ registry `control_response`, `request_id:"init-1"`) are excluded from the noisy classes. Those two carry
219
+ the tool/skill catalog (slash-command descriptions naming `docsend.com`, `Pitch.com`, …) and the MCP-server
220
+ names (`claude.ai Gmail`, …) — environment boilerplate a regex can't tell apart from customer data, and the
221
+ sole concentrated source of false positives. They are excluded **as a unit**, not by domain — but `email`
222
+ still scans them (the registry's `account` field can carry the developer's own email). `--allow <regex>`
223
+ suppresses synthetic / public reference names (e.g. `NVCA`, `Cooley GO`, `Acme`); multi-word proper names
224
+ are **not** a default class (too noisy). `verify-cassettes` also runs the **staleness** check
225
+ (`--staleness-only`): a drifted `skillHash` (you edited the skill but didn't re-record) fails the gate.
226
+
227
+ ```bash
228
+ cowork-harness verify-cassettes cassettes/ --allow 'NVCA|Cooley GO|Acme'
229
+ ```
230
+
231
+ The cardinal rule still holds: record against **synthetic** inputs (e.g. "Cadence / Acme", made-up
232
+ numbers) — redaction and the scan are belt-and-suspenders, not a license to record real customer data.
233
+
186
234
  ## Committed fixture
187
235
 
188
236
  `examples/replays/example-pdf-skill.cassette.json` is a **synthetic** cassette committed to the repo
@@ -112,12 +112,26 @@ npm version patch # or minor | major
112
112
  git push --follow-tags
113
113
  ```
114
114
 
115
- Pushing the `vX.Y.Z` tag triggers `release.yml`, which verifies the tag matches `package.json`, runs
116
- `npm run ci`, then `npm publish --provenance --access public`. Auth is **OIDC**: the workflow's
117
- `id-token: write` is exchanged for a short-lived publish credential there is **no `NPM_TOKEN`**. A
118
- GitHub Release is opened from the tag, and `prepublishOnly` re-runs CI so a manual publish is guarded too.
119
- A published version is **immutable**the same `X.Y.Z` can never be re-published, so a botched run needs a
120
- new patch (not a re-run against the same version).
115
+ Pushing the `vX.Y.Z` tag triggers `release.yml`, which (in order) **waits for `ci.yml` to have succeeded
116
+ for that commit**, verifies the tag matches `package.json`, checks `CHANGELOG.md` has a `## [X.Y.Z]`
117
+ heading, runs the **version-lockstep guard** (`npm run check:versions`), runs `npm run ci`, then
118
+ `npm publish --provenance --access public`. Auth is **OIDC**: the workflow's `id-token: write` is exchanged
119
+ for a short-lived publish credentialthere is **no `NPM_TOKEN`**. A GitHub Release is opened from the tag,
120
+ and `prepublishOnly` re-runs CI so a manual publish is guarded too. A published version is **immutable** —
121
+ the same `X.Y.Z` can never be re-published, so a botched run needs a new patch (not a re-run against the
122
+ same version).
123
+
124
+ The `ci.yml`-success gate matters because `release.yml`'s own `npm run ci` is **TypeScript-only**, while
125
+ `ci.yml` also runs pytest (the Python helper lane), `format:check`, the replay gate, and the boundary +
126
+ scenario suites. Without the gate, a tag could publish a build that `main`'s CI would have rejected. The
127
+ gate polls (~30 min) so `git push --follow-tags` works even when the commit's CI is still running.
128
+
129
+ **Version-lockstep guard (`scripts/check-versions.ts`, run in both `ci.yml` and `release.yml`).** Fails
130
+ loud unless all version strings agree: `package.json` == `package-lock.json`; the three skill versions
131
+ (`marketplace.json`, the skill `plugin.json`, `SKILL.md` frontmatter) == each other; the `SKILL.md`
132
+ bootstrap floor `@>=X.Y.Z` == its `tracks-harness:` version; and that floor is `<=` `package.json` (the
133
+ skill can't demand a harness newer than this repo publishes). This enforces the lockstep the next section
134
+ describes, so a hand-edited bump can't silently drift.
121
135
 
122
136
  **One-time setup (on npmjs.com):** configure a Trusted Publisher on the `cowork-harness` package →
123
137
  provider GitHub Actions, repo `yaniv-golan/cowork-harness`, workflow filename `release.yml`,
package/docs/scenario.md CHANGED
@@ -205,13 +205,21 @@ dotted `path` selects into the document; one operator decides the check:
205
205
  - artifact_json: { artifact: outputs/cap_state.json, path: me.run_id, equals: "r1" }
206
206
  - artifact_json: { artifact: outputs/cap_state.json, path: rounds.0.amount, gt: 0 }
207
207
  - artifact_json: { artifact: outputs/instruments.json, path: exclusivity_days, absent: true } # anti-hallucination
208
+ - artifact_json: { artifact: outputs/cap_state.json, path: stage, in: ["seed", "series-a"] } # one of a stable set
208
209
  ```
209
- Operators: `equals` (deep-equal) · `gt` (number) · `exists: <bool>` · `absent: <bool>` · `is_null: <bool>`.
210
+ Operators: `equals` (deep-equal) · `in: [<set>]` (deep-equal one of) · `gt` (number) · `exists: <bool>` · `absent: <bool>` · `is_null: <bool>`.
210
211
  The three states are **distinct**: `absent` (the final key is missing from a parent that resolved) vs
211
212
  `is_null` (present but JSON `null`) vs an **unresolved intermediate** segment (the artifact is malformed for
212
213
  that path) — which **fails loud**, never a vacuous pass. (No JSONPath/jq — a dotted path keeps it
213
214
  dependency-free and side-effect-free.)
214
215
 
216
+ > **Stable vs brittle asserts on stochastic (LLM-extracted) values.** A cassette freezes ONE stochastic
217
+ > output, so an `equals` on an LLM-extracted string will churn every time you re-record. Prefer **stable**
218
+ > operators for extracted values: `absent` / `exists` (the anti-hallucination negative is rock-stable),
219
+ > or `in: [<set>]` to accept any of a known-good set. Reserve `equals` for values the skill computes
220
+ > deterministically (ids, counts, enums). This pairs with record-time redaction: redaction rewrites the
221
+ > very strings an `equals` would pin, so `equals` on a redacted field would break on re-record anyway.
222
+
215
223
  > **Boundary assertions** (`egress_*`, `expect_denied`) require a sandboxed fidelity — `container`, `microvm`, or `hostloop` (all share the container sandbox + egress proxy). Only `protocol` is rejected, to avoid a false pass — see [boundary.md](./boundary.md).
216
224
 
217
225
  ### Which assertions survive `replay` (CI placement)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "cowork-harness",
3
- "version": "0.2.0",
3
+ "version": "0.3.0",
4
4
  "description": "Scriptable, CI-friendly harness for Claude Cowork's runtime contract for testing skills across scenarios — same agent, mounts, egress allowlist, permission protocol, and sandbox limitations.",
5
5
  "license": "MIT",
6
6
  "type": "module",
@@ -56,7 +56,8 @@
56
56
  "format:check": "prettier --check \"src/**/*.ts\" \"test/**/*.ts\"",
57
57
  "smoke:cli": "node dist/cli.js list",
58
58
  "ci": "npm run typecheck && npm run build && npm run test",
59
- "prepublishOnly": "npm run ci"
59
+ "prepublishOnly": "npm run ci",
60
+ "check:versions": "tsx scripts/check-versions.ts"
60
61
  },
61
62
  "dependencies": {
62
63
  "yaml": "^2.5.0",
@@ -223,6 +223,11 @@
223
223
  "equals": {
224
224
  "description": "the resolved value deep-equals this"
225
225
  },
226
+ "in": {
227
+ "type": "array",
228
+ "items": {},
229
+ "description": "the resolved value deep-equals one of these (stable for stochastic/LLM-extracted values where equals churns)"
230
+ },
226
231
  "gt": {
227
232
  "type": "number",
228
233
  "description": "the resolved value is a number greater than this"
@@ -244,7 +249,7 @@
244
249
  "artifact"
245
250
  ],
246
251
  "additionalProperties": false,
247
- "description": "assert over a JSON artifact's contents (dotted path + equals|gt|exists|absent|is_null)"
252
+ "description": "assert over a JSON artifact's contents (dotted path + equals|in|gt|exists|absent|is_null)"
248
253
  }
249
254
  },
250
255
  "additionalProperties": false
@@ -0,0 +1,90 @@
1
+ // Guards version lockstep across the npm package and the companion skill, so a
2
+ // hand-edited release can't drift. Fails loud (exit 1) on any mismatch.
3
+ //
4
+ // npm run check:versions
5
+ //
6
+ // Invariants (the skill versions INDEPENDENTLY from the npm package — see
7
+ // docs/maintenance.md — so we do NOT require the skill version to equal the
8
+ // package version):
9
+ // 1. npm self-consistency: package.json === package-lock.json (root + "" package).
10
+ // 2. skill self-consistency: marketplace.json === skill plugin.json === SKILL.md `version:`.
11
+ // 3. floor === tracks: SKILL.md bootstrap floor `@>=X.Y.Z` === `tracks-harness:` version.
12
+ // 4. floor <= package: the harness version the skill demands must be one this repo
13
+ // can publish (else the skill ships ahead of npm).
14
+ import { readFileSync } from "node:fs";
15
+ import { dirname, join } from "node:path";
16
+ import { fileURLToPath, pathToFileURL } from "node:url";
17
+
18
+ const REPO_ROOT = join(dirname(fileURLToPath(import.meta.url)), "..");
19
+ const r = (p: string) => readFileSync(join(REPO_ROOT, p), "utf8");
20
+ const json = (p: string) => JSON.parse(r(p)) as Record<string, any>;
21
+
22
+ const SEMVER = /^\d+\.\d+\.\d+$/;
23
+ /** Compare two X.Y.Z strings: <0 if a<b, 0 if equal, >0 if a>b. */
24
+ function cmp(a: string, b: string): number {
25
+ const pa = a.split(".").map(Number);
26
+ const pb = b.split(".").map(Number);
27
+ for (let i = 0; i < 3; i++) if (pa[i] !== pb[i]) return pa[i] - pb[i];
28
+ return 0;
29
+ }
30
+
31
+ export function checkVersions(): { ok: boolean; errors: string[]; values: Record<string, string | undefined> } {
32
+ const errors: string[] = [];
33
+
34
+ const pkg = json("package.json").version as string;
35
+ const lock = json("package-lock.json");
36
+ const lockRoot = lock.version as string;
37
+ const lockPkg = lock.packages?.[""]?.version as string | undefined;
38
+
39
+ const market = json(".claude-plugin/marketplace.json").plugins?.[0]?.version as string | undefined;
40
+ const plugin = json(".claude/skills/cowork-harness/.claude-plugin/plugin.json").version as string | undefined;
41
+
42
+ const skillMd = r(".claude/skills/cowork-harness/SKILL.md");
43
+ const frontmatter = skillMd.split("---")[1] ?? "";
44
+ const skillVer = frontmatter.match(/^\s*version:\s*(\S+)\s*$/m)?.[1];
45
+ const tracks = skillMd.match(/tracks-harness:\s*cowork-harness\s+(\d+\.\d+\.\d+)/)?.[1];
46
+ const floor = skillMd.match(/cowork-harness@>=(\d+\.\d+\.\d+)/)?.[1];
47
+
48
+ const values = { pkg, lockRoot, lockPkg, market, plugin, skillVer, tracks, floor };
49
+
50
+ // 1. npm self-consistency
51
+ if (!SEMVER.test(pkg)) errors.push(`package.json version "${pkg}" is not X.Y.Z`);
52
+ if (lockRoot !== pkg) errors.push(`package-lock.json root version "${lockRoot}" != package.json "${pkg}"`);
53
+ if (lockPkg !== pkg) errors.push(`package-lock.json packages[""].version "${lockPkg}" != package.json "${pkg}"`);
54
+
55
+ // 2. skill self-consistency
56
+ const skillSet = new Set([market, plugin, skillVer]);
57
+ if (skillSet.size !== 1 || [...skillSet][0] === undefined) {
58
+ errors.push(
59
+ `skill version mismatch — marketplace.json=${market}, plugin.json=${plugin}, SKILL.md=${skillVer} (all three must agree)`,
60
+ );
61
+ }
62
+
63
+ // 3. floor === tracks-harness
64
+ if (!floor) errors.push(`could not find bootstrap floor "cowork-harness@>=X.Y.Z" in SKILL.md`);
65
+ if (!tracks) errors.push(`could not find "tracks-harness: cowork-harness X.Y.Z" in SKILL.md`);
66
+ if (floor && tracks && floor !== tracks) {
67
+ errors.push(`bootstrap floor "@>=${floor}" != tracks-harness "${tracks}" (keep them in lockstep)`);
68
+ }
69
+
70
+ // 4. floor <= package.json (the skill must not demand an unpublished/future harness)
71
+ if (floor && SEMVER.test(pkg) && cmp(floor, pkg) > 0) {
72
+ errors.push(`bootstrap floor "@>=${floor}" is ahead of package.json "${pkg}" — skill would lead npm`);
73
+ }
74
+
75
+ return { ok: errors.length === 0, errors, values };
76
+ }
77
+
78
+ function main(): void {
79
+ const { ok, errors, values } = checkVersions();
80
+ process.stdout.write(`version lockstep: ${JSON.stringify(values)}\n`);
81
+ if (ok) {
82
+ process.stdout.write("✓ all version strings are aligned\n");
83
+ return;
84
+ }
85
+ for (const e of errors) process.stderr.write(`::error::${e}\n`);
86
+ process.exitCode = 1;
87
+ }
88
+
89
+ // Run only when invoked directly (so a test can import checkVersions without side effects).
90
+ if (import.meta.url === pathToFileURL(process.argv[1] ?? "").href) main();