npm - cowork-harness - Versions diffs - 0.2.0 → 0.3.0 - Mend

cowork-harness 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

package/CHANGELOG.md +36 -0
package/README.md +12 -2
package/dist/assert.js +9 -0
package/dist/cli.js +6 -1
package/dist/redact.js +101 -0
package/dist/run/cassette.js +426 -65
package/dist/scan.js +30 -0
package/dist/types.js +5 -1
package/docs/cassette.md +48 -0
package/docs/maintenance.md +20 -6
package/docs/scenario.md +9 -1
package/package.json +3 -2
package/schema/scenario.schema.json +6 -1
package/scripts/check-versions.ts +90 -0

package/CHANGELOG.md CHANGED Viewed

@@ -6,6 +6,42 @@ All notable changes to this project are documented here. The format is based on
 ## [Unreleased]
+## [0.3.0] — 2026-06-17
+The CI-operate + privacy layer for committed cassettes: record-time redaction, an always-on
+`verify-cassettes` scan/staleness gate, batch recording, and a set-membership assert operator.
+### Added
+- **`verify-cassettes <file|dir>`** — a token/agent-free CI gate over committed cassettes. A privacy
+  **scan** flags `email`/`currency`/bare-`domain` matches across the whole cassette, excluding only the
+  agent's **capability-manifest** messages (`system/init` + the `init-1` registry) from the noisy classes —
+  that catalog/MCP-server boilerplate is the sole concentrated false-positive source (email still scans it,
+  since the registry `account` field can carry the dev's email). `--allow <regex>` suppresses synthetic/
+  public reference names; multi-word proper names are opt-in, not a default class. Plus a **staleness** check
+  (`--staleness-only`) fails when a cassette's fingerprint drifted (you edited the skill but didn't
+  re-record). Exit 1 on any finding/drift/unreadable cassette; a malformed cassette is tallied, never
+  crashes the batch. Dedicated JSON envelope (`{command, ok, results}`), not the `RunResult` shape.
+- **Record-time content redaction** (opt-in; distinct from secret-scrub). A `.cowork-redact.json` (or
+  `COWORK_HARNESS_REDACT_PATTERNS`/`_KEYS`) rewrites configured PII across the **whole** cassette surface
+  (transcript, artifact bodies + filenames, prompt/answers/assert, skillSources) **structurally** — JSON
+  stays valid and the AskUserQuestion question/answer strings stay in sync (the O7 guard still passes), with
+  collision-safe deterministic tokens. Redaction is **verdict-preserving**: `record` refuses to write if it
+  would flip an assertion (a manufactured green). `--no-redact` / `--allow-failing` escape hatches.
+- **Batch recording** — `record <dir>` records every scenario in a directory (classified by a positive
+  `prompt:` signal: a non-scenario YAML is an announced skip, a broken scenario is a failure, never a silent
+  skip); `record <cassette-dir> --rerecord-stale` re-records only the cassettes whose fingerprint drifted.
+- **`artifact_json` `in:` operator** — assert the resolved value deep-equals one of a fixed set; stable for
+  stochastic (LLM-extracted) values where `equals` churns across re-records.
+### Fixed
+- **`skillHash` cassette fingerprint was silently dead** — `skillSourceDirs` passed a path string to
+  `loadSession` (which wants parsed YAML), threw, and the throw was swallowed, so the staleness gate's
+  skill-edit signal never computed for a file-based session. Now parses + resolves the session correctly;
+  `hashDir` folds in each file's relative path + type marker (a *move* now registers); `skillSources` are
+  stored relative, never as absolute host paths.
 ## [0.2.0] — 2026-06-17
 Binary-verified the AskUserQuestion answer wire shape (agent ELF 2.1.170), implemented the

package/README.md CHANGED Viewed

@@ -74,7 +74,8 @@ Skill testing is the headline use, but the tool is a general harness over the Co
 | `skill <folder> "<prompt>"` | Run a local skill/plugin folder once against the staged agent | ad-hoc "is the skill alive / does it do X?" — the fast inner loop |
 | `run <scenario.yaml \| dir/>` | Run authored scenarios with `assert:` + a CI-ready exit code | you want a repeatable, **asserted regression test** |
 | `chat <folder>` | Interactive multi-turn REPL against a skill (TTY) | debugging a multi-turn flow by hand |
-| `record` / `replay` | Save a control-protocol cassette, then replay it deterministically (`replay --strict` fails on a stale cassette) | **token-free, Docker-free CI** from a once-recorded run |
+| `record` / `replay` | Save a control-protocol cassette (one scenario, or batch a `dir/`; `--rerecord-stale` refreshes only drifted ones), then replay it deterministically (`replay --strict` fails on a stale cassette) | **token-free, Docker-free CI** from a once-recorded run |
+| `verify-cassettes <file\|dir>` | Token-free CI gate over committed cassettes: a privacy scan (email/currency/domain → exit 1) + a staleness check (`--staleness-only`) | gating **committed cassettes** against PII leaks + "edited the skill, forgot to re-record" |
 | `trace <run-id>` | Digest a run's `events.jsonl` (`--tools`, `--gates`, `--dispatches` for the sub-agent dispatch tree + total) | "how many sub-agents *actually* dispatched, and which?" |
 | `scaffold --from-run <id>` | Turn a kept run into a starter scenario YAML (gates→answers, artifacts→`file_exists`) | authoring a scenario from a real run instead of guessing |
 | `assert --list` | List the available scenario assertions (generated from the schema) | "what can I assert?" without grepping the source |
@@ -204,8 +205,17 @@ cowork-harness replay --cassette cassettes/example-pdf-skill.cassette.json
 # A committed synthetic fixture is ready to replay on a fresh clone (no record step needed):
 cowork-harness replay --cassette examples/replays/example-pdf-skill.cassette.json
+# Cassettes are COMMITTED fixtures — record against synthetic data, and gate them in CI:
+cowork-harness verify-cassettes cassettes/   # privacy scan (email/currency/domain) + staleness; exit 1 on a finding
 ```
+> **Privacy:** a cassette snapshots the transcript and the `outputs/` JSON bodies, so it's committed PII
+> surface. Record against synthetic inputs; opt into record-time **redaction** with a `.cowork-redact.json`
+> (verdict-preserving — `record` refuses to write if redaction would flip an assertion); and gate every
+> commit with `verify-cassettes` (the always-on scan, `--allow <regex>` for synthetic/public names). See
+> [docs/cassette.md](./docs/cassette.md).
 > **What replay checks.** A cassette bundles BOTH recorded protocol directions: the child→driver
 > `events` stream AND the driver→child `controlOut` decision responses. `replay` re-runs the
 > orchestration from both, re-evaluates the **content** assertions, and re-exercises
@@ -385,7 +395,7 @@ The provided [GitHub Actions workflow](.github/workflows/ci.yml) runs a **four-s
 | Stage | Runs | Needs | Gates |
 |---|---|---|---|
-| **unit** | format check · typecheck · unit tests · build · CLI smoke · token-free `replay` gate | nothing | every push/PR |
+| **unit** | format check · typecheck · unit tests · build · CLI smoke · token-free `replay` + `verify-cassettes` gates | nothing | every push/PR |
 | **boundary** | builds the pinned agent image, brings up the default-deny network, runs `boundary-check` | Docker, arm64 runner | proves the sandbox enforces Cowork's limits — **no API key** |
 | **scenarios** | the live scenario suite at `container` fidelity, uploads transcripts/egress logs as artifacts | `ANTHROPIC_API_KEY` (or `CLAUDE_CODE_OAUTH_TOKEN`) | fork PRs: the whole job is skipped (`if:` guard); same-repo without a key: warns and exits 0 |
 | **parity-drift** | reminder to re-`sync` when Desktop updates | nothing | informational, never blocks |

package/dist/assert.js CHANGED Viewed

@@ -241,6 +241,15 @@ function check(a, ctx) {
                             ? ok()
                             : fail(`artifact_json: "${aj.path}" = ${JSON.stringify(val)}, expected > ${aj.gt}`));
                     }
+                    // #4: set membership — the resolved value deep-equals one of a fixed set. Stable for stochastic
+                    // (LLM-extracted) values where `equals` would churn across re-records. `present &&` guard mirrors
+                    // `equals` so an absent value never vacuously satisfies it.
+                    if (aj.in !== undefined) {
+                        any = true;
+                        results.push(present && Array.isArray(aj.in) && aj.in.some((x) => jsonEq(val, x))
+                            ? ok()
+                            : fail(`artifact_json: "${aj.path}" = ${JSON.stringify(val)}, expected one of ${JSON.stringify(aj.in)}`));
+                    }
                     // No operator → an existence assertion (the value must be present).
                     if (!any)
                         results.push(present ? ok() : fail(`artifact_json: "${aj.path ?? "(root)"}" is not present (no operator given → existence check)`));

package/dist/cli.js CHANGED Viewed

@@ -13,7 +13,7 @@ import { vmInit, vmDelete, vmStatus, vmPrune, instanceName } from "./runtime/lim
 import { sync } from "./sync/cowork-sync.js";
 import { runBoundaryChecks, formatBoundary } from "./boundary.js";
 import { cmdChat } from "./run/chat.js";
-import { cmdRecord, cmdReplay } from "./run/cassette.js";
+import { cmdRecord, cmdReplay, cmdVerifyCassettes } from "./run/cassette.js";
 import { loadDotenv } from "./dotenv.js";
 import { makeRenderer, renderStart, renderFooter, startHeartbeat } from "./run/renderer.js";
 import { resolveEventsFile, buildTrace, formatTrace, buildGateTrace, formatGateTrace, buildDispatchTree, formatDispatchTree, } from "./run/trace-view.js";
@@ -50,6 +50,8 @@ const HELP = `cowork-harness <command>   (v${"$VERSION"})
       [--out <file>]           cassette path (default: cassettes/<scenario-name>.cassette.json)
   replay --cassette <file>     deterministic protocol-replay of a cassette (no token) [--output-format json]
       [--strict]               escalate a cassette-staleness warning (baseline/skill drift) to a failure
+  verify-cassettes <file|dir>  CI gate (no token): privacy scan + staleness — exit 1 on a PII finding or drift
+      [--privacy-only|--staleness-only] [--allow <regex>]... [--output-format json]
   trace <run-id | dir | path>  digest a run's events.jsonl (tools+result status, dispatches, decisions)
       [--tools]  tool/dispatch rows only   [--gates]  gate lifecycle (question→answer→delivered)
       [--dispatches]  sub-agent dispatch tree + the real total (read off dispatch_count_max)
@@ -199,6 +201,7 @@ async function main() {
             "chat",
             "record",
             "replay",
+            "verify-cassettes",
             "trace",
             "assert",
             "scaffold",
@@ -260,6 +263,8 @@ async function main() {
             return cmdRecord(rest);
         case "replay":
             return cmdReplay(rest);
+        case "verify-cassettes":
+            return cmdVerifyCassettes(rest);
         case "trace":
             return cmdTrace(rest);
         case "assert":

package/dist/redact.js ADDED Viewed

@@ -0,0 +1,101 @@
+/**
+ * Content redaction for committed cassettes (#1 / A1). DISTINCT from `secrets.ts` (which scrubs auth
+ * tokens): this redacts author-configured PII patterns out of the cassette surface before it is written.
+ *
+ * Two hard requirements drive the design:
+ *  - STRUCTURAL (not line-level) for JSON protocol lines: events/controlOut are JSON; redacting their raw
+ *    text could unbalance the JSON (→ a silently skipped line on replay) or desync the AskUserQuestion
+ *    question/answer strings the O7 guard compares across events and controlOut. So JSON is parsed, every
+ *    string LEAF and object KEY is redacted, then re-serialized.
+ *  - COLLISION-SAFE deterministic tokens: `[REDACTED:<label>:<hash>]`. The hash (of the matched text) keeps
+ *    the token stable across re-records (no churn) AND injective — two distinct names never collapse into a
+ *    single `answers` map key. A genuine collision (astronomically rare) fails loud, never silently merges.
+ */
+import { createHash } from "node:crypto";
+import { existsSync, readFileSync } from "node:fs";
+import { join } from "node:path";
+export const EMPTY_POLICY = { patterns: [], keyNames: [] };
+function csv(v) {
+    return (v ?? "")
+        .split(",")
+        .map((s) => s.trim())
+        .filter(Boolean);
+}
+/** Assemble a redaction policy from `.cowork-redact.json` (searched in `searchDirs`, e.g. cwd then the
+ *  scenario/cassette dir) merged with `COWORK_HARNESS_REDACT_PATTERNS`/`_KEYS`. No config + no env →
+ *  EMPTY_POLICY (the opt-in default; the A2 scanner is the always-on safety net). A malformed regex throws —
+ *  a silently-dropped redaction rule is under-redaction, i.e. a leak. */
+export function loadRedactionPolicy(searchDirs) {
+    const patterns = [];
+    const keyNames = [];
+    const seen = new Set();
+    for (const dir of searchDirs) {
+        const f = join(dir, ".cowork-redact.json");
+        if (seen.has(f) || !existsSync(f))
+            continue;
+        seen.add(f);
+        const cfg = JSON.parse(readFileSync(f, "utf8"));
+        for (const p of cfg.patterns ?? [])
+            patterns.push({ re: new RegExp(p.regex, p.flags ?? "g"), label: p.label ?? "redacted" });
+        for (const k of cfg.keys ?? [])
+            keyNames.push(k);
+    }
+    for (const src of csv(process.env.COWORK_HARNESS_REDACT_PATTERNS))
+        patterns.push({ re: new RegExp(src, "g"), label: "redacted" });
+    for (const k of csv(process.env.COWORK_HARNESS_REDACT_KEYS))
+        keyNames.push(k);
+    return { patterns, keyNames };
+}
+/** Stable, collision-safe token for a matched span. Depends ONLY on the matched text (context-free), so the
+ *  same logical string redacts identically wherever it appears (events question text == controlOut answers
+ *  key) — the property the O7 guard relies on. */
+function token(label, match) {
+    const h = createHash("sha256").update(match).digest("hex").slice(0, 12);
+    return `[REDACTED:${label}:${h}]`;
+}
+/** Apply every pattern to a single string. Each pattern is forced global so all occurrences go. */
+export function redactText(text, policy) {
+    let out = text;
+    for (const { re, label } of policy.patterns) {
+        const g = new RegExp(re.source, re.flags.includes("g") ? re.flags : re.flags + "g");
+        out = out.replace(g, (m) => token(label, m));
+    }
+    return out;
+}
+/** Recursively redact a parsed JSON value: string leaves AND object keys (C3). Numbers/booleans/null pass
+ *  through. A key collision after redaction (two distinct keys → one) throws — a silent merge would lose data
+ *  and (for an `answers` map) break replay. */
+export function redactStructural(value, policy) {
+    if (typeof value === "string")
+        return redactText(value, policy);
+    if (Array.isArray(value))
+        return value.map((v) => redactStructural(v, policy));
+    if (value !== null && typeof value === "object") {
+        const out = {};
+        for (const [k, v] of Object.entries(value)) {
+            const rk = redactText(k, policy);
+            // A value under a configured key is redacted wholesale regardless of TYPE (a sensitive number/object
+            // leaks just like a string). The hash is over its JSON form so the token stays deterministic.
+            const rv = policy.keyNames.includes(k) ? token("key", typeof v === "string" ? v : JSON.stringify(v)) : redactStructural(v, policy);
+            if (Object.prototype.hasOwnProperty.call(out, rk))
+                throw new Error(`redaction collision: two distinct keys both redacted to "${rk}" — refusing to silently merge`);
+            out[rk] = rv;
+        }
+        return out;
+    }
+    return value;
+}
+/** Redact one JSONL protocol line. If it parses as JSON, redact structurally (guaranteeing it still parses);
+ *  otherwise fall back to safe text redaction (a non-JSON line has no protocol coupling). */
+export function redactJsonLine(line, policy) {
+    if (!line.trim())
+        return line;
+    let parsed;
+    try {
+        parsed = JSON.parse(line);
+    }
+    catch {
+        return redactText(line, policy);
+    }
+    return JSON.stringify(redactStructural(parsed, policy));
+}

package/dist/run/cassette.js CHANGED Viewed

@@ -3,8 +3,8 @@ import { readFileSync, writeFileSync, mkdirSync, mkdtempSync, existsSync, readdi
 import { createHash } from "node:crypto";
 import { tmpdir } from "node:os";
 import { join, dirname, relative, isAbsolute } from "node:path";
-import { executeScenario, parseScenarioFile, collectArtifacts } from "./execute.js";
-import { loadSession } from "../session.js";
+import { executeScenario, parseScenarioFile, collectArtifacts, parseSessionFile } from "./execute.js";
+import { loadSession, resolveSessionPaths } from "../session.js";
 import { loadBaseline } from "../baseline.js";
 import { Run } from "./run.js";
 import { parseMessage, serializeDecision, deserializeDecision, canon, } from "../agent/session.js";
@@ -13,6 +13,10 @@ import { evaluate } from "../assert.js";
 import { makeRenderer, renderFooter } from "./renderer.js";
 import { jsonEnvelope, parseOutputFormat } from "./envelope.js";
 import { computeVerdict } from "./verdict.js";
+import { redactJsonLine, redactText, redactStructural, loadRedactionPolicy } from "../redact.js";
+import { collectSecrets, scrub } from "../secrets.js";
+import { scanText, DEFAULT_SCAN_PATTERNS, EMAIL_SCAN_PATTERNS } from "../scan.js";
+import { parse as parseYaml } from "yaml";
 const out = (s) => process.stdout.write(s + "\n");
 const log = (s) => process.stderr.write(s + "\n");
 /** Current cassette format version. Readers tolerate ABSENT (legacy → 0) and warn on a FUTURE version. */
@@ -47,8 +51,10 @@ function materializeManifest(entries) {
     }
     return { workRoot, prefixes: ["outputs", ".projects"] };
 }
-/** Hash a directory's file CONTENTS recursively (sorted, path-relative) — stable across machines. */
-function hashDir(dir, hash) {
+/** Hash a directory's structure + file CONTENTS recursively (sorted) — stable across machines. The hash
+ *  folds in each entry's RELATIVE path (not just its basename) plus a type marker, so a file MOVING within
+ *  the tree (`a/x.json` → `a/sub/x.json`, same content) changes the hash (S2 — basename-only missed moves). */
+function hashDir(dir, hash, rel = "") {
     let entries;
     try {
         entries = readdirSync(dir).sort();
@@ -58,6 +64,7 @@ function hashDir(dir, hash) {
     }
     for (const name of entries) {
         const abs = join(dir, name);
+        const relPath = rel ? `${rel}/${name}` : name;
         let st;
         try {
             st = statSync(abs);
@@ -65,10 +72,12 @@ function hashDir(dir, hash) {
         catch {
             continue;
         }
-        if (st.isDirectory())
-            hashDir(abs, hash);
+        if (st.isDirectory()) {
+            hash.update(`D:${relPath}\n`); // structure marker — an empty/renamed dir registers too
+            hashDir(abs, hash, relPath);
+        }
         else if (st.isFile()) {
-            hash.update(name);
+            hash.update(`F:${relPath}\n`); // relative path, not basename — a move changes the digest
             try {
                 hash.update(readFileSync(abs));
             }
@@ -78,28 +87,121 @@ function hashDir(dir, hash) {
         }
     }
 }
-/** #1b: the local skill/plugin/marketplace source dirs a session mounts — the "skill dir" hash unit. */
+/** #1b: the local skill/plugin/marketplace source dirs a session mounts — the "skill dir" hash unit.
+ *  Returns ABSOLUTE dirs (for hashing/reading) plus `baseDir`, the session-file dir the relative
+ *  `skillSources` are stored against (so the committed fingerprint carries no absolute host path — C1). */
 function skillSourceDirs(sessionPath, cassetteDir) {
     const resolved = cassetteDir && !isAbsolute(sessionPath) ? join(cassetteDir, sessionPath) : sessionPath;
+    const baseDir = dirname(resolved);
     if (sessionPath === "(inline)" || !existsSync(resolved))
-        return [];
+        return { dirs: [], baseDir };
     let cfg;
     try {
-        cfg = loadSession(resolved);
+        // Mirror loadSessionFromFile (execute.ts): parse the YAML, then RESOLVE its relative skill/plugin
+        // paths against the session-file dir (`baseDir` — the post-cassetteDir-join location, so this works for
+        // both the record call (no cassetteDir) and the replay call (cassetteDir set)). Passing the raw path
+        // string to loadSession() throws (it wants parsed YAML) — the swallowed throw is why skillHash was
+        // silently never computed.
+        cfg = resolveSessionPaths(loadSession(parseSessionFile(resolved)), baseDir);
     }
     catch {
-        return [];
+        return { dirs: [], baseDir };
     }
-    return [...cfg.skills.local, ...cfg.plugins.local_plugins, ...cfg.plugins.remote_plugins, ...cfg.plugins.local_marketplaces].filter((d) => existsSync(d));
+    const dirs = [...cfg.skills.local, ...cfg.plugins.local_plugins, ...cfg.plugins.remote_plugins, ...cfg.plugins.local_marketplaces].filter((d) => existsSync(d));
+    return { dirs, baseDir };
 }
-function buildFingerprint(sessionPath, baselineAppVersion, cassetteDir) {
-    const dirs = skillSourceDirs(sessionPath, cassetteDir);
+export function buildFingerprint(sessionPath, baselineAppVersion, cassetteDir) {
+    const { dirs, baseDir } = skillSourceDirs(sessionPath, cassetteDir);
     if (dirs.length === 0)
         return { baseline: baselineAppVersion };
     const hash = createHash("sha256");
     for (const d of dirs.sort())
         hashDir(d, hash);
-    return { baseline: baselineAppVersion, skillHash: hash.digest("hex"), skillSources: dirs };
+    // Store skillSources RELATIVE to the session-file dir — diagnostics only (the replay recompute re-derives
+    // the dirs from the session), so a relative path is enough and never leaks an absolute `/Users/...` path.
+    return { baseline: baselineAppVersion, skillHash: hash.digest("hex"), skillSources: dirs.map((d) => relative(baseDir, d)) };
+}
+/** A2: scan the WHOLE cassette surface for PII (default classes: email/currency/domain). A `truncated`
+ *  artifact has NO committed body (hash-only) — nothing to leak — but is reported as `unscanned` so coverage
+ *  is never silently implied. Real-class findings fail the gate; `unscanned` is informational. */
+/** The agent's CAPABILITY MANIFEST — environment boilerplate, never user data, and the sole concentrated
+ *  source of `domain`/`currency` scan noise (tool/skill catalog descriptions + MCP-server names a regex
+ *  can't tell apart from customer data). Two stable structural forms:
+ *   - the `system/init` event (tools/mcp_servers/skills/cwd registry), and
+ *   - the `initialize` `control_response` (`request_id: "init-1"`; body = commands/agents/models/account).
+ *  These get `email`-only scanning (email is universal — the `account` field can carry the dev's own email);
+ *  the noisy classes are suppressed only here. */
+function isCapabilityManifest(line) {
+    let m;
+    try {
+        m = JSON.parse(line);
+    }
+    catch {
+        return false;
+    }
+    if (m?.type === "system" && m?.subtype === "init")
+        return true;
+    if (m?.type === "control_response") {
+        const r = m.response ?? {};
+        if (r.request_id === "init-1")
+            return true;
+        const body = r.response;
+        if (body && typeof body === "object" && "commands" in body && "agents" in body)
+            return true; // shape fallback
+    }
+    return false;
+}
+export function scanCassette(cassette, allow) {
+    const findings = [];
+    const FULL = DEFAULT_SCAN_PATTERNS; // email + currency + domain
+    const EMAIL = EMAIL_SCAN_PATTERNS; // email only — for the capability-manifest messages
+    // Transcript: full net EXCEPT the capability-manifest messages (catalog noise), where only email runs.
+    cassette.events.forEach((l, i) => findings.push(...scanText(l, `events[${i}]`, allow, isCapabilityManifest(l) ? EMAIL : FULL)));
+    cassette.controlOut?.forEach((l, i) => findings.push(...scanText(l, `controlOut[${i}]`, allow, isCapabilityManifest(l) ? EMAIL : FULL)));
+    // Deliverable + author-written fields — full net (a real cap table's figures/domains live here).
+    for (const a of cassette.artifacts ?? []) {
+        findings.push(...scanText(a.path, `artifact path ${a.path}`, allow, FULL)); // a filename can name a customer
+        if (a.body !== undefined)
+            findings.push(...scanText(a.body, `artifact ${a.path}`, allow, FULL));
+        else if (a.truncated)
+            findings.push({ where: `artifact ${a.path}`, cls: "unscanned", sample: "(body not committed — too large or unreadable)" });
+    }
+    findings.push(...scanText(cassette.scenario.prompt, "scenario.prompt", allow, FULL));
+    findings.push(...scanText(JSON.stringify(cassette.scenario.answers ?? null), "scenario.answers", allow, FULL));
+    findings.push(...scanText(JSON.stringify(cassette.scenario.assert ?? null), "scenario.assert", allow, FULL));
+    for (const s of cassette.fingerprint?.skillSources ?? [])
+        findings.push(...scanText(s, "fingerprint.skillSources", allow, FULL));
+    return findings;
+}
+/** B3 staleness GATE: recompute the fingerprint and report drift. Unlike `replayCassette` (which WARNS),
+ *  the gate treats an unresolvable skillHash as a failure — can't verify ⇒ not green. No fingerprint → nothing
+ *  to check (legacy cassette). */
+export function checkStaleness(cassette, cassetteDir) {
+    const fp = cassette.fingerprint;
+    if (!fp)
+        return [];
+    const msgs = [];
+    let liveBaseline;
+    try {
+        liveBaseline = loadBaseline("latest").appVersion;
+    }
+    catch {
+        /* baseline not loadable */
+    }
+    // Gate mode: can't verify ⇒ not green. The cassette carries a baseline-of-record but we can't load the
+    // current one to compare — a fail, not a silent skip (baselines ship with the package, so this is rare).
+    if (liveBaseline === undefined)
+        msgs.push("cannot load the latest baseline to verify staleness — run `cowork-harness sync` or ship baselines/ (can't verify ⇒ not green)");
+    else if (liveBaseline !== fp.baseline)
+        msgs.push(`baseline moved ${fp.baseline} → ${liveBaseline} since record — re-record`);
+    if (fp.skillHash) {
+        const live = buildFingerprint(cassette.scenario.session, fp.baseline, cassetteDir);
+        if (live.skillHash === undefined)
+            msgs.push("skill dirs not resolvable from the cassette location — cannot verify staleness (gate fails: can't verify ⇒ not green)");
+        else if (live.skillHash !== fp.skillHash)
+            msgs.push("local skill/plugin dir contents changed since record — re-record");
+    }
+    return msgs;
 }
 /** A minimal RunRecord for a truncated-cassette replay — empty collections so downstream evaluate()/the
  *  mismatch loops don't NPE; result:"error" because the cassette could not be driven to completion. */
@@ -270,75 +372,246 @@ const NOOP_DECIDER = {
         return ABSTAIN;
     },
 };
-/** `record <scenario.yaml> [--out <file>]` — run live + save a cassette. */
+/** Apply CONTENT redaction (the opt-in policy) across the WHOLE cassette surface (C1): events/controlOut
+ *  protocol lines (structurally — string leaves AND object keys, keeping JSON valid + the O7 question/answer
+ *  strings in sync), artifact bodies, the scenario prompt/answers/assert metadata, and the diagnostic
+ *  skillSources. Identity fields (name/session/fidelity/baseline) are left intact so replay still resolves.
+ *  Pure — returns a new cassette. Distinct from secret-scrub (`scrub`), which runs first. */
+export function redactCassette(cassette, policy) {
+    const scenario = {
+        ...cassette.scenario,
+        prompt: redactText(cassette.scenario.prompt, policy),
+        answers: redactStructural(cassette.scenario.answers, policy),
+        assert: redactStructural(cassette.scenario.assert, policy),
+    };
+    return {
+        ...cassette,
+        scenario,
+        events: cassette.events.map((l) => redactJsonLine(l, policy)),
+        controlOut: cassette.controlOut?.map((l) => redactJsonLine(l, policy)),
+        artifacts: cassette.artifacts?.map((a) => ({
+            ...a,
+            path: redactText(a.path, policy), // C1: a filename can name a customer (outputs/Acme-cap-table.json)
+            ...(a.body !== undefined ? { body: redactJsonLine(a.body, policy) } : {}),
+        })),
+        fingerprint: cassette.fingerprint
+            ? { ...cassette.fingerprint, skillSources: cassette.fingerprint.skillSources?.map((s) => redactText(s, policy)) }
+            : undefined,
+    };
+}
+/** A3 / C4 cardinal-sin guard: redaction must be VERDICT-PRESERVING. Replay both the pre-redaction and the
+ *  redacted cassette (token-free) and compare verdicts; if redaction flipped any replay-checkable assertion
+ *  (e.g. stripped a value a `transcript_not_matches` keys on, manufacturing a green), throw — never write a
+ *  cassette whose verdict was changed by redaction. */
+export async function assertRedactionVerdictPreserved(base, redacted) {
+    const vb = computeVerdict(await replayCassette(base), "replay");
+    const vr = computeVerdict(await replayCassette(redacted), "replay");
+    if (vb.pass !== vr.pass)
+        throw new Error(`redaction changed the replay verdict (pre-redaction pass=${vb.pass} → redacted pass=${vr.pass}) — redaction altered an ` +
+            `asserted observable; refusing to write a cassette whose verdict was manufactured by redaction (A3). ` +
+            `Record against synthetic inputs, or narrow the redaction policy so it doesn't touch asserted values.`);
+}
+/** B1: classify the `*.yaml`/`*.yml` (non-recursive) under `dir` for batch `record`. Classification keys on a
+ *  POSITIVE `prompt:` signal — NOT on "Scenario.parse threw", because a session YAML and a broken scenario
+ *  both throw the same error. A doc with `prompt:` that fails to parse is BROKEN (a batch failure), never a
+ *  silent skip — silently swallowing a broken scenario as a non-scenario is the false-green this guards. */
+export function discoverScenarios(dir) {
+    const files = readdirSync(dir)
+        .filter((f) => /\.ya?ml$/i.test(f))
+        .sort()
+        .map((f) => join(dir, f));
+    const out = { scenarios: [], skipped: [], broken: [] };
+    for (const f of files) {
+        let raw;
+        try {
+            raw = parseYaml(readFileSync(f, "utf8"));
+        }
+        catch (e) {
+            out.broken.push({ file: f, error: `YAML parse error: ${e.message}` });
+            continue;
+        }
+        const hasPrompt = raw !== null && typeof raw === "object" && "prompt" in raw;
+        if (!hasPrompt) {
+            out.skipped.push(f); // no prompt → a session/other doc; announced skip, not a failure
+            continue;
+        }
+        try {
+            parseScenarioFile(f);
+            out.scenarios.push(f);
+        }
+        catch (e) {
+            out.broken.push({ file: f, error: e.message });
+        }
+    }
+    return out;
+}
+/** Read + parse a cassette, never throwing — a malformed `*.cassette.json` must be TALLIED, not crash a
+ *  whole batch (a crash mid-walk reads as "the rest were fine" — a false-green by abort). */
+function readCassette(path) {
+    try {
+        return { cassette: JSON.parse(readFileSync(path, "utf8")) };
+    }
+    catch (e) {
+        return { error: `unreadable / invalid cassette JSON: ${e.message}` };
+    }
+}
+/** B2: the committed cassettes under `dir` whose fingerprint has drifted (baseline/skill) — the re-record
+ *  work-list. Pure + token-free (reuses `checkStaleness`); the actual re-record needs the live agent. A
+ *  malformed cassette is surfaced as stale (needs attention) rather than silently dropped. */
+export function selectStaleCassettes(dir) {
+    return readdirSync(dir)
+        .filter((f) => f.endsWith(".cassette.json"))
+        .sort()
+        .map((f) => join(dir, f))
+        .map((path) => {
+        const r = readCassette(path);
+        return "error" in r ? { path, staleness: [r.error] } : { path, staleness: checkStaleness(r.cassette, dirname(path)) };
+    })
+        .filter((x) => x.staleness.length > 0);
+}
+/** Record one scenario FILE → one cassette (parses the file, then shares the live-record tail with the
+ *  in-memory path). The file's dir feeds the redaction-policy search (for a co-located .cowork-redact.json). */
+async function recordScenarioFile(file, opts) {
+    return recordScenarioObject(parseScenarioFile(file), opts, [dirname(file)]);
+}
+/** `record <scenario.yaml | dir> [--out <file>] [--rerecord-stale] [--no-redact] [--allow-failing]` —
+ *  run live + save a cassette. A single file records one; a dir batches (B1); --rerecord-stale (B2) treats
+ *  the dir as committed cassettes and re-records only those whose fingerprint drifted. */
 export async function cmdRecord(args) {
+    const noRedact = args.includes("--no-redact");
+    if (noRedact)
+        log("record: --no-redact — content redaction is OFF; the cassette is written verbatim, so ensure inputs are synthetic.");
+    const allowFailing = args.includes("--allow-failing");
+    const rerecordStale = args.includes("--rerecord-stale");
     const outIdx = args.indexOf("--out");
-    // #9: bounds-check --out's value — a trailing `--out` makes cassettePath undefined → a raw
-    // dirname(undefined)/writeFileSync(undefined) crash surfacing as an `internal` error.
     if (outIdx >= 0 && args[outIdx + 1] === undefined) {
         log("usage: record <scenario.yaml> --out <file.cassette.json>  (--out needs a value)");
-        process.exit(2);
+        return process.exit(2);
     }
-    // Skip --out's VALUE when scanning for the scenario positional, so the common flag-first form
-    // `record --out out.json scenario.yaml` records scenario.yaml (not out.json) as the scenario.
-    const scenarioPositionals = args.filter((a, i) => !a.startsWith("--") && args[i - 1] !== "--out");
-    const file = scenarioPositionals[0];
-    if (!file) {
-        log("usage: record <scenario.yaml> [--out <file.cassette.json>]");
-        process.exit(2);
+    const positionals = args.filter((a, i) => !a.startsWith("--") && args[i - 1] !== "--out");
+    const target = positionals[0];
+    if (!target) {
+        log("usage: record <scenario.yaml | dir/> [--out <file>] [--rerecord-stale] [--no-redact] [--allow-failing]");
+        return process.exit(2);
     }
-    // Reject extra scenario positionals rather than silently dropping all but the first (record takes ONE).
-    if (scenarioPositionals.length > 1) {
-        log(`record takes a single scenario (got ${scenarioPositionals.length}: ${scenarioPositionals.join(", ")})`);
-        process.exit(2);
+    if (positionals.length > 1) {
+        log(`record takes a single scenario or dir (got ${positionals.length}: ${positionals.join(", ")})`);
+        return process.exit(2);
     }
-    const scenario = parseScenarioFile(file);
+    const isDir = existsSync(target) && statSync(target).isDirectory();
+    // B2: re-record only the drifted cassettes in a committed cassette dir.
+    if (rerecordStale) {
+        if (!isDir) {
+            log("record --rerecord-stale takes a DIRECTORY of committed cassettes");
+            return process.exit(2);
+        }
+        const stale = selectStaleCassettes(target);
+        if (stale.length === 0) {
+            log(`✓ record --rerecord-stale: all cassettes under ${target} are fresh — nothing to re-record`);
+            return process.exit(0);
+        }
+        let failures = 0;
+        for (const { path: cp, staleness } of stale) {
+            const rc = readCassette(cp);
+            if ("error" in rc) {
+                failures++;
+                log(`  ✗ ${cp}: ${rc.error} — cannot re-record`);
+                continue;
+            }
+            const cassette = rc.cassette;
+            // Re-record from the embedded scenario, re-resolving its relocatable session against the cassette dir.
+            const sessionRef = cassette.scenario.session === "(inline)" ? "(inline)" : join(dirname(cp), cassette.scenario.session);
+            log(`↻ re-recording ${cp} (stale: ${staleness.join("; ")})`);
+            try {
+                const r = await recordScenarioObject({ ...cassette.scenario, session: sessionRef }, { noRedact, allowFailing, cassettePath: cp });
+                log(`  ✓ ${cp} (${r.result.result})`);
+            }
+            catch (e) {
+                failures++;
+                log(`  ✗ ${cp}: ${e.message}`);
+            }
+        }
+        return process.exit(failures > 0 ? 1 : 0);
+    }
+    // B1: batch a directory of scenarios.
+    if (isDir) {
+        const disc = discoverScenarios(target);
+        for (const s of disc.skipped)
+            log(`· skipped (not a scenario — no \`prompt:\`): ${s}`);
+        for (const b of disc.broken)
+            log(`✗ ${b.file}: ${b.error}`);
+        if (disc.scenarios.length === 0) {
+            log(`record: no scenarios discovered under ${target} (loud non-zero — not a vacuous "0 failures = green")`);
+            return process.exit(2);
+        }
+        let failures = disc.broken.length;
+        for (const f of disc.scenarios) {
+            try {
+                const r = await recordScenarioFile(f, { noRedact, allowFailing });
+                log(`✓ ${f} → ${r.cassettePath} (${r.result.result})`);
+            }
+            catch (e) {
+                failures++;
+                log(`✗ ${f}: ${e.message}`);
+            }
+        }
+        log(failures > 0
+            ? `✗ record: ${failures} of ${disc.scenarios.length + disc.broken.length} failed`
+            : `✓ record: ${disc.scenarios.length} cassette(s)`);
+        return process.exit(failures > 0 ? 1 : 0);
+    }
+    // Single scenario file.
+    try {
+        const cassettePath = outIdx >= 0 ? args[outIdx + 1] : undefined;
+        const r = await recordScenarioFile(target, { noRedact, allowFailing, cassettePath });
+        log(`✓ recorded ${r.result.result} · ${r.artifacts} artifact(s) → ${r.cassettePath}`);
+    }
+    catch (e) {
+        log(`record: ${e.message}`);
+        return process.exit(1);
+    }
+}
+/** The live-record TAIL shared by the file (B1/single) and in-memory (B2 re-record) paths: run live, refuse
+ *  a failing run unless opted in (A3), snapshot + secret-scrub bodies (C2), opt-in redact + verdict-preserve
+ *  (A1/A3), write. `extraPolicyDirs` adds the scenario-file dir to the .cowork-redact.json search. */
+async function recordScenarioObject(scenario, opts, extraPolicyDirs = []) {
     const result = await executeScenario(scenario);
-    const events = safeLines(join(result.outDir, "events.jsonl"));
-    const controlOut = safeLines(join(result.outDir, "control-out.jsonl"));
-    const cassettePath = outIdx >= 0 ? args[outIdx + 1] : join("cassettes", `${scenario.name}.cassette.json`);
+    const cassettePath = opts.cassettePath ?? join("cassettes", `${scenario.name}.cassette.json`);
     mkdirSync(dirname(cassettePath), { recursive: true });
-    // Store a RELOCATABLE session path (relative to the cassette dir) instead of the absolute resolved path
-    // parseScenarioFile baked in — replay never loads the session for the pipeline, so this is metadata-only,
-    // but it keeps a moved bundle honest. Record the resolved tier so replay can report effectiveFidelity.
+    // A3: a failing live run frozen into a cassette is a latent false-signal — refuse unless opted in.
+    if (!computeVerdict(result, "live").pass && !opts.allowFailing)
+        throw new Error(`live run did NOT pass (result=${result.result}) — refusing to freeze a failing run (re-run, or --allow-failing)`);
+    // RELOCATABLE session path (relative to the cassette dir) — metadata-only, keeps a moved bundle honest.
     const relocatable = {
         ...scenario,
         session: scenario.session === "(inline)" ? "(inline)" : relative(dirname(cassettePath), scenario.session),
     };
-    // #1: snapshot the user-visible artifacts (from the live work root, before --keep cleanup) so
-    // file_exists/user_visible_artifact/artifact_json survive token-free replay. #1b: a staleness tripwire
-    // over the recording's inputs (baseline + local skill dirs) — `scenario.session` is still absolute here.
-    const artifacts = result.workDir ? buildManifest(result.workDir) : [];
-    const fingerprint = buildFingerprint(scenario.session, result.baseline);
-    const cassette = {
+    // C2: buildManifest reads output bodies RAW (executeScenario scrubs result/events/control-out, NOT
+    // outputs/) — secret-scrub each body before it is committed.
+    const secrets = collectSecrets();
+    const artifacts = (result.workDir ? buildManifest(result.workDir) : []).map((a) => a.body !== undefined ? { ...a, body: scrub(a.body, secrets) } : a);
+    const base = {
         cassetteVersion: CASSETTE_VERSION,
         scenario: relocatable,
-        events,
-        controlOut,
+        events: safeLines(join(result.outDir, "events.jsonl")),
+        controlOut: safeLines(join(result.outDir, "control-out.jsonl")),
         effectiveFidelity: result.effectiveFidelity,
         artifacts,
-        fingerprint,
+        fingerprint: buildFingerprint(scenario.session, result.baseline),
     };
-    writeFileSync(cassettePath, JSON.stringify(cassette, null, 2));
-    // capture summary: turns (assistant messages) + tool calls in the recording
-    let turns = 0;
-    let tools = 0;
-    for (const e of cassette.events) {
-        let m;
-        try {
-            m = JSON.parse(e);
-        }
-        catch {
-            continue;
-        }
-        if (m.type === "assistant") {
-            turns++;
-            for (const b of m.message?.content ?? [])
-                if (b.type === "tool_use")
-                    tools++;
-        }
+    // A1 (opt-in) content redaction over the whole surface (C1). Empty policy → no-op. Non-empty → must be
+    // VERDICT-PRESERVING (A3): replay both and refuse to write on divergence (a manufactured green).
+    const policy = opts.noRedact
+        ? { patterns: [], keyNames: [] }
+        : loadRedactionPolicy([process.cwd(), ...extraPolicyDirs, dirname(cassettePath)]);
+    let cassette = base;
+    if (policy.patterns.length || policy.keyNames.length) {
+        const redacted = redactCassette(base, policy);
+        await assertRedactionVerdictPreserved(base, redacted);
+        cassette = redacted;
     }
-    log(`✓ recorded ${events.length} events · ${turns} turns · ${tools} tool calls · ${artifacts.length} artifact(s) → ${cassettePath}  (${result.result})`);
+    writeFileSync(cassettePath, JSON.stringify(cassette, null, 2));
+    return { result, cassettePath, artifacts: artifacts.length };
 }
 /** `replay --cassette <file>` — deterministic protocol-replay; re-evaluates content assertions. */
 export async function cmdReplay(args) {
@@ -374,6 +647,94 @@ export async function cmdReplay(args) {
         renderFooter(result, plan, { renderer, lane: "replay" });
     process.exit(verdict.exitCode);
 }
+/** `verify-cassettes <file|dir>` — the CI gate (token/agent-free). Runs the privacy scan (A2) and the
+ *  staleness check (B3) over one cassette or every `*.cassette.json` in a dir (non-recursive). Exit 1 on any
+ *  real PII finding or staleness drift; `unscanned` notes are informational. Dedicated JSON envelope. */
+export function cmdVerifyCassettes(args) {
+    let json;
+    try {
+        json = parseOutputFormat(args) === "json";
+    }
+    catch (e) {
+        log(String(e.message));
+        return process.exit(2);
+    }
+    const privacyOnly = args.includes("--privacy-only");
+    const stalenessOnly = args.includes("--staleness-only");
+    // Both flags together would disable BOTH families → empty findings → ok=true → exit 0: a silent
+    // false-green in the gate itself. Reject it as a usage error.
+    if (privacyOnly && stalenessOnly) {
+        log("verify-cassettes: --privacy-only and --staleness-only are mutually exclusive (together they'd check nothing)");
+        return process.exit(2);
+    }
+    const doPrivacy = !stalenessOnly;
+    const doStaleness = !privacyOnly;
+    const allow = [];
+    for (let i = 0; i < args.length; i++) {
+        if (args[i] !== "--allow")
+            continue;
+        const src = args[++i];
+        if (src === undefined) {
+            log("--allow needs a regex value");
+            return process.exit(2);
+        }
+        try {
+            allow.push(new RegExp(src, "i"));
+        }
+        catch {
+            log(`--allow: invalid regex: ${src}`);
+            return process.exit(2);
+        }
+    }
+    const target = args.find((a, i) => !a.startsWith("--") && args[i - 1] !== "--allow");
+    if (!target) {
+        log("usage: verify-cassettes <file|dir> [--privacy-only|--staleness-only] [--allow <regex>]... [--output-format json]");
+        return process.exit(2);
+    }
+    if (!existsSync(target)) {
+        log(`verify-cassettes: path not found: ${target}`);
+        return process.exit(2);
+    }
+    const files = statSync(target).isDirectory()
+        ? readdirSync(target)
+            .filter((f) => f.endsWith(".cassette.json"))
+            .sort()
+            .map((f) => join(target, f))
+        : [target];
+    if (files.length === 0) {
+        log(`verify-cassettes: no .cassette.json files under ${target} — nothing verified (loud non-zero, not a vacuous pass)`);
+        return process.exit(2);
+    }
+    const results = files.map((f) => {
+        const rc = readCassette(f);
+        if ("error" in rc)
+            return { file: f, findings: [], staleness: [], error: rc.error };
+        const findings = doPrivacy ? scanCassette(rc.cassette, allow) : [];
+        const staleness = doStaleness ? checkStaleness(rc.cassette, dirname(f)) : [];
+        return { file: f, findings, staleness, error: undefined };
+    });
+    const realFindings = results.flatMap((r) => r.findings.filter((x) => x.cls !== "unscanned"));
+    const staleAny = results.some((r) => r.staleness.length > 0);
+    const errorAny = results.some((r) => r.error !== undefined);
+    const ok = realFindings.length === 0 && !staleAny && !errorAny;
+    if (json) {
+        out(JSON.stringify({ command: "verify-cassettes", ok, results }));
+    }
+    else {
+        for (const r of results) {
+            if (r.error)
+                log(`✗ ${r.file}: [error] ${r.error}`);
+            for (const f of r.findings)
+                log(`${f.cls === "unscanned" ? "·" : "✗"} ${r.file}: [${f.cls}] ${f.where} — ${f.sample}`);
+            for (const s of r.staleness)
+                log(`✗ ${r.file}: [stale] ${s}`);
+        }
+        log(ok
+            ? `✓ verify-cassettes: ${files.length} cassette(s) clean`
+            : `✗ verify-cassettes: ${realFindings.length} PII finding(s)${staleAny ? " + staleness drift" : ""}${errorAny ? " + unreadable cassette(s)" : ""} across ${files.length} cassette(s)`);
+    }
+    return process.exit(ok ? 0 : 1);
+}
 /** Replay a cassette through Run and re-evaluate the content assertions. With a `cassette.artifacts`
  *  manifest (#1), filesystem assertions (file_exists/user_visible_artifact/artifact_json) ALSO run, against
  *  the materialized snapshot. `opts.strict` (#1b) escalates staleness warnings to failing assertions. */

package/dist/scan.js ADDED Viewed

@@ -0,0 +1,30 @@
+export const DEFAULT_SCAN_PATTERNS = [
+    { re: /[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}/gi, cls: "email" },
+    { re: /\$\s?\d[\d,]*(?:\.\d+)?\s?(?:k|m|b|bn|million|billion)?/gi, cls: "currency" },
+    { re: /\b[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.(?:com|io|net|org|co|app|ai|dev|xyz)\b/gi, cls: "domain" },
+];
+/** The high-precision subset (email only). `email` is scanned UNIVERSALLY — even on the agent
+ *  capability-manifest messages (the `system/init` event and the `initialize` registry `control_response`),
+ *  because the registry's `account` field can carry the developer's own email (a real leak). The noisy
+ *  classes (`currency`/`domain`) are the ones suppressed on those two manifest messages, where every hit is
+ *  the agent's tool/skill catalog or MCP-server names — environment boilerplate a regex can't tell apart
+ *  from customer data. Everywhere else (assistant reasoning, tool I/O, decisions, the deliverable) gets the
+ *  full net. */
+export const EMAIL_SCAN_PATTERNS = DEFAULT_SCAN_PATTERNS.filter((p) => p.cls === "email");
+function allowed(sample, allow) {
+    // Test against a non-global clone so a caller's /g regex can't carry lastIndex across calls.
+    return allow.some((a) => new RegExp(a.source, a.flags.replace("g", "")).test(sample));
+}
+/** Scan one string for PII matches, suppressing anything the allowlist covers. */
+export function scanText(text, where, allow, patterns = DEFAULT_SCAN_PATTERNS) {
+    const out = [];
+    for (const { re, cls } of patterns) {
+        const g = new RegExp(re.source, re.flags.includes("g") ? re.flags : re.flags + "g");
+        for (const m of text.matchAll(g)) {
+            const sample = m[0];
+            if (!allowed(sample, allow))
+                out.push({ where, cls, sample });
+        }
+    }
+    return out;
+}

package/dist/types.js CHANGED Viewed

@@ -145,13 +145,17 @@ export const Assertion = z.object({
         artifact: z.string().describe("relative path to a JSON artifact under the work root (e.g. outputs/cap_state.json)"),
         path: z.string().optional().describe("dotted path into the JSON (e.g. me.run_id); omit to target the whole document"),
         equals: z.unknown().optional().describe("the resolved value deep-equals this"),
+        in: z
+            .array(z.unknown())
+            .optional()
+            .describe("the resolved value deep-equals one of these (stable for stochastic/LLM-extracted values where equals churns)"),
         gt: z.number().optional().describe("the resolved value is a number greater than this"),
         exists: z.boolean().optional().describe("the path resolves to a present (non-absent) value"),
         absent: z.boolean().optional().describe("the final key is absent from its (resolved) parent — the anti-hallucination negative"),
         is_null: z.boolean().optional().describe("the resolved value is JSON null (distinct from absent)"),
     })
         .optional()
-        .describe("assert over a JSON artifact's contents (dotted path + equals|gt|exists|absent|is_null)"),
+        .describe("assert over a JSON artifact's contents (dotted path + equals|in|gt|exists|absent|is_null)"),
 });
 export const ScenarioObject = z
     .object({

package/docs/cassette.md CHANGED Viewed

@@ -183,6 +183,54 @@ Re-record a cassette when:
 - `replay` exits 1 on a `replay_protocol_fidelity` mismatch — this means `serializeDecision` changed;
   review the change, confirm it's correct, then re-record to update the frozen envelope.
+## Batch recording
+`record` takes a single scenario OR a directory:
+```bash
+cowork-harness record scenarios/                 # record every scenario in the dir (one cassette each)
+cowork-harness record cassettes/ --rerecord-stale # re-record ONLY the cassettes whose fingerprint drifted
+```
+Directory discovery keys on a **positive `prompt:` signal**: a `*.yaml` with no top-level `prompt:` is an
+announced skip (it's a session/other doc), but a doc that *looks* like a scenario (has `prompt:`) yet fails
+to parse is a **failure**, never a silent skip. Zero scenarios discovered → loud non-zero exit. `record`
+also **refuses to freeze a failing live run** into a cassette (`--allow-failing` overrides) — a committed
+red cassette is a latent false-signal.
+## Privacy: cassettes are committed fixtures
+A cassette snapshots the transcript **and** the `outputs/` JSON bodies (names, dollar figures, share
+counts) — committed PII surface. Two layers, distinct from secret-scrub (which only strips auth tokens):
+- **Opt-in redaction** (the mutation). Drop a `.cowork-redact.json` next to your scenarios, or set
+  `COWORK_HARNESS_REDACT_PATTERNS` / `COWORK_HARNESS_REDACT_KEYS`. At record time it rewrites matching PII
+  across the whole cassette surface (transcript, artifact bodies + filenames, prompt/answers/assert,
+  skillSources) **structurally** — JSON stays valid and the AskUserQuestion question/answer strings stay in
+  sync, so the O7 guard still passes. Redaction is **verdict-preserving**: `record` replays before/after and
+  **refuses to write** if redaction would flip an assertion (a manufactured green is the cardinal sin).
+  `--no-redact` skips it for known-synthetic inputs.
+- **Always-on scan gate** — `verify-cassettes <file|dir>` scans the committed cassettes and **exits
+  non-zero** on a finding, so "no leak" is a gate, not discipline. The full net (`email` + `currency` +
+  bare-`domain`) runs over the **whole cassette** — the deliverable (`outputs/` bodies + filenames), the
+  author-written `prompt`/`answers`/`assert`, AND the agent's reasoning + tool I/O — with **one structural
+  exception**: the agent's **capability-manifest** messages (the `system/init` event and the `initialize`
+  registry `control_response`, `request_id:"init-1"`) are excluded from the noisy classes. Those two carry
+  the tool/skill catalog (slash-command descriptions naming `docsend.com`, `Pitch.com`, …) and the MCP-server
+  names (`claude.ai Gmail`, …) — environment boilerplate a regex can't tell apart from customer data, and the
+  sole concentrated source of false positives. They are excluded **as a unit**, not by domain — but `email`
+  still scans them (the registry's `account` field can carry the developer's own email). `--allow <regex>`
+  suppresses synthetic / public reference names (e.g. `NVCA`, `Cooley GO`, `Acme`); multi-word proper names
+  are **not** a default class (too noisy). `verify-cassettes` also runs the **staleness** check
+  (`--staleness-only`): a drifted `skillHash` (you edited the skill but didn't re-record) fails the gate.
+```bash
+cowork-harness verify-cassettes cassettes/ --allow 'NVCA|Cooley GO|Acme'
+```
+The cardinal rule still holds: record against **synthetic** inputs (e.g. "Cadence / Acme", made-up
+numbers) — redaction and the scan are belt-and-suspenders, not a license to record real customer data.
 ## Committed fixture
 `examples/replays/example-pdf-skill.cassette.json` is a **synthetic** cassette committed to the repo

package/docs/maintenance.md CHANGED Viewed

@@ -112,12 +112,26 @@ npm version patch        # or minor | major
 git push --follow-tags
 ```
-Pushing the `vX.Y.Z` tag triggers `release.yml`, which verifies the tag matches `package.json`, runs
-`npm run ci`, then `npm publish --provenance --access public`. Auth is **OIDC**: the workflow's
-`id-token: write` is exchanged for a short-lived publish credential — there is **no `NPM_TOKEN`**. A
-GitHub Release is opened from the tag, and `prepublishOnly` re-runs CI so a manual publish is guarded too.
-A published version is **immutable** — the same `X.Y.Z` can never be re-published, so a botched run needs a
-new patch (not a re-run against the same version).
+Pushing the `vX.Y.Z` tag triggers `release.yml`, which (in order) **waits for `ci.yml` to have succeeded
+for that commit**, verifies the tag matches `package.json`, checks `CHANGELOG.md` has a `## [X.Y.Z]`
+heading, runs the **version-lockstep guard** (`npm run check:versions`), runs `npm run ci`, then
+`npm publish --provenance --access public`. Auth is **OIDC**: the workflow's `id-token: write` is exchanged
+for a short-lived publish credential — there is **no `NPM_TOKEN`**. A GitHub Release is opened from the tag,
+and `prepublishOnly` re-runs CI so a manual publish is guarded too. A published version is **immutable** —
+the same `X.Y.Z` can never be re-published, so a botched run needs a new patch (not a re-run against the
+same version).
+The `ci.yml`-success gate matters because `release.yml`'s own `npm run ci` is **TypeScript-only**, while
+`ci.yml` also runs pytest (the Python helper lane), `format:check`, the replay gate, and the boundary +
+scenario suites. Without the gate, a tag could publish a build that `main`'s CI would have rejected. The
+gate polls (~30 min) so `git push --follow-tags` works even when the commit's CI is still running.
+**Version-lockstep guard (`scripts/check-versions.ts`, run in both `ci.yml` and `release.yml`).** Fails
+loud unless all version strings agree: `package.json` == `package-lock.json`; the three skill versions
+(`marketplace.json`, the skill `plugin.json`, `SKILL.md` frontmatter) == each other; the `SKILL.md`
+bootstrap floor `@>=X.Y.Z` == its `tracks-harness:` version; and that floor is `<=` `package.json` (the
+skill can't demand a harness newer than this repo publishes). This enforces the lockstep the next section
+describes, so a hand-edited bump can't silently drift.
 **One-time setup (on npmjs.com):** configure a Trusted Publisher on the `cowork-harness` package →
 provider GitHub Actions, repo `yaniv-golan/cowork-harness`, workflow filename `release.yml`,

package/docs/scenario.md CHANGED Viewed

@@ -205,13 +205,21 @@ dotted `path` selects into the document; one operator decides the check:
 - artifact_json: { artifact: outputs/cap_state.json, path: me.run_id, equals: "r1" }
 - artifact_json: { artifact: outputs/cap_state.json, path: rounds.0.amount, gt: 0 }
 - artifact_json: { artifact: outputs/instruments.json, path: exclusivity_days, absent: true }   # anti-hallucination
+- artifact_json: { artifact: outputs/cap_state.json, path: stage, in: ["seed", "series-a"] }     # one of a stable set
 ```
-Operators: `equals` (deep-equal) · `gt` (number) · `exists: <bool>` · `absent: <bool>` · `is_null: <bool>`.
+Operators: `equals` (deep-equal) · `in: [<set>]` (deep-equal one of) · `gt` (number) · `exists: <bool>` · `absent: <bool>` · `is_null: <bool>`.
 The three states are **distinct**: `absent` (the final key is missing from a parent that resolved) vs
 `is_null` (present but JSON `null`) vs an **unresolved intermediate** segment (the artifact is malformed for
 that path) — which **fails loud**, never a vacuous pass. (No JSONPath/jq — a dotted path keeps it
 dependency-free and side-effect-free.)
+> **Stable vs brittle asserts on stochastic (LLM-extracted) values.** A cassette freezes ONE stochastic
+> output, so an `equals` on an LLM-extracted string will churn every time you re-record. Prefer **stable**
+> operators for extracted values: `absent` / `exists` (the anti-hallucination negative is rock-stable),
+> or `in: [<set>]` to accept any of a known-good set. Reserve `equals` for values the skill computes
+> deterministically (ids, counts, enums). This pairs with record-time redaction: redaction rewrites the
+> very strings an `equals` would pin, so `equals` on a redacted field would break on re-record anyway.
 > **Boundary assertions** (`egress_*`, `expect_denied`) require a sandboxed fidelity — `container`, `microvm`, or `hostloop` (all share the container sandbox + egress proxy). Only `protocol` is rejected, to avoid a false pass — see [boundary.md](./boundary.md).
 ### Which assertions survive `replay` (CI placement)

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "cowork-harness",
-  "version": "0.2.0",
+  "version": "0.3.0",
   "description": "Scriptable, CI-friendly harness for Claude Cowork's runtime contract for testing skills across scenarios — same agent, mounts, egress allowlist, permission protocol, and sandbox limitations.",
   "license": "MIT",
   "type": "module",
@@ -56,7 +56,8 @@
     "format:check": "prettier --check \"src/**/*.ts\" \"test/**/*.ts\"",
     "smoke:cli": "node dist/cli.js list",
     "ci": "npm run typecheck && npm run build && npm run test",
-    "prepublishOnly": "npm run ci"
+    "prepublishOnly": "npm run ci",
+    "check:versions": "tsx scripts/check-versions.ts"
   },
   "dependencies": {
     "yaml": "^2.5.0",

package/schema/scenario.schema.json CHANGED Viewed

@@ -223,6 +223,11 @@
                   "equals": {
                     "description": "the resolved value deep-equals this"
                   },
+                  "in": {
+                    "type": "array",
+                    "items": {},
+                    "description": "the resolved value deep-equals one of these (stable for stochastic/LLM-extracted values where equals churns)"
+                  },
                   "gt": {
                     "type": "number",
                     "description": "the resolved value is a number greater than this"
@@ -244,7 +249,7 @@
                   "artifact"
                 ],
                 "additionalProperties": false,
-                "description": "assert over a JSON artifact's contents (dotted path + equals|gt|exists|absent|is_null)"
+                "description": "assert over a JSON artifact's contents (dotted path + equals|in|gt|exists|absent|is_null)"
               }
             },
             "additionalProperties": false

package/scripts/check-versions.ts ADDED Viewed

@@ -0,0 +1,90 @@
+// Guards version lockstep across the npm package and the companion skill, so a
+// hand-edited release can't drift. Fails loud (exit 1) on any mismatch.
+//
+//   npm run check:versions
+//
+// Invariants (the skill versions INDEPENDENTLY from the npm package — see
+// docs/maintenance.md — so we do NOT require the skill version to equal the
+// package version):
+//   1. npm self-consistency:   package.json === package-lock.json (root + "" package).
+//   2. skill self-consistency: marketplace.json === skill plugin.json === SKILL.md `version:`.
+//   3. floor === tracks:       SKILL.md bootstrap floor `@>=X.Y.Z` === `tracks-harness:` version.
+//   4. floor <= package:       the harness version the skill demands must be one this repo
+//                              can publish (else the skill ships ahead of npm).
+import { readFileSync } from "node:fs";
+import { dirname, join } from "node:path";
+import { fileURLToPath, pathToFileURL } from "node:url";
+const REPO_ROOT = join(dirname(fileURLToPath(import.meta.url)), "..");
+const r = (p: string) => readFileSync(join(REPO_ROOT, p), "utf8");
+const json = (p: string) => JSON.parse(r(p)) as Record<string, any>;
+const SEMVER = /^\d+\.\d+\.\d+$/;
+/** Compare two X.Y.Z strings: <0 if a<b, 0 if equal, >0 if a>b. */
+function cmp(a: string, b: string): number {
+  const pa = a.split(".").map(Number);
+  const pb = b.split(".").map(Number);
+  for (let i = 0; i < 3; i++) if (pa[i] !== pb[i]) return pa[i] - pb[i];
+  return 0;
+}
+export function checkVersions(): { ok: boolean; errors: string[]; values: Record<string, string | undefined> } {
+  const errors: string[] = [];
+  const pkg = json("package.json").version as string;
+  const lock = json("package-lock.json");
+  const lockRoot = lock.version as string;
+  const lockPkg = lock.packages?.[""]?.version as string | undefined;
+  const market = json(".claude-plugin/marketplace.json").plugins?.[0]?.version as string | undefined;
+  const plugin = json(".claude/skills/cowork-harness/.claude-plugin/plugin.json").version as string | undefined;
+  const skillMd = r(".claude/skills/cowork-harness/SKILL.md");
+  const frontmatter = skillMd.split("---")[1] ?? "";
+  const skillVer = frontmatter.match(/^\s*version:\s*(\S+)\s*$/m)?.[1];
+  const tracks = skillMd.match(/tracks-harness:\s*cowork-harness\s+(\d+\.\d+\.\d+)/)?.[1];
+  const floor = skillMd.match(/cowork-harness@>=(\d+\.\d+\.\d+)/)?.[1];
+  const values = { pkg, lockRoot, lockPkg, market, plugin, skillVer, tracks, floor };
+  // 1. npm self-consistency
+  if (!SEMVER.test(pkg)) errors.push(`package.json version "${pkg}" is not X.Y.Z`);
+  if (lockRoot !== pkg) errors.push(`package-lock.json root version "${lockRoot}" != package.json "${pkg}"`);
+  if (lockPkg !== pkg) errors.push(`package-lock.json packages[""].version "${lockPkg}" != package.json "${pkg}"`);
+  // 2. skill self-consistency
+  const skillSet = new Set([market, plugin, skillVer]);
+  if (skillSet.size !== 1 || [...skillSet][0] === undefined) {
+    errors.push(
+      `skill version mismatch — marketplace.json=${market}, plugin.json=${plugin}, SKILL.md=${skillVer} (all three must agree)`,
+    );
+  }
+  // 3. floor === tracks-harness
+  if (!floor) errors.push(`could not find bootstrap floor "cowork-harness@>=X.Y.Z" in SKILL.md`);
+  if (!tracks) errors.push(`could not find "tracks-harness: cowork-harness X.Y.Z" in SKILL.md`);
+  if (floor && tracks && floor !== tracks) {
+    errors.push(`bootstrap floor "@>=${floor}" != tracks-harness "${tracks}" (keep them in lockstep)`);
+  }
+  // 4. floor <= package.json (the skill must not demand an unpublished/future harness)
+  if (floor && SEMVER.test(pkg) && cmp(floor, pkg) > 0) {
+    errors.push(`bootstrap floor "@>=${floor}" is ahead of package.json "${pkg}" — skill would lead npm`);
+  }
+  return { ok: errors.length === 0, errors, values };
+}
+function main(): void {
+  const { ok, errors, values } = checkVersions();
+  process.stdout.write(`version lockstep: ${JSON.stringify(values)}\n`);
+  if (ok) {
+    process.stdout.write("✓ all version strings are aligned\n");
+    return;
+  }
+  for (const e of errors) process.stderr.write(`::error::${e}\n`);
+  process.exitCode = 1;
+}
+// Run only when invoked directly (so a test can import checkVersions without side effects).
+if (import.meta.url === pathToFileURL(process.argv[1] ?? "").href) main();