cowork-harness 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +36 -0
- package/README.md +12 -2
- package/dist/assert.js +9 -0
- package/dist/cli.js +6 -1
- package/dist/redact.js +101 -0
- package/dist/run/cassette.js +426 -65
- package/dist/scan.js +30 -0
- package/dist/types.js +5 -1
- package/docs/cassette.md +48 -0
- package/docs/maintenance.md +20 -6
- package/docs/scenario.md +9 -1
- package/package.json +3 -2
- package/schema/scenario.schema.json +6 -1
- package/scripts/check-versions.ts +90 -0
package/CHANGELOG.md
CHANGED
|
@@ -6,6 +6,42 @@ All notable changes to this project are documented here. The format is based on
|
|
|
6
6
|
|
|
7
7
|
## [Unreleased]
|
|
8
8
|
|
|
9
|
+
## [0.3.0] — 2026-06-17
|
|
10
|
+
|
|
11
|
+
The CI-operate + privacy layer for committed cassettes: record-time redaction, an always-on
|
|
12
|
+
`verify-cassettes` scan/staleness gate, batch recording, and a set-membership assert operator.
|
|
13
|
+
|
|
14
|
+
### Added
|
|
15
|
+
|
|
16
|
+
- **`verify-cassettes <file|dir>`** — a token/agent-free CI gate over committed cassettes. A privacy
|
|
17
|
+
**scan** flags `email`/`currency`/bare-`domain` matches across the whole cassette, excluding only the
|
|
18
|
+
agent's **capability-manifest** messages (`system/init` + the `init-1` registry) from the noisy classes —
|
|
19
|
+
that catalog/MCP-server boilerplate is the sole concentrated false-positive source (email still scans it,
|
|
20
|
+
since the registry `account` field can carry the dev's email). `--allow <regex>` suppresses synthetic/
|
|
21
|
+
public reference names; multi-word proper names are opt-in, not a default class. Plus a **staleness** check
|
|
22
|
+
(`--staleness-only`) fails when a cassette's fingerprint drifted (you edited the skill but didn't
|
|
23
|
+
re-record). Exit 1 on any finding/drift/unreadable cassette; a malformed cassette is tallied, never
|
|
24
|
+
crashes the batch. Dedicated JSON envelope (`{command, ok, results}`), not the `RunResult` shape.
|
|
25
|
+
- **Record-time content redaction** (opt-in; distinct from secret-scrub). A `.cowork-redact.json` (or
|
|
26
|
+
`COWORK_HARNESS_REDACT_PATTERNS`/`_KEYS`) rewrites configured PII across the **whole** cassette surface
|
|
27
|
+
(transcript, artifact bodies + filenames, prompt/answers/assert, skillSources) **structurally** — JSON
|
|
28
|
+
stays valid and the AskUserQuestion question/answer strings stay in sync (the O7 guard still passes), with
|
|
29
|
+
collision-safe deterministic tokens. Redaction is **verdict-preserving**: `record` refuses to write if it
|
|
30
|
+
would flip an assertion (a manufactured green). `--no-redact` / `--allow-failing` escape hatches.
|
|
31
|
+
- **Batch recording** — `record <dir>` records every scenario in a directory (classified by a positive
|
|
32
|
+
`prompt:` signal: a non-scenario YAML is an announced skip, a broken scenario is a failure, never a silent
|
|
33
|
+
skip); `record <cassette-dir> --rerecord-stale` re-records only the cassettes whose fingerprint drifted.
|
|
34
|
+
- **`artifact_json` `in:` operator** — assert the resolved value deep-equals one of a fixed set; stable for
|
|
35
|
+
stochastic (LLM-extracted) values where `equals` churns across re-records.
|
|
36
|
+
|
|
37
|
+
### Fixed
|
|
38
|
+
|
|
39
|
+
- **`skillHash` cassette fingerprint was silently dead** — `skillSourceDirs` passed a path string to
|
|
40
|
+
`loadSession` (which wants parsed YAML), threw, and the throw was swallowed, so the staleness gate's
|
|
41
|
+
skill-edit signal never computed for a file-based session. Now parses + resolves the session correctly;
|
|
42
|
+
`hashDir` folds in each file's relative path + type marker (a *move* now registers); `skillSources` are
|
|
43
|
+
stored relative, never as absolute host paths.
|
|
44
|
+
|
|
9
45
|
## [0.2.0] — 2026-06-17
|
|
10
46
|
|
|
11
47
|
Binary-verified the AskUserQuestion answer wire shape (agent ELF 2.1.170), implemented the
|
package/README.md
CHANGED
|
@@ -74,7 +74,8 @@ Skill testing is the headline use, but the tool is a general harness over the Co
|
|
|
74
74
|
| `skill <folder> "<prompt>"` | Run a local skill/plugin folder once against the staged agent | ad-hoc "is the skill alive / does it do X?" — the fast inner loop |
|
|
75
75
|
| `run <scenario.yaml \| dir/>` | Run authored scenarios with `assert:` + a CI-ready exit code | you want a repeatable, **asserted regression test** |
|
|
76
76
|
| `chat <folder>` | Interactive multi-turn REPL against a skill (TTY) | debugging a multi-turn flow by hand |
|
|
77
|
-
| `record` / `replay` | Save a control-protocol cassette, then replay it deterministically (`replay --strict` fails on a stale cassette) | **token-free, Docker-free CI** from a once-recorded run |
|
|
77
|
+
| `record` / `replay` | Save a control-protocol cassette (one scenario, or batch a `dir/`; `--rerecord-stale` refreshes only drifted ones), then replay it deterministically (`replay --strict` fails on a stale cassette) | **token-free, Docker-free CI** from a once-recorded run |
|
|
78
|
+
| `verify-cassettes <file\|dir>` | Token-free CI gate over committed cassettes: a privacy scan (email/currency/domain → exit 1) + a staleness check (`--staleness-only`) | gating **committed cassettes** against PII leaks + "edited the skill, forgot to re-record" |
|
|
78
79
|
| `trace <run-id>` | Digest a run's `events.jsonl` (`--tools`, `--gates`, `--dispatches` for the sub-agent dispatch tree + total) | "how many sub-agents *actually* dispatched, and which?" |
|
|
79
80
|
| `scaffold --from-run <id>` | Turn a kept run into a starter scenario YAML (gates→answers, artifacts→`file_exists`) | authoring a scenario from a real run instead of guessing |
|
|
80
81
|
| `assert --list` | List the available scenario assertions (generated from the schema) | "what can I assert?" without grepping the source |
|
|
@@ -204,8 +205,17 @@ cowork-harness replay --cassette cassettes/example-pdf-skill.cassette.json
|
|
|
204
205
|
|
|
205
206
|
# A committed synthetic fixture is ready to replay on a fresh clone (no record step needed):
|
|
206
207
|
cowork-harness replay --cassette examples/replays/example-pdf-skill.cassette.json
|
|
208
|
+
|
|
209
|
+
# Cassettes are COMMITTED fixtures — record against synthetic data, and gate them in CI:
|
|
210
|
+
cowork-harness verify-cassettes cassettes/ # privacy scan (email/currency/domain) + staleness; exit 1 on a finding
|
|
207
211
|
```
|
|
208
212
|
|
|
213
|
+
> **Privacy:** a cassette snapshots the transcript and the `outputs/` JSON bodies, so it's committed PII
|
|
214
|
+
> surface. Record against synthetic inputs; opt into record-time **redaction** with a `.cowork-redact.json`
|
|
215
|
+
> (verdict-preserving — `record` refuses to write if redaction would flip an assertion); and gate every
|
|
216
|
+
> commit with `verify-cassettes` (the always-on scan, `--allow <regex>` for synthetic/public names). See
|
|
217
|
+
> [docs/cassette.md](./docs/cassette.md).
|
|
218
|
+
|
|
209
219
|
> **What replay checks.** A cassette bundles BOTH recorded protocol directions: the child→driver
|
|
210
220
|
> `events` stream AND the driver→child `controlOut` decision responses. `replay` re-runs the
|
|
211
221
|
> orchestration from both, re-evaluates the **content** assertions, and re-exercises
|
|
@@ -385,7 +395,7 @@ The provided [GitHub Actions workflow](.github/workflows/ci.yml) runs a **four-s
|
|
|
385
395
|
|
|
386
396
|
| Stage | Runs | Needs | Gates |
|
|
387
397
|
|---|---|---|---|
|
|
388
|
-
| **unit** | format check · typecheck · unit tests · build · CLI smoke · token-free `replay`
|
|
398
|
+
| **unit** | format check · typecheck · unit tests · build · CLI smoke · token-free `replay` + `verify-cassettes` gates | nothing | every push/PR |
|
|
389
399
|
| **boundary** | builds the pinned agent image, brings up the default-deny network, runs `boundary-check` | Docker, arm64 runner | proves the sandbox enforces Cowork's limits — **no API key** |
|
|
390
400
|
| **scenarios** | the live scenario suite at `container` fidelity, uploads transcripts/egress logs as artifacts | `ANTHROPIC_API_KEY` (or `CLAUDE_CODE_OAUTH_TOKEN`) | fork PRs: the whole job is skipped (`if:` guard); same-repo without a key: warns and exits 0 |
|
|
391
401
|
| **parity-drift** | reminder to re-`sync` when Desktop updates | nothing | informational, never blocks |
|
package/dist/assert.js
CHANGED
|
@@ -241,6 +241,15 @@ function check(a, ctx) {
|
|
|
241
241
|
? ok()
|
|
242
242
|
: fail(`artifact_json: "${aj.path}" = ${JSON.stringify(val)}, expected > ${aj.gt}`));
|
|
243
243
|
}
|
|
244
|
+
// #4: set membership — the resolved value deep-equals one of a fixed set. Stable for stochastic
|
|
245
|
+
// (LLM-extracted) values where `equals` would churn across re-records. `present &&` guard mirrors
|
|
246
|
+
// `equals` so an absent value never vacuously satisfies it.
|
|
247
|
+
if (aj.in !== undefined) {
|
|
248
|
+
any = true;
|
|
249
|
+
results.push(present && Array.isArray(aj.in) && aj.in.some((x) => jsonEq(val, x))
|
|
250
|
+
? ok()
|
|
251
|
+
: fail(`artifact_json: "${aj.path}" = ${JSON.stringify(val)}, expected one of ${JSON.stringify(aj.in)}`));
|
|
252
|
+
}
|
|
244
253
|
// No operator → an existence assertion (the value must be present).
|
|
245
254
|
if (!any)
|
|
246
255
|
results.push(present ? ok() : fail(`artifact_json: "${aj.path ?? "(root)"}" is not present (no operator given → existence check)`));
|
package/dist/cli.js
CHANGED
|
@@ -13,7 +13,7 @@ import { vmInit, vmDelete, vmStatus, vmPrune, instanceName } from "./runtime/lim
|
|
|
13
13
|
import { sync } from "./sync/cowork-sync.js";
|
|
14
14
|
import { runBoundaryChecks, formatBoundary } from "./boundary.js";
|
|
15
15
|
import { cmdChat } from "./run/chat.js";
|
|
16
|
-
import { cmdRecord, cmdReplay } from "./run/cassette.js";
|
|
16
|
+
import { cmdRecord, cmdReplay, cmdVerifyCassettes } from "./run/cassette.js";
|
|
17
17
|
import { loadDotenv } from "./dotenv.js";
|
|
18
18
|
import { makeRenderer, renderStart, renderFooter, startHeartbeat } from "./run/renderer.js";
|
|
19
19
|
import { resolveEventsFile, buildTrace, formatTrace, buildGateTrace, formatGateTrace, buildDispatchTree, formatDispatchTree, } from "./run/trace-view.js";
|
|
@@ -50,6 +50,8 @@ const HELP = `cowork-harness <command> (v${"$VERSION"})
|
|
|
50
50
|
[--out <file>] cassette path (default: cassettes/<scenario-name>.cassette.json)
|
|
51
51
|
replay --cassette <file> deterministic protocol-replay of a cassette (no token) [--output-format json]
|
|
52
52
|
[--strict] escalate a cassette-staleness warning (baseline/skill drift) to a failure
|
|
53
|
+
verify-cassettes <file|dir> CI gate (no token): privacy scan + staleness — exit 1 on a PII finding or drift
|
|
54
|
+
[--privacy-only|--staleness-only] [--allow <regex>]... [--output-format json]
|
|
53
55
|
trace <run-id | dir | path> digest a run's events.jsonl (tools+result status, dispatches, decisions)
|
|
54
56
|
[--tools] tool/dispatch rows only [--gates] gate lifecycle (question→answer→delivered)
|
|
55
57
|
[--dispatches] sub-agent dispatch tree + the real total (read off dispatch_count_max)
|
|
@@ -199,6 +201,7 @@ async function main() {
|
|
|
199
201
|
"chat",
|
|
200
202
|
"record",
|
|
201
203
|
"replay",
|
|
204
|
+
"verify-cassettes",
|
|
202
205
|
"trace",
|
|
203
206
|
"assert",
|
|
204
207
|
"scaffold",
|
|
@@ -260,6 +263,8 @@ async function main() {
|
|
|
260
263
|
return cmdRecord(rest);
|
|
261
264
|
case "replay":
|
|
262
265
|
return cmdReplay(rest);
|
|
266
|
+
case "verify-cassettes":
|
|
267
|
+
return cmdVerifyCassettes(rest);
|
|
263
268
|
case "trace":
|
|
264
269
|
return cmdTrace(rest);
|
|
265
270
|
case "assert":
|
package/dist/redact.js
ADDED
|
@@ -0,0 +1,101 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Content redaction for committed cassettes (#1 / A1). DISTINCT from `secrets.ts` (which scrubs auth
|
|
3
|
+
* tokens): this redacts author-configured PII patterns out of the cassette surface before it is written.
|
|
4
|
+
*
|
|
5
|
+
* Two hard requirements drive the design:
|
|
6
|
+
* - STRUCTURAL (not line-level) for JSON protocol lines: events/controlOut are JSON; redacting their raw
|
|
7
|
+
* text could unbalance the JSON (→ a silently skipped line on replay) or desync the AskUserQuestion
|
|
8
|
+
* question/answer strings the O7 guard compares across events and controlOut. So JSON is parsed, every
|
|
9
|
+
* string LEAF and object KEY is redacted, then re-serialized.
|
|
10
|
+
* - COLLISION-SAFE deterministic tokens: `[REDACTED:<label>:<hash>]`. The hash (of the matched text) keeps
|
|
11
|
+
* the token stable across re-records (no churn) AND injective — two distinct names never collapse into a
|
|
12
|
+
* single `answers` map key. A genuine collision (astronomically rare) fails loud, never silently merges.
|
|
13
|
+
*/
|
|
14
|
+
import { createHash } from "node:crypto";
|
|
15
|
+
import { existsSync, readFileSync } from "node:fs";
|
|
16
|
+
import { join } from "node:path";
|
|
17
|
+
export const EMPTY_POLICY = { patterns: [], keyNames: [] };
|
|
18
|
+
function csv(v) {
|
|
19
|
+
return (v ?? "")
|
|
20
|
+
.split(",")
|
|
21
|
+
.map((s) => s.trim())
|
|
22
|
+
.filter(Boolean);
|
|
23
|
+
}
|
|
24
|
+
/** Assemble a redaction policy from `.cowork-redact.json` (searched in `searchDirs`, e.g. cwd then the
|
|
25
|
+
* scenario/cassette dir) merged with `COWORK_HARNESS_REDACT_PATTERNS`/`_KEYS`. No config + no env →
|
|
26
|
+
* EMPTY_POLICY (the opt-in default; the A2 scanner is the always-on safety net). A malformed regex throws —
|
|
27
|
+
* a silently-dropped redaction rule is under-redaction, i.e. a leak. */
|
|
28
|
+
export function loadRedactionPolicy(searchDirs) {
|
|
29
|
+
const patterns = [];
|
|
30
|
+
const keyNames = [];
|
|
31
|
+
const seen = new Set();
|
|
32
|
+
for (const dir of searchDirs) {
|
|
33
|
+
const f = join(dir, ".cowork-redact.json");
|
|
34
|
+
if (seen.has(f) || !existsSync(f))
|
|
35
|
+
continue;
|
|
36
|
+
seen.add(f);
|
|
37
|
+
const cfg = JSON.parse(readFileSync(f, "utf8"));
|
|
38
|
+
for (const p of cfg.patterns ?? [])
|
|
39
|
+
patterns.push({ re: new RegExp(p.regex, p.flags ?? "g"), label: p.label ?? "redacted" });
|
|
40
|
+
for (const k of cfg.keys ?? [])
|
|
41
|
+
keyNames.push(k);
|
|
42
|
+
}
|
|
43
|
+
for (const src of csv(process.env.COWORK_HARNESS_REDACT_PATTERNS))
|
|
44
|
+
patterns.push({ re: new RegExp(src, "g"), label: "redacted" });
|
|
45
|
+
for (const k of csv(process.env.COWORK_HARNESS_REDACT_KEYS))
|
|
46
|
+
keyNames.push(k);
|
|
47
|
+
return { patterns, keyNames };
|
|
48
|
+
}
|
|
49
|
+
/** Stable, collision-safe token for a matched span. Depends ONLY on the matched text (context-free), so the
|
|
50
|
+
* same logical string redacts identically wherever it appears (events question text == controlOut answers
|
|
51
|
+
* key) — the property the O7 guard relies on. */
|
|
52
|
+
function token(label, match) {
|
|
53
|
+
const h = createHash("sha256").update(match).digest("hex").slice(0, 12);
|
|
54
|
+
return `[REDACTED:${label}:${h}]`;
|
|
55
|
+
}
|
|
56
|
+
/** Apply every pattern to a single string. Each pattern is forced global so all occurrences go. */
|
|
57
|
+
export function redactText(text, policy) {
|
|
58
|
+
let out = text;
|
|
59
|
+
for (const { re, label } of policy.patterns) {
|
|
60
|
+
const g = new RegExp(re.source, re.flags.includes("g") ? re.flags : re.flags + "g");
|
|
61
|
+
out = out.replace(g, (m) => token(label, m));
|
|
62
|
+
}
|
|
63
|
+
return out;
|
|
64
|
+
}
|
|
65
|
+
/** Recursively redact a parsed JSON value: string leaves AND object keys (C3). Numbers/booleans/null pass
|
|
66
|
+
* through. A key collision after redaction (two distinct keys → one) throws — a silent merge would lose data
|
|
67
|
+
* and (for an `answers` map) break replay. */
|
|
68
|
+
export function redactStructural(value, policy) {
|
|
69
|
+
if (typeof value === "string")
|
|
70
|
+
return redactText(value, policy);
|
|
71
|
+
if (Array.isArray(value))
|
|
72
|
+
return value.map((v) => redactStructural(v, policy));
|
|
73
|
+
if (value !== null && typeof value === "object") {
|
|
74
|
+
const out = {};
|
|
75
|
+
for (const [k, v] of Object.entries(value)) {
|
|
76
|
+
const rk = redactText(k, policy);
|
|
77
|
+
// A value under a configured key is redacted wholesale regardless of TYPE (a sensitive number/object
|
|
78
|
+
// leaks just like a string). The hash is over its JSON form so the token stays deterministic.
|
|
79
|
+
const rv = policy.keyNames.includes(k) ? token("key", typeof v === "string" ? v : JSON.stringify(v)) : redactStructural(v, policy);
|
|
80
|
+
if (Object.prototype.hasOwnProperty.call(out, rk))
|
|
81
|
+
throw new Error(`redaction collision: two distinct keys both redacted to "${rk}" — refusing to silently merge`);
|
|
82
|
+
out[rk] = rv;
|
|
83
|
+
}
|
|
84
|
+
return out;
|
|
85
|
+
}
|
|
86
|
+
return value;
|
|
87
|
+
}
|
|
88
|
+
/** Redact one JSONL protocol line. If it parses as JSON, redact structurally (guaranteeing it still parses);
|
|
89
|
+
* otherwise fall back to safe text redaction (a non-JSON line has no protocol coupling). */
|
|
90
|
+
export function redactJsonLine(line, policy) {
|
|
91
|
+
if (!line.trim())
|
|
92
|
+
return line;
|
|
93
|
+
let parsed;
|
|
94
|
+
try {
|
|
95
|
+
parsed = JSON.parse(line);
|
|
96
|
+
}
|
|
97
|
+
catch {
|
|
98
|
+
return redactText(line, policy);
|
|
99
|
+
}
|
|
100
|
+
return JSON.stringify(redactStructural(parsed, policy));
|
|
101
|
+
}
|
package/dist/run/cassette.js
CHANGED
|
@@ -3,8 +3,8 @@ import { readFileSync, writeFileSync, mkdirSync, mkdtempSync, existsSync, readdi
|
|
|
3
3
|
import { createHash } from "node:crypto";
|
|
4
4
|
import { tmpdir } from "node:os";
|
|
5
5
|
import { join, dirname, relative, isAbsolute } from "node:path";
|
|
6
|
-
import { executeScenario, parseScenarioFile, collectArtifacts } from "./execute.js";
|
|
7
|
-
import { loadSession } from "../session.js";
|
|
6
|
+
import { executeScenario, parseScenarioFile, collectArtifacts, parseSessionFile } from "./execute.js";
|
|
7
|
+
import { loadSession, resolveSessionPaths } from "../session.js";
|
|
8
8
|
import { loadBaseline } from "../baseline.js";
|
|
9
9
|
import { Run } from "./run.js";
|
|
10
10
|
import { parseMessage, serializeDecision, deserializeDecision, canon, } from "../agent/session.js";
|
|
@@ -13,6 +13,10 @@ import { evaluate } from "../assert.js";
|
|
|
13
13
|
import { makeRenderer, renderFooter } from "./renderer.js";
|
|
14
14
|
import { jsonEnvelope, parseOutputFormat } from "./envelope.js";
|
|
15
15
|
import { computeVerdict } from "./verdict.js";
|
|
16
|
+
import { redactJsonLine, redactText, redactStructural, loadRedactionPolicy } from "../redact.js";
|
|
17
|
+
import { collectSecrets, scrub } from "../secrets.js";
|
|
18
|
+
import { scanText, DEFAULT_SCAN_PATTERNS, EMAIL_SCAN_PATTERNS } from "../scan.js";
|
|
19
|
+
import { parse as parseYaml } from "yaml";
|
|
16
20
|
const out = (s) => process.stdout.write(s + "\n");
|
|
17
21
|
const log = (s) => process.stderr.write(s + "\n");
|
|
18
22
|
/** Current cassette format version. Readers tolerate ABSENT (legacy → 0) and warn on a FUTURE version. */
|
|
@@ -47,8 +51,10 @@ function materializeManifest(entries) {
|
|
|
47
51
|
}
|
|
48
52
|
return { workRoot, prefixes: ["outputs", ".projects"] };
|
|
49
53
|
}
|
|
50
|
-
/** Hash a directory's file CONTENTS recursively (sorted
|
|
51
|
-
|
|
54
|
+
/** Hash a directory's structure + file CONTENTS recursively (sorted) — stable across machines. The hash
|
|
55
|
+
* folds in each entry's RELATIVE path (not just its basename) plus a type marker, so a file MOVING within
|
|
56
|
+
* the tree (`a/x.json` → `a/sub/x.json`, same content) changes the hash (S2 — basename-only missed moves). */
|
|
57
|
+
function hashDir(dir, hash, rel = "") {
|
|
52
58
|
let entries;
|
|
53
59
|
try {
|
|
54
60
|
entries = readdirSync(dir).sort();
|
|
@@ -58,6 +64,7 @@ function hashDir(dir, hash) {
|
|
|
58
64
|
}
|
|
59
65
|
for (const name of entries) {
|
|
60
66
|
const abs = join(dir, name);
|
|
67
|
+
const relPath = rel ? `${rel}/${name}` : name;
|
|
61
68
|
let st;
|
|
62
69
|
try {
|
|
63
70
|
st = statSync(abs);
|
|
@@ -65,10 +72,12 @@ function hashDir(dir, hash) {
|
|
|
65
72
|
catch {
|
|
66
73
|
continue;
|
|
67
74
|
}
|
|
68
|
-
if (st.isDirectory())
|
|
69
|
-
|
|
75
|
+
if (st.isDirectory()) {
|
|
76
|
+
hash.update(`D:${relPath}\n`); // structure marker — an empty/renamed dir registers too
|
|
77
|
+
hashDir(abs, hash, relPath);
|
|
78
|
+
}
|
|
70
79
|
else if (st.isFile()) {
|
|
71
|
-
hash.update(
|
|
80
|
+
hash.update(`F:${relPath}\n`); // relative path, not basename — a move changes the digest
|
|
72
81
|
try {
|
|
73
82
|
hash.update(readFileSync(abs));
|
|
74
83
|
}
|
|
@@ -78,28 +87,121 @@ function hashDir(dir, hash) {
|
|
|
78
87
|
}
|
|
79
88
|
}
|
|
80
89
|
}
|
|
81
|
-
/** #1b: the local skill/plugin/marketplace source dirs a session mounts — the "skill dir" hash unit.
|
|
90
|
+
/** #1b: the local skill/plugin/marketplace source dirs a session mounts — the "skill dir" hash unit.
|
|
91
|
+
* Returns ABSOLUTE dirs (for hashing/reading) plus `baseDir`, the session-file dir the relative
|
|
92
|
+
* `skillSources` are stored against (so the committed fingerprint carries no absolute host path — C1). */
|
|
82
93
|
function skillSourceDirs(sessionPath, cassetteDir) {
|
|
83
94
|
const resolved = cassetteDir && !isAbsolute(sessionPath) ? join(cassetteDir, sessionPath) : sessionPath;
|
|
95
|
+
const baseDir = dirname(resolved);
|
|
84
96
|
if (sessionPath === "(inline)" || !existsSync(resolved))
|
|
85
|
-
return [];
|
|
97
|
+
return { dirs: [], baseDir };
|
|
86
98
|
let cfg;
|
|
87
99
|
try {
|
|
88
|
-
|
|
100
|
+
// Mirror loadSessionFromFile (execute.ts): parse the YAML, then RESOLVE its relative skill/plugin
|
|
101
|
+
// paths against the session-file dir (`baseDir` — the post-cassetteDir-join location, so this works for
|
|
102
|
+
// both the record call (no cassetteDir) and the replay call (cassetteDir set)). Passing the raw path
|
|
103
|
+
// string to loadSession() throws (it wants parsed YAML) — the swallowed throw is why skillHash was
|
|
104
|
+
// silently never computed.
|
|
105
|
+
cfg = resolveSessionPaths(loadSession(parseSessionFile(resolved)), baseDir);
|
|
89
106
|
}
|
|
90
107
|
catch {
|
|
91
|
-
return [];
|
|
108
|
+
return { dirs: [], baseDir };
|
|
92
109
|
}
|
|
93
|
-
|
|
110
|
+
const dirs = [...cfg.skills.local, ...cfg.plugins.local_plugins, ...cfg.plugins.remote_plugins, ...cfg.plugins.local_marketplaces].filter((d) => existsSync(d));
|
|
111
|
+
return { dirs, baseDir };
|
|
94
112
|
}
|
|
95
|
-
function buildFingerprint(sessionPath, baselineAppVersion, cassetteDir) {
|
|
96
|
-
const dirs = skillSourceDirs(sessionPath, cassetteDir);
|
|
113
|
+
export function buildFingerprint(sessionPath, baselineAppVersion, cassetteDir) {
|
|
114
|
+
const { dirs, baseDir } = skillSourceDirs(sessionPath, cassetteDir);
|
|
97
115
|
if (dirs.length === 0)
|
|
98
116
|
return { baseline: baselineAppVersion };
|
|
99
117
|
const hash = createHash("sha256");
|
|
100
118
|
for (const d of dirs.sort())
|
|
101
119
|
hashDir(d, hash);
|
|
102
|
-
|
|
120
|
+
// Store skillSources RELATIVE to the session-file dir — diagnostics only (the replay recompute re-derives
|
|
121
|
+
// the dirs from the session), so a relative path is enough and never leaks an absolute `/Users/...` path.
|
|
122
|
+
return { baseline: baselineAppVersion, skillHash: hash.digest("hex"), skillSources: dirs.map((d) => relative(baseDir, d)) };
|
|
123
|
+
}
|
|
124
|
+
/** A2: scan the WHOLE cassette surface for PII (default classes: email/currency/domain). A `truncated`
|
|
125
|
+
* artifact has NO committed body (hash-only) — nothing to leak — but is reported as `unscanned` so coverage
|
|
126
|
+
* is never silently implied. Real-class findings fail the gate; `unscanned` is informational. */
|
|
127
|
+
/** The agent's CAPABILITY MANIFEST — environment boilerplate, never user data, and the sole concentrated
|
|
128
|
+
* source of `domain`/`currency` scan noise (tool/skill catalog descriptions + MCP-server names a regex
|
|
129
|
+
* can't tell apart from customer data). Two stable structural forms:
|
|
130
|
+
* - the `system/init` event (tools/mcp_servers/skills/cwd registry), and
|
|
131
|
+
* - the `initialize` `control_response` (`request_id: "init-1"`; body = commands/agents/models/account).
|
|
132
|
+
* These get `email`-only scanning (email is universal — the `account` field can carry the dev's own email);
|
|
133
|
+
* the noisy classes are suppressed only here. */
|
|
134
|
+
function isCapabilityManifest(line) {
|
|
135
|
+
let m;
|
|
136
|
+
try {
|
|
137
|
+
m = JSON.parse(line);
|
|
138
|
+
}
|
|
139
|
+
catch {
|
|
140
|
+
return false;
|
|
141
|
+
}
|
|
142
|
+
if (m?.type === "system" && m?.subtype === "init")
|
|
143
|
+
return true;
|
|
144
|
+
if (m?.type === "control_response") {
|
|
145
|
+
const r = m.response ?? {};
|
|
146
|
+
if (r.request_id === "init-1")
|
|
147
|
+
return true;
|
|
148
|
+
const body = r.response;
|
|
149
|
+
if (body && typeof body === "object" && "commands" in body && "agents" in body)
|
|
150
|
+
return true; // shape fallback
|
|
151
|
+
}
|
|
152
|
+
return false;
|
|
153
|
+
}
|
|
154
|
+
export function scanCassette(cassette, allow) {
|
|
155
|
+
const findings = [];
|
|
156
|
+
const FULL = DEFAULT_SCAN_PATTERNS; // email + currency + domain
|
|
157
|
+
const EMAIL = EMAIL_SCAN_PATTERNS; // email only — for the capability-manifest messages
|
|
158
|
+
// Transcript: full net EXCEPT the capability-manifest messages (catalog noise), where only email runs.
|
|
159
|
+
cassette.events.forEach((l, i) => findings.push(...scanText(l, `events[${i}]`, allow, isCapabilityManifest(l) ? EMAIL : FULL)));
|
|
160
|
+
cassette.controlOut?.forEach((l, i) => findings.push(...scanText(l, `controlOut[${i}]`, allow, isCapabilityManifest(l) ? EMAIL : FULL)));
|
|
161
|
+
// Deliverable + author-written fields — full net (a real cap table's figures/domains live here).
|
|
162
|
+
for (const a of cassette.artifacts ?? []) {
|
|
163
|
+
findings.push(...scanText(a.path, `artifact path ${a.path}`, allow, FULL)); // a filename can name a customer
|
|
164
|
+
if (a.body !== undefined)
|
|
165
|
+
findings.push(...scanText(a.body, `artifact ${a.path}`, allow, FULL));
|
|
166
|
+
else if (a.truncated)
|
|
167
|
+
findings.push({ where: `artifact ${a.path}`, cls: "unscanned", sample: "(body not committed — too large or unreadable)" });
|
|
168
|
+
}
|
|
169
|
+
findings.push(...scanText(cassette.scenario.prompt, "scenario.prompt", allow, FULL));
|
|
170
|
+
findings.push(...scanText(JSON.stringify(cassette.scenario.answers ?? null), "scenario.answers", allow, FULL));
|
|
171
|
+
findings.push(...scanText(JSON.stringify(cassette.scenario.assert ?? null), "scenario.assert", allow, FULL));
|
|
172
|
+
for (const s of cassette.fingerprint?.skillSources ?? [])
|
|
173
|
+
findings.push(...scanText(s, "fingerprint.skillSources", allow, FULL));
|
|
174
|
+
return findings;
|
|
175
|
+
}
|
|
176
|
+
/** B3 staleness GATE: recompute the fingerprint and report drift. Unlike `replayCassette` (which WARNS),
|
|
177
|
+
* the gate treats an unresolvable skillHash as a failure — can't verify ⇒ not green. No fingerprint → nothing
|
|
178
|
+
* to check (legacy cassette). */
|
|
179
|
+
export function checkStaleness(cassette, cassetteDir) {
|
|
180
|
+
const fp = cassette.fingerprint;
|
|
181
|
+
if (!fp)
|
|
182
|
+
return [];
|
|
183
|
+
const msgs = [];
|
|
184
|
+
let liveBaseline;
|
|
185
|
+
try {
|
|
186
|
+
liveBaseline = loadBaseline("latest").appVersion;
|
|
187
|
+
}
|
|
188
|
+
catch {
|
|
189
|
+
/* baseline not loadable */
|
|
190
|
+
}
|
|
191
|
+
// Gate mode: can't verify ⇒ not green. The cassette carries a baseline-of-record but we can't load the
|
|
192
|
+
// current one to compare — a fail, not a silent skip (baselines ship with the package, so this is rare).
|
|
193
|
+
if (liveBaseline === undefined)
|
|
194
|
+
msgs.push("cannot load the latest baseline to verify staleness — run `cowork-harness sync` or ship baselines/ (can't verify ⇒ not green)");
|
|
195
|
+
else if (liveBaseline !== fp.baseline)
|
|
196
|
+
msgs.push(`baseline moved ${fp.baseline} → ${liveBaseline} since record — re-record`);
|
|
197
|
+
if (fp.skillHash) {
|
|
198
|
+
const live = buildFingerprint(cassette.scenario.session, fp.baseline, cassetteDir);
|
|
199
|
+
if (live.skillHash === undefined)
|
|
200
|
+
msgs.push("skill dirs not resolvable from the cassette location — cannot verify staleness (gate fails: can't verify ⇒ not green)");
|
|
201
|
+
else if (live.skillHash !== fp.skillHash)
|
|
202
|
+
msgs.push("local skill/plugin dir contents changed since record — re-record");
|
|
203
|
+
}
|
|
204
|
+
return msgs;
|
|
103
205
|
}
|
|
104
206
|
/** A minimal RunRecord for a truncated-cassette replay — empty collections so downstream evaluate()/the
|
|
105
207
|
* mismatch loops don't NPE; result:"error" because the cassette could not be driven to completion. */
|
|
@@ -270,75 +372,246 @@ const NOOP_DECIDER = {
|
|
|
270
372
|
return ABSTAIN;
|
|
271
373
|
},
|
|
272
374
|
};
|
|
273
|
-
/**
|
|
375
|
+
/** Apply CONTENT redaction (the opt-in policy) across the WHOLE cassette surface (C1): events/controlOut
|
|
376
|
+
* protocol lines (structurally — string leaves AND object keys, keeping JSON valid + the O7 question/answer
|
|
377
|
+
* strings in sync), artifact bodies, the scenario prompt/answers/assert metadata, and the diagnostic
|
|
378
|
+
* skillSources. Identity fields (name/session/fidelity/baseline) are left intact so replay still resolves.
|
|
379
|
+
* Pure — returns a new cassette. Distinct from secret-scrub (`scrub`), which runs first. */
|
|
380
|
+
export function redactCassette(cassette, policy) {
|
|
381
|
+
const scenario = {
|
|
382
|
+
...cassette.scenario,
|
|
383
|
+
prompt: redactText(cassette.scenario.prompt, policy),
|
|
384
|
+
answers: redactStructural(cassette.scenario.answers, policy),
|
|
385
|
+
assert: redactStructural(cassette.scenario.assert, policy),
|
|
386
|
+
};
|
|
387
|
+
return {
|
|
388
|
+
...cassette,
|
|
389
|
+
scenario,
|
|
390
|
+
events: cassette.events.map((l) => redactJsonLine(l, policy)),
|
|
391
|
+
controlOut: cassette.controlOut?.map((l) => redactJsonLine(l, policy)),
|
|
392
|
+
artifacts: cassette.artifacts?.map((a) => ({
|
|
393
|
+
...a,
|
|
394
|
+
path: redactText(a.path, policy), // C1: a filename can name a customer (outputs/Acme-cap-table.json)
|
|
395
|
+
...(a.body !== undefined ? { body: redactJsonLine(a.body, policy) } : {}),
|
|
396
|
+
})),
|
|
397
|
+
fingerprint: cassette.fingerprint
|
|
398
|
+
? { ...cassette.fingerprint, skillSources: cassette.fingerprint.skillSources?.map((s) => redactText(s, policy)) }
|
|
399
|
+
: undefined,
|
|
400
|
+
};
|
|
401
|
+
}
|
|
402
|
+
/** A3 / C4 cardinal-sin guard: redaction must be VERDICT-PRESERVING. Replay both the pre-redaction and the
|
|
403
|
+
* redacted cassette (token-free) and compare verdicts; if redaction flipped any replay-checkable assertion
|
|
404
|
+
* (e.g. stripped a value a `transcript_not_matches` keys on, manufacturing a green), throw — never write a
|
|
405
|
+
* cassette whose verdict was changed by redaction. */
|
|
406
|
+
export async function assertRedactionVerdictPreserved(base, redacted) {
|
|
407
|
+
const vb = computeVerdict(await replayCassette(base), "replay");
|
|
408
|
+
const vr = computeVerdict(await replayCassette(redacted), "replay");
|
|
409
|
+
if (vb.pass !== vr.pass)
|
|
410
|
+
throw new Error(`redaction changed the replay verdict (pre-redaction pass=${vb.pass} → redacted pass=${vr.pass}) — redaction altered an ` +
|
|
411
|
+
`asserted observable; refusing to write a cassette whose verdict was manufactured by redaction (A3). ` +
|
|
412
|
+
`Record against synthetic inputs, or narrow the redaction policy so it doesn't touch asserted values.`);
|
|
413
|
+
}
|
|
414
|
+
/** B1: classify the `*.yaml`/`*.yml` (non-recursive) under `dir` for batch `record`. Classification keys on a
|
|
415
|
+
* POSITIVE `prompt:` signal — NOT on "Scenario.parse threw", because a session YAML and a broken scenario
|
|
416
|
+
* both throw the same error. A doc with `prompt:` that fails to parse is BROKEN (a batch failure), never a
|
|
417
|
+
* silent skip — silently swallowing a broken scenario as a non-scenario is the false-green this guards. */
|
|
418
|
+
export function discoverScenarios(dir) {
|
|
419
|
+
const files = readdirSync(dir)
|
|
420
|
+
.filter((f) => /\.ya?ml$/i.test(f))
|
|
421
|
+
.sort()
|
|
422
|
+
.map((f) => join(dir, f));
|
|
423
|
+
const out = { scenarios: [], skipped: [], broken: [] };
|
|
424
|
+
for (const f of files) {
|
|
425
|
+
let raw;
|
|
426
|
+
try {
|
|
427
|
+
raw = parseYaml(readFileSync(f, "utf8"));
|
|
428
|
+
}
|
|
429
|
+
catch (e) {
|
|
430
|
+
out.broken.push({ file: f, error: `YAML parse error: ${e.message}` });
|
|
431
|
+
continue;
|
|
432
|
+
}
|
|
433
|
+
const hasPrompt = raw !== null && typeof raw === "object" && "prompt" in raw;
|
|
434
|
+
if (!hasPrompt) {
|
|
435
|
+
out.skipped.push(f); // no prompt → a session/other doc; announced skip, not a failure
|
|
436
|
+
continue;
|
|
437
|
+
}
|
|
438
|
+
try {
|
|
439
|
+
parseScenarioFile(f);
|
|
440
|
+
out.scenarios.push(f);
|
|
441
|
+
}
|
|
442
|
+
catch (e) {
|
|
443
|
+
out.broken.push({ file: f, error: e.message });
|
|
444
|
+
}
|
|
445
|
+
}
|
|
446
|
+
return out;
|
|
447
|
+
}
|
|
448
|
+
/** Read + parse a cassette, never throwing — a malformed `*.cassette.json` must be TALLIED, not crash a
|
|
449
|
+
* whole batch (a crash mid-walk reads as "the rest were fine" — a false-green by abort). */
|
|
450
|
+
function readCassette(path) {
|
|
451
|
+
try {
|
|
452
|
+
return { cassette: JSON.parse(readFileSync(path, "utf8")) };
|
|
453
|
+
}
|
|
454
|
+
catch (e) {
|
|
455
|
+
return { error: `unreadable / invalid cassette JSON: ${e.message}` };
|
|
456
|
+
}
|
|
457
|
+
}
|
|
458
|
+
/** B2: the committed cassettes under `dir` whose fingerprint has drifted (baseline/skill) — the re-record
|
|
459
|
+
* work-list. Pure + token-free (reuses `checkStaleness`); the actual re-record needs the live agent. A
|
|
460
|
+
* malformed cassette is surfaced as stale (needs attention) rather than silently dropped. */
|
|
461
|
+
export function selectStaleCassettes(dir) {
|
|
462
|
+
return readdirSync(dir)
|
|
463
|
+
.filter((f) => f.endsWith(".cassette.json"))
|
|
464
|
+
.sort()
|
|
465
|
+
.map((f) => join(dir, f))
|
|
466
|
+
.map((path) => {
|
|
467
|
+
const r = readCassette(path);
|
|
468
|
+
return "error" in r ? { path, staleness: [r.error] } : { path, staleness: checkStaleness(r.cassette, dirname(path)) };
|
|
469
|
+
})
|
|
470
|
+
.filter((x) => x.staleness.length > 0);
|
|
471
|
+
}
|
|
472
|
+
/** Record one scenario FILE → one cassette (parses the file, then shares the live-record tail with the
|
|
473
|
+
* in-memory path). The file's dir feeds the redaction-policy search (for a co-located .cowork-redact.json). */
|
|
474
|
+
async function recordScenarioFile(file, opts) {
|
|
475
|
+
return recordScenarioObject(parseScenarioFile(file), opts, [dirname(file)]);
|
|
476
|
+
}
|
|
477
|
+
/** `record <scenario.yaml | dir> [--out <file>] [--rerecord-stale] [--no-redact] [--allow-failing]` —
|
|
478
|
+
* run live + save a cassette. A single file records one; a dir batches (B1); --rerecord-stale (B2) treats
|
|
479
|
+
* the dir as committed cassettes and re-records only those whose fingerprint drifted. */
|
|
274
480
|
export async function cmdRecord(args) {
|
|
481
|
+
const noRedact = args.includes("--no-redact");
|
|
482
|
+
if (noRedact)
|
|
483
|
+
log("record: --no-redact — content redaction is OFF; the cassette is written verbatim, so ensure inputs are synthetic.");
|
|
484
|
+
const allowFailing = args.includes("--allow-failing");
|
|
485
|
+
const rerecordStale = args.includes("--rerecord-stale");
|
|
275
486
|
const outIdx = args.indexOf("--out");
|
|
276
|
-
// #9: bounds-check --out's value — a trailing `--out` makes cassettePath undefined → a raw
|
|
277
|
-
// dirname(undefined)/writeFileSync(undefined) crash surfacing as an `internal` error.
|
|
278
487
|
if (outIdx >= 0 && args[outIdx + 1] === undefined) {
|
|
279
488
|
log("usage: record <scenario.yaml> --out <file.cassette.json> (--out needs a value)");
|
|
280
|
-
process.exit(2);
|
|
489
|
+
return process.exit(2);
|
|
281
490
|
}
|
|
282
|
-
|
|
283
|
-
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
log("usage: record <scenario.yaml> [--out <file.cassette.json>]");
|
|
288
|
-
process.exit(2);
|
|
491
|
+
const positionals = args.filter((a, i) => !a.startsWith("--") && args[i - 1] !== "--out");
|
|
492
|
+
const target = positionals[0];
|
|
493
|
+
if (!target) {
|
|
494
|
+
log("usage: record <scenario.yaml | dir/> [--out <file>] [--rerecord-stale] [--no-redact] [--allow-failing]");
|
|
495
|
+
return process.exit(2);
|
|
289
496
|
}
|
|
290
|
-
|
|
291
|
-
|
|
292
|
-
|
|
293
|
-
process.exit(2);
|
|
497
|
+
if (positionals.length > 1) {
|
|
498
|
+
log(`record takes a single scenario or dir (got ${positionals.length}: ${positionals.join(", ")})`);
|
|
499
|
+
return process.exit(2);
|
|
294
500
|
}
|
|
295
|
-
const
|
|
501
|
+
const isDir = existsSync(target) && statSync(target).isDirectory();
|
|
502
|
+
// B2: re-record only the drifted cassettes in a committed cassette dir.
|
|
503
|
+
if (rerecordStale) {
|
|
504
|
+
if (!isDir) {
|
|
505
|
+
log("record --rerecord-stale takes a DIRECTORY of committed cassettes");
|
|
506
|
+
return process.exit(2);
|
|
507
|
+
}
|
|
508
|
+
const stale = selectStaleCassettes(target);
|
|
509
|
+
if (stale.length === 0) {
|
|
510
|
+
log(`✓ record --rerecord-stale: all cassettes under ${target} are fresh — nothing to re-record`);
|
|
511
|
+
return process.exit(0);
|
|
512
|
+
}
|
|
513
|
+
let failures = 0;
|
|
514
|
+
for (const { path: cp, staleness } of stale) {
|
|
515
|
+
const rc = readCassette(cp);
|
|
516
|
+
if ("error" in rc) {
|
|
517
|
+
failures++;
|
|
518
|
+
log(` ✗ ${cp}: ${rc.error} — cannot re-record`);
|
|
519
|
+
continue;
|
|
520
|
+
}
|
|
521
|
+
const cassette = rc.cassette;
|
|
522
|
+
// Re-record from the embedded scenario, re-resolving its relocatable session against the cassette dir.
|
|
523
|
+
const sessionRef = cassette.scenario.session === "(inline)" ? "(inline)" : join(dirname(cp), cassette.scenario.session);
|
|
524
|
+
log(`↻ re-recording ${cp} (stale: ${staleness.join("; ")})`);
|
|
525
|
+
try {
|
|
526
|
+
const r = await recordScenarioObject({ ...cassette.scenario, session: sessionRef }, { noRedact, allowFailing, cassettePath: cp });
|
|
527
|
+
log(` ✓ ${cp} (${r.result.result})`);
|
|
528
|
+
}
|
|
529
|
+
catch (e) {
|
|
530
|
+
failures++;
|
|
531
|
+
log(` ✗ ${cp}: ${e.message}`);
|
|
532
|
+
}
|
|
533
|
+
}
|
|
534
|
+
return process.exit(failures > 0 ? 1 : 0);
|
|
535
|
+
}
|
|
536
|
+
// B1: batch a directory of scenarios.
|
|
537
|
+
if (isDir) {
|
|
538
|
+
const disc = discoverScenarios(target);
|
|
539
|
+
for (const s of disc.skipped)
|
|
540
|
+
log(`· skipped (not a scenario — no \`prompt:\`): ${s}`);
|
|
541
|
+
for (const b of disc.broken)
|
|
542
|
+
log(`✗ ${b.file}: ${b.error}`);
|
|
543
|
+
if (disc.scenarios.length === 0) {
|
|
544
|
+
log(`record: no scenarios discovered under ${target} (loud non-zero — not a vacuous "0 failures = green")`);
|
|
545
|
+
return process.exit(2);
|
|
546
|
+
}
|
|
547
|
+
let failures = disc.broken.length;
|
|
548
|
+
for (const f of disc.scenarios) {
|
|
549
|
+
try {
|
|
550
|
+
const r = await recordScenarioFile(f, { noRedact, allowFailing });
|
|
551
|
+
log(`✓ ${f} → ${r.cassettePath} (${r.result.result})`);
|
|
552
|
+
}
|
|
553
|
+
catch (e) {
|
|
554
|
+
failures++;
|
|
555
|
+
log(`✗ ${f}: ${e.message}`);
|
|
556
|
+
}
|
|
557
|
+
}
|
|
558
|
+
log(failures > 0
|
|
559
|
+
? `✗ record: ${failures} of ${disc.scenarios.length + disc.broken.length} failed`
|
|
560
|
+
: `✓ record: ${disc.scenarios.length} cassette(s)`);
|
|
561
|
+
return process.exit(failures > 0 ? 1 : 0);
|
|
562
|
+
}
|
|
563
|
+
// Single scenario file.
|
|
564
|
+
try {
|
|
565
|
+
const cassettePath = outIdx >= 0 ? args[outIdx + 1] : undefined;
|
|
566
|
+
const r = await recordScenarioFile(target, { noRedact, allowFailing, cassettePath });
|
|
567
|
+
log(`✓ recorded ${r.result.result} · ${r.artifacts} artifact(s) → ${r.cassettePath}`);
|
|
568
|
+
}
|
|
569
|
+
catch (e) {
|
|
570
|
+
log(`record: ${e.message}`);
|
|
571
|
+
return process.exit(1);
|
|
572
|
+
}
|
|
573
|
+
}
|
|
574
|
+
/** The live-record TAIL shared by the file (B1/single) and in-memory (B2 re-record) paths: run live, refuse
|
|
575
|
+
* a failing run unless opted in (A3), snapshot + secret-scrub bodies (C2), opt-in redact + verdict-preserve
|
|
576
|
+
* (A1/A3), write. `extraPolicyDirs` adds the scenario-file dir to the .cowork-redact.json search. */
|
|
577
|
+
async function recordScenarioObject(scenario, opts, extraPolicyDirs = []) {
|
|
296
578
|
const result = await executeScenario(scenario);
|
|
297
|
-
const
|
|
298
|
-
const controlOut = safeLines(join(result.outDir, "control-out.jsonl"));
|
|
299
|
-
const cassettePath = outIdx >= 0 ? args[outIdx + 1] : join("cassettes", `${scenario.name}.cassette.json`);
|
|
579
|
+
const cassettePath = opts.cassettePath ?? join("cassettes", `${scenario.name}.cassette.json`);
|
|
300
580
|
mkdirSync(dirname(cassettePath), { recursive: true });
|
|
301
|
-
//
|
|
302
|
-
|
|
303
|
-
|
|
581
|
+
// A3: a failing live run frozen into a cassette is a latent false-signal — refuse unless opted in.
|
|
582
|
+
if (!computeVerdict(result, "live").pass && !opts.allowFailing)
|
|
583
|
+
throw new Error(`live run did NOT pass (result=${result.result}) — refusing to freeze a failing run (re-run, or --allow-failing)`);
|
|
584
|
+
// RELOCATABLE session path (relative to the cassette dir) — metadata-only, keeps a moved bundle honest.
|
|
304
585
|
const relocatable = {
|
|
305
586
|
...scenario,
|
|
306
587
|
session: scenario.session === "(inline)" ? "(inline)" : relative(dirname(cassettePath), scenario.session),
|
|
307
588
|
};
|
|
308
|
-
//
|
|
309
|
-
//
|
|
310
|
-
|
|
311
|
-
const artifacts = result.workDir ? buildManifest(result.workDir) : [];
|
|
312
|
-
const
|
|
313
|
-
const cassette = {
|
|
589
|
+
// C2: buildManifest reads output bodies RAW (executeScenario scrubs result/events/control-out, NOT
|
|
590
|
+
// outputs/) — secret-scrub each body before it is committed.
|
|
591
|
+
const secrets = collectSecrets();
|
|
592
|
+
const artifacts = (result.workDir ? buildManifest(result.workDir) : []).map((a) => a.body !== undefined ? { ...a, body: scrub(a.body, secrets) } : a);
|
|
593
|
+
const base = {
|
|
314
594
|
cassetteVersion: CASSETTE_VERSION,
|
|
315
595
|
scenario: relocatable,
|
|
316
|
-
events,
|
|
317
|
-
controlOut,
|
|
596
|
+
events: safeLines(join(result.outDir, "events.jsonl")),
|
|
597
|
+
controlOut: safeLines(join(result.outDir, "control-out.jsonl")),
|
|
318
598
|
effectiveFidelity: result.effectiveFidelity,
|
|
319
599
|
artifacts,
|
|
320
|
-
fingerprint,
|
|
600
|
+
fingerprint: buildFingerprint(scenario.session, result.baseline),
|
|
321
601
|
};
|
|
322
|
-
|
|
323
|
-
//
|
|
324
|
-
|
|
325
|
-
|
|
326
|
-
|
|
327
|
-
|
|
328
|
-
|
|
329
|
-
|
|
330
|
-
|
|
331
|
-
|
|
332
|
-
continue;
|
|
333
|
-
}
|
|
334
|
-
if (m.type === "assistant") {
|
|
335
|
-
turns++;
|
|
336
|
-
for (const b of m.message?.content ?? [])
|
|
337
|
-
if (b.type === "tool_use")
|
|
338
|
-
tools++;
|
|
339
|
-
}
|
|
602
|
+
// A1 (opt-in) content redaction over the whole surface (C1). Empty policy → no-op. Non-empty → must be
|
|
603
|
+
// VERDICT-PRESERVING (A3): replay both and refuse to write on divergence (a manufactured green).
|
|
604
|
+
const policy = opts.noRedact
|
|
605
|
+
? { patterns: [], keyNames: [] }
|
|
606
|
+
: loadRedactionPolicy([process.cwd(), ...extraPolicyDirs, dirname(cassettePath)]);
|
|
607
|
+
let cassette = base;
|
|
608
|
+
if (policy.patterns.length || policy.keyNames.length) {
|
|
609
|
+
const redacted = redactCassette(base, policy);
|
|
610
|
+
await assertRedactionVerdictPreserved(base, redacted);
|
|
611
|
+
cassette = redacted;
|
|
340
612
|
}
|
|
341
|
-
|
|
613
|
+
writeFileSync(cassettePath, JSON.stringify(cassette, null, 2));
|
|
614
|
+
return { result, cassettePath, artifacts: artifacts.length };
|
|
342
615
|
}
|
|
343
616
|
/** `replay --cassette <file>` — deterministic protocol-replay; re-evaluates content assertions. */
|
|
344
617
|
export async function cmdReplay(args) {
|
|
@@ -374,6 +647,94 @@ export async function cmdReplay(args) {
|
|
|
374
647
|
renderFooter(result, plan, { renderer, lane: "replay" });
|
|
375
648
|
process.exit(verdict.exitCode);
|
|
376
649
|
}
|
|
650
|
+
/** `verify-cassettes <file|dir>` — the CI gate (token/agent-free). Runs the privacy scan (A2) and the
|
|
651
|
+
* staleness check (B3) over one cassette or every `*.cassette.json` in a dir (non-recursive). Exit 1 on any
|
|
652
|
+
* real PII finding or staleness drift; `unscanned` notes are informational. Dedicated JSON envelope. */
|
|
653
|
+
export function cmdVerifyCassettes(args) {
|
|
654
|
+
let json;
|
|
655
|
+
try {
|
|
656
|
+
json = parseOutputFormat(args) === "json";
|
|
657
|
+
}
|
|
658
|
+
catch (e) {
|
|
659
|
+
log(String(e.message));
|
|
660
|
+
return process.exit(2);
|
|
661
|
+
}
|
|
662
|
+
const privacyOnly = args.includes("--privacy-only");
|
|
663
|
+
const stalenessOnly = args.includes("--staleness-only");
|
|
664
|
+
// Both flags together would disable BOTH families → empty findings → ok=true → exit 0: a silent
|
|
665
|
+
// false-green in the gate itself. Reject it as a usage error.
|
|
666
|
+
if (privacyOnly && stalenessOnly) {
|
|
667
|
+
log("verify-cassettes: --privacy-only and --staleness-only are mutually exclusive (together they'd check nothing)");
|
|
668
|
+
return process.exit(2);
|
|
669
|
+
}
|
|
670
|
+
const doPrivacy = !stalenessOnly;
|
|
671
|
+
const doStaleness = !privacyOnly;
|
|
672
|
+
const allow = [];
|
|
673
|
+
for (let i = 0; i < args.length; i++) {
|
|
674
|
+
if (args[i] !== "--allow")
|
|
675
|
+
continue;
|
|
676
|
+
const src = args[++i];
|
|
677
|
+
if (src === undefined) {
|
|
678
|
+
log("--allow needs a regex value");
|
|
679
|
+
return process.exit(2);
|
|
680
|
+
}
|
|
681
|
+
try {
|
|
682
|
+
allow.push(new RegExp(src, "i"));
|
|
683
|
+
}
|
|
684
|
+
catch {
|
|
685
|
+
log(`--allow: invalid regex: ${src}`);
|
|
686
|
+
return process.exit(2);
|
|
687
|
+
}
|
|
688
|
+
}
|
|
689
|
+
const target = args.find((a, i) => !a.startsWith("--") && args[i - 1] !== "--allow");
|
|
690
|
+
if (!target) {
|
|
691
|
+
log("usage: verify-cassettes <file|dir> [--privacy-only|--staleness-only] [--allow <regex>]... [--output-format json]");
|
|
692
|
+
return process.exit(2);
|
|
693
|
+
}
|
|
694
|
+
if (!existsSync(target)) {
|
|
695
|
+
log(`verify-cassettes: path not found: ${target}`);
|
|
696
|
+
return process.exit(2);
|
|
697
|
+
}
|
|
698
|
+
const files = statSync(target).isDirectory()
|
|
699
|
+
? readdirSync(target)
|
|
700
|
+
.filter((f) => f.endsWith(".cassette.json"))
|
|
701
|
+
.sort()
|
|
702
|
+
.map((f) => join(target, f))
|
|
703
|
+
: [target];
|
|
704
|
+
if (files.length === 0) {
|
|
705
|
+
log(`verify-cassettes: no .cassette.json files under ${target} — nothing verified (loud non-zero, not a vacuous pass)`);
|
|
706
|
+
return process.exit(2);
|
|
707
|
+
}
|
|
708
|
+
const results = files.map((f) => {
|
|
709
|
+
const rc = readCassette(f);
|
|
710
|
+
if ("error" in rc)
|
|
711
|
+
return { file: f, findings: [], staleness: [], error: rc.error };
|
|
712
|
+
const findings = doPrivacy ? scanCassette(rc.cassette, allow) : [];
|
|
713
|
+
const staleness = doStaleness ? checkStaleness(rc.cassette, dirname(f)) : [];
|
|
714
|
+
return { file: f, findings, staleness, error: undefined };
|
|
715
|
+
});
|
|
716
|
+
const realFindings = results.flatMap((r) => r.findings.filter((x) => x.cls !== "unscanned"));
|
|
717
|
+
const staleAny = results.some((r) => r.staleness.length > 0);
|
|
718
|
+
const errorAny = results.some((r) => r.error !== undefined);
|
|
719
|
+
const ok = realFindings.length === 0 && !staleAny && !errorAny;
|
|
720
|
+
if (json) {
|
|
721
|
+
out(JSON.stringify({ command: "verify-cassettes", ok, results }));
|
|
722
|
+
}
|
|
723
|
+
else {
|
|
724
|
+
for (const r of results) {
|
|
725
|
+
if (r.error)
|
|
726
|
+
log(`✗ ${r.file}: [error] ${r.error}`);
|
|
727
|
+
for (const f of r.findings)
|
|
728
|
+
log(`${f.cls === "unscanned" ? "·" : "✗"} ${r.file}: [${f.cls}] ${f.where} — ${f.sample}`);
|
|
729
|
+
for (const s of r.staleness)
|
|
730
|
+
log(`✗ ${r.file}: [stale] ${s}`);
|
|
731
|
+
}
|
|
732
|
+
log(ok
|
|
733
|
+
? `✓ verify-cassettes: ${files.length} cassette(s) clean`
|
|
734
|
+
: `✗ verify-cassettes: ${realFindings.length} PII finding(s)${staleAny ? " + staleness drift" : ""}${errorAny ? " + unreadable cassette(s)" : ""} across ${files.length} cassette(s)`);
|
|
735
|
+
}
|
|
736
|
+
return process.exit(ok ? 0 : 1);
|
|
737
|
+
}
|
|
377
738
|
/** Replay a cassette through Run and re-evaluate the content assertions. With a `cassette.artifacts`
|
|
378
739
|
* manifest (#1), filesystem assertions (file_exists/user_visible_artifact/artifact_json) ALSO run, against
|
|
379
740
|
* the materialized snapshot. `opts.strict` (#1b) escalates staleness warnings to failing assertions. */
|
package/dist/scan.js
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
1
|
+
export const DEFAULT_SCAN_PATTERNS = [
|
|
2
|
+
{ re: /[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}/gi, cls: "email" },
|
|
3
|
+
{ re: /\$\s?\d[\d,]*(?:\.\d+)?\s?(?:k|m|b|bn|million|billion)?/gi, cls: "currency" },
|
|
4
|
+
{ re: /\b[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.(?:com|io|net|org|co|app|ai|dev|xyz)\b/gi, cls: "domain" },
|
|
5
|
+
];
|
|
6
|
+
/** The high-precision subset (email only). `email` is scanned UNIVERSALLY — even on the agent
|
|
7
|
+
* capability-manifest messages (the `system/init` event and the `initialize` registry `control_response`),
|
|
8
|
+
* because the registry's `account` field can carry the developer's own email (a real leak). The noisy
|
|
9
|
+
* classes (`currency`/`domain`) are the ones suppressed on those two manifest messages, where every hit is
|
|
10
|
+
* the agent's tool/skill catalog or MCP-server names — environment boilerplate a regex can't tell apart
|
|
11
|
+
* from customer data. Everywhere else (assistant reasoning, tool I/O, decisions, the deliverable) gets the
|
|
12
|
+
* full net. */
|
|
13
|
+
export const EMAIL_SCAN_PATTERNS = DEFAULT_SCAN_PATTERNS.filter((p) => p.cls === "email");
|
|
14
|
+
function allowed(sample, allow) {
|
|
15
|
+
// Test against a non-global clone so a caller's /g regex can't carry lastIndex across calls.
|
|
16
|
+
return allow.some((a) => new RegExp(a.source, a.flags.replace("g", "")).test(sample));
|
|
17
|
+
}
|
|
18
|
+
/** Scan one string for PII matches, suppressing anything the allowlist covers. */
|
|
19
|
+
export function scanText(text, where, allow, patterns = DEFAULT_SCAN_PATTERNS) {
|
|
20
|
+
const out = [];
|
|
21
|
+
for (const { re, cls } of patterns) {
|
|
22
|
+
const g = new RegExp(re.source, re.flags.includes("g") ? re.flags : re.flags + "g");
|
|
23
|
+
for (const m of text.matchAll(g)) {
|
|
24
|
+
const sample = m[0];
|
|
25
|
+
if (!allowed(sample, allow))
|
|
26
|
+
out.push({ where, cls, sample });
|
|
27
|
+
}
|
|
28
|
+
}
|
|
29
|
+
return out;
|
|
30
|
+
}
|
package/dist/types.js
CHANGED
|
@@ -145,13 +145,17 @@ export const Assertion = z.object({
|
|
|
145
145
|
artifact: z.string().describe("relative path to a JSON artifact under the work root (e.g. outputs/cap_state.json)"),
|
|
146
146
|
path: z.string().optional().describe("dotted path into the JSON (e.g. me.run_id); omit to target the whole document"),
|
|
147
147
|
equals: z.unknown().optional().describe("the resolved value deep-equals this"),
|
|
148
|
+
in: z
|
|
149
|
+
.array(z.unknown())
|
|
150
|
+
.optional()
|
|
151
|
+
.describe("the resolved value deep-equals one of these (stable for stochastic/LLM-extracted values where equals churns)"),
|
|
148
152
|
gt: z.number().optional().describe("the resolved value is a number greater than this"),
|
|
149
153
|
exists: z.boolean().optional().describe("the path resolves to a present (non-absent) value"),
|
|
150
154
|
absent: z.boolean().optional().describe("the final key is absent from its (resolved) parent — the anti-hallucination negative"),
|
|
151
155
|
is_null: z.boolean().optional().describe("the resolved value is JSON null (distinct from absent)"),
|
|
152
156
|
})
|
|
153
157
|
.optional()
|
|
154
|
-
.describe("assert over a JSON artifact's contents (dotted path + equals|gt|exists|absent|is_null)"),
|
|
158
|
+
.describe("assert over a JSON artifact's contents (dotted path + equals|in|gt|exists|absent|is_null)"),
|
|
155
159
|
});
|
|
156
160
|
export const ScenarioObject = z
|
|
157
161
|
.object({
|
package/docs/cassette.md
CHANGED
|
@@ -183,6 +183,54 @@ Re-record a cassette when:
|
|
|
183
183
|
- `replay` exits 1 on a `replay_protocol_fidelity` mismatch — this means `serializeDecision` changed;
|
|
184
184
|
review the change, confirm it's correct, then re-record to update the frozen envelope.
|
|
185
185
|
|
|
186
|
+
## Batch recording
|
|
187
|
+
|
|
188
|
+
`record` takes a single scenario OR a directory:
|
|
189
|
+
|
|
190
|
+
```bash
|
|
191
|
+
cowork-harness record scenarios/ # record every scenario in the dir (one cassette each)
|
|
192
|
+
cowork-harness record cassettes/ --rerecord-stale # re-record ONLY the cassettes whose fingerprint drifted
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
Directory discovery keys on a **positive `prompt:` signal**: a `*.yaml` with no top-level `prompt:` is an
|
|
196
|
+
announced skip (it's a session/other doc), but a doc that *looks* like a scenario (has `prompt:`) yet fails
|
|
197
|
+
to parse is a **failure**, never a silent skip. Zero scenarios discovered → loud non-zero exit. `record`
|
|
198
|
+
also **refuses to freeze a failing live run** into a cassette (`--allow-failing` overrides) — a committed
|
|
199
|
+
red cassette is a latent false-signal.
|
|
200
|
+
|
|
201
|
+
## Privacy: cassettes are committed fixtures
|
|
202
|
+
|
|
203
|
+
A cassette snapshots the transcript **and** the `outputs/` JSON bodies (names, dollar figures, share
|
|
204
|
+
counts) — committed PII surface. Two layers, distinct from secret-scrub (which only strips auth tokens):
|
|
205
|
+
|
|
206
|
+
- **Opt-in redaction** (the mutation). Drop a `.cowork-redact.json` next to your scenarios, or set
|
|
207
|
+
`COWORK_HARNESS_REDACT_PATTERNS` / `COWORK_HARNESS_REDACT_KEYS`. At record time it rewrites matching PII
|
|
208
|
+
across the whole cassette surface (transcript, artifact bodies + filenames, prompt/answers/assert,
|
|
209
|
+
skillSources) **structurally** — JSON stays valid and the AskUserQuestion question/answer strings stay in
|
|
210
|
+
sync, so the O7 guard still passes. Redaction is **verdict-preserving**: `record` replays before/after and
|
|
211
|
+
**refuses to write** if redaction would flip an assertion (a manufactured green is the cardinal sin).
|
|
212
|
+
`--no-redact` skips it for known-synthetic inputs.
|
|
213
|
+
- **Always-on scan gate** — `verify-cassettes <file|dir>` scans the committed cassettes and **exits
|
|
214
|
+
non-zero** on a finding, so "no leak" is a gate, not discipline. The full net (`email` + `currency` +
|
|
215
|
+
bare-`domain`) runs over the **whole cassette** — the deliverable (`outputs/` bodies + filenames), the
|
|
216
|
+
author-written `prompt`/`answers`/`assert`, AND the agent's reasoning + tool I/O — with **one structural
|
|
217
|
+
exception**: the agent's **capability-manifest** messages (the `system/init` event and the `initialize`
|
|
218
|
+
registry `control_response`, `request_id:"init-1"`) are excluded from the noisy classes. Those two carry
|
|
219
|
+
the tool/skill catalog (slash-command descriptions naming `docsend.com`, `Pitch.com`, …) and the MCP-server
|
|
220
|
+
names (`claude.ai Gmail`, …) — environment boilerplate a regex can't tell apart from customer data, and the
|
|
221
|
+
sole concentrated source of false positives. They are excluded **as a unit**, not by domain — but `email`
|
|
222
|
+
still scans them (the registry's `account` field can carry the developer's own email). `--allow <regex>`
|
|
223
|
+
suppresses synthetic / public reference names (e.g. `NVCA`, `Cooley GO`, `Acme`); multi-word proper names
|
|
224
|
+
are **not** a default class (too noisy). `verify-cassettes` also runs the **staleness** check
|
|
225
|
+
(`--staleness-only`): a drifted `skillHash` (you edited the skill but didn't re-record) fails the gate.
|
|
226
|
+
|
|
227
|
+
```bash
|
|
228
|
+
cowork-harness verify-cassettes cassettes/ --allow 'NVCA|Cooley GO|Acme'
|
|
229
|
+
```
|
|
230
|
+
|
|
231
|
+
The cardinal rule still holds: record against **synthetic** inputs (e.g. "Cadence / Acme", made-up
|
|
232
|
+
numbers) — redaction and the scan are belt-and-suspenders, not a license to record real customer data.
|
|
233
|
+
|
|
186
234
|
## Committed fixture
|
|
187
235
|
|
|
188
236
|
`examples/replays/example-pdf-skill.cassette.json` is a **synthetic** cassette committed to the repo
|
package/docs/maintenance.md
CHANGED
|
@@ -112,12 +112,26 @@ npm version patch # or minor | major
|
|
|
112
112
|
git push --follow-tags
|
|
113
113
|
```
|
|
114
114
|
|
|
115
|
-
Pushing the `vX.Y.Z` tag triggers `release.yml`, which
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
115
|
+
Pushing the `vX.Y.Z` tag triggers `release.yml`, which (in order) **waits for `ci.yml` to have succeeded
|
|
116
|
+
for that commit**, verifies the tag matches `package.json`, checks `CHANGELOG.md` has a `## [X.Y.Z]`
|
|
117
|
+
heading, runs the **version-lockstep guard** (`npm run check:versions`), runs `npm run ci`, then
|
|
118
|
+
`npm publish --provenance --access public`. Auth is **OIDC**: the workflow's `id-token: write` is exchanged
|
|
119
|
+
for a short-lived publish credential — there is **no `NPM_TOKEN`**. A GitHub Release is opened from the tag,
|
|
120
|
+
and `prepublishOnly` re-runs CI so a manual publish is guarded too. A published version is **immutable** —
|
|
121
|
+
the same `X.Y.Z` can never be re-published, so a botched run needs a new patch (not a re-run against the
|
|
122
|
+
same version).
|
|
123
|
+
|
|
124
|
+
The `ci.yml`-success gate matters because `release.yml`'s own `npm run ci` is **TypeScript-only**, while
|
|
125
|
+
`ci.yml` also runs pytest (the Python helper lane), `format:check`, the replay gate, and the boundary +
|
|
126
|
+
scenario suites. Without the gate, a tag could publish a build that `main`'s CI would have rejected. The
|
|
127
|
+
gate polls (~30 min) so `git push --follow-tags` works even when the commit's CI is still running.
|
|
128
|
+
|
|
129
|
+
**Version-lockstep guard (`scripts/check-versions.ts`, run in both `ci.yml` and `release.yml`).** Fails
|
|
130
|
+
loud unless all version strings agree: `package.json` == `package-lock.json`; the three skill versions
|
|
131
|
+
(`marketplace.json`, the skill `plugin.json`, `SKILL.md` frontmatter) == each other; the `SKILL.md`
|
|
132
|
+
bootstrap floor `@>=X.Y.Z` == its `tracks-harness:` version; and that floor is `<=` `package.json` (the
|
|
133
|
+
skill can't demand a harness newer than this repo publishes). This enforces the lockstep the next section
|
|
134
|
+
describes, so a hand-edited bump can't silently drift.
|
|
121
135
|
|
|
122
136
|
**One-time setup (on npmjs.com):** configure a Trusted Publisher on the `cowork-harness` package →
|
|
123
137
|
provider GitHub Actions, repo `yaniv-golan/cowork-harness`, workflow filename `release.yml`,
|
package/docs/scenario.md
CHANGED
|
@@ -205,13 +205,21 @@ dotted `path` selects into the document; one operator decides the check:
|
|
|
205
205
|
- artifact_json: { artifact: outputs/cap_state.json, path: me.run_id, equals: "r1" }
|
|
206
206
|
- artifact_json: { artifact: outputs/cap_state.json, path: rounds.0.amount, gt: 0 }
|
|
207
207
|
- artifact_json: { artifact: outputs/instruments.json, path: exclusivity_days, absent: true } # anti-hallucination
|
|
208
|
+
- artifact_json: { artifact: outputs/cap_state.json, path: stage, in: ["seed", "series-a"] } # one of a stable set
|
|
208
209
|
```
|
|
209
|
-
Operators: `equals` (deep-equal) · `gt` (number) · `exists: <bool>` · `absent: <bool>` · `is_null: <bool>`.
|
|
210
|
+
Operators: `equals` (deep-equal) · `in: [<set>]` (deep-equal one of) · `gt` (number) · `exists: <bool>` · `absent: <bool>` · `is_null: <bool>`.
|
|
210
211
|
The three states are **distinct**: `absent` (the final key is missing from a parent that resolved) vs
|
|
211
212
|
`is_null` (present but JSON `null`) vs an **unresolved intermediate** segment (the artifact is malformed for
|
|
212
213
|
that path) — which **fails loud**, never a vacuous pass. (No JSONPath/jq — a dotted path keeps it
|
|
213
214
|
dependency-free and side-effect-free.)
|
|
214
215
|
|
|
216
|
+
> **Stable vs brittle asserts on stochastic (LLM-extracted) values.** A cassette freezes ONE stochastic
|
|
217
|
+
> output, so an `equals` on an LLM-extracted string will churn every time you re-record. Prefer **stable**
|
|
218
|
+
> operators for extracted values: `absent` / `exists` (the anti-hallucination negative is rock-stable),
|
|
219
|
+
> or `in: [<set>]` to accept any of a known-good set. Reserve `equals` for values the skill computes
|
|
220
|
+
> deterministically (ids, counts, enums). This pairs with record-time redaction: redaction rewrites the
|
|
221
|
+
> very strings an `equals` would pin, so `equals` on a redacted field would break on re-record anyway.
|
|
222
|
+
|
|
215
223
|
> **Boundary assertions** (`egress_*`, `expect_denied`) require a sandboxed fidelity — `container`, `microvm`, or `hostloop` (all share the container sandbox + egress proxy). Only `protocol` is rejected, to avoid a false pass — see [boundary.md](./boundary.md).
|
|
216
224
|
|
|
217
225
|
### Which assertions survive `replay` (CI placement)
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "cowork-harness",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.3.0",
|
|
4
4
|
"description": "Scriptable, CI-friendly harness for Claude Cowork's runtime contract for testing skills across scenarios — same agent, mounts, egress allowlist, permission protocol, and sandbox limitations.",
|
|
5
5
|
"license": "MIT",
|
|
6
6
|
"type": "module",
|
|
@@ -56,7 +56,8 @@
|
|
|
56
56
|
"format:check": "prettier --check \"src/**/*.ts\" \"test/**/*.ts\"",
|
|
57
57
|
"smoke:cli": "node dist/cli.js list",
|
|
58
58
|
"ci": "npm run typecheck && npm run build && npm run test",
|
|
59
|
-
"prepublishOnly": "npm run ci"
|
|
59
|
+
"prepublishOnly": "npm run ci",
|
|
60
|
+
"check:versions": "tsx scripts/check-versions.ts"
|
|
60
61
|
},
|
|
61
62
|
"dependencies": {
|
|
62
63
|
"yaml": "^2.5.0",
|
|
@@ -223,6 +223,11 @@
|
|
|
223
223
|
"equals": {
|
|
224
224
|
"description": "the resolved value deep-equals this"
|
|
225
225
|
},
|
|
226
|
+
"in": {
|
|
227
|
+
"type": "array",
|
|
228
|
+
"items": {},
|
|
229
|
+
"description": "the resolved value deep-equals one of these (stable for stochastic/LLM-extracted values where equals churns)"
|
|
230
|
+
},
|
|
226
231
|
"gt": {
|
|
227
232
|
"type": "number",
|
|
228
233
|
"description": "the resolved value is a number greater than this"
|
|
@@ -244,7 +249,7 @@
|
|
|
244
249
|
"artifact"
|
|
245
250
|
],
|
|
246
251
|
"additionalProperties": false,
|
|
247
|
-
"description": "assert over a JSON artifact's contents (dotted path + equals|gt|exists|absent|is_null)"
|
|
252
|
+
"description": "assert over a JSON artifact's contents (dotted path + equals|in|gt|exists|absent|is_null)"
|
|
248
253
|
}
|
|
249
254
|
},
|
|
250
255
|
"additionalProperties": false
|
|
@@ -0,0 +1,90 @@
|
|
|
1
|
+
// Guards version lockstep across the npm package and the companion skill, so a
|
|
2
|
+
// hand-edited release can't drift. Fails loud (exit 1) on any mismatch.
|
|
3
|
+
//
|
|
4
|
+
// npm run check:versions
|
|
5
|
+
//
|
|
6
|
+
// Invariants (the skill versions INDEPENDENTLY from the npm package — see
|
|
7
|
+
// docs/maintenance.md — so we do NOT require the skill version to equal the
|
|
8
|
+
// package version):
|
|
9
|
+
// 1. npm self-consistency: package.json === package-lock.json (root + "" package).
|
|
10
|
+
// 2. skill self-consistency: marketplace.json === skill plugin.json === SKILL.md `version:`.
|
|
11
|
+
// 3. floor === tracks: SKILL.md bootstrap floor `@>=X.Y.Z` === `tracks-harness:` version.
|
|
12
|
+
// 4. floor <= package: the harness version the skill demands must be one this repo
|
|
13
|
+
// can publish (else the skill ships ahead of npm).
|
|
14
|
+
import { readFileSync } from "node:fs";
|
|
15
|
+
import { dirname, join } from "node:path";
|
|
16
|
+
import { fileURLToPath, pathToFileURL } from "node:url";
|
|
17
|
+
|
|
18
|
+
const REPO_ROOT = join(dirname(fileURLToPath(import.meta.url)), "..");
|
|
19
|
+
const r = (p: string) => readFileSync(join(REPO_ROOT, p), "utf8");
|
|
20
|
+
const json = (p: string) => JSON.parse(r(p)) as Record<string, any>;
|
|
21
|
+
|
|
22
|
+
const SEMVER = /^\d+\.\d+\.\d+$/;
|
|
23
|
+
/** Compare two X.Y.Z strings: <0 if a<b, 0 if equal, >0 if a>b. */
|
|
24
|
+
function cmp(a: string, b: string): number {
|
|
25
|
+
const pa = a.split(".").map(Number);
|
|
26
|
+
const pb = b.split(".").map(Number);
|
|
27
|
+
for (let i = 0; i < 3; i++) if (pa[i] !== pb[i]) return pa[i] - pb[i];
|
|
28
|
+
return 0;
|
|
29
|
+
}
|
|
30
|
+
|
|
31
|
+
export function checkVersions(): { ok: boolean; errors: string[]; values: Record<string, string | undefined> } {
|
|
32
|
+
const errors: string[] = [];
|
|
33
|
+
|
|
34
|
+
const pkg = json("package.json").version as string;
|
|
35
|
+
const lock = json("package-lock.json");
|
|
36
|
+
const lockRoot = lock.version as string;
|
|
37
|
+
const lockPkg = lock.packages?.[""]?.version as string | undefined;
|
|
38
|
+
|
|
39
|
+
const market = json(".claude-plugin/marketplace.json").plugins?.[0]?.version as string | undefined;
|
|
40
|
+
const plugin = json(".claude/skills/cowork-harness/.claude-plugin/plugin.json").version as string | undefined;
|
|
41
|
+
|
|
42
|
+
const skillMd = r(".claude/skills/cowork-harness/SKILL.md");
|
|
43
|
+
const frontmatter = skillMd.split("---")[1] ?? "";
|
|
44
|
+
const skillVer = frontmatter.match(/^\s*version:\s*(\S+)\s*$/m)?.[1];
|
|
45
|
+
const tracks = skillMd.match(/tracks-harness:\s*cowork-harness\s+(\d+\.\d+\.\d+)/)?.[1];
|
|
46
|
+
const floor = skillMd.match(/cowork-harness@>=(\d+\.\d+\.\d+)/)?.[1];
|
|
47
|
+
|
|
48
|
+
const values = { pkg, lockRoot, lockPkg, market, plugin, skillVer, tracks, floor };
|
|
49
|
+
|
|
50
|
+
// 1. npm self-consistency
|
|
51
|
+
if (!SEMVER.test(pkg)) errors.push(`package.json version "${pkg}" is not X.Y.Z`);
|
|
52
|
+
if (lockRoot !== pkg) errors.push(`package-lock.json root version "${lockRoot}" != package.json "${pkg}"`);
|
|
53
|
+
if (lockPkg !== pkg) errors.push(`package-lock.json packages[""].version "${lockPkg}" != package.json "${pkg}"`);
|
|
54
|
+
|
|
55
|
+
// 2. skill self-consistency
|
|
56
|
+
const skillSet = new Set([market, plugin, skillVer]);
|
|
57
|
+
if (skillSet.size !== 1 || [...skillSet][0] === undefined) {
|
|
58
|
+
errors.push(
|
|
59
|
+
`skill version mismatch — marketplace.json=${market}, plugin.json=${plugin}, SKILL.md=${skillVer} (all three must agree)`,
|
|
60
|
+
);
|
|
61
|
+
}
|
|
62
|
+
|
|
63
|
+
// 3. floor === tracks-harness
|
|
64
|
+
if (!floor) errors.push(`could not find bootstrap floor "cowork-harness@>=X.Y.Z" in SKILL.md`);
|
|
65
|
+
if (!tracks) errors.push(`could not find "tracks-harness: cowork-harness X.Y.Z" in SKILL.md`);
|
|
66
|
+
if (floor && tracks && floor !== tracks) {
|
|
67
|
+
errors.push(`bootstrap floor "@>=${floor}" != tracks-harness "${tracks}" (keep them in lockstep)`);
|
|
68
|
+
}
|
|
69
|
+
|
|
70
|
+
// 4. floor <= package.json (the skill must not demand an unpublished/future harness)
|
|
71
|
+
if (floor && SEMVER.test(pkg) && cmp(floor, pkg) > 0) {
|
|
72
|
+
errors.push(`bootstrap floor "@>=${floor}" is ahead of package.json "${pkg}" — skill would lead npm`);
|
|
73
|
+
}
|
|
74
|
+
|
|
75
|
+
return { ok: errors.length === 0, errors, values };
|
|
76
|
+
}
|
|
77
|
+
|
|
78
|
+
function main(): void {
|
|
79
|
+
const { ok, errors, values } = checkVersions();
|
|
80
|
+
process.stdout.write(`version lockstep: ${JSON.stringify(values)}\n`);
|
|
81
|
+
if (ok) {
|
|
82
|
+
process.stdout.write("✓ all version strings are aligned\n");
|
|
83
|
+
return;
|
|
84
|
+
}
|
|
85
|
+
for (const e of errors) process.stderr.write(`::error::${e}\n`);
|
|
86
|
+
process.exitCode = 1;
|
|
87
|
+
}
|
|
88
|
+
|
|
89
|
+
// Run only when invoked directly (so a test can import checkVersions without side effects).
|
|
90
|
+
if (import.meta.url === pathToFileURL(process.argv[1] ?? "").href) main();
|