@checkstack/satellite 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,65 @@
1
1
  # @checkstack/satellite
2
2
 
3
+ ## 0.5.0
4
+
5
+ ### Minor Changes
6
+
7
+ - 9dcc848: Layered OS-level script sandbox, secure and fail-closed by default (epic #247).
8
+
9
+ Script and shell health checks and the `run_shell` / `run_script` automation actions now run inside a layered OS-level sandbox by default. The sandbox lives in `core/backend-api/src/script-sandbox/` (the single source of truth) and is enforced inside the shared runners, so it applies wherever a job runs.
10
+
11
+ Layers:
12
+
13
+ - Resource caps (CPU / memory / PID / FD / file-size, via `prlimit` on capable Linux; ESM JS-heap cap via `--max-old-space-size`; portable wall-clock timeout) and an OOM-safe streaming output cap.
14
+ - Privilege drop via a NON-ROOT supervisor model: the shipped images run the supervisor as non-root uid `65532`, so every sandboxed script inherits non-root and can never be host-root; filesystem + network confinement is delivered by ROOTLESS `bwrap`/`nsjail` via unprivileged user namespaces. `enforced.privilege` is truthful (true only when the child cannot run as host-root). Runners no longer pass `uid`/`gid` to `Bun.spawn` (a silent no-op and a forward-compat hazard).
15
+ - Filesystem isolation (`scratch-only` / `scratch-plus-ro`) confining the child to its per-run scratch dir over a read-only base; the interpreter path is RO-bound so the runtime execs, and `TMPDIR` is pinned to the in-namespace tmpfs.
16
+ - Network egress control: `deny` (routeless loopback-only netns), `allowlist` (real plumbed egress via macvlan OR rootless slirp4netns + an in-kernel nftables filter), and an always-on metadata / link-local block (`169.254.0.0/16`, `fe80::/10`, `fc00::/7`). No-blackhole invariant: `enforced.network` is never true when egress is actually severed or unfiltered; unpluggable egress degrades to surfaced host net.
17
+ - Per-run fork-bomb containment via RLIMIT*NPROC inside the fresh per-run user+PID namespace; a centralized forbidden-env denylist (`LD_PRELOAD`, `LD_LIBRARY_PATH`, `DYLD*_`, `NODE*OPTIONS`, `BUN*_`, caller `PATH` overrides).
18
+ - A validated tuned seccomp profile (`deploy/seccomp/checkstack-userns.json`) and a live `clone(CLONE_NEWUSER|CLONE_NEWNET)` capability probe (not the static sysctl), shipped by default in both Dockerfiles, `docker-compose.yml`, and `deploy/k8s/checkstack-sandbox.yaml`.
19
+
20
+ Global policy and operator surface:
21
+
22
+ - The global sandbox policy lives in ONE durable row owned by `script-packages` (its `ConfigService` row in shared `plugin_configs`). A single process-wide provider serves every runner; the two script plugins no longer register competing providers. A dedicated admin-only `script-sandbox.manage` permission gates both reading and writing the policy. New `getSandboxPolicy` / `setSandboxPolicy` endpoints and a Settings -> Script Sandbox admin UI (`enabled`, `onUnavailable`, network/filesystem/privilege modes, allow list, metadata block, resource caps). The startup capability/readiness log is emitted in-process by `script-packages-backend` (no fragile init-order RPC self-loop), and on a host that cannot enforce a layer a one-time startup warning explains the two local-dev paths (Docker, or set the global policy to `degrade`).
23
+ - Satellite relay: the WS protocol carries the resolved policy in the `authenticated` message and a `sandbox_policy` push-on-change; a satellite caches the last relayed policy and resolves every run through it.
24
+
25
+ BREAKING CHANGES (platform in BETA, shipped as minor):
26
+
27
+ - Scripts run sandboxed by default. The shipped global default is FAIL-CLOSED (`onUnavailable: "fail"`): when a requested layer cannot be enforced the run is REFUSED (clean `exitCode: -1`, never an unsandboxed spawn) rather than silently degrading. Deployments on hosts that cannot enforce a layer (no bubblewrap, user namespaces blocked, no `/proc` unmask) must run the official images with the documented runtime flags (the bundled seccomp profile + `systempaths=unconfined`, or k8s `procMount: Unmasked`), or set the global policy to `degrade`. On macOS / restricted containers the strong layers degrade to the portable subset and are surfaced per run.
28
+ - Default network posture is deny-egress (`allowlist` with an empty allow list, which resolves to the routeless `deny` path). Scripts calling external endpoints fail until those destinations are allowlisted in the global default. The always-on metadata / link-local block applies even under looser modes.
29
+ - The per-action / per-check `sandbox` config override and the transport `ScriptRequest.sandbox` field are removed; policy is global-only, so an automation/check author can no longer weaken the sandbox on their own item. Stored configs carrying a stray `sandbox` key are tolerated (stripped on parse).
30
+ - The shared runners' `run()` no longer accepts a `sandbox` option; callers rely on the global policy provider.
31
+ - A satellite fails closed (most restrictive profile) until it receives the first relayed policy; a relay-read failure or an older core keeps it fail-closed. A relay failure can never loosen a satellite's sandbox.
32
+
33
+ State and scale: the global policy is a single durable Postgres row read identically on every pod. Capability detection is per-process, deterministic from the host kernel, and surfaced per run via the `EffectiveSandbox` report (a Linux pod and a macOS satellite may legitimately differ). `CHECKSTACK_SANDBOX_UID/GID` and macvlan addressing are genuinely per-host infrastructure, surfaced per run, not the queryable policy. The satellite's policy cache is satellite-local transport state. No new pod-local current-state.
34
+
35
+ This is a beta minor.
36
+
37
+ ### Patch Changes
38
+
39
+ - Updated dependencies [9dcc848]
40
+ - Updated dependencies [9dcc848]
41
+ - Updated dependencies [9dcc848]
42
+ - Updated dependencies [9dcc848]
43
+ - Updated dependencies [9dcc848]
44
+ - Updated dependencies [9dcc848]
45
+ - Updated dependencies [9dcc848]
46
+ - Updated dependencies [9dcc848]
47
+ - Updated dependencies [9dcc848]
48
+ - Updated dependencies [9dcc848]
49
+ - Updated dependencies [9dcc848]
50
+ - @checkstack/backend-api@0.21.0
51
+ - @checkstack/common@0.13.0
52
+ - @checkstack/script-packages-backend@0.3.0
53
+ - @checkstack/satellite-common@0.8.0
54
+
55
+ ## 0.4.1
56
+
57
+ ### Patch Changes
58
+
59
+ - Updated dependencies [a57f7db]
60
+ - @checkstack/backend-api@0.20.0
61
+ - @checkstack/script-packages-backend@0.2.1
62
+
3
63
  ## 0.4.0
4
64
 
5
65
  ### Minor Changes
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@checkstack/satellite",
3
- "version": "0.4.0",
3
+ "version": "0.5.0",
4
4
  "license": "Elastic-2.0",
5
5
  "type": "module",
6
6
  "main": "src/index.ts",
@@ -11,9 +11,9 @@
11
11
  "lint:code": "eslint . --max-warnings 0"
12
12
  },
13
13
  "dependencies": {
14
- "@checkstack/satellite-common": "0.6.0",
15
- "@checkstack/backend-api": "0.18.0",
16
- "@checkstack/script-packages-backend": "0.1.0",
14
+ "@checkstack/satellite-common": "0.7.0",
15
+ "@checkstack/backend-api": "0.20.0",
16
+ "@checkstack/script-packages-backend": "0.2.1",
17
17
  "@checkstack/common": "0.12.0"
18
18
  },
19
19
  "devDependencies": {
package/src/index.ts CHANGED
@@ -2,13 +2,15 @@ import type {
2
2
  SatelliteAssignment,
3
3
  ResultMessage,
4
4
  } from "@checkstack/satellite-common";
5
- import type {
6
- ConnectedClient,
7
- TransportClient,
8
- CollectorRunContext,
5
+ import {
6
+ registerSandboxPolicyProvider,
7
+ type ConnectedClient,
8
+ type TransportClient,
9
+ type CollectorRunContext,
9
10
  } from "@checkstack/backend-api";
10
11
  import { resolveScriptPackagesDir } from "@checkstack/script-packages-backend";
11
12
  import { SatelliteClient } from "./satellite-client";
13
+ import { SatelliteSandboxPolicyCache } from "./sandbox-policy-cache";
12
14
  import { Scheduler } from "./scheduler";
13
15
  import { loadStrategies } from "./strategy-loader";
14
16
  import { buildRunContext } from "./run-context";
@@ -58,6 +60,27 @@ const logger = {
58
60
  // =============================================================================
59
61
 
60
62
  logger.info(`Starting Checkstack Satellite v${VERSION}`);
63
+
64
+ // Wire the process-wide GLOBAL sandbox policy provider for the satellite
65
+ // runtime. The script runners (shell + ESM) resolve the active policy through
66
+ // this provider and FAIL CLOSED if none is registered, so it MUST be set before
67
+ // any health-check script runs.
68
+ //
69
+ // The satellite has no ConfigService, so it cannot read the durable cluster
70
+ // policy directly. Instead the core RELAYS the resolved global policy over the
71
+ // already-authenticated WS channel: on connect (carried in the `authenticated`
72
+ // message) and on change (a `sandbox_policy` push). The cache holds the last
73
+ // relayed policy and the provider resolves through it.
74
+ //
75
+ // FAIL CLOSED UNTIL RELAY: before the first policy is received, the cache's
76
+ // provider returns the FAIL_CLOSED profile (deny egress, scratch filesystem +
77
+ // read-only managed packages, privilege drop) - NEVER the permissive shipped
78
+ // default. A satellite must never run a script with a looser policy than core
79
+ // relayed; before the first relay there is none, so it denies. Trust is
80
+ // established by the authenticated WS connection.
81
+ const sandboxPolicyCache = new SatelliteSandboxPolicyCache();
82
+ registerSandboxPolicyProvider(sandboxPolicyCache.toProvider());
83
+
61
84
  logger.info("Loading health check strategies...");
62
85
 
63
86
  const { healthCheckRegistry, collectorRegistry } = await loadStrategies({
@@ -292,6 +315,12 @@ const client = new SatelliteClient({
292
315
  onScriptPackagesLockfileHash: (lockfileHash) => {
293
316
  void scriptPackages.reconcile(lockfileHash);
294
317
  },
318
+ onSandboxPolicy: (policy) => {
319
+ // Cache the relayed cluster-wide policy; the runner's provider resolves
320
+ // through this cache. Fail-closed until the first relay arrives.
321
+ sandboxPolicyCache.set(policy);
322
+ logger.info("Applied relayed global sandbox policy");
323
+ },
295
324
  onDisconnect: () => {
296
325
  scheduler.stop();
297
326
  },
@@ -0,0 +1,56 @@
1
+ import { describe, expect, it } from "bun:test";
2
+ import {
3
+ FAIL_CLOSED_SANDBOX_PROFILE,
4
+ resolveDefaultSandboxProfile,
5
+ sandboxPolicySchema,
6
+ } from "@checkstack/backend-api";
7
+ import { SatelliteSandboxPolicyCache } from "./sandbox-policy-cache";
8
+
9
+ const RELAYED = sandboxPolicySchema.parse({
10
+ enabled: true,
11
+ onUnavailable: "degrade",
12
+ resources: { cpuSeconds: 42 },
13
+ filesystem: { mode: "scratch-plus-ro" },
14
+ network: { mode: "deny", allow: [], denyLinkLocalAndMetadata: true },
15
+ privilege: { mode: "drop-to-uid" },
16
+ });
17
+
18
+ describe("SatelliteSandboxPolicyCache fail-closed-until-relay", () => {
19
+ it("resolves the FAIL-CLOSED profile BEFORE any policy is relayed", async () => {
20
+ const cache = new SatelliteSandboxPolicyCache();
21
+ const policy = await cache.resolve();
22
+ expect(policy).toEqual(FAIL_CLOSED_SANDBOX_PROFILE);
23
+ // Crucially NOT the permissive shipped default.
24
+ expect(policy).not.toEqual(resolveDefaultSandboxProfile());
25
+ expect(policy.network.mode).toBe("deny");
26
+ });
27
+
28
+ it("resolves the CACHED relayed policy once one has been received", async () => {
29
+ const cache = new SatelliteSandboxPolicyCache();
30
+ cache.set(RELAYED);
31
+ const policy = await cache.resolve();
32
+ expect(policy).toEqual(RELAYED);
33
+ expect(policy.resources.cpuSeconds).toBe(42);
34
+ });
35
+
36
+ it("a later relay replaces the cached policy", async () => {
37
+ const cache = new SatelliteSandboxPolicyCache();
38
+ cache.set(RELAYED);
39
+ const next = sandboxPolicySchema.parse({
40
+ ...RELAYED,
41
+ network: { mode: "allowlist", allow: ["10.0.0.1"], denyLinkLocalAndMetadata: true },
42
+ });
43
+ cache.set(next);
44
+ const policy = await cache.resolve();
45
+ expect(policy.network.mode).toBe("allowlist");
46
+ expect(policy.network.allow).toEqual(["10.0.0.1"]);
47
+ });
48
+
49
+ it("the provider closure fails closed until the first relay, then returns the cache", async () => {
50
+ const cache = new SatelliteSandboxPolicyCache();
51
+ const provider = cache.toProvider();
52
+ expect(await provider()).toEqual(FAIL_CLOSED_SANDBOX_PROFILE);
53
+ cache.set(RELAYED);
54
+ expect(await provider()).toEqual(RELAYED);
55
+ });
56
+ });
@@ -0,0 +1,49 @@
1
+ import {
2
+ FAIL_CLOSED_SANDBOX_PROFILE,
3
+ type SandboxPolicy,
4
+ type SandboxPolicyProvider,
5
+ } from "@checkstack/backend-api";
6
+
7
+ /**
8
+ * Pod-local cache of the GLOBAL sandbox policy relayed to this satellite by the
9
+ * core over the authenticated WS channel.
10
+ *
11
+ * State-and-scale note: this is NOT a queryable global source of truth - the
12
+ * core's durable policy row is. It is genuinely satellite-local transport state
13
+ * (the policy this specific satellite was last told to enforce), exactly like
14
+ * the pod-local live-socket registry on the core side. The satellite has no
15
+ * database connection, so it cannot read the durable policy directly; it trusts
16
+ * the already-authenticated WS connection to relay it.
17
+ *
18
+ * SECURITY - FAIL CLOSED UNTIL RELAY: before the first policy is received,
19
+ * {@link resolve} returns {@link FAIL_CLOSED_SANDBOX_PROFILE} (deny egress,
20
+ * scratch filesystem + read-only managed packages, privilege drop) - NEVER the
21
+ * permissive shipped default. A satellite must never run a script with a looser
22
+ * policy than core relayed, and before the first relay there is no relayed
23
+ * policy, so it denies. The policy is delivered on connect (in the
24
+ * `authenticated` message) and on change (a `sandbox_policy` push), so the
25
+ * fail-closed window is only the brief interval before the first authenticated
26
+ * message is processed.
27
+ */
28
+ export class SatelliteSandboxPolicyCache {
29
+ private policy: SandboxPolicy | undefined;
30
+
31
+ /** Replace the cached policy with the latest relayed value. */
32
+ set(policy: SandboxPolicy): void {
33
+ this.policy = policy;
34
+ }
35
+
36
+ /**
37
+ * Resolve the policy to enforce for a run: the cached relayed policy, or the
38
+ * fail-closed profile if nothing has been relayed yet. Bound so it can be
39
+ * handed directly to {@link registerSandboxPolicyProvider} as the provider.
40
+ */
41
+ resolve = async (): Promise<SandboxPolicy> => {
42
+ return this.policy ?? FAIL_CLOSED_SANDBOX_PROFILE;
43
+ };
44
+
45
+ /** A {@link SandboxPolicyProvider} closure bound to this cache. */
46
+ toProvider(): SandboxPolicyProvider {
47
+ return this.resolve;
48
+ }
49
+ }
@@ -10,6 +10,7 @@ import type {
10
10
  ResultMessage,
11
11
  ScriptPackageSyncStateMessage,
12
12
  } from "@checkstack/satellite-common";
13
+ import type { SandboxPolicy } from "@checkstack/backend-api";
13
14
  import { ResultBuffer } from "./result-buffer";
14
15
 
15
16
  interface ManifestEntryWire {
@@ -31,6 +32,13 @@ interface SatelliteClientConfig {
31
32
  * `refresh_script_packages` push. The satellite reconciles to it.
32
33
  */
33
34
  onScriptPackagesLockfileHash?: (lockfileHash: string | null) => void;
35
+ /**
36
+ * Called with the relayed GLOBAL sandbox policy - on connect (carried in the
37
+ * `authenticated` message) and on a `sandbox_policy` push. The satellite
38
+ * caches it and resolves every script run through it. Until the first call
39
+ * the satellite FAILS CLOSED (denies egress).
40
+ */
41
+ onSandboxPolicy?: (policy: SandboxPolicy) => void;
34
42
  logger?: {
35
43
  info: (msg: string) => void;
36
44
  warn: (msg: string) => void;
@@ -249,6 +257,12 @@ export class SatelliteClient {
249
257
  msg.scriptPackagesLockfileHash,
250
258
  );
251
259
  }
260
+ // Sandbox policy relay (durable backstop): cache the policy carried on
261
+ // (re)connect so runs enforce the operator's cluster-wide policy. Until
262
+ // this arrives the satellite stays fail-closed (deny egress).
263
+ if (msg.sandboxPolicy !== undefined) {
264
+ this.config.onSandboxPolicy?.(msg.sandboxPolicy);
265
+ }
252
266
  break;
253
267
  }
254
268
 
@@ -280,6 +294,14 @@ export class SatelliteClient {
280
294
  break;
281
295
  }
282
296
 
297
+ case "sandbox_policy": {
298
+ // Push-on-change: replace the cached global sandbox policy so the next
299
+ // run enforces it immediately.
300
+ this.config.logger?.info("Received updated global sandbox policy");
301
+ this.config.onSandboxPolicy?.(msg.policy);
302
+ break;
303
+ }
304
+
283
305
  case "script_package_manifest": {
284
306
  // The pending callback is looked up by a message-supplied key. Validate
285
307
  // that what we got back is actually callable before invoking it, so an