@link-assistant/hive-mind 1.78.11 → 1.78.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,57 @@
1
1
  # @link-assistant/hive-mind
2
2
 
3
+ ## 1.78.12
4
+
5
+ ### Patch Changes
6
+
7
+ - 5f60c04: fix(isolation): default nested Docker daemon to fuse-overlayfs so multi-GB images fit on disk + add storage-driver/disk preflight diagnostics (#1914)
8
+
9
+ `--isolation docker` was reopened after PR #1915: native Docker isolation and
10
+ host-image passthrough now work, but the first isolated task on the >30 GB
11
+ `konard/hive-mind-dind` image still died with:
12
+
13
+ ```
14
+ failed to register layer: no space left on device
15
+ ```
16
+
17
+ even though most layers reported `Already exists` (the daemon was correctly
18
+ seeded — passthrough is working). The failure was during layer **registration**,
19
+ not download.
20
+
21
+ **Root cause (in this repo).** `Dockerfile.dind` baked `ENV
22
+ DIND_STORAGE_DRIVER="vfs"` (commit 44d2c29e). `vfs` performs **no copy-on-write**:
23
+ it materializes a full, independent copy of the entire filesystem for _every_
24
+ layer, so a multi-GB image's on-disk footprint becomes the _sum_ of all
25
+ cumulative layer sizes — many times the image size — and overflows the disk.
26
+ Worse, pinning the env var **defeated box-dind's storage-driver auto-detection**
27
+ (`overlay2 → fuse-overlayfs → vfs`, with graceful fallback): box would otherwise
28
+ have picked a copy-on-write driver here. `/dev/fuse` is present (the dind
29
+ container runs `--privileged`), the `fuse-overlayfs` binary ships in box-dind,
30
+ and `overlay` is in `/proc/filesystems` — so copy-on-write was available the
31
+ whole time but was being bypassed by the `vfs` pin.
32
+
33
+ **Fix.** `Dockerfile.dind` now pins `ENV DIND_STORAGE_DRIVER="fuse-overlayfs"` — a
34
+ copy-on-write driver that also works overlay-on-overlay (the compatibility reason
35
+ `vfs` was originally chosen; `overlay2` can fail on the overlay-backed hosts our
36
+ deploys run on). Under `fuse-overlayfs`, registering a 498 MB top layer on a
37
+ ~30 GB base costs ~498 MB instead of ~30 GB, so the image fits. Empirically
38
+ verified in the box-dind environment (`docs/case-studies/issue-1914/data/fuse-overlayfs-capability-proof.log`).
39
+
40
+ **Self-diagnosing preflight.** `src/isolation-runner.lib.mjs` gained two probes —
41
+ `checkDockerStorageDriver()` and `checkDockerDiskSpace()` — wired into
42
+ `preflightDockerIsolation()`. Before running an isolated task it now warns, with
43
+ an actionable remedy, when the nested daemon is on `vfs` (even if the image is
44
+ already present) or when free space at the Docker data root is below 40 GiB, so
45
+ the next operator hitting this gets a clear breadcrumb instead of a cryptic
46
+ `no space left on device`. Both probes are best-effort and never throw.
47
+
48
+ Added `tests/test-issue-1914-storage-driver-diagnostics.mjs` (34 assertions),
49
+ extended `tests/test-issue-1914-preflight-passthrough.mjs` and
50
+ `tests/test-docker-dind-variant.mjs`, refreshed `docs/DOCKER*.md`, and expanded
51
+ the `docs/case-studies/issue-1914` case study with the reopen timeline, refined
52
+ root-cause analysis, captured evidence, and an upstream observability request
53
+ (link-foundation/box#104: warn when the nested daemon lands on `vfs`).
54
+
3
55
  ## 1.78.11
4
56
 
5
57
  ### Patch Changes
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@link-assistant/hive-mind",
3
- "version": "1.78.11",
3
+ "version": "1.78.12",
4
4
  "description": "AI-powered issue solver and hive mind for collaborative problem solving",
5
5
  "main": "src/hive.mjs",
6
6
  "type": "module",
@@ -47,6 +47,12 @@ const DEFAULT_HOST_DOCKER_SOCK = '/var/run/host-docker.sock';
47
47
  // throwaway container — booting the dind image's dockerd entrypoint — purely to
48
48
  // check whether bash exists. See issue #1914.
49
49
  const DOCKER_ISOLATION_SHELL = 'sh';
50
+ // Free-space floor (GiB) below which the preflight warns that an impending
51
+ // isolation-image pull may fail with `no space left on device`. The Hive Mind
52
+ // isolation images are well over 30 GB extracted, so a host/nested daemon with
53
+ // less headroom than this cannot safely pull one. Diagnostic only — never
54
+ // blocks startup. See issue #1914.
55
+ const DOCKER_ISOLATION_LOW_DISK_GIB = 40;
50
56
 
51
57
  function normalizeProcessIds(value) {
52
58
  if (!value || typeof value !== 'object') return {};
@@ -87,12 +93,13 @@ function maybeAddMount(mounts, source, target, existsSync) {
87
93
  /**
88
94
  * Resolve the tag used for the Docker isolation image.
89
95
  *
90
- * Defaults to `latest`, but operators can pin it (e.g. to the exact version
91
- * already present on the host) via `HIVE_MIND_DOCKER_ISOLATION_IMAGE_TAG`.
92
- * Pinning matters for Docker-in-Docker deployments: the nested daemon starts
93
- * with an empty image store, so an unpinned `:latest` whose registry digest has
94
- * drifted from the host copy forces a fresh multi-gigabyte pull on every task.
95
- * A pinned tag lets a pre-seeded image be reused instead. See issue #1879.
96
+ * Release Docker images bake this env var from `HIVE_MIND_VERSION`, so a parent
97
+ * container started via `:latest` still launches child isolation containers from
98
+ * the same immutable release tag. Local/PR builds fall back to `latest`, and
99
+ * operators can override the tag explicitly when using custom images. Pinning
100
+ * matters for Docker-in-Docker deployments: the nested daemon starts with an
101
+ * empty image store, so a `:latest` digest drift from the host copy forces a
102
+ * fresh multi-gigabyte pull. See issue #1879.
96
103
  */
97
104
  export function resolveDockerIsolationImageTag({ env = process.env } = {}) {
98
105
  const explicit = String(env.HIVE_MIND_DOCKER_ISOLATION_IMAGE_TAG || '').trim();
@@ -618,6 +625,80 @@ export async function checkDockerImagePresent(image, verbose = false) {
618
625
  }
619
626
  }
620
627
 
628
+ /**
629
+ * Report the storage driver the (nested) Docker daemon is using.
630
+ *
631
+ * `vfs` performs NO copy-on-write — it stores a full copy of every image layer
632
+ * — so the multi-gigabyte Hive Mind images consume many times their real size
633
+ * on disk and the first isolated `docker run`/pull dies with
634
+ * `failed to register layer: no space left on device` (issue #1914 reopen).
635
+ * The preflight uses this to warn loudly when the daemon is on `vfs` instead of
636
+ * letting the disk silently overflow mid-task.
637
+ *
638
+ * Never throws: returns the lowercased driver name, or `null` when docker is
639
+ * unavailable / the daemon is unreachable.
640
+ *
641
+ * @param {boolean} [verbose] - Enable verbose logging
642
+ * @returns {Promise<string|null>} e.g. 'fuse-overlayfs', 'overlay2', 'vfs', or null
643
+ */
644
+ export async function checkDockerStorageDriver(verbose = false) {
645
+ try {
646
+ const result = await $({ mirror: false })`docker info --format ${'{{.Driver}}'}`;
647
+ const driver = (result.stdout?.toString() || '').trim().toLowerCase() || null;
648
+ if (verbose) console.log(`[VERBOSE] isolation-runner: docker storage driver: ${driver || '(unknown)'}`);
649
+ return driver;
650
+ } catch {
651
+ if (verbose) console.log('[VERBOSE] isolation-runner: docker info unavailable; storage driver unknown');
652
+ return null;
653
+ }
654
+ }
655
+
656
+ /**
657
+ * Report the free space (in GiB) on the Docker daemon's data root.
658
+ *
659
+ * The Hive Mind isolation images are multiple gigabytes; when the nested daemon
660
+ * has to pull one, it needs room for the extracted layers. This lets the
661
+ * preflight predict a `no space left on device` failure (issue #1914) instead
662
+ * of discovering it mid-pull. Resolves the daemon's real data root via
663
+ * `docker info` and falls back to `/var/lib/docker`, then reads `df -Pk`.
664
+ *
665
+ * Never throws: returns `{ availableGiB, dataRoot }`, or `null` when the
666
+ * information cannot be determined (no docker, no df, unparseable output).
667
+ *
668
+ * @param {boolean} [verbose] - Enable verbose logging
669
+ * @returns {Promise<{availableGiB: number, dataRoot: string}|null>}
670
+ */
671
+ export async function checkDockerDiskSpace(verbose = false) {
672
+ try {
673
+ let dataRoot = '/var/lib/docker';
674
+ try {
675
+ const info = await $({ mirror: false })`docker info --format ${'{{.DockerRootDir}}'}`;
676
+ const root = (info.stdout?.toString() || '').trim();
677
+ if (root) dataRoot = root;
678
+ } catch {
679
+ // Daemon unreachable: fall back to the conventional data root. If df then
680
+ // fails on it (e.g. the path does not exist) we return null below.
681
+ }
682
+
683
+ const df = await $({ mirror: false })`df -Pk ${dataRoot}`;
684
+ // `df -P` guarantees one logical line per filesystem (no wrapping). The last
685
+ // line is the data row: Filesystem 1024-blocks Used Available Capacity Mount
686
+ const lines = (df.stdout?.toString() || '').trim().split('\n');
687
+ const cols = (lines[lines.length - 1] || '').trim().split(/\s+/);
688
+ const availableKb = Number(cols[3]);
689
+ if (!Number.isFinite(availableKb)) {
690
+ if (verbose) console.log('[VERBOSE] isolation-runner: could not parse df output for Docker disk space');
691
+ return null;
692
+ }
693
+ const availableGiB = availableKb / (1024 * 1024);
694
+ if (verbose) console.log(`[VERBOSE] isolation-runner: Docker data root '${dataRoot}' has ${availableGiB.toFixed(1)} GiB free`);
695
+ return { availableGiB, dataRoot };
696
+ } catch {
697
+ if (verbose) console.log('[VERBOSE] isolation-runner: df unavailable; Docker disk space unknown');
698
+ return null;
699
+ }
700
+ }
701
+
621
702
  /**
622
703
  * Startup preflight for `--isolation docker`.
623
704
  *
@@ -637,42 +718,78 @@ export async function checkDockerImagePresent(image, verbose = false) {
637
718
  * blocks startup — a misconfigured passthrough should degrade to a slow first
638
719
  * task, not a dead bot.
639
720
  *
721
+ * It also surfaces the two root causes of the issue #1914 reopen
722
+ * (`failed to register layer: no space left on device`): a non-copy-on-write
723
+ * storage driver (`vfs`, which copies every layer in full) and a Docker data
724
+ * root with too little free space to hold the >30 GB image. Both are reported
725
+ * as loud, actionable warnings so the disk overflow is self-diagnosing at
726
+ * startup instead of surfacing mid-task.
727
+ *
640
728
  * @param {Object} [options]
641
729
  * @param {Object} [options.env] - Environment (defaults to process.env)
642
730
  * @param {Function} [options.existsSync] - fs.existsSync (injectable for tests)
643
731
  * @param {boolean} [options.verbose] - Enable verbose logging
644
732
  * @param {Object} [options.logger] - Logger with .log/.warn (defaults to console)
645
733
  * @param {Function} [options.checkImagePresent] - Image-presence probe (injectable for tests)
646
- * @returns {Promise<{image: string, sock: string, socketMounted: boolean, imagePresent: boolean, isDind: boolean, ok: boolean, warnings: string[]}>}
734
+ * @param {Function} [options.checkStorageDriver] - Storage-driver probe (injectable for tests)
735
+ * @param {Function} [options.checkDiskSpace] - Disk-space probe (injectable for tests)
736
+ * @returns {Promise<{image: string, sock: string, socketMounted: boolean, imagePresent: boolean, isDind: boolean, storageDriver: (string|null), storageDriverOk: boolean, diskAvailableGiB: (number|null), ok: boolean, warnings: string[]}>}
647
737
  */
648
738
  export async function preflightDockerIsolation(options = {}) {
649
- const { env = process.env, existsSync = fs.existsSync, verbose = false, logger = console, checkImagePresent = checkDockerImagePresent } = options;
739
+ const { env = process.env, existsSync = fs.existsSync, verbose = false, logger = console, checkImagePresent = checkDockerImagePresent, checkStorageDriver = checkDockerStorageDriver, checkDiskSpace = checkDockerDiskSpace } = options;
650
740
 
651
741
  const image = getDockerIsolationImage({ env });
652
742
  const sock = resolveHostDockerSock({ env });
653
743
  const isDind = shouldRunPrivilegedDockerIsolation(image, env);
654
744
  const socketMounted = Boolean(existsSync(sock));
655
745
  const imagePresent = Boolean(await checkImagePresent(image, verbose));
656
-
657
- const result = { image, sock, socketMounted, imagePresent, isDind, ok: imagePresent, warnings: [] };
746
+ const storageDriver = await checkStorageDriver(verbose);
747
+ const disk = await checkDiskSpace(verbose);
748
+ const diskAvailableGiB = disk && Number.isFinite(disk.availableGiB) ? disk.availableGiB : null;
749
+ // Unknown driver (probe returned null) is treated as ok — we only flag the
750
+ // one driver known to overflow the disk, never block on missing information.
751
+ const storageDriverOk = storageDriver !== 'vfs';
752
+
753
+ const result = { image, sock, socketMounted, imagePresent, isDind, storageDriver, storageDriverOk, diskAvailableGiB, ok: imagePresent, warnings: [] };
658
754
  const info = typeof logger.log === 'function' ? logger.log.bind(logger) : () => {};
659
755
  const warn = typeof logger.warn === 'function' ? logger.warn.bind(logger) : info;
660
756
 
661
- if (imagePresent) {
662
- info(`✅ Docker isolation image '${image}' is already present locally — isolated tasks reuse it (no multi-GB pull). See issue #1914.`);
663
- return result;
757
+ const preload = `node scripts/preload-dind-isolation-image.mjs --image ${image}`;
758
+
759
+ // Root Cause A of the issue #1914 reopen: a non-copy-on-write storage driver.
760
+ // `vfs` stores a full copy of every image layer, so the multi-GB images
761
+ // consume many times their size on disk and any layer write (pull, run,
762
+ // commit) can fail with `failed to register layer: no space left on device`.
763
+ // This is dangerous even when the image is already present — a task that
764
+ // commits or pulls more layers still overflows — so we warn independent of
765
+ // image presence.
766
+ if (storageDriver === 'vfs') {
767
+ result.warnings.push(`The Docker daemon backing '--isolation docker' is using the 'vfs' storage driver, which performs NO copy-on-write: ` + `it stores a full copy of every image layer, so the multi-GB Hive Mind images consume many times their size on disk and isolated tasks can fail with 'failed to register layer: no space left on device' (issue #1914). ` + `Switch to a copy-on-write driver: rebuild/redeploy with the current Dockerfile.dind (it defaults to 'fuse-overlayfs'), or for an already-running container add '-e DIND_STORAGE_DRIVER=fuse-overlayfs' to the bot container's 'docker run' and recreate it.`);
664
768
  }
665
769
 
666
- // Image absent: the first isolated task will pull the full image. Explain the
667
- // most likely cause and the exact fix instead of letting the operator first
668
- // discover it as a surprise multi-gigabyte download mid-task.
669
- const preload = `node scripts/preload-dind-isolation-image.mjs --image ${image}`;
670
- if (isDind && !socketMounted) {
671
- result.warnings.push(`Docker isolation image '${image}' is NOT in the nested Docker daemon and the host Docker socket is not mounted at ${sock}. ` + `box host-image passthrough cannot seed the nested daemon, so the FIRST isolated task will pull the full image (the Hive Mind images are multiple GB). ` + `Fix the deployment: add '-v /var/run/docker.sock:${sock}:ro' and '-e DIND_HOST_PASSTHROUGH_IMAGES="konard/hive-mind konard/hive-mind-dind"' to the bot container's 'docker run', or seed it now with: ${preload}`);
672
- } else if (isDind && socketMounted) {
673
- result.warnings.push(`Docker isolation image '${image}' is NOT in the nested Docker daemon even though the host Docker socket is mounted at ${sock}. ` + `box host-image passthrough may have skipped it (check DIND_HOST_PASSTHROUGH mode, the DIND_HOST_PASSTHROUGH_IMAGES allowlist, and that the host actually has '${image}' with a registry digest). ` + `The first isolated task will pull the full image. Seed it now with: ${preload}`);
674
- } else {
675
- result.warnings.push(`Docker isolation image '${image}' is not present locally; the first isolated task will pull it. ` + `If this host already has it under a different tag, pin HIVE_MIND_DOCKER_ISOLATION_IMAGE_TAG, or seed it with: ${preload}`);
770
+ if (!imagePresent) {
771
+ // Image absent: the first isolated task will pull the full image. Explain
772
+ // the most likely cause and the exact fix instead of letting the operator
773
+ // first discover it as a surprise multi-gigabyte download mid-task.
774
+ if (isDind && !socketMounted) {
775
+ result.warnings.push(`Docker isolation image '${image}' is NOT in the nested Docker daemon and the host Docker socket is not mounted at ${sock}. ` + `box host-image passthrough cannot seed the nested daemon, so the FIRST isolated task will pull the full image (the Hive Mind images are multiple GB). ` + `Fix the deployment: add '-v /var/run/docker.sock:${sock}:ro' and '-e DIND_HOST_PASSTHROUGH_IMAGES="konard/hive-mind konard/hive-mind-dind"' to the bot container's 'docker run', or seed it now with: ${preload}`);
776
+ } else if (isDind && socketMounted) {
777
+ result.warnings.push(`Docker isolation image '${image}' is NOT in the nested Docker daemon even though the host Docker socket is mounted at ${sock}. ` + `box host-image passthrough may have skipped it (check DIND_HOST_PASSTHROUGH mode, the DIND_HOST_PASSTHROUGH_IMAGES allowlist, and that the host actually has '${image}' with a registry digest). ` + `The first isolated task will pull the full image. Seed it now with: ${preload}`);
778
+ } else {
779
+ result.warnings.push(`Docker isolation image '${image}' is not present locally; the first isolated task will pull it. ` + `If this host already has it under a different tag, pin HIVE_MIND_DOCKER_ISOLATION_IMAGE_TAG, or seed it with: ${preload}`);
780
+ }
781
+
782
+ // Root Cause B of the issue #1914 reopen: too little disk for the pull. The
783
+ // image is well over 30 GB extracted; predict the `no space left on device`
784
+ // failure here rather than hitting it mid-pull.
785
+ if (diskAvailableGiB != null && diskAvailableGiB < DOCKER_ISOLATION_LOW_DISK_GIB) {
786
+ const root = disk?.dataRoot || 'the Docker data root';
787
+ result.warnings.push(`Only ~${diskAvailableGiB.toFixed(0)} GiB free on ${root} and the isolation image '${image}' is not present yet. ` + `The Hive Mind isolation image is well over 30 GB extracted, so the first isolated task's pull may fail with 'no space left on device' (issue #1914). ` + `Seed it via host passthrough (mount the host docker socket) or with '${preload}', and free space on the Docker data root.`);
788
+ }
789
+ }
790
+
791
+ if (imagePresent) {
792
+ info(`✅ Docker isolation image '${image}' is already present locally — isolated tasks reuse it (no multi-GB pull). See issue #1914.`);
676
793
  }
677
794
  for (const w of result.warnings) warn(`⚠️ ${w}`);
678
795
  return result;