@link-assistant/hive-mind 1.78.10 → 1.78.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,96 @@
1
1
  # @link-assistant/hive-mind
2
2
 
3
+ ## 1.78.12
4
+
5
+ ### Patch Changes
6
+
7
+ - 5f60c04: fix(isolation): default nested Docker daemon to fuse-overlayfs so multi-GB images fit on disk + add storage-driver/disk preflight diagnostics (#1914)
8
+
9
+ `--isolation docker` was reopened after PR #1915: native Docker isolation and
10
+ host-image passthrough now work, but the first isolated task on the >30 GB
11
+ `konard/hive-mind-dind` image still died with:
12
+
13
+ ```
14
+ failed to register layer: no space left on device
15
+ ```
16
+
17
+ even though most layers reported `Already exists` (the daemon was correctly
18
+ seeded — passthrough is working). The failure was during layer **registration**,
19
+ not download.
20
+
21
+ **Root cause (in this repo).** `Dockerfile.dind` baked `ENV
22
+ DIND_STORAGE_DRIVER="vfs"` (commit 44d2c29e). `vfs` performs **no copy-on-write**:
23
+ it materializes a full, independent copy of the entire filesystem for _every_
24
+ layer, so a multi-GB image's on-disk footprint becomes the _sum_ of all
25
+ cumulative layer sizes — many times the image size — and overflows the disk.
26
+ Worse, pinning the env var **defeated box-dind's storage-driver auto-detection**
27
+ (`overlay2 → fuse-overlayfs → vfs`, with graceful fallback): box would otherwise
28
+ have picked a copy-on-write driver here. `/dev/fuse` is present (the dind
29
+ container runs `--privileged`), the `fuse-overlayfs` binary ships in box-dind,
30
+ and `overlay` is in `/proc/filesystems` — so copy-on-write was available the
31
+ whole time but was being bypassed by the `vfs` pin.
32
+
33
+ **Fix.** `Dockerfile.dind` now pins `ENV DIND_STORAGE_DRIVER="fuse-overlayfs"` — a
34
+ copy-on-write driver that also works overlay-on-overlay (the compatibility reason
35
+ `vfs` was originally chosen; `overlay2` can fail on the overlay-backed hosts our
36
+ deploys run on). Under `fuse-overlayfs`, registering a 498 MB top layer on a
37
+ ~30 GB base costs ~498 MB instead of ~30 GB, so the image fits. Empirically
38
+ verified in the box-dind environment (`docs/case-studies/issue-1914/data/fuse-overlayfs-capability-proof.log`).
39
+
40
+ **Self-diagnosing preflight.** `src/isolation-runner.lib.mjs` gained two probes —
41
+ `checkDockerStorageDriver()` and `checkDockerDiskSpace()` — wired into
42
+ `preflightDockerIsolation()`. Before running an isolated task it now warns, with
43
+ an actionable remedy, when the nested daemon is on `vfs` (even if the image is
44
+ already present) or when free space at the Docker data root is below 40 GiB, so
45
+ the next operator hitting this gets a clear breadcrumb instead of a cryptic
46
+ `no space left on device`. Both probes are best-effort and never throw.
47
+
48
+ Added `tests/test-issue-1914-storage-driver-diagnostics.mjs` (34 assertions),
49
+ extended `tests/test-issue-1914-preflight-passthrough.mjs` and
50
+ `tests/test-docker-dind-variant.mjs`, refreshed `docs/DOCKER*.md`, and expanded
51
+ the `docs/case-studies/issue-1914` case study with the reopen timeline, refined
52
+ root-cause analysis, captured evidence, and an upstream observability request
53
+ (link-foundation/box#104: warn when the nested daemon lands on `vfs`).
54
+
55
+ ## 1.78.11
56
+
57
+ ### Patch Changes
58
+
59
+ - 24fb17e: fix(retry): auto-resume on server-side 429 "Server is temporarily limiting requests" rate-limit errors (#1924)
60
+
61
+ A long-running solve session (177 turns, ~72 min) was thrown away when the Claude
62
+ CLI surfaced a **server-side temporary rate limit**:
63
+
64
+ ```
65
+ API Error: Server is temporarily limiting requests (not your usage limit) · Rate limited
66
+ ```
67
+
68
+ The CLI reports this as a `result` event with `is_error: true` and
69
+ `api_error_status: 429`, and the HTTP response carries `x-should-retry: true`.
70
+ This is a transient throttle that clears on its own — distinct from an account
71
+ usage/quota limit (the message literally says "not your usage limit", and there
72
+ is no reset time to wait for).
73
+
74
+ Root cause: the error matched neither `classifyRetryableError` (no pattern for
75
+ the 429 throttle wording) nor `isUsageLimitError` (correctly, since it is not a
76
+ quota limit), so it fell through to a hard failure with exit code 1 and **no
77
+ auto-resume**, unlike every other transient class (overload 500/529, 503,
78
+ internal server error, request timeout, socket drops).
79
+
80
+ Fix: `classifyRetryableError` (in `src/tool-retry.lib.mjs`, the shared classifier
81
+ used by every tool wrapper — claude, codex, gemini, opencode, qwen, agent) now
82
+ recognises this throttle and marks it retryable (`isCapacity: false`, so no model
83
+ switch), so it retries with the session preserved (`--resume`) after a backoff.
84
+ `src/claude.lib.mjs` additionally detects the structured `api_error_status === 429`
85
+ directly (robust to wording changes) and logs a verbose diagnostic with the
86
+ `request_id`. The matcher is narrow so genuine account usage limits stay on the
87
+ usage-limit reset-time path.
88
+
89
+ Added `tests/test-issue-1924-rate-limit-retry.mjs` (18 assertions) and a full
90
+ case study with timeline, root-cause analysis, upstream references
91
+ (anthropics/claude-code#53915, #53922), and the captured logs under
92
+ `docs/case-studies/issue-1924`.
93
+
3
94
  ## 1.78.10
4
95
 
5
96
  ### Patch Changes
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@link-assistant/hive-mind",
3
- "version": "1.78.10",
3
+ "version": "1.78.12",
4
4
  "description": "AI-powered issue solver and hive mind for collaborative problem solving",
5
5
  "main": "src/hive.mjs",
6
6
  "type": "module",
@@ -648,6 +648,7 @@ export const executeClaudeCommand = async params => {
648
648
  let is503Error = false;
649
649
  let isInternalServerError = false;
650
650
  let isRequestTimeout = false;
651
+ let isRateLimitError = false; // Issue #1924: server-side 429 temporary rate limiting
651
652
  let apiMarkedNotRetryable = false;
652
653
  let resultNumTurns = 0;
653
654
  let stderrErrors = [];
@@ -977,6 +978,14 @@ export const executeClaudeCommand = async params => {
977
978
  isRequestTimeout = true;
978
979
  await log('⏱️ Detected request timeout from Claude CLI (will retry with --resume)', { verbose: true });
979
980
  }
981
+ // Issue #1924: Server-side temporary rate limiting (HTTP 429) — a transient
982
+ // throttle, not an account usage limit ("...not your usage limit..."), so retry
983
+ // with --resume. The message text is handled by classifyRetryableError; this also
984
+ // catches the structured api_error_status if the wording ever changes.
985
+ if (data.api_error_status === 429) {
986
+ isRateLimitError = true;
987
+ await log(`⚠️ Detected server-side rate limiting (429) from Claude CLI (will retry with --resume). request_id=${data.request_id || 'unknown'}`, { verbose: true });
988
+ }
980
989
  // Issue #1834: Detect corrupted extended-thinking-block 400 (un-resumable session).
981
990
  // Capture diagnostics (request id, content path) to aid debugging and upstream reports.
982
991
  if ((lastMessage.includes('thinking') || lastMessage.includes('redacted_thinking')) && lastMessage.includes('cannot be modified')) {
@@ -1174,7 +1183,7 @@ export const executeClaudeCommand = async params => {
1174
1183
  return await executeWithRetry();
1175
1184
  }
1176
1185
  // Issues #1331, #1353, #1472/#1475: Unified transient error retry (exponential backoff, session preservation)
1177
- const isTransientError = isStartupTimeout || isActivityTimeout || isOverloadError || isInternalServerError || is503Error || isRequestTimeout || retryableLastError.isRetryable || (lastMessage.includes('API Error: 500') && (lastMessage.includes('Overloaded') || lastMessage.includes('Internal server error'))) || (lastMessage.includes('API Error: 529') && (lastMessage.includes('overloaded_error') || lastMessage.includes('Overloaded'))) || (lastMessage.includes('api_error') && lastMessage.includes('Overloaded')) || (lastMessage.includes('overloaded_error') && lastMessage.includes('Overloaded')) || lastMessage.includes('API Error: 503') || (lastMessage.includes('503') && (lastMessage.includes('upstream connect error') || lastMessage.includes('remote connection failure'))) || lastMessage === 'Request timed out' || lastMessage.includes('Request timed out');
1186
+ const isTransientError = isStartupTimeout || isActivityTimeout || isOverloadError || isInternalServerError || is503Error || isRequestTimeout || isRateLimitError || retryableLastError.isRetryable || (lastMessage.includes('API Error: 500') && (lastMessage.includes('Overloaded') || lastMessage.includes('Internal server error'))) || (lastMessage.includes('API Error: 529') && (lastMessage.includes('overloaded_error') || lastMessage.includes('Overloaded'))) || (lastMessage.includes('api_error') && lastMessage.includes('Overloaded')) || (lastMessage.includes('overloaded_error') && lastMessage.includes('Overloaded')) || lastMessage.includes('API Error: 503') || (lastMessage.includes('503') && (lastMessage.includes('upstream connect error') || lastMessage.includes('remote connection failure'))) || lastMessage === 'Request timed out' || lastMessage.includes('Request timed out');
1178
1187
  if ((commandFailed || isTransientError) && isTransientError) {
1179
1188
  // Issue #1472/#1475: Startup/activity timeout → 30s–2min backoff; #1353: Request timeout → 5min–1hr; general → 2min–30min
1180
1189
  const isTimeoutRetry = isStartupTimeout || isActivityTimeout;
@@ -1208,7 +1217,7 @@ export const executeClaudeCommand = async params => {
1208
1217
  }
1209
1218
  if (retryCount < maxRetries) {
1210
1219
  const delay = Math.min(initialDelay * Math.pow(retryLimits.retryBackoffMultiplier, retryCount), maxDelay);
1211
- const errorLabel = isStartupTimeout ? 'Stream startup timeout (Issue #1472/#1475)' : isActivityTimeout ? 'Stream activity timeout (Issue #1472)' : isRequestTimeout ? 'Request timeout' : retryableLastError.label || (isOverloadError || (lastMessage.includes('API Error: 500') && lastMessage.includes('Overloaded')) || (lastMessage.includes('API Error: 529') && lastMessage.includes('Overloaded')) ? `API overload (${lastMessage.includes('529') ? '529' : '500'})` : isInternalServerError || lastMessage.includes('Internal server error') ? 'Internal server error (500)' : '503 network error');
1220
+ const errorLabel = isStartupTimeout ? 'Stream startup timeout (Issue #1472/#1475)' : isActivityTimeout ? 'Stream activity timeout (Issue #1472)' : isRequestTimeout ? 'Request timeout' : retryableLastError.label || (isOverloadError || (lastMessage.includes('API Error: 500') && lastMessage.includes('Overloaded')) || (lastMessage.includes('API Error: 529') && lastMessage.includes('Overloaded')) ? `API overload (${lastMessage.includes('529') ? '529' : '500'})` : isInternalServerError || lastMessage.includes('Internal server error') ? 'Internal server error (500)' : isRateLimitError ? 'Server rate limited (429)' : '503 network error');
1212
1221
  const notRetryableHint = apiMarkedNotRetryable ? ' (API says not retryable — will stop early if no progress)' : '';
1213
1222
  const delayLabel = delay >= 60000 ? `${Math.round(delay / 60000)} min` : `${Math.round(delay / 1000)}s`;
1214
1223
  const retryMode = isStartupTimeout ? ' (fresh start)' : ' (session preserved)';
@@ -47,6 +47,12 @@ const DEFAULT_HOST_DOCKER_SOCK = '/var/run/host-docker.sock';
47
47
  // throwaway container — booting the dind image's dockerd entrypoint — purely to
48
48
  // check whether bash exists. See issue #1914.
49
49
  const DOCKER_ISOLATION_SHELL = 'sh';
50
+ // Free-space floor (GiB) below which the preflight warns that an impending
51
+ // isolation-image pull may fail with `no space left on device`. The Hive Mind
52
+ // isolation images are well over 30 GB extracted, so a host/nested daemon with
53
+ // less headroom than this cannot safely pull one. Diagnostic only — never
54
+ // blocks startup. See issue #1914.
55
+ const DOCKER_ISOLATION_LOW_DISK_GIB = 40;
50
56
 
51
57
  function normalizeProcessIds(value) {
52
58
  if (!value || typeof value !== 'object') return {};
@@ -87,12 +93,13 @@ function maybeAddMount(mounts, source, target, existsSync) {
87
93
  /**
88
94
  * Resolve the tag used for the Docker isolation image.
89
95
  *
90
- * Defaults to `latest`, but operators can pin it (e.g. to the exact version
91
- * already present on the host) via `HIVE_MIND_DOCKER_ISOLATION_IMAGE_TAG`.
92
- * Pinning matters for Docker-in-Docker deployments: the nested daemon starts
93
- * with an empty image store, so an unpinned `:latest` whose registry digest has
94
- * drifted from the host copy forces a fresh multi-gigabyte pull on every task.
95
- * A pinned tag lets a pre-seeded image be reused instead. See issue #1879.
96
+ * Release Docker images bake this env var from `HIVE_MIND_VERSION`, so a parent
97
+ * container started via `:latest` still launches child isolation containers from
98
+ * the same immutable release tag. Local/PR builds fall back to `latest`, and
99
+ * operators can override the tag explicitly when using custom images. Pinning
100
+ * matters for Docker-in-Docker deployments: the nested daemon starts with an
101
+ * empty image store, so a `:latest` digest drift from the host copy forces a
102
+ * fresh multi-gigabyte pull. See issue #1879.
96
103
  */
97
104
  export function resolveDockerIsolationImageTag({ env = process.env } = {}) {
98
105
  const explicit = String(env.HIVE_MIND_DOCKER_ISOLATION_IMAGE_TAG || '').trim();
@@ -618,6 +625,80 @@ export async function checkDockerImagePresent(image, verbose = false) {
618
625
  }
619
626
  }
620
627
 
628
+ /**
629
+ * Report the storage driver the (nested) Docker daemon is using.
630
+ *
631
+ * `vfs` performs NO copy-on-write — it stores a full copy of every image layer
632
+ * — so the multi-gigabyte Hive Mind images consume many times their real size
633
+ * on disk and the first isolated `docker run`/pull dies with
634
+ * `failed to register layer: no space left on device` (issue #1914 reopen).
635
+ * The preflight uses this to warn loudly when the daemon is on `vfs` instead of
636
+ * letting the disk silently overflow mid-task.
637
+ *
638
+ * Never throws: returns the lowercased driver name, or `null` when docker is
639
+ * unavailable / the daemon is unreachable.
640
+ *
641
+ * @param {boolean} [verbose] - Enable verbose logging
642
+ * @returns {Promise<string|null>} e.g. 'fuse-overlayfs', 'overlay2', 'vfs', or null
643
+ */
644
+ export async function checkDockerStorageDriver(verbose = false) {
645
+ try {
646
+ const result = await $({ mirror: false })`docker info --format ${'{{.Driver}}'}`;
647
+ const driver = (result.stdout?.toString() || '').trim().toLowerCase() || null;
648
+ if (verbose) console.log(`[VERBOSE] isolation-runner: docker storage driver: ${driver || '(unknown)'}`);
649
+ return driver;
650
+ } catch {
651
+ if (verbose) console.log('[VERBOSE] isolation-runner: docker info unavailable; storage driver unknown');
652
+ return null;
653
+ }
654
+ }
655
+
656
+ /**
657
+ * Report the free space (in GiB) on the Docker daemon's data root.
658
+ *
659
+ * The Hive Mind isolation images are multiple gigabytes; when the nested daemon
660
+ * has to pull one, it needs room for the extracted layers. This lets the
661
+ * preflight predict a `no space left on device` failure (issue #1914) instead
662
+ * of discovering it mid-pull. Resolves the daemon's real data root via
663
+ * `docker info` and falls back to `/var/lib/docker`, then reads `df -Pk`.
664
+ *
665
+ * Never throws: returns `{ availableGiB, dataRoot }`, or `null` when the
666
+ * information cannot be determined (no docker, no df, unparseable output).
667
+ *
668
+ * @param {boolean} [verbose] - Enable verbose logging
669
+ * @returns {Promise<{availableGiB: number, dataRoot: string}|null>}
670
+ */
671
+ export async function checkDockerDiskSpace(verbose = false) {
672
+ try {
673
+ let dataRoot = '/var/lib/docker';
674
+ try {
675
+ const info = await $({ mirror: false })`docker info --format ${'{{.DockerRootDir}}'}`;
676
+ const root = (info.stdout?.toString() || '').trim();
677
+ if (root) dataRoot = root;
678
+ } catch {
679
+ // Daemon unreachable: fall back to the conventional data root. If df then
680
+ // fails on it (e.g. the path does not exist) we return null below.
681
+ }
682
+
683
+ const df = await $({ mirror: false })`df -Pk ${dataRoot}`;
684
+ // `df -P` guarantees one logical line per filesystem (no wrapping). The last
685
+ // line is the data row: Filesystem 1024-blocks Used Available Capacity Mount
686
+ const lines = (df.stdout?.toString() || '').trim().split('\n');
687
+ const cols = (lines[lines.length - 1] || '').trim().split(/\s+/);
688
+ const availableKb = Number(cols[3]);
689
+ if (!Number.isFinite(availableKb)) {
690
+ if (verbose) console.log('[VERBOSE] isolation-runner: could not parse df output for Docker disk space');
691
+ return null;
692
+ }
693
+ const availableGiB = availableKb / (1024 * 1024);
694
+ if (verbose) console.log(`[VERBOSE] isolation-runner: Docker data root '${dataRoot}' has ${availableGiB.toFixed(1)} GiB free`);
695
+ return { availableGiB, dataRoot };
696
+ } catch {
697
+ if (verbose) console.log('[VERBOSE] isolation-runner: df unavailable; Docker disk space unknown');
698
+ return null;
699
+ }
700
+ }
701
+
621
702
  /**
622
703
  * Startup preflight for `--isolation docker`.
623
704
  *
@@ -637,42 +718,78 @@ export async function checkDockerImagePresent(image, verbose = false) {
637
718
  * blocks startup — a misconfigured passthrough should degrade to a slow first
638
719
  * task, not a dead bot.
639
720
  *
721
+ * It also surfaces the two root causes of the issue #1914 reopen
722
+ * (`failed to register layer: no space left on device`): a non-copy-on-write
723
+ * storage driver (`vfs`, which copies every layer in full) and a Docker data
724
+ * root with too little free space to hold the >30 GB image. Both are reported
725
+ * as loud, actionable warnings so the disk overflow is self-diagnosing at
726
+ * startup instead of surfacing mid-task.
727
+ *
640
728
  * @param {Object} [options]
641
729
  * @param {Object} [options.env] - Environment (defaults to process.env)
642
730
  * @param {Function} [options.existsSync] - fs.existsSync (injectable for tests)
643
731
  * @param {boolean} [options.verbose] - Enable verbose logging
644
732
  * @param {Object} [options.logger] - Logger with .log/.warn (defaults to console)
645
733
  * @param {Function} [options.checkImagePresent] - Image-presence probe (injectable for tests)
646
- * @returns {Promise<{image: string, sock: string, socketMounted: boolean, imagePresent: boolean, isDind: boolean, ok: boolean, warnings: string[]}>}
734
+ * @param {Function} [options.checkStorageDriver] - Storage-driver probe (injectable for tests)
735
+ * @param {Function} [options.checkDiskSpace] - Disk-space probe (injectable for tests)
736
+ * @returns {Promise<{image: string, sock: string, socketMounted: boolean, imagePresent: boolean, isDind: boolean, storageDriver: (string|null), storageDriverOk: boolean, diskAvailableGiB: (number|null), ok: boolean, warnings: string[]}>}
647
737
  */
648
738
  export async function preflightDockerIsolation(options = {}) {
649
- const { env = process.env, existsSync = fs.existsSync, verbose = false, logger = console, checkImagePresent = checkDockerImagePresent } = options;
739
+ const { env = process.env, existsSync = fs.existsSync, verbose = false, logger = console, checkImagePresent = checkDockerImagePresent, checkStorageDriver = checkDockerStorageDriver, checkDiskSpace = checkDockerDiskSpace } = options;
650
740
 
651
741
  const image = getDockerIsolationImage({ env });
652
742
  const sock = resolveHostDockerSock({ env });
653
743
  const isDind = shouldRunPrivilegedDockerIsolation(image, env);
654
744
  const socketMounted = Boolean(existsSync(sock));
655
745
  const imagePresent = Boolean(await checkImagePresent(image, verbose));
656
-
657
- const result = { image, sock, socketMounted, imagePresent, isDind, ok: imagePresent, warnings: [] };
746
+ const storageDriver = await checkStorageDriver(verbose);
747
+ const disk = await checkDiskSpace(verbose);
748
+ const diskAvailableGiB = disk && Number.isFinite(disk.availableGiB) ? disk.availableGiB : null;
749
+ // Unknown driver (probe returned null) is treated as ok — we only flag the
750
+ // one driver known to overflow the disk, never block on missing information.
751
+ const storageDriverOk = storageDriver !== 'vfs';
752
+
753
+ const result = { image, sock, socketMounted, imagePresent, isDind, storageDriver, storageDriverOk, diskAvailableGiB, ok: imagePresent, warnings: [] };
658
754
  const info = typeof logger.log === 'function' ? logger.log.bind(logger) : () => {};
659
755
  const warn = typeof logger.warn === 'function' ? logger.warn.bind(logger) : info;
660
756
 
661
- if (imagePresent) {
662
- info(`✅ Docker isolation image '${image}' is already present locally — isolated tasks reuse it (no multi-GB pull). See issue #1914.`);
663
- return result;
757
+ const preload = `node scripts/preload-dind-isolation-image.mjs --image ${image}`;
758
+
759
+ // Root Cause A of the issue #1914 reopen: a non-copy-on-write storage driver.
760
+ // `vfs` stores a full copy of every image layer, so the multi-GB images
761
+ // consume many times their size on disk and any layer write (pull, run,
762
+ // commit) can fail with `failed to register layer: no space left on device`.
763
+ // This is dangerous even when the image is already present — a task that
764
+ // commits or pulls more layers still overflows — so we warn independent of
765
+ // image presence.
766
+ if (storageDriver === 'vfs') {
767
+ result.warnings.push(`The Docker daemon backing '--isolation docker' is using the 'vfs' storage driver, which performs NO copy-on-write: ` + `it stores a full copy of every image layer, so the multi-GB Hive Mind images consume many times their size on disk and isolated tasks can fail with 'failed to register layer: no space left on device' (issue #1914). ` + `Switch to a copy-on-write driver: rebuild/redeploy with the current Dockerfile.dind (it defaults to 'fuse-overlayfs'), or for an already-running container add '-e DIND_STORAGE_DRIVER=fuse-overlayfs' to the bot container's 'docker run' and recreate it.`);
664
768
  }
665
769
 
666
- // Image absent: the first isolated task will pull the full image. Explain the
667
- // most likely cause and the exact fix instead of letting the operator first
668
- // discover it as a surprise multi-gigabyte download mid-task.
669
- const preload = `node scripts/preload-dind-isolation-image.mjs --image ${image}`;
670
- if (isDind && !socketMounted) {
671
- result.warnings.push(`Docker isolation image '${image}' is NOT in the nested Docker daemon and the host Docker socket is not mounted at ${sock}. ` + `box host-image passthrough cannot seed the nested daemon, so the FIRST isolated task will pull the full image (the Hive Mind images are multiple GB). ` + `Fix the deployment: add '-v /var/run/docker.sock:${sock}:ro' and '-e DIND_HOST_PASSTHROUGH_IMAGES="konard/hive-mind konard/hive-mind-dind"' to the bot container's 'docker run', or seed it now with: ${preload}`);
672
- } else if (isDind && socketMounted) {
673
- result.warnings.push(`Docker isolation image '${image}' is NOT in the nested Docker daemon even though the host Docker socket is mounted at ${sock}. ` + `box host-image passthrough may have skipped it (check DIND_HOST_PASSTHROUGH mode, the DIND_HOST_PASSTHROUGH_IMAGES allowlist, and that the host actually has '${image}' with a registry digest). ` + `The first isolated task will pull the full image. Seed it now with: ${preload}`);
674
- } else {
675
- result.warnings.push(`Docker isolation image '${image}' is not present locally; the first isolated task will pull it. ` + `If this host already has it under a different tag, pin HIVE_MIND_DOCKER_ISOLATION_IMAGE_TAG, or seed it with: ${preload}`);
770
+ if (!imagePresent) {
771
+ // Image absent: the first isolated task will pull the full image. Explain
772
+ // the most likely cause and the exact fix instead of letting the operator
773
+ // first discover it as a surprise multi-gigabyte download mid-task.
774
+ if (isDind && !socketMounted) {
775
+ result.warnings.push(`Docker isolation image '${image}' is NOT in the nested Docker daemon and the host Docker socket is not mounted at ${sock}. ` + `box host-image passthrough cannot seed the nested daemon, so the FIRST isolated task will pull the full image (the Hive Mind images are multiple GB). ` + `Fix the deployment: add '-v /var/run/docker.sock:${sock}:ro' and '-e DIND_HOST_PASSTHROUGH_IMAGES="konard/hive-mind konard/hive-mind-dind"' to the bot container's 'docker run', or seed it now with: ${preload}`);
776
+ } else if (isDind && socketMounted) {
777
+ result.warnings.push(`Docker isolation image '${image}' is NOT in the nested Docker daemon even though the host Docker socket is mounted at ${sock}. ` + `box host-image passthrough may have skipped it (check DIND_HOST_PASSTHROUGH mode, the DIND_HOST_PASSTHROUGH_IMAGES allowlist, and that the host actually has '${image}' with a registry digest). ` + `The first isolated task will pull the full image. Seed it now with: ${preload}`);
778
+ } else {
779
+ result.warnings.push(`Docker isolation image '${image}' is not present locally; the first isolated task will pull it. ` + `If this host already has it under a different tag, pin HIVE_MIND_DOCKER_ISOLATION_IMAGE_TAG, or seed it with: ${preload}`);
780
+ }
781
+
782
+ // Root Cause B of the issue #1914 reopen: too little disk for the pull. The
783
+ // image is well over 30 GB extracted; predict the `no space left on device`
784
+ // failure here rather than hitting it mid-pull.
785
+ if (diskAvailableGiB != null && diskAvailableGiB < DOCKER_ISOLATION_LOW_DISK_GIB) {
786
+ const root = disk?.dataRoot || 'the Docker data root';
787
+ result.warnings.push(`Only ~${diskAvailableGiB.toFixed(0)} GiB free on ${root} and the isolation image '${image}' is not present yet. ` + `The Hive Mind isolation image is well over 30 GB extracted, so the first isolated task's pull may fail with 'no space left on device' (issue #1914). ` + `Seed it via host passthrough (mount the host docker socket) or with '${preload}', and free space on the Docker data root.`);
788
+ }
789
+ }
790
+
791
+ if (imagePresent) {
792
+ info(`✅ Docker isolation image '${image}' is already present locally — isolated tasks reuse it (no multi-GB pull). See issue #1914.`);
676
793
  }
677
794
  for (const w of result.warnings) warn(`⚠️ ${w}`);
678
795
  return result;
@@ -72,6 +72,22 @@ export const classifyRetryableError = value => {
72
72
  return { message, isRetryable: false, isCapacity: false, requiresFreshSession: true, label: 'Corrupted thinking blocks (un-resumable session)' };
73
73
  }
74
74
 
75
+ // Issue #1924: Server-side temporary rate limiting (HTTP 429), distinct from an
76
+ // account usage/quota limit. The Claude CLI surfaces this as a synthetic
77
+ // assistant/result message and an api_error_status of 429:
78
+ // "API Error: Server is temporarily limiting requests (not your usage limit) · Rate limited"
79
+ // The response carries `x-should-retry: true` and the stream emits a
80
+ // `rate_limit_event` with `status: "rejected"`. Because the message explicitly
81
+ // says "not your usage limit", it is NOT a usage-limit reset-time situation and
82
+ // must NOT be routed through detectUsageLimit() (there is no reset time to wait
83
+ // for). It is a transient throttle that clears on its own, so it is safe to
84
+ // retry with the session preserved (--resume) after a backoff. Switching models
85
+ // does not help (the throttle is request-rate, not model capacity), so
86
+ // isCapacity is false.
87
+ if (lower.includes('temporarily limiting requests') || (lower.includes('rate limited') && lower.includes('not your usage limit')) || (lower.includes('rate_limit') && lower.includes('429'))) {
88
+ return { message, isRetryable: true, isCapacity: false, label: 'Server rate limited (429)' };
89
+ }
90
+
75
91
  if (lower.includes('api error: 503') || (lower.includes('503') && (lower.includes('upstream connect error') || lower.includes('remote connection failure')))) {
76
92
  return { message, isRetryable: true, isCapacity: false, label: '503 network error' };
77
93
  }