@link-assistant/hive-mind 1.78.10 → 1.78.12
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +91 -0
- package/package.json +1 -1
- package/src/claude.lib.mjs +11 -2
- package/src/isolation-runner.lib.mjs +140 -23
- package/src/tool-retry.lib.mjs +16 -0
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,96 @@
|
|
|
1
1
|
# @link-assistant/hive-mind
|
|
2
2
|
|
|
3
|
+
## 1.78.12
|
|
4
|
+
|
|
5
|
+
### Patch Changes
|
|
6
|
+
|
|
7
|
+
- 5f60c04: fix(isolation): default nested Docker daemon to fuse-overlayfs so multi-GB images fit on disk + add storage-driver/disk preflight diagnostics (#1914)
|
|
8
|
+
|
|
9
|
+
`--isolation docker` was reopened after PR #1915: native Docker isolation and
|
|
10
|
+
host-image passthrough now work, but the first isolated task on the >30 GB
|
|
11
|
+
`konard/hive-mind-dind` image still died with:
|
|
12
|
+
|
|
13
|
+
```
|
|
14
|
+
failed to register layer: no space left on device
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
even though most layers reported `Already exists` (the daemon was correctly
|
|
18
|
+
seeded — passthrough is working). The failure was during layer **registration**,
|
|
19
|
+
not download.
|
|
20
|
+
|
|
21
|
+
**Root cause (in this repo).** `Dockerfile.dind` baked `ENV
|
|
22
|
+
DIND_STORAGE_DRIVER="vfs"` (commit 44d2c29e). `vfs` performs **no copy-on-write**:
|
|
23
|
+
it materializes a full, independent copy of the entire filesystem for _every_
|
|
24
|
+
layer, so a multi-GB image's on-disk footprint becomes the _sum_ of all
|
|
25
|
+
cumulative layer sizes — many times the image size — and overflows the disk.
|
|
26
|
+
Worse, pinning the env var **defeated box-dind's storage-driver auto-detection**
|
|
27
|
+
(`overlay2 → fuse-overlayfs → vfs`, with graceful fallback): box would otherwise
|
|
28
|
+
have picked a copy-on-write driver here. `/dev/fuse` is present (the dind
|
|
29
|
+
container runs `--privileged`), the `fuse-overlayfs` binary ships in box-dind,
|
|
30
|
+
and `overlay` is in `/proc/filesystems` — so copy-on-write was available the
|
|
31
|
+
whole time but was being bypassed by the `vfs` pin.
|
|
32
|
+
|
|
33
|
+
**Fix.** `Dockerfile.dind` now pins `ENV DIND_STORAGE_DRIVER="fuse-overlayfs"` — a
|
|
34
|
+
copy-on-write driver that also works overlay-on-overlay (the compatibility reason
|
|
35
|
+
`vfs` was originally chosen; `overlay2` can fail on the overlay-backed hosts our
|
|
36
|
+
deploys run on). Under `fuse-overlayfs`, registering a 498 MB top layer on a
|
|
37
|
+
~30 GB base costs ~498 MB instead of ~30 GB, so the image fits. Empirically
|
|
38
|
+
verified in the box-dind environment (`docs/case-studies/issue-1914/data/fuse-overlayfs-capability-proof.log`).
|
|
39
|
+
|
|
40
|
+
**Self-diagnosing preflight.** `src/isolation-runner.lib.mjs` gained two probes —
|
|
41
|
+
`checkDockerStorageDriver()` and `checkDockerDiskSpace()` — wired into
|
|
42
|
+
`preflightDockerIsolation()`. Before running an isolated task it now warns, with
|
|
43
|
+
an actionable remedy, when the nested daemon is on `vfs` (even if the image is
|
|
44
|
+
already present) or when free space at the Docker data root is below 40 GiB, so
|
|
45
|
+
the next operator hitting this gets a clear breadcrumb instead of a cryptic
|
|
46
|
+
`no space left on device`. Both probes are best-effort and never throw.
|
|
47
|
+
|
|
48
|
+
Added `tests/test-issue-1914-storage-driver-diagnostics.mjs` (34 assertions),
|
|
49
|
+
extended `tests/test-issue-1914-preflight-passthrough.mjs` and
|
|
50
|
+
`tests/test-docker-dind-variant.mjs`, refreshed `docs/DOCKER*.md`, and expanded
|
|
51
|
+
the `docs/case-studies/issue-1914` case study with the reopen timeline, refined
|
|
52
|
+
root-cause analysis, captured evidence, and an upstream observability request
|
|
53
|
+
(link-foundation/box#104: warn when the nested daemon lands on `vfs`).
|
|
54
|
+
|
|
55
|
+
## 1.78.11
|
|
56
|
+
|
|
57
|
+
### Patch Changes
|
|
58
|
+
|
|
59
|
+
- 24fb17e: fix(retry): auto-resume on server-side 429 "Server is temporarily limiting requests" rate-limit errors (#1924)
|
|
60
|
+
|
|
61
|
+
A long-running solve session (177 turns, ~72 min) was thrown away when the Claude
|
|
62
|
+
CLI surfaced a **server-side temporary rate limit**:
|
|
63
|
+
|
|
64
|
+
```
|
|
65
|
+
API Error: Server is temporarily limiting requests (not your usage limit) · Rate limited
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
The CLI reports this as a `result` event with `is_error: true` and
|
|
69
|
+
`api_error_status: 429`, and the HTTP response carries `x-should-retry: true`.
|
|
70
|
+
This is a transient throttle that clears on its own — distinct from an account
|
|
71
|
+
usage/quota limit (the message literally says "not your usage limit", and there
|
|
72
|
+
is no reset time to wait for).
|
|
73
|
+
|
|
74
|
+
Root cause: the error matched neither `classifyRetryableError` (no pattern for
|
|
75
|
+
the 429 throttle wording) nor `isUsageLimitError` (correctly, since it is not a
|
|
76
|
+
quota limit), so it fell through to a hard failure with exit code 1 and **no
|
|
77
|
+
auto-resume**, unlike every other transient class (overload 500/529, 503,
|
|
78
|
+
internal server error, request timeout, socket drops).
|
|
79
|
+
|
|
80
|
+
Fix: `classifyRetryableError` (in `src/tool-retry.lib.mjs`, the shared classifier
|
|
81
|
+
used by every tool wrapper — claude, codex, gemini, opencode, qwen, agent) now
|
|
82
|
+
recognises this throttle and marks it retryable (`isCapacity: false`, so no model
|
|
83
|
+
switch), so it retries with the session preserved (`--resume`) after a backoff.
|
|
84
|
+
`src/claude.lib.mjs` additionally detects the structured `api_error_status === 429`
|
|
85
|
+
directly (robust to wording changes) and logs a verbose diagnostic with the
|
|
86
|
+
`request_id`. The matcher is narrow so genuine account usage limits stay on the
|
|
87
|
+
usage-limit reset-time path.
|
|
88
|
+
|
|
89
|
+
Added `tests/test-issue-1924-rate-limit-retry.mjs` (18 assertions) and a full
|
|
90
|
+
case study with timeline, root-cause analysis, upstream references
|
|
91
|
+
(anthropics/claude-code#53915, #53922), and the captured logs under
|
|
92
|
+
`docs/case-studies/issue-1924`.
|
|
93
|
+
|
|
3
94
|
## 1.78.10
|
|
4
95
|
|
|
5
96
|
### Patch Changes
|
package/package.json
CHANGED
package/src/claude.lib.mjs
CHANGED
|
@@ -648,6 +648,7 @@ export const executeClaudeCommand = async params => {
|
|
|
648
648
|
let is503Error = false;
|
|
649
649
|
let isInternalServerError = false;
|
|
650
650
|
let isRequestTimeout = false;
|
|
651
|
+
let isRateLimitError = false; // Issue #1924: server-side 429 temporary rate limiting
|
|
651
652
|
let apiMarkedNotRetryable = false;
|
|
652
653
|
let resultNumTurns = 0;
|
|
653
654
|
let stderrErrors = [];
|
|
@@ -977,6 +978,14 @@ export const executeClaudeCommand = async params => {
|
|
|
977
978
|
isRequestTimeout = true;
|
|
978
979
|
await log('⏱️ Detected request timeout from Claude CLI (will retry with --resume)', { verbose: true });
|
|
979
980
|
}
|
|
981
|
+
// Issue #1924: Server-side temporary rate limiting (HTTP 429) — a transient
|
|
982
|
+
// throttle, not an account usage limit ("...not your usage limit..."), so retry
|
|
983
|
+
// with --resume. The message text is handled by classifyRetryableError; this also
|
|
984
|
+
// catches the structured api_error_status if the wording ever changes.
|
|
985
|
+
if (data.api_error_status === 429) {
|
|
986
|
+
isRateLimitError = true;
|
|
987
|
+
await log(`⚠️ Detected server-side rate limiting (429) from Claude CLI (will retry with --resume). request_id=${data.request_id || 'unknown'}`, { verbose: true });
|
|
988
|
+
}
|
|
980
989
|
// Issue #1834: Detect corrupted extended-thinking-block 400 (un-resumable session).
|
|
981
990
|
// Capture diagnostics (request id, content path) to aid debugging and upstream reports.
|
|
982
991
|
if ((lastMessage.includes('thinking') || lastMessage.includes('redacted_thinking')) && lastMessage.includes('cannot be modified')) {
|
|
@@ -1174,7 +1183,7 @@ export const executeClaudeCommand = async params => {
|
|
|
1174
1183
|
return await executeWithRetry();
|
|
1175
1184
|
}
|
|
1176
1185
|
// Issues #1331, #1353, #1472/#1475: Unified transient error retry (exponential backoff, session preservation)
|
|
1177
|
-
const isTransientError = isStartupTimeout || isActivityTimeout || isOverloadError || isInternalServerError || is503Error || isRequestTimeout || retryableLastError.isRetryable || (lastMessage.includes('API Error: 500') && (lastMessage.includes('Overloaded') || lastMessage.includes('Internal server error'))) || (lastMessage.includes('API Error: 529') && (lastMessage.includes('overloaded_error') || lastMessage.includes('Overloaded'))) || (lastMessage.includes('api_error') && lastMessage.includes('Overloaded')) || (lastMessage.includes('overloaded_error') && lastMessage.includes('Overloaded')) || lastMessage.includes('API Error: 503') || (lastMessage.includes('503') && (lastMessage.includes('upstream connect error') || lastMessage.includes('remote connection failure'))) || lastMessage === 'Request timed out' || lastMessage.includes('Request timed out');
|
|
1186
|
+
const isTransientError = isStartupTimeout || isActivityTimeout || isOverloadError || isInternalServerError || is503Error || isRequestTimeout || isRateLimitError || retryableLastError.isRetryable || (lastMessage.includes('API Error: 500') && (lastMessage.includes('Overloaded') || lastMessage.includes('Internal server error'))) || (lastMessage.includes('API Error: 529') && (lastMessage.includes('overloaded_error') || lastMessage.includes('Overloaded'))) || (lastMessage.includes('api_error') && lastMessage.includes('Overloaded')) || (lastMessage.includes('overloaded_error') && lastMessage.includes('Overloaded')) || lastMessage.includes('API Error: 503') || (lastMessage.includes('503') && (lastMessage.includes('upstream connect error') || lastMessage.includes('remote connection failure'))) || lastMessage === 'Request timed out' || lastMessage.includes('Request timed out');
|
|
1178
1187
|
if ((commandFailed || isTransientError) && isTransientError) {
|
|
1179
1188
|
// Issue #1472/#1475: Startup/activity timeout → 30s–2min backoff; #1353: Request timeout → 5min–1hr; general → 2min–30min
|
|
1180
1189
|
const isTimeoutRetry = isStartupTimeout || isActivityTimeout;
|
|
@@ -1208,7 +1217,7 @@ export const executeClaudeCommand = async params => {
|
|
|
1208
1217
|
}
|
|
1209
1218
|
if (retryCount < maxRetries) {
|
|
1210
1219
|
const delay = Math.min(initialDelay * Math.pow(retryLimits.retryBackoffMultiplier, retryCount), maxDelay);
|
|
1211
|
-
const errorLabel = isStartupTimeout ? 'Stream startup timeout (Issue #1472/#1475)' : isActivityTimeout ? 'Stream activity timeout (Issue #1472)' : isRequestTimeout ? 'Request timeout' : retryableLastError.label || (isOverloadError || (lastMessage.includes('API Error: 500') && lastMessage.includes('Overloaded')) || (lastMessage.includes('API Error: 529') && lastMessage.includes('Overloaded')) ? `API overload (${lastMessage.includes('529') ? '529' : '500'})` : isInternalServerError || lastMessage.includes('Internal server error') ? 'Internal server error (500)' : '503 network error');
|
|
1220
|
+
const errorLabel = isStartupTimeout ? 'Stream startup timeout (Issue #1472/#1475)' : isActivityTimeout ? 'Stream activity timeout (Issue #1472)' : isRequestTimeout ? 'Request timeout' : retryableLastError.label || (isOverloadError || (lastMessage.includes('API Error: 500') && lastMessage.includes('Overloaded')) || (lastMessage.includes('API Error: 529') && lastMessage.includes('Overloaded')) ? `API overload (${lastMessage.includes('529') ? '529' : '500'})` : isInternalServerError || lastMessage.includes('Internal server error') ? 'Internal server error (500)' : isRateLimitError ? 'Server rate limited (429)' : '503 network error');
|
|
1212
1221
|
const notRetryableHint = apiMarkedNotRetryable ? ' (API says not retryable — will stop early if no progress)' : '';
|
|
1213
1222
|
const delayLabel = delay >= 60000 ? `${Math.round(delay / 60000)} min` : `${Math.round(delay / 1000)}s`;
|
|
1214
1223
|
const retryMode = isStartupTimeout ? ' (fresh start)' : ' (session preserved)';
|
|
@@ -47,6 +47,12 @@ const DEFAULT_HOST_DOCKER_SOCK = '/var/run/host-docker.sock';
|
|
|
47
47
|
// throwaway container — booting the dind image's dockerd entrypoint — purely to
|
|
48
48
|
// check whether bash exists. See issue #1914.
|
|
49
49
|
const DOCKER_ISOLATION_SHELL = 'sh';
|
|
50
|
+
// Free-space floor (GiB) below which the preflight warns that an impending
|
|
51
|
+
// isolation-image pull may fail with `no space left on device`. The Hive Mind
|
|
52
|
+
// isolation images are well over 30 GB extracted, so a host/nested daemon with
|
|
53
|
+
// less headroom than this cannot safely pull one. Diagnostic only — never
|
|
54
|
+
// blocks startup. See issue #1914.
|
|
55
|
+
const DOCKER_ISOLATION_LOW_DISK_GIB = 40;
|
|
50
56
|
|
|
51
57
|
function normalizeProcessIds(value) {
|
|
52
58
|
if (!value || typeof value !== 'object') return {};
|
|
@@ -87,12 +93,13 @@ function maybeAddMount(mounts, source, target, existsSync) {
|
|
|
87
93
|
/**
|
|
88
94
|
* Resolve the tag used for the Docker isolation image.
|
|
89
95
|
*
|
|
90
|
-
*
|
|
91
|
-
*
|
|
92
|
-
*
|
|
93
|
-
*
|
|
94
|
-
*
|
|
95
|
-
*
|
|
96
|
+
* Release Docker images bake this env var from `HIVE_MIND_VERSION`, so a parent
|
|
97
|
+
* container started via `:latest` still launches child isolation containers from
|
|
98
|
+
* the same immutable release tag. Local/PR builds fall back to `latest`, and
|
|
99
|
+
* operators can override the tag explicitly when using custom images. Pinning
|
|
100
|
+
* matters for Docker-in-Docker deployments: the nested daemon starts with an
|
|
101
|
+
* empty image store, so a `:latest` digest drift from the host copy forces a
|
|
102
|
+
* fresh multi-gigabyte pull. See issue #1879.
|
|
96
103
|
*/
|
|
97
104
|
export function resolveDockerIsolationImageTag({ env = process.env } = {}) {
|
|
98
105
|
const explicit = String(env.HIVE_MIND_DOCKER_ISOLATION_IMAGE_TAG || '').trim();
|
|
@@ -618,6 +625,80 @@ export async function checkDockerImagePresent(image, verbose = false) {
|
|
|
618
625
|
}
|
|
619
626
|
}
|
|
620
627
|
|
|
628
|
+
/**
|
|
629
|
+
* Report the storage driver the (nested) Docker daemon is using.
|
|
630
|
+
*
|
|
631
|
+
* `vfs` performs NO copy-on-write — it stores a full copy of every image layer
|
|
632
|
+
* — so the multi-gigabyte Hive Mind images consume many times their real size
|
|
633
|
+
* on disk and the first isolated `docker run`/pull dies with
|
|
634
|
+
* `failed to register layer: no space left on device` (issue #1914 reopen).
|
|
635
|
+
* The preflight uses this to warn loudly when the daemon is on `vfs` instead of
|
|
636
|
+
* letting the disk silently overflow mid-task.
|
|
637
|
+
*
|
|
638
|
+
* Never throws: returns the lowercased driver name, or `null` when docker is
|
|
639
|
+
* unavailable / the daemon is unreachable.
|
|
640
|
+
*
|
|
641
|
+
* @param {boolean} [verbose] - Enable verbose logging
|
|
642
|
+
* @returns {Promise<string|null>} e.g. 'fuse-overlayfs', 'overlay2', 'vfs', or null
|
|
643
|
+
*/
|
|
644
|
+
export async function checkDockerStorageDriver(verbose = false) {
|
|
645
|
+
try {
|
|
646
|
+
const result = await $({ mirror: false })`docker info --format ${'{{.Driver}}'}`;
|
|
647
|
+
const driver = (result.stdout?.toString() || '').trim().toLowerCase() || null;
|
|
648
|
+
if (verbose) console.log(`[VERBOSE] isolation-runner: docker storage driver: ${driver || '(unknown)'}`);
|
|
649
|
+
return driver;
|
|
650
|
+
} catch {
|
|
651
|
+
if (verbose) console.log('[VERBOSE] isolation-runner: docker info unavailable; storage driver unknown');
|
|
652
|
+
return null;
|
|
653
|
+
}
|
|
654
|
+
}
|
|
655
|
+
|
|
656
|
+
/**
|
|
657
|
+
* Report the free space (in GiB) on the Docker daemon's data root.
|
|
658
|
+
*
|
|
659
|
+
* The Hive Mind isolation images are multiple gigabytes; when the nested daemon
|
|
660
|
+
* has to pull one, it needs room for the extracted layers. This lets the
|
|
661
|
+
* preflight predict a `no space left on device` failure (issue #1914) instead
|
|
662
|
+
* of discovering it mid-pull. Resolves the daemon's real data root via
|
|
663
|
+
* `docker info` and falls back to `/var/lib/docker`, then reads `df -Pk`.
|
|
664
|
+
*
|
|
665
|
+
* Never throws: returns `{ availableGiB, dataRoot }`, or `null` when the
|
|
666
|
+
* information cannot be determined (no docker, no df, unparseable output).
|
|
667
|
+
*
|
|
668
|
+
* @param {boolean} [verbose] - Enable verbose logging
|
|
669
|
+
* @returns {Promise<{availableGiB: number, dataRoot: string}|null>}
|
|
670
|
+
*/
|
|
671
|
+
export async function checkDockerDiskSpace(verbose = false) {
|
|
672
|
+
try {
|
|
673
|
+
let dataRoot = '/var/lib/docker';
|
|
674
|
+
try {
|
|
675
|
+
const info = await $({ mirror: false })`docker info --format ${'{{.DockerRootDir}}'}`;
|
|
676
|
+
const root = (info.stdout?.toString() || '').trim();
|
|
677
|
+
if (root) dataRoot = root;
|
|
678
|
+
} catch {
|
|
679
|
+
// Daemon unreachable: fall back to the conventional data root. If df then
|
|
680
|
+
// fails on it (e.g. the path does not exist) we return null below.
|
|
681
|
+
}
|
|
682
|
+
|
|
683
|
+
const df = await $({ mirror: false })`df -Pk ${dataRoot}`;
|
|
684
|
+
// `df -P` guarantees one logical line per filesystem (no wrapping). The last
|
|
685
|
+
// line is the data row: Filesystem 1024-blocks Used Available Capacity Mount
|
|
686
|
+
const lines = (df.stdout?.toString() || '').trim().split('\n');
|
|
687
|
+
const cols = (lines[lines.length - 1] || '').trim().split(/\s+/);
|
|
688
|
+
const availableKb = Number(cols[3]);
|
|
689
|
+
if (!Number.isFinite(availableKb)) {
|
|
690
|
+
if (verbose) console.log('[VERBOSE] isolation-runner: could not parse df output for Docker disk space');
|
|
691
|
+
return null;
|
|
692
|
+
}
|
|
693
|
+
const availableGiB = availableKb / (1024 * 1024);
|
|
694
|
+
if (verbose) console.log(`[VERBOSE] isolation-runner: Docker data root '${dataRoot}' has ${availableGiB.toFixed(1)} GiB free`);
|
|
695
|
+
return { availableGiB, dataRoot };
|
|
696
|
+
} catch {
|
|
697
|
+
if (verbose) console.log('[VERBOSE] isolation-runner: df unavailable; Docker disk space unknown');
|
|
698
|
+
return null;
|
|
699
|
+
}
|
|
700
|
+
}
|
|
701
|
+
|
|
621
702
|
/**
|
|
622
703
|
* Startup preflight for `--isolation docker`.
|
|
623
704
|
*
|
|
@@ -637,42 +718,78 @@ export async function checkDockerImagePresent(image, verbose = false) {
|
|
|
637
718
|
* blocks startup — a misconfigured passthrough should degrade to a slow first
|
|
638
719
|
* task, not a dead bot.
|
|
639
720
|
*
|
|
721
|
+
* It also surfaces the two root causes of the issue #1914 reopen
|
|
722
|
+
* (`failed to register layer: no space left on device`): a non-copy-on-write
|
|
723
|
+
* storage driver (`vfs`, which copies every layer in full) and a Docker data
|
|
724
|
+
* root with too little free space to hold the >30 GB image. Both are reported
|
|
725
|
+
* as loud, actionable warnings so the disk overflow is self-diagnosing at
|
|
726
|
+
* startup instead of surfacing mid-task.
|
|
727
|
+
*
|
|
640
728
|
* @param {Object} [options]
|
|
641
729
|
* @param {Object} [options.env] - Environment (defaults to process.env)
|
|
642
730
|
* @param {Function} [options.existsSync] - fs.existsSync (injectable for tests)
|
|
643
731
|
* @param {boolean} [options.verbose] - Enable verbose logging
|
|
644
732
|
* @param {Object} [options.logger] - Logger with .log/.warn (defaults to console)
|
|
645
733
|
* @param {Function} [options.checkImagePresent] - Image-presence probe (injectable for tests)
|
|
646
|
-
* @
|
|
734
|
+
* @param {Function} [options.checkStorageDriver] - Storage-driver probe (injectable for tests)
|
|
735
|
+
* @param {Function} [options.checkDiskSpace] - Disk-space probe (injectable for tests)
|
|
736
|
+
* @returns {Promise<{image: string, sock: string, socketMounted: boolean, imagePresent: boolean, isDind: boolean, storageDriver: (string|null), storageDriverOk: boolean, diskAvailableGiB: (number|null), ok: boolean, warnings: string[]}>}
|
|
647
737
|
*/
|
|
648
738
|
export async function preflightDockerIsolation(options = {}) {
|
|
649
|
-
const { env = process.env, existsSync = fs.existsSync, verbose = false, logger = console, checkImagePresent = checkDockerImagePresent } = options;
|
|
739
|
+
const { env = process.env, existsSync = fs.existsSync, verbose = false, logger = console, checkImagePresent = checkDockerImagePresent, checkStorageDriver = checkDockerStorageDriver, checkDiskSpace = checkDockerDiskSpace } = options;
|
|
650
740
|
|
|
651
741
|
const image = getDockerIsolationImage({ env });
|
|
652
742
|
const sock = resolveHostDockerSock({ env });
|
|
653
743
|
const isDind = shouldRunPrivilegedDockerIsolation(image, env);
|
|
654
744
|
const socketMounted = Boolean(existsSync(sock));
|
|
655
745
|
const imagePresent = Boolean(await checkImagePresent(image, verbose));
|
|
656
|
-
|
|
657
|
-
const
|
|
746
|
+
const storageDriver = await checkStorageDriver(verbose);
|
|
747
|
+
const disk = await checkDiskSpace(verbose);
|
|
748
|
+
const diskAvailableGiB = disk && Number.isFinite(disk.availableGiB) ? disk.availableGiB : null;
|
|
749
|
+
// Unknown driver (probe returned null) is treated as ok — we only flag the
|
|
750
|
+
// one driver known to overflow the disk, never block on missing information.
|
|
751
|
+
const storageDriverOk = storageDriver !== 'vfs';
|
|
752
|
+
|
|
753
|
+
const result = { image, sock, socketMounted, imagePresent, isDind, storageDriver, storageDriverOk, diskAvailableGiB, ok: imagePresent, warnings: [] };
|
|
658
754
|
const info = typeof logger.log === 'function' ? logger.log.bind(logger) : () => {};
|
|
659
755
|
const warn = typeof logger.warn === 'function' ? logger.warn.bind(logger) : info;
|
|
660
756
|
|
|
661
|
-
|
|
662
|
-
|
|
663
|
-
|
|
757
|
+
const preload = `node scripts/preload-dind-isolation-image.mjs --image ${image}`;
|
|
758
|
+
|
|
759
|
+
// Root Cause A of the issue #1914 reopen: a non-copy-on-write storage driver.
|
|
760
|
+
// `vfs` stores a full copy of every image layer, so the multi-GB images
|
|
761
|
+
// consume many times their size on disk and any layer write (pull, run,
|
|
762
|
+
// commit) can fail with `failed to register layer: no space left on device`.
|
|
763
|
+
// This is dangerous even when the image is already present — a task that
|
|
764
|
+
// commits or pulls more layers still overflows — so we warn independent of
|
|
765
|
+
// image presence.
|
|
766
|
+
if (storageDriver === 'vfs') {
|
|
767
|
+
result.warnings.push(`The Docker daemon backing '--isolation docker' is using the 'vfs' storage driver, which performs NO copy-on-write: ` + `it stores a full copy of every image layer, so the multi-GB Hive Mind images consume many times their size on disk and isolated tasks can fail with 'failed to register layer: no space left on device' (issue #1914). ` + `Switch to a copy-on-write driver: rebuild/redeploy with the current Dockerfile.dind (it defaults to 'fuse-overlayfs'), or for an already-running container add '-e DIND_STORAGE_DRIVER=fuse-overlayfs' to the bot container's 'docker run' and recreate it.`);
|
|
664
768
|
}
|
|
665
769
|
|
|
666
|
-
|
|
667
|
-
|
|
668
|
-
|
|
669
|
-
|
|
670
|
-
|
|
671
|
-
|
|
672
|
-
|
|
673
|
-
|
|
674
|
-
|
|
675
|
-
|
|
770
|
+
if (!imagePresent) {
|
|
771
|
+
// Image absent: the first isolated task will pull the full image. Explain
|
|
772
|
+
// the most likely cause and the exact fix instead of letting the operator
|
|
773
|
+
// first discover it as a surprise multi-gigabyte download mid-task.
|
|
774
|
+
if (isDind && !socketMounted) {
|
|
775
|
+
result.warnings.push(`Docker isolation image '${image}' is NOT in the nested Docker daemon and the host Docker socket is not mounted at ${sock}. ` + `box host-image passthrough cannot seed the nested daemon, so the FIRST isolated task will pull the full image (the Hive Mind images are multiple GB). ` + `Fix the deployment: add '-v /var/run/docker.sock:${sock}:ro' and '-e DIND_HOST_PASSTHROUGH_IMAGES="konard/hive-mind konard/hive-mind-dind"' to the bot container's 'docker run', or seed it now with: ${preload}`);
|
|
776
|
+
} else if (isDind && socketMounted) {
|
|
777
|
+
result.warnings.push(`Docker isolation image '${image}' is NOT in the nested Docker daemon even though the host Docker socket is mounted at ${sock}. ` + `box host-image passthrough may have skipped it (check DIND_HOST_PASSTHROUGH mode, the DIND_HOST_PASSTHROUGH_IMAGES allowlist, and that the host actually has '${image}' with a registry digest). ` + `The first isolated task will pull the full image. Seed it now with: ${preload}`);
|
|
778
|
+
} else {
|
|
779
|
+
result.warnings.push(`Docker isolation image '${image}' is not present locally; the first isolated task will pull it. ` + `If this host already has it under a different tag, pin HIVE_MIND_DOCKER_ISOLATION_IMAGE_TAG, or seed it with: ${preload}`);
|
|
780
|
+
}
|
|
781
|
+
|
|
782
|
+
// Root Cause B of the issue #1914 reopen: too little disk for the pull. The
|
|
783
|
+
// image is well over 30 GB extracted; predict the `no space left on device`
|
|
784
|
+
// failure here rather than hitting it mid-pull.
|
|
785
|
+
if (diskAvailableGiB != null && diskAvailableGiB < DOCKER_ISOLATION_LOW_DISK_GIB) {
|
|
786
|
+
const root = disk?.dataRoot || 'the Docker data root';
|
|
787
|
+
result.warnings.push(`Only ~${diskAvailableGiB.toFixed(0)} GiB free on ${root} and the isolation image '${image}' is not present yet. ` + `The Hive Mind isolation image is well over 30 GB extracted, so the first isolated task's pull may fail with 'no space left on device' (issue #1914). ` + `Seed it via host passthrough (mount the host docker socket) or with '${preload}', and free space on the Docker data root.`);
|
|
788
|
+
}
|
|
789
|
+
}
|
|
790
|
+
|
|
791
|
+
if (imagePresent) {
|
|
792
|
+
info(`✅ Docker isolation image '${image}' is already present locally — isolated tasks reuse it (no multi-GB pull). See issue #1914.`);
|
|
676
793
|
}
|
|
677
794
|
for (const w of result.warnings) warn(`⚠️ ${w}`);
|
|
678
795
|
return result;
|
package/src/tool-retry.lib.mjs
CHANGED
|
@@ -72,6 +72,22 @@ export const classifyRetryableError = value => {
|
|
|
72
72
|
return { message, isRetryable: false, isCapacity: false, requiresFreshSession: true, label: 'Corrupted thinking blocks (un-resumable session)' };
|
|
73
73
|
}
|
|
74
74
|
|
|
75
|
+
// Issue #1924: Server-side temporary rate limiting (HTTP 429), distinct from an
|
|
76
|
+
// account usage/quota limit. The Claude CLI surfaces this as a synthetic
|
|
77
|
+
// assistant/result message and an api_error_status of 429:
|
|
78
|
+
// "API Error: Server is temporarily limiting requests (not your usage limit) · Rate limited"
|
|
79
|
+
// The response carries `x-should-retry: true` and the stream emits a
|
|
80
|
+
// `rate_limit_event` with `status: "rejected"`. Because the message explicitly
|
|
81
|
+
// says "not your usage limit", it is NOT a usage-limit reset-time situation and
|
|
82
|
+
// must NOT be routed through detectUsageLimit() (there is no reset time to wait
|
|
83
|
+
// for). It is a transient throttle that clears on its own, so it is safe to
|
|
84
|
+
// retry with the session preserved (--resume) after a backoff. Switching models
|
|
85
|
+
// does not help (the throttle is request-rate, not model capacity), so
|
|
86
|
+
// isCapacity is false.
|
|
87
|
+
if (lower.includes('temporarily limiting requests') || (lower.includes('rate limited') && lower.includes('not your usage limit')) || (lower.includes('rate_limit') && lower.includes('429'))) {
|
|
88
|
+
return { message, isRetryable: true, isCapacity: false, label: 'Server rate limited (429)' };
|
|
89
|
+
}
|
|
90
|
+
|
|
75
91
|
if (lower.includes('api error: 503') || (lower.includes('503') && (lower.includes('upstream connect error') || lower.includes('remote connection failure')))) {
|
|
76
92
|
return { message, isRetryable: true, isCapacity: false, label: '503 network error' };
|
|
77
93
|
}
|