@link-assistant/hive-mind 2.0.0 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,91 @@
1
1
  # @link-assistant/hive-mind
2
2
 
3
+ ## 2.0.2
4
+
5
+ ### Patch Changes
6
+
7
+ - 19aea85: fix(retry): auto-resume on "Stream idle timeout - partial response received" (#1937)
8
+
9
+ A long-running solve session (391 turns, ~$34.11) had its streaming response
10
+ stall mid-answer. The Claude CLI surfaced it as a `result` event with
11
+ `is_error: true`, `subtype: "success"`, and:
12
+
13
+ ```
14
+ API Error: Stream idle timeout - partial response received
15
+ ```
16
+
17
+ Instead of retrying with the session preserved, the harness fell straight
18
+ through to the generic failure path and exited with code 1 after **zero
19
+ retries** — abandoning the whole session even though it had a valid session ID
20
+ and printed the exact `--resume` command needed to continue.
21
+
22
+ Root cause: the shared retry classifier `classifyRetryableError()`
23
+ (`src/tool-retry.lib.mjs`) had no branch for the stream-idle-timeout family, so
24
+ `isRetryable` was false, `isTransientError` evaluated to false, and the unified
25
+ exponential-backoff retry block was never entered.
26
+
27
+ This error is a transient transport-level stall (a slow/stuck server-sent-events
28
+ socket), not a request-content rejection — the on-disk session transcript stays
29
+ valid, which is why a manual `--resume` works. The fix adds one branch to
30
+ `classifyRetryableError()` returning
31
+ `{ isRetryable: true, isCapacity: false, label: 'Stream idle timeout (partial response)' }`,
32
+ so the existing retry block resumes the session with the same context after an
33
+ exponential backoff. Because the classifier is shared, this fixes the behaviour
34
+ for **all** tools (claude/codex/gemini/opencode/qwen/agent) at once.
35
+
36
+ Added `tests/test-issue-1937-stream-idle-timeout-retry.mjs` (17 assertions) and a
37
+ full case study with timeline, root-cause analysis, upstream references, and the
38
+ captured logs under `docs/case-studies/issue-1937`.
39
+
40
+ ## 2.0.1
41
+
42
+ ### Patch Changes
43
+
44
+ - 70e1542: fix(retry): treat 5-hour "session limit" and "weekly limit" 429s as account usage limits, not transient throttles (#1935)
45
+
46
+ A long-running solve session (588 turns, ~$70.62) hit Claude's **5-hour session
47
+ limit**. The Claude CLI surfaced it as a `result` event with `is_error: true`,
48
+ `api_error_status: 429`, and:
49
+
50
+ ```
51
+ You've hit your session limit · resets 4pm (UTC)
52
+ ```
53
+
54
+ Instead of being treated as an **account usage limit** (post a comment with the
55
+ reset time + wait until the exact reset moment), it was put through the transient
56
+ exponential-backoff retry loop:
57
+
58
+ ```
59
+ ⚠️ Server rate limited (429) detected. Retry 1/10 in 2 min (session preserved)...
60
+ Error: You've hit your session limit · resets 4pm (UTC)
61
+ ⚠️ Server rate limited (429) detected. Retry 2/10 in 4 min (session preserved)...
62
+ ```
63
+
64
+ Each retry re-hit the same limit because the quota only frees at the reset time —
65
+ so the harness burned ~10 futile retries and never told the user when the limit
66
+ resets.
67
+
68
+ Root cause (regression from #1924): `src/claude.lib.mjs` set
69
+ `isRateLimitError = true` for **every** structured `api_error_status === 429`,
70
+ without checking whether the message was an account usage limit. Claude reports
71
+ **both** a transient throttle ("...not your usage limit...") and account
72
+ session/weekly limits with `api_error_status: 429`, so the unconditional check
73
+ swept genuine usage limits into the transient-retry path — ahead of the
74
+ `detectUsageLimit()` reset-time wait, which was therefore never reached.
75
+
76
+ Fix: `src/claude.lib.mjs` now only flags a structured 429 as a transient rate
77
+ limit when the message is **not** a usage limit
78
+ (`api_error_status === 429 && !isUsageLimitError(lastMessage)`), so session/weekly
79
+ limits fall through to the usage-limit handler that immediately posts a comment
80
+ and waits until the exact reset time (auto-resuming there with
81
+ `--auto-continue-limit`). `src/usage-limit.lib.mjs` additionally recognises the
82
+ "hit your session limit" / "hit your weekly limit" phrasing as a backstop (the
83
+ reset-time regex already matched "resets 4pm").
84
+
85
+ Added `tests/test-issue-1935-session-limit-429.mjs` (15 assertions) and a full
86
+ case study with timeline, blame history (PR #1924), root-cause analysis, and the
87
+ captured logs under `docs/case-studies/issue-1935`.
88
+
3
89
  ## 2.0.0
4
90
 
5
91
  ### Major Changes
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@link-assistant/hive-mind",
3
- "version": "2.0.0",
3
+ "version": "2.0.2",
4
4
  "description": "AI-powered issue solver and hive mind for collaborative problem solving",
5
5
  "main": "src/hive.mjs",
6
6
  "type": "module",
@@ -9,7 +9,7 @@ const path = (await use('path')).default;
9
9
  import { log, isENOSPC } from './lib.mjs';
10
10
  import { reportError } from './sentry.lib.mjs';
11
11
  import { timeouts, retryLimits, claudeCode, getClaudeEnv, getThinkingLevelToTokens, getTokensToThinkingLevel, supportsThinkingBudget, DEFAULT_MAX_THINKING_BUDGET, getMaxOutputTokensForModel } from './config.lib.mjs';
12
- import { detectUsageLimit, formatUsageLimitMessage } from './usage-limit.lib.mjs';
12
+ import { detectUsageLimit, formatUsageLimitMessage, isUsageLimitError } from './usage-limit.lib.mjs';
13
13
  import { createInteractiveHandler } from './interactive-mode.lib.mjs';
14
14
  import { setupBidirectionalHandler, finalizeBidirectionalHandler, validateBidirectionalModeConfig, attachStreamingInput } from './bidirectional-interactive.lib.mjs';
15
15
  import { initProgressMonitoring } from './solve.progress-monitoring.lib.mjs';
@@ -978,11 +978,12 @@ export const executeClaudeCommand = async params => {
978
978
  isRequestTimeout = true;
979
979
  await log('⏱️ Detected request timeout from Claude CLI (will retry with --resume)', { verbose: true });
980
980
  }
981
- // Issue #1924: Server-side temporary rate limiting (HTTP 429) a transient
982
- // throttle, not an account usage limit ("...not your usage limit..."), so retry
983
- // with --resume. The message text is handled by classifyRetryableError; this also
984
- // catches the structured api_error_status if the wording ever changes.
985
- if (data.api_error_status === 429) {
981
+ // Issue #1924: server-side temporary rate limiting (HTTP 429) is a transient
982
+ // throttle ("...not your usage limit..."), so retry with --resume. Issue #1935
983
+ // (regression from #1924): account usage limits ("session limit" / "weekly limit")
984
+ // ALSO arrive with api_error_status === 429 plus an explicit reset time, so the
985
+ // isUsageLimitError() guard routes those to the usage-limit handler below instead.
986
+ if (data.api_error_status === 429 && !isUsageLimitError(lastMessage)) {
986
987
  isRateLimitError = true;
987
988
  await log(`⚠️ Detected server-side rate limiting (429) from Claude CLI (will retry with --resume). request_id=${data.request_id || 'unknown'}`, { verbose: true });
988
989
  }
@@ -43,6 +43,22 @@ export const classifyRetryableError = value => {
43
43
  return { message, isRetryable: true, isCapacity: false, label: 'Stream disconnected before completion' };
44
44
  }
45
45
 
46
+ // Issue #1937: Stream idle timeout. When the Anthropic streaming response stalls
47
+ // (no bytes for the SDK's idle window) after the model has already emitted part of
48
+ // its answer, the Claude CLI aborts the turn and surfaces a synthetic assistant /
49
+ // result message:
50
+ // "API Error: Stream idle timeout - partial response received"
51
+ // This is a transient network/streaming stall (a slow or stuck server-sent-events
52
+ // socket), not a request-content error, so the session is still valid and safe to
53
+ // resume. Before this branch classifyRetryableError() did not recognise it, so
54
+ // isRetryable was false and the whole solve session aborted with exit code 1 even
55
+ // though `--resume <sessionId>` could continue with the same context. Switching
56
+ // models does not help (the stall is in the response stream, not model capacity),
57
+ // so isCapacity is false → retry with the session preserved after a backoff.
58
+ if (lower.includes('stream idle timeout') || (lower.includes('idle timeout') && lower.includes('partial response'))) {
59
+ return { message, isRetryable: true, isCapacity: false, label: 'Stream idle timeout (partial response)' };
60
+ }
61
+
46
62
  // Issue #1881: Transient socket / network disconnects from the SDK's underlying fetch.
47
63
  // When the HTTP(S)/streaming socket drops mid-request, the Claude/Codex CLI surfaces a
48
64
  // synthetic assistant message such as:
@@ -48,6 +48,14 @@ export function isUsageLimitError(message) {
48
48
  // Provider-specific phrasings we've seen in the wild
49
49
  'session limit reached', // Claude
50
50
  'weekly limit reached', // Claude
51
+ // Issue #1935: Claude surfaces 5-hour / weekly account limits as
52
+ // "You've hit your session limit · resets 4pm (UTC)"
53
+ // "You've hit your weekly limit · resets Jan 15, 8am (UTC)"
54
+ // These arrive with api_error_status === 429 but are real usage limits with an
55
+ // explicit reset time, so detect the "hit your <window> limit" phrasing directly
56
+ // (independent of whether a parseable reset time is present in the message).
57
+ 'hit your session limit', // Claude 5-hour limit
58
+ 'hit your weekly limit', // Claude weekly limit
51
59
  'daily limit reached',
52
60
  'monthly limit reached',
53
61
  'billing hard limit',