npm - @link-assistant/hive-mind - Versions diffs - 2.0.0 → 2.0.2 - Mend

@link-assistant/hive-mind 2.0.0 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,91 @@
 # @link-assistant/hive-mind
+## 2.0.2
+### Patch Changes
+- 19aea85: fix(retry): auto-resume on "Stream idle timeout - partial response received" (#1937)
+  A long-running solve session (391 turns, ~$34.11) had its streaming response
+  stall mid-answer. The Claude CLI surfaced it as a `result` event with
+  `is_error: true`, `subtype: "success"`, and:
+  ```
+  API Error: Stream idle timeout - partial response received
+  ```
+  Instead of retrying with the session preserved, the harness fell straight
+  through to the generic failure path and exited with code 1 after **zero
+  retries** — abandoning the whole session even though it had a valid session ID
+  and printed the exact `--resume` command needed to continue.
+  Root cause: the shared retry classifier `classifyRetryableError()`
+  (`src/tool-retry.lib.mjs`) had no branch for the stream-idle-timeout family, so
+  `isRetryable` was false, `isTransientError` evaluated to false, and the unified
+  exponential-backoff retry block was never entered.
+  This error is a transient transport-level stall (a slow/stuck server-sent-events
+  socket), not a request-content rejection — the on-disk session transcript stays
+  valid, which is why a manual `--resume` works. The fix adds one branch to
+  `classifyRetryableError()` returning
+  `{ isRetryable: true, isCapacity: false, label: 'Stream idle timeout (partial response)' }`,
+  so the existing retry block resumes the session with the same context after an
+  exponential backoff. Because the classifier is shared, this fixes the behaviour
+  for **all** tools (claude/codex/gemini/opencode/qwen/agent) at once.
+  Added `tests/test-issue-1937-stream-idle-timeout-retry.mjs` (17 assertions) and a
+  full case study with timeline, root-cause analysis, upstream references, and the
+  captured logs under `docs/case-studies/issue-1937`.
+## 2.0.1
+### Patch Changes
+- 70e1542: fix(retry): treat 5-hour "session limit" and "weekly limit" 429s as account usage limits, not transient throttles (#1935)
+  A long-running solve session (588 turns, ~$70.62) hit Claude's **5-hour session
+  limit**. The Claude CLI surfaced it as a `result` event with `is_error: true`,
+  `api_error_status: 429`, and:
+  ```
+  You've hit your session limit · resets 4pm (UTC)
+  ```
+  Instead of being treated as an **account usage limit** (post a comment with the
+  reset time + wait until the exact reset moment), it was put through the transient
+  exponential-backoff retry loop:
+  ```
+  ⚠️ Server rate limited (429) detected. Retry 1/10 in 2 min (session preserved)...
+     Error: You've hit your session limit · resets 4pm (UTC)
+  ⚠️ Server rate limited (429) detected. Retry 2/10 in 4 min (session preserved)...
+  ```
+  Each retry re-hit the same limit because the quota only frees at the reset time —
+  so the harness burned ~10 futile retries and never told the user when the limit
+  resets.
+  Root cause (regression from #1924): `src/claude.lib.mjs` set
+  `isRateLimitError = true` for **every** structured `api_error_status === 429`,
+  without checking whether the message was an account usage limit. Claude reports
+  **both** a transient throttle ("...not your usage limit...") and account
+  session/weekly limits with `api_error_status: 429`, so the unconditional check
+  swept genuine usage limits into the transient-retry path — ahead of the
+  `detectUsageLimit()` reset-time wait, which was therefore never reached.
+  Fix: `src/claude.lib.mjs` now only flags a structured 429 as a transient rate
+  limit when the message is **not** a usage limit
+  (`api_error_status === 429 && !isUsageLimitError(lastMessage)`), so session/weekly
+  limits fall through to the usage-limit handler that immediately posts a comment
+  and waits until the exact reset time (auto-resuming there with
+  `--auto-continue-limit`). `src/usage-limit.lib.mjs` additionally recognises the
+  "hit your session limit" / "hit your weekly limit" phrasing as a backstop (the
+  reset-time regex already matched "resets 4pm").
+  Added `tests/test-issue-1935-session-limit-429.mjs` (15 assertions) and a full
+  case study with timeline, blame history (PR #1924), root-cause analysis, and the
+  captured logs under `docs/case-studies/issue-1935`.
 ## 2.0.0
 ### Major Changes

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@link-assistant/hive-mind",
-  "version": "2.0.0",
+  "version": "2.0.2",
   "description": "AI-powered issue solver and hive mind for collaborative problem solving",
   "main": "src/hive.mjs",
   "type": "module",

package/src/claude.lib.mjs CHANGED Viewed

@@ -9,7 +9,7 @@ const path = (await use('path')).default;
 import { log, isENOSPC } from './lib.mjs';
 import { reportError } from './sentry.lib.mjs';
 import { timeouts, retryLimits, claudeCode, getClaudeEnv, getThinkingLevelToTokens, getTokensToThinkingLevel, supportsThinkingBudget, DEFAULT_MAX_THINKING_BUDGET, getMaxOutputTokensForModel } from './config.lib.mjs';
-import { detectUsageLimit, formatUsageLimitMessage } from './usage-limit.lib.mjs';
+import { detectUsageLimit, formatUsageLimitMessage, isUsageLimitError } from './usage-limit.lib.mjs';
 import { createInteractiveHandler } from './interactive-mode.lib.mjs';
 import { setupBidirectionalHandler, finalizeBidirectionalHandler, validateBidirectionalModeConfig, attachStreamingInput } from './bidirectional-interactive.lib.mjs';
 import { initProgressMonitoring } from './solve.progress-monitoring.lib.mjs';
@@ -978,11 +978,12 @@ export const executeClaudeCommand = async params => {
                     isRequestTimeout = true;
                     await log('⏱️ Detected request timeout from Claude CLI (will retry with --resume)', { verbose: true });
                   }
-                  // Issue #1924: Server-side temporary rate limiting (HTTP 429) — a transient
-                  // throttle, not an account usage limit ("...not your usage limit..."), so retry
-                  // with --resume. The message text is handled by classifyRetryableError; this also
-                  // catches the structured api_error_status if the wording ever changes.
-                  if (data.api_error_status === 429) {
+                  // Issue #1924: server-side temporary rate limiting (HTTP 429) is a transient
+                  // throttle ("...not your usage limit..."), so retry with --resume. Issue #1935
+                  // (regression from #1924): account usage limits ("session limit" / "weekly limit")
+                  // ALSO arrive with api_error_status === 429 plus an explicit reset time, so the
+                  // isUsageLimitError() guard routes those to the usage-limit handler below instead.
+                  if (data.api_error_status === 429 && !isUsageLimitError(lastMessage)) {
                     isRateLimitError = true;
                     await log(`⚠️ Detected server-side rate limiting (429) from Claude CLI (will retry with --resume). request_id=${data.request_id || 'unknown'}`, { verbose: true });
                   }

package/src/tool-retry.lib.mjs CHANGED Viewed

@@ -43,6 +43,22 @@ export const classifyRetryableError = value => {
     return { message, isRetryable: true, isCapacity: false, label: 'Stream disconnected before completion' };
   }
+  // Issue #1937: Stream idle timeout. When the Anthropic streaming response stalls
+  // (no bytes for the SDK's idle window) after the model has already emitted part of
+  // its answer, the Claude CLI aborts the turn and surfaces a synthetic assistant /
+  // result message:
+  //   "API Error: Stream idle timeout - partial response received"
+  // This is a transient network/streaming stall (a slow or stuck server-sent-events
+  // socket), not a request-content error, so the session is still valid and safe to
+  // resume. Before this branch classifyRetryableError() did not recognise it, so
+  // isRetryable was false and the whole solve session aborted with exit code 1 even
+  // though `--resume <sessionId>` could continue with the same context. Switching
+  // models does not help (the stall is in the response stream, not model capacity),
+  // so isCapacity is false → retry with the session preserved after a backoff.
+  if (lower.includes('stream idle timeout') || (lower.includes('idle timeout') && lower.includes('partial response'))) {
+    return { message, isRetryable: true, isCapacity: false, label: 'Stream idle timeout (partial response)' };
+  }
   // Issue #1881: Transient socket / network disconnects from the SDK's underlying fetch.
   // When the HTTP(S)/streaming socket drops mid-request, the Claude/Codex CLI surfaces a
   // synthetic assistant message such as:

package/src/usage-limit.lib.mjs CHANGED Viewed

@@ -48,6 +48,14 @@ export function isUsageLimitError(message) {
     // Provider-specific phrasings we've seen in the wild
     'session limit reached', // Claude
     'weekly limit reached', // Claude
+    // Issue #1935: Claude surfaces 5-hour / weekly account limits as
+    //   "You've hit your session limit · resets 4pm (UTC)"
+    //   "You've hit your weekly limit · resets Jan 15, 8am (UTC)"
+    // These arrive with api_error_status === 429 but are real usage limits with an
+    // explicit reset time, so detect the "hit your <window> limit" phrasing directly
+    // (independent of whether a parseable reset time is present in the message).
+    'hit your session limit', // Claude 5-hour limit
+    'hit your weekly limit', // Claude weekly limit
     'daily limit reached',
     'monthly limit reached',
     'billing hard limit',