npm - @oneciel-ai/claude-any - Versions diffs - 0.1.34 → 0.1.35 - Mend

@oneciel-ai/claude-any 0.1.34 → 0.1.35

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/README.md CHANGED Viewed

@@ -48,7 +48,7 @@ arguments through unchanged.
 Credits: One Ciel LLC
-Current version: `0.1.34`
+Current version: `0.1.35`
 ## Why This Exists
@@ -381,6 +381,15 @@ steps under that larger model's supervision.
 ## Changelog
+### 0.1.35
+- **NVIDIA router context guard**: NVIDIA hosted now defaults to a 32K router
+  context window and LLM presets may tune that cap, reducing timeout-prone
+  payload growth in long Claude Code sessions.
+- **Upstream activity status**: the router records current request, retry,
+  success, and error state with estimated token/byte size so the statusline can
+  distinguish active upstream waits from idle sessions.
 ### 0.1.34
 - **Complete headless configuration path**: add `--ca-env-file`,

package/claude_any.py CHANGED Viewed

@@ -39,8 +39,9 @@ LOG_LEVEL_PATH = CONFIG_DIR / "log-level"
 REQUEST_DUMP_PATH = CONFIG_DIR / "requests.jsonl"
 RESPONSE_DUMP_PATH = CONFIG_DIR / "responses.jsonl"
 TOOL_CALL_LOG_PATH = CONFIG_DIR / "tool-calls.jsonl"
-RATE_LIMIT_STATE_PATH = CONFIG_DIR / "rate-limit-state.json"
-CHAT_MESSAGES_PATH = CONFIG_DIR / "chat-messages.jsonl"
+RATE_LIMIT_STATE_PATH = CONFIG_DIR / "rate-limit-state.json"
+ROUTER_ACTIVITY_PATH = CONFIG_DIR / "router-activity.json"
+CHAT_MESSAGES_PATH = CONFIG_DIR / "chat-messages.jsonl"
 CHAT_FILES_DIR = CONFIG_DIR / "chat-files"
 PLAN_ARTIFACTS_DIR = CONFIG_DIR / "plan-artifacts"
 PID_PATH = CONFIG_DIR / "router.pid"
@@ -84,7 +85,7 @@ PROVIDER_LABELS = {
     "self-hosted-nim": "Self Hosted NIM",
 }
 APP_NAME = "Claude Any"
-VERSION = "0.1.34"
+VERSION = "0.1.35"
 CREDITS = "Credits: One Ciel LLC"
 LOG_LEVELS = {"SILENT": 0, "ERROR": 1, "WARN": 2, "INFO": 3, "DEBUG": 4, "TRACE": 5}
@@ -712,17 +713,18 @@ DEFAULT_CONFIG: dict[str, Any] = {
             "stream_enabled": True,
             "stream_word_chunking": False,
         },
-        "nvidia-hosted": {
-            "base_url": "https://integrate.api.nvidia.com/v1",
-            "api_key": "not-used",
-            "current_model": "qwen/qwen3-coder-480b-a35b-instruct",
+        "nvidia-hosted": {
+            "base_url": "https://integrate.api.nvidia.com/v1",
+            "api_key": "not-used",
+            "current_model": "qwen/qwen3-coder-480b-a35b-instruct",
             "advisor_model": "",
             "custom_models": [],
             "native_compat": False,
-            "rate_limit_rpm": 40,
-            "rate_limit_status": True,
-            "max_output_tokens": 4096,
-            "temperature": 0.7,
+            "rate_limit_rpm": 40,
+            "rate_limit_status": True,
+            "context_window": 32768,
+            "max_output_tokens": 4096,
+            "temperature": 0.7,
             "top_p": 0.8,
             "request_timeout_ms": 300000,
             "stream_enabled": True,
@@ -773,14 +775,21 @@ def apply_config_migrations(cfg: dict[str, Any]) -> None:
             pcfg["native_compat"] = False
         migrations[marker] = True
-    marker = "default_timeout_5m_20260513"
-    if not migrations.get(marker):
-        for pcfg in (cfg.get("providers") or {}).values():
-            if not isinstance(pcfg, dict):
-                continue
-            if positive_int(pcfg.get("request_timeout_ms")) in (600000, 1800000):
-                pcfg["request_timeout_ms"] = 300000
-        migrations[marker] = True
+    marker = "default_timeout_5m_20260513"
+    if not migrations.get(marker):
+        for pcfg in (cfg.get("providers") or {}).values():
+            if not isinstance(pcfg, dict):
+                continue
+            if positive_int(pcfg.get("request_timeout_ms")) in (600000, 1800000):
+                pcfg["request_timeout_ms"] = 300000
+        migrations[marker] = True
+    marker = "nvidia_context_window_32k_20260513"
+    if not migrations.get(marker):
+        pcfg = cfg.get("providers", {}).get("nvidia-hosted", {})
+        if isinstance(pcfg, dict) and not positive_int(pcfg.get("context_window")):
+            pcfg["context_window"] = 32768
+        migrations[marker] = True
 _config_cache: dict[str, Any] | None = None
@@ -1174,8 +1183,9 @@ from pathlib import Path
 HOME = Path.home()
 CONFIG_DIR = Path(os.environ.get("CLAUDE_ANY_CONFIG_DIR") or (HOME / ".config" / "claude-any"))
 CONFIG_PATH = CONFIG_DIR / "config.json"
-STATE_PATH = CONFIG_DIR / "rate-limit-state.json"
-PALETTE = (203, 209, 215, 221, 229, 187, 151, 116, 111, 147, 183, 219)
+STATE_PATH = CONFIG_DIR / "rate-limit-state.json"
+ACTIVITY_PATH = CONFIG_DIR / "router-activity.json"
+PALETTE = (203, 209, 215, 221, 229, 187, 151, 116, 111, 147, 183, 219)
 def load_json(path, default):
@@ -1225,8 +1235,9 @@ def main():
         rpm = int(raw_rpm)
     except Exception:
         rpm = 40
-    state = load_json(STATE_PATH, {})
-    now = time.time()
+    state = load_json(STATE_PATH, {})
+    activity = load_json(ACTIVITY_PATH, {})
+    now = time.time()
     key = f"{provider}:__global__" if provider else ""
     entry = state.get(key) if key else None
     if not isinstance(entry, dict):
@@ -1288,9 +1299,25 @@ def main():
             rpm_text += " | server " + ", ".join(parts)
     if penalty_until > now:
         rpm_text += f" | wait {max(0.0, penalty_until - now):.0f}s"
-    elif last_wait >= 0.5 and 0.0 <= now - updated_at < 60.0:
-        rpm_text += f" | wait {last_wait:.1f}s"
-    print(f"{left} | {color(rpm_text)}")
+    elif last_wait >= 0.5 and 0.0 <= now - updated_at < 60.0:
+        rpm_text += f" | wait {last_wait:.1f}s"
+    if isinstance(activity, dict):
+        try:
+            age = now - float(activity.get("updated_at") or 0)
+        except Exception:
+            age = 999999
+        if 0 <= age < 180:
+            event = str(activity.get("event") or "")
+            if event == "retry":
+                rpm_text += f" | retry {activity.get('attempt')}/{activity.get('total')}"
+            elif event == "request":
+                tokens = activity.get("tokens")
+                rpm_text += f" | upstream {age:.0f}s"
+                if tokens:
+                    rpm_text += f" {tokens}tok"
+            elif event in ("success", "error"):
+                rpm_text += f" | {event} {age:.0f}s"
+    print(f"{left} | {color(rpm_text)}")
 if __name__ == "__main__":
@@ -2834,16 +2861,34 @@ def native_anthropic_base_url(provider: str, pcfg: dict[str, Any]) -> str:
     return base
-def write_json(handler: BaseHTTPRequestHandler, obj: Any, status: int = 200) -> None:
-    body = json.dumps(obj).encode("utf-8")
-    handler.send_response(status)
-    handler.send_header("content-type", "application/json")
-    handler.send_header("content-length", str(len(body)))
-    handler.end_headers()
-    handler.wfile.write(body)
-def write_text_response(handler: BaseHTTPRequestHandler, text: str, status: int = 200, content_type: str = "text/plain; charset=utf-8") -> None:
+def write_json(handler: BaseHTTPRequestHandler, obj: Any, status: int = 200) -> None:
+    body = json.dumps(obj).encode("utf-8")
+    handler.send_response(status)
+    handler.send_header("content-type", "application/json")
+    handler.send_header("content-length", str(len(body)))
+    handler.end_headers()
+    handler.wfile.write(body)
+def write_router_activity(event: str, provider: str, model: str | None = None, **fields: Any) -> None:
+    try:
+        CONFIG_DIR.mkdir(parents=True, exist_ok=True)
+        data = {
+            "updated_at": time.time(),
+            "time": time.strftime("%Y-%m-%dT%H:%M:%S"),
+            "event": event,
+            "provider": provider,
+            "model": model or "",
+        }
+        data.update(fields)
+        tmp = ROUTER_ACTIVITY_PATH.with_name(f"{ROUTER_ACTIVITY_PATH.name}.{os.getpid()}.{time.time_ns()}.tmp")
+        tmp.write_text(json.dumps(data, ensure_ascii=False, separators=(",", ":")), encoding="utf-8")
+        tmp.replace(ROUTER_ACTIVITY_PATH)
+    except Exception:
+        pass
+def write_text_response(handler: BaseHTTPRequestHandler, text: str, status: int = 200, content_type: str = "text/plain; charset=utf-8") -> None:
     body = text.encode("utf-8")
     handler.send_response(status)
     handler.send_header("content-type", content_type)
@@ -3556,11 +3601,20 @@ def cap_output_tokens_for_context(
     return max(1, min(configured, available))
-def ollama_context_limit_for_budget(pcfg: dict[str, Any]) -> int:
-    raw = pcfg.get("num_ctx", "auto")
-    if isinstance(raw, str) and raw.strip().lower() == "auto":
-        return positive_int(pcfg.get("num_ctx_max")) or 65536
-    return positive_int(raw) or positive_int(pcfg.get("num_ctx_max")) or 65536
+def ollama_context_limit_for_budget(pcfg: dict[str, Any]) -> int:
+    raw = pcfg.get("num_ctx", "auto")
+    if isinstance(raw, str) and raw.strip().lower() == "auto":
+        return positive_int(pcfg.get("num_ctx_max")) or 65536
+    return positive_int(raw) or positive_int(pcfg.get("num_ctx_max")) or 65536
+def openai_context_limit_for_budget(provider: str, pcfg: dict[str, Any]) -> int:
+    configured = positive_int(pcfg.get("context_window")) or positive_int(pcfg.get("max_model_len"))
+    if configured:
+        return configured
+    if provider == "nvidia-hosted":
+        return 32768
+    return 65536
 def compact_ollama_messages_for_budget(
@@ -3695,10 +3749,10 @@ def ollama_chat_request(model: str, body: dict[str, Any], pcfg: dict[str, Any],
     return req
-def openai_compatible_chat_request(model: str, body: dict[str, Any], pcfg: dict[str, Any], stream: bool = False) -> dict[str, Any]:
-    messages = anthropic_messages_to_openai(body)
-    tools = anthropic_tools_to_ollama(body.get("tools"))
-    context_limit = positive_int(pcfg.get("context_window")) or positive_int(pcfg.get("max_model_len")) or 65536
+def openai_compatible_chat_request(provider: str, model: str, body: dict[str, Any], pcfg: dict[str, Any], stream: bool = False) -> dict[str, Any]:
+    messages = anthropic_messages_to_openai(body)
+    tools = anthropic_tools_to_ollama(body.get("tools"))
+    context_limit = openai_context_limit_for_budget(provider, pcfg)
     configured = configured_output_tokens(pcfg, body)
     reserve = positive_int(pcfg.get("context_reserve_tokens")) or 1024
     output_reserve = configured or positive_int(body.get("max_tokens")) or 4096
@@ -4592,37 +4646,58 @@ def post_json_with_rate_retry(
     model: str,
     retry_notice: Callable[[str], None] | None = None,
 ) -> Any:
-    gateway_retries = positive_int(pcfg.get("gateway_retries")) or 2
-    max_attempts = max(1, gateway_retries + 1)
-    for attempt in range(max_attempts):
-        try:
-            data_bytes = json.dumps(req_body).encode("utf-8")
-            req = urllib.request.Request(url, data=data_bytes, headers=headers, method="POST")
-            with urllib.request.urlopen(req, timeout=timeout) as resp:
-                learn_router_rate_limit_headers(provider, pcfg, model, resp.headers)
-                return json.loads(resp.read().decode("utf-8"))
-        except urllib.error.HTTPError as exc:
-            raw = exc.read().decode("utf-8", errors="ignore")
-            learn_router_rate_limit_headers(provider, pcfg, model, exc.headers)
+    gateway_retries = positive_int(pcfg.get("gateway_retries")) or 2
+    max_attempts = max(1, gateway_retries + 1)
+    token_estimate = estimate_tokens(req_body)
+    byte_estimate = len(json.dumps(req_body, ensure_ascii=False).encode("utf-8"))
+    for attempt in range(max_attempts):
+        try:
+            write_router_activity(
+                "request",
+                provider,
+                model,
+                attempt=attempt + 1,
+                total=max_attempts,
+                tokens=token_estimate,
+                bytes=byte_estimate,
+                timeout=timeout,
+            )
+            router_log("INFO", f"upstream_request provider={provider} model={model} attempt={attempt + 1}/{max_attempts} tokens={token_estimate} bytes={byte_estimate} timeout={timeout}")
+            data_bytes = json.dumps(req_body).encode("utf-8")
+            req = urllib.request.Request(url, data=data_bytes, headers=headers, method="POST")
+            with urllib.request.urlopen(req, timeout=timeout) as resp:
+                learn_router_rate_limit_headers(provider, pcfg, model, resp.headers)
+                data = json.loads(resp.read().decode("utf-8"))
+                write_router_activity("success", provider, model, attempt=attempt + 1, tokens=token_estimate, bytes=byte_estimate)
+                return data
+        except urllib.error.HTTPError as exc:
+            raw = exc.read().decode("utf-8", errors="ignore")
+            learn_router_rate_limit_headers(provider, pcfg, model, exc.headers)
             if exc.code == 429 and attempt == 0:
                 wait = register_router_rate_limit_backoff(provider, pcfg, model, exc.headers.get("Retry-After"))
                 time.sleep(wait)
                 continue
-            if exc.code in UPSTREAM_RETRY_HTTP_CODES and attempt + 1 < max_attempts:
-                retry_no = attempt + 1
-                if retry_notice:
-                    retry_notice(upstream_retry_message(retry_no, gateway_retries))
-                time.sleep(upstream_retry_wait_seconds(retry_no))
-                continue
-            raise RuntimeError(upstream_http_error_message(exc, raw)) from exc
-        except (TimeoutError, urllib.error.URLError) as exc:
-            if retryable_timeout_exception(exc) and attempt + 1 < max_attempts:
-                retry_no = attempt + 1
-                if retry_notice:
-                    retry_notice(upstream_retry_message(retry_no, gateway_retries))
-                time.sleep(upstream_retry_wait_seconds(retry_no))
-                continue
-            raise RuntimeError(f"{type(exc).__name__}: {exc}") from exc
+            if exc.code in UPSTREAM_RETRY_HTTP_CODES and attempt + 1 < max_attempts:
+                retry_no = attempt + 1
+                write_router_activity("retry", provider, model, attempt=retry_no, total=gateway_retries, code=exc.code, tokens=token_estimate, bytes=byte_estimate)
+                router_log("WARN", f"upstream_retry provider={provider} model={model} attempt={retry_no}/{gateway_retries} code={exc.code} tokens={token_estimate} bytes={byte_estimate}")
+                if retry_notice:
+                    retry_notice(upstream_retry_message(retry_no, gateway_retries))
+                time.sleep(upstream_retry_wait_seconds(retry_no))
+                continue
+            write_router_activity("error", provider, model, code=exc.code, tokens=token_estimate, bytes=byte_estimate)
+            raise RuntimeError(upstream_http_error_message(exc, raw)) from exc
+        except (TimeoutError, urllib.error.URLError) as exc:
+            if retryable_timeout_exception(exc) and attempt + 1 < max_attempts:
+                retry_no = attempt + 1
+                write_router_activity("retry", provider, model, attempt=retry_no, total=gateway_retries, error=type(exc).__name__, tokens=token_estimate, bytes=byte_estimate)
+                router_log("WARN", f"upstream_retry provider={provider} model={model} attempt={retry_no}/{gateway_retries} error={type(exc).__name__} tokens={token_estimate} bytes={byte_estimate}")
+                if retry_notice:
+                    retry_notice(upstream_retry_message(retry_no, gateway_retries))
+                time.sleep(upstream_retry_wait_seconds(retry_no))
+                continue
+            write_router_activity("error", provider, model, error=type(exc).__name__, tokens=token_estimate, bytes=byte_estimate)
+            raise RuntimeError(f"{type(exc).__name__}: {exc}") from exc
     raise RuntimeError("upstream request failed")
@@ -4631,7 +4706,7 @@ def forward_openai_compatible_chat(handler: BaseHTTPRequestHandler, provider: st
     model = resolve_requested_model(provider, pcfg, body.get("model"))
     if provider == "nvidia-hosted":
         model = ncp_model_id_for_nvidia_hosted(model)
-    req_body = openai_compatible_chat_request(model, body, pcfg, stream=False)
+    req_body = openai_compatible_chat_request(provider, model, body, pcfg, stream=False)
     url = join_url(provider_upstream_request_base(provider, pcfg), "/chat/completions")
     waited, rpm_used, rpm_limit = apply_router_rate_limit(provider, pcfg, model)
     stream = bool(body.get("stream", True))
@@ -5112,7 +5187,7 @@ def status_lines() -> list[str]:
         *([f"keep_alive: {pcfg.get('keep_alive', 'default')}"] if provider in ("ollama", "ollama-cloud") else []),
         *([f"think: {bool(pcfg.get('think', False))}"] if provider in ("ollama", "ollama-cloud") else []),
         *([f"request_timeout_ms: {pcfg.get('request_timeout_ms', 'default')}"] if provider in ("ollama", "ollama-cloud") else []),
-        *([f"context_window: {pcfg.get('context_window', 'default')}"] if provider in ("vllm", "self-hosted-nim") else []),
+        *([f"context_window: {pcfg.get('context_window', 'default')}"] if provider in ("vllm", "nvidia-hosted", "self-hosted-nim") else []),
         *([f"context_reserve_tokens: {pcfg.get('context_reserve_tokens', 'default')}"] if provider in ("vllm", "self-hosted-nim") else []),
         *([f"max_output_tokens: {pcfg.get('max_output_tokens', 'default')}"] if provider in ("vllm", "nvidia-hosted", "self-hosted-nim") else []),
         *([f"request_timeout_ms: {pcfg.get('request_timeout_ms', 'default')}"] if provider in ("vllm", "nvidia-hosted", "self-hosted-nim") else []),
@@ -5421,9 +5496,9 @@ def provider_options_status(provider: str, pcfg: dict[str, Any]) -> str:
             if limit is not None:
                 suffix = f"{used}/{limit}" if limit > 0 else f"{used}/min(unlimited)"
                 parts.append(f"rpm_used={suffix}")
-    if provider in ("vllm", "self-hosted-nim"):
-        parts.insert(0, f"context_window={pcfg.get('context_window', 'default')}")
-        parts.insert(1, f"reserve={pcfg.get('context_reserve_tokens', 'default')}")
+    if provider in ("vllm", "nvidia-hosted", "self-hosted-nim"):
+        parts.insert(0, f"context_window={pcfg.get('context_window', 'default')}")
+        parts.insert(1, f"reserve={pcfg.get('context_reserve_tokens', 'default')}")
     if provider in ("vllm", "self-hosted-nim"):
         native_default = False if provider == "nvidia-hosted" else True
         parts.append(f"native={bool(pcfg.get('native_compat', native_default))}")
@@ -5741,10 +5816,10 @@ def apply_llm_preset_to_provider(provider: str, pcfg: dict[str, Any], preset_id:
                 f"native={native_default}",
             ],
         }
-        for token in tokens_by_preset[preset_id]:
-            if provider == "nvidia-hosted" and token.startswith(("context_window=", "reserve=", "native=")):
-                continue
-            apply_provider_option(provider, pcfg, token)
+        for token in tokens_by_preset[preset_id]:
+            if provider == "nvidia-hosted" and token.startswith("native="):
+                continue
+            apply_provider_option(provider, pcfg, token)
         if server_limit:
             requested_context = positive_int(pcfg.get("context_window"))
             if requested_context and requested_context > server_limit:

package/docs/README.ja.md CHANGED Viewed

@@ -47,7 +47,7 @@ vLLM、NVIDIA hosted、self-hosted NIM を選択し、通常の Claude Code 引
 Credits: One Ciel LLC
-現在のバージョン: `0.1.34`
+現在のバージョン: `0.1.35`
 ## 作られた理由
@@ -351,6 +351,15 @@ Windows/Linux 管理、クリーンアップスクリプト、定期的なセキ
 ## 変更履歴
+### 0.1.35
+- **NVIDIA router context guard**: NVIDIA hosted の router context 既定値を 32K
+  に下げ、LLM preset がこの cap を調整できるようにしました。長い Claude Code
+  セッションで payload が肥大して timeout する状況を減らします。
+- **Upstream activity status**: router が現在の request/retry/success/error
+  状態と推定 token/byte サイズを記録し、statusline で upstream 待機と idle を
+  判別できるようにしました。
 ### 0.1.34
 - **完全な headless 設定経路**: `--ca-env-file`、環境変数マッピング、Advisor

package/docs/README.ko.md CHANGED Viewed

@@ -47,7 +47,7 @@ NVIDIA hosted, self-hosted NIM을 선택하고, Claude Code의 일반 인자는
 Credits: One Ciel LLC
-현재 버전: `0.1.34`
+현재 버전: `0.1.35`
 ## 왜 만들었나
@@ -351,6 +351,15 @@ Windows 이벤트 로그 리뷰, 바이러스/랜섬웨어 침입 시도 정리,
 ## 변경 이력
+### 0.1.35
+- **NVIDIA router context guard**: NVIDIA hosted의 router context 기본값을 32K로
+  낮추고 LLM preset이 이 cap을 조정할 수 있게 하여, 긴 Claude Code 세션에서
+  payload가 커져 timeout이 나는 상황을 줄였습니다.
+- **Upstream activity status**: router가 현재 request/retry/success/error 상태와
+  추정 token/byte 크기를 기록하여, statusline에서 upstream 대기와 idle 상태를
+  구분할 수 있습니다.
 ### 0.1.34
 - **완전한 headless 설정 경로**: `--ca-env-file`, 환경변수 매핑, Advisor model,

package/docs/README.zh.md CHANGED Viewed

@@ -47,7 +47,7 @@ NIM，并把普通 Claude Code 参数原样传递。
 Credits: One Ciel LLC
-当前版本: `0.1.34`
+当前版本: `0.1.35`
 ## 为什么存在
@@ -337,6 +337,14 @@ Hermes 格式模型或部分较旧的 Qwen tool template。
 ## 更新日志
+### 0.1.35
+- **NVIDIA router context guard**：NVIDIA hosted 的 router context 默认值改为
+  32K，并允许 LLM preset 调整该 cap，减少长 Claude Code 会话中 payload 变大后
+  触发 timeout 的情况。
+- **Upstream activity status**：router 会记录当前 request/retry/success/error
+  状态和估算 token/byte 大小，statusline 可以区分正在等待 upstream 还是已 idle。
 ### 0.1.34
 - **完整 headless 配置路径**：新增 `--ca-env-file`、环境变量映射、Advisor

package/docs/manual.md CHANGED Viewed

@@ -10,7 +10,7 @@ Code starts, while passing normal Claude Code arguments through unchanged.
 Credits: One Ciel LLC
-Current version: `0.1.34`
+Current version: `0.1.35`
 ## Install

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@oneciel-ai/claude-any",
-  "version": "0.1.34",
+  "version": "0.1.35",
   "description": "Claude Code provider selector for Anthropic, Ollama, Ollama Cloud, vLLM, NVIDIA hosted, and self-hosted NIM.",
   "license": "MIT",
   "author": "One Ciel LLC",