PyPI - cli-agent-runner - Versions diffs - 0.1.34__tar.gz → 0.1.36__tar.gz - Mend

cli-agent-runner 0.1.34tar.gz → 0.1.36tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (221) hide show

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.1.36] - 2026-05-21
+### Added
+- New monitor detector `supervisor_stale` (notify) — alerts when the supervisor stops emitting events (stuck between rounds or dead), a blind spot the event stream and `detect_hung` cannot catch. Default ON; threshold derives from `round_timeout_s * 1.5`. Detector count 11 → 12.
+- `[monitor] supervisor_stale_threshold_s` config — override the derived staleness threshold (positive = seconds; 0 = disable; unset = derived).
+### Changed
+- `docs/runbook.md` documents the liveness-monitoring architecture: run `monitor --host` from a separate machine to detect supervisor silent-death AND host death (a same-host monitor dies with its host).
+## [0.1.35] - 2026-05-20
+### Removed
+- `claude_rate_limit_detector` plugin alias (0.1.20-era back-compat layer after the 0.1.23 rename to `claude_error_detector`). Hard-cut at both entry-point and config-mapping layers. See `docs/migrations/0.1.35.md` for the 1-line TOML migration.
 ## [0.1.34] - 2026-05-20
 ### Added

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: cli-agent-runner
-Version: 0.1.34
+Version: 0.1.36
 Summary: Restart-on-exit supervisor for autonomous CLI agents
 Project-URL: Homepage, https://github.com/wan9yu/cli-agent-runner
 Project-URL: Documentation, https://github.com/wan9yu/cli-agent-runner#readme
@@ -49,7 +49,7 @@ full disks, runaway memory.
 ```
 ┌──────────────────────────────────────────┐
-│ Layer 3: The Witness (monitor)           │  11 detectors + auto-stop
+│ Layer 3: The Witness (monitor)           │  12 detectors + auto-stop
 ├──────────────────────────────────────────┤
 │ Layer 2: The Loop (serve, ~120 LOC)      │  signal-trapping restart loop
 ├──────────────────────────────────────────┤
@@ -80,13 +80,13 @@ agent-runner monitor              # live anomaly detection
 Full walkthrough: [`docs/quickstart.md`](docs/quickstart.md).
-## 14 verbs
+## 16 verbs
 | Lifecycle | Observation |
 |---|---|
 | `init` / `install` / `uninstall` | `peek` — state snapshot |
 | `start` / `stop` / `kill` / `cancel` | `watch` — peek in a refresh loop |
-| `restart` / `status` | `monitor` — 11 detectors, alerts, auto-stop |
+| `restart` / `status` | `monitor` — 12 detectors, alerts, auto-stop |
 | `round` / `serve` / `upgrade` | `events` — query / stream events.jsonl |
 Verb reference: [`docs/commands.md`](docs/commands.md).
@@ -106,11 +106,11 @@ guards it. Highlights:
 Full list and rationale: [`docs/architecture.md`](docs/architecture.md).
-## Monitor: 11 detectors
+## Monitor: 12 detectors
 Notify only: `timeout_rate`, `hung`, `orphan_chain`, `disk_warning`,
 `mem_pressure`, `smoke_fail_rate`, `network_fail`, `rate_limit_active`,
-`anomaly_repetitive_active`.
+`anomaly_repetitive_active`, `supervisor_stale`.
 **Auto-stop the service** (continuing is harmful):
 - `oauth_fail` — burning API quota on auth-rejected rounds

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/README.md RENAMED Viewed

@@ -12,7 +12,7 @@ full disks, runaway memory.
 ```
 ┌──────────────────────────────────────────┐
-│ Layer 3: The Witness (monitor)           │  11 detectors + auto-stop
+│ Layer 3: The Witness (monitor)           │  12 detectors + auto-stop
 ├──────────────────────────────────────────┤
 │ Layer 2: The Loop (serve, ~120 LOC)      │  signal-trapping restart loop
 ├──────────────────────────────────────────┤
@@ -43,13 +43,13 @@ agent-runner monitor              # live anomaly detection
 Full walkthrough: [`docs/quickstart.md`](docs/quickstart.md).
-## 14 verbs
+## 16 verbs
 | Lifecycle | Observation |
 |---|---|
 | `init` / `install` / `uninstall` | `peek` — state snapshot |
 | `start` / `stop` / `kill` / `cancel` | `watch` — peek in a refresh loop |
-| `restart` / `status` | `monitor` — 11 detectors, alerts, auto-stop |
+| `restart` / `status` | `monitor` — 12 detectors, alerts, auto-stop |
 | `round` / `serve` / `upgrade` | `events` — query / stream events.jsonl |
 Verb reference: [`docs/commands.md`](docs/commands.md).
@@ -69,11 +69,11 @@ guards it. Highlights:
 Full list and rationale: [`docs/architecture.md`](docs/architecture.md).
-## Monitor: 11 detectors
+## Monitor: 12 detectors
 Notify only: `timeout_rate`, `hung`, `orphan_chain`, `disk_warning`,
 `mem_pressure`, `smoke_fail_rate`, `network_fail`, `rate_limit_active`,
-`anomaly_repetitive_active`.
+`anomaly_repetitive_active`, `supervisor_stale`.
 **Auto-stop the service** (continuing is harmful):
 - `oauth_fail` — burning API quota on auth-rejected rounds

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/README.zh.md RENAMED Viewed

@@ -20,9 +20,9 @@ supervisor 重启 —— 这是核心模式。中间穿插 11 条防御，避开
 ```
 ┌──────────────────────────────────────────┐
-│ Layer 3：Witness（monitor）              │  9 个检测器 + 自动停服
+│ Layer 3：Witness（monitor）              │  12 个检测器 + 自动停服
 ├──────────────────────────────────────────┤
-│ Layer 2：Loop（serve，~60 LOC 薄壳）     │  捕获信号，循环拉起 round
+│ Layer 2：Loop（serve，~120 LOC 薄壳）    │  捕获信号，循环拉起 round
 ├──────────────────────────────────────────┤
 │ Layer 1：Round（round）                  │  跑一次 agent，跑完即退
 └──────────────────────────────────────────┘
@@ -57,14 +57,14 @@ agent-runner monitor              # 实时异常检测，OAuth/磁盘 critical
 完整上手流程：[`docs/quickstart.md`](docs/quickstart.md)。
-## 13 个动词
+## 16 个动词
 | 生命周期 | 观察 |
 |---|---|
 | `init` / `install` / `uninstall` | `peek` —— 项目状态快照 |
 | `start` / `stop` / `kill` / `cancel` | `watch` —— peek 在刷新循环里 |
-| `restart` / `status` | `monitor` —— 9 个检测器 + 告警 + 自动停服 |
-| `round` / `serve` | |
+| `restart` / `status` | `monitor` —— 12 个检测器 + 告警 + 自动停服 |
+| `round` / `serve` / `upgrade` | `events` —— 查询 / 流式订阅 events.jsonl |
 **停服三动词**有清晰的语义分层：
 - `stop` —— 优雅，等当前 round 跑完再退（最常用）
@@ -95,11 +95,12 @@ agent-runner monitor              # 实时异常检测，OAuth/磁盘 critical
 完整列表 + 历史出处：[`docs/architecture.md`](docs/architecture.md)。
-## Monitor：9 个检测器
+## Monitor：12 个检测器
 **只告警**（warning 级，服务继续跑）：
 `timeout_rate` / `hung` / `orphan_chain` / `disk_warning` /
-`mem_pressure` / `smoke_fail_rate` / `network_fail`
+`mem_pressure` / `smoke_fail_rate` / `network_fail` / `rate_limit_active` /
+`anomaly_repetitive_active` / `supervisor_stale`
 **自动停服**（critical 级，继续是 net negative）：

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/agent_runner/__init__.py RENAMED Viewed

@@ -20,11 +20,6 @@ _HOOK_GROUPS = (
 # Surfaced via peek --json `plugins.disabled` for operator visibility.
 _DISABLED_PLUGIN_NAMES: list[str] = []
-# Plugin name aliases for back-compat: old entry-point name -> canonical name.
-_PLUGIN_NAME_ALIASES: dict[str, str] = {
-    "claude_rate_limit_detector": "claude_error_detector",  # 0.1.20 -> 0.1.23 rename
-}
 def _load_plugins_from_group(group: str) -> None:
     """Discover and load entry_points in ``group``, isolating per-plugin failures.
@@ -98,9 +93,6 @@ def apply_plugin_disable(names: list[str]) -> None:
     if not names:
         return
-    # Translate aliases so old config names keep working
-    names = [_PLUGIN_NAME_ALIASES.get(n, n) for n in names]
     global _DISABLED_PLUGIN_NAMES
     _DISABLED_PLUGIN_NAMES = list(names)

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/agent_runner/_version.py RENAMED Viewed

@@ -18,7 +18,7 @@ version_tuple: tuple[int | str, ...]
 commit_id: str | None
 __commit_id__: str | None
-__version__ = version = '0.1.34'
-__version_tuple__ = version_tuple = (0, 1, 34)
+__version__ = version = '0.1.36'
+__version_tuple__ = version_tuple = (0, 1, 36)
 __commit_id__ = commit_id = None

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/agent_runner/api.py RENAMED Viewed

@@ -452,6 +452,7 @@ def _poll_once(project: str | Path, *, host: str | None) -> list[monitor.Alert]:
         metrics=metrics,
         log_tails=log_tails,
         round_timeout_s=cfg.runtime.round_timeout_s,
+        supervisor_stale_threshold_s=cfg.monitor.supervisor_stale_threshold_s,
         auth_fail_patterns=cfg.monitor.auth_fail_patterns,
         auth_fail_hint=cfg.monitor.auth_fail_hint,
         phases_overrides=cfg.phases.overrides if cfg.phases.overrides else None,

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/agent_runner/builtin_plugins/claude_rate_limit.py RENAMED Viewed

@@ -7,10 +7,9 @@ with computed reset_at_epoch. Supervisor consumes the event.
 Also emits agent_usage_recorded per-round with token/cost data from the
 claude result event (0.1.24+).
-Naming history: was `claude_rate_limit_detector` in 0.1.20 (single-purpose
-rate-limit detector). Renamed + generalized to multi-classification in 0.1.23.
-Old plugin name `claude_rate_limit_detector` retained as entry-point alias
-via pyproject.toml.
+Module name is historical: the original 0.1.20 single-purpose
+rate-limit detector was generalized to multi-classification in 0.1.23
+(class + entry-point renamed to `claude_error_detector`; module path kept).
 """
 from __future__ import annotations

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/agent_runner/config.py RENAMED Viewed

@@ -141,6 +141,12 @@ class MonitorConfig:
     anomaly_repetitive_threshold: int = 0  # 0 = disabled
     host_health: MonitorHostHealthConfig = field(default_factory=MonitorHostHealthConfig)
     round_progress_interval_s: int = 0  # 0 = disabled; >0 = emit round_progress every N seconds
+    supervisor_stale_threshold_s: int | None = None
+    """Staleness deadline for the supervisor_stale detector (seconds).
+    None (unset) → derived default round_timeout_s * 1.5.
+    Positive int → explicit threshold. 0 → disable the detector.
+    """
 @dataclass(frozen=True)
@@ -467,6 +473,14 @@ def load_config(toml_path: Path) -> Config:
             monitor_d.get("round_progress_interval_s", 0),
             field="monitor.round_progress_interval_s",
         ),
+        supervisor_stale_threshold_s=(
+            None
+            if monitor_d.get("supervisor_stale_threshold_s") is None
+            else _require_non_negative_int(
+                monitor_d["supervisor_stale_threshold_s"],
+                field="monitor.supervisor_stale_threshold_s",
+            )
+        ),
     )
     plugins_raw = dict(raw.get("plugins") or {})  # copy so we can pop
     disable = list(plugins_raw.pop("disable", []))

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/agent_runner/monitor.py RENAMED Viewed

@@ -1,6 +1,6 @@
 """Monitor — anomaly detectors over events + metrics + log tails.
-11 built-in detectors. Two trigger ``auto_action="stop_service"``:
+12 built-in detectors. Two trigger ``auto_action="stop_service"``:
   * oauth_fail  — auth pattern in short-exit logs (retrying burns API quota)
   * disk_critical — disk_used_pct > 95% (writing more risks corruption)
@@ -54,6 +54,7 @@ KNOWN_ALERT_KINDS: frozenset[str] = frozenset(
         "network_fail",
         "rate_limit_active",
         "anomaly_repetitive_active",
+        "supervisor_stale",
     }
 )
@@ -429,6 +430,39 @@ def detect_anomaly_repetitive_active(
     )
+def detect_supervisor_stale(
+    events: list[dict[str, Any]],
+    *,
+    now: datetime,
+    stale_threshold_s: int,
+) -> Alert | None:
+    """Alert when the most recent event is older than ``stale_threshold_s``.
+    Catches supervisor "silent-death": stuck between rounds (after round_end,
+    before the next round_start) emitting no events. The event stream cannot
+    distinguish that from a normal idle gap — only a deadline check can.
+    ``stale_threshold_s <= 0`` disables the check (caller resolves the
+    sentinel). Empty event list → no alert: that is "never started", not
+    silent-death, and there is no baseline to measure staleness against.
+    """
+    if stale_threshold_s <= 0 or not events:
+        return None
+    last_ts_str = max((e["ts"] for e in events if "ts" in e), default=None)
+    if last_ts_str is None:
+        return None
+    age_s = (now - parse_iso_ms(last_ts_str)).total_seconds()
+    if age_s <= stale_threshold_s:
+        return None
+    return _alert(
+        "supervisor_stale",
+        "warning",
+        f"No events for {int(age_s)}s (threshold {stale_threshold_s}s) — "
+        f"supervisor may be stuck or dead. Last event: {last_ts_str}.",
+        {"age_s": int(age_s), "threshold_s": stale_threshold_s, "last_ts": last_ts_str},
+    )
 # ---------------------------------------------------------------------------
 # State-tree assembly (Task 3.2)
 # ---------------------------------------------------------------------------
@@ -535,6 +569,7 @@ def run_all_detectors(
     metrics: list[dict[str, Any]],
     log_tails: dict[int, str],
     round_timeout_s: int = 1800,
+    supervisor_stale_threshold_s: int | None = None,
     now: datetime | None = None,
     auth_fail_patterns: list[str] | None = None,
     auth_fail_hint: str | None = None,
@@ -543,12 +578,17 @@ def run_all_detectors(
     disk_warning_pct: float = 90.0,
     disk_critical_pct: float = 95.0,
 ) -> list[Alert]:
-    """Run all 11 detectors; returns alerts (empty = healthy)."""
+    """Run all 12 detectors; returns alerts (empty = healthy)."""
     if now is None:
         now = datetime.now(UTC)
     compiled_auth_pats = (
         [re.compile(p, re.IGNORECASE) for p in auth_fail_patterns] if auth_fail_patterns else None
     )
+    effective_stale_s = (
+        int(round_timeout_s * 1.5)
+        if supervisor_stale_threshold_s is None
+        else supervisor_stale_threshold_s
+    )
     candidates = [
         detect_timeout_rate(events),
         detect_hung(
@@ -568,6 +608,7 @@ def run_all_detectors(
         detect_network_fail(events, log_tails),
         detect_rate_limit_active(events, now=now.timestamp()),
         detect_anomaly_repetitive_active(events),
+        detect_supervisor_stale(events, now=now, stale_threshold_s=effective_stale_s),
     ]
     return [a for a in candidates if a is not None]

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/docs/architecture.md RENAMED Viewed

@@ -65,13 +65,14 @@ surfacing everywhere.
 | `event_kind_registry` | Prevent events.emit() typos / unregistered kinds slipping past CI | `tests/invariants/test_event_kind_registry.py` |
 <!-- /gen:defenses-table -->
-## Monitor: 11 detectors
+## Monitor: 12 detectors
 Three categories by `auto_action`:
 **Notify only** (severity `warning`):
 `timeout_rate`, `hung`, `orphan_chain`, `disk_warning`, `mem_pressure`,
-`smoke_fail_rate`, `network_fail`.
+`smoke_fail_rate`, `network_fail`, `rate_limit_active`,
+`anomaly_repetitive_active`, `supervisor_stale`.
 **Auto-stop service** (severity `critical`, `auto_action="stop_service"`):
 `oauth_fail`, `disk_critical`. Continuing in either state is harmful (burning
@@ -88,6 +89,7 @@ API quota / writing to a near-full disk).
 - `orphan_chain`
 - `rate_limit_active`
 - `smoke_fail_rate`
+- `supervisor_stale`
 - `timeout_rate`
 <!-- /gen:detector-list -->

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/docs/commands.md RENAMED Viewed

@@ -117,7 +117,7 @@ agent-runner events --kind transient_error_backoff_capped --tail
 ### `agent-runner monitor [--host SSH-ALIAS] [--interval N] [--json]`
-Anomaly-detection daemon. Runs the 11 detectors against the live state on every
+Anomaly-detection daemon. Runs the 12 detectors against the live state on every
 poll. Without `--host`, watches local logs at default 30s interval. With
 `--host`, watches a remote agent-runner over plain ssh at default 60s interval.
@@ -133,7 +133,7 @@ agent-runner monitor --json | jq -c        # pipe alerts to a downstream consume
 ## 中文摘要
-13 个动词：`init / install / uninstall / start / stop / kill / cancel / restart / status / round / serve / peek / watch / monitor`。
+16 个动词：`init / install / uninstall / start / stop / kill / cancel / restart / status / round / serve / upgrade / peek / watch / events / monitor`。
 观察类（peek/watch/monitor）三视角对称，全部共用 `--round / --log / --events / --select / --json` 下钻参数。

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/docs/configuration.md RENAMED Viewed

@@ -80,6 +80,7 @@ running with newly-set `dirty_action = "auto_commit"` is undefined).
 | `anomaly_repetitive_threshold` | `int` | 0 |
 | `host_health` | `MonitorHostHealthConfig` | MonitorHostHealthConfig(mem_avail_min_mb=200, disk_warning_pct=90.0, disk_critical_pct=95.0) |
 | `round_progress_interval_s` | `int` | 0 |
+| `supervisor_stale_threshold_s` | `int | None` | None |
 <!-- /gen:config-schema -->
 ### `vcs.dirty_action`
@@ -203,6 +204,7 @@ Unconfigured phases (and configs without `[phases]`) keep using the global
 [monitor]
 auto_stop_on = ["oauth_fail", "disk_critical"]
 round_progress_interval_s = 0  # 0 = disabled; set >0 to emit round_progress heartbeat events
+# supervisor_stale_threshold_s = 2700  # unset = round_timeout_s * 1.5; 0 = disable
 [monitor.host_health]
 mem_avail_min_mb = 200        # mem_pressure fires when mem_available_mb < this

cli_agent_runner-0.1.36/docs/migrations/0.1.35.md ADDED Viewed

@@ -0,0 +1,97 @@
+# 0.1.35 — `claude_rate_limit_detector` alias removed
+**Date**: 2026-05-20
+## What changed
+The `claude_rate_limit_detector` alias (introduced 0.1.20, kept as
+back-compat after the 0.1.23 rename to `claude_error_detector`) was
+hard-removed in 0.1.35 at **both** layers:
+1. **Entry-point level** (`pyproject.toml`): the old key is no longer
+   declared, so `importlib.metadata.entry_points(group=...)` no longer
+   returns it.
+2. **Config-level alias mapping** (`agent_runner/__init__.py`): the
+   `_PLUGIN_NAME_ALIASES` dict that previously auto-translated the old
+   name in `[plugins] disable/enable` was deleted. Stale TOML entries
+   now trigger the existing typo-catcher UserWarning.
+The underlying class (`ClaudeErrorDetector`, module
+`agent_runner.builtin_plugins.claude_rate_limit`) is unchanged; only
+the alias surface was removed.
+## Migration (one-line edit, only if you used the old name)
+Search your `agent-runner.toml` for `claude_rate_limit_detector`:
+```bash
+grep -nE 'claude_rate_limit_detector' agent-runner.toml \
+    && echo "switch to claude_error_detector" \
+    || echo "no migration needed"
+```
+If found, switch:
+```diff
+ [plugins]
+- disable = ["claude_rate_limit_detector"]
++ disable = ["claude_error_detector"]
+```
+Same for `enable = [...]` if you opted-in by name.
+The behavior is identical — same class, same hooks, same events emitted.
+## Migration (consumers of `importlib.metadata.entry_points`)
+If any of your code introspects the entry-point group:
+```python
+from importlib.metadata import entry_points
+hooks = entry_points(group="agent_runner.post_round_hooks")
+```
+The old key `claude_rate_limit_detector` is no longer in the result.
+Either filter against the canonical name (`claude_error_detector`) or
+iterate by `.value` (module:class target) which has been unchanged
+since 0.1.20.
+## Why hard-cut (no deprecation cycle)
+The 0.1.23 rename is 12 releases / ~6 days old (as measured by the rapid
+0.1.25-0.1.34 ship cadence). The Argus Gateway team — our only known
+production consumer — explicitly migrated away from the old name during
+their 2026-05-19/20 monitoring overhaul. Carrying an unused back-compat
+entry-point line indefinitely is debt; per project policy
+(`docs/thesis.md` — zero tech debt per release), we hard-cut.
+This matches the precedent set in 0.1.29 (legacy `rate_limit_*` event
+aliases removed) and 0.1.34 (`peek --select events.<kind>` selector
+removed): when consumers are migrated, the back-compat layer is debt
+that compounds.
+## What did NOT change
+- `claude_error_detector` (canonical entry-point key since 0.1.23) — unchanged
+- `gemini_error_detector` — unchanged
+- The plugin's class name (`ClaudeErrorDetector`) — unchanged
+- All events emitted by the plugin (`transient_error_detected` /
+  `agent_usage_recorded` / `anomaly_repetitive_tool`) — unchanged
+- Config schema — no new TOML keys
+- Public Python API — unchanged
+## Verification
+After upgrade:
+```bash
+agent-runner peek --json | jq '.plugins.post_round_hooks'
+```
+Expected output:
+```json
+["claude_error_detector", "gemini_error_detector"]
+```
+(Order may vary; the key thing is `claude_rate_limit_detector` is not present.)

cli_agent_runner-0.1.36/docs/migrations/0.1.36.md ADDED Viewed

@@ -0,0 +1,73 @@
+# Migrating to 0.1.36
+## TL;DR
+```bash
+pip install --upgrade cli-agent-runner==0.1.36
+```
+No action required. The new `supervisor_stale` detector is ON by default with
+a derived threshold and is notify-only — it never stops your service.
+## What changed
+0.1.36 adds a 12th monitor detector, `supervisor_stale`, that closes a
+liveness blind spot: a supervisor that hangs *between* rounds (after a round
+ends, before the next one starts) emits no events. The event stream cannot
+tell a permanent silence from a normal idle gap, and `detect_hung` only
+covers a round that *started* and then hung mid-execution. `supervisor_stale`
+watches the age of the most recent event and alerts when it exceeds a
+staleness deadline.
+## Default behavior (no action needed)
+- ON by default.
+- Threshold derives from `round_timeout_s * 1.5` — comfortably above the
+  longest legitimate inter-event gap (a round running to full timeout, plus
+  restart delay), so it does not false-positive on healthy systems.
+- Notify-only: it emits an alert, never an auto-stop. A stuck or dead
+  supervisor cannot honor an auto-stop anyway; the alert is for a human or an
+  external watchdog.
+## Tuning (optional)
+Set `[monitor] supervisor_stale_threshold_s` when the derived default does not
+fit your project's cadence:
+```toml
+[monitor]
+supervisor_stale_threshold_s = 3600   # explicit seconds
+# supervisor_stale_threshold_s = 0    # disable the detector
+# (unset)                             # derive round_timeout_s * 1.5
+```
+- **Very short rounds with occasional long legitimate gaps** (e.g. 2-minute
+  rounds plus a periodic maintenance pause): set a value higher than derived.
+- **Phase overrides that raise `round_timeout_s`** for some phase: the derived
+  threshold uses the *base* `round_timeout_s`, so a round in a longer-timeout
+  phase can exceed `base * 1.5`. Set
+  `supervisor_stale_threshold_s >= max_phase_timeout * 1.5`.
+## The liveness architecture (important)
+A monitor on the *same host* as the supervisor dies when that host dies — it
+cannot report its own host's death. For true liveness coverage, run the
+monitor from a **separate machine**:
+```bash
+# On your laptop / a second host, not on the supervised host:
+agent-runner monitor --host pi
+```
+That catches both failure modes: a stuck supervisor on a live host
+(`supervisor_stale`, events frozen) and a dead host or severed network (SSH
+poll fails → `monitor_remote_giveup`).
+## What did NOT change
+- The existing 11 detectors are unchanged.
+- `detect_hung` still covers the in-round hang case.
+- `round_timeout_s` is unchanged; the staleness threshold derives from it but
+  does not modify it.
+- No new core event kind: `supervisor_stale` is a monitor *alert kind*, surfaced
+  through the existing monitor alert path.

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/docs/plugins.md RENAMED Viewed

@@ -266,13 +266,16 @@ agent-runner ships two built-in `post_round_hooks` plugins registered
 automatically via their own entry-points: `claude_error_detector` (below)
 and `gemini_error_detector` (0.1.24+, parallel for gemini CLI).
-### `claude_error_detector` (0.1.23+, formerly `claude_rate_limit_detector`)
+### `claude_error_detector` (0.1.23+)
 **Entry-point group:** `agent_runner.post_round_hooks`
 **Module:** `agent_runner.builtin_plugins.claude_rate_limit`
-**Old name:** `claude_rate_limit_detector` retained as an alias in
-`pyproject.toml` so `[plugins] disable = ["claude_rate_limit_detector"]`
-still works for back-compat.
+Renamed from `claude_rate_limit_detector` in 0.1.23 when the detector
+was generalized from single-rate-limit to multi-classification. The
+old-name alias was kept as a `pyproject.toml` entry-point through 0.1.34
+and removed in 0.1.35. Operators still using `[plugins] disable =
+["claude_rate_limit_detector"]` must switch to `claude_error_detector`.
 After each round, scans the last 50 lines of the round's JSONL log for
 transient errors and usage data:

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/docs/runbook.md RENAMED Viewed

@@ -45,9 +45,10 @@ correctly (process still runs as your user, not root).
 ### Health check
 ```bash
-agent-runner status              # service running?
-agent-runner peek                # full state snapshot
-agent-runner peek --json | jq .defenses    # what's defended
+agent-runner status                                       # service running?
+agent-runner peek                                         # full state snapshot
+agent-runner peek --json | jq .defenses                   # what's defended
+agent-runner peek --json | jq .system.agent_process_count # orphan agent count (0.1.34+)
 journalctl --user -u agent-runner@<project> --since "1 hour ago"
 ```
@@ -252,6 +253,29 @@ API. Power profile:
   real state change. Verify the detector logic and thresholds before enabling
   `auto_stop` on a production remote.
+### Liveness monitoring: run monitor from a separate machine
+`agent-runner monitor` detects anomalies including `supervisor_stale` — the
+supervisor stopped emitting events because it is stuck between rounds or dead.
+But a monitor running on the *same host* as the supervisor dies when that host
+dies, so it cannot report its own host's death.
+For true liveness coverage, run the monitor from a **separate machine**:
+    # On your laptop / a second host, NOT on the supervised host:
+    agent-runner monitor --host pi
+This catches both failure modes:
+- Supervisor stuck on a live host → `supervisor_stale` alert (events frozen).
+- Host itself dead / network gone → SSH poll fails → `monitor_remote_giveup`.
+The `supervisor_stale` threshold defaults to `round_timeout_s * 1.5`. Override
+with `[monitor] supervisor_stale_threshold_s = N` for projects whose legitimate
+cadence — very short rounds with occasional long legitimate gaps, or phase
+overrides that raise `round_timeout_s` — does not fit the derived default. Set
+to `0` to disable the detector entirely.
 ## Live event stream (machine-readable)
 For machine consumption (parity comparisons, custom dashboards, automation

{cli_agent_runner-0.1.34 → cli_agent_runner-0.1.36}/pyproject.toml RENAMED Viewed

@@ -45,9 +45,8 @@ Changelog = "https://github.com/wan9yu/cli-agent-runner/blob/main/CHANGELOG.md"
 agent-runner = "agent_runner.cli:main"
 [project.entry-points."agent_runner.post_round_hooks"]
-claude_rate_limit_detector = "agent_runner.builtin_plugins.claude_rate_limit:ClaudeErrorDetector"  # 0.1.20 alias
-claude_error_detector = "agent_runner.builtin_plugins.claude_rate_limit:ClaudeErrorDetector"      # 0.1.23 canonical
-gemini_error_detector = "agent_runner.builtin_plugins.gemini:GeminiErrorDetector"                 # 0.1.24
+claude_error_detector = "agent_runner.builtin_plugins.claude_rate_limit:ClaudeErrorDetector"
+gemini_error_detector = "agent_runner.builtin_plugins.gemini:GeminiErrorDetector"
 [project.optional-dependencies]
 dev = [

cli-agent-runner 0.1.34__tar.gz → 0.1.36__tar.gz

cli-agent-runner 0.1.34tar.gz → 0.1.36tar.gz