PyPI - cli-agent-runner - Versions diffs - 0.1.35__tar.gz → 0.1.36__tar.gz - Mend

cli-agent-runner 0.1.35tar.gz → 0.1.36tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (221) hide show

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.1.36] - 2026-05-21
+### Added
+- New monitor detector `supervisor_stale` (notify) — alerts when the supervisor stops emitting events (stuck between rounds or dead), a blind spot the event stream and `detect_hung` cannot catch. Default ON; threshold derives from `round_timeout_s * 1.5`. Detector count 11 → 12.
+- `[monitor] supervisor_stale_threshold_s` config — override the derived staleness threshold (positive = seconds; 0 = disable; unset = derived).
+### Changed
+- `docs/runbook.md` documents the liveness-monitoring architecture: run `monitor --host` from a separate machine to detect supervisor silent-death AND host death (a same-host monitor dies with its host).
 ## [0.1.35] - 2026-05-20
 ### Removed

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: cli-agent-runner
-Version: 0.1.35
+Version: 0.1.36
 Summary: Restart-on-exit supervisor for autonomous CLI agents
 Project-URL: Homepage, https://github.com/wan9yu/cli-agent-runner
 Project-URL: Documentation, https://github.com/wan9yu/cli-agent-runner#readme
@@ -49,7 +49,7 @@ full disks, runaway memory.
 ```
 ┌──────────────────────────────────────────┐
-│ Layer 3: The Witness (monitor)           │  11 detectors + auto-stop
+│ Layer 3: The Witness (monitor)           │  12 detectors + auto-stop
 ├──────────────────────────────────────────┤
 │ Layer 2: The Loop (serve, ~120 LOC)      │  signal-trapping restart loop
 ├──────────────────────────────────────────┤
@@ -86,7 +86,7 @@ Full walkthrough: [`docs/quickstart.md`](docs/quickstart.md).
 |---|---|
 | `init` / `install` / `uninstall` | `peek` — state snapshot |
 | `start` / `stop` / `kill` / `cancel` | `watch` — peek in a refresh loop |
-| `restart` / `status` | `monitor` — 11 detectors, alerts, auto-stop |
+| `restart` / `status` | `monitor` — 12 detectors, alerts, auto-stop |
 | `round` / `serve` / `upgrade` | `events` — query / stream events.jsonl |
 Verb reference: [`docs/commands.md`](docs/commands.md).
@@ -106,11 +106,11 @@ guards it. Highlights:
 Full list and rationale: [`docs/architecture.md`](docs/architecture.md).
-## Monitor: 11 detectors
+## Monitor: 12 detectors
 Notify only: `timeout_rate`, `hung`, `orphan_chain`, `disk_warning`,
 `mem_pressure`, `smoke_fail_rate`, `network_fail`, `rate_limit_active`,
-`anomaly_repetitive_active`.
+`anomaly_repetitive_active`, `supervisor_stale`.
 **Auto-stop the service** (continuing is harmful):
 - `oauth_fail` — burning API quota on auth-rejected rounds

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/README.md RENAMED Viewed

@@ -12,7 +12,7 @@ full disks, runaway memory.
 ```
 ┌──────────────────────────────────────────┐
-│ Layer 3: The Witness (monitor)           │  11 detectors + auto-stop
+│ Layer 3: The Witness (monitor)           │  12 detectors + auto-stop
 ├──────────────────────────────────────────┤
 │ Layer 2: The Loop (serve, ~120 LOC)      │  signal-trapping restart loop
 ├──────────────────────────────────────────┤
@@ -49,7 +49,7 @@ Full walkthrough: [`docs/quickstart.md`](docs/quickstart.md).
 |---|---|
 | `init` / `install` / `uninstall` | `peek` — state snapshot |
 | `start` / `stop` / `kill` / `cancel` | `watch` — peek in a refresh loop |
-| `restart` / `status` | `monitor` — 11 detectors, alerts, auto-stop |
+| `restart` / `status` | `monitor` — 12 detectors, alerts, auto-stop |
 | `round` / `serve` / `upgrade` | `events` — query / stream events.jsonl |
 Verb reference: [`docs/commands.md`](docs/commands.md).
@@ -69,11 +69,11 @@ guards it. Highlights:
 Full list and rationale: [`docs/architecture.md`](docs/architecture.md).
-## Monitor: 11 detectors
+## Monitor: 12 detectors
 Notify only: `timeout_rate`, `hung`, `orphan_chain`, `disk_warning`,
 `mem_pressure`, `smoke_fail_rate`, `network_fail`, `rate_limit_active`,
-`anomaly_repetitive_active`.
+`anomaly_repetitive_active`, `supervisor_stale`.
 **Auto-stop the service** (continuing is harmful):
 - `oauth_fail` — burning API quota on auth-rejected rounds

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/README.zh.md RENAMED Viewed

@@ -20,7 +20,7 @@ supervisor 重启 —— 这是核心模式。中间穿插 11 条防御，避开
 ```
 ┌──────────────────────────────────────────┐
-│ Layer 3：Witness（monitor）              │  11 个检测器 + 自动停服
+│ Layer 3：Witness（monitor）              │  12 个检测器 + 自动停服
 ├──────────────────────────────────────────┤
 │ Layer 2：Loop（serve，~120 LOC 薄壳）    │  捕获信号，循环拉起 round
 ├──────────────────────────────────────────┤
@@ -63,7 +63,7 @@ agent-runner monitor              # 实时异常检测，OAuth/磁盘 critical
 |---|---|
 | `init` / `install` / `uninstall` | `peek` —— 项目状态快照 |
 | `start` / `stop` / `kill` / `cancel` | `watch` —— peek 在刷新循环里 |
-| `restart` / `status` | `monitor` —— 11 个检测器 + 告警 + 自动停服 |
+| `restart` / `status` | `monitor` —— 12 个检测器 + 告警 + 自动停服 |
 | `round` / `serve` / `upgrade` | `events` —— 查询 / 流式订阅 events.jsonl |
 **停服三动词**有清晰的语义分层：
@@ -95,11 +95,12 @@ agent-runner monitor              # 实时异常检测，OAuth/磁盘 critical
 完整列表 + 历史出处：[`docs/architecture.md`](docs/architecture.md)。
-## Monitor：9 个检测器
+## Monitor：12 个检测器
 **只告警**（warning 级，服务继续跑）：
 `timeout_rate` / `hung` / `orphan_chain` / `disk_warning` /
-`mem_pressure` / `smoke_fail_rate` / `network_fail`
+`mem_pressure` / `smoke_fail_rate` / `network_fail` / `rate_limit_active` /
+`anomaly_repetitive_active` / `supervisor_stale`
 **自动停服**（critical 级，继续是 net negative）：

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/agent_runner/_version.py RENAMED Viewed

@@ -18,7 +18,7 @@ version_tuple: tuple[int | str, ...]
 commit_id: str | None
 __commit_id__: str | None
-__version__ = version = '0.1.35'
-__version_tuple__ = version_tuple = (0, 1, 35)
+__version__ = version = '0.1.36'
+__version_tuple__ = version_tuple = (0, 1, 36)
 __commit_id__ = commit_id = None

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/agent_runner/api.py RENAMED Viewed

@@ -452,6 +452,7 @@ def _poll_once(project: str | Path, *, host: str | None) -> list[monitor.Alert]:
         metrics=metrics,
         log_tails=log_tails,
         round_timeout_s=cfg.runtime.round_timeout_s,
+        supervisor_stale_threshold_s=cfg.monitor.supervisor_stale_threshold_s,
         auth_fail_patterns=cfg.monitor.auth_fail_patterns,
         auth_fail_hint=cfg.monitor.auth_fail_hint,
         phases_overrides=cfg.phases.overrides if cfg.phases.overrides else None,

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/agent_runner/config.py RENAMED Viewed

@@ -141,6 +141,12 @@ class MonitorConfig:
     anomaly_repetitive_threshold: int = 0  # 0 = disabled
     host_health: MonitorHostHealthConfig = field(default_factory=MonitorHostHealthConfig)
     round_progress_interval_s: int = 0  # 0 = disabled; >0 = emit round_progress every N seconds
+    supervisor_stale_threshold_s: int | None = None
+    """Staleness deadline for the supervisor_stale detector (seconds).
+    None (unset) → derived default round_timeout_s * 1.5.
+    Positive int → explicit threshold. 0 → disable the detector.
+    """
 @dataclass(frozen=True)
@@ -467,6 +473,14 @@ def load_config(toml_path: Path) -> Config:
             monitor_d.get("round_progress_interval_s", 0),
             field="monitor.round_progress_interval_s",
         ),
+        supervisor_stale_threshold_s=(
+            None
+            if monitor_d.get("supervisor_stale_threshold_s") is None
+            else _require_non_negative_int(
+                monitor_d["supervisor_stale_threshold_s"],
+                field="monitor.supervisor_stale_threshold_s",
+            )
+        ),
     )
     plugins_raw = dict(raw.get("plugins") or {})  # copy so we can pop
     disable = list(plugins_raw.pop("disable", []))

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/agent_runner/monitor.py RENAMED Viewed

@@ -1,6 +1,6 @@
 """Monitor — anomaly detectors over events + metrics + log tails.
-11 built-in detectors. Two trigger ``auto_action="stop_service"``:
+12 built-in detectors. Two trigger ``auto_action="stop_service"``:
   * oauth_fail  — auth pattern in short-exit logs (retrying burns API quota)
   * disk_critical — disk_used_pct > 95% (writing more risks corruption)
@@ -54,6 +54,7 @@ KNOWN_ALERT_KINDS: frozenset[str] = frozenset(
         "network_fail",
         "rate_limit_active",
         "anomaly_repetitive_active",
+        "supervisor_stale",
     }
 )
@@ -429,6 +430,39 @@ def detect_anomaly_repetitive_active(
     )
+def detect_supervisor_stale(
+    events: list[dict[str, Any]],
+    *,
+    now: datetime,
+    stale_threshold_s: int,
+) -> Alert | None:
+    """Alert when the most recent event is older than ``stale_threshold_s``.
+    Catches supervisor "silent-death": stuck between rounds (after round_end,
+    before the next round_start) emitting no events. The event stream cannot
+    distinguish that from a normal idle gap — only a deadline check can.
+    ``stale_threshold_s <= 0`` disables the check (caller resolves the
+    sentinel). Empty event list → no alert: that is "never started", not
+    silent-death, and there is no baseline to measure staleness against.
+    """
+    if stale_threshold_s <= 0 or not events:
+        return None
+    last_ts_str = max((e["ts"] for e in events if "ts" in e), default=None)
+    if last_ts_str is None:
+        return None
+    age_s = (now - parse_iso_ms(last_ts_str)).total_seconds()
+    if age_s <= stale_threshold_s:
+        return None
+    return _alert(
+        "supervisor_stale",
+        "warning",
+        f"No events for {int(age_s)}s (threshold {stale_threshold_s}s) — "
+        f"supervisor may be stuck or dead. Last event: {last_ts_str}.",
+        {"age_s": int(age_s), "threshold_s": stale_threshold_s, "last_ts": last_ts_str},
+    )
 # ---------------------------------------------------------------------------
 # State-tree assembly (Task 3.2)
 # ---------------------------------------------------------------------------
@@ -535,6 +569,7 @@ def run_all_detectors(
     metrics: list[dict[str, Any]],
     log_tails: dict[int, str],
     round_timeout_s: int = 1800,
+    supervisor_stale_threshold_s: int | None = None,
     now: datetime | None = None,
     auth_fail_patterns: list[str] | None = None,
     auth_fail_hint: str | None = None,
@@ -543,12 +578,17 @@ def run_all_detectors(
     disk_warning_pct: float = 90.0,
     disk_critical_pct: float = 95.0,
 ) -> list[Alert]:
-    """Run all 11 detectors; returns alerts (empty = healthy)."""
+    """Run all 12 detectors; returns alerts (empty = healthy)."""
     if now is None:
         now = datetime.now(UTC)
     compiled_auth_pats = (
         [re.compile(p, re.IGNORECASE) for p in auth_fail_patterns] if auth_fail_patterns else None
     )
+    effective_stale_s = (
+        int(round_timeout_s * 1.5)
+        if supervisor_stale_threshold_s is None
+        else supervisor_stale_threshold_s
+    )
     candidates = [
         detect_timeout_rate(events),
         detect_hung(
@@ -568,6 +608,7 @@ def run_all_detectors(
         detect_network_fail(events, log_tails),
         detect_rate_limit_active(events, now=now.timestamp()),
         detect_anomaly_repetitive_active(events),
+        detect_supervisor_stale(events, now=now, stale_threshold_s=effective_stale_s),
     ]
     return [a for a in candidates if a is not None]

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/docs/architecture.md RENAMED Viewed

@@ -65,13 +65,14 @@ surfacing everywhere.
 | `event_kind_registry` | Prevent events.emit() typos / unregistered kinds slipping past CI | `tests/invariants/test_event_kind_registry.py` |
 <!-- /gen:defenses-table -->
-## Monitor: 11 detectors
+## Monitor: 12 detectors
 Three categories by `auto_action`:
 **Notify only** (severity `warning`):
 `timeout_rate`, `hung`, `orphan_chain`, `disk_warning`, `mem_pressure`,
-`smoke_fail_rate`, `network_fail`.
+`smoke_fail_rate`, `network_fail`, `rate_limit_active`,
+`anomaly_repetitive_active`, `supervisor_stale`.
 **Auto-stop service** (severity `critical`, `auto_action="stop_service"`):
 `oauth_fail`, `disk_critical`. Continuing in either state is harmful (burning
@@ -88,6 +89,7 @@ API quota / writing to a near-full disk).
 - `orphan_chain`
 - `rate_limit_active`
 - `smoke_fail_rate`
+- `supervisor_stale`
 - `timeout_rate`
 <!-- /gen:detector-list -->

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/docs/commands.md RENAMED Viewed

@@ -117,7 +117,7 @@ agent-runner events --kind transient_error_backoff_capped --tail
 ### `agent-runner monitor [--host SSH-ALIAS] [--interval N] [--json]`
-Anomaly-detection daemon. Runs the 11 detectors against the live state on every
+Anomaly-detection daemon. Runs the 12 detectors against the live state on every
 poll. Without `--host`, watches local logs at default 30s interval. With
 `--host`, watches a remote agent-runner over plain ssh at default 60s interval.

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/docs/configuration.md RENAMED Viewed

@@ -80,6 +80,7 @@ running with newly-set `dirty_action = "auto_commit"` is undefined).
 | `anomaly_repetitive_threshold` | `int` | 0 |
 | `host_health` | `MonitorHostHealthConfig` | MonitorHostHealthConfig(mem_avail_min_mb=200, disk_warning_pct=90.0, disk_critical_pct=95.0) |
 | `round_progress_interval_s` | `int` | 0 |
+| `supervisor_stale_threshold_s` | `int | None` | None |
 <!-- /gen:config-schema -->
 ### `vcs.dirty_action`
@@ -203,6 +204,7 @@ Unconfigured phases (and configs without `[phases]`) keep using the global
 [monitor]
 auto_stop_on = ["oauth_fail", "disk_critical"]
 round_progress_interval_s = 0  # 0 = disabled; set >0 to emit round_progress heartbeat events
+# supervisor_stale_threshold_s = 2700  # unset = round_timeout_s * 1.5; 0 = disable
 [monitor.host_health]
 mem_avail_min_mb = 200        # mem_pressure fires when mem_available_mb < this

cli_agent_runner-0.1.36/docs/migrations/0.1.36.md ADDED Viewed

@@ -0,0 +1,73 @@
+# Migrating to 0.1.36
+## TL;DR
+```bash
+pip install --upgrade cli-agent-runner==0.1.36
+```
+No action required. The new `supervisor_stale` detector is ON by default with
+a derived threshold and is notify-only — it never stops your service.
+## What changed
+0.1.36 adds a 12th monitor detector, `supervisor_stale`, that closes a
+liveness blind spot: a supervisor that hangs *between* rounds (after a round
+ends, before the next one starts) emits no events. The event stream cannot
+tell a permanent silence from a normal idle gap, and `detect_hung` only
+covers a round that *started* and then hung mid-execution. `supervisor_stale`
+watches the age of the most recent event and alerts when it exceeds a
+staleness deadline.
+## Default behavior (no action needed)
+- ON by default.
+- Threshold derives from `round_timeout_s * 1.5` — comfortably above the
+  longest legitimate inter-event gap (a round running to full timeout, plus
+  restart delay), so it does not false-positive on healthy systems.
+- Notify-only: it emits an alert, never an auto-stop. A stuck or dead
+  supervisor cannot honor an auto-stop anyway; the alert is for a human or an
+  external watchdog.
+## Tuning (optional)
+Set `[monitor] supervisor_stale_threshold_s` when the derived default does not
+fit your project's cadence:
+```toml
+[monitor]
+supervisor_stale_threshold_s = 3600   # explicit seconds
+# supervisor_stale_threshold_s = 0    # disable the detector
+# (unset)                             # derive round_timeout_s * 1.5
+```
+- **Very short rounds with occasional long legitimate gaps** (e.g. 2-minute
+  rounds plus a periodic maintenance pause): set a value higher than derived.
+- **Phase overrides that raise `round_timeout_s`** for some phase: the derived
+  threshold uses the *base* `round_timeout_s`, so a round in a longer-timeout
+  phase can exceed `base * 1.5`. Set
+  `supervisor_stale_threshold_s >= max_phase_timeout * 1.5`.
+## The liveness architecture (important)
+A monitor on the *same host* as the supervisor dies when that host dies — it
+cannot report its own host's death. For true liveness coverage, run the
+monitor from a **separate machine**:
+```bash
+# On your laptop / a second host, not on the supervised host:
+agent-runner monitor --host pi
+```
+That catches both failure modes: a stuck supervisor on a live host
+(`supervisor_stale`, events frozen) and a dead host or severed network (SSH
+poll fails → `monitor_remote_giveup`).
+## What did NOT change
+- The existing 11 detectors are unchanged.
+- `detect_hung` still covers the in-round hang case.
+- `round_timeout_s` is unchanged; the staleness threshold derives from it but
+  does not modify it.
+- No new core event kind: `supervisor_stale` is a monitor *alert kind*, surfaced
+  through the existing monitor alert path.

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/docs/runbook.md RENAMED Viewed

@@ -253,6 +253,29 @@ API. Power profile:
   real state change. Verify the detector logic and thresholds before enabling
   `auto_stop` on a production remote.
+### Liveness monitoring: run monitor from a separate machine
+`agent-runner monitor` detects anomalies including `supervisor_stale` — the
+supervisor stopped emitting events because it is stuck between rounds or dead.
+But a monitor running on the *same host* as the supervisor dies when that host
+dies, so it cannot report its own host's death.
+For true liveness coverage, run the monitor from a **separate machine**:
+    # On your laptop / a second host, NOT on the supervised host:
+    agent-runner monitor --host pi
+This catches both failure modes:
+- Supervisor stuck on a live host → `supervisor_stale` alert (events frozen).
+- Host itself dead / network gone → SSH poll fails → `monitor_remote_giveup`.
+The `supervisor_stale` threshold defaults to `round_timeout_s * 1.5`. Override
+with `[monitor] supervisor_stale_threshold_s = N` for projects whose legitimate
+cadence — very short rounds with occasional long legitimate gaps, or phase
+overrides that raise `round_timeout_s` — does not fit the derived default. Set
+to `0` to disable the detector entirely.
 ## Live event stream (machine-readable)
 For machine consumption (parity comparisons, custom dashboards, automation

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/tests/invariants/test_architecture.py RENAMED Viewed

@@ -118,7 +118,7 @@ def test_given_api_types_when_inspected_then_all_frozen_dataclasses() -> None:
         assert cls.__dataclass_params__.frozen, f"{name} not frozen"
-def test_given_known_alert_kinds_when_inspected_then_matches_eleven_detectors() -> None:
+def test_given_known_alert_kinds_when_inspected_then_matches_twelve_detectors() -> None:
     from agent_runner.monitor import KNOWN_ALERT_KINDS
     expected = {
@@ -133,5 +133,6 @@ def test_given_known_alert_kinds_when_inspected_then_matches_eleven_detectors()
         "network_fail",
         "rate_limit_active",
         "anomaly_repetitive_active",
+        "supervisor_stale",
     }
     assert KNOWN_ALERT_KINDS == expected

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/tests/unit/test_api_observation.py RENAMED Viewed

@@ -107,9 +107,24 @@ def test_given_no_alerts_when_poll_once_then_returns_empty(
     tmp_git_repo: Path,
     monkeypatch: pytest.MonkeyPatch,
 ) -> None:
+    import dataclasses
     monkeypatch.setenv("HOME", str(tmp_git_repo))
     api.init(tmp_git_repo, force=False, commit=False)
     _seed_logs(tmp_git_repo)
+    # Disable supervisor_stale: seeded events use a fixed old timestamp; the
+    # detector would otherwise fire because now >> seed ts.
+    real_load = load_config
+    def patched_load(path):
+        cfg = real_load(path)
+        return dataclasses.replace(
+            cfg,
+            monitor=dataclasses.replace(cfg.monitor, supervisor_stale_threshold_s=0),
+        )
+    monkeypatch.setattr("agent_runner.api.load_config", patched_load)
     alerts = api._poll_once(tmp_git_repo, host=None)
     assert alerts == []

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/tests/unit/test_api_service.py RENAMED Viewed

@@ -205,3 +205,25 @@ def test_given_per_phase_override_when_poll_once_then_forwards_phases_overrides_
         "phases_overrides kwarg missing from run_all_detectors call"
     )
     assert call_kwargs["phases_overrides"] == {"dev": PhaseOverride(round_timeout_s=3600)}
+def test_poll_once_forwards_supervisor_stale_threshold(
+    tmp_git_repo: Path,
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """_poll_once must forward cfg.monitor.supervisor_stale_threshold_s."""
+    api.init(tmp_git_repo, force=False, commit=False)
+    captured: list[dict] = []
+    def capturing_rad(**kwargs):
+        captured.append(kwargs)
+        return []
+    monkeypatch.setattr("agent_runner.monitor.run_all_detectors", capturing_rad)
+    api._poll_once(tmp_git_repo, host=None)
+    assert captured, "run_all_detectors was never called"
+    call_kwargs = captured[0]
+    assert "supervisor_stale_threshold_s" in call_kwargs

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/tests/unit/test_config.py RENAMED Viewed

@@ -1554,3 +1554,26 @@ def test_given_high_disk_critical_when_disk_used_below_then_warning_still_fires(
     alert = detect_disk_warning(metrics, threshold_pct=90.0, critical_pct=98.0)
     assert alert is not None
     assert alert.detector == "disk_warning"
+def test_given_no_supervisor_stale_field_then_default_none(tmp_path: Path) -> None:
+    toml = _write_toml(
+        tmp_path,
+        '[agent]\ncommand = ["true"]\nprompt_arg_template = ["{prompt}"]\n'
+        '[runtime]\nwork_dir = "."\nlog_dir = "/tmp/logs"\n'
+        '[prompt]\nfile = "p.md"\n',
+    )
+    cfg = load_config(toml)
+    assert cfg.monitor.supervisor_stale_threshold_s is None
+def test_given_supervisor_stale_threshold_set_then_loaded(tmp_path: Path) -> None:
+    toml = _write_toml(
+        tmp_path,
+        '[agent]\ncommand = ["true"]\nprompt_arg_template = ["{prompt}"]\n'
+        '[runtime]\nwork_dir = "."\nlog_dir = "/tmp/logs"\n'
+        '[prompt]\nfile = "p.md"\n'
+        "[monitor]\nsupervisor_stale_threshold_s = 600\n",
+    )
+    cfg = load_config(toml)
+    assert cfg.monitor.supervisor_stale_threshold_s == 600

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/tests/unit/test_docgen.py RENAMED Viewed

@@ -108,9 +108,9 @@ def test_given_render_alert_kinds_list_when_called_then_returns_bullet_list() ->
     from agent_runner._docgen import render_alert_kinds_list
     md = render_alert_kinds_list()
-    # Bullet list, alphabetised, 11 entries
+    # Bullet list, alphabetised, 12 entries
     bullets = [line for line in md.splitlines() if line.startswith("- ")]
-    assert len(bullets) == 11
+    assert len(bullets) == 12
     assert any("oauth_fail" in line for line in bullets)
     assert any("disk_critical" in line for line in bullets)

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/tests/unit/test_monitor_assembly.py RENAMED Viewed

@@ -79,6 +79,7 @@ def test_given_clean_history_when_run_all_detectors_then_no_alerts(
         metrics=metrics,
         log_tails=log_tails,
         round_timeout_s=1800,
+        supervisor_stale_threshold_s=0,  # disable: seeded events use a fixed old timestamp
     )
     assert alerts == []

cli_agent_runner-0.1.36/tests/unit/test_monitor_detect_supervisor_stale.py ADDED Viewed

@@ -0,0 +1,38 @@
+from __future__ import annotations
+from datetime import UTC, datetime
+from agent_runner.monitor import detect_supervisor_stale
+def _ev(ts: str, event: str = "round_end", **fields) -> dict:
+    return {"event": event, "ts": ts, **fields}
+NOW = datetime(2026, 5, 21, 12, 0, 0, tzinfo=UTC)
+def test_given_last_event_older_than_threshold_then_alerts() -> None:
+    # Last event 4000s before NOW, threshold 2700s -> stale.
+    events = [_ev("2026-05-21T10:53:20.000Z", round_num=5)]
+    alert = detect_supervisor_stale(events, now=NOW, stale_threshold_s=2700)
+    assert alert is not None
+    assert alert.detector == "supervisor_stale"
+    assert alert.severity == "warning"
+    assert alert.auto_action == "none"
+    assert "2700" in alert.message
+def test_given_last_event_within_threshold_then_no_alert() -> None:
+    # Last event 100s before NOW, threshold 2700s -> healthy.
+    events = [_ev("2026-05-21T11:58:20.000Z", round_num=5)]
+    assert detect_supervisor_stale(events, now=NOW, stale_threshold_s=2700) is None
+def test_given_empty_events_then_no_alert() -> None:
+    assert detect_supervisor_stale([], now=NOW, stale_threshold_s=2700) is None
+def test_given_threshold_zero_then_disabled_no_alert() -> None:
+    events = [_ev("2026-05-21T00:00:00.000Z", round_num=1)]  # very old
+    assert detect_supervisor_stale(events, now=NOW, stale_threshold_s=0) is None

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.36}/tests/unit/test_monitor_detectors.py RENAMED Viewed

@@ -24,7 +24,7 @@ def _ev(event: str, **fields) -> dict:
     return {"event": event, "ts": "2026-05-12T10:00:00.000Z", **fields}
-def test_given_known_alert_kinds_when_inspected_then_contains_all_eleven() -> None:
+def test_given_known_alert_kinds_when_inspected_then_contains_all_twelve() -> None:
     expected = {
         "timeout_rate",
         "hung",
@@ -37,6 +37,7 @@ def test_given_known_alert_kinds_when_inspected_then_contains_all_eleven() -> No
         "network_fail",
         "rate_limit_active",
         "anomaly_repetitive_active",
+        "supervisor_stale",
     }
     assert expected == KNOWN_ALERT_KINDS
@@ -308,3 +309,37 @@ def test_given_phase_not_in_override_when_detect_hung_then_uses_global() -> None
         phases_overrides={"warmup": PhaseOverride(round_timeout_s=300)},
     )
     assert out is None
+def test_given_no_stale_threshold_then_derives_from_round_timeout() -> None:
+    # round_timeout_s=1000 -> derived 1500s. Last event 1800s ago -> stale.
+    from agent_runner.monitor import run_all_detectors
+    events = [_ev("round_end", round_num=1)]  # ts 2026-05-12T10:00:00.000Z
+    now = datetime(2026, 5, 12, 10, 30, 0, tzinfo=UTC)  # 1800s later
+    alerts = run_all_detectors(
+        events=events,
+        metrics=[],
+        log_tails={},
+        round_timeout_s=1000,
+        now=now,
+    )
+    assert any(a.detector == "supervisor_stale" for a in alerts)
+def test_given_explicit_stale_threshold_then_used_over_derived() -> None:
+    # Explicit 3600s threshold; last event 1800s ago -> NOT stale even though
+    # derived (round_timeout 1000 * 1.5 = 1500) would have fired.
+    from agent_runner.monitor import run_all_detectors
+    events = [_ev("round_end", round_num=1)]
+    now = datetime(2026, 5, 12, 10, 30, 0, tzinfo=UTC)  # 1800s later
+    alerts = run_all_detectors(
+        events=events,
+        metrics=[],
+        log_tails={},
+        round_timeout_s=1000,
+        supervisor_stale_threshold_s=3600,
+        now=now,
+    )
+    assert not any(a.detector == "supervisor_stale" for a in alerts)