PyPI - cli-agent-runner - Versions diffs - 0.1.35__tar.gz → 0.1.37__tar.gz - Mend

cli-agent-runner 0.1.35tar.gz → 0.1.37tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (222) hide show

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.37}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.1.37] - 2026-05-22
+### Fixed
+- `upgrade` no longer crashes when run from a directory without `agent-runner.toml` — it upgrades the package and falls back to package-only mode.
+- `upgrade` handles PEP 668 externally-managed environments (Debian 12 etc.): retries pip with `--break-system-packages` (and `--user` for user-site installs) when not in a venv.
+### Changed
+- `upgrade` only stop/start-orchestrates the `systemd --user` service it installed. For a self-managed service (e.g. a systemd system unit) it does package-only upgrade + smoke and prints the restart command to run yourself — no more silent no-op, and no more `agent-runner start` suggestion (which could spawn a conflicting second supervisor).
+- New `--no-restart` flag forces package-only upgrade.
+### Added
+- New event `package_upgraded` (on-disk package changed; restart deferred to the operator), distinct from `service_upgraded` (the live service is now on the new version).
+## [0.1.36] - 2026-05-21
+### Added
+- New monitor detector `supervisor_stale` (notify) — alerts when the supervisor stops emitting events (stuck between rounds or dead), a blind spot the event stream and `detect_hung` cannot catch. Default ON; threshold derives from `round_timeout_s * 1.5`. Detector count 11 → 12.
+- `[monitor] supervisor_stale_threshold_s` config — override the derived staleness threshold (positive = seconds; 0 = disable; unset = derived).
+### Changed
+- `docs/runbook.md` documents the liveness-monitoring architecture: run `monitor --host` from a separate machine to detect supervisor silent-death AND host death (a same-host monitor dies with its host).
 ## [0.1.35] - 2026-05-20
 ### Removed

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.37}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: cli-agent-runner
-Version: 0.1.35
+Version: 0.1.37
 Summary: Restart-on-exit supervisor for autonomous CLI agents
 Project-URL: Homepage, https://github.com/wan9yu/cli-agent-runner
 Project-URL: Documentation, https://github.com/wan9yu/cli-agent-runner#readme
@@ -49,7 +49,7 @@ full disks, runaway memory.
 ```
 ┌──────────────────────────────────────────┐
-│ Layer 3: The Witness (monitor)           │  11 detectors + auto-stop
+│ Layer 3: The Witness (monitor)           │  12 detectors + auto-stop
 ├──────────────────────────────────────────┤
 │ Layer 2: The Loop (serve, ~120 LOC)      │  signal-trapping restart loop
 ├──────────────────────────────────────────┤
@@ -86,7 +86,7 @@ Full walkthrough: [`docs/quickstart.md`](docs/quickstart.md).
 |---|---|
 | `init` / `install` / `uninstall` | `peek` — state snapshot |
 | `start` / `stop` / `kill` / `cancel` | `watch` — peek in a refresh loop |
-| `restart` / `status` | `monitor` — 11 detectors, alerts, auto-stop |
+| `restart` / `status` | `monitor` — 12 detectors, alerts, auto-stop |
 | `round` / `serve` / `upgrade` | `events` — query / stream events.jsonl |
 Verb reference: [`docs/commands.md`](docs/commands.md).
@@ -106,11 +106,11 @@ guards it. Highlights:
 Full list and rationale: [`docs/architecture.md`](docs/architecture.md).
-## Monitor: 11 detectors
+## Monitor: 12 detectors
 Notify only: `timeout_rate`, `hung`, `orphan_chain`, `disk_warning`,
 `mem_pressure`, `smoke_fail_rate`, `network_fail`, `rate_limit_active`,
-`anomaly_repetitive_active`.
+`anomaly_repetitive_active`, `supervisor_stale`.
 **Auto-stop the service** (continuing is harmful):
 - `oauth_fail` — burning API quota on auth-rejected rounds

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.37}/README.md RENAMED Viewed

@@ -12,7 +12,7 @@ full disks, runaway memory.
 ```
 ┌──────────────────────────────────────────┐
-│ Layer 3: The Witness (monitor)           │  11 detectors + auto-stop
+│ Layer 3: The Witness (monitor)           │  12 detectors + auto-stop
 ├──────────────────────────────────────────┤
 │ Layer 2: The Loop (serve, ~120 LOC)      │  signal-trapping restart loop
 ├──────────────────────────────────────────┤
@@ -49,7 +49,7 @@ Full walkthrough: [`docs/quickstart.md`](docs/quickstart.md).
 |---|---|
 | `init` / `install` / `uninstall` | `peek` — state snapshot |
 | `start` / `stop` / `kill` / `cancel` | `watch` — peek in a refresh loop |
-| `restart` / `status` | `monitor` — 11 detectors, alerts, auto-stop |
+| `restart` / `status` | `monitor` — 12 detectors, alerts, auto-stop |
 | `round` / `serve` / `upgrade` | `events` — query / stream events.jsonl |
 Verb reference: [`docs/commands.md`](docs/commands.md).
@@ -69,11 +69,11 @@ guards it. Highlights:
 Full list and rationale: [`docs/architecture.md`](docs/architecture.md).
-## Monitor: 11 detectors
+## Monitor: 12 detectors
 Notify only: `timeout_rate`, `hung`, `orphan_chain`, `disk_warning`,
 `mem_pressure`, `smoke_fail_rate`, `network_fail`, `rate_limit_active`,
-`anomaly_repetitive_active`.
+`anomaly_repetitive_active`, `supervisor_stale`.
 **Auto-stop the service** (continuing is harmful):
 - `oauth_fail` — burning API quota on auth-rejected rounds

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.37}/README.zh.md RENAMED Viewed

@@ -20,7 +20,7 @@ supervisor 重启 —— 这是核心模式。中间穿插 11 条防御，避开
 ```
 ┌──────────────────────────────────────────┐
-│ Layer 3：Witness（monitor）              │  11 个检测器 + 自动停服
+│ Layer 3：Witness（monitor）              │  12 个检测器 + 自动停服
 ├──────────────────────────────────────────┤
 │ Layer 2：Loop（serve，~120 LOC 薄壳）    │  捕获信号，循环拉起 round
 ├──────────────────────────────────────────┤
@@ -63,7 +63,7 @@ agent-runner monitor              # 实时异常检测，OAuth/磁盘 critical
 |---|---|
 | `init` / `install` / `uninstall` | `peek` —— 项目状态快照 |
 | `start` / `stop` / `kill` / `cancel` | `watch` —— peek 在刷新循环里 |
-| `restart` / `status` | `monitor` —— 11 个检测器 + 告警 + 自动停服 |
+| `restart` / `status` | `monitor` —— 12 个检测器 + 告警 + 自动停服 |
 | `round` / `serve` / `upgrade` | `events` —— 查询 / 流式订阅 events.jsonl |
 **停服三动词**有清晰的语义分层：
@@ -95,11 +95,12 @@ agent-runner monitor              # 实时异常检测，OAuth/磁盘 critical
 完整列表 + 历史出处：[`docs/architecture.md`](docs/architecture.md)。
-## Monitor：9 个检测器
+## Monitor：12 个检测器
 **只告警**（warning 级，服务继续跑）：
 `timeout_rate` / `hung` / `orphan_chain` / `disk_warning` /
-`mem_pressure` / `smoke_fail_rate` / `network_fail`
+`mem_pressure` / `smoke_fail_rate` / `network_fail` / `rate_limit_active` /
+`anomaly_repetitive_active` / `supervisor_stale`
 **自动停服**（critical 级，继续是 net negative）：

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.37}/agent_runner/_version.py RENAMED Viewed

@@ -18,7 +18,7 @@ version_tuple: tuple[int | str, ...]
 commit_id: str | None
 __commit_id__: str | None
-__version__ = version = '0.1.35'
-__version_tuple__ = version_tuple = (0, 1, 35)
+__version__ = version = '0.1.37'
+__version_tuple__ = version_tuple = (0, 1, 37)
 __commit_id__ = commit_id = None

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.37}/agent_runner/api.py RENAMED Viewed

@@ -452,6 +452,7 @@ def _poll_once(project: str | Path, *, host: str | None) -> list[monitor.Alert]:
         metrics=metrics,
         log_tails=log_tails,
         round_timeout_s=cfg.runtime.round_timeout_s,
+        supervisor_stale_threshold_s=cfg.monitor.supervisor_stale_threshold_s,
         auth_fail_patterns=cfg.monitor.auth_fail_patterns,
         auth_fail_hint=cfg.monitor.auth_fail_hint,
         phases_overrides=cfg.phases.overrides if cfg.phases.overrides else None,

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.37}/agent_runner/cli/upgrade_cmd.py RENAMED Viewed

@@ -18,7 +18,9 @@ import sys
 import time
 from pathlib import Path
+import agent_runner
 from agent_runner import __version__, api, events
+from agent_runner.api_types import ServiceMode
 from agent_runner.cli.common import cfg_from_args, fail, info
 from agent_runner.config import Config
@@ -28,8 +30,8 @@ def add_parser(sub, parent) -> None:
         "upgrade",
         parents=[parent],
         help=(
-            "Round-boundary upgrade: stop → pip install → smoke → start"
-            " (auto-rollback on smoke fail)"
+            "Package upgrade with service-mode gate: orchestrated stop/start"
+            " for systemd --user; package-only otherwise"
         ),
     )
     p.add_argument(
@@ -40,24 +42,69 @@ def add_parser(sub, parent) -> None:
         help="Pin a specific version (e.g. 0.1.13). Default: latest from PyPI. "
         "Use to roll back: `--target <previous-version>`.",
     )
+    p.add_argument(
+        "--no-restart",
+        action="store_true",
+        help="Upgrade the package + smoke only; do not stop/start the service "
+        "(you restart it yourself).",
+    )
     p.set_defaults(func=cmd)
 def cmd(args) -> int:
-    cfg = cfg_from_args(args)
-    return _run_upgrade(cfg, target=args.target, cfg_path=args.config)
+    cfg = _try_load_cfg(args)
+    return _run_upgrade(
+        cfg,
+        target=args.target,
+        cfg_path=args.config,
+        no_restart=getattr(args, "no_restart", False),
+    )
-def _pip_install(spec: str, *, force_reinstall: bool = False) -> subprocess.CompletedProcess:
-    """Invoke pip install with the given spec. Returns CompletedProcess (rc check by caller).
+def _try_load_cfg(args) -> Config | None:
+    """Load the project config if present; None when absent (package-only)."""
+    try:
+        return cfg_from_args(args)
+    except FileNotFoundError:
+        return None
+def _pip_env_flags() -> list[str]:
+    """Extra pip flags for the current install under PEP 668.
-    Uses ``sys.executable -m pip`` to match the smoke functions and guarantee
-    we install into the same interpreter we will smoke-test against.
+    Inside a venv: none (pip is unrestricted). Otherwise (system/user
+    interpreter on an externally-managed distro) the caller retries with these.
+    ``--user`` is added only when agent_runner lives in user-site, matching
+    where the existing install actually is.
+    """
+    import sys
+    if sys.prefix != sys.base_prefix:  # inside a venv → no PEP 668
+        return []
+    import site
+    flags = ["--break-system-packages"]
+    user_site = site.getusersitepackages()
+    if str(Path(agent_runner.__file__)).startswith(str(Path(user_site))):
+        flags.insert(0, "--user")
+    return flags
+def _pip_install(spec: str, *, force_reinstall: bool = False) -> subprocess.CompletedProcess:
+    """pip install --upgrade <spec>, retrying once with PEP668 flags on an
+    externally-managed environment. Returns CompletedProcess (rc check by caller).
     """
-    cmd = [sys.executable, "-m", "pip", "install", "--upgrade", spec]
+    base = [sys.executable, "-m", "pip", "install", "--upgrade", spec]
     if force_reinstall:
-        cmd.insert(4, "--force-reinstall")
-    return subprocess.run(cmd, capture_output=True, text=True, check=False)
+        base.insert(4, "--force-reinstall")
+    r = subprocess.run(base, capture_output=True, text=True, check=False)
+    if r.returncode == 0 or "externally-managed-environment" not in (r.stderr or ""):
+        return r
+    extra = _pip_env_flags()
+    if not extra:
+        return r
+    info(f"externally-managed env detected; retrying pip with {' '.join(extra)}")
+    return subprocess.run(base + extra, capture_output=True, text=True, check=False)
 def _smoke_version() -> tuple[int, str]:
@@ -93,18 +140,40 @@ def _smoke_peek(cfg_path: Path) -> tuple[int, str]:
     return 0, ""
-def _run_upgrade(cfg: Config, *, target: str | None, cfg_path: Path) -> int:
-    """Orchestrate the full upgrade flow.
-    Returns exit code (0 success, 1 user-recoverable, 2 critical).
-    """
+def _run_upgrade(
+    cfg: Config | None,
+    *,
+    target: str | None,
+    cfg_path: Path,
+    no_restart: bool = False,
+) -> int:
+    """Dispatch: full orchestration for the systemd --user service we installed;
+    package-only everywhere else."""
     if target is not None and not target.strip():
         return fail("--target must be a non-empty version string (e.g. 0.1.13)")
+    from_version = __version__
+    if _orchestrate_capable(cfg, no_restart):
+        return _orchestrated_upgrade(
+            cfg, target=target, cfg_path=cfg_path, from_version=from_version
+        )
+    return _package_only_upgrade(cfg, target=target, from_version=from_version)
+def _orchestrate_capable(cfg: Config | None, no_restart: bool) -> bool:
+    if cfg is None or no_restart:
+        return False
+    pname = api._resolve_project(cfg.runtime.work_dir)
+    return api.detect_service_mode(pname, log_dir=cfg.runtime.log_dir) == ServiceMode.SYSTEMD_USER
+def _orchestrated_upgrade(
+    cfg: Config, *, target: str | None, cfg_path: Path, from_version: str
+) -> int:
+    """Full stop → pip → smoke(--version + peek) → start → emit service_upgraded,
+    with auto-rollback on smoke failure. Only reached for the systemd --user
+    service agent-runner installed (api.start works there)."""
     log_dir = cfg.runtime.log_dir
     log_dir.mkdir(parents=True, exist_ok=True)
-    from_version = __version__
     t0 = time.monotonic()
     info("stopping service...")
@@ -155,14 +224,13 @@ def _run_upgrade(cfg: Config, *, target: str | None, cfg_path: Path) -> int:
             started_at=t0,
             cfg_path=cfg_path,
         )
     info(f"smoke OK (now at {to_version})")
     info("starting service...")
     t_start = time.monotonic()
     try:
         api.start(cfg.runtime.work_dir)
-    except Exception as e:  # noqa: BLE001 — new version installed but service stopped; no safe auto-rollback
+    except Exception as e:  # noqa: BLE001 — new version installed but service stopped
         return _rollback_failed(
             log_dir,
             to_version,
@@ -183,6 +251,65 @@ def _run_upgrade(cfg: Config, *, target: str | None, cfg_path: Path) -> int:
     return 0
+def _package_only_upgrade(cfg: Config | None, *, target: str | None, from_version: str) -> int:
+    """Upgrade the on-disk package + smoke (--version), with pip-level rollback.
+    Never touches the service — the operator restarts it. Used for any deployment
+    not managed as a systemd --user service (system unit, foreground, none, no
+    config, or --no-restart)."""
+    spec = "cli-agent-runner" if target is None else f"cli-agent-runner=={target}"
+    info(f"package-only upgrade (service not managed by agent-runner); installing {spec}...")
+    pip_result = _pip_install(spec)
+    if pip_result.returncode != 0:
+        return fail(
+            f"pip install failed (rc={pip_result.returncode}): "
+            f"{pip_result.stderr.strip()[:200]}; "
+            f"package unchanged, your service keeps running the current version"
+        )
+    rc_v, version_or_err = _smoke_version()
+    if rc_v != 0:
+        attempted = target or "latest"
+        info(f"smoke failed at {attempted} ({version_or_err}); reinstalling {from_version}...")
+        rb = _pip_install(f"cli-agent-runner=={from_version}", force_reinstall=True)
+        if rb.returncode != 0:
+            return fail(
+                f"package smoke failed AND rollback reinstall failed (rc={rb.returncode}): "
+                f"{rb.stderr.strip()[:200]}; run: "
+                f"pip install --force-reinstall cli-agent-runner=={from_version}"
+            )
+        return fail(
+            f"package smoke failed at {attempted}; reinstalled {from_version}; service untouched"
+        )
+    to_version = version_or_err
+    if cfg is not None:
+        log_dir = cfg.runtime.log_dir
+        log_dir.mkdir(parents=True, exist_ok=True)
+        events.emit(
+            log_dir,
+            events.PACKAGE_UPGRADED,
+            from_version=from_version,
+            to_version=to_version,
+            restart_deferred=True,
+        )
+    info(f"package upgraded {from_version} → {to_version}. Restart your supervisor to load it:")
+    info(_restart_hint(cfg))
+    return 0
+def _restart_hint(cfg: Config | None) -> str:
+    """Mode-correct restart command. Never suggests `agent-runner start`
+    (which would spawn a conflicting supervisor on a system-unit host)."""
+    if cfg is not None:
+        pname = api._resolve_project(cfg.runtime.work_dir)
+        if api.detect_service_mode(pname, log_dir=cfg.runtime.log_dir) == ServiceMode.SYSTEMD_USER:
+            return f"  systemctl --user restart {api.serve_unit_filename(pname)}"
+    return (
+        "  sudo systemctl restart <your-unit>   # if run by a systemd system unit\n"
+        "  (agent-runner can't know a service it didn't install; substitute your unit name)"
+    )
 def _rollback(
     cfg: Config,
     log_dir: Path,

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.37}/agent_runner/config.py RENAMED Viewed

@@ -141,6 +141,12 @@ class MonitorConfig:
     anomaly_repetitive_threshold: int = 0  # 0 = disabled
     host_health: MonitorHostHealthConfig = field(default_factory=MonitorHostHealthConfig)
     round_progress_interval_s: int = 0  # 0 = disabled; >0 = emit round_progress every N seconds
+    supervisor_stale_threshold_s: int | None = None
+    """Staleness deadline for the supervisor_stale detector (seconds).
+    None (unset) → derived default round_timeout_s * 1.5.
+    Positive int → explicit threshold. 0 → disable the detector.
+    """
 @dataclass(frozen=True)
@@ -467,6 +473,14 @@ def load_config(toml_path: Path) -> Config:
             monitor_d.get("round_progress_interval_s", 0),
             field="monitor.round_progress_interval_s",
         ),
+        supervisor_stale_threshold_s=(
+            None
+            if monitor_d.get("supervisor_stale_threshold_s") is None
+            else _require_non_negative_int(
+                monitor_d["supervisor_stale_threshold_s"],
+                field="monitor.supervisor_stale_threshold_s",
+            )
+        ),
     )
     plugins_raw = dict(raw.get("plugins") or {})  # copy so we can pop
     disable = list(plugins_raw.pop("disable", []))

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.37}/agent_runner/events.py RENAMED Viewed

@@ -46,6 +46,7 @@ MONITOR_STARTED = "monitor_started"
 ORPHAN_IDEMPOTENT_SKIP = "orphan_idempotent_skip"
 ORPHAN_STASH_FAILED = "orphan_stash_failed"
 ORPHAN_STASHED = "orphan_stashed"
+PACKAGE_UPGRADED = "package_upgraded"
 PROMPT_OVERWRITTEN = "prompt_overwritten"
 ROUND_END = "round_end"
 ROUND_GRACE_KILL = "round_grace_kill"

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.37}/agent_runner/monitor.py RENAMED Viewed

@@ -1,6 +1,6 @@
 """Monitor — anomaly detectors over events + metrics + log tails.
-11 built-in detectors. Two trigger ``auto_action="stop_service"``:
+12 built-in detectors. Two trigger ``auto_action="stop_service"``:
   * oauth_fail  — auth pattern in short-exit logs (retrying burns API quota)
   * disk_critical — disk_used_pct > 95% (writing more risks corruption)
@@ -54,6 +54,7 @@ KNOWN_ALERT_KINDS: frozenset[str] = frozenset(
         "network_fail",
         "rate_limit_active",
         "anomaly_repetitive_active",
+        "supervisor_stale",
     }
 )
@@ -429,6 +430,39 @@ def detect_anomaly_repetitive_active(
     )
+def detect_supervisor_stale(
+    events: list[dict[str, Any]],
+    *,
+    now: datetime,
+    stale_threshold_s: int,
+) -> Alert | None:
+    """Alert when the most recent event is older than ``stale_threshold_s``.
+    Catches supervisor "silent-death": stuck between rounds (after round_end,
+    before the next round_start) emitting no events. The event stream cannot
+    distinguish that from a normal idle gap — only a deadline check can.
+    ``stale_threshold_s <= 0`` disables the check (caller resolves the
+    sentinel). Empty event list → no alert: that is "never started", not
+    silent-death, and there is no baseline to measure staleness against.
+    """
+    if stale_threshold_s <= 0 or not events:
+        return None
+    last_ts_str = max((e["ts"] for e in events if "ts" in e), default=None)
+    if last_ts_str is None:
+        return None
+    age_s = (now - parse_iso_ms(last_ts_str)).total_seconds()
+    if age_s <= stale_threshold_s:
+        return None
+    return _alert(
+        "supervisor_stale",
+        "warning",
+        f"No events for {int(age_s)}s (threshold {stale_threshold_s}s) — "
+        f"supervisor may be stuck or dead. Last event: {last_ts_str}.",
+        {"age_s": int(age_s), "threshold_s": stale_threshold_s, "last_ts": last_ts_str},
+    )
 # ---------------------------------------------------------------------------
 # State-tree assembly (Task 3.2)
 # ---------------------------------------------------------------------------
@@ -535,6 +569,7 @@ def run_all_detectors(
     metrics: list[dict[str, Any]],
     log_tails: dict[int, str],
     round_timeout_s: int = 1800,
+    supervisor_stale_threshold_s: int | None = None,
     now: datetime | None = None,
     auth_fail_patterns: list[str] | None = None,
     auth_fail_hint: str | None = None,
@@ -543,12 +578,17 @@ def run_all_detectors(
     disk_warning_pct: float = 90.0,
     disk_critical_pct: float = 95.0,
 ) -> list[Alert]:
-    """Run all 11 detectors; returns alerts (empty = healthy)."""
+    """Run all 12 detectors; returns alerts (empty = healthy)."""
     if now is None:
         now = datetime.now(UTC)
     compiled_auth_pats = (
         [re.compile(p, re.IGNORECASE) for p in auth_fail_patterns] if auth_fail_patterns else None
     )
+    effective_stale_s = (
+        int(round_timeout_s * 1.5)
+        if supervisor_stale_threshold_s is None
+        else supervisor_stale_threshold_s
+    )
     candidates = [
         detect_timeout_rate(events),
         detect_hung(
@@ -568,6 +608,7 @@ def run_all_detectors(
         detect_network_fail(events, log_tails),
         detect_rate_limit_active(events, now=now.timestamp()),
         detect_anomaly_repetitive_active(events),
+        detect_supervisor_stale(events, now=now, stale_threshold_s=effective_stale_s),
     ]
     return [a for a in candidates if a is not None]

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.37}/docs/architecture.md RENAMED Viewed

@@ -65,13 +65,14 @@ surfacing everywhere.
 | `event_kind_registry` | Prevent events.emit() typos / unregistered kinds slipping past CI | `tests/invariants/test_event_kind_registry.py` |
 <!-- /gen:defenses-table -->
-## Monitor: 11 detectors
+## Monitor: 12 detectors
 Three categories by `auto_action`:
 **Notify only** (severity `warning`):
 `timeout_rate`, `hung`, `orphan_chain`, `disk_warning`, `mem_pressure`,
-`smoke_fail_rate`, `network_fail`.
+`smoke_fail_rate`, `network_fail`, `rate_limit_active`,
+`anomaly_repetitive_active`, `supervisor_stale`.
 **Auto-stop service** (severity `critical`, `auto_action="stop_service"`):
 `oauth_fail`, `disk_critical`. Continuing in either state is harmful (burning
@@ -88,6 +89,7 @@ API quota / writing to a near-full disk).
 - `orphan_chain`
 - `rate_limit_active`
 - `smoke_fail_rate`
+- `supervisor_stale`
 - `timeout_rate`
 <!-- /gen:detector-list -->
@@ -163,6 +165,7 @@ hook (vs ALL pre-round hooks), use `[plugins] disable = ["that_entry_point_name"
 - `orphan_idempotent_skip`
 - `orphan_stash_failed`
 - `orphan_stashed`
+- `package_upgraded`
 - `prompt_overwritten`
 - `round_end`
 - `round_grace_kill`

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.37}/docs/commands.md RENAMED Viewed

@@ -24,7 +24,7 @@ are shared between `peek`, `watch`, and `monitor`.
 | `monitor` | Anomaly detection, narrate/events stream, or HTTP progress page |
 | `serve` | Long-running supervisor loop |
 | `round` | Run one round and exit |
-| `upgrade` | Round-boundary upgrade: stop → pip install → smoke → start (auto-rollback on smoke fail) |
+| `upgrade` | Package upgrade with service-mode gate: orchestrated stop/start for systemd --user; package-only otherwise |
 <!-- /gen:verb-table -->
 ## Lifecycle
@@ -76,6 +76,24 @@ Long-running supervisor loop. Traps SIGTERM (graceful stop), SIGINT (graceful),
 SIGUSR1 (cancel — forwards SIGINT to current round). Writes `serve.pid` and
 `round.pid`. `--once` runs a single round then exits (debug).
+### `agent-runner upgrade [--target VERSION] [--no-restart] [--config PATH]`
+Upgrade the agent-runner package. Behavior depends on the detected service mode:
+- **systemd --user service** (installed via `agent-runner install`): full
+  orchestrated flow — stop → pip install → smoke (`--version` + `peek`) →
+  start → emit `service_upgraded`. Auto-rollback on smoke failure.
+- **Anything else** (system unit, foreground, no config): package-only —
+  PEP 668-aware pip + `--version` smoke + pip-level rollback, emits
+  `package_upgraded`, prints the restart command. Never touches your running
+  service, never runs `sudo`.
+`--config` is optional: when omitted (or the file is absent), `upgrade` falls
+back to package-only mode automatically.
+`--no-restart` forces package-only even on a systemd --user host (upgrade the
+package now, restart your service yourself).
 ## Observation
 ### `agent-runner peek [flags]`
@@ -117,7 +135,7 @@ agent-runner events --kind transient_error_backoff_capped --tail
 ### `agent-runner monitor [--host SSH-ALIAS] [--interval N] [--json]`
-Anomaly-detection daemon. Runs the 11 detectors against the live state on every
+Anomaly-detection daemon. Runs the 12 detectors against the live state on every
 poll. Without `--host`, watches local logs at default 30s interval. With
 `--host`, watches a remote agent-runner over plain ssh at default 60s interval.

{cli_agent_runner-0.1.35 → cli_agent_runner-0.1.37}/docs/configuration.md RENAMED Viewed

@@ -80,6 +80,7 @@ running with newly-set `dirty_action = "auto_commit"` is undefined).
 | `anomaly_repetitive_threshold` | `int` | 0 |
 | `host_health` | `MonitorHostHealthConfig` | MonitorHostHealthConfig(mem_avail_min_mb=200, disk_warning_pct=90.0, disk_critical_pct=95.0) |
 | `round_progress_interval_s` | `int` | 0 |
+| `supervisor_stale_threshold_s` | `int | None` | None |
 <!-- /gen:config-schema -->
 ### `vcs.dirty_action`
@@ -203,6 +204,7 @@ Unconfigured phases (and configs without `[phases]`) keep using the global
 [monitor]
 auto_stop_on = ["oauth_fail", "disk_critical"]
 round_progress_interval_s = 0  # 0 = disabled; set >0 to emit round_progress heartbeat events
+# supervisor_stale_threshold_s = 2700  # unset = round_timeout_s * 1.5; 0 = disable
 [monitor.host_health]
 mem_avail_min_mb = 200        # mem_pressure fires when mem_available_mb < this

cli-agent-runner 0.1.35__tar.gz → 0.1.37__tar.gz

cli-agent-runner 0.1.35tar.gz → 0.1.37tar.gz