PyPI - cli-agent-runner - Versions diffs - 0.1.41__tar.gz → 0.1.42__tar.gz - Mend

cli-agent-runner 0.1.41tar.gz → 0.1.42tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (235) hide show

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,21 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.1.42] - 2026-06-25
+### Added
+- `crash_loop` defense — serve stops after 5 consecutive *unknown* short crashes (non-zero exit, <60s, no classified transient), escalating the restart delay and recording the failure reason. Ends the respawn-forever crash loop; recoverable-slow failures (rate-limit / quota / 5xx / timeout) still ride the transient-error backoff unchanged.
+- `config_broken` defense — a permanent startup-battery failure now halts serve (distinct no-retry exit code `78`) instead of respawning a broken config every round.
+### Fixed
+- `vcs.dirty_action` no longer sweeps the runner's own `log_dir` bookkeeping when `log_dir` is inside `work_dir`: `auto_commit` excludes it from the commit (no more phantom `git_head` advance on a zero-work round) and `stash` excludes it from `git stash push -u` (logs no longer vanish). `.evolving/` and agent work are unaffected.
+### Removed
+- The inert `smoke_fail_rate` monitor alert (could never fire — superseded by the always-on `config_broken` stop). Monitor now ships 11 detectors.
+### Docs
+- `thesis.md`: the stuck-loop defense is described honestly as a notify-level, opt-in-to-auto-stop monitor detector (`anomaly_repetitive_active`), not a default hard-stop; fixed the `stuck_loop_detected` naming drift.
 ## [0.1.41] - 2026-06-07
 ### Added

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: cli-agent-runner
-Version: 0.1.41
+Version: 0.1.42
 Summary: Restart-on-exit supervisor for autonomous CLI agents
 Project-URL: Homepage, https://github.com/wan9yu/cli-agent-runner
 Project-URL: Documentation, https://github.com/wan9yu/cli-agent-runner#readme
@@ -49,7 +49,7 @@ full disks, runaway memory.
 ```
 ┌──────────────────────────────────────────┐
-│ Layer 3: The Witness (monitor)           │  12 detectors + auto-stop
+│ Layer 3: The Witness (monitor)           │  11 detectors + auto-stop
 ├──────────────────────────────────────────┤
 │ Layer 2: The Loop (serve, ~120 LOC)      │  signal-trapping restart loop
 ├──────────────────────────────────────────┤
@@ -86,14 +86,14 @@ Full walkthrough: [`docs/quickstart.md`](docs/quickstart.md).
 |---|---|
 | `init` / `install` / `uninstall` | `peek` — state snapshot |
 | `start` / `stop` / `kill` / `cancel` | `watch` — peek in a refresh loop |
-| `restart` / `status` | `monitor` — 12 detectors, alerts, auto-stop |
+| `restart` / `status` | `monitor` — 11 detectors, alerts, auto-stop |
 | `round` / `serve` / `upgrade` | `events` — query / stream events.jsonl |
 Verb reference: [`docs/commands.md`](docs/commands.md).
 ## Defenses (built in)
-11 named defenses, structured as data — see `agent-runner peek --select defenses`.
+12 named defenses, structured as data — see `agent-runner peek --select defenses`.
 Each carries the historical incident it codifies and the invariant test that
 guards it. Highlights:
@@ -106,7 +106,7 @@ guards it. Highlights:
 Full list and rationale: [`docs/architecture.md`](docs/architecture.md).
-## Monitor: 12 detectors
+## Monitor: 11 detectors
 Notify only: `timeout_rate`, `hung`, `orphan_chain`, `disk_warning`,
 `mem_pressure`, `smoke_fail_rate`, `network_fail`, `rate_limit_active`,

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/README.md RENAMED Viewed

@@ -12,7 +12,7 @@ full disks, runaway memory.
 ```
 ┌──────────────────────────────────────────┐
-│ Layer 3: The Witness (monitor)           │  12 detectors + auto-stop
+│ Layer 3: The Witness (monitor)           │  11 detectors + auto-stop
 ├──────────────────────────────────────────┤
 │ Layer 2: The Loop (serve, ~120 LOC)      │  signal-trapping restart loop
 ├──────────────────────────────────────────┤
@@ -49,14 +49,14 @@ Full walkthrough: [`docs/quickstart.md`](docs/quickstart.md).
 |---|---|
 | `init` / `install` / `uninstall` | `peek` — state snapshot |
 | `start` / `stop` / `kill` / `cancel` | `watch` — peek in a refresh loop |
-| `restart` / `status` | `monitor` — 12 detectors, alerts, auto-stop |
+| `restart` / `status` | `monitor` — 11 detectors, alerts, auto-stop |
 | `round` / `serve` / `upgrade` | `events` — query / stream events.jsonl |
 Verb reference: [`docs/commands.md`](docs/commands.md).
 ## Defenses (built in)
-11 named defenses, structured as data — see `agent-runner peek --select defenses`.
+12 named defenses, structured as data — see `agent-runner peek --select defenses`.
 Each carries the historical incident it codifies and the invariant test that
 guards it. Highlights:
@@ -69,7 +69,7 @@ guards it. Highlights:
 Full list and rationale: [`docs/architecture.md`](docs/architecture.md).
-## Monitor: 12 detectors
+## Monitor: 11 detectors
 Notify only: `timeout_rate`, `hung`, `orphan_chain`, `disk_warning`,
 `mem_pressure`, `smoke_fail_rate`, `network_fail`, `rate_limit_active`,

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/README.zh.md RENAMED Viewed

@@ -6,7 +6,7 @@
 把任意 CLI agent（Claude Code、自研 agent、任何长跑命令）包装成可被
 systemd / launchd 拉起、能被远程观测的服务。**每轮跑完进程退出**，外层
-supervisor 重启 —— 这是核心模式。中间穿插 11 条防御，避开 production 上
+supervisor 重启 —— 这是核心模式。中间穿插 12 条防御，避开 production 上
 最容易翻车的几条路：
 - 轮卡死、Tool 调用空转 → 硬墙 timeout
@@ -20,7 +20,7 @@ supervisor 重启 —— 这是核心模式。中间穿插 11 条防御，避开
 ```
 ┌──────────────────────────────────────────┐
-│ Layer 3：Witness（monitor）              │  12 个检测器 + 自动停服
+│ Layer 3：Witness（monitor）              │  11 个检测器 + 自动停服
 ├──────────────────────────────────────────┤
 │ Layer 2：Loop（serve，~120 LOC 薄壳）    │  捕获信号，循环拉起 round
 ├──────────────────────────────────────────┤
@@ -63,7 +63,7 @@ agent-runner monitor              # 实时异常检测，OAuth/磁盘 critical
 |---|---|
 | `init` / `install` / `uninstall` | `peek` —— 项目状态快照 |
 | `start` / `stop` / `kill` / `cancel` | `watch` —— peek 在刷新循环里 |
-| `restart` / `status` | `monitor` —— 12 个检测器 + 告警 + 自动停服 |
+| `restart` / `status` | `monitor` —— 11 个检测器 + 告警 + 自动停服 |
 | `round` / `serve` / `upgrade` | `events` —— 查询 / 流式订阅 events.jsonl |
 **停服三动词**有清晰的语义分层：
@@ -73,7 +73,7 @@ agent-runner monitor              # 实时异常检测，OAuth/磁盘 critical
 动词参考：[`docs/commands.md`](docs/commands.md)。
-## 内置防御（11 条）
+## 内置防御（12 条）
 防御以数据形式定义在 `agent_runner/defenses.py`，可通过
 `agent-runner peek --select defenses` 直接拿到。每条防御自带：
@@ -95,7 +95,7 @@ agent-runner monitor              # 实时异常检测，OAuth/磁盘 critical
 完整列表 + 历史出处：[`docs/architecture.md`](docs/architecture.md)。
-## Monitor：12 个检测器
+## Monitor：11 个检测器
 **只告警**（warning 级，服务继续跑）：
 `timeout_rate` / `hung` / `orphan_chain` / `disk_warning` /

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/agent_runner/_emit.py RENAMED Viewed

@@ -45,6 +45,29 @@ def emit_max_rounds_reached(log_dir: Path, *, rounds_completed: int, max_rounds:
     emit(log_dir, MAX_ROUNDS_REACHED, rounds_completed=rounds_completed, max_rounds=max_rounds)
+def emit_config_broken(log_dir: Path, *, reason: str) -> None:
+    """Emit config_broken (serve stopped on a permanent startup-battery failure)."""
+    from agent_runner.events import CONFIG_BROKEN, emit
+    emit(log_dir, CONFIG_BROKEN, reason=reason)
+def emit_crash_loop(log_dir: Path, *, consecutive: int, exit_code: int, log_path: Path) -> None:
+    """Emit crash_loop (serve stopped after consecutive unknown short crashes).
+    Captures the failure reason — a redacted tail of the round log — so a
+    recurring unknown crash can later be classified into a transient bucket.
+    """
+    from agent_runner._redact import redact_secrets
+    from agent_runner.events import CRASH_LOOP, emit
+    try:
+        reason = redact_secrets(log_path.read_text(errors="replace")[-2000:])
+    except OSError:
+        reason = ""
+    emit(log_dir, CRASH_LOOP, consecutive=consecutive, exit_code=exit_code, reason=reason)
 def emit_stop_file_detected(
     log_dir: Path, *, stop_file: Path, content: str, rounds_completed: int
 ) -> None:

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/agent_runner/_version.py RENAMED Viewed

@@ -18,7 +18,7 @@ version_tuple: tuple[int | str, ...]
 commit_id: str | None
 __commit_id__: str | None
-__version__ = version = '0.1.41'
-__version_tuple__ = version_tuple = (0, 1, 41)
+__version__ = version = '0.1.42'
+__version_tuple__ = version_tuple = (0, 1, 42)
 __commit_id__ = commit_id = None

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/agent_runner/api.py RENAMED Viewed

@@ -18,7 +18,7 @@ import sysconfig
 import time
 from collections.abc import Iterator
 from pathlib import Path
-from typing import Any
+from typing import Any, Literal
 from agent_runner import events, lifecycle
 from agent_runner.api_types import (
@@ -45,6 +45,59 @@ from agent_runner.service_unit import (
     serve_unit_filename,
 )
+# Exit code for a permanent (no-retry) startup-battery failure. A broken config
+# does not self-heal between rounds, so serve STOPS rather than respawning it
+# forever. 78 = EX_CONFIG (sysexits) — avoids argparse's 2 and the generic 1.
+# Lives here (not runner.py) so serve_cmd can import it from the sanctioned api
+# facade without coupling to runner (runner imports api, not the reverse).
+PERMANENT_CONFIG_EXIT = 78
+# Crash-loop circuit breaker (b12). The serve loop escalates the restart delay
+# on consecutive UNKNOWN short crashes (non-zero exit, short duration, no
+# classified transient) and STOPS after CRASH_LOOP_THRESHOLD of them — the Run 6
+# ~100-empty-rounds scar. Recoverable-slow failures (rate limit / 5h quota / 5xx
+# / timeout) are already handled by the transient-error throttle and never reach
+# this path. A clean (exit 0), long, or classified-transient round resets the run.
+CRASH_LOOP_THRESHOLD = 5
+CRASH_LOOP_SHORT_EXIT_S = 60  # mirrors monitor.SHORT_EXIT_THRESHOLD_S
+CRASH_LOOP_MAX_DELAY_S = 1800  # cap the escalating restart delay (30 min)
+def post_round_decision(
+    *,
+    returncode: int,
+    duration_s: float,
+    throttle_active: bool,
+    consecutive: int,
+    restart_delay_s: int,
+) -> tuple[Literal["config_broken", "crash_loop", "continue"], int, int]:
+    """Restart policy after one round — keeps the serve loop a thin dispatcher.
+    Returns ``(action, delay_s, consecutive)`` where action is:
+    - ``"config_broken"`` — permanent startup failure (b18): stop.
+    - ``"crash_loop"`` — CRASH_LOOP_THRESHOLD consecutive unknown short crashes
+      (b12): stop. An unknown short crash is a non-zero, fast exit with no
+      classified transient (rate-limit/5xx/timeout are handled by the throttle).
+    - ``"continue"`` — sleep ``delay_s`` then run the next round.
+    A clean (exit 0), long, or transient round resets ``consecutive`` to 0; an
+    unknown short crash escalates the delay (restart × 2ⁿ, capped) until the stop.
+    """
+    if returncode == PERMANENT_CONFIG_EXIT:
+        return ("config_broken", 0, consecutive)
+    unknown_short_crash = (
+        returncode != 0 and duration_s < CRASH_LOOP_SHORT_EXIT_S and not throttle_active
+    )
+    if unknown_short_crash:
+        consecutive += 1
+        if consecutive >= CRASH_LOOP_THRESHOLD:
+            return ("crash_loop", 0, consecutive)
+        delay = min(restart_delay_s * 2**consecutive, CRASH_LOOP_MAX_DELAY_S)
+        return ("continue", delay, consecutive)
+    delay = restart_delay_s if returncode == 0 else restart_delay_s * 2
+    return ("continue", delay, 0)
 _PROJECT_NAME_RE = re.compile(r"^[A-Za-z0-9._-]+$")
 _LINGER_HINT = (
@@ -730,6 +783,8 @@ def check_self_terminated_sentinel(log_dir: Path) -> bool:
 from agent_runner._emit import (  # noqa: E402,F401 — intentional bottom re-export
     emit_agent_usage_recorded,
     emit_anomaly_repetitive_tool,
+    emit_config_broken,
+    emit_crash_loop,
     emit_fresh_eyes_round_triggered,
     emit_max_rounds_reached,
     emit_rate_limit_stop,

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/agent_runner/cli/serve_cmd.py RENAMED Viewed

@@ -23,12 +23,15 @@ from agent_runner._throttle import _check_throttle_state
 from agent_runner._throttle import reset_counters as _reset_counters
 from agent_runner.api import (
     check_self_terminated_sentinel,
+    emit_config_broken,
+    emit_crash_loop,
     emit_fresh_eyes_round_triggered,
     emit_max_rounds_reached,
     emit_rate_limit_stop,
     emit_round_substrate_after,
     emit_round_substrate_before,
     emit_stop_file_detected,
+    post_round_decision,
 )
 from agent_runner.cli.common import cfg_from_args
 from agent_runner.hooks import run_serve_startup_hooks
@@ -135,6 +138,7 @@ def cmd(args) -> int:
     stop_file = cfg.runtime.stop_file  # cache: same pattern as effective_max_rounds
     work_dir = cfg.runtime.work_dir
     rounds_completed = 0
+    consecutive_crashes = 0  # b12: consecutive UNKNOWN short crashes (crash-loop breaker)
     try:
         pid_file.write(os.getpid())
@@ -197,6 +201,7 @@ def cmd(args) -> int:
                     every_n=cfg.runtime.fresh_eyes_every_n,
                 )
             round_log_path = log_dir / f"round-{round_num}.log"
+            round_started = time.monotonic()
             with round_log_path.open("w") as f:
                 r = subprocess.run(
                     [
@@ -211,6 +216,7 @@ def cmd(args) -> int:
                     stdout=f,
                     stderr=subprocess.STDOUT,
                 )
+            round_duration_s = time.monotonic() - round_started
             atomic_relink(log_dir / ROUND_CURRENT_LINK, round_log_path)
             git_head_after = compute_git_head(work_dir)
             paths_hash_after = compute_paths_hash(work_dir, cfg.runtime.substrate_fingerprint_paths)
@@ -221,13 +227,28 @@ def cmd(args) -> int:
                 paths_hash=paths_hash_after,
             )
             rounds_completed += 1
+            # Restart policy (config_broken / crash_loop / continue) lives in the
+            # tested api.post_round_decision helper so this loop stays thin.
+            action, delay, consecutive_crashes = post_round_decision(
+                returncode=r.returncode,
+                duration_s=round_duration_s,
+                throttle_active=_check_throttle_state(log_dir) is not None,
+                consecutive=consecutive_crashes,
+                restart_delay_s=cfg.runtime.restart_delay_s,
+            )
+            if action == "config_broken":
+                emit_config_broken(log_dir, reason="startup battery permanent failure")
+                break
+            if action == "crash_loop":
+                emit_crash_loop(
+                    log_dir,
+                    consecutive=consecutive_crashes,
+                    exit_code=r.returncode,
+                    log_path=round_log_path,
+                )
+                break
             if args.once or stop["requested"]:
                 break
-            delay = (
-                cfg.runtime.restart_delay_s
-                if r.returncode == 0
-                else cfg.runtime.restart_delay_s * 2
-            )
             time.sleep(delay)
     finally:
         pid_file.unlink()

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/agent_runner/defenses.py RENAMED Viewed

@@ -83,8 +83,18 @@ def catalog(cfg: Config) -> list[Defense]:
         Defense(
             name="startup_smoke_check",
             value="6 checks (config / log_dir / agent_cli / git / prompt_file / prompt_smoke)",
-            codifies="R721 + #446 — _common.md frontmatter caused 4h/123-round silent burn",
-            guarded_by=None,
+            codifies=(
+                "R721 + #446 — _common.md frontmatter caused 4h/123-round silent burn; "
+                "now halts serve (config_broken) instead of respawning a broken config"
+            ),
+            guarded_by=Path("tests/unit/test_serve_config_broken.py"),
+            current_state="active",
+        ),
+        Defense(
+            name="crash_loop_breaker",
+            value="stop after 5 consecutive short crashes; exp-escalating delay",
+            codifies="Run 6 — crashing agent respawned ~100 empty rounds at a fixed 2x delay",
+            guarded_by=Path("tests/unit/test_serve_crash_loop.py"),
             current_state="active",
         ),
         Defense(

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/agent_runner/events.py RENAMED Viewed

@@ -32,6 +32,8 @@ ANOMALY_REPETITIVE_TOOL = "anomaly_repetitive_tool"
 AGENT_NETWORK_BLIP = "agent_network_blip"
 AGENT_SPAWN = "agent_spawn"
 AGENT_USAGE_RECORDED = "agent_usage_recorded"
+CONFIG_BROKEN = "config_broken"
+CRASH_LOOP = "crash_loop"
 DIRTY_COMMIT_FAILED = "dirty_commit_failed"
 DIRTY_DETECTED = "dirty_detected"
 FRESH_EYES_ROUND_TRIGGERED = "fresh_eyes_round_triggered"

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/agent_runner/monitor.py RENAMED Viewed

@@ -49,7 +49,6 @@ KNOWN_ALERT_KINDS: frozenset[str] = frozenset(
         "disk_warning",
         "disk_critical",
         "mem_pressure",
-        "smoke_fail_rate",
         "oauth_fail",
         "network_fail",
         "rate_limit_active",
@@ -265,29 +264,6 @@ def detect_mem_pressure(metrics: list[dict[str, Any]], *, threshold_mb: int = 20
     )
-def detect_smoke_fail_rate(
-    events: list[dict[str, Any]], *, window: int = 10, threshold: float = 0.1
-) -> Alert | None:
-    ends = [e for e in events if e.get("event") == "round_end"]
-    if len(ends) < window:
-        return None
-    recent_round_nums = [e.get("round_num") for e in ends[-window:]]
-    fails = sum(
-        1
-        for e in events
-        if e.get("event") == "smoke_check_failed" and e.get("round_num") in recent_round_nums
-    )
-    rate = fails / window
-    if rate < threshold:
-        return None
-    return _alert(
-        "smoke_fail_rate",
-        "warning",
-        f"{fails}/{window} recent rounds had smoke_check_failed",
-        {"rate": rate, "threshold": threshold, "hint": "Inspect events.jsonl for failure reasons"},
-    )
 def detect_oauth_fail(
     events: list[dict[str, Any]],
     log_tails: dict[int, str],
@@ -603,7 +579,6 @@ def run_all_detectors(
         ),
         detect_disk_critical(metrics, threshold_pct=disk_critical_pct),
         detect_mem_pressure(metrics, threshold_mb=mem_avail_min_mb),
-        detect_smoke_fail_rate(events),
         detect_oauth_fail(events, log_tails, patterns=compiled_auth_pats, hint=auth_fail_hint),
         detect_network_fail(events, log_tails),
         detect_rate_limit_active(events, now=now.timestamp()),

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/agent_runner/runner.py RENAMED Viewed

@@ -369,7 +369,7 @@ def run_one_round(cfg: Config, *, phase_override: str | None = None) -> RoundRes
                 file=sys.stderr,
             )
             events.emit(log_dir, "smoke_check_failed", reason=f"{r.name}: {r.reason}")
-        sys.exit(1)
+        sys.exit(api.PERMANENT_CONFIG_EXIT)
     # Concurrency lock (per-project)
     lock_path = log_dir / "agent-runner.lock"
@@ -521,6 +521,7 @@ def _run_one_round_inner(cfg: Config, *, phase_override: str | None = None) -> R
                 round_num=round_num,
                 phase=phase,
                 idempotency_s=cfg.vcs.stash_idempotency_s,
+                log_dir=cfg.runtime.log_dir,
             )
             if ref is not None:
                 context_store.write_orphan_state(
@@ -546,7 +547,9 @@ def _run_one_round_inner(cfg: Config, *, phase_override: str | None = None) -> R
             # Leave tree dirty for next round; dirty_detected already emitted
             pass
         elif action == "auto_commit":
-            err = vcs_state.try_auto_commit(cfg.runtime.work_dir, round_num, phase)
+            err = vcs_state.try_auto_commit(
+                cfg.runtime.work_dir, round_num, phase, log_dir=cfg.runtime.log_dir
+            )
             if err is not None:
                 events.emit(
                     log_dir,

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/agent_runner/vcs_state.py RENAMED Viewed

@@ -223,12 +223,18 @@ def stash_orphan(
     round_num: int,
     phase: str | None,
     idempotency_s: int = 5,
+    log_dir: Path | None = None,
 ) -> StashRef | None:
     """Stash dirty tree as ORPHAN entry, SHA-locked.
     Returns existing ref if a matching ORPHAN was created within ``idempotency_s``
     (R820 lesson — same-second multiple calls would otherwise pile up duplicate
     stashes). Returns None if tree is clean.
+    ``log_dir`` (when under ``repo``) is excluded from the stash so ``git stash
+    push -u`` does not sweep the runner's own bookkeeping (lock / pid / event
+    logs) out of the work tree. If only ``log_dir`` churned, nothing is stashed
+    and this returns None.
     """
     if not detect_dirty_files(repo):
         return None
@@ -238,7 +244,8 @@ def stash_orphan(
     ts = time.strftime("%Y-%m-%dT%H:%M:%S")
     phase_part = f" phase={phase}" if phase else ""
     msg = f"ORPHAN R{round_num}{phase_part} ts={ts}"
-    push = _git(repo, "stash", "push", "-u", "-m", msg, timeout=30)
+    exclude = _log_dir_exclude_pathspec(repo, log_dir)
+    push = _git(repo, "stash", "push", "-u", "-m", msg, *exclude, timeout=30)
     if push.returncode != 0:
         return None
     listing = _git(repo, "stash", "list", "-1", "--format=%H %s")
@@ -287,22 +294,63 @@ def pop_stash(repo: Path, sha: str) -> bool:
     return _git(repo, "stash", "pop", sel).returncode == 0
-def try_auto_commit(work_dir: Path, round_num: int, phase: str | None) -> str | None:
+def _log_dir_exclude_pathspec(root: Path, log_dir: Path | None) -> list[str]:
+    """Git pathspec args excluding the runner's own ``log_dir`` from an add/stash,
+    applied only when it lives inside the work tree AND is not already gitignored.
+    Empty otherwise: an outside or gitignored log_dir is skipped by git's own
+    handling, and folding an ignored path into a stash pathspec breaks untracked
+    capture (git refuses the ignored path).
+    Keeps supervisor bookkeeping (lock / pid / event logs) out of the agent's
+    dirty-tree handling: without it a zero-work round's log churn lands in a
+    commit (``git_head`` lies) or a ``git stash push -u`` (the logs vanish).
+    """
+    if log_dir is None:
+        return []
+    try:
+        rel = log_dir.resolve().relative_to(root.resolve()).as_posix()
+    except ValueError:
+        return []  # log_dir outside work_dir → nothing to exclude
+    if _git(root, "check-ignore", "-q", rel).returncode == 0:
+        return []  # already gitignored → git skips it; pathspec would misfire
+    return ["--", f":(exclude){rel}"]
+def try_auto_commit(
+    work_dir: Path,
+    round_num: int,
+    phase: str | None,
+    *,
+    log_dir: Path | None = None,
+) -> str | None:
     """Auto-commit dirty tree with hardcoded subject. Return None on success, error on failure.
     Subject: ``agent-runner auto-commit: R<N> <phase>`` (phase part omitted if None).
     Uses ``git -c commit.gpgsign=false`` to skip GPG; honors pre-commit hooks
     (no ``--no-verify``). DOES NOT push — local commit only.
+    ``log_dir`` (when under ``work_dir``) is excluded from the add so a zero-work
+    round that only churned the runner's own bookkeeping (lock/pid/event logs)
+    does not advance ``git_head``. The agent's work and ``.evolving/`` live
+    outside ``log_dir`` and are still committed. If nothing remains staged after
+    the exclusion, this is a no-op (returns None, leaves HEAD untouched).
     Callers (runner.py) emit ``dirty_commit_failed`` event when return value is not None.
     """
     phase_part = f" {phase}" if phase else ""
     subject = f"agent-runner auto-commit: R{round_num}{phase_part}"
-    add_result = _git(work_dir, "add", "-A")
+    exclude = _log_dir_exclude_pathspec(work_dir, log_dir)
+    add_result = _git(work_dir, "add", "-A", *exclude)
     if add_result.returncode != 0:
         return (add_result.stderr or "git add failed")[:200]
+    # Only the exclusion can leave nothing staged (a zero-work round that churned
+    # only log_dir); without it the tree was dirty so there is always something to
+    # commit. Skip the extra git call on the common (no-exclusion) path.
+    if exclude and _git(work_dir, "diff", "--cached", "--quiet").returncode == 0:
+        return None
     commit_result = _git(
         work_dir,
         "commit",

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/docs/architecture.md RENAMED Viewed

@@ -34,7 +34,7 @@ All three accept the same drill-down flags: `--round N`, `--log`, `--events N`,
 ## Defenses-as-data
-`agent_runner.defenses.catalog(cfg)` returns 11 structured `Defense` entries.
+`agent_runner.defenses.catalog(cfg)` returns 12 structured `Defense` entries.
 Each entry carries:
 - `name` — stable identifier
@@ -59,13 +59,14 @@ surfacing everywhere.
 | `sha_locked_stash` | §9 IMMUTABLE — batch drop by index breaks under concurrent stash | `tests/invariants/test_stash_uses_sha_not_index.py` |
 | `set_diff_classification` | R2110 — rotation-only diff via +-line scan misclassifies | `—` |
 | `critical_envs_injection` | Env injection via [agent.env] block — preset-supplied per CLI (e.g. DISABLE_AUTOUPDATER for claude prevents mid-loop self-updates) | `—` |
-| `startup_smoke_check` | R721 + #446 — _common.md frontmatter caused 4h/123-round silent burn | `—` |
+| `startup_smoke_check` | R721 + #446 — _common.md frontmatter caused 4h/123-round silent burn; now halts serve (config_broken) instead of respawning a broken config | `tests/unit/test_serve_config_broken.py` |
+| `crash_loop_breaker` | Run 6 — crashing agent respawned ~100 empty rounds at a fixed 2x delay | `tests/unit/test_serve_crash_loop.py` |
 | `flock_concurrency` | Architectural — prevent concurrent supervisors corrupting state | `—` |
 | `atomic_state_writes` | Data integrity — crashes never leave half-written state files | `tests/invariants/test_atomic_write_enforced.py` |
 | `event_kind_registry` | Prevent events.emit() typos / unregistered kinds slipping past CI | `tests/invariants/test_event_kind_registry.py` |
 <!-- /gen:defenses-table -->
-## Monitor: 12 detectors
+## Monitor: 11 detectors
 Three categories by `auto_action`:
@@ -88,7 +89,6 @@ API quota / writing to a near-full disk).
 - `oauth_fail` — **auto-stop**
 - `orphan_chain`
 - `rate_limit_active`
-- `smoke_fail_rate`
 - `supervisor_stale`
 - `timeout_rate`
 <!-- /gen:detector-list -->
@@ -151,6 +151,8 @@ hook (vs ALL pre-round hooks), use `[plugins] disable = ["that_entry_point_name"
 - `agent_spawn`
 - `agent_usage_recorded`
 - `anomaly_repetitive_tool`
+- `config_broken`
+- `crash_loop`
 - `dirty_commit_failed`
 - `dirty_detected`
 - `fresh_eyes_round_triggered`
@@ -192,4 +194,4 @@ hook (vs ALL pre-round hooks), use `[plugins] disable = ["that_entry_point_name"
 三层架构：Round（一轮 agent）/ Loop（serve 薄壳）/ Witness（monitor）。
 三视角对称：peek（快照）/ watch（快照循环）/ monitor（异常检测），共用下钻参数。
-防御以结构化目录形式存在（11 条），每条防御自描述「防的是哪条历史教训、被哪个 invariant test 守、当前状态」。
+防御以结构化目录形式存在（12 条），每条防御自描述「防的是哪条历史教训、被哪个 invariant test 守、当前状态」。

{cli_agent_runner-0.1.41 → cli_agent_runner-0.1.42}/docs/commands.md RENAMED Viewed

@@ -145,7 +145,7 @@ agent-runner events --kind transient_error_backoff_capped --tail
 ### `agent-runner monitor [--host SSH-ALIAS] [--interval N] [--mode MODE] [--port PORT] [--json]`
-Anomaly-detection daemon. Runs the 12 detectors against the live state on every
+Anomaly-detection daemon. Runs the 11 detectors against the live state on every
 poll. Without `--host`, watches local logs at default 30s interval. With
 `--host`, watches a remote agent-runner over plain ssh at default 60s interval.

cli_agent_runner-0.1.42/docs/migrations/0.1.42.md ADDED Viewed

@@ -0,0 +1,58 @@
+# Migrating to 0.1.42
+## TL;DR
+```bash
+pip install --upgrade cli-agent-runner==0.1.42
+```
+Two new always-on serve defenses (`crash_loop`, `config_broken`), one removed
+(inert) monitor alert, and an `auto_commit` scope fix. No config-schema change;
+no action required for a healthy deployment.
+## Behavior change: serve now STOPS on two harmful states (instead of respawning)
+Both fire in the always-on path (no `monitor` process required):
+- **`config_broken`** — if the startup battery fails (broken config: missing
+  prompt, non-git `work_dir`, agent CLI not on PATH, sub-500-byte prompt, …),
+  the round exits with the no-retry code `78` and serve emits `config_broken`
+  and stops. Previously the round exited `1` and serve respawned the broken
+  config forever. The specific cause is in the round's `smoke_check_failed`
+  event. Fix the config and restart.
+- **`crash_loop`** — after **5 consecutive** *unknown short crashes* (a round
+  that exits non-zero in under 60s with no classified transient error), serve
+  emits `crash_loop` (carrying `consecutive`, `exit_code`, and a redacted reason
+  tail) and stops, escalating the restart delay along the way. Previously such a
+  round respawned forever at a fixed 2× delay (the Run 6 ~100-empty-rounds
+  incident).
+Recoverable-slow failures are unaffected: rate-limit / 5h-quota / 5xx / timeout
+are classified as transient errors and still ride the existing
+`transient_error_*` backoff (`rate_limit_account` waits the server's exact
+`resetsAt`). They never count toward the crash-loop breaker.
+To watch for these: `grep -E '"event": "(crash_loop|config_broken)"' events-*.jsonl`.
+## Removed: the `smoke_fail_rate` monitor alert
+It could never fire (it matched on `round_num`, which `smoke_check_failed` never
+carried) and is now superseded by the always-on `config_broken` stop. If you
+subscribed to `smoke_fail_rate` (it never emitted), switch to `config_broken`.
+`monitor` now reports **11** detectors.
+## Fixed: dirty-tree handling no longer sweeps the runner's `log_dir`
+When `log_dir` is inside `work_dir`, **both** VCS dirty-actions now exclude the
+supervisor's own bookkeeping (lock, pid, `events-*.jsonl`, round logs):
+- `dirty_action = "auto_commit"` excludes it from the commit — previously the
+  per-round churn produced a non-empty commit even on a zero-work round,
+  advancing `git_head` and making the progress signal lie.
+- `dirty_action = "stash"` (the default) excludes it from `git stash push -u` —
+  previously the logs (and the events file being written) were swept into the
+  stash and vanished from the work tree each round.
+Your agent's work and the `.evolving/` ledger live outside `log_dir` and are
+unaffected. Default deployments (`log_dir` at `~/.agent-runner/{project}/logs`,
+outside `work_dir`) were never affected by either.

cli-agent-runner 0.1.41__tar.gz → 0.1.42__tar.gz

cli-agent-runner 0.1.41tar.gz → 0.1.42tar.gz