nexo-brain 7.11.0 → 7.11.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "nexo-brain",
3
- "version": "7.11.0",
3
+ "version": "7.11.2",
4
4
  "description": "Local cognitive runtime for Claude Code \u2014 persistent memory, overnight learning, doctor diagnostics, personal scripts, recovery-aware jobs, startup preflight, and optional dashboard/power helper.",
5
5
  "author": {
6
6
  "name": "NEXO Brain",
package/README.md CHANGED
@@ -18,7 +18,9 @@
18
18
 
19
19
  [Watch the overview video](https://nexo-brain.com/watch/) · [Watch on YouTube](https://www.youtube.com/watch?v=i2lkGhKyVqI) · [Open the infographic](https://nexo-brain.com/assets/nexo-brain-infographic-v5.png)
20
20
 
21
- Version `7.11.0` is the current packaged-runtime line. Minor release — introduces a runtime fingerprint that gates `mcp-restart-required.json`. `nexo update` now only forces connected MCP clients (Claude Code, Codex, Claude Desktop) to restart when at least one `.py` file the running server actually imports has changed. README-only, blog-only, changelog-only releases skip the restart entirely. Conservative fallback (#186): if the fingerprint can't be computed, the gate behaves like the legacy version-string check and writes the marker. Explicit opt-in escape hatch via `"force_restart": true` in `version.json`. Marker schema bumped to v2 with optional `from_fingerprint` / `to_fingerprint`. Full write-up in [`docs/runtime-fingerprint.md`](docs/runtime-fingerprint.md).
21
+ Version `7.11.2` is the current packaged-runtime line. Patch release — two reliability fixes in the same family ("components ignoring signals they should respect"): (1) `STUCK CRON REAPER` added to `nexo-watchdog.sh` and (2) the Guardian/Enforcer now honors the `mcp-restart-required` marker. Previously the enforcer kept injecting `<system-reminder>` blocks asking the agent to call `nexo_*` tools while the MCP server was already returning `mcp_restart_required` for every call every ping was a guaranteed no-op. The new gate at the top of `HeadlessEnforcer._enqueue()` reads the marker file (cached per-instance, 30s TTL) and skips reminders that mention `nexo_` while the marker is present. Reminders that don't reference `nexo_*` (R23 deploy guards, R25 nora/maria read-only, etc.) still fire — they don't depend on the MCP. The watchdog reaper closes a sibling gap: the v5.8.1 fix taught the watchdog to leave running jobs alone (it had been killing `deep-sleep` mid-flight 2026-04-14..17). The same restraint silently let truly hung wrappers e.g. headless `claude --bare` blocked on an MCP that flagged `mcp_restart_required` — block their own next tick for days (`morning-agent`, `followup-runner` and `orchestrator-v2` went silent 2026-04-24..27). The reaper sweeps every `cron_runs` row with `ended_at IS NULL` and reaps anything older than `stuck_after_seconds` (per-cron from `manifest.json`, fallback 12h global). Live wrapper `SIGTERM` (the wrapper's existing trap closes the row at `exit 143`), 10s grace, then `SIGKILL` on wrapper + descendants. Orphan zombi row → cleaned in-band with `exit_code=137`. `cron_id='watchdog'` is hard-coded skip so the watchdog never reaps itself. Generous defaults (deep-sleep 8h, sleep/evolution 4h) prevent any v5.8.1 regression. New observability: `summary.reaped` in `watchdog-status.json`, `REAPED:` header in the human report, `REAPED=N` in the final log line. 6 new tests; 3 existing watchdog tests stay green.
22
+
23
+ Previously in `7.11.1`: patch release — caches the runtime fingerprint by `(file_count, size_total, max_mtime)` signature so MCP startup and the per-tool-call `resolve_restart_required` skip the 263-file rehash when nothing on disk changed. ~11× speedup warm path (~40ms → ~3.7ms locally), ~10-20s/day saved across Claude Code / Codex / headless / deep-sleep / cron startups. Cache miss is always safe (falls through to full hash and self-repairs). Default `use_cache=False` keeps `plugins/update.py` on the ground-truth path around `git pull` / `npm update`. Builds on the v7.11.0 runtime fingerprint that gates `mcp-restart-required.json`. Full write-up in [`docs/runtime-fingerprint.md`](docs/runtime-fingerprint.md).
22
24
 
23
25
  Previously in `7.10.0`: minor release — **removes the LLM proxy override path that 7.9.28 → 7.9.34 introduced**. Background: 7.9.28 added two opt-in files at `~/.nexo/config/llm_endpoint.json` and `~/.nexo/config/auth_provider.json` that let a third-party orchestrator (NEXO Desktop) redirect every Anthropic SDK call from Brain to a custom proxy and resolve the bearer via a local helper, with concrete model names translated to wire aliases (`nexo-max`, `nexo-high`, `nexo-medium`, `nexo-low`, `nexo-mini`) and an `Idempotency-Key` per request. NEXO Desktop's commercial model has changed: Desktop is now a wrapper over the user's own Claude Code subscription (Max / Pro), with a separate Desktop licence. Brain calls go directly to `api.anthropic.com` using the user's existing OAuth (the one stored under `~/.claude/` and consumed by Claude Code spawns) or a plain `ANTHROPIC_API_KEY`. There is no NEXO bearer, no NEXO proxy, no NEXO credit accounting in this codebase. Every proxy symbol is gone from `call_model_raw.py` and `agent_runner.py`; the proxy-specific tests and `docs/api/override-files.md` are removed; any pre-existing override files on disk are simply ignored from this release forward.
24
26
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "nexo-brain",
3
- "version": "7.11.0",
3
+ "version": "7.11.2",
4
4
  "mcpName": "io.github.wazionapps/nexo",
5
5
  "description": "NEXO Brain — Shared brain for AI agents. Persistent memory, semantic RAG, natural forgetting, metacognitive guard, trust scoring, 150+ MCP tools. Works with Claude Code, Codex, Claude Desktop & any MCP client. 100% local, free.",
6
6
  "homepage": "https://nexo-brain.com",
@@ -13,6 +13,7 @@
13
13
  "recovery_policy": "catchup",
14
14
  "idempotent": true,
15
15
  "max_catchup_age": 172800,
16
+ "stuck_after_seconds": 28800,
16
17
  "run_on_boot": true,
17
18
  "run_on_wake": true
18
19
  },
@@ -38,6 +39,7 @@
38
39
  "recovery_policy": "catchup",
39
40
  "idempotent": true,
40
41
  "max_catchup_age": 172800,
42
+ "stuck_after_seconds": 14400,
41
43
  "run_on_boot": true,
42
44
  "run_on_wake": true
43
45
  },
@@ -140,6 +142,7 @@
140
142
  "recovery_policy": "catchup",
141
143
  "idempotent": true,
142
144
  "max_catchup_age": 1209600,
145
+ "stuck_after_seconds": 14400,
143
146
  "run_on_boot": true,
144
147
  "run_on_wake": true
145
148
  },
@@ -295,6 +298,7 @@
295
298
  "recovery_policy": "run_once_on_wake",
296
299
  "idempotent": true,
297
300
  "max_catchup_age": 1200,
301
+ "stuck_after_seconds": 600,
298
302
  "run_on_boot": false,
299
303
  "run_on_wake": true
300
304
  },
@@ -308,6 +312,7 @@
308
312
  "recovery_policy": "run_once_on_wake",
309
313
  "idempotent": true,
310
314
  "max_catchup_age": 7200,
315
+ "stuck_after_seconds": 1800,
311
316
  "run_on_boot": false,
312
317
  "run_on_wake": true
313
318
  },
@@ -321,6 +326,7 @@
321
326
  "recovery_policy": "catchup",
322
327
  "idempotent": true,
323
328
  "max_catchup_age": 86400,
329
+ "stuck_after_seconds": 1800,
324
330
  "run_on_boot": false,
325
331
  "run_on_wake": true
326
332
  }
@@ -2520,6 +2520,44 @@ class HeadlessEnforcer:
2520
2520
  # the per-rule tag collision check, and time-dedup at the call site.
2521
2521
  _LEGACY_TAG_PREFIXES = ("after:", "periodic_msgs:", "periodic_time:", "start:")
2522
2522
 
2523
+ @staticmethod
2524
+ def _mcp_restart_marker_path() -> "Path":
2525
+ """Resolve the path to the MCP restart-required marker on disk.
2526
+
2527
+ The marker is written by `plugins/update.py` when a `nexo update`
2528
+ actually changes runtime `.py` bytes (cf. v7.11.0 fingerprint
2529
+ gating). Honors the F0.6 runtime/operations/ canonical layout
2530
+ with a fall-back to the pre-F0.6 operations/ legacy layout so
2531
+ half-migrated installs are still detected correctly.
2532
+ """
2533
+ from pathlib import Path as _Path
2534
+ home = _Path(os.environ.get("NEXO_HOME", str(_Path.home() / ".nexo")))
2535
+ new = home / "runtime" / "operations" / "mcp-restart-required.json"
2536
+ if new.is_file():
2537
+ return new
2538
+ legacy = home / "operations" / "mcp-restart-required.json"
2539
+ return legacy if legacy.is_file() else new
2540
+
2541
+ def _mcp_restart_pending(self) -> bool:
2542
+ """Return True if the MCP server has a restart-required marker on disk.
2543
+
2544
+ Cached per-instance with a 30s TTL: the marker rarely changes mid-
2545
+ session (it's written by `nexo update` and cleared by the next
2546
+ client restart) but a TTL keeps long-lived enforcer instances from
2547
+ getting stuck on a stale negative cache if the operator runs
2548
+ `nexo update` mid-session without restarting.
2549
+ """
2550
+ cached_at = getattr(self, "_mcp_restart_pending_cache_at", 0.0)
2551
+ if (time.time() - cached_at) < 30.0:
2552
+ return getattr(self, "_mcp_restart_pending_cache", False)
2553
+ try:
2554
+ result = self._mcp_restart_marker_path().is_file()
2555
+ except Exception: # noqa: BLE001 — never block enforcement on path errors
2556
+ result = False
2557
+ self._mcp_restart_pending_cache = result
2558
+ self._mcp_restart_pending_cache_at = time.time()
2559
+ return result
2560
+
2523
2561
  def _enqueue(self, prompt: str, tag: str, rule_id: str = ""):
2524
2562
  """Enqueue an injection. Mirrors Desktop _enqueue for parity.
2525
2563
 
@@ -2535,6 +2573,21 @@ class HeadlessEnforcer:
2535
2573
  """
2536
2574
  if any(q["tag"] == tag for q in self.injection_queue):
2537
2575
  return
2576
+ # v7.11.2: suppress reminders that ask the agent to call nexo_*
2577
+ # tools while the MCP server has a restart-required marker on
2578
+ # disk. Without this gate every periodic ping ("Execute
2579
+ # nexo_session_diary_write", "Execute nexo_smart_startup",
2580
+ # nexo_guard_check pre-Edit, etc) returns mcp_restart_required
2581
+ # and the agent burns cycles on guaranteed no-ops. Reminders that
2582
+ # don't reference nexo_* (R23 deploy guards, R25 nora/maria
2583
+ # read-only, etc) still fire — they don't depend on the MCP.
2584
+ if "nexo_" in prompt and self._mcp_restart_pending():
2585
+ _logger.info(
2586
+ "SKIP: %s — mcp_restart_required marker present (rule_id=%s)",
2587
+ tag,
2588
+ rule_id or "?",
2589
+ )
2590
+ return
2538
2591
  legacy = tag.startswith(self._LEGACY_TAG_PREFIXES)
2539
2592
  if legacy:
2540
2593
  tool = tag.split(":")[-1].split("->")[-1]
@@ -160,6 +160,98 @@ def restart_required_marker_path() -> Path:
160
160
  return paths.operations_dir() / "mcp-restart-required.json"
161
161
 
162
162
 
163
+ def fingerprint_cache_path() -> Path:
164
+ """Where the runtime fingerprint cache lives.
165
+
166
+ The cache lets `prime_process_fingerprint()` and `installed_runtime_fingerprint()`
167
+ skip hashing 200+ source files on every MCP startup / tool call when the
168
+ runtime tree on disk hasn't changed (same file count, same total size, same
169
+ max mtime). Invalidates automatically when any source byte changes.
170
+ """
171
+ return paths.operations_dir() / "fingerprint-cache.json"
172
+
173
+
174
+ def _runtime_tree_signature(src_dir: Path) -> tuple[int, int, float] | None:
175
+ """Cheap stat-only walk over the fingerprint-tracked tree.
176
+
177
+ Returns ``(file_count, size_total, max_mtime)`` or ``None`` when the source
178
+ tree cannot be traversed. This is the cache key — if it matches, the bytes
179
+ haven't changed in any way the fingerprint would care about.
180
+ """
181
+ try:
182
+ files = _iter_runtime_source_files(src_dir)
183
+ except Exception:
184
+ return None
185
+ if not files:
186
+ return None
187
+ count = 0
188
+ size_total = 0
189
+ max_mtime = 0.0
190
+ for path in files:
191
+ try:
192
+ st = path.stat()
193
+ except Exception:
194
+ return None
195
+ count += 1
196
+ size_total += int(st.st_size)
197
+ if st.st_mtime > max_mtime:
198
+ max_mtime = float(st.st_mtime)
199
+ return (count, size_total, max_mtime)
200
+
201
+
202
+ def _read_fingerprint_cache(src_dir: Path) -> str:
203
+ """Return cached fingerprint when the on-disk signature still matches.
204
+
205
+ Empty string means cache miss (corrupt, missing, or signature drifted).
206
+ Cache miss is always safe — caller falls through to a full hash.
207
+ """
208
+ cache_path = fingerprint_cache_path()
209
+ if not cache_path.is_file():
210
+ return ""
211
+ try:
212
+ payload = json.loads(cache_path.read_text(encoding="utf-8"))
213
+ except Exception:
214
+ return ""
215
+ if not isinstance(payload, dict):
216
+ return ""
217
+ if str(payload.get("src_dir") or "") != str(src_dir):
218
+ return ""
219
+ sig = _runtime_tree_signature(src_dir)
220
+ if sig is None:
221
+ return ""
222
+ try:
223
+ cached_count = int(payload.get("file_count"))
224
+ cached_size = int(payload.get("size_total"))
225
+ cached_mtime = float(payload.get("max_mtime"))
226
+ except (TypeError, ValueError):
227
+ return ""
228
+ if cached_count != sig[0] or cached_size != sig[1] or cached_mtime != sig[2]:
229
+ return ""
230
+ fingerprint = str(payload.get("fingerprint") or "").strip()
231
+ return fingerprint
232
+
233
+
234
+ def _write_fingerprint_cache(src_dir: Path, fingerprint: str) -> None:
235
+ """Persist the fingerprint+signature pair. Best-effort; failures don't propagate."""
236
+ if not fingerprint:
237
+ return
238
+ sig = _runtime_tree_signature(src_dir)
239
+ if sig is None:
240
+ return
241
+ payload = {
242
+ "fingerprint": fingerprint,
243
+ "src_dir": str(src_dir),
244
+ "file_count": sig[0],
245
+ "size_total": sig[1],
246
+ "max_mtime": sig[2],
247
+ "updated_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
248
+ }
249
+ try:
250
+ _write_json_atomic(fingerprint_cache_path(), payload)
251
+ except Exception:
252
+ pass
253
+
254
+
163
255
  def _candidate_version_files(base: Path) -> list[Path]:
164
256
  return [
165
257
  base / "version.json",
@@ -225,7 +317,9 @@ def _iter_runtime_source_files(src_dir: Path) -> list[Path]:
225
317
  return out
226
318
 
227
319
 
228
- def compute_mcp_runtime_fingerprint(src_dir: Path | None = None) -> str:
320
+ def compute_mcp_runtime_fingerprint(
321
+ src_dir: Path | None = None, *, use_cache: bool = False
322
+ ) -> str:
229
323
  """Hash of every Python source file the running MCP can import.
230
324
 
231
325
  Returns a sha256 hex digest, or "" when the source tree cannot be located
@@ -240,6 +334,14 @@ def compute_mcp_runtime_fingerprint(src_dir: Path | None = None) -> str:
240
334
  * non-`.py` assets (docs, blogs, READMEs, JSON/YAML configs, templates,
241
335
  CHANGELOG, marketing files) — these never affect what the live MCP
242
336
  process executes
337
+
338
+ When ``use_cache=True`` (hot paths: server startup, every tool call) the
339
+ function consults ``fingerprint-cache.json``: if the on-disk tree
340
+ signature (file count + total size + max mtime) still matches the cached
341
+ one, the cached digest is returned without re-reading any byte. Cache miss
342
+ falls through to the normal full-hash path and writes a fresh entry. The
343
+ update flow keeps ``use_cache=False`` (default) so it always sees ground
344
+ truth around the pull/npm step.
243
345
  """
244
346
  if src_dir is None:
245
347
  candidates: list[Path] = []
@@ -267,6 +369,11 @@ def compute_mcp_runtime_fingerprint(src_dir: Path | None = None) -> str:
267
369
  if src_dir is None:
268
370
  return ""
269
371
 
372
+ if use_cache:
373
+ cached = _read_fingerprint_cache(src_dir)
374
+ if cached:
375
+ return cached
376
+
270
377
  files = _iter_runtime_source_files(src_dir)
271
378
  if not files:
272
379
  return ""
@@ -283,11 +390,19 @@ def compute_mcp_runtime_fingerprint(src_dir: Path | None = None) -> str:
283
390
  except Exception:
284
391
  return ""
285
392
  h.update(b"\n")
286
- return h.hexdigest()
393
+ digest = h.hexdigest()
394
+ if use_cache and digest:
395
+ _write_fingerprint_cache(src_dir, digest)
396
+ return digest
287
397
 
288
398
 
289
399
  def installed_runtime_fingerprint() -> str:
290
- """Fingerprint of whatever runtime source tree is on disk right now."""
400
+ """Fingerprint of whatever runtime source tree is on disk right now.
401
+
402
+ Hot path — runs on every MCP tool call via ``resolve_restart_required``.
403
+ Uses the disk-signature cache so a repeated call without any source
404
+ change is a few stat() syscalls instead of 200+ file reads.
405
+ """
291
406
  candidates: list[Path] = []
292
407
  try:
293
408
  root = active_runtime_root()
@@ -308,7 +423,7 @@ def installed_runtime_fingerprint() -> str:
308
423
  except Exception:
309
424
  pass
310
425
  for cand in candidates:
311
- fp = compute_mcp_runtime_fingerprint(cand)
426
+ fp = compute_mcp_runtime_fingerprint(cand, use_cache=True)
312
427
  if fp:
313
428
  return fp
314
429
  return ""
@@ -616,7 +731,7 @@ def prime_process_fingerprint() -> str:
616
731
  except Exception:
617
732
  pass
618
733
  for cand in candidates:
619
- fp = compute_mcp_runtime_fingerprint(cand)
734
+ fp = compute_mcp_runtime_fingerprint(cand, use_cache=True)
620
735
  if fp:
621
736
  PROCESS_FINGERPRINT = fp
622
737
  return PROCESS_FINGERPRINT
@@ -530,6 +530,183 @@ json_escape() {
530
530
  echo "$1" | sed 's/\\/\\\\/g; s/"/\\"/g; s/ / /g' | tr '\n' ' '
531
531
  }
532
532
 
533
+ # ============================================================================
534
+ # STUCK CRON REAPER (v7.11.2)
535
+ # ============================================================================
536
+ # Mirror image of the v5.8.1 in-flight detection. The v5.8.1 fix taught the
537
+ # watchdog to leave running jobs alone when their cron_runs row was open
538
+ # (started_at present, ended_at NULL) — that closed the loop where the
539
+ # watchdog kept kickstart -k'ing deep-sleep mid-flight (2026-04-14..17).
540
+ #
541
+ # But the same restraint became the new failure mode: when a wrapper child
542
+ # truly hangs (e.g. headless `claude --bare` blocked on an MCP that flagged
543
+ # `mcp_restart_required`), the row stays open forever, no new tick can run
544
+ # (the next wrapper sees "Another instance running. Skipping"), and the
545
+ # watchdog's only response was WARN. Morning brief, followup runner, and
546
+ # orchestrator-v2 went silent for days because of this.
547
+ #
548
+ # The reaper closes that gap without bringing back the v5.8.1 bug:
549
+ # * Per-cron threshold via `stuck_after_seconds` in manifest.json.
550
+ # * Generous default (12h) so legitimate long jobs keep running.
551
+ # * Override deep-sleep to 8h, sleep/evolution to 4h — well above their
552
+ # real worst-case so the v5.8.1 incident cannot repeat.
553
+ # * Reaper sends SIGTERM to the wrapper — its trap (line 187) closes the
554
+ # cron_runs row exit_code=143 and propagates to the child. Only after
555
+ # a 10s grace does it escalate to SIGKILL on wrapper + descendants.
556
+ # * If no wrapper PID is alive (orphan row), the reaper just closes the
557
+ # row in-band with exit_code=137 so the next tick can run.
558
+ # ============================================================================
559
+
560
+ STUCK_DEFAULT_SECONDS="${STUCK_DEFAULT_SECONDS:-43200}" # 12h
561
+ STUCK_KILL_GRACE="${STUCK_KILL_GRACE:-10}"
562
+ TOTAL_REAPED=0
563
+
564
+ # Skip cron_ids that should never be reaped from inside a watchdog tick.
565
+ # 'watchdog' is us — reaping ourselves would be self-immolation.
566
+ STUCK_REAPER_SKIP="watchdog"
567
+
568
+ _build_stuck_thresholds_from_manifest() {
569
+ if [ ! -f "$MANIFEST_FILE" ]; then
570
+ return
571
+ fi
572
+ python3 - "$MANIFEST_FILE" <<'PY' 2>/dev/null
573
+ import json, sys
574
+ try:
575
+ with open(sys.argv[1]) as f:
576
+ data = json.load(f)
577
+ except Exception:
578
+ sys.exit(0)
579
+ for c in data.get('crons', []):
580
+ cid = c.get('id')
581
+ th = c.get('stuck_after_seconds')
582
+ if cid and isinstance(th, (int, float)) and th > 0:
583
+ print(f"{cid}|{int(th)}")
584
+ PY
585
+ }
586
+
587
+ STUCK_THRESHOLDS_RAW=""
588
+ _load_stuck_thresholds() {
589
+ STUCK_THRESHOLDS_RAW=$(_build_stuck_thresholds_from_manifest)
590
+ }
591
+
592
+ lookup_stuck_threshold() {
593
+ local cron_id="$1"
594
+ if [ -z "$STUCK_THRESHOLDS_RAW" ]; then
595
+ echo "$STUCK_DEFAULT_SECONDS"
596
+ return
597
+ fi
598
+ local line
599
+ line=$(echo "$STUCK_THRESHOLDS_RAW" | grep "^${cron_id}|" | head -1)
600
+ if [ -n "$line" ]; then
601
+ echo "$line" | cut -d'|' -f2
602
+ else
603
+ echo "$STUCK_DEFAULT_SECONDS"
604
+ fi
605
+ }
606
+
607
+ find_wrapper_pids() {
608
+ local cron_id="$1"
609
+ # Match the wrapper's exact arg slot: "nexo-cron-wrapper.sh CRON_ID "
610
+ # The trailing space prevents prefix collisions (e.g. "morning-agent" vs
611
+ # a hypothetical "morning-agent-v2").
612
+ pgrep -f "nexo-cron-wrapper\.sh ${cron_id} " 2>/dev/null
613
+ }
614
+
615
+ reap_stuck_cron_pids() {
616
+ local cron_id="$1"
617
+ local pids
618
+ pids=$(find_wrapper_pids "$cron_id")
619
+ if [ -z "$pids" ]; then
620
+ # No wrapper alive — caller should fall through to in-band row cleanup.
621
+ return 1
622
+ fi
623
+ log_repair "STUCK REAPER: SIGTERM to wrapper PIDs ($cron_id): $(echo "$pids" | tr '\n' ' ')"
624
+ for pid in $pids; do
625
+ kill -TERM "$pid" 2>/dev/null || true
626
+ done
627
+ # Grace period — the wrapper trap (TERM → forward to child → finalize_row)
628
+ # needs a few seconds to close the cron_runs row cleanly.
629
+ local waited=0
630
+ local still
631
+ while [ $waited -lt "$STUCK_KILL_GRACE" ]; do
632
+ sleep 1
633
+ waited=$((waited + 1))
634
+ still=$(find_wrapper_pids "$cron_id")
635
+ [ -z "$still" ] && break
636
+ done
637
+ # Escalate to SIGKILL for any survivor (wrapper + descendants).
638
+ local survivors
639
+ survivors=$(find_wrapper_pids "$cron_id")
640
+ if [ -n "$survivors" ]; then
641
+ log_repair "STUCK REAPER: SIGKILL escalation ($cron_id): $(echo "$survivors" | tr '\n' ' ')"
642
+ for pid in $survivors; do
643
+ # Kill descendants first so they don't get reparented to PID 1.
644
+ pkill -KILL -P "$pid" 2>/dev/null || true
645
+ kill -KILL "$pid" 2>/dev/null || true
646
+ done
647
+ sleep 1
648
+ fi
649
+ # Last sanity check.
650
+ if [ -n "$(find_wrapper_pids "$cron_id")" ]; then
651
+ log "STUCK REAPER: failed to kill wrapper for $cron_id (still alive after SIGKILL)"
652
+ return 2
653
+ fi
654
+ return 0
655
+ }
656
+
657
+ finalize_stuck_db_row() {
658
+ local row_id="$1"
659
+ local cron_id="$2"
660
+ [ ! -f "$DB_PATH" ] && return 1
661
+ sqlite3 "$DB_PATH" "
662
+ UPDATE cron_runs
663
+ SET ended_at = strftime('%Y-%m-%d %H:%M:%S','now'),
664
+ exit_code = 137,
665
+ summary = 'stuck row reaped by watchdog: wrapper PID gone',
666
+ error = 'Watchdog STUCK REAPER: orphan in-flight row cleaned up',
667
+ duration_secs = CAST(strftime('%s','now') - strftime('%s', started_at) AS REAL)
668
+ WHERE id = $row_id;
669
+ " 2>/dev/null
670
+ log_repair "STUCK REAPER: cleaned up zombie cron_runs row id=$row_id ($cron_id)"
671
+ }
672
+
673
+ run_stuck_reaper() {
674
+ [ ! -f "$DB_PATH" ] && return 0
675
+ _load_stuck_thresholds
676
+ local row_id cron_id age_secs threshold
677
+ while IFS='|' read -r row_id cron_id age_secs; do
678
+ [ -z "$row_id" ] && continue
679
+ [ -z "$cron_id" ] && continue
680
+ # Skip self and any explicitly-protected cron_ids.
681
+ case " $STUCK_REAPER_SKIP " in
682
+ *" $cron_id "*) continue ;;
683
+ esac
684
+ threshold=$(lookup_stuck_threshold "$cron_id")
685
+ if [ "$age_secs" -gt "$threshold" ]; then
686
+ log "STUCK REAPER: cron_id=$cron_id row_id=$row_id age=${age_secs}s threshold=${threshold}s — reaping"
687
+ if reap_stuck_cron_pids "$cron_id"; then
688
+ # Wrapper trap closes the row with exit 143; nothing else to do.
689
+ TOTAL_REAPED=$((TOTAL_REAPED + 1))
690
+ else
691
+ # No wrapper alive (orphan zombie row) — close it in-band so the
692
+ # next tick of this cron isn't blocked by "Another instance running".
693
+ finalize_stuck_db_row "$row_id" "$cron_id"
694
+ TOTAL_REAPED=$((TOTAL_REAPED + 1))
695
+ fi
696
+ fi
697
+ done < <(sqlite3 -separator '|' "$DB_PATH" "
698
+ SELECT id, cron_id, CAST(strftime('%s','now') - strftime('%s', started_at) AS INTEGER)
699
+ FROM cron_runs
700
+ WHERE ended_at IS NULL
701
+ ORDER BY id DESC;
702
+ " 2>/dev/null)
703
+ if [ "$TOTAL_REAPED" -gt 0 ]; then
704
+ log "STUCK REAPER: complete — reaped $TOTAL_REAPED stuck cron(s)"
705
+ fi
706
+ }
707
+
708
+ run_stuck_reaper
709
+
533
710
  # ============================================================================
534
711
  # RUN CHECKS
535
712
  # ============================================================================
@@ -1023,6 +1200,7 @@ cat > "$STATUS_JSON" <<JSONEOF
1023
1200
  "warn": $TOTAL_WARN,
1024
1201
  "fail": $TOTAL_FAIL,
1025
1202
  "healed": $TOTAL_HEALED,
1203
+ "reaped": $TOTAL_REAPED,
1026
1204
  "overall": "$OVERALL"
1027
1205
  },
1028
1206
  "launch_agents": [
@@ -1047,7 +1225,7 @@ cat > "$REPORT_TXT" <<REPORTEOF
1047
1225
  ======================================================
1048
1226
  NEXO WATCHDOG REPORT — $TS
1049
1227
  ======================================================
1050
- PASS: $TOTAL_PASS | HEALED: $TOTAL_HEALED | WARN: $TOTAL_WARN | FAIL: $TOTAL_FAIL | TOTAL: $TOTAL
1228
+ PASS: $TOTAL_PASS | HEALED: $TOTAL_HEALED | WARN: $TOTAL_WARN | FAIL: $TOTAL_FAIL | REAPED: $TOTAL_REAPED | TOTAL: $TOTAL
1051
1229
  OVERALL: $OVERALL
1052
1230
  ======================================================
1053
1231
 
@@ -1261,4 +1439,4 @@ fi
1261
1439
  # ============================================================================
1262
1440
  # LOG SUMMARY
1263
1441
  # ============================================================================
1264
- log "Complete: PASS=$TOTAL_PASS HEALED=$TOTAL_HEALED WARN=$TOTAL_WARN FAIL=$TOTAL_FAIL"
1442
+ log "Complete: PASS=$TOTAL_PASS HEALED=$TOTAL_HEALED WARN=$TOTAL_WARN FAIL=$TOTAL_FAIL REAPED=$TOTAL_REAPED"