nexo-brain 7.11.1 → 7.11.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/README.md +5 -1
- package/package.json +1 -1
- package/src/crons/manifest.json +6 -0
- package/src/enforcement_engine.py +53 -0
- package/src/runtime_versioning.py +1 -0
- package/src/scripts/nexo-watchdog.sh +180 -2
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "nexo-brain",
|
|
3
|
-
"version": "7.11.
|
|
3
|
+
"version": "7.11.3",
|
|
4
4
|
"description": "Local cognitive runtime for Claude Code \u2014 persistent memory, overnight learning, doctor diagnostics, personal scripts, recovery-aware jobs, startup preflight, and optional dashboard/power helper.",
|
|
5
5
|
"author": {
|
|
6
6
|
"name": "NEXO Brain",
|
package/README.md
CHANGED
|
@@ -18,7 +18,11 @@
|
|
|
18
18
|
|
|
19
19
|
[Watch the overview video](https://nexo-brain.com/watch/) · [Watch on YouTube](https://www.youtube.com/watch?v=i2lkGhKyVqI) · [Open the infographic](https://nexo-brain.com/assets/nexo-brain-infographic-v5.png)
|
|
20
20
|
|
|
21
|
-
Version `7.11.
|
|
21
|
+
Version `7.11.3` is the current packaged-runtime line. Patch release — root-cause fix for the `mcp_restart_required` lockup that v7.11.2 only masked at the enforcer layer. `_FINGERPRINT_EXCLUDE_DIRS` in `src/runtime_versioning.py` was missing `"versions"`, so `compute_mcp_runtime_fingerprint()` walked into `core/versions/<old>/**.py` whenever it was called against the live runtime root. `installed_runtime_fingerprint()` (which resolves through `active_runtime_root()` → `core/versions/<active>/`) returned a clean per-snapshot hash, while `prime_process_fingerprint()` (which starts from `Path(__file__).resolve().parent` → live `core/`) accumulated every retained snapshot. The two never matched after the second-ever `nexo update` on a host. Every update wrote `mcp-restart-required.json` and the marker could never be cleared by `_ack_current_client_if_restarted()` because the `installed_fp != process_fp` test always returned `True`. Every non-allowlisted MCP tool (`nexo_reminders`, `nexo_smart_startup`, `nexo_guard_check`, `nexo_task_open`, …) returned `{"error": "mcp_restart_required", "reason": "fingerprint_mismatch"}` indefinitely, even after the operator restarted the client (the new client connected to the same server with the same cached `PROCESS_FINGERPRINT`). v7.11.2 silenced the enforcer-side noise but the marker itself stayed stuck, so user-driven calls kept failing across sessions. Adding `"versions"` to `_FINGERPRINT_EXCLUDE_DIRS` restores parity: both fingerprint computations now hash the same set of files regardless of which entry path the caller starts from. 1 new regression test (`test_fingerprint_ignores_versions_subtree`) seeds three snapshot directories under `versions/` and asserts the fingerprint does not shift. The two existing exclude-dir tests now also cover `"versions"`. All 21 tests in `tests/test_runtime_fingerprint.py` stay green.
|
|
22
|
+
|
|
23
|
+
Previously in `7.11.2`: patch release — two reliability fixes in the same family ("components ignoring signals they should respect"): (1) `STUCK CRON REAPER` added to `nexo-watchdog.sh` and (2) the Guardian/Enforcer now honors the `mcp-restart-required` marker. The watchdog reaper closes the v5.8.1 in-flight gap: truly hung wrappers (e.g. headless `claude --bare` blocked on an MCP that flagged `mcp_restart_required`) used to hold their slot for days. The reaper sweeps `cron_runs` rows with `ended_at IS NULL` past `stuck_after_seconds` (per-cron from `manifest.json`, fallback 12h global), SIGTERMs the wrapper (trap closes row at `exit 143`), grace 10s, SIGKILL on survivors. Generous defaults (deep-sleep 8h, sleep/evolution 4h) prevent any v5.8.1 regression. The enforcer gate skips `nexo_*`-mentioning reminders when the marker file is present (cached per-instance, 30s TTL); reminders that don't reference `nexo_*` still fire. 12 new tests; 3 existing watchdog tests + 52 existing enforcer tests stay green.
|
|
24
|
+
|
|
25
|
+
Previously in `7.11.1`: patch release — caches the runtime fingerprint by `(file_count, size_total, max_mtime)` signature so MCP startup and the per-tool-call `resolve_restart_required` skip the 263-file rehash when nothing on disk changed. ~11× speedup warm path (~40ms → ~3.7ms locally), ~10-20s/day saved across Claude Code / Codex / headless / deep-sleep / cron startups. Cache miss is always safe (falls through to full hash and self-repairs). Default `use_cache=False` keeps `plugins/update.py` on the ground-truth path around `git pull` / `npm update`. Builds on the v7.11.0 runtime fingerprint that gates `mcp-restart-required.json`. Full write-up in [`docs/runtime-fingerprint.md`](docs/runtime-fingerprint.md).
|
|
22
26
|
|
|
23
27
|
Previously in `7.10.0`: minor release — **removes the LLM proxy override path that 7.9.28 → 7.9.34 introduced**. Background: 7.9.28 added two opt-in files at `~/.nexo/config/llm_endpoint.json` and `~/.nexo/config/auth_provider.json` that let a third-party orchestrator (NEXO Desktop) redirect every Anthropic SDK call from Brain to a custom proxy and resolve the bearer via a local helper, with concrete model names translated to wire aliases (`nexo-max`, `nexo-high`, `nexo-medium`, `nexo-low`, `nexo-mini`) and an `Idempotency-Key` per request. NEXO Desktop's commercial model has changed: Desktop is now a wrapper over the user's own Claude Code subscription (Max / Pro), with a separate Desktop licence. Brain calls go directly to `api.anthropic.com` using the user's existing OAuth (the one stored under `~/.claude/` and consumed by Claude Code spawns) or a plain `ANTHROPIC_API_KEY`. There is no NEXO bearer, no NEXO proxy, no NEXO credit accounting in this codebase. Every proxy symbol is gone from `call_model_raw.py` and `agent_runner.py`; the proxy-specific tests and `docs/api/override-files.md` are removed; any pre-existing override files on disk are simply ignored from this release forward.
|
|
24
28
|
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "nexo-brain",
|
|
3
|
-
"version": "7.11.
|
|
3
|
+
"version": "7.11.3",
|
|
4
4
|
"mcpName": "io.github.wazionapps/nexo",
|
|
5
5
|
"description": "NEXO Brain — Shared brain for AI agents. Persistent memory, semantic RAG, natural forgetting, metacognitive guard, trust scoring, 150+ MCP tools. Works with Claude Code, Codex, Claude Desktop & any MCP client. 100% local, free.",
|
|
6
6
|
"homepage": "https://nexo-brain.com",
|
package/src/crons/manifest.json
CHANGED
|
@@ -13,6 +13,7 @@
|
|
|
13
13
|
"recovery_policy": "catchup",
|
|
14
14
|
"idempotent": true,
|
|
15
15
|
"max_catchup_age": 172800,
|
|
16
|
+
"stuck_after_seconds": 28800,
|
|
16
17
|
"run_on_boot": true,
|
|
17
18
|
"run_on_wake": true
|
|
18
19
|
},
|
|
@@ -38,6 +39,7 @@
|
|
|
38
39
|
"recovery_policy": "catchup",
|
|
39
40
|
"idempotent": true,
|
|
40
41
|
"max_catchup_age": 172800,
|
|
42
|
+
"stuck_after_seconds": 14400,
|
|
41
43
|
"run_on_boot": true,
|
|
42
44
|
"run_on_wake": true
|
|
43
45
|
},
|
|
@@ -140,6 +142,7 @@
|
|
|
140
142
|
"recovery_policy": "catchup",
|
|
141
143
|
"idempotent": true,
|
|
142
144
|
"max_catchup_age": 1209600,
|
|
145
|
+
"stuck_after_seconds": 14400,
|
|
143
146
|
"run_on_boot": true,
|
|
144
147
|
"run_on_wake": true
|
|
145
148
|
},
|
|
@@ -295,6 +298,7 @@
|
|
|
295
298
|
"recovery_policy": "run_once_on_wake",
|
|
296
299
|
"idempotent": true,
|
|
297
300
|
"max_catchup_age": 1200,
|
|
301
|
+
"stuck_after_seconds": 600,
|
|
298
302
|
"run_on_boot": false,
|
|
299
303
|
"run_on_wake": true
|
|
300
304
|
},
|
|
@@ -308,6 +312,7 @@
|
|
|
308
312
|
"recovery_policy": "run_once_on_wake",
|
|
309
313
|
"idempotent": true,
|
|
310
314
|
"max_catchup_age": 7200,
|
|
315
|
+
"stuck_after_seconds": 1800,
|
|
311
316
|
"run_on_boot": false,
|
|
312
317
|
"run_on_wake": true
|
|
313
318
|
},
|
|
@@ -321,6 +326,7 @@
|
|
|
321
326
|
"recovery_policy": "catchup",
|
|
322
327
|
"idempotent": true,
|
|
323
328
|
"max_catchup_age": 86400,
|
|
329
|
+
"stuck_after_seconds": 1800,
|
|
324
330
|
"run_on_boot": false,
|
|
325
331
|
"run_on_wake": true
|
|
326
332
|
}
|
|
@@ -2520,6 +2520,44 @@ class HeadlessEnforcer:
|
|
|
2520
2520
|
# the per-rule tag collision check, and time-dedup at the call site.
|
|
2521
2521
|
_LEGACY_TAG_PREFIXES = ("after:", "periodic_msgs:", "periodic_time:", "start:")
|
|
2522
2522
|
|
|
2523
|
+
@staticmethod
|
|
2524
|
+
def _mcp_restart_marker_path() -> "Path":
|
|
2525
|
+
"""Resolve the path to the MCP restart-required marker on disk.
|
|
2526
|
+
|
|
2527
|
+
The marker is written by `plugins/update.py` when a `nexo update`
|
|
2528
|
+
actually changes runtime `.py` bytes (cf. v7.11.0 fingerprint
|
|
2529
|
+
gating). Honors the F0.6 runtime/operations/ canonical layout
|
|
2530
|
+
with a fall-back to the pre-F0.6 operations/ legacy layout so
|
|
2531
|
+
half-migrated installs are still detected correctly.
|
|
2532
|
+
"""
|
|
2533
|
+
from pathlib import Path as _Path
|
|
2534
|
+
home = _Path(os.environ.get("NEXO_HOME", str(_Path.home() / ".nexo")))
|
|
2535
|
+
new = home / "runtime" / "operations" / "mcp-restart-required.json"
|
|
2536
|
+
if new.is_file():
|
|
2537
|
+
return new
|
|
2538
|
+
legacy = home / "operations" / "mcp-restart-required.json"
|
|
2539
|
+
return legacy if legacy.is_file() else new
|
|
2540
|
+
|
|
2541
|
+
def _mcp_restart_pending(self) -> bool:
|
|
2542
|
+
"""Return True if the MCP server has a restart-required marker on disk.
|
|
2543
|
+
|
|
2544
|
+
Cached per-instance with a 30s TTL: the marker rarely changes mid-
|
|
2545
|
+
session (it's written by `nexo update` and cleared by the next
|
|
2546
|
+
client restart) but a TTL keeps long-lived enforcer instances from
|
|
2547
|
+
getting stuck on a stale negative cache if the operator runs
|
|
2548
|
+
`nexo update` mid-session without restarting.
|
|
2549
|
+
"""
|
|
2550
|
+
cached_at = getattr(self, "_mcp_restart_pending_cache_at", 0.0)
|
|
2551
|
+
if (time.time() - cached_at) < 30.0:
|
|
2552
|
+
return getattr(self, "_mcp_restart_pending_cache", False)
|
|
2553
|
+
try:
|
|
2554
|
+
result = self._mcp_restart_marker_path().is_file()
|
|
2555
|
+
except Exception: # noqa: BLE001 — never block enforcement on path errors
|
|
2556
|
+
result = False
|
|
2557
|
+
self._mcp_restart_pending_cache = result
|
|
2558
|
+
self._mcp_restart_pending_cache_at = time.time()
|
|
2559
|
+
return result
|
|
2560
|
+
|
|
2523
2561
|
def _enqueue(self, prompt: str, tag: str, rule_id: str = ""):
|
|
2524
2562
|
"""Enqueue an injection. Mirrors Desktop _enqueue for parity.
|
|
2525
2563
|
|
|
@@ -2535,6 +2573,21 @@ class HeadlessEnforcer:
|
|
|
2535
2573
|
"""
|
|
2536
2574
|
if any(q["tag"] == tag for q in self.injection_queue):
|
|
2537
2575
|
return
|
|
2576
|
+
# v7.11.2: suppress reminders that ask the agent to call nexo_*
|
|
2577
|
+
# tools while the MCP server has a restart-required marker on
|
|
2578
|
+
# disk. Without this gate every periodic ping ("Execute
|
|
2579
|
+
# nexo_session_diary_write", "Execute nexo_smart_startup",
|
|
2580
|
+
# nexo_guard_check pre-Edit, etc) returns mcp_restart_required
|
|
2581
|
+
# and the agent burns cycles on guaranteed no-ops. Reminders that
|
|
2582
|
+
# don't reference nexo_* (R23 deploy guards, R25 nora/maria
|
|
2583
|
+
# read-only, etc) still fire — they don't depend on the MCP.
|
|
2584
|
+
if "nexo_" in prompt and self._mcp_restart_pending():
|
|
2585
|
+
_logger.info(
|
|
2586
|
+
"SKIP: %s — mcp_restart_required marker present (rule_id=%s)",
|
|
2587
|
+
tag,
|
|
2588
|
+
rule_id or "?",
|
|
2589
|
+
)
|
|
2590
|
+
return
|
|
2538
2591
|
legacy = tag.startswith(self._LEGACY_TAG_PREFIXES)
|
|
2539
2592
|
if legacy:
|
|
2540
2593
|
tool = tag.split(":")[-1].split("->")[-1]
|
|
@@ -530,6 +530,183 @@ json_escape() {
|
|
|
530
530
|
echo "$1" | sed 's/\\/\\\\/g; s/"/\\"/g; s/ / /g' | tr '\n' ' '
|
|
531
531
|
}
|
|
532
532
|
|
|
533
|
+
# ============================================================================
|
|
534
|
+
# STUCK CRON REAPER (v7.11.2)
|
|
535
|
+
# ============================================================================
|
|
536
|
+
# Mirror image of the v5.8.1 in-flight detection. The v5.8.1 fix taught the
|
|
537
|
+
# watchdog to leave running jobs alone when their cron_runs row was open
|
|
538
|
+
# (started_at present, ended_at NULL) — that closed the loop where the
|
|
539
|
+
# watchdog kept kickstart -k'ing deep-sleep mid-flight (2026-04-14..17).
|
|
540
|
+
#
|
|
541
|
+
# But the same restraint became the new failure mode: when a wrapper child
|
|
542
|
+
# truly hangs (e.g. headless `claude --bare` blocked on an MCP that flagged
|
|
543
|
+
# `mcp_restart_required`), the row stays open forever, no new tick can run
|
|
544
|
+
# (the next wrapper sees "Another instance running. Skipping"), and the
|
|
545
|
+
# watchdog's only response was WARN. Morning brief, followup runner, and
|
|
546
|
+
# orchestrator-v2 went silent for days because of this.
|
|
547
|
+
#
|
|
548
|
+
# The reaper closes that gap without bringing back the v5.8.1 bug:
|
|
549
|
+
# * Per-cron threshold via `stuck_after_seconds` in manifest.json.
|
|
550
|
+
# * Generous default (12h) so legitimate long jobs keep running.
|
|
551
|
+
# * Override deep-sleep to 8h, sleep/evolution to 4h — well above their
|
|
552
|
+
# real worst-case so the v5.8.1 incident cannot repeat.
|
|
553
|
+
# * Reaper sends SIGTERM to the wrapper — its trap (line 187) closes the
|
|
554
|
+
# cron_runs row exit_code=143 and propagates to the child. Only after
|
|
555
|
+
# a 10s grace does it escalate to SIGKILL on wrapper + descendants.
|
|
556
|
+
# * If no wrapper PID is alive (orphan row), the reaper just closes the
|
|
557
|
+
# row in-band with exit_code=137 so the next tick can run.
|
|
558
|
+
# ============================================================================
|
|
559
|
+
|
|
560
|
+
STUCK_DEFAULT_SECONDS="${STUCK_DEFAULT_SECONDS:-43200}" # 12h
|
|
561
|
+
STUCK_KILL_GRACE="${STUCK_KILL_GRACE:-10}"
|
|
562
|
+
TOTAL_REAPED=0
|
|
563
|
+
|
|
564
|
+
# Skip cron_ids that should never be reaped from inside a watchdog tick.
|
|
565
|
+
# 'watchdog' is us — reaping ourselves would be self-immolation.
|
|
566
|
+
STUCK_REAPER_SKIP="watchdog"
|
|
567
|
+
|
|
568
|
+
_build_stuck_thresholds_from_manifest() {
|
|
569
|
+
if [ ! -f "$MANIFEST_FILE" ]; then
|
|
570
|
+
return
|
|
571
|
+
fi
|
|
572
|
+
python3 - "$MANIFEST_FILE" <<'PY' 2>/dev/null
|
|
573
|
+
import json, sys
|
|
574
|
+
try:
|
|
575
|
+
with open(sys.argv[1]) as f:
|
|
576
|
+
data = json.load(f)
|
|
577
|
+
except Exception:
|
|
578
|
+
sys.exit(0)
|
|
579
|
+
for c in data.get('crons', []):
|
|
580
|
+
cid = c.get('id')
|
|
581
|
+
th = c.get('stuck_after_seconds')
|
|
582
|
+
if cid and isinstance(th, (int, float)) and th > 0:
|
|
583
|
+
print(f"{cid}|{int(th)}")
|
|
584
|
+
PY
|
|
585
|
+
}
|
|
586
|
+
|
|
587
|
+
STUCK_THRESHOLDS_RAW=""
|
|
588
|
+
_load_stuck_thresholds() {
|
|
589
|
+
STUCK_THRESHOLDS_RAW=$(_build_stuck_thresholds_from_manifest)
|
|
590
|
+
}
|
|
591
|
+
|
|
592
|
+
lookup_stuck_threshold() {
|
|
593
|
+
local cron_id="$1"
|
|
594
|
+
if [ -z "$STUCK_THRESHOLDS_RAW" ]; then
|
|
595
|
+
echo "$STUCK_DEFAULT_SECONDS"
|
|
596
|
+
return
|
|
597
|
+
fi
|
|
598
|
+
local line
|
|
599
|
+
line=$(echo "$STUCK_THRESHOLDS_RAW" | grep "^${cron_id}|" | head -1)
|
|
600
|
+
if [ -n "$line" ]; then
|
|
601
|
+
echo "$line" | cut -d'|' -f2
|
|
602
|
+
else
|
|
603
|
+
echo "$STUCK_DEFAULT_SECONDS"
|
|
604
|
+
fi
|
|
605
|
+
}
|
|
606
|
+
|
|
607
|
+
find_wrapper_pids() {
|
|
608
|
+
local cron_id="$1"
|
|
609
|
+
# Match the wrapper's exact arg slot: "nexo-cron-wrapper.sh CRON_ID "
|
|
610
|
+
# The trailing space prevents prefix collisions (e.g. "morning-agent" vs
|
|
611
|
+
# a hypothetical "morning-agent-v2").
|
|
612
|
+
pgrep -f "nexo-cron-wrapper\.sh ${cron_id} " 2>/dev/null
|
|
613
|
+
}
|
|
614
|
+
|
|
615
|
+
reap_stuck_cron_pids() {
|
|
616
|
+
local cron_id="$1"
|
|
617
|
+
local pids
|
|
618
|
+
pids=$(find_wrapper_pids "$cron_id")
|
|
619
|
+
if [ -z "$pids" ]; then
|
|
620
|
+
# No wrapper alive — caller should fall through to in-band row cleanup.
|
|
621
|
+
return 1
|
|
622
|
+
fi
|
|
623
|
+
log_repair "STUCK REAPER: SIGTERM to wrapper PIDs ($cron_id): $(echo "$pids" | tr '\n' ' ')"
|
|
624
|
+
for pid in $pids; do
|
|
625
|
+
kill -TERM "$pid" 2>/dev/null || true
|
|
626
|
+
done
|
|
627
|
+
# Grace period — the wrapper trap (TERM → forward to child → finalize_row)
|
|
628
|
+
# needs a few seconds to close the cron_runs row cleanly.
|
|
629
|
+
local waited=0
|
|
630
|
+
local still
|
|
631
|
+
while [ $waited -lt "$STUCK_KILL_GRACE" ]; do
|
|
632
|
+
sleep 1
|
|
633
|
+
waited=$((waited + 1))
|
|
634
|
+
still=$(find_wrapper_pids "$cron_id")
|
|
635
|
+
[ -z "$still" ] && break
|
|
636
|
+
done
|
|
637
|
+
# Escalate to SIGKILL for any survivor (wrapper + descendants).
|
|
638
|
+
local survivors
|
|
639
|
+
survivors=$(find_wrapper_pids "$cron_id")
|
|
640
|
+
if [ -n "$survivors" ]; then
|
|
641
|
+
log_repair "STUCK REAPER: SIGKILL escalation ($cron_id): $(echo "$survivors" | tr '\n' ' ')"
|
|
642
|
+
for pid in $survivors; do
|
|
643
|
+
# Kill descendants first so they don't get reparented to PID 1.
|
|
644
|
+
pkill -KILL -P "$pid" 2>/dev/null || true
|
|
645
|
+
kill -KILL "$pid" 2>/dev/null || true
|
|
646
|
+
done
|
|
647
|
+
sleep 1
|
|
648
|
+
fi
|
|
649
|
+
# Last sanity check.
|
|
650
|
+
if [ -n "$(find_wrapper_pids "$cron_id")" ]; then
|
|
651
|
+
log "STUCK REAPER: failed to kill wrapper for $cron_id (still alive after SIGKILL)"
|
|
652
|
+
return 2
|
|
653
|
+
fi
|
|
654
|
+
return 0
|
|
655
|
+
}
|
|
656
|
+
|
|
657
|
+
finalize_stuck_db_row() {
|
|
658
|
+
local row_id="$1"
|
|
659
|
+
local cron_id="$2"
|
|
660
|
+
[ ! -f "$DB_PATH" ] && return 1
|
|
661
|
+
sqlite3 "$DB_PATH" "
|
|
662
|
+
UPDATE cron_runs
|
|
663
|
+
SET ended_at = strftime('%Y-%m-%d %H:%M:%S','now'),
|
|
664
|
+
exit_code = 137,
|
|
665
|
+
summary = 'stuck row reaped by watchdog: wrapper PID gone',
|
|
666
|
+
error = 'Watchdog STUCK REAPER: orphan in-flight row cleaned up',
|
|
667
|
+
duration_secs = CAST(strftime('%s','now') - strftime('%s', started_at) AS REAL)
|
|
668
|
+
WHERE id = $row_id;
|
|
669
|
+
" 2>/dev/null
|
|
670
|
+
log_repair "STUCK REAPER: cleaned up zombie cron_runs row id=$row_id ($cron_id)"
|
|
671
|
+
}
|
|
672
|
+
|
|
673
|
+
run_stuck_reaper() {
|
|
674
|
+
[ ! -f "$DB_PATH" ] && return 0
|
|
675
|
+
_load_stuck_thresholds
|
|
676
|
+
local row_id cron_id age_secs threshold
|
|
677
|
+
while IFS='|' read -r row_id cron_id age_secs; do
|
|
678
|
+
[ -z "$row_id" ] && continue
|
|
679
|
+
[ -z "$cron_id" ] && continue
|
|
680
|
+
# Skip self and any explicitly-protected cron_ids.
|
|
681
|
+
case " $STUCK_REAPER_SKIP " in
|
|
682
|
+
*" $cron_id "*) continue ;;
|
|
683
|
+
esac
|
|
684
|
+
threshold=$(lookup_stuck_threshold "$cron_id")
|
|
685
|
+
if [ "$age_secs" -gt "$threshold" ]; then
|
|
686
|
+
log "STUCK REAPER: cron_id=$cron_id row_id=$row_id age=${age_secs}s threshold=${threshold}s — reaping"
|
|
687
|
+
if reap_stuck_cron_pids "$cron_id"; then
|
|
688
|
+
# Wrapper trap closes the row with exit 143; nothing else to do.
|
|
689
|
+
TOTAL_REAPED=$((TOTAL_REAPED + 1))
|
|
690
|
+
else
|
|
691
|
+
# No wrapper alive (orphan zombie row) — close it in-band so the
|
|
692
|
+
# next tick of this cron isn't blocked by "Another instance running".
|
|
693
|
+
finalize_stuck_db_row "$row_id" "$cron_id"
|
|
694
|
+
TOTAL_REAPED=$((TOTAL_REAPED + 1))
|
|
695
|
+
fi
|
|
696
|
+
fi
|
|
697
|
+
done < <(sqlite3 -separator '|' "$DB_PATH" "
|
|
698
|
+
SELECT id, cron_id, CAST(strftime('%s','now') - strftime('%s', started_at) AS INTEGER)
|
|
699
|
+
FROM cron_runs
|
|
700
|
+
WHERE ended_at IS NULL
|
|
701
|
+
ORDER BY id DESC;
|
|
702
|
+
" 2>/dev/null)
|
|
703
|
+
if [ "$TOTAL_REAPED" -gt 0 ]; then
|
|
704
|
+
log "STUCK REAPER: complete — reaped $TOTAL_REAPED stuck cron(s)"
|
|
705
|
+
fi
|
|
706
|
+
}
|
|
707
|
+
|
|
708
|
+
run_stuck_reaper
|
|
709
|
+
|
|
533
710
|
# ============================================================================
|
|
534
711
|
# RUN CHECKS
|
|
535
712
|
# ============================================================================
|
|
@@ -1023,6 +1200,7 @@ cat > "$STATUS_JSON" <<JSONEOF
|
|
|
1023
1200
|
"warn": $TOTAL_WARN,
|
|
1024
1201
|
"fail": $TOTAL_FAIL,
|
|
1025
1202
|
"healed": $TOTAL_HEALED,
|
|
1203
|
+
"reaped": $TOTAL_REAPED,
|
|
1026
1204
|
"overall": "$OVERALL"
|
|
1027
1205
|
},
|
|
1028
1206
|
"launch_agents": [
|
|
@@ -1047,7 +1225,7 @@ cat > "$REPORT_TXT" <<REPORTEOF
|
|
|
1047
1225
|
======================================================
|
|
1048
1226
|
NEXO WATCHDOG REPORT — $TS
|
|
1049
1227
|
======================================================
|
|
1050
|
-
PASS: $TOTAL_PASS | HEALED: $TOTAL_HEALED | WARN: $TOTAL_WARN | FAIL: $TOTAL_FAIL | TOTAL: $TOTAL
|
|
1228
|
+
PASS: $TOTAL_PASS | HEALED: $TOTAL_HEALED | WARN: $TOTAL_WARN | FAIL: $TOTAL_FAIL | REAPED: $TOTAL_REAPED | TOTAL: $TOTAL
|
|
1051
1229
|
OVERALL: $OVERALL
|
|
1052
1230
|
======================================================
|
|
1053
1231
|
|
|
@@ -1261,4 +1439,4 @@ fi
|
|
|
1261
1439
|
# ============================================================================
|
|
1262
1440
|
# LOG SUMMARY
|
|
1263
1441
|
# ============================================================================
|
|
1264
|
-
log "Complete: PASS=$TOTAL_PASS HEALED=$TOTAL_HEALED WARN=$TOTAL_WARN FAIL=$TOTAL_FAIL"
|
|
1442
|
+
log "Complete: PASS=$TOTAL_PASS HEALED=$TOTAL_HEALED WARN=$TOTAL_WARN FAIL=$TOTAL_FAIL REAPED=$TOTAL_REAPED"
|