nexo-brain 7.11.0 → 7.11.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/README.md +3 -1
- package/package.json +1 -1
- package/src/crons/manifest.json +6 -0
- package/src/enforcement_engine.py +53 -0
- package/src/runtime_versioning.py +120 -5
- package/src/scripts/nexo-watchdog.sh +180 -2
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "nexo-brain",
|
|
3
|
-
"version": "7.11.
|
|
3
|
+
"version": "7.11.2",
|
|
4
4
|
"description": "Local cognitive runtime for Claude Code \u2014 persistent memory, overnight learning, doctor diagnostics, personal scripts, recovery-aware jobs, startup preflight, and optional dashboard/power helper.",
|
|
5
5
|
"author": {
|
|
6
6
|
"name": "NEXO Brain",
|
package/README.md
CHANGED
|
@@ -18,7 +18,9 @@
|
|
|
18
18
|
|
|
19
19
|
[Watch the overview video](https://nexo-brain.com/watch/) · [Watch on YouTube](https://www.youtube.com/watch?v=i2lkGhKyVqI) · [Open the infographic](https://nexo-brain.com/assets/nexo-brain-infographic-v5.png)
|
|
20
20
|
|
|
21
|
-
Version `7.11.
|
|
21
|
+
Version `7.11.2` is the current packaged-runtime line. Patch release — two reliability fixes in the same family ("components ignoring signals they should respect"): (1) `STUCK CRON REAPER` added to `nexo-watchdog.sh` and (2) the Guardian/Enforcer now honors the `mcp-restart-required` marker. Previously the enforcer kept injecting `<system-reminder>` blocks asking the agent to call `nexo_*` tools while the MCP server was already returning `mcp_restart_required` for every call — every ping was a guaranteed no-op. The new gate at the top of `HeadlessEnforcer._enqueue()` reads the marker file (cached per-instance, 30s TTL) and skips reminders that mention `nexo_` while the marker is present. Reminders that don't reference `nexo_*` (R23 deploy guards, R25 nora/maria read-only, etc.) still fire — they don't depend on the MCP. The watchdog reaper closes a sibling gap: the v5.8.1 fix taught the watchdog to leave running jobs alone (it had been killing `deep-sleep` mid-flight 2026-04-14..17). The same restraint silently let truly hung wrappers — e.g. headless `claude --bare` blocked on an MCP that flagged `mcp_restart_required` — block their own next tick for days (`morning-agent`, `followup-runner` and `orchestrator-v2` went silent 2026-04-24..27). The reaper sweeps every `cron_runs` row with `ended_at IS NULL` and reaps anything older than `stuck_after_seconds` (per-cron from `manifest.json`, fallback 12h global). Live wrapper → `SIGTERM` (the wrapper's existing trap closes the row at `exit 143`), 10s grace, then `SIGKILL` on wrapper + descendants. Orphan zombi row → cleaned in-band with `exit_code=137`. `cron_id='watchdog'` is hard-coded skip so the watchdog never reaps itself. Generous defaults (deep-sleep 8h, sleep/evolution 4h) prevent any v5.8.1 regression. New observability: `summary.reaped` in `watchdog-status.json`, `REAPED:` header in the human report, `REAPED=N` in the final log line. 6 new tests; 3 existing watchdog tests stay green.
|
|
22
|
+
|
|
23
|
+
Previously in `7.11.1`: patch release — caches the runtime fingerprint by `(file_count, size_total, max_mtime)` signature so MCP startup and the per-tool-call `resolve_restart_required` skip the 263-file rehash when nothing on disk changed. ~11× speedup warm path (~40ms → ~3.7ms locally), ~10-20s/day saved across Claude Code / Codex / headless / deep-sleep / cron startups. Cache miss is always safe (falls through to full hash and self-repairs). Default `use_cache=False` keeps `plugins/update.py` on the ground-truth path around `git pull` / `npm update`. Builds on the v7.11.0 runtime fingerprint that gates `mcp-restart-required.json`. Full write-up in [`docs/runtime-fingerprint.md`](docs/runtime-fingerprint.md).
|
|
22
24
|
|
|
23
25
|
Previously in `7.10.0`: minor release — **removes the LLM proxy override path that 7.9.28 → 7.9.34 introduced**. Background: 7.9.28 added two opt-in files at `~/.nexo/config/llm_endpoint.json` and `~/.nexo/config/auth_provider.json` that let a third-party orchestrator (NEXO Desktop) redirect every Anthropic SDK call from Brain to a custom proxy and resolve the bearer via a local helper, with concrete model names translated to wire aliases (`nexo-max`, `nexo-high`, `nexo-medium`, `nexo-low`, `nexo-mini`) and an `Idempotency-Key` per request. NEXO Desktop's commercial model has changed: Desktop is now a wrapper over the user's own Claude Code subscription (Max / Pro), with a separate Desktop licence. Brain calls go directly to `api.anthropic.com` using the user's existing OAuth (the one stored under `~/.claude/` and consumed by Claude Code spawns) or a plain `ANTHROPIC_API_KEY`. There is no NEXO bearer, no NEXO proxy, no NEXO credit accounting in this codebase. Every proxy symbol is gone from `call_model_raw.py` and `agent_runner.py`; the proxy-specific tests and `docs/api/override-files.md` are removed; any pre-existing override files on disk are simply ignored from this release forward.
|
|
24
26
|
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "nexo-brain",
|
|
3
|
-
"version": "7.11.
|
|
3
|
+
"version": "7.11.2",
|
|
4
4
|
"mcpName": "io.github.wazionapps/nexo",
|
|
5
5
|
"description": "NEXO Brain — Shared brain for AI agents. Persistent memory, semantic RAG, natural forgetting, metacognitive guard, trust scoring, 150+ MCP tools. Works with Claude Code, Codex, Claude Desktop & any MCP client. 100% local, free.",
|
|
6
6
|
"homepage": "https://nexo-brain.com",
|
package/src/crons/manifest.json
CHANGED
|
@@ -13,6 +13,7 @@
|
|
|
13
13
|
"recovery_policy": "catchup",
|
|
14
14
|
"idempotent": true,
|
|
15
15
|
"max_catchup_age": 172800,
|
|
16
|
+
"stuck_after_seconds": 28800,
|
|
16
17
|
"run_on_boot": true,
|
|
17
18
|
"run_on_wake": true
|
|
18
19
|
},
|
|
@@ -38,6 +39,7 @@
|
|
|
38
39
|
"recovery_policy": "catchup",
|
|
39
40
|
"idempotent": true,
|
|
40
41
|
"max_catchup_age": 172800,
|
|
42
|
+
"stuck_after_seconds": 14400,
|
|
41
43
|
"run_on_boot": true,
|
|
42
44
|
"run_on_wake": true
|
|
43
45
|
},
|
|
@@ -140,6 +142,7 @@
|
|
|
140
142
|
"recovery_policy": "catchup",
|
|
141
143
|
"idempotent": true,
|
|
142
144
|
"max_catchup_age": 1209600,
|
|
145
|
+
"stuck_after_seconds": 14400,
|
|
143
146
|
"run_on_boot": true,
|
|
144
147
|
"run_on_wake": true
|
|
145
148
|
},
|
|
@@ -295,6 +298,7 @@
|
|
|
295
298
|
"recovery_policy": "run_once_on_wake",
|
|
296
299
|
"idempotent": true,
|
|
297
300
|
"max_catchup_age": 1200,
|
|
301
|
+
"stuck_after_seconds": 600,
|
|
298
302
|
"run_on_boot": false,
|
|
299
303
|
"run_on_wake": true
|
|
300
304
|
},
|
|
@@ -308,6 +312,7 @@
|
|
|
308
312
|
"recovery_policy": "run_once_on_wake",
|
|
309
313
|
"idempotent": true,
|
|
310
314
|
"max_catchup_age": 7200,
|
|
315
|
+
"stuck_after_seconds": 1800,
|
|
311
316
|
"run_on_boot": false,
|
|
312
317
|
"run_on_wake": true
|
|
313
318
|
},
|
|
@@ -321,6 +326,7 @@
|
|
|
321
326
|
"recovery_policy": "catchup",
|
|
322
327
|
"idempotent": true,
|
|
323
328
|
"max_catchup_age": 86400,
|
|
329
|
+
"stuck_after_seconds": 1800,
|
|
324
330
|
"run_on_boot": false,
|
|
325
331
|
"run_on_wake": true
|
|
326
332
|
}
|
|
@@ -2520,6 +2520,44 @@ class HeadlessEnforcer:
|
|
|
2520
2520
|
# the per-rule tag collision check, and time-dedup at the call site.
|
|
2521
2521
|
_LEGACY_TAG_PREFIXES = ("after:", "periodic_msgs:", "periodic_time:", "start:")
|
|
2522
2522
|
|
|
2523
|
+
@staticmethod
|
|
2524
|
+
def _mcp_restart_marker_path() -> "Path":
|
|
2525
|
+
"""Resolve the path to the MCP restart-required marker on disk.
|
|
2526
|
+
|
|
2527
|
+
The marker is written by `plugins/update.py` when a `nexo update`
|
|
2528
|
+
actually changes runtime `.py` bytes (cf. v7.11.0 fingerprint
|
|
2529
|
+
gating). Honors the F0.6 runtime/operations/ canonical layout
|
|
2530
|
+
with a fall-back to the pre-F0.6 operations/ legacy layout so
|
|
2531
|
+
half-migrated installs are still detected correctly.
|
|
2532
|
+
"""
|
|
2533
|
+
from pathlib import Path as _Path
|
|
2534
|
+
home = _Path(os.environ.get("NEXO_HOME", str(_Path.home() / ".nexo")))
|
|
2535
|
+
new = home / "runtime" / "operations" / "mcp-restart-required.json"
|
|
2536
|
+
if new.is_file():
|
|
2537
|
+
return new
|
|
2538
|
+
legacy = home / "operations" / "mcp-restart-required.json"
|
|
2539
|
+
return legacy if legacy.is_file() else new
|
|
2540
|
+
|
|
2541
|
+
def _mcp_restart_pending(self) -> bool:
|
|
2542
|
+
"""Return True if the MCP server has a restart-required marker on disk.
|
|
2543
|
+
|
|
2544
|
+
Cached per-instance with a 30s TTL: the marker rarely changes mid-
|
|
2545
|
+
session (it's written by `nexo update` and cleared by the next
|
|
2546
|
+
client restart) but a TTL keeps long-lived enforcer instances from
|
|
2547
|
+
getting stuck on a stale negative cache if the operator runs
|
|
2548
|
+
`nexo update` mid-session without restarting.
|
|
2549
|
+
"""
|
|
2550
|
+
cached_at = getattr(self, "_mcp_restart_pending_cache_at", 0.0)
|
|
2551
|
+
if (time.time() - cached_at) < 30.0:
|
|
2552
|
+
return getattr(self, "_mcp_restart_pending_cache", False)
|
|
2553
|
+
try:
|
|
2554
|
+
result = self._mcp_restart_marker_path().is_file()
|
|
2555
|
+
except Exception: # noqa: BLE001 — never block enforcement on path errors
|
|
2556
|
+
result = False
|
|
2557
|
+
self._mcp_restart_pending_cache = result
|
|
2558
|
+
self._mcp_restart_pending_cache_at = time.time()
|
|
2559
|
+
return result
|
|
2560
|
+
|
|
2523
2561
|
def _enqueue(self, prompt: str, tag: str, rule_id: str = ""):
|
|
2524
2562
|
"""Enqueue an injection. Mirrors Desktop _enqueue for parity.
|
|
2525
2563
|
|
|
@@ -2535,6 +2573,21 @@ class HeadlessEnforcer:
|
|
|
2535
2573
|
"""
|
|
2536
2574
|
if any(q["tag"] == tag for q in self.injection_queue):
|
|
2537
2575
|
return
|
|
2576
|
+
# v7.11.2: suppress reminders that ask the agent to call nexo_*
|
|
2577
|
+
# tools while the MCP server has a restart-required marker on
|
|
2578
|
+
# disk. Without this gate every periodic ping ("Execute
|
|
2579
|
+
# nexo_session_diary_write", "Execute nexo_smart_startup",
|
|
2580
|
+
# nexo_guard_check pre-Edit, etc) returns mcp_restart_required
|
|
2581
|
+
# and the agent burns cycles on guaranteed no-ops. Reminders that
|
|
2582
|
+
# don't reference nexo_* (R23 deploy guards, R25 nora/maria
|
|
2583
|
+
# read-only, etc) still fire — they don't depend on the MCP.
|
|
2584
|
+
if "nexo_" in prompt and self._mcp_restart_pending():
|
|
2585
|
+
_logger.info(
|
|
2586
|
+
"SKIP: %s — mcp_restart_required marker present (rule_id=%s)",
|
|
2587
|
+
tag,
|
|
2588
|
+
rule_id or "?",
|
|
2589
|
+
)
|
|
2590
|
+
return
|
|
2538
2591
|
legacy = tag.startswith(self._LEGACY_TAG_PREFIXES)
|
|
2539
2592
|
if legacy:
|
|
2540
2593
|
tool = tag.split(":")[-1].split("->")[-1]
|
|
@@ -160,6 +160,98 @@ def restart_required_marker_path() -> Path:
|
|
|
160
160
|
return paths.operations_dir() / "mcp-restart-required.json"
|
|
161
161
|
|
|
162
162
|
|
|
163
|
+
def fingerprint_cache_path() -> Path:
|
|
164
|
+
"""Where the runtime fingerprint cache lives.
|
|
165
|
+
|
|
166
|
+
The cache lets `prime_process_fingerprint()` and `installed_runtime_fingerprint()`
|
|
167
|
+
skip hashing 200+ source files on every MCP startup / tool call when the
|
|
168
|
+
runtime tree on disk hasn't changed (same file count, same total size, same
|
|
169
|
+
max mtime). Invalidates automatically when any source byte changes.
|
|
170
|
+
"""
|
|
171
|
+
return paths.operations_dir() / "fingerprint-cache.json"
|
|
172
|
+
|
|
173
|
+
|
|
174
|
+
def _runtime_tree_signature(src_dir: Path) -> tuple[int, int, float] | None:
|
|
175
|
+
"""Cheap stat-only walk over the fingerprint-tracked tree.
|
|
176
|
+
|
|
177
|
+
Returns ``(file_count, size_total, max_mtime)`` or ``None`` when the source
|
|
178
|
+
tree cannot be traversed. This is the cache key — if it matches, the bytes
|
|
179
|
+
haven't changed in any way the fingerprint would care about.
|
|
180
|
+
"""
|
|
181
|
+
try:
|
|
182
|
+
files = _iter_runtime_source_files(src_dir)
|
|
183
|
+
except Exception:
|
|
184
|
+
return None
|
|
185
|
+
if not files:
|
|
186
|
+
return None
|
|
187
|
+
count = 0
|
|
188
|
+
size_total = 0
|
|
189
|
+
max_mtime = 0.0
|
|
190
|
+
for path in files:
|
|
191
|
+
try:
|
|
192
|
+
st = path.stat()
|
|
193
|
+
except Exception:
|
|
194
|
+
return None
|
|
195
|
+
count += 1
|
|
196
|
+
size_total += int(st.st_size)
|
|
197
|
+
if st.st_mtime > max_mtime:
|
|
198
|
+
max_mtime = float(st.st_mtime)
|
|
199
|
+
return (count, size_total, max_mtime)
|
|
200
|
+
|
|
201
|
+
|
|
202
|
+
def _read_fingerprint_cache(src_dir: Path) -> str:
|
|
203
|
+
"""Return cached fingerprint when the on-disk signature still matches.
|
|
204
|
+
|
|
205
|
+
Empty string means cache miss (corrupt, missing, or signature drifted).
|
|
206
|
+
Cache miss is always safe — caller falls through to a full hash.
|
|
207
|
+
"""
|
|
208
|
+
cache_path = fingerprint_cache_path()
|
|
209
|
+
if not cache_path.is_file():
|
|
210
|
+
return ""
|
|
211
|
+
try:
|
|
212
|
+
payload = json.loads(cache_path.read_text(encoding="utf-8"))
|
|
213
|
+
except Exception:
|
|
214
|
+
return ""
|
|
215
|
+
if not isinstance(payload, dict):
|
|
216
|
+
return ""
|
|
217
|
+
if str(payload.get("src_dir") or "") != str(src_dir):
|
|
218
|
+
return ""
|
|
219
|
+
sig = _runtime_tree_signature(src_dir)
|
|
220
|
+
if sig is None:
|
|
221
|
+
return ""
|
|
222
|
+
try:
|
|
223
|
+
cached_count = int(payload.get("file_count"))
|
|
224
|
+
cached_size = int(payload.get("size_total"))
|
|
225
|
+
cached_mtime = float(payload.get("max_mtime"))
|
|
226
|
+
except (TypeError, ValueError):
|
|
227
|
+
return ""
|
|
228
|
+
if cached_count != sig[0] or cached_size != sig[1] or cached_mtime != sig[2]:
|
|
229
|
+
return ""
|
|
230
|
+
fingerprint = str(payload.get("fingerprint") or "").strip()
|
|
231
|
+
return fingerprint
|
|
232
|
+
|
|
233
|
+
|
|
234
|
+
def _write_fingerprint_cache(src_dir: Path, fingerprint: str) -> None:
|
|
235
|
+
"""Persist the fingerprint+signature pair. Best-effort; failures don't propagate."""
|
|
236
|
+
if not fingerprint:
|
|
237
|
+
return
|
|
238
|
+
sig = _runtime_tree_signature(src_dir)
|
|
239
|
+
if sig is None:
|
|
240
|
+
return
|
|
241
|
+
payload = {
|
|
242
|
+
"fingerprint": fingerprint,
|
|
243
|
+
"src_dir": str(src_dir),
|
|
244
|
+
"file_count": sig[0],
|
|
245
|
+
"size_total": sig[1],
|
|
246
|
+
"max_mtime": sig[2],
|
|
247
|
+
"updated_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
|
|
248
|
+
}
|
|
249
|
+
try:
|
|
250
|
+
_write_json_atomic(fingerprint_cache_path(), payload)
|
|
251
|
+
except Exception:
|
|
252
|
+
pass
|
|
253
|
+
|
|
254
|
+
|
|
163
255
|
def _candidate_version_files(base: Path) -> list[Path]:
|
|
164
256
|
return [
|
|
165
257
|
base / "version.json",
|
|
@@ -225,7 +317,9 @@ def _iter_runtime_source_files(src_dir: Path) -> list[Path]:
|
|
|
225
317
|
return out
|
|
226
318
|
|
|
227
319
|
|
|
228
|
-
def compute_mcp_runtime_fingerprint(
|
|
320
|
+
def compute_mcp_runtime_fingerprint(
|
|
321
|
+
src_dir: Path | None = None, *, use_cache: bool = False
|
|
322
|
+
) -> str:
|
|
229
323
|
"""Hash of every Python source file the running MCP can import.
|
|
230
324
|
|
|
231
325
|
Returns a sha256 hex digest, or "" when the source tree cannot be located
|
|
@@ -240,6 +334,14 @@ def compute_mcp_runtime_fingerprint(src_dir: Path | None = None) -> str:
|
|
|
240
334
|
* non-`.py` assets (docs, blogs, READMEs, JSON/YAML configs, templates,
|
|
241
335
|
CHANGELOG, marketing files) — these never affect what the live MCP
|
|
242
336
|
process executes
|
|
337
|
+
|
|
338
|
+
When ``use_cache=True`` (hot paths: server startup, every tool call) the
|
|
339
|
+
function consults ``fingerprint-cache.json``: if the on-disk tree
|
|
340
|
+
signature (file count + total size + max mtime) still matches the cached
|
|
341
|
+
one, the cached digest is returned without re-reading any byte. Cache miss
|
|
342
|
+
falls through to the normal full-hash path and writes a fresh entry. The
|
|
343
|
+
update flow keeps ``use_cache=False`` (default) so it always sees ground
|
|
344
|
+
truth around the pull/npm step.
|
|
243
345
|
"""
|
|
244
346
|
if src_dir is None:
|
|
245
347
|
candidates: list[Path] = []
|
|
@@ -267,6 +369,11 @@ def compute_mcp_runtime_fingerprint(src_dir: Path | None = None) -> str:
|
|
|
267
369
|
if src_dir is None:
|
|
268
370
|
return ""
|
|
269
371
|
|
|
372
|
+
if use_cache:
|
|
373
|
+
cached = _read_fingerprint_cache(src_dir)
|
|
374
|
+
if cached:
|
|
375
|
+
return cached
|
|
376
|
+
|
|
270
377
|
files = _iter_runtime_source_files(src_dir)
|
|
271
378
|
if not files:
|
|
272
379
|
return ""
|
|
@@ -283,11 +390,19 @@ def compute_mcp_runtime_fingerprint(src_dir: Path | None = None) -> str:
|
|
|
283
390
|
except Exception:
|
|
284
391
|
return ""
|
|
285
392
|
h.update(b"\n")
|
|
286
|
-
|
|
393
|
+
digest = h.hexdigest()
|
|
394
|
+
if use_cache and digest:
|
|
395
|
+
_write_fingerprint_cache(src_dir, digest)
|
|
396
|
+
return digest
|
|
287
397
|
|
|
288
398
|
|
|
289
399
|
def installed_runtime_fingerprint() -> str:
|
|
290
|
-
"""Fingerprint of whatever runtime source tree is on disk right now.
|
|
400
|
+
"""Fingerprint of whatever runtime source tree is on disk right now.
|
|
401
|
+
|
|
402
|
+
Hot path — runs on every MCP tool call via ``resolve_restart_required``.
|
|
403
|
+
Uses the disk-signature cache so a repeated call without any source
|
|
404
|
+
change is a few stat() syscalls instead of 200+ file reads.
|
|
405
|
+
"""
|
|
291
406
|
candidates: list[Path] = []
|
|
292
407
|
try:
|
|
293
408
|
root = active_runtime_root()
|
|
@@ -308,7 +423,7 @@ def installed_runtime_fingerprint() -> str:
|
|
|
308
423
|
except Exception:
|
|
309
424
|
pass
|
|
310
425
|
for cand in candidates:
|
|
311
|
-
fp = compute_mcp_runtime_fingerprint(cand)
|
|
426
|
+
fp = compute_mcp_runtime_fingerprint(cand, use_cache=True)
|
|
312
427
|
if fp:
|
|
313
428
|
return fp
|
|
314
429
|
return ""
|
|
@@ -616,7 +731,7 @@ def prime_process_fingerprint() -> str:
|
|
|
616
731
|
except Exception:
|
|
617
732
|
pass
|
|
618
733
|
for cand in candidates:
|
|
619
|
-
fp = compute_mcp_runtime_fingerprint(cand)
|
|
734
|
+
fp = compute_mcp_runtime_fingerprint(cand, use_cache=True)
|
|
620
735
|
if fp:
|
|
621
736
|
PROCESS_FINGERPRINT = fp
|
|
622
737
|
return PROCESS_FINGERPRINT
|
|
@@ -530,6 +530,183 @@ json_escape() {
|
|
|
530
530
|
echo "$1" | sed 's/\\/\\\\/g; s/"/\\"/g; s/ / /g' | tr '\n' ' '
|
|
531
531
|
}
|
|
532
532
|
|
|
533
|
+
# ============================================================================
|
|
534
|
+
# STUCK CRON REAPER (v7.11.2)
|
|
535
|
+
# ============================================================================
|
|
536
|
+
# Mirror image of the v5.8.1 in-flight detection. The v5.8.1 fix taught the
|
|
537
|
+
# watchdog to leave running jobs alone when their cron_runs row was open
|
|
538
|
+
# (started_at present, ended_at NULL) — that closed the loop where the
|
|
539
|
+
# watchdog kept kickstart -k'ing deep-sleep mid-flight (2026-04-14..17).
|
|
540
|
+
#
|
|
541
|
+
# But the same restraint became the new failure mode: when a wrapper child
|
|
542
|
+
# truly hangs (e.g. headless `claude --bare` blocked on an MCP that flagged
|
|
543
|
+
# `mcp_restart_required`), the row stays open forever, no new tick can run
|
|
544
|
+
# (the next wrapper sees "Another instance running. Skipping"), and the
|
|
545
|
+
# watchdog's only response was WARN. Morning brief, followup runner, and
|
|
546
|
+
# orchestrator-v2 went silent for days because of this.
|
|
547
|
+
#
|
|
548
|
+
# The reaper closes that gap without bringing back the v5.8.1 bug:
|
|
549
|
+
# * Per-cron threshold via `stuck_after_seconds` in manifest.json.
|
|
550
|
+
# * Generous default (12h) so legitimate long jobs keep running.
|
|
551
|
+
# * Override deep-sleep to 8h, sleep/evolution to 4h — well above their
|
|
552
|
+
# real worst-case so the v5.8.1 incident cannot repeat.
|
|
553
|
+
# * Reaper sends SIGTERM to the wrapper — its trap (line 187) closes the
|
|
554
|
+
# cron_runs row exit_code=143 and propagates to the child. Only after
|
|
555
|
+
# a 10s grace does it escalate to SIGKILL on wrapper + descendants.
|
|
556
|
+
# * If no wrapper PID is alive (orphan row), the reaper just closes the
|
|
557
|
+
# row in-band with exit_code=137 so the next tick can run.
|
|
558
|
+
# ============================================================================
|
|
559
|
+
|
|
560
|
+
STUCK_DEFAULT_SECONDS="${STUCK_DEFAULT_SECONDS:-43200}" # 12h
|
|
561
|
+
STUCK_KILL_GRACE="${STUCK_KILL_GRACE:-10}"
|
|
562
|
+
TOTAL_REAPED=0
|
|
563
|
+
|
|
564
|
+
# Skip cron_ids that should never be reaped from inside a watchdog tick.
|
|
565
|
+
# 'watchdog' is us — reaping ourselves would be self-immolation.
|
|
566
|
+
STUCK_REAPER_SKIP="watchdog"
|
|
567
|
+
|
|
568
|
+
_build_stuck_thresholds_from_manifest() {
|
|
569
|
+
if [ ! -f "$MANIFEST_FILE" ]; then
|
|
570
|
+
return
|
|
571
|
+
fi
|
|
572
|
+
python3 - "$MANIFEST_FILE" <<'PY' 2>/dev/null
|
|
573
|
+
import json, sys
|
|
574
|
+
try:
|
|
575
|
+
with open(sys.argv[1]) as f:
|
|
576
|
+
data = json.load(f)
|
|
577
|
+
except Exception:
|
|
578
|
+
sys.exit(0)
|
|
579
|
+
for c in data.get('crons', []):
|
|
580
|
+
cid = c.get('id')
|
|
581
|
+
th = c.get('stuck_after_seconds')
|
|
582
|
+
if cid and isinstance(th, (int, float)) and th > 0:
|
|
583
|
+
print(f"{cid}|{int(th)}")
|
|
584
|
+
PY
|
|
585
|
+
}
|
|
586
|
+
|
|
587
|
+
STUCK_THRESHOLDS_RAW=""
|
|
588
|
+
_load_stuck_thresholds() {
|
|
589
|
+
STUCK_THRESHOLDS_RAW=$(_build_stuck_thresholds_from_manifest)
|
|
590
|
+
}
|
|
591
|
+
|
|
592
|
+
lookup_stuck_threshold() {
|
|
593
|
+
local cron_id="$1"
|
|
594
|
+
if [ -z "$STUCK_THRESHOLDS_RAW" ]; then
|
|
595
|
+
echo "$STUCK_DEFAULT_SECONDS"
|
|
596
|
+
return
|
|
597
|
+
fi
|
|
598
|
+
local line
|
|
599
|
+
line=$(echo "$STUCK_THRESHOLDS_RAW" | grep "^${cron_id}|" | head -1)
|
|
600
|
+
if [ -n "$line" ]; then
|
|
601
|
+
echo "$line" | cut -d'|' -f2
|
|
602
|
+
else
|
|
603
|
+
echo "$STUCK_DEFAULT_SECONDS"
|
|
604
|
+
fi
|
|
605
|
+
}
|
|
606
|
+
|
|
607
|
+
find_wrapper_pids() {
|
|
608
|
+
local cron_id="$1"
|
|
609
|
+
# Match the wrapper's exact arg slot: "nexo-cron-wrapper.sh CRON_ID "
|
|
610
|
+
# The trailing space prevents prefix collisions (e.g. "morning-agent" vs
|
|
611
|
+
# a hypothetical "morning-agent-v2").
|
|
612
|
+
pgrep -f "nexo-cron-wrapper\.sh ${cron_id} " 2>/dev/null
|
|
613
|
+
}
|
|
614
|
+
|
|
615
|
+
reap_stuck_cron_pids() {
|
|
616
|
+
local cron_id="$1"
|
|
617
|
+
local pids
|
|
618
|
+
pids=$(find_wrapper_pids "$cron_id")
|
|
619
|
+
if [ -z "$pids" ]; then
|
|
620
|
+
# No wrapper alive — caller should fall through to in-band row cleanup.
|
|
621
|
+
return 1
|
|
622
|
+
fi
|
|
623
|
+
log_repair "STUCK REAPER: SIGTERM to wrapper PIDs ($cron_id): $(echo "$pids" | tr '\n' ' ')"
|
|
624
|
+
for pid in $pids; do
|
|
625
|
+
kill -TERM "$pid" 2>/dev/null || true
|
|
626
|
+
done
|
|
627
|
+
# Grace period — the wrapper trap (TERM → forward to child → finalize_row)
|
|
628
|
+
# needs a few seconds to close the cron_runs row cleanly.
|
|
629
|
+
local waited=0
|
|
630
|
+
local still
|
|
631
|
+
while [ $waited -lt "$STUCK_KILL_GRACE" ]; do
|
|
632
|
+
sleep 1
|
|
633
|
+
waited=$((waited + 1))
|
|
634
|
+
still=$(find_wrapper_pids "$cron_id")
|
|
635
|
+
[ -z "$still" ] && break
|
|
636
|
+
done
|
|
637
|
+
# Escalate to SIGKILL for any survivor (wrapper + descendants).
|
|
638
|
+
local survivors
|
|
639
|
+
survivors=$(find_wrapper_pids "$cron_id")
|
|
640
|
+
if [ -n "$survivors" ]; then
|
|
641
|
+
log_repair "STUCK REAPER: SIGKILL escalation ($cron_id): $(echo "$survivors" | tr '\n' ' ')"
|
|
642
|
+
for pid in $survivors; do
|
|
643
|
+
# Kill descendants first so they don't get reparented to PID 1.
|
|
644
|
+
pkill -KILL -P "$pid" 2>/dev/null || true
|
|
645
|
+
kill -KILL "$pid" 2>/dev/null || true
|
|
646
|
+
done
|
|
647
|
+
sleep 1
|
|
648
|
+
fi
|
|
649
|
+
# Last sanity check.
|
|
650
|
+
if [ -n "$(find_wrapper_pids "$cron_id")" ]; then
|
|
651
|
+
log "STUCK REAPER: failed to kill wrapper for $cron_id (still alive after SIGKILL)"
|
|
652
|
+
return 2
|
|
653
|
+
fi
|
|
654
|
+
return 0
|
|
655
|
+
}
|
|
656
|
+
|
|
657
|
+
finalize_stuck_db_row() {
|
|
658
|
+
local row_id="$1"
|
|
659
|
+
local cron_id="$2"
|
|
660
|
+
[ ! -f "$DB_PATH" ] && return 1
|
|
661
|
+
sqlite3 "$DB_PATH" "
|
|
662
|
+
UPDATE cron_runs
|
|
663
|
+
SET ended_at = strftime('%Y-%m-%d %H:%M:%S','now'),
|
|
664
|
+
exit_code = 137,
|
|
665
|
+
summary = 'stuck row reaped by watchdog: wrapper PID gone',
|
|
666
|
+
error = 'Watchdog STUCK REAPER: orphan in-flight row cleaned up',
|
|
667
|
+
duration_secs = CAST(strftime('%s','now') - strftime('%s', started_at) AS REAL)
|
|
668
|
+
WHERE id = $row_id;
|
|
669
|
+
" 2>/dev/null
|
|
670
|
+
log_repair "STUCK REAPER: cleaned up zombie cron_runs row id=$row_id ($cron_id)"
|
|
671
|
+
}
|
|
672
|
+
|
|
673
|
+
run_stuck_reaper() {
|
|
674
|
+
[ ! -f "$DB_PATH" ] && return 0
|
|
675
|
+
_load_stuck_thresholds
|
|
676
|
+
local row_id cron_id age_secs threshold
|
|
677
|
+
while IFS='|' read -r row_id cron_id age_secs; do
|
|
678
|
+
[ -z "$row_id" ] && continue
|
|
679
|
+
[ -z "$cron_id" ] && continue
|
|
680
|
+
# Skip self and any explicitly-protected cron_ids.
|
|
681
|
+
case " $STUCK_REAPER_SKIP " in
|
|
682
|
+
*" $cron_id "*) continue ;;
|
|
683
|
+
esac
|
|
684
|
+
threshold=$(lookup_stuck_threshold "$cron_id")
|
|
685
|
+
if [ "$age_secs" -gt "$threshold" ]; then
|
|
686
|
+
log "STUCK REAPER: cron_id=$cron_id row_id=$row_id age=${age_secs}s threshold=${threshold}s — reaping"
|
|
687
|
+
if reap_stuck_cron_pids "$cron_id"; then
|
|
688
|
+
# Wrapper trap closes the row with exit 143; nothing else to do.
|
|
689
|
+
TOTAL_REAPED=$((TOTAL_REAPED + 1))
|
|
690
|
+
else
|
|
691
|
+
# No wrapper alive (orphan zombie row) — close it in-band so the
|
|
692
|
+
# next tick of this cron isn't blocked by "Another instance running".
|
|
693
|
+
finalize_stuck_db_row "$row_id" "$cron_id"
|
|
694
|
+
TOTAL_REAPED=$((TOTAL_REAPED + 1))
|
|
695
|
+
fi
|
|
696
|
+
fi
|
|
697
|
+
done < <(sqlite3 -separator '|' "$DB_PATH" "
|
|
698
|
+
SELECT id, cron_id, CAST(strftime('%s','now') - strftime('%s', started_at) AS INTEGER)
|
|
699
|
+
FROM cron_runs
|
|
700
|
+
WHERE ended_at IS NULL
|
|
701
|
+
ORDER BY id DESC;
|
|
702
|
+
" 2>/dev/null)
|
|
703
|
+
if [ "$TOTAL_REAPED" -gt 0 ]; then
|
|
704
|
+
log "STUCK REAPER: complete — reaped $TOTAL_REAPED stuck cron(s)"
|
|
705
|
+
fi
|
|
706
|
+
}
|
|
707
|
+
|
|
708
|
+
run_stuck_reaper
|
|
709
|
+
|
|
533
710
|
# ============================================================================
|
|
534
711
|
# RUN CHECKS
|
|
535
712
|
# ============================================================================
|
|
@@ -1023,6 +1200,7 @@ cat > "$STATUS_JSON" <<JSONEOF
|
|
|
1023
1200
|
"warn": $TOTAL_WARN,
|
|
1024
1201
|
"fail": $TOTAL_FAIL,
|
|
1025
1202
|
"healed": $TOTAL_HEALED,
|
|
1203
|
+
"reaped": $TOTAL_REAPED,
|
|
1026
1204
|
"overall": "$OVERALL"
|
|
1027
1205
|
},
|
|
1028
1206
|
"launch_agents": [
|
|
@@ -1047,7 +1225,7 @@ cat > "$REPORT_TXT" <<REPORTEOF
|
|
|
1047
1225
|
======================================================
|
|
1048
1226
|
NEXO WATCHDOG REPORT — $TS
|
|
1049
1227
|
======================================================
|
|
1050
|
-
PASS: $TOTAL_PASS | HEALED: $TOTAL_HEALED | WARN: $TOTAL_WARN | FAIL: $TOTAL_FAIL | TOTAL: $TOTAL
|
|
1228
|
+
PASS: $TOTAL_PASS | HEALED: $TOTAL_HEALED | WARN: $TOTAL_WARN | FAIL: $TOTAL_FAIL | REAPED: $TOTAL_REAPED | TOTAL: $TOTAL
|
|
1051
1229
|
OVERALL: $OVERALL
|
|
1052
1230
|
======================================================
|
|
1053
1231
|
|
|
@@ -1261,4 +1439,4 @@ fi
|
|
|
1261
1439
|
# ============================================================================
|
|
1262
1440
|
# LOG SUMMARY
|
|
1263
1441
|
# ============================================================================
|
|
1264
|
-
log "Complete: PASS=$TOTAL_PASS HEALED=$TOTAL_HEALED WARN=$TOTAL_WARN FAIL=$TOTAL_FAIL"
|
|
1442
|
+
log "Complete: PASS=$TOTAL_PASS HEALED=$TOTAL_HEALED WARN=$TOTAL_WARN FAIL=$TOTAL_FAIL REAPED=$TOTAL_REAPED"
|