nexo-brain 5.8.0 → 5.8.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "nexo-brain",
3
- "version": "5.8.0",
3
+ "version": "5.8.2",
4
4
  "description": "Local cognitive runtime for Claude Code \u2014 persistent memory, overnight learning, doctor diagnostics, personal scripts, recovery-aware jobs, startup preflight, and optional dashboard/power helper.",
5
5
  "author": {
6
6
  "name": "NEXO Brain",
package/README.md CHANGED
@@ -18,7 +18,11 @@
18
18
 
19
19
  [Watch the overview video](https://nexo-brain.com/watch/) · [Watch on YouTube](https://www.youtube.com/watch?v=i2lkGhKyVqI) · [Open the infographic](https://nexo-brain.com/assets/nexo-brain-infographic-v5.png)
20
20
 
21
- Version `5.8.0` is the current packaged-runtime line: first-class `internal` and `owner` columns on `followups` and `reminders`. Migration #40 adds both fields with an idempotent one-shot backfill, so the "who does this task belong to?" classification moves from client-side regex (Desktop) to persistent storage every MCP client shares. Taxonomy is intentionally generic `owner in {user, waiting, agent, shared}` so third-party agents plugging into the shared Brain can render whatever assistant label they carry without inheriting NEXO branding. `nexo_reminder_create`, `nexo_reminder_update`, `nexo_followup_create`, and `nexo_followup_update` gain optional `internal` and `owner` parameters that win over the default heuristic.
21
+ Version `5.8.2` is the current packaged-runtime line: the Brain core no longer auto-classifies `followups` and `reminders` on behalf of agents. v5.8.0's `classify_task()` heuristic (NEXO-specific ID prefixes `NF-PROTOCOL-*` / `NF-DS-*` / `NF-AUDIT-*`, Spanish user-verbs `debes` / `revisar` / `firmar`, agent keywords `monitor` / `auditoría diaria` / `checkpoint`) was fine for NEXO's own DB but bled convention into every third-party agent plugged into the shared Brain. The core now persists `internal=0` and `owner=NULL` when the caller omits them, and clients that want automatic classification (NEXO Desktop does, via its `_legacyClassifyOwner` helpers) compute it themselves and pass the result. Migration #40 keeps the columns + indexes; rows already backfilled by v5.8.0 keep their values. `normalise_owner` still explicitly rejects the string `"nexo"` so legacy hardcoding cannot sneak back in.
22
+
23
+ Previously in `5.8.1`: closes a self-reinforcing `launchctl kickstart -k` loop in the watchdog that wedged deep-sleep Phase 2 between 2026-04-14 and 2026-04-17. The cron wrapper now INSERTs an in-flight row (`ended_at=NULL`) at start and traps SIGTERM/INT/HUP to close it with `exit_code=143` instead of vanishing from `cron_runs`. The watchdog interprets in-flight rows as "currently running" and only re-executes after verifying the worker process is dead. `extract.py` classifies CLI failures into transient (`overloaded_error`, rate-limit, timeout, signal — retried next run) and deterministic (skipped after `MAX_POISON_ATTEMPTS`), and passes a slim shared-context (200 head lines + metadata) instead of the full 400+ KB dump. A new `auto_update._heal_deep_sleep_runtime()` repairs existing installs silently on the next `nexo update`: poisoned checkpoints, stale locks, dangling `cron_runs` rows, and bloated `.watchdog-fails` counters.
24
+
25
+ Previously in `5.8.0`: first-class `internal` and `owner` columns on `followups` and `reminders`. Migration #40 adds both fields with an idempotent one-shot backfill, so the "who does this task belong to?" classification moves from client-side regex (Desktop) to persistent storage every MCP client shares. Taxonomy is intentionally generic — `owner in {user, waiting, agent, shared}` — so third-party agents plugging into the shared Brain can render whatever assistant label they carry without inheriting NEXO branding. `nexo_reminder_create`, `nexo_reminder_update`, `nexo_followup_create`, and `nexo_followup_update` gain optional `internal` and `owner` parameters that win over the default heuristic.
22
26
 
23
27
  Previously in `5.7.0`: `nexo update` now keeps Claude Code and Codex CLIs in lockstep with NEXO Brain itself. When the global `@anthropic-ai/claude-code` or `@openai/codex` packages are installed, the updater checks the npm registry and runs `npm install -g <pkg>@latest` in-line — so the terminal boot model stays aligned with the settings NEXO already wrote to `~/.claude/settings.json`. Packages the operator never installed are skipped silently. Pass `nexo update --no-clis` to keep the terminal CLIs pinned.
24
28
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "nexo-brain",
3
- "version": "5.8.0",
3
+ "version": "5.8.2",
4
4
  "mcpName": "io.github.wazionapps/nexo",
5
5
  "description": "NEXO Brain \u2014 Shared brain for AI agents. Persistent memory, semantic RAG, natural forgetting, metacognitive guard, trust scoring, 150+ MCP tools. Works with Claude Code, Codex, Claude Desktop & any MCP client. 100% local, free.",
6
6
  "homepage": "https://nexo-brain.com",
@@ -875,6 +875,135 @@ def _purge_zero_byte_db_files() -> list[Path]:
875
875
  return removed
876
876
 
877
877
 
878
+ def _heal_deep_sleep_runtime(dest: Path = NEXO_HOME) -> list[str]:
879
+ """Repair deep-sleep state that older runtimes left in a bad shape.
880
+
881
+ Runs on every ``auto_update`` post-sync. The bug it fixes: between
882
+ Brain 5.6.1 and 5.8.0 the cron wrapper only wrote to ``cron_runs`` at
883
+ end, so any wrapper killed by signal produced no row. The watchdog then
884
+ saw the cron as "missing cron_runs entry" and kickstart-‍k'd the live
885
+ worker — an infinite loop that wedged deep-sleep Phase 2 on the first
886
+ session of every batch. 5.8.1 fixes the loop at the source (wrapper
887
+ start-row + watchdog in-flight detection) but older runtimes that have
888
+ already been running the buggy loop need their residue cleaned up.
889
+
890
+ Returns the list of actions performed, for logging. Failures are
891
+ swallowed: this is best-effort healing, it must never block an update.
892
+ """
893
+ import sqlite3
894
+ import time as _time
895
+
896
+ actions: list[str] = []
897
+
898
+ deep_sleep_dir = dest / "operations" / "deep-sleep"
899
+ coord_dir = dest / "coordination"
900
+ data_db = dest / "data" / "nexo.db"
901
+ now = _time.time()
902
+
903
+ # (1) Drop poisoned checkpoints: the first retry that hit Anthropic's
904
+ # overloaded_error got cached as a permanent failure. Older
905
+ # extract.py re-used that checkpoint forever. New extract.py treats
906
+ # transient errors as retryable, but old poisoned checkpoints still
907
+ # claim 0 findings — purge them so the next deep-sleep retries cleanly.
908
+ if deep_sleep_dir.is_dir():
909
+ poisoned = 0
910
+ for checkpoint_dir in deep_sleep_dir.glob("*/checkpoints"):
911
+ if not checkpoint_dir.is_dir():
912
+ continue
913
+ for entry in checkpoint_dir.glob("*.json"):
914
+ try:
915
+ content = entry.read_text()
916
+ except OSError:
917
+ continue
918
+ if "overloaded_error" in content or '"error":{"type":"' in content:
919
+ try:
920
+ entry.unlink()
921
+ poisoned += 1
922
+ except OSError:
923
+ pass
924
+ if poisoned:
925
+ actions.append(f"checkpoints-purged:{poisoned}")
926
+
927
+ # Drop debug-extract-*.txt scratch files older than 7 days.
928
+ stale_debug = 0
929
+ for entry in deep_sleep_dir.glob("debug-extract-*.txt"):
930
+ try:
931
+ if now - entry.stat().st_mtime > 7 * 86400:
932
+ entry.unlink()
933
+ stale_debug += 1
934
+ except OSError:
935
+ continue
936
+ if stale_debug:
937
+ actions.append(f"debug-scratch-purged:{stale_debug}")
938
+
939
+ # (2) Release stale deep-sleep locks so the next 04:30 run can acquire
940
+ # them. Locks older than 6h are always stale — a real run finishes
941
+ # in well under an hour.
942
+ lock_names = ("sleep.lock", "sleep-process.lock", "synthesis.lock")
943
+ released = 0
944
+ if coord_dir.is_dir():
945
+ for name in lock_names:
946
+ lock_path = coord_dir / name
947
+ if not lock_path.exists():
948
+ continue
949
+ try:
950
+ age = now - lock_path.stat().st_mtime
951
+ except OSError:
952
+ continue
953
+ if age > 6 * 3600:
954
+ try:
955
+ lock_path.unlink()
956
+ released += 1
957
+ except OSError:
958
+ pass
959
+ if released:
960
+ actions.append(f"stale-locks-released:{released}")
961
+
962
+ # (3) Close dangling cron_runs rows. Any row with ended_at IS NULL older
963
+ # than 6h is either a process killed by the old watchdog loop or a
964
+ # zombie left behind by a previous bad install. Close them with
965
+ # exit_code=143 + summary so the NEW watchdog treats the cron as
966
+ # "finished with error" rather than "in-flight forever".
967
+ if data_db.is_file():
968
+ try:
969
+ conn = sqlite3.connect(str(data_db), timeout=5)
970
+ try:
971
+ cur = conn.execute(
972
+ """
973
+ UPDATE cron_runs
974
+ SET ended_at = datetime('now'),
975
+ exit_code = 143,
976
+ error = 'healed by auto_update (pre-5.8.1 wrapper left row open)',
977
+ duration_secs = CAST(
978
+ strftime('%s','now') - strftime('%s', started_at) AS REAL
979
+ )
980
+ WHERE ended_at IS NULL
981
+ AND strftime('%s','now') - strftime('%s', started_at) > 6 * 3600
982
+ """
983
+ )
984
+ closed = cur.rowcount or 0
985
+ conn.commit()
986
+ if closed:
987
+ actions.append(f"cron_runs-closed-dangling:{closed}")
988
+ finally:
989
+ conn.close()
990
+ except Exception as exc:
991
+ actions.append(f"cron_runs-heal-warning:{exc.__class__.__name__}")
992
+
993
+ # (4) Remove .watchdog-fails registry entries older than 24h — the new
994
+ # in-flight detection makes stale counters obsolete.
995
+ fails_file = dest / "scripts" / ".watchdog-fails"
996
+ if fails_file.exists():
997
+ try:
998
+ if now - fails_file.stat().st_mtime > 24 * 3600:
999
+ fails_file.unlink()
1000
+ actions.append("watchdog-fails-reset")
1001
+ except OSError:
1002
+ pass
1003
+
1004
+ return actions
1005
+
1006
+
878
1007
  def _backup_dbs() -> str | None:
879
1008
  """Snapshot all .db files before migration. Returns backup dir or None."""
880
1009
  import sqlite3
@@ -2558,6 +2687,16 @@ def _run_runtime_post_sync(dest: Path = NEXO_HOME, progress_fn=None) -> tuple[bo
2558
2687
  except Exception as e:
2559
2688
  actions.append(f"client-sync-warning:{e}")
2560
2689
 
2690
+ # Heal deep-sleep residue from older buggy runtimes. Idempotent + safe:
2691
+ # no-op if the runtime is already clean.
2692
+ try:
2693
+ _emit_progress(progress_fn, "Healing deep-sleep runtime state...")
2694
+ heal_actions = _heal_deep_sleep_runtime(dest)
2695
+ for action in heal_actions:
2696
+ actions.append(f"deep-sleep-heal:{action}")
2697
+ except Exception as exc:
2698
+ actions.append(f"deep-sleep-heal-warning:{exc.__class__.__name__}")
2699
+
2561
2700
  _emit_progress(progress_fn, "Verifying runtime imports...")
2562
2701
  verify = subprocess.run(
2563
2702
  [sys.executable, "-c", "import server"],
@@ -1,132 +1,54 @@
1
- """NEXO DB — Task classification helpers (internal + owner).
2
-
3
- Introduced in migration #40. Every followup and reminder carries two
4
- classification attributes so clients (Desktop Home, dashboard, future
5
- agents) do not need to compute them with client-side regex:
6
-
7
- internal (INTEGER 0/1):
8
- 1 if the task is bookkeeping the agent keeps for itself
9
- (protocol enforcer, deep-sleep housekeeping, audit trail,
10
- release gates, retroactive learnings). These are hidden from
11
- normal user views by default.
12
-
13
- owner (TEXT):
14
- 'user' — the user has to act (was 'Para ti' in Desktop).
15
- 'waiting' — blocked on an external response (was 'Esperando').
1
+ """NEXO DB — Task classification storage (internal + owner).
2
+
3
+ Migration #40 added ``internal`` and ``owner`` columns to ``followups`` and
4
+ ``reminders``. Agents creating or updating tasks pass these two fields
5
+ explicitly via the MCP tools (``nexo_followup_create``, ``nexo_reminder_create``
6
+ and their ``_update`` counterparts).
7
+
8
+ The Brain core does **not** classify tasks on behalf of agents. Up to and
9
+ including v5.8.1 the core shipped a Spanish-first regex heuristic
10
+ (``NF-PROTOCOL-*`` / ``NF-DS-*`` prefixes, user verbs like ``debes``,
11
+ ``revisar``, etc.) as a fallback for callers that left the fields blank.
12
+ That fallback bled NEXO-specific naming conventions into every deployment
13
+ of the shared Brain — third-party agents plugged into the same DB would
14
+ inherit classifications they never asked for. v5.8.2 removes it.
15
+
16
+ The module now exposes only:
17
+
18
+ VALID_OWNERS — the canonical set {user, waiting, agent, shared}.
19
+ normalise_owner — clamps an agent-supplied string to VALID_OWNERS
20
+ (or ``None`` for empty / invalid input so the
21
+ caller can decide whether to persist ``NULL``).
22
+ normalise_internal — coerces truthy / boolean / numeric agent input
23
+ into ``0`` / ``1`` (or ``None`` for empty input).
24
+
25
+ owner values:
26
+ 'user' — the user has to act.
27
+ 'waiting' — blocked on an external response.
16
28
  'agent' — the AI agent handles it autonomously. Intentionally
17
- named 'agent' and NOT 'nexo' so non-NEXO deployments
18
- render whatever label fits (e.g. 'Claude', 'Codex',
19
- hotel-assistant name). The user-facing label is
20
- resolved client-side.
21
- 'shared' — collaborative follow-up (was 'Seguimiento').
22
- NULL — unclassified; clients fall back to the legacy
23
- client-side heuristic for backward compat.
24
-
25
- Agents creating tasks via nexo_followup_create / nexo_reminder_create
26
- can override both fields explicitly. If they leave them blank, the
27
- Brain applies the heuristic below so a vanilla agent keeps sensible
28
- behaviour out of the box.
29
+ named ``agent`` (not ``nexo``) so deployments render
30
+ whatever assistant label fits client-side.
31
+ 'shared' — collaborative follow-up.
32
+ NULL — unclassified; clients are free to apply whatever
33
+ fallback they want at render time.
34
+
35
+ Clients that want automatic classification (NEXO Desktop does, via its
36
+ ``_legacyClassifyOwner`` / ``_legacyIsInternalTaskId`` helpers) compute
37
+ ``owner``/``internal`` themselves and pass them to the create/update call.
29
38
  """
30
39
 
31
40
  from __future__ import annotations
32
41
 
33
- import re
34
-
35
- # Task-ID prefixes historically owned by NEXO's own automation. They are
36
- # kept as a default heuristic because they match the existing corpus of
37
- # 468+ followups and 40+ reminders. Any agent not following this naming
38
- # convention will simply not match these patterns and its tasks will
39
- # stay visible (internal=0) unless the agent sets internal=1 explicitly
40
- # on create — which is exactly what we want for a pluralistic ecosystem.
41
- _INTERNAL_ID_PATTERNS = [
42
- re.compile(r"^NF-PROTOCOL[-_]", re.IGNORECASE),
43
- re.compile(r"^NF-DS[-_]", re.IGNORECASE),
44
- re.compile(r"^NF-AUDIT[-_]", re.IGNORECASE),
45
- re.compile(r"^NF-OPPORTUNITY[-_]", re.IGNORECASE),
46
- re.compile(r"^NF-RETRO[-_]", re.IGNORECASE),
47
- re.compile(r"^R-RELEASE[-_]", re.IGNORECASE),
48
- re.compile(r"^R-FU-NF-PROTOCOL[-_]", re.IGNORECASE),
49
- re.compile(r"^R-FU-NF-DS[-_]", re.IGNORECASE),
50
- re.compile(r"^R-FU-NF-AUDIT[-_]", re.IGNORECASE),
51
- ]
52
-
53
- # Spanish user-action verbs. The heuristic is Spanish-first because the
54
- # existing corpus is Spanish, but since every agent can override `owner`
55
- # explicitly on create, deployments in other languages are not blocked.
56
- _USER_VERB_RX = re.compile(
57
- r"\b(francisco debe|debes|llamar|responder|revisar|validar|confirmar|"
58
- r"decidir|aprobar|firmar|enviar email|mandar email|contestar|"
59
- r"reuni[óo]n|reservar|comprar)\b",
60
- re.IGNORECASE,
61
- )
62
-
63
- _WAITING_RX = re.compile(
64
- r"\b(esperando|esperar|bloqueo|bloqueado|pendiente respuesta|"
65
- r"pendiente de|en espera)\b",
66
- re.IGNORECASE,
67
- )
68
-
69
- _AGENT_RX = re.compile(
70
- r"\b(monitoreo|monitorizar|monitor|auditor[íi]a diaria|"
71
- r"promoci[óo]n diaria|seguir|seguimiento 24|72h|checkpoint|runner|cron)\b",
72
- re.IGNORECASE,
73
- )
74
42
 
75
43
  VALID_OWNERS = {"user", "waiting", "agent", "shared"}
76
44
 
77
45
 
78
- def is_internal_id(task_id: str | None) -> bool:
79
- """Return True when the ID matches a known agent-internal prefix."""
80
- tid = (task_id or "").strip()
81
- if not tid:
82
- return False
83
- return any(pat.search(tid) for pat in _INTERNAL_ID_PATTERNS)
84
-
85
-
86
- def classify_owner(
87
- task_id: str | None,
88
- description: str | None,
89
- category: str | None = None,
90
- recurrence: str | None = None,
91
- ) -> str:
92
- """Classify ownership into one of VALID_OWNERS using the legacy rules."""
93
- tid = (task_id or "").strip()
94
- desc = (description or "").strip()
95
- cat = (category or "").strip().lower()
96
- rec = (recurrence or "").strip()
97
-
98
- if cat == "waiting" or _WAITING_RX.search(desc):
99
- return "waiting"
100
- if _USER_VERB_RX.search(desc) or tid.lower().startswith("nf-protocol-"):
101
- return "user"
102
- if rec or _AGENT_RX.search(desc):
103
- return "agent"
104
- return "shared"
105
-
106
-
107
- def classify_task(
108
- task_id: str | None,
109
- description: str | None,
110
- category: str | None = None,
111
- recurrence: str | None = None,
112
- ) -> tuple[int, str]:
113
- """Compute (internal, owner) pair for a task.
114
-
115
- Returns integers for internal so the SQLite column (INTEGER DEFAULT 0)
116
- and the JSON round-trip stay consistent. Clients can truthy-check either
117
- int or bool safely.
118
- """
119
- internal = 1 if is_internal_id(task_id) else 0
120
- owner = classify_owner(task_id, description, category, recurrence)
121
- return internal, owner
122
-
123
-
124
46
  def normalise_owner(value: str | None) -> str | None:
125
47
  """Accept owner overrides from agents and clamp to VALID_OWNERS.
126
48
 
127
49
  Returns None for empty input (so the DB keeps NULL / pre-existing value)
128
50
  and coerces invalid strings to None rather than silently persisting
129
- garbage. Callers decide whether to fall back to classify_owner().
51
+ garbage.
130
52
  """
131
53
  if value is None:
132
54
  return None
@@ -8,7 +8,7 @@ import sqlite3
8
8
  from typing import Any
9
9
 
10
10
  from db._core import get_db, now_epoch
11
- from db._classification import classify_task, normalise_internal, normalise_owner
11
+ from db._classification import normalise_internal, normalise_owner
12
12
  from db._fts import fts_upsert
13
13
  from db._hot_context import capture_context_event
14
14
 
@@ -256,18 +256,18 @@ def create_reminder(
256
256
  """Create a new reminder.
257
257
 
258
258
  Agents may pass `internal` (0/1, bool, or string) and `owner`
259
- ('user'|'waiting'|'agent'|'shared') to override the default
260
- classification. When omitted, classify_task() applies the legacy
261
- heuristic so behaviour matches pre-migration #40.
259
+ ('user'|'waiting'|'agent'|'shared'). When omitted, the Brain persists
260
+ ``internal=0`` and ``owner=NULL`` the Brain core does not classify
261
+ tasks on behalf of agents. Clients that want automatic classification
262
+ compute it themselves and pass the result.
262
263
  """
263
264
  conn = get_db()
264
265
  now = now_epoch()
265
266
 
266
- auto_internal, auto_owner = classify_task(id, description, category, None)
267
267
  internal_value = normalise_internal(internal)
268
268
  if internal_value is None:
269
- internal_value = auto_internal
270
- owner_value = normalise_owner(owner) or auto_owner
269
+ internal_value = 0
270
+ owner_value = normalise_owner(owner)
271
271
 
272
272
  columns = {str(row["name"]) for row in conn.execute("PRAGMA table_info(reminders)").fetchall()}
273
273
  payload: dict[str, object] = {
@@ -615,9 +615,10 @@ def create_followup(
615
615
  ) -> dict:
616
616
  """Create a new followup with optional reasoning and recurrence.
617
617
 
618
- Agents may override the default classification via `internal` and
619
- `owner`. Omitted values are filled by classify_task() using the
620
- legacy heuristics so pre-migration callers keep working identically.
618
+ Agents may set `internal` and `owner` explicitly. Omitted values
619
+ persist as ``internal=0`` and ``owner=NULL`` the Brain core does not
620
+ classify tasks on behalf of agents. Clients that want automatic
621
+ classification compute it themselves and pass the result.
621
622
  """
622
623
  conn = get_db()
623
624
  now = now_epoch()
@@ -630,11 +631,10 @@ def create_followup(
630
631
  f"(scores: {', '.join(str(s['_similarity']) for s in similar[:3])}). Consider updating instead."
631
632
  )
632
633
 
633
- auto_internal, auto_owner = classify_task(id, description, None, recurrence)
634
634
  internal_value = normalise_internal(internal)
635
635
  if internal_value is None:
636
- internal_value = auto_internal
637
- owner_value = normalise_owner(owner) or auto_owner
636
+ internal_value = 0
637
+ owner_value = normalise_owner(owner)
638
638
 
639
639
  columns = {str(row["name"]) for row in conn.execute("PRAGMA table_info(followups)").fetchall()}
640
640
  payload: dict[str, object] = {
package/src/db/_schema.py CHANGED
@@ -939,18 +939,11 @@ def _m39_hook_runs(conn):
939
939
  def _m40_classification_columns(conn):
940
940
  """Add internal (INTEGER 0/1) and owner (TEXT) to followups and reminders.
941
941
 
942
- Background: before this migration, Desktop clients had to compute the
943
- "who does this belong to" classification client-side using Spanish regex
944
- on description and ID-prefix pattern matching (NF-PROTOCOL-*, NF-DS-*, …).
945
- That logic was hardcoded to NEXO's own ID convention and Spanish-speaking
946
- users. Any third-party agent plugging into the shared Brain would either
947
- see every task as "Seguimiento" (owner=shared fallback) or, worse, have
948
- its real user-facing tasks hidden by the Desktop 'internal' filter.
949
-
950
- Fix: make both attributes first-class columns agents can set on create.
951
- Vanilla agents that omit them get the legacy heuristic (classify_task)
952
- applied on insert and during this one-shot backfill, so existing rows
953
- preserve their current Desktop rendering.
942
+ Agents creating tasks via nexo_followup_create / nexo_reminder_create
943
+ can set both fields explicitly. The Brain core does not classify tasks
944
+ on behalf of agents clients that want automatic classification
945
+ compute it themselves (NEXO Desktop does, via its legacy client-side
946
+ helpers) and pass the result.
954
947
 
955
948
  Values:
956
949
  internal: 0 (external, visible) or 1 (agent bookkeeping, hidden).
@@ -960,8 +953,10 @@ def _m40_classification_columns(conn):
960
953
  'NEXO'.
961
954
 
962
955
  Idempotent: _migrate_add_column is a no-op when the column exists,
963
- _migrate_add_index likewise. The backfill only touches rows where
964
- owner IS NULL, so re-running never overwrites agent-set values.
956
+ _migrate_add_index likewise. Pre-v5.8.2 versions of this migration
957
+ also ran a one-shot backfill using a Spanish-first regex heuristic;
958
+ v5.8.2 removed that heuristic so the core stays neutral across
959
+ deployments. Rows that were already backfilled keep their values.
965
960
  """
966
961
  _migrate_add_column(conn, "followups", "internal", "INTEGER DEFAULT 0")
967
962
  _migrate_add_column(conn, "followups", "owner", "TEXT DEFAULT NULL")
@@ -972,32 +967,6 @@ def _m40_classification_columns(conn):
972
967
  _migrate_add_index(conn, "idx_reminders_internal", "reminders", "internal")
973
968
  _migrate_add_index(conn, "idx_reminders_owner", "reminders", "owner")
974
969
 
975
- from db._classification import classify_task
976
-
977
- rows = conn.execute(
978
- "SELECT id, description, recurrence FROM followups WHERE owner IS NULL"
979
- ).fetchall()
980
- for row in rows:
981
- internal, owner = classify_task(
982
- row["id"], row["description"], None, row["recurrence"]
983
- )
984
- conn.execute(
985
- "UPDATE followups SET internal = ?, owner = ? WHERE id = ?",
986
- (internal, owner, row["id"]),
987
- )
988
-
989
- rows = conn.execute(
990
- "SELECT id, description, category FROM reminders WHERE owner IS NULL"
991
- ).fetchall()
992
- for row in rows:
993
- internal, owner = classify_task(
994
- row["id"], row["description"], row["category"], None
995
- )
996
- conn.execute(
997
- "UPDATE reminders SET internal = ?, owner = ? WHERE id = ?",
998
- (internal, owner, row["id"]),
999
- )
1000
-
1001
970
 
1002
971
  MIGRATIONS = [
1003
972
  (1, "learnings_columns", _m1_learnings_columns),
@@ -38,6 +38,56 @@ except Exception:
38
38
  # still leaving enough headroom for legitimate long per-session extractions.
39
39
  CLAUDE_TIMEOUT = AUTOMATION_SUBPROCESS_TIMEOUT
40
40
 
41
+ # Poison detection: a session checkpoint records the number of failed attempts
42
+ # across runs. Once it reaches this limit we stop trying to extract findings
43
+ # from that session — repeated failures on the same session (deterministic
44
+ # JSON parse errors, unreadable transcripts) only burn API credits and stall
45
+ # the whole deep-sleep cycle behind the poisoned session. The session is still
46
+ # kept in the output (with the error) so synthesize.py can account for it.
47
+ MAX_POISON_ATTEMPTS = 3
48
+
49
+ # Transient error types worth retrying on the next deep-sleep run instead of
50
+ # being counted as a poisoned attempt. `overloaded_error` comes from the
51
+ # Anthropic API when it is under load and is the cause of the stuck
52
+ # deep-sleep between 2026-04-14 and 2026-04-17 — the first attempt hit it,
53
+ # the checkpoint flagged it as permanent failure, and later runs kept
54
+ # re-processing the same session forever.
55
+ TRANSIENT_ERROR_KINDS = {
56
+ "overloaded_error",
57
+ "rate_limit_error",
58
+ "api_error",
59
+ "timeout",
60
+ "signal",
61
+ }
62
+
63
+
64
+ def _classify_cli_result(result) -> tuple[str, str]:
65
+ """Return (kind, short_message) describing a failed automation backend call.
66
+
67
+ Kinds:
68
+ - "overloaded_error" / "rate_limit_error" / "api_error"
69
+ Anthropic API transient failure — do not poison the checkpoint.
70
+ - "signal" Claude CLI killed by external signal (SIGTERM / SIGKILL / exit>=128).
71
+ - "timeout" Subprocess hit CLAUDE_TIMEOUT — extremely long session.
72
+ - "json_parse" Claude responded, but output wasn't parseable JSON.
73
+ - "unknown" Fallback.
74
+ """
75
+ rc = getattr(result, "returncode", -1)
76
+ stderr = (getattr(result, "stderr", "") or "")[:800]
77
+ stdout = (getattr(result, "stdout", "") or "")[:800]
78
+ blob = f"{stderr}\n{stdout}".lower()
79
+ if "overloaded" in blob:
80
+ return "overloaded_error", "Anthropic API overloaded"
81
+ if "rate_limit" in blob or "rate-limit" in blob or "429" in blob:
82
+ return "rate_limit_error", "Anthropic rate-limit hit"
83
+ if '"type":"error"' in blob and '"api_error"' in blob:
84
+ return "api_error", "Anthropic API error"
85
+ if rc >= 128:
86
+ return "signal", f"killed by signal (exit {rc})"
87
+ if rc < 0:
88
+ return "signal", f"subprocess terminated (exit {rc})"
89
+ return "unknown", f"exit {rc}"
90
+
41
91
 
42
92
  def extract_json_from_response(text: str) -> dict | None:
43
93
  """Parse JSON from Claude's response, handling markdown fences."""
@@ -104,20 +154,23 @@ def analyze_session(
104
154
  date_dir: Path,
105
155
  shared_context_file: Path | None,
106
156
  session_txt_map: dict[str, str] | None = None,
107
- ) -> dict | None:
157
+ ) -> tuple[dict | None, str | None]:
108
158
  """Send a session to the automation backend for extraction analysis.
109
159
 
110
- The backend reads the small per-session file + shared context file.
111
- Prompt is short the heavy lifting is in the Read tool calls.
160
+ Returns (parsed_result, error_kind). `error_kind` is only set on failure.
161
+ See `_classify_cli_result` for possible values.
112
162
  """
113
163
  session_file = find_session_file(session_id, date_dir, session_txt_map=session_txt_map)
114
164
  if not session_file:
115
165
  print(f" No session file found for {session_id}", file=sys.stderr)
116
- return None
166
+ return None, "missing_session_file"
117
167
 
118
168
  print(f" File: {session_file.name} ({session_file.stat().st_size / 1024:.0f} KB)")
119
169
 
120
- # Build a short prompt — Claude reads the files itself
170
+ # Build a short prompt — Claude reads the files itself. We point at the
171
+ # slim shared context rather than the full 400+KB dump so the Claude CLI
172
+ # process doesn't have to stream hundreds of kilobytes of followups /
173
+ # learnings into its context window on every per-session extraction.
121
174
  shared_ctx_instruction = ""
122
175
  if shared_context_file and shared_context_file.exists():
123
176
  shared_ctx_instruction = f"\n\nAlso read the shared context (followups, learnings, DB state) at: {shared_context_file}"
@@ -148,8 +201,9 @@ def analyze_session(
148
201
  )
149
202
 
150
203
  if result.returncode != 0:
151
- print(f" Automation backend error (exit {result.returncode}): {result.stderr[:300]}", file=sys.stderr)
152
- return None
204
+ kind, message = _classify_cli_result(result)
205
+ print(f" Automation backend {kind} (exit {result.returncode}): {message}", file=sys.stderr)
206
+ return None, kind
153
207
 
154
208
  # Filter out stop hook contamination (e.g. "Post-mortem completo.")
155
209
  output = "\n".join(
@@ -185,18 +239,65 @@ def analyze_session(
185
239
  debug_file = DEEP_SLEEP_DIR / f"debug-extract-{session_id[:20]}.txt"
186
240
  debug_file.write_text(result.stdout[:5000])
187
241
  print(f" Failed to parse JSON. Raw output saved to {debug_file}", file=sys.stderr)
188
- return None
242
+ return None, "json_parse"
189
243
 
190
- return parsed
244
+ return parsed, None
191
245
 
192
246
  except AutomationBackendUnavailableError as exc:
193
247
  print(f" Automation backend unavailable: {exc}", file=sys.stderr)
194
- return None
248
+ return None, "backend_unavailable"
195
249
  except subprocess.TimeoutExpired:
196
250
  print(f" Automation backend timeout ({CLAUDE_TIMEOUT}s)", file=sys.stderr)
251
+ return None, "timeout"
252
+
253
+
254
+ def _write_slim_shared_context(full_path: Path) -> Path:
255
+ """Generate (once per run) a slim version of shared-context.txt.
256
+
257
+ The full shared context can exceed 400KB — feeding that to every
258
+ per-session extraction means the Claude CLI subprocess spends most of its
259
+ context window on repeated DB metadata instead of the session transcript.
260
+ The slim version keeps the top-level structure + the first ~200 lines so
261
+ the model still has a summary of followups/learnings/diary samples.
262
+ """
263
+ slim_path = full_path.with_suffix(".slim.txt")
264
+ try:
265
+ raw = full_path.read_text(errors="replace")
266
+ except OSError:
267
+ return full_path
268
+ lines = raw.splitlines()
269
+ head = lines[:200]
270
+ header = [
271
+ "# Shared context (slim) — " + full_path.name,
272
+ f"# original_bytes={full_path.stat().st_size} original_lines={len(lines)}",
273
+ f"# trimmed_to=first_{len(head)}_lines",
274
+ "",
275
+ ]
276
+ try:
277
+ slim_path.write_text("\n".join(header + head), encoding="utf-8")
278
+ except OSError:
279
+ return full_path
280
+ return slim_path
281
+
282
+
283
+ def _load_checkpoint(path: Path) -> dict | None:
284
+ if not path.exists():
285
+ return None
286
+ try:
287
+ with path.open() as fh:
288
+ return json.load(fh)
289
+ except (json.JSONDecodeError, OSError):
197
290
  return None
198
291
 
199
292
 
293
+ def _save_checkpoint(path: Path, payload: dict) -> None:
294
+ try:
295
+ with path.open("w") as fh:
296
+ json.dump(payload, fh, indent=2, ensure_ascii=False)
297
+ except OSError as exc:
298
+ print(f" Warning: could not persist checkpoint {path}: {exc}", file=sys.stderr)
299
+
300
+
200
301
  def main():
201
302
  target_date = sys.argv[1] if len(sys.argv) > 1 else datetime.now().strftime("%Y-%m-%d")
202
303
 
@@ -237,12 +338,17 @@ def main():
237
338
  print(f"[extract] Output: {output_file}")
238
339
  return
239
340
 
240
- # Shared context file (followups, learnings, DB state)
241
- shared_context_file = date_dir / "shared-context.txt" if date_dir.exists() else None
242
- if shared_context_file and shared_context_file.exists():
243
- print(f"[extract] Shared context: {shared_context_file} ({shared_context_file.stat().st_size / 1024:.0f} KB)")
341
+ # Shared context file (followups, learnings, DB state).
342
+ # Use a slim copy for the per-session prompts so the Claude CLI doesn't
343
+ # re-read the full 400+KB dump for every single session.
344
+ full_shared_context = date_dir / "shared-context.txt" if date_dir.exists() else None
345
+ shared_context_file: Path | None = None
346
+ if full_shared_context and full_shared_context.exists():
347
+ shared_context_file = _write_slim_shared_context(full_shared_context)
348
+ full_kb = full_shared_context.stat().st_size / 1024
349
+ slim_kb = shared_context_file.stat().st_size / 1024
350
+ print(f"[extract] Shared context: {shared_context_file} ({slim_kb:.0f} KB slim, {full_kb:.0f} KB full)")
244
351
  else:
245
- shared_context_file = None
246
352
  print("[extract] No shared context file")
247
353
 
248
354
  print(f"[extract] Phase 2: Analyzing {len(session_files)} sessions for {target_date}")
@@ -255,6 +361,7 @@ def main():
255
361
  all_extractions = []
256
362
  total_findings = 0
257
363
  skipped = 0
364
+ poisoned = 0
258
365
  # Two attempts is enough: if a session's extraction fails twice, the cause is
259
366
  # almost always deterministic (JSON parse, schema violation) rather than transient,
260
367
  # so further retries just burn time. Skip and continue instead.
@@ -264,26 +371,42 @@ def main():
264
371
  sid_safe = _safe_session_slug(session_id)[:40]
265
372
  checkpoint_file = checkpoint_dir / f"{sid_safe}.json"
266
373
 
267
- # Resume: skip already-processed sessions
268
- if checkpoint_file.exists():
269
- try:
270
- with open(checkpoint_file) as f:
271
- cached = json.load(f)
272
- findings_count = len(cached.get("findings", []))
273
- total_findings += findings_count
274
- all_extractions.append(cached)
275
- skipped += 1
276
- print(f"[extract] Session {i + 1}/{len(session_files)}: {session_id} (cached, {findings_count} findings)")
277
- continue
278
- except (json.JSONDecodeError, KeyError):
279
- pass # Corrupted checkpoint, re-process
374
+ cached = _load_checkpoint(checkpoint_file)
375
+ cached_error_count = int((cached or {}).get("error_count", 0))
376
+ cached_last_error_kind = (cached or {}).get("last_error_kind", "")
377
+
378
+ # Successful prior checkpoint → reuse as-is
379
+ if cached and not cached.get("error") and cached.get("findings") is not None:
380
+ findings_count = len(cached.get("findings", []))
381
+ total_findings += findings_count
382
+ all_extractions.append(cached)
383
+ skipped += 1
384
+ print(f"[extract] Session {i + 1}/{len(session_files)}: {session_id} (cached, {findings_count} findings)")
385
+ continue
386
+
387
+ # Poisoned checkpoint → skip without burning API calls
388
+ if cached_error_count >= MAX_POISON_ATTEMPTS:
389
+ poisoned += 1
390
+ all_extractions.append(cached or {
391
+ "session_id": session_id,
392
+ "findings": [],
393
+ "error": "poisoned",
394
+ "error_count": cached_error_count,
395
+ "last_error_kind": cached_last_error_kind,
396
+ })
397
+ print(
398
+ f"[extract] Session {i + 1}/{len(session_files)}: {session_id} "
399
+ f"(poisoned, {cached_error_count} prior failures — skip)"
400
+ )
401
+ continue
280
402
 
281
403
  print(f"[extract] Session {i + 1}/{len(session_files)}: {session_id}")
282
404
 
283
- # Retry loop
405
+ # Retry loop within this run
284
406
  result = None
407
+ last_error_kind = ""
285
408
  for attempt in range(1, MAX_RETRIES + 1):
286
- result = analyze_session(
409
+ result, error_kind = analyze_session(
287
410
  session_id,
288
411
  date_dir,
289
412
  shared_context_file,
@@ -291,47 +414,77 @@ def main():
291
414
  )
292
415
  if result:
293
416
  break
417
+ last_error_kind = error_kind or "unknown"
294
418
  if attempt < MAX_RETRIES:
295
- print(f" -> Attempt {attempt}/{MAX_RETRIES} failed, retrying...")
419
+ print(f" -> Attempt {attempt}/{MAX_RETRIES} failed ({last_error_kind}), retrying...")
296
420
 
297
421
  if result:
298
422
  findings_count = len(result.get("findings", []))
299
423
  total_findings += findings_count
424
+ # Persist success and reset error_count so transient past failures
425
+ # don't keep counting against the session.
426
+ result.setdefault("session_id", session_id)
427
+ result["error_count"] = 0
428
+ result["last_error_kind"] = ""
300
429
  all_extractions.append(result)
301
- # Save checkpoint
302
- with open(checkpoint_file, "w") as f:
303
- json.dump(result, f, indent=2, ensure_ascii=False)
430
+ _save_checkpoint(checkpoint_file, result)
304
431
  print(f" -> {findings_count} findings extracted (checkpointed)")
305
432
  else:
306
- print(f" -> Failed after {MAX_RETRIES} attempts, marking as failed")
433
+ # Transient errors (API overloaded, rate-limit, timeout, killed
434
+ # by signal) should NOT increment the poison counter — they're
435
+ # not the session's fault. They also don't persist a fresh
436
+ # checkpoint, so the next deep-sleep run will retry cleanly.
437
+ transient = last_error_kind in TRANSIENT_ERROR_KINDS
438
+ if transient:
439
+ print(f" -> Transient failure ({last_error_kind}), will retry on next run.")
440
+ all_extractions.append({
441
+ "session_id": session_id,
442
+ "findings": [],
443
+ "error": "transient",
444
+ "error_count": cached_error_count,
445
+ "last_error_kind": last_error_kind,
446
+ })
447
+ # Do not touch the checkpoint — the next run gets a clean retry.
448
+ continue
449
+
450
+ new_count = cached_error_count + 1
451
+ state = "poisoned" if new_count >= MAX_POISON_ATTEMPTS else "failed"
452
+ print(
453
+ f" -> Deterministic failure #{new_count}/{MAX_POISON_ATTEMPTS} "
454
+ f"({last_error_kind}); marked as {state}."
455
+ )
307
456
  failed_entry = {
308
457
  "session_id": session_id,
309
458
  "findings": [],
310
- "error": f"Extraction failed after {MAX_RETRIES} attempts"
459
+ "error": state,
460
+ "error_count": new_count,
461
+ "last_error_kind": last_error_kind,
311
462
  }
312
463
  all_extractions.append(failed_entry)
313
- # Save failed checkpoint too (so we don't retry forever)
314
- with open(checkpoint_file, "w") as f:
315
- json.dump(failed_entry, f, indent=2, ensure_ascii=False)
464
+ _save_checkpoint(checkpoint_file, failed_entry)
465
+ if state == "poisoned":
466
+ poisoned += 1
316
467
 
317
468
  # Merge into output
318
469
  output = {
319
470
  "date": target_date,
320
471
  "sessions_analyzed": len(session_files),
321
- "sessions_succeeded": len([e for e in all_extractions if "error" not in e]),
472
+ "sessions_succeeded": len([e for e in all_extractions if not e.get("error")]),
322
473
  "sessions_cached": skipped,
474
+ "sessions_poisoned": poisoned,
323
475
  "total_findings": total_findings,
324
- "extractions": all_extractions
476
+ "extractions": all_extractions,
325
477
  }
326
478
 
327
479
  output_file = DEEP_SLEEP_DIR / f"{target_date}-extractions.json"
328
480
  with open(output_file, "w") as f:
329
481
  json.dump(output, f, indent=2, ensure_ascii=False)
330
482
 
331
- if skipped:
332
- print(f"\n[extract] Done. {total_findings} findings from {len(session_files)} sessions ({skipped} cached, {len(session_files) - skipped} new).")
333
- else:
334
- print(f"\n[extract] Done. {total_findings} findings from {len(session_files)} sessions.")
483
+ fresh_runs = len(session_files) - skipped - poisoned
484
+ print(
485
+ f"\n[extract] Done. {total_findings} findings from {len(session_files)} sessions "
486
+ f"({skipped} cached, {fresh_runs} fresh, {poisoned} poisoned)."
487
+ )
335
488
  print(f"[extract] Output: {output_file}")
336
489
 
337
490
 
@@ -5,6 +5,16 @@
5
5
  #
6
6
  # Wraps any cron command to automatically record start/end/exit_code/summary.
7
7
  # Used by sync.py when generating LaunchAgents from manifest.json.
8
+ #
9
+ # Two-phase recording (start → end):
10
+ # 1. INSERT cron_runs row at start with ended_at=NULL so the watchdog can
11
+ # distinguish "currently running" from "missed / stuck". Without this,
12
+ # any job that exceeds the next watchdog tick (interval_seconds=1800 by
13
+ # default) looks stale and the watchdog may kickstart -k over it — which
14
+ # is exactly the loop that broke deep-sleep between 2026-04-14 and 2026-04-17.
15
+ # 2. UPDATE the row at end with ended_at + exit_code + summary.
16
+ # 3. Trap SIGTERM / SIGINT so wrappers killed mid-flight still close their
17
+ # row (exit_code=143 or 130) instead of leaving it NULL forever.
8
18
 
9
19
  set -uo pipefail
10
20
 
@@ -33,84 +43,159 @@ print(datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S"))
33
43
  PY
34
44
  )
35
45
 
36
- # Run the actual command, capture output
46
+ # Phase 1: INSERT row at start (ended_at NULL = "running").
47
+ # ROW_ID empty on DB failure; spool-fallback at the end handles that.
48
+ ROW_ID=""
49
+ ROW_ID=$(python3 - "$DB" "$CRON_ID" "$STARTED_AT" <<'PY' 2>/dev/null
50
+ from __future__ import annotations
51
+ import sqlite3
52
+ import sys
53
+ db_path, cron_id, started_at = sys.argv[1:]
54
+ conn = sqlite3.connect(db_path)
55
+ try:
56
+ cur = conn.execute(
57
+ "INSERT INTO cron_runs (cron_id, started_at, ended_at) VALUES (?, ?, NULL)",
58
+ (cron_id, started_at),
59
+ )
60
+ conn.commit()
61
+ print(cur.lastrowid)
62
+ finally:
63
+ conn.close()
64
+ PY
65
+ )
66
+
37
67
  OUTPUT_FILE=$(mktemp)
38
- trap 'rm -f "$OUTPUT_FILE"' EXIT
39
- "$@" > "$OUTPUT_FILE" 2>&1
40
- EXIT_CODE=$?
41
- ENDED_AT=$(python3 - <<'PY'
68
+ EXIT_CODE=0
69
+ SIGNAL_NAME=""
70
+
71
+ # finalize_row DB writer — also used by signal traps.
72
+ # Reads $EXIT_CODE / $SIGNAL_NAME / $OUTPUT_FILE from the outer scope.
73
+ finalize_row() {
74
+ local ended_at duration summary error
75
+ ended_at=$(python3 - <<'PY'
42
76
  from datetime import datetime, timezone
43
77
  print(datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S"))
44
78
  PY
45
79
  )
46
- DURATION_SECS=$(python3 - <<PY
80
+ duration=$(python3 - <<PY
47
81
  start = float("$START_EPOCH")
48
82
  import time
49
83
  print(round(time.time() - start, 1))
50
84
  PY
51
85
  )
86
+ summary=$(tail -5 "$OUTPUT_FILE" 2>/dev/null | grep -v "^$" | tail -1 | head -c 500)
87
+ error=""
88
+ if [ "$EXIT_CODE" -ne 0 ]; then
89
+ if [ -n "$SIGNAL_NAME" ]; then
90
+ error="Killed by $SIGNAL_NAME (exit $EXIT_CODE)"
91
+ else
92
+ error=$(grep -i "error\|exception\|fail\|traceback" "$OUTPUT_FILE" 2>/dev/null | tail -1 | head -c 500)
93
+ fi
94
+ fi
52
95
 
53
- # Extract summary (last meaningful line, max 500 chars)
54
- SUMMARY=$(tail -5 "$OUTPUT_FILE" | grep -v "^$" | tail -1 | head -c 500)
55
-
56
- # Extract error if failed
57
- ERROR=""
58
- if [ $EXIT_CODE -ne 0 ]; then
59
- ERROR=$(grep -i "error\|exception\|fail\|traceback" "$OUTPUT_FILE" | tail -1 | head -c 500)
60
- fi
61
-
62
- if ! python3 - "$DB" "$CRON_ID" "$STARTED_AT" "$ENDED_AT" "$EXIT_CODE" "$SUMMARY" "$ERROR" "$DURATION_SECS" <<'PY'
96
+ # Update the row we inserted at start — or INSERT fresh if the start write failed.
97
+ if ! python3 - "$DB" "$ROW_ID" "$CRON_ID" "$STARTED_AT" "$ended_at" "$EXIT_CODE" "$summary" "$error" "$duration" <<'PY' 2>/dev/null
63
98
  from __future__ import annotations
64
-
65
99
  import sqlite3
66
100
  import sys
67
-
68
- db_path, cron_id, started_at, ended_at, exit_code, summary, error, duration_secs = sys.argv[1:]
101
+ db_path, row_id, cron_id, started_at, ended_at, exit_code, summary, error, duration_secs = sys.argv[1:]
69
102
  conn = sqlite3.connect(db_path)
70
103
  try:
71
- conn.execute(
72
- """
73
- INSERT INTO cron_runs (
74
- cron_id, started_at, ended_at, exit_code, summary, error, duration_secs
75
- ) VALUES (?, ?, ?, ?, ?, ?, ?)
76
- """,
77
- (
78
- cron_id,
79
- started_at,
80
- ended_at,
81
- int(exit_code),
82
- summary,
83
- error,
84
- float(duration_secs),
85
- ),
86
- )
104
+ if row_id:
105
+ conn.execute(
106
+ """
107
+ UPDATE cron_runs
108
+ SET ended_at=?, exit_code=?, summary=?, error=?, duration_secs=?
109
+ WHERE id=?
110
+ """,
111
+ (ended_at, int(exit_code), summary, error, float(duration_secs), int(row_id)),
112
+ )
113
+ else:
114
+ conn.execute(
115
+ """
116
+ INSERT INTO cron_runs (cron_id, started_at, ended_at, exit_code, summary, error, duration_secs)
117
+ VALUES (?, ?, ?, ?, ?, ?, ?)
118
+ """,
119
+ (cron_id, started_at, ended_at, int(exit_code), summary, error, float(duration_secs)),
120
+ )
87
121
  conn.commit()
88
122
  finally:
89
123
  conn.close()
90
124
  PY
91
- then
92
- mkdir -p "$SPOOL_DIR"
93
- SPOOL_FILE="$SPOOL_DIR/${CRON_ID}-$(date +%Y%m%d-%H%M%S)-$$.json"
94
- python3 - "$SPOOL_FILE" "$CRON_ID" "$STARTED_AT" "$ENDED_AT" "$EXIT_CODE" "$SUMMARY" "$ERROR" "$DURATION_SECS" <<'PY'
125
+ then
126
+ mkdir -p "$SPOOL_DIR"
127
+ local spool_file="$SPOOL_DIR/${CRON_ID}-$(date +%Y%m%d-%H%M%S)-$$.json"
128
+ python3 - "$spool_file" "$CRON_ID" "$STARTED_AT" "$ended_at" "$EXIT_CODE" "$summary" "$error" "$duration" <<'PY'
95
129
  from __future__ import annotations
96
-
97
130
  import json
98
131
  import sys
99
132
  from pathlib import Path
100
-
101
133
  spool_file, cron_id, started_at, ended_at, exit_code, summary, error, duration_secs = sys.argv[1:]
102
- payload = {
103
- "cron_id": cron_id,
104
- "started_at": started_at,
105
- "ended_at": ended_at,
106
- "exit_code": int(exit_code),
107
- "summary": summary,
108
- "error": error,
109
- "duration_secs": float(duration_secs),
110
- }
111
- Path(spool_file).write_text(json.dumps(payload, indent=2, ensure_ascii=False) + "\n", encoding="utf-8")
134
+ Path(spool_file).write_text(
135
+ json.dumps({
136
+ "cron_id": cron_id,
137
+ "started_at": started_at,
138
+ "ended_at": ended_at,
139
+ "exit_code": int(exit_code),
140
+ "summary": summary,
141
+ "error": error,
142
+ "duration_secs": float(duration_secs),
143
+ }, indent=2, ensure_ascii=False) + "\n",
144
+ encoding="utf-8",
145
+ )
112
146
  PY
113
- echo "[nexo-cron-wrapper] DB write failed; spooled run to $SPOOL_FILE" >&2
114
- fi
147
+ echo "[nexo-cron-wrapper] DB write failed; spooled run to $spool_file" >&2
148
+ fi
149
+ }
150
+
151
+ cleanup() {
152
+ rm -f "$OUTPUT_FILE"
153
+ }
154
+
155
+ CHILD_PID=""
156
+
157
+ on_signal() {
158
+ local sig="$1"
159
+ local code="$2"
160
+ SIGNAL_NAME="$sig"
161
+ EXIT_CODE="$code"
162
+ # Forward the signal to the child. Bash traps run AFTER the foreground
163
+ # command completes, which is why we launch the command in background
164
+ # and wait on its PID — otherwise a SIGTERM to the wrapper would be
165
+ # delivered only when the child finishes naturally, defeating the
166
+ # purpose of closing the cron_runs row on kill.
167
+ if [ -n "$CHILD_PID" ] && kill -0 "$CHILD_PID" 2>/dev/null; then
168
+ kill -TERM "$CHILD_PID" 2>/dev/null
169
+ # Brief grace period before escalating to SIGKILL so the child gets
170
+ # a chance to clean up on its own.
171
+ local waited=0
172
+ while [ $waited -lt 5 ] && kill -0 "$CHILD_PID" 2>/dev/null; do
173
+ sleep 1
174
+ waited=$((waited + 1))
175
+ done
176
+ kill -KILL "$CHILD_PID" 2>/dev/null
177
+ fi
178
+ finalize_row
179
+ cleanup
180
+ exit "$code"
181
+ }
182
+
183
+ trap cleanup EXIT
184
+ trap 'on_signal SIGTERM 143' TERM
185
+ trap 'on_signal SIGINT 130' INT
186
+ trap 'on_signal SIGHUP 129' HUP
187
+
188
+ "$@" > "$OUTPUT_FILE" 2>&1 &
189
+ CHILD_PID=$!
190
+
191
+ # `wait` is interruptible by signals — when the trap fires, wait returns
192
+ # immediately and on_signal() takes over. When the child finishes
193
+ # normally, wait yields its exit code and we fall through to finalize_row
194
+ # for the happy path.
195
+ wait "$CHILD_PID"
196
+ EXIT_CODE=$?
197
+ CHILD_PID=""
198
+
199
+ finalize_row
115
200
 
116
- exit $EXIT_CODE
201
+ exit "$EXIT_CODE"
@@ -594,40 +594,72 @@ for monitor in "${MONITORS[@]}"; do
594
594
  run_info=$(cron_last_run_info "$cron_id" || true)
595
595
  if [ -n "$run_info" ]; then
596
596
  latest_run_has_record=true
597
- IFS='|' read -r age _ _ last_exit last_error last_summary <<< "$run_info"
597
+ IFS='|' read -r age _ last_ended last_exit last_error last_summary <<< "$run_info"
598
598
  age="${age:-999999}"
599
599
  stale_age=$(format_age "$age")
600
- if [ -n "$last_exit" ] && [ "$last_exit" != "0" ]; then
601
- latest_run_failed=true
602
- status="FAIL"
603
- details="${details}Last run exited ${last_exit}. "
604
- [ -n "$last_error" ] && details="${details}Error: ${last_error}. "
605
- fi
606
- if [ "$age" -gt $(( max_stale * 3 )) ]; then
607
- if [ "$recovery_policy" = "catchup" ]; then
608
- if try_request_catchup; then
609
- status="HEALED"
610
- details="${details}Self-healed: requested catchup for missed window (last run: $stale_age). "
611
- TOTAL_HEALED=$((TOTAL_HEALED + 1))
600
+
601
+ # In-flight detection: started_at present but ended_at empty means the
602
+ # wrapper is still running. Never kickstart -k over an in-flight row —
603
+ # that was the loop that broke deep-sleep between 2026-04-14 and
604
+ # 2026-04-17, when the watchdog kept killing the worker that was
605
+ # actually doing the job. Only intervene if the process is provably
606
+ # dead (zombie row) AND the run has exceeded max_stale.
607
+ if [ -z "$last_ended" ]; then
608
+ if [ "$age" -gt $(( max_stale * 3 )) ] && [ -n "$proc_grep" ] && ! process_running "$proc_grep"; then
609
+ status="FAIL"
610
+ details="${details}In-flight for ${stale_age} but process '$proc_grep' dead stale row. "
611
+ if [ "$recovery_policy" = "catchup" ]; then
612
+ if try_request_catchup; then
613
+ status="HEALED"
614
+ details="${details}Self-healed: requested catchup for crashed in-flight run. "
615
+ TOTAL_HEALED=$((TOTAL_HEALED + 1))
616
+ fi
612
617
  else
613
- status="FAIL"
614
- details="${details}cron_runs stale: $stale_age (limit: $(format_age "$max_stale")). Catchup request failed. "
618
+ if try_reexecute_missed_cron "$plist_id"; then
619
+ status="HEALED"
620
+ details="${details}Self-healed: re-executed crashed in-flight run. "
621
+ TOTAL_HEALED=$((TOTAL_HEALED + 1))
622
+ fi
615
623
  fi
624
+ elif [ "$age" -gt $(( max_stale * 3 )) ]; then
625
+ [ "$status" = "PASS" ] && status="WARN"
626
+ details="${details}In-flight for ${stale_age} (long-running, process alive). "
616
627
  else
617
- if try_reexecute_missed_cron "$plist_id"; then
618
- status="HEALED"
619
- details="${details}Self-healed: re-executed missed cron (last run: $stale_age). "
620
- TOTAL_HEALED=$((TOTAL_HEALED + 1))
628
+ details="${details}In-flight (started ${stale_age}). "
629
+ fi
630
+ else
631
+ if [ -n "$last_exit" ] && [ "$last_exit" != "0" ]; then
632
+ latest_run_failed=true
633
+ status="FAIL"
634
+ details="${details}Last run exited ${last_exit}. "
635
+ [ -n "$last_error" ] && details="${details}Error: ${last_error}. "
636
+ fi
637
+ if [ "$age" -gt $(( max_stale * 3 )) ]; then
638
+ if [ "$recovery_policy" = "catchup" ]; then
639
+ if try_request_catchup; then
640
+ status="HEALED"
641
+ details="${details}Self-healed: requested catchup for missed window (last run: $stale_age). "
642
+ TOTAL_HEALED=$((TOTAL_HEALED + 1))
643
+ else
644
+ status="FAIL"
645
+ details="${details}cron_runs stale: $stale_age (limit: $(format_age "$max_stale")). Catchup request failed. "
646
+ fi
621
647
  else
622
- status="FAIL"
623
- details="${details}cron_runs stale: $stale_age (limit: $(format_age "$max_stale")). Re-execute failed. "
648
+ if try_reexecute_missed_cron "$plist_id"; then
649
+ status="HEALED"
650
+ details="${details}Self-healed: re-executed missed cron (last run: $stale_age). "
651
+ TOTAL_HEALED=$((TOTAL_HEALED + 1))
652
+ else
653
+ status="FAIL"
654
+ details="${details}cron_runs stale: $stale_age (limit: $(format_age "$max_stale")). Re-execute failed. "
655
+ fi
624
656
  fi
657
+ elif [ "$age" -gt "$max_stale" ]; then
658
+ [ "$status" = "PASS" ] && status="WARN"
659
+ details="${details}cron_runs slightly stale: $stale_age. "
660
+ elif [ -z "$details" ] && [ -n "$last_summary" ]; then
661
+ details="${details}Last run summary: ${last_summary}. "
625
662
  fi
626
- elif [ "$age" -gt "$max_stale" ]; then
627
- [ "$status" = "PASS" ] && status="WARN"
628
- details="${details}cron_runs slightly stale: $stale_age. "
629
- elif [ -z "$details" ] && [ -n "$last_summary" ]; then
630
- details="${details}Last run summary: ${last_summary}. "
631
663
  fi
632
664
  else
633
665
  stale_age="no cron_runs entry"