nexo-brain 5.8.0 → 5.8.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/README.md +5 -1
- package/package.json +1 -1
- package/src/auto_update.py +139 -0
- package/src/db/_classification.py +37 -115
- package/src/db/_reminders.py +13 -13
- package/src/db/_schema.py +9 -40
- package/src/scripts/deep-sleep/extract.py +198 -45
- package/src/scripts/nexo-cron-wrapper.sh +139 -54
- package/src/scripts/nexo-watchdog.sh +58 -26
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "nexo-brain",
|
|
3
|
-
"version": "5.8.
|
|
3
|
+
"version": "5.8.2",
|
|
4
4
|
"description": "Local cognitive runtime for Claude Code \u2014 persistent memory, overnight learning, doctor diagnostics, personal scripts, recovery-aware jobs, startup preflight, and optional dashboard/power helper.",
|
|
5
5
|
"author": {
|
|
6
6
|
"name": "NEXO Brain",
|
package/README.md
CHANGED
|
@@ -18,7 +18,11 @@
|
|
|
18
18
|
|
|
19
19
|
[Watch the overview video](https://nexo-brain.com/watch/) · [Watch on YouTube](https://www.youtube.com/watch?v=i2lkGhKyVqI) · [Open the infographic](https://nexo-brain.com/assets/nexo-brain-infographic-v5.png)
|
|
20
20
|
|
|
21
|
-
Version `5.8.
|
|
21
|
+
Version `5.8.2` is the current packaged-runtime line: the Brain core no longer auto-classifies `followups` and `reminders` on behalf of agents. v5.8.0's `classify_task()` heuristic (NEXO-specific ID prefixes `NF-PROTOCOL-*` / `NF-DS-*` / `NF-AUDIT-*`, Spanish user-verbs `debes` / `revisar` / `firmar`, agent keywords `monitor` / `auditoría diaria` / `checkpoint`) was fine for NEXO's own DB but bled convention into every third-party agent plugged into the shared Brain. The core now persists `internal=0` and `owner=NULL` when the caller omits them, and clients that want automatic classification (NEXO Desktop does, via its `_legacyClassifyOwner` helpers) compute it themselves and pass the result. Migration #40 keeps the columns + indexes; rows already backfilled by v5.8.0 keep their values. `normalise_owner` still explicitly rejects the string `"nexo"` so legacy hardcoding cannot sneak back in.
|
|
22
|
+
|
|
23
|
+
Previously in `5.8.1`: closes a self-reinforcing `launchctl kickstart -k` loop in the watchdog that wedged deep-sleep Phase 2 between 2026-04-14 and 2026-04-17. The cron wrapper now INSERTs an in-flight row (`ended_at=NULL`) at start and traps SIGTERM/INT/HUP to close it with `exit_code=143` instead of vanishing from `cron_runs`. The watchdog interprets in-flight rows as "currently running" and only re-executes after verifying the worker process is dead. `extract.py` classifies CLI failures into transient (`overloaded_error`, rate-limit, timeout, signal — retried next run) and deterministic (skipped after `MAX_POISON_ATTEMPTS`), and passes a slim shared-context (200 head lines + metadata) instead of the full 400+ KB dump. A new `auto_update._heal_deep_sleep_runtime()` repairs existing installs silently on the next `nexo update`: poisoned checkpoints, stale locks, dangling `cron_runs` rows, and bloated `.watchdog-fails` counters.
|
|
24
|
+
|
|
25
|
+
Previously in `5.8.0`: first-class `internal` and `owner` columns on `followups` and `reminders`. Migration #40 adds both fields with an idempotent one-shot backfill, so the "who does this task belong to?" classification moves from client-side regex (Desktop) to persistent storage every MCP client shares. Taxonomy is intentionally generic — `owner in {user, waiting, agent, shared}` — so third-party agents plugging into the shared Brain can render whatever assistant label they carry without inheriting NEXO branding. `nexo_reminder_create`, `nexo_reminder_update`, `nexo_followup_create`, and `nexo_followup_update` gain optional `internal` and `owner` parameters that win over the default heuristic.
|
|
22
26
|
|
|
23
27
|
Previously in `5.7.0`: `nexo update` now keeps Claude Code and Codex CLIs in lockstep with NEXO Brain itself. When the global `@anthropic-ai/claude-code` or `@openai/codex` packages are installed, the updater checks the npm registry and runs `npm install -g <pkg>@latest` in-line — so the terminal boot model stays aligned with the settings NEXO already wrote to `~/.claude/settings.json`. Packages the operator never installed are skipped silently. Pass `nexo update --no-clis` to keep the terminal CLIs pinned.
|
|
24
28
|
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "nexo-brain",
|
|
3
|
-
"version": "5.8.
|
|
3
|
+
"version": "5.8.2",
|
|
4
4
|
"mcpName": "io.github.wazionapps/nexo",
|
|
5
5
|
"description": "NEXO Brain \u2014 Shared brain for AI agents. Persistent memory, semantic RAG, natural forgetting, metacognitive guard, trust scoring, 150+ MCP tools. Works with Claude Code, Codex, Claude Desktop & any MCP client. 100% local, free.",
|
|
6
6
|
"homepage": "https://nexo-brain.com",
|
package/src/auto_update.py
CHANGED
|
@@ -875,6 +875,135 @@ def _purge_zero_byte_db_files() -> list[Path]:
|
|
|
875
875
|
return removed
|
|
876
876
|
|
|
877
877
|
|
|
878
|
+
def _heal_deep_sleep_runtime(dest: Path = NEXO_HOME) -> list[str]:
|
|
879
|
+
"""Repair deep-sleep state that older runtimes left in a bad shape.
|
|
880
|
+
|
|
881
|
+
Runs on every ``auto_update`` post-sync. The bug it fixes: between
|
|
882
|
+
Brain 5.6.1 and 5.8.0 the cron wrapper only wrote to ``cron_runs`` at
|
|
883
|
+
end, so any wrapper killed by signal produced no row. The watchdog then
|
|
884
|
+
saw the cron as "missing cron_runs entry" and kickstart-k'd the live
|
|
885
|
+
worker — an infinite loop that wedged deep-sleep Phase 2 on the first
|
|
886
|
+
session of every batch. 5.8.1 fixes the loop at the source (wrapper
|
|
887
|
+
start-row + watchdog in-flight detection) but older runtimes that have
|
|
888
|
+
already been running the buggy loop need their residue cleaned up.
|
|
889
|
+
|
|
890
|
+
Returns the list of actions performed, for logging. Failures are
|
|
891
|
+
swallowed: this is best-effort healing, it must never block an update.
|
|
892
|
+
"""
|
|
893
|
+
import sqlite3
|
|
894
|
+
import time as _time
|
|
895
|
+
|
|
896
|
+
actions: list[str] = []
|
|
897
|
+
|
|
898
|
+
deep_sleep_dir = dest / "operations" / "deep-sleep"
|
|
899
|
+
coord_dir = dest / "coordination"
|
|
900
|
+
data_db = dest / "data" / "nexo.db"
|
|
901
|
+
now = _time.time()
|
|
902
|
+
|
|
903
|
+
# (1) Drop poisoned checkpoints: the first retry that hit Anthropic's
|
|
904
|
+
# overloaded_error got cached as a permanent failure. Older
|
|
905
|
+
# extract.py re-used that checkpoint forever. New extract.py treats
|
|
906
|
+
# transient errors as retryable, but old poisoned checkpoints still
|
|
907
|
+
# claim 0 findings — purge them so the next deep-sleep retries cleanly.
|
|
908
|
+
if deep_sleep_dir.is_dir():
|
|
909
|
+
poisoned = 0
|
|
910
|
+
for checkpoint_dir in deep_sleep_dir.glob("*/checkpoints"):
|
|
911
|
+
if not checkpoint_dir.is_dir():
|
|
912
|
+
continue
|
|
913
|
+
for entry in checkpoint_dir.glob("*.json"):
|
|
914
|
+
try:
|
|
915
|
+
content = entry.read_text()
|
|
916
|
+
except OSError:
|
|
917
|
+
continue
|
|
918
|
+
if "overloaded_error" in content or '"error":{"type":"' in content:
|
|
919
|
+
try:
|
|
920
|
+
entry.unlink()
|
|
921
|
+
poisoned += 1
|
|
922
|
+
except OSError:
|
|
923
|
+
pass
|
|
924
|
+
if poisoned:
|
|
925
|
+
actions.append(f"checkpoints-purged:{poisoned}")
|
|
926
|
+
|
|
927
|
+
# Drop debug-extract-*.txt scratch files older than 7 days.
|
|
928
|
+
stale_debug = 0
|
|
929
|
+
for entry in deep_sleep_dir.glob("debug-extract-*.txt"):
|
|
930
|
+
try:
|
|
931
|
+
if now - entry.stat().st_mtime > 7 * 86400:
|
|
932
|
+
entry.unlink()
|
|
933
|
+
stale_debug += 1
|
|
934
|
+
except OSError:
|
|
935
|
+
continue
|
|
936
|
+
if stale_debug:
|
|
937
|
+
actions.append(f"debug-scratch-purged:{stale_debug}")
|
|
938
|
+
|
|
939
|
+
# (2) Release stale deep-sleep locks so the next 04:30 run can acquire
|
|
940
|
+
# them. Locks older than 6h are always stale — a real run finishes
|
|
941
|
+
# in well under an hour.
|
|
942
|
+
lock_names = ("sleep.lock", "sleep-process.lock", "synthesis.lock")
|
|
943
|
+
released = 0
|
|
944
|
+
if coord_dir.is_dir():
|
|
945
|
+
for name in lock_names:
|
|
946
|
+
lock_path = coord_dir / name
|
|
947
|
+
if not lock_path.exists():
|
|
948
|
+
continue
|
|
949
|
+
try:
|
|
950
|
+
age = now - lock_path.stat().st_mtime
|
|
951
|
+
except OSError:
|
|
952
|
+
continue
|
|
953
|
+
if age > 6 * 3600:
|
|
954
|
+
try:
|
|
955
|
+
lock_path.unlink()
|
|
956
|
+
released += 1
|
|
957
|
+
except OSError:
|
|
958
|
+
pass
|
|
959
|
+
if released:
|
|
960
|
+
actions.append(f"stale-locks-released:{released}")
|
|
961
|
+
|
|
962
|
+
# (3) Close dangling cron_runs rows. Any row with ended_at IS NULL older
|
|
963
|
+
# than 6h is either a process killed by the old watchdog loop or a
|
|
964
|
+
# zombie left behind by a previous bad install. Close them with
|
|
965
|
+
# exit_code=143 + summary so the NEW watchdog treats the cron as
|
|
966
|
+
# "finished with error" rather than "in-flight forever".
|
|
967
|
+
if data_db.is_file():
|
|
968
|
+
try:
|
|
969
|
+
conn = sqlite3.connect(str(data_db), timeout=5)
|
|
970
|
+
try:
|
|
971
|
+
cur = conn.execute(
|
|
972
|
+
"""
|
|
973
|
+
UPDATE cron_runs
|
|
974
|
+
SET ended_at = datetime('now'),
|
|
975
|
+
exit_code = 143,
|
|
976
|
+
error = 'healed by auto_update (pre-5.8.1 wrapper left row open)',
|
|
977
|
+
duration_secs = CAST(
|
|
978
|
+
strftime('%s','now') - strftime('%s', started_at) AS REAL
|
|
979
|
+
)
|
|
980
|
+
WHERE ended_at IS NULL
|
|
981
|
+
AND strftime('%s','now') - strftime('%s', started_at) > 6 * 3600
|
|
982
|
+
"""
|
|
983
|
+
)
|
|
984
|
+
closed = cur.rowcount or 0
|
|
985
|
+
conn.commit()
|
|
986
|
+
if closed:
|
|
987
|
+
actions.append(f"cron_runs-closed-dangling:{closed}")
|
|
988
|
+
finally:
|
|
989
|
+
conn.close()
|
|
990
|
+
except Exception as exc:
|
|
991
|
+
actions.append(f"cron_runs-heal-warning:{exc.__class__.__name__}")
|
|
992
|
+
|
|
993
|
+
# (4) Remove .watchdog-fails registry entries older than 24h — the new
|
|
994
|
+
# in-flight detection makes stale counters obsolete.
|
|
995
|
+
fails_file = dest / "scripts" / ".watchdog-fails"
|
|
996
|
+
if fails_file.exists():
|
|
997
|
+
try:
|
|
998
|
+
if now - fails_file.stat().st_mtime > 24 * 3600:
|
|
999
|
+
fails_file.unlink()
|
|
1000
|
+
actions.append("watchdog-fails-reset")
|
|
1001
|
+
except OSError:
|
|
1002
|
+
pass
|
|
1003
|
+
|
|
1004
|
+
return actions
|
|
1005
|
+
|
|
1006
|
+
|
|
878
1007
|
def _backup_dbs() -> str | None:
|
|
879
1008
|
"""Snapshot all .db files before migration. Returns backup dir or None."""
|
|
880
1009
|
import sqlite3
|
|
@@ -2558,6 +2687,16 @@ def _run_runtime_post_sync(dest: Path = NEXO_HOME, progress_fn=None) -> tuple[bo
|
|
|
2558
2687
|
except Exception as e:
|
|
2559
2688
|
actions.append(f"client-sync-warning:{e}")
|
|
2560
2689
|
|
|
2690
|
+
# Heal deep-sleep residue from older buggy runtimes. Idempotent + safe:
|
|
2691
|
+
# no-op if the runtime is already clean.
|
|
2692
|
+
try:
|
|
2693
|
+
_emit_progress(progress_fn, "Healing deep-sleep runtime state...")
|
|
2694
|
+
heal_actions = _heal_deep_sleep_runtime(dest)
|
|
2695
|
+
for action in heal_actions:
|
|
2696
|
+
actions.append(f"deep-sleep-heal:{action}")
|
|
2697
|
+
except Exception as exc:
|
|
2698
|
+
actions.append(f"deep-sleep-heal-warning:{exc.__class__.__name__}")
|
|
2699
|
+
|
|
2561
2700
|
_emit_progress(progress_fn, "Verifying runtime imports...")
|
|
2562
2701
|
verify = subprocess.run(
|
|
2563
2702
|
[sys.executable, "-c", "import server"],
|
|
@@ -1,132 +1,54 @@
|
|
|
1
|
-
"""NEXO DB — Task classification
|
|
2
|
-
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
1
|
+
"""NEXO DB — Task classification storage (internal + owner).
|
|
2
|
+
|
|
3
|
+
Migration #40 added ``internal`` and ``owner`` columns to ``followups`` and
|
|
4
|
+
``reminders``. Agents creating or updating tasks pass these two fields
|
|
5
|
+
explicitly via the MCP tools (``nexo_followup_create``, ``nexo_reminder_create``
|
|
6
|
+
and their ``_update`` counterparts).
|
|
7
|
+
|
|
8
|
+
The Brain core does **not** classify tasks on behalf of agents. Up to and
|
|
9
|
+
including v5.8.1 the core shipped a Spanish-first regex heuristic
|
|
10
|
+
(``NF-PROTOCOL-*`` / ``NF-DS-*`` prefixes, user verbs like ``debes``,
|
|
11
|
+
``revisar``, etc.) as a fallback for callers that left the fields blank.
|
|
12
|
+
That fallback bled NEXO-specific naming conventions into every deployment
|
|
13
|
+
of the shared Brain — third-party agents plugged into the same DB would
|
|
14
|
+
inherit classifications they never asked for. v5.8.2 removes it.
|
|
15
|
+
|
|
16
|
+
The module now exposes only:
|
|
17
|
+
|
|
18
|
+
VALID_OWNERS — the canonical set {user, waiting, agent, shared}.
|
|
19
|
+
normalise_owner — clamps an agent-supplied string to VALID_OWNERS
|
|
20
|
+
(or ``None`` for empty / invalid input so the
|
|
21
|
+
caller can decide whether to persist ``NULL``).
|
|
22
|
+
normalise_internal — coerces truthy / boolean / numeric agent input
|
|
23
|
+
into ``0`` / ``1`` (or ``None`` for empty input).
|
|
24
|
+
|
|
25
|
+
owner values:
|
|
26
|
+
'user' — the user has to act.
|
|
27
|
+
'waiting' — blocked on an external response.
|
|
16
28
|
'agent' — the AI agent handles it autonomously. Intentionally
|
|
17
|
-
named
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
can override both fields explicitly. If they leave them blank, the
|
|
27
|
-
Brain applies the heuristic below so a vanilla agent keeps sensible
|
|
28
|
-
behaviour out of the box.
|
|
29
|
+
named ``agent`` (not ``nexo``) so deployments render
|
|
30
|
+
whatever assistant label fits client-side.
|
|
31
|
+
'shared' — collaborative follow-up.
|
|
32
|
+
NULL — unclassified; clients are free to apply whatever
|
|
33
|
+
fallback they want at render time.
|
|
34
|
+
|
|
35
|
+
Clients that want automatic classification (NEXO Desktop does, via its
|
|
36
|
+
``_legacyClassifyOwner`` / ``_legacyIsInternalTaskId`` helpers) compute
|
|
37
|
+
``owner``/``internal`` themselves and pass them to the create/update call.
|
|
29
38
|
"""
|
|
30
39
|
|
|
31
40
|
from __future__ import annotations
|
|
32
41
|
|
|
33
|
-
import re
|
|
34
|
-
|
|
35
|
-
# Task-ID prefixes historically owned by NEXO's own automation. They are
|
|
36
|
-
# kept as a default heuristic because they match the existing corpus of
|
|
37
|
-
# 468+ followups and 40+ reminders. Any agent not following this naming
|
|
38
|
-
# convention will simply not match these patterns and its tasks will
|
|
39
|
-
# stay visible (internal=0) unless the agent sets internal=1 explicitly
|
|
40
|
-
# on create — which is exactly what we want for a pluralistic ecosystem.
|
|
41
|
-
_INTERNAL_ID_PATTERNS = [
|
|
42
|
-
re.compile(r"^NF-PROTOCOL[-_]", re.IGNORECASE),
|
|
43
|
-
re.compile(r"^NF-DS[-_]", re.IGNORECASE),
|
|
44
|
-
re.compile(r"^NF-AUDIT[-_]", re.IGNORECASE),
|
|
45
|
-
re.compile(r"^NF-OPPORTUNITY[-_]", re.IGNORECASE),
|
|
46
|
-
re.compile(r"^NF-RETRO[-_]", re.IGNORECASE),
|
|
47
|
-
re.compile(r"^R-RELEASE[-_]", re.IGNORECASE),
|
|
48
|
-
re.compile(r"^R-FU-NF-PROTOCOL[-_]", re.IGNORECASE),
|
|
49
|
-
re.compile(r"^R-FU-NF-DS[-_]", re.IGNORECASE),
|
|
50
|
-
re.compile(r"^R-FU-NF-AUDIT[-_]", re.IGNORECASE),
|
|
51
|
-
]
|
|
52
|
-
|
|
53
|
-
# Spanish user-action verbs. The heuristic is Spanish-first because the
|
|
54
|
-
# existing corpus is Spanish, but since every agent can override `owner`
|
|
55
|
-
# explicitly on create, deployments in other languages are not blocked.
|
|
56
|
-
_USER_VERB_RX = re.compile(
|
|
57
|
-
r"\b(francisco debe|debes|llamar|responder|revisar|validar|confirmar|"
|
|
58
|
-
r"decidir|aprobar|firmar|enviar email|mandar email|contestar|"
|
|
59
|
-
r"reuni[óo]n|reservar|comprar)\b",
|
|
60
|
-
re.IGNORECASE,
|
|
61
|
-
)
|
|
62
|
-
|
|
63
|
-
_WAITING_RX = re.compile(
|
|
64
|
-
r"\b(esperando|esperar|bloqueo|bloqueado|pendiente respuesta|"
|
|
65
|
-
r"pendiente de|en espera)\b",
|
|
66
|
-
re.IGNORECASE,
|
|
67
|
-
)
|
|
68
|
-
|
|
69
|
-
_AGENT_RX = re.compile(
|
|
70
|
-
r"\b(monitoreo|monitorizar|monitor|auditor[íi]a diaria|"
|
|
71
|
-
r"promoci[óo]n diaria|seguir|seguimiento 24|72h|checkpoint|runner|cron)\b",
|
|
72
|
-
re.IGNORECASE,
|
|
73
|
-
)
|
|
74
42
|
|
|
75
43
|
VALID_OWNERS = {"user", "waiting", "agent", "shared"}
|
|
76
44
|
|
|
77
45
|
|
|
78
|
-
def is_internal_id(task_id: str | None) -> bool:
|
|
79
|
-
"""Return True when the ID matches a known agent-internal prefix."""
|
|
80
|
-
tid = (task_id or "").strip()
|
|
81
|
-
if not tid:
|
|
82
|
-
return False
|
|
83
|
-
return any(pat.search(tid) for pat in _INTERNAL_ID_PATTERNS)
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
def classify_owner(
|
|
87
|
-
task_id: str | None,
|
|
88
|
-
description: str | None,
|
|
89
|
-
category: str | None = None,
|
|
90
|
-
recurrence: str | None = None,
|
|
91
|
-
) -> str:
|
|
92
|
-
"""Classify ownership into one of VALID_OWNERS using the legacy rules."""
|
|
93
|
-
tid = (task_id or "").strip()
|
|
94
|
-
desc = (description or "").strip()
|
|
95
|
-
cat = (category or "").strip().lower()
|
|
96
|
-
rec = (recurrence or "").strip()
|
|
97
|
-
|
|
98
|
-
if cat == "waiting" or _WAITING_RX.search(desc):
|
|
99
|
-
return "waiting"
|
|
100
|
-
if _USER_VERB_RX.search(desc) or tid.lower().startswith("nf-protocol-"):
|
|
101
|
-
return "user"
|
|
102
|
-
if rec or _AGENT_RX.search(desc):
|
|
103
|
-
return "agent"
|
|
104
|
-
return "shared"
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
def classify_task(
|
|
108
|
-
task_id: str | None,
|
|
109
|
-
description: str | None,
|
|
110
|
-
category: str | None = None,
|
|
111
|
-
recurrence: str | None = None,
|
|
112
|
-
) -> tuple[int, str]:
|
|
113
|
-
"""Compute (internal, owner) pair for a task.
|
|
114
|
-
|
|
115
|
-
Returns integers for internal so the SQLite column (INTEGER DEFAULT 0)
|
|
116
|
-
and the JSON round-trip stay consistent. Clients can truthy-check either
|
|
117
|
-
int or bool safely.
|
|
118
|
-
"""
|
|
119
|
-
internal = 1 if is_internal_id(task_id) else 0
|
|
120
|
-
owner = classify_owner(task_id, description, category, recurrence)
|
|
121
|
-
return internal, owner
|
|
122
|
-
|
|
123
|
-
|
|
124
46
|
def normalise_owner(value: str | None) -> str | None:
|
|
125
47
|
"""Accept owner overrides from agents and clamp to VALID_OWNERS.
|
|
126
48
|
|
|
127
49
|
Returns None for empty input (so the DB keeps NULL / pre-existing value)
|
|
128
50
|
and coerces invalid strings to None rather than silently persisting
|
|
129
|
-
garbage.
|
|
51
|
+
garbage.
|
|
130
52
|
"""
|
|
131
53
|
if value is None:
|
|
132
54
|
return None
|
package/src/db/_reminders.py
CHANGED
|
@@ -8,7 +8,7 @@ import sqlite3
|
|
|
8
8
|
from typing import Any
|
|
9
9
|
|
|
10
10
|
from db._core import get_db, now_epoch
|
|
11
|
-
from db._classification import
|
|
11
|
+
from db._classification import normalise_internal, normalise_owner
|
|
12
12
|
from db._fts import fts_upsert
|
|
13
13
|
from db._hot_context import capture_context_event
|
|
14
14
|
|
|
@@ -256,18 +256,18 @@ def create_reminder(
|
|
|
256
256
|
"""Create a new reminder.
|
|
257
257
|
|
|
258
258
|
Agents may pass `internal` (0/1, bool, or string) and `owner`
|
|
259
|
-
('user'|'waiting'|'agent'|'shared')
|
|
260
|
-
|
|
261
|
-
|
|
259
|
+
('user'|'waiting'|'agent'|'shared'). When omitted, the Brain persists
|
|
260
|
+
``internal=0`` and ``owner=NULL`` — the Brain core does not classify
|
|
261
|
+
tasks on behalf of agents. Clients that want automatic classification
|
|
262
|
+
compute it themselves and pass the result.
|
|
262
263
|
"""
|
|
263
264
|
conn = get_db()
|
|
264
265
|
now = now_epoch()
|
|
265
266
|
|
|
266
|
-
auto_internal, auto_owner = classify_task(id, description, category, None)
|
|
267
267
|
internal_value = normalise_internal(internal)
|
|
268
268
|
if internal_value is None:
|
|
269
|
-
internal_value =
|
|
270
|
-
owner_value = normalise_owner(owner)
|
|
269
|
+
internal_value = 0
|
|
270
|
+
owner_value = normalise_owner(owner)
|
|
271
271
|
|
|
272
272
|
columns = {str(row["name"]) for row in conn.execute("PRAGMA table_info(reminders)").fetchall()}
|
|
273
273
|
payload: dict[str, object] = {
|
|
@@ -615,9 +615,10 @@ def create_followup(
|
|
|
615
615
|
) -> dict:
|
|
616
616
|
"""Create a new followup with optional reasoning and recurrence.
|
|
617
617
|
|
|
618
|
-
Agents may
|
|
619
|
-
|
|
620
|
-
|
|
618
|
+
Agents may set `internal` and `owner` explicitly. Omitted values
|
|
619
|
+
persist as ``internal=0`` and ``owner=NULL`` — the Brain core does not
|
|
620
|
+
classify tasks on behalf of agents. Clients that want automatic
|
|
621
|
+
classification compute it themselves and pass the result.
|
|
621
622
|
"""
|
|
622
623
|
conn = get_db()
|
|
623
624
|
now = now_epoch()
|
|
@@ -630,11 +631,10 @@ def create_followup(
|
|
|
630
631
|
f"(scores: {', '.join(str(s['_similarity']) for s in similar[:3])}). Consider updating instead."
|
|
631
632
|
)
|
|
632
633
|
|
|
633
|
-
auto_internal, auto_owner = classify_task(id, description, None, recurrence)
|
|
634
634
|
internal_value = normalise_internal(internal)
|
|
635
635
|
if internal_value is None:
|
|
636
|
-
internal_value =
|
|
637
|
-
owner_value = normalise_owner(owner)
|
|
636
|
+
internal_value = 0
|
|
637
|
+
owner_value = normalise_owner(owner)
|
|
638
638
|
|
|
639
639
|
columns = {str(row["name"]) for row in conn.execute("PRAGMA table_info(followups)").fetchall()}
|
|
640
640
|
payload: dict[str, object] = {
|
package/src/db/_schema.py
CHANGED
|
@@ -939,18 +939,11 @@ def _m39_hook_runs(conn):
|
|
|
939
939
|
def _m40_classification_columns(conn):
|
|
940
940
|
"""Add internal (INTEGER 0/1) and owner (TEXT) to followups and reminders.
|
|
941
941
|
|
|
942
|
-
|
|
943
|
-
|
|
944
|
-
on
|
|
945
|
-
|
|
946
|
-
|
|
947
|
-
see every task as "Seguimiento" (owner=shared fallback) or, worse, have
|
|
948
|
-
its real user-facing tasks hidden by the Desktop 'internal' filter.
|
|
949
|
-
|
|
950
|
-
Fix: make both attributes first-class columns agents can set on create.
|
|
951
|
-
Vanilla agents that omit them get the legacy heuristic (classify_task)
|
|
952
|
-
applied on insert and during this one-shot backfill, so existing rows
|
|
953
|
-
preserve their current Desktop rendering.
|
|
942
|
+
Agents creating tasks via nexo_followup_create / nexo_reminder_create
|
|
943
|
+
can set both fields explicitly. The Brain core does not classify tasks
|
|
944
|
+
on behalf of agents — clients that want automatic classification
|
|
945
|
+
compute it themselves (NEXO Desktop does, via its legacy client-side
|
|
946
|
+
helpers) and pass the result.
|
|
954
947
|
|
|
955
948
|
Values:
|
|
956
949
|
internal: 0 (external, visible) or 1 (agent bookkeeping, hidden).
|
|
@@ -960,8 +953,10 @@ def _m40_classification_columns(conn):
|
|
|
960
953
|
'NEXO'.
|
|
961
954
|
|
|
962
955
|
Idempotent: _migrate_add_column is a no-op when the column exists,
|
|
963
|
-
_migrate_add_index likewise.
|
|
964
|
-
|
|
956
|
+
_migrate_add_index likewise. Pre-v5.8.2 versions of this migration
|
|
957
|
+
also ran a one-shot backfill using a Spanish-first regex heuristic;
|
|
958
|
+
v5.8.2 removed that heuristic so the core stays neutral across
|
|
959
|
+
deployments. Rows that were already backfilled keep their values.
|
|
965
960
|
"""
|
|
966
961
|
_migrate_add_column(conn, "followups", "internal", "INTEGER DEFAULT 0")
|
|
967
962
|
_migrate_add_column(conn, "followups", "owner", "TEXT DEFAULT NULL")
|
|
@@ -972,32 +967,6 @@ def _m40_classification_columns(conn):
|
|
|
972
967
|
_migrate_add_index(conn, "idx_reminders_internal", "reminders", "internal")
|
|
973
968
|
_migrate_add_index(conn, "idx_reminders_owner", "reminders", "owner")
|
|
974
969
|
|
|
975
|
-
from db._classification import classify_task
|
|
976
|
-
|
|
977
|
-
rows = conn.execute(
|
|
978
|
-
"SELECT id, description, recurrence FROM followups WHERE owner IS NULL"
|
|
979
|
-
).fetchall()
|
|
980
|
-
for row in rows:
|
|
981
|
-
internal, owner = classify_task(
|
|
982
|
-
row["id"], row["description"], None, row["recurrence"]
|
|
983
|
-
)
|
|
984
|
-
conn.execute(
|
|
985
|
-
"UPDATE followups SET internal = ?, owner = ? WHERE id = ?",
|
|
986
|
-
(internal, owner, row["id"]),
|
|
987
|
-
)
|
|
988
|
-
|
|
989
|
-
rows = conn.execute(
|
|
990
|
-
"SELECT id, description, category FROM reminders WHERE owner IS NULL"
|
|
991
|
-
).fetchall()
|
|
992
|
-
for row in rows:
|
|
993
|
-
internal, owner = classify_task(
|
|
994
|
-
row["id"], row["description"], row["category"], None
|
|
995
|
-
)
|
|
996
|
-
conn.execute(
|
|
997
|
-
"UPDATE reminders SET internal = ?, owner = ? WHERE id = ?",
|
|
998
|
-
(internal, owner, row["id"]),
|
|
999
|
-
)
|
|
1000
|
-
|
|
1001
970
|
|
|
1002
971
|
MIGRATIONS = [
|
|
1003
972
|
(1, "learnings_columns", _m1_learnings_columns),
|
|
@@ -38,6 +38,56 @@ except Exception:
|
|
|
38
38
|
# still leaving enough headroom for legitimate long per-session extractions.
|
|
39
39
|
CLAUDE_TIMEOUT = AUTOMATION_SUBPROCESS_TIMEOUT
|
|
40
40
|
|
|
41
|
+
# Poison detection: a session checkpoint records the number of failed attempts
|
|
42
|
+
# across runs. Once it reaches this limit we stop trying to extract findings
|
|
43
|
+
# from that session — repeated failures on the same session (deterministic
|
|
44
|
+
# JSON parse errors, unreadable transcripts) only burn API credits and stall
|
|
45
|
+
# the whole deep-sleep cycle behind the poisoned session. The session is still
|
|
46
|
+
# kept in the output (with the error) so synthesize.py can account for it.
|
|
47
|
+
MAX_POISON_ATTEMPTS = 3
|
|
48
|
+
|
|
49
|
+
# Transient error types worth retrying on the next deep-sleep run instead of
|
|
50
|
+
# being counted as a poisoned attempt. `overloaded_error` comes from the
|
|
51
|
+
# Anthropic API when it is under load and is the cause of the stuck
|
|
52
|
+
# deep-sleep between 2026-04-14 and 2026-04-17 — the first attempt hit it,
|
|
53
|
+
# the checkpoint flagged it as permanent failure, and later runs kept
|
|
54
|
+
# re-processing the same session forever.
|
|
55
|
+
TRANSIENT_ERROR_KINDS = {
|
|
56
|
+
"overloaded_error",
|
|
57
|
+
"rate_limit_error",
|
|
58
|
+
"api_error",
|
|
59
|
+
"timeout",
|
|
60
|
+
"signal",
|
|
61
|
+
}
|
|
62
|
+
|
|
63
|
+
|
|
64
|
+
def _classify_cli_result(result) -> tuple[str, str]:
|
|
65
|
+
"""Return (kind, short_message) describing a failed automation backend call.
|
|
66
|
+
|
|
67
|
+
Kinds:
|
|
68
|
+
- "overloaded_error" / "rate_limit_error" / "api_error"
|
|
69
|
+
Anthropic API transient failure — do not poison the checkpoint.
|
|
70
|
+
- "signal" Claude CLI killed by external signal (SIGTERM / SIGKILL / exit>=128).
|
|
71
|
+
- "timeout" Subprocess hit CLAUDE_TIMEOUT — extremely long session.
|
|
72
|
+
- "json_parse" Claude responded, but output wasn't parseable JSON.
|
|
73
|
+
- "unknown" Fallback.
|
|
74
|
+
"""
|
|
75
|
+
rc = getattr(result, "returncode", -1)
|
|
76
|
+
stderr = (getattr(result, "stderr", "") or "")[:800]
|
|
77
|
+
stdout = (getattr(result, "stdout", "") or "")[:800]
|
|
78
|
+
blob = f"{stderr}\n{stdout}".lower()
|
|
79
|
+
if "overloaded" in blob:
|
|
80
|
+
return "overloaded_error", "Anthropic API overloaded"
|
|
81
|
+
if "rate_limit" in blob or "rate-limit" in blob or "429" in blob:
|
|
82
|
+
return "rate_limit_error", "Anthropic rate-limit hit"
|
|
83
|
+
if '"type":"error"' in blob and '"api_error"' in blob:
|
|
84
|
+
return "api_error", "Anthropic API error"
|
|
85
|
+
if rc >= 128:
|
|
86
|
+
return "signal", f"killed by signal (exit {rc})"
|
|
87
|
+
if rc < 0:
|
|
88
|
+
return "signal", f"subprocess terminated (exit {rc})"
|
|
89
|
+
return "unknown", f"exit {rc}"
|
|
90
|
+
|
|
41
91
|
|
|
42
92
|
def extract_json_from_response(text: str) -> dict | None:
|
|
43
93
|
"""Parse JSON from Claude's response, handling markdown fences."""
|
|
@@ -104,20 +154,23 @@ def analyze_session(
|
|
|
104
154
|
date_dir: Path,
|
|
105
155
|
shared_context_file: Path | None,
|
|
106
156
|
session_txt_map: dict[str, str] | None = None,
|
|
107
|
-
) -> dict | None:
|
|
157
|
+
) -> tuple[dict | None, str | None]:
|
|
108
158
|
"""Send a session to the automation backend for extraction analysis.
|
|
109
159
|
|
|
110
|
-
|
|
111
|
-
|
|
160
|
+
Returns (parsed_result, error_kind). `error_kind` is only set on failure.
|
|
161
|
+
See `_classify_cli_result` for possible values.
|
|
112
162
|
"""
|
|
113
163
|
session_file = find_session_file(session_id, date_dir, session_txt_map=session_txt_map)
|
|
114
164
|
if not session_file:
|
|
115
165
|
print(f" No session file found for {session_id}", file=sys.stderr)
|
|
116
|
-
return None
|
|
166
|
+
return None, "missing_session_file"
|
|
117
167
|
|
|
118
168
|
print(f" File: {session_file.name} ({session_file.stat().st_size / 1024:.0f} KB)")
|
|
119
169
|
|
|
120
|
-
# Build a short prompt — Claude reads the files itself
|
|
170
|
+
# Build a short prompt — Claude reads the files itself. We point at the
|
|
171
|
+
# slim shared context rather than the full 400+KB dump so the Claude CLI
|
|
172
|
+
# process doesn't have to stream hundreds of kilobytes of followups /
|
|
173
|
+
# learnings into its context window on every per-session extraction.
|
|
121
174
|
shared_ctx_instruction = ""
|
|
122
175
|
if shared_context_file and shared_context_file.exists():
|
|
123
176
|
shared_ctx_instruction = f"\n\nAlso read the shared context (followups, learnings, DB state) at: {shared_context_file}"
|
|
@@ -148,8 +201,9 @@ def analyze_session(
|
|
|
148
201
|
)
|
|
149
202
|
|
|
150
203
|
if result.returncode != 0:
|
|
151
|
-
|
|
152
|
-
|
|
204
|
+
kind, message = _classify_cli_result(result)
|
|
205
|
+
print(f" Automation backend {kind} (exit {result.returncode}): {message}", file=sys.stderr)
|
|
206
|
+
return None, kind
|
|
153
207
|
|
|
154
208
|
# Filter out stop hook contamination (e.g. "Post-mortem completo.")
|
|
155
209
|
output = "\n".join(
|
|
@@ -185,18 +239,65 @@ def analyze_session(
|
|
|
185
239
|
debug_file = DEEP_SLEEP_DIR / f"debug-extract-{session_id[:20]}.txt"
|
|
186
240
|
debug_file.write_text(result.stdout[:5000])
|
|
187
241
|
print(f" Failed to parse JSON. Raw output saved to {debug_file}", file=sys.stderr)
|
|
188
|
-
return None
|
|
242
|
+
return None, "json_parse"
|
|
189
243
|
|
|
190
|
-
return parsed
|
|
244
|
+
return parsed, None
|
|
191
245
|
|
|
192
246
|
except AutomationBackendUnavailableError as exc:
|
|
193
247
|
print(f" Automation backend unavailable: {exc}", file=sys.stderr)
|
|
194
|
-
return None
|
|
248
|
+
return None, "backend_unavailable"
|
|
195
249
|
except subprocess.TimeoutExpired:
|
|
196
250
|
print(f" Automation backend timeout ({CLAUDE_TIMEOUT}s)", file=sys.stderr)
|
|
251
|
+
return None, "timeout"
|
|
252
|
+
|
|
253
|
+
|
|
254
|
+
def _write_slim_shared_context(full_path: Path) -> Path:
|
|
255
|
+
"""Generate (once per run) a slim version of shared-context.txt.
|
|
256
|
+
|
|
257
|
+
The full shared context can exceed 400KB — feeding that to every
|
|
258
|
+
per-session extraction means the Claude CLI subprocess spends most of its
|
|
259
|
+
context window on repeated DB metadata instead of the session transcript.
|
|
260
|
+
The slim version keeps the top-level structure + the first ~200 lines so
|
|
261
|
+
the model still has a summary of followups/learnings/diary samples.
|
|
262
|
+
"""
|
|
263
|
+
slim_path = full_path.with_suffix(".slim.txt")
|
|
264
|
+
try:
|
|
265
|
+
raw = full_path.read_text(errors="replace")
|
|
266
|
+
except OSError:
|
|
267
|
+
return full_path
|
|
268
|
+
lines = raw.splitlines()
|
|
269
|
+
head = lines[:200]
|
|
270
|
+
header = [
|
|
271
|
+
"# Shared context (slim) — " + full_path.name,
|
|
272
|
+
f"# original_bytes={full_path.stat().st_size} original_lines={len(lines)}",
|
|
273
|
+
f"# trimmed_to=first_{len(head)}_lines",
|
|
274
|
+
"",
|
|
275
|
+
]
|
|
276
|
+
try:
|
|
277
|
+
slim_path.write_text("\n".join(header + head), encoding="utf-8")
|
|
278
|
+
except OSError:
|
|
279
|
+
return full_path
|
|
280
|
+
return slim_path
|
|
281
|
+
|
|
282
|
+
|
|
283
|
+
def _load_checkpoint(path: Path) -> dict | None:
|
|
284
|
+
if not path.exists():
|
|
285
|
+
return None
|
|
286
|
+
try:
|
|
287
|
+
with path.open() as fh:
|
|
288
|
+
return json.load(fh)
|
|
289
|
+
except (json.JSONDecodeError, OSError):
|
|
197
290
|
return None
|
|
198
291
|
|
|
199
292
|
|
|
293
|
+
def _save_checkpoint(path: Path, payload: dict) -> None:
|
|
294
|
+
try:
|
|
295
|
+
with path.open("w") as fh:
|
|
296
|
+
json.dump(payload, fh, indent=2, ensure_ascii=False)
|
|
297
|
+
except OSError as exc:
|
|
298
|
+
print(f" Warning: could not persist checkpoint {path}: {exc}", file=sys.stderr)
|
|
299
|
+
|
|
300
|
+
|
|
200
301
|
def main():
|
|
201
302
|
target_date = sys.argv[1] if len(sys.argv) > 1 else datetime.now().strftime("%Y-%m-%d")
|
|
202
303
|
|
|
@@ -237,12 +338,17 @@ def main():
|
|
|
237
338
|
print(f"[extract] Output: {output_file}")
|
|
238
339
|
return
|
|
239
340
|
|
|
240
|
-
# Shared context file (followups, learnings, DB state)
|
|
241
|
-
|
|
242
|
-
|
|
243
|
-
|
|
341
|
+
# Shared context file (followups, learnings, DB state).
|
|
342
|
+
# Use a slim copy for the per-session prompts so the Claude CLI doesn't
|
|
343
|
+
# re-read the full 400+KB dump for every single session.
|
|
344
|
+
full_shared_context = date_dir / "shared-context.txt" if date_dir.exists() else None
|
|
345
|
+
shared_context_file: Path | None = None
|
|
346
|
+
if full_shared_context and full_shared_context.exists():
|
|
347
|
+
shared_context_file = _write_slim_shared_context(full_shared_context)
|
|
348
|
+
full_kb = full_shared_context.stat().st_size / 1024
|
|
349
|
+
slim_kb = shared_context_file.stat().st_size / 1024
|
|
350
|
+
print(f"[extract] Shared context: {shared_context_file} ({slim_kb:.0f} KB slim, {full_kb:.0f} KB full)")
|
|
244
351
|
else:
|
|
245
|
-
shared_context_file = None
|
|
246
352
|
print("[extract] No shared context file")
|
|
247
353
|
|
|
248
354
|
print(f"[extract] Phase 2: Analyzing {len(session_files)} sessions for {target_date}")
|
|
@@ -255,6 +361,7 @@ def main():
|
|
|
255
361
|
all_extractions = []
|
|
256
362
|
total_findings = 0
|
|
257
363
|
skipped = 0
|
|
364
|
+
poisoned = 0
|
|
258
365
|
# Two attempts is enough: if a session's extraction fails twice, the cause is
|
|
259
366
|
# almost always deterministic (JSON parse, schema violation) rather than transient,
|
|
260
367
|
# so further retries just burn time. Skip and continue instead.
|
|
@@ -264,26 +371,42 @@ def main():
|
|
|
264
371
|
sid_safe = _safe_session_slug(session_id)[:40]
|
|
265
372
|
checkpoint_file = checkpoint_dir / f"{sid_safe}.json"
|
|
266
373
|
|
|
267
|
-
|
|
268
|
-
|
|
269
|
-
|
|
270
|
-
|
|
271
|
-
|
|
272
|
-
|
|
273
|
-
|
|
274
|
-
|
|
275
|
-
|
|
276
|
-
|
|
277
|
-
|
|
278
|
-
|
|
279
|
-
|
|
374
|
+
cached = _load_checkpoint(checkpoint_file)
|
|
375
|
+
cached_error_count = int((cached or {}).get("error_count", 0))
|
|
376
|
+
cached_last_error_kind = (cached or {}).get("last_error_kind", "")
|
|
377
|
+
|
|
378
|
+
# Successful prior checkpoint → reuse as-is
|
|
379
|
+
if cached and not cached.get("error") and cached.get("findings") is not None:
|
|
380
|
+
findings_count = len(cached.get("findings", []))
|
|
381
|
+
total_findings += findings_count
|
|
382
|
+
all_extractions.append(cached)
|
|
383
|
+
skipped += 1
|
|
384
|
+
print(f"[extract] Session {i + 1}/{len(session_files)}: {session_id} (cached, {findings_count} findings)")
|
|
385
|
+
continue
|
|
386
|
+
|
|
387
|
+
# Poisoned checkpoint → skip without burning API calls
|
|
388
|
+
if cached_error_count >= MAX_POISON_ATTEMPTS:
|
|
389
|
+
poisoned += 1
|
|
390
|
+
all_extractions.append(cached or {
|
|
391
|
+
"session_id": session_id,
|
|
392
|
+
"findings": [],
|
|
393
|
+
"error": "poisoned",
|
|
394
|
+
"error_count": cached_error_count,
|
|
395
|
+
"last_error_kind": cached_last_error_kind,
|
|
396
|
+
})
|
|
397
|
+
print(
|
|
398
|
+
f"[extract] Session {i + 1}/{len(session_files)}: {session_id} "
|
|
399
|
+
f"(poisoned, {cached_error_count} prior failures — skip)"
|
|
400
|
+
)
|
|
401
|
+
continue
|
|
280
402
|
|
|
281
403
|
print(f"[extract] Session {i + 1}/{len(session_files)}: {session_id}")
|
|
282
404
|
|
|
283
|
-
# Retry loop
|
|
405
|
+
# Retry loop within this run
|
|
284
406
|
result = None
|
|
407
|
+
last_error_kind = ""
|
|
285
408
|
for attempt in range(1, MAX_RETRIES + 1):
|
|
286
|
-
result = analyze_session(
|
|
409
|
+
result, error_kind = analyze_session(
|
|
287
410
|
session_id,
|
|
288
411
|
date_dir,
|
|
289
412
|
shared_context_file,
|
|
@@ -291,47 +414,77 @@ def main():
|
|
|
291
414
|
)
|
|
292
415
|
if result:
|
|
293
416
|
break
|
|
417
|
+
last_error_kind = error_kind or "unknown"
|
|
294
418
|
if attempt < MAX_RETRIES:
|
|
295
|
-
print(f" -> Attempt {attempt}/{MAX_RETRIES} failed, retrying...")
|
|
419
|
+
print(f" -> Attempt {attempt}/{MAX_RETRIES} failed ({last_error_kind}), retrying...")
|
|
296
420
|
|
|
297
421
|
if result:
|
|
298
422
|
findings_count = len(result.get("findings", []))
|
|
299
423
|
total_findings += findings_count
|
|
424
|
+
# Persist success and reset error_count so transient past failures
|
|
425
|
+
# don't keep counting against the session.
|
|
426
|
+
result.setdefault("session_id", session_id)
|
|
427
|
+
result["error_count"] = 0
|
|
428
|
+
result["last_error_kind"] = ""
|
|
300
429
|
all_extractions.append(result)
|
|
301
|
-
|
|
302
|
-
with open(checkpoint_file, "w") as f:
|
|
303
|
-
json.dump(result, f, indent=2, ensure_ascii=False)
|
|
430
|
+
_save_checkpoint(checkpoint_file, result)
|
|
304
431
|
print(f" -> {findings_count} findings extracted (checkpointed)")
|
|
305
432
|
else:
|
|
306
|
-
|
|
433
|
+
# Transient errors (API overloaded, rate-limit, timeout, killed
|
|
434
|
+
# by signal) should NOT increment the poison counter — they're
|
|
435
|
+
# not the session's fault. They also don't persist a fresh
|
|
436
|
+
# checkpoint, so the next deep-sleep run will retry cleanly.
|
|
437
|
+
transient = last_error_kind in TRANSIENT_ERROR_KINDS
|
|
438
|
+
if transient:
|
|
439
|
+
print(f" -> Transient failure ({last_error_kind}), will retry on next run.")
|
|
440
|
+
all_extractions.append({
|
|
441
|
+
"session_id": session_id,
|
|
442
|
+
"findings": [],
|
|
443
|
+
"error": "transient",
|
|
444
|
+
"error_count": cached_error_count,
|
|
445
|
+
"last_error_kind": last_error_kind,
|
|
446
|
+
})
|
|
447
|
+
# Do not touch the checkpoint — the next run gets a clean retry.
|
|
448
|
+
continue
|
|
449
|
+
|
|
450
|
+
new_count = cached_error_count + 1
|
|
451
|
+
state = "poisoned" if new_count >= MAX_POISON_ATTEMPTS else "failed"
|
|
452
|
+
print(
|
|
453
|
+
f" -> Deterministic failure #{new_count}/{MAX_POISON_ATTEMPTS} "
|
|
454
|
+
f"({last_error_kind}); marked as {state}."
|
|
455
|
+
)
|
|
307
456
|
failed_entry = {
|
|
308
457
|
"session_id": session_id,
|
|
309
458
|
"findings": [],
|
|
310
|
-
"error":
|
|
459
|
+
"error": state,
|
|
460
|
+
"error_count": new_count,
|
|
461
|
+
"last_error_kind": last_error_kind,
|
|
311
462
|
}
|
|
312
463
|
all_extractions.append(failed_entry)
|
|
313
|
-
|
|
314
|
-
|
|
315
|
-
|
|
464
|
+
_save_checkpoint(checkpoint_file, failed_entry)
|
|
465
|
+
if state == "poisoned":
|
|
466
|
+
poisoned += 1
|
|
316
467
|
|
|
317
468
|
# Merge into output
|
|
318
469
|
output = {
|
|
319
470
|
"date": target_date,
|
|
320
471
|
"sessions_analyzed": len(session_files),
|
|
321
|
-
"sessions_succeeded": len([e for e in all_extractions if "error"
|
|
472
|
+
"sessions_succeeded": len([e for e in all_extractions if not e.get("error")]),
|
|
322
473
|
"sessions_cached": skipped,
|
|
474
|
+
"sessions_poisoned": poisoned,
|
|
323
475
|
"total_findings": total_findings,
|
|
324
|
-
"extractions": all_extractions
|
|
476
|
+
"extractions": all_extractions,
|
|
325
477
|
}
|
|
326
478
|
|
|
327
479
|
output_file = DEEP_SLEEP_DIR / f"{target_date}-extractions.json"
|
|
328
480
|
with open(output_file, "w") as f:
|
|
329
481
|
json.dump(output, f, indent=2, ensure_ascii=False)
|
|
330
482
|
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
|
|
334
|
-
|
|
483
|
+
fresh_runs = len(session_files) - skipped - poisoned
|
|
484
|
+
print(
|
|
485
|
+
f"\n[extract] Done. {total_findings} findings from {len(session_files)} sessions "
|
|
486
|
+
f"({skipped} cached, {fresh_runs} fresh, {poisoned} poisoned)."
|
|
487
|
+
)
|
|
335
488
|
print(f"[extract] Output: {output_file}")
|
|
336
489
|
|
|
337
490
|
|
|
@@ -5,6 +5,16 @@
|
|
|
5
5
|
#
|
|
6
6
|
# Wraps any cron command to automatically record start/end/exit_code/summary.
|
|
7
7
|
# Used by sync.py when generating LaunchAgents from manifest.json.
|
|
8
|
+
#
|
|
9
|
+
# Two-phase recording (start → end):
|
|
10
|
+
# 1. INSERT cron_runs row at start with ended_at=NULL so the watchdog can
|
|
11
|
+
# distinguish "currently running" from "missed / stuck". Without this,
|
|
12
|
+
# any job that exceeds the next watchdog tick (interval_seconds=1800 by
|
|
13
|
+
# default) looks stale and the watchdog may kickstart -k over it — which
|
|
14
|
+
# is exactly the loop that broke deep-sleep between 2026-04-14 and 2026-04-17.
|
|
15
|
+
# 2. UPDATE the row at end with ended_at + exit_code + summary.
|
|
16
|
+
# 3. Trap SIGTERM / SIGINT so wrappers killed mid-flight still close their
|
|
17
|
+
# row (exit_code=143 or 130) instead of leaving it NULL forever.
|
|
8
18
|
|
|
9
19
|
set -uo pipefail
|
|
10
20
|
|
|
@@ -33,84 +43,159 @@ print(datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S"))
|
|
|
33
43
|
PY
|
|
34
44
|
)
|
|
35
45
|
|
|
36
|
-
#
|
|
46
|
+
# Phase 1: INSERT row at start (ended_at NULL = "running").
|
|
47
|
+
# ROW_ID empty on DB failure; spool-fallback at the end handles that.
|
|
48
|
+
ROW_ID=""
|
|
49
|
+
ROW_ID=$(python3 - "$DB" "$CRON_ID" "$STARTED_AT" <<'PY' 2>/dev/null
|
|
50
|
+
from __future__ import annotations
|
|
51
|
+
import sqlite3
|
|
52
|
+
import sys
|
|
53
|
+
db_path, cron_id, started_at = sys.argv[1:]
|
|
54
|
+
conn = sqlite3.connect(db_path)
|
|
55
|
+
try:
|
|
56
|
+
cur = conn.execute(
|
|
57
|
+
"INSERT INTO cron_runs (cron_id, started_at, ended_at) VALUES (?, ?, NULL)",
|
|
58
|
+
(cron_id, started_at),
|
|
59
|
+
)
|
|
60
|
+
conn.commit()
|
|
61
|
+
print(cur.lastrowid)
|
|
62
|
+
finally:
|
|
63
|
+
conn.close()
|
|
64
|
+
PY
|
|
65
|
+
)
|
|
66
|
+
|
|
37
67
|
OUTPUT_FILE=$(mktemp)
|
|
38
|
-
|
|
39
|
-
"
|
|
40
|
-
|
|
41
|
-
|
|
68
|
+
EXIT_CODE=0
|
|
69
|
+
SIGNAL_NAME=""
|
|
70
|
+
|
|
71
|
+
# finalize_row DB writer — also used by signal traps.
|
|
72
|
+
# Reads $EXIT_CODE / $SIGNAL_NAME / $OUTPUT_FILE from the outer scope.
|
|
73
|
+
finalize_row() {
|
|
74
|
+
local ended_at duration summary error
|
|
75
|
+
ended_at=$(python3 - <<'PY'
|
|
42
76
|
from datetime import datetime, timezone
|
|
43
77
|
print(datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S"))
|
|
44
78
|
PY
|
|
45
79
|
)
|
|
46
|
-
|
|
80
|
+
duration=$(python3 - <<PY
|
|
47
81
|
start = float("$START_EPOCH")
|
|
48
82
|
import time
|
|
49
83
|
print(round(time.time() - start, 1))
|
|
50
84
|
PY
|
|
51
85
|
)
|
|
86
|
+
summary=$(tail -5 "$OUTPUT_FILE" 2>/dev/null | grep -v "^$" | tail -1 | head -c 500)
|
|
87
|
+
error=""
|
|
88
|
+
if [ "$EXIT_CODE" -ne 0 ]; then
|
|
89
|
+
if [ -n "$SIGNAL_NAME" ]; then
|
|
90
|
+
error="Killed by $SIGNAL_NAME (exit $EXIT_CODE)"
|
|
91
|
+
else
|
|
92
|
+
error=$(grep -i "error\|exception\|fail\|traceback" "$OUTPUT_FILE" 2>/dev/null | tail -1 | head -c 500)
|
|
93
|
+
fi
|
|
94
|
+
fi
|
|
52
95
|
|
|
53
|
-
#
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
# Extract error if failed
|
|
57
|
-
ERROR=""
|
|
58
|
-
if [ $EXIT_CODE -ne 0 ]; then
|
|
59
|
-
ERROR=$(grep -i "error\|exception\|fail\|traceback" "$OUTPUT_FILE" | tail -1 | head -c 500)
|
|
60
|
-
fi
|
|
61
|
-
|
|
62
|
-
if ! python3 - "$DB" "$CRON_ID" "$STARTED_AT" "$ENDED_AT" "$EXIT_CODE" "$SUMMARY" "$ERROR" "$DURATION_SECS" <<'PY'
|
|
96
|
+
# Update the row we inserted at start — or INSERT fresh if the start write failed.
|
|
97
|
+
if ! python3 - "$DB" "$ROW_ID" "$CRON_ID" "$STARTED_AT" "$ended_at" "$EXIT_CODE" "$summary" "$error" "$duration" <<'PY' 2>/dev/null
|
|
63
98
|
from __future__ import annotations
|
|
64
|
-
|
|
65
99
|
import sqlite3
|
|
66
100
|
import sys
|
|
67
|
-
|
|
68
|
-
db_path, cron_id, started_at, ended_at, exit_code, summary, error, duration_secs = sys.argv[1:]
|
|
101
|
+
db_path, row_id, cron_id, started_at, ended_at, exit_code, summary, error, duration_secs = sys.argv[1:]
|
|
69
102
|
conn = sqlite3.connect(db_path)
|
|
70
103
|
try:
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
error,
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
104
|
+
if row_id:
|
|
105
|
+
conn.execute(
|
|
106
|
+
"""
|
|
107
|
+
UPDATE cron_runs
|
|
108
|
+
SET ended_at=?, exit_code=?, summary=?, error=?, duration_secs=?
|
|
109
|
+
WHERE id=?
|
|
110
|
+
""",
|
|
111
|
+
(ended_at, int(exit_code), summary, error, float(duration_secs), int(row_id)),
|
|
112
|
+
)
|
|
113
|
+
else:
|
|
114
|
+
conn.execute(
|
|
115
|
+
"""
|
|
116
|
+
INSERT INTO cron_runs (cron_id, started_at, ended_at, exit_code, summary, error, duration_secs)
|
|
117
|
+
VALUES (?, ?, ?, ?, ?, ?, ?)
|
|
118
|
+
""",
|
|
119
|
+
(cron_id, started_at, ended_at, int(exit_code), summary, error, float(duration_secs)),
|
|
120
|
+
)
|
|
87
121
|
conn.commit()
|
|
88
122
|
finally:
|
|
89
123
|
conn.close()
|
|
90
124
|
PY
|
|
91
|
-
then
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
125
|
+
then
|
|
126
|
+
mkdir -p "$SPOOL_DIR"
|
|
127
|
+
local spool_file="$SPOOL_DIR/${CRON_ID}-$(date +%Y%m%d-%H%M%S)-$$.json"
|
|
128
|
+
python3 - "$spool_file" "$CRON_ID" "$STARTED_AT" "$ended_at" "$EXIT_CODE" "$summary" "$error" "$duration" <<'PY'
|
|
95
129
|
from __future__ import annotations
|
|
96
|
-
|
|
97
130
|
import json
|
|
98
131
|
import sys
|
|
99
132
|
from pathlib import Path
|
|
100
|
-
|
|
101
133
|
spool_file, cron_id, started_at, ended_at, exit_code, summary, error, duration_secs = sys.argv[1:]
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
134
|
+
Path(spool_file).write_text(
|
|
135
|
+
json.dumps({
|
|
136
|
+
"cron_id": cron_id,
|
|
137
|
+
"started_at": started_at,
|
|
138
|
+
"ended_at": ended_at,
|
|
139
|
+
"exit_code": int(exit_code),
|
|
140
|
+
"summary": summary,
|
|
141
|
+
"error": error,
|
|
142
|
+
"duration_secs": float(duration_secs),
|
|
143
|
+
}, indent=2, ensure_ascii=False) + "\n",
|
|
144
|
+
encoding="utf-8",
|
|
145
|
+
)
|
|
112
146
|
PY
|
|
113
|
-
|
|
114
|
-
fi
|
|
147
|
+
echo "[nexo-cron-wrapper] DB write failed; spooled run to $spool_file" >&2
|
|
148
|
+
fi
|
|
149
|
+
}
|
|
150
|
+
|
|
151
|
+
cleanup() {
|
|
152
|
+
rm -f "$OUTPUT_FILE"
|
|
153
|
+
}
|
|
154
|
+
|
|
155
|
+
CHILD_PID=""
|
|
156
|
+
|
|
157
|
+
on_signal() {
|
|
158
|
+
local sig="$1"
|
|
159
|
+
local code="$2"
|
|
160
|
+
SIGNAL_NAME="$sig"
|
|
161
|
+
EXIT_CODE="$code"
|
|
162
|
+
# Forward the signal to the child. Bash traps run AFTER the foreground
|
|
163
|
+
# command completes, which is why we launch the command in background
|
|
164
|
+
# and wait on its PID — otherwise a SIGTERM to the wrapper would be
|
|
165
|
+
# delivered only when the child finishes naturally, defeating the
|
|
166
|
+
# purpose of closing the cron_runs row on kill.
|
|
167
|
+
if [ -n "$CHILD_PID" ] && kill -0 "$CHILD_PID" 2>/dev/null; then
|
|
168
|
+
kill -TERM "$CHILD_PID" 2>/dev/null
|
|
169
|
+
# Brief grace period before escalating to SIGKILL so the child gets
|
|
170
|
+
# a chance to clean up on its own.
|
|
171
|
+
local waited=0
|
|
172
|
+
while [ $waited -lt 5 ] && kill -0 "$CHILD_PID" 2>/dev/null; do
|
|
173
|
+
sleep 1
|
|
174
|
+
waited=$((waited + 1))
|
|
175
|
+
done
|
|
176
|
+
kill -KILL "$CHILD_PID" 2>/dev/null
|
|
177
|
+
fi
|
|
178
|
+
finalize_row
|
|
179
|
+
cleanup
|
|
180
|
+
exit "$code"
|
|
181
|
+
}
|
|
182
|
+
|
|
183
|
+
trap cleanup EXIT
|
|
184
|
+
trap 'on_signal SIGTERM 143' TERM
|
|
185
|
+
trap 'on_signal SIGINT 130' INT
|
|
186
|
+
trap 'on_signal SIGHUP 129' HUP
|
|
187
|
+
|
|
188
|
+
"$@" > "$OUTPUT_FILE" 2>&1 &
|
|
189
|
+
CHILD_PID=$!
|
|
190
|
+
|
|
191
|
+
# `wait` is interruptible by signals — when the trap fires, wait returns
|
|
192
|
+
# immediately and on_signal() takes over. When the child finishes
|
|
193
|
+
# normally, wait yields its exit code and we fall through to finalize_row
|
|
194
|
+
# for the happy path.
|
|
195
|
+
wait "$CHILD_PID"
|
|
196
|
+
EXIT_CODE=$?
|
|
197
|
+
CHILD_PID=""
|
|
198
|
+
|
|
199
|
+
finalize_row
|
|
115
200
|
|
|
116
|
-
exit $EXIT_CODE
|
|
201
|
+
exit "$EXIT_CODE"
|
|
@@ -594,40 +594,72 @@ for monitor in "${MONITORS[@]}"; do
|
|
|
594
594
|
run_info=$(cron_last_run_info "$cron_id" || true)
|
|
595
595
|
if [ -n "$run_info" ]; then
|
|
596
596
|
latest_run_has_record=true
|
|
597
|
-
IFS='|' read -r age _
|
|
597
|
+
IFS='|' read -r age _ last_ended last_exit last_error last_summary <<< "$run_info"
|
|
598
598
|
age="${age:-999999}"
|
|
599
599
|
stale_age=$(format_age "$age")
|
|
600
|
-
|
|
601
|
-
|
|
602
|
-
|
|
603
|
-
|
|
604
|
-
|
|
605
|
-
|
|
606
|
-
|
|
607
|
-
|
|
608
|
-
|
|
609
|
-
|
|
610
|
-
|
|
611
|
-
|
|
600
|
+
|
|
601
|
+
# In-flight detection: started_at present but ended_at empty means the
|
|
602
|
+
# wrapper is still running. Never kickstart -k over an in-flight row —
|
|
603
|
+
# that was the loop that broke deep-sleep between 2026-04-14 and
|
|
604
|
+
# 2026-04-17, when the watchdog kept killing the worker that was
|
|
605
|
+
# actually doing the job. Only intervene if the process is provably
|
|
606
|
+
# dead (zombie row) AND the run has exceeded 3× max_stale.
|
|
607
|
+
if [ -z "$last_ended" ]; then
|
|
608
|
+
if [ "$age" -gt $(( max_stale * 3 )) ] && [ -n "$proc_grep" ] && ! process_running "$proc_grep"; then
|
|
609
|
+
status="FAIL"
|
|
610
|
+
details="${details}In-flight for ${stale_age} but process '$proc_grep' dead — stale row. "
|
|
611
|
+
if [ "$recovery_policy" = "catchup" ]; then
|
|
612
|
+
if try_request_catchup; then
|
|
613
|
+
status="HEALED"
|
|
614
|
+
details="${details}Self-healed: requested catchup for crashed in-flight run. "
|
|
615
|
+
TOTAL_HEALED=$((TOTAL_HEALED + 1))
|
|
616
|
+
fi
|
|
612
617
|
else
|
|
613
|
-
|
|
614
|
-
|
|
618
|
+
if try_reexecute_missed_cron "$plist_id"; then
|
|
619
|
+
status="HEALED"
|
|
620
|
+
details="${details}Self-healed: re-executed crashed in-flight run. "
|
|
621
|
+
TOTAL_HEALED=$((TOTAL_HEALED + 1))
|
|
622
|
+
fi
|
|
615
623
|
fi
|
|
624
|
+
elif [ "$age" -gt $(( max_stale * 3 )) ]; then
|
|
625
|
+
[ "$status" = "PASS" ] && status="WARN"
|
|
626
|
+
details="${details}In-flight for ${stale_age} (long-running, process alive). "
|
|
616
627
|
else
|
|
617
|
-
|
|
618
|
-
|
|
619
|
-
|
|
620
|
-
|
|
628
|
+
details="${details}In-flight (started ${stale_age}). "
|
|
629
|
+
fi
|
|
630
|
+
else
|
|
631
|
+
if [ -n "$last_exit" ] && [ "$last_exit" != "0" ]; then
|
|
632
|
+
latest_run_failed=true
|
|
633
|
+
status="FAIL"
|
|
634
|
+
details="${details}Last run exited ${last_exit}. "
|
|
635
|
+
[ -n "$last_error" ] && details="${details}Error: ${last_error}. "
|
|
636
|
+
fi
|
|
637
|
+
if [ "$age" -gt $(( max_stale * 3 )) ]; then
|
|
638
|
+
if [ "$recovery_policy" = "catchup" ]; then
|
|
639
|
+
if try_request_catchup; then
|
|
640
|
+
status="HEALED"
|
|
641
|
+
details="${details}Self-healed: requested catchup for missed window (last run: $stale_age). "
|
|
642
|
+
TOTAL_HEALED=$((TOTAL_HEALED + 1))
|
|
643
|
+
else
|
|
644
|
+
status="FAIL"
|
|
645
|
+
details="${details}cron_runs stale: $stale_age (limit: $(format_age "$max_stale")). Catchup request failed. "
|
|
646
|
+
fi
|
|
621
647
|
else
|
|
622
|
-
|
|
623
|
-
|
|
648
|
+
if try_reexecute_missed_cron "$plist_id"; then
|
|
649
|
+
status="HEALED"
|
|
650
|
+
details="${details}Self-healed: re-executed missed cron (last run: $stale_age). "
|
|
651
|
+
TOTAL_HEALED=$((TOTAL_HEALED + 1))
|
|
652
|
+
else
|
|
653
|
+
status="FAIL"
|
|
654
|
+
details="${details}cron_runs stale: $stale_age (limit: $(format_age "$max_stale")). Re-execute failed. "
|
|
655
|
+
fi
|
|
624
656
|
fi
|
|
657
|
+
elif [ "$age" -gt "$max_stale" ]; then
|
|
658
|
+
[ "$status" = "PASS" ] && status="WARN"
|
|
659
|
+
details="${details}cron_runs slightly stale: $stale_age. "
|
|
660
|
+
elif [ -z "$details" ] && [ -n "$last_summary" ]; then
|
|
661
|
+
details="${details}Last run summary: ${last_summary}. "
|
|
625
662
|
fi
|
|
626
|
-
elif [ "$age" -gt "$max_stale" ]; then
|
|
627
|
-
[ "$status" = "PASS" ] && status="WARN"
|
|
628
|
-
details="${details}cron_runs slightly stale: $stale_age. "
|
|
629
|
-
elif [ -z "$details" ] && [ -n "$last_summary" ]; then
|
|
630
|
-
details="${details}Last run summary: ${last_summary}. "
|
|
631
663
|
fi
|
|
632
664
|
else
|
|
633
665
|
stale_age="no cron_runs entry"
|