nexo-brain 1.1.1 → 1.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,12 +1,12 @@
1
1
  # NEXO Brain — Your AI Gets a Brain
2
2
 
3
- [![npm v1.1.0](https://img.shields.io/npm/v/nexo-brain?label=npm&color=purple)](https://www.npmjs.com/package/nexo-brain)
3
+ [![npm v1.2.1](https://img.shields.io/npm/v/nexo-brain?label=npm&color=purple)](https://www.npmjs.com/package/nexo-brain)
4
4
  [![F1 0.588 on LoCoMo](https://img.shields.io/badge/LoCoMo_F1-0.588-brightgreen)](https://github.com/wazionapps/nexo/blob/main/benchmarks/locomo/results/)
5
5
  [![+55% vs GPT-4](https://img.shields.io/badge/vs_GPT--4-%2B55%25-blue)](https://github.com/snap-research/locomo/issues/33)
6
6
  [![GitHub stars](https://img.shields.io/github/stars/wazionapps/nexo?style=social)](https://github.com/wazionapps/nexo/stargazers)
7
7
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
8
8
 
9
- > **v1.1.0** — Context Continuity via auto-compaction hooks. PreCompact saves a full session checkpoint; PostCompact re-injects it so long sessions (8+ hours) feel like one continuous conversation. Plus: Cognitive Cortex, 30 Core Rules as DNA, Smart Startup, Context Packets, Auto-Prime. The first AI memory system with architectural inhibitory control — the agent reasons about whether to act before acting. Battle-tested from 6 months of production use, validated via multi-AI debate (Claude Opus + GPT-5.4 + Gemini 3.1 Pro).
9
+ > **v1.1.1** — Context Continuity via auto-compaction hooks. PreCompact saves a full session checkpoint; PostCompact re-injects it so long sessions (8+ hours) feel like one continuous conversation. Plus: Cognitive Cortex, 30 Core Rules as DNA, Smart Startup, Context Packets, Auto-Prime. The first AI memory system with architectural inhibitory control — the agent reasons about whether to act before acting. Battle-tested from 6 months of production use, validated via multi-AI debate (Claude Opus + GPT-5.4 + Gemini 3.1 Pro).
10
10
 
11
11
  **NEXO Brain transforms any MCP-compatible AI agent from a stateless assistant into a cognitive partner that remembers, learns, forgets, adapts, and builds a relationship with you over time.**
12
12
 
@@ -163,7 +163,7 @@ User message → Fast Path check → Simple chat? → Respond directly
163
163
 
164
164
  The Cortex was designed through a 3-way AI debate (Claude Opus 4.6 + GPT-5.4 + Gemini 3.1 Pro) and validated against 6 months of real production failures.
165
165
 
166
- ## Context Continuity (Auto-Compaction) (v1.1.0)
166
+ ## Context Continuity (Auto-Compaction) (v1.1.1)
167
167
 
168
168
  NEXO Brain automatically preserves session context when Claude Code compacts conversations. Using PreCompact and PostCompact hooks:
169
169
 
@@ -604,6 +604,23 @@ If NEXO Brain is useful to you, consider:
604
604
 
605
605
  ## Changelog
606
606
 
607
+ ### v1.2.1 — Stop Hook Hotfix (2026-03-27)
608
+ - **Fix**: v1.2.0 deleted the flag on approve, causing infinite block loops if session didn't close immediately
609
+ - **Fix**: Removed TTL on flag — it persists until SessionStart cleans it up next session
610
+ - **New**: Trivial sessions (<5 meaningful tool calls) skip post-mortem entirely and approve immediately
611
+ - SessionStart hook now cleans up `.postmortem-complete` flag on session start
612
+
613
+ ### v1.2.0 — Blocking Stop Hook (2026-03-27)
614
+ - **Fix**: Stop hook now uses `"decision": "block"` instead of `"approve"` to enforce post-mortem execution
615
+ - Previous behavior: hook injected `systemMessage` but AI had already responded — instructions were never processed
616
+ - New behavior: session close is blocked until AI completes self-critique, session diary, buffer entry, and followups
617
+ - Flag-based mechanism (`.postmortem-complete`) allows second close attempt to succeed
618
+ - Works for all NEXO users, not just specific setups
619
+
620
+ ### v1.1.1 — Multi-terminal fix (2026-03-27)
621
+ - **Fix**: PostCompact now reads the correct session's checkpoint in multi-terminal setups
622
+ - Changelog section added to README
623
+
607
624
  ### v1.1.0 — Context Continuity (2026-03-27)
608
625
  - **Context Continuity**: PreCompact/PostCompact hooks preserve session state across compaction events
609
626
  - New `session_checkpoints` SQLite table + migration #12
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "nexo-brain",
3
- "version": "1.1.1",
3
+ "version": "1.2.1",
4
4
  "mcpName": "io.github.wazionapps/nexo",
5
5
  "description": "NEXO \u2014 Cognitive co-operator for Claude Code. Atkinson-Shiffrin memory, semantic RAG, trust scoring, and metacognitive error prevention.",
6
6
  "bin": {
@@ -8,7 +8,10 @@ NEXO_HOME="${NEXO_HOME:-$HOME/.nexo}"
8
8
  BRIEFING_FILE="$NEXO_HOME/coordination/session-briefing.txt"
9
9
  MAX_AGE_SECONDS=3600 # 1 hour cache
10
10
 
11
- mkdir -p "$NEXO_HOME/coordination"
11
+ mkdir -p "$NEXO_HOME/coordination" "$NEXO_HOME/operations"
12
+
13
+ # Clean up post-mortem flag from previous session
14
+ rm -f "$NEXO_HOME/operations/.postmortem-complete" 2>/dev/null
12
15
 
13
16
  # If briefing exists and is less than 1 hour old, skip regeneration
14
17
  if [ -f "$BRIEFING_FILE" ]; then
@@ -1,12 +1,30 @@
1
1
  #!/bin/bash
2
- # NEXO Stop hook — Full post-mortem and session closure.
3
- # Injects a systemMessage with mandatory self-critique instructions.
4
- # After emitting the hook response, writes a fallback buffer entry
5
- # and triggers intra-day reflection if enough sessions have accumulated.
2
+ # NEXO Stop hook (v7 BLOCKING post-mortem with trivial session detection)
3
+ #
4
+ # v5 bug: used "approve" + systemMessage AI never processed post-mortem.
5
+ # v6 bug: used "block" but deleted flag on approve — caused infinite block loop.
6
+ # Also had TTL on flag that expired between close attempts.
7
+ # v7 fix: trivial sessions (<5 tool calls) approve immediately.
8
+ # Non-trivial sessions block until post-mortem is done.
9
+ # Flag has NO TTL and is NOT deleted on approve.
10
+ # SessionStart hook cleans up the flag for the next session.
11
+ #
12
+ # Flow:
13
+ # Trivial session (quick question, <5 meaningful tool calls):
14
+ # → APPROVE immediately, no post-mortem needed
15
+ #
16
+ # Non-trivial session:
17
+ # 1. User closes → hook checks flag → not found → BLOCK
18
+ # 2. AI executes post-mortem → creates flag
19
+ # 3. User closes again → hook sees flag → APPROVE
20
+ # 4. Next session start → SessionStart hook deletes flag
6
21
  set -euo pipefail
7
22
 
8
23
  NEXO_HOME="${NEXO_HOME:-$HOME/.nexo}"
9
24
  NEXO_NAME="${NEXO_NAME:-NEXO}"
25
+ FLAG_FILE="$NEXO_HOME/operations/.postmortem-complete"
26
+ TODAY=$(date +%Y-%m-%d)
27
+ TOOL_LOG="$NEXO_HOME/operations/tool-logs/${TODAY}.jsonl"
10
28
 
11
29
  # 0. Refresh diary draft with latest changes/decisions (best-effort)
12
30
  python3 -c "
@@ -38,19 +56,70 @@ except Exception:
38
56
  pass
39
57
  " 2>/dev/null || true
40
58
 
41
- # 1. Emit hook response (must be first output Claude Code reads this)
42
- cat << HOOKEOF
59
+ # 1. Detect trivial session count meaningful tool calls from today's log
60
+ # A session with <5 tool calls (excluding Read/Grep/Glob/Bash/ToolSearch) is trivial
61
+ TOOL_COUNT=0
62
+ if [ -f "$TOOL_LOG" ]; then
63
+ TOOL_COUNT=$(python3 -c "
64
+ import json, sys
65
+ count = 0
66
+ for line in open('$TOOL_LOG'):
67
+ try:
68
+ d = json.loads(line)
69
+ t = d.get('tool_name', '')
70
+ if t and t not in ('Read', 'Grep', 'Glob', 'Bash', 'ToolSearch'):
71
+ count += 1
72
+ except:
73
+ pass
74
+ print(count)
75
+ " 2>/dev/null || echo "0")
76
+ fi
77
+
78
+ # Trivial session → approve immediately, write minimal buffer, skip post-mortem
79
+ if [ "$TOOL_COUNT" -lt 5 ]; then
80
+ BUFFER="$NEXO_HOME/brain/session_buffer.jsonl"
81
+ TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%S")
82
+ mkdir -p "$(dirname "$BUFFER")"
83
+ echo "{\"ts\":\"$TIMESTAMP\",\"tasks\":[\"trivial session\"],\"decisions\":[],\"user_patterns\":[],\"files_modified\":[],\"errors_resolved\":[],\"self_critique\":\"trivial session — no post-mortem needed\",\"mood\":\"neutral\",\"source\":\"hook-trivial\"}" >> "$BUFFER" 2>/dev/null
84
+
85
+ cat << 'HOOKEOF'
86
+ {
87
+ "decision": "approve"
88
+ }
89
+ HOOKEOF
90
+ exit 0
91
+ fi
92
+
93
+ # 2. Non-trivial session — check if post-mortem was already completed
94
+ # Flag has NO TTL — it persists until SessionStart cleans it up next session.
95
+ # IMPORTANT: do NOT delete flag here — that causes an infinite block loop
96
+ # if the session doesn't close immediately after approve.
97
+ POSTMORTEM_DONE=false
98
+ if [ -f "$FLAG_FILE" ]; then
99
+ POSTMORTEM_DONE=true
100
+ fi
101
+
102
+ if [ "$POSTMORTEM_DONE" = true ]; then
103
+ # Post-mortem was done — allow session to close
104
+ cat << 'HOOKEOF'
43
105
  {
44
- "decision": "approve",
45
- "systemMessage": "STOP HOOK — MANDATORY POST-MORTEM before ending (do NOT ask permission, do NOT skip):\n\n## 1. SELF-CRITIQUE (MANDATORY — write to session diary)\nAnswer these questions in the self_critique field of nexo_session_diary_write:\n- Did the user have to ask me for something I should have detected or done on my own?\n- Did I wait for the user to tell me something I could have verified proactively?\n- Are there systems/states I can check next session without being asked?\n- Did I repeat an error that already had a registered learning?\n- What would I do differently if I repeated this session?\nIf any answer is YES — write the specific rule that would prevent repetition.\nIf the session was flawless, write 'No self-critique — clean session.'\n\n## 2. SESSION BUFFER\nIf the session was NOT trivial, append ONE JSON line to ${NEXO_HOME}/brain/session_buffer.jsonl:\n{\"ts\":\"YYYY-MM-DDTHH:MM:SS\",\"tasks\":[...],\"decisions\":[...],\"user_patterns\":[...],\"files_modified\":[...],\"errors_resolved\":[...],\"self_critique\":\"short summary of what I should have done better\",\"mood\":\"focused|impatient|exploratory|frustrated|satisfied|neutral\",\"source\":\"claude\"}\n\n## 3. FOLLOWUPS\nIf there were deploys/cron changes/fixes — nexo_followup_create with verification date.\n\n## 4. PROACTIVE SEEDS\nBefore closing, think: what can I leave prepared so the next session starts doing useful work without the user asking? Create followups with date=tomorrow for proactive verifications.\n\nEntities, preferences, learnings — only if they appeared during the session."
106
+ "decision": "approve"
46
107
  }
47
108
  HOOKEOF
109
+ else
110
+ # Post-mortem NOT done — BLOCK session close and inject instructions
111
+ cat << HOOKEOF
112
+ {
113
+ "decision": "block",
114
+ "reason": "STOP HOOK — MANDATORY POST-MORTEM before ending (do NOT ask permission, do NOT skip):\n\n## 1. SELF-CRITIQUE (MANDATORY — write to session diary)\nAnswer these questions in the self_critique field of nexo_session_diary_write:\n- Did the user have to ask me for something I should have detected or done on my own?\n- Did I wait for the user to tell me something I could have verified proactively?\n- Are there systems/states I can check next session without being asked?\n- Did I repeat an error that already had a registered learning?\n- What would I do differently if I repeated this session?\nIf any answer is YES — write the specific rule that would prevent repetition.\nIf the session was flawless, write 'No self-critique — clean session.'\n\n## 2. SESSION BUFFER\nIf the session was NOT trivial, append ONE JSON line to ${NEXO_HOME}/brain/session_buffer.jsonl:\n{\"ts\":\"YYYY-MM-DDTHH:MM:SS\",\"tasks\":[...],\"decisions\":[...],\"user_patterns\":[...],\"files_modified\":[...],\"errors_resolved\":[...],\"self_critique\":\"short summary\",\"mood\":\"focused|impatient|exploratory|frustrated|satisfied|neutral\",\"source\":\"claude\"}\n\n## 3. FOLLOWUPS\nIf there were deploys/cron changes/fixes — nexo_followup_create with verification date.\n\n## 4. PROACTIVE SEEDS\nWhat can I leave prepared so the next session starts doing useful work without the user asking?\n\n## 5. MARK COMPLETE\nWhen ALL of the above is done, run:\nbash -c 'mkdir -p ${NEXO_HOME}/operations && date +%s > ${NEXO_HOME}/operations/.postmortem-complete'\nThen say goodbye. The user will close again and the hook will approve."
115
+ }
116
+ HOOKEOF
117
+ fi
48
118
 
49
- # 2. Direct session buffer fallback (Claude's MCP write is better but not guaranteed)
119
+ # 3. Direct session buffer fallback (runs for non-trivial sessions)
50
120
  BUFFER="$NEXO_HOME/brain/session_buffer.jsonl"
51
121
  TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%S")
52
122
 
53
- # Check if Claude already wrote to the buffer in the last 60 seconds
54
123
  SKIP_FALLBACK=false
55
124
  if [ -f "$BUFFER" ]; then
56
125
  LAST_SOURCE=$(python3 -c "
@@ -79,7 +148,6 @@ fi
79
148
 
80
149
  if [ "$SKIP_FALLBACK" = false ]; then
81
150
  mkdir -p "$(dirname "$BUFFER")"
82
- # Read current adaptive mode for the buffer entry
83
151
  ADAPTIVE_MODE="unknown"
84
152
  ADAPTIVE_FILE="$NEXO_HOME/brain/adaptive_state.json"
85
153
  if [ -f "$ADAPTIVE_FILE" ]; then
@@ -95,8 +163,7 @@ except:
95
163
  echo "{\"ts\":\"$TIMESTAMP\",\"tasks\":[\"session ended\"],\"decisions\":[],\"user_patterns\":[],\"files_modified\":[],\"errors_resolved\":[],\"self_critique\":\"hook-fallback, no self-critique captured\",\"mood\":\"unknown\",\"session_end_mode\":\"$ADAPTIVE_MODE\",\"source\":\"hook-fallback\"}" >> "$BUFFER" 2>/dev/null
96
164
  fi
97
165
 
98
- # 3. Intra-day reflection trigger
99
- # Check if buffer has >=3 sessions AND last reflection was >4h ago
166
+ # 4. Intra-day reflection trigger
100
167
  REFLECTION_SCRIPT="$NEXO_HOME/scripts/nexo-reflection.py"
101
168
  REFLECTION_STATE="$NEXO_HOME/coordination/reflection-log.json"
102
169
  TRIGGER_THRESHOLD=3
@@ -130,7 +197,6 @@ except:
130
197
  fi
131
198
 
132
199
  if [ "$SHOULD_REFLECT" = true ]; then
133
- # Find Python — prefer the one used by NEXO
134
200
  PYTHON=$(which python3 2>/dev/null || echo "/usr/bin/python3")
135
201
  nohup "$PYTHON" "$REFLECTION_SCRIPT" \
136
202
  >> "$NEXO_HOME/logs/reflection-stdout.log" \
@@ -9,12 +9,6 @@ from datetime import datetime, timedelta
9
9
  from db import get_db, find_similar_learnings, extract_keywords
10
10
 
11
11
 
12
- SCHEMA_CACHE_PATH = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
13
- "nexo-mcp", "schema_cache.json")
14
- # Fallback: same dir as db
15
- if not os.path.exists(SCHEMA_CACHE_PATH):
16
- SCHEMA_CACHE_PATH = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "schema_cache.json")
17
-
18
12
 
19
13
  def _load_schema_cache() -> dict:
20
14
  """Load cached DB schemas from schema_cache.json."""
@@ -117,7 +111,8 @@ def handle_guard_check(files: str = "", area: str = "", include_schemas: str = "
117
111
  ).fetchall()
118
112
  for r in rows:
119
113
  if r["id"] not in seen_ids:
120
- result["universal_rules"].append({"id": r["id"], "rule": r["title"]})
114
+ seen_ids.add(r["id"])
115
+ result["universal_rules"].append({"id": r["id"], "rule": r["title"], "category": r["category"]})
121
116
 
122
117
  # 4. DB schemas if files contain SQL keywords
123
118
  if include_schemas_bool and file_list:
@@ -141,16 +136,42 @@ def handle_guard_check(files: str = "", area: str = "", include_schemas: str = "
141
136
  elif "cloud_sql" in cache and table in cache["cloud_sql"]:
142
137
  result["schemas"][table] = cache["cloud_sql"][table]
143
138
 
144
- # 5. Check for blocking rules (5+ repetitions)
145
- for learning in result["learnings"]:
139
+ # 5. Check for blocking rules two paths:
140
+ # (a) 5+ repetitions (existing behavior)
141
+ # (b) Learning contains NUNCA/NEVER/PROHIBIDO and matches semantically (aggressive mode)
142
+ import re
143
+ BLOCKING_KEYWORDS = re.compile(
144
+ r'\bNUNCA\b|\bNEVER\b|\bPROHIBIDO\b|\bNO\s+\w+\b|\bFORBIDDEN\b|\bBLOCKING\b|\bSIEMPRE\b|\bALWAYS\b',
145
+ re.IGNORECASE
146
+ )
147
+ # Check both learnings and universal_rules for blocking
148
+ all_candidates = [(l, "learning") for l in result["learnings"]] + \
149
+ [(u, "universal") for u in result["universal_rules"]]
150
+ blocking_seen = set()
151
+ for learning, source in all_candidates:
146
152
  lid = learning["id"]
153
+ if lid in blocking_seen:
154
+ continue
147
155
  rep_count = conn.execute(
148
156
  "SELECT COUNT(*) as cnt FROM error_repetitions WHERE original_learning_id = ?",
149
157
  (lid,)
150
158
  ).fetchone()["cnt"]
159
+
160
+ # Path (a): 5+ repetitions
151
161
  if rep_count >= 5:
162
+ blocking_seen.add(lid)
152
163
  result["blocking_rules"].append({
153
- "id": lid, "rule": learning["rule"], "repetitions": rep_count
164
+ "id": lid, "rule": learning["rule"], "repetitions": rep_count,
165
+ "reason": "repeated_error"
166
+ })
167
+ continue
168
+
169
+ # Path (b): Aggressive — learning TITLE contains prohibition keywords
170
+ if BLOCKING_KEYWORDS.search(learning["rule"]):
171
+ blocking_seen.add(lid)
172
+ result["blocking_rules"].append({
173
+ "id": lid, "rule": learning["rule"], "repetitions": rep_count,
174
+ "reason": "prohibition_keyword"
154
175
  })
155
176
 
156
177
  # 6. Area repetition rate
@@ -185,15 +206,6 @@ def handle_guard_check(files: str = "", area: str = "", include_schemas: str = "
185
206
  cog_top_k = 3
186
207
  cog_min_score = 0.65
187
208
 
188
- # Somatic risk lowers threshold further
189
- try:
190
- risk_result = cognitive.somatic_get_risk(file_list, area)
191
- if risk_result["max_risk"] > 0.5:
192
- cog_min_score = min(cog_min_score, 0.4)
193
- cog_top_k = max(cog_top_k, 5)
194
- except Exception:
195
- pass
196
-
197
209
  query_parts = []
198
210
  if file_list:
199
211
  query_parts.append(f"editing files: {', '.join(file_list[:5])}")
@@ -241,7 +253,11 @@ def handle_guard_check(files: str = "", area: str = "", include_schemas: str = "
241
253
  if result["blocking_rules"]:
242
254
  lines.append("BLOCKING RULES (resolve BEFORE writing):")
243
255
  for r in result["blocking_rules"]:
244
- lines.append(f" #{r['id']} ({r['repetitions']}x repeated): {r['rule']}")
256
+ reason = r.get("reason", "repeated_error")
257
+ if reason == "prohibition_keyword":
258
+ lines.append(f" #{r['id']} [PROHIBIT]: {r['rule']}")
259
+ else:
260
+ lines.append(f" #{r['id']} ({r['repetitions']}x repeated): {r['rule']}")
245
261
  lines.append("")
246
262
 
247
263
  if result["learnings"]:
@@ -0,0 +1,645 @@
1
+ #!/bin/bash
2
+ # ============================================================================
3
+ # NEXO Watchdog — Health monitor with two-level auto-repair
4
+ # ============================================================================
5
+ # Monitors all NEXO core LaunchAgents, cron jobs, and infrastructure.
6
+ # Level 1: Mechanical repair (launchctl bootstrap/kickstart, chmod)
7
+ # Level 2: Launches NEXO CLI for intelligent diagnosis and fix
8
+ #
9
+ # Install: Add to LaunchAgents for periodic execution (every 5 min recommended)
10
+ # ============================================================================
11
+ set -uo pipefail
12
+
13
+ # === PATHS ===
14
+ HOME_DIR="$HOME"
15
+ NEXO_DIR="$HOME_DIR/claude/nexo-mcp"
16
+ OPS_DIR="$HOME_DIR/claude/operations"
17
+ LOG_DIR="$HOME_DIR/claude/logs"
18
+ LOG="$LOG_DIR/watchdog.log"
19
+ STATUS_JSON="$OPS_DIR/watchdog-status.json"
20
+ REPORT_TXT="$OPS_DIR/watchdog-report.txt"
21
+ ALERT_FILE="$OPS_DIR/.watchdog-alert"
22
+ FAIL_COUNT_FILE="$HOME_DIR/claude/scripts/.watchdog-fails"
23
+ MAX_FAILS=3
24
+
25
+ mkdir -p "$LOG_DIR" "$OPS_DIR"
26
+
27
+ TS=$(date "+%Y-%m-%d %H:%M:%S")
28
+ TS_EPOCH=$(date +%s)
29
+
30
+ log() { echo "[$TS] $1" >> "$LOG"; }
31
+
32
+ # ============================================================================
33
+ # HELPER FUNCTIONS
34
+ # ============================================================================
35
+
36
+ UID_NUM=$(id -u)
37
+ REPAIR_LOG="$LOG_DIR/watchdog-repairs.log"
38
+ TOTAL_HEALED=0
39
+
40
+ log_repair() { echo "[$TS] REPAIR: $1" >> "$REPAIR_LOG"; log "REPAIR: $1"; }
41
+
42
+ is_loaded() {
43
+ launchctl list "$1" &>/dev/null
44
+ }
45
+
46
+ file_age() {
47
+ if [ -f "$1" ]; then
48
+ local mod_epoch
49
+ # macOS: stat -f %m, Linux: stat -c %Y
50
+ mod_epoch=$(stat -f %m "$1" 2>/dev/null || stat -c %Y "$1" 2>/dev/null || echo 0)
51
+ echo $(( TS_EPOCH - mod_epoch ))
52
+ else
53
+ echo 999999
54
+ fi
55
+ }
56
+
57
+ format_age() {
58
+ local secs=$1
59
+ if [ "$secs" -ge 999999 ]; then
60
+ echo "never"
61
+ elif [ "$secs" -ge 86400 ]; then
62
+ echo "$((secs / 86400))d $((secs % 86400 / 3600))h ago"
63
+ elif [ "$secs" -ge 3600 ]; then
64
+ echo "$((secs / 3600))h $((secs % 3600 / 60))m ago"
65
+ elif [ "$secs" -ge 60 ]; then
66
+ echo "$((secs / 60))m ago"
67
+ else
68
+ echo "${secs}s ago"
69
+ fi
70
+ }
71
+
72
+ check_errors() {
73
+ local logfile="$1"
74
+ if [ -f "$logfile" ] && [ -s "$logfile" ]; then
75
+ tail -50 "$logfile" 2>/dev/null | grep -cE "$ERROR_PATTERNS" 2>/dev/null || echo 0
76
+ else
77
+ echo 0
78
+ fi
79
+ }
80
+
81
+ process_running() {
82
+ if [ -n "$1" ]; then
83
+ pgrep -f "$1" > /dev/null 2>&1
84
+ else
85
+ return 1
86
+ fi
87
+ }
88
+
89
+ json_escape() {
90
+ echo "$1" | sed 's/\\/\\\\/g; s/"/\\"/g; s/ / /g' | tr '\n' ' '
91
+ }
92
+
93
+ # ============================================================================
94
+ # AUTO-REPAIR FUNCTIONS
95
+ # ============================================================================
96
+
97
+ try_repair_launchagent() {
98
+ local plist_id="$1"
99
+ local proc_grep="$2"
100
+ local plist_file="$HOME_DIR/Library/LaunchAgents/${plist_id}.plist"
101
+
102
+ # Repair 1: Not loaded — try to bootstrap
103
+ if ! is_loaded "$plist_id"; then
104
+ if [ -f "$plist_file" ]; then
105
+ launchctl bootstrap "gui/$UID_NUM" "$plist_file" 2>/dev/null
106
+ sleep 1
107
+ if is_loaded "$plist_id"; then
108
+ log_repair "$plist_id: bootstrapped successfully"
109
+ return 0
110
+ fi
111
+ fi
112
+ return 1
113
+ fi
114
+
115
+ # Repair 2: Loaded but process not running (KeepAlive) — kickstart
116
+ if [ -n "$proc_grep" ] && ! process_running "$proc_grep"; then
117
+ launchctl kickstart "gui/$UID_NUM/$plist_id" 2>/dev/null
118
+ sleep 2
119
+ if process_running "$proc_grep"; then
120
+ log_repair "$plist_id: kickstarted process '$proc_grep'"
121
+ return 0
122
+ fi
123
+ fi
124
+
125
+ return 1
126
+ }
127
+
128
+ try_repair_cron() {
129
+ local script="$1"
130
+
131
+ if [ -f "$script" ] && [ ! -x "$script" ]; then
132
+ chmod +x "$script"
133
+ if [ -x "$script" ]; then
134
+ log_repair "$script: made executable"
135
+ return 0
136
+ fi
137
+ fi
138
+
139
+ return 1
140
+ }
141
+
142
+ try_repair_backup() {
143
+ local backup_script="$NEXO_DIR/backup_cron.sh"
144
+ if [ -x "$backup_script" ]; then
145
+ "$backup_script" 2>/dev/null
146
+ sleep 1
147
+ local newest
148
+ newest=$(ls -t "$NEXO_DIR/backups/nexo-"*.db 2>/dev/null | head -1)
149
+ if [ -n "$newest" ]; then
150
+ local age
151
+ age=$(file_age "$newest")
152
+ if [ "$age" -lt 60 ]; then
153
+ log_repair "backup_cron.sh: ran successfully, fresh backup created"
154
+ return 0
155
+ fi
156
+ fi
157
+ fi
158
+ return 1
159
+ }
160
+
161
+ # ============================================================================
162
+ # MONITOR REGISTRY — NEXO Core Services
163
+ # ============================================================================
164
+ # Format: NAME|PLIST_ID|LOG_STDOUT|LOG_STDERR|MAX_STALE_SECS|PROCESS_GREP|SCHEDULE_DESC
165
+ #
166
+ # Users can add custom monitors in ~/claude/config/watchdog-monitors.conf
167
+ # (same format, one per line, # for comments)
168
+ # ============================================================================
169
+ MONITORS=(
170
+ "Auto-Close Sessions|com.nexo.auto-close-sessions|$HOME_DIR/claude/coordination/auto-close-stdout.log|$HOME_DIR/claude/coordination/auto-close-stderr.log|900||Every 5 min"
171
+ "Catchup|com.nexo.catchup|$HOME_DIR/claude/logs/catchup-stdout.log|$HOME_DIR/claude/logs/catchup-stderr.log|0||RunAtLoad once"
172
+ "Cognitive Decay|com.nexo.cognitive-decay|$HOME_DIR/claude/logs/cognitive-decay-stdout.log|$HOME_DIR/claude/logs/cognitive-decay-stderr.log|90000||Daily 3:00 AM"
173
+ "Evolution|com.nexo.evolution|$HOME_DIR/claude/logs/evolution-stdout.log|$HOME_DIR/claude/logs/evolution-stderr.log|0||Weekly Sun 3:00 AM"
174
+ "GitHub Monitor|com.nexo.github-monitor|$HOME_DIR/claude/logs/github-monitor-stdout.log|$HOME_DIR/claude/logs/github-monitor-stderr.log|90000||Daily 8:00 AM"
175
+ "Immune|com.nexo.immune|$HOME_DIR/claude/coordination/immune-stdout.log|$HOME_DIR/claude/coordination/immune-stderr.log|3600||Every 30 min"
176
+ "Postmortem|com.nexo.postmortem|$HOME_DIR/claude/logs/postmortem-stdout.log|$HOME_DIR/claude/logs/postmortem-stderr.log|90000||Daily 23:30"
177
+ "Prevent Sleep|com.nexo.prevent-sleep|||0|caffeinate|KeepAlive"
178
+ "Self Audit|com.nexo.self-audit|$HOME_DIR/claude/logs/self-audit-stdout.log|$HOME_DIR/claude/logs/self-audit-stderr.log|90000||Daily 7:00 AM"
179
+ "Sleep|com.nexo.sleep|$HOME_DIR/claude/coordination/sleep-stdout.log|$HOME_DIR/claude/coordination/sleep-stderr.log|90000||Daily 4:00 AM"
180
+ "Synthesis|com.nexo.synthesis|$HOME_DIR/claude/coordination/synthesis-stdout.log|$HOME_DIR/claude/coordination/synthesis-stderr.log|10800||Every 2 hours"
181
+ )
182
+
183
+ # Load user-defined monitors if file exists
184
+ USER_MONITORS_FILE="$HOME_DIR/claude/config/watchdog-monitors.conf"
185
+ if [ -f "$USER_MONITORS_FILE" ]; then
186
+ while IFS= read -r line; do
187
+ [[ "$line" =~ ^[[:space:]]*# ]] && continue
188
+ [[ -z "$line" ]] && continue
189
+ MONITORS+=("$line")
190
+ done < "$USER_MONITORS_FILE"
191
+ fi
192
+
193
+ # Cron jobs to check (NAME|SCRIPT|CHECK_PATH|MAX_STALE_SECS|SCHEDULE)
194
+ CRON_MONITORS=(
195
+ "Backup Cron|$NEXO_DIR/backup_cron.sh|$NEXO_DIR/backups/|7200|Hourly"
196
+ )
197
+
198
+ # Error patterns to search in stderr logs (last 50 lines)
199
+ ERROR_PATTERNS="Traceback|Error:|CRITICAL|FATAL|ModuleNotFoundError|PermissionError|FileNotFoundError|ConnectionRefused|Errno"
200
+
201
+ # ============================================================================
202
+ # RUN CHECKS
203
+ # ============================================================================
204
+
205
+ TOTAL_PASS=0
206
+ TOTAL_WARN=0
207
+ TOTAL_FAIL=0
208
+ JSON_AGENTS=""
209
+ REPORT_LINES=""
210
+ FAILED_MONITORS=() # Track failed monitors for Level 2 repair
211
+
212
+ for monitor in "${MONITORS[@]}"; do
213
+ [[ "$monitor" =~ ^[[:space:]]*# ]] && continue
214
+ IFS='|' read -r name plist_id log_stdout log_stderr max_stale proc_grep schedule <<< "$monitor"
215
+
216
+ status="PASS"
217
+ details=""
218
+ loaded="unknown"
219
+ stale_age="n/a"
220
+ error_count=0
221
+ proc_alive="n/a"
222
+
223
+ # Check 1: LaunchAgent loaded?
224
+ if is_loaded "$plist_id"; then
225
+ loaded="yes"
226
+ else
227
+ loaded="no"
228
+ if try_repair_launchagent "$plist_id" "$proc_grep"; then
229
+ loaded="yes"
230
+ status="HEALED"
231
+ details="${details}Self-healed: bootstrapped. "
232
+ TOTAL_HEALED=$((TOTAL_HEALED + 1))
233
+ else
234
+ status="FAIL"
235
+ details="${details}Not loaded in launchctl (repair failed). "
236
+ fi
237
+ fi
238
+
239
+ # Check 2: Process alive? (only for KeepAlive / long-running)
240
+ if [ -n "$proc_grep" ]; then
241
+ if process_running "$proc_grep"; then
242
+ proc_alive="yes"
243
+ else
244
+ proc_alive="no"
245
+ if [ "$status" != "FAIL" ] && [ "$status" != "HEALED" ]; then
246
+ if try_repair_launchagent "$plist_id" "$proc_grep"; then
247
+ proc_alive="yes"
248
+ status="HEALED"
249
+ details="${details}Self-healed: kickstarted. "
250
+ TOTAL_HEALED=$((TOTAL_HEALED + 1))
251
+ else
252
+ status="WARN"
253
+ details="${details}Process '$proc_grep' not running (repair failed). "
254
+ fi
255
+ elif [ "$status" = "HEALED" ]; then
256
+ sleep 1
257
+ if process_running "$proc_grep"; then
258
+ proc_alive="yes"
259
+ else
260
+ details="${details}Process '$proc_grep' still not running after bootstrap. "
261
+ fi
262
+ fi
263
+ fi
264
+ fi
265
+
266
+ # Check 3: Log staleness
267
+ if [ -n "$log_stdout" ] && [ "$max_stale" -gt 0 ]; then
268
+ age=$(file_age "$log_stdout")
269
+ stale_age=$(format_age "$age")
270
+ if [ "$age" -gt $(( max_stale * 3 )) ]; then
271
+ status="FAIL"
272
+ details="${details}Log stale: $stale_age (limit: $(format_age "$max_stale")). "
273
+ elif [ "$age" -gt "$max_stale" ]; then
274
+ [ "$status" = "PASS" ] && status="WARN"
275
+ details="${details}Log slightly stale: $stale_age. "
276
+ fi
277
+ elif [ -n "$log_stdout" ]; then
278
+ if [ -f "$log_stdout" ]; then
279
+ age=$(file_age "$log_stdout")
280
+ stale_age=$(format_age "$age")
281
+ else
282
+ stale_age="no log file"
283
+ fi
284
+ fi
285
+
286
+ # Check 4: Errors in stderr log
287
+ if [ -n "$log_stderr" ]; then
288
+ error_count=$(check_errors "$log_stderr")
289
+ if [ "$error_count" -gt 5 ]; then
290
+ [ "$status" = "PASS" ] && status="WARN"
291
+ details="${details}${error_count} errors in recent stderr. "
292
+ fi
293
+ fi
294
+
295
+ [ -z "$details" ] && details="All checks passed"
296
+
297
+ case "$status" in
298
+ PASS|HEALED) TOTAL_PASS=$((TOTAL_PASS + 1)) ;;
299
+ WARN) TOTAL_WARN=$((TOTAL_WARN + 1)) ;;
300
+ FAIL)
301
+ TOTAL_FAIL=$((TOTAL_FAIL + 1))
302
+ FAILED_MONITORS+=("${name}|${plist_id}|${log_stdout}|${log_stderr}|${proc_grep}|${schedule}|${details}")
303
+ ;;
304
+ esac
305
+
306
+ # JSON
307
+ escaped_details=$(json_escape "$details")
308
+ json_item=" {\"name\":\"$name\",\"plist\":\"$plist_id\",\"status\":\"$status\",\"loaded\":\"$loaded\",\"process\":\"$proc_alive\",\"last_activity\":\"$stale_age\",\"stderr_errors\":$error_count,\"schedule\":\"$schedule\",\"details\":\"$escaped_details\"}"
309
+ [ -n "$JSON_AGENTS" ] && JSON_AGENTS="${JSON_AGENTS},
310
+ ${json_item}" || JSON_AGENTS="$json_item"
311
+
312
+ # Report
313
+ case "$status" in
314
+ PASS) icon="PASS" ;; HEALED) icon="HEAL" ;; WARN) icon="WARN" ;; FAIL) icon="FAIL" ;; *) icon="????" ;;
315
+ esac
316
+ REPORT_LINES="${REPORT_LINES} [${icon}] ${name} (${schedule})
317
+ Loaded: ${loaded} | Process: ${proc_alive} | Last: ${stale_age} | Errors: ${error_count}
318
+ ${details}
319
+ "
320
+ done
321
+
322
+ # --- Cron job checks ---
323
+ CRON_JSON=""
324
+ CRON_REPORT=""
325
+ for cron_entry in "${CRON_MONITORS[@]}"; do
326
+ IFS='|' read -r name script check_path max_stale schedule <<< "$cron_entry"
327
+
328
+ c_status="PASS"
329
+ c_details=""
330
+ age_str="n/a"
331
+
332
+ if [ ! -x "$script" ]; then
333
+ if try_repair_cron "$script"; then
334
+ c_status="HEALED"
335
+ c_details="Self-healed: made executable. "
336
+ TOTAL_HEALED=$((TOTAL_HEALED + 1))
337
+ else
338
+ c_status="FAIL"
339
+ c_details="Script not executable or missing (repair failed). "
340
+ fi
341
+ fi
342
+
343
+ if [ -d "$check_path" ]; then
344
+ newest=$(ls -t "$check_path" 2>/dev/null | head -1)
345
+ if [ -n "$newest" ]; then
346
+ age=$(file_age "${check_path}${newest}")
347
+ age_str=$(format_age "$age")
348
+ if [ "$age" -gt $(( max_stale * 3 )) ]; then
349
+ c_status="FAIL"
350
+ c_details="${c_details}Output stale: $age_str. "
351
+ elif [ "$age" -gt "$max_stale" ]; then
352
+ [ "$c_status" = "PASS" ] && c_status="WARN"
353
+ c_details="${c_details}Output slightly stale: $age_str. "
354
+ fi
355
+ else
356
+ c_status="WARN"
357
+ c_details="${c_details}No output files found. "
358
+ age_str="no files"
359
+ fi
360
+ elif [ -f "$check_path" ]; then
361
+ age=$(file_age "$check_path")
362
+ age_str=$(format_age "$age")
363
+ if [ "$age" -gt $(( max_stale * 3 )) ]; then
364
+ c_status="FAIL"
365
+ c_details="${c_details}Output stale: $age_str. "
366
+ elif [ "$age" -gt "$max_stale" ]; then
367
+ [ "$c_status" = "PASS" ] && c_status="WARN"
368
+ c_details="${c_details}Output slightly stale: $age_str. "
369
+ fi
370
+ fi
371
+
372
+ [ -z "$c_details" ] && c_details="All checks passed"
373
+
374
+ case "$c_status" in
375
+ PASS|HEALED) TOTAL_PASS=$((TOTAL_PASS + 1)) ;;
376
+ WARN) TOTAL_WARN=$((TOTAL_WARN + 1)) ;;
377
+ FAIL) TOTAL_FAIL=$((TOTAL_FAIL + 1)) ;;
378
+ esac
379
+
380
+ escaped_details=$(json_escape "$c_details")
381
+ cron_item=" {\"name\":\"$name\",\"script\":\"$script\",\"status\":\"$c_status\",\"last_output\":\"$age_str\",\"schedule\":\"$schedule\",\"details\":\"$escaped_details\"}"
382
+ [ -n "$CRON_JSON" ] && CRON_JSON="${CRON_JSON},
383
+ ${cron_item}" || CRON_JSON="$cron_item"
384
+
385
+ case "$c_status" in
386
+ PASS) icon="PASS" ;; HEALED) icon="HEAL" ;; WARN) icon="WARN" ;; FAIL) icon="FAIL" ;; *) icon="????" ;;
387
+ esac
388
+ CRON_REPORT="${CRON_REPORT} [${icon}] ${name} (${schedule})
389
+ Last output: ${age_str}
390
+ ${c_details}
391
+ "
392
+ done
393
+
394
+ # ============================================================================
395
+ # INFRASTRUCTURE CHECKS
396
+ # ============================================================================
397
+
398
+ # --- SQLite integrity ---
399
+ SQLITE_STATUS="PASS"
400
+ SQLITE_DETAIL=""
401
+ INTEGRITY=$(sqlite3 "$NEXO_DIR/nexo.db" "PRAGMA integrity_check;" 2>/dev/null || echo "CORRUPT")
402
+ if [ "$INTEGRITY" != "ok" ]; then
403
+ SQLITE_STATUS="FAIL"
404
+ SQLITE_DETAIL="Integrity check: $INTEGRITY"
405
+ log "CRITICAL: SQLite integrity check failed: $INTEGRITY"
406
+ TOTAL_FAIL=$((TOTAL_FAIL + 1))
407
+ # Save corrupt copy before restoring
408
+ cp "$NEXO_DIR/nexo.db" "$NEXO_DIR/nexo.db.corrupt.$(date +%s)" 2>/dev/null
409
+ LATEST_BACKUP=$(ls -t "$NEXO_DIR/backups/nexo-"*.db 2>/dev/null | head -1)
410
+ if [ -n "$LATEST_BACKUP" ]; then
411
+ cp "$LATEST_BACKUP" "$NEXO_DIR/nexo.db"
412
+ log "RESTORED from $LATEST_BACKUP"
413
+ SQLITE_DETAIL="${SQLITE_DETAIL}. Restored from backup."
414
+ fi
415
+ else
416
+ SQLITE_DETAIL="Integrity OK"
417
+ TOTAL_PASS=$((TOTAL_PASS + 1))
418
+ fi
419
+
420
+ # --- Cognitive DB check ---
421
+ COG_STATUS="PASS"
422
+ COG_DETAIL=""
423
+ COG_DB="$NEXO_DIR/cognitive.db"
424
+ if [ -f "$COG_DB" ]; then
425
+ COG_INT=$(sqlite3 "$COG_DB" "PRAGMA integrity_check;" 2>/dev/null || echo "CORRUPT")
426
+ if [ "$COG_INT" != "ok" ]; then
427
+ COG_STATUS="FAIL"
428
+ COG_DETAIL="Cognitive DB integrity: $COG_INT"
429
+ TOTAL_FAIL=$((TOTAL_FAIL + 1))
430
+ else
431
+ COG_DETAIL="Integrity OK"
432
+ TOTAL_PASS=$((TOTAL_PASS + 1))
433
+ fi
434
+ else
435
+ COG_STATUS="WARN"
436
+ COG_DETAIL="cognitive.db not found"
437
+ TOTAL_WARN=$((TOTAL_WARN + 1))
438
+ fi
439
+
440
+ # --- Backup freshness ---
441
+ BACKUP_STATUS="PASS"
442
+ BACKUP_DETAIL=""
443
+ LATEST_BACKUP=$(ls -t "$NEXO_DIR/backups/nexo-"*.db 2>/dev/null | head -1)
444
+ if [ -n "$LATEST_BACKUP" ]; then
445
+ BACKUP_AGE=$(file_age "$LATEST_BACKUP")
446
+ BACKUP_AGE_STR=$(format_age "$BACKUP_AGE")
447
+ if [ "$BACKUP_AGE" -gt 7200 ]; then
448
+ if try_repair_backup; then
449
+ BACKUP_STATUS="HEALED"
450
+ BACKUP_DETAIL="Self-healed: backup was stale ($BACKUP_AGE_STR), ran fresh backup"
451
+ TOTAL_HEALED=$((TOTAL_HEALED + 1))
452
+ TOTAL_PASS=$((TOTAL_PASS + 1))
453
+ else
454
+ BACKUP_STATUS="WARN"
455
+ BACKUP_DETAIL="Last backup: $BACKUP_AGE_STR (>2h, repair failed)"
456
+ TOTAL_WARN=$((TOTAL_WARN + 1))
457
+ fi
458
+ else
459
+ BACKUP_DETAIL="Last backup: $BACKUP_AGE_STR"
460
+ TOTAL_PASS=$((TOTAL_PASS + 1))
461
+ fi
462
+ else
463
+ BACKUP_STATUS="FAIL"
464
+ BACKUP_DETAIL="No backups found"
465
+ TOTAL_FAIL=$((TOTAL_FAIL + 1))
466
+ fi
467
+
468
+ # ============================================================================
469
+ # WRITE JSON STATUS
470
+ # ============================================================================
471
+ TOTAL=$((TOTAL_PASS + TOTAL_WARN + TOTAL_FAIL))
472
+ OVERALL="PASS"
473
+ [ "$TOTAL_WARN" -gt 0 ] && OVERALL="WARN"
474
+ [ "$TOTAL_FAIL" -gt 0 ] && OVERALL="FAIL"
475
+
476
+ cat > "$STATUS_JSON" <<JSONEOF
477
+ {
478
+ "timestamp": "$TS",
479
+ "summary": {
480
+ "total": $TOTAL,
481
+ "pass": $TOTAL_PASS,
482
+ "warn": $TOTAL_WARN,
483
+ "fail": $TOTAL_FAIL,
484
+ "healed": $TOTAL_HEALED,
485
+ "overall": "$OVERALL"
486
+ },
487
+ "launch_agents": [
488
+ $JSON_AGENTS
489
+ ],
490
+ "cron_jobs": [
491
+ $CRON_JSON
492
+ ],
493
+ "infrastructure": {
494
+ "sqlite": {"status": "$SQLITE_STATUS", "detail": "$(json_escape "$SQLITE_DETAIL")"},
495
+ "cognitive_db": {"status": "$COG_STATUS", "detail": "$(json_escape "$COG_DETAIL")"},
496
+ "backups": {"status": "$BACKUP_STATUS", "detail": "$(json_escape "$BACKUP_DETAIL")"}
497
+ }
498
+ }
499
+ JSONEOF
500
+
501
+ # ============================================================================
502
+ # WRITE HUMAN-READABLE REPORT
503
+ # ============================================================================
504
+ cat > "$REPORT_TXT" <<REPORTEOF
505
+ ======================================================
506
+ NEXO WATCHDOG REPORT — $TS
507
+ ======================================================
508
+ PASS: $TOTAL_PASS | HEALED: $TOTAL_HEALED | WARN: $TOTAL_WARN | FAIL: $TOTAL_FAIL | TOTAL: $TOTAL
509
+ OVERALL: $OVERALL
510
+ ======================================================
511
+
512
+ -- LaunchAgents (${#MONITORS[@]}) ---------------------
513
+ $REPORT_LINES
514
+ -- Cron Jobs ------------------------------------------
515
+ $CRON_REPORT
516
+ -- Infrastructure -------------------------------------
517
+ [$SQLITE_STATUS] SQLite nexo.db: $SQLITE_DETAIL
518
+ [$COG_STATUS] Cognitive DB: $COG_DETAIL
519
+ [$BACKUP_STATUS] Backups: $BACKUP_DETAIL
520
+
521
+ -- End of Report --------------------------------------
522
+ REPORTEOF
523
+
524
+ # ============================================================================
525
+ # ALERT FILE
526
+ # ============================================================================
527
+ if [ "$TOTAL_FAIL" -gt 0 ]; then
528
+ {
529
+ echo "timestamp=$TS"
530
+ echo "fail_count=$TOTAL_FAIL"
531
+ echo "warn_count=$TOTAL_WARN"
532
+ echo "failures:"
533
+ grep '\[FAIL\]' "$REPORT_TXT" | head -10 | sed 's/^/ /'
534
+ } > "$ALERT_FILE"
535
+ log "ALERT: $TOTAL_FAIL failures detected"
536
+ else
537
+ rm -f "$ALERT_FILE"
538
+ fi
539
+
540
+ # ============================================================================
541
+ # CONSECUTIVE FAILURE TRACKING
542
+ # ============================================================================
543
+ FAILS=$(cat "$FAIL_COUNT_FILE" 2>/dev/null || echo 0)
544
+ if [ "$TOTAL_FAIL" -gt 0 ]; then
545
+ FAILS=$((FAILS + 1))
546
+ echo "$FAILS" > "$FAIL_COUNT_FILE"
547
+ if [ "$FAILS" -ge "$MAX_FAILS" ]; then
548
+ log "ALERT: $FAILS consecutive runs with failures"
549
+ fi
550
+ else
551
+ echo "0" > "$FAIL_COUNT_FILE"
552
+ fi
553
+
554
+ # ============================================================================
555
+ # LEVEL 2 AUTO-REPAIR: Launch NEXO for intelligent diagnosis
556
+ # ============================================================================
557
+ REPAIR_LOCK="$HOME_DIR/claude/scripts/.watchdog-nexo-repair.lock"
558
+ REPAIR_COOLDOWN=1800 # 30 min between NEXO repair attempts
559
+
560
+ if [ "$TOTAL_FAIL" -gt 0 ]; then
561
+ LOCK_AGE=999999
562
+ SKIP_REPAIR=false
563
+ if [ -f "$REPAIR_LOCK" ]; then
564
+ LOCK_AGE=$(file_age "$REPAIR_LOCK")
565
+ if [ "$LOCK_AGE" -lt "$REPAIR_COOLDOWN" ]; then
566
+ log "NEXO repair skipped: cooldown (${LOCK_AGE}s < ${REPAIR_COOLDOWN}s)"
567
+ SKIP_REPAIR=true
568
+ fi
569
+ fi
570
+
571
+ if ! $SKIP_REPAIR; then
572
+ # Collect failure details from tracked FAILED_MONITORS array
573
+ FAIL_DETAILS=""
574
+ for failed in "${FAILED_MONITORS[@]}"; do
575
+ IFS='|' read -r m_name m_plist m_stdout m_stderr m_proc m_sched m_details <<< "$failed"
576
+ STDERR_TAIL=""
577
+ if [ -n "$m_stderr" ] && [ -f "$m_stderr" ]; then
578
+ STDERR_TAIL=$(tail -20 "$m_stderr" 2>/dev/null | head -20)
579
+ fi
580
+ STDOUT_TAIL=""
581
+ if [ -n "$m_stdout" ] && [ -f "$m_stdout" ]; then
582
+ STDOUT_TAIL=$(tail -10 "$m_stdout" 2>/dev/null | head -10)
583
+ fi
584
+ FAIL_DETAILS="${FAIL_DETAILS}
585
+ --- ${m_name} (${m_plist}) ---
586
+ Schedule: ${m_sched}
587
+ Failure reason: ${m_details}
588
+ Plist: ~/Library/LaunchAgents/${m_plist}.plist
589
+ Process grep: ${m_proc}
590
+ Stderr (last 20 lines):
591
+ ${STDERR_TAIL}
592
+ Stdout (last 10 lines):
593
+ ${STDOUT_TAIL}
594
+ "
595
+ done
596
+
597
+ if [ -n "$FAIL_DETAILS" ]; then
598
+ touch "$REPAIR_LOCK"
599
+ log "Launching NEXO Level 2 repair..."
600
+
601
+ REPAIR_PROMPT_FILE=$(mktemp /tmp/nexo-repair-XXXXXX.txt)
602
+ cat > "$REPAIR_PROMPT_FILE" <<NEXOPROMPT
603
+ WATCHDOG LEVEL 2 REPAIR — Automated diagnostic session.
604
+
605
+ The watchdog detected services that failed mechanical repair (launchctl bootstrap/kickstart).
606
+ Your job: diagnose the root cause and fix it. Do NOT ask the user anything — resolve autonomously.
607
+
608
+ FAILURES:
609
+ ${FAIL_DETAILS}
610
+
611
+ STEPS:
612
+ 1. Read the plist file to understand the service configuration
613
+ 2. Check stderr/stdout logs for the actual error
614
+ 3. Fix the root cause (missing file, bad config, dependency issue, etc.)
615
+ 4. Reload the service and verify it is running
616
+ 5. Log what you did to ~/claude/logs/watchdog-repair-result.log
617
+
618
+ CONSTRAINTS:
619
+ - Do NOT modify CLAUDE.md or any protected file
620
+ - Do NOT start interactive conversations
621
+ - Keep it under 5 minutes
622
+ - Log what you did to ~/claude/logs/watchdog-repair-result.log
623
+ NEXOPROMPT
624
+
625
+ # Find claude CLI (may not be in PATH for cron/LaunchAgent)
626
+ CLAUDE_BIN=$(command -v claude 2>/dev/null || echo "$HOME_DIR/.claude/local/bin/claude")
627
+ if [ ! -x "$CLAUDE_BIN" ]; then
628
+ CLAUDE_BIN=$(find /usr/local/bin /opt/homebrew/bin "$HOME_DIR/.local/bin" "$HOME_DIR/.npm-global/bin" -name claude -type f 2>/dev/null | head -1)
629
+ fi
630
+
631
+ if [ -n "$CLAUDE_BIN" ] && [ -x "$CLAUDE_BIN" ]; then
632
+ nohup bash -c "\"$CLAUDE_BIN\" --print --dangerously-skip-permissions -p \"\$(cat '$REPAIR_PROMPT_FILE')\" >> '$LOG_DIR/watchdog-nexo-repair.log' 2>&1; rm -f '$REPAIR_PROMPT_FILE'" &
633
+ log "NEXO repair launched (PID: $!)"
634
+ else
635
+ log "NEXO repair ABORTED: claude CLI not found in PATH"
636
+ rm -f "$REPAIR_PROMPT_FILE"
637
+ fi
638
+ fi
639
+ fi
640
+ fi
641
+
642
+ # ============================================================================
643
+ # LOG SUMMARY
644
+ # ============================================================================
645
+ log "Complete: PASS=$TOTAL_PASS HEALED=$TOTAL_HEALED WARN=$TOTAL_WARN FAIL=$TOTAL_FAIL"