openclaw-aegis 1.3.0 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,27 +1,34 @@
1
- # OpenClaw Aegis
1
+ <p align="center">
2
+ <img src="assets/cover.jpg" alt="OpenClaw Aegis — Self-Healing Sidecar for OpenClaw Gateway" width="820" height="450" />
3
+ </p>
2
4
 
3
- **Self-healing sidecar for the OpenClaw Gateway.**
5
+ <p align="center">
6
+ <a href="https://www.npmjs.com/package/openclaw-aegis"><img src="https://img.shields.io/npm/v/openclaw-aegis" alt="npm" /></a>
7
+ <a href="https://github.com/Canary-Builds/openclaw-aegis/actions/workflows/ci.yml"><img src="https://img.shields.io/github/actions/workflow/status/Canary-Builds/openclaw-aegis/ci.yml?label=CI" alt="CI" /></a>
8
+ <a href="https://nodejs.org"><img src="https://img.shields.io/badge/node-%3E%3D18-brightgreen" alt="Node.js" /></a>
9
+ <a href="LICENSE"><img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License: MIT" /></a>
10
+ </p>
4
11
 
5
- Aegis monitors your OpenClaw gateway, detects failures in seconds, fixes them automatically, and alerts you through out-of-band channels that don't depend on the gateway being up.
12
+ ---
6
13
 
7
- [![npm](https://img.shields.io/npm/v/openclaw-aegis)](https://www.npmjs.com/package/openclaw-aegis)
8
- [![CI](https://github.com/Canary-Builds/openclaw-aegis/actions/workflows/ci.yml/badge.svg)](https://github.com/Canary-Builds/openclaw-aegis/actions/workflows/ci.yml)
9
- [![Node.js](https://img.shields.io/badge/node-%3E%3D18-brightgreen)](https://nodejs.org)
10
- [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
14
+ ## The Armor Your Gateway Deserves
11
15
 
12
- ---
16
+ When your OpenClaw gateway goes down, **everything goes dark** — Telegram, WhatsApp, all channels. Silent. No alerts, no warnings, nothing. If a bad config caused the crash, restarting won't help. The `.bak` files carry the same poison. You only find out hours later when someone asks why messages stopped.
13
17
 
14
- ## Why Aegis?
18
+ **Aegis doesn't let that happen.**
15
19
 
16
- When your OpenClaw gateway crashes, **everything goes dark** Telegram, WhatsApp, all channels. If the crash was caused by a bad config, restarting doesn't help. The `.bak` files may contain the same poison. You only find out when you notice messages aren't arriving.
20
+ It stands between your gateway and disastera tireless sentinel that detects failures in seconds, diagnoses the root cause, repairs what it can, and alerts you through channels that bypass the gateway entirely.
17
21
 
18
- Aegis prevents this:
22
+ ### What It Does
19
23
 
20
- 1. **Detects** failures via 10 health probes (process, port, config, memory, CPU, disk, logs, network, WebSocket, HTTP)
21
- 2. **Diagnoses** the root cause using 6 failure pattern matchers
22
- 3. **Fixes** automatically restores known-good config, clears stale PIDs, runs safe `doctor --fix`
23
- 4. **Alerts** you through channels that bypass the gateway entirely (ntfy, Telegram, WhatsApp, Slack, Discord, Email, Pushover, webhook)
24
- 5. **Responds** to bot commands message `/health` on Telegram, WhatsApp, Slack, or Discord and get real-time status
24
+ | | |
25
+ |---|---|
26
+ | **Detects** | 10 health probes scan process, port, HTTP, config, WebSocket, TUN, memory, CPU, disk, and logs every 10 seconds |
27
+ | **Diagnoses** | 6 failure pattern matchers identify poison configs, stale PIDs, port conflicts, permission errors, corruption, and OOM kills |
28
+ | **Heals** | L1 restart, L2 targeted repair, L3 deep repair (network, dependencies, safe mode, disk), config rollback all automatic |
29
+ | **Alerts** | 8 out-of-band providers (ntfy, Telegram, WhatsApp, Slack, Discord, Email, Pushover, webhook) that work even when the gateway is dead |
30
+ | **Responds** | Message `/health` on Telegram, WhatsApp, Slack, or Discord — Aegis replies with real-time status |
31
+ | **Remembers** | Full incident timeline, MTTR tracking, and a 18-endpoint REST API for dashboard integration |
25
32
 
26
33
  **Total downtime: ~15 seconds instead of hours.**
27
34
 
@@ -29,56 +36,45 @@ Aegis prevents this:
29
36
 
30
37
  ## Quick Start
31
38
 
39
+ Three commands. That's it.
40
+
32
41
  ```bash
33
- # Install
42
+ # Deploy the shield
34
43
  npm install -g openclaw-aegis
35
44
 
36
- # Configure (auto-detects your gateway)
45
+ # Auto-detect your gateway — zero questions asked
37
46
  aegis init --auto
38
47
 
39
- # Verify
48
+ # Confirm the shield is up
40
49
  aegis check
41
50
  ```
42
51
 
43
- Output:
44
52
  ```
45
53
  Health: HEALTHY (score: 10)
46
54
  Probes: 10 passed, 0 failed
47
55
  ```
48
56
 
49
- ---
50
-
51
- ## Commands
52
-
53
- | Command | Description |
54
- |---------|-------------|
55
- | `aegis init` | Interactive setup wizard |
56
- | `aegis init --auto` | Auto-detect everything, zero prompts |
57
- | `aegis check` | Run all 10 health probes once |
58
- | `aegis check --json` | JSON output for scripting |
59
- | `aegis status` | Health dashboard with per-probe details |
60
- | `aegis test-alert` | Send a test notification to all configured channels |
61
- | `aegis incidents` | Browse past incident logs |
62
- | `aegis incidents <id>` | Show full timeline for a specific incident |
63
- | `aegis serve` | Start REST API server + bot listeners |
57
+ Your gateway is now protected.
64
58
 
65
59
  ---
66
60
 
67
- ## Documentation
61
+ ## Arsenal
68
62
 
69
- | Document | Description |
70
- |----------|-------------|
71
- | [Getting Started](docs/getting-started.md) | Installation, first setup, verification |
72
- | [Architecture](docs/architecture.md) | System design, probe pipeline, recovery tiers |
73
- | [Configuration](docs/configuration.md) | Full TOML reference with every option |
74
- | [Alerts](docs/alerts.md) | Setting up ntfy, Telegram, WhatsApp, Slack, Discord, Email, Pushover, webhooks |
75
- | [CLI Reference](docs/cli-reference.md) | Every command with examples and options |
76
- | [Contributing](docs/contributing.md) | Development setup, testing, PR process |
77
- | [Releasing](docs/releasing.md) | Version bumps, npm publish, GitHub releases |
63
+ | Command | What It Does |
64
+ |---------|-------------|
65
+ | `aegis init` | Interactive setup walks you through everything |
66
+ | `aegis init --auto` | Zero-config setup detects gateway, sets defaults |
67
+ | `aegis check` | Run all 10 probes, get a health verdict |
68
+ | `aegis check --json` | Machine-readable output for scripts and monitoring |
69
+ | `aegis status` | Live dashboard every probe, color-coded |
70
+ | `aegis test-alert` | Fire a test alert to all configured channels |
71
+ | `aegis incidents` | Browse past battles — what failed, what was fixed |
72
+ | `aegis incidents <id>` | Full incident timeline with every recovery step |
73
+ | `aegis serve` | Start REST API + bot listeners for dashboard integration |
78
74
 
79
75
  ---
80
76
 
81
- ## How It Works
77
+ ## Defense Architecture
82
78
 
83
79
  ```
84
80
  OpenClaw Gateway Aegis Sidecar
@@ -87,11 +83,12 @@ OpenClaw Gateway Aegis Sidecar
87
83
  │ ~/.openclaw/ │◄────────►│ Config Guardian │
88
84
  │ openclaw.json │ │ Dead Man's Switch │
89
85
  │ logs/ │ │ Recovery Orchestrator │
90
- │ │ │ L1: Restart
86
+ │ │ │ L1: Quick Restart
91
87
  │ systemd/launchd │◄─────────│ L2: Targeted Repair │
88
+ │ │ │ L3: Deep Repair │
92
89
  │ │ │ L4: Human Alert │
93
90
  └─────────────────────┘ │ Alert Dispatcher │
94
- │ (8 alert providers)
91
+ │ (8 out-of-band providers)
95
92
  └──────────────────────────────┘
96
93
 
97
94
  Out-of-band
@@ -102,13 +99,46 @@ OpenClaw Gateway Aegis Sidecar
102
99
  Your phone
103
100
  ```
104
101
 
102
+ Alerts bypass the gateway entirely. If the gateway is down, Aegis talks directly to Telegram, Slack, Discord, and the rest. **No single point of failure.**
103
+
104
+ ---
105
+
106
+ ## Recovery Cascade
107
+
108
+ When Aegis detects a problem, it doesn't just restart and pray:
109
+
110
+ **L1 — Quick Restart** (5s) — Pre-flight config check first. If config is clean, restart with exponential backoff. If config is poisoned, skip straight to L2.
111
+
112
+ **L2 — Targeted Repair** (30s-2min) — Diagnose the exact failure pattern and apply the right fix. Restore known-good config, delete stale PID files, fix permissions.
113
+
114
+ **L3 — Deep Repair** (30s-2min) — Riskier fixes when L2 isn't enough. Network repair (DNS flush, TUN reset), process resurrection (reinstall binary), dependency rebuild, safe mode boot, and disk cleanup.
115
+
116
+ **L4 — Human Alert** (instant) — When auto-recovery fails, Aegis sends a full incident report through every configured channel. You get the health score, what was tried, and why it failed.
117
+
118
+ Anti-flap protection, circuit breakers, and exponential backoff prevent crash loops. Aegis won't make things worse.
119
+
120
+ ---
121
+
122
+ ## Documentation
123
+
124
+ | Document | Description |
125
+ |----------|-------------|
126
+ | [Getting Started](docs/getting-started.md) | Installation, first setup, verification |
127
+ | [Architecture](docs/architecture.md) | Probe pipeline, recovery tiers, system design |
128
+ | [Configuration](docs/configuration.md) | Full TOML reference — every knob and dial |
129
+ | [Alerts](docs/alerts.md) | Setup guides for all 8 providers |
130
+ | [CLI Reference](docs/cli-reference.md) | Every command with examples |
131
+ | [Contributing](docs/contributing.md) | Dev setup, testing, PR process |
132
+ | [Releasing](docs/releasing.md) | Version bumps, npm publish, GitHub releases |
133
+ | [Roadmap](docs/roadmap.md) | What's coming — L3 recovery, observability, fleet management |
134
+
105
135
  ---
106
136
 
107
137
  ## Requirements
108
138
 
109
- - Node.js >= 18
110
- - OpenClaw Gateway (any version with `openclaw gateway health` support)
111
- - Linux (systemd) or macOS (launchd)
139
+ - **Node.js** >= 18
140
+ - **OpenClaw Gateway** (any version with `openclaw gateway health`)
141
+ - **Linux** (systemd) or **macOS** (launchd)
112
142
 
113
143
  ---
114
144