@gravito/zenith 0.1.0-beta.1 → 1.0.0-beta.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/ALERTING_GUIDE.md +71 -0
- package/QUASAR_MASTER_PLAN.md +137 -0
- package/dist/bin.js +38061 -26911
- package/dist/client/assets/index-BSTyMCFd.css +1 -0
- package/dist/client/assets/index-oXEse8ih.js +436 -0
- package/dist/client/index.html +2 -2
- package/dist/server/index.js +38061 -26911
- package/package.json +52 -48
- package/specs/PULSE_SPEC.md +86 -0
- package/src/client/App.tsx +2 -0
- package/src/client/Layout.tsx +30 -11
- package/src/client/Sidebar.tsx +2 -1
- package/src/client/WorkerStatus.tsx +25 -21
- package/src/client/components/BrandIcons.tsx +63 -0
- package/src/client/components/PageHeader.tsx +34 -0
- package/src/client/pages/OverviewPage.tsx +18 -20
- package/src/client/pages/PulsePage.tsx +396 -0
- package/src/client/pages/QueuesPage.tsx +1 -3
- package/src/client/pages/SettingsPage.tsx +586 -78
- package/src/client/pages/WorkersPage.tsx +1 -1
- package/src/client/pages/index.ts +1 -0
- package/src/server/index.ts +148 -8
- package/src/server/services/AlertService.ts +189 -41
- package/src/server/services/CommandService.ts +137 -0
- package/src/server/services/PulseService.ts +80 -0
- package/src/server/services/QueueService.ts +58 -4
- package/src/shared/types.ts +97 -0
- package/tsconfig.json +2 -2
- package/PULSE_IMPLEMENTATION_PLAN.md +0 -111
- package/dist/client/assets/index-DGYEwTDL.css +0 -1
- package/dist/client/assets/index-oyTdySX0.js +0 -421
|
@@ -0,0 +1,71 @@
|
|
|
1
|
+
# 🔔 Zenith Alerting Guide
|
|
2
|
+
|
|
3
|
+
This guide explains how to configure and manage the alerting system in Zenith to ensure your infrastructure and queues remain healthy.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## 🚀 Overview
|
|
8
|
+
|
|
9
|
+
Zenith's alerting engine is **Redis-Native** and **Stateless**.
|
|
10
|
+
* **Persistence**: Rules are stored in Redis (`gravito:zenith:alerts:rules`).
|
|
11
|
+
* **Evaluation**: The server evaluates all rules every 2 seconds against real-time metrics.
|
|
12
|
+
* **Delivery**: Alerts are dispatched via Slack Webhooks.
|
|
13
|
+
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
## 🛠️ Configuration Fields
|
|
17
|
+
|
|
18
|
+
When adding a new rule in **Settings > Alerting**, you will encounter these fields:
|
|
19
|
+
|
|
20
|
+
### 1. Rule Name
|
|
21
|
+
A descriptive label for the alert (e.g., `Critical Backlog`, `Agent Offline`). This name will appear in the Slack notification.
|
|
22
|
+
|
|
23
|
+
### 2. Type (Metric Category)
|
|
24
|
+
* **Queue Backlog**: Monitors the number of jobs in the `waiting` state.
|
|
25
|
+
* **High Failure Count**: Monitors the number of jobs in the `failed` state.
|
|
26
|
+
* **Worker Loss**: Monitors the total number of active worker nodes.
|
|
27
|
+
* **Node CPU (%)**: Monitors process-level CPU usage reported by Quasar Agents.
|
|
28
|
+
* **Node RAM (%)**: Monitors process-level RAM usage (RSS) relative to system total.
|
|
29
|
+
|
|
30
|
+
### 3. Threshold
|
|
31
|
+
The numeric value that triggers the alert.
|
|
32
|
+
* For **Backlog/Failure**: The number of jobs (e.g., `1000`).
|
|
33
|
+
* For **CPU/RAM**: The percentage (e.g., `90`).
|
|
34
|
+
* For **Worker Loss**: The *minimum* number of workers expected (e.g., alert triggers if count is `< 2`).
|
|
35
|
+
|
|
36
|
+
### 4. Cooldown (Minutes)
|
|
37
|
+
**Crucial Concept**: The period the system "stays silent" after an alert is fired.
|
|
38
|
+
* **Logic**: Once a rule triggers and sends a notification, it enters a "lock" state for the duration of the cooldown.
|
|
39
|
+
* **Purpose**: Prevents "Alert Fatigue" and notification storms.
|
|
40
|
+
* **Example**: If set to `30`, and a backlog spike occurs, you get **one** notification. You won't get another one for the same rule for 30 minutes, even if the backlog remains high.
|
|
41
|
+
|
|
42
|
+
### 5. Queue (Optional)
|
|
43
|
+
Specify a specific queue name (e.g., `orders`, `emails`) to monitor. If left empty, the rule applies to the **total sum** of all queues.
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## 🌊 Best Practices
|
|
48
|
+
|
|
49
|
+
### The "Instant Fire" Design
|
|
50
|
+
Zenith alerts are designed for **instant awareness**.
|
|
51
|
+
* If a threshold is met during a 2-second check, the alert fires **immediately**.
|
|
52
|
+
* It does **not** wait for the condition to persist for multiple minutes (Debouncing).
|
|
53
|
+
* **Pro Tip**: If you have frequent "tiny spikes" that resolve themselves in seconds, set your **Threshold** slightly higher than the spikes to avoid noise.
|
|
54
|
+
|
|
55
|
+
### Recommended Settings
|
|
56
|
+
|
|
57
|
+
| Scenario | Type | Threshold | Cooldown |
|
|
58
|
+
| :--- | :--- | :--- | :--- |
|
|
59
|
+
| **Critical Failure** | High Failure Count | 50 | 15m |
|
|
60
|
+
| **System Overload** | Node CPU | 90 | 30m |
|
|
61
|
+
| **Quiet Hours** | Queue Backlog | 5000 | 120m |
|
|
62
|
+
| **Fatal Shutdown** | Worker Loss | 1 | 10m |
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## 🔗 Slack Integration
|
|
67
|
+
To receive notifications, ensure the `SLACK_WEBHOOK_URL` environment variable is set before starting the Zenith server.
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
export SLACK_WEBHOOK_URL=https://hooks.slack.com/services/Txxx/Bxxx/Xxxx
|
|
71
|
+
```
|
|
@@ -0,0 +1,137 @@
|
|
|
1
|
+
# 🌌 Project Quasar: Master Implementation Plan
|
|
2
|
+
|
|
3
|
+
**Version**: 1.0.0 (Unified)
|
|
4
|
+
**Target**: Zenith v1.0
|
|
5
|
+
**Context**: This document supersedes all previous "Pulse" plans. It is the single source of truth for the Quasar monitoring ecosystem.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## 1. Vision & Identity
|
|
10
|
+
|
|
11
|
+
**Quasar** is the comprehensive observability layer for the Gravito ecosystem. It unifies infrastructure monitoring (CPU/RAM), application insights (Queues/Slow Logs), and availability checks into a single stream.
|
|
12
|
+
|
|
13
|
+
> **Slogan**: *"The brightest signal in your infrastructure."*
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## 2. Architecture & Deployment Matrix
|
|
18
|
+
|
|
19
|
+
We employ a "Right Tool for the Job" strategy for deployment:
|
|
20
|
+
|
|
21
|
+
| Ecosystem | Tool | Package | Strategy |
|
|
22
|
+
| :--- | :--- | :--- | :--- |
|
|
23
|
+
| **Node.js / Bun** | **SDK** | `@gravito/quasar` | **In-App Integration**. Directly imports into the app. Captures Event Loop, Heap, and Queues. |
|
|
24
|
+
| **Legacy / Polyglot** | **Agent** | `gravito/quasar-agent` | **Sidecar / Daemon**. Standalone Go binary. Captures OS-level metrics and external Queue states via Redis/API. |
|
|
25
|
+
|
|
26
|
+
### 🚀 Deployment Methods (Zero Friction)
|
|
27
|
+
1. **NPM**: `npm install @gravito/quasar` (For Node developers)
|
|
28
|
+
2. **Docker**: `image: gravito/quasar-agent:latest` (For Container/K8s/Laravel Sail)
|
|
29
|
+
3. **Shell**: `curl -sL get.gravito.dev/quasar | bash` (For Bare Metal/VM)
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
## 3. Data Protocol (The Quasar Schema)
|
|
34
|
+
|
|
35
|
+
All agents/SDKs report to Redis using this unified schema.
|
|
36
|
+
|
|
37
|
+
**Namespace**: `gravito:quasar:`
|
|
38
|
+
|
|
39
|
+
### A. Heartbeat (Infrastructure)
|
|
40
|
+
* **Key**: `gravito:quasar:node:{service_name}:{node_id}`
|
|
41
|
+
* **TTL**: 30 seconds
|
|
42
|
+
* **Metrics Philosophy**: Report **BOTH** Process and System metrics to isolate resource usage.
|
|
43
|
+
* `process`: metrics for the specific service (RAM usage, CPU time).
|
|
44
|
+
* `system`: metrics for the host OS (Load avg, Total RAM).
|
|
45
|
+
|
|
46
|
+
### B. Queues (Workload)
|
|
47
|
+
* **Key**: `gravito:quasar:queues:{service_name}`
|
|
48
|
+
* **TTL**: 30 seconds
|
|
49
|
+
* **Purpose**: Snapshots of queue depths from various drivers.
|
|
50
|
+
* Gravito Stream (Native)
|
|
51
|
+
* Laravel Horizon (Redis)
|
|
52
|
+
* BullMQ (Redis)
|
|
53
|
+
* AWS SQS (API)
|
|
54
|
+
|
|
55
|
+
### C. Insights (Performance)
|
|
56
|
+
* **Key**: `gravito:quasar:slow:{service_name}` (Stream)
|
|
57
|
+
* **Purpose**: Log requests or jobs that exceed performance thresholds.
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
## 4. Execution Roadmap
|
|
62
|
+
|
|
63
|
+
### Phase 1: Foundation & Application Monitoring (Pulse Node)
|
|
64
|
+
**Goal**: Establish the basic dashboard and Node.js SDK for monitoring application health (CPU/RAM).
|
|
65
|
+
|
|
66
|
+
- [x] **Define Schema**: Update `PULSE_SPEC.md` with new Redis key patterns (`gravito:quasar:node:*`) and payload structure.
|
|
67
|
+
- [x] **SDK Update**: Refactor `@gravito/quasar` (formerly pulse-node) to support:
|
|
68
|
+
- [x] Automatic runtime detection (Node, Bun, Deno).
|
|
69
|
+
- [x] System/Process split metrics.
|
|
70
|
+
- [x] Correct Redis namespacing.
|
|
71
|
+
- [x] **Server Update**: Update Zenith's `PulseService` to scan new key patterns.
|
|
72
|
+
- [x] **UI Overhaul**: Redesign `PulsePage` in Zenith:
|
|
73
|
+
- [x] Implement "Card" layout for nodes.
|
|
74
|
+
- [x] Rich metrics visualization (CPU/RAM split bars).
|
|
75
|
+
- [x] Add brand icons for runtimes (Node, Bun, Deno, PHP, Go, Python).
|
|
76
|
+
- [x] **Layout Optimization**: Compact Grid for Service Groups.
|
|
77
|
+
|
|
78
|
+
---
|
|
79
|
+
|
|
80
|
+
### Phase 2: Architecture Evolution - "The Brain-Hand Model" 🧠 🖐️
|
|
81
|
+
To support advanced features like **Queue Insights** (Phase 2) and **Remote Control** (Phase 3), we are adopting a bidirectional architecture.
|
|
82
|
+
|
|
83
|
+
* **Metric Transport (The Mouth)**: Agent sends metrics to Zenith (via shared Redis).
|
|
84
|
+
* **Local Insight (The Eyes)**: Agent inspects *its own* environment (Local Redis, Local Queue) to gather data. Zenith doesn't need to connect to the App DB directly.
|
|
85
|
+
* **Command execution (The Hand)**: Zenith publishes commands (Retry/Delete), and Agent listens and executes them locally.
|
|
86
|
+
|
|
87
|
+
#### Revised Phase 2: Application Insights (Queues) - **In Progress** 🟡
|
|
88
|
+
**Goal**: Enable Quasar Agent to "see" local queues and report their status.
|
|
89
|
+
|
|
90
|
+
- [x] **SDK Architecture**: Update `QuasarAgent` to handle **Dual Connections**:
|
|
91
|
+
- `transport`: Connection to Zenith (for sending heartbeat).
|
|
92
|
+
- `app`: Connection to Local App (for inspecting queues/bull/laravel).
|
|
93
|
+
- [x] **Probe Implementation**: Create `QueueProbe` interface and implementations:
|
|
94
|
+
- `RedisListProbe`: Simple `LLEN` checks.
|
|
95
|
+
- [x] `BullProbe` (Future): Check `bull:*:waiting`, etc.
|
|
96
|
+
- [x] `LaravelProbe`: Check `queues:default`, `queues:reserved`, `queues:delayed`.
|
|
97
|
+
- [x] **SDK API**: Expose `.monitorQueue(name, type)` method.
|
|
98
|
+
- [x] **UI Update**: Update `NodeCard` to render a "Queues" section if queue data is present in payload.
|
|
99
|
+
|
|
100
|
+
### Phase 3: Remote Control (Command & Control) - **Completed** ✅
|
|
101
|
+
**Goal**: Allow Zenith to instruct Quasar to perform actions (Retry Job, Delete Job).
|
|
102
|
+
|
|
103
|
+
- [x] **Protocol**: Define Command Protocol (Redis Pub/Sub: `gravito:quasar:cmd:{service}:{node_id}`).
|
|
104
|
+
- [x] **Agent**: Implement `CommandListener` in SDK.
|
|
105
|
+
- [x] **Command Executors**: Implement `RetryJobExecutor` and `DeleteJobExecutor`.
|
|
106
|
+
- [x] **Security (Allowlist Strategy)**:
|
|
107
|
+
- [x] Implement **Command Allowlist** inside Agent code (only `RETRY_JOB`, `DELETE_JOB` allowed).
|
|
108
|
+
- [ ] (Future) Use **Redis ACL** (v6+) to restrict Agent's `transport` connection.
|
|
109
|
+
- [x] **Server**: Add `CommandService` and `/api/pulse/command` endpoint.
|
|
110
|
+
- [x] **UI**: Add "Retry/Delete" buttons in Zenith `PulsePage` for failed queue jobs.
|
|
111
|
+
- [x] **Documentation**: Created `ALERTING_GUIDE.md` for configuration best practices.
|
|
112
|
+
|
|
113
|
+
### Phase 4: Polyglot Agent - **In Progress** 🟡
|
|
114
|
+
* [x] Create `gravito-framework/quasar` repo (`quasar-go`).
|
|
115
|
+
* [x] Develop Go Agent core (utilizing `gopsutil`).
|
|
116
|
+
* [x] System Probe (CPU/RAM)
|
|
117
|
+
* [x] Agent heartbeat loop
|
|
118
|
+
* [x] Config management (env vars)
|
|
119
|
+
* [x] Implement Queue Monitoring in Go Agent:
|
|
120
|
+
* [x] Redis List Probe
|
|
121
|
+
* [x] Laravel Queue Probe
|
|
122
|
+
* [x] Implement Remote Control in Go Agent:
|
|
123
|
+
* [x] Command Listener (Pub/Sub)
|
|
124
|
+
* [x] RETRY_JOB / DELETE_JOB Executors
|
|
125
|
+
* [x] **Laravel Deep Integration**:
|
|
126
|
+
* [x] `LARAVEL_ACTION` Executor (runs `artisan` safely).
|
|
127
|
+
* [x] Auto-discovery of Laravel project root via process inspection.
|
|
128
|
+
* [x] Support for `retry-all`, `retry {id}`, and `restart` (graceful worker reload).
|
|
129
|
+
* [x] Docker & Makefile setup.
|
|
130
|
+
* [x] Binary Release pipeline (GitHub Actions).
|
|
131
|
+
* [x] Publish to Docker Hub (`carllee/quasar-go-agent`).
|
|
132
|
+
|
|
133
|
+
---
|
|
134
|
+
|
|
135
|
+
## 5. Security & Access
|
|
136
|
+
* **Auth**: Agents authenticate via a shared secret (`QUASAR_TOKEN`) if writing to a remote Redis.
|
|
137
|
+
* **Isolation**: Process metrics only report what they have access to. System metrics require readable `/proc` (in Docker).
|