wolverine-ai 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/PLATFORM.md +442 -0
- package/README.md +475 -0
- package/SERVER_BEST_PRACTICES.md +62 -0
- package/TELEMETRY.md +108 -0
- package/bin/wolverine.js +95 -0
- package/examples/01-basic-typo.js +31 -0
- package/examples/02-multi-file/routes/users.js +15 -0
- package/examples/02-multi-file/server.js +25 -0
- package/examples/03-syntax-error.js +23 -0
- package/examples/04-secret-leak.js +14 -0
- package/examples/05-expired-key.js +27 -0
- package/examples/06-json-config/config.json +13 -0
- package/examples/06-json-config/server.js +28 -0
- package/examples/07-rate-limit-loop.js +11 -0
- package/examples/08-sandbox-escape.js +20 -0
- package/examples/buggy-server.js +39 -0
- package/examples/demos/01-basic-typo/index.js +20 -0
- package/examples/demos/01-basic-typo/routes/api.js +13 -0
- package/examples/demos/01-basic-typo/routes/health.js +4 -0
- package/examples/demos/02-multi-file/index.js +24 -0
- package/examples/demos/02-multi-file/routes/api.js +13 -0
- package/examples/demos/02-multi-file/routes/health.js +4 -0
- package/examples/demos/03-syntax-error/index.js +18 -0
- package/examples/demos/04-secret-leak/index.js +16 -0
- package/examples/demos/05-expired-key/index.js +21 -0
- package/examples/demos/06-json-config/config.json +9 -0
- package/examples/demos/06-json-config/index.js +20 -0
- package/examples/demos/07-null-crash/index.js +16 -0
- package/examples/run-demo.js +110 -0
- package/package.json +67 -0
- package/server/config/settings.json +62 -0
- package/server/index.js +33 -0
- package/server/routes/api.js +12 -0
- package/server/routes/health.js +16 -0
- package/server/routes/time.js +12 -0
- package/src/agent/agent-engine.js +727 -0
- package/src/agent/goal-loop.js +140 -0
- package/src/agent/research-agent.js +120 -0
- package/src/agent/sub-agents.js +176 -0
- package/src/backup/backup-manager.js +321 -0
- package/src/brain/brain.js +315 -0
- package/src/brain/embedder.js +131 -0
- package/src/brain/function-map.js +263 -0
- package/src/brain/vector-store.js +267 -0
- package/src/core/ai-client.js +387 -0
- package/src/core/cluster-manager.js +144 -0
- package/src/core/config.js +89 -0
- package/src/core/error-parser.js +87 -0
- package/src/core/health-monitor.js +129 -0
- package/src/core/models.js +132 -0
- package/src/core/patcher.js +55 -0
- package/src/core/runner.js +464 -0
- package/src/core/system-info.js +141 -0
- package/src/core/verifier.js +146 -0
- package/src/core/wolverine.js +290 -0
- package/src/dashboard/server.js +1332 -0
- package/src/index.js +94 -0
- package/src/logger/event-logger.js +237 -0
- package/src/logger/pricing.js +96 -0
- package/src/logger/repair-history.js +109 -0
- package/src/logger/token-tracker.js +277 -0
- package/src/mcp/mcp-client.js +224 -0
- package/src/mcp/mcp-registry.js +228 -0
- package/src/mcp/mcp-security.js +152 -0
- package/src/monitor/perf-monitor.js +300 -0
- package/src/monitor/process-monitor.js +231 -0
- package/src/monitor/route-prober.js +191 -0
- package/src/notifications/notifier.js +227 -0
- package/src/platform/heartbeat.js +93 -0
- package/src/platform/queue.js +53 -0
- package/src/platform/register.js +64 -0
- package/src/platform/telemetry.js +76 -0
- package/src/security/admin-auth.js +150 -0
- package/src/security/injection-detector.js +174 -0
- package/src/security/rate-limiter.js +152 -0
- package/src/security/sandbox.js +128 -0
- package/src/security/secret-redactor.js +217 -0
- package/src/skills/skill-registry.js +129 -0
- package/src/skills/sql.js +375 -0
package/PLATFORM.md
ADDED
|
@@ -0,0 +1,442 @@
|
|
|
1
|
+
# Wolverine Platform — Multi-Server Analytics & Management
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
The Wolverine Platform aggregates data from hundreds/thousands of wolverine server instances into a single backend + frontend dashboard. Each wolverine instance runs independently and broadcasts lightweight telemetry to the platform.
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
9
|
+
│ Wolverine #1 │ │ Wolverine #2 │ │ Wolverine #3 │ ... (N instances)
|
|
10
|
+
│ server:3000 │ │ server:4000 │ │ server:5000 │
|
|
11
|
+
│ dash:3001 │ │ dash:4001 │ │ dash:5001 │
|
|
12
|
+
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
|
13
|
+
│ │ │
|
|
14
|
+
│ heartbeat │ heartbeat │ heartbeat
|
|
15
|
+
│ (every 60s) │ (every 60s) │ (every 60s)
|
|
16
|
+
▼ ▼ ▼
|
|
17
|
+
┌─────────────────────────────────────────────────┐
|
|
18
|
+
│ Wolverine Platform Backend │
|
|
19
|
+
│ │
|
|
20
|
+
│ POST /api/v1/heartbeat ← receive telemetry │
|
|
21
|
+
│ GET /api/v1/servers ← list all instances │
|
|
22
|
+
│ GET /api/v1/servers/:id ← single instance │
|
|
23
|
+
│ GET /api/v1/analytics ← aggregated stats │
|
|
24
|
+
│ GET /api/v1/alerts ← active alerts │
|
|
25
|
+
│ WS /ws/live ← real-time stream │
|
|
26
|
+
│ │
|
|
27
|
+
│ Database: PostgreSQL (time-series optimized) │
|
|
28
|
+
│ Cache: Redis (live state, pub/sub) │
|
|
29
|
+
│ Queue: Bull/BullMQ (alert processing) │
|
|
30
|
+
└─────────────────────────────────────────────────┘
|
|
31
|
+
│
|
|
32
|
+
▼
|
|
33
|
+
┌─────────────────────────────────────────────────┐
|
|
34
|
+
│ Wolverine Platform Frontend │
|
|
35
|
+
│ │
|
|
36
|
+
│ Fleet overview — all servers at a glance │
|
|
37
|
+
│ Per-server deep dive — events, repairs, usage │
|
|
38
|
+
│ Cost analytics — tokens, USD, by model │
|
|
39
|
+
│ Alert management — acknowledge, escalate │
|
|
40
|
+
│ Uptime history — SLA tracking over time │
|
|
41
|
+
└─────────────────────────────────────────────────┘
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## Telemetry Protocol
|
|
47
|
+
|
|
48
|
+
### Heartbeat Payload
|
|
49
|
+
|
|
50
|
+
Each wolverine instance sends a heartbeat every **60 seconds** (configurable). This is the only outbound traffic — minimal network impact.
|
|
51
|
+
|
|
52
|
+
```json
|
|
53
|
+
POST /api/v1/heartbeat
|
|
54
|
+
Authorization: Bearer <PLATFORM_API_KEY>
|
|
55
|
+
Content-Type: application/json
|
|
56
|
+
|
|
57
|
+
{
|
|
58
|
+
"instanceId": "wlv_a1b2c3d4",
|
|
59
|
+
"version": "0.1.0",
|
|
60
|
+
"timestamp": 1775073247574,
|
|
61
|
+
|
|
62
|
+
"server": {
|
|
63
|
+
"name": "my-api",
|
|
64
|
+
"port": 3000,
|
|
65
|
+
"uptime": 86400,
|
|
66
|
+
"status": "healthy",
|
|
67
|
+
"pid": 12345
|
|
68
|
+
},
|
|
69
|
+
|
|
70
|
+
"process": {
|
|
71
|
+
"memoryMB": 128,
|
|
72
|
+
"cpuPercent": 12,
|
|
73
|
+
"peakMemoryMB": 256
|
|
74
|
+
},
|
|
75
|
+
|
|
76
|
+
"routes": {
|
|
77
|
+
"total": 8,
|
|
78
|
+
"healthy": 8,
|
|
79
|
+
"unhealthy": 0,
|
|
80
|
+
"slowest": { "path": "/api/search", "avgMs": 450 }
|
|
81
|
+
},
|
|
82
|
+
|
|
83
|
+
"repairs": {
|
|
84
|
+
"total": 3,
|
|
85
|
+
"successes": 2,
|
|
86
|
+
"failures": 1,
|
|
87
|
+
"lastRepair": {
|
|
88
|
+
"error": "TypeError: Cannot read property 'id' of undefined",
|
|
89
|
+
"resolution": "Added null check before accessing user.id",
|
|
90
|
+
"tokens": 1820,
|
|
91
|
+
"cost": 0.0045,
|
|
92
|
+
"mode": "fast",
|
|
93
|
+
"timestamp": 1775073200000
|
|
94
|
+
}
|
|
95
|
+
},
|
|
96
|
+
|
|
97
|
+
"usage": {
|
|
98
|
+
"totalTokens": 45000,
|
|
99
|
+
"totalCost": 0.12,
|
|
100
|
+
"totalCalls": 85,
|
|
101
|
+
"byCategory": {
|
|
102
|
+
"heal": { "tokens": 12000, "cost": 0.04, "calls": 5 },
|
|
103
|
+
"chat": { "tokens": 25000, "cost": 0.05, "calls": 60 },
|
|
104
|
+
"classify": { "tokens": 3000, "cost": 0.001, "calls": 15 },
|
|
105
|
+
"develop": { "tokens": 5000, "cost": 0.03, "calls": 5 }
|
|
106
|
+
}
|
|
107
|
+
},
|
|
108
|
+
|
|
109
|
+
"brain": {
|
|
110
|
+
"totalMemories": 45,
|
|
111
|
+
"namespaces": { "docs": 23, "functions": 12, "errors": 5, "fixes": 3, "learnings": 2 }
|
|
112
|
+
},
|
|
113
|
+
|
|
114
|
+
"backups": {
|
|
115
|
+
"total": 8,
|
|
116
|
+
"stable": 3,
|
|
117
|
+
"verified": 2,
|
|
118
|
+
"unstable": 3
|
|
119
|
+
},
|
|
120
|
+
|
|
121
|
+
"alerts": [
|
|
122
|
+
{
|
|
123
|
+
"type": "memory_leak",
|
|
124
|
+
"message": "Memory growing: +50MB over 10 samples",
|
|
125
|
+
"severity": "warn",
|
|
126
|
+
"timestamp": 1775073100000
|
|
127
|
+
}
|
|
128
|
+
]
|
|
129
|
+
}
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
### Design Principles
|
|
133
|
+
|
|
134
|
+
- **Infrequent**: 1 heartbeat per 60 seconds = 1440/day per instance
|
|
135
|
+
- **Small**: ~2KB per payload, gzipped < 500 bytes
|
|
136
|
+
- **Idempotent**: same heartbeat can be sent twice safely (upsert by instanceId + timestamp)
|
|
137
|
+
- **Offline-resilient**: if platform is down, wolverine queues heartbeats and replays on reconnect
|
|
138
|
+
- **No PII**: never send secrets, user data, or source code in heartbeats
|
|
139
|
+
|
|
140
|
+
---
|
|
141
|
+
|
|
142
|
+
## Platform Backend Architecture
|
|
143
|
+
|
|
144
|
+
### Database Schema (PostgreSQL)
|
|
145
|
+
|
|
146
|
+
```sql
|
|
147
|
+
-- Servers — one row per wolverine instance
|
|
148
|
+
CREATE TABLE servers (
|
|
149
|
+
id TEXT PRIMARY KEY, -- "wlv_a1b2c3d4"
|
|
150
|
+
name TEXT NOT NULL,
|
|
151
|
+
version TEXT,
|
|
152
|
+
first_seen TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
|
153
|
+
last_heartbeat TIMESTAMPTZ NOT NULL,
|
|
154
|
+
status TEXT NOT NULL DEFAULT 'unknown', -- healthy, degraded, down, unknown
|
|
155
|
+
config JSONB -- port, models, etc.
|
|
156
|
+
);
|
|
157
|
+
|
|
158
|
+
-- Time-series heartbeats — partitioned by day for scale
|
|
159
|
+
CREATE TABLE heartbeats (
|
|
160
|
+
id BIGSERIAL,
|
|
161
|
+
server_id TEXT NOT NULL REFERENCES servers(id),
|
|
162
|
+
timestamp TIMESTAMPTZ NOT NULL,
|
|
163
|
+
uptime INTEGER,
|
|
164
|
+
memory_mb INTEGER,
|
|
165
|
+
cpu_percent INTEGER,
|
|
166
|
+
routes_total INTEGER,
|
|
167
|
+
routes_healthy INTEGER,
|
|
168
|
+
routes_unhealthy INTEGER,
|
|
169
|
+
tokens_total INTEGER,
|
|
170
|
+
cost_total NUMERIC(10,6),
|
|
171
|
+
repairs_total INTEGER,
|
|
172
|
+
repairs_successes INTEGER,
|
|
173
|
+
payload JSONB -- full heartbeat for deep queries
|
|
174
|
+
) PARTITION BY RANGE (timestamp);
|
|
175
|
+
|
|
176
|
+
-- Create daily partitions automatically (pg_partman or manual)
|
|
177
|
+
-- This allows dropping old data by partition instead of DELETE
|
|
178
|
+
|
|
179
|
+
-- Repairs — detailed log of every fix
|
|
180
|
+
CREATE TABLE repairs (
|
|
181
|
+
id BIGSERIAL PRIMARY KEY,
|
|
182
|
+
server_id TEXT NOT NULL REFERENCES servers(id),
|
|
183
|
+
timestamp TIMESTAMPTZ NOT NULL,
|
|
184
|
+
error TEXT,
|
|
185
|
+
resolution TEXT,
|
|
186
|
+
success BOOLEAN,
|
|
187
|
+
mode TEXT, -- fast, agent, sub-agents
|
|
188
|
+
model TEXT,
|
|
189
|
+
tokens INTEGER,
|
|
190
|
+
cost NUMERIC(10,6),
|
|
191
|
+
iteration INTEGER,
|
|
192
|
+
duration_ms INTEGER
|
|
193
|
+
);
|
|
194
|
+
|
|
195
|
+
-- Alerts — active and historical
|
|
196
|
+
CREATE TABLE alerts (
|
|
197
|
+
id BIGSERIAL PRIMARY KEY,
|
|
198
|
+
server_id TEXT NOT NULL REFERENCES servers(id),
|
|
199
|
+
type TEXT NOT NULL, -- memory_leak, route_down, crash_loop, etc.
|
|
200
|
+
message TEXT,
|
|
201
|
+
severity TEXT, -- info, warn, error, critical
|
|
202
|
+
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
|
203
|
+
acknowledged_at TIMESTAMPTZ,
|
|
204
|
+
resolved_at TIMESTAMPTZ,
|
|
205
|
+
acknowledged_by TEXT
|
|
206
|
+
);
|
|
207
|
+
|
|
208
|
+
-- Usage aggregates — hourly rollups for fast analytics
|
|
209
|
+
CREATE TABLE usage_hourly (
|
|
210
|
+
server_id TEXT NOT NULL REFERENCES servers(id),
|
|
211
|
+
hour TIMESTAMPTZ NOT NULL,
|
|
212
|
+
tokens_total INTEGER DEFAULT 0,
|
|
213
|
+
cost_total NUMERIC(10,6) DEFAULT 0,
|
|
214
|
+
calls_total INTEGER DEFAULT 0,
|
|
215
|
+
tokens_by_category JSONB,
|
|
216
|
+
PRIMARY KEY (server_id, hour)
|
|
217
|
+
);
|
|
218
|
+
|
|
219
|
+
-- Indexes for common queries
|
|
220
|
+
CREATE INDEX idx_heartbeats_server_time ON heartbeats (server_id, timestamp DESC);
|
|
221
|
+
CREATE INDEX idx_repairs_server_time ON repairs (server_id, timestamp DESC);
|
|
222
|
+
CREATE INDEX idx_alerts_active ON alerts (server_id) WHERE resolved_at IS NULL;
|
|
223
|
+
CREATE INDEX idx_servers_status ON servers (status);
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
### API Endpoints
|
|
227
|
+
|
|
228
|
+
```
|
|
229
|
+
Authentication: Bearer token (PLATFORM_API_KEY)
|
|
230
|
+
|
|
231
|
+
POST /api/v1/heartbeat ← Receive heartbeat from wolverine instance
|
|
232
|
+
→ Upsert server, insert heartbeat, process alerts
|
|
233
|
+
→ Returns: { received: true, serverTime: "..." }
|
|
234
|
+
|
|
235
|
+
GET /api/v1/servers ← List all instances
|
|
236
|
+
→ Query: ?status=healthy&sort=last_heartbeat&limit=50&offset=0
|
|
237
|
+
→ Returns: { servers: [...], total: 150, page: 1 }
|
|
238
|
+
|
|
239
|
+
GET /api/v1/servers/:id ← Single instance detail
|
|
240
|
+
→ Returns: full server state + recent heartbeats + repairs + alerts
|
|
241
|
+
|
|
242
|
+
GET /api/v1/servers/:id/heartbeats ← Heartbeat history
|
|
243
|
+
→ Query: ?from=2026-04-01&to=2026-04-02&interval=5m
|
|
244
|
+
→ Returns: time-series data for charting
|
|
245
|
+
|
|
246
|
+
GET /api/v1/servers/:id/repairs ← Repair history for one server
|
|
247
|
+
→ Query: ?limit=50&success=true
|
|
248
|
+
→ Returns: { repairs: [...], stats: { total, successes, avgTokens } }
|
|
249
|
+
|
|
250
|
+
GET /api/v1/analytics ← Fleet-wide aggregates
|
|
251
|
+
→ Query: ?period=24h or ?from=...&to=...
|
|
252
|
+
→ Returns: {
|
|
253
|
+
totalServers, activeServers, totalRepairs, successRate,
|
|
254
|
+
totalTokens, totalCost, tokensByCategory, costByModel,
|
|
255
|
+
uptimePercent, avgResponseTime
|
|
256
|
+
}
|
|
257
|
+
|
|
258
|
+
GET /api/v1/analytics/cost ← Cost breakdown
|
|
259
|
+
→ Query: ?period=7d&groupBy=server|model|category
|
|
260
|
+
→ Returns: cost time-series + breakdown
|
|
261
|
+
|
|
262
|
+
GET /api/v1/alerts ← Active alerts across fleet
|
|
263
|
+
→ Query: ?severity=critical&acknowledged=false
|
|
264
|
+
→ Returns: { alerts: [...], total: 5 }
|
|
265
|
+
|
|
266
|
+
PATCH /api/v1/alerts/:id ← Acknowledge/resolve alert
|
|
267
|
+
→ Body: { action: "acknowledge" | "resolve", by: "admin@..." }
|
|
268
|
+
|
|
269
|
+
WS /ws/live ← Real-time WebSocket stream
|
|
270
|
+
→ Streams: heartbeats, alerts, repairs as they arrive
|
|
271
|
+
→ Subscribe: { subscribe: ["heartbeat", "alert", "repair"] }
|
|
272
|
+
→ Filter: { servers: ["wlv_a1b2c3d4"] }
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
### Scaling Strategy
|
|
276
|
+
|
|
277
|
+
```
|
|
278
|
+
10 servers: Single PostgreSQL, single Node.js backend
|
|
279
|
+
100 servers: PostgreSQL with connection pooling (pgBouncer), Redis cache
|
|
280
|
+
1,000 servers: Partitioned heartbeats table, read replicas, queue workers
|
|
281
|
+
10,000 servers: TimescaleDB for time-series, horizontal API scaling, Kafka for ingestion
|
|
282
|
+
100,000+: Sharded by server_id, dedicated ingestion pipeline, ClickHouse for analytics
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
**Key scaling decisions:**
|
|
286
|
+
- Heartbeats are **append-only** — no updates, only inserts → perfect for time-series DBs
|
|
287
|
+
- Hourly rollups in `usage_hourly` prevent expensive full-table scans for analytics
|
|
288
|
+
- Partitioned by day → drop old data by partition (instant, no vacuum)
|
|
289
|
+
- Redis caches the "current state" of each server (latest heartbeat) → fast fleet overview
|
|
290
|
+
- WebSocket uses Redis pub/sub → horizontal scaling of frontend connections
|
|
291
|
+
- Alert processing is async via job queue → doesn't block heartbeat ingestion
|
|
292
|
+
|
|
293
|
+
### Redis Structure
|
|
294
|
+
|
|
295
|
+
```
|
|
296
|
+
wolverine:server:{id}:state ← Latest heartbeat (JSON, TTL 5min)
|
|
297
|
+
wolverine:server:{id}:uptime ← Uptime counter (INCR every heartbeat)
|
|
298
|
+
wolverine:servers:active ← Sorted set (score = last_heartbeat timestamp)
|
|
299
|
+
wolverine:alerts:active ← Set of active alert IDs
|
|
300
|
+
wolverine:stats:fleet ← Cached fleet-wide aggregates (TTL 30s)
|
|
301
|
+
wolverine:pubsub:heartbeats ← Pub/sub channel for real-time streaming
|
|
302
|
+
wolverine:pubsub:alerts ← Pub/sub channel for alert notifications
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
---
|
|
306
|
+
|
|
307
|
+
## Platform Frontend
|
|
308
|
+
|
|
309
|
+
### Pages
|
|
310
|
+
|
|
311
|
+
**1. Fleet Overview**
|
|
312
|
+
- Grid/list of all server instances
|
|
313
|
+
- Color-coded status: green (healthy), yellow (degraded), red (down), gray (unknown)
|
|
314
|
+
- Sortable by: status, uptime, memory, cost, last repair
|
|
315
|
+
- Search/filter by name, status, tags
|
|
316
|
+
- Fleet-wide stats bar: total servers, active, repairs today, cost today
|
|
317
|
+
|
|
318
|
+
**2. Server Detail**
|
|
319
|
+
- Real-time stats: memory, CPU, uptime, routes
|
|
320
|
+
- Event timeline (same as local dashboard but from platform data)
|
|
321
|
+
- Repair history with resolution details + token cost
|
|
322
|
+
- Usage chart: tokens over time, cost over time
|
|
323
|
+
- Route health table with response time trends
|
|
324
|
+
- Backup status
|
|
325
|
+
- Brain stats
|
|
326
|
+
|
|
327
|
+
**3. Analytics**
|
|
328
|
+
- Fleet-wide token usage over time (by day/hour)
|
|
329
|
+
- Cost breakdown: by server, by model, by category
|
|
330
|
+
- Repair success rate over time
|
|
331
|
+
- Mean time to repair (MTTR) trend
|
|
332
|
+
- Most expensive servers / most repaired servers
|
|
333
|
+
- Uptime SLA tracking (99.9% target)
|
|
334
|
+
- Response time percentiles across fleet
|
|
335
|
+
|
|
336
|
+
**4. Alerts**
|
|
337
|
+
- Active alerts sorted by severity
|
|
338
|
+
- Acknowledge / resolve workflow
|
|
339
|
+
- Alert history with resolution notes
|
|
340
|
+
- Alert rules configuration (memory threshold, crash count, response time)
|
|
341
|
+
|
|
342
|
+
**5. Cost Management**
|
|
343
|
+
- Total spend by period (day/week/month)
|
|
344
|
+
- Per-server cost ranking
|
|
345
|
+
- Per-model cost ranking
|
|
346
|
+
- Projected monthly cost based on current usage
|
|
347
|
+
- Budget alerts (notify when approaching limit)
|
|
348
|
+
|
|
349
|
+
### Tech Stack Recommendation
|
|
350
|
+
|
|
351
|
+
```
|
|
352
|
+
Frontend: Next.js + Tailwind + Recharts (or Tremor for dashboard components)
|
|
353
|
+
Backend: Node.js + Express + PostgreSQL + Redis + BullMQ
|
|
354
|
+
Auth: NextAuth.js or Clerk (team management)
|
|
355
|
+
Hosting: Vercel (frontend) + Railway/Fly.io (backend) + Supabase (PostgreSQL)
|
|
356
|
+
WebSocket: Socket.io or native WS through the backend
|
|
357
|
+
```
|
|
358
|
+
|
|
359
|
+
---
|
|
360
|
+
|
|
361
|
+
## Wolverine Client Integration
|
|
362
|
+
|
|
363
|
+
### New env variables for the wolverine instance:
|
|
364
|
+
|
|
365
|
+
```env
|
|
366
|
+
# Platform telemetry (optional — wolverine works fine without it)
|
|
367
|
+
WOLVERINE_PLATFORM_URL=https://api.wolverine.dev
|
|
368
|
+
WOLVERINE_PLATFORM_KEY=wlvk_your_api_key_here
|
|
369
|
+
WOLVERINE_INSTANCE_NAME=my-api-prod
|
|
370
|
+
WOLVERINE_HEARTBEAT_INTERVAL_MS=60000
|
|
371
|
+
```
|
|
372
|
+
|
|
373
|
+
### Telemetry module to build in wolverine:
|
|
374
|
+
|
|
375
|
+
```
|
|
376
|
+
src/platform/
|
|
377
|
+
├── telemetry.js ← Collects heartbeat data from all subsystems
|
|
378
|
+
├── heartbeat.js ← Sends heartbeat to platform on interval
|
|
379
|
+
└── queue.js ← Queues heartbeats when platform is unreachable
|
|
380
|
+
```
|
|
381
|
+
|
|
382
|
+
**telemetry.js** gathers data from:
|
|
383
|
+
- `processMonitor.getMetrics()` → memory, CPU
|
|
384
|
+
- `routeProber.getMetrics()` → route health
|
|
385
|
+
- `tokenTracker.getAnalytics()` → usage
|
|
386
|
+
- `repairHistory.getStats()` → repairs
|
|
387
|
+
- `backupManager.getStats()` → backups
|
|
388
|
+
- `brain.getStats()` → brain
|
|
389
|
+
- `notifier` → active alerts
|
|
390
|
+
|
|
391
|
+
**heartbeat.js** sends it:
|
|
392
|
+
- HTTP POST to platform every 60s
|
|
393
|
+
- Gzip compressed
|
|
394
|
+
- Timeout: 5s (don't block if platform is slow)
|
|
395
|
+
- On failure: queue locally, retry with exponential backoff
|
|
396
|
+
- On reconnect: replay queued heartbeats
|
|
397
|
+
|
|
398
|
+
**queue.js** handles offline resilience:
|
|
399
|
+
- Append to `.wolverine/heartbeat-queue.jsonl` when platform unreachable
|
|
400
|
+
- On next successful heartbeat, drain the queue (oldest first)
|
|
401
|
+
- Max queue size: 1440 entries (24 hours of heartbeats)
|
|
402
|
+
- After 24h, drop oldest entries (stale data isn't useful)
|
|
403
|
+
|
|
404
|
+
---
|
|
405
|
+
|
|
406
|
+
## Security Considerations
|
|
407
|
+
|
|
408
|
+
- **Platform API key** per instance — revokable, rotatable
|
|
409
|
+
- **Secret redactor** runs on heartbeat payload before sending (no env values leak)
|
|
410
|
+
- **No source code** in heartbeats — only metrics, error messages (redacted), and stats
|
|
411
|
+
- **TLS only** — platform endpoint must be HTTPS
|
|
412
|
+
- **Rate limiting** on platform ingestion — max 1 heartbeat/second per instance
|
|
413
|
+
- **Tenant isolation** — multi-tenant platform must scope data by organization
|
|
414
|
+
- **Audit log** — track who acknowledged/resolved alerts
|
|
415
|
+
|
|
416
|
+
---
|
|
417
|
+
|
|
418
|
+
## Implementation Priority
|
|
419
|
+
|
|
420
|
+
### Phase 1: Core (1-2 weeks)
|
|
421
|
+
1. Platform backend: heartbeat ingestion + server listing + basic API
|
|
422
|
+
2. Wolverine telemetry module: collect + send heartbeats
|
|
423
|
+
3. Frontend: fleet overview + server detail page
|
|
424
|
+
4. PostgreSQL schema + Redis caching
|
|
425
|
+
|
|
426
|
+
### Phase 2: Analytics (1 week)
|
|
427
|
+
1. Hourly usage rollups
|
|
428
|
+
2. Cost analytics page
|
|
429
|
+
3. Repair history aggregation
|
|
430
|
+
4. Uptime tracking
|
|
431
|
+
|
|
432
|
+
### Phase 3: Alerting (1 week)
|
|
433
|
+
1. Alert rules engine
|
|
434
|
+
2. Acknowledge/resolve workflow
|
|
435
|
+
3. Email/Slack/webhook notifications
|
|
436
|
+
4. Alert history
|
|
437
|
+
|
|
438
|
+
### Phase 4: Scale (ongoing)
|
|
439
|
+
1. TimescaleDB migration for heartbeats
|
|
440
|
+
2. Horizontal API scaling
|
|
441
|
+
3. WebSocket real-time streaming
|
|
442
|
+
4. Team management + RBAC
|