@unrdf/observability 26.4.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.eslintrc.cjs +10 -0
- package/IMPLEMENTATION-SUMMARY.md +478 -0
- package/LICENSE +21 -0
- package/README.md +482 -0
- package/capability-map.md +90 -0
- package/config/alert-rules.yml +269 -0
- package/config/prometheus.yml +136 -0
- package/dashboards/grafana-unrdf.json +798 -0
- package/dashboards/unrdf-workflow-dashboard.json +295 -0
- package/docs/OBSERVABILITY-PATTERNS.md +681 -0
- package/docs/OBSERVABILITY-RUNBOOK.md +554 -0
- package/examples/observability-demo.mjs +334 -0
- package/package.json +46 -0
- package/src/advanced-metrics.mjs +413 -0
- package/src/alerts/alert-manager.mjs +436 -0
- package/src/custom-events.mjs +558 -0
- package/src/distributed-tracing.mjs +352 -0
- package/src/exporters/grafana-exporter.mjs +415 -0
- package/src/index.mjs +61 -0
- package/src/metrics/workflow-metrics.mjs +346 -0
- package/src/receipts/anchor.mjs +155 -0
- package/src/receipts/index.mjs +62 -0
- package/src/receipts/merkle-tree.mjs +188 -0
- package/src/receipts/receipt-chain.mjs +209 -0
- package/src/receipts/receipt-schema.mjs +128 -0
- package/src/receipts/tamper-detection.mjs +219 -0
- package/test/advanced-metrics.test.mjs +302 -0
- package/test/custom-events.test.mjs +387 -0
- package/test/distributed-tracing.test.mjs +314 -0
- package/validation/observability-validation.mjs +366 -0
- package/vitest.config.mjs +25 -0
|
@@ -0,0 +1,554 @@
|
|
|
1
|
+
# UNRDF Observability Runbook
|
|
2
|
+
|
|
3
|
+
Operational procedures for monitoring, troubleshooting, and maintaining UNRDF observability infrastructure.
|
|
4
|
+
|
|
5
|
+
## Quick Reference
|
|
6
|
+
|
|
7
|
+
| Alert | Severity | Response Time | Action |
|
|
8
|
+
| ------------------- | -------- | ------------- | --------------------------------------- |
|
|
9
|
+
| ServiceDown | Critical | 1 min | [Service Recovery](#service-down) |
|
|
10
|
+
| InjectionAttempt | Critical | 2 min | [Security Incident](#injection-attempt) |
|
|
11
|
+
| CriticalMemoryUsage | Critical | 5 min | [Memory Issues](#high-memory-usage) |
|
|
12
|
+
| HighP99Latency | Critical | 5 min | [Performance](#high-latency) |
|
|
13
|
+
| LowSuccessRate | Warning | 10 min | [Success Rate](#low-success-rate) |
|
|
14
|
+
|
|
15
|
+
## Incident Response
|
|
16
|
+
|
|
17
|
+
### Service Down
|
|
18
|
+
|
|
19
|
+
**Alert**: `ServiceDown`
|
|
20
|
+
**Severity**: Critical
|
|
21
|
+
**Response Time**: 1 minute
|
|
22
|
+
|
|
23
|
+
#### Symptoms
|
|
24
|
+
|
|
25
|
+
- Prometheus shows `up{job="unrdf-*"} == 0`
|
|
26
|
+
- No metrics received in last 60 seconds
|
|
27
|
+
- Health check endpoints unreachable
|
|
28
|
+
|
|
29
|
+
#### Diagnosis
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
# Check service status
|
|
33
|
+
systemctl status unrdf-core
|
|
34
|
+
|
|
35
|
+
# Check logs
|
|
36
|
+
journalctl -u unrdf-core -n 100 --no-pager
|
|
37
|
+
|
|
38
|
+
# Check process
|
|
39
|
+
ps aux | grep unrdf
|
|
40
|
+
|
|
41
|
+
# Check port availability
|
|
42
|
+
netstat -tulpn | grep 9464
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
#### Resolution
|
|
46
|
+
|
|
47
|
+
**Option 1: Restart Service**
|
|
48
|
+
|
|
49
|
+
```bash
|
|
50
|
+
# Graceful restart
|
|
51
|
+
systemctl restart unrdf-core
|
|
52
|
+
|
|
53
|
+
# Verify recovery
|
|
54
|
+
curl http://localhost:9464/metrics
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
**Option 2: Check Configuration**
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
# Validate config
|
|
61
|
+
node --check /path/to/unrdf/index.mjs
|
|
62
|
+
|
|
63
|
+
# Check environment variables
|
|
64
|
+
env | grep UNRDF
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
**Option 3: Resource Issues**
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
# Check disk space
|
|
71
|
+
df -h
|
|
72
|
+
|
|
73
|
+
# Check memory
|
|
74
|
+
free -h
|
|
75
|
+
|
|
76
|
+
# Check OOM killer
|
|
77
|
+
dmesg | grep -i "out of memory"
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
#### Prevention
|
|
81
|
+
|
|
82
|
+
- Set up health check monitoring (every 10s)
|
|
83
|
+
- Configure auto-restart on failure
|
|
84
|
+
- Set resource limits appropriately
|
|
85
|
+
|
|
86
|
+
### Injection Attempt
|
|
87
|
+
|
|
88
|
+
**Alert**: `InjectionAttempt`
|
|
89
|
+
**Severity**: Critical
|
|
90
|
+
**Response Time**: 2 minutes
|
|
91
|
+
|
|
92
|
+
#### Symptoms
|
|
93
|
+
|
|
94
|
+
- Alert: `increase(event_total{event_type="security.injection.attempt"}[5m]) > 0`
|
|
95
|
+
- Security events in Grafana dashboard
|
|
96
|
+
- Unusual query patterns
|
|
97
|
+
|
|
98
|
+
#### Diagnosis
|
|
99
|
+
|
|
100
|
+
```bash
|
|
101
|
+
# Query injection events
|
|
102
|
+
curl "http://prometheus:9090/api/v1/query?query=event_total{event_type=\"security.injection.attempt\"}"
|
|
103
|
+
|
|
104
|
+
# Check event details via API
|
|
105
|
+
curl http://unrdf-api:3000/api/events?type=security.injection.attempt&limit=10
|
|
106
|
+
|
|
107
|
+
# Review logs
|
|
108
|
+
grep "injection.attempt" /var/log/unrdf/security.log
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
#### Resolution
|
|
112
|
+
|
|
113
|
+
**Immediate Actions**:
|
|
114
|
+
|
|
115
|
+
1. **Block IP Address**:
|
|
116
|
+
|
|
117
|
+
```bash
|
|
118
|
+
# Add to firewall
|
|
119
|
+
sudo ufw deny from <ATTACKER_IP>
|
|
120
|
+
|
|
121
|
+
# Or via API
|
|
122
|
+
curl -X POST http://sidecar:3000/api/admin/blacklist \
|
|
123
|
+
-H "Content-Type: application/json" \
|
|
124
|
+
-d '{"ip": "<ATTACKER_IP>", "reason": "injection_attempt"}'
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
2. **Review Attack Payload**:
|
|
128
|
+
|
|
129
|
+
```javascript
|
|
130
|
+
// Via custom events API
|
|
131
|
+
const events = await getEventsByType('security.injection.attempt');
|
|
132
|
+
events.forEach(e => {
|
|
133
|
+
console.log('Attack type:', e.attributes['injection.type']);
|
|
134
|
+
console.log('Payload hash:', e.attributes['injection.payload_hash']);
|
|
135
|
+
console.log('IP:', e.attributes['injection.ip_address']);
|
|
136
|
+
});
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
3. **Validate Input Sanitization**:
|
|
140
|
+
|
|
141
|
+
```bash
|
|
142
|
+
# Run security audit
|
|
143
|
+
npm run security:audit
|
|
144
|
+
|
|
145
|
+
# Check SPARQL injection protection
|
|
146
|
+
npm run test:security:sparql-injection
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
#### Prevention
|
|
150
|
+
|
|
151
|
+
- Enable strict input validation
|
|
152
|
+
- Use parameterized queries
|
|
153
|
+
- Implement rate limiting
|
|
154
|
+
- Monitor for abnormal patterns
|
|
155
|
+
|
|
156
|
+
### High Memory Usage
|
|
157
|
+
|
|
158
|
+
**Alert**: `HighMemoryUsage` / `CriticalMemoryUsage`
|
|
159
|
+
**Severity**: Warning / Critical
|
|
160
|
+
**Response Time**: 5 minutes
|
|
161
|
+
|
|
162
|
+
#### Symptoms
|
|
163
|
+
|
|
164
|
+
- Alert: `resource_heap_used_bytes / resource_heap_total_bytes > 0.85`
|
|
165
|
+
- Slow performance
|
|
166
|
+
- Frequent GC activity
|
|
167
|
+
|
|
168
|
+
#### Diagnosis
|
|
169
|
+
|
|
170
|
+
```bash
|
|
171
|
+
# Check current memory usage
|
|
172
|
+
curl http://localhost:9464/metrics | grep resource_heap
|
|
173
|
+
|
|
174
|
+
# Take heap snapshot
|
|
175
|
+
kill -SIGUSR2 <PID>
|
|
176
|
+
# Snapshot saved to: /tmp/heapsnapshot-<timestamp>.heapsnapshot
|
|
177
|
+
|
|
178
|
+
# Analyze with Chrome DevTools
|
|
179
|
+
# 1. Open Chrome DevTools
|
|
180
|
+
# 2. Memory tab > Load snapshot
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
#### Resolution
|
|
184
|
+
|
|
185
|
+
**Option 1: Immediate Relief**
|
|
186
|
+
|
|
187
|
+
```bash
|
|
188
|
+
# Trigger manual GC (if --expose-gc enabled)
|
|
189
|
+
curl -X POST http://localhost:3000/api/admin/gc
|
|
190
|
+
|
|
191
|
+
# Clear caches
|
|
192
|
+
curl -X POST http://localhost:3000/api/admin/clear-cache
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
**Option 2: Identify Memory Leaks**
|
|
196
|
+
|
|
197
|
+
```bash
|
|
198
|
+
# Run leak detection
|
|
199
|
+
npm run memory:leak-check
|
|
200
|
+
|
|
201
|
+
# Profile memory usage
|
|
202
|
+
node --inspect index.mjs
|
|
203
|
+
# Connect Chrome DevTools to chrome://inspect
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
**Option 3: Increase Memory**
|
|
207
|
+
|
|
208
|
+
```bash
|
|
209
|
+
# Update systemd service
|
|
210
|
+
sudo systemctl edit unrdf-core
|
|
211
|
+
|
|
212
|
+
# Add memory limit
|
|
213
|
+
[Service]
|
|
214
|
+
Environment="NODE_OPTIONS=--max-old-space-size=4096"
|
|
215
|
+
|
|
216
|
+
# Restart
|
|
217
|
+
sudo systemctl restart unrdf-core
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
#### Prevention
|
|
221
|
+
|
|
222
|
+
- Set appropriate memory limits
|
|
223
|
+
- Monitor memory growth trends
|
|
224
|
+
- Implement cache eviction policies
|
|
225
|
+
- Use streaming for large datasets
|
|
226
|
+
|
|
227
|
+
### High Latency
|
|
228
|
+
|
|
229
|
+
**Alert**: `HighP95Latency` / `HighP99Latency`
|
|
230
|
+
**Severity**: Warning / Critical
|
|
231
|
+
**Response Time**: 5 minutes
|
|
232
|
+
|
|
233
|
+
#### Symptoms
|
|
234
|
+
|
|
235
|
+
- Alert: `latency_p95_ms > 1000`
|
|
236
|
+
- Slow query responses
|
|
237
|
+
- Timeout warnings in logs
|
|
238
|
+
|
|
239
|
+
#### Diagnosis
|
|
240
|
+
|
|
241
|
+
```bash
|
|
242
|
+
# Check latency percentiles
|
|
243
|
+
curl -s "http://prometheus:9090/api/v1/query?query=latency_p95_ms" | jq
|
|
244
|
+
|
|
245
|
+
# Identify slow operations
|
|
246
|
+
curl -s "http://prometheus:9090/api/v1/query?query=topk(10,%20latency_p95_ms)" | jq
|
|
247
|
+
|
|
248
|
+
# Check slow query events
|
|
249
|
+
curl http://localhost:3000/api/events?type=performance.slow_query&limit=20
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
#### Resolution
|
|
253
|
+
|
|
254
|
+
**Option 1: Query Optimization**
|
|
255
|
+
|
|
256
|
+
```bash
|
|
257
|
+
# Analyze slow queries
|
|
258
|
+
npm run query:analyze
|
|
259
|
+
|
|
260
|
+
# Check query plan
|
|
261
|
+
curl -X POST http://localhost:3000/api/query/explain \
|
|
262
|
+
-H "Content-Type: application/json" \
|
|
263
|
+
-d '{"query": "SELECT * WHERE { ?s ?p ?o }"}'
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
**Option 2: Scale Resources**
|
|
267
|
+
|
|
268
|
+
```bash
|
|
269
|
+
# Increase worker threads
|
|
270
|
+
export UNRDF_WORKERS=4
|
|
271
|
+
|
|
272
|
+
# Restart service
|
|
273
|
+
systemctl restart unrdf-core
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
**Option 3: Enable Caching**
|
|
277
|
+
|
|
278
|
+
```bash
|
|
279
|
+
# Enable query cache
|
|
280
|
+
curl -X POST http://localhost:3000/api/admin/cache/enable
|
|
281
|
+
|
|
282
|
+
# Set cache TTL
|
|
283
|
+
curl -X POST http://localhost:3000/api/admin/cache/config \
|
|
284
|
+
-d '{"ttl": 300}'
|
|
285
|
+
```
|
|
286
|
+
|
|
287
|
+
#### Prevention
|
|
288
|
+
|
|
289
|
+
- Optimize common query patterns
|
|
290
|
+
- Use query complexity limits
|
|
291
|
+
- Implement connection pooling
|
|
292
|
+
- Enable aggressive caching
|
|
293
|
+
|
|
294
|
+
### Low Success Rate
|
|
295
|
+
|
|
296
|
+
**Alert**: `LowSuccessRate` / `CriticalSuccessRate`
|
|
297
|
+
**Severity**: Warning / Critical
|
|
298
|
+
**Response Time**: 10 minutes
|
|
299
|
+
|
|
300
|
+
#### Symptoms
|
|
301
|
+
|
|
302
|
+
- Alert: `rate(business_operations_total{result="success"}[5m]) / rate(business_operations_total[5m]) < 0.95`
|
|
303
|
+
- Increased error rate
|
|
304
|
+
- User reports
|
|
305
|
+
|
|
306
|
+
#### Diagnosis
|
|
307
|
+
|
|
308
|
+
```bash
|
|
309
|
+
# Check error distribution
|
|
310
|
+
curl -s "http://prometheus:9090/api/v1/query?query=sum%20by%20(error_type)%20(rate(business_failures_by_type%5B5m%5D))" | jq
|
|
311
|
+
|
|
312
|
+
# Review error logs
|
|
313
|
+
journalctl -u unrdf-core --since "10 minutes ago" | grep ERROR
|
|
314
|
+
|
|
315
|
+
# Check dependencies
|
|
316
|
+
curl http://localhost:3000/api/health/dependencies
|
|
317
|
+
```
|
|
318
|
+
|
|
319
|
+
#### Resolution
|
|
320
|
+
|
|
321
|
+
**Option 1: Identify Root Cause**
|
|
322
|
+
|
|
323
|
+
```bash
|
|
324
|
+
# Group errors by type
|
|
325
|
+
curl "http://prometheus:9090/api/v1/query?query=topk(5,%20sum%20by%20(error_type)%20(rate(business_failures_by_type%5B5m%5D)))"
|
|
326
|
+
|
|
327
|
+
# Check error traces
|
|
328
|
+
curl http://localhost:16686/api/traces?service=unrdf-core&limit=20
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
**Option 2: Rollback Recent Changes**
|
|
332
|
+
|
|
333
|
+
```bash
|
|
334
|
+
# Check recent deployments
|
|
335
|
+
git log --since="1 hour ago" --oneline
|
|
336
|
+
|
|
337
|
+
# Rollback if needed
|
|
338
|
+
git revert <commit-hash>
|
|
339
|
+
npm run deploy
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
**Option 3: Circuit Breaker**
|
|
343
|
+
|
|
344
|
+
```bash
|
|
345
|
+
# Enable circuit breaker for failing dependencies
|
|
346
|
+
curl -X POST http://localhost:3000/api/admin/circuit-breaker/enable \
|
|
347
|
+
-d '{"service": "external-api", "threshold": 0.5}'
|
|
348
|
+
```
|
|
349
|
+
|
|
350
|
+
#### Prevention
|
|
351
|
+
|
|
352
|
+
- Comprehensive error handling
|
|
353
|
+
- Graceful degradation
|
|
354
|
+
- Dependency health checks
|
|
355
|
+
- Canary deployments
|
|
356
|
+
|
|
357
|
+
## Monitoring Checklist
|
|
358
|
+
|
|
359
|
+
### Daily Checks
|
|
360
|
+
|
|
361
|
+
- [ ] Review Grafana dashboard
|
|
362
|
+
- [ ] Check alert status in Alertmanager
|
|
363
|
+
- [ ] Verify metrics ingestion (no staleness)
|
|
364
|
+
- [ ] Review security events
|
|
365
|
+
- [ ] Check resource utilization trends
|
|
366
|
+
|
|
367
|
+
### Weekly Checks
|
|
368
|
+
|
|
369
|
+
- [ ] Analyze latency trends
|
|
370
|
+
- [ ] Review error patterns
|
|
371
|
+
- [ ] Check disk space on Prometheus/Jaeger
|
|
372
|
+
- [ ] Validate backup processes
|
|
373
|
+
- [ ] Review alert tuning
|
|
374
|
+
|
|
375
|
+
### Monthly Checks
|
|
376
|
+
|
|
377
|
+
- [ ] Review dashboard effectiveness
|
|
378
|
+
- [ ] Update alert thresholds based on trends
|
|
379
|
+
- [ ] Analyze cost/performance trade-offs
|
|
380
|
+
- [ ] Update runbook based on incidents
|
|
381
|
+
- [ ] Review sampling strategies
|
|
382
|
+
|
|
383
|
+
## Maintenance Procedures
|
|
384
|
+
|
|
385
|
+
### Prometheus Data Retention
|
|
386
|
+
|
|
387
|
+
```bash
|
|
388
|
+
# Check disk usage
|
|
389
|
+
df -h /var/lib/prometheus
|
|
390
|
+
|
|
391
|
+
# Compact old data
|
|
392
|
+
curl -X POST http://prometheus:9090/api/v1/admin/tsdb/clean_tombstones
|
|
393
|
+
|
|
394
|
+
# Adjust retention (in prometheus.yml)
|
|
395
|
+
storage:
|
|
396
|
+
tsdb:
|
|
397
|
+
retention.time: 30d
|
|
398
|
+
retention.size: 50GB
|
|
399
|
+
```
|
|
400
|
+
|
|
401
|
+
### Jaeger Storage Cleanup
|
|
402
|
+
|
|
403
|
+
```bash
|
|
404
|
+
# Clean old traces (Cassandra)
|
|
405
|
+
docker exec jaeger-cassandra cqlsh -e \
|
|
406
|
+
"TRUNCATE jaeger_v1_dc1.traces WHERE timestamp < dateOf(now()) - 604800000;"
|
|
407
|
+
|
|
408
|
+
# Or use Jaeger's built-in cleanup
|
|
409
|
+
curl -X DELETE "http://jaeger:16686/api/traces?service=unrdf-core&lookback=30d"
|
|
410
|
+
```
|
|
411
|
+
|
|
412
|
+
### OTEL Collector Restart
|
|
413
|
+
|
|
414
|
+
```bash
|
|
415
|
+
# Graceful restart
|
|
416
|
+
docker restart otel-collector
|
|
417
|
+
|
|
418
|
+
# Verify health
|
|
419
|
+
curl http://localhost:13133/
|
|
420
|
+
```
|
|
421
|
+
|
|
422
|
+
## Performance Tuning
|
|
423
|
+
|
|
424
|
+
### Sampling Strategy Adjustment
|
|
425
|
+
|
|
426
|
+
```javascript
|
|
427
|
+
// Update sampling rates based on load
|
|
428
|
+
const tracing = createDistributedTracing({
|
|
429
|
+
sampling: {
|
|
430
|
+
defaultRate: 0.001, // 0.1% for high-volume production
|
|
431
|
+
errorRate: 1.0, // Always sample errors
|
|
432
|
+
slowThreshold: 5000, // 5s threshold
|
|
433
|
+
slowRate: 0.5, // 50% of slow operations
|
|
434
|
+
},
|
|
435
|
+
});
|
|
436
|
+
```
|
|
437
|
+
|
|
438
|
+
### Metric Aggregation
|
|
439
|
+
|
|
440
|
+
```yaml
|
|
441
|
+
# Prometheus recording rules for faster queries
|
|
442
|
+
groups:
|
|
443
|
+
- name: unrdf_aggregates
|
|
444
|
+
interval: 30s
|
|
445
|
+
rules:
|
|
446
|
+
- record: job:latency_p95_ms:5m
|
|
447
|
+
expr: histogram_quantile(0.95, rate(latency_operation_duration_ms_bucket[5m]))
|
|
448
|
+
|
|
449
|
+
- record: job:success_rate:5m
|
|
450
|
+
expr: |
|
|
451
|
+
rate(business_operations_total{result="success"}[5m]) /
|
|
452
|
+
rate(business_operations_total[5m])
|
|
453
|
+
```
|
|
454
|
+
|
|
455
|
+
### Resource Optimization
|
|
456
|
+
|
|
457
|
+
```bash
|
|
458
|
+
# Optimize Prometheus scrape interval
|
|
459
|
+
# High-priority: 10s
|
|
460
|
+
# Medium: 30s
|
|
461
|
+
# Low: 60s
|
|
462
|
+
|
|
463
|
+
# Reduce cardinality
|
|
464
|
+
# Drop unnecessary labels
|
|
465
|
+
metric_relabel_configs:
|
|
466
|
+
- source_labels: [__name__]
|
|
467
|
+
regex: '.*_bucket|.*_sum|.*_count'
|
|
468
|
+
action: drop
|
|
469
|
+
```
|
|
470
|
+
|
|
471
|
+
## Backup and Recovery
|
|
472
|
+
|
|
473
|
+
### Prometheus Backup
|
|
474
|
+
|
|
475
|
+
```bash
|
|
476
|
+
# Snapshot current data
|
|
477
|
+
curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot
|
|
478
|
+
|
|
479
|
+
# Backup snapshot
|
|
480
|
+
tar -czf prometheus-backup-$(date +%Y%m%d).tar.gz \
|
|
481
|
+
/var/lib/prometheus/snapshots/
|
|
482
|
+
|
|
483
|
+
# Upload to S3
|
|
484
|
+
aws s3 cp prometheus-backup-$(date +%Y%m%d).tar.gz \
|
|
485
|
+
s3://backups/prometheus/
|
|
486
|
+
```
|
|
487
|
+
|
|
488
|
+
### Jaeger Backup
|
|
489
|
+
|
|
490
|
+
```bash
|
|
491
|
+
# Export traces to storage
|
|
492
|
+
docker exec jaeger-collector \
|
|
493
|
+
jaeger-tools export --service unrdf-core \
|
|
494
|
+
--output /backups/traces-$(date +%Y%m%d).json
|
|
495
|
+
```
|
|
496
|
+
|
|
497
|
+
### Restore Procedure
|
|
498
|
+
|
|
499
|
+
```bash
|
|
500
|
+
# Stop Prometheus
|
|
501
|
+
systemctl stop prometheus
|
|
502
|
+
|
|
503
|
+
# Restore data
|
|
504
|
+
tar -xzf prometheus-backup-YYYYMMDD.tar.gz -C /var/lib/prometheus/
|
|
505
|
+
|
|
506
|
+
# Start Prometheus
|
|
507
|
+
systemctl start prometheus
|
|
508
|
+
```
|
|
509
|
+
|
|
510
|
+
## Escalation
|
|
511
|
+
|
|
512
|
+
### L1 Support
|
|
513
|
+
|
|
514
|
+
- Check alert status
|
|
515
|
+
- Review runbook procedures
|
|
516
|
+
- Attempt standard remediation
|
|
517
|
+
|
|
518
|
+
### L2 Support (Escalate if)
|
|
519
|
+
|
|
520
|
+
- Alert persists >15 minutes
|
|
521
|
+
- Multiple cascading failures
|
|
522
|
+
- Security incidents
|
|
523
|
+
- Unknown root cause
|
|
524
|
+
|
|
525
|
+
### L3 Support (Escalate if)
|
|
526
|
+
|
|
527
|
+
- System-wide outage
|
|
528
|
+
- Data loss suspected
|
|
529
|
+
- Critical security breach
|
|
530
|
+
- Architecture-level issues
|
|
531
|
+
|
|
532
|
+
## Contact Information
|
|
533
|
+
|
|
534
|
+
```yaml
|
|
535
|
+
Teams:
|
|
536
|
+
Platform:
|
|
537
|
+
Slack: '#unrdf-platform'
|
|
538
|
+
Pagerduty: '@unrdf-oncall'
|
|
539
|
+
|
|
540
|
+
Security:
|
|
541
|
+
Slack: '#security-incidents'
|
|
542
|
+
Email: 'security@example.com'
|
|
543
|
+
|
|
544
|
+
Infrastructure:
|
|
545
|
+
Slack: '#infrastructure'
|
|
546
|
+
Pagerduty: '@infra-oncall'
|
|
547
|
+
```
|
|
548
|
+
|
|
549
|
+
## Additional Resources
|
|
550
|
+
|
|
551
|
+
- [Observability Patterns](./OBSERVABILITY-PATTERNS.md)
|
|
552
|
+
- [Prometheus Querying](https://prometheus.io/docs/prometheus/latest/querying/basics/)
|
|
553
|
+
- [Jaeger UI Guide](https://www.jaegertracing.io/docs/latest/frontend-ui/)
|
|
554
|
+
- [Grafana Troubleshooting](https://grafana.com/docs/grafana/latest/troubleshooting/)
|