@unrdf/observability 26.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,554 @@
1
+ # UNRDF Observability Runbook
2
+
3
+ Operational procedures for monitoring, troubleshooting, and maintaining UNRDF observability infrastructure.
4
+
5
+ ## Quick Reference
6
+
7
+ | Alert | Severity | Response Time | Action |
8
+ | ------------------- | -------- | ------------- | --------------------------------------- |
9
+ | ServiceDown | Critical | 1 min | [Service Recovery](#service-down) |
10
+ | InjectionAttempt | Critical | 2 min | [Security Incident](#injection-attempt) |
11
+ | CriticalMemoryUsage | Critical | 5 min | [Memory Issues](#high-memory-usage) |
12
+ | HighP99Latency | Critical | 5 min | [Performance](#high-latency) |
13
+ | LowSuccessRate | Warning | 10 min | [Success Rate](#low-success-rate) |
14
+
15
+ ## Incident Response
16
+
17
+ ### Service Down
18
+
19
+ **Alert**: `ServiceDown`
20
+ **Severity**: Critical
21
+ **Response Time**: 1 minute
22
+
23
+ #### Symptoms
24
+
25
+ - Prometheus shows `up{job="unrdf-*"} == 0`
26
+ - No metrics received in last 60 seconds
27
+ - Health check endpoints unreachable
28
+
29
+ #### Diagnosis
30
+
31
+ ```bash
32
+ # Check service status
33
+ systemctl status unrdf-core
34
+
35
+ # Check logs
36
+ journalctl -u unrdf-core -n 100 --no-pager
37
+
38
+ # Check process
39
+ ps aux | grep unrdf
40
+
41
+ # Check port availability
42
+ netstat -tulpn | grep 9464
43
+ ```
44
+
45
+ #### Resolution
46
+
47
+ **Option 1: Restart Service**
48
+
49
+ ```bash
50
+ # Graceful restart
51
+ systemctl restart unrdf-core
52
+
53
+ # Verify recovery
54
+ curl http://localhost:9464/metrics
55
+ ```
56
+
57
+ **Option 2: Check Configuration**
58
+
59
+ ```bash
60
+ # Validate config
61
+ node --check /path/to/unrdf/index.mjs
62
+
63
+ # Check environment variables
64
+ env | grep UNRDF
65
+ ```
66
+
67
+ **Option 3: Resource Issues**
68
+
69
+ ```bash
70
+ # Check disk space
71
+ df -h
72
+
73
+ # Check memory
74
+ free -h
75
+
76
+ # Check OOM killer
77
+ dmesg | grep -i "out of memory"
78
+ ```
79
+
80
+ #### Prevention
81
+
82
+ - Set up health check monitoring (every 10s)
83
+ - Configure auto-restart on failure
84
+ - Set resource limits appropriately
85
+
86
+ ### Injection Attempt
87
+
88
+ **Alert**: `InjectionAttempt`
89
+ **Severity**: Critical
90
+ **Response Time**: 2 minutes
91
+
92
+ #### Symptoms
93
+
94
+ - Alert: `increase(event_total{event_type="security.injection.attempt"}[5m]) > 0`
95
+ - Security events in Grafana dashboard
96
+ - Unusual query patterns
97
+
98
+ #### Diagnosis
99
+
100
+ ```bash
101
+ # Query injection events
102
+ curl "http://prometheus:9090/api/v1/query?query=event_total{event_type=\"security.injection.attempt\"}"
103
+
104
+ # Check event details via API
105
+ curl http://unrdf-api:3000/api/events?type=security.injection.attempt&limit=10
106
+
107
+ # Review logs
108
+ grep "injection.attempt" /var/log/unrdf/security.log
109
+ ```
110
+
111
+ #### Resolution
112
+
113
+ **Immediate Actions**:
114
+
115
+ 1. **Block IP Address**:
116
+
117
+ ```bash
118
+ # Add to firewall
119
+ sudo ufw deny from <ATTACKER_IP>
120
+
121
+ # Or via API
122
+ curl -X POST http://sidecar:3000/api/admin/blacklist \
123
+ -H "Content-Type: application/json" \
124
+ -d '{"ip": "<ATTACKER_IP>", "reason": "injection_attempt"}'
125
+ ```
126
+
127
+ 2. **Review Attack Payload**:
128
+
129
+ ```javascript
130
+ // Via custom events API
131
+ const events = await getEventsByType('security.injection.attempt');
132
+ events.forEach(e => {
133
+ console.log('Attack type:', e.attributes['injection.type']);
134
+ console.log('Payload hash:', e.attributes['injection.payload_hash']);
135
+ console.log('IP:', e.attributes['injection.ip_address']);
136
+ });
137
+ ```
138
+
139
+ 3. **Validate Input Sanitization**:
140
+
141
+ ```bash
142
+ # Run security audit
143
+ npm run security:audit
144
+
145
+ # Check SPARQL injection protection
146
+ npm run test:security:sparql-injection
147
+ ```
148
+
149
+ #### Prevention
150
+
151
+ - Enable strict input validation
152
+ - Use parameterized queries
153
+ - Implement rate limiting
154
+ - Monitor for abnormal patterns
155
+
156
+ ### High Memory Usage
157
+
158
+ **Alert**: `HighMemoryUsage` / `CriticalMemoryUsage`
159
+ **Severity**: Warning / Critical
160
+ **Response Time**: 5 minutes
161
+
162
+ #### Symptoms
163
+
164
+ - Alert: `resource_heap_used_bytes / resource_heap_total_bytes > 0.85`
165
+ - Slow performance
166
+ - Frequent GC activity
167
+
168
+ #### Diagnosis
169
+
170
+ ```bash
171
+ # Check current memory usage
172
+ curl http://localhost:9464/metrics | grep resource_heap
173
+
174
+ # Take heap snapshot
175
+ kill -SIGUSR2 <PID>
176
+ # Snapshot saved to: /tmp/heapsnapshot-<timestamp>.heapsnapshot
177
+
178
+ # Analyze with Chrome DevTools
179
+ # 1. Open Chrome DevTools
180
+ # 2. Memory tab > Load snapshot
181
+ ```
182
+
183
+ #### Resolution
184
+
185
+ **Option 1: Immediate Relief**
186
+
187
+ ```bash
188
+ # Trigger manual GC (if --expose-gc enabled)
189
+ curl -X POST http://localhost:3000/api/admin/gc
190
+
191
+ # Clear caches
192
+ curl -X POST http://localhost:3000/api/admin/clear-cache
193
+ ```
194
+
195
+ **Option 2: Identify Memory Leaks**
196
+
197
+ ```bash
198
+ # Run leak detection
199
+ npm run memory:leak-check
200
+
201
+ # Profile memory usage
202
+ node --inspect index.mjs
203
+ # Connect Chrome DevTools to chrome://inspect
204
+ ```
205
+
206
+ **Option 3: Increase Memory**
207
+
208
+ ```bash
209
+ # Update systemd service
210
+ sudo systemctl edit unrdf-core
211
+
212
+ # Add memory limit
213
+ [Service]
214
+ Environment="NODE_OPTIONS=--max-old-space-size=4096"
215
+
216
+ # Restart
217
+ sudo systemctl restart unrdf-core
218
+ ```
219
+
220
+ #### Prevention
221
+
222
+ - Set appropriate memory limits
223
+ - Monitor memory growth trends
224
+ - Implement cache eviction policies
225
+ - Use streaming for large datasets
226
+
227
+ ### High Latency
228
+
229
+ **Alert**: `HighP95Latency` / `HighP99Latency`
230
+ **Severity**: Warning / Critical
231
+ **Response Time**: 5 minutes
232
+
233
+ #### Symptoms
234
+
235
+ - Alert: `latency_p95_ms > 1000`
236
+ - Slow query responses
237
+ - Timeout warnings in logs
238
+
239
+ #### Diagnosis
240
+
241
+ ```bash
242
+ # Check latency percentiles
243
+ curl -s "http://prometheus:9090/api/v1/query?query=latency_p95_ms" | jq
244
+
245
+ # Identify slow operations
246
+ curl -s "http://prometheus:9090/api/v1/query?query=topk(10,%20latency_p95_ms)" | jq
247
+
248
+ # Check slow query events
249
+ curl http://localhost:3000/api/events?type=performance.slow_query&limit=20
250
+ ```
251
+
252
+ #### Resolution
253
+
254
+ **Option 1: Query Optimization**
255
+
256
+ ```bash
257
+ # Analyze slow queries
258
+ npm run query:analyze
259
+
260
+ # Check query plan
261
+ curl -X POST http://localhost:3000/api/query/explain \
262
+ -H "Content-Type: application/json" \
263
+ -d '{"query": "SELECT * WHERE { ?s ?p ?o }"}'
264
+ ```
265
+
266
+ **Option 2: Scale Resources**
267
+
268
+ ```bash
269
+ # Increase worker threads
270
+ export UNRDF_WORKERS=4
271
+
272
+ # Restart service
273
+ systemctl restart unrdf-core
274
+ ```
275
+
276
+ **Option 3: Enable Caching**
277
+
278
+ ```bash
279
+ # Enable query cache
280
+ curl -X POST http://localhost:3000/api/admin/cache/enable
281
+
282
+ # Set cache TTL
283
+ curl -X POST http://localhost:3000/api/admin/cache/config \
284
+ -d '{"ttl": 300}'
285
+ ```
286
+
287
+ #### Prevention
288
+
289
+ - Optimize common query patterns
290
+ - Use query complexity limits
291
+ - Implement connection pooling
292
+ - Enable aggressive caching
293
+
294
+ ### Low Success Rate
295
+
296
+ **Alert**: `LowSuccessRate` / `CriticalSuccessRate`
297
+ **Severity**: Warning / Critical
298
+ **Response Time**: 10 minutes
299
+
300
+ #### Symptoms
301
+
302
+ - Alert: `rate(business_operations_total{result="success"}[5m]) / rate(business_operations_total[5m]) < 0.95`
303
+ - Increased error rate
304
+ - User reports
305
+
306
+ #### Diagnosis
307
+
308
+ ```bash
309
+ # Check error distribution
310
+ curl -s "http://prometheus:9090/api/v1/query?query=sum%20by%20(error_type)%20(rate(business_failures_by_type%5B5m%5D))" | jq
311
+
312
+ # Review error logs
313
+ journalctl -u unrdf-core --since "10 minutes ago" | grep ERROR
314
+
315
+ # Check dependencies
316
+ curl http://localhost:3000/api/health/dependencies
317
+ ```
318
+
319
+ #### Resolution
320
+
321
+ **Option 1: Identify Root Cause**
322
+
323
+ ```bash
324
+ # Group errors by type
325
+ curl "http://prometheus:9090/api/v1/query?query=topk(5,%20sum%20by%20(error_type)%20(rate(business_failures_by_type%5B5m%5D)))"
326
+
327
+ # Check error traces
328
+ curl http://localhost:16686/api/traces?service=unrdf-core&limit=20
329
+ ```
330
+
331
+ **Option 2: Rollback Recent Changes**
332
+
333
+ ```bash
334
+ # Check recent deployments
335
+ git log --since="1 hour ago" --oneline
336
+
337
+ # Rollback if needed
338
+ git revert <commit-hash>
339
+ npm run deploy
340
+ ```
341
+
342
+ **Option 3: Circuit Breaker**
343
+
344
+ ```bash
345
+ # Enable circuit breaker for failing dependencies
346
+ curl -X POST http://localhost:3000/api/admin/circuit-breaker/enable \
347
+ -d '{"service": "external-api", "threshold": 0.5}'
348
+ ```
349
+
350
+ #### Prevention
351
+
352
+ - Comprehensive error handling
353
+ - Graceful degradation
354
+ - Dependency health checks
355
+ - Canary deployments
356
+
357
+ ## Monitoring Checklist
358
+
359
+ ### Daily Checks
360
+
361
+ - [ ] Review Grafana dashboard
362
+ - [ ] Check alert status in Alertmanager
363
+ - [ ] Verify metrics ingestion (no staleness)
364
+ - [ ] Review security events
365
+ - [ ] Check resource utilization trends
366
+
367
+ ### Weekly Checks
368
+
369
+ - [ ] Analyze latency trends
370
+ - [ ] Review error patterns
371
+ - [ ] Check disk space on Prometheus/Jaeger
372
+ - [ ] Validate backup processes
373
+ - [ ] Review alert tuning
374
+
375
+ ### Monthly Checks
376
+
377
+ - [ ] Review dashboard effectiveness
378
+ - [ ] Update alert thresholds based on trends
379
+ - [ ] Analyze cost/performance trade-offs
380
+ - [ ] Update runbook based on incidents
381
+ - [ ] Review sampling strategies
382
+
383
+ ## Maintenance Procedures
384
+
385
+ ### Prometheus Data Retention
386
+
387
+ ```bash
388
+ # Check disk usage
389
+ df -h /var/lib/prometheus
390
+
391
+ # Compact old data
392
+ curl -X POST http://prometheus:9090/api/v1/admin/tsdb/clean_tombstones
393
+
394
+ # Adjust retention (in prometheus.yml)
395
+ storage:
396
+ tsdb:
397
+ retention.time: 30d
398
+ retention.size: 50GB
399
+ ```
400
+
401
+ ### Jaeger Storage Cleanup
402
+
403
+ ```bash
404
+ # Clean old traces (Cassandra)
405
+ docker exec jaeger-cassandra cqlsh -e \
406
+ "TRUNCATE jaeger_v1_dc1.traces WHERE timestamp < dateOf(now()) - 604800000;"
407
+
408
+ # Or use Jaeger's built-in cleanup
409
+ curl -X DELETE "http://jaeger:16686/api/traces?service=unrdf-core&lookback=30d"
410
+ ```
411
+
412
+ ### OTEL Collector Restart
413
+
414
+ ```bash
415
+ # Graceful restart
416
+ docker restart otel-collector
417
+
418
+ # Verify health
419
+ curl http://localhost:13133/
420
+ ```
421
+
422
+ ## Performance Tuning
423
+
424
+ ### Sampling Strategy Adjustment
425
+
426
+ ```javascript
427
+ // Update sampling rates based on load
428
+ const tracing = createDistributedTracing({
429
+ sampling: {
430
+ defaultRate: 0.001, // 0.1% for high-volume production
431
+ errorRate: 1.0, // Always sample errors
432
+ slowThreshold: 5000, // 5s threshold
433
+ slowRate: 0.5, // 50% of slow operations
434
+ },
435
+ });
436
+ ```
437
+
438
+ ### Metric Aggregation
439
+
440
+ ```yaml
441
+ # Prometheus recording rules for faster queries
442
+ groups:
443
+ - name: unrdf_aggregates
444
+ interval: 30s
445
+ rules:
446
+ - record: job:latency_p95_ms:5m
447
+ expr: histogram_quantile(0.95, rate(latency_operation_duration_ms_bucket[5m]))
448
+
449
+ - record: job:success_rate:5m
450
+ expr: |
451
+ rate(business_operations_total{result="success"}[5m]) /
452
+ rate(business_operations_total[5m])
453
+ ```
454
+
455
+ ### Resource Optimization
456
+
457
+ ```bash
458
+ # Optimize Prometheus scrape interval
459
+ # High-priority: 10s
460
+ # Medium: 30s
461
+ # Low: 60s
462
+
463
+ # Reduce cardinality
464
+ # Drop unnecessary labels
465
+ metric_relabel_configs:
466
+ - source_labels: [__name__]
467
+ regex: '.*_bucket|.*_sum|.*_count'
468
+ action: drop
469
+ ```
470
+
471
+ ## Backup and Recovery
472
+
473
+ ### Prometheus Backup
474
+
475
+ ```bash
476
+ # Snapshot current data
477
+ curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot
478
+
479
+ # Backup snapshot
480
+ tar -czf prometheus-backup-$(date +%Y%m%d).tar.gz \
481
+ /var/lib/prometheus/snapshots/
482
+
483
+ # Upload to S3
484
+ aws s3 cp prometheus-backup-$(date +%Y%m%d).tar.gz \
485
+ s3://backups/prometheus/
486
+ ```
487
+
488
+ ### Jaeger Backup
489
+
490
+ ```bash
491
+ # Export traces to storage
492
+ docker exec jaeger-collector \
493
+ jaeger-tools export --service unrdf-core \
494
+ --output /backups/traces-$(date +%Y%m%d).json
495
+ ```
496
+
497
+ ### Restore Procedure
498
+
499
+ ```bash
500
+ # Stop Prometheus
501
+ systemctl stop prometheus
502
+
503
+ # Restore data
504
+ tar -xzf prometheus-backup-YYYYMMDD.tar.gz -C /var/lib/prometheus/
505
+
506
+ # Start Prometheus
507
+ systemctl start prometheus
508
+ ```
509
+
510
+ ## Escalation
511
+
512
+ ### L1 Support
513
+
514
+ - Check alert status
515
+ - Review runbook procedures
516
+ - Attempt standard remediation
517
+
518
+ ### L2 Support (Escalate if)
519
+
520
+ - Alert persists >15 minutes
521
+ - Multiple cascading failures
522
+ - Security incidents
523
+ - Unknown root cause
524
+
525
+ ### L3 Support (Escalate if)
526
+
527
+ - System-wide outage
528
+ - Data loss suspected
529
+ - Critical security breach
530
+ - Architecture-level issues
531
+
532
+ ## Contact Information
533
+
534
+ ```yaml
535
+ Teams:
536
+ Platform:
537
+ Slack: '#unrdf-platform'
538
+ Pagerduty: '@unrdf-oncall'
539
+
540
+ Security:
541
+ Slack: '#security-incidents'
542
+ Email: 'security@example.com'
543
+
544
+ Infrastructure:
545
+ Slack: '#infrastructure'
546
+ Pagerduty: '@infra-oncall'
547
+ ```
548
+
549
+ ## Additional Resources
550
+
551
+ - [Observability Patterns](./OBSERVABILITY-PATTERNS.md)
552
+ - [Prometheus Querying](https://prometheus.io/docs/prometheus/latest/querying/basics/)
553
+ - [Jaeger UI Guide](https://www.jaegertracing.io/docs/latest/frontend-ui/)
554
+ - [Grafana Troubleshooting](https://grafana.com/docs/grafana/latest/troubleshooting/)