omgkit 2.22.11 → 2.23.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,732 @@
1
+ ---
2
+ name: Chaos Engineering and Resilience Testing
3
+ description: The agent implements chaos engineering practices for building resilient systems. Use when testing fault tolerance, designing game days, or validating system recovery.
4
+ category: testing
5
+ ---
6
+
7
+ # Chaos Engineering and Resilience Testing
8
+
9
+ ## Purpose
10
+
11
+ Chaos engineering is a discipline pioneered by Netflix that involves experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.
12
+
13
+ Netflix's philosophy: **"The best way to avoid failure is to fail constantly."**
14
+
15
+ Key benefits:
16
+ - **Discover weaknesses** before they cause outages
17
+ - **Build confidence** in system resilience
18
+ - **Improve incident response** through practice
19
+ - **Validate recovery procedures** work as designed
20
+ - **Document system behavior** under stress
21
+
22
+ ## Features
23
+
24
+ | Feature | Description | Tools |
25
+ |---------|-------------|-------|
26
+ | Fault Injection | Introduce failures deliberately | Chaos Monkey, Gremlin |
27
+ | Network Chaos | Simulate network issues | tc, Toxiproxy |
28
+ | Resource Exhaustion | CPU, memory, disk stress | stress-ng, LitmusChaos |
29
+ | State Chaos | Corrupt or delete data | Custom scripts |
30
+ | Application Chaos | Kill processes, inject latency | Chaos Toolkit |
31
+ | Game Days | Coordinated chaos exercises | Runbooks |
32
+
33
+ ## Chaos Engineering Principles
34
+
35
+ ### The Scientific Method for Chaos
36
+
37
+ ```
38
+ 1. DEFINE steady state (normal system behavior)
39
+
40
+ 2. HYPOTHESIZE that steady state continues during chaos
41
+
42
+ 3. INTRODUCE real-world events (faults)
43
+
44
+ 4. OBSERVE differences between control and experiment
45
+
46
+ 5. CONCLUDE whether hypothesis held
47
+ ```
48
+
49
+ ### Netflix Principles
50
+
51
+ 1. **Build a Hypothesis around Steady State Behavior**
52
+ - Define measurable system outputs
53
+ - Focus on overall system behavior, not internals
54
+
55
+ 2. **Vary Real-World Events**
56
+ - Hardware failures
57
+ - Network partitions
58
+ - Malformed requests
59
+ - Traffic spikes
60
+
61
+ 3. **Run Experiments in Production**
62
+ - Non-production environments differ too much
63
+ - Start with minimal blast radius
64
+
65
+ 4. **Automate Experiments to Run Continuously**
66
+ - Manual testing doesn't scale
67
+ - Continuous validation catches regressions
68
+
69
+ 5. **Minimize Blast Radius**
70
+ - Start small, expand carefully
71
+ - Have kill switches ready
72
+
73
+ ## Tools and Frameworks
74
+
75
+ ### Chaos Monkey (Netflix)
76
+
77
+ ```yaml
78
+ # Chaos Monkey Configuration
79
+ chaos_monkey:
80
+ enabled: true
81
+ probability: 1.0 # Always kill if selected
82
+ schedule:
83
+ start: "09:00"
84
+ end: "17:00"
85
+ timezone: "America/Los_Angeles"
86
+ filters:
87
+ - region: us-east-1
88
+ - cluster: production
89
+ exclusions:
90
+ - app: critical-service
91
+ - tag: chaos-exempt
92
+ ```
93
+
94
+ ### Gremlin (Enterprise)
95
+
96
+ ```bash
97
+ # Gremlin CLI - CPU Attack
98
+ gremlin attack cpu \
99
+ --length 300 \
100
+ --cores 2 \
101
+ --percent 80
102
+
103
+ # Gremlin CLI - Network Latency
104
+ gremlin attack network latency \
105
+ --length 300 \
106
+ --delay 500 \
107
+ --hosts "api.example.com"
108
+
109
+ # Gremlin CLI - Process Kill
110
+ gremlin attack process kill \
111
+ --length 60 \
112
+ --process "nginx" \
113
+ --interval 10
114
+ ```
115
+
116
+ ### LitmusChaos (Kubernetes-Native)
117
+
118
+ ```yaml
119
+ # LitmusChaos Experiment - Pod Delete
120
+ apiVersion: litmuschaos.io/v1alpha1
121
+ kind: ChaosEngine
122
+ metadata:
123
+ name: nginx-chaos
124
+ namespace: default
125
+ spec:
126
+ appinfo:
127
+ appns: default
128
+ applabel: "app=nginx"
129
+ appkind: deployment
130
+ chaosServiceAccount: litmus-admin
131
+ experiments:
132
+ - name: pod-delete
133
+ spec:
134
+ components:
135
+ env:
136
+ - name: TOTAL_CHAOS_DURATION
137
+ value: '30'
138
+ - name: CHAOS_INTERVAL
139
+ value: '10'
140
+ - name: FORCE
141
+ value: 'false'
142
+ ---
143
+ # LitmusChaos Experiment - Network Partition
144
+ apiVersion: litmuschaos.io/v1alpha1
145
+ kind: ChaosEngine
146
+ metadata:
147
+ name: network-chaos
148
+ spec:
149
+ appinfo:
150
+ appns: default
151
+ applabel: "app=backend"
152
+ appkind: deployment
153
+ experiments:
154
+ - name: pod-network-partition
155
+ spec:
156
+ components:
157
+ env:
158
+ - name: TOTAL_CHAOS_DURATION
159
+ value: '60'
160
+ - name: NETWORK_INTERFACE
161
+ value: 'eth0'
162
+ - name: TARGET_PODS
163
+ value: 'database-0'
164
+ ```
165
+
166
+ ### Chaos Toolkit (Open Source)
167
+
168
+ ```json
169
+ {
170
+ "title": "Database failover should be transparent",
171
+ "description": "Verify that when primary DB fails, replica takes over",
172
+
173
+ "steady-state-hypothesis": {
174
+ "title": "Application responds normally",
175
+ "probes": [
176
+ {
177
+ "name": "api-responds",
178
+ "type": "probe",
179
+ "provider": {
180
+ "type": "http",
181
+ "url": "http://api.example.com/health",
182
+ "timeout": 3
183
+ },
184
+ "tolerance": {
185
+ "status": 200
186
+ }
187
+ },
188
+ {
189
+ "name": "latency-acceptable",
190
+ "type": "probe",
191
+ "provider": {
192
+ "type": "python",
193
+ "module": "chaosprobe",
194
+ "func": "check_p99_latency",
195
+ "arguments": {
196
+ "threshold_ms": 500
197
+ }
198
+ },
199
+ "tolerance": true
200
+ }
201
+ ]
202
+ },
203
+
204
+ "method": [
205
+ {
206
+ "name": "terminate-primary-database",
207
+ "type": "action",
208
+ "provider": {
209
+ "type": "python",
210
+ "module": "chaosaws.rds.actions",
211
+ "func": "failover_db_instance",
212
+ "arguments": {
213
+ "db_instance_identifier": "prod-primary"
214
+ }
215
+ },
216
+ "pauses": {
217
+ "after": 30
218
+ }
219
+ }
220
+ ],
221
+
222
+ "rollbacks": [
223
+ {
224
+ "name": "ensure-db-available",
225
+ "type": "action",
226
+ "provider": {
227
+ "type": "python",
228
+ "module": "chaosaws.rds.actions",
229
+ "func": "wait_for_db_instance_available",
230
+ "arguments": {
231
+ "db_instance_identifier": "prod-primary"
232
+ }
233
+ }
234
+ }
235
+ ]
236
+ }
237
+ ```
238
+
239
+ ## Experiment Types
240
+
241
+ ### 1. Infrastructure Chaos
242
+
243
+ ```bash
244
+ #!/bin/bash
245
+ # Infrastructure chaos experiments
246
+
247
+ # Kill random EC2 instance
248
+ kill_random_instance() {
249
+ INSTANCE=$(aws ec2 describe-instances \
250
+ --filters "Name=tag:Environment,Values=prod" \
251
+ --query 'Reservations[].Instances[?State.Name==`running`].InstanceId' \
252
+ --output text | shuf -n 1)
253
+
254
+ echo "Terminating instance: $INSTANCE"
255
+ aws ec2 terminate-instances --instance-ids $INSTANCE
256
+ }
257
+
258
+ # Detach EBS volume
259
+ detach_volume() {
260
+ VOLUME=$1
261
+ aws ec2 detach-volume --volume-id $VOLUME --force
262
+ }
263
+
264
+ # Simulate AZ failure
265
+ simulate_az_failure() {
266
+ AZ=$1
267
+ # Update security groups to block all traffic to/from AZ
268
+ aws ec2 create-network-acl-entry \
269
+ --network-acl-id $NACL_ID \
270
+ --rule-number 1 \
271
+ --protocol -1 \
272
+ --rule-action deny \
273
+ --cidr-block $AZ_CIDR
274
+ }
275
+ ```
276
+
277
+ ### 2. Application Chaos
278
+
279
+ ```python
280
+ # Application chaos experiments
281
+ import random
282
+ import time
283
+ from functools import wraps
284
+
285
+ class ChaosMiddleware:
286
+ """Inject chaos into application requests"""
287
+
288
+ def __init__(self, app, config):
289
+ self.app = app
290
+ self.config = config
291
+
292
+ def __call__(self, environ, start_response):
293
+ # Random latency injection
294
+ if random.random() < self.config.get('latency_probability', 0):
295
+ delay = random.uniform(
296
+ self.config.get('latency_min_ms', 100),
297
+ self.config.get('latency_max_ms', 1000)
298
+ ) / 1000
299
+ time.sleep(delay)
300
+
301
+ # Random error injection
302
+ if random.random() < self.config.get('error_probability', 0):
303
+ start_response('500 Internal Server Error', [])
304
+ return [b'Chaos error injected']
305
+
306
+ # Random timeout
307
+ if random.random() < self.config.get('timeout_probability', 0):
308
+ time.sleep(self.config.get('timeout_seconds', 30))
309
+
310
+ return self.app(environ, start_response)
311
+
312
+ # Decorator for chaos injection
313
+ def inject_chaos(failure_rate=0.01, latency_ms=0):
314
+ def decorator(func):
315
+ @wraps(func)
316
+ def wrapper(*args, **kwargs):
317
+ # Inject failure
318
+ if random.random() < failure_rate:
319
+ raise Exception("Chaos failure injected")
320
+
321
+ # Inject latency
322
+ if latency_ms > 0:
323
+ time.sleep(latency_ms / 1000)
324
+
325
+ return func(*args, **kwargs)
326
+ return wrapper
327
+ return decorator
328
+
329
+ # Usage
330
+ @inject_chaos(failure_rate=0.05, latency_ms=100)
331
+ def external_api_call():
332
+ return requests.get("https://api.external.com/data")
333
+ ```
334
+
335
+ ### 3. Network Chaos
336
+
337
+ ```yaml
338
+ # Toxiproxy configuration
339
+ proxies:
340
+ - name: redis
341
+ listen: "0.0.0.0:6380"
342
+ upstream: "redis:6379"
343
+ enabled: true
344
+
345
+ - name: postgres
346
+ listen: "0.0.0.0:5433"
347
+ upstream: "postgres:5432"
348
+ enabled: true
349
+
350
+ toxics:
351
+ # Add 500ms latency to Redis
352
+ - name: redis_latency
353
+ proxy: redis
354
+ type: latency
355
+ attributes:
356
+ latency: 500
357
+ jitter: 100
358
+
359
+ # Drop 10% of Postgres connections
360
+ - name: postgres_timeout
361
+ proxy: postgres
362
+ type: timeout
363
+ attributes:
364
+ timeout: 5000
365
+
366
+ # Limit bandwidth to 1KB/s
367
+ - name: slow_bandwidth
368
+ proxy: redis
369
+ type: bandwidth
370
+ attributes:
371
+ rate: 1024
372
+ ```
373
+
374
+ ```bash
375
+ # Linux Traffic Control (tc) for network chaos
376
+ # Add 100ms latency with 20ms jitter
377
+ tc qdisc add dev eth0 root netem delay 100ms 20ms
378
+
379
+ # Add 5% packet loss
380
+ tc qdisc add dev eth0 root netem loss 5%
381
+
382
+ # Add packet corruption
383
+ tc qdisc add dev eth0 root netem corrupt 1%
384
+
385
+ # Combine effects
386
+ tc qdisc add dev eth0 root netem delay 50ms 10ms loss 1% corrupt 0.1%
387
+
388
+ # Remove chaos
389
+ tc qdisc del dev eth0 root
390
+ ```
391
+
392
+ ### 4. Resource Exhaustion
393
+
394
+ ```bash
395
+ #!/bin/bash
396
+ # Resource exhaustion experiments
397
+
398
+ # CPU stress
399
+ stress_cpu() {
400
+ CORES=$1
401
+ DURATION=$2
402
+ stress-ng --cpu $CORES --timeout ${DURATION}s
403
+ }
404
+
405
+ # Memory stress
406
+ stress_memory() {
407
+ PERCENT=$1
408
+ DURATION=$2
409
+ stress-ng --vm 1 --vm-bytes ${PERCENT}% --timeout ${DURATION}s
410
+ }
411
+
412
+ # Disk I/O stress
413
+ stress_disk() {
414
+ stress-ng --hdd 2 --hdd-bytes 1G --timeout 60s
415
+ }
416
+
417
+ # Fill disk
418
+ fill_disk() {
419
+ TARGET_DIR=$1
420
+ dd if=/dev/zero of=${TARGET_DIR}/chaos-fill bs=1M count=10000
421
+ }
422
+
423
+ # Fork bomb (careful!)
424
+ fork_bomb() {
425
+ :(){ :|:& };:
426
+ }
427
+ ```
428
+
429
+ ## Game Day Planning
430
+
431
+ ### Game Day Runbook Template
432
+
433
+ ```markdown
434
+ # Game Day: [Name]
435
+ **Date:** [Date]
436
+ **Duration:** [X hours]
437
+ **Participants:** [Team members]
438
+
439
+ ## Objectives
440
+ 1. Validate [system] handles [failure type]
441
+ 2. Test incident response procedures
442
+ 3. Document recovery time
443
+
444
+ ## Prerequisites
445
+ - [ ] Monitoring dashboards ready
446
+ - [ ] Communication channel established
447
+ - [ ] Rollback procedures documented
448
+ - [ ] Stakeholders notified
449
+
450
+ ## Steady State Definition
451
+ - API latency p99 < 500ms
452
+ - Error rate < 0.1%
453
+ - All health checks passing
454
+
455
+ ## Experiment Schedule
456
+
457
+ | Time | Action | Owner | Expected Impact |
458
+ |------|--------|-------|-----------------|
459
+ | 10:00 | Begin experiment | Lead | None |
460
+ | 10:05 | Kill primary DB | DB Admin | Failover starts |
461
+ | 10:10 | Verify failover | SRE | Latency spike |
462
+ | 10:15 | Restore primary | DB Admin | None |
463
+ | 10:30 | End experiment | Lead | System stable |
464
+
465
+ ## Abort Criteria
466
+ - [ ] Error rate > 5%
467
+ - [ ] Customer complaints received
468
+ - [ ] Cascading failures detected
469
+ - [ ] Recovery taking > 15 minutes
470
+
471
+ ## Rollback Procedure
472
+ 1. Stop experiment immediately
473
+ 2. Restore from snapshot if needed
474
+ 3. Scale up healthy instances
475
+ 4. Page on-call if needed
476
+
477
+ ## Post-Game Analysis
478
+ - What happened?
479
+ - What did we learn?
480
+ - What should we fix?
481
+ - Schedule next game day
482
+ ```
483
+
484
+ ### Automated Game Day Execution
485
+
486
+ ```python
487
+ # Automated game day orchestration
488
+ import asyncio
489
+ from datetime import datetime, timedelta
490
+ from dataclasses import dataclass
491
+ from typing import List, Callable
492
+
493
+ @dataclass
494
+ class GameDayStep:
495
+ name: str
496
+ action: Callable
497
+ duration_seconds: int
498
+ abort_check: Callable[[], bool]
499
+ rollback: Callable
500
+
501
+ class GameDayOrchestrator:
502
+ def __init__(self, steady_state_check: Callable[[], bool]):
503
+ self.steady_state_check = steady_state_check
504
+ self.steps: List[GameDayStep] = []
505
+ self.aborted = False
506
+
507
+ def add_step(self, step: GameDayStep):
508
+ self.steps.append(step)
509
+
510
+ async def run(self):
511
+ print(f"[{datetime.now()}] Game Day Starting")
512
+
513
+ # Verify steady state before starting
514
+ if not self.steady_state_check():
515
+ print("ERROR: System not in steady state. Aborting.")
516
+ return
517
+
518
+ executed_steps = []
519
+
520
+ try:
521
+ for step in self.steps:
522
+ print(f"[{datetime.now()}] Executing: {step.name}")
523
+
524
+ # Execute action
525
+ await asyncio.to_thread(step.action)
526
+ executed_steps.append(step)
527
+
528
+ # Wait and monitor
529
+ end_time = datetime.now() + timedelta(seconds=step.duration_seconds)
530
+ while datetime.now() < end_time:
531
+ if step.abort_check():
532
+ raise AbortException(f"Abort triggered during {step.name}")
533
+ await asyncio.sleep(1)
534
+
535
+ # Verify steady state
536
+ if not self.steady_state_check():
537
+ print(f"WARN: Steady state violated after {step.name}")
538
+
539
+ except AbortException as e:
540
+ print(f"ABORT: {e}")
541
+ self.aborted = True
542
+
543
+ finally:
544
+ # Rollback in reverse order
545
+ print("Rolling back...")
546
+ for step in reversed(executed_steps):
547
+ try:
548
+ await asyncio.to_thread(step.rollback)
549
+ except Exception as e:
550
+ print(f"Rollback failed for {step.name}: {e}")
551
+
552
+ print(f"[{datetime.now()}] Game Day Complete")
553
+
554
+ # Usage
555
+ orchestrator = GameDayOrchestrator(
556
+ steady_state_check=lambda: check_api_health() and check_error_rate() < 0.01
557
+ )
558
+
559
+ orchestrator.add_step(GameDayStep(
560
+ name="Kill cache cluster node",
561
+ action=lambda: kill_redis_node("redis-1"),
562
+ duration_seconds=60,
563
+ abort_check=lambda: get_error_rate() > 0.05,
564
+ rollback=lambda: restore_redis_node("redis-1")
565
+ ))
566
+
567
+ asyncio.run(orchestrator.run())
568
+ ```
569
+
570
+ ## Best Practices
571
+
572
+ ### 1. Start Small
573
+
574
+ ```yaml
575
+ # Chaos maturity levels
576
+ level_1:
577
+ name: "Chaos Curious"
578
+ experiments:
579
+ - Kill single non-critical pod
580
+ - Add minor latency to internal calls
581
+ environment: staging
582
+
583
+ level_2:
584
+ name: "Chaos Aware"
585
+ experiments:
586
+ - Kill critical service pods
587
+ - Network partition between services
588
+ environment: production (low traffic)
589
+
590
+ level_3:
591
+ name: "Chaos Native"
592
+ experiments:
593
+ - Multi-AZ failures
594
+ - Database failovers
595
+ - Cascading failures
596
+ environment: production (all traffic)
597
+ ```
598
+
599
+ ### 2. Monitor Everything
600
+
601
+ ```yaml
602
+ # Chaos observability requirements
603
+ metrics:
604
+ - name: error_rate
605
+ threshold: 0.1%
606
+ source: prometheus
607
+ - name: latency_p99
608
+ threshold: 500ms
609
+ source: prometheus
610
+ - name: availability
611
+ threshold: 99.9%
612
+ source: synthetic_monitoring
613
+
614
+ alerts:
615
+ - name: chaos_impact_high
616
+ condition: error_rate > 1%
617
+ action: abort_experiment
618
+ - name: chaos_duration_exceeded
619
+ condition: experiment_time > 10m
620
+ action: notify_and_review
621
+ ```
622
+
623
+ ### 3. Document Everything
624
+
625
+ ```markdown
626
+ ## Experiment Report: Redis Failover
627
+
628
+ **Date:** 2024-01-15
629
+ **Duration:** 5 minutes
630
+ **Environment:** Production
631
+
632
+ ### Hypothesis
633
+ When Redis primary fails, the application will:
634
+ 1. Detect failure within 30 seconds
635
+ 2. Reconnect to replica within 60 seconds
636
+ 3. Maintain < 1% error rate
637
+
638
+ ### Results
639
+ | Metric | Expected | Actual | Status |
640
+ |--------|----------|--------|--------|
641
+ | Detection time | < 30s | 15s | PASS |
642
+ | Reconnect time | < 60s | 45s | PASS |
643
+ | Error rate | < 1% | 0.3% | PASS |
644
+
645
+ ### Findings
646
+ - Sentinel failover worked as expected
647
+ - Connection pool reset caused brief spike
648
+ - Alert fired correctly at 0.5% error rate
649
+
650
+ ### Action Items
651
+ - [ ] Reduce connection pool timeout to 5s
652
+ - [ ] Add retry logic for transient failures
653
+ - [ ] Update runbook with actual metrics
654
+ ```
655
+
656
+ ## Use Cases
657
+
658
+ ### 1. Validate Auto-Scaling
659
+
660
+ ```python
661
+ # Test auto-scaling under load + chaos
662
+ async def test_autoscaling_resilience():
663
+ # Generate load
664
+ await generate_load(rps=10000)
665
+
666
+ # Inject chaos: kill 50% of pods
667
+ await kill_pods_by_percentage(0.5)
668
+
669
+ # Verify autoscaler responds
670
+ await wait_for_condition(
671
+ lambda: get_pod_count() >= MIN_PODS,
672
+ timeout=120
673
+ )
674
+
675
+ # Verify steady state
676
+ assert await get_error_rate() < 0.01
677
+ ```
678
+
679
+ ### 2. Test Circuit Breakers
680
+
681
+ ```python
682
+ # Verify circuit breaker opens under failure
683
+ async def test_circuit_breaker():
684
+ # Baseline: circuit closed
685
+ assert await get_circuit_state("payment-service") == "closed"
686
+
687
+ # Inject 100% failure rate
688
+ await inject_failure("payment-service", rate=1.0)
689
+
690
+ # Wait for circuit to open
691
+ await wait_for_condition(
692
+ lambda: get_circuit_state("payment-service") == "open",
693
+ timeout=30
694
+ )
695
+
696
+ # Verify fallback is used
697
+ response = await call_payment_service()
698
+ assert response.used_fallback == True
699
+ ```
700
+
701
+ ### 3. Disaster Recovery Drill
702
+
703
+ ```python
704
+ # Full region failover test
705
+ async def test_region_failover():
706
+ # Record baseline
707
+ baseline_metrics = await capture_metrics()
708
+
709
+ # Simulate region failure
710
+ await simulate_region_outage("us-east-1")
711
+
712
+ # Verify traffic shifts
713
+ await wait_for_condition(
714
+ lambda: get_traffic_in_region("us-west-2") > 0.95,
715
+ timeout=180
716
+ )
717
+
718
+ # Verify performance maintained
719
+ current_metrics = await capture_metrics()
720
+ assert current_metrics.latency_p99 < baseline_metrics.latency_p99 * 1.5
721
+ ```
722
+
723
+ ## Related Skills
724
+
725
+ - `devops/kubernetes` - Container orchestration chaos
726
+ - `devops/observability` - Monitoring chaos impact
727
+ - `testing/performance-testing` - Load testing integration
728
+ - `devops/feature-flags` - Chaos kill switches
729
+
730
+ ---
731
+
732
+ *Think Omega. Build Omega. Be Omega.*