omgkit 2.22.11 → 2.23.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +20 -8
- package/package.json +2 -2
- package/plugin/registry.yaml +3 -3
- package/plugin/skills/devops/dora-metrics/SKILL.md +852 -0
- package/plugin/skills/devops/feature-flags/SKILL.md +559 -0
- package/plugin/skills/methodology/stacked-diffs/SKILL.md +568 -0
- package/plugin/skills/testing/chaos-engineering/SKILL.md +732 -0
|
@@ -0,0 +1,732 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: Chaos Engineering and Resilience Testing
|
|
3
|
+
description: The agent implements chaos engineering practices for building resilient systems. Use when testing fault tolerance, designing game days, or validating system recovery.
|
|
4
|
+
category: testing
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Chaos Engineering and Resilience Testing
|
|
8
|
+
|
|
9
|
+
## Purpose
|
|
10
|
+
|
|
11
|
+
Chaos engineering is a discipline pioneered by Netflix that involves experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.
|
|
12
|
+
|
|
13
|
+
Netflix's philosophy: **"The best way to avoid failure is to fail constantly."**
|
|
14
|
+
|
|
15
|
+
Key benefits:
|
|
16
|
+
- **Discover weaknesses** before they cause outages
|
|
17
|
+
- **Build confidence** in system resilience
|
|
18
|
+
- **Improve incident response** through practice
|
|
19
|
+
- **Validate recovery procedures** work as designed
|
|
20
|
+
- **Document system behavior** under stress
|
|
21
|
+
|
|
22
|
+
## Features
|
|
23
|
+
|
|
24
|
+
| Feature | Description | Tools |
|
|
25
|
+
|---------|-------------|-------|
|
|
26
|
+
| Fault Injection | Introduce failures deliberately | Chaos Monkey, Gremlin |
|
|
27
|
+
| Network Chaos | Simulate network issues | tc, Toxiproxy |
|
|
28
|
+
| Resource Exhaustion | CPU, memory, disk stress | stress-ng, LitmusChaos |
|
|
29
|
+
| State Chaos | Corrupt or delete data | Custom scripts |
|
|
30
|
+
| Application Chaos | Kill processes, inject latency | Chaos Toolkit |
|
|
31
|
+
| Game Days | Coordinated chaos exercises | Runbooks |
|
|
32
|
+
|
|
33
|
+
## Chaos Engineering Principles
|
|
34
|
+
|
|
35
|
+
### The Scientific Method for Chaos
|
|
36
|
+
|
|
37
|
+
```
|
|
38
|
+
1. DEFINE steady state (normal system behavior)
|
|
39
|
+
↓
|
|
40
|
+
2. HYPOTHESIZE that steady state continues during chaos
|
|
41
|
+
↓
|
|
42
|
+
3. INTRODUCE real-world events (faults)
|
|
43
|
+
↓
|
|
44
|
+
4. OBSERVE differences between control and experiment
|
|
45
|
+
↓
|
|
46
|
+
5. CONCLUDE whether hypothesis held
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
### Netflix Principles
|
|
50
|
+
|
|
51
|
+
1. **Build a Hypothesis around Steady State Behavior**
|
|
52
|
+
- Define measurable system outputs
|
|
53
|
+
- Focus on overall system behavior, not internals
|
|
54
|
+
|
|
55
|
+
2. **Vary Real-World Events**
|
|
56
|
+
- Hardware failures
|
|
57
|
+
- Network partitions
|
|
58
|
+
- Malformed requests
|
|
59
|
+
- Traffic spikes
|
|
60
|
+
|
|
61
|
+
3. **Run Experiments in Production**
|
|
62
|
+
- Non-production environments differ too much
|
|
63
|
+
- Start with minimal blast radius
|
|
64
|
+
|
|
65
|
+
4. **Automate Experiments to Run Continuously**
|
|
66
|
+
- Manual testing doesn't scale
|
|
67
|
+
- Continuous validation catches regressions
|
|
68
|
+
|
|
69
|
+
5. **Minimize Blast Radius**
|
|
70
|
+
- Start small, expand carefully
|
|
71
|
+
- Have kill switches ready
|
|
72
|
+
|
|
73
|
+
## Tools and Frameworks
|
|
74
|
+
|
|
75
|
+
### Chaos Monkey (Netflix)
|
|
76
|
+
|
|
77
|
+
```yaml
|
|
78
|
+
# Chaos Monkey Configuration
|
|
79
|
+
chaos_monkey:
|
|
80
|
+
enabled: true
|
|
81
|
+
probability: 1.0 # Always kill if selected
|
|
82
|
+
schedule:
|
|
83
|
+
start: "09:00"
|
|
84
|
+
end: "17:00"
|
|
85
|
+
timezone: "America/Los_Angeles"
|
|
86
|
+
filters:
|
|
87
|
+
- region: us-east-1
|
|
88
|
+
- cluster: production
|
|
89
|
+
exclusions:
|
|
90
|
+
- app: critical-service
|
|
91
|
+
- tag: chaos-exempt
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
### Gremlin (Enterprise)
|
|
95
|
+
|
|
96
|
+
```bash
|
|
97
|
+
# Gremlin CLI - CPU Attack
|
|
98
|
+
gremlin attack cpu \
|
|
99
|
+
--length 300 \
|
|
100
|
+
--cores 2 \
|
|
101
|
+
--percent 80
|
|
102
|
+
|
|
103
|
+
# Gremlin CLI - Network Latency
|
|
104
|
+
gremlin attack network latency \
|
|
105
|
+
--length 300 \
|
|
106
|
+
--delay 500 \
|
|
107
|
+
--hosts "api.example.com"
|
|
108
|
+
|
|
109
|
+
# Gremlin CLI - Process Kill
|
|
110
|
+
gremlin attack process kill \
|
|
111
|
+
--length 60 \
|
|
112
|
+
--process "nginx" \
|
|
113
|
+
--interval 10
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
### LitmusChaos (Kubernetes-Native)
|
|
117
|
+
|
|
118
|
+
```yaml
|
|
119
|
+
# LitmusChaos Experiment - Pod Delete
|
|
120
|
+
apiVersion: litmuschaos.io/v1alpha1
|
|
121
|
+
kind: ChaosEngine
|
|
122
|
+
metadata:
|
|
123
|
+
name: nginx-chaos
|
|
124
|
+
namespace: default
|
|
125
|
+
spec:
|
|
126
|
+
appinfo:
|
|
127
|
+
appns: default
|
|
128
|
+
applabel: "app=nginx"
|
|
129
|
+
appkind: deployment
|
|
130
|
+
chaosServiceAccount: litmus-admin
|
|
131
|
+
experiments:
|
|
132
|
+
- name: pod-delete
|
|
133
|
+
spec:
|
|
134
|
+
components:
|
|
135
|
+
env:
|
|
136
|
+
- name: TOTAL_CHAOS_DURATION
|
|
137
|
+
value: '30'
|
|
138
|
+
- name: CHAOS_INTERVAL
|
|
139
|
+
value: '10'
|
|
140
|
+
- name: FORCE
|
|
141
|
+
value: 'false'
|
|
142
|
+
---
|
|
143
|
+
# LitmusChaos Experiment - Network Partition
|
|
144
|
+
apiVersion: litmuschaos.io/v1alpha1
|
|
145
|
+
kind: ChaosEngine
|
|
146
|
+
metadata:
|
|
147
|
+
name: network-chaos
|
|
148
|
+
spec:
|
|
149
|
+
appinfo:
|
|
150
|
+
appns: default
|
|
151
|
+
applabel: "app=backend"
|
|
152
|
+
appkind: deployment
|
|
153
|
+
experiments:
|
|
154
|
+
- name: pod-network-partition
|
|
155
|
+
spec:
|
|
156
|
+
components:
|
|
157
|
+
env:
|
|
158
|
+
- name: TOTAL_CHAOS_DURATION
|
|
159
|
+
value: '60'
|
|
160
|
+
- name: NETWORK_INTERFACE
|
|
161
|
+
value: 'eth0'
|
|
162
|
+
- name: TARGET_PODS
|
|
163
|
+
value: 'database-0'
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
### Chaos Toolkit (Open Source)
|
|
167
|
+
|
|
168
|
+
```json
|
|
169
|
+
{
|
|
170
|
+
"title": "Database failover should be transparent",
|
|
171
|
+
"description": "Verify that when primary DB fails, replica takes over",
|
|
172
|
+
|
|
173
|
+
"steady-state-hypothesis": {
|
|
174
|
+
"title": "Application responds normally",
|
|
175
|
+
"probes": [
|
|
176
|
+
{
|
|
177
|
+
"name": "api-responds",
|
|
178
|
+
"type": "probe",
|
|
179
|
+
"provider": {
|
|
180
|
+
"type": "http",
|
|
181
|
+
"url": "http://api.example.com/health",
|
|
182
|
+
"timeout": 3
|
|
183
|
+
},
|
|
184
|
+
"tolerance": {
|
|
185
|
+
"status": 200
|
|
186
|
+
}
|
|
187
|
+
},
|
|
188
|
+
{
|
|
189
|
+
"name": "latency-acceptable",
|
|
190
|
+
"type": "probe",
|
|
191
|
+
"provider": {
|
|
192
|
+
"type": "python",
|
|
193
|
+
"module": "chaosprobe",
|
|
194
|
+
"func": "check_p99_latency",
|
|
195
|
+
"arguments": {
|
|
196
|
+
"threshold_ms": 500
|
|
197
|
+
}
|
|
198
|
+
},
|
|
199
|
+
"tolerance": true
|
|
200
|
+
}
|
|
201
|
+
]
|
|
202
|
+
},
|
|
203
|
+
|
|
204
|
+
"method": [
|
|
205
|
+
{
|
|
206
|
+
"name": "terminate-primary-database",
|
|
207
|
+
"type": "action",
|
|
208
|
+
"provider": {
|
|
209
|
+
"type": "python",
|
|
210
|
+
"module": "chaosaws.rds.actions",
|
|
211
|
+
"func": "failover_db_instance",
|
|
212
|
+
"arguments": {
|
|
213
|
+
"db_instance_identifier": "prod-primary"
|
|
214
|
+
}
|
|
215
|
+
},
|
|
216
|
+
"pauses": {
|
|
217
|
+
"after": 30
|
|
218
|
+
}
|
|
219
|
+
}
|
|
220
|
+
],
|
|
221
|
+
|
|
222
|
+
"rollbacks": [
|
|
223
|
+
{
|
|
224
|
+
"name": "ensure-db-available",
|
|
225
|
+
"type": "action",
|
|
226
|
+
"provider": {
|
|
227
|
+
"type": "python",
|
|
228
|
+
"module": "chaosaws.rds.actions",
|
|
229
|
+
"func": "wait_for_db_instance_available",
|
|
230
|
+
"arguments": {
|
|
231
|
+
"db_instance_identifier": "prod-primary"
|
|
232
|
+
}
|
|
233
|
+
}
|
|
234
|
+
}
|
|
235
|
+
]
|
|
236
|
+
}
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
## Experiment Types
|
|
240
|
+
|
|
241
|
+
### 1. Infrastructure Chaos
|
|
242
|
+
|
|
243
|
+
```bash
|
|
244
|
+
#!/bin/bash
|
|
245
|
+
# Infrastructure chaos experiments
|
|
246
|
+
|
|
247
|
+
# Kill random EC2 instance
|
|
248
|
+
kill_random_instance() {
|
|
249
|
+
INSTANCE=$(aws ec2 describe-instances \
|
|
250
|
+
--filters "Name=tag:Environment,Values=prod" \
|
|
251
|
+
--query 'Reservations[].Instances[?State.Name==`running`].InstanceId' \
|
|
252
|
+
--output text | shuf -n 1)
|
|
253
|
+
|
|
254
|
+
echo "Terminating instance: $INSTANCE"
|
|
255
|
+
aws ec2 terminate-instances --instance-ids $INSTANCE
|
|
256
|
+
}
|
|
257
|
+
|
|
258
|
+
# Detach EBS volume
|
|
259
|
+
detach_volume() {
|
|
260
|
+
VOLUME=$1
|
|
261
|
+
aws ec2 detach-volume --volume-id $VOLUME --force
|
|
262
|
+
}
|
|
263
|
+
|
|
264
|
+
# Simulate AZ failure
|
|
265
|
+
simulate_az_failure() {
|
|
266
|
+
AZ=$1
|
|
267
|
+
# Update security groups to block all traffic to/from AZ
|
|
268
|
+
aws ec2 create-network-acl-entry \
|
|
269
|
+
--network-acl-id $NACL_ID \
|
|
270
|
+
--rule-number 1 \
|
|
271
|
+
--protocol -1 \
|
|
272
|
+
--rule-action deny \
|
|
273
|
+
--cidr-block $AZ_CIDR
|
|
274
|
+
}
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
### 2. Application Chaos
|
|
278
|
+
|
|
279
|
+
```python
|
|
280
|
+
# Application chaos experiments
|
|
281
|
+
import random
|
|
282
|
+
import time
|
|
283
|
+
from functools import wraps
|
|
284
|
+
|
|
285
|
+
class ChaosMiddleware:
|
|
286
|
+
"""Inject chaos into application requests"""
|
|
287
|
+
|
|
288
|
+
def __init__(self, app, config):
|
|
289
|
+
self.app = app
|
|
290
|
+
self.config = config
|
|
291
|
+
|
|
292
|
+
def __call__(self, environ, start_response):
|
|
293
|
+
# Random latency injection
|
|
294
|
+
if random.random() < self.config.get('latency_probability', 0):
|
|
295
|
+
delay = random.uniform(
|
|
296
|
+
self.config.get('latency_min_ms', 100),
|
|
297
|
+
self.config.get('latency_max_ms', 1000)
|
|
298
|
+
) / 1000
|
|
299
|
+
time.sleep(delay)
|
|
300
|
+
|
|
301
|
+
# Random error injection
|
|
302
|
+
if random.random() < self.config.get('error_probability', 0):
|
|
303
|
+
start_response('500 Internal Server Error', [])
|
|
304
|
+
return [b'Chaos error injected']
|
|
305
|
+
|
|
306
|
+
# Random timeout
|
|
307
|
+
if random.random() < self.config.get('timeout_probability', 0):
|
|
308
|
+
time.sleep(self.config.get('timeout_seconds', 30))
|
|
309
|
+
|
|
310
|
+
return self.app(environ, start_response)
|
|
311
|
+
|
|
312
|
+
# Decorator for chaos injection
|
|
313
|
+
def inject_chaos(failure_rate=0.01, latency_ms=0):
|
|
314
|
+
def decorator(func):
|
|
315
|
+
@wraps(func)
|
|
316
|
+
def wrapper(*args, **kwargs):
|
|
317
|
+
# Inject failure
|
|
318
|
+
if random.random() < failure_rate:
|
|
319
|
+
raise Exception("Chaos failure injected")
|
|
320
|
+
|
|
321
|
+
# Inject latency
|
|
322
|
+
if latency_ms > 0:
|
|
323
|
+
time.sleep(latency_ms / 1000)
|
|
324
|
+
|
|
325
|
+
return func(*args, **kwargs)
|
|
326
|
+
return wrapper
|
|
327
|
+
return decorator
|
|
328
|
+
|
|
329
|
+
# Usage
|
|
330
|
+
@inject_chaos(failure_rate=0.05, latency_ms=100)
|
|
331
|
+
def external_api_call():
|
|
332
|
+
return requests.get("https://api.external.com/data")
|
|
333
|
+
```
|
|
334
|
+
|
|
335
|
+
### 3. Network Chaos
|
|
336
|
+
|
|
337
|
+
```yaml
|
|
338
|
+
# Toxiproxy configuration
|
|
339
|
+
proxies:
|
|
340
|
+
- name: redis
|
|
341
|
+
listen: "0.0.0.0:6380"
|
|
342
|
+
upstream: "redis:6379"
|
|
343
|
+
enabled: true
|
|
344
|
+
|
|
345
|
+
- name: postgres
|
|
346
|
+
listen: "0.0.0.0:5433"
|
|
347
|
+
upstream: "postgres:5432"
|
|
348
|
+
enabled: true
|
|
349
|
+
|
|
350
|
+
toxics:
|
|
351
|
+
# Add 500ms latency to Redis
|
|
352
|
+
- name: redis_latency
|
|
353
|
+
proxy: redis
|
|
354
|
+
type: latency
|
|
355
|
+
attributes:
|
|
356
|
+
latency: 500
|
|
357
|
+
jitter: 100
|
|
358
|
+
|
|
359
|
+
# Drop 10% of Postgres connections
|
|
360
|
+
- name: postgres_timeout
|
|
361
|
+
proxy: postgres
|
|
362
|
+
type: timeout
|
|
363
|
+
attributes:
|
|
364
|
+
timeout: 5000
|
|
365
|
+
|
|
366
|
+
# Limit bandwidth to 1KB/s
|
|
367
|
+
- name: slow_bandwidth
|
|
368
|
+
proxy: redis
|
|
369
|
+
type: bandwidth
|
|
370
|
+
attributes:
|
|
371
|
+
rate: 1024
|
|
372
|
+
```
|
|
373
|
+
|
|
374
|
+
```bash
|
|
375
|
+
# Linux Traffic Control (tc) for network chaos
|
|
376
|
+
# Add 100ms latency with 20ms jitter
|
|
377
|
+
tc qdisc add dev eth0 root netem delay 100ms 20ms
|
|
378
|
+
|
|
379
|
+
# Add 5% packet loss
|
|
380
|
+
tc qdisc add dev eth0 root netem loss 5%
|
|
381
|
+
|
|
382
|
+
# Add packet corruption
|
|
383
|
+
tc qdisc add dev eth0 root netem corrupt 1%
|
|
384
|
+
|
|
385
|
+
# Combine effects
|
|
386
|
+
tc qdisc add dev eth0 root netem delay 50ms 10ms loss 1% corrupt 0.1%
|
|
387
|
+
|
|
388
|
+
# Remove chaos
|
|
389
|
+
tc qdisc del dev eth0 root
|
|
390
|
+
```
|
|
391
|
+
|
|
392
|
+
### 4. Resource Exhaustion
|
|
393
|
+
|
|
394
|
+
```bash
|
|
395
|
+
#!/bin/bash
|
|
396
|
+
# Resource exhaustion experiments
|
|
397
|
+
|
|
398
|
+
# CPU stress
|
|
399
|
+
stress_cpu() {
|
|
400
|
+
CORES=$1
|
|
401
|
+
DURATION=$2
|
|
402
|
+
stress-ng --cpu $CORES --timeout ${DURATION}s
|
|
403
|
+
}
|
|
404
|
+
|
|
405
|
+
# Memory stress
|
|
406
|
+
stress_memory() {
|
|
407
|
+
PERCENT=$1
|
|
408
|
+
DURATION=$2
|
|
409
|
+
stress-ng --vm 1 --vm-bytes ${PERCENT}% --timeout ${DURATION}s
|
|
410
|
+
}
|
|
411
|
+
|
|
412
|
+
# Disk I/O stress
|
|
413
|
+
stress_disk() {
|
|
414
|
+
stress-ng --hdd 2 --hdd-bytes 1G --timeout 60s
|
|
415
|
+
}
|
|
416
|
+
|
|
417
|
+
# Fill disk
|
|
418
|
+
fill_disk() {
|
|
419
|
+
TARGET_DIR=$1
|
|
420
|
+
dd if=/dev/zero of=${TARGET_DIR}/chaos-fill bs=1M count=10000
|
|
421
|
+
}
|
|
422
|
+
|
|
423
|
+
# Fork bomb (careful!)
|
|
424
|
+
fork_bomb() {
|
|
425
|
+
:(){ :|:& };:
|
|
426
|
+
}
|
|
427
|
+
```
|
|
428
|
+
|
|
429
|
+
## Game Day Planning
|
|
430
|
+
|
|
431
|
+
### Game Day Runbook Template
|
|
432
|
+
|
|
433
|
+
```markdown
|
|
434
|
+
# Game Day: [Name]
|
|
435
|
+
**Date:** [Date]
|
|
436
|
+
**Duration:** [X hours]
|
|
437
|
+
**Participants:** [Team members]
|
|
438
|
+
|
|
439
|
+
## Objectives
|
|
440
|
+
1. Validate [system] handles [failure type]
|
|
441
|
+
2. Test incident response procedures
|
|
442
|
+
3. Document recovery time
|
|
443
|
+
|
|
444
|
+
## Prerequisites
|
|
445
|
+
- [ ] Monitoring dashboards ready
|
|
446
|
+
- [ ] Communication channel established
|
|
447
|
+
- [ ] Rollback procedures documented
|
|
448
|
+
- [ ] Stakeholders notified
|
|
449
|
+
|
|
450
|
+
## Steady State Definition
|
|
451
|
+
- API latency p99 < 500ms
|
|
452
|
+
- Error rate < 0.1%
|
|
453
|
+
- All health checks passing
|
|
454
|
+
|
|
455
|
+
## Experiment Schedule
|
|
456
|
+
|
|
457
|
+
| Time | Action | Owner | Expected Impact |
|
|
458
|
+
|------|--------|-------|-----------------|
|
|
459
|
+
| 10:00 | Begin experiment | Lead | None |
|
|
460
|
+
| 10:05 | Kill primary DB | DB Admin | Failover starts |
|
|
461
|
+
| 10:10 | Verify failover | SRE | Latency spike |
|
|
462
|
+
| 10:15 | Restore primary | DB Admin | None |
|
|
463
|
+
| 10:30 | End experiment | Lead | System stable |
|
|
464
|
+
|
|
465
|
+
## Abort Criteria
|
|
466
|
+
- [ ] Error rate > 5%
|
|
467
|
+
- [ ] Customer complaints received
|
|
468
|
+
- [ ] Cascading failures detected
|
|
469
|
+
- [ ] Recovery taking > 15 minutes
|
|
470
|
+
|
|
471
|
+
## Rollback Procedure
|
|
472
|
+
1. Stop experiment immediately
|
|
473
|
+
2. Restore from snapshot if needed
|
|
474
|
+
3. Scale up healthy instances
|
|
475
|
+
4. Page on-call if needed
|
|
476
|
+
|
|
477
|
+
## Post-Game Analysis
|
|
478
|
+
- What happened?
|
|
479
|
+
- What did we learn?
|
|
480
|
+
- What should we fix?
|
|
481
|
+
- Schedule next game day
|
|
482
|
+
```
|
|
483
|
+
|
|
484
|
+
### Automated Game Day Execution
|
|
485
|
+
|
|
486
|
+
```python
|
|
487
|
+
# Automated game day orchestration
|
|
488
|
+
import asyncio
|
|
489
|
+
from datetime import datetime, timedelta
|
|
490
|
+
from dataclasses import dataclass
|
|
491
|
+
from typing import List, Callable
|
|
492
|
+
|
|
493
|
+
@dataclass
|
|
494
|
+
class GameDayStep:
|
|
495
|
+
name: str
|
|
496
|
+
action: Callable
|
|
497
|
+
duration_seconds: int
|
|
498
|
+
abort_check: Callable[[], bool]
|
|
499
|
+
rollback: Callable
|
|
500
|
+
|
|
501
|
+
class GameDayOrchestrator:
|
|
502
|
+
def __init__(self, steady_state_check: Callable[[], bool]):
|
|
503
|
+
self.steady_state_check = steady_state_check
|
|
504
|
+
self.steps: List[GameDayStep] = []
|
|
505
|
+
self.aborted = False
|
|
506
|
+
|
|
507
|
+
def add_step(self, step: GameDayStep):
|
|
508
|
+
self.steps.append(step)
|
|
509
|
+
|
|
510
|
+
async def run(self):
|
|
511
|
+
print(f"[{datetime.now()}] Game Day Starting")
|
|
512
|
+
|
|
513
|
+
# Verify steady state before starting
|
|
514
|
+
if not self.steady_state_check():
|
|
515
|
+
print("ERROR: System not in steady state. Aborting.")
|
|
516
|
+
return
|
|
517
|
+
|
|
518
|
+
executed_steps = []
|
|
519
|
+
|
|
520
|
+
try:
|
|
521
|
+
for step in self.steps:
|
|
522
|
+
print(f"[{datetime.now()}] Executing: {step.name}")
|
|
523
|
+
|
|
524
|
+
# Execute action
|
|
525
|
+
await asyncio.to_thread(step.action)
|
|
526
|
+
executed_steps.append(step)
|
|
527
|
+
|
|
528
|
+
# Wait and monitor
|
|
529
|
+
end_time = datetime.now() + timedelta(seconds=step.duration_seconds)
|
|
530
|
+
while datetime.now() < end_time:
|
|
531
|
+
if step.abort_check():
|
|
532
|
+
raise AbortException(f"Abort triggered during {step.name}")
|
|
533
|
+
await asyncio.sleep(1)
|
|
534
|
+
|
|
535
|
+
# Verify steady state
|
|
536
|
+
if not self.steady_state_check():
|
|
537
|
+
print(f"WARN: Steady state violated after {step.name}")
|
|
538
|
+
|
|
539
|
+
except AbortException as e:
|
|
540
|
+
print(f"ABORT: {e}")
|
|
541
|
+
self.aborted = True
|
|
542
|
+
|
|
543
|
+
finally:
|
|
544
|
+
# Rollback in reverse order
|
|
545
|
+
print("Rolling back...")
|
|
546
|
+
for step in reversed(executed_steps):
|
|
547
|
+
try:
|
|
548
|
+
await asyncio.to_thread(step.rollback)
|
|
549
|
+
except Exception as e:
|
|
550
|
+
print(f"Rollback failed for {step.name}: {e}")
|
|
551
|
+
|
|
552
|
+
print(f"[{datetime.now()}] Game Day Complete")
|
|
553
|
+
|
|
554
|
+
# Usage
|
|
555
|
+
orchestrator = GameDayOrchestrator(
|
|
556
|
+
steady_state_check=lambda: check_api_health() and check_error_rate() < 0.01
|
|
557
|
+
)
|
|
558
|
+
|
|
559
|
+
orchestrator.add_step(GameDayStep(
|
|
560
|
+
name="Kill cache cluster node",
|
|
561
|
+
action=lambda: kill_redis_node("redis-1"),
|
|
562
|
+
duration_seconds=60,
|
|
563
|
+
abort_check=lambda: get_error_rate() > 0.05,
|
|
564
|
+
rollback=lambda: restore_redis_node("redis-1")
|
|
565
|
+
))
|
|
566
|
+
|
|
567
|
+
asyncio.run(orchestrator.run())
|
|
568
|
+
```
|
|
569
|
+
|
|
570
|
+
## Best Practices
|
|
571
|
+
|
|
572
|
+
### 1. Start Small
|
|
573
|
+
|
|
574
|
+
```yaml
|
|
575
|
+
# Chaos maturity levels
|
|
576
|
+
level_1:
|
|
577
|
+
name: "Chaos Curious"
|
|
578
|
+
experiments:
|
|
579
|
+
- Kill single non-critical pod
|
|
580
|
+
- Add minor latency to internal calls
|
|
581
|
+
environment: staging
|
|
582
|
+
|
|
583
|
+
level_2:
|
|
584
|
+
name: "Chaos Aware"
|
|
585
|
+
experiments:
|
|
586
|
+
- Kill critical service pods
|
|
587
|
+
- Network partition between services
|
|
588
|
+
environment: production (low traffic)
|
|
589
|
+
|
|
590
|
+
level_3:
|
|
591
|
+
name: "Chaos Native"
|
|
592
|
+
experiments:
|
|
593
|
+
- Multi-AZ failures
|
|
594
|
+
- Database failovers
|
|
595
|
+
- Cascading failures
|
|
596
|
+
environment: production (all traffic)
|
|
597
|
+
```
|
|
598
|
+
|
|
599
|
+
### 2. Monitor Everything
|
|
600
|
+
|
|
601
|
+
```yaml
|
|
602
|
+
# Chaos observability requirements
|
|
603
|
+
metrics:
|
|
604
|
+
- name: error_rate
|
|
605
|
+
threshold: 0.1%
|
|
606
|
+
source: prometheus
|
|
607
|
+
- name: latency_p99
|
|
608
|
+
threshold: 500ms
|
|
609
|
+
source: prometheus
|
|
610
|
+
- name: availability
|
|
611
|
+
threshold: 99.9%
|
|
612
|
+
source: synthetic_monitoring
|
|
613
|
+
|
|
614
|
+
alerts:
|
|
615
|
+
- name: chaos_impact_high
|
|
616
|
+
condition: error_rate > 1%
|
|
617
|
+
action: abort_experiment
|
|
618
|
+
- name: chaos_duration_exceeded
|
|
619
|
+
condition: experiment_time > 10m
|
|
620
|
+
action: notify_and_review
|
|
621
|
+
```
|
|
622
|
+
|
|
623
|
+
### 3. Document Everything
|
|
624
|
+
|
|
625
|
+
```markdown
|
|
626
|
+
## Experiment Report: Redis Failover
|
|
627
|
+
|
|
628
|
+
**Date:** 2024-01-15
|
|
629
|
+
**Duration:** 5 minutes
|
|
630
|
+
**Environment:** Production
|
|
631
|
+
|
|
632
|
+
### Hypothesis
|
|
633
|
+
When Redis primary fails, the application will:
|
|
634
|
+
1. Detect failure within 30 seconds
|
|
635
|
+
2. Reconnect to replica within 60 seconds
|
|
636
|
+
3. Maintain < 1% error rate
|
|
637
|
+
|
|
638
|
+
### Results
|
|
639
|
+
| Metric | Expected | Actual | Status |
|
|
640
|
+
|--------|----------|--------|--------|
|
|
641
|
+
| Detection time | < 30s | 15s | PASS |
|
|
642
|
+
| Reconnect time | < 60s | 45s | PASS |
|
|
643
|
+
| Error rate | < 1% | 0.3% | PASS |
|
|
644
|
+
|
|
645
|
+
### Findings
|
|
646
|
+
- Sentinel failover worked as expected
|
|
647
|
+
- Connection pool reset caused brief spike
|
|
648
|
+
- Alert fired correctly at 0.5% error rate
|
|
649
|
+
|
|
650
|
+
### Action Items
|
|
651
|
+
- [ ] Reduce connection pool timeout to 5s
|
|
652
|
+
- [ ] Add retry logic for transient failures
|
|
653
|
+
- [ ] Update runbook with actual metrics
|
|
654
|
+
```
|
|
655
|
+
|
|
656
|
+
## Use Cases
|
|
657
|
+
|
|
658
|
+
### 1. Validate Auto-Scaling
|
|
659
|
+
|
|
660
|
+
```python
|
|
661
|
+
# Test auto-scaling under load + chaos
|
|
662
|
+
async def test_autoscaling_resilience():
|
|
663
|
+
# Generate load
|
|
664
|
+
await generate_load(rps=10000)
|
|
665
|
+
|
|
666
|
+
# Inject chaos: kill 50% of pods
|
|
667
|
+
await kill_pods_by_percentage(0.5)
|
|
668
|
+
|
|
669
|
+
# Verify autoscaler responds
|
|
670
|
+
await wait_for_condition(
|
|
671
|
+
lambda: get_pod_count() >= MIN_PODS,
|
|
672
|
+
timeout=120
|
|
673
|
+
)
|
|
674
|
+
|
|
675
|
+
# Verify steady state
|
|
676
|
+
assert await get_error_rate() < 0.01
|
|
677
|
+
```
|
|
678
|
+
|
|
679
|
+
### 2. Test Circuit Breakers
|
|
680
|
+
|
|
681
|
+
```python
|
|
682
|
+
# Verify circuit breaker opens under failure
|
|
683
|
+
async def test_circuit_breaker():
|
|
684
|
+
# Baseline: circuit closed
|
|
685
|
+
assert await get_circuit_state("payment-service") == "closed"
|
|
686
|
+
|
|
687
|
+
# Inject 100% failure rate
|
|
688
|
+
await inject_failure("payment-service", rate=1.0)
|
|
689
|
+
|
|
690
|
+
# Wait for circuit to open
|
|
691
|
+
await wait_for_condition(
|
|
692
|
+
lambda: get_circuit_state("payment-service") == "open",
|
|
693
|
+
timeout=30
|
|
694
|
+
)
|
|
695
|
+
|
|
696
|
+
# Verify fallback is used
|
|
697
|
+
response = await call_payment_service()
|
|
698
|
+
assert response.used_fallback == True
|
|
699
|
+
```
|
|
700
|
+
|
|
701
|
+
### 3. Disaster Recovery Drill
|
|
702
|
+
|
|
703
|
+
```python
|
|
704
|
+
# Full region failover test
|
|
705
|
+
async def test_region_failover():
|
|
706
|
+
# Record baseline
|
|
707
|
+
baseline_metrics = await capture_metrics()
|
|
708
|
+
|
|
709
|
+
# Simulate region failure
|
|
710
|
+
await simulate_region_outage("us-east-1")
|
|
711
|
+
|
|
712
|
+
# Verify traffic shifts
|
|
713
|
+
await wait_for_condition(
|
|
714
|
+
lambda: get_traffic_in_region("us-west-2") > 0.95,
|
|
715
|
+
timeout=180
|
|
716
|
+
)
|
|
717
|
+
|
|
718
|
+
# Verify performance maintained
|
|
719
|
+
current_metrics = await capture_metrics()
|
|
720
|
+
assert current_metrics.latency_p99 < baseline_metrics.latency_p99 * 1.5
|
|
721
|
+
```
|
|
722
|
+
|
|
723
|
+
## Related Skills
|
|
724
|
+
|
|
725
|
+
- `devops/kubernetes` - Container orchestration chaos
|
|
726
|
+
- `devops/observability` - Monitoring chaos impact
|
|
727
|
+
- `testing/performance-testing` - Load testing integration
|
|
728
|
+
- `devops/feature-flags` - Chaos kill switches
|
|
729
|
+
|
|
730
|
+
---
|
|
731
|
+
|
|
732
|
+
*Think Omega. Build Omega. Be Omega.*
|