codymaster 4.4.4 → 4.5.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +33 -0
- package/README.md +29 -14
- package/commands/demo.md +1 -1
- package/dist/context-bus.js +70 -0
- package/dist/context-db.js +265 -0
- package/dist/continuity.js +12 -0
- package/dist/file-watcher.js +79 -0
- package/dist/index.js +152 -1
- package/dist/l0-indexer.js +158 -0
- package/dist/mcp-context-server.js +400 -0
- package/dist/migrate-json-to-sqlite.js +126 -0
- package/dist/skill-chain.js +19 -3
- package/dist/token-budget.js +108 -0
- package/dist/uri-resolver.js +203 -0
- package/package.json +7 -1
- package/skills/_shared/helpers.md +50 -14
- package/skills/cm-autopilot/SKILL.md +29 -0
- package/skills/cm-autopilot/scripts/autopilot.py +190 -0
- package/skills/cm-continuity/SKILL.md +90 -28
- package/skills/cm-quality-gate/SKILL.md +11 -1
- package/skills/cm-safe-deploy/SKILL.md +38 -2
- package/skills/cm-security-gate/SKILL.md +158 -34
- package/skills/cm-skill-chain/SKILL.md +47 -1
- package/skills/cm-start/SKILL.md +11 -2
- package/skills/cm-test-gate/SKILL.md +3 -0
- package/skills/boxme-git-config/SKILL.md +0 -56
- package/skills/boxme-local-dev/SKILL.md +0 -66
- package/skills/jobs-to-be-done/SKILL.md +0 -266
- package/skills/jobs-to-be-done/references/case-studies.md +0 -154
- package/skills/jobs-to-be-done/references/competitive-strategy.md +0 -280
- package/skills/jobs-to-be-done/references/diagnostics.md +0 -158
- package/skills/jobs-to-be-done/references/innovation-process.md +0 -392
- package/skills/jobs-to-be-done/references/organizational-change.md +0 -328
- package/skills/marketplace-report-crawler/SKILL.md +0 -176
- package/skills/marketplace-report-crawler/config/accounts.json +0 -41
- package/skills/marketplace-report-crawler/config/report-types.json +0 -422
- package/skills/marketplace-report-crawler/config/sessions.json +0 -3
- package/skills/marketplace-report-crawler/scripts/ab-wrapper.sh +0 -102
- package/skills/marketplace-report-crawler/scripts/browser-actions/lazada/lazada-actions.js +0 -114
- package/skills/marketplace-report-crawler/scripts/browser-actions/shopee/shopee-actions.js +0 -94
- package/skills/marketplace-report-crawler/scripts/browser-actions/tiktok/tiktok-actions.js +0 -272
- package/skills/marketplace-report-crawler/scripts/crawl-runner.js +0 -281
- package/skills/marketplace-report-crawler/scripts/session-check.sh +0 -72
- package/skills/marketplace-report-crawler/scripts/session-manager.sh +0 -349
- package/skills/marketplace-report-crawler/scripts/setup-folders.sh +0 -83
- package/skills/medical-research/SKILL.md +0 -194
- package/skills/medical-research/scripts/evidence_checker.py +0 -288
- package/skills/mom-test/SKILL.md +0 -267
- package/skills/mom-test/references/avoiding-bad-data.md +0 -221
- package/skills/mom-test/references/case-studies.md +0 -306
- package/skills/mom-test/references/commitment-advancement.md +0 -219
- package/skills/mom-test/references/finding-conversations.md +0 -251
- package/skills/mom-test/references/processing-learning.md +0 -256
- package/skills/mom-test/references/question-patterns.md +0 -198
- package/skills/pandasai-analytics/SKILL.md +0 -251
- package/skills/release-it/SKILL.md +0 -235
- package/skills/release-it/references/anti-patterns.md +0 -279
- package/skills/release-it/references/capacity-planning.md +0 -285
- package/skills/release-it/references/chaos-engineering.md +0 -325
- package/skills/release-it/references/deployment-strategies.md +0 -331
- package/skills/release-it/references/observability.md +0 -301
- package/skills/release-it/references/stability-patterns.md +0 -355
- package/skills/skill-creator-ultra/.agents/workflows/skill-audit.md +0 -37
- package/skills/skill-creator-ultra/.agents/workflows/skill-compare.md +0 -34
- package/skills/skill-creator-ultra/.agents/workflows/skill-export.md +0 -51
- package/skills/skill-creator-ultra/.agents/workflows/skill-generate.md +0 -39
- package/skills/skill-creator-ultra/.agents/workflows/skill-scaffold.md +0 -52
- package/skills/skill-creator-ultra/.agents/workflows/skill-simulate.md +0 -25
- package/skills/skill-creator-ultra/.agents/workflows/skill-stats.md +0 -31
- package/skills/skill-creator-ultra/.agents/workflows/skill-validate.md +0 -25
- package/skills/skill-creator-ultra/README.md +0 -1242
- package/skills/skill-creator-ultra/SKILL.md +0 -388
- package/skills/skill-creator-ultra/agents/analyzer.md +0 -274
- package/skills/skill-creator-ultra/agents/comparator.md +0 -202
- package/skills/skill-creator-ultra/agents/grader.md +0 -223
- package/skills/skill-creator-ultra/assets/eval_review.html +0 -146
- package/skills/skill-creator-ultra/eval-viewer/generate_review.py +0 -471
- package/skills/skill-creator-ultra/eval-viewer/viewer.html +0 -1325
- package/skills/skill-creator-ultra/examples/example_anthropic_frontend.md +0 -109
- package/skills/skill-creator-ultra/examples/example_anthropic_pdf.md +0 -116
- package/skills/skill-creator-ultra/examples/example_api_docs.md +0 -189
- package/skills/skill-creator-ultra/examples/example_db_migration.md +0 -253
- package/skills/skill-creator-ultra/examples/example_git_commit.md +0 -111
- package/skills/skill-creator-ultra/install.ps1 +0 -289
- package/skills/skill-creator-ultra/install.sh +0 -313
- package/skills/skill-creator-ultra/phases/phase1_interview.md +0 -202
- package/skills/skill-creator-ultra/phases/phase2_extract.md +0 -55
- package/skills/skill-creator-ultra/phases/phase3_detect.md +0 -57
- package/skills/skill-creator-ultra/phases/phase4_generate.md +0 -543
- package/skills/skill-creator-ultra/phases/phase5_test.md +0 -319
- package/skills/skill-creator-ultra/phases/phase6_eval.md +0 -301
- package/skills/skill-creator-ultra/phases/phase7_iterate.md +0 -103
- package/skills/skill-creator-ultra/phases/phase8_optimize.md +0 -113
- package/skills/skill-creator-ultra/resources/advanced_patterns.md +0 -499
- package/skills/skill-creator-ultra/resources/anti_patterns.md +0 -376
- package/skills/skill-creator-ultra/resources/blueprints.md +0 -498
- package/skills/skill-creator-ultra/resources/checklist.md +0 -243
- package/skills/skill-creator-ultra/resources/composition_cookbook.md +0 -291
- package/skills/skill-creator-ultra/resources/description_optimization.md +0 -90
- package/skills/skill-creator-ultra/resources/eval_guide.md +0 -133
- package/skills/skill-creator-ultra/resources/industry_questions.md +0 -189
- package/skills/skill-creator-ultra/resources/interview_questions.md +0 -200
- package/skills/skill-creator-ultra/resources/pattern_detection.md +0 -200
- package/skills/skill-creator-ultra/resources/prompt_engineering.md +0 -531
- package/skills/skill-creator-ultra/resources/schemas.md +0 -430
- package/skills/skill-creator-ultra/resources/script_integration.md +0 -593
- package/skills/skill-creator-ultra/resources/scripts_guide.md +0 -339
- package/skills/skill-creator-ultra/resources/skill_template.md +0 -124
- package/skills/skill-creator-ultra/resources/skill_writing_guide.md +0 -634
- package/skills/skill-creator-ultra/resources/versioning_guide.md +0 -193
- package/skills/skill-creator-ultra/scripts/ci_eval.py +0 -200
- package/skills/skill-creator-ultra/scripts/package_skill.py +0 -165
- package/skills/skill-creator-ultra/scripts/simulate_skill.py +0 -398
- package/skills/skill-creator-ultra/scripts/skill_audit.py +0 -611
- package/skills/skill-creator-ultra/scripts/skill_compare.py +0 -265
- package/skills/skill-creator-ultra/scripts/skill_export.py +0 -334
- package/skills/skill-creator-ultra/scripts/skill_scaffold.py +0 -403
- package/skills/skill-creator-ultra/scripts/skill_stats.py +0 -339
- package/skills/skill-creator-ultra/scripts/validate_skill.py +0 -411
- package/skills/tailwind-mastery/SKILL.md +0 -229
- package/skills/vercel-react-best-practices/AGENTS.md +0 -3373
- package/skills/vercel-react-best-practices/README.md +0 -123
- package/skills/vercel-react-best-practices/SKILL.md +0 -143
- package/skills/vercel-react-best-practices/rules/_sections.md +0 -46
- package/skills/vercel-react-best-practices/rules/_template.md +0 -28
- package/skills/vercel-react-best-practices/rules/advanced-event-handler-refs.md +0 -55
- package/skills/vercel-react-best-practices/rules/advanced-init-once.md +0 -42
- package/skills/vercel-react-best-practices/rules/advanced-use-latest.md +0 -39
- package/skills/vercel-react-best-practices/rules/async-api-routes.md +0 -38
- package/skills/vercel-react-best-practices/rules/async-defer-await.md +0 -80
- package/skills/vercel-react-best-practices/rules/async-dependencies.md +0 -51
- package/skills/vercel-react-best-practices/rules/async-parallel.md +0 -28
- package/skills/vercel-react-best-practices/rules/async-suspense-boundaries.md +0 -99
- package/skills/vercel-react-best-practices/rules/bundle-barrel-imports.md +0 -59
- package/skills/vercel-react-best-practices/rules/bundle-conditional.md +0 -31
- package/skills/vercel-react-best-practices/rules/bundle-defer-third-party.md +0 -49
- package/skills/vercel-react-best-practices/rules/bundle-dynamic-imports.md +0 -35
- package/skills/vercel-react-best-practices/rules/bundle-preload.md +0 -50
- package/skills/vercel-react-best-practices/rules/client-event-listeners.md +0 -74
- package/skills/vercel-react-best-practices/rules/client-localstorage-schema.md +0 -71
- package/skills/vercel-react-best-practices/rules/client-passive-event-listeners.md +0 -48
- package/skills/vercel-react-best-practices/rules/client-swr-dedup.md +0 -56
- package/skills/vercel-react-best-practices/rules/js-batch-dom-css.md +0 -107
- package/skills/vercel-react-best-practices/rules/js-cache-function-results.md +0 -80
- package/skills/vercel-react-best-practices/rules/js-cache-property-access.md +0 -28
- package/skills/vercel-react-best-practices/rules/js-cache-storage.md +0 -70
- package/skills/vercel-react-best-practices/rules/js-combine-iterations.md +0 -32
- package/skills/vercel-react-best-practices/rules/js-early-exit.md +0 -50
- package/skills/vercel-react-best-practices/rules/js-flatmap-filter.md +0 -60
- package/skills/vercel-react-best-practices/rules/js-hoist-regexp.md +0 -45
- package/skills/vercel-react-best-practices/rules/js-index-maps.md +0 -37
- package/skills/vercel-react-best-practices/rules/js-length-check-first.md +0 -49
- package/skills/vercel-react-best-practices/rules/js-min-max-loop.md +0 -82
- package/skills/vercel-react-best-practices/rules/js-set-map-lookups.md +0 -24
- package/skills/vercel-react-best-practices/rules/js-tosorted-immutable.md +0 -57
- package/skills/vercel-react-best-practices/rules/rendering-activity.md +0 -26
- package/skills/vercel-react-best-practices/rules/rendering-animate-svg-wrapper.md +0 -47
- package/skills/vercel-react-best-practices/rules/rendering-conditional-render.md +0 -40
- package/skills/vercel-react-best-practices/rules/rendering-content-visibility.md +0 -38
- package/skills/vercel-react-best-practices/rules/rendering-hoist-jsx.md +0 -46
- package/skills/vercel-react-best-practices/rules/rendering-hydration-no-flicker.md +0 -82
- package/skills/vercel-react-best-practices/rules/rendering-hydration-suppress-warning.md +0 -30
- package/skills/vercel-react-best-practices/rules/rendering-resource-hints.md +0 -85
- package/skills/vercel-react-best-practices/rules/rendering-script-defer-async.md +0 -68
- package/skills/vercel-react-best-practices/rules/rendering-svg-precision.md +0 -28
- package/skills/vercel-react-best-practices/rules/rendering-usetransition-loading.md +0 -75
- package/skills/vercel-react-best-practices/rules/rerender-defer-reads.md +0 -39
- package/skills/vercel-react-best-practices/rules/rerender-dependencies.md +0 -45
- package/skills/vercel-react-best-practices/rules/rerender-derived-state-no-effect.md +0 -40
- package/skills/vercel-react-best-practices/rules/rerender-derived-state.md +0 -29
- package/skills/vercel-react-best-practices/rules/rerender-functional-setstate.md +0 -74
- package/skills/vercel-react-best-practices/rules/rerender-lazy-state-init.md +0 -58
- package/skills/vercel-react-best-practices/rules/rerender-memo-with-default-value.md +0 -38
- package/skills/vercel-react-best-practices/rules/rerender-memo.md +0 -44
- package/skills/vercel-react-best-practices/rules/rerender-move-effect-to-event.md +0 -45
- package/skills/vercel-react-best-practices/rules/rerender-no-inline-components.md +0 -82
- package/skills/vercel-react-best-practices/rules/rerender-simple-expression-in-memo.md +0 -35
- package/skills/vercel-react-best-practices/rules/rerender-split-combined-hooks.md +0 -64
- package/skills/vercel-react-best-practices/rules/rerender-transitions.md +0 -40
- package/skills/vercel-react-best-practices/rules/rerender-use-deferred-value.md +0 -59
- package/skills/vercel-react-best-practices/rules/rerender-use-ref-transient-values.md +0 -73
- package/skills/vercel-react-best-practices/rules/server-after-nonblocking.md +0 -73
- package/skills/vercel-react-best-practices/rules/server-auth-actions.md +0 -96
- package/skills/vercel-react-best-practices/rules/server-cache-lru.md +0 -41
- package/skills/vercel-react-best-practices/rules/server-cache-react.md +0 -76
- package/skills/vercel-react-best-practices/rules/server-dedup-props.md +0 -65
- package/skills/vercel-react-best-practices/rules/server-hoist-static-io.md +0 -142
- package/skills/vercel-react-best-practices/rules/server-parallel-fetching.md +0 -83
- package/skills/vercel-react-best-practices/rules/server-serialization.md +0 -38
- package/skills/web-design-guidelines/SKILL.md +0 -39
|
@@ -1,285 +0,0 @@
|
|
|
1
|
-
# Capacity Planning
|
|
2
|
-
|
|
3
|
-
Capacity planning is the discipline of understanding how much load your system can handle, what breaks first, and how to scale before users experience degradation. It is not a one-time exercise -- it is a continuous practice that evolves as your system, traffic patterns, and infrastructure change.
|
|
4
|
-
|
|
5
|
-
## Performance Testing Taxonomy
|
|
6
|
-
|
|
7
|
-
Not all performance tests are equal. Each type answers a different question.
|
|
8
|
-
|
|
9
|
-
### Test Types
|
|
10
|
-
|
|
11
|
-
| Test Type | Question It Answers | Duration | Load Profile |
|
|
12
|
-
|-----------|-------------------|----------|-------------|
|
|
13
|
-
| **Load test** | Can the system handle expected peak traffic? | 30-60 minutes | Ramp to expected peak, hold steady |
|
|
14
|
-
| **Stress test** | Where does the system break? | Until failure | Ramp beyond expected peak until degradation or failure |
|
|
15
|
-
| **Soak test** | Does the system degrade over time? | 24-72 hours | Sustained load at 70-80% of capacity |
|
|
16
|
-
| **Spike test** | How does the system handle sudden bursts? | 15-30 minutes | Sudden jump from baseline to peak, then back |
|
|
17
|
-
| **Scalability test** | Does adding resources improve throughput linearly? | Variable | Measure throughput at different resource levels |
|
|
18
|
-
|
|
19
|
-
### Load Test Design
|
|
20
|
-
|
|
21
|
-
A good load test simulates real user behavior, not synthetic happy paths.
|
|
22
|
-
|
|
23
|
-
**Essential elements:**
|
|
24
|
-
- **Realistic user journeys:** Mix of browse, search, add to cart, checkout -- not just one endpoint
|
|
25
|
-
- **Think time:** Users do not fire requests as fast as possible; include realistic pauses between actions
|
|
26
|
-
- **Data variation:** Different users, different products, different search terms -- not the same request repeated
|
|
27
|
-
- **Ramp-up period:** Gradually increase load to avoid a cold-start stampede
|
|
28
|
-
- **Steady state period:** Hold at target load long enough to observe stabilization (minimum 15 minutes)
|
|
29
|
-
- **Ramp-down period:** Gradually decrease load to observe resource release
|
|
30
|
-
|
|
31
|
-
### Stress Test Design
|
|
32
|
-
|
|
33
|
-
The goal is to find the breaking point -- not to prove the system works under normal load.
|
|
34
|
-
|
|
35
|
-
**Key principles:**
|
|
36
|
-
- Increase load incrementally (e.g., 10% every 5 minutes) until you observe degradation
|
|
37
|
-
- Monitor all resources: CPU, memory, disk I/O, network, thread pools, connection pools, queue depths
|
|
38
|
-
- Record the exact load level when each metric crosses its threshold
|
|
39
|
-
- Document the failure mode: does the system degrade gracefully (latency increases, then errors) or fail catastrophically (crash, hang, data corruption)?
|
|
40
|
-
- Run the stress test multiple times to verify consistency
|
|
41
|
-
|
|
42
|
-
### Soak Test Design
|
|
43
|
-
|
|
44
|
-
Soak tests reveal problems that only manifest over time.
|
|
45
|
-
|
|
46
|
-
**What soak tests catch:**
|
|
47
|
-
|
|
48
|
-
| Problem | Mechanism | Detection |
|
|
49
|
-
|---------|-----------|-----------|
|
|
50
|
-
| **Memory leaks** | Gradual memory growth from unreleased objects | Memory usage trends upward over hours |
|
|
51
|
-
| **Connection leaks** | Connections borrowed from pool but never returned | Pool exhaustion after hours of operation |
|
|
52
|
-
| **File handle leaks** | Files opened but never closed | "Too many open files" errors after prolonged operation |
|
|
53
|
-
| **Log file growth** | Disk fills over extended operation | Disk utilization climbs throughout test |
|
|
54
|
-
| **Cache bloat** | Cache grows without eviction under sustained load | Memory or disk consumption increases monotonically |
|
|
55
|
-
| **Database bloat** | Temp tables, uncommitted transactions accumulate | Database performance degrades over test duration |
|
|
56
|
-
|
|
57
|
-
**Soak test requirements:**
|
|
58
|
-
- Run at 70-80% of measured capacity (not full stress -- you are testing endurance, not peak)
|
|
59
|
-
- Duration: minimum 24 hours, ideally 72 hours
|
|
60
|
-
- Monitor resource trends, not just snapshots -- a flat graph is healthy, a rising trend is a leak
|
|
61
|
-
- Compare start-of-test and end-of-test resource consumption
|
|
62
|
-
|
|
63
|
-
---
|
|
64
|
-
|
|
65
|
-
## Resource Pool Management
|
|
66
|
-
|
|
67
|
-
Resource pools -- thread pools, connection pools, object pools -- are finite and shared. Mismanaging them is one of the most common causes of production failures.
|
|
68
|
-
|
|
69
|
-
### Connection Pool Sizing
|
|
70
|
-
|
|
71
|
-
The most common question: "How many connections do I need?"
|
|
72
|
-
|
|
73
|
-
**Formula:**
|
|
74
|
-
```
|
|
75
|
-
pool_size = peak_concurrent_requests × avg_hold_time / avg_request_time
|
|
76
|
-
```
|
|
77
|
-
|
|
78
|
-
**But in practice:**
|
|
79
|
-
- Measure actual concurrent active connections under peak load
|
|
80
|
-
- Set pool size to measured p99 concurrency + 20-30% headroom
|
|
81
|
-
- Set a maximum that protects the downstream resource (databases have their own connection limits)
|
|
82
|
-
- Too many connections: each consumes memory on both client and server; database performance degrades with too many connections
|
|
83
|
-
- Too few connections: requests queue waiting for a connection; latency increases; pool exhaustion looks like a database outage
|
|
84
|
-
|
|
85
|
-
### Connection Pool Configuration
|
|
86
|
-
|
|
87
|
-
| Parameter | Purpose | Typical Value |
|
|
88
|
-
|-----------|---------|---------------|
|
|
89
|
-
| **Minimum pool size** | Connections maintained even when idle | 5-10 |
|
|
90
|
-
| **Maximum pool size** | Hard upper limit on connections | Based on measurement |
|
|
91
|
-
| **Checkout timeout** | How long to wait for a connection from the pool | 500ms - 2s |
|
|
92
|
-
| **Idle timeout** | How long an unused connection stays in the pool | 5-10 minutes |
|
|
93
|
-
| **Max lifetime** | Maximum age of a connection before forced recycling | 30-60 minutes |
|
|
94
|
-
| **Validation query** | Query to verify connection health before use | `SELECT 1` |
|
|
95
|
-
| **Validation interval** | How often idle connections are validated | 30-60 seconds |
|
|
96
|
-
|
|
97
|
-
### Connection Pool Anti-Patterns
|
|
98
|
-
|
|
99
|
-
| Anti-Pattern | Problem | Fix |
|
|
100
|
-
|-------------|---------|-----|
|
|
101
|
-
| **No checkout timeout** | Thread waits forever for a connection | Set checkout timeout to 1-2 seconds |
|
|
102
|
-
| **No max lifetime** | Stale connections cause intermittent errors | Recycle connections every 30-60 minutes |
|
|
103
|
-
| **Pool size = DB max connections** | Leaves no connections for admin, monitoring, or other services | Pool size = (DB max - reserved) / number of application instances |
|
|
104
|
-
| **Ignoring connection leaks** | Pool slowly drains until exhaustion | Monitor borrowed-vs-returned; log leaked connections |
|
|
105
|
-
| **Default pool size** | Either wastes resources or causes starvation | Size based on measured concurrency |
|
|
106
|
-
|
|
107
|
-
---
|
|
108
|
-
|
|
109
|
-
## Thread Pool Management
|
|
110
|
-
|
|
111
|
-
Thread pools control the concurrency of your application. Getting them right is critical for both throughput and stability.
|
|
112
|
-
|
|
113
|
-
### Thread Pool Sizing
|
|
114
|
-
|
|
115
|
-
**CPU-bound workloads:**
|
|
116
|
-
```
|
|
117
|
-
threads = number_of_cores
|
|
118
|
-
```
|
|
119
|
-
|
|
120
|
-
**I/O-bound workloads (most web applications):**
|
|
121
|
-
```
|
|
122
|
-
threads = number_of_cores × (1 + wait_time / service_time)
|
|
123
|
-
```
|
|
124
|
-
|
|
125
|
-
Example: 8 cores, requests spend 80% of time waiting on I/O:
|
|
126
|
-
```
|
|
127
|
-
threads = 8 × (1 + 80/20) = 8 × 5 = 40 threads
|
|
128
|
-
```
|
|
129
|
-
|
|
130
|
-
### Thread Pool Configuration
|
|
131
|
-
|
|
132
|
-
| Parameter | Purpose | Consideration |
|
|
133
|
-
|-----------|---------|---------------|
|
|
134
|
-
| **Core pool size** | Threads always kept alive | Handles normal load without thread creation overhead |
|
|
135
|
-
| **Maximum pool size** | Hard upper limit | Handles burst load; too high causes context-switching overhead |
|
|
136
|
-
| **Queue capacity** | Work queue between core and max | Bounded queue with rejection policy; never unbounded |
|
|
137
|
-
| **Keep-alive time** | How long excess threads survive when idle | 30-60 seconds; balances responsiveness and resource usage |
|
|
138
|
-
| **Rejection policy** | What happens when pool and queue are both full | Reject immediately (fail fast) or caller-runs (back-pressure) |
|
|
139
|
-
|
|
140
|
-
### Thread Pool Anti-Patterns
|
|
141
|
-
|
|
142
|
-
| Anti-Pattern | Problem | Fix |
|
|
143
|
-
|-------------|---------|-----|
|
|
144
|
-
| **Unbounded queue** | Memory grows until OOM; latency climbs invisibly | Use bounded queue; fail fast when full |
|
|
145
|
-
| **Single shared pool** | One slow operation starves all others | Separate pools per workload type |
|
|
146
|
-
| **Too many threads** | Context-switching overhead exceeds throughput gain | Measure throughput at different pool sizes; find the plateau |
|
|
147
|
-
| **No monitoring** | Thread starvation goes undetected until outage | Monitor active/idle/queued counts; alert on pool saturation |
|
|
148
|
-
|
|
149
|
-
---
|
|
150
|
-
|
|
151
|
-
## The Universal Scalability Law
|
|
152
|
-
|
|
153
|
-
The Universal Scalability Law (USL), developed by Neil Gunther, models how system throughput changes as you add resources (servers, threads, cores).
|
|
154
|
-
|
|
155
|
-
### The Model
|
|
156
|
-
|
|
157
|
-
```
|
|
158
|
-
C(N) = N / (1 + σ(N-1) + κN(N-1))
|
|
159
|
-
```
|
|
160
|
-
|
|
161
|
-
Where:
|
|
162
|
-
- **N** = number of processors/servers/threads
|
|
163
|
-
- **σ** (sigma) = contention parameter: fraction of work that must be serialized
|
|
164
|
-
- **κ** (kappa) = coherence parameter: cost of keeping shared state consistent
|
|
165
|
-
- **C(N)** = relative capacity at N resources
|
|
166
|
-
|
|
167
|
-
### Key Insights
|
|
168
|
-
|
|
169
|
-
| Parameter | Effect | Example |
|
|
170
|
-
|-----------|--------|---------|
|
|
171
|
-
| **σ = 0, κ = 0** | Linear scalability (ideal but impossible) | Adding 10 servers = 10x throughput |
|
|
172
|
-
| **σ > 0, κ = 0** | Amdahl's Law: diminishing returns | Shared lock limits parallelism |
|
|
173
|
-
| **σ > 0, κ > 0** | Retrograde scalability: adding resources decreases throughput | Distributed cache coherence overhead exceeds throughput gain |
|
|
174
|
-
|
|
175
|
-
### Practical Application
|
|
176
|
-
|
|
177
|
-
1. **Measure throughput** at 1, 2, 4, 8, 16 resources
|
|
178
|
-
2. **Fit the USL curve** to find σ and κ
|
|
179
|
-
3. **Predict the scalability ceiling:** the point where adding more resources stops helping (or hurts)
|
|
180
|
-
4. **Identify the bottleneck:** high σ means contention (locks, serialization); high κ means coherence costs (cache invalidation, distributed consensus)
|
|
181
|
-
|
|
182
|
-
---
|
|
183
|
-
|
|
184
|
-
## Capacity Modeling
|
|
185
|
-
|
|
186
|
-
A capacity model documents the relationship between load, resources, and performance for each service in your system.
|
|
187
|
-
|
|
188
|
-
### Capacity Model Template
|
|
189
|
-
|
|
190
|
-
For each service, document:
|
|
191
|
-
|
|
192
|
-
| Dimension | Current Value | Limit | Action at Limit |
|
|
193
|
-
|-----------|--------------|-------|-----------------|
|
|
194
|
-
| **Requests/sec** | 500 RPS | 2,000 RPS | Scale horizontally |
|
|
195
|
-
| **CPU** | 40% avg, 70% peak | 80% sustained | Add instances |
|
|
196
|
-
| **Memory** | 2.5 GB / 4 GB | 3.5 GB | Increase instance size or optimize |
|
|
197
|
-
| **DB connections** | 30 active / 50 max | 45 active | Increase pool or add read replicas |
|
|
198
|
-
| **Disk I/O** | 200 IOPS | 3,000 IOPS (provisioned) | Upgrade storage tier |
|
|
199
|
-
| **Network** | 500 Mbps | 10 Gbps | Unlikely bottleneck |
|
|
200
|
-
|
|
201
|
-
### Bottleneck Resource
|
|
202
|
-
|
|
203
|
-
Every service has a bottleneck resource -- the resource that runs out first as load increases. The capacity of the service equals the capacity of its bottleneck.
|
|
204
|
-
|
|
205
|
-
**Finding the bottleneck:**
|
|
206
|
-
1. Run a stress test, increasing load gradually
|
|
207
|
-
2. Monitor all resources simultaneously
|
|
208
|
-
3. The first resource to hit its limit is the bottleneck
|
|
209
|
-
4. All capacity planning focuses on this resource
|
|
210
|
-
|
|
211
|
-
**Common bottlenecks by service type:**
|
|
212
|
-
|
|
213
|
-
| Service Type | Typical Bottleneck |
|
|
214
|
-
|-------------|-------------------|
|
|
215
|
-
| API services | Thread pool or connection pool |
|
|
216
|
-
| Data-heavy services | Database connections or query throughput |
|
|
217
|
-
| Compute-heavy services | CPU |
|
|
218
|
-
| File processing | Disk I/O or memory |
|
|
219
|
-
| Real-time services | Network bandwidth or connection count |
|
|
220
|
-
|
|
221
|
-
---
|
|
222
|
-
|
|
223
|
-
## Capacity Myths
|
|
224
|
-
|
|
225
|
-
### Myth 1: "The Cloud Is Infinitely Scalable"
|
|
226
|
-
|
|
227
|
-
Reality:
|
|
228
|
-
- Auto-scaling has lag time (1-5 minutes to provision and start new instances)
|
|
229
|
-
- Cold starts add latency to the first requests on new instances
|
|
230
|
-
- Cloud providers have account-level limits (instance count, API rate limits)
|
|
231
|
-
- Some resources do not scale horizontally (relational databases, stateful services)
|
|
232
|
-
- Scaling costs money -- infinite scale means infinite cost
|
|
233
|
-
|
|
234
|
-
### Myth 2: "We'll Just Add More Servers"
|
|
235
|
-
|
|
236
|
-
Reality:
|
|
237
|
-
- Adding servers only helps if the bottleneck is CPU or memory on the application tier
|
|
238
|
-
- If the bottleneck is the database, adding application servers makes it worse (more connections, more load on the same database)
|
|
239
|
-
- Network hops, serialization overhead, and coordination costs increase with more servers
|
|
240
|
-
- Horizontal scaling requires stateless design -- session affinity, local caches, and local file storage break horizontal scaling
|
|
241
|
-
|
|
242
|
-
### Myth 3: "Our Load Tests Pass, So We're Fine"
|
|
243
|
-
|
|
244
|
-
Reality:
|
|
245
|
-
- Load tests with synthetic data miss hot spots in production data
|
|
246
|
-
- Load tests rarely simulate realistic user behavior (think times, session patterns, edge cases)
|
|
247
|
-
- Load test environments rarely match production topology, network latency, or data volume
|
|
248
|
-
- Load tests find throughput limits but not endurance problems (need soak tests)
|
|
249
|
-
- Passing a load test at 2x expected peak does not protect against 10x flash crowds
|
|
250
|
-
|
|
251
|
-
### Myth 4: "We Don't Need Capacity Planning -- We Have Auto-Scaling"
|
|
252
|
-
|
|
253
|
-
Reality:
|
|
254
|
-
- Auto-scaling reacts to load after it arrives; capacity planning anticipates load before it arrives
|
|
255
|
-
- Auto-scaling cannot protect against instant traffic spikes (Black Friday, viral events)
|
|
256
|
-
- Auto-scaling policies themselves need testing -- misconfigured policies can scale in the wrong direction or oscillate
|
|
257
|
-
- Cost management requires understanding baseline and peak capacity needs
|
|
258
|
-
|
|
259
|
-
---
|
|
260
|
-
|
|
261
|
-
## Performance Anti-Patterns
|
|
262
|
-
|
|
263
|
-
### Resource Contention
|
|
264
|
-
|
|
265
|
-
Multiple threads competing for the same resource (lock, connection, CPU core). Throughput plateaus or decreases as concurrency increases.
|
|
266
|
-
|
|
267
|
-
**Detection:** Throughput does not increase when adding threads/instances. CPU utilization is low despite high load. Thread dumps show threads waiting on locks.
|
|
268
|
-
|
|
269
|
-
**Fix:** Reduce lock scope. Use lock-free data structures. Partition data to reduce contention. Use read-write locks instead of exclusive locks.
|
|
270
|
-
|
|
271
|
-
### The Coordinated Omission Problem
|
|
272
|
-
|
|
273
|
-
Load testing tools that wait for a response before sending the next request undercount latency at high load. When the server slows down, the tool also slows down, making the measured throughput look stable while actually masking massive latency increases.
|
|
274
|
-
|
|
275
|
-
**Detection:** Load test shows consistent throughput even as the system degrades. Real users report much worse latency than load tests measure.
|
|
276
|
-
|
|
277
|
-
**Fix:** Use load testing tools that support coordinated omission correction (e.g., wrk2, Gatling with constant throughput mode). Measure latency independently of throughput. Use open-loop load generators that send requests at a fixed rate regardless of response time.
|
|
278
|
-
|
|
279
|
-
### N+1 Query Problem
|
|
280
|
-
|
|
281
|
-
Fetching a list of N items, then making one additional query for each item. Total queries = N + 1 instead of 1 or 2.
|
|
282
|
-
|
|
283
|
-
**Detection:** Database query count scales linearly with result set size. Response time increases linearly with page size.
|
|
284
|
-
|
|
285
|
-
**Fix:** Use eager loading / JOIN queries. Batch queries (`WHERE id IN (...)`). Implement DataLoader pattern for GraphQL.
|
|
@@ -1,325 +0,0 @@
|
|
|
1
|
-
# Chaos Engineering
|
|
2
|
-
|
|
3
|
-
Chaos engineering is the discipline of experimenting on a system in order to build confidence in its ability to withstand turbulent conditions in production. It is not about breaking things for fun -- it is a rigorous, scientific approach to discovering weaknesses before they cause outages.
|
|
4
|
-
|
|
5
|
-
> **Safety note:** This reference describes chaos engineering *concepts and planning patterns*. All failure injection experiments must be performed by authorized engineers using dedicated chaos tooling (e.g., Gremlin, Litmus, AWS Fault Injection Simulator) with proper approvals, blast radius controls, monitoring, and rollback plans. Commands shown are for reference only -- never run them without authorization and safeguards.
|
|
6
|
-
|
|
7
|
-
The fundamental insight is simple: you cannot know how your system handles failure until it actually fails. Waiting for production incidents to discover weaknesses is reactive and expensive. Chaos engineering is proactive and controlled.
|
|
8
|
-
|
|
9
|
-
## Principles of Chaos Engineering
|
|
10
|
-
|
|
11
|
-
### 1. Define Steady State
|
|
12
|
-
|
|
13
|
-
Before you can detect abnormal behavior, you must define what normal looks like. Steady state is expressed as measurable business or system metrics that indicate the system is functioning correctly.
|
|
14
|
-
|
|
15
|
-
**Good steady state definitions:**
|
|
16
|
-
|
|
17
|
-
| Metric Type | Steady State Definition | Example |
|
|
18
|
-
|-------------|----------------------|---------|
|
|
19
|
-
| **Business metric** | Orders per minute within expected range | 100-150 orders/min during business hours |
|
|
20
|
-
| **Error rate** | Below defined threshold | < 0.1% 5xx errors |
|
|
21
|
-
| **Latency** | Within SLO bounds | p99 latency < 500ms |
|
|
22
|
-
| **Throughput** | Within expected range | 1000-2000 RPS |
|
|
23
|
-
| **Availability** | All critical paths responding | Health checks green on all services |
|
|
24
|
-
|
|
25
|
-
**Bad steady state definitions:**
|
|
26
|
-
- "The system is working" (not measurable)
|
|
27
|
-
- "No alerts firing" (absence of evidence is not evidence of absence)
|
|
28
|
-
- "CPU below 80%" (cause-based, not symptom-based)
|
|
29
|
-
|
|
30
|
-
### 2. Formulate a Hypothesis
|
|
31
|
-
|
|
32
|
-
Every chaos experiment starts with a hypothesis: a prediction about what will happen when you inject a specific failure.
|
|
33
|
-
|
|
34
|
-
**Hypothesis format:**
|
|
35
|
-
```
|
|
36
|
-
"We believe that when [failure condition], the system will [expected behavior],
|
|
37
|
-
as measured by [steady state metric] remaining within [acceptable bounds]."
|
|
38
|
-
```
|
|
39
|
-
|
|
40
|
-
**Example hypotheses:**
|
|
41
|
-
|
|
42
|
-
| Failure | Hypothesis | Metric |
|
|
43
|
-
|---------|-----------|--------|
|
|
44
|
-
| Terminate one API instance (via chaos tooling) | System continues serving traffic with no user-visible errors | Error rate stays < 0.1% |
|
|
45
|
-
| Add 500ms latency to database (via chaos tooling) | Response time degrades but stays within SLO; circuit breaker does not trip | p99 < 2s; no circuit breaker events |
|
|
46
|
-
| Payment service returns 503 (via fault injection proxy) | Checkout shows graceful error; other features unaffected | Non-checkout error rate unchanged |
|
|
47
|
-
| Disk at 95% capacity (via chaos tooling) | Log rotation triggers; alerts fire; no service disruption | Disk drops below 90% within 10 minutes |
|
|
48
|
-
|
|
49
|
-
### 3. Introduce Real-World Failures
|
|
50
|
-
|
|
51
|
-
Chaos experiments should simulate failures that actually happen in production, not theoretical edge cases.
|
|
52
|
-
|
|
53
|
-
**Common failure types to simulate (via dedicated chaos tooling):**
|
|
54
|
-
|
|
55
|
-
| Category | Failures | Tooling Examples |
|
|
56
|
-
|----------|----------|-----------------|
|
|
57
|
-
| **Infrastructure** | Instance crash, disk failure, network partition | Gremlin, Litmus, AWS FIS |
|
|
58
|
-
| **Network** | Latency, packet loss, DNS failure | Toxiproxy, Istio fault injection, tc (traffic control) |
|
|
59
|
-
| **Application** | Memory pressure, CPU saturation, thread contention | stress-ng (controlled), Gremlin resource attacks |
|
|
60
|
-
| **Dependency** | Service unavailable, slow response, corrupt response | Toxiproxy, Envoy fault injection, mock services |
|
|
61
|
-
| **Cloud** | AZ failure, region degradation, API throttling | AWS FIS, GCP Fault Injection, Azure Chaos Studio |
|
|
62
|
-
|
|
63
|
-
### 4. Run in Production
|
|
64
|
-
|
|
65
|
-
Staging environments do not reproduce the complexity of production. They lack real user traffic, real data volumes, real concurrency patterns, and real interactions between services. Chaos experiments in staging build false confidence.
|
|
66
|
-
|
|
67
|
-
**But safely:**
|
|
68
|
-
- Start with non-production, then graduate to production
|
|
69
|
-
- Use the smallest blast radius possible
|
|
70
|
-
- Have an emergency stop mechanism to halt the experiment immediately
|
|
71
|
-
- Run during business hours when the team is available to respond
|
|
72
|
-
- Inform the on-call team before running experiments
|
|
73
|
-
- Never experiment during peak traffic or known risky periods
|
|
74
|
-
|
|
75
|
-
### 5. Automate and Run Continuously
|
|
76
|
-
|
|
77
|
-
A chaos experiment that runs once proves resilience at one point in time. Automated, recurring experiments prove resilience continuously.
|
|
78
|
-
|
|
79
|
-
**Automation maturity levels:**
|
|
80
|
-
|
|
81
|
-
| Level | Practice | Confidence |
|
|
82
|
-
|-------|----------|-----------|
|
|
83
|
-
| **Manual** | Engineer runs experiment by hand, observes results | Low -- depends on who runs it and when |
|
|
84
|
-
| **Scripted** | Experiment codified in a script, run on schedule | Medium -- repeatable but requires human analysis |
|
|
85
|
-
| **Automated** | Experiment runs automatically, evaluates steady state, reports results | High -- continuous verification |
|
|
86
|
-
| **Integrated** | Experiments run in CI/CD pipeline; failing experiment blocks deployment | Very high -- resilience is a deployment gate |
|
|
87
|
-
|
|
88
|
-
---
|
|
89
|
-
|
|
90
|
-
## Chaos Experiment Design
|
|
91
|
-
|
|
92
|
-
### Experiment Template
|
|
93
|
-
|
|
94
|
-
```
|
|
95
|
-
Experiment: [Name]
|
|
96
|
-
Date: [When]
|
|
97
|
-
Team: [Who is running it]
|
|
98
|
-
Blast Radius: [What is affected]
|
|
99
|
-
|
|
100
|
-
Hypothesis:
|
|
101
|
-
When [failure condition], we expect [behavior],
|
|
102
|
-
as measured by [metric] remaining within [bounds].
|
|
103
|
-
|
|
104
|
-
Steady State:
|
|
105
|
-
- Metric 1: [current value, acceptable range]
|
|
106
|
-
- Metric 2: [current value, acceptable range]
|
|
107
|
-
|
|
108
|
-
Method:
|
|
109
|
-
1. Verify steady state
|
|
110
|
-
2. Inject [specific failure]
|
|
111
|
-
3. Observe for [duration]
|
|
112
|
-
4. Measure [metrics]
|
|
113
|
-
5. Remove failure injection
|
|
114
|
-
6. Verify recovery to steady state
|
|
115
|
-
|
|
116
|
-
Abort Conditions:
|
|
117
|
-
- [Metric] exceeds [threshold]
|
|
118
|
-
- On-call pages for [service]
|
|
119
|
-
- Customer-visible impact detected
|
|
120
|
-
|
|
121
|
-
Results:
|
|
122
|
-
- Hypothesis confirmed/refuted
|
|
123
|
-
- Observations: [what happened]
|
|
124
|
-
- Action items: [what to fix]
|
|
125
|
-
```
|
|
126
|
-
|
|
127
|
-
### Blast Radius Management
|
|
128
|
-
|
|
129
|
-
Blast radius is the scope of impact if the experiment causes unexpected damage. Always minimize blast radius and expand gradually.
|
|
130
|
-
|
|
131
|
-
**Blast radius levels:**
|
|
132
|
-
|
|
133
|
-
| Level | Scope | When to Use |
|
|
134
|
-
|-------|-------|-------------|
|
|
135
|
-
| **Development** | Single developer's environment | First-time experiments, unproven hypotheses |
|
|
136
|
-
| **Staging** | Staging environment | Validating experiment mechanics before production |
|
|
137
|
-
| **Canary** | Small subset of production (1-5%) | First production experiment for a new failure type |
|
|
138
|
-
| **Single AZ** | One availability zone in production | Testing AZ failure resilience |
|
|
139
|
-
| **Full production** | All production traffic | Well-understood experiments that have been run many times |
|
|
140
|
-
|
|
141
|
-
**Blast radius controls:**
|
|
142
|
-
- **Targeting:** Limit experiment to specific instances, user segments, or traffic percentage
|
|
143
|
-
- **Duration:** Set maximum experiment duration; auto-revert after timeout
|
|
144
|
-
- **Emergency stop:** One-button (or automatic) experiment termination
|
|
145
|
-
- **Monitoring:** Real-time dashboards showing experiment impact on steady state metrics
|
|
146
|
-
- **Rollback:** Pre-planned steps to undo the experiment if things go wrong
|
|
147
|
-
|
|
148
|
-
---
|
|
149
|
-
|
|
150
|
-
## Failure Injection Techniques
|
|
151
|
-
|
|
152
|
-
### Process-Level Failures
|
|
153
|
-
|
|
154
|
-
| Technique | What It Simulates | Chaos Tooling |
|
|
155
|
-
|-----------|-------------------|--------------|
|
|
156
|
-
| Instance termination | Crash | Gremlin process attack, Kubernetes pod disruption budget, Litmus ChaosEngine |
|
|
157
|
-
| Process freeze | Hang/unresponsive | Gremlin process attack (pause), Litmus pod-cpu-hog |
|
|
158
|
-
| CPU saturation | Compute pressure | Gremlin CPU attack, Litmus cpu-hog, stress-ng (controlled, authorized) |
|
|
159
|
-
| Memory pressure | Memory exhaustion | Gremlin memory attack, Litmus pod-memory-hog |
|
|
160
|
-
| Process flood | Process table exhaustion | Gremlin process attack with controlled parameters |
|
|
161
|
-
|
|
162
|
-
### Network-Level Failures
|
|
163
|
-
|
|
164
|
-
| Technique | What It Simulates | Chaos Tooling |
|
|
165
|
-
|-----------|-------------------|--------------|
|
|
166
|
-
| Latency injection | Slow network | Toxiproxy, Gremlin latency attack, Istio fault injection |
|
|
167
|
-
| Packet loss | Unreliable network | Gremlin packet loss attack, Toxiproxy, tc netem (authorized) |
|
|
168
|
-
| Network partition | Network split | Gremlin blackhole attack, Litmus pod-network-partition |
|
|
169
|
-
| DNS failure | DNS outage | Gremlin DNS attack, Litmus pod-dns-error |
|
|
170
|
-
| Bandwidth limit | Constrained network | Toxiproxy bandwidth limit, Gremlin bandwidth attack |
|
|
171
|
-
|
|
172
|
-
### Dependency-Level Failures
|
|
173
|
-
|
|
174
|
-
| Technique | What It Simulates | Chaos Tooling |
|
|
175
|
-
|-----------|-------------------|--------------|
|
|
176
|
-
| Error injection proxy | Downstream errors | Toxiproxy, Envoy fault injection |
|
|
177
|
-
| Latency injection | Slow dependency | Toxiproxy, Istio fault injection |
|
|
178
|
-
| Connection limit | Pool exhaustion | Toxiproxy connection limit, Gremlin blackhole |
|
|
179
|
-
| Response corruption | Data integrity issues | Custom fault injection proxy |
|
|
180
|
-
| Certificate expiration | TLS failures | Expired test certificate in staging |
|
|
181
|
-
|
|
182
|
-
### Disk-Level Failures
|
|
183
|
-
|
|
184
|
-
| Technique | What It Simulates | Chaos Tooling |
|
|
185
|
-
|-----------|-------------------|--------------|
|
|
186
|
-
| Disk pressure | Disk full | Gremlin disk attack, Litmus disk-fill |
|
|
187
|
-
| Slow I/O | Storage degradation | Gremlin IO attack, dm-delay (authorized) |
|
|
188
|
-
| Read-only filesystem | Mount failure | Gremlin disk attack (read-only mode) |
|
|
189
|
-
| Data corruption | Integrity issues | Controlled corruption of test data in staging |
|
|
190
|
-
|
|
191
|
-
---
|
|
192
|
-
|
|
193
|
-
## Chaos Monkey and Netflix's Approach
|
|
194
|
-
|
|
195
|
-
Netflix pioneered chaos engineering with Chaos Monkey, which randomly terminates production instances during business hours using automated, authorized tooling with built-in safeguards.
|
|
196
|
-
|
|
197
|
-
### The Netflix Chaos Engineering Stack
|
|
198
|
-
|
|
199
|
-
| Tool | What It Does | Scope |
|
|
200
|
-
|------|-------------|-------|
|
|
201
|
-
| **Chaos Monkey** | Terminates random instances (automated, authorized) | Single instance |
|
|
202
|
-
| **Chaos Kong** | Simulates entire region failure | Region |
|
|
203
|
-
| **Latency Monkey** | Injects network latency | Network |
|
|
204
|
-
| **Conformity Monkey** | Finds instances not following best practices | Compliance |
|
|
205
|
-
| **Chaos Automation Platform (ChAP)** | Runs automated experiments with steady state comparison | Full stack |
|
|
206
|
-
|
|
207
|
-
### Key Lessons from Netflix
|
|
208
|
-
|
|
209
|
-
1. **Start small:** Chaos Monkey terminates one instance at a time. Only after years of practice did Netflix graduate to region-level chaos (Chaos Kong).
|
|
210
|
-
2. **Business hours only:** Run experiments when the team is available to respond. Late-night chaos is just an outage.
|
|
211
|
-
3. **Opt-out, not opt-in:** By default, all services are enrolled. Teams must explicitly justify opting out.
|
|
212
|
-
4. **No blame:** Finding a weakness is a success, not a failure. Teams that discover and fix problems through chaos engineering are celebrated.
|
|
213
|
-
5. **Invest in tooling:** Manual chaos is not sustainable. Invest in platforms that automate experiment execution, evaluation, and reporting.
|
|
214
|
-
|
|
215
|
-
---
|
|
216
|
-
|
|
217
|
-
## GameDay Exercises
|
|
218
|
-
|
|
219
|
-
A GameDay is a scheduled exercise where a team practices responding to a realistic failure scenario. It combines chaos engineering (inject failure) with incident response practice (detect, diagnose, mitigate, resolve).
|
|
220
|
-
|
|
221
|
-
### GameDay Structure
|
|
222
|
-
|
|
223
|
-
| Phase | Duration | Activities |
|
|
224
|
-
|-------|----------|-----------|
|
|
225
|
-
| **Preparation** | 1-2 weeks before | Define scenario, brief participants, set up monitoring |
|
|
226
|
-
| **Pre-game** | 30 minutes | Verify steady state, confirm all participants are ready |
|
|
227
|
-
| **Execution** | 1-3 hours | Inject failure, observe team response, take notes |
|
|
228
|
-
| **Post-game** | 1 hour | Debrief, identify what worked and what did not |
|
|
229
|
-
| **Follow-up** | 1-2 weeks after | File action items, track remediation, schedule next GameDay |
|
|
230
|
-
|
|
231
|
-
### GameDay Scenarios
|
|
232
|
-
|
|
233
|
-
| Scenario | Complexity | What It Tests |
|
|
234
|
-
|----------|-----------|---------------|
|
|
235
|
-
| Terminate a single instance | Low | Auto-healing, health checks, load balancing |
|
|
236
|
-
| Simulate database failover | Medium | Connection handling, read replica routing, data consistency |
|
|
237
|
-
| Full AZ failure | High | Multi-AZ architecture, DNS failover, stateful service recovery |
|
|
238
|
-
| Dependency outage (payment provider) | Medium | Circuit breakers, fallback behavior, user communication |
|
|
239
|
-
| Security incident (compromised credentials) | High | Credential rotation, access logging, incident response process |
|
|
240
|
-
| Data corruption | High | Backup restoration, data validation, recovery time |
|
|
241
|
-
|
|
242
|
-
### GameDay Ground Rules
|
|
243
|
-
|
|
244
|
-
1. **Safety first:** A facilitator can halt the exercise at any time if real customer impact occurs
|
|
245
|
-
2. **No blame:** The goal is learning, not grading individual performance
|
|
246
|
-
3. **Document everything:** Notes, timestamps, decisions, and outcomes
|
|
247
|
-
4. **Include diverse roles:** Engineers, SREs, product managers, customer support
|
|
248
|
-
5. **Realistic but controlled:** Use production systems but with blast radius limits
|
|
249
|
-
6. **Schedule during business hours:** The team should be fully staffed and alert
|
|
250
|
-
7. **Announce to stakeholders:** Customer support and leadership should know a GameDay is happening
|
|
251
|
-
|
|
252
|
-
### GameDay Facilitation Tips
|
|
253
|
-
|
|
254
|
-
- **The facilitator does not fix things.** Their job is to inject failures, observe the response, and take notes.
|
|
255
|
-
- **Start with known weaknesses.** The first GameDay should test a scenario the team suspects might fail -- the learning is in confirming the suspicion and practicing the response.
|
|
256
|
-
- **Increase difficulty over time.** First GameDay: terminate one instance. Tenth GameDay: simultaneous database failover + network partition + on-call engineer is unavailable.
|
|
257
|
-
- **Celebrate findings.** Every weakness discovered in a GameDay is a weakness that will not cause a real outage. This is a win.
|
|
258
|
-
|
|
259
|
-
---
|
|
260
|
-
|
|
261
|
-
## Building Confidence Through Controlled Failure
|
|
262
|
-
|
|
263
|
-
### The Confidence Curve
|
|
264
|
-
|
|
265
|
-
```
|
|
266
|
-
Confidence
|
|
267
|
-
▲
|
|
268
|
-
│ ╭───────────
|
|
269
|
-
│ ╭────╯
|
|
270
|
-
│ ╭────╯
|
|
271
|
-
│ ╭────╯
|
|
272
|
-
│ ╭────╯
|
|
273
|
-
│ ╭────╯
|
|
274
|
-
│╭────╯
|
|
275
|
-
└──────────────────────────────────────────→ Experiments
|
|
276
|
-
0 10 20 30 40 50 60
|
|
277
|
-
```
|
|
278
|
-
|
|
279
|
-
Each successful chaos experiment increases confidence that the system will survive that failure in production. Each failed experiment (where the hypothesis was disproven) reveals a weakness that, once fixed, increases actual resilience.
|
|
280
|
-
|
|
281
|
-
### Maturity Model
|
|
282
|
-
|
|
283
|
-
| Level | Practice | Organization |
|
|
284
|
-
|-------|----------|-------------|
|
|
285
|
-
| **Level 0: Reactive** | No chaos engineering; learn from production incidents | Firefighting culture |
|
|
286
|
-
| **Level 1: Exploratory** | Manual, ad-hoc experiments; individual teams | Curious early adopters |
|
|
287
|
-
| **Level 2: Systematic** | Regular GameDays; documented experiments; shared learnings | Engineering-wide practice |
|
|
288
|
-
| **Level 3: Automated** | Continuous automated experiments; integrated into CI/CD | Platform team provides tooling |
|
|
289
|
-
| **Level 4: Cultural** | Chaos engineering is expected for all services; opt-out requires justification | Resilience is a core engineering value |
|
|
290
|
-
|
|
291
|
-
### Getting Started
|
|
292
|
-
|
|
293
|
-
If you have never done chaos engineering, start here:
|
|
294
|
-
|
|
295
|
-
1. **Pick one service** -- the one you are most worried about
|
|
296
|
-
2. **Define its steady state** -- what metrics prove it is working?
|
|
297
|
-
3. **Form one hypothesis** -- "If we terminate one instance, the service will continue serving traffic"
|
|
298
|
-
4. **Run the experiment in staging** -- verify your tooling and monitoring work
|
|
299
|
-
5. **Run the experiment in production** -- with minimal blast radius and an emergency stop mechanism
|
|
300
|
-
6. **Document the results** -- what happened? Was the hypothesis confirmed?
|
|
301
|
-
7. **Fix what you found** -- if the hypothesis was disproven, fix the weakness
|
|
302
|
-
8. **Repeat** -- expand to more failure types, more services, more complex scenarios
|
|
303
|
-
|
|
304
|
-
### Common Objections and Responses
|
|
305
|
-
|
|
306
|
-
| Objection | Response |
|
|
307
|
-
|-----------|---------|
|
|
308
|
-
| "We can't break production on purpose!" | You are already breaking production accidentally. Chaos engineering lets you do it on your terms, when you are prepared. |
|
|
309
|
-
| "Our system isn't resilient enough for chaos" | That is exactly why you need chaos engineering -- to find and fix the weaknesses. Start small. |
|
|
310
|
-
| "We don't have time for this" | You have time for incident response, post-mortems, and customer apologies. Chaos engineering reduces all three. |
|
|
311
|
-
| "What if we cause an outage?" | Start with minimal blast radius in staging. Terminate one process. If that causes an outage, you have learned something invaluable. |
|
|
312
|
-
| "Management won't approve this" | Frame it as risk reduction, not risk creation. Show the cost of recent outages vs. the cost of preventive experiments. |
|
|
313
|
-
|
|
314
|
-
### Anti-Fragility
|
|
315
|
-
|
|
316
|
-
The ultimate goal of chaos engineering is not just resilience (surviving failure) but anti-fragility (getting stronger from failure). A system is anti-fragile when each failure makes it more resistant to future failures.
|
|
317
|
-
|
|
318
|
-
**How chaos engineering builds anti-fragility:**
|
|
319
|
-
- Each experiment reveals a weakness
|
|
320
|
-
- Each fix removes that weakness permanently
|
|
321
|
-
- Each automated experiment continuously verifies the fix
|
|
322
|
-
- Over time, the system has been tested against every common failure mode
|
|
323
|
-
- New failure modes are discovered faster because the team has built the muscles and tooling to find them
|
|
324
|
-
|
|
325
|
-
This is the Release It! philosophy in action: production-ready software is not software that never fails. It is software that has been designed, tested, and operated to handle failure gracefully -- because failure is not a possibility, it is a certainty.
|