codymaster 4.4.4 → 4.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (190) hide show
  1. package/CHANGELOG.md +33 -0
  2. package/README.md +29 -14
  3. package/commands/demo.md +1 -1
  4. package/dist/context-bus.js +70 -0
  5. package/dist/context-db.js +265 -0
  6. package/dist/continuity.js +12 -0
  7. package/dist/file-watcher.js +79 -0
  8. package/dist/index.js +152 -1
  9. package/dist/l0-indexer.js +158 -0
  10. package/dist/mcp-context-server.js +400 -0
  11. package/dist/migrate-json-to-sqlite.js +126 -0
  12. package/dist/skill-chain.js +19 -3
  13. package/dist/token-budget.js +108 -0
  14. package/dist/uri-resolver.js +203 -0
  15. package/package.json +7 -1
  16. package/skills/_shared/helpers.md +50 -14
  17. package/skills/cm-autopilot/SKILL.md +29 -0
  18. package/skills/cm-autopilot/scripts/autopilot.py +190 -0
  19. package/skills/cm-continuity/SKILL.md +90 -28
  20. package/skills/cm-quality-gate/SKILL.md +11 -1
  21. package/skills/cm-safe-deploy/SKILL.md +38 -2
  22. package/skills/cm-security-gate/SKILL.md +158 -34
  23. package/skills/cm-skill-chain/SKILL.md +47 -1
  24. package/skills/cm-start/SKILL.md +11 -2
  25. package/skills/cm-test-gate/SKILL.md +3 -0
  26. package/skills/boxme-git-config/SKILL.md +0 -56
  27. package/skills/boxme-local-dev/SKILL.md +0 -66
  28. package/skills/jobs-to-be-done/SKILL.md +0 -266
  29. package/skills/jobs-to-be-done/references/case-studies.md +0 -154
  30. package/skills/jobs-to-be-done/references/competitive-strategy.md +0 -280
  31. package/skills/jobs-to-be-done/references/diagnostics.md +0 -158
  32. package/skills/jobs-to-be-done/references/innovation-process.md +0 -392
  33. package/skills/jobs-to-be-done/references/organizational-change.md +0 -328
  34. package/skills/marketplace-report-crawler/SKILL.md +0 -176
  35. package/skills/marketplace-report-crawler/config/accounts.json +0 -41
  36. package/skills/marketplace-report-crawler/config/report-types.json +0 -422
  37. package/skills/marketplace-report-crawler/config/sessions.json +0 -3
  38. package/skills/marketplace-report-crawler/scripts/ab-wrapper.sh +0 -102
  39. package/skills/marketplace-report-crawler/scripts/browser-actions/lazada/lazada-actions.js +0 -114
  40. package/skills/marketplace-report-crawler/scripts/browser-actions/shopee/shopee-actions.js +0 -94
  41. package/skills/marketplace-report-crawler/scripts/browser-actions/tiktok/tiktok-actions.js +0 -272
  42. package/skills/marketplace-report-crawler/scripts/crawl-runner.js +0 -281
  43. package/skills/marketplace-report-crawler/scripts/session-check.sh +0 -72
  44. package/skills/marketplace-report-crawler/scripts/session-manager.sh +0 -349
  45. package/skills/marketplace-report-crawler/scripts/setup-folders.sh +0 -83
  46. package/skills/medical-research/SKILL.md +0 -194
  47. package/skills/medical-research/scripts/evidence_checker.py +0 -288
  48. package/skills/mom-test/SKILL.md +0 -267
  49. package/skills/mom-test/references/avoiding-bad-data.md +0 -221
  50. package/skills/mom-test/references/case-studies.md +0 -306
  51. package/skills/mom-test/references/commitment-advancement.md +0 -219
  52. package/skills/mom-test/references/finding-conversations.md +0 -251
  53. package/skills/mom-test/references/processing-learning.md +0 -256
  54. package/skills/mom-test/references/question-patterns.md +0 -198
  55. package/skills/pandasai-analytics/SKILL.md +0 -251
  56. package/skills/release-it/SKILL.md +0 -235
  57. package/skills/release-it/references/anti-patterns.md +0 -279
  58. package/skills/release-it/references/capacity-planning.md +0 -285
  59. package/skills/release-it/references/chaos-engineering.md +0 -325
  60. package/skills/release-it/references/deployment-strategies.md +0 -331
  61. package/skills/release-it/references/observability.md +0 -301
  62. package/skills/release-it/references/stability-patterns.md +0 -355
  63. package/skills/skill-creator-ultra/.agents/workflows/skill-audit.md +0 -37
  64. package/skills/skill-creator-ultra/.agents/workflows/skill-compare.md +0 -34
  65. package/skills/skill-creator-ultra/.agents/workflows/skill-export.md +0 -51
  66. package/skills/skill-creator-ultra/.agents/workflows/skill-generate.md +0 -39
  67. package/skills/skill-creator-ultra/.agents/workflows/skill-scaffold.md +0 -52
  68. package/skills/skill-creator-ultra/.agents/workflows/skill-simulate.md +0 -25
  69. package/skills/skill-creator-ultra/.agents/workflows/skill-stats.md +0 -31
  70. package/skills/skill-creator-ultra/.agents/workflows/skill-validate.md +0 -25
  71. package/skills/skill-creator-ultra/README.md +0 -1242
  72. package/skills/skill-creator-ultra/SKILL.md +0 -388
  73. package/skills/skill-creator-ultra/agents/analyzer.md +0 -274
  74. package/skills/skill-creator-ultra/agents/comparator.md +0 -202
  75. package/skills/skill-creator-ultra/agents/grader.md +0 -223
  76. package/skills/skill-creator-ultra/assets/eval_review.html +0 -146
  77. package/skills/skill-creator-ultra/eval-viewer/generate_review.py +0 -471
  78. package/skills/skill-creator-ultra/eval-viewer/viewer.html +0 -1325
  79. package/skills/skill-creator-ultra/examples/example_anthropic_frontend.md +0 -109
  80. package/skills/skill-creator-ultra/examples/example_anthropic_pdf.md +0 -116
  81. package/skills/skill-creator-ultra/examples/example_api_docs.md +0 -189
  82. package/skills/skill-creator-ultra/examples/example_db_migration.md +0 -253
  83. package/skills/skill-creator-ultra/examples/example_git_commit.md +0 -111
  84. package/skills/skill-creator-ultra/install.ps1 +0 -289
  85. package/skills/skill-creator-ultra/install.sh +0 -313
  86. package/skills/skill-creator-ultra/phases/phase1_interview.md +0 -202
  87. package/skills/skill-creator-ultra/phases/phase2_extract.md +0 -55
  88. package/skills/skill-creator-ultra/phases/phase3_detect.md +0 -57
  89. package/skills/skill-creator-ultra/phases/phase4_generate.md +0 -543
  90. package/skills/skill-creator-ultra/phases/phase5_test.md +0 -319
  91. package/skills/skill-creator-ultra/phases/phase6_eval.md +0 -301
  92. package/skills/skill-creator-ultra/phases/phase7_iterate.md +0 -103
  93. package/skills/skill-creator-ultra/phases/phase8_optimize.md +0 -113
  94. package/skills/skill-creator-ultra/resources/advanced_patterns.md +0 -499
  95. package/skills/skill-creator-ultra/resources/anti_patterns.md +0 -376
  96. package/skills/skill-creator-ultra/resources/blueprints.md +0 -498
  97. package/skills/skill-creator-ultra/resources/checklist.md +0 -243
  98. package/skills/skill-creator-ultra/resources/composition_cookbook.md +0 -291
  99. package/skills/skill-creator-ultra/resources/description_optimization.md +0 -90
  100. package/skills/skill-creator-ultra/resources/eval_guide.md +0 -133
  101. package/skills/skill-creator-ultra/resources/industry_questions.md +0 -189
  102. package/skills/skill-creator-ultra/resources/interview_questions.md +0 -200
  103. package/skills/skill-creator-ultra/resources/pattern_detection.md +0 -200
  104. package/skills/skill-creator-ultra/resources/prompt_engineering.md +0 -531
  105. package/skills/skill-creator-ultra/resources/schemas.md +0 -430
  106. package/skills/skill-creator-ultra/resources/script_integration.md +0 -593
  107. package/skills/skill-creator-ultra/resources/scripts_guide.md +0 -339
  108. package/skills/skill-creator-ultra/resources/skill_template.md +0 -124
  109. package/skills/skill-creator-ultra/resources/skill_writing_guide.md +0 -634
  110. package/skills/skill-creator-ultra/resources/versioning_guide.md +0 -193
  111. package/skills/skill-creator-ultra/scripts/ci_eval.py +0 -200
  112. package/skills/skill-creator-ultra/scripts/package_skill.py +0 -165
  113. package/skills/skill-creator-ultra/scripts/simulate_skill.py +0 -398
  114. package/skills/skill-creator-ultra/scripts/skill_audit.py +0 -611
  115. package/skills/skill-creator-ultra/scripts/skill_compare.py +0 -265
  116. package/skills/skill-creator-ultra/scripts/skill_export.py +0 -334
  117. package/skills/skill-creator-ultra/scripts/skill_scaffold.py +0 -403
  118. package/skills/skill-creator-ultra/scripts/skill_stats.py +0 -339
  119. package/skills/skill-creator-ultra/scripts/validate_skill.py +0 -411
  120. package/skills/tailwind-mastery/SKILL.md +0 -229
  121. package/skills/vercel-react-best-practices/AGENTS.md +0 -3373
  122. package/skills/vercel-react-best-practices/README.md +0 -123
  123. package/skills/vercel-react-best-practices/SKILL.md +0 -143
  124. package/skills/vercel-react-best-practices/rules/_sections.md +0 -46
  125. package/skills/vercel-react-best-practices/rules/_template.md +0 -28
  126. package/skills/vercel-react-best-practices/rules/advanced-event-handler-refs.md +0 -55
  127. package/skills/vercel-react-best-practices/rules/advanced-init-once.md +0 -42
  128. package/skills/vercel-react-best-practices/rules/advanced-use-latest.md +0 -39
  129. package/skills/vercel-react-best-practices/rules/async-api-routes.md +0 -38
  130. package/skills/vercel-react-best-practices/rules/async-defer-await.md +0 -80
  131. package/skills/vercel-react-best-practices/rules/async-dependencies.md +0 -51
  132. package/skills/vercel-react-best-practices/rules/async-parallel.md +0 -28
  133. package/skills/vercel-react-best-practices/rules/async-suspense-boundaries.md +0 -99
  134. package/skills/vercel-react-best-practices/rules/bundle-barrel-imports.md +0 -59
  135. package/skills/vercel-react-best-practices/rules/bundle-conditional.md +0 -31
  136. package/skills/vercel-react-best-practices/rules/bundle-defer-third-party.md +0 -49
  137. package/skills/vercel-react-best-practices/rules/bundle-dynamic-imports.md +0 -35
  138. package/skills/vercel-react-best-practices/rules/bundle-preload.md +0 -50
  139. package/skills/vercel-react-best-practices/rules/client-event-listeners.md +0 -74
  140. package/skills/vercel-react-best-practices/rules/client-localstorage-schema.md +0 -71
  141. package/skills/vercel-react-best-practices/rules/client-passive-event-listeners.md +0 -48
  142. package/skills/vercel-react-best-practices/rules/client-swr-dedup.md +0 -56
  143. package/skills/vercel-react-best-practices/rules/js-batch-dom-css.md +0 -107
  144. package/skills/vercel-react-best-practices/rules/js-cache-function-results.md +0 -80
  145. package/skills/vercel-react-best-practices/rules/js-cache-property-access.md +0 -28
  146. package/skills/vercel-react-best-practices/rules/js-cache-storage.md +0 -70
  147. package/skills/vercel-react-best-practices/rules/js-combine-iterations.md +0 -32
  148. package/skills/vercel-react-best-practices/rules/js-early-exit.md +0 -50
  149. package/skills/vercel-react-best-practices/rules/js-flatmap-filter.md +0 -60
  150. package/skills/vercel-react-best-practices/rules/js-hoist-regexp.md +0 -45
  151. package/skills/vercel-react-best-practices/rules/js-index-maps.md +0 -37
  152. package/skills/vercel-react-best-practices/rules/js-length-check-first.md +0 -49
  153. package/skills/vercel-react-best-practices/rules/js-min-max-loop.md +0 -82
  154. package/skills/vercel-react-best-practices/rules/js-set-map-lookups.md +0 -24
  155. package/skills/vercel-react-best-practices/rules/js-tosorted-immutable.md +0 -57
  156. package/skills/vercel-react-best-practices/rules/rendering-activity.md +0 -26
  157. package/skills/vercel-react-best-practices/rules/rendering-animate-svg-wrapper.md +0 -47
  158. package/skills/vercel-react-best-practices/rules/rendering-conditional-render.md +0 -40
  159. package/skills/vercel-react-best-practices/rules/rendering-content-visibility.md +0 -38
  160. package/skills/vercel-react-best-practices/rules/rendering-hoist-jsx.md +0 -46
  161. package/skills/vercel-react-best-practices/rules/rendering-hydration-no-flicker.md +0 -82
  162. package/skills/vercel-react-best-practices/rules/rendering-hydration-suppress-warning.md +0 -30
  163. package/skills/vercel-react-best-practices/rules/rendering-resource-hints.md +0 -85
  164. package/skills/vercel-react-best-practices/rules/rendering-script-defer-async.md +0 -68
  165. package/skills/vercel-react-best-practices/rules/rendering-svg-precision.md +0 -28
  166. package/skills/vercel-react-best-practices/rules/rendering-usetransition-loading.md +0 -75
  167. package/skills/vercel-react-best-practices/rules/rerender-defer-reads.md +0 -39
  168. package/skills/vercel-react-best-practices/rules/rerender-dependencies.md +0 -45
  169. package/skills/vercel-react-best-practices/rules/rerender-derived-state-no-effect.md +0 -40
  170. package/skills/vercel-react-best-practices/rules/rerender-derived-state.md +0 -29
  171. package/skills/vercel-react-best-practices/rules/rerender-functional-setstate.md +0 -74
  172. package/skills/vercel-react-best-practices/rules/rerender-lazy-state-init.md +0 -58
  173. package/skills/vercel-react-best-practices/rules/rerender-memo-with-default-value.md +0 -38
  174. package/skills/vercel-react-best-practices/rules/rerender-memo.md +0 -44
  175. package/skills/vercel-react-best-practices/rules/rerender-move-effect-to-event.md +0 -45
  176. package/skills/vercel-react-best-practices/rules/rerender-no-inline-components.md +0 -82
  177. package/skills/vercel-react-best-practices/rules/rerender-simple-expression-in-memo.md +0 -35
  178. package/skills/vercel-react-best-practices/rules/rerender-split-combined-hooks.md +0 -64
  179. package/skills/vercel-react-best-practices/rules/rerender-transitions.md +0 -40
  180. package/skills/vercel-react-best-practices/rules/rerender-use-deferred-value.md +0 -59
  181. package/skills/vercel-react-best-practices/rules/rerender-use-ref-transient-values.md +0 -73
  182. package/skills/vercel-react-best-practices/rules/server-after-nonblocking.md +0 -73
  183. package/skills/vercel-react-best-practices/rules/server-auth-actions.md +0 -96
  184. package/skills/vercel-react-best-practices/rules/server-cache-lru.md +0 -41
  185. package/skills/vercel-react-best-practices/rules/server-cache-react.md +0 -76
  186. package/skills/vercel-react-best-practices/rules/server-dedup-props.md +0 -65
  187. package/skills/vercel-react-best-practices/rules/server-hoist-static-io.md +0 -142
  188. package/skills/vercel-react-best-practices/rules/server-parallel-fetching.md +0 -83
  189. package/skills/vercel-react-best-practices/rules/server-serialization.md +0 -38
  190. package/skills/web-design-guidelines/SKILL.md +0 -39
@@ -1,285 +0,0 @@
1
- # Capacity Planning
2
-
3
- Capacity planning is the discipline of understanding how much load your system can handle, what breaks first, and how to scale before users experience degradation. It is not a one-time exercise -- it is a continuous practice that evolves as your system, traffic patterns, and infrastructure change.
4
-
5
- ## Performance Testing Taxonomy
6
-
7
- Not all performance tests are equal. Each type answers a different question.
8
-
9
- ### Test Types
10
-
11
- | Test Type | Question It Answers | Duration | Load Profile |
12
- |-----------|-------------------|----------|-------------|
13
- | **Load test** | Can the system handle expected peak traffic? | 30-60 minutes | Ramp to expected peak, hold steady |
14
- | **Stress test** | Where does the system break? | Until failure | Ramp beyond expected peak until degradation or failure |
15
- | **Soak test** | Does the system degrade over time? | 24-72 hours | Sustained load at 70-80% of capacity |
16
- | **Spike test** | How does the system handle sudden bursts? | 15-30 minutes | Sudden jump from baseline to peak, then back |
17
- | **Scalability test** | Does adding resources improve throughput linearly? | Variable | Measure throughput at different resource levels |
18
-
19
- ### Load Test Design
20
-
21
- A good load test simulates real user behavior, not synthetic happy paths.
22
-
23
- **Essential elements:**
24
- - **Realistic user journeys:** Mix of browse, search, add to cart, checkout -- not just one endpoint
25
- - **Think time:** Users do not fire requests as fast as possible; include realistic pauses between actions
26
- - **Data variation:** Different users, different products, different search terms -- not the same request repeated
27
- - **Ramp-up period:** Gradually increase load to avoid a cold-start stampede
28
- - **Steady state period:** Hold at target load long enough to observe stabilization (minimum 15 minutes)
29
- - **Ramp-down period:** Gradually decrease load to observe resource release
30
-
31
- ### Stress Test Design
32
-
33
- The goal is to find the breaking point -- not to prove the system works under normal load.
34
-
35
- **Key principles:**
36
- - Increase load incrementally (e.g., 10% every 5 minutes) until you observe degradation
37
- - Monitor all resources: CPU, memory, disk I/O, network, thread pools, connection pools, queue depths
38
- - Record the exact load level when each metric crosses its threshold
39
- - Document the failure mode: does the system degrade gracefully (latency increases, then errors) or fail catastrophically (crash, hang, data corruption)?
40
- - Run the stress test multiple times to verify consistency
41
-
42
- ### Soak Test Design
43
-
44
- Soak tests reveal problems that only manifest over time.
45
-
46
- **What soak tests catch:**
47
-
48
- | Problem | Mechanism | Detection |
49
- |---------|-----------|-----------|
50
- | **Memory leaks** | Gradual memory growth from unreleased objects | Memory usage trends upward over hours |
51
- | **Connection leaks** | Connections borrowed from pool but never returned | Pool exhaustion after hours of operation |
52
- | **File handle leaks** | Files opened but never closed | "Too many open files" errors after prolonged operation |
53
- | **Log file growth** | Disk fills over extended operation | Disk utilization climbs throughout test |
54
- | **Cache bloat** | Cache grows without eviction under sustained load | Memory or disk consumption increases monotonically |
55
- | **Database bloat** | Temp tables, uncommitted transactions accumulate | Database performance degrades over test duration |
56
-
57
- **Soak test requirements:**
58
- - Run at 70-80% of measured capacity (not full stress -- you are testing endurance, not peak)
59
- - Duration: minimum 24 hours, ideally 72 hours
60
- - Monitor resource trends, not just snapshots -- a flat graph is healthy, a rising trend is a leak
61
- - Compare start-of-test and end-of-test resource consumption
62
-
63
- ---
64
-
65
- ## Resource Pool Management
66
-
67
- Resource pools -- thread pools, connection pools, object pools -- are finite and shared. Mismanaging them is one of the most common causes of production failures.
68
-
69
- ### Connection Pool Sizing
70
-
71
- The most common question: "How many connections do I need?"
72
-
73
- **Formula:**
74
- ```
75
- pool_size = peak_concurrent_requests × avg_hold_time / avg_request_time
76
- ```
77
-
78
- **But in practice:**
79
- - Measure actual concurrent active connections under peak load
80
- - Set pool size to measured p99 concurrency + 20-30% headroom
81
- - Set a maximum that protects the downstream resource (databases have their own connection limits)
82
- - Too many connections: each consumes memory on both client and server; database performance degrades with too many connections
83
- - Too few connections: requests queue waiting for a connection; latency increases; pool exhaustion looks like a database outage
84
-
85
- ### Connection Pool Configuration
86
-
87
- | Parameter | Purpose | Typical Value |
88
- |-----------|---------|---------------|
89
- | **Minimum pool size** | Connections maintained even when idle | 5-10 |
90
- | **Maximum pool size** | Hard upper limit on connections | Based on measurement |
91
- | **Checkout timeout** | How long to wait for a connection from the pool | 500ms - 2s |
92
- | **Idle timeout** | How long an unused connection stays in the pool | 5-10 minutes |
93
- | **Max lifetime** | Maximum age of a connection before forced recycling | 30-60 minutes |
94
- | **Validation query** | Query to verify connection health before use | `SELECT 1` |
95
- | **Validation interval** | How often idle connections are validated | 30-60 seconds |
96
-
97
- ### Connection Pool Anti-Patterns
98
-
99
- | Anti-Pattern | Problem | Fix |
100
- |-------------|---------|-----|
101
- | **No checkout timeout** | Thread waits forever for a connection | Set checkout timeout to 1-2 seconds |
102
- | **No max lifetime** | Stale connections cause intermittent errors | Recycle connections every 30-60 minutes |
103
- | **Pool size = DB max connections** | Leaves no connections for admin, monitoring, or other services | Pool size = (DB max - reserved) / number of application instances |
104
- | **Ignoring connection leaks** | Pool slowly drains until exhaustion | Monitor borrowed-vs-returned; log leaked connections |
105
- | **Default pool size** | Either wastes resources or causes starvation | Size based on measured concurrency |
106
-
107
- ---
108
-
109
- ## Thread Pool Management
110
-
111
- Thread pools control the concurrency of your application. Getting them right is critical for both throughput and stability.
112
-
113
- ### Thread Pool Sizing
114
-
115
- **CPU-bound workloads:**
116
- ```
117
- threads = number_of_cores
118
- ```
119
-
120
- **I/O-bound workloads (most web applications):**
121
- ```
122
- threads = number_of_cores × (1 + wait_time / service_time)
123
- ```
124
-
125
- Example: 8 cores, requests spend 80% of time waiting on I/O:
126
- ```
127
- threads = 8 × (1 + 80/20) = 8 × 5 = 40 threads
128
- ```
129
-
130
- ### Thread Pool Configuration
131
-
132
- | Parameter | Purpose | Consideration |
133
- |-----------|---------|---------------|
134
- | **Core pool size** | Threads always kept alive | Handles normal load without thread creation overhead |
135
- | **Maximum pool size** | Hard upper limit | Handles burst load; too high causes context-switching overhead |
136
- | **Queue capacity** | Work queue between core and max | Bounded queue with rejection policy; never unbounded |
137
- | **Keep-alive time** | How long excess threads survive when idle | 30-60 seconds; balances responsiveness and resource usage |
138
- | **Rejection policy** | What happens when pool and queue are both full | Reject immediately (fail fast) or caller-runs (back-pressure) |
139
-
140
- ### Thread Pool Anti-Patterns
141
-
142
- | Anti-Pattern | Problem | Fix |
143
- |-------------|---------|-----|
144
- | **Unbounded queue** | Memory grows until OOM; latency climbs invisibly | Use bounded queue; fail fast when full |
145
- | **Single shared pool** | One slow operation starves all others | Separate pools per workload type |
146
- | **Too many threads** | Context-switching overhead exceeds throughput gain | Measure throughput at different pool sizes; find the plateau |
147
- | **No monitoring** | Thread starvation goes undetected until outage | Monitor active/idle/queued counts; alert on pool saturation |
148
-
149
- ---
150
-
151
- ## The Universal Scalability Law
152
-
153
- The Universal Scalability Law (USL), developed by Neil Gunther, models how system throughput changes as you add resources (servers, threads, cores).
154
-
155
- ### The Model
156
-
157
- ```
158
- C(N) = N / (1 + σ(N-1) + κN(N-1))
159
- ```
160
-
161
- Where:
162
- - **N** = number of processors/servers/threads
163
- - **σ** (sigma) = contention parameter: fraction of work that must be serialized
164
- - **κ** (kappa) = coherence parameter: cost of keeping shared state consistent
165
- - **C(N)** = relative capacity at N resources
166
-
167
- ### Key Insights
168
-
169
- | Parameter | Effect | Example |
170
- |-----------|--------|---------|
171
- | **σ = 0, κ = 0** | Linear scalability (ideal but impossible) | Adding 10 servers = 10x throughput |
172
- | **σ > 0, κ = 0** | Amdahl's Law: diminishing returns | Shared lock limits parallelism |
173
- | **σ > 0, κ > 0** | Retrograde scalability: adding resources decreases throughput | Distributed cache coherence overhead exceeds throughput gain |
174
-
175
- ### Practical Application
176
-
177
- 1. **Measure throughput** at 1, 2, 4, 8, 16 resources
178
- 2. **Fit the USL curve** to find σ and κ
179
- 3. **Predict the scalability ceiling:** the point where adding more resources stops helping (or hurts)
180
- 4. **Identify the bottleneck:** high σ means contention (locks, serialization); high κ means coherence costs (cache invalidation, distributed consensus)
181
-
182
- ---
183
-
184
- ## Capacity Modeling
185
-
186
- A capacity model documents the relationship between load, resources, and performance for each service in your system.
187
-
188
- ### Capacity Model Template
189
-
190
- For each service, document:
191
-
192
- | Dimension | Current Value | Limit | Action at Limit |
193
- |-----------|--------------|-------|-----------------|
194
- | **Requests/sec** | 500 RPS | 2,000 RPS | Scale horizontally |
195
- | **CPU** | 40% avg, 70% peak | 80% sustained | Add instances |
196
- | **Memory** | 2.5 GB / 4 GB | 3.5 GB | Increase instance size or optimize |
197
- | **DB connections** | 30 active / 50 max | 45 active | Increase pool or add read replicas |
198
- | **Disk I/O** | 200 IOPS | 3,000 IOPS (provisioned) | Upgrade storage tier |
199
- | **Network** | 500 Mbps | 10 Gbps | Unlikely bottleneck |
200
-
201
- ### Bottleneck Resource
202
-
203
- Every service has a bottleneck resource -- the resource that runs out first as load increases. The capacity of the service equals the capacity of its bottleneck.
204
-
205
- **Finding the bottleneck:**
206
- 1. Run a stress test, increasing load gradually
207
- 2. Monitor all resources simultaneously
208
- 3. The first resource to hit its limit is the bottleneck
209
- 4. All capacity planning focuses on this resource
210
-
211
- **Common bottlenecks by service type:**
212
-
213
- | Service Type | Typical Bottleneck |
214
- |-------------|-------------------|
215
- | API services | Thread pool or connection pool |
216
- | Data-heavy services | Database connections or query throughput |
217
- | Compute-heavy services | CPU |
218
- | File processing | Disk I/O or memory |
219
- | Real-time services | Network bandwidth or connection count |
220
-
221
- ---
222
-
223
- ## Capacity Myths
224
-
225
- ### Myth 1: "The Cloud Is Infinitely Scalable"
226
-
227
- Reality:
228
- - Auto-scaling has lag time (1-5 minutes to provision and start new instances)
229
- - Cold starts add latency to the first requests on new instances
230
- - Cloud providers have account-level limits (instance count, API rate limits)
231
- - Some resources do not scale horizontally (relational databases, stateful services)
232
- - Scaling costs money -- infinite scale means infinite cost
233
-
234
- ### Myth 2: "We'll Just Add More Servers"
235
-
236
- Reality:
237
- - Adding servers only helps if the bottleneck is CPU or memory on the application tier
238
- - If the bottleneck is the database, adding application servers makes it worse (more connections, more load on the same database)
239
- - Network hops, serialization overhead, and coordination costs increase with more servers
240
- - Horizontal scaling requires stateless design -- session affinity, local caches, and local file storage break horizontal scaling
241
-
242
- ### Myth 3: "Our Load Tests Pass, So We're Fine"
243
-
244
- Reality:
245
- - Load tests with synthetic data miss hot spots in production data
246
- - Load tests rarely simulate realistic user behavior (think times, session patterns, edge cases)
247
- - Load test environments rarely match production topology, network latency, or data volume
248
- - Load tests find throughput limits but not endurance problems (need soak tests)
249
- - Passing a load test at 2x expected peak does not protect against 10x flash crowds
250
-
251
- ### Myth 4: "We Don't Need Capacity Planning -- We Have Auto-Scaling"
252
-
253
- Reality:
254
- - Auto-scaling reacts to load after it arrives; capacity planning anticipates load before it arrives
255
- - Auto-scaling cannot protect against instant traffic spikes (Black Friday, viral events)
256
- - Auto-scaling policies themselves need testing -- misconfigured policies can scale in the wrong direction or oscillate
257
- - Cost management requires understanding baseline and peak capacity needs
258
-
259
- ---
260
-
261
- ## Performance Anti-Patterns
262
-
263
- ### Resource Contention
264
-
265
- Multiple threads competing for the same resource (lock, connection, CPU core). Throughput plateaus or decreases as concurrency increases.
266
-
267
- **Detection:** Throughput does not increase when adding threads/instances. CPU utilization is low despite high load. Thread dumps show threads waiting on locks.
268
-
269
- **Fix:** Reduce lock scope. Use lock-free data structures. Partition data to reduce contention. Use read-write locks instead of exclusive locks.
270
-
271
- ### The Coordinated Omission Problem
272
-
273
- Load testing tools that wait for a response before sending the next request undercount latency at high load. When the server slows down, the tool also slows down, making the measured throughput look stable while actually masking massive latency increases.
274
-
275
- **Detection:** Load test shows consistent throughput even as the system degrades. Real users report much worse latency than load tests measure.
276
-
277
- **Fix:** Use load testing tools that support coordinated omission correction (e.g., wrk2, Gatling with constant throughput mode). Measure latency independently of throughput. Use open-loop load generators that send requests at a fixed rate regardless of response time.
278
-
279
- ### N+1 Query Problem
280
-
281
- Fetching a list of N items, then making one additional query for each item. Total queries = N + 1 instead of 1 or 2.
282
-
283
- **Detection:** Database query count scales linearly with result set size. Response time increases linearly with page size.
284
-
285
- **Fix:** Use eager loading / JOIN queries. Batch queries (`WHERE id IN (...)`). Implement DataLoader pattern for GraphQL.
@@ -1,325 +0,0 @@
1
- # Chaos Engineering
2
-
3
- Chaos engineering is the discipline of experimenting on a system in order to build confidence in its ability to withstand turbulent conditions in production. It is not about breaking things for fun -- it is a rigorous, scientific approach to discovering weaknesses before they cause outages.
4
-
5
- > **Safety note:** This reference describes chaos engineering *concepts and planning patterns*. All failure injection experiments must be performed by authorized engineers using dedicated chaos tooling (e.g., Gremlin, Litmus, AWS Fault Injection Simulator) with proper approvals, blast radius controls, monitoring, and rollback plans. Commands shown are for reference only -- never run them without authorization and safeguards.
6
-
7
- The fundamental insight is simple: you cannot know how your system handles failure until it actually fails. Waiting for production incidents to discover weaknesses is reactive and expensive. Chaos engineering is proactive and controlled.
8
-
9
- ## Principles of Chaos Engineering
10
-
11
- ### 1. Define Steady State
12
-
13
- Before you can detect abnormal behavior, you must define what normal looks like. Steady state is expressed as measurable business or system metrics that indicate the system is functioning correctly.
14
-
15
- **Good steady state definitions:**
16
-
17
- | Metric Type | Steady State Definition | Example |
18
- |-------------|----------------------|---------|
19
- | **Business metric** | Orders per minute within expected range | 100-150 orders/min during business hours |
20
- | **Error rate** | Below defined threshold | < 0.1% 5xx errors |
21
- | **Latency** | Within SLO bounds | p99 latency < 500ms |
22
- | **Throughput** | Within expected range | 1000-2000 RPS |
23
- | **Availability** | All critical paths responding | Health checks green on all services |
24
-
25
- **Bad steady state definitions:**
26
- - "The system is working" (not measurable)
27
- - "No alerts firing" (absence of evidence is not evidence of absence)
28
- - "CPU below 80%" (cause-based, not symptom-based)
29
-
30
- ### 2. Formulate a Hypothesis
31
-
32
- Every chaos experiment starts with a hypothesis: a prediction about what will happen when you inject a specific failure.
33
-
34
- **Hypothesis format:**
35
- ```
36
- "We believe that when [failure condition], the system will [expected behavior],
37
- as measured by [steady state metric] remaining within [acceptable bounds]."
38
- ```
39
-
40
- **Example hypotheses:**
41
-
42
- | Failure | Hypothesis | Metric |
43
- |---------|-----------|--------|
44
- | Terminate one API instance (via chaos tooling) | System continues serving traffic with no user-visible errors | Error rate stays < 0.1% |
45
- | Add 500ms latency to database (via chaos tooling) | Response time degrades but stays within SLO; circuit breaker does not trip | p99 < 2s; no circuit breaker events |
46
- | Payment service returns 503 (via fault injection proxy) | Checkout shows graceful error; other features unaffected | Non-checkout error rate unchanged |
47
- | Disk at 95% capacity (via chaos tooling) | Log rotation triggers; alerts fire; no service disruption | Disk drops below 90% within 10 minutes |
48
-
49
- ### 3. Introduce Real-World Failures
50
-
51
- Chaos experiments should simulate failures that actually happen in production, not theoretical edge cases.
52
-
53
- **Common failure types to simulate (via dedicated chaos tooling):**
54
-
55
- | Category | Failures | Tooling Examples |
56
- |----------|----------|-----------------|
57
- | **Infrastructure** | Instance crash, disk failure, network partition | Gremlin, Litmus, AWS FIS |
58
- | **Network** | Latency, packet loss, DNS failure | Toxiproxy, Istio fault injection, tc (traffic control) |
59
- | **Application** | Memory pressure, CPU saturation, thread contention | stress-ng (controlled), Gremlin resource attacks |
60
- | **Dependency** | Service unavailable, slow response, corrupt response | Toxiproxy, Envoy fault injection, mock services |
61
- | **Cloud** | AZ failure, region degradation, API throttling | AWS FIS, GCP Fault Injection, Azure Chaos Studio |
62
-
63
- ### 4. Run in Production
64
-
65
- Staging environments do not reproduce the complexity of production. They lack real user traffic, real data volumes, real concurrency patterns, and real interactions between services. Chaos experiments in staging build false confidence.
66
-
67
- **But safely:**
68
- - Start with non-production, then graduate to production
69
- - Use the smallest blast radius possible
70
- - Have an emergency stop mechanism to halt the experiment immediately
71
- - Run during business hours when the team is available to respond
72
- - Inform the on-call team before running experiments
73
- - Never experiment during peak traffic or known risky periods
74
-
75
- ### 5. Automate and Run Continuously
76
-
77
- A chaos experiment that runs once proves resilience at one point in time. Automated, recurring experiments prove resilience continuously.
78
-
79
- **Automation maturity levels:**
80
-
81
- | Level | Practice | Confidence |
82
- |-------|----------|-----------|
83
- | **Manual** | Engineer runs experiment by hand, observes results | Low -- depends on who runs it and when |
84
- | **Scripted** | Experiment codified in a script, run on schedule | Medium -- repeatable but requires human analysis |
85
- | **Automated** | Experiment runs automatically, evaluates steady state, reports results | High -- continuous verification |
86
- | **Integrated** | Experiments run in CI/CD pipeline; failing experiment blocks deployment | Very high -- resilience is a deployment gate |
87
-
88
- ---
89
-
90
- ## Chaos Experiment Design
91
-
92
- ### Experiment Template
93
-
94
- ```
95
- Experiment: [Name]
96
- Date: [When]
97
- Team: [Who is running it]
98
- Blast Radius: [What is affected]
99
-
100
- Hypothesis:
101
- When [failure condition], we expect [behavior],
102
- as measured by [metric] remaining within [bounds].
103
-
104
- Steady State:
105
- - Metric 1: [current value, acceptable range]
106
- - Metric 2: [current value, acceptable range]
107
-
108
- Method:
109
- 1. Verify steady state
110
- 2. Inject [specific failure]
111
- 3. Observe for [duration]
112
- 4. Measure [metrics]
113
- 5. Remove failure injection
114
- 6. Verify recovery to steady state
115
-
116
- Abort Conditions:
117
- - [Metric] exceeds [threshold]
118
- - On-call pages for [service]
119
- - Customer-visible impact detected
120
-
121
- Results:
122
- - Hypothesis confirmed/refuted
123
- - Observations: [what happened]
124
- - Action items: [what to fix]
125
- ```
126
-
127
- ### Blast Radius Management
128
-
129
- Blast radius is the scope of impact if the experiment causes unexpected damage. Always minimize blast radius and expand gradually.
130
-
131
- **Blast radius levels:**
132
-
133
- | Level | Scope | When to Use |
134
- |-------|-------|-------------|
135
- | **Development** | Single developer's environment | First-time experiments, unproven hypotheses |
136
- | **Staging** | Staging environment | Validating experiment mechanics before production |
137
- | **Canary** | Small subset of production (1-5%) | First production experiment for a new failure type |
138
- | **Single AZ** | One availability zone in production | Testing AZ failure resilience |
139
- | **Full production** | All production traffic | Well-understood experiments that have been run many times |
140
-
141
- **Blast radius controls:**
142
- - **Targeting:** Limit experiment to specific instances, user segments, or traffic percentage
143
- - **Duration:** Set maximum experiment duration; auto-revert after timeout
144
- - **Emergency stop:** One-button (or automatic) experiment termination
145
- - **Monitoring:** Real-time dashboards showing experiment impact on steady state metrics
146
- - **Rollback:** Pre-planned steps to undo the experiment if things go wrong
147
-
148
- ---
149
-
150
- ## Failure Injection Techniques
151
-
152
- ### Process-Level Failures
153
-
154
- | Technique | What It Simulates | Chaos Tooling |
155
- |-----------|-------------------|--------------|
156
- | Instance termination | Crash | Gremlin process attack, Kubernetes pod disruption budget, Litmus ChaosEngine |
157
- | Process freeze | Hang/unresponsive | Gremlin process attack (pause), Litmus pod-cpu-hog |
158
- | CPU saturation | Compute pressure | Gremlin CPU attack, Litmus cpu-hog, stress-ng (controlled, authorized) |
159
- | Memory pressure | Memory exhaustion | Gremlin memory attack, Litmus pod-memory-hog |
160
- | Process flood | Process table exhaustion | Gremlin process attack with controlled parameters |
161
-
162
- ### Network-Level Failures
163
-
164
- | Technique | What It Simulates | Chaos Tooling |
165
- |-----------|-------------------|--------------|
166
- | Latency injection | Slow network | Toxiproxy, Gremlin latency attack, Istio fault injection |
167
- | Packet loss | Unreliable network | Gremlin packet loss attack, Toxiproxy, tc netem (authorized) |
168
- | Network partition | Network split | Gremlin blackhole attack, Litmus pod-network-partition |
169
- | DNS failure | DNS outage | Gremlin DNS attack, Litmus pod-dns-error |
170
- | Bandwidth limit | Constrained network | Toxiproxy bandwidth limit, Gremlin bandwidth attack |
171
-
172
- ### Dependency-Level Failures
173
-
174
- | Technique | What It Simulates | Chaos Tooling |
175
- |-----------|-------------------|--------------|
176
- | Error injection proxy | Downstream errors | Toxiproxy, Envoy fault injection |
177
- | Latency injection | Slow dependency | Toxiproxy, Istio fault injection |
178
- | Connection limit | Pool exhaustion | Toxiproxy connection limit, Gremlin blackhole |
179
- | Response corruption | Data integrity issues | Custom fault injection proxy |
180
- | Certificate expiration | TLS failures | Expired test certificate in staging |
181
-
182
- ### Disk-Level Failures
183
-
184
- | Technique | What It Simulates | Chaos Tooling |
185
- |-----------|-------------------|--------------|
186
- | Disk pressure | Disk full | Gremlin disk attack, Litmus disk-fill |
187
- | Slow I/O | Storage degradation | Gremlin IO attack, dm-delay (authorized) |
188
- | Read-only filesystem | Mount failure | Gremlin disk attack (read-only mode) |
189
- | Data corruption | Integrity issues | Controlled corruption of test data in staging |
190
-
191
- ---
192
-
193
- ## Chaos Monkey and Netflix's Approach
194
-
195
- Netflix pioneered chaos engineering with Chaos Monkey, which randomly terminates production instances during business hours using automated, authorized tooling with built-in safeguards.
196
-
197
- ### The Netflix Chaos Engineering Stack
198
-
199
- | Tool | What It Does | Scope |
200
- |------|-------------|-------|
201
- | **Chaos Monkey** | Terminates random instances (automated, authorized) | Single instance |
202
- | **Chaos Kong** | Simulates entire region failure | Region |
203
- | **Latency Monkey** | Injects network latency | Network |
204
- | **Conformity Monkey** | Finds instances not following best practices | Compliance |
205
- | **Chaos Automation Platform (ChAP)** | Runs automated experiments with steady state comparison | Full stack |
206
-
207
- ### Key Lessons from Netflix
208
-
209
- 1. **Start small:** Chaos Monkey terminates one instance at a time. Only after years of practice did Netflix graduate to region-level chaos (Chaos Kong).
210
- 2. **Business hours only:** Run experiments when the team is available to respond. Late-night chaos is just an outage.
211
- 3. **Opt-out, not opt-in:** By default, all services are enrolled. Teams must explicitly justify opting out.
212
- 4. **No blame:** Finding a weakness is a success, not a failure. Teams that discover and fix problems through chaos engineering are celebrated.
213
- 5. **Invest in tooling:** Manual chaos is not sustainable. Invest in platforms that automate experiment execution, evaluation, and reporting.
214
-
215
- ---
216
-
217
- ## GameDay Exercises
218
-
219
- A GameDay is a scheduled exercise where a team practices responding to a realistic failure scenario. It combines chaos engineering (inject failure) with incident response practice (detect, diagnose, mitigate, resolve).
220
-
221
- ### GameDay Structure
222
-
223
- | Phase | Duration | Activities |
224
- |-------|----------|-----------|
225
- | **Preparation** | 1-2 weeks before | Define scenario, brief participants, set up monitoring |
226
- | **Pre-game** | 30 minutes | Verify steady state, confirm all participants are ready |
227
- | **Execution** | 1-3 hours | Inject failure, observe team response, take notes |
228
- | **Post-game** | 1 hour | Debrief, identify what worked and what did not |
229
- | **Follow-up** | 1-2 weeks after | File action items, track remediation, schedule next GameDay |
230
-
231
- ### GameDay Scenarios
232
-
233
- | Scenario | Complexity | What It Tests |
234
- |----------|-----------|---------------|
235
- | Terminate a single instance | Low | Auto-healing, health checks, load balancing |
236
- | Simulate database failover | Medium | Connection handling, read replica routing, data consistency |
237
- | Full AZ failure | High | Multi-AZ architecture, DNS failover, stateful service recovery |
238
- | Dependency outage (payment provider) | Medium | Circuit breakers, fallback behavior, user communication |
239
- | Security incident (compromised credentials) | High | Credential rotation, access logging, incident response process |
240
- | Data corruption | High | Backup restoration, data validation, recovery time |
241
-
242
- ### GameDay Ground Rules
243
-
244
- 1. **Safety first:** A facilitator can halt the exercise at any time if real customer impact occurs
245
- 2. **No blame:** The goal is learning, not grading individual performance
246
- 3. **Document everything:** Notes, timestamps, decisions, and outcomes
247
- 4. **Include diverse roles:** Engineers, SREs, product managers, customer support
248
- 5. **Realistic but controlled:** Use production systems but with blast radius limits
249
- 6. **Schedule during business hours:** The team should be fully staffed and alert
250
- 7. **Announce to stakeholders:** Customer support and leadership should know a GameDay is happening
251
-
252
- ### GameDay Facilitation Tips
253
-
254
- - **The facilitator does not fix things.** Their job is to inject failures, observe the response, and take notes.
255
- - **Start with known weaknesses.** The first GameDay should test a scenario the team suspects might fail -- the learning is in confirming the suspicion and practicing the response.
256
- - **Increase difficulty over time.** First GameDay: terminate one instance. Tenth GameDay: simultaneous database failover + network partition + on-call engineer is unavailable.
257
- - **Celebrate findings.** Every weakness discovered in a GameDay is a weakness that will not cause a real outage. This is a win.
258
-
259
- ---
260
-
261
- ## Building Confidence Through Controlled Failure
262
-
263
- ### The Confidence Curve
264
-
265
- ```
266
- Confidence
267
-
268
- │ ╭───────────
269
- │ ╭────╯
270
- │ ╭────╯
271
- │ ╭────╯
272
- │ ╭────╯
273
- │ ╭────╯
274
- │╭────╯
275
- └──────────────────────────────────────────→ Experiments
276
- 0 10 20 30 40 50 60
277
- ```
278
-
279
- Each successful chaos experiment increases confidence that the system will survive that failure in production. Each failed experiment (where the hypothesis was disproven) reveals a weakness that, once fixed, increases actual resilience.
280
-
281
- ### Maturity Model
282
-
283
- | Level | Practice | Organization |
284
- |-------|----------|-------------|
285
- | **Level 0: Reactive** | No chaos engineering; learn from production incidents | Firefighting culture |
286
- | **Level 1: Exploratory** | Manual, ad-hoc experiments; individual teams | Curious early adopters |
287
- | **Level 2: Systematic** | Regular GameDays; documented experiments; shared learnings | Engineering-wide practice |
288
- | **Level 3: Automated** | Continuous automated experiments; integrated into CI/CD | Platform team provides tooling |
289
- | **Level 4: Cultural** | Chaos engineering is expected for all services; opt-out requires justification | Resilience is a core engineering value |
290
-
291
- ### Getting Started
292
-
293
- If you have never done chaos engineering, start here:
294
-
295
- 1. **Pick one service** -- the one you are most worried about
296
- 2. **Define its steady state** -- what metrics prove it is working?
297
- 3. **Form one hypothesis** -- "If we terminate one instance, the service will continue serving traffic"
298
- 4. **Run the experiment in staging** -- verify your tooling and monitoring work
299
- 5. **Run the experiment in production** -- with minimal blast radius and an emergency stop mechanism
300
- 6. **Document the results** -- what happened? Was the hypothesis confirmed?
301
- 7. **Fix what you found** -- if the hypothesis was disproven, fix the weakness
302
- 8. **Repeat** -- expand to more failure types, more services, more complex scenarios
303
-
304
- ### Common Objections and Responses
305
-
306
- | Objection | Response |
307
- |-----------|---------|
308
- | "We can't break production on purpose!" | You are already breaking production accidentally. Chaos engineering lets you do it on your terms, when you are prepared. |
309
- | "Our system isn't resilient enough for chaos" | That is exactly why you need chaos engineering -- to find and fix the weaknesses. Start small. |
310
- | "We don't have time for this" | You have time for incident response, post-mortems, and customer apologies. Chaos engineering reduces all three. |
311
- | "What if we cause an outage?" | Start with minimal blast radius in staging. Terminate one process. If that causes an outage, you have learned something invaluable. |
312
- | "Management won't approve this" | Frame it as risk reduction, not risk creation. Show the cost of recent outages vs. the cost of preventive experiments. |
313
-
314
- ### Anti-Fragility
315
-
316
- The ultimate goal of chaos engineering is not just resilience (surviving failure) but anti-fragility (getting stronger from failure). A system is anti-fragile when each failure makes it more resistant to future failures.
317
-
318
- **How chaos engineering builds anti-fragility:**
319
- - Each experiment reveals a weakness
320
- - Each fix removes that weakness permanently
321
- - Each automated experiment continuously verifies the fix
322
- - Over time, the system has been tested against every common failure mode
323
- - New failure modes are discovered faster because the team has built the muscles and tooling to find them
324
-
325
- This is the Release It! philosophy in action: production-ready software is not software that never fails. It is software that has been designed, tested, and operated to handle failure gracefully -- because failure is not a possibility, it is a certainty.