@wazir-dev/cli 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (629) hide show
  1. package/AGENTS.md +111 -0
  2. package/CHANGELOG.md +14 -0
  3. package/CONTRIBUTING.md +101 -0
  4. package/LICENSE +21 -0
  5. package/README.md +314 -0
  6. package/assets/composition-engine.mmd +34 -0
  7. package/assets/demo-script.sh +17 -0
  8. package/assets/logo-dark.svg +14 -0
  9. package/assets/logo.svg +14 -0
  10. package/assets/pipeline.mmd +39 -0
  11. package/assets/record-demo.sh +51 -0
  12. package/docs/README.md +51 -0
  13. package/docs/adapters/context-mode.md +60 -0
  14. package/docs/concepts/architecture.md +87 -0
  15. package/docs/concepts/artifact-model.md +60 -0
  16. package/docs/concepts/composition-engine.md +36 -0
  17. package/docs/concepts/indexing-and-recall.md +160 -0
  18. package/docs/concepts/observability.md +41 -0
  19. package/docs/concepts/roles-and-workflows.md +59 -0
  20. package/docs/concepts/terminology-policy.md +27 -0
  21. package/docs/getting-started/01-installation.md +78 -0
  22. package/docs/getting-started/02-first-run.md +102 -0
  23. package/docs/getting-started/03-adding-to-project.md +15 -0
  24. package/docs/getting-started/04-host-setup.md +15 -0
  25. package/docs/guides/ci-integration.md +15 -0
  26. package/docs/guides/creating-skills.md +15 -0
  27. package/docs/guides/expertise-module-authoring.md +15 -0
  28. package/docs/guides/hook-development.md +15 -0
  29. package/docs/guides/memory-and-learnings.md +34 -0
  30. package/docs/guides/multi-host-export.md +15 -0
  31. package/docs/guides/troubleshooting.md +101 -0
  32. package/docs/guides/writing-custom-roles.md +15 -0
  33. package/docs/plans/2026-03-15-cli-pipeline-integration-design.md +592 -0
  34. package/docs/plans/2026-03-15-cli-pipeline-integration-plan.md +598 -0
  35. package/docs/plans/2026-03-15-docs-enforcement-plan.md +238 -0
  36. package/docs/readmes/INDEX.md +99 -0
  37. package/docs/readmes/features/expertise/README.md +171 -0
  38. package/docs/readmes/features/exports/README.md +222 -0
  39. package/docs/readmes/features/hooks/README.md +103 -0
  40. package/docs/readmes/features/hooks/loop-cap-guard.md +133 -0
  41. package/docs/readmes/features/hooks/post-tool-capture.md +121 -0
  42. package/docs/readmes/features/hooks/post-tool-lint.md +130 -0
  43. package/docs/readmes/features/hooks/pre-compact-summary.md +122 -0
  44. package/docs/readmes/features/hooks/pre-tool-capture-route.md +100 -0
  45. package/docs/readmes/features/hooks/protected-path-write-guard.md +128 -0
  46. package/docs/readmes/features/hooks/session-start.md +119 -0
  47. package/docs/readmes/features/hooks/stop-handoff-harvest.md +125 -0
  48. package/docs/readmes/features/roles/README.md +157 -0
  49. package/docs/readmes/features/roles/clarifier.md +152 -0
  50. package/docs/readmes/features/roles/content-author.md +190 -0
  51. package/docs/readmes/features/roles/designer.md +193 -0
  52. package/docs/readmes/features/roles/executor.md +184 -0
  53. package/docs/readmes/features/roles/learner.md +210 -0
  54. package/docs/readmes/features/roles/planner.md +182 -0
  55. package/docs/readmes/features/roles/researcher.md +164 -0
  56. package/docs/readmes/features/roles/reviewer.md +184 -0
  57. package/docs/readmes/features/roles/specifier.md +162 -0
  58. package/docs/readmes/features/roles/verifier.md +215 -0
  59. package/docs/readmes/features/schemas/README.md +178 -0
  60. package/docs/readmes/features/skills/README.md +63 -0
  61. package/docs/readmes/features/skills/brainstorming.md +96 -0
  62. package/docs/readmes/features/skills/debugging.md +148 -0
  63. package/docs/readmes/features/skills/design.md +120 -0
  64. package/docs/readmes/features/skills/prepare-next.md +109 -0
  65. package/docs/readmes/features/skills/run-audit.md +159 -0
  66. package/docs/readmes/features/skills/scan-project.md +109 -0
  67. package/docs/readmes/features/skills/self-audit.md +176 -0
  68. package/docs/readmes/features/skills/tdd.md +137 -0
  69. package/docs/readmes/features/skills/using-skills.md +92 -0
  70. package/docs/readmes/features/skills/verification.md +120 -0
  71. package/docs/readmes/features/skills/writing-plans.md +104 -0
  72. package/docs/readmes/features/tooling/README.md +320 -0
  73. package/docs/readmes/features/workflows/README.md +186 -0
  74. package/docs/readmes/features/workflows/author.md +181 -0
  75. package/docs/readmes/features/workflows/clarify.md +154 -0
  76. package/docs/readmes/features/workflows/design-review.md +171 -0
  77. package/docs/readmes/features/workflows/design.md +169 -0
  78. package/docs/readmes/features/workflows/discover.md +162 -0
  79. package/docs/readmes/features/workflows/execute.md +173 -0
  80. package/docs/readmes/features/workflows/learn.md +167 -0
  81. package/docs/readmes/features/workflows/plan-review.md +165 -0
  82. package/docs/readmes/features/workflows/plan.md +170 -0
  83. package/docs/readmes/features/workflows/prepare-next.md +167 -0
  84. package/docs/readmes/features/workflows/review.md +169 -0
  85. package/docs/readmes/features/workflows/run-audit.md +191 -0
  86. package/docs/readmes/features/workflows/spec-challenge.md +159 -0
  87. package/docs/readmes/features/workflows/specify.md +160 -0
  88. package/docs/readmes/features/workflows/verify.md +177 -0
  89. package/docs/readmes/packages/README.md +50 -0
  90. package/docs/readmes/packages/ajv.md +117 -0
  91. package/docs/readmes/packages/context-mode.md +118 -0
  92. package/docs/readmes/packages/gray-matter.md +116 -0
  93. package/docs/readmes/packages/node-test.md +137 -0
  94. package/docs/readmes/packages/yaml.md +112 -0
  95. package/docs/reference/configuration-reference.md +159 -0
  96. package/docs/reference/expertise-index.md +52 -0
  97. package/docs/reference/git-flow.md +43 -0
  98. package/docs/reference/hooks.md +87 -0
  99. package/docs/reference/host-exports.md +50 -0
  100. package/docs/reference/launch-checklist.md +172 -0
  101. package/docs/reference/marketplace-listings.md +76 -0
  102. package/docs/reference/release-process.md +34 -0
  103. package/docs/reference/roles-reference.md +77 -0
  104. package/docs/reference/skills.md +33 -0
  105. package/docs/reference/templates.md +29 -0
  106. package/docs/reference/tooling-cli.md +94 -0
  107. package/docs/truth-claims.yaml +222 -0
  108. package/expertise/PROGRESS.md +63 -0
  109. package/expertise/README.md +18 -0
  110. package/expertise/antipatterns/PROGRESS.md +56 -0
  111. package/expertise/antipatterns/backend/api-design-antipatterns.md +1271 -0
  112. package/expertise/antipatterns/backend/auth-antipatterns.md +1195 -0
  113. package/expertise/antipatterns/backend/caching-antipatterns.md +622 -0
  114. package/expertise/antipatterns/backend/database-antipatterns.md +1038 -0
  115. package/expertise/antipatterns/backend/index.md +24 -0
  116. package/expertise/antipatterns/backend/microservices-antipatterns.md +850 -0
  117. package/expertise/antipatterns/code/architecture-antipatterns.md +919 -0
  118. package/expertise/antipatterns/code/async-antipatterns.md +622 -0
  119. package/expertise/antipatterns/code/code-smells.md +1186 -0
  120. package/expertise/antipatterns/code/dependency-antipatterns.md +1209 -0
  121. package/expertise/antipatterns/code/error-handling-antipatterns.md +1360 -0
  122. package/expertise/antipatterns/code/index.md +27 -0
  123. package/expertise/antipatterns/code/naming-and-abstraction.md +1118 -0
  124. package/expertise/antipatterns/code/state-management-antipatterns.md +1076 -0
  125. package/expertise/antipatterns/code/testing-antipatterns.md +1053 -0
  126. package/expertise/antipatterns/design/accessibility-antipatterns.md +1136 -0
  127. package/expertise/antipatterns/design/dark-patterns.md +1121 -0
  128. package/expertise/antipatterns/design/index.md +22 -0
  129. package/expertise/antipatterns/design/ui-antipatterns.md +1202 -0
  130. package/expertise/antipatterns/design/ux-antipatterns.md +680 -0
  131. package/expertise/antipatterns/frontend/css-layout-antipatterns.md +691 -0
  132. package/expertise/antipatterns/frontend/flutter-antipatterns.md +1827 -0
  133. package/expertise/antipatterns/frontend/index.md +23 -0
  134. package/expertise/antipatterns/frontend/mobile-antipatterns.md +573 -0
  135. package/expertise/antipatterns/frontend/react-antipatterns.md +1128 -0
  136. package/expertise/antipatterns/frontend/spa-antipatterns.md +1235 -0
  137. package/expertise/antipatterns/index.md +31 -0
  138. package/expertise/antipatterns/performance/index.md +20 -0
  139. package/expertise/antipatterns/performance/performance-antipatterns.md +1013 -0
  140. package/expertise/antipatterns/performance/premature-optimization.md +623 -0
  141. package/expertise/antipatterns/performance/scaling-antipatterns.md +785 -0
  142. package/expertise/antipatterns/process/ai-coding-antipatterns.md +853 -0
  143. package/expertise/antipatterns/process/code-review-antipatterns.md +656 -0
  144. package/expertise/antipatterns/process/deployment-antipatterns.md +920 -0
  145. package/expertise/antipatterns/process/index.md +23 -0
  146. package/expertise/antipatterns/process/technical-debt-antipatterns.md +647 -0
  147. package/expertise/antipatterns/security/index.md +20 -0
  148. package/expertise/antipatterns/security/secrets-antipatterns.md +849 -0
  149. package/expertise/antipatterns/security/security-theater.md +843 -0
  150. package/expertise/antipatterns/security/vulnerability-patterns.md +801 -0
  151. package/expertise/architecture/PROGRESS.md +70 -0
  152. package/expertise/architecture/data/caching-architecture.md +671 -0
  153. package/expertise/architecture/data/data-consistency.md +574 -0
  154. package/expertise/architecture/data/data-modeling.md +536 -0
  155. package/expertise/architecture/data/event-streams-and-queues.md +634 -0
  156. package/expertise/architecture/data/index.md +25 -0
  157. package/expertise/architecture/data/search-architecture.md +663 -0
  158. package/expertise/architecture/data/sql-vs-nosql.md +708 -0
  159. package/expertise/architecture/decisions/architecture-decision-records.md +640 -0
  160. package/expertise/architecture/decisions/build-vs-buy.md +616 -0
  161. package/expertise/architecture/decisions/index.md +23 -0
  162. package/expertise/architecture/decisions/monolith-to-microservices.md +790 -0
  163. package/expertise/architecture/decisions/technology-selection.md +616 -0
  164. package/expertise/architecture/distributed/cap-theorem-and-tradeoffs.md +800 -0
  165. package/expertise/architecture/distributed/circuit-breaker-bulkhead.md +741 -0
  166. package/expertise/architecture/distributed/consensus-and-coordination.md +796 -0
  167. package/expertise/architecture/distributed/distributed-systems-fundamentals.md +564 -0
  168. package/expertise/architecture/distributed/idempotency-and-retry.md +796 -0
  169. package/expertise/architecture/distributed/index.md +25 -0
  170. package/expertise/architecture/distributed/saga-pattern.md +797 -0
  171. package/expertise/architecture/foundations/architectural-thinking.md +460 -0
  172. package/expertise/architecture/foundations/coupling-and-cohesion.md +770 -0
  173. package/expertise/architecture/foundations/design-principles-solid.md +649 -0
  174. package/expertise/architecture/foundations/domain-driven-design.md +719 -0
  175. package/expertise/architecture/foundations/index.md +25 -0
  176. package/expertise/architecture/foundations/separation-of-concerns.md +472 -0
  177. package/expertise/architecture/foundations/twelve-factor-app.md +797 -0
  178. package/expertise/architecture/index.md +34 -0
  179. package/expertise/architecture/integration/api-design-graphql.md +638 -0
  180. package/expertise/architecture/integration/api-design-grpc.md +804 -0
  181. package/expertise/architecture/integration/api-design-rest.md +892 -0
  182. package/expertise/architecture/integration/index.md +25 -0
  183. package/expertise/architecture/integration/third-party-integration.md +795 -0
  184. package/expertise/architecture/integration/webhooks-and-callbacks.md +1152 -0
  185. package/expertise/architecture/integration/websockets-realtime.md +791 -0
  186. package/expertise/architecture/mobile-architecture/index.md +22 -0
  187. package/expertise/architecture/mobile-architecture/mobile-app-architecture.md +780 -0
  188. package/expertise/architecture/mobile-architecture/mobile-backend-for-frontend.md +670 -0
  189. package/expertise/architecture/mobile-architecture/offline-first.md +719 -0
  190. package/expertise/architecture/mobile-architecture/push-and-sync.md +782 -0
  191. package/expertise/architecture/patterns/cqrs-event-sourcing.md +717 -0
  192. package/expertise/architecture/patterns/event-driven.md +797 -0
  193. package/expertise/architecture/patterns/hexagonal-clean-architecture.md +870 -0
  194. package/expertise/architecture/patterns/index.md +27 -0
  195. package/expertise/architecture/patterns/layered-architecture.md +736 -0
  196. package/expertise/architecture/patterns/microservices.md +753 -0
  197. package/expertise/architecture/patterns/modular-monolith.md +692 -0
  198. package/expertise/architecture/patterns/monolith.md +626 -0
  199. package/expertise/architecture/patterns/plugin-architecture.md +735 -0
  200. package/expertise/architecture/patterns/serverless.md +780 -0
  201. package/expertise/architecture/scaling/database-scaling.md +615 -0
  202. package/expertise/architecture/scaling/feature-flags-and-rollouts.md +757 -0
  203. package/expertise/architecture/scaling/horizontal-vs-vertical.md +606 -0
  204. package/expertise/architecture/scaling/index.md +24 -0
  205. package/expertise/architecture/scaling/multi-tenancy.md +800 -0
  206. package/expertise/architecture/scaling/stateless-design.md +787 -0
  207. package/expertise/backend/embedded-firmware.md +625 -0
  208. package/expertise/backend/go.md +853 -0
  209. package/expertise/backend/index.md +24 -0
  210. package/expertise/backend/java-spring.md +448 -0
  211. package/expertise/backend/node-typescript.md +625 -0
  212. package/expertise/backend/python-fastapi.md +724 -0
  213. package/expertise/backend/rust.md +458 -0
  214. package/expertise/backend/solidity.md +711 -0
  215. package/expertise/composition-map.yaml +443 -0
  216. package/expertise/content/foundations/content-modeling.md +395 -0
  217. package/expertise/content/foundations/editorial-standards.md +449 -0
  218. package/expertise/content/foundations/index.md +24 -0
  219. package/expertise/content/foundations/microcopy.md +455 -0
  220. package/expertise/content/foundations/terminology-governance.md +509 -0
  221. package/expertise/content/index.md +34 -0
  222. package/expertise/content/patterns/accessibility-copy.md +518 -0
  223. package/expertise/content/patterns/index.md +24 -0
  224. package/expertise/content/patterns/notification-content.md +433 -0
  225. package/expertise/content/patterns/sample-content.md +486 -0
  226. package/expertise/content/patterns/state-copy.md +439 -0
  227. package/expertise/design/PROGRESS.md +58 -0
  228. package/expertise/design/disciplines/dark-mode-theming.md +577 -0
  229. package/expertise/design/disciplines/design-systems.md +595 -0
  230. package/expertise/design/disciplines/index.md +25 -0
  231. package/expertise/design/disciplines/information-architecture.md +800 -0
  232. package/expertise/design/disciplines/interaction-design.md +788 -0
  233. package/expertise/design/disciplines/responsive-design.md +552 -0
  234. package/expertise/design/disciplines/usability-testing.md +516 -0
  235. package/expertise/design/disciplines/user-research.md +792 -0
  236. package/expertise/design/foundations/accessibility-design.md +796 -0
  237. package/expertise/design/foundations/color-theory.md +797 -0
  238. package/expertise/design/foundations/iconography.md +795 -0
  239. package/expertise/design/foundations/index.md +26 -0
  240. package/expertise/design/foundations/motion-and-animation.md +653 -0
  241. package/expertise/design/foundations/rtl-design.md +585 -0
  242. package/expertise/design/foundations/spacing-and-layout.md +607 -0
  243. package/expertise/design/foundations/typography.md +800 -0
  244. package/expertise/design/foundations/visual-hierarchy.md +761 -0
  245. package/expertise/design/index.md +32 -0
  246. package/expertise/design/patterns/authentication-flows.md +474 -0
  247. package/expertise/design/patterns/content-consumption.md +789 -0
  248. package/expertise/design/patterns/data-display.md +618 -0
  249. package/expertise/design/patterns/e-commerce.md +1494 -0
  250. package/expertise/design/patterns/feedback-and-states.md +642 -0
  251. package/expertise/design/patterns/forms-and-input.md +819 -0
  252. package/expertise/design/patterns/gamification.md +801 -0
  253. package/expertise/design/patterns/index.md +31 -0
  254. package/expertise/design/patterns/microinteractions.md +449 -0
  255. package/expertise/design/patterns/navigation.md +800 -0
  256. package/expertise/design/patterns/notifications.md +705 -0
  257. package/expertise/design/patterns/onboarding.md +700 -0
  258. package/expertise/design/patterns/search-and-filter.md +601 -0
  259. package/expertise/design/patterns/settings-and-preferences.md +768 -0
  260. package/expertise/design/patterns/social-and-community.md +748 -0
  261. package/expertise/design/platforms/desktop-native.md +612 -0
  262. package/expertise/design/platforms/index.md +25 -0
  263. package/expertise/design/platforms/mobile-android.md +825 -0
  264. package/expertise/design/platforms/mobile-cross-platform.md +983 -0
  265. package/expertise/design/platforms/mobile-ios.md +699 -0
  266. package/expertise/design/platforms/tablet.md +794 -0
  267. package/expertise/design/platforms/web-dashboard.md +790 -0
  268. package/expertise/design/platforms/web-responsive.md +550 -0
  269. package/expertise/design/psychology/behavioral-nudges.md +449 -0
  270. package/expertise/design/psychology/cognitive-load.md +1191 -0
  271. package/expertise/design/psychology/error-psychology.md +778 -0
  272. package/expertise/design/psychology/index.md +22 -0
  273. package/expertise/design/psychology/persuasive-design.md +736 -0
  274. package/expertise/design/psychology/user-mental-models.md +623 -0
  275. package/expertise/design/tooling/open-pencil.md +266 -0
  276. package/expertise/frontend/angular.md +1073 -0
  277. package/expertise/frontend/desktop-electron.md +546 -0
  278. package/expertise/frontend/flutter.md +782 -0
  279. package/expertise/frontend/index.md +27 -0
  280. package/expertise/frontend/native-android.md +409 -0
  281. package/expertise/frontend/native-ios.md +490 -0
  282. package/expertise/frontend/react-native.md +1160 -0
  283. package/expertise/frontend/react.md +808 -0
  284. package/expertise/frontend/vue.md +1089 -0
  285. package/expertise/humanize/domain-rules-code.md +79 -0
  286. package/expertise/humanize/domain-rules-content.md +67 -0
  287. package/expertise/humanize/domain-rules-technical-docs.md +56 -0
  288. package/expertise/humanize/index.md +35 -0
  289. package/expertise/humanize/self-audit-checklist.md +87 -0
  290. package/expertise/humanize/sentence-patterns.md +218 -0
  291. package/expertise/humanize/vocabulary-blacklist.md +105 -0
  292. package/expertise/i18n/PROGRESS.md +65 -0
  293. package/expertise/i18n/advanced/accessibility-and-i18n.md +28 -0
  294. package/expertise/i18n/advanced/bidirectional-text-algorithm.md +38 -0
  295. package/expertise/i18n/advanced/complex-scripts.md +30 -0
  296. package/expertise/i18n/advanced/performance-and-i18n.md +27 -0
  297. package/expertise/i18n/advanced/testing-i18n.md +28 -0
  298. package/expertise/i18n/content/content-adaptation.md +23 -0
  299. package/expertise/i18n/content/locale-specific-formatting.md +23 -0
  300. package/expertise/i18n/content/machine-translation-integration.md +28 -0
  301. package/expertise/i18n/content/translation-management.md +29 -0
  302. package/expertise/i18n/foundations/date-time-calendars.md +67 -0
  303. package/expertise/i18n/foundations/i18n-architecture.md +272 -0
  304. package/expertise/i18n/foundations/locale-and-language-tags.md +79 -0
  305. package/expertise/i18n/foundations/numbers-currency-units.md +61 -0
  306. package/expertise/i18n/foundations/pluralization-and-gender.md +109 -0
  307. package/expertise/i18n/foundations/string-externalization.md +236 -0
  308. package/expertise/i18n/foundations/text-direction-bidi.md +241 -0
  309. package/expertise/i18n/foundations/unicode-and-encoding.md +86 -0
  310. package/expertise/i18n/index.md +38 -0
  311. package/expertise/i18n/platform/backend-i18n.md +31 -0
  312. package/expertise/i18n/platform/flutter-i18n.md +148 -0
  313. package/expertise/i18n/platform/native-android-i18n.md +36 -0
  314. package/expertise/i18n/platform/native-ios-i18n.md +36 -0
  315. package/expertise/i18n/platform/react-i18n.md +103 -0
  316. package/expertise/i18n/platform/web-css-i18n.md +81 -0
  317. package/expertise/i18n/rtl/arabic-specific.md +175 -0
  318. package/expertise/i18n/rtl/hebrew-specific.md +149 -0
  319. package/expertise/i18n/rtl/rtl-animations-and-transitions.md +111 -0
  320. package/expertise/i18n/rtl/rtl-forms-and-input.md +161 -0
  321. package/expertise/i18n/rtl/rtl-fundamentals.md +211 -0
  322. package/expertise/i18n/rtl/rtl-icons-and-images.md +181 -0
  323. package/expertise/i18n/rtl/rtl-layout-mirroring.md +252 -0
  324. package/expertise/i18n/rtl/rtl-navigation-and-gestures.md +107 -0
  325. package/expertise/i18n/rtl/rtl-testing-and-qa.md +147 -0
  326. package/expertise/i18n/rtl/rtl-typography.md +160 -0
  327. package/expertise/index.md +113 -0
  328. package/expertise/index.yaml +216 -0
  329. package/expertise/infrastructure/cloud-aws.md +597 -0
  330. package/expertise/infrastructure/cloud-gcp.md +599 -0
  331. package/expertise/infrastructure/cybersecurity.md +816 -0
  332. package/expertise/infrastructure/database-mongodb.md +447 -0
  333. package/expertise/infrastructure/database-postgres.md +400 -0
  334. package/expertise/infrastructure/devops-cicd.md +787 -0
  335. package/expertise/infrastructure/index.md +27 -0
  336. package/expertise/performance/PROGRESS.md +50 -0
  337. package/expertise/performance/backend/api-latency.md +1204 -0
  338. package/expertise/performance/backend/background-jobs.md +506 -0
  339. package/expertise/performance/backend/connection-pooling.md +1209 -0
  340. package/expertise/performance/backend/database-query-optimization.md +515 -0
  341. package/expertise/performance/backend/index.md +23 -0
  342. package/expertise/performance/backend/rate-limiting-and-throttling.md +971 -0
  343. package/expertise/performance/foundations/algorithmic-complexity.md +954 -0
  344. package/expertise/performance/foundations/caching-strategies.md +489 -0
  345. package/expertise/performance/foundations/concurrency-and-parallelism.md +847 -0
  346. package/expertise/performance/foundations/index.md +24 -0
  347. package/expertise/performance/foundations/measuring-and-profiling.md +440 -0
  348. package/expertise/performance/foundations/memory-management.md +964 -0
  349. package/expertise/performance/foundations/performance-budgets.md +1314 -0
  350. package/expertise/performance/index.md +31 -0
  351. package/expertise/performance/infrastructure/auto-scaling.md +1059 -0
  352. package/expertise/performance/infrastructure/cdn-and-edge.md +1081 -0
  353. package/expertise/performance/infrastructure/index.md +22 -0
  354. package/expertise/performance/infrastructure/load-balancing.md +1081 -0
  355. package/expertise/performance/infrastructure/observability.md +1079 -0
  356. package/expertise/performance/mobile/index.md +23 -0
  357. package/expertise/performance/mobile/mobile-animations.md +544 -0
  358. package/expertise/performance/mobile/mobile-memory-battery.md +416 -0
  359. package/expertise/performance/mobile/mobile-network.md +452 -0
  360. package/expertise/performance/mobile/mobile-rendering.md +599 -0
  361. package/expertise/performance/mobile/mobile-startup-time.md +505 -0
  362. package/expertise/performance/platform-specific/flutter-performance.md +647 -0
  363. package/expertise/performance/platform-specific/index.md +22 -0
  364. package/expertise/performance/platform-specific/node-performance.md +1307 -0
  365. package/expertise/performance/platform-specific/postgres-performance.md +1366 -0
  366. package/expertise/performance/platform-specific/react-performance.md +1403 -0
  367. package/expertise/performance/web/bundle-optimization.md +1239 -0
  368. package/expertise/performance/web/image-and-media.md +636 -0
  369. package/expertise/performance/web/index.md +24 -0
  370. package/expertise/performance/web/network-optimization.md +1133 -0
  371. package/expertise/performance/web/rendering-performance.md +1098 -0
  372. package/expertise/performance/web/ssr-and-hydration.md +918 -0
  373. package/expertise/performance/web/web-vitals.md +1374 -0
  374. package/expertise/quality/accessibility.md +985 -0
  375. package/expertise/quality/evidence-based-verification.md +499 -0
  376. package/expertise/quality/index.md +24 -0
  377. package/expertise/quality/ml-model-audit.md +614 -0
  378. package/expertise/quality/performance.md +600 -0
  379. package/expertise/quality/testing-api.md +891 -0
  380. package/expertise/quality/testing-mobile.md +496 -0
  381. package/expertise/quality/testing-web.md +849 -0
  382. package/expertise/security/PROGRESS.md +54 -0
  383. package/expertise/security/agentic-identity.md +540 -0
  384. package/expertise/security/compliance-frameworks.md +601 -0
  385. package/expertise/security/data/data-encryption.md +364 -0
  386. package/expertise/security/data/data-privacy-gdpr.md +692 -0
  387. package/expertise/security/data/database-security.md +1171 -0
  388. package/expertise/security/data/index.md +22 -0
  389. package/expertise/security/data/pii-handling.md +531 -0
  390. package/expertise/security/foundations/authentication.md +1041 -0
  391. package/expertise/security/foundations/authorization.md +603 -0
  392. package/expertise/security/foundations/cryptography.md +1001 -0
  393. package/expertise/security/foundations/index.md +25 -0
  394. package/expertise/security/foundations/owasp-top-10.md +1354 -0
  395. package/expertise/security/foundations/secrets-management.md +1217 -0
  396. package/expertise/security/foundations/secure-sdlc.md +700 -0
  397. package/expertise/security/foundations/supply-chain-security.md +698 -0
  398. package/expertise/security/index.md +31 -0
  399. package/expertise/security/infrastructure/cloud-security-aws.md +1296 -0
  400. package/expertise/security/infrastructure/cloud-security-gcp.md +1376 -0
  401. package/expertise/security/infrastructure/container-security.md +721 -0
  402. package/expertise/security/infrastructure/incident-response.md +1295 -0
  403. package/expertise/security/infrastructure/index.md +24 -0
  404. package/expertise/security/infrastructure/logging-and-monitoring.md +1618 -0
  405. package/expertise/security/infrastructure/network-security.md +1337 -0
  406. package/expertise/security/mobile/index.md +23 -0
  407. package/expertise/security/mobile/mobile-android-security.md +1218 -0
  408. package/expertise/security/mobile/mobile-binary-protection.md +1229 -0
  409. package/expertise/security/mobile/mobile-data-storage.md +1265 -0
  410. package/expertise/security/mobile/mobile-ios-security.md +1401 -0
  411. package/expertise/security/mobile/mobile-network-security.md +1520 -0
  412. package/expertise/security/smart-contract-security.md +594 -0
  413. package/expertise/security/testing/index.md +22 -0
  414. package/expertise/security/testing/penetration-testing.md +1258 -0
  415. package/expertise/security/testing/security-code-review.md +1765 -0
  416. package/expertise/security/testing/threat-modeling.md +1074 -0
  417. package/expertise/security/testing/vulnerability-scanning.md +1062 -0
  418. package/expertise/security/web/api-security.md +586 -0
  419. package/expertise/security/web/cors-and-headers.md +433 -0
  420. package/expertise/security/web/csrf.md +562 -0
  421. package/expertise/security/web/file-upload.md +1477 -0
  422. package/expertise/security/web/index.md +25 -0
  423. package/expertise/security/web/injection.md +1375 -0
  424. package/expertise/security/web/session-management.md +1101 -0
  425. package/expertise/security/web/xss.md +1158 -0
  426. package/exports/README.md +17 -0
  427. package/exports/hosts/claude/.claude/agents/clarifier.md +42 -0
  428. package/exports/hosts/claude/.claude/agents/content-author.md +63 -0
  429. package/exports/hosts/claude/.claude/agents/designer.md +55 -0
  430. package/exports/hosts/claude/.claude/agents/executor.md +55 -0
  431. package/exports/hosts/claude/.claude/agents/learner.md +51 -0
  432. package/exports/hosts/claude/.claude/agents/planner.md +53 -0
  433. package/exports/hosts/claude/.claude/agents/researcher.md +43 -0
  434. package/exports/hosts/claude/.claude/agents/reviewer.md +54 -0
  435. package/exports/hosts/claude/.claude/agents/specifier.md +47 -0
  436. package/exports/hosts/claude/.claude/agents/verifier.md +71 -0
  437. package/exports/hosts/claude/.claude/commands/author.md +42 -0
  438. package/exports/hosts/claude/.claude/commands/clarify.md +38 -0
  439. package/exports/hosts/claude/.claude/commands/design-review.md +46 -0
  440. package/exports/hosts/claude/.claude/commands/design.md +44 -0
  441. package/exports/hosts/claude/.claude/commands/discover.md +37 -0
  442. package/exports/hosts/claude/.claude/commands/execute.md +48 -0
  443. package/exports/hosts/claude/.claude/commands/learn.md +38 -0
  444. package/exports/hosts/claude/.claude/commands/plan-review.md +42 -0
  445. package/exports/hosts/claude/.claude/commands/plan.md +39 -0
  446. package/exports/hosts/claude/.claude/commands/prepare-next.md +37 -0
  447. package/exports/hosts/claude/.claude/commands/review.md +40 -0
  448. package/exports/hosts/claude/.claude/commands/run-audit.md +41 -0
  449. package/exports/hosts/claude/.claude/commands/spec-challenge.md +41 -0
  450. package/exports/hosts/claude/.claude/commands/specify.md +38 -0
  451. package/exports/hosts/claude/.claude/commands/verify.md +37 -0
  452. package/exports/hosts/claude/.claude/settings.json +34 -0
  453. package/exports/hosts/claude/CLAUDE.md +19 -0
  454. package/exports/hosts/claude/export.manifest.json +38 -0
  455. package/exports/hosts/claude/host-package.json +67 -0
  456. package/exports/hosts/codex/AGENTS.md +19 -0
  457. package/exports/hosts/codex/export.manifest.json +38 -0
  458. package/exports/hosts/codex/host-package.json +41 -0
  459. package/exports/hosts/cursor/.cursor/hooks.json +16 -0
  460. package/exports/hosts/cursor/.cursor/rules/wazir-core.mdc +19 -0
  461. package/exports/hosts/cursor/export.manifest.json +38 -0
  462. package/exports/hosts/cursor/host-package.json +42 -0
  463. package/exports/hosts/gemini/GEMINI.md +19 -0
  464. package/exports/hosts/gemini/export.manifest.json +38 -0
  465. package/exports/hosts/gemini/host-package.json +41 -0
  466. package/hooks/README.md +18 -0
  467. package/hooks/definitions/loop_cap_guard.yaml +21 -0
  468. package/hooks/definitions/post_tool_capture.yaml +24 -0
  469. package/hooks/definitions/pre_compact_summary.yaml +19 -0
  470. package/hooks/definitions/pre_tool_capture_route.yaml +19 -0
  471. package/hooks/definitions/protected_path_write_guard.yaml +19 -0
  472. package/hooks/definitions/session_start.yaml +19 -0
  473. package/hooks/definitions/stop_handoff_harvest.yaml +20 -0
  474. package/hooks/loop-cap-guard +17 -0
  475. package/hooks/post-tool-lint +36 -0
  476. package/hooks/protected-path-write-guard +17 -0
  477. package/hooks/session-start +41 -0
  478. package/llms-full.txt +2355 -0
  479. package/llms.txt +43 -0
  480. package/package.json +79 -0
  481. package/roles/README.md +20 -0
  482. package/roles/clarifier.md +42 -0
  483. package/roles/content-author.md +63 -0
  484. package/roles/designer.md +55 -0
  485. package/roles/executor.md +55 -0
  486. package/roles/learner.md +51 -0
  487. package/roles/planner.md +53 -0
  488. package/roles/researcher.md +43 -0
  489. package/roles/reviewer.md +54 -0
  490. package/roles/specifier.md +47 -0
  491. package/roles/verifier.md +71 -0
  492. package/schemas/README.md +24 -0
  493. package/schemas/accepted-learning.schema.json +20 -0
  494. package/schemas/author-artifact.schema.json +156 -0
  495. package/schemas/clarification.schema.json +19 -0
  496. package/schemas/design-artifact.schema.json +80 -0
  497. package/schemas/docs-claim.schema.json +18 -0
  498. package/schemas/export-manifest.schema.json +20 -0
  499. package/schemas/hook.schema.json +67 -0
  500. package/schemas/host-export-package.schema.json +18 -0
  501. package/schemas/implementation-plan.schema.json +19 -0
  502. package/schemas/proposed-learning.schema.json +19 -0
  503. package/schemas/research.schema.json +18 -0
  504. package/schemas/review.schema.json +29 -0
  505. package/schemas/run-manifest.schema.json +18 -0
  506. package/schemas/spec-challenge.schema.json +18 -0
  507. package/schemas/spec.schema.json +20 -0
  508. package/schemas/usage.schema.json +102 -0
  509. package/schemas/verification-proof.schema.json +29 -0
  510. package/schemas/wazir-manifest.schema.json +173 -0
  511. package/skills/README.md +40 -0
  512. package/skills/brainstorming/SKILL.md +77 -0
  513. package/skills/debugging/SKILL.md +50 -0
  514. package/skills/design/SKILL.md +61 -0
  515. package/skills/dispatching-parallel-agents/SKILL.md +128 -0
  516. package/skills/executing-plans/SKILL.md +70 -0
  517. package/skills/finishing-a-development-branch/SKILL.md +169 -0
  518. package/skills/humanize/SKILL.md +123 -0
  519. package/skills/init-pipeline/SKILL.md +124 -0
  520. package/skills/prepare-next/SKILL.md +20 -0
  521. package/skills/receiving-code-review/SKILL.md +123 -0
  522. package/skills/requesting-code-review/SKILL.md +105 -0
  523. package/skills/requesting-code-review/code-reviewer.md +108 -0
  524. package/skills/run-audit/SKILL.md +197 -0
  525. package/skills/scan-project/SKILL.md +41 -0
  526. package/skills/self-audit/SKILL.md +153 -0
  527. package/skills/subagent-driven-development/SKILL.md +154 -0
  528. package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +26 -0
  529. package/skills/subagent-driven-development/implementer-prompt.md +102 -0
  530. package/skills/subagent-driven-development/spec-reviewer-prompt.md +61 -0
  531. package/skills/tdd/SKILL.md +23 -0
  532. package/skills/using-git-worktrees/SKILL.md +163 -0
  533. package/skills/using-skills/SKILL.md +95 -0
  534. package/skills/verification/SKILL.md +22 -0
  535. package/skills/wazir/SKILL.md +463 -0
  536. package/skills/writing-plans/SKILL.md +30 -0
  537. package/skills/writing-skills/SKILL.md +157 -0
  538. package/skills/writing-skills/anthropic-best-practices.md +122 -0
  539. package/skills/writing-skills/persuasion-principles.md +50 -0
  540. package/templates/README.md +20 -0
  541. package/templates/artifacts/README.md +10 -0
  542. package/templates/artifacts/accepted-learning.md +19 -0
  543. package/templates/artifacts/accepted-learning.template.json +12 -0
  544. package/templates/artifacts/author.md +74 -0
  545. package/templates/artifacts/author.template.json +19 -0
  546. package/templates/artifacts/clarification.md +21 -0
  547. package/templates/artifacts/clarification.template.json +12 -0
  548. package/templates/artifacts/execute-notes.md +19 -0
  549. package/templates/artifacts/implementation-plan.md +21 -0
  550. package/templates/artifacts/implementation-plan.template.json +11 -0
  551. package/templates/artifacts/learning-proposal.md +19 -0
  552. package/templates/artifacts/next-run-handoff.md +21 -0
  553. package/templates/artifacts/plan-review.md +19 -0
  554. package/templates/artifacts/proposed-learning.template.json +12 -0
  555. package/templates/artifacts/research.md +21 -0
  556. package/templates/artifacts/research.template.json +12 -0
  557. package/templates/artifacts/review-findings.md +19 -0
  558. package/templates/artifacts/review.template.json +11 -0
  559. package/templates/artifacts/run-manifest.template.json +8 -0
  560. package/templates/artifacts/spec-challenge.md +19 -0
  561. package/templates/artifacts/spec-challenge.template.json +11 -0
  562. package/templates/artifacts/spec.md +21 -0
  563. package/templates/artifacts/spec.template.json +12 -0
  564. package/templates/artifacts/verification-proof.md +19 -0
  565. package/templates/artifacts/verification-proof.template.json +11 -0
  566. package/templates/examples/accepted-learning.example.json +14 -0
  567. package/templates/examples/author.example.json +152 -0
  568. package/templates/examples/clarification.example.json +15 -0
  569. package/templates/examples/docs-claim.example.json +8 -0
  570. package/templates/examples/export-manifest.example.json +7 -0
  571. package/templates/examples/host-export-package.example.json +11 -0
  572. package/templates/examples/implementation-plan.example.json +17 -0
  573. package/templates/examples/proposed-learning.example.json +13 -0
  574. package/templates/examples/research.example.json +15 -0
  575. package/templates/examples/research.example.md +6 -0
  576. package/templates/examples/review.example.json +17 -0
  577. package/templates/examples/run-manifest.example.json +9 -0
  578. package/templates/examples/spec-challenge.example.json +14 -0
  579. package/templates/examples/spec.example.json +21 -0
  580. package/templates/examples/verification-proof.example.json +21 -0
  581. package/templates/examples/wazir-manifest.example.yaml +65 -0
  582. package/templates/task-definition-schema.md +99 -0
  583. package/tooling/README.md +20 -0
  584. package/tooling/src/adapters/context-mode.js +50 -0
  585. package/tooling/src/capture/command.js +376 -0
  586. package/tooling/src/capture/store.js +99 -0
  587. package/tooling/src/capture/usage.js +270 -0
  588. package/tooling/src/checks/branches.js +50 -0
  589. package/tooling/src/checks/brand-truth.js +110 -0
  590. package/tooling/src/checks/changelog.js +231 -0
  591. package/tooling/src/checks/command-registry.js +36 -0
  592. package/tooling/src/checks/commits.js +102 -0
  593. package/tooling/src/checks/docs-drift.js +103 -0
  594. package/tooling/src/checks/docs-truth.js +201 -0
  595. package/tooling/src/checks/runtime-surface.js +156 -0
  596. package/tooling/src/cli.js +116 -0
  597. package/tooling/src/command-options.js +56 -0
  598. package/tooling/src/commands/validate.js +320 -0
  599. package/tooling/src/doctor/command.js +91 -0
  600. package/tooling/src/export/command.js +77 -0
  601. package/tooling/src/export/compiler.js +498 -0
  602. package/tooling/src/guards/loop-cap-guard.js +52 -0
  603. package/tooling/src/guards/protected-path-write-guard.js +67 -0
  604. package/tooling/src/index/command.js +152 -0
  605. package/tooling/src/index/storage.js +1061 -0
  606. package/tooling/src/index/summarizers.js +261 -0
  607. package/tooling/src/loaders.js +18 -0
  608. package/tooling/src/project-root.js +22 -0
  609. package/tooling/src/recall/command.js +225 -0
  610. package/tooling/src/schema-validator.js +30 -0
  611. package/tooling/src/state-root.js +40 -0
  612. package/tooling/src/status/command.js +71 -0
  613. package/wazir.manifest.yaml +135 -0
  614. package/workflows/README.md +19 -0
  615. package/workflows/author.md +42 -0
  616. package/workflows/clarify.md +38 -0
  617. package/workflows/design-review.md +46 -0
  618. package/workflows/design.md +44 -0
  619. package/workflows/discover.md +37 -0
  620. package/workflows/execute.md +48 -0
  621. package/workflows/learn.md +38 -0
  622. package/workflows/plan-review.md +42 -0
  623. package/workflows/plan.md +39 -0
  624. package/workflows/prepare-next.md +37 -0
  625. package/workflows/review.md +40 -0
  626. package/workflows/run-audit.md +41 -0
  627. package/workflows/spec-challenge.md +41 -0
  628. package/workflows/specify.md +38 -0
  629. package/workflows/verify.md +37 -0
@@ -0,0 +1,785 @@
1
+ # Scaling Anti-Patterns -- Performance Anti-Patterns Module
2
+
3
+ > Scaling failures are among the most expensive incidents in production systems. They strike at peak traffic -- the worst possible moment -- and cascade into multi-hour outages that destroy revenue and user trust. Most scaling failures are not caused by unprecedented load but by well-known anti-patterns that were never addressed. GitHub's June 2025 outage, where a routine database migration cascaded into a platform-wide crisis, is a reminder that even the most sophisticated engineering organizations are not immune.
4
+
5
+ > **Domain:** Performance
6
+ > **Severity:** Critical
7
+ > **Applies to:** Backend, Infrastructure, Distributed Systems
8
+ > **Key metrics:** Requests per second capacity, p99 latency under load, error rate during traffic spikes, time to recover from overload, cost per request
9
+
10
+ ---
11
+
12
+ ## Table of Contents
13
+
14
+ 1. [Vertical Scaling Only (Bigger Server Syndrome)](#1-vertical-scaling-only-bigger-server-syndrome)
15
+ 2. [Not Designing for Horizontal Scaling](#2-not-designing-for-horizontal-scaling)
16
+ 3. [Sticky Sessions Preventing Scale-Out](#3-sticky-sessions-preventing-scale-out)
17
+ 4. [Storing State on Local Filesystem](#4-storing-state-on-local-filesystem)
18
+ 5. [Not Using Connection Pooling](#5-not-using-connection-pooling)
19
+ 6. [Single Database for Everything](#6-single-database-for-everything)
20
+ 7. [Not Planning for Thundering Herd](#7-not-planning-for-thundering-herd)
21
+ 8. [Ignoring Backpressure](#8-ignoring-backpressure)
22
+ 9. [Unbounded Queues](#9-unbounded-queues)
23
+ 10. [Not Load Testing Before Launch](#10-not-load-testing-before-launch)
24
+ 11. [Hot Spots from Poor Sharding](#11-hot-spots-from-poor-sharding)
25
+ 12. [Cross-Shard Transactions](#12-cross-shard-transactions)
26
+ 13. [Scaling by Adding Complexity](#13-scaling-by-adding-complexity)
27
+ 14. [Ignoring Cold Start Problems](#14-ignoring-cold-start-problems)
28
+ 15. [Not Planning for Graceful Degradation](#15-not-planning-for-graceful-degradation)
29
+ 16. [Monolithic Database Migrations at Scale](#16-monolithic-database-migrations-at-scale)
30
+ 17. [Network Calls in Loops](#17-network-calls-in-loops)
31
+ 18. [Fan-Out Without Fan-In Limits](#18-fan-out-without-fan-in-limits)
32
+ 19. [Not Using Read Replicas When Read-Heavy](#19-not-using-read-replicas-when-read-heavy)
33
+ 20. [Over-Provisioning vs Under-Provisioning](#20-over-provisioning-vs-under-provisioning)
34
+ 21. [Root Cause Analysis](#root-cause-analysis)
35
+ 22. [Self-Check Questions](#self-check-questions)
36
+ 23. [Code Smell Quick Reference](#code-smell-quick-reference)
37
+ 24. [Sources](#sources)
38
+
39
+ ---
40
+
41
+ ## 1. Vertical Scaling Only (Bigger Server Syndrome)
42
+
43
+ **Anti-pattern:** Responding to every capacity problem by upgrading to a larger server instance (more CPU, more RAM, bigger disk) instead of distributing load across multiple nodes.
44
+
45
+ **Why it happens:** Vertical scaling is the path of least resistance. No code changes required -- just resize the instance. Teams under pressure choose the fastest fix. Early-stage startups often lack the engineering capacity to design for horizontal scaling, so they throw hardware at the problem.
46
+
47
+ **Real-world incident:** Airbnb initially scaled their monolithic Ruby on Rails application by upgrading to progressively larger AWS EC2 instances. The strategy hit a wall when peak loads exceeded what any single instance could handle. High-end servers with 128 cores and 1TB of RAM cost exponentially more -- often 5x the price of a machine with half the specs -- delivering diminishing returns. Airbnb ultimately transitioned to a service-oriented architecture with horizontal scaling across regions.
48
+
49
+ **Why it fails:**
50
+ - Hardware has physical limits -- there is a largest server money can buy
51
+ - Cost scales superlinearly: doubling capacity often costs 3-5x more
52
+ - Creates a single point of failure -- if the one server goes down, everything goes down
53
+ - Maintenance windows require full downtime since there is no redundancy
54
+ - No geographic distribution possible
55
+
56
+ **The fix:**
57
+ - Design stateless application tiers from the start
58
+ - Use load balancers to distribute traffic across multiple instances
59
+ - Externalize state to shared stores (Redis, S3, managed databases)
60
+ - Adopt auto-scaling groups that add/remove instances based on load
61
+ - Vertical scale the database tier (where it is hardest to horizontally scale) but horizontally scale everything else
62
+
63
+ **Detection signals:**
64
+ - Monthly infrastructure bills growing faster than revenue
65
+ - Single-instance CPU consistently above 70%
66
+ - Downtime during maintenance windows with no failover
67
+ - Maximum instance size already in use
68
+
69
+ ---
70
+
71
+ ## 2. Not Designing for Horizontal Scaling
72
+
73
+ **Anti-pattern:** Building applications that assume a single-process, single-machine deployment. In-memory caches, local file storage, process-level singletons, and reliance on local disk all prevent adding more instances behind a load balancer.
74
+
75
+ **Why it happens:** Local development environments are inherently single-machine. Developers build and test on one machine and never encounter multi-instance issues until production. Frameworks often default to in-process state management. The cost of distributed design feels premature when you have 100 users.
76
+
77
+ **Real-world incident:** A SaaS platform stored uploaded user avatars on the local filesystem of the web server. When they added a second server behind a load balancer, half of all avatar requests returned 404 errors because the file only existed on the original server. Emergency migration to S3 required a maintenance window and data reconciliation.
78
+
79
+ **Why it fails:**
80
+ - Adding instances behind a load balancer produces inconsistent behavior
81
+ - In-memory caches diverge across instances, causing stale data bugs
82
+ - File uploads saved locally become inaccessible from other instances
83
+ - Process-level locks and singletons cause race conditions in multi-instance deployments
84
+ - Cannot leverage auto-scaling since new instances lack the accumulated state
85
+
86
+ **The fix:**
87
+ - Follow the Twelve-Factor App methodology: treat servers as disposable
88
+ - Store all persistent data in external services (databases, object storage, caches)
89
+ - Use distributed caching (Redis, Memcached) instead of in-process caches
90
+ - Design all endpoints to be stateless -- any instance can handle any request
91
+ - Use distributed locks (Redis SETNX, ZooKeeper) instead of local mutexes
92
+
93
+ **Detection signals:**
94
+ - Code references to `/tmp`, `/var/data`, or other local paths for user data
95
+ - `HashMap` or `Dictionary` used as an application-level cache
96
+ - Singleton pattern used for rate limiters or session stores
97
+ - Tests only run against a single instance
98
+
99
+ ---
100
+
101
+ ## 3. Sticky Sessions Preventing Scale-Out
102
+
103
+ **Anti-pattern:** Configuring load balancers to route all requests from a given user to the same backend server (session affinity), tying user state to a specific instance and defeating the purpose of horizontal scaling.
104
+
105
+ **Why it happens:** Server-side session storage (e.g., `HttpSession` in Java, session middleware in Express) stores user state in the process memory of whichever server handled the login. Without sticky sessions, the next request may hit a different server that has no knowledge of the session. Sticky sessions are the quick fix.
106
+
107
+ **Real-world incident:** A Kubernetes-based platform enabled sticky sessions to solve session consistency issues. When Horizontal Pod Autoscaler (HPA) scaled pods due to increased load, the new pods received zero traffic because all existing users were pinned to the original pods. The scaling event was effectively nullified -- new pods sat idle while overloaded pods continued to degrade. The team had to redesign session management using Redis before auto-scaling became functional.
108
+
109
+ **Why it fails:**
110
+ - New instances get no traffic until existing sessions expire, nullifying scale-out
111
+ - If a server fails, all pinned sessions are lost -- users experience errors or forced re-login
112
+ - Load becomes uneven: one server may handle 10x the traffic of another
113
+ - Rolling deployments are painful because draining sticky sessions takes time
114
+ - Cannot effectively use auto-scaling policies
115
+
116
+ **The fix:**
117
+ - Externalize session state to Redis, Memcached, or a database
118
+ - Use token-based authentication (JWT) where session data travels with the request
119
+ - Store only a session ID in a cookie; look up state from the shared store
120
+ - If sticky sessions are truly required (e.g., WebSocket connections), implement session replication as a fallback
121
+
122
+ **Detection signals:**
123
+ - Load balancer configuration includes `stickiness` or `affinity` settings
124
+ - Uneven request distribution visible in server metrics
125
+ - New instances show near-zero traffic after scaling events
126
+ - User complaints spike after server restarts
127
+
128
+ ---
129
+
130
+ ## 4. Storing State on Local Filesystem
131
+
132
+ **Anti-pattern:** Writing application state -- uploads, generated reports, cache files, session data, or temp files -- to the local disk of a server instance, making it inaccessible to other instances and lost on instance termination.
133
+
134
+ **Why it happens:** Writing to disk is the simplest I/O operation in any language. It works perfectly in development. Cloud instances come with local storage by default. The distinction between ephemeral and persistent storage is not obvious to developers unfamiliar with cloud-native patterns.
135
+
136
+ **Real-world incident:** An e-commerce platform generated PDF invoices and stored them at `/var/invoices/` on the web server. During a holiday traffic spike, auto-scaling launched four new instances. Customers who generated invoices on instance A could not download them when their next request was routed to instance B. The team scrambled to implement an S3-backed solution while simultaneously handling peak traffic -- the worst possible time for an architectural change.
137
+
138
+ **Why it fails:**
139
+ - Data is lost when instances are terminated, recycled, or crash
140
+ - Other instances cannot access the files, breaking multi-instance deployments
141
+ - Auto-scaling and spot/preemptible instances are incompatible with local state
142
+ - Disk space is finite and unmonitored, leading to silent failures when full
143
+ - No built-in redundancy or backup
144
+
145
+ **The fix:**
146
+ - Use object storage (S3, GCS, Azure Blob) for all user-facing files
147
+ - Use managed databases or Redis for application state
148
+ - Treat local disk as ephemeral scratch space only
149
+ - Mount shared filesystems (EFS, Filestore) if POSIX semantics are required
150
+ - Implement upload-to-cloud patterns where files go directly to object storage from the client
151
+
152
+ **Detection signals:**
153
+ - Code writes to `/tmp`, `/var`, or custom local paths for persistent data
154
+ - `os.path`, `fs.writeFile`, or `File.write` used for user-generated content
155
+ - No object storage SDK in project dependencies
156
+ - Missing files reported after deployments or scaling events
157
+
158
+ ---
159
+
160
+ ## 5. Not Using Connection Pooling
161
+
162
+ **Anti-pattern:** Opening a new database connection for every request and closing it afterward, or failing to limit the total number of connections across application instances, leading to connection exhaustion under load.
163
+
164
+ **Why it happens:** Default ORM and driver configurations often open connections on demand without pooling. In development, the connection count is low enough that it never matters. When the application scales to multiple instances with multiple workers each, the connection count multiplies and overwhelms the database.
165
+
166
+ **Real-world incident:** A production PostgreSQL database experienced full connection pool exhaustion when multiple Celery workers, each running several concurrent processes, opened more connections than the database could handle. The database began rejecting all new connections, causing a complete application outage. The immediate fix required a superuser connection to identify and kill hundreds of idle connections that had leaked from application code. Long-term, the team deployed PgBouncer to multiplex client connections through a smaller pool of actual database connections.
167
+
168
+ **Why it fails:**
169
+ - Each connection consumes ~10MB of database server memory (PostgreSQL)
170
+ - Connection establishment takes 50-200ms of TCP + TLS handshake overhead
171
+ - At scale, `max_connections` is exhausted, and all new queries are rejected
172
+ - Leaked connections (not returned to pool) silently accumulate until crisis
173
+ - Multiple application instances multiply the problem (10 instances x 20 workers x 5 connections = 1000 connections)
174
+
175
+ **The fix:**
176
+ - Configure connection pooling in the application (HikariCP, SQLAlchemy pool, Knex pool)
177
+ - Deploy a connection pooler proxy (PgBouncer, ProxySQL, Amazon RDS Proxy)
178
+ - Set pool sizes based on: `pool_size = (total_db_connections) / (num_instances * workers_per_instance)`
179
+ - Use context managers or try/finally to guarantee connection release
180
+ - Monitor active vs idle connections with alerts at 70% utilization
181
+
182
+ **Detection signals:**
183
+ - `too many connections` or `connection pool exhausted` errors in logs
184
+ - Database CPU low but connection count near maximum
185
+ - Queries timing out waiting for an available connection
186
+ - Increasing latency correlated with instance count rather than query complexity
187
+
188
+ ---
189
+
190
+ ## 6. Single Database for Everything
191
+
192
+ **Anti-pattern:** Using one database instance for all services, all data types, and all workloads -- OLTP transactions, analytics queries, search, session storage, job queues, and audit logs all hitting the same server.
193
+
194
+ **Why it happens:** Starting with one database is rational. It simplifies the technology stack, avoids distributed transaction complexity, and makes joins trivial. The problem is that teams never revisit this decision as the application grows and workload characteristics diverge.
195
+
196
+ **Real-world incident:** GitHub experienced repeated outages in February 2020 traced directly to database infrastructure. Application logic changes to database query patterns rapidly increased load on database clusters. A heavy analytics query on the shared database caused lock contention that blocked OLTP transactions, degrading the entire platform. GitHub subsequently invested heavily in database partitioning and workload isolation.
197
+
198
+ **Why it fails:**
199
+ - An analytics query holding locks blocks all transactional writes
200
+ - One slow query can saturate CPU and starve all other workloads
201
+ - Scaling the single database means scaling for the most demanding workload, even if others are light
202
+ - Schema migrations affect all services simultaneously
203
+ - Backup and restore times grow with total data volume, increasing recovery time
204
+
205
+ **The fix:**
206
+ - Separate OLTP from OLAP workloads (dedicated analytics database or data warehouse)
207
+ - Use purpose-built data stores: Redis for sessions, Elasticsearch for search, a queue service for job queues
208
+ - Implement the database-per-service pattern for microservices
209
+ - Use Change Data Capture (CDC) to replicate data between specialized stores
210
+ - Start with logical separation (schemas) and graduate to physical separation as load demands
211
+
212
+ **Detection signals:**
213
+ - Single connection string used across all services and background jobs
214
+ - Mixed query patterns: sub-millisecond lookups alongside multi-second aggregations
215
+ - Lock wait timeouts correlating with batch job schedules
216
+ - Schema with 200+ tables where domains overlap
217
+
218
+ ---
219
+
220
+ ## 7. Not Planning for Thundering Herd
221
+
222
+ **Anti-pattern:** Allowing all clients to simultaneously retry, reconnect, or request the same resource at the same moment, creating a synchronized stampede that overwhelms backends that might otherwise recover.
223
+
224
+ **Why it happens:** Systems are designed for steady-state traffic patterns. When a cache expires, a service restarts, or an outage ends, all waiting clients rush in simultaneously. Fixed retry intervals ensure every client retries at the same time. Developers test with one client at a time and never simulate coordinated surges.
225
+
226
+ **Real-world incident:** Depot experienced a thundering herd event where database traffic suddenly spiked, CPU usage jumped to 100%, and the overload cascaded into a much larger outage. Every client with retry logic retried simultaneously with fixed intervals, hitting the recovering database with the exact same wave of traffic again and again. Similarly, IRCTC (Indian Railways) pre-loads train data before the 10 AM Tatkal booking window but still struggles because millions of seat booking writes spike at exactly 10:00:00.
227
+
228
+ **Why it fails:**
229
+ - Cache expiration causes all requests to hit the origin simultaneously
230
+ - Service recovery is prevented because the herd arrives faster than the system can stabilize
231
+ - Retry storms amplify failures: N clients failing and retrying creates 2N, then 4N load
232
+ - Database connection pools are exhausted instantly
233
+ - CDN or cache layer going cold triggers origin overload
234
+
235
+ **The fix:**
236
+ - Add jitter to all retry intervals: `delay = base_delay * 2^attempt + random(0, base_delay)`
237
+ - Implement cache stampede protection: lock-based recomputation where only one request rebuilds the cache
238
+ - Use staggered TTLs: add random variance to cache expiration times
239
+ - Implement retry budgets: limit total retries per time window across the fleet
240
+ - Use load shedding at the gateway: return 503 with `Retry-After` header during overload
241
+ - Deploy request coalescing: deduplicate identical in-flight requests
242
+
243
+ **Detection signals:**
244
+ - Traffic graphs show sharp spikes to 10x+ normal immediately after recovery
245
+ - Cache hit rate drops from 99% to 0% simultaneously across all keys
246
+ - All retry timers use fixed intervals without jitter
247
+ - No circuit breakers between services
248
+
249
+ ---
250
+
251
+ ## 8. Ignoring Backpressure
252
+
253
+ **Anti-pattern:** Accepting every incoming request or message without regard for downstream capacity, allowing producers to overwhelm consumers until the system collapses from resource exhaustion.
254
+
255
+ **Why it happens:** Developers focus on throughput -- accepting requests as fast as possible. Saying "no" to a request feels like a bug. Load balancers, API gateways, and message brokers accept work by default. There is no built-in "slow down" signal in HTTP. The system works fine under normal load, so the problem is invisible until a traffic spike.
256
+
257
+ **Real-world incident:** A data pipeline ingested events from hundreds of IoT devices. The ingestion API accepted all messages and pushed them to a processing queue. When downstream processors slowed due to a database bottleneck, the queue grew to 40GB over six hours, the process hit its memory limit, and OOM-killed. All buffered events were lost. The system had no mechanism to signal producers to slow down or to shed excess load.
258
+
259
+ **Why it fails:**
260
+ - Memory grows unboundedly as work accumulates faster than it is processed
261
+ - Latency increases for all requests, not just the excess
262
+ - OOM kills cause abrupt crashes with no graceful cleanup
263
+ - Recovery is slow because the backlog must be drained before normal operation
264
+ - Downstream services may fail under the sudden surge when processing resumes
265
+
266
+ **The fix:**
267
+ - Implement rate limiting at the API gateway (token bucket, sliding window)
268
+ - Use bounded buffers and reject or drop when full (return 429 or 503)
269
+ - Implement reactive streams / flow control (gRPC flow control, Kafka consumer pause)
270
+ - Monitor queue depth and alert when it exceeds a threshold
271
+ - Design producers to handle rejection: exponential backoff, dead-letter queues
272
+ - Use admission control: shed load early rather than accepting work you cannot complete
273
+
274
+ **Detection signals:**
275
+ - Queue depth metrics trending upward with no plateau
276
+ - Memory usage growing linearly over time under sustained load
277
+ - No rate limiting configured on public-facing endpoints
278
+ - No 429 or 503 responses in access logs -- every request is accepted
279
+
280
+ ---
281
+
282
+ ## 9. Unbounded Queues
283
+
284
+ **Anti-pattern:** Using queues with no maximum size, allowing unlimited messages to accumulate in memory or on disk when consumers cannot keep up, eventually exhausting system resources.
285
+
286
+ **Why it happens:** Most queue implementations default to unbounded. Setting a limit feels like an arbitrary constraint. "What if we lose messages?" is a common objection. Teams assume consumers will always keep up and never plan for the contrary.
287
+
288
+ **Real-world incident:** After an upgrade to version 2025.10, Authentik (an identity provider) experienced OOM kills on worker pods due to unbounded queue growth in the `authentik_tasks_task` queue. Stale tasks accumulated without limit, consuming all available memory. The interim fix was a CronJob to periodically purge stale tasks, but the root cause was the absence of any queue size bound or TTL on enqueued items. Separately, Wazuh's remote message control queue (introduced in v4.13.0) had no size limit, allowing unlimited memory consumption during high agent load, risking complete memory exhaustion of the management server.
289
+
290
+ **Why it fails:**
291
+ - Memory consumption grows silently until the process is OOM-killed
292
+ - The failure mode is catastrophic: instant crash with no graceful degradation
293
+ - Processing latency for new messages equals the time to drain the entire backlog
294
+ - Messages at the tail of a massive queue may be stale by the time they are processed
295
+ - Monitoring often only tracks throughput, not queue depth
296
+
297
+ **The fix:**
298
+ - Set explicit maximum queue sizes on all queues (`maxlen`, `capacity`, `x-max-length`)
299
+ - Define a rejection policy: drop oldest, drop newest, reject producer, or dead-letter
300
+ - Add TTL to messages so stale items are automatically discarded
301
+ - Monitor queue depth, enqueue rate, and dequeue rate with alerts
302
+ - Implement consumer auto-scaling tied to queue depth metrics
303
+ - Size queues based on: `max_depth = consumer_throughput * max_acceptable_latency`
304
+
305
+ **Detection signals:**
306
+ - Queue configuration shows no `maxlen`, `capacity`, or size limit
307
+ - Memory usage on queue hosts trends upward during load spikes
308
+ - No dead-letter queue configured
309
+ - Consumer count is static regardless of queue depth
310
+
311
+ ---
312
+
313
+ ## 10. Not Load Testing Before Launch
314
+
315
+ **Anti-pattern:** Deploying to production without systematically testing the system under expected and peak traffic volumes, discovering capacity limits through real user impact instead of controlled experiments.
316
+
317
+ **Why it happens:** Load testing takes time to set up, requires realistic test data, and needs a production-like environment. Teams under deadline pressure skip it. "We can always scale up if needed" is the rationalization. Development environments give no indication of production-scale behavior.
318
+
319
+ **Real-world incident:** CodinGame experienced a "Reddit hug of death" that took their platform offline for 2 hours. They received as many new users in one day as during the previous two months. Post-mortem analysis revealed multiple failures that load testing would have caught: the RDS database was the main bottleneck with all data centralized and tangled, application servers had a memory leak that only manifested under heavy load, and the chat server process hit 100% CPU under concurrent connections. Industry data shows 80% of incidents are triggered by internal changes with insufficient testing.
320
+
321
+ **Why it fails:**
322
+ - True bottlenecks only appear under concurrent load (lock contention, connection limits, memory leaks)
323
+ - Capacity limits are unknown, making scaling decisions guesswork
324
+ - Performance regressions ship undetected when there is no baseline
325
+ - Third-party dependencies (payment processors, APIs) may rate-limit or fail under load
326
+ - Auto-scaling configurations are never validated -- minimum/maximum counts may be wrong
327
+
328
+ **The fix:**
329
+ - Establish a load testing practice with tools (k6, Locust, Gatling, Artillery)
330
+ - Test at 2x expected peak to find the breaking point, not just confirm the happy path
331
+ - Run soak tests (sustained load for hours) to detect memory leaks and connection exhaustion
332
+ - Include third-party dependencies in tests or mock them at realistic latencies
333
+ - Automate load tests in CI/CD to catch regressions per release
334
+ - Define performance budgets: maximum p99 latency, minimum throughput, maximum error rate
335
+
336
+ **Detection signals:**
337
+ - No load testing tools in project dependencies or CI/CD pipeline
338
+ - Production capacity limits are unknown -- "We will see"
339
+ - Performance metrics have no historical baseline
340
+ - First traffic spike causes unexpected failures
341
+
342
+ ---
343
+
344
+ ## 11. Hot Spots from Poor Sharding
345
+
346
+ **Anti-pattern:** Choosing a shard key that distributes data unevenly, causing one or a few shards to receive disproportionate read/write traffic while others sit idle.
347
+
348
+ **Why it happens:** Shard key selection requires understanding access patterns, data distribution, and growth projections. Teams often choose the most obvious key (user ID, tenant ID, timestamp) without analyzing the distribution. Some keys that appear uniform are actually highly skewed.
349
+
350
+ **Real-world incident:** A documented $2.4 million sharding project failed when, after implementation, one shard grew to 2,847,000 records while another had only 156,000. The root cause: enterprise customers had 10,000+ users while small customers had 1-5 users, and sharding by `customer_id` concentrated enterprise data on a few shards. In another case, an e-commerce platform sharded by product category, but the "electronics" category received 60% of all traffic, creating a persistent hot shard that required repeated hardware upgrades.
351
+
352
+ **Why it fails:**
353
+ - Hot shards become the bottleneck, capping system throughput at one shard's capacity
354
+ - Rebalancing data across shards is operationally expensive and risky
355
+ - Using timestamps as shard keys creates write-hot shards (all new data goes to one shard)
356
+ - Growth in one category or tenant can destabilize the entire cluster
357
+ - Monitoring may show "average" load as healthy while one shard is on fire
358
+
359
+ **The fix:**
360
+ - Analyze data distribution before choosing a shard key -- histogram the candidate key
361
+ - Use composite shard keys that combine a high-cardinality field with a distribution field
362
+ - Hash-based sharding (consistent hashing) provides uniform distribution at the cost of range query support
363
+ - Implement automatic rebalancing (as in MongoDB, CockroachDB, TiDB)
364
+ - Monitor per-shard metrics: CPU, IOPS, query latency, record count
365
+ - Consider virtual shards (more shards than nodes) to simplify rebalancing
366
+
367
+ **Detection signals:**
368
+ - One shard's CPU or IOPS is 5x+ higher than other shards
369
+ - Record counts vary by more than 3x across shards
370
+ - Shard key is a timestamp or low-cardinality field (status, country code)
371
+ - No per-shard monitoring dashboards exist
372
+
373
+ ---
374
+
375
+ ## 12. Cross-Shard Transactions
376
+
377
+ **Anti-pattern:** Designing sharded systems that require frequent transactions spanning multiple shards, introducing distributed coordination overhead (two-phase commit) that negates the throughput gains of sharding.
378
+
379
+ **Why it happens:** Applications are sharded for write throughput, but business logic still requires atomic operations across entities that live on different shards. "Transfer $100 from Account A (Shard 1) to Account B (Shard 2)" is a natural requirement that is extremely difficult to implement correctly in a sharded system.
380
+
381
+ **Real-world context:** Two-phase commit (2PC) has been the standard protocol for cross-shard consistency, used in systems from Oracle and PostgreSQL to Google Spanner and Apache Kafka. However, distributed systems expert Daniel Abadi argues: "I see very little benefit in system architects making continued use of 2PC in sharded systems moving forward." The protocol blocks when a participant fails, and if the coordinator fails permanently during the commit phase, some participants will never resolve their transactions, leaving data in an inconsistent state.
382
+
383
+ **Why it fails:**
384
+ - 2PC adds a coordination round-trip to every transaction, increasing latency by 2-10x
385
+ - Locks must be held across shards for the duration of the protocol, reducing throughput
386
+ - Coordinator failure during commit leaves data in an indeterminate state
387
+ - Deadlocks across shards are difficult to detect and resolve
388
+ - Throughput drops to the speed of the slowest shard
389
+
390
+ **The fix:**
391
+ - Design the data model so that related entities co-locate on the same shard
392
+ - Use the Saga pattern: break distributed transactions into compensatable local transactions
393
+ - Accept eventual consistency where business rules allow (most do)
394
+ - Use change data capture (CDC) and event sourcing for cross-shard data synchronization
395
+ - If strong consistency is required, use databases with native distributed transactions (CockroachDB, Spanner)
396
+ - Minimize cross-shard operations by denormalizing frequently-joined data onto the same shard
397
+
398
+ **Detection signals:**
399
+ - `BEGIN DISTRIBUTED TRANSACTION` or 2PC log entries in database logs
400
+ - Cross-shard query latency is 5x+ higher than single-shard latency
401
+ - Deadlock errors involving multiple shards
402
+ - Business logic requires joins across shard boundaries
403
+
404
+ ---
405
+
406
+ ## 13. Scaling by Adding Complexity
407
+
408
+ **Anti-pattern:** Responding to scaling challenges by introducing additional layers, services, caches, and technologies instead of first simplifying the existing architecture, removing unnecessary work, or optimizing hot paths.
409
+
410
+ **Why it happens:** Adding a cache in front of a slow query feels productive. Introducing a message queue between two services feels like proper engineering. Teams accumulate layers because each solves a proximate problem without addressing why the problem exists. Resume-driven development also plays a role -- engineers want to work with shiny distributed systems technologies.
411
+
412
+ **Real-world incident:** Pokemon Go's scaling crisis illustrates complexity backfiring. During their migration to Google Cloud load balancer (GCLB), the team added GCLB to scale the load balancing layer. But the additional capacity at the load balancing tier actually overwhelmed their backend stack -- the bottleneck was downstream, not at the load balancer. The migration prolonged the outage rather than fixing it. Adding capacity at the wrong layer amplified the failure.
413
+
414
+ **Why it fails:**
415
+ - Each new component adds latency, failure modes, and operational burden
416
+ - Caches create cache invalidation problems (one of the two hard things in computer science)
417
+ - More moving parts means more things that can fail simultaneously
418
+ - Debugging requires understanding interactions between N components instead of one
419
+ - Operational overhead grows: monitoring, alerting, upgrades, and on-call burden multiply
420
+
421
+ **The fix:**
422
+ - Before adding a component, ask: "Can we remove or simplify something instead?"
423
+ - Profile first: find the actual bottleneck before adding infrastructure
424
+ - Remove unnecessary middleware, ORM layers, and abstraction layers
425
+ - Optimize the hot path: 90% of load is often caused by 10% of code paths
426
+ - Evaluate whether the existing technology can be tuned before introducing a new one
427
+ - Apply the "boring technology" principle: use proven, well-understood tools
428
+
429
+ **Detection signals:**
430
+ - Architecture diagrams require a legend and multiple pages
431
+ - More infrastructure components than team members
432
+ - Incidents frequently involve interactions between components rather than individual failures
433
+ - Team cannot explain the full request path from client to database
434
+
435
+ ---
436
+
437
+ ## 14. Ignoring Cold Start Problems
438
+
439
+ **Anti-pattern:** Failing to account for initialization latency when new instances, containers, or serverless functions are launched, causing latency spikes and timeouts during scale-out events or after periods of low traffic.
440
+
441
+ **Why it happens:** Cold starts are invisible in steady-state monitoring. Functions and containers that are already warm respond in milliseconds. The problem only manifests during scaling events (new instances launching) or after idle periods (serverless environments recycling). Developers testing against warm environments never experience the issue.
442
+
443
+ **Real-world incident:** AWS Lambda cold starts typically add 100ms-2s to function execution time depending on runtime, dependencies, and code size. While cold starts affect less than 1% of requests in steady state, during traffic spikes every new concurrent invocation experiences a cold start. In event-driven architectures with functions calling other functions, the probability that at least one function in the chain is cold approaches 100%, causing cascading latency. One team reported that a chain of five Lambda functions experienced compounding cold starts that pushed end-to-end latency from 200ms (warm) to 8 seconds (all cold).
444
+
445
+ **Why it fails:**
446
+ - Health checks pass before the application is actually ready to serve traffic
447
+ - JVM-based services need time for JIT compilation -- first requests are 10-100x slower
448
+ - Dependency initialization (database connections, SDK clients, config loading) takes seconds
449
+ - Auto-scaling triggers bring up instances that immediately receive traffic before warming up
450
+ - Serverless environments recycle idle instances unpredictably
451
+
452
+ **The fix:**
453
+ - Implement readiness probes that verify the application can actually serve requests
454
+ - Use connection pre-warming: establish database and cache connections during startup
455
+ - For Lambda: use Provisioned Concurrency to keep functions initialized
456
+ - Pre-warm caches on startup by loading frequently-accessed data
457
+ - Use progressive traffic shifting: new instances receive traffic gradually, not all at once
458
+ - Minimize dependency count and use lazy initialization for non-critical paths
459
+
460
+ **Detection signals:**
461
+ - p99 latency is 10x+ higher than p50
462
+ - Latency spikes correlate with scaling events or deployment times
463
+ - First request after idle period is significantly slower
464
+ - Startup logs show multi-second initialization sequences
465
+
466
+ ---
467
+
468
+ ## 15. Not Planning for Graceful Degradation
469
+
470
+ **Anti-pattern:** Building systems that either work at full capacity or fail completely, with no intermediate modes that maintain core functionality when subsystems are impaired.
471
+
472
+ **Why it happens:** Systems are designed for the happy path. Failure handling is an afterthought. "If the recommendation service is down, what do we show?" is a question that never gets asked during design. Feature flags and degradation modes require upfront investment that feels wasteful when everything is working.
473
+
474
+ **Real-world incident:** Pokemon Go's launch is a canonical example. Instead of degrading gracefully -- for example, disabling social features, reducing map detail, or limiting new registrations -- the entire system collapsed under unexpected load. Users could not log in at all. In contrast, Fastly's CDN is designed so that if an origin server is unavailable, it serves stale cached content rather than error pages, maintaining user experience while giving incident responders time to diagnose and fix the root cause.
475
+
476
+ **Why it fails:**
477
+ - Total outage of a non-critical subsystem takes down the entire application
478
+ - Users get error pages instead of reduced-functionality experiences
479
+ - Incident responders have no levers to shed load or disable features
480
+ - Recovery is all-or-nothing, making partial restoration impossible
481
+ - No fallback behavior has been designed or tested
482
+
483
+ **The fix:**
484
+ - Identify critical vs non-critical features and design fallbacks for non-critical ones
485
+ - Implement circuit breakers (Hystrix, Resilience4j, Polly) for all downstream dependencies
486
+ - Use feature flags to disable resource-intensive features during overload
487
+ - Serve stale/cached data when fresh data is unavailable
488
+ - Design load-shedding endpoints: drop excess requests at the edge, not deep in the stack
489
+ - Implement priority queues: process high-value requests first during degradation
490
+ - Test degradation modes regularly -- chaos engineering validates that fallbacks work
491
+
492
+ **Detection signals:**
493
+ - No circuit breakers in the codebase
494
+ - No feature flag system deployed
495
+ - Error pages are the only failure response -- no partial functionality
496
+ - Runbooks say "wait for recovery" instead of listing degradation steps
497
+
498
+ ---
499
+
500
+ ## 16. Monolithic Database Migrations at Scale
501
+
502
+ **Anti-pattern:** Running schema-altering DDL operations (adding columns, creating indexes, altering types) on large production tables using blocking operations that lock the table and halt all reads or writes for the duration of the migration.
503
+
504
+ **Why it happens:** ORMs generate migration files that use standard `ALTER TABLE` statements. These work fine on small tables. Developers test migrations on development databases with 1,000 rows, not production databases with 50 million rows. The migration that takes 200ms in development takes 45 minutes in production -- and locks the table the entire time.
505
+
506
+ **Real-world incident:** GitHub's June 2025 outage was triggered by a planned database migration that cascaded into a multi-hour incident disrupting repositories, pull requests, GitHub Actions, and dependent services. The migration triggered unanticipated load patterns on primary database clusters, causing cascading failures. In another documented case, adding an index to a 50-million-row table locked the entire table, blocking all reads and writes for 45 minutes, with downstream cost estimated at $5,600 per minute of downtime.
507
+
508
+ **Why it fails:**
509
+ - `ALTER TABLE` acquires exclusive locks on large tables, blocking all queries
510
+ - Index creation on large tables can take minutes to hours
511
+ - Failed migrations may leave the schema in an inconsistent state
512
+ - Rollback of a partially-applied migration can be more dangerous than the migration itself
513
+ - Multiple services depending on the same table are all affected simultaneously
514
+
515
+ **The fix:**
516
+ - Use online schema change tools: `pt-online-schema-change` (MySQL), `pg_repack` (PostgreSQL), `gh-ost` (GitHub's own tool)
517
+ - Add columns as nullable first, backfill data, then add constraints
518
+ - Create indexes with `CONCURRENTLY` (PostgreSQL) or equivalent non-blocking syntax
519
+ - Implement expand-contract migrations: add the new schema alongside the old, migrate data, then remove the old
520
+ - Test migrations against production-sized datasets before deploying
521
+ - Use feature flags to gradually shift traffic to new schema paths
522
+ - Schedule migrations during low-traffic windows with rollback plans
523
+
524
+ **Detection signals:**
525
+ - Migration files contain raw `ALTER TABLE ... ADD COLUMN ... NOT NULL`
526
+ - No online schema change tooling in the deployment pipeline
527
+ - Migration testing only runs against seeded development databases
528
+ - Lock wait timeout errors during deployments
529
+
530
+ ---
531
+
532
+ ## 17. Network Calls in Loops
533
+
534
+ **Anti-pattern:** Making individual network requests (database queries, API calls, cache lookups) inside a loop, turning what should be a single batch operation into N sequential round-trips, each paying full network latency.
535
+
536
+ **Why it happens:** ORMs make it easy: `for order in orders: order.customer.name` triggers a query per iteration (the N+1 problem). REST APIs expose individual resources, so fetching related data requires one call per item. The code reads naturally and works correctly -- it is just catastrophically slow at scale.
537
+
538
+ **Real-world incident:** A developer documented reducing API response time from 30 seconds to under 1 second by eliminating N+1 queries. The endpoint listed 500 orders, and for each order, made a separate database query to fetch the customer -- 501 total queries. Each query took 2ms on the network, but 501 x 2ms = 1 second of pure network latency, plus database processing time. Replacing this with a single `WHERE customer_id IN (...)` query reduced the total to 2 queries and sub-100ms response time. At high traffic, the N+1 version generated 50,000+ queries per second from a single endpoint.
539
+
540
+ **Why it fails:**
541
+ - Each network call adds 1-5ms of latency (TCP round-trip), which multiplies by N
542
+ - Database connection pool is consumed by N concurrent connections for one user request
543
+ - Serialization/deserialization overhead multiplies by N
544
+ - Total latency grows linearly with data size, making the endpoint unusable as data grows
545
+ - Database query logs show thousands of nearly-identical queries
546
+
547
+ **The fix:**
548
+ - Use batch APIs: `GET /users?ids=1,2,3` instead of N individual calls
549
+ - Use eager loading in ORMs: `includes(:customer)` (Rails), `joinedload()` (SQLAlchemy), `.Include()` (EF Core)
550
+ - Implement DataLoader pattern for GraphQL (batches and deduplicates within a request)
551
+ - Replace loops with `WHERE ... IN (...)` queries
552
+ - Use database views or materialized views to pre-join data
553
+ - Add N+1 detection tools: Bullet (Rails), nplusone (Django), SQLAlchemy warnings
554
+
555
+ **Detection signals:**
556
+ - Database query logs show repeated queries with only the parameter changing
557
+ - Endpoint latency scales linearly with result set size
558
+ - ORM lazy-loading enabled with no eager-loading configuration
559
+ - API calls inside `for`, `forEach`, `map`, or `while` loops
560
+
561
+ ---
562
+
563
+ ## 18. Fan-Out Without Fan-In Limits
564
+
565
+ **Anti-pattern:** Designing systems where a single request triggers many parallel downstream requests (fan-out) without limiting how many can be in flight simultaneously, risking overwhelming downstream services and creating cascading failures.
566
+
567
+ **Why it happens:** Fan-out is a natural pattern: a social media timeline request fans out to N friend feeds, a search request fans out to N index shards, an API gateway fans out to N microservices. The pattern works at small N but becomes dangerous as N grows. Developers set fan-out based on the data model without considering the downstream capacity.
568
+
569
+ **Real-world context:** LinkedIn published research (Moolle, ICDE 2016) on fan-out control for scalable distributed data stores, documenting how unlimited fan-out in their social graph queries could overwhelm backend storage nodes. Their system found that keeping dependency chains shallow (1-2 levels) and limiting parallel requests per tier was essential for stability. Without fan-in limits, a single user request could generate thousands of backend queries, and a modest traffic spike would amplify into a backend-crushing storm.
570
+
571
+ **Why it fails:**
572
+ - N downstream calls means N chances for failure -- probability of at least one failure approaches 1
573
+ - Total latency is bounded by the slowest of N calls (tail latency amplification)
574
+ - Downstream services experience N x traffic amplification from fan-out
575
+ - Retry logic on fan-out calls multiplies the amplification effect
576
+ - One slow downstream service blocks the entire fan-in, wasting the fast responses
577
+
578
+ **The fix:**
579
+ - Set explicit concurrency limits on fan-out calls (semaphores, worker pools)
580
+ - Implement timeouts per fan-out call -- do not wait for stragglers
581
+ - Use hedged requests: send a second request after a timeout, take whichever finishes first
582
+ - Apply circuit breakers per downstream service
583
+ - Return partial results when some fan-out calls fail (graceful degradation)
584
+ - Isolate resource pools for high-fan-out calls so they cannot starve other workloads
585
+ - Monitor fan-out factor per endpoint and alert when it exceeds expected bounds
586
+
587
+ **Detection signals:**
588
+ - `Promise.all()` or `asyncio.gather()` with unbounded arrays of calls
589
+ - No concurrency limit on parallel HTTP client calls
590
+ - Tail latency (p99) is much higher than median (p50) due to straggler effect
591
+ - Downstream services report traffic spikes correlated with upstream deployments
592
+
593
+ ---
594
+
595
+ ## 19. Not Using Read Replicas When Read-Heavy
596
+
597
+ **Anti-pattern:** Sending all database queries -- reads and writes -- to a single primary instance, even when the workload is 90%+ reads, leaving the primary overloaded with read traffic that could be served by replicas.
598
+
599
+ **Why it happens:** Using a single database endpoint is simpler. Application code does not need to distinguish between read and write connections. ORMs default to a single connection. Teams do not realize their workload is read-heavy because they have never measured the read/write ratio. Adding read replicas requires code changes to route queries.
600
+
601
+ **Real-world context:** AWS RDS documentation emphasizes that read replicas provide horizontal scaling by offloading read-intensive workloads from the primary instance. Most web applications are 80-95% reads. A primary database handling 10,000 queries/second where 9,500 are reads could offload those to 2-3 replicas, reducing primary load by 95% and freeing it for writes. However, RDS does not automatically route reads to replicas -- the application must explicitly direct read traffic to replica endpoints.
602
+
603
+ **Why it fails:**
604
+ - Primary database CPU is saturated by read queries, slowing write transactions
605
+ - Vertical scaling the primary is expensive and has limits
606
+ - Read-heavy endpoints (dashboards, feeds, search results) dominate the query mix
607
+ - The primary cannot be scaled horizontally for writes, making read offloading essential
608
+ - Failover promotes a replica to primary, but if no replicas exist, failover means downtime
609
+
610
+ **The fix:**
611
+ - Measure the read/write ratio -- if reads exceed 70%, add replicas
612
+ - Configure the ORM for read/write splitting: write to primary, read from replica
613
+ - Use a database proxy (ProxySQL, Amazon RDS Proxy, PgPool) for automatic routing
614
+ - Accept eventual consistency for read-replica queries (typical lag is under 1 second)
615
+ - Monitor replica lag (`ReplicaLag` metric) and fail over to primary if lag exceeds acceptable thresholds
616
+ - Size replica count based on: `num_replicas = ceil(read_qps / single_instance_read_capacity)`
617
+
618
+ **Detection signals:**
619
+ - Single database endpoint in application configuration
620
+ - Primary database CPU above 70% while query mix is majority SELECT
621
+ - No replica instances provisioned
622
+ - Read-heavy endpoints have higher latency than write endpoints
623
+
624
+ ---
625
+
626
+ ## 20. Over-Provisioning vs Under-Provisioning
627
+
628
+ **Anti-pattern:** Either allocating far more resources than needed (wasting money) or allocating too few (causing outages under load), rather than right-sizing based on data and implementing auto-scaling.
629
+
630
+ **Why it happens:** Over-provisioning is driven by fear -- "what if we get a traffic spike?" Under-provisioning is driven by cost pressure -- "can we run this cheaper?" Both are guesses. Without load testing data and auto-scaling, teams pick a static size and hope. Premature scaling was identified as a factor in 70% of tech startup failures (Startup Genome report).
631
+
632
+ **Real-world incident:** Groupon prioritized rapid customer acquisition and scaled infrastructure aggressively, spending heavily on capacity they did not need while their underlying business model was unsustainable. In contrast, Amazon grew methodically, focusing on dominating one market at a time and staying lean. On the under-provisioning side, Kubernetes environments frequently suffer from overprovisioning after an under-provisioning incident -- teams panic-scale to 3x capacity after an outage, then never right-size back down. Studies show organizations waste 30-35% of cloud spend on over-provisioned resources.
633
+
634
+ **Why it fails:**
635
+ - Over-provisioning wastes 30-35% of cloud spend on idle resources
636
+ - Under-provisioning causes outages during traffic spikes and degrades user experience
637
+ - Static provisioning cannot adapt to variable traffic patterns (day/night, weekday/weekend)
638
+ - Over-provisioned resources mask inefficient code -- there is no pressure to optimize
639
+ - Under-provisioned databases hit connection limits and IOPS caps under load
640
+
641
+ **The fix:**
642
+ - Implement auto-scaling based on actual metrics (CPU, memory, request queue depth)
643
+ - Right-size instances using utilization data: target 60-70% average CPU utilization
644
+ - Use spot/preemptible instances for fault-tolerant workloads (60-90% cost savings)
645
+ - Implement cost monitoring with alerts for spend anomalies
646
+ - Run regular right-sizing reviews using tools (AWS Compute Optimizer, GCP Recommender)
647
+ - Load test to determine actual capacity needs rather than guessing
648
+ - Use reserved instances or savings plans for predictable baseline load, spot for burst
649
+
650
+ **Detection signals:**
651
+ - Average CPU utilization below 20% (over-provisioned) or consistently above 85% (under-provisioned)
652
+ - No auto-scaling policies configured
653
+ - Instance sizes have not been reviewed in 6+ months
654
+ - Cloud bill growing faster than traffic or revenue
655
+
656
+ ---
657
+
658
+ ## Root Cause Analysis
659
+
660
+ Scaling anti-patterns cluster around five root causes:
661
+
662
+ ### 1. Single-Machine Mindset
663
+ **Anti-patterns:** #1, #2, #3, #4, #5, #19
664
+ **Root cause:** Designing for a single server because that is the development environment. State is stored locally, connections are opened per-request, and all traffic hits one database. The architecture works for one server and breaks for two.
665
+ **Systemic fix:** Adopt the Twelve-Factor App methodology. Treat servers as disposable. Externalize all state. Test with multiple instances from the beginning.
666
+
667
+ ### 2. No Capacity Planning
668
+ **Anti-patterns:** #10, #14, #20
669
+ **Root cause:** Capacity is unknown because it was never measured. Load testing is skipped, cold starts are untested, and provisioning is guesswork. The system's limits are discovered through production incidents.
670
+ **Systemic fix:** Make load testing a release gate. Measure capacity before every launch. Implement auto-scaling with validated thresholds.
671
+
672
+ ### 3. Unbounded Resource Consumption
673
+ **Anti-patterns:** #7, #8, #9, #18
674
+ **Root cause:** No limits on resource consumption -- queues, connections, fan-out, retries -- because setting limits feels like artificial constraints. Resources are consumed faster than they are released, and the system runs out.
675
+ **Systemic fix:** Every resource must have an explicit bound: queue depth, connection count, fan-out factor, retry budget. Design for rejection and backpressure from day one.
676
+
677
+ ### 4. Data Architecture Debt
678
+ **Anti-patterns:** #6, #11, #12, #16, #17, #19
679
+ **Root cause:** A single database handles all workloads. Schema changes are blocking. Shard keys are chosen without analysis. Queries are generated by ORM defaults. The data layer becomes the bottleneck that cannot be easily changed.
680
+ **Systemic fix:** Separate read and write paths. Choose shard keys based on access pattern analysis. Use online schema change tools. Audit ORM-generated queries.
681
+
682
+ ### 5. Complexity Over Simplification
683
+ **Anti-patterns:** #13, #15
684
+ **Root cause:** Adding components to solve scaling problems without first understanding the bottleneck. More layers means more failure modes, more latency, and more operational burden.
685
+ **Systemic fix:** Profile before scaling. Remove unnecessary layers. Design degradation modes that reduce functionality rather than adding infrastructure.
686
+
687
+ ---
688
+
689
+ ## Self-Check Questions
690
+
691
+ Use these questions during design reviews and architecture assessments:
692
+
693
+ ### Statelessness and Horizontal Scaling
694
+ - [ ] Can we add a second instance behind a load balancer with zero code changes?
695
+ - [ ] Is all user-facing state stored in an external service (database, Redis, S3)?
696
+ - [ ] Can any instance handle any request, or are requests tied to specific instances?
697
+ - [ ] Are we using sticky sessions? If so, do we have a plan to remove them?
698
+
699
+ ### Database and Data Layer
700
+ - [ ] Is the database handling mixed workloads (OLTP + analytics + search)?
701
+ - [ ] Do we have read replicas configured for read-heavy endpoints?
702
+ - [ ] What is our read/write ratio? Have we measured it?
703
+ - [ ] Are database migrations tested against production-sized datasets?
704
+ - [ ] Do we use online schema change tools for migrations?
705
+ - [ ] Is connection pooling configured with explicit pool sizes?
706
+
707
+ ### Queues and Backpressure
708
+ - [ ] Do all queues have explicit maximum sizes?
709
+ - [ ] What happens when a queue is full? (Drop? Reject? Dead-letter?)
710
+ - [ ] Do we have rate limiting on public-facing endpoints?
711
+ - [ ] What is the maximum queue depth we can tolerate before latency is unacceptable?
712
+
713
+ ### Failure and Degradation
714
+ - [ ] What happens when a downstream service is unavailable?
715
+ - [ ] Do we have circuit breakers for all external dependencies?
716
+ - [ ] Can we disable non-critical features during overload?
717
+ - [ ] Do retries include jitter and exponential backoff?
718
+ - [ ] Is there a retry budget to prevent thundering herd?
719
+
720
+ ### Capacity and Load
721
+ - [ ] Have we load tested at 2x expected peak?
722
+ - [ ] Do we know where the system breaks under load?
723
+ - [ ] Are auto-scaling policies configured and validated?
724
+ - [ ] Is our provisioning based on data or guesswork?
725
+ - [ ] Do we know the cold start time for new instances?
726
+
727
+ ### Complexity
728
+ - [ ] Can every team member explain the full request path?
729
+ - [ ] Are we adding a component to solve a problem, or to avoid understanding one?
730
+ - [ ] Could we remove a layer instead of adding one?
731
+
732
+ ---
733
+
734
+ ## Code Smell Quick Reference
735
+
736
+ | Smell | Anti-Pattern | Severity |
737
+ |---|---|---|
738
+ | Single database connection string across all services | #6 Single DB | High |
739
+ | `session.sticky = true` in load balancer config | #3 Sticky Sessions | High |
740
+ | `fs.writeFile` / `File.write` for user uploads | #4 Local Filesystem | High |
741
+ | No `maxPoolSize` or pool config in DB connection | #5 No Connection Pool | Critical |
742
+ | `for item in items: db.query(item.id)` | #17 N+1 / Network Loops | High |
743
+ | `Promise.all(unboundedArray.map(fetch))` | #18 Unbounded Fan-Out | High |
744
+ | Queue instantiated with no `maxlen` / `capacity` | #9 Unbounded Queue | High |
745
+ | Retry with fixed delay: `sleep(5); retry()` | #7 Thundering Herd | Medium |
746
+ | No `429` or `503` responses in access logs | #8 No Backpressure | High |
747
+ | `ALTER TABLE` without `CONCURRENTLY` in migration | #16 Blocking Migration | Critical |
748
+ | Single DB endpoint; no replica endpoint in config | #19 No Read Replicas | Medium |
749
+ | No load test scripts or tools in repository | #10 No Load Testing | High |
750
+ | Auto-scaling min = max (fixed instance count) | #20 Static Provisioning | Medium |
751
+ | No circuit breaker library in dependencies | #15 No Degradation | High |
752
+ | Instance type is `*.4xlarge` or higher | #1 Vertical Only | Medium |
753
+ | Shard key is `created_at` or `timestamp` | #11 Hot Shards | High |
754
+ | `BEGIN DISTRIBUTED TRANSACTION` in query logs | #12 Cross-Shard Txn | Medium |
755
+
756
+ ---
757
+
758
+ ## Sources
759
+
760
+ - [GitHub Database Infrastructure Outages (Feb 2020)](https://devclass.com/2020/03/27/github-reveals-database-infrastructure-was-the-villain-behind-february-spate-of-outages-again/)
761
+ - [GitHub's June 2025 Outage: Database Migration Cascade](https://www.webpronews.com/githubs-june-2025-outage-how-a-routine-database-migration-cascaded-into-a-platform-wide-crisis/)
762
+ - [PostgreSQL Connection Pool Exhaustion -- Lessons from a Production Outage](https://www.c-sharpcorner.com/article/postgresql-connection-pool-exhaustion-lessons-from-a-production-outage/)
763
+ - [Distributed Systems Horror Stories: The Thundering Herd Problem (Encore)](https://encore.dev/blog/thundering-herd-problem)
764
+ - [Scaling Depot: Solving a Thundering Herd Problem](https://depot.dev/blog/planetscale-to-reduce-the-thundering-herd)
765
+ - [Thundering Herd Problem Explained (Medium)](https://medium.com/@work.dhairya.singla/the-thundering-herd-problem-explained-causes-examples-and-solutions-7166b7e26c0c)
766
+ - [Pod Auto Scaling and the Curse of Sticky Sessions (Medium)](https://medium.com/nerd-for-tech/how-session-stickiness-disrupts-pod-auto-scaling-in-kubernetes-17ece8e2ea4f)
767
+ - [Scaling Horizontally: Kubernetes, Sticky Sessions, and Redis](https://dev.to/deepak_mishra_35863517037/scaling-horizontally-kubernetes-sticky-sessions-and-redis-578o)
768
+ - [How CodinGame Survived a Reddit Hug of Death](https://www.codingame.com/blog/how-did-codingame-survive-reddit-hug-of/)
769
+ - [Everyone's Doing Database Sharding Wrong ($2M Failure)](https://medium.com/@jholt1055/everyones-doing-database-sharding-wrong-here-s-why-your-2m-sharding-project-will-fail-de7f52d944a4)
770
+ - [Challenges of Sharding: Data Hotspots and Imbalanced Shards](https://dohost.us/index.php/2025/10/03/challenges-of-sharding-data-hotspots-and-imbalanced-shards/)
771
+ - [Wazuh Memory Exhaustion in Unbounded Queue (GitHub Issue)](https://github.com/wazuh/wazuh/issues/31240)
772
+ - [Authentik Worker OOM from Unbounded Queue Growth (GitHub Issue)](https://github.com/goauthentik/authentik/issues/18915)
773
+ - [Understanding and Remediating Cold Starts: AWS Lambda (AWS Blog)](https://aws.amazon.com/blogs/compute/understanding-and-remediating-cold-starts-an-aws-lambda-perspective/)
774
+ - [Zero Downtime Migrations at Petabyte Scale (PlanetScale)](https://planetscale.com/blog/zero-downtime-migrations-at-petabyte-scale)
775
+ - [Solving the N+1 Query Problem: 30s to Under 1s (Medium)](https://medium.com/@nkangprecious26/solving-the-n-1-query-problem-how-i-reduced-api-response-time-from-30s-to-1s-1fcd819c34e6)
776
+ - [Moolle: Fan-out Control for Scalable Distributed Data Stores (LinkedIn, ICDE 2016)](https://content.linkedin.com/content/dam/engineering/site-assets/pdfs/ICDE16_industry_571.pdf)
777
+ - [It's Time to Move on from Two Phase Commit (Daniel Abadi)](http://dbmsmusings.blogspot.com/2019/01/its-time-to-move-on-from-two-phase.html)
778
+ - [Vertical vs Horizontal Scaling (CockroachDB)](https://www.cockroachlabs.com/blog/vertical-scaling-vs-horizontal-scaling/)
779
+ - [The 7 Deadly Sins of Startups: Premature Scaling (Medium)](https://medium.com/superteam/danger-the-7-deadly-sins-of-startups-premature-scaling-1d2a976e2540)
780
+ - [Kubernetes Overprovisioning: The Hidden Cost (DEV Community)](https://dev.to/naveens16/kubernetes-overprovisioning-the-hidden-cost-of-chasing-performance-and-how-to-escape-114k)
781
+ - [Design for Chaos: Fastly's Principles of Fault Isolation](https://www.fastly.com/blog/design-for-chaos-fastlys-principles-of-fault-isolation-and-graceful)
782
+ - [AWS Well-Architected: Implement Graceful Degradation](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_mitigate_interaction_failure_graceful_degradation.html)
783
+ - [N+1 API Calls Detection (Sentry)](https://docs.sentry.io/product/issues/issue-details/performance-issues/n-one-api-calls/)
784
+ - [AWS RDS Read Replicas Documentation](https://aws.amazon.com/rds/features/read-replicas/)
785
+ - [Dan Luu's Post-Mortems Collection (GitHub)](https://github.com/danluu/post-mortems)