@wazir-dev/cli 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +111 -0
- package/CHANGELOG.md +14 -0
- package/CONTRIBUTING.md +101 -0
- package/LICENSE +21 -0
- package/README.md +314 -0
- package/assets/composition-engine.mmd +34 -0
- package/assets/demo-script.sh +17 -0
- package/assets/logo-dark.svg +14 -0
- package/assets/logo.svg +14 -0
- package/assets/pipeline.mmd +39 -0
- package/assets/record-demo.sh +51 -0
- package/docs/README.md +51 -0
- package/docs/adapters/context-mode.md +60 -0
- package/docs/concepts/architecture.md +87 -0
- package/docs/concepts/artifact-model.md +60 -0
- package/docs/concepts/composition-engine.md +36 -0
- package/docs/concepts/indexing-and-recall.md +160 -0
- package/docs/concepts/observability.md +41 -0
- package/docs/concepts/roles-and-workflows.md +59 -0
- package/docs/concepts/terminology-policy.md +27 -0
- package/docs/getting-started/01-installation.md +78 -0
- package/docs/getting-started/02-first-run.md +102 -0
- package/docs/getting-started/03-adding-to-project.md +15 -0
- package/docs/getting-started/04-host-setup.md +15 -0
- package/docs/guides/ci-integration.md +15 -0
- package/docs/guides/creating-skills.md +15 -0
- package/docs/guides/expertise-module-authoring.md +15 -0
- package/docs/guides/hook-development.md +15 -0
- package/docs/guides/memory-and-learnings.md +34 -0
- package/docs/guides/multi-host-export.md +15 -0
- package/docs/guides/troubleshooting.md +101 -0
- package/docs/guides/writing-custom-roles.md +15 -0
- package/docs/plans/2026-03-15-cli-pipeline-integration-design.md +592 -0
- package/docs/plans/2026-03-15-cli-pipeline-integration-plan.md +598 -0
- package/docs/plans/2026-03-15-docs-enforcement-plan.md +238 -0
- package/docs/readmes/INDEX.md +99 -0
- package/docs/readmes/features/expertise/README.md +171 -0
- package/docs/readmes/features/exports/README.md +222 -0
- package/docs/readmes/features/hooks/README.md +103 -0
- package/docs/readmes/features/hooks/loop-cap-guard.md +133 -0
- package/docs/readmes/features/hooks/post-tool-capture.md +121 -0
- package/docs/readmes/features/hooks/post-tool-lint.md +130 -0
- package/docs/readmes/features/hooks/pre-compact-summary.md +122 -0
- package/docs/readmes/features/hooks/pre-tool-capture-route.md +100 -0
- package/docs/readmes/features/hooks/protected-path-write-guard.md +128 -0
- package/docs/readmes/features/hooks/session-start.md +119 -0
- package/docs/readmes/features/hooks/stop-handoff-harvest.md +125 -0
- package/docs/readmes/features/roles/README.md +157 -0
- package/docs/readmes/features/roles/clarifier.md +152 -0
- package/docs/readmes/features/roles/content-author.md +190 -0
- package/docs/readmes/features/roles/designer.md +193 -0
- package/docs/readmes/features/roles/executor.md +184 -0
- package/docs/readmes/features/roles/learner.md +210 -0
- package/docs/readmes/features/roles/planner.md +182 -0
- package/docs/readmes/features/roles/researcher.md +164 -0
- package/docs/readmes/features/roles/reviewer.md +184 -0
- package/docs/readmes/features/roles/specifier.md +162 -0
- package/docs/readmes/features/roles/verifier.md +215 -0
- package/docs/readmes/features/schemas/README.md +178 -0
- package/docs/readmes/features/skills/README.md +63 -0
- package/docs/readmes/features/skills/brainstorming.md +96 -0
- package/docs/readmes/features/skills/debugging.md +148 -0
- package/docs/readmes/features/skills/design.md +120 -0
- package/docs/readmes/features/skills/prepare-next.md +109 -0
- package/docs/readmes/features/skills/run-audit.md +159 -0
- package/docs/readmes/features/skills/scan-project.md +109 -0
- package/docs/readmes/features/skills/self-audit.md +176 -0
- package/docs/readmes/features/skills/tdd.md +137 -0
- package/docs/readmes/features/skills/using-skills.md +92 -0
- package/docs/readmes/features/skills/verification.md +120 -0
- package/docs/readmes/features/skills/writing-plans.md +104 -0
- package/docs/readmes/features/tooling/README.md +320 -0
- package/docs/readmes/features/workflows/README.md +186 -0
- package/docs/readmes/features/workflows/author.md +181 -0
- package/docs/readmes/features/workflows/clarify.md +154 -0
- package/docs/readmes/features/workflows/design-review.md +171 -0
- package/docs/readmes/features/workflows/design.md +169 -0
- package/docs/readmes/features/workflows/discover.md +162 -0
- package/docs/readmes/features/workflows/execute.md +173 -0
- package/docs/readmes/features/workflows/learn.md +167 -0
- package/docs/readmes/features/workflows/plan-review.md +165 -0
- package/docs/readmes/features/workflows/plan.md +170 -0
- package/docs/readmes/features/workflows/prepare-next.md +167 -0
- package/docs/readmes/features/workflows/review.md +169 -0
- package/docs/readmes/features/workflows/run-audit.md +191 -0
- package/docs/readmes/features/workflows/spec-challenge.md +159 -0
- package/docs/readmes/features/workflows/specify.md +160 -0
- package/docs/readmes/features/workflows/verify.md +177 -0
- package/docs/readmes/packages/README.md +50 -0
- package/docs/readmes/packages/ajv.md +117 -0
- package/docs/readmes/packages/context-mode.md +118 -0
- package/docs/readmes/packages/gray-matter.md +116 -0
- package/docs/readmes/packages/node-test.md +137 -0
- package/docs/readmes/packages/yaml.md +112 -0
- package/docs/reference/configuration-reference.md +159 -0
- package/docs/reference/expertise-index.md +52 -0
- package/docs/reference/git-flow.md +43 -0
- package/docs/reference/hooks.md +87 -0
- package/docs/reference/host-exports.md +50 -0
- package/docs/reference/launch-checklist.md +172 -0
- package/docs/reference/marketplace-listings.md +76 -0
- package/docs/reference/release-process.md +34 -0
- package/docs/reference/roles-reference.md +77 -0
- package/docs/reference/skills.md +33 -0
- package/docs/reference/templates.md +29 -0
- package/docs/reference/tooling-cli.md +94 -0
- package/docs/truth-claims.yaml +222 -0
- package/expertise/PROGRESS.md +63 -0
- package/expertise/README.md +18 -0
- package/expertise/antipatterns/PROGRESS.md +56 -0
- package/expertise/antipatterns/backend/api-design-antipatterns.md +1271 -0
- package/expertise/antipatterns/backend/auth-antipatterns.md +1195 -0
- package/expertise/antipatterns/backend/caching-antipatterns.md +622 -0
- package/expertise/antipatterns/backend/database-antipatterns.md +1038 -0
- package/expertise/antipatterns/backend/index.md +24 -0
- package/expertise/antipatterns/backend/microservices-antipatterns.md +850 -0
- package/expertise/antipatterns/code/architecture-antipatterns.md +919 -0
- package/expertise/antipatterns/code/async-antipatterns.md +622 -0
- package/expertise/antipatterns/code/code-smells.md +1186 -0
- package/expertise/antipatterns/code/dependency-antipatterns.md +1209 -0
- package/expertise/antipatterns/code/error-handling-antipatterns.md +1360 -0
- package/expertise/antipatterns/code/index.md +27 -0
- package/expertise/antipatterns/code/naming-and-abstraction.md +1118 -0
- package/expertise/antipatterns/code/state-management-antipatterns.md +1076 -0
- package/expertise/antipatterns/code/testing-antipatterns.md +1053 -0
- package/expertise/antipatterns/design/accessibility-antipatterns.md +1136 -0
- package/expertise/antipatterns/design/dark-patterns.md +1121 -0
- package/expertise/antipatterns/design/index.md +22 -0
- package/expertise/antipatterns/design/ui-antipatterns.md +1202 -0
- package/expertise/antipatterns/design/ux-antipatterns.md +680 -0
- package/expertise/antipatterns/frontend/css-layout-antipatterns.md +691 -0
- package/expertise/antipatterns/frontend/flutter-antipatterns.md +1827 -0
- package/expertise/antipatterns/frontend/index.md +23 -0
- package/expertise/antipatterns/frontend/mobile-antipatterns.md +573 -0
- package/expertise/antipatterns/frontend/react-antipatterns.md +1128 -0
- package/expertise/antipatterns/frontend/spa-antipatterns.md +1235 -0
- package/expertise/antipatterns/index.md +31 -0
- package/expertise/antipatterns/performance/index.md +20 -0
- package/expertise/antipatterns/performance/performance-antipatterns.md +1013 -0
- package/expertise/antipatterns/performance/premature-optimization.md +623 -0
- package/expertise/antipatterns/performance/scaling-antipatterns.md +785 -0
- package/expertise/antipatterns/process/ai-coding-antipatterns.md +853 -0
- package/expertise/antipatterns/process/code-review-antipatterns.md +656 -0
- package/expertise/antipatterns/process/deployment-antipatterns.md +920 -0
- package/expertise/antipatterns/process/index.md +23 -0
- package/expertise/antipatterns/process/technical-debt-antipatterns.md +647 -0
- package/expertise/antipatterns/security/index.md +20 -0
- package/expertise/antipatterns/security/secrets-antipatterns.md +849 -0
- package/expertise/antipatterns/security/security-theater.md +843 -0
- package/expertise/antipatterns/security/vulnerability-patterns.md +801 -0
- package/expertise/architecture/PROGRESS.md +70 -0
- package/expertise/architecture/data/caching-architecture.md +671 -0
- package/expertise/architecture/data/data-consistency.md +574 -0
- package/expertise/architecture/data/data-modeling.md +536 -0
- package/expertise/architecture/data/event-streams-and-queues.md +634 -0
- package/expertise/architecture/data/index.md +25 -0
- package/expertise/architecture/data/search-architecture.md +663 -0
- package/expertise/architecture/data/sql-vs-nosql.md +708 -0
- package/expertise/architecture/decisions/architecture-decision-records.md +640 -0
- package/expertise/architecture/decisions/build-vs-buy.md +616 -0
- package/expertise/architecture/decisions/index.md +23 -0
- package/expertise/architecture/decisions/monolith-to-microservices.md +790 -0
- package/expertise/architecture/decisions/technology-selection.md +616 -0
- package/expertise/architecture/distributed/cap-theorem-and-tradeoffs.md +800 -0
- package/expertise/architecture/distributed/circuit-breaker-bulkhead.md +741 -0
- package/expertise/architecture/distributed/consensus-and-coordination.md +796 -0
- package/expertise/architecture/distributed/distributed-systems-fundamentals.md +564 -0
- package/expertise/architecture/distributed/idempotency-and-retry.md +796 -0
- package/expertise/architecture/distributed/index.md +25 -0
- package/expertise/architecture/distributed/saga-pattern.md +797 -0
- package/expertise/architecture/foundations/architectural-thinking.md +460 -0
- package/expertise/architecture/foundations/coupling-and-cohesion.md +770 -0
- package/expertise/architecture/foundations/design-principles-solid.md +649 -0
- package/expertise/architecture/foundations/domain-driven-design.md +719 -0
- package/expertise/architecture/foundations/index.md +25 -0
- package/expertise/architecture/foundations/separation-of-concerns.md +472 -0
- package/expertise/architecture/foundations/twelve-factor-app.md +797 -0
- package/expertise/architecture/index.md +34 -0
- package/expertise/architecture/integration/api-design-graphql.md +638 -0
- package/expertise/architecture/integration/api-design-grpc.md +804 -0
- package/expertise/architecture/integration/api-design-rest.md +892 -0
- package/expertise/architecture/integration/index.md +25 -0
- package/expertise/architecture/integration/third-party-integration.md +795 -0
- package/expertise/architecture/integration/webhooks-and-callbacks.md +1152 -0
- package/expertise/architecture/integration/websockets-realtime.md +791 -0
- package/expertise/architecture/mobile-architecture/index.md +22 -0
- package/expertise/architecture/mobile-architecture/mobile-app-architecture.md +780 -0
- package/expertise/architecture/mobile-architecture/mobile-backend-for-frontend.md +670 -0
- package/expertise/architecture/mobile-architecture/offline-first.md +719 -0
- package/expertise/architecture/mobile-architecture/push-and-sync.md +782 -0
- package/expertise/architecture/patterns/cqrs-event-sourcing.md +717 -0
- package/expertise/architecture/patterns/event-driven.md +797 -0
- package/expertise/architecture/patterns/hexagonal-clean-architecture.md +870 -0
- package/expertise/architecture/patterns/index.md +27 -0
- package/expertise/architecture/patterns/layered-architecture.md +736 -0
- package/expertise/architecture/patterns/microservices.md +753 -0
- package/expertise/architecture/patterns/modular-monolith.md +692 -0
- package/expertise/architecture/patterns/monolith.md +626 -0
- package/expertise/architecture/patterns/plugin-architecture.md +735 -0
- package/expertise/architecture/patterns/serverless.md +780 -0
- package/expertise/architecture/scaling/database-scaling.md +615 -0
- package/expertise/architecture/scaling/feature-flags-and-rollouts.md +757 -0
- package/expertise/architecture/scaling/horizontal-vs-vertical.md +606 -0
- package/expertise/architecture/scaling/index.md +24 -0
- package/expertise/architecture/scaling/multi-tenancy.md +800 -0
- package/expertise/architecture/scaling/stateless-design.md +787 -0
- package/expertise/backend/embedded-firmware.md +625 -0
- package/expertise/backend/go.md +853 -0
- package/expertise/backend/index.md +24 -0
- package/expertise/backend/java-spring.md +448 -0
- package/expertise/backend/node-typescript.md +625 -0
- package/expertise/backend/python-fastapi.md +724 -0
- package/expertise/backend/rust.md +458 -0
- package/expertise/backend/solidity.md +711 -0
- package/expertise/composition-map.yaml +443 -0
- package/expertise/content/foundations/content-modeling.md +395 -0
- package/expertise/content/foundations/editorial-standards.md +449 -0
- package/expertise/content/foundations/index.md +24 -0
- package/expertise/content/foundations/microcopy.md +455 -0
- package/expertise/content/foundations/terminology-governance.md +509 -0
- package/expertise/content/index.md +34 -0
- package/expertise/content/patterns/accessibility-copy.md +518 -0
- package/expertise/content/patterns/index.md +24 -0
- package/expertise/content/patterns/notification-content.md +433 -0
- package/expertise/content/patterns/sample-content.md +486 -0
- package/expertise/content/patterns/state-copy.md +439 -0
- package/expertise/design/PROGRESS.md +58 -0
- package/expertise/design/disciplines/dark-mode-theming.md +577 -0
- package/expertise/design/disciplines/design-systems.md +595 -0
- package/expertise/design/disciplines/index.md +25 -0
- package/expertise/design/disciplines/information-architecture.md +800 -0
- package/expertise/design/disciplines/interaction-design.md +788 -0
- package/expertise/design/disciplines/responsive-design.md +552 -0
- package/expertise/design/disciplines/usability-testing.md +516 -0
- package/expertise/design/disciplines/user-research.md +792 -0
- package/expertise/design/foundations/accessibility-design.md +796 -0
- package/expertise/design/foundations/color-theory.md +797 -0
- package/expertise/design/foundations/iconography.md +795 -0
- package/expertise/design/foundations/index.md +26 -0
- package/expertise/design/foundations/motion-and-animation.md +653 -0
- package/expertise/design/foundations/rtl-design.md +585 -0
- package/expertise/design/foundations/spacing-and-layout.md +607 -0
- package/expertise/design/foundations/typography.md +800 -0
- package/expertise/design/foundations/visual-hierarchy.md +761 -0
- package/expertise/design/index.md +32 -0
- package/expertise/design/patterns/authentication-flows.md +474 -0
- package/expertise/design/patterns/content-consumption.md +789 -0
- package/expertise/design/patterns/data-display.md +618 -0
- package/expertise/design/patterns/e-commerce.md +1494 -0
- package/expertise/design/patterns/feedback-and-states.md +642 -0
- package/expertise/design/patterns/forms-and-input.md +819 -0
- package/expertise/design/patterns/gamification.md +801 -0
- package/expertise/design/patterns/index.md +31 -0
- package/expertise/design/patterns/microinteractions.md +449 -0
- package/expertise/design/patterns/navigation.md +800 -0
- package/expertise/design/patterns/notifications.md +705 -0
- package/expertise/design/patterns/onboarding.md +700 -0
- package/expertise/design/patterns/search-and-filter.md +601 -0
- package/expertise/design/patterns/settings-and-preferences.md +768 -0
- package/expertise/design/patterns/social-and-community.md +748 -0
- package/expertise/design/platforms/desktop-native.md +612 -0
- package/expertise/design/platforms/index.md +25 -0
- package/expertise/design/platforms/mobile-android.md +825 -0
- package/expertise/design/platforms/mobile-cross-platform.md +983 -0
- package/expertise/design/platforms/mobile-ios.md +699 -0
- package/expertise/design/platforms/tablet.md +794 -0
- package/expertise/design/platforms/web-dashboard.md +790 -0
- package/expertise/design/platforms/web-responsive.md +550 -0
- package/expertise/design/psychology/behavioral-nudges.md +449 -0
- package/expertise/design/psychology/cognitive-load.md +1191 -0
- package/expertise/design/psychology/error-psychology.md +778 -0
- package/expertise/design/psychology/index.md +22 -0
- package/expertise/design/psychology/persuasive-design.md +736 -0
- package/expertise/design/psychology/user-mental-models.md +623 -0
- package/expertise/design/tooling/open-pencil.md +266 -0
- package/expertise/frontend/angular.md +1073 -0
- package/expertise/frontend/desktop-electron.md +546 -0
- package/expertise/frontend/flutter.md +782 -0
- package/expertise/frontend/index.md +27 -0
- package/expertise/frontend/native-android.md +409 -0
- package/expertise/frontend/native-ios.md +490 -0
- package/expertise/frontend/react-native.md +1160 -0
- package/expertise/frontend/react.md +808 -0
- package/expertise/frontend/vue.md +1089 -0
- package/expertise/humanize/domain-rules-code.md +79 -0
- package/expertise/humanize/domain-rules-content.md +67 -0
- package/expertise/humanize/domain-rules-technical-docs.md +56 -0
- package/expertise/humanize/index.md +35 -0
- package/expertise/humanize/self-audit-checklist.md +87 -0
- package/expertise/humanize/sentence-patterns.md +218 -0
- package/expertise/humanize/vocabulary-blacklist.md +105 -0
- package/expertise/i18n/PROGRESS.md +65 -0
- package/expertise/i18n/advanced/accessibility-and-i18n.md +28 -0
- package/expertise/i18n/advanced/bidirectional-text-algorithm.md +38 -0
- package/expertise/i18n/advanced/complex-scripts.md +30 -0
- package/expertise/i18n/advanced/performance-and-i18n.md +27 -0
- package/expertise/i18n/advanced/testing-i18n.md +28 -0
- package/expertise/i18n/content/content-adaptation.md +23 -0
- package/expertise/i18n/content/locale-specific-formatting.md +23 -0
- package/expertise/i18n/content/machine-translation-integration.md +28 -0
- package/expertise/i18n/content/translation-management.md +29 -0
- package/expertise/i18n/foundations/date-time-calendars.md +67 -0
- package/expertise/i18n/foundations/i18n-architecture.md +272 -0
- package/expertise/i18n/foundations/locale-and-language-tags.md +79 -0
- package/expertise/i18n/foundations/numbers-currency-units.md +61 -0
- package/expertise/i18n/foundations/pluralization-and-gender.md +109 -0
- package/expertise/i18n/foundations/string-externalization.md +236 -0
- package/expertise/i18n/foundations/text-direction-bidi.md +241 -0
- package/expertise/i18n/foundations/unicode-and-encoding.md +86 -0
- package/expertise/i18n/index.md +38 -0
- package/expertise/i18n/platform/backend-i18n.md +31 -0
- package/expertise/i18n/platform/flutter-i18n.md +148 -0
- package/expertise/i18n/platform/native-android-i18n.md +36 -0
- package/expertise/i18n/platform/native-ios-i18n.md +36 -0
- package/expertise/i18n/platform/react-i18n.md +103 -0
- package/expertise/i18n/platform/web-css-i18n.md +81 -0
- package/expertise/i18n/rtl/arabic-specific.md +175 -0
- package/expertise/i18n/rtl/hebrew-specific.md +149 -0
- package/expertise/i18n/rtl/rtl-animations-and-transitions.md +111 -0
- package/expertise/i18n/rtl/rtl-forms-and-input.md +161 -0
- package/expertise/i18n/rtl/rtl-fundamentals.md +211 -0
- package/expertise/i18n/rtl/rtl-icons-and-images.md +181 -0
- package/expertise/i18n/rtl/rtl-layout-mirroring.md +252 -0
- package/expertise/i18n/rtl/rtl-navigation-and-gestures.md +107 -0
- package/expertise/i18n/rtl/rtl-testing-and-qa.md +147 -0
- package/expertise/i18n/rtl/rtl-typography.md +160 -0
- package/expertise/index.md +113 -0
- package/expertise/index.yaml +216 -0
- package/expertise/infrastructure/cloud-aws.md +597 -0
- package/expertise/infrastructure/cloud-gcp.md +599 -0
- package/expertise/infrastructure/cybersecurity.md +816 -0
- package/expertise/infrastructure/database-mongodb.md +447 -0
- package/expertise/infrastructure/database-postgres.md +400 -0
- package/expertise/infrastructure/devops-cicd.md +787 -0
- package/expertise/infrastructure/index.md +27 -0
- package/expertise/performance/PROGRESS.md +50 -0
- package/expertise/performance/backend/api-latency.md +1204 -0
- package/expertise/performance/backend/background-jobs.md +506 -0
- package/expertise/performance/backend/connection-pooling.md +1209 -0
- package/expertise/performance/backend/database-query-optimization.md +515 -0
- package/expertise/performance/backend/index.md +23 -0
- package/expertise/performance/backend/rate-limiting-and-throttling.md +971 -0
- package/expertise/performance/foundations/algorithmic-complexity.md +954 -0
- package/expertise/performance/foundations/caching-strategies.md +489 -0
- package/expertise/performance/foundations/concurrency-and-parallelism.md +847 -0
- package/expertise/performance/foundations/index.md +24 -0
- package/expertise/performance/foundations/measuring-and-profiling.md +440 -0
- package/expertise/performance/foundations/memory-management.md +964 -0
- package/expertise/performance/foundations/performance-budgets.md +1314 -0
- package/expertise/performance/index.md +31 -0
- package/expertise/performance/infrastructure/auto-scaling.md +1059 -0
- package/expertise/performance/infrastructure/cdn-and-edge.md +1081 -0
- package/expertise/performance/infrastructure/index.md +22 -0
- package/expertise/performance/infrastructure/load-balancing.md +1081 -0
- package/expertise/performance/infrastructure/observability.md +1079 -0
- package/expertise/performance/mobile/index.md +23 -0
- package/expertise/performance/mobile/mobile-animations.md +544 -0
- package/expertise/performance/mobile/mobile-memory-battery.md +416 -0
- package/expertise/performance/mobile/mobile-network.md +452 -0
- package/expertise/performance/mobile/mobile-rendering.md +599 -0
- package/expertise/performance/mobile/mobile-startup-time.md +505 -0
- package/expertise/performance/platform-specific/flutter-performance.md +647 -0
- package/expertise/performance/platform-specific/index.md +22 -0
- package/expertise/performance/platform-specific/node-performance.md +1307 -0
- package/expertise/performance/platform-specific/postgres-performance.md +1366 -0
- package/expertise/performance/platform-specific/react-performance.md +1403 -0
- package/expertise/performance/web/bundle-optimization.md +1239 -0
- package/expertise/performance/web/image-and-media.md +636 -0
- package/expertise/performance/web/index.md +24 -0
- package/expertise/performance/web/network-optimization.md +1133 -0
- package/expertise/performance/web/rendering-performance.md +1098 -0
- package/expertise/performance/web/ssr-and-hydration.md +918 -0
- package/expertise/performance/web/web-vitals.md +1374 -0
- package/expertise/quality/accessibility.md +985 -0
- package/expertise/quality/evidence-based-verification.md +499 -0
- package/expertise/quality/index.md +24 -0
- package/expertise/quality/ml-model-audit.md +614 -0
- package/expertise/quality/performance.md +600 -0
- package/expertise/quality/testing-api.md +891 -0
- package/expertise/quality/testing-mobile.md +496 -0
- package/expertise/quality/testing-web.md +849 -0
- package/expertise/security/PROGRESS.md +54 -0
- package/expertise/security/agentic-identity.md +540 -0
- package/expertise/security/compliance-frameworks.md +601 -0
- package/expertise/security/data/data-encryption.md +364 -0
- package/expertise/security/data/data-privacy-gdpr.md +692 -0
- package/expertise/security/data/database-security.md +1171 -0
- package/expertise/security/data/index.md +22 -0
- package/expertise/security/data/pii-handling.md +531 -0
- package/expertise/security/foundations/authentication.md +1041 -0
- package/expertise/security/foundations/authorization.md +603 -0
- package/expertise/security/foundations/cryptography.md +1001 -0
- package/expertise/security/foundations/index.md +25 -0
- package/expertise/security/foundations/owasp-top-10.md +1354 -0
- package/expertise/security/foundations/secrets-management.md +1217 -0
- package/expertise/security/foundations/secure-sdlc.md +700 -0
- package/expertise/security/foundations/supply-chain-security.md +698 -0
- package/expertise/security/index.md +31 -0
- package/expertise/security/infrastructure/cloud-security-aws.md +1296 -0
- package/expertise/security/infrastructure/cloud-security-gcp.md +1376 -0
- package/expertise/security/infrastructure/container-security.md +721 -0
- package/expertise/security/infrastructure/incident-response.md +1295 -0
- package/expertise/security/infrastructure/index.md +24 -0
- package/expertise/security/infrastructure/logging-and-monitoring.md +1618 -0
- package/expertise/security/infrastructure/network-security.md +1337 -0
- package/expertise/security/mobile/index.md +23 -0
- package/expertise/security/mobile/mobile-android-security.md +1218 -0
- package/expertise/security/mobile/mobile-binary-protection.md +1229 -0
- package/expertise/security/mobile/mobile-data-storage.md +1265 -0
- package/expertise/security/mobile/mobile-ios-security.md +1401 -0
- package/expertise/security/mobile/mobile-network-security.md +1520 -0
- package/expertise/security/smart-contract-security.md +594 -0
- package/expertise/security/testing/index.md +22 -0
- package/expertise/security/testing/penetration-testing.md +1258 -0
- package/expertise/security/testing/security-code-review.md +1765 -0
- package/expertise/security/testing/threat-modeling.md +1074 -0
- package/expertise/security/testing/vulnerability-scanning.md +1062 -0
- package/expertise/security/web/api-security.md +586 -0
- package/expertise/security/web/cors-and-headers.md +433 -0
- package/expertise/security/web/csrf.md +562 -0
- package/expertise/security/web/file-upload.md +1477 -0
- package/expertise/security/web/index.md +25 -0
- package/expertise/security/web/injection.md +1375 -0
- package/expertise/security/web/session-management.md +1101 -0
- package/expertise/security/web/xss.md +1158 -0
- package/exports/README.md +17 -0
- package/exports/hosts/claude/.claude/agents/clarifier.md +42 -0
- package/exports/hosts/claude/.claude/agents/content-author.md +63 -0
- package/exports/hosts/claude/.claude/agents/designer.md +55 -0
- package/exports/hosts/claude/.claude/agents/executor.md +55 -0
- package/exports/hosts/claude/.claude/agents/learner.md +51 -0
- package/exports/hosts/claude/.claude/agents/planner.md +53 -0
- package/exports/hosts/claude/.claude/agents/researcher.md +43 -0
- package/exports/hosts/claude/.claude/agents/reviewer.md +54 -0
- package/exports/hosts/claude/.claude/agents/specifier.md +47 -0
- package/exports/hosts/claude/.claude/agents/verifier.md +71 -0
- package/exports/hosts/claude/.claude/commands/author.md +42 -0
- package/exports/hosts/claude/.claude/commands/clarify.md +38 -0
- package/exports/hosts/claude/.claude/commands/design-review.md +46 -0
- package/exports/hosts/claude/.claude/commands/design.md +44 -0
- package/exports/hosts/claude/.claude/commands/discover.md +37 -0
- package/exports/hosts/claude/.claude/commands/execute.md +48 -0
- package/exports/hosts/claude/.claude/commands/learn.md +38 -0
- package/exports/hosts/claude/.claude/commands/plan-review.md +42 -0
- package/exports/hosts/claude/.claude/commands/plan.md +39 -0
- package/exports/hosts/claude/.claude/commands/prepare-next.md +37 -0
- package/exports/hosts/claude/.claude/commands/review.md +40 -0
- package/exports/hosts/claude/.claude/commands/run-audit.md +41 -0
- package/exports/hosts/claude/.claude/commands/spec-challenge.md +41 -0
- package/exports/hosts/claude/.claude/commands/specify.md +38 -0
- package/exports/hosts/claude/.claude/commands/verify.md +37 -0
- package/exports/hosts/claude/.claude/settings.json +34 -0
- package/exports/hosts/claude/CLAUDE.md +19 -0
- package/exports/hosts/claude/export.manifest.json +38 -0
- package/exports/hosts/claude/host-package.json +67 -0
- package/exports/hosts/codex/AGENTS.md +19 -0
- package/exports/hosts/codex/export.manifest.json +38 -0
- package/exports/hosts/codex/host-package.json +41 -0
- package/exports/hosts/cursor/.cursor/hooks.json +16 -0
- package/exports/hosts/cursor/.cursor/rules/wazir-core.mdc +19 -0
- package/exports/hosts/cursor/export.manifest.json +38 -0
- package/exports/hosts/cursor/host-package.json +42 -0
- package/exports/hosts/gemini/GEMINI.md +19 -0
- package/exports/hosts/gemini/export.manifest.json +38 -0
- package/exports/hosts/gemini/host-package.json +41 -0
- package/hooks/README.md +18 -0
- package/hooks/definitions/loop_cap_guard.yaml +21 -0
- package/hooks/definitions/post_tool_capture.yaml +24 -0
- package/hooks/definitions/pre_compact_summary.yaml +19 -0
- package/hooks/definitions/pre_tool_capture_route.yaml +19 -0
- package/hooks/definitions/protected_path_write_guard.yaml +19 -0
- package/hooks/definitions/session_start.yaml +19 -0
- package/hooks/definitions/stop_handoff_harvest.yaml +20 -0
- package/hooks/loop-cap-guard +17 -0
- package/hooks/post-tool-lint +36 -0
- package/hooks/protected-path-write-guard +17 -0
- package/hooks/session-start +41 -0
- package/llms-full.txt +2355 -0
- package/llms.txt +43 -0
- package/package.json +79 -0
- package/roles/README.md +20 -0
- package/roles/clarifier.md +42 -0
- package/roles/content-author.md +63 -0
- package/roles/designer.md +55 -0
- package/roles/executor.md +55 -0
- package/roles/learner.md +51 -0
- package/roles/planner.md +53 -0
- package/roles/researcher.md +43 -0
- package/roles/reviewer.md +54 -0
- package/roles/specifier.md +47 -0
- package/roles/verifier.md +71 -0
- package/schemas/README.md +24 -0
- package/schemas/accepted-learning.schema.json +20 -0
- package/schemas/author-artifact.schema.json +156 -0
- package/schemas/clarification.schema.json +19 -0
- package/schemas/design-artifact.schema.json +80 -0
- package/schemas/docs-claim.schema.json +18 -0
- package/schemas/export-manifest.schema.json +20 -0
- package/schemas/hook.schema.json +67 -0
- package/schemas/host-export-package.schema.json +18 -0
- package/schemas/implementation-plan.schema.json +19 -0
- package/schemas/proposed-learning.schema.json +19 -0
- package/schemas/research.schema.json +18 -0
- package/schemas/review.schema.json +29 -0
- package/schemas/run-manifest.schema.json +18 -0
- package/schemas/spec-challenge.schema.json +18 -0
- package/schemas/spec.schema.json +20 -0
- package/schemas/usage.schema.json +102 -0
- package/schemas/verification-proof.schema.json +29 -0
- package/schemas/wazir-manifest.schema.json +173 -0
- package/skills/README.md +40 -0
- package/skills/brainstorming/SKILL.md +77 -0
- package/skills/debugging/SKILL.md +50 -0
- package/skills/design/SKILL.md +61 -0
- package/skills/dispatching-parallel-agents/SKILL.md +128 -0
- package/skills/executing-plans/SKILL.md +70 -0
- package/skills/finishing-a-development-branch/SKILL.md +169 -0
- package/skills/humanize/SKILL.md +123 -0
- package/skills/init-pipeline/SKILL.md +124 -0
- package/skills/prepare-next/SKILL.md +20 -0
- package/skills/receiving-code-review/SKILL.md +123 -0
- package/skills/requesting-code-review/SKILL.md +105 -0
- package/skills/requesting-code-review/code-reviewer.md +108 -0
- package/skills/run-audit/SKILL.md +197 -0
- package/skills/scan-project/SKILL.md +41 -0
- package/skills/self-audit/SKILL.md +153 -0
- package/skills/subagent-driven-development/SKILL.md +154 -0
- package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +26 -0
- package/skills/subagent-driven-development/implementer-prompt.md +102 -0
- package/skills/subagent-driven-development/spec-reviewer-prompt.md +61 -0
- package/skills/tdd/SKILL.md +23 -0
- package/skills/using-git-worktrees/SKILL.md +163 -0
- package/skills/using-skills/SKILL.md +95 -0
- package/skills/verification/SKILL.md +22 -0
- package/skills/wazir/SKILL.md +463 -0
- package/skills/writing-plans/SKILL.md +30 -0
- package/skills/writing-skills/SKILL.md +157 -0
- package/skills/writing-skills/anthropic-best-practices.md +122 -0
- package/skills/writing-skills/persuasion-principles.md +50 -0
- package/templates/README.md +20 -0
- package/templates/artifacts/README.md +10 -0
- package/templates/artifacts/accepted-learning.md +19 -0
- package/templates/artifacts/accepted-learning.template.json +12 -0
- package/templates/artifacts/author.md +74 -0
- package/templates/artifacts/author.template.json +19 -0
- package/templates/artifacts/clarification.md +21 -0
- package/templates/artifacts/clarification.template.json +12 -0
- package/templates/artifacts/execute-notes.md +19 -0
- package/templates/artifacts/implementation-plan.md +21 -0
- package/templates/artifacts/implementation-plan.template.json +11 -0
- package/templates/artifacts/learning-proposal.md +19 -0
- package/templates/artifacts/next-run-handoff.md +21 -0
- package/templates/artifacts/plan-review.md +19 -0
- package/templates/artifacts/proposed-learning.template.json +12 -0
- package/templates/artifacts/research.md +21 -0
- package/templates/artifacts/research.template.json +12 -0
- package/templates/artifacts/review-findings.md +19 -0
- package/templates/artifacts/review.template.json +11 -0
- package/templates/artifacts/run-manifest.template.json +8 -0
- package/templates/artifacts/spec-challenge.md +19 -0
- package/templates/artifacts/spec-challenge.template.json +11 -0
- package/templates/artifacts/spec.md +21 -0
- package/templates/artifacts/spec.template.json +12 -0
- package/templates/artifacts/verification-proof.md +19 -0
- package/templates/artifacts/verification-proof.template.json +11 -0
- package/templates/examples/accepted-learning.example.json +14 -0
- package/templates/examples/author.example.json +152 -0
- package/templates/examples/clarification.example.json +15 -0
- package/templates/examples/docs-claim.example.json +8 -0
- package/templates/examples/export-manifest.example.json +7 -0
- package/templates/examples/host-export-package.example.json +11 -0
- package/templates/examples/implementation-plan.example.json +17 -0
- package/templates/examples/proposed-learning.example.json +13 -0
- package/templates/examples/research.example.json +15 -0
- package/templates/examples/research.example.md +6 -0
- package/templates/examples/review.example.json +17 -0
- package/templates/examples/run-manifest.example.json +9 -0
- package/templates/examples/spec-challenge.example.json +14 -0
- package/templates/examples/spec.example.json +21 -0
- package/templates/examples/verification-proof.example.json +21 -0
- package/templates/examples/wazir-manifest.example.yaml +65 -0
- package/templates/task-definition-schema.md +99 -0
- package/tooling/README.md +20 -0
- package/tooling/src/adapters/context-mode.js +50 -0
- package/tooling/src/capture/command.js +376 -0
- package/tooling/src/capture/store.js +99 -0
- package/tooling/src/capture/usage.js +270 -0
- package/tooling/src/checks/branches.js +50 -0
- package/tooling/src/checks/brand-truth.js +110 -0
- package/tooling/src/checks/changelog.js +231 -0
- package/tooling/src/checks/command-registry.js +36 -0
- package/tooling/src/checks/commits.js +102 -0
- package/tooling/src/checks/docs-drift.js +103 -0
- package/tooling/src/checks/docs-truth.js +201 -0
- package/tooling/src/checks/runtime-surface.js +156 -0
- package/tooling/src/cli.js +116 -0
- package/tooling/src/command-options.js +56 -0
- package/tooling/src/commands/validate.js +320 -0
- package/tooling/src/doctor/command.js +91 -0
- package/tooling/src/export/command.js +77 -0
- package/tooling/src/export/compiler.js +498 -0
- package/tooling/src/guards/loop-cap-guard.js +52 -0
- package/tooling/src/guards/protected-path-write-guard.js +67 -0
- package/tooling/src/index/command.js +152 -0
- package/tooling/src/index/storage.js +1061 -0
- package/tooling/src/index/summarizers.js +261 -0
- package/tooling/src/loaders.js +18 -0
- package/tooling/src/project-root.js +22 -0
- package/tooling/src/recall/command.js +225 -0
- package/tooling/src/schema-validator.js +30 -0
- package/tooling/src/state-root.js +40 -0
- package/tooling/src/status/command.js +71 -0
- package/wazir.manifest.yaml +135 -0
- package/workflows/README.md +19 -0
- package/workflows/author.md +42 -0
- package/workflows/clarify.md +38 -0
- package/workflows/design-review.md +46 -0
- package/workflows/design.md +44 -0
- package/workflows/discover.md +37 -0
- package/workflows/execute.md +48 -0
- package/workflows/learn.md +38 -0
- package/workflows/plan-review.md +42 -0
- package/workflows/plan.md +39 -0
- package/workflows/prepare-next.md +37 -0
- package/workflows/review.md +40 -0
- package/workflows/run-audit.md +41 -0
- package/workflows/spec-challenge.md +41 -0
- package/workflows/specify.md +38 -0
- package/workflows/verify.md +37 -0
|
@@ -0,0 +1,785 @@
|
|
|
1
|
+
# Scaling Anti-Patterns -- Performance Anti-Patterns Module
|
|
2
|
+
|
|
3
|
+
> Scaling failures are among the most expensive incidents in production systems. They strike at peak traffic -- the worst possible moment -- and cascade into multi-hour outages that destroy revenue and user trust. Most scaling failures are not caused by unprecedented load but by well-known anti-patterns that were never addressed. GitHub's June 2025 outage, where a routine database migration cascaded into a platform-wide crisis, is a reminder that even the most sophisticated engineering organizations are not immune.
|
|
4
|
+
|
|
5
|
+
> **Domain:** Performance
|
|
6
|
+
> **Severity:** Critical
|
|
7
|
+
> **Applies to:** Backend, Infrastructure, Distributed Systems
|
|
8
|
+
> **Key metrics:** Requests per second capacity, p99 latency under load, error rate during traffic spikes, time to recover from overload, cost per request
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## Table of Contents
|
|
13
|
+
|
|
14
|
+
1. [Vertical Scaling Only (Bigger Server Syndrome)](#1-vertical-scaling-only-bigger-server-syndrome)
|
|
15
|
+
2. [Not Designing for Horizontal Scaling](#2-not-designing-for-horizontal-scaling)
|
|
16
|
+
3. [Sticky Sessions Preventing Scale-Out](#3-sticky-sessions-preventing-scale-out)
|
|
17
|
+
4. [Storing State on Local Filesystem](#4-storing-state-on-local-filesystem)
|
|
18
|
+
5. [Not Using Connection Pooling](#5-not-using-connection-pooling)
|
|
19
|
+
6. [Single Database for Everything](#6-single-database-for-everything)
|
|
20
|
+
7. [Not Planning for Thundering Herd](#7-not-planning-for-thundering-herd)
|
|
21
|
+
8. [Ignoring Backpressure](#8-ignoring-backpressure)
|
|
22
|
+
9. [Unbounded Queues](#9-unbounded-queues)
|
|
23
|
+
10. [Not Load Testing Before Launch](#10-not-load-testing-before-launch)
|
|
24
|
+
11. [Hot Spots from Poor Sharding](#11-hot-spots-from-poor-sharding)
|
|
25
|
+
12. [Cross-Shard Transactions](#12-cross-shard-transactions)
|
|
26
|
+
13. [Scaling by Adding Complexity](#13-scaling-by-adding-complexity)
|
|
27
|
+
14. [Ignoring Cold Start Problems](#14-ignoring-cold-start-problems)
|
|
28
|
+
15. [Not Planning for Graceful Degradation](#15-not-planning-for-graceful-degradation)
|
|
29
|
+
16. [Monolithic Database Migrations at Scale](#16-monolithic-database-migrations-at-scale)
|
|
30
|
+
17. [Network Calls in Loops](#17-network-calls-in-loops)
|
|
31
|
+
18. [Fan-Out Without Fan-In Limits](#18-fan-out-without-fan-in-limits)
|
|
32
|
+
19. [Not Using Read Replicas When Read-Heavy](#19-not-using-read-replicas-when-read-heavy)
|
|
33
|
+
20. [Over-Provisioning vs Under-Provisioning](#20-over-provisioning-vs-under-provisioning)
|
|
34
|
+
21. [Root Cause Analysis](#root-cause-analysis)
|
|
35
|
+
22. [Self-Check Questions](#self-check-questions)
|
|
36
|
+
23. [Code Smell Quick Reference](#code-smell-quick-reference)
|
|
37
|
+
24. [Sources](#sources)
|
|
38
|
+
|
|
39
|
+
---
|
|
40
|
+
|
|
41
|
+
## 1. Vertical Scaling Only (Bigger Server Syndrome)
|
|
42
|
+
|
|
43
|
+
**Anti-pattern:** Responding to every capacity problem by upgrading to a larger server instance (more CPU, more RAM, bigger disk) instead of distributing load across multiple nodes.
|
|
44
|
+
|
|
45
|
+
**Why it happens:** Vertical scaling is the path of least resistance. No code changes required -- just resize the instance. Teams under pressure choose the fastest fix. Early-stage startups often lack the engineering capacity to design for horizontal scaling, so they throw hardware at the problem.
|
|
46
|
+
|
|
47
|
+
**Real-world incident:** Airbnb initially scaled their monolithic Ruby on Rails application by upgrading to progressively larger AWS EC2 instances. The strategy hit a wall when peak loads exceeded what any single instance could handle. High-end servers with 128 cores and 1TB of RAM cost exponentially more -- often 5x the price of a machine with half the specs -- delivering diminishing returns. Airbnb ultimately transitioned to a service-oriented architecture with horizontal scaling across regions.
|
|
48
|
+
|
|
49
|
+
**Why it fails:**
|
|
50
|
+
- Hardware has physical limits -- there is a largest server money can buy
|
|
51
|
+
- Cost scales superlinearly: doubling capacity often costs 3-5x more
|
|
52
|
+
- Creates a single point of failure -- if the one server goes down, everything goes down
|
|
53
|
+
- Maintenance windows require full downtime since there is no redundancy
|
|
54
|
+
- No geographic distribution possible
|
|
55
|
+
|
|
56
|
+
**The fix:**
|
|
57
|
+
- Design stateless application tiers from the start
|
|
58
|
+
- Use load balancers to distribute traffic across multiple instances
|
|
59
|
+
- Externalize state to shared stores (Redis, S3, managed databases)
|
|
60
|
+
- Adopt auto-scaling groups that add/remove instances based on load
|
|
61
|
+
- Vertical scale the database tier (where it is hardest to horizontally scale) but horizontally scale everything else
|
|
62
|
+
|
|
63
|
+
**Detection signals:**
|
|
64
|
+
- Monthly infrastructure bills growing faster than revenue
|
|
65
|
+
- Single-instance CPU consistently above 70%
|
|
66
|
+
- Downtime during maintenance windows with no failover
|
|
67
|
+
- Maximum instance size already in use
|
|
68
|
+
|
|
69
|
+
---
|
|
70
|
+
|
|
71
|
+
## 2. Not Designing for Horizontal Scaling
|
|
72
|
+
|
|
73
|
+
**Anti-pattern:** Building applications that assume a single-process, single-machine deployment. In-memory caches, local file storage, process-level singletons, and reliance on local disk all prevent adding more instances behind a load balancer.
|
|
74
|
+
|
|
75
|
+
**Why it happens:** Local development environments are inherently single-machine. Developers build and test on one machine and never encounter multi-instance issues until production. Frameworks often default to in-process state management. The cost of distributed design feels premature when you have 100 users.
|
|
76
|
+
|
|
77
|
+
**Real-world incident:** A SaaS platform stored uploaded user avatars on the local filesystem of the web server. When they added a second server behind a load balancer, half of all avatar requests returned 404 errors because the file only existed on the original server. Emergency migration to S3 required a maintenance window and data reconciliation.
|
|
78
|
+
|
|
79
|
+
**Why it fails:**
|
|
80
|
+
- Adding instances behind a load balancer produces inconsistent behavior
|
|
81
|
+
- In-memory caches diverge across instances, causing stale data bugs
|
|
82
|
+
- File uploads saved locally become inaccessible from other instances
|
|
83
|
+
- Process-level locks and singletons cause race conditions in multi-instance deployments
|
|
84
|
+
- Cannot leverage auto-scaling since new instances lack the accumulated state
|
|
85
|
+
|
|
86
|
+
**The fix:**
|
|
87
|
+
- Follow the Twelve-Factor App methodology: treat servers as disposable
|
|
88
|
+
- Store all persistent data in external services (databases, object storage, caches)
|
|
89
|
+
- Use distributed caching (Redis, Memcached) instead of in-process caches
|
|
90
|
+
- Design all endpoints to be stateless -- any instance can handle any request
|
|
91
|
+
- Use distributed locks (Redis SETNX, ZooKeeper) instead of local mutexes
|
|
92
|
+
|
|
93
|
+
**Detection signals:**
|
|
94
|
+
- Code references to `/tmp`, `/var/data`, or other local paths for user data
|
|
95
|
+
- `HashMap` or `Dictionary` used as an application-level cache
|
|
96
|
+
- Singleton pattern used for rate limiters or session stores
|
|
97
|
+
- Tests only run against a single instance
|
|
98
|
+
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
## 3. Sticky Sessions Preventing Scale-Out
|
|
102
|
+
|
|
103
|
+
**Anti-pattern:** Configuring load balancers to route all requests from a given user to the same backend server (session affinity), tying user state to a specific instance and defeating the purpose of horizontal scaling.
|
|
104
|
+
|
|
105
|
+
**Why it happens:** Server-side session storage (e.g., `HttpSession` in Java, session middleware in Express) stores user state in the process memory of whichever server handled the login. Without sticky sessions, the next request may hit a different server that has no knowledge of the session. Sticky sessions are the quick fix.
|
|
106
|
+
|
|
107
|
+
**Real-world incident:** A Kubernetes-based platform enabled sticky sessions to solve session consistency issues. When Horizontal Pod Autoscaler (HPA) scaled pods due to increased load, the new pods received zero traffic because all existing users were pinned to the original pods. The scaling event was effectively nullified -- new pods sat idle while overloaded pods continued to degrade. The team had to redesign session management using Redis before auto-scaling became functional.
|
|
108
|
+
|
|
109
|
+
**Why it fails:**
|
|
110
|
+
- New instances get no traffic until existing sessions expire, nullifying scale-out
|
|
111
|
+
- If a server fails, all pinned sessions are lost -- users experience errors or forced re-login
|
|
112
|
+
- Load becomes uneven: one server may handle 10x the traffic of another
|
|
113
|
+
- Rolling deployments are painful because draining sticky sessions takes time
|
|
114
|
+
- Cannot effectively use auto-scaling policies
|
|
115
|
+
|
|
116
|
+
**The fix:**
|
|
117
|
+
- Externalize session state to Redis, Memcached, or a database
|
|
118
|
+
- Use token-based authentication (JWT) where session data travels with the request
|
|
119
|
+
- Store only a session ID in a cookie; look up state from the shared store
|
|
120
|
+
- If sticky sessions are truly required (e.g., WebSocket connections), implement session replication as a fallback
|
|
121
|
+
|
|
122
|
+
**Detection signals:**
|
|
123
|
+
- Load balancer configuration includes `stickiness` or `affinity` settings
|
|
124
|
+
- Uneven request distribution visible in server metrics
|
|
125
|
+
- New instances show near-zero traffic after scaling events
|
|
126
|
+
- User complaints spike after server restarts
|
|
127
|
+
|
|
128
|
+
---
|
|
129
|
+
|
|
130
|
+
## 4. Storing State on Local Filesystem
|
|
131
|
+
|
|
132
|
+
**Anti-pattern:** Writing application state -- uploads, generated reports, cache files, session data, or temp files -- to the local disk of a server instance, making it inaccessible to other instances and lost on instance termination.
|
|
133
|
+
|
|
134
|
+
**Why it happens:** Writing to disk is the simplest I/O operation in any language. It works perfectly in development. Cloud instances come with local storage by default. The distinction between ephemeral and persistent storage is not obvious to developers unfamiliar with cloud-native patterns.
|
|
135
|
+
|
|
136
|
+
**Real-world incident:** An e-commerce platform generated PDF invoices and stored them at `/var/invoices/` on the web server. During a holiday traffic spike, auto-scaling launched four new instances. Customers who generated invoices on instance A could not download them when their next request was routed to instance B. The team scrambled to implement an S3-backed solution while simultaneously handling peak traffic -- the worst possible time for an architectural change.
|
|
137
|
+
|
|
138
|
+
**Why it fails:**
|
|
139
|
+
- Data is lost when instances are terminated, recycled, or crash
|
|
140
|
+
- Other instances cannot access the files, breaking multi-instance deployments
|
|
141
|
+
- Auto-scaling and spot/preemptible instances are incompatible with local state
|
|
142
|
+
- Disk space is finite and unmonitored, leading to silent failures when full
|
|
143
|
+
- No built-in redundancy or backup
|
|
144
|
+
|
|
145
|
+
**The fix:**
|
|
146
|
+
- Use object storage (S3, GCS, Azure Blob) for all user-facing files
|
|
147
|
+
- Use managed databases or Redis for application state
|
|
148
|
+
- Treat local disk as ephemeral scratch space only
|
|
149
|
+
- Mount shared filesystems (EFS, Filestore) if POSIX semantics are required
|
|
150
|
+
- Implement upload-to-cloud patterns where files go directly to object storage from the client
|
|
151
|
+
|
|
152
|
+
**Detection signals:**
|
|
153
|
+
- Code writes to `/tmp`, `/var`, or custom local paths for persistent data
|
|
154
|
+
- `os.path`, `fs.writeFile`, or `File.write` used for user-generated content
|
|
155
|
+
- No object storage SDK in project dependencies
|
|
156
|
+
- Missing files reported after deployments or scaling events
|
|
157
|
+
|
|
158
|
+
---
|
|
159
|
+
|
|
160
|
+
## 5. Not Using Connection Pooling
|
|
161
|
+
|
|
162
|
+
**Anti-pattern:** Opening a new database connection for every request and closing it afterward, or failing to limit the total number of connections across application instances, leading to connection exhaustion under load.
|
|
163
|
+
|
|
164
|
+
**Why it happens:** Default ORM and driver configurations often open connections on demand without pooling. In development, the connection count is low enough that it never matters. When the application scales to multiple instances with multiple workers each, the connection count multiplies and overwhelms the database.
|
|
165
|
+
|
|
166
|
+
**Real-world incident:** A production PostgreSQL database experienced full connection pool exhaustion when multiple Celery workers, each running several concurrent processes, opened more connections than the database could handle. The database began rejecting all new connections, causing a complete application outage. The immediate fix required a superuser connection to identify and kill hundreds of idle connections that had leaked from application code. Long-term, the team deployed PgBouncer to multiplex client connections through a smaller pool of actual database connections.
|
|
167
|
+
|
|
168
|
+
**Why it fails:**
|
|
169
|
+
- Each connection consumes ~10MB of database server memory (PostgreSQL)
|
|
170
|
+
- Connection establishment takes 50-200ms of TCP + TLS handshake overhead
|
|
171
|
+
- At scale, `max_connections` is exhausted, and all new queries are rejected
|
|
172
|
+
- Leaked connections (not returned to pool) silently accumulate until crisis
|
|
173
|
+
- Multiple application instances multiply the problem (10 instances x 20 workers x 5 connections = 1000 connections)
|
|
174
|
+
|
|
175
|
+
**The fix:**
|
|
176
|
+
- Configure connection pooling in the application (HikariCP, SQLAlchemy pool, Knex pool)
|
|
177
|
+
- Deploy a connection pooler proxy (PgBouncer, ProxySQL, Amazon RDS Proxy)
|
|
178
|
+
- Set pool sizes based on: `pool_size = (total_db_connections) / (num_instances * workers_per_instance)`
|
|
179
|
+
- Use context managers or try/finally to guarantee connection release
|
|
180
|
+
- Monitor active vs idle connections with alerts at 70% utilization
|
|
181
|
+
|
|
182
|
+
**Detection signals:**
|
|
183
|
+
- `too many connections` or `connection pool exhausted` errors in logs
|
|
184
|
+
- Database CPU low but connection count near maximum
|
|
185
|
+
- Queries timing out waiting for an available connection
|
|
186
|
+
- Increasing latency correlated with instance count rather than query complexity
|
|
187
|
+
|
|
188
|
+
---
|
|
189
|
+
|
|
190
|
+
## 6. Single Database for Everything
|
|
191
|
+
|
|
192
|
+
**Anti-pattern:** Using one database instance for all services, all data types, and all workloads -- OLTP transactions, analytics queries, search, session storage, job queues, and audit logs all hitting the same server.
|
|
193
|
+
|
|
194
|
+
**Why it happens:** Starting with one database is rational. It simplifies the technology stack, avoids distributed transaction complexity, and makes joins trivial. The problem is that teams never revisit this decision as the application grows and workload characteristics diverge.
|
|
195
|
+
|
|
196
|
+
**Real-world incident:** GitHub experienced repeated outages in February 2020 traced directly to database infrastructure. Application logic changes to database query patterns rapidly increased load on database clusters. A heavy analytics query on the shared database caused lock contention that blocked OLTP transactions, degrading the entire platform. GitHub subsequently invested heavily in database partitioning and workload isolation.
|
|
197
|
+
|
|
198
|
+
**Why it fails:**
|
|
199
|
+
- An analytics query holding locks blocks all transactional writes
|
|
200
|
+
- One slow query can saturate CPU and starve all other workloads
|
|
201
|
+
- Scaling the single database means scaling for the most demanding workload, even if others are light
|
|
202
|
+
- Schema migrations affect all services simultaneously
|
|
203
|
+
- Backup and restore times grow with total data volume, increasing recovery time
|
|
204
|
+
|
|
205
|
+
**The fix:**
|
|
206
|
+
- Separate OLTP from OLAP workloads (dedicated analytics database or data warehouse)
|
|
207
|
+
- Use purpose-built data stores: Redis for sessions, Elasticsearch for search, a queue service for job queues
|
|
208
|
+
- Implement the database-per-service pattern for microservices
|
|
209
|
+
- Use Change Data Capture (CDC) to replicate data between specialized stores
|
|
210
|
+
- Start with logical separation (schemas) and graduate to physical separation as load demands
|
|
211
|
+
|
|
212
|
+
**Detection signals:**
|
|
213
|
+
- Single connection string used across all services and background jobs
|
|
214
|
+
- Mixed query patterns: sub-millisecond lookups alongside multi-second aggregations
|
|
215
|
+
- Lock wait timeouts correlating with batch job schedules
|
|
216
|
+
- Schema with 200+ tables where domains overlap
|
|
217
|
+
|
|
218
|
+
---
|
|
219
|
+
|
|
220
|
+
## 7. Not Planning for Thundering Herd
|
|
221
|
+
|
|
222
|
+
**Anti-pattern:** Allowing all clients to simultaneously retry, reconnect, or request the same resource at the same moment, creating a synchronized stampede that overwhelms backends that might otherwise recover.
|
|
223
|
+
|
|
224
|
+
**Why it happens:** Systems are designed for steady-state traffic patterns. When a cache expires, a service restarts, or an outage ends, all waiting clients rush in simultaneously. Fixed retry intervals ensure every client retries at the same time. Developers test with one client at a time and never simulate coordinated surges.
|
|
225
|
+
|
|
226
|
+
**Real-world incident:** Depot experienced a thundering herd event where database traffic suddenly spiked, CPU usage jumped to 100%, and the overload cascaded into a much larger outage. Every client with retry logic retried simultaneously with fixed intervals, hitting the recovering database with the exact same wave of traffic again and again. Similarly, IRCTC (Indian Railways) pre-loads train data before the 10 AM Tatkal booking window but still struggles because millions of seat booking writes spike at exactly 10:00:00.
|
|
227
|
+
|
|
228
|
+
**Why it fails:**
|
|
229
|
+
- Cache expiration causes all requests to hit the origin simultaneously
|
|
230
|
+
- Service recovery is prevented because the herd arrives faster than the system can stabilize
|
|
231
|
+
- Retry storms amplify failures: N clients failing and retrying creates 2N, then 4N load
|
|
232
|
+
- Database connection pools are exhausted instantly
|
|
233
|
+
- CDN or cache layer going cold triggers origin overload
|
|
234
|
+
|
|
235
|
+
**The fix:**
|
|
236
|
+
- Add jitter to all retry intervals: `delay = base_delay * 2^attempt + random(0, base_delay)`
|
|
237
|
+
- Implement cache stampede protection: lock-based recomputation where only one request rebuilds the cache
|
|
238
|
+
- Use staggered TTLs: add random variance to cache expiration times
|
|
239
|
+
- Implement retry budgets: limit total retries per time window across the fleet
|
|
240
|
+
- Use load shedding at the gateway: return 503 with `Retry-After` header during overload
|
|
241
|
+
- Deploy request coalescing: deduplicate identical in-flight requests
|
|
242
|
+
|
|
243
|
+
**Detection signals:**
|
|
244
|
+
- Traffic graphs show sharp spikes to 10x+ normal immediately after recovery
|
|
245
|
+
- Cache hit rate drops from 99% to 0% simultaneously across all keys
|
|
246
|
+
- All retry timers use fixed intervals without jitter
|
|
247
|
+
- No circuit breakers between services
|
|
248
|
+
|
|
249
|
+
---
|
|
250
|
+
|
|
251
|
+
## 8. Ignoring Backpressure
|
|
252
|
+
|
|
253
|
+
**Anti-pattern:** Accepting every incoming request or message without regard for downstream capacity, allowing producers to overwhelm consumers until the system collapses from resource exhaustion.
|
|
254
|
+
|
|
255
|
+
**Why it happens:** Developers focus on throughput -- accepting requests as fast as possible. Saying "no" to a request feels like a bug. Load balancers, API gateways, and message brokers accept work by default. There is no built-in "slow down" signal in HTTP. The system works fine under normal load, so the problem is invisible until a traffic spike.
|
|
256
|
+
|
|
257
|
+
**Real-world incident:** A data pipeline ingested events from hundreds of IoT devices. The ingestion API accepted all messages and pushed them to a processing queue. When downstream processors slowed due to a database bottleneck, the queue grew to 40GB over six hours, the process hit its memory limit, and OOM-killed. All buffered events were lost. The system had no mechanism to signal producers to slow down or to shed excess load.
|
|
258
|
+
|
|
259
|
+
**Why it fails:**
|
|
260
|
+
- Memory grows unboundedly as work accumulates faster than it is processed
|
|
261
|
+
- Latency increases for all requests, not just the excess
|
|
262
|
+
- OOM kills cause abrupt crashes with no graceful cleanup
|
|
263
|
+
- Recovery is slow because the backlog must be drained before normal operation
|
|
264
|
+
- Downstream services may fail under the sudden surge when processing resumes
|
|
265
|
+
|
|
266
|
+
**The fix:**
|
|
267
|
+
- Implement rate limiting at the API gateway (token bucket, sliding window)
|
|
268
|
+
- Use bounded buffers and reject or drop when full (return 429 or 503)
|
|
269
|
+
- Implement reactive streams / flow control (gRPC flow control, Kafka consumer pause)
|
|
270
|
+
- Monitor queue depth and alert when it exceeds a threshold
|
|
271
|
+
- Design producers to handle rejection: exponential backoff, dead-letter queues
|
|
272
|
+
- Use admission control: shed load early rather than accepting work you cannot complete
|
|
273
|
+
|
|
274
|
+
**Detection signals:**
|
|
275
|
+
- Queue depth metrics trending upward with no plateau
|
|
276
|
+
- Memory usage growing linearly over time under sustained load
|
|
277
|
+
- No rate limiting configured on public-facing endpoints
|
|
278
|
+
- No 429 or 503 responses in access logs -- every request is accepted
|
|
279
|
+
|
|
280
|
+
---
|
|
281
|
+
|
|
282
|
+
## 9. Unbounded Queues
|
|
283
|
+
|
|
284
|
+
**Anti-pattern:** Using queues with no maximum size, allowing unlimited messages to accumulate in memory or on disk when consumers cannot keep up, eventually exhausting system resources.
|
|
285
|
+
|
|
286
|
+
**Why it happens:** Most queue implementations default to unbounded. Setting a limit feels like an arbitrary constraint. "What if we lose messages?" is a common objection. Teams assume consumers will always keep up and never plan for the contrary.
|
|
287
|
+
|
|
288
|
+
**Real-world incident:** After an upgrade to version 2025.10, Authentik (an identity provider) experienced OOM kills on worker pods due to unbounded queue growth in the `authentik_tasks_task` queue. Stale tasks accumulated without limit, consuming all available memory. The interim fix was a CronJob to periodically purge stale tasks, but the root cause was the absence of any queue size bound or TTL on enqueued items. Separately, Wazuh's remote message control queue (introduced in v4.13.0) had no size limit, allowing unlimited memory consumption during high agent load, risking complete memory exhaustion of the management server.
|
|
289
|
+
|
|
290
|
+
**Why it fails:**
|
|
291
|
+
- Memory consumption grows silently until the process is OOM-killed
|
|
292
|
+
- The failure mode is catastrophic: instant crash with no graceful degradation
|
|
293
|
+
- Processing latency for new messages equals the time to drain the entire backlog
|
|
294
|
+
- Messages at the tail of a massive queue may be stale by the time they are processed
|
|
295
|
+
- Monitoring often only tracks throughput, not queue depth
|
|
296
|
+
|
|
297
|
+
**The fix:**
|
|
298
|
+
- Set explicit maximum queue sizes on all queues (`maxlen`, `capacity`, `x-max-length`)
|
|
299
|
+
- Define a rejection policy: drop oldest, drop newest, reject producer, or dead-letter
|
|
300
|
+
- Add TTL to messages so stale items are automatically discarded
|
|
301
|
+
- Monitor queue depth, enqueue rate, and dequeue rate with alerts
|
|
302
|
+
- Implement consumer auto-scaling tied to queue depth metrics
|
|
303
|
+
- Size queues based on: `max_depth = consumer_throughput * max_acceptable_latency`
|
|
304
|
+
|
|
305
|
+
**Detection signals:**
|
|
306
|
+
- Queue configuration shows no `maxlen`, `capacity`, or size limit
|
|
307
|
+
- Memory usage on queue hosts trends upward during load spikes
|
|
308
|
+
- No dead-letter queue configured
|
|
309
|
+
- Consumer count is static regardless of queue depth
|
|
310
|
+
|
|
311
|
+
---
|
|
312
|
+
|
|
313
|
+
## 10. Not Load Testing Before Launch
|
|
314
|
+
|
|
315
|
+
**Anti-pattern:** Deploying to production without systematically testing the system under expected and peak traffic volumes, discovering capacity limits through real user impact instead of controlled experiments.
|
|
316
|
+
|
|
317
|
+
**Why it happens:** Load testing takes time to set up, requires realistic test data, and needs a production-like environment. Teams under deadline pressure skip it. "We can always scale up if needed" is the rationalization. Development environments give no indication of production-scale behavior.
|
|
318
|
+
|
|
319
|
+
**Real-world incident:** CodinGame experienced a "Reddit hug of death" that took their platform offline for 2 hours. They received as many new users in one day as during the previous two months. Post-mortem analysis revealed multiple failures that load testing would have caught: the RDS database was the main bottleneck with all data centralized and tangled, application servers had a memory leak that only manifested under heavy load, and the chat server process hit 100% CPU under concurrent connections. Industry data shows 80% of incidents are triggered by internal changes with insufficient testing.
|
|
320
|
+
|
|
321
|
+
**Why it fails:**
|
|
322
|
+
- True bottlenecks only appear under concurrent load (lock contention, connection limits, memory leaks)
|
|
323
|
+
- Capacity limits are unknown, making scaling decisions guesswork
|
|
324
|
+
- Performance regressions ship undetected when there is no baseline
|
|
325
|
+
- Third-party dependencies (payment processors, APIs) may rate-limit or fail under load
|
|
326
|
+
- Auto-scaling configurations are never validated -- minimum/maximum counts may be wrong
|
|
327
|
+
|
|
328
|
+
**The fix:**
|
|
329
|
+
- Establish a load testing practice with tools (k6, Locust, Gatling, Artillery)
|
|
330
|
+
- Test at 2x expected peak to find the breaking point, not just confirm the happy path
|
|
331
|
+
- Run soak tests (sustained load for hours) to detect memory leaks and connection exhaustion
|
|
332
|
+
- Include third-party dependencies in tests or mock them at realistic latencies
|
|
333
|
+
- Automate load tests in CI/CD to catch regressions per release
|
|
334
|
+
- Define performance budgets: maximum p99 latency, minimum throughput, maximum error rate
|
|
335
|
+
|
|
336
|
+
**Detection signals:**
|
|
337
|
+
- No load testing tools in project dependencies or CI/CD pipeline
|
|
338
|
+
- Production capacity limits are unknown -- "We will see"
|
|
339
|
+
- Performance metrics have no historical baseline
|
|
340
|
+
- First traffic spike causes unexpected failures
|
|
341
|
+
|
|
342
|
+
---
|
|
343
|
+
|
|
344
|
+
## 11. Hot Spots from Poor Sharding
|
|
345
|
+
|
|
346
|
+
**Anti-pattern:** Choosing a shard key that distributes data unevenly, causing one or a few shards to receive disproportionate read/write traffic while others sit idle.
|
|
347
|
+
|
|
348
|
+
**Why it happens:** Shard key selection requires understanding access patterns, data distribution, and growth projections. Teams often choose the most obvious key (user ID, tenant ID, timestamp) without analyzing the distribution. Some keys that appear uniform are actually highly skewed.
|
|
349
|
+
|
|
350
|
+
**Real-world incident:** A documented $2.4 million sharding project failed when, after implementation, one shard grew to 2,847,000 records while another had only 156,000. The root cause: enterprise customers had 10,000+ users while small customers had 1-5 users, and sharding by `customer_id` concentrated enterprise data on a few shards. In another case, an e-commerce platform sharded by product category, but the "electronics" category received 60% of all traffic, creating a persistent hot shard that required repeated hardware upgrades.
|
|
351
|
+
|
|
352
|
+
**Why it fails:**
|
|
353
|
+
- Hot shards become the bottleneck, capping system throughput at one shard's capacity
|
|
354
|
+
- Rebalancing data across shards is operationally expensive and risky
|
|
355
|
+
- Using timestamps as shard keys creates write-hot shards (all new data goes to one shard)
|
|
356
|
+
- Growth in one category or tenant can destabilize the entire cluster
|
|
357
|
+
- Monitoring may show "average" load as healthy while one shard is on fire
|
|
358
|
+
|
|
359
|
+
**The fix:**
|
|
360
|
+
- Analyze data distribution before choosing a shard key -- histogram the candidate key
|
|
361
|
+
- Use composite shard keys that combine a high-cardinality field with a distribution field
|
|
362
|
+
- Hash-based sharding (consistent hashing) provides uniform distribution at the cost of range query support
|
|
363
|
+
- Implement automatic rebalancing (as in MongoDB, CockroachDB, TiDB)
|
|
364
|
+
- Monitor per-shard metrics: CPU, IOPS, query latency, record count
|
|
365
|
+
- Consider virtual shards (more shards than nodes) to simplify rebalancing
|
|
366
|
+
|
|
367
|
+
**Detection signals:**
|
|
368
|
+
- One shard's CPU or IOPS is 5x+ higher than other shards
|
|
369
|
+
- Record counts vary by more than 3x across shards
|
|
370
|
+
- Shard key is a timestamp or low-cardinality field (status, country code)
|
|
371
|
+
- No per-shard monitoring dashboards exist
|
|
372
|
+
|
|
373
|
+
---
|
|
374
|
+
|
|
375
|
+
## 12. Cross-Shard Transactions
|
|
376
|
+
|
|
377
|
+
**Anti-pattern:** Designing sharded systems that require frequent transactions spanning multiple shards, introducing distributed coordination overhead (two-phase commit) that negates the throughput gains of sharding.
|
|
378
|
+
|
|
379
|
+
**Why it happens:** Applications are sharded for write throughput, but business logic still requires atomic operations across entities that live on different shards. "Transfer $100 from Account A (Shard 1) to Account B (Shard 2)" is a natural requirement that is extremely difficult to implement correctly in a sharded system.
|
|
380
|
+
|
|
381
|
+
**Real-world context:** Two-phase commit (2PC) has been the standard protocol for cross-shard consistency, used in systems from Oracle and PostgreSQL to Google Spanner and Apache Kafka. However, distributed systems expert Daniel Abadi argues: "I see very little benefit in system architects making continued use of 2PC in sharded systems moving forward." The protocol blocks when a participant fails, and if the coordinator fails permanently during the commit phase, some participants will never resolve their transactions, leaving data in an inconsistent state.
|
|
382
|
+
|
|
383
|
+
**Why it fails:**
|
|
384
|
+
- 2PC adds a coordination round-trip to every transaction, increasing latency by 2-10x
|
|
385
|
+
- Locks must be held across shards for the duration of the protocol, reducing throughput
|
|
386
|
+
- Coordinator failure during commit leaves data in an indeterminate state
|
|
387
|
+
- Deadlocks across shards are difficult to detect and resolve
|
|
388
|
+
- Throughput drops to the speed of the slowest shard
|
|
389
|
+
|
|
390
|
+
**The fix:**
|
|
391
|
+
- Design the data model so that related entities co-locate on the same shard
|
|
392
|
+
- Use the Saga pattern: break distributed transactions into compensatable local transactions
|
|
393
|
+
- Accept eventual consistency where business rules allow (most do)
|
|
394
|
+
- Use change data capture (CDC) and event sourcing for cross-shard data synchronization
|
|
395
|
+
- If strong consistency is required, use databases with native distributed transactions (CockroachDB, Spanner)
|
|
396
|
+
- Minimize cross-shard operations by denormalizing frequently-joined data onto the same shard
|
|
397
|
+
|
|
398
|
+
**Detection signals:**
|
|
399
|
+
- `BEGIN DISTRIBUTED TRANSACTION` or 2PC log entries in database logs
|
|
400
|
+
- Cross-shard query latency is 5x+ higher than single-shard latency
|
|
401
|
+
- Deadlock errors involving multiple shards
|
|
402
|
+
- Business logic requires joins across shard boundaries
|
|
403
|
+
|
|
404
|
+
---
|
|
405
|
+
|
|
406
|
+
## 13. Scaling by Adding Complexity
|
|
407
|
+
|
|
408
|
+
**Anti-pattern:** Responding to scaling challenges by introducing additional layers, services, caches, and technologies instead of first simplifying the existing architecture, removing unnecessary work, or optimizing hot paths.
|
|
409
|
+
|
|
410
|
+
**Why it happens:** Adding a cache in front of a slow query feels productive. Introducing a message queue between two services feels like proper engineering. Teams accumulate layers because each solves a proximate problem without addressing why the problem exists. Resume-driven development also plays a role -- engineers want to work with shiny distributed systems technologies.
|
|
411
|
+
|
|
412
|
+
**Real-world incident:** Pokemon Go's scaling crisis illustrates complexity backfiring. During their migration to Google Cloud load balancer (GCLB), the team added GCLB to scale the load balancing layer. But the additional capacity at the load balancing tier actually overwhelmed their backend stack -- the bottleneck was downstream, not at the load balancer. The migration prolonged the outage rather than fixing it. Adding capacity at the wrong layer amplified the failure.
|
|
413
|
+
|
|
414
|
+
**Why it fails:**
|
|
415
|
+
- Each new component adds latency, failure modes, and operational burden
|
|
416
|
+
- Caches create cache invalidation problems (one of the two hard things in computer science)
|
|
417
|
+
- More moving parts means more things that can fail simultaneously
|
|
418
|
+
- Debugging requires understanding interactions between N components instead of one
|
|
419
|
+
- Operational overhead grows: monitoring, alerting, upgrades, and on-call burden multiply
|
|
420
|
+
|
|
421
|
+
**The fix:**
|
|
422
|
+
- Before adding a component, ask: "Can we remove or simplify something instead?"
|
|
423
|
+
- Profile first: find the actual bottleneck before adding infrastructure
|
|
424
|
+
- Remove unnecessary middleware, ORM layers, and abstraction layers
|
|
425
|
+
- Optimize the hot path: 90% of load is often caused by 10% of code paths
|
|
426
|
+
- Evaluate whether the existing technology can be tuned before introducing a new one
|
|
427
|
+
- Apply the "boring technology" principle: use proven, well-understood tools
|
|
428
|
+
|
|
429
|
+
**Detection signals:**
|
|
430
|
+
- Architecture diagrams require a legend and multiple pages
|
|
431
|
+
- More infrastructure components than team members
|
|
432
|
+
- Incidents frequently involve interactions between components rather than individual failures
|
|
433
|
+
- Team cannot explain the full request path from client to database
|
|
434
|
+
|
|
435
|
+
---
|
|
436
|
+
|
|
437
|
+
## 14. Ignoring Cold Start Problems
|
|
438
|
+
|
|
439
|
+
**Anti-pattern:** Failing to account for initialization latency when new instances, containers, or serverless functions are launched, causing latency spikes and timeouts during scale-out events or after periods of low traffic.
|
|
440
|
+
|
|
441
|
+
**Why it happens:** Cold starts are invisible in steady-state monitoring. Functions and containers that are already warm respond in milliseconds. The problem only manifests during scaling events (new instances launching) or after idle periods (serverless environments recycling). Developers testing against warm environments never experience the issue.
|
|
442
|
+
|
|
443
|
+
**Real-world incident:** AWS Lambda cold starts typically add 100ms-2s to function execution time depending on runtime, dependencies, and code size. While cold starts affect less than 1% of requests in steady state, during traffic spikes every new concurrent invocation experiences a cold start. In event-driven architectures with functions calling other functions, the probability that at least one function in the chain is cold approaches 100%, causing cascading latency. One team reported that a chain of five Lambda functions experienced compounding cold starts that pushed end-to-end latency from 200ms (warm) to 8 seconds (all cold).
|
|
444
|
+
|
|
445
|
+
**Why it fails:**
|
|
446
|
+
- Health checks pass before the application is actually ready to serve traffic
|
|
447
|
+
- JVM-based services need time for JIT compilation -- first requests are 10-100x slower
|
|
448
|
+
- Dependency initialization (database connections, SDK clients, config loading) takes seconds
|
|
449
|
+
- Auto-scaling triggers bring up instances that immediately receive traffic before warming up
|
|
450
|
+
- Serverless environments recycle idle instances unpredictably
|
|
451
|
+
|
|
452
|
+
**The fix:**
|
|
453
|
+
- Implement readiness probes that verify the application can actually serve requests
|
|
454
|
+
- Use connection pre-warming: establish database and cache connections during startup
|
|
455
|
+
- For Lambda: use Provisioned Concurrency to keep functions initialized
|
|
456
|
+
- Pre-warm caches on startup by loading frequently-accessed data
|
|
457
|
+
- Use progressive traffic shifting: new instances receive traffic gradually, not all at once
|
|
458
|
+
- Minimize dependency count and use lazy initialization for non-critical paths
|
|
459
|
+
|
|
460
|
+
**Detection signals:**
|
|
461
|
+
- p99 latency is 10x+ higher than p50
|
|
462
|
+
- Latency spikes correlate with scaling events or deployment times
|
|
463
|
+
- First request after idle period is significantly slower
|
|
464
|
+
- Startup logs show multi-second initialization sequences
|
|
465
|
+
|
|
466
|
+
---
|
|
467
|
+
|
|
468
|
+
## 15. Not Planning for Graceful Degradation
|
|
469
|
+
|
|
470
|
+
**Anti-pattern:** Building systems that either work at full capacity or fail completely, with no intermediate modes that maintain core functionality when subsystems are impaired.
|
|
471
|
+
|
|
472
|
+
**Why it happens:** Systems are designed for the happy path. Failure handling is an afterthought. "If the recommendation service is down, what do we show?" is a question that never gets asked during design. Feature flags and degradation modes require upfront investment that feels wasteful when everything is working.
|
|
473
|
+
|
|
474
|
+
**Real-world incident:** Pokemon Go's launch is a canonical example. Instead of degrading gracefully -- for example, disabling social features, reducing map detail, or limiting new registrations -- the entire system collapsed under unexpected load. Users could not log in at all. In contrast, Fastly's CDN is designed so that if an origin server is unavailable, it serves stale cached content rather than error pages, maintaining user experience while giving incident responders time to diagnose and fix the root cause.
|
|
475
|
+
|
|
476
|
+
**Why it fails:**
|
|
477
|
+
- Total outage of a non-critical subsystem takes down the entire application
|
|
478
|
+
- Users get error pages instead of reduced-functionality experiences
|
|
479
|
+
- Incident responders have no levers to shed load or disable features
|
|
480
|
+
- Recovery is all-or-nothing, making partial restoration impossible
|
|
481
|
+
- No fallback behavior has been designed or tested
|
|
482
|
+
|
|
483
|
+
**The fix:**
|
|
484
|
+
- Identify critical vs non-critical features and design fallbacks for non-critical ones
|
|
485
|
+
- Implement circuit breakers (Hystrix, Resilience4j, Polly) for all downstream dependencies
|
|
486
|
+
- Use feature flags to disable resource-intensive features during overload
|
|
487
|
+
- Serve stale/cached data when fresh data is unavailable
|
|
488
|
+
- Design load-shedding endpoints: drop excess requests at the edge, not deep in the stack
|
|
489
|
+
- Implement priority queues: process high-value requests first during degradation
|
|
490
|
+
- Test degradation modes regularly -- chaos engineering validates that fallbacks work
|
|
491
|
+
|
|
492
|
+
**Detection signals:**
|
|
493
|
+
- No circuit breakers in the codebase
|
|
494
|
+
- No feature flag system deployed
|
|
495
|
+
- Error pages are the only failure response -- no partial functionality
|
|
496
|
+
- Runbooks say "wait for recovery" instead of listing degradation steps
|
|
497
|
+
|
|
498
|
+
---
|
|
499
|
+
|
|
500
|
+
## 16. Monolithic Database Migrations at Scale
|
|
501
|
+
|
|
502
|
+
**Anti-pattern:** Running schema-altering DDL operations (adding columns, creating indexes, altering types) on large production tables using blocking operations that lock the table and halt all reads or writes for the duration of the migration.
|
|
503
|
+
|
|
504
|
+
**Why it happens:** ORMs generate migration files that use standard `ALTER TABLE` statements. These work fine on small tables. Developers test migrations on development databases with 1,000 rows, not production databases with 50 million rows. The migration that takes 200ms in development takes 45 minutes in production -- and locks the table the entire time.
|
|
505
|
+
|
|
506
|
+
**Real-world incident:** GitHub's June 2025 outage was triggered by a planned database migration that cascaded into a multi-hour incident disrupting repositories, pull requests, GitHub Actions, and dependent services. The migration triggered unanticipated load patterns on primary database clusters, causing cascading failures. In another documented case, adding an index to a 50-million-row table locked the entire table, blocking all reads and writes for 45 minutes, with downstream cost estimated at $5,600 per minute of downtime.
|
|
507
|
+
|
|
508
|
+
**Why it fails:**
|
|
509
|
+
- `ALTER TABLE` acquires exclusive locks on large tables, blocking all queries
|
|
510
|
+
- Index creation on large tables can take minutes to hours
|
|
511
|
+
- Failed migrations may leave the schema in an inconsistent state
|
|
512
|
+
- Rollback of a partially-applied migration can be more dangerous than the migration itself
|
|
513
|
+
- Multiple services depending on the same table are all affected simultaneously
|
|
514
|
+
|
|
515
|
+
**The fix:**
|
|
516
|
+
- Use online schema change tools: `pt-online-schema-change` (MySQL), `pg_repack` (PostgreSQL), `gh-ost` (GitHub's own tool)
|
|
517
|
+
- Add columns as nullable first, backfill data, then add constraints
|
|
518
|
+
- Create indexes with `CONCURRENTLY` (PostgreSQL) or equivalent non-blocking syntax
|
|
519
|
+
- Implement expand-contract migrations: add the new schema alongside the old, migrate data, then remove the old
|
|
520
|
+
- Test migrations against production-sized datasets before deploying
|
|
521
|
+
- Use feature flags to gradually shift traffic to new schema paths
|
|
522
|
+
- Schedule migrations during low-traffic windows with rollback plans
|
|
523
|
+
|
|
524
|
+
**Detection signals:**
|
|
525
|
+
- Migration files contain raw `ALTER TABLE ... ADD COLUMN ... NOT NULL`
|
|
526
|
+
- No online schema change tooling in the deployment pipeline
|
|
527
|
+
- Migration testing only runs against seeded development databases
|
|
528
|
+
- Lock wait timeout errors during deployments
|
|
529
|
+
|
|
530
|
+
---
|
|
531
|
+
|
|
532
|
+
## 17. Network Calls in Loops
|
|
533
|
+
|
|
534
|
+
**Anti-pattern:** Making individual network requests (database queries, API calls, cache lookups) inside a loop, turning what should be a single batch operation into N sequential round-trips, each paying full network latency.
|
|
535
|
+
|
|
536
|
+
**Why it happens:** ORMs make it easy: `for order in orders: order.customer.name` triggers a query per iteration (the N+1 problem). REST APIs expose individual resources, so fetching related data requires one call per item. The code reads naturally and works correctly -- it is just catastrophically slow at scale.
|
|
537
|
+
|
|
538
|
+
**Real-world incident:** A developer documented reducing API response time from 30 seconds to under 1 second by eliminating N+1 queries. The endpoint listed 500 orders, and for each order, made a separate database query to fetch the customer -- 501 total queries. Each query took 2ms on the network, but 501 x 2ms = 1 second of pure network latency, plus database processing time. Replacing this with a single `WHERE customer_id IN (...)` query reduced the total to 2 queries and sub-100ms response time. At high traffic, the N+1 version generated 50,000+ queries per second from a single endpoint.
|
|
539
|
+
|
|
540
|
+
**Why it fails:**
|
|
541
|
+
- Each network call adds 1-5ms of latency (TCP round-trip), which multiplies by N
|
|
542
|
+
- Database connection pool is consumed by N concurrent connections for one user request
|
|
543
|
+
- Serialization/deserialization overhead multiplies by N
|
|
544
|
+
- Total latency grows linearly with data size, making the endpoint unusable as data grows
|
|
545
|
+
- Database query logs show thousands of nearly-identical queries
|
|
546
|
+
|
|
547
|
+
**The fix:**
|
|
548
|
+
- Use batch APIs: `GET /users?ids=1,2,3` instead of N individual calls
|
|
549
|
+
- Use eager loading in ORMs: `includes(:customer)` (Rails), `joinedload()` (SQLAlchemy), `.Include()` (EF Core)
|
|
550
|
+
- Implement DataLoader pattern for GraphQL (batches and deduplicates within a request)
|
|
551
|
+
- Replace loops with `WHERE ... IN (...)` queries
|
|
552
|
+
- Use database views or materialized views to pre-join data
|
|
553
|
+
- Add N+1 detection tools: Bullet (Rails), nplusone (Django), SQLAlchemy warnings
|
|
554
|
+
|
|
555
|
+
**Detection signals:**
|
|
556
|
+
- Database query logs show repeated queries with only the parameter changing
|
|
557
|
+
- Endpoint latency scales linearly with result set size
|
|
558
|
+
- ORM lazy-loading enabled with no eager-loading configuration
|
|
559
|
+
- API calls inside `for`, `forEach`, `map`, or `while` loops
|
|
560
|
+
|
|
561
|
+
---
|
|
562
|
+
|
|
563
|
+
## 18. Fan-Out Without Fan-In Limits
|
|
564
|
+
|
|
565
|
+
**Anti-pattern:** Designing systems where a single request triggers many parallel downstream requests (fan-out) without limiting how many can be in flight simultaneously, risking overwhelming downstream services and creating cascading failures.
|
|
566
|
+
|
|
567
|
+
**Why it happens:** Fan-out is a natural pattern: a social media timeline request fans out to N friend feeds, a search request fans out to N index shards, an API gateway fans out to N microservices. The pattern works at small N but becomes dangerous as N grows. Developers set fan-out based on the data model without considering the downstream capacity.
|
|
568
|
+
|
|
569
|
+
**Real-world context:** LinkedIn published research (Moolle, ICDE 2016) on fan-out control for scalable distributed data stores, documenting how unlimited fan-out in their social graph queries could overwhelm backend storage nodes. Their system found that keeping dependency chains shallow (1-2 levels) and limiting parallel requests per tier was essential for stability. Without fan-in limits, a single user request could generate thousands of backend queries, and a modest traffic spike would amplify into a backend-crushing storm.
|
|
570
|
+
|
|
571
|
+
**Why it fails:**
|
|
572
|
+
- N downstream calls means N chances for failure -- probability of at least one failure approaches 1
|
|
573
|
+
- Total latency is bounded by the slowest of N calls (tail latency amplification)
|
|
574
|
+
- Downstream services experience N x traffic amplification from fan-out
|
|
575
|
+
- Retry logic on fan-out calls multiplies the amplification effect
|
|
576
|
+
- One slow downstream service blocks the entire fan-in, wasting the fast responses
|
|
577
|
+
|
|
578
|
+
**The fix:**
|
|
579
|
+
- Set explicit concurrency limits on fan-out calls (semaphores, worker pools)
|
|
580
|
+
- Implement timeouts per fan-out call -- do not wait for stragglers
|
|
581
|
+
- Use hedged requests: send a second request after a timeout, take whichever finishes first
|
|
582
|
+
- Apply circuit breakers per downstream service
|
|
583
|
+
- Return partial results when some fan-out calls fail (graceful degradation)
|
|
584
|
+
- Isolate resource pools for high-fan-out calls so they cannot starve other workloads
|
|
585
|
+
- Monitor fan-out factor per endpoint and alert when it exceeds expected bounds
|
|
586
|
+
|
|
587
|
+
**Detection signals:**
|
|
588
|
+
- `Promise.all()` or `asyncio.gather()` with unbounded arrays of calls
|
|
589
|
+
- No concurrency limit on parallel HTTP client calls
|
|
590
|
+
- Tail latency (p99) is much higher than median (p50) due to straggler effect
|
|
591
|
+
- Downstream services report traffic spikes correlated with upstream deployments
|
|
592
|
+
|
|
593
|
+
---
|
|
594
|
+
|
|
595
|
+
## 19. Not Using Read Replicas When Read-Heavy
|
|
596
|
+
|
|
597
|
+
**Anti-pattern:** Sending all database queries -- reads and writes -- to a single primary instance, even when the workload is 90%+ reads, leaving the primary overloaded with read traffic that could be served by replicas.
|
|
598
|
+
|
|
599
|
+
**Why it happens:** Using a single database endpoint is simpler. Application code does not need to distinguish between read and write connections. ORMs default to a single connection. Teams do not realize their workload is read-heavy because they have never measured the read/write ratio. Adding read replicas requires code changes to route queries.
|
|
600
|
+
|
|
601
|
+
**Real-world context:** AWS RDS documentation emphasizes that read replicas provide horizontal scaling by offloading read-intensive workloads from the primary instance. Most web applications are 80-95% reads. A primary database handling 10,000 queries/second where 9,500 are reads could offload those to 2-3 replicas, reducing primary load by 95% and freeing it for writes. However, RDS does not automatically route reads to replicas -- the application must explicitly direct read traffic to replica endpoints.
|
|
602
|
+
|
|
603
|
+
**Why it fails:**
|
|
604
|
+
- Primary database CPU is saturated by read queries, slowing write transactions
|
|
605
|
+
- Vertical scaling the primary is expensive and has limits
|
|
606
|
+
- Read-heavy endpoints (dashboards, feeds, search results) dominate the query mix
|
|
607
|
+
- The primary cannot be scaled horizontally for writes, making read offloading essential
|
|
608
|
+
- Failover promotes a replica to primary, but if no replicas exist, failover means downtime
|
|
609
|
+
|
|
610
|
+
**The fix:**
|
|
611
|
+
- Measure the read/write ratio -- if reads exceed 70%, add replicas
|
|
612
|
+
- Configure the ORM for read/write splitting: write to primary, read from replica
|
|
613
|
+
- Use a database proxy (ProxySQL, Amazon RDS Proxy, PgPool) for automatic routing
|
|
614
|
+
- Accept eventual consistency for read-replica queries (typical lag is under 1 second)
|
|
615
|
+
- Monitor replica lag (`ReplicaLag` metric) and fail over to primary if lag exceeds acceptable thresholds
|
|
616
|
+
- Size replica count based on: `num_replicas = ceil(read_qps / single_instance_read_capacity)`
|
|
617
|
+
|
|
618
|
+
**Detection signals:**
|
|
619
|
+
- Single database endpoint in application configuration
|
|
620
|
+
- Primary database CPU above 70% while query mix is majority SELECT
|
|
621
|
+
- No replica instances provisioned
|
|
622
|
+
- Read-heavy endpoints have higher latency than write endpoints
|
|
623
|
+
|
|
624
|
+
---
|
|
625
|
+
|
|
626
|
+
## 20. Over-Provisioning vs Under-Provisioning
|
|
627
|
+
|
|
628
|
+
**Anti-pattern:** Either allocating far more resources than needed (wasting money) or allocating too few (causing outages under load), rather than right-sizing based on data and implementing auto-scaling.
|
|
629
|
+
|
|
630
|
+
**Why it happens:** Over-provisioning is driven by fear -- "what if we get a traffic spike?" Under-provisioning is driven by cost pressure -- "can we run this cheaper?" Both are guesses. Without load testing data and auto-scaling, teams pick a static size and hope. Premature scaling was identified as a factor in 70% of tech startup failures (Startup Genome report).
|
|
631
|
+
|
|
632
|
+
**Real-world incident:** Groupon prioritized rapid customer acquisition and scaled infrastructure aggressively, spending heavily on capacity they did not need while their underlying business model was unsustainable. In contrast, Amazon grew methodically, focusing on dominating one market at a time and staying lean. On the under-provisioning side, Kubernetes environments frequently suffer from overprovisioning after an under-provisioning incident -- teams panic-scale to 3x capacity after an outage, then never right-size back down. Studies show organizations waste 30-35% of cloud spend on over-provisioned resources.
|
|
633
|
+
|
|
634
|
+
**Why it fails:**
|
|
635
|
+
- Over-provisioning wastes 30-35% of cloud spend on idle resources
|
|
636
|
+
- Under-provisioning causes outages during traffic spikes and degrades user experience
|
|
637
|
+
- Static provisioning cannot adapt to variable traffic patterns (day/night, weekday/weekend)
|
|
638
|
+
- Over-provisioned resources mask inefficient code -- there is no pressure to optimize
|
|
639
|
+
- Under-provisioned databases hit connection limits and IOPS caps under load
|
|
640
|
+
|
|
641
|
+
**The fix:**
|
|
642
|
+
- Implement auto-scaling based on actual metrics (CPU, memory, request queue depth)
|
|
643
|
+
- Right-size instances using utilization data: target 60-70% average CPU utilization
|
|
644
|
+
- Use spot/preemptible instances for fault-tolerant workloads (60-90% cost savings)
|
|
645
|
+
- Implement cost monitoring with alerts for spend anomalies
|
|
646
|
+
- Run regular right-sizing reviews using tools (AWS Compute Optimizer, GCP Recommender)
|
|
647
|
+
- Load test to determine actual capacity needs rather than guessing
|
|
648
|
+
- Use reserved instances or savings plans for predictable baseline load, spot for burst
|
|
649
|
+
|
|
650
|
+
**Detection signals:**
|
|
651
|
+
- Average CPU utilization below 20% (over-provisioned) or consistently above 85% (under-provisioned)
|
|
652
|
+
- No auto-scaling policies configured
|
|
653
|
+
- Instance sizes have not been reviewed in 6+ months
|
|
654
|
+
- Cloud bill growing faster than traffic or revenue
|
|
655
|
+
|
|
656
|
+
---
|
|
657
|
+
|
|
658
|
+
## Root Cause Analysis
|
|
659
|
+
|
|
660
|
+
Scaling anti-patterns cluster around five root causes:
|
|
661
|
+
|
|
662
|
+
### 1. Single-Machine Mindset
|
|
663
|
+
**Anti-patterns:** #1, #2, #3, #4, #5, #19
|
|
664
|
+
**Root cause:** Designing for a single server because that is the development environment. State is stored locally, connections are opened per-request, and all traffic hits one database. The architecture works for one server and breaks for two.
|
|
665
|
+
**Systemic fix:** Adopt the Twelve-Factor App methodology. Treat servers as disposable. Externalize all state. Test with multiple instances from the beginning.
|
|
666
|
+
|
|
667
|
+
### 2. No Capacity Planning
|
|
668
|
+
**Anti-patterns:** #10, #14, #20
|
|
669
|
+
**Root cause:** Capacity is unknown because it was never measured. Load testing is skipped, cold starts are untested, and provisioning is guesswork. The system's limits are discovered through production incidents.
|
|
670
|
+
**Systemic fix:** Make load testing a release gate. Measure capacity before every launch. Implement auto-scaling with validated thresholds.
|
|
671
|
+
|
|
672
|
+
### 3. Unbounded Resource Consumption
|
|
673
|
+
**Anti-patterns:** #7, #8, #9, #18
|
|
674
|
+
**Root cause:** No limits on resource consumption -- queues, connections, fan-out, retries -- because setting limits feels like artificial constraints. Resources are consumed faster than they are released, and the system runs out.
|
|
675
|
+
**Systemic fix:** Every resource must have an explicit bound: queue depth, connection count, fan-out factor, retry budget. Design for rejection and backpressure from day one.
|
|
676
|
+
|
|
677
|
+
### 4. Data Architecture Debt
|
|
678
|
+
**Anti-patterns:** #6, #11, #12, #16, #17, #19
|
|
679
|
+
**Root cause:** A single database handles all workloads. Schema changes are blocking. Shard keys are chosen without analysis. Queries are generated by ORM defaults. The data layer becomes the bottleneck that cannot be easily changed.
|
|
680
|
+
**Systemic fix:** Separate read and write paths. Choose shard keys based on access pattern analysis. Use online schema change tools. Audit ORM-generated queries.
|
|
681
|
+
|
|
682
|
+
### 5. Complexity Over Simplification
|
|
683
|
+
**Anti-patterns:** #13, #15
|
|
684
|
+
**Root cause:** Adding components to solve scaling problems without first understanding the bottleneck. More layers means more failure modes, more latency, and more operational burden.
|
|
685
|
+
**Systemic fix:** Profile before scaling. Remove unnecessary layers. Design degradation modes that reduce functionality rather than adding infrastructure.
|
|
686
|
+
|
|
687
|
+
---
|
|
688
|
+
|
|
689
|
+
## Self-Check Questions
|
|
690
|
+
|
|
691
|
+
Use these questions during design reviews and architecture assessments:
|
|
692
|
+
|
|
693
|
+
### Statelessness and Horizontal Scaling
|
|
694
|
+
- [ ] Can we add a second instance behind a load balancer with zero code changes?
|
|
695
|
+
- [ ] Is all user-facing state stored in an external service (database, Redis, S3)?
|
|
696
|
+
- [ ] Can any instance handle any request, or are requests tied to specific instances?
|
|
697
|
+
- [ ] Are we using sticky sessions? If so, do we have a plan to remove them?
|
|
698
|
+
|
|
699
|
+
### Database and Data Layer
|
|
700
|
+
- [ ] Is the database handling mixed workloads (OLTP + analytics + search)?
|
|
701
|
+
- [ ] Do we have read replicas configured for read-heavy endpoints?
|
|
702
|
+
- [ ] What is our read/write ratio? Have we measured it?
|
|
703
|
+
- [ ] Are database migrations tested against production-sized datasets?
|
|
704
|
+
- [ ] Do we use online schema change tools for migrations?
|
|
705
|
+
- [ ] Is connection pooling configured with explicit pool sizes?
|
|
706
|
+
|
|
707
|
+
### Queues and Backpressure
|
|
708
|
+
- [ ] Do all queues have explicit maximum sizes?
|
|
709
|
+
- [ ] What happens when a queue is full? (Drop? Reject? Dead-letter?)
|
|
710
|
+
- [ ] Do we have rate limiting on public-facing endpoints?
|
|
711
|
+
- [ ] What is the maximum queue depth we can tolerate before latency is unacceptable?
|
|
712
|
+
|
|
713
|
+
### Failure and Degradation
|
|
714
|
+
- [ ] What happens when a downstream service is unavailable?
|
|
715
|
+
- [ ] Do we have circuit breakers for all external dependencies?
|
|
716
|
+
- [ ] Can we disable non-critical features during overload?
|
|
717
|
+
- [ ] Do retries include jitter and exponential backoff?
|
|
718
|
+
- [ ] Is there a retry budget to prevent thundering herd?
|
|
719
|
+
|
|
720
|
+
### Capacity and Load
|
|
721
|
+
- [ ] Have we load tested at 2x expected peak?
|
|
722
|
+
- [ ] Do we know where the system breaks under load?
|
|
723
|
+
- [ ] Are auto-scaling policies configured and validated?
|
|
724
|
+
- [ ] Is our provisioning based on data or guesswork?
|
|
725
|
+
- [ ] Do we know the cold start time for new instances?
|
|
726
|
+
|
|
727
|
+
### Complexity
|
|
728
|
+
- [ ] Can every team member explain the full request path?
|
|
729
|
+
- [ ] Are we adding a component to solve a problem, or to avoid understanding one?
|
|
730
|
+
- [ ] Could we remove a layer instead of adding one?
|
|
731
|
+
|
|
732
|
+
---
|
|
733
|
+
|
|
734
|
+
## Code Smell Quick Reference
|
|
735
|
+
|
|
736
|
+
| Smell | Anti-Pattern | Severity |
|
|
737
|
+
|---|---|---|
|
|
738
|
+
| Single database connection string across all services | #6 Single DB | High |
|
|
739
|
+
| `session.sticky = true` in load balancer config | #3 Sticky Sessions | High |
|
|
740
|
+
| `fs.writeFile` / `File.write` for user uploads | #4 Local Filesystem | High |
|
|
741
|
+
| No `maxPoolSize` or pool config in DB connection | #5 No Connection Pool | Critical |
|
|
742
|
+
| `for item in items: db.query(item.id)` | #17 N+1 / Network Loops | High |
|
|
743
|
+
| `Promise.all(unboundedArray.map(fetch))` | #18 Unbounded Fan-Out | High |
|
|
744
|
+
| Queue instantiated with no `maxlen` / `capacity` | #9 Unbounded Queue | High |
|
|
745
|
+
| Retry with fixed delay: `sleep(5); retry()` | #7 Thundering Herd | Medium |
|
|
746
|
+
| No `429` or `503` responses in access logs | #8 No Backpressure | High |
|
|
747
|
+
| `ALTER TABLE` without `CONCURRENTLY` in migration | #16 Blocking Migration | Critical |
|
|
748
|
+
| Single DB endpoint; no replica endpoint in config | #19 No Read Replicas | Medium |
|
|
749
|
+
| No load test scripts or tools in repository | #10 No Load Testing | High |
|
|
750
|
+
| Auto-scaling min = max (fixed instance count) | #20 Static Provisioning | Medium |
|
|
751
|
+
| No circuit breaker library in dependencies | #15 No Degradation | High |
|
|
752
|
+
| Instance type is `*.4xlarge` or higher | #1 Vertical Only | Medium |
|
|
753
|
+
| Shard key is `created_at` or `timestamp` | #11 Hot Shards | High |
|
|
754
|
+
| `BEGIN DISTRIBUTED TRANSACTION` in query logs | #12 Cross-Shard Txn | Medium |
|
|
755
|
+
|
|
756
|
+
---
|
|
757
|
+
|
|
758
|
+
## Sources
|
|
759
|
+
|
|
760
|
+
- [GitHub Database Infrastructure Outages (Feb 2020)](https://devclass.com/2020/03/27/github-reveals-database-infrastructure-was-the-villain-behind-february-spate-of-outages-again/)
|
|
761
|
+
- [GitHub's June 2025 Outage: Database Migration Cascade](https://www.webpronews.com/githubs-june-2025-outage-how-a-routine-database-migration-cascaded-into-a-platform-wide-crisis/)
|
|
762
|
+
- [PostgreSQL Connection Pool Exhaustion -- Lessons from a Production Outage](https://www.c-sharpcorner.com/article/postgresql-connection-pool-exhaustion-lessons-from-a-production-outage/)
|
|
763
|
+
- [Distributed Systems Horror Stories: The Thundering Herd Problem (Encore)](https://encore.dev/blog/thundering-herd-problem)
|
|
764
|
+
- [Scaling Depot: Solving a Thundering Herd Problem](https://depot.dev/blog/planetscale-to-reduce-the-thundering-herd)
|
|
765
|
+
- [Thundering Herd Problem Explained (Medium)](https://medium.com/@work.dhairya.singla/the-thundering-herd-problem-explained-causes-examples-and-solutions-7166b7e26c0c)
|
|
766
|
+
- [Pod Auto Scaling and the Curse of Sticky Sessions (Medium)](https://medium.com/nerd-for-tech/how-session-stickiness-disrupts-pod-auto-scaling-in-kubernetes-17ece8e2ea4f)
|
|
767
|
+
- [Scaling Horizontally: Kubernetes, Sticky Sessions, and Redis](https://dev.to/deepak_mishra_35863517037/scaling-horizontally-kubernetes-sticky-sessions-and-redis-578o)
|
|
768
|
+
- [How CodinGame Survived a Reddit Hug of Death](https://www.codingame.com/blog/how-did-codingame-survive-reddit-hug-of/)
|
|
769
|
+
- [Everyone's Doing Database Sharding Wrong ($2M Failure)](https://medium.com/@jholt1055/everyones-doing-database-sharding-wrong-here-s-why-your-2m-sharding-project-will-fail-de7f52d944a4)
|
|
770
|
+
- [Challenges of Sharding: Data Hotspots and Imbalanced Shards](https://dohost.us/index.php/2025/10/03/challenges-of-sharding-data-hotspots-and-imbalanced-shards/)
|
|
771
|
+
- [Wazuh Memory Exhaustion in Unbounded Queue (GitHub Issue)](https://github.com/wazuh/wazuh/issues/31240)
|
|
772
|
+
- [Authentik Worker OOM from Unbounded Queue Growth (GitHub Issue)](https://github.com/goauthentik/authentik/issues/18915)
|
|
773
|
+
- [Understanding and Remediating Cold Starts: AWS Lambda (AWS Blog)](https://aws.amazon.com/blogs/compute/understanding-and-remediating-cold-starts-an-aws-lambda-perspective/)
|
|
774
|
+
- [Zero Downtime Migrations at Petabyte Scale (PlanetScale)](https://planetscale.com/blog/zero-downtime-migrations-at-petabyte-scale)
|
|
775
|
+
- [Solving the N+1 Query Problem: 30s to Under 1s (Medium)](https://medium.com/@nkangprecious26/solving-the-n-1-query-problem-how-i-reduced-api-response-time-from-30s-to-1s-1fcd819c34e6)
|
|
776
|
+
- [Moolle: Fan-out Control for Scalable Distributed Data Stores (LinkedIn, ICDE 2016)](https://content.linkedin.com/content/dam/engineering/site-assets/pdfs/ICDE16_industry_571.pdf)
|
|
777
|
+
- [It's Time to Move on from Two Phase Commit (Daniel Abadi)](http://dbmsmusings.blogspot.com/2019/01/its-time-to-move-on-from-two-phase.html)
|
|
778
|
+
- [Vertical vs Horizontal Scaling (CockroachDB)](https://www.cockroachlabs.com/blog/vertical-scaling-vs-horizontal-scaling/)
|
|
779
|
+
- [The 7 Deadly Sins of Startups: Premature Scaling (Medium)](https://medium.com/superteam/danger-the-7-deadly-sins-of-startups-premature-scaling-1d2a976e2540)
|
|
780
|
+
- [Kubernetes Overprovisioning: The Hidden Cost (DEV Community)](https://dev.to/naveens16/kubernetes-overprovisioning-the-hidden-cost-of-chasing-performance-and-how-to-escape-114k)
|
|
781
|
+
- [Design for Chaos: Fastly's Principles of Fault Isolation](https://www.fastly.com/blog/design-for-chaos-fastlys-principles-of-fault-isolation-and-graceful)
|
|
782
|
+
- [AWS Well-Architected: Implement Graceful Degradation](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_mitigate_interaction_failure_graceful_degradation.html)
|
|
783
|
+
- [N+1 API Calls Detection (Sentry)](https://docs.sentry.io/product/issues/issue-details/performance-issues/n-one-api-calls/)
|
|
784
|
+
- [AWS RDS Read Replicas Documentation](https://aws.amazon.com/rds/features/read-replicas/)
|
|
785
|
+
- [Dan Luu's Post-Mortems Collection (GitHub)](https://github.com/danluu/post-mortems)
|