@zigrivers/scaffold 3.16.0 → 3.17.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +28 -0
- package/content/knowledge/backend/backend-fintech-broker-integration.md +244 -0
- package/content/knowledge/backend/backend-fintech-compliance.md +181 -0
- package/content/knowledge/backend/backend-fintech-data-modeling.md +210 -0
- package/content/knowledge/backend/backend-fintech-ledger.md +226 -0
- package/content/knowledge/backend/backend-fintech-observability.md +151 -0
- package/content/knowledge/backend/backend-fintech-order-lifecycle.md +213 -0
- package/content/knowledge/backend/backend-fintech-risk-management.md +150 -0
- package/content/knowledge/backend/backend-fintech-testing.md +197 -0
- package/content/knowledge/core/automated-review-tooling.md +10 -0
- package/content/knowledge/core/multi-service-api-contracts.md +634 -0
- package/content/knowledge/core/multi-service-architecture.md +492 -0
- package/content/knowledge/core/multi-service-auth.md +706 -0
- package/content/knowledge/core/multi-service-data-ownership.md +539 -0
- package/content/knowledge/core/multi-service-observability.md +545 -0
- package/content/knowledge/core/multi-service-resilience.md +710 -0
- package/content/knowledge/core/multi-service-task-decomposition.md +615 -0
- package/content/knowledge/core/multi-service-testing.md +728 -0
- package/content/methodology/backend-fintech.yml +46 -0
- package/content/methodology/custom-defaults.yml +6 -0
- package/content/methodology/deep.yml +6 -0
- package/content/methodology/multi-service-overlay.yml +103 -0
- package/content/methodology/mvp.yml +6 -0
- package/content/pipeline/architecture/service-ownership-map.md +83 -0
- package/content/pipeline/quality/cross-service-auth.md +96 -0
- package/content/pipeline/quality/cross-service-observability.md +104 -0
- package/content/pipeline/quality/integration-test-plan.md +106 -0
- package/content/pipeline/specification/inter-service-contracts.md +95 -0
- package/dist/cli/commands/adopt.cli-flags.test.js +20 -0
- package/dist/cli/commands/adopt.cli-flags.test.js.map +1 -1
- package/dist/cli/commands/adopt.d.ts.map +1 -1
- package/dist/cli/commands/adopt.js +11 -3
- package/dist/cli/commands/adopt.js.map +1 -1
- package/dist/cli/commands/complete.d.ts +1 -0
- package/dist/cli/commands/complete.d.ts.map +1 -1
- package/dist/cli/commands/complete.js +26 -8
- package/dist/cli/commands/complete.js.map +1 -1
- package/dist/cli/commands/dashboard.d.ts +1 -0
- package/dist/cli/commands/dashboard.d.ts.map +1 -1
- package/dist/cli/commands/dashboard.js +19 -6
- package/dist/cli/commands/dashboard.js.map +1 -1
- package/dist/cli/commands/decisions.d.ts +1 -0
- package/dist/cli/commands/decisions.d.ts.map +1 -1
- package/dist/cli/commands/decisions.js +18 -4
- package/dist/cli/commands/decisions.js.map +1 -1
- package/dist/cli/commands/info.d.ts +1 -0
- package/dist/cli/commands/info.d.ts.map +1 -1
- package/dist/cli/commands/info.js +25 -3
- package/dist/cli/commands/info.js.map +1 -1
- package/dist/cli/commands/init-from.test.d.ts +2 -0
- package/dist/cli/commands/init-from.test.d.ts.map +1 -0
- package/dist/cli/commands/init-from.test.js +315 -0
- package/dist/cli/commands/init-from.test.js.map +1 -0
- package/dist/cli/commands/init.d.ts +3 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +239 -129
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/init.test.js +20 -0
- package/dist/cli/commands/init.test.js.map +1 -1
- package/dist/cli/commands/next.d.ts +1 -0
- package/dist/cli/commands/next.d.ts.map +1 -1
- package/dist/cli/commands/next.js +40 -4
- package/dist/cli/commands/next.js.map +1 -1
- package/dist/cli/commands/next.test.js +151 -0
- package/dist/cli/commands/next.test.js.map +1 -1
- package/dist/cli/commands/reset.d.ts +1 -0
- package/dist/cli/commands/reset.d.ts.map +1 -1
- package/dist/cli/commands/reset.js +77 -29
- package/dist/cli/commands/reset.js.map +1 -1
- package/dist/cli/commands/rework.d.ts +1 -0
- package/dist/cli/commands/rework.d.ts.map +1 -1
- package/dist/cli/commands/rework.js +16 -2
- package/dist/cli/commands/rework.js.map +1 -1
- package/dist/cli/commands/run.d.ts +1 -0
- package/dist/cli/commands/run.d.ts.map +1 -1
- package/dist/cli/commands/run.js +65 -13
- package/dist/cli/commands/run.js.map +1 -1
- package/dist/cli/commands/run.test.js +192 -3
- package/dist/cli/commands/run.test.js.map +1 -1
- package/dist/cli/commands/skip.d.ts +1 -0
- package/dist/cli/commands/skip.d.ts.map +1 -1
- package/dist/cli/commands/skip.js +24 -7
- package/dist/cli/commands/skip.js.map +1 -1
- package/dist/cli/commands/status.d.ts +1 -0
- package/dist/cli/commands/status.d.ts.map +1 -1
- package/dist/cli/commands/status.js +51 -4
- package/dist/cli/commands/status.js.map +1 -1
- package/dist/cli/commands/status.test.js +128 -0
- package/dist/cli/commands/status.test.js.map +1 -1
- package/dist/cli/guards-coverage.test.d.ts +2 -0
- package/dist/cli/guards-coverage.test.d.ts.map +1 -0
- package/dist/cli/guards-coverage.test.js +26 -0
- package/dist/cli/guards-coverage.test.js.map +1 -0
- package/dist/cli/guards-integration.test.d.ts +2 -0
- package/dist/cli/guards-integration.test.d.ts.map +1 -0
- package/dist/cli/guards-integration.test.js +178 -0
- package/dist/cli/guards-integration.test.js.map +1 -0
- package/dist/cli/guards.d.ts +13 -0
- package/dist/cli/guards.d.ts.map +1 -0
- package/dist/cli/guards.js +70 -0
- package/dist/cli/guards.js.map +1 -0
- package/dist/cli/guards.test.d.ts +2 -0
- package/dist/cli/guards.test.d.ts.map +1 -0
- package/dist/cli/guards.test.js +136 -0
- package/dist/cli/guards.test.js.map +1 -0
- package/dist/cli/init-flag-families.d.ts +1 -1
- package/dist/cli/init-flag-families.d.ts.map +1 -1
- package/dist/cli/init-flag-families.js +4 -1
- package/dist/cli/init-flag-families.js.map +1 -1
- package/dist/cli/init-flag-families.test.js +10 -0
- package/dist/cli/init-flag-families.test.js.map +1 -1
- package/dist/cli/shutdown.d.ts +2 -3
- package/dist/cli/shutdown.d.ts.map +1 -1
- package/dist/cli/shutdown.js +14 -11
- package/dist/cli/shutdown.js.map +1 -1
- package/dist/cli/shutdown.test.js +2 -4
- package/dist/cli/shutdown.test.js.map +1 -1
- package/dist/config/schema.d.ts +12122 -288
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +74 -79
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +230 -1
- package/dist/config/schema.test.js.map +1 -1
- package/dist/config/validators/backend.d.ts +4 -0
- package/dist/config/validators/backend.d.ts.map +1 -0
- package/dist/config/validators/backend.js +14 -0
- package/dist/config/validators/backend.js.map +1 -0
- package/dist/config/validators/browser-extension.d.ts +4 -0
- package/dist/config/validators/browser-extension.d.ts.map +1 -0
- package/dist/config/validators/browser-extension.js +24 -0
- package/dist/config/validators/browser-extension.js.map +1 -0
- package/dist/config/validators/cli.d.ts +4 -0
- package/dist/config/validators/cli.d.ts.map +1 -0
- package/dist/config/validators/cli.js +14 -0
- package/dist/config/validators/cli.js.map +1 -0
- package/dist/config/validators/data-pipeline.d.ts +4 -0
- package/dist/config/validators/data-pipeline.d.ts.map +1 -0
- package/dist/config/validators/data-pipeline.js +14 -0
- package/dist/config/validators/data-pipeline.js.map +1 -0
- package/dist/config/validators/game.d.ts +4 -0
- package/dist/config/validators/game.d.ts.map +1 -0
- package/dist/config/validators/game.js +14 -0
- package/dist/config/validators/game.js.map +1 -0
- package/dist/config/validators/index.d.ts +7 -0
- package/dist/config/validators/index.d.ts.map +1 -0
- package/dist/config/validators/index.js +27 -0
- package/dist/config/validators/index.js.map +1 -0
- package/dist/config/validators/library.d.ts +4 -0
- package/dist/config/validators/library.d.ts.map +1 -0
- package/dist/config/validators/library.js +25 -0
- package/dist/config/validators/library.js.map +1 -0
- package/dist/config/validators/ml.d.ts +4 -0
- package/dist/config/validators/ml.d.ts.map +1 -0
- package/dist/config/validators/ml.js +31 -0
- package/dist/config/validators/ml.js.map +1 -0
- package/dist/config/validators/mobile-app.d.ts +4 -0
- package/dist/config/validators/mobile-app.d.ts.map +1 -0
- package/dist/config/validators/mobile-app.js +14 -0
- package/dist/config/validators/mobile-app.js.map +1 -0
- package/dist/config/validators/registry.test.d.ts +2 -0
- package/dist/config/validators/registry.test.d.ts.map +1 -0
- package/dist/config/validators/registry.test.js +26 -0
- package/dist/config/validators/registry.test.js.map +1 -0
- package/dist/config/validators/research.d.ts +4 -0
- package/dist/config/validators/research.d.ts.map +1 -0
- package/dist/config/validators/research.js +24 -0
- package/dist/config/validators/research.js.map +1 -0
- package/dist/config/validators/research.test.d.ts +2 -0
- package/dist/config/validators/research.test.d.ts.map +1 -0
- package/dist/config/validators/research.test.js +44 -0
- package/dist/config/validators/research.test.js.map +1 -0
- package/dist/config/validators/types.d.ts +19 -0
- package/dist/config/validators/types.d.ts.map +1 -0
- package/dist/config/validators/types.js +2 -0
- package/dist/config/validators/types.js.map +1 -0
- package/dist/config/validators/validators.test.d.ts +2 -0
- package/dist/config/validators/validators.test.d.ts.map +1 -0
- package/dist/config/validators/validators.test.js +25 -0
- package/dist/config/validators/validators.test.js.map +1 -0
- package/dist/config/validators/web-app.d.ts +4 -0
- package/dist/config/validators/web-app.d.ts.map +1 -0
- package/dist/config/validators/web-app.js +31 -0
- package/dist/config/validators/web-app.js.map +1 -0
- package/dist/core/assembly/context-gatherer.d.ts.map +1 -1
- package/dist/core/assembly/context-gatherer.js +4 -2
- package/dist/core/assembly/context-gatherer.js.map +1 -1
- package/dist/core/assembly/cross-reads.d.ts +58 -0
- package/dist/core/assembly/cross-reads.d.ts.map +1 -0
- package/dist/core/assembly/cross-reads.js +185 -0
- package/dist/core/assembly/cross-reads.js.map +1 -0
- package/dist/core/assembly/cross-reads.test.d.ts +2 -0
- package/dist/core/assembly/cross-reads.test.d.ts.map +1 -0
- package/dist/core/assembly/cross-reads.test.js +383 -0
- package/dist/core/assembly/cross-reads.test.js.map +1 -0
- package/dist/core/assembly/overlay-loader-structural.test.d.ts +2 -0
- package/dist/core/assembly/overlay-loader-structural.test.d.ts.map +1 -0
- package/dist/core/assembly/overlay-loader-structural.test.js +114 -0
- package/dist/core/assembly/overlay-loader-structural.test.js.map +1 -0
- package/dist/core/assembly/overlay-loader.d.ts +17 -3
- package/dist/core/assembly/overlay-loader.d.ts.map +1 -1
- package/dist/core/assembly/overlay-loader.js +75 -0
- package/dist/core/assembly/overlay-loader.js.map +1 -1
- package/dist/core/assembly/overlay-resolver.d.ts +2 -2
- package/dist/core/assembly/overlay-resolver.d.ts.map +1 -1
- package/dist/core/assembly/overlay-resolver.js.map +1 -1
- package/dist/core/assembly/overlay-resolver.test.js.map +1 -1
- package/dist/core/assembly/overlay-state-resolver.d.ts +5 -0
- package/dist/core/assembly/overlay-state-resolver.d.ts.map +1 -1
- package/dist/core/assembly/overlay-state-resolver.js +41 -1
- package/dist/core/assembly/overlay-state-resolver.js.map +1 -1
- package/dist/core/assembly/overlay-state-resolver.test.js +262 -0
- package/dist/core/assembly/overlay-state-resolver.test.js.map +1 -1
- package/dist/core/assembly/update-mode.d.ts +1 -0
- package/dist/core/assembly/update-mode.d.ts.map +1 -1
- package/dist/core/assembly/update-mode.js +17 -9
- package/dist/core/assembly/update-mode.js.map +1 -1
- package/dist/core/dependency/eligibility.d.ts +10 -1
- package/dist/core/dependency/eligibility.d.ts.map +1 -1
- package/dist/core/dependency/eligibility.js +19 -1
- package/dist/core/dependency/eligibility.js.map +1 -1
- package/dist/core/dependency/eligibility.test.js +82 -0
- package/dist/core/dependency/eligibility.test.js.map +1 -1
- package/dist/core/dependency/graph.d.ts +4 -1
- package/dist/core/dependency/graph.d.ts.map +1 -1
- package/dist/core/dependency/graph.js +7 -1
- package/dist/core/dependency/graph.js.map +1 -1
- package/dist/core/dependency/graph.test.js +29 -0
- package/dist/core/dependency/graph.test.js.map +1 -1
- package/dist/core/pipeline/global-steps.d.ts +7 -0
- package/dist/core/pipeline/global-steps.d.ts.map +1 -0
- package/dist/core/pipeline/global-steps.js +18 -0
- package/dist/core/pipeline/global-steps.js.map +1 -0
- package/dist/core/pipeline/resolver.d.ts +1 -0
- package/dist/core/pipeline/resolver.d.ts.map +1 -1
- package/dist/core/pipeline/resolver.js +51 -6
- package/dist/core/pipeline/resolver.js.map +1 -1
- package/dist/core/pipeline/types.d.ts +5 -1
- package/dist/core/pipeline/types.d.ts.map +1 -1
- package/dist/e2e/cross-service-references.test.d.ts +22 -0
- package/dist/e2e/cross-service-references.test.d.ts.map +1 -0
- package/dist/e2e/cross-service-references.test.js +188 -0
- package/dist/e2e/cross-service-references.test.js.map +1 -0
- package/dist/e2e/multi-service-pipeline.test.d.ts +10 -0
- package/dist/e2e/multi-service-pipeline.test.d.ts.map +1 -0
- package/dist/e2e/multi-service-pipeline.test.js +185 -0
- package/dist/e2e/multi-service-pipeline.test.js.map +1 -0
- package/dist/e2e/project-type-overlays.test.js +68 -0
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/e2e/service-execution.test.d.ts +15 -0
- package/dist/e2e/service-execution.test.d.ts.map +1 -0
- package/dist/e2e/service-execution.test.js +219 -0
- package/dist/e2e/service-execution.test.js.map +1 -0
- package/dist/e2e/service-manifest.test.d.ts +19 -0
- package/dist/e2e/service-manifest.test.d.ts.map +1 -0
- package/dist/e2e/service-manifest.test.js +166 -0
- package/dist/e2e/service-manifest.test.js.map +1 -0
- package/dist/project/__frozen-schemas__/schema-v3.9.2.d.ts +224 -224
- package/dist/project/frontmatter.d.ts.map +1 -1
- package/dist/project/frontmatter.js +11 -0
- package/dist/project/frontmatter.js.map +1 -1
- package/dist/project/frontmatter.test.js +71 -0
- package/dist/project/frontmatter.test.js.map +1 -1
- package/dist/state/completion.d.ts +1 -1
- package/dist/state/completion.d.ts.map +1 -1
- package/dist/state/completion.js +10 -8
- package/dist/state/completion.js.map +1 -1
- package/dist/state/decision-logger.d.ts +3 -2
- package/dist/state/decision-logger.d.ts.map +1 -1
- package/dist/state/decision-logger.js +12 -11
- package/dist/state/decision-logger.js.map +1 -1
- package/dist/state/ensure-v3-migration.d.ts +9 -0
- package/dist/state/ensure-v3-migration.d.ts.map +1 -0
- package/dist/state/ensure-v3-migration.js +35 -0
- package/dist/state/ensure-v3-migration.js.map +1 -0
- package/dist/state/lock-manager.d.ts +5 -4
- package/dist/state/lock-manager.d.ts.map +1 -1
- package/dist/state/lock-manager.js +11 -11
- package/dist/state/lock-manager.js.map +1 -1
- package/dist/state/rework-manager.d.ts +1 -2
- package/dist/state/rework-manager.d.ts.map +1 -1
- package/dist/state/rework-manager.js +4 -5
- package/dist/state/rework-manager.js.map +1 -1
- package/dist/state/state-manager.d.ts +25 -1
- package/dist/state/state-manager.d.ts.map +1 -1
- package/dist/state/state-manager.js +86 -12
- package/dist/state/state-manager.js.map +1 -1
- package/dist/state/state-manager.test.js +278 -0
- package/dist/state/state-manager.test.js.map +1 -1
- package/dist/state/state-migration-v3.d.ts +22 -0
- package/dist/state/state-migration-v3.d.ts.map +1 -0
- package/dist/state/state-migration-v3.js +82 -0
- package/dist/state/state-migration-v3.js.map +1 -0
- package/dist/state/state-migration-v3.test.d.ts +2 -0
- package/dist/state/state-migration-v3.test.d.ts.map +1 -0
- package/dist/state/state-migration-v3.test.js +196 -0
- package/dist/state/state-migration-v3.test.js.map +1 -0
- package/dist/state/state-migration.d.ts.map +1 -1
- package/dist/state/state-migration.js +11 -6
- package/dist/state/state-migration.js.map +1 -1
- package/dist/state/state-migration.test.js +47 -2
- package/dist/state/state-migration.test.js.map +1 -1
- package/dist/state/state-path-resolver.d.ts +23 -0
- package/dist/state/state-path-resolver.d.ts.map +1 -0
- package/dist/state/state-path-resolver.js +36 -0
- package/dist/state/state-path-resolver.js.map +1 -0
- package/dist/state/state-path-resolver.test.d.ts +2 -0
- package/dist/state/state-path-resolver.test.d.ts.map +1 -0
- package/dist/state/state-path-resolver.test.js +78 -0
- package/dist/state/state-path-resolver.test.js.map +1 -0
- package/dist/state/state-version-dispatch.d.ts +17 -0
- package/dist/state/state-version-dispatch.d.ts.map +1 -0
- package/dist/state/state-version-dispatch.js +27 -0
- package/dist/state/state-version-dispatch.js.map +1 -0
- package/dist/state/state-version-dispatch.test.d.ts +2 -0
- package/dist/state/state-version-dispatch.test.d.ts.map +1 -0
- package/dist/state/state-version-dispatch.test.js +40 -0
- package/dist/state/state-version-dispatch.test.js.map +1 -0
- package/dist/types/config.d.ts +25 -3
- package/dist/types/config.d.ts.map +1 -1
- package/dist/types/config.test.js +13 -1
- package/dist/types/config.test.js.map +1 -1
- package/dist/types/dependency.d.ts +5 -0
- package/dist/types/dependency.d.ts.map +1 -1
- package/dist/types/frontmatter.d.ts +5 -0
- package/dist/types/frontmatter.d.ts.map +1 -1
- package/dist/types/lock.d.ts +1 -1
- package/dist/types/lock.d.ts.map +1 -1
- package/dist/types/state.d.ts +1 -1
- package/dist/types/state.d.ts.map +1 -1
- package/dist/utils/artifact-path.d.ts +19 -0
- package/dist/utils/artifact-path.d.ts.map +1 -0
- package/dist/utils/artifact-path.js +95 -0
- package/dist/utils/artifact-path.js.map +1 -0
- package/dist/utils/artifact-path.test.d.ts +2 -0
- package/dist/utils/artifact-path.test.d.ts.map +1 -0
- package/dist/utils/artifact-path.test.js +138 -0
- package/dist/utils/artifact-path.test.js.map +1 -0
- package/dist/utils/errors.d.ts +1 -1
- package/dist/utils/errors.d.ts.map +1 -1
- package/dist/utils/errors.js +5 -2
- package/dist/utils/errors.js.map +1 -1
- package/dist/utils/user-errors.d.ts +46 -0
- package/dist/utils/user-errors.d.ts.map +1 -0
- package/dist/utils/user-errors.js +76 -0
- package/dist/utils/user-errors.js.map +1 -0
- package/dist/utils/user-errors.test.d.ts +2 -0
- package/dist/utils/user-errors.test.d.ts.map +1 -0
- package/dist/utils/user-errors.test.js +74 -0
- package/dist/utils/user-errors.test.js.map +1 -0
- package/dist/validation/index.d.ts.map +1 -1
- package/dist/validation/index.js +16 -0
- package/dist/validation/index.js.map +1 -1
- package/dist/validation/index.test.js +48 -0
- package/dist/validation/index.test.js.map +1 -1
- package/dist/validation/state-validator.d.ts +5 -2
- package/dist/validation/state-validator.d.ts.map +1 -1
- package/dist/validation/state-validator.js +18 -20
- package/dist/validation/state-validator.js.map +1 -1
- package/dist/validation/state-validator.test.js +31 -2
- package/dist/validation/state-validator.test.js.map +1 -1
- package/dist/wizard/copy/backend.d.ts.map +1 -1
- package/dist/wizard/copy/backend.js +12 -0
- package/dist/wizard/copy/backend.js.map +1 -1
- package/dist/wizard/flags.d.ts +1 -0
- package/dist/wizard/flags.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +5 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +45 -2
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +23 -0
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +85 -47
- package/dist/wizard/wizard.js.map +1 -1
- package/dist/wizard/wizard.test.js +186 -1
- package/dist/wizard/wizard.test.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,710 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: multi-service-resilience
|
|
3
|
+
description: Circuit breakers, bulkheads, timeout budgets, and failure isolation strategies
|
|
4
|
+
topics: [circuit-breakers, bulkheads, timeout-budgets, failure-isolation, retry-storms]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Summary
|
|
8
|
+
|
|
9
|
+
In a multi-service system, any individual service will fail. The question is whether that failure stays contained or cascades into a full system outage. Resilience patterns exist to answer that question.
|
|
10
|
+
|
|
11
|
+
**Circuit breakers** wrap outbound calls and monitor failure rates. When failures exceed a threshold, the circuit opens and returns immediate failures instead of waiting for timeouts. Three states: closed (normal), open (failing fast), half-open (probing recovery).
|
|
12
|
+
|
|
13
|
+
**Bulkheads** partition resources so a misbehaving integration cannot consume all threads and connections. Dedicate separate concurrency limits per downstream service — a payments outage fills only the payments pool, not the global pool.
|
|
14
|
+
|
|
15
|
+
**Timeout budgets:** Every user-facing request has a total time budget (e.g., 2000ms). Propagate the absolute deadline via `x-request-deadline` header so all services share one clock. Timeouts must decrease with call depth — a chain of five services with 5000ms timeouts each can take 25 seconds to fail.
|
|
16
|
+
|
|
17
|
+
**Graceful degradation:** Design each feature for full, degraded, and unavailable modes. Use `Promise.allSettled` for parallel enrichment calls so one failure doesn't cancel all others. Return stale cache or placeholder data for non-critical enrichments when the upstream is unavailable.
|
|
18
|
+
|
|
19
|
+
**Retry storm prevention:** Add full jitter to exponential backoff. Coordinate retry budgets (allow at most 20% of traffic to be retries) in high-traffic services. Pair every retry block with a circuit breaker — retries should stop when the circuit opens.
|
|
20
|
+
|
|
21
|
+
**Observability for resilience:** Every circuit breaker, bulkhead, and retry must emit metrics (`circuit_breaker_state`, `bulkhead_rejected_calls_total`, `retry_attempts_total`) so you know when they activate.
|
|
22
|
+
|
|
23
|
+
## Deep Guidance
|
|
24
|
+
|
|
25
|
+
## Circuit Breaker Pattern
|
|
26
|
+
|
|
27
|
+
A circuit breaker wraps outbound calls to a downstream service. It monitors failure rates and, when they exceed a threshold, stops forwarding calls — returning an immediate failure instead of waiting for timeouts to accumulate.
|
|
28
|
+
|
|
29
|
+
### States
|
|
30
|
+
|
|
31
|
+
A circuit breaker has three states:
|
|
32
|
+
|
|
33
|
+
**Closed (normal):** All calls pass through. Failures are counted. When the failure rate exceeds the threshold within the measurement window, the circuit opens.
|
|
34
|
+
|
|
35
|
+
**Open (tripped):** All calls fail immediately without attempting the downstream call. The caller receives a fast failure. After a configured timeout, the circuit transitions to half-open.
|
|
36
|
+
|
|
37
|
+
**Half-Open (probing):** A limited number of probe calls are allowed through to the downstream service. If they succeed, the circuit closes. If they fail, the circuit returns to open.
|
|
38
|
+
|
|
39
|
+
```typescript
|
|
40
|
+
type CircuitState = 'closed' | 'open' | 'half-open';
|
|
41
|
+
|
|
42
|
+
interface CircuitBreakerConfig {
|
|
43
|
+
// How many failures in the window before opening
|
|
44
|
+
failureThreshold: number;
|
|
45
|
+
// Rolling window duration (ms) for counting failures
|
|
46
|
+
windowMs: number;
|
|
47
|
+
// Minimum number of calls in the window before the threshold applies
|
|
48
|
+
// Prevents opening on 1 failure out of 1 call (100% failure rate)
|
|
49
|
+
minimumCallCount: number;
|
|
50
|
+
// How long to stay open before probing (ms)
|
|
51
|
+
recoveryTimeoutMs: number;
|
|
52
|
+
// How many probe calls in half-open state before deciding
|
|
53
|
+
probeCount: number;
|
|
54
|
+
// How many probes must succeed to close the circuit
|
|
55
|
+
probeSuccessThreshold: number;
|
|
56
|
+
}
|
|
57
|
+
|
|
58
|
+
const DEFAULT_CIRCUIT_CONFIG: CircuitBreakerConfig = {
|
|
59
|
+
failureThreshold: 0.5, // 50% failure rate triggers open
|
|
60
|
+
windowMs: 60_000, // 1-minute rolling window
|
|
61
|
+
minimumCallCount: 10, // Need at least 10 calls before evaluating
|
|
62
|
+
recoveryTimeoutMs: 30_000, // Wait 30s before probing
|
|
63
|
+
probeCount: 3,
|
|
64
|
+
probeSuccessThreshold: 2,
|
|
65
|
+
};
|
|
66
|
+
|
|
67
|
+
class CircuitBreaker {
|
|
68
|
+
private state: CircuitState = 'closed';
|
|
69
|
+
private callWindow: Array<{ timestamp: number; success: boolean }> = [];
|
|
70
|
+
private probeAttempts = 0;
|
|
71
|
+
private probeSuccesses = 0;
|
|
72
|
+
private openedAt?: number;
|
|
73
|
+
|
|
74
|
+
constructor(
|
|
75
|
+
private readonly name: string,
|
|
76
|
+
private readonly config: CircuitBreakerConfig,
|
|
77
|
+
) {}
|
|
78
|
+
|
|
79
|
+
async execute<T>(fn: () => Promise<T>): Promise<T> {
|
|
80
|
+
this.pruneWindow();
|
|
81
|
+
|
|
82
|
+
if (this.state === 'open') {
|
|
83
|
+
const elapsed = Date.now() - (this.openedAt ?? 0);
|
|
84
|
+
if (elapsed < this.config.recoveryTimeoutMs) {
|
|
85
|
+
throw new CircuitOpenError(`Circuit ${this.name} is open — downstream unavailable`);
|
|
86
|
+
}
|
|
87
|
+
this.transitionTo('half-open');
|
|
88
|
+
}
|
|
89
|
+
|
|
90
|
+
if (this.state === 'half-open') {
|
|
91
|
+
if (this.probeAttempts >= this.config.probeCount) {
|
|
92
|
+
throw new CircuitOpenError(`Circuit ${this.name} probe limit reached — still open`);
|
|
93
|
+
}
|
|
94
|
+
this.probeAttempts++;
|
|
95
|
+
}
|
|
96
|
+
|
|
97
|
+
try {
|
|
98
|
+
const result = await fn();
|
|
99
|
+
this.recordSuccess();
|
|
100
|
+
return result;
|
|
101
|
+
} catch (error) {
|
|
102
|
+
this.recordFailure();
|
|
103
|
+
throw error;
|
|
104
|
+
}
|
|
105
|
+
}
|
|
106
|
+
|
|
107
|
+
private recordSuccess() {
|
|
108
|
+
this.callWindow.push({ timestamp: Date.now(), success: true });
|
|
109
|
+
|
|
110
|
+
if (this.state === 'half-open') {
|
|
111
|
+
this.probeSuccesses++;
|
|
112
|
+
if (this.probeSuccesses >= this.config.probeSuccessThreshold) {
|
|
113
|
+
this.transitionTo('closed');
|
|
114
|
+
}
|
|
115
|
+
}
|
|
116
|
+
}
|
|
117
|
+
|
|
118
|
+
private recordFailure() {
|
|
119
|
+
this.callWindow.push({ timestamp: Date.now(), success: false });
|
|
120
|
+
|
|
121
|
+
if (this.state === 'half-open') {
|
|
122
|
+
this.transitionTo('open');
|
|
123
|
+
return;
|
|
124
|
+
}
|
|
125
|
+
|
|
126
|
+
if (this.state === 'closed' && this.shouldOpen()) {
|
|
127
|
+
this.transitionTo('open');
|
|
128
|
+
}
|
|
129
|
+
}
|
|
130
|
+
|
|
131
|
+
private shouldOpen(): boolean {
|
|
132
|
+
if (this.callWindow.length < this.config.minimumCallCount) return false;
|
|
133
|
+
const failures = this.callWindow.filter(c => !c.success).length;
|
|
134
|
+
return failures / this.callWindow.length >= this.config.failureThreshold;
|
|
135
|
+
}
|
|
136
|
+
|
|
137
|
+
private transitionTo(next: CircuitState) {
|
|
138
|
+
this.state = next;
|
|
139
|
+
if (next === 'open') {
|
|
140
|
+
this.openedAt = Date.now();
|
|
141
|
+
this.probeAttempts = 0;
|
|
142
|
+
this.probeSuccesses = 0;
|
|
143
|
+
} else if (next === 'closed') {
|
|
144
|
+
this.callWindow = [];
|
|
145
|
+
this.probeAttempts = 0;
|
|
146
|
+
this.probeSuccesses = 0;
|
|
147
|
+
}
|
|
148
|
+
}
|
|
149
|
+
|
|
150
|
+
private pruneWindow() {
|
|
151
|
+
const cutoff = Date.now() - this.config.windowMs;
|
|
152
|
+
this.callWindow = this.callWindow.filter(c => c.timestamp > cutoff);
|
|
153
|
+
}
|
|
154
|
+
|
|
155
|
+
getState(): CircuitState { return this.state; }
|
|
156
|
+
}
|
|
157
|
+
|
|
158
|
+
class CircuitOpenError extends Error {
|
|
159
|
+
constructor(message: string) {
|
|
160
|
+
super(message);
|
|
161
|
+
this.name = 'CircuitOpenError';
|
|
162
|
+
}
|
|
163
|
+
}
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
**Trade-offs:**
|
|
167
|
+
- (+) Prevents cascading failures: a slow downstream service cannot hold the upstream's threads indefinitely.
|
|
168
|
+
- (+) Fast failure in open state gives the upstream service time to shed load and the downstream service time to recover.
|
|
169
|
+
- (+) Half-open probing enables automatic recovery without manual intervention.
|
|
170
|
+
- (-) Circuit opens on transient spikes, not just real outages — requires tuning `minimumCallCount` and `windowMs` carefully for your traffic patterns.
|
|
171
|
+
- (-) State is per-instance by default. In a horizontally-scaled service, each replica has its own circuit state. Use Redis or a service mesh (Istio, Envoy) for cluster-wide state.
|
|
172
|
+
- (-) `CircuitOpenError` is a new error type callers must handle. Forgetting to catch it surfaces as an unhandled exception.
|
|
173
|
+
|
|
174
|
+
**Configuration guidelines:**
|
|
175
|
+
- `failureThreshold: 0.5` is a reasonable default for most services. Lower it (0.25) for critical payments or auth paths. Raise it (0.75) for non-critical enrichment services where partial data is acceptable.
|
|
176
|
+
- `recoveryTimeoutMs: 30_000` gives a recovering service 30s of breathing room. Increase for services that are known to take longer to restart.
|
|
177
|
+
- `minimumCallCount: 10` prevents false opens on low-traffic periods. Reduce in environments where services handle fewer requests (dev, staging).
|
|
178
|
+
|
|
179
|
+
## Bulkhead Isolation
|
|
180
|
+
|
|
181
|
+
Bulkheads partition resources so that failures in one integration cannot exhaust the resources needed by other integrations. The name comes from ship design: watertight compartments prevent a single hull breach from sinking the entire vessel.
|
|
182
|
+
|
|
183
|
+
### Thread Pool Bulkhead
|
|
184
|
+
|
|
185
|
+
Dedicate a separate thread pool (or async concurrency limit) to each downstream service. A slow downstream service fills its own pool, not the global pool.
|
|
186
|
+
|
|
187
|
+
```typescript
|
|
188
|
+
class Semaphore {
|
|
189
|
+
private permits: number;
|
|
190
|
+
private waitQueue: Array<() => void> = [];
|
|
191
|
+
|
|
192
|
+
constructor(private maxConcurrency: number) {
|
|
193
|
+
this.permits = maxConcurrency;
|
|
194
|
+
}
|
|
195
|
+
|
|
196
|
+
async acquire(): Promise<void> {
|
|
197
|
+
if (this.permits > 0) {
|
|
198
|
+
this.permits--;
|
|
199
|
+
return;
|
|
200
|
+
}
|
|
201
|
+
return new Promise(resolve => this.waitQueue.push(resolve));
|
|
202
|
+
}
|
|
203
|
+
|
|
204
|
+
release(): void {
|
|
205
|
+
if (this.waitQueue.length > 0) {
|
|
206
|
+
const next = this.waitQueue.shift()!;
|
|
207
|
+
next();
|
|
208
|
+
} else {
|
|
209
|
+
this.permits++;
|
|
210
|
+
}
|
|
211
|
+
}
|
|
212
|
+
}
|
|
213
|
+
|
|
214
|
+
class BulkheadExecutor {
|
|
215
|
+
private semaphore: Semaphore;
|
|
216
|
+
private activeCount = 0;
|
|
217
|
+
private rejectedCount = 0;
|
|
218
|
+
|
|
219
|
+
constructor(
|
|
220
|
+
private readonly name: string,
|
|
221
|
+
private readonly maxConcurrency: number,
|
|
222
|
+
private readonly maxQueueSize: number,
|
|
223
|
+
) {
|
|
224
|
+
this.semaphore = new Semaphore(maxConcurrency);
|
|
225
|
+
}
|
|
226
|
+
|
|
227
|
+
async execute<T>(fn: () => Promise<T>): Promise<T> {
|
|
228
|
+
const queueDepth = this.semaphore['waitQueue'].length;
|
|
229
|
+
if (queueDepth >= this.maxQueueSize) {
|
|
230
|
+
this.rejectedCount++;
|
|
231
|
+
throw new BulkheadFullError(
|
|
232
|
+
`Bulkhead ${this.name} queue full (${this.maxConcurrency} concurrent + ${this.maxQueueSize} queued)`
|
|
233
|
+
);
|
|
234
|
+
}
|
|
235
|
+
|
|
236
|
+
await this.semaphore.acquire();
|
|
237
|
+
this.activeCount++;
|
|
238
|
+
|
|
239
|
+
try {
|
|
240
|
+
return await fn();
|
|
241
|
+
} finally {
|
|
242
|
+
this.activeCount--;
|
|
243
|
+
this.semaphore.release();
|
|
244
|
+
}
|
|
245
|
+
}
|
|
246
|
+
|
|
247
|
+
getMetrics() {
|
|
248
|
+
return {
|
|
249
|
+
name: this.name,
|
|
250
|
+
active: this.activeCount,
|
|
251
|
+
maxConcurrency: this.maxConcurrency,
|
|
252
|
+
rejected: this.rejectedCount,
|
|
253
|
+
};
|
|
254
|
+
}
|
|
255
|
+
}
|
|
256
|
+
|
|
257
|
+
class BulkheadFullError extends Error {
|
|
258
|
+
constructor(message: string) {
|
|
259
|
+
super(message);
|
|
260
|
+
this.name = 'BulkheadFullError';
|
|
261
|
+
}
|
|
262
|
+
}
|
|
263
|
+
|
|
264
|
+
// Usage: each downstream gets its own bulkhead
|
|
265
|
+
const paymentsBulkhead = new BulkheadExecutor('payments-service', 10, 20);
|
|
266
|
+
const inventoryBulkhead = new BulkheadExecutor('inventory-service', 20, 40);
|
|
267
|
+
const notificationsBulkhead = new BulkheadExecutor('notifications-service', 5, 10);
|
|
268
|
+
|
|
269
|
+
async function chargeAndFulfill(orderId: string) {
|
|
270
|
+
// Payments and inventory each have their own concurrency limit
|
|
271
|
+
// A payments outage cannot consume inventory's capacity
|
|
272
|
+
const [charge, reservation] = await Promise.all([
|
|
273
|
+
paymentsBulkhead.execute(() => paymentsClient.charge(orderId)),
|
|
274
|
+
inventoryBulkhead.execute(() => inventoryClient.reserve(orderId)),
|
|
275
|
+
]);
|
|
276
|
+
return { charge, reservation };
|
|
277
|
+
}
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
### Queue-Based Bulkhead
|
|
281
|
+
|
|
282
|
+
For async workloads, use separate queues per integration. Each queue has its own concurrency and backpressure configuration.
|
|
283
|
+
|
|
284
|
+
```typescript
|
|
285
|
+
// BullMQ example: separate queues isolate failure domains
|
|
286
|
+
const paymentQueue = new Queue('payment-processing', {
|
|
287
|
+
connection: redis,
|
|
288
|
+
defaultJobOptions: { attempts: 3, backoff: { type: 'exponential', delay: 1000 } },
|
|
289
|
+
});
|
|
290
|
+
|
|
291
|
+
const notificationQueue = new Queue('notification-dispatch', {
|
|
292
|
+
connection: redis,
|
|
293
|
+
defaultJobOptions: { attempts: 5, backoff: { type: 'exponential', delay: 500 } },
|
|
294
|
+
});
|
|
295
|
+
|
|
296
|
+
// Payment worker: limited to 5 concurrent jobs
|
|
297
|
+
const paymentWorker = new Worker('payment-processing', processPayment, {
|
|
298
|
+
connection: redis,
|
|
299
|
+
concurrency: 5, // Bulkhead: max 5 concurrent payment jobs
|
|
300
|
+
limiter: { max: 100, duration: 60_000 }, // 100 jobs/minute rate limit
|
|
301
|
+
});
|
|
302
|
+
|
|
303
|
+
// Notification worker: separate concurrency, isolated from payment issues
|
|
304
|
+
const notificationWorker = new Worker('notification-dispatch', sendNotification, {
|
|
305
|
+
connection: redis,
|
|
306
|
+
concurrency: 20, // Notifications can be higher concurrency
|
|
307
|
+
});
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
**Trade-offs (bulkheads):**
|
|
311
|
+
- (+) Failure in one integration cannot starve others of thread/connection resources.
|
|
312
|
+
- (+) Per-integration metrics (`activeCount`, `rejectedCount`) make capacity problems visible before they cascade.
|
|
313
|
+
- (+) `BulkheadFullError` sheds load with a controlled rejection rather than an unbounded queue that grows until OOM.
|
|
314
|
+
- (-) Tuning concurrency limits requires load testing. Too low and you throttle unnecessarily; too high and the bulkhead doesn't protect.
|
|
315
|
+
- (-) Adds a queuing layer. Calls that would have failed fast now wait up to `maxQueueSize` entries before being rejected — this can increase average latency even during partial failures.
|
|
316
|
+
- (-) Per-integration bulkheads multiply operational configuration. Document limits in service manifests.
|
|
317
|
+
|
|
318
|
+
## Timeout Budget Allocation
|
|
319
|
+
|
|
320
|
+
Timeouts must be set at every level of a call chain. A missing timeout is a resource leak waiting to become an outage.
|
|
321
|
+
|
|
322
|
+
### The Budget Model
|
|
323
|
+
|
|
324
|
+
Every user-facing request has a total time budget. Each hop in the call chain consumes part of that budget. The sum of all hops (including their own processing time) must fit within the total budget.
|
|
325
|
+
|
|
326
|
+
```
|
|
327
|
+
User SLA: 2,000ms (P99 target)
|
|
328
|
+
|
|
329
|
+
Entry path:
|
|
330
|
+
API Gateway auth + routing: 50ms
|
|
331
|
+
BFF / aggregator overhead: 100ms
|
|
332
|
+
|
|
333
|
+
Parallel downstream calls:
|
|
334
|
+
Order Service: 500ms
|
|
335
|
+
└─ Inventory sub-call: 200ms (within Order's 500ms budget)
|
|
336
|
+
User Profile Service: 300ms (parallel with Order)
|
|
337
|
+
Product Catalog Service: 250ms (parallel with Order)
|
|
338
|
+
|
|
339
|
+
Serialization + network: 100ms
|
|
340
|
+
P99 buffer / headroom: 500ms
|
|
341
|
+
────
|
|
342
|
+
Total: 2,000ms
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
### Deadline Propagation
|
|
346
|
+
|
|
347
|
+
Pass the absolute deadline through the call chain so all services share one clock, not per-hop timeouts that can stack:
|
|
348
|
+
|
|
349
|
+
```typescript
|
|
350
|
+
const DEADLINE_HEADER = 'x-request-deadline';
|
|
351
|
+
const DEFAULT_TIMEOUT_MS = 5_000;
|
|
352
|
+
|
|
353
|
+
// Middleware: attach deadline to incoming requests that don't have one
|
|
354
|
+
function deadlineMiddleware(req: Request, res: Response, next: NextFunction) {
|
|
355
|
+
if (!req.headers[DEADLINE_HEADER]) {
|
|
356
|
+
const deadline = new Date(Date.now() + DEFAULT_TIMEOUT_MS).toISOString();
|
|
357
|
+
req.headers[DEADLINE_HEADER] = deadline;
|
|
358
|
+
}
|
|
359
|
+
next();
|
|
360
|
+
}
|
|
361
|
+
|
|
362
|
+
// Helper: compute remaining timeout when making downstream calls
|
|
363
|
+
function getRemainingMs(req: Request, overhead = 10): number {
|
|
364
|
+
const deadline = req.headers[DEADLINE_HEADER] as string | undefined;
|
|
365
|
+
if (!deadline) return DEFAULT_TIMEOUT_MS;
|
|
366
|
+
|
|
367
|
+
const remaining = new Date(deadline).getTime() - Date.now() - overhead;
|
|
368
|
+
if (remaining <= 0) throw new DeadlineExceededError('Request deadline already exceeded');
|
|
369
|
+
return remaining;
|
|
370
|
+
}
|
|
371
|
+
|
|
372
|
+
// Usage: downstream call uses remaining budget, not a fixed timeout
|
|
373
|
+
async function callInventoryService(req: Request, orderId: string) {
|
|
374
|
+
const timeout = getRemainingMs(req, 20); // subtract 20ms for network overhead
|
|
375
|
+
|
|
376
|
+
return inventoryClient.getAvailability(orderId, {
|
|
377
|
+
headers: { [DEADLINE_HEADER]: req.headers[DEADLINE_HEADER] },
|
|
378
|
+
timeout,
|
|
379
|
+
});
|
|
380
|
+
}
|
|
381
|
+
|
|
382
|
+
class DeadlineExceededError extends Error {
|
|
383
|
+
constructor(message: string) {
|
|
384
|
+
super(message);
|
|
385
|
+
this.name = 'DeadlineExceededError';
|
|
386
|
+
}
|
|
387
|
+
}
|
|
388
|
+
```
|
|
389
|
+
|
|
390
|
+
**Trade-offs (timeout budgets):**
|
|
391
|
+
- (+) Deadline propagation ensures the entire call chain fails fast rather than stacking per-hop timeouts.
|
|
392
|
+
- (+) Budget modeling makes timeout decisions explicit rather than scattered magic numbers across the codebase.
|
|
393
|
+
- (-) Requires all services to adopt the deadline header convention. Partial adoption produces incorrect behavior.
|
|
394
|
+
- (-) Clock skew across services introduces small errors in remaining-time calculations. Use monotonic clocks where available and add conservative overhead.
|
|
395
|
+
- (-) Budgets become stale as service performance changes. Revisit quarterly or after each major architectural change.
|
|
396
|
+
|
|
397
|
+
**Timeout anti-patterns:**
|
|
398
|
+
- **No timeout set**: Most HTTP clients default to no timeout. An unresponsive downstream holds a connection indefinitely, exhausting the upstream's connection pool.
|
|
399
|
+
- **Identical timeout at every level**: A chain of 5 services each with a 5,000ms timeout can take 25,000ms to fail end-to-end. Timeouts must decrease with depth.
|
|
400
|
+
- **Timeout without circuit breaker**: Repeated timeouts still attempt the downstream call on every retry. Pair every timeout with a circuit breaker so repeated failures open the circuit.
|
|
401
|
+
|
|
402
|
+
## Failure Isolation Strategies
|
|
403
|
+
|
|
404
|
+
### Error Classification
|
|
405
|
+
|
|
406
|
+
Not all errors are the same. Classify errors before deciding on the response:
|
|
407
|
+
|
|
408
|
+
```typescript
|
|
409
|
+
type ErrorClass =
|
|
410
|
+
| 'transient' // Temporary — retry is likely to succeed
|
|
411
|
+
| 'permanent' // Client or data error — retrying won't help
|
|
412
|
+
| 'overload' // Server is busy — retry with backoff + jitter
|
|
413
|
+
| 'timeout' // Deadline exceeded — check deadline before retrying
|
|
414
|
+
| 'circuit-open' // Downstream unavailable — return fallback, don't retry
|
|
415
|
+
| 'unknown'; // Unclassified — treat conservatively
|
|
416
|
+
|
|
417
|
+
function classifyError(error: unknown): ErrorClass {
|
|
418
|
+
if (error instanceof CircuitOpenError) return 'circuit-open';
|
|
419
|
+
if (error instanceof DeadlineExceededError) return 'timeout';
|
|
420
|
+
|
|
421
|
+
const status = (error as any)?.response?.status as number | undefined;
|
|
422
|
+
if (!status) return 'transient'; // Network error, no response
|
|
423
|
+
|
|
424
|
+
if (status === 429) return 'overload';
|
|
425
|
+
if ([500, 502, 503, 504].includes(status)) return 'transient';
|
|
426
|
+
if ([400, 401, 403, 404, 422].includes(status)) return 'permanent';
|
|
427
|
+
if (status === 408) return 'timeout';
|
|
428
|
+
|
|
429
|
+
return 'unknown';
|
|
430
|
+
}
|
|
431
|
+
```
|
|
432
|
+
|
|
433
|
+
### Fallback Strategies by Error Class
|
|
434
|
+
|
|
435
|
+
```typescript
|
|
436
|
+
async function getProductWithFallback(productId: string, req: Request): Promise<Product> {
|
|
437
|
+
try {
|
|
438
|
+
return await catalogBulkhead.execute(() =>
|
|
439
|
+
catalogCircuit.execute(() =>
|
|
440
|
+
callCatalogService(req, productId)
|
|
441
|
+
)
|
|
442
|
+
);
|
|
443
|
+
} catch (error) {
|
|
444
|
+
const errorClass = classifyError(error);
|
|
445
|
+
|
|
446
|
+
switch (errorClass) {
|
|
447
|
+
case 'circuit-open':
|
|
448
|
+
case 'transient':
|
|
449
|
+
// Return cached version if available
|
|
450
|
+
const cached = await cache.get<Product>(`product:${productId}`);
|
|
451
|
+
if (cached) {
|
|
452
|
+
logger.warn({ productId, errorClass }, 'Returning stale cached product');
|
|
453
|
+
return { ...cached, _stale: true };
|
|
454
|
+
}
|
|
455
|
+
// Return a minimal placeholder — better than an error for browsing
|
|
456
|
+
return { id: productId, name: 'Product temporarily unavailable', available: false };
|
|
457
|
+
|
|
458
|
+
case 'permanent':
|
|
459
|
+
// 404 or 422 — propagate, don't retry or cache
|
|
460
|
+
throw error;
|
|
461
|
+
|
|
462
|
+
case 'timeout':
|
|
463
|
+
// Deadline already exceeded — fail fast, don't attempt fallback
|
|
464
|
+
throw error;
|
|
465
|
+
|
|
466
|
+
default:
|
|
467
|
+
// Unknown — return placeholder and log for investigation
|
|
468
|
+
logger.error({ productId, errorClass, error }, 'Unclassified product fetch error');
|
|
469
|
+
return { id: productId, name: 'Product temporarily unavailable', available: false };
|
|
470
|
+
}
|
|
471
|
+
}
|
|
472
|
+
}
|
|
473
|
+
```
|
|
474
|
+
|
|
475
|
+
### Graceful Degradation Patterns
|
|
476
|
+
|
|
477
|
+
Design each feature for three modes: full, degraded, and unavailable.
|
|
478
|
+
|
|
479
|
+
```typescript
|
|
480
|
+
interface ProductPageData {
|
|
481
|
+
product: Product;
|
|
482
|
+
relatedProducts: Product[]; // Optional — degradable
|
|
483
|
+
reviews: Review[]; // Optional — degradable
|
|
484
|
+
inventory: InventoryStatus; // Optional — degradable
|
|
485
|
+
}
|
|
486
|
+
|
|
487
|
+
async function getProductPageData(
|
|
488
|
+
productId: string,
|
|
489
|
+
req: Request
|
|
490
|
+
): Promise<ProductPageData> {
|
|
491
|
+
// Core product data — non-degradable
|
|
492
|
+
const product = await getProductWithFallback(productId, req);
|
|
493
|
+
|
|
494
|
+
// Optional enrichments — fail independently, don't block the page
|
|
495
|
+
const [relatedProducts, reviews, inventory] = await Promise.allSettled([
|
|
496
|
+
getRelatedProducts(productId, req),
|
|
497
|
+
getProductReviews(productId, req),
|
|
498
|
+
getInventoryStatus(productId, req),
|
|
499
|
+
]);
|
|
500
|
+
|
|
501
|
+
return {
|
|
502
|
+
product,
|
|
503
|
+
relatedProducts: relatedProducts.status === 'fulfilled' ? relatedProducts.value : [],
|
|
504
|
+
reviews: reviews.status === 'fulfilled' ? reviews.value : [],
|
|
505
|
+
inventory: inventory.status === 'fulfilled'
|
|
506
|
+
? inventory.value
|
|
507
|
+
: { status: 'unknown', message: 'Inventory status temporarily unavailable' },
|
|
508
|
+
};
|
|
509
|
+
}
|
|
510
|
+
```
|
|
511
|
+
|
|
512
|
+
**Trade-offs (graceful degradation):**
|
|
513
|
+
- (+) Users receive a working page rather than an error screen when a non-critical service is down.
|
|
514
|
+
- (+) `Promise.allSettled` is the correct primitive — unlike `Promise.all`, it does not short-circuit on failure.
|
|
515
|
+
- (-) Partial data requires careful UI handling. The frontend must know how to render each degraded state.
|
|
516
|
+
- (-) Stale cache fallbacks can show outdated prices or incorrect availability. Set explicit `_stale` flags so the UI can display a freshness warning.
|
|
517
|
+
|
|
518
|
+
## Retry Storm Prevention
|
|
519
|
+
|
|
520
|
+
Retry storms occur when many clients simultaneously retry after a shared failure, overwhelming a recovering service with a burst of traffic.
|
|
521
|
+
|
|
522
|
+
### Jitter and Backoff
|
|
523
|
+
|
|
524
|
+
```typescript
|
|
525
|
+
interface RetryConfig {
|
|
526
|
+
maxAttempts: number;
|
|
527
|
+
baseDelayMs: number;
|
|
528
|
+
maxDelayMs: number;
|
|
529
|
+
jitter: 'full' | 'equal' | 'decorrelated';
|
|
530
|
+
}
|
|
531
|
+
|
|
532
|
+
const SERVICE_RETRY_CONFIG: RetryConfig = {
|
|
533
|
+
maxAttempts: 3,
|
|
534
|
+
baseDelayMs: 100,
|
|
535
|
+
maxDelayMs: 10_000,
|
|
536
|
+
jitter: 'full',
|
|
537
|
+
};
|
|
538
|
+
|
|
539
|
+
function computeDelay(attempt: number, config: RetryConfig): number {
|
|
540
|
+
const exponential = config.baseDelayMs * Math.pow(2, attempt - 1);
|
|
541
|
+
const capped = Math.min(exponential, config.maxDelayMs);
|
|
542
|
+
|
|
543
|
+
switch (config.jitter) {
|
|
544
|
+
case 'full':
|
|
545
|
+
// Random value between 0 and capped — maximum spread
|
|
546
|
+
return Math.random() * capped;
|
|
547
|
+
|
|
548
|
+
case 'equal':
|
|
549
|
+
// Half fixed, half random — more predictable average
|
|
550
|
+
return capped / 2 + Math.random() * (capped / 2);
|
|
551
|
+
|
|
552
|
+
case 'decorrelated':
|
|
553
|
+
// Each delay is random within [baseDelay, prevDelay * 3]
|
|
554
|
+
// Avoids correlated retry waves across clients
|
|
555
|
+
return Math.random() * (capped * 3 - config.baseDelayMs) + config.baseDelayMs;
|
|
556
|
+
}
|
|
557
|
+
}
|
|
558
|
+
|
|
559
|
+
async function withRetry<T>(
|
|
560
|
+
fn: () => Promise<T>,
|
|
561
|
+
config: RetryConfig = SERVICE_RETRY_CONFIG,
|
|
562
|
+
): Promise<T> {
|
|
563
|
+
let lastError: unknown;
|
|
564
|
+
|
|
565
|
+
for (let attempt = 1; attempt <= config.maxAttempts; attempt++) {
|
|
566
|
+
try {
|
|
567
|
+
return await fn();
|
|
568
|
+
} catch (error) {
|
|
569
|
+
lastError = error;
|
|
570
|
+
const errorClass = classifyError(error);
|
|
571
|
+
|
|
572
|
+
// Only retry transient errors — never retry permanent or open-circuit
|
|
573
|
+
if (errorClass === 'permanent' || errorClass === 'circuit-open') {
|
|
574
|
+
throw error;
|
|
575
|
+
}
|
|
576
|
+
|
|
577
|
+
if (attempt === config.maxAttempts) break;
|
|
578
|
+
|
|
579
|
+
// Respect server-provided Retry-After header (rate limit responses)
|
|
580
|
+
const retryAfterHeader = (error as any)?.response?.headers?.['retry-after'];
|
|
581
|
+
const delay = retryAfterHeader
|
|
582
|
+
? parseInt(retryAfterHeader, 10) * 1000
|
|
583
|
+
: computeDelay(attempt, config);
|
|
584
|
+
|
|
585
|
+
await new Promise(resolve => setTimeout(resolve, delay));
|
|
586
|
+
}
|
|
587
|
+
}
|
|
588
|
+
|
|
589
|
+
throw lastError;
|
|
590
|
+
}
|
|
591
|
+
```
|
|
592
|
+
|
|
593
|
+
### Coordinated Retry Budgets
|
|
594
|
+
|
|
595
|
+
In high-traffic systems, coordinate retry behavior at the request level using retry budgets — a limit on the ratio of retries to original requests:
|
|
596
|
+
|
|
597
|
+
```typescript
|
|
598
|
+
class RetryBudget {
|
|
599
|
+
private tokens: number;
|
|
600
|
+
private readonly maxTokens: number;
|
|
601
|
+
private lastRefill: number;
|
|
602
|
+
|
|
603
|
+
constructor(
|
|
604
|
+
private readonly requestsPerSecond: number,
|
|
605
|
+
// Allow at most 20% of traffic to be retries
|
|
606
|
+
private readonly retryRatio: number = 0.2,
|
|
607
|
+
) {
|
|
608
|
+
this.maxTokens = Math.ceil(requestsPerSecond * retryRatio);
|
|
609
|
+
this.tokens = this.maxTokens;
|
|
610
|
+
this.lastRefill = Date.now();
|
|
611
|
+
}
|
|
612
|
+
|
|
613
|
+
canRetry(): boolean {
|
|
614
|
+
this.refill();
|
|
615
|
+
if (this.tokens > 0) {
|
|
616
|
+
this.tokens--;
|
|
617
|
+
return true;
|
|
618
|
+
}
|
|
619
|
+
return false;
|
|
620
|
+
}
|
|
621
|
+
|
|
622
|
+
private refill() {
|
|
623
|
+
const now = Date.now();
|
|
624
|
+
const elapsed = now - this.lastRefill;
|
|
625
|
+
const refillAmount = (elapsed / 1000) * this.requestsPerSecond * this.retryRatio;
|
|
626
|
+
this.tokens = Math.min(this.maxTokens, this.tokens + refillAmount);
|
|
627
|
+
this.lastRefill = now;
|
|
628
|
+
}
|
|
629
|
+
}
|
|
630
|
+
|
|
631
|
+
const catalogRetryBudget = new RetryBudget(100); // 100 req/s, 20% retry budget = 20 retries/s
|
|
632
|
+
|
|
633
|
+
async function callCatalogWithBudget(productId: string): Promise<Product> {
|
|
634
|
+
try {
|
|
635
|
+
return await catalogClient.getProduct(productId);
|
|
636
|
+
} catch (error) {
|
|
637
|
+
if (classifyError(error) === 'transient' && catalogRetryBudget.canRetry()) {
|
|
638
|
+
await new Promise(resolve => setTimeout(resolve, computeDelay(1, SERVICE_RETRY_CONFIG)));
|
|
639
|
+
return catalogClient.getProduct(productId);
|
|
640
|
+
}
|
|
641
|
+
throw error;
|
|
642
|
+
}
|
|
643
|
+
}
|
|
644
|
+
```
|
|
645
|
+
|
|
646
|
+
**Trade-offs (retry storm prevention):**
|
|
647
|
+
- (+) Full jitter is the most effective at preventing thundering herd — clients retry at random intervals rather than synchronized waves.
|
|
648
|
+
- (+) Retry budgets cap the amplification factor: 100 failures cannot produce 300+ retry calls when the budget is exhausted.
|
|
649
|
+
- (-) Full jitter means some retries happen very quickly (near zero delay), which may not be appropriate for services that need explicit recovery time. Use `equal` jitter for predictable minimum delays.
|
|
650
|
+
- (-) Retry budget state must be shared across instances in a horizontally-scaled service. Use Redis counters with TTL for cluster-wide budgets.
|
|
651
|
+
|
|
652
|
+
## Observability for Resilience
|
|
653
|
+
|
|
654
|
+
Every resilience mechanism must emit metrics and structured logs so you know when it activates:
|
|
655
|
+
|
|
656
|
+
```typescript
|
|
657
|
+
// Metrics to expose (Prometheus or equivalent):
|
|
658
|
+
// circuit_breaker_state{name, state} — 0=closed, 1=half-open, 2=open
|
|
659
|
+
// circuit_breaker_calls_total{name, result} — success, failure, short-circuited
|
|
660
|
+
// bulkhead_active_calls{name} — current concurrency
|
|
661
|
+
// bulkhead_rejected_calls_total{name} — rejected due to full bulkhead
|
|
662
|
+
// retry_attempts_total{service, attempt_number} — retry attempt distribution
|
|
663
|
+
// timeout_exceeded_total{service} — deadline exceeded events
|
|
664
|
+
|
|
665
|
+
function instrumentedCircuitCall<T>(
|
|
666
|
+
circuit: CircuitBreaker,
|
|
667
|
+
bulkhead: BulkheadExecutor,
|
|
668
|
+
fn: () => Promise<T>,
|
|
669
|
+
labels: { service: string },
|
|
670
|
+
): Promise<T> {
|
|
671
|
+
const start = Date.now();
|
|
672
|
+
|
|
673
|
+
return bulkhead.execute(() => circuit.execute(fn))
|
|
674
|
+
.then(result => {
|
|
675
|
+
metrics.increment('circuit_breaker_calls_total', { ...labels, result: 'success' });
|
|
676
|
+
metrics.histogram('call_duration_ms', Date.now() - start, labels);
|
|
677
|
+
return result;
|
|
678
|
+
})
|
|
679
|
+
.catch(error => {
|
|
680
|
+
const result = error instanceof CircuitOpenError
|
|
681
|
+
? 'short-circuited'
|
|
682
|
+
: error instanceof BulkheadFullError
|
|
683
|
+
? 'rejected'
|
|
684
|
+
: 'failure';
|
|
685
|
+
metrics.increment('circuit_breaker_calls_total', { ...labels, result });
|
|
686
|
+
throw error;
|
|
687
|
+
});
|
|
688
|
+
}
|
|
689
|
+
```
|
|
690
|
+
|
|
691
|
+
## Common Pitfalls
|
|
692
|
+
|
|
693
|
+
**Circuit breaker per-instance only.** In a 10-replica service, each replica has its own circuit state. One replica may have opened its circuit while the other nine keep sending traffic, defeating the protection. Fix: share circuit state via Redis or delegate circuit breaking to the service mesh (Envoy, Istio).
|
|
694
|
+
|
|
695
|
+
**Retrying without a circuit breaker.** Retry logic keeps hammering a failing service even after the failure is clearly systemic. Fix: wrap every retry block in a circuit breaker. When the circuit opens, retries stop.
|
|
696
|
+
|
|
697
|
+
**Missing `Promise.allSettled` in fan-outs.** Using `Promise.all` for parallel enrichment calls means a single enrichment failure cancels all others. Fix: use `Promise.allSettled` and handle each result independently.
|
|
698
|
+
|
|
699
|
+
**Timeout but no deadline.** Setting a timeout on each call without propagating a shared deadline means the full call chain timeout is the sum of individual timeouts. A chain of three services each with 5,000ms timeouts can take 15 seconds to fail. Fix: attach a deadline at the entry point and propagate it through all downstream calls.
|
|
700
|
+
|
|
701
|
+
**Bulkhead sized by gut feel.** Setting `maxConcurrency: 10` without measuring the service's actual throughput. Too low throttles legitimate traffic; too high doesn't protect. Fix: load test the downstream service to find its natural breaking point, then set the bulkhead just below it.
|
|
702
|
+
|
|
703
|
+
**Graceful degradation without cache warming.** Returning a cached fallback only works if the cache was populated. On a cold start or after a cache flush, there's nothing to fall back to. Fix: implement cache warming on startup and ensure TTLs are set conservatively so fallbacks are available during brief outages.
|
|
704
|
+
|
|
705
|
+
## See Also
|
|
706
|
+
|
|
707
|
+
- [multi-service-api-contracts](./multi-service-api-contracts.md) — Retry policies, idempotency keys, and timeout budget allocation
|
|
708
|
+
- [multi-service-observability](./multi-service-observability.md) — Metrics, tracing, and alerting for resilience signals
|
|
709
|
+
- [multi-service-architecture](./multi-service-architecture.md) — Circuit breaker context within service topology
|
|
710
|
+
- [multi-service-testing](./multi-service-testing.md) — Chaos testing and fault injection strategies
|