@zigrivers/scaffold 3.16.0 → 3.18.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +28 -0
- package/content/knowledge/backend/backend-fintech-broker-integration.md +244 -0
- package/content/knowledge/backend/backend-fintech-compliance.md +181 -0
- package/content/knowledge/backend/backend-fintech-data-modeling.md +210 -0
- package/content/knowledge/backend/backend-fintech-ledger.md +226 -0
- package/content/knowledge/backend/backend-fintech-observability.md +151 -0
- package/content/knowledge/backend/backend-fintech-order-lifecycle.md +213 -0
- package/content/knowledge/backend/backend-fintech-risk-management.md +150 -0
- package/content/knowledge/backend/backend-fintech-testing.md +197 -0
- package/content/knowledge/core/automated-review-tooling.md +10 -0
- package/content/knowledge/core/multi-service-api-contracts.md +634 -0
- package/content/knowledge/core/multi-service-architecture.md +492 -0
- package/content/knowledge/core/multi-service-auth.md +706 -0
- package/content/knowledge/core/multi-service-data-ownership.md +539 -0
- package/content/knowledge/core/multi-service-observability.md +545 -0
- package/content/knowledge/core/multi-service-resilience.md +710 -0
- package/content/knowledge/core/multi-service-task-decomposition.md +615 -0
- package/content/knowledge/core/multi-service-testing.md +728 -0
- package/content/methodology/backend-fintech.yml +46 -0
- package/content/methodology/custom-defaults.yml +6 -0
- package/content/methodology/deep.yml +6 -0
- package/content/methodology/multi-service-overlay.yml +103 -0
- package/content/methodology/mvp.yml +6 -0
- package/content/pipeline/architecture/service-ownership-map.md +83 -0
- package/content/pipeline/quality/cross-service-auth.md +96 -0
- package/content/pipeline/quality/cross-service-observability.md +104 -0
- package/content/pipeline/quality/integration-test-plan.md +106 -0
- package/content/pipeline/specification/inter-service-contracts.md +95 -0
- package/dist/cli/commands/adopt.cli-flags.test.js +20 -0
- package/dist/cli/commands/adopt.cli-flags.test.js.map +1 -1
- package/dist/cli/commands/adopt.d.ts.map +1 -1
- package/dist/cli/commands/adopt.js +11 -3
- package/dist/cli/commands/adopt.js.map +1 -1
- package/dist/cli/commands/complete.d.ts +1 -0
- package/dist/cli/commands/complete.d.ts.map +1 -1
- package/dist/cli/commands/complete.js +26 -8
- package/dist/cli/commands/complete.js.map +1 -1
- package/dist/cli/commands/dashboard.d.ts +1 -0
- package/dist/cli/commands/dashboard.d.ts.map +1 -1
- package/dist/cli/commands/dashboard.js +19 -6
- package/dist/cli/commands/dashboard.js.map +1 -1
- package/dist/cli/commands/decisions.d.ts +1 -0
- package/dist/cli/commands/decisions.d.ts.map +1 -1
- package/dist/cli/commands/decisions.js +18 -4
- package/dist/cli/commands/decisions.js.map +1 -1
- package/dist/cli/commands/info.d.ts +1 -0
- package/dist/cli/commands/info.d.ts.map +1 -1
- package/dist/cli/commands/info.js +25 -3
- package/dist/cli/commands/info.js.map +1 -1
- package/dist/cli/commands/init-from.test.d.ts +2 -0
- package/dist/cli/commands/init-from.test.d.ts.map +1 -0
- package/dist/cli/commands/init-from.test.js +315 -0
- package/dist/cli/commands/init-from.test.js.map +1 -0
- package/dist/cli/commands/init.d.ts +3 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +239 -129
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/init.test.js +20 -0
- package/dist/cli/commands/init.test.js.map +1 -1
- package/dist/cli/commands/next.d.ts +1 -0
- package/dist/cli/commands/next.d.ts.map +1 -1
- package/dist/cli/commands/next.js +40 -4
- package/dist/cli/commands/next.js.map +1 -1
- package/dist/cli/commands/next.test.js +153 -0
- package/dist/cli/commands/next.test.js.map +1 -1
- package/dist/cli/commands/reset.d.ts +1 -0
- package/dist/cli/commands/reset.d.ts.map +1 -1
- package/dist/cli/commands/reset.js +77 -29
- package/dist/cli/commands/reset.js.map +1 -1
- package/dist/cli/commands/rework.d.ts +1 -0
- package/dist/cli/commands/rework.d.ts.map +1 -1
- package/dist/cli/commands/rework.js +16 -2
- package/dist/cli/commands/rework.js.map +1 -1
- package/dist/cli/commands/run.d.ts +1 -0
- package/dist/cli/commands/run.d.ts.map +1 -1
- package/dist/cli/commands/run.js +65 -13
- package/dist/cli/commands/run.js.map +1 -1
- package/dist/cli/commands/run.test.js +255 -3
- package/dist/cli/commands/run.test.js.map +1 -1
- package/dist/cli/commands/skip.d.ts +1 -0
- package/dist/cli/commands/skip.d.ts.map +1 -1
- package/dist/cli/commands/skip.js +24 -7
- package/dist/cli/commands/skip.js.map +1 -1
- package/dist/cli/commands/status.d.ts +1 -0
- package/dist/cli/commands/status.d.ts.map +1 -1
- package/dist/cli/commands/status.js +51 -4
- package/dist/cli/commands/status.js.map +1 -1
- package/dist/cli/commands/status.test.js +130 -0
- package/dist/cli/commands/status.test.js.map +1 -1
- package/dist/cli/guards-coverage.test.d.ts +2 -0
- package/dist/cli/guards-coverage.test.d.ts.map +1 -0
- package/dist/cli/guards-coverage.test.js +26 -0
- package/dist/cli/guards-coverage.test.js.map +1 -0
- package/dist/cli/guards-integration.test.d.ts +2 -0
- package/dist/cli/guards-integration.test.d.ts.map +1 -0
- package/dist/cli/guards-integration.test.js +178 -0
- package/dist/cli/guards-integration.test.js.map +1 -0
- package/dist/cli/guards.d.ts +13 -0
- package/dist/cli/guards.d.ts.map +1 -0
- package/dist/cli/guards.js +70 -0
- package/dist/cli/guards.js.map +1 -0
- package/dist/cli/guards.test.d.ts +2 -0
- package/dist/cli/guards.test.d.ts.map +1 -0
- package/dist/cli/guards.test.js +136 -0
- package/dist/cli/guards.test.js.map +1 -0
- package/dist/cli/init-flag-families.d.ts +1 -1
- package/dist/cli/init-flag-families.d.ts.map +1 -1
- package/dist/cli/init-flag-families.js +4 -1
- package/dist/cli/init-flag-families.js.map +1 -1
- package/dist/cli/init-flag-families.test.js +10 -0
- package/dist/cli/init-flag-families.test.js.map +1 -1
- package/dist/cli/shutdown.d.ts +2 -3
- package/dist/cli/shutdown.d.ts.map +1 -1
- package/dist/cli/shutdown.js +14 -11
- package/dist/cli/shutdown.js.map +1 -1
- package/dist/cli/shutdown.test.js +2 -4
- package/dist/cli/shutdown.test.js.map +1 -1
- package/dist/config/schema.d.ts +12122 -288
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +74 -79
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +230 -1
- package/dist/config/schema.test.js.map +1 -1
- package/dist/config/validators/backend.d.ts +4 -0
- package/dist/config/validators/backend.d.ts.map +1 -0
- package/dist/config/validators/backend.js +14 -0
- package/dist/config/validators/backend.js.map +1 -0
- package/dist/config/validators/browser-extension.d.ts +4 -0
- package/dist/config/validators/browser-extension.d.ts.map +1 -0
- package/dist/config/validators/browser-extension.js +24 -0
- package/dist/config/validators/browser-extension.js.map +1 -0
- package/dist/config/validators/cli.d.ts +4 -0
- package/dist/config/validators/cli.d.ts.map +1 -0
- package/dist/config/validators/cli.js +14 -0
- package/dist/config/validators/cli.js.map +1 -0
- package/dist/config/validators/data-pipeline.d.ts +4 -0
- package/dist/config/validators/data-pipeline.d.ts.map +1 -0
- package/dist/config/validators/data-pipeline.js +14 -0
- package/dist/config/validators/data-pipeline.js.map +1 -0
- package/dist/config/validators/game.d.ts +4 -0
- package/dist/config/validators/game.d.ts.map +1 -0
- package/dist/config/validators/game.js +14 -0
- package/dist/config/validators/game.js.map +1 -0
- package/dist/config/validators/index.d.ts +7 -0
- package/dist/config/validators/index.d.ts.map +1 -0
- package/dist/config/validators/index.js +27 -0
- package/dist/config/validators/index.js.map +1 -0
- package/dist/config/validators/library.d.ts +4 -0
- package/dist/config/validators/library.d.ts.map +1 -0
- package/dist/config/validators/library.js +25 -0
- package/dist/config/validators/library.js.map +1 -0
- package/dist/config/validators/ml.d.ts +4 -0
- package/dist/config/validators/ml.d.ts.map +1 -0
- package/dist/config/validators/ml.js +31 -0
- package/dist/config/validators/ml.js.map +1 -0
- package/dist/config/validators/mobile-app.d.ts +4 -0
- package/dist/config/validators/mobile-app.d.ts.map +1 -0
- package/dist/config/validators/mobile-app.js +14 -0
- package/dist/config/validators/mobile-app.js.map +1 -0
- package/dist/config/validators/registry.test.d.ts +2 -0
- package/dist/config/validators/registry.test.d.ts.map +1 -0
- package/dist/config/validators/registry.test.js +26 -0
- package/dist/config/validators/registry.test.js.map +1 -0
- package/dist/config/validators/research.d.ts +4 -0
- package/dist/config/validators/research.d.ts.map +1 -0
- package/dist/config/validators/research.js +24 -0
- package/dist/config/validators/research.js.map +1 -0
- package/dist/config/validators/research.test.d.ts +2 -0
- package/dist/config/validators/research.test.d.ts.map +1 -0
- package/dist/config/validators/research.test.js +44 -0
- package/dist/config/validators/research.test.js.map +1 -0
- package/dist/config/validators/types.d.ts +19 -0
- package/dist/config/validators/types.d.ts.map +1 -0
- package/dist/config/validators/types.js +2 -0
- package/dist/config/validators/types.js.map +1 -0
- package/dist/config/validators/validators.test.d.ts +2 -0
- package/dist/config/validators/validators.test.d.ts.map +1 -0
- package/dist/config/validators/validators.test.js +25 -0
- package/dist/config/validators/validators.test.js.map +1 -0
- package/dist/config/validators/web-app.d.ts +4 -0
- package/dist/config/validators/web-app.d.ts.map +1 -0
- package/dist/config/validators/web-app.js +31 -0
- package/dist/config/validators/web-app.js.map +1 -0
- package/dist/core/assembly/context-gatherer.d.ts.map +1 -1
- package/dist/core/assembly/context-gatherer.js +4 -2
- package/dist/core/assembly/context-gatherer.js.map +1 -1
- package/dist/core/assembly/cross-reads.d.ts +61 -0
- package/dist/core/assembly/cross-reads.d.ts.map +1 -0
- package/dist/core/assembly/cross-reads.js +190 -0
- package/dist/core/assembly/cross-reads.js.map +1 -0
- package/dist/core/assembly/cross-reads.test.d.ts +2 -0
- package/dist/core/assembly/cross-reads.test.d.ts.map +1 -0
- package/dist/core/assembly/cross-reads.test.js +497 -0
- package/dist/core/assembly/cross-reads.test.js.map +1 -0
- package/dist/core/assembly/overlay-loader-structural.test.d.ts +2 -0
- package/dist/core/assembly/overlay-loader-structural.test.d.ts.map +1 -0
- package/dist/core/assembly/overlay-loader-structural.test.js +173 -0
- package/dist/core/assembly/overlay-loader-structural.test.js.map +1 -0
- package/dist/core/assembly/overlay-loader.d.ts +19 -3
- package/dist/core/assembly/overlay-loader.d.ts.map +1 -1
- package/dist/core/assembly/overlay-loader.js +135 -4
- package/dist/core/assembly/overlay-loader.js.map +1 -1
- package/dist/core/assembly/overlay-loader.test.js +204 -1
- package/dist/core/assembly/overlay-loader.test.js.map +1 -1
- package/dist/core/assembly/overlay-resolver.d.ts +9 -2
- package/dist/core/assembly/overlay-resolver.d.ts.map +1 -1
- package/dist/core/assembly/overlay-resolver.js +32 -1
- package/dist/core/assembly/overlay-resolver.js.map +1 -1
- package/dist/core/assembly/overlay-resolver.test.js +135 -17
- package/dist/core/assembly/overlay-resolver.test.js.map +1 -1
- package/dist/core/assembly/overlay-state-resolver.d.ts +9 -0
- package/dist/core/assembly/overlay-state-resolver.d.ts.map +1 -1
- package/dist/core/assembly/overlay-state-resolver.js +43 -2
- package/dist/core/assembly/overlay-state-resolver.js.map +1 -1
- package/dist/core/assembly/overlay-state-resolver.test.js +321 -0
- package/dist/core/assembly/overlay-state-resolver.test.js.map +1 -1
- package/dist/core/assembly/update-mode.d.ts +1 -0
- package/dist/core/assembly/update-mode.d.ts.map +1 -1
- package/dist/core/assembly/update-mode.js +17 -9
- package/dist/core/assembly/update-mode.js.map +1 -1
- package/dist/core/dependency/eligibility.d.ts +10 -1
- package/dist/core/dependency/eligibility.d.ts.map +1 -1
- package/dist/core/dependency/eligibility.js +19 -1
- package/dist/core/dependency/eligibility.js.map +1 -1
- package/dist/core/dependency/eligibility.test.js +82 -0
- package/dist/core/dependency/eligibility.test.js.map +1 -1
- package/dist/core/dependency/graph.d.ts +4 -1
- package/dist/core/dependency/graph.d.ts.map +1 -1
- package/dist/core/dependency/graph.js +7 -1
- package/dist/core/dependency/graph.js.map +1 -1
- package/dist/core/dependency/graph.test.js +48 -0
- package/dist/core/dependency/graph.test.js.map +1 -1
- package/dist/core/pipeline/global-steps.d.ts +7 -0
- package/dist/core/pipeline/global-steps.d.ts.map +1 -0
- package/dist/core/pipeline/global-steps.js +18 -0
- package/dist/core/pipeline/global-steps.js.map +1 -0
- package/dist/core/pipeline/resolver.d.ts +1 -0
- package/dist/core/pipeline/resolver.d.ts.map +1 -1
- package/dist/core/pipeline/resolver.js +54 -7
- package/dist/core/pipeline/resolver.js.map +1 -1
- package/dist/core/pipeline/resolver.test.js +51 -1
- package/dist/core/pipeline/resolver.test.js.map +1 -1
- package/dist/core/pipeline/types.d.ts +5 -1
- package/dist/core/pipeline/types.d.ts.map +1 -1
- package/dist/e2e/cross-service-references.test.d.ts +22 -0
- package/dist/e2e/cross-service-references.test.d.ts.map +1 -0
- package/dist/e2e/cross-service-references.test.js +230 -0
- package/dist/e2e/cross-service-references.test.js.map +1 -0
- package/dist/e2e/multi-service-pipeline.test.d.ts +10 -0
- package/dist/e2e/multi-service-pipeline.test.d.ts.map +1 -0
- package/dist/e2e/multi-service-pipeline.test.js +185 -0
- package/dist/e2e/multi-service-pipeline.test.js.map +1 -0
- package/dist/e2e/project-type-overlays.test.js +68 -0
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/e2e/service-execution.test.d.ts +15 -0
- package/dist/e2e/service-execution.test.d.ts.map +1 -0
- package/dist/e2e/service-execution.test.js +219 -0
- package/dist/e2e/service-execution.test.js.map +1 -0
- package/dist/e2e/service-manifest.test.d.ts +19 -0
- package/dist/e2e/service-manifest.test.d.ts.map +1 -0
- package/dist/e2e/service-manifest.test.js +166 -0
- package/dist/e2e/service-manifest.test.js.map +1 -0
- package/dist/project/__frozen-schemas__/schema-v3.9.2.d.ts +224 -224
- package/dist/project/frontmatter.d.ts.map +1 -1
- package/dist/project/frontmatter.js +11 -0
- package/dist/project/frontmatter.js.map +1 -1
- package/dist/project/frontmatter.test.js +71 -0
- package/dist/project/frontmatter.test.js.map +1 -1
- package/dist/state/completion.d.ts +1 -1
- package/dist/state/completion.d.ts.map +1 -1
- package/dist/state/completion.js +10 -8
- package/dist/state/completion.js.map +1 -1
- package/dist/state/decision-logger.d.ts +3 -2
- package/dist/state/decision-logger.d.ts.map +1 -1
- package/dist/state/decision-logger.js +12 -11
- package/dist/state/decision-logger.js.map +1 -1
- package/dist/state/ensure-v3-migration.d.ts +9 -0
- package/dist/state/ensure-v3-migration.d.ts.map +1 -0
- package/dist/state/ensure-v3-migration.js +35 -0
- package/dist/state/ensure-v3-migration.js.map +1 -0
- package/dist/state/lock-manager.d.ts +5 -4
- package/dist/state/lock-manager.d.ts.map +1 -1
- package/dist/state/lock-manager.js +11 -11
- package/dist/state/lock-manager.js.map +1 -1
- package/dist/state/rework-manager.d.ts +1 -2
- package/dist/state/rework-manager.d.ts.map +1 -1
- package/dist/state/rework-manager.js +4 -5
- package/dist/state/rework-manager.js.map +1 -1
- package/dist/state/state-manager.d.ts +25 -1
- package/dist/state/state-manager.d.ts.map +1 -1
- package/dist/state/state-manager.js +86 -12
- package/dist/state/state-manager.js.map +1 -1
- package/dist/state/state-manager.test.js +278 -0
- package/dist/state/state-manager.test.js.map +1 -1
- package/dist/state/state-migration-v3.d.ts +22 -0
- package/dist/state/state-migration-v3.d.ts.map +1 -0
- package/dist/state/state-migration-v3.js +82 -0
- package/dist/state/state-migration-v3.js.map +1 -0
- package/dist/state/state-migration-v3.test.d.ts +2 -0
- package/dist/state/state-migration-v3.test.d.ts.map +1 -0
- package/dist/state/state-migration-v3.test.js +196 -0
- package/dist/state/state-migration-v3.test.js.map +1 -0
- package/dist/state/state-migration.d.ts.map +1 -1
- package/dist/state/state-migration.js +11 -6
- package/dist/state/state-migration.js.map +1 -1
- package/dist/state/state-migration.test.js +47 -2
- package/dist/state/state-migration.test.js.map +1 -1
- package/dist/state/state-path-resolver.d.ts +23 -0
- package/dist/state/state-path-resolver.d.ts.map +1 -0
- package/dist/state/state-path-resolver.js +36 -0
- package/dist/state/state-path-resolver.js.map +1 -0
- package/dist/state/state-path-resolver.test.d.ts +2 -0
- package/dist/state/state-path-resolver.test.d.ts.map +1 -0
- package/dist/state/state-path-resolver.test.js +78 -0
- package/dist/state/state-path-resolver.test.js.map +1 -0
- package/dist/state/state-version-dispatch.d.ts +17 -0
- package/dist/state/state-version-dispatch.d.ts.map +1 -0
- package/dist/state/state-version-dispatch.js +27 -0
- package/dist/state/state-version-dispatch.js.map +1 -0
- package/dist/state/state-version-dispatch.test.d.ts +2 -0
- package/dist/state/state-version-dispatch.test.d.ts.map +1 -0
- package/dist/state/state-version-dispatch.test.js +40 -0
- package/dist/state/state-version-dispatch.test.js.map +1 -0
- package/dist/types/config.d.ts +33 -3
- package/dist/types/config.d.ts.map +1 -1
- package/dist/types/config.test.js +62 -1
- package/dist/types/config.test.js.map +1 -1
- package/dist/types/dependency.d.ts +9 -0
- package/dist/types/dependency.d.ts.map +1 -1
- package/dist/types/frontmatter.d.ts +5 -0
- package/dist/types/frontmatter.d.ts.map +1 -1
- package/dist/types/lock.d.ts +1 -1
- package/dist/types/lock.d.ts.map +1 -1
- package/dist/types/state.d.ts +1 -1
- package/dist/types/state.d.ts.map +1 -1
- package/dist/utils/artifact-path.d.ts +19 -0
- package/dist/utils/artifact-path.d.ts.map +1 -0
- package/dist/utils/artifact-path.js +95 -0
- package/dist/utils/artifact-path.js.map +1 -0
- package/dist/utils/artifact-path.test.d.ts +2 -0
- package/dist/utils/artifact-path.test.d.ts.map +1 -0
- package/dist/utils/artifact-path.test.js +138 -0
- package/dist/utils/artifact-path.test.js.map +1 -0
- package/dist/utils/errors.d.ts +3 -1
- package/dist/utils/errors.d.ts.map +1 -1
- package/dist/utils/errors.js +21 -2
- package/dist/utils/errors.js.map +1 -1
- package/dist/utils/errors.test.js +27 -1
- package/dist/utils/errors.test.js.map +1 -1
- package/dist/utils/user-errors.d.ts +46 -0
- package/dist/utils/user-errors.d.ts.map +1 -0
- package/dist/utils/user-errors.js +76 -0
- package/dist/utils/user-errors.js.map +1 -0
- package/dist/utils/user-errors.test.d.ts +2 -0
- package/dist/utils/user-errors.test.d.ts.map +1 -0
- package/dist/utils/user-errors.test.js +74 -0
- package/dist/utils/user-errors.test.js.map +1 -0
- package/dist/validation/index.d.ts.map +1 -1
- package/dist/validation/index.js +16 -0
- package/dist/validation/index.js.map +1 -1
- package/dist/validation/index.test.js +48 -0
- package/dist/validation/index.test.js.map +1 -1
- package/dist/validation/state-validator.d.ts +5 -2
- package/dist/validation/state-validator.d.ts.map +1 -1
- package/dist/validation/state-validator.js +18 -20
- package/dist/validation/state-validator.js.map +1 -1
- package/dist/validation/state-validator.test.js +31 -2
- package/dist/validation/state-validator.test.js.map +1 -1
- package/dist/wizard/copy/backend.d.ts.map +1 -1
- package/dist/wizard/copy/backend.js +12 -0
- package/dist/wizard/copy/backend.js.map +1 -1
- package/dist/wizard/flags.d.ts +1 -0
- package/dist/wizard/flags.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +5 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +45 -2
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +23 -0
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +85 -47
- package/dist/wizard/wizard.js.map +1 -1
- package/dist/wizard/wizard.test.js +186 -1
- package/dist/wizard/wizard.test.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,545 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: multi-service-observability
|
|
3
|
+
description: Distributed tracing, correlation IDs, cross-service SLOs, and failure attribution
|
|
4
|
+
topics: [distributed-tracing, correlation-ids, cross-service-slos, failure-attribution]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Summary
|
|
8
|
+
|
|
9
|
+
Observability in a multi-service system is a prerequisite for correct operation, not an optional enhancement. When a request crosses four service boundaries before returning an error, you cannot debug it without distributed tracing and correlation IDs.
|
|
10
|
+
|
|
11
|
+
**Three pillars for multi-service systems:**
|
|
12
|
+
- **Distributed tracing (W3C Trace Context):** Every request gets a `traceparent` header with a trace ID and span ID. Each service records spans. All spans for a single request share a trace ID, creating a complete picture of the request's journey. Use OpenTelemetry (vendor-neutral) and export to any backend (Jaeger, Tempo, Datadog).
|
|
13
|
+
- **Correlation IDs (`X-Correlation-ID`):** Business-level identifier for a workflow, persisted in the application database. Survives async boundaries that distributed traces don't bridge (jobs, scheduled tasks, multi-request workflows). Include in every log entry and outgoing message.
|
|
14
|
+
- **Structured logs (JSON):** Every log entry must include `correlationId`, `traceId`, `service`, `version`, and `level`. Ship to a central aggregation system (ELK, Loki, CloudWatch).
|
|
15
|
+
|
|
16
|
+
**SLO strategy:** Define SLOs per service and per user-facing journey. Composite availability = product of all participating services' availabilities — a 5-service chain each at 99.9% yields ~99.5% composite. Alert on error budget burn rate (e.g., 14x sustainable rate in 1 hour), not hard thresholds.
|
|
17
|
+
|
|
18
|
+
**Failure attribution:** Walk the span tree inward from the user-facing error to find the first span that recorded an error. Classify as infrastructure, dependency, or application failure.
|
|
19
|
+
|
|
20
|
+
**OpenTelemetry Collector:** Route telemetry through a Collector (not directly from services to the backend) for backend-agnostic export, sampling, and buffering.
|
|
21
|
+
|
|
22
|
+
## Deep Guidance
|
|
23
|
+
|
|
24
|
+
## Distributed Tracing with W3C Trace Context
|
|
25
|
+
|
|
26
|
+
### The Problem Distributed Tracing Solves
|
|
27
|
+
|
|
28
|
+
A single user-facing request in a multi-service system might be handled by an API gateway, an auth service, an order service, an inventory service, and a payment service. Each service logs independently. Without distributed tracing, a single failed request leaves log entries scattered across five services with no way to correlate them. Debugging requires manual log archaeology across systems with imprecise time correlation.
|
|
29
|
+
|
|
30
|
+
Distributed tracing solves this by propagating a trace context through every service boundary. Each service records spans — units of work with start time, duration, tags, and relationships. All spans for a single request share a trace ID, creating a complete picture of the request's journey.
|
|
31
|
+
|
|
32
|
+
### W3C Trace Context Standard
|
|
33
|
+
|
|
34
|
+
The W3C Trace Context specification (https://www.w3.org/TR/trace-context/) defines two HTTP headers for propagating trace context:
|
|
35
|
+
|
|
36
|
+
**`traceparent`** — carries the trace ID, span ID, and sampling flags:
|
|
37
|
+
|
|
38
|
+
```
|
|
39
|
+
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
|
|
40
|
+
^^ version
|
|
41
|
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ trace-id (16 bytes, hex)
|
|
42
|
+
^^^^^^^^^^^^^^^^ parent-span-id (8 bytes, hex)
|
|
43
|
+
^^ flags (01 = sampled)
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
**`tracestate`** — carries vendor-specific key-value pairs alongside the standard header:
|
|
47
|
+
|
|
48
|
+
```
|
|
49
|
+
tracestate: rojo=00f067aa0ba902b7,congo=t61rcWkgMzE
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
**Why use W3C Trace Context instead of vendor-specific headers:**
|
|
53
|
+
- (+) Interoperable: every OpenTelemetry SDK, AWS X-Ray, Google Cloud Trace, and Datadog agent understands it.
|
|
54
|
+
- (+) Future-proof: the standard is stable and broadly adopted.
|
|
55
|
+
- (-) Requires all services to propagate the headers correctly. A service that drops the headers breaks the trace chain.
|
|
56
|
+
|
|
57
|
+
### OpenTelemetry Integration
|
|
58
|
+
|
|
59
|
+
OpenTelemetry (OTel) is the CNCF-standard SDK for distributed tracing, metrics, and logs. It is the recommended approach — instrument once, export to any backend.
|
|
60
|
+
|
|
61
|
+
**Node.js setup:**
|
|
62
|
+
|
|
63
|
+
```typescript
|
|
64
|
+
// src/tracing.ts — initialize before requiring any other modules
|
|
65
|
+
import { NodeSDK } from '@opentelemetry/sdk-node'
|
|
66
|
+
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
|
|
67
|
+
import { Resource } from '@opentelemetry/resources'
|
|
68
|
+
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions'
|
|
69
|
+
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
|
|
70
|
+
|
|
71
|
+
const sdk = new NodeSDK({
|
|
72
|
+
resource: new Resource({
|
|
73
|
+
[SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME ?? 'unknown-service',
|
|
74
|
+
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '0.0.0',
|
|
75
|
+
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV ?? 'development',
|
|
76
|
+
}),
|
|
77
|
+
traceExporter: new OTLPTraceExporter({
|
|
78
|
+
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://otel-collector:4318/v1/traces',
|
|
79
|
+
}),
|
|
80
|
+
instrumentations: [
|
|
81
|
+
getNodeAutoInstrumentations({
|
|
82
|
+
'@opentelemetry/instrumentation-http': { enabled: true },
|
|
83
|
+
'@opentelemetry/instrumentation-express': { enabled: true },
|
|
84
|
+
'@opentelemetry/instrumentation-pg': { enabled: true },
|
|
85
|
+
}),
|
|
86
|
+
],
|
|
87
|
+
})
|
|
88
|
+
|
|
89
|
+
sdk.start()
|
|
90
|
+
|
|
91
|
+
// Graceful shutdown
|
|
92
|
+
process.on('SIGTERM', () => sdk.shutdown())
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
**Creating custom spans for business operations:**
|
|
96
|
+
|
|
97
|
+
```typescript
|
|
98
|
+
import { trace, context, SpanStatusCode } from '@opentelemetry/api'
|
|
99
|
+
|
|
100
|
+
const tracer = trace.getTracer('order-service', '1.0.0')
|
|
101
|
+
|
|
102
|
+
async function processOrder(orderId: string, items: OrderItem[]): Promise<Order> {
|
|
103
|
+
return tracer.startActiveSpan('processOrder', async (span) => {
|
|
104
|
+
span.setAttributes({
|
|
105
|
+
'order.id': orderId,
|
|
106
|
+
'order.item_count': items.length,
|
|
107
|
+
})
|
|
108
|
+
|
|
109
|
+
try {
|
|
110
|
+
const result = await doProcessOrder(orderId, items)
|
|
111
|
+
span.setStatus({ code: SpanStatusCode.OK })
|
|
112
|
+
return result
|
|
113
|
+
} catch (err) {
|
|
114
|
+
span.recordException(err as Error)
|
|
115
|
+
span.setStatus({ code: SpanStatusCode.ERROR, message: (err as Error).message })
|
|
116
|
+
throw err
|
|
117
|
+
} finally {
|
|
118
|
+
span.end()
|
|
119
|
+
}
|
|
120
|
+
})
|
|
121
|
+
}
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
**Trade-offs (OpenTelemetry auto-instrumentation):**
|
|
125
|
+
- (+) Automatic instrumentation for HTTP, gRPC, database drivers — no manual span creation needed for most cases.
|
|
126
|
+
- (+) Vendor-neutral: switch from Jaeger to Tempo to Datadog by changing the exporter config.
|
|
127
|
+
- (-) Auto-instrumentation adds startup latency (~200ms) — acceptable for long-running services, problematic for AWS Lambda cold starts.
|
|
128
|
+
- (-) High-cardinality span attributes (user IDs, order IDs) can explode storage costs. Set attribute cardinality limits.
|
|
129
|
+
|
|
130
|
+
## Correlation ID Propagation
|
|
131
|
+
|
|
132
|
+
### Correlation IDs vs. Trace IDs
|
|
133
|
+
|
|
134
|
+
Correlation IDs and trace IDs serve different purposes:
|
|
135
|
+
|
|
136
|
+
- **Trace ID** (from W3C traceparent): used by distributed tracing systems to correlate spans. Auto-generated by the tracing SDK. Used by engineers debugging specific requests.
|
|
137
|
+
- **Correlation ID**: a business-level identifier tied to a user request session or workflow, persisted in the application database for long-term audit and support. May span multiple traces if a workflow spans multiple requests or async operations.
|
|
138
|
+
|
|
139
|
+
Use both. The trace ID handles in-flight debugging; the correlation ID handles after-the-fact auditing and cross-referencing support tickets with log entries.
|
|
140
|
+
|
|
141
|
+
### Propagation Standards
|
|
142
|
+
|
|
143
|
+
**Incoming requests:** Extract the correlation ID from the `X-Correlation-ID` header. If absent, generate a new UUID. Always return it in the response.
|
|
144
|
+
|
|
145
|
+
**Outgoing requests:** Attach the correlation ID to every outgoing HTTP call, Kafka message, and async job.
|
|
146
|
+
|
|
147
|
+
**Logs:** Include the correlation ID in every log entry during request processing.
|
|
148
|
+
|
|
149
|
+
```typescript
|
|
150
|
+
// src/middleware/correlation.ts
|
|
151
|
+
import { randomUUID } from 'crypto'
|
|
152
|
+
import type { Request, Response, NextFunction } from 'express'
|
|
153
|
+
import { AsyncLocalStorage } from 'async_hooks'
|
|
154
|
+
|
|
155
|
+
const correlationStore = new AsyncLocalStorage<{ correlationId: string; traceId?: string }>()
|
|
156
|
+
|
|
157
|
+
export function correlationMiddleware(req: Request, res: Response, next: NextFunction): void {
|
|
158
|
+
const correlationId = (req.headers['x-correlation-id'] as string) ?? randomUUID()
|
|
159
|
+
const traceId = req.headers['traceparent'] as string | undefined
|
|
160
|
+
|
|
161
|
+
res.setHeader('X-Correlation-ID', correlationId)
|
|
162
|
+
|
|
163
|
+
correlationStore.run({ correlationId, traceId }, () => {
|
|
164
|
+
next()
|
|
165
|
+
})
|
|
166
|
+
}
|
|
167
|
+
|
|
168
|
+
export function getCorrelationId(): string | undefined {
|
|
169
|
+
return correlationStore.getStore()?.correlationId
|
|
170
|
+
}
|
|
171
|
+
|
|
172
|
+
// Attach to outgoing HTTP calls
|
|
173
|
+
export function outboundHeaders(): Record<string, string> {
|
|
174
|
+
const store = correlationStore.getStore()
|
|
175
|
+
if (!store) return {}
|
|
176
|
+
return {
|
|
177
|
+
'X-Correlation-ID': store.correlationId,
|
|
178
|
+
}
|
|
179
|
+
}
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
**In structured logs (pino example):**
|
|
183
|
+
|
|
184
|
+
```typescript
|
|
185
|
+
import pino from 'pino'
|
|
186
|
+
import { getCorrelationId } from './middleware/correlation.js'
|
|
187
|
+
|
|
188
|
+
const baseLogger = pino({
|
|
189
|
+
level: process.env.LOG_LEVEL ?? 'info',
|
|
190
|
+
formatters: {
|
|
191
|
+
log(object) {
|
|
192
|
+
return {
|
|
193
|
+
...object,
|
|
194
|
+
correlationId: getCorrelationId(),
|
|
195
|
+
service: process.env.SERVICE_NAME,
|
|
196
|
+
version: process.env.SERVICE_VERSION,
|
|
197
|
+
}
|
|
198
|
+
},
|
|
199
|
+
},
|
|
200
|
+
})
|
|
201
|
+
|
|
202
|
+
export const logger = baseLogger
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
**In Kafka messages:**
|
|
206
|
+
|
|
207
|
+
```typescript
|
|
208
|
+
// Attach correlation context to message headers
|
|
209
|
+
await producer.send({
|
|
210
|
+
topic: 'order.placed',
|
|
211
|
+
messages: [{
|
|
212
|
+
key: orderId,
|
|
213
|
+
value: JSON.stringify(payload),
|
|
214
|
+
headers: {
|
|
215
|
+
'x-correlation-id': getCorrelationId() ?? '',
|
|
216
|
+
'x-source-service': process.env.SERVICE_NAME ?? '',
|
|
217
|
+
},
|
|
218
|
+
}],
|
|
219
|
+
})
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
**Consumer side — extract and propagate:**
|
|
223
|
+
|
|
224
|
+
```typescript
|
|
225
|
+
consumer.run({
|
|
226
|
+
eachMessage: async ({ message }) => {
|
|
227
|
+
const correlationId =
|
|
228
|
+
message.headers?.['x-correlation-id']?.toString() ?? randomUUID()
|
|
229
|
+
|
|
230
|
+
correlationStore.run({ correlationId }, async () => {
|
|
231
|
+
await processMessage(message)
|
|
232
|
+
})
|
|
233
|
+
},
|
|
234
|
+
})
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
**Trade-offs (correlation ID propagation):**
|
|
238
|
+
- (+) End-to-end request tracing across async boundaries that distributed tracing alone cannot bridge (async jobs, scheduled tasks, event chains spanning minutes or hours).
|
|
239
|
+
- (+) Customer support can reference a correlation ID in a ticket and engineers can filter all logs for that single workflow.
|
|
240
|
+
- (-) Every service must be updated to propagate the header. A service that drops it breaks the chain.
|
|
241
|
+
- (-) Adds cardinality to logs — increases log storage unless correlation IDs are indexed and older logs are pruned.
|
|
242
|
+
|
|
243
|
+
## Cross-Service SLO Definition and Error Budget Management
|
|
244
|
+
|
|
245
|
+
### Defining SLOs Across Services
|
|
246
|
+
|
|
247
|
+
A Service Level Objective (SLO) is a target for service reliability expressed as a percentage of requests that succeed within a defined latency. In a multi-service system, each service has its own SLOs, and user-facing operations have composite SLOs that depend on the SLOs of all participating services.
|
|
248
|
+
|
|
249
|
+
**Single-service SLO example:**
|
|
250
|
+
|
|
251
|
+
```yaml
|
|
252
|
+
# docs/slos/order-service.yml
|
|
253
|
+
service: order-service
|
|
254
|
+
slos:
|
|
255
|
+
- name: order_placement_availability
|
|
256
|
+
description: POST /orders returns 2xx or 422 (valid response, not an infra error)
|
|
257
|
+
target: 99.9% # 43.8 minutes downtime per month
|
|
258
|
+
window: 30d
|
|
259
|
+
indicator:
|
|
260
|
+
type: availability
|
|
261
|
+
good_events: http_requests_total{service="order-service", path="/orders", method="POST", status=~"2xx|422"}
|
|
262
|
+
total_events: http_requests_total{service="order-service", path="/orders", method="POST"}
|
|
263
|
+
|
|
264
|
+
- name: order_placement_latency
|
|
265
|
+
description: POST /orders responds within 500ms at p99
|
|
266
|
+
target: 99%
|
|
267
|
+
window: 30d
|
|
268
|
+
indicator:
|
|
269
|
+
type: latency
|
|
270
|
+
threshold_ms: 500
|
|
271
|
+
percentile: 99
|
|
272
|
+
metric: http_request_duration_ms{service="order-service", path="/orders"}
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
**Composite SLO for a user-facing flow:** When a user places an order, the request touches the API gateway, auth service, order service, inventory service, and payment service. The composite availability is the product of each service's availability:
|
|
276
|
+
|
|
277
|
+
```
|
|
278
|
+
P(order_success) = P(gateway) × P(auth) × P(order) × P(inventory) × P(payment)
|
|
279
|
+
= 0.9999 × 0.9999 × 0.9990 × 0.9995 × 0.9990
|
|
280
|
+
= 0.9973 (99.73% availability, ~2 hours downtime/month)
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
This means if you target 99.9% for the composite user experience, each participating service must significantly exceed that — a single 99.9% service makes the composite worse.
|
|
284
|
+
|
|
285
|
+
**Practical SLO guidelines:**
|
|
286
|
+
- Define SLOs per service and per user-facing journey. Both are needed.
|
|
287
|
+
- Use 30-day rolling windows for error budgets — avoids quarterly spikes.
|
|
288
|
+
- Alert on error budget burn rate (e.g., if you burn 5% of the monthly error budget in an hour, page on-call) rather than hard availability thresholds.
|
|
289
|
+
- SLOs should be stored in version control alongside the service code.
|
|
290
|
+
|
|
291
|
+
### Error Budget Management
|
|
292
|
+
|
|
293
|
+
An error budget is the allowed failure capacity derived from the SLO target: if the SLO is 99.9%, the error budget is 0.1% (43.8 minutes of downtime per month).
|
|
294
|
+
|
|
295
|
+
**Error budget policy decisions:**
|
|
296
|
+
|
|
297
|
+
| Error Budget Remaining | Allowed Action |
|
|
298
|
+
|------------------------|----------------|
|
|
299
|
+
| > 50% | Normal development velocity, feature work, experiments |
|
|
300
|
+
| 25–50% | Caution. Prefer reliability improvements over new features |
|
|
301
|
+
| 10–25% | Freeze risky deploys. Focus on reliability work |
|
|
302
|
+
| < 10% | Stop all non-critical deploys. Incident review required |
|
|
303
|
+
|
|
304
|
+
**Prometheus alert rule for error budget burn rate:**
|
|
305
|
+
|
|
306
|
+
```yaml
|
|
307
|
+
# alerts/slo-burn-rate.yml
|
|
308
|
+
groups:
|
|
309
|
+
- name: slo_burn_rate
|
|
310
|
+
rules:
|
|
311
|
+
- alert: HighErrorBudgetBurnRate
|
|
312
|
+
expr: |
|
|
313
|
+
(
|
|
314
|
+
rate(http_requests_total{status=~"5.."}[1h]) /
|
|
315
|
+
rate(http_requests_total[1h])
|
|
316
|
+
) > (14.4 * (1 - 0.999))
|
|
317
|
+
for: 2m
|
|
318
|
+
labels:
|
|
319
|
+
severity: page
|
|
320
|
+
annotations:
|
|
321
|
+
summary: "{{ $labels.service }} burning error budget at 14x rate"
|
|
322
|
+
description: |
|
|
323
|
+
Service {{ $labels.service }} is burning its monthly error budget
|
|
324
|
+
at 14x the sustainable rate. At this rate, the full monthly budget
|
|
325
|
+
will be consumed in ~2 hours.
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
**Trade-offs (SLO-based alerting):**
|
|
329
|
+
- (+) Error budget burn rate alerts fire early (before the budget is exhausted) and reduce alert fatigue compared to hard threshold alerts.
|
|
330
|
+
- (+) Aligns engineering and product decisions: spending error budget on risky experiments is an explicit product trade-off.
|
|
331
|
+
- (-) Requires setting meaningful SLO targets — too lenient wastes budget, too strict makes every incident a budget crisis.
|
|
332
|
+
- (-) Composite SLOs across services require all participating services to instrument and report correctly.
|
|
333
|
+
|
|
334
|
+
## Failure Attribution and Root Cause Analysis
|
|
335
|
+
|
|
336
|
+
### Attributing Failures in Distributed Traces
|
|
337
|
+
|
|
338
|
+
When a distributed request fails, the trace shows which span failed and why. The root cause is typically the deepest span with an error status — but not always. Use a structured analysis approach.
|
|
339
|
+
|
|
340
|
+
**Steps for trace-based failure attribution:**
|
|
341
|
+
|
|
342
|
+
1. Identify the user-facing error (the outermost span with an error status).
|
|
343
|
+
2. Walk the span tree inward until you find the first span that recorded an error. This is the origin of the error.
|
|
344
|
+
3. Check if the origin span is a timeout, a 5xx from a downstream service, or an exception in application code.
|
|
345
|
+
4. Classify the failure: infrastructure (network, hardware), dependency (external API, database), or application (bug, unhandled edge case).
|
|
346
|
+
|
|
347
|
+
**Span attributes to include for attribution:**
|
|
348
|
+
|
|
349
|
+
```typescript
|
|
350
|
+
// Good span attributes for failure attribution
|
|
351
|
+
span.setAttributes({
|
|
352
|
+
'http.method': 'POST',
|
|
353
|
+
'http.url': 'https://payment-service/charge',
|
|
354
|
+
'http.status_code': 503,
|
|
355
|
+
'error.type': 'ServiceUnavailable',
|
|
356
|
+
'error.message': 'payment-service: connection timeout after 2000ms',
|
|
357
|
+
'downstream.service': 'payment-service',
|
|
358
|
+
'retry.attempt': 2,
|
|
359
|
+
'retry.max': 3,
|
|
360
|
+
})
|
|
361
|
+
```
|
|
362
|
+
|
|
363
|
+
### Distributed Logging Aggregation
|
|
364
|
+
|
|
365
|
+
All services must ship logs to a central log aggregation system (ELK stack, Loki, CloudWatch Logs). Structured JSON logs with consistent fields are essential.
|
|
366
|
+
|
|
367
|
+
**Mandatory log fields (every log entry from every service):**
|
|
368
|
+
|
|
369
|
+
```typescript
|
|
370
|
+
interface LogEntry {
|
|
371
|
+
timestamp: string // ISO 8601
|
|
372
|
+
level: 'debug' | 'info' | 'warn' | 'error' | 'fatal'
|
|
373
|
+
service: string // service name from SERVICE_NAME env var
|
|
374
|
+
version: string // service version
|
|
375
|
+
correlationId?: string // propagated X-Correlation-ID
|
|
376
|
+
traceId?: string // from W3C traceparent
|
|
377
|
+
spanId?: string // current span ID
|
|
378
|
+
message: string
|
|
379
|
+
// Additional context fields as needed
|
|
380
|
+
[key: string]: unknown
|
|
381
|
+
}
|
|
382
|
+
```
|
|
383
|
+
|
|
384
|
+
**Log query patterns for failure attribution:**
|
|
385
|
+
|
|
386
|
+
```
|
|
387
|
+
# Find all log entries for a specific correlation ID across all services
|
|
388
|
+
correlationId = "550e8400-e29b-41d4-a716-446655440000"
|
|
389
|
+
|
|
390
|
+
# Find all errors in the order-placement flow in the last hour
|
|
391
|
+
level = "error" AND correlationId = "..." AND timestamp > now() - 1h
|
|
392
|
+
|
|
393
|
+
# Find timeout patterns across the payment-service
|
|
394
|
+
service = "payment-service" AND message:timeout AND level = "error"
|
|
395
|
+
| stats count by bin(1m)
|
|
396
|
+
```
|
|
397
|
+
|
|
398
|
+
### Cross-Service Dashboards
|
|
399
|
+
|
|
400
|
+
A cross-service dashboard gives the on-call engineer a single view of system health:
|
|
401
|
+
|
|
402
|
+
**Essential panels for a cross-service operations dashboard:**
|
|
403
|
+
|
|
404
|
+
```yaml
|
|
405
|
+
# Grafana dashboard structure (conceptual)
|
|
406
|
+
dashboard:
|
|
407
|
+
title: "Multi-Service Operations"
|
|
408
|
+
rows:
|
|
409
|
+
- title: "User-Facing Health"
|
|
410
|
+
panels:
|
|
411
|
+
- name: "Composite Availability (30m window)"
|
|
412
|
+
type: stat
|
|
413
|
+
query: |
|
|
414
|
+
avg(rate(http_requests_success[30m]) / rate(http_requests_total[30m]))
|
|
415
|
+
- name: "p99 Latency by Service"
|
|
416
|
+
type: timeseries
|
|
417
|
+
query: |
|
|
418
|
+
histogram_quantile(0.99, rate(http_request_duration_ms_bucket[5m]))
|
|
419
|
+
|
|
420
|
+
- title: "Error Budget"
|
|
421
|
+
panels:
|
|
422
|
+
- name: "Error Budget Remaining (30d)"
|
|
423
|
+
type: gauge
|
|
424
|
+
thresholds: [10, 25, 50]
|
|
425
|
+
query: |
|
|
426
|
+
1 - (sum(rate(http_requests_total{status=~"5.."}[30d])) /
|
|
427
|
+
sum(rate(http_requests_total[30d])))
|
|
428
|
+
|
|
429
|
+
- title: "Service Dependencies"
|
|
430
|
+
panels:
|
|
431
|
+
- name: "Cross-Service Call Success Rate"
|
|
432
|
+
type: heatmap
|
|
433
|
+
description: "Source service (rows) calling destination service (columns)"
|
|
434
|
+
```
|
|
435
|
+
|
|
436
|
+
**Trade-offs (centralized dashboards):**
|
|
437
|
+
- (+) Single pane of glass during incidents — on-call does not need to check each service individually.
|
|
438
|
+
- (+) Error budget panels enforce SLO accountability.
|
|
439
|
+
- (-) Dashboard maintenance burden. When services are added or renamed, dashboards go stale.
|
|
440
|
+
- (-) A single cross-service dashboard can obscure service-specific details. Link to per-service dashboards from the cross-service dashboard rather than collapsing everything into one view.
|
|
441
|
+
|
|
442
|
+
## OpenTelemetry Collector Deployment
|
|
443
|
+
|
|
444
|
+
For production deployments, route telemetry through an OpenTelemetry Collector rather than exporting directly from services to the backend. The Collector acts as a buffer, processor, and router.
|
|
445
|
+
|
|
446
|
+
```yaml
|
|
447
|
+
# otel-collector-config.yml
|
|
448
|
+
receivers:
|
|
449
|
+
otlp:
|
|
450
|
+
protocols:
|
|
451
|
+
grpc:
|
|
452
|
+
endpoint: 0.0.0.0:4317
|
|
453
|
+
http:
|
|
454
|
+
endpoint: 0.0.0.0:4318
|
|
455
|
+
|
|
456
|
+
processors:
|
|
457
|
+
batch:
|
|
458
|
+
timeout: 10s
|
|
459
|
+
send_batch_size: 1024
|
|
460
|
+
memory_limiter:
|
|
461
|
+
check_interval: 1s
|
|
462
|
+
limit_mib: 400
|
|
463
|
+
spike_limit_mib: 100
|
|
464
|
+
resource:
|
|
465
|
+
attributes:
|
|
466
|
+
- action: insert
|
|
467
|
+
key: deployment.environment
|
|
468
|
+
value: "${DEPLOYMENT_ENVIRONMENT}"
|
|
469
|
+
|
|
470
|
+
exporters:
|
|
471
|
+
jaeger:
|
|
472
|
+
endpoint: jaeger-collector:14250
|
|
473
|
+
tls:
|
|
474
|
+
insecure: false
|
|
475
|
+
cert_file: /certs/collector.crt
|
|
476
|
+
key_file: /certs/collector.key
|
|
477
|
+
prometheus:
|
|
478
|
+
endpoint: "0.0.0.0:8889"
|
|
479
|
+
namespace: otel
|
|
480
|
+
loki:
|
|
481
|
+
endpoint: http://loki:3100/loki/api/v1/push
|
|
482
|
+
|
|
483
|
+
service:
|
|
484
|
+
pipelines:
|
|
485
|
+
traces:
|
|
486
|
+
receivers: [otlp]
|
|
487
|
+
processors: [memory_limiter, batch, resource]
|
|
488
|
+
exporters: [jaeger]
|
|
489
|
+
metrics:
|
|
490
|
+
receivers: [otlp]
|
|
491
|
+
processors: [memory_limiter, batch]
|
|
492
|
+
exporters: [prometheus]
|
|
493
|
+
logs:
|
|
494
|
+
receivers: [otlp]
|
|
495
|
+
processors: [memory_limiter, batch, resource]
|
|
496
|
+
exporters: [loki]
|
|
497
|
+
```
|
|
498
|
+
|
|
499
|
+
**Trade-offs (OTel Collector):**
|
|
500
|
+
- (+) Backend-agnostic. Switch from Jaeger to Tempo by changing the exporter — no service code changes.
|
|
501
|
+
- (+) The Collector can sample, filter, and enrich telemetry before export. Reduces storage costs.
|
|
502
|
+
- (+) The Collector buffers telemetry during backend outages — no data loss if Jaeger has a hiccup.
|
|
503
|
+
- (-) Adds one more component to operate. The Collector must be highly available or services lose telemetry.
|
|
504
|
+
- (-) Misconfigured sampling in the Collector can silently drop critical traces. Monitor Collector drop rate.
|
|
505
|
+
|
|
506
|
+
## Sampling Strategy
|
|
507
|
+
|
|
508
|
+
High-traffic services can generate millions of spans per minute. Sampling reduces storage costs at the expense of completeness.
|
|
509
|
+
|
|
510
|
+
**Head-based sampling:** The tracing SDK decides at the start of a trace whether to record it (based on a percentage, e.g., 1%). Simple but can drop error traces.
|
|
511
|
+
|
|
512
|
+
**Tail-based sampling (recommended for production):** The Collector holds spans in memory until the trace is complete, then decides whether to keep it based on trace-level criteria (e.g., keep all error traces, keep 1% of success traces).
|
|
513
|
+
|
|
514
|
+
```yaml
|
|
515
|
+
# Tail-based sampling in OTel Collector
|
|
516
|
+
processors:
|
|
517
|
+
tail_sampling:
|
|
518
|
+
decision_wait: 10s
|
|
519
|
+
num_traces: 100000
|
|
520
|
+
expected_new_traces_per_sec: 1000
|
|
521
|
+
policies:
|
|
522
|
+
- name: errors-policy
|
|
523
|
+
type: status_code
|
|
524
|
+
status_code: {status_codes: [ERROR]}
|
|
525
|
+
- name: slow-traces-policy
|
|
526
|
+
type: latency
|
|
527
|
+
latency: {threshold_ms: 2000}
|
|
528
|
+
- name: probabilistic-policy
|
|
529
|
+
type: probabilistic
|
|
530
|
+
probabilistic: {sampling_percentage: 1}
|
|
531
|
+
```
|
|
532
|
+
|
|
533
|
+
## Common Pitfalls
|
|
534
|
+
|
|
535
|
+
**Missing header propagation.** A service receives a `traceparent` header but does not forward it in outgoing calls. The trace is broken at that service — the downstream spans appear as independent traces with no parent. Fix: instrument all HTTP clients, message producers, and async job dispatchers to propagate trace context.
|
|
536
|
+
|
|
537
|
+
**Log correlation without structured logs.** If services log plain text without the correlation ID field, log queries cannot aggregate across services. Fix: require structured JSON logs with `correlationId` and `traceId` as top-level fields in all services.
|
|
538
|
+
|
|
539
|
+
**SLOs without alerting.** Defining SLOs in YAML that nobody reads provides no operational benefit. Fix: SLO definitions must be backed by alerting rules that fire before the budget is exhausted. Treat unenforced SLOs as unfinished work.
|
|
540
|
+
|
|
541
|
+
**Dashboard sprawl.** Each service creates its own dashboard with different conventions, different time windows, and different color schemes. Nobody uses them during incidents because they cannot find the right one. Fix: establish a single cross-service dashboard as the on-call starting point with links to per-service detail dashboards.
|
|
542
|
+
|
|
543
|
+
**High-cardinality span attributes.** Adding user IDs or request payloads as span attributes creates millions of unique label combinations that explode trace storage costs. Fix: restrict span attributes to known-cardinality fields (service names, status codes, HTTP methods, boolean flags). Put user IDs in log fields, not span attributes.
|
|
544
|
+
|
|
545
|
+
**Tracing gaps in async flows.** A trace starts when an HTTP request arrives and ends when the response is sent. If that request enqueues a job that processes 30 minutes later, the trace does not capture the job processing. Fix: propagate the trace context in job metadata and create a new linked span in the worker, linking it to the original trace via `FOLLOWS_FROM` span link.
|