@wazir-dev/cli 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +111 -0
- package/CHANGELOG.md +14 -0
- package/CONTRIBUTING.md +101 -0
- package/LICENSE +21 -0
- package/README.md +314 -0
- package/assets/composition-engine.mmd +34 -0
- package/assets/demo-script.sh +17 -0
- package/assets/logo-dark.svg +14 -0
- package/assets/logo.svg +14 -0
- package/assets/pipeline.mmd +39 -0
- package/assets/record-demo.sh +51 -0
- package/docs/README.md +51 -0
- package/docs/adapters/context-mode.md +60 -0
- package/docs/concepts/architecture.md +87 -0
- package/docs/concepts/artifact-model.md +60 -0
- package/docs/concepts/composition-engine.md +36 -0
- package/docs/concepts/indexing-and-recall.md +160 -0
- package/docs/concepts/observability.md +41 -0
- package/docs/concepts/roles-and-workflows.md +59 -0
- package/docs/concepts/terminology-policy.md +27 -0
- package/docs/getting-started/01-installation.md +78 -0
- package/docs/getting-started/02-first-run.md +102 -0
- package/docs/getting-started/03-adding-to-project.md +15 -0
- package/docs/getting-started/04-host-setup.md +15 -0
- package/docs/guides/ci-integration.md +15 -0
- package/docs/guides/creating-skills.md +15 -0
- package/docs/guides/expertise-module-authoring.md +15 -0
- package/docs/guides/hook-development.md +15 -0
- package/docs/guides/memory-and-learnings.md +34 -0
- package/docs/guides/multi-host-export.md +15 -0
- package/docs/guides/troubleshooting.md +101 -0
- package/docs/guides/writing-custom-roles.md +15 -0
- package/docs/plans/2026-03-15-cli-pipeline-integration-design.md +592 -0
- package/docs/plans/2026-03-15-cli-pipeline-integration-plan.md +598 -0
- package/docs/plans/2026-03-15-docs-enforcement-plan.md +238 -0
- package/docs/readmes/INDEX.md +99 -0
- package/docs/readmes/features/expertise/README.md +171 -0
- package/docs/readmes/features/exports/README.md +222 -0
- package/docs/readmes/features/hooks/README.md +103 -0
- package/docs/readmes/features/hooks/loop-cap-guard.md +133 -0
- package/docs/readmes/features/hooks/post-tool-capture.md +121 -0
- package/docs/readmes/features/hooks/post-tool-lint.md +130 -0
- package/docs/readmes/features/hooks/pre-compact-summary.md +122 -0
- package/docs/readmes/features/hooks/pre-tool-capture-route.md +100 -0
- package/docs/readmes/features/hooks/protected-path-write-guard.md +128 -0
- package/docs/readmes/features/hooks/session-start.md +119 -0
- package/docs/readmes/features/hooks/stop-handoff-harvest.md +125 -0
- package/docs/readmes/features/roles/README.md +157 -0
- package/docs/readmes/features/roles/clarifier.md +152 -0
- package/docs/readmes/features/roles/content-author.md +190 -0
- package/docs/readmes/features/roles/designer.md +193 -0
- package/docs/readmes/features/roles/executor.md +184 -0
- package/docs/readmes/features/roles/learner.md +210 -0
- package/docs/readmes/features/roles/planner.md +182 -0
- package/docs/readmes/features/roles/researcher.md +164 -0
- package/docs/readmes/features/roles/reviewer.md +184 -0
- package/docs/readmes/features/roles/specifier.md +162 -0
- package/docs/readmes/features/roles/verifier.md +215 -0
- package/docs/readmes/features/schemas/README.md +178 -0
- package/docs/readmes/features/skills/README.md +63 -0
- package/docs/readmes/features/skills/brainstorming.md +96 -0
- package/docs/readmes/features/skills/debugging.md +148 -0
- package/docs/readmes/features/skills/design.md +120 -0
- package/docs/readmes/features/skills/prepare-next.md +109 -0
- package/docs/readmes/features/skills/run-audit.md +159 -0
- package/docs/readmes/features/skills/scan-project.md +109 -0
- package/docs/readmes/features/skills/self-audit.md +176 -0
- package/docs/readmes/features/skills/tdd.md +137 -0
- package/docs/readmes/features/skills/using-skills.md +92 -0
- package/docs/readmes/features/skills/verification.md +120 -0
- package/docs/readmes/features/skills/writing-plans.md +104 -0
- package/docs/readmes/features/tooling/README.md +320 -0
- package/docs/readmes/features/workflows/README.md +186 -0
- package/docs/readmes/features/workflows/author.md +181 -0
- package/docs/readmes/features/workflows/clarify.md +154 -0
- package/docs/readmes/features/workflows/design-review.md +171 -0
- package/docs/readmes/features/workflows/design.md +169 -0
- package/docs/readmes/features/workflows/discover.md +162 -0
- package/docs/readmes/features/workflows/execute.md +173 -0
- package/docs/readmes/features/workflows/learn.md +167 -0
- package/docs/readmes/features/workflows/plan-review.md +165 -0
- package/docs/readmes/features/workflows/plan.md +170 -0
- package/docs/readmes/features/workflows/prepare-next.md +167 -0
- package/docs/readmes/features/workflows/review.md +169 -0
- package/docs/readmes/features/workflows/run-audit.md +191 -0
- package/docs/readmes/features/workflows/spec-challenge.md +159 -0
- package/docs/readmes/features/workflows/specify.md +160 -0
- package/docs/readmes/features/workflows/verify.md +177 -0
- package/docs/readmes/packages/README.md +50 -0
- package/docs/readmes/packages/ajv.md +117 -0
- package/docs/readmes/packages/context-mode.md +118 -0
- package/docs/readmes/packages/gray-matter.md +116 -0
- package/docs/readmes/packages/node-test.md +137 -0
- package/docs/readmes/packages/yaml.md +112 -0
- package/docs/reference/configuration-reference.md +159 -0
- package/docs/reference/expertise-index.md +52 -0
- package/docs/reference/git-flow.md +43 -0
- package/docs/reference/hooks.md +87 -0
- package/docs/reference/host-exports.md +50 -0
- package/docs/reference/launch-checklist.md +172 -0
- package/docs/reference/marketplace-listings.md +76 -0
- package/docs/reference/release-process.md +34 -0
- package/docs/reference/roles-reference.md +77 -0
- package/docs/reference/skills.md +33 -0
- package/docs/reference/templates.md +29 -0
- package/docs/reference/tooling-cli.md +94 -0
- package/docs/truth-claims.yaml +222 -0
- package/expertise/PROGRESS.md +63 -0
- package/expertise/README.md +18 -0
- package/expertise/antipatterns/PROGRESS.md +56 -0
- package/expertise/antipatterns/backend/api-design-antipatterns.md +1271 -0
- package/expertise/antipatterns/backend/auth-antipatterns.md +1195 -0
- package/expertise/antipatterns/backend/caching-antipatterns.md +622 -0
- package/expertise/antipatterns/backend/database-antipatterns.md +1038 -0
- package/expertise/antipatterns/backend/index.md +24 -0
- package/expertise/antipatterns/backend/microservices-antipatterns.md +850 -0
- package/expertise/antipatterns/code/architecture-antipatterns.md +919 -0
- package/expertise/antipatterns/code/async-antipatterns.md +622 -0
- package/expertise/antipatterns/code/code-smells.md +1186 -0
- package/expertise/antipatterns/code/dependency-antipatterns.md +1209 -0
- package/expertise/antipatterns/code/error-handling-antipatterns.md +1360 -0
- package/expertise/antipatterns/code/index.md +27 -0
- package/expertise/antipatterns/code/naming-and-abstraction.md +1118 -0
- package/expertise/antipatterns/code/state-management-antipatterns.md +1076 -0
- package/expertise/antipatterns/code/testing-antipatterns.md +1053 -0
- package/expertise/antipatterns/design/accessibility-antipatterns.md +1136 -0
- package/expertise/antipatterns/design/dark-patterns.md +1121 -0
- package/expertise/antipatterns/design/index.md +22 -0
- package/expertise/antipatterns/design/ui-antipatterns.md +1202 -0
- package/expertise/antipatterns/design/ux-antipatterns.md +680 -0
- package/expertise/antipatterns/frontend/css-layout-antipatterns.md +691 -0
- package/expertise/antipatterns/frontend/flutter-antipatterns.md +1827 -0
- package/expertise/antipatterns/frontend/index.md +23 -0
- package/expertise/antipatterns/frontend/mobile-antipatterns.md +573 -0
- package/expertise/antipatterns/frontend/react-antipatterns.md +1128 -0
- package/expertise/antipatterns/frontend/spa-antipatterns.md +1235 -0
- package/expertise/antipatterns/index.md +31 -0
- package/expertise/antipatterns/performance/index.md +20 -0
- package/expertise/antipatterns/performance/performance-antipatterns.md +1013 -0
- package/expertise/antipatterns/performance/premature-optimization.md +623 -0
- package/expertise/antipatterns/performance/scaling-antipatterns.md +785 -0
- package/expertise/antipatterns/process/ai-coding-antipatterns.md +853 -0
- package/expertise/antipatterns/process/code-review-antipatterns.md +656 -0
- package/expertise/antipatterns/process/deployment-antipatterns.md +920 -0
- package/expertise/antipatterns/process/index.md +23 -0
- package/expertise/antipatterns/process/technical-debt-antipatterns.md +647 -0
- package/expertise/antipatterns/security/index.md +20 -0
- package/expertise/antipatterns/security/secrets-antipatterns.md +849 -0
- package/expertise/antipatterns/security/security-theater.md +843 -0
- package/expertise/antipatterns/security/vulnerability-patterns.md +801 -0
- package/expertise/architecture/PROGRESS.md +70 -0
- package/expertise/architecture/data/caching-architecture.md +671 -0
- package/expertise/architecture/data/data-consistency.md +574 -0
- package/expertise/architecture/data/data-modeling.md +536 -0
- package/expertise/architecture/data/event-streams-and-queues.md +634 -0
- package/expertise/architecture/data/index.md +25 -0
- package/expertise/architecture/data/search-architecture.md +663 -0
- package/expertise/architecture/data/sql-vs-nosql.md +708 -0
- package/expertise/architecture/decisions/architecture-decision-records.md +640 -0
- package/expertise/architecture/decisions/build-vs-buy.md +616 -0
- package/expertise/architecture/decisions/index.md +23 -0
- package/expertise/architecture/decisions/monolith-to-microservices.md +790 -0
- package/expertise/architecture/decisions/technology-selection.md +616 -0
- package/expertise/architecture/distributed/cap-theorem-and-tradeoffs.md +800 -0
- package/expertise/architecture/distributed/circuit-breaker-bulkhead.md +741 -0
- package/expertise/architecture/distributed/consensus-and-coordination.md +796 -0
- package/expertise/architecture/distributed/distributed-systems-fundamentals.md +564 -0
- package/expertise/architecture/distributed/idempotency-and-retry.md +796 -0
- package/expertise/architecture/distributed/index.md +25 -0
- package/expertise/architecture/distributed/saga-pattern.md +797 -0
- package/expertise/architecture/foundations/architectural-thinking.md +460 -0
- package/expertise/architecture/foundations/coupling-and-cohesion.md +770 -0
- package/expertise/architecture/foundations/design-principles-solid.md +649 -0
- package/expertise/architecture/foundations/domain-driven-design.md +719 -0
- package/expertise/architecture/foundations/index.md +25 -0
- package/expertise/architecture/foundations/separation-of-concerns.md +472 -0
- package/expertise/architecture/foundations/twelve-factor-app.md +797 -0
- package/expertise/architecture/index.md +34 -0
- package/expertise/architecture/integration/api-design-graphql.md +638 -0
- package/expertise/architecture/integration/api-design-grpc.md +804 -0
- package/expertise/architecture/integration/api-design-rest.md +892 -0
- package/expertise/architecture/integration/index.md +25 -0
- package/expertise/architecture/integration/third-party-integration.md +795 -0
- package/expertise/architecture/integration/webhooks-and-callbacks.md +1152 -0
- package/expertise/architecture/integration/websockets-realtime.md +791 -0
- package/expertise/architecture/mobile-architecture/index.md +22 -0
- package/expertise/architecture/mobile-architecture/mobile-app-architecture.md +780 -0
- package/expertise/architecture/mobile-architecture/mobile-backend-for-frontend.md +670 -0
- package/expertise/architecture/mobile-architecture/offline-first.md +719 -0
- package/expertise/architecture/mobile-architecture/push-and-sync.md +782 -0
- package/expertise/architecture/patterns/cqrs-event-sourcing.md +717 -0
- package/expertise/architecture/patterns/event-driven.md +797 -0
- package/expertise/architecture/patterns/hexagonal-clean-architecture.md +870 -0
- package/expertise/architecture/patterns/index.md +27 -0
- package/expertise/architecture/patterns/layered-architecture.md +736 -0
- package/expertise/architecture/patterns/microservices.md +753 -0
- package/expertise/architecture/patterns/modular-monolith.md +692 -0
- package/expertise/architecture/patterns/monolith.md +626 -0
- package/expertise/architecture/patterns/plugin-architecture.md +735 -0
- package/expertise/architecture/patterns/serverless.md +780 -0
- package/expertise/architecture/scaling/database-scaling.md +615 -0
- package/expertise/architecture/scaling/feature-flags-and-rollouts.md +757 -0
- package/expertise/architecture/scaling/horizontal-vs-vertical.md +606 -0
- package/expertise/architecture/scaling/index.md +24 -0
- package/expertise/architecture/scaling/multi-tenancy.md +800 -0
- package/expertise/architecture/scaling/stateless-design.md +787 -0
- package/expertise/backend/embedded-firmware.md +625 -0
- package/expertise/backend/go.md +853 -0
- package/expertise/backend/index.md +24 -0
- package/expertise/backend/java-spring.md +448 -0
- package/expertise/backend/node-typescript.md +625 -0
- package/expertise/backend/python-fastapi.md +724 -0
- package/expertise/backend/rust.md +458 -0
- package/expertise/backend/solidity.md +711 -0
- package/expertise/composition-map.yaml +443 -0
- package/expertise/content/foundations/content-modeling.md +395 -0
- package/expertise/content/foundations/editorial-standards.md +449 -0
- package/expertise/content/foundations/index.md +24 -0
- package/expertise/content/foundations/microcopy.md +455 -0
- package/expertise/content/foundations/terminology-governance.md +509 -0
- package/expertise/content/index.md +34 -0
- package/expertise/content/patterns/accessibility-copy.md +518 -0
- package/expertise/content/patterns/index.md +24 -0
- package/expertise/content/patterns/notification-content.md +433 -0
- package/expertise/content/patterns/sample-content.md +486 -0
- package/expertise/content/patterns/state-copy.md +439 -0
- package/expertise/design/PROGRESS.md +58 -0
- package/expertise/design/disciplines/dark-mode-theming.md +577 -0
- package/expertise/design/disciplines/design-systems.md +595 -0
- package/expertise/design/disciplines/index.md +25 -0
- package/expertise/design/disciplines/information-architecture.md +800 -0
- package/expertise/design/disciplines/interaction-design.md +788 -0
- package/expertise/design/disciplines/responsive-design.md +552 -0
- package/expertise/design/disciplines/usability-testing.md +516 -0
- package/expertise/design/disciplines/user-research.md +792 -0
- package/expertise/design/foundations/accessibility-design.md +796 -0
- package/expertise/design/foundations/color-theory.md +797 -0
- package/expertise/design/foundations/iconography.md +795 -0
- package/expertise/design/foundations/index.md +26 -0
- package/expertise/design/foundations/motion-and-animation.md +653 -0
- package/expertise/design/foundations/rtl-design.md +585 -0
- package/expertise/design/foundations/spacing-and-layout.md +607 -0
- package/expertise/design/foundations/typography.md +800 -0
- package/expertise/design/foundations/visual-hierarchy.md +761 -0
- package/expertise/design/index.md +32 -0
- package/expertise/design/patterns/authentication-flows.md +474 -0
- package/expertise/design/patterns/content-consumption.md +789 -0
- package/expertise/design/patterns/data-display.md +618 -0
- package/expertise/design/patterns/e-commerce.md +1494 -0
- package/expertise/design/patterns/feedback-and-states.md +642 -0
- package/expertise/design/patterns/forms-and-input.md +819 -0
- package/expertise/design/patterns/gamification.md +801 -0
- package/expertise/design/patterns/index.md +31 -0
- package/expertise/design/patterns/microinteractions.md +449 -0
- package/expertise/design/patterns/navigation.md +800 -0
- package/expertise/design/patterns/notifications.md +705 -0
- package/expertise/design/patterns/onboarding.md +700 -0
- package/expertise/design/patterns/search-and-filter.md +601 -0
- package/expertise/design/patterns/settings-and-preferences.md +768 -0
- package/expertise/design/patterns/social-and-community.md +748 -0
- package/expertise/design/platforms/desktop-native.md +612 -0
- package/expertise/design/platforms/index.md +25 -0
- package/expertise/design/platforms/mobile-android.md +825 -0
- package/expertise/design/platforms/mobile-cross-platform.md +983 -0
- package/expertise/design/platforms/mobile-ios.md +699 -0
- package/expertise/design/platforms/tablet.md +794 -0
- package/expertise/design/platforms/web-dashboard.md +790 -0
- package/expertise/design/platforms/web-responsive.md +550 -0
- package/expertise/design/psychology/behavioral-nudges.md +449 -0
- package/expertise/design/psychology/cognitive-load.md +1191 -0
- package/expertise/design/psychology/error-psychology.md +778 -0
- package/expertise/design/psychology/index.md +22 -0
- package/expertise/design/psychology/persuasive-design.md +736 -0
- package/expertise/design/psychology/user-mental-models.md +623 -0
- package/expertise/design/tooling/open-pencil.md +266 -0
- package/expertise/frontend/angular.md +1073 -0
- package/expertise/frontend/desktop-electron.md +546 -0
- package/expertise/frontend/flutter.md +782 -0
- package/expertise/frontend/index.md +27 -0
- package/expertise/frontend/native-android.md +409 -0
- package/expertise/frontend/native-ios.md +490 -0
- package/expertise/frontend/react-native.md +1160 -0
- package/expertise/frontend/react.md +808 -0
- package/expertise/frontend/vue.md +1089 -0
- package/expertise/humanize/domain-rules-code.md +79 -0
- package/expertise/humanize/domain-rules-content.md +67 -0
- package/expertise/humanize/domain-rules-technical-docs.md +56 -0
- package/expertise/humanize/index.md +35 -0
- package/expertise/humanize/self-audit-checklist.md +87 -0
- package/expertise/humanize/sentence-patterns.md +218 -0
- package/expertise/humanize/vocabulary-blacklist.md +105 -0
- package/expertise/i18n/PROGRESS.md +65 -0
- package/expertise/i18n/advanced/accessibility-and-i18n.md +28 -0
- package/expertise/i18n/advanced/bidirectional-text-algorithm.md +38 -0
- package/expertise/i18n/advanced/complex-scripts.md +30 -0
- package/expertise/i18n/advanced/performance-and-i18n.md +27 -0
- package/expertise/i18n/advanced/testing-i18n.md +28 -0
- package/expertise/i18n/content/content-adaptation.md +23 -0
- package/expertise/i18n/content/locale-specific-formatting.md +23 -0
- package/expertise/i18n/content/machine-translation-integration.md +28 -0
- package/expertise/i18n/content/translation-management.md +29 -0
- package/expertise/i18n/foundations/date-time-calendars.md +67 -0
- package/expertise/i18n/foundations/i18n-architecture.md +272 -0
- package/expertise/i18n/foundations/locale-and-language-tags.md +79 -0
- package/expertise/i18n/foundations/numbers-currency-units.md +61 -0
- package/expertise/i18n/foundations/pluralization-and-gender.md +109 -0
- package/expertise/i18n/foundations/string-externalization.md +236 -0
- package/expertise/i18n/foundations/text-direction-bidi.md +241 -0
- package/expertise/i18n/foundations/unicode-and-encoding.md +86 -0
- package/expertise/i18n/index.md +38 -0
- package/expertise/i18n/platform/backend-i18n.md +31 -0
- package/expertise/i18n/platform/flutter-i18n.md +148 -0
- package/expertise/i18n/platform/native-android-i18n.md +36 -0
- package/expertise/i18n/platform/native-ios-i18n.md +36 -0
- package/expertise/i18n/platform/react-i18n.md +103 -0
- package/expertise/i18n/platform/web-css-i18n.md +81 -0
- package/expertise/i18n/rtl/arabic-specific.md +175 -0
- package/expertise/i18n/rtl/hebrew-specific.md +149 -0
- package/expertise/i18n/rtl/rtl-animations-and-transitions.md +111 -0
- package/expertise/i18n/rtl/rtl-forms-and-input.md +161 -0
- package/expertise/i18n/rtl/rtl-fundamentals.md +211 -0
- package/expertise/i18n/rtl/rtl-icons-and-images.md +181 -0
- package/expertise/i18n/rtl/rtl-layout-mirroring.md +252 -0
- package/expertise/i18n/rtl/rtl-navigation-and-gestures.md +107 -0
- package/expertise/i18n/rtl/rtl-testing-and-qa.md +147 -0
- package/expertise/i18n/rtl/rtl-typography.md +160 -0
- package/expertise/index.md +113 -0
- package/expertise/index.yaml +216 -0
- package/expertise/infrastructure/cloud-aws.md +597 -0
- package/expertise/infrastructure/cloud-gcp.md +599 -0
- package/expertise/infrastructure/cybersecurity.md +816 -0
- package/expertise/infrastructure/database-mongodb.md +447 -0
- package/expertise/infrastructure/database-postgres.md +400 -0
- package/expertise/infrastructure/devops-cicd.md +787 -0
- package/expertise/infrastructure/index.md +27 -0
- package/expertise/performance/PROGRESS.md +50 -0
- package/expertise/performance/backend/api-latency.md +1204 -0
- package/expertise/performance/backend/background-jobs.md +506 -0
- package/expertise/performance/backend/connection-pooling.md +1209 -0
- package/expertise/performance/backend/database-query-optimization.md +515 -0
- package/expertise/performance/backend/index.md +23 -0
- package/expertise/performance/backend/rate-limiting-and-throttling.md +971 -0
- package/expertise/performance/foundations/algorithmic-complexity.md +954 -0
- package/expertise/performance/foundations/caching-strategies.md +489 -0
- package/expertise/performance/foundations/concurrency-and-parallelism.md +847 -0
- package/expertise/performance/foundations/index.md +24 -0
- package/expertise/performance/foundations/measuring-and-profiling.md +440 -0
- package/expertise/performance/foundations/memory-management.md +964 -0
- package/expertise/performance/foundations/performance-budgets.md +1314 -0
- package/expertise/performance/index.md +31 -0
- package/expertise/performance/infrastructure/auto-scaling.md +1059 -0
- package/expertise/performance/infrastructure/cdn-and-edge.md +1081 -0
- package/expertise/performance/infrastructure/index.md +22 -0
- package/expertise/performance/infrastructure/load-balancing.md +1081 -0
- package/expertise/performance/infrastructure/observability.md +1079 -0
- package/expertise/performance/mobile/index.md +23 -0
- package/expertise/performance/mobile/mobile-animations.md +544 -0
- package/expertise/performance/mobile/mobile-memory-battery.md +416 -0
- package/expertise/performance/mobile/mobile-network.md +452 -0
- package/expertise/performance/mobile/mobile-rendering.md +599 -0
- package/expertise/performance/mobile/mobile-startup-time.md +505 -0
- package/expertise/performance/platform-specific/flutter-performance.md +647 -0
- package/expertise/performance/platform-specific/index.md +22 -0
- package/expertise/performance/platform-specific/node-performance.md +1307 -0
- package/expertise/performance/platform-specific/postgres-performance.md +1366 -0
- package/expertise/performance/platform-specific/react-performance.md +1403 -0
- package/expertise/performance/web/bundle-optimization.md +1239 -0
- package/expertise/performance/web/image-and-media.md +636 -0
- package/expertise/performance/web/index.md +24 -0
- package/expertise/performance/web/network-optimization.md +1133 -0
- package/expertise/performance/web/rendering-performance.md +1098 -0
- package/expertise/performance/web/ssr-and-hydration.md +918 -0
- package/expertise/performance/web/web-vitals.md +1374 -0
- package/expertise/quality/accessibility.md +985 -0
- package/expertise/quality/evidence-based-verification.md +499 -0
- package/expertise/quality/index.md +24 -0
- package/expertise/quality/ml-model-audit.md +614 -0
- package/expertise/quality/performance.md +600 -0
- package/expertise/quality/testing-api.md +891 -0
- package/expertise/quality/testing-mobile.md +496 -0
- package/expertise/quality/testing-web.md +849 -0
- package/expertise/security/PROGRESS.md +54 -0
- package/expertise/security/agentic-identity.md +540 -0
- package/expertise/security/compliance-frameworks.md +601 -0
- package/expertise/security/data/data-encryption.md +364 -0
- package/expertise/security/data/data-privacy-gdpr.md +692 -0
- package/expertise/security/data/database-security.md +1171 -0
- package/expertise/security/data/index.md +22 -0
- package/expertise/security/data/pii-handling.md +531 -0
- package/expertise/security/foundations/authentication.md +1041 -0
- package/expertise/security/foundations/authorization.md +603 -0
- package/expertise/security/foundations/cryptography.md +1001 -0
- package/expertise/security/foundations/index.md +25 -0
- package/expertise/security/foundations/owasp-top-10.md +1354 -0
- package/expertise/security/foundations/secrets-management.md +1217 -0
- package/expertise/security/foundations/secure-sdlc.md +700 -0
- package/expertise/security/foundations/supply-chain-security.md +698 -0
- package/expertise/security/index.md +31 -0
- package/expertise/security/infrastructure/cloud-security-aws.md +1296 -0
- package/expertise/security/infrastructure/cloud-security-gcp.md +1376 -0
- package/expertise/security/infrastructure/container-security.md +721 -0
- package/expertise/security/infrastructure/incident-response.md +1295 -0
- package/expertise/security/infrastructure/index.md +24 -0
- package/expertise/security/infrastructure/logging-and-monitoring.md +1618 -0
- package/expertise/security/infrastructure/network-security.md +1337 -0
- package/expertise/security/mobile/index.md +23 -0
- package/expertise/security/mobile/mobile-android-security.md +1218 -0
- package/expertise/security/mobile/mobile-binary-protection.md +1229 -0
- package/expertise/security/mobile/mobile-data-storage.md +1265 -0
- package/expertise/security/mobile/mobile-ios-security.md +1401 -0
- package/expertise/security/mobile/mobile-network-security.md +1520 -0
- package/expertise/security/smart-contract-security.md +594 -0
- package/expertise/security/testing/index.md +22 -0
- package/expertise/security/testing/penetration-testing.md +1258 -0
- package/expertise/security/testing/security-code-review.md +1765 -0
- package/expertise/security/testing/threat-modeling.md +1074 -0
- package/expertise/security/testing/vulnerability-scanning.md +1062 -0
- package/expertise/security/web/api-security.md +586 -0
- package/expertise/security/web/cors-and-headers.md +433 -0
- package/expertise/security/web/csrf.md +562 -0
- package/expertise/security/web/file-upload.md +1477 -0
- package/expertise/security/web/index.md +25 -0
- package/expertise/security/web/injection.md +1375 -0
- package/expertise/security/web/session-management.md +1101 -0
- package/expertise/security/web/xss.md +1158 -0
- package/exports/README.md +17 -0
- package/exports/hosts/claude/.claude/agents/clarifier.md +42 -0
- package/exports/hosts/claude/.claude/agents/content-author.md +63 -0
- package/exports/hosts/claude/.claude/agents/designer.md +55 -0
- package/exports/hosts/claude/.claude/agents/executor.md +55 -0
- package/exports/hosts/claude/.claude/agents/learner.md +51 -0
- package/exports/hosts/claude/.claude/agents/planner.md +53 -0
- package/exports/hosts/claude/.claude/agents/researcher.md +43 -0
- package/exports/hosts/claude/.claude/agents/reviewer.md +54 -0
- package/exports/hosts/claude/.claude/agents/specifier.md +47 -0
- package/exports/hosts/claude/.claude/agents/verifier.md +71 -0
- package/exports/hosts/claude/.claude/commands/author.md +42 -0
- package/exports/hosts/claude/.claude/commands/clarify.md +38 -0
- package/exports/hosts/claude/.claude/commands/design-review.md +46 -0
- package/exports/hosts/claude/.claude/commands/design.md +44 -0
- package/exports/hosts/claude/.claude/commands/discover.md +37 -0
- package/exports/hosts/claude/.claude/commands/execute.md +48 -0
- package/exports/hosts/claude/.claude/commands/learn.md +38 -0
- package/exports/hosts/claude/.claude/commands/plan-review.md +42 -0
- package/exports/hosts/claude/.claude/commands/plan.md +39 -0
- package/exports/hosts/claude/.claude/commands/prepare-next.md +37 -0
- package/exports/hosts/claude/.claude/commands/review.md +40 -0
- package/exports/hosts/claude/.claude/commands/run-audit.md +41 -0
- package/exports/hosts/claude/.claude/commands/spec-challenge.md +41 -0
- package/exports/hosts/claude/.claude/commands/specify.md +38 -0
- package/exports/hosts/claude/.claude/commands/verify.md +37 -0
- package/exports/hosts/claude/.claude/settings.json +34 -0
- package/exports/hosts/claude/CLAUDE.md +19 -0
- package/exports/hosts/claude/export.manifest.json +38 -0
- package/exports/hosts/claude/host-package.json +67 -0
- package/exports/hosts/codex/AGENTS.md +19 -0
- package/exports/hosts/codex/export.manifest.json +38 -0
- package/exports/hosts/codex/host-package.json +41 -0
- package/exports/hosts/cursor/.cursor/hooks.json +16 -0
- package/exports/hosts/cursor/.cursor/rules/wazir-core.mdc +19 -0
- package/exports/hosts/cursor/export.manifest.json +38 -0
- package/exports/hosts/cursor/host-package.json +42 -0
- package/exports/hosts/gemini/GEMINI.md +19 -0
- package/exports/hosts/gemini/export.manifest.json +38 -0
- package/exports/hosts/gemini/host-package.json +41 -0
- package/hooks/README.md +18 -0
- package/hooks/definitions/loop_cap_guard.yaml +21 -0
- package/hooks/definitions/post_tool_capture.yaml +24 -0
- package/hooks/definitions/pre_compact_summary.yaml +19 -0
- package/hooks/definitions/pre_tool_capture_route.yaml +19 -0
- package/hooks/definitions/protected_path_write_guard.yaml +19 -0
- package/hooks/definitions/session_start.yaml +19 -0
- package/hooks/definitions/stop_handoff_harvest.yaml +20 -0
- package/hooks/loop-cap-guard +17 -0
- package/hooks/post-tool-lint +36 -0
- package/hooks/protected-path-write-guard +17 -0
- package/hooks/session-start +41 -0
- package/llms-full.txt +2355 -0
- package/llms.txt +43 -0
- package/package.json +79 -0
- package/roles/README.md +20 -0
- package/roles/clarifier.md +42 -0
- package/roles/content-author.md +63 -0
- package/roles/designer.md +55 -0
- package/roles/executor.md +55 -0
- package/roles/learner.md +51 -0
- package/roles/planner.md +53 -0
- package/roles/researcher.md +43 -0
- package/roles/reviewer.md +54 -0
- package/roles/specifier.md +47 -0
- package/roles/verifier.md +71 -0
- package/schemas/README.md +24 -0
- package/schemas/accepted-learning.schema.json +20 -0
- package/schemas/author-artifact.schema.json +156 -0
- package/schemas/clarification.schema.json +19 -0
- package/schemas/design-artifact.schema.json +80 -0
- package/schemas/docs-claim.schema.json +18 -0
- package/schemas/export-manifest.schema.json +20 -0
- package/schemas/hook.schema.json +67 -0
- package/schemas/host-export-package.schema.json +18 -0
- package/schemas/implementation-plan.schema.json +19 -0
- package/schemas/proposed-learning.schema.json +19 -0
- package/schemas/research.schema.json +18 -0
- package/schemas/review.schema.json +29 -0
- package/schemas/run-manifest.schema.json +18 -0
- package/schemas/spec-challenge.schema.json +18 -0
- package/schemas/spec.schema.json +20 -0
- package/schemas/usage.schema.json +102 -0
- package/schemas/verification-proof.schema.json +29 -0
- package/schemas/wazir-manifest.schema.json +173 -0
- package/skills/README.md +40 -0
- package/skills/brainstorming/SKILL.md +77 -0
- package/skills/debugging/SKILL.md +50 -0
- package/skills/design/SKILL.md +61 -0
- package/skills/dispatching-parallel-agents/SKILL.md +128 -0
- package/skills/executing-plans/SKILL.md +70 -0
- package/skills/finishing-a-development-branch/SKILL.md +169 -0
- package/skills/humanize/SKILL.md +123 -0
- package/skills/init-pipeline/SKILL.md +124 -0
- package/skills/prepare-next/SKILL.md +20 -0
- package/skills/receiving-code-review/SKILL.md +123 -0
- package/skills/requesting-code-review/SKILL.md +105 -0
- package/skills/requesting-code-review/code-reviewer.md +108 -0
- package/skills/run-audit/SKILL.md +197 -0
- package/skills/scan-project/SKILL.md +41 -0
- package/skills/self-audit/SKILL.md +153 -0
- package/skills/subagent-driven-development/SKILL.md +154 -0
- package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +26 -0
- package/skills/subagent-driven-development/implementer-prompt.md +102 -0
- package/skills/subagent-driven-development/spec-reviewer-prompt.md +61 -0
- package/skills/tdd/SKILL.md +23 -0
- package/skills/using-git-worktrees/SKILL.md +163 -0
- package/skills/using-skills/SKILL.md +95 -0
- package/skills/verification/SKILL.md +22 -0
- package/skills/wazir/SKILL.md +463 -0
- package/skills/writing-plans/SKILL.md +30 -0
- package/skills/writing-skills/SKILL.md +157 -0
- package/skills/writing-skills/anthropic-best-practices.md +122 -0
- package/skills/writing-skills/persuasion-principles.md +50 -0
- package/templates/README.md +20 -0
- package/templates/artifacts/README.md +10 -0
- package/templates/artifacts/accepted-learning.md +19 -0
- package/templates/artifacts/accepted-learning.template.json +12 -0
- package/templates/artifacts/author.md +74 -0
- package/templates/artifacts/author.template.json +19 -0
- package/templates/artifacts/clarification.md +21 -0
- package/templates/artifacts/clarification.template.json +12 -0
- package/templates/artifacts/execute-notes.md +19 -0
- package/templates/artifacts/implementation-plan.md +21 -0
- package/templates/artifacts/implementation-plan.template.json +11 -0
- package/templates/artifacts/learning-proposal.md +19 -0
- package/templates/artifacts/next-run-handoff.md +21 -0
- package/templates/artifacts/plan-review.md +19 -0
- package/templates/artifacts/proposed-learning.template.json +12 -0
- package/templates/artifacts/research.md +21 -0
- package/templates/artifacts/research.template.json +12 -0
- package/templates/artifacts/review-findings.md +19 -0
- package/templates/artifacts/review.template.json +11 -0
- package/templates/artifacts/run-manifest.template.json +8 -0
- package/templates/artifacts/spec-challenge.md +19 -0
- package/templates/artifacts/spec-challenge.template.json +11 -0
- package/templates/artifacts/spec.md +21 -0
- package/templates/artifacts/spec.template.json +12 -0
- package/templates/artifacts/verification-proof.md +19 -0
- package/templates/artifacts/verification-proof.template.json +11 -0
- package/templates/examples/accepted-learning.example.json +14 -0
- package/templates/examples/author.example.json +152 -0
- package/templates/examples/clarification.example.json +15 -0
- package/templates/examples/docs-claim.example.json +8 -0
- package/templates/examples/export-manifest.example.json +7 -0
- package/templates/examples/host-export-package.example.json +11 -0
- package/templates/examples/implementation-plan.example.json +17 -0
- package/templates/examples/proposed-learning.example.json +13 -0
- package/templates/examples/research.example.json +15 -0
- package/templates/examples/research.example.md +6 -0
- package/templates/examples/review.example.json +17 -0
- package/templates/examples/run-manifest.example.json +9 -0
- package/templates/examples/spec-challenge.example.json +14 -0
- package/templates/examples/spec.example.json +21 -0
- package/templates/examples/verification-proof.example.json +21 -0
- package/templates/examples/wazir-manifest.example.yaml +65 -0
- package/templates/task-definition-schema.md +99 -0
- package/tooling/README.md +20 -0
- package/tooling/src/adapters/context-mode.js +50 -0
- package/tooling/src/capture/command.js +376 -0
- package/tooling/src/capture/store.js +99 -0
- package/tooling/src/capture/usage.js +270 -0
- package/tooling/src/checks/branches.js +50 -0
- package/tooling/src/checks/brand-truth.js +110 -0
- package/tooling/src/checks/changelog.js +231 -0
- package/tooling/src/checks/command-registry.js +36 -0
- package/tooling/src/checks/commits.js +102 -0
- package/tooling/src/checks/docs-drift.js +103 -0
- package/tooling/src/checks/docs-truth.js +201 -0
- package/tooling/src/checks/runtime-surface.js +156 -0
- package/tooling/src/cli.js +116 -0
- package/tooling/src/command-options.js +56 -0
- package/tooling/src/commands/validate.js +320 -0
- package/tooling/src/doctor/command.js +91 -0
- package/tooling/src/export/command.js +77 -0
- package/tooling/src/export/compiler.js +498 -0
- package/tooling/src/guards/loop-cap-guard.js +52 -0
- package/tooling/src/guards/protected-path-write-guard.js +67 -0
- package/tooling/src/index/command.js +152 -0
- package/tooling/src/index/storage.js +1061 -0
- package/tooling/src/index/summarizers.js +261 -0
- package/tooling/src/loaders.js +18 -0
- package/tooling/src/project-root.js +22 -0
- package/tooling/src/recall/command.js +225 -0
- package/tooling/src/schema-validator.js +30 -0
- package/tooling/src/state-root.js +40 -0
- package/tooling/src/status/command.js +71 -0
- package/wazir.manifest.yaml +135 -0
- package/workflows/README.md +19 -0
- package/workflows/author.md +42 -0
- package/workflows/clarify.md +38 -0
- package/workflows/design-review.md +46 -0
- package/workflows/design.md +44 -0
- package/workflows/discover.md +37 -0
- package/workflows/execute.md +48 -0
- package/workflows/learn.md +38 -0
- package/workflows/plan-review.md +42 -0
- package/workflows/plan.md +39 -0
- package/workflows/prepare-next.md +37 -0
- package/workflows/review.md +40 -0
- package/workflows/run-audit.md +41 -0
- package/workflows/spec-challenge.md +41 -0
- package/workflows/specify.md +38 -0
- package/workflows/verify.md +37 -0
|
@@ -0,0 +1,1079 @@
|
|
|
1
|
+
# Observability for Performance Engineering
|
|
2
|
+
|
|
3
|
+
> **Expertise Module** | Domain: Performance / Infrastructure
|
|
4
|
+
> Last updated: 2026-03-08
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Table of Contents
|
|
9
|
+
|
|
10
|
+
1. [Overview](#overview)
|
|
11
|
+
2. [The Three Pillars: Metrics, Logs, Traces](#the-three-pillars-metrics-logs-traces)
|
|
12
|
+
3. [When to Use Each Pillar for Performance](#when-to-use-each-pillar-for-performance)
|
|
13
|
+
4. [OpenTelemetry](#opentelemetry)
|
|
14
|
+
5. [Distributed Tracing Systems](#distributed-tracing-systems)
|
|
15
|
+
6. [Metrics Systems: Prometheus and Datadog](#metrics-systems-prometheus-and-datadog)
|
|
16
|
+
7. [RED and USE Methods](#red-and-use-methods)
|
|
17
|
+
8. [SLOs, SLIs, and Error Budgets](#slos-slis-and-error-budgets)
|
|
18
|
+
9. [Sampling Strategies](#sampling-strategies)
|
|
19
|
+
10. [Alerting on Performance](#alerting-on-performance)
|
|
20
|
+
11. [Cost Management and Optimization](#cost-management-and-optimization)
|
|
21
|
+
12. [Common Bottlenecks](#common-bottlenecks)
|
|
22
|
+
13. [Anti-Patterns](#anti-patterns)
|
|
23
|
+
14. [Before/After: Observability-Driven Performance Fixes](#beforeafter-observability-driven-performance-fixes)
|
|
24
|
+
15. [Decision Tree: What Should I Monitor?](#decision-tree-what-should-i-monitor)
|
|
25
|
+
16. [Quick Reference](#quick-reference)
|
|
26
|
+
17. [Sources](#sources)
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
## Overview
|
|
31
|
+
|
|
32
|
+
Observability is the ability to understand a system's internal state from its external outputs.
|
|
33
|
+
For performance engineering, observability answers: "Why is my system slow, and where?"
|
|
34
|
+
|
|
35
|
+
**Key industry numbers:**
|
|
36
|
+
|
|
37
|
+
- The global observability market reached $28.5 billion in 2025 (Gartner/Research Nester).
|
|
38
|
+
- 15-25% of infrastructure budgets are allocated to observability (Gartner).
|
|
39
|
+
- Over 50% of observability spend goes to logs alone (ClickHouse TCO Report).
|
|
40
|
+
- 97% of organizations have experienced unexpected observability cost surprises (Grepr AI, 2026).
|
|
41
|
+
- 36% of enterprise clients spend over $1M/year on observability; 4% exceed $10M (Gartner).
|
|
42
|
+
|
|
43
|
+
Observability is not monitoring. Monitoring tells you *what* is broken. Observability tells you *why*.
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## The Three Pillars: Metrics, Logs, Traces
|
|
48
|
+
|
|
49
|
+
### Metrics
|
|
50
|
+
|
|
51
|
+
Numeric measurements of system behavior over time. Stored as time series.
|
|
52
|
+
|
|
53
|
+
| Property | Detail |
|
|
54
|
+
|------------------|-----------------------------------------------------------|
|
|
55
|
+
| **Data type** | Counters, gauges, histograms, summaries |
|
|
56
|
+
| **Storage cost** | Low (~8 bytes per data point in Prometheus TSDB) |
|
|
57
|
+
| **Query speed** | Fast (pre-aggregated, indexed by label) |
|
|
58
|
+
| **Best for** | Dashboards, alerting, trend analysis, capacity planning |
|
|
59
|
+
| **Cardinality** | Must be bounded (unbounded labels destroy performance) |
|
|
60
|
+
|
|
61
|
+
Typical performance metrics: request rate, error rate, p50/p95/p99 latency, CPU utilization,
|
|
62
|
+
memory usage, queue depth, connection pool saturation.
|
|
63
|
+
|
|
64
|
+
### Logs
|
|
65
|
+
|
|
66
|
+
Discrete events with structured or unstructured text.
|
|
67
|
+
|
|
68
|
+
| Property | Detail |
|
|
69
|
+
|------------------|-----------------------------------------------------------|
|
|
70
|
+
| **Data type** | Text records with timestamps and metadata |
|
|
71
|
+
| **Storage cost** | High (can be 10-100x more than metrics at scale) |
|
|
72
|
+
| **Query speed** | Slow without indexing; fast with structured/indexed logs |
|
|
73
|
+
| **Best for** | Debugging, audit trails, error details, forensic analysis |
|
|
74
|
+
| **Volume risk** | Easily grows to TB/day in production systems |
|
|
75
|
+
|
|
76
|
+
Performance-relevant log patterns: slow query logs (>100ms), garbage collection pauses,
|
|
77
|
+
connection timeouts, circuit breaker state changes, retry exhaustion events.
|
|
78
|
+
|
|
79
|
+
### Traces
|
|
80
|
+
|
|
81
|
+
End-to-end records of a request's journey through distributed services.
|
|
82
|
+
|
|
83
|
+
| Property | Detail |
|
|
84
|
+
|------------------|-----------------------------------------------------------|
|
|
85
|
+
| **Data type** | Directed acyclic graphs (DAGs) of spans with timing data |
|
|
86
|
+
| **Storage cost** | Medium-high (each trace can contain 10-100+ spans) |
|
|
87
|
+
| **Query speed** | Moderate (requires trace ID lookup or attribute search) |
|
|
88
|
+
| **Best for** | Latency analysis, dependency mapping, bottleneck finding |
|
|
89
|
+
| **Sampling** | Almost always required at scale (1-10% typical) |
|
|
90
|
+
|
|
91
|
+
A single trace through a microservices system might contain 20-50 spans, each recording
|
|
92
|
+
service name, operation, duration, status, and custom attributes.
|
|
93
|
+
|
|
94
|
+
---
|
|
95
|
+
|
|
96
|
+
## When to Use Each Pillar for Performance
|
|
97
|
+
|
|
98
|
+
```
|
|
99
|
+
Question You're Asking --> Pillar to Use
|
|
100
|
+
──────────────────────────────────────────────────────────────
|
|
101
|
+
"Is latency increasing over time?" --> Metrics (histogram)
|
|
102
|
+
"What's the p99 latency right now?" --> Metrics (histogram quantile)
|
|
103
|
+
"Why was THIS request slow?" --> Traces (span waterfall)
|
|
104
|
+
"What error did the DB return?" --> Logs (structured error log)
|
|
105
|
+
"Which service is the bottleneck?" --> Traces (critical path analysis)
|
|
106
|
+
"Is CPU saturated?" --> Metrics (USE method)
|
|
107
|
+
"What happened during the outage?" --> Logs + Traces (correlated)
|
|
108
|
+
"Are we meeting our latency SLO?" --> Metrics (SLI tracking)
|
|
109
|
+
"What changed between deployments?" --> Metrics (before/after comparison)
|
|
110
|
+
"Why did GC pause spike?" --> Logs (GC log analysis)
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
**Rule of thumb:** Metrics for *detecting*, traces for *diagnosing*, logs for *explaining*.
|
|
114
|
+
|
|
115
|
+
---
|
|
116
|
+
|
|
117
|
+
## OpenTelemetry
|
|
118
|
+
|
|
119
|
+
OpenTelemetry (OTel) is the CNCF standard for telemetry collection. It provides vendor-neutral
|
|
120
|
+
APIs, SDKs, and a Collector for metrics, logs, and traces.
|
|
121
|
+
|
|
122
|
+
### Auto-Instrumentation
|
|
123
|
+
|
|
124
|
+
Auto-instrumentation injects telemetry collection without code changes. Available for:
|
|
125
|
+
Java (agent), Python (sitecustomize), Node.js (require hooks), .NET (startup hooks), Go (eBPF).
|
|
126
|
+
|
|
127
|
+
**Overhead benchmarks (from OTel official benchmarks and academic research):**
|
|
128
|
+
|
|
129
|
+
| Language | CPU Overhead | Latency Impact (p95) | Memory Overhead | Source |
|
|
130
|
+
|----------|--------------------|----------------------|-------------------|---------------------------------------|
|
|
131
|
+
| Java | 3-20% typical | 9-16% increase | 50-150 MB heap | OTel Java Instrumentation benchmarks |
|
|
132
|
+
| Go | 7-35% (varies) | 5-15% increase | 20-60 MB | Coroot OTel Go overhead study |
|
|
133
|
+
| Python | 5-15% | 10-25% increase | 30-80 MB | OTel Python SDK benchmarks |
|
|
134
|
+
| Node.js | 3-10% | 5-15% increase | 20-50 MB | OTel JS community benchmarks |
|
|
135
|
+
|
|
136
|
+
**Key findings from benchmarks:**
|
|
137
|
+
|
|
138
|
+
- Java agent: CPU overhead ranges from 3.6% (10% sampling) to 17.8% (100% sampling) of
|
|
139
|
+
additional CPU usage (Umea University research, 2024).
|
|
140
|
+
- Go: ~35% CPU increase under full tracing load; ~7% CPU overhead for Redis operations
|
|
141
|
+
specifically (Coroot benchmark, 2024).
|
|
142
|
+
- Batch size impact: CPU overhead increases from 18.4% to 49.0% as batch size decreases,
|
|
143
|
+
making batch configuration critical (OTel specification benchmarks).
|
|
144
|
+
- Manual tracing consistently causes less overhead than automatic tracing (TechRxiv, 2024).
|
|
145
|
+
|
|
146
|
+
### Custom Spans
|
|
147
|
+
|
|
148
|
+
When auto-instrumentation is insufficient, create custom spans for business-critical paths:
|
|
149
|
+
|
|
150
|
+
```python
|
|
151
|
+
from opentelemetry import trace
|
|
152
|
+
|
|
153
|
+
tracer = trace.get_tracer("payment-service")
|
|
154
|
+
|
|
155
|
+
def process_payment(order):
|
|
156
|
+
with tracer.start_as_current_span("process_payment") as span:
|
|
157
|
+
span.set_attribute("order.amount", order.amount)
|
|
158
|
+
span.set_attribute("order.currency", order.currency)
|
|
159
|
+
|
|
160
|
+
with tracer.start_as_current_span("validate_card"):
|
|
161
|
+
validate(order.card) # creates child span
|
|
162
|
+
|
|
163
|
+
with tracer.start_as_current_span("charge_provider"):
|
|
164
|
+
result = charge(order) # creates child span
|
|
165
|
+
span.set_attribute("payment.provider_latency_ms", result.latency)
|
|
166
|
+
|
|
167
|
+
return result
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
**Guidelines for custom spans:**
|
|
171
|
+
|
|
172
|
+
- Instrument operations taking >1ms that cross boundaries (network, disk, queue).
|
|
173
|
+
- Add business attributes (order value, customer tier) for filtering.
|
|
174
|
+
- Avoid spans inside tight loops (creating a span costs ~1 microsecond, but at 1M iterations
|
|
175
|
+
that adds 1 second of pure overhead).
|
|
176
|
+
- Use span events for lightweight annotations instead of child spans.
|
|
177
|
+
|
|
178
|
+
### The OTel Collector
|
|
179
|
+
|
|
180
|
+
The Collector sits between instrumented applications and backends, providing:
|
|
181
|
+
|
|
182
|
+
- **Receivers**: Accept data via OTLP, Jaeger, Zipkin, Prometheus, and 80+ formats.
|
|
183
|
+
- **Processors**: Batch, filter, sample, transform, and enrich telemetry.
|
|
184
|
+
- **Exporters**: Send to any backend (Jaeger, Tempo, Prometheus, Datadog, etc.).
|
|
185
|
+
|
|
186
|
+
**Performance characteristics of the Collector:**
|
|
187
|
+
|
|
188
|
+
- Batching processor: groups spans into batches of 8192 (default), reducing export overhead
|
|
189
|
+
by 10-50x compared to per-span export.
|
|
190
|
+
- Memory limiter processor: prevents OOM by dropping data when memory exceeds threshold
|
|
191
|
+
(recommended: set to 80% of available memory).
|
|
192
|
+
- Typical resource usage: 0.5-2 CPU cores and 512 MB-2 GB RAM for 50,000 spans/second
|
|
193
|
+
throughput (OTel Collector benchmarks).
|
|
194
|
+
|
|
195
|
+
### Reducing OTel Overhead
|
|
196
|
+
|
|
197
|
+
Strategies that reduce CPU overhead by 50-70% (OneUptime, 2026):
|
|
198
|
+
|
|
199
|
+
1. **Selective instrumentation**: Disable instrumentations you don't need
|
|
200
|
+
(e.g., `OTEL_INSTRUMENTATION_HTTP_ENABLED=false`).
|
|
201
|
+
2. **Increase batch size**: Larger batches amortize export cost; 8192+ spans per batch.
|
|
202
|
+
3. **Use sampling**: Even 10% sampling reduces CPU overhead by ~80% (Umea research).
|
|
203
|
+
4. **Async exporters**: Never block the application thread on telemetry export.
|
|
204
|
+
5. **Filter at the Collector**: Drop low-value spans before export to reduce backend load.
|
|
205
|
+
|
|
206
|
+
---
|
|
207
|
+
|
|
208
|
+
## Distributed Tracing Systems
|
|
209
|
+
|
|
210
|
+
### Jaeger
|
|
211
|
+
|
|
212
|
+
Originally developed by Uber, now a CNCF graduated project.
|
|
213
|
+
|
|
214
|
+
- **Jaeger 2.0** (November 2024): Rebuilt on the OpenTelemetry Collector framework.
|
|
215
|
+
Single binary reduced image size from 40 MB to 30 MB. Native OTLP support eliminates
|
|
216
|
+
translation overhead. Adds tail-based sampling via OTel sampler (CNCF, 2024).
|
|
217
|
+
- **Architecture**: Agent (sidecar) buffers traces locally, preventing app slowdown if
|
|
218
|
+
the collector is unavailable. Collector handles ingestion and indexing.
|
|
219
|
+
- **Storage backends**: Cassandra, Elasticsearch, Kafka, Badger (local), ClickHouse.
|
|
220
|
+
- **Adaptive sampling**: Adjusts sampling rate per-service based on traffic volume.
|
|
221
|
+
- **Scale**: Uber processes billions of spans/day with Jaeger in production.
|
|
222
|
+
|
|
223
|
+
### Zipkin
|
|
224
|
+
|
|
225
|
+
The original open-source distributed tracing system (from Twitter's Dapper paper implementation).
|
|
226
|
+
|
|
227
|
+
- **Architecture**: Direct reporting from services to Zipkin server (no sidecar agent).
|
|
228
|
+
Lower latency for trace visibility but higher risk of app impact if Zipkin is unavailable.
|
|
229
|
+
- **Overhead**: Slightly higher CPU and memory usage than OTel-based alternatives in
|
|
230
|
+
comparative benchmarks (Umea University, 2024).
|
|
231
|
+
- **Storage**: Cassandra, Elasticsearch, MySQL.
|
|
232
|
+
- **Consideration**: Jaeger now supports Zipkin format, so migration path is clear.
|
|
233
|
+
|
|
234
|
+
### Grafana Tempo
|
|
235
|
+
|
|
236
|
+
Purpose-built for cost-effective trace storage.
|
|
237
|
+
|
|
238
|
+
- **Key differentiator**: Uses object storage (S3, GCS, Azure Blob) instead of databases,
|
|
239
|
+
reducing operational complexity and storage cost by 10-100x vs. Elasticsearch-backed
|
|
240
|
+
solutions for large trace volumes.
|
|
241
|
+
- **TraceQL**: Query language for searching traces by attributes, duration, and structure.
|
|
242
|
+
- **Integration**: Native integration with Grafana, Loki (logs), and Mimir (metrics) for
|
|
243
|
+
correlated observability.
|
|
244
|
+
- **Scale**: Designed for petabyte-scale trace storage with minimal indexing overhead.
|
|
245
|
+
|
|
246
|
+
### Choosing a Tracing Backend
|
|
247
|
+
|
|
248
|
+
```
|
|
249
|
+
Evaluation Criteria Jaeger 2.0 Zipkin Tempo
|
|
250
|
+
─────────────────────────────────────────────────────────────────────
|
|
251
|
+
OTel native Yes (v2 core) Via collector Yes
|
|
252
|
+
Storage cost at scale Medium Medium Low (object storage)
|
|
253
|
+
Operational complexity Medium Low Low
|
|
254
|
+
Tail-based sampling Yes (v2) No Via Collector
|
|
255
|
+
Query capability Good Basic TraceQL (powerful)
|
|
256
|
+
Ecosystem integration Broad Broad Grafana stack
|
|
257
|
+
Production maturity Very high Very high High
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
---
|
|
261
|
+
|
|
262
|
+
## Metrics Systems: Prometheus and Datadog
|
|
263
|
+
|
|
264
|
+
### Prometheus
|
|
265
|
+
|
|
266
|
+
Open-source, pull-based metrics system. De facto standard for Kubernetes monitoring.
|
|
267
|
+
|
|
268
|
+
- **Data model**: Multi-dimensional time series identified by metric name and key/value labels.
|
|
269
|
+
- **Query language**: PromQL -- powerful, expressive, but steep learning curve.
|
|
270
|
+
- **Storage**: Local TSDB with ~1.3 bytes per sample (compressed). Retention typically 15-90 days.
|
|
271
|
+
- **Scrape interval**: Default 15 seconds. Lower intervals increase storage and CPU linearly.
|
|
272
|
+
- **Scalability limit**: Single Prometheus instance handles ~10M active time series;
|
|
273
|
+
beyond that, use Thanos or Cortex for federation.
|
|
274
|
+
- **Cost**: Free (open source). Infrastructure cost ~$0.03-0.06 per node/hour for managed
|
|
275
|
+
services (e.g., Grafana Cloud, Amazon Managed Prometheus).
|
|
276
|
+
- **Metric types**: Counter, Gauge, Histogram, Summary.
|
|
277
|
+
|
|
278
|
+
**Performance-relevant PromQL examples:**
|
|
279
|
+
|
|
280
|
+
```promql
|
|
281
|
+
# p99 latency over 5 minutes
|
|
282
|
+
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
|
|
283
|
+
|
|
284
|
+
# Error rate as percentage
|
|
285
|
+
sum(rate(http_requests_total{status=~"5.."}[5m]))
|
|
286
|
+
/ sum(rate(http_requests_total[5m])) * 100
|
|
287
|
+
|
|
288
|
+
# CPU saturation (runnable threads waiting)
|
|
289
|
+
rate(node_schedstat_waiting_seconds_total[5m])
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
### Datadog
|
|
293
|
+
|
|
294
|
+
Commercial full-stack observability platform.
|
|
295
|
+
|
|
296
|
+
- **Data model**: Push-based with 600+ pre-built integrations.
|
|
297
|
+
- **Custom metrics**: $0.05 per custom metric per month (at scale). Costs escalate rapidly
|
|
298
|
+
with high-cardinality metrics.
|
|
299
|
+
- **Strengths**: Anomaly detection, forecast monitoring, composite alerts, APM correlation.
|
|
300
|
+
- **APM pricing**: Based on traced hosts ($31-40/host/month) plus ingested spans
|
|
301
|
+
($0.10 per million after included volume).
|
|
302
|
+
- **Real-time**: 1-second granularity for infrastructure metrics (vs. 15s for Prometheus default).
|
|
303
|
+
|
|
304
|
+
### Prometheus vs. Datadog: Decision Factors
|
|
305
|
+
|
|
306
|
+
| Factor | Prometheus | Datadog |
|
|
307
|
+
|-------------------------|-----------------------------------|------------------------------------|
|
|
308
|
+
| Cost at 100 hosts | $0 (self-hosted) + infra | ~$3,100-4,000/month |
|
|
309
|
+
| Cost at 1000 hosts | $0 + significant infra | ~$31,000-40,000/month |
|
|
310
|
+
| Setup time | Hours (with Helm chart) | Minutes (agent install) |
|
|
311
|
+
| Custom metrics cost | Free | $0.05/metric/month |
|
|
312
|
+
| Vendor lock-in | None | High |
|
|
313
|
+
| Operational overhead | High (you manage everything) | Low (fully managed) |
|
|
314
|
+
| AI/ML features | None built-in | Anomaly detection, forecasting |
|
|
315
|
+
| Query language | PromQL (powerful, open) | DQL (proprietary) |
|
|
316
|
+
|
|
317
|
+
---
|
|
318
|
+
|
|
319
|
+
## RED and USE Methods
|
|
320
|
+
|
|
321
|
+
### RED Method (for Services)
|
|
322
|
+
|
|
323
|
+
Developed by Tom Wilkie (Grafana Labs). Measures the user-facing behavior of every service.
|
|
324
|
+
|
|
325
|
+
**R**ate -- requests per second served by the service.
|
|
326
|
+
**E**rrors -- failed requests per second (HTTP 5xx, gRPC errors, exceptions).
|
|
327
|
+
**D**uration -- distribution of request latencies (p50, p95, p99).
|
|
328
|
+
|
|
329
|
+
```
|
|
330
|
+
Apply RED to:
|
|
331
|
+
- API gateways and load balancers
|
|
332
|
+
- Microservice endpoints
|
|
333
|
+
- Message queue consumers
|
|
334
|
+
- Database query paths
|
|
335
|
+
- External API calls
|
|
336
|
+
```
|
|
337
|
+
|
|
338
|
+
**Implementation example (Prometheus metrics):**
|
|
339
|
+
|
|
340
|
+
```
|
|
341
|
+
# Rate
|
|
342
|
+
sum(rate(http_requests_total[5m])) by (service)
|
|
343
|
+
|
|
344
|
+
# Errors
|
|
345
|
+
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
|
|
346
|
+
|
|
347
|
+
# Duration (p99)
|
|
348
|
+
histogram_quantile(0.99,
|
|
349
|
+
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
|
|
350
|
+
)
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
### USE Method (for Resources)
|
|
354
|
+
|
|
355
|
+
Developed by Brendan Gregg. Measures the health of every physical/virtual resource.
|
|
356
|
+
|
|
357
|
+
**U**tilization -- percentage of resource capacity in use (0-100%).
|
|
358
|
+
**S**aturation -- work that cannot be served (queue depth, runnable threads waiting).
|
|
359
|
+
**E**rrors -- count of error events on the resource.
|
|
360
|
+
|
|
361
|
+
```
|
|
362
|
+
Apply USE to:
|
|
363
|
+
- CPU (utilization, run queue, hardware errors)
|
|
364
|
+
- Memory (usage %, swap activity, OOM events)
|
|
365
|
+
- Disk I/O (bandwidth utilization, I/O wait queue, read/write errors)
|
|
366
|
+
- Network (bandwidth %, TCP retransmits, dropped packets)
|
|
367
|
+
- Connection pools (active/max, waiters, timeouts)
|
|
368
|
+
- Thread pools (active/max, queue depth, rejections)
|
|
369
|
+
```
|
|
370
|
+
|
|
371
|
+
### RED + USE Together
|
|
372
|
+
|
|
373
|
+
```
|
|
374
|
+
USE Method RED Method
|
|
375
|
+
(Resources) (Services)
|
|
376
|
+
┌─────────────────┐ ┌─────────────────┐
|
|
377
|
+
│ CPU utilization │ │ Request rate │
|
|
378
|
+
Infrastructure │ Memory saturat. │ App │ Error rate │
|
|
379
|
+
Layer │ Disk I/O errors │ Layer │ Latency p99 │
|
|
380
|
+
└────────┬────────┘ └────────┬────────┘
|
|
381
|
+
│ │
|
|
382
|
+
└──────────┬─────────────────┘
|
|
383
|
+
│
|
|
384
|
+
┌─────────▼─────────┐
|
|
385
|
+
│ Correlated View │
|
|
386
|
+
│ "p99 latency spiked│
|
|
387
|
+
│ because CPU hit │
|
|
388
|
+
│ 95% utilization" │
|
|
389
|
+
└────────────────────┘
|
|
390
|
+
```
|
|
391
|
+
|
|
392
|
+
---
|
|
393
|
+
|
|
394
|
+
## SLOs, SLIs, and Error Budgets
|
|
395
|
+
|
|
396
|
+
### Service Level Indicators (SLIs)
|
|
397
|
+
|
|
398
|
+
An SLI is a quantitative measure of a specific aspect of service performance. For performance
|
|
399
|
+
engineering, the most critical SLIs are:
|
|
400
|
+
|
|
401
|
+
| SLI Type | Example Measurement | Typical Target |
|
|
402
|
+
|-------------------|------------------------------------------|----------------------|
|
|
403
|
+
| Availability | Successful requests / total requests | 99.9% - 99.99% |
|
|
404
|
+
| Latency | % requests completing within threshold | 99.9% < 200ms |
|
|
405
|
+
| Throughput | Requests processed per second | > 10,000 rps |
|
|
406
|
+
| Error rate | Failed requests / total requests | < 0.1% |
|
|
407
|
+
| Saturation | Resource utilization below threshold | < 80% CPU |
|
|
408
|
+
|
|
409
|
+
**Best practice:** Measure SLIs at the load balancer or API gateway, not internally. The user's
|
|
410
|
+
experience is what matters, not what the server thinks happened.
|
|
411
|
+
|
|
412
|
+
### Service Level Objectives (SLOs)
|
|
413
|
+
|
|
414
|
+
An SLO is the target value for an SLI over a rolling time window.
|
|
415
|
+
|
|
416
|
+
**Common performance SLOs:**
|
|
417
|
+
|
|
418
|
+
```
|
|
419
|
+
SLO: 99.9% of HTTP requests complete in < 200ms over a 30-day window.
|
|
420
|
+
|
|
421
|
+
Meaning:
|
|
422
|
+
- Total requests in 30 days: ~130 million (at 50 rps)
|
|
423
|
+
- Allowed slow requests: ~130,000 (0.1% error budget)
|
|
424
|
+
- That's ~4,333 slow requests per day
|
|
425
|
+
- Or ~180 slow requests per hour
|
|
426
|
+
```
|
|
427
|
+
|
|
428
|
+
**SLO target selection guide:**
|
|
429
|
+
|
|
430
|
+
| SLO Target | Monthly Error Budget | Use Case |
|
|
431
|
+
|------------|----------------------|-----------------------------------------|
|
|
432
|
+
| 99% | 7.3 hours | Internal tools, batch processing |
|
|
433
|
+
| 99.5% | 3.6 hours | Non-critical customer-facing services |
|
|
434
|
+
| 99.9% | 43.8 minutes | Most production APIs and web apps |
|
|
435
|
+
| 99.95% | 21.9 minutes | Payment processing, auth services |
|
|
436
|
+
| 99.99% | 4.3 minutes | Core infrastructure, DNS, load balancer |
|
|
437
|
+
|
|
438
|
+
### Error Budgets
|
|
439
|
+
|
|
440
|
+
The error budget is the inverse of the SLO: `error_budget = 1 - SLO_target`.
|
|
441
|
+
|
|
442
|
+
**How error budgets drive performance decisions:**
|
|
443
|
+
|
|
444
|
+
```
|
|
445
|
+
Error Budget State Action
|
|
446
|
+
───────────────────────────────────────────────────────────────
|
|
447
|
+
Budget > 50% remaining Ship features freely. Performance
|
|
448
|
+
improvements are optional.
|
|
449
|
+
|
|
450
|
+
Budget 20-50% remaining Increase caution. Require performance
|
|
451
|
+
review for risky deployments.
|
|
452
|
+
|
|
453
|
+
Budget < 20% remaining Freeze feature releases. All engineering
|
|
454
|
+
effort directed at reliability/performance.
|
|
455
|
+
|
|
456
|
+
Budget exhausted (0%) Full stop on deploys. Incident-level
|
|
457
|
+
response. Postmortem required.
|
|
458
|
+
```
|
|
459
|
+
|
|
460
|
+
**Error budget calculation example:**
|
|
461
|
+
|
|
462
|
+
```
|
|
463
|
+
SLO: 99.9% availability over 30 days
|
|
464
|
+
Total minutes: 43,200
|
|
465
|
+
Error budget: 43,200 * 0.001 = 43.2 minutes of downtime allowed
|
|
466
|
+
|
|
467
|
+
Day 15: 20 minutes of downtime consumed
|
|
468
|
+
Remaining budget: 23.2 minutes (53.7% remaining)
|
|
469
|
+
Status: CAUTION -- increase deployment scrutiny
|
|
470
|
+
```
|
|
471
|
+
|
|
472
|
+
**73% of organizations experienced an outage costing over $100,000 in the past year** (Nobl9
|
|
473
|
+
error budget guide, 2024). Error budgets provide a framework to balance feature velocity
|
|
474
|
+
against these risks.
|
|
475
|
+
|
|
476
|
+
---
|
|
477
|
+
|
|
478
|
+
## Sampling Strategies
|
|
479
|
+
|
|
480
|
+
At scale, collecting 100% of traces is neither affordable nor necessary. Sampling strategies
|
|
481
|
+
determine which traces to keep.
|
|
482
|
+
|
|
483
|
+
### Head-Based Sampling
|
|
484
|
+
|
|
485
|
+
Decision made at trace creation time (the "head" of the trace).
|
|
486
|
+
|
|
487
|
+
- **How it works**: A random number determines if the trace is sampled. The decision
|
|
488
|
+
propagates to all downstream services via trace context headers.
|
|
489
|
+
- **Pros**: Simple, low overhead, predictable cost, no buffering required.
|
|
490
|
+
- **Cons**: Cannot make decisions based on trace outcome (errors, latency). May miss
|
|
491
|
+
rare but important events.
|
|
492
|
+
- **Overhead**: Minimal -- just a random number comparison at span creation.
|
|
493
|
+
- **Typical rate**: 1-10% for high-throughput services (>1000 rps).
|
|
494
|
+
|
|
495
|
+
```
|
|
496
|
+
# OpenTelemetry head-based sampling configuration
|
|
497
|
+
OTEL_TRACES_SAMPLER=parentbased_traceidratio
|
|
498
|
+
OTEL_TRACES_SAMPLER_ARG=0.01 # 1% sampling
|
|
499
|
+
```
|
|
500
|
+
|
|
501
|
+
### Tail-Based Sampling
|
|
502
|
+
|
|
503
|
+
Decision made after the entire trace completes (the "tail").
|
|
504
|
+
|
|
505
|
+
- **How it works**: All spans are buffered in a collector. After a timeout (typically
|
|
506
|
+
30-60 seconds), the complete trace is evaluated against policies.
|
|
507
|
+
- **Pros**: Can keep traces with errors, high latency, or specific attributes.
|
|
508
|
+
Dramatically improves signal-to-noise ratio for debugging.
|
|
509
|
+
- **Cons**: Requires buffering all spans (high memory: 2-8 GB per collector typical).
|
|
510
|
+
All spans of a trace must reach the same collector instance (routing complexity).
|
|
511
|
+
During incidents, resource usage spikes as more traces match "keep" criteria.
|
|
512
|
+
- **Overhead**: 2-5x more infrastructure than head-based sampling.
|
|
513
|
+
|
|
514
|
+
```yaml
|
|
515
|
+
# OTel Collector tail sampling processor configuration
|
|
516
|
+
processors:
|
|
517
|
+
tail_sampling:
|
|
518
|
+
decision_wait: 30s
|
|
519
|
+
policies:
|
|
520
|
+
- name: errors
|
|
521
|
+
type: status_code
|
|
522
|
+
status_code: {status_codes: [ERROR]} # Keep all error traces
|
|
523
|
+
- name: slow-requests
|
|
524
|
+
type: latency
|
|
525
|
+
latency: {threshold_ms: 500} # Keep traces > 500ms
|
|
526
|
+
- name: baseline
|
|
527
|
+
type: probabilistic
|
|
528
|
+
probabilistic: {sampling_percentage: 1} # 1% of normal traces
|
|
529
|
+
```
|
|
530
|
+
|
|
531
|
+
### Adaptive Sampling
|
|
532
|
+
|
|
533
|
+
Dynamically adjusts sampling rate based on system conditions.
|
|
534
|
+
|
|
535
|
+
- **Normal traffic**: Low sampling rate (0.1-1%) to minimize cost.
|
|
536
|
+
- **During incidents**: Automatically increases to 10-100% to capture diagnostic data.
|
|
537
|
+
- **Per-service**: Higher-throughput services sampled less; lower-traffic services at 100%.
|
|
538
|
+
- **Implementation**: Jaeger's adaptive sampling adjusts rates every 60 seconds based on
|
|
539
|
+
observed traffic per service/endpoint.
|
|
540
|
+
|
|
541
|
+
### Sampling Strategy Comparison
|
|
542
|
+
|
|
543
|
+
| Strategy | Cost Reduction | Diagnostic Quality | Complexity | Best For |
|
|
544
|
+
|----------------|----------------|--------------------|------------|--------------------------------|
|
|
545
|
+
| None (100%) | 0% | Perfect | None | <100 rps total |
|
|
546
|
+
| Head 10% | ~90% | Poor (misses rare) | Low | >1,000 rps, cost-sensitive |
|
|
547
|
+
| Head 1% | ~99% | Very poor | Low | >10,000 rps, cost-critical |
|
|
548
|
+
| Tail (errors) | 70-95% | Good for errors | High | Debugging-focused teams |
|
|
549
|
+
| Tail (latency) | 70-95% | Good for perf | High | Performance-focused teams |
|
|
550
|
+
| Adaptive | 80-99% | Good | Very high | Large-scale production systems |
|
|
551
|
+
|
|
552
|
+
---
|
|
553
|
+
|
|
554
|
+
## Alerting on Performance
|
|
555
|
+
|
|
556
|
+
### The Alert Fatigue Problem
|
|
557
|
+
|
|
558
|
+
Teams that alert on raw thresholds (e.g., "p99 > 200ms") generate excessive alerts.
|
|
559
|
+
A brief spike during a deployment is not the same as a sustained degradation.
|
|
560
|
+
|
|
561
|
+
**Symptom of alert fatigue:**
|
|
562
|
+
|
|
563
|
+
- Teams start ignoring alerts (>50% of alerts are false positives in many organizations).
|
|
564
|
+
- Mean time to acknowledge (MTTA) increases from minutes to hours.
|
|
565
|
+
- Real incidents get lost in noise.
|
|
566
|
+
|
|
567
|
+
### Burn Rate Alerts (Recommended)
|
|
568
|
+
|
|
569
|
+
Instead of alerting on instantaneous threshold breaches, alert on the *rate at which you're
|
|
570
|
+
consuming your error budget*. This is the approach recommended by Google SRE.
|
|
571
|
+
|
|
572
|
+
**How burn rate works:**
|
|
573
|
+
|
|
574
|
+
A burn rate of 1 means you will exactly exhaust your error budget at the end of the SLO window.
|
|
575
|
+
A burn rate of 10 means you will exhaust it in 1/10th of the window.
|
|
576
|
+
|
|
577
|
+
```
|
|
578
|
+
Burn Rate = (observed error rate) / (SLO error rate)
|
|
579
|
+
|
|
580
|
+
Example:
|
|
581
|
+
SLO: 99.9% over 30 days (error rate allowed: 0.1%)
|
|
582
|
+
Observed error rate in last hour: 1%
|
|
583
|
+
Burn rate = 1% / 0.1% = 10x
|
|
584
|
+
|
|
585
|
+
At this rate, the 30-day error budget will be consumed in 3 days.
|
|
586
|
+
```
|
|
587
|
+
|
|
588
|
+
### Multi-Window, Multi-Burn-Rate Alerts
|
|
589
|
+
|
|
590
|
+
Google SRE recommends combining multiple windows for different severity levels.
|
|
591
|
+
The short window should be 1/12th of the long window.
|
|
592
|
+
|
|
593
|
+
| Severity | Burn Rate | Long Window | Short Window | Budget Consumed | Response |
|
|
594
|
+
|----------|-----------|-------------|--------------|-----------------|-------------|
|
|
595
|
+
| P1 | 14.4x | 1 hour | 5 minutes | 2% in 1 hour | Page (5 min)|
|
|
596
|
+
| P2 | 6x | 6 hours | 30 minutes | 5% in 6 hours | Page (30 min)|
|
|
597
|
+
| P3 | 3x | 1 day | 2 hours | 10% in 1 day | Ticket |
|
|
598
|
+
| P4 | 1x | 3 days | 6 hours | 10% in 3 days | Review |
|
|
599
|
+
|
|
600
|
+
**Implementation in Prometheus:**
|
|
601
|
+
|
|
602
|
+
```promql
|
|
603
|
+
# Fast burn rate alert (P1): 2% of budget consumed in 1 hour
|
|
604
|
+
# For 99.9% SLO (0.1% allowed error rate), burn rate 14.4x threshold
|
|
605
|
+
(
|
|
606
|
+
sum(rate(http_requests_total{status=~"5.."}[1h]))
|
|
607
|
+
/ sum(rate(http_requests_total[1h]))
|
|
608
|
+
) > (14.4 * 0.001)
|
|
609
|
+
AND
|
|
610
|
+
(
|
|
611
|
+
sum(rate(http_requests_total{status=~"5.."}[5m]))
|
|
612
|
+
/ sum(rate(http_requests_total[5m]))
|
|
613
|
+
) > (14.4 * 0.001)
|
|
614
|
+
```
|
|
615
|
+
|
|
616
|
+
### Performance-Specific Alert Rules
|
|
617
|
+
|
|
618
|
+
```
|
|
619
|
+
Alert Type Threshold Window
|
|
620
|
+
─────────────────────────────────────────────────────────────────
|
|
621
|
+
Latency SLO burn >6x burn rate 6h + 30m
|
|
622
|
+
Error rate SLO burn >14.4x burn rate 1h + 5m
|
|
623
|
+
CPU saturation >90% sustained 15m
|
|
624
|
+
Memory approaching OOM >85% usage 5m
|
|
625
|
+
Disk I/O saturation >80% utilization 10m
|
|
626
|
+
Connection pool exhaustion >90% active connections 5m
|
|
627
|
+
GC pause time >500ms pause Immediate
|
|
628
|
+
Queue depth growth Monotonic increase >10min 10m
|
|
629
|
+
Deployment regression p99 >20% increase post-deploy 15m post-deploy
|
|
630
|
+
```
|
|
631
|
+
|
|
632
|
+
---
|
|
633
|
+
|
|
634
|
+
## Cost Management and Optimization
|
|
635
|
+
|
|
636
|
+
### Where Observability Cost Comes From
|
|
637
|
+
|
|
638
|
+
```
|
|
639
|
+
Cost Breakdown (typical enterprise):
|
|
640
|
+
|
|
641
|
+
Logs: 50-60% of total observability spend
|
|
642
|
+
Metrics: 15-25% of total observability spend
|
|
643
|
+
Traces: 15-25% of total observability spend
|
|
644
|
+
APM: 5-10% of total observability spend
|
|
645
|
+
|
|
646
|
+
Primary cost drivers:
|
|
647
|
+
1. Data ingestion volume (GB/day)
|
|
648
|
+
2. Data retention duration
|
|
649
|
+
3. Metric cardinality (unique time series)
|
|
650
|
+
4. Query/dashboard compute
|
|
651
|
+
```
|
|
652
|
+
|
|
653
|
+
### Cardinality Reduction
|
|
654
|
+
|
|
655
|
+
High-cardinality metrics are the single largest cost amplifier. Each unique combination of
|
|
656
|
+
label values creates a new time series.
|
|
657
|
+
|
|
658
|
+
**Example of cardinality explosion:**
|
|
659
|
+
|
|
660
|
+
```
|
|
661
|
+
Metric: http_request_duration_seconds
|
|
662
|
+
Labels: method, endpoint, status_code, user_id
|
|
663
|
+
|
|
664
|
+
Cardinality calculation:
|
|
665
|
+
methods: 5 (GET, POST, PUT, DELETE, PATCH)
|
|
666
|
+
endpoints: 50
|
|
667
|
+
status_codes: 10
|
|
668
|
+
user_ids: 100,000
|
|
669
|
+
|
|
670
|
+
Total series: 5 * 50 * 10 * 100,000 = 250,000,000 time series
|
|
671
|
+
|
|
672
|
+
After removing user_id:
|
|
673
|
+
Total series: 5 * 50 * 10 = 2,500 time series
|
|
674
|
+
|
|
675
|
+
Reduction: 99.999%
|
|
676
|
+
```
|
|
677
|
+
|
|
678
|
+
**Cardinality rules (denylist for metric labels):**
|
|
679
|
+
|
|
680
|
+
Never use as metric labels:
|
|
681
|
+
- `user_id`, `session_id`, `request_id` (unbounded)
|
|
682
|
+
- `container_id`, `pod_uid` (ephemeral, high churn)
|
|
683
|
+
- `url` with query strings (effectively unbounded)
|
|
684
|
+
- `trace_id` (belongs in traces, not metrics)
|
|
685
|
+
- `error_message` (use error codes instead)
|
|
686
|
+
- `timestamp` (already implicit in time series)
|
|
687
|
+
|
|
688
|
+
These identifiers belong in traces or logs, where they aid debugging without
|
|
689
|
+
overwhelming the metrics system.
|
|
690
|
+
|
|
691
|
+
### Log Volume Reduction
|
|
692
|
+
|
|
693
|
+
Strategies that achieve 20-40% log volume reduction (ClickHouse TCO Report):
|
|
694
|
+
|
|
695
|
+
1. **Structured logging**: JSON format enables selective field indexing. Parse and drop
|
|
696
|
+
fields you never query against.
|
|
697
|
+
2. **Log levels in production**: WARN and above only for most services. DEBUG/TRACE
|
|
698
|
+
only enabled dynamically during incidents.
|
|
699
|
+
3. **Log aggregation**: Collapse repeated events. "Connection refused" occurring 10,000
|
|
700
|
+
times in 1 minute should be 1 log entry with `count=10000`.
|
|
701
|
+
4. **Sampling verbose logs**: Sample DEBUG logs at 1% in production.
|
|
702
|
+
5. **Edge filtering**: Filter at the agent/collector level before data leaves the host.
|
|
703
|
+
|
|
704
|
+
### Trace Cost Optimization
|
|
705
|
+
|
|
706
|
+
Strategies that achieve 25-50% lower storage costs:
|
|
707
|
+
|
|
708
|
+
1. **Head-based sampling at 1-5%** for high-throughput services.
|
|
709
|
+
2. **Tail-based sampling** to keep only errors and slow traces.
|
|
710
|
+
3. **Span attribute trimming**: Remove large attributes (SQL queries > 500 chars, request
|
|
711
|
+
bodies) or replace with hashes.
|
|
712
|
+
4. **Short retention**: 7-14 days for traces (vs. 30-90 for metrics).
|
|
713
|
+
5. **Object storage backends**: Tempo's S3-based storage is 10-100x cheaper per GB than
|
|
714
|
+
Elasticsearch for trace data.
|
|
715
|
+
|
|
716
|
+
### Retention by Service Tier
|
|
717
|
+
|
|
718
|
+
```
|
|
719
|
+
Service Tier Metrics Retention Log Retention Trace Retention
|
|
720
|
+
──────────────────────────────────────────────────────────────────────
|
|
721
|
+
Tier 1 (critical) 90 days 30 days 14 days
|
|
722
|
+
Tier 2 (standard) 30 days 14 days 7 days
|
|
723
|
+
Tier 3 (internal) 14 days 7 days 3 days
|
|
724
|
+
Tier 4 (dev/test) 7 days 3 days 1 day
|
|
725
|
+
```
|
|
726
|
+
|
|
727
|
+
---
|
|
728
|
+
|
|
729
|
+
## Common Bottlenecks
|
|
730
|
+
|
|
731
|
+
### 1. Observability Overhead Itself
|
|
732
|
+
|
|
733
|
+
The instrumentation meant to detect performance issues can *cause* them.
|
|
734
|
+
|
|
735
|
+
| Bottleneck | Impact | Mitigation |
|
|
736
|
+
|-------------------------------|----------------------------------|------------------------------------|
|
|
737
|
+
| OTel auto-instrumentation | 3-20% CPU overhead (Java) | Selective instrumentation |
|
|
738
|
+
| Synchronous span export | Blocks request processing | Use async batch exporters |
|
|
739
|
+
| High-frequency log writes | Disk I/O contention | Buffer + batch, async writes |
|
|
740
|
+
| Collector as bottleneck | Backpressure causes data loss | Scale collectors horizontally |
|
|
741
|
+
| Sidecar agent memory | 50-200 MB per pod | Right-size resource limits |
|
|
742
|
+
|
|
743
|
+
### 2. High-Cardinality Metrics
|
|
744
|
+
|
|
745
|
+
- Prometheus: query latency increases linearly with series count. At >10M series,
|
|
746
|
+
simple queries can take >10 seconds.
|
|
747
|
+
- Datadog: custom metric pricing means cardinality directly increases cost.
|
|
748
|
+
100K custom metrics = $5,000/month on Datadog.
|
|
749
|
+
- Cortex/Mimir: ingestion rate drops when series cardinality causes index churn.
|
|
750
|
+
|
|
751
|
+
### 3. Log Volume
|
|
752
|
+
|
|
753
|
+
- At 1 TB/day ingestion, Elasticsearch clusters require 3-5 TB storage (with replication).
|
|
754
|
+
- Log search latency degrades from <1s to >30s as daily volume crosses 500 GB
|
|
755
|
+
without proper index management.
|
|
756
|
+
- Splunk pricing at scale: $2-4 per GB ingested/day, making 1 TB/day = $730K-1.46M/year.
|
|
757
|
+
|
|
758
|
+
### 4. Trace Storage
|
|
759
|
+
|
|
760
|
+
- A single trace with 50 spans averages 5-15 KB.
|
|
761
|
+
- At 10,000 rps with 1% sampling: 100 traces/sec = 50-150 KB/sec = 4-13 GB/day.
|
|
762
|
+
- At 10,000 rps with 100% sampling: 500-1500 KB/sec = 43-130 GB/day.
|
|
763
|
+
- Elasticsearch storage for traces: ~$0.10-0.30/GB/month.
|
|
764
|
+
- S3 (Tempo): ~$0.023/GB/month (4-10x cheaper).
|
|
765
|
+
|
|
766
|
+
---
|
|
767
|
+
|
|
768
|
+
## Anti-Patterns
|
|
769
|
+
|
|
770
|
+
### 1. Logging in Hot Paths
|
|
771
|
+
|
|
772
|
+
```
|
|
773
|
+
ANTI-PATTERN:
|
|
774
|
+
for item in million_items:
|
|
775
|
+
logger.debug(f"Processing item {item.id}") # 1M log lines
|
|
776
|
+
process(item)
|
|
777
|
+
|
|
778
|
+
FIX:
|
|
779
|
+
logger.info(f"Processing {len(million_items)} items")
|
|
780
|
+
for item in million_items:
|
|
781
|
+
process(item)
|
|
782
|
+
logger.info(f"Completed processing {len(million_items)} items")
|
|
783
|
+
```
|
|
784
|
+
|
|
785
|
+
**Impact**: Synchronous logging in tight loops can add 10-100x overhead. A `logger.debug()`
|
|
786
|
+
call costs ~1-5 microseconds even when DEBUG is disabled (due to string formatting and
|
|
787
|
+
level check). At 1M iterations, that's 1-5 seconds of pure logging overhead.
|
|
788
|
+
|
|
789
|
+
### 2. Unbounded Cardinality Labels
|
|
790
|
+
|
|
791
|
+
```
|
|
792
|
+
ANTI-PATTERN:
|
|
793
|
+
http_requests_total{user_id="12345", path="/api/users/12345/orders"}
|
|
794
|
+
|
|
795
|
+
FIX:
|
|
796
|
+
http_requests_total{user_tier="premium", path="/api/users/{id}/orders"}
|
|
797
|
+
```
|
|
798
|
+
|
|
799
|
+
**Impact**: Prometheus documentation explicitly warns against this. A single metric with
|
|
800
|
+
a `user_id` label creates one time series per user. At 1M users, that's 1M series from
|
|
801
|
+
one metric -- causing memory exhaustion, slow queries, and potential TSDB corruption.
|
|
802
|
+
|
|
803
|
+
### 3. Not Sampling Traces
|
|
804
|
+
|
|
805
|
+
```
|
|
806
|
+
ANTI-PATTERN:
|
|
807
|
+
# Collecting 100% of traces at 10,000 rps
|
|
808
|
+
# = 864 million spans/day
|
|
809
|
+
# = 4-12 TB/day storage
|
|
810
|
+
# = $146K-438K/month on Elasticsearch
|
|
811
|
+
|
|
812
|
+
FIX:
|
|
813
|
+
# Tail-based sampling: errors + slow + 1% baseline
|
|
814
|
+
# = ~10-20 million spans/day (97-99% reduction)
|
|
815
|
+
# = 50-300 GB/day storage
|
|
816
|
+
# = $1.5K-9K/month on Elasticsearch
|
|
817
|
+
# Or $345-2,070/month on S3/Tempo
|
|
818
|
+
```
|
|
819
|
+
|
|
820
|
+
### 4. Alerting on Raw Thresholds
|
|
821
|
+
|
|
822
|
+
```
|
|
823
|
+
ANTI-PATTERN:
|
|
824
|
+
alert: HighLatency
|
|
825
|
+
expr: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) > 0.2
|
|
826
|
+
# Fires on every brief spike, creating alert fatigue
|
|
827
|
+
|
|
828
|
+
FIX:
|
|
829
|
+
alert: LatencySLOBurnRateHigh
|
|
830
|
+
expr: |
|
|
831
|
+
(error_rate_1h / slo_error_rate) > 6
|
|
832
|
+
AND
|
|
833
|
+
(error_rate_30m / slo_error_rate) > 6
|
|
834
|
+
# Only fires when error budget is being consumed at 6x rate
|
|
835
|
+
```
|
|
836
|
+
|
|
837
|
+
### 5. Correlating Telemetry Without Shared Context
|
|
838
|
+
|
|
839
|
+
```
|
|
840
|
+
ANTI-PATTERN:
|
|
841
|
+
# Logs, metrics, and traces with no shared identifiers
|
|
842
|
+
# Debugging requires manual timestamp correlation across 3 systems
|
|
843
|
+
|
|
844
|
+
FIX:
|
|
845
|
+
# Inject trace_id and span_id into log records
|
|
846
|
+
# Use exemplars to link metrics to traces
|
|
847
|
+
# Use OTel resource attributes for service identity
|
|
848
|
+
log.info("Payment processed",
|
|
849
|
+
extra={"trace_id": span.get_span_context().trace_id,
|
|
850
|
+
"order_id": order.id,
|
|
851
|
+
"duration_ms": elapsed})
|
|
852
|
+
```
|
|
853
|
+
|
|
854
|
+
### 6. Over-Instrumenting Everything
|
|
855
|
+
|
|
856
|
+
```
|
|
857
|
+
ANTI-PATTERN:
|
|
858
|
+
# Spans for every function call, including utility functions
|
|
859
|
+
with tracer.start_span("string_format"):
|
|
860
|
+
result = f"{first} {last}"
|
|
861
|
+
with tracer.start_span("list_append"):
|
|
862
|
+
items.append(result)
|
|
863
|
+
|
|
864
|
+
FIX:
|
|
865
|
+
# Spans only for meaningful operations (>1ms, I/O, cross-service)
|
|
866
|
+
with tracer.start_span("fetch_user_profile"):
|
|
867
|
+
profile = await db.query("SELECT * FROM users WHERE id = ?", user_id)
|
|
868
|
+
```
|
|
869
|
+
|
|
870
|
+
---
|
|
871
|
+
|
|
872
|
+
## Before/After: Observability-Driven Performance Fixes
|
|
873
|
+
|
|
874
|
+
### Case 1: Mystery Latency Spikes
|
|
875
|
+
|
|
876
|
+
**Before observability:**
|
|
877
|
+
- p99 latency: 2.3 seconds (SLO: 500ms)
|
|
878
|
+
- Team suspects "the database is slow" but cannot prove it
|
|
879
|
+
- Random restarts as mitigation strategy
|
|
880
|
+
- Mean time to resolution (MTTR): 4-8 hours
|
|
881
|
+
|
|
882
|
+
**After adding distributed tracing:**
|
|
883
|
+
- Trace waterfall reveals: 1.8 seconds spent in a downstream auth service
|
|
884
|
+
- Auth service making 3 sequential calls to a token validation endpoint
|
|
885
|
+
- Each call: 600ms (includes 400ms DNS resolution due to misconfigured resolver)
|
|
886
|
+
- Fix: Cache DNS + parallelize token validation
|
|
887
|
+
- p99 latency: 180ms (92% reduction)
|
|
888
|
+
- MTTR for similar issues: 15-30 minutes
|
|
889
|
+
|
|
890
|
+
### Case 2: Gradual Throughput Degradation
|
|
891
|
+
|
|
892
|
+
**Before observability:**
|
|
893
|
+
- Throughput drops from 5,000 rps to 2,000 rps over 2 weeks
|
|
894
|
+
- No clear correlation with any deployment
|
|
895
|
+
- Team adds more instances (cost +60%)
|
|
896
|
+
|
|
897
|
+
**After adding USE method metrics:**
|
|
898
|
+
- CPU utilization: 45% (not the bottleneck)
|
|
899
|
+
- Memory: 70% (not the bottleneck)
|
|
900
|
+
- Connection pool saturation: 98% (FOUND IT)
|
|
901
|
+
- Database connection pool of 20 connections shared across 50 threads
|
|
902
|
+
- Fix: Increase pool to 50, add connection pool metrics to dashboard
|
|
903
|
+
- Throughput restored to 5,000 rps. Removed extra instances (cost -38%)
|
|
904
|
+
|
|
905
|
+
### Case 3: Intermittent Error Spikes
|
|
906
|
+
|
|
907
|
+
**Before observability:**
|
|
908
|
+
- 0.5% error rate (SLO: 0.1%) but only during peak hours
|
|
909
|
+
- Errors appear random across services
|
|
910
|
+
- No correlation found in application logs
|
|
911
|
+
|
|
912
|
+
**After adding RED method + correlated logs/traces:**
|
|
913
|
+
- RED metrics show errors correlate with request rate > 3,000 rps
|
|
914
|
+
- Traces reveal: payment service timeout at exactly 30 seconds (default HTTP timeout)
|
|
915
|
+
- Correlated logs show: connection pool exhaustion in payment provider SDK
|
|
916
|
+
- Fix: Increase timeout, add circuit breaker, add bulkhead isolation
|
|
917
|
+
- Error rate: 0.02% (80% below SLO target)
|
|
918
|
+
|
|
919
|
+
---
|
|
920
|
+
|
|
921
|
+
## Decision Tree: What Should I Monitor?
|
|
922
|
+
|
|
923
|
+
```
|
|
924
|
+
START: "I need to monitor my system for performance"
|
|
925
|
+
│
|
|
926
|
+
├── Q1: "What type of system component?"
|
|
927
|
+
│ │
|
|
928
|
+
│ ├── Infrastructure (CPU, memory, disk, network)
|
|
929
|
+
│ │ └── USE Method
|
|
930
|
+
│ │ ├── Utilization: cpu_usage_percent, memory_used_bytes
|
|
931
|
+
│ │ ├── Saturation: cpu_runqueue_length, disk_io_queue
|
|
932
|
+
│ │ └── Errors: disk_errors_total, network_drops_total
|
|
933
|
+
│ │
|
|
934
|
+
│ ├── Service / API endpoint
|
|
935
|
+
│ │ └── RED Method
|
|
936
|
+
│ │ ├── Rate: http_requests_total (counter)
|
|
937
|
+
│ │ ├── Errors: http_errors_total or status 5xx rate
|
|
938
|
+
│ │ └── Duration: http_request_duration_seconds (histogram)
|
|
939
|
+
│ │
|
|
940
|
+
│ ├── Database
|
|
941
|
+
│ │ ├── Query latency (p50, p95, p99)
|
|
942
|
+
│ │ ├── Connection pool (active, idle, waiting, timeouts)
|
|
943
|
+
│ │ ├── Slow query log (queries > 100ms)
|
|
944
|
+
│ │ ├── Lock contention (lock wait time, deadlocks)
|
|
945
|
+
│ │ └── Replication lag (seconds behind primary)
|
|
946
|
+
│ │
|
|
947
|
+
│ ├── Message Queue (Kafka, RabbitMQ, SQS)
|
|
948
|
+
│ │ ├── Consumer lag (messages behind)
|
|
949
|
+
│ │ ├── Produce/consume rate (messages/sec)
|
|
950
|
+
│ │ ├── Processing duration per message
|
|
951
|
+
│ │ └── Dead letter queue depth
|
|
952
|
+
│ │
|
|
953
|
+
│ └── External Dependency
|
|
954
|
+
│ ├── Availability (success rate of outbound calls)
|
|
955
|
+
│ ├── Latency (p99 of outbound call duration)
|
|
956
|
+
│ ├── Circuit breaker state (open/closed/half-open)
|
|
957
|
+
│ └── Retry rate and exhaustion count
|
|
958
|
+
│
|
|
959
|
+
├── Q2: "What's my traffic volume?"
|
|
960
|
+
│ │
|
|
961
|
+
│ ├── < 100 rps → Trace 100%, basic metrics, structured logs
|
|
962
|
+
│ ├── 100-1K rps → Trace 10-50%, RED+USE metrics, warn+ logs
|
|
963
|
+
│ ├── 1K-10K rps → Trace 1-10%, full metrics, aggregated logs
|
|
964
|
+
│ └── > 10K rps → Trace 0.1-1% (tail-based), metrics only, sampled logs
|
|
965
|
+
│
|
|
966
|
+
├── Q3: "Do I have SLOs defined?"
|
|
967
|
+
│ │
|
|
968
|
+
│ ├── No → Define SLOs first:
|
|
969
|
+
│ │ - Availability SLO (99.9% typical)
|
|
970
|
+
│ │ - Latency SLO (99% of requests < Xms)
|
|
971
|
+
│ │ - Set up error budget tracking
|
|
972
|
+
│ │ - Configure burn rate alerts
|
|
973
|
+
│ │
|
|
974
|
+
│ └── Yes → Monitor SLI metrics continuously
|
|
975
|
+
│ - Track error budget consumption
|
|
976
|
+
│ - Set multi-window burn rate alerts
|
|
977
|
+
│ - Review SLOs quarterly
|
|
978
|
+
│
|
|
979
|
+
└── Q4: "What's my observability budget?"
|
|
980
|
+
│
|
|
981
|
+
├── Minimal ($0-500/mo)
|
|
982
|
+
│ └── Prometheus + Grafana + Loki (self-hosted)
|
|
983
|
+
│ Tempo for traces, head-based sampling
|
|
984
|
+
│
|
|
985
|
+
├── Moderate ($500-5K/mo)
|
|
986
|
+
│ └── Grafana Cloud or self-hosted with Thanos
|
|
987
|
+
│ Tail-based sampling, 14-day retention
|
|
988
|
+
│
|
|
989
|
+
└── Enterprise ($5K+/mo)
|
|
990
|
+
└── Datadog, New Relic, or Splunk
|
|
991
|
+
Full APM, anomaly detection, 30+ day retention
|
|
992
|
+
Or: self-hosted OTel + ClickHouse at scale
|
|
993
|
+
```
|
|
994
|
+
|
|
995
|
+
---
|
|
996
|
+
|
|
997
|
+
## Quick Reference
|
|
998
|
+
|
|
999
|
+
### Observability Stack Recommendations by Scale
|
|
1000
|
+
|
|
1001
|
+
| Scale | Metrics | Logs | Traces | Cost/mo (approx) |
|
|
1002
|
+
|--------------------|-------------------|-------------------|------------------|--------------------|
|
|
1003
|
+
| Startup (<10 svcs) | Prometheus+Grafana| Loki | Jaeger | $0-200 (self-host) |
|
|
1004
|
+
| Growth (10-50 svcs)| Grafana Cloud | Grafana Cloud Logs| Tempo | $500-3,000 |
|
|
1005
|
+
| Scale (50-200 svcs)| Mimir/Thanos | Loki/Elasticsearch| Tempo | $3,000-15,000 |
|
|
1006
|
+
| Enterprise (200+) | Datadog/Mimir | Splunk/Elasticsearch| Tempo/Datadog | $15,000-100,000+ |
|
|
1007
|
+
|
|
1008
|
+
### Performance Monitoring Checklist
|
|
1009
|
+
|
|
1010
|
+
```
|
|
1011
|
+
[ ] RED metrics on every service endpoint
|
|
1012
|
+
[ ] USE metrics on every infrastructure resource
|
|
1013
|
+
[ ] SLOs defined for latency, availability, and throughput
|
|
1014
|
+
[ ] Error budget tracking with burn rate alerts
|
|
1015
|
+
[ ] Distributed tracing with appropriate sampling
|
|
1016
|
+
[ ] Structured logging with trace ID correlation
|
|
1017
|
+
[ ] Dashboards: service overview, resource saturation, SLO status
|
|
1018
|
+
[ ] Runbooks linked to every alert
|
|
1019
|
+
[ ] Cardinality review (quarterly)
|
|
1020
|
+
[ ] Observability cost review (monthly)
|
|
1021
|
+
```
|
|
1022
|
+
|
|
1023
|
+
### Key Thresholds
|
|
1024
|
+
|
|
1025
|
+
```
|
|
1026
|
+
Metric Warning Critical
|
|
1027
|
+
──────────────────────────────────────────────────────
|
|
1028
|
+
CPU utilization >70% >90%
|
|
1029
|
+
Memory utilization >80% >90%
|
|
1030
|
+
Disk I/O utilization >70% >85%
|
|
1031
|
+
Connection pool usage >75% >90%
|
|
1032
|
+
Error budget burn rate >3x >14.4x
|
|
1033
|
+
p99 latency vs SLO >80% of target >100% of target
|
|
1034
|
+
GC pause time (JVM) >200ms >500ms
|
|
1035
|
+
Thread pool queue depth >100 >1000
|
|
1036
|
+
```
|
|
1037
|
+
|
|
1038
|
+
---
|
|
1039
|
+
|
|
1040
|
+
## Sources
|
|
1041
|
+
|
|
1042
|
+
- [OpenTelemetry Performance Benchmark Specification](https://opentelemetry.io/docs/specs/otel/performance-benchmark/)
|
|
1043
|
+
- [OTel Component Performance Benchmarks](https://opentelemetry.io/blog/2023/perf-testing/)
|
|
1044
|
+
- [OpenTelemetry Java Agent Performance](https://opentelemetry.io/docs/zero-code/java/agent/performance/)
|
|
1045
|
+
- [OpenTelemetry for Go: Measuring the Overhead (Coroot)](https://coroot.com/blog/opentelemetry-for-go-measuring-the-overhead/)
|
|
1046
|
+
- [Evaluating OpenTelemetry's Impact on Performance in Microservice Architectures (Umea University)](https://umu.diva-portal.org/smash/get/diva2:1877027/FULLTEXT01.pdf)
|
|
1047
|
+
- [Performance Overhead and Optimization Strategies in OpenTelemetry (TechRxiv)](https://www.techrxiv.org/users/937157/articles/1334227-performance-overhead-and-optimization-strategies-in-opentelemetry)
|
|
1048
|
+
- [How to Reduce OpenTelemetry Performance Overhead by 50% (OneUptime)](https://oneuptime.com/blog/post/2026-02-06-reduce-opentelemetry-performance-overhead-production/view)
|
|
1049
|
+
- [OTelBench: Benchmark OpenTelemetry Infrastructure (Quesma/InfoQ)](https://www.infoq.com/news/2026/02/quesma-otel-bench-performance-ai/)
|
|
1050
|
+
- [Jaeger v2 Released: OpenTelemetry in the Core (CNCF)](https://www.cncf.io/blog/2024/11/12/jaeger-v2-released-opentelemetry-in-the-core/)
|
|
1051
|
+
- [Jaeger vs Zipkin vs Grafana Tempo (CoderSociety)](https://codersociety.com/blog/articles/jaeger-vs-zipkin-vs-tempo)
|
|
1052
|
+
- [Grafana Tempo vs Jaeger (Last9)](https://last9.io/blog/grafana-tempo-vs-jaeger/)
|
|
1053
|
+
- [The RED Method: How to Instrument Your Services (Grafana Labs)](https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/)
|
|
1054
|
+
- [RED and USE Metrics for Monitoring (Better Stack)](https://betterstack.com/community/guides/monitoring/red-use-metrics/)
|
|
1055
|
+
- [Monitoring Methodologies: RED and USE (The New Stack)](https://thenewstack.io/monitoring-methodologies-red-and-use/)
|
|
1056
|
+
- [Three Pillars of Observability (IBM)](https://www.ibm.com/think/insights/observability-pillars)
|
|
1057
|
+
- [Three Pillars of Observability (CrowdStrike)](https://www.crowdstrike.com/en-us/cybersecurity-101/observability/three-pillars-of-observability/)
|
|
1058
|
+
- [Google SRE Workbook: Alerting on SLOs](https://sre.google/workbook/alerting-on-slos/)
|
|
1059
|
+
- [Google SRE Workbook: Implementing SLOs](https://sre.google/workbook/implementing-slos/)
|
|
1060
|
+
- [A Complete Guide to Error Budgets (Nobl9)](https://www.nobl9.com/resources/a-complete-guide-to-error-budgets-setting-up-slos-slis-and-slas-to-maintain-reliability)
|
|
1061
|
+
- [Burn Rate Alerts (Datadog)](https://docs.datadoghq.com/service_management/service_level_objectives/burn_rate/)
|
|
1062
|
+
- [Multi-Window Multi-Burn-Rate Alerts (Grafana Labs)](https://grafana.com/blog/how-to-implement-multi-window-multi-burn-rate-alerts-with-grafana-cloud/)
|
|
1063
|
+
- [Alerting on SLOs Like Pros (SoundCloud)](https://developers.soundcloud.com/blog/alerting-on-slos/)
|
|
1064
|
+
- [SLO/SLA-Driven Monitoring Requirements 2025 (Uptrace)](https://uptrace.dev/blog/sla-slo-monitoring-requirements)
|
|
1065
|
+
- [OpenTelemetry Sampling Concepts](https://opentelemetry.io/docs/concepts/sampling/)
|
|
1066
|
+
- [Tail Sampling with OpenTelemetry](https://opentelemetry.io/blog/2022/tail-sampling/)
|
|
1067
|
+
- [Head-Based vs Tail-Based Sampling (CubeAPM)](https://cubeapm.com/blog/head-based-vs-tail-based-sampling/)
|
|
1068
|
+
- [Mastering Distributed Tracing Sampling (Datadog)](https://www.datadoghq.com/architecture/mastering-distributed-tracing-data-volume-challenges-and-datadogs-approach-to-efficient-sampling/)
|
|
1069
|
+
- [Observability TCO and Cost Reduction (ClickHouse)](https://clickhouse.com/resources/engineering/observability-tco-cost-reduction)
|
|
1070
|
+
- [The High-Cardinality Trap (ClickHouse)](https://clickhouse.com/resources/engineering/high-cardinality-slow-observability-challenge)
|
|
1071
|
+
- [Three Observability Anti-Patterns (Chronosphere)](https://chronosphere.io/learn/three-pesky-observability-anti-patterns-that-impact-developer-efficiency/)
|
|
1072
|
+
- [Metric Cardinality Explained (Groundcover)](https://www.groundcover.com/learn/observability/metric-cardinality)
|
|
1073
|
+
- [Prometheus vs Datadog Comparison 2024 (Squadcast)](https://medium.com/@squadcast/prometheus-vs-datadog-a-complete-comparison-guide-for-2024-7713d87d34a5)
|
|
1074
|
+
- [Datadog vs Prometheus Comparison 2026 (Better Stack)](https://betterstack.com/community/comparisons/datadog-vs-prometheus/)
|
|
1075
|
+
- [Hidden Costs in Observability 2026 (Grepr AI)](https://www.grepr.ai/blog/the-hidden-cost-in-observability)
|
|
1076
|
+
- [How Much Should Observability Cost (Honeycomb)](https://www.honeycomb.io/blog/how-much-should-i-spend-on-observability-pt1)
|
|
1077
|
+
- [Observability Trends 2026 (Elastic)](https://www.elastic.co/blog/2026-observability-trends-costs-business-impact)
|
|
1078
|
+
- [OpenTelemetry Java Metrics Performance Comparison](https://opentelemetry.io/blog/2024/java-metric-systems-compared/)
|
|
1079
|
+
- [OpenTelemetry and Grafana Labs 2025](https://grafana.com/blog/opentelemetry-and-grafana-labs-whats-new-and-whats-next-in-2025/)
|