@wazir-dev/cli 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (629) hide show
  1. package/AGENTS.md +111 -0
  2. package/CHANGELOG.md +14 -0
  3. package/CONTRIBUTING.md +101 -0
  4. package/LICENSE +21 -0
  5. package/README.md +314 -0
  6. package/assets/composition-engine.mmd +34 -0
  7. package/assets/demo-script.sh +17 -0
  8. package/assets/logo-dark.svg +14 -0
  9. package/assets/logo.svg +14 -0
  10. package/assets/pipeline.mmd +39 -0
  11. package/assets/record-demo.sh +51 -0
  12. package/docs/README.md +51 -0
  13. package/docs/adapters/context-mode.md +60 -0
  14. package/docs/concepts/architecture.md +87 -0
  15. package/docs/concepts/artifact-model.md +60 -0
  16. package/docs/concepts/composition-engine.md +36 -0
  17. package/docs/concepts/indexing-and-recall.md +160 -0
  18. package/docs/concepts/observability.md +41 -0
  19. package/docs/concepts/roles-and-workflows.md +59 -0
  20. package/docs/concepts/terminology-policy.md +27 -0
  21. package/docs/getting-started/01-installation.md +78 -0
  22. package/docs/getting-started/02-first-run.md +102 -0
  23. package/docs/getting-started/03-adding-to-project.md +15 -0
  24. package/docs/getting-started/04-host-setup.md +15 -0
  25. package/docs/guides/ci-integration.md +15 -0
  26. package/docs/guides/creating-skills.md +15 -0
  27. package/docs/guides/expertise-module-authoring.md +15 -0
  28. package/docs/guides/hook-development.md +15 -0
  29. package/docs/guides/memory-and-learnings.md +34 -0
  30. package/docs/guides/multi-host-export.md +15 -0
  31. package/docs/guides/troubleshooting.md +101 -0
  32. package/docs/guides/writing-custom-roles.md +15 -0
  33. package/docs/plans/2026-03-15-cli-pipeline-integration-design.md +592 -0
  34. package/docs/plans/2026-03-15-cli-pipeline-integration-plan.md +598 -0
  35. package/docs/plans/2026-03-15-docs-enforcement-plan.md +238 -0
  36. package/docs/readmes/INDEX.md +99 -0
  37. package/docs/readmes/features/expertise/README.md +171 -0
  38. package/docs/readmes/features/exports/README.md +222 -0
  39. package/docs/readmes/features/hooks/README.md +103 -0
  40. package/docs/readmes/features/hooks/loop-cap-guard.md +133 -0
  41. package/docs/readmes/features/hooks/post-tool-capture.md +121 -0
  42. package/docs/readmes/features/hooks/post-tool-lint.md +130 -0
  43. package/docs/readmes/features/hooks/pre-compact-summary.md +122 -0
  44. package/docs/readmes/features/hooks/pre-tool-capture-route.md +100 -0
  45. package/docs/readmes/features/hooks/protected-path-write-guard.md +128 -0
  46. package/docs/readmes/features/hooks/session-start.md +119 -0
  47. package/docs/readmes/features/hooks/stop-handoff-harvest.md +125 -0
  48. package/docs/readmes/features/roles/README.md +157 -0
  49. package/docs/readmes/features/roles/clarifier.md +152 -0
  50. package/docs/readmes/features/roles/content-author.md +190 -0
  51. package/docs/readmes/features/roles/designer.md +193 -0
  52. package/docs/readmes/features/roles/executor.md +184 -0
  53. package/docs/readmes/features/roles/learner.md +210 -0
  54. package/docs/readmes/features/roles/planner.md +182 -0
  55. package/docs/readmes/features/roles/researcher.md +164 -0
  56. package/docs/readmes/features/roles/reviewer.md +184 -0
  57. package/docs/readmes/features/roles/specifier.md +162 -0
  58. package/docs/readmes/features/roles/verifier.md +215 -0
  59. package/docs/readmes/features/schemas/README.md +178 -0
  60. package/docs/readmes/features/skills/README.md +63 -0
  61. package/docs/readmes/features/skills/brainstorming.md +96 -0
  62. package/docs/readmes/features/skills/debugging.md +148 -0
  63. package/docs/readmes/features/skills/design.md +120 -0
  64. package/docs/readmes/features/skills/prepare-next.md +109 -0
  65. package/docs/readmes/features/skills/run-audit.md +159 -0
  66. package/docs/readmes/features/skills/scan-project.md +109 -0
  67. package/docs/readmes/features/skills/self-audit.md +176 -0
  68. package/docs/readmes/features/skills/tdd.md +137 -0
  69. package/docs/readmes/features/skills/using-skills.md +92 -0
  70. package/docs/readmes/features/skills/verification.md +120 -0
  71. package/docs/readmes/features/skills/writing-plans.md +104 -0
  72. package/docs/readmes/features/tooling/README.md +320 -0
  73. package/docs/readmes/features/workflows/README.md +186 -0
  74. package/docs/readmes/features/workflows/author.md +181 -0
  75. package/docs/readmes/features/workflows/clarify.md +154 -0
  76. package/docs/readmes/features/workflows/design-review.md +171 -0
  77. package/docs/readmes/features/workflows/design.md +169 -0
  78. package/docs/readmes/features/workflows/discover.md +162 -0
  79. package/docs/readmes/features/workflows/execute.md +173 -0
  80. package/docs/readmes/features/workflows/learn.md +167 -0
  81. package/docs/readmes/features/workflows/plan-review.md +165 -0
  82. package/docs/readmes/features/workflows/plan.md +170 -0
  83. package/docs/readmes/features/workflows/prepare-next.md +167 -0
  84. package/docs/readmes/features/workflows/review.md +169 -0
  85. package/docs/readmes/features/workflows/run-audit.md +191 -0
  86. package/docs/readmes/features/workflows/spec-challenge.md +159 -0
  87. package/docs/readmes/features/workflows/specify.md +160 -0
  88. package/docs/readmes/features/workflows/verify.md +177 -0
  89. package/docs/readmes/packages/README.md +50 -0
  90. package/docs/readmes/packages/ajv.md +117 -0
  91. package/docs/readmes/packages/context-mode.md +118 -0
  92. package/docs/readmes/packages/gray-matter.md +116 -0
  93. package/docs/readmes/packages/node-test.md +137 -0
  94. package/docs/readmes/packages/yaml.md +112 -0
  95. package/docs/reference/configuration-reference.md +159 -0
  96. package/docs/reference/expertise-index.md +52 -0
  97. package/docs/reference/git-flow.md +43 -0
  98. package/docs/reference/hooks.md +87 -0
  99. package/docs/reference/host-exports.md +50 -0
  100. package/docs/reference/launch-checklist.md +172 -0
  101. package/docs/reference/marketplace-listings.md +76 -0
  102. package/docs/reference/release-process.md +34 -0
  103. package/docs/reference/roles-reference.md +77 -0
  104. package/docs/reference/skills.md +33 -0
  105. package/docs/reference/templates.md +29 -0
  106. package/docs/reference/tooling-cli.md +94 -0
  107. package/docs/truth-claims.yaml +222 -0
  108. package/expertise/PROGRESS.md +63 -0
  109. package/expertise/README.md +18 -0
  110. package/expertise/antipatterns/PROGRESS.md +56 -0
  111. package/expertise/antipatterns/backend/api-design-antipatterns.md +1271 -0
  112. package/expertise/antipatterns/backend/auth-antipatterns.md +1195 -0
  113. package/expertise/antipatterns/backend/caching-antipatterns.md +622 -0
  114. package/expertise/antipatterns/backend/database-antipatterns.md +1038 -0
  115. package/expertise/antipatterns/backend/index.md +24 -0
  116. package/expertise/antipatterns/backend/microservices-antipatterns.md +850 -0
  117. package/expertise/antipatterns/code/architecture-antipatterns.md +919 -0
  118. package/expertise/antipatterns/code/async-antipatterns.md +622 -0
  119. package/expertise/antipatterns/code/code-smells.md +1186 -0
  120. package/expertise/antipatterns/code/dependency-antipatterns.md +1209 -0
  121. package/expertise/antipatterns/code/error-handling-antipatterns.md +1360 -0
  122. package/expertise/antipatterns/code/index.md +27 -0
  123. package/expertise/antipatterns/code/naming-and-abstraction.md +1118 -0
  124. package/expertise/antipatterns/code/state-management-antipatterns.md +1076 -0
  125. package/expertise/antipatterns/code/testing-antipatterns.md +1053 -0
  126. package/expertise/antipatterns/design/accessibility-antipatterns.md +1136 -0
  127. package/expertise/antipatterns/design/dark-patterns.md +1121 -0
  128. package/expertise/antipatterns/design/index.md +22 -0
  129. package/expertise/antipatterns/design/ui-antipatterns.md +1202 -0
  130. package/expertise/antipatterns/design/ux-antipatterns.md +680 -0
  131. package/expertise/antipatterns/frontend/css-layout-antipatterns.md +691 -0
  132. package/expertise/antipatterns/frontend/flutter-antipatterns.md +1827 -0
  133. package/expertise/antipatterns/frontend/index.md +23 -0
  134. package/expertise/antipatterns/frontend/mobile-antipatterns.md +573 -0
  135. package/expertise/antipatterns/frontend/react-antipatterns.md +1128 -0
  136. package/expertise/antipatterns/frontend/spa-antipatterns.md +1235 -0
  137. package/expertise/antipatterns/index.md +31 -0
  138. package/expertise/antipatterns/performance/index.md +20 -0
  139. package/expertise/antipatterns/performance/performance-antipatterns.md +1013 -0
  140. package/expertise/antipatterns/performance/premature-optimization.md +623 -0
  141. package/expertise/antipatterns/performance/scaling-antipatterns.md +785 -0
  142. package/expertise/antipatterns/process/ai-coding-antipatterns.md +853 -0
  143. package/expertise/antipatterns/process/code-review-antipatterns.md +656 -0
  144. package/expertise/antipatterns/process/deployment-antipatterns.md +920 -0
  145. package/expertise/antipatterns/process/index.md +23 -0
  146. package/expertise/antipatterns/process/technical-debt-antipatterns.md +647 -0
  147. package/expertise/antipatterns/security/index.md +20 -0
  148. package/expertise/antipatterns/security/secrets-antipatterns.md +849 -0
  149. package/expertise/antipatterns/security/security-theater.md +843 -0
  150. package/expertise/antipatterns/security/vulnerability-patterns.md +801 -0
  151. package/expertise/architecture/PROGRESS.md +70 -0
  152. package/expertise/architecture/data/caching-architecture.md +671 -0
  153. package/expertise/architecture/data/data-consistency.md +574 -0
  154. package/expertise/architecture/data/data-modeling.md +536 -0
  155. package/expertise/architecture/data/event-streams-and-queues.md +634 -0
  156. package/expertise/architecture/data/index.md +25 -0
  157. package/expertise/architecture/data/search-architecture.md +663 -0
  158. package/expertise/architecture/data/sql-vs-nosql.md +708 -0
  159. package/expertise/architecture/decisions/architecture-decision-records.md +640 -0
  160. package/expertise/architecture/decisions/build-vs-buy.md +616 -0
  161. package/expertise/architecture/decisions/index.md +23 -0
  162. package/expertise/architecture/decisions/monolith-to-microservices.md +790 -0
  163. package/expertise/architecture/decisions/technology-selection.md +616 -0
  164. package/expertise/architecture/distributed/cap-theorem-and-tradeoffs.md +800 -0
  165. package/expertise/architecture/distributed/circuit-breaker-bulkhead.md +741 -0
  166. package/expertise/architecture/distributed/consensus-and-coordination.md +796 -0
  167. package/expertise/architecture/distributed/distributed-systems-fundamentals.md +564 -0
  168. package/expertise/architecture/distributed/idempotency-and-retry.md +796 -0
  169. package/expertise/architecture/distributed/index.md +25 -0
  170. package/expertise/architecture/distributed/saga-pattern.md +797 -0
  171. package/expertise/architecture/foundations/architectural-thinking.md +460 -0
  172. package/expertise/architecture/foundations/coupling-and-cohesion.md +770 -0
  173. package/expertise/architecture/foundations/design-principles-solid.md +649 -0
  174. package/expertise/architecture/foundations/domain-driven-design.md +719 -0
  175. package/expertise/architecture/foundations/index.md +25 -0
  176. package/expertise/architecture/foundations/separation-of-concerns.md +472 -0
  177. package/expertise/architecture/foundations/twelve-factor-app.md +797 -0
  178. package/expertise/architecture/index.md +34 -0
  179. package/expertise/architecture/integration/api-design-graphql.md +638 -0
  180. package/expertise/architecture/integration/api-design-grpc.md +804 -0
  181. package/expertise/architecture/integration/api-design-rest.md +892 -0
  182. package/expertise/architecture/integration/index.md +25 -0
  183. package/expertise/architecture/integration/third-party-integration.md +795 -0
  184. package/expertise/architecture/integration/webhooks-and-callbacks.md +1152 -0
  185. package/expertise/architecture/integration/websockets-realtime.md +791 -0
  186. package/expertise/architecture/mobile-architecture/index.md +22 -0
  187. package/expertise/architecture/mobile-architecture/mobile-app-architecture.md +780 -0
  188. package/expertise/architecture/mobile-architecture/mobile-backend-for-frontend.md +670 -0
  189. package/expertise/architecture/mobile-architecture/offline-first.md +719 -0
  190. package/expertise/architecture/mobile-architecture/push-and-sync.md +782 -0
  191. package/expertise/architecture/patterns/cqrs-event-sourcing.md +717 -0
  192. package/expertise/architecture/patterns/event-driven.md +797 -0
  193. package/expertise/architecture/patterns/hexagonal-clean-architecture.md +870 -0
  194. package/expertise/architecture/patterns/index.md +27 -0
  195. package/expertise/architecture/patterns/layered-architecture.md +736 -0
  196. package/expertise/architecture/patterns/microservices.md +753 -0
  197. package/expertise/architecture/patterns/modular-monolith.md +692 -0
  198. package/expertise/architecture/patterns/monolith.md +626 -0
  199. package/expertise/architecture/patterns/plugin-architecture.md +735 -0
  200. package/expertise/architecture/patterns/serverless.md +780 -0
  201. package/expertise/architecture/scaling/database-scaling.md +615 -0
  202. package/expertise/architecture/scaling/feature-flags-and-rollouts.md +757 -0
  203. package/expertise/architecture/scaling/horizontal-vs-vertical.md +606 -0
  204. package/expertise/architecture/scaling/index.md +24 -0
  205. package/expertise/architecture/scaling/multi-tenancy.md +800 -0
  206. package/expertise/architecture/scaling/stateless-design.md +787 -0
  207. package/expertise/backend/embedded-firmware.md +625 -0
  208. package/expertise/backend/go.md +853 -0
  209. package/expertise/backend/index.md +24 -0
  210. package/expertise/backend/java-spring.md +448 -0
  211. package/expertise/backend/node-typescript.md +625 -0
  212. package/expertise/backend/python-fastapi.md +724 -0
  213. package/expertise/backend/rust.md +458 -0
  214. package/expertise/backend/solidity.md +711 -0
  215. package/expertise/composition-map.yaml +443 -0
  216. package/expertise/content/foundations/content-modeling.md +395 -0
  217. package/expertise/content/foundations/editorial-standards.md +449 -0
  218. package/expertise/content/foundations/index.md +24 -0
  219. package/expertise/content/foundations/microcopy.md +455 -0
  220. package/expertise/content/foundations/terminology-governance.md +509 -0
  221. package/expertise/content/index.md +34 -0
  222. package/expertise/content/patterns/accessibility-copy.md +518 -0
  223. package/expertise/content/patterns/index.md +24 -0
  224. package/expertise/content/patterns/notification-content.md +433 -0
  225. package/expertise/content/patterns/sample-content.md +486 -0
  226. package/expertise/content/patterns/state-copy.md +439 -0
  227. package/expertise/design/PROGRESS.md +58 -0
  228. package/expertise/design/disciplines/dark-mode-theming.md +577 -0
  229. package/expertise/design/disciplines/design-systems.md +595 -0
  230. package/expertise/design/disciplines/index.md +25 -0
  231. package/expertise/design/disciplines/information-architecture.md +800 -0
  232. package/expertise/design/disciplines/interaction-design.md +788 -0
  233. package/expertise/design/disciplines/responsive-design.md +552 -0
  234. package/expertise/design/disciplines/usability-testing.md +516 -0
  235. package/expertise/design/disciplines/user-research.md +792 -0
  236. package/expertise/design/foundations/accessibility-design.md +796 -0
  237. package/expertise/design/foundations/color-theory.md +797 -0
  238. package/expertise/design/foundations/iconography.md +795 -0
  239. package/expertise/design/foundations/index.md +26 -0
  240. package/expertise/design/foundations/motion-and-animation.md +653 -0
  241. package/expertise/design/foundations/rtl-design.md +585 -0
  242. package/expertise/design/foundations/spacing-and-layout.md +607 -0
  243. package/expertise/design/foundations/typography.md +800 -0
  244. package/expertise/design/foundations/visual-hierarchy.md +761 -0
  245. package/expertise/design/index.md +32 -0
  246. package/expertise/design/patterns/authentication-flows.md +474 -0
  247. package/expertise/design/patterns/content-consumption.md +789 -0
  248. package/expertise/design/patterns/data-display.md +618 -0
  249. package/expertise/design/patterns/e-commerce.md +1494 -0
  250. package/expertise/design/patterns/feedback-and-states.md +642 -0
  251. package/expertise/design/patterns/forms-and-input.md +819 -0
  252. package/expertise/design/patterns/gamification.md +801 -0
  253. package/expertise/design/patterns/index.md +31 -0
  254. package/expertise/design/patterns/microinteractions.md +449 -0
  255. package/expertise/design/patterns/navigation.md +800 -0
  256. package/expertise/design/patterns/notifications.md +705 -0
  257. package/expertise/design/patterns/onboarding.md +700 -0
  258. package/expertise/design/patterns/search-and-filter.md +601 -0
  259. package/expertise/design/patterns/settings-and-preferences.md +768 -0
  260. package/expertise/design/patterns/social-and-community.md +748 -0
  261. package/expertise/design/platforms/desktop-native.md +612 -0
  262. package/expertise/design/platforms/index.md +25 -0
  263. package/expertise/design/platforms/mobile-android.md +825 -0
  264. package/expertise/design/platforms/mobile-cross-platform.md +983 -0
  265. package/expertise/design/platforms/mobile-ios.md +699 -0
  266. package/expertise/design/platforms/tablet.md +794 -0
  267. package/expertise/design/platforms/web-dashboard.md +790 -0
  268. package/expertise/design/platforms/web-responsive.md +550 -0
  269. package/expertise/design/psychology/behavioral-nudges.md +449 -0
  270. package/expertise/design/psychology/cognitive-load.md +1191 -0
  271. package/expertise/design/psychology/error-psychology.md +778 -0
  272. package/expertise/design/psychology/index.md +22 -0
  273. package/expertise/design/psychology/persuasive-design.md +736 -0
  274. package/expertise/design/psychology/user-mental-models.md +623 -0
  275. package/expertise/design/tooling/open-pencil.md +266 -0
  276. package/expertise/frontend/angular.md +1073 -0
  277. package/expertise/frontend/desktop-electron.md +546 -0
  278. package/expertise/frontend/flutter.md +782 -0
  279. package/expertise/frontend/index.md +27 -0
  280. package/expertise/frontend/native-android.md +409 -0
  281. package/expertise/frontend/native-ios.md +490 -0
  282. package/expertise/frontend/react-native.md +1160 -0
  283. package/expertise/frontend/react.md +808 -0
  284. package/expertise/frontend/vue.md +1089 -0
  285. package/expertise/humanize/domain-rules-code.md +79 -0
  286. package/expertise/humanize/domain-rules-content.md +67 -0
  287. package/expertise/humanize/domain-rules-technical-docs.md +56 -0
  288. package/expertise/humanize/index.md +35 -0
  289. package/expertise/humanize/self-audit-checklist.md +87 -0
  290. package/expertise/humanize/sentence-patterns.md +218 -0
  291. package/expertise/humanize/vocabulary-blacklist.md +105 -0
  292. package/expertise/i18n/PROGRESS.md +65 -0
  293. package/expertise/i18n/advanced/accessibility-and-i18n.md +28 -0
  294. package/expertise/i18n/advanced/bidirectional-text-algorithm.md +38 -0
  295. package/expertise/i18n/advanced/complex-scripts.md +30 -0
  296. package/expertise/i18n/advanced/performance-and-i18n.md +27 -0
  297. package/expertise/i18n/advanced/testing-i18n.md +28 -0
  298. package/expertise/i18n/content/content-adaptation.md +23 -0
  299. package/expertise/i18n/content/locale-specific-formatting.md +23 -0
  300. package/expertise/i18n/content/machine-translation-integration.md +28 -0
  301. package/expertise/i18n/content/translation-management.md +29 -0
  302. package/expertise/i18n/foundations/date-time-calendars.md +67 -0
  303. package/expertise/i18n/foundations/i18n-architecture.md +272 -0
  304. package/expertise/i18n/foundations/locale-and-language-tags.md +79 -0
  305. package/expertise/i18n/foundations/numbers-currency-units.md +61 -0
  306. package/expertise/i18n/foundations/pluralization-and-gender.md +109 -0
  307. package/expertise/i18n/foundations/string-externalization.md +236 -0
  308. package/expertise/i18n/foundations/text-direction-bidi.md +241 -0
  309. package/expertise/i18n/foundations/unicode-and-encoding.md +86 -0
  310. package/expertise/i18n/index.md +38 -0
  311. package/expertise/i18n/platform/backend-i18n.md +31 -0
  312. package/expertise/i18n/platform/flutter-i18n.md +148 -0
  313. package/expertise/i18n/platform/native-android-i18n.md +36 -0
  314. package/expertise/i18n/platform/native-ios-i18n.md +36 -0
  315. package/expertise/i18n/platform/react-i18n.md +103 -0
  316. package/expertise/i18n/platform/web-css-i18n.md +81 -0
  317. package/expertise/i18n/rtl/arabic-specific.md +175 -0
  318. package/expertise/i18n/rtl/hebrew-specific.md +149 -0
  319. package/expertise/i18n/rtl/rtl-animations-and-transitions.md +111 -0
  320. package/expertise/i18n/rtl/rtl-forms-and-input.md +161 -0
  321. package/expertise/i18n/rtl/rtl-fundamentals.md +211 -0
  322. package/expertise/i18n/rtl/rtl-icons-and-images.md +181 -0
  323. package/expertise/i18n/rtl/rtl-layout-mirroring.md +252 -0
  324. package/expertise/i18n/rtl/rtl-navigation-and-gestures.md +107 -0
  325. package/expertise/i18n/rtl/rtl-testing-and-qa.md +147 -0
  326. package/expertise/i18n/rtl/rtl-typography.md +160 -0
  327. package/expertise/index.md +113 -0
  328. package/expertise/index.yaml +216 -0
  329. package/expertise/infrastructure/cloud-aws.md +597 -0
  330. package/expertise/infrastructure/cloud-gcp.md +599 -0
  331. package/expertise/infrastructure/cybersecurity.md +816 -0
  332. package/expertise/infrastructure/database-mongodb.md +447 -0
  333. package/expertise/infrastructure/database-postgres.md +400 -0
  334. package/expertise/infrastructure/devops-cicd.md +787 -0
  335. package/expertise/infrastructure/index.md +27 -0
  336. package/expertise/performance/PROGRESS.md +50 -0
  337. package/expertise/performance/backend/api-latency.md +1204 -0
  338. package/expertise/performance/backend/background-jobs.md +506 -0
  339. package/expertise/performance/backend/connection-pooling.md +1209 -0
  340. package/expertise/performance/backend/database-query-optimization.md +515 -0
  341. package/expertise/performance/backend/index.md +23 -0
  342. package/expertise/performance/backend/rate-limiting-and-throttling.md +971 -0
  343. package/expertise/performance/foundations/algorithmic-complexity.md +954 -0
  344. package/expertise/performance/foundations/caching-strategies.md +489 -0
  345. package/expertise/performance/foundations/concurrency-and-parallelism.md +847 -0
  346. package/expertise/performance/foundations/index.md +24 -0
  347. package/expertise/performance/foundations/measuring-and-profiling.md +440 -0
  348. package/expertise/performance/foundations/memory-management.md +964 -0
  349. package/expertise/performance/foundations/performance-budgets.md +1314 -0
  350. package/expertise/performance/index.md +31 -0
  351. package/expertise/performance/infrastructure/auto-scaling.md +1059 -0
  352. package/expertise/performance/infrastructure/cdn-and-edge.md +1081 -0
  353. package/expertise/performance/infrastructure/index.md +22 -0
  354. package/expertise/performance/infrastructure/load-balancing.md +1081 -0
  355. package/expertise/performance/infrastructure/observability.md +1079 -0
  356. package/expertise/performance/mobile/index.md +23 -0
  357. package/expertise/performance/mobile/mobile-animations.md +544 -0
  358. package/expertise/performance/mobile/mobile-memory-battery.md +416 -0
  359. package/expertise/performance/mobile/mobile-network.md +452 -0
  360. package/expertise/performance/mobile/mobile-rendering.md +599 -0
  361. package/expertise/performance/mobile/mobile-startup-time.md +505 -0
  362. package/expertise/performance/platform-specific/flutter-performance.md +647 -0
  363. package/expertise/performance/platform-specific/index.md +22 -0
  364. package/expertise/performance/platform-specific/node-performance.md +1307 -0
  365. package/expertise/performance/platform-specific/postgres-performance.md +1366 -0
  366. package/expertise/performance/platform-specific/react-performance.md +1403 -0
  367. package/expertise/performance/web/bundle-optimization.md +1239 -0
  368. package/expertise/performance/web/image-and-media.md +636 -0
  369. package/expertise/performance/web/index.md +24 -0
  370. package/expertise/performance/web/network-optimization.md +1133 -0
  371. package/expertise/performance/web/rendering-performance.md +1098 -0
  372. package/expertise/performance/web/ssr-and-hydration.md +918 -0
  373. package/expertise/performance/web/web-vitals.md +1374 -0
  374. package/expertise/quality/accessibility.md +985 -0
  375. package/expertise/quality/evidence-based-verification.md +499 -0
  376. package/expertise/quality/index.md +24 -0
  377. package/expertise/quality/ml-model-audit.md +614 -0
  378. package/expertise/quality/performance.md +600 -0
  379. package/expertise/quality/testing-api.md +891 -0
  380. package/expertise/quality/testing-mobile.md +496 -0
  381. package/expertise/quality/testing-web.md +849 -0
  382. package/expertise/security/PROGRESS.md +54 -0
  383. package/expertise/security/agentic-identity.md +540 -0
  384. package/expertise/security/compliance-frameworks.md +601 -0
  385. package/expertise/security/data/data-encryption.md +364 -0
  386. package/expertise/security/data/data-privacy-gdpr.md +692 -0
  387. package/expertise/security/data/database-security.md +1171 -0
  388. package/expertise/security/data/index.md +22 -0
  389. package/expertise/security/data/pii-handling.md +531 -0
  390. package/expertise/security/foundations/authentication.md +1041 -0
  391. package/expertise/security/foundations/authorization.md +603 -0
  392. package/expertise/security/foundations/cryptography.md +1001 -0
  393. package/expertise/security/foundations/index.md +25 -0
  394. package/expertise/security/foundations/owasp-top-10.md +1354 -0
  395. package/expertise/security/foundations/secrets-management.md +1217 -0
  396. package/expertise/security/foundations/secure-sdlc.md +700 -0
  397. package/expertise/security/foundations/supply-chain-security.md +698 -0
  398. package/expertise/security/index.md +31 -0
  399. package/expertise/security/infrastructure/cloud-security-aws.md +1296 -0
  400. package/expertise/security/infrastructure/cloud-security-gcp.md +1376 -0
  401. package/expertise/security/infrastructure/container-security.md +721 -0
  402. package/expertise/security/infrastructure/incident-response.md +1295 -0
  403. package/expertise/security/infrastructure/index.md +24 -0
  404. package/expertise/security/infrastructure/logging-and-monitoring.md +1618 -0
  405. package/expertise/security/infrastructure/network-security.md +1337 -0
  406. package/expertise/security/mobile/index.md +23 -0
  407. package/expertise/security/mobile/mobile-android-security.md +1218 -0
  408. package/expertise/security/mobile/mobile-binary-protection.md +1229 -0
  409. package/expertise/security/mobile/mobile-data-storage.md +1265 -0
  410. package/expertise/security/mobile/mobile-ios-security.md +1401 -0
  411. package/expertise/security/mobile/mobile-network-security.md +1520 -0
  412. package/expertise/security/smart-contract-security.md +594 -0
  413. package/expertise/security/testing/index.md +22 -0
  414. package/expertise/security/testing/penetration-testing.md +1258 -0
  415. package/expertise/security/testing/security-code-review.md +1765 -0
  416. package/expertise/security/testing/threat-modeling.md +1074 -0
  417. package/expertise/security/testing/vulnerability-scanning.md +1062 -0
  418. package/expertise/security/web/api-security.md +586 -0
  419. package/expertise/security/web/cors-and-headers.md +433 -0
  420. package/expertise/security/web/csrf.md +562 -0
  421. package/expertise/security/web/file-upload.md +1477 -0
  422. package/expertise/security/web/index.md +25 -0
  423. package/expertise/security/web/injection.md +1375 -0
  424. package/expertise/security/web/session-management.md +1101 -0
  425. package/expertise/security/web/xss.md +1158 -0
  426. package/exports/README.md +17 -0
  427. package/exports/hosts/claude/.claude/agents/clarifier.md +42 -0
  428. package/exports/hosts/claude/.claude/agents/content-author.md +63 -0
  429. package/exports/hosts/claude/.claude/agents/designer.md +55 -0
  430. package/exports/hosts/claude/.claude/agents/executor.md +55 -0
  431. package/exports/hosts/claude/.claude/agents/learner.md +51 -0
  432. package/exports/hosts/claude/.claude/agents/planner.md +53 -0
  433. package/exports/hosts/claude/.claude/agents/researcher.md +43 -0
  434. package/exports/hosts/claude/.claude/agents/reviewer.md +54 -0
  435. package/exports/hosts/claude/.claude/agents/specifier.md +47 -0
  436. package/exports/hosts/claude/.claude/agents/verifier.md +71 -0
  437. package/exports/hosts/claude/.claude/commands/author.md +42 -0
  438. package/exports/hosts/claude/.claude/commands/clarify.md +38 -0
  439. package/exports/hosts/claude/.claude/commands/design-review.md +46 -0
  440. package/exports/hosts/claude/.claude/commands/design.md +44 -0
  441. package/exports/hosts/claude/.claude/commands/discover.md +37 -0
  442. package/exports/hosts/claude/.claude/commands/execute.md +48 -0
  443. package/exports/hosts/claude/.claude/commands/learn.md +38 -0
  444. package/exports/hosts/claude/.claude/commands/plan-review.md +42 -0
  445. package/exports/hosts/claude/.claude/commands/plan.md +39 -0
  446. package/exports/hosts/claude/.claude/commands/prepare-next.md +37 -0
  447. package/exports/hosts/claude/.claude/commands/review.md +40 -0
  448. package/exports/hosts/claude/.claude/commands/run-audit.md +41 -0
  449. package/exports/hosts/claude/.claude/commands/spec-challenge.md +41 -0
  450. package/exports/hosts/claude/.claude/commands/specify.md +38 -0
  451. package/exports/hosts/claude/.claude/commands/verify.md +37 -0
  452. package/exports/hosts/claude/.claude/settings.json +34 -0
  453. package/exports/hosts/claude/CLAUDE.md +19 -0
  454. package/exports/hosts/claude/export.manifest.json +38 -0
  455. package/exports/hosts/claude/host-package.json +67 -0
  456. package/exports/hosts/codex/AGENTS.md +19 -0
  457. package/exports/hosts/codex/export.manifest.json +38 -0
  458. package/exports/hosts/codex/host-package.json +41 -0
  459. package/exports/hosts/cursor/.cursor/hooks.json +16 -0
  460. package/exports/hosts/cursor/.cursor/rules/wazir-core.mdc +19 -0
  461. package/exports/hosts/cursor/export.manifest.json +38 -0
  462. package/exports/hosts/cursor/host-package.json +42 -0
  463. package/exports/hosts/gemini/GEMINI.md +19 -0
  464. package/exports/hosts/gemini/export.manifest.json +38 -0
  465. package/exports/hosts/gemini/host-package.json +41 -0
  466. package/hooks/README.md +18 -0
  467. package/hooks/definitions/loop_cap_guard.yaml +21 -0
  468. package/hooks/definitions/post_tool_capture.yaml +24 -0
  469. package/hooks/definitions/pre_compact_summary.yaml +19 -0
  470. package/hooks/definitions/pre_tool_capture_route.yaml +19 -0
  471. package/hooks/definitions/protected_path_write_guard.yaml +19 -0
  472. package/hooks/definitions/session_start.yaml +19 -0
  473. package/hooks/definitions/stop_handoff_harvest.yaml +20 -0
  474. package/hooks/loop-cap-guard +17 -0
  475. package/hooks/post-tool-lint +36 -0
  476. package/hooks/protected-path-write-guard +17 -0
  477. package/hooks/session-start +41 -0
  478. package/llms-full.txt +2355 -0
  479. package/llms.txt +43 -0
  480. package/package.json +79 -0
  481. package/roles/README.md +20 -0
  482. package/roles/clarifier.md +42 -0
  483. package/roles/content-author.md +63 -0
  484. package/roles/designer.md +55 -0
  485. package/roles/executor.md +55 -0
  486. package/roles/learner.md +51 -0
  487. package/roles/planner.md +53 -0
  488. package/roles/researcher.md +43 -0
  489. package/roles/reviewer.md +54 -0
  490. package/roles/specifier.md +47 -0
  491. package/roles/verifier.md +71 -0
  492. package/schemas/README.md +24 -0
  493. package/schemas/accepted-learning.schema.json +20 -0
  494. package/schemas/author-artifact.schema.json +156 -0
  495. package/schemas/clarification.schema.json +19 -0
  496. package/schemas/design-artifact.schema.json +80 -0
  497. package/schemas/docs-claim.schema.json +18 -0
  498. package/schemas/export-manifest.schema.json +20 -0
  499. package/schemas/hook.schema.json +67 -0
  500. package/schemas/host-export-package.schema.json +18 -0
  501. package/schemas/implementation-plan.schema.json +19 -0
  502. package/schemas/proposed-learning.schema.json +19 -0
  503. package/schemas/research.schema.json +18 -0
  504. package/schemas/review.schema.json +29 -0
  505. package/schemas/run-manifest.schema.json +18 -0
  506. package/schemas/spec-challenge.schema.json +18 -0
  507. package/schemas/spec.schema.json +20 -0
  508. package/schemas/usage.schema.json +102 -0
  509. package/schemas/verification-proof.schema.json +29 -0
  510. package/schemas/wazir-manifest.schema.json +173 -0
  511. package/skills/README.md +40 -0
  512. package/skills/brainstorming/SKILL.md +77 -0
  513. package/skills/debugging/SKILL.md +50 -0
  514. package/skills/design/SKILL.md +61 -0
  515. package/skills/dispatching-parallel-agents/SKILL.md +128 -0
  516. package/skills/executing-plans/SKILL.md +70 -0
  517. package/skills/finishing-a-development-branch/SKILL.md +169 -0
  518. package/skills/humanize/SKILL.md +123 -0
  519. package/skills/init-pipeline/SKILL.md +124 -0
  520. package/skills/prepare-next/SKILL.md +20 -0
  521. package/skills/receiving-code-review/SKILL.md +123 -0
  522. package/skills/requesting-code-review/SKILL.md +105 -0
  523. package/skills/requesting-code-review/code-reviewer.md +108 -0
  524. package/skills/run-audit/SKILL.md +197 -0
  525. package/skills/scan-project/SKILL.md +41 -0
  526. package/skills/self-audit/SKILL.md +153 -0
  527. package/skills/subagent-driven-development/SKILL.md +154 -0
  528. package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +26 -0
  529. package/skills/subagent-driven-development/implementer-prompt.md +102 -0
  530. package/skills/subagent-driven-development/spec-reviewer-prompt.md +61 -0
  531. package/skills/tdd/SKILL.md +23 -0
  532. package/skills/using-git-worktrees/SKILL.md +163 -0
  533. package/skills/using-skills/SKILL.md +95 -0
  534. package/skills/verification/SKILL.md +22 -0
  535. package/skills/wazir/SKILL.md +463 -0
  536. package/skills/writing-plans/SKILL.md +30 -0
  537. package/skills/writing-skills/SKILL.md +157 -0
  538. package/skills/writing-skills/anthropic-best-practices.md +122 -0
  539. package/skills/writing-skills/persuasion-principles.md +50 -0
  540. package/templates/README.md +20 -0
  541. package/templates/artifacts/README.md +10 -0
  542. package/templates/artifacts/accepted-learning.md +19 -0
  543. package/templates/artifacts/accepted-learning.template.json +12 -0
  544. package/templates/artifacts/author.md +74 -0
  545. package/templates/artifacts/author.template.json +19 -0
  546. package/templates/artifacts/clarification.md +21 -0
  547. package/templates/artifacts/clarification.template.json +12 -0
  548. package/templates/artifacts/execute-notes.md +19 -0
  549. package/templates/artifacts/implementation-plan.md +21 -0
  550. package/templates/artifacts/implementation-plan.template.json +11 -0
  551. package/templates/artifacts/learning-proposal.md +19 -0
  552. package/templates/artifacts/next-run-handoff.md +21 -0
  553. package/templates/artifacts/plan-review.md +19 -0
  554. package/templates/artifacts/proposed-learning.template.json +12 -0
  555. package/templates/artifacts/research.md +21 -0
  556. package/templates/artifacts/research.template.json +12 -0
  557. package/templates/artifacts/review-findings.md +19 -0
  558. package/templates/artifacts/review.template.json +11 -0
  559. package/templates/artifacts/run-manifest.template.json +8 -0
  560. package/templates/artifacts/spec-challenge.md +19 -0
  561. package/templates/artifacts/spec-challenge.template.json +11 -0
  562. package/templates/artifacts/spec.md +21 -0
  563. package/templates/artifacts/spec.template.json +12 -0
  564. package/templates/artifacts/verification-proof.md +19 -0
  565. package/templates/artifacts/verification-proof.template.json +11 -0
  566. package/templates/examples/accepted-learning.example.json +14 -0
  567. package/templates/examples/author.example.json +152 -0
  568. package/templates/examples/clarification.example.json +15 -0
  569. package/templates/examples/docs-claim.example.json +8 -0
  570. package/templates/examples/export-manifest.example.json +7 -0
  571. package/templates/examples/host-export-package.example.json +11 -0
  572. package/templates/examples/implementation-plan.example.json +17 -0
  573. package/templates/examples/proposed-learning.example.json +13 -0
  574. package/templates/examples/research.example.json +15 -0
  575. package/templates/examples/research.example.md +6 -0
  576. package/templates/examples/review.example.json +17 -0
  577. package/templates/examples/run-manifest.example.json +9 -0
  578. package/templates/examples/spec-challenge.example.json +14 -0
  579. package/templates/examples/spec.example.json +21 -0
  580. package/templates/examples/verification-proof.example.json +21 -0
  581. package/templates/examples/wazir-manifest.example.yaml +65 -0
  582. package/templates/task-definition-schema.md +99 -0
  583. package/tooling/README.md +20 -0
  584. package/tooling/src/adapters/context-mode.js +50 -0
  585. package/tooling/src/capture/command.js +376 -0
  586. package/tooling/src/capture/store.js +99 -0
  587. package/tooling/src/capture/usage.js +270 -0
  588. package/tooling/src/checks/branches.js +50 -0
  589. package/tooling/src/checks/brand-truth.js +110 -0
  590. package/tooling/src/checks/changelog.js +231 -0
  591. package/tooling/src/checks/command-registry.js +36 -0
  592. package/tooling/src/checks/commits.js +102 -0
  593. package/tooling/src/checks/docs-drift.js +103 -0
  594. package/tooling/src/checks/docs-truth.js +201 -0
  595. package/tooling/src/checks/runtime-surface.js +156 -0
  596. package/tooling/src/cli.js +116 -0
  597. package/tooling/src/command-options.js +56 -0
  598. package/tooling/src/commands/validate.js +320 -0
  599. package/tooling/src/doctor/command.js +91 -0
  600. package/tooling/src/export/command.js +77 -0
  601. package/tooling/src/export/compiler.js +498 -0
  602. package/tooling/src/guards/loop-cap-guard.js +52 -0
  603. package/tooling/src/guards/protected-path-write-guard.js +67 -0
  604. package/tooling/src/index/command.js +152 -0
  605. package/tooling/src/index/storage.js +1061 -0
  606. package/tooling/src/index/summarizers.js +261 -0
  607. package/tooling/src/loaders.js +18 -0
  608. package/tooling/src/project-root.js +22 -0
  609. package/tooling/src/recall/command.js +225 -0
  610. package/tooling/src/schema-validator.js +30 -0
  611. package/tooling/src/state-root.js +40 -0
  612. package/tooling/src/status/command.js +71 -0
  613. package/wazir.manifest.yaml +135 -0
  614. package/workflows/README.md +19 -0
  615. package/workflows/author.md +42 -0
  616. package/workflows/clarify.md +38 -0
  617. package/workflows/design-review.md +46 -0
  618. package/workflows/design.md +44 -0
  619. package/workflows/discover.md +37 -0
  620. package/workflows/execute.md +48 -0
  621. package/workflows/learn.md +38 -0
  622. package/workflows/plan-review.md +42 -0
  623. package/workflows/plan.md +39 -0
  624. package/workflows/prepare-next.md +37 -0
  625. package/workflows/review.md +40 -0
  626. package/workflows/run-audit.md +41 -0
  627. package/workflows/spec-challenge.md +41 -0
  628. package/workflows/specify.md +38 -0
  629. package/workflows/verify.md +37 -0
@@ -0,0 +1,1079 @@
1
+ # Observability for Performance Engineering
2
+
3
+ > **Expertise Module** | Domain: Performance / Infrastructure
4
+ > Last updated: 2026-03-08
5
+
6
+ ---
7
+
8
+ ## Table of Contents
9
+
10
+ 1. [Overview](#overview)
11
+ 2. [The Three Pillars: Metrics, Logs, Traces](#the-three-pillars-metrics-logs-traces)
12
+ 3. [When to Use Each Pillar for Performance](#when-to-use-each-pillar-for-performance)
13
+ 4. [OpenTelemetry](#opentelemetry)
14
+ 5. [Distributed Tracing Systems](#distributed-tracing-systems)
15
+ 6. [Metrics Systems: Prometheus and Datadog](#metrics-systems-prometheus-and-datadog)
16
+ 7. [RED and USE Methods](#red-and-use-methods)
17
+ 8. [SLOs, SLIs, and Error Budgets](#slos-slis-and-error-budgets)
18
+ 9. [Sampling Strategies](#sampling-strategies)
19
+ 10. [Alerting on Performance](#alerting-on-performance)
20
+ 11. [Cost Management and Optimization](#cost-management-and-optimization)
21
+ 12. [Common Bottlenecks](#common-bottlenecks)
22
+ 13. [Anti-Patterns](#anti-patterns)
23
+ 14. [Before/After: Observability-Driven Performance Fixes](#beforeafter-observability-driven-performance-fixes)
24
+ 15. [Decision Tree: What Should I Monitor?](#decision-tree-what-should-i-monitor)
25
+ 16. [Quick Reference](#quick-reference)
26
+ 17. [Sources](#sources)
27
+
28
+ ---
29
+
30
+ ## Overview
31
+
32
+ Observability is the ability to understand a system's internal state from its external outputs.
33
+ For performance engineering, observability answers: "Why is my system slow, and where?"
34
+
35
+ **Key industry numbers:**
36
+
37
+ - The global observability market reached $28.5 billion in 2025 (Gartner/Research Nester).
38
+ - 15-25% of infrastructure budgets are allocated to observability (Gartner).
39
+ - Over 50% of observability spend goes to logs alone (ClickHouse TCO Report).
40
+ - 97% of organizations have experienced unexpected observability cost surprises (Grepr AI, 2026).
41
+ - 36% of enterprise clients spend over $1M/year on observability; 4% exceed $10M (Gartner).
42
+
43
+ Observability is not monitoring. Monitoring tells you *what* is broken. Observability tells you *why*.
44
+
45
+ ---
46
+
47
+ ## The Three Pillars: Metrics, Logs, Traces
48
+
49
+ ### Metrics
50
+
51
+ Numeric measurements of system behavior over time. Stored as time series.
52
+
53
+ | Property | Detail |
54
+ |------------------|-----------------------------------------------------------|
55
+ | **Data type** | Counters, gauges, histograms, summaries |
56
+ | **Storage cost** | Low (~8 bytes per data point in Prometheus TSDB) |
57
+ | **Query speed** | Fast (pre-aggregated, indexed by label) |
58
+ | **Best for** | Dashboards, alerting, trend analysis, capacity planning |
59
+ | **Cardinality** | Must be bounded (unbounded labels destroy performance) |
60
+
61
+ Typical performance metrics: request rate, error rate, p50/p95/p99 latency, CPU utilization,
62
+ memory usage, queue depth, connection pool saturation.
63
+
64
+ ### Logs
65
+
66
+ Discrete events with structured or unstructured text.
67
+
68
+ | Property | Detail |
69
+ |------------------|-----------------------------------------------------------|
70
+ | **Data type** | Text records with timestamps and metadata |
71
+ | **Storage cost** | High (can be 10-100x more than metrics at scale) |
72
+ | **Query speed** | Slow without indexing; fast with structured/indexed logs |
73
+ | **Best for** | Debugging, audit trails, error details, forensic analysis |
74
+ | **Volume risk** | Easily grows to TB/day in production systems |
75
+
76
+ Performance-relevant log patterns: slow query logs (>100ms), garbage collection pauses,
77
+ connection timeouts, circuit breaker state changes, retry exhaustion events.
78
+
79
+ ### Traces
80
+
81
+ End-to-end records of a request's journey through distributed services.
82
+
83
+ | Property | Detail |
84
+ |------------------|-----------------------------------------------------------|
85
+ | **Data type** | Directed acyclic graphs (DAGs) of spans with timing data |
86
+ | **Storage cost** | Medium-high (each trace can contain 10-100+ spans) |
87
+ | **Query speed** | Moderate (requires trace ID lookup or attribute search) |
88
+ | **Best for** | Latency analysis, dependency mapping, bottleneck finding |
89
+ | **Sampling** | Almost always required at scale (1-10% typical) |
90
+
91
+ A single trace through a microservices system might contain 20-50 spans, each recording
92
+ service name, operation, duration, status, and custom attributes.
93
+
94
+ ---
95
+
96
+ ## When to Use Each Pillar for Performance
97
+
98
+ ```
99
+ Question You're Asking --> Pillar to Use
100
+ ──────────────────────────────────────────────────────────────
101
+ "Is latency increasing over time?" --> Metrics (histogram)
102
+ "What's the p99 latency right now?" --> Metrics (histogram quantile)
103
+ "Why was THIS request slow?" --> Traces (span waterfall)
104
+ "What error did the DB return?" --> Logs (structured error log)
105
+ "Which service is the bottleneck?" --> Traces (critical path analysis)
106
+ "Is CPU saturated?" --> Metrics (USE method)
107
+ "What happened during the outage?" --> Logs + Traces (correlated)
108
+ "Are we meeting our latency SLO?" --> Metrics (SLI tracking)
109
+ "What changed between deployments?" --> Metrics (before/after comparison)
110
+ "Why did GC pause spike?" --> Logs (GC log analysis)
111
+ ```
112
+
113
+ **Rule of thumb:** Metrics for *detecting*, traces for *diagnosing*, logs for *explaining*.
114
+
115
+ ---
116
+
117
+ ## OpenTelemetry
118
+
119
+ OpenTelemetry (OTel) is the CNCF standard for telemetry collection. It provides vendor-neutral
120
+ APIs, SDKs, and a Collector for metrics, logs, and traces.
121
+
122
+ ### Auto-Instrumentation
123
+
124
+ Auto-instrumentation injects telemetry collection without code changes. Available for:
125
+ Java (agent), Python (sitecustomize), Node.js (require hooks), .NET (startup hooks), Go (eBPF).
126
+
127
+ **Overhead benchmarks (from OTel official benchmarks and academic research):**
128
+
129
+ | Language | CPU Overhead | Latency Impact (p95) | Memory Overhead | Source |
130
+ |----------|--------------------|----------------------|-------------------|---------------------------------------|
131
+ | Java | 3-20% typical | 9-16% increase | 50-150 MB heap | OTel Java Instrumentation benchmarks |
132
+ | Go | 7-35% (varies) | 5-15% increase | 20-60 MB | Coroot OTel Go overhead study |
133
+ | Python | 5-15% | 10-25% increase | 30-80 MB | OTel Python SDK benchmarks |
134
+ | Node.js | 3-10% | 5-15% increase | 20-50 MB | OTel JS community benchmarks |
135
+
136
+ **Key findings from benchmarks:**
137
+
138
+ - Java agent: CPU overhead ranges from 3.6% (10% sampling) to 17.8% (100% sampling) of
139
+ additional CPU usage (Umea University research, 2024).
140
+ - Go: ~35% CPU increase under full tracing load; ~7% CPU overhead for Redis operations
141
+ specifically (Coroot benchmark, 2024).
142
+ - Batch size impact: CPU overhead increases from 18.4% to 49.0% as batch size decreases,
143
+ making batch configuration critical (OTel specification benchmarks).
144
+ - Manual tracing consistently causes less overhead than automatic tracing (TechRxiv, 2024).
145
+
146
+ ### Custom Spans
147
+
148
+ When auto-instrumentation is insufficient, create custom spans for business-critical paths:
149
+
150
+ ```python
151
+ from opentelemetry import trace
152
+
153
+ tracer = trace.get_tracer("payment-service")
154
+
155
+ def process_payment(order):
156
+ with tracer.start_as_current_span("process_payment") as span:
157
+ span.set_attribute("order.amount", order.amount)
158
+ span.set_attribute("order.currency", order.currency)
159
+
160
+ with tracer.start_as_current_span("validate_card"):
161
+ validate(order.card) # creates child span
162
+
163
+ with tracer.start_as_current_span("charge_provider"):
164
+ result = charge(order) # creates child span
165
+ span.set_attribute("payment.provider_latency_ms", result.latency)
166
+
167
+ return result
168
+ ```
169
+
170
+ **Guidelines for custom spans:**
171
+
172
+ - Instrument operations taking >1ms that cross boundaries (network, disk, queue).
173
+ - Add business attributes (order value, customer tier) for filtering.
174
+ - Avoid spans inside tight loops (creating a span costs ~1 microsecond, but at 1M iterations
175
+ that adds 1 second of pure overhead).
176
+ - Use span events for lightweight annotations instead of child spans.
177
+
178
+ ### The OTel Collector
179
+
180
+ The Collector sits between instrumented applications and backends, providing:
181
+
182
+ - **Receivers**: Accept data via OTLP, Jaeger, Zipkin, Prometheus, and 80+ formats.
183
+ - **Processors**: Batch, filter, sample, transform, and enrich telemetry.
184
+ - **Exporters**: Send to any backend (Jaeger, Tempo, Prometheus, Datadog, etc.).
185
+
186
+ **Performance characteristics of the Collector:**
187
+
188
+ - Batching processor: groups spans into batches of 8192 (default), reducing export overhead
189
+ by 10-50x compared to per-span export.
190
+ - Memory limiter processor: prevents OOM by dropping data when memory exceeds threshold
191
+ (recommended: set to 80% of available memory).
192
+ - Typical resource usage: 0.5-2 CPU cores and 512 MB-2 GB RAM for 50,000 spans/second
193
+ throughput (OTel Collector benchmarks).
194
+
195
+ ### Reducing OTel Overhead
196
+
197
+ Strategies that reduce CPU overhead by 50-70% (OneUptime, 2026):
198
+
199
+ 1. **Selective instrumentation**: Disable instrumentations you don't need
200
+ (e.g., `OTEL_INSTRUMENTATION_HTTP_ENABLED=false`).
201
+ 2. **Increase batch size**: Larger batches amortize export cost; 8192+ spans per batch.
202
+ 3. **Use sampling**: Even 10% sampling reduces CPU overhead by ~80% (Umea research).
203
+ 4. **Async exporters**: Never block the application thread on telemetry export.
204
+ 5. **Filter at the Collector**: Drop low-value spans before export to reduce backend load.
205
+
206
+ ---
207
+
208
+ ## Distributed Tracing Systems
209
+
210
+ ### Jaeger
211
+
212
+ Originally developed by Uber, now a CNCF graduated project.
213
+
214
+ - **Jaeger 2.0** (November 2024): Rebuilt on the OpenTelemetry Collector framework.
215
+ Single binary reduced image size from 40 MB to 30 MB. Native OTLP support eliminates
216
+ translation overhead. Adds tail-based sampling via OTel sampler (CNCF, 2024).
217
+ - **Architecture**: Agent (sidecar) buffers traces locally, preventing app slowdown if
218
+ the collector is unavailable. Collector handles ingestion and indexing.
219
+ - **Storage backends**: Cassandra, Elasticsearch, Kafka, Badger (local), ClickHouse.
220
+ - **Adaptive sampling**: Adjusts sampling rate per-service based on traffic volume.
221
+ - **Scale**: Uber processes billions of spans/day with Jaeger in production.
222
+
223
+ ### Zipkin
224
+
225
+ The original open-source distributed tracing system (from Twitter's Dapper paper implementation).
226
+
227
+ - **Architecture**: Direct reporting from services to Zipkin server (no sidecar agent).
228
+ Lower latency for trace visibility but higher risk of app impact if Zipkin is unavailable.
229
+ - **Overhead**: Slightly higher CPU and memory usage than OTel-based alternatives in
230
+ comparative benchmarks (Umea University, 2024).
231
+ - **Storage**: Cassandra, Elasticsearch, MySQL.
232
+ - **Consideration**: Jaeger now supports Zipkin format, so migration path is clear.
233
+
234
+ ### Grafana Tempo
235
+
236
+ Purpose-built for cost-effective trace storage.
237
+
238
+ - **Key differentiator**: Uses object storage (S3, GCS, Azure Blob) instead of databases,
239
+ reducing operational complexity and storage cost by 10-100x vs. Elasticsearch-backed
240
+ solutions for large trace volumes.
241
+ - **TraceQL**: Query language for searching traces by attributes, duration, and structure.
242
+ - **Integration**: Native integration with Grafana, Loki (logs), and Mimir (metrics) for
243
+ correlated observability.
244
+ - **Scale**: Designed for petabyte-scale trace storage with minimal indexing overhead.
245
+
246
+ ### Choosing a Tracing Backend
247
+
248
+ ```
249
+ Evaluation Criteria Jaeger 2.0 Zipkin Tempo
250
+ ─────────────────────────────────────────────────────────────────────
251
+ OTel native Yes (v2 core) Via collector Yes
252
+ Storage cost at scale Medium Medium Low (object storage)
253
+ Operational complexity Medium Low Low
254
+ Tail-based sampling Yes (v2) No Via Collector
255
+ Query capability Good Basic TraceQL (powerful)
256
+ Ecosystem integration Broad Broad Grafana stack
257
+ Production maturity Very high Very high High
258
+ ```
259
+
260
+ ---
261
+
262
+ ## Metrics Systems: Prometheus and Datadog
263
+
264
+ ### Prometheus
265
+
266
+ Open-source, pull-based metrics system. De facto standard for Kubernetes monitoring.
267
+
268
+ - **Data model**: Multi-dimensional time series identified by metric name and key/value labels.
269
+ - **Query language**: PromQL -- powerful, expressive, but steep learning curve.
270
+ - **Storage**: Local TSDB with ~1.3 bytes per sample (compressed). Retention typically 15-90 days.
271
+ - **Scrape interval**: Default 15 seconds. Lower intervals increase storage and CPU linearly.
272
+ - **Scalability limit**: Single Prometheus instance handles ~10M active time series;
273
+ beyond that, use Thanos or Cortex for federation.
274
+ - **Cost**: Free (open source). Infrastructure cost ~$0.03-0.06 per node/hour for managed
275
+ services (e.g., Grafana Cloud, Amazon Managed Prometheus).
276
+ - **Metric types**: Counter, Gauge, Histogram, Summary.
277
+
278
+ **Performance-relevant PromQL examples:**
279
+
280
+ ```promql
281
+ # p99 latency over 5 minutes
282
+ histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
283
+
284
+ # Error rate as percentage
285
+ sum(rate(http_requests_total{status=~"5.."}[5m]))
286
+ / sum(rate(http_requests_total[5m])) * 100
287
+
288
+ # CPU saturation (runnable threads waiting)
289
+ rate(node_schedstat_waiting_seconds_total[5m])
290
+ ```
291
+
292
+ ### Datadog
293
+
294
+ Commercial full-stack observability platform.
295
+
296
+ - **Data model**: Push-based with 600+ pre-built integrations.
297
+ - **Custom metrics**: $0.05 per custom metric per month (at scale). Costs escalate rapidly
298
+ with high-cardinality metrics.
299
+ - **Strengths**: Anomaly detection, forecast monitoring, composite alerts, APM correlation.
300
+ - **APM pricing**: Based on traced hosts ($31-40/host/month) plus ingested spans
301
+ ($0.10 per million after included volume).
302
+ - **Real-time**: 1-second granularity for infrastructure metrics (vs. 15s for Prometheus default).
303
+
304
+ ### Prometheus vs. Datadog: Decision Factors
305
+
306
+ | Factor | Prometheus | Datadog |
307
+ |-------------------------|-----------------------------------|------------------------------------|
308
+ | Cost at 100 hosts | $0 (self-hosted) + infra | ~$3,100-4,000/month |
309
+ | Cost at 1000 hosts | $0 + significant infra | ~$31,000-40,000/month |
310
+ | Setup time | Hours (with Helm chart) | Minutes (agent install) |
311
+ | Custom metrics cost | Free | $0.05/metric/month |
312
+ | Vendor lock-in | None | High |
313
+ | Operational overhead | High (you manage everything) | Low (fully managed) |
314
+ | AI/ML features | None built-in | Anomaly detection, forecasting |
315
+ | Query language | PromQL (powerful, open) | DQL (proprietary) |
316
+
317
+ ---
318
+
319
+ ## RED and USE Methods
320
+
321
+ ### RED Method (for Services)
322
+
323
+ Developed by Tom Wilkie (Grafana Labs). Measures the user-facing behavior of every service.
324
+
325
+ **R**ate -- requests per second served by the service.
326
+ **E**rrors -- failed requests per second (HTTP 5xx, gRPC errors, exceptions).
327
+ **D**uration -- distribution of request latencies (p50, p95, p99).
328
+
329
+ ```
330
+ Apply RED to:
331
+ - API gateways and load balancers
332
+ - Microservice endpoints
333
+ - Message queue consumers
334
+ - Database query paths
335
+ - External API calls
336
+ ```
337
+
338
+ **Implementation example (Prometheus metrics):**
339
+
340
+ ```
341
+ # Rate
342
+ sum(rate(http_requests_total[5m])) by (service)
343
+
344
+ # Errors
345
+ sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
346
+
347
+ # Duration (p99)
348
+ histogram_quantile(0.99,
349
+ sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
350
+ )
351
+ ```
352
+
353
+ ### USE Method (for Resources)
354
+
355
+ Developed by Brendan Gregg. Measures the health of every physical/virtual resource.
356
+
357
+ **U**tilization -- percentage of resource capacity in use (0-100%).
358
+ **S**aturation -- work that cannot be served (queue depth, runnable threads waiting).
359
+ **E**rrors -- count of error events on the resource.
360
+
361
+ ```
362
+ Apply USE to:
363
+ - CPU (utilization, run queue, hardware errors)
364
+ - Memory (usage %, swap activity, OOM events)
365
+ - Disk I/O (bandwidth utilization, I/O wait queue, read/write errors)
366
+ - Network (bandwidth %, TCP retransmits, dropped packets)
367
+ - Connection pools (active/max, waiters, timeouts)
368
+ - Thread pools (active/max, queue depth, rejections)
369
+ ```
370
+
371
+ ### RED + USE Together
372
+
373
+ ```
374
+ USE Method RED Method
375
+ (Resources) (Services)
376
+ ┌─────────────────┐ ┌─────────────────┐
377
+ │ CPU utilization │ │ Request rate │
378
+ Infrastructure │ Memory saturat. │ App │ Error rate │
379
+ Layer │ Disk I/O errors │ Layer │ Latency p99 │
380
+ └────────┬────────┘ └────────┬────────┘
381
+ │ │
382
+ └──────────┬─────────────────┘
383
+
384
+ ┌─────────▼─────────┐
385
+ │ Correlated View │
386
+ │ "p99 latency spiked│
387
+ │ because CPU hit │
388
+ │ 95% utilization" │
389
+ └────────────────────┘
390
+ ```
391
+
392
+ ---
393
+
394
+ ## SLOs, SLIs, and Error Budgets
395
+
396
+ ### Service Level Indicators (SLIs)
397
+
398
+ An SLI is a quantitative measure of a specific aspect of service performance. For performance
399
+ engineering, the most critical SLIs are:
400
+
401
+ | SLI Type | Example Measurement | Typical Target |
402
+ |-------------------|------------------------------------------|----------------------|
403
+ | Availability | Successful requests / total requests | 99.9% - 99.99% |
404
+ | Latency | % requests completing within threshold | 99.9% < 200ms |
405
+ | Throughput | Requests processed per second | > 10,000 rps |
406
+ | Error rate | Failed requests / total requests | < 0.1% |
407
+ | Saturation | Resource utilization below threshold | < 80% CPU |
408
+
409
+ **Best practice:** Measure SLIs at the load balancer or API gateway, not internally. The user's
410
+ experience is what matters, not what the server thinks happened.
411
+
412
+ ### Service Level Objectives (SLOs)
413
+
414
+ An SLO is the target value for an SLI over a rolling time window.
415
+
416
+ **Common performance SLOs:**
417
+
418
+ ```
419
+ SLO: 99.9% of HTTP requests complete in < 200ms over a 30-day window.
420
+
421
+ Meaning:
422
+ - Total requests in 30 days: ~130 million (at 50 rps)
423
+ - Allowed slow requests: ~130,000 (0.1% error budget)
424
+ - That's ~4,333 slow requests per day
425
+ - Or ~180 slow requests per hour
426
+ ```
427
+
428
+ **SLO target selection guide:**
429
+
430
+ | SLO Target | Monthly Error Budget | Use Case |
431
+ |------------|----------------------|-----------------------------------------|
432
+ | 99% | 7.3 hours | Internal tools, batch processing |
433
+ | 99.5% | 3.6 hours | Non-critical customer-facing services |
434
+ | 99.9% | 43.8 minutes | Most production APIs and web apps |
435
+ | 99.95% | 21.9 minutes | Payment processing, auth services |
436
+ | 99.99% | 4.3 minutes | Core infrastructure, DNS, load balancer |
437
+
438
+ ### Error Budgets
439
+
440
+ The error budget is the inverse of the SLO: `error_budget = 1 - SLO_target`.
441
+
442
+ **How error budgets drive performance decisions:**
443
+
444
+ ```
445
+ Error Budget State Action
446
+ ───────────────────────────────────────────────────────────────
447
+ Budget > 50% remaining Ship features freely. Performance
448
+ improvements are optional.
449
+
450
+ Budget 20-50% remaining Increase caution. Require performance
451
+ review for risky deployments.
452
+
453
+ Budget < 20% remaining Freeze feature releases. All engineering
454
+ effort directed at reliability/performance.
455
+
456
+ Budget exhausted (0%) Full stop on deploys. Incident-level
457
+ response. Postmortem required.
458
+ ```
459
+
460
+ **Error budget calculation example:**
461
+
462
+ ```
463
+ SLO: 99.9% availability over 30 days
464
+ Total minutes: 43,200
465
+ Error budget: 43,200 * 0.001 = 43.2 minutes of downtime allowed
466
+
467
+ Day 15: 20 minutes of downtime consumed
468
+ Remaining budget: 23.2 minutes (53.7% remaining)
469
+ Status: CAUTION -- increase deployment scrutiny
470
+ ```
471
+
472
+ **73% of organizations experienced an outage costing over $100,000 in the past year** (Nobl9
473
+ error budget guide, 2024). Error budgets provide a framework to balance feature velocity
474
+ against these risks.
475
+
476
+ ---
477
+
478
+ ## Sampling Strategies
479
+
480
+ At scale, collecting 100% of traces is neither affordable nor necessary. Sampling strategies
481
+ determine which traces to keep.
482
+
483
+ ### Head-Based Sampling
484
+
485
+ Decision made at trace creation time (the "head" of the trace).
486
+
487
+ - **How it works**: A random number determines if the trace is sampled. The decision
488
+ propagates to all downstream services via trace context headers.
489
+ - **Pros**: Simple, low overhead, predictable cost, no buffering required.
490
+ - **Cons**: Cannot make decisions based on trace outcome (errors, latency). May miss
491
+ rare but important events.
492
+ - **Overhead**: Minimal -- just a random number comparison at span creation.
493
+ - **Typical rate**: 1-10% for high-throughput services (>1000 rps).
494
+
495
+ ```
496
+ # OpenTelemetry head-based sampling configuration
497
+ OTEL_TRACES_SAMPLER=parentbased_traceidratio
498
+ OTEL_TRACES_SAMPLER_ARG=0.01 # 1% sampling
499
+ ```
500
+
501
+ ### Tail-Based Sampling
502
+
503
+ Decision made after the entire trace completes (the "tail").
504
+
505
+ - **How it works**: All spans are buffered in a collector. After a timeout (typically
506
+ 30-60 seconds), the complete trace is evaluated against policies.
507
+ - **Pros**: Can keep traces with errors, high latency, or specific attributes.
508
+ Dramatically improves signal-to-noise ratio for debugging.
509
+ - **Cons**: Requires buffering all spans (high memory: 2-8 GB per collector typical).
510
+ All spans of a trace must reach the same collector instance (routing complexity).
511
+ During incidents, resource usage spikes as more traces match "keep" criteria.
512
+ - **Overhead**: 2-5x more infrastructure than head-based sampling.
513
+
514
+ ```yaml
515
+ # OTel Collector tail sampling processor configuration
516
+ processors:
517
+ tail_sampling:
518
+ decision_wait: 30s
519
+ policies:
520
+ - name: errors
521
+ type: status_code
522
+ status_code: {status_codes: [ERROR]} # Keep all error traces
523
+ - name: slow-requests
524
+ type: latency
525
+ latency: {threshold_ms: 500} # Keep traces > 500ms
526
+ - name: baseline
527
+ type: probabilistic
528
+ probabilistic: {sampling_percentage: 1} # 1% of normal traces
529
+ ```
530
+
531
+ ### Adaptive Sampling
532
+
533
+ Dynamically adjusts sampling rate based on system conditions.
534
+
535
+ - **Normal traffic**: Low sampling rate (0.1-1%) to minimize cost.
536
+ - **During incidents**: Automatically increases to 10-100% to capture diagnostic data.
537
+ - **Per-service**: Higher-throughput services sampled less; lower-traffic services at 100%.
538
+ - **Implementation**: Jaeger's adaptive sampling adjusts rates every 60 seconds based on
539
+ observed traffic per service/endpoint.
540
+
541
+ ### Sampling Strategy Comparison
542
+
543
+ | Strategy | Cost Reduction | Diagnostic Quality | Complexity | Best For |
544
+ |----------------|----------------|--------------------|------------|--------------------------------|
545
+ | None (100%) | 0% | Perfect | None | <100 rps total |
546
+ | Head 10% | ~90% | Poor (misses rare) | Low | >1,000 rps, cost-sensitive |
547
+ | Head 1% | ~99% | Very poor | Low | >10,000 rps, cost-critical |
548
+ | Tail (errors) | 70-95% | Good for errors | High | Debugging-focused teams |
549
+ | Tail (latency) | 70-95% | Good for perf | High | Performance-focused teams |
550
+ | Adaptive | 80-99% | Good | Very high | Large-scale production systems |
551
+
552
+ ---
553
+
554
+ ## Alerting on Performance
555
+
556
+ ### The Alert Fatigue Problem
557
+
558
+ Teams that alert on raw thresholds (e.g., "p99 > 200ms") generate excessive alerts.
559
+ A brief spike during a deployment is not the same as a sustained degradation.
560
+
561
+ **Symptom of alert fatigue:**
562
+
563
+ - Teams start ignoring alerts (>50% of alerts are false positives in many organizations).
564
+ - Mean time to acknowledge (MTTA) increases from minutes to hours.
565
+ - Real incidents get lost in noise.
566
+
567
+ ### Burn Rate Alerts (Recommended)
568
+
569
+ Instead of alerting on instantaneous threshold breaches, alert on the *rate at which you're
570
+ consuming your error budget*. This is the approach recommended by Google SRE.
571
+
572
+ **How burn rate works:**
573
+
574
+ A burn rate of 1 means you will exactly exhaust your error budget at the end of the SLO window.
575
+ A burn rate of 10 means you will exhaust it in 1/10th of the window.
576
+
577
+ ```
578
+ Burn Rate = (observed error rate) / (SLO error rate)
579
+
580
+ Example:
581
+ SLO: 99.9% over 30 days (error rate allowed: 0.1%)
582
+ Observed error rate in last hour: 1%
583
+ Burn rate = 1% / 0.1% = 10x
584
+
585
+ At this rate, the 30-day error budget will be consumed in 3 days.
586
+ ```
587
+
588
+ ### Multi-Window, Multi-Burn-Rate Alerts
589
+
590
+ Google SRE recommends combining multiple windows for different severity levels.
591
+ The short window should be 1/12th of the long window.
592
+
593
+ | Severity | Burn Rate | Long Window | Short Window | Budget Consumed | Response |
594
+ |----------|-----------|-------------|--------------|-----------------|-------------|
595
+ | P1 | 14.4x | 1 hour | 5 minutes | 2% in 1 hour | Page (5 min)|
596
+ | P2 | 6x | 6 hours | 30 minutes | 5% in 6 hours | Page (30 min)|
597
+ | P3 | 3x | 1 day | 2 hours | 10% in 1 day | Ticket |
598
+ | P4 | 1x | 3 days | 6 hours | 10% in 3 days | Review |
599
+
600
+ **Implementation in Prometheus:**
601
+
602
+ ```promql
603
+ # Fast burn rate alert (P1): 2% of budget consumed in 1 hour
604
+ # For 99.9% SLO (0.1% allowed error rate), burn rate 14.4x threshold
605
+ (
606
+ sum(rate(http_requests_total{status=~"5.."}[1h]))
607
+ / sum(rate(http_requests_total[1h]))
608
+ ) > (14.4 * 0.001)
609
+ AND
610
+ (
611
+ sum(rate(http_requests_total{status=~"5.."}[5m]))
612
+ / sum(rate(http_requests_total[5m]))
613
+ ) > (14.4 * 0.001)
614
+ ```
615
+
616
+ ### Performance-Specific Alert Rules
617
+
618
+ ```
619
+ Alert Type Threshold Window
620
+ ─────────────────────────────────────────────────────────────────
621
+ Latency SLO burn >6x burn rate 6h + 30m
622
+ Error rate SLO burn >14.4x burn rate 1h + 5m
623
+ CPU saturation >90% sustained 15m
624
+ Memory approaching OOM >85% usage 5m
625
+ Disk I/O saturation >80% utilization 10m
626
+ Connection pool exhaustion >90% active connections 5m
627
+ GC pause time >500ms pause Immediate
628
+ Queue depth growth Monotonic increase >10min 10m
629
+ Deployment regression p99 >20% increase post-deploy 15m post-deploy
630
+ ```
631
+
632
+ ---
633
+
634
+ ## Cost Management and Optimization
635
+
636
+ ### Where Observability Cost Comes From
637
+
638
+ ```
639
+ Cost Breakdown (typical enterprise):
640
+
641
+ Logs: 50-60% of total observability spend
642
+ Metrics: 15-25% of total observability spend
643
+ Traces: 15-25% of total observability spend
644
+ APM: 5-10% of total observability spend
645
+
646
+ Primary cost drivers:
647
+ 1. Data ingestion volume (GB/day)
648
+ 2. Data retention duration
649
+ 3. Metric cardinality (unique time series)
650
+ 4. Query/dashboard compute
651
+ ```
652
+
653
+ ### Cardinality Reduction
654
+
655
+ High-cardinality metrics are the single largest cost amplifier. Each unique combination of
656
+ label values creates a new time series.
657
+
658
+ **Example of cardinality explosion:**
659
+
660
+ ```
661
+ Metric: http_request_duration_seconds
662
+ Labels: method, endpoint, status_code, user_id
663
+
664
+ Cardinality calculation:
665
+ methods: 5 (GET, POST, PUT, DELETE, PATCH)
666
+ endpoints: 50
667
+ status_codes: 10
668
+ user_ids: 100,000
669
+
670
+ Total series: 5 * 50 * 10 * 100,000 = 250,000,000 time series
671
+
672
+ After removing user_id:
673
+ Total series: 5 * 50 * 10 = 2,500 time series
674
+
675
+ Reduction: 99.999%
676
+ ```
677
+
678
+ **Cardinality rules (denylist for metric labels):**
679
+
680
+ Never use as metric labels:
681
+ - `user_id`, `session_id`, `request_id` (unbounded)
682
+ - `container_id`, `pod_uid` (ephemeral, high churn)
683
+ - `url` with query strings (effectively unbounded)
684
+ - `trace_id` (belongs in traces, not metrics)
685
+ - `error_message` (use error codes instead)
686
+ - `timestamp` (already implicit in time series)
687
+
688
+ These identifiers belong in traces or logs, where they aid debugging without
689
+ overwhelming the metrics system.
690
+
691
+ ### Log Volume Reduction
692
+
693
+ Strategies that achieve 20-40% log volume reduction (ClickHouse TCO Report):
694
+
695
+ 1. **Structured logging**: JSON format enables selective field indexing. Parse and drop
696
+ fields you never query against.
697
+ 2. **Log levels in production**: WARN and above only for most services. DEBUG/TRACE
698
+ only enabled dynamically during incidents.
699
+ 3. **Log aggregation**: Collapse repeated events. "Connection refused" occurring 10,000
700
+ times in 1 minute should be 1 log entry with `count=10000`.
701
+ 4. **Sampling verbose logs**: Sample DEBUG logs at 1% in production.
702
+ 5. **Edge filtering**: Filter at the agent/collector level before data leaves the host.
703
+
704
+ ### Trace Cost Optimization
705
+
706
+ Strategies that achieve 25-50% lower storage costs:
707
+
708
+ 1. **Head-based sampling at 1-5%** for high-throughput services.
709
+ 2. **Tail-based sampling** to keep only errors and slow traces.
710
+ 3. **Span attribute trimming**: Remove large attributes (SQL queries > 500 chars, request
711
+ bodies) or replace with hashes.
712
+ 4. **Short retention**: 7-14 days for traces (vs. 30-90 for metrics).
713
+ 5. **Object storage backends**: Tempo's S3-based storage is 10-100x cheaper per GB than
714
+ Elasticsearch for trace data.
715
+
716
+ ### Retention by Service Tier
717
+
718
+ ```
719
+ Service Tier Metrics Retention Log Retention Trace Retention
720
+ ──────────────────────────────────────────────────────────────────────
721
+ Tier 1 (critical) 90 days 30 days 14 days
722
+ Tier 2 (standard) 30 days 14 days 7 days
723
+ Tier 3 (internal) 14 days 7 days 3 days
724
+ Tier 4 (dev/test) 7 days 3 days 1 day
725
+ ```
726
+
727
+ ---
728
+
729
+ ## Common Bottlenecks
730
+
731
+ ### 1. Observability Overhead Itself
732
+
733
+ The instrumentation meant to detect performance issues can *cause* them.
734
+
735
+ | Bottleneck | Impact | Mitigation |
736
+ |-------------------------------|----------------------------------|------------------------------------|
737
+ | OTel auto-instrumentation | 3-20% CPU overhead (Java) | Selective instrumentation |
738
+ | Synchronous span export | Blocks request processing | Use async batch exporters |
739
+ | High-frequency log writes | Disk I/O contention | Buffer + batch, async writes |
740
+ | Collector as bottleneck | Backpressure causes data loss | Scale collectors horizontally |
741
+ | Sidecar agent memory | 50-200 MB per pod | Right-size resource limits |
742
+
743
+ ### 2. High-Cardinality Metrics
744
+
745
+ - Prometheus: query latency increases linearly with series count. At >10M series,
746
+ simple queries can take >10 seconds.
747
+ - Datadog: custom metric pricing means cardinality directly increases cost.
748
+ 100K custom metrics = $5,000/month on Datadog.
749
+ - Cortex/Mimir: ingestion rate drops when series cardinality causes index churn.
750
+
751
+ ### 3. Log Volume
752
+
753
+ - At 1 TB/day ingestion, Elasticsearch clusters require 3-5 TB storage (with replication).
754
+ - Log search latency degrades from <1s to >30s as daily volume crosses 500 GB
755
+ without proper index management.
756
+ - Splunk pricing at scale: $2-4 per GB ingested/day, making 1 TB/day = $730K-1.46M/year.
757
+
758
+ ### 4. Trace Storage
759
+
760
+ - A single trace with 50 spans averages 5-15 KB.
761
+ - At 10,000 rps with 1% sampling: 100 traces/sec = 50-150 KB/sec = 4-13 GB/day.
762
+ - At 10,000 rps with 100% sampling: 500-1500 KB/sec = 43-130 GB/day.
763
+ - Elasticsearch storage for traces: ~$0.10-0.30/GB/month.
764
+ - S3 (Tempo): ~$0.023/GB/month (4-10x cheaper).
765
+
766
+ ---
767
+
768
+ ## Anti-Patterns
769
+
770
+ ### 1. Logging in Hot Paths
771
+
772
+ ```
773
+ ANTI-PATTERN:
774
+ for item in million_items:
775
+ logger.debug(f"Processing item {item.id}") # 1M log lines
776
+ process(item)
777
+
778
+ FIX:
779
+ logger.info(f"Processing {len(million_items)} items")
780
+ for item in million_items:
781
+ process(item)
782
+ logger.info(f"Completed processing {len(million_items)} items")
783
+ ```
784
+
785
+ **Impact**: Synchronous logging in tight loops can add 10-100x overhead. A `logger.debug()`
786
+ call costs ~1-5 microseconds even when DEBUG is disabled (due to string formatting and
787
+ level check). At 1M iterations, that's 1-5 seconds of pure logging overhead.
788
+
789
+ ### 2. Unbounded Cardinality Labels
790
+
791
+ ```
792
+ ANTI-PATTERN:
793
+ http_requests_total{user_id="12345", path="/api/users/12345/orders"}
794
+
795
+ FIX:
796
+ http_requests_total{user_tier="premium", path="/api/users/{id}/orders"}
797
+ ```
798
+
799
+ **Impact**: Prometheus documentation explicitly warns against this. A single metric with
800
+ a `user_id` label creates one time series per user. At 1M users, that's 1M series from
801
+ one metric -- causing memory exhaustion, slow queries, and potential TSDB corruption.
802
+
803
+ ### 3. Not Sampling Traces
804
+
805
+ ```
806
+ ANTI-PATTERN:
807
+ # Collecting 100% of traces at 10,000 rps
808
+ # = 864 million spans/day
809
+ # = 4-12 TB/day storage
810
+ # = $146K-438K/month on Elasticsearch
811
+
812
+ FIX:
813
+ # Tail-based sampling: errors + slow + 1% baseline
814
+ # = ~10-20 million spans/day (97-99% reduction)
815
+ # = 50-300 GB/day storage
816
+ # = $1.5K-9K/month on Elasticsearch
817
+ # Or $345-2,070/month on S3/Tempo
818
+ ```
819
+
820
+ ### 4. Alerting on Raw Thresholds
821
+
822
+ ```
823
+ ANTI-PATTERN:
824
+ alert: HighLatency
825
+ expr: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) > 0.2
826
+ # Fires on every brief spike, creating alert fatigue
827
+
828
+ FIX:
829
+ alert: LatencySLOBurnRateHigh
830
+ expr: |
831
+ (error_rate_1h / slo_error_rate) > 6
832
+ AND
833
+ (error_rate_30m / slo_error_rate) > 6
834
+ # Only fires when error budget is being consumed at 6x rate
835
+ ```
836
+
837
+ ### 5. Correlating Telemetry Without Shared Context
838
+
839
+ ```
840
+ ANTI-PATTERN:
841
+ # Logs, metrics, and traces with no shared identifiers
842
+ # Debugging requires manual timestamp correlation across 3 systems
843
+
844
+ FIX:
845
+ # Inject trace_id and span_id into log records
846
+ # Use exemplars to link metrics to traces
847
+ # Use OTel resource attributes for service identity
848
+ log.info("Payment processed",
849
+ extra={"trace_id": span.get_span_context().trace_id,
850
+ "order_id": order.id,
851
+ "duration_ms": elapsed})
852
+ ```
853
+
854
+ ### 6. Over-Instrumenting Everything
855
+
856
+ ```
857
+ ANTI-PATTERN:
858
+ # Spans for every function call, including utility functions
859
+ with tracer.start_span("string_format"):
860
+ result = f"{first} {last}"
861
+ with tracer.start_span("list_append"):
862
+ items.append(result)
863
+
864
+ FIX:
865
+ # Spans only for meaningful operations (>1ms, I/O, cross-service)
866
+ with tracer.start_span("fetch_user_profile"):
867
+ profile = await db.query("SELECT * FROM users WHERE id = ?", user_id)
868
+ ```
869
+
870
+ ---
871
+
872
+ ## Before/After: Observability-Driven Performance Fixes
873
+
874
+ ### Case 1: Mystery Latency Spikes
875
+
876
+ **Before observability:**
877
+ - p99 latency: 2.3 seconds (SLO: 500ms)
878
+ - Team suspects "the database is slow" but cannot prove it
879
+ - Random restarts as mitigation strategy
880
+ - Mean time to resolution (MTTR): 4-8 hours
881
+
882
+ **After adding distributed tracing:**
883
+ - Trace waterfall reveals: 1.8 seconds spent in a downstream auth service
884
+ - Auth service making 3 sequential calls to a token validation endpoint
885
+ - Each call: 600ms (includes 400ms DNS resolution due to misconfigured resolver)
886
+ - Fix: Cache DNS + parallelize token validation
887
+ - p99 latency: 180ms (92% reduction)
888
+ - MTTR for similar issues: 15-30 minutes
889
+
890
+ ### Case 2: Gradual Throughput Degradation
891
+
892
+ **Before observability:**
893
+ - Throughput drops from 5,000 rps to 2,000 rps over 2 weeks
894
+ - No clear correlation with any deployment
895
+ - Team adds more instances (cost +60%)
896
+
897
+ **After adding USE method metrics:**
898
+ - CPU utilization: 45% (not the bottleneck)
899
+ - Memory: 70% (not the bottleneck)
900
+ - Connection pool saturation: 98% (FOUND IT)
901
+ - Database connection pool of 20 connections shared across 50 threads
902
+ - Fix: Increase pool to 50, add connection pool metrics to dashboard
903
+ - Throughput restored to 5,000 rps. Removed extra instances (cost -38%)
904
+
905
+ ### Case 3: Intermittent Error Spikes
906
+
907
+ **Before observability:**
908
+ - 0.5% error rate (SLO: 0.1%) but only during peak hours
909
+ - Errors appear random across services
910
+ - No correlation found in application logs
911
+
912
+ **After adding RED method + correlated logs/traces:**
913
+ - RED metrics show errors correlate with request rate > 3,000 rps
914
+ - Traces reveal: payment service timeout at exactly 30 seconds (default HTTP timeout)
915
+ - Correlated logs show: connection pool exhaustion in payment provider SDK
916
+ - Fix: Increase timeout, add circuit breaker, add bulkhead isolation
917
+ - Error rate: 0.02% (80% below SLO target)
918
+
919
+ ---
920
+
921
+ ## Decision Tree: What Should I Monitor?
922
+
923
+ ```
924
+ START: "I need to monitor my system for performance"
925
+
926
+ ├── Q1: "What type of system component?"
927
+ │ │
928
+ │ ├── Infrastructure (CPU, memory, disk, network)
929
+ │ │ └── USE Method
930
+ │ │ ├── Utilization: cpu_usage_percent, memory_used_bytes
931
+ │ │ ├── Saturation: cpu_runqueue_length, disk_io_queue
932
+ │ │ └── Errors: disk_errors_total, network_drops_total
933
+ │ │
934
+ │ ├── Service / API endpoint
935
+ │ │ └── RED Method
936
+ │ │ ├── Rate: http_requests_total (counter)
937
+ │ │ ├── Errors: http_errors_total or status 5xx rate
938
+ │ │ └── Duration: http_request_duration_seconds (histogram)
939
+ │ │
940
+ │ ├── Database
941
+ │ │ ├── Query latency (p50, p95, p99)
942
+ │ │ ├── Connection pool (active, idle, waiting, timeouts)
943
+ │ │ ├── Slow query log (queries > 100ms)
944
+ │ │ ├── Lock contention (lock wait time, deadlocks)
945
+ │ │ └── Replication lag (seconds behind primary)
946
+ │ │
947
+ │ ├── Message Queue (Kafka, RabbitMQ, SQS)
948
+ │ │ ├── Consumer lag (messages behind)
949
+ │ │ ├── Produce/consume rate (messages/sec)
950
+ │ │ ├── Processing duration per message
951
+ │ │ └── Dead letter queue depth
952
+ │ │
953
+ │ └── External Dependency
954
+ │ ├── Availability (success rate of outbound calls)
955
+ │ ├── Latency (p99 of outbound call duration)
956
+ │ ├── Circuit breaker state (open/closed/half-open)
957
+ │ └── Retry rate and exhaustion count
958
+
959
+ ├── Q2: "What's my traffic volume?"
960
+ │ │
961
+ │ ├── < 100 rps → Trace 100%, basic metrics, structured logs
962
+ │ ├── 100-1K rps → Trace 10-50%, RED+USE metrics, warn+ logs
963
+ │ ├── 1K-10K rps → Trace 1-10%, full metrics, aggregated logs
964
+ │ └── > 10K rps → Trace 0.1-1% (tail-based), metrics only, sampled logs
965
+
966
+ ├── Q3: "Do I have SLOs defined?"
967
+ │ │
968
+ │ ├── No → Define SLOs first:
969
+ │ │ - Availability SLO (99.9% typical)
970
+ │ │ - Latency SLO (99% of requests < Xms)
971
+ │ │ - Set up error budget tracking
972
+ │ │ - Configure burn rate alerts
973
+ │ │
974
+ │ └── Yes → Monitor SLI metrics continuously
975
+ │ - Track error budget consumption
976
+ │ - Set multi-window burn rate alerts
977
+ │ - Review SLOs quarterly
978
+
979
+ └── Q4: "What's my observability budget?"
980
+
981
+ ├── Minimal ($0-500/mo)
982
+ │ └── Prometheus + Grafana + Loki (self-hosted)
983
+ │ Tempo for traces, head-based sampling
984
+
985
+ ├── Moderate ($500-5K/mo)
986
+ │ └── Grafana Cloud or self-hosted with Thanos
987
+ │ Tail-based sampling, 14-day retention
988
+
989
+ └── Enterprise ($5K+/mo)
990
+ └── Datadog, New Relic, or Splunk
991
+ Full APM, anomaly detection, 30+ day retention
992
+ Or: self-hosted OTel + ClickHouse at scale
993
+ ```
994
+
995
+ ---
996
+
997
+ ## Quick Reference
998
+
999
+ ### Observability Stack Recommendations by Scale
1000
+
1001
+ | Scale | Metrics | Logs | Traces | Cost/mo (approx) |
1002
+ |--------------------|-------------------|-------------------|------------------|--------------------|
1003
+ | Startup (<10 svcs) | Prometheus+Grafana| Loki | Jaeger | $0-200 (self-host) |
1004
+ | Growth (10-50 svcs)| Grafana Cloud | Grafana Cloud Logs| Tempo | $500-3,000 |
1005
+ | Scale (50-200 svcs)| Mimir/Thanos | Loki/Elasticsearch| Tempo | $3,000-15,000 |
1006
+ | Enterprise (200+) | Datadog/Mimir | Splunk/Elasticsearch| Tempo/Datadog | $15,000-100,000+ |
1007
+
1008
+ ### Performance Monitoring Checklist
1009
+
1010
+ ```
1011
+ [ ] RED metrics on every service endpoint
1012
+ [ ] USE metrics on every infrastructure resource
1013
+ [ ] SLOs defined for latency, availability, and throughput
1014
+ [ ] Error budget tracking with burn rate alerts
1015
+ [ ] Distributed tracing with appropriate sampling
1016
+ [ ] Structured logging with trace ID correlation
1017
+ [ ] Dashboards: service overview, resource saturation, SLO status
1018
+ [ ] Runbooks linked to every alert
1019
+ [ ] Cardinality review (quarterly)
1020
+ [ ] Observability cost review (monthly)
1021
+ ```
1022
+
1023
+ ### Key Thresholds
1024
+
1025
+ ```
1026
+ Metric Warning Critical
1027
+ ──────────────────────────────────────────────────────
1028
+ CPU utilization >70% >90%
1029
+ Memory utilization >80% >90%
1030
+ Disk I/O utilization >70% >85%
1031
+ Connection pool usage >75% >90%
1032
+ Error budget burn rate >3x >14.4x
1033
+ p99 latency vs SLO >80% of target >100% of target
1034
+ GC pause time (JVM) >200ms >500ms
1035
+ Thread pool queue depth >100 >1000
1036
+ ```
1037
+
1038
+ ---
1039
+
1040
+ ## Sources
1041
+
1042
+ - [OpenTelemetry Performance Benchmark Specification](https://opentelemetry.io/docs/specs/otel/performance-benchmark/)
1043
+ - [OTel Component Performance Benchmarks](https://opentelemetry.io/blog/2023/perf-testing/)
1044
+ - [OpenTelemetry Java Agent Performance](https://opentelemetry.io/docs/zero-code/java/agent/performance/)
1045
+ - [OpenTelemetry for Go: Measuring the Overhead (Coroot)](https://coroot.com/blog/opentelemetry-for-go-measuring-the-overhead/)
1046
+ - [Evaluating OpenTelemetry's Impact on Performance in Microservice Architectures (Umea University)](https://umu.diva-portal.org/smash/get/diva2:1877027/FULLTEXT01.pdf)
1047
+ - [Performance Overhead and Optimization Strategies in OpenTelemetry (TechRxiv)](https://www.techrxiv.org/users/937157/articles/1334227-performance-overhead-and-optimization-strategies-in-opentelemetry)
1048
+ - [How to Reduce OpenTelemetry Performance Overhead by 50% (OneUptime)](https://oneuptime.com/blog/post/2026-02-06-reduce-opentelemetry-performance-overhead-production/view)
1049
+ - [OTelBench: Benchmark OpenTelemetry Infrastructure (Quesma/InfoQ)](https://www.infoq.com/news/2026/02/quesma-otel-bench-performance-ai/)
1050
+ - [Jaeger v2 Released: OpenTelemetry in the Core (CNCF)](https://www.cncf.io/blog/2024/11/12/jaeger-v2-released-opentelemetry-in-the-core/)
1051
+ - [Jaeger vs Zipkin vs Grafana Tempo (CoderSociety)](https://codersociety.com/blog/articles/jaeger-vs-zipkin-vs-tempo)
1052
+ - [Grafana Tempo vs Jaeger (Last9)](https://last9.io/blog/grafana-tempo-vs-jaeger/)
1053
+ - [The RED Method: How to Instrument Your Services (Grafana Labs)](https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/)
1054
+ - [RED and USE Metrics for Monitoring (Better Stack)](https://betterstack.com/community/guides/monitoring/red-use-metrics/)
1055
+ - [Monitoring Methodologies: RED and USE (The New Stack)](https://thenewstack.io/monitoring-methodologies-red-and-use/)
1056
+ - [Three Pillars of Observability (IBM)](https://www.ibm.com/think/insights/observability-pillars)
1057
+ - [Three Pillars of Observability (CrowdStrike)](https://www.crowdstrike.com/en-us/cybersecurity-101/observability/three-pillars-of-observability/)
1058
+ - [Google SRE Workbook: Alerting on SLOs](https://sre.google/workbook/alerting-on-slos/)
1059
+ - [Google SRE Workbook: Implementing SLOs](https://sre.google/workbook/implementing-slos/)
1060
+ - [A Complete Guide to Error Budgets (Nobl9)](https://www.nobl9.com/resources/a-complete-guide-to-error-budgets-setting-up-slos-slis-and-slas-to-maintain-reliability)
1061
+ - [Burn Rate Alerts (Datadog)](https://docs.datadoghq.com/service_management/service_level_objectives/burn_rate/)
1062
+ - [Multi-Window Multi-Burn-Rate Alerts (Grafana Labs)](https://grafana.com/blog/how-to-implement-multi-window-multi-burn-rate-alerts-with-grafana-cloud/)
1063
+ - [Alerting on SLOs Like Pros (SoundCloud)](https://developers.soundcloud.com/blog/alerting-on-slos/)
1064
+ - [SLO/SLA-Driven Monitoring Requirements 2025 (Uptrace)](https://uptrace.dev/blog/sla-slo-monitoring-requirements)
1065
+ - [OpenTelemetry Sampling Concepts](https://opentelemetry.io/docs/concepts/sampling/)
1066
+ - [Tail Sampling with OpenTelemetry](https://opentelemetry.io/blog/2022/tail-sampling/)
1067
+ - [Head-Based vs Tail-Based Sampling (CubeAPM)](https://cubeapm.com/blog/head-based-vs-tail-based-sampling/)
1068
+ - [Mastering Distributed Tracing Sampling (Datadog)](https://www.datadoghq.com/architecture/mastering-distributed-tracing-data-volume-challenges-and-datadogs-approach-to-efficient-sampling/)
1069
+ - [Observability TCO and Cost Reduction (ClickHouse)](https://clickhouse.com/resources/engineering/observability-tco-cost-reduction)
1070
+ - [The High-Cardinality Trap (ClickHouse)](https://clickhouse.com/resources/engineering/high-cardinality-slow-observability-challenge)
1071
+ - [Three Observability Anti-Patterns (Chronosphere)](https://chronosphere.io/learn/three-pesky-observability-anti-patterns-that-impact-developer-efficiency/)
1072
+ - [Metric Cardinality Explained (Groundcover)](https://www.groundcover.com/learn/observability/metric-cardinality)
1073
+ - [Prometheus vs Datadog Comparison 2024 (Squadcast)](https://medium.com/@squadcast/prometheus-vs-datadog-a-complete-comparison-guide-for-2024-7713d87d34a5)
1074
+ - [Datadog vs Prometheus Comparison 2026 (Better Stack)](https://betterstack.com/community/comparisons/datadog-vs-prometheus/)
1075
+ - [Hidden Costs in Observability 2026 (Grepr AI)](https://www.grepr.ai/blog/the-hidden-cost-in-observability)
1076
+ - [How Much Should Observability Cost (Honeycomb)](https://www.honeycomb.io/blog/how-much-should-i-spend-on-observability-pt1)
1077
+ - [Observability Trends 2026 (Elastic)](https://www.elastic.co/blog/2026-observability-trends-costs-business-impact)
1078
+ - [OpenTelemetry Java Metrics Performance Comparison](https://opentelemetry.io/blog/2024/java-metric-systems-compared/)
1079
+ - [OpenTelemetry and Grafana Labs 2025](https://grafana.com/blog/opentelemetry-and-grafana-labs-whats-new-and-whats-next-in-2025/)