aigroup-workflow 2.2.0 → 2.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (645) hide show
  1. package/.claude/commands/fix-build.md +10 -5
  2. package/.claude/commands/init-project.md +13 -8
  3. package/.claude/commands/plan.md +15 -8
  4. package/.claude/commands/review.md +12 -6
  5. package/.claude/commands/tdd.md +11 -5
  6. package/.claude/commands/workflow-start.md +20 -11
  7. package/.claude/settings.json +28 -0
  8. package/.codex/agents/architect.toml +207 -0
  9. package/.codex/agents/build-error-resolver.toml +110 -0
  10. package/.codex/agents/code-reviewer.toml +233 -0
  11. package/.codex/agents/doc-updater.toml +103 -0
  12. package/.codex/agents/e2e-runner.toml +103 -0
  13. package/.codex/agents/get-current-datetime.toml +23 -0
  14. package/.codex/agents/init-architect.toml +181 -0
  15. package/.codex/agents/planner.toml +208 -0
  16. package/.codex/agents/refactor-cleaner.toml +81 -0
  17. package/.codex/agents/rust-reviewer.toml +90 -0
  18. package/.codex/agents/security-reviewer.toml +104 -0
  19. package/.codex/agents/tdd-guide.toml +87 -0
  20. package/AGENTS.md +2 -2
  21. package/CLAUDE.md +23 -1
  22. package/LICENSE +20 -20
  23. package/README.md +333 -333
  24. package/agents/a11y-architect.md +141 -141
  25. package/agents/architect.md +211 -211
  26. package/agents/build-error-resolver.md +114 -114
  27. package/agents/chief-of-staff.md +151 -151
  28. package/agents/code-architect.md +71 -71
  29. package/agents/code-explorer.md +69 -69
  30. package/agents/code-reviewer.md +237 -237
  31. package/agents/code-simplifier.md +47 -47
  32. package/agents/comment-analyzer.md +45 -45
  33. package/agents/conversation-analyzer.md +52 -52
  34. package/agents/cpp-build-resolver.md +90 -90
  35. package/agents/cpp-reviewer.md +72 -72
  36. package/agents/csharp-reviewer.md +101 -101
  37. package/agents/dart-build-resolver.md +201 -201
  38. package/agents/database-reviewer.md +91 -91
  39. package/agents/doc-updater.md +107 -107
  40. package/agents/docs-lookup.md +68 -68
  41. package/agents/e2e-runner.md +107 -107
  42. package/agents/flutter-reviewer.md +243 -243
  43. package/agents/gan-evaluator.md +209 -209
  44. package/agents/gan-generator.md +131 -131
  45. package/agents/gan-planner.md +99 -99
  46. package/agents/get-current-datetime.md +26 -26
  47. package/agents/go-build-resolver.md +94 -94
  48. package/agents/go-reviewer.md +76 -76
  49. package/agents/harness-optimizer.md +35 -35
  50. package/agents/healthcare-reviewer.md +83 -83
  51. package/agents/java-build-resolver.md +153 -153
  52. package/agents/java-reviewer.md +92 -92
  53. package/agents/kotlin-build-resolver.md +118 -118
  54. package/agents/kotlin-reviewer.md +159 -159
  55. package/agents/loop-operator.md +36 -36
  56. package/agents/opensource-forker.md +198 -198
  57. package/agents/opensource-packager.md +249 -249
  58. package/agents/opensource-sanitizer.md +188 -188
  59. package/agents/performance-optimizer.md +446 -446
  60. package/agents/planner.md +212 -212
  61. package/agents/pr-test-analyzer.md +45 -45
  62. package/agents/python-reviewer.md +98 -98
  63. package/agents/pytorch-build-resolver.md +120 -120
  64. package/agents/refactor-cleaner.md +85 -85
  65. package/agents/rust-build-resolver.md +148 -148
  66. package/agents/rust-reviewer.md +94 -94
  67. package/agents/security-reviewer.md +108 -108
  68. package/agents/seo-specialist.md +59 -59
  69. package/agents/silent-failure-hunter.md +50 -50
  70. package/agents/tdd-guide.md +91 -91
  71. package/agents/type-design-analyzer.md +41 -41
  72. package/agents/typescript-reviewer.md +112 -112
  73. package/cli/commands/update.mjs +1 -1
  74. package/cli/utils/scaffold.mjs +53 -0
  75. package/docs/rules/agents.md +166 -50
  76. package/docs/rules/cpp/coding-style.md +44 -44
  77. package/docs/rules/cpp/hooks.md +39 -39
  78. package/docs/rules/cpp/patterns.md +51 -51
  79. package/docs/rules/cpp/security.md +51 -51
  80. package/docs/rules/cpp/testing.md +44 -44
  81. package/docs/rules/csharp/coding-style.md +72 -72
  82. package/docs/rules/csharp/hooks.md +25 -25
  83. package/docs/rules/csharp/patterns.md +50 -50
  84. package/docs/rules/csharp/security.md +58 -58
  85. package/docs/rules/csharp/testing.md +46 -46
  86. package/docs/rules/dart/coding-style.md +159 -159
  87. package/docs/rules/dart/hooks.md +66 -66
  88. package/docs/rules/dart/patterns.md +261 -261
  89. package/docs/rules/dart/security.md +135 -135
  90. package/docs/rules/dart/testing.md +215 -215
  91. package/docs/rules/golang/coding-style.md +32 -32
  92. package/docs/rules/golang/hooks.md +17 -17
  93. package/docs/rules/golang/patterns.md +45 -45
  94. package/docs/rules/golang/security.md +34 -34
  95. package/docs/rules/golang/testing.md +31 -31
  96. package/docs/rules/java/coding-style.md +114 -114
  97. package/docs/rules/java/hooks.md +18 -18
  98. package/docs/rules/java/patterns.md +146 -146
  99. package/docs/rules/java/security.md +100 -100
  100. package/docs/rules/java/testing.md +131 -131
  101. package/docs/rules/kotlin/coding-style.md +86 -86
  102. package/docs/rules/kotlin/hooks.md +17 -17
  103. package/docs/rules/kotlin/patterns.md +146 -146
  104. package/docs/rules/kotlin/security.md +82 -82
  105. package/docs/rules/kotlin/testing.md +128 -128
  106. package/docs/rules/perl/coding-style.md +46 -46
  107. package/docs/rules/perl/hooks.md +22 -22
  108. package/docs/rules/perl/patterns.md +76 -76
  109. package/docs/rules/perl/security.md +69 -69
  110. package/docs/rules/perl/testing.md +54 -54
  111. package/docs/rules/php/coding-style.md +40 -40
  112. package/docs/rules/php/hooks.md +24 -24
  113. package/docs/rules/php/patterns.md +33 -33
  114. package/docs/rules/php/security.md +37 -37
  115. package/docs/rules/php/testing.md +39 -39
  116. package/docs/rules/python/coding-style.md +42 -42
  117. package/docs/rules/python/hooks.md +19 -19
  118. package/docs/rules/python/patterns.md +39 -39
  119. package/docs/rules/python/security.md +30 -30
  120. package/docs/rules/python/testing.md +38 -38
  121. package/docs/rules/rust/coding-style.md +151 -151
  122. package/docs/rules/rust/hooks.md +16 -16
  123. package/docs/rules/rust/patterns.md +168 -168
  124. package/docs/rules/rust/security.md +141 -141
  125. package/docs/rules/rust/testing.md +154 -154
  126. package/docs/rules/swift/coding-style.md +47 -47
  127. package/docs/rules/swift/hooks.md +20 -20
  128. package/docs/rules/swift/patterns.md +66 -66
  129. package/docs/rules/swift/security.md +33 -33
  130. package/docs/rules/swift/testing.md +45 -45
  131. package/docs/rules/typescript/coding-style.md +199 -199
  132. package/docs/rules/typescript/hooks.md +22 -22
  133. package/docs/rules/typescript/patterns.md +52 -52
  134. package/docs/rules/typescript/security.md +28 -28
  135. package/docs/rules/typescript/testing.md +18 -18
  136. package/docs/rules/web/coding-style.md +96 -96
  137. package/docs/rules/web/design-quality.md +62 -62
  138. package/docs/rules/web/hooks.md +120 -120
  139. package/docs/rules/web/patterns.md +79 -79
  140. package/docs/rules/web/performance.md +64 -64
  141. package/docs/rules/web/security.md +57 -57
  142. package/docs/rules/web/testing.md +55 -55
  143. package/docs/templates/README.md +36 -36
  144. package/docs/templates/ai-project-final.md +124 -124
  145. package/docs/templates/ai-project.md +105 -105
  146. package/docs/templates/api.md +157 -157
  147. package/docs/templates/bug.md +62 -62
  148. package/docs/templates/code-review.md +87 -87
  149. package/docs/templates/generic.md +116 -116
  150. package/docs/templates/implementation-plan.md +1 -1
  151. package/docs/templates/meeting.md +68 -68
  152. package/docs/templates/prd.md +98 -98
  153. package/docs/templates/ui.md +134 -134
  154. package/docs/workflow-pipeline.md +11 -10
  155. package/package.json +40 -39
  156. package/scripts/hooks/checks/orchestration-artifacts.cjs +28 -23
  157. package/scripts/hooks/checks/workflow-state.cjs +4 -5
  158. package/scripts/orchestration/lib/orchestrator.cjs +344 -117
  159. package/scripts/orchestration/lib/validate.cjs +145 -0
  160. package/scripts/orchestration/session.cjs +88 -44
  161. package/skills/SUPERPOWERS-LICENSE +21 -21
  162. package/skills/ai-ml/fine-tuning-expert/SKILL.md +162 -162
  163. package/skills/ai-ml/fine-tuning-expert/references/dataset-preparation.md +540 -540
  164. package/skills/ai-ml/fine-tuning-expert/references/deployment-optimization.md +673 -673
  165. package/skills/ai-ml/fine-tuning-expert/references/evaluation-metrics.md +597 -597
  166. package/skills/ai-ml/fine-tuning-expert/references/hyperparameter-tuning.md +565 -565
  167. package/skills/ai-ml/fine-tuning-expert/references/lora-peft.md +347 -347
  168. package/skills/ai-ml/ml-pipeline/SKILL.md +159 -159
  169. package/skills/ai-ml/ml-pipeline/references/experiment-tracking.md +833 -833
  170. package/skills/ai-ml/ml-pipeline/references/feature-engineering.md +631 -631
  171. package/skills/ai-ml/ml-pipeline/references/model-validation.md +978 -978
  172. package/skills/ai-ml/ml-pipeline/references/pipeline-orchestration.md +907 -907
  173. package/skills/ai-ml/ml-pipeline/references/training-pipelines.md +782 -782
  174. package/skills/ai-ml/rag-architect/SKILL.md +194 -194
  175. package/skills/ai-ml/rag-architect/references/chunking-strategies.md +878 -878
  176. package/skills/ai-ml/rag-architect/references/embedding-models.md +561 -561
  177. package/skills/ai-ml/rag-architect/references/rag-evaluation.md +833 -833
  178. package/skills/ai-ml/rag-architect/references/retrieval-optimization.md +795 -795
  179. package/skills/ai-ml/rag-architect/references/vector-databases.md +589 -589
  180. package/skills/ai-ml/spark-engineer/SKILL.md +148 -148
  181. package/skills/ai-ml/spark-engineer/references/partitioning-caching.md +543 -543
  182. package/skills/ai-ml/spark-engineer/references/performance-tuning.md +544 -544
  183. package/skills/ai-ml/spark-engineer/references/rdd-operations.md +599 -599
  184. package/skills/ai-ml/spark-engineer/references/spark-sql-dataframes.md +474 -474
  185. package/skills/ai-ml/spark-engineer/references/streaming-patterns.md +786 -786
  186. package/skills/backend/api-designer/SKILL.md +217 -217
  187. package/skills/backend/api-designer/references/error-handling.md +541 -541
  188. package/skills/backend/api-designer/references/openapi.md +824 -824
  189. package/skills/backend/api-designer/references/pagination.md +494 -494
  190. package/skills/backend/api-designer/references/rest-patterns.md +335 -335
  191. package/skills/backend/api-designer/references/versioning.md +391 -391
  192. package/skills/backend/architecture-designer/SKILL.md +117 -117
  193. package/skills/backend/architecture-designer/references/adr-template.md +116 -116
  194. package/skills/backend/architecture-designer/references/architecture-patterns.md +111 -111
  195. package/skills/backend/architecture-designer/references/database-selection.md +102 -102
  196. package/skills/backend/architecture-designer/references/nfr-checklist.md +112 -112
  197. package/skills/backend/architecture-designer/references/system-design.md +100 -100
  198. package/skills/backend/code-documenter/SKILL.md +147 -147
  199. package/skills/backend/code-documenter/references/api-docs-fastapi-django.md +166 -166
  200. package/skills/backend/code-documenter/references/api-docs-nestjs-express.md +220 -220
  201. package/skills/backend/code-documenter/references/coverage-reports.md +125 -125
  202. package/skills/backend/code-documenter/references/documentation-systems.md +333 -333
  203. package/skills/backend/code-documenter/references/interactive-api-docs.md +531 -531
  204. package/skills/backend/code-documenter/references/python-docstrings.md +121 -121
  205. package/skills/backend/code-documenter/references/typescript-jsdoc.md +145 -145
  206. package/skills/backend/code-documenter/references/user-guides-tutorials.md +530 -530
  207. package/skills/backend/debugging-wizard/SKILL.md +105 -105
  208. package/skills/backend/debugging-wizard/references/common-patterns.md +132 -132
  209. package/skills/backend/debugging-wizard/references/debugging-tools.md +140 -140
  210. package/skills/backend/debugging-wizard/references/quick-fixes.md +177 -177
  211. package/skills/backend/debugging-wizard/references/strategies.md +142 -142
  212. package/skills/backend/debugging-wizard/references/systematic-debugging.md +367 -367
  213. package/skills/backend/feature-forge/SKILL.md +98 -98
  214. package/skills/backend/feature-forge/references/acceptance-criteria.md +104 -104
  215. package/skills/backend/feature-forge/references/ears-syntax.md +99 -99
  216. package/skills/backend/feature-forge/references/interview-questions.md +150 -150
  217. package/skills/backend/feature-forge/references/pre-discovery-subagents.md +54 -54
  218. package/skills/backend/feature-forge/references/specification-template.md +103 -103
  219. package/skills/backend/fullstack-guardian/SKILL.md +105 -105
  220. package/skills/backend/fullstack-guardian/references/api-design-standards.md +307 -307
  221. package/skills/backend/fullstack-guardian/references/architecture-decisions.md +350 -350
  222. package/skills/backend/fullstack-guardian/references/backend-patterns.md +237 -237
  223. package/skills/backend/fullstack-guardian/references/common-patterns.md +134 -134
  224. package/skills/backend/fullstack-guardian/references/deliverables-checklist.md +354 -354
  225. package/skills/backend/fullstack-guardian/references/design-template.md +91 -91
  226. package/skills/backend/fullstack-guardian/references/error-handling.md +135 -135
  227. package/skills/backend/fullstack-guardian/references/frontend-patterns.md +340 -340
  228. package/skills/backend/fullstack-guardian/references/integration-patterns.md +333 -333
  229. package/skills/backend/fullstack-guardian/references/security-checklist.md +106 -106
  230. package/skills/backend/graphql-architect/SKILL.md +146 -146
  231. package/skills/backend/graphql-architect/references/federation.md +418 -418
  232. package/skills/backend/graphql-architect/references/migration-from-rest.md +1141 -1141
  233. package/skills/backend/graphql-architect/references/resolvers.md +425 -425
  234. package/skills/backend/graphql-architect/references/schema-design.md +393 -393
  235. package/skills/backend/graphql-architect/references/security.md +569 -569
  236. package/skills/backend/graphql-architect/references/subscriptions.md +510 -510
  237. package/skills/backend/legacy-modernizer/SKILL.md +137 -137
  238. package/skills/backend/legacy-modernizer/references/legacy-testing.md +381 -381
  239. package/skills/backend/legacy-modernizer/references/migration-strategies.md +423 -423
  240. package/skills/backend/legacy-modernizer/references/refactoring-patterns.md +395 -395
  241. package/skills/backend/legacy-modernizer/references/strangler-fig-pattern.md +281 -281
  242. package/skills/backend/legacy-modernizer/references/system-assessment.md +487 -487
  243. package/skills/backend/microservices-architect/SKILL.md +164 -164
  244. package/skills/backend/microservices-architect/references/communication.md +499 -499
  245. package/skills/backend/microservices-architect/references/data.md +721 -721
  246. package/skills/backend/microservices-architect/references/decomposition.md +344 -344
  247. package/skills/backend/microservices-architect/references/observability.md +805 -805
  248. package/skills/backend/microservices-architect/references/patterns.md +603 -603
  249. package/skills/database/database-optimizer/SKILL.md +147 -147
  250. package/skills/database/database-optimizer/references/index-strategies.md +331 -331
  251. package/skills/database/database-optimizer/references/monitoring-analysis.md +501 -501
  252. package/skills/database/database-optimizer/references/mysql-tuning.md +452 -452
  253. package/skills/database/database-optimizer/references/postgresql-tuning.md +413 -413
  254. package/skills/database/database-optimizer/references/query-optimization.md +251 -251
  255. package/skills/database/postgres-pro/SKILL.md +152 -152
  256. package/skills/database/postgres-pro/references/extensions.md +404 -404
  257. package/skills/database/postgres-pro/references/jsonb.md +321 -321
  258. package/skills/database/postgres-pro/references/maintenance.md +481 -481
  259. package/skills/database/postgres-pro/references/performance.md +265 -265
  260. package/skills/database/postgres-pro/references/replication.md +446 -446
  261. package/skills/database/sql-pro/SKILL.md +129 -129
  262. package/skills/database/sql-pro/references/database-design.md +402 -402
  263. package/skills/database/sql-pro/references/dialect-differences.md +419 -419
  264. package/skills/database/sql-pro/references/optimization.md +384 -384
  265. package/skills/database/sql-pro/references/query-patterns.md +285 -285
  266. package/skills/database/sql-pro/references/window-functions.md +328 -328
  267. package/skills/dotnet/csharp-developer/SKILL.md +125 -125
  268. package/skills/dotnet/csharp-developer/references/aspnet-core.md +394 -394
  269. package/skills/dotnet/csharp-developer/references/blazor.md +553 -553
  270. package/skills/dotnet/csharp-developer/references/entity-framework.md +409 -409
  271. package/skills/dotnet/csharp-developer/references/modern-csharp.md +248 -248
  272. package/skills/dotnet/csharp-developer/references/performance.md +498 -498
  273. package/skills/dotnet/dotnet-core-expert/SKILL.md +138 -138
  274. package/skills/dotnet/dotnet-core-expert/references/authentication.md +546 -546
  275. package/skills/dotnet/dotnet-core-expert/references/clean-architecture.md +455 -455
  276. package/skills/dotnet/dotnet-core-expert/references/cloud-native.md +548 -548
  277. package/skills/dotnet/dotnet-core-expert/references/entity-framework.md +440 -440
  278. package/skills/dotnet/dotnet-core-expert/references/minimal-apis.md +319 -319
  279. package/skills/frontend/angular-architect/SKILL.md +152 -152
  280. package/skills/frontend/angular-architect/references/components.md +297 -297
  281. package/skills/frontend/angular-architect/references/ngrx.md +401 -401
  282. package/skills/frontend/angular-architect/references/routing.md +361 -361
  283. package/skills/frontend/angular-architect/references/rxjs.md +319 -319
  284. package/skills/frontend/angular-architect/references/testing.md +405 -405
  285. package/skills/frontend/design-commands/design.md +91 -91
  286. package/skills/frontend/design-commands/handoff.md +97 -97
  287. package/skills/frontend/design-commands/prototype.md +120 -120
  288. package/skills/frontend/design-commands/spec.md +160 -160
  289. package/skills/frontend/design-commands/style.md +78 -78
  290. package/skills/frontend/flutter-expert/SKILL.md +138 -138
  291. package/skills/frontend/flutter-expert/references/bloc-state.md +259 -259
  292. package/skills/frontend/flutter-expert/references/gorouter-navigation.md +119 -119
  293. package/skills/frontend/flutter-expert/references/performance.md +99 -99
  294. package/skills/frontend/flutter-expert/references/project-structure.md +118 -118
  295. package/skills/frontend/flutter-expert/references/riverpod-state.md +130 -130
  296. package/skills/frontend/flutter-expert/references/widget-patterns.md +123 -123
  297. package/skills/frontend/nextjs-developer/SKILL.md +143 -143
  298. package/skills/frontend/nextjs-developer/references/app-router.md +311 -311
  299. package/skills/frontend/nextjs-developer/references/data-fetching.md +482 -482
  300. package/skills/frontend/nextjs-developer/references/deployment.md +545 -545
  301. package/skills/frontend/nextjs-developer/references/server-actions.md +462 -462
  302. package/skills/frontend/nextjs-developer/references/server-components.md +384 -384
  303. package/skills/frontend/react-expert/SKILL.md +149 -149
  304. package/skills/frontend/react-expert/references/hooks-patterns.md +162 -162
  305. package/skills/frontend/react-expert/references/migration-class-to-modern.md +1119 -1119
  306. package/skills/frontend/react-expert/references/performance.md +168 -168
  307. package/skills/frontend/react-expert/references/react-19-features.md +174 -174
  308. package/skills/frontend/react-expert/references/server-components.md +143 -143
  309. package/skills/frontend/react-expert/references/state-management.md +171 -171
  310. package/skills/frontend/react-expert/references/testing-react.md +174 -174
  311. package/skills/frontend/react-native-expert/SKILL.md +185 -185
  312. package/skills/frontend/react-native-expert/references/expo-router.md +187 -187
  313. package/skills/frontend/react-native-expert/references/list-optimization.md +204 -204
  314. package/skills/frontend/react-native-expert/references/platform-handling.md +188 -188
  315. package/skills/frontend/react-native-expert/references/project-structure.md +171 -171
  316. package/skills/frontend/react-native-expert/references/storage-hooks.md +173 -173
  317. package/skills/frontend/senior-frontend/SKILL.md +477 -477
  318. package/skills/frontend/senior-frontend/references/frontend_best_practices.md +806 -806
  319. package/skills/frontend/senior-frontend/references/nextjs_optimization_guide.md +724 -724
  320. package/skills/frontend/senior-frontend/references/react_patterns.md +746 -746
  321. package/skills/frontend/senior-frontend/scripts/bundle_analyzer.py +407 -407
  322. package/skills/frontend/senior-frontend/scripts/component_generator.py +329 -329
  323. package/skills/frontend/senior-frontend/scripts/frontend_scaffolder.py +1005 -1005
  324. package/skills/frontend/ui-ux-pro-max/SKILL.md +386 -386
  325. package/skills/frontend/ui-ux-pro-max/data/charts.csv +26 -26
  326. package/skills/frontend/ui-ux-pro-max/data/colors.csv +97 -97
  327. package/skills/frontend/ui-ux-pro-max/data/icons.csv +101 -101
  328. package/skills/frontend/ui-ux-pro-max/data/landing.csv +31 -31
  329. package/skills/frontend/ui-ux-pro-max/data/products.csv +96 -96
  330. package/skills/frontend/ui-ux-pro-max/data/react-performance.csv +45 -45
  331. package/skills/frontend/ui-ux-pro-max/data/stacks/astro.csv +54 -54
  332. package/skills/frontend/ui-ux-pro-max/data/stacks/flutter.csv +53 -53
  333. package/skills/frontend/ui-ux-pro-max/data/stacks/html-tailwind.csv +56 -56
  334. package/skills/frontend/ui-ux-pro-max/data/stacks/jetpack-compose.csv +53 -53
  335. package/skills/frontend/ui-ux-pro-max/data/stacks/nextjs.csv +53 -53
  336. package/skills/frontend/ui-ux-pro-max/data/stacks/nuxt-ui.csv +51 -51
  337. package/skills/frontend/ui-ux-pro-max/data/stacks/nuxtjs.csv +59 -59
  338. package/skills/frontend/ui-ux-pro-max/data/stacks/react-native.csv +52 -52
  339. package/skills/frontend/ui-ux-pro-max/data/stacks/react.csv +54 -54
  340. package/skills/frontend/ui-ux-pro-max/data/stacks/shadcn.csv +61 -61
  341. package/skills/frontend/ui-ux-pro-max/data/stacks/svelte.csv +54 -54
  342. package/skills/frontend/ui-ux-pro-max/data/stacks/swiftui.csv +51 -51
  343. package/skills/frontend/ui-ux-pro-max/data/stacks/vue.csv +50 -50
  344. package/skills/frontend/ui-ux-pro-max/data/styles.csv +68 -68
  345. package/skills/frontend/ui-ux-pro-max/data/typography.csv +57 -57
  346. package/skills/frontend/ui-ux-pro-max/data/ui-reasoning.csv +101 -101
  347. package/skills/frontend/ui-ux-pro-max/data/ux-guidelines.csv +99 -99
  348. package/skills/frontend/ui-ux-pro-max/data/web-interface.csv +31 -31
  349. package/skills/frontend/ui-ux-pro-max/scripts/core.py +253 -253
  350. package/skills/frontend/ui-ux-pro-max/scripts/design_system.py +1067 -1067
  351. package/skills/frontend/ui-ux-pro-max/scripts/search.py +114 -114
  352. package/skills/frontend/vue-expert/SKILL.md +98 -98
  353. package/skills/frontend/vue-expert/references/build-tooling.md +480 -480
  354. package/skills/frontend/vue-expert/references/components.md +448 -448
  355. package/skills/frontend/vue-expert/references/composition-api.md +299 -299
  356. package/skills/frontend/vue-expert/references/mobile-hybrid.md +636 -636
  357. package/skills/frontend/vue-expert/references/nuxt.md +669 -669
  358. package/skills/frontend/vue-expert/references/state-management.md +449 -449
  359. package/skills/frontend/vue-expert/references/typescript.md +584 -584
  360. package/skills/frontend/vue-expert-js/SKILL.md +167 -167
  361. package/skills/frontend/vue-expert-js/references/component-architecture.md +219 -219
  362. package/skills/frontend/vue-expert-js/references/composables-patterns.md +183 -183
  363. package/skills/frontend/vue-expert-js/references/jsdoc-typing.md +535 -535
  364. package/skills/frontend/vue-expert-js/references/state-management.md +249 -249
  365. package/skills/frontend/vue-expert-js/references/testing-patterns.md +237 -237
  366. package/skills/go-rust-cpp/cpp-pro/SKILL.md +115 -115
  367. package/skills/go-rust-cpp/cpp-pro/references/build-tooling.md +440 -440
  368. package/skills/go-rust-cpp/cpp-pro/references/concurrency.md +437 -437
  369. package/skills/go-rust-cpp/cpp-pro/references/memory-performance.md +397 -397
  370. package/skills/go-rust-cpp/cpp-pro/references/modern-cpp.md +304 -304
  371. package/skills/go-rust-cpp/cpp-pro/references/templates.md +357 -357
  372. package/skills/go-rust-cpp/golang-pro/SKILL.md +122 -122
  373. package/skills/go-rust-cpp/golang-pro/references/concurrency.md +329 -329
  374. package/skills/go-rust-cpp/golang-pro/references/generics.md +442 -442
  375. package/skills/go-rust-cpp/golang-pro/references/interfaces.md +432 -432
  376. package/skills/go-rust-cpp/golang-pro/references/project-structure.md +477 -477
  377. package/skills/go-rust-cpp/golang-pro/references/testing.md +451 -451
  378. package/skills/go-rust-cpp/rust-engineer/SKILL.md +167 -167
  379. package/skills/go-rust-cpp/rust-engineer/references/async.md +458 -458
  380. package/skills/go-rust-cpp/rust-engineer/references/error-handling.md +334 -334
  381. package/skills/go-rust-cpp/rust-engineer/references/ownership.md +278 -278
  382. package/skills/go-rust-cpp/rust-engineer/references/testing.md +470 -470
  383. package/skills/go-rust-cpp/rust-engineer/references/traits.md +413 -413
  384. package/skills/infra/cli-developer/SKILL.md +113 -113
  385. package/skills/infra/cli-developer/references/design-patterns.md +221 -221
  386. package/skills/infra/cli-developer/references/go-cli.md +540 -540
  387. package/skills/infra/cli-developer/references/node-cli.md +383 -383
  388. package/skills/infra/cli-developer/references/python-cli.md +422 -422
  389. package/skills/infra/cli-developer/references/ux-patterns.md +448 -448
  390. package/skills/infra/cloud-architect/SKILL.md +216 -216
  391. package/skills/infra/cloud-architect/references/aws.md +394 -394
  392. package/skills/infra/cloud-architect/references/azure.md +562 -562
  393. package/skills/infra/cloud-architect/references/cost.md +582 -582
  394. package/skills/infra/cloud-architect/references/gcp.md +633 -633
  395. package/skills/infra/cloud-architect/references/multi-cloud.md +483 -483
  396. package/skills/infra/devops-engineer/SKILL.md +144 -144
  397. package/skills/infra/devops-engineer/references/deployment-strategies.md +241 -241
  398. package/skills/infra/devops-engineer/references/docker-patterns.md +113 -113
  399. package/skills/infra/devops-engineer/references/github-actions.md +139 -139
  400. package/skills/infra/devops-engineer/references/incident-response.md +331 -331
  401. package/skills/infra/devops-engineer/references/kubernetes.md +154 -154
  402. package/skills/infra/devops-engineer/references/platform-engineering.md +417 -417
  403. package/skills/infra/devops-engineer/references/release-automation.md +527 -527
  404. package/skills/infra/devops-engineer/references/terraform-iac.md +141 -141
  405. package/skills/infra/kubernetes-specialist/SKILL.md +241 -241
  406. package/skills/infra/kubernetes-specialist/references/configuration.md +452 -452
  407. package/skills/infra/kubernetes-specialist/references/cost-optimization.md +458 -458
  408. package/skills/infra/kubernetes-specialist/references/custom-operators.md +563 -563
  409. package/skills/infra/kubernetes-specialist/references/gitops.md +530 -530
  410. package/skills/infra/kubernetes-specialist/references/helm-charts.md +912 -912
  411. package/skills/infra/kubernetes-specialist/references/multi-cluster.md +507 -507
  412. package/skills/infra/kubernetes-specialist/references/networking.md +447 -447
  413. package/skills/infra/kubernetes-specialist/references/service-mesh.md +459 -459
  414. package/skills/infra/kubernetes-specialist/references/storage.md +535 -535
  415. package/skills/infra/kubernetes-specialist/references/troubleshooting.md +414 -414
  416. package/skills/infra/kubernetes-specialist/references/workloads.md +377 -377
  417. package/skills/infra/mcp-developer/SKILL.md +143 -143
  418. package/skills/infra/mcp-developer/references/protocol.md +244 -244
  419. package/skills/infra/mcp-developer/references/python-sdk.md +367 -367
  420. package/skills/infra/mcp-developer/references/resources.md +554 -554
  421. package/skills/infra/mcp-developer/references/tools.md +480 -480
  422. package/skills/infra/mcp-developer/references/typescript-sdk.md +350 -350
  423. package/skills/infra/monitoring-expert/SKILL.md +176 -176
  424. package/skills/infra/monitoring-expert/references/alerting-rules.md +141 -141
  425. package/skills/infra/monitoring-expert/references/application-profiling.md +331 -331
  426. package/skills/infra/monitoring-expert/references/capacity-planning.md +344 -344
  427. package/skills/infra/monitoring-expert/references/dashboards.md +126 -126
  428. package/skills/infra/monitoring-expert/references/opentelemetry.md +123 -123
  429. package/skills/infra/monitoring-expert/references/performance-testing.md +269 -269
  430. package/skills/infra/monitoring-expert/references/prometheus-metrics.md +136 -136
  431. package/skills/infra/monitoring-expert/references/structured-logging.md +142 -142
  432. package/skills/infra/sre-engineer/SKILL.md +181 -181
  433. package/skills/infra/sre-engineer/references/automation-toil.md +492 -492
  434. package/skills/infra/sre-engineer/references/error-budget-policy.md +334 -334
  435. package/skills/infra/sre-engineer/references/incident-chaos.md +576 -576
  436. package/skills/infra/sre-engineer/references/monitoring-alerting.md +424 -424
  437. package/skills/infra/sre-engineer/references/slo-sli-management.md +238 -238
  438. package/skills/infra/terraform-engineer/SKILL.md +143 -143
  439. package/skills/infra/terraform-engineer/references/best-practices.md +583 -583
  440. package/skills/infra/terraform-engineer/references/module-patterns.md +297 -297
  441. package/skills/infra/terraform-engineer/references/providers.md +452 -452
  442. package/skills/infra/terraform-engineer/references/state-management.md +371 -371
  443. package/skills/infra/terraform-engineer/references/testing.md +486 -486
  444. package/skills/infra/websocket-engineer/SKILL.md +168 -168
  445. package/skills/infra/websocket-engineer/references/alternatives.md +391 -391
  446. package/skills/infra/websocket-engineer/references/patterns.md +400 -400
  447. package/skills/infra/websocket-engineer/references/protocol.md +195 -195
  448. package/skills/infra/websocket-engineer/references/scaling.md +333 -333
  449. package/skills/infra/websocket-engineer/references/security.md +474 -474
  450. package/skills/java/java-architect/SKILL.md +132 -132
  451. package/skills/java/java-architect/references/jpa-optimization.md +393 -393
  452. package/skills/java/java-architect/references/reactive-webflux.md +356 -356
  453. package/skills/java/java-architect/references/spring-boot-setup.md +269 -269
  454. package/skills/java/java-architect/references/spring-security.md +445 -445
  455. package/skills/java/java-architect/references/testing-patterns.md +500 -500
  456. package/skills/java/kotlin-specialist/SKILL.md +147 -147
  457. package/skills/java/kotlin-specialist/references/android-compose.md +419 -419
  458. package/skills/java/kotlin-specialist/references/coroutines-flow.md +276 -276
  459. package/skills/java/kotlin-specialist/references/dsl-idioms.md +421 -421
  460. package/skills/java/kotlin-specialist/references/ktor-server.md +426 -426
  461. package/skills/java/kotlin-specialist/references/multiplatform-kmp.md +380 -380
  462. package/skills/java/spring-boot-engineer/SKILL.md +195 -195
  463. package/skills/java/spring-boot-engineer/references/cloud.md +498 -498
  464. package/skills/java/spring-boot-engineer/references/data.md +381 -381
  465. package/skills/java/spring-boot-engineer/references/security.md +459 -459
  466. package/skills/java/spring-boot-engineer/references/testing.md +545 -545
  467. package/skills/java/spring-boot-engineer/references/web.md +295 -295
  468. package/skills/javascript/javascript-pro/SKILL.md +132 -132
  469. package/skills/javascript/javascript-pro/references/async-patterns.md +334 -334
  470. package/skills/javascript/javascript-pro/references/browser-apis.md +398 -398
  471. package/skills/javascript/javascript-pro/references/modern-syntax.md +272 -272
  472. package/skills/javascript/javascript-pro/references/modules.md +357 -357
  473. package/skills/javascript/javascript-pro/references/node-essentials.md +471 -471
  474. package/skills/javascript/nestjs-expert/SKILL.md +206 -206
  475. package/skills/javascript/nestjs-expert/references/authentication.md +166 -166
  476. package/skills/javascript/nestjs-expert/references/controllers-routing.md +111 -111
  477. package/skills/javascript/nestjs-expert/references/dtos-validation.md +153 -153
  478. package/skills/javascript/nestjs-expert/references/migration-from-express.md +1237 -1237
  479. package/skills/javascript/nestjs-expert/references/services-di.md +140 -140
  480. package/skills/javascript/nestjs-expert/references/testing-patterns.md +186 -186
  481. package/skills/javascript/typescript-pro/SKILL.md +145 -145
  482. package/skills/javascript/typescript-pro/references/advanced-types.md +259 -259
  483. package/skills/javascript/typescript-pro/references/configuration.md +445 -445
  484. package/skills/javascript/typescript-pro/references/patterns.md +484 -484
  485. package/skills/javascript/typescript-pro/references/type-guards.md +352 -352
  486. package/skills/javascript/typescript-pro/references/utility-types.md +329 -329
  487. package/skills/php/laravel-specialist/SKILL.md +262 -262
  488. package/skills/php/laravel-specialist/references/eloquent.md +351 -351
  489. package/skills/php/laravel-specialist/references/livewire.md +512 -512
  490. package/skills/php/laravel-specialist/references/queues.md +423 -423
  491. package/skills/php/laravel-specialist/references/routing.md +362 -362
  492. package/skills/php/laravel-specialist/references/testing.md +522 -522
  493. package/skills/php/php-pro/SKILL.md +206 -206
  494. package/skills/php/php-pro/references/async-patterns.md +412 -412
  495. package/skills/php/php-pro/references/laravel-patterns.md +377 -377
  496. package/skills/php/php-pro/references/modern-php-features.md +323 -323
  497. package/skills/php/php-pro/references/symfony-patterns.md +466 -466
  498. package/skills/php/php-pro/references/testing-quality.md +466 -466
  499. package/skills/product/competitive-analysis/SKILL.md +257 -257
  500. package/skills/product/meeting-notes/SKILL.md +266 -266
  501. package/skills/product/prd-template/SKILL.md +150 -150
  502. package/skills/product/stakeholder-update/SKILL.md +225 -225
  503. package/skills/product/user-research-synthesis/SKILL.md +235 -235
  504. package/skills/python/django-expert/SKILL.md +162 -162
  505. package/skills/python/django-expert/references/authentication.md +145 -145
  506. package/skills/python/django-expert/references/drf-serializers.md +148 -148
  507. package/skills/python/django-expert/references/models-orm.md +151 -151
  508. package/skills/python/django-expert/references/testing-django.md +204 -204
  509. package/skills/python/django-expert/references/viewsets-views.md +153 -153
  510. package/skills/python/fastapi-expert/SKILL.md +185 -185
  511. package/skills/python/fastapi-expert/references/async-sqlalchemy.md +146 -146
  512. package/skills/python/fastapi-expert/references/authentication.md +159 -159
  513. package/skills/python/fastapi-expert/references/endpoints-routing.md +142 -142
  514. package/skills/python/fastapi-expert/references/migration-from-django.md +996 -996
  515. package/skills/python/fastapi-expert/references/pydantic-v2.md +135 -135
  516. package/skills/python/fastapi-expert/references/testing-async.md +159 -159
  517. package/skills/python/pandas-pro/SKILL.md +178 -178
  518. package/skills/python/pandas-pro/references/aggregation-groupby.md +545 -545
  519. package/skills/python/pandas-pro/references/data-cleaning.md +500 -500
  520. package/skills/python/pandas-pro/references/dataframe-operations.md +420 -420
  521. package/skills/python/pandas-pro/references/merging-joining.md +596 -596
  522. package/skills/python/pandas-pro/references/performance-optimization.md +597 -597
  523. package/skills/python/python-pro/SKILL.md +177 -177
  524. package/skills/python/python-pro/references/async-patterns.md +356 -356
  525. package/skills/python/python-pro/references/packaging.md +460 -460
  526. package/skills/python/python-pro/references/standard-library.md +378 -378
  527. package/skills/python/python-pro/references/testing.md +404 -404
  528. package/skills/python/python-pro/references/type-system.md +290 -290
  529. package/skills/quality/chaos-engineer/SKILL.md +182 -182
  530. package/skills/quality/chaos-engineer/references/chaos-tools.md +511 -511
  531. package/skills/quality/chaos-engineer/references/experiment-design.md +229 -229
  532. package/skills/quality/chaos-engineer/references/game-days.md +434 -434
  533. package/skills/quality/chaos-engineer/references/infrastructure-chaos.md +348 -348
  534. package/skills/quality/chaos-engineer/references/kubernetes-chaos.md +432 -432
  535. package/skills/quality/code-reviewer/SKILL.md +119 -119
  536. package/skills/quality/code-reviewer/references/common-issues.md +142 -142
  537. package/skills/quality/code-reviewer/references/feedback-examples.md +144 -144
  538. package/skills/quality/code-reviewer/references/receiving-feedback.md +238 -238
  539. package/skills/quality/code-reviewer/references/report-template.md +109 -109
  540. package/skills/quality/code-reviewer/references/review-checklist.md +88 -88
  541. package/skills/quality/code-reviewer/references/spec-compliance-review.md +258 -258
  542. package/skills/quality/playwright-expert/SKILL.md +169 -169
  543. package/skills/quality/playwright-expert/references/api-mocking.md +140 -140
  544. package/skills/quality/playwright-expert/references/configuration.md +155 -155
  545. package/skills/quality/playwright-expert/references/debugging-flaky.md +150 -150
  546. package/skills/quality/playwright-expert/references/page-object-model.md +152 -152
  547. package/skills/quality/playwright-expert/references/selectors-locators.md +119 -119
  548. package/skills/quality/secure-code-guardian/SKILL.md +191 -191
  549. package/skills/quality/secure-code-guardian/references/authentication.md +136 -136
  550. package/skills/quality/secure-code-guardian/references/input-validation.md +146 -146
  551. package/skills/quality/secure-code-guardian/references/owasp-prevention.md +135 -135
  552. package/skills/quality/secure-code-guardian/references/security-headers.md +133 -133
  553. package/skills/quality/secure-code-guardian/references/xss-csrf.md +157 -157
  554. package/skills/quality/security-reviewer/SKILL.md +103 -103
  555. package/skills/quality/security-reviewer/references/infrastructure-security.md +268 -268
  556. package/skills/quality/security-reviewer/references/penetration-testing.md +268 -268
  557. package/skills/quality/security-reviewer/references/report-template.md +170 -170
  558. package/skills/quality/security-reviewer/references/sast-tools.md +117 -117
  559. package/skills/quality/security-reviewer/references/secret-scanning.md +125 -125
  560. package/skills/quality/security-reviewer/references/vulnerability-patterns.md +152 -152
  561. package/skills/quality/senior-qa/README.md +196 -196
  562. package/skills/quality/senior-qa/SKILL.md +399 -399
  563. package/skills/quality/senior-qa/references/qa_best_practices.md +964 -964
  564. package/skills/quality/senior-qa/references/test_automation_patterns.md +1009 -1009
  565. package/skills/quality/senior-qa/references/testing_strategies.md +649 -649
  566. package/skills/quality/senior-qa/scripts/coverage_analyzer.py +836 -836
  567. package/skills/quality/senior-qa/scripts/e2e_test_scaffolder.py +820 -820
  568. package/skills/quality/senior-qa/scripts/test_suite_generator.py +605 -605
  569. package/skills/quality/tdd-guide/HOW_TO_USE.md +313 -313
  570. package/skills/quality/tdd-guide/README.md +680 -680
  571. package/skills/quality/tdd-guide/SKILL.md +122 -122
  572. package/skills/quality/tdd-guide/assets/expected_output.json +77 -77
  573. package/skills/quality/tdd-guide/assets/sample_input_python.json +39 -39
  574. package/skills/quality/tdd-guide/assets/sample_input_typescript.json +36 -36
  575. package/skills/quality/tdd-guide/references/ci-integration.md +195 -195
  576. package/skills/quality/tdd-guide/references/framework-guide.md +206 -206
  577. package/skills/quality/tdd-guide/references/tdd-best-practices.md +128 -128
  578. package/skills/quality/tdd-guide/scripts/coverage_analyzer.py +434 -434
  579. package/skills/quality/tdd-guide/scripts/fixture_generator.py +440 -440
  580. package/skills/quality/tdd-guide/scripts/format_detector.py +384 -384
  581. package/skills/quality/tdd-guide/scripts/framework_adapter.py +428 -428
  582. package/skills/quality/tdd-guide/scripts/metrics_calculator.py +456 -456
  583. package/skills/quality/tdd-guide/scripts/output_formatter.py +354 -354
  584. package/skills/quality/tdd-guide/scripts/tdd_workflow.py +474 -474
  585. package/skills/quality/tdd-guide/scripts/test_generator.py +438 -438
  586. package/skills/quality/test-master/SKILL.md +94 -94
  587. package/skills/quality/test-master/references/automation-frameworks.md +294 -294
  588. package/skills/quality/test-master/references/e2e-testing.md +128 -128
  589. package/skills/quality/test-master/references/integration-testing.md +120 -120
  590. package/skills/quality/test-master/references/performance-testing.md +118 -118
  591. package/skills/quality/test-master/references/qa-methodology.md +247 -247
  592. package/skills/quality/test-master/references/security-testing.md +127 -127
  593. package/skills/quality/test-master/references/tdd-iron-laws.md +174 -174
  594. package/skills/quality/test-master/references/test-reports.md +104 -104
  595. package/skills/quality/test-master/references/testing-anti-patterns.md +231 -231
  596. package/skills/quality/test-master/references/unit-testing.md +113 -113
  597. package/skills/ruby/rails-expert/SKILL.md +154 -154
  598. package/skills/ruby/rails-expert/references/active-record.md +244 -244
  599. package/skills/ruby/rails-expert/references/api-development.md +401 -401
  600. package/skills/ruby/rails-expert/references/background-jobs.md +272 -272
  601. package/skills/ruby/rails-expert/references/hotwire-turbo.md +228 -228
  602. package/skills/ruby/rails-expert/references/rspec-testing.md +367 -367
  603. package/skills/swift/swift-expert/SKILL.md +163 -163
  604. package/skills/swift/swift-expert/references/async-concurrency.md +360 -360
  605. package/skills/swift/swift-expert/references/memory-performance.md +377 -377
  606. package/skills/swift/swift-expert/references/protocol-oriented.md +354 -354
  607. package/skills/swift/swift-expert/references/swiftui-patterns.md +291 -291
  608. package/skills/swift/swift-expert/references/testing-patterns.md +399 -399
  609. package/skills/workflow/brainstorming/SKILL.md +164 -164
  610. package/skills/workflow/brainstorming/scripts/frame-template.html +214 -214
  611. package/skills/workflow/brainstorming/scripts/helper.js +88 -88
  612. package/skills/workflow/brainstorming/scripts/server.cjs +354 -354
  613. package/skills/workflow/brainstorming/scripts/start-server.sh +148 -148
  614. package/skills/workflow/brainstorming/scripts/stop-server.sh +56 -56
  615. package/skills/workflow/brainstorming/spec-document-reviewer-prompt.md +49 -49
  616. package/skills/workflow/brainstorming/visual-companion.md +287 -287
  617. package/skills/workflow/documentation/SKILL.md +45 -45
  618. package/skills/workflow/entropy-management/SKILL.md +115 -115
  619. package/skills/workflow/executing-plans/SKILL.md +70 -70
  620. package/skills/workflow/finishing-a-development-branch/SKILL.md +200 -200
  621. package/skills/workflow/receiving-code-review/SKILL.md +213 -213
  622. package/skills/workflow/requesting-code-review/SKILL.md +105 -105
  623. package/skills/workflow/requesting-code-review/code-reviewer.md +146 -146
  624. package/skills/workflow/requirement-engineering/SKILL.md +111 -111
  625. package/skills/workflow/systematic-debugging/CREATION-LOG.md +119 -119
  626. package/skills/workflow/systematic-debugging/SKILL.md +296 -296
  627. package/skills/workflow/systematic-debugging/condition-based-waiting-example.ts +158 -158
  628. package/skills/workflow/systematic-debugging/condition-based-waiting.md +115 -115
  629. package/skills/workflow/systematic-debugging/defense-in-depth.md +122 -122
  630. package/skills/workflow/systematic-debugging/find-polluter.sh +63 -63
  631. package/skills/workflow/systematic-debugging/root-cause-tracing.md +169 -169
  632. package/skills/workflow/systematic-debugging/test-academic.md +14 -14
  633. package/skills/workflow/systematic-debugging/test-pressure-1.md +58 -58
  634. package/skills/workflow/systematic-debugging/test-pressure-2.md +68 -68
  635. package/skills/workflow/systematic-debugging/test-pressure-3.md +69 -69
  636. package/skills/workflow/using-git-worktrees/SKILL.md +218 -218
  637. package/skills/workflow/verification-before-completion/SKILL.md +139 -139
  638. package/skills/workflow/writing-plans/SKILL.md +151 -151
  639. package/skills/workflow/writing-plans/plan-document-reviewer-prompt.md +49 -49
  640. package/skills/workflow/writing-skills/SKILL.md +655 -655
  641. package/skills/workflow/writing-skills/anthropic-best-practices.md +1150 -1150
  642. package/skills/workflow/writing-skills/examples/CLAUDE_MD_TESTING.md +189 -189
  643. package/skills/workflow/writing-skills/persuasion-principles.md +187 -187
  644. package/skills/workflow/writing-skills/render-graphs.js +168 -168
  645. package/skills/workflow/writing-skills/testing-skills-with-subagents.md +384 -384
@@ -1,805 +1,805 @@
1
- # Observability in Microservices
2
-
3
- Comprehensive guide for monitoring, tracing, and debugging distributed systems.
4
-
5
- ## The Three Pillars
6
-
7
- ### 1. Metrics
8
-
9
- **Purpose:** Quantitative measurements of system behavior over time.
10
-
11
- **Categories:**
12
-
13
- **Business Metrics:**
14
- ```
15
- Examples:
16
- - Orders per minute
17
- - Revenue per hour
18
- - Active users
19
- - Conversion rate
20
- - Cart abandonment rate
21
-
22
- Why Important:
23
- - Align with business goals
24
- - Detect business anomalies
25
- - Inform scaling decisions
26
-
27
- Implementation:
28
- from prometheus_client import Counter, Histogram
29
-
30
- orders_total = Counter(
31
- 'orders_total',
32
- 'Total number of orders',
33
- ['status', 'payment_method']
34
- )
35
-
36
- order_value = Histogram(
37
- 'order_value_dollars',
38
- 'Order value in dollars',
39
- buckets=[10, 50, 100, 500, 1000, 5000]
40
- )
41
-
42
- # In code
43
- orders_total.labels(status='completed', payment_method='credit_card').inc()
44
- order_value.observe(order.total_amount)
45
- ```
46
-
47
- **System Metrics:**
48
- ```
49
- Infrastructure:
50
- - CPU usage
51
- - Memory usage
52
- - Disk I/O
53
- - Network throughput
54
-
55
- Application:
56
- - Request rate
57
- - Error rate
58
- - Request duration (latency)
59
- - Active connections
60
- - Thread pool utilization
61
-
62
- Database:
63
- - Query duration
64
- - Connection pool usage
65
- - Slow queries
66
- - Deadlocks
67
-
68
- Message Queue:
69
- - Queue depth
70
- - Message processing rate
71
- - Consumer lag
72
- - Dead letter queue size
73
- ```
74
-
75
- **The Four Golden Signals (Google SRE):**
76
- ```
77
- 1. Latency:
78
- - Time to serve requests
79
- - Track p50, p95, p99, p99.9
80
- - Separate success vs error latency
81
-
82
- request_duration = Histogram(
83
- 'http_request_duration_seconds',
84
- 'HTTP request duration',
85
- ['method', 'endpoint', 'status']
86
- )
87
-
88
- 2. Traffic:
89
- - Requests per second
90
- - Transactions per second
91
- - Concurrent users
92
-
93
- requests_total = Counter(
94
- 'http_requests_total',
95
- 'Total HTTP requests',
96
- ['method', 'endpoint', 'status']
97
- )
98
-
99
- 3. Errors:
100
- - Rate of failed requests
101
- - 4xx vs 5xx errors
102
- - Exception types
103
-
104
- errors_total = Counter(
105
- 'errors_total',
106
- 'Total errors',
107
- ['service', 'error_type']
108
- )
109
-
110
- 4. Saturation:
111
- - Resource utilization
112
- - Queue depth
113
- - Thread pool usage
114
-
115
- connection_pool_usage = Gauge(
116
- 'db_connection_pool_active',
117
- 'Active database connections'
118
- )
119
- ```
120
-
121
- **RED Method (for services):**
122
- ```
123
- - Rate: Requests per second
124
- - Errors: Failed requests per second
125
- - Duration: Request latency distribution
126
-
127
- Perfect for microservices dashboards
128
- ```
129
-
130
- **USE Method (for resources):**
131
- ```
132
- - Utilization: Percentage of time resource busy
133
- - Saturation: Queue depth or waiting threads
134
- - Errors: Error count
135
-
136
- Perfect for infrastructure monitoring
137
- ```
138
-
139
- ### 2. Logs
140
-
141
- **Purpose:** Discrete event records with context.
142
-
143
- **Structured Logging:**
144
- ```json
145
- {
146
- "timestamp": "2025-12-14T15:30:45.123Z",
147
- "level": "INFO",
148
- "service": "order-service",
149
- "version": "1.2.3",
150
- "traceId": "abc123def456",
151
- "spanId": "span789",
152
- "userId": "user-123",
153
- "message": "Order created successfully",
154
- "orderId": "order-456",
155
- "totalAmount": 99.99,
156
- "currency": "USD",
157
- "duration_ms": 45,
158
- "endpoint": "/api/v1/orders",
159
- "method": "POST",
160
- "statusCode": 201
161
- }
162
- ```
163
-
164
- **Log Levels:**
165
- ```
166
- ERROR:
167
- - Application errors
168
- - Failed operations
169
- - Exceptions
170
- Use: Alerts, immediate attention
171
-
172
- WARN:
173
- - Degraded functionality
174
- - Retry attempts
175
- - Deprecated API usage
176
- Use: Investigation, potential issues
177
-
178
- INFO:
179
- - Business events (order created, user logged in)
180
- - System events (service started, configuration loaded)
181
- Use: Audit trail, business analytics
182
-
183
- DEBUG:
184
- - Detailed execution flow
185
- - Variable values
186
- - Function entry/exit
187
- Use: Development, troubleshooting
188
-
189
- TRACE:
190
- - Very detailed debugging
191
- Use: Deep troubleshooting (disabled in production usually)
192
- ```
193
-
194
- **Correlation IDs:**
195
- ```
196
- Request flow across services:
197
-
198
- Client Request → API Gateway
199
- ↓ (correlationId: corr-123)
200
- Order Service
201
- ↓ (correlationId: corr-123)
202
- Payment Service
203
- ↓ (correlationId: corr-123)
204
- Notification Service
205
-
206
- All logs include correlationId: corr-123
207
- Easy to trace entire request flow
208
-
209
- Implementation:
210
- import logging
211
- from contextvars import ContextVar
212
-
213
- correlation_id_var = ContextVar('correlation_id', default=None)
214
-
215
- class CorrelationIdFilter(logging.Filter):
216
- def filter(self, record):
217
- record.correlation_id = correlation_id_var.get()
218
- return True
219
-
220
- # Middleware
221
- async def correlation_middleware(request, call_next):
222
- correlation_id = request.headers.get('X-Correlation-ID', str(uuid4()))
223
- correlation_id_var.set(correlation_id)
224
- response = await call_next(request)
225
- response.headers['X-Correlation-ID'] = correlation_id
226
- return response
227
- ```
228
-
229
- **Log Aggregation:**
230
- ```
231
- Services → Log Shipper → Centralized Log Storage → Visualization
232
-
233
- Tools:
234
- - ELK Stack (Elasticsearch, Logstash, Kibana)
235
- - EFK Stack (Elasticsearch, Fluentd, Kibana)
236
- - Loki (from Grafana)
237
- - CloudWatch Logs (AWS)
238
- - Stackdriver (GCP)
239
-
240
- Query Examples:
241
- # Find all errors for specific user
242
- service:"order-service" AND level:"ERROR" AND userId:"user-123"
243
-
244
- # Find slow requests
245
- service:"payment-service" AND duration_ms:>5000
246
-
247
- # Find requests with specific correlation ID
248
- correlationId:"corr-123"
249
- ```
250
-
251
- ### 3. Distributed Tracing
252
-
253
- **Purpose:** Visualize request flow across services, identify bottlenecks.
254
-
255
- **Concepts:**
256
-
257
- **Trace:**
258
- ```
259
- Entire request journey across all services
260
-
261
- Example: User places order
262
- Trace ID: trace-abc123
263
-
264
- Spans in trace:
265
- 1. api-gateway: /checkout (200ms)
266
- 2. order-service: createOrder (150ms)
267
- 3. payment-service: processPayment (80ms)
268
- 4. inventory-service: reserveItems (40ms)
269
- 5. notification-service: sendEmail (30ms)
270
-
271
- Total: 200ms (some parallel execution)
272
- ```
273
-
274
- **Span:**
275
- ```
276
- Single operation within a trace
277
-
278
- Span attributes:
279
- {
280
- "traceId": "trace-abc123",
281
- "spanId": "span-456",
282
- "parentSpanId": "span-123",
283
- "name": "POST /api/v1/orders",
284
- "startTime": "2025-12-14T15:30:45.000Z",
285
- "endTime": "2025-12-14T15:30:45.150Z",
286
- "duration": 150,
287
- "status": "OK",
288
- "attributes": {
289
- "http.method": "POST",
290
- "http.url": "/api/v1/orders",
291
- "http.status_code": 201,
292
- "user.id": "user-123",
293
- "order.id": "order-456",
294
- "order.total": 99.99
295
- },
296
- "events": [
297
- {
298
- "timestamp": "2025-12-14T15:30:45.050Z",
299
- "name": "Validating order items"
300
- },
301
- {
302
- "timestamp": "2025-12-14T15:30:45.100Z",
303
- "name": "Calling payment service"
304
- }
305
- ]
306
- }
307
- ```
308
-
309
- **Implementation (OpenTelemetry):**
310
- ```python
311
- from opentelemetry import trace
312
- from opentelemetry.exporter.jaeger.thrift import JaegerExporter
313
- from opentelemetry.sdk.trace import TracerProvider
314
- from opentelemetry.sdk.trace.export import BatchSpanProcessor
315
- from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
316
-
317
- # Setup tracing
318
- provider = TracerProvider()
319
- jaeger_exporter = JaegerExporter(
320
- agent_host_name="jaeger",
321
- agent_port=6831
322
- )
323
- provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
324
- trace.set_tracer_provider(provider)
325
-
326
- # Instrument FastAPI
327
- app = FastAPI()
328
- FastAPIInstrumentor.instrument_app(app)
329
-
330
- # Manual span creation
331
- tracer = trace.get_tracer(__name__)
332
-
333
- async def create_order(order_data):
334
- with tracer.start_as_current_span("create_order") as span:
335
- span.set_attribute("order.items_count", len(order_data.items))
336
- span.set_attribute("order.total", order_data.total)
337
-
338
- # Database operation
339
- with tracer.start_as_current_span("db.insert_order"):
340
- order_id = await db.insert_order(order_data)
341
-
342
- # Call payment service
343
- with tracer.start_as_current_span("http.payment_service") as payment_span:
344
- payment_span.set_attribute("http.url", f"{PAYMENT_URL}/payments")
345
- result = await payment_service.charge(order_id, order_data.total)
346
-
347
- return order_id
348
- ```
349
-
350
- **Trace Visualization:**
351
- ```
352
- Jaeger UI shows:
353
-
354
- Timeline view:
355
- |-- api-gateway (200ms) ----------------------------------|
356
- |-- order-service (150ms) ------------------------|
357
- |-- db.insert_order (30ms) --|
358
- |-- payment-service (80ms) -----------------|
359
- |-- db.create_transaction (20ms) ----|
360
- |-- notification-service (30ms) ----------|
361
-
362
- Critical path highlighted
363
- Bottlenecks identified (payment-service taking 80ms)
364
- Parallel operations visible
365
- ```
366
-
367
- **Sampling Strategies:**
368
- ```
369
- Problem: Tracing every request is expensive
370
-
371
- Solutions:
372
-
373
- 1. Probabilistic Sampling:
374
- - Trace 1% of requests
375
- - Good for high-volume services
376
-
377
- 2. Rate Limiting Sampling:
378
- - Max 100 traces per second
379
- - Prevents overwhelming trace backend
380
-
381
- 3. Tail-Based Sampling:
382
- - Trace all errors
383
- - Trace slow requests (>5s)
384
- - Sample 1% of fast successful requests
385
-
386
- 4. Priority Sampling:
387
- - Always trace premium users
388
- - Always trace critical endpoints
389
- - Sample others
390
-
391
- Implementation:
392
- from opentelemetry.sdk.trace.sampling import (
393
- ParentBasedTraceIdRatioBased,
394
- ALWAYS_ON,
395
- ALWAYS_OFF
396
- )
397
-
398
- # Sample 1% of traces
399
- sampler = ParentBasedTraceIdRatioBased(0.01)
400
-
401
- # Or custom sampler
402
- class CustomSampler:
403
- def should_sample(self, context, trace_id, name, attributes):
404
- # Always sample errors
405
- if attributes.get("http.status_code", 0) >= 500:
406
- return ALWAYS_ON
407
-
408
- # Always sample slow requests
409
- if attributes.get("duration_ms", 0) > 5000:
410
- return ALWAYS_ON
411
-
412
- # Sample 1% of others
413
- return ParentBasedTraceIdRatioBased(0.01).should_sample(...)
414
- ```
415
-
416
- ## Service Level Objectives (SLOs)
417
-
418
- ### Defining SLOs
419
-
420
- **SLI (Service Level Indicator):**
421
- ```
422
- Quantitative measure of service level
423
-
424
- Examples:
425
- - Request latency: p99 < 200ms
426
- - Availability: 99.9% of requests succeed
427
- - Throughput: Handle 10,000 requests/sec
428
- ```
429
-
430
- **SLO (Service Level Objective):**
431
- ```
432
- Target value for SLI
433
-
434
- Examples:
435
- - 99.9% of requests complete in < 200ms
436
- - 99.95% availability over 30 days
437
- - Zero data loss
438
-
439
- SLO Components:
440
- - Metric: What you measure (latency, availability)
441
- - Target: Threshold (99.9%, 200ms)
442
- - Time window: Evaluation period (30 days, weekly)
443
- ```
444
-
445
- **SLA (Service Level Agreement):**
446
- ```
447
- Contract with consequences if SLO not met
448
-
449
- Example:
450
- - SLO: 99.9% availability
451
- - SLA: If availability < 99.9%, customers get 10% credit
452
-
453
- SLA ≤ SLO (leave buffer for incidents)
454
- ```
455
-
456
- **Error Budget:**
457
- ```
458
- Allowed failure to meet SLO = (100% - SLO target)
459
-
460
- Example:
461
- SLO: 99.9% availability
462
- Error budget: 0.1% = 43.8 minutes downtime per month
463
-
464
- Error budget consumed:
465
- - Outages
466
- - Slow responses
467
- - Failed requests
468
-
469
- When error budget exhausted:
470
- - Freeze feature deployments
471
- - Focus on reliability
472
- - Only critical fixes deployed
473
-
474
- Benefits:
475
- - Balances innovation vs stability
476
- - Data-driven deployment decisions
477
- - Aligns engineering priorities
478
- ```
479
-
480
- ### Implementing SLO Monitoring
481
-
482
- **Prometheus + Grafana:**
483
- ```
484
- # SLI: Availability
485
- availability_sli = (
486
- sum(rate(http_requests_total{status!~"5.."}[30d]))
487
- /
488
- sum(rate(http_requests_total[30d]))
489
- ) * 100
490
-
491
- # SLI: Latency
492
- latency_sli = histogram_quantile(
493
- 0.99,
494
- rate(http_request_duration_seconds_bucket[30d])
495
- )
496
-
497
- # Error Budget
498
- error_budget_remaining = (
499
- 1 - (target_slo / 100)
500
- ) - (
501
- 1 - (availability_sli / 100)
502
- )
503
-
504
- Alert when error budget < 10%:
505
- alert: ErrorBudgetCritical
506
- expr: error_budget_remaining < 0.1
507
- annotations:
508
- summary: "Error budget critically low"
509
- description: "Only 10% error budget remaining. Freeze deployments."
510
- ```
511
-
512
- ## Alerting Strategies
513
-
514
- ### Alert Levels
515
-
516
- **Critical (Page immediately):**
517
- ```
518
- Conditions:
519
- - Service completely down
520
- - Error rate > 50%
521
- - Data loss occurring
522
- - SLO burn rate critical
523
-
524
- Actions:
525
- - Page on-call engineer
526
- - Incident created automatically
527
- - Escalate if not acknowledged in 5 min
528
-
529
- Example:
530
- alert: ServiceDown
531
- expr: up{service="payment-service"} == 0
532
- for: 1m
533
- severity: critical
534
- ```
535
-
536
- **Warning (Investigate soon):**
537
- ```
538
- Conditions:
539
- - Elevated error rate (5-10%)
540
- - Latency degraded (p99 > 500ms)
541
- - Queue depth increasing
542
- - Error budget < 25%
543
-
544
- Actions:
545
- - Slack notification
546
- - Create ticket
547
- - Investigate during business hours
548
-
549
- Example:
550
- alert: HighErrorRate
551
- expr: rate(http_requests_total{status="500"}[5m]) > 0.05
552
- for: 10m
553
- severity: warning
554
- ```
555
-
556
- **Info (Awareness):**
557
- ```
558
- Conditions:
559
- - Deployment completed
560
- - Scaling event
561
- - Configuration changed
562
- - Capacity threshold reached
563
-
564
- Actions:
565
- - Log to monitoring system
566
- - Dashboard annotation
567
- - Optional Slack notification
568
- ```
569
-
570
- ### Alert Best Practices
571
-
572
- **Actionable Alerts:**
573
- ```
574
- Bad Alert:
575
- "High CPU usage"
576
-
577
- Good Alert:
578
- "CPU usage > 80% on order-service-pod-abc for 10 minutes
579
- Runbook: https://wiki.company.com/runbooks/high-cpu
580
- Likely cause: Memory leak or infinite loop
581
- Actions: 1) Check recent deployments 2) Review logs for exceptions 3) Consider rolling back"
582
-
583
- Include:
584
- ✓ What is wrong
585
- ✓ Why it matters
586
- ✓ How to investigate
587
- ✓ Runbook link
588
- ✓ Suggested actions
589
- ```
590
-
591
- **Avoid Alert Fatigue:**
592
- ```
593
- Problems:
594
- - Too many alerts
595
- - False positives
596
- - Non-actionable alerts
597
- - Duplicate alerts
598
-
599
- Solutions:
600
- - Alert on symptoms, not causes
601
- - Proper thresholds and durations
602
- - Alert aggregation (don't alert per pod, alert per service)
603
- - Regular alert review and tuning
604
- - Auto-resolve alerts
605
- - Silence during maintenance
606
-
607
- Good Practice:
608
- for: 5m # Don't alert on transient spikes
609
- group_by: [service] # Aggregate per service
610
- group_wait: 30s # Wait before sending
611
- group_interval: 5m # Batch notifications
612
- ```
613
-
614
- ## Observability Stack
615
-
616
- ### Recommended Tools
617
-
618
- **Metrics:**
619
- ```
620
- Collection: Prometheus
621
- - Pull-based metrics
622
- - Time-series database
623
- - Powerful query language (PromQL)
624
- - Service discovery
625
-
626
- Visualization: Grafana
627
- - Beautiful dashboards
628
- - Alerting integration
629
- - Multiple data sources
630
- - Template variables
631
-
632
- Alternative: Datadog, New Relic, CloudWatch
633
- ```
634
-
635
- **Logs:**
636
- ```
637
- Aggregation: ELK Stack
638
- - Elasticsearch (storage & search)
639
- - Logstash / Fluentd (collection)
640
- - Kibana (visualization)
641
-
642
- Or: Loki (lightweight alternative)
643
- - Integrates with Grafana
644
- - Labels instead of full-text indexing
645
- - Lower resource usage
646
-
647
- Alternative: Splunk, Datadog, CloudWatch Logs
648
- ```
649
-
650
- **Tracing:**
651
- ```
652
- Backend: Jaeger or Zipkin
653
- - Trace storage
654
- - Trace visualization
655
- - Dependency graphs
656
- - Performance analysis
657
-
658
- Instrumentation: OpenTelemetry
659
- - Vendor-neutral standard
660
- - Auto-instrumentation for common frameworks
661
- - Manual instrumentation API
662
- - Export to any backend
663
-
664
- Alternative: Datadog APM, New Relic, Lightstep
665
- ```
666
-
667
- **All-in-One:**
668
- ```
669
- Observability platforms:
670
- - Datadog (metrics, logs, traces, RUM)
671
- - New Relic (APM, logs, infrastructure)
672
- - Dynatrace (auto-instrumentation, AI)
673
-
674
- Pros:
675
- - Unified experience
676
- - Correlated data
677
- - Easier setup
678
-
679
- Cons:
680
- - Vendor lock-in
681
- - Higher cost
682
- - Less flexibility
683
- ```
684
-
685
- ### Implementation Checklist
686
-
687
- **For Each Service:**
688
- ```
689
- ✓ Structured logging with correlation IDs
690
- ✓ Metrics exported (Prometheus format)
691
- ✓ Distributed tracing instrumented
692
- ✓ Health check endpoints (/health/live, /health/ready)
693
- ✓ Graceful shutdown handling
694
- ✓ Resource limits set (CPU, memory)
695
- ✓ Alerts configured for critical paths
696
- ✓ Dashboards created
697
- ✓ Runbooks documented
698
- ✓ On-call rotation established
699
- ```
700
-
701
- **For System-Wide:**
702
- ```
703
- ✓ Centralized log aggregation
704
- ✓ Distributed tracing backend
705
- ✓ Metrics aggregation and storage
706
- ✓ Unified dashboards (service overview)
707
- ✓ Alert routing configured
708
- ✓ Incident management process
709
- ✓ Post-mortem template
710
- ✓ SLO definitions and tracking
711
- ✓ Dependency mapping
712
- ✓ Chaos engineering experiments
713
- ```
714
-
715
- ## Troubleshooting Workflow
716
-
717
- **Incident Response:**
718
- ```
719
- 1. Detect (Alert fires)
720
- - Check dashboard
721
- - Verify alert is valid
722
- - Assess impact
723
-
724
- 2. Triage (Determine severity)
725
- - Critical: Page on-call
726
- - Warning: Create ticket
727
- - How many users affected?
728
- - What functionality broken?
729
-
730
- 3. Investigate (Find root cause)
731
- - Check recent deployments
732
- - Review logs (search by correlation ID)
733
- - Analyze traces (slow operations)
734
- - Check metrics (resource saturation)
735
- - Examine dependencies
736
-
737
- 4. Mitigate (Stop the bleeding)
738
- - Rollback deployment
739
- - Scale up resources
740
- - Failover to backup
741
- - Enable circuit breakers
742
- - Rate limit traffic
743
-
744
- 5. Resolve (Fix root cause)
745
- - Deploy fix
746
- - Verify resolution
747
- - Monitor for recurrence
748
-
749
- 6. Post-mortem (Learn and improve)
750
- - Timeline of events
751
- - Root cause analysis
752
- - Action items
753
- - Update runbooks
754
- ```
755
-
756
- **Using Traces to Debug:**
757
- ```
758
- Scenario: API returning 500 errors
759
-
760
- 1. Find failing trace:
761
- - Filter: status = error, service = api-gateway
762
- - Sort by timestamp (most recent)
763
-
764
- 2. Analyze span waterfall:
765
- - Identify which service failed (order-service returned 500)
766
- - Check error message in span
767
- - Review span attributes
768
-
769
- 3. Correlate with logs:
770
- - Extract trace ID from failed trace
771
- - Search logs: traceId:"trace-abc123"
772
- - Find exception stack trace
773
-
774
- 4. Check related metrics:
775
- - order-service error rate spiked 10 min ago
776
- - Corresponds with deployment
777
- - Likely cause: Bad deployment
778
-
779
- 5. Remediate:
780
- - Rollback order-service
781
- - Verify errors stopped
782
- - Create ticket for bug fix
783
- ```
784
-
785
- ## Summary
786
-
787
- Observability is non-negotiable in microservices:
788
-
789
- **Must-Haves:**
790
- - Structured logging with correlation IDs
791
- - Metrics (RED/USE methodology)
792
- - Distributed tracing (OpenTelemetry)
793
- - Centralized log aggregation
794
- - SLO tracking with error budgets
795
- - Actionable alerts with runbooks
796
-
797
- **Best Practices:**
798
- - Correlate metrics, logs, and traces
799
- - Define SLOs based on user experience
800
- - Alert on symptoms, not causes
801
- - Maintain runbooks for common issues
802
- - Regular post-mortems and learning
803
- - Practice incident response with game days
804
-
805
- Without observability, you're flying blind in production.
1
+ # Observability in Microservices
2
+
3
+ Comprehensive guide for monitoring, tracing, and debugging distributed systems.
4
+
5
+ ## The Three Pillars
6
+
7
+ ### 1. Metrics
8
+
9
+ **Purpose:** Quantitative measurements of system behavior over time.
10
+
11
+ **Categories:**
12
+
13
+ **Business Metrics:**
14
+ ```
15
+ Examples:
16
+ - Orders per minute
17
+ - Revenue per hour
18
+ - Active users
19
+ - Conversion rate
20
+ - Cart abandonment rate
21
+
22
+ Why Important:
23
+ - Align with business goals
24
+ - Detect business anomalies
25
+ - Inform scaling decisions
26
+
27
+ Implementation:
28
+ from prometheus_client import Counter, Histogram
29
+
30
+ orders_total = Counter(
31
+ 'orders_total',
32
+ 'Total number of orders',
33
+ ['status', 'payment_method']
34
+ )
35
+
36
+ order_value = Histogram(
37
+ 'order_value_dollars',
38
+ 'Order value in dollars',
39
+ buckets=[10, 50, 100, 500, 1000, 5000]
40
+ )
41
+
42
+ # In code
43
+ orders_total.labels(status='completed', payment_method='credit_card').inc()
44
+ order_value.observe(order.total_amount)
45
+ ```
46
+
47
+ **System Metrics:**
48
+ ```
49
+ Infrastructure:
50
+ - CPU usage
51
+ - Memory usage
52
+ - Disk I/O
53
+ - Network throughput
54
+
55
+ Application:
56
+ - Request rate
57
+ - Error rate
58
+ - Request duration (latency)
59
+ - Active connections
60
+ - Thread pool utilization
61
+
62
+ Database:
63
+ - Query duration
64
+ - Connection pool usage
65
+ - Slow queries
66
+ - Deadlocks
67
+
68
+ Message Queue:
69
+ - Queue depth
70
+ - Message processing rate
71
+ - Consumer lag
72
+ - Dead letter queue size
73
+ ```
74
+
75
+ **The Four Golden Signals (Google SRE):**
76
+ ```
77
+ 1. Latency:
78
+ - Time to serve requests
79
+ - Track p50, p95, p99, p99.9
80
+ - Separate success vs error latency
81
+
82
+ request_duration = Histogram(
83
+ 'http_request_duration_seconds',
84
+ 'HTTP request duration',
85
+ ['method', 'endpoint', 'status']
86
+ )
87
+
88
+ 2. Traffic:
89
+ - Requests per second
90
+ - Transactions per second
91
+ - Concurrent users
92
+
93
+ requests_total = Counter(
94
+ 'http_requests_total',
95
+ 'Total HTTP requests',
96
+ ['method', 'endpoint', 'status']
97
+ )
98
+
99
+ 3. Errors:
100
+ - Rate of failed requests
101
+ - 4xx vs 5xx errors
102
+ - Exception types
103
+
104
+ errors_total = Counter(
105
+ 'errors_total',
106
+ 'Total errors',
107
+ ['service', 'error_type']
108
+ )
109
+
110
+ 4. Saturation:
111
+ - Resource utilization
112
+ - Queue depth
113
+ - Thread pool usage
114
+
115
+ connection_pool_usage = Gauge(
116
+ 'db_connection_pool_active',
117
+ 'Active database connections'
118
+ )
119
+ ```
120
+
121
+ **RED Method (for services):**
122
+ ```
123
+ - Rate: Requests per second
124
+ - Errors: Failed requests per second
125
+ - Duration: Request latency distribution
126
+
127
+ Perfect for microservices dashboards
128
+ ```
129
+
130
+ **USE Method (for resources):**
131
+ ```
132
+ - Utilization: Percentage of time resource busy
133
+ - Saturation: Queue depth or waiting threads
134
+ - Errors: Error count
135
+
136
+ Perfect for infrastructure monitoring
137
+ ```
138
+
139
+ ### 2. Logs
140
+
141
+ **Purpose:** Discrete event records with context.
142
+
143
+ **Structured Logging:**
144
+ ```json
145
+ {
146
+ "timestamp": "2025-12-14T15:30:45.123Z",
147
+ "level": "INFO",
148
+ "service": "order-service",
149
+ "version": "1.2.3",
150
+ "traceId": "abc123def456",
151
+ "spanId": "span789",
152
+ "userId": "user-123",
153
+ "message": "Order created successfully",
154
+ "orderId": "order-456",
155
+ "totalAmount": 99.99,
156
+ "currency": "USD",
157
+ "duration_ms": 45,
158
+ "endpoint": "/api/v1/orders",
159
+ "method": "POST",
160
+ "statusCode": 201
161
+ }
162
+ ```
163
+
164
+ **Log Levels:**
165
+ ```
166
+ ERROR:
167
+ - Application errors
168
+ - Failed operations
169
+ - Exceptions
170
+ Use: Alerts, immediate attention
171
+
172
+ WARN:
173
+ - Degraded functionality
174
+ - Retry attempts
175
+ - Deprecated API usage
176
+ Use: Investigation, potential issues
177
+
178
+ INFO:
179
+ - Business events (order created, user logged in)
180
+ - System events (service started, configuration loaded)
181
+ Use: Audit trail, business analytics
182
+
183
+ DEBUG:
184
+ - Detailed execution flow
185
+ - Variable values
186
+ - Function entry/exit
187
+ Use: Development, troubleshooting
188
+
189
+ TRACE:
190
+ - Very detailed debugging
191
+ Use: Deep troubleshooting (disabled in production usually)
192
+ ```
193
+
194
+ **Correlation IDs:**
195
+ ```
196
+ Request flow across services:
197
+
198
+ Client Request → API Gateway
199
+ ↓ (correlationId: corr-123)
200
+ Order Service
201
+ ↓ (correlationId: corr-123)
202
+ Payment Service
203
+ ↓ (correlationId: corr-123)
204
+ Notification Service
205
+
206
+ All logs include correlationId: corr-123
207
+ Easy to trace entire request flow
208
+
209
+ Implementation:
210
+ import logging
211
+ from contextvars import ContextVar
212
+
213
+ correlation_id_var = ContextVar('correlation_id', default=None)
214
+
215
+ class CorrelationIdFilter(logging.Filter):
216
+ def filter(self, record):
217
+ record.correlation_id = correlation_id_var.get()
218
+ return True
219
+
220
+ # Middleware
221
+ async def correlation_middleware(request, call_next):
222
+ correlation_id = request.headers.get('X-Correlation-ID', str(uuid4()))
223
+ correlation_id_var.set(correlation_id)
224
+ response = await call_next(request)
225
+ response.headers['X-Correlation-ID'] = correlation_id
226
+ return response
227
+ ```
228
+
229
+ **Log Aggregation:**
230
+ ```
231
+ Services → Log Shipper → Centralized Log Storage → Visualization
232
+
233
+ Tools:
234
+ - ELK Stack (Elasticsearch, Logstash, Kibana)
235
+ - EFK Stack (Elasticsearch, Fluentd, Kibana)
236
+ - Loki (from Grafana)
237
+ - CloudWatch Logs (AWS)
238
+ - Stackdriver (GCP)
239
+
240
+ Query Examples:
241
+ # Find all errors for specific user
242
+ service:"order-service" AND level:"ERROR" AND userId:"user-123"
243
+
244
+ # Find slow requests
245
+ service:"payment-service" AND duration_ms:>5000
246
+
247
+ # Find requests with specific correlation ID
248
+ correlationId:"corr-123"
249
+ ```
250
+
251
+ ### 3. Distributed Tracing
252
+
253
+ **Purpose:** Visualize request flow across services, identify bottlenecks.
254
+
255
+ **Concepts:**
256
+
257
+ **Trace:**
258
+ ```
259
+ Entire request journey across all services
260
+
261
+ Example: User places order
262
+ Trace ID: trace-abc123
263
+
264
+ Spans in trace:
265
+ 1. api-gateway: /checkout (200ms)
266
+ 2. order-service: createOrder (150ms)
267
+ 3. payment-service: processPayment (80ms)
268
+ 4. inventory-service: reserveItems (40ms)
269
+ 5. notification-service: sendEmail (30ms)
270
+
271
+ Total: 200ms (some parallel execution)
272
+ ```
273
+
274
+ **Span:**
275
+ ```
276
+ Single operation within a trace
277
+
278
+ Span attributes:
279
+ {
280
+ "traceId": "trace-abc123",
281
+ "spanId": "span-456",
282
+ "parentSpanId": "span-123",
283
+ "name": "POST /api/v1/orders",
284
+ "startTime": "2025-12-14T15:30:45.000Z",
285
+ "endTime": "2025-12-14T15:30:45.150Z",
286
+ "duration": 150,
287
+ "status": "OK",
288
+ "attributes": {
289
+ "http.method": "POST",
290
+ "http.url": "/api/v1/orders",
291
+ "http.status_code": 201,
292
+ "user.id": "user-123",
293
+ "order.id": "order-456",
294
+ "order.total": 99.99
295
+ },
296
+ "events": [
297
+ {
298
+ "timestamp": "2025-12-14T15:30:45.050Z",
299
+ "name": "Validating order items"
300
+ },
301
+ {
302
+ "timestamp": "2025-12-14T15:30:45.100Z",
303
+ "name": "Calling payment service"
304
+ }
305
+ ]
306
+ }
307
+ ```
308
+
309
+ **Implementation (OpenTelemetry):**
310
+ ```python
311
+ from opentelemetry import trace
312
+ from opentelemetry.exporter.jaeger.thrift import JaegerExporter
313
+ from opentelemetry.sdk.trace import TracerProvider
314
+ from opentelemetry.sdk.trace.export import BatchSpanProcessor
315
+ from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
316
+
317
+ # Setup tracing
318
+ provider = TracerProvider()
319
+ jaeger_exporter = JaegerExporter(
320
+ agent_host_name="jaeger",
321
+ agent_port=6831
322
+ )
323
+ provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
324
+ trace.set_tracer_provider(provider)
325
+
326
+ # Instrument FastAPI
327
+ app = FastAPI()
328
+ FastAPIInstrumentor.instrument_app(app)
329
+
330
+ # Manual span creation
331
+ tracer = trace.get_tracer(__name__)
332
+
333
+ async def create_order(order_data):
334
+ with tracer.start_as_current_span("create_order") as span:
335
+ span.set_attribute("order.items_count", len(order_data.items))
336
+ span.set_attribute("order.total", order_data.total)
337
+
338
+ # Database operation
339
+ with tracer.start_as_current_span("db.insert_order"):
340
+ order_id = await db.insert_order(order_data)
341
+
342
+ # Call payment service
343
+ with tracer.start_as_current_span("http.payment_service") as payment_span:
344
+ payment_span.set_attribute("http.url", f"{PAYMENT_URL}/payments")
345
+ result = await payment_service.charge(order_id, order_data.total)
346
+
347
+ return order_id
348
+ ```
349
+
350
+ **Trace Visualization:**
351
+ ```
352
+ Jaeger UI shows:
353
+
354
+ Timeline view:
355
+ |-- api-gateway (200ms) ----------------------------------|
356
+ |-- order-service (150ms) ------------------------|
357
+ |-- db.insert_order (30ms) --|
358
+ |-- payment-service (80ms) -----------------|
359
+ |-- db.create_transaction (20ms) ----|
360
+ |-- notification-service (30ms) ----------|
361
+
362
+ Critical path highlighted
363
+ Bottlenecks identified (payment-service taking 80ms)
364
+ Parallel operations visible
365
+ ```
366
+
367
+ **Sampling Strategies:**
368
+ ```
369
+ Problem: Tracing every request is expensive
370
+
371
+ Solutions:
372
+
373
+ 1. Probabilistic Sampling:
374
+ - Trace 1% of requests
375
+ - Good for high-volume services
376
+
377
+ 2. Rate Limiting Sampling:
378
+ - Max 100 traces per second
379
+ - Prevents overwhelming trace backend
380
+
381
+ 3. Tail-Based Sampling:
382
+ - Trace all errors
383
+ - Trace slow requests (>5s)
384
+ - Sample 1% of fast successful requests
385
+
386
+ 4. Priority Sampling:
387
+ - Always trace premium users
388
+ - Always trace critical endpoints
389
+ - Sample others
390
+
391
+ Implementation:
392
+ from opentelemetry.sdk.trace.sampling import (
393
+ ParentBasedTraceIdRatioBased,
394
+ ALWAYS_ON,
395
+ ALWAYS_OFF
396
+ )
397
+
398
+ # Sample 1% of traces
399
+ sampler = ParentBasedTraceIdRatioBased(0.01)
400
+
401
+ # Or custom sampler
402
+ class CustomSampler:
403
+ def should_sample(self, context, trace_id, name, attributes):
404
+ # Always sample errors
405
+ if attributes.get("http.status_code", 0) >= 500:
406
+ return ALWAYS_ON
407
+
408
+ # Always sample slow requests
409
+ if attributes.get("duration_ms", 0) > 5000:
410
+ return ALWAYS_ON
411
+
412
+ # Sample 1% of others
413
+ return ParentBasedTraceIdRatioBased(0.01).should_sample(...)
414
+ ```
415
+
416
+ ## Service Level Objectives (SLOs)
417
+
418
+ ### Defining SLOs
419
+
420
+ **SLI (Service Level Indicator):**
421
+ ```
422
+ Quantitative measure of service level
423
+
424
+ Examples:
425
+ - Request latency: p99 < 200ms
426
+ - Availability: 99.9% of requests succeed
427
+ - Throughput: Handle 10,000 requests/sec
428
+ ```
429
+
430
+ **SLO (Service Level Objective):**
431
+ ```
432
+ Target value for SLI
433
+
434
+ Examples:
435
+ - 99.9% of requests complete in < 200ms
436
+ - 99.95% availability over 30 days
437
+ - Zero data loss
438
+
439
+ SLO Components:
440
+ - Metric: What you measure (latency, availability)
441
+ - Target: Threshold (99.9%, 200ms)
442
+ - Time window: Evaluation period (30 days, weekly)
443
+ ```
444
+
445
+ **SLA (Service Level Agreement):**
446
+ ```
447
+ Contract with consequences if SLO not met
448
+
449
+ Example:
450
+ - SLO: 99.9% availability
451
+ - SLA: If availability < 99.9%, customers get 10% credit
452
+
453
+ SLA ≤ SLO (leave buffer for incidents)
454
+ ```
455
+
456
+ **Error Budget:**
457
+ ```
458
+ Allowed failure to meet SLO = (100% - SLO target)
459
+
460
+ Example:
461
+ SLO: 99.9% availability
462
+ Error budget: 0.1% = 43.8 minutes downtime per month
463
+
464
+ Error budget consumed:
465
+ - Outages
466
+ - Slow responses
467
+ - Failed requests
468
+
469
+ When error budget exhausted:
470
+ - Freeze feature deployments
471
+ - Focus on reliability
472
+ - Only critical fixes deployed
473
+
474
+ Benefits:
475
+ - Balances innovation vs stability
476
+ - Data-driven deployment decisions
477
+ - Aligns engineering priorities
478
+ ```
479
+
480
+ ### Implementing SLO Monitoring
481
+
482
+ **Prometheus + Grafana:**
483
+ ```
484
+ # SLI: Availability
485
+ availability_sli = (
486
+ sum(rate(http_requests_total{status!~"5.."}[30d]))
487
+ /
488
+ sum(rate(http_requests_total[30d]))
489
+ ) * 100
490
+
491
+ # SLI: Latency
492
+ latency_sli = histogram_quantile(
493
+ 0.99,
494
+ rate(http_request_duration_seconds_bucket[30d])
495
+ )
496
+
497
+ # Error Budget
498
+ error_budget_remaining = (
499
+ 1 - (target_slo / 100)
500
+ ) - (
501
+ 1 - (availability_sli / 100)
502
+ )
503
+
504
+ Alert when error budget < 10%:
505
+ alert: ErrorBudgetCritical
506
+ expr: error_budget_remaining < 0.1
507
+ annotations:
508
+ summary: "Error budget critically low"
509
+ description: "Only 10% error budget remaining. Freeze deployments."
510
+ ```
511
+
512
+ ## Alerting Strategies
513
+
514
+ ### Alert Levels
515
+
516
+ **Critical (Page immediately):**
517
+ ```
518
+ Conditions:
519
+ - Service completely down
520
+ - Error rate > 50%
521
+ - Data loss occurring
522
+ - SLO burn rate critical
523
+
524
+ Actions:
525
+ - Page on-call engineer
526
+ - Incident created automatically
527
+ - Escalate if not acknowledged in 5 min
528
+
529
+ Example:
530
+ alert: ServiceDown
531
+ expr: up{service="payment-service"} == 0
532
+ for: 1m
533
+ severity: critical
534
+ ```
535
+
536
+ **Warning (Investigate soon):**
537
+ ```
538
+ Conditions:
539
+ - Elevated error rate (5-10%)
540
+ - Latency degraded (p99 > 500ms)
541
+ - Queue depth increasing
542
+ - Error budget < 25%
543
+
544
+ Actions:
545
+ - Slack notification
546
+ - Create ticket
547
+ - Investigate during business hours
548
+
549
+ Example:
550
+ alert: HighErrorRate
551
+ expr: rate(http_requests_total{status="500"}[5m]) > 0.05
552
+ for: 10m
553
+ severity: warning
554
+ ```
555
+
556
+ **Info (Awareness):**
557
+ ```
558
+ Conditions:
559
+ - Deployment completed
560
+ - Scaling event
561
+ - Configuration changed
562
+ - Capacity threshold reached
563
+
564
+ Actions:
565
+ - Log to monitoring system
566
+ - Dashboard annotation
567
+ - Optional Slack notification
568
+ ```
569
+
570
+ ### Alert Best Practices
571
+
572
+ **Actionable Alerts:**
573
+ ```
574
+ Bad Alert:
575
+ "High CPU usage"
576
+
577
+ Good Alert:
578
+ "CPU usage > 80% on order-service-pod-abc for 10 minutes
579
+ Runbook: https://wiki.company.com/runbooks/high-cpu
580
+ Likely cause: Memory leak or infinite loop
581
+ Actions: 1) Check recent deployments 2) Review logs for exceptions 3) Consider rolling back"
582
+
583
+ Include:
584
+ ✓ What is wrong
585
+ ✓ Why it matters
586
+ ✓ How to investigate
587
+ ✓ Runbook link
588
+ ✓ Suggested actions
589
+ ```
590
+
591
+ **Avoid Alert Fatigue:**
592
+ ```
593
+ Problems:
594
+ - Too many alerts
595
+ - False positives
596
+ - Non-actionable alerts
597
+ - Duplicate alerts
598
+
599
+ Solutions:
600
+ - Alert on symptoms, not causes
601
+ - Proper thresholds and durations
602
+ - Alert aggregation (don't alert per pod, alert per service)
603
+ - Regular alert review and tuning
604
+ - Auto-resolve alerts
605
+ - Silence during maintenance
606
+
607
+ Good Practice:
608
+ for: 5m # Don't alert on transient spikes
609
+ group_by: [service] # Aggregate per service
610
+ group_wait: 30s # Wait before sending
611
+ group_interval: 5m # Batch notifications
612
+ ```
613
+
614
+ ## Observability Stack
615
+
616
+ ### Recommended Tools
617
+
618
+ **Metrics:**
619
+ ```
620
+ Collection: Prometheus
621
+ - Pull-based metrics
622
+ - Time-series database
623
+ - Powerful query language (PromQL)
624
+ - Service discovery
625
+
626
+ Visualization: Grafana
627
+ - Beautiful dashboards
628
+ - Alerting integration
629
+ - Multiple data sources
630
+ - Template variables
631
+
632
+ Alternative: Datadog, New Relic, CloudWatch
633
+ ```
634
+
635
+ **Logs:**
636
+ ```
637
+ Aggregation: ELK Stack
638
+ - Elasticsearch (storage & search)
639
+ - Logstash / Fluentd (collection)
640
+ - Kibana (visualization)
641
+
642
+ Or: Loki (lightweight alternative)
643
+ - Integrates with Grafana
644
+ - Labels instead of full-text indexing
645
+ - Lower resource usage
646
+
647
+ Alternative: Splunk, Datadog, CloudWatch Logs
648
+ ```
649
+
650
+ **Tracing:**
651
+ ```
652
+ Backend: Jaeger or Zipkin
653
+ - Trace storage
654
+ - Trace visualization
655
+ - Dependency graphs
656
+ - Performance analysis
657
+
658
+ Instrumentation: OpenTelemetry
659
+ - Vendor-neutral standard
660
+ - Auto-instrumentation for common frameworks
661
+ - Manual instrumentation API
662
+ - Export to any backend
663
+
664
+ Alternative: Datadog APM, New Relic, Lightstep
665
+ ```
666
+
667
+ **All-in-One:**
668
+ ```
669
+ Observability platforms:
670
+ - Datadog (metrics, logs, traces, RUM)
671
+ - New Relic (APM, logs, infrastructure)
672
+ - Dynatrace (auto-instrumentation, AI)
673
+
674
+ Pros:
675
+ - Unified experience
676
+ - Correlated data
677
+ - Easier setup
678
+
679
+ Cons:
680
+ - Vendor lock-in
681
+ - Higher cost
682
+ - Less flexibility
683
+ ```
684
+
685
+ ### Implementation Checklist
686
+
687
+ **For Each Service:**
688
+ ```
689
+ ✓ Structured logging with correlation IDs
690
+ ✓ Metrics exported (Prometheus format)
691
+ ✓ Distributed tracing instrumented
692
+ ✓ Health check endpoints (/health/live, /health/ready)
693
+ ✓ Graceful shutdown handling
694
+ ✓ Resource limits set (CPU, memory)
695
+ ✓ Alerts configured for critical paths
696
+ ✓ Dashboards created
697
+ ✓ Runbooks documented
698
+ ✓ On-call rotation established
699
+ ```
700
+
701
+ **For System-Wide:**
702
+ ```
703
+ ✓ Centralized log aggregation
704
+ ✓ Distributed tracing backend
705
+ ✓ Metrics aggregation and storage
706
+ ✓ Unified dashboards (service overview)
707
+ ✓ Alert routing configured
708
+ ✓ Incident management process
709
+ ✓ Post-mortem template
710
+ ✓ SLO definitions and tracking
711
+ ✓ Dependency mapping
712
+ ✓ Chaos engineering experiments
713
+ ```
714
+
715
+ ## Troubleshooting Workflow
716
+
717
+ **Incident Response:**
718
+ ```
719
+ 1. Detect (Alert fires)
720
+ - Check dashboard
721
+ - Verify alert is valid
722
+ - Assess impact
723
+
724
+ 2. Triage (Determine severity)
725
+ - Critical: Page on-call
726
+ - Warning: Create ticket
727
+ - How many users affected?
728
+ - What functionality broken?
729
+
730
+ 3. Investigate (Find root cause)
731
+ - Check recent deployments
732
+ - Review logs (search by correlation ID)
733
+ - Analyze traces (slow operations)
734
+ - Check metrics (resource saturation)
735
+ - Examine dependencies
736
+
737
+ 4. Mitigate (Stop the bleeding)
738
+ - Rollback deployment
739
+ - Scale up resources
740
+ - Failover to backup
741
+ - Enable circuit breakers
742
+ - Rate limit traffic
743
+
744
+ 5. Resolve (Fix root cause)
745
+ - Deploy fix
746
+ - Verify resolution
747
+ - Monitor for recurrence
748
+
749
+ 6. Post-mortem (Learn and improve)
750
+ - Timeline of events
751
+ - Root cause analysis
752
+ - Action items
753
+ - Update runbooks
754
+ ```
755
+
756
+ **Using Traces to Debug:**
757
+ ```
758
+ Scenario: API returning 500 errors
759
+
760
+ 1. Find failing trace:
761
+ - Filter: status = error, service = api-gateway
762
+ - Sort by timestamp (most recent)
763
+
764
+ 2. Analyze span waterfall:
765
+ - Identify which service failed (order-service returned 500)
766
+ - Check error message in span
767
+ - Review span attributes
768
+
769
+ 3. Correlate with logs:
770
+ - Extract trace ID from failed trace
771
+ - Search logs: traceId:"trace-abc123"
772
+ - Find exception stack trace
773
+
774
+ 4. Check related metrics:
775
+ - order-service error rate spiked 10 min ago
776
+ - Corresponds with deployment
777
+ - Likely cause: Bad deployment
778
+
779
+ 5. Remediate:
780
+ - Rollback order-service
781
+ - Verify errors stopped
782
+ - Create ticket for bug fix
783
+ ```
784
+
785
+ ## Summary
786
+
787
+ Observability is non-negotiable in microservices:
788
+
789
+ **Must-Haves:**
790
+ - Structured logging with correlation IDs
791
+ - Metrics (RED/USE methodology)
792
+ - Distributed tracing (OpenTelemetry)
793
+ - Centralized log aggregation
794
+ - SLO tracking with error budgets
795
+ - Actionable alerts with runbooks
796
+
797
+ **Best Practices:**
798
+ - Correlate metrics, logs, and traces
799
+ - Define SLOs based on user experience
800
+ - Alert on symptoms, not causes
801
+ - Maintain runbooks for common issues
802
+ - Regular post-mortems and learning
803
+ - Practice incident response with game days
804
+
805
+ Without observability, you're flying blind in production.