@umacloud/knowledge 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (418) hide show
  1. package/00-governance/governance-capabilities.md +557 -0
  2. package/00-governance/knowledge-map.md +39 -0
  3. package/00-governance/maintenance-policy.md +76 -0
  4. package/00-governance/review-checklist.md +81 -0
  5. package/README.md +13 -0
  6. package/ai/01-standards/agent-development-complete.md +691 -0
  7. package/ai/01-standards/llm-application-complete.md +488 -0
  8. package/ai/01-standards/mlops-complete.md +798 -0
  9. package/ai/01-standards/prompt-engineering-complete.md +646 -0
  10. package/ai/01-standards/rag-architecture-complete.md +649 -0
  11. package/ai/02-playbooks/llm-evaluation-playbook.md +847 -0
  12. package/ai/03-checklists/ai-project-checklist.md +215 -0
  13. package/ai/04-antipatterns/ai-antipatterns.md +661 -0
  14. package/ai/05-cases/case-rag-production.md +147 -0
  15. package/ai/06-glossary/ai-glossary.md +162 -0
  16. package/ai/agent-evaluation-benchmark.md +53 -0
  17. package/ai/ai-agent-memory-context-management.md +41 -0
  18. package/ai/ai-cost-capacity-optimization-playbook.md +42 -0
  19. package/ai/ai-data-security-and-compliance-playbook.md +37 -0
  20. package/ai/ai-domain-index-and-checklist.md +40 -0
  21. package/ai/ai-governance-maturity-model.md +50 -0
  22. package/ai/ai-model-selection-and-routing-strategy.md +47 -0
  23. package/ai/ai-observability-and-oncall-runbook.md +52 -0
  24. package/ai/ai-rag-engineering-playbook.md +42 -0
  25. package/ai/ai-red-team-and-safety-evaluation.md +42 -0
  26. package/ai/ai-release-readiness-and-rollback-gate.md +42 -0
  27. package/ai/llm-agent-engineering-deep-dive.md +57 -0
  28. package/ai/prompt-and-tool-guardrails.md +52 -0
  29. package/api/01-standards/enterprise-api-standards.md +198 -0
  30. package/api/01-standards/rest-api-design-guide.md +63 -0
  31. package/api/02-playbooks/api-pagination-playbook.md +93 -0
  32. package/api/02-playbooks/graphql-production-playbook.md +176 -0
  33. package/api/03-checklists/api-review-checklist.md +55 -0
  34. package/api/04-antipatterns/api-antipatterns.md +112 -0
  35. package/architecture/01-standards/api-gateway-patterns.md +496 -0
  36. package/architecture/01-standards/cloud-native-patterns.md +644 -0
  37. package/architecture/01-standards/distributed-systems-patterns.md +591 -0
  38. package/architecture/01-standards/event-driven-architecture.md +595 -0
  39. package/architecture/01-standards/microservices-patterns-complete.md +968 -0
  40. package/architecture/01-standards/microservices-patterns.md +495 -0
  41. package/architecture/01-standards/system-design-interview.md +664 -0
  42. package/architecture/02-playbooks/microservices-patterns-playbook.md +137 -0
  43. package/architecture/02-playbooks/migration-playbook.md +780 -0
  44. package/architecture/02-playbooks/system-design-playbook.md +779 -0
  45. package/architecture/03-checklists/architecture-decision-checklist.md +297 -0
  46. package/architecture/04-antipatterns/architecture-antipatterns.md +417 -0
  47. package/architecture/05-cases/case-netflix-microservices.md +413 -0
  48. package/architecture/06-glossary/architecture-glossary.md +164 -0
  49. package/architecture/adr-template-and-examples.md +38 -0
  50. package/architecture/api-gateway-deep-dive.md +1291 -0
  51. package/architecture/configuration-management.md +1162 -0
  52. package/architecture/distributed-transactions.md +1220 -0
  53. package/architecture/microservices-complete.md +735 -0
  54. package/architecture/resilience-and-disaster-patterns.md +37 -0
  55. package/architecture/service-governance.md +1198 -0
  56. package/architecture/system-architecture-deep-dive.md +37 -0
  57. package/backend/01-standards/analytics-and-growth.md +65 -0
  58. package/backend/01-standards/api-and-error-conventions.md +120 -0
  59. package/backend/01-standards/application-layering-and-packaging.md +160 -0
  60. package/backend/01-standards/auth-implementation.md +104 -0
  61. package/backend/01-standards/backend-framework-idioms.md +74 -0
  62. package/backend/01-standards/background-jobs-and-async.md +66 -0
  63. package/backend/01-standards/caching-strategies-complete.md +390 -0
  64. package/backend/01-standards/config-and-observability.md +77 -0
  65. package/backend/01-standards/data-modeling-and-persistence.md +94 -0
  66. package/backend/01-standards/django-complete.md +1765 -0
  67. package/backend/01-standards/email-and-notifications.md +64 -0
  68. package/backend/01-standards/fastapi-complete.md +925 -0
  69. package/backend/01-standards/file-upload-and-storage.md +66 -0
  70. package/backend/01-standards/graphql-api-complete.md +416 -0
  71. package/backend/01-standards/llm-application-standard.md +78 -0
  72. package/backend/01-standards/message-queue-patterns.md +379 -0
  73. package/backend/01-standards/microservices-and-distributed.md +78 -0
  74. package/backend/01-standards/nestjs-complete.md +2167 -0
  75. package/backend/01-standards/payment-integration.md +80 -0
  76. package/backend/01-standards/rate-limiting-complete.md +451 -0
  77. package/backend/01-standards/realtime-and-websocket.md +65 -0
  78. package/backend/01-standards/search-and-filtering.md +64 -0
  79. package/backend/01-standards/spring-boot-complete.md +445 -0
  80. package/backend/02-playbooks/api-design-playbook.md +718 -0
  81. package/backend/02-playbooks/email-send-playbook.md +130 -0
  82. package/backend/02-playbooks/file-upload-s3-playbook.md +153 -0
  83. package/backend/02-playbooks/typescript-enterprise-playbook.md +133 -0
  84. package/backend/02-playbooks/websocket-realtime-playbook.md +154 -0
  85. package/backend/03-checklists/api-launch-checklist.md +189 -0
  86. package/backend/04-antipatterns/backend-antipatterns.md +1051 -0
  87. package/blockchain/01-standards/blockchain-basics.md +557 -0
  88. package/blockchain/01-standards/smart-contract-development.md +1315 -0
  89. package/cicd/01-standards/deployment-and-delivery-standard.md +96 -0
  90. package/cicd/01-standards/github-actions-complete.md +473 -0
  91. package/cicd/01-standards/release-and-store-submission.md +75 -0
  92. package/cicd/02-playbooks/cicd-pipeline-playbook.md +144 -0
  93. package/cicd/02-playbooks/release-management-playbook.md +605 -0
  94. package/cicd/03-checklists/pipeline-security-checklist.md +168 -0
  95. package/cicd/04-antipatterns/cicd-antipatterns.md +589 -0
  96. package/cicd/05-cases/case-deployment-automation.md +221 -0
  97. package/cicd/05-cases/case-gitops-transformation.md +212 -0
  98. package/cicd/06-glossary/cicd-glossary.md +114 -0
  99. package/cicd/cicd-blueprint-deep-dive.md +38 -0
  100. package/cicd/release-readiness-gate.md +37 -0
  101. package/cloud-native/01-standards/container-security.md +741 -0
  102. package/cloud-native/01-standards/kubernetes-complete.md +812 -0
  103. package/cloud-native/02-playbooks/api-gateway-playbook.md +155 -0
  104. package/cloud-native/02-playbooks/gitops-with-argocd.md +760 -0
  105. package/cloud-native/02-playbooks/k8s-troubleshooting-playbook.md +1942 -0
  106. package/cloud-native/02-playbooks/message-queue-playbook.md +129 -0
  107. package/cloud-native/02-playbooks/multicloud-governance.md +726 -0
  108. package/cloud-native/02-playbooks/serverless-patterns.md +788 -0
  109. package/cloud-native/02-playbooks/service-mesh-playbook.md +612 -0
  110. package/cloud-native/02-playbooks/terraform-iac-playbook.md +143 -0
  111. package/cloud-native/03-checklists/container-security-checklist.md +431 -0
  112. package/cloud-native/03-checklists/k8s-production-readiness-checklist.md +460 -0
  113. package/cloud-native/04-antipatterns/container-antipatterns.md +660 -0
  114. package/cloud-native/04-antipatterns/k8s-antipatterns.md +743 -0
  115. package/cloud-native/05-cases/case-k8s-migration.md +478 -0
  116. package/cloud-native/05-cases/case-k8s-scaling.md +642 -0
  117. package/cloud-native/05-cases/case-k8s-security-incident.md +397 -0
  118. package/cloud-native/06-glossary/cloud-native-glossary.md +337 -0
  119. package/cross-platform/01-standards/cross-platform-frameworks.md +83 -0
  120. package/cross-platform/01-standards/platform-selection-and-architecture.md +77 -0
  121. package/data/01-standards/elasticsearch-complete.md +2098 -0
  122. package/data/01-standards/postgresql-complete.md +1613 -0
  123. package/data/01-standards/redis-complete.md +1527 -0
  124. package/data/02-playbooks/database-optimization-playbook.md +403 -0
  125. package/data/02-playbooks/elasticsearch-production-playbook.md +132 -0
  126. package/data/03-checklists/database-launch-checklist.md +187 -0
  127. package/data/04-antipatterns/database-antipatterns.md +873 -0
  128. package/data/05-cases/case-database-migration.md +310 -0
  129. package/data/06-glossary/database-glossary.md +440 -0
  130. package/data/data-governance-and-modeling-deep-dive.md +39 -0
  131. package/data-engineering/01-standards/airflow-complete.md +523 -0
  132. package/data-engineering/01-standards/kafka-complete.md +1521 -0
  133. package/data-engineering/02-playbooks/spark-etl-playbook.md +496 -0
  134. package/data-engineering/03-checklists/pipeline-launch-checklist.md +194 -0
  135. package/data-engineering/04-antipatterns/data-pipeline-antipatterns.md +684 -0
  136. package/data-engineering/05-cases/case-real-time-pipeline.md +355 -0
  137. package/data-engineering/06-glossary/data-engineering-glossary.md +429 -0
  138. package/database/01-standards/database-schema-standards.md +147 -0
  139. package/database/02-playbooks/postgresql-optimization-quick.md +52 -0
  140. package/database/02-playbooks/postgresql-performance-optimization.md +58 -0
  141. package/database/02-playbooks/postgresql-production-playbook.md +146 -0
  142. package/database/02-playbooks/redis-caching-playbook.md +117 -0
  143. package/database/03-checklists/database-review-checklist.md +50 -0
  144. package/database/04-antipatterns/database-antipatterns.md +112 -0
  145. package/design/01-standards/ui-design-system-complete.md +423 -0
  146. package/design/02-playbooks/design-handoff-playbook.md +254 -0
  147. package/design/02-playbooks/design-review-playbook.md +388 -0
  148. package/design/03-checklists/design-review-checklist.md +246 -0
  149. package/design/04-antipatterns/design-antipatterns.md +378 -0
  150. package/design/05-cases/case-design-system-adoption.md +328 -0
  151. package/design/06-glossary/design-glossary.md +329 -0
  152. package/design/ui-full-lifecycle-cross-platform-playbook.md +571 -0
  153. package/design/ux-system-deep-dive.md +38 -0
  154. package/design-systems/00-craft-rules.md +71 -0
  155. package/design-systems/aesthetic-families.md +43 -0
  156. package/design-systems/anti-ai-slop.md +162 -0
  157. package/design-systems/bold-geometric.md +120 -0
  158. package/design-systems/brutalist-bold.md +103 -0
  159. package/design-systems/editorial-clean.md +109 -0
  160. package/design-systems/glass-aurora.md +108 -0
  161. package/design-systems/modern-minimal.md +145 -0
  162. package/design-systems/premium-luxury.md +106 -0
  163. package/design-systems/product-type-design-map.md +48 -0
  164. package/design-systems/soft-warm.md +123 -0
  165. package/design-systems/tech-utility.md +113 -0
  166. package/desktop/01-standards/desktop-app-standard.md +72 -0
  167. package/desktop/01-standards/desktop-design.md +71 -0
  168. package/development/00-governance/document-template.md +41 -0
  169. package/development/01-standards/api-versioning-strategies.md +432 -0
  170. package/development/01-standards/authentication-patterns-complete.md +479 -0
  171. package/development/01-standards/css-architecture-complete.md +550 -0
  172. package/development/01-standards/database-migration-strategies.md +484 -0
  173. package/development/01-standards/elasticsearch-complete.md +347 -0
  174. package/development/01-standards/git-complete.md +371 -0
  175. package/development/01-standards/golang-complete.md +1565 -0
  176. package/development/01-standards/graphql-complete.md +298 -0
  177. package/development/01-standards/javascript-bundlers-complete.md +469 -0
  178. package/development/01-standards/javascript-typescript-complete.md +528 -0
  179. package/development/01-standards/jest-complete.md +275 -0
  180. package/development/01-standards/linux-complete.md +234 -0
  181. package/development/01-standards/logging-observability-complete.md +526 -0
  182. package/development/01-standards/microservices-communication.md +502 -0
  183. package/development/01-standards/mongodb-complete.md +406 -0
  184. package/development/01-standards/oauth2-complete.md +285 -0
  185. package/development/01-standards/performance-optimization-complete.md +289 -0
  186. package/development/01-standards/playwright-complete.md +247 -0
  187. package/development/01-standards/postgresql-complete.md +456 -0
  188. package/development/01-standards/pytest-complete.md +340 -0
  189. package/development/01-standards/python-async-programming.md +902 -0
  190. package/development/01-standards/python-complete.md +956 -0
  191. package/development/01-standards/python-decorators-complete.md +799 -0
  192. package/development/01-standards/python-design-patterns.md +2854 -0
  193. package/development/01-standards/python-packaging-distribution.md +420 -0
  194. package/development/01-standards/python-testing-strategies.md +607 -0
  195. package/development/01-standards/python-web-frameworks-comparison.md +471 -0
  196. package/development/01-standards/redis-complete.md +317 -0
  197. package/development/01-standards/rest-api-complete.md +316 -0
  198. package/development/01-standards/rust-complete.md +578 -0
  199. package/development/01-standards/typescript-advanced-types.md +1513 -0
  200. package/development/01-standards/web-security-complete.md +292 -0
  201. package/development/02-playbooks/api-design-playbook.md +810 -0
  202. package/development/02-playbooks/database-migration-playbook.md +580 -0
  203. package/development/02-playbooks/debugging-playbook.md +692 -0
  204. package/development/02-playbooks/feature-delivery-playbook.md +430 -0
  205. package/development/02-playbooks/incident-hotfix-playbook.md +387 -0
  206. package/development/02-playbooks/performance-optimization-playbook.md +531 -0
  207. package/development/02-playbooks/performance-tuning-playbook.md +652 -0
  208. package/development/02-playbooks/refactor-playbook.md +403 -0
  209. package/development/02-playbooks/release-playbook.md +469 -0
  210. package/development/03-checklists/architecture-review-checklist.md +168 -0
  211. package/development/03-checklists/data-migration-checklist.md +157 -0
  212. package/development/03-checklists/oncall-handover-checklist.md +173 -0
  213. package/development/03-checklists/pr-checklist.md +158 -0
  214. package/development/03-checklists/production-readiness-checklist.md +190 -0
  215. package/development/03-checklists/release-readiness-checklist.md +154 -0
  216. package/development/03-checklists/security-review-checklist.md +182 -0
  217. package/development/04-antipatterns/api-antipatterns.md +657 -0
  218. package/development/04-antipatterns/architecture-antipatterns.md +686 -0
  219. package/development/04-antipatterns/backend-antipatterns.md +648 -0
  220. package/development/04-antipatterns/cicd-antipatterns.md +540 -0
  221. package/development/04-antipatterns/code-smell-antipatterns.md +571 -0
  222. package/development/04-antipatterns/data-antipatterns.md +658 -0
  223. package/development/04-antipatterns/database-antipatterns.md +578 -0
  224. package/development/04-antipatterns/frontend-antipatterns.md +635 -0
  225. package/development/04-antipatterns/reliability-antipatterns.md +700 -0
  226. package/development/04-antipatterns/security-antipatterns.md +747 -0
  227. package/development/05-cases/case-api-version-migration.md +428 -0
  228. package/development/05-cases/case-authorization-hardening.md +383 -0
  229. package/development/05-cases/case-bluegreen-rollback.md +466 -0
  230. package/development/05-cases/case-cache-snowball-protection.md +485 -0
  231. package/development/05-cases/case-ci-cd-pipeline.md +544 -0
  232. package/development/05-cases/case-database-scaling.md +500 -0
  233. package/development/05-cases/case-db-hotspot-optimization.md +487 -0
  234. package/development/05-cases/case-incident-mttr-reduction.md +563 -0
  235. package/development/05-cases/case-microservice-migration.md +375 -0
  236. package/development/05-cases/case-performance-optimization.md +406 -0
  237. package/development/05-cases/case-security-incident-response.md +345 -0
  238. package/development/06-glossary/full-stack-glossary.md +166 -0
  239. package/development/09-maturity/quarterly-audit-template.md +35 -0
  240. package/development/11-ui-excellence/ui-aesthetic-system.md +41 -0
  241. package/development/11-ui-excellence/ui-engineering-excellence.md +435 -0
  242. package/development/12-scenarios/development-scenarios-guide.md +565 -0
  243. package/development/13-implementation-assets/implementation-toolkit.md +282 -0
  244. package/development/13-implementation-assets/knowledge-gates-execution.md +43 -0
  245. package/development/14-full-lifecycle/software-lifecycle-gates.md +511 -0
  246. package/development/15-lifecycle-templates/project-templates-collection.md +791 -0
  247. package/development/api-contract-and-versioning-guide.md +36 -0
  248. package/development/api-governance-complete.md +43 -0
  249. package/development/backend-engineering-complete.md +43 -0
  250. package/development/code-review-quality-complete.md +43 -0
  251. package/development/concurrency-reliability-complete.md +43 -0
  252. package/development/database-engineering-complete.md +43 -0
  253. package/development/engineering-effectiveness-complete.md +43 -0
  254. package/development/engineering-standards-deep-dive.md +38 -0
  255. package/development/frontend-engineering-complete.md +43 -0
  256. package/development/performance-capacity-complete.md +43 -0
  257. package/development/refactor-migration-complete.md +42 -0
  258. package/development/refactoring-and-techdebt-playbook.md +37 -0
  259. package/development/security-in-development-complete.md +43 -0
  260. package/devops/01-standards/cicd-pipeline-complete.md +262 -0
  261. package/devops/01-standards/docker-complete.md +1490 -0
  262. package/devops/01-standards/github-actions-complete.md +337 -0
  263. package/devops/01-standards/kubernetes-complete.md +638 -0
  264. package/devops/01-standards/terraform-complete.md +2117 -0
  265. package/devops/02-playbooks/docker-compose-playbook.md +233 -0
  266. package/devops/02-playbooks/docker-k8s-production-playbook.md +186 -0
  267. package/devops/02-playbooks/docker-production-playbook.md +952 -0
  268. package/edge-iot/01-standards/edge-iot-complete.md +473 -0
  269. package/experts/architect/api-design.md +178 -0
  270. package/experts/architect/methodology.md +124 -0
  271. package/experts/architect/security.md +75 -0
  272. package/experts/backend-lead/methodology.md +216 -0
  273. package/experts/devops/methodology.md +160 -0
  274. package/experts/frontend-lead/methodology.md +178 -0
  275. package/experts/product-manager/industry/ecommerce.md +43 -0
  276. package/experts/product-manager/industry/saas.md +40 -0
  277. package/experts/product-manager/methodology.md +97 -0
  278. package/experts/qa-lead/methodology.md +123 -0
  279. package/experts/qa-lead/test-strategy.md +128 -0
  280. package/experts/uiux-designer/methodology.md +125 -0
  281. package/frontend/01-standards/accessibility-complete.md +532 -0
  282. package/frontend/01-standards/accessibility-standard.md +74 -0
  283. package/frontend/01-standards/admin-dashboard-and-crud.md +72 -0
  284. package/frontend/01-standards/design-tokens-complete.md +444 -0
  285. package/frontend/01-standards/forms-and-validation.md +77 -0
  286. package/frontend/01-standards/frontend-architecture-and-layering.md +119 -0
  287. package/frontend/01-standards/i18n-and-localization.md +65 -0
  288. package/frontend/01-standards/nextjs-complete.md +451 -0
  289. package/frontend/01-standards/react-complete.md +713 -0
  290. package/frontend/01-standards/react-hooks-complete-guide.md +1100 -0
  291. package/frontend/01-standards/react-hooks-complete.md +1171 -0
  292. package/frontend/01-standards/seo-and-web-vitals.md +77 -0
  293. package/frontend/01-standards/state-management-complete.md +444 -0
  294. package/frontend/01-standards/vue-complete.md +499 -0
  295. package/frontend/01-standards/vue3-complete.md +2002 -0
  296. package/frontend/01-standards/web-framework-best-practices.md +64 -0
  297. package/frontend/01-standards/web-performance-complete.md +495 -0
  298. package/frontend/02-playbooks/accessibility-a11y-playbook.md +161 -0
  299. package/frontend/02-playbooks/frontend-performance-playbook.md +707 -0
  300. package/frontend/02-playbooks/i18n-internationalization-playbook.md +120 -0
  301. package/frontend/02-playbooks/performance-optimization-playbook.md +163 -0
  302. package/frontend/02-playbooks/react-nextjs-production-playbook.md +167 -0
  303. package/frontend/02-playbooks/react-state-management-playbook.md +173 -0
  304. package/frontend/03-checklists/component-quality-checklist.md +166 -0
  305. package/frontend/03-checklists/frontend-launch-checklist.md +299 -0
  306. package/frontend/04-antipatterns/frontend-antipatterns.md +886 -0
  307. package/frontend/05-cases/case-performance-optimization.md +274 -0
  308. package/harmony/01-standards/harmonyos-arkts-standard.md +75 -0
  309. package/harmony/01-standards/harmonyos-design.md +65 -0
  310. package/high-quality-engineering-playbook.md +54 -0
  311. package/incident/01-standards/incident-response-complete.md +303 -0
  312. package/incident/02-playbooks/chaos-engineering-playbook.md +883 -0
  313. package/incident/02-playbooks/postmortem-playbook.md +398 -0
  314. package/incident/03-checklists/incident-readiness-checklist.md +181 -0
  315. package/incident/04-antipatterns/incident-antipatterns.md +490 -0
  316. package/incident/05-cases/case-cascade-failure.md +176 -0
  317. package/incident/06-glossary/incident-glossary.md +114 -0
  318. package/incident/postmortem-and-response-deep-dive.md +39 -0
  319. package/industries/ecommerce/ecommerce-complete.md +631 -0
  320. package/industries/education/education-complete.md +555 -0
  321. package/industries/fintech/fintech-complete.md +501 -0
  322. package/industries/gaming/gaming-complete.md +587 -0
  323. package/industries/healthcare/healthcare-complete.md +452 -0
  324. package/low-code/01-standards/low-code-complete.md +944 -0
  325. package/miniprogram/01-standards/ai-common-mistakes.md +61 -0
  326. package/miniprogram/01-standards/miniprogram-custom-navbar-capsule.md +77 -0
  327. package/miniprogram/01-standards/miniprogram-design.md +61 -0
  328. package/miniprogram/01-standards/miniprogram-standard.md +81 -0
  329. package/mobile/01-standards/android-material-design.md +70 -0
  330. package/mobile/01-standards/flutter-complete.md +384 -0
  331. package/mobile/01-standards/ios-design-hig.md +78 -0
  332. package/mobile/01-standards/mobile-app-standard.md +85 -0
  333. package/mobile/01-standards/react-native-complete.md +352 -0
  334. package/mobile/02-playbooks/mobile-cross-platform-playbook.md +175 -0
  335. package/mobile/02-playbooks/mobile-performance.md +473 -0
  336. package/mobile/03-checklists/mobile-release-checklist.md +234 -0
  337. package/mobile/04-antipatterns/mobile-antipatterns.md +798 -0
  338. package/mobile/05-cases/case-app-performance.md +500 -0
  339. package/mobile/05-cases/case-app-startup-optimization.md +218 -0
  340. package/mobile/06-glossary/mobile-glossary.md +484 -0
  341. package/observability/01-standards/observability-standards.md +103 -0
  342. package/observability/02-playbooks/prometheus-grafana-playbook.md +135 -0
  343. package/observability/02-playbooks/structured-logging-playbook.md +73 -0
  344. package/observability/03-checklists/observability-checklist.md +54 -0
  345. package/observability/04-antipatterns/observability-antipatterns.md +106 -0
  346. package/operations/01-standards/prometheus-monitoring-complete.md +1578 -0
  347. package/operations/02-playbooks/capacity-planning-playbook.md +620 -0
  348. package/operations/03-checklists/production-launch-checklist.md +365 -0
  349. package/operations/04-antipatterns/operations-antipatterns.md +664 -0
  350. package/operations/05-cases/case-sre-practices.md +581 -0
  351. package/operations/06-glossary/operations-glossary.md +120 -0
  352. package/operations/aiops-anomaly-detection.md +758 -0
  353. package/operations/capacity-planning.md +1061 -0
  354. package/operations/chaos-engineering.md +659 -0
  355. package/operations/incident-command-system.md +38 -0
  356. package/operations/observability-complete.md +442 -0
  357. package/operations/slo-sli-playbook.md +517 -0
  358. package/operations/sre-operations-deep-dive.md +39 -0
  359. package/package.json +8 -0
  360. package/performance/01-standards/performance-and-scalability.md +80 -0
  361. package/performance/01-standards/performance-standards.md +156 -0
  362. package/performance/02-playbooks/query-optimization-playbook.md +103 -0
  363. package/performance/03-checklists/performance-checklist.md +56 -0
  364. package/performance/04-antipatterns/performance-antipatterns.md +146 -0
  365. package/product/01-standards/product-management-complete.md +285 -0
  366. package/product/02-playbooks/feature-launch-playbook.md +207 -0
  367. package/product/02-playbooks/user-research-playbook.md +532 -0
  368. package/product/03-checklists/feature-launch-checklist.md +275 -0
  369. package/product/04-antipatterns/product-antipatterns.md +355 -0
  370. package/product/05-cases/case-mvp-to-scale.md +384 -0
  371. package/product/06-glossary/product-glossary.md +462 -0
  372. package/product/feature-prioritization-framework.md +40 -0
  373. package/product/kpi-and-metric-tree.md +37 -0
  374. package/product/product-discovery-and-prd-deep-dive.md +41 -0
  375. package/quantum/01-standards/quantum-complete.md +1186 -0
  376. package/security/01-standards/api-security-complete.md +511 -0
  377. package/security/01-standards/container-runtime-security.md +574 -0
  378. package/security/01-standards/data-protection-gdpr.md +543 -0
  379. package/security/01-standards/owasp-top10-complete.md +1890 -0
  380. package/security/01-standards/secure-coding-baseline.md +90 -0
  381. package/security/01-standards/supply-chain-security.md +441 -0
  382. package/security/01-standards/web-security-checklist.md +108 -0
  383. package/security/01-standards/zero-trust-architecture.md +521 -0
  384. package/security/02-playbooks/auth-sso-playbook.md +166 -0
  385. package/security/02-playbooks/incident-response-security-playbook.md +588 -0
  386. package/security/02-playbooks/owasp-api-security-playbook.md +129 -0
  387. package/security/02-playbooks/payment-integration-playbook.md +119 -0
  388. package/security/02-playbooks/penetration-testing-playbook.md +517 -0
  389. package/security/03-checklists/security-audit-checklist.md +356 -0
  390. package/security/04-antipatterns/security-coding-antipatterns.md +580 -0
  391. package/security/05-cases/case-log4shell-incident.md +537 -0
  392. package/security/05-cases/case-major-breaches.md +468 -0
  393. package/security/06-glossary/security-glossary.md +212 -0
  394. package/security/compliance-automation.md +993 -0
  395. package/security/container-security.md +680 -0
  396. package/security/devsecops-complete.md +426 -0
  397. package/security/sast-dast-sca.md +775 -0
  398. package/security/secrets-management.md +594 -0
  399. package/security/security-architecture-deep-dive.md +37 -0
  400. package/security/threat-modeling-stride-playbook.md +40 -0
  401. package/seed-templates/auth-system.md +59 -0
  402. package/seed-templates/blog-content.md +94 -0
  403. package/seed-templates/dashboard.md +89 -0
  404. package/seed-templates/docs-site.md +73 -0
  405. package/seed-templates/e-commerce.md +50 -0
  406. package/seed-templates/saas-landing.md +92 -0
  407. package/seed-templates/settings-page.md +51 -0
  408. package/testing/01-standards/test-strategy-and-layering.md +83 -0
  409. package/testing/01-standards/testing-strategy-complete.md +422 -0
  410. package/testing/01-standards/unit-testing-best-practices.md +118 -0
  411. package/testing/02-playbooks/e2e-testing-playbook.md +988 -0
  412. package/testing/02-playbooks/testing-strategy-playbook.md +126 -0
  413. package/testing/03-checklists/test-strategy-checklist.md +208 -0
  414. package/testing/04-antipatterns/testing-antipatterns.md +718 -0
  415. package/testing/05-cases/case-testing-transformation.md +300 -0
  416. package/testing/06-glossary/testing-glossary.md +110 -0
  417. package/testing/risk-based-test-matrix.md +36 -0
  418. package/testing/testing-strategy-deep-dive.md +37 -0
@@ -0,0 +1,1578 @@
1
+ ---
2
+ id: prometheus-monitoring-complete
3
+ title: Prometheus监控完整指南
4
+ domain: operations
5
+ category: 01-standards
6
+ difficulty: intermediate
7
+ tags: [complete, grafana仪表盘, kubernetes监控, monitoring, operations, prometheus, prometheus核心概念, promql查询语言]
8
+ quality_score: 70
9
+ last_updated: 2026-06-15
10
+ ---
11
+ # Prometheus监控完整指南
12
+
13
+ ## 概述
14
+
15
+ 可观测性(Observability)是现代分布式系统运维的核心能力,由三大支柱构成:
16
+
17
+ 1. **指标(Metrics)** — 可聚合的数值型时间序列数据,回答"系统状态如何"
18
+ 2. **日志(Logs)** — 离散事件记录,回答"发生了什么"
19
+ 3. **追踪(Traces)** — 跨服务请求链路,回答"请求经过了哪些路径"
20
+
21
+ Prometheus专注于指标采集与查询,是CNCF毕业项目,已成为云原生监控事实标准。它采用拉取(Pull)模型主动抓取目标暴露的指标端点,配合Alertmanager实现告警,配合Grafana实现可视化。
22
+
23
+ ### 核心架构
24
+
25
+ ```
26
+ ┌──────────────┐ scrape ┌──────────────┐
27
+ │ 应用/Exporter │ ◄──────────── │ Prometheus │
28
+ │ /metrics端点 │ │ Server │
29
+ └──────────────┘ │ - TSDB │
30
+ │ - PromQL │
31
+ ┌──────────────┐ scrape │ - Rules │
32
+ │ Node Exporter │ ◄──────────── │ │
33
+ └──────────────┘ └──────┬───────┘
34
+
35
+ ┌────────────┼────────────┐
36
+ ▼ ▼ ▼
37
+ ┌───────────┐ ┌──────────┐ ┌──────────────┐
38
+ │Alertmanager│ │ Grafana │ │ Remote Write │
39
+ │ 告警路由 │ │ 可视化 │ │ 远程存储 │
40
+ └───────────┘ └──────────┘ └──────────────┘
41
+ ```
42
+
43
+ ---
44
+
45
+ ## Prometheus核心概念
46
+
47
+ ### 1. 指标类型(Metric Types)
48
+
49
+ #### Counter(计数器)
50
+ 单调递增值,只能增加或重置为零。适用于请求总数、错误总数、处理字节数等。
51
+
52
+ ```promql
53
+ # 指标示例
54
+ http_requests_total{method="GET", handler="/api/users", status="200"} 1027
55
+ http_requests_total{method="POST", handler="/api/users", status="201"} 83
56
+
57
+ # 计算速率(每秒请求数)
58
+ rate(http_requests_total[5m])
59
+
60
+ # 计算增量
61
+ increase(http_requests_total[1h])
62
+ ```
63
+
64
+ #### Gauge(仪表盘)
65
+ 可任意增减的瞬时值。适用于温度、内存使用量、当前连接数、队列深度等。
66
+
67
+ ```promql
68
+ # 指标示例
69
+ node_memory_AvailableBytes 4294967296
70
+ go_goroutines 42
71
+ queue_depth{queue="orders"} 156
72
+
73
+ # 直接查询当前值
74
+ node_memory_AvailableBytes
75
+
76
+ # 计算变化趋势
77
+ delta(node_memory_AvailableBytes[1h])
78
+ deriv(node_memory_AvailableBytes[1h])
79
+ ```
80
+
81
+ #### Histogram(直方图)
82
+ 将观测值分布到可配置的桶(bucket)中,同时记录总和与计数。适用于请求延迟、响应大小等需要分位数计算的场景。
83
+
84
+ ```promql
85
+ # 指标示例(自动生成三组时间序列)
86
+ http_request_duration_seconds_bucket{le="0.005"} 24054
87
+ http_request_duration_seconds_bucket{le="0.01"} 33444
88
+ http_request_duration_seconds_bucket{le="0.025"} 100392
89
+ http_request_duration_seconds_bucket{le="0.05"} 129389
90
+ http_request_duration_seconds_bucket{le="0.1"} 133988
91
+ http_request_duration_seconds_bucket{le="+Inf"} 144320
92
+ http_request_duration_seconds_sum 53.2
93
+ http_request_duration_seconds_count 144320
94
+
95
+ # 计算P99延迟
96
+ histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
97
+
98
+ # 计算平均延迟
99
+ rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
100
+ ```
101
+
102
+ #### Summary(摘要)
103
+ 在客户端直接计算分位数,无法跨实例聚合。除非有特殊需求,优先使用Histogram。
104
+
105
+ ```promql
106
+ # 指标示例
107
+ rpc_duration_seconds{quantile="0.5"} 0.023
108
+ rpc_duration_seconds{quantile="0.9"} 0.056
109
+ rpc_duration_seconds{quantile="0.99"} 0.148
110
+ rpc_duration_seconds_sum 1.7560473e+04
111
+ rpc_duration_seconds_count 2693
112
+ ```
113
+
114
+ ### 2. 标签(Labels)
115
+
116
+ 标签是Prometheus的核心维度建模机制,每一组唯一标签组合构成一条独立的时间序列。
117
+
118
+ ```yaml
119
+ # 好的标签设计 — 低基数、高区分度
120
+ http_requests_total{method="GET", status="200", service="user-api"}
121
+ http_requests_total{method="POST", status="500", service="user-api"}
122
+
123
+ # 坏的标签设计 — 高基数,导致时间序列爆炸
124
+ http_requests_total{user_id="abc123"} # 用户ID作标签 → 百万级序列
125
+ http_requests_total{request_id="..."} # 请求ID作标签 → 无限序列
126
+ http_requests_total{ip="10.0.0.1"} # IP地址作标签 → 高基数
127
+ ```
128
+
129
+ **标签基数控制原则:**
130
+ - 标签值的势(cardinality)应控制在数百以内
131
+ - 避免将用户ID、IP地址、请求ID、trace ID等放入标签
132
+ - 使用日志或追踪系统处理高基数维度
133
+
134
+ ### 3. Scrape配置
135
+
136
+ ```yaml
137
+ # prometheus.yml
138
+ global:
139
+ scrape_interval: 15s # 全局抓取间隔
140
+ evaluation_interval: 15s # 规则评估间隔
141
+ scrape_timeout: 10s # 抓取超时
142
+
143
+ # 抓取目标配置
144
+ scrape_configs:
145
+ # 静态目标
146
+ - job_name: "web-api"
147
+ metrics_path: /metrics # 默认 /metrics
148
+ scheme: https # 默认 http
149
+ static_configs:
150
+ - targets:
151
+ - "api-server-1:8080"
152
+ - "api-server-2:8080"
153
+ labels:
154
+ env: production
155
+ team: backend
156
+
157
+ # 带认证的目标
158
+ - job_name: "secure-service"
159
+ bearer_token_file: /etc/prometheus/token
160
+ tls_config:
161
+ ca_file: /etc/prometheus/ca.pem
162
+ insecure_skip_verify: false
163
+ static_configs:
164
+ - targets: ["secure-svc:9090"]
165
+
166
+ # 指标重标记(relabeling)
167
+ - job_name: "node-exporter"
168
+ static_configs:
169
+ - targets: ["node1:9100", "node2:9100"]
170
+ metric_relabel_configs:
171
+ - source_labels: [__name__]
172
+ regex: "node_cpu_seconds_total"
173
+ action: keep
174
+ - source_labels: [mode]
175
+ regex: "idle"
176
+ action: drop
177
+ ```
178
+
179
+ ### 4. 服务发现(Service Discovery)
180
+
181
+ ```yaml
182
+ scrape_configs:
183
+ # Kubernetes服务发现
184
+ - job_name: "kubernetes-pods"
185
+ kubernetes_sd_configs:
186
+ - role: pod
187
+ namespaces:
188
+ names: ["production", "staging"]
189
+ relabel_configs:
190
+ # 仅抓取有prometheus.io/scrape注解的Pod
191
+ - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
192
+ action: keep
193
+ regex: true
194
+ # 使用注解覆盖端口
195
+ - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
196
+ action: replace
197
+ target_label: __address__
198
+ regex: (.+)
199
+ replacement: $1
200
+ # 保留Pod标签
201
+ - source_labels: [__meta_kubernetes_namespace]
202
+ target_label: namespace
203
+ - source_labels: [__meta_kubernetes_pod_name]
204
+ target_label: pod
205
+
206
+ # Consul服务发现
207
+ - job_name: "consul-services"
208
+ consul_sd_configs:
209
+ - server: "consul.service.consul:8500"
210
+ services: ["web", "api", "worker"]
211
+ relabel_configs:
212
+ - source_labels: [__meta_consul_tags]
213
+ regex: ".*,monitor,.*"
214
+ action: keep
215
+
216
+ # DNS服务发现
217
+ - job_name: "dns-services"
218
+ dns_sd_configs:
219
+ - names: ["_prometheus._tcp.example.com"]
220
+ type: SRV
221
+ refresh_interval: 30s
222
+
223
+ # 文件服务发现(适合动态环境)
224
+ - job_name: "file-sd"
225
+ file_sd_configs:
226
+ - files:
227
+ - "/etc/prometheus/targets/*.json"
228
+ refresh_interval: 5m
229
+ ```
230
+
231
+ ---
232
+
233
+ ## PromQL查询语言
234
+
235
+ ### 1. 数据类型
236
+
237
+ | 类型 | 说明 | 示例 |
238
+ |------|------|------|
239
+ | 即时向量(Instant Vector) | 每条序列的单个最新样本 | `http_requests_total` |
240
+ | 范围向量(Range Vector) | 每条序列的一段时间范围样本 | `http_requests_total[5m]` |
241
+ | 标量(Scalar) | 单个浮点数值 | `42`, `3.14` |
242
+ | 字符串(String) | 字符串值(极少用) | `"hello"` |
243
+
244
+ ### 2. 选择器与匹配器
245
+
246
+ ```promql
247
+ # 精确匹配
248
+ http_requests_total{method="GET"}
249
+
250
+ # 不等于
251
+ http_requests_total{status!="200"}
252
+
253
+ # 正则匹配
254
+ http_requests_total{handler=~"/api/.*"}
255
+
256
+ # 正则不匹配
257
+ http_requests_total{handler!~"/health|/ready"}
258
+
259
+ # 组合匹配
260
+ http_requests_total{method="GET", status=~"5..", service="user-api"}
261
+ ```
262
+
263
+ ### 3. 范围向量与偏移
264
+
265
+ ```promql
266
+ # 最近5分钟的样本
267
+ http_requests_total[5m]
268
+
269
+ # 1小时前的即时值
270
+ http_requests_total offset 1h
271
+
272
+ # 1小时前的5分钟范围
273
+ http_requests_total[5m] offset 1h
274
+
275
+ # 支持的时间单位: ms s m h d w y
276
+ ```
277
+
278
+ ### 4. 聚合运算符
279
+
280
+ ```promql
281
+ # 求和(按维度聚合)
282
+ sum(rate(http_requests_total[5m])) by (service)
283
+
284
+ # 平均值
285
+ avg(node_cpu_seconds_total{mode="idle"}) by (instance)
286
+
287
+ # 最大/最小值
288
+ max(node_memory_AvailableBytes) by (instance)
289
+ min(node_filesystem_avail_bytes) by (mountpoint)
290
+
291
+ # 计数
292
+ count(up == 1) by (job)
293
+
294
+ # 标准差
295
+ stddev(rate(http_request_duration_seconds_sum[5m]))
296
+
297
+ # 分位数
298
+ quantile(0.95, rate(http_request_duration_seconds_sum[5m]))
299
+
300
+ # Top-K / Bottom-K
301
+ topk(5, rate(http_requests_total[5m]))
302
+ bottomk(3, node_filesystem_avail_bytes)
303
+
304
+ # without — 排除指定维度后聚合
305
+ sum without (instance) (rate(http_requests_total[5m]))
306
+
307
+ # count_values — 统计各值出现的次数
308
+ count_values("version", build_info)
309
+ ```
310
+
311
+ ### 5. 二元运算符
312
+
313
+ ```promql
314
+ # 算术运算
315
+ node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
316
+ (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
317
+
318
+ # 比较运算(过滤)
319
+ http_requests_total > 1000
320
+ node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
321
+
322
+ # 比较运算(布尔模式,返回0或1)
323
+ http_requests_total > bool 1000
324
+
325
+ # 向量匹配
326
+ # one-to-one
327
+ method:http_requests:rate5m{method="GET"} / ignoring(method) group_left sum(method:http_requests:rate5m)
328
+
329
+ # many-to-one
330
+ node_cpu_seconds_total * on(instance) group_left(nodename) node_uname_info
331
+ ```
332
+
333
+ ### 6. 常用函数
334
+
335
+ ```promql
336
+ # 速率计算(Counter必用)
337
+ rate(http_requests_total[5m]) # 每秒平均速率,自动处理重置
338
+ irate(http_requests_total[5m]) # 瞬时速率,取最后两个样本
339
+
340
+ # 增量
341
+ increase(http_requests_total[1h]) # 1小时内增加量
342
+ delta(temperature_celsius[1h]) # Gauge的变化量(可负)
343
+
344
+ # 直方图分位数
345
+ histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
346
+
347
+ # 预测
348
+ predict_linear(node_filesystem_avail_bytes[6h], 24*3600) # 预测24小时后的磁盘可用空间
349
+
350
+ # 缺失检测
351
+ absent(up{job="api"}) # 目标不存在时返回1
352
+ absent_over_time(up{job="api"}[5m]) # 5分钟内无数据时返回1
353
+
354
+ # 时间函数
355
+ time() # 当前Unix时间戳
356
+ timestamp(up) # 样本的时间戳
357
+ day_of_week() # 星期几(0=Sunday)
358
+ hour() # 当前小时
359
+
360
+ # 标签操作
361
+ label_replace(up, "host", "$1", "instance", "(.+):.+")
362
+ label_join(up, "full_name", "-", "job", "instance")
363
+
364
+ # 排序
365
+ sort(node_memory_AvailableBytes)
366
+ sort_desc(rate(http_requests_total[5m]))
367
+
368
+ # 截断
369
+ clamp(cpu_usage, 0, 100)
370
+ clamp_min(value, 0)
371
+ clamp_max(value, 100)
372
+
373
+ # 聚合窗口
374
+ avg_over_time(node_cpu_seconds_total[5m])
375
+ max_over_time(node_memory_AvailableBytes[1h])
376
+ min_over_time(node_filesystem_avail_bytes[1h])
377
+ count_over_time(http_requests_total[5m])
378
+ ```
379
+
380
+ ### 7. 常用查询模板
381
+
382
+ ```promql
383
+ # === 错误率 ===
384
+ # HTTP 5xx错误率
385
+ sum(rate(http_requests_total{status=~"5.."}[5m]))
386
+ / sum(rate(http_requests_total[5m])) * 100
387
+
388
+ # 按服务的错误率
389
+ sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
390
+ / sum by (service) (rate(http_requests_total[5m])) * 100
391
+
392
+ # === 延迟分位数 ===
393
+ # P50 / P90 / P99延迟
394
+ histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
395
+ histogram_quantile(0.90, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
396
+ histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
397
+
398
+ # 按服务的P99延迟
399
+ histogram_quantile(0.99, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))
400
+
401
+ # === 资源使用率 ===
402
+ # CPU使用率(%)
403
+ 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
404
+
405
+ # 内存使用率(%)
406
+ (1 - node_memory_AvailableBytes / node_memory_MemTotal_bytes) * 100
407
+
408
+ # 磁盘使用率(%)
409
+ (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
410
+ / node_filesystem_size_bytes) * 100
411
+
412
+ # 磁盘IO利用率
413
+ rate(node_disk_io_time_seconds_total[5m]) * 100
414
+
415
+ # === 吞吐量 ===
416
+ # 每秒请求数(QPS)
417
+ sum(rate(http_requests_total[5m]))
418
+
419
+ # 按接口的QPS
420
+ sum by (handler) (rate(http_requests_total[5m]))
421
+
422
+ # 网络吞吐(MB/s)
423
+ rate(node_network_receive_bytes_total{device!="lo"}[5m]) / 1024 / 1024
424
+ rate(node_network_transmit_bytes_total{device!="lo"}[5m]) / 1024 / 1024
425
+
426
+ # === 饱和度 ===
427
+ # Go协程数量
428
+ go_goroutines{job="api"}
429
+
430
+ # 文件描述符使用率
431
+ process_open_fds / process_max_fds * 100
432
+
433
+ # 连接池使用率
434
+ db_pool_active_connections / db_pool_max_connections * 100
435
+ ```
436
+
437
+ ---
438
+
439
+ ## 告警规则
440
+
441
+ ### 1. 告警规则定义
442
+
443
+ ```yaml
444
+ # rules/application-alerts.yml
445
+ groups:
446
+ - name: application.rules
447
+ interval: 30s # 评估间隔(可选,默认使用全局值)
448
+ rules:
449
+ # 高错误率
450
+ - alert: HighErrorRate
451
+ expr: |
452
+ sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
453
+ / sum by (service) (rate(http_requests_total[5m])) > 0.05
454
+ for: 5m # 持续5分钟才触发
455
+ labels:
456
+ severity: critical
457
+ team: backend
458
+ annotations:
459
+ summary: "服务 {{ $labels.service }} 错误率过高"
460
+ description: "错误率已达 {{ $value | humanizePercentage }},超过5%阈值,持续5分钟"
461
+ runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
462
+ dashboard: "https://grafana.example.com/d/svc-overview?var-service={{ $labels.service }}"
463
+
464
+ # 高延迟
465
+ - alert: HighLatencyP99
466
+ expr: |
467
+ histogram_quantile(0.99, sum by (le, service)
468
+ (rate(http_request_duration_seconds_bucket[5m]))) > 1.0
469
+ for: 10m
470
+ labels:
471
+ severity: warning
472
+ team: backend
473
+ annotations:
474
+ summary: "服务 {{ $labels.service }} P99延迟过高"
475
+ description: "P99延迟为 {{ $value | humanizeDuration }},超过1秒阈值"
476
+
477
+ # 实例宕机
478
+ - alert: InstanceDown
479
+ expr: up == 0
480
+ for: 3m
481
+ labels:
482
+ severity: critical
483
+ annotations:
484
+ summary: "实例 {{ $labels.instance }} 不可达"
485
+ description: "任务 {{ $labels.job }} 的实例 {{ $labels.instance }} 已宕机超过3分钟"
486
+
487
+ - name: infrastructure.rules
488
+ rules:
489
+ # 磁盘空间预警
490
+ - alert: DiskSpaceRunningOut
491
+ expr: |
492
+ predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24*3600) < 0
493
+ for: 30m
494
+ labels:
495
+ severity: warning
496
+ annotations:
497
+ summary: "主机 {{ $labels.instance }} 磁盘空间即将耗尽"
498
+ description: "挂载点 {{ $labels.mountpoint }} 预计24小时内磁盘空间耗尽,当前可用 {{ $value | humanize1024 }}B"
499
+
500
+ # 内存使用率过高
501
+ - alert: HighMemoryUsage
502
+ expr: (1 - node_memory_AvailableBytes / node_memory_MemTotal_bytes) > 0.9
503
+ for: 10m
504
+ labels:
505
+ severity: warning
506
+ annotations:
507
+ summary: "主机 {{ $labels.instance }} 内存使用率超过90%"
508
+
509
+ # CPU使用率持续过高
510
+ - alert: HighCPUUsage
511
+ expr: |
512
+ 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) > 85
513
+ for: 15m
514
+ labels:
515
+ severity: warning
516
+ annotations:
517
+ summary: "主机 {{ $labels.instance }} CPU使用率持续超过85%"
518
+
519
+ # 目标抓取失败
520
+ - alert: PrometheusTargetMissing
521
+ expr: up == 0
522
+ for: 5m
523
+ labels:
524
+ severity: critical
525
+ annotations:
526
+ summary: "Prometheus抓取目标丢失: {{ $labels.job }}/{{ $labels.instance }}"
527
+ ```
528
+
529
+ ### 2. Alertmanager配置
530
+
531
+ ```yaml
532
+ # alertmanager.yml
533
+ global:
534
+ resolve_timeout: 5m
535
+ smtp_smarthost: "smtp.example.com:587"
536
+ smtp_from: "alertmanager@example.com"
537
+ smtp_auth_username: "alertmanager"
538
+ smtp_auth_password_file: /etc/alertmanager/smtp_password
539
+ slack_api_url_file: /etc/alertmanager/slack_webhook
540
+
541
+ # 路由树
542
+ route:
543
+ receiver: "default-slack"
544
+ group_by: ["alertname", "service", "namespace"]
545
+ group_wait: 30s # 同组告警等待聚合的时间
546
+ group_interval: 5m # 同组已发送后再次发送的间隔
547
+ repeat_interval: 4h # 未恢复告警重复通知的间隔
548
+
549
+ routes:
550
+ # 关键告警 → PagerDuty
551
+ - match:
552
+ severity: critical
553
+ receiver: "pagerduty-critical"
554
+ group_wait: 10s
555
+ repeat_interval: 1h
556
+ continue: false
557
+
558
+ # 警告级别 → Slack
559
+ - match:
560
+ severity: warning
561
+ receiver: "team-slack"
562
+ group_wait: 1m
563
+ repeat_interval: 8h
564
+
565
+ # 按团队路由
566
+ - match_re:
567
+ team: "frontend|mobile"
568
+ receiver: "frontend-slack"
569
+ - match:
570
+ team: backend
571
+ receiver: "backend-slack"
572
+
573
+ # 抑制规则
574
+ inhibit_rules:
575
+ # critical触发时抑制同服务的warning
576
+ - source_match:
577
+ severity: critical
578
+ target_match:
579
+ severity: warning
580
+ equal: ["alertname", "service"]
581
+
582
+ # 集群级告警抑制节点级告警
583
+ - source_match:
584
+ scope: cluster
585
+ target_match:
586
+ scope: node
587
+ equal: ["cluster"]
588
+
589
+ # 接收器
590
+ receivers:
591
+ - name: "default-slack"
592
+ slack_configs:
593
+ - channel: "#alerts-default"
594
+ title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
595
+ text: >-
596
+ *摘要:* {{ .CommonAnnotations.summary }}
597
+ *描述:* {{ .CommonAnnotations.description }}
598
+ *详情:*
599
+ {{ range .Alerts }}
600
+ - *{{ .Labels.instance }}*: {{ .Annotations.description }}
601
+ {{ end }}
602
+ send_resolved: true
603
+
604
+ - name: "team-slack"
605
+ slack_configs:
606
+ - channel: "#alerts-team"
607
+ send_resolved: true
608
+
609
+ - name: "frontend-slack"
610
+ slack_configs:
611
+ - channel: "#alerts-frontend"
612
+ send_resolved: true
613
+
614
+ - name: "backend-slack"
615
+ slack_configs:
616
+ - channel: "#alerts-backend"
617
+ send_resolved: true
618
+
619
+ - name: "pagerduty-critical"
620
+ pagerduty_configs:
621
+ - routing_key_file: /etc/alertmanager/pagerduty_key
622
+ severity: critical
623
+ description: '{{ .CommonAnnotations.summary }}'
624
+ details:
625
+ firing: '{{ .Alerts.Firing | len }}'
626
+ resolved: '{{ .Alerts.Resolved | len }}'
627
+ dashboard: '{{ (index .Alerts 0).Annotations.dashboard }}'
628
+
629
+ - name: "webhook-custom"
630
+ webhook_configs:
631
+ - url: "https://hooks.example.com/alertmanager"
632
+ send_resolved: true
633
+ max_alerts: 10
634
+ http_config:
635
+ bearer_token_file: /etc/alertmanager/webhook_token
636
+ ```
637
+
638
+ ### 3. 静默与维护窗口
639
+
640
+ ```bash
641
+ # 创建静默(通过amtool)
642
+ amtool silence add \
643
+ --alertmanager.url=http://localhost:9093 \
644
+ --author="ops-team" \
645
+ --comment="计划维护窗口 2026-03-28 22:00-02:00" \
646
+ --duration=4h \
647
+ alertname="InstanceDown" instance=~"node-[12].*"
648
+
649
+ # 查看当前静默
650
+ amtool silence query --alertmanager.url=http://localhost:9093
651
+
652
+ # 取消静默
653
+ amtool silence expire <silence-id> --alertmanager.url=http://localhost:9093
654
+ ```
655
+
656
+ ---
657
+
658
+ ## Grafana仪表盘
659
+
660
+ ### 1. 数据源配置
661
+
662
+ ```yaml
663
+ # grafana/provisioning/datasources/prometheus.yml
664
+ apiVersion: 1
665
+ datasources:
666
+ - name: Prometheus
667
+ type: prometheus
668
+ access: proxy
669
+ url: http://prometheus:9090
670
+ isDefault: true
671
+ editable: false
672
+ jsonData:
673
+ timeInterval: "15s" # 与scrape_interval对齐
674
+ httpMethod: POST # 大查询用POST避免URL长度限制
675
+ exemplarTraceIdDestinations:
676
+ - name: traceID
677
+ datasourceUid: tempo
678
+ urlDisplayLabel: "View in Tempo"
679
+ ```
680
+
681
+ ### 2. 面板类型与适用场景
682
+
683
+ | 面板类型 | 适用场景 | 典型查询 |
684
+ |---------|---------|---------|
685
+ | Time Series | 趋势变化 | `rate(http_requests_total[5m])` |
686
+ | Stat | 单值显示 | `sum(up{job="api"})` |
687
+ | Gauge | 百分比/阈值 | `(1 - node_memory_AvailableBytes/node_memory_MemTotal_bytes)*100` |
688
+ | Bar Chart | 维度比较 | `topk(10, sum by (handler)(rate(http_requests_total[5m])))` |
689
+ | Table | 多维明细 | 多指标联合查询 |
690
+ | Heatmap | 延迟分布 | `rate(http_request_duration_seconds_bucket[5m])` |
691
+ | Logs | 日志面板 | 配合Loki数据源 |
692
+ | Node Graph | 拓扑关系 | 配合Tempo数据源 |
693
+
694
+ ### 3. 模板变量
695
+
696
+ ```
697
+ # 变量定义(在仪表盘Settings > Variables中配置)
698
+
699
+ # 数据源变量
700
+ Name: datasource
701
+ Type: Datasource
702
+ Query: prometheus
703
+
704
+ # 标签值变量
705
+ Name: namespace
706
+ Type: Query
707
+ Query: label_values(kube_pod_info, namespace)
708
+ Multi-value: true
709
+ Include All: true
710
+
711
+ # 依赖变量(级联)
712
+ Name: service
713
+ Type: Query
714
+ Query: label_values(kube_pod_info{namespace="$namespace"}, pod)
715
+
716
+ # 间隔变量
717
+ Name: interval
718
+ Type: Interval
719
+ Values: 1m,5m,10m,30m,1h
720
+ Auto: true
721
+ Min interval: $__rate_interval
722
+
723
+ # 在面板查询中使用
724
+ sum by (pod) (rate(http_requests_total{namespace="$namespace"}[$interval]))
725
+ ```
726
+
727
+ ### 4. RED方法仪表盘(面向服务)
728
+
729
+ RED方法关注三个维度:Rate(速率)、Errors(错误)、Duration(延迟)。
730
+
731
+ ```promql
732
+ # --- Rate(请求速率) ---
733
+ # 总QPS
734
+ sum(rate(http_requests_total{service="$service"}[5m]))
735
+ # 按状态码的QPS
736
+ sum by (status) (rate(http_requests_total{service="$service"}[5m]))
737
+
738
+ # --- Errors(错误率) ---
739
+ # 错误率百分比
740
+ sum(rate(http_requests_total{service="$service", status=~"5.."}[5m]))
741
+ / sum(rate(http_requests_total{service="$service"}[5m])) * 100
742
+
743
+ # --- Duration(延迟分布) ---
744
+ # P50 / P90 / P99
745
+ histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket{service="$service"}[5m])))
746
+ histogram_quantile(0.90, sum by (le) (rate(http_request_duration_seconds_bucket{service="$service"}[5m])))
747
+ histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket{service="$service"}[5m])))
748
+ ```
749
+
750
+ ### 5. USE方法仪表盘(面向资源)
751
+
752
+ USE方法关注:Utilization(利用率)、Saturation(饱和度)、Errors(错误数)。
753
+
754
+ ```promql
755
+ # --- CPU ---
756
+ # Utilization: CPU使用率
757
+ 100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle", instance="$instance"}[5m])) * 100
758
+ # Saturation: CPU运行队列长度
759
+ node_load1{instance="$instance"} / count without(cpu) (node_cpu_seconds_total{mode="idle", instance="$instance"})
760
+
761
+ # --- Memory ---
762
+ # Utilization: 内存使用率
763
+ (1 - node_memory_AvailableBytes{instance="$instance"} / node_memory_MemTotal_bytes{instance="$instance"}) * 100
764
+ # Saturation: Swap使用
765
+ node_memory_SwapTotal_bytes{instance="$instance"} - node_memory_SwapFree_bytes{instance="$instance"}
766
+
767
+ # --- Disk ---
768
+ # Utilization: 磁盘空间使用率
769
+ (1 - node_filesystem_avail_bytes{instance="$instance", fstype!~"tmpfs|overlay"}
770
+ / node_filesystem_size_bytes) * 100
771
+ # Saturation: 磁盘IO利用率
772
+ rate(node_disk_io_time_seconds_total{instance="$instance"}[5m]) * 100
773
+
774
+ # --- Network ---
775
+ # Utilization: 网络带宽使用
776
+ rate(node_network_receive_bytes_total{instance="$instance", device!="lo"}[5m]) * 8
777
+ rate(node_network_transmit_bytes_total{instance="$instance", device!="lo"}[5m]) * 8
778
+ # Errors: 网络错误
779
+ rate(node_network_receive_errs_total{instance="$instance"}[5m])
780
+ rate(node_network_transmit_errs_total{instance="$instance"}[5m])
781
+ ```
782
+
783
+ ### 6. Four Golden Signals仪表盘(Google SRE)
784
+
785
+ ```promql
786
+ # 1. Latency(延迟) — 成功请求的延迟 vs 失败请求的延迟
787
+ # 成功请求P99
788
+ histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])))
789
+ # 失败请求P99
790
+ histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket{status=~"5.."}[5m])))
791
+
792
+ # 2. Traffic(流量) — 系统负载
793
+ sum(rate(http_requests_total[5m]))
794
+
795
+ # 3. Errors(错误) — 失败请求比例
796
+ sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
797
+
798
+ # 4. Saturation(饱和度) — 最受限资源的利用率
799
+ # CPU饱和度
800
+ avg(1 - rate(node_cpu_seconds_total{mode="idle"}[5m]))
801
+ # 内存饱和度
802
+ 1 - avg(node_memory_AvailableBytes / node_memory_MemTotal_bytes)
803
+ ```
804
+
805
+ ---
806
+
807
+ ## 应用埋点
808
+
809
+ ### 1. Python客户端(prometheus_client)
810
+
811
+ ```python
812
+ # pip install prometheus-client
813
+ from prometheus_client import (
814
+ Counter, Gauge, Histogram, Summary,
815
+ start_http_server, generate_latest, CONTENT_TYPE_LATEST
816
+ )
817
+ import time
818
+
819
+ # 定义指标
820
+ REQUEST_COUNT = Counter(
821
+ "http_requests_total",
822
+ "Total HTTP requests",
823
+ ["method", "handler", "status"]
824
+ )
825
+ REQUEST_LATENCY = Histogram(
826
+ "http_request_duration_seconds",
827
+ "HTTP request latency",
828
+ ["method", "handler"],
829
+ buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
830
+ )
831
+ IN_PROGRESS = Gauge(
832
+ "http_requests_in_progress",
833
+ "Number of in-progress HTTP requests",
834
+ ["handler"]
835
+ )
836
+ DB_POOL_SIZE = Gauge(
837
+ "db_connection_pool_size",
838
+ "Database connection pool size",
839
+ ["pool"]
840
+ )
841
+
842
+ # 使用装饰器简化
843
+ @REQUEST_LATENCY.labels(method="GET", handler="/api/users").time()
844
+ def get_users():
845
+ pass
846
+
847
+ # 手动埋点
848
+ def handle_request(method, handler):
849
+ IN_PROGRESS.labels(handler=handler).inc()
850
+ start = time.time()
851
+ try:
852
+ # ... 业务逻辑
853
+ status = "200"
854
+ except Exception:
855
+ status = "500"
856
+ raise
857
+ finally:
858
+ duration = time.time() - start
859
+ REQUEST_COUNT.labels(method=method, handler=handler, status=status).inc()
860
+ REQUEST_LATENCY.labels(method=method, handler=handler).observe(duration)
861
+ IN_PROGRESS.labels(handler=handler).dec()
862
+
863
+ # Flask中间件集成
864
+ from flask import Flask, request, g
865
+ app = Flask(__name__)
866
+
867
+ @app.before_request
868
+ def before_request():
869
+ g.start_time = time.time()
870
+
871
+ @app.after_request
872
+ def after_request(response):
873
+ latency = time.time() - g.start_time
874
+ REQUEST_COUNT.labels(
875
+ method=request.method,
876
+ handler=request.endpoint or "unknown",
877
+ status=response.status_code
878
+ ).inc()
879
+ REQUEST_LATENCY.labels(
880
+ method=request.method,
881
+ handler=request.endpoint or "unknown"
882
+ ).observe(latency)
883
+ return response
884
+
885
+ @app.route("/metrics")
886
+ def metrics():
887
+ return generate_latest(), 200, {"Content-Type": CONTENT_TYPE_LATEST}
888
+
889
+ # FastAPI集成
890
+ from fastapi import FastAPI, Request
891
+ from starlette.middleware.base import BaseHTTPMiddleware
892
+ from prometheus_client import make_asgi_app
893
+
894
+ app = FastAPI()
895
+
896
+ class MetricsMiddleware(BaseHTTPMiddleware):
897
+ async def dispatch(self, request: Request, call_next):
898
+ start = time.time()
899
+ response = await call_next(request)
900
+ duration = time.time() - start
901
+ REQUEST_COUNT.labels(
902
+ method=request.method,
903
+ handler=request.url.path,
904
+ status=response.status_code
905
+ ).inc()
906
+ REQUEST_LATENCY.labels(
907
+ method=request.method,
908
+ handler=request.url.path
909
+ ).observe(duration)
910
+ return response
911
+
912
+ app.add_middleware(MetricsMiddleware)
913
+ metrics_app = make_asgi_app()
914
+ app.mount("/metrics", metrics_app)
915
+
916
+ # 独立指标服务器
917
+ if __name__ == "__main__":
918
+ start_http_server(8000) # 在8000端口暴露/metrics
919
+ ```
920
+
921
+ ### 2. Node.js客户端(prom-client)
922
+
923
+ ```javascript
924
+ // npm install prom-client
925
+ const client = require("prom-client");
926
+
927
+ // 启用默认指标(进程级: CPU/内存/GC/事件循环等)
928
+ const collectDefaultMetrics = client.collectDefaultMetrics;
929
+ collectDefaultMetrics({ prefix: "app_" });
930
+
931
+ // 自定义指标
932
+ const httpRequestsTotal = new client.Counter({
933
+ name: "http_requests_total",
934
+ help: "Total HTTP requests",
935
+ labelNames: ["method", "handler", "status"],
936
+ });
937
+
938
+ const httpRequestDuration = new client.Histogram({
939
+ name: "http_request_duration_seconds",
940
+ help: "HTTP request latency in seconds",
941
+ labelNames: ["method", "handler"],
942
+ buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
943
+ });
944
+
945
+ const activeConnections = new client.Gauge({
946
+ name: "http_active_connections",
947
+ help: "Number of active connections",
948
+ });
949
+
950
+ // Express中间件
951
+ const express = require("express");
952
+ const app = express();
953
+
954
+ app.use((req, res, next) => {
955
+ const end = httpRequestDuration.startTimer({
956
+ method: req.method,
957
+ handler: req.route?.path || req.path,
958
+ });
959
+ activeConnections.inc();
960
+
961
+ res.on("finish", () => {
962
+ end();
963
+ httpRequestsTotal.inc({
964
+ method: req.method,
965
+ handler: req.route?.path || req.path,
966
+ status: res.statusCode,
967
+ });
968
+ activeConnections.dec();
969
+ });
970
+
971
+ next();
972
+ });
973
+
974
+ // 指标端点
975
+ app.get("/metrics", async (req, res) => {
976
+ res.set("Content-Type", client.register.contentType);
977
+ res.end(await client.register.metrics());
978
+ });
979
+ ```
980
+
981
+ ### 3. Go客户端(prometheus/client_golang)
982
+
983
+ ```go
984
+ package main
985
+
986
+ import (
987
+ "net/http"
988
+ "time"
989
+
990
+ "github.com/prometheus/client_golang/prometheus"
991
+ "github.com/prometheus/client_golang/prometheus/promauto"
992
+ "github.com/prometheus/client_golang/prometheus/promhttp"
993
+ )
994
+
995
+ var (
996
+ httpRequestsTotal = promauto.NewCounterVec(
997
+ prometheus.CounterOpts{
998
+ Name: "http_requests_total",
999
+ Help: "Total HTTP requests",
1000
+ },
1001
+ []string{"method", "handler", "status"},
1002
+ )
1003
+
1004
+ httpRequestDuration = promauto.NewHistogramVec(
1005
+ prometheus.HistogramOpts{
1006
+ Name: "http_request_duration_seconds",
1007
+ Help: "HTTP request duration in seconds",
1008
+ Buckets: prometheus.DefBuckets,
1009
+ },
1010
+ []string{"method", "handler"},
1011
+ )
1012
+
1013
+ activeRequests = promauto.NewGauge(
1014
+ prometheus.GaugeOpts{
1015
+ Name: "http_active_requests",
1016
+ Help: "Number of active requests",
1017
+ },
1018
+ )
1019
+ )
1020
+
1021
+ // 中间件
1022
+ func metricsMiddleware(next http.Handler) http.Handler {
1023
+ return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
1024
+ start := time.Now()
1025
+ activeRequests.Inc()
1026
+ defer activeRequests.Dec()
1027
+
1028
+ rw := &responseWriter{ResponseWriter: w, statusCode: 200}
1029
+ next.ServeHTTP(rw, r)
1030
+
1031
+ duration := time.Since(start).Seconds()
1032
+ httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, http.StatusText(rw.statusCode)).Inc()
1033
+ httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
1034
+ })
1035
+ }
1036
+
1037
+ type responseWriter struct {
1038
+ http.ResponseWriter
1039
+ statusCode int
1040
+ }
1041
+
1042
+ func (rw *responseWriter) WriteHeader(code int) {
1043
+ rw.statusCode = code
1044
+ rw.ResponseWriter.WriteHeader(code)
1045
+ }
1046
+
1047
+ func main() {
1048
+ mux := http.NewServeMux()
1049
+ mux.Handle("/metrics", promhttp.Handler())
1050
+ mux.HandleFunc("/api/users", handleUsers)
1051
+
1052
+ server := &http.Server{
1053
+ Addr: ":8080",
1054
+ Handler: metricsMiddleware(mux),
1055
+ }
1056
+ server.ListenAndServe()
1057
+ }
1058
+
1059
+ func handleUsers(w http.ResponseWriter, r *http.Request) {
1060
+ w.Write([]byte(`{"users": []}`))
1061
+ }
1062
+ ```
1063
+
1064
+ ---
1065
+
1066
+ ## Kubernetes监控
1067
+
1068
+ ### 1. 核心组件
1069
+
1070
+ ```yaml
1071
+ # kube-prometheus-stack 一键部署(Helm)
1072
+ # 包含: Prometheus Operator + Prometheus + Alertmanager + Grafana + 预置规则/仪表盘
1073
+ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
1074
+ helm install monitoring prometheus-community/kube-prometheus-stack \
1075
+ --namespace monitoring --create-namespace \
1076
+ --set prometheus.prometheusSpec.retention=15d \
1077
+ --set prometheus.prometheusSpec.resources.requests.memory=2Gi \
1078
+ --set prometheus.prometheusSpec.resources.requests.cpu=500m \
1079
+ --set alertmanager.alertmanagerSpec.replicas=3
1080
+ ```
1081
+
1082
+ ### 2. ServiceMonitor与PodMonitor
1083
+
1084
+ ```yaml
1085
+ # ServiceMonitor — 通过Service发现并抓取Pod指标
1086
+ apiVersion: monitoring.coreos.com/v1
1087
+ kind: ServiceMonitor
1088
+ metadata:
1089
+ name: user-api-monitor
1090
+ namespace: monitoring
1091
+ labels:
1092
+ release: monitoring # 必须匹配Prometheus Operator的selector
1093
+ spec:
1094
+ namespaceSelector:
1095
+ matchNames: ["production"]
1096
+ selector:
1097
+ matchLabels:
1098
+ app: user-api
1099
+ endpoints:
1100
+ - port: metrics # Service中定义的端口名
1101
+ interval: 15s
1102
+ path: /metrics
1103
+ scrapeTimeout: 10s
1104
+ metricRelabelings:
1105
+ - sourceLabels: [__name__]
1106
+ regex: "go_.*"
1107
+ action: drop # 丢弃Go运行时指标以减少存储
1108
+
1109
+ ---
1110
+ # PodMonitor — 直接抓取Pod,不需要Service
1111
+ apiVersion: monitoring.coreos.com/v1
1112
+ kind: PodMonitor
1113
+ metadata:
1114
+ name: batch-job-monitor
1115
+ namespace: monitoring
1116
+ spec:
1117
+ namespaceSelector:
1118
+ matchNames: ["batch"]
1119
+ selector:
1120
+ matchLabels:
1121
+ app: batch-worker
1122
+ podMetricsEndpoints:
1123
+ - port: metrics
1124
+ interval: 30s
1125
+ ```
1126
+
1127
+ ### 3. 关键Kubernetes指标
1128
+
1129
+ ```promql
1130
+ # --- kube-state-metrics ---
1131
+ # Pod状态
1132
+ kube_pod_status_phase{phase!="Running", phase!="Succeeded"} == 1
1133
+ # Pod重启次数
1134
+ rate(kube_pod_container_status_restarts_total[15m]) > 0
1135
+ # Deployment副本不匹配
1136
+ kube_deployment_status_replicas_available != kube_deployment_spec_replicas
1137
+ # HPA当前副本 vs 期望副本
1138
+ kube_horizontalpodautoscaler_status_current_replicas / kube_horizontalpodautoscaler_spec_max_replicas
1139
+
1140
+ # --- node-exporter ---
1141
+ # 节点CPU/内存/磁盘(前面已列出)
1142
+
1143
+ # --- cAdvisor(kubelet内置) ---
1144
+ # 容器CPU使用率
1145
+ sum by (pod, container) (rate(container_cpu_usage_seconds_total{container!="POD", container!=""}[5m]))
1146
+ # 容器内存使用
1147
+ sum by (pod, container) (container_memory_working_set_bytes{container!="POD", container!=""})
1148
+ # 容器OOMKill
1149
+ increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[1h])
1150
+ # 容器CPU限流
1151
+ sum by (pod) (rate(container_cpu_cfs_throttled_periods_total[5m]))
1152
+ / sum by (pod) (rate(container_cpu_cfs_periods_total[5m])) * 100
1153
+
1154
+ # --- 常用告警规则 ---
1155
+ # Pod CrashLoopBackOff
1156
+ kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1
1157
+ # Job失败
1158
+ kube_job_status_failed > 0
1159
+ # PVC空间不足
1160
+ kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes < 0.1
1161
+ # 节点NotReady
1162
+ kube_node_status_condition{condition="Ready", status="true"} == 0
1163
+ ```
1164
+
1165
+ ---
1166
+
1167
+ ## 高可用与长期存储
1168
+
1169
+ ### 1. Prometheus联邦(Federation)
1170
+
1171
+ ```yaml
1172
+ # 全局Prometheus从区域Prometheus采集聚合指标
1173
+ scrape_configs:
1174
+ - job_name: "federate-region-cn"
1175
+ honor_labels: true
1176
+ metrics_path: /federate
1177
+ params:
1178
+ "match[]":
1179
+ - '{__name__=~"job:.*"}' # 只采集recording rules产出
1180
+ - '{__name__=~"instance:.*"}'
1181
+ static_configs:
1182
+ - targets: ["prometheus-cn.internal:9090"]
1183
+ labels:
1184
+ region: cn
1185
+
1186
+ - job_name: "federate-region-us"
1187
+ honor_labels: true
1188
+ metrics_path: /federate
1189
+ params:
1190
+ "match[]":
1191
+ - '{__name__=~"job:.*"}'
1192
+ - '{__name__=~"instance:.*"}'
1193
+ static_configs:
1194
+ - targets: ["prometheus-us.internal:9090"]
1195
+ labels:
1196
+ region: us
1197
+ ```
1198
+
1199
+ ### 2. Thanos架构
1200
+
1201
+ ```
1202
+ ┌─────────────┐ ┌─────────────┐
1203
+ │ Prometheus A │ │ Prometheus B │ ← 各集群独立运行
1204
+ │ + Sidecar │ │ + Sidecar │
1205
+ └──────┬──────┘ └──────┬──────┘
1206
+ │ │
1207
+ ▼ ▼
1208
+ ┌────────────────────────┐
1209
+ │ 对象存储(S3/GCS) │ ← 长期存储
1210
+ └────────────────────────┘
1211
+ ▲ ▲
1212
+ │ │
1213
+ ┌──────┴──────┐ ┌──────┴──────┐
1214
+ │ Thanos Store│ │ Thanos │
1215
+ │ Gateway │ │ Compactor │ ← 降采样、压缩
1216
+ └──────┬──────┘ └─────────────┘
1217
+
1218
+ ┌──────┴──────┐
1219
+ │ Thanos Query│ ← 统一查询入口,去重
1220
+ └──────┬──────┘
1221
+
1222
+ ┌──────┴──────┐
1223
+ │ Grafana │
1224
+ └─────────────┘
1225
+ ```
1226
+
1227
+ ```yaml
1228
+ # Thanos Sidecar配置(与Prometheus同Pod)
1229
+ containers:
1230
+ - name: thanos-sidecar
1231
+ image: quay.io/thanos/thanos:v0.35.0
1232
+ args:
1233
+ - sidecar
1234
+ - --tsdb.path=/prometheus/data
1235
+ - --prometheus.url=http://localhost:9090
1236
+ - --objstore.config-file=/etc/thanos/bucket.yml
1237
+ volumeMounts:
1238
+ - name: prometheus-data
1239
+ mountPath: /prometheus/data
1240
+ - name: thanos-config
1241
+ mountPath: /etc/thanos
1242
+
1243
+ # bucket.yml
1244
+ type: S3
1245
+ config:
1246
+ bucket: thanos-metrics
1247
+ endpoint: s3.amazonaws.com
1248
+ region: us-east-1
1249
+ access_key: ${AWS_ACCESS_KEY_ID}
1250
+ secret_key: ${AWS_SECRET_ACCESS_KEY}
1251
+ ```
1252
+
1253
+ ### 3. VictoriaMetrics(高性能替代方案)
1254
+
1255
+ ```yaml
1256
+ # 作为Prometheus远程写入目标
1257
+ # prometheus.yml
1258
+ remote_write:
1259
+ - url: http://victoriametrics:8428/api/v1/write
1260
+ queue_config:
1261
+ max_samples_per_send: 10000
1262
+ capacity: 100000
1263
+ max_shards: 30
1264
+
1265
+ # VictoriaMetrics部署(单节点)
1266
+ docker run -d \
1267
+ --name victoriametrics \
1268
+ -v vmdata:/victoria-data \
1269
+ -p 8428:8428 \
1270
+ victoriametrics/victoria-metrics:v1.101.0 \
1271
+ -retentionPeriod=90d \
1272
+ -storageDataPath=/victoria-data \
1273
+ -memory.allowedPercent=60
1274
+ ```
1275
+
1276
+ ### 4. 远程存储配置
1277
+
1278
+ ```yaml
1279
+ # prometheus.yml — 远程读写
1280
+ remote_write:
1281
+ - url: http://remote-storage:9201/write
1282
+ remote_timeout: 30s
1283
+ queue_config:
1284
+ capacity: 50000
1285
+ max_shards: 20
1286
+ min_shards: 1
1287
+ max_samples_per_send: 5000
1288
+ batch_send_deadline: 5s
1289
+ min_backoff: 30ms
1290
+ max_backoff: 5s
1291
+ write_relabel_configs:
1292
+ - source_labels: [__name__]
1293
+ regex: "go_.*"
1294
+ action: drop # 不写入Go运行时指标
1295
+
1296
+ remote_read:
1297
+ - url: http://remote-storage:9201/read
1298
+ read_recent: false # 本地有的数据不走远程读
1299
+ required_matchers:
1300
+ job: "important-service"
1301
+ ```
1302
+
1303
+ ---
1304
+
1305
+ ## 最佳实践
1306
+
1307
+ ### 1. 命名规范
1308
+
1309
+ ```
1310
+ # 格式: <namespace>_<name>_<unit>
1311
+ # 使用下划线分隔,全小写
1312
+
1313
+ # 好的命名
1314
+ http_requests_total # Counter: _total后缀
1315
+ http_request_duration_seconds # Histogram: 使用基本单位(秒而非毫秒)
1316
+ node_memory_AvailableBytes # Gauge: 使用基本单位(字节而非MB)
1317
+ process_cpu_seconds_total # Counter: CPU秒数
1318
+ myapp_queue_depth # Gauge: 队列深度
1319
+
1320
+ # 坏的命名
1321
+ http_requests # 缺少_total后缀
1322
+ request_latency_ms # 不要使用毫秒,用秒
1323
+ memory_usage_mb # 不要使用MB,用字节
1324
+ HttpRequestCount # 不要用驼峰
1325
+ http.requests.total # 不要用点号分隔
1326
+ ```
1327
+
1328
+ ### 2. Recording Rules(预计算)
1329
+
1330
+ ```yaml
1331
+ # rules/recording-rules.yml
1332
+ groups:
1333
+ - name: http_recording_rules
1334
+ interval: 30s
1335
+ rules:
1336
+ # 预计算每秒请求速率
1337
+ - record: job:http_requests:rate5m
1338
+ expr: sum by (job) (rate(http_requests_total[5m]))
1339
+
1340
+ # 预计算错误率
1341
+ - record: job:http_errors:ratio5m
1342
+ expr: |
1343
+ sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
1344
+ / sum by (job) (rate(http_requests_total[5m]))
1345
+
1346
+ # 预计算延迟分位数
1347
+ - record: job:http_request_duration_seconds:p99_5m
1348
+ expr: |
1349
+ histogram_quantile(0.99,
1350
+ sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
1351
+
1352
+ # 预计算实例级CPU使用率
1353
+ - record: instance:node_cpu:ratio
1354
+ expr: |
1355
+ 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
1356
+
1357
+ - name: resource_recording_rules
1358
+ rules:
1359
+ # 节点内存使用率
1360
+ - record: instance:node_memory_utilization:ratio
1361
+ expr: |
1362
+ 1 - node_memory_AvailableBytes / node_memory_MemTotal_bytes
1363
+
1364
+ # 容器CPU请求使用率
1365
+ - record: namespace:container_cpu_usage:sum
1366
+ expr: |
1367
+ sum by (namespace) (
1368
+ rate(container_cpu_usage_seconds_total{container!="POD", container!=""}[5m])
1369
+ )
1370
+ ```
1371
+
1372
+ ### 3. 性能调优
1373
+
1374
+ ```yaml
1375
+ # prometheus.yml全局配置
1376
+ global:
1377
+ scrape_interval: 15s # 通常15s-60s;不要低于10s
1378
+ scrape_timeout: 10s # 必须小于scrape_interval
1379
+ evaluation_interval: 15s
1380
+
1381
+ # 存储配置(命令行参数)
1382
+ # --storage.tsdb.retention.time=15d # 本地保留15天
1383
+ # --storage.tsdb.retention.size=50GB # 或按大小保留
1384
+ # --storage.tsdb.wal-compression # 启用WAL压缩
1385
+ # --storage.tsdb.min-block-duration=2h # 最小block时长
1386
+ # --storage.tsdb.max-block-duration=36h # 最大block时长
1387
+ # --query.max-concurrency=20 # 最大并发查询数
1388
+ # --query.timeout=2m # 查询超时
1389
+ # --query.max-samples=50000000 # 单次查询最大样本数
1390
+ ```
1391
+
1392
+ **性能关键指标自监控:**
1393
+
1394
+ ```promql
1395
+ # Prometheus自身健康
1396
+ prometheus_tsdb_head_series # 活跃时间序列数(核心容量指标)
1397
+ rate(prometheus_tsdb_head_samples_appended_total[5m]) # 每秒写入样本数
1398
+ prometheus_tsdb_compactions_failed_total # 压缩失败次数
1399
+ prometheus_engine_query_duration_seconds # 查询耗时
1400
+ rate(prometheus_target_scrapes_exceeded_sample_limit_total[5m]) # 样本超限
1401
+ prometheus_tsdb_storage_blocks_bytes # 存储块大小
1402
+ ```
1403
+
1404
+ ### 4. 标签基数控制
1405
+
1406
+ ```yaml
1407
+ # 在scrape配置中限制每次抓取的样本数
1408
+ scrape_configs:
1409
+ - job_name: "risky-service"
1410
+ sample_limit: 5000 # 超过则整次抓取失败
1411
+ target_limit: 100 # 限制目标数
1412
+ label_limit: 30 # 标签数量上限
1413
+ label_name_length_limit: 200 # 标签名长度上限
1414
+ label_value_length_limit: 500 # 标签值长度上限
1415
+
1416
+ metric_relabel_configs:
1417
+ # 丢弃高基数指标
1418
+ - source_labels: [__name__]
1419
+ regex: "expensive_metric_.*"
1420
+ action: drop
1421
+ # 丢弃特定标签(降低基数)
1422
+ - regex: "trace_id|span_id"
1423
+ action: labeldrop
1424
+ ```
1425
+
1426
+ ---
1427
+
1428
+ ## 常见陷阱
1429
+
1430
+ ### 1. 高基数标签
1431
+
1432
+ **问题:** 将高基数维度(用户ID、IP、请求ID)作为标签,导致时间序列爆炸,内存和存储急剧增长。
1433
+
1434
+ ```promql
1435
+ # 诊断: 查看每个指标的序列数
1436
+ topk(20, count by (__name__) ({__name__!=""}))
1437
+
1438
+ # 诊断: 查看每个标签的基数
1439
+ count(count by (user_id) (http_requests_total)) # 如果返回值很大,说明有问题
1440
+ ```
1441
+
1442
+ **解决方案:**
1443
+ - 移除高基数标签,改用日志或追踪系统
1444
+ - 使用metric_relabel_configs在采集时丢弃
1445
+ - 对必要的高基数场景使用recording rules预聚合
1446
+
1447
+ ### 2. Missing指标(指标缺失)
1448
+
1449
+ **问题:** 服务刚启动时,Counter/Histogram尚未被触发过,查询返回空结果,导致告警规则失效。
1450
+
1451
+ ```python
1452
+ # Python: 初始化时预设标签组合
1453
+ REQUEST_COUNT = Counter("http_requests_total", "Total requests", ["method", "status"])
1454
+ # 启动时初始化所有预期的标签组合
1455
+ for method in ["GET", "POST", "PUT", "DELETE"]:
1456
+ for status in ["200", "400", "404", "500"]:
1457
+ REQUEST_COUNT.labels(method=method, status=status) # 初始值为0
1458
+ ```
1459
+
1460
+ ```promql
1461
+ # PromQL: 使用or向量填充
1462
+ (
1463
+ sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
1464
+ / sum by (service) (rate(http_requests_total[5m]))
1465
+ ) or (
1466
+ 0 * group by (service) (up{job="api"})
1467
+ )
1468
+ ```
1469
+
1470
+ ### 3. 告警风暴(Alert Storm)
1471
+
1472
+ **问题:** 级联故障导致大量告警同时触发,通知渠道被淹没。
1473
+
1474
+ **解决方案:**
1475
+ - 合理设置`group_by`和`group_wait`,将相关告警聚合
1476
+ - 使用`inhibit_rules`抑制低级别告警(见Alertmanager配置节)
1477
+ - 分层告警: 基础设施层 → 平台层 → 应用层,高层告警抑制低层
1478
+ - 设置合理的`for`持续时间,过滤瞬时抖动
1479
+ - 使用路由树将不同级别告警发送到不同渠道
1480
+
1481
+ ### 4. 存储膨胀
1482
+
1483
+ **问题:** 指标数量不受控增长,存储成本持续上升。
1484
+
1485
+ ```promql
1486
+ # 诊断: 每个job贡献的序列数
1487
+ count by (job) ({__name__!=""})
1488
+
1489
+ # 诊断: 每个指标名贡献的序列数
1490
+ topk(10, count by (__name__) ({__name__!=""}))
1491
+
1492
+ # 诊断: 抓取的样本量
1493
+ scrape_samples_scraped
1494
+ ```
1495
+
1496
+ **解决方案:**
1497
+ - 定期审计指标,移除不再使用的指标
1498
+ - 使用`metric_relabel_configs`在采集端丢弃无用指标
1499
+ - 对高频指标使用recording rules聚合后,丢弃原始细粒度数据
1500
+ - 合理设置retention(时间或大小)
1501
+ - 使用Thanos/VictoriaMetrics降采样(downsampling)长期数据
1502
+
1503
+ ### 5. rate()与irate()误用
1504
+
1505
+ **问题:** 对Gauge使用rate(),对需要平滑趋势的Counter使用irate()。
1506
+
1507
+ ```promql
1508
+ # 错误: Gauge不应使用rate
1509
+ rate(node_memory_AvailableBytes[5m]) # 错误
1510
+ # 正确: Gauge使用delta或deriv
1511
+ delta(node_memory_AvailableBytes[5m]) # 正确
1512
+ deriv(node_memory_AvailableBytes[5m]) # 正确
1513
+
1514
+ # irate vs rate的选择
1515
+ rate(http_requests_total[5m]) # 平滑的平均速率,适合告警和趋势
1516
+ irate(http_requests_total[5m]) # 瞬时速率,适合仪表盘实时展示(但告警中易误报)
1517
+ ```
1518
+
1519
+ ### 6. histogram_quantile精度陷阱
1520
+
1521
+ **问题:** bucket边界设置不合理,导致分位数计算严重偏离真实值。
1522
+
1523
+ ```python
1524
+ # 不好: 默认bucket可能不适合你的延迟分布
1525
+ Histogram("http_duration_seconds", "...", buckets=prometheus_client.DEFAULT_BUCKETS)
1526
+ # DEFAULT_BUCKETS = (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0)
1527
+
1528
+ # 好: 根据实际SLO和延迟分布定制bucket
1529
+ Histogram(
1530
+ "http_duration_seconds", "...",
1531
+ buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5]
1532
+ # 如果SLO是P99 < 500ms, 在500ms附近需要更密的bucket
1533
+ )
1534
+ ```
1535
+
1536
+ ---
1537
+
1538
+ ## Agent Checklist
1539
+
1540
+ 以下检查项供UmaDev Agent在监控相关任务中使用:
1541
+
1542
+ ### 埋点检查
1543
+ - [ ] 所有HTTP服务已暴露`/metrics`端点
1544
+ - [ ] 使用了正确的指标类型(Counter用于累计值, Gauge用于瞬时值, Histogram用于分布)
1545
+ - [ ] 指标命名符合`<namespace>_<name>_<unit>`规范,Counter以`_total`结尾
1546
+ - [ ] 无高基数标签(用户ID、IP、请求ID等不应作为标签)
1547
+ - [ ] Histogram bucket边界与SLO对齐,在关键阈值附近有足够精度
1548
+ - [ ] Counter在服务启动时预初始化所有标签组合,避免指标缺失
1549
+
1550
+ ### 告警检查
1551
+ - [ ] 每条告警规则有`for`持续时间(避免瞬时抖动误报)
1552
+ - [ ] 告警包含`severity`标签,按严重级别路由
1553
+ - [ ] 告警包含`summary`和`description`注解,附带runbook链接
1554
+ - [ ] 配置了抑制规则(critical抑制warning,集群级抑制节点级)
1555
+ - [ ] 告警分组合理,避免告警风暴
1556
+ - [ ] 已配置发送恢复通知(`send_resolved: true`)
1557
+
1558
+ ### 基础设施检查
1559
+ - [ ] Prometheus配置了合理的retention(时间或大小)
1560
+ - [ ] scrape_interval与Grafana仪表盘的`$__rate_interval`对齐
1561
+ - [ ] 生产环境Prometheus高可用(至少2副本或使用Thanos/VictoriaMetrics)
1562
+ - [ ] 使用recording rules预计算高频查询,降低查询延迟
1563
+ - [ ] 定期审计指标基数,`prometheus_tsdb_head_series`处于合理范围
1564
+ - [ ] Prometheus自身被监控(元监控),包括抓取延迟、查询耗时、存储使用
1565
+
1566
+ ### Kubernetes监控检查
1567
+ - [ ] 部署了kube-state-metrics、node-exporter
1568
+ - [ ] ServiceMonitor/PodMonitor已创建且selector正确匹配
1569
+ - [ ] 容器资源指标(CPU/内存/网络/磁盘)已采集
1570
+ - [ ] Pod重启、CrashLoopBackOff、OOMKilled已配置告警
1571
+ - [ ] PVC空间和节点磁盘空间有预测性告警(`predict_linear`)
1572
+
1573
+ ### 仪表盘检查
1574
+ - [ ] 核心服务有RED方法仪表盘(Rate/Errors/Duration)
1575
+ - [ ] 基础设施有USE方法仪表盘(Utilization/Saturation/Errors)
1576
+ - [ ] 仪表盘使用模板变量(namespace、service、instance)支持筛选
1577
+ - [ ] 关键面板有阈值标记(红/黄/绿)
1578
+ - [ ] 仪表盘已纳入版本管理(Grafana provisioning或dashboard-as-code)