@umacloud/knowledge 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (418) hide show
  1. package/00-governance/governance-capabilities.md +557 -0
  2. package/00-governance/knowledge-map.md +39 -0
  3. package/00-governance/maintenance-policy.md +76 -0
  4. package/00-governance/review-checklist.md +81 -0
  5. package/README.md +13 -0
  6. package/ai/01-standards/agent-development-complete.md +691 -0
  7. package/ai/01-standards/llm-application-complete.md +488 -0
  8. package/ai/01-standards/mlops-complete.md +798 -0
  9. package/ai/01-standards/prompt-engineering-complete.md +646 -0
  10. package/ai/01-standards/rag-architecture-complete.md +649 -0
  11. package/ai/02-playbooks/llm-evaluation-playbook.md +847 -0
  12. package/ai/03-checklists/ai-project-checklist.md +215 -0
  13. package/ai/04-antipatterns/ai-antipatterns.md +661 -0
  14. package/ai/05-cases/case-rag-production.md +147 -0
  15. package/ai/06-glossary/ai-glossary.md +162 -0
  16. package/ai/agent-evaluation-benchmark.md +53 -0
  17. package/ai/ai-agent-memory-context-management.md +41 -0
  18. package/ai/ai-cost-capacity-optimization-playbook.md +42 -0
  19. package/ai/ai-data-security-and-compliance-playbook.md +37 -0
  20. package/ai/ai-domain-index-and-checklist.md +40 -0
  21. package/ai/ai-governance-maturity-model.md +50 -0
  22. package/ai/ai-model-selection-and-routing-strategy.md +47 -0
  23. package/ai/ai-observability-and-oncall-runbook.md +52 -0
  24. package/ai/ai-rag-engineering-playbook.md +42 -0
  25. package/ai/ai-red-team-and-safety-evaluation.md +42 -0
  26. package/ai/ai-release-readiness-and-rollback-gate.md +42 -0
  27. package/ai/llm-agent-engineering-deep-dive.md +57 -0
  28. package/ai/prompt-and-tool-guardrails.md +52 -0
  29. package/api/01-standards/enterprise-api-standards.md +198 -0
  30. package/api/01-standards/rest-api-design-guide.md +63 -0
  31. package/api/02-playbooks/api-pagination-playbook.md +93 -0
  32. package/api/02-playbooks/graphql-production-playbook.md +176 -0
  33. package/api/03-checklists/api-review-checklist.md +55 -0
  34. package/api/04-antipatterns/api-antipatterns.md +112 -0
  35. package/architecture/01-standards/api-gateway-patterns.md +496 -0
  36. package/architecture/01-standards/cloud-native-patterns.md +644 -0
  37. package/architecture/01-standards/distributed-systems-patterns.md +591 -0
  38. package/architecture/01-standards/event-driven-architecture.md +595 -0
  39. package/architecture/01-standards/microservices-patterns-complete.md +968 -0
  40. package/architecture/01-standards/microservices-patterns.md +495 -0
  41. package/architecture/01-standards/system-design-interview.md +664 -0
  42. package/architecture/02-playbooks/microservices-patterns-playbook.md +137 -0
  43. package/architecture/02-playbooks/migration-playbook.md +780 -0
  44. package/architecture/02-playbooks/system-design-playbook.md +779 -0
  45. package/architecture/03-checklists/architecture-decision-checklist.md +297 -0
  46. package/architecture/04-antipatterns/architecture-antipatterns.md +417 -0
  47. package/architecture/05-cases/case-netflix-microservices.md +413 -0
  48. package/architecture/06-glossary/architecture-glossary.md +164 -0
  49. package/architecture/adr-template-and-examples.md +38 -0
  50. package/architecture/api-gateway-deep-dive.md +1291 -0
  51. package/architecture/configuration-management.md +1162 -0
  52. package/architecture/distributed-transactions.md +1220 -0
  53. package/architecture/microservices-complete.md +735 -0
  54. package/architecture/resilience-and-disaster-patterns.md +37 -0
  55. package/architecture/service-governance.md +1198 -0
  56. package/architecture/system-architecture-deep-dive.md +37 -0
  57. package/backend/01-standards/analytics-and-growth.md +65 -0
  58. package/backend/01-standards/api-and-error-conventions.md +120 -0
  59. package/backend/01-standards/application-layering-and-packaging.md +160 -0
  60. package/backend/01-standards/auth-implementation.md +104 -0
  61. package/backend/01-standards/backend-framework-idioms.md +74 -0
  62. package/backend/01-standards/background-jobs-and-async.md +66 -0
  63. package/backend/01-standards/caching-strategies-complete.md +390 -0
  64. package/backend/01-standards/config-and-observability.md +77 -0
  65. package/backend/01-standards/data-modeling-and-persistence.md +94 -0
  66. package/backend/01-standards/django-complete.md +1765 -0
  67. package/backend/01-standards/email-and-notifications.md +64 -0
  68. package/backend/01-standards/fastapi-complete.md +925 -0
  69. package/backend/01-standards/file-upload-and-storage.md +66 -0
  70. package/backend/01-standards/graphql-api-complete.md +416 -0
  71. package/backend/01-standards/llm-application-standard.md +78 -0
  72. package/backend/01-standards/message-queue-patterns.md +379 -0
  73. package/backend/01-standards/microservices-and-distributed.md +78 -0
  74. package/backend/01-standards/nestjs-complete.md +2167 -0
  75. package/backend/01-standards/payment-integration.md +80 -0
  76. package/backend/01-standards/rate-limiting-complete.md +451 -0
  77. package/backend/01-standards/realtime-and-websocket.md +65 -0
  78. package/backend/01-standards/search-and-filtering.md +64 -0
  79. package/backend/01-standards/spring-boot-complete.md +445 -0
  80. package/backend/02-playbooks/api-design-playbook.md +718 -0
  81. package/backend/02-playbooks/email-send-playbook.md +130 -0
  82. package/backend/02-playbooks/file-upload-s3-playbook.md +153 -0
  83. package/backend/02-playbooks/typescript-enterprise-playbook.md +133 -0
  84. package/backend/02-playbooks/websocket-realtime-playbook.md +154 -0
  85. package/backend/03-checklists/api-launch-checklist.md +189 -0
  86. package/backend/04-antipatterns/backend-antipatterns.md +1051 -0
  87. package/blockchain/01-standards/blockchain-basics.md +557 -0
  88. package/blockchain/01-standards/smart-contract-development.md +1315 -0
  89. package/cicd/01-standards/deployment-and-delivery-standard.md +96 -0
  90. package/cicd/01-standards/github-actions-complete.md +473 -0
  91. package/cicd/01-standards/release-and-store-submission.md +75 -0
  92. package/cicd/02-playbooks/cicd-pipeline-playbook.md +144 -0
  93. package/cicd/02-playbooks/release-management-playbook.md +605 -0
  94. package/cicd/03-checklists/pipeline-security-checklist.md +168 -0
  95. package/cicd/04-antipatterns/cicd-antipatterns.md +589 -0
  96. package/cicd/05-cases/case-deployment-automation.md +221 -0
  97. package/cicd/05-cases/case-gitops-transformation.md +212 -0
  98. package/cicd/06-glossary/cicd-glossary.md +114 -0
  99. package/cicd/cicd-blueprint-deep-dive.md +38 -0
  100. package/cicd/release-readiness-gate.md +37 -0
  101. package/cloud-native/01-standards/container-security.md +741 -0
  102. package/cloud-native/01-standards/kubernetes-complete.md +812 -0
  103. package/cloud-native/02-playbooks/api-gateway-playbook.md +155 -0
  104. package/cloud-native/02-playbooks/gitops-with-argocd.md +760 -0
  105. package/cloud-native/02-playbooks/k8s-troubleshooting-playbook.md +1942 -0
  106. package/cloud-native/02-playbooks/message-queue-playbook.md +129 -0
  107. package/cloud-native/02-playbooks/multicloud-governance.md +726 -0
  108. package/cloud-native/02-playbooks/serverless-patterns.md +788 -0
  109. package/cloud-native/02-playbooks/service-mesh-playbook.md +612 -0
  110. package/cloud-native/02-playbooks/terraform-iac-playbook.md +143 -0
  111. package/cloud-native/03-checklists/container-security-checklist.md +431 -0
  112. package/cloud-native/03-checklists/k8s-production-readiness-checklist.md +460 -0
  113. package/cloud-native/04-antipatterns/container-antipatterns.md +660 -0
  114. package/cloud-native/04-antipatterns/k8s-antipatterns.md +743 -0
  115. package/cloud-native/05-cases/case-k8s-migration.md +478 -0
  116. package/cloud-native/05-cases/case-k8s-scaling.md +642 -0
  117. package/cloud-native/05-cases/case-k8s-security-incident.md +397 -0
  118. package/cloud-native/06-glossary/cloud-native-glossary.md +337 -0
  119. package/cross-platform/01-standards/cross-platform-frameworks.md +83 -0
  120. package/cross-platform/01-standards/platform-selection-and-architecture.md +77 -0
  121. package/data/01-standards/elasticsearch-complete.md +2098 -0
  122. package/data/01-standards/postgresql-complete.md +1613 -0
  123. package/data/01-standards/redis-complete.md +1527 -0
  124. package/data/02-playbooks/database-optimization-playbook.md +403 -0
  125. package/data/02-playbooks/elasticsearch-production-playbook.md +132 -0
  126. package/data/03-checklists/database-launch-checklist.md +187 -0
  127. package/data/04-antipatterns/database-antipatterns.md +873 -0
  128. package/data/05-cases/case-database-migration.md +310 -0
  129. package/data/06-glossary/database-glossary.md +440 -0
  130. package/data/data-governance-and-modeling-deep-dive.md +39 -0
  131. package/data-engineering/01-standards/airflow-complete.md +523 -0
  132. package/data-engineering/01-standards/kafka-complete.md +1521 -0
  133. package/data-engineering/02-playbooks/spark-etl-playbook.md +496 -0
  134. package/data-engineering/03-checklists/pipeline-launch-checklist.md +194 -0
  135. package/data-engineering/04-antipatterns/data-pipeline-antipatterns.md +684 -0
  136. package/data-engineering/05-cases/case-real-time-pipeline.md +355 -0
  137. package/data-engineering/06-glossary/data-engineering-glossary.md +429 -0
  138. package/database/01-standards/database-schema-standards.md +147 -0
  139. package/database/02-playbooks/postgresql-optimization-quick.md +52 -0
  140. package/database/02-playbooks/postgresql-performance-optimization.md +58 -0
  141. package/database/02-playbooks/postgresql-production-playbook.md +146 -0
  142. package/database/02-playbooks/redis-caching-playbook.md +117 -0
  143. package/database/03-checklists/database-review-checklist.md +50 -0
  144. package/database/04-antipatterns/database-antipatterns.md +112 -0
  145. package/design/01-standards/ui-design-system-complete.md +423 -0
  146. package/design/02-playbooks/design-handoff-playbook.md +254 -0
  147. package/design/02-playbooks/design-review-playbook.md +388 -0
  148. package/design/03-checklists/design-review-checklist.md +246 -0
  149. package/design/04-antipatterns/design-antipatterns.md +378 -0
  150. package/design/05-cases/case-design-system-adoption.md +328 -0
  151. package/design/06-glossary/design-glossary.md +329 -0
  152. package/design/ui-full-lifecycle-cross-platform-playbook.md +571 -0
  153. package/design/ux-system-deep-dive.md +38 -0
  154. package/design-systems/00-craft-rules.md +71 -0
  155. package/design-systems/aesthetic-families.md +43 -0
  156. package/design-systems/anti-ai-slop.md +162 -0
  157. package/design-systems/bold-geometric.md +120 -0
  158. package/design-systems/brutalist-bold.md +103 -0
  159. package/design-systems/editorial-clean.md +109 -0
  160. package/design-systems/glass-aurora.md +108 -0
  161. package/design-systems/modern-minimal.md +145 -0
  162. package/design-systems/premium-luxury.md +106 -0
  163. package/design-systems/product-type-design-map.md +48 -0
  164. package/design-systems/soft-warm.md +123 -0
  165. package/design-systems/tech-utility.md +113 -0
  166. package/desktop/01-standards/desktop-app-standard.md +72 -0
  167. package/desktop/01-standards/desktop-design.md +71 -0
  168. package/development/00-governance/document-template.md +41 -0
  169. package/development/01-standards/api-versioning-strategies.md +432 -0
  170. package/development/01-standards/authentication-patterns-complete.md +479 -0
  171. package/development/01-standards/css-architecture-complete.md +550 -0
  172. package/development/01-standards/database-migration-strategies.md +484 -0
  173. package/development/01-standards/elasticsearch-complete.md +347 -0
  174. package/development/01-standards/git-complete.md +371 -0
  175. package/development/01-standards/golang-complete.md +1565 -0
  176. package/development/01-standards/graphql-complete.md +298 -0
  177. package/development/01-standards/javascript-bundlers-complete.md +469 -0
  178. package/development/01-standards/javascript-typescript-complete.md +528 -0
  179. package/development/01-standards/jest-complete.md +275 -0
  180. package/development/01-standards/linux-complete.md +234 -0
  181. package/development/01-standards/logging-observability-complete.md +526 -0
  182. package/development/01-standards/microservices-communication.md +502 -0
  183. package/development/01-standards/mongodb-complete.md +406 -0
  184. package/development/01-standards/oauth2-complete.md +285 -0
  185. package/development/01-standards/performance-optimization-complete.md +289 -0
  186. package/development/01-standards/playwright-complete.md +247 -0
  187. package/development/01-standards/postgresql-complete.md +456 -0
  188. package/development/01-standards/pytest-complete.md +340 -0
  189. package/development/01-standards/python-async-programming.md +902 -0
  190. package/development/01-standards/python-complete.md +956 -0
  191. package/development/01-standards/python-decorators-complete.md +799 -0
  192. package/development/01-standards/python-design-patterns.md +2854 -0
  193. package/development/01-standards/python-packaging-distribution.md +420 -0
  194. package/development/01-standards/python-testing-strategies.md +607 -0
  195. package/development/01-standards/python-web-frameworks-comparison.md +471 -0
  196. package/development/01-standards/redis-complete.md +317 -0
  197. package/development/01-standards/rest-api-complete.md +316 -0
  198. package/development/01-standards/rust-complete.md +578 -0
  199. package/development/01-standards/typescript-advanced-types.md +1513 -0
  200. package/development/01-standards/web-security-complete.md +292 -0
  201. package/development/02-playbooks/api-design-playbook.md +810 -0
  202. package/development/02-playbooks/database-migration-playbook.md +580 -0
  203. package/development/02-playbooks/debugging-playbook.md +692 -0
  204. package/development/02-playbooks/feature-delivery-playbook.md +430 -0
  205. package/development/02-playbooks/incident-hotfix-playbook.md +387 -0
  206. package/development/02-playbooks/performance-optimization-playbook.md +531 -0
  207. package/development/02-playbooks/performance-tuning-playbook.md +652 -0
  208. package/development/02-playbooks/refactor-playbook.md +403 -0
  209. package/development/02-playbooks/release-playbook.md +469 -0
  210. package/development/03-checklists/architecture-review-checklist.md +168 -0
  211. package/development/03-checklists/data-migration-checklist.md +157 -0
  212. package/development/03-checklists/oncall-handover-checklist.md +173 -0
  213. package/development/03-checklists/pr-checklist.md +158 -0
  214. package/development/03-checklists/production-readiness-checklist.md +190 -0
  215. package/development/03-checklists/release-readiness-checklist.md +154 -0
  216. package/development/03-checklists/security-review-checklist.md +182 -0
  217. package/development/04-antipatterns/api-antipatterns.md +657 -0
  218. package/development/04-antipatterns/architecture-antipatterns.md +686 -0
  219. package/development/04-antipatterns/backend-antipatterns.md +648 -0
  220. package/development/04-antipatterns/cicd-antipatterns.md +540 -0
  221. package/development/04-antipatterns/code-smell-antipatterns.md +571 -0
  222. package/development/04-antipatterns/data-antipatterns.md +658 -0
  223. package/development/04-antipatterns/database-antipatterns.md +578 -0
  224. package/development/04-antipatterns/frontend-antipatterns.md +635 -0
  225. package/development/04-antipatterns/reliability-antipatterns.md +700 -0
  226. package/development/04-antipatterns/security-antipatterns.md +747 -0
  227. package/development/05-cases/case-api-version-migration.md +428 -0
  228. package/development/05-cases/case-authorization-hardening.md +383 -0
  229. package/development/05-cases/case-bluegreen-rollback.md +466 -0
  230. package/development/05-cases/case-cache-snowball-protection.md +485 -0
  231. package/development/05-cases/case-ci-cd-pipeline.md +544 -0
  232. package/development/05-cases/case-database-scaling.md +500 -0
  233. package/development/05-cases/case-db-hotspot-optimization.md +487 -0
  234. package/development/05-cases/case-incident-mttr-reduction.md +563 -0
  235. package/development/05-cases/case-microservice-migration.md +375 -0
  236. package/development/05-cases/case-performance-optimization.md +406 -0
  237. package/development/05-cases/case-security-incident-response.md +345 -0
  238. package/development/06-glossary/full-stack-glossary.md +166 -0
  239. package/development/09-maturity/quarterly-audit-template.md +35 -0
  240. package/development/11-ui-excellence/ui-aesthetic-system.md +41 -0
  241. package/development/11-ui-excellence/ui-engineering-excellence.md +435 -0
  242. package/development/12-scenarios/development-scenarios-guide.md +565 -0
  243. package/development/13-implementation-assets/implementation-toolkit.md +282 -0
  244. package/development/13-implementation-assets/knowledge-gates-execution.md +43 -0
  245. package/development/14-full-lifecycle/software-lifecycle-gates.md +511 -0
  246. package/development/15-lifecycle-templates/project-templates-collection.md +791 -0
  247. package/development/api-contract-and-versioning-guide.md +36 -0
  248. package/development/api-governance-complete.md +43 -0
  249. package/development/backend-engineering-complete.md +43 -0
  250. package/development/code-review-quality-complete.md +43 -0
  251. package/development/concurrency-reliability-complete.md +43 -0
  252. package/development/database-engineering-complete.md +43 -0
  253. package/development/engineering-effectiveness-complete.md +43 -0
  254. package/development/engineering-standards-deep-dive.md +38 -0
  255. package/development/frontend-engineering-complete.md +43 -0
  256. package/development/performance-capacity-complete.md +43 -0
  257. package/development/refactor-migration-complete.md +42 -0
  258. package/development/refactoring-and-techdebt-playbook.md +37 -0
  259. package/development/security-in-development-complete.md +43 -0
  260. package/devops/01-standards/cicd-pipeline-complete.md +262 -0
  261. package/devops/01-standards/docker-complete.md +1490 -0
  262. package/devops/01-standards/github-actions-complete.md +337 -0
  263. package/devops/01-standards/kubernetes-complete.md +638 -0
  264. package/devops/01-standards/terraform-complete.md +2117 -0
  265. package/devops/02-playbooks/docker-compose-playbook.md +233 -0
  266. package/devops/02-playbooks/docker-k8s-production-playbook.md +186 -0
  267. package/devops/02-playbooks/docker-production-playbook.md +952 -0
  268. package/edge-iot/01-standards/edge-iot-complete.md +473 -0
  269. package/experts/architect/api-design.md +178 -0
  270. package/experts/architect/methodology.md +124 -0
  271. package/experts/architect/security.md +75 -0
  272. package/experts/backend-lead/methodology.md +216 -0
  273. package/experts/devops/methodology.md +160 -0
  274. package/experts/frontend-lead/methodology.md +178 -0
  275. package/experts/product-manager/industry/ecommerce.md +43 -0
  276. package/experts/product-manager/industry/saas.md +40 -0
  277. package/experts/product-manager/methodology.md +97 -0
  278. package/experts/qa-lead/methodology.md +123 -0
  279. package/experts/qa-lead/test-strategy.md +128 -0
  280. package/experts/uiux-designer/methodology.md +125 -0
  281. package/frontend/01-standards/accessibility-complete.md +532 -0
  282. package/frontend/01-standards/accessibility-standard.md +74 -0
  283. package/frontend/01-standards/admin-dashboard-and-crud.md +72 -0
  284. package/frontend/01-standards/design-tokens-complete.md +444 -0
  285. package/frontend/01-standards/forms-and-validation.md +77 -0
  286. package/frontend/01-standards/frontend-architecture-and-layering.md +119 -0
  287. package/frontend/01-standards/i18n-and-localization.md +65 -0
  288. package/frontend/01-standards/nextjs-complete.md +451 -0
  289. package/frontend/01-standards/react-complete.md +713 -0
  290. package/frontend/01-standards/react-hooks-complete-guide.md +1100 -0
  291. package/frontend/01-standards/react-hooks-complete.md +1171 -0
  292. package/frontend/01-standards/seo-and-web-vitals.md +77 -0
  293. package/frontend/01-standards/state-management-complete.md +444 -0
  294. package/frontend/01-standards/vue-complete.md +499 -0
  295. package/frontend/01-standards/vue3-complete.md +2002 -0
  296. package/frontend/01-standards/web-framework-best-practices.md +64 -0
  297. package/frontend/01-standards/web-performance-complete.md +495 -0
  298. package/frontend/02-playbooks/accessibility-a11y-playbook.md +161 -0
  299. package/frontend/02-playbooks/frontend-performance-playbook.md +707 -0
  300. package/frontend/02-playbooks/i18n-internationalization-playbook.md +120 -0
  301. package/frontend/02-playbooks/performance-optimization-playbook.md +163 -0
  302. package/frontend/02-playbooks/react-nextjs-production-playbook.md +167 -0
  303. package/frontend/02-playbooks/react-state-management-playbook.md +173 -0
  304. package/frontend/03-checklists/component-quality-checklist.md +166 -0
  305. package/frontend/03-checklists/frontend-launch-checklist.md +299 -0
  306. package/frontend/04-antipatterns/frontend-antipatterns.md +886 -0
  307. package/frontend/05-cases/case-performance-optimization.md +274 -0
  308. package/harmony/01-standards/harmonyos-arkts-standard.md +75 -0
  309. package/harmony/01-standards/harmonyos-design.md +65 -0
  310. package/high-quality-engineering-playbook.md +54 -0
  311. package/incident/01-standards/incident-response-complete.md +303 -0
  312. package/incident/02-playbooks/chaos-engineering-playbook.md +883 -0
  313. package/incident/02-playbooks/postmortem-playbook.md +398 -0
  314. package/incident/03-checklists/incident-readiness-checklist.md +181 -0
  315. package/incident/04-antipatterns/incident-antipatterns.md +490 -0
  316. package/incident/05-cases/case-cascade-failure.md +176 -0
  317. package/incident/06-glossary/incident-glossary.md +114 -0
  318. package/incident/postmortem-and-response-deep-dive.md +39 -0
  319. package/industries/ecommerce/ecommerce-complete.md +631 -0
  320. package/industries/education/education-complete.md +555 -0
  321. package/industries/fintech/fintech-complete.md +501 -0
  322. package/industries/gaming/gaming-complete.md +587 -0
  323. package/industries/healthcare/healthcare-complete.md +452 -0
  324. package/low-code/01-standards/low-code-complete.md +944 -0
  325. package/miniprogram/01-standards/ai-common-mistakes.md +61 -0
  326. package/miniprogram/01-standards/miniprogram-custom-navbar-capsule.md +77 -0
  327. package/miniprogram/01-standards/miniprogram-design.md +61 -0
  328. package/miniprogram/01-standards/miniprogram-standard.md +81 -0
  329. package/mobile/01-standards/android-material-design.md +70 -0
  330. package/mobile/01-standards/flutter-complete.md +384 -0
  331. package/mobile/01-standards/ios-design-hig.md +78 -0
  332. package/mobile/01-standards/mobile-app-standard.md +85 -0
  333. package/mobile/01-standards/react-native-complete.md +352 -0
  334. package/mobile/02-playbooks/mobile-cross-platform-playbook.md +175 -0
  335. package/mobile/02-playbooks/mobile-performance.md +473 -0
  336. package/mobile/03-checklists/mobile-release-checklist.md +234 -0
  337. package/mobile/04-antipatterns/mobile-antipatterns.md +798 -0
  338. package/mobile/05-cases/case-app-performance.md +500 -0
  339. package/mobile/05-cases/case-app-startup-optimization.md +218 -0
  340. package/mobile/06-glossary/mobile-glossary.md +484 -0
  341. package/observability/01-standards/observability-standards.md +103 -0
  342. package/observability/02-playbooks/prometheus-grafana-playbook.md +135 -0
  343. package/observability/02-playbooks/structured-logging-playbook.md +73 -0
  344. package/observability/03-checklists/observability-checklist.md +54 -0
  345. package/observability/04-antipatterns/observability-antipatterns.md +106 -0
  346. package/operations/01-standards/prometheus-monitoring-complete.md +1578 -0
  347. package/operations/02-playbooks/capacity-planning-playbook.md +620 -0
  348. package/operations/03-checklists/production-launch-checklist.md +365 -0
  349. package/operations/04-antipatterns/operations-antipatterns.md +664 -0
  350. package/operations/05-cases/case-sre-practices.md +581 -0
  351. package/operations/06-glossary/operations-glossary.md +120 -0
  352. package/operations/aiops-anomaly-detection.md +758 -0
  353. package/operations/capacity-planning.md +1061 -0
  354. package/operations/chaos-engineering.md +659 -0
  355. package/operations/incident-command-system.md +38 -0
  356. package/operations/observability-complete.md +442 -0
  357. package/operations/slo-sli-playbook.md +517 -0
  358. package/operations/sre-operations-deep-dive.md +39 -0
  359. package/package.json +8 -0
  360. package/performance/01-standards/performance-and-scalability.md +80 -0
  361. package/performance/01-standards/performance-standards.md +156 -0
  362. package/performance/02-playbooks/query-optimization-playbook.md +103 -0
  363. package/performance/03-checklists/performance-checklist.md +56 -0
  364. package/performance/04-antipatterns/performance-antipatterns.md +146 -0
  365. package/product/01-standards/product-management-complete.md +285 -0
  366. package/product/02-playbooks/feature-launch-playbook.md +207 -0
  367. package/product/02-playbooks/user-research-playbook.md +532 -0
  368. package/product/03-checklists/feature-launch-checklist.md +275 -0
  369. package/product/04-antipatterns/product-antipatterns.md +355 -0
  370. package/product/05-cases/case-mvp-to-scale.md +384 -0
  371. package/product/06-glossary/product-glossary.md +462 -0
  372. package/product/feature-prioritization-framework.md +40 -0
  373. package/product/kpi-and-metric-tree.md +37 -0
  374. package/product/product-discovery-and-prd-deep-dive.md +41 -0
  375. package/quantum/01-standards/quantum-complete.md +1186 -0
  376. package/security/01-standards/api-security-complete.md +511 -0
  377. package/security/01-standards/container-runtime-security.md +574 -0
  378. package/security/01-standards/data-protection-gdpr.md +543 -0
  379. package/security/01-standards/owasp-top10-complete.md +1890 -0
  380. package/security/01-standards/secure-coding-baseline.md +90 -0
  381. package/security/01-standards/supply-chain-security.md +441 -0
  382. package/security/01-standards/web-security-checklist.md +108 -0
  383. package/security/01-standards/zero-trust-architecture.md +521 -0
  384. package/security/02-playbooks/auth-sso-playbook.md +166 -0
  385. package/security/02-playbooks/incident-response-security-playbook.md +588 -0
  386. package/security/02-playbooks/owasp-api-security-playbook.md +129 -0
  387. package/security/02-playbooks/payment-integration-playbook.md +119 -0
  388. package/security/02-playbooks/penetration-testing-playbook.md +517 -0
  389. package/security/03-checklists/security-audit-checklist.md +356 -0
  390. package/security/04-antipatterns/security-coding-antipatterns.md +580 -0
  391. package/security/05-cases/case-log4shell-incident.md +537 -0
  392. package/security/05-cases/case-major-breaches.md +468 -0
  393. package/security/06-glossary/security-glossary.md +212 -0
  394. package/security/compliance-automation.md +993 -0
  395. package/security/container-security.md +680 -0
  396. package/security/devsecops-complete.md +426 -0
  397. package/security/sast-dast-sca.md +775 -0
  398. package/security/secrets-management.md +594 -0
  399. package/security/security-architecture-deep-dive.md +37 -0
  400. package/security/threat-modeling-stride-playbook.md +40 -0
  401. package/seed-templates/auth-system.md +59 -0
  402. package/seed-templates/blog-content.md +94 -0
  403. package/seed-templates/dashboard.md +89 -0
  404. package/seed-templates/docs-site.md +73 -0
  405. package/seed-templates/e-commerce.md +50 -0
  406. package/seed-templates/saas-landing.md +92 -0
  407. package/seed-templates/settings-page.md +51 -0
  408. package/testing/01-standards/test-strategy-and-layering.md +83 -0
  409. package/testing/01-standards/testing-strategy-complete.md +422 -0
  410. package/testing/01-standards/unit-testing-best-practices.md +118 -0
  411. package/testing/02-playbooks/e2e-testing-playbook.md +988 -0
  412. package/testing/02-playbooks/testing-strategy-playbook.md +126 -0
  413. package/testing/03-checklists/test-strategy-checklist.md +208 -0
  414. package/testing/04-antipatterns/testing-antipatterns.md +718 -0
  415. package/testing/05-cases/case-testing-transformation.md +300 -0
  416. package/testing/06-glossary/testing-glossary.md +110 -0
  417. package/testing/risk-based-test-matrix.md +36 -0
  418. package/testing/testing-strategy-deep-dive.md +37 -0
@@ -0,0 +1,490 @@
1
+ ---
2
+ id: incident-antipatterns
3
+ title: 事故管理反模式指南 (Incident Management Anti-Patterns)
4
+ domain: incident
5
+ category: 04-antipatterns
6
+ difficulty: intermediate
7
+ tags: [alert, antipatterns, fatigue, heroism, incident, postmortem, 反模式, 告警疲劳]
8
+ quality_score: 70
9
+ last_updated: 2026-06-15
10
+ ---
11
+ # 事故管理反模式指南 (Incident Management Anti-Patterns)
12
+
13
+ > 适用范围:SRE / DevOps / 运维团队 / 研发团队
14
+ > 约束级别:SHALL(必须在事故管理流程建设中规避)
15
+ > 目标:识别和消除事故管理中的常见反模式,建设高效、可靠的事故响应体系。
16
+
17
+ ---
18
+
19
+ ## 反模式 1: 英雄主义 (Heroism)
20
+
21
+ ### 描述
22
+
23
+ 团队依赖少数"英雄"工程师来解决所有重大事故,这些人总是半夜被叫醒、总是第一个响应、总是能救火。组织将个人的超负荷付出视为理所当然,而非系统性问题的信号。
24
+
25
+ ### 危害
26
+
27
+ - 英雄疲惫导致判断力下降,引发更大的事故
28
+ - 知识集中在少数人身上,形成单点故障
29
+ - 其他团队成员缺少锻炼机会,能力无法成长
30
+ - 英雄离职后团队事故响应能力断崖式下降
31
+ - 表面上"问题总能解决"掩盖了系统性的架构和流程缺陷
32
+
33
+ ### 错误做法
34
+
35
+ ```
36
+ # BAD: 英雄主义模式
37
+ 事故发生 → 所有人第一反应"找张三"
38
+ → 张三不在?等张三上线
39
+ → 张三每月值班 25 天,年假从不敢休
40
+ → 其他人旁观学习但从不独立处理
41
+ → 管理层: "张三真靠谱,给他加绩效"
42
+ ```
43
+
44
+ ### 正确做法
45
+
46
+ ```
47
+ # GOOD: 系统化的事故响应
48
+ 事故发生 → On-Call 轮值工程师按 Runbook 响应
49
+ → 10 分钟无法解决 → 自动升级到二线
50
+ → 所有操作步骤记录在 Runbook 中,任何人可执行
51
+ → 事故后复盘更新 Runbook,补充缺失步骤
52
+ → 每月模拟演练,确保所有轮值人员都能独立响应
53
+ ```
54
+
55
+ ### 检测信号
56
+
57
+ - 同一个人处理了 > 50% 的 P0/P1 事故
58
+ - On-Call 轮值表上总是同一批人
59
+ - 有人说"这个问题只有 XX 能解决"
60
+ - Runbook 覆盖率 < 50% 或长期未更新
61
+
62
+ ---
63
+
64
+ ## 反模式 2: 告警疲劳 (Alert Fatigue)
65
+
66
+ ### 描述
67
+
68
+ 监控系统产生大量低价值或重复的告警,On-Call 工程师被淹没在告警洪流中,逐渐对告警麻木,开始忽略甚至静音告警,导致真正的严重问题被淹没。
69
+
70
+ ### 危害
71
+
72
+ - 关键告警被忽略,事故发现时间(TTD)大幅延长
73
+ - On-Call 工程师精神压力大,睡眠质量差,离职率高
74
+ - 团队对监控系统失去信任,不再认真对待告警
75
+ - 形成恶性循环:越忽略告警 → 越多告警无人处理 → 越多告警
76
+
77
+ ### 错误做法
78
+
79
+ ```yaml
80
+ # BAD: 告警配置示例
81
+ alerts:
82
+ - name: CPU_HIGH
83
+ condition: cpu > 50% # 阈值过低,正常波动就触发
84
+ severity: critical # 所有告警都是 critical
85
+ channels: [sms, call, email, slack] # 所有渠道都通知
86
+
87
+ - name: MEMORY_HIGH
88
+ condition: memory > 60% # 同上
89
+ severity: critical
90
+ channels: [sms, call, email, slack]
91
+
92
+ - name: DISK_USAGE
93
+ condition: disk > 70% # 70% 根本不需要告警
94
+ severity: critical
95
+ channels: [sms, call, email, slack]
96
+
97
+ # 结果: On-Call 每天收到 200+ 条告警,90% 无需处理
98
+ ```
99
+
100
+ ### 正确做法
101
+
102
+ ```yaml
103
+ # GOOD: 分级分层的告警策略
104
+ alerts:
105
+ - name: API_ERROR_RATE_CRITICAL
106
+ condition: error_rate_5m > 5%
107
+ severity: critical
108
+ channels: [pagerduty] # 仅电话告警
109
+ runbook: https://wiki/runbook/api-error-rate
110
+ auto_resolve: true
111
+
112
+ - name: API_ERROR_RATE_WARNING
113
+ condition: error_rate_5m > 1%
114
+ severity: warning
115
+ channels: [slack] # 仅消息通知
116
+ suppress_duration: 30m # 30 分钟内不重复
117
+
118
+ - name: DISK_USAGE_HIGH
119
+ condition: disk > 90%
120
+ severity: warning
121
+ channels: [slack]
122
+ predict_full_in: 24h # 预测 24 小时内满才告警
123
+
124
+ # 原则:
125
+ # - Critical: 需要立即人工介入,否则用户受影响
126
+ # - Warning: 需要关注但不紧急,工作时间处理
127
+ # - Info: 记录但不通知,用于排查和趋势分析
128
+ ```
129
+
130
+ ### 检测信号
131
+
132
+ - On-Call 每天收到 > 50 条告警
133
+ - 告警中 > 70% 无需人工处理(auto-resolve 或 false positive)
134
+ - 团队有人把告警通知静音了
135
+ - 存在"告警风暴"(一个问题触发 10+ 条告警)
136
+
137
+ ---
138
+
139
+ ## 反模式 3: 无复盘 (No Postmortem)
140
+
141
+ ### 描述
142
+
143
+ 事故恢复后就结束了,不进行系统化的复盘分析。团队忙于日常开发,认为复盘"太耗时间"或"已经知道原因了不需要开会"。结果同类事故反复发生。
144
+
145
+ ### 危害
146
+
147
+ - 同类事故反复发生,MTTR 不降反升
148
+ - 团队无法从失败中学习,不断踩同一个坑
149
+ - 缺少改进驱动力,系统稳定性停滞不前
150
+ - 新人无法从历史事故中学习经验
151
+
152
+ ### 错误做法
153
+
154
+ ```
155
+ # BAD: 事故恢复后的典型场景
156
+ 15:05 事故恢复
157
+ 15:10 "好了,问题解决了,大家继续干活吧"
158
+ 15:15 回到各自的 Sprint 任务
159
+ (三周后同样的问题再次发生)
160
+ "这个问题上次不是修了吗?" "修了临时方案,根本原因没处理"
161
+ ```
162
+
163
+ ### 正确做法
164
+
165
+ ```
166
+ # GOOD: 事故恢复后的标准流程
167
+ 15:05 事故恢复
168
+ 15:10 创建复盘 Issue,记录初步时间线
169
+ 15:15 指定复盘负责人,预约复盘会议(48h 内)
170
+ Day+2 举行复盘会议(60min)
171
+ Day+3 复盘报告发布,行动项创建为 JIRA
172
+ Week+1 行动项进度 Review
173
+ Week+4 验证行动项效果
174
+ ```
175
+
176
+ ### 检测信号
177
+
178
+ - P1 以上事故无对应的复盘报告
179
+ - 复盘报告无行动项,或行动项无人跟踪
180
+ - 同类事故在 90 天内重复发生
181
+ - 团队不知道过去一年有哪些事故
182
+
183
+ ---
184
+
185
+ ## 反模式 4: Blame 文化 (Blame Culture)
186
+
187
+ ### 描述
188
+
189
+ 事故发生后,第一反应是追问"谁干的"而非"为什么会发生"。犯错的人被公开批评、扣绩效或被要求写检讨,导致团队隐瞒问题而非暴露问题。
190
+
191
+ ### 危害
192
+
193
+ - 工程师隐瞒操作失误,小问题滚成大事故
194
+ - 没人敢做高风险的必要变更,系统停滞不前
195
+ - 复盘变成追责会议,无法获得真实信息
196
+ - "近失事件"(Near Miss)完全不被上报
197
+ - 优秀工程师因文化问题离职
198
+
199
+ ### 错误做法
200
+
201
+ ```
202
+ # BAD: Blame 文化的复盘会议
203
+ 经理: "这次事故谁上线的代码?"
204
+ 工程师 A: "...是我"
205
+ 经理: "上线前没测试吗?基本功都不扎实"
206
+ 工程师 A: "测了,但是..."
207
+ 经理: "以后你的代码必须两个人 Review 才能上线"
208
+ (之后 A 变得谨慎到不敢发布任何变更,其他人学会了 Never volunteer)
209
+ ```
210
+
211
+ ### 正确做法
212
+
213
+ ```
214
+ # GOOD: 无责文化的复盘会议
215
+ 主持人: "让我们了解一下当时发生了什么。A,你能描述一下上线的过程吗?"
216
+ A: "我在 14:00 执行了部署,测试环境一切正常。上线后发现生产环境的
217
+ 数据量比测试环境大 100 倍,查询性能完全不同。"
218
+ 主持人: "这说明我们的测试环境和生产环境数据量差异太大。
219
+ 这是一个系统性的差距,我们来讨论怎么改进。"
220
+ 行动项: 建设与生产等比例的 Staging 环境
221
+ ```
222
+
223
+ ### 检测信号
224
+
225
+ - 复盘报告中出现个人名字和批评性语言
226
+ - 事故后有人被要求"写检讨"或"做承诺"
227
+ - 工程师说"我不敢部署"或"让别人来上线"
228
+ - 小事故被隐瞒,直到变成大事故才被发现
229
+
230
+ ---
231
+
232
+ ## 反模式 5: 手动恢复 (Manual Recovery)
233
+
234
+ ### 描述
235
+
236
+ 事故恢复完全依赖人工操作:手动 SSH 到服务器、手动执行 SQL、手动重启服务、手动切换流量。操作步骤在每个人脑子里,没有文档化和自动化。
237
+
238
+ ### 危害
239
+
240
+ - 恢复时间长,人工操作需要回忆和查找步骤
241
+ - 操作失误风险高,紧张情况下更容易犯错
242
+ - 非专家无法执行恢复,依赖特定人员
243
+ - 相同的恢复操作每次都要从头来
244
+
245
+ ### 错误做法
246
+
247
+ ```bash
248
+ # BAD: 手动恢复流程(存在于某人的记忆中)
249
+ # 1. SSH 到生产服务器(哪台来着?)
250
+ ssh admin@prod-server-03
251
+
252
+ # 2. 查看日志(日志路径是什么?)
253
+ tail -f /var/log/app/error.log
254
+
255
+ # 3. 重启服务(直接 kill 还是 graceful?)
256
+ sudo systemctl restart app-service
257
+
258
+ # 4. 检查是否恢复(看哪个指标?)
259
+ curl http://localhost:8080/health
260
+
261
+ # 5. 如果没恢复,回滚数据库(回滚到哪个版本?SQL 是什么?)
262
+ psql -c "UPDATE config SET value='old_value' WHERE key='feature_flag'"
263
+ ```
264
+
265
+ ### 正确做法
266
+
267
+ ```yaml
268
+ # GOOD: 自动化恢复 Runbook
269
+ # runbook/api-service-recovery.yaml
270
+ name: API 服务恢复
271
+ trigger: API 错误率 > 5% 持续 5 分钟
272
+ steps:
273
+ - name: 自动诊断
274
+ action: run_diagnostics
275
+ checks: [health, connectivity, resources, recent_deploys]
276
+
277
+ - name: 自动重启(如果健康检查失败)
278
+ action: rolling_restart
279
+ target: api-service
280
+ canary: true
281
+ rollback_on_failure: true
282
+
283
+ - name: 自动回滚(如果重启无效且有近期部署)
284
+ action: rollback_deployment
285
+ target: api-service
286
+ to: last_stable_version
287
+
288
+ - name: 自动扩容(如果是流量导致)
289
+ action: scale_up
290
+ target: api-service
291
+ max_replicas: 20
292
+
293
+ - name: 人工介入(以上均无效)
294
+ action: page_oncall
295
+ escalation: P1
296
+ context: "自动恢复失败,需人工排查"
297
+ ```
298
+
299
+ ### 检测信号
300
+
301
+ - 事故恢复过程中有 SSH 到生产服务器的操作
302
+ - 恢复步骤没有文档化,依赖口头传授
303
+ - MTTR > 30 分钟的事故中,> 50% 时间花在"回忆步骤"
304
+ - 恢复操作曾因人工失误导致二次事故
305
+
306
+ ---
307
+
308
+ ## 反模式 6: 无 Runbook (No Runbook)
309
+
310
+ ### 描述
311
+
312
+ 团队没有标准化的操作手册(Runbook),事故响应完全依赖工程师的个人经验和临场判断。新人面对事故束手无策,老人在凌晨 3 点边回忆边操作。
313
+
314
+ ### 危害
315
+
316
+ - 新人 On-Call 时响应效率极低,MTTR 成倍增长
317
+ - 同一类型的事故,不同人的处理方式和质量天差地别
318
+ - 知识传承断层,人员变动后团队能力大幅下降
319
+ - On-Call 轮值时心理压力大,不敢接收告警
320
+
321
+ ### 错误做法
322
+
323
+ ```
324
+ # BAD: 没有 Runbook 的事故响应
325
+ 03:00 PagerDuty 告警: 数据库连接数飙升
326
+ On-Call 新人: "我该怎么办?"
327
+ → 翻 Slack 历史消息找类似案例
328
+ → 搜索内部 Wiki 但什么都找不到
329
+ → 打电话把老同事叫醒
330
+ → 老同事凭记忆指导操作
331
+ → 40 分钟后恢复(其中 25 分钟在找人和找方法)
332
+ ```
333
+
334
+ ### 正确做法
335
+
336
+ ```markdown
337
+ # GOOD: 标准化 Runbook 示例
338
+ ## Runbook: 数据库连接数异常
339
+
340
+ ### 触发条件
341
+ - 数据库活跃连接数 > 80% 最大连接数
342
+ - 应用层连接池等待队列 > 100
343
+
344
+ ### 快速诊断(2 分钟内完成)
345
+ 1. 查看当前连接数: `SELECT count(*) FROM pg_stat_activity;`
346
+ 2. 查看连接来源: `SELECT client_addr, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;`
347
+ 3. 查看长事务: `SELECT pid, now()-xact_start, query FROM pg_stat_activity WHERE state='active' ORDER BY 2 DESC LIMIT 10;`
348
+
349
+ ### 恢复步骤
350
+ **场景 A: 慢查询导致**
351
+ 1. Kill 超过 5 分钟的查询: `SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='active' AND now()-xact_start > interval '5 minutes';`
352
+ 2. 验证连接数回落
353
+ 3. 记录被 Kill 的查询,创建优化 Issue
354
+
355
+ **场景 B: 连接泄漏**
356
+ 1. 滚动重启应用实例(每次 1 个)
357
+ 2. 验证连接数回落
358
+ 3. 排查应用代码中的连接泄漏
359
+
360
+ **场景 C: 流量暴增**
361
+ 1. 临时调大数据库最大连接数
362
+ 2. 启用应用层连接池排队
363
+ 3. 评估是否需要只读副本分流
364
+
365
+ ### 升级条件
366
+ - 以上步骤 15 分钟内未恢复 → 升级到 P1
367
+ - 数据库主节点不可用 → 立即升级到 P0
368
+ ```
369
+
370
+ ### 检测信号
371
+
372
+ - 搜索内部文档找不到"Runbook"关键字
373
+ - On-Call 交接时靠口头说明
374
+ - 新人首次 On-Call 后说"完全不知道该怎么办"
375
+ - 相同告警每次的处理步骤都不一样
376
+
377
+ ---
378
+
379
+ ## 反模式 7: 信息孤岛 (Information Silos)
380
+
381
+ ### 描述
382
+
383
+ 事故响应过程中,关键信息分散在不同的渠道和团队中,沟通不畅导致重复排查、遗漏关键线索、决策延迟。
384
+
385
+ ### 危害
386
+
387
+ - 多个团队在排查同一个根因但互不知情
388
+ - 关键信息延迟传递,影响决策效率
389
+ - 事故后无法还原完整时间线
390
+ - 跨团队协作效率低下
391
+
392
+ ### 错误做法
393
+
394
+ ```
395
+ # BAD: 信息分散在多个渠道
396
+ - 前端团队在 #frontend 频道讨论 "页面加载超时"
397
+ - 后端团队在 #backend 频道讨论 "API 响应慢"
398
+ - DBA 在私聊中讨论 "数据库负载高"
399
+ - SRE 在 #ops 频道讨论 "网络延迟异常"
400
+ - 没有人知道这些其实是同一个事故的不同表现
401
+ - 经理在 30 分钟后才知道有事故发生
402
+ ```
403
+
404
+ ### 正确做法
405
+
406
+ ```
407
+ # GOOD: 集中化的事故沟通
408
+ 1. 事故确认后立即创建专用频道 #incident-2024-0042
409
+ 2. 所有相关讨论集中在该频道
410
+ 3. Incident Commander 每 10 分钟发布状态更新:
411
+ - 当前状态(排查中/已定位/恢复中/已恢复)
412
+ - 影响范围
413
+ - 当前假设和排查方向
414
+ - 需要哪些团队协助
415
+ 4. 自动同步到状态页面,用户可自行查看
416
+ 5. 事故结束后频道归档,作为复盘材料
417
+ ```
418
+
419
+ ### 检测信号
420
+
421
+ - 事故响应时需要在 3 个以上频道同时沟通
422
+ - 有人说"我不知道已经在处理了"
423
+ - 事后发现某个团队早就知道根因但没传递
424
+ - 管理层通过非正式渠道才知道有事故
425
+
426
+ ---
427
+
428
+ ## 反模式 8: 过度升级 (Over-Escalation)
429
+
430
+ ### 描述
431
+
432
+ 团队对事故等级判断失准,将普通问题升级为 P0/P1 事故,或者每次告警都全员群发,导致高层和专家资源被频繁打扰,真正的重大事故反而无法得到足够关注。
433
+
434
+ ### 危害
435
+
436
+ - 高级工程师和管理层被频繁打断,影响正常工作
437
+ - "狼来了"效应:真正的 P0 事故时大家反应迟钝
438
+ - 团队形成"什么事都升级"的依赖心态,不愿自主决策
439
+ - 事故等级通胀,P0 数量远超合理范围
440
+
441
+ ### 错误做法
442
+
443
+ ```
444
+ # BAD: 过度升级的告警配置
445
+ 告警 → 立即通知 CTO + 全部总监 + 全部 Tech Lead + 全部 SRE
446
+ → CTO 每天收到 10 条事故通知
447
+ → CTO 开始忽略通知
448
+ → 真正的 P0 事故来了,CTO 没看到通知
449
+
450
+ # BAD: 模糊的升级标准
451
+ "如果你不确定,就升级到 P0" → 一切都是 P0
452
+ ```
453
+
454
+ ### 正确做法
455
+
456
+ ```markdown
457
+ # GOOD: 清晰的事故分级和升级标准
458
+
459
+ ## 事故分级定义
460
+ | 等级 | 定义 | 示例 | 通知范围 |
461
+ |------|------|------|----------|
462
+ | P0 | 核心业务完全不可用 | 全站宕机/支付系统崩溃/数据泄露 | Incident Commander + SRE + VP |
463
+ | P1 | 核心业务严重受损 | 部分用户无法下单/API 错误率>5% | Incident Commander + SRE |
464
+ | P2 | 非核心功能异常 | 搜索推荐异常/报表延迟 | On-Call + 对应团队 |
465
+ | P3 | 轻微异常或预警 | 单节点故障(已自动恢复)/容量预警 | Slack 通知 |
466
+
467
+ ## 升级规则
468
+ - On-Call 工程师 15 分钟内无法恢复 P2 → 升级为 P1
469
+ - P1 事故 30 分钟内无法恢复 → 升级为 P0
470
+ - 影响范围扩大到其他业务 → 提升一个等级
471
+ - 不确定等级时,先按 P2 处理,10 分钟内评估是否升级
472
+ ```
473
+
474
+ ### 检测信号
475
+
476
+ - 月均 P0 事故 > 5 次(健康范围: 0-2 次/月)
477
+ - 超过 50% 的 P0 事故降级后实际是 P2/P3
478
+ - CTO/VP 每周被事故通知打扰 > 3 次
479
+ - On-Call 工程师的第一反应总是"先升级再说"
480
+
481
+ ---
482
+
483
+ ## Agent Checklist
484
+
485
+ - [ ] 对照本文档评估团队当前的事故管理成熟度
486
+ - [ ] 识别出存在的反模式并按危害程度排定优先级
487
+ - [ ] 每个反模式制定改进计划(90 天内可落地的措施)
488
+ - [ ] 建立定期回顾机制(季度),检查反模式是否复发
489
+ - [ ] 将本文档纳入 SRE/On-Call 新人培训必读材料
490
+ - [ ] 告警噪声比(无需处理的告警 / 总告警)定期统计并优化
@@ -0,0 +1,176 @@
1
+ ---
2
+ id: case-cascade-failure
3
+ title: 级联故障案例:一个服务超时引发全站雪崩
4
+ domain: incident
5
+ category: 05-cases
6
+ difficulty: intermediate
7
+ tags: [agent, cascade, case, checklist, failure, incident, whys, 关键教训]
8
+ quality_score: 70
9
+ last_updated: 2026-06-15
10
+ ---
11
+ # 级联故障案例:一个服务超时引发全站雪崩
12
+
13
+ ## 概述
14
+
15
+ 2024 年 3 月某电商平台大促期间,商品推荐服务的一次慢查询导致级联故障,
16
+ 最终造成全站不可用 47 分钟,直接经济损失约 ¥380 万。本案例复盘完整故障
17
+ 链条、应急响应过程和事后改进措施。
18
+
19
+ ## 系统背景
20
+
21
+ ```
22
+ 用户请求 → Nginx → API Gateway → 商品服务 → [推荐服务, 库存服务, 价格服务]
23
+
24
+ 推荐服务 → Redis 缓存 → 推荐引擎(ML 模型)
25
+ ```
26
+
27
+ - 商品详情页同步调用推荐服务获取"猜你喜欢"
28
+ - 推荐服务连接池大小: 200,超时设置: 5s
29
+ - API Gateway 到商品服务超时: 10s
30
+ - 日常 QPS: 商品服务 8,000,推荐服务 3,000
31
+
32
+ ## 故障时间线
33
+
34
+ ### T+0min - 触发点
35
+
36
+ ```
37
+ 14:00 大促开始,流量从 8,000 QPS 飙升至 25,000 QPS
38
+ 14:03 推荐引擎 ML 模型推理延迟从 50ms 上升至 800ms
39
+ 根因:模型服务未扩容,GPU 利用率 100%
40
+ ```
41
+
42
+ ### T+5min - 第一级扩散
43
+
44
+ ```
45
+ 14:05 推荐服务响应时间从 100ms 上升至 4.5s
46
+ Redis 缓存命中率从 85% 降至 30%(大促新品未预热)
47
+ 14:06 推荐服务线程池耗尽(200 线程全部阻塞在 ML 调用)
48
+ 14:07 推荐服务开始返回超时错误,但商品服务仍在等待
49
+ ```
50
+
51
+ ### T+10min - 第二级扩散
52
+
53
+ ```
54
+ 14:10 商品服务线程池开始积压
55
+ 每个请求等待推荐服务 5s 超时才能释放
56
+ 有效并发从 500 降至 50
57
+ 14:12 商品服务响应时间从 200ms 飙升至 8s
58
+ 14:13 API Gateway 到商品服务开始超时(10s 限制)
59
+ 14:14 API Gateway 连接池耗尽
60
+ ```
61
+
62
+ ### T+15min - 全站雪崩
63
+
64
+ ```
65
+ 14:15 API Gateway 无法处理任何请求
66
+ 影响范围从商品详情扩展到搜索、购物车、订单等所有服务
67
+ 原因:所有服务共享同一个 API Gateway 实例
68
+ 14:17 Nginx 502 错误率达到 98%
69
+ 14:18 全站不可用
70
+ ```
71
+
72
+ ### T+20min - 告警与响应
73
+
74
+ ```
75
+ 14:20 P0 告警触发(全站可用性 < 10%)
76
+ 14:22 On-Call 工程师接入,开始排查
77
+ 14:25 初步判断为推荐服务问题,但不确定根因
78
+ 14:30 事故指挥官到位,拉起 War Room
79
+ ```
80
+
81
+ ### T+35min - 定位与恢复
82
+
83
+ ```
84
+ 14:35 确认根因:推荐服务超时拖垮整条链路
85
+ 14:37 决策:降级推荐服务(返回静态推荐列表)
86
+ 14:40 Feature Flag 关闭推荐服务实时调用
87
+ 14:42 商品服务线程池开始释放
88
+ 14:45 API Gateway 恢复,全站开始恢复
89
+ 14:47 全站可用性恢复至 99%
90
+ ```
91
+
92
+ ## 根因分析(5 Whys)
93
+
94
+ ```
95
+ 为什么全站挂了?
96
+ → API Gateway 连接池耗尽,所有服务不可达
97
+
98
+ 为什么 API Gateway 连接池耗尽?
99
+ → 商品服务响应极慢,连接无法释放
100
+
101
+ 为什么商品服务响应极慢?
102
+ → 同步等待推荐服务 5s 超时
103
+
104
+ 为什么推荐服务超时?
105
+ → ML 模型推理延迟 + 缓存未预热
106
+
107
+ 为什么没有熔断和降级?
108
+ → 推荐服务被视为"非核心"但以同步方式嵌入核心链路,
109
+ 且未配置熔断器
110
+ ```
111
+
112
+ ## 故障链条图
113
+
114
+ ```
115
+ [ML 模型未扩容]
116
+
117
+ [推荐引擎延迟 800ms]
118
+
119
+ [Redis 缓存未预热, 命中率 30%] → [推荐服务线程池耗尽]
120
+
121
+ [商品服务等待超时, 线程积压]
122
+
123
+ [API Gateway 连接池耗尽]
124
+
125
+ [全站所有服务不可用]
126
+ ```
127
+
128
+ ## 改进措施
129
+
130
+ ### 立即修复(1 周内)
131
+
132
+ | 措施 | 说明 |
133
+ |------|------|
134
+ | 推荐服务熔断器 | Hystrix/Sentinel,失败率 > 50% 自动熔断 |
135
+ | 超时缩短 | 推荐服务超时从 5s 改为 500ms |
136
+ | 异步降级 | 推荐调用改为异步,超时返回静态列表 |
137
+ | 缓存预热 | 大促前 2 小时自动预热热门商品推荐缓存 |
138
+
139
+ ### 中期改进(1 个月内)
140
+
141
+ | 措施 | 说明 |
142
+ |------|------|
143
+ | 服务隔离 | 核心服务(商品/订单/支付)与非核心服务独立 Gateway |
144
+ | 舱壁模式 | 每个下游服务独立线程池,互不影响 |
145
+ | 限流分级 | 核心接口保障带宽,非核心接口可降级 |
146
+ | 压测常态化 | 每月全链路压测,验证容量和降级策略 |
147
+
148
+ ### 长期改进(3 个月内)
149
+
150
+ | 措施 | 说明 |
151
+ |------|------|
152
+ | 服务网格 | Istio 统一管理超时/重试/熔断策略 |
153
+ | 混沌工程 | 定期注入故障验证系统韧性 |
154
+ | 容量规划 | 基于流量模型自动扩容,大促前自动预扩 |
155
+ | SLO 体系 | 每个服务定义 SLO,用 Error Budget 驱动决策 |
156
+
157
+ ## 关键教训
158
+
159
+ 1. **非核心服务不要同步阻塞核心链路**: 推荐、广告等非核心功能必须异步或设极短超时
160
+ 2. **超时设置必须端到端校验**: 下游 5s + 上游 10s = 用户等 10s,这是不可接受的
161
+ 3. **共享基础设施是最大的单点**: 所有服务共用一个 Gateway,一个服务拖垮全部
162
+ 4. **熔断不是可选项**: 任何远程调用都必须有熔断保护
163
+ 5. **缓存预热是大促必选动作**: 冷缓存在流量高峰下是定时炸弹
164
+ 6. **告警到人太晚**: 从雪崩到告警花了 5 分钟,应在第一级扩散时就告警
165
+
166
+ ## Agent Checklist
167
+
168
+ - [ ] 核心链路是否有非核心服务的同步依赖
169
+ - [ ] 所有远程调用是否配置了合理超时(建议 < 1s)
170
+ - [ ] 熔断器是否已对所有下游服务配置
171
+ - [ ] 舱壁模式(独立线程池/连接池)是否已实施
172
+ - [ ] API Gateway 是否做了核心/非核心服务隔离
173
+ - [ ] 缓存预热机制是否就绪
174
+ - [ ] 降级策略是否已定义且可通过 Feature Flag 快速启用
175
+ - [ ] 全链路压测是否覆盖了级联故障场景
176
+ - [ ] 告警阈值是否能在第一级扩散时触发