@umacloud/knowledge 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (418) hide show
  1. package/00-governance/governance-capabilities.md +557 -0
  2. package/00-governance/knowledge-map.md +39 -0
  3. package/00-governance/maintenance-policy.md +76 -0
  4. package/00-governance/review-checklist.md +81 -0
  5. package/README.md +13 -0
  6. package/ai/01-standards/agent-development-complete.md +691 -0
  7. package/ai/01-standards/llm-application-complete.md +488 -0
  8. package/ai/01-standards/mlops-complete.md +798 -0
  9. package/ai/01-standards/prompt-engineering-complete.md +646 -0
  10. package/ai/01-standards/rag-architecture-complete.md +649 -0
  11. package/ai/02-playbooks/llm-evaluation-playbook.md +847 -0
  12. package/ai/03-checklists/ai-project-checklist.md +215 -0
  13. package/ai/04-antipatterns/ai-antipatterns.md +661 -0
  14. package/ai/05-cases/case-rag-production.md +147 -0
  15. package/ai/06-glossary/ai-glossary.md +162 -0
  16. package/ai/agent-evaluation-benchmark.md +53 -0
  17. package/ai/ai-agent-memory-context-management.md +41 -0
  18. package/ai/ai-cost-capacity-optimization-playbook.md +42 -0
  19. package/ai/ai-data-security-and-compliance-playbook.md +37 -0
  20. package/ai/ai-domain-index-and-checklist.md +40 -0
  21. package/ai/ai-governance-maturity-model.md +50 -0
  22. package/ai/ai-model-selection-and-routing-strategy.md +47 -0
  23. package/ai/ai-observability-and-oncall-runbook.md +52 -0
  24. package/ai/ai-rag-engineering-playbook.md +42 -0
  25. package/ai/ai-red-team-and-safety-evaluation.md +42 -0
  26. package/ai/ai-release-readiness-and-rollback-gate.md +42 -0
  27. package/ai/llm-agent-engineering-deep-dive.md +57 -0
  28. package/ai/prompt-and-tool-guardrails.md +52 -0
  29. package/api/01-standards/enterprise-api-standards.md +198 -0
  30. package/api/01-standards/rest-api-design-guide.md +63 -0
  31. package/api/02-playbooks/api-pagination-playbook.md +93 -0
  32. package/api/02-playbooks/graphql-production-playbook.md +176 -0
  33. package/api/03-checklists/api-review-checklist.md +55 -0
  34. package/api/04-antipatterns/api-antipatterns.md +112 -0
  35. package/architecture/01-standards/api-gateway-patterns.md +496 -0
  36. package/architecture/01-standards/cloud-native-patterns.md +644 -0
  37. package/architecture/01-standards/distributed-systems-patterns.md +591 -0
  38. package/architecture/01-standards/event-driven-architecture.md +595 -0
  39. package/architecture/01-standards/microservices-patterns-complete.md +968 -0
  40. package/architecture/01-standards/microservices-patterns.md +495 -0
  41. package/architecture/01-standards/system-design-interview.md +664 -0
  42. package/architecture/02-playbooks/microservices-patterns-playbook.md +137 -0
  43. package/architecture/02-playbooks/migration-playbook.md +780 -0
  44. package/architecture/02-playbooks/system-design-playbook.md +779 -0
  45. package/architecture/03-checklists/architecture-decision-checklist.md +297 -0
  46. package/architecture/04-antipatterns/architecture-antipatterns.md +417 -0
  47. package/architecture/05-cases/case-netflix-microservices.md +413 -0
  48. package/architecture/06-glossary/architecture-glossary.md +164 -0
  49. package/architecture/adr-template-and-examples.md +38 -0
  50. package/architecture/api-gateway-deep-dive.md +1291 -0
  51. package/architecture/configuration-management.md +1162 -0
  52. package/architecture/distributed-transactions.md +1220 -0
  53. package/architecture/microservices-complete.md +735 -0
  54. package/architecture/resilience-and-disaster-patterns.md +37 -0
  55. package/architecture/service-governance.md +1198 -0
  56. package/architecture/system-architecture-deep-dive.md +37 -0
  57. package/backend/01-standards/analytics-and-growth.md +65 -0
  58. package/backend/01-standards/api-and-error-conventions.md +120 -0
  59. package/backend/01-standards/application-layering-and-packaging.md +160 -0
  60. package/backend/01-standards/auth-implementation.md +104 -0
  61. package/backend/01-standards/backend-framework-idioms.md +74 -0
  62. package/backend/01-standards/background-jobs-and-async.md +66 -0
  63. package/backend/01-standards/caching-strategies-complete.md +390 -0
  64. package/backend/01-standards/config-and-observability.md +77 -0
  65. package/backend/01-standards/data-modeling-and-persistence.md +94 -0
  66. package/backend/01-standards/django-complete.md +1765 -0
  67. package/backend/01-standards/email-and-notifications.md +64 -0
  68. package/backend/01-standards/fastapi-complete.md +925 -0
  69. package/backend/01-standards/file-upload-and-storage.md +66 -0
  70. package/backend/01-standards/graphql-api-complete.md +416 -0
  71. package/backend/01-standards/llm-application-standard.md +78 -0
  72. package/backend/01-standards/message-queue-patterns.md +379 -0
  73. package/backend/01-standards/microservices-and-distributed.md +78 -0
  74. package/backend/01-standards/nestjs-complete.md +2167 -0
  75. package/backend/01-standards/payment-integration.md +80 -0
  76. package/backend/01-standards/rate-limiting-complete.md +451 -0
  77. package/backend/01-standards/realtime-and-websocket.md +65 -0
  78. package/backend/01-standards/search-and-filtering.md +64 -0
  79. package/backend/01-standards/spring-boot-complete.md +445 -0
  80. package/backend/02-playbooks/api-design-playbook.md +718 -0
  81. package/backend/02-playbooks/email-send-playbook.md +130 -0
  82. package/backend/02-playbooks/file-upload-s3-playbook.md +153 -0
  83. package/backend/02-playbooks/typescript-enterprise-playbook.md +133 -0
  84. package/backend/02-playbooks/websocket-realtime-playbook.md +154 -0
  85. package/backend/03-checklists/api-launch-checklist.md +189 -0
  86. package/backend/04-antipatterns/backend-antipatterns.md +1051 -0
  87. package/blockchain/01-standards/blockchain-basics.md +557 -0
  88. package/blockchain/01-standards/smart-contract-development.md +1315 -0
  89. package/cicd/01-standards/deployment-and-delivery-standard.md +96 -0
  90. package/cicd/01-standards/github-actions-complete.md +473 -0
  91. package/cicd/01-standards/release-and-store-submission.md +75 -0
  92. package/cicd/02-playbooks/cicd-pipeline-playbook.md +144 -0
  93. package/cicd/02-playbooks/release-management-playbook.md +605 -0
  94. package/cicd/03-checklists/pipeline-security-checklist.md +168 -0
  95. package/cicd/04-antipatterns/cicd-antipatterns.md +589 -0
  96. package/cicd/05-cases/case-deployment-automation.md +221 -0
  97. package/cicd/05-cases/case-gitops-transformation.md +212 -0
  98. package/cicd/06-glossary/cicd-glossary.md +114 -0
  99. package/cicd/cicd-blueprint-deep-dive.md +38 -0
  100. package/cicd/release-readiness-gate.md +37 -0
  101. package/cloud-native/01-standards/container-security.md +741 -0
  102. package/cloud-native/01-standards/kubernetes-complete.md +812 -0
  103. package/cloud-native/02-playbooks/api-gateway-playbook.md +155 -0
  104. package/cloud-native/02-playbooks/gitops-with-argocd.md +760 -0
  105. package/cloud-native/02-playbooks/k8s-troubleshooting-playbook.md +1942 -0
  106. package/cloud-native/02-playbooks/message-queue-playbook.md +129 -0
  107. package/cloud-native/02-playbooks/multicloud-governance.md +726 -0
  108. package/cloud-native/02-playbooks/serverless-patterns.md +788 -0
  109. package/cloud-native/02-playbooks/service-mesh-playbook.md +612 -0
  110. package/cloud-native/02-playbooks/terraform-iac-playbook.md +143 -0
  111. package/cloud-native/03-checklists/container-security-checklist.md +431 -0
  112. package/cloud-native/03-checklists/k8s-production-readiness-checklist.md +460 -0
  113. package/cloud-native/04-antipatterns/container-antipatterns.md +660 -0
  114. package/cloud-native/04-antipatterns/k8s-antipatterns.md +743 -0
  115. package/cloud-native/05-cases/case-k8s-migration.md +478 -0
  116. package/cloud-native/05-cases/case-k8s-scaling.md +642 -0
  117. package/cloud-native/05-cases/case-k8s-security-incident.md +397 -0
  118. package/cloud-native/06-glossary/cloud-native-glossary.md +337 -0
  119. package/cross-platform/01-standards/cross-platform-frameworks.md +83 -0
  120. package/cross-platform/01-standards/platform-selection-and-architecture.md +77 -0
  121. package/data/01-standards/elasticsearch-complete.md +2098 -0
  122. package/data/01-standards/postgresql-complete.md +1613 -0
  123. package/data/01-standards/redis-complete.md +1527 -0
  124. package/data/02-playbooks/database-optimization-playbook.md +403 -0
  125. package/data/02-playbooks/elasticsearch-production-playbook.md +132 -0
  126. package/data/03-checklists/database-launch-checklist.md +187 -0
  127. package/data/04-antipatterns/database-antipatterns.md +873 -0
  128. package/data/05-cases/case-database-migration.md +310 -0
  129. package/data/06-glossary/database-glossary.md +440 -0
  130. package/data/data-governance-and-modeling-deep-dive.md +39 -0
  131. package/data-engineering/01-standards/airflow-complete.md +523 -0
  132. package/data-engineering/01-standards/kafka-complete.md +1521 -0
  133. package/data-engineering/02-playbooks/spark-etl-playbook.md +496 -0
  134. package/data-engineering/03-checklists/pipeline-launch-checklist.md +194 -0
  135. package/data-engineering/04-antipatterns/data-pipeline-antipatterns.md +684 -0
  136. package/data-engineering/05-cases/case-real-time-pipeline.md +355 -0
  137. package/data-engineering/06-glossary/data-engineering-glossary.md +429 -0
  138. package/database/01-standards/database-schema-standards.md +147 -0
  139. package/database/02-playbooks/postgresql-optimization-quick.md +52 -0
  140. package/database/02-playbooks/postgresql-performance-optimization.md +58 -0
  141. package/database/02-playbooks/postgresql-production-playbook.md +146 -0
  142. package/database/02-playbooks/redis-caching-playbook.md +117 -0
  143. package/database/03-checklists/database-review-checklist.md +50 -0
  144. package/database/04-antipatterns/database-antipatterns.md +112 -0
  145. package/design/01-standards/ui-design-system-complete.md +423 -0
  146. package/design/02-playbooks/design-handoff-playbook.md +254 -0
  147. package/design/02-playbooks/design-review-playbook.md +388 -0
  148. package/design/03-checklists/design-review-checklist.md +246 -0
  149. package/design/04-antipatterns/design-antipatterns.md +378 -0
  150. package/design/05-cases/case-design-system-adoption.md +328 -0
  151. package/design/06-glossary/design-glossary.md +329 -0
  152. package/design/ui-full-lifecycle-cross-platform-playbook.md +571 -0
  153. package/design/ux-system-deep-dive.md +38 -0
  154. package/design-systems/00-craft-rules.md +71 -0
  155. package/design-systems/aesthetic-families.md +43 -0
  156. package/design-systems/anti-ai-slop.md +162 -0
  157. package/design-systems/bold-geometric.md +120 -0
  158. package/design-systems/brutalist-bold.md +103 -0
  159. package/design-systems/editorial-clean.md +109 -0
  160. package/design-systems/glass-aurora.md +108 -0
  161. package/design-systems/modern-minimal.md +145 -0
  162. package/design-systems/premium-luxury.md +106 -0
  163. package/design-systems/product-type-design-map.md +48 -0
  164. package/design-systems/soft-warm.md +123 -0
  165. package/design-systems/tech-utility.md +113 -0
  166. package/desktop/01-standards/desktop-app-standard.md +72 -0
  167. package/desktop/01-standards/desktop-design.md +71 -0
  168. package/development/00-governance/document-template.md +41 -0
  169. package/development/01-standards/api-versioning-strategies.md +432 -0
  170. package/development/01-standards/authentication-patterns-complete.md +479 -0
  171. package/development/01-standards/css-architecture-complete.md +550 -0
  172. package/development/01-standards/database-migration-strategies.md +484 -0
  173. package/development/01-standards/elasticsearch-complete.md +347 -0
  174. package/development/01-standards/git-complete.md +371 -0
  175. package/development/01-standards/golang-complete.md +1565 -0
  176. package/development/01-standards/graphql-complete.md +298 -0
  177. package/development/01-standards/javascript-bundlers-complete.md +469 -0
  178. package/development/01-standards/javascript-typescript-complete.md +528 -0
  179. package/development/01-standards/jest-complete.md +275 -0
  180. package/development/01-standards/linux-complete.md +234 -0
  181. package/development/01-standards/logging-observability-complete.md +526 -0
  182. package/development/01-standards/microservices-communication.md +502 -0
  183. package/development/01-standards/mongodb-complete.md +406 -0
  184. package/development/01-standards/oauth2-complete.md +285 -0
  185. package/development/01-standards/performance-optimization-complete.md +289 -0
  186. package/development/01-standards/playwright-complete.md +247 -0
  187. package/development/01-standards/postgresql-complete.md +456 -0
  188. package/development/01-standards/pytest-complete.md +340 -0
  189. package/development/01-standards/python-async-programming.md +902 -0
  190. package/development/01-standards/python-complete.md +956 -0
  191. package/development/01-standards/python-decorators-complete.md +799 -0
  192. package/development/01-standards/python-design-patterns.md +2854 -0
  193. package/development/01-standards/python-packaging-distribution.md +420 -0
  194. package/development/01-standards/python-testing-strategies.md +607 -0
  195. package/development/01-standards/python-web-frameworks-comparison.md +471 -0
  196. package/development/01-standards/redis-complete.md +317 -0
  197. package/development/01-standards/rest-api-complete.md +316 -0
  198. package/development/01-standards/rust-complete.md +578 -0
  199. package/development/01-standards/typescript-advanced-types.md +1513 -0
  200. package/development/01-standards/web-security-complete.md +292 -0
  201. package/development/02-playbooks/api-design-playbook.md +810 -0
  202. package/development/02-playbooks/database-migration-playbook.md +580 -0
  203. package/development/02-playbooks/debugging-playbook.md +692 -0
  204. package/development/02-playbooks/feature-delivery-playbook.md +430 -0
  205. package/development/02-playbooks/incident-hotfix-playbook.md +387 -0
  206. package/development/02-playbooks/performance-optimization-playbook.md +531 -0
  207. package/development/02-playbooks/performance-tuning-playbook.md +652 -0
  208. package/development/02-playbooks/refactor-playbook.md +403 -0
  209. package/development/02-playbooks/release-playbook.md +469 -0
  210. package/development/03-checklists/architecture-review-checklist.md +168 -0
  211. package/development/03-checklists/data-migration-checklist.md +157 -0
  212. package/development/03-checklists/oncall-handover-checklist.md +173 -0
  213. package/development/03-checklists/pr-checklist.md +158 -0
  214. package/development/03-checklists/production-readiness-checklist.md +190 -0
  215. package/development/03-checklists/release-readiness-checklist.md +154 -0
  216. package/development/03-checklists/security-review-checklist.md +182 -0
  217. package/development/04-antipatterns/api-antipatterns.md +657 -0
  218. package/development/04-antipatterns/architecture-antipatterns.md +686 -0
  219. package/development/04-antipatterns/backend-antipatterns.md +648 -0
  220. package/development/04-antipatterns/cicd-antipatterns.md +540 -0
  221. package/development/04-antipatterns/code-smell-antipatterns.md +571 -0
  222. package/development/04-antipatterns/data-antipatterns.md +658 -0
  223. package/development/04-antipatterns/database-antipatterns.md +578 -0
  224. package/development/04-antipatterns/frontend-antipatterns.md +635 -0
  225. package/development/04-antipatterns/reliability-antipatterns.md +700 -0
  226. package/development/04-antipatterns/security-antipatterns.md +747 -0
  227. package/development/05-cases/case-api-version-migration.md +428 -0
  228. package/development/05-cases/case-authorization-hardening.md +383 -0
  229. package/development/05-cases/case-bluegreen-rollback.md +466 -0
  230. package/development/05-cases/case-cache-snowball-protection.md +485 -0
  231. package/development/05-cases/case-ci-cd-pipeline.md +544 -0
  232. package/development/05-cases/case-database-scaling.md +500 -0
  233. package/development/05-cases/case-db-hotspot-optimization.md +487 -0
  234. package/development/05-cases/case-incident-mttr-reduction.md +563 -0
  235. package/development/05-cases/case-microservice-migration.md +375 -0
  236. package/development/05-cases/case-performance-optimization.md +406 -0
  237. package/development/05-cases/case-security-incident-response.md +345 -0
  238. package/development/06-glossary/full-stack-glossary.md +166 -0
  239. package/development/09-maturity/quarterly-audit-template.md +35 -0
  240. package/development/11-ui-excellence/ui-aesthetic-system.md +41 -0
  241. package/development/11-ui-excellence/ui-engineering-excellence.md +435 -0
  242. package/development/12-scenarios/development-scenarios-guide.md +565 -0
  243. package/development/13-implementation-assets/implementation-toolkit.md +282 -0
  244. package/development/13-implementation-assets/knowledge-gates-execution.md +43 -0
  245. package/development/14-full-lifecycle/software-lifecycle-gates.md +511 -0
  246. package/development/15-lifecycle-templates/project-templates-collection.md +791 -0
  247. package/development/api-contract-and-versioning-guide.md +36 -0
  248. package/development/api-governance-complete.md +43 -0
  249. package/development/backend-engineering-complete.md +43 -0
  250. package/development/code-review-quality-complete.md +43 -0
  251. package/development/concurrency-reliability-complete.md +43 -0
  252. package/development/database-engineering-complete.md +43 -0
  253. package/development/engineering-effectiveness-complete.md +43 -0
  254. package/development/engineering-standards-deep-dive.md +38 -0
  255. package/development/frontend-engineering-complete.md +43 -0
  256. package/development/performance-capacity-complete.md +43 -0
  257. package/development/refactor-migration-complete.md +42 -0
  258. package/development/refactoring-and-techdebt-playbook.md +37 -0
  259. package/development/security-in-development-complete.md +43 -0
  260. package/devops/01-standards/cicd-pipeline-complete.md +262 -0
  261. package/devops/01-standards/docker-complete.md +1490 -0
  262. package/devops/01-standards/github-actions-complete.md +337 -0
  263. package/devops/01-standards/kubernetes-complete.md +638 -0
  264. package/devops/01-standards/terraform-complete.md +2117 -0
  265. package/devops/02-playbooks/docker-compose-playbook.md +233 -0
  266. package/devops/02-playbooks/docker-k8s-production-playbook.md +186 -0
  267. package/devops/02-playbooks/docker-production-playbook.md +952 -0
  268. package/edge-iot/01-standards/edge-iot-complete.md +473 -0
  269. package/experts/architect/api-design.md +178 -0
  270. package/experts/architect/methodology.md +124 -0
  271. package/experts/architect/security.md +75 -0
  272. package/experts/backend-lead/methodology.md +216 -0
  273. package/experts/devops/methodology.md +160 -0
  274. package/experts/frontend-lead/methodology.md +178 -0
  275. package/experts/product-manager/industry/ecommerce.md +43 -0
  276. package/experts/product-manager/industry/saas.md +40 -0
  277. package/experts/product-manager/methodology.md +97 -0
  278. package/experts/qa-lead/methodology.md +123 -0
  279. package/experts/qa-lead/test-strategy.md +128 -0
  280. package/experts/uiux-designer/methodology.md +125 -0
  281. package/frontend/01-standards/accessibility-complete.md +532 -0
  282. package/frontend/01-standards/accessibility-standard.md +74 -0
  283. package/frontend/01-standards/admin-dashboard-and-crud.md +72 -0
  284. package/frontend/01-standards/design-tokens-complete.md +444 -0
  285. package/frontend/01-standards/forms-and-validation.md +77 -0
  286. package/frontend/01-standards/frontend-architecture-and-layering.md +119 -0
  287. package/frontend/01-standards/i18n-and-localization.md +65 -0
  288. package/frontend/01-standards/nextjs-complete.md +451 -0
  289. package/frontend/01-standards/react-complete.md +713 -0
  290. package/frontend/01-standards/react-hooks-complete-guide.md +1100 -0
  291. package/frontend/01-standards/react-hooks-complete.md +1171 -0
  292. package/frontend/01-standards/seo-and-web-vitals.md +77 -0
  293. package/frontend/01-standards/state-management-complete.md +444 -0
  294. package/frontend/01-standards/vue-complete.md +499 -0
  295. package/frontend/01-standards/vue3-complete.md +2002 -0
  296. package/frontend/01-standards/web-framework-best-practices.md +64 -0
  297. package/frontend/01-standards/web-performance-complete.md +495 -0
  298. package/frontend/02-playbooks/accessibility-a11y-playbook.md +161 -0
  299. package/frontend/02-playbooks/frontend-performance-playbook.md +707 -0
  300. package/frontend/02-playbooks/i18n-internationalization-playbook.md +120 -0
  301. package/frontend/02-playbooks/performance-optimization-playbook.md +163 -0
  302. package/frontend/02-playbooks/react-nextjs-production-playbook.md +167 -0
  303. package/frontend/02-playbooks/react-state-management-playbook.md +173 -0
  304. package/frontend/03-checklists/component-quality-checklist.md +166 -0
  305. package/frontend/03-checklists/frontend-launch-checklist.md +299 -0
  306. package/frontend/04-antipatterns/frontend-antipatterns.md +886 -0
  307. package/frontend/05-cases/case-performance-optimization.md +274 -0
  308. package/harmony/01-standards/harmonyos-arkts-standard.md +75 -0
  309. package/harmony/01-standards/harmonyos-design.md +65 -0
  310. package/high-quality-engineering-playbook.md +54 -0
  311. package/incident/01-standards/incident-response-complete.md +303 -0
  312. package/incident/02-playbooks/chaos-engineering-playbook.md +883 -0
  313. package/incident/02-playbooks/postmortem-playbook.md +398 -0
  314. package/incident/03-checklists/incident-readiness-checklist.md +181 -0
  315. package/incident/04-antipatterns/incident-antipatterns.md +490 -0
  316. package/incident/05-cases/case-cascade-failure.md +176 -0
  317. package/incident/06-glossary/incident-glossary.md +114 -0
  318. package/incident/postmortem-and-response-deep-dive.md +39 -0
  319. package/industries/ecommerce/ecommerce-complete.md +631 -0
  320. package/industries/education/education-complete.md +555 -0
  321. package/industries/fintech/fintech-complete.md +501 -0
  322. package/industries/gaming/gaming-complete.md +587 -0
  323. package/industries/healthcare/healthcare-complete.md +452 -0
  324. package/low-code/01-standards/low-code-complete.md +944 -0
  325. package/miniprogram/01-standards/ai-common-mistakes.md +61 -0
  326. package/miniprogram/01-standards/miniprogram-custom-navbar-capsule.md +77 -0
  327. package/miniprogram/01-standards/miniprogram-design.md +61 -0
  328. package/miniprogram/01-standards/miniprogram-standard.md +81 -0
  329. package/mobile/01-standards/android-material-design.md +70 -0
  330. package/mobile/01-standards/flutter-complete.md +384 -0
  331. package/mobile/01-standards/ios-design-hig.md +78 -0
  332. package/mobile/01-standards/mobile-app-standard.md +85 -0
  333. package/mobile/01-standards/react-native-complete.md +352 -0
  334. package/mobile/02-playbooks/mobile-cross-platform-playbook.md +175 -0
  335. package/mobile/02-playbooks/mobile-performance.md +473 -0
  336. package/mobile/03-checklists/mobile-release-checklist.md +234 -0
  337. package/mobile/04-antipatterns/mobile-antipatterns.md +798 -0
  338. package/mobile/05-cases/case-app-performance.md +500 -0
  339. package/mobile/05-cases/case-app-startup-optimization.md +218 -0
  340. package/mobile/06-glossary/mobile-glossary.md +484 -0
  341. package/observability/01-standards/observability-standards.md +103 -0
  342. package/observability/02-playbooks/prometheus-grafana-playbook.md +135 -0
  343. package/observability/02-playbooks/structured-logging-playbook.md +73 -0
  344. package/observability/03-checklists/observability-checklist.md +54 -0
  345. package/observability/04-antipatterns/observability-antipatterns.md +106 -0
  346. package/operations/01-standards/prometheus-monitoring-complete.md +1578 -0
  347. package/operations/02-playbooks/capacity-planning-playbook.md +620 -0
  348. package/operations/03-checklists/production-launch-checklist.md +365 -0
  349. package/operations/04-antipatterns/operations-antipatterns.md +664 -0
  350. package/operations/05-cases/case-sre-practices.md +581 -0
  351. package/operations/06-glossary/operations-glossary.md +120 -0
  352. package/operations/aiops-anomaly-detection.md +758 -0
  353. package/operations/capacity-planning.md +1061 -0
  354. package/operations/chaos-engineering.md +659 -0
  355. package/operations/incident-command-system.md +38 -0
  356. package/operations/observability-complete.md +442 -0
  357. package/operations/slo-sli-playbook.md +517 -0
  358. package/operations/sre-operations-deep-dive.md +39 -0
  359. package/package.json +8 -0
  360. package/performance/01-standards/performance-and-scalability.md +80 -0
  361. package/performance/01-standards/performance-standards.md +156 -0
  362. package/performance/02-playbooks/query-optimization-playbook.md +103 -0
  363. package/performance/03-checklists/performance-checklist.md +56 -0
  364. package/performance/04-antipatterns/performance-antipatterns.md +146 -0
  365. package/product/01-standards/product-management-complete.md +285 -0
  366. package/product/02-playbooks/feature-launch-playbook.md +207 -0
  367. package/product/02-playbooks/user-research-playbook.md +532 -0
  368. package/product/03-checklists/feature-launch-checklist.md +275 -0
  369. package/product/04-antipatterns/product-antipatterns.md +355 -0
  370. package/product/05-cases/case-mvp-to-scale.md +384 -0
  371. package/product/06-glossary/product-glossary.md +462 -0
  372. package/product/feature-prioritization-framework.md +40 -0
  373. package/product/kpi-and-metric-tree.md +37 -0
  374. package/product/product-discovery-and-prd-deep-dive.md +41 -0
  375. package/quantum/01-standards/quantum-complete.md +1186 -0
  376. package/security/01-standards/api-security-complete.md +511 -0
  377. package/security/01-standards/container-runtime-security.md +574 -0
  378. package/security/01-standards/data-protection-gdpr.md +543 -0
  379. package/security/01-standards/owasp-top10-complete.md +1890 -0
  380. package/security/01-standards/secure-coding-baseline.md +90 -0
  381. package/security/01-standards/supply-chain-security.md +441 -0
  382. package/security/01-standards/web-security-checklist.md +108 -0
  383. package/security/01-standards/zero-trust-architecture.md +521 -0
  384. package/security/02-playbooks/auth-sso-playbook.md +166 -0
  385. package/security/02-playbooks/incident-response-security-playbook.md +588 -0
  386. package/security/02-playbooks/owasp-api-security-playbook.md +129 -0
  387. package/security/02-playbooks/payment-integration-playbook.md +119 -0
  388. package/security/02-playbooks/penetration-testing-playbook.md +517 -0
  389. package/security/03-checklists/security-audit-checklist.md +356 -0
  390. package/security/04-antipatterns/security-coding-antipatterns.md +580 -0
  391. package/security/05-cases/case-log4shell-incident.md +537 -0
  392. package/security/05-cases/case-major-breaches.md +468 -0
  393. package/security/06-glossary/security-glossary.md +212 -0
  394. package/security/compliance-automation.md +993 -0
  395. package/security/container-security.md +680 -0
  396. package/security/devsecops-complete.md +426 -0
  397. package/security/sast-dast-sca.md +775 -0
  398. package/security/secrets-management.md +594 -0
  399. package/security/security-architecture-deep-dive.md +37 -0
  400. package/security/threat-modeling-stride-playbook.md +40 -0
  401. package/seed-templates/auth-system.md +59 -0
  402. package/seed-templates/blog-content.md +94 -0
  403. package/seed-templates/dashboard.md +89 -0
  404. package/seed-templates/docs-site.md +73 -0
  405. package/seed-templates/e-commerce.md +50 -0
  406. package/seed-templates/saas-landing.md +92 -0
  407. package/seed-templates/settings-page.md +51 -0
  408. package/testing/01-standards/test-strategy-and-layering.md +83 -0
  409. package/testing/01-standards/testing-strategy-complete.md +422 -0
  410. package/testing/01-standards/unit-testing-best-practices.md +118 -0
  411. package/testing/02-playbooks/e2e-testing-playbook.md +988 -0
  412. package/testing/02-playbooks/testing-strategy-playbook.md +126 -0
  413. package/testing/03-checklists/test-strategy-checklist.md +208 -0
  414. package/testing/04-antipatterns/testing-antipatterns.md +718 -0
  415. package/testing/05-cases/case-testing-transformation.md +300 -0
  416. package/testing/06-glossary/testing-glossary.md +110 -0
  417. package/testing/risk-based-test-matrix.md +36 -0
  418. package/testing/testing-strategy-deep-dive.md +37 -0
@@ -0,0 +1,563 @@
1
+ ---
2
+ id: case-incident-mttr-reduction
3
+ title: 案例研究:故障恢复时间(MTTR)从 45 分钟降到 5 分钟
4
+ domain: development
5
+ category: 05-cases
6
+ difficulty: intermediate
7
+ tags: [agent, case, checklist, development, incident, mttr, reduction, runbook]
8
+ quality_score: 70
9
+ last_updated: 2026-06-15
10
+ ---
11
+ # 案例研究:故障恢复时间(MTTR)从 45 分钟降到 5 分钟
12
+
13
+ ## 元数据
14
+
15
+ | 字段 | 值 |
16
+ |------|------|
17
+ | 行业 | 即时配送平台 |
18
+ | 系统规模 | 日订单 200 万,覆盖 80 个城市,峰值 QPS 25,000 |
19
+ | 技术栈 | Go + Java + MySQL + Redis + Kafka + Kubernetes |
20
+ | 服务数量 | 35 个微服务 |
21
+ | 团队规模 | 后端 40 人,SRE 6 人 |
22
+ | 改进周期 | 12 周(2024-01 至 2024-03) |
23
+ | 核心目标 | MTTR 从 45 分钟降到 5 分钟以内 |
24
+
25
+ ---
26
+
27
+ ## 一、背景
28
+
29
+ ### 1.1 业务特殊性
30
+
31
+ 即时配送平台对故障恢复时间极其敏感:
32
+
33
+ - 骑手在路上,订单不能中断
34
+ - 商家在接单,延迟 = 出餐延迟 = 用户差评
35
+ - 用户在等餐,超过 15 分钟不送达就会取消
36
+ - 高峰期(11:00-13:00, 17:00-20:00)每分钟故障损失 **5 万元**
37
+
38
+ ### 1.2 故障恢复现状
39
+
40
+ 过去 6 个月的 P0/P1 事故统计:
41
+
42
+ | 指标 | 值 |
43
+ |------|------|
44
+ | P0 事故次数 | 4 次 |
45
+ | P1 事故次数 | 12 次 |
46
+ | 平均 MTTD(发现时间) | 8 分钟 |
47
+ | 平均 MTTI(定位时间) | 22 分钟 |
48
+ | 平均 MTTR(恢复时间) | 45 分钟 |
49
+ | 最长恢复时间 | 2 小时 15 分钟 |
50
+ | 故障总损失 | 估算 850 万元/半年 |
51
+
52
+ ### 1.3 故障恢复瓶颈分析
53
+
54
+ 对 16 次 P0/P1 事故的恢复过程做时间分解:
55
+
56
+ ```
57
+ 典型恢复时间线(45 分钟):
58
+ ┌────────────────────────────────────────────────┐
59
+ │ 0min 8min 15min 30min 45min │
60
+ │ ├────────┼─────────┼─────────┼─────────┤ │
61
+ │ │ 发现 │ 通知 │ 定位 │ 止血 │恢复 │
62
+ │ │ 8min │ 7min │ 15min │ 10min │5min │
63
+ │ ├────────┼─────────┼─────────┼─────────┤ │
64
+ │ │告警延迟│找人+组队│翻日志找 │写SQL/ │验证+ │
65
+ │ │+确认 │+上下文 │根因 │改配置/ │宣布 │
66
+ │ │ │同步 │ │重启 │ │
67
+ └────────────────────────────────────────────────┘
68
+
69
+ 各阶段瓶颈:
70
+ 1. 发现(8min):告警规则粗糙,依赖用户投诉才确认
71
+ 2. 通知(7min):值班人员需要电话联系相关人员,信息传递低效
72
+ 3. 定位(15min):日志分散在 35 个服务中,无统一追踪
73
+ 4. 止血(10min):没有预定义的止血手段,每次现场决策
74
+ 5. 恢复(5min):验证手段不完善
75
+ ```
76
+
77
+ ---
78
+
79
+ ## 二、挑战
80
+
81
+ ### 2.1 可观测性不足
82
+
83
+ | 问题 | 表现 |
84
+ |------|------|
85
+ | 日志不统一 | Go 服务用 zap,Java 服务用 logback,格式不一致 |
86
+ | 无链路追踪 | 请求在 35 个服务间流转,无法端到端追踪 |
87
+ | 指标分散 | Prometheus 指标命名不规范,各服务自定义 |
88
+ | 无关联分析 | 告警 → 日志 → 指标之间缺乏关联 |
89
+
90
+ ### 2.2 响应流程缺失
91
+
92
+ | 问题 | 表现 |
93
+ |------|------|
94
+ | 值班制度不完善 | 值班表不清晰,经常找不到对口人 |
95
+ | 无升级路径 | 不知道什么时候该升级,升级给谁 |
96
+ | 信息传递低效 | 靠微信群 @ 人,关键信息被淹没 |
97
+ | 无止血预案 | 每次故障都是"现场想办法" |
98
+ | 复盘流于形式 | 有复盘会但改进项无人跟踪 |
99
+
100
+ ### 2.3 组织挑战
101
+
102
+ 1. 35 个微服务分属 7 个团队,故障定位需要跨团队协作
103
+ 2. 部分团队认为可观测性是"SRE 的事",不愿投入时间
104
+ 3. Runbook 要求每个服务提供,但只有 8 个服务写了
105
+
106
+ ---
107
+
108
+ ## 三、方案设计
109
+
110
+ ### 3.1 目标时间线
111
+
112
+ ```
113
+ 目标恢复时间线(5 分钟):
114
+ ┌────────────────────────┐
115
+ │ 0 1min 2min 5min │
116
+ │ ├────┼─────┼─────┤ │
117
+ │ │发现│定位 │止血 │恢复 │
118
+ │ │1min│1min │2min │1min │
119
+ │ ├────┼─────┼─────┤ │
120
+ │ │自动│自动│预案 │自动 │
121
+ │ │告警│关联│执行 │验证 │
122
+ └────────────────────────┘
123
+ ```
124
+
125
+ ### 3.2 四大支柱建设
126
+
127
+ ```
128
+ 支柱 1: 可观测性(Observability)
129
+ → 统一日志 + 链路追踪 + 指标标准化
130
+
131
+ 支柱 2: 告警体系(Alerting)
132
+ → 智能告警 + 自动升级 + 告警聚合
133
+
134
+ 支柱 3: 响应流程(Incident Response)
135
+ → 指挥官制度 + Runbook + 自动化止血
136
+
137
+ 支柱 4: 复盘改进(Post-mortem)
138
+ → 标准化模板 + 改进项追踪 + 指标度量
139
+ ```
140
+
141
+ ---
142
+
143
+ ## 四、实施步骤
144
+
145
+ ### 4.1 支柱 1:可观测性建设(Week 1-4)
146
+
147
+ #### 统一日志
148
+
149
+ ```
150
+ 日志标准化规范:
151
+ 1. 所有服务统一 JSON 格式
152
+ 2. 必含字段:timestamp, level, service, trace_id, span_id, message
153
+ 3. 业务日志必含:user_id, order_id, action, result
154
+ 4. 错误日志必含:error_code, error_message, stack_trace
155
+ ```
156
+
157
+ ```go
158
+ // Go 服务日志中间件
159
+ func LoggingMiddleware(next http.Handler) http.Handler {
160
+ return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
161
+ span := trace.SpanFromContext(r.Context())
162
+ logger := zap.L().With(
163
+ zap.String("trace_id", span.SpanContext().TraceID().String()),
164
+ zap.String("span_id", span.SpanContext().SpanID().String()),
165
+ zap.String("service", serviceName),
166
+ zap.String("method", r.Method),
167
+ zap.String("path", r.URL.Path),
168
+ )
169
+ ctx := WithLogger(r.Context(), logger)
170
+ next.ServeHTTP(w, r.WithContext(ctx))
171
+ })
172
+ }
173
+ ```
174
+
175
+ #### 链路追踪
176
+
177
+ ```
178
+ 架构:OpenTelemetry SDK → OTel Collector → Jaeger/Tempo
179
+
180
+ 接入方式:
181
+ - Go 服务:otelgrpc + otelhttp 自动注入
182
+ - Java 服务:OpenTelemetry Java Agent(-javaagent 方式,零代码侵入)
183
+ - Kafka 消息:在 Header 中传递 trace_id
184
+
185
+ 关键场景的 Trace 覆盖:
186
+ 1. 用户下单:APP → Gateway → Order → Payment → Dispatch → Notify(6 hop)
187
+ 2. 骑手接单:APP → Gateway → Dispatch → Assign → Push → Rider(5 hop)
188
+ 3. 商家出餐:POS → Gateway → Order → Kitchen → Notify(4 hop)
189
+ ```
190
+
191
+ #### 指标标准化
192
+
193
+ ```
194
+ 命名规范:{namespace}_{subsystem}_{name}_{unit}
195
+
196
+ 核心指标(每个服务必须暴露):
197
+ - http_request_duration_seconds # 请求延迟
198
+ - http_request_total # 请求计数(按 code 分标签)
199
+ - grpc_server_handled_total # gRPC 请求计数
200
+ - db_query_duration_seconds # 数据库查询延迟
201
+ - redis_operation_duration_seconds # Redis 操作延迟
202
+ - kafka_consumer_lag # Kafka 消费延迟
203
+
204
+ 业务指标:
205
+ - order_created_total # 创建订单数
206
+ - order_dispatched_total # 派单数
207
+ - order_delivered_total # 完成配送数
208
+ - rider_online_count # 在线骑手数
209
+ - delivery_duration_seconds # 配送时长
210
+ ```
211
+
212
+ #### 统一看板
213
+
214
+ ```
215
+ Grafana Dashboard 层级:
216
+ L1: 全局大盘(CTO/VP 视角)
217
+ - 全站 QPS / 错误率 / P99 延迟
218
+ - 订单量 / 配送量 / 取消率
219
+ - 红绿灯:各核心服务健康状态
220
+
221
+ L2: 服务级看板(每个微服务一个)
222
+ - 该服务的 RED 指标(Rate/Error/Duration)
223
+ - 依赖服务健康状态
224
+ - 资源使用(CPU/Memory/Connections)
225
+
226
+ L3: 专项看板
227
+ - 数据库性能(慢查询/连接数/锁等待)
228
+ - Redis 性能(命中率/内存/连接数)
229
+ - Kafka 延迟(各 Topic 消费延迟)
230
+ ```
231
+
232
+ ### 4.2 支柱 2:告警体系(Week 5-7)
233
+
234
+ #### 告警分级
235
+
236
+ ```yaml
237
+ # 告警级别定义
238
+ alert_levels:
239
+ P0:
240
+ definition: "核心业务不可用或数据损坏"
241
+ examples:
242
+ - "全站错误率 > 10%"
243
+ - "订单系统不可用"
244
+ - "支付成功率 < 90%"
245
+ sla: "5 分钟内响应,15 分钟内止血"
246
+ notify:
247
+ - "值班 SRE(电话)"
248
+ - "服务 Owner(电话)"
249
+ - "CTO(钉钉)"
250
+ - "故障群自动拉群"
251
+
252
+ P1:
253
+ definition: "核心业务严重劣化"
254
+ examples:
255
+ - "订单系统 P99 > 3s"
256
+ - "骑手接单成功率 < 95%"
257
+ - "单个城市配送异常"
258
+ sla: "10 分钟内响应,30 分钟内止血"
259
+ notify:
260
+ - "值班 SRE(钉钉 + 电话)"
261
+ - "服务 Owner(钉钉)"
262
+
263
+ P2:
264
+ definition: "非核心功能异常"
265
+ examples:
266
+ - "推送服务延迟 > 5s"
267
+ - "报表生成失败"
268
+ sla: "30 分钟内响应,4 小时内修复"
269
+ notify:
270
+ - "值班 SRE(钉钉)"
271
+ ```
272
+
273
+ #### 告警规则示例
274
+
275
+ ```yaml
276
+ # Prometheus AlertManager 规则
277
+ groups:
278
+ - name: order-service
279
+ rules:
280
+ - alert: OrderServiceHighErrorRate
281
+ expr: |
282
+ sum(rate(http_request_total{service="order-service",code=~"5.."}[2m]))
283
+ /
284
+ sum(rate(http_request_total{service="order-service"}[2m]))
285
+ > 0.01
286
+ for: 1m
287
+ labels:
288
+ severity: P0
289
+ annotations:
290
+ summary: "订单服务错误率 {{ $value | humanizePercentage }}"
291
+ runbook: "https://wiki.internal/runbook/order-service-high-error"
292
+
293
+ - alert: OrderServiceHighLatency
294
+ expr: |
295
+ histogram_quantile(0.99,
296
+ rate(http_request_duration_seconds_bucket{service="order-service"}[2m])
297
+ ) > 1
298
+ for: 2m
299
+ labels:
300
+ severity: P1
301
+ annotations:
302
+ summary: "订单服务 P99 延迟 {{ $value }}s"
303
+ runbook: "https://wiki.internal/runbook/order-service-high-latency"
304
+ ```
305
+
306
+ #### 智能告警聚合
307
+
308
+ ```
309
+ 问题:一次故障可能触发 50+ 条告警(Redis 故障 → 缓存失效 → DB 超载 → API 超时 → ...)
310
+
311
+ 解决方案:
312
+ 1. 告警关联:同一 trace_id 的告警自动聚合
313
+ 2. 根因推断:按依赖拓扑排序,上游告警优先级更高
314
+ 3. 告警收敛:同一服务 5 分钟内的相同告警只通知一次
315
+ 4. 自动关联:告警 → 对应服务的 Runbook → 最近变更记录
316
+ ```
317
+
318
+ ### 4.3 支柱 3:响应流程(Week 8-10)
319
+
320
+ #### 值班制度
321
+
322
+ ```
323
+ 值班体系:
324
+ ├── L1 值班(SRE,7x24)
325
+ │ 负责:告警响应 + 初步判断 + 拉群 + 止血
326
+
327
+ ├── L2 值班(各服务 Owner,工作时间 on-call)
328
+ │ 负责:定位 + 修复 + 协调
329
+
330
+ └── L3 指挥官(Tech Lead 轮值)
331
+ 负责:P0 事故指挥 + 决策 + 对外沟通
332
+
333
+ 值班工具:
334
+ - PagerDuty:自动告警路由 + 电话升级
335
+ - 故障 Bot:自动创建钉钉群 + 拉入相关人员 + 同步时间线
336
+ ```
337
+
338
+ #### Runbook 标准化
339
+
340
+ ```markdown
341
+ ## Runbook 模板
342
+
343
+ ### 服务名称
344
+ [service-name]
345
+
346
+ ### 常见故障场景
347
+
348
+ #### 场景 1: [故障描述]
349
+ **症状**: [用户/系统表现]
350
+ **可能原因**:
351
+ 1. [原因 A]
352
+ 2. [原因 B]
353
+
354
+ **诊断步骤**:
355
+ 1. 检查 [指标/日志]:`[查询命令]`
356
+ 2. 检查 [依赖服务]:`[查询命令]`
357
+
358
+ **止血手段**:
359
+ - [ ] 方案 A: [操作步骤](预计恢复时间: X 分钟)
360
+ - [ ] 方案 B: [操作步骤](预计恢复时间: X 分钟)
361
+
362
+ **恢复验证**:
363
+ - [ ] [指标] 恢复到正常范围
364
+ - [ ] [功能] 验证正常
365
+ ```
366
+
367
+ **Runbook 覆盖要求**:
368
+ - 35 个微服务在 4 周内全部编写 Runbook(通过 Sprint 任务分配)
369
+ - Runbook 评审标准:至少覆盖 3 个常见故障场景
370
+ - Runbook 可用性要求:新人 SRE 能在 2 分钟内找到止血步骤
371
+
372
+ #### 自动化止血工具
373
+
374
+ ```go
375
+ // 一键止血命令行工具
376
+ // 常见止血操作的自动化封装
377
+
378
+ type HealAction struct {
379
+ Name string
380
+ Description string
381
+ Execute func(ctx context.Context, params map[string]string) error
382
+ }
383
+
384
+ var healActions = map[string]HealAction{
385
+ "circuit-break": {
386
+ Name: "熔断下游服务",
387
+ Description: "将指定下游服务的调用熔断,返回降级响应",
388
+ Execute: circuitBreak,
389
+ },
390
+ "rate-limit": {
391
+ Name: "启用限流",
392
+ Description: "对指定接口启用限流",
393
+ Execute: enableRateLimit,
394
+ },
395
+ "rollback": {
396
+ Name: "版本回滚",
397
+ Description: "将指定服务回滚到上一个稳定版本",
398
+ Execute: rollbackService,
399
+ },
400
+ "scale-up": {
401
+ Name: "紧急扩容",
402
+ Description: "将指定服务的副本数翻倍",
403
+ Execute: scaleUp,
404
+ },
405
+ "failover-db": {
406
+ Name: "数据库主从切换",
407
+ Description: "将数据库流量切到从库",
408
+ Execute: failoverDB,
409
+ },
410
+ }
411
+
412
+ // 使用方式
413
+ // heal-tool circuit-break --service=payment --downstream=risk-engine --duration=30m
414
+ // heal-tool rollback --service=order-service --version=v2.3.1
415
+ // heal-tool scale-up --service=dispatch-service --factor=2
416
+ ```
417
+
418
+ ### 4.4 支柱 4:复盘改进(Week 11-12)
419
+
420
+ #### 复盘模板
421
+
422
+ ```markdown
423
+ ## 事故复盘报告
424
+
425
+ ### 基本信息
426
+ - 事故等级: P[0/1/2]
427
+ - 发生时间: YYYY-MM-DD HH:MM
428
+ - 恢复时间: YYYY-MM-DD HH:MM
429
+ - 影响时长: XX 分钟
430
+ - 影响范围: [描述]
431
+ - 业务影响: [量化损失]
432
+
433
+ ### 时间线
434
+ | 时间 | 事件 | 操作人 |
435
+ |------|------|--------|
436
+ | HH:MM | 告警触发 | 系统 |
437
+ | HH:MM | ... | ... |
438
+
439
+ ### 根因分析
440
+ - 直接原因: [...]
441
+ - 深层原因: [...]
442
+ - 5 Whys 分析: [...]
443
+
444
+ ### 改进措施
445
+ | 序号 | 措施 | 类型 | Owner | Deadline | 状态 |
446
+ |------|------|------|-------|----------|------|
447
+ | 1 | ... | 预防/检测/响应 | @xxx | YYYY-MM-DD | TODO |
448
+
449
+ ### MTTR 分解
450
+ | 阶段 | 耗时 | 瓶颈 | 改进方案 |
451
+ |------|------|------|----------|
452
+ | 发现 | Xmin | ... | ... |
453
+ | 定位 | Xmin | ... | ... |
454
+ | 止血 | Xmin | ... | ... |
455
+ | 恢复 | Xmin | ... | ... |
456
+ ```
457
+
458
+ #### 改进项追踪
459
+
460
+ ```
461
+ 追踪机制:
462
+ 1. 所有改进项录入 Jira(Label: incident-action-item)
463
+ 2. 每周 SRE 周会 Review 进度
464
+ 3. 逾期改进项自动升级到 Tech Lead
465
+ 4. 月度 MTTR 趋势报告
466
+ ```
467
+
468
+ ---
469
+
470
+ ## 五、结果数据
471
+
472
+ ### 5.1 MTTR 分解对比
473
+
474
+ | 阶段 | 改进前 | 改进后 | 改善 |
475
+ |------|--------|--------|------|
476
+ | MTTD(发现) | 8 min | 1 min(自动告警) | -87% |
477
+ | MTTI(定位) | 22 min | 2 min(链路追踪 + Runbook) | -91% |
478
+ | 止血 | 10 min | 1.5 min(自动化止血工具) | -85% |
479
+ | 恢复验证 | 5 min | 0.5 min(自动化验证) | -90% |
480
+ | **总 MTTR** | **45 min** | **5 min** | **-89%** |
481
+
482
+ ### 5.2 事故指标
483
+
484
+ | 指标 | 改进前(H1 2024) | 改进后(H2 2024) |
485
+ |------|-------------------|-------------------|
486
+ | P0 事故次数 | 4 次 | 1 次 |
487
+ | P1 事故次数 | 12 次 | 5 次 |
488
+ | 平均 MTTR | 45 min | 4.8 min |
489
+ | 最长恢复时间 | 135 min | 12 min |
490
+ | 故障总损失 | 850 万元 | 120 万元 |
491
+ | Runbook 覆盖率 | 23%(8/35 服务) | 100%(35/35 服务) |
492
+ | 链路追踪覆盖率 | 0% | 100% |
493
+
494
+ ### 5.3 可观测性指标
495
+
496
+ | 指标 | 改进前 | 改进后 |
497
+ |------|--------|--------|
498
+ | 日志标准化率 | 30% | 100% |
499
+ | 指标标准化率 | 40% | 100% |
500
+ | 链路追踪覆盖 | 0% | 100% |
501
+ | 告警噪音(无用告警占比) | 60% | 12% |
502
+ | 告警→Runbook 关联率 | 0% | 95% |
503
+
504
+ ---
505
+
506
+ ## 六、经验教训
507
+
508
+ ### 6.1 做对的事
509
+
510
+ 1. **可观测性是前提**:没有统一的日志和链路追踪,故障定位就只能靠猜。链路追踪让定位时间从 22 分钟降到 2 分钟
511
+ 2. **Runbook 是最高 ROI 投入**:写 Runbook 花 2 小时,但每次故障节省 15 分钟。35 个服务的 Runbook 投入 70 人小时,半年内节省 200+ 人小时
512
+ 3. **自动化止血**:预定义的止血命令消除了"现场想办法"的决策时间
513
+ 4. **告警聚合很重要**:从 50+ 条告警收敛到 1 条根因告警,让 SRE 不再被噪音淹没
514
+ 5. **复盘改进项追踪闭环**:每周 Review 确保改进项落地,而非停留在复盘文档中
515
+
516
+ ### 6.2 做错的事
517
+
518
+ 1. **可观测性工具选型犹豫**:在 Jaeger vs Tempo 之间犹豫了 2 周,其实先上线比选型更重要
519
+ 2. **Runbook 质量参差不齐**:前 2 周写的 Runbook 太简略("重启服务"),后来制定了评审标准才改善
520
+ 3. **低估了日志标准化的工作量**:35 个服务的日志格式统一预估 2 周,实际花了 4 周(历史代码改动多)
521
+ 4. **告警阈值初始设置不合理**:前 2 周告警噪音很大,团队对告警产生了"狼来了"效应,后来花了 3 周调优阈值
522
+
523
+ ### 6.3 关键认知
524
+
525
+ - MTTR = MTTD + MTTI + 止血时间 + 恢复验证,每个环节都需要优化
526
+ - 可观测性不是 SRE 的事,是所有开发者的事。服务 Owner 最了解自己的服务
527
+ - 告警不是越多越好,高噪音 = 无告警。告警精准度比覆盖率更重要
528
+ - Runbook 必须可执行、可验证,而非"参考文档"
529
+ - 复盘的价值不在于找到根因,而在于改进项的落地率
530
+ - 故障恢复能力需要定期演练(Chaos Engineering),否则会生锈
531
+
532
+ ---
533
+
534
+ ## Agent Checklist
535
+
536
+ 在 AI Agent 辅助优化故障恢复能力时,应逐项确认:
537
+
538
+ ### 可观测性
539
+ - [ ] **日志标准化**:所有服务是否使用统一的日志格式和必含字段
540
+ - [ ] **链路追踪**:是否部署了分布式链路追踪(OpenTelemetry/Jaeger/Zipkin)
541
+ - [ ] **指标标准化**:各服务的 Prometheus 指标命名是否遵循统一规范
542
+ - [ ] **统一看板**:是否有全局 → 服务级 → 专项的分层 Dashboard
543
+ - [ ] **关联分析**:告警 → 日志 → 链路追踪之间是否有快速跳转
544
+
545
+ ### 告警体系
546
+ - [ ] **告警分级**:是否定义了 P0/P1/P2 的分级标准和 SLA
547
+ - [ ] **告警路由**:告警是否自动路由到正确的值班人员
548
+ - [ ] **告警聚合**:同一故障的多条告警是否自动聚合
549
+ - [ ] **告警噪音**:无用告警占比是否 < 20%
550
+ - [ ] **升级机制**:告警无人响应时是否有自动升级
551
+
552
+ ### 响应流程
553
+ - [ ] **值班制度**:是否有 L1/L2/L3 的值班体系和明确职责
554
+ - [ ] **Runbook**:所有核心服务是否有可执行的 Runbook
555
+ - [ ] **止血工具**:常见止血操作是否有自动化工具(熔断/限流/回滚/扩容)
556
+ - [ ] **指挥官制度**:P0 事故是否有明确的指挥官和信息同步机制
557
+ - [ ] **故障演练**:是否定期进行故障演练(至少每季度一次)
558
+
559
+ ### 复盘改进
560
+ - [ ] **复盘模板**:是否有标准化的事故复盘模板
561
+ - [ ] **MTTR 分解**:每次复盘是否分解了各阶段耗时和瓶颈
562
+ - [ ] **改进追踪**:改进项是否有 Owner、Deadline 和追踪机制
563
+ - [ ] **趋势度量**:是否有 MTTR/事故次数的趋势报告