@umacloud/knowledge 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (418) hide show
  1. package/00-governance/governance-capabilities.md +557 -0
  2. package/00-governance/knowledge-map.md +39 -0
  3. package/00-governance/maintenance-policy.md +76 -0
  4. package/00-governance/review-checklist.md +81 -0
  5. package/README.md +13 -0
  6. package/ai/01-standards/agent-development-complete.md +691 -0
  7. package/ai/01-standards/llm-application-complete.md +488 -0
  8. package/ai/01-standards/mlops-complete.md +798 -0
  9. package/ai/01-standards/prompt-engineering-complete.md +646 -0
  10. package/ai/01-standards/rag-architecture-complete.md +649 -0
  11. package/ai/02-playbooks/llm-evaluation-playbook.md +847 -0
  12. package/ai/03-checklists/ai-project-checklist.md +215 -0
  13. package/ai/04-antipatterns/ai-antipatterns.md +661 -0
  14. package/ai/05-cases/case-rag-production.md +147 -0
  15. package/ai/06-glossary/ai-glossary.md +162 -0
  16. package/ai/agent-evaluation-benchmark.md +53 -0
  17. package/ai/ai-agent-memory-context-management.md +41 -0
  18. package/ai/ai-cost-capacity-optimization-playbook.md +42 -0
  19. package/ai/ai-data-security-and-compliance-playbook.md +37 -0
  20. package/ai/ai-domain-index-and-checklist.md +40 -0
  21. package/ai/ai-governance-maturity-model.md +50 -0
  22. package/ai/ai-model-selection-and-routing-strategy.md +47 -0
  23. package/ai/ai-observability-and-oncall-runbook.md +52 -0
  24. package/ai/ai-rag-engineering-playbook.md +42 -0
  25. package/ai/ai-red-team-and-safety-evaluation.md +42 -0
  26. package/ai/ai-release-readiness-and-rollback-gate.md +42 -0
  27. package/ai/llm-agent-engineering-deep-dive.md +57 -0
  28. package/ai/prompt-and-tool-guardrails.md +52 -0
  29. package/api/01-standards/enterprise-api-standards.md +198 -0
  30. package/api/01-standards/rest-api-design-guide.md +63 -0
  31. package/api/02-playbooks/api-pagination-playbook.md +93 -0
  32. package/api/02-playbooks/graphql-production-playbook.md +176 -0
  33. package/api/03-checklists/api-review-checklist.md +55 -0
  34. package/api/04-antipatterns/api-antipatterns.md +112 -0
  35. package/architecture/01-standards/api-gateway-patterns.md +496 -0
  36. package/architecture/01-standards/cloud-native-patterns.md +644 -0
  37. package/architecture/01-standards/distributed-systems-patterns.md +591 -0
  38. package/architecture/01-standards/event-driven-architecture.md +595 -0
  39. package/architecture/01-standards/microservices-patterns-complete.md +968 -0
  40. package/architecture/01-standards/microservices-patterns.md +495 -0
  41. package/architecture/01-standards/system-design-interview.md +664 -0
  42. package/architecture/02-playbooks/microservices-patterns-playbook.md +137 -0
  43. package/architecture/02-playbooks/migration-playbook.md +780 -0
  44. package/architecture/02-playbooks/system-design-playbook.md +779 -0
  45. package/architecture/03-checklists/architecture-decision-checklist.md +297 -0
  46. package/architecture/04-antipatterns/architecture-antipatterns.md +417 -0
  47. package/architecture/05-cases/case-netflix-microservices.md +413 -0
  48. package/architecture/06-glossary/architecture-glossary.md +164 -0
  49. package/architecture/adr-template-and-examples.md +38 -0
  50. package/architecture/api-gateway-deep-dive.md +1291 -0
  51. package/architecture/configuration-management.md +1162 -0
  52. package/architecture/distributed-transactions.md +1220 -0
  53. package/architecture/microservices-complete.md +735 -0
  54. package/architecture/resilience-and-disaster-patterns.md +37 -0
  55. package/architecture/service-governance.md +1198 -0
  56. package/architecture/system-architecture-deep-dive.md +37 -0
  57. package/backend/01-standards/analytics-and-growth.md +65 -0
  58. package/backend/01-standards/api-and-error-conventions.md +120 -0
  59. package/backend/01-standards/application-layering-and-packaging.md +160 -0
  60. package/backend/01-standards/auth-implementation.md +104 -0
  61. package/backend/01-standards/backend-framework-idioms.md +74 -0
  62. package/backend/01-standards/background-jobs-and-async.md +66 -0
  63. package/backend/01-standards/caching-strategies-complete.md +390 -0
  64. package/backend/01-standards/config-and-observability.md +77 -0
  65. package/backend/01-standards/data-modeling-and-persistence.md +94 -0
  66. package/backend/01-standards/django-complete.md +1765 -0
  67. package/backend/01-standards/email-and-notifications.md +64 -0
  68. package/backend/01-standards/fastapi-complete.md +925 -0
  69. package/backend/01-standards/file-upload-and-storage.md +66 -0
  70. package/backend/01-standards/graphql-api-complete.md +416 -0
  71. package/backend/01-standards/llm-application-standard.md +78 -0
  72. package/backend/01-standards/message-queue-patterns.md +379 -0
  73. package/backend/01-standards/microservices-and-distributed.md +78 -0
  74. package/backend/01-standards/nestjs-complete.md +2167 -0
  75. package/backend/01-standards/payment-integration.md +80 -0
  76. package/backend/01-standards/rate-limiting-complete.md +451 -0
  77. package/backend/01-standards/realtime-and-websocket.md +65 -0
  78. package/backend/01-standards/search-and-filtering.md +64 -0
  79. package/backend/01-standards/spring-boot-complete.md +445 -0
  80. package/backend/02-playbooks/api-design-playbook.md +718 -0
  81. package/backend/02-playbooks/email-send-playbook.md +130 -0
  82. package/backend/02-playbooks/file-upload-s3-playbook.md +153 -0
  83. package/backend/02-playbooks/typescript-enterprise-playbook.md +133 -0
  84. package/backend/02-playbooks/websocket-realtime-playbook.md +154 -0
  85. package/backend/03-checklists/api-launch-checklist.md +189 -0
  86. package/backend/04-antipatterns/backend-antipatterns.md +1051 -0
  87. package/blockchain/01-standards/blockchain-basics.md +557 -0
  88. package/blockchain/01-standards/smart-contract-development.md +1315 -0
  89. package/cicd/01-standards/deployment-and-delivery-standard.md +96 -0
  90. package/cicd/01-standards/github-actions-complete.md +473 -0
  91. package/cicd/01-standards/release-and-store-submission.md +75 -0
  92. package/cicd/02-playbooks/cicd-pipeline-playbook.md +144 -0
  93. package/cicd/02-playbooks/release-management-playbook.md +605 -0
  94. package/cicd/03-checklists/pipeline-security-checklist.md +168 -0
  95. package/cicd/04-antipatterns/cicd-antipatterns.md +589 -0
  96. package/cicd/05-cases/case-deployment-automation.md +221 -0
  97. package/cicd/05-cases/case-gitops-transformation.md +212 -0
  98. package/cicd/06-glossary/cicd-glossary.md +114 -0
  99. package/cicd/cicd-blueprint-deep-dive.md +38 -0
  100. package/cicd/release-readiness-gate.md +37 -0
  101. package/cloud-native/01-standards/container-security.md +741 -0
  102. package/cloud-native/01-standards/kubernetes-complete.md +812 -0
  103. package/cloud-native/02-playbooks/api-gateway-playbook.md +155 -0
  104. package/cloud-native/02-playbooks/gitops-with-argocd.md +760 -0
  105. package/cloud-native/02-playbooks/k8s-troubleshooting-playbook.md +1942 -0
  106. package/cloud-native/02-playbooks/message-queue-playbook.md +129 -0
  107. package/cloud-native/02-playbooks/multicloud-governance.md +726 -0
  108. package/cloud-native/02-playbooks/serverless-patterns.md +788 -0
  109. package/cloud-native/02-playbooks/service-mesh-playbook.md +612 -0
  110. package/cloud-native/02-playbooks/terraform-iac-playbook.md +143 -0
  111. package/cloud-native/03-checklists/container-security-checklist.md +431 -0
  112. package/cloud-native/03-checklists/k8s-production-readiness-checklist.md +460 -0
  113. package/cloud-native/04-antipatterns/container-antipatterns.md +660 -0
  114. package/cloud-native/04-antipatterns/k8s-antipatterns.md +743 -0
  115. package/cloud-native/05-cases/case-k8s-migration.md +478 -0
  116. package/cloud-native/05-cases/case-k8s-scaling.md +642 -0
  117. package/cloud-native/05-cases/case-k8s-security-incident.md +397 -0
  118. package/cloud-native/06-glossary/cloud-native-glossary.md +337 -0
  119. package/cross-platform/01-standards/cross-platform-frameworks.md +83 -0
  120. package/cross-platform/01-standards/platform-selection-and-architecture.md +77 -0
  121. package/data/01-standards/elasticsearch-complete.md +2098 -0
  122. package/data/01-standards/postgresql-complete.md +1613 -0
  123. package/data/01-standards/redis-complete.md +1527 -0
  124. package/data/02-playbooks/database-optimization-playbook.md +403 -0
  125. package/data/02-playbooks/elasticsearch-production-playbook.md +132 -0
  126. package/data/03-checklists/database-launch-checklist.md +187 -0
  127. package/data/04-antipatterns/database-antipatterns.md +873 -0
  128. package/data/05-cases/case-database-migration.md +310 -0
  129. package/data/06-glossary/database-glossary.md +440 -0
  130. package/data/data-governance-and-modeling-deep-dive.md +39 -0
  131. package/data-engineering/01-standards/airflow-complete.md +523 -0
  132. package/data-engineering/01-standards/kafka-complete.md +1521 -0
  133. package/data-engineering/02-playbooks/spark-etl-playbook.md +496 -0
  134. package/data-engineering/03-checklists/pipeline-launch-checklist.md +194 -0
  135. package/data-engineering/04-antipatterns/data-pipeline-antipatterns.md +684 -0
  136. package/data-engineering/05-cases/case-real-time-pipeline.md +355 -0
  137. package/data-engineering/06-glossary/data-engineering-glossary.md +429 -0
  138. package/database/01-standards/database-schema-standards.md +147 -0
  139. package/database/02-playbooks/postgresql-optimization-quick.md +52 -0
  140. package/database/02-playbooks/postgresql-performance-optimization.md +58 -0
  141. package/database/02-playbooks/postgresql-production-playbook.md +146 -0
  142. package/database/02-playbooks/redis-caching-playbook.md +117 -0
  143. package/database/03-checklists/database-review-checklist.md +50 -0
  144. package/database/04-antipatterns/database-antipatterns.md +112 -0
  145. package/design/01-standards/ui-design-system-complete.md +423 -0
  146. package/design/02-playbooks/design-handoff-playbook.md +254 -0
  147. package/design/02-playbooks/design-review-playbook.md +388 -0
  148. package/design/03-checklists/design-review-checklist.md +246 -0
  149. package/design/04-antipatterns/design-antipatterns.md +378 -0
  150. package/design/05-cases/case-design-system-adoption.md +328 -0
  151. package/design/06-glossary/design-glossary.md +329 -0
  152. package/design/ui-full-lifecycle-cross-platform-playbook.md +571 -0
  153. package/design/ux-system-deep-dive.md +38 -0
  154. package/design-systems/00-craft-rules.md +71 -0
  155. package/design-systems/aesthetic-families.md +43 -0
  156. package/design-systems/anti-ai-slop.md +162 -0
  157. package/design-systems/bold-geometric.md +120 -0
  158. package/design-systems/brutalist-bold.md +103 -0
  159. package/design-systems/editorial-clean.md +109 -0
  160. package/design-systems/glass-aurora.md +108 -0
  161. package/design-systems/modern-minimal.md +145 -0
  162. package/design-systems/premium-luxury.md +106 -0
  163. package/design-systems/product-type-design-map.md +48 -0
  164. package/design-systems/soft-warm.md +123 -0
  165. package/design-systems/tech-utility.md +113 -0
  166. package/desktop/01-standards/desktop-app-standard.md +72 -0
  167. package/desktop/01-standards/desktop-design.md +71 -0
  168. package/development/00-governance/document-template.md +41 -0
  169. package/development/01-standards/api-versioning-strategies.md +432 -0
  170. package/development/01-standards/authentication-patterns-complete.md +479 -0
  171. package/development/01-standards/css-architecture-complete.md +550 -0
  172. package/development/01-standards/database-migration-strategies.md +484 -0
  173. package/development/01-standards/elasticsearch-complete.md +347 -0
  174. package/development/01-standards/git-complete.md +371 -0
  175. package/development/01-standards/golang-complete.md +1565 -0
  176. package/development/01-standards/graphql-complete.md +298 -0
  177. package/development/01-standards/javascript-bundlers-complete.md +469 -0
  178. package/development/01-standards/javascript-typescript-complete.md +528 -0
  179. package/development/01-standards/jest-complete.md +275 -0
  180. package/development/01-standards/linux-complete.md +234 -0
  181. package/development/01-standards/logging-observability-complete.md +526 -0
  182. package/development/01-standards/microservices-communication.md +502 -0
  183. package/development/01-standards/mongodb-complete.md +406 -0
  184. package/development/01-standards/oauth2-complete.md +285 -0
  185. package/development/01-standards/performance-optimization-complete.md +289 -0
  186. package/development/01-standards/playwright-complete.md +247 -0
  187. package/development/01-standards/postgresql-complete.md +456 -0
  188. package/development/01-standards/pytest-complete.md +340 -0
  189. package/development/01-standards/python-async-programming.md +902 -0
  190. package/development/01-standards/python-complete.md +956 -0
  191. package/development/01-standards/python-decorators-complete.md +799 -0
  192. package/development/01-standards/python-design-patterns.md +2854 -0
  193. package/development/01-standards/python-packaging-distribution.md +420 -0
  194. package/development/01-standards/python-testing-strategies.md +607 -0
  195. package/development/01-standards/python-web-frameworks-comparison.md +471 -0
  196. package/development/01-standards/redis-complete.md +317 -0
  197. package/development/01-standards/rest-api-complete.md +316 -0
  198. package/development/01-standards/rust-complete.md +578 -0
  199. package/development/01-standards/typescript-advanced-types.md +1513 -0
  200. package/development/01-standards/web-security-complete.md +292 -0
  201. package/development/02-playbooks/api-design-playbook.md +810 -0
  202. package/development/02-playbooks/database-migration-playbook.md +580 -0
  203. package/development/02-playbooks/debugging-playbook.md +692 -0
  204. package/development/02-playbooks/feature-delivery-playbook.md +430 -0
  205. package/development/02-playbooks/incident-hotfix-playbook.md +387 -0
  206. package/development/02-playbooks/performance-optimization-playbook.md +531 -0
  207. package/development/02-playbooks/performance-tuning-playbook.md +652 -0
  208. package/development/02-playbooks/refactor-playbook.md +403 -0
  209. package/development/02-playbooks/release-playbook.md +469 -0
  210. package/development/03-checklists/architecture-review-checklist.md +168 -0
  211. package/development/03-checklists/data-migration-checklist.md +157 -0
  212. package/development/03-checklists/oncall-handover-checklist.md +173 -0
  213. package/development/03-checklists/pr-checklist.md +158 -0
  214. package/development/03-checklists/production-readiness-checklist.md +190 -0
  215. package/development/03-checklists/release-readiness-checklist.md +154 -0
  216. package/development/03-checklists/security-review-checklist.md +182 -0
  217. package/development/04-antipatterns/api-antipatterns.md +657 -0
  218. package/development/04-antipatterns/architecture-antipatterns.md +686 -0
  219. package/development/04-antipatterns/backend-antipatterns.md +648 -0
  220. package/development/04-antipatterns/cicd-antipatterns.md +540 -0
  221. package/development/04-antipatterns/code-smell-antipatterns.md +571 -0
  222. package/development/04-antipatterns/data-antipatterns.md +658 -0
  223. package/development/04-antipatterns/database-antipatterns.md +578 -0
  224. package/development/04-antipatterns/frontend-antipatterns.md +635 -0
  225. package/development/04-antipatterns/reliability-antipatterns.md +700 -0
  226. package/development/04-antipatterns/security-antipatterns.md +747 -0
  227. package/development/05-cases/case-api-version-migration.md +428 -0
  228. package/development/05-cases/case-authorization-hardening.md +383 -0
  229. package/development/05-cases/case-bluegreen-rollback.md +466 -0
  230. package/development/05-cases/case-cache-snowball-protection.md +485 -0
  231. package/development/05-cases/case-ci-cd-pipeline.md +544 -0
  232. package/development/05-cases/case-database-scaling.md +500 -0
  233. package/development/05-cases/case-db-hotspot-optimization.md +487 -0
  234. package/development/05-cases/case-incident-mttr-reduction.md +563 -0
  235. package/development/05-cases/case-microservice-migration.md +375 -0
  236. package/development/05-cases/case-performance-optimization.md +406 -0
  237. package/development/05-cases/case-security-incident-response.md +345 -0
  238. package/development/06-glossary/full-stack-glossary.md +166 -0
  239. package/development/09-maturity/quarterly-audit-template.md +35 -0
  240. package/development/11-ui-excellence/ui-aesthetic-system.md +41 -0
  241. package/development/11-ui-excellence/ui-engineering-excellence.md +435 -0
  242. package/development/12-scenarios/development-scenarios-guide.md +565 -0
  243. package/development/13-implementation-assets/implementation-toolkit.md +282 -0
  244. package/development/13-implementation-assets/knowledge-gates-execution.md +43 -0
  245. package/development/14-full-lifecycle/software-lifecycle-gates.md +511 -0
  246. package/development/15-lifecycle-templates/project-templates-collection.md +791 -0
  247. package/development/api-contract-and-versioning-guide.md +36 -0
  248. package/development/api-governance-complete.md +43 -0
  249. package/development/backend-engineering-complete.md +43 -0
  250. package/development/code-review-quality-complete.md +43 -0
  251. package/development/concurrency-reliability-complete.md +43 -0
  252. package/development/database-engineering-complete.md +43 -0
  253. package/development/engineering-effectiveness-complete.md +43 -0
  254. package/development/engineering-standards-deep-dive.md +38 -0
  255. package/development/frontend-engineering-complete.md +43 -0
  256. package/development/performance-capacity-complete.md +43 -0
  257. package/development/refactor-migration-complete.md +42 -0
  258. package/development/refactoring-and-techdebt-playbook.md +37 -0
  259. package/development/security-in-development-complete.md +43 -0
  260. package/devops/01-standards/cicd-pipeline-complete.md +262 -0
  261. package/devops/01-standards/docker-complete.md +1490 -0
  262. package/devops/01-standards/github-actions-complete.md +337 -0
  263. package/devops/01-standards/kubernetes-complete.md +638 -0
  264. package/devops/01-standards/terraform-complete.md +2117 -0
  265. package/devops/02-playbooks/docker-compose-playbook.md +233 -0
  266. package/devops/02-playbooks/docker-k8s-production-playbook.md +186 -0
  267. package/devops/02-playbooks/docker-production-playbook.md +952 -0
  268. package/edge-iot/01-standards/edge-iot-complete.md +473 -0
  269. package/experts/architect/api-design.md +178 -0
  270. package/experts/architect/methodology.md +124 -0
  271. package/experts/architect/security.md +75 -0
  272. package/experts/backend-lead/methodology.md +216 -0
  273. package/experts/devops/methodology.md +160 -0
  274. package/experts/frontend-lead/methodology.md +178 -0
  275. package/experts/product-manager/industry/ecommerce.md +43 -0
  276. package/experts/product-manager/industry/saas.md +40 -0
  277. package/experts/product-manager/methodology.md +97 -0
  278. package/experts/qa-lead/methodology.md +123 -0
  279. package/experts/qa-lead/test-strategy.md +128 -0
  280. package/experts/uiux-designer/methodology.md +125 -0
  281. package/frontend/01-standards/accessibility-complete.md +532 -0
  282. package/frontend/01-standards/accessibility-standard.md +74 -0
  283. package/frontend/01-standards/admin-dashboard-and-crud.md +72 -0
  284. package/frontend/01-standards/design-tokens-complete.md +444 -0
  285. package/frontend/01-standards/forms-and-validation.md +77 -0
  286. package/frontend/01-standards/frontend-architecture-and-layering.md +119 -0
  287. package/frontend/01-standards/i18n-and-localization.md +65 -0
  288. package/frontend/01-standards/nextjs-complete.md +451 -0
  289. package/frontend/01-standards/react-complete.md +713 -0
  290. package/frontend/01-standards/react-hooks-complete-guide.md +1100 -0
  291. package/frontend/01-standards/react-hooks-complete.md +1171 -0
  292. package/frontend/01-standards/seo-and-web-vitals.md +77 -0
  293. package/frontend/01-standards/state-management-complete.md +444 -0
  294. package/frontend/01-standards/vue-complete.md +499 -0
  295. package/frontend/01-standards/vue3-complete.md +2002 -0
  296. package/frontend/01-standards/web-framework-best-practices.md +64 -0
  297. package/frontend/01-standards/web-performance-complete.md +495 -0
  298. package/frontend/02-playbooks/accessibility-a11y-playbook.md +161 -0
  299. package/frontend/02-playbooks/frontend-performance-playbook.md +707 -0
  300. package/frontend/02-playbooks/i18n-internationalization-playbook.md +120 -0
  301. package/frontend/02-playbooks/performance-optimization-playbook.md +163 -0
  302. package/frontend/02-playbooks/react-nextjs-production-playbook.md +167 -0
  303. package/frontend/02-playbooks/react-state-management-playbook.md +173 -0
  304. package/frontend/03-checklists/component-quality-checklist.md +166 -0
  305. package/frontend/03-checklists/frontend-launch-checklist.md +299 -0
  306. package/frontend/04-antipatterns/frontend-antipatterns.md +886 -0
  307. package/frontend/05-cases/case-performance-optimization.md +274 -0
  308. package/harmony/01-standards/harmonyos-arkts-standard.md +75 -0
  309. package/harmony/01-standards/harmonyos-design.md +65 -0
  310. package/high-quality-engineering-playbook.md +54 -0
  311. package/incident/01-standards/incident-response-complete.md +303 -0
  312. package/incident/02-playbooks/chaos-engineering-playbook.md +883 -0
  313. package/incident/02-playbooks/postmortem-playbook.md +398 -0
  314. package/incident/03-checklists/incident-readiness-checklist.md +181 -0
  315. package/incident/04-antipatterns/incident-antipatterns.md +490 -0
  316. package/incident/05-cases/case-cascade-failure.md +176 -0
  317. package/incident/06-glossary/incident-glossary.md +114 -0
  318. package/incident/postmortem-and-response-deep-dive.md +39 -0
  319. package/industries/ecommerce/ecommerce-complete.md +631 -0
  320. package/industries/education/education-complete.md +555 -0
  321. package/industries/fintech/fintech-complete.md +501 -0
  322. package/industries/gaming/gaming-complete.md +587 -0
  323. package/industries/healthcare/healthcare-complete.md +452 -0
  324. package/low-code/01-standards/low-code-complete.md +944 -0
  325. package/miniprogram/01-standards/ai-common-mistakes.md +61 -0
  326. package/miniprogram/01-standards/miniprogram-custom-navbar-capsule.md +77 -0
  327. package/miniprogram/01-standards/miniprogram-design.md +61 -0
  328. package/miniprogram/01-standards/miniprogram-standard.md +81 -0
  329. package/mobile/01-standards/android-material-design.md +70 -0
  330. package/mobile/01-standards/flutter-complete.md +384 -0
  331. package/mobile/01-standards/ios-design-hig.md +78 -0
  332. package/mobile/01-standards/mobile-app-standard.md +85 -0
  333. package/mobile/01-standards/react-native-complete.md +352 -0
  334. package/mobile/02-playbooks/mobile-cross-platform-playbook.md +175 -0
  335. package/mobile/02-playbooks/mobile-performance.md +473 -0
  336. package/mobile/03-checklists/mobile-release-checklist.md +234 -0
  337. package/mobile/04-antipatterns/mobile-antipatterns.md +798 -0
  338. package/mobile/05-cases/case-app-performance.md +500 -0
  339. package/mobile/05-cases/case-app-startup-optimization.md +218 -0
  340. package/mobile/06-glossary/mobile-glossary.md +484 -0
  341. package/observability/01-standards/observability-standards.md +103 -0
  342. package/observability/02-playbooks/prometheus-grafana-playbook.md +135 -0
  343. package/observability/02-playbooks/structured-logging-playbook.md +73 -0
  344. package/observability/03-checklists/observability-checklist.md +54 -0
  345. package/observability/04-antipatterns/observability-antipatterns.md +106 -0
  346. package/operations/01-standards/prometheus-monitoring-complete.md +1578 -0
  347. package/operations/02-playbooks/capacity-planning-playbook.md +620 -0
  348. package/operations/03-checklists/production-launch-checklist.md +365 -0
  349. package/operations/04-antipatterns/operations-antipatterns.md +664 -0
  350. package/operations/05-cases/case-sre-practices.md +581 -0
  351. package/operations/06-glossary/operations-glossary.md +120 -0
  352. package/operations/aiops-anomaly-detection.md +758 -0
  353. package/operations/capacity-planning.md +1061 -0
  354. package/operations/chaos-engineering.md +659 -0
  355. package/operations/incident-command-system.md +38 -0
  356. package/operations/observability-complete.md +442 -0
  357. package/operations/slo-sli-playbook.md +517 -0
  358. package/operations/sre-operations-deep-dive.md +39 -0
  359. package/package.json +8 -0
  360. package/performance/01-standards/performance-and-scalability.md +80 -0
  361. package/performance/01-standards/performance-standards.md +156 -0
  362. package/performance/02-playbooks/query-optimization-playbook.md +103 -0
  363. package/performance/03-checklists/performance-checklist.md +56 -0
  364. package/performance/04-antipatterns/performance-antipatterns.md +146 -0
  365. package/product/01-standards/product-management-complete.md +285 -0
  366. package/product/02-playbooks/feature-launch-playbook.md +207 -0
  367. package/product/02-playbooks/user-research-playbook.md +532 -0
  368. package/product/03-checklists/feature-launch-checklist.md +275 -0
  369. package/product/04-antipatterns/product-antipatterns.md +355 -0
  370. package/product/05-cases/case-mvp-to-scale.md +384 -0
  371. package/product/06-glossary/product-glossary.md +462 -0
  372. package/product/feature-prioritization-framework.md +40 -0
  373. package/product/kpi-and-metric-tree.md +37 -0
  374. package/product/product-discovery-and-prd-deep-dive.md +41 -0
  375. package/quantum/01-standards/quantum-complete.md +1186 -0
  376. package/security/01-standards/api-security-complete.md +511 -0
  377. package/security/01-standards/container-runtime-security.md +574 -0
  378. package/security/01-standards/data-protection-gdpr.md +543 -0
  379. package/security/01-standards/owasp-top10-complete.md +1890 -0
  380. package/security/01-standards/secure-coding-baseline.md +90 -0
  381. package/security/01-standards/supply-chain-security.md +441 -0
  382. package/security/01-standards/web-security-checklist.md +108 -0
  383. package/security/01-standards/zero-trust-architecture.md +521 -0
  384. package/security/02-playbooks/auth-sso-playbook.md +166 -0
  385. package/security/02-playbooks/incident-response-security-playbook.md +588 -0
  386. package/security/02-playbooks/owasp-api-security-playbook.md +129 -0
  387. package/security/02-playbooks/payment-integration-playbook.md +119 -0
  388. package/security/02-playbooks/penetration-testing-playbook.md +517 -0
  389. package/security/03-checklists/security-audit-checklist.md +356 -0
  390. package/security/04-antipatterns/security-coding-antipatterns.md +580 -0
  391. package/security/05-cases/case-log4shell-incident.md +537 -0
  392. package/security/05-cases/case-major-breaches.md +468 -0
  393. package/security/06-glossary/security-glossary.md +212 -0
  394. package/security/compliance-automation.md +993 -0
  395. package/security/container-security.md +680 -0
  396. package/security/devsecops-complete.md +426 -0
  397. package/security/sast-dast-sca.md +775 -0
  398. package/security/secrets-management.md +594 -0
  399. package/security/security-architecture-deep-dive.md +37 -0
  400. package/security/threat-modeling-stride-playbook.md +40 -0
  401. package/seed-templates/auth-system.md +59 -0
  402. package/seed-templates/blog-content.md +94 -0
  403. package/seed-templates/dashboard.md +89 -0
  404. package/seed-templates/docs-site.md +73 -0
  405. package/seed-templates/e-commerce.md +50 -0
  406. package/seed-templates/saas-landing.md +92 -0
  407. package/seed-templates/settings-page.md +51 -0
  408. package/testing/01-standards/test-strategy-and-layering.md +83 -0
  409. package/testing/01-standards/testing-strategy-complete.md +422 -0
  410. package/testing/01-standards/unit-testing-best-practices.md +118 -0
  411. package/testing/02-playbooks/e2e-testing-playbook.md +988 -0
  412. package/testing/02-playbooks/testing-strategy-playbook.md +126 -0
  413. package/testing/03-checklists/test-strategy-checklist.md +208 -0
  414. package/testing/04-antipatterns/testing-antipatterns.md +718 -0
  415. package/testing/05-cases/case-testing-transformation.md +300 -0
  416. package/testing/06-glossary/testing-glossary.md +110 -0
  417. package/testing/risk-based-test-matrix.md +36 -0
  418. package/testing/testing-strategy-deep-dive.md +37 -0
@@ -0,0 +1,581 @@
1
+ ---
2
+ id: case-sre-practices
3
+ title: 案例研究:SRE 实践落地 - 从理念到工程化的完整路径
4
+ domain: operations
5
+ category: 05-cases
6
+ difficulty: intermediate
7
+ tags: [budget, case, engineering, operations, practices, sre, 为例, 优化]
8
+ quality_score: 70
9
+ last_updated: 2026-06-15
10
+ ---
11
+ # 案例研究:SRE 实践落地 - 从理念到工程化的完整路径
12
+
13
+ ## 元数据
14
+
15
+ | 字段 | 值 |
16
+ |------|------|
17
+ | 行业 | 电商平台 |
18
+ | 系统规模 | 注册用户 5000 万,日活 800 万,日均订单 200 万 |
19
+ | 技术栈 | Go + Java + PostgreSQL + Redis + Kafka + Kubernetes |
20
+ | 团队规模 | 后端 60 人,SRE 8 人,QA 10 人 |
21
+ | 实施周期 | 12 个月(2024-01 至 2024-12) |
22
+ | 核心目标 | 建立 SRE 体系,将可用性从 99.5% 提升至 99.95% |
23
+
24
+ ---
25
+
26
+ ## 一、背景与动机
27
+
28
+ ### 1.1 现状痛点
29
+
30
+ 某电商平台在快速增长阶段遭遇严重的稳定性挑战:
31
+
32
+ - **可用性不达标**:过去 12 个月发生 18 次 P1 级故障,累计不可用时间 43 小时(可用性约 99.5%)
33
+ - **MTTR 过长**:平均故障恢复时间 2.4 小时,最长一次持续 8 小时
34
+ - **运维模式落后**:大量手动操作,SRE 团队 70% 时间花在重复性工作(Toil)
35
+ - **缺乏量化目标**:团队对"系统够不够稳定"没有统一标准
36
+ - **On-Call 疲劳**:告警风暴频发,On-Call 人员每周被唤醒 3-5 次
37
+
38
+ ### 1.2 SRE 转型目标
39
+
40
+ 团队决定借鉴 Google SRE 核心理念,系统性地解决稳定性问题。设定 12 个月路线图:
41
+
42
+ | 阶段 | 时间 | 目标 |
43
+ |------|------|------|
44
+ | Q1 | 月 1-3 | 建立 SLO 体系 + 可观测性基础 |
45
+ | Q2 | 月 4-6 | Error Budget 机制运行 + Toil 治理 |
46
+ | Q3 | 月 7-9 | On-Call 优化 + 自动化修复 |
47
+ | Q4 | 月 10-12 | Chaos Engineering + 持续改进 |
48
+
49
+ ---
50
+
51
+ ## 二、Google SRE 核心理念落地
52
+
53
+ ### 2.1 核心原则
54
+
55
+ SRE 团队首先统一了 Google SRE 的 5 条核心原则,并针对自身场景做了落地解读:
56
+
57
+ | Google SRE 原则 | 本团队落地解读 |
58
+ |----------------|--------------|
59
+ | **拥抱风险** (Embracing Risk) | 100% 可用性不是目标,用 Error Budget 量化可接受的风险水平 |
60
+ | **服务级别目标** (SLOs) | 每个核心服务定义 SLI/SLO/SLA,用数据驱动决策 |
61
+ | **消除苦差事** (Eliminating Toil) | Toil 占比不超过 50%,自动化是工程工作的核心产出 |
62
+ | **监控与可观测** (Monitoring) | 从"基于告警"转向"基于 SLO"的监控体系 |
63
+ | **简单化** (Simplicity) | 拒绝不必要的复杂性,每次架构变更必须论证必要性 |
64
+
65
+ ### 2.2 组织架构调整
66
+
67
+ - SRE 团队从运维部独立出来,直接向 CTO 汇报
68
+ - SRE 与开发团队比例 1:8(8 名 SRE 对应 60 名开发)
69
+ - 每个核心服务指定一名 SRE 作为稳定性 Owner
70
+ - 建立"SRE 嵌入"机制:SRE 参加业务团队的 Sprint Planning
71
+
72
+ ---
73
+
74
+ ## 三、SLO 设定实战(以电商 API 为例)
75
+
76
+ ### 3.1 SLI 定义
77
+
78
+ 选择能直接反映用户体验的 SLI(Service Level Indicator):
79
+
80
+ ```yaml
81
+ # 订单服务 SLI 定义
82
+ order-service:
83
+ availability:
84
+ description: "成功响应的请求占比"
85
+ formula: "count(status < 500) / count(total_requests)"
86
+ measurement: "Prometheus metrics at load balancer"
87
+
88
+ latency:
89
+ description: "请求响应时间"
90
+ formula: "histogram_quantile(0.99, request_duration_seconds)"
91
+ measurement: "Application-level metrics"
92
+
93
+ correctness:
94
+ description: "订单创建后数据一致性"
95
+ formula: "count(order_verified) / count(order_created)"
96
+ measurement: "异步校验 Job 每 5 分钟运行"
97
+ ```
98
+
99
+ ### 3.2 SLO 设定
100
+
101
+ 基于历史数据和业务需求设定 SLO:
102
+
103
+ | 服务 | SLI | SLO 目标 | 计算窗口 | 依据 |
104
+ |------|-----|---------|---------|------|
105
+ | 订单服务 | 可用性 | 99.95% | 30 天滚动 | 业务要求:月度不可用 < 21.6 分钟 |
106
+ | 订单服务 | P99 延迟 | < 500ms | 30 天滚动 | 用户体验研究:> 500ms 转化率下降 12% |
107
+ | 商品服务 | 可用性 | 99.9% | 30 天滚动 | 非交易链路,容忍度较高 |
108
+ | 商品服务 | P99 延迟 | < 200ms | 30 天滚动 | 首页/搜索依赖,对速度敏感 |
109
+ | 支付服务 | 可用性 | 99.99% | 30 天滚动 | 资金安全,要求极高 |
110
+ | 支付服务 | P99 延迟 | < 1000ms | 30 天滚动 | 支付本身耗时较长,用户有预期 |
111
+
112
+ ### 3.3 SLO 实现
113
+
114
+ ```python
115
+ # Prometheus 查询:订单服务 30 天可用性
116
+ availability_30d = """
117
+ 1 - (
118
+ sum(rate(http_requests_total{service="order", code=~"5.."}[30d]))
119
+ /
120
+ sum(rate(http_requests_total{service="order"}[30d]))
121
+ )
122
+ """
123
+
124
+ # Grafana Dashboard 展示
125
+ # - 当前 SLO 达成率(大数字)
126
+ # - Error Budget 剩余百分比(进度条)
127
+ # - Error Budget 消耗速率(趋势线)
128
+ # - 预计 Error Budget 耗尽时间
129
+ ```
130
+
131
+ ---
132
+
133
+ ## 四、Error Budget 机制
134
+
135
+ ### 4.1 Error Budget 计算
136
+
137
+ Error Budget 是 SLO 允许的"不可靠余量":
138
+
139
+ ```
140
+ Error Budget = 1 - SLO 目标
141
+
142
+ 订单服务 Error Budget:
143
+ = 1 - 99.95%
144
+ = 0.05%
145
+ = 30 天 × 24 小时 × 60 分钟 × 0.0005
146
+ = 21.6 分钟 / 月
147
+ ```
148
+
149
+ ### 4.2 Error Budget 策略
150
+
151
+ 建立明确的 Error Budget 消耗规则:
152
+
153
+ | Budget 剩余 | 状态 | 策略 |
154
+ |------------|------|------|
155
+ | > 50% | 绿色 | 正常发布节奏,鼓励新功能开发 |
156
+ | 25% - 50% | 黄色 | 减少高风险发布,增加灰度比例 |
157
+ | 5% - 25% | 橙色 | 冻结非关键发布,全力修复稳定性 |
158
+ | < 5% | 红色 | 全面冻结发布,SRE 主导故障排查 |
159
+
160
+ ### 4.3 实际案例:大促前 Error Budget 决策
161
+
162
+ 2024 年 618 大促前两周,订单服务 Error Budget 剩余 35%(黄色):
163
+
164
+ ```
165
+ 时间线:
166
+ 6月1日 - Error Budget 剩余 35%
167
+ 6月3日 - 产品团队要求发布新促销规则引擎
168
+ 6月3日 - SRE 评估:新功能涉及订单核心链路,风险高
169
+ 6月4日 - 决策:推迟到大促后发布
170
+ - 替代方案:通过 Feature Flag + 配置驱动实现部分促销规则
171
+ 6月5日 - 集中修复已知稳定性问题,Budget 消耗速率下降
172
+ 6月18日 - 大促期间零 P1 故障,Budget 剩余 28%
173
+ ```
174
+
175
+ 这个决策避免了大促期间的潜在故障,是 Error Budget 机制发挥作用的典型场景。
176
+
177
+ ---
178
+
179
+ ## 五、Toil 治理
180
+
181
+ ### 5.1 Toil 定义与识别
182
+
183
+ Toil 是满足以下特征的工作:手动的、重复的、可自动化的、战术性的、缺乏持久价值的、随服务增长线性增长的。
184
+
185
+ 团队首先做了 Toil 审计,记录 SRE 两周内的所有工作:
186
+
187
+ | Toil 项 | 频率 | 每次耗时 | 月累计 | 自动化难度 |
188
+ |---------|------|---------|--------|-----------|
189
+ | 手动扩容/缩容 | 日均 3 次 | 15 分钟 | 22.5h | 低 |
190
+ | 证书更新 | 月均 8 次 | 30 分钟 | 4h | 低 |
191
+ | 日志清理 | 日均 1 次 | 10 分钟 | 5h | 低 |
192
+ | 数据库慢查询处理 | 周均 5 次 | 45 分钟 | 15h | 中 |
193
+ | 用户数据修复 | 周均 3 次 | 60 分钟 | 12h | 中 |
194
+ | 配置变更 | 日均 2 次 | 20 分钟 | 20h | 中 |
195
+ | 故障排查 | 周均 2 次 | 120 分钟 | 16h | 高 |
196
+
197
+ **Toil 总计:约 94.5 小时/月,占 SRE 总工时的 73%**(8 人 × 160h = 1280h,Toil 占比 = 94.5 × 8 人均分后按实际执行人计算约 73%)。
198
+
199
+ ### 5.2 Toil 治理路线
200
+
201
+ 按"高频 + 低难度"优先原则排序:
202
+
203
+ **第一批(月 1-2)- 快速自动化**:
204
+
205
+ ```yaml
206
+ # HPA 自动扩缩容
207
+ apiVersion: autoscaling/v2
208
+ kind: HorizontalPodAutoscaler
209
+ metadata:
210
+ name: order-service-hpa
211
+ spec:
212
+ scaleTargetRef:
213
+ apiVersion: apps/v1
214
+ kind: Deployment
215
+ name: order-service
216
+ minReplicas: 3
217
+ maxReplicas: 50
218
+ metrics:
219
+ - type: Resource
220
+ resource:
221
+ name: cpu
222
+ target:
223
+ type: Utilization
224
+ averageUtilization: 70
225
+ behavior:
226
+ scaleUp:
227
+ stabilizationWindowSeconds: 60
228
+ policies:
229
+ - type: Percent
230
+ value: 100
231
+ periodSeconds: 60
232
+ scaleDown:
233
+ stabilizationWindowSeconds: 300
234
+ ```
235
+
236
+ ```bash
237
+ # cert-manager 自动证书管理
238
+ kubectl apply -f - <<EOF
239
+ apiVersion: cert-manager.io/v1
240
+ kind: Certificate
241
+ metadata:
242
+ name: api-cert
243
+ spec:
244
+ secretName: api-tls
245
+ issuerRef:
246
+ name: letsencrypt-prod
247
+ kind: ClusterIssuer
248
+ dnsNames:
249
+ - api.example.com
250
+ renewBefore: 720h # 证书过期前 30 天自动续签
251
+ EOF
252
+ ```
253
+
254
+ **第二批(月 3-4)- 半自动化**:
255
+
256
+ - 慢查询自动检测 + 建议索引 + 人工确认执行
257
+ - 配置变更通过 GitOps 流程,PR Review 后自动应用
258
+ - 日志清理通过 CronJob 自动执行
259
+
260
+ **第三批(月 5-6)- 智能自动化**:
261
+
262
+ - 用户数据修复:建立自助修复工具,减少 SRE 介入
263
+ - 故障排查:建立 Runbook 自动化(后续章节详述)
264
+
265
+ ### 5.3 治理成果
266
+
267
+ | 指标 | 治理前 | 治理后 | 改善 |
268
+ |------|--------|--------|------|
269
+ | Toil 占比 | 73% | 28% | -45pp |
270
+ | 月均 Toil 工时 | 94.5h | 36h | -62% |
271
+ | SRE 工程项目占比 | 27% | 72% | +45pp |
272
+
273
+ ---
274
+
275
+ ## 六、On-Call 优化
276
+
277
+ ### 6.1 On-Call 现状问题
278
+
279
+ - 告警数量:日均 200+ 条告警,其中 80% 无需人工介入
280
+ - 告警疲劳:On-Call 人员对告警脱敏,关键告警被忽略
281
+ - 升级不清晰:不确定何时升级、找谁升级
282
+ - 知识孤岛:故障处理经验在个人脑中,无法传递
283
+
284
+ ### 6.2 告警治理
285
+
286
+ **告警分级体系**:
287
+
288
+ | 级别 | 条件 | 响应要求 | 通知方式 |
289
+ |------|------|---------|---------|
290
+ | P0 / Critical | Error Budget 消耗 > 10x 正常速率 | 5 分钟内响应 | 电话 + 短信 + Slack |
291
+ | P1 / High | SLO 达成率下降但 Budget 充足 | 15 分钟内响应 | 短信 + Slack |
292
+ | P2 / Medium | 非核心服务异常 | 1 小时内响应 | Slack |
293
+ | P3 / Low | 预警信号 | 次工作日处理 | Slack (低优先级频道) |
294
+
295
+ **告警降噪措施**:
296
+
297
+ ```yaml
298
+ # Alertmanager 告警聚合
299
+ route:
300
+ group_by: ['service', 'alertname']
301
+ group_wait: 30s # 等待 30s 聚合同类告警
302
+ group_interval: 5m # 同组告警间隔 5m 发送
303
+ repeat_interval: 4h # 未解决告警 4h 重复提醒
304
+
305
+ inhibit_rules:
306
+ # 集群级故障抑制 Pod 级告警
307
+ - source_match:
308
+ severity: critical
309
+ scope: cluster
310
+ target_match:
311
+ severity: warning
312
+ scope: pod
313
+ equal: ['cluster']
314
+ ```
315
+
316
+ **治理后**:日均告警从 200+ 降至 15 条,P0/P1 告警占比 < 20%。
317
+
318
+ ### 6.3 On-Call 轮值制度
319
+
320
+ ```
321
+ 轮值规则:
322
+ - 主 On-Call + 副 On-Call(双人制)
323
+ - 每轮 7 天,周一 10:00 交接
324
+ - 交接会议必须包含:
325
+ - 本周告警统计与趋势
326
+ - 未关闭的 Issue 和待跟进事项
327
+ - 已知风险(即将到来的大促/变更)
328
+ - Runbook 更新情况
329
+ - On-Call 补偿:每次 On-Call 额外 1 天调休
330
+ ```
331
+
332
+ ### 6.4 Runbook 自动化
333
+
334
+ 将常见故障处理流程编写为可执行的 Runbook:
335
+
336
+ ```python
337
+ # Runbook 示例:Redis 连接池耗尽自动处理
338
+ class RedisConnectionPoolExhausted(Runbook):
339
+ trigger = "redis_connection_pool_usage > 90%"
340
+ severity = "P1"
341
+
342
+ def diagnose(self):
343
+ """诊断步骤"""
344
+ checks = [
345
+ self.check_redis_connections(), # 当前连接数
346
+ self.check_slow_commands(), # 慢命令统计
347
+ self.check_client_list(), # 客户端连接分布
348
+ self.check_recent_deployments(), # 最近部署
349
+ ]
350
+ return self.analyze(checks)
351
+
352
+ def auto_remediate(self):
353
+ """自动修复"""
354
+ # Step 1: Kill 空闲超过 300s 的连接
355
+ self.redis_cli("CLIENT KILL IDLE 300")
356
+
357
+ # Step 2: 如果仍高于 80%,临时扩大连接池
358
+ if self.check_pool_usage() > 80:
359
+ self.scale_pool(factor=1.5)
360
+
361
+ # Step 3: 创建 Ticket 跟踪根因
362
+ self.create_ticket(
363
+ title="Redis 连接池耗尽 - 需要根因分析",
364
+ assignee="on-call",
365
+ priority="high",
366
+ )
367
+
368
+ def escalate(self):
369
+ """升级路径"""
370
+ return ["on-call-primary", "on-call-secondary", "sre-lead"]
371
+ ```
372
+
373
+ ---
374
+
375
+ ## 七、自动化修复
376
+
377
+ ### 7.1 自动修复框架
378
+
379
+ 建立分级自动修复能力:
380
+
381
+ | 级别 | 自动化程度 | 示例 |
382
+ |------|-----------|------|
383
+ | L0 | 全自动 | Pod 重启、连接池回收、缓存清理 |
384
+ | L1 | 半自动(自动诊断 + 人工确认) | 数据库 Failover、流量降级 |
385
+ | L2 | 辅助(提供诊断报告 + 建议操作) | 数据不一致修复、容量规划 |
386
+ | L3 | 人工(Runbook 指导) | 架构级故障、数据恢复 |
387
+
388
+ ### 7.2 自愈系统实现
389
+
390
+ ```yaml
391
+ # Kubernetes 自愈配置
392
+ apiVersion: apps/v1
393
+ kind: Deployment
394
+ spec:
395
+ template:
396
+ spec:
397
+ containers:
398
+ - name: order-service
399
+ livenessProbe:
400
+ httpGet:
401
+ path: /healthz
402
+ port: 8080
403
+ initialDelaySeconds: 10
404
+ periodSeconds: 10
405
+ failureThreshold: 3 # 连续 3 次失败则重启
406
+ readinessProbe:
407
+ httpGet:
408
+ path: /readyz
409
+ port: 8080
410
+ periodSeconds: 5
411
+ failureThreshold: 2 # 连续 2 次失败则从 LB 摘除
412
+ resources:
413
+ requests:
414
+ memory: "512Mi"
415
+ cpu: "500m"
416
+ limits:
417
+ memory: "1Gi"
418
+ cpu: "1000m"
419
+ # OOMKilled 后自动重启
420
+ restartPolicy: Always
421
+ ```
422
+
423
+ ### 7.3 降级策略自动化
424
+
425
+ ```go
426
+ // 熔断降级配置
427
+ circuitBreaker := gobreaker.NewCircuitBreaker(gobreaker.Settings{
428
+ Name: "payment-service",
429
+ MaxRequests: 5, // 半开状态允许 5 个请求
430
+ Interval: 10 * time.Second, // 统计窗口
431
+ Timeout: 30 * time.Second, // 熔断持续时间
432
+ ReadyToTrip: func(counts gobreaker.Counts) bool {
433
+ failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
434
+ return counts.Requests >= 10 && failureRatio >= 0.5 // 失败率 > 50% 触发熔断
435
+ },
436
+ OnStateChange: func(name string, from, to gobreaker.State) {
437
+ metrics.CircuitBreakerState.WithLabelValues(name).Set(float64(to))
438
+ if to == gobreaker.StateOpen {
439
+ alerting.Notify(alerting.P1, fmt.Sprintf("Circuit breaker %s opened", name))
440
+ }
441
+ },
442
+ })
443
+ ```
444
+
445
+ ---
446
+
447
+ ## 八、Chaos Engineering 实践
448
+
449
+ ### 8.1 Chaos Engineering 成熟度模型
450
+
451
+ 团队采用渐进式引入策略:
452
+
453
+ | 阶段 | 时间 | 实验范围 | 工具 |
454
+ |------|------|---------|------|
455
+ | 起步 | 月 7-8 | 非生产环境 + 单一故障 | LitmusChaos |
456
+ | 进阶 | 月 9-10 | 生产环境 + 受控故障 | LitmusChaos + 自研 |
457
+ | 成熟 | 月 11-12 | 生产环境 + Game Day | 全套工具链 |
458
+
459
+ ### 8.2 实验设计
460
+
461
+ **实验 1:Redis 主节点故障**
462
+
463
+ ```yaml
464
+ # LitmusChaos 实验定义
465
+ apiVersion: litmuschaos.io/v1alpha1
466
+ kind: ChaosEngine
467
+ metadata:
468
+ name: redis-failover-test
469
+ spec:
470
+ appinfo:
471
+ appns: production
472
+ applabel: app=redis-master
473
+ chaosServiceAccount: litmus-admin
474
+ experiments:
475
+ - name: pod-delete
476
+ spec:
477
+ components:
478
+ env:
479
+ - name: TOTAL_CHAOS_DURATION
480
+ value: '60' # 故障持续 60 秒
481
+ - name: CHAOS_INTERVAL
482
+ value: '10'
483
+ - name: FORCE
484
+ value: 'true'
485
+ ```
486
+
487
+ **预期结果**:Redis Sentinel 在 30 秒内完成 Failover,应用自动重连,订单服务 SLO 不受影响。
488
+
489
+ **实际结果**:
490
+
491
+ ```
492
+ 第一次实验(月 8):
493
+ - Failover 耗时 45s(超过预期)
494
+ - 应用连接池未正确处理 Failover,出现 120s 的错误
495
+ - 订单服务可用性降至 99.2%(短期)
496
+
497
+ 修复措施:
498
+ - 调整 Sentinel down-after-milliseconds: 10000 → 5000
499
+ - 应用添加 Redis 重连逻辑 + 连接池健康检查
500
+ - 引入 Redis 客户端 Sentinel 模式
501
+
502
+ 第二次实验(月 9):
503
+ - Failover 耗时 12s
504
+ - 应用在 15s 内恢复正常
505
+ - 订单服务可用性保持 99.95% 以上
506
+ ```
507
+
508
+ ### 8.3 Game Day
509
+
510
+ 每季度组织一次 Game Day(全团队参与的故障演练):
511
+
512
+ ```
513
+ Game Day 流程:
514
+ 1. 准备(前 1 周)
515
+ - 定义故障场景和影响范围
516
+ - 准备回滚方案
517
+ - 通知所有相关团队
518
+
519
+ 2. 执行(当天)
520
+ 09:30 - 团队集合,说明规则
521
+ 10:00 - 注入故障(不告知具体故障类型)
522
+ 10:00~12:00 - On-Call 团队按正常流程响应
523
+ 12:00 - 故障恢复确认
524
+
525
+ 3. 复盘(当天下午)
526
+ - 时间线回顾
527
+ - 发现的问题和改进项
528
+ - 更新 Runbook
529
+ - 指派 Action Item
530
+ ```
531
+
532
+ **2024-Q4 Game Day 成果**:
533
+
534
+ - 场景:主数据库所在可用区网络隔离
535
+ - 发现 3 个未知的单点故障
536
+ - MTTR 从预估的 30 分钟实际达到 18 分钟
537
+ - 识别出 2 个 Runbook 中的过期步骤
538
+
539
+ ---
540
+
541
+ ## 九、成果总结
542
+
543
+ ### 9.1 关键指标对比
544
+
545
+ | 指标 | 实施前 | 实施后 | 改善 |
546
+ |------|--------|--------|------|
547
+ | 可用性 | 99.5% | 99.96% | +0.46pp |
548
+ | P1 故障次数(年) | 18 次 | 3 次 | -83% |
549
+ | MTTR | 2.4h | 22min | -85% |
550
+ | Toil 占比 | 73% | 28% | -45pp |
551
+ | 日均告警数 | 200+ | 15 | -92% |
552
+ | On-Call 夜间唤醒 | 3-5 次/周 | 0.3 次/周 | -92% |
553
+
554
+ ### 9.2 关键经验
555
+
556
+ 1. **SLO 先行**:没有 SLO 就没有 Error Budget,没有 Error Budget 就无法平衡速度与稳定性
557
+ 2. **Toil 可量化才可治理**:先审计后优化,数据驱动优先级
558
+ 3. **Chaos Engineering 渐进引入**:从非生产环境开始,建立信心后再进入生产
559
+ 4. **自动化是工程投资**:前期投入大但复利效应显著
560
+ 5. **文化比工具重要**:SRE 不只是工具链,更是团队认知的转变
561
+
562
+ ### 9.3 后续规划
563
+
564
+ - AIOps:基于历史告警数据训练异常检测模型
565
+ - 全链路压测:大促前自动化全链路压测流程
566
+ - SRE 平台化:将 SRE 工具链打包为内部 PaaS
567
+
568
+ ---
569
+
570
+ ## Agent Checklist
571
+
572
+ - [ ] 覆盖 Google SRE 核心理念介绍
573
+ - [ ] Error Budget 机制包含计算方法和实际决策案例
574
+ - [ ] Toil 治理包含审计数据和分阶段自动化路线
575
+ - [ ] SLO 设定以电商 API 为具体案例,包含 SLI/SLO/SLA
576
+ - [ ] On-Call 优化包含告警治理、轮值制度和 Runbook
577
+ - [ ] 自动化修复包含分级框架和代码示例
578
+ - [ ] Chaos Engineering 包含渐进式实施和 Game Day 流程
579
+ - [ ] 所有数据前后对比清晰,改善幅度可量化
580
+ - [ ] 代码示例使用真实工具语法(Prometheus/K8s/Go/Python)
581
+ - [ ] 文件超过 250 行