@umacloud/knowledge 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (418) hide show
  1. package/00-governance/governance-capabilities.md +557 -0
  2. package/00-governance/knowledge-map.md +39 -0
  3. package/00-governance/maintenance-policy.md +76 -0
  4. package/00-governance/review-checklist.md +81 -0
  5. package/README.md +13 -0
  6. package/ai/01-standards/agent-development-complete.md +691 -0
  7. package/ai/01-standards/llm-application-complete.md +488 -0
  8. package/ai/01-standards/mlops-complete.md +798 -0
  9. package/ai/01-standards/prompt-engineering-complete.md +646 -0
  10. package/ai/01-standards/rag-architecture-complete.md +649 -0
  11. package/ai/02-playbooks/llm-evaluation-playbook.md +847 -0
  12. package/ai/03-checklists/ai-project-checklist.md +215 -0
  13. package/ai/04-antipatterns/ai-antipatterns.md +661 -0
  14. package/ai/05-cases/case-rag-production.md +147 -0
  15. package/ai/06-glossary/ai-glossary.md +162 -0
  16. package/ai/agent-evaluation-benchmark.md +53 -0
  17. package/ai/ai-agent-memory-context-management.md +41 -0
  18. package/ai/ai-cost-capacity-optimization-playbook.md +42 -0
  19. package/ai/ai-data-security-and-compliance-playbook.md +37 -0
  20. package/ai/ai-domain-index-and-checklist.md +40 -0
  21. package/ai/ai-governance-maturity-model.md +50 -0
  22. package/ai/ai-model-selection-and-routing-strategy.md +47 -0
  23. package/ai/ai-observability-and-oncall-runbook.md +52 -0
  24. package/ai/ai-rag-engineering-playbook.md +42 -0
  25. package/ai/ai-red-team-and-safety-evaluation.md +42 -0
  26. package/ai/ai-release-readiness-and-rollback-gate.md +42 -0
  27. package/ai/llm-agent-engineering-deep-dive.md +57 -0
  28. package/ai/prompt-and-tool-guardrails.md +52 -0
  29. package/api/01-standards/enterprise-api-standards.md +198 -0
  30. package/api/01-standards/rest-api-design-guide.md +63 -0
  31. package/api/02-playbooks/api-pagination-playbook.md +93 -0
  32. package/api/02-playbooks/graphql-production-playbook.md +176 -0
  33. package/api/03-checklists/api-review-checklist.md +55 -0
  34. package/api/04-antipatterns/api-antipatterns.md +112 -0
  35. package/architecture/01-standards/api-gateway-patterns.md +496 -0
  36. package/architecture/01-standards/cloud-native-patterns.md +644 -0
  37. package/architecture/01-standards/distributed-systems-patterns.md +591 -0
  38. package/architecture/01-standards/event-driven-architecture.md +595 -0
  39. package/architecture/01-standards/microservices-patterns-complete.md +968 -0
  40. package/architecture/01-standards/microservices-patterns.md +495 -0
  41. package/architecture/01-standards/system-design-interview.md +664 -0
  42. package/architecture/02-playbooks/microservices-patterns-playbook.md +137 -0
  43. package/architecture/02-playbooks/migration-playbook.md +780 -0
  44. package/architecture/02-playbooks/system-design-playbook.md +779 -0
  45. package/architecture/03-checklists/architecture-decision-checklist.md +297 -0
  46. package/architecture/04-antipatterns/architecture-antipatterns.md +417 -0
  47. package/architecture/05-cases/case-netflix-microservices.md +413 -0
  48. package/architecture/06-glossary/architecture-glossary.md +164 -0
  49. package/architecture/adr-template-and-examples.md +38 -0
  50. package/architecture/api-gateway-deep-dive.md +1291 -0
  51. package/architecture/configuration-management.md +1162 -0
  52. package/architecture/distributed-transactions.md +1220 -0
  53. package/architecture/microservices-complete.md +735 -0
  54. package/architecture/resilience-and-disaster-patterns.md +37 -0
  55. package/architecture/service-governance.md +1198 -0
  56. package/architecture/system-architecture-deep-dive.md +37 -0
  57. package/backend/01-standards/analytics-and-growth.md +65 -0
  58. package/backend/01-standards/api-and-error-conventions.md +120 -0
  59. package/backend/01-standards/application-layering-and-packaging.md +160 -0
  60. package/backend/01-standards/auth-implementation.md +104 -0
  61. package/backend/01-standards/backend-framework-idioms.md +74 -0
  62. package/backend/01-standards/background-jobs-and-async.md +66 -0
  63. package/backend/01-standards/caching-strategies-complete.md +390 -0
  64. package/backend/01-standards/config-and-observability.md +77 -0
  65. package/backend/01-standards/data-modeling-and-persistence.md +94 -0
  66. package/backend/01-standards/django-complete.md +1765 -0
  67. package/backend/01-standards/email-and-notifications.md +64 -0
  68. package/backend/01-standards/fastapi-complete.md +925 -0
  69. package/backend/01-standards/file-upload-and-storage.md +66 -0
  70. package/backend/01-standards/graphql-api-complete.md +416 -0
  71. package/backend/01-standards/llm-application-standard.md +78 -0
  72. package/backend/01-standards/message-queue-patterns.md +379 -0
  73. package/backend/01-standards/microservices-and-distributed.md +78 -0
  74. package/backend/01-standards/nestjs-complete.md +2167 -0
  75. package/backend/01-standards/payment-integration.md +80 -0
  76. package/backend/01-standards/rate-limiting-complete.md +451 -0
  77. package/backend/01-standards/realtime-and-websocket.md +65 -0
  78. package/backend/01-standards/search-and-filtering.md +64 -0
  79. package/backend/01-standards/spring-boot-complete.md +445 -0
  80. package/backend/02-playbooks/api-design-playbook.md +718 -0
  81. package/backend/02-playbooks/email-send-playbook.md +130 -0
  82. package/backend/02-playbooks/file-upload-s3-playbook.md +153 -0
  83. package/backend/02-playbooks/typescript-enterprise-playbook.md +133 -0
  84. package/backend/02-playbooks/websocket-realtime-playbook.md +154 -0
  85. package/backend/03-checklists/api-launch-checklist.md +189 -0
  86. package/backend/04-antipatterns/backend-antipatterns.md +1051 -0
  87. package/blockchain/01-standards/blockchain-basics.md +557 -0
  88. package/blockchain/01-standards/smart-contract-development.md +1315 -0
  89. package/cicd/01-standards/deployment-and-delivery-standard.md +96 -0
  90. package/cicd/01-standards/github-actions-complete.md +473 -0
  91. package/cicd/01-standards/release-and-store-submission.md +75 -0
  92. package/cicd/02-playbooks/cicd-pipeline-playbook.md +144 -0
  93. package/cicd/02-playbooks/release-management-playbook.md +605 -0
  94. package/cicd/03-checklists/pipeline-security-checklist.md +168 -0
  95. package/cicd/04-antipatterns/cicd-antipatterns.md +589 -0
  96. package/cicd/05-cases/case-deployment-automation.md +221 -0
  97. package/cicd/05-cases/case-gitops-transformation.md +212 -0
  98. package/cicd/06-glossary/cicd-glossary.md +114 -0
  99. package/cicd/cicd-blueprint-deep-dive.md +38 -0
  100. package/cicd/release-readiness-gate.md +37 -0
  101. package/cloud-native/01-standards/container-security.md +741 -0
  102. package/cloud-native/01-standards/kubernetes-complete.md +812 -0
  103. package/cloud-native/02-playbooks/api-gateway-playbook.md +155 -0
  104. package/cloud-native/02-playbooks/gitops-with-argocd.md +760 -0
  105. package/cloud-native/02-playbooks/k8s-troubleshooting-playbook.md +1942 -0
  106. package/cloud-native/02-playbooks/message-queue-playbook.md +129 -0
  107. package/cloud-native/02-playbooks/multicloud-governance.md +726 -0
  108. package/cloud-native/02-playbooks/serverless-patterns.md +788 -0
  109. package/cloud-native/02-playbooks/service-mesh-playbook.md +612 -0
  110. package/cloud-native/02-playbooks/terraform-iac-playbook.md +143 -0
  111. package/cloud-native/03-checklists/container-security-checklist.md +431 -0
  112. package/cloud-native/03-checklists/k8s-production-readiness-checklist.md +460 -0
  113. package/cloud-native/04-antipatterns/container-antipatterns.md +660 -0
  114. package/cloud-native/04-antipatterns/k8s-antipatterns.md +743 -0
  115. package/cloud-native/05-cases/case-k8s-migration.md +478 -0
  116. package/cloud-native/05-cases/case-k8s-scaling.md +642 -0
  117. package/cloud-native/05-cases/case-k8s-security-incident.md +397 -0
  118. package/cloud-native/06-glossary/cloud-native-glossary.md +337 -0
  119. package/cross-platform/01-standards/cross-platform-frameworks.md +83 -0
  120. package/cross-platform/01-standards/platform-selection-and-architecture.md +77 -0
  121. package/data/01-standards/elasticsearch-complete.md +2098 -0
  122. package/data/01-standards/postgresql-complete.md +1613 -0
  123. package/data/01-standards/redis-complete.md +1527 -0
  124. package/data/02-playbooks/database-optimization-playbook.md +403 -0
  125. package/data/02-playbooks/elasticsearch-production-playbook.md +132 -0
  126. package/data/03-checklists/database-launch-checklist.md +187 -0
  127. package/data/04-antipatterns/database-antipatterns.md +873 -0
  128. package/data/05-cases/case-database-migration.md +310 -0
  129. package/data/06-glossary/database-glossary.md +440 -0
  130. package/data/data-governance-and-modeling-deep-dive.md +39 -0
  131. package/data-engineering/01-standards/airflow-complete.md +523 -0
  132. package/data-engineering/01-standards/kafka-complete.md +1521 -0
  133. package/data-engineering/02-playbooks/spark-etl-playbook.md +496 -0
  134. package/data-engineering/03-checklists/pipeline-launch-checklist.md +194 -0
  135. package/data-engineering/04-antipatterns/data-pipeline-antipatterns.md +684 -0
  136. package/data-engineering/05-cases/case-real-time-pipeline.md +355 -0
  137. package/data-engineering/06-glossary/data-engineering-glossary.md +429 -0
  138. package/database/01-standards/database-schema-standards.md +147 -0
  139. package/database/02-playbooks/postgresql-optimization-quick.md +52 -0
  140. package/database/02-playbooks/postgresql-performance-optimization.md +58 -0
  141. package/database/02-playbooks/postgresql-production-playbook.md +146 -0
  142. package/database/02-playbooks/redis-caching-playbook.md +117 -0
  143. package/database/03-checklists/database-review-checklist.md +50 -0
  144. package/database/04-antipatterns/database-antipatterns.md +112 -0
  145. package/design/01-standards/ui-design-system-complete.md +423 -0
  146. package/design/02-playbooks/design-handoff-playbook.md +254 -0
  147. package/design/02-playbooks/design-review-playbook.md +388 -0
  148. package/design/03-checklists/design-review-checklist.md +246 -0
  149. package/design/04-antipatterns/design-antipatterns.md +378 -0
  150. package/design/05-cases/case-design-system-adoption.md +328 -0
  151. package/design/06-glossary/design-glossary.md +329 -0
  152. package/design/ui-full-lifecycle-cross-platform-playbook.md +571 -0
  153. package/design/ux-system-deep-dive.md +38 -0
  154. package/design-systems/00-craft-rules.md +71 -0
  155. package/design-systems/aesthetic-families.md +43 -0
  156. package/design-systems/anti-ai-slop.md +162 -0
  157. package/design-systems/bold-geometric.md +120 -0
  158. package/design-systems/brutalist-bold.md +103 -0
  159. package/design-systems/editorial-clean.md +109 -0
  160. package/design-systems/glass-aurora.md +108 -0
  161. package/design-systems/modern-minimal.md +145 -0
  162. package/design-systems/premium-luxury.md +106 -0
  163. package/design-systems/product-type-design-map.md +48 -0
  164. package/design-systems/soft-warm.md +123 -0
  165. package/design-systems/tech-utility.md +113 -0
  166. package/desktop/01-standards/desktop-app-standard.md +72 -0
  167. package/desktop/01-standards/desktop-design.md +71 -0
  168. package/development/00-governance/document-template.md +41 -0
  169. package/development/01-standards/api-versioning-strategies.md +432 -0
  170. package/development/01-standards/authentication-patterns-complete.md +479 -0
  171. package/development/01-standards/css-architecture-complete.md +550 -0
  172. package/development/01-standards/database-migration-strategies.md +484 -0
  173. package/development/01-standards/elasticsearch-complete.md +347 -0
  174. package/development/01-standards/git-complete.md +371 -0
  175. package/development/01-standards/golang-complete.md +1565 -0
  176. package/development/01-standards/graphql-complete.md +298 -0
  177. package/development/01-standards/javascript-bundlers-complete.md +469 -0
  178. package/development/01-standards/javascript-typescript-complete.md +528 -0
  179. package/development/01-standards/jest-complete.md +275 -0
  180. package/development/01-standards/linux-complete.md +234 -0
  181. package/development/01-standards/logging-observability-complete.md +526 -0
  182. package/development/01-standards/microservices-communication.md +502 -0
  183. package/development/01-standards/mongodb-complete.md +406 -0
  184. package/development/01-standards/oauth2-complete.md +285 -0
  185. package/development/01-standards/performance-optimization-complete.md +289 -0
  186. package/development/01-standards/playwright-complete.md +247 -0
  187. package/development/01-standards/postgresql-complete.md +456 -0
  188. package/development/01-standards/pytest-complete.md +340 -0
  189. package/development/01-standards/python-async-programming.md +902 -0
  190. package/development/01-standards/python-complete.md +956 -0
  191. package/development/01-standards/python-decorators-complete.md +799 -0
  192. package/development/01-standards/python-design-patterns.md +2854 -0
  193. package/development/01-standards/python-packaging-distribution.md +420 -0
  194. package/development/01-standards/python-testing-strategies.md +607 -0
  195. package/development/01-standards/python-web-frameworks-comparison.md +471 -0
  196. package/development/01-standards/redis-complete.md +317 -0
  197. package/development/01-standards/rest-api-complete.md +316 -0
  198. package/development/01-standards/rust-complete.md +578 -0
  199. package/development/01-standards/typescript-advanced-types.md +1513 -0
  200. package/development/01-standards/web-security-complete.md +292 -0
  201. package/development/02-playbooks/api-design-playbook.md +810 -0
  202. package/development/02-playbooks/database-migration-playbook.md +580 -0
  203. package/development/02-playbooks/debugging-playbook.md +692 -0
  204. package/development/02-playbooks/feature-delivery-playbook.md +430 -0
  205. package/development/02-playbooks/incident-hotfix-playbook.md +387 -0
  206. package/development/02-playbooks/performance-optimization-playbook.md +531 -0
  207. package/development/02-playbooks/performance-tuning-playbook.md +652 -0
  208. package/development/02-playbooks/refactor-playbook.md +403 -0
  209. package/development/02-playbooks/release-playbook.md +469 -0
  210. package/development/03-checklists/architecture-review-checklist.md +168 -0
  211. package/development/03-checklists/data-migration-checklist.md +157 -0
  212. package/development/03-checklists/oncall-handover-checklist.md +173 -0
  213. package/development/03-checklists/pr-checklist.md +158 -0
  214. package/development/03-checklists/production-readiness-checklist.md +190 -0
  215. package/development/03-checklists/release-readiness-checklist.md +154 -0
  216. package/development/03-checklists/security-review-checklist.md +182 -0
  217. package/development/04-antipatterns/api-antipatterns.md +657 -0
  218. package/development/04-antipatterns/architecture-antipatterns.md +686 -0
  219. package/development/04-antipatterns/backend-antipatterns.md +648 -0
  220. package/development/04-antipatterns/cicd-antipatterns.md +540 -0
  221. package/development/04-antipatterns/code-smell-antipatterns.md +571 -0
  222. package/development/04-antipatterns/data-antipatterns.md +658 -0
  223. package/development/04-antipatterns/database-antipatterns.md +578 -0
  224. package/development/04-antipatterns/frontend-antipatterns.md +635 -0
  225. package/development/04-antipatterns/reliability-antipatterns.md +700 -0
  226. package/development/04-antipatterns/security-antipatterns.md +747 -0
  227. package/development/05-cases/case-api-version-migration.md +428 -0
  228. package/development/05-cases/case-authorization-hardening.md +383 -0
  229. package/development/05-cases/case-bluegreen-rollback.md +466 -0
  230. package/development/05-cases/case-cache-snowball-protection.md +485 -0
  231. package/development/05-cases/case-ci-cd-pipeline.md +544 -0
  232. package/development/05-cases/case-database-scaling.md +500 -0
  233. package/development/05-cases/case-db-hotspot-optimization.md +487 -0
  234. package/development/05-cases/case-incident-mttr-reduction.md +563 -0
  235. package/development/05-cases/case-microservice-migration.md +375 -0
  236. package/development/05-cases/case-performance-optimization.md +406 -0
  237. package/development/05-cases/case-security-incident-response.md +345 -0
  238. package/development/06-glossary/full-stack-glossary.md +166 -0
  239. package/development/09-maturity/quarterly-audit-template.md +35 -0
  240. package/development/11-ui-excellence/ui-aesthetic-system.md +41 -0
  241. package/development/11-ui-excellence/ui-engineering-excellence.md +435 -0
  242. package/development/12-scenarios/development-scenarios-guide.md +565 -0
  243. package/development/13-implementation-assets/implementation-toolkit.md +282 -0
  244. package/development/13-implementation-assets/knowledge-gates-execution.md +43 -0
  245. package/development/14-full-lifecycle/software-lifecycle-gates.md +511 -0
  246. package/development/15-lifecycle-templates/project-templates-collection.md +791 -0
  247. package/development/api-contract-and-versioning-guide.md +36 -0
  248. package/development/api-governance-complete.md +43 -0
  249. package/development/backend-engineering-complete.md +43 -0
  250. package/development/code-review-quality-complete.md +43 -0
  251. package/development/concurrency-reliability-complete.md +43 -0
  252. package/development/database-engineering-complete.md +43 -0
  253. package/development/engineering-effectiveness-complete.md +43 -0
  254. package/development/engineering-standards-deep-dive.md +38 -0
  255. package/development/frontend-engineering-complete.md +43 -0
  256. package/development/performance-capacity-complete.md +43 -0
  257. package/development/refactor-migration-complete.md +42 -0
  258. package/development/refactoring-and-techdebt-playbook.md +37 -0
  259. package/development/security-in-development-complete.md +43 -0
  260. package/devops/01-standards/cicd-pipeline-complete.md +262 -0
  261. package/devops/01-standards/docker-complete.md +1490 -0
  262. package/devops/01-standards/github-actions-complete.md +337 -0
  263. package/devops/01-standards/kubernetes-complete.md +638 -0
  264. package/devops/01-standards/terraform-complete.md +2117 -0
  265. package/devops/02-playbooks/docker-compose-playbook.md +233 -0
  266. package/devops/02-playbooks/docker-k8s-production-playbook.md +186 -0
  267. package/devops/02-playbooks/docker-production-playbook.md +952 -0
  268. package/edge-iot/01-standards/edge-iot-complete.md +473 -0
  269. package/experts/architect/api-design.md +178 -0
  270. package/experts/architect/methodology.md +124 -0
  271. package/experts/architect/security.md +75 -0
  272. package/experts/backend-lead/methodology.md +216 -0
  273. package/experts/devops/methodology.md +160 -0
  274. package/experts/frontend-lead/methodology.md +178 -0
  275. package/experts/product-manager/industry/ecommerce.md +43 -0
  276. package/experts/product-manager/industry/saas.md +40 -0
  277. package/experts/product-manager/methodology.md +97 -0
  278. package/experts/qa-lead/methodology.md +123 -0
  279. package/experts/qa-lead/test-strategy.md +128 -0
  280. package/experts/uiux-designer/methodology.md +125 -0
  281. package/frontend/01-standards/accessibility-complete.md +532 -0
  282. package/frontend/01-standards/accessibility-standard.md +74 -0
  283. package/frontend/01-standards/admin-dashboard-and-crud.md +72 -0
  284. package/frontend/01-standards/design-tokens-complete.md +444 -0
  285. package/frontend/01-standards/forms-and-validation.md +77 -0
  286. package/frontend/01-standards/frontend-architecture-and-layering.md +119 -0
  287. package/frontend/01-standards/i18n-and-localization.md +65 -0
  288. package/frontend/01-standards/nextjs-complete.md +451 -0
  289. package/frontend/01-standards/react-complete.md +713 -0
  290. package/frontend/01-standards/react-hooks-complete-guide.md +1100 -0
  291. package/frontend/01-standards/react-hooks-complete.md +1171 -0
  292. package/frontend/01-standards/seo-and-web-vitals.md +77 -0
  293. package/frontend/01-standards/state-management-complete.md +444 -0
  294. package/frontend/01-standards/vue-complete.md +499 -0
  295. package/frontend/01-standards/vue3-complete.md +2002 -0
  296. package/frontend/01-standards/web-framework-best-practices.md +64 -0
  297. package/frontend/01-standards/web-performance-complete.md +495 -0
  298. package/frontend/02-playbooks/accessibility-a11y-playbook.md +161 -0
  299. package/frontend/02-playbooks/frontend-performance-playbook.md +707 -0
  300. package/frontend/02-playbooks/i18n-internationalization-playbook.md +120 -0
  301. package/frontend/02-playbooks/performance-optimization-playbook.md +163 -0
  302. package/frontend/02-playbooks/react-nextjs-production-playbook.md +167 -0
  303. package/frontend/02-playbooks/react-state-management-playbook.md +173 -0
  304. package/frontend/03-checklists/component-quality-checklist.md +166 -0
  305. package/frontend/03-checklists/frontend-launch-checklist.md +299 -0
  306. package/frontend/04-antipatterns/frontend-antipatterns.md +886 -0
  307. package/frontend/05-cases/case-performance-optimization.md +274 -0
  308. package/harmony/01-standards/harmonyos-arkts-standard.md +75 -0
  309. package/harmony/01-standards/harmonyos-design.md +65 -0
  310. package/high-quality-engineering-playbook.md +54 -0
  311. package/incident/01-standards/incident-response-complete.md +303 -0
  312. package/incident/02-playbooks/chaos-engineering-playbook.md +883 -0
  313. package/incident/02-playbooks/postmortem-playbook.md +398 -0
  314. package/incident/03-checklists/incident-readiness-checklist.md +181 -0
  315. package/incident/04-antipatterns/incident-antipatterns.md +490 -0
  316. package/incident/05-cases/case-cascade-failure.md +176 -0
  317. package/incident/06-glossary/incident-glossary.md +114 -0
  318. package/incident/postmortem-and-response-deep-dive.md +39 -0
  319. package/industries/ecommerce/ecommerce-complete.md +631 -0
  320. package/industries/education/education-complete.md +555 -0
  321. package/industries/fintech/fintech-complete.md +501 -0
  322. package/industries/gaming/gaming-complete.md +587 -0
  323. package/industries/healthcare/healthcare-complete.md +452 -0
  324. package/low-code/01-standards/low-code-complete.md +944 -0
  325. package/miniprogram/01-standards/ai-common-mistakes.md +61 -0
  326. package/miniprogram/01-standards/miniprogram-custom-navbar-capsule.md +77 -0
  327. package/miniprogram/01-standards/miniprogram-design.md +61 -0
  328. package/miniprogram/01-standards/miniprogram-standard.md +81 -0
  329. package/mobile/01-standards/android-material-design.md +70 -0
  330. package/mobile/01-standards/flutter-complete.md +384 -0
  331. package/mobile/01-standards/ios-design-hig.md +78 -0
  332. package/mobile/01-standards/mobile-app-standard.md +85 -0
  333. package/mobile/01-standards/react-native-complete.md +352 -0
  334. package/mobile/02-playbooks/mobile-cross-platform-playbook.md +175 -0
  335. package/mobile/02-playbooks/mobile-performance.md +473 -0
  336. package/mobile/03-checklists/mobile-release-checklist.md +234 -0
  337. package/mobile/04-antipatterns/mobile-antipatterns.md +798 -0
  338. package/mobile/05-cases/case-app-performance.md +500 -0
  339. package/mobile/05-cases/case-app-startup-optimization.md +218 -0
  340. package/mobile/06-glossary/mobile-glossary.md +484 -0
  341. package/observability/01-standards/observability-standards.md +103 -0
  342. package/observability/02-playbooks/prometheus-grafana-playbook.md +135 -0
  343. package/observability/02-playbooks/structured-logging-playbook.md +73 -0
  344. package/observability/03-checklists/observability-checklist.md +54 -0
  345. package/observability/04-antipatterns/observability-antipatterns.md +106 -0
  346. package/operations/01-standards/prometheus-monitoring-complete.md +1578 -0
  347. package/operations/02-playbooks/capacity-planning-playbook.md +620 -0
  348. package/operations/03-checklists/production-launch-checklist.md +365 -0
  349. package/operations/04-antipatterns/operations-antipatterns.md +664 -0
  350. package/operations/05-cases/case-sre-practices.md +581 -0
  351. package/operations/06-glossary/operations-glossary.md +120 -0
  352. package/operations/aiops-anomaly-detection.md +758 -0
  353. package/operations/capacity-planning.md +1061 -0
  354. package/operations/chaos-engineering.md +659 -0
  355. package/operations/incident-command-system.md +38 -0
  356. package/operations/observability-complete.md +442 -0
  357. package/operations/slo-sli-playbook.md +517 -0
  358. package/operations/sre-operations-deep-dive.md +39 -0
  359. package/package.json +8 -0
  360. package/performance/01-standards/performance-and-scalability.md +80 -0
  361. package/performance/01-standards/performance-standards.md +156 -0
  362. package/performance/02-playbooks/query-optimization-playbook.md +103 -0
  363. package/performance/03-checklists/performance-checklist.md +56 -0
  364. package/performance/04-antipatterns/performance-antipatterns.md +146 -0
  365. package/product/01-standards/product-management-complete.md +285 -0
  366. package/product/02-playbooks/feature-launch-playbook.md +207 -0
  367. package/product/02-playbooks/user-research-playbook.md +532 -0
  368. package/product/03-checklists/feature-launch-checklist.md +275 -0
  369. package/product/04-antipatterns/product-antipatterns.md +355 -0
  370. package/product/05-cases/case-mvp-to-scale.md +384 -0
  371. package/product/06-glossary/product-glossary.md +462 -0
  372. package/product/feature-prioritization-framework.md +40 -0
  373. package/product/kpi-and-metric-tree.md +37 -0
  374. package/product/product-discovery-and-prd-deep-dive.md +41 -0
  375. package/quantum/01-standards/quantum-complete.md +1186 -0
  376. package/security/01-standards/api-security-complete.md +511 -0
  377. package/security/01-standards/container-runtime-security.md +574 -0
  378. package/security/01-standards/data-protection-gdpr.md +543 -0
  379. package/security/01-standards/owasp-top10-complete.md +1890 -0
  380. package/security/01-standards/secure-coding-baseline.md +90 -0
  381. package/security/01-standards/supply-chain-security.md +441 -0
  382. package/security/01-standards/web-security-checklist.md +108 -0
  383. package/security/01-standards/zero-trust-architecture.md +521 -0
  384. package/security/02-playbooks/auth-sso-playbook.md +166 -0
  385. package/security/02-playbooks/incident-response-security-playbook.md +588 -0
  386. package/security/02-playbooks/owasp-api-security-playbook.md +129 -0
  387. package/security/02-playbooks/payment-integration-playbook.md +119 -0
  388. package/security/02-playbooks/penetration-testing-playbook.md +517 -0
  389. package/security/03-checklists/security-audit-checklist.md +356 -0
  390. package/security/04-antipatterns/security-coding-antipatterns.md +580 -0
  391. package/security/05-cases/case-log4shell-incident.md +537 -0
  392. package/security/05-cases/case-major-breaches.md +468 -0
  393. package/security/06-glossary/security-glossary.md +212 -0
  394. package/security/compliance-automation.md +993 -0
  395. package/security/container-security.md +680 -0
  396. package/security/devsecops-complete.md +426 -0
  397. package/security/sast-dast-sca.md +775 -0
  398. package/security/secrets-management.md +594 -0
  399. package/security/security-architecture-deep-dive.md +37 -0
  400. package/security/threat-modeling-stride-playbook.md +40 -0
  401. package/seed-templates/auth-system.md +59 -0
  402. package/seed-templates/blog-content.md +94 -0
  403. package/seed-templates/dashboard.md +89 -0
  404. package/seed-templates/docs-site.md +73 -0
  405. package/seed-templates/e-commerce.md +50 -0
  406. package/seed-templates/saas-landing.md +92 -0
  407. package/seed-templates/settings-page.md +51 -0
  408. package/testing/01-standards/test-strategy-and-layering.md +83 -0
  409. package/testing/01-standards/testing-strategy-complete.md +422 -0
  410. package/testing/01-standards/unit-testing-best-practices.md +118 -0
  411. package/testing/02-playbooks/e2e-testing-playbook.md +988 -0
  412. package/testing/02-playbooks/testing-strategy-playbook.md +126 -0
  413. package/testing/03-checklists/test-strategy-checklist.md +208 -0
  414. package/testing/04-antipatterns/testing-antipatterns.md +718 -0
  415. package/testing/05-cases/case-testing-transformation.md +300 -0
  416. package/testing/06-glossary/testing-glossary.md +110 -0
  417. package/testing/risk-based-test-matrix.md +36 -0
  418. package/testing/testing-strategy-deep-dive.md +37 -0
@@ -0,0 +1,883 @@
1
+ ---
2
+ title: 混沌工程作战手册
3
+ version: 1.0.0
4
+ last_updated: 2026-03-28
5
+ owner: sre-team
6
+ tags: [chaos-engineering, Chaos-Monkey, LitmusChaos, Gremlin, fault-injection, resilience, gameday]
7
+ status: production
8
+ domain: incident
9
+ difficulty: intermediate
10
+ quality_score: 70
11
+ ---
12
+
13
+ # 开发:Excellent(11964948@qq.com)
14
+ # 功能:混沌工程全流程作战手册
15
+ # 作用:指导团队安全地实施混沌实验以验证系统韧性
16
+ # 创建时间:2026-03-28
17
+ # 最后修改:2026-03-28
18
+
19
+ ## 目标
20
+
21
+ 建立混沌工程标准化实践,确保:
22
+ - 系统韧性通过主动故障注入而非被动等待来验证
23
+ - 每个实验有明确假设、安全边界和回滚机制
24
+ - 实验结果量化并转化为可落地的改进项
25
+ - 团队对故障场景建立肌肉记忆(GameDay 演练)
26
+ - 生产环境混沌实验安全可控(爆炸半径最小化)
27
+
28
+ ## 适用场景
29
+
30
+ - 验证高可用架构(多副本/主从切换/跨可用区)
31
+ - 验证降级与熔断策略
32
+ - 验证自动伸缩与自愈能力
33
+ - 验证监控告警覆盖度
34
+ - 验证应急预案有效性
35
+ - 新系统上线前韧性验收
36
+
37
+ ## 前置条件
38
+
39
+ ### 环境要求
40
+
41
+ | 项目 | 要求 |
42
+ |------|------|
43
+ | 可观测性 | 监控(Prometheus/Grafana)+ 日志(ELK)+ 追踪(Jaeger)已部署 |
44
+ | 告警 | 核心服务告警规则已配置且验证过可触发 |
45
+ | 容器编排 | Kubernetes 1.24+(LitmusChaos 要求) |
46
+ | 回滚能力 | 服务可在 1 分钟内回滚到上一版本 |
47
+ | 团队准备 | SRE + 开发 + 运维已完成混沌工程培训 |
48
+
49
+ ### 工具链安装
50
+
51
+ ```bash
52
+ # LitmusChaos(Kubernetes 原生,开源)
53
+ # 安装 LitmusChaos Control Plane
54
+ kubectl apply -f https://litmuschaos.github.io/litmus/3.0.0/litmus-3.0.0.yaml
55
+ # 安装 Chaos Runner
56
+ kubectl apply -f https://hub.litmuschaos.io/api/chaos/3.0.0/install
57
+
58
+ # Chaos Mesh(CNCF 项目,Kubernetes 原生)
59
+ helm repo add chaos-mesh https://charts.chaos-mesh.org
60
+ helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-mesh --create-namespace \
61
+ --set chaosDaemon.runtime=containerd \
62
+ --set chaosDaemon.socketPath=/run/containerd/containerd.sock
63
+
64
+ # Gremlin(商业版,支持物理机/容器/云)
65
+ # 安装 Agent
66
+ curl -sL https://apt.gremlin.com/install.sh | sudo bash
67
+
68
+ # Toxiproxy(网络故障注入)
69
+ docker run -d --name toxiproxy -p 8474:8474 -p 8000-9000:8000-9000 ghcr.io/shopify/toxiproxy
70
+
71
+ # stress-ng(资源压力测试)
72
+ sudo apt install stress-ng
73
+ ```
74
+
75
+ ### 安全前提
76
+
77
+ - [ ] 混沌实验章程已获管理层批准
78
+ - [ ] 实验范围(命名空间/服务/环境)已明确
79
+ - [ ] 生产环境实验需至少 2 人确认(四眼原则)
80
+ - [ ] 紧急停止机制已验证可用(Kill Switch)
81
+ - [ ] 实验窗口避开业务高峰(通常选工作日上午 10:00-12:00)
82
+ - [ ] 客服团队已提前通知
83
+
84
+ ---
85
+
86
+ ## 一、实验设计
87
+
88
+ ### 1.1 混沌实验方法论
89
+
90
+ ```yaml
91
+ 混沌工程四步法:
92
+ 1. 建立稳态假设:
93
+ 定义系统正常行为的量化指标
94
+ 示例:
95
+ - "API P99 延迟 < 500ms"
96
+ - "订单成功率 > 99.5%"
97
+ - "错误率 < 0.1%"
98
+
99
+ 2. 引入现实世界变量:
100
+ 注入模拟真实故障场景的干扰
101
+ 示例:
102
+ - 杀死一个 Pod
103
+ - 注入网络延迟
104
+ - 耗尽 CPU 资源
105
+
106
+ 3. 观察系统行为:
107
+ 对比稳态指标在实验期间的变化
108
+ 关键问题:
109
+ - 系统是否自动恢复?
110
+ - 恢复用了多长时间?
111
+ - 用户是否感知到故障?
112
+
113
+ 4. 验证或推翻假设:
114
+ 假设成立 → 系统韧性验证通过
115
+ 假设推翻 → 发现弱点 → 创建改进项
116
+ ```
117
+
118
+ ### 1.2 故障注入类型
119
+
120
+ ```yaml
121
+ 基础设施层:
122
+ 节点故障:
123
+ - 节点下线(drain + cordon)
124
+ - 节点重启
125
+ - 节点网络不可达
126
+
127
+ 资源耗尽:
128
+ - CPU 100%
129
+ - 内存 OOM
130
+ - 磁盘空间满
131
+ - 文件描述符耗尽
132
+
133
+ 应用层:
134
+ Pod/容器故障:
135
+ - 随机杀 Pod
136
+ - 容器 OOM Kill
137
+ - 容器启动失败(镜像拉取失败)
138
+ - 就绪探针失败
139
+
140
+ 进程故障:
141
+ - 主进程 Kill
142
+ - JVM Full GC(Java)
143
+ - 线程池耗尽
144
+
145
+ 网络层:
146
+ 连通性:
147
+ - 网络分区(Pod A 无法访问 Pod B)
148
+ - DNS 解析失败
149
+ - 端口不可达
150
+
151
+ 质量退化:
152
+ - 延迟注入(100ms ~ 5s)
153
+ - 丢包(1% ~ 50%)
154
+ - 带宽限制
155
+ - 连接重置(RST)
156
+
157
+ 依赖层:
158
+ 数据库:
159
+ - 主库宕机
160
+ - 从库延迟增大
161
+ - 连接池耗尽
162
+ - 慢查询
163
+
164
+ 缓存:
165
+ - Redis 不可达
166
+ - 缓存数据清空
167
+ - 缓存延迟增大
168
+
169
+ 消息队列:
170
+ - Kafka Broker 宕机
171
+ - 消费延迟
172
+ - Topic 不可用
173
+
174
+ 外部服务:
175
+ - 第三方 API 超时
176
+ - 第三方 API 返回错误
177
+ - 第三方 API 限流
178
+ ```
179
+
180
+ ### 1.3 实验模板
181
+
182
+ ```yaml
183
+ # 实验卡片模板
184
+ 实验 ID: CHAOS-2026-001
185
+ 实验名称: 订单服务单 Pod 故障恢复验证
186
+ 实验日期: 2026-03-28 10:00-12:00
187
+ 负责人: SRE-张三
188
+ 参与人: 开发-李四, 运维-王五
189
+
190
+ 稳态假设:
191
+ - 订单 API P99 延迟 < 500ms
192
+ - 订单创建成功率 > 99.5%
193
+ - 告警在 2 分钟内触发
194
+
195
+ 故障类型: Pod Kill(随机杀死 1/3 订单服务 Pod)
196
+ 爆炸半径: 仅订单服务,最多影响 1 个 Pod
197
+ 持续时间: 5 分钟
198
+
199
+ 前提条件:
200
+ - 订单服务当前 3 副本运行正常
201
+ - 监控大盘已打开
202
+ - 回滚命令已准备
203
+
204
+ 中止条件(任一触发立即停止):
205
+ - 订单创建成功率 < 95%
206
+ - P99 延迟 > 2s 持续 3 分钟
207
+ - 出现数据不一致
208
+
209
+ 预期结果:
210
+ - Kubernetes 自动重建 Pod(< 60s)
211
+ - 期间请求被负载均衡到存活 Pod
212
+ - 用户无感知或仅有短暂延迟上升
213
+
214
+ 实际结果: [实验后填写]
215
+ 改进项: [实验后填写]
216
+ ```
217
+
218
+ ---
219
+
220
+ ## 二、工具实战
221
+
222
+ ### 2.1 LitmusChaos 实验
223
+
224
+ ```yaml
225
+ # Pod Kill 实验
226
+ apiVersion: litmuschaos.io/v1alpha1
227
+ kind: ChaosEngine
228
+ metadata:
229
+ name: order-service-pod-kill
230
+ namespace: production
231
+ spec:
232
+ appinfo:
233
+ appns: production
234
+ applabel: app=order-service
235
+ appkind: deployment
236
+ engineState: active
237
+ chaosServiceAccount: litmus-admin
238
+ experiments:
239
+ - name: pod-delete
240
+ spec:
241
+ components:
242
+ env:
243
+ - name: TOTAL_CHAOS_DURATION
244
+ value: '300' # 5 分钟
245
+ - name: CHAOS_INTERVAL
246
+ value: '60' # 每 60 秒杀一次
247
+ - name: FORCE
248
+ value: 'true'
249
+ - name: PODS_AFFECTED_PERC
250
+ value: '33' # 杀 1/3 的 Pod
251
+ probe:
252
+ - name: order-api-health
253
+ type: httpProbe
254
+ httpProbe/inputs:
255
+ url: http://order-service.production:8080/health
256
+ method:
257
+ get:
258
+ criteria: ==
259
+ responseCode: '200'
260
+ mode: Continuous
261
+ runProperties:
262
+ probeTimeout: 5
263
+ retry: 3
264
+ interval: 10
265
+ ```
266
+
267
+ ```yaml
268
+ # 网络延迟实验
269
+ apiVersion: litmuschaos.io/v1alpha1
270
+ kind: ChaosEngine
271
+ metadata:
272
+ name: order-db-network-latency
273
+ namespace: production
274
+ spec:
275
+ appinfo:
276
+ appns: production
277
+ applabel: app=order-service
278
+ appkind: deployment
279
+ engineState: active
280
+ chaosServiceAccount: litmus-admin
281
+ experiments:
282
+ - name: pod-network-latency
283
+ spec:
284
+ components:
285
+ env:
286
+ - name: TOTAL_CHAOS_DURATION
287
+ value: '180' # 3 分钟
288
+ - name: NETWORK_LATENCY
289
+ value: '500' # 注入 500ms 延迟
290
+ - name: JITTER
291
+ value: '100' # 抖动 ±100ms
292
+ - name: DESTINATION_IPS
293
+ value: '10.0.1.100' # 数据库 IP
294
+ - name: DESTINATION_PORTS
295
+ value: '5432' # PostgreSQL 端口
296
+ ```
297
+
298
+ ### 2.2 Chaos Mesh 实验
299
+
300
+ ```yaml
301
+ # CPU 压力实验
302
+ apiVersion: chaos-mesh.org/v1alpha1
303
+ kind: StressChaos
304
+ metadata:
305
+ name: order-service-cpu-stress
306
+ namespace: production
307
+ spec:
308
+ mode: one # 影响 1 个 Pod
309
+ selector:
310
+ namespaces:
311
+ - production
312
+ labelSelectors:
313
+ app: order-service
314
+ stressors:
315
+ cpu:
316
+ workers: 4 # 4 个 CPU 工作线程
317
+ load: 90 # 每个工作线程 90% 负载
318
+ duration: '5m'
319
+
320
+ ---
321
+ # 网络分区实验(订单服务 → 支付服务不可达)
322
+ apiVersion: chaos-mesh.org/v1alpha1
323
+ kind: NetworkChaos
324
+ metadata:
325
+ name: order-payment-partition
326
+ namespace: production
327
+ spec:
328
+ action: partition
329
+ mode: all
330
+ selector:
331
+ namespaces:
332
+ - production
333
+ labelSelectors:
334
+ app: order-service
335
+ direction: to
336
+ target:
337
+ mode: all
338
+ selector:
339
+ namespaces:
340
+ - production
341
+ labelSelectors:
342
+ app: payment-service
343
+ duration: '3m'
344
+
345
+ ---
346
+ # IO 故障实验(磁盘延迟)
347
+ apiVersion: chaos-mesh.org/v1alpha1
348
+ kind: IOChaos
349
+ metadata:
350
+ name: order-db-io-latency
351
+ namespace: production
352
+ spec:
353
+ action: latency
354
+ mode: one
355
+ selector:
356
+ namespaces:
357
+ - production
358
+ labelSelectors:
359
+ app: postgresql
360
+ volumePath: /var/lib/postgresql/data
361
+ path: '*'
362
+ delay: '200ms'
363
+ percent: 80 # 80% 的 IO 操作受影响
364
+ duration: '3m'
365
+ ```
366
+
367
+ ### 2.3 Gremlin 实验
368
+
369
+ ```bash
370
+ # CPU 攻击
371
+ gremlin attack cpu --length 300 --cores 2 --percent 90 \
372
+ --target-type kubernetes --namespace production --label app=order-service
373
+
374
+ # 网络延迟
375
+ gremlin attack network latency --length 180 --delay 500 --jitter 100 \
376
+ --target-type kubernetes --namespace production --label app=order-service \
377
+ --port 5432
378
+
379
+ # 进程 Kill
380
+ gremlin attack process kill --length 60 --interval 30 \
381
+ --process java \
382
+ --target-type kubernetes --namespace production --label app=order-service
383
+
384
+ # DNS 故障
385
+ gremlin attack dns --length 120 \
386
+ --domain payment-service.production.svc.cluster.local \
387
+ --target-type kubernetes --namespace production --label app=order-service
388
+
389
+ # 磁盘空间填充
390
+ gremlin attack disk fill --length 180 --dir /tmp --percent 95 \
391
+ --target-type kubernetes --namespace production --label app=order-service
392
+ ```
393
+
394
+ ### 2.4 Toxiproxy(网络故障注入)
395
+
396
+ ```bash
397
+ # 创建代理(模拟 Redis 连接)
398
+ toxiproxy-cli create redis-proxy -l 0.0.0.0:6380 -u redis:6379
399
+
400
+ # 添加延迟
401
+ toxiproxy-cli toxic add redis-proxy -t latency -a latency=500 -a jitter=100
402
+
403
+ # 添加丢包
404
+ toxiproxy-cli toxic add redis-proxy -t timeout -a timeout=3000
405
+
406
+ # 限制带宽
407
+ toxiproxy-cli toxic add redis-proxy -t bandwidth -a rate=10 # 10 KB/s
408
+
409
+ # 连接重置
410
+ toxiproxy-cli toxic add redis-proxy -t reset_peer -a timeout=2000
411
+
412
+ # 查看所有代理状态
413
+ toxiproxy-cli list
414
+
415
+ # 移除故障(恢复正常)
416
+ toxiproxy-cli toxic remove redis-proxy -n latency_downstream
417
+ ```
418
+
419
+ ---
420
+
421
+ ## 三、安全实施
422
+
423
+ ### 3.1 爆炸半径控制
424
+
425
+ ```yaml
426
+ 最小化影响原则:
427
+ 环境分级:
428
+ Level 1 - 开发/测试环境: 任意实验,无需审批
429
+ Level 2 - Staging 环境: 需 SRE 审批,可激进
430
+ Level 3 - 生产环境: 需 2 人审批(四眼原则),严格控制
431
+
432
+ 逐步升级:
433
+ 第 1 周: 开发环境验证实验脚本
434
+ 第 2 周: Staging 环境全量实验
435
+ 第 3 周: 生产环境最小影响实验(1 个 Pod / 1% 流量)
436
+ 第 4 周: 生产环境扩大范围(基于前周结果)
437
+
438
+ 流量隔离:
439
+ - 使用 Feature Flag 将实验流量标记
440
+ - 实验期间仅影响内部测试流量
441
+ - 生产用户流量通过正常路径
442
+
443
+ Kill Switch(紧急停止):
444
+ # LitmusChaos
445
+ kubectl patch chaosengine <name> -n production --type merge -p '{"spec":{"engineState":"stop"}}'
446
+
447
+ # Chaos Mesh
448
+ kubectl delete stresschaos <name> -n production
449
+
450
+ # Gremlin
451
+ gremlin halt
452
+
453
+ # 通用:删除所有混沌实验
454
+ kubectl delete chaosengine --all -n production
455
+ kubectl delete networkchaos,stresschaos,iochaos --all -n production
456
+ ```
457
+
458
+ ### 3.2 监控与告警
459
+
460
+ ```yaml
461
+ 实验期间必看大盘:
462
+ 黄金信号(Golden Signals):
463
+ 延迟:
464
+ - API P50 / P99 / P99.9 延迟
465
+ - 数据库查询延迟
466
+ - 上游服务调用延迟
467
+
468
+ 流量:
469
+ - 请求 QPS(总量/成功/失败)
470
+ - 活跃连接数
471
+
472
+ 错误:
473
+ - HTTP 5xx 比例
474
+ - 业务错误码比例
475
+ - 超时比例
476
+
477
+ 饱和度:
478
+ - CPU / 内存 / 磁盘利用率
479
+ - 连接池使用率
480
+ - 消息队列积压量
481
+
482
+ Grafana Dashboard 配置:
483
+ # 创建专用混沌实验 Dashboard
484
+ # 包含以下面板:
485
+ - 实验状态指示器(进行中/已完成/已中止)
486
+ - 稳态指标实时对比(实验前基线 vs 当前)
487
+ - 受影响 Pod/节点列表
488
+ - 告警触发时间线
489
+
490
+ 告警规则(实验期间额外启用):
491
+ # Prometheus 告警规则
492
+ groups:
493
+ - name: chaos-experiment-alerts
494
+ rules:
495
+ - alert: ChaosExperimentSLOBreach
496
+ expr: |
497
+ (sum(rate(http_requests_total{status=~"5.."}[1m]))
498
+ / sum(rate(http_requests_total[1m]))) > 0.05
499
+ for: 2m
500
+ labels:
501
+ severity: critical
502
+ team: sre
503
+ annotations:
504
+ summary: "混沌实验期间错误率超过 5%,考虑中止实验"
505
+
506
+ - alert: ChaosExperimentLatencyBreach
507
+ expr: |
508
+ histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[1m])) by (le)) > 2
509
+ for: 3m
510
+ labels:
511
+ severity: warning
512
+ annotations:
513
+ summary: "混沌实验期间 P99 延迟超过 2s"
514
+ ```
515
+
516
+ ### 3.3 实验执行流程
517
+
518
+ ```yaml
519
+ 执行前(T-30min):
520
+ - [ ] 确认实验卡片已审批
521
+ - [ ] 参与人员全部就位
522
+ - [ ] 监控大盘已打开
523
+ - [ ] Kill Switch 命令已准备并测试
524
+ - [ ] 记录当前稳态指标(基线)
525
+ - [ ] 通知相关团队(客服/产品)
526
+
527
+ 执行中:
528
+ - [ ] 按实验卡片注入故障
529
+ - [ ] 持续观察监控大盘
530
+ - [ ] 记录关键事件时间点
531
+ - [ ] 如触发中止条件,立即执行 Kill Switch
532
+ - [ ] 等待故障持续时间结束
533
+
534
+ 执行后:
535
+ - [ ] 确认故障已完全清除
536
+ - [ ] 确认系统恢复到稳态
537
+ - [ ] 收集实验期间的监控数据
538
+ - [ ] 对比实验前后指标
539
+ - [ ] 填写实验结果报告
540
+ ```
541
+
542
+ ---
543
+
544
+ ## 四、实验场景库
545
+
546
+ ### 4.1 基础场景(入门级)
547
+
548
+ ```yaml
549
+ 场景 1 - 单 Pod 故障:
550
+ 故障类型: 杀死 1 个 Pod
551
+ 验证目标: K8s 自动重建 + 流量自动转移
552
+ 预期结果: 30s 内新 Pod Running,用户无感知
553
+ 适用阶段: 第一次混沌实验
554
+
555
+ 场景 2 - 缓存失效:
556
+ 故障类型: Redis 不可达 60s
557
+ 验证目标: 缓存降级 → 直接查询 DB
558
+ 预期结果: 延迟上升但服务可用,缓存恢复后自动回填
559
+ 注意: 关注 DB 连接池是否被打满
560
+
561
+ 场景 3 - 外部 API 超时:
562
+ 故障类型: 第三方支付 API 延迟 10s
563
+ 验证目标: 熔断器开启 + 友好提示
564
+ 预期结果: 熔断后快速失败,不影响其他功能
565
+ ```
566
+
567
+ ### 4.2 进阶场景(中级)
568
+
569
+ ```yaml
570
+ 场景 4 - 数据库主从切换:
571
+ 故障类型: 杀死 PostgreSQL 主节点
572
+ 验证目标: 自动 Failover + 应用自动重连
573
+ 预期结果: Failover < 30s,写入中断 < 10s
574
+ 风险: 可能有少量事务丢失(取决于复制延迟)
575
+
576
+ 场景 5 - 网络分区:
577
+ 故障类型: 可用区 A 与可用区 B 网络隔离
578
+ 验证目标: 跨可用区冗余是否生效
579
+ 预期结果: 每个可用区独立提供服务
580
+ 注意: 分布式锁/选举机制可能受影响
581
+
582
+ 场景 6 - 级联故障:
583
+ 故障类型: 订单服务 CPU 100% → 上游超时 → 网关排队
584
+ 验证目标: 熔断器 + 限流 + 降级三者协同
585
+ 预期结果: 故障被隔离在订单服务,不扩散到全系统
586
+ ```
587
+
588
+ ### 4.3 高级场景(生产级 GameDay)
589
+
590
+ ```yaml
591
+ 场景 7 - 全可用区故障:
592
+ 故障类型: 模拟一个 AZ 完全不可用
593
+ 验证目标: 跨 AZ 容灾能力
594
+ 预期结果: 其余 AZ 承接全部流量,性能略降但可用
595
+ 前提: 架构设计支持跨 AZ 部署
596
+
597
+ 场景 8 - 依赖方批量故障:
598
+ 故障类型: 同时关闭 Redis + Kafka
599
+ 验证目标: 多依赖同时故障时的系统行为
600
+ 预期结果: 核心读写路径可用(降级模式),异步任务堆积但不丢失
601
+
602
+ 场景 9 - 流量突增:
603
+ 故障类型: 10 倍流量突增(模拟秒杀/营销活动)
604
+ 验证目标: 自动伸缩 + 限流
605
+ 预期结果: HPA 触发扩容,限流保护后端不被打挂
606
+ 工具: k6 / Locust 配合 Chaos 实验
607
+ ```
608
+
609
+ ---
610
+
611
+ ## 五、结果分析
612
+
613
+ ### 5.1 实验结果模板
614
+
615
+ ```markdown
616
+ # 混沌实验结果报告
617
+
618
+ ## 实验基本信息
619
+ - 实验 ID: CHAOS-2026-001
620
+ - 实验名称: 订单服务单 Pod 故障恢复验证
621
+ - 执行时间: 2026-03-28 10:00-10:30
622
+ - 环境: Staging / Production
623
+ - 执行人: SRE-张三
624
+
625
+ ## 稳态假设验证
626
+
627
+ | 假设 | 基线值 | 实验期间值 | 恢复后值 | 结论 |
628
+ |------|--------|-----------|---------|------|
629
+ | P99 延迟 < 500ms | 230ms | 480ms(峰值) | 240ms | 通过 |
630
+ | 成功率 > 99.5% | 99.98% | 99.2%(最低) | 99.97% | 未通过 |
631
+ | 告警 < 2min | - | 1.5min | - | 通过 |
632
+
633
+ ## 时间线
634
+ | 时间 | 事件 |
635
+ |------|------|
636
+ | 10:00 | 实验开始,杀死 Pod order-service-abc |
637
+ | 10:00:05 | K8s 检测到 Pod 不健康 |
638
+ | 10:00:15 | 新 Pod 开始创建 |
639
+ | 10:00:35 | 新 Pod 进入 Running 状态 |
640
+ | 10:00:45 | 新 Pod 通过就绪检查,接收流量 |
641
+ | 10:01:30 | 告警触发(Pod Restart) |
642
+ | 10:05:00 | 指标恢复到稳态 |
643
+
644
+ ## 发现与改进
645
+
646
+ ### 发现 1: 成功率短暂跌破 99.5%
647
+ - 原因:Pod 被杀时正在处理的请求直接失败
648
+ - 影响:约 50 个请求返回 502
649
+ - 改进项:配置 Pod preStop hook,在 SIGTERM 时先从 Service 摘除再优雅停机
650
+ - 优先级:P1
651
+ - 负责人:开发-李四
652
+ - 截止日期:2026-04-05
653
+
654
+ ### 发现 2: 告警触发较慢(1.5min)
655
+ - 原因:告警评估间隔为 1 分钟
656
+ - 改进项:将 Pod 故障告警评估间隔缩短为 30 秒
657
+ - 优先级:P2
658
+ - 负责人:运维-王五
659
+ - 截止日期:2026-04-10
660
+ ```
661
+
662
+ ### 5.2 韧性评分卡
663
+
664
+ ```yaml
665
+ 评分维度(每项 1-5 分):
666
+
667
+ 自动恢复:
668
+ 5: 故障自动恢复,用户完全无感知
669
+ 4: 自动恢复,用户有短暂延迟(< 5s)
670
+ 3: 自动恢复,用户有明显延迟(5-30s)
671
+ 2: 需要人工介入才能恢复
672
+ 1: 无法恢复,需要回滚
673
+
674
+ 故障隔离:
675
+ 5: 故障完全隔离,不影响其他服务
676
+ 4: 轻微影响邻近服务(延迟上升)
677
+ 3: 影响部分下游服务的非核心功能
678
+ 2: 导致多个服务降级
679
+ 1: 级联故障,全系统受影响
680
+
681
+ 可观测性:
682
+ 5: 故障在 30s 内被检测到
683
+ 4: 1 分钟内检测到
684
+ 3: 5 分钟内检测到
685
+ 2: 需要人工巡检发现
686
+ 1: 故障未被监控覆盖
687
+
688
+ 降级体验:
689
+ 5: 降级后功能完整,仅性能略降
690
+ 4: 非核心功能不可用,核心功能正常
691
+ 3: 核心功能部分可用
692
+ 2: 核心功能显著受损
693
+ 1: 服务完全不可用
694
+
695
+ 综合评分:
696
+ 18-20: 优秀(生产就绪)
697
+ 14-17: 良好(可上线,需持续改进)
698
+ 10-13: 及格(需修复后复测)
699
+ < 10: 不合格(需重新设计韧性方案)
700
+ ```
701
+
702
+ ---
703
+
704
+ ## 六、GameDay 演练
705
+
706
+ ### 6.1 GameDay 组织
707
+
708
+ ```yaml
709
+ GameDay 定义:
710
+ 团队集中进行混沌实验的专项活动日。
711
+ 目的是在真实(或接近真实)的环境中验证系统韧性和团队响应能力。
712
+
713
+ 频率: 每季度 1 次(生产环境)/ 每月 1 次(Staging)
714
+
715
+ 角色分工:
716
+ 实验设计师(SRE):
717
+ - 设计实验场景
718
+ - 准备故障注入脚本
719
+ - 控制实验进度
720
+
721
+ 观察者(开发 + 运维):
722
+ - 监控系统指标
723
+ - 记录异常行为
724
+ - 执行应急响应
725
+
726
+ 裁判(Tech Lead):
727
+ - 判定实验是否需要中止
728
+ - 评估团队响应质量
729
+ - 汇总实验结果
730
+
731
+ 记录员:
732
+ - 记录完整时间线
733
+ - 记录所有发现
734
+ - 整理实验报告
735
+
736
+ 议程(半天):
737
+ 09:00-09:30: 开场说明 + 回顾上次改进项
738
+ 09:30-10:00: 实验场景说明 + 确认安全措施
739
+ 10:00-11:30: 实验执行(3-4 个场景)
740
+ 11:30-12:00: 即时复盘 + 改进项整理
741
+ ```
742
+
743
+ ### 6.2 GameDay 评估表
744
+
745
+ ```yaml
746
+ 团队响应评估:
747
+ | 评估项 | 评分(1-5) | 备注 |
748
+ |--------|----------|------|
749
+ | 故障发现速度 | | |
750
+ | 沟通协作效率 | | |
751
+ | 根因定位速度 | | |
752
+ | 恢复操作正确性 | | |
753
+ | 应急预案执行 | | |
754
+ | 决策质量 | | |
755
+
756
+ 系统评估:
757
+ | 评估项 | 评分(1-5) | 备注 |
758
+ |--------|----------|------|
759
+ | 自动恢复能力 | | |
760
+ | 降级方案有效性 | | |
761
+ | 监控告警覆盖度 | | |
762
+ | 故障隔离能力 | | |
763
+ | 数据一致性保持 | | |
764
+ ```
765
+
766
+ ---
767
+
768
+ ## 七、验证
769
+
770
+ ### 7.1 混沌工程成熟度模型
771
+
772
+ ```yaml
773
+ Level 1 - 起步:
774
+ - 在开发/测试环境进行手动故障注入
775
+ - 基本监控已覆盖
776
+ - 团队了解混沌工程概念
777
+
778
+ Level 2 - 标准化:
779
+ - 使用混沌工程工具(LitmusChaos / Chaos Mesh)
780
+ - 实验模板化、可复现
781
+ - Staging 环境定期实验
782
+
783
+ Level 3 - 自动化:
784
+ - 混沌实验集成到 CI/CD(Staging 门禁)
785
+ - 实验结果自动生成报告
786
+ - 改进项自动创建 Ticket
787
+
788
+ Level 4 - 生产就绪:
789
+ - 生产环境定期 GameDay
790
+ - 自动化混沌实验(非工作时间自动运行)
791
+ - 韧性评分纳入发布门禁
792
+
793
+ Level 5 - 持续混沌:
794
+ - 生产环境持续混沌(Chaos as a Service)
795
+ - 故障注入自适应(基于系统健康度调整强度)
796
+ - 混沌实验覆盖全部核心路径
797
+ ```
798
+
799
+ ### 7.2 验证清单
800
+
801
+ | 指标 | 达标标准 |
802
+ |------|---------|
803
+ | 核心服务故障恢复时间 | < 60s(自动) |
804
+ | 告警触发时间 | < 2 分钟 |
805
+ | 熔断器生效时间 | < 30s |
806
+ | 降级方案覆盖率 | 所有外部依赖 100% |
807
+ | GameDay 频率 | >= 每季度 1 次 |
808
+ | 改进项闭环率 | > 90% |
809
+
810
+ ---
811
+
812
+ ## 八、回滚
813
+
814
+ ### 实验回滚
815
+
816
+ ```bash
817
+ # 紧急停止所有混沌实验
818
+
819
+ # LitmusChaos - 停止所有实验
820
+ kubectl get chaosengine -n production -o name | xargs -I {} kubectl patch {} -n production --type merge -p '{"spec":{"engineState":"stop"}}'
821
+
822
+ # Chaos Mesh - 删除所有实验
823
+ kubectl delete networkchaos,stresschaos,iochaos,podchaos,dnschaos --all -n production
824
+
825
+ # Gremlin - 全局停止
826
+ gremlin halt
827
+
828
+ # Toxiproxy - 移除所有 Toxic
829
+ for proxy in $(toxiproxy-cli list | tail -n +2 | awk '{print $1}'); do
830
+ toxiproxy-cli toxic remove $proxy --all
831
+ done
832
+
833
+ # 验证系统恢复
834
+ kubectl get pods -n production
835
+ kubectl top pods -n production
836
+ curl -s http://api.target.com/health | jq .
837
+ ```
838
+
839
+ ### 实验导致真实故障时的处理
840
+
841
+ ```yaml
842
+ 如果混沌实验导致了预期外的真实故障:
843
+
844
+ 1. 立即停止实验(Kill Switch)
845
+
846
+ 2. 按安全事件响应流程处理:
847
+ - 评估影响范围
848
+ - 通知相关团队
849
+ - 执行恢复操作
850
+
851
+ 3. 如果无法自动恢复:
852
+ - 手动重启受影响服务
853
+ kubectl rollout restart deployment/<service> -n production
854
+ - 如果数据受损,从备份恢复
855
+
856
+ 4. 事后分析:
857
+ - 为什么实验超出了预期爆炸半径?
858
+ - 安全机制为什么没有生效?
859
+ - 更新实验安全边界
860
+ ```
861
+
862
+ ---
863
+
864
+ ## Agent Checklist
865
+
866
+ 供自动化 Agent 在执行混沌工程流程时逐项核查:
867
+
868
+ - [ ] 混沌工程工具已安装并验证可用(LitmusChaos / Chaos Mesh / Gremlin)
869
+ - [ ] 可观测性三支柱已就绪(监控/日志/追踪)
870
+ - [ ] 实验卡片已填写并审批
871
+ - [ ] 稳态假设已明确且有量化指标
872
+ - [ ] 爆炸半径已控制(环境/命名空间/Pod 数量)
873
+ - [ ] Kill Switch 已测试可用
874
+ - [ ] 中止条件已明确定义
875
+ - [ ] 参与人员全部就位
876
+ - [ ] 监控大盘已打开并记录基线
877
+ - [ ] 实验按计划执行
878
+ - [ ] 实验期间持续监控关键指标
879
+ - [ ] 实验结束后系统恢复到稳态
880
+ - [ ] 实验结果已记录(稳态假设验证结果/时间线/发现)
881
+ - [ ] 改进项已创建并指定负责人和截止日期
882
+ - [ ] 韧性评分已计算
883
+ - [ ] 实验报告已归档