@umacloud/knowledge 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (418) hide show
  1. package/00-governance/governance-capabilities.md +557 -0
  2. package/00-governance/knowledge-map.md +39 -0
  3. package/00-governance/maintenance-policy.md +76 -0
  4. package/00-governance/review-checklist.md +81 -0
  5. package/README.md +13 -0
  6. package/ai/01-standards/agent-development-complete.md +691 -0
  7. package/ai/01-standards/llm-application-complete.md +488 -0
  8. package/ai/01-standards/mlops-complete.md +798 -0
  9. package/ai/01-standards/prompt-engineering-complete.md +646 -0
  10. package/ai/01-standards/rag-architecture-complete.md +649 -0
  11. package/ai/02-playbooks/llm-evaluation-playbook.md +847 -0
  12. package/ai/03-checklists/ai-project-checklist.md +215 -0
  13. package/ai/04-antipatterns/ai-antipatterns.md +661 -0
  14. package/ai/05-cases/case-rag-production.md +147 -0
  15. package/ai/06-glossary/ai-glossary.md +162 -0
  16. package/ai/agent-evaluation-benchmark.md +53 -0
  17. package/ai/ai-agent-memory-context-management.md +41 -0
  18. package/ai/ai-cost-capacity-optimization-playbook.md +42 -0
  19. package/ai/ai-data-security-and-compliance-playbook.md +37 -0
  20. package/ai/ai-domain-index-and-checklist.md +40 -0
  21. package/ai/ai-governance-maturity-model.md +50 -0
  22. package/ai/ai-model-selection-and-routing-strategy.md +47 -0
  23. package/ai/ai-observability-and-oncall-runbook.md +52 -0
  24. package/ai/ai-rag-engineering-playbook.md +42 -0
  25. package/ai/ai-red-team-and-safety-evaluation.md +42 -0
  26. package/ai/ai-release-readiness-and-rollback-gate.md +42 -0
  27. package/ai/llm-agent-engineering-deep-dive.md +57 -0
  28. package/ai/prompt-and-tool-guardrails.md +52 -0
  29. package/api/01-standards/enterprise-api-standards.md +198 -0
  30. package/api/01-standards/rest-api-design-guide.md +63 -0
  31. package/api/02-playbooks/api-pagination-playbook.md +93 -0
  32. package/api/02-playbooks/graphql-production-playbook.md +176 -0
  33. package/api/03-checklists/api-review-checklist.md +55 -0
  34. package/api/04-antipatterns/api-antipatterns.md +112 -0
  35. package/architecture/01-standards/api-gateway-patterns.md +496 -0
  36. package/architecture/01-standards/cloud-native-patterns.md +644 -0
  37. package/architecture/01-standards/distributed-systems-patterns.md +591 -0
  38. package/architecture/01-standards/event-driven-architecture.md +595 -0
  39. package/architecture/01-standards/microservices-patterns-complete.md +968 -0
  40. package/architecture/01-standards/microservices-patterns.md +495 -0
  41. package/architecture/01-standards/system-design-interview.md +664 -0
  42. package/architecture/02-playbooks/microservices-patterns-playbook.md +137 -0
  43. package/architecture/02-playbooks/migration-playbook.md +780 -0
  44. package/architecture/02-playbooks/system-design-playbook.md +779 -0
  45. package/architecture/03-checklists/architecture-decision-checklist.md +297 -0
  46. package/architecture/04-antipatterns/architecture-antipatterns.md +417 -0
  47. package/architecture/05-cases/case-netflix-microservices.md +413 -0
  48. package/architecture/06-glossary/architecture-glossary.md +164 -0
  49. package/architecture/adr-template-and-examples.md +38 -0
  50. package/architecture/api-gateway-deep-dive.md +1291 -0
  51. package/architecture/configuration-management.md +1162 -0
  52. package/architecture/distributed-transactions.md +1220 -0
  53. package/architecture/microservices-complete.md +735 -0
  54. package/architecture/resilience-and-disaster-patterns.md +37 -0
  55. package/architecture/service-governance.md +1198 -0
  56. package/architecture/system-architecture-deep-dive.md +37 -0
  57. package/backend/01-standards/analytics-and-growth.md +65 -0
  58. package/backend/01-standards/api-and-error-conventions.md +120 -0
  59. package/backend/01-standards/application-layering-and-packaging.md +160 -0
  60. package/backend/01-standards/auth-implementation.md +104 -0
  61. package/backend/01-standards/backend-framework-idioms.md +74 -0
  62. package/backend/01-standards/background-jobs-and-async.md +66 -0
  63. package/backend/01-standards/caching-strategies-complete.md +390 -0
  64. package/backend/01-standards/config-and-observability.md +77 -0
  65. package/backend/01-standards/data-modeling-and-persistence.md +94 -0
  66. package/backend/01-standards/django-complete.md +1765 -0
  67. package/backend/01-standards/email-and-notifications.md +64 -0
  68. package/backend/01-standards/fastapi-complete.md +925 -0
  69. package/backend/01-standards/file-upload-and-storage.md +66 -0
  70. package/backend/01-standards/graphql-api-complete.md +416 -0
  71. package/backend/01-standards/llm-application-standard.md +78 -0
  72. package/backend/01-standards/message-queue-patterns.md +379 -0
  73. package/backend/01-standards/microservices-and-distributed.md +78 -0
  74. package/backend/01-standards/nestjs-complete.md +2167 -0
  75. package/backend/01-standards/payment-integration.md +80 -0
  76. package/backend/01-standards/rate-limiting-complete.md +451 -0
  77. package/backend/01-standards/realtime-and-websocket.md +65 -0
  78. package/backend/01-standards/search-and-filtering.md +64 -0
  79. package/backend/01-standards/spring-boot-complete.md +445 -0
  80. package/backend/02-playbooks/api-design-playbook.md +718 -0
  81. package/backend/02-playbooks/email-send-playbook.md +130 -0
  82. package/backend/02-playbooks/file-upload-s3-playbook.md +153 -0
  83. package/backend/02-playbooks/typescript-enterprise-playbook.md +133 -0
  84. package/backend/02-playbooks/websocket-realtime-playbook.md +154 -0
  85. package/backend/03-checklists/api-launch-checklist.md +189 -0
  86. package/backend/04-antipatterns/backend-antipatterns.md +1051 -0
  87. package/blockchain/01-standards/blockchain-basics.md +557 -0
  88. package/blockchain/01-standards/smart-contract-development.md +1315 -0
  89. package/cicd/01-standards/deployment-and-delivery-standard.md +96 -0
  90. package/cicd/01-standards/github-actions-complete.md +473 -0
  91. package/cicd/01-standards/release-and-store-submission.md +75 -0
  92. package/cicd/02-playbooks/cicd-pipeline-playbook.md +144 -0
  93. package/cicd/02-playbooks/release-management-playbook.md +605 -0
  94. package/cicd/03-checklists/pipeline-security-checklist.md +168 -0
  95. package/cicd/04-antipatterns/cicd-antipatterns.md +589 -0
  96. package/cicd/05-cases/case-deployment-automation.md +221 -0
  97. package/cicd/05-cases/case-gitops-transformation.md +212 -0
  98. package/cicd/06-glossary/cicd-glossary.md +114 -0
  99. package/cicd/cicd-blueprint-deep-dive.md +38 -0
  100. package/cicd/release-readiness-gate.md +37 -0
  101. package/cloud-native/01-standards/container-security.md +741 -0
  102. package/cloud-native/01-standards/kubernetes-complete.md +812 -0
  103. package/cloud-native/02-playbooks/api-gateway-playbook.md +155 -0
  104. package/cloud-native/02-playbooks/gitops-with-argocd.md +760 -0
  105. package/cloud-native/02-playbooks/k8s-troubleshooting-playbook.md +1942 -0
  106. package/cloud-native/02-playbooks/message-queue-playbook.md +129 -0
  107. package/cloud-native/02-playbooks/multicloud-governance.md +726 -0
  108. package/cloud-native/02-playbooks/serverless-patterns.md +788 -0
  109. package/cloud-native/02-playbooks/service-mesh-playbook.md +612 -0
  110. package/cloud-native/02-playbooks/terraform-iac-playbook.md +143 -0
  111. package/cloud-native/03-checklists/container-security-checklist.md +431 -0
  112. package/cloud-native/03-checklists/k8s-production-readiness-checklist.md +460 -0
  113. package/cloud-native/04-antipatterns/container-antipatterns.md +660 -0
  114. package/cloud-native/04-antipatterns/k8s-antipatterns.md +743 -0
  115. package/cloud-native/05-cases/case-k8s-migration.md +478 -0
  116. package/cloud-native/05-cases/case-k8s-scaling.md +642 -0
  117. package/cloud-native/05-cases/case-k8s-security-incident.md +397 -0
  118. package/cloud-native/06-glossary/cloud-native-glossary.md +337 -0
  119. package/cross-platform/01-standards/cross-platform-frameworks.md +83 -0
  120. package/cross-platform/01-standards/platform-selection-and-architecture.md +77 -0
  121. package/data/01-standards/elasticsearch-complete.md +2098 -0
  122. package/data/01-standards/postgresql-complete.md +1613 -0
  123. package/data/01-standards/redis-complete.md +1527 -0
  124. package/data/02-playbooks/database-optimization-playbook.md +403 -0
  125. package/data/02-playbooks/elasticsearch-production-playbook.md +132 -0
  126. package/data/03-checklists/database-launch-checklist.md +187 -0
  127. package/data/04-antipatterns/database-antipatterns.md +873 -0
  128. package/data/05-cases/case-database-migration.md +310 -0
  129. package/data/06-glossary/database-glossary.md +440 -0
  130. package/data/data-governance-and-modeling-deep-dive.md +39 -0
  131. package/data-engineering/01-standards/airflow-complete.md +523 -0
  132. package/data-engineering/01-standards/kafka-complete.md +1521 -0
  133. package/data-engineering/02-playbooks/spark-etl-playbook.md +496 -0
  134. package/data-engineering/03-checklists/pipeline-launch-checklist.md +194 -0
  135. package/data-engineering/04-antipatterns/data-pipeline-antipatterns.md +684 -0
  136. package/data-engineering/05-cases/case-real-time-pipeline.md +355 -0
  137. package/data-engineering/06-glossary/data-engineering-glossary.md +429 -0
  138. package/database/01-standards/database-schema-standards.md +147 -0
  139. package/database/02-playbooks/postgresql-optimization-quick.md +52 -0
  140. package/database/02-playbooks/postgresql-performance-optimization.md +58 -0
  141. package/database/02-playbooks/postgresql-production-playbook.md +146 -0
  142. package/database/02-playbooks/redis-caching-playbook.md +117 -0
  143. package/database/03-checklists/database-review-checklist.md +50 -0
  144. package/database/04-antipatterns/database-antipatterns.md +112 -0
  145. package/design/01-standards/ui-design-system-complete.md +423 -0
  146. package/design/02-playbooks/design-handoff-playbook.md +254 -0
  147. package/design/02-playbooks/design-review-playbook.md +388 -0
  148. package/design/03-checklists/design-review-checklist.md +246 -0
  149. package/design/04-antipatterns/design-antipatterns.md +378 -0
  150. package/design/05-cases/case-design-system-adoption.md +328 -0
  151. package/design/06-glossary/design-glossary.md +329 -0
  152. package/design/ui-full-lifecycle-cross-platform-playbook.md +571 -0
  153. package/design/ux-system-deep-dive.md +38 -0
  154. package/design-systems/00-craft-rules.md +71 -0
  155. package/design-systems/aesthetic-families.md +43 -0
  156. package/design-systems/anti-ai-slop.md +162 -0
  157. package/design-systems/bold-geometric.md +120 -0
  158. package/design-systems/brutalist-bold.md +103 -0
  159. package/design-systems/editorial-clean.md +109 -0
  160. package/design-systems/glass-aurora.md +108 -0
  161. package/design-systems/modern-minimal.md +145 -0
  162. package/design-systems/premium-luxury.md +106 -0
  163. package/design-systems/product-type-design-map.md +48 -0
  164. package/design-systems/soft-warm.md +123 -0
  165. package/design-systems/tech-utility.md +113 -0
  166. package/desktop/01-standards/desktop-app-standard.md +72 -0
  167. package/desktop/01-standards/desktop-design.md +71 -0
  168. package/development/00-governance/document-template.md +41 -0
  169. package/development/01-standards/api-versioning-strategies.md +432 -0
  170. package/development/01-standards/authentication-patterns-complete.md +479 -0
  171. package/development/01-standards/css-architecture-complete.md +550 -0
  172. package/development/01-standards/database-migration-strategies.md +484 -0
  173. package/development/01-standards/elasticsearch-complete.md +347 -0
  174. package/development/01-standards/git-complete.md +371 -0
  175. package/development/01-standards/golang-complete.md +1565 -0
  176. package/development/01-standards/graphql-complete.md +298 -0
  177. package/development/01-standards/javascript-bundlers-complete.md +469 -0
  178. package/development/01-standards/javascript-typescript-complete.md +528 -0
  179. package/development/01-standards/jest-complete.md +275 -0
  180. package/development/01-standards/linux-complete.md +234 -0
  181. package/development/01-standards/logging-observability-complete.md +526 -0
  182. package/development/01-standards/microservices-communication.md +502 -0
  183. package/development/01-standards/mongodb-complete.md +406 -0
  184. package/development/01-standards/oauth2-complete.md +285 -0
  185. package/development/01-standards/performance-optimization-complete.md +289 -0
  186. package/development/01-standards/playwright-complete.md +247 -0
  187. package/development/01-standards/postgresql-complete.md +456 -0
  188. package/development/01-standards/pytest-complete.md +340 -0
  189. package/development/01-standards/python-async-programming.md +902 -0
  190. package/development/01-standards/python-complete.md +956 -0
  191. package/development/01-standards/python-decorators-complete.md +799 -0
  192. package/development/01-standards/python-design-patterns.md +2854 -0
  193. package/development/01-standards/python-packaging-distribution.md +420 -0
  194. package/development/01-standards/python-testing-strategies.md +607 -0
  195. package/development/01-standards/python-web-frameworks-comparison.md +471 -0
  196. package/development/01-standards/redis-complete.md +317 -0
  197. package/development/01-standards/rest-api-complete.md +316 -0
  198. package/development/01-standards/rust-complete.md +578 -0
  199. package/development/01-standards/typescript-advanced-types.md +1513 -0
  200. package/development/01-standards/web-security-complete.md +292 -0
  201. package/development/02-playbooks/api-design-playbook.md +810 -0
  202. package/development/02-playbooks/database-migration-playbook.md +580 -0
  203. package/development/02-playbooks/debugging-playbook.md +692 -0
  204. package/development/02-playbooks/feature-delivery-playbook.md +430 -0
  205. package/development/02-playbooks/incident-hotfix-playbook.md +387 -0
  206. package/development/02-playbooks/performance-optimization-playbook.md +531 -0
  207. package/development/02-playbooks/performance-tuning-playbook.md +652 -0
  208. package/development/02-playbooks/refactor-playbook.md +403 -0
  209. package/development/02-playbooks/release-playbook.md +469 -0
  210. package/development/03-checklists/architecture-review-checklist.md +168 -0
  211. package/development/03-checklists/data-migration-checklist.md +157 -0
  212. package/development/03-checklists/oncall-handover-checklist.md +173 -0
  213. package/development/03-checklists/pr-checklist.md +158 -0
  214. package/development/03-checklists/production-readiness-checklist.md +190 -0
  215. package/development/03-checklists/release-readiness-checklist.md +154 -0
  216. package/development/03-checklists/security-review-checklist.md +182 -0
  217. package/development/04-antipatterns/api-antipatterns.md +657 -0
  218. package/development/04-antipatterns/architecture-antipatterns.md +686 -0
  219. package/development/04-antipatterns/backend-antipatterns.md +648 -0
  220. package/development/04-antipatterns/cicd-antipatterns.md +540 -0
  221. package/development/04-antipatterns/code-smell-antipatterns.md +571 -0
  222. package/development/04-antipatterns/data-antipatterns.md +658 -0
  223. package/development/04-antipatterns/database-antipatterns.md +578 -0
  224. package/development/04-antipatterns/frontend-antipatterns.md +635 -0
  225. package/development/04-antipatterns/reliability-antipatterns.md +700 -0
  226. package/development/04-antipatterns/security-antipatterns.md +747 -0
  227. package/development/05-cases/case-api-version-migration.md +428 -0
  228. package/development/05-cases/case-authorization-hardening.md +383 -0
  229. package/development/05-cases/case-bluegreen-rollback.md +466 -0
  230. package/development/05-cases/case-cache-snowball-protection.md +485 -0
  231. package/development/05-cases/case-ci-cd-pipeline.md +544 -0
  232. package/development/05-cases/case-database-scaling.md +500 -0
  233. package/development/05-cases/case-db-hotspot-optimization.md +487 -0
  234. package/development/05-cases/case-incident-mttr-reduction.md +563 -0
  235. package/development/05-cases/case-microservice-migration.md +375 -0
  236. package/development/05-cases/case-performance-optimization.md +406 -0
  237. package/development/05-cases/case-security-incident-response.md +345 -0
  238. package/development/06-glossary/full-stack-glossary.md +166 -0
  239. package/development/09-maturity/quarterly-audit-template.md +35 -0
  240. package/development/11-ui-excellence/ui-aesthetic-system.md +41 -0
  241. package/development/11-ui-excellence/ui-engineering-excellence.md +435 -0
  242. package/development/12-scenarios/development-scenarios-guide.md +565 -0
  243. package/development/13-implementation-assets/implementation-toolkit.md +282 -0
  244. package/development/13-implementation-assets/knowledge-gates-execution.md +43 -0
  245. package/development/14-full-lifecycle/software-lifecycle-gates.md +511 -0
  246. package/development/15-lifecycle-templates/project-templates-collection.md +791 -0
  247. package/development/api-contract-and-versioning-guide.md +36 -0
  248. package/development/api-governance-complete.md +43 -0
  249. package/development/backend-engineering-complete.md +43 -0
  250. package/development/code-review-quality-complete.md +43 -0
  251. package/development/concurrency-reliability-complete.md +43 -0
  252. package/development/database-engineering-complete.md +43 -0
  253. package/development/engineering-effectiveness-complete.md +43 -0
  254. package/development/engineering-standards-deep-dive.md +38 -0
  255. package/development/frontend-engineering-complete.md +43 -0
  256. package/development/performance-capacity-complete.md +43 -0
  257. package/development/refactor-migration-complete.md +42 -0
  258. package/development/refactoring-and-techdebt-playbook.md +37 -0
  259. package/development/security-in-development-complete.md +43 -0
  260. package/devops/01-standards/cicd-pipeline-complete.md +262 -0
  261. package/devops/01-standards/docker-complete.md +1490 -0
  262. package/devops/01-standards/github-actions-complete.md +337 -0
  263. package/devops/01-standards/kubernetes-complete.md +638 -0
  264. package/devops/01-standards/terraform-complete.md +2117 -0
  265. package/devops/02-playbooks/docker-compose-playbook.md +233 -0
  266. package/devops/02-playbooks/docker-k8s-production-playbook.md +186 -0
  267. package/devops/02-playbooks/docker-production-playbook.md +952 -0
  268. package/edge-iot/01-standards/edge-iot-complete.md +473 -0
  269. package/experts/architect/api-design.md +178 -0
  270. package/experts/architect/methodology.md +124 -0
  271. package/experts/architect/security.md +75 -0
  272. package/experts/backend-lead/methodology.md +216 -0
  273. package/experts/devops/methodology.md +160 -0
  274. package/experts/frontend-lead/methodology.md +178 -0
  275. package/experts/product-manager/industry/ecommerce.md +43 -0
  276. package/experts/product-manager/industry/saas.md +40 -0
  277. package/experts/product-manager/methodology.md +97 -0
  278. package/experts/qa-lead/methodology.md +123 -0
  279. package/experts/qa-lead/test-strategy.md +128 -0
  280. package/experts/uiux-designer/methodology.md +125 -0
  281. package/frontend/01-standards/accessibility-complete.md +532 -0
  282. package/frontend/01-standards/accessibility-standard.md +74 -0
  283. package/frontend/01-standards/admin-dashboard-and-crud.md +72 -0
  284. package/frontend/01-standards/design-tokens-complete.md +444 -0
  285. package/frontend/01-standards/forms-and-validation.md +77 -0
  286. package/frontend/01-standards/frontend-architecture-and-layering.md +119 -0
  287. package/frontend/01-standards/i18n-and-localization.md +65 -0
  288. package/frontend/01-standards/nextjs-complete.md +451 -0
  289. package/frontend/01-standards/react-complete.md +713 -0
  290. package/frontend/01-standards/react-hooks-complete-guide.md +1100 -0
  291. package/frontend/01-standards/react-hooks-complete.md +1171 -0
  292. package/frontend/01-standards/seo-and-web-vitals.md +77 -0
  293. package/frontend/01-standards/state-management-complete.md +444 -0
  294. package/frontend/01-standards/vue-complete.md +499 -0
  295. package/frontend/01-standards/vue3-complete.md +2002 -0
  296. package/frontend/01-standards/web-framework-best-practices.md +64 -0
  297. package/frontend/01-standards/web-performance-complete.md +495 -0
  298. package/frontend/02-playbooks/accessibility-a11y-playbook.md +161 -0
  299. package/frontend/02-playbooks/frontend-performance-playbook.md +707 -0
  300. package/frontend/02-playbooks/i18n-internationalization-playbook.md +120 -0
  301. package/frontend/02-playbooks/performance-optimization-playbook.md +163 -0
  302. package/frontend/02-playbooks/react-nextjs-production-playbook.md +167 -0
  303. package/frontend/02-playbooks/react-state-management-playbook.md +173 -0
  304. package/frontend/03-checklists/component-quality-checklist.md +166 -0
  305. package/frontend/03-checklists/frontend-launch-checklist.md +299 -0
  306. package/frontend/04-antipatterns/frontend-antipatterns.md +886 -0
  307. package/frontend/05-cases/case-performance-optimization.md +274 -0
  308. package/harmony/01-standards/harmonyos-arkts-standard.md +75 -0
  309. package/harmony/01-standards/harmonyos-design.md +65 -0
  310. package/high-quality-engineering-playbook.md +54 -0
  311. package/incident/01-standards/incident-response-complete.md +303 -0
  312. package/incident/02-playbooks/chaos-engineering-playbook.md +883 -0
  313. package/incident/02-playbooks/postmortem-playbook.md +398 -0
  314. package/incident/03-checklists/incident-readiness-checklist.md +181 -0
  315. package/incident/04-antipatterns/incident-antipatterns.md +490 -0
  316. package/incident/05-cases/case-cascade-failure.md +176 -0
  317. package/incident/06-glossary/incident-glossary.md +114 -0
  318. package/incident/postmortem-and-response-deep-dive.md +39 -0
  319. package/industries/ecommerce/ecommerce-complete.md +631 -0
  320. package/industries/education/education-complete.md +555 -0
  321. package/industries/fintech/fintech-complete.md +501 -0
  322. package/industries/gaming/gaming-complete.md +587 -0
  323. package/industries/healthcare/healthcare-complete.md +452 -0
  324. package/low-code/01-standards/low-code-complete.md +944 -0
  325. package/miniprogram/01-standards/ai-common-mistakes.md +61 -0
  326. package/miniprogram/01-standards/miniprogram-custom-navbar-capsule.md +77 -0
  327. package/miniprogram/01-standards/miniprogram-design.md +61 -0
  328. package/miniprogram/01-standards/miniprogram-standard.md +81 -0
  329. package/mobile/01-standards/android-material-design.md +70 -0
  330. package/mobile/01-standards/flutter-complete.md +384 -0
  331. package/mobile/01-standards/ios-design-hig.md +78 -0
  332. package/mobile/01-standards/mobile-app-standard.md +85 -0
  333. package/mobile/01-standards/react-native-complete.md +352 -0
  334. package/mobile/02-playbooks/mobile-cross-platform-playbook.md +175 -0
  335. package/mobile/02-playbooks/mobile-performance.md +473 -0
  336. package/mobile/03-checklists/mobile-release-checklist.md +234 -0
  337. package/mobile/04-antipatterns/mobile-antipatterns.md +798 -0
  338. package/mobile/05-cases/case-app-performance.md +500 -0
  339. package/mobile/05-cases/case-app-startup-optimization.md +218 -0
  340. package/mobile/06-glossary/mobile-glossary.md +484 -0
  341. package/observability/01-standards/observability-standards.md +103 -0
  342. package/observability/02-playbooks/prometheus-grafana-playbook.md +135 -0
  343. package/observability/02-playbooks/structured-logging-playbook.md +73 -0
  344. package/observability/03-checklists/observability-checklist.md +54 -0
  345. package/observability/04-antipatterns/observability-antipatterns.md +106 -0
  346. package/operations/01-standards/prometheus-monitoring-complete.md +1578 -0
  347. package/operations/02-playbooks/capacity-planning-playbook.md +620 -0
  348. package/operations/03-checklists/production-launch-checklist.md +365 -0
  349. package/operations/04-antipatterns/operations-antipatterns.md +664 -0
  350. package/operations/05-cases/case-sre-practices.md +581 -0
  351. package/operations/06-glossary/operations-glossary.md +120 -0
  352. package/operations/aiops-anomaly-detection.md +758 -0
  353. package/operations/capacity-planning.md +1061 -0
  354. package/operations/chaos-engineering.md +659 -0
  355. package/operations/incident-command-system.md +38 -0
  356. package/operations/observability-complete.md +442 -0
  357. package/operations/slo-sli-playbook.md +517 -0
  358. package/operations/sre-operations-deep-dive.md +39 -0
  359. package/package.json +8 -0
  360. package/performance/01-standards/performance-and-scalability.md +80 -0
  361. package/performance/01-standards/performance-standards.md +156 -0
  362. package/performance/02-playbooks/query-optimization-playbook.md +103 -0
  363. package/performance/03-checklists/performance-checklist.md +56 -0
  364. package/performance/04-antipatterns/performance-antipatterns.md +146 -0
  365. package/product/01-standards/product-management-complete.md +285 -0
  366. package/product/02-playbooks/feature-launch-playbook.md +207 -0
  367. package/product/02-playbooks/user-research-playbook.md +532 -0
  368. package/product/03-checklists/feature-launch-checklist.md +275 -0
  369. package/product/04-antipatterns/product-antipatterns.md +355 -0
  370. package/product/05-cases/case-mvp-to-scale.md +384 -0
  371. package/product/06-glossary/product-glossary.md +462 -0
  372. package/product/feature-prioritization-framework.md +40 -0
  373. package/product/kpi-and-metric-tree.md +37 -0
  374. package/product/product-discovery-and-prd-deep-dive.md +41 -0
  375. package/quantum/01-standards/quantum-complete.md +1186 -0
  376. package/security/01-standards/api-security-complete.md +511 -0
  377. package/security/01-standards/container-runtime-security.md +574 -0
  378. package/security/01-standards/data-protection-gdpr.md +543 -0
  379. package/security/01-standards/owasp-top10-complete.md +1890 -0
  380. package/security/01-standards/secure-coding-baseline.md +90 -0
  381. package/security/01-standards/supply-chain-security.md +441 -0
  382. package/security/01-standards/web-security-checklist.md +108 -0
  383. package/security/01-standards/zero-trust-architecture.md +521 -0
  384. package/security/02-playbooks/auth-sso-playbook.md +166 -0
  385. package/security/02-playbooks/incident-response-security-playbook.md +588 -0
  386. package/security/02-playbooks/owasp-api-security-playbook.md +129 -0
  387. package/security/02-playbooks/payment-integration-playbook.md +119 -0
  388. package/security/02-playbooks/penetration-testing-playbook.md +517 -0
  389. package/security/03-checklists/security-audit-checklist.md +356 -0
  390. package/security/04-antipatterns/security-coding-antipatterns.md +580 -0
  391. package/security/05-cases/case-log4shell-incident.md +537 -0
  392. package/security/05-cases/case-major-breaches.md +468 -0
  393. package/security/06-glossary/security-glossary.md +212 -0
  394. package/security/compliance-automation.md +993 -0
  395. package/security/container-security.md +680 -0
  396. package/security/devsecops-complete.md +426 -0
  397. package/security/sast-dast-sca.md +775 -0
  398. package/security/secrets-management.md +594 -0
  399. package/security/security-architecture-deep-dive.md +37 -0
  400. package/security/threat-modeling-stride-playbook.md +40 -0
  401. package/seed-templates/auth-system.md +59 -0
  402. package/seed-templates/blog-content.md +94 -0
  403. package/seed-templates/dashboard.md +89 -0
  404. package/seed-templates/docs-site.md +73 -0
  405. package/seed-templates/e-commerce.md +50 -0
  406. package/seed-templates/saas-landing.md +92 -0
  407. package/seed-templates/settings-page.md +51 -0
  408. package/testing/01-standards/test-strategy-and-layering.md +83 -0
  409. package/testing/01-standards/testing-strategy-complete.md +422 -0
  410. package/testing/01-standards/unit-testing-best-practices.md +118 -0
  411. package/testing/02-playbooks/e2e-testing-playbook.md +988 -0
  412. package/testing/02-playbooks/testing-strategy-playbook.md +126 -0
  413. package/testing/03-checklists/test-strategy-checklist.md +208 -0
  414. package/testing/04-antipatterns/testing-antipatterns.md +718 -0
  415. package/testing/05-cases/case-testing-transformation.md +300 -0
  416. package/testing/06-glossary/testing-glossary.md +110 -0
  417. package/testing/risk-based-test-matrix.md +36 -0
  418. package/testing/testing-strategy-deep-dive.md +37 -0
@@ -0,0 +1,659 @@
1
+ ---
2
+ id: chaos-engineering
3
+ title: chaos-engineering
4
+ domain: operations
5
+ category: chaos-engineering.md
6
+ difficulty: intermediate
7
+ tags: [chaos, engineering, operations, 分钟, 实施流程, 实验信息, 实验概况, 核心原则]
8
+ quality_score: 70
9
+ last_updated: 2026-06-15
10
+ ---
11
+ # 开发:Excellent(11964948@qq.com)
12
+ # 功能:混沌工程实践指南
13
+ # 作用:通过主动注入故障验证系统韧性,提前发现和修复潜在问题
14
+ # 创建时间:2026-03-20
15
+ # 最后修改:2026-03-20
16
+
17
+ ## 目标
18
+ 建立混沌工程实践体系,通过受控的故障注入实验,验证系统在生产环境真实负载下的韧性,提前发现和修复潜在问题,提升服务可靠性。
19
+
20
+ ## 适用范围
21
+ - 生产环境关键服务(在充分测试后)
22
+ - 预生产环境(Staging/Pre-prod)
23
+ - 性能测试环境
24
+ - 不适用于:无备份的生产数据库、不可恢复的关键系统
25
+
26
+ ## 核心原则
27
+
28
+ ### 1. 建立稳态假设(Steady State Hypothesis)
29
+ **定义**:系统正常运行的可测量状态
30
+
31
+ **示例**:
32
+ ```yaml
33
+ # 订单服务稳态假设
34
+ steady_state_hypothesis:
35
+ title: "订单服务正常处理请求"
36
+ probes:
37
+ - name: "可用性 >= 99.9%"
38
+ type: prometheus
39
+ query: "slo:availability:ratio:5m{service='order-service'} >= 0.999"
40
+
41
+ - name: "P95 延迟 < 200ms"
42
+ type: prometheus
43
+ query: "slo:latency:p95{service='order-service'} < 0.2"
44
+
45
+ - name: "错误率 < 0.1%"
46
+ type: prometheus
47
+ query: "slo:error_rate:ratio:5m{service='order-service'} < 0.001"
48
+
49
+ - name: "健康检查通过"
50
+ type: http
51
+ url: "https://order-service.example.com/health"
52
+ expected_status: 200
53
+ ```
54
+
55
+ ### 2. 模拟真实世界故障(Real-world Events)
56
+ **故障类型**:
57
+ - 基础设施故障:服务器宕机、网络分区、磁盘故障
58
+ - 资源耗尽:CPU 飙升、内存泄漏、磁盘满、连接池耗尽
59
+ - 依赖故障:数据库宕机、缓存失效、第三方 API 超时
60
+ - 网络故障:延迟、丢包、DNS 故障、SSL 证书过期
61
+ - 应用故障:进程崩溃、异常抛出、配置错误
62
+
63
+ ### 3. 在生产环境运行(Run in Production)
64
+ **原因**:
65
+ - 测试环境无法完全模拟生产流量和依赖
66
+ - 生产环境的故障最真实
67
+ - 提前暴露问题,避免真实故障时措手不及
68
+
69
+ **前置条件**:
70
+ - 完善的监控和告警
71
+ - 快速回滚机制
72
+ - 团队 On-call 能力
73
+ - 充分的测试环境验证
74
+
75
+ ### 4. 自动化持续运行(Automate and Run Continuously)
76
+ **策略**:
77
+ - Game Day:定期(每月/每季度)组织的混沌演练日
78
+ - 自动化实验:CI/CD 流水线中集成混沌测试
79
+ - 渐进式扩大范围:从单个服务 -> 集群 -> 跨区域
80
+
81
+ ### 5. 最小化爆炸半径(Minimize Blast Radius)
82
+ **控制措施**:
83
+ - 分阶段实验:先测试环境,再生产环境
84
+ - 流量隔离:仅对部分用户/流量注入故障
85
+ - 紧急停止:实验失控时立即中止
86
+ - 回滚机制:快速恢复到正常状态
87
+
88
+ ## 实施流程
89
+
90
+ ### 阶段 1:准备(Preparation)
91
+
92
+ #### 1.1 评估系统成熟度
93
+ **检查清单**:
94
+ - [ ] 服务监控覆盖 >= 90%(日志/指标/追踪)
95
+ - [ ] 核心服务 SLO 定义完整
96
+ - [ ] 自动化部署和回滚流程
97
+ - [ ] 团队 On-call 机制和 Runbook
98
+ - [ ] 灾难恢复(DR)演练过
99
+
100
+ **成熟度评估**:
101
+ | 维度 | L1(初始) | L2(基础) | L3(成熟) | L4(卓越) |
102
+ |------|-----------|-----------|-----------|-----------|
103
+ | 监控 | 基础指标 | 完整监控 | SLO + 告警 | AIOps |
104
+ | 部署 | 手动 | 自动化部署 | 蓝绿/金丝雀 | 渐进式交付 |
105
+ | 恢复 | 手动恢复 | 自动化恢复 | 自愈系统 | 主动预防 |
106
+ | 混沌 | 无 | 测试环境 | 生产环境 | 自动化混沌 |
107
+
108
+ **准入标准**:成熟度 >= L2 才能在生产环境运行混沌实验
109
+
110
+ #### 1.2 选择实验目标
111
+ **优先级评估**:
112
+ - 关键服务(P0):订单、支付、认证
113
+ - 高频故障服务:过去 3 个月故障次数最多
114
+ - 新上线服务:验证架构韧性
115
+ - 大规模变更前:验证变更影响
116
+
117
+ **服务依赖图分析**:
118
+ ```
119
+ 用户 -> CDN -> 负载均衡 -> 订单服务 -> 数据库
120
+ -> 库存服务 -> 缓存
121
+ -> 支付服务 -> 支付网关
122
+ ```
123
+
124
+ **目标选择示例**:
125
+ ```yaml
126
+ # 优先级排序
127
+ targets:
128
+ - service: "order-service"
129
+ priority: "P0"
130
+ reason: "核心业务服务,直接影响营收"
131
+
132
+ - service: "payment-service"
133
+ priority: "P0"
134
+ reason: "支付关键路径,故障影响订单完成"
135
+
136
+ - service: "inventory-service"
137
+ priority: "P1"
138
+ reason: "库存服务故障导致超卖风险"
139
+ ```
140
+
141
+ ### 阶段 2:设计实验(Experiment Design)
142
+
143
+ #### 2.1 定义实验假设
144
+ **模板**:
145
+ ```
146
+ 当 <注入故障> 时,
147
+ 如果 <系统行为> 正常,
148
+ 则 <业务影响> 在可接受范围内
149
+ ```
150
+
151
+ **示例**:
152
+ ```
153
+ 当 订单服务的一个 Pod 宕机时,
154
+ 如果 自动扩缩容和负载均衡正常,
155
+ 则 用户请求成功率保持在 99.9% 以上,P95 延迟不超过 250ms
156
+ ```
157
+
158
+ #### 2.2 选择故障注入类型
159
+ **Chaos Mesh 故障类型**:
160
+ ```yaml
161
+ # Pod 故障
162
+ apiVersion: chaos-mesh.org/v1alpha1
163
+ kind: PodChaos
164
+ metadata:
165
+ name: pod-failure
166
+ namespace: chaos-testing
167
+ spec:
168
+ action: pod-failure
169
+ mode: one
170
+ selector:
171
+ namespaces:
172
+ - production
173
+ labelSelectors:
174
+ app: order-service
175
+ duration: "5m"
176
+
177
+ # 网络延迟
178
+ apiVersion: chaos-mesh.org/v1alpha1
179
+ kind: NetworkChaos
180
+ metadata:
181
+ name: network-delay
182
+ namespace: chaos-testing
183
+ spec:
184
+ action: delay
185
+ mode: all
186
+ selector:
187
+ namespaces:
188
+ - production
189
+ labelSelectors:
190
+ app: order-service
191
+ delay:
192
+ latency: "100ms"
193
+ correlation: "50"
194
+ jitter: "10ms"
195
+ duration: "10m"
196
+
197
+ # CPU 压力
198
+ apiVersion: chaos-mesh.org/v1alpha1
199
+ kind: StressChaos
200
+ metadata:
201
+ name: cpu-stress
202
+ namespace: chaos-testing
203
+ spec:
204
+ mode: one
205
+ selector:
206
+ namespaces:
207
+ - production
208
+ labelSelectors:
209
+ app: order-service
210
+ stressors:
211
+ cpu:
212
+ workers: 2
213
+ load: 80
214
+ duration: "5m"
215
+
216
+ # DNS 故障
217
+ apiVersion: chaos-mesh.org/v1alpha1
218
+ kind: DNSChaos
219
+ metadata:
220
+ name: dns-failure
221
+ namespace: chaos-testing
222
+ spec:
223
+ action: error
224
+ mode: all
225
+ selector:
226
+ namespaces:
227
+ - production
228
+ labelSelectors:
229
+ app: order-service
230
+ patterns:
231
+ - "mysql-*"
232
+ - "redis-*"
233
+ duration: "3m"
234
+ ```
235
+
236
+ **Litmus Chaos 故障类型**:
237
+ ```yaml
238
+ # 节点故障
239
+ apiVersion: litmuschaos.io/v1alpha1
240
+ kind: ChaosEngine
241
+ metadata:
242
+ name: node-chaos
243
+ namespace: chaos-testing
244
+ spec:
245
+ appinfo:
246
+ appns: production
247
+ applabel: "app=order-service"
248
+ chaosServiceAccount: litmus-admin
249
+ experiments:
250
+ - name: node-cpu-hog
251
+ spec:
252
+ components:
253
+ env:
254
+ - name: CPU_CORES
255
+ value: "2"
256
+ - name: TOTAL_CHAOS_DURATION
257
+ value: "60"
258
+
259
+ # Pod 删除
260
+ - name: pod-delete
261
+ spec:
262
+ components:
263
+ env:
264
+ - name: FORCE
265
+ value: "false"
266
+ - name: CHAOS_INTERVAL
267
+ value: "10"
268
+ - name: TOTAL_CHAOS_DURATION
269
+ value: "60"
270
+ ```
271
+
272
+ #### 2.3 定义停止条件
273
+ **自动停止条件**:
274
+ ```yaml
275
+ stop_conditions:
276
+ - name: "可用性降至 99% 以下"
277
+ type: prometheus
278
+ query: "slo:availability:ratio:5m{service='order-service'} < 0.99"
279
+ action: "abort"
280
+
281
+ - name: "P95 延迟超过 500ms"
282
+ type: prometheus
283
+ query: "slo:latency:p95{service='order-service'} > 0.5"
284
+ action: "abort"
285
+
286
+ - name: "错误率超过 1%"
287
+ type: prometheus
288
+ query: "slo:error_rate:ratio:5m{service='order-service'} > 0.01"
289
+ action: "abort"
290
+
291
+ - name: "人工终止"
292
+ type: manual
293
+ action: "abort"
294
+ ```
295
+
296
+ **手动停止流程**:
297
+ 1. 监控团队观察 SLO Dashboard
298
+ 2. 发现指标超过阈值 -> 立即通知实验负责人
299
+ 3. 负责人执行 `kubectl delete -f experiment.yaml` 停止实验
300
+ 4. 如果系统未自动恢复 -> 执行回滚流程
301
+
302
+ ### 阶段 3:执行实验(Execution)
303
+
304
+ #### 3.1 环境准备
305
+ **检查清单**:
306
+ - [ ] 实验方案已评审(技术负责人 + SRE)
307
+ - [ ] 监控和告警已验证
308
+ - [ ] 回滚流程已测试
309
+ - [ ] 团队已通知(On-call + 相关团队)
310
+ - [ ] 用户通知(如影响较大)
311
+ - [ ] 选择低峰时段(如凌晨 2-5 点)
312
+
313
+ #### 3.2 基线测量
314
+ **执行步骤**:
315
+ 1. 记录实验前 30 分钟的指标基线
316
+ 2. 确认系统处于稳态(SLO 达成)
317
+ 3. 保存基线数据用于对比
318
+
319
+ **基线报告模板**:
320
+ ```markdown
321
+ # 混沌实验基线报告
322
+
323
+ ## 实验信息
324
+ - 实验名称:订单服务单 Pod 宕机测试
325
+ - 时间:2026-03-20 02:00 - 02:30
326
+ - 负责人:张三
327
+
328
+ ## 基线指标(实验前 30 分钟)
329
+ - 可用性:99.95%
330
+ - P95 延迟:150ms
331
+ - P99 延迟:280ms
332
+ - 错误率:0.05%
333
+ - QPS:1,200
334
+
335
+ ## 系统状态
336
+ - Pod 数量:3
337
+ - CPU 使用率:45%
338
+ - 内存使用率:60%
339
+ - 数据库连接数:80/100
340
+ ```
341
+
342
+ #### 3.3 注入故障
343
+ **执行流程**:
344
+ ```bash
345
+ # 1. 应用故障配置
346
+ kubectl apply -f pod-failure.yaml
347
+
348
+ # 2. 观察系统响应
349
+ watch -n 5 'kubectl get pods -l app=order-service'
350
+
351
+ # 3. 监控指标
352
+ # 打开 Grafana SLO Dashboard: https://grafana.example.com/d/slo-dashboard
353
+
354
+ # 4. 持续观察 5-10 分钟
355
+ # 记录关键事件和指标变化
356
+
357
+ # 5. 停止实验
358
+ kubectl delete -f pod-failure.yaml
359
+
360
+ # 6. 观察恢复过程
361
+ # 等待 5-10 分钟,确认系统恢复到基线
362
+ ```
363
+
364
+ #### 3.4 实时监控
365
+ **监控维度**:
366
+ - SLO 指标:可用性、延迟、错误率
367
+ - 资源指标:CPU、内存、网络、磁盘
368
+ - 业务指标:订单量、转化率、营收
369
+ - 依赖服务:数据库、缓存、第三方 API
370
+
371
+ **告警配置**:
372
+ ```yaml
373
+ # 实验专用告警(更严格的阈值)
374
+ groups:
375
+ - name: chaos-experiment-alerts
376
+ rules:
377
+ - alert: ChaosExperimentSLOBreach
378
+ expr: slo:availability:ratio:5m{service="order-service"} < 0.99
379
+ for: 1m
380
+ labels:
381
+ severity: critical
382
+ experiment: "pod-failure"
383
+ annotations:
384
+ summary: "混沌实验导致 SLO 违反"
385
+ description: "立即停止实验并检查系统"
386
+ ```
387
+
388
+ ### 阶段 4:分析与改进(Analysis & Improvement)
389
+
390
+ #### 4.1 数据收集
391
+ **收集内容**:
392
+ - 实验期间所有指标数据
393
+ - 日志(应用/系统/中间件)
394
+ - 追踪数据(分布式追踪)
395
+ - 事件时间线(故障注入/恢复时间点)
396
+ - 团队观察记录
397
+
398
+ #### 4.2 结果分析
399
+ **分析维度**:
400
+ 1. **稳态验证**:
401
+ - 实验期间稳态假设是否成立?
402
+ - SLO 是否违反?违反多久?
403
+ - 系统是否自动恢复?
404
+
405
+ 2. **性能影响**:
406
+ - 延迟增加多少?
407
+ - 吞吐量下降多少?
408
+ - 用户影响范围多大?
409
+
410
+ 3. **恢复时间**:
411
+ - MTTR(Mean Time To Recovery)多长?
412
+ - 自动恢复还是人工干预?
413
+ - 恢复过程中有无次生故障?
414
+
415
+ 4. **发现的问题**:
416
+ - 未预期的系统行为?
417
+ - 监控盲区?
418
+ - Runbook 缺失或不准确?
419
+
420
+ **实验报告模板**:
421
+ ```markdown
422
+ # 混沌实验报告
423
+
424
+ ## 实验概况
425
+ - 实验名称:订单服务单 Pod 宕机测试
426
+ - 执行时间:2026-03-20 02:00 - 02:15
427
+ - 执行人:张三
428
+ - 环境:生产环境
429
+
430
+ ## 实验设计
431
+ - 故障类型:Pod 删除
432
+ - 注入对象:order-service-7d9f8c6b5-x9k2m
433
+ - 持续时间:10 分钟
434
+ - 稳态假设:可用性 >= 99.9%, P95 延迟 < 200ms
435
+
436
+ ## 实验结果
437
+ ### 稳态验证
438
+ - [x] 可用性保持 >= 99.9%(实际:99.92%)
439
+ - [x] P95 延迟 < 200ms(实际:180ms)
440
+ - [x] 错误率 < 0.1%(实际:0.08%)
441
+
442
+ ### 性能影响
443
+ - 延迟增加:+30ms(150ms -> 180ms)
444
+ - 吞吐量下降:5%(1,200 QPS -> 1,140 QPS)
445
+ - 用户影响:无用户投诉
446
+
447
+ ### 恢复时间
448
+ - Pod 重启时间:45 秒
449
+ - 服务恢复时间:60 秒
450
+ - 恢复方式:Kubernetes 自动重启
451
+ - 次生故障:无
452
+
453
+ ## 发现的问题
454
+ ### 严重问题
455
+ 1. 数据库连接池未及时释放,导致新 Pod 启动时连接池耗尽
456
+ - 影响:新 Pod 启动延迟 15 秒
457
+ - 修复:配置连接池超时自动释放
458
+
459
+ ### 一般问题
460
+ 1. 监控面板缺少 Pod 重启事件标注
461
+ - 影响:难以快速定位故障原因
462
+ - 修复:添加 Kubernetes 事件监控
463
+
464
+ ## 改进措施
465
+ | 优先级 | 措施 | 负责人 | 截止日期 |
466
+ |--------|------|--------|----------|
467
+ | P0 | 数据库连接池配置优化 | 李四 | 2026-03-25 |
468
+ | P1 | 监控面板添加事件标注 | 王五 | 2026-03-27 |
469
+ | P2 | Runbook 更新 | 张三 | 2026-03-30 |
470
+
471
+ ## 结论
472
+ 实验成功,系统具备单 Pod 故障自愈能力。但发现数据库连接池配置问题,需要优化。
473
+ 建议:下次实验增加双 Pod 同时宕机场景,验证极端情况下的系统韧性。
474
+ ```
475
+
476
+ #### 4.3 改进实施
477
+ **改进优先级**:
478
+ - P0(紧急):影响 SLO 的问题,立即修复
479
+ - P1(高):影响系统韧性的问题,1 周内修复
480
+ - P2(中):监控/文档缺失,2 周内完善
481
+ - P3(低):优化建议,纳入后续迭代
482
+
483
+ **验证改进**:
484
+ - 改进后重新运行相同实验
485
+ - 对比改进前后的指标
486
+ - 确认问题已解决
487
+
488
+ ## 实验场景库
489
+
490
+ ### 场景 1:单服务 Pod 宕机
491
+ **目标**:验证自动扩缩容和负载均衡
492
+ **故障**:删除 1 个 Pod
493
+ **预期**:流量自动切换到其他 Pod, SLO 保持
494
+ **难度**:初级
495
+
496
+ ### 场景 2:数据库连接池耗尽
497
+ **目标**:验证数据库连接池管理和超时机制
498
+ **故障**:模拟大量慢查询,耗尽连接池
499
+ **预期**:应用自动降级或熔断,核心功能可用
500
+ **难度**:中级
501
+
502
+ ### 场景 3:缓存服务宕机
503
+ **目标**:验证缓存降级策略
504
+ **故障**:Redis 宕机
505
+ **预期**:应用回源到数据库,性能下降但可用
506
+ **难度**:中级
507
+
508
+ ### 场景 4:网络分区
509
+ **目标**:验证跨可用区高可用
510
+ **故障**:模拟单个可用区网络分区
511
+ **预期**:流量自动切换到其他可用区
512
+ **难度**:高级
513
+
514
+ ### 场景 5:第三方 API 超时
515
+ **目标**:验证熔断器和降级策略
516
+ **故障**:支付网关 API 超时
517
+ **预期**:启用备用支付方式或降级提示
518
+ **难度**:中级
519
+
520
+ ### 场景 6:CPU 飙升
521
+ **目标**:验证资源隔离和自动扩容
522
+ **故障**:注入 CPU 压力
523
+ **预期**:HPA 自动扩容,服务保持响应
524
+ **难度**:初级
525
+
526
+ ### 场景 7:磁盘满
527
+ **目标**:验证磁盘监控和清理机制
528
+ **故障**:快速填充磁盘空间
529
+ **预期**:告警触发,自动清理或扩容
530
+ **难度**:中级
531
+
532
+ ### 场景 8:DNS 故障
533
+ **目标**:验证服务发现和兜底机制
534
+ **故障**:DNS 解析失败
535
+ **预期**:使用 DNS 缓存或硬编码兜底地址
536
+ **难度**:高级
537
+
538
+ ## 工具选型
539
+
540
+ ### Chaos Mesh(Kubernetes 原生)
541
+ **优势**:
542
+ - 支持丰富的故障类型(Pod/网络/IO/压力/DNS)
543
+ - 声明式配置(YAML)
544
+ - 可视化 Dashboard
545
+ - 集成 CI/CD
546
+
547
+ **适用场景**:Kubernetes 环境
548
+
549
+ **安装**:
550
+ ```bash
551
+ # Helm 安装
552
+ helm repo add chaos-mesh https://charts.chaos-mesh.org
553
+ helm install chaos-mesh chaos-mesh/chaos-mesh --namespace=chaos-testing
554
+ ```
555
+
556
+ ### Litmus Chaos(云原生)
557
+ **优势**:
558
+ - 大量预定义实验(100+)
559
+ - ChaosHub 实验市场
560
+ - 支持非 Kubernetes 环境
561
+ - 强大的分析能力
562
+
563
+ **适用场景**:混合环境、企业级
564
+
565
+ **安装**:
566
+ ```bash
567
+ kubectl apply -f https://litmuschaos.github.io/litmus/2.13.0/litmus-2.13.0.yaml
568
+ ```
569
+
570
+ ### Gremlin(商业)
571
+ **优势**:
572
+ - SaaS 平台,无需维护
573
+ - 丰富的故障类型
574
+ - 详细的分析报告
575
+ - 企业级支持
576
+
577
+ **适用场景**:企业级、快速落地
578
+
579
+ ### Chaos Blade(阿里开源)
580
+ **优势**:
581
+ - 轻量级、易上手
582
+ - 支持多语言(Java/Go/C++)
583
+ - 丰富的故障场景
584
+
585
+ **适用场景**:应用层故障注入
586
+
587
+ ## 安全与合规
588
+
589
+ ### 安全检查清单
590
+ - [ ] 实验范围限制(namespace/label 选择器)
591
+ - [ ] 权限控制(RBAC)
592
+ - [ ] 实验审批流程
593
+ - [ ] 紧急停止机制
594
+ - [ ] 数据脱敏(避免泄露敏感信息)
595
+
596
+ ### 合规要求
597
+ - 数据保护法规(GDPR/CCPA):确保实验不泄露用户数据
598
+ - 金融监管(如适用):需提前报备
599
+ - 审计日志:记录所有实验操作
600
+
601
+ ## 常见失败模式
602
+
603
+ ### 1. 准备不足
604
+ - **未验证监控**:实验期间发现监控盲区,无法判断系统状态
605
+ - **缺少回滚计划**:系统无法恢复,导致真实故障
606
+ - **未通知团队**:On-call 人员误以为是真实故障,启动事故响应
607
+
608
+ ### 2. 爆炸半径失控
609
+ - **误删生产数据库**:标签选择器配置错误,影响数据库
610
+ - **影响范围过大**:同时注入多个故障,导致级联失败
611
+ - **流量超标**:注入故障影响过多用户
612
+
613
+ ### 3. 实验设计不当
614
+ - **假设不明确**:无法判断实验是否成功
615
+ - **停止条件缺失**:故障持续过久,造成不必要影响
616
+ - **未考虑依赖**:只测试了服务本身,忽略依赖服务
617
+
618
+ ### 4. 组织问题
619
+ - **缺少持续实践**:一次性实验后不再进行,能力退化
620
+ - **未改进问题**:发现的问题未修复,下次实验仍然失败
621
+ - **缺少文档**:实验知识未沉淀,人员流失后无法复现
622
+
623
+ ## 验收标准
624
+
625
+ ### 功能验收
626
+ - [ ] 混沌工程平台部署完成(Chaos Mesh/Litmus)
627
+ - [ ] 至少 3 个核心服务完成混沌实验
628
+ - [ ] 实验场景库建立(>= 10 个场景)
629
+ - [ ] 实验报告模板和流程文档化
630
+
631
+ ### 质量验收
632
+ - [ ] 实验成功率 >= 80%(系统符合稳态假设)
633
+ - [ ] 发现问题修复率 >= 90%
634
+ - [ ] 无生产事故由混沌实验导致
635
+ - [ ] MTTR 改善 >= 20%(对比实验前)
636
+
637
+ ### 运营验收
638
+ - [ ] 每月 Game Day 机制建立
639
+ - [ ] 团队培训覆盖率 100%
640
+ - [ ] CI/CD 集成混沌测试
641
+ - [ ] 实验报告归档率 100%
642
+
643
+ ## 参考资源
644
+
645
+ ### 经典著作
646
+ - Chaos Engineering(O'Reilly)
647
+ - Building Secure and Reliable Systems(Google)
648
+ - Site Reliability Engineering(Google)
649
+
650
+ ### 开源工具
651
+ - Chaos Mesh:https://chaos-mesh.org/
652
+ - Litmus Chaos:https://litmuschaos.io/
653
+ - Chaos Blade:https://github.com/chaosblade-io/chaosblade
654
+ - Gremlin:https://www.gremlin.com/
655
+
656
+ ### 最佳实践
657
+ - Principles of Chaos Engineering:https://principlesofchaos.org/
658
+ - Chaos Engineering at Netflix:https://netflixtechblog.com/tagged/chaos-engineering
659
+ - Amazon GameDay:https://aws.amazon.com/blogs/awscn/the-aws-game-day/
@@ -0,0 +1,38 @@
1
+ ---
2
+ id: incident-command-system
3
+ title: incident-command-system
4
+ domain: operations
5
+ category: incident-command-system.md
6
+ difficulty: intermediate
7
+ tags: [command, incident, operations, system]
8
+ quality_score: 70
9
+ last_updated: 2026-06-15
10
+ ---
11
+ # 开发:Excellent(11964948@qq.com)
12
+
13
+ ## 事故指挥体系(ICS)手册
14
+
15
+ ### 目标
16
+ - 在生产事故中实现统一指挥、快速止损、清晰协同。
17
+
18
+ ### 角色定义
19
+ - 指挥官:统一决策与优先级管理。
20
+ - 技术负责人:定位根因与恢复方案执行。
21
+ - 沟通负责人:对内对外同步状态与影响。
22
+ - 记录员:维护时间线与关键决策记录。
23
+
24
+ ### 响应流程
25
+ - 事故分级:按用户影响与业务损失快速定级。
26
+ - 首轮止损:优先保护关键交易链路与数据一致性。
27
+ - 分工执行:并行推进定位、恢复、沟通。
28
+ - 持续更新:固定时间窗发布状态更新。
29
+ - 关闭与复盘:确认恢复、收敛风险、输出复盘。
30
+
31
+ ### 执行门禁
32
+ - P1及以上事故必须启用指挥角色。
33
+ - 关键决策必须有时间戳与责任人。
34
+ - 事故结束后24小时内提交初版复盘。
35
+
36
+ ### 常见失败模式
37
+ - 没有单一指挥导致多头决策冲突。
38
+ - 恢复后不复盘,问题重复发生。