@umacloud/knowledge 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (418) hide show
  1. package/00-governance/governance-capabilities.md +557 -0
  2. package/00-governance/knowledge-map.md +39 -0
  3. package/00-governance/maintenance-policy.md +76 -0
  4. package/00-governance/review-checklist.md +81 -0
  5. package/README.md +13 -0
  6. package/ai/01-standards/agent-development-complete.md +691 -0
  7. package/ai/01-standards/llm-application-complete.md +488 -0
  8. package/ai/01-standards/mlops-complete.md +798 -0
  9. package/ai/01-standards/prompt-engineering-complete.md +646 -0
  10. package/ai/01-standards/rag-architecture-complete.md +649 -0
  11. package/ai/02-playbooks/llm-evaluation-playbook.md +847 -0
  12. package/ai/03-checklists/ai-project-checklist.md +215 -0
  13. package/ai/04-antipatterns/ai-antipatterns.md +661 -0
  14. package/ai/05-cases/case-rag-production.md +147 -0
  15. package/ai/06-glossary/ai-glossary.md +162 -0
  16. package/ai/agent-evaluation-benchmark.md +53 -0
  17. package/ai/ai-agent-memory-context-management.md +41 -0
  18. package/ai/ai-cost-capacity-optimization-playbook.md +42 -0
  19. package/ai/ai-data-security-and-compliance-playbook.md +37 -0
  20. package/ai/ai-domain-index-and-checklist.md +40 -0
  21. package/ai/ai-governance-maturity-model.md +50 -0
  22. package/ai/ai-model-selection-and-routing-strategy.md +47 -0
  23. package/ai/ai-observability-and-oncall-runbook.md +52 -0
  24. package/ai/ai-rag-engineering-playbook.md +42 -0
  25. package/ai/ai-red-team-and-safety-evaluation.md +42 -0
  26. package/ai/ai-release-readiness-and-rollback-gate.md +42 -0
  27. package/ai/llm-agent-engineering-deep-dive.md +57 -0
  28. package/ai/prompt-and-tool-guardrails.md +52 -0
  29. package/api/01-standards/enterprise-api-standards.md +198 -0
  30. package/api/01-standards/rest-api-design-guide.md +63 -0
  31. package/api/02-playbooks/api-pagination-playbook.md +93 -0
  32. package/api/02-playbooks/graphql-production-playbook.md +176 -0
  33. package/api/03-checklists/api-review-checklist.md +55 -0
  34. package/api/04-antipatterns/api-antipatterns.md +112 -0
  35. package/architecture/01-standards/api-gateway-patterns.md +496 -0
  36. package/architecture/01-standards/cloud-native-patterns.md +644 -0
  37. package/architecture/01-standards/distributed-systems-patterns.md +591 -0
  38. package/architecture/01-standards/event-driven-architecture.md +595 -0
  39. package/architecture/01-standards/microservices-patterns-complete.md +968 -0
  40. package/architecture/01-standards/microservices-patterns.md +495 -0
  41. package/architecture/01-standards/system-design-interview.md +664 -0
  42. package/architecture/02-playbooks/microservices-patterns-playbook.md +137 -0
  43. package/architecture/02-playbooks/migration-playbook.md +780 -0
  44. package/architecture/02-playbooks/system-design-playbook.md +779 -0
  45. package/architecture/03-checklists/architecture-decision-checklist.md +297 -0
  46. package/architecture/04-antipatterns/architecture-antipatterns.md +417 -0
  47. package/architecture/05-cases/case-netflix-microservices.md +413 -0
  48. package/architecture/06-glossary/architecture-glossary.md +164 -0
  49. package/architecture/adr-template-and-examples.md +38 -0
  50. package/architecture/api-gateway-deep-dive.md +1291 -0
  51. package/architecture/configuration-management.md +1162 -0
  52. package/architecture/distributed-transactions.md +1220 -0
  53. package/architecture/microservices-complete.md +735 -0
  54. package/architecture/resilience-and-disaster-patterns.md +37 -0
  55. package/architecture/service-governance.md +1198 -0
  56. package/architecture/system-architecture-deep-dive.md +37 -0
  57. package/backend/01-standards/analytics-and-growth.md +65 -0
  58. package/backend/01-standards/api-and-error-conventions.md +120 -0
  59. package/backend/01-standards/application-layering-and-packaging.md +160 -0
  60. package/backend/01-standards/auth-implementation.md +104 -0
  61. package/backend/01-standards/backend-framework-idioms.md +74 -0
  62. package/backend/01-standards/background-jobs-and-async.md +66 -0
  63. package/backend/01-standards/caching-strategies-complete.md +390 -0
  64. package/backend/01-standards/config-and-observability.md +77 -0
  65. package/backend/01-standards/data-modeling-and-persistence.md +94 -0
  66. package/backend/01-standards/django-complete.md +1765 -0
  67. package/backend/01-standards/email-and-notifications.md +64 -0
  68. package/backend/01-standards/fastapi-complete.md +925 -0
  69. package/backend/01-standards/file-upload-and-storage.md +66 -0
  70. package/backend/01-standards/graphql-api-complete.md +416 -0
  71. package/backend/01-standards/llm-application-standard.md +78 -0
  72. package/backend/01-standards/message-queue-patterns.md +379 -0
  73. package/backend/01-standards/microservices-and-distributed.md +78 -0
  74. package/backend/01-standards/nestjs-complete.md +2167 -0
  75. package/backend/01-standards/payment-integration.md +80 -0
  76. package/backend/01-standards/rate-limiting-complete.md +451 -0
  77. package/backend/01-standards/realtime-and-websocket.md +65 -0
  78. package/backend/01-standards/search-and-filtering.md +64 -0
  79. package/backend/01-standards/spring-boot-complete.md +445 -0
  80. package/backend/02-playbooks/api-design-playbook.md +718 -0
  81. package/backend/02-playbooks/email-send-playbook.md +130 -0
  82. package/backend/02-playbooks/file-upload-s3-playbook.md +153 -0
  83. package/backend/02-playbooks/typescript-enterprise-playbook.md +133 -0
  84. package/backend/02-playbooks/websocket-realtime-playbook.md +154 -0
  85. package/backend/03-checklists/api-launch-checklist.md +189 -0
  86. package/backend/04-antipatterns/backend-antipatterns.md +1051 -0
  87. package/blockchain/01-standards/blockchain-basics.md +557 -0
  88. package/blockchain/01-standards/smart-contract-development.md +1315 -0
  89. package/cicd/01-standards/deployment-and-delivery-standard.md +96 -0
  90. package/cicd/01-standards/github-actions-complete.md +473 -0
  91. package/cicd/01-standards/release-and-store-submission.md +75 -0
  92. package/cicd/02-playbooks/cicd-pipeline-playbook.md +144 -0
  93. package/cicd/02-playbooks/release-management-playbook.md +605 -0
  94. package/cicd/03-checklists/pipeline-security-checklist.md +168 -0
  95. package/cicd/04-antipatterns/cicd-antipatterns.md +589 -0
  96. package/cicd/05-cases/case-deployment-automation.md +221 -0
  97. package/cicd/05-cases/case-gitops-transformation.md +212 -0
  98. package/cicd/06-glossary/cicd-glossary.md +114 -0
  99. package/cicd/cicd-blueprint-deep-dive.md +38 -0
  100. package/cicd/release-readiness-gate.md +37 -0
  101. package/cloud-native/01-standards/container-security.md +741 -0
  102. package/cloud-native/01-standards/kubernetes-complete.md +812 -0
  103. package/cloud-native/02-playbooks/api-gateway-playbook.md +155 -0
  104. package/cloud-native/02-playbooks/gitops-with-argocd.md +760 -0
  105. package/cloud-native/02-playbooks/k8s-troubleshooting-playbook.md +1942 -0
  106. package/cloud-native/02-playbooks/message-queue-playbook.md +129 -0
  107. package/cloud-native/02-playbooks/multicloud-governance.md +726 -0
  108. package/cloud-native/02-playbooks/serverless-patterns.md +788 -0
  109. package/cloud-native/02-playbooks/service-mesh-playbook.md +612 -0
  110. package/cloud-native/02-playbooks/terraform-iac-playbook.md +143 -0
  111. package/cloud-native/03-checklists/container-security-checklist.md +431 -0
  112. package/cloud-native/03-checklists/k8s-production-readiness-checklist.md +460 -0
  113. package/cloud-native/04-antipatterns/container-antipatterns.md +660 -0
  114. package/cloud-native/04-antipatterns/k8s-antipatterns.md +743 -0
  115. package/cloud-native/05-cases/case-k8s-migration.md +478 -0
  116. package/cloud-native/05-cases/case-k8s-scaling.md +642 -0
  117. package/cloud-native/05-cases/case-k8s-security-incident.md +397 -0
  118. package/cloud-native/06-glossary/cloud-native-glossary.md +337 -0
  119. package/cross-platform/01-standards/cross-platform-frameworks.md +83 -0
  120. package/cross-platform/01-standards/platform-selection-and-architecture.md +77 -0
  121. package/data/01-standards/elasticsearch-complete.md +2098 -0
  122. package/data/01-standards/postgresql-complete.md +1613 -0
  123. package/data/01-standards/redis-complete.md +1527 -0
  124. package/data/02-playbooks/database-optimization-playbook.md +403 -0
  125. package/data/02-playbooks/elasticsearch-production-playbook.md +132 -0
  126. package/data/03-checklists/database-launch-checklist.md +187 -0
  127. package/data/04-antipatterns/database-antipatterns.md +873 -0
  128. package/data/05-cases/case-database-migration.md +310 -0
  129. package/data/06-glossary/database-glossary.md +440 -0
  130. package/data/data-governance-and-modeling-deep-dive.md +39 -0
  131. package/data-engineering/01-standards/airflow-complete.md +523 -0
  132. package/data-engineering/01-standards/kafka-complete.md +1521 -0
  133. package/data-engineering/02-playbooks/spark-etl-playbook.md +496 -0
  134. package/data-engineering/03-checklists/pipeline-launch-checklist.md +194 -0
  135. package/data-engineering/04-antipatterns/data-pipeline-antipatterns.md +684 -0
  136. package/data-engineering/05-cases/case-real-time-pipeline.md +355 -0
  137. package/data-engineering/06-glossary/data-engineering-glossary.md +429 -0
  138. package/database/01-standards/database-schema-standards.md +147 -0
  139. package/database/02-playbooks/postgresql-optimization-quick.md +52 -0
  140. package/database/02-playbooks/postgresql-performance-optimization.md +58 -0
  141. package/database/02-playbooks/postgresql-production-playbook.md +146 -0
  142. package/database/02-playbooks/redis-caching-playbook.md +117 -0
  143. package/database/03-checklists/database-review-checklist.md +50 -0
  144. package/database/04-antipatterns/database-antipatterns.md +112 -0
  145. package/design/01-standards/ui-design-system-complete.md +423 -0
  146. package/design/02-playbooks/design-handoff-playbook.md +254 -0
  147. package/design/02-playbooks/design-review-playbook.md +388 -0
  148. package/design/03-checklists/design-review-checklist.md +246 -0
  149. package/design/04-antipatterns/design-antipatterns.md +378 -0
  150. package/design/05-cases/case-design-system-adoption.md +328 -0
  151. package/design/06-glossary/design-glossary.md +329 -0
  152. package/design/ui-full-lifecycle-cross-platform-playbook.md +571 -0
  153. package/design/ux-system-deep-dive.md +38 -0
  154. package/design-systems/00-craft-rules.md +71 -0
  155. package/design-systems/aesthetic-families.md +43 -0
  156. package/design-systems/anti-ai-slop.md +162 -0
  157. package/design-systems/bold-geometric.md +120 -0
  158. package/design-systems/brutalist-bold.md +103 -0
  159. package/design-systems/editorial-clean.md +109 -0
  160. package/design-systems/glass-aurora.md +108 -0
  161. package/design-systems/modern-minimal.md +145 -0
  162. package/design-systems/premium-luxury.md +106 -0
  163. package/design-systems/product-type-design-map.md +48 -0
  164. package/design-systems/soft-warm.md +123 -0
  165. package/design-systems/tech-utility.md +113 -0
  166. package/desktop/01-standards/desktop-app-standard.md +72 -0
  167. package/desktop/01-standards/desktop-design.md +71 -0
  168. package/development/00-governance/document-template.md +41 -0
  169. package/development/01-standards/api-versioning-strategies.md +432 -0
  170. package/development/01-standards/authentication-patterns-complete.md +479 -0
  171. package/development/01-standards/css-architecture-complete.md +550 -0
  172. package/development/01-standards/database-migration-strategies.md +484 -0
  173. package/development/01-standards/elasticsearch-complete.md +347 -0
  174. package/development/01-standards/git-complete.md +371 -0
  175. package/development/01-standards/golang-complete.md +1565 -0
  176. package/development/01-standards/graphql-complete.md +298 -0
  177. package/development/01-standards/javascript-bundlers-complete.md +469 -0
  178. package/development/01-standards/javascript-typescript-complete.md +528 -0
  179. package/development/01-standards/jest-complete.md +275 -0
  180. package/development/01-standards/linux-complete.md +234 -0
  181. package/development/01-standards/logging-observability-complete.md +526 -0
  182. package/development/01-standards/microservices-communication.md +502 -0
  183. package/development/01-standards/mongodb-complete.md +406 -0
  184. package/development/01-standards/oauth2-complete.md +285 -0
  185. package/development/01-standards/performance-optimization-complete.md +289 -0
  186. package/development/01-standards/playwright-complete.md +247 -0
  187. package/development/01-standards/postgresql-complete.md +456 -0
  188. package/development/01-standards/pytest-complete.md +340 -0
  189. package/development/01-standards/python-async-programming.md +902 -0
  190. package/development/01-standards/python-complete.md +956 -0
  191. package/development/01-standards/python-decorators-complete.md +799 -0
  192. package/development/01-standards/python-design-patterns.md +2854 -0
  193. package/development/01-standards/python-packaging-distribution.md +420 -0
  194. package/development/01-standards/python-testing-strategies.md +607 -0
  195. package/development/01-standards/python-web-frameworks-comparison.md +471 -0
  196. package/development/01-standards/redis-complete.md +317 -0
  197. package/development/01-standards/rest-api-complete.md +316 -0
  198. package/development/01-standards/rust-complete.md +578 -0
  199. package/development/01-standards/typescript-advanced-types.md +1513 -0
  200. package/development/01-standards/web-security-complete.md +292 -0
  201. package/development/02-playbooks/api-design-playbook.md +810 -0
  202. package/development/02-playbooks/database-migration-playbook.md +580 -0
  203. package/development/02-playbooks/debugging-playbook.md +692 -0
  204. package/development/02-playbooks/feature-delivery-playbook.md +430 -0
  205. package/development/02-playbooks/incident-hotfix-playbook.md +387 -0
  206. package/development/02-playbooks/performance-optimization-playbook.md +531 -0
  207. package/development/02-playbooks/performance-tuning-playbook.md +652 -0
  208. package/development/02-playbooks/refactor-playbook.md +403 -0
  209. package/development/02-playbooks/release-playbook.md +469 -0
  210. package/development/03-checklists/architecture-review-checklist.md +168 -0
  211. package/development/03-checklists/data-migration-checklist.md +157 -0
  212. package/development/03-checklists/oncall-handover-checklist.md +173 -0
  213. package/development/03-checklists/pr-checklist.md +158 -0
  214. package/development/03-checklists/production-readiness-checklist.md +190 -0
  215. package/development/03-checklists/release-readiness-checklist.md +154 -0
  216. package/development/03-checklists/security-review-checklist.md +182 -0
  217. package/development/04-antipatterns/api-antipatterns.md +657 -0
  218. package/development/04-antipatterns/architecture-antipatterns.md +686 -0
  219. package/development/04-antipatterns/backend-antipatterns.md +648 -0
  220. package/development/04-antipatterns/cicd-antipatterns.md +540 -0
  221. package/development/04-antipatterns/code-smell-antipatterns.md +571 -0
  222. package/development/04-antipatterns/data-antipatterns.md +658 -0
  223. package/development/04-antipatterns/database-antipatterns.md +578 -0
  224. package/development/04-antipatterns/frontend-antipatterns.md +635 -0
  225. package/development/04-antipatterns/reliability-antipatterns.md +700 -0
  226. package/development/04-antipatterns/security-antipatterns.md +747 -0
  227. package/development/05-cases/case-api-version-migration.md +428 -0
  228. package/development/05-cases/case-authorization-hardening.md +383 -0
  229. package/development/05-cases/case-bluegreen-rollback.md +466 -0
  230. package/development/05-cases/case-cache-snowball-protection.md +485 -0
  231. package/development/05-cases/case-ci-cd-pipeline.md +544 -0
  232. package/development/05-cases/case-database-scaling.md +500 -0
  233. package/development/05-cases/case-db-hotspot-optimization.md +487 -0
  234. package/development/05-cases/case-incident-mttr-reduction.md +563 -0
  235. package/development/05-cases/case-microservice-migration.md +375 -0
  236. package/development/05-cases/case-performance-optimization.md +406 -0
  237. package/development/05-cases/case-security-incident-response.md +345 -0
  238. package/development/06-glossary/full-stack-glossary.md +166 -0
  239. package/development/09-maturity/quarterly-audit-template.md +35 -0
  240. package/development/11-ui-excellence/ui-aesthetic-system.md +41 -0
  241. package/development/11-ui-excellence/ui-engineering-excellence.md +435 -0
  242. package/development/12-scenarios/development-scenarios-guide.md +565 -0
  243. package/development/13-implementation-assets/implementation-toolkit.md +282 -0
  244. package/development/13-implementation-assets/knowledge-gates-execution.md +43 -0
  245. package/development/14-full-lifecycle/software-lifecycle-gates.md +511 -0
  246. package/development/15-lifecycle-templates/project-templates-collection.md +791 -0
  247. package/development/api-contract-and-versioning-guide.md +36 -0
  248. package/development/api-governance-complete.md +43 -0
  249. package/development/backend-engineering-complete.md +43 -0
  250. package/development/code-review-quality-complete.md +43 -0
  251. package/development/concurrency-reliability-complete.md +43 -0
  252. package/development/database-engineering-complete.md +43 -0
  253. package/development/engineering-effectiveness-complete.md +43 -0
  254. package/development/engineering-standards-deep-dive.md +38 -0
  255. package/development/frontend-engineering-complete.md +43 -0
  256. package/development/performance-capacity-complete.md +43 -0
  257. package/development/refactor-migration-complete.md +42 -0
  258. package/development/refactoring-and-techdebt-playbook.md +37 -0
  259. package/development/security-in-development-complete.md +43 -0
  260. package/devops/01-standards/cicd-pipeline-complete.md +262 -0
  261. package/devops/01-standards/docker-complete.md +1490 -0
  262. package/devops/01-standards/github-actions-complete.md +337 -0
  263. package/devops/01-standards/kubernetes-complete.md +638 -0
  264. package/devops/01-standards/terraform-complete.md +2117 -0
  265. package/devops/02-playbooks/docker-compose-playbook.md +233 -0
  266. package/devops/02-playbooks/docker-k8s-production-playbook.md +186 -0
  267. package/devops/02-playbooks/docker-production-playbook.md +952 -0
  268. package/edge-iot/01-standards/edge-iot-complete.md +473 -0
  269. package/experts/architect/api-design.md +178 -0
  270. package/experts/architect/methodology.md +124 -0
  271. package/experts/architect/security.md +75 -0
  272. package/experts/backend-lead/methodology.md +216 -0
  273. package/experts/devops/methodology.md +160 -0
  274. package/experts/frontend-lead/methodology.md +178 -0
  275. package/experts/product-manager/industry/ecommerce.md +43 -0
  276. package/experts/product-manager/industry/saas.md +40 -0
  277. package/experts/product-manager/methodology.md +97 -0
  278. package/experts/qa-lead/methodology.md +123 -0
  279. package/experts/qa-lead/test-strategy.md +128 -0
  280. package/experts/uiux-designer/methodology.md +125 -0
  281. package/frontend/01-standards/accessibility-complete.md +532 -0
  282. package/frontend/01-standards/accessibility-standard.md +74 -0
  283. package/frontend/01-standards/admin-dashboard-and-crud.md +72 -0
  284. package/frontend/01-standards/design-tokens-complete.md +444 -0
  285. package/frontend/01-standards/forms-and-validation.md +77 -0
  286. package/frontend/01-standards/frontend-architecture-and-layering.md +119 -0
  287. package/frontend/01-standards/i18n-and-localization.md +65 -0
  288. package/frontend/01-standards/nextjs-complete.md +451 -0
  289. package/frontend/01-standards/react-complete.md +713 -0
  290. package/frontend/01-standards/react-hooks-complete-guide.md +1100 -0
  291. package/frontend/01-standards/react-hooks-complete.md +1171 -0
  292. package/frontend/01-standards/seo-and-web-vitals.md +77 -0
  293. package/frontend/01-standards/state-management-complete.md +444 -0
  294. package/frontend/01-standards/vue-complete.md +499 -0
  295. package/frontend/01-standards/vue3-complete.md +2002 -0
  296. package/frontend/01-standards/web-framework-best-practices.md +64 -0
  297. package/frontend/01-standards/web-performance-complete.md +495 -0
  298. package/frontend/02-playbooks/accessibility-a11y-playbook.md +161 -0
  299. package/frontend/02-playbooks/frontend-performance-playbook.md +707 -0
  300. package/frontend/02-playbooks/i18n-internationalization-playbook.md +120 -0
  301. package/frontend/02-playbooks/performance-optimization-playbook.md +163 -0
  302. package/frontend/02-playbooks/react-nextjs-production-playbook.md +167 -0
  303. package/frontend/02-playbooks/react-state-management-playbook.md +173 -0
  304. package/frontend/03-checklists/component-quality-checklist.md +166 -0
  305. package/frontend/03-checklists/frontend-launch-checklist.md +299 -0
  306. package/frontend/04-antipatterns/frontend-antipatterns.md +886 -0
  307. package/frontend/05-cases/case-performance-optimization.md +274 -0
  308. package/harmony/01-standards/harmonyos-arkts-standard.md +75 -0
  309. package/harmony/01-standards/harmonyos-design.md +65 -0
  310. package/high-quality-engineering-playbook.md +54 -0
  311. package/incident/01-standards/incident-response-complete.md +303 -0
  312. package/incident/02-playbooks/chaos-engineering-playbook.md +883 -0
  313. package/incident/02-playbooks/postmortem-playbook.md +398 -0
  314. package/incident/03-checklists/incident-readiness-checklist.md +181 -0
  315. package/incident/04-antipatterns/incident-antipatterns.md +490 -0
  316. package/incident/05-cases/case-cascade-failure.md +176 -0
  317. package/incident/06-glossary/incident-glossary.md +114 -0
  318. package/incident/postmortem-and-response-deep-dive.md +39 -0
  319. package/industries/ecommerce/ecommerce-complete.md +631 -0
  320. package/industries/education/education-complete.md +555 -0
  321. package/industries/fintech/fintech-complete.md +501 -0
  322. package/industries/gaming/gaming-complete.md +587 -0
  323. package/industries/healthcare/healthcare-complete.md +452 -0
  324. package/low-code/01-standards/low-code-complete.md +944 -0
  325. package/miniprogram/01-standards/ai-common-mistakes.md +61 -0
  326. package/miniprogram/01-standards/miniprogram-custom-navbar-capsule.md +77 -0
  327. package/miniprogram/01-standards/miniprogram-design.md +61 -0
  328. package/miniprogram/01-standards/miniprogram-standard.md +81 -0
  329. package/mobile/01-standards/android-material-design.md +70 -0
  330. package/mobile/01-standards/flutter-complete.md +384 -0
  331. package/mobile/01-standards/ios-design-hig.md +78 -0
  332. package/mobile/01-standards/mobile-app-standard.md +85 -0
  333. package/mobile/01-standards/react-native-complete.md +352 -0
  334. package/mobile/02-playbooks/mobile-cross-platform-playbook.md +175 -0
  335. package/mobile/02-playbooks/mobile-performance.md +473 -0
  336. package/mobile/03-checklists/mobile-release-checklist.md +234 -0
  337. package/mobile/04-antipatterns/mobile-antipatterns.md +798 -0
  338. package/mobile/05-cases/case-app-performance.md +500 -0
  339. package/mobile/05-cases/case-app-startup-optimization.md +218 -0
  340. package/mobile/06-glossary/mobile-glossary.md +484 -0
  341. package/observability/01-standards/observability-standards.md +103 -0
  342. package/observability/02-playbooks/prometheus-grafana-playbook.md +135 -0
  343. package/observability/02-playbooks/structured-logging-playbook.md +73 -0
  344. package/observability/03-checklists/observability-checklist.md +54 -0
  345. package/observability/04-antipatterns/observability-antipatterns.md +106 -0
  346. package/operations/01-standards/prometheus-monitoring-complete.md +1578 -0
  347. package/operations/02-playbooks/capacity-planning-playbook.md +620 -0
  348. package/operations/03-checklists/production-launch-checklist.md +365 -0
  349. package/operations/04-antipatterns/operations-antipatterns.md +664 -0
  350. package/operations/05-cases/case-sre-practices.md +581 -0
  351. package/operations/06-glossary/operations-glossary.md +120 -0
  352. package/operations/aiops-anomaly-detection.md +758 -0
  353. package/operations/capacity-planning.md +1061 -0
  354. package/operations/chaos-engineering.md +659 -0
  355. package/operations/incident-command-system.md +38 -0
  356. package/operations/observability-complete.md +442 -0
  357. package/operations/slo-sli-playbook.md +517 -0
  358. package/operations/sre-operations-deep-dive.md +39 -0
  359. package/package.json +8 -0
  360. package/performance/01-standards/performance-and-scalability.md +80 -0
  361. package/performance/01-standards/performance-standards.md +156 -0
  362. package/performance/02-playbooks/query-optimization-playbook.md +103 -0
  363. package/performance/03-checklists/performance-checklist.md +56 -0
  364. package/performance/04-antipatterns/performance-antipatterns.md +146 -0
  365. package/product/01-standards/product-management-complete.md +285 -0
  366. package/product/02-playbooks/feature-launch-playbook.md +207 -0
  367. package/product/02-playbooks/user-research-playbook.md +532 -0
  368. package/product/03-checklists/feature-launch-checklist.md +275 -0
  369. package/product/04-antipatterns/product-antipatterns.md +355 -0
  370. package/product/05-cases/case-mvp-to-scale.md +384 -0
  371. package/product/06-glossary/product-glossary.md +462 -0
  372. package/product/feature-prioritization-framework.md +40 -0
  373. package/product/kpi-and-metric-tree.md +37 -0
  374. package/product/product-discovery-and-prd-deep-dive.md +41 -0
  375. package/quantum/01-standards/quantum-complete.md +1186 -0
  376. package/security/01-standards/api-security-complete.md +511 -0
  377. package/security/01-standards/container-runtime-security.md +574 -0
  378. package/security/01-standards/data-protection-gdpr.md +543 -0
  379. package/security/01-standards/owasp-top10-complete.md +1890 -0
  380. package/security/01-standards/secure-coding-baseline.md +90 -0
  381. package/security/01-standards/supply-chain-security.md +441 -0
  382. package/security/01-standards/web-security-checklist.md +108 -0
  383. package/security/01-standards/zero-trust-architecture.md +521 -0
  384. package/security/02-playbooks/auth-sso-playbook.md +166 -0
  385. package/security/02-playbooks/incident-response-security-playbook.md +588 -0
  386. package/security/02-playbooks/owasp-api-security-playbook.md +129 -0
  387. package/security/02-playbooks/payment-integration-playbook.md +119 -0
  388. package/security/02-playbooks/penetration-testing-playbook.md +517 -0
  389. package/security/03-checklists/security-audit-checklist.md +356 -0
  390. package/security/04-antipatterns/security-coding-antipatterns.md +580 -0
  391. package/security/05-cases/case-log4shell-incident.md +537 -0
  392. package/security/05-cases/case-major-breaches.md +468 -0
  393. package/security/06-glossary/security-glossary.md +212 -0
  394. package/security/compliance-automation.md +993 -0
  395. package/security/container-security.md +680 -0
  396. package/security/devsecops-complete.md +426 -0
  397. package/security/sast-dast-sca.md +775 -0
  398. package/security/secrets-management.md +594 -0
  399. package/security/security-architecture-deep-dive.md +37 -0
  400. package/security/threat-modeling-stride-playbook.md +40 -0
  401. package/seed-templates/auth-system.md +59 -0
  402. package/seed-templates/blog-content.md +94 -0
  403. package/seed-templates/dashboard.md +89 -0
  404. package/seed-templates/docs-site.md +73 -0
  405. package/seed-templates/e-commerce.md +50 -0
  406. package/seed-templates/saas-landing.md +92 -0
  407. package/seed-templates/settings-page.md +51 -0
  408. package/testing/01-standards/test-strategy-and-layering.md +83 -0
  409. package/testing/01-standards/testing-strategy-complete.md +422 -0
  410. package/testing/01-standards/unit-testing-best-practices.md +118 -0
  411. package/testing/02-playbooks/e2e-testing-playbook.md +988 -0
  412. package/testing/02-playbooks/testing-strategy-playbook.md +126 -0
  413. package/testing/03-checklists/test-strategy-checklist.md +208 -0
  414. package/testing/04-antipatterns/testing-antipatterns.md +718 -0
  415. package/testing/05-cases/case-testing-transformation.md +300 -0
  416. package/testing/06-glossary/testing-glossary.md +110 -0
  417. package/testing/risk-based-test-matrix.md +36 -0
  418. package/testing/testing-strategy-deep-dive.md +37 -0
@@ -0,0 +1,758 @@
1
+ ---
2
+ id: aiops-anomaly-detection
3
+ title: aiops-anomaly-detection
4
+ domain: operations
5
+ category: aiops-anomaly-detection.md
6
+ difficulty: intermediate
7
+ tags: [aiops, anomaly, detection, operations, 告警降噪, 实施架构, 常见失败模式, 异常检测算法]
8
+ quality_score: 70
9
+ last_updated: 2026-06-15
10
+ ---
11
+ # 开发:Excellent(11964948@qq.com)
12
+ # 功能:AIOps 异常检测实践指南
13
+ # 作用:利用机器学习和 AI 技术自动化检测和诊断系统异常
14
+ # 创建时间:2026-03-20
15
+ # 最后修改:2026-03-20
16
+
17
+ ## 目标
18
+ 建立 AIOps 异常检测体系,通过机器学习算法自动识别系统异常、减少告警噪音、加速根因定位、实现智能运维。
19
+
20
+ ## 适用范围
21
+ - 时序指标异常检测(CPU、内存、QPS、延迟等)
22
+ - 日志异常检测(错误日志、异常模式)
23
+ - 应用性能异常(APM)
24
+ - 业务指标异常(订单量、转化率、营收)
25
+
26
+ ## 核心概念
27
+
28
+ ### 什么是 AIOps
29
+ **定义**:Artificial Intelligence for IT Operations,利用 AI/ML 技术增强 IT 运维能力
30
+
31
+ **核心能力**:
32
+ 1. **异常检测(Anomaly Detection)**:自动识别偏离正常模式的指标
33
+ 2. **根因分析(Root Cause Analysis)**:自动定位异常根因
34
+ 3. **告警降噪(Alert Reduction)**:智能合并和去重告警
35
+ 4. **预测分析(Predictive Analytics)**:预测未来趋势和潜在问题
36
+ 5. **自动化修复(Auto-Remediation)**:自动执行修复动作
37
+
38
+ ### 异常类型
39
+
40
+ #### 1. 点异常(Point Anomalies)
41
+ **定义**:单个数据点明显偏离正常范围
42
+
43
+ **示例**:
44
+ - CPU 使用率突然飙升至 100%
45
+ - 请求延迟突增至 10 秒
46
+ - 错误率从 0.1% 飙升至 20%
47
+
48
+ **检测方法**:
49
+ - 静态阈值(固定阈值)
50
+ - 统计方法(Z-score、IQR)
51
+ - 机器学习(Isolation Forest、One-Class SVM)
52
+
53
+ #### 2. 上下文异常(Contextual Anomalies)
54
+ **定义**:在特定上下文中异常,但在其他情况下正常
55
+
56
+ **示例**:
57
+ - 凌晨 3 点 CPU 50%(异常,因为正常 < 10%)
58
+ - 白天 CPU 50%(正常)
59
+ - 促销期间订单量 10 倍(正常)
60
+ - 非促销期间订单量 10 倍(异常)
61
+
62
+ **检测方法**:
63
+ - 时间上下文(工作日/周末、白天/夜晚)
64
+ - 季节性分解(STL、TBATS)
65
+ - 上下文感知模型
66
+
67
+ #### 3. 集合异常(Collective Anomalies)
68
+ **定义**:单个点正常,但一组点的模式异常
69
+
70
+ **示例**:
71
+ - CPU 持续缓慢上升(内存泄漏)
72
+ - 错误率小幅但持续增加(服务降级)
73
+ - 响应时间波动加大(资源竞争)
74
+
75
+ **检测方法**:
76
+ - 时间序列分割
77
+ - 变化点检测(Change Point Detection)
78
+ - 序列模式挖掘
79
+
80
+ ## 异常检测算法
81
+
82
+ ### 1. 统计方法
83
+
84
+ #### Z-Score
85
+ **原理**:计算数据点与均值的标准差距离
86
+
87
+ **公式**:
88
+ ```
89
+ Z = (X - μ) / σ
90
+ ```
91
+ - X:当前值
92
+ - μ:历史均值
93
+ - σ:历史标准差
94
+
95
+ **阈值**:|Z| > 3 通常视为异常
96
+
97
+ **优点**:简单、计算快
98
+ **缺点**:假设正态分布、对异常值敏感
99
+
100
+ **实现**:
101
+ ```python
102
+ import numpy as np
103
+
104
+ def z_score_anomaly(data, threshold=3):
105
+ mean = np.mean(data)
106
+ std = np.std(data)
107
+ z_scores = [(x - mean) / std for x in data]
108
+ anomalies = [i for i, z in enumerate(z_scores) if abs(z) > threshold]
109
+ return anomalies
110
+ ```
111
+
112
+ #### IQR(四分位距)
113
+ **原理**:使用四分位数识别异常值
114
+
115
+ **公式**:
116
+ ```
117
+ IQR = Q3 - Q1
118
+ 下界 = Q1 - 1.5 * IQR
119
+ 上界 = Q3 + 1.5 * IQR
120
+ ```
121
+
122
+ **优点**:对异常值鲁棒(不假设正态分布)
123
+ **缺点**:不适用于非平稳时间序列
124
+
125
+ **实现**:
126
+ ```python
127
+ import numpy as np
128
+
129
+ def iqr_anomaly(data, k=1.5):
130
+ q1 = np.percentile(data, 25)
131
+ q3 = np.percentile(data, 75)
132
+ iqr = q3 - q1
133
+ lower = q1 - k * iqr
134
+ upper = q3 + k * iqr
135
+ anomalies = [i for i, x in enumerate(data) if x < lower or x > upper]
136
+ return anomalies
137
+ ```
138
+
139
+ ### 2. 机器学习方法
140
+
141
+ #### Isolation Forest
142
+ **原理**:通过随机隔离数据点,异常点更容易被隔离(路径更短)
143
+
144
+ **优点**:
145
+ - 无需标注数据(无监督)
146
+ - 适用于高维数据
147
+ - 计算效率高
148
+
149
+ **缺点**:
150
+ - 需要调参( contamination、n_estimators)
151
+ - 对异常比例敏感
152
+
153
+ **实现**:
154
+ ```python
155
+ from sklearn.ensemble import IsolationForest
156
+
157
+ def isolation_forest_anomaly(data, contamination=0.1):
158
+ model = IsolationForest(contamination=contamination, random_state=42)
159
+ predictions = model.fit_predict(data)
160
+ anomalies = [i for i, pred in enumerate(predictions) if pred == -1]
161
+ return anomalies
162
+
163
+ # 使用示例
164
+ import pandas as pd
165
+
166
+ # 读取 CPU 使用率数据
167
+ df = pd.read_csv('cpu_usage.csv')
168
+ anomalies = isolation_forest_anomaly(df[['cpu_usage']].values)
169
+ print(f"异常点索引: {anomalies}")
170
+ ```
171
+
172
+ #### One-Class SVM
173
+ **原理**:学习正常数据的边界,异常点落在边界外
174
+
175
+ **优点**:
176
+ - 适用于小样本
177
+ - 可处理非线性边界(核函数)
178
+
179
+ **缺点**:
180
+ - 计算复杂度高
181
+ - 需要调参(nu、kernel)
182
+
183
+ **实现**:
184
+ ```python
185
+ from sklearn.svm import OneClassSVM
186
+
187
+ def one_class_svm_anomaly(data, nu=0.1):
188
+ model = OneClassSVM(nu=nu, kernel='rbf', gamma='auto')
189
+ predictions = model.fit_predict(data)
190
+ anomalies = [i for i, pred in enumerate(predictions) if pred == -1]
191
+ return anomalies
192
+ ```
193
+
194
+ #### Autoencoder
195
+ **原理**:训练自编码器重建正常数据,异常数据重建误差大
196
+
197
+ **优点**:
198
+ - 适用于复杂模式
199
+ - 可学习多变量关系
200
+
201
+ **缺点**:
202
+ - 需要大量训练数据
203
+ - 训练成本高
204
+
205
+ **实现**:
206
+ ```python
207
+ import tensorflow as tf
208
+ from tensorflow import keras
209
+
210
+ def build_autoencoder(input_dim, encoding_dim=32):
211
+ model = keras.Sequential([
212
+ keras.layers.Dense(64, activation='relu', input_shape=(input_dim,)),
213
+ keras.layers.Dense(encoding_dim, activation='relu'),
214
+ keras.layers.Dense(64, activation='relu'),
215
+ keras.layers.Dense(input_dim, activation='sigmoid')
216
+ ])
217
+ model.compile(optimizer='adam', loss='mse')
218
+ return model
219
+
220
+ def autoencoder_anomaly(train_data, test_data, threshold_percentile=95):
221
+ model = build_autoencoder(train_data.shape[1])
222
+ model.fit(train_data, train_data, epochs=50, batch_size=32, verbose=0)
223
+
224
+ # 计算重建误差
225
+ reconstructions = model.predict(test_data)
226
+ mse = np.mean(np.power(test_data - reconstructions, 2), axis=1)
227
+
228
+ # 设置阈值
229
+ threshold = np.percentile(mse, threshold_percentile)
230
+ anomalies = [i for i, error in enumerate(mse) if error > threshold]
231
+ return anomalies
232
+ ```
233
+
234
+ ### 3. 时间序列方法
235
+
236
+ #### ARIMA(自回归积分滑动平均)
237
+ **原理**:基于历史数据预测未来值,预测误差大视为异常
238
+
239
+ **优点**:
240
+ - 适用于平稳时间序列
241
+ - 可解释性强
242
+
243
+ **缺点**:
244
+ - 需要人工调参(p、d、q)
245
+ - 不适用于非线性模式
246
+
247
+ **实现**:
248
+ ```python
249
+ from statsmodels.tsa.arima.model import ARIMA
250
+
251
+ def arima_anomaly(data, order=(1, 1, 1), threshold=3):
252
+ model = ARIMA(data, order=order)
253
+ fitted = model.fit()
254
+
255
+ # 预测
256
+ predictions = fitted.fittedvalues
257
+ residuals = data - predictions
258
+
259
+ # 检测异常
260
+ std = np.std(residuals)
261
+ anomalies = [i for i, r in enumerate(residuals) if abs(r) > threshold * std]
262
+ return anomalies
263
+ ```
264
+
265
+ #### Prophet(Facebook)
266
+ **原理**:分解时间序列为趋势、季节性、节假日效应
267
+
268
+ **优点**:
269
+ - 自动处理季节性
270
+ - 支持节假日效应
271
+ - 对缺失值鲁棒
272
+
273
+ **缺点**:
274
+ - 需要足够历史数据(至少 2 个季节周期)
275
+ - 不适用于高频数据(秒级)
276
+
277
+ **实现**:
278
+ ```python
279
+ from fbprophet import Prophet
280
+ import pandas as pd
281
+
282
+ def prophet_anomaly(df, interval_width=0.99):
283
+ """
284
+ df: DataFrame with columns 'ds' (datetime) and 'y' (value)
285
+ """
286
+ model = Prophet(interval_width=interval_width)
287
+ model.fit(df)
288
+
289
+ # 预测
290
+ forecast = model.predict(df)
291
+
292
+ # 检测异常
293
+ df['yhat_lower'] = forecast['yhat_lower']
294
+ df['yhat_upper'] = forecast['yhat_upper']
295
+ anomalies = df[(df['y'] < df['yhat_lower']) | (df['y'] > df['yhat_upper'])]
296
+
297
+ return anomalies
298
+ ```
299
+
300
+ #### LSTM(长短期记忆网络)
301
+ **原理**:利用 RNN 学习时间序列模式,预测误差大视为异常
302
+
303
+ **优点**:
304
+ - 适用于长期依赖
305
+ - 可处理多变量
306
+
307
+ **缺点**:
308
+ - 训练成本高
309
+ - 需要大量数据
310
+
311
+ **实现**:
312
+ ```python
313
+ import tensorflow as tf
314
+ from tensorflow import keras
315
+
316
+ def build_lstm_model(sequence_length, n_features):
317
+ model = keras.Sequential([
318
+ keras.layers.LSTM(64, return_sequences=True, input_shape=(sequence_length, n_features)),
319
+ keras.layers.LSTM(32, return_sequences=False),
320
+ keras.layers.Dense(n_features)
321
+ ])
322
+ model.compile(optimizer='adam', loss='mse')
323
+ return model
324
+
325
+ def lstm_anomaly(data, sequence_length=10, threshold_percentile=95):
326
+ # 准备数据
327
+ X, y = [], []
328
+ for i in range(len(data) - sequence_length):
329
+ X.append(data[i:i+sequence_length])
330
+ y.append(data[i+sequence_length])
331
+ X, y = np.array(X), np.array(y)
332
+
333
+ # 训练模型
334
+ model = build_lstm_model(sequence_length, data.shape[1])
335
+ model.fit(X, y, epochs=50, batch_size=32, verbose=0)
336
+
337
+ # 预测
338
+ predictions = model.predict(X)
339
+ mse = np.mean(np.power(y - predictions, 2), axis=1)
340
+
341
+ # 检测异常
342
+ threshold = np.percentile(mse, threshold_percentile)
343
+ anomalies = [i + sequence_length for i, error in enumerate(mse) if error > threshold]
344
+ return anomalies
345
+ ```
346
+
347
+ ### 4. 多变量异常检测
348
+
349
+ #### PCA(主成分分析)
350
+ **原理**:降维后重建数据,重建误差大视为异常
351
+
352
+ **优点**:
353
+ - 适用于高维数据
354
+ - 计算效率高
355
+
356
+ **缺点**:
357
+ - 假设线性关系
358
+ - 需要选择主成分数量
359
+
360
+ **实现**:
361
+ ```python
362
+ from sklearn.decomposition import PCA
363
+
364
+ def pca_anomaly(data, n_components=0.95, threshold_percentile=95):
365
+ # 训练 PCA
366
+ pca = PCA(n_components=n_components)
367
+ reduced = pca.fit_transform(data)
368
+
369
+ # 重建数据
370
+ reconstructed = pca.inverse_transform(reduced)
371
+
372
+ # 计算重建误差
373
+ mse = np.mean(np.power(data - reconstructed, 2), axis=1)
374
+
375
+ # 检测异常
376
+ threshold = np.percentile(mse, threshold_percentile)
377
+ anomalies = [i for i, error in enumerate(mse) if error > threshold]
378
+ return anomalies
379
+ ```
380
+
381
+ ## 实施架构
382
+
383
+ ### 数据采集层
384
+ ```
385
+ Prometheus(指标)
386
+ -> Vector/Fluentd(日志)
387
+ -> OpenTelemetry(追踪)
388
+ -> Kafka(数据总线)
389
+ ```
390
+
391
+ ### 特征工程层
392
+ ```python
393
+ # 特征提取
394
+ features = {
395
+ # 原始指标
396
+ 'cpu_usage': cpu_usage,
397
+ 'memory_usage': memory_usage,
398
+ 'request_rate': request_rate,
399
+ 'error_rate': error_rate,
400
+
401
+ # 滚动统计特征
402
+ 'cpu_usage_mean_5m': cpu_usage.rolling(5).mean(),
403
+ 'cpu_usage_std_5m': cpu_usage.rolling(5).std(),
404
+ 'cpu_usage_max_5m': cpu_usage.rolling(5).max(),
405
+
406
+ # 变化率特征
407
+ 'cpu_usage_diff': cpu_usage.diff(),
408
+ 'request_rate_pct_change': request_rate.pct_change(),
409
+
410
+ # 时间特征
411
+ 'hour_of_day': timestamp.hour,
412
+ 'day_of_week': timestamp.weekday,
413
+
414
+ # 交叉特征
415
+ 'cpu_memory_ratio': cpu_usage / memory_usage,
416
+ 'error_per_request': error_rate / request_rate
417
+ }
418
+ ```
419
+
420
+ ### 模型训练层
421
+ ```yaml
422
+ # 模型训练流水线
423
+ pipeline:
424
+ - name: data_preprocessing
425
+ steps:
426
+ - handle_missing_values
427
+ - remove_outliers
428
+ - normalize_data
429
+
430
+ - name: feature_engineering
431
+ steps:
432
+ - extract_rolling_features
433
+ - extract_time_features
434
+ - extract_cross_features
435
+
436
+ - name: model_training
437
+ algorithm: isolation_forest
438
+ hyperparameters:
439
+ n_estimators: 100
440
+ contamination: 0.1
441
+ max_samples: 256
442
+
443
+ - name: model_evaluation
444
+ metrics:
445
+ - precision
446
+ - recall
447
+ - f1_score
448
+ - false_positive_rate
449
+ ```
450
+
451
+ ### 推理服务层
452
+ ```python
453
+ # 异常检测服务
454
+ from fastapi import FastAPI
455
+ import joblib
456
+
457
+ app = FastAPI()
458
+ model = joblib.load('isolation_forest_model.pkl')
459
+
460
+ @app.post("/detect")
461
+ async def detect_anomaly(metrics: dict):
462
+ # 特征提取
463
+ features = extract_features(metrics)
464
+
465
+ # 预测
466
+ prediction = model.predict([features])[0]
467
+
468
+ # 返回结果
469
+ if prediction == -1:
470
+ return {
471
+ "is_anomaly": True,
472
+ "confidence": model.decision_function([features])[0],
473
+ "timestamp": metrics['timestamp']
474
+ }
475
+ else:
476
+ return {
477
+ "is_anomaly": False,
478
+ "confidence": 1.0,
479
+ "timestamp": metrics['timestamp']
480
+ }
481
+ ```
482
+
483
+ ### 告警集成层
484
+ ```yaml
485
+ # 告警规则配置
486
+ groups:
487
+ - name: aiops_anomaly_alerts
488
+ rules:
489
+ - alert: AIAnomalyDetected
490
+ expr: aiops_anomaly_score{service="order-service"} > 0.8
491
+ for: 1m
492
+ labels:
493
+ severity: warning
494
+ source: aiops
495
+ annotations:
496
+ summary: "AI 检测到异常"
497
+ description: "服务 {{ $labels.service }} 检测到异常,得分 {{ $value }}"
498
+ ```
499
+
500
+ ## 根因分析
501
+
502
+ ### 因果推断
503
+ **方法**:构建指标间的因果关系图
504
+
505
+ **工具**:
506
+ - CausalImpact(Google)
507
+ - DoWhy(Microsoft)
508
+ - PCMCI(因果发现)
509
+
510
+ **实现**:
511
+ ```python
512
+ from causalimpact import CausalImpact
513
+
514
+ def analyze_root_cause(target_metric, related_metrics, pre_period, post_period):
515
+ """
516
+ 分析目标指标的异常根因
517
+ """
518
+ # 合并数据
519
+ data = pd.concat([target_metric] + related_metrics, axis=1)
520
+
521
+ # 因果分析
522
+ impact = CausalImpact(data, pre_period, post_period)
523
+
524
+ # 输出结果
525
+ print(impact.summary())
526
+ impact.plot()
527
+
528
+ return impact
529
+ ```
530
+
531
+ ### 关联分析
532
+ **方法**:发现异常指标间的相关性
533
+
534
+ **实现**:
535
+ ```python
536
+ import pandas as pd
537
+
538
+ def correlation_analysis(anomaly_metrics, threshold=0.8):
539
+ """
540
+ 分析异常指标间的相关性
541
+ """
542
+ # 计算相关矩阵
543
+ corr_matrix = anomaly_metrics.corr()
544
+
545
+ # 找出高度相关的指标对
546
+ highly_correlated = []
547
+ for i in range(len(corr_matrix.columns)):
548
+ for j in range(i+1, len(corr_matrix.columns)):
549
+ if abs(corr_matrix.iloc[i, j]) > threshold:
550
+ highly_correlated.append({
551
+ 'metric1': corr_matrix.columns[i],
552
+ 'metric2': corr_matrix.columns[j],
553
+ 'correlation': corr_matrix.iloc[i, j]
554
+ })
555
+
556
+ return highly_correlated
557
+ ```
558
+
559
+ ### 图分析
560
+ **方法**:基于服务依赖图传播异常
561
+
562
+ **实现**:
563
+ ```python
564
+ import networkx as nx
565
+
566
+ def propagate_anomaly(dependency_graph, anomaly_service):
567
+ """
568
+ 在依赖图中传播异常
569
+ """
570
+ G = nx.DiGraph(dependency_graph)
571
+
572
+ # 查找受影响的服务
573
+ affected_services = nx.descendants(G, anomaly_service)
574
+
575
+ # 计算影响路径
576
+ paths = {}
577
+ for service in affected_services:
578
+ path = nx.shortest_path(G, anomaly_service, service)
579
+ paths[service] = path
580
+
581
+ return {
582
+ 'anomaly_source': anomaly_service,
583
+ 'affected_services': list(affected_services),
584
+ 'impact_paths': paths
585
+ }
586
+ ```
587
+
588
+ ## 告警降噪
589
+
590
+ ### 告警聚合
591
+ **策略**:
592
+ - 时间窗口聚合:5 分钟内相同告警合并
593
+ - 根因聚合:基于根因分析合并相关告警
594
+ - 服务聚合:同一服务的多个告警合并
595
+
596
+ **实现**:
597
+ ```python
598
+ def aggregate_alerts(alerts, time_window=300):
599
+ """
600
+ 时间窗口内聚合告警
601
+ """
602
+ aggregated = {}
603
+
604
+ for alert in alerts:
605
+ key = (alert['service'], alert['alert_name'])
606
+
607
+ if key not in aggregated:
608
+ aggregated[key] = {
609
+ 'service': alert['service'],
610
+ 'alert_name': alert['alert_name'],
611
+ 'count': 1,
612
+ 'first_seen': alert['timestamp'],
613
+ 'last_seen': alert['timestamp'],
614
+ 'samples': [alert]
615
+ }
616
+ else:
617
+ # 检查时间窗口
618
+ if alert['timestamp'] - aggregated[key]['last_seen'] < time_window:
619
+ aggregated[key]['count'] += 1
620
+ aggregated[key]['last_seen'] = alert['timestamp']
621
+ aggregated[key]['samples'].append(alert)
622
+
623
+ return list(aggregated.values())
624
+ ```
625
+
626
+ ### 告警优先级排序
627
+ **策略**:
628
+ - 基于业务影响:核心服务告警优先级高
629
+ - 基于异常得分:异常得分高优先级高
630
+ - 基于历史频率:频繁误报告警优先级低
631
+
632
+ **实现**:
633
+ ```python
634
+ def prioritize_alerts(alerts, service_priority, anomaly_scores, historical_fp_rate):
635
+ """
636
+ 告警优先级排序
637
+ """
638
+ for alert in alerts:
639
+ # 业务优先级得分
640
+ business_score = service_priority.get(alert['service'], 1)
641
+
642
+ # 异常得分
643
+ anomaly_score = anomaly_scores.get(alert['id'], 0.5)
644
+
645
+ # 历史误报惩罚
646
+ fp_penalty = historical_fp_rate.get(alert['alert_name'], 0)
647
+
648
+ # 综合得分
649
+ alert['priority_score'] = (
650
+ business_score * 0.4 +
651
+ anomaly_score * 0.4 -
652
+ fp_penalty * 0.2
653
+ )
654
+
655
+ # 排序
656
+ sorted_alerts = sorted(alerts, key=lambda x: x['priority_score'], reverse=True)
657
+ return sorted_alerts
658
+ ```
659
+
660
+ ## 常见失败模式
661
+
662
+ ### 1. 误报过多
663
+ **原因**:
664
+ - 阈值设置过严格
665
+ - 模型训练数据包含异常
666
+ - 未考虑季节性/周期性
667
+
668
+ **解决**:
669
+ - 调整阈值(提高 percentile)
670
+ - 清洗训练数据
671
+ - 添加季节性分解
672
+
673
+ ### 2. 漏报关键异常
674
+ **原因**:
675
+ - 阈值设置过宽松
676
+ - 模型欠拟合
677
+ - 异常类型未覆盖
678
+
679
+ **解决**:
680
+ - 降低阈值
681
+ - 增加模型复杂度
682
+ - 集成多种检测算法
683
+
684
+ ### 3. 模型漂移
685
+ **原因**:
686
+ - 业务模式变化
687
+ - 系统架构演进
688
+ - 数据分布变化
689
+
690
+ **解决**:
691
+ - 定期重新训练模型(每周/每月)
692
+ - 在线学习(增量更新)
693
+ - 监控模型性能指标
694
+
695
+ ### 4. 特征工程不足
696
+ **原因**:
697
+ - 缺少领域知识
698
+ - 特征选择不当
699
+ - 特征维度过高
700
+
701
+ **解决**:
702
+ - 与领域专家合作
703
+ - 使用特征选择算法
704
+ - PCA/特征重要性分析
705
+
706
+ ### 5. 计算成本高
707
+ **原因**:
708
+ - 模型过于复杂
709
+ - 数据量过大
710
+ - 实时性要求高
711
+
712
+ **解决**:
713
+ - 模型简化(剪枝/量化)
714
+ - 数据采样/降采样
715
+ - 分布式计算/边缘计算
716
+
717
+ ## 验收标准
718
+
719
+ ### 功能验收
720
+ - [ ] 异常检测模型部署完成(>= 3 种算法)
721
+ - [ ] 核心服务异常检测覆盖 >= 90%
722
+ - [ ] 告警降噪功能上线
723
+ - [ ] 根因分析功能可用
724
+
725
+ ### 性能验收
726
+ - [ ] 误报率 < 10%
727
+ - [ ] 漏报率 < 5%(关键异常)
728
+ - [ ] 检测延迟 < 30 秒
729
+ - [ ] 告警降噪率 >= 50%
730
+
731
+ ### 运营验收
732
+ - [ ] 模型定期重新训练机制建立
733
+ - [ ] 团队培训覆盖率 100%
734
+ - [ ] 异常检测 Dashboard 上线
735
+ - [ ] 每月异常检测报告产出
736
+
737
+ ## 参考资源
738
+
739
+ ### 开源工具
740
+ - Prometheus + Alertmanager:指标采集与告警
741
+ - Grafana:可视化
742
+ - ELK Stack:日志分析
743
+ - PyOD(Python Outlier Detection):异常检测算法库
744
+ - Facebook Prophet:时间序列预测
745
+ - TensorFlow/Keras:深度学习模型
746
+
747
+ ### 云服务
748
+ - AWS CloudWatch Anomaly Detection
749
+ - Azure Monitor Anomaly Detector
750
+ - Google Cloud Monitoring Anomaly Detection
751
+ - Datadog Watchdog
752
+ - Dynatrace Davis
753
+
754
+ ### 学习资源
755
+ - AIOps: Artificial Intelligence for IT Operations(O'Reilly)
756
+ - Time Series Analysis and Its Applications(Springer)
757
+ - Anomaly Detection for Monitoring(Datadog)
758
+ - Google SRE Book - Chapter on Monitoring Distributed Systems