@umacloud/knowledge 1.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/00-governance/governance-capabilities.md +557 -0
- package/00-governance/knowledge-map.md +39 -0
- package/00-governance/maintenance-policy.md +76 -0
- package/00-governance/review-checklist.md +81 -0
- package/README.md +13 -0
- package/ai/01-standards/agent-development-complete.md +691 -0
- package/ai/01-standards/llm-application-complete.md +488 -0
- package/ai/01-standards/mlops-complete.md +798 -0
- package/ai/01-standards/prompt-engineering-complete.md +646 -0
- package/ai/01-standards/rag-architecture-complete.md +649 -0
- package/ai/02-playbooks/llm-evaluation-playbook.md +847 -0
- package/ai/03-checklists/ai-project-checklist.md +215 -0
- package/ai/04-antipatterns/ai-antipatterns.md +661 -0
- package/ai/05-cases/case-rag-production.md +147 -0
- package/ai/06-glossary/ai-glossary.md +162 -0
- package/ai/agent-evaluation-benchmark.md +53 -0
- package/ai/ai-agent-memory-context-management.md +41 -0
- package/ai/ai-cost-capacity-optimization-playbook.md +42 -0
- package/ai/ai-data-security-and-compliance-playbook.md +37 -0
- package/ai/ai-domain-index-and-checklist.md +40 -0
- package/ai/ai-governance-maturity-model.md +50 -0
- package/ai/ai-model-selection-and-routing-strategy.md +47 -0
- package/ai/ai-observability-and-oncall-runbook.md +52 -0
- package/ai/ai-rag-engineering-playbook.md +42 -0
- package/ai/ai-red-team-and-safety-evaluation.md +42 -0
- package/ai/ai-release-readiness-and-rollback-gate.md +42 -0
- package/ai/llm-agent-engineering-deep-dive.md +57 -0
- package/ai/prompt-and-tool-guardrails.md +52 -0
- package/api/01-standards/enterprise-api-standards.md +198 -0
- package/api/01-standards/rest-api-design-guide.md +63 -0
- package/api/02-playbooks/api-pagination-playbook.md +93 -0
- package/api/02-playbooks/graphql-production-playbook.md +176 -0
- package/api/03-checklists/api-review-checklist.md +55 -0
- package/api/04-antipatterns/api-antipatterns.md +112 -0
- package/architecture/01-standards/api-gateway-patterns.md +496 -0
- package/architecture/01-standards/cloud-native-patterns.md +644 -0
- package/architecture/01-standards/distributed-systems-patterns.md +591 -0
- package/architecture/01-standards/event-driven-architecture.md +595 -0
- package/architecture/01-standards/microservices-patterns-complete.md +968 -0
- package/architecture/01-standards/microservices-patterns.md +495 -0
- package/architecture/01-standards/system-design-interview.md +664 -0
- package/architecture/02-playbooks/microservices-patterns-playbook.md +137 -0
- package/architecture/02-playbooks/migration-playbook.md +780 -0
- package/architecture/02-playbooks/system-design-playbook.md +779 -0
- package/architecture/03-checklists/architecture-decision-checklist.md +297 -0
- package/architecture/04-antipatterns/architecture-antipatterns.md +417 -0
- package/architecture/05-cases/case-netflix-microservices.md +413 -0
- package/architecture/06-glossary/architecture-glossary.md +164 -0
- package/architecture/adr-template-and-examples.md +38 -0
- package/architecture/api-gateway-deep-dive.md +1291 -0
- package/architecture/configuration-management.md +1162 -0
- package/architecture/distributed-transactions.md +1220 -0
- package/architecture/microservices-complete.md +735 -0
- package/architecture/resilience-and-disaster-patterns.md +37 -0
- package/architecture/service-governance.md +1198 -0
- package/architecture/system-architecture-deep-dive.md +37 -0
- package/backend/01-standards/analytics-and-growth.md +65 -0
- package/backend/01-standards/api-and-error-conventions.md +120 -0
- package/backend/01-standards/application-layering-and-packaging.md +160 -0
- package/backend/01-standards/auth-implementation.md +104 -0
- package/backend/01-standards/backend-framework-idioms.md +74 -0
- package/backend/01-standards/background-jobs-and-async.md +66 -0
- package/backend/01-standards/caching-strategies-complete.md +390 -0
- package/backend/01-standards/config-and-observability.md +77 -0
- package/backend/01-standards/data-modeling-and-persistence.md +94 -0
- package/backend/01-standards/django-complete.md +1765 -0
- package/backend/01-standards/email-and-notifications.md +64 -0
- package/backend/01-standards/fastapi-complete.md +925 -0
- package/backend/01-standards/file-upload-and-storage.md +66 -0
- package/backend/01-standards/graphql-api-complete.md +416 -0
- package/backend/01-standards/llm-application-standard.md +78 -0
- package/backend/01-standards/message-queue-patterns.md +379 -0
- package/backend/01-standards/microservices-and-distributed.md +78 -0
- package/backend/01-standards/nestjs-complete.md +2167 -0
- package/backend/01-standards/payment-integration.md +80 -0
- package/backend/01-standards/rate-limiting-complete.md +451 -0
- package/backend/01-standards/realtime-and-websocket.md +65 -0
- package/backend/01-standards/search-and-filtering.md +64 -0
- package/backend/01-standards/spring-boot-complete.md +445 -0
- package/backend/02-playbooks/api-design-playbook.md +718 -0
- package/backend/02-playbooks/email-send-playbook.md +130 -0
- package/backend/02-playbooks/file-upload-s3-playbook.md +153 -0
- package/backend/02-playbooks/typescript-enterprise-playbook.md +133 -0
- package/backend/02-playbooks/websocket-realtime-playbook.md +154 -0
- package/backend/03-checklists/api-launch-checklist.md +189 -0
- package/backend/04-antipatterns/backend-antipatterns.md +1051 -0
- package/blockchain/01-standards/blockchain-basics.md +557 -0
- package/blockchain/01-standards/smart-contract-development.md +1315 -0
- package/cicd/01-standards/deployment-and-delivery-standard.md +96 -0
- package/cicd/01-standards/github-actions-complete.md +473 -0
- package/cicd/01-standards/release-and-store-submission.md +75 -0
- package/cicd/02-playbooks/cicd-pipeline-playbook.md +144 -0
- package/cicd/02-playbooks/release-management-playbook.md +605 -0
- package/cicd/03-checklists/pipeline-security-checklist.md +168 -0
- package/cicd/04-antipatterns/cicd-antipatterns.md +589 -0
- package/cicd/05-cases/case-deployment-automation.md +221 -0
- package/cicd/05-cases/case-gitops-transformation.md +212 -0
- package/cicd/06-glossary/cicd-glossary.md +114 -0
- package/cicd/cicd-blueprint-deep-dive.md +38 -0
- package/cicd/release-readiness-gate.md +37 -0
- package/cloud-native/01-standards/container-security.md +741 -0
- package/cloud-native/01-standards/kubernetes-complete.md +812 -0
- package/cloud-native/02-playbooks/api-gateway-playbook.md +155 -0
- package/cloud-native/02-playbooks/gitops-with-argocd.md +760 -0
- package/cloud-native/02-playbooks/k8s-troubleshooting-playbook.md +1942 -0
- package/cloud-native/02-playbooks/message-queue-playbook.md +129 -0
- package/cloud-native/02-playbooks/multicloud-governance.md +726 -0
- package/cloud-native/02-playbooks/serverless-patterns.md +788 -0
- package/cloud-native/02-playbooks/service-mesh-playbook.md +612 -0
- package/cloud-native/02-playbooks/terraform-iac-playbook.md +143 -0
- package/cloud-native/03-checklists/container-security-checklist.md +431 -0
- package/cloud-native/03-checklists/k8s-production-readiness-checklist.md +460 -0
- package/cloud-native/04-antipatterns/container-antipatterns.md +660 -0
- package/cloud-native/04-antipatterns/k8s-antipatterns.md +743 -0
- package/cloud-native/05-cases/case-k8s-migration.md +478 -0
- package/cloud-native/05-cases/case-k8s-scaling.md +642 -0
- package/cloud-native/05-cases/case-k8s-security-incident.md +397 -0
- package/cloud-native/06-glossary/cloud-native-glossary.md +337 -0
- package/cross-platform/01-standards/cross-platform-frameworks.md +83 -0
- package/cross-platform/01-standards/platform-selection-and-architecture.md +77 -0
- package/data/01-standards/elasticsearch-complete.md +2098 -0
- package/data/01-standards/postgresql-complete.md +1613 -0
- package/data/01-standards/redis-complete.md +1527 -0
- package/data/02-playbooks/database-optimization-playbook.md +403 -0
- package/data/02-playbooks/elasticsearch-production-playbook.md +132 -0
- package/data/03-checklists/database-launch-checklist.md +187 -0
- package/data/04-antipatterns/database-antipatterns.md +873 -0
- package/data/05-cases/case-database-migration.md +310 -0
- package/data/06-glossary/database-glossary.md +440 -0
- package/data/data-governance-and-modeling-deep-dive.md +39 -0
- package/data-engineering/01-standards/airflow-complete.md +523 -0
- package/data-engineering/01-standards/kafka-complete.md +1521 -0
- package/data-engineering/02-playbooks/spark-etl-playbook.md +496 -0
- package/data-engineering/03-checklists/pipeline-launch-checklist.md +194 -0
- package/data-engineering/04-antipatterns/data-pipeline-antipatterns.md +684 -0
- package/data-engineering/05-cases/case-real-time-pipeline.md +355 -0
- package/data-engineering/06-glossary/data-engineering-glossary.md +429 -0
- package/database/01-standards/database-schema-standards.md +147 -0
- package/database/02-playbooks/postgresql-optimization-quick.md +52 -0
- package/database/02-playbooks/postgresql-performance-optimization.md +58 -0
- package/database/02-playbooks/postgresql-production-playbook.md +146 -0
- package/database/02-playbooks/redis-caching-playbook.md +117 -0
- package/database/03-checklists/database-review-checklist.md +50 -0
- package/database/04-antipatterns/database-antipatterns.md +112 -0
- package/design/01-standards/ui-design-system-complete.md +423 -0
- package/design/02-playbooks/design-handoff-playbook.md +254 -0
- package/design/02-playbooks/design-review-playbook.md +388 -0
- package/design/03-checklists/design-review-checklist.md +246 -0
- package/design/04-antipatterns/design-antipatterns.md +378 -0
- package/design/05-cases/case-design-system-adoption.md +328 -0
- package/design/06-glossary/design-glossary.md +329 -0
- package/design/ui-full-lifecycle-cross-platform-playbook.md +571 -0
- package/design/ux-system-deep-dive.md +38 -0
- package/design-systems/00-craft-rules.md +71 -0
- package/design-systems/aesthetic-families.md +43 -0
- package/design-systems/anti-ai-slop.md +162 -0
- package/design-systems/bold-geometric.md +120 -0
- package/design-systems/brutalist-bold.md +103 -0
- package/design-systems/editorial-clean.md +109 -0
- package/design-systems/glass-aurora.md +108 -0
- package/design-systems/modern-minimal.md +145 -0
- package/design-systems/premium-luxury.md +106 -0
- package/design-systems/product-type-design-map.md +48 -0
- package/design-systems/soft-warm.md +123 -0
- package/design-systems/tech-utility.md +113 -0
- package/desktop/01-standards/desktop-app-standard.md +72 -0
- package/desktop/01-standards/desktop-design.md +71 -0
- package/development/00-governance/document-template.md +41 -0
- package/development/01-standards/api-versioning-strategies.md +432 -0
- package/development/01-standards/authentication-patterns-complete.md +479 -0
- package/development/01-standards/css-architecture-complete.md +550 -0
- package/development/01-standards/database-migration-strategies.md +484 -0
- package/development/01-standards/elasticsearch-complete.md +347 -0
- package/development/01-standards/git-complete.md +371 -0
- package/development/01-standards/golang-complete.md +1565 -0
- package/development/01-standards/graphql-complete.md +298 -0
- package/development/01-standards/javascript-bundlers-complete.md +469 -0
- package/development/01-standards/javascript-typescript-complete.md +528 -0
- package/development/01-standards/jest-complete.md +275 -0
- package/development/01-standards/linux-complete.md +234 -0
- package/development/01-standards/logging-observability-complete.md +526 -0
- package/development/01-standards/microservices-communication.md +502 -0
- package/development/01-standards/mongodb-complete.md +406 -0
- package/development/01-standards/oauth2-complete.md +285 -0
- package/development/01-standards/performance-optimization-complete.md +289 -0
- package/development/01-standards/playwright-complete.md +247 -0
- package/development/01-standards/postgresql-complete.md +456 -0
- package/development/01-standards/pytest-complete.md +340 -0
- package/development/01-standards/python-async-programming.md +902 -0
- package/development/01-standards/python-complete.md +956 -0
- package/development/01-standards/python-decorators-complete.md +799 -0
- package/development/01-standards/python-design-patterns.md +2854 -0
- package/development/01-standards/python-packaging-distribution.md +420 -0
- package/development/01-standards/python-testing-strategies.md +607 -0
- package/development/01-standards/python-web-frameworks-comparison.md +471 -0
- package/development/01-standards/redis-complete.md +317 -0
- package/development/01-standards/rest-api-complete.md +316 -0
- package/development/01-standards/rust-complete.md +578 -0
- package/development/01-standards/typescript-advanced-types.md +1513 -0
- package/development/01-standards/web-security-complete.md +292 -0
- package/development/02-playbooks/api-design-playbook.md +810 -0
- package/development/02-playbooks/database-migration-playbook.md +580 -0
- package/development/02-playbooks/debugging-playbook.md +692 -0
- package/development/02-playbooks/feature-delivery-playbook.md +430 -0
- package/development/02-playbooks/incident-hotfix-playbook.md +387 -0
- package/development/02-playbooks/performance-optimization-playbook.md +531 -0
- package/development/02-playbooks/performance-tuning-playbook.md +652 -0
- package/development/02-playbooks/refactor-playbook.md +403 -0
- package/development/02-playbooks/release-playbook.md +469 -0
- package/development/03-checklists/architecture-review-checklist.md +168 -0
- package/development/03-checklists/data-migration-checklist.md +157 -0
- package/development/03-checklists/oncall-handover-checklist.md +173 -0
- package/development/03-checklists/pr-checklist.md +158 -0
- package/development/03-checklists/production-readiness-checklist.md +190 -0
- package/development/03-checklists/release-readiness-checklist.md +154 -0
- package/development/03-checklists/security-review-checklist.md +182 -0
- package/development/04-antipatterns/api-antipatterns.md +657 -0
- package/development/04-antipatterns/architecture-antipatterns.md +686 -0
- package/development/04-antipatterns/backend-antipatterns.md +648 -0
- package/development/04-antipatterns/cicd-antipatterns.md +540 -0
- package/development/04-antipatterns/code-smell-antipatterns.md +571 -0
- package/development/04-antipatterns/data-antipatterns.md +658 -0
- package/development/04-antipatterns/database-antipatterns.md +578 -0
- package/development/04-antipatterns/frontend-antipatterns.md +635 -0
- package/development/04-antipatterns/reliability-antipatterns.md +700 -0
- package/development/04-antipatterns/security-antipatterns.md +747 -0
- package/development/05-cases/case-api-version-migration.md +428 -0
- package/development/05-cases/case-authorization-hardening.md +383 -0
- package/development/05-cases/case-bluegreen-rollback.md +466 -0
- package/development/05-cases/case-cache-snowball-protection.md +485 -0
- package/development/05-cases/case-ci-cd-pipeline.md +544 -0
- package/development/05-cases/case-database-scaling.md +500 -0
- package/development/05-cases/case-db-hotspot-optimization.md +487 -0
- package/development/05-cases/case-incident-mttr-reduction.md +563 -0
- package/development/05-cases/case-microservice-migration.md +375 -0
- package/development/05-cases/case-performance-optimization.md +406 -0
- package/development/05-cases/case-security-incident-response.md +345 -0
- package/development/06-glossary/full-stack-glossary.md +166 -0
- package/development/09-maturity/quarterly-audit-template.md +35 -0
- package/development/11-ui-excellence/ui-aesthetic-system.md +41 -0
- package/development/11-ui-excellence/ui-engineering-excellence.md +435 -0
- package/development/12-scenarios/development-scenarios-guide.md +565 -0
- package/development/13-implementation-assets/implementation-toolkit.md +282 -0
- package/development/13-implementation-assets/knowledge-gates-execution.md +43 -0
- package/development/14-full-lifecycle/software-lifecycle-gates.md +511 -0
- package/development/15-lifecycle-templates/project-templates-collection.md +791 -0
- package/development/api-contract-and-versioning-guide.md +36 -0
- package/development/api-governance-complete.md +43 -0
- package/development/backend-engineering-complete.md +43 -0
- package/development/code-review-quality-complete.md +43 -0
- package/development/concurrency-reliability-complete.md +43 -0
- package/development/database-engineering-complete.md +43 -0
- package/development/engineering-effectiveness-complete.md +43 -0
- package/development/engineering-standards-deep-dive.md +38 -0
- package/development/frontend-engineering-complete.md +43 -0
- package/development/performance-capacity-complete.md +43 -0
- package/development/refactor-migration-complete.md +42 -0
- package/development/refactoring-and-techdebt-playbook.md +37 -0
- package/development/security-in-development-complete.md +43 -0
- package/devops/01-standards/cicd-pipeline-complete.md +262 -0
- package/devops/01-standards/docker-complete.md +1490 -0
- package/devops/01-standards/github-actions-complete.md +337 -0
- package/devops/01-standards/kubernetes-complete.md +638 -0
- package/devops/01-standards/terraform-complete.md +2117 -0
- package/devops/02-playbooks/docker-compose-playbook.md +233 -0
- package/devops/02-playbooks/docker-k8s-production-playbook.md +186 -0
- package/devops/02-playbooks/docker-production-playbook.md +952 -0
- package/edge-iot/01-standards/edge-iot-complete.md +473 -0
- package/experts/architect/api-design.md +178 -0
- package/experts/architect/methodology.md +124 -0
- package/experts/architect/security.md +75 -0
- package/experts/backend-lead/methodology.md +216 -0
- package/experts/devops/methodology.md +160 -0
- package/experts/frontend-lead/methodology.md +178 -0
- package/experts/product-manager/industry/ecommerce.md +43 -0
- package/experts/product-manager/industry/saas.md +40 -0
- package/experts/product-manager/methodology.md +97 -0
- package/experts/qa-lead/methodology.md +123 -0
- package/experts/qa-lead/test-strategy.md +128 -0
- package/experts/uiux-designer/methodology.md +125 -0
- package/frontend/01-standards/accessibility-complete.md +532 -0
- package/frontend/01-standards/accessibility-standard.md +74 -0
- package/frontend/01-standards/admin-dashboard-and-crud.md +72 -0
- package/frontend/01-standards/design-tokens-complete.md +444 -0
- package/frontend/01-standards/forms-and-validation.md +77 -0
- package/frontend/01-standards/frontend-architecture-and-layering.md +119 -0
- package/frontend/01-standards/i18n-and-localization.md +65 -0
- package/frontend/01-standards/nextjs-complete.md +451 -0
- package/frontend/01-standards/react-complete.md +713 -0
- package/frontend/01-standards/react-hooks-complete-guide.md +1100 -0
- package/frontend/01-standards/react-hooks-complete.md +1171 -0
- package/frontend/01-standards/seo-and-web-vitals.md +77 -0
- package/frontend/01-standards/state-management-complete.md +444 -0
- package/frontend/01-standards/vue-complete.md +499 -0
- package/frontend/01-standards/vue3-complete.md +2002 -0
- package/frontend/01-standards/web-framework-best-practices.md +64 -0
- package/frontend/01-standards/web-performance-complete.md +495 -0
- package/frontend/02-playbooks/accessibility-a11y-playbook.md +161 -0
- package/frontend/02-playbooks/frontend-performance-playbook.md +707 -0
- package/frontend/02-playbooks/i18n-internationalization-playbook.md +120 -0
- package/frontend/02-playbooks/performance-optimization-playbook.md +163 -0
- package/frontend/02-playbooks/react-nextjs-production-playbook.md +167 -0
- package/frontend/02-playbooks/react-state-management-playbook.md +173 -0
- package/frontend/03-checklists/component-quality-checklist.md +166 -0
- package/frontend/03-checklists/frontend-launch-checklist.md +299 -0
- package/frontend/04-antipatterns/frontend-antipatterns.md +886 -0
- package/frontend/05-cases/case-performance-optimization.md +274 -0
- package/harmony/01-standards/harmonyos-arkts-standard.md +75 -0
- package/harmony/01-standards/harmonyos-design.md +65 -0
- package/high-quality-engineering-playbook.md +54 -0
- package/incident/01-standards/incident-response-complete.md +303 -0
- package/incident/02-playbooks/chaos-engineering-playbook.md +883 -0
- package/incident/02-playbooks/postmortem-playbook.md +398 -0
- package/incident/03-checklists/incident-readiness-checklist.md +181 -0
- package/incident/04-antipatterns/incident-antipatterns.md +490 -0
- package/incident/05-cases/case-cascade-failure.md +176 -0
- package/incident/06-glossary/incident-glossary.md +114 -0
- package/incident/postmortem-and-response-deep-dive.md +39 -0
- package/industries/ecommerce/ecommerce-complete.md +631 -0
- package/industries/education/education-complete.md +555 -0
- package/industries/fintech/fintech-complete.md +501 -0
- package/industries/gaming/gaming-complete.md +587 -0
- package/industries/healthcare/healthcare-complete.md +452 -0
- package/low-code/01-standards/low-code-complete.md +944 -0
- package/miniprogram/01-standards/ai-common-mistakes.md +61 -0
- package/miniprogram/01-standards/miniprogram-custom-navbar-capsule.md +77 -0
- package/miniprogram/01-standards/miniprogram-design.md +61 -0
- package/miniprogram/01-standards/miniprogram-standard.md +81 -0
- package/mobile/01-standards/android-material-design.md +70 -0
- package/mobile/01-standards/flutter-complete.md +384 -0
- package/mobile/01-standards/ios-design-hig.md +78 -0
- package/mobile/01-standards/mobile-app-standard.md +85 -0
- package/mobile/01-standards/react-native-complete.md +352 -0
- package/mobile/02-playbooks/mobile-cross-platform-playbook.md +175 -0
- package/mobile/02-playbooks/mobile-performance.md +473 -0
- package/mobile/03-checklists/mobile-release-checklist.md +234 -0
- package/mobile/04-antipatterns/mobile-antipatterns.md +798 -0
- package/mobile/05-cases/case-app-performance.md +500 -0
- package/mobile/05-cases/case-app-startup-optimization.md +218 -0
- package/mobile/06-glossary/mobile-glossary.md +484 -0
- package/observability/01-standards/observability-standards.md +103 -0
- package/observability/02-playbooks/prometheus-grafana-playbook.md +135 -0
- package/observability/02-playbooks/structured-logging-playbook.md +73 -0
- package/observability/03-checklists/observability-checklist.md +54 -0
- package/observability/04-antipatterns/observability-antipatterns.md +106 -0
- package/operations/01-standards/prometheus-monitoring-complete.md +1578 -0
- package/operations/02-playbooks/capacity-planning-playbook.md +620 -0
- package/operations/03-checklists/production-launch-checklist.md +365 -0
- package/operations/04-antipatterns/operations-antipatterns.md +664 -0
- package/operations/05-cases/case-sre-practices.md +581 -0
- package/operations/06-glossary/operations-glossary.md +120 -0
- package/operations/aiops-anomaly-detection.md +758 -0
- package/operations/capacity-planning.md +1061 -0
- package/operations/chaos-engineering.md +659 -0
- package/operations/incident-command-system.md +38 -0
- package/operations/observability-complete.md +442 -0
- package/operations/slo-sli-playbook.md +517 -0
- package/operations/sre-operations-deep-dive.md +39 -0
- package/package.json +8 -0
- package/performance/01-standards/performance-and-scalability.md +80 -0
- package/performance/01-standards/performance-standards.md +156 -0
- package/performance/02-playbooks/query-optimization-playbook.md +103 -0
- package/performance/03-checklists/performance-checklist.md +56 -0
- package/performance/04-antipatterns/performance-antipatterns.md +146 -0
- package/product/01-standards/product-management-complete.md +285 -0
- package/product/02-playbooks/feature-launch-playbook.md +207 -0
- package/product/02-playbooks/user-research-playbook.md +532 -0
- package/product/03-checklists/feature-launch-checklist.md +275 -0
- package/product/04-antipatterns/product-antipatterns.md +355 -0
- package/product/05-cases/case-mvp-to-scale.md +384 -0
- package/product/06-glossary/product-glossary.md +462 -0
- package/product/feature-prioritization-framework.md +40 -0
- package/product/kpi-and-metric-tree.md +37 -0
- package/product/product-discovery-and-prd-deep-dive.md +41 -0
- package/quantum/01-standards/quantum-complete.md +1186 -0
- package/security/01-standards/api-security-complete.md +511 -0
- package/security/01-standards/container-runtime-security.md +574 -0
- package/security/01-standards/data-protection-gdpr.md +543 -0
- package/security/01-standards/owasp-top10-complete.md +1890 -0
- package/security/01-standards/secure-coding-baseline.md +90 -0
- package/security/01-standards/supply-chain-security.md +441 -0
- package/security/01-standards/web-security-checklist.md +108 -0
- package/security/01-standards/zero-trust-architecture.md +521 -0
- package/security/02-playbooks/auth-sso-playbook.md +166 -0
- package/security/02-playbooks/incident-response-security-playbook.md +588 -0
- package/security/02-playbooks/owasp-api-security-playbook.md +129 -0
- package/security/02-playbooks/payment-integration-playbook.md +119 -0
- package/security/02-playbooks/penetration-testing-playbook.md +517 -0
- package/security/03-checklists/security-audit-checklist.md +356 -0
- package/security/04-antipatterns/security-coding-antipatterns.md +580 -0
- package/security/05-cases/case-log4shell-incident.md +537 -0
- package/security/05-cases/case-major-breaches.md +468 -0
- package/security/06-glossary/security-glossary.md +212 -0
- package/security/compliance-automation.md +993 -0
- package/security/container-security.md +680 -0
- package/security/devsecops-complete.md +426 -0
- package/security/sast-dast-sca.md +775 -0
- package/security/secrets-management.md +594 -0
- package/security/security-architecture-deep-dive.md +37 -0
- package/security/threat-modeling-stride-playbook.md +40 -0
- package/seed-templates/auth-system.md +59 -0
- package/seed-templates/blog-content.md +94 -0
- package/seed-templates/dashboard.md +89 -0
- package/seed-templates/docs-site.md +73 -0
- package/seed-templates/e-commerce.md +50 -0
- package/seed-templates/saas-landing.md +92 -0
- package/seed-templates/settings-page.md +51 -0
- package/testing/01-standards/test-strategy-and-layering.md +83 -0
- package/testing/01-standards/testing-strategy-complete.md +422 -0
- package/testing/01-standards/unit-testing-best-practices.md +118 -0
- package/testing/02-playbooks/e2e-testing-playbook.md +988 -0
- package/testing/02-playbooks/testing-strategy-playbook.md +126 -0
- package/testing/03-checklists/test-strategy-checklist.md +208 -0
- package/testing/04-antipatterns/testing-antipatterns.md +718 -0
- package/testing/05-cases/case-testing-transformation.md +300 -0
- package/testing/06-glossary/testing-glossary.md +110 -0
- package/testing/risk-based-test-matrix.md +36 -0
- package/testing/testing-strategy-deep-dive.md +37 -0
|
@@ -0,0 +1,398 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: postmortem-playbook
|
|
3
|
+
title: 事故复盘 Playbook (Postmortem Playbook)
|
|
4
|
+
domain: incident
|
|
5
|
+
category: 02-playbooks
|
|
6
|
+
difficulty: intermediate
|
|
7
|
+
tags: [incident, playbook, postmortem, smart, 事故恢复确认, 制定行动项, 基本信息, 时间线构建]
|
|
8
|
+
quality_score: 70
|
|
9
|
+
last_updated: 2026-06-15
|
|
10
|
+
---
|
|
11
|
+
# 事故复盘 Playbook (Postmortem Playbook)
|
|
12
|
+
|
|
13
|
+
> 适用场景:生产事故复盘、重大故障分析、持续改进闭环建设。
|
|
14
|
+
> 约束级别:所有 P0/P1 事故必须在恢复后 72 小时内完成复盘;P2 事故 5 个工作日内完成。
|
|
15
|
+
> 核心原则:无责文化(Blameless)— 聚焦系统改进而非追究个人责任。
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
## 阶段 0: 事故恢复确认
|
|
20
|
+
|
|
21
|
+
在启动复盘之前,确认以下前置条件:
|
|
22
|
+
|
|
23
|
+
- [ ] 事故已完全恢复,用户影响已消除
|
|
24
|
+
- [ ] 临时修复(Workaround)已到位,不会再次发生
|
|
25
|
+
- [ ] 事故严重等级已由 On-Call 负责人确认(P0 / P1 / P2 / P3)
|
|
26
|
+
- [ ] 事故通报已发送给相关干系人
|
|
27
|
+
- [ ] 监控已恢复到正常基线,告警已解除
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## 阶段 1: 会议准备(复盘前 1-2 天)
|
|
32
|
+
|
|
33
|
+
### 1.1 指定复盘负责人
|
|
34
|
+
|
|
35
|
+
- [ ] 指定一名非直接责任方的复盘主持人(Facilitator)
|
|
36
|
+
- [ ] 主持人负责收集材料、组织会议、输出报告
|
|
37
|
+
- [ ] 主持人已阅读本 Playbook 和无责文化准则
|
|
38
|
+
|
|
39
|
+
### 1.2 参会人员
|
|
40
|
+
|
|
41
|
+
- [ ] 事故响应的一线工程师(On-Call / First Responder)
|
|
42
|
+
- [ ] 受影响系统的 Owner 和核心开发者
|
|
43
|
+
- [ ] 事故期间的决策者(Incident Commander)
|
|
44
|
+
- [ ] 相关的 SRE / DevOps / 安全团队成员
|
|
45
|
+
- [ ] (可选)产品经理、客户支持代表
|
|
46
|
+
- [ ] 参会人数控制在 5-10 人,避免过多旁观者
|
|
47
|
+
|
|
48
|
+
### 1.3 材料收集
|
|
49
|
+
|
|
50
|
+
- [ ] 事故时间线草稿已整理(从告警触发到完全恢复)
|
|
51
|
+
- [ ] 监控截图、日志片段、告警记录已收集
|
|
52
|
+
- [ ] 相关的 Chat / Slack / 飞书沟通记录已导出
|
|
53
|
+
- [ ] 变更记录已拉取(部署、配置变更、数据库变更)
|
|
54
|
+
- [ ] 影响范围数据已统计(受影响用户数、错误率、持续时间)
|
|
55
|
+
- [ ] 相关的 On-Call 交接记录已收集
|
|
56
|
+
|
|
57
|
+
### 1.4 会议邀请
|
|
58
|
+
|
|
59
|
+
- [ ] 会议时间已预约(建议事故恢复后 48-72 小时,记忆新鲜且情绪平复)
|
|
60
|
+
- [ ] 会议时长 60-90 分钟(P0 可延长至 120 分钟)
|
|
61
|
+
- [ ] 议程已提前发送,参会者有时间回忆和准备
|
|
62
|
+
- [ ] 会议房间 / 线上链接已确认
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## 阶段 2: 时间线构建
|
|
67
|
+
|
|
68
|
+
### 2.1 时间线模板
|
|
69
|
+
|
|
70
|
+
时间线是复盘的核心事实基础,必须精确到分钟级。
|
|
71
|
+
|
|
72
|
+
```markdown
|
|
73
|
+
| 时间 (UTC+8) | 事件 | 操作人 | 来源 |
|
|
74
|
+
|--------------|------|--------|------|
|
|
75
|
+
| 14:23 | 监控告警: API 错误率 > 5% | 系统 | Grafana Alert |
|
|
76
|
+
| 14:25 | On-Call 工程师 A 收到 PagerDuty 告警 | A | PagerDuty |
|
|
77
|
+
| 14:28 | A 开始排查,查看 Dashboard | A | Slack 记录 |
|
|
78
|
+
| 14:35 | 定位到数据库连接池耗尽 | A | 日志 |
|
|
79
|
+
| 14:38 | 尝试重启服务实例 | A | K8s 操作记录 |
|
|
80
|
+
| 14:40 | 重启无效,升级到 P1 事故 | A | Incident Channel |
|
|
81
|
+
| 14:45 | DBA B 加入排查 | B | Slack 记录 |
|
|
82
|
+
| 14:52 | 发现慢查询导致连接池占满 | B | 慢查询日志 |
|
|
83
|
+
| 15:00 | 临时 Kill 慢查询,连接池恢复 | B | DB 操作记录 |
|
|
84
|
+
| 15:05 | 服务恢复正常,告警解除 | 系统 | Grafana |
|
|
85
|
+
| 15:10 | 客户沟通完成,事故通报发送 | PM | 邮件 |
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### 2.2 时间线构建原则
|
|
89
|
+
|
|
90
|
+
- [ ] 以客观事实为基础,不加入主观判断
|
|
91
|
+
- [ ] 标注信息来源(监控 / 日志 / 聊天记录 / 个人回忆)
|
|
92
|
+
- [ ] 区分"发现时间"和"实际发生时间"
|
|
93
|
+
- [ ] 标注关键决策点和决策依据
|
|
94
|
+
- [ ] 多人回忆有冲突时以系统日志为准
|
|
95
|
+
|
|
96
|
+
### 2.3 关键时间指标
|
|
97
|
+
|
|
98
|
+
- [ ] **TTD (Time to Detect)** — 故障发生到被发现的时间
|
|
99
|
+
- [ ] **TTE (Time to Engage)** — 发现到开始响应的时间
|
|
100
|
+
- [ ] **TTR (Time to Restore)** — 开始响应到服务恢复的时间
|
|
101
|
+
- [ ] **TTN (Time to Notify)** — 发现到通知用户/干系人的时间
|
|
102
|
+
- [ ] **Total Impact Duration** — 用户实际受影响的总时长
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## 阶段 3: 根因分析
|
|
107
|
+
|
|
108
|
+
### 3.1 5 Whys 分析法
|
|
109
|
+
|
|
110
|
+
从直接原因出发,连续追问 5 次"为什么",直到挖掘到系统性根因。
|
|
111
|
+
|
|
112
|
+
```markdown
|
|
113
|
+
**事故现象**: API 返回 503,持续 42 分钟。
|
|
114
|
+
|
|
115
|
+
**Why 1**: 为什么 API 返回 503?
|
|
116
|
+
→ 因为应用服务器连接池耗尽,无法获取数据库连接。
|
|
117
|
+
|
|
118
|
+
**Why 2**: 为什么连接池耗尽?
|
|
119
|
+
→ 因为有大量慢查询长时间持有连接不释放。
|
|
120
|
+
|
|
121
|
+
**Why 3**: 为什么出现大量慢查询?
|
|
122
|
+
→ 因为新上线的报表功能包含全表扫描的 SQL,未加索引。
|
|
123
|
+
|
|
124
|
+
**Why 4**: 为什么全表扫描的 SQL 能上线?
|
|
125
|
+
→ 因为 Code Review 时没有进行 SQL 性能审查,没有 Explain 分析。
|
|
126
|
+
|
|
127
|
+
**Why 5**: 为什么 Code Review 流程缺少 SQL 审查?
|
|
128
|
+
→ 因为没有建立数据库变更的强制审查门禁和自动化检测。
|
|
129
|
+
|
|
130
|
+
**根因**: 缺少数据库查询性能的自动化门禁,依赖人工审查不可靠。
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
#### 5 Whys 原则
|
|
134
|
+
|
|
135
|
+
- [ ] 每个 "Why" 的答案必须基于事实,不可臆测
|
|
136
|
+
- [ ] 追问到系统/流程层面,而非停留在个人操作层面
|
|
137
|
+
- [ ] 可能存在多条因果链,需逐一追溯
|
|
138
|
+
- [ ] 避免在第 1-2 层就停止(过于表面)
|
|
139
|
+
- [ ] 避免超过 7 层(过于发散)
|
|
140
|
+
|
|
141
|
+
### 3.2 鱼骨图分析(石川图)
|
|
142
|
+
|
|
143
|
+
用于系统化梳理多维度原因,适合复杂的多因素事故。
|
|
144
|
+
|
|
145
|
+
```
|
|
146
|
+
┌─ 无慢查询告警
|
|
147
|
+
环境 ──────┤
|
|
148
|
+
└─ 数据库连接池配置偏小
|
|
149
|
+
┌─ 全表扫描 SQL
|
|
150
|
+
技术 ──────┤
|
|
151
|
+
└─ 无索引策略
|
|
152
|
+
──→ API 503
|
|
153
|
+
┌─ 无 SQL 性能门禁 42 分钟
|
|
154
|
+
流程 ──────┤
|
|
155
|
+
└─ Code Review 未覆盖数据库
|
|
156
|
+
┌─ DBA 未参与报表功能评审
|
|
157
|
+
人员 ──────┤
|
|
158
|
+
└─ 开发者缺少 SQL 优化培训
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
#### 鱼骨图维度
|
|
162
|
+
|
|
163
|
+
| 维度 | 分析要点 |
|
|
164
|
+
|------|----------|
|
|
165
|
+
| 技术 (Technology) | 代码缺陷、架构缺陷、依赖问题、配置错误 |
|
|
166
|
+
| 流程 (Process) | 变更管理、审查流程、发布流程、监控覆盖 |
|
|
167
|
+
| 人员 (People) | 技能差距、沟通不畅、交接遗漏、培训不足 |
|
|
168
|
+
| 环境 (Environment) | 基础设施、容量、网络、第三方依赖 |
|
|
169
|
+
| 工具 (Tools) | 监控盲区、告警规则、自动化缺失、工具链问题 |
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
## 阶段 4: 制定行动项 (SMART)
|
|
174
|
+
|
|
175
|
+
### 4.1 SMART 原则
|
|
176
|
+
|
|
177
|
+
每个行动项必须满足 SMART 标准:
|
|
178
|
+
|
|
179
|
+
| 维度 | 含义 | 示例 |
|
|
180
|
+
|------|------|------|
|
|
181
|
+
| **S**pecific(具体) | 明确要做什么 | "为 reports 表添加 created_at 字段索引" |
|
|
182
|
+
| **M**easurable(可度量) | 完成标准可度量 | "慢查询从 15s 降到 < 200ms" |
|
|
183
|
+
| **A**ssignable(可分配) | 有明确负责人 | "DBA 张三负责" |
|
|
184
|
+
| **R**ealistic(可实现) | 在给定时间内可完成 | "不需要架构重构,加索引即可" |
|
|
185
|
+
| **T**ime-bound(有时限) | 有明确截止日期 | "2024-03-15 前完成" |
|
|
186
|
+
|
|
187
|
+
### 4.2 行动项模板
|
|
188
|
+
|
|
189
|
+
```markdown
|
|
190
|
+
### Action Item #1: 为 reports 表添加索引
|
|
191
|
+
|
|
192
|
+
| 字段 | 值 |
|
|
193
|
+
|------|-----|
|
|
194
|
+
| 类型 | 修复 (Fix) |
|
|
195
|
+
| 优先级 | P0 - 立即 |
|
|
196
|
+
| 负责人 | 张三 (DBA) |
|
|
197
|
+
| 截止日期 | 2024-03-10 |
|
|
198
|
+
| 验收标准 | reports 表查询 P99 < 200ms |
|
|
199
|
+
| 跟踪 Issue | JIRA-1234 |
|
|
200
|
+
| 状态 | 进行中 |
|
|
201
|
+
|
|
202
|
+
### Action Item #2: CI 添加 SQL 性能自动检测
|
|
203
|
+
|
|
204
|
+
| 字段 | 值 |
|
|
205
|
+
|------|-----|
|
|
206
|
+
| 类型 | 预防 (Prevent) |
|
|
207
|
+
| 优先级 | P1 - 本周 |
|
|
208
|
+
| 负责人 | 李四 (SRE) |
|
|
209
|
+
| 截止日期 | 2024-03-17 |
|
|
210
|
+
| 验收标准 | 全表扫描 SQL 在 CI 阶段被拦截 |
|
|
211
|
+
| 跟踪 Issue | JIRA-1235 |
|
|
212
|
+
| 状态 | 待开始 |
|
|
213
|
+
|
|
214
|
+
### Action Item #3: 数据库变更审查流程建设
|
|
215
|
+
|
|
216
|
+
| 字段 | 值 |
|
|
217
|
+
|------|-----|
|
|
218
|
+
| 类型 | 预防 (Prevent) |
|
|
219
|
+
| 优先级 | P2 - 本月 |
|
|
220
|
+
| 负责人 | 王五 (Tech Lead) |
|
|
221
|
+
| 截止日期 | 2024-03-31 |
|
|
222
|
+
| 验收标准 | 含 SQL 变更的 PR 必须有 DBA Approve |
|
|
223
|
+
| 跟踪 Issue | JIRA-1236 |
|
|
224
|
+
| 状态 | 待开始 |
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
### 4.3 行动项分类
|
|
228
|
+
|
|
229
|
+
| 类型 | 说明 | 时间要求 |
|
|
230
|
+
|------|------|----------|
|
|
231
|
+
| 修复 (Fix) | 直接修复本次事故的技术问题 | 24-48 小时 |
|
|
232
|
+
| 检测 (Detect) | 增加监控和告警,更快发现类似问题 | 1 周 |
|
|
233
|
+
| 预防 (Prevent) | 从流程和架构层面防止同类问题再次发生 | 2-4 周 |
|
|
234
|
+
| 改进 (Improve) | 优化响应效率、降低影响范围 | 1-3 月 |
|
|
235
|
+
|
|
236
|
+
---
|
|
237
|
+
|
|
238
|
+
## 阶段 5: 编写复盘报告
|
|
239
|
+
|
|
240
|
+
### 5.1 复盘报告模板
|
|
241
|
+
|
|
242
|
+
```markdown
|
|
243
|
+
# 事故复盘报告: [事故标题]
|
|
244
|
+
|
|
245
|
+
## 基本信息
|
|
246
|
+
|
|
247
|
+
| 字段 | 值 |
|
|
248
|
+
|------|-----|
|
|
249
|
+
| 事故 ID | INC-2024-0042 |
|
|
250
|
+
| 严重等级 | P1 |
|
|
251
|
+
| 影响时长 | 42 分钟 (14:23 - 15:05 UTC+8) |
|
|
252
|
+
| 影响范围 | 全部 API 用户,约 12,000 请求失败 |
|
|
253
|
+
| 复盘日期 | 2024-03-08 |
|
|
254
|
+
| 复盘主持 | [姓名] |
|
|
255
|
+
| 报告作者 | [姓名] |
|
|
256
|
+
|
|
257
|
+
## 概述
|
|
258
|
+
|
|
259
|
+
一句话描述事故:新上线的报表功能包含未优化的数据库查询,导致连接池
|
|
260
|
+
耗尽,API 服务不可用 42 分钟。
|
|
261
|
+
|
|
262
|
+
## 时间线
|
|
263
|
+
|
|
264
|
+
[完整时间线表格]
|
|
265
|
+
|
|
266
|
+
## 关键指标
|
|
267
|
+
|
|
268
|
+
| 指标 | 值 | 目标 |
|
|
269
|
+
|------|-----|------|
|
|
270
|
+
| TTD (发现时间) | 2 分钟 | < 5 分钟 |
|
|
271
|
+
| TTE (响应时间) | 5 分钟 | < 10 分钟 |
|
|
272
|
+
| TTR (恢复时间) | 35 分钟 | < 30 分钟 |
|
|
273
|
+
| 影响用户数 | ~3,200 | - |
|
|
274
|
+
| 错误请求数 | ~12,000 | - |
|
|
275
|
+
|
|
276
|
+
## 根因分析
|
|
277
|
+
|
|
278
|
+
### 5 Whys
|
|
279
|
+
[5 Whys 分析过程]
|
|
280
|
+
|
|
281
|
+
### 贡献因素
|
|
282
|
+
- 近因: 全表扫描 SQL 上线
|
|
283
|
+
- 远因: 缺少 SQL 性能审查流程
|
|
284
|
+
- 系统因素: 连接池无动态扩缩和慢查询自动 Kill
|
|
285
|
+
|
|
286
|
+
## 行动项
|
|
287
|
+
|
|
288
|
+
[行动项列表]
|
|
289
|
+
|
|
290
|
+
## 经验教训
|
|
291
|
+
|
|
292
|
+
### 做得好的
|
|
293
|
+
- 监控告警在 2 分钟内触发,TTD 符合目标
|
|
294
|
+
- 团队协作迅速,DBA 及时加入排查
|
|
295
|
+
|
|
296
|
+
### 需要改进的
|
|
297
|
+
- 缺少数据库变更的自动化门禁
|
|
298
|
+
- 连接池配置过于保守,无弹性扩缩能力
|
|
299
|
+
|
|
300
|
+
### 幸运因素
|
|
301
|
+
- 事故发生在工作时间,响应迅速
|
|
302
|
+
- 影响的是报表功能,非核心交易链路
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
---
|
|
306
|
+
|
|
307
|
+
## 阶段 6: 行动项跟踪
|
|
308
|
+
|
|
309
|
+
### 6.1 跟踪机制
|
|
310
|
+
|
|
311
|
+
- [ ] 所有行动项已创建为 JIRA / GitHub Issue,关联事故 ID
|
|
312
|
+
- [ ] 每周例会 Review 行动项进度
|
|
313
|
+
- [ ] 逾期行动项自动升级通知
|
|
314
|
+
- [ ] 完成的行动项需要验收确认
|
|
315
|
+
|
|
316
|
+
### 6.2 跟踪看板
|
|
317
|
+
|
|
318
|
+
```
|
|
319
|
+
| 状态 | 行动项 | 负责人 | 截止日期 | 进度 |
|
|
320
|
+
|------|--------|--------|----------|------|
|
|
321
|
+
| Done | 添加索引 | 张三 | 03-10 | 100% |
|
|
322
|
+
| In Progress | SQL 检测 | 李四 | 03-17 | 60% |
|
|
323
|
+
| Todo | 审查流程 | 王五 | 03-31 | 0% |
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
### 6.3 闭环确认
|
|
327
|
+
|
|
328
|
+
- [ ] 所有 P0 行动项已完成并验收
|
|
329
|
+
- [ ] P1 行动项完成率 >= 90%
|
|
330
|
+
- [ ] 复盘报告已归档到知识库
|
|
331
|
+
- [ ] 行动项的效果已在后续事故中得到验证
|
|
332
|
+
|
|
333
|
+
---
|
|
334
|
+
|
|
335
|
+
## 阶段 7: 无责文化建设
|
|
336
|
+
|
|
337
|
+
### 7.1 无责文化原则
|
|
338
|
+
|
|
339
|
+
| 原则 | 说明 |
|
|
340
|
+
|------|------|
|
|
341
|
+
| 人不是根因 | 人犯错是系统允许错误发生的结果,改进系统而非惩罚个人 |
|
|
342
|
+
| 信息透明 | 鼓励主动披露错误和近失事件(Near Miss),隐瞒比犯错更危险 |
|
|
343
|
+
| 好奇心优先 | 用"我想了解当时发生了什么"替代"你为什么这样做" |
|
|
344
|
+
| 安全心理 | 参会者可以坦诚分享而不用担心负面后果 |
|
|
345
|
+
| 系统思维 | 关注流程、工具、环境等系统因素,而非个人能力或态度 |
|
|
346
|
+
|
|
347
|
+
### 7.2 复盘会议用语规范
|
|
348
|
+
|
|
349
|
+
| 避免使用 | 推荐使用 |
|
|
350
|
+
|----------|----------|
|
|
351
|
+
| "谁犯的错?" | "当时发生了什么?" |
|
|
352
|
+
| "你应该知道..." | "当时有哪些信息可用?" |
|
|
353
|
+
| "这太低级了" | "什么系统改进能防止这种情况?" |
|
|
354
|
+
| "为什么不测试?" | "当时的测试策略覆盖了哪些场景?" |
|
|
355
|
+
| "就不应该上线" | "发布流程中有哪些检查点?" |
|
|
356
|
+
|
|
357
|
+
### 7.3 无责文化建设行动
|
|
358
|
+
|
|
359
|
+
- [ ] 复盘报告中不出现个人名字(用角色替代:On-Call A、DBA B)
|
|
360
|
+
- [ ] 鼓励自愿分享"当时我以为..."的思考过程
|
|
361
|
+
- [ ] 管理层明确声明不会因事故对个人进行惩罚
|
|
362
|
+
- [ ] 定期举办"事故故事会",分享近失事件和教训
|
|
363
|
+
- [ ] 将复盘质量(而非事故数量)纳入团队 KPI
|
|
364
|
+
|
|
365
|
+
### 7.4 反面信号检测
|
|
366
|
+
|
|
367
|
+
以下信号表明团队尚未建立真正的无责文化:
|
|
368
|
+
|
|
369
|
+
- 工程师隐瞒操作失误或试图悄悄修复
|
|
370
|
+
- 复盘会议气氛紧张,参会者不愿发言
|
|
371
|
+
- 行动项总是"加强培训"而非改进系统
|
|
372
|
+
- 事故数量下降但严重程度上升(小事故被隐瞒直到变大)
|
|
373
|
+
- "没人愿意上线"或"没人愿意值班"
|
|
374
|
+
|
|
375
|
+
---
|
|
376
|
+
|
|
377
|
+
## 附录: 复盘报告评审标准
|
|
378
|
+
|
|
379
|
+
| 评审项 | 通过标准 |
|
|
380
|
+
|--------|----------|
|
|
381
|
+
| 时间线完整性 | 关键事件均有时间戳和来源标注 |
|
|
382
|
+
| 根因深度 | 5 Whys 至少深入到第 3 层 |
|
|
383
|
+
| 行动项质量 | 每项均满足 SMART 标准 |
|
|
384
|
+
| 行动项覆盖 | 包含修复、检测和预防三类 |
|
|
385
|
+
| 无责文化 | 报告中无个人指责内容 |
|
|
386
|
+
| 数据支撑 | 影响范围有量化数据 |
|
|
387
|
+
|
|
388
|
+
---
|
|
389
|
+
|
|
390
|
+
## Agent Checklist
|
|
391
|
+
|
|
392
|
+
- [ ] P0/P1 事故在 72 小时内启动复盘流程
|
|
393
|
+
- [ ] 时间线基于系统日志而非纯粹个人回忆
|
|
394
|
+
- [ ] 5 Whys 至少深入到第 3 层,且追溯到系统/流程层面
|
|
395
|
+
- [ ] 所有行动项符合 SMART 标准并创建为跟踪 Issue
|
|
396
|
+
- [ ] 复盘报告已发布到团队知识库并通知相关方
|
|
397
|
+
- [ ] 复盘过程遵循无责文化原则,无个人指责内容
|
|
398
|
+
- [ ] 行动项完成后有闭环验收确认
|
|
@@ -0,0 +1,181 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: incident-readiness-checklist
|
|
3
|
+
title: 事故准备度检查清单
|
|
4
|
+
domain: incident
|
|
5
|
+
category: 03-checklists
|
|
6
|
+
difficulty: intermediate
|
|
7
|
+
tags: [agent, checklist, incident, readiness, 概述, 评分汇总]
|
|
8
|
+
quality_score: 70
|
|
9
|
+
last_updated: 2026-06-15
|
|
10
|
+
---
|
|
11
|
+
# 事故准备度检查清单
|
|
12
|
+
|
|
13
|
+
## 概述
|
|
14
|
+
|
|
15
|
+
本清单用于评估团队在生产事故发生前的准备程度。事故准备度决定了故障发生时的响应速度
|
|
16
|
+
和恢复能力。建议每季度执行一次全面检查,每月抽查关键项。
|
|
17
|
+
|
|
18
|
+
评分标准:每项 0-2 分(0=未实施, 1=部分实施, 2=完全就绪),总分 60 分满分。
|
|
19
|
+
及格线:48 分(80%)。低于 36 分(60%)应立即启动改进计划。
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## 一、告警体系(12 分)
|
|
24
|
+
|
|
25
|
+
### 1.1 告警覆盖
|
|
26
|
+
|
|
27
|
+
- [ ] 核心服务健康检查告警已配置(HTTP 探针/TCP 探针)
|
|
28
|
+
- [ ] 关键业务指标告警已配置(订单量/支付成功率/登录量异常波动)
|
|
29
|
+
- [ ] 基础设施告警已配置(CPU/内存/磁盘/网络)
|
|
30
|
+
- [ ] 依赖服务告警已配置(数据库/缓存/消息队列/第三方 API)
|
|
31
|
+
|
|
32
|
+
### 1.2 告警质量
|
|
33
|
+
|
|
34
|
+
- [ ] 告警分级明确(P0-紧急/P1-高/P2-中/P3-低)
|
|
35
|
+
- [ ] 告警阈值经过历史数据校准,误报率 < 5%
|
|
36
|
+
- [ ] 告警收敛规则已配置(相同告警 5 分钟内不重复通知)
|
|
37
|
+
- [ ] 每月回顾告警噪音,清理无效告警
|
|
38
|
+
|
|
39
|
+
### 1.3 告警通道
|
|
40
|
+
|
|
41
|
+
- [ ] P0/P1 告警通过电话+短信+IM 多通道触达
|
|
42
|
+
- [ ] P2/P3 告警通过 IM/邮件通知
|
|
43
|
+
- [ ] 告警通道冗余(主通道故障时备用通道可用)
|
|
44
|
+
- [ ] 最近一次告警通道可达性测试在 30 天内
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## 二、Runbook(10 分)
|
|
49
|
+
|
|
50
|
+
### 2.1 覆盖度
|
|
51
|
+
|
|
52
|
+
- [ ] 每个核心服务有对应的 Runbook
|
|
53
|
+
- [ ] Top-10 历史高频故障场景有专项 Runbook
|
|
54
|
+
- [ ] Runbook 包含:现象描述、影响范围、排查步骤、恢复操作、升级条件
|
|
55
|
+
|
|
56
|
+
### 2.2 质量与时效
|
|
57
|
+
|
|
58
|
+
- [ ] Runbook 最近 90 天内有更新或确认仍有效
|
|
59
|
+
- [ ] Runbook 中的命令/脚本已在近期验证可执行
|
|
60
|
+
- [ ] 新人可独立按 Runbook 完成 P2 级故障处理
|
|
61
|
+
- [ ] Runbook 存储在团队统一平台(而非个人笔记)
|
|
62
|
+
|
|
63
|
+
### 2.3 演练验证
|
|
64
|
+
|
|
65
|
+
- [ ] 关键 Runbook 每季度至少演练一次
|
|
66
|
+
- [ ] 演练结果记录并反馈到 Runbook 更新
|
|
67
|
+
- [ ] 演练中发现的文档缺陷已在 5 个工作日内修复
|
|
68
|
+
|
|
69
|
+
---
|
|
70
|
+
|
|
71
|
+
## 三、On-Call 轮值(10 分)
|
|
72
|
+
|
|
73
|
+
### 3.1 排班体系
|
|
74
|
+
|
|
75
|
+
- [ ] On-Call 排班表已发布且覆盖未来 4 周
|
|
76
|
+
- [ ] 每个时段至少有主备两人 On-Call
|
|
77
|
+
- [ ] On-Call 人员具备处理该服务 P1 故障的能力
|
|
78
|
+
- [ ] 换班流程明确,交接记录可追溯
|
|
79
|
+
|
|
80
|
+
### 3.2 响应要求
|
|
81
|
+
|
|
82
|
+
- [ ] P0 事故响应时间要求 < 5 分钟已明确告知 On-Call 人员
|
|
83
|
+
- [ ] P1 事故响应时间要求 < 15 分钟已明确告知
|
|
84
|
+
- [ ] On-Call 人员确认通讯设备 24h 可达
|
|
85
|
+
- [ ] 升级路径明确(On-Call → TL → 架构师 → CTO)
|
|
86
|
+
|
|
87
|
+
### 3.3 工具就绪
|
|
88
|
+
|
|
89
|
+
- [ ] On-Call 人员拥有生产环境只读权限
|
|
90
|
+
- [ ] 紧急操作权限可在 5 分钟内通过审批获取
|
|
91
|
+
- [ ] VPN/堡垒机/监控面板等工具访问已预配置
|
|
92
|
+
|
|
93
|
+
---
|
|
94
|
+
|
|
95
|
+
## 四、通信渠道(8 分)
|
|
96
|
+
|
|
97
|
+
### 4.1 事故沟通
|
|
98
|
+
|
|
99
|
+
- [ ] 事故专用沟通群/频道创建流程 < 2 分钟
|
|
100
|
+
- [ ] 事故沟通模板已预置(影响范围/当前状态/下一步/ETA)
|
|
101
|
+
- [ ] 内部状态页或广播机制已就绪
|
|
102
|
+
- [ ] 对外客户通知模板和流程已预置
|
|
103
|
+
|
|
104
|
+
### 4.2 协作工具
|
|
105
|
+
|
|
106
|
+
- [ ] 事故指挥官(Incident Commander)角色定义明确
|
|
107
|
+
- [ ] 多团队协同作战的 War Room 流程已文档化
|
|
108
|
+
- [ ] 事后复盘(Postmortem)模板和会议流程已标准化
|
|
109
|
+
- [ ] 事故时间线记录工具已选定并团队熟悉
|
|
110
|
+
|
|
111
|
+
---
|
|
112
|
+
|
|
113
|
+
## 五、回滚能力(12 分)
|
|
114
|
+
|
|
115
|
+
### 5.1 应用回滚
|
|
116
|
+
|
|
117
|
+
- [ ] 应用部署支持一键回滚到上一版本
|
|
118
|
+
- [ ] 回滚操作耗时 < 5 分钟
|
|
119
|
+
- [ ] 最近一次回滚演练在 30 天内
|
|
120
|
+
- [ ] 回滚不依赖原始构建产物仍可完成(镜像保留策略)
|
|
121
|
+
|
|
122
|
+
### 5.2 数据库回滚
|
|
123
|
+
|
|
124
|
+
- [ ] 数据库 Migration 支持 Down/Rollback
|
|
125
|
+
- [ ] 破坏性 DDL(删表/删列)有延迟执行和确认机制
|
|
126
|
+
- [ ] 数据修复脚本有模板和审核流程
|
|
127
|
+
|
|
128
|
+
### 5.3 配置回滚
|
|
129
|
+
|
|
130
|
+
- [ ] 配置中心支持版本对比和一键回滚
|
|
131
|
+
- [ ] Feature Flag 可在 1 分钟内关闭特定功能
|
|
132
|
+
- [ ] DNS/负载均衡切换操作已文档化且可快速执行
|
|
133
|
+
|
|
134
|
+
### 5.4 流量控制
|
|
135
|
+
|
|
136
|
+
- [ ] 熔断器已配置且阈值合理
|
|
137
|
+
- [ ] 限流策略已配置(核心接口有独立限流)
|
|
138
|
+
- [ ] 灰度发布支持按比例/用户分组控制
|
|
139
|
+
|
|
140
|
+
---
|
|
141
|
+
|
|
142
|
+
## 六、备份验证(8 分)
|
|
143
|
+
|
|
144
|
+
### 6.1 备份策略
|
|
145
|
+
|
|
146
|
+
- [ ] 数据库全量备份频率 ≥ 每日一次
|
|
147
|
+
- [ ] 增量备份/Binlog 备份已启用
|
|
148
|
+
- [ ] 备份存储与主库物理隔离(跨 AZ 或跨区域)
|
|
149
|
+
- [ ] 备份保留周期符合业务要求(通常 ≥ 30 天)
|
|
150
|
+
|
|
151
|
+
### 6.2 恢复验证
|
|
152
|
+
|
|
153
|
+
- [ ] 最近一次备份恢复演练在 90 天内
|
|
154
|
+
- [ ] 恢复耗时已测量并符合 RTO 要求
|
|
155
|
+
- [ ] 恢复后数据完整性已验证(RPO 检查)
|
|
156
|
+
- [ ] 备份失败有告警且最近 30 天无未处理的备份失败
|
|
157
|
+
|
|
158
|
+
---
|
|
159
|
+
|
|
160
|
+
## 评分汇总
|
|
161
|
+
|
|
162
|
+
| 类别 | 满分 | 得分 | 状态 |
|
|
163
|
+
|------|------|------|------|
|
|
164
|
+
| 告警体系 | 12 | __ | |
|
|
165
|
+
| Runbook | 10 | __ | |
|
|
166
|
+
| On-Call 轮值 | 10 | __ | |
|
|
167
|
+
| 通信渠道 | 8 | __ | |
|
|
168
|
+
| 回滚能力 | 12 | __ | |
|
|
169
|
+
| 备份验证 | 8 | __ | |
|
|
170
|
+
| **总计** | **60** | **__** | |
|
|
171
|
+
|
|
172
|
+
## Agent Checklist
|
|
173
|
+
|
|
174
|
+
- [ ] 告警覆盖是否包含业务指标而非仅基础设施
|
|
175
|
+
- [ ] Runbook 是否可被新人独立执行
|
|
176
|
+
- [ ] On-Call 排班是否有主备冗余
|
|
177
|
+
- [ ] 事故沟通模板是否预置且团队已熟悉
|
|
178
|
+
- [ ] 回滚能力是否经过近期演练验证
|
|
179
|
+
- [ ] 备份恢复是否在 90 天内做过实际演练
|
|
180
|
+
- [ ] 评分是否达到 80% 及格线
|
|
181
|
+
- [ ] 不达标项是否已生成改进计划和 Owner
|