@umacloud/knowledge 1.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/00-governance/governance-capabilities.md +557 -0
- package/00-governance/knowledge-map.md +39 -0
- package/00-governance/maintenance-policy.md +76 -0
- package/00-governance/review-checklist.md +81 -0
- package/README.md +13 -0
- package/ai/01-standards/agent-development-complete.md +691 -0
- package/ai/01-standards/llm-application-complete.md +488 -0
- package/ai/01-standards/mlops-complete.md +798 -0
- package/ai/01-standards/prompt-engineering-complete.md +646 -0
- package/ai/01-standards/rag-architecture-complete.md +649 -0
- package/ai/02-playbooks/llm-evaluation-playbook.md +847 -0
- package/ai/03-checklists/ai-project-checklist.md +215 -0
- package/ai/04-antipatterns/ai-antipatterns.md +661 -0
- package/ai/05-cases/case-rag-production.md +147 -0
- package/ai/06-glossary/ai-glossary.md +162 -0
- package/ai/agent-evaluation-benchmark.md +53 -0
- package/ai/ai-agent-memory-context-management.md +41 -0
- package/ai/ai-cost-capacity-optimization-playbook.md +42 -0
- package/ai/ai-data-security-and-compliance-playbook.md +37 -0
- package/ai/ai-domain-index-and-checklist.md +40 -0
- package/ai/ai-governance-maturity-model.md +50 -0
- package/ai/ai-model-selection-and-routing-strategy.md +47 -0
- package/ai/ai-observability-and-oncall-runbook.md +52 -0
- package/ai/ai-rag-engineering-playbook.md +42 -0
- package/ai/ai-red-team-and-safety-evaluation.md +42 -0
- package/ai/ai-release-readiness-and-rollback-gate.md +42 -0
- package/ai/llm-agent-engineering-deep-dive.md +57 -0
- package/ai/prompt-and-tool-guardrails.md +52 -0
- package/api/01-standards/enterprise-api-standards.md +198 -0
- package/api/01-standards/rest-api-design-guide.md +63 -0
- package/api/02-playbooks/api-pagination-playbook.md +93 -0
- package/api/02-playbooks/graphql-production-playbook.md +176 -0
- package/api/03-checklists/api-review-checklist.md +55 -0
- package/api/04-antipatterns/api-antipatterns.md +112 -0
- package/architecture/01-standards/api-gateway-patterns.md +496 -0
- package/architecture/01-standards/cloud-native-patterns.md +644 -0
- package/architecture/01-standards/distributed-systems-patterns.md +591 -0
- package/architecture/01-standards/event-driven-architecture.md +595 -0
- package/architecture/01-standards/microservices-patterns-complete.md +968 -0
- package/architecture/01-standards/microservices-patterns.md +495 -0
- package/architecture/01-standards/system-design-interview.md +664 -0
- package/architecture/02-playbooks/microservices-patterns-playbook.md +137 -0
- package/architecture/02-playbooks/migration-playbook.md +780 -0
- package/architecture/02-playbooks/system-design-playbook.md +779 -0
- package/architecture/03-checklists/architecture-decision-checklist.md +297 -0
- package/architecture/04-antipatterns/architecture-antipatterns.md +417 -0
- package/architecture/05-cases/case-netflix-microservices.md +413 -0
- package/architecture/06-glossary/architecture-glossary.md +164 -0
- package/architecture/adr-template-and-examples.md +38 -0
- package/architecture/api-gateway-deep-dive.md +1291 -0
- package/architecture/configuration-management.md +1162 -0
- package/architecture/distributed-transactions.md +1220 -0
- package/architecture/microservices-complete.md +735 -0
- package/architecture/resilience-and-disaster-patterns.md +37 -0
- package/architecture/service-governance.md +1198 -0
- package/architecture/system-architecture-deep-dive.md +37 -0
- package/backend/01-standards/analytics-and-growth.md +65 -0
- package/backend/01-standards/api-and-error-conventions.md +120 -0
- package/backend/01-standards/application-layering-and-packaging.md +160 -0
- package/backend/01-standards/auth-implementation.md +104 -0
- package/backend/01-standards/backend-framework-idioms.md +74 -0
- package/backend/01-standards/background-jobs-and-async.md +66 -0
- package/backend/01-standards/caching-strategies-complete.md +390 -0
- package/backend/01-standards/config-and-observability.md +77 -0
- package/backend/01-standards/data-modeling-and-persistence.md +94 -0
- package/backend/01-standards/django-complete.md +1765 -0
- package/backend/01-standards/email-and-notifications.md +64 -0
- package/backend/01-standards/fastapi-complete.md +925 -0
- package/backend/01-standards/file-upload-and-storage.md +66 -0
- package/backend/01-standards/graphql-api-complete.md +416 -0
- package/backend/01-standards/llm-application-standard.md +78 -0
- package/backend/01-standards/message-queue-patterns.md +379 -0
- package/backend/01-standards/microservices-and-distributed.md +78 -0
- package/backend/01-standards/nestjs-complete.md +2167 -0
- package/backend/01-standards/payment-integration.md +80 -0
- package/backend/01-standards/rate-limiting-complete.md +451 -0
- package/backend/01-standards/realtime-and-websocket.md +65 -0
- package/backend/01-standards/search-and-filtering.md +64 -0
- package/backend/01-standards/spring-boot-complete.md +445 -0
- package/backend/02-playbooks/api-design-playbook.md +718 -0
- package/backend/02-playbooks/email-send-playbook.md +130 -0
- package/backend/02-playbooks/file-upload-s3-playbook.md +153 -0
- package/backend/02-playbooks/typescript-enterprise-playbook.md +133 -0
- package/backend/02-playbooks/websocket-realtime-playbook.md +154 -0
- package/backend/03-checklists/api-launch-checklist.md +189 -0
- package/backend/04-antipatterns/backend-antipatterns.md +1051 -0
- package/blockchain/01-standards/blockchain-basics.md +557 -0
- package/blockchain/01-standards/smart-contract-development.md +1315 -0
- package/cicd/01-standards/deployment-and-delivery-standard.md +96 -0
- package/cicd/01-standards/github-actions-complete.md +473 -0
- package/cicd/01-standards/release-and-store-submission.md +75 -0
- package/cicd/02-playbooks/cicd-pipeline-playbook.md +144 -0
- package/cicd/02-playbooks/release-management-playbook.md +605 -0
- package/cicd/03-checklists/pipeline-security-checklist.md +168 -0
- package/cicd/04-antipatterns/cicd-antipatterns.md +589 -0
- package/cicd/05-cases/case-deployment-automation.md +221 -0
- package/cicd/05-cases/case-gitops-transformation.md +212 -0
- package/cicd/06-glossary/cicd-glossary.md +114 -0
- package/cicd/cicd-blueprint-deep-dive.md +38 -0
- package/cicd/release-readiness-gate.md +37 -0
- package/cloud-native/01-standards/container-security.md +741 -0
- package/cloud-native/01-standards/kubernetes-complete.md +812 -0
- package/cloud-native/02-playbooks/api-gateway-playbook.md +155 -0
- package/cloud-native/02-playbooks/gitops-with-argocd.md +760 -0
- package/cloud-native/02-playbooks/k8s-troubleshooting-playbook.md +1942 -0
- package/cloud-native/02-playbooks/message-queue-playbook.md +129 -0
- package/cloud-native/02-playbooks/multicloud-governance.md +726 -0
- package/cloud-native/02-playbooks/serverless-patterns.md +788 -0
- package/cloud-native/02-playbooks/service-mesh-playbook.md +612 -0
- package/cloud-native/02-playbooks/terraform-iac-playbook.md +143 -0
- package/cloud-native/03-checklists/container-security-checklist.md +431 -0
- package/cloud-native/03-checklists/k8s-production-readiness-checklist.md +460 -0
- package/cloud-native/04-antipatterns/container-antipatterns.md +660 -0
- package/cloud-native/04-antipatterns/k8s-antipatterns.md +743 -0
- package/cloud-native/05-cases/case-k8s-migration.md +478 -0
- package/cloud-native/05-cases/case-k8s-scaling.md +642 -0
- package/cloud-native/05-cases/case-k8s-security-incident.md +397 -0
- package/cloud-native/06-glossary/cloud-native-glossary.md +337 -0
- package/cross-platform/01-standards/cross-platform-frameworks.md +83 -0
- package/cross-platform/01-standards/platform-selection-and-architecture.md +77 -0
- package/data/01-standards/elasticsearch-complete.md +2098 -0
- package/data/01-standards/postgresql-complete.md +1613 -0
- package/data/01-standards/redis-complete.md +1527 -0
- package/data/02-playbooks/database-optimization-playbook.md +403 -0
- package/data/02-playbooks/elasticsearch-production-playbook.md +132 -0
- package/data/03-checklists/database-launch-checklist.md +187 -0
- package/data/04-antipatterns/database-antipatterns.md +873 -0
- package/data/05-cases/case-database-migration.md +310 -0
- package/data/06-glossary/database-glossary.md +440 -0
- package/data/data-governance-and-modeling-deep-dive.md +39 -0
- package/data-engineering/01-standards/airflow-complete.md +523 -0
- package/data-engineering/01-standards/kafka-complete.md +1521 -0
- package/data-engineering/02-playbooks/spark-etl-playbook.md +496 -0
- package/data-engineering/03-checklists/pipeline-launch-checklist.md +194 -0
- package/data-engineering/04-antipatterns/data-pipeline-antipatterns.md +684 -0
- package/data-engineering/05-cases/case-real-time-pipeline.md +355 -0
- package/data-engineering/06-glossary/data-engineering-glossary.md +429 -0
- package/database/01-standards/database-schema-standards.md +147 -0
- package/database/02-playbooks/postgresql-optimization-quick.md +52 -0
- package/database/02-playbooks/postgresql-performance-optimization.md +58 -0
- package/database/02-playbooks/postgresql-production-playbook.md +146 -0
- package/database/02-playbooks/redis-caching-playbook.md +117 -0
- package/database/03-checklists/database-review-checklist.md +50 -0
- package/database/04-antipatterns/database-antipatterns.md +112 -0
- package/design/01-standards/ui-design-system-complete.md +423 -0
- package/design/02-playbooks/design-handoff-playbook.md +254 -0
- package/design/02-playbooks/design-review-playbook.md +388 -0
- package/design/03-checklists/design-review-checklist.md +246 -0
- package/design/04-antipatterns/design-antipatterns.md +378 -0
- package/design/05-cases/case-design-system-adoption.md +328 -0
- package/design/06-glossary/design-glossary.md +329 -0
- package/design/ui-full-lifecycle-cross-platform-playbook.md +571 -0
- package/design/ux-system-deep-dive.md +38 -0
- package/design-systems/00-craft-rules.md +71 -0
- package/design-systems/aesthetic-families.md +43 -0
- package/design-systems/anti-ai-slop.md +162 -0
- package/design-systems/bold-geometric.md +120 -0
- package/design-systems/brutalist-bold.md +103 -0
- package/design-systems/editorial-clean.md +109 -0
- package/design-systems/glass-aurora.md +108 -0
- package/design-systems/modern-minimal.md +145 -0
- package/design-systems/premium-luxury.md +106 -0
- package/design-systems/product-type-design-map.md +48 -0
- package/design-systems/soft-warm.md +123 -0
- package/design-systems/tech-utility.md +113 -0
- package/desktop/01-standards/desktop-app-standard.md +72 -0
- package/desktop/01-standards/desktop-design.md +71 -0
- package/development/00-governance/document-template.md +41 -0
- package/development/01-standards/api-versioning-strategies.md +432 -0
- package/development/01-standards/authentication-patterns-complete.md +479 -0
- package/development/01-standards/css-architecture-complete.md +550 -0
- package/development/01-standards/database-migration-strategies.md +484 -0
- package/development/01-standards/elasticsearch-complete.md +347 -0
- package/development/01-standards/git-complete.md +371 -0
- package/development/01-standards/golang-complete.md +1565 -0
- package/development/01-standards/graphql-complete.md +298 -0
- package/development/01-standards/javascript-bundlers-complete.md +469 -0
- package/development/01-standards/javascript-typescript-complete.md +528 -0
- package/development/01-standards/jest-complete.md +275 -0
- package/development/01-standards/linux-complete.md +234 -0
- package/development/01-standards/logging-observability-complete.md +526 -0
- package/development/01-standards/microservices-communication.md +502 -0
- package/development/01-standards/mongodb-complete.md +406 -0
- package/development/01-standards/oauth2-complete.md +285 -0
- package/development/01-standards/performance-optimization-complete.md +289 -0
- package/development/01-standards/playwright-complete.md +247 -0
- package/development/01-standards/postgresql-complete.md +456 -0
- package/development/01-standards/pytest-complete.md +340 -0
- package/development/01-standards/python-async-programming.md +902 -0
- package/development/01-standards/python-complete.md +956 -0
- package/development/01-standards/python-decorators-complete.md +799 -0
- package/development/01-standards/python-design-patterns.md +2854 -0
- package/development/01-standards/python-packaging-distribution.md +420 -0
- package/development/01-standards/python-testing-strategies.md +607 -0
- package/development/01-standards/python-web-frameworks-comparison.md +471 -0
- package/development/01-standards/redis-complete.md +317 -0
- package/development/01-standards/rest-api-complete.md +316 -0
- package/development/01-standards/rust-complete.md +578 -0
- package/development/01-standards/typescript-advanced-types.md +1513 -0
- package/development/01-standards/web-security-complete.md +292 -0
- package/development/02-playbooks/api-design-playbook.md +810 -0
- package/development/02-playbooks/database-migration-playbook.md +580 -0
- package/development/02-playbooks/debugging-playbook.md +692 -0
- package/development/02-playbooks/feature-delivery-playbook.md +430 -0
- package/development/02-playbooks/incident-hotfix-playbook.md +387 -0
- package/development/02-playbooks/performance-optimization-playbook.md +531 -0
- package/development/02-playbooks/performance-tuning-playbook.md +652 -0
- package/development/02-playbooks/refactor-playbook.md +403 -0
- package/development/02-playbooks/release-playbook.md +469 -0
- package/development/03-checklists/architecture-review-checklist.md +168 -0
- package/development/03-checklists/data-migration-checklist.md +157 -0
- package/development/03-checklists/oncall-handover-checklist.md +173 -0
- package/development/03-checklists/pr-checklist.md +158 -0
- package/development/03-checklists/production-readiness-checklist.md +190 -0
- package/development/03-checklists/release-readiness-checklist.md +154 -0
- package/development/03-checklists/security-review-checklist.md +182 -0
- package/development/04-antipatterns/api-antipatterns.md +657 -0
- package/development/04-antipatterns/architecture-antipatterns.md +686 -0
- package/development/04-antipatterns/backend-antipatterns.md +648 -0
- package/development/04-antipatterns/cicd-antipatterns.md +540 -0
- package/development/04-antipatterns/code-smell-antipatterns.md +571 -0
- package/development/04-antipatterns/data-antipatterns.md +658 -0
- package/development/04-antipatterns/database-antipatterns.md +578 -0
- package/development/04-antipatterns/frontend-antipatterns.md +635 -0
- package/development/04-antipatterns/reliability-antipatterns.md +700 -0
- package/development/04-antipatterns/security-antipatterns.md +747 -0
- package/development/05-cases/case-api-version-migration.md +428 -0
- package/development/05-cases/case-authorization-hardening.md +383 -0
- package/development/05-cases/case-bluegreen-rollback.md +466 -0
- package/development/05-cases/case-cache-snowball-protection.md +485 -0
- package/development/05-cases/case-ci-cd-pipeline.md +544 -0
- package/development/05-cases/case-database-scaling.md +500 -0
- package/development/05-cases/case-db-hotspot-optimization.md +487 -0
- package/development/05-cases/case-incident-mttr-reduction.md +563 -0
- package/development/05-cases/case-microservice-migration.md +375 -0
- package/development/05-cases/case-performance-optimization.md +406 -0
- package/development/05-cases/case-security-incident-response.md +345 -0
- package/development/06-glossary/full-stack-glossary.md +166 -0
- package/development/09-maturity/quarterly-audit-template.md +35 -0
- package/development/11-ui-excellence/ui-aesthetic-system.md +41 -0
- package/development/11-ui-excellence/ui-engineering-excellence.md +435 -0
- package/development/12-scenarios/development-scenarios-guide.md +565 -0
- package/development/13-implementation-assets/implementation-toolkit.md +282 -0
- package/development/13-implementation-assets/knowledge-gates-execution.md +43 -0
- package/development/14-full-lifecycle/software-lifecycle-gates.md +511 -0
- package/development/15-lifecycle-templates/project-templates-collection.md +791 -0
- package/development/api-contract-and-versioning-guide.md +36 -0
- package/development/api-governance-complete.md +43 -0
- package/development/backend-engineering-complete.md +43 -0
- package/development/code-review-quality-complete.md +43 -0
- package/development/concurrency-reliability-complete.md +43 -0
- package/development/database-engineering-complete.md +43 -0
- package/development/engineering-effectiveness-complete.md +43 -0
- package/development/engineering-standards-deep-dive.md +38 -0
- package/development/frontend-engineering-complete.md +43 -0
- package/development/performance-capacity-complete.md +43 -0
- package/development/refactor-migration-complete.md +42 -0
- package/development/refactoring-and-techdebt-playbook.md +37 -0
- package/development/security-in-development-complete.md +43 -0
- package/devops/01-standards/cicd-pipeline-complete.md +262 -0
- package/devops/01-standards/docker-complete.md +1490 -0
- package/devops/01-standards/github-actions-complete.md +337 -0
- package/devops/01-standards/kubernetes-complete.md +638 -0
- package/devops/01-standards/terraform-complete.md +2117 -0
- package/devops/02-playbooks/docker-compose-playbook.md +233 -0
- package/devops/02-playbooks/docker-k8s-production-playbook.md +186 -0
- package/devops/02-playbooks/docker-production-playbook.md +952 -0
- package/edge-iot/01-standards/edge-iot-complete.md +473 -0
- package/experts/architect/api-design.md +178 -0
- package/experts/architect/methodology.md +124 -0
- package/experts/architect/security.md +75 -0
- package/experts/backend-lead/methodology.md +216 -0
- package/experts/devops/methodology.md +160 -0
- package/experts/frontend-lead/methodology.md +178 -0
- package/experts/product-manager/industry/ecommerce.md +43 -0
- package/experts/product-manager/industry/saas.md +40 -0
- package/experts/product-manager/methodology.md +97 -0
- package/experts/qa-lead/methodology.md +123 -0
- package/experts/qa-lead/test-strategy.md +128 -0
- package/experts/uiux-designer/methodology.md +125 -0
- package/frontend/01-standards/accessibility-complete.md +532 -0
- package/frontend/01-standards/accessibility-standard.md +74 -0
- package/frontend/01-standards/admin-dashboard-and-crud.md +72 -0
- package/frontend/01-standards/design-tokens-complete.md +444 -0
- package/frontend/01-standards/forms-and-validation.md +77 -0
- package/frontend/01-standards/frontend-architecture-and-layering.md +119 -0
- package/frontend/01-standards/i18n-and-localization.md +65 -0
- package/frontend/01-standards/nextjs-complete.md +451 -0
- package/frontend/01-standards/react-complete.md +713 -0
- package/frontend/01-standards/react-hooks-complete-guide.md +1100 -0
- package/frontend/01-standards/react-hooks-complete.md +1171 -0
- package/frontend/01-standards/seo-and-web-vitals.md +77 -0
- package/frontend/01-standards/state-management-complete.md +444 -0
- package/frontend/01-standards/vue-complete.md +499 -0
- package/frontend/01-standards/vue3-complete.md +2002 -0
- package/frontend/01-standards/web-framework-best-practices.md +64 -0
- package/frontend/01-standards/web-performance-complete.md +495 -0
- package/frontend/02-playbooks/accessibility-a11y-playbook.md +161 -0
- package/frontend/02-playbooks/frontend-performance-playbook.md +707 -0
- package/frontend/02-playbooks/i18n-internationalization-playbook.md +120 -0
- package/frontend/02-playbooks/performance-optimization-playbook.md +163 -0
- package/frontend/02-playbooks/react-nextjs-production-playbook.md +167 -0
- package/frontend/02-playbooks/react-state-management-playbook.md +173 -0
- package/frontend/03-checklists/component-quality-checklist.md +166 -0
- package/frontend/03-checklists/frontend-launch-checklist.md +299 -0
- package/frontend/04-antipatterns/frontend-antipatterns.md +886 -0
- package/frontend/05-cases/case-performance-optimization.md +274 -0
- package/harmony/01-standards/harmonyos-arkts-standard.md +75 -0
- package/harmony/01-standards/harmonyos-design.md +65 -0
- package/high-quality-engineering-playbook.md +54 -0
- package/incident/01-standards/incident-response-complete.md +303 -0
- package/incident/02-playbooks/chaos-engineering-playbook.md +883 -0
- package/incident/02-playbooks/postmortem-playbook.md +398 -0
- package/incident/03-checklists/incident-readiness-checklist.md +181 -0
- package/incident/04-antipatterns/incident-antipatterns.md +490 -0
- package/incident/05-cases/case-cascade-failure.md +176 -0
- package/incident/06-glossary/incident-glossary.md +114 -0
- package/incident/postmortem-and-response-deep-dive.md +39 -0
- package/industries/ecommerce/ecommerce-complete.md +631 -0
- package/industries/education/education-complete.md +555 -0
- package/industries/fintech/fintech-complete.md +501 -0
- package/industries/gaming/gaming-complete.md +587 -0
- package/industries/healthcare/healthcare-complete.md +452 -0
- package/low-code/01-standards/low-code-complete.md +944 -0
- package/miniprogram/01-standards/ai-common-mistakes.md +61 -0
- package/miniprogram/01-standards/miniprogram-custom-navbar-capsule.md +77 -0
- package/miniprogram/01-standards/miniprogram-design.md +61 -0
- package/miniprogram/01-standards/miniprogram-standard.md +81 -0
- package/mobile/01-standards/android-material-design.md +70 -0
- package/mobile/01-standards/flutter-complete.md +384 -0
- package/mobile/01-standards/ios-design-hig.md +78 -0
- package/mobile/01-standards/mobile-app-standard.md +85 -0
- package/mobile/01-standards/react-native-complete.md +352 -0
- package/mobile/02-playbooks/mobile-cross-platform-playbook.md +175 -0
- package/mobile/02-playbooks/mobile-performance.md +473 -0
- package/mobile/03-checklists/mobile-release-checklist.md +234 -0
- package/mobile/04-antipatterns/mobile-antipatterns.md +798 -0
- package/mobile/05-cases/case-app-performance.md +500 -0
- package/mobile/05-cases/case-app-startup-optimization.md +218 -0
- package/mobile/06-glossary/mobile-glossary.md +484 -0
- package/observability/01-standards/observability-standards.md +103 -0
- package/observability/02-playbooks/prometheus-grafana-playbook.md +135 -0
- package/observability/02-playbooks/structured-logging-playbook.md +73 -0
- package/observability/03-checklists/observability-checklist.md +54 -0
- package/observability/04-antipatterns/observability-antipatterns.md +106 -0
- package/operations/01-standards/prometheus-monitoring-complete.md +1578 -0
- package/operations/02-playbooks/capacity-planning-playbook.md +620 -0
- package/operations/03-checklists/production-launch-checklist.md +365 -0
- package/operations/04-antipatterns/operations-antipatterns.md +664 -0
- package/operations/05-cases/case-sre-practices.md +581 -0
- package/operations/06-glossary/operations-glossary.md +120 -0
- package/operations/aiops-anomaly-detection.md +758 -0
- package/operations/capacity-planning.md +1061 -0
- package/operations/chaos-engineering.md +659 -0
- package/operations/incident-command-system.md +38 -0
- package/operations/observability-complete.md +442 -0
- package/operations/slo-sli-playbook.md +517 -0
- package/operations/sre-operations-deep-dive.md +39 -0
- package/package.json +8 -0
- package/performance/01-standards/performance-and-scalability.md +80 -0
- package/performance/01-standards/performance-standards.md +156 -0
- package/performance/02-playbooks/query-optimization-playbook.md +103 -0
- package/performance/03-checklists/performance-checklist.md +56 -0
- package/performance/04-antipatterns/performance-antipatterns.md +146 -0
- package/product/01-standards/product-management-complete.md +285 -0
- package/product/02-playbooks/feature-launch-playbook.md +207 -0
- package/product/02-playbooks/user-research-playbook.md +532 -0
- package/product/03-checklists/feature-launch-checklist.md +275 -0
- package/product/04-antipatterns/product-antipatterns.md +355 -0
- package/product/05-cases/case-mvp-to-scale.md +384 -0
- package/product/06-glossary/product-glossary.md +462 -0
- package/product/feature-prioritization-framework.md +40 -0
- package/product/kpi-and-metric-tree.md +37 -0
- package/product/product-discovery-and-prd-deep-dive.md +41 -0
- package/quantum/01-standards/quantum-complete.md +1186 -0
- package/security/01-standards/api-security-complete.md +511 -0
- package/security/01-standards/container-runtime-security.md +574 -0
- package/security/01-standards/data-protection-gdpr.md +543 -0
- package/security/01-standards/owasp-top10-complete.md +1890 -0
- package/security/01-standards/secure-coding-baseline.md +90 -0
- package/security/01-standards/supply-chain-security.md +441 -0
- package/security/01-standards/web-security-checklist.md +108 -0
- package/security/01-standards/zero-trust-architecture.md +521 -0
- package/security/02-playbooks/auth-sso-playbook.md +166 -0
- package/security/02-playbooks/incident-response-security-playbook.md +588 -0
- package/security/02-playbooks/owasp-api-security-playbook.md +129 -0
- package/security/02-playbooks/payment-integration-playbook.md +119 -0
- package/security/02-playbooks/penetration-testing-playbook.md +517 -0
- package/security/03-checklists/security-audit-checklist.md +356 -0
- package/security/04-antipatterns/security-coding-antipatterns.md +580 -0
- package/security/05-cases/case-log4shell-incident.md +537 -0
- package/security/05-cases/case-major-breaches.md +468 -0
- package/security/06-glossary/security-glossary.md +212 -0
- package/security/compliance-automation.md +993 -0
- package/security/container-security.md +680 -0
- package/security/devsecops-complete.md +426 -0
- package/security/sast-dast-sca.md +775 -0
- package/security/secrets-management.md +594 -0
- package/security/security-architecture-deep-dive.md +37 -0
- package/security/threat-modeling-stride-playbook.md +40 -0
- package/seed-templates/auth-system.md +59 -0
- package/seed-templates/blog-content.md +94 -0
- package/seed-templates/dashboard.md +89 -0
- package/seed-templates/docs-site.md +73 -0
- package/seed-templates/e-commerce.md +50 -0
- package/seed-templates/saas-landing.md +92 -0
- package/seed-templates/settings-page.md +51 -0
- package/testing/01-standards/test-strategy-and-layering.md +83 -0
- package/testing/01-standards/testing-strategy-complete.md +422 -0
- package/testing/01-standards/unit-testing-best-practices.md +118 -0
- package/testing/02-playbooks/e2e-testing-playbook.md +988 -0
- package/testing/02-playbooks/testing-strategy-playbook.md +126 -0
- package/testing/03-checklists/test-strategy-checklist.md +208 -0
- package/testing/04-antipatterns/testing-antipatterns.md +718 -0
- package/testing/05-cases/case-testing-transformation.md +300 -0
- package/testing/06-glossary/testing-glossary.md +110 -0
- package/testing/risk-based-test-matrix.md +36 -0
- package/testing/testing-strategy-deep-dive.md +37 -0
|
@@ -0,0 +1,1942 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: Kubernetes 故障排查作战手册
|
|
3
|
+
version: 1.0.0
|
|
4
|
+
last_updated: 2026-03-28
|
|
5
|
+
owner: platform-team
|
|
6
|
+
tags: [kubernetes, troubleshooting, diagnostics, pod, network, storage, node]
|
|
7
|
+
status: production
|
|
8
|
+
domain: cloud-native
|
|
9
|
+
difficulty: intermediate
|
|
10
|
+
quality_score: 70
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# 开发:Excellent(11964948@qq.com)
|
|
14
|
+
# 功能:Kubernetes 全场景故障排查作战手册
|
|
15
|
+
# 作用:指导 K8s 集群各类故障的诊断、根因分析与修复
|
|
16
|
+
# 创建时间:2026-03-28
|
|
17
|
+
# 最后修改:2026-03-28
|
|
18
|
+
|
|
19
|
+
## 目标
|
|
20
|
+
|
|
21
|
+
建立 Kubernetes 故障排查标准化流程,确保:
|
|
22
|
+
- 故障快速定位,MTTR 控制在 15 分钟以内
|
|
23
|
+
- 诊断流程可复现、可追溯
|
|
24
|
+
- 运维人员按手册即可独立完成 90% 常见故障修复
|
|
25
|
+
- 故障根因归档,形成持续改进闭环
|
|
26
|
+
|
|
27
|
+
## 适用场景
|
|
28
|
+
|
|
29
|
+
- Pod 生命周期异常(CrashLoopBackOff / ImagePullBackOff / OOMKilled / Pending / Evicted)
|
|
30
|
+
- 网络连通性故障(Service 不可达 / DNS 解析失败 / NetworkPolicy 阻断 / Ingress 5xx)
|
|
31
|
+
- 存储故障(PVC Pending / 挂载失败 / 权限问题)
|
|
32
|
+
- 节点故障(NotReady / 磁盘压力 / 内存压力 / PID 压力)
|
|
33
|
+
- 资源瓶颈(CPU Throttling / 内存泄漏 / HPA 不生效)
|
|
34
|
+
- 部署故障(滚动更新卡住 / 回滚 / ConfigMap/Secret 更新不生效)
|
|
35
|
+
|
|
36
|
+
---
|
|
37
|
+
|
|
38
|
+
## 一、通用诊断入口
|
|
39
|
+
|
|
40
|
+
在定位任何故障之前,先执行以下命令建立全局视图:
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
# 集群整体健康状态
|
|
44
|
+
kubectl get nodes -o wide
|
|
45
|
+
kubectl get cs # 控制平面组件状态(1.19+ 已废弃,改用下方)
|
|
46
|
+
kubectl get --raw='/readyz?verbose' # API Server 就绪探针
|
|
47
|
+
|
|
48
|
+
# 当前命名空间异常资源速查
|
|
49
|
+
kubectl get pods --field-selector=status.phase!=Running,status.phase!=Succeeded
|
|
50
|
+
kubectl get events --sort-by='.lastTimestamp' | tail -30
|
|
51
|
+
|
|
52
|
+
# 资源总览
|
|
53
|
+
kubectl top nodes
|
|
54
|
+
kubectl top pods --sort-by=memory
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
> **原则**:先看 Events,再看 Describe,最后看 Logs。90% 的故障在 Events 阶段即可定位。
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
## 二、Pod 故障排查
|
|
62
|
+
|
|
63
|
+
### 2.1 CrashLoopBackOff
|
|
64
|
+
|
|
65
|
+
**症状**
|
|
66
|
+
|
|
67
|
+
- Pod 反复重启,`RESTARTS` 计数持续增长
|
|
68
|
+
- `kubectl get pods` 显示状态 `CrashLoopBackOff`
|
|
69
|
+
- 退避时间从 10s 递增至 5min
|
|
70
|
+
|
|
71
|
+
**诊断命令**
|
|
72
|
+
|
|
73
|
+
```bash
|
|
74
|
+
# 1. 查看 Pod 事件
|
|
75
|
+
kubectl describe pod <pod-name> -n <namespace>
|
|
76
|
+
|
|
77
|
+
# 2. 查看当前容器日志
|
|
78
|
+
kubectl logs <pod-name> -n <namespace> --tail=100
|
|
79
|
+
|
|
80
|
+
# 3. 查看上一次崩溃的日志(关键)
|
|
81
|
+
kubectl logs <pod-name> -n <namespace> --previous --tail=200
|
|
82
|
+
|
|
83
|
+
# 4. 如果是多容器 Pod
|
|
84
|
+
kubectl logs <pod-name> -c <container-name> --previous
|
|
85
|
+
|
|
86
|
+
# 5. 查看退出码
|
|
87
|
+
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
**退出码速查表**
|
|
91
|
+
|
|
92
|
+
| 退出码 | 含义 | 常见原因 |
|
|
93
|
+
|--------|------|----------|
|
|
94
|
+
| 0 | 正常退出 | 一次性任务完成但 restartPolicy=Always |
|
|
95
|
+
| 1 | 应用错误 | 未捕获异常、配置错误 |
|
|
96
|
+
| 126 | 权限不足 | 二进制文件不可执行 |
|
|
97
|
+
| 127 | 命令未找到 | entrypoint/cmd 路径错误 |
|
|
98
|
+
| 137 | SIGKILL (OOM) | 内存超限被 cgroup 杀死 |
|
|
99
|
+
| 139 | SIGSEGV | 段错误,原生代码 bug |
|
|
100
|
+
| 143 | SIGTERM | 正常终止信号,但进程未优雅关闭 |
|
|
101
|
+
|
|
102
|
+
**根因分析**
|
|
103
|
+
|
|
104
|
+
1. **应用代码崩溃**:查看 `--previous` 日志中的异常栈
|
|
105
|
+
2. **配置缺失**:环境变量、ConfigMap、Secret 未挂载或值错误
|
|
106
|
+
3. **依赖服务不可用**:数据库 / 消息队列 / 外部 API 连接失败
|
|
107
|
+
4. **健康检查过严**:livenessProbe 超时导致容器被反复杀死
|
|
108
|
+
5. **启动顺序依赖**:initContainer 未正确配置
|
|
109
|
+
|
|
110
|
+
**解决方案**
|
|
111
|
+
|
|
112
|
+
```yaml
|
|
113
|
+
# 场景:livenessProbe 过于激进导致循环重启
|
|
114
|
+
# 修复:增大 initialDelaySeconds 和 timeoutSeconds
|
|
115
|
+
livenessProbe:
|
|
116
|
+
httpGet:
|
|
117
|
+
path: /healthz
|
|
118
|
+
port: 8080
|
|
119
|
+
initialDelaySeconds: 30 # 给应用足够的启动时间
|
|
120
|
+
periodSeconds: 10
|
|
121
|
+
timeoutSeconds: 5
|
|
122
|
+
failureThreshold: 3
|
|
123
|
+
successThreshold: 1
|
|
124
|
+
|
|
125
|
+
# 场景:依赖服务未就绪
|
|
126
|
+
# 修复:使用 initContainer 等待依赖
|
|
127
|
+
initContainers:
|
|
128
|
+
- name: wait-for-db
|
|
129
|
+
image: busybox:1.36
|
|
130
|
+
command: ['sh', '-c', 'until nc -z db-service 5432; do echo waiting; sleep 2; done']
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
**预防措施**
|
|
134
|
+
|
|
135
|
+
- [ ] 所有服务必须配置合理的 startupProbe(慢启动应用)
|
|
136
|
+
- [ ] livenessProbe 的 initialDelaySeconds >= 应用平均启动时间的 1.5 倍
|
|
137
|
+
- [ ] 关键依赖使用 initContainer 做就绪等待
|
|
138
|
+
- [ ] 应用内部实现优雅降级,避免依赖不可用直接 crash
|
|
139
|
+
|
|
140
|
+
---
|
|
141
|
+
|
|
142
|
+
### 2.2 ImagePullBackOff
|
|
143
|
+
|
|
144
|
+
**症状**
|
|
145
|
+
|
|
146
|
+
- Pod 状态卡在 `ImagePullBackOff` 或 `ErrImagePull`
|
|
147
|
+
- Events 中出现 `Failed to pull image` 或 `unauthorized`
|
|
148
|
+
|
|
149
|
+
**诊断命令**
|
|
150
|
+
|
|
151
|
+
```bash
|
|
152
|
+
# 1. 查看详细错误信息
|
|
153
|
+
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Events:"
|
|
154
|
+
|
|
155
|
+
# 2. 检查镜像地址是否正确
|
|
156
|
+
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].image}'
|
|
157
|
+
|
|
158
|
+
# 3. 检查 imagePullSecrets
|
|
159
|
+
kubectl get pod <pod-name> -o jsonpath='{.spec.imagePullSecrets}'
|
|
160
|
+
|
|
161
|
+
# 4. 检查 Secret 内容
|
|
162
|
+
kubectl get secret <secret-name> -n <namespace> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
|
|
163
|
+
|
|
164
|
+
# 5. 在节点上手动拉取验证
|
|
165
|
+
crictl pull <image-url>
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
**根因分析**
|
|
169
|
+
|
|
170
|
+
1. **镜像不存在**:tag 拼写错误、镜像已被删除
|
|
171
|
+
2. **认证失败**:imagePullSecret 过期或配置错误
|
|
172
|
+
3. **网络不通**:节点无法访问镜像仓库(防火墙 / 代理)
|
|
173
|
+
4. **仓库限流**:Docker Hub 匿名拉取限制(100次/6小时)
|
|
174
|
+
5. **镜像架构不匹配**:ARM 节点拉取 AMD64 镜像
|
|
175
|
+
|
|
176
|
+
**解决方案**
|
|
177
|
+
|
|
178
|
+
```bash
|
|
179
|
+
# 创建/更新 imagePullSecret
|
|
180
|
+
kubectl create secret docker-registry regcred \
|
|
181
|
+
--docker-server=registry.example.com \
|
|
182
|
+
--docker-username=user \
|
|
183
|
+
--docker-password=pass \
|
|
184
|
+
--docker-email=user@example.com \
|
|
185
|
+
-n <namespace> --dry-run=client -o yaml | kubectl apply -f -
|
|
186
|
+
|
|
187
|
+
# 配置 ServiceAccount 默认拉取凭证(推荐)
|
|
188
|
+
kubectl patch serviceaccount default -n <namespace> \
|
|
189
|
+
-p '{"imagePullSecrets": [{"name": "regcred"}]}'
|
|
190
|
+
|
|
191
|
+
# Docker Hub 限流解决:配置镜像代理
|
|
192
|
+
# 在 containerd 配置中添加 mirror
|
|
193
|
+
# /etc/containerd/config.toml
|
|
194
|
+
# [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
|
|
195
|
+
# endpoint = ["https://mirror.example.com"]
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
**预防措施**
|
|
199
|
+
|
|
200
|
+
- [ ] CI/CD 流水线中验证镜像推送成功后再触发部署
|
|
201
|
+
- [ ] 使用确定性 tag(如 sha256 digest),禁止线上使用 `latest`
|
|
202
|
+
- [ ] imagePullSecret 通过 Sealed Secrets 或 External Secrets Operator 管理
|
|
203
|
+
- [ ] 建立私有镜像仓库,避免公网依赖
|
|
204
|
+
|
|
205
|
+
---
|
|
206
|
+
|
|
207
|
+
### 2.3 OOMKilled
|
|
208
|
+
|
|
209
|
+
**症状**
|
|
210
|
+
|
|
211
|
+
- Pod 状态 `OOMKilled`,退出码 137
|
|
212
|
+
- `kubectl describe pod` 中 `Reason: OOMKilled`
|
|
213
|
+
- 节点 dmesg 中出现 `oom-kill` 日志
|
|
214
|
+
|
|
215
|
+
**诊断命令**
|
|
216
|
+
|
|
217
|
+
```bash
|
|
218
|
+
# 1. 确认 OOM 事件
|
|
219
|
+
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
|
|
220
|
+
|
|
221
|
+
# 2. 查看当前内存用量
|
|
222
|
+
kubectl top pod <pod-name> --containers
|
|
223
|
+
|
|
224
|
+
# 3. 查看 resource limits
|
|
225
|
+
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources}'
|
|
226
|
+
|
|
227
|
+
# 4. 节点级别 OOM 日志
|
|
228
|
+
kubectl get events --field-selector reason=OOMKilling -A
|
|
229
|
+
|
|
230
|
+
# 5. 查看 cgroup 内存统计(需要 SSH 到节点)
|
|
231
|
+
cat /sys/fs/cgroup/memory/kubepods/pod<uid>/<container-id>/memory.max_usage_in_bytes
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
**根因分析**
|
|
235
|
+
|
|
236
|
+
1. **limits 设置过低**:应用正常峰值超过 memory limits
|
|
237
|
+
2. **内存泄漏**:应用长时间运行后内存持续增长
|
|
238
|
+
3. **突发流量**:瞬时请求量导致内存激增
|
|
239
|
+
4. **JVM 堆外内存**:Java 应用 native memory 未计入 -Xmx
|
|
240
|
+
5. **子进程内存**:容器内启动的子进程超出预期
|
|
241
|
+
|
|
242
|
+
**解决方案**
|
|
243
|
+
|
|
244
|
+
```yaml
|
|
245
|
+
# 1. 调整 limits(基于实际观测值 + 20% 缓冲)
|
|
246
|
+
resources:
|
|
247
|
+
requests:
|
|
248
|
+
memory: "512Mi"
|
|
249
|
+
limits:
|
|
250
|
+
memory: "768Mi" # requests 的 1.5 倍
|
|
251
|
+
|
|
252
|
+
# 2. Java 应用示例:限制堆 + 堆外
|
|
253
|
+
env:
|
|
254
|
+
- name: JAVA_OPTS
|
|
255
|
+
value: >-
|
|
256
|
+
-Xms256m -Xmx512m
|
|
257
|
+
-XX:MaxMetaspaceSize=128m
|
|
258
|
+
-XX:MaxDirectMemorySize=64m
|
|
259
|
+
-XX:+UseContainerSupport
|
|
260
|
+
-XX:MaxRAMPercentage=75.0
|
|
261
|
+
|
|
262
|
+
# 3. Go 应用:设置 GOMEMLIMIT
|
|
263
|
+
env:
|
|
264
|
+
- name: GOMEMLIMIT
|
|
265
|
+
value: "600MiB" # limits 的 ~80%
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
**预防措施**
|
|
269
|
+
|
|
270
|
+
- [ ] 所有容器必须设置 memory requests 和 limits
|
|
271
|
+
- [ ] 生产环境通过压测确定资源基线,limits = P99 * 1.5
|
|
272
|
+
- [ ] Java 应用必须配置 `-XX:+UseContainerSupport`
|
|
273
|
+
- [ ] 配置 Prometheus 告警:容器内存使用率 > 80% 触发预警
|
|
274
|
+
- [ ] 定期执行内存 profiling(pprof / async-profiler)
|
|
275
|
+
|
|
276
|
+
---
|
|
277
|
+
|
|
278
|
+
### 2.4 Pending
|
|
279
|
+
|
|
280
|
+
**症状**
|
|
281
|
+
|
|
282
|
+
- Pod 长时间处于 `Pending` 状态
|
|
283
|
+
- `kubectl describe pod` 的 Events 中出现调度失败信息
|
|
284
|
+
|
|
285
|
+
**诊断命令**
|
|
286
|
+
|
|
287
|
+
```bash
|
|
288
|
+
# 1. 查看调度事件
|
|
289
|
+
kubectl describe pod <pod-name> -n <namespace> | grep -A 20 "Events:"
|
|
290
|
+
|
|
291
|
+
# 2. 查看节点可分配资源
|
|
292
|
+
kubectl describe nodes | grep -A 5 "Allocated resources"
|
|
293
|
+
|
|
294
|
+
# 3. 查看 Pod 的资源请求
|
|
295
|
+
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].resources.requests}'
|
|
296
|
+
|
|
297
|
+
# 4. 查看 nodeSelector / affinity / tolerations
|
|
298
|
+
kubectl get pod <pod-name> -o yaml | grep -A 10 'nodeSelector\|affinity\|tolerations'
|
|
299
|
+
|
|
300
|
+
# 5. 检查 PVC 绑定状态
|
|
301
|
+
kubectl get pvc -n <namespace>
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
**根因分析**
|
|
305
|
+
|
|
306
|
+
1. **资源不足**:集群无节点满足 CPU/Memory requests
|
|
307
|
+
2. **nodeSelector 无匹配**:标签选择器与节点标签不匹配
|
|
308
|
+
3. **亲和性冲突**:requiredDuringScheduling 规则过严
|
|
309
|
+
4. **Taint 未容忍**:节点有 taint 但 Pod 无对应 toleration
|
|
310
|
+
5. **PVC 未绑定**:StorageClass 无法动态供给或 PV 不足
|
|
311
|
+
6. **Pod 数量限制**:节点 maxPods 已满(默认 110)
|
|
312
|
+
7. **ResourceQuota 超限**:命名空间配额已用完
|
|
313
|
+
|
|
314
|
+
**解决方案**
|
|
315
|
+
|
|
316
|
+
```bash
|
|
317
|
+
# 场景:资源不足 → 查看各节点可调度余量
|
|
318
|
+
kubectl get nodes -o custom-columns=\
|
|
319
|
+
NAME:.metadata.name,\
|
|
320
|
+
CPU_ALLOC:.status.allocatable.cpu,\
|
|
321
|
+
MEM_ALLOC:.status.allocatable.memory,\
|
|
322
|
+
PODS:.status.allocatable.pods
|
|
323
|
+
|
|
324
|
+
# 场景:Taint 导致 → 添加 toleration
|
|
325
|
+
# kubectl taint nodes <node> key=value:NoSchedule
|
|
326
|
+
# 在 Pod spec 中添加:
|
|
327
|
+
# tolerations:
|
|
328
|
+
# - key: "key"
|
|
329
|
+
# operator: "Equal"
|
|
330
|
+
# value: "value"
|
|
331
|
+
# effect: "NoSchedule"
|
|
332
|
+
|
|
333
|
+
# 场景:ResourceQuota 超限
|
|
334
|
+
kubectl get resourcequota -n <namespace>
|
|
335
|
+
kubectl describe resourcequota <name> -n <namespace>
|
|
336
|
+
```
|
|
337
|
+
|
|
338
|
+
**预防措施**
|
|
339
|
+
|
|
340
|
+
- [ ] 配置 Cluster Autoscaler,资源不足时自动扩节点
|
|
341
|
+
- [ ] 为非核心负载配置 PriorityClass,资源紧张时可被抢占
|
|
342
|
+
- [ ] 定期审计 ResourceQuota 和 LimitRange 配置
|
|
343
|
+
- [ ] 使用 `kubectl-resource-capacity` 插件监控集群利用率
|
|
344
|
+
|
|
345
|
+
---
|
|
346
|
+
|
|
347
|
+
### 2.5 Evicted
|
|
348
|
+
|
|
349
|
+
**症状**
|
|
350
|
+
|
|
351
|
+
- Pod 状态 `Evicted`,大量 Evicted Pod 残留
|
|
352
|
+
- Events 中出现 `The node was low on resource: ephemeral-storage / memory`
|
|
353
|
+
|
|
354
|
+
**诊断命令**
|
|
355
|
+
|
|
356
|
+
```bash
|
|
357
|
+
# 1. 查看被驱逐的 Pod
|
|
358
|
+
kubectl get pods --field-selector=status.phase=Failed -A | grep Evicted
|
|
359
|
+
|
|
360
|
+
# 2. 查看驱逐原因
|
|
361
|
+
kubectl get pod <evicted-pod> -o jsonpath='{.status.reason} {.status.message}'
|
|
362
|
+
|
|
363
|
+
# 3. 节点资源压力状态
|
|
364
|
+
kubectl describe node <node-name> | grep -A 5 "Conditions:"
|
|
365
|
+
|
|
366
|
+
# 4. 批量清理 Evicted Pod
|
|
367
|
+
kubectl get pods -A --field-selector=status.phase=Failed | grep Evicted | \
|
|
368
|
+
awk '{print "kubectl delete pod " $2 " -n " $1}' | sh
|
|
369
|
+
```
|
|
370
|
+
|
|
371
|
+
**根因分析**
|
|
372
|
+
|
|
373
|
+
1. **临时存储超限**:容器写入日志 / 临时文件超过 ephemeral-storage limits
|
|
374
|
+
2. **节点磁盘压力**:节点磁盘使用率超过驱逐阈值(默认 85%)
|
|
375
|
+
3. **节点内存压力**:系统内存不足触发 kubelet 驱逐
|
|
376
|
+
4. **镜像 / 容器垃圾积累**:未配置 GC 导致磁盘空间耗尽
|
|
377
|
+
|
|
378
|
+
**解决方案**
|
|
379
|
+
|
|
380
|
+
```yaml
|
|
381
|
+
# 设置 ephemeral-storage 限制
|
|
382
|
+
resources:
|
|
383
|
+
requests:
|
|
384
|
+
ephemeral-storage: "1Gi"
|
|
385
|
+
limits:
|
|
386
|
+
ephemeral-storage: "2Gi"
|
|
387
|
+
|
|
388
|
+
# 配置 kubelet 垃圾回收(节点级别)
|
|
389
|
+
# /var/lib/kubelet/config.yaml
|
|
390
|
+
# imageGCHighThresholdPercent: 85
|
|
391
|
+
# imageGCLowThresholdPercent: 80
|
|
392
|
+
# evictionHard:
|
|
393
|
+
# memory.available: "100Mi"
|
|
394
|
+
# nodefs.available: "10%"
|
|
395
|
+
# imagefs.available: "15%"
|
|
396
|
+
```
|
|
397
|
+
|
|
398
|
+
**预防措施**
|
|
399
|
+
|
|
400
|
+
- [ ] 日志输出到 stdout,由日志采集系统收集,不写本地文件
|
|
401
|
+
- [ ] 配置 ephemeral-storage limits 防止磁盘滥用
|
|
402
|
+
- [ ] 节点磁盘使用率告警阈值设为 70%
|
|
403
|
+
- [ ] 定期清理无用镜像和已完成的 Job/Pod
|
|
404
|
+
|
|
405
|
+
---
|
|
406
|
+
|
|
407
|
+
## 三、网络故障排查
|
|
408
|
+
|
|
409
|
+
### 3.1 Service 不可达
|
|
410
|
+
|
|
411
|
+
**症状**
|
|
412
|
+
|
|
413
|
+
- Pod 内 `curl <service-name>:<port>` 超时或连接拒绝
|
|
414
|
+
- 外部流量无法到达后端 Pod
|
|
415
|
+
|
|
416
|
+
**诊断命令**
|
|
417
|
+
|
|
418
|
+
```bash
|
|
419
|
+
# 1. 检查 Service 是否存在且端口正确
|
|
420
|
+
kubectl get svc <service-name> -n <namespace> -o wide
|
|
421
|
+
|
|
422
|
+
# 2. 检查 Endpoints 是否有后端 Pod
|
|
423
|
+
kubectl get endpoints <service-name> -n <namespace>
|
|
424
|
+
|
|
425
|
+
# 3. 检查 Pod 标签是否与 Service selector 匹配
|
|
426
|
+
kubectl get pods -n <namespace> -l <selector-key>=<selector-value>
|
|
427
|
+
|
|
428
|
+
# 4. 从另一个 Pod 测试连通性
|
|
429
|
+
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash
|
|
430
|
+
# 在 debug Pod 中
|
|
431
|
+
curl -v <service-name>.<namespace>.svc.cluster.local:<port>
|
|
432
|
+
nslookup <service-name>.<namespace>.svc.cluster.local
|
|
433
|
+
|
|
434
|
+
# 5. 检查 iptables/ipvs 规则(在节点上)
|
|
435
|
+
iptables -t nat -L KUBE-SERVICES | grep <service-cluster-ip>
|
|
436
|
+
ipvsadm -Ln | grep <service-cluster-ip>
|
|
437
|
+
```
|
|
438
|
+
|
|
439
|
+
**根因分析**
|
|
440
|
+
|
|
441
|
+
1. **Endpoints 为空**:Pod 标签与 Service selector 不匹配
|
|
442
|
+
2. **Pod 未就绪**:readinessProbe 失败导致从 Endpoints 中移除
|
|
443
|
+
3. **端口映射错误**:Service port / targetPort / containerPort 不一致
|
|
444
|
+
4. **kube-proxy 异常**:iptables/ipvs 规则未同步
|
|
445
|
+
5. **容器端口未监听**:应用绑定了 127.0.0.1 而非 0.0.0.0
|
|
446
|
+
|
|
447
|
+
**解决方案**
|
|
448
|
+
|
|
449
|
+
```bash
|
|
450
|
+
# 验证端口映射链:Service.port → Service.targetPort → Container.containerPort
|
|
451
|
+
kubectl get svc <svc> -o jsonpath='{.spec.ports}'
|
|
452
|
+
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].ports}'
|
|
453
|
+
|
|
454
|
+
# 验证应用监听地址
|
|
455
|
+
kubectl exec <pod> -- ss -tlnp
|
|
456
|
+
# 确保监听 0.0.0.0:<port> 而非 127.0.0.1:<port>
|
|
457
|
+
|
|
458
|
+
# 重启 kube-proxy 刷新规则
|
|
459
|
+
kubectl rollout restart daemonset kube-proxy -n kube-system
|
|
460
|
+
```
|
|
461
|
+
|
|
462
|
+
**预防措施**
|
|
463
|
+
|
|
464
|
+
- [ ] 所有 Service 定义后立即验证 Endpoints 不为空
|
|
465
|
+
- [ ] readinessProbe 必须准确反映服务可用性
|
|
466
|
+
- [ ] 应用监听地址统一使用 `0.0.0.0`
|
|
467
|
+
- [ ] 使用 headless Service 时确认 DNS 解析行为
|
|
468
|
+
|
|
469
|
+
---
|
|
470
|
+
|
|
471
|
+
### 3.2 DNS 解析失败
|
|
472
|
+
|
|
473
|
+
**症状**
|
|
474
|
+
|
|
475
|
+
- Pod 内 `nslookup` / `dig` 无法解析 Service 名称
|
|
476
|
+
- 应用日志出现 `Name or service not known` / `no such host`
|
|
477
|
+
|
|
478
|
+
**诊断命令**
|
|
479
|
+
|
|
480
|
+
```bash
|
|
481
|
+
# 1. 检查 CoreDNS 运行状态
|
|
482
|
+
kubectl get pods -n kube-system -l k8s-app=kube-dns
|
|
483
|
+
|
|
484
|
+
# 2. 查看 CoreDNS 日志
|
|
485
|
+
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
|
|
486
|
+
|
|
487
|
+
# 3. 从 Pod 内测试 DNS
|
|
488
|
+
kubectl exec <pod> -- nslookup kubernetes.default.svc.cluster.local
|
|
489
|
+
kubectl exec <pod> -- cat /etc/resolv.conf
|
|
490
|
+
|
|
491
|
+
# 4. 直接查询 CoreDNS ClusterIP
|
|
492
|
+
kubectl exec <pod> -- nslookup <service-name> $(kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}')
|
|
493
|
+
|
|
494
|
+
# 5. 检查 CoreDNS ConfigMap
|
|
495
|
+
kubectl get cm coredns -n kube-system -o yaml
|
|
496
|
+
```
|
|
497
|
+
|
|
498
|
+
**根因分析**
|
|
499
|
+
|
|
500
|
+
1. **CoreDNS Pod 异常**:OOM 或 CrashLoopBackOff
|
|
501
|
+
2. **resolv.conf 错误**:Pod 的 DNS 配置未指向 CoreDNS
|
|
502
|
+
3. **ndots 过高**:默认 ndots=5 导致多余的搜索查询超时
|
|
503
|
+
4. **上游 DNS 不通**:CoreDNS forward 到的外部 DNS 不可达
|
|
504
|
+
5. **CoreDNS 过载**:大量 DNS 查询导致延迟或丢弃
|
|
505
|
+
|
|
506
|
+
**解决方案**
|
|
507
|
+
|
|
508
|
+
```yaml
|
|
509
|
+
# 优化 DNS 查询(减少无效搜索)
|
|
510
|
+
spec:
|
|
511
|
+
dnsConfig:
|
|
512
|
+
options:
|
|
513
|
+
- name: ndots
|
|
514
|
+
value: "2" # 默认 5,降为 2 减少搜索域拼接
|
|
515
|
+
- name: single-request-reopen
|
|
516
|
+
value: "" # 避免 A/AAAA 并发导致的 conntrack 冲突
|
|
517
|
+
|
|
518
|
+
# CoreDNS 扩容
|
|
519
|
+
kubectl scale deployment coredns -n kube-system --replicas=3
|
|
520
|
+
|
|
521
|
+
# CoreDNS 启用缓存(在 Corefile 中确认)
|
|
522
|
+
# .:53 {
|
|
523
|
+
# cache 30
|
|
524
|
+
# ...
|
|
525
|
+
# }
|
|
526
|
+
```
|
|
527
|
+
|
|
528
|
+
**预防措施**
|
|
529
|
+
|
|
530
|
+
- [ ] CoreDNS 至少 2 副本,配置 PDB 保证可用性
|
|
531
|
+
- [ ] 高频 DNS 查询场景配置 NodeLocal DNSCache
|
|
532
|
+
- [ ] 监控 CoreDNS 的 QPS 和延迟指标
|
|
533
|
+
- [ ] 应用层面启用 DNS 连接池 / 缓存
|
|
534
|
+
|
|
535
|
+
---
|
|
536
|
+
|
|
537
|
+
### 3.3 NetworkPolicy 阻断
|
|
538
|
+
|
|
539
|
+
**症状**
|
|
540
|
+
|
|
541
|
+
- Pod 间通信突然中断
|
|
542
|
+
- `curl` 超时但 DNS 解析正常
|
|
543
|
+
- 新部署的 Pod 无法访问已有服务
|
|
544
|
+
|
|
545
|
+
**诊断命令**
|
|
546
|
+
|
|
547
|
+
```bash
|
|
548
|
+
# 1. 列出命名空间内所有 NetworkPolicy
|
|
549
|
+
kubectl get networkpolicy -n <namespace>
|
|
550
|
+
|
|
551
|
+
# 2. 查看 NetworkPolicy 详情
|
|
552
|
+
kubectl describe networkpolicy <name> -n <namespace>
|
|
553
|
+
|
|
554
|
+
# 3. 检查 Pod 的标签是否被 NetworkPolicy 选中
|
|
555
|
+
kubectl get pod <pod-name> --show-labels
|
|
556
|
+
|
|
557
|
+
# 4. 模拟验证(使用 debug Pod)
|
|
558
|
+
kubectl run debug --rm -it --image=nicolaka/netshoot -n <namespace> -- bash
|
|
559
|
+
# 在 debug Pod 中
|
|
560
|
+
curl -v --connect-timeout 3 <target-service>:<port>
|
|
561
|
+
|
|
562
|
+
# 5. 检查 CNI 插件是否支持 NetworkPolicy
|
|
563
|
+
kubectl get pods -n kube-system | grep -E 'calico|cilium|weave'
|
|
564
|
+
```
|
|
565
|
+
|
|
566
|
+
**根因分析**
|
|
567
|
+
|
|
568
|
+
1. **默认拒绝策略**:命名空间设了 default-deny 但未为新 Pod 添加 allow 规则
|
|
569
|
+
2. **标签不匹配**:NetworkPolicy 的 podSelector 或 namespaceSelector 不正确
|
|
570
|
+
3. **端口遗漏**:仅允许了部分端口,遗漏了实际使用的端口
|
|
571
|
+
4. **CNI 不支持**:使用了不支持 NetworkPolicy 的 CNI(如 Flannel 默认模式)
|
|
572
|
+
5. **egress 规则遗漏**:只配了 ingress,忘记配 egress
|
|
573
|
+
|
|
574
|
+
**解决方案**
|
|
575
|
+
|
|
576
|
+
```yaml
|
|
577
|
+
# 允许特定命名空间的 Pod 访问
|
|
578
|
+
apiVersion: networking.k8s.io/v1
|
|
579
|
+
kind: NetworkPolicy
|
|
580
|
+
metadata:
|
|
581
|
+
name: allow-from-frontend
|
|
582
|
+
namespace: backend
|
|
583
|
+
spec:
|
|
584
|
+
podSelector:
|
|
585
|
+
matchLabels:
|
|
586
|
+
app: api-server
|
|
587
|
+
policyTypes:
|
|
588
|
+
- Ingress
|
|
589
|
+
ingress:
|
|
590
|
+
- from:
|
|
591
|
+
- namespaceSelector:
|
|
592
|
+
matchLabels:
|
|
593
|
+
tier: frontend
|
|
594
|
+
podSelector:
|
|
595
|
+
matchLabels:
|
|
596
|
+
app: web
|
|
597
|
+
ports:
|
|
598
|
+
- protocol: TCP
|
|
599
|
+
port: 8080
|
|
600
|
+
|
|
601
|
+
# 允许 DNS 出站(常被遗忘导致网络全断)
|
|
602
|
+
apiVersion: networking.k8s.io/v1
|
|
603
|
+
kind: NetworkPolicy
|
|
604
|
+
metadata:
|
|
605
|
+
name: allow-dns-egress
|
|
606
|
+
spec:
|
|
607
|
+
podSelector: {}
|
|
608
|
+
policyTypes:
|
|
609
|
+
- Egress
|
|
610
|
+
egress:
|
|
611
|
+
- to:
|
|
612
|
+
- namespaceSelector: {}
|
|
613
|
+
ports:
|
|
614
|
+
- protocol: UDP
|
|
615
|
+
port: 53
|
|
616
|
+
- protocol: TCP
|
|
617
|
+
port: 53
|
|
618
|
+
```
|
|
619
|
+
|
|
620
|
+
**预防措施**
|
|
621
|
+
|
|
622
|
+
- [ ] 每条 default-deny 策略必须配套 allow-dns-egress
|
|
623
|
+
- [ ] NetworkPolicy 变更后必须在 staging 环境验证连通性
|
|
624
|
+
- [ ] 使用 `kubectl-np-viewer` 可视化网络策略
|
|
625
|
+
- [ ] CI 中集成 NetworkPolicy 合规性检查
|
|
626
|
+
|
|
627
|
+
---
|
|
628
|
+
|
|
629
|
+
### 3.4 Ingress 502 / 504
|
|
630
|
+
|
|
631
|
+
**症状**
|
|
632
|
+
|
|
633
|
+
- 外部请求通过 Ingress 返回 502 Bad Gateway 或 504 Gateway Timeout
|
|
634
|
+
- 直接访问 Service ClusterIP 正常
|
|
635
|
+
|
|
636
|
+
**诊断命令**
|
|
637
|
+
|
|
638
|
+
```bash
|
|
639
|
+
# 1. 检查 Ingress 配置
|
|
640
|
+
kubectl get ingress <name> -n <namespace> -o yaml
|
|
641
|
+
|
|
642
|
+
# 2. 检查 Ingress Controller 日志
|
|
643
|
+
kubectl logs -n ingress-nginx -l app.kubernetes.io/component=controller --tail=100
|
|
644
|
+
|
|
645
|
+
# 3. 检查后端 Service 和 Endpoints
|
|
646
|
+
kubectl get svc <backend-svc> -n <namespace>
|
|
647
|
+
kubectl get endpoints <backend-svc> -n <namespace>
|
|
648
|
+
|
|
649
|
+
# 4. 测试 Ingress Controller 到后端的连通性
|
|
650
|
+
kubectl exec -n ingress-nginx <controller-pod> -- curl -v http://<service-cluster-ip>:<port>/
|
|
651
|
+
|
|
652
|
+
# 5. 查看 Ingress Controller 的 nginx.conf
|
|
653
|
+
kubectl exec -n ingress-nginx <controller-pod> -- cat /etc/nginx/nginx.conf | grep -A 20 "upstream"
|
|
654
|
+
```
|
|
655
|
+
|
|
656
|
+
**根因分析**
|
|
657
|
+
|
|
658
|
+
**502 Bad Gateway**:
|
|
659
|
+
1. 后端 Pod 全部不可用(0 Endpoints)
|
|
660
|
+
2. 后端 Pod 正在滚动更新,旧 Pod 已终止,新 Pod 未就绪
|
|
661
|
+
3. Service port 与 Ingress backend port 不一致
|
|
662
|
+
4. 后端返回的 HTTP 响应格式异常
|
|
663
|
+
|
|
664
|
+
**504 Gateway Timeout**:
|
|
665
|
+
1. 后端处理时间超过 Ingress Controller 的 proxy-read-timeout(默认 60s)
|
|
666
|
+
2. 网络延迟或丢包导致连接超时
|
|
667
|
+
3. 后端应用死锁或线程池耗尽
|
|
668
|
+
|
|
669
|
+
**解决方案**
|
|
670
|
+
|
|
671
|
+
```yaml
|
|
672
|
+
# 调整超时时间(nginx-ingress 注解)
|
|
673
|
+
apiVersion: networking.k8s.io/v1
|
|
674
|
+
kind: Ingress
|
|
675
|
+
metadata:
|
|
676
|
+
annotations:
|
|
677
|
+
nginx.ingress.kubernetes.io/proxy-connect-timeout: "10"
|
|
678
|
+
nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
|
|
679
|
+
nginx.ingress.kubernetes.io/proxy-send-timeout: "120"
|
|
680
|
+
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
|
|
681
|
+
|
|
682
|
+
# 配置优雅终止,避免滚动更新时 502
|
|
683
|
+
# 在 Deployment 中设置
|
|
684
|
+
spec:
|
|
685
|
+
template:
|
|
686
|
+
spec:
|
|
687
|
+
terminationGracePeriodSeconds: 60
|
|
688
|
+
containers:
|
|
689
|
+
- name: app
|
|
690
|
+
lifecycle:
|
|
691
|
+
preStop:
|
|
692
|
+
exec:
|
|
693
|
+
command: ["/bin/sh", "-c", "sleep 15"] # 等待 Ingress 摘流
|
|
694
|
+
```
|
|
695
|
+
|
|
696
|
+
**预防措施**
|
|
697
|
+
|
|
698
|
+
- [ ] 后端 readinessProbe 必须准确,确保流量只到就绪 Pod
|
|
699
|
+
- [ ] 配置 PodDisruptionBudget 避免滚动更新时全部不可用
|
|
700
|
+
- [ ] 长耗时接口配置合理的 proxy-read-timeout
|
|
701
|
+
- [ ] Ingress Controller 启用访问日志用于事后分析
|
|
702
|
+
|
|
703
|
+
---
|
|
704
|
+
|
|
705
|
+
## 四、存储故障排查
|
|
706
|
+
|
|
707
|
+
### 4.1 PVC Pending
|
|
708
|
+
|
|
709
|
+
**症状**
|
|
710
|
+
|
|
711
|
+
- PVC 长时间处于 `Pending` 状态
|
|
712
|
+
- Pod 因 PVC 未绑定而 Pending
|
|
713
|
+
|
|
714
|
+
**诊断命令**
|
|
715
|
+
|
|
716
|
+
```bash
|
|
717
|
+
# 1. 查看 PVC 状态和事件
|
|
718
|
+
kubectl describe pvc <pvc-name> -n <namespace>
|
|
719
|
+
|
|
720
|
+
# 2. 检查可用 PV
|
|
721
|
+
kubectl get pv --sort-by='.spec.capacity.storage'
|
|
722
|
+
|
|
723
|
+
# 3. 检查 StorageClass
|
|
724
|
+
kubectl get storageclass
|
|
725
|
+
kubectl describe storageclass <name>
|
|
726
|
+
|
|
727
|
+
# 4. 检查 CSI 驱动状态
|
|
728
|
+
kubectl get csidrivers
|
|
729
|
+
kubectl get pods -n kube-system | grep csi
|
|
730
|
+
|
|
731
|
+
# 5. 查看 provisioner 日志
|
|
732
|
+
kubectl logs -n kube-system <csi-provisioner-pod> --tail=100
|
|
733
|
+
```
|
|
734
|
+
|
|
735
|
+
**根因分析**
|
|
736
|
+
|
|
737
|
+
1. **StorageClass 不存在**:PVC 指定的 StorageClass 未创建
|
|
738
|
+
2. **动态供给失败**:CSI 驱动异常或云厂商 API 配额耗尽
|
|
739
|
+
3. **容量不足**:存储池空间不够
|
|
740
|
+
4. **accessMode 不匹配**:PVC 要求 ReadWriteMany 但 StorageClass 不支持
|
|
741
|
+
5. **拓扑约束**:PV 在可用区 A,Pod 被调度到可用区 B
|
|
742
|
+
6. **静态 PV selector 不匹配**:PVC 的 selector 没有匹配的 PV
|
|
743
|
+
|
|
744
|
+
**解决方案**
|
|
745
|
+
|
|
746
|
+
```bash
|
|
747
|
+
# 检查 StorageClass 是否为 default
|
|
748
|
+
kubectl get sc -o jsonpath='{range .items[*]}{.metadata.name}:{.metadata.annotations.storageclass\.kubernetes\.io/is-default-class}{"\n"}{end}'
|
|
749
|
+
|
|
750
|
+
# 手动创建 PV(静态供给场景)
|
|
751
|
+
cat <<EOF | kubectl apply -f -
|
|
752
|
+
apiVersion: v1
|
|
753
|
+
kind: PersistentVolume
|
|
754
|
+
metadata:
|
|
755
|
+
name: manual-pv
|
|
756
|
+
spec:
|
|
757
|
+
capacity:
|
|
758
|
+
storage: 10Gi
|
|
759
|
+
accessModes:
|
|
760
|
+
- ReadWriteOnce
|
|
761
|
+
persistentVolumeReclaimPolicy: Retain
|
|
762
|
+
storageClassName: standard
|
|
763
|
+
hostPath:
|
|
764
|
+
path: /data/manual-pv
|
|
765
|
+
EOF
|
|
766
|
+
|
|
767
|
+
# 跨可用区问题:配置 volumeBindingMode
|
|
768
|
+
# storageClassName 的 volumeBindingMode 应为 WaitForFirstConsumer
|
|
769
|
+
```
|
|
770
|
+
|
|
771
|
+
**预防措施**
|
|
772
|
+
|
|
773
|
+
- [ ] 使用 `WaitForFirstConsumer` 绑定模式避免拓扑冲突
|
|
774
|
+
- [ ] 存储容量告警阈值设为 75%
|
|
775
|
+
- [ ] CSI 驱动部署 HA,至少 2 副本
|
|
776
|
+
- [ ] 定期清理 Released 状态的 PV
|
|
777
|
+
|
|
778
|
+
---
|
|
779
|
+
|
|
780
|
+
### 4.2 挂载失败
|
|
781
|
+
|
|
782
|
+
**症状**
|
|
783
|
+
|
|
784
|
+
- Pod 卡在 `ContainerCreating`
|
|
785
|
+
- Events 中出现 `FailedMount` / `Unable to attach or mount volumes`
|
|
786
|
+
- 超时信息 `timeout expired waiting for volumes to attach/mount`
|
|
787
|
+
|
|
788
|
+
**诊断命令**
|
|
789
|
+
|
|
790
|
+
```bash
|
|
791
|
+
# 1. 查看挂载事件
|
|
792
|
+
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "FailedMount\|Unable to"
|
|
793
|
+
|
|
794
|
+
# 2. 检查 VolumeAttachment 状态
|
|
795
|
+
kubectl get volumeattachment
|
|
796
|
+
|
|
797
|
+
# 3. 检查节点上的挂载状态
|
|
798
|
+
kubectl get csinodes <node-name> -o yaml
|
|
799
|
+
|
|
800
|
+
# 4. 在节点上检查设备
|
|
801
|
+
lsblk
|
|
802
|
+
mount | grep <pv-name>
|
|
803
|
+
|
|
804
|
+
# 5. CSI 驱动详细日志
|
|
805
|
+
kubectl logs -n kube-system <csi-node-pod> -c <driver-container> --tail=100
|
|
806
|
+
```
|
|
807
|
+
|
|
808
|
+
**根因分析**
|
|
809
|
+
|
|
810
|
+
1. **多节点挂载冲突**:ReadWriteOnce 的 PV 被另一个节点占用
|
|
811
|
+
2. **设备繁忙**:上一个 Pod 未正常卸载导致设备 busy
|
|
812
|
+
3. **CSI 驱动版本不兼容**:节点上 CSI 驱动未更新
|
|
813
|
+
4. **云盘跨可用区**:EBS/Disk 不支持跨 AZ 挂载
|
|
814
|
+
5. **subPath 错误**:subPath 指定的路径不存在
|
|
815
|
+
|
|
816
|
+
**解决方案**
|
|
817
|
+
|
|
818
|
+
```bash
|
|
819
|
+
# 强制分离卡住的 VolumeAttachment
|
|
820
|
+
kubectl delete volumeattachment <name> --force --grace-period=0
|
|
821
|
+
|
|
822
|
+
# 迁移 Pod 到 PV 所在的可用区
|
|
823
|
+
# 使用 nodeAffinity 固定
|
|
824
|
+
# 或使用支持跨 AZ 的存储(如 EFS / NFS)
|
|
825
|
+
|
|
826
|
+
# subPath 不存在时使用 subPathExpr + initContainer
|
|
827
|
+
# initContainers:
|
|
828
|
+
# - name: init-data-dir
|
|
829
|
+
# image: busybox
|
|
830
|
+
# command: ['mkdir', '-p', '/data/subdir']
|
|
831
|
+
# volumeMounts:
|
|
832
|
+
# - name: data
|
|
833
|
+
# mountPath: /data
|
|
834
|
+
```
|
|
835
|
+
|
|
836
|
+
**预防措施**
|
|
837
|
+
|
|
838
|
+
- [ ] ReadWriteOnce 卷仅用于 Deployment replicas=1 或 StatefulSet
|
|
839
|
+
- [ ] 需要多 Pod 共享存储时使用 ReadWriteMany(NFS / CephFS / EFS)
|
|
840
|
+
- [ ] Pod 配置 terminationGracePeriodSeconds 确保优雅卸载
|
|
841
|
+
- [ ] 云盘快照定期备份
|
|
842
|
+
|
|
843
|
+
---
|
|
844
|
+
|
|
845
|
+
### 4.3 权限问题
|
|
846
|
+
|
|
847
|
+
**症状**
|
|
848
|
+
|
|
849
|
+
- 容器日志出现 `Permission denied`
|
|
850
|
+
- 文件写入失败但挂载本身成功
|
|
851
|
+
- 应用以非 root 用户运行时无法访问存储
|
|
852
|
+
|
|
853
|
+
**诊断命令**
|
|
854
|
+
|
|
855
|
+
```bash
|
|
856
|
+
# 1. 查看容器内文件权限
|
|
857
|
+
kubectl exec <pod> -- ls -la /data/
|
|
858
|
+
|
|
859
|
+
# 2. 查看 securityContext 配置
|
|
860
|
+
kubectl get pod <pod> -o jsonpath='{.spec.securityContext}'
|
|
861
|
+
kubectl get pod <pod> -o jsonpath='{.spec.containers[0].securityContext}'
|
|
862
|
+
|
|
863
|
+
# 3. 查看容器进程运行的 UID/GID
|
|
864
|
+
kubectl exec <pod> -- id
|
|
865
|
+
|
|
866
|
+
# 4. 查看挂载点的文件系统信息
|
|
867
|
+
kubectl exec <pod> -- stat /data/
|
|
868
|
+
kubectl exec <pod> -- df -h /data/
|
|
869
|
+
```
|
|
870
|
+
|
|
871
|
+
**根因分析**
|
|
872
|
+
|
|
873
|
+
1. **UID 不匹配**:容器进程以 UID 1000 运行,但文件属主是 root
|
|
874
|
+
2. **fsGroup 未设置**:PV 上的文件不归 Pod 的 group 所有
|
|
875
|
+
3. **readOnly 挂载**:volumeMount 设置了 readOnly: true
|
|
876
|
+
4. **NFS root_squash**:NFS 服务端将 root 映射为 nobody
|
|
877
|
+
5. **SELinux / AppArmor**:安全模块阻止了文件访问
|
|
878
|
+
|
|
879
|
+
**解决方案**
|
|
880
|
+
|
|
881
|
+
```yaml
|
|
882
|
+
# 使用 securityContext 统一 UID/GID
|
|
883
|
+
spec:
|
|
884
|
+
securityContext:
|
|
885
|
+
runAsUser: 1000
|
|
886
|
+
runAsGroup: 1000
|
|
887
|
+
fsGroup: 1000 # 挂载卷的 group ownership
|
|
888
|
+
fsGroupChangePolicy: "OnRootMismatch" # 仅在不匹配时修改,加速启动
|
|
889
|
+
containers:
|
|
890
|
+
- name: app
|
|
891
|
+
securityContext:
|
|
892
|
+
allowPrivilegeEscalation: false
|
|
893
|
+
readOnlyRootFilesystem: true # 强制只读根文件系统
|
|
894
|
+
volumeMounts:
|
|
895
|
+
- name: data
|
|
896
|
+
mountPath: /data
|
|
897
|
+
- name: tmp
|
|
898
|
+
mountPath: /tmp # 可写临时目录
|
|
899
|
+
|
|
900
|
+
# 使用 initContainer 修复权限
|
|
901
|
+
initContainers:
|
|
902
|
+
- name: fix-permissions
|
|
903
|
+
image: busybox:1.36
|
|
904
|
+
command: ['sh', '-c', 'chown -R 1000:1000 /data']
|
|
905
|
+
volumeMounts:
|
|
906
|
+
- name: data
|
|
907
|
+
mountPath: /data
|
|
908
|
+
securityContext:
|
|
909
|
+
runAsUser: 0 # initContainer 以 root 运行修复权限
|
|
910
|
+
```
|
|
911
|
+
|
|
912
|
+
**预防措施**
|
|
913
|
+
|
|
914
|
+
- [ ] 所有 Pod 配置 securityContext.fsGroup
|
|
915
|
+
- [ ] Dockerfile 中使用 `USER` 指令明确非 root 用户
|
|
916
|
+
- [ ] NFS 服务端配置 `no_root_squash` 或使用确定性 UID 映射
|
|
917
|
+
- [ ] 对敏感卷使用 readOnly 挂载
|
|
918
|
+
|
|
919
|
+
---
|
|
920
|
+
|
|
921
|
+
## 五、节点故障排查
|
|
922
|
+
|
|
923
|
+
### 5.1 NotReady
|
|
924
|
+
|
|
925
|
+
**症状**
|
|
926
|
+
|
|
927
|
+
- `kubectl get nodes` 显示某节点 `NotReady`
|
|
928
|
+
- 该节点上的 Pod 状态变为 `Unknown` 或被驱逐
|
|
929
|
+
|
|
930
|
+
**诊断命令**
|
|
931
|
+
|
|
932
|
+
```bash
|
|
933
|
+
# 1. 查看节点状态条件
|
|
934
|
+
kubectl describe node <node-name> | grep -A 20 "Conditions:"
|
|
935
|
+
|
|
936
|
+
# 2. 查看节点事件
|
|
937
|
+
kubectl get events --field-selector involvedObject.name=<node-name> --sort-by='.lastTimestamp'
|
|
938
|
+
|
|
939
|
+
# 3. 检查 kubelet 状态(SSH 到节点)
|
|
940
|
+
systemctl status kubelet
|
|
941
|
+
journalctl -u kubelet --since "10 minutes ago" --no-pager | tail -50
|
|
942
|
+
|
|
943
|
+
# 4. 检查容器运行时
|
|
944
|
+
systemctl status containerd
|
|
945
|
+
crictl ps
|
|
946
|
+
|
|
947
|
+
# 5. 检查网络连通性(节点到 API Server)
|
|
948
|
+
curl -k https://<api-server>:6443/healthz
|
|
949
|
+
|
|
950
|
+
# 6. 检查证书过期
|
|
951
|
+
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
|
|
952
|
+
```
|
|
953
|
+
|
|
954
|
+
**根因分析**
|
|
955
|
+
|
|
956
|
+
1. **kubelet 进程异常**:kubelet crash 或 OOM
|
|
957
|
+
2. **容器运行时故障**:containerd / docker 无响应
|
|
958
|
+
3. **网络断开**:节点与 API Server 失联
|
|
959
|
+
4. **证书过期**:kubelet 客户端证书或 CA 过期
|
|
960
|
+
5. **磁盘满**:根分区 / Docker 分区满导致 kubelet 无法工作
|
|
961
|
+
6. **内核 panic**:节点内核崩溃
|
|
962
|
+
|
|
963
|
+
**解决方案**
|
|
964
|
+
|
|
965
|
+
```bash
|
|
966
|
+
# 重启 kubelet
|
|
967
|
+
systemctl restart kubelet
|
|
968
|
+
|
|
969
|
+
# 重启容器运行时
|
|
970
|
+
systemctl restart containerd
|
|
971
|
+
|
|
972
|
+
# 清理磁盘空间
|
|
973
|
+
crictl rmi --prune # 清理未使用镜像
|
|
974
|
+
journalctl --vacuum-size=500M # 清理 journal 日志
|
|
975
|
+
|
|
976
|
+
# 证书过期 → 更新
|
|
977
|
+
kubeadm certs renew all
|
|
978
|
+
systemctl restart kubelet
|
|
979
|
+
|
|
980
|
+
# 节点彻底不可恢复 → 安全排水后移除
|
|
981
|
+
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=120s
|
|
982
|
+
kubectl delete node <node-name>
|
|
983
|
+
```
|
|
984
|
+
|
|
985
|
+
**预防措施**
|
|
986
|
+
|
|
987
|
+
- [ ] 配置 kubelet 监控和自动重启(systemd watchdog)
|
|
988
|
+
- [ ] 证书到期前 30 天告警
|
|
989
|
+
- [ ] 节点磁盘使用率告警阈值 70%
|
|
990
|
+
- [ ] 关键工作负载配置 PDB + 多副本跨节点分布
|
|
991
|
+
|
|
992
|
+
---
|
|
993
|
+
|
|
994
|
+
### 5.2 磁盘压力(DiskPressure)
|
|
995
|
+
|
|
996
|
+
**症状**
|
|
997
|
+
|
|
998
|
+
- 节点 Conditions 中 `DiskPressure=True`
|
|
999
|
+
- 新 Pod 无法调度到该节点
|
|
1000
|
+
- 已有 Pod 可能被驱逐
|
|
1001
|
+
|
|
1002
|
+
**诊断命令**
|
|
1003
|
+
|
|
1004
|
+
```bash
|
|
1005
|
+
# 1. 确认磁盘压力状态
|
|
1006
|
+
kubectl describe node <node-name> | grep DiskPressure
|
|
1007
|
+
|
|
1008
|
+
# 2. 检查节点磁盘使用(SSH 到节点)
|
|
1009
|
+
df -h
|
|
1010
|
+
du -sh /var/lib/containerd/*
|
|
1011
|
+
du -sh /var/log/*
|
|
1012
|
+
du -sh /var/lib/kubelet/*
|
|
1013
|
+
|
|
1014
|
+
# 3. 查看容器镜像占用
|
|
1015
|
+
crictl images | sort -k 4 -h
|
|
1016
|
+
|
|
1017
|
+
# 4. 查看容器日志大小
|
|
1018
|
+
find /var/log/containers -name "*.log" -exec ls -lh {} \; | sort -k5 -h | tail -20
|
|
1019
|
+
```
|
|
1020
|
+
|
|
1021
|
+
**根因分析**
|
|
1022
|
+
|
|
1023
|
+
1. **容器日志未限制**:应用大量写 stdout 导致日志文件膨胀
|
|
1024
|
+
2. **镜像积累**:未清理的旧镜像占满磁盘
|
|
1025
|
+
3. **emptyDir 滥用**:Pod 的 emptyDir 写入大量临时数据
|
|
1026
|
+
4. **系统日志堆积**:journal / syslog 未配置轮转
|
|
1027
|
+
|
|
1028
|
+
**解决方案**
|
|
1029
|
+
|
|
1030
|
+
```bash
|
|
1031
|
+
# 立即清理
|
|
1032
|
+
crictl rmi --prune
|
|
1033
|
+
journalctl --vacuum-size=200M
|
|
1034
|
+
find /var/log -name "*.gz" -mtime +7 -delete
|
|
1035
|
+
|
|
1036
|
+
# 配置 containerd 日志大小限制
|
|
1037
|
+
# /etc/containerd/config.toml
|
|
1038
|
+
# [plugins."io.containerd.grpc.v1.cri".containerd]
|
|
1039
|
+
# [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
|
|
1040
|
+
# [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
|
|
1041
|
+
# max-container-log-line-size = 16384
|
|
1042
|
+
|
|
1043
|
+
# 配置 kubelet 日志轮转
|
|
1044
|
+
# containerLogMaxSize: "50Mi"
|
|
1045
|
+
# containerLogMaxFiles: 3
|
|
1046
|
+
```
|
|
1047
|
+
|
|
1048
|
+
**预防措施**
|
|
1049
|
+
|
|
1050
|
+
- [ ] kubelet 配置 containerLogMaxSize 和 containerLogMaxFiles
|
|
1051
|
+
- [ ] 配置 imagefs 和 nodefs 告警
|
|
1052
|
+
- [ ] 定期运行镜像垃圾回收 CronJob
|
|
1053
|
+
- [ ] 使用独立磁盘分区存放容器数据
|
|
1054
|
+
|
|
1055
|
+
---
|
|
1056
|
+
|
|
1057
|
+
### 5.3 内存压力(MemoryPressure)
|
|
1058
|
+
|
|
1059
|
+
**症状**
|
|
1060
|
+
|
|
1061
|
+
- 节点 Conditions 中 `MemoryPressure=True`
|
|
1062
|
+
- 低优先级 Pod 被驱逐
|
|
1063
|
+
- 系统 OOM Killer 开始杀进程
|
|
1064
|
+
|
|
1065
|
+
**诊断命令**
|
|
1066
|
+
|
|
1067
|
+
```bash
|
|
1068
|
+
# 1. 查看节点内存状态
|
|
1069
|
+
kubectl describe node <node-name> | grep -A 3 "MemoryPressure"
|
|
1070
|
+
kubectl top node <node-name>
|
|
1071
|
+
|
|
1072
|
+
# 2. 节点内存详情(SSH 到节点)
|
|
1073
|
+
free -h
|
|
1074
|
+
cat /proc/meminfo | head -20
|
|
1075
|
+
|
|
1076
|
+
# 3. 查看内存占用前 10 的进程
|
|
1077
|
+
ps aux --sort=-%mem | head -10
|
|
1078
|
+
|
|
1079
|
+
# 4. 查看 Pod 内存使用
|
|
1080
|
+
kubectl top pods --sort-by=memory -A | head -20
|
|
1081
|
+
|
|
1082
|
+
# 5. 检查 OOM 事件
|
|
1083
|
+
dmesg | grep -i "oom\|out of memory" | tail -20
|
|
1084
|
+
```
|
|
1085
|
+
|
|
1086
|
+
**根因分析**
|
|
1087
|
+
|
|
1088
|
+
1. **资源超卖**:requests 总和远大于节点实际内存
|
|
1089
|
+
2. **内存泄漏**:某个 Pod 内存持续增长
|
|
1090
|
+
3. **系统预留不足**:未配置 system-reserved / kube-reserved
|
|
1091
|
+
4. **缓存未释放**:内核缓存过大,可回收但未触发回收
|
|
1092
|
+
|
|
1093
|
+
**解决方案**
|
|
1094
|
+
|
|
1095
|
+
```bash
|
|
1096
|
+
# 手动释放缓存(临时措施)
|
|
1097
|
+
echo 3 > /proc/sys/vm/drop_caches
|
|
1098
|
+
|
|
1099
|
+
# 配置 kubelet 系统预留
|
|
1100
|
+
# /var/lib/kubelet/config.yaml
|
|
1101
|
+
# systemReserved:
|
|
1102
|
+
# cpu: "500m"
|
|
1103
|
+
# memory: "1Gi"
|
|
1104
|
+
# kubeReserved:
|
|
1105
|
+
# cpu: "500m"
|
|
1106
|
+
# memory: "512Mi"
|
|
1107
|
+
# evictionHard:
|
|
1108
|
+
# memory.available: "200Mi"
|
|
1109
|
+
# evictionSoft:
|
|
1110
|
+
# memory.available: "500Mi"
|
|
1111
|
+
# evictionSoftGracePeriod:
|
|
1112
|
+
# memory.available: "1m"
|
|
1113
|
+
```
|
|
1114
|
+
|
|
1115
|
+
**预防措施**
|
|
1116
|
+
|
|
1117
|
+
- [ ] 节点内存 requests 总和不超过 allocatable 的 85%
|
|
1118
|
+
- [ ] 配置 system-reserved 和 kube-reserved
|
|
1119
|
+
- [ ] 内存使用率 > 80% 触发告警
|
|
1120
|
+
- [ ] 使用 VPA 或定期压测校准 requests
|
|
1121
|
+
|
|
1122
|
+
---
|
|
1123
|
+
|
|
1124
|
+
### 5.4 PID 压力(PIDPressure)
|
|
1125
|
+
|
|
1126
|
+
**症状**
|
|
1127
|
+
|
|
1128
|
+
- 节点 Conditions 中 `PIDPressure=True`
|
|
1129
|
+
- 无法创建新进程 / 容器
|
|
1130
|
+
- 出现 `cannot allocate memory`(实际是 PID 耗尽)
|
|
1131
|
+
|
|
1132
|
+
**诊断命令**
|
|
1133
|
+
|
|
1134
|
+
```bash
|
|
1135
|
+
# 1. 查看节点 PID 状态
|
|
1136
|
+
kubectl describe node <node-name> | grep PIDPressure
|
|
1137
|
+
|
|
1138
|
+
# 2. 当前 PID 使用量
|
|
1139
|
+
cat /proc/sys/kernel/pid_max
|
|
1140
|
+
ls /proc | grep -E '^[0-9]+$' | wc -l
|
|
1141
|
+
|
|
1142
|
+
# 3. 各容器 PID 使用
|
|
1143
|
+
for cid in $(crictl ps -q); do
|
|
1144
|
+
name=$(crictl inspect $cid | jq -r '.status.labels."io.kubernetes.pod.name"')
|
|
1145
|
+
pids=$(crictl inspect $cid | jq '.info.pid')
|
|
1146
|
+
echo "$name: PID=$pids"
|
|
1147
|
+
done
|
|
1148
|
+
|
|
1149
|
+
# 4. 查看进程树,找出 fork 炸弹
|
|
1150
|
+
ps auxf | head -100
|
|
1151
|
+
```
|
|
1152
|
+
|
|
1153
|
+
**根因分析**
|
|
1154
|
+
|
|
1155
|
+
1. **Fork 炸弹**:应用 bug 导致无限创建子进程
|
|
1156
|
+
2. **pidsLimit 未设置**:单个 Pod 耗尽节点全部 PID
|
|
1157
|
+
3. **pid_max 过低**:系统默认 32768 在高密度节点不够
|
|
1158
|
+
4. **僵尸进程**:PID 1 进程未正确回收子进程
|
|
1159
|
+
|
|
1160
|
+
**解决方案**
|
|
1161
|
+
|
|
1162
|
+
```bash
|
|
1163
|
+
# 提高 pid_max
|
|
1164
|
+
sysctl -w kernel.pid_max=65536
|
|
1165
|
+
|
|
1166
|
+
# 配置 kubelet PID 限制
|
|
1167
|
+
# /var/lib/kubelet/config.yaml
|
|
1168
|
+
# podPidsLimit: 1024 # 每 Pod 最多 1024 个 PID
|
|
1169
|
+
# evictionHard:
|
|
1170
|
+
# pid.available: "10%"
|
|
1171
|
+
|
|
1172
|
+
# 容器内使用 tini 作为 init 进程回收僵尸进程
|
|
1173
|
+
# Dockerfile
|
|
1174
|
+
# RUN apk add --no-cache tini
|
|
1175
|
+
# ENTRYPOINT ["tini", "--"]
|
|
1176
|
+
# CMD ["your-app"]
|
|
1177
|
+
```
|
|
1178
|
+
|
|
1179
|
+
**预防措施**
|
|
1180
|
+
|
|
1181
|
+
- [ ] 所有容器使用 tini / dumb-init 作为 PID 1
|
|
1182
|
+
- [ ] kubelet 配置 podPidsLimit
|
|
1183
|
+
- [ ] 监控节点 PID 使用率
|
|
1184
|
+
- [ ] 应用代码审查确保无 fork 炸弹风险
|
|
1185
|
+
|
|
1186
|
+
---
|
|
1187
|
+
|
|
1188
|
+
## 六、资源问题排查
|
|
1189
|
+
|
|
1190
|
+
### 6.1 CPU Throttling
|
|
1191
|
+
|
|
1192
|
+
**症状**
|
|
1193
|
+
|
|
1194
|
+
- 应用响应延迟增高但未 OOM
|
|
1195
|
+
- Prometheus 中 `container_cpu_cfs_throttled_seconds_total` 持续增长
|
|
1196
|
+
- 应用日志无异常但性能下降
|
|
1197
|
+
|
|
1198
|
+
**诊断命令**
|
|
1199
|
+
|
|
1200
|
+
```bash
|
|
1201
|
+
# 1. 查看 CPU 使用与限制
|
|
1202
|
+
kubectl top pod <pod-name> --containers
|
|
1203
|
+
|
|
1204
|
+
# 2. 查看 cgroup CPU 统计(节点上)
|
|
1205
|
+
cat /sys/fs/cgroup/cpu/kubepods/pod<uid>/<cid>/cpu.stat
|
|
1206
|
+
# nr_throttled: 节流次数
|
|
1207
|
+
# throttled_time: 被节流的总时间(纳秒)
|
|
1208
|
+
|
|
1209
|
+
# 3. Prometheus 查询
|
|
1210
|
+
# rate(container_cpu_cfs_throttled_periods_total[5m])
|
|
1211
|
+
# / rate(container_cpu_cfs_periods_total[5m]) > 0.25
|
|
1212
|
+
|
|
1213
|
+
# 4. 查看 CPU requests/limits 配置
|
|
1214
|
+
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].resources}'
|
|
1215
|
+
```
|
|
1216
|
+
|
|
1217
|
+
**根因分析**
|
|
1218
|
+
|
|
1219
|
+
1. **CPU limits 过低**:峰值 CPU 需求超过 limits
|
|
1220
|
+
2. **突发性计算**:GC、JIT 编译等突发 CPU 需求被限流
|
|
1221
|
+
3. **limits/requests 比值过低**:burstable 空间不足
|
|
1222
|
+
4. **多线程应用**:线程数 > CPU limits 导致分时竞争
|
|
1223
|
+
|
|
1224
|
+
**解决方案**
|
|
1225
|
+
|
|
1226
|
+
```yaml
|
|
1227
|
+
# 方案 1:放大 limits(推荐 limits = requests * 2 ~ 5)
|
|
1228
|
+
resources:
|
|
1229
|
+
requests:
|
|
1230
|
+
cpu: "500m"
|
|
1231
|
+
limits:
|
|
1232
|
+
cpu: "2000m" # 允许 4 倍突发
|
|
1233
|
+
|
|
1234
|
+
# 方案 2:移除 CPU limits(争议方案,但 Google/Zalando 推荐)
|
|
1235
|
+
# 只设 requests,不设 limits,依赖 requests 做公平调度
|
|
1236
|
+
resources:
|
|
1237
|
+
requests:
|
|
1238
|
+
cpu: "500m"
|
|
1239
|
+
# 不设 limits
|
|
1240
|
+
|
|
1241
|
+
# 方案 3:Java 应用设置容器感知
|
|
1242
|
+
env:
|
|
1243
|
+
- name: JAVA_OPTS
|
|
1244
|
+
value: "-XX:+UseContainerSupport -XX:ActiveProcessorCount=2"
|
|
1245
|
+
```
|
|
1246
|
+
|
|
1247
|
+
**预防措施**
|
|
1248
|
+
|
|
1249
|
+
- [ ] 监控 CPU throttling 比率,超过 25% 触发告警
|
|
1250
|
+
- [ ] 压测确定 CPU 基线后设置 requests,limits 设为 2-5 倍
|
|
1251
|
+
- [ ] 考虑对延迟敏感的服务不设 CPU limits
|
|
1252
|
+
- [ ] 避免 CPU limits = requests(Guaranteed QoS 无突发空间)
|
|
1253
|
+
|
|
1254
|
+
---
|
|
1255
|
+
|
|
1256
|
+
### 6.2 内存泄漏
|
|
1257
|
+
|
|
1258
|
+
**症状**
|
|
1259
|
+
|
|
1260
|
+
- Pod 内存使用量持续线性增长
|
|
1261
|
+
- 定期触发 OOMKilled
|
|
1262
|
+
- 重启后暂时正常,一段时间后再次增长
|
|
1263
|
+
|
|
1264
|
+
**诊断命令**
|
|
1265
|
+
|
|
1266
|
+
```bash
|
|
1267
|
+
# 1. 观察内存增长趋势
|
|
1268
|
+
kubectl top pod <pod-name> --containers
|
|
1269
|
+
# 间隔 1 分钟执行多次,观察趋势
|
|
1270
|
+
|
|
1271
|
+
# 2. Prometheus 查询(过去 24 小时趋势)
|
|
1272
|
+
# container_memory_working_set_bytes{pod="<pod-name>"}
|
|
1273
|
+
|
|
1274
|
+
# 3. 进入容器进行 profiling
|
|
1275
|
+
# Go 应用
|
|
1276
|
+
kubectl port-forward <pod> 6060:6060
|
|
1277
|
+
go tool pprof http://localhost:6060/debug/pprof/heap
|
|
1278
|
+
|
|
1279
|
+
# Java 应用
|
|
1280
|
+
kubectl exec <pod> -- jcmd 1 GC.heap_dump /tmp/dump.hprof
|
|
1281
|
+
kubectl cp <pod>:/tmp/dump.hprof ./dump.hprof
|
|
1282
|
+
|
|
1283
|
+
# Python 应用
|
|
1284
|
+
kubectl exec <pod> -- python -c "import tracemalloc; tracemalloc.start()"
|
|
1285
|
+
|
|
1286
|
+
# Node.js 应用
|
|
1287
|
+
kubectl exec <pod> -- kill -USR2 1 # 如果启用了 heapdump
|
|
1288
|
+
```
|
|
1289
|
+
|
|
1290
|
+
**根因分析**
|
|
1291
|
+
|
|
1292
|
+
1. **缓存无上限**:内存缓存未设置 maxSize 或 TTL
|
|
1293
|
+
2. **连接池泄漏**:数据库/HTTP 连接未正确释放
|
|
1294
|
+
3. **事件监听器泄漏**:注册了 listener 但未取消
|
|
1295
|
+
4. **全局变量累积**:全局 list/map 持续追加
|
|
1296
|
+
5. **goroutine 泄漏**(Go 应用):goroutine 未退出
|
|
1297
|
+
|
|
1298
|
+
**解决方案**
|
|
1299
|
+
|
|
1300
|
+
```bash
|
|
1301
|
+
# 临时措施:配置定时重启
|
|
1302
|
+
# 使用 CronJob 或 kubectl rollout restart
|
|
1303
|
+
|
|
1304
|
+
# Go goroutine 泄漏诊断
|
|
1305
|
+
kubectl exec <pod> -- curl localhost:6060/debug/pprof/goroutine?debug=2
|
|
1306
|
+
|
|
1307
|
+
# Java 堆分析
|
|
1308
|
+
# 下载 dump 后用 Eclipse MAT / VisualVM 分析
|
|
1309
|
+
|
|
1310
|
+
# 设置内存告警 + 自动重启
|
|
1311
|
+
# 通过 VPA 或自定义 controller 实现
|
|
1312
|
+
```
|
|
1313
|
+
|
|
1314
|
+
**预防措施**
|
|
1315
|
+
|
|
1316
|
+
- [ ] 所有内存缓存必须设置 maxSize 和 TTL
|
|
1317
|
+
- [ ] 定期进行 memory profiling,纳入 CI 流程
|
|
1318
|
+
- [ ] 配置 Prometheus 告警:内存 24h 增长率 > 20% 预警
|
|
1319
|
+
- [ ] Go 应用启用 pprof endpoint,Java 应用启用 JMX
|
|
1320
|
+
|
|
1321
|
+
---
|
|
1322
|
+
|
|
1323
|
+
### 6.3 HPA 不生效
|
|
1324
|
+
|
|
1325
|
+
**症状**
|
|
1326
|
+
|
|
1327
|
+
- 负载增加但 Pod 数量不增长
|
|
1328
|
+
- `kubectl get hpa` 显示 `TARGETS` 为 `<unknown>` 或指标不更新
|
|
1329
|
+
- 手动扩容有效但自动扩容无反应
|
|
1330
|
+
|
|
1331
|
+
**诊断命令**
|
|
1332
|
+
|
|
1333
|
+
```bash
|
|
1334
|
+
# 1. 查看 HPA 状态
|
|
1335
|
+
kubectl get hpa <name> -n <namespace> -o wide
|
|
1336
|
+
kubectl describe hpa <name> -n <namespace>
|
|
1337
|
+
|
|
1338
|
+
# 2. 检查 metrics-server 是否正常
|
|
1339
|
+
kubectl top pods # 如果失败说明 metrics-server 异常
|
|
1340
|
+
kubectl get pods -n kube-system | grep metrics-server
|
|
1341
|
+
kubectl logs -n kube-system -l k8s-app=metrics-server --tail=50
|
|
1342
|
+
|
|
1343
|
+
# 3. 检查 Pod 是否设置了 resources.requests
|
|
1344
|
+
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources.requests}'
|
|
1345
|
+
|
|
1346
|
+
# 4. 检查自定义指标(如果用 custom metrics)
|
|
1347
|
+
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
|
|
1348
|
+
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/<ns>/pods/*/http_requests_per_second" | jq .
|
|
1349
|
+
|
|
1350
|
+
# 5. 查看 HPA 事件
|
|
1351
|
+
kubectl get events --field-selector involvedObject.name=<hpa-name> --sort-by='.lastTimestamp'
|
|
1352
|
+
```
|
|
1353
|
+
|
|
1354
|
+
**根因分析**
|
|
1355
|
+
|
|
1356
|
+
1. **metrics-server 未部署或异常**
|
|
1357
|
+
2. **Pod 未设置 CPU/Memory requests**(HPA 百分比目标需要 requests 基准)
|
|
1358
|
+
3. **HPA 指标采集延迟**(默认 15s 采集,30s 计算)
|
|
1359
|
+
4. **缩放冷却期**:缩容默认 5 分钟冷却,扩容默认 3 分钟
|
|
1360
|
+
5. **自定义指标 adapter 异常**
|
|
1361
|
+
6. **minReplicas = maxReplicas**:配置错误导致无法扩缩
|
|
1362
|
+
|
|
1363
|
+
**解决方案**
|
|
1364
|
+
|
|
1365
|
+
```yaml
|
|
1366
|
+
# 正确的 HPA 配置
|
|
1367
|
+
apiVersion: autoscaling/v2
|
|
1368
|
+
kind: HorizontalPodAutoscaler
|
|
1369
|
+
metadata:
|
|
1370
|
+
name: app-hpa
|
|
1371
|
+
spec:
|
|
1372
|
+
scaleTargetRef:
|
|
1373
|
+
apiVersion: apps/v1
|
|
1374
|
+
kind: Deployment
|
|
1375
|
+
name: app
|
|
1376
|
+
minReplicas: 2
|
|
1377
|
+
maxReplicas: 20
|
|
1378
|
+
behavior:
|
|
1379
|
+
scaleUp:
|
|
1380
|
+
stabilizationWindowSeconds: 60 # 扩容稳定窗口
|
|
1381
|
+
policies:
|
|
1382
|
+
- type: Percent
|
|
1383
|
+
value: 100 # 每次最多翻倍
|
|
1384
|
+
periodSeconds: 60
|
|
1385
|
+
scaleDown:
|
|
1386
|
+
stabilizationWindowSeconds: 300 # 缩容稳定窗口 5 分钟
|
|
1387
|
+
policies:
|
|
1388
|
+
- type: Percent
|
|
1389
|
+
value: 10 # 每次最多缩 10%
|
|
1390
|
+
periodSeconds: 60
|
|
1391
|
+
metrics:
|
|
1392
|
+
- type: Resource
|
|
1393
|
+
resource:
|
|
1394
|
+
name: cpu
|
|
1395
|
+
target:
|
|
1396
|
+
type: Utilization
|
|
1397
|
+
averageUtilization: 70 # 目标 CPU 利用率
|
|
1398
|
+
- type: Resource
|
|
1399
|
+
resource:
|
|
1400
|
+
name: memory
|
|
1401
|
+
target:
|
|
1402
|
+
type: Utilization
|
|
1403
|
+
averageUtilization: 80
|
|
1404
|
+
```
|
|
1405
|
+
|
|
1406
|
+
**预防措施**
|
|
1407
|
+
|
|
1408
|
+
- [ ] 所有被 HPA 管理的 Pod 必须设置 CPU requests
|
|
1409
|
+
- [ ] metrics-server 部署 HA(2 副本 + PDB)
|
|
1410
|
+
- [ ] HPA 目标利用率设为 60-80%,留出突发缓冲
|
|
1411
|
+
- [ ] 配置 HPA 事件告警,及时发现指标采集异常
|
|
1412
|
+
|
|
1413
|
+
---
|
|
1414
|
+
|
|
1415
|
+
## 七、部署故障排查
|
|
1416
|
+
|
|
1417
|
+
### 7.1 滚动更新卡住
|
|
1418
|
+
|
|
1419
|
+
**症状**
|
|
1420
|
+
|
|
1421
|
+
- `kubectl rollout status` 一直等待
|
|
1422
|
+
- 新版本 Pod 无法就绪,旧版本 Pod 被保留
|
|
1423
|
+
- Deployment 的 `READY` 列显示不完整(如 2/3)
|
|
1424
|
+
|
|
1425
|
+
**诊断命令**
|
|
1426
|
+
|
|
1427
|
+
```bash
|
|
1428
|
+
# 1. 查看 rollout 状态
|
|
1429
|
+
kubectl rollout status deployment/<name> -n <namespace> --timeout=10s
|
|
1430
|
+
|
|
1431
|
+
# 2. 查看 ReplicaSet 状态
|
|
1432
|
+
kubectl get rs -n <namespace> -l app=<name>
|
|
1433
|
+
|
|
1434
|
+
# 3. 查看新旧 Pod 状态
|
|
1435
|
+
kubectl get pods -n <namespace> -l app=<name> -o wide
|
|
1436
|
+
|
|
1437
|
+
# 4. 查看 Deployment 事件
|
|
1438
|
+
kubectl describe deployment <name> -n <namespace>
|
|
1439
|
+
|
|
1440
|
+
# 5. 查看新 Pod 的问题
|
|
1441
|
+
kubectl describe pod <new-pod> -n <namespace>
|
|
1442
|
+
kubectl logs <new-pod> -n <namespace> --tail=100
|
|
1443
|
+
```
|
|
1444
|
+
|
|
1445
|
+
**根因分析**
|
|
1446
|
+
|
|
1447
|
+
1. **新版本 Pod CrashLoopBackOff**:代码 bug 或配置错误
|
|
1448
|
+
2. **readinessProbe 失败**:健康检查不通过
|
|
1449
|
+
3. **资源不足**:无法调度新 Pod
|
|
1450
|
+
4. **镜像拉取失败**:新版本镜像不存在
|
|
1451
|
+
5. **maxUnavailable=0 + maxSurge=0**:错误配置导致无法滚动
|
|
1452
|
+
6. **PDB 阻止**:PodDisruptionBudget 不允许终止旧 Pod
|
|
1453
|
+
|
|
1454
|
+
**解决方案**
|
|
1455
|
+
|
|
1456
|
+
```bash
|
|
1457
|
+
# 查看并回滚到上一版本
|
|
1458
|
+
kubectl rollout undo deployment/<name> -n <namespace>
|
|
1459
|
+
|
|
1460
|
+
# 回滚到指定版本
|
|
1461
|
+
kubectl rollout history deployment/<name> -n <namespace>
|
|
1462
|
+
kubectl rollout undo deployment/<name> --to-revision=<N> -n <namespace>
|
|
1463
|
+
|
|
1464
|
+
# 暂停滚动更新以排查
|
|
1465
|
+
kubectl rollout pause deployment/<name> -n <namespace>
|
|
1466
|
+
# 排查完成后继续
|
|
1467
|
+
kubectl rollout resume deployment/<name> -n <namespace>
|
|
1468
|
+
```
|
|
1469
|
+
|
|
1470
|
+
```yaml
|
|
1471
|
+
# 合理的滚动更新策略
|
|
1472
|
+
spec:
|
|
1473
|
+
strategy:
|
|
1474
|
+
type: RollingUpdate
|
|
1475
|
+
rollingUpdate:
|
|
1476
|
+
maxSurge: 25% # 允许超出期望数 25%
|
|
1477
|
+
maxUnavailable: 25% # 允许不可用 25%
|
|
1478
|
+
minReadySeconds: 10 # Pod 就绪后等待 10 秒再继续
|
|
1479
|
+
progressDeadlineSeconds: 600 # 10 分钟超时自动标记失败
|
|
1480
|
+
```
|
|
1481
|
+
|
|
1482
|
+
**预防措施**
|
|
1483
|
+
|
|
1484
|
+
- [ ] 配置 progressDeadlineSeconds(默认 600s)
|
|
1485
|
+
- [ ] 新版本在 staging 验证通过后再上线
|
|
1486
|
+
- [ ] 使用 Argo Rollouts 实现金丝雀发布
|
|
1487
|
+
- [ ] PDB 配置确保 minAvailable 小于 replicas
|
|
1488
|
+
|
|
1489
|
+
---
|
|
1490
|
+
|
|
1491
|
+
### 7.2 回滚操作
|
|
1492
|
+
|
|
1493
|
+
**完整回滚流程**
|
|
1494
|
+
|
|
1495
|
+
```bash
|
|
1496
|
+
# Step 1: 确认当前版本和历史
|
|
1497
|
+
kubectl rollout history deployment/<name> -n <namespace>
|
|
1498
|
+
|
|
1499
|
+
# Step 2: 查看特定版本详情
|
|
1500
|
+
kubectl rollout history deployment/<name> --revision=<N> -n <namespace>
|
|
1501
|
+
|
|
1502
|
+
# Step 3: 执行回滚
|
|
1503
|
+
kubectl rollout undo deployment/<name> -n <namespace> # 回滚到上一版本
|
|
1504
|
+
kubectl rollout undo deployment/<name> --to-revision=<N> -n <namespace> # 回滚到指定版本
|
|
1505
|
+
|
|
1506
|
+
# Step 4: 验证回滚状态
|
|
1507
|
+
kubectl rollout status deployment/<name> -n <namespace>
|
|
1508
|
+
kubectl get pods -n <namespace> -l app=<name>
|
|
1509
|
+
|
|
1510
|
+
# Step 5: 验证应用可用性
|
|
1511
|
+
kubectl exec <test-pod> -- curl -s http://<service>:<port>/healthz
|
|
1512
|
+
```
|
|
1513
|
+
|
|
1514
|
+
**StatefulSet 回滚注意事项**
|
|
1515
|
+
|
|
1516
|
+
```bash
|
|
1517
|
+
# StatefulSet 不支持 rollout undo,需要手动修改
|
|
1518
|
+
kubectl get statefulset <name> -o jsonpath='{.spec.updateStrategy}'
|
|
1519
|
+
|
|
1520
|
+
# 分区滚动(partition 控制)
|
|
1521
|
+
kubectl patch statefulset <name> -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":3}}}}'
|
|
1522
|
+
# 从最后一个 Pod 开始更新到 partition 指定的序号
|
|
1523
|
+
|
|
1524
|
+
# 恢复完整滚动
|
|
1525
|
+
kubectl patch statefulset <name> -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
|
|
1526
|
+
```
|
|
1527
|
+
|
|
1528
|
+
**预防措施**
|
|
1529
|
+
|
|
1530
|
+
- [ ] 保留足够的 revisionHistoryLimit(默认 10)
|
|
1531
|
+
- [ ] 每次部署前记录当前版本号
|
|
1532
|
+
- [ ] 回滚后立即验证核心功能
|
|
1533
|
+
- [ ] 事后进行 RCA 并修复根因
|
|
1534
|
+
|
|
1535
|
+
---
|
|
1536
|
+
|
|
1537
|
+
### 7.3 ConfigMap / Secret 更新不生效
|
|
1538
|
+
|
|
1539
|
+
**症状**
|
|
1540
|
+
|
|
1541
|
+
- 修改了 ConfigMap 或 Secret 但 Pod 行为未变
|
|
1542
|
+
- 环境变量仍为旧值
|
|
1543
|
+
- 配置文件内容未更新
|
|
1544
|
+
|
|
1545
|
+
**诊断命令**
|
|
1546
|
+
|
|
1547
|
+
```bash
|
|
1548
|
+
# 1. 确认 ConfigMap 已更新
|
|
1549
|
+
kubectl get cm <name> -n <namespace> -o yaml
|
|
1550
|
+
|
|
1551
|
+
# 2. 查看 Pod 内的实际值
|
|
1552
|
+
kubectl exec <pod> -- env | grep <key> # 环境变量方式
|
|
1553
|
+
kubectl exec <pod> -- cat /config/<file> # 挂载文件方式
|
|
1554
|
+
|
|
1555
|
+
# 3. 查看 Pod 创建时间(环境变量只在创建时注入)
|
|
1556
|
+
kubectl get pod <pod> -o jsonpath='{.metadata.creationTimestamp}'
|
|
1557
|
+
|
|
1558
|
+
# 4. 检查挂载方式
|
|
1559
|
+
kubectl get pod <pod> -o jsonpath='{.spec.volumes}' | jq .
|
|
1560
|
+
```
|
|
1561
|
+
|
|
1562
|
+
**根因分析**
|
|
1563
|
+
|
|
1564
|
+
1. **环境变量不会热更新**:env / envFrom 只在 Pod 创建时注入
|
|
1565
|
+
2. **Volume 挂载有延迟**:kubelet 同步周期默认 60s,可能需 1-2 分钟
|
|
1566
|
+
3. **subPath 挂载不会更新**:这是 Kubernetes 已知限制
|
|
1567
|
+
4. **应用未 watch 文件变化**:文件更新了但应用没有重新加载
|
|
1568
|
+
5. **immutable ConfigMap**:设置了 `immutable: true`
|
|
1569
|
+
|
|
1570
|
+
**解决方案**
|
|
1571
|
+
|
|
1572
|
+
```bash
|
|
1573
|
+
# 方案 1:强制重启 Pod 使环境变量生效
|
|
1574
|
+
kubectl rollout restart deployment/<name> -n <namespace>
|
|
1575
|
+
|
|
1576
|
+
# 方案 2:使用 hash 注解实现自动滚动更新
|
|
1577
|
+
# 在 Deployment template 中添加
|
|
1578
|
+
# metadata:
|
|
1579
|
+
# annotations:
|
|
1580
|
+
# checksum/config: {{ sha256sum of configmap }}
|
|
1581
|
+
# CI/CD 自动计算 hash,ConfigMap 变更触发滚动更新
|
|
1582
|
+
|
|
1583
|
+
# 方案 3:使用 Reloader 自动重启
|
|
1584
|
+
# 安装 stakater/Reloader
|
|
1585
|
+
# 然后在 Deployment 上添加注解:
|
|
1586
|
+
# metadata:
|
|
1587
|
+
# annotations:
|
|
1588
|
+
# reloader.stakater.com/auto: "true"
|
|
1589
|
+
```
|
|
1590
|
+
|
|
1591
|
+
```yaml
|
|
1592
|
+
# 最佳实践:immutable ConfigMap + 版本化名称
|
|
1593
|
+
apiVersion: v1
|
|
1594
|
+
kind: ConfigMap
|
|
1595
|
+
metadata:
|
|
1596
|
+
name: app-config-v2 # 版本化命名
|
|
1597
|
+
immutable: true # 防止意外修改
|
|
1598
|
+
data:
|
|
1599
|
+
app.yaml: |
|
|
1600
|
+
key: new-value
|
|
1601
|
+
|
|
1602
|
+
# Deployment 引用新版本
|
|
1603
|
+
# volumes:
|
|
1604
|
+
# - name: config
|
|
1605
|
+
# configMap:
|
|
1606
|
+
# name: app-config-v2 # 引用新 ConfigMap 触发滚动更新
|
|
1607
|
+
```
|
|
1608
|
+
|
|
1609
|
+
**预防措施**
|
|
1610
|
+
|
|
1611
|
+
- [ ] 环境变量型配置必须配合 rollout restart 使用
|
|
1612
|
+
- [ ] 生产环境使用 immutable ConfigMap + 版本化命名
|
|
1613
|
+
- [ ] 安装 Reloader 或在 CI 中自动注入 checksum 注解
|
|
1614
|
+
- [ ] 避免使用 subPath 挂载需要热更新的配置
|
|
1615
|
+
|
|
1616
|
+
---
|
|
1617
|
+
|
|
1618
|
+
## 八、日志与监控
|
|
1619
|
+
|
|
1620
|
+
### 8.1 kubectl logs 高级用法
|
|
1621
|
+
|
|
1622
|
+
```bash
|
|
1623
|
+
# 基础:查看 Pod 日志
|
|
1624
|
+
kubectl logs <pod> -n <namespace>
|
|
1625
|
+
|
|
1626
|
+
# 查看上一个容器实例的日志(crash 场景必用)
|
|
1627
|
+
kubectl logs <pod> --previous
|
|
1628
|
+
|
|
1629
|
+
# 多容器 Pod 指定容器
|
|
1630
|
+
kubectl logs <pod> -c <container>
|
|
1631
|
+
|
|
1632
|
+
# 实时跟踪
|
|
1633
|
+
kubectl logs <pod> -f --tail=100
|
|
1634
|
+
|
|
1635
|
+
# 按时间过滤
|
|
1636
|
+
kubectl logs <pod> --since=1h
|
|
1637
|
+
kubectl logs <pod> --since-time='2026-03-28T10:00:00Z'
|
|
1638
|
+
|
|
1639
|
+
# 所有副本日志聚合
|
|
1640
|
+
kubectl logs -l app=<name> --all-containers --max-log-requests=10
|
|
1641
|
+
|
|
1642
|
+
# 输出到文件
|
|
1643
|
+
kubectl logs <pod> --all-containers > /tmp/pod-logs.txt
|
|
1644
|
+
```
|
|
1645
|
+
|
|
1646
|
+
### 8.2 kubectl top 资源监控
|
|
1647
|
+
|
|
1648
|
+
```bash
|
|
1649
|
+
# 节点资源使用
|
|
1650
|
+
kubectl top nodes
|
|
1651
|
+
kubectl top nodes --sort-by=cpu
|
|
1652
|
+
kubectl top nodes --sort-by=memory
|
|
1653
|
+
|
|
1654
|
+
# Pod 资源使用
|
|
1655
|
+
kubectl top pods -n <namespace> --sort-by=memory
|
|
1656
|
+
kubectl top pods -n <namespace> --containers # 按容器拆分
|
|
1657
|
+
kubectl top pods -A --sort-by=cpu | head -20 # 全集群 CPU Top 20
|
|
1658
|
+
```
|
|
1659
|
+
|
|
1660
|
+
### 8.3 kubectl describe 关键信息
|
|
1661
|
+
|
|
1662
|
+
```bash
|
|
1663
|
+
# Pod:重点看 Events、State、Conditions
|
|
1664
|
+
kubectl describe pod <pod> -n <namespace>
|
|
1665
|
+
|
|
1666
|
+
# Node:重点看 Conditions、Allocated resources、Events
|
|
1667
|
+
kubectl describe node <node>
|
|
1668
|
+
|
|
1669
|
+
# Service:重点看 Endpoints
|
|
1670
|
+
kubectl describe svc <service> -n <namespace>
|
|
1671
|
+
|
|
1672
|
+
# PVC:重点看 Events(供给状态)
|
|
1673
|
+
kubectl describe pvc <pvc> -n <namespace>
|
|
1674
|
+
```
|
|
1675
|
+
|
|
1676
|
+
### 8.4 kubectl events 事件排查
|
|
1677
|
+
|
|
1678
|
+
```bash
|
|
1679
|
+
# 命名空间内所有事件(按时间排序)
|
|
1680
|
+
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
|
|
1681
|
+
|
|
1682
|
+
# 全集群 Warning 事件
|
|
1683
|
+
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp'
|
|
1684
|
+
|
|
1685
|
+
# 特定资源的事件
|
|
1686
|
+
kubectl get events --field-selector involvedObject.name=<pod-name>
|
|
1687
|
+
|
|
1688
|
+
# 监听实时事件
|
|
1689
|
+
kubectl get events -w -n <namespace>
|
|
1690
|
+
|
|
1691
|
+
# 过滤特定原因
|
|
1692
|
+
kubectl get events --field-selector reason=FailedScheduling -A
|
|
1693
|
+
kubectl get events --field-selector reason=OOMKilling -A
|
|
1694
|
+
kubectl get events --field-selector reason=BackOff -A
|
|
1695
|
+
```
|
|
1696
|
+
|
|
1697
|
+
### 8.5 Prometheus 告警规则参考
|
|
1698
|
+
|
|
1699
|
+
```yaml
|
|
1700
|
+
# 关键告警规则(PrometheusRule CRD)
|
|
1701
|
+
groups:
|
|
1702
|
+
- name: kubernetes-pod-alerts
|
|
1703
|
+
rules:
|
|
1704
|
+
- alert: PodCrashLooping
|
|
1705
|
+
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
|
|
1706
|
+
for: 5m
|
|
1707
|
+
labels:
|
|
1708
|
+
severity: critical
|
|
1709
|
+
annotations:
|
|
1710
|
+
summary: "Pod {{ $labels.pod }} is crash looping"
|
|
1711
|
+
|
|
1712
|
+
- alert: PodOOMKilled
|
|
1713
|
+
expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
|
|
1714
|
+
for: 0m
|
|
1715
|
+
labels:
|
|
1716
|
+
severity: warning
|
|
1717
|
+
annotations:
|
|
1718
|
+
summary: "Pod {{ $labels.pod }} was OOM killed"
|
|
1719
|
+
|
|
1720
|
+
- alert: HighCPUThrottling
|
|
1721
|
+
expr: >
|
|
1722
|
+
rate(container_cpu_cfs_throttled_seconds_total[5m])
|
|
1723
|
+
/ rate(container_cpu_cfs_periods_total[5m]) > 0.25
|
|
1724
|
+
for: 10m
|
|
1725
|
+
labels:
|
|
1726
|
+
severity: warning
|
|
1727
|
+
annotations:
|
|
1728
|
+
summary: "Container {{ $labels.container }} high CPU throttling"
|
|
1729
|
+
|
|
1730
|
+
- alert: PVCAlmostFull
|
|
1731
|
+
expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
|
|
1732
|
+
for: 5m
|
|
1733
|
+
labels:
|
|
1734
|
+
severity: warning
|
|
1735
|
+
annotations:
|
|
1736
|
+
summary: "PVC {{ $labels.persistentvolumeclaim }} is 85% full"
|
|
1737
|
+
|
|
1738
|
+
- alert: NodeMemoryPressure
|
|
1739
|
+
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
|
|
1740
|
+
for: 2m
|
|
1741
|
+
labels:
|
|
1742
|
+
severity: critical
|
|
1743
|
+
annotations:
|
|
1744
|
+
summary: "Node {{ $labels.node }} under memory pressure"
|
|
1745
|
+
|
|
1746
|
+
- alert: NodeDiskPressure
|
|
1747
|
+
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
|
|
1748
|
+
for: 2m
|
|
1749
|
+
labels:
|
|
1750
|
+
severity: critical
|
|
1751
|
+
annotations:
|
|
1752
|
+
summary: "Node {{ $labels.node }} under disk pressure"
|
|
1753
|
+
|
|
1754
|
+
- alert: HPAMaxedOut
|
|
1755
|
+
expr: kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas
|
|
1756
|
+
for: 15m
|
|
1757
|
+
labels:
|
|
1758
|
+
severity: warning
|
|
1759
|
+
annotations:
|
|
1760
|
+
summary: "HPA {{ $labels.horizontalpodautoscaler }} at max replicas"
|
|
1761
|
+
```
|
|
1762
|
+
|
|
1763
|
+
---
|
|
1764
|
+
|
|
1765
|
+
## 九、网络调试工具箱
|
|
1766
|
+
|
|
1767
|
+
### 9.1 kubectl exec 调试
|
|
1768
|
+
|
|
1769
|
+
```bash
|
|
1770
|
+
# 进入 Pod 执行调试命令
|
|
1771
|
+
kubectl exec -it <pod> -- /bin/sh
|
|
1772
|
+
|
|
1773
|
+
# 如果容器没有 shell(distroless 镜像)
|
|
1774
|
+
# 使用 ephemeral container(K8s 1.25+)
|
|
1775
|
+
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container>
|
|
1776
|
+
|
|
1777
|
+
# 在节点上调试(共享节点网络命名空间)
|
|
1778
|
+
kubectl debug node/<node-name> -it --image=nicolaka/netshoot
|
|
1779
|
+
```
|
|
1780
|
+
|
|
1781
|
+
### 9.2 nsenter 节点级网络调试
|
|
1782
|
+
|
|
1783
|
+
```bash
|
|
1784
|
+
# 获取容器的 PID
|
|
1785
|
+
crictl inspect <container-id> | jq '.info.pid'
|
|
1786
|
+
|
|
1787
|
+
# 进入容器的网络命名空间
|
|
1788
|
+
nsenter -t <pid> -n
|
|
1789
|
+
|
|
1790
|
+
# 在容器网络命名空间内执行命令
|
|
1791
|
+
nsenter -t <pid> -n ip addr
|
|
1792
|
+
nsenter -t <pid> -n ss -tlnp
|
|
1793
|
+
nsenter -t <pid> -n iptables -L -n
|
|
1794
|
+
nsenter -t <pid> -n ip route
|
|
1795
|
+
```
|
|
1796
|
+
|
|
1797
|
+
### 9.3 tcpdump 抓包
|
|
1798
|
+
|
|
1799
|
+
```bash
|
|
1800
|
+
# 在 Pod 内抓包(如果有 tcpdump)
|
|
1801
|
+
kubectl exec <pod> -- tcpdump -i eth0 -nn -c 100 port 8080
|
|
1802
|
+
|
|
1803
|
+
# 使用 ephemeral container 抓包
|
|
1804
|
+
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container> -- \
|
|
1805
|
+
tcpdump -i eth0 -nn -w /tmp/capture.pcap -c 1000
|
|
1806
|
+
|
|
1807
|
+
# 抓包并下载分析
|
|
1808
|
+
kubectl cp <pod>:/tmp/capture.pcap ./capture.pcap -c debugger
|
|
1809
|
+
|
|
1810
|
+
# 在节点上抓特定 Pod 的流量
|
|
1811
|
+
# 先获取 Pod IP
|
|
1812
|
+
POD_IP=$(kubectl get pod <pod> -o jsonpath='{.status.podIP}')
|
|
1813
|
+
# 在节点上
|
|
1814
|
+
tcpdump -i any host $POD_IP -nn -c 200
|
|
1815
|
+
```
|
|
1816
|
+
|
|
1817
|
+
### 9.4 curl / wget 连通性测试
|
|
1818
|
+
|
|
1819
|
+
```bash
|
|
1820
|
+
# 使用 debug Pod 测试
|
|
1821
|
+
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash
|
|
1822
|
+
|
|
1823
|
+
# 测试 Service 连通性
|
|
1824
|
+
curl -v http://<service>.<namespace>.svc.cluster.local:<port>/healthz
|
|
1825
|
+
curl -v --connect-timeout 3 --max-time 10 http://<service>:<port>/
|
|
1826
|
+
|
|
1827
|
+
# 测试 HTTPS / TLS
|
|
1828
|
+
curl -vk https://<ingress-host>/
|
|
1829
|
+
openssl s_client -connect <ingress-host>:443 -servername <host>
|
|
1830
|
+
|
|
1831
|
+
# 测试 TCP 端口连通性
|
|
1832
|
+
nc -zv <host> <port>
|
|
1833
|
+
nmap -p <port> <host>
|
|
1834
|
+
|
|
1835
|
+
# 测试 DNS 解析
|
|
1836
|
+
dig <service>.<namespace>.svc.cluster.local @<coredns-ip>
|
|
1837
|
+
dig +short +trace <external-domain>
|
|
1838
|
+
|
|
1839
|
+
# 路由追踪
|
|
1840
|
+
traceroute -n <target-ip>
|
|
1841
|
+
mtr -n <target-ip>
|
|
1842
|
+
```
|
|
1843
|
+
|
|
1844
|
+
### 9.5 常用调试镜像
|
|
1845
|
+
|
|
1846
|
+
| 镜像 | 用途 | 工具集 |
|
|
1847
|
+
|------|------|--------|
|
|
1848
|
+
| `nicolaka/netshoot` | 网络调试瑞士军刀 | tcpdump, dig, curl, iperf, nmap, strace |
|
|
1849
|
+
| `busybox:1.36` | 轻量级调试 | sh, nc, wget, ping, nslookup |
|
|
1850
|
+
| `curlimages/curl` | HTTP 测试 | curl |
|
|
1851
|
+
| `alpine:3.19` | 通用调试 | sh, apk 可装任何工具 |
|
|
1852
|
+
| `ubuntu:22.04` | 完整调试环境 | apt 可装任何工具 |
|
|
1853
|
+
|
|
1854
|
+
---
|
|
1855
|
+
|
|
1856
|
+
## 十、故障排查决策树
|
|
1857
|
+
|
|
1858
|
+
```
|
|
1859
|
+
Pod 异常
|
|
1860
|
+
├── Pending
|
|
1861
|
+
│ ├── Events: FailedScheduling → 检查资源/亲和性/taint
|
|
1862
|
+
│ ├── Events: no PV found → 检查 PVC/StorageClass
|
|
1863
|
+
│ └── 无 Events → 检查 API Server / Scheduler 日志
|
|
1864
|
+
├── CrashLoopBackOff
|
|
1865
|
+
│ ├── Exit 137 → OOMKilled → 调大 limits / 查内存泄漏
|
|
1866
|
+
│ ├── Exit 1 → 应用错误 → 看 --previous 日志
|
|
1867
|
+
│ ├── Exit 127 → 命令不存在 → 检查 entrypoint/CMD
|
|
1868
|
+
│ └── 无日志 → 检查 image / securityContext
|
|
1869
|
+
├── ImagePullBackOff
|
|
1870
|
+
│ ├── unauthorized → 检查 imagePullSecret
|
|
1871
|
+
│ ├── not found → 检查镜像 tag
|
|
1872
|
+
│ └── timeout → 检查网络 / 镜像仓库
|
|
1873
|
+
├── ContainerCreating(超时)
|
|
1874
|
+
│ ├── FailedMount → 检查 PV/VolumeAttachment
|
|
1875
|
+
│ └── NetworkNotReady → 检查 CNI 插件
|
|
1876
|
+
├── Running 但不工作
|
|
1877
|
+
│ ├── readinessProbe 失败 → 检查探针配置
|
|
1878
|
+
│ ├── CPU Throttling → 调整 limits
|
|
1879
|
+
│ └── 网络不通 → 走网络排查流程
|
|
1880
|
+
└── Evicted
|
|
1881
|
+
├── ephemeral-storage → 清理磁盘 / 设 limits
|
|
1882
|
+
└── memory → 检查节点内存压力
|
|
1883
|
+
```
|
|
1884
|
+
|
|
1885
|
+
---
|
|
1886
|
+
|
|
1887
|
+
## Agent Checklist
|
|
1888
|
+
|
|
1889
|
+
以下清单供 AI Agent 在 Kubernetes 故障排查场景中使用:
|
|
1890
|
+
|
|
1891
|
+
### 信息收集阶段
|
|
1892
|
+
|
|
1893
|
+
- [ ] 执行 `kubectl get pods -o wide` 确认 Pod 状态和所在节点
|
|
1894
|
+
- [ ] 执行 `kubectl get events --sort-by='.lastTimestamp'` 获取最新事件
|
|
1895
|
+
- [ ] 执行 `kubectl describe pod/node/svc` 获取详细状态
|
|
1896
|
+
- [ ] 确认问题影响范围:单 Pod / 整个 Deployment / 全节点 / 全集群
|
|
1897
|
+
|
|
1898
|
+
### Pod 故障
|
|
1899
|
+
|
|
1900
|
+
- [ ] CrashLoopBackOff:查看 `--previous` 日志,确认退出码
|
|
1901
|
+
- [ ] ImagePullBackOff:验证镜像地址、imagePullSecret、网络连通性
|
|
1902
|
+
- [ ] OOMKilled:对比 `kubectl top` 用量与 limits 设置
|
|
1903
|
+
- [ ] Pending:检查调度事件、节点资源、亲和性规则、PVC 状态
|
|
1904
|
+
- [ ] Evicted:检查节点 Conditions,确认是磁盘还是内存压力
|
|
1905
|
+
|
|
1906
|
+
### 网络故障
|
|
1907
|
+
|
|
1908
|
+
- [ ] 用 `kubectl get endpoints` 验证 Service 后端不为空
|
|
1909
|
+
- [ ] 用 debug Pod 测试 DNS 解析和 TCP 连通性
|
|
1910
|
+
- [ ] 检查 NetworkPolicy 是否阻断流量
|
|
1911
|
+
- [ ] Ingress 502/504:检查 Ingress Controller 日志和后端健康状态
|
|
1912
|
+
|
|
1913
|
+
### 存储故障
|
|
1914
|
+
|
|
1915
|
+
- [ ] PVC Pending:检查 StorageClass、CSI 驱动、容量配额
|
|
1916
|
+
- [ ] 挂载失败:检查 VolumeAttachment 和 accessMode
|
|
1917
|
+
- [ ] 权限问题:对比容器 UID 与文件 ownership,检查 fsGroup
|
|
1918
|
+
|
|
1919
|
+
### 节点故障
|
|
1920
|
+
|
|
1921
|
+
- [ ] NotReady:检查 kubelet 和 containerd 状态
|
|
1922
|
+
- [ ] DiskPressure:检查磁盘使用,清理镜像和日志
|
|
1923
|
+
- [ ] MemoryPressure:检查内存使用,确认 system-reserved 配置
|
|
1924
|
+
- [ ] PIDPressure:检查 PID 使用量和 podPidsLimit
|
|
1925
|
+
|
|
1926
|
+
### 资源问题
|
|
1927
|
+
|
|
1928
|
+
- [ ] CPU Throttling:查看 cfs_throttled 指标,评估 limits 合理性
|
|
1929
|
+
- [ ] 内存泄漏:观察内存增长趋势,进行 profiling
|
|
1930
|
+
- [ ] HPA 不生效:验证 metrics-server、Pod requests、HPA 配置
|
|
1931
|
+
|
|
1932
|
+
### 部署故障
|
|
1933
|
+
|
|
1934
|
+
- [ ] 滚动更新卡住:检查新 Pod 状态,必要时 rollout undo
|
|
1935
|
+
- [ ] ConfigMap/Secret 更新不生效:确认挂载方式,必要时 rollout restart
|
|
1936
|
+
|
|
1937
|
+
### 修复验证
|
|
1938
|
+
|
|
1939
|
+
- [ ] 修复后确认 Pod 状态恢复 Running/Ready
|
|
1940
|
+
- [ ] 验证业务功能正常(健康检查通过、端到端请求成功)
|
|
1941
|
+
- [ ] 检查是否有次生问题(如回滚后配置不一致)
|
|
1942
|
+
- [ ] 记录根因和修复方案,更新运维文档
|