blockmine 1.24.0 → 1.25.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +32 -0
- package/README.en.md +427 -0
- package/README.md +40 -0
- package/backend/cli.js +1 -1
- package/backend/src/ai/plugin-assistant-system-prompt.md +664 -5
- package/backend/src/api/routes/bots.js +13 -0
- package/backend/src/api/routes/servers.js +14 -2
- package/backend/src/core/BotProcess.js +98 -2
- package/backend/src/core/PluginLoader.js +83 -3
- package/backend/src/core/PluginManager.js +75 -5
- package/backend/src/core/services/BotLifecycleService.js +186 -2
- package/backend/src/server.js +11 -1
- package/frontend/dist/assets/browser-ponyfill-DN7pwmHT.js +2 -0
- package/frontend/dist/assets/index-LSy71uwm.js +11261 -0
- package/frontend/dist/assets/index-SfhKxI4-.css +32 -0
- package/frontend/dist/flags/en.svg +32 -0
- package/frontend/dist/flags/ru.svg +5 -0
- package/frontend/dist/index.html +2 -2
- package/frontend/dist/locales/en/admin.json +100 -0
- package/frontend/dist/locales/en/api-keys.json +58 -0
- package/frontend/dist/locales/en/bots.json +110 -0
- package/frontend/dist/locales/en/common.json +47 -0
- package/frontend/dist/locales/en/configuration.json +22 -0
- package/frontend/dist/locales/en/console.json +10 -0
- package/frontend/dist/locales/en/dashboard.json +85 -0
- package/frontend/dist/locales/en/dialogs.json +70 -0
- package/frontend/dist/locales/en/event-graphs.json +50 -0
- package/frontend/dist/locales/en/graph-store.json +70 -0
- package/frontend/dist/locales/en/login.json +34 -0
- package/frontend/dist/locales/en/management.json +114 -0
- package/frontend/dist/locales/en/minecraft-viewer.json +27 -0
- package/frontend/dist/locales/en/nodes.json +1077 -0
- package/frontend/dist/locales/en/permissions.json +50 -0
- package/frontend/dist/locales/en/plugin-detail.json +49 -0
- package/frontend/dist/locales/en/plugins.json +110 -0
- package/frontend/dist/locales/en/proxies.json +81 -0
- package/frontend/dist/locales/en/servers.json +39 -0
- package/frontend/dist/locales/en/setup.json +17 -0
- package/frontend/dist/locales/en/sidebar.json +27 -0
- package/frontend/dist/locales/en/tasks.json +62 -0
- package/frontend/dist/locales/en/visual-editor.json +219 -0
- package/frontend/dist/locales/en/websocket.json +86 -0
- package/frontend/dist/locales/ru/admin.json +100 -0
- package/frontend/dist/locales/ru/api-keys.json +58 -0
- package/frontend/dist/locales/ru/bots.json +110 -0
- package/frontend/dist/locales/ru/common.json +49 -0
- package/frontend/dist/locales/ru/configuration.json +22 -0
- package/frontend/dist/locales/ru/console.json +10 -0
- package/frontend/dist/locales/ru/dashboard.json +85 -0
- package/frontend/dist/locales/ru/dialogs.json +70 -0
- package/frontend/dist/locales/ru/event-graphs.json +50 -0
- package/frontend/dist/locales/ru/graph-store.json +70 -0
- package/frontend/dist/locales/ru/login.json +34 -0
- package/frontend/dist/locales/ru/management.json +114 -0
- package/frontend/dist/locales/ru/minecraft-viewer.json +27 -0
- package/frontend/dist/locales/ru/nodes.json +1077 -0
- package/frontend/dist/locales/ru/permissions.json +50 -0
- package/frontend/dist/locales/ru/plugin-detail.json +49 -0
- package/frontend/dist/locales/ru/plugins.json +110 -0
- package/frontend/dist/locales/ru/proxies.json +81 -0
- package/frontend/dist/locales/ru/servers.json +39 -0
- package/frontend/dist/locales/ru/setup.json +17 -0
- package/frontend/dist/locales/ru/sidebar.json +27 -0
- package/frontend/dist/locales/ru/tasks.json +62 -0
- package/frontend/dist/locales/ru/visual-editor.json +221 -0
- package/frontend/dist/locales/ru/websocket.json +86 -0
- package/frontend/dist/monacoeditorwork/css.worker.bundle.js +7 -7
- package/frontend/dist/monacoeditorwork/html.worker.bundle.js +7 -7
- package/frontend/dist/monacoeditorwork/json.worker.bundle.js +7 -7
- package/frontend/dist/monacoeditorwork/ts.worker.bundle.js +3 -3
- package/frontend/package.json +4 -0
- package/package.json +1 -1
- package/screen/3dviewer.png +0 -0
- package/screen/console.png +0 -0
- package/screen/dashboard.png +0 -0
- package/screen/graph_collabe.png +0 -0
- package/screen/graph_live_debug.png +0 -0
- package/screen/language_selector.png +0 -0
- package/screen/management_command.png +0 -0
- package/screen/node_debug_trace.png +0 -0
- package/screen/plugin_/320/276/320/261/320/267/320/276/321/200.png +0 -0
- package/screen/websocket.png +0 -0
- package/screen//320/275/320/260/321/201/321/202/321/200/320/276/320/271/320/272/320/270_/320/276/321/202/320/264/320/265/320/273/321/214/320/275/321/213/321/205_/320/272/320/276/320/274/320/260/320/275/320/264_/320/272/320/260/320/266/320/264/321/203_/320/272/320/276/320/274/320/260/320/275/320/273/320/264/321/203_/320/274/320/276/320/266/320/275/320/276_/320/275/320/260/321/201/321/202/321/200/320/260/320/270/320/262/320/260/321/202/321/214.png +0 -0
- package/screen//320/277/320/273/320/260/320/275/320/270/321/200/320/276/320/262/321/211/320/270/320/272_/320/274/320/276/320/266/320/275/320/276_/320/267/320/260/320/264/320/260/320/262/320/260/321/202/321/214_/320/264/320/265/320/271/321/201/321/202/320/262/320/270/321/217_/320/277/320/276_/320/262/321/200/320/265/320/274/320/265/320/275/320/270.png +0 -0
- package/.claude/agents/README.md +0 -469
- package/.claude/agents/auth-route-debugger.md +0 -118
- package/.claude/agents/auth-route-tester.md +0 -93
- package/.claude/agents/auto-error-resolver.md +0 -97
- package/.claude/agents/build-optimizer.md +0 -236
- package/.claude/agents/code-architect.md +0 -34
- package/.claude/agents/code-architecture-reviewer.md +0 -83
- package/.claude/agents/code-explorer.md +0 -51
- package/.claude/agents/code-refactor-master.md +0 -94
- package/.claude/agents/code-reviewer.md +0 -46
- package/.claude/agents/cost-optimizer.md +0 -134
- package/.claude/agents/deployment-orchestrator.md +0 -113
- package/.claude/agents/documentation-architect.md +0 -82
- package/.claude/agents/frontend-error-fixer.md +0 -77
- package/.claude/agents/iac-code-generator.md +0 -71
- package/.claude/agents/incident-responder.md +0 -346
- package/.claude/agents/infrastructure-architect.md +0 -31
- package/.claude/agents/kubernetes-specialist.md +0 -56
- package/.claude/agents/migration-planner.md +0 -181
- package/.claude/agents/network-architect.md +0 -196
- package/.claude/agents/plan-reviewer.md +0 -52
- package/.claude/agents/refactor-planner.md +0 -63
- package/.claude/agents/security-scanner.md +0 -102
- package/.claude/agents/web-research-specialist.md +0 -78
- package/.claude/commands/cost-analysis.md +0 -315
- package/.claude/commands/dev-docs-update.md +0 -55
- package/.claude/commands/dev-docs.md +0 -51
- package/.claude/commands/feature-dev.md +0 -125
- package/.claude/commands/incident-debug.md +0 -247
- package/.claude/commands/infra-plan.md +0 -81
- package/.claude/commands/migration-plan.md +0 -478
- package/.claude/commands/route-research-for-testing.md +0 -37
- package/.claude/commands/security-review.md +0 -66
- package/.claude/hooks/CONFIG.md +0 -448
- package/.claude/hooks/README.md +0 -163
- package/.claude/hooks/SKILL_ACTIVATION_COMPLETE.md +0 -226
- package/.claude/hooks/WINDOWS_HOOKS_README.md +0 -151
- package/.claude/hooks/add-skill-activation-banners.ts +0 -132
- package/.claude/hooks/comprehensive-skill-test.ts +0 -1315
- package/.claude/hooks/error-handling-reminder.sh +0 -12
- package/.claude/hooks/error-handling-reminder.ts +0 -222
- package/.claude/hooks/k8s-manifest-validator.sh +0 -56
- package/.claude/hooks/package-lock.json +0 -556
- package/.claude/hooks/package.json +0 -16
- package/.claude/hooks/post-tool-use-tracker.ps1 +0 -174
- package/.claude/hooks/post-tool-use-tracker.sh +0 -183
- package/.claude/hooks/security-policy-check.sh +0 -247
- package/.claude/hooks/skill-activation-prompt.ps1 +0 -10
- package/.claude/hooks/skill-activation-prompt.sh +0 -10
- package/.claude/hooks/skill-activation-prompt.ts +0 -141
- package/.claude/hooks/stop-build-check-enhanced.sh +0 -130
- package/.claude/hooks/terraform-validator.sh +0 -53
- package/.claude/hooks/test-input.json +0 -7
- package/.claude/hooks/test-skill-activation.ts +0 -427
- package/.claude/hooks/trigger-build-resolver.sh +0 -79
- package/.claude/hooks/tsc-check.sh +0 -173
- package/.claude/hooks/tsconfig.json +0 -19
- package/.claude/settings.json +0 -59
- package/.claude/settings.local.json +0 -67
- package/.claude/skills/README.md +0 -507
- package/.claude/skills/api-engineering/SKILL.md +0 -63
- package/.claude/skills/api-engineering/resources/api-versioning.md +0 -88
- package/.claude/skills/api-engineering/resources/graphql-patterns.md +0 -106
- package/.claude/skills/api-engineering/resources/rate-limiting.md +0 -118
- package/.claude/skills/api-engineering/resources/rest-api-design.md +0 -105
- package/.claude/skills/backend-dev-guidelines/SKILL.md +0 -306
- package/.claude/skills/backend-dev-guidelines/resources/architecture-overview.md +0 -451
- package/.claude/skills/backend-dev-guidelines/resources/async-and-errors.md +0 -307
- package/.claude/skills/backend-dev-guidelines/resources/complete-examples.md +0 -638
- package/.claude/skills/backend-dev-guidelines/resources/configuration.md +0 -275
- package/.claude/skills/backend-dev-guidelines/resources/database-patterns.md +0 -224
- package/.claude/skills/backend-dev-guidelines/resources/middleware-guide.md +0 -213
- package/.claude/skills/backend-dev-guidelines/resources/routing-and-controllers.md +0 -756
- package/.claude/skills/backend-dev-guidelines/resources/sentry-and-monitoring.md +0 -336
- package/.claude/skills/backend-dev-guidelines/resources/services-and-repositories.md +0 -789
- package/.claude/skills/backend-dev-guidelines/resources/testing-guide.md +0 -235
- package/.claude/skills/backend-dev-guidelines/resources/validation-patterns.md +0 -754
- package/.claude/skills/budget-and-cost-management/SKILL.md +0 -850
- package/.claude/skills/build-engineering/SKILL.md +0 -431
- package/.claude/skills/build-engineering/resources/artifact-repositories.md +0 -72
- package/.claude/skills/build-engineering/resources/build-caching.md +0 -96
- package/.claude/skills/build-engineering/resources/build-pipelines.md +0 -105
- package/.claude/skills/build-engineering/resources/build-security.md +0 -95
- package/.claude/skills/build-engineering/resources/build-systems.md +0 -389
- package/.claude/skills/build-engineering/resources/compilation-optimization.md +0 -201
- package/.claude/skills/build-engineering/resources/dependency-management.md +0 -73
- package/.claude/skills/build-engineering/resources/monorepo-builds.md +0 -110
- package/.claude/skills/build-engineering/resources/performance-optimization.md +0 -113
- package/.claude/skills/build-engineering/resources/reproducible-builds.md +0 -82
- package/.claude/skills/cloud-engineering/SKILL.md +0 -675
- package/.claude/skills/cloud-engineering/resources/aws-patterns.md +0 -742
- package/.claude/skills/cloud-engineering/resources/azure-patterns.md +0 -714
- package/.claude/skills/cloud-engineering/resources/cleared-cloud-environments.md +0 -987
- package/.claude/skills/cloud-engineering/resources/cloud-cost-optimization.md +0 -757
- package/.claude/skills/cloud-engineering/resources/cloud-networking.md +0 -1058
- package/.claude/skills/cloud-engineering/resources/cloud-security-tools.md +0 -1530
- package/.claude/skills/cloud-engineering/resources/cloud-security.md +0 -990
- package/.claude/skills/cloud-engineering/resources/gcp-patterns.md +0 -758
- package/.claude/skills/cloud-engineering/resources/migration-strategies.md +0 -820
- package/.claude/skills/cloud-engineering/resources/multi-cloud-strategies.md +0 -670
- package/.claude/skills/cloud-engineering/resources/oci-patterns.md +0 -1198
- package/.claude/skills/cloud-engineering/resources/serverless-patterns.md +0 -795
- package/.claude/skills/cloud-engineering/resources/well-architected-frameworks.md +0 -966
- package/.claude/skills/cybersecurity/SKILL.md +0 -409
- package/.claude/skills/cybersecurity/resources/security-architecture.md +0 -266
- package/.claude/skills/database-engineering/SKILL.md +0 -61
- package/.claude/skills/database-engineering/resources/backup-and-recovery.md +0 -72
- package/.claude/skills/database-engineering/resources/database-replication.md +0 -63
- package/.claude/skills/database-engineering/resources/postgresql-fundamentals.md +0 -70
- package/.claude/skills/database-engineering/resources/query-optimization.md +0 -68
- package/.claude/skills/devsecops/SKILL.md +0 -374
- package/.claude/skills/devsecops/resources/ci-cd-security.md +0 -204
- package/.claude/skills/devsecops/resources/compliance-automation.md +0 -530
- package/.claude/skills/devsecops/resources/compliance-frameworks.md +0 -2322
- package/.claude/skills/devsecops/resources/container-security.md +0 -915
- package/.claude/skills/devsecops/resources/cspm-integration.md +0 -1440
- package/.claude/skills/devsecops/resources/policy-enforcement.md +0 -619
- package/.claude/skills/devsecops/resources/secrets-management.md +0 -755
- package/.claude/skills/devsecops/resources/security-monitoring.md +0 -146
- package/.claude/skills/devsecops/resources/security-scanning.md +0 -887
- package/.claude/skills/devsecops/resources/security-testing.md +0 -203
- package/.claude/skills/devsecops/resources/supply-chain-security.md +0 -518
- package/.claude/skills/devsecops/resources/vulnerability-management.md +0 -481
- package/.claude/skills/devsecops/resources/zero-trust-architecture.md +0 -177
- package/.claude/skills/documentation-as-code/SKILL.md +0 -323
- package/.claude/skills/documentation-as-code/resources/api-documentation.md +0 -90
- package/.claude/skills/documentation-as-code/resources/changelog-management.md +0 -79
- package/.claude/skills/documentation-as-code/resources/diagram-generation.md +0 -44
- package/.claude/skills/documentation-as-code/resources/docs-as-code-workflow.md +0 -99
- package/.claude/skills/documentation-as-code/resources/documentation-automation.md +0 -68
- package/.claude/skills/documentation-as-code/resources/documentation-sites.md +0 -79
- package/.claude/skills/documentation-as-code/resources/markdown-best-practices.md +0 -162
- package/.claude/skills/documentation-as-code/resources/openapi-specification.md +0 -77
- package/.claude/skills/documentation-as-code/resources/readme-engineering.md +0 -60
- package/.claude/skills/documentation-as-code/resources/technical-writing-guide.md +0 -202
- package/.claude/skills/engineering-management/SKILL.md +0 -356
- package/.claude/skills/engineering-management/resources/career-ladders.md +0 -609
- package/.claude/skills/engineering-management/resources/hiring-and-assessment.md +0 -555
- package/.claude/skills/engineering-management/resources/one-on-one-guides.md +0 -609
- package/.claude/skills/engineering-management/resources/resource-planning.md +0 -557
- package/.claude/skills/engineering-management/resources/team-organization-patterns.md +0 -491
- package/.claude/skills/engineering-management/resources/technical-interviews.md +0 -474
- package/.claude/skills/engineering-operations-management/SKILL.md +0 -817
- package/.claude/skills/error-tracking/SKILL.md +0 -379
- package/.claude/skills/frontend-design/SKILL.md +0 -42
- package/.claude/skills/frontend-dev-guidelines/SKILL.md +0 -403
- package/.claude/skills/frontend-dev-guidelines/resources/common-patterns.md +0 -331
- package/.claude/skills/frontend-dev-guidelines/resources/complete-examples.md +0 -872
- package/.claude/skills/frontend-dev-guidelines/resources/component-patterns.md +0 -502
- package/.claude/skills/frontend-dev-guidelines/resources/data-fetching.md +0 -767
- package/.claude/skills/frontend-dev-guidelines/resources/file-organization.md +0 -502
- package/.claude/skills/frontend-dev-guidelines/resources/loading-and-error-states.md +0 -501
- package/.claude/skills/frontend-dev-guidelines/resources/performance.md +0 -406
- package/.claude/skills/frontend-dev-guidelines/resources/routing-guide.md +0 -364
- package/.claude/skills/frontend-dev-guidelines/resources/styling-guide.md +0 -428
- package/.claude/skills/frontend-dev-guidelines/resources/typescript-standards.md +0 -418
- package/.claude/skills/general-it-engineering/SKILL.md +0 -393
- package/.claude/skills/general-it-engineering/resources/asset-management.md +0 -712
- package/.claude/skills/general-it-engineering/resources/automation-orchestration.md +0 -817
- package/.claude/skills/general-it-engineering/resources/business-continuity.md +0 -786
- package/.claude/skills/general-it-engineering/resources/change-management.md +0 -715
- package/.claude/skills/general-it-engineering/resources/enterprise-monitoring.md +0 -729
- package/.claude/skills/general-it-engineering/resources/help-desk-operations.md +0 -738
- package/.claude/skills/general-it-engineering/resources/incident-service-management.md +0 -834
- package/.claude/skills/general-it-engineering/resources/it-governance.md +0 -753
- package/.claude/skills/general-it-engineering/resources/itil-framework.md +0 -503
- package/.claude/skills/general-it-engineering/resources/service-management.md +0 -669
- package/.claude/skills/infrastructure-architecture/SKILL.md +0 -328
- package/.claude/skills/infrastructure-architecture/resources/architecture-decision-records.md +0 -505
- package/.claude/skills/infrastructure-architecture/resources/architecture-patterns.md +0 -528
- package/.claude/skills/infrastructure-architecture/resources/capacity-planning.md +0 -453
- package/.claude/skills/infrastructure-architecture/resources/cleared-environment-architecture.md +0 -773
- package/.claude/skills/infrastructure-architecture/resources/cost-architecture.md +0 -499
- package/.claude/skills/infrastructure-architecture/resources/data-architecture.md +0 -501
- package/.claude/skills/infrastructure-architecture/resources/disaster-recovery.md +0 -535
- package/.claude/skills/infrastructure-architecture/resources/migration-architecture.md +0 -512
- package/.claude/skills/infrastructure-architecture/resources/multi-region-design.md +0 -608
- package/.claude/skills/infrastructure-architecture/resources/reference-architectures.md +0 -562
- package/.claude/skills/infrastructure-architecture/resources/security-architecture.md +0 -538
- package/.claude/skills/infrastructure-architecture/resources/system-design-principles.md +0 -489
- package/.claude/skills/infrastructure-architecture/resources/workload-classification.md +0 -1000
- package/.claude/skills/infrastructure-strategy/SKILL.md +0 -924
- package/.claude/skills/network-engineering/SKILL.md +0 -385
- package/.claude/skills/network-engineering/resources/dns-management.md +0 -738
- package/.claude/skills/network-engineering/resources/load-balancing.md +0 -820
- package/.claude/skills/network-engineering/resources/network-architecture.md +0 -546
- package/.claude/skills/network-engineering/resources/network-security.md +0 -921
- package/.claude/skills/network-engineering/resources/network-troubleshooting.md +0 -749
- package/.claude/skills/network-engineering/resources/routing-switching.md +0 -373
- package/.claude/skills/network-engineering/resources/sdn-networking.md +0 -695
- package/.claude/skills/network-engineering/resources/service-mesh-networking.md +0 -777
- package/.claude/skills/network-engineering/resources/tcp-ip-protocols.md +0 -444
- package/.claude/skills/network-engineering/resources/vpn-connectivity.md +0 -672
- package/.claude/skills/node-development/SKILL.md +0 -317
- package/.claude/skills/observability-engineering/SKILL.md +0 -101
- package/.claude/skills/observability-engineering/resources/apm-tools.md +0 -97
- package/.claude/skills/observability-engineering/resources/correlation-strategies.md +0 -87
- package/.claude/skills/observability-engineering/resources/distributed-tracing.md +0 -98
- package/.claude/skills/observability-engineering/resources/logs-aggregation.md +0 -118
- package/.claude/skills/observability-engineering/resources/observability-cost-optimization.md +0 -141
- package/.claude/skills/observability-engineering/resources/opentelemetry.md +0 -110
- package/.claude/skills/platform-engineering/SKILL.md +0 -555
- package/.claude/skills/platform-engineering/resources/architecture-overview.md +0 -600
- package/.claude/skills/platform-engineering/resources/container-orchestration.md +0 -916
- package/.claude/skills/platform-engineering/resources/cost-optimization.md +0 -634
- package/.claude/skills/platform-engineering/resources/developer-platforms.md +0 -670
- package/.claude/skills/platform-engineering/resources/gitops-automation.md +0 -650
- package/.claude/skills/platform-engineering/resources/infrastructure-as-code.md +0 -778
- package/.claude/skills/platform-engineering/resources/infrastructure-standards.md +0 -708
- package/.claude/skills/platform-engineering/resources/multi-tenancy.md +0 -602
- package/.claude/skills/platform-engineering/resources/platform-security.md +0 -711
- package/.claude/skills/platform-engineering/resources/resource-management.md +0 -592
- package/.claude/skills/platform-engineering/resources/service-mesh.md +0 -628
- package/.claude/skills/release-engineering/SKILL.md +0 -393
- package/.claude/skills/release-engineering/resources/artifact-management.md +0 -108
- package/.claude/skills/release-engineering/resources/build-optimization.md +0 -84
- package/.claude/skills/release-engineering/resources/ci-cd-pipelines.md +0 -411
- package/.claude/skills/release-engineering/resources/deployment-strategies.md +0 -197
- package/.claude/skills/release-engineering/resources/pipeline-security.md +0 -62
- package/.claude/skills/release-engineering/resources/progressive-delivery.md +0 -83
- package/.claude/skills/release-engineering/resources/release-automation.md +0 -68
- package/.claude/skills/release-engineering/resources/release-orchestration.md +0 -77
- package/.claude/skills/release-engineering/resources/rollback-strategies.md +0 -66
- package/.claude/skills/release-engineering/resources/versioning-strategies.md +0 -59
- package/.claude/skills/route-tester/SKILL.md +0 -392
- package/.claude/skills/skill-developer/ADVANCED.md +0 -197
- package/.claude/skills/skill-developer/HOOK_MECHANISMS.md +0 -306
- package/.claude/skills/skill-developer/PATTERNS_LIBRARY.md +0 -152
- package/.claude/skills/skill-developer/SKILL.md +0 -430
- package/.claude/skills/skill-developer/SKILL_RULES_REFERENCE.md +0 -315
- package/.claude/skills/skill-developer/TRIGGER_TYPES.md +0 -305
- package/.claude/skills/skill-developer/TROUBLESHOOTING.md +0 -514
- package/.claude/skills/skill-rules.json +0 -2989
- package/.claude/skills/sre/SKILL.md +0 -464
- package/.claude/skills/sre/resources/alerting-best-practices.md +0 -282
- package/.claude/skills/sre/resources/capacity-planning.md +0 -226
- package/.claude/skills/sre/resources/chaos-engineering.md +0 -193
- package/.claude/skills/sre/resources/disaster-recovery.md +0 -232
- package/.claude/skills/sre/resources/incident-management.md +0 -436
- package/.claude/skills/sre/resources/observability-stack.md +0 -240
- package/.claude/skills/sre/resources/on-call-runbooks.md +0 -167
- package/.claude/skills/sre/resources/performance-optimization.md +0 -108
- package/.claude/skills/sre/resources/reliability-patterns.md +0 -183
- package/.claude/skills/sre/resources/slo-sli-sla.md +0 -464
- package/.claude/skills/sre/resources/toil-reduction.md +0 -145
- package/.claude/skills/systems-engineering/SKILL.md +0 -648
- package/.claude/skills/systems-engineering/resources/automation-patterns.md +0 -771
- package/.claude/skills/systems-engineering/resources/configuration-management.md +0 -998
- package/.claude/skills/systems-engineering/resources/linux-administration.md +0 -672
- package/.claude/skills/systems-engineering/resources/networking-fundamentals.md +0 -982
- package/.claude/skills/systems-engineering/resources/performance-tuning.md +0 -871
- package/.claude/skills/systems-engineering/resources/powershell-scripting.md +0 -482
- package/.claude/skills/systems-engineering/resources/security-hardening.md +0 -739
- package/.claude/skills/systems-engineering/resources/shell-scripting.md +0 -915
- package/.claude/skills/systems-engineering/resources/storage-management.md +0 -628
- package/.claude/skills/systems-engineering/resources/system-monitoring.md +0 -787
- package/.claude/skills/systems-engineering/resources/troubleshooting-guide.md +0 -753
- package/.claude/skills/systems-engineering/resources/windows-administration.md +0 -738
- package/.claude/skills/technical-leadership/SKILL.md +0 -728
- package/backend/docs/SECRETS_DOCUMENTATION.md +0 -327
- package/frontend/dist/assets/index-BC-NbKXi.css +0 -32
- package/frontend/dist/assets/index-DqJXZMHY.js +0 -11266
|
@@ -1,232 +0,0 @@
|
|
|
1
|
-
# Disaster Recovery
|
|
2
|
-
|
|
3
|
-
Backup strategies, RTO/RPO definitions, failover procedures, disaster recovery testing, and multi-region architectures.
|
|
4
|
-
|
|
5
|
-
## Table of Contents
|
|
6
|
-
|
|
7
|
-
- [RTO and RPO](#rto-and-rpo)
|
|
8
|
-
- [Backup Strategies](#backup-strategies)
|
|
9
|
-
- [Failover Procedures](#failover-procedures)
|
|
10
|
-
- [DR Testing](#dr-testing)
|
|
11
|
-
- [Multi-Region Architecture](#multi-region-architecture)
|
|
12
|
-
|
|
13
|
-
## RTO and RPO
|
|
14
|
-
|
|
15
|
-
**Definitions:**
|
|
16
|
-
```
|
|
17
|
-
RTO (Recovery Time Objective):
|
|
18
|
-
Maximum acceptable downtime
|
|
19
|
-
Example: 4 hours
|
|
20
|
-
|
|
21
|
-
RPO (Recovery Point Objective):
|
|
22
|
-
Maximum acceptable data loss
|
|
23
|
-
Example: 1 hour (last backup)
|
|
24
|
-
```
|
|
25
|
-
|
|
26
|
-
**RTO/RPO Tiers:**
|
|
27
|
-
```yaml
|
|
28
|
-
tier_1_critical:
|
|
29
|
-
rto: 1 hour
|
|
30
|
-
rpo: 15 minutes
|
|
31
|
-
cost: High
|
|
32
|
-
examples: [payment processing, critical APIs]
|
|
33
|
-
|
|
34
|
-
tier_2_important:
|
|
35
|
-
rto: 4 hours
|
|
36
|
-
rpo: 1 hour
|
|
37
|
-
cost: Medium
|
|
38
|
-
examples: [main application, databases]
|
|
39
|
-
|
|
40
|
-
tier_3_standard:
|
|
41
|
-
rto: 24 hours
|
|
42
|
-
rpo: 24 hours
|
|
43
|
-
cost: Low
|
|
44
|
-
examples: [internal tools, analytics]
|
|
45
|
-
```
|
|
46
|
-
|
|
47
|
-
## Backup Strategies
|
|
48
|
-
|
|
49
|
-
**3-2-1 Rule:**
|
|
50
|
-
```
|
|
51
|
-
3 copies of data
|
|
52
|
-
2 different media types
|
|
53
|
-
1 offsite backup
|
|
54
|
-
```
|
|
55
|
-
|
|
56
|
-
**Database Backups:**
|
|
57
|
-
```yaml
|
|
58
|
-
# PostgreSQL backup with WAL archiving
|
|
59
|
-
postgresql_backup:
|
|
60
|
-
full_backup:
|
|
61
|
-
frequency: daily
|
|
62
|
-
retention: 30 days
|
|
63
|
-
command: |
|
|
64
|
-
pg_basebackup -h localhost -D /backup/$(date +%Y%m%d) -Ft -z -Xs
|
|
65
|
-
|
|
66
|
-
wal_archiving:
|
|
67
|
-
enabled: true
|
|
68
|
-
archive_command: |
|
|
69
|
-
aws s3 cp %p s3://backups/wal/%f
|
|
70
|
-
restore_command: |
|
|
71
|
-
aws s3 cp s3://backups/wal/%f %p
|
|
72
|
-
|
|
73
|
-
point_in_time_recovery:
|
|
74
|
-
enabled: true
|
|
75
|
-
max_recovery_window: 7 days
|
|
76
|
-
```
|
|
77
|
-
|
|
78
|
-
**Automated Backups (Velero for Kubernetes):**
|
|
79
|
-
```yaml
|
|
80
|
-
apiVersion: velero.io/v1
|
|
81
|
-
kind: Schedule
|
|
82
|
-
metadata:
|
|
83
|
-
name: daily-backup
|
|
84
|
-
spec:
|
|
85
|
-
schedule: "0 1 * * *" # 1 AM daily
|
|
86
|
-
template:
|
|
87
|
-
includedNamespaces:
|
|
88
|
-
- production
|
|
89
|
-
includedResources:
|
|
90
|
-
- "*"
|
|
91
|
-
snapshotVolumes: true
|
|
92
|
-
ttl: 720h # 30 days
|
|
93
|
-
```
|
|
94
|
-
|
|
95
|
-
## Failover Procedures
|
|
96
|
-
|
|
97
|
-
**Database Failover:**
|
|
98
|
-
```yaml
|
|
99
|
-
# Automated failover with Patroni
|
|
100
|
-
patroni:
|
|
101
|
-
name: postgres01
|
|
102
|
-
scope: postgres-cluster
|
|
103
|
-
|
|
104
|
-
bootstrap:
|
|
105
|
-
dcs:
|
|
106
|
-
ttl: 30
|
|
107
|
-
loop_wait: 10
|
|
108
|
-
retry_timeout: 10
|
|
109
|
-
maximum_lag_on_failover: 1048576
|
|
110
|
-
|
|
111
|
-
postgresql:
|
|
112
|
-
parameters:
|
|
113
|
-
max_connections: 100
|
|
114
|
-
shared_buffers: 256MB
|
|
115
|
-
|
|
116
|
-
# Failover process:
|
|
117
|
-
# 1. Patroni detects primary failure
|
|
118
|
-
# 2. Initiates leader election
|
|
119
|
-
# 3. Promotes replica to primary
|
|
120
|
-
# 4. Updates DNS/load balancer
|
|
121
|
-
# 5. Old primary rejoins as replica
|
|
122
|
-
```
|
|
123
|
-
|
|
124
|
-
**Application Failover:**
|
|
125
|
-
```yaml
|
|
126
|
-
# Multi-region with Route53 health checks
|
|
127
|
-
route53_failover:
|
|
128
|
-
primary:
|
|
129
|
-
region: us-east-1
|
|
130
|
-
health_check:
|
|
131
|
-
protocol: HTTPS
|
|
132
|
-
path: /health
|
|
133
|
-
interval: 30
|
|
134
|
-
failure_threshold: 3
|
|
135
|
-
|
|
136
|
-
secondary:
|
|
137
|
-
region: us-west-2
|
|
138
|
-
failover_mode: automatic
|
|
139
|
-
activate_when: primary_unhealthy
|
|
140
|
-
```
|
|
141
|
-
|
|
142
|
-
## DR Testing
|
|
143
|
-
|
|
144
|
-
**DR Drill Schedule:**
|
|
145
|
-
```yaml
|
|
146
|
-
quarterly_dr_drill:
|
|
147
|
-
week_1:
|
|
148
|
-
- Review DR plan
|
|
149
|
-
- Update runbooks
|
|
150
|
-
- Verify backup integrity
|
|
151
|
-
|
|
152
|
-
week_2:
|
|
153
|
-
- Tabletop exercise
|
|
154
|
-
- Walk through procedures
|
|
155
|
-
- Identify gaps
|
|
156
|
-
|
|
157
|
-
week_3:
|
|
158
|
-
- Partial failover test
|
|
159
|
-
- Test database recovery
|
|
160
|
-
- Verify monitoring
|
|
161
|
-
|
|
162
|
-
week_4:
|
|
163
|
-
- Full DR drill
|
|
164
|
-
- Complete failover
|
|
165
|
-
- Document lessons learned
|
|
166
|
-
```
|
|
167
|
-
|
|
168
|
-
**DR Test Checklist:**
|
|
169
|
-
```markdown
|
|
170
|
-
- [ ] Backup restoration successful
|
|
171
|
-
- [ ] RTO met (< 4 hours)
|
|
172
|
-
- [ ] RPO met (< 1 hour data loss)
|
|
173
|
-
- [ ] All services operational
|
|
174
|
-
- [ ] Monitoring functional
|
|
175
|
-
- [ ] Logs accessible
|
|
176
|
-
- [ ] Team communication effective
|
|
177
|
-
- [ ] Runbooks accurate
|
|
178
|
-
- [ ] Action items documented
|
|
179
|
-
```
|
|
180
|
-
|
|
181
|
-
## Multi-Region Architecture
|
|
182
|
-
|
|
183
|
-
**Active-Passive:**
|
|
184
|
-
```
|
|
185
|
-
Primary Region (Active) ─────┐
|
|
186
|
-
- Handles all traffic │
|
|
187
|
-
- Database writes │ Replication
|
|
188
|
-
│
|
|
189
|
-
Secondary Region (Passive) ◄─┘
|
|
190
|
-
- Standby for failover
|
|
191
|
-
- Read replicas only
|
|
192
|
-
- Activated manually
|
|
193
|
-
```
|
|
194
|
-
|
|
195
|
-
**Active-Active:**
|
|
196
|
-
```
|
|
197
|
-
Region 1 (Active) ◄─────► Region 2 (Active)
|
|
198
|
-
- 50% traffic - 50% traffic
|
|
199
|
-
- Full read/write - Full read/write
|
|
200
|
-
- Bi-directional sync - Bi-directional sync
|
|
201
|
-
- Auto-failover - Auto-failover
|
|
202
|
-
```
|
|
203
|
-
|
|
204
|
-
**Implementation:**
|
|
205
|
-
```yaml
|
|
206
|
-
# Kubernetes multi-region with Cilium Cluster Mesh
|
|
207
|
-
clusters:
|
|
208
|
-
us-east-1:
|
|
209
|
-
role: primary
|
|
210
|
-
services:
|
|
211
|
-
- api-service
|
|
212
|
-
- database (primary)
|
|
213
|
-
- cache
|
|
214
|
-
|
|
215
|
-
us-west-2:
|
|
216
|
-
role: secondary
|
|
217
|
-
services:
|
|
218
|
-
- api-service (read-only)
|
|
219
|
-
- database (replica)
|
|
220
|
-
- cache
|
|
221
|
-
|
|
222
|
-
failover:
|
|
223
|
-
automatic: true
|
|
224
|
-
health_check_interval: 30s
|
|
225
|
-
failover_threshold: 3
|
|
226
|
-
```
|
|
227
|
-
|
|
228
|
-
---
|
|
229
|
-
|
|
230
|
-
**Related Resources:**
|
|
231
|
-
- [incident-management.md](incident-management.md)
|
|
232
|
-
- [reliability-patterns.md](reliability-patterns.md)
|
|
@@ -1,436 +0,0 @@
|
|
|
1
|
-
# Incident Management
|
|
2
|
-
|
|
3
|
-
Incident response procedures, severity levels, escalation paths, communication protocols, postmortems, and on-call processes.
|
|
4
|
-
|
|
5
|
-
## Table of Contents
|
|
6
|
-
|
|
7
|
-
- [Incident Lifecycle](#incident-lifecycle)
|
|
8
|
-
- [Severity Levels](#severity-levels)
|
|
9
|
-
- [Roles and Responsibilities](#roles-and-responsibilities)
|
|
10
|
-
- [Response Procedures](#response-procedures)
|
|
11
|
-
- [Communication](#communication)
|
|
12
|
-
- [Postmortems](#postmortems)
|
|
13
|
-
- [Best Practices](#best-practices)
|
|
14
|
-
|
|
15
|
-
## Incident Lifecycle
|
|
16
|
-
|
|
17
|
-
```
|
|
18
|
-
Detect → Respond → Mitigate → Resolve → Learn
|
|
19
|
-
↓ ↓ ↓ ↓ ↓
|
|
20
|
-
Alert Triage Fix/Workaround Root Postmortem
|
|
21
|
-
Cause
|
|
22
|
-
```
|
|
23
|
-
|
|
24
|
-
## Severity Levels
|
|
25
|
-
|
|
26
|
-
```yaml
|
|
27
|
-
SEV1 - Critical:
|
|
28
|
-
impact: Complete service outage or data loss
|
|
29
|
-
response_time: Immediate (< 15 minutes)
|
|
30
|
-
escalation: Page on-call + management
|
|
31
|
-
examples:
|
|
32
|
-
- Production database down
|
|
33
|
-
- Payment processing failed
|
|
34
|
-
- Security breach
|
|
35
|
-
- Data loss
|
|
36
|
-
|
|
37
|
-
SEV2 - High:
|
|
38
|
-
impact: Major functionality impaired
|
|
39
|
-
response_time: 30 minutes
|
|
40
|
-
escalation: Page on-call
|
|
41
|
-
examples:
|
|
42
|
-
- API latency severely degraded
|
|
43
|
-
- Single region outage
|
|
44
|
-
- Critical feature unavailable
|
|
45
|
-
|
|
46
|
-
SEV3 - Medium:
|
|
47
|
-
impact: Minor functionality impaired
|
|
48
|
-
response_time: 2 hours
|
|
49
|
-
escalation: Notify on-call via Slack
|
|
50
|
-
examples:
|
|
51
|
-
- Non-critical feature degraded
|
|
52
|
-
- Elevated error rates
|
|
53
|
-
- Performance slowdown
|
|
54
|
-
|
|
55
|
-
SEV4 - Low:
|
|
56
|
-
impact: Minimal user impact
|
|
57
|
-
response_time: Next business day
|
|
58
|
-
escalation: Create ticket
|
|
59
|
-
examples:
|
|
60
|
-
- UI cosmetic issues
|
|
61
|
-
- Internal tool problems
|
|
62
|
-
- Low-priority bugs
|
|
63
|
-
```
|
|
64
|
-
|
|
65
|
-
## Roles and Responsibilities
|
|
66
|
-
|
|
67
|
-
**Incident Commander:**
|
|
68
|
-
- Owns the incident response
|
|
69
|
-
- Makes decisions
|
|
70
|
-
- Coordinates team
|
|
71
|
-
- Manages communication
|
|
72
|
-
|
|
73
|
-
**Technical Lead:**
|
|
74
|
-
- Diagnoses root cause
|
|
75
|
-
- Implements fixes
|
|
76
|
-
- Validates resolution
|
|
77
|
-
|
|
78
|
-
**Communications Lead:**
|
|
79
|
-
- Updates status page
|
|
80
|
-
- Notifies stakeholders
|
|
81
|
-
- Manages customer communication
|
|
82
|
-
|
|
83
|
-
**Scribe:**
|
|
84
|
-
- Documents timeline
|
|
85
|
-
- Records decisions
|
|
86
|
-
- Tracks action items
|
|
87
|
-
|
|
88
|
-
## Response Procedures
|
|
89
|
-
|
|
90
|
-
### Detection and Triage
|
|
91
|
-
|
|
92
|
-
```yaml
|
|
93
|
-
# incident-response.yaml
|
|
94
|
-
1_detection:
|
|
95
|
-
- Alert triggers
|
|
96
|
-
- Customer report
|
|
97
|
-
- Monitoring system
|
|
98
|
-
- Team member notice
|
|
99
|
-
|
|
100
|
-
2_initial_triage:
|
|
101
|
-
- Assess severity
|
|
102
|
-
- Determine impact
|
|
103
|
-
- Page appropriate team
|
|
104
|
-
- Create incident channel
|
|
105
|
-
|
|
106
|
-
3_form_response_team:
|
|
107
|
-
- Incident Commander
|
|
108
|
-
- Technical Lead(s)
|
|
109
|
-
- Communications Lead
|
|
110
|
-
- Subject Matter Experts
|
|
111
|
-
```
|
|
112
|
-
|
|
113
|
-
### Incident Command Structure
|
|
114
|
-
|
|
115
|
-
```bash
|
|
116
|
-
# Create incident Slack channel
|
|
117
|
-
/incident create SEV1 "API Gateway Down"
|
|
118
|
-
|
|
119
|
-
# Auto-creates:
|
|
120
|
-
# - #incident-2024-001
|
|
121
|
-
# - Zoom bridge
|
|
122
|
-
# - Status page placeholder
|
|
123
|
-
# - Timeline doc
|
|
124
|
-
```
|
|
125
|
-
|
|
126
|
-
### Response Playbooks
|
|
127
|
-
|
|
128
|
-
**Database Outage:**
|
|
129
|
-
```yaml
|
|
130
|
-
playbook: database-outage
|
|
131
|
-
severity: SEV1
|
|
132
|
-
|
|
133
|
-
steps:
|
|
134
|
-
1_immediate:
|
|
135
|
-
- Check database health metrics
|
|
136
|
-
- Verify connectivity
|
|
137
|
-
- Check for locks/blocking queries
|
|
138
|
-
- Review recent changes
|
|
139
|
-
|
|
140
|
-
2_diagnosis:
|
|
141
|
-
- Check replication lag
|
|
142
|
-
- Review error logs
|
|
143
|
-
- Verify disk space
|
|
144
|
-
- Check connection pool
|
|
145
|
-
|
|
146
|
-
3_mitigation:
|
|
147
|
-
- Failover to replica
|
|
148
|
-
- Kill blocking queries
|
|
149
|
-
- Restart if necessary
|
|
150
|
-
- Scale resources
|
|
151
|
-
|
|
152
|
-
4_communication:
|
|
153
|
-
- Update status page
|
|
154
|
-
- Notify customers
|
|
155
|
-
- Inform stakeholders
|
|
156
|
-
```
|
|
157
|
-
|
|
158
|
-
**API Latency Degradation:**
|
|
159
|
-
```yaml
|
|
160
|
-
playbook: api-latency
|
|
161
|
-
severity: SEV2
|
|
162
|
-
|
|
163
|
-
steps:
|
|
164
|
-
1_gather_data:
|
|
165
|
-
- Check p95/p99 latency
|
|
166
|
-
- Review error rates
|
|
167
|
-
- Examine slow query logs
|
|
168
|
-
- Check downstream services
|
|
169
|
-
|
|
170
|
-
2_common_causes:
|
|
171
|
-
- Database slow queries
|
|
172
|
-
- Increased traffic
|
|
173
|
-
- Downstream dependency
|
|
174
|
-
- Resource exhaustion
|
|
175
|
-
- Code deployment
|
|
176
|
-
|
|
177
|
-
3_quick_fixes:
|
|
178
|
-
- Scale up instances
|
|
179
|
-
- Enable/adjust caching
|
|
180
|
-
- Rate limit traffic
|
|
181
|
-
- Rollback deployment
|
|
182
|
-
```
|
|
183
|
-
|
|
184
|
-
## Communication
|
|
185
|
-
|
|
186
|
-
### Status Page Updates
|
|
187
|
-
|
|
188
|
-
```yaml
|
|
189
|
-
# Incident timeline
|
|
190
|
-
14:05: Investigating - We're investigating reports of API errors
|
|
191
|
-
14:15: Identified - Database connection issues identified
|
|
192
|
-
14:30: Monitoring - Failover completed, monitoring recovery
|
|
193
|
-
15:00: Resolved - All services restored, investigating root cause
|
|
194
|
-
```
|
|
195
|
-
|
|
196
|
-
### Customer Communication Template
|
|
197
|
-
|
|
198
|
-
```markdown
|
|
199
|
-
Subject: [RESOLVED] API Service Disruption - Jan 15, 2024
|
|
200
|
-
|
|
201
|
-
Dear Customers,
|
|
202
|
-
|
|
203
|
-
SUMMARY:
|
|
204
|
-
Between 14:00-15:00 UTC today, our API service experienced elevated
|
|
205
|
-
error rates affecting approximately 10% of requests.
|
|
206
|
-
|
|
207
|
-
IMPACT:
|
|
208
|
-
- API errors for 10% of requests
|
|
209
|
-
- Average latency increased from 200ms to 2s
|
|
210
|
-
- No data loss occurred
|
|
211
|
-
|
|
212
|
-
ROOT CAUSE:
|
|
213
|
-
Database connection pool exhaustion due to traffic spike
|
|
214
|
-
|
|
215
|
-
RESOLUTION:
|
|
216
|
-
- Scaled database connection pools
|
|
217
|
-
- Implemented better connection management
|
|
218
|
-
- Added auto-scaling triggers
|
|
219
|
-
|
|
220
|
-
PREVENTION:
|
|
221
|
-
- Enhanced monitoring and alerting
|
|
222
|
-
- Implemented circuit breakers
|
|
223
|
-
- Scheduled capacity review
|
|
224
|
-
|
|
225
|
-
We apologize for the disruption. Please contact support@example.com
|
|
226
|
-
with any questions.
|
|
227
|
-
|
|
228
|
-
Status page: https://status.example.com/incidents/2024-001
|
|
229
|
-
```
|
|
230
|
-
|
|
231
|
-
### Internal Communication
|
|
232
|
-
|
|
233
|
-
```markdown
|
|
234
|
-
# Incident Slack Update Template
|
|
235
|
-
:rotating_light: **SEV1 INCIDENT** :rotating_light:
|
|
236
|
-
|
|
237
|
-
**Status:** Investigating
|
|
238
|
-
**Impact:** API returning 500 errors
|
|
239
|
-
**Started:** 14:05 UTC
|
|
240
|
-
**Incident Commander:** @alice
|
|
241
|
-
**Bridge:** https://zoom.us/j/123456789
|
|
242
|
-
**Channel:** #incident-2024-001
|
|
243
|
-
|
|
244
|
-
**Timeline:**
|
|
245
|
-
14:05 - Alert triggered for high error rate
|
|
246
|
-
14:07 - Incident declared SEV1
|
|
247
|
-
14:10 - Team assembled
|
|
248
|
-
14:15 - Root cause identified
|
|
249
|
-
|
|
250
|
-
**Next Update:** 14:30 UTC or sooner if status changes
|
|
251
|
-
```
|
|
252
|
-
|
|
253
|
-
## Postmortems
|
|
254
|
-
|
|
255
|
-
### Blameless Postmortem Template
|
|
256
|
-
|
|
257
|
-
```markdown
|
|
258
|
-
# Postmortem: API Outage - January 15, 2024
|
|
259
|
-
|
|
260
|
-
## Incident Summary
|
|
261
|
-
**Date:** 2024-01-15
|
|
262
|
-
**Duration:** 55 minutes (14:05 - 15:00 UTC)
|
|
263
|
-
**Severity:** SEV1
|
|
264
|
-
**Impact:** 10% error rate, 500k affected requests
|
|
265
|
-
**Root Cause:** Database connection pool exhaustion
|
|
266
|
-
|
|
267
|
-
## Timeline (UTC)
|
|
268
|
-
- 14:00: Traffic begins increasing (2x normal)
|
|
269
|
-
- 14:05: Alert: High API error rate
|
|
270
|
-
- 14:07: Incident declared SEV1
|
|
271
|
-
- 14:10: Incident team assembled
|
|
272
|
-
- 14:15: Root cause identified (connection pool exhausted)
|
|
273
|
-
- 14:20: Mitigation started (scale connection pool)
|
|
274
|
-
- 14:30: Error rates declining
|
|
275
|
-
- 14:45: Monitoring recovery
|
|
276
|
-
- 15:00: Incident resolved
|
|
277
|
-
|
|
278
|
-
## Root Cause Analysis
|
|
279
|
-
|
|
280
|
-
### What Happened
|
|
281
|
-
A marketing campaign drove 2x normal traffic. Our database connection
|
|
282
|
-
pool had a static size of 100 connections, which was exhausted. API
|
|
283
|
-
servers could not acquire database connections, resulting in errors.
|
|
284
|
-
|
|
285
|
-
### Why It Happened
|
|
286
|
-
1. No auto-scaling for database connection pools
|
|
287
|
-
2. Connection pool size not sized for peak traffic
|
|
288
|
-
3. No circuit breaker to fail fast
|
|
289
|
-
4. Insufficient load testing
|
|
290
|
-
|
|
291
|
-
### Contributing Factors
|
|
292
|
-
- Marketing campaign not coordinated with engineering
|
|
293
|
-
- Connection pool metrics not monitored
|
|
294
|
-
- No alerts on connection pool saturation
|
|
295
|
-
|
|
296
|
-
## Impact
|
|
297
|
-
- 500,000 failed API requests
|
|
298
|
-
- 10% error rate for 55 minutes
|
|
299
|
-
- Estimated revenue impact: $5,000
|
|
300
|
-
- Customer complaints: 50
|
|
301
|
-
|
|
302
|
-
## What Went Well
|
|
303
|
-
- Fast detection (< 5 minutes)
|
|
304
|
-
- Clear escalation path
|
|
305
|
-
- Good team communication
|
|
306
|
-
- Status page updated regularly
|
|
307
|
-
- Fix deployed quickly
|
|
308
|
-
|
|
309
|
-
## What Went Wrong
|
|
310
|
-
- No advance warning of traffic spike
|
|
311
|
-
- Connection pool not monitored
|
|
312
|
-
- Manual scaling required
|
|
313
|
-
- Customer notification delayed 10 minutes
|
|
314
|
-
|
|
315
|
-
## Action Items
|
|
316
|
-
|
|
317
|
-
### Immediate (This Week)
|
|
318
|
-
- [ ] Implement connection pool auto-scaling (@alice, 2024-01-17)
|
|
319
|
-
- [ ] Add connection pool metrics to dashboards (@bob, 2024-01-18)
|
|
320
|
-
- [ ] Create alerts for pool saturation (@charlie, 2024-01-19)
|
|
321
|
-
|
|
322
|
-
### Short-term (This Month)
|
|
323
|
-
- [ ] Implement circuit breakers (@alice, 2024-01-25)
|
|
324
|
-
- [ ] Load test at 3x normal traffic (@bob, 2024-01-30)
|
|
325
|
-
- [ ] Create runbook for connection issues (@charlie, 2024-01-30)
|
|
326
|
-
|
|
327
|
-
### Long-term (This Quarter)
|
|
328
|
-
- [ ] Improve engineering/marketing coordination (@dave, 2024-03-31)
|
|
329
|
-
- [ ] Implement capacity planning process (@eve, 2024-03-31)
|
|
330
|
-
- [ ] Auto-notification system for incidents (@frank, 2024-03-31)
|
|
331
|
-
|
|
332
|
-
## Lessons Learned
|
|
333
|
-
1. Static resource limits are a failure point
|
|
334
|
-
2. Cross-team coordination essential for major campaigns
|
|
335
|
-
3. Observability gaps can hide brewing problems
|
|
336
|
-
4. Circuit breakers prevent cascading failures
|
|
337
|
-
|
|
338
|
-
## Related Incidents
|
|
339
|
-
- INC-2023-089: Similar connection pool issue (resolved)
|
|
340
|
-
- INC-2023-112: Traffic spike from campaign (different cause)
|
|
341
|
-
|
|
342
|
-
## Appendix
|
|
343
|
-
- [Grafana Dashboard](https://grafana.example.com/d/incident-2024-001)
|
|
344
|
-
- [Logs](https://logs.example.com/incident-2024-001)
|
|
345
|
-
- [Slack Channel](https://slack.com/archives/incident-2024-001)
|
|
346
|
-
```
|
|
347
|
-
|
|
348
|
-
### Postmortem Review Process
|
|
349
|
-
|
|
350
|
-
```yaml
|
|
351
|
-
postmortem_process:
|
|
352
|
-
1_draft:
|
|
353
|
-
owner: Incident Commander
|
|
354
|
-
deadline: 2 business days
|
|
355
|
-
content:
|
|
356
|
-
- Timeline
|
|
357
|
-
- Root cause
|
|
358
|
-
- Impact
|
|
359
|
-
- Action items
|
|
360
|
-
|
|
361
|
-
2_review:
|
|
362
|
-
participants:
|
|
363
|
-
- Incident team
|
|
364
|
-
- Engineering leadership
|
|
365
|
-
- Related teams
|
|
366
|
-
format: Meeting (30-60 min)
|
|
367
|
-
goals:
|
|
368
|
-
- Validate accuracy
|
|
369
|
-
- Identify additional learnings
|
|
370
|
-
- Prioritize action items
|
|
371
|
-
|
|
372
|
-
3_publish:
|
|
373
|
-
distribution:
|
|
374
|
-
- All engineering
|
|
375
|
-
- Product team
|
|
376
|
-
- Customer support
|
|
377
|
-
- Public (if appropriate)
|
|
378
|
-
|
|
379
|
-
4_followup:
|
|
380
|
-
- Track action items in project management tool
|
|
381
|
-
- Review progress in weekly meetings
|
|
382
|
-
- Update on completion
|
|
383
|
-
```
|
|
384
|
-
|
|
385
|
-
## Best Practices
|
|
386
|
-
|
|
387
|
-
### 1. Blameless Culture
|
|
388
|
-
|
|
389
|
-
```
|
|
390
|
-
Focus on systems and processes, not individuals
|
|
391
|
-
Ask "how" and "why", not "who"
|
|
392
|
-
Encourage sharing mistakes openly
|
|
393
|
-
```
|
|
394
|
-
|
|
395
|
-
### 2. Clear Severity Definitions
|
|
396
|
-
|
|
397
|
-
```yaml
|
|
398
|
-
# Document and communicate
|
|
399
|
-
# Train team on criteria
|
|
400
|
-
# Review severity in retrospective
|
|
401
|
-
```
|
|
402
|
-
|
|
403
|
-
### 3. Designated Roles
|
|
404
|
-
|
|
405
|
-
```
|
|
406
|
-
Never have incident response without clear roles
|
|
407
|
-
Incident Commander makes all decisions
|
|
408
|
-
Scribe documents everything
|
|
409
|
-
```
|
|
410
|
-
|
|
411
|
-
### 4. Practice Incidents
|
|
412
|
-
|
|
413
|
-
```yaml
|
|
414
|
-
# Run incident simulations quarterly
|
|
415
|
-
chaos_engineering:
|
|
416
|
-
- Simulate database failure
|
|
417
|
-
- Test failover procedures
|
|
418
|
-
- Practice communication
|
|
419
|
-
- Time the response
|
|
420
|
-
```
|
|
421
|
-
|
|
422
|
-
### 5. Action Item Follow-Through
|
|
423
|
-
|
|
424
|
-
```
|
|
425
|
-
Assign owners and deadlines
|
|
426
|
-
Track in project management
|
|
427
|
-
Report progress weekly
|
|
428
|
-
Review in postmortem review
|
|
429
|
-
```
|
|
430
|
-
|
|
431
|
-
---
|
|
432
|
-
|
|
433
|
-
**Related Resources:**
|
|
434
|
-
- [on-call-runbooks.md](on-call-runbooks.md) - Diagnostic procedures
|
|
435
|
-
- [observability-stack.md](observability-stack.md) - Monitoring and detection
|
|
436
|
-
- [alerting-best-practices.md](alerting-best-practices.md) - Alert configuration
|