aigroup-workflow 2.2.1 → 2.2.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/commands/fix-build.md +10 -5
- package/.claude/commands/init-project.md +13 -8
- package/.claude/commands/plan.md +15 -8
- package/.claude/commands/review.md +12 -6
- package/.claude/commands/tdd.md +11 -5
- package/.claude/commands/workflow-start.md +20 -11
- package/.claude/settings.json +28 -0
- package/.codex/agents/architect.toml +207 -0
- package/.codex/agents/build-error-resolver.toml +110 -0
- package/.codex/agents/code-reviewer.toml +233 -0
- package/.codex/agents/doc-updater.toml +103 -0
- package/.codex/agents/e2e-runner.toml +103 -0
- package/.codex/agents/get-current-datetime.toml +23 -0
- package/.codex/agents/init-architect.toml +181 -0
- package/.codex/agents/planner.toml +208 -0
- package/.codex/agents/refactor-cleaner.toml +81 -0
- package/.codex/agents/rust-reviewer.toml +90 -0
- package/.codex/agents/security-reviewer.toml +104 -0
- package/.codex/agents/tdd-guide.toml +87 -0
- package/AGENTS.md +2 -2
- package/CLAUDE.md +23 -1
- package/LICENSE +20 -20
- package/README.md +333 -333
- package/agents/a11y-architect.md +141 -141
- package/agents/architect.md +211 -211
- package/agents/build-error-resolver.md +114 -114
- package/agents/chief-of-staff.md +151 -151
- package/agents/code-architect.md +71 -71
- package/agents/code-explorer.md +69 -69
- package/agents/code-reviewer.md +237 -237
- package/agents/code-simplifier.md +47 -47
- package/agents/comment-analyzer.md +45 -45
- package/agents/conversation-analyzer.md +52 -52
- package/agents/cpp-build-resolver.md +90 -90
- package/agents/cpp-reviewer.md +72 -72
- package/agents/csharp-reviewer.md +101 -101
- package/agents/dart-build-resolver.md +201 -201
- package/agents/database-reviewer.md +91 -91
- package/agents/doc-updater.md +107 -107
- package/agents/docs-lookup.md +68 -68
- package/agents/e2e-runner.md +107 -107
- package/agents/flutter-reviewer.md +243 -243
- package/agents/gan-evaluator.md +209 -209
- package/agents/gan-generator.md +131 -131
- package/agents/gan-planner.md +99 -99
- package/agents/get-current-datetime.md +26 -26
- package/agents/go-build-resolver.md +94 -94
- package/agents/go-reviewer.md +76 -76
- package/agents/harness-optimizer.md +35 -35
- package/agents/healthcare-reviewer.md +83 -83
- package/agents/java-build-resolver.md +153 -153
- package/agents/java-reviewer.md +92 -92
- package/agents/kotlin-build-resolver.md +118 -118
- package/agents/kotlin-reviewer.md +159 -159
- package/agents/loop-operator.md +36 -36
- package/agents/opensource-forker.md +198 -198
- package/agents/opensource-packager.md +249 -249
- package/agents/opensource-sanitizer.md +188 -188
- package/agents/performance-optimizer.md +446 -446
- package/agents/planner.md +212 -212
- package/agents/pr-test-analyzer.md +45 -45
- package/agents/python-reviewer.md +98 -98
- package/agents/pytorch-build-resolver.md +120 -120
- package/agents/refactor-cleaner.md +85 -85
- package/agents/rust-build-resolver.md +148 -148
- package/agents/rust-reviewer.md +94 -94
- package/agents/security-reviewer.md +108 -108
- package/agents/seo-specialist.md +59 -59
- package/agents/silent-failure-hunter.md +50 -50
- package/agents/tdd-guide.md +91 -91
- package/agents/type-design-analyzer.md +41 -41
- package/agents/typescript-reviewer.md +112 -112
- package/cli/commands/update.mjs +1 -1
- package/cli/utils/scaffold.mjs +53 -0
- package/docs/rules/agents.md +166 -50
- package/docs/rules/cpp/coding-style.md +44 -44
- package/docs/rules/cpp/hooks.md +39 -39
- package/docs/rules/cpp/patterns.md +51 -51
- package/docs/rules/cpp/security.md +51 -51
- package/docs/rules/cpp/testing.md +44 -44
- package/docs/rules/csharp/coding-style.md +72 -72
- package/docs/rules/csharp/hooks.md +25 -25
- package/docs/rules/csharp/patterns.md +50 -50
- package/docs/rules/csharp/security.md +58 -58
- package/docs/rules/csharp/testing.md +46 -46
- package/docs/rules/dart/coding-style.md +159 -159
- package/docs/rules/dart/hooks.md +66 -66
- package/docs/rules/dart/patterns.md +261 -261
- package/docs/rules/dart/security.md +135 -135
- package/docs/rules/dart/testing.md +215 -215
- package/docs/rules/golang/coding-style.md +32 -32
- package/docs/rules/golang/hooks.md +17 -17
- package/docs/rules/golang/patterns.md +45 -45
- package/docs/rules/golang/security.md +34 -34
- package/docs/rules/golang/testing.md +31 -31
- package/docs/rules/java/coding-style.md +114 -114
- package/docs/rules/java/hooks.md +18 -18
- package/docs/rules/java/patterns.md +146 -146
- package/docs/rules/java/security.md +100 -100
- package/docs/rules/java/testing.md +131 -131
- package/docs/rules/kotlin/coding-style.md +86 -86
- package/docs/rules/kotlin/hooks.md +17 -17
- package/docs/rules/kotlin/patterns.md +146 -146
- package/docs/rules/kotlin/security.md +82 -82
- package/docs/rules/kotlin/testing.md +128 -128
- package/docs/rules/perl/coding-style.md +46 -46
- package/docs/rules/perl/hooks.md +22 -22
- package/docs/rules/perl/patterns.md +76 -76
- package/docs/rules/perl/security.md +69 -69
- package/docs/rules/perl/testing.md +54 -54
- package/docs/rules/php/coding-style.md +40 -40
- package/docs/rules/php/hooks.md +24 -24
- package/docs/rules/php/patterns.md +33 -33
- package/docs/rules/php/security.md +37 -37
- package/docs/rules/php/testing.md +39 -39
- package/docs/rules/python/coding-style.md +42 -42
- package/docs/rules/python/hooks.md +19 -19
- package/docs/rules/python/patterns.md +39 -39
- package/docs/rules/python/security.md +30 -30
- package/docs/rules/python/testing.md +38 -38
- package/docs/rules/rust/coding-style.md +151 -151
- package/docs/rules/rust/hooks.md +16 -16
- package/docs/rules/rust/patterns.md +168 -168
- package/docs/rules/rust/security.md +141 -141
- package/docs/rules/rust/testing.md +154 -154
- package/docs/rules/swift/coding-style.md +47 -47
- package/docs/rules/swift/hooks.md +20 -20
- package/docs/rules/swift/patterns.md +66 -66
- package/docs/rules/swift/security.md +33 -33
- package/docs/rules/swift/testing.md +45 -45
- package/docs/rules/typescript/coding-style.md +199 -199
- package/docs/rules/typescript/hooks.md +22 -22
- package/docs/rules/typescript/patterns.md +52 -52
- package/docs/rules/typescript/security.md +28 -28
- package/docs/rules/typescript/testing.md +18 -18
- package/docs/rules/web/coding-style.md +96 -96
- package/docs/rules/web/design-quality.md +62 -62
- package/docs/rules/web/hooks.md +120 -120
- package/docs/rules/web/patterns.md +79 -79
- package/docs/rules/web/performance.md +64 -64
- package/docs/rules/web/security.md +57 -57
- package/docs/rules/web/testing.md +55 -55
- package/docs/templates/README.md +36 -36
- package/docs/templates/ai-project-final.md +124 -124
- package/docs/templates/ai-project.md +105 -105
- package/docs/templates/api.md +157 -157
- package/docs/templates/bug.md +62 -62
- package/docs/templates/code-review.md +87 -87
- package/docs/templates/generic.md +116 -116
- package/docs/templates/implementation-plan.md +1 -1
- package/docs/templates/meeting.md +68 -68
- package/docs/templates/prd.md +98 -98
- package/docs/templates/ui.md +134 -134
- package/docs/workflow-pipeline.md +5 -5
- package/package.json +40 -39
- package/skills/SUPERPOWERS-LICENSE +21 -21
- package/skills/ai-ml/fine-tuning-expert/SKILL.md +162 -162
- package/skills/ai-ml/fine-tuning-expert/references/dataset-preparation.md +540 -540
- package/skills/ai-ml/fine-tuning-expert/references/deployment-optimization.md +673 -673
- package/skills/ai-ml/fine-tuning-expert/references/evaluation-metrics.md +597 -597
- package/skills/ai-ml/fine-tuning-expert/references/hyperparameter-tuning.md +565 -565
- package/skills/ai-ml/fine-tuning-expert/references/lora-peft.md +347 -347
- package/skills/ai-ml/ml-pipeline/SKILL.md +159 -159
- package/skills/ai-ml/ml-pipeline/references/experiment-tracking.md +833 -833
- package/skills/ai-ml/ml-pipeline/references/feature-engineering.md +631 -631
- package/skills/ai-ml/ml-pipeline/references/model-validation.md +978 -978
- package/skills/ai-ml/ml-pipeline/references/pipeline-orchestration.md +907 -907
- package/skills/ai-ml/ml-pipeline/references/training-pipelines.md +782 -782
- package/skills/ai-ml/rag-architect/SKILL.md +194 -194
- package/skills/ai-ml/rag-architect/references/chunking-strategies.md +878 -878
- package/skills/ai-ml/rag-architect/references/embedding-models.md +561 -561
- package/skills/ai-ml/rag-architect/references/rag-evaluation.md +833 -833
- package/skills/ai-ml/rag-architect/references/retrieval-optimization.md +795 -795
- package/skills/ai-ml/rag-architect/references/vector-databases.md +589 -589
- package/skills/ai-ml/spark-engineer/SKILL.md +148 -148
- package/skills/ai-ml/spark-engineer/references/partitioning-caching.md +543 -543
- package/skills/ai-ml/spark-engineer/references/performance-tuning.md +544 -544
- package/skills/ai-ml/spark-engineer/references/rdd-operations.md +599 -599
- package/skills/ai-ml/spark-engineer/references/spark-sql-dataframes.md +474 -474
- package/skills/ai-ml/spark-engineer/references/streaming-patterns.md +786 -786
- package/skills/backend/api-designer/SKILL.md +217 -217
- package/skills/backend/api-designer/references/error-handling.md +541 -541
- package/skills/backend/api-designer/references/openapi.md +824 -824
- package/skills/backend/api-designer/references/pagination.md +494 -494
- package/skills/backend/api-designer/references/rest-patterns.md +335 -335
- package/skills/backend/api-designer/references/versioning.md +391 -391
- package/skills/backend/architecture-designer/SKILL.md +117 -117
- package/skills/backend/architecture-designer/references/adr-template.md +116 -116
- package/skills/backend/architecture-designer/references/architecture-patterns.md +111 -111
- package/skills/backend/architecture-designer/references/database-selection.md +102 -102
- package/skills/backend/architecture-designer/references/nfr-checklist.md +112 -112
- package/skills/backend/architecture-designer/references/system-design.md +100 -100
- package/skills/backend/code-documenter/SKILL.md +147 -147
- package/skills/backend/code-documenter/references/api-docs-fastapi-django.md +166 -166
- package/skills/backend/code-documenter/references/api-docs-nestjs-express.md +220 -220
- package/skills/backend/code-documenter/references/coverage-reports.md +125 -125
- package/skills/backend/code-documenter/references/documentation-systems.md +333 -333
- package/skills/backend/code-documenter/references/interactive-api-docs.md +531 -531
- package/skills/backend/code-documenter/references/python-docstrings.md +121 -121
- package/skills/backend/code-documenter/references/typescript-jsdoc.md +145 -145
- package/skills/backend/code-documenter/references/user-guides-tutorials.md +530 -530
- package/skills/backend/debugging-wizard/SKILL.md +105 -105
- package/skills/backend/debugging-wizard/references/common-patterns.md +132 -132
- package/skills/backend/debugging-wizard/references/debugging-tools.md +140 -140
- package/skills/backend/debugging-wizard/references/quick-fixes.md +177 -177
- package/skills/backend/debugging-wizard/references/strategies.md +142 -142
- package/skills/backend/debugging-wizard/references/systematic-debugging.md +367 -367
- package/skills/backend/feature-forge/SKILL.md +98 -98
- package/skills/backend/feature-forge/references/acceptance-criteria.md +104 -104
- package/skills/backend/feature-forge/references/ears-syntax.md +99 -99
- package/skills/backend/feature-forge/references/interview-questions.md +150 -150
- package/skills/backend/feature-forge/references/pre-discovery-subagents.md +54 -54
- package/skills/backend/feature-forge/references/specification-template.md +103 -103
- package/skills/backend/fullstack-guardian/SKILL.md +105 -105
- package/skills/backend/fullstack-guardian/references/api-design-standards.md +307 -307
- package/skills/backend/fullstack-guardian/references/architecture-decisions.md +350 -350
- package/skills/backend/fullstack-guardian/references/backend-patterns.md +237 -237
- package/skills/backend/fullstack-guardian/references/common-patterns.md +134 -134
- package/skills/backend/fullstack-guardian/references/deliverables-checklist.md +354 -354
- package/skills/backend/fullstack-guardian/references/design-template.md +91 -91
- package/skills/backend/fullstack-guardian/references/error-handling.md +135 -135
- package/skills/backend/fullstack-guardian/references/frontend-patterns.md +340 -340
- package/skills/backend/fullstack-guardian/references/integration-patterns.md +333 -333
- package/skills/backend/fullstack-guardian/references/security-checklist.md +106 -106
- package/skills/backend/graphql-architect/SKILL.md +146 -146
- package/skills/backend/graphql-architect/references/federation.md +418 -418
- package/skills/backend/graphql-architect/references/migration-from-rest.md +1141 -1141
- package/skills/backend/graphql-architect/references/resolvers.md +425 -425
- package/skills/backend/graphql-architect/references/schema-design.md +393 -393
- package/skills/backend/graphql-architect/references/security.md +569 -569
- package/skills/backend/graphql-architect/references/subscriptions.md +510 -510
- package/skills/backend/legacy-modernizer/SKILL.md +137 -137
- package/skills/backend/legacy-modernizer/references/legacy-testing.md +381 -381
- package/skills/backend/legacy-modernizer/references/migration-strategies.md +423 -423
- package/skills/backend/legacy-modernizer/references/refactoring-patterns.md +395 -395
- package/skills/backend/legacy-modernizer/references/strangler-fig-pattern.md +281 -281
- package/skills/backend/legacy-modernizer/references/system-assessment.md +487 -487
- package/skills/backend/microservices-architect/SKILL.md +164 -164
- package/skills/backend/microservices-architect/references/communication.md +499 -499
- package/skills/backend/microservices-architect/references/data.md +721 -721
- package/skills/backend/microservices-architect/references/decomposition.md +344 -344
- package/skills/backend/microservices-architect/references/observability.md +805 -805
- package/skills/backend/microservices-architect/references/patterns.md +603 -603
- package/skills/database/database-optimizer/SKILL.md +147 -147
- package/skills/database/database-optimizer/references/index-strategies.md +331 -331
- package/skills/database/database-optimizer/references/monitoring-analysis.md +501 -501
- package/skills/database/database-optimizer/references/mysql-tuning.md +452 -452
- package/skills/database/database-optimizer/references/postgresql-tuning.md +413 -413
- package/skills/database/database-optimizer/references/query-optimization.md +251 -251
- package/skills/database/postgres-pro/SKILL.md +152 -152
- package/skills/database/postgres-pro/references/extensions.md +404 -404
- package/skills/database/postgres-pro/references/jsonb.md +321 -321
- package/skills/database/postgres-pro/references/maintenance.md +481 -481
- package/skills/database/postgres-pro/references/performance.md +265 -265
- package/skills/database/postgres-pro/references/replication.md +446 -446
- package/skills/database/sql-pro/SKILL.md +129 -129
- package/skills/database/sql-pro/references/database-design.md +402 -402
- package/skills/database/sql-pro/references/dialect-differences.md +419 -419
- package/skills/database/sql-pro/references/optimization.md +384 -384
- package/skills/database/sql-pro/references/query-patterns.md +285 -285
- package/skills/database/sql-pro/references/window-functions.md +328 -328
- package/skills/dotnet/csharp-developer/SKILL.md +125 -125
- package/skills/dotnet/csharp-developer/references/aspnet-core.md +394 -394
- package/skills/dotnet/csharp-developer/references/blazor.md +553 -553
- package/skills/dotnet/csharp-developer/references/entity-framework.md +409 -409
- package/skills/dotnet/csharp-developer/references/modern-csharp.md +248 -248
- package/skills/dotnet/csharp-developer/references/performance.md +498 -498
- package/skills/dotnet/dotnet-core-expert/SKILL.md +138 -138
- package/skills/dotnet/dotnet-core-expert/references/authentication.md +546 -546
- package/skills/dotnet/dotnet-core-expert/references/clean-architecture.md +455 -455
- package/skills/dotnet/dotnet-core-expert/references/cloud-native.md +548 -548
- package/skills/dotnet/dotnet-core-expert/references/entity-framework.md +440 -440
- package/skills/dotnet/dotnet-core-expert/references/minimal-apis.md +319 -319
- package/skills/frontend/angular-architect/SKILL.md +152 -152
- package/skills/frontend/angular-architect/references/components.md +297 -297
- package/skills/frontend/angular-architect/references/ngrx.md +401 -401
- package/skills/frontend/angular-architect/references/routing.md +361 -361
- package/skills/frontend/angular-architect/references/rxjs.md +319 -319
- package/skills/frontend/angular-architect/references/testing.md +405 -405
- package/skills/frontend/design-commands/design.md +91 -91
- package/skills/frontend/design-commands/handoff.md +97 -97
- package/skills/frontend/design-commands/prototype.md +120 -120
- package/skills/frontend/design-commands/spec.md +160 -160
- package/skills/frontend/design-commands/style.md +78 -78
- package/skills/frontend/flutter-expert/SKILL.md +138 -138
- package/skills/frontend/flutter-expert/references/bloc-state.md +259 -259
- package/skills/frontend/flutter-expert/references/gorouter-navigation.md +119 -119
- package/skills/frontend/flutter-expert/references/performance.md +99 -99
- package/skills/frontend/flutter-expert/references/project-structure.md +118 -118
- package/skills/frontend/flutter-expert/references/riverpod-state.md +130 -130
- package/skills/frontend/flutter-expert/references/widget-patterns.md +123 -123
- package/skills/frontend/nextjs-developer/SKILL.md +143 -143
- package/skills/frontend/nextjs-developer/references/app-router.md +311 -311
- package/skills/frontend/nextjs-developer/references/data-fetching.md +482 -482
- package/skills/frontend/nextjs-developer/references/deployment.md +545 -545
- package/skills/frontend/nextjs-developer/references/server-actions.md +462 -462
- package/skills/frontend/nextjs-developer/references/server-components.md +384 -384
- package/skills/frontend/react-expert/SKILL.md +149 -149
- package/skills/frontend/react-expert/references/hooks-patterns.md +162 -162
- package/skills/frontend/react-expert/references/migration-class-to-modern.md +1119 -1119
- package/skills/frontend/react-expert/references/performance.md +168 -168
- package/skills/frontend/react-expert/references/react-19-features.md +174 -174
- package/skills/frontend/react-expert/references/server-components.md +143 -143
- package/skills/frontend/react-expert/references/state-management.md +171 -171
- package/skills/frontend/react-expert/references/testing-react.md +174 -174
- package/skills/frontend/react-native-expert/SKILL.md +185 -185
- package/skills/frontend/react-native-expert/references/expo-router.md +187 -187
- package/skills/frontend/react-native-expert/references/list-optimization.md +204 -204
- package/skills/frontend/react-native-expert/references/platform-handling.md +188 -188
- package/skills/frontend/react-native-expert/references/project-structure.md +171 -171
- package/skills/frontend/react-native-expert/references/storage-hooks.md +173 -173
- package/skills/frontend/senior-frontend/SKILL.md +477 -477
- package/skills/frontend/senior-frontend/references/frontend_best_practices.md +806 -806
- package/skills/frontend/senior-frontend/references/nextjs_optimization_guide.md +724 -724
- package/skills/frontend/senior-frontend/references/react_patterns.md +746 -746
- package/skills/frontend/senior-frontend/scripts/bundle_analyzer.py +407 -407
- package/skills/frontend/senior-frontend/scripts/component_generator.py +329 -329
- package/skills/frontend/senior-frontend/scripts/frontend_scaffolder.py +1005 -1005
- package/skills/frontend/ui-ux-pro-max/SKILL.md +386 -386
- package/skills/frontend/ui-ux-pro-max/data/charts.csv +26 -26
- package/skills/frontend/ui-ux-pro-max/data/colors.csv +97 -97
- package/skills/frontend/ui-ux-pro-max/data/icons.csv +101 -101
- package/skills/frontend/ui-ux-pro-max/data/landing.csv +31 -31
- package/skills/frontend/ui-ux-pro-max/data/products.csv +96 -96
- package/skills/frontend/ui-ux-pro-max/data/react-performance.csv +45 -45
- package/skills/frontend/ui-ux-pro-max/data/stacks/astro.csv +54 -54
- package/skills/frontend/ui-ux-pro-max/data/stacks/flutter.csv +53 -53
- package/skills/frontend/ui-ux-pro-max/data/stacks/html-tailwind.csv +56 -56
- package/skills/frontend/ui-ux-pro-max/data/stacks/jetpack-compose.csv +53 -53
- package/skills/frontend/ui-ux-pro-max/data/stacks/nextjs.csv +53 -53
- package/skills/frontend/ui-ux-pro-max/data/stacks/nuxt-ui.csv +51 -51
- package/skills/frontend/ui-ux-pro-max/data/stacks/nuxtjs.csv +59 -59
- package/skills/frontend/ui-ux-pro-max/data/stacks/react-native.csv +52 -52
- package/skills/frontend/ui-ux-pro-max/data/stacks/react.csv +54 -54
- package/skills/frontend/ui-ux-pro-max/data/stacks/shadcn.csv +61 -61
- package/skills/frontend/ui-ux-pro-max/data/stacks/svelte.csv +54 -54
- package/skills/frontend/ui-ux-pro-max/data/stacks/swiftui.csv +51 -51
- package/skills/frontend/ui-ux-pro-max/data/stacks/vue.csv +50 -50
- package/skills/frontend/ui-ux-pro-max/data/styles.csv +68 -68
- package/skills/frontend/ui-ux-pro-max/data/typography.csv +57 -57
- package/skills/frontend/ui-ux-pro-max/data/ui-reasoning.csv +101 -101
- package/skills/frontend/ui-ux-pro-max/data/ux-guidelines.csv +99 -99
- package/skills/frontend/ui-ux-pro-max/data/web-interface.csv +31 -31
- package/skills/frontend/ui-ux-pro-max/scripts/core.py +253 -253
- package/skills/frontend/ui-ux-pro-max/scripts/design_system.py +1067 -1067
- package/skills/frontend/ui-ux-pro-max/scripts/search.py +114 -114
- package/skills/frontend/vue-expert/SKILL.md +98 -98
- package/skills/frontend/vue-expert/references/build-tooling.md +480 -480
- package/skills/frontend/vue-expert/references/components.md +448 -448
- package/skills/frontend/vue-expert/references/composition-api.md +299 -299
- package/skills/frontend/vue-expert/references/mobile-hybrid.md +636 -636
- package/skills/frontend/vue-expert/references/nuxt.md +669 -669
- package/skills/frontend/vue-expert/references/state-management.md +449 -449
- package/skills/frontend/vue-expert/references/typescript.md +584 -584
- package/skills/frontend/vue-expert-js/SKILL.md +167 -167
- package/skills/frontend/vue-expert-js/references/component-architecture.md +219 -219
- package/skills/frontend/vue-expert-js/references/composables-patterns.md +183 -183
- package/skills/frontend/vue-expert-js/references/jsdoc-typing.md +535 -535
- package/skills/frontend/vue-expert-js/references/state-management.md +249 -249
- package/skills/frontend/vue-expert-js/references/testing-patterns.md +237 -237
- package/skills/go-rust-cpp/cpp-pro/SKILL.md +115 -115
- package/skills/go-rust-cpp/cpp-pro/references/build-tooling.md +440 -440
- package/skills/go-rust-cpp/cpp-pro/references/concurrency.md +437 -437
- package/skills/go-rust-cpp/cpp-pro/references/memory-performance.md +397 -397
- package/skills/go-rust-cpp/cpp-pro/references/modern-cpp.md +304 -304
- package/skills/go-rust-cpp/cpp-pro/references/templates.md +357 -357
- package/skills/go-rust-cpp/golang-pro/SKILL.md +122 -122
- package/skills/go-rust-cpp/golang-pro/references/concurrency.md +329 -329
- package/skills/go-rust-cpp/golang-pro/references/generics.md +442 -442
- package/skills/go-rust-cpp/golang-pro/references/interfaces.md +432 -432
- package/skills/go-rust-cpp/golang-pro/references/project-structure.md +477 -477
- package/skills/go-rust-cpp/golang-pro/references/testing.md +451 -451
- package/skills/go-rust-cpp/rust-engineer/SKILL.md +167 -167
- package/skills/go-rust-cpp/rust-engineer/references/async.md +458 -458
- package/skills/go-rust-cpp/rust-engineer/references/error-handling.md +334 -334
- package/skills/go-rust-cpp/rust-engineer/references/ownership.md +278 -278
- package/skills/go-rust-cpp/rust-engineer/references/testing.md +470 -470
- package/skills/go-rust-cpp/rust-engineer/references/traits.md +413 -413
- package/skills/infra/cli-developer/SKILL.md +113 -113
- package/skills/infra/cli-developer/references/design-patterns.md +221 -221
- package/skills/infra/cli-developer/references/go-cli.md +540 -540
- package/skills/infra/cli-developer/references/node-cli.md +383 -383
- package/skills/infra/cli-developer/references/python-cli.md +422 -422
- package/skills/infra/cli-developer/references/ux-patterns.md +448 -448
- package/skills/infra/cloud-architect/SKILL.md +216 -216
- package/skills/infra/cloud-architect/references/aws.md +394 -394
- package/skills/infra/cloud-architect/references/azure.md +562 -562
- package/skills/infra/cloud-architect/references/cost.md +582 -582
- package/skills/infra/cloud-architect/references/gcp.md +633 -633
- package/skills/infra/cloud-architect/references/multi-cloud.md +483 -483
- package/skills/infra/devops-engineer/SKILL.md +144 -144
- package/skills/infra/devops-engineer/references/deployment-strategies.md +241 -241
- package/skills/infra/devops-engineer/references/docker-patterns.md +113 -113
- package/skills/infra/devops-engineer/references/github-actions.md +139 -139
- package/skills/infra/devops-engineer/references/incident-response.md +331 -331
- package/skills/infra/devops-engineer/references/kubernetes.md +154 -154
- package/skills/infra/devops-engineer/references/platform-engineering.md +417 -417
- package/skills/infra/devops-engineer/references/release-automation.md +527 -527
- package/skills/infra/devops-engineer/references/terraform-iac.md +141 -141
- package/skills/infra/kubernetes-specialist/SKILL.md +241 -241
- package/skills/infra/kubernetes-specialist/references/configuration.md +452 -452
- package/skills/infra/kubernetes-specialist/references/cost-optimization.md +458 -458
- package/skills/infra/kubernetes-specialist/references/custom-operators.md +563 -563
- package/skills/infra/kubernetes-specialist/references/gitops.md +530 -530
- package/skills/infra/kubernetes-specialist/references/helm-charts.md +912 -912
- package/skills/infra/kubernetes-specialist/references/multi-cluster.md +507 -507
- package/skills/infra/kubernetes-specialist/references/networking.md +447 -447
- package/skills/infra/kubernetes-specialist/references/service-mesh.md +459 -459
- package/skills/infra/kubernetes-specialist/references/storage.md +535 -535
- package/skills/infra/kubernetes-specialist/references/troubleshooting.md +414 -414
- package/skills/infra/kubernetes-specialist/references/workloads.md +377 -377
- package/skills/infra/mcp-developer/SKILL.md +143 -143
- package/skills/infra/mcp-developer/references/protocol.md +244 -244
- package/skills/infra/mcp-developer/references/python-sdk.md +367 -367
- package/skills/infra/mcp-developer/references/resources.md +554 -554
- package/skills/infra/mcp-developer/references/tools.md +480 -480
- package/skills/infra/mcp-developer/references/typescript-sdk.md +350 -350
- package/skills/infra/monitoring-expert/SKILL.md +176 -176
- package/skills/infra/monitoring-expert/references/alerting-rules.md +141 -141
- package/skills/infra/monitoring-expert/references/application-profiling.md +331 -331
- package/skills/infra/monitoring-expert/references/capacity-planning.md +344 -344
- package/skills/infra/monitoring-expert/references/dashboards.md +126 -126
- package/skills/infra/monitoring-expert/references/opentelemetry.md +123 -123
- package/skills/infra/monitoring-expert/references/performance-testing.md +269 -269
- package/skills/infra/monitoring-expert/references/prometheus-metrics.md +136 -136
- package/skills/infra/monitoring-expert/references/structured-logging.md +142 -142
- package/skills/infra/sre-engineer/SKILL.md +181 -181
- package/skills/infra/sre-engineer/references/automation-toil.md +492 -492
- package/skills/infra/sre-engineer/references/error-budget-policy.md +334 -334
- package/skills/infra/sre-engineer/references/incident-chaos.md +576 -576
- package/skills/infra/sre-engineer/references/monitoring-alerting.md +424 -424
- package/skills/infra/sre-engineer/references/slo-sli-management.md +238 -238
- package/skills/infra/terraform-engineer/SKILL.md +143 -143
- package/skills/infra/terraform-engineer/references/best-practices.md +583 -583
- package/skills/infra/terraform-engineer/references/module-patterns.md +297 -297
- package/skills/infra/terraform-engineer/references/providers.md +452 -452
- package/skills/infra/terraform-engineer/references/state-management.md +371 -371
- package/skills/infra/terraform-engineer/references/testing.md +486 -486
- package/skills/infra/websocket-engineer/SKILL.md +168 -168
- package/skills/infra/websocket-engineer/references/alternatives.md +391 -391
- package/skills/infra/websocket-engineer/references/patterns.md +400 -400
- package/skills/infra/websocket-engineer/references/protocol.md +195 -195
- package/skills/infra/websocket-engineer/references/scaling.md +333 -333
- package/skills/infra/websocket-engineer/references/security.md +474 -474
- package/skills/java/java-architect/SKILL.md +132 -132
- package/skills/java/java-architect/references/jpa-optimization.md +393 -393
- package/skills/java/java-architect/references/reactive-webflux.md +356 -356
- package/skills/java/java-architect/references/spring-boot-setup.md +269 -269
- package/skills/java/java-architect/references/spring-security.md +445 -445
- package/skills/java/java-architect/references/testing-patterns.md +500 -500
- package/skills/java/kotlin-specialist/SKILL.md +147 -147
- package/skills/java/kotlin-specialist/references/android-compose.md +419 -419
- package/skills/java/kotlin-specialist/references/coroutines-flow.md +276 -276
- package/skills/java/kotlin-specialist/references/dsl-idioms.md +421 -421
- package/skills/java/kotlin-specialist/references/ktor-server.md +426 -426
- package/skills/java/kotlin-specialist/references/multiplatform-kmp.md +380 -380
- package/skills/java/spring-boot-engineer/SKILL.md +195 -195
- package/skills/java/spring-boot-engineer/references/cloud.md +498 -498
- package/skills/java/spring-boot-engineer/references/data.md +381 -381
- package/skills/java/spring-boot-engineer/references/security.md +459 -459
- package/skills/java/spring-boot-engineer/references/testing.md +545 -545
- package/skills/java/spring-boot-engineer/references/web.md +295 -295
- package/skills/javascript/javascript-pro/SKILL.md +132 -132
- package/skills/javascript/javascript-pro/references/async-patterns.md +334 -334
- package/skills/javascript/javascript-pro/references/browser-apis.md +398 -398
- package/skills/javascript/javascript-pro/references/modern-syntax.md +272 -272
- package/skills/javascript/javascript-pro/references/modules.md +357 -357
- package/skills/javascript/javascript-pro/references/node-essentials.md +471 -471
- package/skills/javascript/nestjs-expert/SKILL.md +206 -206
- package/skills/javascript/nestjs-expert/references/authentication.md +166 -166
- package/skills/javascript/nestjs-expert/references/controllers-routing.md +111 -111
- package/skills/javascript/nestjs-expert/references/dtos-validation.md +153 -153
- package/skills/javascript/nestjs-expert/references/migration-from-express.md +1237 -1237
- package/skills/javascript/nestjs-expert/references/services-di.md +140 -140
- package/skills/javascript/nestjs-expert/references/testing-patterns.md +186 -186
- package/skills/javascript/typescript-pro/SKILL.md +145 -145
- package/skills/javascript/typescript-pro/references/advanced-types.md +259 -259
- package/skills/javascript/typescript-pro/references/configuration.md +445 -445
- package/skills/javascript/typescript-pro/references/patterns.md +484 -484
- package/skills/javascript/typescript-pro/references/type-guards.md +352 -352
- package/skills/javascript/typescript-pro/references/utility-types.md +329 -329
- package/skills/php/laravel-specialist/SKILL.md +262 -262
- package/skills/php/laravel-specialist/references/eloquent.md +351 -351
- package/skills/php/laravel-specialist/references/livewire.md +512 -512
- package/skills/php/laravel-specialist/references/queues.md +423 -423
- package/skills/php/laravel-specialist/references/routing.md +362 -362
- package/skills/php/laravel-specialist/references/testing.md +522 -522
- package/skills/php/php-pro/SKILL.md +206 -206
- package/skills/php/php-pro/references/async-patterns.md +412 -412
- package/skills/php/php-pro/references/laravel-patterns.md +377 -377
- package/skills/php/php-pro/references/modern-php-features.md +323 -323
- package/skills/php/php-pro/references/symfony-patterns.md +466 -466
- package/skills/php/php-pro/references/testing-quality.md +466 -466
- package/skills/product/competitive-analysis/SKILL.md +257 -257
- package/skills/product/meeting-notes/SKILL.md +266 -266
- package/skills/product/prd-template/SKILL.md +150 -150
- package/skills/product/stakeholder-update/SKILL.md +225 -225
- package/skills/product/user-research-synthesis/SKILL.md +235 -235
- package/skills/python/django-expert/SKILL.md +162 -162
- package/skills/python/django-expert/references/authentication.md +145 -145
- package/skills/python/django-expert/references/drf-serializers.md +148 -148
- package/skills/python/django-expert/references/models-orm.md +151 -151
- package/skills/python/django-expert/references/testing-django.md +204 -204
- package/skills/python/django-expert/references/viewsets-views.md +153 -153
- package/skills/python/fastapi-expert/SKILL.md +185 -185
- package/skills/python/fastapi-expert/references/async-sqlalchemy.md +146 -146
- package/skills/python/fastapi-expert/references/authentication.md +159 -159
- package/skills/python/fastapi-expert/references/endpoints-routing.md +142 -142
- package/skills/python/fastapi-expert/references/migration-from-django.md +996 -996
- package/skills/python/fastapi-expert/references/pydantic-v2.md +135 -135
- package/skills/python/fastapi-expert/references/testing-async.md +159 -159
- package/skills/python/pandas-pro/SKILL.md +178 -178
- package/skills/python/pandas-pro/references/aggregation-groupby.md +545 -545
- package/skills/python/pandas-pro/references/data-cleaning.md +500 -500
- package/skills/python/pandas-pro/references/dataframe-operations.md +420 -420
- package/skills/python/pandas-pro/references/merging-joining.md +596 -596
- package/skills/python/pandas-pro/references/performance-optimization.md +597 -597
- package/skills/python/python-pro/SKILL.md +177 -177
- package/skills/python/python-pro/references/async-patterns.md +356 -356
- package/skills/python/python-pro/references/packaging.md +460 -460
- package/skills/python/python-pro/references/standard-library.md +378 -378
- package/skills/python/python-pro/references/testing.md +404 -404
- package/skills/python/python-pro/references/type-system.md +290 -290
- package/skills/quality/chaos-engineer/SKILL.md +182 -182
- package/skills/quality/chaos-engineer/references/chaos-tools.md +511 -511
- package/skills/quality/chaos-engineer/references/experiment-design.md +229 -229
- package/skills/quality/chaos-engineer/references/game-days.md +434 -434
- package/skills/quality/chaos-engineer/references/infrastructure-chaos.md +348 -348
- package/skills/quality/chaos-engineer/references/kubernetes-chaos.md +432 -432
- package/skills/quality/code-reviewer/SKILL.md +119 -119
- package/skills/quality/code-reviewer/references/common-issues.md +142 -142
- package/skills/quality/code-reviewer/references/feedback-examples.md +144 -144
- package/skills/quality/code-reviewer/references/receiving-feedback.md +238 -238
- package/skills/quality/code-reviewer/references/report-template.md +109 -109
- package/skills/quality/code-reviewer/references/review-checklist.md +88 -88
- package/skills/quality/code-reviewer/references/spec-compliance-review.md +258 -258
- package/skills/quality/playwright-expert/SKILL.md +169 -169
- package/skills/quality/playwright-expert/references/api-mocking.md +140 -140
- package/skills/quality/playwright-expert/references/configuration.md +155 -155
- package/skills/quality/playwright-expert/references/debugging-flaky.md +150 -150
- package/skills/quality/playwright-expert/references/page-object-model.md +152 -152
- package/skills/quality/playwright-expert/references/selectors-locators.md +119 -119
- package/skills/quality/secure-code-guardian/SKILL.md +191 -191
- package/skills/quality/secure-code-guardian/references/authentication.md +136 -136
- package/skills/quality/secure-code-guardian/references/input-validation.md +146 -146
- package/skills/quality/secure-code-guardian/references/owasp-prevention.md +135 -135
- package/skills/quality/secure-code-guardian/references/security-headers.md +133 -133
- package/skills/quality/secure-code-guardian/references/xss-csrf.md +157 -157
- package/skills/quality/security-reviewer/SKILL.md +103 -103
- package/skills/quality/security-reviewer/references/infrastructure-security.md +268 -268
- package/skills/quality/security-reviewer/references/penetration-testing.md +268 -268
- package/skills/quality/security-reviewer/references/report-template.md +170 -170
- package/skills/quality/security-reviewer/references/sast-tools.md +117 -117
- package/skills/quality/security-reviewer/references/secret-scanning.md +125 -125
- package/skills/quality/security-reviewer/references/vulnerability-patterns.md +152 -152
- package/skills/quality/senior-qa/README.md +196 -196
- package/skills/quality/senior-qa/SKILL.md +399 -399
- package/skills/quality/senior-qa/references/qa_best_practices.md +964 -964
- package/skills/quality/senior-qa/references/test_automation_patterns.md +1009 -1009
- package/skills/quality/senior-qa/references/testing_strategies.md +649 -649
- package/skills/quality/senior-qa/scripts/coverage_analyzer.py +836 -836
- package/skills/quality/senior-qa/scripts/e2e_test_scaffolder.py +820 -820
- package/skills/quality/senior-qa/scripts/test_suite_generator.py +605 -605
- package/skills/quality/tdd-guide/HOW_TO_USE.md +313 -313
- package/skills/quality/tdd-guide/README.md +680 -680
- package/skills/quality/tdd-guide/SKILL.md +122 -122
- package/skills/quality/tdd-guide/assets/expected_output.json +77 -77
- package/skills/quality/tdd-guide/assets/sample_input_python.json +39 -39
- package/skills/quality/tdd-guide/assets/sample_input_typescript.json +36 -36
- package/skills/quality/tdd-guide/references/ci-integration.md +195 -195
- package/skills/quality/tdd-guide/references/framework-guide.md +206 -206
- package/skills/quality/tdd-guide/references/tdd-best-practices.md +128 -128
- package/skills/quality/tdd-guide/scripts/coverage_analyzer.py +434 -434
- package/skills/quality/tdd-guide/scripts/fixture_generator.py +440 -440
- package/skills/quality/tdd-guide/scripts/format_detector.py +384 -384
- package/skills/quality/tdd-guide/scripts/framework_adapter.py +428 -428
- package/skills/quality/tdd-guide/scripts/metrics_calculator.py +456 -456
- package/skills/quality/tdd-guide/scripts/output_formatter.py +354 -354
- package/skills/quality/tdd-guide/scripts/tdd_workflow.py +474 -474
- package/skills/quality/tdd-guide/scripts/test_generator.py +438 -438
- package/skills/quality/test-master/SKILL.md +94 -94
- package/skills/quality/test-master/references/automation-frameworks.md +294 -294
- package/skills/quality/test-master/references/e2e-testing.md +128 -128
- package/skills/quality/test-master/references/integration-testing.md +120 -120
- package/skills/quality/test-master/references/performance-testing.md +118 -118
- package/skills/quality/test-master/references/qa-methodology.md +247 -247
- package/skills/quality/test-master/references/security-testing.md +127 -127
- package/skills/quality/test-master/references/tdd-iron-laws.md +174 -174
- package/skills/quality/test-master/references/test-reports.md +104 -104
- package/skills/quality/test-master/references/testing-anti-patterns.md +231 -231
- package/skills/quality/test-master/references/unit-testing.md +113 -113
- package/skills/ruby/rails-expert/SKILL.md +154 -154
- package/skills/ruby/rails-expert/references/active-record.md +244 -244
- package/skills/ruby/rails-expert/references/api-development.md +401 -401
- package/skills/ruby/rails-expert/references/background-jobs.md +272 -272
- package/skills/ruby/rails-expert/references/hotwire-turbo.md +228 -228
- package/skills/ruby/rails-expert/references/rspec-testing.md +367 -367
- package/skills/swift/swift-expert/SKILL.md +163 -163
- package/skills/swift/swift-expert/references/async-concurrency.md +360 -360
- package/skills/swift/swift-expert/references/memory-performance.md +377 -377
- package/skills/swift/swift-expert/references/protocol-oriented.md +354 -354
- package/skills/swift/swift-expert/references/swiftui-patterns.md +291 -291
- package/skills/swift/swift-expert/references/testing-patterns.md +399 -399
- package/skills/workflow/brainstorming/SKILL.md +164 -164
- package/skills/workflow/brainstorming/scripts/frame-template.html +214 -214
- package/skills/workflow/brainstorming/scripts/helper.js +88 -88
- package/skills/workflow/brainstorming/scripts/server.cjs +354 -354
- package/skills/workflow/brainstorming/scripts/start-server.sh +148 -148
- package/skills/workflow/brainstorming/scripts/stop-server.sh +56 -56
- package/skills/workflow/brainstorming/spec-document-reviewer-prompt.md +49 -49
- package/skills/workflow/brainstorming/visual-companion.md +287 -287
- package/skills/workflow/documentation/SKILL.md +45 -45
- package/skills/workflow/entropy-management/SKILL.md +115 -115
- package/skills/workflow/executing-plans/SKILL.md +70 -70
- package/skills/workflow/finishing-a-development-branch/SKILL.md +200 -200
- package/skills/workflow/receiving-code-review/SKILL.md +213 -213
- package/skills/workflow/requesting-code-review/SKILL.md +105 -105
- package/skills/workflow/requesting-code-review/code-reviewer.md +146 -146
- package/skills/workflow/requirement-engineering/SKILL.md +111 -111
- package/skills/workflow/systematic-debugging/CREATION-LOG.md +119 -119
- package/skills/workflow/systematic-debugging/SKILL.md +296 -296
- package/skills/workflow/systematic-debugging/condition-based-waiting-example.ts +158 -158
- package/skills/workflow/systematic-debugging/condition-based-waiting.md +115 -115
- package/skills/workflow/systematic-debugging/defense-in-depth.md +122 -122
- package/skills/workflow/systematic-debugging/find-polluter.sh +63 -63
- package/skills/workflow/systematic-debugging/root-cause-tracing.md +169 -169
- package/skills/workflow/systematic-debugging/test-academic.md +14 -14
- package/skills/workflow/systematic-debugging/test-pressure-1.md +58 -58
- package/skills/workflow/systematic-debugging/test-pressure-2.md +68 -68
- package/skills/workflow/systematic-debugging/test-pressure-3.md +69 -69
- package/skills/workflow/using-git-worktrees/SKILL.md +218 -218
- package/skills/workflow/verification-before-completion/SKILL.md +139 -139
- package/skills/workflow/writing-plans/SKILL.md +151 -151
- package/skills/workflow/writing-plans/plan-document-reviewer-prompt.md +49 -49
- package/skills/workflow/writing-skills/SKILL.md +655 -655
- package/skills/workflow/writing-skills/anthropic-best-practices.md +1150 -1150
- package/skills/workflow/writing-skills/examples/CLAUDE_MD_TESTING.md +189 -189
- package/skills/workflow/writing-skills/persuasion-principles.md +187 -187
- package/skills/workflow/writing-skills/render-graphs.js +168 -168
- package/skills/workflow/writing-skills/testing-skills-with-subagents.md +384 -384
|
@@ -1,434 +1,434 @@
|
|
|
1
|
-
# Game Day Planning & Execution
|
|
2
|
-
|
|
3
|
-
## Game Day Planning Template
|
|
4
|
-
|
|
5
|
-
```yaml
|
|
6
|
-
game_day:
|
|
7
|
-
name: "Database Failover Drill"
|
|
8
|
-
date: "2025-01-15"
|
|
9
|
-
time: "10:00-12:00 PST"
|
|
10
|
-
environment: "staging" # Start in staging
|
|
11
|
-
|
|
12
|
-
objectives:
|
|
13
|
-
- "Verify RDS failover to standby in under 2 minutes"
|
|
14
|
-
- "Validate application auto-reconnect logic"
|
|
15
|
-
- "Test monitoring and alerting effectiveness"
|
|
16
|
-
- "Practice incident response procedures"
|
|
17
|
-
|
|
18
|
-
participants:
|
|
19
|
-
facilitator: "chaos-engineer@company.com"
|
|
20
|
-
observers:
|
|
21
|
-
- "sre-team@company.com"
|
|
22
|
-
- "dev-team@company.com"
|
|
23
|
-
responders:
|
|
24
|
-
- "on-call-engineer@company.com"
|
|
25
|
-
- "database-admin@company.com"
|
|
26
|
-
stakeholders:
|
|
27
|
-
- "engineering-manager@company.com"
|
|
28
|
-
|
|
29
|
-
scenarios:
|
|
30
|
-
- name: "Primary database instance failure"
|
|
31
|
-
duration_minutes: 30
|
|
32
|
-
steps:
|
|
33
|
-
- action: "Force RDS instance reboot with failover"
|
|
34
|
-
expected: "Failover to standby in <2 min"
|
|
35
|
-
success_criteria:
|
|
36
|
-
- "Downtime < 2 minutes"
|
|
37
|
-
- "No data loss"
|
|
38
|
-
- "Alerts fired correctly"
|
|
39
|
-
|
|
40
|
-
- name: "Network partition to database"
|
|
41
|
-
duration_minutes: 20
|
|
42
|
-
steps:
|
|
43
|
-
- action: "Block network traffic to RDS security group"
|
|
44
|
-
expected: "Application switches to read replica"
|
|
45
|
-
success_criteria:
|
|
46
|
-
- "Read-only mode activated"
|
|
47
|
-
- "User-facing error messages clear"
|
|
48
|
-
|
|
49
|
-
communication_plan:
|
|
50
|
-
announcement_channel: "#game-day-announcements"
|
|
51
|
-
war_room: "Zoom link: https://..."
|
|
52
|
-
status_updates_every: "5 minutes"
|
|
53
|
-
escalation_contacts:
|
|
54
|
-
- name: "VP Engineering"
|
|
55
|
-
phone: "+1-555-0100"
|
|
56
|
-
threshold: "downtime > 5 minutes"
|
|
57
|
-
|
|
58
|
-
rollback_plan:
|
|
59
|
-
automatic_rollback_triggers:
|
|
60
|
-
- "production traffic affected"
|
|
61
|
-
- "customer complaints received"
|
|
62
|
-
- "error_rate > 10%"
|
|
63
|
-
manual_rollback_command: "aws rds reboot-db-instance --db-instance-identifier primary --force-failover"
|
|
64
|
-
rollback_time_limit_seconds: 60
|
|
65
|
-
|
|
66
|
-
success_metrics:
|
|
67
|
-
- metric: "RTO (Recovery Time Objective)"
|
|
68
|
-
target: "< 2 minutes"
|
|
69
|
-
measurement: "time between failure and full recovery"
|
|
70
|
-
- metric: "Alert accuracy"
|
|
71
|
-
target: "100%"
|
|
72
|
-
measurement: "all expected alerts fired"
|
|
73
|
-
- metric: "Team response time"
|
|
74
|
-
target: "< 5 minutes"
|
|
75
|
-
measurement: "time to acknowledge incident"
|
|
76
|
-
|
|
77
|
-
post_mortem:
|
|
78
|
-
scheduled_for: "2025-01-16 14:00"
|
|
79
|
-
template: "game-day-retro.md"
|
|
80
|
-
required_attendees: "all participants"
|
|
81
|
-
```
|
|
82
|
-
|
|
83
|
-
## Game Day Runbook
|
|
84
|
-
|
|
85
|
-
```markdown
|
|
86
|
-
# Database Failover Game Day Runbook
|
|
87
|
-
|
|
88
|
-
**Date**: January 15, 2025
|
|
89
|
-
**Duration**: 2 hours
|
|
90
|
-
**Environment**: Staging
|
|
91
|
-
|
|
92
|
-
## Pre-Game Checklist (T-30 min)
|
|
93
|
-
|
|
94
|
-
- [ ] Verify all participants joined war room
|
|
95
|
-
- [ ] Confirm monitoring dashboards accessible
|
|
96
|
-
- [ ] Test rollback procedures work
|
|
97
|
-
- [ ] Announce game day start in #engineering
|
|
98
|
-
- [ ] Verify staging environment healthy
|
|
99
|
-
- [ ] Set up screen recording for timeline
|
|
100
|
-
- [ ] Prepare incident timeline spreadsheet
|
|
101
|
-
|
|
102
|
-
## Timeline
|
|
103
|
-
|
|
104
|
-
### 10:00 - Introduction (10 min)
|
|
105
|
-
- Facilitator explains objectives
|
|
106
|
-
- Review scenarios and success criteria
|
|
107
|
-
- Confirm roles and communication channels
|
|
108
|
-
- Remind everyone: this is a learning exercise
|
|
109
|
-
|
|
110
|
-
### 10:10 - Scenario 1: Primary DB Failure (30 min)
|
|
111
|
-
|
|
112
|
-
**T+0 (10:10)** - Inject failure
|
|
113
|
-
```bash
|
|
114
|
-
aws rds reboot-db-instance \
|
|
115
|
-
--db-instance-identifier staging-primary \
|
|
116
|
-
--force-failover
|
|
117
|
-
```
|
|
118
|
-
|
|
119
|
-
**Expected Timeline**:
|
|
120
|
-
- T+0: Reboot initiated
|
|
121
|
-
- T+30s: Primary becomes unavailable
|
|
122
|
-
- T+60s: DNS updated to standby
|
|
123
|
-
- T+90s: Application reconnects
|
|
124
|
-
- T+120s: Full recovery
|
|
125
|
-
|
|
126
|
-
**Observer Tasks**:
|
|
127
|
-
- [ ] Record exact time of failure injection
|
|
128
|
-
- [ ] Monitor application error logs
|
|
129
|
-
- [ ] Track alert notifications
|
|
130
|
-
- [ ] Document team response actions
|
|
131
|
-
- [ ] Screenshot dashboard states
|
|
132
|
-
|
|
133
|
-
**Questions to Answer**:
|
|
134
|
-
- How long until first alert?
|
|
135
|
-
- Did application auto-reconnect?
|
|
136
|
-
- Were customers impacted?
|
|
137
|
-
- What manual interventions needed?
|
|
138
|
-
|
|
139
|
-
### 10:40 - Debrief Scenario 1 (10 min)
|
|
140
|
-
- What went well?
|
|
141
|
-
- What could improve?
|
|
142
|
-
- Any surprises?
|
|
143
|
-
- Action items identified
|
|
144
|
-
|
|
145
|
-
### 10:50 - Scenario 2: Network Partition (20 min)
|
|
146
|
-
|
|
147
|
-
**T+0 (10:50)** - Inject failure
|
|
148
|
-
```bash
|
|
149
|
-
# Block database security group ingress
|
|
150
|
-
aws ec2 revoke-security-group-ingress \
|
|
151
|
-
--group-id sg-xxxxx \
|
|
152
|
-
--protocol tcp \
|
|
153
|
-
--port 5432 \
|
|
154
|
-
--cidr 10.0.0.0/16
|
|
155
|
-
```
|
|
156
|
-
|
|
157
|
-
**Expected Behavior**:
|
|
158
|
-
- Connection timeouts occur
|
|
159
|
-
- Circuit breaker opens
|
|
160
|
-
- Read-only mode activates
|
|
161
|
-
- Clear error messages shown
|
|
162
|
-
|
|
163
|
-
**Observer Tasks**:
|
|
164
|
-
- [ ] Monitor circuit breaker state
|
|
165
|
-
- [ ] Verify read-replica failover
|
|
166
|
-
- [ ] Check user-facing error messages
|
|
167
|
-
- [ ] Track degraded service duration
|
|
168
|
-
|
|
169
|
-
### 11:10 - Debrief Scenario 2 (10 min)
|
|
170
|
-
|
|
171
|
-
### 11:20 - Scenario 3: Surprise! (20 min)
|
|
172
|
-
|
|
173
|
-
**Facilitator Note**: Don't announce this scenario details beforehand.
|
|
174
|
-
Test true incident response capability.
|
|
175
|
-
|
|
176
|
-
**Hidden Scenario**: Combination failure
|
|
177
|
-
1. Database connection pool leak
|
|
178
|
-
2. Simultaneous cache invalidation
|
|
179
|
-
|
|
180
|
-
```python
|
|
181
|
-
# Connection leak simulator
|
|
182
|
-
import psycopg2
|
|
183
|
-
connections = []
|
|
184
|
-
for i in range(100):
|
|
185
|
-
conn = psycopg2.connect(DATABASE_URL)
|
|
186
|
-
connections.append(conn)
|
|
187
|
-
# Intentionally don't close
|
|
188
|
-
```
|
|
189
|
-
|
|
190
|
-
**Observer Tasks**:
|
|
191
|
-
- [ ] How long to identify root cause?
|
|
192
|
-
- [ ] Communication effectiveness
|
|
193
|
-
- [ ] Cross-team coordination
|
|
194
|
-
- [ ] Escalation decisions
|
|
195
|
-
|
|
196
|
-
### 11:40 - Final Debrief & Wrap-up (20 min)
|
|
197
|
-
|
|
198
|
-
**Debrief Questions**:
|
|
199
|
-
1. What worked well?
|
|
200
|
-
2. What didn't work?
|
|
201
|
-
3. What surprised us?
|
|
202
|
-
4. What are our top 3 action items?
|
|
203
|
-
5. When should we run this again?
|
|
204
|
-
|
|
205
|
-
## Post-Game Checklist
|
|
206
|
-
|
|
207
|
-
- [ ] Restore all services to normal state
|
|
208
|
-
- [ ] Verify no lingering issues
|
|
209
|
-
- [ ] Collect all observer notes
|
|
210
|
-
- [ ] Export metrics and dashboards
|
|
211
|
-
- [ ] Schedule post-mortem meeting
|
|
212
|
-
- [ ] Send thank-you to participants
|
|
213
|
-
- [ ] Create action item tickets
|
|
214
|
-
- [ ] Update runbooks based on learnings
|
|
215
|
-
```
|
|
216
|
-
|
|
217
|
-
## Game Day Observation Template
|
|
218
|
-
|
|
219
|
-
```python
|
|
220
|
-
from dataclasses import dataclass, field
|
|
221
|
-
from datetime import datetime
|
|
222
|
-
from typing import List
|
|
223
|
-
|
|
224
|
-
@dataclass
|
|
225
|
-
class GameDayObservation:
|
|
226
|
-
timestamp: datetime
|
|
227
|
-
observer: str
|
|
228
|
-
scenario: str
|
|
229
|
-
observation: str
|
|
230
|
-
category: str # technical, process, communication, surprise
|
|
231
|
-
severity: str # info, concern, critical
|
|
232
|
-
photo_url: str = ""
|
|
233
|
-
|
|
234
|
-
@dataclass
|
|
235
|
-
class GameDayMetrics:
|
|
236
|
-
scenario_name: str
|
|
237
|
-
start_time: datetime
|
|
238
|
-
end_time: datetime
|
|
239
|
-
|
|
240
|
-
# Technical metrics
|
|
241
|
-
time_to_detect_seconds: float
|
|
242
|
-
time_to_respond_seconds: float
|
|
243
|
-
time_to_recover_seconds: float
|
|
244
|
-
error_rate_peak: float
|
|
245
|
-
alerts_fired: List[str] = field(default_factory=list)
|
|
246
|
-
alerts_missed: List[str] = field(default_factory=list)
|
|
247
|
-
|
|
248
|
-
# Team metrics
|
|
249
|
-
responders_involved: int
|
|
250
|
-
escalations_needed: int
|
|
251
|
-
communication_gaps: List[str] = field(default_factory=list)
|
|
252
|
-
|
|
253
|
-
# Success criteria
|
|
254
|
-
met_rto: bool = False
|
|
255
|
-
met_rpo: bool = False
|
|
256
|
-
zero_customer_impact: bool = False
|
|
257
|
-
|
|
258
|
-
def calculate_mttr(self) -> float:
|
|
259
|
-
"""Mean Time To Recovery"""
|
|
260
|
-
return (self.end_time - self.start_time).total_seconds()
|
|
261
|
-
|
|
262
|
-
def success_rate(self) -> float:
|
|
263
|
-
"""Percentage of success criteria met"""
|
|
264
|
-
criteria = [
|
|
265
|
-
self.met_rto,
|
|
266
|
-
self.met_rpo,
|
|
267
|
-
self.zero_customer_impact,
|
|
268
|
-
len(self.alerts_missed) == 0
|
|
269
|
-
]
|
|
270
|
-
return sum(criteria) / len(criteria) * 100
|
|
271
|
-
|
|
272
|
-
# Example usage
|
|
273
|
-
metrics = GameDayMetrics(
|
|
274
|
-
scenario_name="Database Failover",
|
|
275
|
-
start_time=datetime(2025, 1, 15, 10, 10, 0),
|
|
276
|
-
end_time=datetime(2025, 1, 15, 10, 12, 30),
|
|
277
|
-
time_to_detect_seconds=15.0,
|
|
278
|
-
time_to_respond_seconds=45.0,
|
|
279
|
-
time_to_recover_seconds=150.0,
|
|
280
|
-
error_rate_peak=0.05,
|
|
281
|
-
alerts_fired=["DatabaseConnectionError", "HighLatency"],
|
|
282
|
-
alerts_missed=["FailoverInitiated"],
|
|
283
|
-
responders_involved=3,
|
|
284
|
-
escalations_needed=0,
|
|
285
|
-
met_rto=True,
|
|
286
|
-
met_rpo=True,
|
|
287
|
-
zero_customer_impact=True
|
|
288
|
-
)
|
|
289
|
-
|
|
290
|
-
print(f"MTTR: {metrics.calculate_mttr()}s")
|
|
291
|
-
print(f"Success Rate: {metrics.success_rate()}%")
|
|
292
|
-
```
|
|
293
|
-
|
|
294
|
-
## Surprise Scenarios Library
|
|
295
|
-
|
|
296
|
-
```yaml
|
|
297
|
-
# Keep these secret until game day!
|
|
298
|
-
surprise_scenarios:
|
|
299
|
-
- name: "Cascading Failure"
|
|
300
|
-
description: "Primary failure triggers secondary issue"
|
|
301
|
-
injection:
|
|
302
|
-
- "Database failover (expected)"
|
|
303
|
-
- "Cache eviction due to new primary IP (surprise!)"
|
|
304
|
-
learning_goals:
|
|
305
|
-
- "Do we understand our dependencies?"
|
|
306
|
-
- "Can we handle multiple simultaneous issues?"
|
|
307
|
-
|
|
308
|
-
- name: "Monitoring Blind Spot"
|
|
309
|
-
description: "Failure that doesn't trigger alerts"
|
|
310
|
-
injection:
|
|
311
|
-
- "Gradual connection pool leak"
|
|
312
|
-
- "No immediate alerts fire"
|
|
313
|
-
learning_goals:
|
|
314
|
-
- "How do we discover issues without alerts?"
|
|
315
|
-
- "Do we have adequate monitoring coverage?"
|
|
316
|
-
|
|
317
|
-
- name: "Documentation Failure"
|
|
318
|
-
description: "Runbook is outdated or incorrect"
|
|
319
|
-
setup:
|
|
320
|
-
- "Modify runbook to have incorrect commands"
|
|
321
|
-
- "Or remove runbook entirely"
|
|
322
|
-
learning_goals:
|
|
323
|
-
- "Can team problem-solve without docs?"
|
|
324
|
-
- "How quickly can we update documentation?"
|
|
325
|
-
|
|
326
|
-
- name: "Key Person Unavailable"
|
|
327
|
-
description: "Subject matter expert is unreachable"
|
|
328
|
-
setup:
|
|
329
|
-
- "Ask SME to not respond for 15 minutes"
|
|
330
|
-
learning_goals:
|
|
331
|
-
- "Is knowledge properly distributed?"
|
|
332
|
-
- "Can team succeed without specific person?"
|
|
333
|
-
|
|
334
|
-
- name: "Partial Degradation"
|
|
335
|
-
description: "Service works but slowly"
|
|
336
|
-
injection:
|
|
337
|
-
- "Add 5 second latency instead of complete failure"
|
|
338
|
-
learning_goals:
|
|
339
|
-
- "Do we detect performance degradation?"
|
|
340
|
-
- "What are our latency SLOs?"
|
|
341
|
-
```
|
|
342
|
-
|
|
343
|
-
## Post-Game Report Template
|
|
344
|
-
|
|
345
|
-
```markdown
|
|
346
|
-
# Game Day Report: Database Failover
|
|
347
|
-
|
|
348
|
-
**Date**: January 15, 2025
|
|
349
|
-
**Participants**: 12
|
|
350
|
-
**Duration**: 2 hours
|
|
351
|
-
**Environment**: Staging
|
|
352
|
-
|
|
353
|
-
## Executive Summary
|
|
354
|
-
|
|
355
|
-
Conducted database failover game day to test RDS high availability and
|
|
356
|
-
application resilience. Successfully failed over database in 2.5 minutes
|
|
357
|
-
(target: 2 min). Discovered 3 critical gaps in monitoring and 2 process
|
|
358
|
-
improvements needed.
|
|
359
|
-
|
|
360
|
-
## Metrics
|
|
361
|
-
|
|
362
|
-
| Metric | Target | Actual | Status |
|
|
363
|
-
|--------|--------|--------|--------|
|
|
364
|
-
| Time to Detect | < 30s | 15s | PASS |
|
|
365
|
-
| Time to Respond | < 5min | 4min 20s | PASS |
|
|
366
|
-
| Time to Recover | < 2min | 2min 30s | FAIL |
|
|
367
|
-
| Alert Accuracy | 100% | 66% | FAIL |
|
|
368
|
-
| Zero Customer Impact | Yes | Yes | PASS |
|
|
369
|
-
|
|
370
|
-
## What Went Well
|
|
371
|
-
|
|
372
|
-
1. Team responded quickly (4m 20s vs 5m target)
|
|
373
|
-
2. Runbooks were accurate and helpful
|
|
374
|
-
3. Communication was clear and frequent
|
|
375
|
-
4. No customer impact during any scenario
|
|
376
|
-
5. Application auto-reconnect worked perfectly
|
|
377
|
-
|
|
378
|
-
## What Didn't Go Well
|
|
379
|
-
|
|
380
|
-
1. Missing alert for failover initiation
|
|
381
|
-
2. Took 30s longer than target to recover
|
|
382
|
-
3. Connection pool exhaustion not detected
|
|
383
|
-
4. Dashboard didn't show replica lag clearly
|
|
384
|
-
5. Escalation contacts list was outdated
|
|
385
|
-
|
|
386
|
-
## Surprises
|
|
387
|
-
|
|
388
|
-
1. Cache invalidation cascaded from DB failover (unexpected)
|
|
389
|
-
2. Read replica had 45s replication lag we didn't know about
|
|
390
|
-
3. Application retried too aggressively during failover
|
|
391
|
-
4. Team found a workaround we hadn't documented
|
|
392
|
-
|
|
393
|
-
## Action Items
|
|
394
|
-
|
|
395
|
-
| Action | Owner | Due Date | Priority |
|
|
396
|
-
|--------|-------|----------|----------|
|
|
397
|
-
| Add alert for RDS failover events | @sre-team | Jan 20 | P0 |
|
|
398
|
-
| Update dashboard with replica lag | @platform | Jan 22 | P1 |
|
|
399
|
-
| Document cache invalidation behavior | @dev-team | Jan 25 | P1 |
|
|
400
|
-
| Add connection pool monitoring | @sre-team | Jan 27 | P0 |
|
|
401
|
-
| Update escalation contact list | @manager | Jan 18 | P2 |
|
|
402
|
-
| Tune application retry backoff | @dev-team | Feb 1 | P1 |
|
|
403
|
-
|
|
404
|
-
## Lessons Learned
|
|
405
|
-
|
|
406
|
-
1. **Monitoring Gaps**: We had blind spots in replica monitoring
|
|
407
|
-
2. **Cascading Effects**: DB changes affect cache in non-obvious ways
|
|
408
|
-
3. **Team Knowledge**: Cross-training is working well
|
|
409
|
-
4. **Documentation**: Runbooks saved time, keep them updated
|
|
410
|
-
|
|
411
|
-
## Next Game Day
|
|
412
|
-
|
|
413
|
-
**Proposed Date**: March 15, 2025
|
|
414
|
-
**Scenario**: Multi-region failover
|
|
415
|
-
**Scope**: Production (with safeguards)
|
|
416
|
-
|
|
417
|
-
## Appendix
|
|
418
|
-
|
|
419
|
-
- Full timeline spreadsheet: [link]
|
|
420
|
-
- Screen recordings: [link]
|
|
421
|
-
- Metrics dashboard export: [link]
|
|
422
|
-
- Raw observation notes: [link]
|
|
423
|
-
```
|
|
424
|
-
|
|
425
|
-
## Quick Reference
|
|
426
|
-
|
|
427
|
-
| Phase | Duration | Key Activities |
|
|
428
|
-
|-------|----------|----------------|
|
|
429
|
-
| Planning | 2 weeks | Define scenarios, invite participants |
|
|
430
|
-
| Pre-game | 30 min | Setup, verify environment, brief team |
|
|
431
|
-
| Execution | 2 hours | Run scenarios, observe, document |
|
|
432
|
-
| Debrief | 30 min | Immediate learnings, quick wins |
|
|
433
|
-
| Post-mortem | 1 week later | Detailed analysis, action items |
|
|
434
|
-
| Follow-up | 1 month | Verify improvements, plan next game day |
|
|
1
|
+
# Game Day Planning & Execution
|
|
2
|
+
|
|
3
|
+
## Game Day Planning Template
|
|
4
|
+
|
|
5
|
+
```yaml
|
|
6
|
+
game_day:
|
|
7
|
+
name: "Database Failover Drill"
|
|
8
|
+
date: "2025-01-15"
|
|
9
|
+
time: "10:00-12:00 PST"
|
|
10
|
+
environment: "staging" # Start in staging
|
|
11
|
+
|
|
12
|
+
objectives:
|
|
13
|
+
- "Verify RDS failover to standby in under 2 minutes"
|
|
14
|
+
- "Validate application auto-reconnect logic"
|
|
15
|
+
- "Test monitoring and alerting effectiveness"
|
|
16
|
+
- "Practice incident response procedures"
|
|
17
|
+
|
|
18
|
+
participants:
|
|
19
|
+
facilitator: "chaos-engineer@company.com"
|
|
20
|
+
observers:
|
|
21
|
+
- "sre-team@company.com"
|
|
22
|
+
- "dev-team@company.com"
|
|
23
|
+
responders:
|
|
24
|
+
- "on-call-engineer@company.com"
|
|
25
|
+
- "database-admin@company.com"
|
|
26
|
+
stakeholders:
|
|
27
|
+
- "engineering-manager@company.com"
|
|
28
|
+
|
|
29
|
+
scenarios:
|
|
30
|
+
- name: "Primary database instance failure"
|
|
31
|
+
duration_minutes: 30
|
|
32
|
+
steps:
|
|
33
|
+
- action: "Force RDS instance reboot with failover"
|
|
34
|
+
expected: "Failover to standby in <2 min"
|
|
35
|
+
success_criteria:
|
|
36
|
+
- "Downtime < 2 minutes"
|
|
37
|
+
- "No data loss"
|
|
38
|
+
- "Alerts fired correctly"
|
|
39
|
+
|
|
40
|
+
- name: "Network partition to database"
|
|
41
|
+
duration_minutes: 20
|
|
42
|
+
steps:
|
|
43
|
+
- action: "Block network traffic to RDS security group"
|
|
44
|
+
expected: "Application switches to read replica"
|
|
45
|
+
success_criteria:
|
|
46
|
+
- "Read-only mode activated"
|
|
47
|
+
- "User-facing error messages clear"
|
|
48
|
+
|
|
49
|
+
communication_plan:
|
|
50
|
+
announcement_channel: "#game-day-announcements"
|
|
51
|
+
war_room: "Zoom link: https://..."
|
|
52
|
+
status_updates_every: "5 minutes"
|
|
53
|
+
escalation_contacts:
|
|
54
|
+
- name: "VP Engineering"
|
|
55
|
+
phone: "+1-555-0100"
|
|
56
|
+
threshold: "downtime > 5 minutes"
|
|
57
|
+
|
|
58
|
+
rollback_plan:
|
|
59
|
+
automatic_rollback_triggers:
|
|
60
|
+
- "production traffic affected"
|
|
61
|
+
- "customer complaints received"
|
|
62
|
+
- "error_rate > 10%"
|
|
63
|
+
manual_rollback_command: "aws rds reboot-db-instance --db-instance-identifier primary --force-failover"
|
|
64
|
+
rollback_time_limit_seconds: 60
|
|
65
|
+
|
|
66
|
+
success_metrics:
|
|
67
|
+
- metric: "RTO (Recovery Time Objective)"
|
|
68
|
+
target: "< 2 minutes"
|
|
69
|
+
measurement: "time between failure and full recovery"
|
|
70
|
+
- metric: "Alert accuracy"
|
|
71
|
+
target: "100%"
|
|
72
|
+
measurement: "all expected alerts fired"
|
|
73
|
+
- metric: "Team response time"
|
|
74
|
+
target: "< 5 minutes"
|
|
75
|
+
measurement: "time to acknowledge incident"
|
|
76
|
+
|
|
77
|
+
post_mortem:
|
|
78
|
+
scheduled_for: "2025-01-16 14:00"
|
|
79
|
+
template: "game-day-retro.md"
|
|
80
|
+
required_attendees: "all participants"
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
## Game Day Runbook
|
|
84
|
+
|
|
85
|
+
```markdown
|
|
86
|
+
# Database Failover Game Day Runbook
|
|
87
|
+
|
|
88
|
+
**Date**: January 15, 2025
|
|
89
|
+
**Duration**: 2 hours
|
|
90
|
+
**Environment**: Staging
|
|
91
|
+
|
|
92
|
+
## Pre-Game Checklist (T-30 min)
|
|
93
|
+
|
|
94
|
+
- [ ] Verify all participants joined war room
|
|
95
|
+
- [ ] Confirm monitoring dashboards accessible
|
|
96
|
+
- [ ] Test rollback procedures work
|
|
97
|
+
- [ ] Announce game day start in #engineering
|
|
98
|
+
- [ ] Verify staging environment healthy
|
|
99
|
+
- [ ] Set up screen recording for timeline
|
|
100
|
+
- [ ] Prepare incident timeline spreadsheet
|
|
101
|
+
|
|
102
|
+
## Timeline
|
|
103
|
+
|
|
104
|
+
### 10:00 - Introduction (10 min)
|
|
105
|
+
- Facilitator explains objectives
|
|
106
|
+
- Review scenarios and success criteria
|
|
107
|
+
- Confirm roles and communication channels
|
|
108
|
+
- Remind everyone: this is a learning exercise
|
|
109
|
+
|
|
110
|
+
### 10:10 - Scenario 1: Primary DB Failure (30 min)
|
|
111
|
+
|
|
112
|
+
**T+0 (10:10)** - Inject failure
|
|
113
|
+
```bash
|
|
114
|
+
aws rds reboot-db-instance \
|
|
115
|
+
--db-instance-identifier staging-primary \
|
|
116
|
+
--force-failover
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
**Expected Timeline**:
|
|
120
|
+
- T+0: Reboot initiated
|
|
121
|
+
- T+30s: Primary becomes unavailable
|
|
122
|
+
- T+60s: DNS updated to standby
|
|
123
|
+
- T+90s: Application reconnects
|
|
124
|
+
- T+120s: Full recovery
|
|
125
|
+
|
|
126
|
+
**Observer Tasks**:
|
|
127
|
+
- [ ] Record exact time of failure injection
|
|
128
|
+
- [ ] Monitor application error logs
|
|
129
|
+
- [ ] Track alert notifications
|
|
130
|
+
- [ ] Document team response actions
|
|
131
|
+
- [ ] Screenshot dashboard states
|
|
132
|
+
|
|
133
|
+
**Questions to Answer**:
|
|
134
|
+
- How long until first alert?
|
|
135
|
+
- Did application auto-reconnect?
|
|
136
|
+
- Were customers impacted?
|
|
137
|
+
- What manual interventions needed?
|
|
138
|
+
|
|
139
|
+
### 10:40 - Debrief Scenario 1 (10 min)
|
|
140
|
+
- What went well?
|
|
141
|
+
- What could improve?
|
|
142
|
+
- Any surprises?
|
|
143
|
+
- Action items identified
|
|
144
|
+
|
|
145
|
+
### 10:50 - Scenario 2: Network Partition (20 min)
|
|
146
|
+
|
|
147
|
+
**T+0 (10:50)** - Inject failure
|
|
148
|
+
```bash
|
|
149
|
+
# Block database security group ingress
|
|
150
|
+
aws ec2 revoke-security-group-ingress \
|
|
151
|
+
--group-id sg-xxxxx \
|
|
152
|
+
--protocol tcp \
|
|
153
|
+
--port 5432 \
|
|
154
|
+
--cidr 10.0.0.0/16
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
**Expected Behavior**:
|
|
158
|
+
- Connection timeouts occur
|
|
159
|
+
- Circuit breaker opens
|
|
160
|
+
- Read-only mode activates
|
|
161
|
+
- Clear error messages shown
|
|
162
|
+
|
|
163
|
+
**Observer Tasks**:
|
|
164
|
+
- [ ] Monitor circuit breaker state
|
|
165
|
+
- [ ] Verify read-replica failover
|
|
166
|
+
- [ ] Check user-facing error messages
|
|
167
|
+
- [ ] Track degraded service duration
|
|
168
|
+
|
|
169
|
+
### 11:10 - Debrief Scenario 2 (10 min)
|
|
170
|
+
|
|
171
|
+
### 11:20 - Scenario 3: Surprise! (20 min)
|
|
172
|
+
|
|
173
|
+
**Facilitator Note**: Don't announce this scenario details beforehand.
|
|
174
|
+
Test true incident response capability.
|
|
175
|
+
|
|
176
|
+
**Hidden Scenario**: Combination failure
|
|
177
|
+
1. Database connection pool leak
|
|
178
|
+
2. Simultaneous cache invalidation
|
|
179
|
+
|
|
180
|
+
```python
|
|
181
|
+
# Connection leak simulator
|
|
182
|
+
import psycopg2
|
|
183
|
+
connections = []
|
|
184
|
+
for i in range(100):
|
|
185
|
+
conn = psycopg2.connect(DATABASE_URL)
|
|
186
|
+
connections.append(conn)
|
|
187
|
+
# Intentionally don't close
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
**Observer Tasks**:
|
|
191
|
+
- [ ] How long to identify root cause?
|
|
192
|
+
- [ ] Communication effectiveness
|
|
193
|
+
- [ ] Cross-team coordination
|
|
194
|
+
- [ ] Escalation decisions
|
|
195
|
+
|
|
196
|
+
### 11:40 - Final Debrief & Wrap-up (20 min)
|
|
197
|
+
|
|
198
|
+
**Debrief Questions**:
|
|
199
|
+
1. What worked well?
|
|
200
|
+
2. What didn't work?
|
|
201
|
+
3. What surprised us?
|
|
202
|
+
4. What are our top 3 action items?
|
|
203
|
+
5. When should we run this again?
|
|
204
|
+
|
|
205
|
+
## Post-Game Checklist
|
|
206
|
+
|
|
207
|
+
- [ ] Restore all services to normal state
|
|
208
|
+
- [ ] Verify no lingering issues
|
|
209
|
+
- [ ] Collect all observer notes
|
|
210
|
+
- [ ] Export metrics and dashboards
|
|
211
|
+
- [ ] Schedule post-mortem meeting
|
|
212
|
+
- [ ] Send thank-you to participants
|
|
213
|
+
- [ ] Create action item tickets
|
|
214
|
+
- [ ] Update runbooks based on learnings
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
## Game Day Observation Template
|
|
218
|
+
|
|
219
|
+
```python
|
|
220
|
+
from dataclasses import dataclass, field
|
|
221
|
+
from datetime import datetime
|
|
222
|
+
from typing import List
|
|
223
|
+
|
|
224
|
+
@dataclass
|
|
225
|
+
class GameDayObservation:
|
|
226
|
+
timestamp: datetime
|
|
227
|
+
observer: str
|
|
228
|
+
scenario: str
|
|
229
|
+
observation: str
|
|
230
|
+
category: str # technical, process, communication, surprise
|
|
231
|
+
severity: str # info, concern, critical
|
|
232
|
+
photo_url: str = ""
|
|
233
|
+
|
|
234
|
+
@dataclass
|
|
235
|
+
class GameDayMetrics:
|
|
236
|
+
scenario_name: str
|
|
237
|
+
start_time: datetime
|
|
238
|
+
end_time: datetime
|
|
239
|
+
|
|
240
|
+
# Technical metrics
|
|
241
|
+
time_to_detect_seconds: float
|
|
242
|
+
time_to_respond_seconds: float
|
|
243
|
+
time_to_recover_seconds: float
|
|
244
|
+
error_rate_peak: float
|
|
245
|
+
alerts_fired: List[str] = field(default_factory=list)
|
|
246
|
+
alerts_missed: List[str] = field(default_factory=list)
|
|
247
|
+
|
|
248
|
+
# Team metrics
|
|
249
|
+
responders_involved: int
|
|
250
|
+
escalations_needed: int
|
|
251
|
+
communication_gaps: List[str] = field(default_factory=list)
|
|
252
|
+
|
|
253
|
+
# Success criteria
|
|
254
|
+
met_rto: bool = False
|
|
255
|
+
met_rpo: bool = False
|
|
256
|
+
zero_customer_impact: bool = False
|
|
257
|
+
|
|
258
|
+
def calculate_mttr(self) -> float:
|
|
259
|
+
"""Mean Time To Recovery"""
|
|
260
|
+
return (self.end_time - self.start_time).total_seconds()
|
|
261
|
+
|
|
262
|
+
def success_rate(self) -> float:
|
|
263
|
+
"""Percentage of success criteria met"""
|
|
264
|
+
criteria = [
|
|
265
|
+
self.met_rto,
|
|
266
|
+
self.met_rpo,
|
|
267
|
+
self.zero_customer_impact,
|
|
268
|
+
len(self.alerts_missed) == 0
|
|
269
|
+
]
|
|
270
|
+
return sum(criteria) / len(criteria) * 100
|
|
271
|
+
|
|
272
|
+
# Example usage
|
|
273
|
+
metrics = GameDayMetrics(
|
|
274
|
+
scenario_name="Database Failover",
|
|
275
|
+
start_time=datetime(2025, 1, 15, 10, 10, 0),
|
|
276
|
+
end_time=datetime(2025, 1, 15, 10, 12, 30),
|
|
277
|
+
time_to_detect_seconds=15.0,
|
|
278
|
+
time_to_respond_seconds=45.0,
|
|
279
|
+
time_to_recover_seconds=150.0,
|
|
280
|
+
error_rate_peak=0.05,
|
|
281
|
+
alerts_fired=["DatabaseConnectionError", "HighLatency"],
|
|
282
|
+
alerts_missed=["FailoverInitiated"],
|
|
283
|
+
responders_involved=3,
|
|
284
|
+
escalations_needed=0,
|
|
285
|
+
met_rto=True,
|
|
286
|
+
met_rpo=True,
|
|
287
|
+
zero_customer_impact=True
|
|
288
|
+
)
|
|
289
|
+
|
|
290
|
+
print(f"MTTR: {metrics.calculate_mttr()}s")
|
|
291
|
+
print(f"Success Rate: {metrics.success_rate()}%")
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
## Surprise Scenarios Library
|
|
295
|
+
|
|
296
|
+
```yaml
|
|
297
|
+
# Keep these secret until game day!
|
|
298
|
+
surprise_scenarios:
|
|
299
|
+
- name: "Cascading Failure"
|
|
300
|
+
description: "Primary failure triggers secondary issue"
|
|
301
|
+
injection:
|
|
302
|
+
- "Database failover (expected)"
|
|
303
|
+
- "Cache eviction due to new primary IP (surprise!)"
|
|
304
|
+
learning_goals:
|
|
305
|
+
- "Do we understand our dependencies?"
|
|
306
|
+
- "Can we handle multiple simultaneous issues?"
|
|
307
|
+
|
|
308
|
+
- name: "Monitoring Blind Spot"
|
|
309
|
+
description: "Failure that doesn't trigger alerts"
|
|
310
|
+
injection:
|
|
311
|
+
- "Gradual connection pool leak"
|
|
312
|
+
- "No immediate alerts fire"
|
|
313
|
+
learning_goals:
|
|
314
|
+
- "How do we discover issues without alerts?"
|
|
315
|
+
- "Do we have adequate monitoring coverage?"
|
|
316
|
+
|
|
317
|
+
- name: "Documentation Failure"
|
|
318
|
+
description: "Runbook is outdated or incorrect"
|
|
319
|
+
setup:
|
|
320
|
+
- "Modify runbook to have incorrect commands"
|
|
321
|
+
- "Or remove runbook entirely"
|
|
322
|
+
learning_goals:
|
|
323
|
+
- "Can team problem-solve without docs?"
|
|
324
|
+
- "How quickly can we update documentation?"
|
|
325
|
+
|
|
326
|
+
- name: "Key Person Unavailable"
|
|
327
|
+
description: "Subject matter expert is unreachable"
|
|
328
|
+
setup:
|
|
329
|
+
- "Ask SME to not respond for 15 minutes"
|
|
330
|
+
learning_goals:
|
|
331
|
+
- "Is knowledge properly distributed?"
|
|
332
|
+
- "Can team succeed without specific person?"
|
|
333
|
+
|
|
334
|
+
- name: "Partial Degradation"
|
|
335
|
+
description: "Service works but slowly"
|
|
336
|
+
injection:
|
|
337
|
+
- "Add 5 second latency instead of complete failure"
|
|
338
|
+
learning_goals:
|
|
339
|
+
- "Do we detect performance degradation?"
|
|
340
|
+
- "What are our latency SLOs?"
|
|
341
|
+
```
|
|
342
|
+
|
|
343
|
+
## Post-Game Report Template
|
|
344
|
+
|
|
345
|
+
```markdown
|
|
346
|
+
# Game Day Report: Database Failover
|
|
347
|
+
|
|
348
|
+
**Date**: January 15, 2025
|
|
349
|
+
**Participants**: 12
|
|
350
|
+
**Duration**: 2 hours
|
|
351
|
+
**Environment**: Staging
|
|
352
|
+
|
|
353
|
+
## Executive Summary
|
|
354
|
+
|
|
355
|
+
Conducted database failover game day to test RDS high availability and
|
|
356
|
+
application resilience. Successfully failed over database in 2.5 minutes
|
|
357
|
+
(target: 2 min). Discovered 3 critical gaps in monitoring and 2 process
|
|
358
|
+
improvements needed.
|
|
359
|
+
|
|
360
|
+
## Metrics
|
|
361
|
+
|
|
362
|
+
| Metric | Target | Actual | Status |
|
|
363
|
+
|--------|--------|--------|--------|
|
|
364
|
+
| Time to Detect | < 30s | 15s | PASS |
|
|
365
|
+
| Time to Respond | < 5min | 4min 20s | PASS |
|
|
366
|
+
| Time to Recover | < 2min | 2min 30s | FAIL |
|
|
367
|
+
| Alert Accuracy | 100% | 66% | FAIL |
|
|
368
|
+
| Zero Customer Impact | Yes | Yes | PASS |
|
|
369
|
+
|
|
370
|
+
## What Went Well
|
|
371
|
+
|
|
372
|
+
1. Team responded quickly (4m 20s vs 5m target)
|
|
373
|
+
2. Runbooks were accurate and helpful
|
|
374
|
+
3. Communication was clear and frequent
|
|
375
|
+
4. No customer impact during any scenario
|
|
376
|
+
5. Application auto-reconnect worked perfectly
|
|
377
|
+
|
|
378
|
+
## What Didn't Go Well
|
|
379
|
+
|
|
380
|
+
1. Missing alert for failover initiation
|
|
381
|
+
2. Took 30s longer than target to recover
|
|
382
|
+
3. Connection pool exhaustion not detected
|
|
383
|
+
4. Dashboard didn't show replica lag clearly
|
|
384
|
+
5. Escalation contacts list was outdated
|
|
385
|
+
|
|
386
|
+
## Surprises
|
|
387
|
+
|
|
388
|
+
1. Cache invalidation cascaded from DB failover (unexpected)
|
|
389
|
+
2. Read replica had 45s replication lag we didn't know about
|
|
390
|
+
3. Application retried too aggressively during failover
|
|
391
|
+
4. Team found a workaround we hadn't documented
|
|
392
|
+
|
|
393
|
+
## Action Items
|
|
394
|
+
|
|
395
|
+
| Action | Owner | Due Date | Priority |
|
|
396
|
+
|--------|-------|----------|----------|
|
|
397
|
+
| Add alert for RDS failover events | @sre-team | Jan 20 | P0 |
|
|
398
|
+
| Update dashboard with replica lag | @platform | Jan 22 | P1 |
|
|
399
|
+
| Document cache invalidation behavior | @dev-team | Jan 25 | P1 |
|
|
400
|
+
| Add connection pool monitoring | @sre-team | Jan 27 | P0 |
|
|
401
|
+
| Update escalation contact list | @manager | Jan 18 | P2 |
|
|
402
|
+
| Tune application retry backoff | @dev-team | Feb 1 | P1 |
|
|
403
|
+
|
|
404
|
+
## Lessons Learned
|
|
405
|
+
|
|
406
|
+
1. **Monitoring Gaps**: We had blind spots in replica monitoring
|
|
407
|
+
2. **Cascading Effects**: DB changes affect cache in non-obvious ways
|
|
408
|
+
3. **Team Knowledge**: Cross-training is working well
|
|
409
|
+
4. **Documentation**: Runbooks saved time, keep them updated
|
|
410
|
+
|
|
411
|
+
## Next Game Day
|
|
412
|
+
|
|
413
|
+
**Proposed Date**: March 15, 2025
|
|
414
|
+
**Scenario**: Multi-region failover
|
|
415
|
+
**Scope**: Production (with safeguards)
|
|
416
|
+
|
|
417
|
+
## Appendix
|
|
418
|
+
|
|
419
|
+
- Full timeline spreadsheet: [link]
|
|
420
|
+
- Screen recordings: [link]
|
|
421
|
+
- Metrics dashboard export: [link]
|
|
422
|
+
- Raw observation notes: [link]
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
## Quick Reference
|
|
426
|
+
|
|
427
|
+
| Phase | Duration | Key Activities |
|
|
428
|
+
|-------|----------|----------------|
|
|
429
|
+
| Planning | 2 weeks | Define scenarios, invite participants |
|
|
430
|
+
| Pre-game | 30 min | Setup, verify environment, brief team |
|
|
431
|
+
| Execution | 2 hours | Run scenarios, observe, document |
|
|
432
|
+
| Debrief | 30 min | Immediate learnings, quick wins |
|
|
433
|
+
| Post-mortem | 1 week later | Detailed analysis, action items |
|
|
434
|
+
| Follow-up | 1 month | Verify improvements, plan next game day |
|