codex-genesis-harness 0.1.4 → 0.1.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.codebase/ARCHITECTURE_REVIEW_COMPLETE.md +216 -216
- package/.codebase/CURRENT_STATE.md +9 -7
- package/.codebase/FILE_NAMING_CLARIFICATION.md +161 -161
- package/.codebase/HARNESS_COMPLETENESS_AUDIT.md +613 -613
- package/.codebase/IMPLEMENTATION_COMPLETE.md +429 -429
- package/.codebase/IMPLEMENTATION_HANDOFF.md +351 -351
- package/.codebase/IMPROVEMENTS_SUMMARY.md +419 -419
- package/.codebase/PHASE3_SKILLS_NAMING_COMPLETE.md +292 -292
- package/.codebase/PHASE_DEPENDENCY_MAP.md +486 -486
- package/.codebase/QUICK_START_SPEC_IMPACT.md +456 -456
- package/.codebase/README.md +139 -139
- package/.codebase/RECOVERY_POINTS.md +438 -438
- package/.codebase/state.json +37 -0
- package/.codex/skills/genesis-api-sync/SKILL.md +354 -354
- package/.codex/skills/genesis-api-sync/checklists/api-sync-checklist.md +101 -101
- package/.codex/skills/genesis-api-sync/templates/api-change-template.md +257 -257
- package/.codex/skills/genesis-debug-guide/SKILL.md +479 -479
- package/.codex/skills/genesis-debug-guide/checklists/flaky-test-investigation.md +339 -339
- package/.codex/skills/genesis-debug-guide/checklists/production-bug-debug.md +210 -210
- package/.codex/skills/genesis-debug-guide/checklists/test-failure-debug.md +158 -158
- package/.codex/skills/genesis-debug-guide/observability/debug-commands.md +365 -365
- package/.codex/skills/genesis-debug-guide/playbooks/unit-test-failures.md +289 -289
- package/.codex/skills/genesis-debug-guide/templates/debug-investigation-log.md +288 -288
- package/.codex/skills/genesis-docs-automation/SKILL.md +1003 -1003
- package/.codex/skills/genesis-docs-automation/checklists/docs-validation.md +359 -359
- package/.codex/skills/genesis-docs-automation/checklists/spec-alignment.md +312 -312
- package/.codex/skills/genesis-docs-automation/observability/docs-tracking.md +382 -382
- package/.codex/skills/genesis-docs-automation/playbooks/auto-update-flow.md +851 -851
- package/.codex/skills/genesis-docs-automation/playbooks/changelog-generation.md +491 -491
- package/.codex/skills/genesis-docs-automation/templates/changelog-entry-template.md +187 -187
- package/.codex/skills/genesis-docs-automation/templates/handoff-template.md +297 -297
- package/.codex/skills/genesis-harness/SKILL.md +1427 -1418
- package/.codex/skills/genesis-harness/agents/openai.yaml +7 -7
- package/.codex/skills/genesis-harness/checklists/bug-fix-qa.md +169 -169
- package/.codex/skills/genesis-harness/checklists/new-feature-qa.md +157 -157
- package/.codex/skills/genesis-harness/checklists/refactor-qa.md +216 -216
- package/.codex/skills/genesis-harness/checklists/requirements-validation.md +211 -211
- package/.codex/skills/genesis-harness/references/planning-schema.md +35 -35
- package/.codex/skills/genesis-harness/references/quality-rubric.md +21 -21
- package/.codex/skills/genesis-harness/references/research-rubric.md +41 -41
- package/.codex/skills/genesis-harness/references/workflows.md +33 -33
- package/.codex/skills/genesis-harness/resources/agents-template.md +27 -27
- package/.codex/skills/genesis-harness/resources/api-docs-template.md +32 -32
- package/.codex/skills/genesis-harness/resources/architecture-template.md +30 -30
- package/.codex/skills/genesis-harness/resources/audit-template.md +26 -26
- package/.codex/skills/genesis-harness/resources/bug-template.md +34 -34
- package/.codex/skills/genesis-harness/resources/change-impact-matrix-template.md +204 -204
- package/.codex/skills/genesis-harness/resources/check-template.md +21 -21
- package/.codex/skills/genesis-harness/resources/conventions-template.md +42 -42
- package/.codex/skills/genesis-harness/resources/decision-template.md +33 -33
- package/.codex/skills/genesis-harness/resources/design-template.md +26 -26
- package/.codex/skills/genesis-harness/resources/escalation-template.md +21 -21
- package/.codex/skills/genesis-harness/resources/feature-template.md +49 -49
- package/.codex/skills/genesis-harness/resources/foundation-phase-template.md +131 -131
- package/.codex/skills/genesis-harness/resources/integrations-template.md +32 -32
- package/.codex/skills/genesis-harness/resources/journeys-template.md +13 -13
- package/.codex/skills/genesis-harness/resources/lessons-learned-template.md +12 -12
- package/.codex/skills/genesis-harness/resources/observability-template.md +34 -34
- package/.codex/skills/genesis-harness/resources/phase-00-foundation-template.md +76 -76
- package/.codex/skills/genesis-harness/resources/phase-template.md +34 -34
- package/.codex/skills/genesis-harness/resources/pitfalls-template.md +22 -22
- package/.codex/skills/genesis-harness/resources/planning-tree-template.md +39 -39
- package/.codex/skills/genesis-harness/resources/post-implementation-guide.md +347 -347
- package/.codex/skills/genesis-harness/resources/project-template.md +38 -38
- package/.codex/skills/genesis-harness/resources/quality-score-template.md +11 -11
- package/.codex/skills/genesis-harness/resources/requirements-template.md +26 -26
- package/.codex/skills/genesis-harness/resources/research-template.md +26 -26
- package/.codex/skills/genesis-harness/resources/review-template.md +22 -22
- package/.codex/skills/genesis-harness/resources/spec-changelog-template.md +6 -6
- package/.codex/skills/genesis-harness/resources/stack-template.md +33 -33
- package/.codex/skills/genesis-harness/resources/verification-template.md +26 -26
- package/.codex/skills/genesis-harness/scripts/check-architecture-boundaries.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/check-docs-sync.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/check-no-debug-logs.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/check-required-planning-files.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/check-spec-changelog.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/check-task-tracking.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/compact-context.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/create-adr.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/create-bug.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/create-feature.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/detect-stack.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/init-planning.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/list-changed-files.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/offload-log.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/run-verification.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/run-verify-loop.sh +0 -0
- package/.codex/skills/genesis-harness/scripts/update-state.sh +0 -0
- package/.codex/skills/genesis-mvp-planning/SKILL.md +114 -0
- package/.codex/skills/genesis-mvp-planning/agents/openai.yaml +6 -0
- package/.codex/skills/genesis-mvp-planning/checklists/mvp-readiness.md +18 -0
- package/.codex/skills/genesis-mvp-planning/examples/5-phase-roadmap-example.md +43 -0
- package/.codex/skills/genesis-mvp-planning/templates/phase-1-core.md +17 -0
- package/.codex/skills/genesis-mvp-planning/templates/phase-2-auth.md +17 -0
- package/.codex/skills/genesis-mvp-planning/templates/phase-3-features.md +17 -0
- package/.codex/skills/genesis-mvp-planning/templates/phase-4-integrations.md +17 -0
- package/.codex/skills/genesis-mvp-planning/templates/phase-5-readiness.md +17 -0
- package/.codex/skills/genesis-new-design/agents/openai.yaml +3 -3
- package/.codex/skills/genesis-observability-automation/checklists/.gitkeep +0 -0
- package/.codex/skills/genesis-observability-automation/observability/.gitkeep +0 -0
- package/.codex/skills/genesis-observability-automation/playbooks/.gitkeep +0 -0
- package/.codex/skills/genesis-observability-automation/templates/.gitkeep +0 -0
- package/.codex/skills/genesis-release-orchestration/SKILL.md +653 -653
- package/.codex/skills/genesis-release-orchestration/checklists/post-deployment-verification.md +274 -274
- package/.codex/skills/genesis-release-orchestration/checklists/pre-release-validation.md +220 -220
- package/.codex/skills/genesis-release-orchestration/observability/release-tracking.md +253 -253
- package/.codex/skills/genesis-release-orchestration/playbooks/canary-deployment-orchestration.md +472 -472
- package/.codex/skills/genesis-release-orchestration/playbooks/semantic-versioning-automation.md +494 -494
- package/.codex/skills/genesis-release-orchestration/templates/deployment-strategy-template.md +303 -303
- package/.codex/skills/genesis-release-orchestration/templates/release-runbook-template.md +420 -420
- package/.codex/skills/genesis-research-first/SKILL.md +237 -237
- package/.codex/skills/genesis-research-first/templates/.gitkeep +0 -0
- package/.codex/skills/genesis-spec-propagation/SKILL.md +534 -534
- package/.codex/skills/genesis-spec-propagation/checklists/phase-update-verification.md +384 -384
- package/.codex/skills/genesis-spec-propagation/checklists/spec-change-detection.md +257 -257
- package/.codex/skills/genesis-spec-propagation/observability/propagation-tracking.md +373 -373
- package/.codex/skills/genesis-spec-propagation/playbooks/breaking-change-propagation.md +692 -692
- package/.codex/skills/genesis-spec-propagation/playbooks/feature-change-propagation.md +434 -434
- package/.codex/skills/genesis-spec-propagation/templates/migration-guide-template.md +407 -407
- package/.codex/skills/genesis-state-machine/SKILL.md +34 -0
- package/.codex/skills/genesis-upgrade-design/agents/openai.yaml +3 -3
- package/.codex/skills/spec-impact-engine/SKILL.md +504 -504
- package/.codex/skills/spec-impact-engine/detect-spec-changes.sh +0 -0
- package/.codex-plugin/plugin.json +24 -24
- package/CHANGELOG.md +42 -0
- package/LICENSE +22 -22
- package/README.EN.md +784 -719
- package/README.VI.md +776 -712
- package/README.md +113 -253
- package/VERSION +2 -2
- package/bin/genesis-harness.js +90 -87
- package/package.json +68 -43
- package/scripts/README.md +342 -342
- package/scripts/compact-context.sh +0 -0
- package/scripts/contract_integrity_gate.js +83 -0
- package/scripts/detect-changes.sh +0 -0
- package/scripts/healing_telemetry.js +118 -0
- package/scripts/install.sh +4 -1
- package/scripts/offload-log.sh +0 -0
- package/scripts/prompt_sentinel.js +84 -0
- package/scripts/run-evals.sh +1 -0
- package/scripts/run-verify-loop.sh +11 -0
- package/scripts/spec_visual_sync.js +157 -0
- package/scripts/test_generator.js +142 -0
- package/scripts/transition_state.sh +67 -0
- package/scripts/uninstall.sh +1 -0
- package/scripts/validation_gates.sh +85 -0
- package/scripts/verify.sh +5 -0
- package/tests/unit/contract_integrity_gate.test.js +74 -0
- package/tests/unit/healing_telemetry.test.js +58 -0
- package/tests/unit/prompt_sentinel.test.js +50 -0
- package/tests/unit/spec_visual_sync.test.js +77 -0
- package/tests/unit/test_generator.test.js +62 -0
package/.codex/skills/genesis-release-orchestration/templates/deployment-strategy-template.md
CHANGED
|
@@ -1,303 +1,303 @@
|
|
|
1
|
-
# Deployment Strategy Template
|
|
2
|
-
|
|
3
|
-
**Release**: v[X.Y.Z]
|
|
4
|
-
**Risk Score**: [N/10]
|
|
5
|
-
**Strategy Selected**: [Blue-Green|Canary|Rolling]
|
|
6
|
-
**Approval Date**: [YYYY-MM-DD]
|
|
7
|
-
**Deployment Window**: [YYYY-MM-DD HH:MM-HH:MM UTC]
|
|
8
|
-
|
|
9
|
-
---
|
|
10
|
-
|
|
11
|
-
## Strategy Selection Criteria
|
|
12
|
-
|
|
13
|
-
**Risk Score 1-2 (LOW)** → Rolling Deployment
|
|
14
|
-
**Risk Score 3-5 (MEDIUM)** → Blue-Green Deployment
|
|
15
|
-
**Risk Score 6-8 (HIGH)** → Canary Deployment
|
|
16
|
-
**Risk Score 9-10 (CRITICAL)** → Canary + Manual approval at each stage
|
|
17
|
-
|
|
18
|
-
---
|
|
19
|
-
|
|
20
|
-
## Blue-Green Deployment Strategy
|
|
21
|
-
|
|
22
|
-
**When to use**: Medium-risk releases (risk 3-5)
|
|
23
|
-
|
|
24
|
-
### Overview
|
|
25
|
-
|
|
26
|
-
Two identical production environments:
|
|
27
|
-
- **Blue** = Current production (v2.5.3)
|
|
28
|
-
- **Green** = New version (v3.0.0)
|
|
29
|
-
|
|
30
|
-
Traffic is routed to one at a time. If issues occur, switch back instantly.
|
|
31
|
-
|
|
32
|
-
### Timeline
|
|
33
|
-
|
|
34
|
-
```
|
|
35
|
-
Stage 1: Deploy to Green (5-10 min)
|
|
36
|
-
- Deploy v3.0.0 to green environment
|
|
37
|
-
- Run health checks
|
|
38
|
-
- Verify all systems ready
|
|
39
|
-
- Keep blue (v2.5.3) running
|
|
40
|
-
|
|
41
|
-
Stage 2: Validate Green (5-10 min)
|
|
42
|
-
- Run smoke tests against green
|
|
43
|
-
- Verify response format correct (if breaking changes)
|
|
44
|
-
- Validate database migrations applied
|
|
45
|
-
- Confirm green ready for traffic
|
|
46
|
-
|
|
47
|
-
Stage 3: Switch Traffic (1-2 min)
|
|
48
|
-
- Update load balancer: 100% traffic → green (v3.0.0)
|
|
49
|
-
- Blue (v2.5.3) stops receiving traffic but stays running
|
|
50
|
-
|
|
51
|
-
Stage 4: Monitor Green (1-24 hours)
|
|
52
|
-
- Monitor error rate, latency, resources
|
|
53
|
-
- If issues found: Instant rollback (switch back to blue)
|
|
54
|
-
- If stable after 1-24 hours: Decommission blue, keep only green
|
|
55
|
-
|
|
56
|
-
Rollback (if needed): <5 seconds
|
|
57
|
-
- Switch traffic: green → blue (instant)
|
|
58
|
-
- No data loss (traffic never split)
|
|
59
|
-
```
|
|
60
|
-
|
|
61
|
-
### Deployment Commands
|
|
62
|
-
|
|
63
|
-
```bash
|
|
64
|
-
# 1. Deploy to green environment
|
|
65
|
-
kubectl apply -f deployment-v3.0.0-green.yaml -n production
|
|
66
|
-
|
|
67
|
-
# 2. Wait for green ready
|
|
68
|
-
kubectl rollout status deployment/myapp-green --timeout=10m
|
|
69
|
-
|
|
70
|
-
# 3. Validate green environment
|
|
71
|
-
curl -f https://green-staging.myapp.example.com/health
|
|
72
|
-
curl -f https://green-staging.myapp.example.com/api/users/1
|
|
73
|
-
|
|
74
|
-
# 4. Switch traffic: Load Balancer route 100% to green
|
|
75
|
-
aws elbv2 modify-rule --rule-arn arn:aws:elasticloadbalancing:... \
|
|
76
|
-
--actions Type=forward,TargetGroups="[{TargetGroupArn=arn:...green:...,Weight=100}]"
|
|
77
|
-
|
|
78
|
-
# 5. Monitor for 1-24 hours
|
|
79
|
-
watch -n 10 'kubectl top pods -n production'
|
|
80
|
-
|
|
81
|
-
# 6. If rollback needed: Switch back to blue (instant)
|
|
82
|
-
aws elbv2 modify-rule --rule-arn arn:aws:elasticloadbalancing:... \
|
|
83
|
-
--actions Type=forward,TargetGroups="[{TargetGroupArn=arn:...blue:...,Weight=100}]"
|
|
84
|
-
```
|
|
85
|
-
|
|
86
|
-
### Pros & Cons
|
|
87
|
-
|
|
88
|
-
**Pros**:
|
|
89
|
-
- ✅ Zero-downtime deployment
|
|
90
|
-
- ✅ Instant rollback (< 5 seconds)
|
|
91
|
-
- ✅ Can run parallel tests
|
|
92
|
-
- ✅ No traffic splitting complexity
|
|
93
|
-
|
|
94
|
-
**Cons**:
|
|
95
|
-
- ❌ Double infrastructure cost (need 2x environments)
|
|
96
|
-
- ❌ Data consistency challenges (2 databases)
|
|
97
|
-
- ❌ Not suitable for frequent deployments
|
|
98
|
-
|
|
99
|
-
---
|
|
100
|
-
|
|
101
|
-
## Canary Deployment Strategy
|
|
102
|
-
|
|
103
|
-
**When to use**: High-risk releases (risk 6-8+)
|
|
104
|
-
|
|
105
|
-
### Overview
|
|
106
|
-
|
|
107
|
-
Gradually roll out new version to small % of traffic, increasing as confidence grows.
|
|
108
|
-
|
|
109
|
-
- Stage 1: 1% traffic (1 hour monitoring)
|
|
110
|
-
- Stage 2: 10% traffic (2 hours monitoring)
|
|
111
|
-
- Stage 3: 50% traffic (4 hours monitoring)
|
|
112
|
-
- Stage 4: 100% traffic (24+ hours monitoring)
|
|
113
|
-
|
|
114
|
-
If issues at any stage: Rollback all traffic to previous version.
|
|
115
|
-
|
|
116
|
-
### Timeline
|
|
117
|
-
|
|
118
|
-
```
|
|
119
|
-
Stage 1: 1% Traffic Canary (1 hour)
|
|
120
|
-
- Deploy v3.0.0 alongside v2.5.3
|
|
121
|
-
- Route 1% of traffic to v3.0.0
|
|
122
|
-
- Monitor error rate, latency, resource usage
|
|
123
|
-
- Go/No-go decision after 1 hour
|
|
124
|
-
|
|
125
|
-
Stage 2: 10% Traffic Canary (2 hours)
|
|
126
|
-
- Increase to 10% traffic if Stage 1 passed
|
|
127
|
-
- Monitor for 2 hours (covers lunch rush if applicable)
|
|
128
|
-
- Go/No-go decision after 2 hours
|
|
129
|
-
|
|
130
|
-
Stage 3: 50% Traffic Canary (4 hours)
|
|
131
|
-
- Increase to 50% traffic if Stage 2 passed
|
|
132
|
-
- Monitor for 4 hours (covers peak traffic periods)
|
|
133
|
-
- Go/No-go decision after 4 hours
|
|
134
|
-
|
|
135
|
-
Stage 4: 100% Traffic - Full Rollout (24+ hours)
|
|
136
|
-
- Route 100% traffic to v3.0.0
|
|
137
|
-
- Keep v2.5.3 running for 24 hours (instant rollback)
|
|
138
|
-
- Continuous monitoring for 24 hours
|
|
139
|
-
- After 24 hours stable: Decommission v2.5.3
|
|
140
|
-
|
|
141
|
-
Total deployment: 8+ hours (Stage 1→4)
|
|
142
|
-
Rollback window: 24 hours post-complete deployment
|
|
143
|
-
```
|
|
144
|
-
|
|
145
|
-
### Deployment Commands
|
|
146
|
-
|
|
147
|
-
```bash
|
|
148
|
-
# Stage 1: Deploy and route 1% traffic
|
|
149
|
-
kubectl apply -f deployment-v3.0.0.yaml -n production
|
|
150
|
-
kubectl get pods -w # Verify ready
|
|
151
|
-
|
|
152
|
-
# Route 1% traffic to v3.0.0, 99% to v2.5.3
|
|
153
|
-
aws elbv2 modify-rule --rule-arn <arn> \
|
|
154
|
-
--actions Type=forward,TargetGroups="[{TargetGroupArn=<v3-tg>,Weight=1},{TargetGroupArn=<v2-tg>,Weight=99}]"
|
|
155
|
-
|
|
156
|
-
# Monitor Stage 1 (1 hour)
|
|
157
|
-
# ... check error rate, latency, etc.
|
|
158
|
-
|
|
159
|
-
# Stage 2: Increase to 10%
|
|
160
|
-
aws elbv2 modify-rule --rule-arn <arn> \
|
|
161
|
-
--actions Type=forward,TargetGroups="[{TargetGroupArn=<v3-tg>,Weight=10},{TargetGroupArn=<v2-tg>,Weight=90}]"
|
|
162
|
-
|
|
163
|
-
# Stage 3: Increase to 50%
|
|
164
|
-
aws elbv2 modify-rule --rule-arn <arn> \
|
|
165
|
-
--actions Type=forward,TargetGroups="[{TargetGroupArn=<v3-tg>,Weight=50},{TargetGroupArn=<v2-tg>,Weight=50}]"
|
|
166
|
-
|
|
167
|
-
# Stage 4: 100% traffic (full rollout)
|
|
168
|
-
aws elbv2 modify-rule --rule-arn <arn> \
|
|
169
|
-
--actions Type=forward,TargetGroups="[{TargetGroupArn=<v3-tg>,Weight=100}]"
|
|
170
|
-
|
|
171
|
-
# Rollback (if needed at any stage): Instant switch to previous version
|
|
172
|
-
aws elbv2 modify-rule --rule-arn <arn> \
|
|
173
|
-
--actions Type=forward,TargetGroups="[{TargetGroupArn=<v2-tg>,Weight=100}]"
|
|
174
|
-
```
|
|
175
|
-
|
|
176
|
-
### Go/No-Go Criteria (At Each Stage)
|
|
177
|
-
|
|
178
|
-
**GO** if:
|
|
179
|
-
- ✅ Error rate <1% (target <0.1%)
|
|
180
|
-
- ✅ Latency within baseline (±10%)
|
|
181
|
-
- ✅ No critical issues
|
|
182
|
-
- ✅ Consumer feedback positive
|
|
183
|
-
- ✅ Team lead approves
|
|
184
|
-
|
|
185
|
-
**NO-GO** (Pause) if:
|
|
186
|
-
- ⚠️ Error rate 1-5% (investigate, pause 30 min)
|
|
187
|
-
- ⚠️ Latency spike >50% (investigate, may indicate load issue)
|
|
188
|
-
- ⚠️ Resource exhaustion (tune, retry)
|
|
189
|
-
|
|
190
|
-
**ROLLBACK** if:
|
|
191
|
-
- ❌ Error rate >5%
|
|
192
|
-
- ❌ Complete service unavailability
|
|
193
|
-
- ❌ Data corruption
|
|
194
|
-
- ❌ Security vulnerability
|
|
195
|
-
|
|
196
|
-
### Pros & Cons
|
|
197
|
-
|
|
198
|
-
**Pros**:
|
|
199
|
-
- ✅ Low risk (catch issues early)
|
|
200
|
-
- ✅ Minimal blast radius at each stage
|
|
201
|
-
- ✅ Customer feedback integrated
|
|
202
|
-
- ✅ Gradual confidence increase
|
|
203
|
-
|
|
204
|
-
**Cons**:
|
|
205
|
-
- ❌ Slow deployment (8+ hours)
|
|
206
|
-
- ❌ Requires careful monitoring
|
|
207
|
-
- ❌ Split traffic may hide issues
|
|
208
|
-
- ❌ Complex state management (need both versions running)
|
|
209
|
-
|
|
210
|
-
---
|
|
211
|
-
|
|
212
|
-
## Rolling Deployment Strategy
|
|
213
|
-
|
|
214
|
-
**When to use**: Low-risk releases (risk 1-2)
|
|
215
|
-
|
|
216
|
-
### Overview
|
|
217
|
-
|
|
218
|
-
Replace instances one-by-one, keeping service available throughout.
|
|
219
|
-
|
|
220
|
-
- Wave 1: Update 25% of instances (1 hour)
|
|
221
|
-
- Wave 2: Update 50% of instances (1 hour)
|
|
222
|
-
- Wave 3: Update 75% of instances (1 hour)
|
|
223
|
-
- Wave 4: Update 100% of instances (1 hour)
|
|
224
|
-
|
|
225
|
-
If issue detected: Rollback stops, can reverse changes.
|
|
226
|
-
|
|
227
|
-
### Timeline
|
|
228
|
-
|
|
229
|
-
```
|
|
230
|
-
Wave 1: 25% Instances (1 hour)
|
|
231
|
-
- Cordoff 25% of instances
|
|
232
|
-
- Deploy v3.0.0 to cordoned instances
|
|
233
|
-
- Verify health checks pass
|
|
234
|
-
- Return instances to rotation
|
|
235
|
-
- Monitor error rate (should be ~0%)
|
|
236
|
-
|
|
237
|
-
Wave 2: 50% Instances (1 hour)
|
|
238
|
-
- Repeat for next 25%
|
|
239
|
-
|
|
240
|
-
Wave 3: 75% Instances (1 hour)
|
|
241
|
-
- Repeat for next 25%
|
|
242
|
-
|
|
243
|
-
Wave 4: 100% Instances (1 hour)
|
|
244
|
-
- Deploy to final 25%
|
|
245
|
-
- All instances now v3.0.0
|
|
246
|
-
|
|
247
|
-
Total deployment: 4 hours
|
|
248
|
-
Rollback: Can stop deployment mid-wave, rollback in progress instances
|
|
249
|
-
```
|
|
250
|
-
|
|
251
|
-
### Pros & Cons
|
|
252
|
-
|
|
253
|
-
**Pros**:
|
|
254
|
-
- ✅ Fast (4 hours vs 8+ hours)
|
|
255
|
-
- ✅ Zero-downtime (service always available)
|
|
256
|
-
- ✅ Simple implementation
|
|
257
|
-
- ✅ Low infrastructure cost
|
|
258
|
-
|
|
259
|
-
**Cons**:
|
|
260
|
-
- ❌ Higher risk (old + new running simultaneously)
|
|
261
|
-
- ❌ Harder to rollback (partially deployed state)
|
|
262
|
-
- ❌ Can't instantly revert (need to re-deploy)
|
|
263
|
-
|
|
264
|
-
---
|
|
265
|
-
|
|
266
|
-
## Monitoring Metrics for All Strategies
|
|
267
|
-
|
|
268
|
-
**Critical Metrics** (check every 5-10 minutes):
|
|
269
|
-
|
|
270
|
-
| Metric | Target | Alert |
|
|
271
|
-
|--------|--------|-------|
|
|
272
|
-
| Error Rate (5xx) | <0.1% | >1% |
|
|
273
|
-
| Latency P95 | <200ms | >300ms |
|
|
274
|
-
| Latency P99 | <500ms | >750ms |
|
|
275
|
-
| CPU Usage | <70% | >80% |
|
|
276
|
-
| Memory Usage | <80% | >85% |
|
|
277
|
-
| Database Connections | Stable | Growing |
|
|
278
|
-
| Request Timeout Rate | <0.01% | >0.05% |
|
|
279
|
-
|
|
280
|
-
---
|
|
281
|
-
|
|
282
|
-
## Rollback Procedure (Universal for All Strategies)
|
|
283
|
-
|
|
284
|
-
```bash
|
|
285
|
-
# Immediate: Switch traffic back to previous version
|
|
286
|
-
aws elbv2 modify-rule --rule-arn <arn> \
|
|
287
|
-
--actions Type=forward,TargetGroups="[{TargetGroupArn=<prev-tg>,Weight=100}]"
|
|
288
|
-
|
|
289
|
-
# Verify previous version handling traffic
|
|
290
|
-
curl -f http://[prod-url]/health
|
|
291
|
-
|
|
292
|
-
# Notify stakeholders
|
|
293
|
-
# Email: ops-team@company.com
|
|
294
|
-
# Slack: #incident-response
|
|
295
|
-
# Status page: Updated
|
|
296
|
-
|
|
297
|
-
# Start investigation
|
|
298
|
-
echo "Rollback complete at $(date)" >> INCIDENT.md
|
|
299
|
-
```
|
|
300
|
-
|
|
301
|
-
---
|
|
302
|
-
|
|
303
|
-
**DEPLOYMENT STRATEGY COMPLETE**
|
|
1
|
+
# Deployment Strategy Template
|
|
2
|
+
|
|
3
|
+
**Release**: v[X.Y.Z]
|
|
4
|
+
**Risk Score**: [N/10]
|
|
5
|
+
**Strategy Selected**: [Blue-Green|Canary|Rolling]
|
|
6
|
+
**Approval Date**: [YYYY-MM-DD]
|
|
7
|
+
**Deployment Window**: [YYYY-MM-DD HH:MM-HH:MM UTC]
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Strategy Selection Criteria
|
|
12
|
+
|
|
13
|
+
**Risk Score 1-2 (LOW)** → Rolling Deployment
|
|
14
|
+
**Risk Score 3-5 (MEDIUM)** → Blue-Green Deployment
|
|
15
|
+
**Risk Score 6-8 (HIGH)** → Canary Deployment
|
|
16
|
+
**Risk Score 9-10 (CRITICAL)** → Canary + Manual approval at each stage
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Blue-Green Deployment Strategy
|
|
21
|
+
|
|
22
|
+
**When to use**: Medium-risk releases (risk 3-5)
|
|
23
|
+
|
|
24
|
+
### Overview
|
|
25
|
+
|
|
26
|
+
Two identical production environments:
|
|
27
|
+
- **Blue** = Current production (v2.5.3)
|
|
28
|
+
- **Green** = New version (v3.0.0)
|
|
29
|
+
|
|
30
|
+
Traffic is routed to one at a time. If issues occur, switch back instantly.
|
|
31
|
+
|
|
32
|
+
### Timeline
|
|
33
|
+
|
|
34
|
+
```
|
|
35
|
+
Stage 1: Deploy to Green (5-10 min)
|
|
36
|
+
- Deploy v3.0.0 to green environment
|
|
37
|
+
- Run health checks
|
|
38
|
+
- Verify all systems ready
|
|
39
|
+
- Keep blue (v2.5.3) running
|
|
40
|
+
|
|
41
|
+
Stage 2: Validate Green (5-10 min)
|
|
42
|
+
- Run smoke tests against green
|
|
43
|
+
- Verify response format correct (if breaking changes)
|
|
44
|
+
- Validate database migrations applied
|
|
45
|
+
- Confirm green ready for traffic
|
|
46
|
+
|
|
47
|
+
Stage 3: Switch Traffic (1-2 min)
|
|
48
|
+
- Update load balancer: 100% traffic → green (v3.0.0)
|
|
49
|
+
- Blue (v2.5.3) stops receiving traffic but stays running
|
|
50
|
+
|
|
51
|
+
Stage 4: Monitor Green (1-24 hours)
|
|
52
|
+
- Monitor error rate, latency, resources
|
|
53
|
+
- If issues found: Instant rollback (switch back to blue)
|
|
54
|
+
- If stable after 1-24 hours: Decommission blue, keep only green
|
|
55
|
+
|
|
56
|
+
Rollback (if needed): <5 seconds
|
|
57
|
+
- Switch traffic: green → blue (instant)
|
|
58
|
+
- No data loss (traffic never split)
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
### Deployment Commands
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
# 1. Deploy to green environment
|
|
65
|
+
kubectl apply -f deployment-v3.0.0-green.yaml -n production
|
|
66
|
+
|
|
67
|
+
# 2. Wait for green ready
|
|
68
|
+
kubectl rollout status deployment/myapp-green --timeout=10m
|
|
69
|
+
|
|
70
|
+
# 3. Validate green environment
|
|
71
|
+
curl -f https://green-staging.myapp.example.com/health
|
|
72
|
+
curl -f https://green-staging.myapp.example.com/api/users/1
|
|
73
|
+
|
|
74
|
+
# 4. Switch traffic: Load Balancer route 100% to green
|
|
75
|
+
aws elbv2 modify-rule --rule-arn arn:aws:elasticloadbalancing:... \
|
|
76
|
+
--actions Type=forward,TargetGroups="[{TargetGroupArn=arn:...green:...,Weight=100}]"
|
|
77
|
+
|
|
78
|
+
# 5. Monitor for 1-24 hours
|
|
79
|
+
watch -n 10 'kubectl top pods -n production'
|
|
80
|
+
|
|
81
|
+
# 6. If rollback needed: Switch back to blue (instant)
|
|
82
|
+
aws elbv2 modify-rule --rule-arn arn:aws:elasticloadbalancing:... \
|
|
83
|
+
--actions Type=forward,TargetGroups="[{TargetGroupArn=arn:...blue:...,Weight=100}]"
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
### Pros & Cons
|
|
87
|
+
|
|
88
|
+
**Pros**:
|
|
89
|
+
- ✅ Zero-downtime deployment
|
|
90
|
+
- ✅ Instant rollback (< 5 seconds)
|
|
91
|
+
- ✅ Can run parallel tests
|
|
92
|
+
- ✅ No traffic splitting complexity
|
|
93
|
+
|
|
94
|
+
**Cons**:
|
|
95
|
+
- ❌ Double infrastructure cost (need 2x environments)
|
|
96
|
+
- ❌ Data consistency challenges (2 databases)
|
|
97
|
+
- ❌ Not suitable for frequent deployments
|
|
98
|
+
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
## Canary Deployment Strategy
|
|
102
|
+
|
|
103
|
+
**When to use**: High-risk releases (risk 6-8+)
|
|
104
|
+
|
|
105
|
+
### Overview
|
|
106
|
+
|
|
107
|
+
Gradually roll out new version to small % of traffic, increasing as confidence grows.
|
|
108
|
+
|
|
109
|
+
- Stage 1: 1% traffic (1 hour monitoring)
|
|
110
|
+
- Stage 2: 10% traffic (2 hours monitoring)
|
|
111
|
+
- Stage 3: 50% traffic (4 hours monitoring)
|
|
112
|
+
- Stage 4: 100% traffic (24+ hours monitoring)
|
|
113
|
+
|
|
114
|
+
If issues at any stage: Rollback all traffic to previous version.
|
|
115
|
+
|
|
116
|
+
### Timeline
|
|
117
|
+
|
|
118
|
+
```
|
|
119
|
+
Stage 1: 1% Traffic Canary (1 hour)
|
|
120
|
+
- Deploy v3.0.0 alongside v2.5.3
|
|
121
|
+
- Route 1% of traffic to v3.0.0
|
|
122
|
+
- Monitor error rate, latency, resource usage
|
|
123
|
+
- Go/No-go decision after 1 hour
|
|
124
|
+
|
|
125
|
+
Stage 2: 10% Traffic Canary (2 hours)
|
|
126
|
+
- Increase to 10% traffic if Stage 1 passed
|
|
127
|
+
- Monitor for 2 hours (covers lunch rush if applicable)
|
|
128
|
+
- Go/No-go decision after 2 hours
|
|
129
|
+
|
|
130
|
+
Stage 3: 50% Traffic Canary (4 hours)
|
|
131
|
+
- Increase to 50% traffic if Stage 2 passed
|
|
132
|
+
- Monitor for 4 hours (covers peak traffic periods)
|
|
133
|
+
- Go/No-go decision after 4 hours
|
|
134
|
+
|
|
135
|
+
Stage 4: 100% Traffic - Full Rollout (24+ hours)
|
|
136
|
+
- Route 100% traffic to v3.0.0
|
|
137
|
+
- Keep v2.5.3 running for 24 hours (instant rollback)
|
|
138
|
+
- Continuous monitoring for 24 hours
|
|
139
|
+
- After 24 hours stable: Decommission v2.5.3
|
|
140
|
+
|
|
141
|
+
Total deployment: 8+ hours (Stage 1→4)
|
|
142
|
+
Rollback window: 24 hours post-complete deployment
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
### Deployment Commands
|
|
146
|
+
|
|
147
|
+
```bash
|
|
148
|
+
# Stage 1: Deploy and route 1% traffic
|
|
149
|
+
kubectl apply -f deployment-v3.0.0.yaml -n production
|
|
150
|
+
kubectl get pods -w # Verify ready
|
|
151
|
+
|
|
152
|
+
# Route 1% traffic to v3.0.0, 99% to v2.5.3
|
|
153
|
+
aws elbv2 modify-rule --rule-arn <arn> \
|
|
154
|
+
--actions Type=forward,TargetGroups="[{TargetGroupArn=<v3-tg>,Weight=1},{TargetGroupArn=<v2-tg>,Weight=99}]"
|
|
155
|
+
|
|
156
|
+
# Monitor Stage 1 (1 hour)
|
|
157
|
+
# ... check error rate, latency, etc.
|
|
158
|
+
|
|
159
|
+
# Stage 2: Increase to 10%
|
|
160
|
+
aws elbv2 modify-rule --rule-arn <arn> \
|
|
161
|
+
--actions Type=forward,TargetGroups="[{TargetGroupArn=<v3-tg>,Weight=10},{TargetGroupArn=<v2-tg>,Weight=90}]"
|
|
162
|
+
|
|
163
|
+
# Stage 3: Increase to 50%
|
|
164
|
+
aws elbv2 modify-rule --rule-arn <arn> \
|
|
165
|
+
--actions Type=forward,TargetGroups="[{TargetGroupArn=<v3-tg>,Weight=50},{TargetGroupArn=<v2-tg>,Weight=50}]"
|
|
166
|
+
|
|
167
|
+
# Stage 4: 100% traffic (full rollout)
|
|
168
|
+
aws elbv2 modify-rule --rule-arn <arn> \
|
|
169
|
+
--actions Type=forward,TargetGroups="[{TargetGroupArn=<v3-tg>,Weight=100}]"
|
|
170
|
+
|
|
171
|
+
# Rollback (if needed at any stage): Instant switch to previous version
|
|
172
|
+
aws elbv2 modify-rule --rule-arn <arn> \
|
|
173
|
+
--actions Type=forward,TargetGroups="[{TargetGroupArn=<v2-tg>,Weight=100}]"
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
### Go/No-Go Criteria (At Each Stage)
|
|
177
|
+
|
|
178
|
+
**GO** if:
|
|
179
|
+
- ✅ Error rate <1% (target <0.1%)
|
|
180
|
+
- ✅ Latency within baseline (±10%)
|
|
181
|
+
- ✅ No critical issues
|
|
182
|
+
- ✅ Consumer feedback positive
|
|
183
|
+
- ✅ Team lead approves
|
|
184
|
+
|
|
185
|
+
**NO-GO** (Pause) if:
|
|
186
|
+
- ⚠️ Error rate 1-5% (investigate, pause 30 min)
|
|
187
|
+
- ⚠️ Latency spike >50% (investigate, may indicate load issue)
|
|
188
|
+
- ⚠️ Resource exhaustion (tune, retry)
|
|
189
|
+
|
|
190
|
+
**ROLLBACK** if:
|
|
191
|
+
- ❌ Error rate >5%
|
|
192
|
+
- ❌ Complete service unavailability
|
|
193
|
+
- ❌ Data corruption
|
|
194
|
+
- ❌ Security vulnerability
|
|
195
|
+
|
|
196
|
+
### Pros & Cons
|
|
197
|
+
|
|
198
|
+
**Pros**:
|
|
199
|
+
- ✅ Low risk (catch issues early)
|
|
200
|
+
- ✅ Minimal blast radius at each stage
|
|
201
|
+
- ✅ Customer feedback integrated
|
|
202
|
+
- ✅ Gradual confidence increase
|
|
203
|
+
|
|
204
|
+
**Cons**:
|
|
205
|
+
- ❌ Slow deployment (8+ hours)
|
|
206
|
+
- ❌ Requires careful monitoring
|
|
207
|
+
- ❌ Split traffic may hide issues
|
|
208
|
+
- ❌ Complex state management (need both versions running)
|
|
209
|
+
|
|
210
|
+
---
|
|
211
|
+
|
|
212
|
+
## Rolling Deployment Strategy
|
|
213
|
+
|
|
214
|
+
**When to use**: Low-risk releases (risk 1-2)
|
|
215
|
+
|
|
216
|
+
### Overview
|
|
217
|
+
|
|
218
|
+
Replace instances one-by-one, keeping service available throughout.
|
|
219
|
+
|
|
220
|
+
- Wave 1: Update 25% of instances (1 hour)
|
|
221
|
+
- Wave 2: Update 50% of instances (1 hour)
|
|
222
|
+
- Wave 3: Update 75% of instances (1 hour)
|
|
223
|
+
- Wave 4: Update 100% of instances (1 hour)
|
|
224
|
+
|
|
225
|
+
If issue detected: Rollback stops, can reverse changes.
|
|
226
|
+
|
|
227
|
+
### Timeline
|
|
228
|
+
|
|
229
|
+
```
|
|
230
|
+
Wave 1: 25% Instances (1 hour)
|
|
231
|
+
- Cordoff 25% of instances
|
|
232
|
+
- Deploy v3.0.0 to cordoned instances
|
|
233
|
+
- Verify health checks pass
|
|
234
|
+
- Return instances to rotation
|
|
235
|
+
- Monitor error rate (should be ~0%)
|
|
236
|
+
|
|
237
|
+
Wave 2: 50% Instances (1 hour)
|
|
238
|
+
- Repeat for next 25%
|
|
239
|
+
|
|
240
|
+
Wave 3: 75% Instances (1 hour)
|
|
241
|
+
- Repeat for next 25%
|
|
242
|
+
|
|
243
|
+
Wave 4: 100% Instances (1 hour)
|
|
244
|
+
- Deploy to final 25%
|
|
245
|
+
- All instances now v3.0.0
|
|
246
|
+
|
|
247
|
+
Total deployment: 4 hours
|
|
248
|
+
Rollback: Can stop deployment mid-wave, rollback in progress instances
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
### Pros & Cons
|
|
252
|
+
|
|
253
|
+
**Pros**:
|
|
254
|
+
- ✅ Fast (4 hours vs 8+ hours)
|
|
255
|
+
- ✅ Zero-downtime (service always available)
|
|
256
|
+
- ✅ Simple implementation
|
|
257
|
+
- ✅ Low infrastructure cost
|
|
258
|
+
|
|
259
|
+
**Cons**:
|
|
260
|
+
- ❌ Higher risk (old + new running simultaneously)
|
|
261
|
+
- ❌ Harder to rollback (partially deployed state)
|
|
262
|
+
- ❌ Can't instantly revert (need to re-deploy)
|
|
263
|
+
|
|
264
|
+
---
|
|
265
|
+
|
|
266
|
+
## Monitoring Metrics for All Strategies
|
|
267
|
+
|
|
268
|
+
**Critical Metrics** (check every 5-10 minutes):
|
|
269
|
+
|
|
270
|
+
| Metric | Target | Alert |
|
|
271
|
+
|--------|--------|-------|
|
|
272
|
+
| Error Rate (5xx) | <0.1% | >1% |
|
|
273
|
+
| Latency P95 | <200ms | >300ms |
|
|
274
|
+
| Latency P99 | <500ms | >750ms |
|
|
275
|
+
| CPU Usage | <70% | >80% |
|
|
276
|
+
| Memory Usage | <80% | >85% |
|
|
277
|
+
| Database Connections | Stable | Growing |
|
|
278
|
+
| Request Timeout Rate | <0.01% | >0.05% |
|
|
279
|
+
|
|
280
|
+
---
|
|
281
|
+
|
|
282
|
+
## Rollback Procedure (Universal for All Strategies)
|
|
283
|
+
|
|
284
|
+
```bash
|
|
285
|
+
# Immediate: Switch traffic back to previous version
|
|
286
|
+
aws elbv2 modify-rule --rule-arn <arn> \
|
|
287
|
+
--actions Type=forward,TargetGroups="[{TargetGroupArn=<prev-tg>,Weight=100}]"
|
|
288
|
+
|
|
289
|
+
# Verify previous version handling traffic
|
|
290
|
+
curl -f http://[prod-url]/health
|
|
291
|
+
|
|
292
|
+
# Notify stakeholders
|
|
293
|
+
# Email: ops-team@company.com
|
|
294
|
+
# Slack: #incident-response
|
|
295
|
+
# Status page: Updated
|
|
296
|
+
|
|
297
|
+
# Start investigation
|
|
298
|
+
echo "Rollback complete at $(date)" >> INCIDENT.md
|
|
299
|
+
```
|
|
300
|
+
|
|
301
|
+
---
|
|
302
|
+
|
|
303
|
+
**DEPLOYMENT STRATEGY COMPLETE**
|