codex-genesis-harness 0.1.4 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (153) hide show
  1. package/.codebase/ARCHITECTURE_REVIEW_COMPLETE.md +216 -216
  2. package/.codebase/CURRENT_STATE.md +9 -7
  3. package/.codebase/FILE_NAMING_CLARIFICATION.md +161 -161
  4. package/.codebase/HARNESS_COMPLETENESS_AUDIT.md +613 -613
  5. package/.codebase/IMPLEMENTATION_COMPLETE.md +429 -429
  6. package/.codebase/IMPLEMENTATION_HANDOFF.md +351 -351
  7. package/.codebase/IMPROVEMENTS_SUMMARY.md +419 -419
  8. package/.codebase/PHASE3_SKILLS_NAMING_COMPLETE.md +292 -292
  9. package/.codebase/PHASE_DEPENDENCY_MAP.md +486 -486
  10. package/.codebase/QUICK_START_SPEC_IMPACT.md +456 -456
  11. package/.codebase/README.md +139 -139
  12. package/.codebase/RECOVERY_POINTS.md +438 -438
  13. package/.codebase/state.json +37 -0
  14. package/.codex/skills/genesis-api-sync/SKILL.md +354 -354
  15. package/.codex/skills/genesis-api-sync/checklists/api-sync-checklist.md +101 -101
  16. package/.codex/skills/genesis-api-sync/templates/api-change-template.md +257 -257
  17. package/.codex/skills/genesis-debug-guide/SKILL.md +479 -479
  18. package/.codex/skills/genesis-debug-guide/checklists/flaky-test-investigation.md +339 -339
  19. package/.codex/skills/genesis-debug-guide/checklists/production-bug-debug.md +210 -210
  20. package/.codex/skills/genesis-debug-guide/checklists/test-failure-debug.md +158 -158
  21. package/.codex/skills/genesis-debug-guide/observability/debug-commands.md +365 -365
  22. package/.codex/skills/genesis-debug-guide/playbooks/unit-test-failures.md +289 -289
  23. package/.codex/skills/genesis-debug-guide/templates/debug-investigation-log.md +288 -288
  24. package/.codex/skills/genesis-docs-automation/SKILL.md +1003 -1003
  25. package/.codex/skills/genesis-docs-automation/checklists/docs-validation.md +359 -359
  26. package/.codex/skills/genesis-docs-automation/checklists/spec-alignment.md +312 -312
  27. package/.codex/skills/genesis-docs-automation/observability/docs-tracking.md +382 -382
  28. package/.codex/skills/genesis-docs-automation/playbooks/auto-update-flow.md +851 -851
  29. package/.codex/skills/genesis-docs-automation/playbooks/changelog-generation.md +491 -491
  30. package/.codex/skills/genesis-docs-automation/templates/changelog-entry-template.md +187 -187
  31. package/.codex/skills/genesis-docs-automation/templates/handoff-template.md +297 -297
  32. package/.codex/skills/genesis-harness/SKILL.md +1427 -1418
  33. package/.codex/skills/genesis-harness/agents/openai.yaml +7 -7
  34. package/.codex/skills/genesis-harness/checklists/bug-fix-qa.md +169 -169
  35. package/.codex/skills/genesis-harness/checklists/new-feature-qa.md +157 -157
  36. package/.codex/skills/genesis-harness/checklists/refactor-qa.md +216 -216
  37. package/.codex/skills/genesis-harness/checklists/requirements-validation.md +211 -211
  38. package/.codex/skills/genesis-harness/references/planning-schema.md +35 -35
  39. package/.codex/skills/genesis-harness/references/quality-rubric.md +21 -21
  40. package/.codex/skills/genesis-harness/references/research-rubric.md +41 -41
  41. package/.codex/skills/genesis-harness/references/workflows.md +33 -33
  42. package/.codex/skills/genesis-harness/resources/agents-template.md +27 -27
  43. package/.codex/skills/genesis-harness/resources/api-docs-template.md +32 -32
  44. package/.codex/skills/genesis-harness/resources/architecture-template.md +30 -30
  45. package/.codex/skills/genesis-harness/resources/audit-template.md +26 -26
  46. package/.codex/skills/genesis-harness/resources/bug-template.md +34 -34
  47. package/.codex/skills/genesis-harness/resources/change-impact-matrix-template.md +204 -204
  48. package/.codex/skills/genesis-harness/resources/check-template.md +21 -21
  49. package/.codex/skills/genesis-harness/resources/conventions-template.md +42 -42
  50. package/.codex/skills/genesis-harness/resources/decision-template.md +33 -33
  51. package/.codex/skills/genesis-harness/resources/design-template.md +26 -26
  52. package/.codex/skills/genesis-harness/resources/escalation-template.md +21 -21
  53. package/.codex/skills/genesis-harness/resources/feature-template.md +49 -49
  54. package/.codex/skills/genesis-harness/resources/foundation-phase-template.md +131 -131
  55. package/.codex/skills/genesis-harness/resources/integrations-template.md +32 -32
  56. package/.codex/skills/genesis-harness/resources/journeys-template.md +13 -13
  57. package/.codex/skills/genesis-harness/resources/lessons-learned-template.md +12 -12
  58. package/.codex/skills/genesis-harness/resources/observability-template.md +34 -34
  59. package/.codex/skills/genesis-harness/resources/phase-00-foundation-template.md +76 -76
  60. package/.codex/skills/genesis-harness/resources/phase-template.md +34 -34
  61. package/.codex/skills/genesis-harness/resources/pitfalls-template.md +22 -22
  62. package/.codex/skills/genesis-harness/resources/planning-tree-template.md +39 -39
  63. package/.codex/skills/genesis-harness/resources/post-implementation-guide.md +347 -347
  64. package/.codex/skills/genesis-harness/resources/project-template.md +38 -38
  65. package/.codex/skills/genesis-harness/resources/quality-score-template.md +11 -11
  66. package/.codex/skills/genesis-harness/resources/requirements-template.md +26 -26
  67. package/.codex/skills/genesis-harness/resources/research-template.md +26 -26
  68. package/.codex/skills/genesis-harness/resources/review-template.md +22 -22
  69. package/.codex/skills/genesis-harness/resources/spec-changelog-template.md +6 -6
  70. package/.codex/skills/genesis-harness/resources/stack-template.md +33 -33
  71. package/.codex/skills/genesis-harness/resources/verification-template.md +26 -26
  72. package/.codex/skills/genesis-harness/scripts/check-architecture-boundaries.sh +0 -0
  73. package/.codex/skills/genesis-harness/scripts/check-docs-sync.sh +0 -0
  74. package/.codex/skills/genesis-harness/scripts/check-no-debug-logs.sh +0 -0
  75. package/.codex/skills/genesis-harness/scripts/check-required-planning-files.sh +0 -0
  76. package/.codex/skills/genesis-harness/scripts/check-spec-changelog.sh +0 -0
  77. package/.codex/skills/genesis-harness/scripts/check-task-tracking.sh +0 -0
  78. package/.codex/skills/genesis-harness/scripts/compact-context.sh +0 -0
  79. package/.codex/skills/genesis-harness/scripts/create-adr.sh +0 -0
  80. package/.codex/skills/genesis-harness/scripts/create-bug.sh +0 -0
  81. package/.codex/skills/genesis-harness/scripts/create-feature.sh +0 -0
  82. package/.codex/skills/genesis-harness/scripts/detect-stack.sh +0 -0
  83. package/.codex/skills/genesis-harness/scripts/init-planning.sh +0 -0
  84. package/.codex/skills/genesis-harness/scripts/list-changed-files.sh +0 -0
  85. package/.codex/skills/genesis-harness/scripts/offload-log.sh +0 -0
  86. package/.codex/skills/genesis-harness/scripts/run-verification.sh +0 -0
  87. package/.codex/skills/genesis-harness/scripts/run-verify-loop.sh +0 -0
  88. package/.codex/skills/genesis-harness/scripts/update-state.sh +0 -0
  89. package/.codex/skills/genesis-mvp-planning/SKILL.md +114 -0
  90. package/.codex/skills/genesis-mvp-planning/agents/openai.yaml +6 -0
  91. package/.codex/skills/genesis-mvp-planning/checklists/mvp-readiness.md +18 -0
  92. package/.codex/skills/genesis-mvp-planning/examples/5-phase-roadmap-example.md +43 -0
  93. package/.codex/skills/genesis-mvp-planning/templates/phase-1-core.md +17 -0
  94. package/.codex/skills/genesis-mvp-planning/templates/phase-2-auth.md +17 -0
  95. package/.codex/skills/genesis-mvp-planning/templates/phase-3-features.md +17 -0
  96. package/.codex/skills/genesis-mvp-planning/templates/phase-4-integrations.md +17 -0
  97. package/.codex/skills/genesis-mvp-planning/templates/phase-5-readiness.md +17 -0
  98. package/.codex/skills/genesis-new-design/agents/openai.yaml +3 -3
  99. package/.codex/skills/genesis-observability-automation/checklists/.gitkeep +0 -0
  100. package/.codex/skills/genesis-observability-automation/observability/.gitkeep +0 -0
  101. package/.codex/skills/genesis-observability-automation/playbooks/.gitkeep +0 -0
  102. package/.codex/skills/genesis-observability-automation/templates/.gitkeep +0 -0
  103. package/.codex/skills/genesis-release-orchestration/SKILL.md +653 -653
  104. package/.codex/skills/genesis-release-orchestration/checklists/post-deployment-verification.md +274 -274
  105. package/.codex/skills/genesis-release-orchestration/checklists/pre-release-validation.md +220 -220
  106. package/.codex/skills/genesis-release-orchestration/observability/release-tracking.md +253 -253
  107. package/.codex/skills/genesis-release-orchestration/playbooks/canary-deployment-orchestration.md +472 -472
  108. package/.codex/skills/genesis-release-orchestration/playbooks/semantic-versioning-automation.md +494 -494
  109. package/.codex/skills/genesis-release-orchestration/templates/deployment-strategy-template.md +303 -303
  110. package/.codex/skills/genesis-release-orchestration/templates/release-runbook-template.md +420 -420
  111. package/.codex/skills/genesis-research-first/SKILL.md +237 -237
  112. package/.codex/skills/genesis-research-first/templates/.gitkeep +0 -0
  113. package/.codex/skills/genesis-spec-propagation/SKILL.md +534 -534
  114. package/.codex/skills/genesis-spec-propagation/checklists/phase-update-verification.md +384 -384
  115. package/.codex/skills/genesis-spec-propagation/checklists/spec-change-detection.md +257 -257
  116. package/.codex/skills/genesis-spec-propagation/observability/propagation-tracking.md +373 -373
  117. package/.codex/skills/genesis-spec-propagation/playbooks/breaking-change-propagation.md +692 -692
  118. package/.codex/skills/genesis-spec-propagation/playbooks/feature-change-propagation.md +434 -434
  119. package/.codex/skills/genesis-spec-propagation/templates/migration-guide-template.md +407 -407
  120. package/.codex/skills/genesis-state-machine/SKILL.md +34 -0
  121. package/.codex/skills/genesis-upgrade-design/agents/openai.yaml +3 -3
  122. package/.codex/skills/spec-impact-engine/SKILL.md +504 -504
  123. package/.codex/skills/spec-impact-engine/detect-spec-changes.sh +0 -0
  124. package/.codex-plugin/plugin.json +24 -24
  125. package/CHANGELOG.md +42 -0
  126. package/LICENSE +22 -22
  127. package/README.EN.md +784 -719
  128. package/README.VI.md +776 -712
  129. package/README.md +113 -253
  130. package/VERSION +2 -2
  131. package/bin/genesis-harness.js +90 -87
  132. package/package.json +68 -43
  133. package/scripts/README.md +342 -342
  134. package/scripts/compact-context.sh +0 -0
  135. package/scripts/contract_integrity_gate.js +83 -0
  136. package/scripts/detect-changes.sh +0 -0
  137. package/scripts/healing_telemetry.js +118 -0
  138. package/scripts/install.sh +4 -1
  139. package/scripts/offload-log.sh +0 -0
  140. package/scripts/prompt_sentinel.js +84 -0
  141. package/scripts/run-evals.sh +1 -0
  142. package/scripts/run-verify-loop.sh +11 -0
  143. package/scripts/spec_visual_sync.js +157 -0
  144. package/scripts/test_generator.js +142 -0
  145. package/scripts/transition_state.sh +67 -0
  146. package/scripts/uninstall.sh +1 -0
  147. package/scripts/validation_gates.sh +85 -0
  148. package/scripts/verify.sh +5 -0
  149. package/tests/unit/contract_integrity_gate.test.js +74 -0
  150. package/tests/unit/healing_telemetry.test.js +58 -0
  151. package/tests/unit/prompt_sentinel.test.js +50 -0
  152. package/tests/unit/spec_visual_sync.test.js +77 -0
  153. package/tests/unit/test_generator.test.js +62 -0
@@ -1,274 +1,274 @@
1
- # Post-Deployment Verification Checklist
2
-
3
- **Purpose**: Verify deployment success and stability post-deployment
4
- **Duration**: 10-15 minutes per stage
5
- **Risk**: Critical - must verify before next canary stage or production acceptance
6
-
7
- ---
8
-
9
- ## Section 1: Deployment Completion Verification (5 min)
10
-
11
- - [ ] **Deployment completed successfully**
12
- - All pods/containers running: `kubectl get pods` shows all Ready
13
- - New version deployed: Verify version in logs/status endpoint
14
- - Previous version stopped: Old containers terminated
15
- - Deployment time recorded: Under <30 min target
16
- - No deployment errors: Check CI/CD logs for failures
17
-
18
- - [ ] **Database state consistent**
19
- - Migrations completed: No pending migrations
20
- - Data integrity: Sample data queries return expected values
21
- - Rollback point saved: Previous database state available if needed
22
- - No connection errors: Application connects to database successfully
23
-
24
- - [ ] **Configuration deployed correctly**
25
- - Environment-specific config loaded: Correct database, API keys
26
- - Feature flags set correctly: New features enabled/disabled as intended
27
- - Secrets loaded: No "secret not found" errors in logs
28
- - Logging configured: Logs at appropriate level (not DEBUG in prod)
29
-
30
- ---
31
-
32
- ## Section 2: Health Check Validation (5 min)
33
-
34
- - [ ] **Liveness checks passing**
35
- - Endpoint: GET /health returns 200 OK
36
- - Response time: <500ms
37
- - Response body includes version: Confirm v2.5.0 deployed
38
- - Check frequency: Every 10 seconds (configurable)
39
-
40
- - [ ] **Readiness checks passing**
41
- - Endpoint: GET /ready returns 200 OK
42
- - Database connection: Verified in response
43
- - Cache connection: Verified (if used)
44
- - External service connectivity: Verified
45
- - All dependencies healthy
46
-
47
- - [ ] **Service discovery updated**
48
- - Load balancer sees new instances: In rotation
49
- - DNS resolves to new IPs: No stale DNS
50
- - Service mesh updated: Istio/Envoy routes to new version
51
- - No connection refused errors: Services can reach each other
52
-
53
- ---
54
-
55
- ## Section 3: Smoke Tests & Critical Workflows (10 min)
56
-
57
- - [ ] **Critical workflow #1: User authentication**
58
- - Login successful: User can authenticate
59
- - Session created: Token generated and valid
60
- - Session persists: Token valid across requests
61
- - Logout works: Session properly cleared
62
-
63
- - [ ] **Critical workflow #2: API response format** (if breaking changes)
64
- - Response structure matches new format: { data: {...} } not { user: {...} }
65
- - Required fields present: All documented fields in response
66
- - Optional fields handled: Backward compatibility (if version allows)
67
- - Error responses use new format: Consistent across all endpoints
68
-
69
- - [ ] **Critical workflow #3: Database writes & reads**
70
- - Create operation works: New data persisted
71
- - Read operation works: Can retrieve created data
72
- - Update operation works: Data can be modified
73
- - Delete operation works: Data removal works
74
- - No data corruption: Integrity constraints honored
75
-
76
- - [ ] **Critical workflow #4: Third-party integrations**
77
- - External API calls succeed: Payment processor, email service, etc.
78
- - Webhooks received: Third parties can call back into service
79
- - Error handling: Graceful failure if third party unavailable
80
- - Retry logic: Automatic retries working for transient failures
81
-
82
- - [ ] **Critical workflow #5: Backward compatibility** (for minor versions)
83
- - Old API endpoints still work: If not broken
84
- - Old response format accepted: Where supported
85
- - Deprecated endpoints return warning: Clear guidance to migrate
86
- - Migration timeline respected: Old version still functional
87
-
88
- ---
89
-
90
- ## Section 4: Performance & Baseline Metrics (5 min)
91
-
92
- - [ ] **Response time metrics**
93
- - P50 latency: Baseline ± 10% (healthy)
94
- - P95 latency: Baseline ± 10% (acceptable)
95
- - P99 latency: Baseline ± 20% (watch but acceptable)
96
- - Median within SLA: <200ms for UI endpoints
97
-
98
- - [ ] **Error rate metrics**
99
- - 5xx errors: <0.1% (critical threshold 1%)
100
- - 4xx errors: Normal range (expected user errors)
101
- - Timeout errors: <0.01%
102
- - No error spike: Consistent with pre-deployment baseline
103
-
104
- - [ ] **Resource usage metrics**
105
- - CPU usage: <70% (healthy, room for spike)
106
- - Memory usage: <80% (healthy, no memory leaks)
107
- - Disk usage: <80% (logs not filling disk)
108
- - Network saturation: <60% (room for growth)
109
-
110
- - [ ] **Throughput metrics**
111
- - Requests per second: Matching pre-deployment baseline
112
- - No request queue buildup: Processing without delays
113
- - Connections active: Stable count (not growing indefinitely)
114
- - Cache hit rate: Expected % (if applicable)
115
-
116
- ---
117
-
118
- ## Section 5: Error Log Analysis (5 min)
119
-
120
- - [ ] **No critical errors**
121
- - Exception rate: 0 or expected baseline
122
- - Stack traces: Review any new errors (expected after deployment)
123
- - Database errors: No connection failures or constraint violations
124
- - Timeout errors: <0.01%
125
-
126
- - [ ] **No security alerts**
127
- - Authentication failures: Normal rate (not spike)
128
- - Authorization failures: Expected for denied access
129
- - Suspicious patterns: No signs of attack or abuse
130
- - Invalid requests: Normal rate (not DDoS)
131
-
132
- - [ ] **Error messages understandable**
133
- - Error descriptions: Clear and actionable
134
- - Error codes: Match documentation (if breaking changes)
135
- - Stack traces: Available for engineering review
136
- - Correlation IDs: Present for request tracing
137
-
138
- ---
139
-
140
- ## Section 6: Consumer Compatibility Check (5 min)
141
-
142
- - [ ] **Consumer health verified** (if breaking changes)
143
- - Known clients connected: Monitoring shows active sessions
144
- - No authentication errors: Clients authenticated successfully
145
- - API response handling: Clients processing new format correctly
146
- - Migration status: Clients on supported versions
147
-
148
- - [ ] **No consumer errors**
149
- - Consumer-specific endpoints: Working (if any)
150
- - Third-party clients: No connection refused
151
- - Mobile apps: No crashes reported
152
- - Web clients: Working across browsers
153
-
154
- - [ ] **Feedback mechanisms working**
155
- - Error reporting: Clients can report issues
156
- - Support channels: Accessible (email, Slack, support portal)
157
- - Status page: Updated and accessible
158
- - Incident escalation: Path clear if issues arise
159
-
160
- ---
161
-
162
- ## Section 7: Deployment Strategy Progression (5 min)
163
-
164
- **If Canary Deployment**:
165
-
166
- - [ ] **Stage decision made** (if canary, proceed to next stage?)
167
- - Errors acceptable: <1% error rate (go)
168
- - Performance stable: Within baseline (go)
169
- - No critical issues: Team agrees safe to proceed (go)
170
- - Consumer feedback positive: No complaints (go)
171
-
172
- **Decision**:
173
- - [ ] GO to next stage: 10% traffic
174
- - [ ] PAUSE: Monitor additional 30 minutes
175
- - [ ] ROLLBACK: Critical issue found, revert to previous version
176
-
177
- **If Blue-Green Deployment**:
178
-
179
- - [ ] **Switch decision made**
180
- - All health checks passing on new (blue) environment
181
- - Load balancer can switch: DNS/LB config tested
182
- - Old (green) environment: Still running, ready for instant rollback
183
- - Consumer requests: Ready to flow to new environment
184
-
185
- **Decision**:
186
- - [ ] SWITCH: Route traffic from green to blue
187
- - [ ] PAUSE: Monitor additional 30 minutes
188
- - [ ] ROLLBACK: Critical issue found, stay on green
189
-
190
- **If Rolling Deployment**:
191
-
192
- - [ ] **Next wave ready**
193
- - Current wave: All healthy and verified
194
- - Next wave: Ready for deployment
195
- - Health checks: Configured and passing
196
- - No errors in current wave: Safe to proceed
197
-
198
- **Decision**:
199
- - [ ] PROCEED: Deploy next wave
200
- - [ ] PAUSE: Monitor additional 15 minutes
201
- - [ ] ROLLBACK: Issue found, stop rolling deployment
202
-
203
- ---
204
-
205
- ## Section 8: Incident & Rollback Preparation (5 min)
206
-
207
- - [ ] **Rollback plan confirmed** (always keep ready)
208
- - Previous version available: Docker image, config, DB snapshot
209
- - Rollback steps documented: Clear procedure to revert
210
- - Team trained: Everyone knows rollback procedure
211
- - Estimated time: <5 minutes to rollback
212
-
213
- - [ ] **Incident response ready**
214
- - Escalation path: Clear who to notify if issues
215
- - On-call team: Available for next 1-2 hours
216
- - Communication channels: Slack, PagerDuty, etc.
217
- - Status page: Updated if incident occurs
218
-
219
- - [ ] **Monitoring alert thresholds**
220
- - Error rate alert: >5% triggers page
221
- - Latency alert: >2s P95 triggers page
222
- - Resource alert: CPU >80% triggers warning
223
- - Dependency alert: External service down triggers alert
224
-
225
- ---
226
-
227
- ## Red Flags - Consider Rollback
228
-
229
- 🚨 **WARNING - Evaluate rollback**:
230
-
231
- - [ ] Error rate >1% (watch)
232
- - [ ] Latency spike >50% above baseline (watch)
233
- - [ ] Any 5xx errors in logs (investigate)
234
- - [ ] Database connection failures (investigate)
235
- - [ ] Consumer complaints (investigate)
236
-
237
- ❌ **MUST ROLLBACK - Immediate action**:
238
-
239
- - [ ] Error rate >5% (ROLLBACK)
240
- - [ ] Complete service unavailability (ROLLBACK)
241
- - [ ] Data corruption detected (ROLLBACK)
242
- - [ ] Security vulnerability discovered (ROLLBACK)
243
- - [ ] Database rollback failed (CRITICAL - escalate)
244
-
245
- **If ROLLBACK needed**:
246
- 1. Execute rollback procedure immediately
247
- 2. Notify stakeholders: Team, consumers, support
248
- 3. Investigate root cause
249
- 4. Fix issue
250
- 5. Re-test before attempting deployment again
251
-
252
- ---
253
-
254
- ## Sign-Off Template
255
-
256
- ```
257
- DEPLOYMENT: v2.5.0 → Canary Stage 1 (1% traffic)
258
- DATE: 2026-05-31 14:30 UTC
259
- VERIFIED BY: [Name]
260
- DURATION: 1 hour monitoring
261
-
262
- Health checks: ✓ PASS
263
- Error rate: ✓ PASS (0.08%)
264
- Response time: ✓ PASS (baseline +5%)
265
- Critical workflows: ✓ PASS
266
- Consumers: ✓ PASS
267
-
268
- Issues found: None critical
269
-
270
- DECISION: ✅ PROCEED TO STAGE 2
271
- Next: Deploy to 10% traffic
272
- Monitor: 2 hours
273
- Go/no-go decision: 16:30 UTC
274
- ```
1
+ # Post-Deployment Verification Checklist
2
+
3
+ **Purpose**: Verify deployment success and stability post-deployment
4
+ **Duration**: 10-15 minutes per stage
5
+ **Risk**: Critical - must verify before next canary stage or production acceptance
6
+
7
+ ---
8
+
9
+ ## Section 1: Deployment Completion Verification (5 min)
10
+
11
+ - [ ] **Deployment completed successfully**
12
+ - All pods/containers running: `kubectl get pods` shows all Ready
13
+ - New version deployed: Verify version in logs/status endpoint
14
+ - Previous version stopped: Old containers terminated
15
+ - Deployment time recorded: Under <30 min target
16
+ - No deployment errors: Check CI/CD logs for failures
17
+
18
+ - [ ] **Database state consistent**
19
+ - Migrations completed: No pending migrations
20
+ - Data integrity: Sample data queries return expected values
21
+ - Rollback point saved: Previous database state available if needed
22
+ - No connection errors: Application connects to database successfully
23
+
24
+ - [ ] **Configuration deployed correctly**
25
+ - Environment-specific config loaded: Correct database, API keys
26
+ - Feature flags set correctly: New features enabled/disabled as intended
27
+ - Secrets loaded: No "secret not found" errors in logs
28
+ - Logging configured: Logs at appropriate level (not DEBUG in prod)
29
+
30
+ ---
31
+
32
+ ## Section 2: Health Check Validation (5 min)
33
+
34
+ - [ ] **Liveness checks passing**
35
+ - Endpoint: GET /health returns 200 OK
36
+ - Response time: <500ms
37
+ - Response body includes version: Confirm v2.5.0 deployed
38
+ - Check frequency: Every 10 seconds (configurable)
39
+
40
+ - [ ] **Readiness checks passing**
41
+ - Endpoint: GET /ready returns 200 OK
42
+ - Database connection: Verified in response
43
+ - Cache connection: Verified (if used)
44
+ - External service connectivity: Verified
45
+ - All dependencies healthy
46
+
47
+ - [ ] **Service discovery updated**
48
+ - Load balancer sees new instances: In rotation
49
+ - DNS resolves to new IPs: No stale DNS
50
+ - Service mesh updated: Istio/Envoy routes to new version
51
+ - No connection refused errors: Services can reach each other
52
+
53
+ ---
54
+
55
+ ## Section 3: Smoke Tests & Critical Workflows (10 min)
56
+
57
+ - [ ] **Critical workflow #1: User authentication**
58
+ - Login successful: User can authenticate
59
+ - Session created: Token generated and valid
60
+ - Session persists: Token valid across requests
61
+ - Logout works: Session properly cleared
62
+
63
+ - [ ] **Critical workflow #2: API response format** (if breaking changes)
64
+ - Response structure matches new format: { data: {...} } not { user: {...} }
65
+ - Required fields present: All documented fields in response
66
+ - Optional fields handled: Backward compatibility (if version allows)
67
+ - Error responses use new format: Consistent across all endpoints
68
+
69
+ - [ ] **Critical workflow #3: Database writes & reads**
70
+ - Create operation works: New data persisted
71
+ - Read operation works: Can retrieve created data
72
+ - Update operation works: Data can be modified
73
+ - Delete operation works: Data removal works
74
+ - No data corruption: Integrity constraints honored
75
+
76
+ - [ ] **Critical workflow #4: Third-party integrations**
77
+ - External API calls succeed: Payment processor, email service, etc.
78
+ - Webhooks received: Third parties can call back into service
79
+ - Error handling: Graceful failure if third party unavailable
80
+ - Retry logic: Automatic retries working for transient failures
81
+
82
+ - [ ] **Critical workflow #5: Backward compatibility** (for minor versions)
83
+ - Old API endpoints still work: If not broken
84
+ - Old response format accepted: Where supported
85
+ - Deprecated endpoints return warning: Clear guidance to migrate
86
+ - Migration timeline respected: Old version still functional
87
+
88
+ ---
89
+
90
+ ## Section 4: Performance & Baseline Metrics (5 min)
91
+
92
+ - [ ] **Response time metrics**
93
+ - P50 latency: Baseline ± 10% (healthy)
94
+ - P95 latency: Baseline ± 10% (acceptable)
95
+ - P99 latency: Baseline ± 20% (watch but acceptable)
96
+ - Median within SLA: <200ms for UI endpoints
97
+
98
+ - [ ] **Error rate metrics**
99
+ - 5xx errors: <0.1% (critical threshold 1%)
100
+ - 4xx errors: Normal range (expected user errors)
101
+ - Timeout errors: <0.01%
102
+ - No error spike: Consistent with pre-deployment baseline
103
+
104
+ - [ ] **Resource usage metrics**
105
+ - CPU usage: <70% (healthy, room for spike)
106
+ - Memory usage: <80% (healthy, no memory leaks)
107
+ - Disk usage: <80% (logs not filling disk)
108
+ - Network saturation: <60% (room for growth)
109
+
110
+ - [ ] **Throughput metrics**
111
+ - Requests per second: Matching pre-deployment baseline
112
+ - No request queue buildup: Processing without delays
113
+ - Connections active: Stable count (not growing indefinitely)
114
+ - Cache hit rate: Expected % (if applicable)
115
+
116
+ ---
117
+
118
+ ## Section 5: Error Log Analysis (5 min)
119
+
120
+ - [ ] **No critical errors**
121
+ - Exception rate: 0 or expected baseline
122
+ - Stack traces: Review any new errors (expected after deployment)
123
+ - Database errors: No connection failures or constraint violations
124
+ - Timeout errors: <0.01%
125
+
126
+ - [ ] **No security alerts**
127
+ - Authentication failures: Normal rate (not spike)
128
+ - Authorization failures: Expected for denied access
129
+ - Suspicious patterns: No signs of attack or abuse
130
+ - Invalid requests: Normal rate (not DDoS)
131
+
132
+ - [ ] **Error messages understandable**
133
+ - Error descriptions: Clear and actionable
134
+ - Error codes: Match documentation (if breaking changes)
135
+ - Stack traces: Available for engineering review
136
+ - Correlation IDs: Present for request tracing
137
+
138
+ ---
139
+
140
+ ## Section 6: Consumer Compatibility Check (5 min)
141
+
142
+ - [ ] **Consumer health verified** (if breaking changes)
143
+ - Known clients connected: Monitoring shows active sessions
144
+ - No authentication errors: Clients authenticated successfully
145
+ - API response handling: Clients processing new format correctly
146
+ - Migration status: Clients on supported versions
147
+
148
+ - [ ] **No consumer errors**
149
+ - Consumer-specific endpoints: Working (if any)
150
+ - Third-party clients: No connection refused
151
+ - Mobile apps: No crashes reported
152
+ - Web clients: Working across browsers
153
+
154
+ - [ ] **Feedback mechanisms working**
155
+ - Error reporting: Clients can report issues
156
+ - Support channels: Accessible (email, Slack, support portal)
157
+ - Status page: Updated and accessible
158
+ - Incident escalation: Path clear if issues arise
159
+
160
+ ---
161
+
162
+ ## Section 7: Deployment Strategy Progression (5 min)
163
+
164
+ **If Canary Deployment**:
165
+
166
+ - [ ] **Stage decision made** (if canary, proceed to next stage?)
167
+ - Errors acceptable: <1% error rate (go)
168
+ - Performance stable: Within baseline (go)
169
+ - No critical issues: Team agrees safe to proceed (go)
170
+ - Consumer feedback positive: No complaints (go)
171
+
172
+ **Decision**:
173
+ - [ ] GO to next stage: 10% traffic
174
+ - [ ] PAUSE: Monitor additional 30 minutes
175
+ - [ ] ROLLBACK: Critical issue found, revert to previous version
176
+
177
+ **If Blue-Green Deployment**:
178
+
179
+ - [ ] **Switch decision made**
180
+ - All health checks passing on new (blue) environment
181
+ - Load balancer can switch: DNS/LB config tested
182
+ - Old (green) environment: Still running, ready for instant rollback
183
+ - Consumer requests: Ready to flow to new environment
184
+
185
+ **Decision**:
186
+ - [ ] SWITCH: Route traffic from green to blue
187
+ - [ ] PAUSE: Monitor additional 30 minutes
188
+ - [ ] ROLLBACK: Critical issue found, stay on green
189
+
190
+ **If Rolling Deployment**:
191
+
192
+ - [ ] **Next wave ready**
193
+ - Current wave: All healthy and verified
194
+ - Next wave: Ready for deployment
195
+ - Health checks: Configured and passing
196
+ - No errors in current wave: Safe to proceed
197
+
198
+ **Decision**:
199
+ - [ ] PROCEED: Deploy next wave
200
+ - [ ] PAUSE: Monitor additional 15 minutes
201
+ - [ ] ROLLBACK: Issue found, stop rolling deployment
202
+
203
+ ---
204
+
205
+ ## Section 8: Incident & Rollback Preparation (5 min)
206
+
207
+ - [ ] **Rollback plan confirmed** (always keep ready)
208
+ - Previous version available: Docker image, config, DB snapshot
209
+ - Rollback steps documented: Clear procedure to revert
210
+ - Team trained: Everyone knows rollback procedure
211
+ - Estimated time: <5 minutes to rollback
212
+
213
+ - [ ] **Incident response ready**
214
+ - Escalation path: Clear who to notify if issues
215
+ - On-call team: Available for next 1-2 hours
216
+ - Communication channels: Slack, PagerDuty, etc.
217
+ - Status page: Updated if incident occurs
218
+
219
+ - [ ] **Monitoring alert thresholds**
220
+ - Error rate alert: >5% triggers page
221
+ - Latency alert: >2s P95 triggers page
222
+ - Resource alert: CPU >80% triggers warning
223
+ - Dependency alert: External service down triggers alert
224
+
225
+ ---
226
+
227
+ ## Red Flags - Consider Rollback
228
+
229
+ 🚨 **WARNING - Evaluate rollback**:
230
+
231
+ - [ ] Error rate >1% (watch)
232
+ - [ ] Latency spike >50% above baseline (watch)
233
+ - [ ] Any 5xx errors in logs (investigate)
234
+ - [ ] Database connection failures (investigate)
235
+ - [ ] Consumer complaints (investigate)
236
+
237
+ ❌ **MUST ROLLBACK - Immediate action**:
238
+
239
+ - [ ] Error rate >5% (ROLLBACK)
240
+ - [ ] Complete service unavailability (ROLLBACK)
241
+ - [ ] Data corruption detected (ROLLBACK)
242
+ - [ ] Security vulnerability discovered (ROLLBACK)
243
+ - [ ] Database rollback failed (CRITICAL - escalate)
244
+
245
+ **If ROLLBACK needed**:
246
+ 1. Execute rollback procedure immediately
247
+ 2. Notify stakeholders: Team, consumers, support
248
+ 3. Investigate root cause
249
+ 4. Fix issue
250
+ 5. Re-test before attempting deployment again
251
+
252
+ ---
253
+
254
+ ## Sign-Off Template
255
+
256
+ ```
257
+ DEPLOYMENT: v2.5.0 → Canary Stage 1 (1% traffic)
258
+ DATE: 2026-05-31 14:30 UTC
259
+ VERIFIED BY: [Name]
260
+ DURATION: 1 hour monitoring
261
+
262
+ Health checks: ✓ PASS
263
+ Error rate: ✓ PASS (0.08%)
264
+ Response time: ✓ PASS (baseline +5%)
265
+ Critical workflows: ✓ PASS
266
+ Consumers: ✓ PASS
267
+
268
+ Issues found: None critical
269
+
270
+ DECISION: ✅ PROCEED TO STAGE 2
271
+ Next: Deploy to 10% traffic
272
+ Monitor: 2 hours
273
+ Go/no-go decision: 16:30 UTC
274
+ ```