agentic-team-templates 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +280 -0
- package/bin/cli.js +5 -0
- package/package.json +47 -0
- package/src/index.js +521 -0
- package/templates/_shared/code-quality.md +162 -0
- package/templates/_shared/communication.md +114 -0
- package/templates/_shared/core-principles.md +62 -0
- package/templates/_shared/git-workflow.md +165 -0
- package/templates/_shared/security-fundamentals.md +173 -0
- package/templates/blockchain/.cursorrules/defi-patterns.md +520 -0
- package/templates/blockchain/.cursorrules/gas-optimization.md +339 -0
- package/templates/blockchain/.cursorrules/overview.md +130 -0
- package/templates/blockchain/.cursorrules/security.md +318 -0
- package/templates/blockchain/.cursorrules/smart-contracts.md +364 -0
- package/templates/blockchain/.cursorrules/testing.md +415 -0
- package/templates/blockchain/.cursorrules/web3-integration.md +538 -0
- package/templates/blockchain/CLAUDE.md +389 -0
- package/templates/cli-tools/.cursorrules/architecture.md +412 -0
- package/templates/cli-tools/.cursorrules/arguments.md +406 -0
- package/templates/cli-tools/.cursorrules/distribution.md +546 -0
- package/templates/cli-tools/.cursorrules/error-handling.md +455 -0
- package/templates/cli-tools/.cursorrules/overview.md +136 -0
- package/templates/cli-tools/.cursorrules/testing.md +537 -0
- package/templates/cli-tools/.cursorrules/user-experience.md +545 -0
- package/templates/cli-tools/CLAUDE.md +356 -0
- package/templates/data-engineering/.cursorrules/data-modeling.md +367 -0
- package/templates/data-engineering/.cursorrules/data-quality.md +455 -0
- package/templates/data-engineering/.cursorrules/overview.md +85 -0
- package/templates/data-engineering/.cursorrules/performance.md +339 -0
- package/templates/data-engineering/.cursorrules/pipeline-design.md +280 -0
- package/templates/data-engineering/.cursorrules/security.md +460 -0
- package/templates/data-engineering/.cursorrules/testing.md +452 -0
- package/templates/data-engineering/CLAUDE.md +974 -0
- package/templates/devops-sre/.cursorrules/capacity-planning.md +653 -0
- package/templates/devops-sre/.cursorrules/change-management.md +584 -0
- package/templates/devops-sre/.cursorrules/chaos-engineering.md +651 -0
- package/templates/devops-sre/.cursorrules/disaster-recovery.md +641 -0
- package/templates/devops-sre/.cursorrules/incident-management.md +565 -0
- package/templates/devops-sre/.cursorrules/observability.md +714 -0
- package/templates/devops-sre/.cursorrules/overview.md +230 -0
- package/templates/devops-sre/.cursorrules/postmortems.md +588 -0
- package/templates/devops-sre/.cursorrules/runbooks.md +760 -0
- package/templates/devops-sre/.cursorrules/slo-sli.md +617 -0
- package/templates/devops-sre/.cursorrules/toil-reduction.md +567 -0
- package/templates/devops-sre/CLAUDE.md +1007 -0
- package/templates/documentation/.cursorrules/adr.md +277 -0
- package/templates/documentation/.cursorrules/api-documentation.md +411 -0
- package/templates/documentation/.cursorrules/code-comments.md +253 -0
- package/templates/documentation/.cursorrules/maintenance.md +260 -0
- package/templates/documentation/.cursorrules/overview.md +82 -0
- package/templates/documentation/.cursorrules/readme-standards.md +306 -0
- package/templates/documentation/CLAUDE.md +120 -0
- package/templates/fullstack/.cursorrules/api-contracts.md +331 -0
- package/templates/fullstack/.cursorrules/architecture.md +298 -0
- package/templates/fullstack/.cursorrules/overview.md +109 -0
- package/templates/fullstack/.cursorrules/shared-types.md +348 -0
- package/templates/fullstack/.cursorrules/testing.md +386 -0
- package/templates/fullstack/CLAUDE.md +349 -0
- package/templates/ml-ai/.cursorrules/data-engineering.md +483 -0
- package/templates/ml-ai/.cursorrules/deployment.md +601 -0
- package/templates/ml-ai/.cursorrules/model-development.md +538 -0
- package/templates/ml-ai/.cursorrules/monitoring.md +658 -0
- package/templates/ml-ai/.cursorrules/overview.md +131 -0
- package/templates/ml-ai/.cursorrules/security.md +637 -0
- package/templates/ml-ai/.cursorrules/testing.md +678 -0
- package/templates/ml-ai/CLAUDE.md +1136 -0
- package/templates/mobile/.cursorrules/navigation.md +246 -0
- package/templates/mobile/.cursorrules/offline-first.md +302 -0
- package/templates/mobile/.cursorrules/overview.md +71 -0
- package/templates/mobile/.cursorrules/performance.md +345 -0
- package/templates/mobile/.cursorrules/testing.md +339 -0
- package/templates/mobile/CLAUDE.md +233 -0
- package/templates/platform-engineering/.cursorrules/ci-cd.md +778 -0
- package/templates/platform-engineering/.cursorrules/developer-experience.md +632 -0
- package/templates/platform-engineering/.cursorrules/infrastructure-as-code.md +600 -0
- package/templates/platform-engineering/.cursorrules/kubernetes.md +710 -0
- package/templates/platform-engineering/.cursorrules/observability.md +747 -0
- package/templates/platform-engineering/.cursorrules/overview.md +215 -0
- package/templates/platform-engineering/.cursorrules/security.md +855 -0
- package/templates/platform-engineering/.cursorrules/testing.md +878 -0
- package/templates/platform-engineering/CLAUDE.md +850 -0
- package/templates/utility-agent/.cursorrules/action-control.md +284 -0
- package/templates/utility-agent/.cursorrules/context-management.md +186 -0
- package/templates/utility-agent/.cursorrules/hallucination-prevention.md +253 -0
- package/templates/utility-agent/.cursorrules/overview.md +78 -0
- package/templates/utility-agent/.cursorrules/token-optimization.md +369 -0
- package/templates/utility-agent/CLAUDE.md +513 -0
- package/templates/web-backend/.cursorrules/api-design.md +255 -0
- package/templates/web-backend/.cursorrules/authentication.md +309 -0
- package/templates/web-backend/.cursorrules/database-patterns.md +298 -0
- package/templates/web-backend/.cursorrules/error-handling.md +366 -0
- package/templates/web-backend/.cursorrules/overview.md +69 -0
- package/templates/web-backend/.cursorrules/security.md +358 -0
- package/templates/web-backend/.cursorrules/testing.md +395 -0
- package/templates/web-backend/CLAUDE.md +366 -0
- package/templates/web-frontend/.cursorrules/accessibility.md +296 -0
- package/templates/web-frontend/.cursorrules/component-patterns.md +204 -0
- package/templates/web-frontend/.cursorrules/overview.md +72 -0
- package/templates/web-frontend/.cursorrules/performance.md +325 -0
- package/templates/web-frontend/.cursorrules/state-management.md +227 -0
- package/templates/web-frontend/.cursorrules/styling.md +271 -0
- package/templates/web-frontend/.cursorrules/testing.md +311 -0
- package/templates/web-frontend/CLAUDE.md +399 -0
|
@@ -0,0 +1,1007 @@
|
|
|
1
|
+
# DevOps/SRE Development Guide
|
|
2
|
+
|
|
3
|
+
Staff-level guidelines for building and operating reliable, scalable production systems with a focus on operational excellence.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Overview
|
|
8
|
+
|
|
9
|
+
This guide applies to:
|
|
10
|
+
|
|
11
|
+
- Site Reliability Engineering (SRE) practices
|
|
12
|
+
- Production operations and incident management
|
|
13
|
+
- Monitoring, alerting, and observability systems
|
|
14
|
+
- Capacity planning and performance engineering
|
|
15
|
+
- Disaster recovery and business continuity
|
|
16
|
+
- Toil reduction and automation
|
|
17
|
+
- Change management and safe deployments
|
|
18
|
+
|
|
19
|
+
### Key Principles
|
|
20
|
+
|
|
21
|
+
1. **Reliability is a Feature** - Users don't distinguish between "the app is slow" and "the app is broken"
|
|
22
|
+
2. **Error Budgets Over Perfection** - 100% reliability is the wrong target; balance reliability with velocity
|
|
23
|
+
3. **Automate Toil Away** - If you're doing it manually more than twice, automate it
|
|
24
|
+
4. **Observability First** - You can't fix what you can't measure
|
|
25
|
+
5. **Blameless Culture** - Incidents are learning opportunities, not blame games
|
|
26
|
+
|
|
27
|
+
### Technology Stack
|
|
28
|
+
|
|
29
|
+
| Layer | Primary | Alternatives |
|
|
30
|
+
|-------|---------|--------------|
|
|
31
|
+
| Metrics | Prometheus + Grafana | Datadog, New Relic, InfluxDB |
|
|
32
|
+
| Logging | Loki, ELK Stack | Splunk, Datadog Logs |
|
|
33
|
+
| Tracing | Jaeger, Tempo | Zipkin, X-Ray, Honeycomb |
|
|
34
|
+
| Alerting | Alertmanager, PagerDuty | OpsGenie, VictorOps |
|
|
35
|
+
| Incident Management | PagerDuty, Incident.io | OpsGenie, Squadcast |
|
|
36
|
+
| Status Pages | Statuspage, Instatus | Cachet, Better Uptime |
|
|
37
|
+
| Chaos Engineering | Chaos Mesh, Litmus | Gremlin, AWS FIS |
|
|
38
|
+
| Load Testing | k6, Locust | Gatling, JMeter |
|
|
39
|
+
| Feature Flags | LaunchDarkly, Unleash | Split, Flagsmith |
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## SRE Fundamentals
|
|
44
|
+
|
|
45
|
+
### The SRE Hierarchy of Needs
|
|
46
|
+
|
|
47
|
+
```
|
|
48
|
+
┌─────────────────┐
|
|
49
|
+
│ Continuous │ ← Experimentation, A/B testing
|
|
50
|
+
│ Improvement │
|
|
51
|
+
┌───┴─────────────────┴───┐
|
|
52
|
+
│ Release │ ← Safe, frequent deployments
|
|
53
|
+
│ Engineering │
|
|
54
|
+
┌───┴─────────────────────────┴───┐
|
|
55
|
+
│ Observability │ ← Metrics, logs, traces
|
|
56
|
+
│ │
|
|
57
|
+
┌───┴─────────────────────────────────┴───┐
|
|
58
|
+
│ Incident Response │ ← Detection, mitigation
|
|
59
|
+
│ │
|
|
60
|
+
┌───┴─────────────────────────────────────────┴───┐
|
|
61
|
+
│ Monitoring/Alerting │ ← Know when things break
|
|
62
|
+
│ │
|
|
63
|
+
┌───┴─────────────────────────────────────────────────┴───┐
|
|
64
|
+
│ Reliability │ ← Core availability
|
|
65
|
+
└─────────────────────────────────────────────────────────┘
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
### SLOs, SLIs, and Error Budgets
|
|
69
|
+
|
|
70
|
+
**SLI (Service Level Indicator)**: A quantitative measure of service behavior
|
|
71
|
+
|
|
72
|
+
```yaml
|
|
73
|
+
# Example SLIs
|
|
74
|
+
availability_sli:
|
|
75
|
+
description: "Proportion of successful requests"
|
|
76
|
+
formula: "successful_requests / total_requests"
|
|
77
|
+
|
|
78
|
+
latency_sli:
|
|
79
|
+
description: "Proportion of requests faster than threshold"
|
|
80
|
+
formula: "requests_under_500ms / total_requests"
|
|
81
|
+
|
|
82
|
+
throughput_sli:
|
|
83
|
+
description: "Requests processed per second"
|
|
84
|
+
formula: "count(requests) / time_window"
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
**SLO (Service Level Objective)**: Target value for an SLI
|
|
88
|
+
|
|
89
|
+
```yaml
|
|
90
|
+
# Example SLOs
|
|
91
|
+
api_availability:
|
|
92
|
+
sli: availability_sli
|
|
93
|
+
target: 99.9%
|
|
94
|
+
window: 30 days
|
|
95
|
+
|
|
96
|
+
api_latency:
|
|
97
|
+
sli: latency_sli
|
|
98
|
+
target: 99%
|
|
99
|
+
threshold: 500ms
|
|
100
|
+
window: 30 days
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
**Error Budget**: The allowed amount of unreliability
|
|
104
|
+
|
|
105
|
+
```python
|
|
106
|
+
# Error budget calculation
|
|
107
|
+
slo_target = 0.999 # 99.9%
|
|
108
|
+
window_minutes = 30 * 24 * 60 # 30 days
|
|
109
|
+
|
|
110
|
+
error_budget_minutes = window_minutes * (1 - slo_target)
|
|
111
|
+
# = 43.2 minutes of downtime allowed per month
|
|
112
|
+
|
|
113
|
+
# If we've used 30 minutes, we have 13.2 minutes remaining
|
|
114
|
+
# If budget exhausted → slow down deployments, focus on reliability
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
### Error Budget Policy
|
|
118
|
+
|
|
119
|
+
```yaml
|
|
120
|
+
# error-budget-policy.yaml
|
|
121
|
+
thresholds:
|
|
122
|
+
- level: healthy
|
|
123
|
+
budget_remaining: ">50%"
|
|
124
|
+
actions:
|
|
125
|
+
- "Normal development velocity"
|
|
126
|
+
- "Experimental features allowed"
|
|
127
|
+
- "Risk-tolerant deployments"
|
|
128
|
+
|
|
129
|
+
- level: caution
|
|
130
|
+
budget_remaining: "25-50%"
|
|
131
|
+
actions:
|
|
132
|
+
- "Review recent changes for reliability impact"
|
|
133
|
+
- "Increase testing coverage"
|
|
134
|
+
- "Limit risky deployments"
|
|
135
|
+
|
|
136
|
+
- level: critical
|
|
137
|
+
budget_remaining: "10-25%"
|
|
138
|
+
actions:
|
|
139
|
+
- "Reliability improvements prioritized"
|
|
140
|
+
- "Feature freeze for non-critical work"
|
|
141
|
+
- "Mandatory rollback plans"
|
|
142
|
+
|
|
143
|
+
- level: exhausted
|
|
144
|
+
budget_remaining: "<10%"
|
|
145
|
+
actions:
|
|
146
|
+
- "Full feature freeze"
|
|
147
|
+
- "All hands on reliability"
|
|
148
|
+
- "Post-incident review required for any deploy"
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
---
|
|
152
|
+
|
|
153
|
+
## Monitoring & Alerting
|
|
154
|
+
|
|
155
|
+
### The Four Golden Signals
|
|
156
|
+
|
|
157
|
+
```yaml
|
|
158
|
+
# Monitor these for every service
|
|
159
|
+
golden_signals:
|
|
160
|
+
latency:
|
|
161
|
+
description: "Time to service a request"
|
|
162
|
+
metrics:
|
|
163
|
+
- http_request_duration_seconds_histogram
|
|
164
|
+
alerts:
|
|
165
|
+
- p50 > 200ms
|
|
166
|
+
- p99 > 1000ms
|
|
167
|
+
|
|
168
|
+
traffic:
|
|
169
|
+
description: "Demand on the system"
|
|
170
|
+
metrics:
|
|
171
|
+
- http_requests_total
|
|
172
|
+
alerts:
|
|
173
|
+
- sudden_drop > 50%
|
|
174
|
+
- sudden_spike > 200%
|
|
175
|
+
|
|
176
|
+
errors:
|
|
177
|
+
description: "Rate of failed requests"
|
|
178
|
+
metrics:
|
|
179
|
+
- http_requests_total{status=~"5.."}
|
|
180
|
+
alerts:
|
|
181
|
+
- error_rate > 1%
|
|
182
|
+
- error_rate > 5% (critical)
|
|
183
|
+
|
|
184
|
+
saturation:
|
|
185
|
+
description: "How full the system is"
|
|
186
|
+
metrics:
|
|
187
|
+
- cpu_usage_percent
|
|
188
|
+
- memory_usage_percent
|
|
189
|
+
- disk_usage_percent
|
|
190
|
+
alerts:
|
|
191
|
+
- cpu > 80%
|
|
192
|
+
- memory > 85%
|
|
193
|
+
- disk > 90%
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
### Alert Quality Guidelines
|
|
197
|
+
|
|
198
|
+
```yaml
|
|
199
|
+
# Good alerts are:
|
|
200
|
+
alert_quality_checklist:
|
|
201
|
+
actionable: "Every alert should have a clear action to take"
|
|
202
|
+
urgent: "If it can wait until morning, it shouldn't page"
|
|
203
|
+
relevant: "Alert fatigue kills on-call engineers"
|
|
204
|
+
|
|
205
|
+
# Alert severity levels
|
|
206
|
+
severity_definitions:
|
|
207
|
+
critical:
|
|
208
|
+
description: "Service is down or severely degraded"
|
|
209
|
+
response_time: "Immediate (page on-call)"
|
|
210
|
+
examples:
|
|
211
|
+
- "API error rate > 10%"
|
|
212
|
+
- "Database unreachable"
|
|
213
|
+
- "All pods in CrashLoopBackOff"
|
|
214
|
+
|
|
215
|
+
warning:
|
|
216
|
+
description: "Service degradation or approaching limits"
|
|
217
|
+
response_time: "Within 1 hour (Slack notification)"
|
|
218
|
+
examples:
|
|
219
|
+
- "Disk usage > 80%"
|
|
220
|
+
- "Error rate > 1%"
|
|
221
|
+
- "Latency p99 > 2s"
|
|
222
|
+
|
|
223
|
+
info:
|
|
224
|
+
description: "Notable events, no action needed"
|
|
225
|
+
response_time: "Next business day"
|
|
226
|
+
examples:
|
|
227
|
+
- "Deployment completed"
|
|
228
|
+
- "Certificate expires in 30 days"
|
|
229
|
+
- "Unusual traffic pattern"
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
### Prometheus Alerting Rules
|
|
233
|
+
|
|
234
|
+
```yaml
|
|
235
|
+
# prometheus-alerts.yaml
|
|
236
|
+
groups:
|
|
237
|
+
- name: api-server
|
|
238
|
+
rules:
|
|
239
|
+
# High Error Rate
|
|
240
|
+
- alert: APIHighErrorRate
|
|
241
|
+
expr: |
|
|
242
|
+
sum(rate(http_requests_total{job="api-server",status=~"5.."}[5m]))
|
|
243
|
+
/
|
|
244
|
+
sum(rate(http_requests_total{job="api-server"}[5m]))
|
|
245
|
+
> 0.01
|
|
246
|
+
for: 5m
|
|
247
|
+
labels:
|
|
248
|
+
severity: warning
|
|
249
|
+
team: backend
|
|
250
|
+
annotations:
|
|
251
|
+
summary: "API error rate above 1%"
|
|
252
|
+
description: "Error rate is {{ $value | humanizePercentage }}"
|
|
253
|
+
runbook_url: "https://wiki.example.com/runbooks/api-high-error-rate"
|
|
254
|
+
|
|
255
|
+
# High Latency
|
|
256
|
+
- alert: APIHighLatency
|
|
257
|
+
expr: |
|
|
258
|
+
histogram_quantile(0.99,
|
|
259
|
+
sum(rate(http_request_duration_seconds_bucket{job="api-server"}[5m])) by (le)
|
|
260
|
+
) > 1
|
|
261
|
+
for: 10m
|
|
262
|
+
labels:
|
|
263
|
+
severity: warning
|
|
264
|
+
team: backend
|
|
265
|
+
annotations:
|
|
266
|
+
summary: "API p99 latency above 1 second"
|
|
267
|
+
description: "p99 latency is {{ $value | humanizeDuration }}"
|
|
268
|
+
runbook_url: "https://wiki.example.com/runbooks/api-high-latency"
|
|
269
|
+
|
|
270
|
+
# SLO Burn Rate (Multi-window)
|
|
271
|
+
- alert: APIAvailabilitySLOBreach
|
|
272
|
+
expr: |
|
|
273
|
+
(
|
|
274
|
+
# Fast burn (last 1h)
|
|
275
|
+
sum(rate(http_requests_total{job="api-server",status=~"5.."}[1h]))
|
|
276
|
+
/
|
|
277
|
+
sum(rate(http_requests_total{job="api-server"}[1h]))
|
|
278
|
+
> (14.4 * 0.001) # 14.4x burn rate for 1h window
|
|
279
|
+
)
|
|
280
|
+
and
|
|
281
|
+
(
|
|
282
|
+
# Slow burn (last 6h)
|
|
283
|
+
sum(rate(http_requests_total{job="api-server",status=~"5.."}[6h]))
|
|
284
|
+
/
|
|
285
|
+
sum(rate(http_requests_total{job="api-server"}[6h]))
|
|
286
|
+
> (6 * 0.001) # 6x burn rate for 6h window
|
|
287
|
+
)
|
|
288
|
+
for: 2m
|
|
289
|
+
labels:
|
|
290
|
+
severity: critical
|
|
291
|
+
team: backend
|
|
292
|
+
annotations:
|
|
293
|
+
summary: "API availability SLO at risk"
|
|
294
|
+
description: "Error budget burn rate indicates SLO breach within window"
|
|
295
|
+
```
|
|
296
|
+
|
|
297
|
+
---
|
|
298
|
+
|
|
299
|
+
## Incident Management
|
|
300
|
+
|
|
301
|
+
### Incident Severity Levels
|
|
302
|
+
|
|
303
|
+
```yaml
|
|
304
|
+
severity_levels:
|
|
305
|
+
sev1:
|
|
306
|
+
name: "Critical"
|
|
307
|
+
description: "Complete service outage or data loss"
|
|
308
|
+
response_time: "Immediate"
|
|
309
|
+
communication: "Status page, exec notification, all-hands war room"
|
|
310
|
+
examples:
|
|
311
|
+
- "Production database down"
|
|
312
|
+
- "Security breach in progress"
|
|
313
|
+
- "Payment processing completely failed"
|
|
314
|
+
|
|
315
|
+
sev2:
|
|
316
|
+
name: "Major"
|
|
317
|
+
description: "Significant degradation affecting many users"
|
|
318
|
+
response_time: "15 minutes"
|
|
319
|
+
communication: "Status page, stakeholder notification"
|
|
320
|
+
examples:
|
|
321
|
+
- "50% of API requests failing"
|
|
322
|
+
- "Search functionality broken"
|
|
323
|
+
- "Mobile app unable to sync"
|
|
324
|
+
|
|
325
|
+
sev3:
|
|
326
|
+
name: "Minor"
|
|
327
|
+
description: "Limited impact, workaround available"
|
|
328
|
+
response_time: "1 hour"
|
|
329
|
+
communication: "Internal Slack channel"
|
|
330
|
+
examples:
|
|
331
|
+
- "Admin panel slow"
|
|
332
|
+
- "Export feature broken"
|
|
333
|
+
- "Non-critical background jobs failing"
|
|
334
|
+
|
|
335
|
+
sev4:
|
|
336
|
+
name: "Low"
|
|
337
|
+
description: "Minimal impact, cosmetic issues"
|
|
338
|
+
response_time: "Next business day"
|
|
339
|
+
communication: "Ticket created"
|
|
340
|
+
examples:
|
|
341
|
+
- "UI alignment issues"
|
|
342
|
+
- "Log formatting errors"
|
|
343
|
+
- "Dev environment issues"
|
|
344
|
+
```
|
|
345
|
+
|
|
346
|
+
### Incident Response Process
|
|
347
|
+
|
|
348
|
+
```
|
|
349
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
350
|
+
│ INCIDENT LIFECYCLE │
|
|
351
|
+
├─────────────────────────────────────────────────────────────────┤
|
|
352
|
+
│ │
|
|
353
|
+
│ ┌─────────┐ ┌─────────┐ ┌──────────┐ ┌──────────────┐ │
|
|
354
|
+
│ │ Detect │──▶│ Respond │──▶│ Mitigate │──▶│ Resolve │ │
|
|
355
|
+
│ └─────────┘ └─────────┘ └──────────┘ └──────────────┘ │
|
|
356
|
+
│ │ │ │ │ │
|
|
357
|
+
│ ▼ ▼ ▼ ▼ │
|
|
358
|
+
│ - Alerting - Page on-call - Stop bleeding - Root cause │
|
|
359
|
+
│ - Monitoring - Declare - Rollback - Fix forward │
|
|
360
|
+
│ - User report - Assign roles - Scale up - Deploy fix │
|
|
361
|
+
│ - War room - Failover │
|
|
362
|
+
│ │
|
|
363
|
+
│ │ │
|
|
364
|
+
│ ▼ │
|
|
365
|
+
│ ┌──────────────────┐ │
|
|
366
|
+
│ │ Postmortem │ │
|
|
367
|
+
│ └──────────────────┘ │
|
|
368
|
+
│ │ │
|
|
369
|
+
│ ▼ │
|
|
370
|
+
│ - Timeline - Root cause - Action items - Share learnings │
|
|
371
|
+
│ │
|
|
372
|
+
└─────────────────────────────────────────────────────────────────┘
|
|
373
|
+
```
|
|
374
|
+
|
|
375
|
+
### Incident Commander Role
|
|
376
|
+
|
|
377
|
+
```yaml
|
|
378
|
+
incident_commander_responsibilities:
|
|
379
|
+
coordination:
|
|
380
|
+
- "Single point of contact for incident"
|
|
381
|
+
- "Assign roles (communications, technical lead, scribe)"
|
|
382
|
+
- "Make decisions when consensus isn't reached"
|
|
383
|
+
- "Escalate when needed"
|
|
384
|
+
|
|
385
|
+
communication:
|
|
386
|
+
- "Regular status updates (every 15-30 min)"
|
|
387
|
+
- "Stakeholder management"
|
|
388
|
+
- "Status page updates"
|
|
389
|
+
- "Executive briefings for Sev1"
|
|
390
|
+
|
|
391
|
+
process:
|
|
392
|
+
- "Start incident channel/war room"
|
|
393
|
+
- "Track timeline of events"
|
|
394
|
+
- "Ensure postmortem is scheduled"
|
|
395
|
+
- "Close out incident when resolved"
|
|
396
|
+
|
|
397
|
+
# Incident channel template
|
|
398
|
+
slack_channel_template:
|
|
399
|
+
name: "inc-{date}-{short-description}"
|
|
400
|
+
topic: "SEV{level} | IC: @{commander} | Status: {status}"
|
|
401
|
+
pinned_messages:
|
|
402
|
+
- "Incident summary and current status"
|
|
403
|
+
- "Timeline of events"
|
|
404
|
+
- "Runbook links"
|
|
405
|
+
```
|
|
406
|
+
|
|
407
|
+
---
|
|
408
|
+
|
|
409
|
+
## On-Call Best Practices
|
|
410
|
+
|
|
411
|
+
### On-Call Rotation
|
|
412
|
+
|
|
413
|
+
```yaml
|
|
414
|
+
oncall_structure:
|
|
415
|
+
rotation_length: "1 week"
|
|
416
|
+
handoff_day: "Monday 9am local time"
|
|
417
|
+
coverage: "24/7"
|
|
418
|
+
|
|
419
|
+
roles:
|
|
420
|
+
primary:
|
|
421
|
+
responsibilities:
|
|
422
|
+
- "First responder to all pages"
|
|
423
|
+
- "Initial triage and escalation"
|
|
424
|
+
- "Document all incidents"
|
|
425
|
+
response_sla: "15 minutes"
|
|
426
|
+
|
|
427
|
+
secondary:
|
|
428
|
+
responsibilities:
|
|
429
|
+
- "Backup if primary unavailable"
|
|
430
|
+
- "Help with prolonged incidents"
|
|
431
|
+
- "Escalation point"
|
|
432
|
+
response_sla: "30 minutes"
|
|
433
|
+
|
|
434
|
+
escalation_path:
|
|
435
|
+
- "Primary on-call"
|
|
436
|
+
- "Secondary on-call"
|
|
437
|
+
- "Team lead"
|
|
438
|
+
- "Engineering manager"
|
|
439
|
+
- "VP Engineering"
|
|
440
|
+
|
|
441
|
+
handoff_checklist:
|
|
442
|
+
- "Review active incidents"
|
|
443
|
+
- "Check pending alerts"
|
|
444
|
+
- "Verify pager is working"
|
|
445
|
+
- "Review recent deployments"
|
|
446
|
+
- "Check error budget status"
|
|
447
|
+
```
|
|
448
|
+
|
|
449
|
+
### On-Call Health
|
|
450
|
+
|
|
451
|
+
```yaml
|
|
452
|
+
oncall_health_metrics:
|
|
453
|
+
targets:
|
|
454
|
+
pages_per_shift: "< 10"
|
|
455
|
+
pages_per_night: "< 2"
|
|
456
|
+
mean_time_to_acknowledge: "< 5 minutes"
|
|
457
|
+
mean_time_to_resolve: "< 1 hour"
|
|
458
|
+
false_positive_rate: "< 10%"
|
|
459
|
+
|
|
460
|
+
burnout_prevention:
|
|
461
|
+
- "Compensatory time off after heavy on-call"
|
|
462
|
+
- "No back-to-back on-call weeks"
|
|
463
|
+
- "Follow-the-sun rotation for global teams"
|
|
464
|
+
- "On-call load balancing across team"
|
|
465
|
+
- "Regular review of alert quality"
|
|
466
|
+
```
|
|
467
|
+
|
|
468
|
+
---
|
|
469
|
+
|
|
470
|
+
## Runbooks
|
|
471
|
+
|
|
472
|
+
### Runbook Template
|
|
473
|
+
|
|
474
|
+
```markdown
|
|
475
|
+
# Runbook: [Alert/Issue Name]
|
|
476
|
+
|
|
477
|
+
## Overview
|
|
478
|
+
Brief description of what this runbook addresses.
|
|
479
|
+
|
|
480
|
+
## Severity
|
|
481
|
+
- **Impact**: [What breaks when this happens]
|
|
482
|
+
- **Urgency**: [How quickly must this be resolved]
|
|
483
|
+
|
|
484
|
+
## Prerequisites
|
|
485
|
+
- Access to: [systems, dashboards, tools]
|
|
486
|
+
- Permissions: [required roles/access]
|
|
487
|
+
|
|
488
|
+
## Symptoms
|
|
489
|
+
- Alert: `AlertName` fires
|
|
490
|
+
- Users report: [symptoms]
|
|
491
|
+
- Dashboards show: [metrics]
|
|
492
|
+
|
|
493
|
+
## Diagnosis Steps
|
|
494
|
+
1. Check [specific metric/log]
|
|
495
|
+
```bash
|
|
496
|
+
kubectl logs -l app=api-server --tail=100
|
|
497
|
+
```
|
|
498
|
+
2. Verify [dependency/connection]
|
|
499
|
+
3. Check recent changes
|
|
500
|
+
```bash
|
|
501
|
+
kubectl rollout history deployment/api-server
|
|
502
|
+
```
|
|
503
|
+
|
|
504
|
+
## Resolution Steps
|
|
505
|
+
|
|
506
|
+
### Quick Mitigation (stop the bleeding)
|
|
507
|
+
1. Scale up if capacity issue:
|
|
508
|
+
```bash
|
|
509
|
+
kubectl scale deployment/api-server --replicas=10
|
|
510
|
+
```
|
|
511
|
+
2. Rollback if recent deployment:
|
|
512
|
+
```bash
|
|
513
|
+
kubectl rollout undo deployment/api-server
|
|
514
|
+
```
|
|
515
|
+
|
|
516
|
+
### Root Cause Fix
|
|
517
|
+
1. [Step-by-step fix instructions]
|
|
518
|
+
2. [Verification commands]
|
|
519
|
+
|
|
520
|
+
## Escalation
|
|
521
|
+
- If not resolved in 30 minutes, escalate to: [team/person]
|
|
522
|
+
- For data loss scenarios, immediately notify: [person]
|
|
523
|
+
|
|
524
|
+
## Prevention
|
|
525
|
+
- Related improvements: [links to tickets]
|
|
526
|
+
- Monitoring gaps: [what to add]
|
|
527
|
+
|
|
528
|
+
## History
|
|
529
|
+
| Date | Author | Change |
|
|
530
|
+
|------|--------|--------|
|
|
531
|
+
| 2025-01-15 | @engineer | Initial version |
|
|
532
|
+
```
|
|
533
|
+
|
|
534
|
+
---
|
|
535
|
+
|
|
536
|
+
## Capacity Planning
|
|
537
|
+
|
|
538
|
+
### Capacity Metrics
|
|
539
|
+
|
|
540
|
+
```yaml
|
|
541
|
+
capacity_dimensions:
|
|
542
|
+
compute:
|
|
543
|
+
metrics:
|
|
544
|
+
- cpu_utilization
|
|
545
|
+
- memory_utilization
|
|
546
|
+
- pod_count
|
|
547
|
+
thresholds:
|
|
548
|
+
warning: 70%
|
|
549
|
+
critical: 85%
|
|
550
|
+
|
|
551
|
+
storage:
|
|
552
|
+
metrics:
|
|
553
|
+
- disk_usage_percent
|
|
554
|
+
- iops_utilization
|
|
555
|
+
- throughput_utilization
|
|
556
|
+
thresholds:
|
|
557
|
+
warning: 75%
|
|
558
|
+
critical: 90%
|
|
559
|
+
|
|
560
|
+
network:
|
|
561
|
+
metrics:
|
|
562
|
+
- bandwidth_utilization
|
|
563
|
+
- connection_count
|
|
564
|
+
- packet_loss_rate
|
|
565
|
+
thresholds:
|
|
566
|
+
warning: 60%
|
|
567
|
+
critical: 80%
|
|
568
|
+
|
|
569
|
+
database:
|
|
570
|
+
metrics:
|
|
571
|
+
- connection_pool_usage
|
|
572
|
+
- query_latency_p99
|
|
573
|
+
- replication_lag
|
|
574
|
+
thresholds:
|
|
575
|
+
warning: 70%
|
|
576
|
+
critical: 85%
|
|
577
|
+
```
|
|
578
|
+
|
|
579
|
+
### Load Testing Strategy
|
|
580
|
+
|
|
581
|
+
```yaml
|
|
582
|
+
load_testing:
|
|
583
|
+
types:
|
|
584
|
+
smoke:
|
|
585
|
+
description: "Verify system handles minimal load"
|
|
586
|
+
duration: "5 minutes"
|
|
587
|
+
users: "10"
|
|
588
|
+
frequency: "Every deployment"
|
|
589
|
+
|
|
590
|
+
load:
|
|
591
|
+
description: "Test expected production load"
|
|
592
|
+
duration: "30 minutes"
|
|
593
|
+
users: "Expected peak * 1.5"
|
|
594
|
+
frequency: "Weekly"
|
|
595
|
+
|
|
596
|
+
stress:
|
|
597
|
+
description: "Find breaking point"
|
|
598
|
+
duration: "Until failure"
|
|
599
|
+
users: "Ramp until errors"
|
|
600
|
+
frequency: "Monthly"
|
|
601
|
+
|
|
602
|
+
soak:
|
|
603
|
+
description: "Test sustained load over time"
|
|
604
|
+
duration: "24 hours"
|
|
605
|
+
users: "Expected average"
|
|
606
|
+
frequency: "Before major releases"
|
|
607
|
+
|
|
608
|
+
# k6 example
|
|
609
|
+
k6_load_test: |
|
|
610
|
+
import http from 'k6/http';
|
|
611
|
+
import { check, sleep } from 'k6';
|
|
612
|
+
|
|
613
|
+
export const options = {
|
|
614
|
+
stages: [
|
|
615
|
+
{ duration: '5m', target: 100 }, // Ramp up
|
|
616
|
+
{ duration: '30m', target: 100 }, // Stay at peak
|
|
617
|
+
{ duration: '5m', target: 0 }, // Ramp down
|
|
618
|
+
],
|
|
619
|
+
thresholds: {
|
|
620
|
+
http_req_duration: ['p(99)<500'],
|
|
621
|
+
http_req_failed: ['rate<0.01'],
|
|
622
|
+
},
|
|
623
|
+
};
|
|
624
|
+
|
|
625
|
+
export default function () {
|
|
626
|
+
const res = http.get('https://api.example.com/health');
|
|
627
|
+
check(res, {
|
|
628
|
+
'status is 200': (r) => r.status === 200,
|
|
629
|
+
'response time < 500ms': (r) => r.timings.duration < 500,
|
|
630
|
+
});
|
|
631
|
+
sleep(1);
|
|
632
|
+
}
|
|
633
|
+
```
|
|
634
|
+
|
|
635
|
+
---
|
|
636
|
+
|
|
637
|
+
## Change Management
|
|
638
|
+
|
|
639
|
+
### Deployment Safety
|
|
640
|
+
|
|
641
|
+
```yaml
|
|
642
|
+
deployment_checklist:
|
|
643
|
+
pre_deploy:
|
|
644
|
+
- "All tests passing in CI"
|
|
645
|
+
- "Code reviewed and approved"
|
|
646
|
+
- "Feature flags in place for risky changes"
|
|
647
|
+
- "Rollback plan documented"
|
|
648
|
+
- "Monitoring dashboards open"
|
|
649
|
+
- "On-call engineer aware"
|
|
650
|
+
|
|
651
|
+
during_deploy:
|
|
652
|
+
- "Watch error rates during rollout"
|
|
653
|
+
- "Monitor latency metrics"
|
|
654
|
+
- "Check application logs for errors"
|
|
655
|
+
- "Verify health checks passing"
|
|
656
|
+
|
|
657
|
+
post_deploy:
|
|
658
|
+
- "Smoke test critical paths"
|
|
659
|
+
- "Compare metrics to baseline"
|
|
660
|
+
- "Check for error rate increases"
|
|
661
|
+
- "Update deployment log"
|
|
662
|
+
|
|
663
|
+
# Progressive delivery stages
|
|
664
|
+
progressive_delivery:
|
|
665
|
+
canary:
|
|
666
|
+
traffic_percentage: 5%
|
|
667
|
+
duration: "15 minutes"
|
|
668
|
+
success_criteria:
|
|
669
|
+
- error_rate < baseline * 1.1
|
|
670
|
+
- latency_p99 < baseline * 1.2
|
|
671
|
+
|
|
672
|
+
partial:
|
|
673
|
+
traffic_percentage: 25%
|
|
674
|
+
duration: "30 minutes"
|
|
675
|
+
success_criteria:
|
|
676
|
+
- error_rate < baseline * 1.05
|
|
677
|
+
- latency_p99 < baseline * 1.1
|
|
678
|
+
|
|
679
|
+
majority:
|
|
680
|
+
traffic_percentage: 75%
|
|
681
|
+
duration: "1 hour"
|
|
682
|
+
success_criteria:
|
|
683
|
+
- error_rate ≈ baseline
|
|
684
|
+
- latency_p99 ≈ baseline
|
|
685
|
+
|
|
686
|
+
full:
|
|
687
|
+
traffic_percentage: 100%
|
|
688
|
+
bake_time: "24 hours"
|
|
689
|
+
```
|
|
690
|
+
|
|
691
|
+
### Rollback Procedures
|
|
692
|
+
|
|
693
|
+
```yaml
|
|
694
|
+
rollback_triggers:
|
|
695
|
+
automatic:
|
|
696
|
+
- "Error rate > 5% for 5 minutes"
|
|
697
|
+
- "Latency p99 > 3x baseline for 10 minutes"
|
|
698
|
+
- "Health check failures > 50%"
|
|
699
|
+
|
|
700
|
+
manual:
|
|
701
|
+
- "User-reported critical bugs"
|
|
702
|
+
- "Security vulnerability discovered"
|
|
703
|
+
- "Data corruption detected"
|
|
704
|
+
|
|
705
|
+
rollback_commands:
|
|
706
|
+
kubernetes:
|
|
707
|
+
immediate: |
|
|
708
|
+
kubectl rollout undo deployment/api-server
|
|
709
|
+
to_specific_version: |
|
|
710
|
+
kubectl rollout undo deployment/api-server --to-revision=42
|
|
711
|
+
verify: |
|
|
712
|
+
kubectl rollout status deployment/api-server
|
|
713
|
+
|
|
714
|
+
argocd:
|
|
715
|
+
immediate: |
|
|
716
|
+
argocd app rollback api-server
|
|
717
|
+
sync_to_previous: |
|
|
718
|
+
argocd app sync api-server --revision HEAD~1
|
|
719
|
+
```
|
|
720
|
+
|
|
721
|
+
---
|
|
722
|
+
|
|
723
|
+
## Disaster Recovery
|
|
724
|
+
|
|
725
|
+
### RTO and RPO Definitions
|
|
726
|
+
|
|
727
|
+
```yaml
|
|
728
|
+
recovery_objectives:
|
|
729
|
+
tier1_critical:
|
|
730
|
+
services:
|
|
731
|
+
- "Payment processing"
|
|
732
|
+
- "User authentication"
|
|
733
|
+
- "Core API"
|
|
734
|
+
rto: "15 minutes" # Recovery Time Objective
|
|
735
|
+
rpo: "0 minutes" # Recovery Point Objective (no data loss)
|
|
736
|
+
strategy: "Active-active multi-region"
|
|
737
|
+
|
|
738
|
+
tier2_important:
|
|
739
|
+
services:
|
|
740
|
+
- "Search functionality"
|
|
741
|
+
- "Notifications"
|
|
742
|
+
- "Analytics ingestion"
|
|
743
|
+
rto: "1 hour"
|
|
744
|
+
rpo: "15 minutes"
|
|
745
|
+
strategy: "Warm standby with automated failover"
|
|
746
|
+
|
|
747
|
+
tier3_standard:
|
|
748
|
+
services:
|
|
749
|
+
- "Admin dashboard"
|
|
750
|
+
- "Reporting"
|
|
751
|
+
- "Batch processing"
|
|
752
|
+
rto: "4 hours"
|
|
753
|
+
rpo: "1 hour"
|
|
754
|
+
strategy: "Cold standby with manual failover"
|
|
755
|
+
```
|
|
756
|
+
|
|
757
|
+
### Backup Strategy
|
|
758
|
+
|
|
759
|
+
```yaml
|
|
760
|
+
backup_strategy:
|
|
761
|
+
databases:
|
|
762
|
+
type: "Continuous replication + daily snapshots"
|
|
763
|
+
retention:
|
|
764
|
+
continuous: "7 days"
|
|
765
|
+
daily: "30 days"
|
|
766
|
+
weekly: "1 year"
|
|
767
|
+
testing: "Monthly restore test"
|
|
768
|
+
location: "Cross-region"
|
|
769
|
+
|
|
770
|
+
object_storage:
|
|
771
|
+
type: "Cross-region replication"
|
|
772
|
+
versioning: "Enabled"
|
|
773
|
+
retention: "Per data classification"
|
|
774
|
+
|
|
775
|
+
configuration:
|
|
776
|
+
type: "GitOps (versioned in Git)"
|
|
777
|
+
backup: "Repository mirroring"
|
|
778
|
+
|
|
779
|
+
secrets:
|
|
780
|
+
type: "Vault replication"
|
|
781
|
+
backup: "Encrypted offline backup monthly"
|
|
782
|
+
```
|
|
783
|
+
|
|
784
|
+
### DR Testing
|
|
785
|
+
|
|
786
|
+
```yaml
|
|
787
|
+
dr_testing_schedule:
|
|
788
|
+
tabletop_exercise:
|
|
789
|
+
frequency: "Quarterly"
|
|
790
|
+
participants: "All on-call engineers"
|
|
791
|
+
scope: "Walk through DR procedures"
|
|
792
|
+
|
|
793
|
+
component_failover:
|
|
794
|
+
frequency: "Monthly"
|
|
795
|
+
scope: "Individual service failover"
|
|
796
|
+
examples:
|
|
797
|
+
- "Database failover to replica"
|
|
798
|
+
- "Redis cluster failover"
|
|
799
|
+
- "Load balancer failover"
|
|
800
|
+
|
|
801
|
+
regional_failover:
|
|
802
|
+
frequency: "Bi-annually"
|
|
803
|
+
scope: "Full region evacuation"
|
|
804
|
+
preparation:
|
|
805
|
+
- "Notify stakeholders"
|
|
806
|
+
- "Schedule maintenance window"
|
|
807
|
+
- "Pre-position support staff"
|
|
808
|
+
|
|
809
|
+
chaos_game_day:
|
|
810
|
+
frequency: "Quarterly"
|
|
811
|
+
scope: "Inject failures in production"
|
|
812
|
+
examples:
|
|
813
|
+
- "Kill random pods"
|
|
814
|
+
- "Inject network latency"
|
|
815
|
+
- "Simulate AZ failure"
|
|
816
|
+
```
|
|
817
|
+
|
|
818
|
+
---
|
|
819
|
+
|
|
820
|
+
## Postmortems
|
|
821
|
+
|
|
822
|
+
### Blameless Postmortem Template
|
|
823
|
+
|
|
824
|
+
```markdown
|
|
825
|
+
# Postmortem: [Incident Title]
|
|
826
|
+
|
|
827
|
+
**Date**: YYYY-MM-DD
|
|
828
|
+
**Authors**: [Names]
|
|
829
|
+
**Status**: Draft | In Review | Complete
|
|
830
|
+
**Severity**: SEV1 | SEV2 | SEV3
|
|
831
|
+
|
|
832
|
+
## Summary
|
|
833
|
+
|
|
834
|
+
One paragraph summary of what happened and the impact.
|
|
835
|
+
|
|
836
|
+
## Impact
|
|
837
|
+
|
|
838
|
+
- **Duration**: X hours Y minutes
|
|
839
|
+
- **Users affected**: X% of users / Y users
|
|
840
|
+
- **Revenue impact**: $X (if applicable)
|
|
841
|
+
- **SLO impact**: X% of monthly error budget consumed
|
|
842
|
+
|
|
843
|
+
## Timeline (all times UTC)
|
|
844
|
+
|
|
845
|
+
| Time | Event |
|
|
846
|
+
|------|-------|
|
|
847
|
+
| 14:00 | Deployment started |
|
|
848
|
+
| 14:05 | First alerts fired |
|
|
849
|
+
| 14:10 | On-call acknowledged |
|
|
850
|
+
| 14:15 | Incident declared |
|
|
851
|
+
| 14:30 | Root cause identified |
|
|
852
|
+
| 14:35 | Rollback initiated |
|
|
853
|
+
| 14:40 | Service recovered |
|
|
854
|
+
| 15:00 | Incident closed |
|
|
855
|
+
|
|
856
|
+
## Root Cause
|
|
857
|
+
|
|
858
|
+
Detailed technical explanation of what went wrong and why.
|
|
859
|
+
|
|
860
|
+
## Contributing Factors
|
|
861
|
+
|
|
862
|
+
- Factor 1: [What made this worse or possible]
|
|
863
|
+
- Factor 2: [Process/tooling gaps]
|
|
864
|
+
- Factor 3: [Environmental conditions]
|
|
865
|
+
|
|
866
|
+
## What Went Well
|
|
867
|
+
|
|
868
|
+
- Quick detection (5 minutes to alert)
|
|
869
|
+
- Clear runbooks available
|
|
870
|
+
- Effective communication in war room
|
|
871
|
+
|
|
872
|
+
## What Went Poorly
|
|
873
|
+
|
|
874
|
+
- Rollback took longer than expected
|
|
875
|
+
- Initial diagnosis went down wrong path
|
|
876
|
+
- Status page update was delayed
|
|
877
|
+
|
|
878
|
+
## Action Items
|
|
879
|
+
|
|
880
|
+
| Action | Type | Owner | Due Date | Status |
|
|
881
|
+
|--------|------|-------|----------|--------|
|
|
882
|
+
| Add pre-deploy smoke tests | Prevent | @eng1 | 2025-02-01 | TODO |
|
|
883
|
+
| Improve rollback automation | Mitigate | @eng2 | 2025-02-15 | TODO |
|
|
884
|
+
| Add metric for early detection | Detect | @eng3 | 2025-02-01 | TODO |
|
|
885
|
+
| Update runbook with lessons | Process | @eng4 | 2025-01-20 | DONE |
|
|
886
|
+
|
|
887
|
+
## Lessons Learned
|
|
888
|
+
|
|
889
|
+
What should the broader organization learn from this incident?
|
|
890
|
+
|
|
891
|
+
## Appendix
|
|
892
|
+
|
|
893
|
+
- Links to dashboards
|
|
894
|
+
- Relevant logs
|
|
895
|
+
- Related incidents
|
|
896
|
+
```
|
|
897
|
+
|
|
898
|
+
---
|
|
899
|
+
|
|
900
|
+
## Staff Engineer Responsibilities
|
|
901
|
+
|
|
902
|
+
### Technical Leadership
|
|
903
|
+
|
|
904
|
+
- Define and evolve reliability standards across the organization
|
|
905
|
+
- Make build vs. buy decisions for tooling
|
|
906
|
+
- Establish SLO frameworks and error budget policies
|
|
907
|
+
- Mentor engineers on operational excellence
|
|
908
|
+
- Drive adoption of SRE practices
|
|
909
|
+
|
|
910
|
+
### Cross-Team Enablement
|
|
911
|
+
|
|
912
|
+
- Design observability standards that work across all services
|
|
913
|
+
- Create reusable runbook templates and incident response procedures
|
|
914
|
+
- Build automation that reduces toil organization-wide
|
|
915
|
+
- Establish on-call best practices and health metrics
|
|
916
|
+
- Lead chaos engineering initiatives
|
|
917
|
+
|
|
918
|
+
### Operational Excellence
|
|
919
|
+
|
|
920
|
+
- Own the incident management process
|
|
921
|
+
- Drive postmortem quality and follow-through
|
|
922
|
+
- Reduce mean time to detection and recovery
|
|
923
|
+
- Eliminate recurring incidents through systemic fixes
|
|
924
|
+
- Balance reliability investments with feature velocity
|
|
925
|
+
|
|
926
|
+
### Strategic Thinking
|
|
927
|
+
|
|
928
|
+
- Align reliability investments with business priorities
|
|
929
|
+
- Plan capacity for growth projections
|
|
930
|
+
- Design disaster recovery strategies
|
|
931
|
+
- Evaluate emerging technologies for operational improvement
|
|
932
|
+
- Manage technical debt in operational tooling
|
|
933
|
+
|
|
934
|
+
---
|
|
935
|
+
|
|
936
|
+
## Definition of Done
|
|
937
|
+
|
|
938
|
+
### Reliability Feature
|
|
939
|
+
|
|
940
|
+
- [ ] SLOs defined with measurable SLIs
|
|
941
|
+
- [ ] Alerts configured with runbooks
|
|
942
|
+
- [ ] Dashboards created for key metrics
|
|
943
|
+
- [ ] Load tested to expected capacity
|
|
944
|
+
- [ ] Failure modes documented
|
|
945
|
+
- [ ] DR procedures tested
|
|
946
|
+
|
|
947
|
+
### Incident Response
|
|
948
|
+
|
|
949
|
+
- [ ] Incident severity correctly assessed
|
|
950
|
+
- [ ] Timeline accurately documented
|
|
951
|
+
- [ ] Stakeholders appropriately notified
|
|
952
|
+
- [ ] Root cause identified (not just symptoms)
|
|
953
|
+
- [ ] Postmortem completed within 5 business days
|
|
954
|
+
- [ ] Action items tracked to completion
|
|
955
|
+
|
|
956
|
+
### On-Call Improvement
|
|
957
|
+
|
|
958
|
+
- [ ] Alert has clear action to take
|
|
959
|
+
- [ ] Runbook is accurate and tested
|
|
960
|
+
- [ ] False positive rate < 10%
|
|
961
|
+
- [ ] Alert fires with enough time to act
|
|
962
|
+
- [ ] Escalation path is clear
|
|
963
|
+
|
|
964
|
+
---
|
|
965
|
+
|
|
966
|
+
## Common Pitfalls
|
|
967
|
+
|
|
968
|
+
### 1. Alert Fatigue
|
|
969
|
+
|
|
970
|
+
❌ **Wrong**: Alert on everything "just in case"
|
|
971
|
+
|
|
972
|
+
✅ **Right**: Every alert must be actionable, urgent, and relevant
|
|
973
|
+
|
|
974
|
+
### 2. SLOs as Targets, Not Limits
|
|
975
|
+
|
|
976
|
+
❌ **Wrong**: "We must hit exactly 99.9%"
|
|
977
|
+
|
|
978
|
+
✅ **Right**: SLOs define acceptable reliability; use error budget for velocity
|
|
979
|
+
|
|
980
|
+
### 3. Blame Culture
|
|
981
|
+
|
|
982
|
+
❌ **Wrong**: "Who caused this outage?"
|
|
983
|
+
|
|
984
|
+
✅ **Right**: "What systemic factors allowed this to happen?"
|
|
985
|
+
|
|
986
|
+
### 4. Manual Heroics
|
|
987
|
+
|
|
988
|
+
❌ **Wrong**: Relying on engineer availability to keep systems running
|
|
989
|
+
|
|
990
|
+
✅ **Right**: Automate recovery, build self-healing systems
|
|
991
|
+
|
|
992
|
+
### 5. Postmortem Theater
|
|
993
|
+
|
|
994
|
+
❌ **Wrong**: Write postmortem, create action items, never follow up
|
|
995
|
+
|
|
996
|
+
✅ **Right**: Track action items to completion, measure improvement
|
|
997
|
+
|
|
998
|
+
---
|
|
999
|
+
|
|
1000
|
+
## Resources
|
|
1001
|
+
|
|
1002
|
+
- [Google SRE Book](https://sre.google/sre-book/table-of-contents/)
|
|
1003
|
+
- [Google SRE Workbook](https://sre.google/workbook/table-of-contents/)
|
|
1004
|
+
- [The Art of SLOs](https://sre.google/resources/practices-and-processes/art-of-slos/)
|
|
1005
|
+
- [Incident Management for Operations](https://www.pagerduty.com/resources/learn/incident-management/)
|
|
1006
|
+
- [Chaos Engineering Principles](https://principlesofchaos.org/)
|
|
1007
|
+
- [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
|