agentic-team-templates 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (103) hide show
  1. package/README.md +280 -0
  2. package/bin/cli.js +5 -0
  3. package/package.json +47 -0
  4. package/src/index.js +521 -0
  5. package/templates/_shared/code-quality.md +162 -0
  6. package/templates/_shared/communication.md +114 -0
  7. package/templates/_shared/core-principles.md +62 -0
  8. package/templates/_shared/git-workflow.md +165 -0
  9. package/templates/_shared/security-fundamentals.md +173 -0
  10. package/templates/blockchain/.cursorrules/defi-patterns.md +520 -0
  11. package/templates/blockchain/.cursorrules/gas-optimization.md +339 -0
  12. package/templates/blockchain/.cursorrules/overview.md +130 -0
  13. package/templates/blockchain/.cursorrules/security.md +318 -0
  14. package/templates/blockchain/.cursorrules/smart-contracts.md +364 -0
  15. package/templates/blockchain/.cursorrules/testing.md +415 -0
  16. package/templates/blockchain/.cursorrules/web3-integration.md +538 -0
  17. package/templates/blockchain/CLAUDE.md +389 -0
  18. package/templates/cli-tools/.cursorrules/architecture.md +412 -0
  19. package/templates/cli-tools/.cursorrules/arguments.md +406 -0
  20. package/templates/cli-tools/.cursorrules/distribution.md +546 -0
  21. package/templates/cli-tools/.cursorrules/error-handling.md +455 -0
  22. package/templates/cli-tools/.cursorrules/overview.md +136 -0
  23. package/templates/cli-tools/.cursorrules/testing.md +537 -0
  24. package/templates/cli-tools/.cursorrules/user-experience.md +545 -0
  25. package/templates/cli-tools/CLAUDE.md +356 -0
  26. package/templates/data-engineering/.cursorrules/data-modeling.md +367 -0
  27. package/templates/data-engineering/.cursorrules/data-quality.md +455 -0
  28. package/templates/data-engineering/.cursorrules/overview.md +85 -0
  29. package/templates/data-engineering/.cursorrules/performance.md +339 -0
  30. package/templates/data-engineering/.cursorrules/pipeline-design.md +280 -0
  31. package/templates/data-engineering/.cursorrules/security.md +460 -0
  32. package/templates/data-engineering/.cursorrules/testing.md +452 -0
  33. package/templates/data-engineering/CLAUDE.md +974 -0
  34. package/templates/devops-sre/.cursorrules/capacity-planning.md +653 -0
  35. package/templates/devops-sre/.cursorrules/change-management.md +584 -0
  36. package/templates/devops-sre/.cursorrules/chaos-engineering.md +651 -0
  37. package/templates/devops-sre/.cursorrules/disaster-recovery.md +641 -0
  38. package/templates/devops-sre/.cursorrules/incident-management.md +565 -0
  39. package/templates/devops-sre/.cursorrules/observability.md +714 -0
  40. package/templates/devops-sre/.cursorrules/overview.md +230 -0
  41. package/templates/devops-sre/.cursorrules/postmortems.md +588 -0
  42. package/templates/devops-sre/.cursorrules/runbooks.md +760 -0
  43. package/templates/devops-sre/.cursorrules/slo-sli.md +617 -0
  44. package/templates/devops-sre/.cursorrules/toil-reduction.md +567 -0
  45. package/templates/devops-sre/CLAUDE.md +1007 -0
  46. package/templates/documentation/.cursorrules/adr.md +277 -0
  47. package/templates/documentation/.cursorrules/api-documentation.md +411 -0
  48. package/templates/documentation/.cursorrules/code-comments.md +253 -0
  49. package/templates/documentation/.cursorrules/maintenance.md +260 -0
  50. package/templates/documentation/.cursorrules/overview.md +82 -0
  51. package/templates/documentation/.cursorrules/readme-standards.md +306 -0
  52. package/templates/documentation/CLAUDE.md +120 -0
  53. package/templates/fullstack/.cursorrules/api-contracts.md +331 -0
  54. package/templates/fullstack/.cursorrules/architecture.md +298 -0
  55. package/templates/fullstack/.cursorrules/overview.md +109 -0
  56. package/templates/fullstack/.cursorrules/shared-types.md +348 -0
  57. package/templates/fullstack/.cursorrules/testing.md +386 -0
  58. package/templates/fullstack/CLAUDE.md +349 -0
  59. package/templates/ml-ai/.cursorrules/data-engineering.md +483 -0
  60. package/templates/ml-ai/.cursorrules/deployment.md +601 -0
  61. package/templates/ml-ai/.cursorrules/model-development.md +538 -0
  62. package/templates/ml-ai/.cursorrules/monitoring.md +658 -0
  63. package/templates/ml-ai/.cursorrules/overview.md +131 -0
  64. package/templates/ml-ai/.cursorrules/security.md +637 -0
  65. package/templates/ml-ai/.cursorrules/testing.md +678 -0
  66. package/templates/ml-ai/CLAUDE.md +1136 -0
  67. package/templates/mobile/.cursorrules/navigation.md +246 -0
  68. package/templates/mobile/.cursorrules/offline-first.md +302 -0
  69. package/templates/mobile/.cursorrules/overview.md +71 -0
  70. package/templates/mobile/.cursorrules/performance.md +345 -0
  71. package/templates/mobile/.cursorrules/testing.md +339 -0
  72. package/templates/mobile/CLAUDE.md +233 -0
  73. package/templates/platform-engineering/.cursorrules/ci-cd.md +778 -0
  74. package/templates/platform-engineering/.cursorrules/developer-experience.md +632 -0
  75. package/templates/platform-engineering/.cursorrules/infrastructure-as-code.md +600 -0
  76. package/templates/platform-engineering/.cursorrules/kubernetes.md +710 -0
  77. package/templates/platform-engineering/.cursorrules/observability.md +747 -0
  78. package/templates/platform-engineering/.cursorrules/overview.md +215 -0
  79. package/templates/platform-engineering/.cursorrules/security.md +855 -0
  80. package/templates/platform-engineering/.cursorrules/testing.md +878 -0
  81. package/templates/platform-engineering/CLAUDE.md +850 -0
  82. package/templates/utility-agent/.cursorrules/action-control.md +284 -0
  83. package/templates/utility-agent/.cursorrules/context-management.md +186 -0
  84. package/templates/utility-agent/.cursorrules/hallucination-prevention.md +253 -0
  85. package/templates/utility-agent/.cursorrules/overview.md +78 -0
  86. package/templates/utility-agent/.cursorrules/token-optimization.md +369 -0
  87. package/templates/utility-agent/CLAUDE.md +513 -0
  88. package/templates/web-backend/.cursorrules/api-design.md +255 -0
  89. package/templates/web-backend/.cursorrules/authentication.md +309 -0
  90. package/templates/web-backend/.cursorrules/database-patterns.md +298 -0
  91. package/templates/web-backend/.cursorrules/error-handling.md +366 -0
  92. package/templates/web-backend/.cursorrules/overview.md +69 -0
  93. package/templates/web-backend/.cursorrules/security.md +358 -0
  94. package/templates/web-backend/.cursorrules/testing.md +395 -0
  95. package/templates/web-backend/CLAUDE.md +366 -0
  96. package/templates/web-frontend/.cursorrules/accessibility.md +296 -0
  97. package/templates/web-frontend/.cursorrules/component-patterns.md +204 -0
  98. package/templates/web-frontend/.cursorrules/overview.md +72 -0
  99. package/templates/web-frontend/.cursorrules/performance.md +325 -0
  100. package/templates/web-frontend/.cursorrules/state-management.md +227 -0
  101. package/templates/web-frontend/.cursorrules/styling.md +271 -0
  102. package/templates/web-frontend/.cursorrules/testing.md +311 -0
  103. package/templates/web-frontend/CLAUDE.md +399 -0
@@ -0,0 +1,1007 @@
1
+ # DevOps/SRE Development Guide
2
+
3
+ Staff-level guidelines for building and operating reliable, scalable production systems with a focus on operational excellence.
4
+
5
+ ---
6
+
7
+ ## Overview
8
+
9
+ This guide applies to:
10
+
11
+ - Site Reliability Engineering (SRE) practices
12
+ - Production operations and incident management
13
+ - Monitoring, alerting, and observability systems
14
+ - Capacity planning and performance engineering
15
+ - Disaster recovery and business continuity
16
+ - Toil reduction and automation
17
+ - Change management and safe deployments
18
+
19
+ ### Key Principles
20
+
21
+ 1. **Reliability is a Feature** - Users don't distinguish between "the app is slow" and "the app is broken"
22
+ 2. **Error Budgets Over Perfection** - 100% reliability is the wrong target; balance reliability with velocity
23
+ 3. **Automate Toil Away** - If you're doing it manually more than twice, automate it
24
+ 4. **Observability First** - You can't fix what you can't measure
25
+ 5. **Blameless Culture** - Incidents are learning opportunities, not blame games
26
+
27
+ ### Technology Stack
28
+
29
+ | Layer | Primary | Alternatives |
30
+ |-------|---------|--------------|
31
+ | Metrics | Prometheus + Grafana | Datadog, New Relic, InfluxDB |
32
+ | Logging | Loki, ELK Stack | Splunk, Datadog Logs |
33
+ | Tracing | Jaeger, Tempo | Zipkin, X-Ray, Honeycomb |
34
+ | Alerting | Alertmanager, PagerDuty | OpsGenie, VictorOps |
35
+ | Incident Management | PagerDuty, Incident.io | OpsGenie, Squadcast |
36
+ | Status Pages | Statuspage, Instatus | Cachet, Better Uptime |
37
+ | Chaos Engineering | Chaos Mesh, Litmus | Gremlin, AWS FIS |
38
+ | Load Testing | k6, Locust | Gatling, JMeter |
39
+ | Feature Flags | LaunchDarkly, Unleash | Split, Flagsmith |
40
+
41
+ ---
42
+
43
+ ## SRE Fundamentals
44
+
45
+ ### The SRE Hierarchy of Needs
46
+
47
+ ```
48
+ ┌─────────────────┐
49
+ │ Continuous │ ← Experimentation, A/B testing
50
+ │ Improvement │
51
+ ┌───┴─────────────────┴───┐
52
+ │ Release │ ← Safe, frequent deployments
53
+ │ Engineering │
54
+ ┌───┴─────────────────────────┴───┐
55
+ │ Observability │ ← Metrics, logs, traces
56
+ │ │
57
+ ┌───┴─────────────────────────────────┴───┐
58
+ │ Incident Response │ ← Detection, mitigation
59
+ │ │
60
+ ┌───┴─────────────────────────────────────────┴───┐
61
+ │ Monitoring/Alerting │ ← Know when things break
62
+ │ │
63
+ ┌───┴─────────────────────────────────────────────────┴───┐
64
+ │ Reliability │ ← Core availability
65
+ └─────────────────────────────────────────────────────────┘
66
+ ```
67
+
68
+ ### SLOs, SLIs, and Error Budgets
69
+
70
+ **SLI (Service Level Indicator)**: A quantitative measure of service behavior
71
+
72
+ ```yaml
73
+ # Example SLIs
74
+ availability_sli:
75
+ description: "Proportion of successful requests"
76
+ formula: "successful_requests / total_requests"
77
+
78
+ latency_sli:
79
+ description: "Proportion of requests faster than threshold"
80
+ formula: "requests_under_500ms / total_requests"
81
+
82
+ throughput_sli:
83
+ description: "Requests processed per second"
84
+ formula: "count(requests) / time_window"
85
+ ```
86
+
87
+ **SLO (Service Level Objective)**: Target value for an SLI
88
+
89
+ ```yaml
90
+ # Example SLOs
91
+ api_availability:
92
+ sli: availability_sli
93
+ target: 99.9%
94
+ window: 30 days
95
+
96
+ api_latency:
97
+ sli: latency_sli
98
+ target: 99%
99
+ threshold: 500ms
100
+ window: 30 days
101
+ ```
102
+
103
+ **Error Budget**: The allowed amount of unreliability
104
+
105
+ ```python
106
+ # Error budget calculation
107
+ slo_target = 0.999 # 99.9%
108
+ window_minutes = 30 * 24 * 60 # 30 days
109
+
110
+ error_budget_minutes = window_minutes * (1 - slo_target)
111
+ # = 43.2 minutes of downtime allowed per month
112
+
113
+ # If we've used 30 minutes, we have 13.2 minutes remaining
114
+ # If budget exhausted → slow down deployments, focus on reliability
115
+ ```
116
+
117
+ ### Error Budget Policy
118
+
119
+ ```yaml
120
+ # error-budget-policy.yaml
121
+ thresholds:
122
+ - level: healthy
123
+ budget_remaining: ">50%"
124
+ actions:
125
+ - "Normal development velocity"
126
+ - "Experimental features allowed"
127
+ - "Risk-tolerant deployments"
128
+
129
+ - level: caution
130
+ budget_remaining: "25-50%"
131
+ actions:
132
+ - "Review recent changes for reliability impact"
133
+ - "Increase testing coverage"
134
+ - "Limit risky deployments"
135
+
136
+ - level: critical
137
+ budget_remaining: "10-25%"
138
+ actions:
139
+ - "Reliability improvements prioritized"
140
+ - "Feature freeze for non-critical work"
141
+ - "Mandatory rollback plans"
142
+
143
+ - level: exhausted
144
+ budget_remaining: "<10%"
145
+ actions:
146
+ - "Full feature freeze"
147
+ - "All hands on reliability"
148
+ - "Post-incident review required for any deploy"
149
+ ```
150
+
151
+ ---
152
+
153
+ ## Monitoring & Alerting
154
+
155
+ ### The Four Golden Signals
156
+
157
+ ```yaml
158
+ # Monitor these for every service
159
+ golden_signals:
160
+ latency:
161
+ description: "Time to service a request"
162
+ metrics:
163
+ - http_request_duration_seconds_histogram
164
+ alerts:
165
+ - p50 > 200ms
166
+ - p99 > 1000ms
167
+
168
+ traffic:
169
+ description: "Demand on the system"
170
+ metrics:
171
+ - http_requests_total
172
+ alerts:
173
+ - sudden_drop > 50%
174
+ - sudden_spike > 200%
175
+
176
+ errors:
177
+ description: "Rate of failed requests"
178
+ metrics:
179
+ - http_requests_total{status=~"5.."}
180
+ alerts:
181
+ - error_rate > 1%
182
+ - error_rate > 5% (critical)
183
+
184
+ saturation:
185
+ description: "How full the system is"
186
+ metrics:
187
+ - cpu_usage_percent
188
+ - memory_usage_percent
189
+ - disk_usage_percent
190
+ alerts:
191
+ - cpu > 80%
192
+ - memory > 85%
193
+ - disk > 90%
194
+ ```
195
+
196
+ ### Alert Quality Guidelines
197
+
198
+ ```yaml
199
+ # Good alerts are:
200
+ alert_quality_checklist:
201
+ actionable: "Every alert should have a clear action to take"
202
+ urgent: "If it can wait until morning, it shouldn't page"
203
+ relevant: "Alert fatigue kills on-call engineers"
204
+
205
+ # Alert severity levels
206
+ severity_definitions:
207
+ critical:
208
+ description: "Service is down or severely degraded"
209
+ response_time: "Immediate (page on-call)"
210
+ examples:
211
+ - "API error rate > 10%"
212
+ - "Database unreachable"
213
+ - "All pods in CrashLoopBackOff"
214
+
215
+ warning:
216
+ description: "Service degradation or approaching limits"
217
+ response_time: "Within 1 hour (Slack notification)"
218
+ examples:
219
+ - "Disk usage > 80%"
220
+ - "Error rate > 1%"
221
+ - "Latency p99 > 2s"
222
+
223
+ info:
224
+ description: "Notable events, no action needed"
225
+ response_time: "Next business day"
226
+ examples:
227
+ - "Deployment completed"
228
+ - "Certificate expires in 30 days"
229
+ - "Unusual traffic pattern"
230
+ ```
231
+
232
+ ### Prometheus Alerting Rules
233
+
234
+ ```yaml
235
+ # prometheus-alerts.yaml
236
+ groups:
237
+ - name: api-server
238
+ rules:
239
+ # High Error Rate
240
+ - alert: APIHighErrorRate
241
+ expr: |
242
+ sum(rate(http_requests_total{job="api-server",status=~"5.."}[5m]))
243
+ /
244
+ sum(rate(http_requests_total{job="api-server"}[5m]))
245
+ > 0.01
246
+ for: 5m
247
+ labels:
248
+ severity: warning
249
+ team: backend
250
+ annotations:
251
+ summary: "API error rate above 1%"
252
+ description: "Error rate is {{ $value | humanizePercentage }}"
253
+ runbook_url: "https://wiki.example.com/runbooks/api-high-error-rate"
254
+
255
+ # High Latency
256
+ - alert: APIHighLatency
257
+ expr: |
258
+ histogram_quantile(0.99,
259
+ sum(rate(http_request_duration_seconds_bucket{job="api-server"}[5m])) by (le)
260
+ ) > 1
261
+ for: 10m
262
+ labels:
263
+ severity: warning
264
+ team: backend
265
+ annotations:
266
+ summary: "API p99 latency above 1 second"
267
+ description: "p99 latency is {{ $value | humanizeDuration }}"
268
+ runbook_url: "https://wiki.example.com/runbooks/api-high-latency"
269
+
270
+ # SLO Burn Rate (Multi-window)
271
+ - alert: APIAvailabilitySLOBreach
272
+ expr: |
273
+ (
274
+ # Fast burn (last 1h)
275
+ sum(rate(http_requests_total{job="api-server",status=~"5.."}[1h]))
276
+ /
277
+ sum(rate(http_requests_total{job="api-server"}[1h]))
278
+ > (14.4 * 0.001) # 14.4x burn rate for 1h window
279
+ )
280
+ and
281
+ (
282
+ # Slow burn (last 6h)
283
+ sum(rate(http_requests_total{job="api-server",status=~"5.."}[6h]))
284
+ /
285
+ sum(rate(http_requests_total{job="api-server"}[6h]))
286
+ > (6 * 0.001) # 6x burn rate for 6h window
287
+ )
288
+ for: 2m
289
+ labels:
290
+ severity: critical
291
+ team: backend
292
+ annotations:
293
+ summary: "API availability SLO at risk"
294
+ description: "Error budget burn rate indicates SLO breach within window"
295
+ ```
296
+
297
+ ---
298
+
299
+ ## Incident Management
300
+
301
+ ### Incident Severity Levels
302
+
303
+ ```yaml
304
+ severity_levels:
305
+ sev1:
306
+ name: "Critical"
307
+ description: "Complete service outage or data loss"
308
+ response_time: "Immediate"
309
+ communication: "Status page, exec notification, all-hands war room"
310
+ examples:
311
+ - "Production database down"
312
+ - "Security breach in progress"
313
+ - "Payment processing completely failed"
314
+
315
+ sev2:
316
+ name: "Major"
317
+ description: "Significant degradation affecting many users"
318
+ response_time: "15 minutes"
319
+ communication: "Status page, stakeholder notification"
320
+ examples:
321
+ - "50% of API requests failing"
322
+ - "Search functionality broken"
323
+ - "Mobile app unable to sync"
324
+
325
+ sev3:
326
+ name: "Minor"
327
+ description: "Limited impact, workaround available"
328
+ response_time: "1 hour"
329
+ communication: "Internal Slack channel"
330
+ examples:
331
+ - "Admin panel slow"
332
+ - "Export feature broken"
333
+ - "Non-critical background jobs failing"
334
+
335
+ sev4:
336
+ name: "Low"
337
+ description: "Minimal impact, cosmetic issues"
338
+ response_time: "Next business day"
339
+ communication: "Ticket created"
340
+ examples:
341
+ - "UI alignment issues"
342
+ - "Log formatting errors"
343
+ - "Dev environment issues"
344
+ ```
345
+
346
+ ### Incident Response Process
347
+
348
+ ```
349
+ ┌─────────────────────────────────────────────────────────────────┐
350
+ │ INCIDENT LIFECYCLE │
351
+ ├─────────────────────────────────────────────────────────────────┤
352
+ │ │
353
+ │ ┌─────────┐ ┌─────────┐ ┌──────────┐ ┌──────────────┐ │
354
+ │ │ Detect │──▶│ Respond │──▶│ Mitigate │──▶│ Resolve │ │
355
+ │ └─────────┘ └─────────┘ └──────────┘ └──────────────┘ │
356
+ │ │ │ │ │ │
357
+ │ ▼ ▼ ▼ ▼ │
358
+ │ - Alerting - Page on-call - Stop bleeding - Root cause │
359
+ │ - Monitoring - Declare - Rollback - Fix forward │
360
+ │ - User report - Assign roles - Scale up - Deploy fix │
361
+ │ - War room - Failover │
362
+ │ │
363
+ │ │ │
364
+ │ ▼ │
365
+ │ ┌──────────────────┐ │
366
+ │ │ Postmortem │ │
367
+ │ └──────────────────┘ │
368
+ │ │ │
369
+ │ ▼ │
370
+ │ - Timeline - Root cause - Action items - Share learnings │
371
+ │ │
372
+ └─────────────────────────────────────────────────────────────────┘
373
+ ```
374
+
375
+ ### Incident Commander Role
376
+
377
+ ```yaml
378
+ incident_commander_responsibilities:
379
+ coordination:
380
+ - "Single point of contact for incident"
381
+ - "Assign roles (communications, technical lead, scribe)"
382
+ - "Make decisions when consensus isn't reached"
383
+ - "Escalate when needed"
384
+
385
+ communication:
386
+ - "Regular status updates (every 15-30 min)"
387
+ - "Stakeholder management"
388
+ - "Status page updates"
389
+ - "Executive briefings for Sev1"
390
+
391
+ process:
392
+ - "Start incident channel/war room"
393
+ - "Track timeline of events"
394
+ - "Ensure postmortem is scheduled"
395
+ - "Close out incident when resolved"
396
+
397
+ # Incident channel template
398
+ slack_channel_template:
399
+ name: "inc-{date}-{short-description}"
400
+ topic: "SEV{level} | IC: @{commander} | Status: {status}"
401
+ pinned_messages:
402
+ - "Incident summary and current status"
403
+ - "Timeline of events"
404
+ - "Runbook links"
405
+ ```
406
+
407
+ ---
408
+
409
+ ## On-Call Best Practices
410
+
411
+ ### On-Call Rotation
412
+
413
+ ```yaml
414
+ oncall_structure:
415
+ rotation_length: "1 week"
416
+ handoff_day: "Monday 9am local time"
417
+ coverage: "24/7"
418
+
419
+ roles:
420
+ primary:
421
+ responsibilities:
422
+ - "First responder to all pages"
423
+ - "Initial triage and escalation"
424
+ - "Document all incidents"
425
+ response_sla: "15 minutes"
426
+
427
+ secondary:
428
+ responsibilities:
429
+ - "Backup if primary unavailable"
430
+ - "Help with prolonged incidents"
431
+ - "Escalation point"
432
+ response_sla: "30 minutes"
433
+
434
+ escalation_path:
435
+ - "Primary on-call"
436
+ - "Secondary on-call"
437
+ - "Team lead"
438
+ - "Engineering manager"
439
+ - "VP Engineering"
440
+
441
+ handoff_checklist:
442
+ - "Review active incidents"
443
+ - "Check pending alerts"
444
+ - "Verify pager is working"
445
+ - "Review recent deployments"
446
+ - "Check error budget status"
447
+ ```
448
+
449
+ ### On-Call Health
450
+
451
+ ```yaml
452
+ oncall_health_metrics:
453
+ targets:
454
+ pages_per_shift: "< 10"
455
+ pages_per_night: "< 2"
456
+ mean_time_to_acknowledge: "< 5 minutes"
457
+ mean_time_to_resolve: "< 1 hour"
458
+ false_positive_rate: "< 10%"
459
+
460
+ burnout_prevention:
461
+ - "Compensatory time off after heavy on-call"
462
+ - "No back-to-back on-call weeks"
463
+ - "Follow-the-sun rotation for global teams"
464
+ - "On-call load balancing across team"
465
+ - "Regular review of alert quality"
466
+ ```
467
+
468
+ ---
469
+
470
+ ## Runbooks
471
+
472
+ ### Runbook Template
473
+
474
+ ```markdown
475
+ # Runbook: [Alert/Issue Name]
476
+
477
+ ## Overview
478
+ Brief description of what this runbook addresses.
479
+
480
+ ## Severity
481
+ - **Impact**: [What breaks when this happens]
482
+ - **Urgency**: [How quickly must this be resolved]
483
+
484
+ ## Prerequisites
485
+ - Access to: [systems, dashboards, tools]
486
+ - Permissions: [required roles/access]
487
+
488
+ ## Symptoms
489
+ - Alert: `AlertName` fires
490
+ - Users report: [symptoms]
491
+ - Dashboards show: [metrics]
492
+
493
+ ## Diagnosis Steps
494
+ 1. Check [specific metric/log]
495
+ ```bash
496
+ kubectl logs -l app=api-server --tail=100
497
+ ```
498
+ 2. Verify [dependency/connection]
499
+ 3. Check recent changes
500
+ ```bash
501
+ kubectl rollout history deployment/api-server
502
+ ```
503
+
504
+ ## Resolution Steps
505
+
506
+ ### Quick Mitigation (stop the bleeding)
507
+ 1. Scale up if capacity issue:
508
+ ```bash
509
+ kubectl scale deployment/api-server --replicas=10
510
+ ```
511
+ 2. Rollback if recent deployment:
512
+ ```bash
513
+ kubectl rollout undo deployment/api-server
514
+ ```
515
+
516
+ ### Root Cause Fix
517
+ 1. [Step-by-step fix instructions]
518
+ 2. [Verification commands]
519
+
520
+ ## Escalation
521
+ - If not resolved in 30 minutes, escalate to: [team/person]
522
+ - For data loss scenarios, immediately notify: [person]
523
+
524
+ ## Prevention
525
+ - Related improvements: [links to tickets]
526
+ - Monitoring gaps: [what to add]
527
+
528
+ ## History
529
+ | Date | Author | Change |
530
+ |------|--------|--------|
531
+ | 2025-01-15 | @engineer | Initial version |
532
+ ```
533
+
534
+ ---
535
+
536
+ ## Capacity Planning
537
+
538
+ ### Capacity Metrics
539
+
540
+ ```yaml
541
+ capacity_dimensions:
542
+ compute:
543
+ metrics:
544
+ - cpu_utilization
545
+ - memory_utilization
546
+ - pod_count
547
+ thresholds:
548
+ warning: 70%
549
+ critical: 85%
550
+
551
+ storage:
552
+ metrics:
553
+ - disk_usage_percent
554
+ - iops_utilization
555
+ - throughput_utilization
556
+ thresholds:
557
+ warning: 75%
558
+ critical: 90%
559
+
560
+ network:
561
+ metrics:
562
+ - bandwidth_utilization
563
+ - connection_count
564
+ - packet_loss_rate
565
+ thresholds:
566
+ warning: 60%
567
+ critical: 80%
568
+
569
+ database:
570
+ metrics:
571
+ - connection_pool_usage
572
+ - query_latency_p99
573
+ - replication_lag
574
+ thresholds:
575
+ warning: 70%
576
+ critical: 85%
577
+ ```
578
+
579
+ ### Load Testing Strategy
580
+
581
+ ```yaml
582
+ load_testing:
583
+ types:
584
+ smoke:
585
+ description: "Verify system handles minimal load"
586
+ duration: "5 minutes"
587
+ users: "10"
588
+ frequency: "Every deployment"
589
+
590
+ load:
591
+ description: "Test expected production load"
592
+ duration: "30 minutes"
593
+ users: "Expected peak * 1.5"
594
+ frequency: "Weekly"
595
+
596
+ stress:
597
+ description: "Find breaking point"
598
+ duration: "Until failure"
599
+ users: "Ramp until errors"
600
+ frequency: "Monthly"
601
+
602
+ soak:
603
+ description: "Test sustained load over time"
604
+ duration: "24 hours"
605
+ users: "Expected average"
606
+ frequency: "Before major releases"
607
+
608
+ # k6 example
609
+ k6_load_test: |
610
+ import http from 'k6/http';
611
+ import { check, sleep } from 'k6';
612
+
613
+ export const options = {
614
+ stages: [
615
+ { duration: '5m', target: 100 }, // Ramp up
616
+ { duration: '30m', target: 100 }, // Stay at peak
617
+ { duration: '5m', target: 0 }, // Ramp down
618
+ ],
619
+ thresholds: {
620
+ http_req_duration: ['p(99)<500'],
621
+ http_req_failed: ['rate<0.01'],
622
+ },
623
+ };
624
+
625
+ export default function () {
626
+ const res = http.get('https://api.example.com/health');
627
+ check(res, {
628
+ 'status is 200': (r) => r.status === 200,
629
+ 'response time < 500ms': (r) => r.timings.duration < 500,
630
+ });
631
+ sleep(1);
632
+ }
633
+ ```
634
+
635
+ ---
636
+
637
+ ## Change Management
638
+
639
+ ### Deployment Safety
640
+
641
+ ```yaml
642
+ deployment_checklist:
643
+ pre_deploy:
644
+ - "All tests passing in CI"
645
+ - "Code reviewed and approved"
646
+ - "Feature flags in place for risky changes"
647
+ - "Rollback plan documented"
648
+ - "Monitoring dashboards open"
649
+ - "On-call engineer aware"
650
+
651
+ during_deploy:
652
+ - "Watch error rates during rollout"
653
+ - "Monitor latency metrics"
654
+ - "Check application logs for errors"
655
+ - "Verify health checks passing"
656
+
657
+ post_deploy:
658
+ - "Smoke test critical paths"
659
+ - "Compare metrics to baseline"
660
+ - "Check for error rate increases"
661
+ - "Update deployment log"
662
+
663
+ # Progressive delivery stages
664
+ progressive_delivery:
665
+ canary:
666
+ traffic_percentage: 5%
667
+ duration: "15 minutes"
668
+ success_criteria:
669
+ - error_rate < baseline * 1.1
670
+ - latency_p99 < baseline * 1.2
671
+
672
+ partial:
673
+ traffic_percentage: 25%
674
+ duration: "30 minutes"
675
+ success_criteria:
676
+ - error_rate < baseline * 1.05
677
+ - latency_p99 < baseline * 1.1
678
+
679
+ majority:
680
+ traffic_percentage: 75%
681
+ duration: "1 hour"
682
+ success_criteria:
683
+ - error_rate ≈ baseline
684
+ - latency_p99 ≈ baseline
685
+
686
+ full:
687
+ traffic_percentage: 100%
688
+ bake_time: "24 hours"
689
+ ```
690
+
691
+ ### Rollback Procedures
692
+
693
+ ```yaml
694
+ rollback_triggers:
695
+ automatic:
696
+ - "Error rate > 5% for 5 minutes"
697
+ - "Latency p99 > 3x baseline for 10 minutes"
698
+ - "Health check failures > 50%"
699
+
700
+ manual:
701
+ - "User-reported critical bugs"
702
+ - "Security vulnerability discovered"
703
+ - "Data corruption detected"
704
+
705
+ rollback_commands:
706
+ kubernetes:
707
+ immediate: |
708
+ kubectl rollout undo deployment/api-server
709
+ to_specific_version: |
710
+ kubectl rollout undo deployment/api-server --to-revision=42
711
+ verify: |
712
+ kubectl rollout status deployment/api-server
713
+
714
+ argocd:
715
+ immediate: |
716
+ argocd app rollback api-server
717
+ sync_to_previous: |
718
+ argocd app sync api-server --revision HEAD~1
719
+ ```
720
+
721
+ ---
722
+
723
+ ## Disaster Recovery
724
+
725
+ ### RTO and RPO Definitions
726
+
727
+ ```yaml
728
+ recovery_objectives:
729
+ tier1_critical:
730
+ services:
731
+ - "Payment processing"
732
+ - "User authentication"
733
+ - "Core API"
734
+ rto: "15 minutes" # Recovery Time Objective
735
+ rpo: "0 minutes" # Recovery Point Objective (no data loss)
736
+ strategy: "Active-active multi-region"
737
+
738
+ tier2_important:
739
+ services:
740
+ - "Search functionality"
741
+ - "Notifications"
742
+ - "Analytics ingestion"
743
+ rto: "1 hour"
744
+ rpo: "15 minutes"
745
+ strategy: "Warm standby with automated failover"
746
+
747
+ tier3_standard:
748
+ services:
749
+ - "Admin dashboard"
750
+ - "Reporting"
751
+ - "Batch processing"
752
+ rto: "4 hours"
753
+ rpo: "1 hour"
754
+ strategy: "Cold standby with manual failover"
755
+ ```
756
+
757
+ ### Backup Strategy
758
+
759
+ ```yaml
760
+ backup_strategy:
761
+ databases:
762
+ type: "Continuous replication + daily snapshots"
763
+ retention:
764
+ continuous: "7 days"
765
+ daily: "30 days"
766
+ weekly: "1 year"
767
+ testing: "Monthly restore test"
768
+ location: "Cross-region"
769
+
770
+ object_storage:
771
+ type: "Cross-region replication"
772
+ versioning: "Enabled"
773
+ retention: "Per data classification"
774
+
775
+ configuration:
776
+ type: "GitOps (versioned in Git)"
777
+ backup: "Repository mirroring"
778
+
779
+ secrets:
780
+ type: "Vault replication"
781
+ backup: "Encrypted offline backup monthly"
782
+ ```
783
+
784
+ ### DR Testing
785
+
786
+ ```yaml
787
+ dr_testing_schedule:
788
+ tabletop_exercise:
789
+ frequency: "Quarterly"
790
+ participants: "All on-call engineers"
791
+ scope: "Walk through DR procedures"
792
+
793
+ component_failover:
794
+ frequency: "Monthly"
795
+ scope: "Individual service failover"
796
+ examples:
797
+ - "Database failover to replica"
798
+ - "Redis cluster failover"
799
+ - "Load balancer failover"
800
+
801
+ regional_failover:
802
+ frequency: "Bi-annually"
803
+ scope: "Full region evacuation"
804
+ preparation:
805
+ - "Notify stakeholders"
806
+ - "Schedule maintenance window"
807
+ - "Pre-position support staff"
808
+
809
+ chaos_game_day:
810
+ frequency: "Quarterly"
811
+ scope: "Inject failures in production"
812
+ examples:
813
+ - "Kill random pods"
814
+ - "Inject network latency"
815
+ - "Simulate AZ failure"
816
+ ```
817
+
818
+ ---
819
+
820
+ ## Postmortems
821
+
822
+ ### Blameless Postmortem Template
823
+
824
+ ```markdown
825
+ # Postmortem: [Incident Title]
826
+
827
+ **Date**: YYYY-MM-DD
828
+ **Authors**: [Names]
829
+ **Status**: Draft | In Review | Complete
830
+ **Severity**: SEV1 | SEV2 | SEV3
831
+
832
+ ## Summary
833
+
834
+ One paragraph summary of what happened and the impact.
835
+
836
+ ## Impact
837
+
838
+ - **Duration**: X hours Y minutes
839
+ - **Users affected**: X% of users / Y users
840
+ - **Revenue impact**: $X (if applicable)
841
+ - **SLO impact**: X% of monthly error budget consumed
842
+
843
+ ## Timeline (all times UTC)
844
+
845
+ | Time | Event |
846
+ |------|-------|
847
+ | 14:00 | Deployment started |
848
+ | 14:05 | First alerts fired |
849
+ | 14:10 | On-call acknowledged |
850
+ | 14:15 | Incident declared |
851
+ | 14:30 | Root cause identified |
852
+ | 14:35 | Rollback initiated |
853
+ | 14:40 | Service recovered |
854
+ | 15:00 | Incident closed |
855
+
856
+ ## Root Cause
857
+
858
+ Detailed technical explanation of what went wrong and why.
859
+
860
+ ## Contributing Factors
861
+
862
+ - Factor 1: [What made this worse or possible]
863
+ - Factor 2: [Process/tooling gaps]
864
+ - Factor 3: [Environmental conditions]
865
+
866
+ ## What Went Well
867
+
868
+ - Quick detection (5 minutes to alert)
869
+ - Clear runbooks available
870
+ - Effective communication in war room
871
+
872
+ ## What Went Poorly
873
+
874
+ - Rollback took longer than expected
875
+ - Initial diagnosis went down wrong path
876
+ - Status page update was delayed
877
+
878
+ ## Action Items
879
+
880
+ | Action | Type | Owner | Due Date | Status |
881
+ |--------|------|-------|----------|--------|
882
+ | Add pre-deploy smoke tests | Prevent | @eng1 | 2025-02-01 | TODO |
883
+ | Improve rollback automation | Mitigate | @eng2 | 2025-02-15 | TODO |
884
+ | Add metric for early detection | Detect | @eng3 | 2025-02-01 | TODO |
885
+ | Update runbook with lessons | Process | @eng4 | 2025-01-20 | DONE |
886
+
887
+ ## Lessons Learned
888
+
889
+ What should the broader organization learn from this incident?
890
+
891
+ ## Appendix
892
+
893
+ - Links to dashboards
894
+ - Relevant logs
895
+ - Related incidents
896
+ ```
897
+
898
+ ---
899
+
900
+ ## Staff Engineer Responsibilities
901
+
902
+ ### Technical Leadership
903
+
904
+ - Define and evolve reliability standards across the organization
905
+ - Make build vs. buy decisions for tooling
906
+ - Establish SLO frameworks and error budget policies
907
+ - Mentor engineers on operational excellence
908
+ - Drive adoption of SRE practices
909
+
910
+ ### Cross-Team Enablement
911
+
912
+ - Design observability standards that work across all services
913
+ - Create reusable runbook templates and incident response procedures
914
+ - Build automation that reduces toil organization-wide
915
+ - Establish on-call best practices and health metrics
916
+ - Lead chaos engineering initiatives
917
+
918
+ ### Operational Excellence
919
+
920
+ - Own the incident management process
921
+ - Drive postmortem quality and follow-through
922
+ - Reduce mean time to detection and recovery
923
+ - Eliminate recurring incidents through systemic fixes
924
+ - Balance reliability investments with feature velocity
925
+
926
+ ### Strategic Thinking
927
+
928
+ - Align reliability investments with business priorities
929
+ - Plan capacity for growth projections
930
+ - Design disaster recovery strategies
931
+ - Evaluate emerging technologies for operational improvement
932
+ - Manage technical debt in operational tooling
933
+
934
+ ---
935
+
936
+ ## Definition of Done
937
+
938
+ ### Reliability Feature
939
+
940
+ - [ ] SLOs defined with measurable SLIs
941
+ - [ ] Alerts configured with runbooks
942
+ - [ ] Dashboards created for key metrics
943
+ - [ ] Load tested to expected capacity
944
+ - [ ] Failure modes documented
945
+ - [ ] DR procedures tested
946
+
947
+ ### Incident Response
948
+
949
+ - [ ] Incident severity correctly assessed
950
+ - [ ] Timeline accurately documented
951
+ - [ ] Stakeholders appropriately notified
952
+ - [ ] Root cause identified (not just symptoms)
953
+ - [ ] Postmortem completed within 5 business days
954
+ - [ ] Action items tracked to completion
955
+
956
+ ### On-Call Improvement
957
+
958
+ - [ ] Alert has clear action to take
959
+ - [ ] Runbook is accurate and tested
960
+ - [ ] False positive rate < 10%
961
+ - [ ] Alert fires with enough time to act
962
+ - [ ] Escalation path is clear
963
+
964
+ ---
965
+
966
+ ## Common Pitfalls
967
+
968
+ ### 1. Alert Fatigue
969
+
970
+ ❌ **Wrong**: Alert on everything "just in case"
971
+
972
+ ✅ **Right**: Every alert must be actionable, urgent, and relevant
973
+
974
+ ### 2. SLOs as Targets, Not Limits
975
+
976
+ ❌ **Wrong**: "We must hit exactly 99.9%"
977
+
978
+ ✅ **Right**: SLOs define acceptable reliability; use error budget for velocity
979
+
980
+ ### 3. Blame Culture
981
+
982
+ ❌ **Wrong**: "Who caused this outage?"
983
+
984
+ ✅ **Right**: "What systemic factors allowed this to happen?"
985
+
986
+ ### 4. Manual Heroics
987
+
988
+ ❌ **Wrong**: Relying on engineer availability to keep systems running
989
+
990
+ ✅ **Right**: Automate recovery, build self-healing systems
991
+
992
+ ### 5. Postmortem Theater
993
+
994
+ ❌ **Wrong**: Write postmortem, create action items, never follow up
995
+
996
+ ✅ **Right**: Track action items to completion, measure improvement
997
+
998
+ ---
999
+
1000
+ ## Resources
1001
+
1002
+ - [Google SRE Book](https://sre.google/sre-book/table-of-contents/)
1003
+ - [Google SRE Workbook](https://sre.google/workbook/table-of-contents/)
1004
+ - [The Art of SLOs](https://sre.google/resources/practices-and-processes/art-of-slos/)
1005
+ - [Incident Management for Operations](https://www.pagerduty.com/resources/learn/incident-management/)
1006
+ - [Chaos Engineering Principles](https://principlesofchaos.org/)
1007
+ - [OpenTelemetry Documentation](https://opentelemetry.io/docs/)