specweave 0.3.13 → 0.4.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +506 -17
- package/README.md +100 -58
- package/bin/install-all.sh +9 -2
- package/bin/install-hooks.sh +57 -0
- package/bin/specweave.js +16 -0
- package/dist/adapters/adapter-base.d.ts +21 -0
- package/dist/adapters/adapter-base.d.ts.map +1 -1
- package/dist/adapters/adapter-base.js +28 -0
- package/dist/adapters/adapter-base.js.map +1 -1
- package/dist/adapters/adapter-interface.d.ts +41 -0
- package/dist/adapters/adapter-interface.d.ts.map +1 -1
- package/dist/adapters/claude/adapter.d.ts +36 -0
- package/dist/adapters/claude/adapter.d.ts.map +1 -1
- package/dist/adapters/claude/adapter.js +135 -0
- package/dist/adapters/claude/adapter.js.map +1 -1
- package/dist/adapters/copilot/adapter.d.ts +25 -0
- package/dist/adapters/copilot/adapter.d.ts.map +1 -1
- package/dist/adapters/copilot/adapter.js +112 -0
- package/dist/adapters/copilot/adapter.js.map +1 -1
- package/dist/adapters/cursor/adapter.d.ts +36 -0
- package/dist/adapters/cursor/adapter.d.ts.map +1 -1
- package/dist/adapters/cursor/adapter.js +140 -0
- package/dist/adapters/cursor/adapter.js.map +1 -1
- package/dist/adapters/generic/adapter.d.ts +25 -0
- package/dist/adapters/generic/adapter.d.ts.map +1 -1
- package/dist/adapters/generic/adapter.js +111 -0
- package/dist/adapters/generic/adapter.js.map +1 -1
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +103 -1
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/plugin.d.ts +37 -0
- package/dist/cli/commands/plugin.d.ts.map +1 -0
- package/dist/cli/commands/plugin.js +296 -0
- package/dist/cli/commands/plugin.js.map +1 -0
- package/dist/core/agent-model-manager.d.ts +52 -0
- package/dist/core/agent-model-manager.d.ts.map +1 -0
- package/dist/core/agent-model-manager.js +120 -0
- package/dist/core/agent-model-manager.js.map +1 -0
- package/dist/core/cost-tracker.d.ts +108 -0
- package/dist/core/cost-tracker.d.ts.map +1 -0
- package/dist/core/cost-tracker.js +281 -0
- package/dist/core/cost-tracker.js.map +1 -0
- package/dist/core/model-selector.d.ts +57 -0
- package/dist/core/model-selector.d.ts.map +1 -0
- package/dist/core/model-selector.js +115 -0
- package/dist/core/model-selector.js.map +1 -0
- package/dist/core/phase-detector.d.ts +62 -0
- package/dist/core/phase-detector.d.ts.map +1 -0
- package/dist/core/phase-detector.js +229 -0
- package/dist/core/phase-detector.js.map +1 -0
- package/dist/core/plugin-detector.d.ts +96 -0
- package/dist/core/plugin-detector.d.ts.map +1 -0
- package/dist/core/plugin-detector.js +349 -0
- package/dist/core/plugin-detector.js.map +1 -0
- package/dist/core/plugin-loader.d.ts +111 -0
- package/dist/core/plugin-loader.d.ts.map +1 -0
- package/dist/core/plugin-loader.js +319 -0
- package/dist/core/plugin-loader.js.map +1 -0
- package/dist/core/plugin-manager.d.ts +144 -0
- package/dist/core/plugin-manager.d.ts.map +1 -0
- package/dist/core/plugin-manager.js +393 -0
- package/dist/core/plugin-manager.js.map +1 -0
- package/dist/core/schemas/plugin-manifest.schema.json +253 -0
- package/dist/core/types/plugin.d.ts +252 -0
- package/dist/core/types/plugin.d.ts.map +1 -0
- package/dist/core/types/plugin.js +48 -0
- package/dist/core/types/plugin.js.map +1 -0
- package/dist/integrations/jira/jira-mapper.d.ts +2 -2
- package/dist/integrations/jira/jira-mapper.js +2 -2
- package/dist/types/cost-tracking.d.ts +43 -0
- package/dist/types/cost-tracking.d.ts.map +1 -0
- package/dist/types/cost-tracking.js +8 -0
- package/dist/types/cost-tracking.js.map +1 -0
- package/dist/types/model-selection.d.ts +53 -0
- package/dist/types/model-selection.d.ts.map +1 -0
- package/dist/types/model-selection.js +12 -0
- package/dist/types/model-selection.js.map +1 -0
- package/dist/utils/cost-reporter.d.ts +58 -0
- package/dist/utils/cost-reporter.d.ts.map +1 -0
- package/dist/utils/cost-reporter.js +224 -0
- package/dist/utils/cost-reporter.js.map +1 -0
- package/dist/utils/pricing-constants.d.ts +70 -0
- package/dist/utils/pricing-constants.d.ts.map +1 -0
- package/dist/utils/pricing-constants.js +71 -0
- package/dist/utils/pricing-constants.js.map +1 -0
- package/package.json +13 -9
- package/src/adapters/adapter-base.ts +33 -0
- package/src/adapters/adapter-interface.ts +46 -0
- package/src/adapters/claude/adapter.ts +164 -0
- package/src/adapters/copilot/adapter.ts +138 -0
- package/src/adapters/cursor/adapter.ts +170 -0
- package/src/adapters/generic/adapter.ts +137 -0
- package/src/agents/architect/AGENT.md +3 -0
- package/src/agents/code-reviewer.md +156 -0
- package/src/agents/data-scientist/AGENT.md +181 -0
- package/src/agents/database-optimizer/AGENT.md +147 -0
- package/src/agents/devops/AGENT.md +3 -0
- package/src/agents/diagrams-architect/AGENT.md +3 -0
- package/src/agents/docs-writer/AGENT.md +3 -0
- package/src/agents/kubernetes-architect/AGENT.md +142 -0
- package/src/agents/ml-engineer/AGENT.md +150 -0
- package/src/agents/mlops-engineer/AGENT.md +201 -0
- package/src/agents/network-engineer/AGENT.md +149 -0
- package/src/agents/observability-engineer/AGENT.md +213 -0
- package/src/agents/payment-integration/AGENT.md +35 -0
- package/src/agents/performance/AGENT.md +3 -0
- package/src/agents/performance-engineer/AGENT.md +153 -0
- package/src/agents/pm/AGENT.md +3 -0
- package/src/agents/qa-lead/AGENT.md +3 -0
- package/src/agents/security/AGENT.md +3 -0
- package/src/agents/sre/AGENT.md +3 -0
- package/src/agents/tdd-orchestrator/AGENT.md +169 -0
- package/src/agents/tech-lead/AGENT.md +3 -0
- package/src/commands/specweave.costs.md +261 -0
- package/src/commands/specweave.increment.md +48 -4
- package/src/commands/specweave.ml-pipeline.md +292 -0
- package/src/commands/specweave.monitor-setup.md +501 -0
- package/src/commands/specweave.slo-implement.md +1055 -0
- package/src/commands/specweave.sync-github.md +1 -1
- package/src/commands/specweave.tdd-cycle.md +199 -0
- package/src/commands/specweave.tdd-green.md +842 -0
- package/src/commands/specweave.tdd-red.md +135 -0
- package/src/commands/specweave.tdd-refactor.md +165 -0
- package/src/hooks/post-increment-plugin-detect.sh +142 -0
- package/src/hooks/post-task-completion.sh +53 -11
- package/src/hooks/pre-task-plugin-detect.sh +96 -0
- package/src/skills/SKILLS-INDEX.md +18 -10
- package/src/skills/billing-automation/SKILL.md +559 -0
- package/src/skills/distributed-tracing/SKILL.md +438 -0
- package/src/skills/e2e-playwright/README.md +1 -1
- package/src/skills/e2e-playwright/package.json +1 -1
- package/src/skills/gitops-workflow/SKILL.md +285 -0
- package/src/skills/gitops-workflow/references/argocd-setup.md +134 -0
- package/src/skills/gitops-workflow/references/sync-policies.md +131 -0
- package/src/skills/grafana-dashboards/SKILL.md +369 -0
- package/src/skills/helm-chart-scaffolding/SKILL.md +544 -0
- package/src/skills/helm-chart-scaffolding/assets/Chart.yaml.template +42 -0
- package/src/skills/helm-chart-scaffolding/assets/values.yaml.template +185 -0
- package/src/skills/helm-chart-scaffolding/references/chart-structure.md +500 -0
- package/src/skills/helm-chart-scaffolding/scripts/validate-chart.sh +244 -0
- package/src/skills/k8s-manifest-generator/SKILL.md +511 -0
- package/src/skills/k8s-manifest-generator/assets/configmap-template.yaml +296 -0
- package/src/skills/k8s-manifest-generator/assets/deployment-template.yaml +203 -0
- package/src/skills/k8s-manifest-generator/assets/service-template.yaml +171 -0
- package/src/skills/k8s-manifest-generator/references/deployment-spec.md +753 -0
- package/src/skills/k8s-manifest-generator/references/service-spec.md +724 -0
- package/src/skills/k8s-security-policies/SKILL.md +334 -0
- package/src/skills/k8s-security-policies/assets/network-policy-template.yaml +177 -0
- package/src/skills/k8s-security-policies/references/rbac-patterns.md +187 -0
- package/src/skills/ml-pipeline-workflow/SKILL.md +245 -0
- package/src/skills/paypal-integration/SKILL.md +467 -0
- package/src/skills/pci-compliance/SKILL.md +466 -0
- package/src/skills/prometheus-configuration/SKILL.md +392 -0
- package/src/skills/slo-implementation/SKILL.md +329 -0
- package/src/skills/stripe-integration/SKILL.md +442 -0
- package/src/skills/tdd-workflow/SKILL.md +378 -0
- package/src/templates/README.md.template +1 -1
- package/src/skills/bmad-method-expert/SKILL.md +0 -626
- package/src/skills/bmad-method-expert/scripts/analyze-project.js +0 -318
- package/src/skills/bmad-method-expert/scripts/check-setup.js +0 -208
- package/src/skills/bmad-method-expert/scripts/generate-template.js +0 -1149
- package/src/skills/bmad-method-expert/scripts/validate-documents.js +0 -340
- package/src/skills/context-optimizer/SKILL.md +0 -588
- package/src/skills/figma-designer/SKILL.md +0 -149
- package/src/skills/figma-implementer/SKILL.md +0 -148
- package/src/skills/figma-mcp-connector/SKILL.md +0 -136
- package/src/skills/figma-to-code/SKILL.md +0 -128
- package/src/skills/spec-kit-expert/SKILL.md +0 -1010
|
@@ -0,0 +1,1055 @@
|
|
|
1
|
+
# SLO Implementation Guide
|
|
2
|
+
|
|
3
|
+
You are an SLO (Service Level Objective) expert specializing in implementing reliability standards and error budget-based engineering practices. Design comprehensive SLO frameworks, establish meaningful SLIs, and create monitoring systems that balance reliability with feature velocity.
|
|
4
|
+
|
|
5
|
+
## Context
|
|
6
|
+
The user needs to implement SLOs to establish reliability targets, measure service performance, and make data-driven decisions about reliability vs. feature development. Focus on practical SLO implementation that aligns with business objectives.
|
|
7
|
+
|
|
8
|
+
## Requirements
|
|
9
|
+
$ARGUMENTS
|
|
10
|
+
|
|
11
|
+
## Instructions
|
|
12
|
+
|
|
13
|
+
### 1. SLO Foundation
|
|
14
|
+
|
|
15
|
+
Establish SLO fundamentals and framework:
|
|
16
|
+
|
|
17
|
+
**SLO Framework Designer**
|
|
18
|
+
```python
|
|
19
|
+
import numpy as np
|
|
20
|
+
from datetime import datetime, timedelta
|
|
21
|
+
from typing import Dict, List, Optional
|
|
22
|
+
|
|
23
|
+
class SLOFramework:
|
|
24
|
+
def __init__(self, service_name: str):
|
|
25
|
+
self.service = service_name
|
|
26
|
+
self.slos = []
|
|
27
|
+
self.error_budget = None
|
|
28
|
+
|
|
29
|
+
def design_slo_framework(self):
|
|
30
|
+
"""
|
|
31
|
+
Design comprehensive SLO framework
|
|
32
|
+
"""
|
|
33
|
+
framework = {
|
|
34
|
+
'service_context': self._analyze_service_context(),
|
|
35
|
+
'user_journeys': self._identify_user_journeys(),
|
|
36
|
+
'sli_candidates': self._identify_sli_candidates(),
|
|
37
|
+
'slo_targets': self._calculate_slo_targets(),
|
|
38
|
+
'error_budgets': self._define_error_budgets(),
|
|
39
|
+
'measurement_strategy': self._design_measurement_strategy()
|
|
40
|
+
}
|
|
41
|
+
|
|
42
|
+
return self._generate_slo_specification(framework)
|
|
43
|
+
|
|
44
|
+
def _analyze_service_context(self):
|
|
45
|
+
"""Analyze service characteristics for SLO design"""
|
|
46
|
+
return {
|
|
47
|
+
'service_tier': self._determine_service_tier(),
|
|
48
|
+
'user_expectations': self._assess_user_expectations(),
|
|
49
|
+
'business_impact': self._evaluate_business_impact(),
|
|
50
|
+
'technical_constraints': self._identify_constraints(),
|
|
51
|
+
'dependencies': self._map_dependencies()
|
|
52
|
+
}
|
|
53
|
+
|
|
54
|
+
def _determine_service_tier(self):
|
|
55
|
+
"""Determine appropriate service tier and SLO targets"""
|
|
56
|
+
tiers = {
|
|
57
|
+
'critical': {
|
|
58
|
+
'description': 'Revenue-critical or safety-critical services',
|
|
59
|
+
'availability_target': 99.95,
|
|
60
|
+
'latency_p99': 100,
|
|
61
|
+
'error_rate': 0.001,
|
|
62
|
+
'examples': ['payment processing', 'authentication']
|
|
63
|
+
},
|
|
64
|
+
'essential': {
|
|
65
|
+
'description': 'Core business functionality',
|
|
66
|
+
'availability_target': 99.9,
|
|
67
|
+
'latency_p99': 500,
|
|
68
|
+
'error_rate': 0.01,
|
|
69
|
+
'examples': ['search', 'product catalog']
|
|
70
|
+
},
|
|
71
|
+
'standard': {
|
|
72
|
+
'description': 'Standard features',
|
|
73
|
+
'availability_target': 99.5,
|
|
74
|
+
'latency_p99': 1000,
|
|
75
|
+
'error_rate': 0.05,
|
|
76
|
+
'examples': ['recommendations', 'analytics']
|
|
77
|
+
},
|
|
78
|
+
'best_effort': {
|
|
79
|
+
'description': 'Non-critical features',
|
|
80
|
+
'availability_target': 99.0,
|
|
81
|
+
'latency_p99': 2000,
|
|
82
|
+
'error_rate': 0.1,
|
|
83
|
+
'examples': ['batch processing', 'reporting']
|
|
84
|
+
}
|
|
85
|
+
}
|
|
86
|
+
|
|
87
|
+
# Analyze service characteristics to determine tier
|
|
88
|
+
characteristics = self._analyze_service_characteristics()
|
|
89
|
+
recommended_tier = self._match_tier(characteristics, tiers)
|
|
90
|
+
|
|
91
|
+
return {
|
|
92
|
+
'recommended': recommended_tier,
|
|
93
|
+
'rationale': self._explain_tier_selection(characteristics),
|
|
94
|
+
'all_tiers': tiers
|
|
95
|
+
}
|
|
96
|
+
|
|
97
|
+
def _identify_user_journeys(self):
|
|
98
|
+
"""Map critical user journeys for SLI selection"""
|
|
99
|
+
journeys = []
|
|
100
|
+
|
|
101
|
+
# Example user journey mapping
|
|
102
|
+
journey_template = {
|
|
103
|
+
'name': 'User Login',
|
|
104
|
+
'description': 'User authenticates and accesses dashboard',
|
|
105
|
+
'steps': [
|
|
106
|
+
{
|
|
107
|
+
'step': 'Load login page',
|
|
108
|
+
'sli_type': 'availability',
|
|
109
|
+
'threshold': '< 2s load time'
|
|
110
|
+
},
|
|
111
|
+
{
|
|
112
|
+
'step': 'Submit credentials',
|
|
113
|
+
'sli_type': 'latency',
|
|
114
|
+
'threshold': '< 500ms response'
|
|
115
|
+
},
|
|
116
|
+
{
|
|
117
|
+
'step': 'Validate authentication',
|
|
118
|
+
'sli_type': 'error_rate',
|
|
119
|
+
'threshold': '< 0.1% auth failures'
|
|
120
|
+
},
|
|
121
|
+
{
|
|
122
|
+
'step': 'Load dashboard',
|
|
123
|
+
'sli_type': 'latency',
|
|
124
|
+
'threshold': '< 3s full render'
|
|
125
|
+
}
|
|
126
|
+
],
|
|
127
|
+
'critical_path': True,
|
|
128
|
+
'business_impact': 'high'
|
|
129
|
+
}
|
|
130
|
+
|
|
131
|
+
return journeys
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
### 2. SLI Selection and Measurement
|
|
135
|
+
|
|
136
|
+
Choose and implement appropriate SLIs:
|
|
137
|
+
|
|
138
|
+
**SLI Implementation**
|
|
139
|
+
```python
|
|
140
|
+
class SLIImplementation:
|
|
141
|
+
def __init__(self):
|
|
142
|
+
self.sli_types = {
|
|
143
|
+
'availability': AvailabilitySLI,
|
|
144
|
+
'latency': LatencySLI,
|
|
145
|
+
'error_rate': ErrorRateSLI,
|
|
146
|
+
'throughput': ThroughputSLI,
|
|
147
|
+
'quality': QualitySLI
|
|
148
|
+
}
|
|
149
|
+
|
|
150
|
+
def implement_slis(self, service_type):
|
|
151
|
+
"""Implement SLIs based on service type"""
|
|
152
|
+
if service_type == 'api':
|
|
153
|
+
return self._api_slis()
|
|
154
|
+
elif service_type == 'web':
|
|
155
|
+
return self._web_slis()
|
|
156
|
+
elif service_type == 'batch':
|
|
157
|
+
return self._batch_slis()
|
|
158
|
+
elif service_type == 'streaming':
|
|
159
|
+
return self._streaming_slis()
|
|
160
|
+
|
|
161
|
+
def _api_slis(self):
|
|
162
|
+
"""SLIs for API services"""
|
|
163
|
+
return {
|
|
164
|
+
'availability': {
|
|
165
|
+
'definition': 'Percentage of successful requests',
|
|
166
|
+
'formula': 'successful_requests / total_requests * 100',
|
|
167
|
+
'implementation': '''
|
|
168
|
+
# Prometheus query for API availability
|
|
169
|
+
api_availability = """
|
|
170
|
+
sum(rate(http_requests_total{status!~"5.."}[5m])) /
|
|
171
|
+
sum(rate(http_requests_total[5m])) * 100
|
|
172
|
+
"""
|
|
173
|
+
|
|
174
|
+
# Implementation
|
|
175
|
+
class APIAvailabilitySLI:
|
|
176
|
+
def __init__(self, prometheus_client):
|
|
177
|
+
self.prom = prometheus_client
|
|
178
|
+
|
|
179
|
+
def calculate(self, time_range='5m'):
|
|
180
|
+
query = f"""
|
|
181
|
+
sum(rate(http_requests_total{{status!~"5.."}}[{time_range}])) /
|
|
182
|
+
sum(rate(http_requests_total[{time_range}])) * 100
|
|
183
|
+
"""
|
|
184
|
+
result = self.prom.query(query)
|
|
185
|
+
return float(result[0]['value'][1])
|
|
186
|
+
|
|
187
|
+
def calculate_with_exclusions(self, time_range='5m'):
|
|
188
|
+
"""Calculate availability excluding certain endpoints"""
|
|
189
|
+
query = f"""
|
|
190
|
+
sum(rate(http_requests_total{{
|
|
191
|
+
status!~"5..",
|
|
192
|
+
endpoint!~"/health|/metrics"
|
|
193
|
+
}}[{time_range}])) /
|
|
194
|
+
sum(rate(http_requests_total{{
|
|
195
|
+
endpoint!~"/health|/metrics"
|
|
196
|
+
}}[{time_range}])) * 100
|
|
197
|
+
"""
|
|
198
|
+
return self.prom.query(query)
|
|
199
|
+
'''
|
|
200
|
+
},
|
|
201
|
+
'latency': {
|
|
202
|
+
'definition': 'Percentage of requests faster than threshold',
|
|
203
|
+
'formula': 'fast_requests / total_requests * 100',
|
|
204
|
+
'implementation': '''
|
|
205
|
+
# Latency SLI with multiple thresholds
|
|
206
|
+
class LatencySLI:
|
|
207
|
+
def __init__(self, thresholds_ms):
|
|
208
|
+
self.thresholds = thresholds_ms # e.g., {'p50': 100, 'p95': 500, 'p99': 1000}
|
|
209
|
+
|
|
210
|
+
def calculate_latency_sli(self, time_range='5m'):
|
|
211
|
+
slis = {}
|
|
212
|
+
|
|
213
|
+
for percentile, threshold in self.thresholds.items():
|
|
214
|
+
query = f"""
|
|
215
|
+
sum(rate(http_request_duration_seconds_bucket{{
|
|
216
|
+
le="{threshold/1000}"
|
|
217
|
+
}}[{time_range}])) /
|
|
218
|
+
sum(rate(http_request_duration_seconds_count[{time_range}])) * 100
|
|
219
|
+
"""
|
|
220
|
+
|
|
221
|
+
slis[f'latency_{percentile}'] = {
|
|
222
|
+
'value': self.execute_query(query),
|
|
223
|
+
'threshold': threshold,
|
|
224
|
+
'unit': 'ms'
|
|
225
|
+
}
|
|
226
|
+
|
|
227
|
+
return slis
|
|
228
|
+
|
|
229
|
+
def calculate_user_centric_latency(self):
|
|
230
|
+
"""Calculate latency from user perspective"""
|
|
231
|
+
# Include client-side metrics
|
|
232
|
+
query = """
|
|
233
|
+
histogram_quantile(0.95,
|
|
234
|
+
sum(rate(user_request_duration_bucket[5m])) by (le)
|
|
235
|
+
)
|
|
236
|
+
"""
|
|
237
|
+
return self.execute_query(query)
|
|
238
|
+
'''
|
|
239
|
+
},
|
|
240
|
+
'error_rate': {
|
|
241
|
+
'definition': 'Percentage of successful requests',
|
|
242
|
+
'formula': '(1 - error_requests / total_requests) * 100',
|
|
243
|
+
'implementation': '''
|
|
244
|
+
class ErrorRateSLI:
|
|
245
|
+
def calculate_error_rate(self, time_range='5m'):
|
|
246
|
+
"""Calculate error rate with categorization"""
|
|
247
|
+
|
|
248
|
+
# Different error categories
|
|
249
|
+
error_categories = {
|
|
250
|
+
'client_errors': 'status=~"4.."',
|
|
251
|
+
'server_errors': 'status=~"5.."',
|
|
252
|
+
'timeout_errors': 'status="504"',
|
|
253
|
+
'business_errors': 'error_type="business_logic"'
|
|
254
|
+
}
|
|
255
|
+
|
|
256
|
+
results = {}
|
|
257
|
+
for category, filter_expr in error_categories.items():
|
|
258
|
+
query = f"""
|
|
259
|
+
sum(rate(http_requests_total{{{filter_expr}}}[{time_range}])) /
|
|
260
|
+
sum(rate(http_requests_total[{time_range}])) * 100
|
|
261
|
+
"""
|
|
262
|
+
results[category] = self.execute_query(query)
|
|
263
|
+
|
|
264
|
+
# Overall error rate (excluding 4xx)
|
|
265
|
+
overall_query = f"""
|
|
266
|
+
(1 - sum(rate(http_requests_total{{status=~"5.."}}[{time_range}])) /
|
|
267
|
+
sum(rate(http_requests_total[{time_range}]))) * 100
|
|
268
|
+
"""
|
|
269
|
+
results['overall_success_rate'] = self.execute_query(overall_query)
|
|
270
|
+
|
|
271
|
+
return results
|
|
272
|
+
'''
|
|
273
|
+
}
|
|
274
|
+
}
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
### 3. Error Budget Calculation
|
|
278
|
+
|
|
279
|
+
Implement error budget tracking:
|
|
280
|
+
|
|
281
|
+
**Error Budget Manager**
|
|
282
|
+
```python
|
|
283
|
+
class ErrorBudgetManager:
|
|
284
|
+
def __init__(self, slo_target: float, window_days: int):
|
|
285
|
+
self.slo_target = slo_target
|
|
286
|
+
self.window_days = window_days
|
|
287
|
+
self.error_budget_minutes = self._calculate_total_budget()
|
|
288
|
+
|
|
289
|
+
def _calculate_total_budget(self):
|
|
290
|
+
"""Calculate total error budget in minutes"""
|
|
291
|
+
total_minutes = self.window_days * 24 * 60
|
|
292
|
+
allowed_downtime_ratio = 1 - (self.slo_target / 100)
|
|
293
|
+
return total_minutes * allowed_downtime_ratio
|
|
294
|
+
|
|
295
|
+
def calculate_error_budget_status(self, start_date, end_date):
|
|
296
|
+
"""Calculate current error budget status"""
|
|
297
|
+
# Get actual performance
|
|
298
|
+
actual_uptime = self._get_actual_uptime(start_date, end_date)
|
|
299
|
+
|
|
300
|
+
# Calculate consumed budget
|
|
301
|
+
total_time = (end_date - start_date).total_seconds() / 60
|
|
302
|
+
expected_uptime = total_time * (self.slo_target / 100)
|
|
303
|
+
consumed_minutes = expected_uptime - actual_uptime
|
|
304
|
+
|
|
305
|
+
# Calculate remaining budget
|
|
306
|
+
remaining_budget = self.error_budget_minutes - consumed_minutes
|
|
307
|
+
burn_rate = consumed_minutes / self.error_budget_minutes
|
|
308
|
+
|
|
309
|
+
# Project exhaustion
|
|
310
|
+
if burn_rate > 0:
|
|
311
|
+
days_until_exhaustion = (self.window_days * (1 - burn_rate)) / burn_rate
|
|
312
|
+
else:
|
|
313
|
+
days_until_exhaustion = float('inf')
|
|
314
|
+
|
|
315
|
+
return {
|
|
316
|
+
'total_budget_minutes': self.error_budget_minutes,
|
|
317
|
+
'consumed_minutes': consumed_minutes,
|
|
318
|
+
'remaining_minutes': remaining_budget,
|
|
319
|
+
'burn_rate': burn_rate,
|
|
320
|
+
'budget_percentage_remaining': (remaining_budget / self.error_budget_minutes) * 100,
|
|
321
|
+
'projected_exhaustion_days': days_until_exhaustion,
|
|
322
|
+
'status': self._determine_status(remaining_budget, burn_rate)
|
|
323
|
+
}
|
|
324
|
+
|
|
325
|
+
def _determine_status(self, remaining_budget, burn_rate):
|
|
326
|
+
"""Determine error budget status"""
|
|
327
|
+
if remaining_budget <= 0:
|
|
328
|
+
return 'exhausted'
|
|
329
|
+
elif burn_rate > 2:
|
|
330
|
+
return 'critical'
|
|
331
|
+
elif burn_rate > 1.5:
|
|
332
|
+
return 'warning'
|
|
333
|
+
elif burn_rate > 1:
|
|
334
|
+
return 'attention'
|
|
335
|
+
else:
|
|
336
|
+
return 'healthy'
|
|
337
|
+
|
|
338
|
+
def generate_burn_rate_alerts(self):
|
|
339
|
+
"""Generate multi-window burn rate alerts"""
|
|
340
|
+
return {
|
|
341
|
+
'fast_burn': {
|
|
342
|
+
'description': '14.4x burn rate over 1 hour',
|
|
343
|
+
'condition': 'burn_rate >= 14.4 AND window = 1h',
|
|
344
|
+
'action': 'page',
|
|
345
|
+
'budget_consumed': '2% in 1 hour'
|
|
346
|
+
},
|
|
347
|
+
'slow_burn': {
|
|
348
|
+
'description': '3x burn rate over 6 hours',
|
|
349
|
+
'condition': 'burn_rate >= 3 AND window = 6h',
|
|
350
|
+
'action': 'ticket',
|
|
351
|
+
'budget_consumed': '10% in 6 hours'
|
|
352
|
+
}
|
|
353
|
+
}
|
|
354
|
+
```
|
|
355
|
+
|
|
356
|
+
### 4. SLO Monitoring Setup
|
|
357
|
+
|
|
358
|
+
Implement comprehensive SLO monitoring:
|
|
359
|
+
|
|
360
|
+
**SLO Monitoring Implementation**
|
|
361
|
+
```yaml
|
|
362
|
+
# Prometheus recording rules for SLO
|
|
363
|
+
groups:
|
|
364
|
+
- name: slo_rules
|
|
365
|
+
interval: 30s
|
|
366
|
+
rules:
|
|
367
|
+
# Request rate
|
|
368
|
+
- record: service:request_rate
|
|
369
|
+
expr: |
|
|
370
|
+
sum(rate(http_requests_total[5m])) by (service, method, route)
|
|
371
|
+
|
|
372
|
+
# Success rate
|
|
373
|
+
- record: service:success_rate_5m
|
|
374
|
+
expr: |
|
|
375
|
+
(
|
|
376
|
+
sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
|
|
377
|
+
/
|
|
378
|
+
sum(rate(http_requests_total[5m])) by (service)
|
|
379
|
+
) * 100
|
|
380
|
+
|
|
381
|
+
# Multi-window success rates
|
|
382
|
+
- record: service:success_rate_30m
|
|
383
|
+
expr: |
|
|
384
|
+
(
|
|
385
|
+
sum(rate(http_requests_total{status!~"5.."}[30m])) by (service)
|
|
386
|
+
/
|
|
387
|
+
sum(rate(http_requests_total[30m])) by (service)
|
|
388
|
+
) * 100
|
|
389
|
+
|
|
390
|
+
- record: service:success_rate_1h
|
|
391
|
+
expr: |
|
|
392
|
+
(
|
|
393
|
+
sum(rate(http_requests_total{status!~"5.."}[1h])) by (service)
|
|
394
|
+
/
|
|
395
|
+
sum(rate(http_requests_total[1h])) by (service)
|
|
396
|
+
) * 100
|
|
397
|
+
|
|
398
|
+
# Latency percentiles
|
|
399
|
+
- record: service:latency_p50_5m
|
|
400
|
+
expr: |
|
|
401
|
+
histogram_quantile(0.50,
|
|
402
|
+
sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
|
|
403
|
+
)
|
|
404
|
+
|
|
405
|
+
- record: service:latency_p95_5m
|
|
406
|
+
expr: |
|
|
407
|
+
histogram_quantile(0.95,
|
|
408
|
+
sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
|
|
409
|
+
)
|
|
410
|
+
|
|
411
|
+
- record: service:latency_p99_5m
|
|
412
|
+
expr: |
|
|
413
|
+
histogram_quantile(0.99,
|
|
414
|
+
sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
|
|
415
|
+
)
|
|
416
|
+
|
|
417
|
+
# Error budget burn rate
|
|
418
|
+
- record: service:error_budget_burn_rate_1h
|
|
419
|
+
expr: |
|
|
420
|
+
(
|
|
421
|
+
1 - (
|
|
422
|
+
sum(increase(http_requests_total{status!~"5.."}[1h])) by (service)
|
|
423
|
+
/
|
|
424
|
+
sum(increase(http_requests_total[1h])) by (service)
|
|
425
|
+
)
|
|
426
|
+
) / (1 - 0.999) # 99.9% SLO
|
|
427
|
+
```
|
|
428
|
+
|
|
429
|
+
**Alert Configuration**
|
|
430
|
+
```yaml
|
|
431
|
+
# Multi-window multi-burn-rate alerts
|
|
432
|
+
groups:
|
|
433
|
+
- name: slo_alerts
|
|
434
|
+
rules:
|
|
435
|
+
# Fast burn alert (2% budget in 1 hour)
|
|
436
|
+
- alert: ErrorBudgetFastBurn
|
|
437
|
+
expr: |
|
|
438
|
+
(
|
|
439
|
+
service:error_budget_burn_rate_5m{service="api"} > 14.4
|
|
440
|
+
AND
|
|
441
|
+
service:error_budget_burn_rate_1h{service="api"} > 14.4
|
|
442
|
+
)
|
|
443
|
+
for: 2m
|
|
444
|
+
labels:
|
|
445
|
+
severity: critical
|
|
446
|
+
team: platform
|
|
447
|
+
annotations:
|
|
448
|
+
summary: "Fast error budget burn for {{ $labels.service }}"
|
|
449
|
+
description: |
|
|
450
|
+
Service {{ $labels.service }} is burning error budget at 14.4x rate.
|
|
451
|
+
Current burn rate: {{ $value }}x
|
|
452
|
+
This will exhaust 2% of monthly budget in 1 hour.
|
|
453
|
+
|
|
454
|
+
# Slow burn alert (10% budget in 6 hours)
|
|
455
|
+
- alert: ErrorBudgetSlowBurn
|
|
456
|
+
expr: |
|
|
457
|
+
(
|
|
458
|
+
service:error_budget_burn_rate_30m{service="api"} > 3
|
|
459
|
+
AND
|
|
460
|
+
service:error_budget_burn_rate_6h{service="api"} > 3
|
|
461
|
+
)
|
|
462
|
+
for: 15m
|
|
463
|
+
labels:
|
|
464
|
+
severity: warning
|
|
465
|
+
team: platform
|
|
466
|
+
annotations:
|
|
467
|
+
summary: "Slow error budget burn for {{ $labels.service }}"
|
|
468
|
+
description: |
|
|
469
|
+
Service {{ $labels.service }} is burning error budget at 3x rate.
|
|
470
|
+
Current burn rate: {{ $value }}x
|
|
471
|
+
This will exhaust 10% of monthly budget in 6 hours.
|
|
472
|
+
```
|
|
473
|
+
|
|
474
|
+
### 5. SLO Dashboard
|
|
475
|
+
|
|
476
|
+
Create comprehensive SLO dashboards:
|
|
477
|
+
|
|
478
|
+
**Grafana Dashboard Configuration**
|
|
479
|
+
```python
|
|
480
|
+
def create_slo_dashboard():
|
|
481
|
+
"""Generate Grafana dashboard for SLO monitoring"""
|
|
482
|
+
return {
|
|
483
|
+
"dashboard": {
|
|
484
|
+
"title": "Service SLO Dashboard",
|
|
485
|
+
"panels": [
|
|
486
|
+
{
|
|
487
|
+
"title": "SLO Summary",
|
|
488
|
+
"type": "stat",
|
|
489
|
+
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
|
|
490
|
+
"targets": [{
|
|
491
|
+
"expr": "service:success_rate_30d{service=\"$service\"}",
|
|
492
|
+
"legendFormat": "30-day SLO"
|
|
493
|
+
}],
|
|
494
|
+
"fieldConfig": {
|
|
495
|
+
"defaults": {
|
|
496
|
+
"thresholds": {
|
|
497
|
+
"mode": "absolute",
|
|
498
|
+
"steps": [
|
|
499
|
+
{"color": "red", "value": None},
|
|
500
|
+
{"color": "yellow", "value": 99.5},
|
|
501
|
+
{"color": "green", "value": 99.9}
|
|
502
|
+
]
|
|
503
|
+
},
|
|
504
|
+
"unit": "percent"
|
|
505
|
+
}
|
|
506
|
+
}
|
|
507
|
+
},
|
|
508
|
+
{
|
|
509
|
+
"title": "Error Budget Status",
|
|
510
|
+
"type": "gauge",
|
|
511
|
+
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
|
|
512
|
+
"targets": [{
|
|
513
|
+
"expr": '''
|
|
514
|
+
100 * (
|
|
515
|
+
1 - (
|
|
516
|
+
(1 - service:success_rate_30d{service="$service"}/100) /
|
|
517
|
+
(1 - $slo_target/100)
|
|
518
|
+
)
|
|
519
|
+
)
|
|
520
|
+
''',
|
|
521
|
+
"legendFormat": "Remaining Budget"
|
|
522
|
+
}],
|
|
523
|
+
"fieldConfig": {
|
|
524
|
+
"defaults": {
|
|
525
|
+
"min": 0,
|
|
526
|
+
"max": 100,
|
|
527
|
+
"thresholds": {
|
|
528
|
+
"mode": "absolute",
|
|
529
|
+
"steps": [
|
|
530
|
+
{"color": "red", "value": None},
|
|
531
|
+
{"color": "yellow", "value": 20},
|
|
532
|
+
{"color": "green", "value": 50}
|
|
533
|
+
]
|
|
534
|
+
},
|
|
535
|
+
"unit": "percent"
|
|
536
|
+
}
|
|
537
|
+
}
|
|
538
|
+
},
|
|
539
|
+
{
|
|
540
|
+
"title": "Burn Rate Trend",
|
|
541
|
+
"type": "graph",
|
|
542
|
+
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
|
|
543
|
+
"targets": [
|
|
544
|
+
{
|
|
545
|
+
"expr": "service:error_budget_burn_rate_1h{service=\"$service\"}",
|
|
546
|
+
"legendFormat": "1h burn rate"
|
|
547
|
+
},
|
|
548
|
+
{
|
|
549
|
+
"expr": "service:error_budget_burn_rate_6h{service=\"$service\"}",
|
|
550
|
+
"legendFormat": "6h burn rate"
|
|
551
|
+
},
|
|
552
|
+
{
|
|
553
|
+
"expr": "service:error_budget_burn_rate_24h{service=\"$service\"}",
|
|
554
|
+
"legendFormat": "24h burn rate"
|
|
555
|
+
}
|
|
556
|
+
],
|
|
557
|
+
"yaxes": [{
|
|
558
|
+
"format": "short",
|
|
559
|
+
"label": "Burn Rate (x)",
|
|
560
|
+
"min": 0
|
|
561
|
+
}],
|
|
562
|
+
"alert": {
|
|
563
|
+
"conditions": [{
|
|
564
|
+
"evaluator": {"params": [14.4], "type": "gt"},
|
|
565
|
+
"operator": {"type": "and"},
|
|
566
|
+
"query": {"params": ["A", "5m", "now"]},
|
|
567
|
+
"type": "query"
|
|
568
|
+
}],
|
|
569
|
+
"name": "High burn rate detected"
|
|
570
|
+
}
|
|
571
|
+
}
|
|
572
|
+
]
|
|
573
|
+
}
|
|
574
|
+
}
|
|
575
|
+
```
|
|
576
|
+
|
|
577
|
+
### 6. SLO Reporting
|
|
578
|
+
|
|
579
|
+
Generate SLO reports and reviews:
|
|
580
|
+
|
|
581
|
+
**SLO Report Generator**
|
|
582
|
+
```python
|
|
583
|
+
class SLOReporter:
|
|
584
|
+
def __init__(self, metrics_client):
|
|
585
|
+
self.metrics = metrics_client
|
|
586
|
+
|
|
587
|
+
def generate_monthly_report(self, service, month):
|
|
588
|
+
"""Generate comprehensive monthly SLO report"""
|
|
589
|
+
report_data = {
|
|
590
|
+
'service': service,
|
|
591
|
+
'period': month,
|
|
592
|
+
'slo_performance': self._calculate_slo_performance(service, month),
|
|
593
|
+
'incidents': self._analyze_incidents(service, month),
|
|
594
|
+
'error_budget': self._analyze_error_budget(service, month),
|
|
595
|
+
'trends': self._analyze_trends(service, month),
|
|
596
|
+
'recommendations': self._generate_recommendations(service, month)
|
|
597
|
+
}
|
|
598
|
+
|
|
599
|
+
return self._format_report(report_data)
|
|
600
|
+
|
|
601
|
+
def _calculate_slo_performance(self, service, month):
|
|
602
|
+
"""Calculate SLO performance metrics"""
|
|
603
|
+
slos = {}
|
|
604
|
+
|
|
605
|
+
# Availability SLO
|
|
606
|
+
availability_query = f"""
|
|
607
|
+
avg_over_time(
|
|
608
|
+
service:success_rate_5m{{service="{service}"}}[{month}]
|
|
609
|
+
)
|
|
610
|
+
"""
|
|
611
|
+
slos['availability'] = {
|
|
612
|
+
'target': 99.9,
|
|
613
|
+
'actual': self.metrics.query(availability_query),
|
|
614
|
+
'met': self.metrics.query(availability_query) >= 99.9
|
|
615
|
+
}
|
|
616
|
+
|
|
617
|
+
# Latency SLO
|
|
618
|
+
latency_query = f"""
|
|
619
|
+
quantile_over_time(0.95,
|
|
620
|
+
service:latency_p95_5m{{service="{service}"}}[{month}]
|
|
621
|
+
)
|
|
622
|
+
"""
|
|
623
|
+
slos['latency_p95'] = {
|
|
624
|
+
'target': 500, # ms
|
|
625
|
+
'actual': self.metrics.query(latency_query) * 1000,
|
|
626
|
+
'met': self.metrics.query(latency_query) * 1000 <= 500
|
|
627
|
+
}
|
|
628
|
+
|
|
629
|
+
return slos
|
|
630
|
+
|
|
631
|
+
def _format_report(self, data):
|
|
632
|
+
"""Format report as HTML"""
|
|
633
|
+
return f"""
|
|
634
|
+
<!DOCTYPE html>
|
|
635
|
+
<html>
|
|
636
|
+
<head>
|
|
637
|
+
<title>SLO Report - {data['service']} - {data['period']}</title>
|
|
638
|
+
<style>
|
|
639
|
+
body {{ font-family: Arial, sans-serif; margin: 40px; }}
|
|
640
|
+
.summary {{ background: #f0f0f0; padding: 20px; border-radius: 8px; }}
|
|
641
|
+
.metric {{ margin: 20px 0; }}
|
|
642
|
+
.good {{ color: green; }}
|
|
643
|
+
.bad {{ color: red; }}
|
|
644
|
+
table {{ border-collapse: collapse; width: 100%; }}
|
|
645
|
+
th, td {{ border: 1px solid #ddd; padding: 8px; text-align: left; }}
|
|
646
|
+
.chart {{ margin: 20px 0; }}
|
|
647
|
+
</style>
|
|
648
|
+
</head>
|
|
649
|
+
<body>
|
|
650
|
+
<h1>SLO Report: {data['service']}</h1>
|
|
651
|
+
<h2>Period: {data['period']}</h2>
|
|
652
|
+
|
|
653
|
+
<div class="summary">
|
|
654
|
+
<h3>Executive Summary</h3>
|
|
655
|
+
<p>Service reliability: {data['slo_performance']['availability']['actual']:.2f}%</p>
|
|
656
|
+
<p>Error budget remaining: {data['error_budget']['remaining_percentage']:.1f}%</p>
|
|
657
|
+
<p>Number of incidents: {len(data['incidents'])}</p>
|
|
658
|
+
</div>
|
|
659
|
+
|
|
660
|
+
<div class="metric">
|
|
661
|
+
<h3>SLO Performance</h3>
|
|
662
|
+
<table>
|
|
663
|
+
<tr>
|
|
664
|
+
<th>SLO</th>
|
|
665
|
+
<th>Target</th>
|
|
666
|
+
<th>Actual</th>
|
|
667
|
+
<th>Status</th>
|
|
668
|
+
</tr>
|
|
669
|
+
{self._format_slo_table_rows(data['slo_performance'])}
|
|
670
|
+
</table>
|
|
671
|
+
</div>
|
|
672
|
+
|
|
673
|
+
<div class="incidents">
|
|
674
|
+
<h3>Incident Analysis</h3>
|
|
675
|
+
{self._format_incident_analysis(data['incidents'])}
|
|
676
|
+
</div>
|
|
677
|
+
|
|
678
|
+
<div class="recommendations">
|
|
679
|
+
<h3>Recommendations</h3>
|
|
680
|
+
{self._format_recommendations(data['recommendations'])}
|
|
681
|
+
</div>
|
|
682
|
+
</body>
|
|
683
|
+
</html>
|
|
684
|
+
"""
|
|
685
|
+
```
|
|
686
|
+
|
|
687
|
+
### 7. SLO-Based Decision Making
|
|
688
|
+
|
|
689
|
+
Implement SLO-driven engineering decisions:
|
|
690
|
+
|
|
691
|
+
**SLO Decision Framework**
|
|
692
|
+
```python
|
|
693
|
+
class SLODecisionFramework:
|
|
694
|
+
def __init__(self, error_budget_policy):
|
|
695
|
+
self.policy = error_budget_policy
|
|
696
|
+
|
|
697
|
+
def make_release_decision(self, service, release_risk):
|
|
698
|
+
"""Make release decisions based on error budget"""
|
|
699
|
+
budget_status = self.get_error_budget_status(service)
|
|
700
|
+
|
|
701
|
+
decision_matrix = {
|
|
702
|
+
'healthy': {
|
|
703
|
+
'low_risk': 'approve',
|
|
704
|
+
'medium_risk': 'approve',
|
|
705
|
+
'high_risk': 'review'
|
|
706
|
+
},
|
|
707
|
+
'attention': {
|
|
708
|
+
'low_risk': 'approve',
|
|
709
|
+
'medium_risk': 'review',
|
|
710
|
+
'high_risk': 'defer'
|
|
711
|
+
},
|
|
712
|
+
'warning': {
|
|
713
|
+
'low_risk': 'review',
|
|
714
|
+
'medium_risk': 'defer',
|
|
715
|
+
'high_risk': 'block'
|
|
716
|
+
},
|
|
717
|
+
'critical': {
|
|
718
|
+
'low_risk': 'defer',
|
|
719
|
+
'medium_risk': 'block',
|
|
720
|
+
'high_risk': 'block'
|
|
721
|
+
},
|
|
722
|
+
'exhausted': {
|
|
723
|
+
'low_risk': 'block',
|
|
724
|
+
'medium_risk': 'block',
|
|
725
|
+
'high_risk': 'block'
|
|
726
|
+
}
|
|
727
|
+
}
|
|
728
|
+
|
|
729
|
+
decision = decision_matrix[budget_status['status']][release_risk]
|
|
730
|
+
|
|
731
|
+
return {
|
|
732
|
+
'decision': decision,
|
|
733
|
+
'rationale': self._explain_decision(budget_status, release_risk),
|
|
734
|
+
'conditions': self._get_approval_conditions(decision, budget_status),
|
|
735
|
+
'alternative_actions': self._suggest_alternatives(decision, budget_status)
|
|
736
|
+
}
|
|
737
|
+
|
|
738
|
+
def prioritize_reliability_work(self, service):
|
|
739
|
+
"""Prioritize reliability improvements based on SLO gaps"""
|
|
740
|
+
slo_gaps = self.analyze_slo_gaps(service)
|
|
741
|
+
|
|
742
|
+
priorities = []
|
|
743
|
+
for gap in slo_gaps:
|
|
744
|
+
priority_score = self.calculate_priority_score(gap)
|
|
745
|
+
|
|
746
|
+
priorities.append({
|
|
747
|
+
'issue': gap['issue'],
|
|
748
|
+
'impact': gap['impact'],
|
|
749
|
+
'effort': gap['estimated_effort'],
|
|
750
|
+
'priority_score': priority_score,
|
|
751
|
+
'recommended_actions': self.recommend_actions(gap)
|
|
752
|
+
})
|
|
753
|
+
|
|
754
|
+
return sorted(priorities, key=lambda x: x['priority_score'], reverse=True)
|
|
755
|
+
|
|
756
|
+
def calculate_toil_budget(self, team_size, slo_performance):
|
|
757
|
+
"""Calculate how much toil is acceptable based on SLOs"""
|
|
758
|
+
# If meeting SLOs, can afford more toil
|
|
759
|
+
# If not meeting SLOs, need to reduce toil
|
|
760
|
+
|
|
761
|
+
base_toil_percentage = 50 # Google SRE recommendation
|
|
762
|
+
|
|
763
|
+
if slo_performance >= 100:
|
|
764
|
+
# Exceeding SLO, can take on more toil
|
|
765
|
+
toil_budget = base_toil_percentage + 10
|
|
766
|
+
elif slo_performance >= 99:
|
|
767
|
+
# Meeting SLO
|
|
768
|
+
toil_budget = base_toil_percentage
|
|
769
|
+
else:
|
|
770
|
+
# Not meeting SLO, reduce toil
|
|
771
|
+
toil_budget = base_toil_percentage - (100 - slo_performance) * 5
|
|
772
|
+
|
|
773
|
+
return {
|
|
774
|
+
'toil_percentage': max(toil_budget, 20), # Minimum 20%
|
|
775
|
+
'toil_hours_per_week': (toil_budget / 100) * 40 * team_size,
|
|
776
|
+
'automation_hours_per_week': ((100 - toil_budget) / 100) * 40 * team_size
|
|
777
|
+
}
|
|
778
|
+
```
|
|
779
|
+
|
|
780
|
+
### 8. SLO Templates
|
|
781
|
+
|
|
782
|
+
Provide SLO templates for common services:
|
|
783
|
+
|
|
784
|
+
**SLO Template Library**
|
|
785
|
+
```python
|
|
786
|
+
class SLOTemplates:
|
|
787
|
+
@staticmethod
|
|
788
|
+
def get_api_service_template():
|
|
789
|
+
"""SLO template for API services"""
|
|
790
|
+
return {
|
|
791
|
+
'name': 'API Service SLO Template',
|
|
792
|
+
'slos': [
|
|
793
|
+
{
|
|
794
|
+
'name': 'availability',
|
|
795
|
+
'description': 'The proportion of successful requests',
|
|
796
|
+
'sli': {
|
|
797
|
+
'type': 'ratio',
|
|
798
|
+
'good_events': 'requests with status != 5xx',
|
|
799
|
+
'total_events': 'all requests'
|
|
800
|
+
},
|
|
801
|
+
'objectives': [
|
|
802
|
+
{'window': '30d', 'target': 99.9}
|
|
803
|
+
]
|
|
804
|
+
},
|
|
805
|
+
{
|
|
806
|
+
'name': 'latency',
|
|
807
|
+
'description': 'The proportion of fast requests',
|
|
808
|
+
'sli': {
|
|
809
|
+
'type': 'ratio',
|
|
810
|
+
'good_events': 'requests faster than 500ms',
|
|
811
|
+
'total_events': 'all requests'
|
|
812
|
+
},
|
|
813
|
+
'objectives': [
|
|
814
|
+
{'window': '30d', 'target': 95.0}
|
|
815
|
+
]
|
|
816
|
+
}
|
|
817
|
+
]
|
|
818
|
+
}
|
|
819
|
+
|
|
820
|
+
@staticmethod
|
|
821
|
+
def get_data_pipeline_template():
|
|
822
|
+
"""SLO template for data pipelines"""
|
|
823
|
+
return {
|
|
824
|
+
'name': 'Data Pipeline SLO Template',
|
|
825
|
+
'slos': [
|
|
826
|
+
{
|
|
827
|
+
'name': 'freshness',
|
|
828
|
+
'description': 'Data is processed within SLA',
|
|
829
|
+
'sli': {
|
|
830
|
+
'type': 'ratio',
|
|
831
|
+
'good_events': 'batches processed within 30 minutes',
|
|
832
|
+
'total_events': 'all batches'
|
|
833
|
+
},
|
|
834
|
+
'objectives': [
|
|
835
|
+
{'window': '7d', 'target': 99.0}
|
|
836
|
+
]
|
|
837
|
+
},
|
|
838
|
+
{
|
|
839
|
+
'name': 'completeness',
|
|
840
|
+
'description': 'All expected data is processed',
|
|
841
|
+
'sli': {
|
|
842
|
+
'type': 'ratio',
|
|
843
|
+
'good_events': 'records successfully processed',
|
|
844
|
+
'total_events': 'all records'
|
|
845
|
+
},
|
|
846
|
+
'objectives': [
|
|
847
|
+
{'window': '7d', 'target': 99.95}
|
|
848
|
+
]
|
|
849
|
+
}
|
|
850
|
+
]
|
|
851
|
+
}
|
|
852
|
+
```
|
|
853
|
+
|
|
854
|
+
### 9. SLO Automation
|
|
855
|
+
|
|
856
|
+
Automate SLO management:
|
|
857
|
+
|
|
858
|
+
**SLO Automation Tools**
|
|
859
|
+
```python
|
|
860
|
+
class SLOAutomation:
|
|
861
|
+
def __init__(self):
|
|
862
|
+
self.config = self.load_slo_config()
|
|
863
|
+
|
|
864
|
+
def auto_generate_slos(self, service_discovery):
|
|
865
|
+
"""Automatically generate SLOs for discovered services"""
|
|
866
|
+
services = service_discovery.get_all_services()
|
|
867
|
+
generated_slos = []
|
|
868
|
+
|
|
869
|
+
for service in services:
|
|
870
|
+
# Analyze service characteristics
|
|
871
|
+
characteristics = self.analyze_service(service)
|
|
872
|
+
|
|
873
|
+
# Select appropriate template
|
|
874
|
+
template = self.select_template(characteristics)
|
|
875
|
+
|
|
876
|
+
# Customize based on observed behavior
|
|
877
|
+
customized_slo = self.customize_slo(template, service)
|
|
878
|
+
|
|
879
|
+
generated_slos.append(customized_slo)
|
|
880
|
+
|
|
881
|
+
return generated_slos
|
|
882
|
+
|
|
883
|
+
def implement_progressive_slos(self, service):
|
|
884
|
+
"""Implement progressively stricter SLOs"""
|
|
885
|
+
return {
|
|
886
|
+
'phase1': {
|
|
887
|
+
'duration': '1 month',
|
|
888
|
+
'target': 99.0,
|
|
889
|
+
'description': 'Baseline establishment'
|
|
890
|
+
},
|
|
891
|
+
'phase2': {
|
|
892
|
+
'duration': '2 months',
|
|
893
|
+
'target': 99.5,
|
|
894
|
+
'description': 'Initial improvement'
|
|
895
|
+
},
|
|
896
|
+
'phase3': {
|
|
897
|
+
'duration': '3 months',
|
|
898
|
+
'target': 99.9,
|
|
899
|
+
'description': 'Production readiness'
|
|
900
|
+
},
|
|
901
|
+
'phase4': {
|
|
902
|
+
'duration': 'ongoing',
|
|
903
|
+
'target': 99.95,
|
|
904
|
+
'description': 'Excellence'
|
|
905
|
+
}
|
|
906
|
+
}
|
|
907
|
+
|
|
908
|
+
def create_slo_as_code(self):
|
|
909
|
+
"""Define SLOs as code"""
|
|
910
|
+
return '''
|
|
911
|
+
# slo_definitions.yaml
|
|
912
|
+
apiVersion: slo.dev/v1
|
|
913
|
+
kind: ServiceLevelObjective
|
|
914
|
+
metadata:
|
|
915
|
+
name: api-availability
|
|
916
|
+
namespace: production
|
|
917
|
+
spec:
|
|
918
|
+
service: api-service
|
|
919
|
+
description: API service availability SLO
|
|
920
|
+
|
|
921
|
+
indicator:
|
|
922
|
+
type: ratio
|
|
923
|
+
counter:
|
|
924
|
+
metric: http_requests_total
|
|
925
|
+
filters:
|
|
926
|
+
- status_code != 5xx
|
|
927
|
+
total:
|
|
928
|
+
metric: http_requests_total
|
|
929
|
+
|
|
930
|
+
objectives:
|
|
931
|
+
- displayName: 30-day rolling window
|
|
932
|
+
window: 30d
|
|
933
|
+
target: 0.999
|
|
934
|
+
|
|
935
|
+
alerting:
|
|
936
|
+
burnRates:
|
|
937
|
+
- severity: critical
|
|
938
|
+
shortWindow: 1h
|
|
939
|
+
longWindow: 5m
|
|
940
|
+
burnRate: 14.4
|
|
941
|
+
- severity: warning
|
|
942
|
+
shortWindow: 6h
|
|
943
|
+
longWindow: 30m
|
|
944
|
+
burnRate: 3
|
|
945
|
+
|
|
946
|
+
annotations:
|
|
947
|
+
runbook: https://runbooks.example.com/api-availability
|
|
948
|
+
dashboard: https://grafana.example.com/d/api-slo
|
|
949
|
+
'''
|
|
950
|
+
```
|
|
951
|
+
|
|
952
|
+
### 10. SLO Culture and Governance
|
|
953
|
+
|
|
954
|
+
Establish SLO culture:
|
|
955
|
+
|
|
956
|
+
**SLO Governance Framework**
|
|
957
|
+
```python
|
|
958
|
+
class SLOGovernance:
|
|
959
|
+
def establish_slo_culture(self):
|
|
960
|
+
"""Establish SLO-driven culture"""
|
|
961
|
+
return {
|
|
962
|
+
'principles': [
|
|
963
|
+
'SLOs are a shared responsibility',
|
|
964
|
+
'Error budgets drive prioritization',
|
|
965
|
+
'Reliability is a feature',
|
|
966
|
+
'Measure what matters to users'
|
|
967
|
+
],
|
|
968
|
+
'practices': {
|
|
969
|
+
'weekly_reviews': self.weekly_slo_review_template(),
|
|
970
|
+
'incident_retrospectives': self.slo_incident_template(),
|
|
971
|
+
'quarterly_planning': self.quarterly_slo_planning(),
|
|
972
|
+
'stakeholder_communication': self.stakeholder_report_template()
|
|
973
|
+
},
|
|
974
|
+
'roles': {
|
|
975
|
+
'slo_owner': {
|
|
976
|
+
'responsibilities': [
|
|
977
|
+
'Define and maintain SLO definitions',
|
|
978
|
+
'Monitor SLO performance',
|
|
979
|
+
'Lead SLO reviews',
|
|
980
|
+
'Communicate with stakeholders'
|
|
981
|
+
]
|
|
982
|
+
},
|
|
983
|
+
'engineering_team': {
|
|
984
|
+
'responsibilities': [
|
|
985
|
+
'Implement SLI measurements',
|
|
986
|
+
'Respond to SLO breaches',
|
|
987
|
+
'Improve reliability',
|
|
988
|
+
'Participate in reviews'
|
|
989
|
+
]
|
|
990
|
+
},
|
|
991
|
+
'product_owner': {
|
|
992
|
+
'responsibilities': [
|
|
993
|
+
'Balance features vs reliability',
|
|
994
|
+
'Approve error budget usage',
|
|
995
|
+
'Set business priorities',
|
|
996
|
+
'Communicate with customers'
|
|
997
|
+
]
|
|
998
|
+
}
|
|
999
|
+
}
|
|
1000
|
+
}
|
|
1001
|
+
|
|
1002
|
+
def create_slo_review_process(self):
|
|
1003
|
+
"""Create structured SLO review process"""
|
|
1004
|
+
return '''
|
|
1005
|
+
# Weekly SLO Review Template
|
|
1006
|
+
|
|
1007
|
+
## Agenda (30 minutes)
|
|
1008
|
+
|
|
1009
|
+
### 1. SLO Performance Review (10 min)
|
|
1010
|
+
- Current SLO status for all services
|
|
1011
|
+
- Error budget consumption rate
|
|
1012
|
+
- Trend analysis
|
|
1013
|
+
|
|
1014
|
+
### 2. Incident Review (10 min)
|
|
1015
|
+
- Incidents impacting SLOs
|
|
1016
|
+
- Root cause analysis
|
|
1017
|
+
- Action items
|
|
1018
|
+
|
|
1019
|
+
### 3. Decision Making (10 min)
|
|
1020
|
+
- Release approvals/deferrals
|
|
1021
|
+
- Resource allocation
|
|
1022
|
+
- Priority adjustments
|
|
1023
|
+
|
|
1024
|
+
## Review Checklist
|
|
1025
|
+
|
|
1026
|
+
- [ ] All SLOs reviewed
|
|
1027
|
+
- [ ] Burn rates analyzed
|
|
1028
|
+
- [ ] Incidents discussed
|
|
1029
|
+
- [ ] Action items assigned
|
|
1030
|
+
- [ ] Decisions documented
|
|
1031
|
+
|
|
1032
|
+
## Output Template
|
|
1033
|
+
|
|
1034
|
+
### Service: [Service Name]
|
|
1035
|
+
- **SLO Status**: [Green/Yellow/Red]
|
|
1036
|
+
- **Error Budget**: [XX%] remaining
|
|
1037
|
+
- **Key Issues**: [List]
|
|
1038
|
+
- **Actions**: [List with owners]
|
|
1039
|
+
- **Decisions**: [List]
|
|
1040
|
+
'''
|
|
1041
|
+
```
|
|
1042
|
+
|
|
1043
|
+
## Output Format
|
|
1044
|
+
|
|
1045
|
+
1. **SLO Framework**: Comprehensive SLO design and objectives
|
|
1046
|
+
2. **SLI Implementation**: Code and queries for measuring SLIs
|
|
1047
|
+
3. **Error Budget Tracking**: Calculations and burn rate monitoring
|
|
1048
|
+
4. **Monitoring Setup**: Prometheus rules and Grafana dashboards
|
|
1049
|
+
5. **Alert Configuration**: Multi-window multi-burn-rate alerts
|
|
1050
|
+
6. **Reporting Templates**: Monthly reports and reviews
|
|
1051
|
+
7. **Decision Framework**: SLO-based engineering decisions
|
|
1052
|
+
8. **Automation Tools**: SLO-as-code and auto-generation
|
|
1053
|
+
9. **Governance Process**: Culture and review processes
|
|
1054
|
+
|
|
1055
|
+
Focus on creating meaningful SLOs that balance reliability with feature velocity, providing clear signals for engineering decisions and fostering a culture of reliability.
|