aigent-team 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +253 -0
- package/dist/chunk-N3RYHWTR.js +267 -0
- package/dist/cli.js +576 -0
- package/dist/index.d.ts +234 -0
- package/dist/index.js +27 -0
- package/package.json +67 -0
- package/templates/shared/git-workflow.md +44 -0
- package/templates/shared/project-conventions.md +48 -0
- package/templates/teams/ba/agent.yaml +25 -0
- package/templates/teams/ba/references/acceptance-criteria.md +87 -0
- package/templates/teams/ba/references/api-contract-design.md +110 -0
- package/templates/teams/ba/references/requirements-analysis.md +83 -0
- package/templates/teams/ba/references/user-story-mapping.md +73 -0
- package/templates/teams/ba/skill.md +85 -0
- package/templates/teams/be/agent.yaml +34 -0
- package/templates/teams/be/conventions.md +102 -0
- package/templates/teams/be/references/api-design.md +91 -0
- package/templates/teams/be/references/async-processing.md +86 -0
- package/templates/teams/be/references/auth-security.md +58 -0
- package/templates/teams/be/references/caching.md +79 -0
- package/templates/teams/be/references/database.md +65 -0
- package/templates/teams/be/references/error-handling.md +106 -0
- package/templates/teams/be/references/observability.md +83 -0
- package/templates/teams/be/references/review-checklist.md +50 -0
- package/templates/teams/be/references/testing.md +100 -0
- package/templates/teams/be/review-checklist.md +54 -0
- package/templates/teams/be/skill.md +71 -0
- package/templates/teams/devops/agent.yaml +35 -0
- package/templates/teams/devops/conventions.md +133 -0
- package/templates/teams/devops/references/ci-cd.md +218 -0
- package/templates/teams/devops/references/cost-optimization.md +218 -0
- package/templates/teams/devops/references/disaster-recovery.md +199 -0
- package/templates/teams/devops/references/docker.md +237 -0
- package/templates/teams/devops/references/infrastructure-as-code.md +238 -0
- package/templates/teams/devops/references/kubernetes.md +397 -0
- package/templates/teams/devops/references/monitoring.md +224 -0
- package/templates/teams/devops/references/review-checklist.md +149 -0
- package/templates/teams/devops/references/security.md +225 -0
- package/templates/teams/devops/review-checklist.md +72 -0
- package/templates/teams/devops/skill.md +131 -0
- package/templates/teams/fe/agent.yaml +28 -0
- package/templates/teams/fe/conventions.md +80 -0
- package/templates/teams/fe/references/accessibility.md +92 -0
- package/templates/teams/fe/references/component-architecture.md +87 -0
- package/templates/teams/fe/references/css-styling.md +89 -0
- package/templates/teams/fe/references/forms.md +73 -0
- package/templates/teams/fe/references/performance.md +104 -0
- package/templates/teams/fe/references/review-checklist.md +51 -0
- package/templates/teams/fe/references/security.md +90 -0
- package/templates/teams/fe/references/state-management.md +117 -0
- package/templates/teams/fe/references/testing.md +112 -0
- package/templates/teams/fe/review-checklist.md +53 -0
- package/templates/teams/fe/skill.md +68 -0
- package/templates/teams/lead/agent.yaml +18 -0
- package/templates/teams/lead/references/cross-team-coordination.md +68 -0
- package/templates/teams/lead/references/quality-gates.md +64 -0
- package/templates/teams/lead/references/task-decomposition.md +69 -0
- package/templates/teams/lead/skill.md +83 -0
- package/templates/teams/qa/agent.yaml +32 -0
- package/templates/teams/qa/conventions.md +130 -0
- package/templates/teams/qa/references/ci-integration.md +337 -0
- package/templates/teams/qa/references/e2e-testing.md +292 -0
- package/templates/teams/qa/references/mocking.md +249 -0
- package/templates/teams/qa/references/performance-testing.md +288 -0
- package/templates/teams/qa/references/review-checklist.md +143 -0
- package/templates/teams/qa/references/security-testing.md +271 -0
- package/templates/teams/qa/references/test-data.md +275 -0
- package/templates/teams/qa/references/test-strategy.md +192 -0
- package/templates/teams/qa/review-checklist.md +53 -0
- package/templates/teams/qa/skill.md +131 -0
|
@@ -0,0 +1,397 @@
|
|
|
1
|
+
# Kubernetes Reference
|
|
2
|
+
|
|
3
|
+
## Namespace Strategy
|
|
4
|
+
|
|
5
|
+
```
|
|
6
|
+
Namespaces:
|
|
7
|
+
├── kube-system # Cluster components (do not deploy here)
|
|
8
|
+
├── monitoring # Prometheus, Grafana, Loki
|
|
9
|
+
├── ingress # Ingress controllers
|
|
10
|
+
├── cert-manager # TLS certificate management
|
|
11
|
+
├── external-secrets # Secrets operator
|
|
12
|
+
├── app-dev # Application workloads — dev
|
|
13
|
+
├── app-staging # Application workloads — staging
|
|
14
|
+
└── app-production # Application workloads — production
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
### Rules
|
|
18
|
+
|
|
19
|
+
- One namespace per environment per application domain.
|
|
20
|
+
- Apply `ResourceQuota` and `LimitRange` to every namespace.
|
|
21
|
+
- Apply `NetworkPolicy` default-deny to every namespace.
|
|
22
|
+
- Label namespaces consistently: `team`, `environment`, `managed-by`.
|
|
23
|
+
|
|
24
|
+
### ResourceQuota Example
|
|
25
|
+
|
|
26
|
+
```yaml
|
|
27
|
+
apiVersion: v1
|
|
28
|
+
kind: ResourceQuota
|
|
29
|
+
metadata:
|
|
30
|
+
name: default-quota
|
|
31
|
+
namespace: app-production
|
|
32
|
+
spec:
|
|
33
|
+
hard:
|
|
34
|
+
requests.cpu: "20"
|
|
35
|
+
requests.memory: 40Gi
|
|
36
|
+
limits.cpu: "40"
|
|
37
|
+
limits.memory: 80Gi
|
|
38
|
+
pods: "100"
|
|
39
|
+
services: "20"
|
|
40
|
+
persistentvolumeclaims: "30"
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
---
|
|
44
|
+
|
|
45
|
+
## Resource Management
|
|
46
|
+
|
|
47
|
+
### Requests and Limits
|
|
48
|
+
|
|
49
|
+
```yaml
|
|
50
|
+
resources:
|
|
51
|
+
requests:
|
|
52
|
+
cpu: 250m
|
|
53
|
+
memory: 256Mi
|
|
54
|
+
limits:
|
|
55
|
+
cpu: 1000m
|
|
56
|
+
memory: 512Mi
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
### Sizing Guidelines
|
|
60
|
+
|
|
61
|
+
| Workload Type | CPU Request | CPU Limit | Memory Request | Memory Limit |
|
|
62
|
+
|---|---|---|---|---|
|
|
63
|
+
| API server | 250m | 1000m | 256Mi | 512Mi |
|
|
64
|
+
| Worker/consumer | 500m | 2000m | 512Mi | 1Gi |
|
|
65
|
+
| Batch job | 1000m | 4000m | 1Gi | 4Gi |
|
|
66
|
+
| Sidecar (proxy) | 50m | 200m | 64Mi | 128Mi |
|
|
67
|
+
|
|
68
|
+
### Rules
|
|
69
|
+
|
|
70
|
+
- **Always set requests.** The scheduler uses requests for placement.
|
|
71
|
+
- **Always set limits for memory.** OOM-killed pods are better than node
|
|
72
|
+
exhaustion.
|
|
73
|
+
- **CPU limits are debated.** Set them to prevent runaway, but be aware of
|
|
74
|
+
throttling. A 4:1 limit-to-request ratio is a reasonable starting point.
|
|
75
|
+
- Use VPA (Vertical Pod Autoscaler) recommendations to right-size after
|
|
76
|
+
running for 7+ days.
|
|
77
|
+
- Start generous, then tighten based on metrics.
|
|
78
|
+
|
|
79
|
+
### LimitRange (Namespace Defaults)
|
|
80
|
+
|
|
81
|
+
```yaml
|
|
82
|
+
apiVersion: v1
|
|
83
|
+
kind: LimitRange
|
|
84
|
+
metadata:
|
|
85
|
+
name: default-limits
|
|
86
|
+
spec:
|
|
87
|
+
limits:
|
|
88
|
+
- default:
|
|
89
|
+
cpu: 500m
|
|
90
|
+
memory: 256Mi
|
|
91
|
+
defaultRequest:
|
|
92
|
+
cpu: 100m
|
|
93
|
+
memory: 128Mi
|
|
94
|
+
type: Container
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
## Probes
|
|
100
|
+
|
|
101
|
+
### Liveness Probe
|
|
102
|
+
|
|
103
|
+
Answers: "Is the process hung?" — restarts the container if it fails.
|
|
104
|
+
|
|
105
|
+
```yaml
|
|
106
|
+
livenessProbe:
|
|
107
|
+
httpGet:
|
|
108
|
+
path: /healthz
|
|
109
|
+
port: 8080
|
|
110
|
+
initialDelaySeconds: 15
|
|
111
|
+
periodSeconds: 20
|
|
112
|
+
timeoutSeconds: 5
|
|
113
|
+
failureThreshold: 3
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
### Readiness Probe
|
|
117
|
+
|
|
118
|
+
Answers: "Can the pod serve traffic?" — removes from Service endpoints if it fails.
|
|
119
|
+
|
|
120
|
+
```yaml
|
|
121
|
+
readinessProbe:
|
|
122
|
+
httpGet:
|
|
123
|
+
path: /readyz
|
|
124
|
+
port: 8080
|
|
125
|
+
initialDelaySeconds: 5
|
|
126
|
+
periodSeconds: 10
|
|
127
|
+
timeoutSeconds: 3
|
|
128
|
+
failureThreshold: 3
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
### Startup Probe
|
|
132
|
+
|
|
133
|
+
Answers: "Has the app finished starting?" — disables liveness/readiness until it succeeds.
|
|
134
|
+
|
|
135
|
+
```yaml
|
|
136
|
+
startupProbe:
|
|
137
|
+
httpGet:
|
|
138
|
+
path: /healthz
|
|
139
|
+
port: 8080
|
|
140
|
+
periodSeconds: 5
|
|
141
|
+
failureThreshold: 30 # 30 x 5s = 150s max startup time
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
### Probe Rules
|
|
145
|
+
|
|
146
|
+
- **Every container** must have readiness and liveness probes.
|
|
147
|
+
- Use startup probe for slow-starting apps (JVM, ML model loading).
|
|
148
|
+
- Liveness should check internal health only (not downstream deps).
|
|
149
|
+
- Readiness should check ability to serve (including critical deps like DB).
|
|
150
|
+
- Never make liveness depend on external services — cascading restarts.
|
|
151
|
+
- Separate endpoints: `/healthz` (liveness), `/readyz` (readiness).
|
|
152
|
+
|
|
153
|
+
---
|
|
154
|
+
|
|
155
|
+
## Pod Disruption Budgets
|
|
156
|
+
|
|
157
|
+
```yaml
|
|
158
|
+
apiVersion: policy/v1
|
|
159
|
+
kind: PodDisruptionBudget
|
|
160
|
+
metadata:
|
|
161
|
+
name: user-api-pdb
|
|
162
|
+
spec:
|
|
163
|
+
minAvailable: 1 # OR maxUnavailable: 1
|
|
164
|
+
selector:
|
|
165
|
+
matchLabels:
|
|
166
|
+
app: user-api
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
### Rules
|
|
170
|
+
|
|
171
|
+
- Every production Deployment with 2+ replicas must have a PDB.
|
|
172
|
+
- Use `minAvailable` for critical services.
|
|
173
|
+
- Use `maxUnavailable: 1` for batch workers.
|
|
174
|
+
- PDB prevents node drains from killing all pods simultaneously.
|
|
175
|
+
|
|
176
|
+
---
|
|
177
|
+
|
|
178
|
+
## Security Context
|
|
179
|
+
|
|
180
|
+
### Pod Level
|
|
181
|
+
|
|
182
|
+
```yaml
|
|
183
|
+
securityContext:
|
|
184
|
+
runAsNonRoot: true
|
|
185
|
+
runAsUser: 65534 # nobody
|
|
186
|
+
runAsGroup: 65534
|
|
187
|
+
fsGroup: 65534
|
|
188
|
+
seccompProfile:
|
|
189
|
+
type: RuntimeDefault
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
### Container Level
|
|
193
|
+
|
|
194
|
+
```yaml
|
|
195
|
+
securityContext:
|
|
196
|
+
allowPrivilegeEscalation: false
|
|
197
|
+
readOnlyRootFilesystem: true
|
|
198
|
+
capabilities:
|
|
199
|
+
drop: ["ALL"]
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
### Rules
|
|
203
|
+
|
|
204
|
+
- `runAsNonRoot: true` — always. No exceptions in production.
|
|
205
|
+
- `readOnlyRootFilesystem: true` — mount `emptyDir` for temp writes.
|
|
206
|
+
- `allowPrivilegeEscalation: false` — always.
|
|
207
|
+
- Drop ALL capabilities, add back only what is strictly needed.
|
|
208
|
+
- Use `seccompProfile: RuntimeDefault` at minimum.
|
|
209
|
+
|
|
210
|
+
---
|
|
211
|
+
|
|
212
|
+
## Network Policies
|
|
213
|
+
|
|
214
|
+
### Default Deny All
|
|
215
|
+
|
|
216
|
+
```yaml
|
|
217
|
+
apiVersion: networking.k8s.io/v1
|
|
218
|
+
kind: NetworkPolicy
|
|
219
|
+
metadata:
|
|
220
|
+
name: default-deny-all
|
|
221
|
+
namespace: app-production
|
|
222
|
+
spec:
|
|
223
|
+
podSelector: {}
|
|
224
|
+
policyTypes:
|
|
225
|
+
- Ingress
|
|
226
|
+
- Egress
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
### Allow Specific Traffic
|
|
230
|
+
|
|
231
|
+
```yaml
|
|
232
|
+
apiVersion: networking.k8s.io/v1
|
|
233
|
+
kind: NetworkPolicy
|
|
234
|
+
metadata:
|
|
235
|
+
name: allow-user-api
|
|
236
|
+
spec:
|
|
237
|
+
podSelector:
|
|
238
|
+
matchLabels:
|
|
239
|
+
app: user-api
|
|
240
|
+
policyTypes:
|
|
241
|
+
- Ingress
|
|
242
|
+
- Egress
|
|
243
|
+
ingress:
|
|
244
|
+
- from:
|
|
245
|
+
- podSelector:
|
|
246
|
+
matchLabels:
|
|
247
|
+
app: api-gateway
|
|
248
|
+
ports:
|
|
249
|
+
- port: 8080
|
|
250
|
+
egress:
|
|
251
|
+
- to:
|
|
252
|
+
- podSelector:
|
|
253
|
+
matchLabels:
|
|
254
|
+
app: postgres
|
|
255
|
+
ports:
|
|
256
|
+
- port: 5432
|
|
257
|
+
- to: # Allow DNS
|
|
258
|
+
- namespaceSelector: {}
|
|
259
|
+
ports:
|
|
260
|
+
- port: 53
|
|
261
|
+
protocol: UDP
|
|
262
|
+
- port: 53
|
|
263
|
+
protocol: TCP
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
### Rules
|
|
267
|
+
|
|
268
|
+
- Start with default-deny in every namespace.
|
|
269
|
+
- Explicitly allow only required traffic paths.
|
|
270
|
+
- Always allow DNS egress (port 53) or pods cannot resolve services.
|
|
271
|
+
- Document network flows in architecture diagrams.
|
|
272
|
+
|
|
273
|
+
---
|
|
274
|
+
|
|
275
|
+
## Secrets Management
|
|
276
|
+
|
|
277
|
+
### External Secrets Operator (Preferred)
|
|
278
|
+
|
|
279
|
+
```yaml
|
|
280
|
+
apiVersion: external-secrets.io/v1beta1
|
|
281
|
+
kind: ExternalSecret
|
|
282
|
+
metadata:
|
|
283
|
+
name: user-api-secrets
|
|
284
|
+
spec:
|
|
285
|
+
refreshInterval: 1h
|
|
286
|
+
secretStoreRef:
|
|
287
|
+
name: vault-backend
|
|
288
|
+
kind: ClusterSecretStore
|
|
289
|
+
target:
|
|
290
|
+
name: user-api-secrets
|
|
291
|
+
data:
|
|
292
|
+
- secretKey: database-url
|
|
293
|
+
remoteRef:
|
|
294
|
+
key: secret/data/user-api
|
|
295
|
+
property: database_url
|
|
296
|
+
```
|
|
297
|
+
|
|
298
|
+
### Rules
|
|
299
|
+
|
|
300
|
+
- Never store secrets in plain K8s Secrets manifests in git.
|
|
301
|
+
- Use ExternalSecrets Operator, Sealed Secrets, or CSI Secrets Store.
|
|
302
|
+
- Rotate secrets on a schedule (90 days max for credentials).
|
|
303
|
+
- Audit secret access via cloud provider audit logs.
|
|
304
|
+
- Mount secrets as files, not env vars (env vars leak in crash dumps and
|
|
305
|
+
`kubectl describe`).
|
|
306
|
+
|
|
307
|
+
---
|
|
308
|
+
|
|
309
|
+
## Image Pull Policy
|
|
310
|
+
|
|
311
|
+
| Tag Type | Policy | Rationale |
|
|
312
|
+
|---|---|---|
|
|
313
|
+
| Git SHA (`abc1234`) | `IfNotPresent` | Immutable, no need to re-pull |
|
|
314
|
+
| Semver (`1.2.3`) | `IfNotPresent` | Immutable (if you follow semver) |
|
|
315
|
+
| `:latest` | `Always` | Mutable — but don't use `:latest` |
|
|
316
|
+
| Branch (`main`) | `Always` | Mutable — only for dev |
|
|
317
|
+
|
|
318
|
+
---
|
|
319
|
+
|
|
320
|
+
## Deployment Template
|
|
321
|
+
|
|
322
|
+
```yaml
|
|
323
|
+
apiVersion: apps/v1
|
|
324
|
+
kind: Deployment
|
|
325
|
+
metadata:
|
|
326
|
+
name: user-api
|
|
327
|
+
labels:
|
|
328
|
+
app: user-api
|
|
329
|
+
version: v1
|
|
330
|
+
spec:
|
|
331
|
+
replicas: 3
|
|
332
|
+
strategy:
|
|
333
|
+
type: RollingUpdate
|
|
334
|
+
rollingUpdate:
|
|
335
|
+
maxSurge: 1
|
|
336
|
+
maxUnavailable: 0
|
|
337
|
+
selector:
|
|
338
|
+
matchLabels:
|
|
339
|
+
app: user-api
|
|
340
|
+
template:
|
|
341
|
+
metadata:
|
|
342
|
+
labels:
|
|
343
|
+
app: user-api
|
|
344
|
+
annotations:
|
|
345
|
+
prometheus.io/scrape: "true"
|
|
346
|
+
prometheus.io/port: "8080"
|
|
347
|
+
prometheus.io/path: "/metrics"
|
|
348
|
+
spec:
|
|
349
|
+
serviceAccountName: user-api
|
|
350
|
+
securityContext:
|
|
351
|
+
runAsNonRoot: true
|
|
352
|
+
seccompProfile:
|
|
353
|
+
type: RuntimeDefault
|
|
354
|
+
containers:
|
|
355
|
+
- name: user-api
|
|
356
|
+
image: registry.company.com/user-api:abc1234
|
|
357
|
+
imagePullPolicy: IfNotPresent
|
|
358
|
+
ports:
|
|
359
|
+
- containerPort: 8080
|
|
360
|
+
resources:
|
|
361
|
+
requests:
|
|
362
|
+
cpu: 250m
|
|
363
|
+
memory: 256Mi
|
|
364
|
+
limits:
|
|
365
|
+
cpu: 1000m
|
|
366
|
+
memory: 512Mi
|
|
367
|
+
securityContext:
|
|
368
|
+
allowPrivilegeEscalation: false
|
|
369
|
+
readOnlyRootFilesystem: true
|
|
370
|
+
capabilities:
|
|
371
|
+
drop: ["ALL"]
|
|
372
|
+
livenessProbe:
|
|
373
|
+
httpGet:
|
|
374
|
+
path: /healthz
|
|
375
|
+
port: 8080
|
|
376
|
+
initialDelaySeconds: 15
|
|
377
|
+
periodSeconds: 20
|
|
378
|
+
readinessProbe:
|
|
379
|
+
httpGet:
|
|
380
|
+
path: /readyz
|
|
381
|
+
port: 8080
|
|
382
|
+
initialDelaySeconds: 5
|
|
383
|
+
periodSeconds: 10
|
|
384
|
+
volumeMounts:
|
|
385
|
+
- name: tmp
|
|
386
|
+
mountPath: /tmp
|
|
387
|
+
volumes:
|
|
388
|
+
- name: tmp
|
|
389
|
+
emptyDir: {}
|
|
390
|
+
topologySpreadConstraints:
|
|
391
|
+
- maxSkew: 1
|
|
392
|
+
topologyKey: topology.kubernetes.io/zone
|
|
393
|
+
whenUnsatisfiable: DoNotSchedule
|
|
394
|
+
labelSelector:
|
|
395
|
+
matchLabels:
|
|
396
|
+
app: user-api
|
|
397
|
+
```
|
|
@@ -0,0 +1,224 @@
|
|
|
1
|
+
# Monitoring and Observability Reference
|
|
2
|
+
|
|
3
|
+
## Three Pillars
|
|
4
|
+
|
|
5
|
+
### 1. Metrics (Prometheus / Grafana)
|
|
6
|
+
|
|
7
|
+
Numeric measurements aggregated over time. Best for dashboards and alerting.
|
|
8
|
+
|
|
9
|
+
- **Counter**: Monotonically increasing (requests_total, errors_total).
|
|
10
|
+
- **Gauge**: Point-in-time value (active_connections, queue_depth).
|
|
11
|
+
- **Histogram**: Distribution of values (request_duration_seconds).
|
|
12
|
+
|
|
13
|
+
### 2. Logs (Loki / ELK / CloudWatch)
|
|
14
|
+
|
|
15
|
+
Timestamped text records of discrete events. Best for debugging specific
|
|
16
|
+
incidents.
|
|
17
|
+
|
|
18
|
+
- Structured JSON only — no unstructured text logs.
|
|
19
|
+
- Include: timestamp, level, service, trace_id, message, context fields.
|
|
20
|
+
- Exclude: PII, secrets, full request/response bodies.
|
|
21
|
+
|
|
22
|
+
### 3. Traces (Jaeger / Tempo / X-Ray)
|
|
23
|
+
|
|
24
|
+
End-to-end request flow across services. Best for understanding latency
|
|
25
|
+
and dependencies.
|
|
26
|
+
|
|
27
|
+
- Instrument all service boundaries (HTTP, gRPC, message queues).
|
|
28
|
+
- Propagate trace context headers (`traceparent` / W3C Trace Context).
|
|
29
|
+
- Sample at 1-10% in production (100% in dev/staging).
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
## Golden Signals Dashboard
|
|
34
|
+
|
|
35
|
+
Every service must have a dashboard showing the four golden signals:
|
|
36
|
+
|
|
37
|
+
| Signal | Metric | Example PromQL |
|
|
38
|
+
|---|---|---|
|
|
39
|
+
| **Latency** | Request duration (p50, p90, p99) | `histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))` |
|
|
40
|
+
| **Traffic** | Requests per second | `rate(http_requests_total[5m])` |
|
|
41
|
+
| **Errors** | Error rate (%) | `rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100` |
|
|
42
|
+
| **Saturation** | CPU/memory/queue utilization | `container_memory_working_set_bytes / container_spec_memory_limit_bytes * 100` |
|
|
43
|
+
|
|
44
|
+
### Dashboard Requirements
|
|
45
|
+
|
|
46
|
+
- Every service has a dedicated Grafana dashboard.
|
|
47
|
+
- Dashboard definition stored in Git (JSON model or Grafonnet).
|
|
48
|
+
- Include: golden signals, resource usage, key business metrics.
|
|
49
|
+
- Time range selectors: last 1h, 6h, 24h, 7d.
|
|
50
|
+
- Variable selectors: environment, namespace, pod.
|
|
51
|
+
|
|
52
|
+
---
|
|
53
|
+
|
|
54
|
+
## Alert Severity Levels
|
|
55
|
+
|
|
56
|
+
| Level | Criteria | Response Time | Notification | Example |
|
|
57
|
+
|---|---|---|---|---|
|
|
58
|
+
| **P1 — Critical** | Customer-facing outage, data loss risk | 5 min | Page on-call, auto-create incident | API error rate > 10%, database down |
|
|
59
|
+
| **P2 — High** | Degraded performance, partial outage | 15 min | Page on-call | p99 latency > 5s, disk > 90% |
|
|
60
|
+
| **P3 — Warning** | Approaching threshold, non-critical issue | 1 hour | Slack channel | Disk > 80%, certificate expiry < 14d |
|
|
61
|
+
| **P4 — Info** | Anomaly, needs investigation during business hours | Next business day | Slack channel | Unusual traffic spike, dependency slow |
|
|
62
|
+
|
|
63
|
+
### Alert Rules
|
|
64
|
+
|
|
65
|
+
```yaml
|
|
66
|
+
# Prometheus alert example
|
|
67
|
+
groups:
|
|
68
|
+
- name: user-api
|
|
69
|
+
rules:
|
|
70
|
+
- alert: HighErrorRate
|
|
71
|
+
expr: |
|
|
72
|
+
rate(http_requests_total{service="user-api", status=~"5.."}[5m])
|
|
73
|
+
/ rate(http_requests_total{service="user-api"}[5m]) > 0.05
|
|
74
|
+
for: 5m
|
|
75
|
+
labels:
|
|
76
|
+
severity: P1
|
|
77
|
+
team: platform
|
|
78
|
+
annotations:
|
|
79
|
+
summary: "user-api error rate > 5%"
|
|
80
|
+
runbook: "https://wiki.company.com/runbooks/user-api-errors"
|
|
81
|
+
dashboard: "https://grafana.company.com/d/user-api"
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
### Alerting Principles
|
|
85
|
+
|
|
86
|
+
- **Alert on symptoms, not causes.** Alert on "error rate high," not "pod restarted."
|
|
87
|
+
- **Alert on SLO burn rate.** If you are burning through your error budget
|
|
88
|
+
faster than expected, page.
|
|
89
|
+
- **Every alert must have a runbook.** No runbook = not a valid alert.
|
|
90
|
+
- **Every alert must be actionable.** If the response is "wait and see,"
|
|
91
|
+
it should be a dashboard, not an alert.
|
|
92
|
+
- **Deduplicate.** One page per incident, not one per pod.
|
|
93
|
+
- **Test alerts.** Fire test alerts monthly to verify routing.
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
|
|
97
|
+
## On-Call Runbook Requirements
|
|
98
|
+
|
|
99
|
+
Every runbook must contain:
|
|
100
|
+
|
|
101
|
+
```markdown
|
|
102
|
+
# Runbook: <Alert Name>
|
|
103
|
+
|
|
104
|
+
## Alert Description
|
|
105
|
+
What this alert means and why it fires.
|
|
106
|
+
|
|
107
|
+
## Impact
|
|
108
|
+
What is the customer-facing impact?
|
|
109
|
+
|
|
110
|
+
## Quick Mitigation
|
|
111
|
+
Step-by-step actions to restore service (within 5 minutes):
|
|
112
|
+
1. ...
|
|
113
|
+
2. ...
|
|
114
|
+
3. ...
|
|
115
|
+
|
|
116
|
+
## Diagnosis
|
|
117
|
+
How to determine root cause:
|
|
118
|
+
- Dashboard link
|
|
119
|
+
- Log query
|
|
120
|
+
- Trace search
|
|
121
|
+
|
|
122
|
+
## Resolution
|
|
123
|
+
Longer-term fix steps.
|
|
124
|
+
|
|
125
|
+
## Escalation
|
|
126
|
+
Who to contact if mitigation fails.
|
|
127
|
+
|
|
128
|
+
## History
|
|
129
|
+
| Date | Cause | Resolution | Duration |
|
|
130
|
+
|------|-------|------------|----------|
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
### Runbook Rules
|
|
134
|
+
|
|
135
|
+
- Stored in Git alongside alert definitions.
|
|
136
|
+
- Reviewed and updated after every incident.
|
|
137
|
+
- Must be executable by anyone on the on-call rotation.
|
|
138
|
+
- Include exact commands, not vague instructions.
|
|
139
|
+
- Link to relevant dashboards and log queries.
|
|
140
|
+
|
|
141
|
+
---
|
|
142
|
+
|
|
143
|
+
## Structured Logging Standards
|
|
144
|
+
|
|
145
|
+
### Log Format (JSON)
|
|
146
|
+
|
|
147
|
+
```json
|
|
148
|
+
{
|
|
149
|
+
"timestamp": "2025-01-15T10:30:00.123Z",
|
|
150
|
+
"level": "error",
|
|
151
|
+
"service": "user-api",
|
|
152
|
+
"version": "1.2.3",
|
|
153
|
+
"trace_id": "abc123def456",
|
|
154
|
+
"span_id": "789ghi",
|
|
155
|
+
"message": "Failed to fetch user profile",
|
|
156
|
+
"error": "connection timeout",
|
|
157
|
+
"user_id": "usr_REDACTED",
|
|
158
|
+
"duration_ms": 5000,
|
|
159
|
+
"method": "GET",
|
|
160
|
+
"path": "/api/v1/users/123",
|
|
161
|
+
"status_code": 504
|
|
162
|
+
}
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
### Log Levels
|
|
166
|
+
|
|
167
|
+
| Level | When to Use | Example |
|
|
168
|
+
|---|---|---|
|
|
169
|
+
| `error` | Operation failed, needs attention | Database query failed, external API error |
|
|
170
|
+
| `warn` | Recoverable issue, may need attention | Retry succeeded, cache miss, slow query |
|
|
171
|
+
| `info` | Significant business events | Request served, user created, job completed |
|
|
172
|
+
| `debug` | Diagnostic detail (disabled in prod) | Query parameters, intermediate state |
|
|
173
|
+
|
|
174
|
+
### Rules
|
|
175
|
+
|
|
176
|
+
- JSON format only — no printf-style logs.
|
|
177
|
+
- Include `trace_id` in every log line for correlation.
|
|
178
|
+
- Never log: passwords, tokens, PII, credit card numbers, full request bodies.
|
|
179
|
+
- Redact or hash user identifiers in logs.
|
|
180
|
+
- Log at request boundaries: start, end, error.
|
|
181
|
+
- Use log sampling for high-volume debug logs.
|
|
182
|
+
- Set retention: 7 days hot, 30 days warm, 90 days cold (adjust per compliance).
|
|
183
|
+
|
|
184
|
+
---
|
|
185
|
+
|
|
186
|
+
## SLO Framework
|
|
187
|
+
|
|
188
|
+
### Defining SLOs
|
|
189
|
+
|
|
190
|
+
| Service | SLI | SLO | Error Budget (30d) |
|
|
191
|
+
|---|---|---|---|
|
|
192
|
+
| User API | Successful requests / total requests | 99.9% | 43.2 min downtime |
|
|
193
|
+
| Payment API | Successful requests / total requests | 99.99% | 4.3 min downtime |
|
|
194
|
+
| Background Jobs | Jobs completed / jobs submitted | 99.5% | 3.6 hours delay |
|
|
195
|
+
| Dashboard | Page load < 3s | 95% | 36 hours of slow loads |
|
|
196
|
+
|
|
197
|
+
### Error Budget Policy
|
|
198
|
+
|
|
199
|
+
- **Budget remaining > 50%**: Ship freely, experiment.
|
|
200
|
+
- **Budget remaining 20-50%**: Caution, increase review rigor.
|
|
201
|
+
- **Budget remaining < 20%**: Freeze features, focus on reliability.
|
|
202
|
+
- **Budget exhausted**: All engineering effort on reliability until budget recovers.
|
|
203
|
+
|
|
204
|
+
---
|
|
205
|
+
|
|
206
|
+
## Metrics Naming Convention
|
|
207
|
+
|
|
208
|
+
```
|
|
209
|
+
<namespace>_<subsystem>_<name>_<unit>
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
Examples:
|
|
213
|
+
- `http_server_request_duration_seconds`
|
|
214
|
+
- `http_server_requests_total`
|
|
215
|
+
- `db_pool_connections_active`
|
|
216
|
+
- `queue_messages_pending`
|
|
217
|
+
|
|
218
|
+
### Rules
|
|
219
|
+
|
|
220
|
+
- Use `_total` suffix for counters.
|
|
221
|
+
- Use `_seconds`, `_bytes`, `_ratio` for units.
|
|
222
|
+
- Use snake_case.
|
|
223
|
+
- Prefix with service/subsystem namespace.
|
|
224
|
+
- Keep cardinality low — avoid high-cardinality labels (user_id, request_id).
|