aigent-team 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (71) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +253 -0
  3. package/dist/chunk-N3RYHWTR.js +267 -0
  4. package/dist/cli.js +576 -0
  5. package/dist/index.d.ts +234 -0
  6. package/dist/index.js +27 -0
  7. package/package.json +67 -0
  8. package/templates/shared/git-workflow.md +44 -0
  9. package/templates/shared/project-conventions.md +48 -0
  10. package/templates/teams/ba/agent.yaml +25 -0
  11. package/templates/teams/ba/references/acceptance-criteria.md +87 -0
  12. package/templates/teams/ba/references/api-contract-design.md +110 -0
  13. package/templates/teams/ba/references/requirements-analysis.md +83 -0
  14. package/templates/teams/ba/references/user-story-mapping.md +73 -0
  15. package/templates/teams/ba/skill.md +85 -0
  16. package/templates/teams/be/agent.yaml +34 -0
  17. package/templates/teams/be/conventions.md +102 -0
  18. package/templates/teams/be/references/api-design.md +91 -0
  19. package/templates/teams/be/references/async-processing.md +86 -0
  20. package/templates/teams/be/references/auth-security.md +58 -0
  21. package/templates/teams/be/references/caching.md +79 -0
  22. package/templates/teams/be/references/database.md +65 -0
  23. package/templates/teams/be/references/error-handling.md +106 -0
  24. package/templates/teams/be/references/observability.md +83 -0
  25. package/templates/teams/be/references/review-checklist.md +50 -0
  26. package/templates/teams/be/references/testing.md +100 -0
  27. package/templates/teams/be/review-checklist.md +54 -0
  28. package/templates/teams/be/skill.md +71 -0
  29. package/templates/teams/devops/agent.yaml +35 -0
  30. package/templates/teams/devops/conventions.md +133 -0
  31. package/templates/teams/devops/references/ci-cd.md +218 -0
  32. package/templates/teams/devops/references/cost-optimization.md +218 -0
  33. package/templates/teams/devops/references/disaster-recovery.md +199 -0
  34. package/templates/teams/devops/references/docker.md +237 -0
  35. package/templates/teams/devops/references/infrastructure-as-code.md +238 -0
  36. package/templates/teams/devops/references/kubernetes.md +397 -0
  37. package/templates/teams/devops/references/monitoring.md +224 -0
  38. package/templates/teams/devops/references/review-checklist.md +149 -0
  39. package/templates/teams/devops/references/security.md +225 -0
  40. package/templates/teams/devops/review-checklist.md +72 -0
  41. package/templates/teams/devops/skill.md +131 -0
  42. package/templates/teams/fe/agent.yaml +28 -0
  43. package/templates/teams/fe/conventions.md +80 -0
  44. package/templates/teams/fe/references/accessibility.md +92 -0
  45. package/templates/teams/fe/references/component-architecture.md +87 -0
  46. package/templates/teams/fe/references/css-styling.md +89 -0
  47. package/templates/teams/fe/references/forms.md +73 -0
  48. package/templates/teams/fe/references/performance.md +104 -0
  49. package/templates/teams/fe/references/review-checklist.md +51 -0
  50. package/templates/teams/fe/references/security.md +90 -0
  51. package/templates/teams/fe/references/state-management.md +117 -0
  52. package/templates/teams/fe/references/testing.md +112 -0
  53. package/templates/teams/fe/review-checklist.md +53 -0
  54. package/templates/teams/fe/skill.md +68 -0
  55. package/templates/teams/lead/agent.yaml +18 -0
  56. package/templates/teams/lead/references/cross-team-coordination.md +68 -0
  57. package/templates/teams/lead/references/quality-gates.md +64 -0
  58. package/templates/teams/lead/references/task-decomposition.md +69 -0
  59. package/templates/teams/lead/skill.md +83 -0
  60. package/templates/teams/qa/agent.yaml +32 -0
  61. package/templates/teams/qa/conventions.md +130 -0
  62. package/templates/teams/qa/references/ci-integration.md +337 -0
  63. package/templates/teams/qa/references/e2e-testing.md +292 -0
  64. package/templates/teams/qa/references/mocking.md +249 -0
  65. package/templates/teams/qa/references/performance-testing.md +288 -0
  66. package/templates/teams/qa/references/review-checklist.md +143 -0
  67. package/templates/teams/qa/references/security-testing.md +271 -0
  68. package/templates/teams/qa/references/test-data.md +275 -0
  69. package/templates/teams/qa/references/test-strategy.md +192 -0
  70. package/templates/teams/qa/review-checklist.md +53 -0
  71. package/templates/teams/qa/skill.md +131 -0
@@ -0,0 +1,397 @@
1
+ # Kubernetes Reference
2
+
3
+ ## Namespace Strategy
4
+
5
+ ```
6
+ Namespaces:
7
+ ├── kube-system # Cluster components (do not deploy here)
8
+ ├── monitoring # Prometheus, Grafana, Loki
9
+ ├── ingress # Ingress controllers
10
+ ├── cert-manager # TLS certificate management
11
+ ├── external-secrets # Secrets operator
12
+ ├── app-dev # Application workloads — dev
13
+ ├── app-staging # Application workloads — staging
14
+ └── app-production # Application workloads — production
15
+ ```
16
+
17
+ ### Rules
18
+
19
+ - One namespace per environment per application domain.
20
+ - Apply `ResourceQuota` and `LimitRange` to every namespace.
21
+ - Apply `NetworkPolicy` default-deny to every namespace.
22
+ - Label namespaces consistently: `team`, `environment`, `managed-by`.
23
+
24
+ ### ResourceQuota Example
25
+
26
+ ```yaml
27
+ apiVersion: v1
28
+ kind: ResourceQuota
29
+ metadata:
30
+ name: default-quota
31
+ namespace: app-production
32
+ spec:
33
+ hard:
34
+ requests.cpu: "20"
35
+ requests.memory: 40Gi
36
+ limits.cpu: "40"
37
+ limits.memory: 80Gi
38
+ pods: "100"
39
+ services: "20"
40
+ persistentvolumeclaims: "30"
41
+ ```
42
+
43
+ ---
44
+
45
+ ## Resource Management
46
+
47
+ ### Requests and Limits
48
+
49
+ ```yaml
50
+ resources:
51
+ requests:
52
+ cpu: 250m
53
+ memory: 256Mi
54
+ limits:
55
+ cpu: 1000m
56
+ memory: 512Mi
57
+ ```
58
+
59
+ ### Sizing Guidelines
60
+
61
+ | Workload Type | CPU Request | CPU Limit | Memory Request | Memory Limit |
62
+ |---|---|---|---|---|
63
+ | API server | 250m | 1000m | 256Mi | 512Mi |
64
+ | Worker/consumer | 500m | 2000m | 512Mi | 1Gi |
65
+ | Batch job | 1000m | 4000m | 1Gi | 4Gi |
66
+ | Sidecar (proxy) | 50m | 200m | 64Mi | 128Mi |
67
+
68
+ ### Rules
69
+
70
+ - **Always set requests.** The scheduler uses requests for placement.
71
+ - **Always set limits for memory.** OOM-killed pods are better than node
72
+ exhaustion.
73
+ - **CPU limits are debated.** Set them to prevent runaway, but be aware of
74
+ throttling. A 4:1 limit-to-request ratio is a reasonable starting point.
75
+ - Use VPA (Vertical Pod Autoscaler) recommendations to right-size after
76
+ running for 7+ days.
77
+ - Start generous, then tighten based on metrics.
78
+
79
+ ### LimitRange (Namespace Defaults)
80
+
81
+ ```yaml
82
+ apiVersion: v1
83
+ kind: LimitRange
84
+ metadata:
85
+ name: default-limits
86
+ spec:
87
+ limits:
88
+ - default:
89
+ cpu: 500m
90
+ memory: 256Mi
91
+ defaultRequest:
92
+ cpu: 100m
93
+ memory: 128Mi
94
+ type: Container
95
+ ```
96
+
97
+ ---
98
+
99
+ ## Probes
100
+
101
+ ### Liveness Probe
102
+
103
+ Answers: "Is the process hung?" — restarts the container if it fails.
104
+
105
+ ```yaml
106
+ livenessProbe:
107
+ httpGet:
108
+ path: /healthz
109
+ port: 8080
110
+ initialDelaySeconds: 15
111
+ periodSeconds: 20
112
+ timeoutSeconds: 5
113
+ failureThreshold: 3
114
+ ```
115
+
116
+ ### Readiness Probe
117
+
118
+ Answers: "Can the pod serve traffic?" — removes from Service endpoints if it fails.
119
+
120
+ ```yaml
121
+ readinessProbe:
122
+ httpGet:
123
+ path: /readyz
124
+ port: 8080
125
+ initialDelaySeconds: 5
126
+ periodSeconds: 10
127
+ timeoutSeconds: 3
128
+ failureThreshold: 3
129
+ ```
130
+
131
+ ### Startup Probe
132
+
133
+ Answers: "Has the app finished starting?" — disables liveness/readiness until it succeeds.
134
+
135
+ ```yaml
136
+ startupProbe:
137
+ httpGet:
138
+ path: /healthz
139
+ port: 8080
140
+ periodSeconds: 5
141
+ failureThreshold: 30 # 30 x 5s = 150s max startup time
142
+ ```
143
+
144
+ ### Probe Rules
145
+
146
+ - **Every container** must have readiness and liveness probes.
147
+ - Use startup probe for slow-starting apps (JVM, ML model loading).
148
+ - Liveness should check internal health only (not downstream deps).
149
+ - Readiness should check ability to serve (including critical deps like DB).
150
+ - Never make liveness depend on external services — cascading restarts.
151
+ - Separate endpoints: `/healthz` (liveness), `/readyz` (readiness).
152
+
153
+ ---
154
+
155
+ ## Pod Disruption Budgets
156
+
157
+ ```yaml
158
+ apiVersion: policy/v1
159
+ kind: PodDisruptionBudget
160
+ metadata:
161
+ name: user-api-pdb
162
+ spec:
163
+ minAvailable: 1 # OR maxUnavailable: 1
164
+ selector:
165
+ matchLabels:
166
+ app: user-api
167
+ ```
168
+
169
+ ### Rules
170
+
171
+ - Every production Deployment with 2+ replicas must have a PDB.
172
+ - Use `minAvailable` for critical services.
173
+ - Use `maxUnavailable: 1` for batch workers.
174
+ - PDB prevents node drains from killing all pods simultaneously.
175
+
176
+ ---
177
+
178
+ ## Security Context
179
+
180
+ ### Pod Level
181
+
182
+ ```yaml
183
+ securityContext:
184
+ runAsNonRoot: true
185
+ runAsUser: 65534 # nobody
186
+ runAsGroup: 65534
187
+ fsGroup: 65534
188
+ seccompProfile:
189
+ type: RuntimeDefault
190
+ ```
191
+
192
+ ### Container Level
193
+
194
+ ```yaml
195
+ securityContext:
196
+ allowPrivilegeEscalation: false
197
+ readOnlyRootFilesystem: true
198
+ capabilities:
199
+ drop: ["ALL"]
200
+ ```
201
+
202
+ ### Rules
203
+
204
+ - `runAsNonRoot: true` — always. No exceptions in production.
205
+ - `readOnlyRootFilesystem: true` — mount `emptyDir` for temp writes.
206
+ - `allowPrivilegeEscalation: false` — always.
207
+ - Drop ALL capabilities, add back only what is strictly needed.
208
+ - Use `seccompProfile: RuntimeDefault` at minimum.
209
+
210
+ ---
211
+
212
+ ## Network Policies
213
+
214
+ ### Default Deny All
215
+
216
+ ```yaml
217
+ apiVersion: networking.k8s.io/v1
218
+ kind: NetworkPolicy
219
+ metadata:
220
+ name: default-deny-all
221
+ namespace: app-production
222
+ spec:
223
+ podSelector: {}
224
+ policyTypes:
225
+ - Ingress
226
+ - Egress
227
+ ```
228
+
229
+ ### Allow Specific Traffic
230
+
231
+ ```yaml
232
+ apiVersion: networking.k8s.io/v1
233
+ kind: NetworkPolicy
234
+ metadata:
235
+ name: allow-user-api
236
+ spec:
237
+ podSelector:
238
+ matchLabels:
239
+ app: user-api
240
+ policyTypes:
241
+ - Ingress
242
+ - Egress
243
+ ingress:
244
+ - from:
245
+ - podSelector:
246
+ matchLabels:
247
+ app: api-gateway
248
+ ports:
249
+ - port: 8080
250
+ egress:
251
+ - to:
252
+ - podSelector:
253
+ matchLabels:
254
+ app: postgres
255
+ ports:
256
+ - port: 5432
257
+ - to: # Allow DNS
258
+ - namespaceSelector: {}
259
+ ports:
260
+ - port: 53
261
+ protocol: UDP
262
+ - port: 53
263
+ protocol: TCP
264
+ ```
265
+
266
+ ### Rules
267
+
268
+ - Start with default-deny in every namespace.
269
+ - Explicitly allow only required traffic paths.
270
+ - Always allow DNS egress (port 53) or pods cannot resolve services.
271
+ - Document network flows in architecture diagrams.
272
+
273
+ ---
274
+
275
+ ## Secrets Management
276
+
277
+ ### External Secrets Operator (Preferred)
278
+
279
+ ```yaml
280
+ apiVersion: external-secrets.io/v1beta1
281
+ kind: ExternalSecret
282
+ metadata:
283
+ name: user-api-secrets
284
+ spec:
285
+ refreshInterval: 1h
286
+ secretStoreRef:
287
+ name: vault-backend
288
+ kind: ClusterSecretStore
289
+ target:
290
+ name: user-api-secrets
291
+ data:
292
+ - secretKey: database-url
293
+ remoteRef:
294
+ key: secret/data/user-api
295
+ property: database_url
296
+ ```
297
+
298
+ ### Rules
299
+
300
+ - Never store secrets in plain K8s Secrets manifests in git.
301
+ - Use ExternalSecrets Operator, Sealed Secrets, or CSI Secrets Store.
302
+ - Rotate secrets on a schedule (90 days max for credentials).
303
+ - Audit secret access via cloud provider audit logs.
304
+ - Mount secrets as files, not env vars (env vars leak in crash dumps and
305
+ `kubectl describe`).
306
+
307
+ ---
308
+
309
+ ## Image Pull Policy
310
+
311
+ | Tag Type | Policy | Rationale |
312
+ |---|---|---|
313
+ | Git SHA (`abc1234`) | `IfNotPresent` | Immutable, no need to re-pull |
314
+ | Semver (`1.2.3`) | `IfNotPresent` | Immutable (if you follow semver) |
315
+ | `:latest` | `Always` | Mutable — but don't use `:latest` |
316
+ | Branch (`main`) | `Always` | Mutable — only for dev |
317
+
318
+ ---
319
+
320
+ ## Deployment Template
321
+
322
+ ```yaml
323
+ apiVersion: apps/v1
324
+ kind: Deployment
325
+ metadata:
326
+ name: user-api
327
+ labels:
328
+ app: user-api
329
+ version: v1
330
+ spec:
331
+ replicas: 3
332
+ strategy:
333
+ type: RollingUpdate
334
+ rollingUpdate:
335
+ maxSurge: 1
336
+ maxUnavailable: 0
337
+ selector:
338
+ matchLabels:
339
+ app: user-api
340
+ template:
341
+ metadata:
342
+ labels:
343
+ app: user-api
344
+ annotations:
345
+ prometheus.io/scrape: "true"
346
+ prometheus.io/port: "8080"
347
+ prometheus.io/path: "/metrics"
348
+ spec:
349
+ serviceAccountName: user-api
350
+ securityContext:
351
+ runAsNonRoot: true
352
+ seccompProfile:
353
+ type: RuntimeDefault
354
+ containers:
355
+ - name: user-api
356
+ image: registry.company.com/user-api:abc1234
357
+ imagePullPolicy: IfNotPresent
358
+ ports:
359
+ - containerPort: 8080
360
+ resources:
361
+ requests:
362
+ cpu: 250m
363
+ memory: 256Mi
364
+ limits:
365
+ cpu: 1000m
366
+ memory: 512Mi
367
+ securityContext:
368
+ allowPrivilegeEscalation: false
369
+ readOnlyRootFilesystem: true
370
+ capabilities:
371
+ drop: ["ALL"]
372
+ livenessProbe:
373
+ httpGet:
374
+ path: /healthz
375
+ port: 8080
376
+ initialDelaySeconds: 15
377
+ periodSeconds: 20
378
+ readinessProbe:
379
+ httpGet:
380
+ path: /readyz
381
+ port: 8080
382
+ initialDelaySeconds: 5
383
+ periodSeconds: 10
384
+ volumeMounts:
385
+ - name: tmp
386
+ mountPath: /tmp
387
+ volumes:
388
+ - name: tmp
389
+ emptyDir: {}
390
+ topologySpreadConstraints:
391
+ - maxSkew: 1
392
+ topologyKey: topology.kubernetes.io/zone
393
+ whenUnsatisfiable: DoNotSchedule
394
+ labelSelector:
395
+ matchLabels:
396
+ app: user-api
397
+ ```
@@ -0,0 +1,224 @@
1
+ # Monitoring and Observability Reference
2
+
3
+ ## Three Pillars
4
+
5
+ ### 1. Metrics (Prometheus / Grafana)
6
+
7
+ Numeric measurements aggregated over time. Best for dashboards and alerting.
8
+
9
+ - **Counter**: Monotonically increasing (requests_total, errors_total).
10
+ - **Gauge**: Point-in-time value (active_connections, queue_depth).
11
+ - **Histogram**: Distribution of values (request_duration_seconds).
12
+
13
+ ### 2. Logs (Loki / ELK / CloudWatch)
14
+
15
+ Timestamped text records of discrete events. Best for debugging specific
16
+ incidents.
17
+
18
+ - Structured JSON only — no unstructured text logs.
19
+ - Include: timestamp, level, service, trace_id, message, context fields.
20
+ - Exclude: PII, secrets, full request/response bodies.
21
+
22
+ ### 3. Traces (Jaeger / Tempo / X-Ray)
23
+
24
+ End-to-end request flow across services. Best for understanding latency
25
+ and dependencies.
26
+
27
+ - Instrument all service boundaries (HTTP, gRPC, message queues).
28
+ - Propagate trace context headers (`traceparent` / W3C Trace Context).
29
+ - Sample at 1-10% in production (100% in dev/staging).
30
+
31
+ ---
32
+
33
+ ## Golden Signals Dashboard
34
+
35
+ Every service must have a dashboard showing the four golden signals:
36
+
37
+ | Signal | Metric | Example PromQL |
38
+ |---|---|---|
39
+ | **Latency** | Request duration (p50, p90, p99) | `histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))` |
40
+ | **Traffic** | Requests per second | `rate(http_requests_total[5m])` |
41
+ | **Errors** | Error rate (%) | `rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100` |
42
+ | **Saturation** | CPU/memory/queue utilization | `container_memory_working_set_bytes / container_spec_memory_limit_bytes * 100` |
43
+
44
+ ### Dashboard Requirements
45
+
46
+ - Every service has a dedicated Grafana dashboard.
47
+ - Dashboard definition stored in Git (JSON model or Grafonnet).
48
+ - Include: golden signals, resource usage, key business metrics.
49
+ - Time range selectors: last 1h, 6h, 24h, 7d.
50
+ - Variable selectors: environment, namespace, pod.
51
+
52
+ ---
53
+
54
+ ## Alert Severity Levels
55
+
56
+ | Level | Criteria | Response Time | Notification | Example |
57
+ |---|---|---|---|---|
58
+ | **P1 — Critical** | Customer-facing outage, data loss risk | 5 min | Page on-call, auto-create incident | API error rate > 10%, database down |
59
+ | **P2 — High** | Degraded performance, partial outage | 15 min | Page on-call | p99 latency > 5s, disk > 90% |
60
+ | **P3 — Warning** | Approaching threshold, non-critical issue | 1 hour | Slack channel | Disk > 80%, certificate expiry < 14d |
61
+ | **P4 — Info** | Anomaly, needs investigation during business hours | Next business day | Slack channel | Unusual traffic spike, dependency slow |
62
+
63
+ ### Alert Rules
64
+
65
+ ```yaml
66
+ # Prometheus alert example
67
+ groups:
68
+ - name: user-api
69
+ rules:
70
+ - alert: HighErrorRate
71
+ expr: |
72
+ rate(http_requests_total{service="user-api", status=~"5.."}[5m])
73
+ / rate(http_requests_total{service="user-api"}[5m]) > 0.05
74
+ for: 5m
75
+ labels:
76
+ severity: P1
77
+ team: platform
78
+ annotations:
79
+ summary: "user-api error rate > 5%"
80
+ runbook: "https://wiki.company.com/runbooks/user-api-errors"
81
+ dashboard: "https://grafana.company.com/d/user-api"
82
+ ```
83
+
84
+ ### Alerting Principles
85
+
86
+ - **Alert on symptoms, not causes.** Alert on "error rate high," not "pod restarted."
87
+ - **Alert on SLO burn rate.** If you are burning through your error budget
88
+ faster than expected, page.
89
+ - **Every alert must have a runbook.** No runbook = not a valid alert.
90
+ - **Every alert must be actionable.** If the response is "wait and see,"
91
+ it should be a dashboard, not an alert.
92
+ - **Deduplicate.** One page per incident, not one per pod.
93
+ - **Test alerts.** Fire test alerts monthly to verify routing.
94
+
95
+ ---
96
+
97
+ ## On-Call Runbook Requirements
98
+
99
+ Every runbook must contain:
100
+
101
+ ```markdown
102
+ # Runbook: <Alert Name>
103
+
104
+ ## Alert Description
105
+ What this alert means and why it fires.
106
+
107
+ ## Impact
108
+ What is the customer-facing impact?
109
+
110
+ ## Quick Mitigation
111
+ Step-by-step actions to restore service (within 5 minutes):
112
+ 1. ...
113
+ 2. ...
114
+ 3. ...
115
+
116
+ ## Diagnosis
117
+ How to determine root cause:
118
+ - Dashboard link
119
+ - Log query
120
+ - Trace search
121
+
122
+ ## Resolution
123
+ Longer-term fix steps.
124
+
125
+ ## Escalation
126
+ Who to contact if mitigation fails.
127
+
128
+ ## History
129
+ | Date | Cause | Resolution | Duration |
130
+ |------|-------|------------|----------|
131
+ ```
132
+
133
+ ### Runbook Rules
134
+
135
+ - Stored in Git alongside alert definitions.
136
+ - Reviewed and updated after every incident.
137
+ - Must be executable by anyone on the on-call rotation.
138
+ - Include exact commands, not vague instructions.
139
+ - Link to relevant dashboards and log queries.
140
+
141
+ ---
142
+
143
+ ## Structured Logging Standards
144
+
145
+ ### Log Format (JSON)
146
+
147
+ ```json
148
+ {
149
+ "timestamp": "2025-01-15T10:30:00.123Z",
150
+ "level": "error",
151
+ "service": "user-api",
152
+ "version": "1.2.3",
153
+ "trace_id": "abc123def456",
154
+ "span_id": "789ghi",
155
+ "message": "Failed to fetch user profile",
156
+ "error": "connection timeout",
157
+ "user_id": "usr_REDACTED",
158
+ "duration_ms": 5000,
159
+ "method": "GET",
160
+ "path": "/api/v1/users/123",
161
+ "status_code": 504
162
+ }
163
+ ```
164
+
165
+ ### Log Levels
166
+
167
+ | Level | When to Use | Example |
168
+ |---|---|---|
169
+ | `error` | Operation failed, needs attention | Database query failed, external API error |
170
+ | `warn` | Recoverable issue, may need attention | Retry succeeded, cache miss, slow query |
171
+ | `info` | Significant business events | Request served, user created, job completed |
172
+ | `debug` | Diagnostic detail (disabled in prod) | Query parameters, intermediate state |
173
+
174
+ ### Rules
175
+
176
+ - JSON format only — no printf-style logs.
177
+ - Include `trace_id` in every log line for correlation.
178
+ - Never log: passwords, tokens, PII, credit card numbers, full request bodies.
179
+ - Redact or hash user identifiers in logs.
180
+ - Log at request boundaries: start, end, error.
181
+ - Use log sampling for high-volume debug logs.
182
+ - Set retention: 7 days hot, 30 days warm, 90 days cold (adjust per compliance).
183
+
184
+ ---
185
+
186
+ ## SLO Framework
187
+
188
+ ### Defining SLOs
189
+
190
+ | Service | SLI | SLO | Error Budget (30d) |
191
+ |---|---|---|---|
192
+ | User API | Successful requests / total requests | 99.9% | 43.2 min downtime |
193
+ | Payment API | Successful requests / total requests | 99.99% | 4.3 min downtime |
194
+ | Background Jobs | Jobs completed / jobs submitted | 99.5% | 3.6 hours delay |
195
+ | Dashboard | Page load < 3s | 95% | 36 hours of slow loads |
196
+
197
+ ### Error Budget Policy
198
+
199
+ - **Budget remaining > 50%**: Ship freely, experiment.
200
+ - **Budget remaining 20-50%**: Caution, increase review rigor.
201
+ - **Budget remaining < 20%**: Freeze features, focus on reliability.
202
+ - **Budget exhausted**: All engineering effort on reliability until budget recovers.
203
+
204
+ ---
205
+
206
+ ## Metrics Naming Convention
207
+
208
+ ```
209
+ <namespace>_<subsystem>_<name>_<unit>
210
+ ```
211
+
212
+ Examples:
213
+ - `http_server_request_duration_seconds`
214
+ - `http_server_requests_total`
215
+ - `db_pool_connections_active`
216
+ - `queue_messages_pending`
217
+
218
+ ### Rules
219
+
220
+ - Use `_total` suffix for counters.
221
+ - Use `_seconds`, `_bytes`, `_ratio` for units.
222
+ - Use snake_case.
223
+ - Prefix with service/subsystem namespace.
224
+ - Keep cardinality low — avoid high-cardinality labels (user_id, request_id).