agentic-team-templates 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +280 -0
- package/bin/cli.js +5 -0
- package/package.json +47 -0
- package/src/index.js +521 -0
- package/templates/_shared/code-quality.md +162 -0
- package/templates/_shared/communication.md +114 -0
- package/templates/_shared/core-principles.md +62 -0
- package/templates/_shared/git-workflow.md +165 -0
- package/templates/_shared/security-fundamentals.md +173 -0
- package/templates/blockchain/.cursorrules/defi-patterns.md +520 -0
- package/templates/blockchain/.cursorrules/gas-optimization.md +339 -0
- package/templates/blockchain/.cursorrules/overview.md +130 -0
- package/templates/blockchain/.cursorrules/security.md +318 -0
- package/templates/blockchain/.cursorrules/smart-contracts.md +364 -0
- package/templates/blockchain/.cursorrules/testing.md +415 -0
- package/templates/blockchain/.cursorrules/web3-integration.md +538 -0
- package/templates/blockchain/CLAUDE.md +389 -0
- package/templates/cli-tools/.cursorrules/architecture.md +412 -0
- package/templates/cli-tools/.cursorrules/arguments.md +406 -0
- package/templates/cli-tools/.cursorrules/distribution.md +546 -0
- package/templates/cli-tools/.cursorrules/error-handling.md +455 -0
- package/templates/cli-tools/.cursorrules/overview.md +136 -0
- package/templates/cli-tools/.cursorrules/testing.md +537 -0
- package/templates/cli-tools/.cursorrules/user-experience.md +545 -0
- package/templates/cli-tools/CLAUDE.md +356 -0
- package/templates/data-engineering/.cursorrules/data-modeling.md +367 -0
- package/templates/data-engineering/.cursorrules/data-quality.md +455 -0
- package/templates/data-engineering/.cursorrules/overview.md +85 -0
- package/templates/data-engineering/.cursorrules/performance.md +339 -0
- package/templates/data-engineering/.cursorrules/pipeline-design.md +280 -0
- package/templates/data-engineering/.cursorrules/security.md +460 -0
- package/templates/data-engineering/.cursorrules/testing.md +452 -0
- package/templates/data-engineering/CLAUDE.md +974 -0
- package/templates/devops-sre/.cursorrules/capacity-planning.md +653 -0
- package/templates/devops-sre/.cursorrules/change-management.md +584 -0
- package/templates/devops-sre/.cursorrules/chaos-engineering.md +651 -0
- package/templates/devops-sre/.cursorrules/disaster-recovery.md +641 -0
- package/templates/devops-sre/.cursorrules/incident-management.md +565 -0
- package/templates/devops-sre/.cursorrules/observability.md +714 -0
- package/templates/devops-sre/.cursorrules/overview.md +230 -0
- package/templates/devops-sre/.cursorrules/postmortems.md +588 -0
- package/templates/devops-sre/.cursorrules/runbooks.md +760 -0
- package/templates/devops-sre/.cursorrules/slo-sli.md +617 -0
- package/templates/devops-sre/.cursorrules/toil-reduction.md +567 -0
- package/templates/devops-sre/CLAUDE.md +1007 -0
- package/templates/documentation/.cursorrules/adr.md +277 -0
- package/templates/documentation/.cursorrules/api-documentation.md +411 -0
- package/templates/documentation/.cursorrules/code-comments.md +253 -0
- package/templates/documentation/.cursorrules/maintenance.md +260 -0
- package/templates/documentation/.cursorrules/overview.md +82 -0
- package/templates/documentation/.cursorrules/readme-standards.md +306 -0
- package/templates/documentation/CLAUDE.md +120 -0
- package/templates/fullstack/.cursorrules/api-contracts.md +331 -0
- package/templates/fullstack/.cursorrules/architecture.md +298 -0
- package/templates/fullstack/.cursorrules/overview.md +109 -0
- package/templates/fullstack/.cursorrules/shared-types.md +348 -0
- package/templates/fullstack/.cursorrules/testing.md +386 -0
- package/templates/fullstack/CLAUDE.md +349 -0
- package/templates/ml-ai/.cursorrules/data-engineering.md +483 -0
- package/templates/ml-ai/.cursorrules/deployment.md +601 -0
- package/templates/ml-ai/.cursorrules/model-development.md +538 -0
- package/templates/ml-ai/.cursorrules/monitoring.md +658 -0
- package/templates/ml-ai/.cursorrules/overview.md +131 -0
- package/templates/ml-ai/.cursorrules/security.md +637 -0
- package/templates/ml-ai/.cursorrules/testing.md +678 -0
- package/templates/ml-ai/CLAUDE.md +1136 -0
- package/templates/mobile/.cursorrules/navigation.md +246 -0
- package/templates/mobile/.cursorrules/offline-first.md +302 -0
- package/templates/mobile/.cursorrules/overview.md +71 -0
- package/templates/mobile/.cursorrules/performance.md +345 -0
- package/templates/mobile/.cursorrules/testing.md +339 -0
- package/templates/mobile/CLAUDE.md +233 -0
- package/templates/platform-engineering/.cursorrules/ci-cd.md +778 -0
- package/templates/platform-engineering/.cursorrules/developer-experience.md +632 -0
- package/templates/platform-engineering/.cursorrules/infrastructure-as-code.md +600 -0
- package/templates/platform-engineering/.cursorrules/kubernetes.md +710 -0
- package/templates/platform-engineering/.cursorrules/observability.md +747 -0
- package/templates/platform-engineering/.cursorrules/overview.md +215 -0
- package/templates/platform-engineering/.cursorrules/security.md +855 -0
- package/templates/platform-engineering/.cursorrules/testing.md +878 -0
- package/templates/platform-engineering/CLAUDE.md +850 -0
- package/templates/utility-agent/.cursorrules/action-control.md +284 -0
- package/templates/utility-agent/.cursorrules/context-management.md +186 -0
- package/templates/utility-agent/.cursorrules/hallucination-prevention.md +253 -0
- package/templates/utility-agent/.cursorrules/overview.md +78 -0
- package/templates/utility-agent/.cursorrules/token-optimization.md +369 -0
- package/templates/utility-agent/CLAUDE.md +513 -0
- package/templates/web-backend/.cursorrules/api-design.md +255 -0
- package/templates/web-backend/.cursorrules/authentication.md +309 -0
- package/templates/web-backend/.cursorrules/database-patterns.md +298 -0
- package/templates/web-backend/.cursorrules/error-handling.md +366 -0
- package/templates/web-backend/.cursorrules/overview.md +69 -0
- package/templates/web-backend/.cursorrules/security.md +358 -0
- package/templates/web-backend/.cursorrules/testing.md +395 -0
- package/templates/web-backend/CLAUDE.md +366 -0
- package/templates/web-frontend/.cursorrules/accessibility.md +296 -0
- package/templates/web-frontend/.cursorrules/component-patterns.md +204 -0
- package/templates/web-frontend/.cursorrules/overview.md +72 -0
- package/templates/web-frontend/.cursorrules/performance.md +325 -0
- package/templates/web-frontend/.cursorrules/state-management.md +227 -0
- package/templates/web-frontend/.cursorrules/styling.md +271 -0
- package/templates/web-frontend/.cursorrules/testing.md +311 -0
- package/templates/web-frontend/CLAUDE.md +399 -0
|
@@ -0,0 +1,850 @@
|
|
|
1
|
+
# Platform Engineering Development Guide
|
|
2
|
+
|
|
3
|
+
Staff-level guidelines for building and operating internal developer platforms, infrastructure automation, and reliability engineering.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Overview
|
|
8
|
+
|
|
9
|
+
This guide applies to:
|
|
10
|
+
|
|
11
|
+
- Infrastructure as Code (Terraform, Pulumi, CDK)
|
|
12
|
+
- Kubernetes and container orchestration
|
|
13
|
+
- CI/CD pipelines and GitOps
|
|
14
|
+
- Internal Developer Platforms (IDPs)
|
|
15
|
+
- Observability systems (metrics, logs, traces)
|
|
16
|
+
- Service mesh and networking
|
|
17
|
+
- Security and compliance automation
|
|
18
|
+
|
|
19
|
+
### Key Principles
|
|
20
|
+
|
|
21
|
+
1. **Platform as Product** - Your internal customers are developers; treat the platform like a product
|
|
22
|
+
2. **Self-Service First** - Enable teams to move fast without becoming a bottleneck
|
|
23
|
+
3. **Reliability Engineering** - Define SLOs, measure SLIs, maintain error budgets
|
|
24
|
+
4. **Security by Default** - Bake security into the golden path, not bolted on after
|
|
25
|
+
5. **Cost Consciousness** - FinOps is everyone's responsibility
|
|
26
|
+
|
|
27
|
+
### Technology Stack
|
|
28
|
+
|
|
29
|
+
| Layer | Technology |
|
|
30
|
+
|-------|------------|
|
|
31
|
+
| IaC | Terraform, Pulumi, AWS CDK |
|
|
32
|
+
| Container Orchestration | Kubernetes, EKS/GKE/AKS |
|
|
33
|
+
| GitOps | Argo CD, Flux |
|
|
34
|
+
| CI/CD | GitHub Actions, GitLab CI, Tekton |
|
|
35
|
+
| Observability | Prometheus, Grafana, Loki, Tempo, Jaeger |
|
|
36
|
+
| Service Mesh | Istio, Linkerd, Cilium |
|
|
37
|
+
| Policy | OPA/Gatekeeper, Kyverno |
|
|
38
|
+
| Secrets | Vault, External Secrets, SOPS |
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## Infrastructure as Code
|
|
43
|
+
|
|
44
|
+
### Module Structure
|
|
45
|
+
|
|
46
|
+
```
|
|
47
|
+
modules/
|
|
48
|
+
├── networking/
|
|
49
|
+
│ ├── vpc/
|
|
50
|
+
│ │ ├── main.tf
|
|
51
|
+
│ │ ├── variables.tf
|
|
52
|
+
│ │ ├── outputs.tf
|
|
53
|
+
│ │ └── README.md
|
|
54
|
+
│ └── dns/
|
|
55
|
+
├── compute/
|
|
56
|
+
│ ├── eks-cluster/
|
|
57
|
+
│ └── node-groups/
|
|
58
|
+
├── data/
|
|
59
|
+
│ ├── rds/
|
|
60
|
+
│ └── redis/
|
|
61
|
+
└── security/
|
|
62
|
+
├── iam-roles/
|
|
63
|
+
└── kms/
|
|
64
|
+
|
|
65
|
+
environments/
|
|
66
|
+
├── dev/
|
|
67
|
+
│ ├── main.tf
|
|
68
|
+
│ ├── terraform.tfvars
|
|
69
|
+
│ └── backend.tf
|
|
70
|
+
├── staging/
|
|
71
|
+
└── production/
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
### Terraform Best Practices
|
|
75
|
+
|
|
76
|
+
```hcl
|
|
77
|
+
# Always use specific provider versions
|
|
78
|
+
terraform {
|
|
79
|
+
required_version = ">= 1.5.0"
|
|
80
|
+
|
|
81
|
+
required_providers {
|
|
82
|
+
aws = {
|
|
83
|
+
source = "hashicorp/aws"
|
|
84
|
+
version = "~> 5.0"
|
|
85
|
+
}
|
|
86
|
+
}
|
|
87
|
+
|
|
88
|
+
backend "s3" {
|
|
89
|
+
bucket = "company-terraform-state"
|
|
90
|
+
key = "env/production/terraform.tfstate"
|
|
91
|
+
region = "us-east-1"
|
|
92
|
+
encrypt = true
|
|
93
|
+
dynamodb_table = "terraform-locks"
|
|
94
|
+
}
|
|
95
|
+
}
|
|
96
|
+
|
|
97
|
+
# Use locals for computed values
|
|
98
|
+
locals {
|
|
99
|
+
common_tags = {
|
|
100
|
+
Environment = var.environment
|
|
101
|
+
ManagedBy = "terraform"
|
|
102
|
+
Team = var.team
|
|
103
|
+
CostCenter = var.cost_center
|
|
104
|
+
}
|
|
105
|
+
}
|
|
106
|
+
|
|
107
|
+
# Meaningful resource naming
|
|
108
|
+
resource "aws_eks_cluster" "main" {
|
|
109
|
+
name = "${var.project}-${var.environment}"
|
|
110
|
+
role_arn = aws_iam_role.cluster.arn
|
|
111
|
+
version = var.kubernetes_version
|
|
112
|
+
|
|
113
|
+
vpc_config {
|
|
114
|
+
subnet_ids = var.private_subnet_ids
|
|
115
|
+
endpoint_private_access = true
|
|
116
|
+
endpoint_public_access = var.environment != "production"
|
|
117
|
+
}
|
|
118
|
+
|
|
119
|
+
tags = local.common_tags
|
|
120
|
+
}
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
### State Management
|
|
124
|
+
|
|
125
|
+
- **Remote State**: Always use remote backends (S3, GCS, Terraform Cloud)
|
|
126
|
+
- **State Locking**: Enable DynamoDB/GCS locking to prevent concurrent modifications
|
|
127
|
+
- **State Isolation**: Separate state files per environment
|
|
128
|
+
- **Sensitive Data**: Never store secrets in state; use external secret managers
|
|
129
|
+
|
|
130
|
+
### Drift Detection
|
|
131
|
+
|
|
132
|
+
```yaml
|
|
133
|
+
# GitHub Actions workflow for drift detection
|
|
134
|
+
name: Terraform Drift Detection
|
|
135
|
+
|
|
136
|
+
on:
|
|
137
|
+
schedule:
|
|
138
|
+
- cron: '0 */6 * * *' # Every 6 hours
|
|
139
|
+
|
|
140
|
+
jobs:
|
|
141
|
+
drift-detection:
|
|
142
|
+
runs-on: ubuntu-latest
|
|
143
|
+
strategy:
|
|
144
|
+
matrix:
|
|
145
|
+
environment: [dev, staging, production]
|
|
146
|
+
steps:
|
|
147
|
+
- uses: actions/checkout@v4
|
|
148
|
+
- uses: hashicorp/setup-terraform@v3
|
|
149
|
+
|
|
150
|
+
- name: Terraform Plan
|
|
151
|
+
run: |
|
|
152
|
+
cd environments/${{ matrix.environment }}
|
|
153
|
+
terraform init
|
|
154
|
+
terraform plan -detailed-exitcode -out=plan.out
|
|
155
|
+
continue-on-error: true
|
|
156
|
+
id: plan
|
|
157
|
+
|
|
158
|
+
- name: Alert on Drift
|
|
159
|
+
if: steps.plan.outcome == 'failure'
|
|
160
|
+
run: |
|
|
161
|
+
# Send alert to Slack/PagerDuty
|
|
162
|
+
echo "Drift detected in ${{ matrix.environment }}"
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
---
|
|
166
|
+
|
|
167
|
+
## Kubernetes Patterns
|
|
168
|
+
|
|
169
|
+
### Resource Management
|
|
170
|
+
|
|
171
|
+
```yaml
|
|
172
|
+
apiVersion: apps/v1
|
|
173
|
+
kind: Deployment
|
|
174
|
+
metadata:
|
|
175
|
+
name: api-server
|
|
176
|
+
labels:
|
|
177
|
+
app.kubernetes.io/name: api-server
|
|
178
|
+
app.kubernetes.io/component: backend
|
|
179
|
+
app.kubernetes.io/managed-by: helm
|
|
180
|
+
spec:
|
|
181
|
+
replicas: 3
|
|
182
|
+
selector:
|
|
183
|
+
matchLabels:
|
|
184
|
+
app.kubernetes.io/name: api-server
|
|
185
|
+
template:
|
|
186
|
+
metadata:
|
|
187
|
+
labels:
|
|
188
|
+
app.kubernetes.io/name: api-server
|
|
189
|
+
annotations:
|
|
190
|
+
prometheus.io/scrape: "true"
|
|
191
|
+
prometheus.io/port: "9090"
|
|
192
|
+
spec:
|
|
193
|
+
# Always set resource limits
|
|
194
|
+
containers:
|
|
195
|
+
- name: api-server
|
|
196
|
+
image: company/api-server:v1.2.3
|
|
197
|
+
resources:
|
|
198
|
+
requests:
|
|
199
|
+
cpu: "100m"
|
|
200
|
+
memory: "256Mi"
|
|
201
|
+
limits:
|
|
202
|
+
cpu: "500m"
|
|
203
|
+
memory: "512Mi"
|
|
204
|
+
|
|
205
|
+
# Always configure probes
|
|
206
|
+
livenessProbe:
|
|
207
|
+
httpGet:
|
|
208
|
+
path: /healthz
|
|
209
|
+
port: 8080
|
|
210
|
+
initialDelaySeconds: 10
|
|
211
|
+
periodSeconds: 10
|
|
212
|
+
readinessProbe:
|
|
213
|
+
httpGet:
|
|
214
|
+
path: /ready
|
|
215
|
+
port: 8080
|
|
216
|
+
initialDelaySeconds: 5
|
|
217
|
+
periodSeconds: 5
|
|
218
|
+
|
|
219
|
+
# Security context
|
|
220
|
+
securityContext:
|
|
221
|
+
runAsNonRoot: true
|
|
222
|
+
runAsUser: 1000
|
|
223
|
+
readOnlyRootFilesystem: true
|
|
224
|
+
allowPrivilegeEscalation: false
|
|
225
|
+
capabilities:
|
|
226
|
+
drop:
|
|
227
|
+
- ALL
|
|
228
|
+
|
|
229
|
+
# Pod-level security
|
|
230
|
+
securityContext:
|
|
231
|
+
fsGroup: 1000
|
|
232
|
+
|
|
233
|
+
# Spread across nodes
|
|
234
|
+
topologySpreadConstraints:
|
|
235
|
+
- maxSkew: 1
|
|
236
|
+
topologyKey: topology.kubernetes.io/zone
|
|
237
|
+
whenUnsatisfiable: ScheduleAnyway
|
|
238
|
+
labelSelector:
|
|
239
|
+
matchLabels:
|
|
240
|
+
app.kubernetes.io/name: api-server
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
### Helm Chart Structure
|
|
244
|
+
|
|
245
|
+
```
|
|
246
|
+
charts/
|
|
247
|
+
└── api-server/
|
|
248
|
+
├── Chart.yaml
|
|
249
|
+
├── values.yaml
|
|
250
|
+
├── values-dev.yaml
|
|
251
|
+
├── values-staging.yaml
|
|
252
|
+
├── values-production.yaml
|
|
253
|
+
├── templates/
|
|
254
|
+
│ ├── _helpers.tpl
|
|
255
|
+
│ ├── deployment.yaml
|
|
256
|
+
│ ├── service.yaml
|
|
257
|
+
│ ├── hpa.yaml
|
|
258
|
+
│ ├── pdb.yaml
|
|
259
|
+
│ ├── networkpolicy.yaml
|
|
260
|
+
│ └── servicemonitor.yaml
|
|
261
|
+
└── tests/
|
|
262
|
+
└── test-connection.yaml
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
### Network Policies
|
|
266
|
+
|
|
267
|
+
```yaml
|
|
268
|
+
# Default deny all ingress
|
|
269
|
+
apiVersion: networking.k8s.io/v1
|
|
270
|
+
kind: NetworkPolicy
|
|
271
|
+
metadata:
|
|
272
|
+
name: default-deny-ingress
|
|
273
|
+
namespace: production
|
|
274
|
+
spec:
|
|
275
|
+
podSelector: {}
|
|
276
|
+
policyTypes:
|
|
277
|
+
- Ingress
|
|
278
|
+
|
|
279
|
+
---
|
|
280
|
+
# Allow specific traffic
|
|
281
|
+
apiVersion: networking.k8s.io/v1
|
|
282
|
+
kind: NetworkPolicy
|
|
283
|
+
metadata:
|
|
284
|
+
name: api-server-ingress
|
|
285
|
+
namespace: production
|
|
286
|
+
spec:
|
|
287
|
+
podSelector:
|
|
288
|
+
matchLabels:
|
|
289
|
+
app.kubernetes.io/name: api-server
|
|
290
|
+
policyTypes:
|
|
291
|
+
- Ingress
|
|
292
|
+
ingress:
|
|
293
|
+
- from:
|
|
294
|
+
- namespaceSelector:
|
|
295
|
+
matchLabels:
|
|
296
|
+
name: ingress-nginx
|
|
297
|
+
- podSelector:
|
|
298
|
+
matchLabels:
|
|
299
|
+
app.kubernetes.io/name: frontend
|
|
300
|
+
ports:
|
|
301
|
+
- protocol: TCP
|
|
302
|
+
port: 8080
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
---
|
|
306
|
+
|
|
307
|
+
## CI/CD & GitOps
|
|
308
|
+
|
|
309
|
+
### Pipeline Architecture
|
|
310
|
+
|
|
311
|
+
```yaml
|
|
312
|
+
# GitHub Actions - Production-grade pipeline
|
|
313
|
+
name: CI/CD Pipeline
|
|
314
|
+
|
|
315
|
+
on:
|
|
316
|
+
push:
|
|
317
|
+
branches: [main, develop]
|
|
318
|
+
pull_request:
|
|
319
|
+
branches: [main]
|
|
320
|
+
|
|
321
|
+
env:
|
|
322
|
+
REGISTRY: ghcr.io
|
|
323
|
+
IMAGE_NAME: ${{ github.repository }}
|
|
324
|
+
|
|
325
|
+
jobs:
|
|
326
|
+
# Stage 1: Validate
|
|
327
|
+
validate:
|
|
328
|
+
runs-on: ubuntu-latest
|
|
329
|
+
steps:
|
|
330
|
+
- uses: actions/checkout@v4
|
|
331
|
+
|
|
332
|
+
- name: Lint Dockerfile
|
|
333
|
+
uses: hadolint/hadolint-action@v3.1.0
|
|
334
|
+
|
|
335
|
+
- name: Lint Kubernetes manifests
|
|
336
|
+
run: |
|
|
337
|
+
helm lint ./charts/*
|
|
338
|
+
kubeval ./manifests/*.yaml
|
|
339
|
+
|
|
340
|
+
- name: Security scan
|
|
341
|
+
uses: aquasecurity/trivy-action@master
|
|
342
|
+
with:
|
|
343
|
+
scan-type: 'fs'
|
|
344
|
+
severity: 'CRITICAL,HIGH'
|
|
345
|
+
|
|
346
|
+
# Stage 2: Test
|
|
347
|
+
test:
|
|
348
|
+
needs: validate
|
|
349
|
+
runs-on: ubuntu-latest
|
|
350
|
+
steps:
|
|
351
|
+
- uses: actions/checkout@v4
|
|
352
|
+
|
|
353
|
+
- name: Run tests
|
|
354
|
+
run: make test
|
|
355
|
+
|
|
356
|
+
- name: Upload coverage
|
|
357
|
+
uses: codecov/codecov-action@v3
|
|
358
|
+
|
|
359
|
+
# Stage 3: Build
|
|
360
|
+
build:
|
|
361
|
+
needs: test
|
|
362
|
+
runs-on: ubuntu-latest
|
|
363
|
+
outputs:
|
|
364
|
+
image-digest: ${{ steps.build.outputs.digest }}
|
|
365
|
+
steps:
|
|
366
|
+
- uses: actions/checkout@v4
|
|
367
|
+
|
|
368
|
+
- name: Build and push
|
|
369
|
+
id: build
|
|
370
|
+
uses: docker/build-push-action@v5
|
|
371
|
+
with:
|
|
372
|
+
context: .
|
|
373
|
+
push: true
|
|
374
|
+
tags: |
|
|
375
|
+
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
|
|
376
|
+
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
|
|
377
|
+
cache-from: type=gha
|
|
378
|
+
cache-to: type=gha,mode=max
|
|
379
|
+
|
|
380
|
+
- name: Sign image
|
|
381
|
+
run: cosign sign --yes ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build.outputs.digest }}
|
|
382
|
+
|
|
383
|
+
# Stage 4: Deploy to staging
|
|
384
|
+
deploy-staging:
|
|
385
|
+
needs: build
|
|
386
|
+
runs-on: ubuntu-latest
|
|
387
|
+
environment: staging
|
|
388
|
+
steps:
|
|
389
|
+
- name: Update GitOps repo
|
|
390
|
+
run: |
|
|
391
|
+
# Update image tag in GitOps repository
|
|
392
|
+
# Argo CD will detect and sync
|
|
393
|
+
|
|
394
|
+
# Stage 5: Deploy to production
|
|
395
|
+
deploy-production:
|
|
396
|
+
needs: deploy-staging
|
|
397
|
+
runs-on: ubuntu-latest
|
|
398
|
+
environment: production
|
|
399
|
+
if: github.ref == 'refs/heads/main'
|
|
400
|
+
steps:
|
|
401
|
+
- name: Update GitOps repo
|
|
402
|
+
run: |
|
|
403
|
+
# Update image tag with approval gate
|
|
404
|
+
```
|
|
405
|
+
|
|
406
|
+
### Argo CD Application
|
|
407
|
+
|
|
408
|
+
```yaml
|
|
409
|
+
apiVersion: argoproj.io/v1alpha1
|
|
410
|
+
kind: Application
|
|
411
|
+
metadata:
|
|
412
|
+
name: api-server
|
|
413
|
+
namespace: argocd
|
|
414
|
+
finalizers:
|
|
415
|
+
- resources-finalizer.argocd.argoproj.io
|
|
416
|
+
spec:
|
|
417
|
+
project: default
|
|
418
|
+
|
|
419
|
+
source:
|
|
420
|
+
repoURL: https://github.com/company/gitops-repo.git
|
|
421
|
+
targetRevision: HEAD
|
|
422
|
+
path: apps/api-server/overlays/production
|
|
423
|
+
|
|
424
|
+
destination:
|
|
425
|
+
server: https://kubernetes.default.svc
|
|
426
|
+
namespace: production
|
|
427
|
+
|
|
428
|
+
syncPolicy:
|
|
429
|
+
automated:
|
|
430
|
+
prune: true
|
|
431
|
+
selfHeal: true
|
|
432
|
+
syncOptions:
|
|
433
|
+
- CreateNamespace=true
|
|
434
|
+
- PruneLast=true
|
|
435
|
+
retry:
|
|
436
|
+
limit: 5
|
|
437
|
+
backoff:
|
|
438
|
+
duration: 5s
|
|
439
|
+
factor: 2
|
|
440
|
+
maxDuration: 3m
|
|
441
|
+
```
|
|
442
|
+
|
|
443
|
+
---
|
|
444
|
+
|
|
445
|
+
## Observability
|
|
446
|
+
|
|
447
|
+
### Metrics (Prometheus)
|
|
448
|
+
|
|
449
|
+
```yaml
|
|
450
|
+
# ServiceMonitor for automatic scraping
|
|
451
|
+
apiVersion: monitoring.coreos.com/v1
|
|
452
|
+
kind: ServiceMonitor
|
|
453
|
+
metadata:
|
|
454
|
+
name: api-server
|
|
455
|
+
labels:
|
|
456
|
+
release: prometheus
|
|
457
|
+
spec:
|
|
458
|
+
selector:
|
|
459
|
+
matchLabels:
|
|
460
|
+
app.kubernetes.io/name: api-server
|
|
461
|
+
endpoints:
|
|
462
|
+
- port: metrics
|
|
463
|
+
interval: 30s
|
|
464
|
+
path: /metrics
|
|
465
|
+
```
|
|
466
|
+
|
|
467
|
+
### SLO Definition
|
|
468
|
+
|
|
469
|
+
```yaml
|
|
470
|
+
# Sloth SLO definition
|
|
471
|
+
apiVersion: sloth.slok.dev/v1
|
|
472
|
+
kind: PrometheusServiceLevel
|
|
473
|
+
metadata:
|
|
474
|
+
name: api-server-slo
|
|
475
|
+
spec:
|
|
476
|
+
service: "api-server"
|
|
477
|
+
labels:
|
|
478
|
+
team: platform
|
|
479
|
+
slos:
|
|
480
|
+
- name: "requests-availability"
|
|
481
|
+
objective: 99.9
|
|
482
|
+
description: "99.9% of requests should be successful"
|
|
483
|
+
sli:
|
|
484
|
+
events:
|
|
485
|
+
errorQuery: sum(rate(http_requests_total{job="api-server",status=~"5.."}[{{.window}}]))
|
|
486
|
+
totalQuery: sum(rate(http_requests_total{job="api-server"}[{{.window}}]))
|
|
487
|
+
alerting:
|
|
488
|
+
name: ApiServerHighErrorRate
|
|
489
|
+
pageAlert:
|
|
490
|
+
labels:
|
|
491
|
+
severity: critical
|
|
492
|
+
ticketAlert:
|
|
493
|
+
labels:
|
|
494
|
+
severity: warning
|
|
495
|
+
|
|
496
|
+
- name: "requests-latency"
|
|
497
|
+
objective: 99.0
|
|
498
|
+
description: "99% of requests should be faster than 500ms"
|
|
499
|
+
sli:
|
|
500
|
+
events:
|
|
501
|
+
errorQuery: sum(rate(http_request_duration_seconds_bucket{job="api-server",le="0.5"}[{{.window}}]))
|
|
502
|
+
totalQuery: sum(rate(http_request_duration_seconds_count{job="api-server"}[{{.window}}]))
|
|
503
|
+
```
|
|
504
|
+
|
|
505
|
+
### Logging (Structured)
|
|
506
|
+
|
|
507
|
+
```go
|
|
508
|
+
// Always use structured logging
|
|
509
|
+
logger.Info("request processed",
|
|
510
|
+
"method", r.Method,
|
|
511
|
+
"path", r.URL.Path,
|
|
512
|
+
"status", status,
|
|
513
|
+
"duration_ms", duration.Milliseconds(),
|
|
514
|
+
"request_id", requestID,
|
|
515
|
+
"user_id", userID,
|
|
516
|
+
)
|
|
517
|
+
```
|
|
518
|
+
|
|
519
|
+
### Distributed Tracing
|
|
520
|
+
|
|
521
|
+
```go
|
|
522
|
+
// OpenTelemetry instrumentation
|
|
523
|
+
func handleRequest(w http.ResponseWriter, r *http.Request) {
|
|
524
|
+
ctx, span := tracer.Start(r.Context(), "handleRequest",
|
|
525
|
+
trace.WithAttributes(
|
|
526
|
+
attribute.String("http.method", r.Method),
|
|
527
|
+
attribute.String("http.url", r.URL.String()),
|
|
528
|
+
),
|
|
529
|
+
)
|
|
530
|
+
defer span.End()
|
|
531
|
+
|
|
532
|
+
// Pass context to downstream calls
|
|
533
|
+
result, err := db.QueryContext(ctx, query)
|
|
534
|
+
if err != nil {
|
|
535
|
+
span.RecordError(err)
|
|
536
|
+
span.SetStatus(codes.Error, err.Error())
|
|
537
|
+
}
|
|
538
|
+
}
|
|
539
|
+
```
|
|
540
|
+
|
|
541
|
+
---
|
|
542
|
+
|
|
543
|
+
## Security
|
|
544
|
+
|
|
545
|
+
### Policy as Code (OPA/Gatekeeper)
|
|
546
|
+
|
|
547
|
+
```yaml
|
|
548
|
+
# Require resource limits
|
|
549
|
+
apiVersion: constraints.gatekeeper.sh/v1beta1
|
|
550
|
+
kind: K8sRequiredResources
|
|
551
|
+
metadata:
|
|
552
|
+
name: require-resource-limits
|
|
553
|
+
spec:
|
|
554
|
+
match:
|
|
555
|
+
kinds:
|
|
556
|
+
- apiGroups: [""]
|
|
557
|
+
kinds: ["Pod"]
|
|
558
|
+
namespaces:
|
|
559
|
+
- production
|
|
560
|
+
- staging
|
|
561
|
+
parameters:
|
|
562
|
+
limits:
|
|
563
|
+
- cpu
|
|
564
|
+
- memory
|
|
565
|
+
requests:
|
|
566
|
+
- cpu
|
|
567
|
+
- memory
|
|
568
|
+
```
|
|
569
|
+
|
|
570
|
+
### Secrets Management
|
|
571
|
+
|
|
572
|
+
```yaml
|
|
573
|
+
# External Secrets Operator
|
|
574
|
+
apiVersion: external-secrets.io/v1beta1
|
|
575
|
+
kind: ExternalSecret
|
|
576
|
+
metadata:
|
|
577
|
+
name: api-secrets
|
|
578
|
+
spec:
|
|
579
|
+
refreshInterval: 1h
|
|
580
|
+
secretStoreRef:
|
|
581
|
+
name: vault-backend
|
|
582
|
+
kind: ClusterSecretStore
|
|
583
|
+
target:
|
|
584
|
+
name: api-secrets
|
|
585
|
+
creationPolicy: Owner
|
|
586
|
+
data:
|
|
587
|
+
- secretKey: database-url
|
|
588
|
+
remoteRef:
|
|
589
|
+
key: secret/data/api-server
|
|
590
|
+
property: database_url
|
|
591
|
+
- secretKey: api-key
|
|
592
|
+
remoteRef:
|
|
593
|
+
key: secret/data/api-server
|
|
594
|
+
property: api_key
|
|
595
|
+
```
|
|
596
|
+
|
|
597
|
+
### Supply Chain Security
|
|
598
|
+
|
|
599
|
+
```yaml
|
|
600
|
+
# Kyverno policy - require signed images
|
|
601
|
+
apiVersion: kyverno.io/v1
|
|
602
|
+
kind: ClusterPolicy
|
|
603
|
+
metadata:
|
|
604
|
+
name: verify-image-signature
|
|
605
|
+
spec:
|
|
606
|
+
validationFailureAction: Enforce
|
|
607
|
+
background: false
|
|
608
|
+
rules:
|
|
609
|
+
- name: verify-signature
|
|
610
|
+
match:
|
|
611
|
+
any:
|
|
612
|
+
- resources:
|
|
613
|
+
kinds:
|
|
614
|
+
- Pod
|
|
615
|
+
verifyImages:
|
|
616
|
+
- imageReferences:
|
|
617
|
+
- "ghcr.io/company/*"
|
|
618
|
+
attestors:
|
|
619
|
+
- entries:
|
|
620
|
+
- keyless:
|
|
621
|
+
subject: "https://github.com/company/*"
|
|
622
|
+
issuer: "https://token.actions.githubusercontent.com"
|
|
623
|
+
```
|
|
624
|
+
|
|
625
|
+
---
|
|
626
|
+
|
|
627
|
+
## Developer Experience
|
|
628
|
+
|
|
629
|
+
### Golden Paths
|
|
630
|
+
|
|
631
|
+
Provide opinionated, well-supported paths for common tasks:
|
|
632
|
+
|
|
633
|
+
```yaml
|
|
634
|
+
# Backstage Software Template
|
|
635
|
+
apiVersion: scaffolder.backstage.io/v1beta3
|
|
636
|
+
kind: Template
|
|
637
|
+
metadata:
|
|
638
|
+
name: microservice-template
|
|
639
|
+
title: Production Microservice
|
|
640
|
+
description: Create a production-ready microservice with all platform integrations
|
|
641
|
+
spec:
|
|
642
|
+
owner: platform-team
|
|
643
|
+
type: service
|
|
644
|
+
|
|
645
|
+
parameters:
|
|
646
|
+
- title: Service Information
|
|
647
|
+
required:
|
|
648
|
+
- name
|
|
649
|
+
- owner
|
|
650
|
+
properties:
|
|
651
|
+
name:
|
|
652
|
+
title: Service Name
|
|
653
|
+
type: string
|
|
654
|
+
pattern: '^[a-z0-9-]+$'
|
|
655
|
+
owner:
|
|
656
|
+
title: Owner Team
|
|
657
|
+
type: string
|
|
658
|
+
ui:field: OwnerPicker
|
|
659
|
+
|
|
660
|
+
- title: Infrastructure
|
|
661
|
+
properties:
|
|
662
|
+
database:
|
|
663
|
+
title: Database
|
|
664
|
+
type: string
|
|
665
|
+
enum: [none, postgresql, mysql]
|
|
666
|
+
cache:
|
|
667
|
+
title: Cache
|
|
668
|
+
type: string
|
|
669
|
+
enum: [none, redis, memcached]
|
|
670
|
+
|
|
671
|
+
steps:
|
|
672
|
+
- id: fetch
|
|
673
|
+
name: Fetch Template
|
|
674
|
+
action: fetch:template
|
|
675
|
+
input:
|
|
676
|
+
url: ./skeleton
|
|
677
|
+
values:
|
|
678
|
+
name: ${{ parameters.name }}
|
|
679
|
+
owner: ${{ parameters.owner }}
|
|
680
|
+
|
|
681
|
+
- id: publish
|
|
682
|
+
name: Publish to GitHub
|
|
683
|
+
action: publish:github
|
|
684
|
+
input:
|
|
685
|
+
repoUrl: github.com?repo=${{ parameters.name }}&owner=company
|
|
686
|
+
|
|
687
|
+
- id: register
|
|
688
|
+
name: Register in Catalog
|
|
689
|
+
action: catalog:register
|
|
690
|
+
input:
|
|
691
|
+
repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}
|
|
692
|
+
catalogInfoPath: /catalog-info.yaml
|
|
693
|
+
```
|
|
694
|
+
|
|
695
|
+
### Self-Service Portal
|
|
696
|
+
|
|
697
|
+
Key capabilities to provide:
|
|
698
|
+
|
|
699
|
+
- **Environment Provisioning**: Spin up dev/preview environments on demand
|
|
700
|
+
- **Database Access**: Request read replicas or sanitized snapshots
|
|
701
|
+
- **Secret Management**: Self-service secret rotation
|
|
702
|
+
- **Monitoring Dashboards**: Auto-generated per-service dashboards
|
|
703
|
+
- **Cost Visibility**: Per-team/per-service cost attribution
|
|
704
|
+
|
|
705
|
+
---
|
|
706
|
+
|
|
707
|
+
## Testing Infrastructure
|
|
708
|
+
|
|
709
|
+
### Terraform Testing
|
|
710
|
+
|
|
711
|
+
```hcl
|
|
712
|
+
# tests/vpc_test.go
|
|
713
|
+
package test
|
|
714
|
+
|
|
715
|
+
import (
|
|
716
|
+
"testing"
|
|
717
|
+
"github.com/gruntwork-io/terratest/modules/terraform"
|
|
718
|
+
"github.com/stretchr/testify/assert"
|
|
719
|
+
)
|
|
720
|
+
|
|
721
|
+
func TestVpcModule(t *testing.T) {
|
|
722
|
+
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
|
|
723
|
+
TerraformDir: "../modules/networking/vpc",
|
|
724
|
+
Vars: map[string]interface{}{
|
|
725
|
+
"environment": "test",
|
|
726
|
+
"cidr_block": "10.0.0.0/16",
|
|
727
|
+
},
|
|
728
|
+
})
|
|
729
|
+
|
|
730
|
+
defer terraform.Destroy(t, terraformOptions)
|
|
731
|
+
terraform.InitAndApply(t, terraformOptions)
|
|
732
|
+
|
|
733
|
+
vpcId := terraform.Output(t, terraformOptions, "vpc_id")
|
|
734
|
+
assert.NotEmpty(t, vpcId)
|
|
735
|
+
}
|
|
736
|
+
```
|
|
737
|
+
|
|
738
|
+
### Kubernetes Testing
|
|
739
|
+
|
|
740
|
+
```yaml
|
|
741
|
+
# Helm chart test
|
|
742
|
+
apiVersion: v1
|
|
743
|
+
kind: Pod
|
|
744
|
+
metadata:
|
|
745
|
+
name: "{{ include "api-server.fullname" . }}-test"
|
|
746
|
+
annotations:
|
|
747
|
+
"helm.sh/hook": test
|
|
748
|
+
spec:
|
|
749
|
+
containers:
|
|
750
|
+
- name: test
|
|
751
|
+
image: curlimages/curl:latest
|
|
752
|
+
command: ['curl']
|
|
753
|
+
args:
|
|
754
|
+
- '--fail'
|
|
755
|
+
- '--silent'
|
|
756
|
+
- 'http://{{ include "api-server.fullname" . }}:{{ .Values.service.port }}/healthz'
|
|
757
|
+
restartPolicy: Never
|
|
758
|
+
```
|
|
759
|
+
|
|
760
|
+
### Chaos Engineering
|
|
761
|
+
|
|
762
|
+
```yaml
|
|
763
|
+
# Chaos Mesh - Pod failure experiment
|
|
764
|
+
apiVersion: chaos-mesh.org/v1alpha1
|
|
765
|
+
kind: PodChaos
|
|
766
|
+
metadata:
|
|
767
|
+
name: api-server-pod-failure
|
|
768
|
+
namespace: chaos-testing
|
|
769
|
+
spec:
|
|
770
|
+
action: pod-failure
|
|
771
|
+
mode: one
|
|
772
|
+
duration: "30s"
|
|
773
|
+
selector:
|
|
774
|
+
namespaces:
|
|
775
|
+
- staging
|
|
776
|
+
labelSelectors:
|
|
777
|
+
app.kubernetes.io/name: api-server
|
|
778
|
+
scheduler:
|
|
779
|
+
cron: "@every 2h"
|
|
780
|
+
```
|
|
781
|
+
|
|
782
|
+
---
|
|
783
|
+
|
|
784
|
+
## Definition of Done
|
|
785
|
+
|
|
786
|
+
### Infrastructure Change
|
|
787
|
+
|
|
788
|
+
- [ ] IaC passes linting and validation
|
|
789
|
+
- [ ] Plan reviewed and approved
|
|
790
|
+
- [ ] Changes tested in non-production first
|
|
791
|
+
- [ ] Rollback procedure documented
|
|
792
|
+
- [ ] Monitoring/alerting in place
|
|
793
|
+
- [ ] Runbook updated
|
|
794
|
+
- [ ] Cost impact assessed
|
|
795
|
+
- [ ] Security review completed
|
|
796
|
+
|
|
797
|
+
### Platform Feature
|
|
798
|
+
|
|
799
|
+
- [ ] Self-service capable (no manual intervention needed)
|
|
800
|
+
- [ ] Documentation complete (how-to, troubleshooting)
|
|
801
|
+
- [ ] Golden path integrated
|
|
802
|
+
- [ ] Metrics exposed for SLOs
|
|
803
|
+
- [ ] Tested with real workloads
|
|
804
|
+
- [ ] Feedback collected from users
|
|
805
|
+
- [ ] Support runbook created
|
|
806
|
+
|
|
807
|
+
---
|
|
808
|
+
|
|
809
|
+
## Common Pitfalls
|
|
810
|
+
|
|
811
|
+
### 1. Building a Ticketing System
|
|
812
|
+
|
|
813
|
+
❌ **Wrong**: Developers file tickets for every infrastructure change
|
|
814
|
+
|
|
815
|
+
✅ **Right**: Build self-service automation; platform team handles the platform, not tickets
|
|
816
|
+
|
|
817
|
+
### 2. Ignoring Developer Experience
|
|
818
|
+
|
|
819
|
+
❌ **Wrong**: Complex, undocumented processes that require tribal knowledge
|
|
820
|
+
|
|
821
|
+
✅ **Right**: Golden paths with sensible defaults, clear documentation, quick feedback loops
|
|
822
|
+
|
|
823
|
+
### 3. Over-Engineering
|
|
824
|
+
|
|
825
|
+
❌ **Wrong**: Kubernetes cluster for a team of 5 running 3 services
|
|
826
|
+
|
|
827
|
+
✅ **Right**: Right-size infrastructure to actual needs; complexity has ongoing costs
|
|
828
|
+
|
|
829
|
+
### 4. No Error Budgets
|
|
830
|
+
|
|
831
|
+
❌ **Wrong**: "Five nines or nothing" with no measurement
|
|
832
|
+
|
|
833
|
+
✅ **Right**: Define SLOs, measure SLIs, use error budgets to balance reliability and velocity
|
|
834
|
+
|
|
835
|
+
### 5. Secrets in Git
|
|
836
|
+
|
|
837
|
+
❌ **Wrong**: Committing `.env` files or hardcoding credentials
|
|
838
|
+
|
|
839
|
+
✅ **Right**: External secret management (Vault, AWS Secrets Manager) with dynamic injection
|
|
840
|
+
|
|
841
|
+
---
|
|
842
|
+
|
|
843
|
+
## Resources
|
|
844
|
+
|
|
845
|
+
- [Platform Engineering Maturity Model](https://platformengineering.org/maturity-model)
|
|
846
|
+
- [Terraform Best Practices](https://www.terraform-best-practices.com/)
|
|
847
|
+
- [Kubernetes Patterns](https://k8spatterns.io/)
|
|
848
|
+
- [Site Reliability Engineering (Google)](https://sre.google/sre-book/table-of-contents/)
|
|
849
|
+
- [The Platform Engineering Guide](https://platformengineering.org/)
|
|
850
|
+
- [CNCF Landscape](https://landscape.cncf.io/)
|