agentic-team-templates 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (103) hide show
  1. package/README.md +280 -0
  2. package/bin/cli.js +5 -0
  3. package/package.json +47 -0
  4. package/src/index.js +521 -0
  5. package/templates/_shared/code-quality.md +162 -0
  6. package/templates/_shared/communication.md +114 -0
  7. package/templates/_shared/core-principles.md +62 -0
  8. package/templates/_shared/git-workflow.md +165 -0
  9. package/templates/_shared/security-fundamentals.md +173 -0
  10. package/templates/blockchain/.cursorrules/defi-patterns.md +520 -0
  11. package/templates/blockchain/.cursorrules/gas-optimization.md +339 -0
  12. package/templates/blockchain/.cursorrules/overview.md +130 -0
  13. package/templates/blockchain/.cursorrules/security.md +318 -0
  14. package/templates/blockchain/.cursorrules/smart-contracts.md +364 -0
  15. package/templates/blockchain/.cursorrules/testing.md +415 -0
  16. package/templates/blockchain/.cursorrules/web3-integration.md +538 -0
  17. package/templates/blockchain/CLAUDE.md +389 -0
  18. package/templates/cli-tools/.cursorrules/architecture.md +412 -0
  19. package/templates/cli-tools/.cursorrules/arguments.md +406 -0
  20. package/templates/cli-tools/.cursorrules/distribution.md +546 -0
  21. package/templates/cli-tools/.cursorrules/error-handling.md +455 -0
  22. package/templates/cli-tools/.cursorrules/overview.md +136 -0
  23. package/templates/cli-tools/.cursorrules/testing.md +537 -0
  24. package/templates/cli-tools/.cursorrules/user-experience.md +545 -0
  25. package/templates/cli-tools/CLAUDE.md +356 -0
  26. package/templates/data-engineering/.cursorrules/data-modeling.md +367 -0
  27. package/templates/data-engineering/.cursorrules/data-quality.md +455 -0
  28. package/templates/data-engineering/.cursorrules/overview.md +85 -0
  29. package/templates/data-engineering/.cursorrules/performance.md +339 -0
  30. package/templates/data-engineering/.cursorrules/pipeline-design.md +280 -0
  31. package/templates/data-engineering/.cursorrules/security.md +460 -0
  32. package/templates/data-engineering/.cursorrules/testing.md +452 -0
  33. package/templates/data-engineering/CLAUDE.md +974 -0
  34. package/templates/devops-sre/.cursorrules/capacity-planning.md +653 -0
  35. package/templates/devops-sre/.cursorrules/change-management.md +584 -0
  36. package/templates/devops-sre/.cursorrules/chaos-engineering.md +651 -0
  37. package/templates/devops-sre/.cursorrules/disaster-recovery.md +641 -0
  38. package/templates/devops-sre/.cursorrules/incident-management.md +565 -0
  39. package/templates/devops-sre/.cursorrules/observability.md +714 -0
  40. package/templates/devops-sre/.cursorrules/overview.md +230 -0
  41. package/templates/devops-sre/.cursorrules/postmortems.md +588 -0
  42. package/templates/devops-sre/.cursorrules/runbooks.md +760 -0
  43. package/templates/devops-sre/.cursorrules/slo-sli.md +617 -0
  44. package/templates/devops-sre/.cursorrules/toil-reduction.md +567 -0
  45. package/templates/devops-sre/CLAUDE.md +1007 -0
  46. package/templates/documentation/.cursorrules/adr.md +277 -0
  47. package/templates/documentation/.cursorrules/api-documentation.md +411 -0
  48. package/templates/documentation/.cursorrules/code-comments.md +253 -0
  49. package/templates/documentation/.cursorrules/maintenance.md +260 -0
  50. package/templates/documentation/.cursorrules/overview.md +82 -0
  51. package/templates/documentation/.cursorrules/readme-standards.md +306 -0
  52. package/templates/documentation/CLAUDE.md +120 -0
  53. package/templates/fullstack/.cursorrules/api-contracts.md +331 -0
  54. package/templates/fullstack/.cursorrules/architecture.md +298 -0
  55. package/templates/fullstack/.cursorrules/overview.md +109 -0
  56. package/templates/fullstack/.cursorrules/shared-types.md +348 -0
  57. package/templates/fullstack/.cursorrules/testing.md +386 -0
  58. package/templates/fullstack/CLAUDE.md +349 -0
  59. package/templates/ml-ai/.cursorrules/data-engineering.md +483 -0
  60. package/templates/ml-ai/.cursorrules/deployment.md +601 -0
  61. package/templates/ml-ai/.cursorrules/model-development.md +538 -0
  62. package/templates/ml-ai/.cursorrules/monitoring.md +658 -0
  63. package/templates/ml-ai/.cursorrules/overview.md +131 -0
  64. package/templates/ml-ai/.cursorrules/security.md +637 -0
  65. package/templates/ml-ai/.cursorrules/testing.md +678 -0
  66. package/templates/ml-ai/CLAUDE.md +1136 -0
  67. package/templates/mobile/.cursorrules/navigation.md +246 -0
  68. package/templates/mobile/.cursorrules/offline-first.md +302 -0
  69. package/templates/mobile/.cursorrules/overview.md +71 -0
  70. package/templates/mobile/.cursorrules/performance.md +345 -0
  71. package/templates/mobile/.cursorrules/testing.md +339 -0
  72. package/templates/mobile/CLAUDE.md +233 -0
  73. package/templates/platform-engineering/.cursorrules/ci-cd.md +778 -0
  74. package/templates/platform-engineering/.cursorrules/developer-experience.md +632 -0
  75. package/templates/platform-engineering/.cursorrules/infrastructure-as-code.md +600 -0
  76. package/templates/platform-engineering/.cursorrules/kubernetes.md +710 -0
  77. package/templates/platform-engineering/.cursorrules/observability.md +747 -0
  78. package/templates/platform-engineering/.cursorrules/overview.md +215 -0
  79. package/templates/platform-engineering/.cursorrules/security.md +855 -0
  80. package/templates/platform-engineering/.cursorrules/testing.md +878 -0
  81. package/templates/platform-engineering/CLAUDE.md +850 -0
  82. package/templates/utility-agent/.cursorrules/action-control.md +284 -0
  83. package/templates/utility-agent/.cursorrules/context-management.md +186 -0
  84. package/templates/utility-agent/.cursorrules/hallucination-prevention.md +253 -0
  85. package/templates/utility-agent/.cursorrules/overview.md +78 -0
  86. package/templates/utility-agent/.cursorrules/token-optimization.md +369 -0
  87. package/templates/utility-agent/CLAUDE.md +513 -0
  88. package/templates/web-backend/.cursorrules/api-design.md +255 -0
  89. package/templates/web-backend/.cursorrules/authentication.md +309 -0
  90. package/templates/web-backend/.cursorrules/database-patterns.md +298 -0
  91. package/templates/web-backend/.cursorrules/error-handling.md +366 -0
  92. package/templates/web-backend/.cursorrules/overview.md +69 -0
  93. package/templates/web-backend/.cursorrules/security.md +358 -0
  94. package/templates/web-backend/.cursorrules/testing.md +395 -0
  95. package/templates/web-backend/CLAUDE.md +366 -0
  96. package/templates/web-frontend/.cursorrules/accessibility.md +296 -0
  97. package/templates/web-frontend/.cursorrules/component-patterns.md +204 -0
  98. package/templates/web-frontend/.cursorrules/overview.md +72 -0
  99. package/templates/web-frontend/.cursorrules/performance.md +325 -0
  100. package/templates/web-frontend/.cursorrules/state-management.md +227 -0
  101. package/templates/web-frontend/.cursorrules/styling.md +271 -0
  102. package/templates/web-frontend/.cursorrules/testing.md +311 -0
  103. package/templates/web-frontend/CLAUDE.md +399 -0
@@ -0,0 +1,850 @@
1
+ # Platform Engineering Development Guide
2
+
3
+ Staff-level guidelines for building and operating internal developer platforms, infrastructure automation, and reliability engineering.
4
+
5
+ ---
6
+
7
+ ## Overview
8
+
9
+ This guide applies to:
10
+
11
+ - Infrastructure as Code (Terraform, Pulumi, CDK)
12
+ - Kubernetes and container orchestration
13
+ - CI/CD pipelines and GitOps
14
+ - Internal Developer Platforms (IDPs)
15
+ - Observability systems (metrics, logs, traces)
16
+ - Service mesh and networking
17
+ - Security and compliance automation
18
+
19
+ ### Key Principles
20
+
21
+ 1. **Platform as Product** - Your internal customers are developers; treat the platform like a product
22
+ 2. **Self-Service First** - Enable teams to move fast without becoming a bottleneck
23
+ 3. **Reliability Engineering** - Define SLOs, measure SLIs, maintain error budgets
24
+ 4. **Security by Default** - Bake security into the golden path, not bolted on after
25
+ 5. **Cost Consciousness** - FinOps is everyone's responsibility
26
+
27
+ ### Technology Stack
28
+
29
+ | Layer | Technology |
30
+ |-------|------------|
31
+ | IaC | Terraform, Pulumi, AWS CDK |
32
+ | Container Orchestration | Kubernetes, EKS/GKE/AKS |
33
+ | GitOps | Argo CD, Flux |
34
+ | CI/CD | GitHub Actions, GitLab CI, Tekton |
35
+ | Observability | Prometheus, Grafana, Loki, Tempo, Jaeger |
36
+ | Service Mesh | Istio, Linkerd, Cilium |
37
+ | Policy | OPA/Gatekeeper, Kyverno |
38
+ | Secrets | Vault, External Secrets, SOPS |
39
+
40
+ ---
41
+
42
+ ## Infrastructure as Code
43
+
44
+ ### Module Structure
45
+
46
+ ```
47
+ modules/
48
+ ├── networking/
49
+ │ ├── vpc/
50
+ │ │ ├── main.tf
51
+ │ │ ├── variables.tf
52
+ │ │ ├── outputs.tf
53
+ │ │ └── README.md
54
+ │ └── dns/
55
+ ├── compute/
56
+ │ ├── eks-cluster/
57
+ │ └── node-groups/
58
+ ├── data/
59
+ │ ├── rds/
60
+ │ └── redis/
61
+ └── security/
62
+ ├── iam-roles/
63
+ └── kms/
64
+
65
+ environments/
66
+ ├── dev/
67
+ │ ├── main.tf
68
+ │ ├── terraform.tfvars
69
+ │ └── backend.tf
70
+ ├── staging/
71
+ └── production/
72
+ ```
73
+
74
+ ### Terraform Best Practices
75
+
76
+ ```hcl
77
+ # Always use specific provider versions
78
+ terraform {
79
+ required_version = ">= 1.5.0"
80
+
81
+ required_providers {
82
+ aws = {
83
+ source = "hashicorp/aws"
84
+ version = "~> 5.0"
85
+ }
86
+ }
87
+
88
+ backend "s3" {
89
+ bucket = "company-terraform-state"
90
+ key = "env/production/terraform.tfstate"
91
+ region = "us-east-1"
92
+ encrypt = true
93
+ dynamodb_table = "terraform-locks"
94
+ }
95
+ }
96
+
97
+ # Use locals for computed values
98
+ locals {
99
+ common_tags = {
100
+ Environment = var.environment
101
+ ManagedBy = "terraform"
102
+ Team = var.team
103
+ CostCenter = var.cost_center
104
+ }
105
+ }
106
+
107
+ # Meaningful resource naming
108
+ resource "aws_eks_cluster" "main" {
109
+ name = "${var.project}-${var.environment}"
110
+ role_arn = aws_iam_role.cluster.arn
111
+ version = var.kubernetes_version
112
+
113
+ vpc_config {
114
+ subnet_ids = var.private_subnet_ids
115
+ endpoint_private_access = true
116
+ endpoint_public_access = var.environment != "production"
117
+ }
118
+
119
+ tags = local.common_tags
120
+ }
121
+ ```
122
+
123
+ ### State Management
124
+
125
+ - **Remote State**: Always use remote backends (S3, GCS, Terraform Cloud)
126
+ - **State Locking**: Enable DynamoDB/GCS locking to prevent concurrent modifications
127
+ - **State Isolation**: Separate state files per environment
128
+ - **Sensitive Data**: Never store secrets in state; use external secret managers
129
+
130
+ ### Drift Detection
131
+
132
+ ```yaml
133
+ # GitHub Actions workflow for drift detection
134
+ name: Terraform Drift Detection
135
+
136
+ on:
137
+ schedule:
138
+ - cron: '0 */6 * * *' # Every 6 hours
139
+
140
+ jobs:
141
+ drift-detection:
142
+ runs-on: ubuntu-latest
143
+ strategy:
144
+ matrix:
145
+ environment: [dev, staging, production]
146
+ steps:
147
+ - uses: actions/checkout@v4
148
+ - uses: hashicorp/setup-terraform@v3
149
+
150
+ - name: Terraform Plan
151
+ run: |
152
+ cd environments/${{ matrix.environment }}
153
+ terraform init
154
+ terraform plan -detailed-exitcode -out=plan.out
155
+ continue-on-error: true
156
+ id: plan
157
+
158
+ - name: Alert on Drift
159
+ if: steps.plan.outcome == 'failure'
160
+ run: |
161
+ # Send alert to Slack/PagerDuty
162
+ echo "Drift detected in ${{ matrix.environment }}"
163
+ ```
164
+
165
+ ---
166
+
167
+ ## Kubernetes Patterns
168
+
169
+ ### Resource Management
170
+
171
+ ```yaml
172
+ apiVersion: apps/v1
173
+ kind: Deployment
174
+ metadata:
175
+ name: api-server
176
+ labels:
177
+ app.kubernetes.io/name: api-server
178
+ app.kubernetes.io/component: backend
179
+ app.kubernetes.io/managed-by: helm
180
+ spec:
181
+ replicas: 3
182
+ selector:
183
+ matchLabels:
184
+ app.kubernetes.io/name: api-server
185
+ template:
186
+ metadata:
187
+ labels:
188
+ app.kubernetes.io/name: api-server
189
+ annotations:
190
+ prometheus.io/scrape: "true"
191
+ prometheus.io/port: "9090"
192
+ spec:
193
+ # Always set resource limits
194
+ containers:
195
+ - name: api-server
196
+ image: company/api-server:v1.2.3
197
+ resources:
198
+ requests:
199
+ cpu: "100m"
200
+ memory: "256Mi"
201
+ limits:
202
+ cpu: "500m"
203
+ memory: "512Mi"
204
+
205
+ # Always configure probes
206
+ livenessProbe:
207
+ httpGet:
208
+ path: /healthz
209
+ port: 8080
210
+ initialDelaySeconds: 10
211
+ periodSeconds: 10
212
+ readinessProbe:
213
+ httpGet:
214
+ path: /ready
215
+ port: 8080
216
+ initialDelaySeconds: 5
217
+ periodSeconds: 5
218
+
219
+ # Security context
220
+ securityContext:
221
+ runAsNonRoot: true
222
+ runAsUser: 1000
223
+ readOnlyRootFilesystem: true
224
+ allowPrivilegeEscalation: false
225
+ capabilities:
226
+ drop:
227
+ - ALL
228
+
229
+ # Pod-level security
230
+ securityContext:
231
+ fsGroup: 1000
232
+
233
+ # Spread across nodes
234
+ topologySpreadConstraints:
235
+ - maxSkew: 1
236
+ topologyKey: topology.kubernetes.io/zone
237
+ whenUnsatisfiable: ScheduleAnyway
238
+ labelSelector:
239
+ matchLabels:
240
+ app.kubernetes.io/name: api-server
241
+ ```
242
+
243
+ ### Helm Chart Structure
244
+
245
+ ```
246
+ charts/
247
+ └── api-server/
248
+ ├── Chart.yaml
249
+ ├── values.yaml
250
+ ├── values-dev.yaml
251
+ ├── values-staging.yaml
252
+ ├── values-production.yaml
253
+ ├── templates/
254
+ │ ├── _helpers.tpl
255
+ │ ├── deployment.yaml
256
+ │ ├── service.yaml
257
+ │ ├── hpa.yaml
258
+ │ ├── pdb.yaml
259
+ │ ├── networkpolicy.yaml
260
+ │ └── servicemonitor.yaml
261
+ └── tests/
262
+ └── test-connection.yaml
263
+ ```
264
+
265
+ ### Network Policies
266
+
267
+ ```yaml
268
+ # Default deny all ingress
269
+ apiVersion: networking.k8s.io/v1
270
+ kind: NetworkPolicy
271
+ metadata:
272
+ name: default-deny-ingress
273
+ namespace: production
274
+ spec:
275
+ podSelector: {}
276
+ policyTypes:
277
+ - Ingress
278
+
279
+ ---
280
+ # Allow specific traffic
281
+ apiVersion: networking.k8s.io/v1
282
+ kind: NetworkPolicy
283
+ metadata:
284
+ name: api-server-ingress
285
+ namespace: production
286
+ spec:
287
+ podSelector:
288
+ matchLabels:
289
+ app.kubernetes.io/name: api-server
290
+ policyTypes:
291
+ - Ingress
292
+ ingress:
293
+ - from:
294
+ - namespaceSelector:
295
+ matchLabels:
296
+ name: ingress-nginx
297
+ - podSelector:
298
+ matchLabels:
299
+ app.kubernetes.io/name: frontend
300
+ ports:
301
+ - protocol: TCP
302
+ port: 8080
303
+ ```
304
+
305
+ ---
306
+
307
+ ## CI/CD & GitOps
308
+
309
+ ### Pipeline Architecture
310
+
311
+ ```yaml
312
+ # GitHub Actions - Production-grade pipeline
313
+ name: CI/CD Pipeline
314
+
315
+ on:
316
+ push:
317
+ branches: [main, develop]
318
+ pull_request:
319
+ branches: [main]
320
+
321
+ env:
322
+ REGISTRY: ghcr.io
323
+ IMAGE_NAME: ${{ github.repository }}
324
+
325
+ jobs:
326
+ # Stage 1: Validate
327
+ validate:
328
+ runs-on: ubuntu-latest
329
+ steps:
330
+ - uses: actions/checkout@v4
331
+
332
+ - name: Lint Dockerfile
333
+ uses: hadolint/hadolint-action@v3.1.0
334
+
335
+ - name: Lint Kubernetes manifests
336
+ run: |
337
+ helm lint ./charts/*
338
+ kubeval ./manifests/*.yaml
339
+
340
+ - name: Security scan
341
+ uses: aquasecurity/trivy-action@master
342
+ with:
343
+ scan-type: 'fs'
344
+ severity: 'CRITICAL,HIGH'
345
+
346
+ # Stage 2: Test
347
+ test:
348
+ needs: validate
349
+ runs-on: ubuntu-latest
350
+ steps:
351
+ - uses: actions/checkout@v4
352
+
353
+ - name: Run tests
354
+ run: make test
355
+
356
+ - name: Upload coverage
357
+ uses: codecov/codecov-action@v3
358
+
359
+ # Stage 3: Build
360
+ build:
361
+ needs: test
362
+ runs-on: ubuntu-latest
363
+ outputs:
364
+ image-digest: ${{ steps.build.outputs.digest }}
365
+ steps:
366
+ - uses: actions/checkout@v4
367
+
368
+ - name: Build and push
369
+ id: build
370
+ uses: docker/build-push-action@v5
371
+ with:
372
+ context: .
373
+ push: true
374
+ tags: |
375
+ ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
376
+ ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
377
+ cache-from: type=gha
378
+ cache-to: type=gha,mode=max
379
+
380
+ - name: Sign image
381
+ run: cosign sign --yes ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build.outputs.digest }}
382
+
383
+ # Stage 4: Deploy to staging
384
+ deploy-staging:
385
+ needs: build
386
+ runs-on: ubuntu-latest
387
+ environment: staging
388
+ steps:
389
+ - name: Update GitOps repo
390
+ run: |
391
+ # Update image tag in GitOps repository
392
+ # Argo CD will detect and sync
393
+
394
+ # Stage 5: Deploy to production
395
+ deploy-production:
396
+ needs: deploy-staging
397
+ runs-on: ubuntu-latest
398
+ environment: production
399
+ if: github.ref == 'refs/heads/main'
400
+ steps:
401
+ - name: Update GitOps repo
402
+ run: |
403
+ # Update image tag with approval gate
404
+ ```
405
+
406
+ ### Argo CD Application
407
+
408
+ ```yaml
409
+ apiVersion: argoproj.io/v1alpha1
410
+ kind: Application
411
+ metadata:
412
+ name: api-server
413
+ namespace: argocd
414
+ finalizers:
415
+ - resources-finalizer.argocd.argoproj.io
416
+ spec:
417
+ project: default
418
+
419
+ source:
420
+ repoURL: https://github.com/company/gitops-repo.git
421
+ targetRevision: HEAD
422
+ path: apps/api-server/overlays/production
423
+
424
+ destination:
425
+ server: https://kubernetes.default.svc
426
+ namespace: production
427
+
428
+ syncPolicy:
429
+ automated:
430
+ prune: true
431
+ selfHeal: true
432
+ syncOptions:
433
+ - CreateNamespace=true
434
+ - PruneLast=true
435
+ retry:
436
+ limit: 5
437
+ backoff:
438
+ duration: 5s
439
+ factor: 2
440
+ maxDuration: 3m
441
+ ```
442
+
443
+ ---
444
+
445
+ ## Observability
446
+
447
+ ### Metrics (Prometheus)
448
+
449
+ ```yaml
450
+ # ServiceMonitor for automatic scraping
451
+ apiVersion: monitoring.coreos.com/v1
452
+ kind: ServiceMonitor
453
+ metadata:
454
+ name: api-server
455
+ labels:
456
+ release: prometheus
457
+ spec:
458
+ selector:
459
+ matchLabels:
460
+ app.kubernetes.io/name: api-server
461
+ endpoints:
462
+ - port: metrics
463
+ interval: 30s
464
+ path: /metrics
465
+ ```
466
+
467
+ ### SLO Definition
468
+
469
+ ```yaml
470
+ # Sloth SLO definition
471
+ apiVersion: sloth.slok.dev/v1
472
+ kind: PrometheusServiceLevel
473
+ metadata:
474
+ name: api-server-slo
475
+ spec:
476
+ service: "api-server"
477
+ labels:
478
+ team: platform
479
+ slos:
480
+ - name: "requests-availability"
481
+ objective: 99.9
482
+ description: "99.9% of requests should be successful"
483
+ sli:
484
+ events:
485
+ errorQuery: sum(rate(http_requests_total{job="api-server",status=~"5.."}[{{.window}}]))
486
+ totalQuery: sum(rate(http_requests_total{job="api-server"}[{{.window}}]))
487
+ alerting:
488
+ name: ApiServerHighErrorRate
489
+ pageAlert:
490
+ labels:
491
+ severity: critical
492
+ ticketAlert:
493
+ labels:
494
+ severity: warning
495
+
496
+ - name: "requests-latency"
497
+ objective: 99.0
498
+ description: "99% of requests should be faster than 500ms"
499
+ sli:
500
+ events:
501
+ errorQuery: sum(rate(http_request_duration_seconds_bucket{job="api-server",le="0.5"}[{{.window}}]))
502
+ totalQuery: sum(rate(http_request_duration_seconds_count{job="api-server"}[{{.window}}]))
503
+ ```
504
+
505
+ ### Logging (Structured)
506
+
507
+ ```go
508
+ // Always use structured logging
509
+ logger.Info("request processed",
510
+ "method", r.Method,
511
+ "path", r.URL.Path,
512
+ "status", status,
513
+ "duration_ms", duration.Milliseconds(),
514
+ "request_id", requestID,
515
+ "user_id", userID,
516
+ )
517
+ ```
518
+
519
+ ### Distributed Tracing
520
+
521
+ ```go
522
+ // OpenTelemetry instrumentation
523
+ func handleRequest(w http.ResponseWriter, r *http.Request) {
524
+ ctx, span := tracer.Start(r.Context(), "handleRequest",
525
+ trace.WithAttributes(
526
+ attribute.String("http.method", r.Method),
527
+ attribute.String("http.url", r.URL.String()),
528
+ ),
529
+ )
530
+ defer span.End()
531
+
532
+ // Pass context to downstream calls
533
+ result, err := db.QueryContext(ctx, query)
534
+ if err != nil {
535
+ span.RecordError(err)
536
+ span.SetStatus(codes.Error, err.Error())
537
+ }
538
+ }
539
+ ```
540
+
541
+ ---
542
+
543
+ ## Security
544
+
545
+ ### Policy as Code (OPA/Gatekeeper)
546
+
547
+ ```yaml
548
+ # Require resource limits
549
+ apiVersion: constraints.gatekeeper.sh/v1beta1
550
+ kind: K8sRequiredResources
551
+ metadata:
552
+ name: require-resource-limits
553
+ spec:
554
+ match:
555
+ kinds:
556
+ - apiGroups: [""]
557
+ kinds: ["Pod"]
558
+ namespaces:
559
+ - production
560
+ - staging
561
+ parameters:
562
+ limits:
563
+ - cpu
564
+ - memory
565
+ requests:
566
+ - cpu
567
+ - memory
568
+ ```
569
+
570
+ ### Secrets Management
571
+
572
+ ```yaml
573
+ # External Secrets Operator
574
+ apiVersion: external-secrets.io/v1beta1
575
+ kind: ExternalSecret
576
+ metadata:
577
+ name: api-secrets
578
+ spec:
579
+ refreshInterval: 1h
580
+ secretStoreRef:
581
+ name: vault-backend
582
+ kind: ClusterSecretStore
583
+ target:
584
+ name: api-secrets
585
+ creationPolicy: Owner
586
+ data:
587
+ - secretKey: database-url
588
+ remoteRef:
589
+ key: secret/data/api-server
590
+ property: database_url
591
+ - secretKey: api-key
592
+ remoteRef:
593
+ key: secret/data/api-server
594
+ property: api_key
595
+ ```
596
+
597
+ ### Supply Chain Security
598
+
599
+ ```yaml
600
+ # Kyverno policy - require signed images
601
+ apiVersion: kyverno.io/v1
602
+ kind: ClusterPolicy
603
+ metadata:
604
+ name: verify-image-signature
605
+ spec:
606
+ validationFailureAction: Enforce
607
+ background: false
608
+ rules:
609
+ - name: verify-signature
610
+ match:
611
+ any:
612
+ - resources:
613
+ kinds:
614
+ - Pod
615
+ verifyImages:
616
+ - imageReferences:
617
+ - "ghcr.io/company/*"
618
+ attestors:
619
+ - entries:
620
+ - keyless:
621
+ subject: "https://github.com/company/*"
622
+ issuer: "https://token.actions.githubusercontent.com"
623
+ ```
624
+
625
+ ---
626
+
627
+ ## Developer Experience
628
+
629
+ ### Golden Paths
630
+
631
+ Provide opinionated, well-supported paths for common tasks:
632
+
633
+ ```yaml
634
+ # Backstage Software Template
635
+ apiVersion: scaffolder.backstage.io/v1beta3
636
+ kind: Template
637
+ metadata:
638
+ name: microservice-template
639
+ title: Production Microservice
640
+ description: Create a production-ready microservice with all platform integrations
641
+ spec:
642
+ owner: platform-team
643
+ type: service
644
+
645
+ parameters:
646
+ - title: Service Information
647
+ required:
648
+ - name
649
+ - owner
650
+ properties:
651
+ name:
652
+ title: Service Name
653
+ type: string
654
+ pattern: '^[a-z0-9-]+$'
655
+ owner:
656
+ title: Owner Team
657
+ type: string
658
+ ui:field: OwnerPicker
659
+
660
+ - title: Infrastructure
661
+ properties:
662
+ database:
663
+ title: Database
664
+ type: string
665
+ enum: [none, postgresql, mysql]
666
+ cache:
667
+ title: Cache
668
+ type: string
669
+ enum: [none, redis, memcached]
670
+
671
+ steps:
672
+ - id: fetch
673
+ name: Fetch Template
674
+ action: fetch:template
675
+ input:
676
+ url: ./skeleton
677
+ values:
678
+ name: ${{ parameters.name }}
679
+ owner: ${{ parameters.owner }}
680
+
681
+ - id: publish
682
+ name: Publish to GitHub
683
+ action: publish:github
684
+ input:
685
+ repoUrl: github.com?repo=${{ parameters.name }}&owner=company
686
+
687
+ - id: register
688
+ name: Register in Catalog
689
+ action: catalog:register
690
+ input:
691
+ repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}
692
+ catalogInfoPath: /catalog-info.yaml
693
+ ```
694
+
695
+ ### Self-Service Portal
696
+
697
+ Key capabilities to provide:
698
+
699
+ - **Environment Provisioning**: Spin up dev/preview environments on demand
700
+ - **Database Access**: Request read replicas or sanitized snapshots
701
+ - **Secret Management**: Self-service secret rotation
702
+ - **Monitoring Dashboards**: Auto-generated per-service dashboards
703
+ - **Cost Visibility**: Per-team/per-service cost attribution
704
+
705
+ ---
706
+
707
+ ## Testing Infrastructure
708
+
709
+ ### Terraform Testing
710
+
711
+ ```hcl
712
+ # tests/vpc_test.go
713
+ package test
714
+
715
+ import (
716
+ "testing"
717
+ "github.com/gruntwork-io/terratest/modules/terraform"
718
+ "github.com/stretchr/testify/assert"
719
+ )
720
+
721
+ func TestVpcModule(t *testing.T) {
722
+ terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
723
+ TerraformDir: "../modules/networking/vpc",
724
+ Vars: map[string]interface{}{
725
+ "environment": "test",
726
+ "cidr_block": "10.0.0.0/16",
727
+ },
728
+ })
729
+
730
+ defer terraform.Destroy(t, terraformOptions)
731
+ terraform.InitAndApply(t, terraformOptions)
732
+
733
+ vpcId := terraform.Output(t, terraformOptions, "vpc_id")
734
+ assert.NotEmpty(t, vpcId)
735
+ }
736
+ ```
737
+
738
+ ### Kubernetes Testing
739
+
740
+ ```yaml
741
+ # Helm chart test
742
+ apiVersion: v1
743
+ kind: Pod
744
+ metadata:
745
+ name: "{{ include "api-server.fullname" . }}-test"
746
+ annotations:
747
+ "helm.sh/hook": test
748
+ spec:
749
+ containers:
750
+ - name: test
751
+ image: curlimages/curl:latest
752
+ command: ['curl']
753
+ args:
754
+ - '--fail'
755
+ - '--silent'
756
+ - 'http://{{ include "api-server.fullname" . }}:{{ .Values.service.port }}/healthz'
757
+ restartPolicy: Never
758
+ ```
759
+
760
+ ### Chaos Engineering
761
+
762
+ ```yaml
763
+ # Chaos Mesh - Pod failure experiment
764
+ apiVersion: chaos-mesh.org/v1alpha1
765
+ kind: PodChaos
766
+ metadata:
767
+ name: api-server-pod-failure
768
+ namespace: chaos-testing
769
+ spec:
770
+ action: pod-failure
771
+ mode: one
772
+ duration: "30s"
773
+ selector:
774
+ namespaces:
775
+ - staging
776
+ labelSelectors:
777
+ app.kubernetes.io/name: api-server
778
+ scheduler:
779
+ cron: "@every 2h"
780
+ ```
781
+
782
+ ---
783
+
784
+ ## Definition of Done
785
+
786
+ ### Infrastructure Change
787
+
788
+ - [ ] IaC passes linting and validation
789
+ - [ ] Plan reviewed and approved
790
+ - [ ] Changes tested in non-production first
791
+ - [ ] Rollback procedure documented
792
+ - [ ] Monitoring/alerting in place
793
+ - [ ] Runbook updated
794
+ - [ ] Cost impact assessed
795
+ - [ ] Security review completed
796
+
797
+ ### Platform Feature
798
+
799
+ - [ ] Self-service capable (no manual intervention needed)
800
+ - [ ] Documentation complete (how-to, troubleshooting)
801
+ - [ ] Golden path integrated
802
+ - [ ] Metrics exposed for SLOs
803
+ - [ ] Tested with real workloads
804
+ - [ ] Feedback collected from users
805
+ - [ ] Support runbook created
806
+
807
+ ---
808
+
809
+ ## Common Pitfalls
810
+
811
+ ### 1. Building a Ticketing System
812
+
813
+ ❌ **Wrong**: Developers file tickets for every infrastructure change
814
+
815
+ ✅ **Right**: Build self-service automation; platform team handles the platform, not tickets
816
+
817
+ ### 2. Ignoring Developer Experience
818
+
819
+ ❌ **Wrong**: Complex, undocumented processes that require tribal knowledge
820
+
821
+ ✅ **Right**: Golden paths with sensible defaults, clear documentation, quick feedback loops
822
+
823
+ ### 3. Over-Engineering
824
+
825
+ ❌ **Wrong**: Kubernetes cluster for a team of 5 running 3 services
826
+
827
+ ✅ **Right**: Right-size infrastructure to actual needs; complexity has ongoing costs
828
+
829
+ ### 4. No Error Budgets
830
+
831
+ ❌ **Wrong**: "Five nines or nothing" with no measurement
832
+
833
+ ✅ **Right**: Define SLOs, measure SLIs, use error budgets to balance reliability and velocity
834
+
835
+ ### 5. Secrets in Git
836
+
837
+ ❌ **Wrong**: Committing `.env` files or hardcoding credentials
838
+
839
+ ✅ **Right**: External secret management (Vault, AWS Secrets Manager) with dynamic injection
840
+
841
+ ---
842
+
843
+ ## Resources
844
+
845
+ - [Platform Engineering Maturity Model](https://platformengineering.org/maturity-model)
846
+ - [Terraform Best Practices](https://www.terraform-best-practices.com/)
847
+ - [Kubernetes Patterns](https://k8spatterns.io/)
848
+ - [Site Reliability Engineering (Google)](https://sre.google/sre-book/table-of-contents/)
849
+ - [The Platform Engineering Guide](https://platformengineering.org/)
850
+ - [CNCF Landscape](https://landscape.cncf.io/)