aigent-team 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +253 -0
- package/dist/chunk-N3RYHWTR.js +267 -0
- package/dist/cli.js +576 -0
- package/dist/index.d.ts +234 -0
- package/dist/index.js +27 -0
- package/package.json +67 -0
- package/templates/shared/git-workflow.md +44 -0
- package/templates/shared/project-conventions.md +48 -0
- package/templates/teams/ba/agent.yaml +25 -0
- package/templates/teams/ba/references/acceptance-criteria.md +87 -0
- package/templates/teams/ba/references/api-contract-design.md +110 -0
- package/templates/teams/ba/references/requirements-analysis.md +83 -0
- package/templates/teams/ba/references/user-story-mapping.md +73 -0
- package/templates/teams/ba/skill.md +85 -0
- package/templates/teams/be/agent.yaml +34 -0
- package/templates/teams/be/conventions.md +102 -0
- package/templates/teams/be/references/api-design.md +91 -0
- package/templates/teams/be/references/async-processing.md +86 -0
- package/templates/teams/be/references/auth-security.md +58 -0
- package/templates/teams/be/references/caching.md +79 -0
- package/templates/teams/be/references/database.md +65 -0
- package/templates/teams/be/references/error-handling.md +106 -0
- package/templates/teams/be/references/observability.md +83 -0
- package/templates/teams/be/references/review-checklist.md +50 -0
- package/templates/teams/be/references/testing.md +100 -0
- package/templates/teams/be/review-checklist.md +54 -0
- package/templates/teams/be/skill.md +71 -0
- package/templates/teams/devops/agent.yaml +35 -0
- package/templates/teams/devops/conventions.md +133 -0
- package/templates/teams/devops/references/ci-cd.md +218 -0
- package/templates/teams/devops/references/cost-optimization.md +218 -0
- package/templates/teams/devops/references/disaster-recovery.md +199 -0
- package/templates/teams/devops/references/docker.md +237 -0
- package/templates/teams/devops/references/infrastructure-as-code.md +238 -0
- package/templates/teams/devops/references/kubernetes.md +397 -0
- package/templates/teams/devops/references/monitoring.md +224 -0
- package/templates/teams/devops/references/review-checklist.md +149 -0
- package/templates/teams/devops/references/security.md +225 -0
- package/templates/teams/devops/review-checklist.md +72 -0
- package/templates/teams/devops/skill.md +131 -0
- package/templates/teams/fe/agent.yaml +28 -0
- package/templates/teams/fe/conventions.md +80 -0
- package/templates/teams/fe/references/accessibility.md +92 -0
- package/templates/teams/fe/references/component-architecture.md +87 -0
- package/templates/teams/fe/references/css-styling.md +89 -0
- package/templates/teams/fe/references/forms.md +73 -0
- package/templates/teams/fe/references/performance.md +104 -0
- package/templates/teams/fe/references/review-checklist.md +51 -0
- package/templates/teams/fe/references/security.md +90 -0
- package/templates/teams/fe/references/state-management.md +117 -0
- package/templates/teams/fe/references/testing.md +112 -0
- package/templates/teams/fe/review-checklist.md +53 -0
- package/templates/teams/fe/skill.md +68 -0
- package/templates/teams/lead/agent.yaml +18 -0
- package/templates/teams/lead/references/cross-team-coordination.md +68 -0
- package/templates/teams/lead/references/quality-gates.md +64 -0
- package/templates/teams/lead/references/task-decomposition.md +69 -0
- package/templates/teams/lead/skill.md +83 -0
- package/templates/teams/qa/agent.yaml +32 -0
- package/templates/teams/qa/conventions.md +130 -0
- package/templates/teams/qa/references/ci-integration.md +337 -0
- package/templates/teams/qa/references/e2e-testing.md +292 -0
- package/templates/teams/qa/references/mocking.md +249 -0
- package/templates/teams/qa/references/performance-testing.md +288 -0
- package/templates/teams/qa/references/review-checklist.md +143 -0
- package/templates/teams/qa/references/security-testing.md +271 -0
- package/templates/teams/qa/references/test-data.md +275 -0
- package/templates/teams/qa/references/test-strategy.md +192 -0
- package/templates/teams/qa/review-checklist.md +53 -0
- package/templates/teams/qa/skill.md +131 -0
|
@@ -0,0 +1,199 @@
|
|
|
1
|
+
# Disaster Recovery Reference
|
|
2
|
+
|
|
3
|
+
## RPO / RTO Tiers
|
|
4
|
+
|
|
5
|
+
| Tier | RPO (Data Loss) | RTO (Downtime) | Strategy | Example Services |
|
|
6
|
+
|---|---|---|---|---|
|
|
7
|
+
| **Tier 1 — Critical** | < 1 min | < 15 min | Active-active multi-region, synchronous replication | Payment processing, auth |
|
|
8
|
+
| **Tier 2 — High** | < 15 min | < 1 hour | Warm standby, async replication, automated failover | User API, core business logic |
|
|
9
|
+
| **Tier 3 — Standard** | < 4 hours | < 4 hours | Pilot light, daily snapshots, semi-automated recovery | Internal tools, reporting |
|
|
10
|
+
| **Tier 4 — Low** | < 24 hours | < 24 hours | Backup/restore from cold storage | Dev environments, archives |
|
|
11
|
+
|
|
12
|
+
### Assigning Tiers
|
|
13
|
+
|
|
14
|
+
- Product and engineering jointly classify each service.
|
|
15
|
+
- Classification reviewed quarterly.
|
|
16
|
+
- Tier drives backup frequency, replication strategy, and testing cadence.
|
|
17
|
+
- Document tier assignment in the service catalog.
|
|
18
|
+
|
|
19
|
+
---
|
|
20
|
+
|
|
21
|
+
## Backup Verification
|
|
22
|
+
|
|
23
|
+
### Backup Schedule
|
|
24
|
+
|
|
25
|
+
| Data Type | Frequency | Retention | Storage | Encryption |
|
|
26
|
+
|---|---|---|---|---|
|
|
27
|
+
| Database (relational) | Continuous WAL + daily snapshot | 30 days | Cross-region S3/GCS | AES-256 |
|
|
28
|
+
| Database (NoSQL) | Hourly snapshot | 14 days | Cross-region S3/GCS | AES-256 |
|
|
29
|
+
| Object storage | Cross-region replication | Versioned, 90-day lifecycle | Secondary region | SSE-S3/KMS |
|
|
30
|
+
| Configuration/secrets | Git + Vault snapshots daily | 90 days | Separate account | Vault seal |
|
|
31
|
+
| Kubernetes state | etcd snapshot every 6 hours | 7 days | S3 with Object Lock | AES-256 |
|
|
32
|
+
|
|
33
|
+
### Verification Rules
|
|
34
|
+
|
|
35
|
+
1. **Automated restore tests.** Weekly: pick a random Tier 1/2 backup, restore
|
|
36
|
+
to isolated environment, run validation queries.
|
|
37
|
+
2. **Measure actual RPO.** Compare latest backup timestamp to current time.
|
|
38
|
+
Alert if gap exceeds tier RPO.
|
|
39
|
+
3. **Measure actual RTO.** Time the restore process end-to-end. Alert if it
|
|
40
|
+
exceeds tier RTO.
|
|
41
|
+
4. **Verify data integrity.** Run checksums, row counts, and application-level
|
|
42
|
+
consistency checks after restore.
|
|
43
|
+
5. **Test cross-region.** Monthly: restore from a backup in the secondary region.
|
|
44
|
+
|
|
45
|
+
### Backup Anti-Patterns
|
|
46
|
+
|
|
47
|
+
- Backups that have never been restored are not backups — they are hopes.
|
|
48
|
+
- Backing up only the database but not the schema migrations.
|
|
49
|
+
- Storing backups in the same account/region as the primary.
|
|
50
|
+
- No monitoring on backup job success/failure.
|
|
51
|
+
- Relying solely on cloud provider point-in-time recovery without testing it.
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## Failover Testing
|
|
56
|
+
|
|
57
|
+
### Failover Types
|
|
58
|
+
|
|
59
|
+
| Type | Scope | Frequency | Duration |
|
|
60
|
+
|---|---|---|---|
|
|
61
|
+
| Component failover | Single service (kill pod, instance) | Weekly (automated) | Minutes |
|
|
62
|
+
| Zone failover | Simulate AZ loss | Monthly | 1-2 hours |
|
|
63
|
+
| Region failover | Full region switchover | Quarterly | 2-4 hours |
|
|
64
|
+
| Dependency failover | External service unavailable | Monthly | 1 hour |
|
|
65
|
+
|
|
66
|
+
### Failover Procedure Template
|
|
67
|
+
|
|
68
|
+
```markdown
|
|
69
|
+
## Failover: <Component/Zone/Region>
|
|
70
|
+
|
|
71
|
+
### Pre-Flight
|
|
72
|
+
- [ ] Notify stakeholders (Slack #incidents)
|
|
73
|
+
- [ ] Confirm monitoring is healthy (baseline)
|
|
74
|
+
- [ ] Confirm rollback procedure is ready
|
|
75
|
+
- [ ] Set maintenance window in PagerDuty
|
|
76
|
+
|
|
77
|
+
### Execution
|
|
78
|
+
1. <Step-by-step actions to trigger failover>
|
|
79
|
+
2. Observe traffic shift in dashboards
|
|
80
|
+
3. Verify service health in secondary
|
|
81
|
+
|
|
82
|
+
### Validation
|
|
83
|
+
- [ ] All health checks passing
|
|
84
|
+
- [ ] Error rate within SLO
|
|
85
|
+
- [ ] Latency within SLO
|
|
86
|
+
- [ ] No data loss detected
|
|
87
|
+
- [ ] Customer-facing functionality verified
|
|
88
|
+
|
|
89
|
+
### Rollback
|
|
90
|
+
1. <Steps to revert to primary>
|
|
91
|
+
|
|
92
|
+
### Post-Test
|
|
93
|
+
- [ ] Document actual RTO (time from trigger to healthy)
|
|
94
|
+
- [ ] Document any issues encountered
|
|
95
|
+
- [ ] Create tickets for improvements
|
|
96
|
+
- [ ] Update runbooks
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
## Chaos Engineering
|
|
102
|
+
|
|
103
|
+
### Progressive Approach
|
|
104
|
+
|
|
105
|
+
```
|
|
106
|
+
Level 1: Kill a pod (weekly, automated)
|
|
107
|
+
Level 2: Inject latency to a dependency (monthly)
|
|
108
|
+
Level 3: Simulate AZ failure (monthly)
|
|
109
|
+
Level 4: Simulate region failure (quarterly)
|
|
110
|
+
Level 5: Simulate total dependency loss (quarterly)
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
### Chaos Experiments
|
|
114
|
+
|
|
115
|
+
| Experiment | Tool | What It Tests |
|
|
116
|
+
|---|---|---|
|
|
117
|
+
| Pod kill | Chaos Mesh, Litmus | Auto-restart, PDB, replica count |
|
|
118
|
+
| Network latency | tc, Chaos Mesh | Timeout handling, circuit breakers |
|
|
119
|
+
| Network partition | iptables, Chaos Mesh | Split-brain handling, retries |
|
|
120
|
+
| Disk fill | Stress tools | Alerting, log rotation, eviction |
|
|
121
|
+
| CPU stress | stress-ng | Autoscaling, request throttling |
|
|
122
|
+
| DNS failure | CoreDNS mutation | Caching, fallback behavior |
|
|
123
|
+
| External API failure | Toxiproxy | Circuit breaker, graceful degradation |
|
|
124
|
+
|
|
125
|
+
### Rules
|
|
126
|
+
|
|
127
|
+
- Always run chaos experiments during business hours with the team present.
|
|
128
|
+
- Start in staging. Graduate to production only with confidence.
|
|
129
|
+
- Define abort conditions before starting.
|
|
130
|
+
- Never run chaos without functioning monitoring and alerting.
|
|
131
|
+
- Document findings and create remediation tickets.
|
|
132
|
+
- Chaos engineering is not "break things" — it is "verify resilience hypotheses."
|
|
133
|
+
|
|
134
|
+
---
|
|
135
|
+
|
|
136
|
+
## Runbook Requirements
|
|
137
|
+
|
|
138
|
+
Every Tier 1 and Tier 2 service must have runbooks for:
|
|
139
|
+
|
|
140
|
+
| Scenario | Content |
|
|
141
|
+
|---|---|
|
|
142
|
+
| Service down | Diagnosis steps, restart procedure, failover steps |
|
|
143
|
+
| Database failover | Promotion steps, connection string update, data verification |
|
|
144
|
+
| Dependency outage | Graceful degradation, circuit breaker verification |
|
|
145
|
+
| Data corruption | Isolation, backup identification, restore procedure |
|
|
146
|
+
| Security incident | Containment, investigation, communication |
|
|
147
|
+
| Region failover | DNS switch, traffic routing, data sync verification |
|
|
148
|
+
|
|
149
|
+
### Runbook Quality Checklist
|
|
150
|
+
|
|
151
|
+
- [ ] Written for someone who has never seen this service before.
|
|
152
|
+
- [ ] Includes exact commands, not "restart the service."
|
|
153
|
+
- [ ] Includes expected output for each command.
|
|
154
|
+
- [ ] Links to dashboards, logs, and relevant architecture docs.
|
|
155
|
+
- [ ] Tested by someone other than the author.
|
|
156
|
+
- [ ] Updated within the last 90 days.
|
|
157
|
+
- [ ] Includes estimated time to execute.
|
|
158
|
+
- [ ] Includes escalation contacts with phone numbers.
|
|
159
|
+
|
|
160
|
+
---
|
|
161
|
+
|
|
162
|
+
## Recovery Playbooks
|
|
163
|
+
|
|
164
|
+
### Database Recovery
|
|
165
|
+
|
|
166
|
+
```
|
|
167
|
+
1. Identify failure (monitoring alert, health check)
|
|
168
|
+
2. Assess: is it a replica or primary failure?
|
|
169
|
+
├── Replica: remove from pool, investigate, rebuild from snapshot
|
|
170
|
+
└── Primary:
|
|
171
|
+
a. Promote replica to primary (automated or manual)
|
|
172
|
+
b. Update connection endpoints
|
|
173
|
+
c. Verify data consistency
|
|
174
|
+
d. Rebuild old primary as new replica
|
|
175
|
+
3. Post-recovery: verify application health, check replication lag
|
|
176
|
+
4. Postmortem within 48 hours
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
### Complete Environment Recovery
|
|
180
|
+
|
|
181
|
+
```
|
|
182
|
+
1. Provision infrastructure (terraform apply — from git)
|
|
183
|
+
2. Restore databases (from latest verified backup)
|
|
184
|
+
3. Deploy applications (from container registry — images already built)
|
|
185
|
+
4. Restore configuration (from git + secrets manager)
|
|
186
|
+
5. Verify: health checks, smoke tests, data integrity
|
|
187
|
+
6. Update DNS / traffic routing
|
|
188
|
+
7. Monitor for 1 hour before declaring recovery complete
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
### Key Principle
|
|
192
|
+
|
|
193
|
+
Everything needed to rebuild from scratch must be in version control or
|
|
194
|
+
automated backups. If you lose the primary region, you should be able to
|
|
195
|
+
reconstruct the entire environment from:
|
|
196
|
+
- Git (infrastructure code, application code, configs, dashboards, alerts)
|
|
197
|
+
- Container registry (built images)
|
|
198
|
+
- Backup storage (databases, state files)
|
|
199
|
+
- Secrets manager (credentials)
|
|
@@ -0,0 +1,237 @@
|
|
|
1
|
+
# Docker Reference
|
|
2
|
+
|
|
3
|
+
## Base Image Selection
|
|
4
|
+
|
|
5
|
+
| Language | Dev/Build Stage | Production Stage | Target Size |
|
|
6
|
+
|---|---|---|---|
|
|
7
|
+
| Go | `golang:1.22-alpine` | `gcr.io/distroless/static-debian12` | < 20 MB |
|
|
8
|
+
| Node.js | `node:20-alpine` | `node:20-alpine` or `gcr.io/distroless/nodejs20-debian12` | < 150 MB |
|
|
9
|
+
| Python | `python:3.12-slim` | `python:3.12-slim` | < 200 MB |
|
|
10
|
+
| Java | `eclipse-temurin:21-jdk-alpine` | `eclipse-temurin:21-jre-alpine` or distroless | < 250 MB |
|
|
11
|
+
| Rust | `rust:1.77-alpine` | `gcr.io/distroless/cc-debian12` | < 30 MB |
|
|
12
|
+
|
|
13
|
+
### Rules
|
|
14
|
+
|
|
15
|
+
- Never use `:latest`. Pin major.minor at minimum.
|
|
16
|
+
- Prefer Alpine or distroless for production. Debian slim if Alpine causes
|
|
17
|
+
musl compatibility issues.
|
|
18
|
+
- Distroless = no shell, no package manager. Best for compiled languages.
|
|
19
|
+
- Rebuild base images weekly to pick up security patches.
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Multi-Stage Build Pattern
|
|
24
|
+
|
|
25
|
+
```dockerfile
|
|
26
|
+
# ---- Build stage ----
|
|
27
|
+
FROM golang:1.22-alpine AS build
|
|
28
|
+
WORKDIR /src
|
|
29
|
+
|
|
30
|
+
COPY go.mod go.sum ./
|
|
31
|
+
RUN go mod download
|
|
32
|
+
|
|
33
|
+
COPY . .
|
|
34
|
+
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /app ./cmd/server
|
|
35
|
+
|
|
36
|
+
# ---- Production stage ----
|
|
37
|
+
FROM gcr.io/distroless/static-debian12
|
|
38
|
+
COPY --from=build /app /app
|
|
39
|
+
USER nonroot:nonroot
|
|
40
|
+
EXPOSE 8080
|
|
41
|
+
ENTRYPOINT ["/app"]
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
### Key Points
|
|
45
|
+
|
|
46
|
+
- Build dependencies stay in the build stage — never leak into production.
|
|
47
|
+
- Copy dependency manifests first, then source code (layer caching).
|
|
48
|
+
- Strip debug symbols in compiled binaries (`-ldflags="-s -w"` for Go).
|
|
49
|
+
- Use `COPY --from=build` to pull only the final artifact.
|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
## Security Hardening
|
|
54
|
+
|
|
55
|
+
### Non-Root Execution
|
|
56
|
+
|
|
57
|
+
```dockerfile
|
|
58
|
+
# Create user in build stage
|
|
59
|
+
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
|
|
60
|
+
|
|
61
|
+
# In production stage
|
|
62
|
+
USER appuser:appgroup
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
Or with distroless, use the built-in `nonroot` user:
|
|
66
|
+
```dockerfile
|
|
67
|
+
USER nonroot:nonroot
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
### Read-Only Filesystem
|
|
71
|
+
|
|
72
|
+
In Kubernetes, set `readOnlyRootFilesystem: true` in the security context.
|
|
73
|
+
If the app needs to write temp files, mount an `emptyDir` at the specific path.
|
|
74
|
+
|
|
75
|
+
### Drop All Capabilities
|
|
76
|
+
|
|
77
|
+
```yaml
|
|
78
|
+
securityContext:
|
|
79
|
+
capabilities:
|
|
80
|
+
drop: ["ALL"]
|
|
81
|
+
allowPrivilegeEscalation: false
|
|
82
|
+
readOnlyRootFilesystem: true
|
|
83
|
+
runAsNonRoot: true
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
### No Secrets in Images
|
|
87
|
+
|
|
88
|
+
- Never `COPY .env` or `COPY credentials`.
|
|
89
|
+
- Never use `ARG` or `ENV` for secrets — they persist in image layers.
|
|
90
|
+
- Pass secrets at runtime via mounted volumes or environment variables from
|
|
91
|
+
a secrets manager.
|
|
92
|
+
- Use `docker history` to verify no secrets leaked into layers.
|
|
93
|
+
|
|
94
|
+
---
|
|
95
|
+
|
|
96
|
+
## .dockerignore
|
|
97
|
+
|
|
98
|
+
Every project must have a `.dockerignore`:
|
|
99
|
+
|
|
100
|
+
```
|
|
101
|
+
.git
|
|
102
|
+
.github
|
|
103
|
+
.gitignore
|
|
104
|
+
.env*
|
|
105
|
+
*.md
|
|
106
|
+
docs/
|
|
107
|
+
node_modules/
|
|
108
|
+
__pycache__/
|
|
109
|
+
*.pyc
|
|
110
|
+
.pytest_cache/
|
|
111
|
+
coverage/
|
|
112
|
+
.nyc_output/
|
|
113
|
+
dist/
|
|
114
|
+
build/
|
|
115
|
+
*.log
|
|
116
|
+
docker-compose*.yml
|
|
117
|
+
Dockerfile*
|
|
118
|
+
.dockerignore
|
|
119
|
+
terraform/
|
|
120
|
+
k8s/
|
|
121
|
+
helm/
|
|
122
|
+
.vscode/
|
|
123
|
+
.idea/
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
### Why It Matters
|
|
127
|
+
|
|
128
|
+
- Reduces build context size (faster builds).
|
|
129
|
+
- Prevents secrets (`.env`) from entering the image.
|
|
130
|
+
- Avoids cache-busting from irrelevant file changes.
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
## Layer Ordering
|
|
135
|
+
|
|
136
|
+
Order instructions from least-frequently-changed to most-frequently-changed:
|
|
137
|
+
|
|
138
|
+
```dockerfile
|
|
139
|
+
# 1. Base image (changes rarely)
|
|
140
|
+
FROM node:20-alpine
|
|
141
|
+
|
|
142
|
+
# 2. System dependencies (changes rarely)
|
|
143
|
+
RUN apk add --no-cache dumb-init
|
|
144
|
+
|
|
145
|
+
# 3. Working directory
|
|
146
|
+
WORKDIR /app
|
|
147
|
+
|
|
148
|
+
# 4. Dependency manifest (changes sometimes)
|
|
149
|
+
COPY package.json package-lock.json ./
|
|
150
|
+
RUN npm ci --production
|
|
151
|
+
|
|
152
|
+
# 5. Application code (changes often)
|
|
153
|
+
COPY src/ ./src/
|
|
154
|
+
|
|
155
|
+
# 6. Runtime config
|
|
156
|
+
USER node
|
|
157
|
+
EXPOSE 3000
|
|
158
|
+
CMD ["dumb-init", "node", "src/index.js"]
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
### Layer Cache Tips
|
|
162
|
+
|
|
163
|
+
- Separate `COPY` for dependency files vs source code.
|
|
164
|
+
- Use `--mount=type=cache` for package manager caches (BuildKit):
|
|
165
|
+
```dockerfile
|
|
166
|
+
RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt
|
|
167
|
+
```
|
|
168
|
+
- Combine `RUN` commands that are logically related to reduce layers:
|
|
169
|
+
```dockerfile
|
|
170
|
+
RUN apt-get update && apt-get install -y --no-install-recommends \
|
|
171
|
+
curl ca-certificates && \
|
|
172
|
+
rm -rf /var/lib/apt/lists/*
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
---
|
|
176
|
+
|
|
177
|
+
## Health Checks
|
|
178
|
+
|
|
179
|
+
```dockerfile
|
|
180
|
+
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
|
|
181
|
+
CMD ["/app", "healthcheck"]
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
Or for HTTP services:
|
|
185
|
+
```dockerfile
|
|
186
|
+
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
|
|
187
|
+
CMD wget --no-verbose --tries=1 --spider http://localhost:8080/healthz || exit 1
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
### Rules
|
|
191
|
+
|
|
192
|
+
- Always define a `HEALTHCHECK` in the Dockerfile or in K8s probes (not both,
|
|
193
|
+
K8s probes take precedence).
|
|
194
|
+
- Use a dedicated health endpoint, not the main route.
|
|
195
|
+
- Health check should verify the app can serve traffic, not just that the
|
|
196
|
+
process is alive.
|
|
197
|
+
|
|
198
|
+
---
|
|
199
|
+
|
|
200
|
+
## Image Size Targets
|
|
201
|
+
|
|
202
|
+
| Category | Target | Action if Exceeded |
|
|
203
|
+
|---|---|---|
|
|
204
|
+
| Static binary (Go/Rust) | < 30 MB | Check for embedded assets, strip symbols |
|
|
205
|
+
| Node.js | < 150 MB | Audit `node_modules`, use `--production` |
|
|
206
|
+
| Python | < 200 MB | Remove build deps, use slim base |
|
|
207
|
+
| Java | < 250 MB | Use JRE not JDK, jlink custom runtime |
|
|
208
|
+
| General | < 500 MB | Investigate — likely build deps leaked |
|
|
209
|
+
|
|
210
|
+
### Checking Size
|
|
211
|
+
|
|
212
|
+
```bash
|
|
213
|
+
# Size of final image
|
|
214
|
+
docker images myapp:latest --format "{{.Size}}"
|
|
215
|
+
|
|
216
|
+
# Layer-by-layer breakdown
|
|
217
|
+
docker history myapp:latest
|
|
218
|
+
|
|
219
|
+
# Deep analysis
|
|
220
|
+
dive myapp:latest
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
---
|
|
224
|
+
|
|
225
|
+
## Build Best Practices
|
|
226
|
+
|
|
227
|
+
- Enable BuildKit: `DOCKER_BUILDKIT=1`
|
|
228
|
+
- Tag with git SHA for traceability: `myapp:abc1234`
|
|
229
|
+
- Also tag with semver if releasing: `myapp:1.2.3`
|
|
230
|
+
- Scan images before push:
|
|
231
|
+
```bash
|
|
232
|
+
trivy image myapp:abc1234
|
|
233
|
+
```
|
|
234
|
+
- Push to a private registry (ECR, GCR, ACR, Harbor). Never deploy from
|
|
235
|
+
Docker Hub in production.
|
|
236
|
+
- Set `imagePullPolicy: IfNotPresent` in K8s when using immutable SHA tags.
|
|
237
|
+
- Use `.dockerignore` in every project — no exceptions.
|
|
@@ -0,0 +1,238 @@
|
|
|
1
|
+
# Infrastructure as Code Reference
|
|
2
|
+
|
|
3
|
+
## Terraform Directory Structure
|
|
4
|
+
|
|
5
|
+
```
|
|
6
|
+
infra/
|
|
7
|
+
├── modules/ # Reusable modules (internal registry)
|
|
8
|
+
│ ├── vpc/
|
|
9
|
+
│ │ ├── main.tf
|
|
10
|
+
│ │ ├── variables.tf
|
|
11
|
+
│ │ ├── outputs.tf
|
|
12
|
+
│ │ └── README.md
|
|
13
|
+
│ ├── eks-cluster/
|
|
14
|
+
│ ├── rds/
|
|
15
|
+
│ └── s3-bucket/
|
|
16
|
+
├── environments/
|
|
17
|
+
│ ├── dev/
|
|
18
|
+
│ │ ├── main.tf # Module calls with dev params
|
|
19
|
+
│ │ ├── variables.tf
|
|
20
|
+
│ │ ├── outputs.tf
|
|
21
|
+
│ │ ├── terraform.tfvars # Dev-specific values
|
|
22
|
+
│ │ ├── backend.tf # Remote state config for dev
|
|
23
|
+
│ │ └── providers.tf
|
|
24
|
+
│ ├── staging/
|
|
25
|
+
│ └── production/
|
|
26
|
+
├── global/ # Shared resources (IAM, DNS zones)
|
|
27
|
+
│ ├── iam/
|
|
28
|
+
│ ├── dns/
|
|
29
|
+
│ └── ecr/
|
|
30
|
+
└── scripts/
|
|
31
|
+
├── plan.sh
|
|
32
|
+
├── apply.sh
|
|
33
|
+
└── import.sh
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
### Rules
|
|
37
|
+
|
|
38
|
+
- One state file per environment per component. Never share state across envs.
|
|
39
|
+
- `global/` resources are applied once and referenced via `terraform_remote_state`
|
|
40
|
+
or SSM parameters.
|
|
41
|
+
- Modules live in a separate repo or `modules/` directory with semantic versions.
|
|
42
|
+
|
|
43
|
+
---
|
|
44
|
+
|
|
45
|
+
## State Management
|
|
46
|
+
|
|
47
|
+
### Remote Backend (AWS Example)
|
|
48
|
+
|
|
49
|
+
```hcl
|
|
50
|
+
terraform {
|
|
51
|
+
backend "s3" {
|
|
52
|
+
bucket = "company-terraform-state"
|
|
53
|
+
key = "environments/production/network/terraform.tfstate"
|
|
54
|
+
region = "us-east-1"
|
|
55
|
+
dynamodb_table = "terraform-locks"
|
|
56
|
+
encrypt = true
|
|
57
|
+
}
|
|
58
|
+
}
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
### State Rules
|
|
62
|
+
|
|
63
|
+
1. **Always remote.** Never commit `.tfstate` to git.
|
|
64
|
+
2. **Always locked.** DynamoDB (AWS), GCS (GCP), or Terraform Cloud.
|
|
65
|
+
3. **Always encrypted.** Enable server-side encryption on the bucket.
|
|
66
|
+
4. **State per component.** Split large configs: `network`, `compute`, `data`,
|
|
67
|
+
`monitoring`. Cross-reference with `terraform_remote_state`.
|
|
68
|
+
5. **Never hand-edit state.** Use `terraform state mv`, `terraform import`,
|
|
69
|
+
`terraform state rm`.
|
|
70
|
+
|
|
71
|
+
### State Key Convention
|
|
72
|
+
|
|
73
|
+
```
|
|
74
|
+
environments/{env}/{component}/terraform.tfstate
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
Examples: `environments/production/network/terraform.tfstate`,
|
|
78
|
+
`environments/staging/eks/terraform.tfstate`.
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
## Plan-Review-Apply Cycle
|
|
83
|
+
|
|
84
|
+
```
|
|
85
|
+
Developer pushes IaC change
|
|
86
|
+
→ CI runs `terraform fmt -check`
|
|
87
|
+
→ CI runs `terraform validate`
|
|
88
|
+
→ CI runs `tflint` / `checkov` / `tfsec`
|
|
89
|
+
→ CI runs `terraform plan` and posts output to PR
|
|
90
|
+
→ Reviewer approves plan output (not just code)
|
|
91
|
+
→ Merge triggers `terraform apply` with saved plan file
|
|
92
|
+
→ Apply output posted to PR / Slack
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
### Plan Safety
|
|
96
|
+
|
|
97
|
+
- Always save plan to file: `terraform plan -out=tfplan`
|
|
98
|
+
- Apply the exact plan: `terraform apply tfplan`
|
|
99
|
+
- Never run `terraform apply` without a saved plan in CI.
|
|
100
|
+
- Require plan output review for any change touching production.
|
|
101
|
+
|
|
102
|
+
### Drift Detection
|
|
103
|
+
|
|
104
|
+
- Schedule weekly `terraform plan` in CI (no apply).
|
|
105
|
+
- Alert if drift detected (non-empty plan on unchanged code).
|
|
106
|
+
- Investigate and reconcile — do not auto-apply drift corrections.
|
|
107
|
+
|
|
108
|
+
---
|
|
109
|
+
|
|
110
|
+
## Module Versioning
|
|
111
|
+
|
|
112
|
+
### Semantic Versioning
|
|
113
|
+
|
|
114
|
+
```hcl
|
|
115
|
+
module "vpc" {
|
|
116
|
+
source = "git::https://github.com/company/tf-modules.git//vpc?ref=v2.1.0"
|
|
117
|
+
}
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
- **Major** (v2→v3): Breaking changes — variable removed, resource recreated.
|
|
121
|
+
- **Minor** (v2.1→v2.2): New feature — new optional variable, new output.
|
|
122
|
+
- **Patch** (v2.1.0→v2.1.1): Bug fix — no interface change.
|
|
123
|
+
|
|
124
|
+
### Rules
|
|
125
|
+
|
|
126
|
+
- Pin module versions in all environments. Never use `ref=main`.
|
|
127
|
+
- Upgrade dev first, validate, then promote to staging, then production.
|
|
128
|
+
- Module changes require a CHANGELOG entry.
|
|
129
|
+
- Modules must have `variables.tf` with descriptions and types for every input.
|
|
130
|
+
- Modules must have `outputs.tf` exposing values consumers need.
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
## Import Before Recreate
|
|
135
|
+
|
|
136
|
+
When Terraform wants to destroy and recreate a critical resource:
|
|
137
|
+
|
|
138
|
+
1. **Stop.** Review the plan carefully.
|
|
139
|
+
2. Check if `lifecycle { prevent_destroy = true }` should be set.
|
|
140
|
+
3. If migrating existing infrastructure into Terraform:
|
|
141
|
+
```bash
|
|
142
|
+
terraform import aws_db_instance.main my-database-id
|
|
143
|
+
```
|
|
144
|
+
4. After import, run `plan` — adjust config until plan shows no changes.
|
|
145
|
+
5. For state moves (renaming resources):
|
|
146
|
+
```bash
|
|
147
|
+
terraform state mv aws_s3_bucket.old aws_s3_bucket.new
|
|
148
|
+
```
|
|
149
|
+
6. Use `moved` blocks (Terraform 1.1+) for refactoring:
|
|
150
|
+
```hcl
|
|
151
|
+
moved {
|
|
152
|
+
from = aws_s3_bucket.old
|
|
153
|
+
to = aws_s3_bucket.new
|
|
154
|
+
}
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
---
|
|
158
|
+
|
|
159
|
+
## Tagging Policy
|
|
160
|
+
|
|
161
|
+
Every cloud resource must carry these tags:
|
|
162
|
+
|
|
163
|
+
| Tag | Example | Purpose |
|
|
164
|
+
|---|---|---|
|
|
165
|
+
| `Environment` | `production` | Cost allocation, access control |
|
|
166
|
+
| `Team` | `platform` | Ownership |
|
|
167
|
+
| `Service` | `user-api` | Dependency mapping |
|
|
168
|
+
| `ManagedBy` | `terraform` | Drift tracking |
|
|
169
|
+
| `CostCenter` | `eng-platform` | Finance reporting |
|
|
170
|
+
| `Repository` | `github.com/co/infra` | Code traceability |
|
|
171
|
+
|
|
172
|
+
### Enforcement
|
|
173
|
+
|
|
174
|
+
```hcl
|
|
175
|
+
# In provider or module
|
|
176
|
+
default_tags {
|
|
177
|
+
tags = {
|
|
178
|
+
Environment = var.environment
|
|
179
|
+
Team = var.team
|
|
180
|
+
ManagedBy = "terraform"
|
|
181
|
+
Repository = var.repository
|
|
182
|
+
}
|
|
183
|
+
}
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
- Use AWS Config rules, GCP Organization Policies, or OPA/Sentinel to reject
|
|
187
|
+
untagged resources.
|
|
188
|
+
- CI lint step checks every resource block for required tags.
|
|
189
|
+
|
|
190
|
+
---
|
|
191
|
+
|
|
192
|
+
## Resource Naming Convention
|
|
193
|
+
|
|
194
|
+
```
|
|
195
|
+
{company}-{environment}-{region}-{service}-{resource_type}[-{qualifier}]
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
Examples:
|
|
199
|
+
- `acme-prod-use1-userapi-rds-primary`
|
|
200
|
+
- `acme-stg-euw1-platform-eks`
|
|
201
|
+
- `acme-dev-use1-shared-vpc`
|
|
202
|
+
|
|
203
|
+
### Rules
|
|
204
|
+
|
|
205
|
+
- Lowercase, hyphens only (no underscores — some cloud providers reject them).
|
|
206
|
+
- Max 63 characters (DNS label limit).
|
|
207
|
+
- Region abbreviated: `use1` = us-east-1, `euw1` = eu-west-1.
|
|
208
|
+
- Use variables and `locals` to construct names — never hardcode.
|
|
209
|
+
|
|
210
|
+
```hcl
|
|
211
|
+
locals {
|
|
212
|
+
name_prefix = "${var.company}-${var.environment}-${var.region_short}-${var.service}"
|
|
213
|
+
}
|
|
214
|
+
|
|
215
|
+
resource "aws_s3_bucket" "assets" {
|
|
216
|
+
bucket = "${local.name_prefix}-assets"
|
|
217
|
+
}
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
---
|
|
221
|
+
|
|
222
|
+
## Terraform Style Guide
|
|
223
|
+
|
|
224
|
+
- `terraform fmt` on every save. Enforce in CI.
|
|
225
|
+
- One resource type per file for large configs, or group by logical domain.
|
|
226
|
+
- Variables: always include `description`, `type`, and `default` (if optional).
|
|
227
|
+
- Outputs: always include `description`.
|
|
228
|
+
- Use `count` for simple toggles, `for_each` for collections.
|
|
229
|
+
- Avoid `depends_on` unless absolutely necessary — implicit dependencies preferred.
|
|
230
|
+
- Keep provider versions pinned with `~>` (pessimistic constraint):
|
|
231
|
+
```hcl
|
|
232
|
+
required_providers {
|
|
233
|
+
aws = {
|
|
234
|
+
source = "hashicorp/aws"
|
|
235
|
+
version = "~> 5.0"
|
|
236
|
+
}
|
|
237
|
+
}
|
|
238
|
+
```
|