aigent-team 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (71) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +253 -0
  3. package/dist/chunk-N3RYHWTR.js +267 -0
  4. package/dist/cli.js +576 -0
  5. package/dist/index.d.ts +234 -0
  6. package/dist/index.js +27 -0
  7. package/package.json +67 -0
  8. package/templates/shared/git-workflow.md +44 -0
  9. package/templates/shared/project-conventions.md +48 -0
  10. package/templates/teams/ba/agent.yaml +25 -0
  11. package/templates/teams/ba/references/acceptance-criteria.md +87 -0
  12. package/templates/teams/ba/references/api-contract-design.md +110 -0
  13. package/templates/teams/ba/references/requirements-analysis.md +83 -0
  14. package/templates/teams/ba/references/user-story-mapping.md +73 -0
  15. package/templates/teams/ba/skill.md +85 -0
  16. package/templates/teams/be/agent.yaml +34 -0
  17. package/templates/teams/be/conventions.md +102 -0
  18. package/templates/teams/be/references/api-design.md +91 -0
  19. package/templates/teams/be/references/async-processing.md +86 -0
  20. package/templates/teams/be/references/auth-security.md +58 -0
  21. package/templates/teams/be/references/caching.md +79 -0
  22. package/templates/teams/be/references/database.md +65 -0
  23. package/templates/teams/be/references/error-handling.md +106 -0
  24. package/templates/teams/be/references/observability.md +83 -0
  25. package/templates/teams/be/references/review-checklist.md +50 -0
  26. package/templates/teams/be/references/testing.md +100 -0
  27. package/templates/teams/be/review-checklist.md +54 -0
  28. package/templates/teams/be/skill.md +71 -0
  29. package/templates/teams/devops/agent.yaml +35 -0
  30. package/templates/teams/devops/conventions.md +133 -0
  31. package/templates/teams/devops/references/ci-cd.md +218 -0
  32. package/templates/teams/devops/references/cost-optimization.md +218 -0
  33. package/templates/teams/devops/references/disaster-recovery.md +199 -0
  34. package/templates/teams/devops/references/docker.md +237 -0
  35. package/templates/teams/devops/references/infrastructure-as-code.md +238 -0
  36. package/templates/teams/devops/references/kubernetes.md +397 -0
  37. package/templates/teams/devops/references/monitoring.md +224 -0
  38. package/templates/teams/devops/references/review-checklist.md +149 -0
  39. package/templates/teams/devops/references/security.md +225 -0
  40. package/templates/teams/devops/review-checklist.md +72 -0
  41. package/templates/teams/devops/skill.md +131 -0
  42. package/templates/teams/fe/agent.yaml +28 -0
  43. package/templates/teams/fe/conventions.md +80 -0
  44. package/templates/teams/fe/references/accessibility.md +92 -0
  45. package/templates/teams/fe/references/component-architecture.md +87 -0
  46. package/templates/teams/fe/references/css-styling.md +89 -0
  47. package/templates/teams/fe/references/forms.md +73 -0
  48. package/templates/teams/fe/references/performance.md +104 -0
  49. package/templates/teams/fe/references/review-checklist.md +51 -0
  50. package/templates/teams/fe/references/security.md +90 -0
  51. package/templates/teams/fe/references/state-management.md +117 -0
  52. package/templates/teams/fe/references/testing.md +112 -0
  53. package/templates/teams/fe/review-checklist.md +53 -0
  54. package/templates/teams/fe/skill.md +68 -0
  55. package/templates/teams/lead/agent.yaml +18 -0
  56. package/templates/teams/lead/references/cross-team-coordination.md +68 -0
  57. package/templates/teams/lead/references/quality-gates.md +64 -0
  58. package/templates/teams/lead/references/task-decomposition.md +69 -0
  59. package/templates/teams/lead/skill.md +83 -0
  60. package/templates/teams/qa/agent.yaml +32 -0
  61. package/templates/teams/qa/conventions.md +130 -0
  62. package/templates/teams/qa/references/ci-integration.md +337 -0
  63. package/templates/teams/qa/references/e2e-testing.md +292 -0
  64. package/templates/teams/qa/references/mocking.md +249 -0
  65. package/templates/teams/qa/references/performance-testing.md +288 -0
  66. package/templates/teams/qa/references/review-checklist.md +143 -0
  67. package/templates/teams/qa/references/security-testing.md +271 -0
  68. package/templates/teams/qa/references/test-data.md +275 -0
  69. package/templates/teams/qa/references/test-strategy.md +192 -0
  70. package/templates/teams/qa/review-checklist.md +53 -0
  71. package/templates/teams/qa/skill.md +131 -0
@@ -0,0 +1,199 @@
1
+ # Disaster Recovery Reference
2
+
3
+ ## RPO / RTO Tiers
4
+
5
+ | Tier | RPO (Data Loss) | RTO (Downtime) | Strategy | Example Services |
6
+ |---|---|---|---|---|
7
+ | **Tier 1 — Critical** | < 1 min | < 15 min | Active-active multi-region, synchronous replication | Payment processing, auth |
8
+ | **Tier 2 — High** | < 15 min | < 1 hour | Warm standby, async replication, automated failover | User API, core business logic |
9
+ | **Tier 3 — Standard** | < 4 hours | < 4 hours | Pilot light, daily snapshots, semi-automated recovery | Internal tools, reporting |
10
+ | **Tier 4 — Low** | < 24 hours | < 24 hours | Backup/restore from cold storage | Dev environments, archives |
11
+
12
+ ### Assigning Tiers
13
+
14
+ - Product and engineering jointly classify each service.
15
+ - Classification reviewed quarterly.
16
+ - Tier drives backup frequency, replication strategy, and testing cadence.
17
+ - Document tier assignment in the service catalog.
18
+
19
+ ---
20
+
21
+ ## Backup Verification
22
+
23
+ ### Backup Schedule
24
+
25
+ | Data Type | Frequency | Retention | Storage | Encryption |
26
+ |---|---|---|---|---|
27
+ | Database (relational) | Continuous WAL + daily snapshot | 30 days | Cross-region S3/GCS | AES-256 |
28
+ | Database (NoSQL) | Hourly snapshot | 14 days | Cross-region S3/GCS | AES-256 |
29
+ | Object storage | Cross-region replication | Versioned, 90-day lifecycle | Secondary region | SSE-S3/KMS |
30
+ | Configuration/secrets | Git + Vault snapshots daily | 90 days | Separate account | Vault seal |
31
+ | Kubernetes state | etcd snapshot every 6 hours | 7 days | S3 with Object Lock | AES-256 |
32
+
33
+ ### Verification Rules
34
+
35
+ 1. **Automated restore tests.** Weekly: pick a random Tier 1/2 backup, restore
36
+ to isolated environment, run validation queries.
37
+ 2. **Measure actual RPO.** Compare latest backup timestamp to current time.
38
+ Alert if gap exceeds tier RPO.
39
+ 3. **Measure actual RTO.** Time the restore process end-to-end. Alert if it
40
+ exceeds tier RTO.
41
+ 4. **Verify data integrity.** Run checksums, row counts, and application-level
42
+ consistency checks after restore.
43
+ 5. **Test cross-region.** Monthly: restore from a backup in the secondary region.
44
+
45
+ ### Backup Anti-Patterns
46
+
47
+ - Backups that have never been restored are not backups — they are hopes.
48
+ - Backing up only the database but not the schema migrations.
49
+ - Storing backups in the same account/region as the primary.
50
+ - No monitoring on backup job success/failure.
51
+ - Relying solely on cloud provider point-in-time recovery without testing it.
52
+
53
+ ---
54
+
55
+ ## Failover Testing
56
+
57
+ ### Failover Types
58
+
59
+ | Type | Scope | Frequency | Duration |
60
+ |---|---|---|---|
61
+ | Component failover | Single service (kill pod, instance) | Weekly (automated) | Minutes |
62
+ | Zone failover | Simulate AZ loss | Monthly | 1-2 hours |
63
+ | Region failover | Full region switchover | Quarterly | 2-4 hours |
64
+ | Dependency failover | External service unavailable | Monthly | 1 hour |
65
+
66
+ ### Failover Procedure Template
67
+
68
+ ```markdown
69
+ ## Failover: <Component/Zone/Region>
70
+
71
+ ### Pre-Flight
72
+ - [ ] Notify stakeholders (Slack #incidents)
73
+ - [ ] Confirm monitoring is healthy (baseline)
74
+ - [ ] Confirm rollback procedure is ready
75
+ - [ ] Set maintenance window in PagerDuty
76
+
77
+ ### Execution
78
+ 1. <Step-by-step actions to trigger failover>
79
+ 2. Observe traffic shift in dashboards
80
+ 3. Verify service health in secondary
81
+
82
+ ### Validation
83
+ - [ ] All health checks passing
84
+ - [ ] Error rate within SLO
85
+ - [ ] Latency within SLO
86
+ - [ ] No data loss detected
87
+ - [ ] Customer-facing functionality verified
88
+
89
+ ### Rollback
90
+ 1. <Steps to revert to primary>
91
+
92
+ ### Post-Test
93
+ - [ ] Document actual RTO (time from trigger to healthy)
94
+ - [ ] Document any issues encountered
95
+ - [ ] Create tickets for improvements
96
+ - [ ] Update runbooks
97
+ ```
98
+
99
+ ---
100
+
101
+ ## Chaos Engineering
102
+
103
+ ### Progressive Approach
104
+
105
+ ```
106
+ Level 1: Kill a pod (weekly, automated)
107
+ Level 2: Inject latency to a dependency (monthly)
108
+ Level 3: Simulate AZ failure (monthly)
109
+ Level 4: Simulate region failure (quarterly)
110
+ Level 5: Simulate total dependency loss (quarterly)
111
+ ```
112
+
113
+ ### Chaos Experiments
114
+
115
+ | Experiment | Tool | What It Tests |
116
+ |---|---|---|
117
+ | Pod kill | Chaos Mesh, Litmus | Auto-restart, PDB, replica count |
118
+ | Network latency | tc, Chaos Mesh | Timeout handling, circuit breakers |
119
+ | Network partition | iptables, Chaos Mesh | Split-brain handling, retries |
120
+ | Disk fill | Stress tools | Alerting, log rotation, eviction |
121
+ | CPU stress | stress-ng | Autoscaling, request throttling |
122
+ | DNS failure | CoreDNS mutation | Caching, fallback behavior |
123
+ | External API failure | Toxiproxy | Circuit breaker, graceful degradation |
124
+
125
+ ### Rules
126
+
127
+ - Always run chaos experiments during business hours with the team present.
128
+ - Start in staging. Graduate to production only with confidence.
129
+ - Define abort conditions before starting.
130
+ - Never run chaos without functioning monitoring and alerting.
131
+ - Document findings and create remediation tickets.
132
+ - Chaos engineering is not "break things" — it is "verify resilience hypotheses."
133
+
134
+ ---
135
+
136
+ ## Runbook Requirements
137
+
138
+ Every Tier 1 and Tier 2 service must have runbooks for:
139
+
140
+ | Scenario | Content |
141
+ |---|---|
142
+ | Service down | Diagnosis steps, restart procedure, failover steps |
143
+ | Database failover | Promotion steps, connection string update, data verification |
144
+ | Dependency outage | Graceful degradation, circuit breaker verification |
145
+ | Data corruption | Isolation, backup identification, restore procedure |
146
+ | Security incident | Containment, investigation, communication |
147
+ | Region failover | DNS switch, traffic routing, data sync verification |
148
+
149
+ ### Runbook Quality Checklist
150
+
151
+ - [ ] Written for someone who has never seen this service before.
152
+ - [ ] Includes exact commands, not "restart the service."
153
+ - [ ] Includes expected output for each command.
154
+ - [ ] Links to dashboards, logs, and relevant architecture docs.
155
+ - [ ] Tested by someone other than the author.
156
+ - [ ] Updated within the last 90 days.
157
+ - [ ] Includes estimated time to execute.
158
+ - [ ] Includes escalation contacts with phone numbers.
159
+
160
+ ---
161
+
162
+ ## Recovery Playbooks
163
+
164
+ ### Database Recovery
165
+
166
+ ```
167
+ 1. Identify failure (monitoring alert, health check)
168
+ 2. Assess: is it a replica or primary failure?
169
+ ├── Replica: remove from pool, investigate, rebuild from snapshot
170
+ └── Primary:
171
+ a. Promote replica to primary (automated or manual)
172
+ b. Update connection endpoints
173
+ c. Verify data consistency
174
+ d. Rebuild old primary as new replica
175
+ 3. Post-recovery: verify application health, check replication lag
176
+ 4. Postmortem within 48 hours
177
+ ```
178
+
179
+ ### Complete Environment Recovery
180
+
181
+ ```
182
+ 1. Provision infrastructure (terraform apply — from git)
183
+ 2. Restore databases (from latest verified backup)
184
+ 3. Deploy applications (from container registry — images already built)
185
+ 4. Restore configuration (from git + secrets manager)
186
+ 5. Verify: health checks, smoke tests, data integrity
187
+ 6. Update DNS / traffic routing
188
+ 7. Monitor for 1 hour before declaring recovery complete
189
+ ```
190
+
191
+ ### Key Principle
192
+
193
+ Everything needed to rebuild from scratch must be in version control or
194
+ automated backups. If you lose the primary region, you should be able to
195
+ reconstruct the entire environment from:
196
+ - Git (infrastructure code, application code, configs, dashboards, alerts)
197
+ - Container registry (built images)
198
+ - Backup storage (databases, state files)
199
+ - Secrets manager (credentials)
@@ -0,0 +1,237 @@
1
+ # Docker Reference
2
+
3
+ ## Base Image Selection
4
+
5
+ | Language | Dev/Build Stage | Production Stage | Target Size |
6
+ |---|---|---|---|
7
+ | Go | `golang:1.22-alpine` | `gcr.io/distroless/static-debian12` | < 20 MB |
8
+ | Node.js | `node:20-alpine` | `node:20-alpine` or `gcr.io/distroless/nodejs20-debian12` | < 150 MB |
9
+ | Python | `python:3.12-slim` | `python:3.12-slim` | < 200 MB |
10
+ | Java | `eclipse-temurin:21-jdk-alpine` | `eclipse-temurin:21-jre-alpine` or distroless | < 250 MB |
11
+ | Rust | `rust:1.77-alpine` | `gcr.io/distroless/cc-debian12` | < 30 MB |
12
+
13
+ ### Rules
14
+
15
+ - Never use `:latest`. Pin major.minor at minimum.
16
+ - Prefer Alpine or distroless for production. Debian slim if Alpine causes
17
+ musl compatibility issues.
18
+ - Distroless = no shell, no package manager. Best for compiled languages.
19
+ - Rebuild base images weekly to pick up security patches.
20
+
21
+ ---
22
+
23
+ ## Multi-Stage Build Pattern
24
+
25
+ ```dockerfile
26
+ # ---- Build stage ----
27
+ FROM golang:1.22-alpine AS build
28
+ WORKDIR /src
29
+
30
+ COPY go.mod go.sum ./
31
+ RUN go mod download
32
+
33
+ COPY . .
34
+ RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /app ./cmd/server
35
+
36
+ # ---- Production stage ----
37
+ FROM gcr.io/distroless/static-debian12
38
+ COPY --from=build /app /app
39
+ USER nonroot:nonroot
40
+ EXPOSE 8080
41
+ ENTRYPOINT ["/app"]
42
+ ```
43
+
44
+ ### Key Points
45
+
46
+ - Build dependencies stay in the build stage — never leak into production.
47
+ - Copy dependency manifests first, then source code (layer caching).
48
+ - Strip debug symbols in compiled binaries (`-ldflags="-s -w"` for Go).
49
+ - Use `COPY --from=build` to pull only the final artifact.
50
+
51
+ ---
52
+
53
+ ## Security Hardening
54
+
55
+ ### Non-Root Execution
56
+
57
+ ```dockerfile
58
+ # Create user in build stage
59
+ RUN addgroup -S appgroup && adduser -S appuser -G appgroup
60
+
61
+ # In production stage
62
+ USER appuser:appgroup
63
+ ```
64
+
65
+ Or with distroless, use the built-in `nonroot` user:
66
+ ```dockerfile
67
+ USER nonroot:nonroot
68
+ ```
69
+
70
+ ### Read-Only Filesystem
71
+
72
+ In Kubernetes, set `readOnlyRootFilesystem: true` in the security context.
73
+ If the app needs to write temp files, mount an `emptyDir` at the specific path.
74
+
75
+ ### Drop All Capabilities
76
+
77
+ ```yaml
78
+ securityContext:
79
+ capabilities:
80
+ drop: ["ALL"]
81
+ allowPrivilegeEscalation: false
82
+ readOnlyRootFilesystem: true
83
+ runAsNonRoot: true
84
+ ```
85
+
86
+ ### No Secrets in Images
87
+
88
+ - Never `COPY .env` or `COPY credentials`.
89
+ - Never use `ARG` or `ENV` for secrets — they persist in image layers.
90
+ - Pass secrets at runtime via mounted volumes or environment variables from
91
+ a secrets manager.
92
+ - Use `docker history` to verify no secrets leaked into layers.
93
+
94
+ ---
95
+
96
+ ## .dockerignore
97
+
98
+ Every project must have a `.dockerignore`:
99
+
100
+ ```
101
+ .git
102
+ .github
103
+ .gitignore
104
+ .env*
105
+ *.md
106
+ docs/
107
+ node_modules/
108
+ __pycache__/
109
+ *.pyc
110
+ .pytest_cache/
111
+ coverage/
112
+ .nyc_output/
113
+ dist/
114
+ build/
115
+ *.log
116
+ docker-compose*.yml
117
+ Dockerfile*
118
+ .dockerignore
119
+ terraform/
120
+ k8s/
121
+ helm/
122
+ .vscode/
123
+ .idea/
124
+ ```
125
+
126
+ ### Why It Matters
127
+
128
+ - Reduces build context size (faster builds).
129
+ - Prevents secrets (`.env`) from entering the image.
130
+ - Avoids cache-busting from irrelevant file changes.
131
+
132
+ ---
133
+
134
+ ## Layer Ordering
135
+
136
+ Order instructions from least-frequently-changed to most-frequently-changed:
137
+
138
+ ```dockerfile
139
+ # 1. Base image (changes rarely)
140
+ FROM node:20-alpine
141
+
142
+ # 2. System dependencies (changes rarely)
143
+ RUN apk add --no-cache dumb-init
144
+
145
+ # 3. Working directory
146
+ WORKDIR /app
147
+
148
+ # 4. Dependency manifest (changes sometimes)
149
+ COPY package.json package-lock.json ./
150
+ RUN npm ci --production
151
+
152
+ # 5. Application code (changes often)
153
+ COPY src/ ./src/
154
+
155
+ # 6. Runtime config
156
+ USER node
157
+ EXPOSE 3000
158
+ CMD ["dumb-init", "node", "src/index.js"]
159
+ ```
160
+
161
+ ### Layer Cache Tips
162
+
163
+ - Separate `COPY` for dependency files vs source code.
164
+ - Use `--mount=type=cache` for package manager caches (BuildKit):
165
+ ```dockerfile
166
+ RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt
167
+ ```
168
+ - Combine `RUN` commands that are logically related to reduce layers:
169
+ ```dockerfile
170
+ RUN apt-get update && apt-get install -y --no-install-recommends \
171
+ curl ca-certificates && \
172
+ rm -rf /var/lib/apt/lists/*
173
+ ```
174
+
175
+ ---
176
+
177
+ ## Health Checks
178
+
179
+ ```dockerfile
180
+ HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
181
+ CMD ["/app", "healthcheck"]
182
+ ```
183
+
184
+ Or for HTTP services:
185
+ ```dockerfile
186
+ HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
187
+ CMD wget --no-verbose --tries=1 --spider http://localhost:8080/healthz || exit 1
188
+ ```
189
+
190
+ ### Rules
191
+
192
+ - Always define a `HEALTHCHECK` in the Dockerfile or in K8s probes (not both,
193
+ K8s probes take precedence).
194
+ - Use a dedicated health endpoint, not the main route.
195
+ - Health check should verify the app can serve traffic, not just that the
196
+ process is alive.
197
+
198
+ ---
199
+
200
+ ## Image Size Targets
201
+
202
+ | Category | Target | Action if Exceeded |
203
+ |---|---|---|
204
+ | Static binary (Go/Rust) | < 30 MB | Check for embedded assets, strip symbols |
205
+ | Node.js | < 150 MB | Audit `node_modules`, use `--production` |
206
+ | Python | < 200 MB | Remove build deps, use slim base |
207
+ | Java | < 250 MB | Use JRE not JDK, jlink custom runtime |
208
+ | General | < 500 MB | Investigate — likely build deps leaked |
209
+
210
+ ### Checking Size
211
+
212
+ ```bash
213
+ # Size of final image
214
+ docker images myapp:latest --format "{{.Size}}"
215
+
216
+ # Layer-by-layer breakdown
217
+ docker history myapp:latest
218
+
219
+ # Deep analysis
220
+ dive myapp:latest
221
+ ```
222
+
223
+ ---
224
+
225
+ ## Build Best Practices
226
+
227
+ - Enable BuildKit: `DOCKER_BUILDKIT=1`
228
+ - Tag with git SHA for traceability: `myapp:abc1234`
229
+ - Also tag with semver if releasing: `myapp:1.2.3`
230
+ - Scan images before push:
231
+ ```bash
232
+ trivy image myapp:abc1234
233
+ ```
234
+ - Push to a private registry (ECR, GCR, ACR, Harbor). Never deploy from
235
+ Docker Hub in production.
236
+ - Set `imagePullPolicy: IfNotPresent` in K8s when using immutable SHA tags.
237
+ - Use `.dockerignore` in every project — no exceptions.
@@ -0,0 +1,238 @@
1
+ # Infrastructure as Code Reference
2
+
3
+ ## Terraform Directory Structure
4
+
5
+ ```
6
+ infra/
7
+ ├── modules/ # Reusable modules (internal registry)
8
+ │ ├── vpc/
9
+ │ │ ├── main.tf
10
+ │ │ ├── variables.tf
11
+ │ │ ├── outputs.tf
12
+ │ │ └── README.md
13
+ │ ├── eks-cluster/
14
+ │ ├── rds/
15
+ │ └── s3-bucket/
16
+ ├── environments/
17
+ │ ├── dev/
18
+ │ │ ├── main.tf # Module calls with dev params
19
+ │ │ ├── variables.tf
20
+ │ │ ├── outputs.tf
21
+ │ │ ├── terraform.tfvars # Dev-specific values
22
+ │ │ ├── backend.tf # Remote state config for dev
23
+ │ │ └── providers.tf
24
+ │ ├── staging/
25
+ │ └── production/
26
+ ├── global/ # Shared resources (IAM, DNS zones)
27
+ │ ├── iam/
28
+ │ ├── dns/
29
+ │ └── ecr/
30
+ └── scripts/
31
+ ├── plan.sh
32
+ ├── apply.sh
33
+ └── import.sh
34
+ ```
35
+
36
+ ### Rules
37
+
38
+ - One state file per environment per component. Never share state across envs.
39
+ - `global/` resources are applied once and referenced via `terraform_remote_state`
40
+ or SSM parameters.
41
+ - Modules live in a separate repo or `modules/` directory with semantic versions.
42
+
43
+ ---
44
+
45
+ ## State Management
46
+
47
+ ### Remote Backend (AWS Example)
48
+
49
+ ```hcl
50
+ terraform {
51
+ backend "s3" {
52
+ bucket = "company-terraform-state"
53
+ key = "environments/production/network/terraform.tfstate"
54
+ region = "us-east-1"
55
+ dynamodb_table = "terraform-locks"
56
+ encrypt = true
57
+ }
58
+ }
59
+ ```
60
+
61
+ ### State Rules
62
+
63
+ 1. **Always remote.** Never commit `.tfstate` to git.
64
+ 2. **Always locked.** DynamoDB (AWS), GCS (GCP), or Terraform Cloud.
65
+ 3. **Always encrypted.** Enable server-side encryption on the bucket.
66
+ 4. **State per component.** Split large configs: `network`, `compute`, `data`,
67
+ `monitoring`. Cross-reference with `terraform_remote_state`.
68
+ 5. **Never hand-edit state.** Use `terraform state mv`, `terraform import`,
69
+ `terraform state rm`.
70
+
71
+ ### State Key Convention
72
+
73
+ ```
74
+ environments/{env}/{component}/terraform.tfstate
75
+ ```
76
+
77
+ Examples: `environments/production/network/terraform.tfstate`,
78
+ `environments/staging/eks/terraform.tfstate`.
79
+
80
+ ---
81
+
82
+ ## Plan-Review-Apply Cycle
83
+
84
+ ```
85
+ Developer pushes IaC change
86
+ → CI runs `terraform fmt -check`
87
+ → CI runs `terraform validate`
88
+ → CI runs `tflint` / `checkov` / `tfsec`
89
+ → CI runs `terraform plan` and posts output to PR
90
+ → Reviewer approves plan output (not just code)
91
+ → Merge triggers `terraform apply` with saved plan file
92
+ → Apply output posted to PR / Slack
93
+ ```
94
+
95
+ ### Plan Safety
96
+
97
+ - Always save plan to file: `terraform plan -out=tfplan`
98
+ - Apply the exact plan: `terraform apply tfplan`
99
+ - Never run `terraform apply` without a saved plan in CI.
100
+ - Require plan output review for any change touching production.
101
+
102
+ ### Drift Detection
103
+
104
+ - Schedule weekly `terraform plan` in CI (no apply).
105
+ - Alert if drift detected (non-empty plan on unchanged code).
106
+ - Investigate and reconcile — do not auto-apply drift corrections.
107
+
108
+ ---
109
+
110
+ ## Module Versioning
111
+
112
+ ### Semantic Versioning
113
+
114
+ ```hcl
115
+ module "vpc" {
116
+ source = "git::https://github.com/company/tf-modules.git//vpc?ref=v2.1.0"
117
+ }
118
+ ```
119
+
120
+ - **Major** (v2→v3): Breaking changes — variable removed, resource recreated.
121
+ - **Minor** (v2.1→v2.2): New feature — new optional variable, new output.
122
+ - **Patch** (v2.1.0→v2.1.1): Bug fix — no interface change.
123
+
124
+ ### Rules
125
+
126
+ - Pin module versions in all environments. Never use `ref=main`.
127
+ - Upgrade dev first, validate, then promote to staging, then production.
128
+ - Module changes require a CHANGELOG entry.
129
+ - Modules must have `variables.tf` with descriptions and types for every input.
130
+ - Modules must have `outputs.tf` exposing values consumers need.
131
+
132
+ ---
133
+
134
+ ## Import Before Recreate
135
+
136
+ When Terraform wants to destroy and recreate a critical resource:
137
+
138
+ 1. **Stop.** Review the plan carefully.
139
+ 2. Check if `lifecycle { prevent_destroy = true }` should be set.
140
+ 3. If migrating existing infrastructure into Terraform:
141
+ ```bash
142
+ terraform import aws_db_instance.main my-database-id
143
+ ```
144
+ 4. After import, run `plan` — adjust config until plan shows no changes.
145
+ 5. For state moves (renaming resources):
146
+ ```bash
147
+ terraform state mv aws_s3_bucket.old aws_s3_bucket.new
148
+ ```
149
+ 6. Use `moved` blocks (Terraform 1.1+) for refactoring:
150
+ ```hcl
151
+ moved {
152
+ from = aws_s3_bucket.old
153
+ to = aws_s3_bucket.new
154
+ }
155
+ ```
156
+
157
+ ---
158
+
159
+ ## Tagging Policy
160
+
161
+ Every cloud resource must carry these tags:
162
+
163
+ | Tag | Example | Purpose |
164
+ |---|---|---|
165
+ | `Environment` | `production` | Cost allocation, access control |
166
+ | `Team` | `platform` | Ownership |
167
+ | `Service` | `user-api` | Dependency mapping |
168
+ | `ManagedBy` | `terraform` | Drift tracking |
169
+ | `CostCenter` | `eng-platform` | Finance reporting |
170
+ | `Repository` | `github.com/co/infra` | Code traceability |
171
+
172
+ ### Enforcement
173
+
174
+ ```hcl
175
+ # In provider or module
176
+ default_tags {
177
+ tags = {
178
+ Environment = var.environment
179
+ Team = var.team
180
+ ManagedBy = "terraform"
181
+ Repository = var.repository
182
+ }
183
+ }
184
+ ```
185
+
186
+ - Use AWS Config rules, GCP Organization Policies, or OPA/Sentinel to reject
187
+ untagged resources.
188
+ - CI lint step checks every resource block for required tags.
189
+
190
+ ---
191
+
192
+ ## Resource Naming Convention
193
+
194
+ ```
195
+ {company}-{environment}-{region}-{service}-{resource_type}[-{qualifier}]
196
+ ```
197
+
198
+ Examples:
199
+ - `acme-prod-use1-userapi-rds-primary`
200
+ - `acme-stg-euw1-platform-eks`
201
+ - `acme-dev-use1-shared-vpc`
202
+
203
+ ### Rules
204
+
205
+ - Lowercase, hyphens only (no underscores — some cloud providers reject them).
206
+ - Max 63 characters (DNS label limit).
207
+ - Region abbreviated: `use1` = us-east-1, `euw1` = eu-west-1.
208
+ - Use variables and `locals` to construct names — never hardcode.
209
+
210
+ ```hcl
211
+ locals {
212
+ name_prefix = "${var.company}-${var.environment}-${var.region_short}-${var.service}"
213
+ }
214
+
215
+ resource "aws_s3_bucket" "assets" {
216
+ bucket = "${local.name_prefix}-assets"
217
+ }
218
+ ```
219
+
220
+ ---
221
+
222
+ ## Terraform Style Guide
223
+
224
+ - `terraform fmt` on every save. Enforce in CI.
225
+ - One resource type per file for large configs, or group by logical domain.
226
+ - Variables: always include `description`, `type`, and `default` (if optional).
227
+ - Outputs: always include `description`.
228
+ - Use `count` for simple toggles, `for_each` for collections.
229
+ - Avoid `depends_on` unless absolutely necessary — implicit dependencies preferred.
230
+ - Keep provider versions pinned with `~>` (pessimistic constraint):
231
+ ```hcl
232
+ required_providers {
233
+ aws = {
234
+ source = "hashicorp/aws"
235
+ version = "~> 5.0"
236
+ }
237
+ }
238
+ ```