aigent-team 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (71) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +253 -0
  3. package/dist/chunk-N3RYHWTR.js +267 -0
  4. package/dist/cli.js +576 -0
  5. package/dist/index.d.ts +234 -0
  6. package/dist/index.js +27 -0
  7. package/package.json +67 -0
  8. package/templates/shared/git-workflow.md +44 -0
  9. package/templates/shared/project-conventions.md +48 -0
  10. package/templates/teams/ba/agent.yaml +25 -0
  11. package/templates/teams/ba/references/acceptance-criteria.md +87 -0
  12. package/templates/teams/ba/references/api-contract-design.md +110 -0
  13. package/templates/teams/ba/references/requirements-analysis.md +83 -0
  14. package/templates/teams/ba/references/user-story-mapping.md +73 -0
  15. package/templates/teams/ba/skill.md +85 -0
  16. package/templates/teams/be/agent.yaml +34 -0
  17. package/templates/teams/be/conventions.md +102 -0
  18. package/templates/teams/be/references/api-design.md +91 -0
  19. package/templates/teams/be/references/async-processing.md +86 -0
  20. package/templates/teams/be/references/auth-security.md +58 -0
  21. package/templates/teams/be/references/caching.md +79 -0
  22. package/templates/teams/be/references/database.md +65 -0
  23. package/templates/teams/be/references/error-handling.md +106 -0
  24. package/templates/teams/be/references/observability.md +83 -0
  25. package/templates/teams/be/references/review-checklist.md +50 -0
  26. package/templates/teams/be/references/testing.md +100 -0
  27. package/templates/teams/be/review-checklist.md +54 -0
  28. package/templates/teams/be/skill.md +71 -0
  29. package/templates/teams/devops/agent.yaml +35 -0
  30. package/templates/teams/devops/conventions.md +133 -0
  31. package/templates/teams/devops/references/ci-cd.md +218 -0
  32. package/templates/teams/devops/references/cost-optimization.md +218 -0
  33. package/templates/teams/devops/references/disaster-recovery.md +199 -0
  34. package/templates/teams/devops/references/docker.md +237 -0
  35. package/templates/teams/devops/references/infrastructure-as-code.md +238 -0
  36. package/templates/teams/devops/references/kubernetes.md +397 -0
  37. package/templates/teams/devops/references/monitoring.md +224 -0
  38. package/templates/teams/devops/references/review-checklist.md +149 -0
  39. package/templates/teams/devops/references/security.md +225 -0
  40. package/templates/teams/devops/review-checklist.md +72 -0
  41. package/templates/teams/devops/skill.md +131 -0
  42. package/templates/teams/fe/agent.yaml +28 -0
  43. package/templates/teams/fe/conventions.md +80 -0
  44. package/templates/teams/fe/references/accessibility.md +92 -0
  45. package/templates/teams/fe/references/component-architecture.md +87 -0
  46. package/templates/teams/fe/references/css-styling.md +89 -0
  47. package/templates/teams/fe/references/forms.md +73 -0
  48. package/templates/teams/fe/references/performance.md +104 -0
  49. package/templates/teams/fe/references/review-checklist.md +51 -0
  50. package/templates/teams/fe/references/security.md +90 -0
  51. package/templates/teams/fe/references/state-management.md +117 -0
  52. package/templates/teams/fe/references/testing.md +112 -0
  53. package/templates/teams/fe/review-checklist.md +53 -0
  54. package/templates/teams/fe/skill.md +68 -0
  55. package/templates/teams/lead/agent.yaml +18 -0
  56. package/templates/teams/lead/references/cross-team-coordination.md +68 -0
  57. package/templates/teams/lead/references/quality-gates.md +64 -0
  58. package/templates/teams/lead/references/task-decomposition.md +69 -0
  59. package/templates/teams/lead/skill.md +83 -0
  60. package/templates/teams/qa/agent.yaml +32 -0
  61. package/templates/teams/qa/conventions.md +130 -0
  62. package/templates/teams/qa/references/ci-integration.md +337 -0
  63. package/templates/teams/qa/references/e2e-testing.md +292 -0
  64. package/templates/teams/qa/references/mocking.md +249 -0
  65. package/templates/teams/qa/references/performance-testing.md +288 -0
  66. package/templates/teams/qa/references/review-checklist.md +143 -0
  67. package/templates/teams/qa/references/security-testing.md +271 -0
  68. package/templates/teams/qa/references/test-data.md +275 -0
  69. package/templates/teams/qa/references/test-strategy.md +192 -0
  70. package/templates/teams/qa/review-checklist.md +53 -0
  71. package/templates/teams/qa/skill.md +131 -0
@@ -0,0 +1,133 @@
1
+ ## Infrastructure as Code
2
+
3
+ - **Terraform** is the default for cloud infrastructure. Pulumi/CDK when team has strong preference and TypeScript expertise.
4
+ - Directory structure:
5
+ ```
6
+ infra/
7
+ ├── modules/ # Reusable modules (vpc, rds, eks, etc.)
8
+ │ ├── vpc/
9
+ │ │ ├── main.tf
10
+ │ │ ├── variables.tf
11
+ │ │ └── outputs.tf
12
+ │ └── ...
13
+ ├── environments/
14
+ │ ├── dev/
15
+ │ │ ├── main.tf # Calls modules with dev-specific values
16
+ │ │ ├── terraform.tfvars
17
+ │ │ └── backend.tf # Remote state config for dev
18
+ │ ├── staging/
19
+ │ └── production/
20
+ └── global/ # Shared resources (IAM, DNS, ECR)
21
+ ```
22
+ - **State management**: Remote backend (S3+DynamoDB or GCS) per environment per component. Workspace-based isolation is acceptable for small projects but explicit directory separation is preferred for production.
23
+ - **Plan → Review → Apply** cycle. Always run `terraform plan -out=plan.tfplan` first. Apply from the saved plan, never `terraform apply` directly.
24
+ - **Import before recreate**: If Terraform wants to destroy and recreate a stateful resource (database, S3 bucket), stop and investigate. Use `terraform import` or `lifecycle { prevent_destroy = true }` for critical resources.
25
+ - **Module versioning**: Pin module versions in production. Tag releases. Never point to `main` branch for module sources.
26
+
27
+ ## Docker Standards
28
+
29
+ - **Base images**: Use specific version tags, never `latest`. Prefer slim variants:
30
+ - Node.js: `node:22-slim` or `node:22-alpine`
31
+ - Python: `python:3.12-slim`
32
+ - Go: Build with `golang:1.22`, run on `gcr.io/distroless/static-debian12`
33
+ - Java: `eclipse-temurin:21-jre-jammy`
34
+ - **Multi-stage builds** mandatory. Builder stage has dev dependencies and compiles. Runtime stage has only production dependencies and the built artifact.
35
+ - **Security**:
36
+ - Create and use non-root user: `RUN adduser --disabled-password --no-create-home appuser && USER appuser`
37
+ - No `sudo`, no `curl` in final image (install in builder stage, copy binary only)
38
+ - `COPY --chown=appuser:appuser` for application files
39
+ - Read-only root filesystem where possible: `readOnlyRootFilesystem: true` in K8s securityContext
40
+ - **Health checks**: Every container has `HEALTHCHECK` instruction or K8s liveness/readiness probes.
41
+ - **Image size targets**: Node.js app <200MB, Go app <50MB, Python app <300MB. Measure with `docker images`.
42
+ - **.dockerignore** must include: `.git`, `node_modules`, `__pycache__`, `.env*`, `*.md`, `test/`, `docs/`, `.vscode/`, `.idea/`.
43
+
44
+ ## Kubernetes Standards
45
+
46
+ - **Namespaces**: One namespace per service per environment. Format: `{service}-{env}` (e.g., `api-production`, `worker-staging`).
47
+ - **Resource management**:
48
+ - Always set `resources.requests` (scheduler guarantee) and `resources.limits` (OOM/CPU throttle ceiling)
49
+ - Requests = p50 actual usage. Limits = p99 + 30% headroom. Profile with real traffic, not guesses.
50
+ - CPU limits are controversial — set for burst workloads, omit for latency-sensitive services (CPU throttling causes latency spikes). Always set memory limits.
51
+ - **Probes**:
52
+ - `livenessProbe`: Is the process alive? Failure = K8s restarts the pod. Use `/health/live`. Don't check dependencies here.
53
+ - `readinessProbe`: Can the pod serve traffic? Failure = removed from Service endpoints. Use `/health/ready`. Check DB connection, cache availability.
54
+ - `startupProbe`: Is the app still starting? Use for slow-starting apps (JVM warmup, large model loading). Prevents premature liveness kills.
55
+ - Intervals: liveness every 10s, readiness every 5s, startup every 5s with failureThreshold 30.
56
+ - **Pod Disruption Budget**: `minAvailable: 50%` or `maxUnavailable: 1` for all production deployments. Prevents K8s from draining all pods simultaneously during node upgrades.
57
+ - **Security context** on every pod:
58
+ ```yaml
59
+ securityContext:
60
+ runAsNonRoot: true
61
+ runAsUser: 65534
62
+ fsGroup: 65534
63
+ capabilities:
64
+ drop: [ALL]
65
+ readOnlyRootFilesystem: true
66
+ ```
67
+ - **Network Policies**: Default deny all ingress. Explicitly allow only the traffic paths you need. Services that don't need to talk to each other must not be able to.
68
+ - **Secrets**: Use `ExternalSecrets` operator or `Vault` sidecar. Never store secrets in K8s Secrets manifests in git (even if base64 encoded — base64 is not encryption).
69
+ - **Image pull policy**: `IfNotPresent` for tagged images. Never `Always` in production (causes unnecessary registry traffic). Never use `latest` tag.
70
+
71
+ ## CI/CD Pipeline Standards
72
+
73
+ - **Pipeline stages** (in order):
74
+ 1. **Lint** — code formatting, linting, type checking. Fast, catches obvious issues.
75
+ 2. **Unit test** — runs in parallel with lint. Fails fast.
76
+ 3. **Build** — Docker image build. Uses cache from previous builds.
77
+ 4. **Security scan** — SAST, dependency audit, container image scan. Blocks on critical/high.
78
+ 5. **Integration test** — runs against built image with test dependencies.
79
+ 6. **Deploy staging** — automatic on merge to main.
80
+ 7. **E2E test** — runs against staging.
81
+ 8. **Deploy production** — manual approval gate. Canary rollout.
82
+ - **Speed targets**: Total pipeline <15 minutes. Lint+unit <3 min. Build <5 min. Integration <5 min. Optimize with parallelization and caching.
83
+ - **Caching strategy**:
84
+ - Docker layer cache — mount BuildKit cache, or use registry cache (`--cache-from`/`--cache-to`)
85
+ - Dependency cache — cache `node_modules` (by lockfile hash), `.pip-cache`, `GOMODCACHE`
86
+ - Test cache — Vitest/Jest cache, Go test cache
87
+ - **Artifact tagging**: `{branch}-{short-sha}-{build-number}` (e.g., `main-a1b2c3d-42`). Production deploys also get semantic version tags.
88
+ - **Branch protection**: Main branch requires — CI passing, 1+ approval, no force push, signed commits preferred.
89
+
90
+ ## Monitoring & Alerting
91
+
92
+ - **Three pillars** — all services must have all three:
93
+ 1. **Metrics**: Request rate, error rate, latency (p50/p95/p99), saturation (CPU, memory, disk, connections)
94
+ 2. **Logs**: Structured JSON, shipped to centralized aggregator, retained 30 days (hot) + 90 days (cold)
95
+ 3. **Traces**: Distributed traces across service boundaries, sampled at 10% in production (100% for errors)
96
+ - **Dashboard per service**: At minimum — request rate, error rate, latency percentiles, resource utilization. Golden signals: Latency, Traffic, Errors, Saturation.
97
+ - **Alert on symptoms, not causes**:
98
+ - GOOD: "Error rate >1% for 5 minutes" — this means users are affected
99
+ - BAD: "CPU >80%" — this might be fine during a deploy
100
+ - GOOD: "p99 latency >2s for 10 minutes" — users are experiencing slow responses
101
+ - BAD: "Memory >70%" — this might be normal for a JVM app
102
+ - **Alert severity**:
103
+ - **P1/Critical**: Pages on-call immediately. User-facing outage. Example: error rate >5%, complete service unavailability.
104
+ - **P2/High**: Slack notification to team. Degraded performance. Example: error rate >1%, p99 >2x baseline.
105
+ - **P3/Medium**: Ticket created. Non-urgent. Example: disk usage >80%, certificate expiring in 14 days.
106
+ - **P4/Low**: Dashboard only. Informational. Example: cost anomaly, deprecation warning.
107
+ - **On-call requirements**: Runbook for every P1/P2 alert. Runbook includes — what the alert means, how to diagnose, common fixes, escalation path.
108
+
109
+ ## Security
110
+
111
+ - **Least privilege**: IAM roles with minimum required permissions. No `*` in resource ARN. No `AdministratorAccess` on service roles.
112
+ - **Network security**: VPC with private subnets for all application workloads. Public subnets only for load balancers and bastion hosts. Security groups = allow specific ports from specific sources only.
113
+ - **Secrets rotation**: Database passwords rotated quarterly. API keys rotated on employee offboarding. TLS certificates auto-renewed (cert-manager or ACM).
114
+ - **Image security**: Scan images in CI (Trivy). Block deployments of images with critical vulnerabilities. Use signed images (cosign/Notary) in production.
115
+ - **Audit logging**: CloudTrail/GCP Audit Log enabled. Log all IAM changes, security group changes, and production access.
116
+
117
+ ## Disaster Recovery
118
+
119
+ - **RPO/RTO** defined per service tier:
120
+ - Tier 1 (critical): RPO <1 hour, RTO <15 minutes. Multi-AZ, automated failover, real-time replication.
121
+ - Tier 2 (important): RPO <4 hours, RTO <1 hour. Multi-AZ, manual failover, periodic replication.
122
+ - Tier 3 (internal): RPO <24 hours, RTO <4 hours. Single-AZ, restore from backup.
123
+ - **Backup verification**: Monthly restore test to verify backups are usable. Document restore time and compare against RTO.
124
+ - **Failover testing**: Quarterly chaos testing — kill a node, kill a database replica, simulate AZ failure. Verify automated recovery works.
125
+ - **Runbooks**: Every critical service has a disaster recovery runbook with step-by-step restore procedure, verified by the last person who ran it.
126
+
127
+ ## Cost Management
128
+
129
+ - **Tagging policy**: All resources tagged with `project`, `environment`, `team`, `cost-center`. Untagged resources get flagged in weekly report.
130
+ - **Right-sizing review**: Monthly review of instance utilization. Any instance with <30% avg CPU utilization gets downsized.
131
+ - **Reserved capacity**: 1-year commitments for steady-state workloads (60-70% of base). Spot instances for batch processing and stateless burst.
132
+ - **Storage lifecycle**: S3/GCS objects transitioned to infrequent access after 30 days, glacier/archive after 90 days. Set lifecycle policies on every bucket.
133
+ - **Budget alerts**: 80% and 100% monthly budget. Daily anomaly detection (>20% deviation from rolling 7-day average).
@@ -0,0 +1,218 @@
1
+ # CI/CD Reference
2
+
3
+ ## Pipeline Stages
4
+
5
+ ```
6
+ ┌─────────┐ ┌──────────┐ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐
7
+ │ Lint │──▶│ Test │──▶│ Build │──▶│ Scan │──▶│ Deploy │──▶│ Verify │
8
+ └─────────┘ └──────────┘ └─────────┘ └──────────┘ └──────────┘ └────────┘
9
+ ```
10
+
11
+ ### Stage Details
12
+
13
+ | Stage | What | Tools | Fail Behavior |
14
+ |---|---|---|---|
15
+ | **Lint** | Code format, style, IaC validation | eslint, ruff, terraform fmt, hadolint | Block merge |
16
+ | **Test** | Unit, integration, contract tests | jest, pytest, go test | Block merge |
17
+ | **Build** | Compile, Docker build, asset bundling | Docker, webpack, go build | Block merge |
18
+ | **Scan** | Vulnerability scan, SAST, secrets detection | Trivy, Semgrep, gitleaks, Snyk | Block merge (critical/high) |
19
+ | **Deploy** | Push to environment | ArgoCD sync, helm upgrade, kubectl apply | Auto-rollback |
20
+ | **Verify** | Smoke tests, synthetic checks | curl, k6, playwright | Alert + auto-rollback |
21
+
22
+ ---
23
+
24
+ ## Speed Targets
25
+
26
+ | Metric | Target | Action if Exceeded |
27
+ |---|---|---|
28
+ | Lint + Test | < 5 min | Parallelize, use test splitting |
29
+ | Full pipeline (to staging) | < 10 min | Cache aggressively, optimize build |
30
+ | Full pipeline (to production) | < 15 min | Investigate — likely a build or test issue |
31
+ | Rollback | < 2 min | Must be automated, not a new pipeline run |
32
+ | Docker build | < 3 min | Fix layer ordering, use BuildKit cache |
33
+
34
+ ### Speed Optimization
35
+
36
+ - Run lint, test, and scan in parallel where possible.
37
+ - Use job-level caching for dependencies.
38
+ - Split test suites across parallel runners.
39
+ - Only build what changed (monorepo: use path filters).
40
+
41
+ ---
42
+
43
+ ## Caching Strategies
44
+
45
+ ### Dependency Caching
46
+
47
+ ```yaml
48
+ # GitHub Actions example
49
+ - uses: actions/cache@v4
50
+ with:
51
+ path: |
52
+ ~/.npm
53
+ node_modules
54
+ key: deps-${{ hashFiles('package-lock.json') }}
55
+ restore-keys: deps-
56
+ ```
57
+
58
+ ### Docker Layer Caching
59
+
60
+ ```yaml
61
+ # GitHub Actions with BuildKit
62
+ - uses: docker/build-push-action@v5
63
+ with:
64
+ cache-from: type=gha
65
+ cache-to: type=gha,mode=max
66
+ ```
67
+
68
+ ### What to Cache
69
+
70
+ | Asset | Cache Key | TTL |
71
+ |---|---|---|
72
+ | npm/yarn/pnpm | `package-lock.json` hash | Until lockfile changes |
73
+ | pip | `requirements.txt` hash | Until lockfile changes |
74
+ | Go modules | `go.sum` hash | Until lockfile changes |
75
+ | Docker layers | BuildKit GHA cache | 7 days / LRU |
76
+ | Terraform providers | `.terraform.lock.hcl` hash | Until lockfile changes |
77
+ | Test fixtures | Commit SHA | Per commit |
78
+
79
+ ---
80
+
81
+ ## Artifact Tagging
82
+
83
+ ### Image Tags
84
+
85
+ Every build produces an image tagged with:
86
+
87
+ ```
88
+ registry.company.com/service-name:<git-sha-short>
89
+ ```
90
+
91
+ Additionally, on release:
92
+
93
+ ```
94
+ registry.company.com/service-name:<semver>
95
+ registry.company.com/service-name:<semver>-<git-sha-short>
96
+ ```
97
+
98
+ ### Rules
99
+
100
+ - Primary tag is **always the git SHA** (7+ characters). This is the source of truth.
101
+ - Semver tags are applied on release branches or tags.
102
+ - Never overwrite a tag. Tags are immutable.
103
+ - Never use `:latest` in production pipelines.
104
+ - Store build metadata as image labels:
105
+ ```dockerfile
106
+ LABEL org.opencontainers.image.revision="${GIT_SHA}" \
107
+ org.opencontainers.image.created="${BUILD_DATE}" \
108
+ org.opencontainers.image.source="${REPO_URL}"
109
+ ```
110
+
111
+ ---
112
+
113
+ ## Branch Protection
114
+
115
+ ### Required for `main` / `master`
116
+
117
+ - Require PR with at least 1 approval.
118
+ - Require status checks to pass (lint, test, scan).
119
+ - Require up-to-date branch before merging.
120
+ - Require signed commits (optional but recommended).
121
+ - No force pushes.
122
+ - No direct pushes (all changes via PR).
123
+
124
+ ### Branch Strategy
125
+
126
+ ```
127
+ main ─────────────────────────────────────────── (always deployable)
128
+ └── feature/TICKET-123-add-caching ────── PR ──▶ merge
129
+ └── fix/TICKET-456-oom-error ──────────── PR ──▶ merge
130
+ └── release/1.2.0 ─── tag v1.2.0 ──▶ deploy
131
+ ```
132
+
133
+ - Trunk-based development preferred: short-lived branches (< 2 days).
134
+ - Release branches only if you need hotfix capability on older versions.
135
+ - Delete branches after merge.
136
+
137
+ ---
138
+
139
+ ## Rollback Mechanisms
140
+
141
+ ### Strategy 1: Redeploy Previous Artifact (Preferred)
142
+
143
+ ```bash
144
+ # ArgoCD
145
+ argocd app set myapp -p image.tag=<previous-sha>
146
+ argocd app sync myapp
147
+
148
+ # Helm
149
+ helm rollback myapp <previous-revision>
150
+
151
+ # kubectl
152
+ kubectl rollout undo deployment/myapp
153
+ ```
154
+
155
+ - Fastest method: the previous image already exists in the registry.
156
+ - No new build needed.
157
+ - Rollback target is tracked in Git history.
158
+
159
+ ### Strategy 2: Git Revert
160
+
161
+ ```bash
162
+ git revert <bad-commit>
163
+ git push origin main
164
+ # Pipeline runs automatically, deploys the revert
165
+ ```
166
+
167
+ - Preferred when the code change itself is the problem.
168
+ - Creates audit trail in Git.
169
+ - Slower than artifact redeploy (requires full pipeline).
170
+
171
+ ### Strategy 3: Feature Flag Disable
172
+
173
+ ```
174
+ Toggle off the feature flag in LaunchDarkly / Unleash / Flipt
175
+ ```
176
+
177
+ - Fastest for feature-level issues.
178
+ - No deployment needed.
179
+ - Requires the change to be behind a flag.
180
+
181
+ ### Rollback Rules
182
+
183
+ - Rollback must be possible within 2 minutes.
184
+ - Every deployment must record which artifact was deployed.
185
+ - Keep last 10 revisions in Helm / ArgoCD.
186
+ - Test rollback procedure quarterly.
187
+ - Automated rollback on failed verify stage (smoke tests).
188
+
189
+ ---
190
+
191
+ ## Pipeline Security
192
+
193
+ - Secrets injected via CI provider's secrets manager, never in pipeline files.
194
+ - Pin action versions to SHA, not tags (supply chain attack mitigation):
195
+ ```yaml
196
+ uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
197
+ ```
198
+ - Use OIDC for cloud authentication — no long-lived credentials.
199
+ - Scan pipeline definitions with `actionlint` (GitHub) or equivalent.
200
+ - Audit who can modify pipeline files (same rigor as production access).
201
+
202
+ ---
203
+
204
+ ## Environments and Promotion
205
+
206
+ ```
207
+ PR Branch → dev (auto-deploy on push)
208
+ main → staging (auto-deploy on merge)
209
+ tag/v* → production (manual approval gate, then auto-deploy)
210
+ ```
211
+
212
+ ### Environment Parity
213
+
214
+ - Same Docker image across all environments (only config changes).
215
+ - Same K8s manifests with env-specific values via Helm values or Kustomize overlays.
216
+ - Same pipeline stages for all environments (skip nothing for staging).
217
+ - Production deploy requires approval from at least one person who did not
218
+ author the change.
@@ -0,0 +1,218 @@
1
+ # Cost Optimization Reference
2
+
3
+ ## Right-Sizing
4
+
5
+ ### Process
6
+
7
+ 1. **Collect data.** 14-30 days of CPU and memory utilization metrics.
8
+ 2. **Identify waste.** Instances/pods where p95 utilization < 30%.
9
+ 3. **Resize.** Adjust resource requests/limits or instance type.
10
+ 4. **Validate.** Monitor for 7 days post-change. Watch for OOMs and throttling.
11
+ 5. **Repeat.** Monthly right-sizing review.
12
+
13
+ ### Kubernetes Right-Sizing
14
+
15
+ ```
16
+ # Current allocation vs actual usage
17
+ Allocated CPU: 4000m → p95 usage: 800m → Right-size to: 1000m request, 2000m limit
18
+ Allocated Memory: 4Gi → p95 usage: 1.2Gi → Right-size to: 1.5Gi request, 2Gi limit
19
+ ```
20
+
21
+ - Use VPA (Vertical Pod Autoscaler) in recommendation mode for data.
22
+ - Use Goldilocks or Kubecost for visibility.
23
+ - Set HPA (Horizontal Pod Autoscaler) based on CPU, memory, or custom metrics.
24
+ - Review HPA min/max replicas quarterly — over-provisioned minimums waste money.
25
+
26
+ ### Compute Right-Sizing
27
+
28
+ | Signal | Action |
29
+ |---|---|
30
+ | CPU p95 < 20% | Downsize instance or reduce CPU request |
31
+ | Memory p95 < 40% | Downsize instance or reduce memory request |
32
+ | CPU p95 > 80% | Upsize or add HPA |
33
+ | Memory p95 > 80% | Upsize — memory pressure causes OOMs |
34
+ | GPU utilization < 50% | Consider time-sharing or smaller GPU |
35
+
36
+ ---
37
+
38
+ ## Reserved Capacity
39
+
40
+ ### When to Reserve
41
+
42
+ | Workload Pattern | Strategy | Savings |
43
+ |---|---|---|
44
+ | Steady 24/7 (databases, core APIs) | Reserved Instances / Committed Use | 30-60% |
45
+ | Steady daytime only | Reserved + scheduled scaling | 20-40% |
46
+ | Spiky / batch | Spot/Preemptible + on-demand fallback | 50-80% |
47
+ | Unpredictable | On-demand (do not reserve) | 0% |
48
+
49
+ ### Rules
50
+
51
+ - Reserve only after 3+ months of stable usage data.
52
+ - Start with 1-year reservations. Use 3-year only for well-understood workloads.
53
+ - Prefer convertible reservations over standard (flexibility to change instance type).
54
+ - Review reservation coverage monthly. Unused reservations are pure waste.
55
+ - Use AWS Savings Plans (compute flexibility) over RIs where possible.
56
+
57
+ ### Spot / Preemptible Instances
58
+
59
+ - Use for: CI runners, batch jobs, stateless workers, dev/test environments.
60
+ - Never use for: databases, stateful services, single-replica production.
61
+ - Always implement graceful shutdown handling (SIGTERM).
62
+ - Use multiple instance types and AZs for spot diversification.
63
+ - Set `maxPrice` slightly above on-demand to avoid bidding wars.
64
+
65
+ ---
66
+
67
+ ## Storage Lifecycle
68
+
69
+ ### S3 / Object Storage Lifecycle
70
+
71
+ ```
72
+ Day 0-30: Standard storage (frequent access)
73
+ Day 30-90: Infrequent Access (IA)
74
+ Day 90-365: Glacier / Archive
75
+ Day 365+: Delete or Deep Archive (per compliance)
76
+ ```
77
+
78
+ ```hcl
79
+ resource "aws_s3_bucket_lifecycle_configuration" "assets" {
80
+ bucket = aws_s3_bucket.assets.id
81
+
82
+ rule {
83
+ id = "archive-old-assets"
84
+ status = "Enabled"
85
+
86
+ transition {
87
+ days = 30
88
+ storage_class = "STANDARD_IA"
89
+ }
90
+ transition {
91
+ days = 90
92
+ storage_class = "GLACIER"
93
+ }
94
+ expiration {
95
+ days = 365
96
+ }
97
+ }
98
+ }
99
+ ```
100
+
101
+ ### Storage Cost Traps
102
+
103
+ | Trap | Fix |
104
+ |---|---|
105
+ | Unattached EBS volumes | Weekly scan and delete, tag with expiry |
106
+ | Old snapshots | Lifecycle policy, delete snapshots older than retention |
107
+ | Unused Elastic IPs | Release unattached EIPs (they cost money when idle) |
108
+ | Oversized volumes | Right-size based on usage, use gp3 over gp2 |
109
+ | Log retention forever | Set TTL: 7d hot, 30d warm, 90d cold |
110
+ | Uncompressed backups | Enable compression before archiving |
111
+
112
+ ---
113
+
114
+ ## Data Transfer Costs
115
+
116
+ ### Cost Hierarchy (AWS, typical)
117
+
118
+ ```
119
+ Free: Inbound from internet, same-AZ within VPC
120
+ Cheap: Cross-AZ within region (~$0.01/GB)
121
+ Moderate: Cross-region (~$0.02/GB)
122
+ Expensive: Outbound to internet (~$0.09/GB)
123
+ ```
124
+
125
+ ### Optimization Strategies
126
+
127
+ - **Keep traffic in-AZ.** Co-locate services that communicate heavily.
128
+ - **Use VPC endpoints.** Avoid NAT Gateway charges for AWS service access
129
+ (S3, DynamoDB, ECR).
130
+ - **CDN for static assets.** CloudFront/Cloudflare reduces origin egress.
131
+ - **Compress data in transit.** gzip/brotli for HTTP, compression for backups.
132
+ - **Cache aggressively.** Redis/Memcached for repeated external API calls.
133
+ - **Use Private Link** for cross-account or cross-VPC communication.
134
+
135
+ ### NAT Gateway Cost Alert
136
+
137
+ NAT Gateways charge per GB processed. Common cost traps:
138
+ - Docker image pulls through NAT (use VPC endpoint for ECR).
139
+ - Log shipping through NAT (use VPC endpoint for CloudWatch/S3).
140
+ - S3 access through NAT (use S3 gateway endpoint — it is free).
141
+
142
+ ---
143
+
144
+ ## Budget Alerts
145
+
146
+ ### Alert Thresholds
147
+
148
+ | Threshold | Action |
149
+ |---|---|
150
+ | 50% of monthly budget | Info notification to Slack |
151
+ | 75% of monthly budget | Warning to team lead |
152
+ | 90% of monthly budget | Alert to engineering manager |
153
+ | 100% of monthly budget | Escalate — investigate immediately |
154
+ | Anomaly (> 20% day-over-day) | Alert to on-call |
155
+
156
+ ### Implementation
157
+
158
+ ```hcl
159
+ resource "aws_budgets_budget" "monthly" {
160
+ name = "monthly-total"
161
+ budget_type = "COST"
162
+ limit_amount = "50000"
163
+ limit_unit = "USD"
164
+ time_unit = "MONTHLY"
165
+
166
+ notification {
167
+ comparison_operator = "GREATER_THAN"
168
+ threshold = 75
169
+ threshold_type = "PERCENTAGE"
170
+ notification_type = "ACTUAL"
171
+ subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
172
+ }
173
+ }
174
+ ```
175
+
176
+ ### Per-Team Budgets
177
+
178
+ - Tag all resources with `CostCenter` and `Team`.
179
+ - Create per-team budget alerts.
180
+ - Share cost dashboards with engineering leads weekly.
181
+ - Include cost delta in deploy notifications.
182
+
183
+ ---
184
+
185
+ ## Tagging for Cost Tracking
186
+
187
+ ### Required Tags (reiterated from IaC reference)
188
+
189
+ | Tag | Purpose for Cost |
190
+ |---|---|
191
+ | `Environment` | Split prod vs dev/staging spend |
192
+ | `Team` | Ownership and accountability |
193
+ | `Service` | Per-service cost tracking |
194
+ | `CostCenter` | Finance allocation |
195
+
196
+ ### Cost Allocation Reports
197
+
198
+ - Enable AWS Cost and Usage Reports (CUR) or GCP billing export.
199
+ - Visualize in Grafana, Kubecost, or cloud-native tools.
200
+ - Track: cost per service, cost per environment, cost per customer (if multi-tenant).
201
+ - Review weekly in engineering stand-up.
202
+
203
+ ---
204
+
205
+ ## Quick Wins Checklist
206
+
207
+ - [ ] Delete unattached EBS volumes and unused EIPs.
208
+ - [ ] Enable S3 lifecycle policies on all buckets.
209
+ - [ ] Use gp3 instead of gp2 for EBS volumes (20% cheaper, better performance).
210
+ - [ ] Add VPC endpoints for S3, ECR, CloudWatch, DynamoDB.
211
+ - [ ] Right-size RDS instances (check Performance Insights).
212
+ - [ ] Stop or terminate dev/staging resources outside business hours.
213
+ - [ ] Use Graviton/ARM instances where possible (20% cheaper).
214
+ - [ ] Enable Spot for CI runners and batch jobs.
215
+ - [ ] Compress and set TTL on all log streams.
216
+ - [ ] Review and delete old container images in ECR (lifecycle policy).
217
+ - [ ] Consolidate underutilized EKS clusters.
218
+ - [ ] Enable S3 Intelligent-Tiering for unpredictable access patterns.