npm - aigent-team - Versions diffs - 0.1.0 - Mend

aigent-team 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (71) hide show

package/LICENSE +21 -0
package/README.md +253 -0
package/dist/chunk-N3RYHWTR.js +267 -0
package/dist/cli.js +576 -0
package/dist/index.d.ts +234 -0
package/dist/index.js +27 -0
package/package.json +67 -0
package/templates/shared/git-workflow.md +44 -0
package/templates/shared/project-conventions.md +48 -0
package/templates/teams/ba/agent.yaml +25 -0
package/templates/teams/ba/references/acceptance-criteria.md +87 -0
package/templates/teams/ba/references/api-contract-design.md +110 -0
package/templates/teams/ba/references/requirements-analysis.md +83 -0
package/templates/teams/ba/references/user-story-mapping.md +73 -0
package/templates/teams/ba/skill.md +85 -0
package/templates/teams/be/agent.yaml +34 -0
package/templates/teams/be/conventions.md +102 -0
package/templates/teams/be/references/api-design.md +91 -0
package/templates/teams/be/references/async-processing.md +86 -0
package/templates/teams/be/references/auth-security.md +58 -0
package/templates/teams/be/references/caching.md +79 -0
package/templates/teams/be/references/database.md +65 -0
package/templates/teams/be/references/error-handling.md +106 -0
package/templates/teams/be/references/observability.md +83 -0
package/templates/teams/be/references/review-checklist.md +50 -0
package/templates/teams/be/references/testing.md +100 -0
package/templates/teams/be/review-checklist.md +54 -0
package/templates/teams/be/skill.md +71 -0
package/templates/teams/devops/agent.yaml +35 -0
package/templates/teams/devops/conventions.md +133 -0
package/templates/teams/devops/references/ci-cd.md +218 -0
package/templates/teams/devops/references/cost-optimization.md +218 -0
package/templates/teams/devops/references/disaster-recovery.md +199 -0
package/templates/teams/devops/references/docker.md +237 -0
package/templates/teams/devops/references/infrastructure-as-code.md +238 -0
package/templates/teams/devops/references/kubernetes.md +397 -0
package/templates/teams/devops/references/monitoring.md +224 -0
package/templates/teams/devops/references/review-checklist.md +149 -0
package/templates/teams/devops/references/security.md +225 -0
package/templates/teams/devops/review-checklist.md +72 -0
package/templates/teams/devops/skill.md +131 -0
package/templates/teams/fe/agent.yaml +28 -0
package/templates/teams/fe/conventions.md +80 -0
package/templates/teams/fe/references/accessibility.md +92 -0
package/templates/teams/fe/references/component-architecture.md +87 -0
package/templates/teams/fe/references/css-styling.md +89 -0
package/templates/teams/fe/references/forms.md +73 -0
package/templates/teams/fe/references/performance.md +104 -0
package/templates/teams/fe/references/review-checklist.md +51 -0
package/templates/teams/fe/references/security.md +90 -0
package/templates/teams/fe/references/state-management.md +117 -0
package/templates/teams/fe/references/testing.md +112 -0
package/templates/teams/fe/review-checklist.md +53 -0
package/templates/teams/fe/skill.md +68 -0
package/templates/teams/lead/agent.yaml +18 -0
package/templates/teams/lead/references/cross-team-coordination.md +68 -0
package/templates/teams/lead/references/quality-gates.md +64 -0
package/templates/teams/lead/references/task-decomposition.md +69 -0
package/templates/teams/lead/skill.md +83 -0
package/templates/teams/qa/agent.yaml +32 -0
package/templates/teams/qa/conventions.md +130 -0
package/templates/teams/qa/references/ci-integration.md +337 -0
package/templates/teams/qa/references/e2e-testing.md +292 -0
package/templates/teams/qa/references/mocking.md +249 -0
package/templates/teams/qa/references/performance-testing.md +288 -0
package/templates/teams/qa/references/review-checklist.md +143 -0
package/templates/teams/qa/references/security-testing.md +271 -0
package/templates/teams/qa/references/test-data.md +275 -0
package/templates/teams/qa/references/test-strategy.md +192 -0
package/templates/teams/qa/review-checklist.md +53 -0
package/templates/teams/qa/skill.md +131 -0

package/templates/teams/devops/conventions.md ADDED Viewed

@@ -0,0 +1,133 @@
+## Infrastructure as Code
+- **Terraform** is the default for cloud infrastructure. Pulumi/CDK when team has strong preference and TypeScript expertise.
+- Directory structure:
+  ```
+  infra/
+  ├── modules/              # Reusable modules (vpc, rds, eks, etc.)
+  │   ├── vpc/
+  │   │   ├── main.tf
+  │   │   ├── variables.tf
+  │   │   └── outputs.tf
+  │   └── ...
+  ├── environments/
+  │   ├── dev/
+  │   │   ├── main.tf       # Calls modules with dev-specific values
+  │   │   ├── terraform.tfvars
+  │   │   └── backend.tf    # Remote state config for dev
+  │   ├── staging/
+  │   └── production/
+  └── global/               # Shared resources (IAM, DNS, ECR)
+  ```
+- **State management**: Remote backend (S3+DynamoDB or GCS) per environment per component. Workspace-based isolation is acceptable for small projects but explicit directory separation is preferred for production.
+- **Plan → Review → Apply** cycle. Always run `terraform plan -out=plan.tfplan` first. Apply from the saved plan, never `terraform apply` directly.
+- **Import before recreate**: If Terraform wants to destroy and recreate a stateful resource (database, S3 bucket), stop and investigate. Use `terraform import` or `lifecycle { prevent_destroy = true }` for critical resources.
+- **Module versioning**: Pin module versions in production. Tag releases. Never point to `main` branch for module sources.
+## Docker Standards
+- **Base images**: Use specific version tags, never `latest`. Prefer slim variants:
+  - Node.js: `node:22-slim` or `node:22-alpine`
+  - Python: `python:3.12-slim`
+  - Go: Build with `golang:1.22`, run on `gcr.io/distroless/static-debian12`
+  - Java: `eclipse-temurin:21-jre-jammy`
+- **Multi-stage builds** mandatory. Builder stage has dev dependencies and compiles. Runtime stage has only production dependencies and the built artifact.
+- **Security**:
+  - Create and use non-root user: `RUN adduser --disabled-password --no-create-home appuser && USER appuser`
+  - No `sudo`, no `curl` in final image (install in builder stage, copy binary only)
+  - `COPY --chown=appuser:appuser` for application files
+  - Read-only root filesystem where possible: `readOnlyRootFilesystem: true` in K8s securityContext
+- **Health checks**: Every container has `HEALTHCHECK` instruction or K8s liveness/readiness probes.
+- **Image size targets**: Node.js app <200MB, Go app <50MB, Python app <300MB. Measure with `docker images`.
+- **.dockerignore** must include: `.git`, `node_modules`, `__pycache__`, `.env*`, `*.md`, `test/`, `docs/`, `.vscode/`, `.idea/`.
+## Kubernetes Standards
+- **Namespaces**: One namespace per service per environment. Format: `{service}-{env}` (e.g., `api-production`, `worker-staging`).
+- **Resource management**:
+  - Always set `resources.requests` (scheduler guarantee) and `resources.limits` (OOM/CPU throttle ceiling)
+  - Requests = p50 actual usage. Limits = p99 + 30% headroom. Profile with real traffic, not guesses.
+  - CPU limits are controversial — set for burst workloads, omit for latency-sensitive services (CPU throttling causes latency spikes). Always set memory limits.
+- **Probes**:
+  - `livenessProbe`: Is the process alive? Failure = K8s restarts the pod. Use `/health/live`. Don't check dependencies here.
+  - `readinessProbe`: Can the pod serve traffic? Failure = removed from Service endpoints. Use `/health/ready`. Check DB connection, cache availability.
+  - `startupProbe`: Is the app still starting? Use for slow-starting apps (JVM warmup, large model loading). Prevents premature liveness kills.
+  - Intervals: liveness every 10s, readiness every 5s, startup every 5s with failureThreshold 30.
+- **Pod Disruption Budget**: `minAvailable: 50%` or `maxUnavailable: 1` for all production deployments. Prevents K8s from draining all pods simultaneously during node upgrades.
+- **Security context** on every pod:
+  ```yaml
+  securityContext:
+    runAsNonRoot: true
+    runAsUser: 65534
+    fsGroup: 65534
+    capabilities:
+      drop: [ALL]
+    readOnlyRootFilesystem: true
+  ```
+- **Network Policies**: Default deny all ingress. Explicitly allow only the traffic paths you need. Services that don't need to talk to each other must not be able to.
+- **Secrets**: Use `ExternalSecrets` operator or `Vault` sidecar. Never store secrets in K8s Secrets manifests in git (even if base64 encoded — base64 is not encryption).
+- **Image pull policy**: `IfNotPresent` for tagged images. Never `Always` in production (causes unnecessary registry traffic). Never use `latest` tag.
+## CI/CD Pipeline Standards
+- **Pipeline stages** (in order):
+  1. **Lint** — code formatting, linting, type checking. Fast, catches obvious issues.
+  2. **Unit test** — runs in parallel with lint. Fails fast.
+  3. **Build** — Docker image build. Uses cache from previous builds.
+  4. **Security scan** — SAST, dependency audit, container image scan. Blocks on critical/high.
+  5. **Integration test** — runs against built image with test dependencies.
+  6. **Deploy staging** — automatic on merge to main.
+  7. **E2E test** — runs against staging.
+  8. **Deploy production** — manual approval gate. Canary rollout.
+- **Speed targets**: Total pipeline <15 minutes. Lint+unit <3 min. Build <5 min. Integration <5 min. Optimize with parallelization and caching.
+- **Caching strategy**:
+  - Docker layer cache — mount BuildKit cache, or use registry cache (`--cache-from`/`--cache-to`)
+  - Dependency cache — cache `node_modules` (by lockfile hash), `.pip-cache`, `GOMODCACHE`
+  - Test cache — Vitest/Jest cache, Go test cache
+- **Artifact tagging**: `{branch}-{short-sha}-{build-number}` (e.g., `main-a1b2c3d-42`). Production deploys also get semantic version tags.
+- **Branch protection**: Main branch requires — CI passing, 1+ approval, no force push, signed commits preferred.
+## Monitoring & Alerting
+- **Three pillars** — all services must have all three:
+  1. **Metrics**: Request rate, error rate, latency (p50/p95/p99), saturation (CPU, memory, disk, connections)
+  2. **Logs**: Structured JSON, shipped to centralized aggregator, retained 30 days (hot) + 90 days (cold)
+  3. **Traces**: Distributed traces across service boundaries, sampled at 10% in production (100% for errors)
+- **Dashboard per service**: At minimum — request rate, error rate, latency percentiles, resource utilization. Golden signals: Latency, Traffic, Errors, Saturation.
+- **Alert on symptoms, not causes**:
+  - GOOD: "Error rate >1% for 5 minutes" — this means users are affected
+  - BAD: "CPU >80%" — this might be fine during a deploy
+  - GOOD: "p99 latency >2s for 10 minutes" — users are experiencing slow responses
+  - BAD: "Memory >70%" — this might be normal for a JVM app
+- **Alert severity**:
+  - **P1/Critical**: Pages on-call immediately. User-facing outage. Example: error rate >5%, complete service unavailability.
+  - **P2/High**: Slack notification to team. Degraded performance. Example: error rate >1%, p99 >2x baseline.
+  - **P3/Medium**: Ticket created. Non-urgent. Example: disk usage >80%, certificate expiring in 14 days.
+  - **P4/Low**: Dashboard only. Informational. Example: cost anomaly, deprecation warning.
+- **On-call requirements**: Runbook for every P1/P2 alert. Runbook includes — what the alert means, how to diagnose, common fixes, escalation path.
+## Security
+- **Least privilege**: IAM roles with minimum required permissions. No `*` in resource ARN. No `AdministratorAccess` on service roles.
+- **Network security**: VPC with private subnets for all application workloads. Public subnets only for load balancers and bastion hosts. Security groups = allow specific ports from specific sources only.
+- **Secrets rotation**: Database passwords rotated quarterly. API keys rotated on employee offboarding. TLS certificates auto-renewed (cert-manager or ACM).
+- **Image security**: Scan images in CI (Trivy). Block deployments of images with critical vulnerabilities. Use signed images (cosign/Notary) in production.
+- **Audit logging**: CloudTrail/GCP Audit Log enabled. Log all IAM changes, security group changes, and production access.
+## Disaster Recovery
+- **RPO/RTO** defined per service tier:
+  - Tier 1 (critical): RPO <1 hour, RTO <15 minutes. Multi-AZ, automated failover, real-time replication.
+  - Tier 2 (important): RPO <4 hours, RTO <1 hour. Multi-AZ, manual failover, periodic replication.
+  - Tier 3 (internal): RPO <24 hours, RTO <4 hours. Single-AZ, restore from backup.
+- **Backup verification**: Monthly restore test to verify backups are usable. Document restore time and compare against RTO.
+- **Failover testing**: Quarterly chaos testing — kill a node, kill a database replica, simulate AZ failure. Verify automated recovery works.
+- **Runbooks**: Every critical service has a disaster recovery runbook with step-by-step restore procedure, verified by the last person who ran it.
+## Cost Management
+- **Tagging policy**: All resources tagged with `project`, `environment`, `team`, `cost-center`. Untagged resources get flagged in weekly report.
+- **Right-sizing review**: Monthly review of instance utilization. Any instance with <30% avg CPU utilization gets downsized.
+- **Reserved capacity**: 1-year commitments for steady-state workloads (60-70% of base). Spot instances for batch processing and stateless burst.
+- **Storage lifecycle**: S3/GCS objects transitioned to infrequent access after 30 days, glacier/archive after 90 days. Set lifecycle policies on every bucket.
+- **Budget alerts**: 80% and 100% monthly budget. Daily anomaly detection (>20% deviation from rolling 7-day average).

package/templates/teams/devops/references/ci-cd.md ADDED Viewed

@@ -0,0 +1,218 @@
+# CI/CD Reference
+## Pipeline Stages
+```
+┌─────────┐   ┌──────────┐   ┌─────────┐   ┌──────────┐   ┌──────────┐   ┌────────┐
+│  Lint   │──▶│   Test   │──▶│  Build  │──▶│   Scan   │──▶│  Deploy  │──▶│ Verify │
+└─────────┘   └──────────┘   └─────────┘   └──────────┘   └──────────┘   └────────┘
+```
+### Stage Details
+| Stage | What | Tools | Fail Behavior |
+|---|---|---|---|
+| **Lint** | Code format, style, IaC validation | eslint, ruff, terraform fmt, hadolint | Block merge |
+| **Test** | Unit, integration, contract tests | jest, pytest, go test | Block merge |
+| **Build** | Compile, Docker build, asset bundling | Docker, webpack, go build | Block merge |
+| **Scan** | Vulnerability scan, SAST, secrets detection | Trivy, Semgrep, gitleaks, Snyk | Block merge (critical/high) |
+| **Deploy** | Push to environment | ArgoCD sync, helm upgrade, kubectl apply | Auto-rollback |
+| **Verify** | Smoke tests, synthetic checks | curl, k6, playwright | Alert + auto-rollback |
+---
+## Speed Targets
+| Metric | Target | Action if Exceeded |
+|---|---|---|
+| Lint + Test | < 5 min | Parallelize, use test splitting |
+| Full pipeline (to staging) | < 10 min | Cache aggressively, optimize build |
+| Full pipeline (to production) | < 15 min | Investigate — likely a build or test issue |
+| Rollback | < 2 min | Must be automated, not a new pipeline run |
+| Docker build | < 3 min | Fix layer ordering, use BuildKit cache |
+### Speed Optimization
+- Run lint, test, and scan in parallel where possible.
+- Use job-level caching for dependencies.
+- Split test suites across parallel runners.
+- Only build what changed (monorepo: use path filters).
+---
+## Caching Strategies
+### Dependency Caching
+```yaml
+# GitHub Actions example
+- uses: actions/cache@v4
+  with:
+    path: |
+      ~/.npm
+      node_modules
+    key: deps-${{ hashFiles('package-lock.json') }}
+    restore-keys: deps-
+```
+### Docker Layer Caching
+```yaml
+# GitHub Actions with BuildKit
+- uses: docker/build-push-action@v5
+  with:
+    cache-from: type=gha
+    cache-to: type=gha,mode=max
+```
+### What to Cache
+| Asset | Cache Key | TTL |
+|---|---|---|
+| npm/yarn/pnpm | `package-lock.json` hash | Until lockfile changes |
+| pip | `requirements.txt` hash | Until lockfile changes |
+| Go modules | `go.sum` hash | Until lockfile changes |
+| Docker layers | BuildKit GHA cache | 7 days / LRU |
+| Terraform providers | `.terraform.lock.hcl` hash | Until lockfile changes |
+| Test fixtures | Commit SHA | Per commit |
+---
+## Artifact Tagging
+### Image Tags
+Every build produces an image tagged with:
+```
+registry.company.com/service-name:<git-sha-short>
+```
+Additionally, on release:
+```
+registry.company.com/service-name:<semver>
+registry.company.com/service-name:<semver>-<git-sha-short>
+```
+### Rules
+- Primary tag is **always the git SHA** (7+ characters). This is the source of truth.
+- Semver tags are applied on release branches or tags.
+- Never overwrite a tag. Tags are immutable.
+- Never use `:latest` in production pipelines.
+- Store build metadata as image labels:
+  ```dockerfile
+  LABEL org.opencontainers.image.revision="${GIT_SHA}" \
+        org.opencontainers.image.created="${BUILD_DATE}" \
+        org.opencontainers.image.source="${REPO_URL}"
+  ```
+---
+## Branch Protection
+### Required for `main` / `master`
+- Require PR with at least 1 approval.
+- Require status checks to pass (lint, test, scan).
+- Require up-to-date branch before merging.
+- Require signed commits (optional but recommended).
+- No force pushes.
+- No direct pushes (all changes via PR).
+### Branch Strategy
+```
+main ─────────────────────────────────────────── (always deployable)
+  └── feature/TICKET-123-add-caching ────── PR ──▶ merge
+  └── fix/TICKET-456-oom-error ──────────── PR ──▶ merge
+  └── release/1.2.0 ─── tag v1.2.0 ──▶ deploy
+```
+- Trunk-based development preferred: short-lived branches (< 2 days).
+- Release branches only if you need hotfix capability on older versions.
+- Delete branches after merge.
+---
+## Rollback Mechanisms
+### Strategy 1: Redeploy Previous Artifact (Preferred)
+```bash
+# ArgoCD
+argocd app set myapp -p image.tag=<previous-sha>
+argocd app sync myapp
+# Helm
+helm rollback myapp <previous-revision>
+# kubectl
+kubectl rollout undo deployment/myapp
+```
+- Fastest method: the previous image already exists in the registry.
+- No new build needed.
+- Rollback target is tracked in Git history.
+### Strategy 2: Git Revert
+```bash
+git revert <bad-commit>
+git push origin main
+# Pipeline runs automatically, deploys the revert
+```
+- Preferred when the code change itself is the problem.
+- Creates audit trail in Git.
+- Slower than artifact redeploy (requires full pipeline).
+### Strategy 3: Feature Flag Disable
+```
+Toggle off the feature flag in LaunchDarkly / Unleash / Flipt
+```
+- Fastest for feature-level issues.
+- No deployment needed.
+- Requires the change to be behind a flag.
+### Rollback Rules
+- Rollback must be possible within 2 minutes.
+- Every deployment must record which artifact was deployed.
+- Keep last 10 revisions in Helm / ArgoCD.
+- Test rollback procedure quarterly.
+- Automated rollback on failed verify stage (smoke tests).
+---
+## Pipeline Security
+- Secrets injected via CI provider's secrets manager, never in pipeline files.
+- Pin action versions to SHA, not tags (supply chain attack mitigation):
+  ```yaml
+  uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
+  ```
+- Use OIDC for cloud authentication — no long-lived credentials.
+- Scan pipeline definitions with `actionlint` (GitHub) or equivalent.
+- Audit who can modify pipeline files (same rigor as production access).
+---
+## Environments and Promotion
+```
+PR Branch → dev (auto-deploy on push)
+main      → staging (auto-deploy on merge)
+tag/v*    → production (manual approval gate, then auto-deploy)
+```
+### Environment Parity
+- Same Docker image across all environments (only config changes).
+- Same K8s manifests with env-specific values via Helm values or Kustomize overlays.
+- Same pipeline stages for all environments (skip nothing for staging).
+- Production deploy requires approval from at least one person who did not
+  author the change.

package/templates/teams/devops/references/cost-optimization.md ADDED Viewed

@@ -0,0 +1,218 @@
+# Cost Optimization Reference
+## Right-Sizing
+### Process
+1. **Collect data.** 14-30 days of CPU and memory utilization metrics.
+2. **Identify waste.** Instances/pods where p95 utilization < 30%.
+3. **Resize.** Adjust resource requests/limits or instance type.
+4. **Validate.** Monitor for 7 days post-change. Watch for OOMs and throttling.
+5. **Repeat.** Monthly right-sizing review.
+### Kubernetes Right-Sizing
+```
+# Current allocation vs actual usage
+Allocated CPU: 4000m → p95 usage: 800m → Right-size to: 1000m request, 2000m limit
+Allocated Memory: 4Gi → p95 usage: 1.2Gi → Right-size to: 1.5Gi request, 2Gi limit
+```
+- Use VPA (Vertical Pod Autoscaler) in recommendation mode for data.
+- Use Goldilocks or Kubecost for visibility.
+- Set HPA (Horizontal Pod Autoscaler) based on CPU, memory, or custom metrics.
+- Review HPA min/max replicas quarterly — over-provisioned minimums waste money.
+### Compute Right-Sizing
+| Signal | Action |
+|---|---|
+| CPU p95 < 20% | Downsize instance or reduce CPU request |
+| Memory p95 < 40% | Downsize instance or reduce memory request |
+| CPU p95 > 80% | Upsize or add HPA |
+| Memory p95 > 80% | Upsize — memory pressure causes OOMs |
+| GPU utilization < 50% | Consider time-sharing or smaller GPU |
+---
+## Reserved Capacity
+### When to Reserve
+| Workload Pattern | Strategy | Savings |
+|---|---|---|
+| Steady 24/7 (databases, core APIs) | Reserved Instances / Committed Use | 30-60% |
+| Steady daytime only | Reserved + scheduled scaling | 20-40% |
+| Spiky / batch | Spot/Preemptible + on-demand fallback | 50-80% |
+| Unpredictable | On-demand (do not reserve) | 0% |
+### Rules
+- Reserve only after 3+ months of stable usage data.
+- Start with 1-year reservations. Use 3-year only for well-understood workloads.
+- Prefer convertible reservations over standard (flexibility to change instance type).
+- Review reservation coverage monthly. Unused reservations are pure waste.
+- Use AWS Savings Plans (compute flexibility) over RIs where possible.
+### Spot / Preemptible Instances
+- Use for: CI runners, batch jobs, stateless workers, dev/test environments.
+- Never use for: databases, stateful services, single-replica production.
+- Always implement graceful shutdown handling (SIGTERM).
+- Use multiple instance types and AZs for spot diversification.
+- Set `maxPrice` slightly above on-demand to avoid bidding wars.
+---
+## Storage Lifecycle
+### S3 / Object Storage Lifecycle
+```
+Day 0-30:   Standard storage (frequent access)
+Day 30-90:  Infrequent Access (IA)
+Day 90-365: Glacier / Archive
+Day 365+:   Delete or Deep Archive (per compliance)
+```
+```hcl
+resource "aws_s3_bucket_lifecycle_configuration" "assets" {
+  bucket = aws_s3_bucket.assets.id
+  rule {
+    id     = "archive-old-assets"
+    status = "Enabled"
+    transition {
+      days          = 30
+      storage_class = "STANDARD_IA"
+    }
+    transition {
+      days          = 90
+      storage_class = "GLACIER"
+    }
+    expiration {
+      days = 365
+    }
+  }
+}
+```
+### Storage Cost Traps
+| Trap | Fix |
+|---|---|
+| Unattached EBS volumes | Weekly scan and delete, tag with expiry |
+| Old snapshots | Lifecycle policy, delete snapshots older than retention |
+| Unused Elastic IPs | Release unattached EIPs (they cost money when idle) |
+| Oversized volumes | Right-size based on usage, use gp3 over gp2 |
+| Log retention forever | Set TTL: 7d hot, 30d warm, 90d cold |
+| Uncompressed backups | Enable compression before archiving |
+---
+## Data Transfer Costs
+### Cost Hierarchy (AWS, typical)
+```
+Free:        Inbound from internet, same-AZ within VPC
+Cheap:       Cross-AZ within region (~$0.01/GB)
+Moderate:    Cross-region (~$0.02/GB)
+Expensive:   Outbound to internet (~$0.09/GB)
+```
+### Optimization Strategies
+- **Keep traffic in-AZ.** Co-locate services that communicate heavily.
+- **Use VPC endpoints.** Avoid NAT Gateway charges for AWS service access
+  (S3, DynamoDB, ECR).
+- **CDN for static assets.** CloudFront/Cloudflare reduces origin egress.
+- **Compress data in transit.** gzip/brotli for HTTP, compression for backups.
+- **Cache aggressively.** Redis/Memcached for repeated external API calls.
+- **Use Private Link** for cross-account or cross-VPC communication.
+### NAT Gateway Cost Alert
+NAT Gateways charge per GB processed. Common cost traps:
+- Docker image pulls through NAT (use VPC endpoint for ECR).
+- Log shipping through NAT (use VPC endpoint for CloudWatch/S3).
+- S3 access through NAT (use S3 gateway endpoint — it is free).
+---
+## Budget Alerts
+### Alert Thresholds
+| Threshold | Action |
+|---|---|
+| 50% of monthly budget | Info notification to Slack |
+| 75% of monthly budget | Warning to team lead |
+| 90% of monthly budget | Alert to engineering manager |
+| 100% of monthly budget | Escalate — investigate immediately |
+| Anomaly (> 20% day-over-day) | Alert to on-call |
+### Implementation
+```hcl
+resource "aws_budgets_budget" "monthly" {
+  name         = "monthly-total"
+  budget_type  = "COST"
+  limit_amount = "50000"
+  limit_unit   = "USD"
+  time_unit    = "MONTHLY"
+  notification {
+    comparison_operator       = "GREATER_THAN"
+    threshold                 = 75
+    threshold_type            = "PERCENTAGE"
+    notification_type         = "ACTUAL"
+    subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
+  }
+}
+```
+### Per-Team Budgets
+- Tag all resources with `CostCenter` and `Team`.
+- Create per-team budget alerts.
+- Share cost dashboards with engineering leads weekly.
+- Include cost delta in deploy notifications.
+---
+## Tagging for Cost Tracking
+### Required Tags (reiterated from IaC reference)
+| Tag | Purpose for Cost |
+|---|---|
+| `Environment` | Split prod vs dev/staging spend |
+| `Team` | Ownership and accountability |
+| `Service` | Per-service cost tracking |
+| `CostCenter` | Finance allocation |
+### Cost Allocation Reports
+- Enable AWS Cost and Usage Reports (CUR) or GCP billing export.
+- Visualize in Grafana, Kubecost, or cloud-native tools.
+- Track: cost per service, cost per environment, cost per customer (if multi-tenant).
+- Review weekly in engineering stand-up.
+---
+## Quick Wins Checklist
+- [ ] Delete unattached EBS volumes and unused EIPs.
+- [ ] Enable S3 lifecycle policies on all buckets.
+- [ ] Use gp3 instead of gp2 for EBS volumes (20% cheaper, better performance).
+- [ ] Add VPC endpoints for S3, ECR, CloudWatch, DynamoDB.
+- [ ] Right-size RDS instances (check Performance Insights).
+- [ ] Stop or terminate dev/staging resources outside business hours.
+- [ ] Use Graviton/ARM instances where possible (20% cheaper).
+- [ ] Enable Spot for CI runners and batch jobs.
+- [ ] Compress and set TTL on all log streams.
+- [ ] Review and delete old container images in ECR (lifecycle policy).
+- [ ] Consolidate underutilized EKS clusters.
+- [ ] Enable S3 Intelligent-Tiering for unpredictable access patterns.