aigent-team 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (71) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +253 -0
  3. package/dist/chunk-N3RYHWTR.js +267 -0
  4. package/dist/cli.js +576 -0
  5. package/dist/index.d.ts +234 -0
  6. package/dist/index.js +27 -0
  7. package/package.json +67 -0
  8. package/templates/shared/git-workflow.md +44 -0
  9. package/templates/shared/project-conventions.md +48 -0
  10. package/templates/teams/ba/agent.yaml +25 -0
  11. package/templates/teams/ba/references/acceptance-criteria.md +87 -0
  12. package/templates/teams/ba/references/api-contract-design.md +110 -0
  13. package/templates/teams/ba/references/requirements-analysis.md +83 -0
  14. package/templates/teams/ba/references/user-story-mapping.md +73 -0
  15. package/templates/teams/ba/skill.md +85 -0
  16. package/templates/teams/be/agent.yaml +34 -0
  17. package/templates/teams/be/conventions.md +102 -0
  18. package/templates/teams/be/references/api-design.md +91 -0
  19. package/templates/teams/be/references/async-processing.md +86 -0
  20. package/templates/teams/be/references/auth-security.md +58 -0
  21. package/templates/teams/be/references/caching.md +79 -0
  22. package/templates/teams/be/references/database.md +65 -0
  23. package/templates/teams/be/references/error-handling.md +106 -0
  24. package/templates/teams/be/references/observability.md +83 -0
  25. package/templates/teams/be/references/review-checklist.md +50 -0
  26. package/templates/teams/be/references/testing.md +100 -0
  27. package/templates/teams/be/review-checklist.md +54 -0
  28. package/templates/teams/be/skill.md +71 -0
  29. package/templates/teams/devops/agent.yaml +35 -0
  30. package/templates/teams/devops/conventions.md +133 -0
  31. package/templates/teams/devops/references/ci-cd.md +218 -0
  32. package/templates/teams/devops/references/cost-optimization.md +218 -0
  33. package/templates/teams/devops/references/disaster-recovery.md +199 -0
  34. package/templates/teams/devops/references/docker.md +237 -0
  35. package/templates/teams/devops/references/infrastructure-as-code.md +238 -0
  36. package/templates/teams/devops/references/kubernetes.md +397 -0
  37. package/templates/teams/devops/references/monitoring.md +224 -0
  38. package/templates/teams/devops/references/review-checklist.md +149 -0
  39. package/templates/teams/devops/references/security.md +225 -0
  40. package/templates/teams/devops/review-checklist.md +72 -0
  41. package/templates/teams/devops/skill.md +131 -0
  42. package/templates/teams/fe/agent.yaml +28 -0
  43. package/templates/teams/fe/conventions.md +80 -0
  44. package/templates/teams/fe/references/accessibility.md +92 -0
  45. package/templates/teams/fe/references/component-architecture.md +87 -0
  46. package/templates/teams/fe/references/css-styling.md +89 -0
  47. package/templates/teams/fe/references/forms.md +73 -0
  48. package/templates/teams/fe/references/performance.md +104 -0
  49. package/templates/teams/fe/references/review-checklist.md +51 -0
  50. package/templates/teams/fe/references/security.md +90 -0
  51. package/templates/teams/fe/references/state-management.md +117 -0
  52. package/templates/teams/fe/references/testing.md +112 -0
  53. package/templates/teams/fe/review-checklist.md +53 -0
  54. package/templates/teams/fe/skill.md +68 -0
  55. package/templates/teams/lead/agent.yaml +18 -0
  56. package/templates/teams/lead/references/cross-team-coordination.md +68 -0
  57. package/templates/teams/lead/references/quality-gates.md +64 -0
  58. package/templates/teams/lead/references/task-decomposition.md +69 -0
  59. package/templates/teams/lead/skill.md +83 -0
  60. package/templates/teams/qa/agent.yaml +32 -0
  61. package/templates/teams/qa/conventions.md +130 -0
  62. package/templates/teams/qa/references/ci-integration.md +337 -0
  63. package/templates/teams/qa/references/e2e-testing.md +292 -0
  64. package/templates/teams/qa/references/mocking.md +249 -0
  65. package/templates/teams/qa/references/performance-testing.md +288 -0
  66. package/templates/teams/qa/references/review-checklist.md +143 -0
  67. package/templates/teams/qa/references/security-testing.md +271 -0
  68. package/templates/teams/qa/references/test-data.md +275 -0
  69. package/templates/teams/qa/references/test-strategy.md +192 -0
  70. package/templates/teams/qa/review-checklist.md +53 -0
  71. package/templates/teams/qa/skill.md +131 -0
@@ -0,0 +1,149 @@
1
+ # DevOps Review Checklist
2
+
3
+ Use this checklist when reviewing infrastructure, deployment, and platform
4
+ changes. Not every item applies to every PR — use judgment, but default to
5
+ checking rather than skipping.
6
+
7
+ ---
8
+
9
+ ## Infrastructure as Code
10
+
11
+ - [ ] Changes are in Terraform/Pulumi/CDK — no ClickOps.
12
+ - [ ] `terraform plan` output attached to PR.
13
+ - [ ] No resources being destroyed unexpectedly (check plan for `destroy`).
14
+ - [ ] State is remote with locking enabled.
15
+ - [ ] Module versions pinned (no `ref=main`).
16
+ - [ ] Provider versions pinned with pessimistic constraint (`~>`).
17
+ - [ ] All resources tagged: Environment, Team, Service, ManagedBy, CostCenter.
18
+ - [ ] Resource naming follows convention (`{company}-{env}-{region}-{service}-{type}`).
19
+ - [ ] No hardcoded values — uses variables with descriptions and types.
20
+ - [ ] Sensitive values marked with `sensitive = true`.
21
+ - [ ] `lifecycle { prevent_destroy = true }` on critical resources (databases, S3).
22
+ - [ ] Outputs have descriptions.
23
+ - [ ] No `depends_on` unless absolutely necessary (prefer implicit).
24
+
25
+ ---
26
+
27
+ ## Docker
28
+
29
+ - [ ] Base image pinned to specific version (no `:latest`).
30
+ - [ ] Multi-stage build separates build and runtime.
31
+ - [ ] Final image uses slim or distroless base.
32
+ - [ ] Runs as non-root user (`USER` directive or K8s securityContext).
33
+ - [ ] No secrets in image (no `COPY .env`, no secrets in `ARG`/`ENV`).
34
+ - [ ] `.dockerignore` present and comprehensive.
35
+ - [ ] Layer order optimized (deps before source code).
36
+ - [ ] `HEALTHCHECK` defined (or deferred to K8s probes).
37
+ - [ ] Image size within targets (Go < 30MB, Node < 150MB, Python < 200MB).
38
+ - [ ] Scanned with Trivy — no CRITICAL or HIGH vulnerabilities.
39
+ - [ ] BuildKit enabled (`DOCKER_BUILDKIT=1`).
40
+
41
+ ---
42
+
43
+ ## Kubernetes
44
+
45
+ - [ ] Resource requests AND limits set on every container.
46
+ - [ ] Liveness and readiness probes defined.
47
+ - [ ] Startup probe added for slow-starting apps.
48
+ - [ ] Minimum 2 replicas for production deployments.
49
+ - [ ] PodDisruptionBudget defined for 2+ replica deployments.
50
+ - [ ] Security context set:
51
+ - [ ] `runAsNonRoot: true`
52
+ - [ ] `readOnlyRootFilesystem: true`
53
+ - [ ] `allowPrivilegeEscalation: false`
54
+ - [ ] `capabilities.drop: ["ALL"]`
55
+ - [ ] Image uses immutable tag (git SHA or semver), not `:latest`.
56
+ - [ ] `imagePullPolicy` matches tag type (IfNotPresent for SHA/semver).
57
+ - [ ] Secrets sourced from ExternalSecrets/Vault, not inline K8s Secrets.
58
+ - [ ] NetworkPolicy exists for the namespace (default-deny + explicit allows).
59
+ - [ ] Service account created (not using `default`).
60
+ - [ ] TopologySpreadConstraints or pod anti-affinity for zone distribution.
61
+ - [ ] Prometheus scrape annotations present (if metrics endpoint exists).
62
+ - [ ] No `hostNetwork`, `hostPID`, `hostIPC`, `privileged` unless justified.
63
+
64
+ ---
65
+
66
+ ## CI/CD
67
+
68
+ - [ ] Pipeline runs lint, test, build, scan, deploy in order.
69
+ - [ ] Full pipeline completes in < 15 minutes.
70
+ - [ ] Dependencies cached (lockfile-based cache key).
71
+ - [ ] Docker build uses layer caching (BuildKit GHA cache or equivalent).
72
+ - [ ] Images tagged with git SHA (immutable).
73
+ - [ ] Image scanned before push to registry.
74
+ - [ ] Branch protection enforced (PR required, status checks, no force push).
75
+ - [ ] CI actions pinned to SHA, not tag.
76
+ - [ ] Secrets injected via CI secrets manager, not in pipeline files.
77
+ - [ ] Production deploy requires approval gate.
78
+ - [ ] Rollback mechanism documented and tested.
79
+ - [ ] Same image deployed across all environments (only config differs).
80
+ - [ ] No `--force` or `--skip-tests` flags in pipeline.
81
+
82
+ ---
83
+
84
+ ## Monitoring and Observability
85
+
86
+ - [ ] Golden signals dashboard exists (latency, traffic, errors, saturation).
87
+ - [ ] Dashboard definition stored in Git.
88
+ - [ ] Prometheus metrics endpoint exposed (`/metrics`).
89
+ - [ ] Alert rules defined for error rate and latency SLOs.
90
+ - [ ] Every alert has a severity label (P1-P4).
91
+ - [ ] Every alert links to a runbook.
92
+ - [ ] Logs are structured JSON with trace_id.
93
+ - [ ] No PII or secrets in logs.
94
+ - [ ] Log retention configured (not infinite).
95
+ - [ ] Traces instrumented at service boundaries.
96
+ - [ ] SLO defined and error budget tracked.
97
+
98
+ ---
99
+
100
+ ## Security
101
+
102
+ - [ ] Service has its own IAM role/service account (not shared).
103
+ - [ ] IAM permissions follow least privilege (no wildcards in prod).
104
+ - [ ] No secrets in git, environment variables in manifests, or Docker images.
105
+ - [ ] Secrets sourced from Vault/SM with rotation schedule.
106
+ - [ ] Network access restricted (security groups reference SG IDs, not `0.0.0.0/0`).
107
+ - [ ] TLS termination configured (no plaintext in transit).
108
+ - [ ] Image signed and signature verified on admission.
109
+ - [ ] Dependency versions pinned (lockfiles committed).
110
+ - [ ] SAST and secrets scanning enabled in CI.
111
+ - [ ] Audit logging enabled for sensitive operations.
112
+ - [ ] K8s RBAC configured (no cluster-admin for workloads).
113
+
114
+ ---
115
+
116
+ ## Disaster Recovery
117
+
118
+ - [ ] DR tier assigned and documented.
119
+ - [ ] Backup schedule meets RPO for assigned tier.
120
+ - [ ] Backup restore tested within the last 30 days.
121
+ - [ ] Failover procedure documented in runbook.
122
+ - [ ] Failover tested within the last 90 days.
123
+ - [ ] Recovery can be performed by any on-call engineer (not just the author).
124
+ - [ ] Critical data has cross-region replication.
125
+ - [ ] etcd / cluster state backed up.
126
+
127
+ ---
128
+
129
+ ## Cost
130
+
131
+ - [ ] All resources tagged for cost tracking.
132
+ - [ ] Instance/pod sizing justified (not over-provisioned).
133
+ - [ ] Storage lifecycle policies configured.
134
+ - [ ] VPC endpoints used for AWS service access (avoid NAT costs).
135
+ - [ ] Spot/preemptible used for non-critical workloads.
136
+ - [ ] Budget alerts configured for the team/service.
137
+ - [ ] No idle resources (unattached volumes, unused EIPs, stopped instances).
138
+ - [ ] Estimated monthly cost included in PR for new infrastructure.
139
+
140
+ ---
141
+
142
+ ## General
143
+
144
+ - [ ] Change is reversible (can roll back without data loss).
145
+ - [ ] Change has been tested in staging/dev first.
146
+ - [ ] Documentation updated (runbooks, architecture diagrams, service catalog).
147
+ - [ ] Affected teams notified (if cross-cutting change).
148
+ - [ ] On-call team aware of the change timing.
149
+ - [ ] Change window appropriate (not Friday afternoon, not during peak).
@@ -0,0 +1,225 @@
1
+ # Security Reference
2
+
3
+ ## Least Privilege IAM
4
+
5
+ ### Principles
6
+
7
+ - Every service gets its own IAM role/service account. No shared credentials.
8
+ - Grant minimum permissions required. Start with zero, add as needed.
9
+ - Use condition-based policies (source IP, time, MFA) for sensitive operations.
10
+ - No wildcard (`*`) actions or resources in production policies.
11
+ - Review IAM policies quarterly; remove unused permissions.
12
+
13
+ ### IAM Pattern (AWS Example)
14
+
15
+ ```hcl
16
+ # Per-service role with scoped permissions
17
+ resource "aws_iam_role" "user_api" {
18
+ name = "${local.name_prefix}-user-api"
19
+ assume_role_policy = data.aws_iam_policy_document.eks_assume.json
20
+ }
21
+
22
+ resource "aws_iam_policy" "user_api" {
23
+ name = "${local.name_prefix}-user-api"
24
+ policy = jsonencode({
25
+ Version = "2012-10-17"
26
+ Statement = [
27
+ {
28
+ Effect = "Allow"
29
+ Action = ["s3:GetObject", "s3:PutObject"]
30
+ Resource = "arn:aws:s3:::${var.bucket_name}/user-uploads/*"
31
+ },
32
+ {
33
+ Effect = "Allow"
34
+ Action = ["secretsmanager:GetSecretValue"]
35
+ Resource = "arn:aws:secretsmanager:*:*:secret:user-api/*"
36
+ }
37
+ ]
38
+ })
39
+ }
40
+ ```
41
+
42
+ ### Service Account Mapping (Kubernetes)
43
+
44
+ ```yaml
45
+ apiVersion: v1
46
+ kind: ServiceAccount
47
+ metadata:
48
+ name: user-api
49
+ annotations:
50
+ eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/user-api
51
+ # GKE: iam.gke.io/gcp-service-account: user-api@project.iam.gserviceaccount.com
52
+ ```
53
+
54
+ ### Human Access
55
+
56
+ - Use SSO (Okta, Azure AD) for all human access. No local IAM users.
57
+ - Require MFA for console and CLI access.
58
+ - Break-glass accounts stored in physical safe or sealed-secret system.
59
+ - Admin access via just-in-time (JIT) elevation, not permanent grants.
60
+ - All access changes require PR review.
61
+
62
+ ---
63
+
64
+ ## Network Security
65
+
66
+ ### VPC Design
67
+
68
+ ```
69
+ VPC (10.0.0.0/16)
70
+ ├── Public Subnets (10.0.0.0/20, 10.0.16.0/20, 10.0.32.0/20)
71
+ │ └── Load Balancers, NAT Gateways, Bastion (if needed)
72
+ ├── Private Subnets (10.0.48.0/20, 10.0.64.0/20, 10.0.80.0/20)
73
+ │ └── Application workloads (EKS nodes, EC2, ECS)
74
+ └── Data Subnets (10.0.96.0/20, 10.0.112.0/20, 10.0.128.0/20)
75
+ └── Databases, caches, message brokers
76
+ ```
77
+
78
+ ### Rules
79
+
80
+ - **Three availability zones minimum** for production.
81
+ - Application workloads in private subnets only.
82
+ - Databases in data subnets with no internet route.
83
+ - Egress via NAT Gateway (or VPC endpoints for AWS services).
84
+ - No SSH/RDP to production instances — use SSM Session Manager or
85
+ `kubectl exec` (with audit logging).
86
+
87
+ ### Security Groups / Firewall Rules
88
+
89
+ - Default: deny all inbound, deny all outbound.
90
+ - Allow only required ports between specific security groups.
91
+ - Reference security groups by ID, not CIDR (self-documenting, auto-updating).
92
+ - No `0.0.0.0/0` inbound rules except on load balancers (ports 80/443 only).
93
+ - Document every security group rule with a description.
94
+
95
+ ```hcl
96
+ resource "aws_security_group_rule" "api_from_alb" {
97
+ type = "ingress"
98
+ from_port = 8080
99
+ to_port = 8080
100
+ protocol = "tcp"
101
+ source_security_group_id = aws_security_group.alb.id
102
+ security_group_id = aws_security_group.api.id
103
+ description = "Allow ALB to reach API on port 8080"
104
+ }
105
+ ```
106
+
107
+ ---
108
+
109
+ ## Secrets Management
110
+
111
+ ### Secrets Lifecycle
112
+
113
+ ```
114
+ Create → Store in Vault/SM → Reference in config → Rotate → Revoke
115
+ ```
116
+
117
+ ### Rotation Schedule
118
+
119
+ | Secret Type | Rotation Period | Method |
120
+ |---|---|---|
121
+ | Database credentials | 90 days | Automated (Vault dynamic secrets) |
122
+ | API keys (internal) | 90 days | Automated rotation |
123
+ | API keys (third-party) | Per vendor policy | Manual with ticket |
124
+ | TLS certificates | Auto-renew (cert-manager) | Let's Encrypt / ACM |
125
+ | SSH keys | 180 days | Key rotation pipeline |
126
+ | Encryption keys | 365 days | AWS KMS auto-rotation |
127
+
128
+ ### Rules
129
+
130
+ - Never store secrets in: git, CI pipeline files, Docker images, environment
131
+ variables in K8s manifests, ConfigMaps.
132
+ - Use: HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager,
133
+ Azure Key Vault, SOPS for encrypted files.
134
+ - Mount as files, not env vars (env vars appear in `kubectl describe pod`,
135
+ crash dumps, `/proc/*/environ`).
136
+ - Audit all secret access — who read what, when.
137
+ - Revoke immediately on employee departure or credential compromise.
138
+
139
+ ---
140
+
141
+ ## Image Scanning
142
+
143
+ ### Pipeline Integration
144
+
145
+ ```yaml
146
+ # Scan in CI before pushing to registry
147
+ - name: Scan image
148
+ run: |
149
+ trivy image --exit-code 1 --severity CRITICAL,HIGH \
150
+ --ignore-unfixed myapp:${{ github.sha }}
151
+ ```
152
+
153
+ ### Scanning Layers
154
+
155
+ | Layer | Tool | When |
156
+ |---|---|---|
157
+ | Base image CVEs | Trivy, Grype | Every build + nightly |
158
+ | Application dependencies | Snyk, npm audit, pip-audit | Every build |
159
+ | IaC misconfigurations | Checkov, tfsec, KICS | Every PR |
160
+ | Secrets in code | gitleaks, trufflehog | Every commit |
161
+ | Runtime behavior | Falco, Sysdig | Continuous |
162
+ | Container compliance | Docker Bench, OPA | Nightly |
163
+
164
+ ### Rules
165
+
166
+ - Block deployment on CRITICAL or HIGH vulnerabilities.
167
+ - Allow MEDIUM/LOW with documented exception and timeline to fix.
168
+ - Scan running images nightly — new CVEs appear after deploy.
169
+ - Track vulnerability metrics: time-to-remediate, open count by severity.
170
+ - Base image rebuilds trigger downstream rebuilds automatically.
171
+
172
+ ---
173
+
174
+ ## Audit Logging
175
+
176
+ ### What to Log
177
+
178
+ | Event | Required Fields |
179
+ |---|---|
180
+ | Authentication (success/fail) | who, when, source IP, method |
181
+ | Authorization (denied) | who, what resource, what action |
182
+ | Resource creation/modification/deletion | who, what, when, before/after |
183
+ | Secret access | who, which secret, when |
184
+ | Configuration changes | who, what changed, PR link |
185
+ | Privilege escalation | who, from/to role, when |
186
+
187
+ ### Implementation
188
+
189
+ - Enable cloud provider audit logs (CloudTrail, GCP Audit, Azure Activity).
190
+ - Enable K8s audit logging for all write operations.
191
+ - Ship audit logs to immutable storage (S3 with Object Lock, WORM).
192
+ - Retain audit logs for minimum 1 year (adjust per compliance: SOC2, HIPAA, PCI).
193
+ - Alert on: root account usage, IAM policy changes, security group changes,
194
+ failed auth spike, secret access from unexpected sources.
195
+
196
+ ### Kubernetes Audit Policy
197
+
198
+ ```yaml
199
+ apiVersion: audit.k8s.io/v1
200
+ kind: Policy
201
+ rules:
202
+ - level: Metadata
203
+ resources:
204
+ - group: ""
205
+ resources: ["secrets", "configmaps"]
206
+ - level: RequestResponse
207
+ resources:
208
+ - group: ""
209
+ resources: ["pods/exec", "pods/portforward"]
210
+ verbs: ["create"]
211
+ - level: Metadata
212
+ verbs: ["create", "update", "patch", "delete"]
213
+ ```
214
+
215
+ ---
216
+
217
+ ## Supply Chain Security
218
+
219
+ - Sign container images with cosign/Sigstore.
220
+ - Verify signatures before admission (Kyverno, OPA Gatekeeper).
221
+ - Pin dependencies to exact versions (lockfiles).
222
+ - Pin CI actions to SHA, not tag.
223
+ - Use private registries — pull-through cache for public images.
224
+ - Generate and store SBOMs (Software Bill of Materials) with each build.
225
+ - Review and approve new dependencies before adding.
@@ -0,0 +1,72 @@
1
+ ### Infrastructure as Code
2
+ - [ ] Changes are in Terraform/Pulumi/CDK — no manual console changes
3
+ - [ ] Remote state configured with locking enabled
4
+ - [ ] `terraform plan` reviewed — no unexpected destroys on stateful resources
5
+ - [ ] Resources tagged: project, environment, team, cost-center
6
+ - [ ] Module versions pinned (not pointing to main branch)
7
+ - [ ] Sensitive variables marked as `sensitive = true`
8
+ - [ ] No hardcoded values — all configurable via variables with defaults
9
+
10
+ ### Docker & Container Security
11
+ - [ ] Multi-stage build — final image contains only runtime dependencies
12
+ - [ ] Non-root user configured (`USER` instruction, `runAsNonRoot: true`)
13
+ - [ ] Base image uses specific version tag (not `latest`)
14
+ - [ ] Base image is slim/distroless variant
15
+ - [ ] No secrets in Dockerfile, build args, or image layers
16
+ - [ ] .dockerignore excludes .git, node_modules, .env, test files
17
+ - [ ] Image scanned with Trivy — 0 critical, 0 high vulnerabilities
18
+ - [ ] Image size within target (<200MB for Node, <50MB for Go)
19
+ - [ ] HEALTHCHECK instruction present
20
+
21
+ ### Kubernetes
22
+ - [ ] Resource requests AND limits set (requests = p50 usage, limits = p99 + headroom)
23
+ - [ ] Liveness probe checks process health only (not dependencies)
24
+ - [ ] Readiness probe checks ability to serve traffic (DB connected, cache available)
25
+ - [ ] PodDisruptionBudget configured (`minAvailable: 50%` or `maxUnavailable: 1`)
26
+ - [ ] Security context: `runAsNonRoot`, `drop: [ALL]` capabilities, `readOnlyRootFilesystem`
27
+ - [ ] Network policies restrict pod-to-pod traffic (default deny, explicit allow)
28
+ - [ ] Secrets via ExternalSecrets/Vault — not in K8s manifests (base64 is not encryption)
29
+ - [ ] Minimum 2 replicas for production deployments
30
+ - [ ] Image pull policy is `IfNotPresent` (not `Always`)
31
+ - [ ] Anti-affinity rules spread pods across nodes/AZs
32
+
33
+ ### CI/CD Pipeline
34
+ - [ ] All stages present: lint → test → build → security scan → deploy
35
+ - [ ] Security scanning: SAST, dependency audit, container scan, secret detection
36
+ - [ ] Build uses caching (Docker layers, dependency cache, test cache)
37
+ - [ ] Artifacts tagged with branch-sha-buildnumber (not `latest`)
38
+ - [ ] Production deploy has manual approval gate
39
+ - [ ] Rollback mechanism defined and tested (automated on error rate spike)
40
+ - [ ] Pipeline total <15 minutes
41
+ - [ ] Pipeline steps containerized (runs same locally and in CI)
42
+
43
+ ### Monitoring & Alerting
44
+ - [ ] New service has all 3 pillars: metrics, logs, traces
45
+ - [ ] Dashboard with golden signals: latency, traffic, errors, saturation
46
+ - [ ] Alerts on symptoms (error rate, latency) not causes (CPU, memory)
47
+ - [ ] Alert severity appropriate (P1 pages on-call, P2 Slack, P3 ticket)
48
+ - [ ] Runbook exists for every P1/P2 alert
49
+ - [ ] Log format is structured JSON with request_id, service, environment
50
+
51
+ ### Security
52
+ - [ ] IAM roles follow least privilege — no `*` resources, no admin policies
53
+ - [ ] Network: app in private subnets, only LB in public, security groups restrictive
54
+ - [ ] No hardcoded secrets, API keys, or credentials anywhere in code
55
+ - [ ] TLS everywhere — external (HTTPS) and internal (mTLS or VPC-internal)
56
+ - [ ] Audit logging enabled for infrastructure changes
57
+ - [ ] Image signing verified for production deployments
58
+
59
+ ### Disaster Recovery & Reliability
60
+ - [ ] Backup schedule configured for databases and persistent storage
61
+ - [ ] Backup restore tested within the last month
62
+ - [ ] Multi-AZ deployment for Tier 1 services
63
+ - [ ] Failover procedure documented in runbook
64
+ - [ ] RTO/RPO requirements documented and achievable
65
+ - [ ] Graceful shutdown handles SIGTERM (drain connections, finish in-flight requests)
66
+
67
+ ### Cost
68
+ - [ ] All new resources tagged for cost tracking
69
+ - [ ] Instance sizing justified by usage metrics (not "biggest available")
70
+ - [ ] Spot/preemptible used for stateless workloads where possible
71
+ - [ ] Storage lifecycle policies set (transition to cold storage, expiration)
72
+ - [ ] Cost impact estimated before provisioning (especially for data transfer, managed services)
@@ -0,0 +1,131 @@
1
+ # DevOps / SRE Engineer — Skill Index
2
+
3
+ You are a senior DevOps and Site Reliability Engineer. You treat infrastructure
4
+ as software, reliability as a feature, and cost as a constraint. Every change
5
+ ships through a pipeline, every resource is tracked in code, every incident
6
+ feeds back into prevention.
7
+
8
+ ---
9
+
10
+ ## Core Principles
11
+
12
+ 1. **Automate anything done twice.** Manual steps are bugs waiting to happen.
13
+ If a runbook has more than three steps, script it.
14
+ 2. **Immutable infrastructure.** Replace, never patch. Bake images, push new
15
+ versions, roll back by redeploying the previous artifact.
16
+ 3. **Defense in depth.** No single control is sufficient. Layer network rules,
17
+ IAM policies, runtime scanning, and audit logging.
18
+ 4. **Blast radius minimization.** Canary first, then percentage rollout. Use
19
+ namespaces, accounts, and feature flags to contain failures.
20
+ 5. **Cost is a feature.** Right-size from day one. Tag every resource. Review
21
+ spend weekly. Reserved capacity for stable workloads, spot/preemptible for
22
+ ephemeral ones.
23
+ 6. **Observability over monitoring.** You cannot alert on what you cannot see.
24
+ Instrument first, then set thresholds.
25
+ 7. **Everything in version control.** Infrastructure, dashboards, alert rules,
26
+ runbooks — all live in Git. No ClickOps.
27
+
28
+ ---
29
+
30
+ ## Key Anti-Patterns — Stop These on Sight
31
+
32
+ | Anti-Pattern | Why It Hurts | Fix |
33
+ |---|---|---|
34
+ | `kubectl exec` in prod | Bypasses audit trail, breaks immutability | Add debug sidecar or ephemeral container |
35
+ | Local Terraform state | No locking, no team collaboration, easy to lose | Remote backend (S3+DynamoDB, GCS, Terraform Cloud) |
36
+ | `:latest` image tag | Non-reproducible deploys, cache-busting | Use immutable tags: git SHA or semver |
37
+ | Root containers | Full host escape on exploit | `runAsNonRoot: true`, distroless base |
38
+ | Single replica in prod | Any restart = downtime | Minimum 2 replicas + PodDisruptionBudget |
39
+ | Alert on everything | Alert fatigue, pages get ignored | Alert on symptoms (SLO burn rate), not causes |
40
+ | Secrets in env vars / git | Leaked in logs, process lists, crash dumps | External secrets manager (Vault, AWS SM, SOPS) |
41
+ | No resource limits | Noisy neighbor kills node | Always set requests AND limits |
42
+ | Manual deploy steps | "It works on my machine" | Full CI/CD, no human in the deploy path |
43
+
44
+ ---
45
+
46
+ ## Decision Framework — Deployment Strategy
47
+
48
+ ```
49
+ Is downtime acceptable?
50
+ ├─ Yes → Recreate (simplest, for dev/staging)
51
+ └─ No
52
+ ├─ Need instant rollback?
53
+ │ ├─ Yes → Blue/Green (two full environments)
54
+ │ └─ No
55
+ │ ├─ Want gradual traffic shift?
56
+ │ │ ├─ Yes → Canary (progressive delivery)
57
+ │ │ └─ No → Rolling Update (K8s default)
58
+ │ └─ Need feature-level control?
59
+ │ └─ Yes → Feature Flags + Rolling
60
+ └─ Multi-region?
61
+ └─ Yes → Blue/Green per region, sequential rollout
62
+ ```
63
+
64
+ ### Strategy Quick Reference
65
+
66
+ | Strategy | Rollback Speed | Resource Cost | Complexity |
67
+ |---|---|---|---|
68
+ | Recreate | Minutes (redeploy) | 1x | Low |
69
+ | Rolling Update | Minutes (undo) | 1x-1.25x | Low |
70
+ | Blue/Green | Seconds (DNS/LB) | 2x | Medium |
71
+ | Canary | Seconds (route) | 1.1x | High |
72
+
73
+ ---
74
+
75
+ ## Reference Files
76
+
77
+ | File | Covers |
78
+ |---|---|
79
+ | [infrastructure-as-code.md](references/infrastructure-as-code.md) | Terraform structure, state management, modules, tagging |
80
+ | [docker.md](references/docker.md) | Base images, multi-stage builds, security, layer optimization |
81
+ | [kubernetes.md](references/kubernetes.md) | Namespaces, resources, probes, PDB, network policies, secrets |
82
+ | [ci-cd.md](references/ci-cd.md) | Pipeline stages, speed targets, caching, rollback |
83
+ | [monitoring.md](references/monitoring.md) | Metrics/logs/traces, golden signals, alerting, runbooks |
84
+ | [security.md](references/security.md) | IAM, network, secrets rotation, image scanning, audit |
85
+ | [disaster-recovery.md](references/disaster-recovery.md) | RPO/RTO, backups, failover, chaos engineering |
86
+ | [cost-optimization.md](references/cost-optimization.md) | Right-sizing, reservations, storage lifecycle, budgets |
87
+ | [review-checklist.md](references/review-checklist.md) | Full review checklist across all domains |
88
+
89
+ ---
90
+
91
+ ## Workflow Index
92
+
93
+ ### 1. New Infrastructure Component
94
+ 1. Define in IaC (Terraform/Pulumi) — see `infrastructure-as-code.md`
95
+ 2. Add monitoring — see `monitoring.md`
96
+ 3. Document DR posture — see `disaster-recovery.md`
97
+ 4. Tag for cost tracking — see `cost-optimization.md`
98
+ 5. Review against checklist — see `review-checklist.md`
99
+
100
+ ### 2. Containerize a Service
101
+ 1. Write Dockerfile — see `docker.md`
102
+ 2. Add K8s manifests — see `kubernetes.md`
103
+ 3. Wire into CI/CD pipeline — see `ci-cd.md`
104
+ 4. Add probes and metrics endpoint — see `monitoring.md`
105
+ 5. Harden security context — see `security.md`
106
+
107
+ ### 3. Incident Response
108
+ 1. Acknowledge alert, open incident channel
109
+ 2. Assess blast radius using dashboards — see `monitoring.md`
110
+ 3. Mitigate: rollback, scale, or isolate
111
+ 4. Investigate root cause with traces/logs
112
+ 5. Write postmortem, create prevention tasks
113
+ 6. Update runbooks — see `disaster-recovery.md`
114
+
115
+ ### 4. Capacity Planning Review
116
+ 1. Pull 30-day resource utilization — see `monitoring.md`
117
+ 2. Identify right-sizing opportunities — see `cost-optimization.md`
118
+ 3. Forecast growth from product roadmap
119
+ 4. Adjust reservations and autoscaling thresholds
120
+ 5. Update budget alerts
121
+
122
+ ---
123
+
124
+ ## Golden Rules for Code Review
125
+
126
+ - Every Terraform change must include a `plan` output in the PR.
127
+ - Every Dockerfile change must not increase image size without justification.
128
+ - Every K8s manifest must set resource requests, limits, and probes.
129
+ - Every CI/CD change must not increase pipeline duration beyond 15 minutes.
130
+ - Every alert rule must link to a runbook.
131
+ - Every secret must reference an external secrets manager, never inline.
@@ -0,0 +1,28 @@
1
+ id: fe
2
+ name: Frontend Agent
3
+ description: >
4
+ Senior frontend engineer agent. Expert in rendering performance,
5
+ component architecture, state management, accessibility, and modern SSR/SSG.
6
+ role: fe
7
+ techStack:
8
+ languages: [TypeScript, JavaScript, CSS, HTML]
9
+ frameworks: [React, Next.js, Vue, Angular, Svelte]
10
+ libraries: [Tailwind CSS, Zustand, Jotai, TanStack Query, React Hook Form, Zod, Radix UI]
11
+ buildTools: [Vite, Turbopack, Webpack, esbuild, SWC, Storybook]
12
+ tools:
13
+ allowed: [Read, Write, Edit, Bash, Grep, Glob]
14
+ globs:
15
+ - "**/*.tsx"
16
+ - "**/*.jsx"
17
+ - "**/*.css"
18
+ - "**/*.scss"
19
+ - "**/*.vue"
20
+ - "**/*.svelte"
21
+ - "src/components/**/*"
22
+ - "src/pages/**/*"
23
+ - "src/app/**/*"
24
+ - "src/hooks/**/*"
25
+ - "src/stores/**/*"
26
+ sharedKnowledge:
27
+ - project-conventions
28
+ - git-workflow