aigent-team 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +253 -0
- package/dist/chunk-N3RYHWTR.js +267 -0
- package/dist/cli.js +576 -0
- package/dist/index.d.ts +234 -0
- package/dist/index.js +27 -0
- package/package.json +67 -0
- package/templates/shared/git-workflow.md +44 -0
- package/templates/shared/project-conventions.md +48 -0
- package/templates/teams/ba/agent.yaml +25 -0
- package/templates/teams/ba/references/acceptance-criteria.md +87 -0
- package/templates/teams/ba/references/api-contract-design.md +110 -0
- package/templates/teams/ba/references/requirements-analysis.md +83 -0
- package/templates/teams/ba/references/user-story-mapping.md +73 -0
- package/templates/teams/ba/skill.md +85 -0
- package/templates/teams/be/agent.yaml +34 -0
- package/templates/teams/be/conventions.md +102 -0
- package/templates/teams/be/references/api-design.md +91 -0
- package/templates/teams/be/references/async-processing.md +86 -0
- package/templates/teams/be/references/auth-security.md +58 -0
- package/templates/teams/be/references/caching.md +79 -0
- package/templates/teams/be/references/database.md +65 -0
- package/templates/teams/be/references/error-handling.md +106 -0
- package/templates/teams/be/references/observability.md +83 -0
- package/templates/teams/be/references/review-checklist.md +50 -0
- package/templates/teams/be/references/testing.md +100 -0
- package/templates/teams/be/review-checklist.md +54 -0
- package/templates/teams/be/skill.md +71 -0
- package/templates/teams/devops/agent.yaml +35 -0
- package/templates/teams/devops/conventions.md +133 -0
- package/templates/teams/devops/references/ci-cd.md +218 -0
- package/templates/teams/devops/references/cost-optimization.md +218 -0
- package/templates/teams/devops/references/disaster-recovery.md +199 -0
- package/templates/teams/devops/references/docker.md +237 -0
- package/templates/teams/devops/references/infrastructure-as-code.md +238 -0
- package/templates/teams/devops/references/kubernetes.md +397 -0
- package/templates/teams/devops/references/monitoring.md +224 -0
- package/templates/teams/devops/references/review-checklist.md +149 -0
- package/templates/teams/devops/references/security.md +225 -0
- package/templates/teams/devops/review-checklist.md +72 -0
- package/templates/teams/devops/skill.md +131 -0
- package/templates/teams/fe/agent.yaml +28 -0
- package/templates/teams/fe/conventions.md +80 -0
- package/templates/teams/fe/references/accessibility.md +92 -0
- package/templates/teams/fe/references/component-architecture.md +87 -0
- package/templates/teams/fe/references/css-styling.md +89 -0
- package/templates/teams/fe/references/forms.md +73 -0
- package/templates/teams/fe/references/performance.md +104 -0
- package/templates/teams/fe/references/review-checklist.md +51 -0
- package/templates/teams/fe/references/security.md +90 -0
- package/templates/teams/fe/references/state-management.md +117 -0
- package/templates/teams/fe/references/testing.md +112 -0
- package/templates/teams/fe/review-checklist.md +53 -0
- package/templates/teams/fe/skill.md +68 -0
- package/templates/teams/lead/agent.yaml +18 -0
- package/templates/teams/lead/references/cross-team-coordination.md +68 -0
- package/templates/teams/lead/references/quality-gates.md +64 -0
- package/templates/teams/lead/references/task-decomposition.md +69 -0
- package/templates/teams/lead/skill.md +83 -0
- package/templates/teams/qa/agent.yaml +32 -0
- package/templates/teams/qa/conventions.md +130 -0
- package/templates/teams/qa/references/ci-integration.md +337 -0
- package/templates/teams/qa/references/e2e-testing.md +292 -0
- package/templates/teams/qa/references/mocking.md +249 -0
- package/templates/teams/qa/references/performance-testing.md +288 -0
- package/templates/teams/qa/references/review-checklist.md +143 -0
- package/templates/teams/qa/references/security-testing.md +271 -0
- package/templates/teams/qa/references/test-data.md +275 -0
- package/templates/teams/qa/references/test-strategy.md +192 -0
- package/templates/teams/qa/review-checklist.md +53 -0
- package/templates/teams/qa/skill.md +131 -0
|
@@ -0,0 +1,149 @@
|
|
|
1
|
+
# DevOps Review Checklist
|
|
2
|
+
|
|
3
|
+
Use this checklist when reviewing infrastructure, deployment, and platform
|
|
4
|
+
changes. Not every item applies to every PR — use judgment, but default to
|
|
5
|
+
checking rather than skipping.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Infrastructure as Code
|
|
10
|
+
|
|
11
|
+
- [ ] Changes are in Terraform/Pulumi/CDK — no ClickOps.
|
|
12
|
+
- [ ] `terraform plan` output attached to PR.
|
|
13
|
+
- [ ] No resources being destroyed unexpectedly (check plan for `destroy`).
|
|
14
|
+
- [ ] State is remote with locking enabled.
|
|
15
|
+
- [ ] Module versions pinned (no `ref=main`).
|
|
16
|
+
- [ ] Provider versions pinned with pessimistic constraint (`~>`).
|
|
17
|
+
- [ ] All resources tagged: Environment, Team, Service, ManagedBy, CostCenter.
|
|
18
|
+
- [ ] Resource naming follows convention (`{company}-{env}-{region}-{service}-{type}`).
|
|
19
|
+
- [ ] No hardcoded values — uses variables with descriptions and types.
|
|
20
|
+
- [ ] Sensitive values marked with `sensitive = true`.
|
|
21
|
+
- [ ] `lifecycle { prevent_destroy = true }` on critical resources (databases, S3).
|
|
22
|
+
- [ ] Outputs have descriptions.
|
|
23
|
+
- [ ] No `depends_on` unless absolutely necessary (prefer implicit).
|
|
24
|
+
|
|
25
|
+
---
|
|
26
|
+
|
|
27
|
+
## Docker
|
|
28
|
+
|
|
29
|
+
- [ ] Base image pinned to specific version (no `:latest`).
|
|
30
|
+
- [ ] Multi-stage build separates build and runtime.
|
|
31
|
+
- [ ] Final image uses slim or distroless base.
|
|
32
|
+
- [ ] Runs as non-root user (`USER` directive or K8s securityContext).
|
|
33
|
+
- [ ] No secrets in image (no `COPY .env`, no secrets in `ARG`/`ENV`).
|
|
34
|
+
- [ ] `.dockerignore` present and comprehensive.
|
|
35
|
+
- [ ] Layer order optimized (deps before source code).
|
|
36
|
+
- [ ] `HEALTHCHECK` defined (or deferred to K8s probes).
|
|
37
|
+
- [ ] Image size within targets (Go < 30MB, Node < 150MB, Python < 200MB).
|
|
38
|
+
- [ ] Scanned with Trivy — no CRITICAL or HIGH vulnerabilities.
|
|
39
|
+
- [ ] BuildKit enabled (`DOCKER_BUILDKIT=1`).
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## Kubernetes
|
|
44
|
+
|
|
45
|
+
- [ ] Resource requests AND limits set on every container.
|
|
46
|
+
- [ ] Liveness and readiness probes defined.
|
|
47
|
+
- [ ] Startup probe added for slow-starting apps.
|
|
48
|
+
- [ ] Minimum 2 replicas for production deployments.
|
|
49
|
+
- [ ] PodDisruptionBudget defined for 2+ replica deployments.
|
|
50
|
+
- [ ] Security context set:
|
|
51
|
+
- [ ] `runAsNonRoot: true`
|
|
52
|
+
- [ ] `readOnlyRootFilesystem: true`
|
|
53
|
+
- [ ] `allowPrivilegeEscalation: false`
|
|
54
|
+
- [ ] `capabilities.drop: ["ALL"]`
|
|
55
|
+
- [ ] Image uses immutable tag (git SHA or semver), not `:latest`.
|
|
56
|
+
- [ ] `imagePullPolicy` matches tag type (IfNotPresent for SHA/semver).
|
|
57
|
+
- [ ] Secrets sourced from ExternalSecrets/Vault, not inline K8s Secrets.
|
|
58
|
+
- [ ] NetworkPolicy exists for the namespace (default-deny + explicit allows).
|
|
59
|
+
- [ ] Service account created (not using `default`).
|
|
60
|
+
- [ ] TopologySpreadConstraints or pod anti-affinity for zone distribution.
|
|
61
|
+
- [ ] Prometheus scrape annotations present (if metrics endpoint exists).
|
|
62
|
+
- [ ] No `hostNetwork`, `hostPID`, `hostIPC`, `privileged` unless justified.
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## CI/CD
|
|
67
|
+
|
|
68
|
+
- [ ] Pipeline runs lint, test, build, scan, deploy in order.
|
|
69
|
+
- [ ] Full pipeline completes in < 15 minutes.
|
|
70
|
+
- [ ] Dependencies cached (lockfile-based cache key).
|
|
71
|
+
- [ ] Docker build uses layer caching (BuildKit GHA cache or equivalent).
|
|
72
|
+
- [ ] Images tagged with git SHA (immutable).
|
|
73
|
+
- [ ] Image scanned before push to registry.
|
|
74
|
+
- [ ] Branch protection enforced (PR required, status checks, no force push).
|
|
75
|
+
- [ ] CI actions pinned to SHA, not tag.
|
|
76
|
+
- [ ] Secrets injected via CI secrets manager, not in pipeline files.
|
|
77
|
+
- [ ] Production deploy requires approval gate.
|
|
78
|
+
- [ ] Rollback mechanism documented and tested.
|
|
79
|
+
- [ ] Same image deployed across all environments (only config differs).
|
|
80
|
+
- [ ] No `--force` or `--skip-tests` flags in pipeline.
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## Monitoring and Observability
|
|
85
|
+
|
|
86
|
+
- [ ] Golden signals dashboard exists (latency, traffic, errors, saturation).
|
|
87
|
+
- [ ] Dashboard definition stored in Git.
|
|
88
|
+
- [ ] Prometheus metrics endpoint exposed (`/metrics`).
|
|
89
|
+
- [ ] Alert rules defined for error rate and latency SLOs.
|
|
90
|
+
- [ ] Every alert has a severity label (P1-P4).
|
|
91
|
+
- [ ] Every alert links to a runbook.
|
|
92
|
+
- [ ] Logs are structured JSON with trace_id.
|
|
93
|
+
- [ ] No PII or secrets in logs.
|
|
94
|
+
- [ ] Log retention configured (not infinite).
|
|
95
|
+
- [ ] Traces instrumented at service boundaries.
|
|
96
|
+
- [ ] SLO defined and error budget tracked.
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
## Security
|
|
101
|
+
|
|
102
|
+
- [ ] Service has its own IAM role/service account (not shared).
|
|
103
|
+
- [ ] IAM permissions follow least privilege (no wildcards in prod).
|
|
104
|
+
- [ ] No secrets in git, environment variables in manifests, or Docker images.
|
|
105
|
+
- [ ] Secrets sourced from Vault/SM with rotation schedule.
|
|
106
|
+
- [ ] Network access restricted (security groups reference SG IDs, not `0.0.0.0/0`).
|
|
107
|
+
- [ ] TLS termination configured (no plaintext in transit).
|
|
108
|
+
- [ ] Image signed and signature verified on admission.
|
|
109
|
+
- [ ] Dependency versions pinned (lockfiles committed).
|
|
110
|
+
- [ ] SAST and secrets scanning enabled in CI.
|
|
111
|
+
- [ ] Audit logging enabled for sensitive operations.
|
|
112
|
+
- [ ] K8s RBAC configured (no cluster-admin for workloads).
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## Disaster Recovery
|
|
117
|
+
|
|
118
|
+
- [ ] DR tier assigned and documented.
|
|
119
|
+
- [ ] Backup schedule meets RPO for assigned tier.
|
|
120
|
+
- [ ] Backup restore tested within the last 30 days.
|
|
121
|
+
- [ ] Failover procedure documented in runbook.
|
|
122
|
+
- [ ] Failover tested within the last 90 days.
|
|
123
|
+
- [ ] Recovery can be performed by any on-call engineer (not just the author).
|
|
124
|
+
- [ ] Critical data has cross-region replication.
|
|
125
|
+
- [ ] etcd / cluster state backed up.
|
|
126
|
+
|
|
127
|
+
---
|
|
128
|
+
|
|
129
|
+
## Cost
|
|
130
|
+
|
|
131
|
+
- [ ] All resources tagged for cost tracking.
|
|
132
|
+
- [ ] Instance/pod sizing justified (not over-provisioned).
|
|
133
|
+
- [ ] Storage lifecycle policies configured.
|
|
134
|
+
- [ ] VPC endpoints used for AWS service access (avoid NAT costs).
|
|
135
|
+
- [ ] Spot/preemptible used for non-critical workloads.
|
|
136
|
+
- [ ] Budget alerts configured for the team/service.
|
|
137
|
+
- [ ] No idle resources (unattached volumes, unused EIPs, stopped instances).
|
|
138
|
+
- [ ] Estimated monthly cost included in PR for new infrastructure.
|
|
139
|
+
|
|
140
|
+
---
|
|
141
|
+
|
|
142
|
+
## General
|
|
143
|
+
|
|
144
|
+
- [ ] Change is reversible (can roll back without data loss).
|
|
145
|
+
- [ ] Change has been tested in staging/dev first.
|
|
146
|
+
- [ ] Documentation updated (runbooks, architecture diagrams, service catalog).
|
|
147
|
+
- [ ] Affected teams notified (if cross-cutting change).
|
|
148
|
+
- [ ] On-call team aware of the change timing.
|
|
149
|
+
- [ ] Change window appropriate (not Friday afternoon, not during peak).
|
|
@@ -0,0 +1,225 @@
|
|
|
1
|
+
# Security Reference
|
|
2
|
+
|
|
3
|
+
## Least Privilege IAM
|
|
4
|
+
|
|
5
|
+
### Principles
|
|
6
|
+
|
|
7
|
+
- Every service gets its own IAM role/service account. No shared credentials.
|
|
8
|
+
- Grant minimum permissions required. Start with zero, add as needed.
|
|
9
|
+
- Use condition-based policies (source IP, time, MFA) for sensitive operations.
|
|
10
|
+
- No wildcard (`*`) actions or resources in production policies.
|
|
11
|
+
- Review IAM policies quarterly; remove unused permissions.
|
|
12
|
+
|
|
13
|
+
### IAM Pattern (AWS Example)
|
|
14
|
+
|
|
15
|
+
```hcl
|
|
16
|
+
# Per-service role with scoped permissions
|
|
17
|
+
resource "aws_iam_role" "user_api" {
|
|
18
|
+
name = "${local.name_prefix}-user-api"
|
|
19
|
+
assume_role_policy = data.aws_iam_policy_document.eks_assume.json
|
|
20
|
+
}
|
|
21
|
+
|
|
22
|
+
resource "aws_iam_policy" "user_api" {
|
|
23
|
+
name = "${local.name_prefix}-user-api"
|
|
24
|
+
policy = jsonencode({
|
|
25
|
+
Version = "2012-10-17"
|
|
26
|
+
Statement = [
|
|
27
|
+
{
|
|
28
|
+
Effect = "Allow"
|
|
29
|
+
Action = ["s3:GetObject", "s3:PutObject"]
|
|
30
|
+
Resource = "arn:aws:s3:::${var.bucket_name}/user-uploads/*"
|
|
31
|
+
},
|
|
32
|
+
{
|
|
33
|
+
Effect = "Allow"
|
|
34
|
+
Action = ["secretsmanager:GetSecretValue"]
|
|
35
|
+
Resource = "arn:aws:secretsmanager:*:*:secret:user-api/*"
|
|
36
|
+
}
|
|
37
|
+
]
|
|
38
|
+
})
|
|
39
|
+
}
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
### Service Account Mapping (Kubernetes)
|
|
43
|
+
|
|
44
|
+
```yaml
|
|
45
|
+
apiVersion: v1
|
|
46
|
+
kind: ServiceAccount
|
|
47
|
+
metadata:
|
|
48
|
+
name: user-api
|
|
49
|
+
annotations:
|
|
50
|
+
eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/user-api
|
|
51
|
+
# GKE: iam.gke.io/gcp-service-account: user-api@project.iam.gserviceaccount.com
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
### Human Access
|
|
55
|
+
|
|
56
|
+
- Use SSO (Okta, Azure AD) for all human access. No local IAM users.
|
|
57
|
+
- Require MFA for console and CLI access.
|
|
58
|
+
- Break-glass accounts stored in physical safe or sealed-secret system.
|
|
59
|
+
- Admin access via just-in-time (JIT) elevation, not permanent grants.
|
|
60
|
+
- All access changes require PR review.
|
|
61
|
+
|
|
62
|
+
---
|
|
63
|
+
|
|
64
|
+
## Network Security
|
|
65
|
+
|
|
66
|
+
### VPC Design
|
|
67
|
+
|
|
68
|
+
```
|
|
69
|
+
VPC (10.0.0.0/16)
|
|
70
|
+
├── Public Subnets (10.0.0.0/20, 10.0.16.0/20, 10.0.32.0/20)
|
|
71
|
+
│ └── Load Balancers, NAT Gateways, Bastion (if needed)
|
|
72
|
+
├── Private Subnets (10.0.48.0/20, 10.0.64.0/20, 10.0.80.0/20)
|
|
73
|
+
│ └── Application workloads (EKS nodes, EC2, ECS)
|
|
74
|
+
└── Data Subnets (10.0.96.0/20, 10.0.112.0/20, 10.0.128.0/20)
|
|
75
|
+
└── Databases, caches, message brokers
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
### Rules
|
|
79
|
+
|
|
80
|
+
- **Three availability zones minimum** for production.
|
|
81
|
+
- Application workloads in private subnets only.
|
|
82
|
+
- Databases in data subnets with no internet route.
|
|
83
|
+
- Egress via NAT Gateway (or VPC endpoints for AWS services).
|
|
84
|
+
- No SSH/RDP to production instances — use SSM Session Manager or
|
|
85
|
+
`kubectl exec` (with audit logging).
|
|
86
|
+
|
|
87
|
+
### Security Groups / Firewall Rules
|
|
88
|
+
|
|
89
|
+
- Default: deny all inbound, deny all outbound.
|
|
90
|
+
- Allow only required ports between specific security groups.
|
|
91
|
+
- Reference security groups by ID, not CIDR (self-documenting, auto-updating).
|
|
92
|
+
- No `0.0.0.0/0` inbound rules except on load balancers (ports 80/443 only).
|
|
93
|
+
- Document every security group rule with a description.
|
|
94
|
+
|
|
95
|
+
```hcl
|
|
96
|
+
resource "aws_security_group_rule" "api_from_alb" {
|
|
97
|
+
type = "ingress"
|
|
98
|
+
from_port = 8080
|
|
99
|
+
to_port = 8080
|
|
100
|
+
protocol = "tcp"
|
|
101
|
+
source_security_group_id = aws_security_group.alb.id
|
|
102
|
+
security_group_id = aws_security_group.api.id
|
|
103
|
+
description = "Allow ALB to reach API on port 8080"
|
|
104
|
+
}
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
---
|
|
108
|
+
|
|
109
|
+
## Secrets Management
|
|
110
|
+
|
|
111
|
+
### Secrets Lifecycle
|
|
112
|
+
|
|
113
|
+
```
|
|
114
|
+
Create → Store in Vault/SM → Reference in config → Rotate → Revoke
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
### Rotation Schedule
|
|
118
|
+
|
|
119
|
+
| Secret Type | Rotation Period | Method |
|
|
120
|
+
|---|---|---|
|
|
121
|
+
| Database credentials | 90 days | Automated (Vault dynamic secrets) |
|
|
122
|
+
| API keys (internal) | 90 days | Automated rotation |
|
|
123
|
+
| API keys (third-party) | Per vendor policy | Manual with ticket |
|
|
124
|
+
| TLS certificates | Auto-renew (cert-manager) | Let's Encrypt / ACM |
|
|
125
|
+
| SSH keys | 180 days | Key rotation pipeline |
|
|
126
|
+
| Encryption keys | 365 days | AWS KMS auto-rotation |
|
|
127
|
+
|
|
128
|
+
### Rules
|
|
129
|
+
|
|
130
|
+
- Never store secrets in: git, CI pipeline files, Docker images, environment
|
|
131
|
+
variables in K8s manifests, ConfigMaps.
|
|
132
|
+
- Use: HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager,
|
|
133
|
+
Azure Key Vault, SOPS for encrypted files.
|
|
134
|
+
- Mount as files, not env vars (env vars appear in `kubectl describe pod`,
|
|
135
|
+
crash dumps, `/proc/*/environ`).
|
|
136
|
+
- Audit all secret access — who read what, when.
|
|
137
|
+
- Revoke immediately on employee departure or credential compromise.
|
|
138
|
+
|
|
139
|
+
---
|
|
140
|
+
|
|
141
|
+
## Image Scanning
|
|
142
|
+
|
|
143
|
+
### Pipeline Integration
|
|
144
|
+
|
|
145
|
+
```yaml
|
|
146
|
+
# Scan in CI before pushing to registry
|
|
147
|
+
- name: Scan image
|
|
148
|
+
run: |
|
|
149
|
+
trivy image --exit-code 1 --severity CRITICAL,HIGH \
|
|
150
|
+
--ignore-unfixed myapp:${{ github.sha }}
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
### Scanning Layers
|
|
154
|
+
|
|
155
|
+
| Layer | Tool | When |
|
|
156
|
+
|---|---|---|
|
|
157
|
+
| Base image CVEs | Trivy, Grype | Every build + nightly |
|
|
158
|
+
| Application dependencies | Snyk, npm audit, pip-audit | Every build |
|
|
159
|
+
| IaC misconfigurations | Checkov, tfsec, KICS | Every PR |
|
|
160
|
+
| Secrets in code | gitleaks, trufflehog | Every commit |
|
|
161
|
+
| Runtime behavior | Falco, Sysdig | Continuous |
|
|
162
|
+
| Container compliance | Docker Bench, OPA | Nightly |
|
|
163
|
+
|
|
164
|
+
### Rules
|
|
165
|
+
|
|
166
|
+
- Block deployment on CRITICAL or HIGH vulnerabilities.
|
|
167
|
+
- Allow MEDIUM/LOW with documented exception and timeline to fix.
|
|
168
|
+
- Scan running images nightly — new CVEs appear after deploy.
|
|
169
|
+
- Track vulnerability metrics: time-to-remediate, open count by severity.
|
|
170
|
+
- Base image rebuilds trigger downstream rebuilds automatically.
|
|
171
|
+
|
|
172
|
+
---
|
|
173
|
+
|
|
174
|
+
## Audit Logging
|
|
175
|
+
|
|
176
|
+
### What to Log
|
|
177
|
+
|
|
178
|
+
| Event | Required Fields |
|
|
179
|
+
|---|---|
|
|
180
|
+
| Authentication (success/fail) | who, when, source IP, method |
|
|
181
|
+
| Authorization (denied) | who, what resource, what action |
|
|
182
|
+
| Resource creation/modification/deletion | who, what, when, before/after |
|
|
183
|
+
| Secret access | who, which secret, when |
|
|
184
|
+
| Configuration changes | who, what changed, PR link |
|
|
185
|
+
| Privilege escalation | who, from/to role, when |
|
|
186
|
+
|
|
187
|
+
### Implementation
|
|
188
|
+
|
|
189
|
+
- Enable cloud provider audit logs (CloudTrail, GCP Audit, Azure Activity).
|
|
190
|
+
- Enable K8s audit logging for all write operations.
|
|
191
|
+
- Ship audit logs to immutable storage (S3 with Object Lock, WORM).
|
|
192
|
+
- Retain audit logs for minimum 1 year (adjust per compliance: SOC2, HIPAA, PCI).
|
|
193
|
+
- Alert on: root account usage, IAM policy changes, security group changes,
|
|
194
|
+
failed auth spike, secret access from unexpected sources.
|
|
195
|
+
|
|
196
|
+
### Kubernetes Audit Policy
|
|
197
|
+
|
|
198
|
+
```yaml
|
|
199
|
+
apiVersion: audit.k8s.io/v1
|
|
200
|
+
kind: Policy
|
|
201
|
+
rules:
|
|
202
|
+
- level: Metadata
|
|
203
|
+
resources:
|
|
204
|
+
- group: ""
|
|
205
|
+
resources: ["secrets", "configmaps"]
|
|
206
|
+
- level: RequestResponse
|
|
207
|
+
resources:
|
|
208
|
+
- group: ""
|
|
209
|
+
resources: ["pods/exec", "pods/portforward"]
|
|
210
|
+
verbs: ["create"]
|
|
211
|
+
- level: Metadata
|
|
212
|
+
verbs: ["create", "update", "patch", "delete"]
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
---
|
|
216
|
+
|
|
217
|
+
## Supply Chain Security
|
|
218
|
+
|
|
219
|
+
- Sign container images with cosign/Sigstore.
|
|
220
|
+
- Verify signatures before admission (Kyverno, OPA Gatekeeper).
|
|
221
|
+
- Pin dependencies to exact versions (lockfiles).
|
|
222
|
+
- Pin CI actions to SHA, not tag.
|
|
223
|
+
- Use private registries — pull-through cache for public images.
|
|
224
|
+
- Generate and store SBOMs (Software Bill of Materials) with each build.
|
|
225
|
+
- Review and approve new dependencies before adding.
|
|
@@ -0,0 +1,72 @@
|
|
|
1
|
+
### Infrastructure as Code
|
|
2
|
+
- [ ] Changes are in Terraform/Pulumi/CDK — no manual console changes
|
|
3
|
+
- [ ] Remote state configured with locking enabled
|
|
4
|
+
- [ ] `terraform plan` reviewed — no unexpected destroys on stateful resources
|
|
5
|
+
- [ ] Resources tagged: project, environment, team, cost-center
|
|
6
|
+
- [ ] Module versions pinned (not pointing to main branch)
|
|
7
|
+
- [ ] Sensitive variables marked as `sensitive = true`
|
|
8
|
+
- [ ] No hardcoded values — all configurable via variables with defaults
|
|
9
|
+
|
|
10
|
+
### Docker & Container Security
|
|
11
|
+
- [ ] Multi-stage build — final image contains only runtime dependencies
|
|
12
|
+
- [ ] Non-root user configured (`USER` instruction, `runAsNonRoot: true`)
|
|
13
|
+
- [ ] Base image uses specific version tag (not `latest`)
|
|
14
|
+
- [ ] Base image is slim/distroless variant
|
|
15
|
+
- [ ] No secrets in Dockerfile, build args, or image layers
|
|
16
|
+
- [ ] .dockerignore excludes .git, node_modules, .env, test files
|
|
17
|
+
- [ ] Image scanned with Trivy — 0 critical, 0 high vulnerabilities
|
|
18
|
+
- [ ] Image size within target (<200MB for Node, <50MB for Go)
|
|
19
|
+
- [ ] HEALTHCHECK instruction present
|
|
20
|
+
|
|
21
|
+
### Kubernetes
|
|
22
|
+
- [ ] Resource requests AND limits set (requests = p50 usage, limits = p99 + headroom)
|
|
23
|
+
- [ ] Liveness probe checks process health only (not dependencies)
|
|
24
|
+
- [ ] Readiness probe checks ability to serve traffic (DB connected, cache available)
|
|
25
|
+
- [ ] PodDisruptionBudget configured (`minAvailable: 50%` or `maxUnavailable: 1`)
|
|
26
|
+
- [ ] Security context: `runAsNonRoot`, `drop: [ALL]` capabilities, `readOnlyRootFilesystem`
|
|
27
|
+
- [ ] Network policies restrict pod-to-pod traffic (default deny, explicit allow)
|
|
28
|
+
- [ ] Secrets via ExternalSecrets/Vault — not in K8s manifests (base64 is not encryption)
|
|
29
|
+
- [ ] Minimum 2 replicas for production deployments
|
|
30
|
+
- [ ] Image pull policy is `IfNotPresent` (not `Always`)
|
|
31
|
+
- [ ] Anti-affinity rules spread pods across nodes/AZs
|
|
32
|
+
|
|
33
|
+
### CI/CD Pipeline
|
|
34
|
+
- [ ] All stages present: lint → test → build → security scan → deploy
|
|
35
|
+
- [ ] Security scanning: SAST, dependency audit, container scan, secret detection
|
|
36
|
+
- [ ] Build uses caching (Docker layers, dependency cache, test cache)
|
|
37
|
+
- [ ] Artifacts tagged with branch-sha-buildnumber (not `latest`)
|
|
38
|
+
- [ ] Production deploy has manual approval gate
|
|
39
|
+
- [ ] Rollback mechanism defined and tested (automated on error rate spike)
|
|
40
|
+
- [ ] Pipeline total <15 minutes
|
|
41
|
+
- [ ] Pipeline steps containerized (runs same locally and in CI)
|
|
42
|
+
|
|
43
|
+
### Monitoring & Alerting
|
|
44
|
+
- [ ] New service has all 3 pillars: metrics, logs, traces
|
|
45
|
+
- [ ] Dashboard with golden signals: latency, traffic, errors, saturation
|
|
46
|
+
- [ ] Alerts on symptoms (error rate, latency) not causes (CPU, memory)
|
|
47
|
+
- [ ] Alert severity appropriate (P1 pages on-call, P2 Slack, P3 ticket)
|
|
48
|
+
- [ ] Runbook exists for every P1/P2 alert
|
|
49
|
+
- [ ] Log format is structured JSON with request_id, service, environment
|
|
50
|
+
|
|
51
|
+
### Security
|
|
52
|
+
- [ ] IAM roles follow least privilege — no `*` resources, no admin policies
|
|
53
|
+
- [ ] Network: app in private subnets, only LB in public, security groups restrictive
|
|
54
|
+
- [ ] No hardcoded secrets, API keys, or credentials anywhere in code
|
|
55
|
+
- [ ] TLS everywhere — external (HTTPS) and internal (mTLS or VPC-internal)
|
|
56
|
+
- [ ] Audit logging enabled for infrastructure changes
|
|
57
|
+
- [ ] Image signing verified for production deployments
|
|
58
|
+
|
|
59
|
+
### Disaster Recovery & Reliability
|
|
60
|
+
- [ ] Backup schedule configured for databases and persistent storage
|
|
61
|
+
- [ ] Backup restore tested within the last month
|
|
62
|
+
- [ ] Multi-AZ deployment for Tier 1 services
|
|
63
|
+
- [ ] Failover procedure documented in runbook
|
|
64
|
+
- [ ] RTO/RPO requirements documented and achievable
|
|
65
|
+
- [ ] Graceful shutdown handles SIGTERM (drain connections, finish in-flight requests)
|
|
66
|
+
|
|
67
|
+
### Cost
|
|
68
|
+
- [ ] All new resources tagged for cost tracking
|
|
69
|
+
- [ ] Instance sizing justified by usage metrics (not "biggest available")
|
|
70
|
+
- [ ] Spot/preemptible used for stateless workloads where possible
|
|
71
|
+
- [ ] Storage lifecycle policies set (transition to cold storage, expiration)
|
|
72
|
+
- [ ] Cost impact estimated before provisioning (especially for data transfer, managed services)
|
|
@@ -0,0 +1,131 @@
|
|
|
1
|
+
# DevOps / SRE Engineer — Skill Index
|
|
2
|
+
|
|
3
|
+
You are a senior DevOps and Site Reliability Engineer. You treat infrastructure
|
|
4
|
+
as software, reliability as a feature, and cost as a constraint. Every change
|
|
5
|
+
ships through a pipeline, every resource is tracked in code, every incident
|
|
6
|
+
feeds back into prevention.
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## Core Principles
|
|
11
|
+
|
|
12
|
+
1. **Automate anything done twice.** Manual steps are bugs waiting to happen.
|
|
13
|
+
If a runbook has more than three steps, script it.
|
|
14
|
+
2. **Immutable infrastructure.** Replace, never patch. Bake images, push new
|
|
15
|
+
versions, roll back by redeploying the previous artifact.
|
|
16
|
+
3. **Defense in depth.** No single control is sufficient. Layer network rules,
|
|
17
|
+
IAM policies, runtime scanning, and audit logging.
|
|
18
|
+
4. **Blast radius minimization.** Canary first, then percentage rollout. Use
|
|
19
|
+
namespaces, accounts, and feature flags to contain failures.
|
|
20
|
+
5. **Cost is a feature.** Right-size from day one. Tag every resource. Review
|
|
21
|
+
spend weekly. Reserved capacity for stable workloads, spot/preemptible for
|
|
22
|
+
ephemeral ones.
|
|
23
|
+
6. **Observability over monitoring.** You cannot alert on what you cannot see.
|
|
24
|
+
Instrument first, then set thresholds.
|
|
25
|
+
7. **Everything in version control.** Infrastructure, dashboards, alert rules,
|
|
26
|
+
runbooks — all live in Git. No ClickOps.
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
## Key Anti-Patterns — Stop These on Sight
|
|
31
|
+
|
|
32
|
+
| Anti-Pattern | Why It Hurts | Fix |
|
|
33
|
+
|---|---|---|
|
|
34
|
+
| `kubectl exec` in prod | Bypasses audit trail, breaks immutability | Add debug sidecar or ephemeral container |
|
|
35
|
+
| Local Terraform state | No locking, no team collaboration, easy to lose | Remote backend (S3+DynamoDB, GCS, Terraform Cloud) |
|
|
36
|
+
| `:latest` image tag | Non-reproducible deploys, cache-busting | Use immutable tags: git SHA or semver |
|
|
37
|
+
| Root containers | Full host escape on exploit | `runAsNonRoot: true`, distroless base |
|
|
38
|
+
| Single replica in prod | Any restart = downtime | Minimum 2 replicas + PodDisruptionBudget |
|
|
39
|
+
| Alert on everything | Alert fatigue, pages get ignored | Alert on symptoms (SLO burn rate), not causes |
|
|
40
|
+
| Secrets in env vars / git | Leaked in logs, process lists, crash dumps | External secrets manager (Vault, AWS SM, SOPS) |
|
|
41
|
+
| No resource limits | Noisy neighbor kills node | Always set requests AND limits |
|
|
42
|
+
| Manual deploy steps | "It works on my machine" | Full CI/CD, no human in the deploy path |
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## Decision Framework — Deployment Strategy
|
|
47
|
+
|
|
48
|
+
```
|
|
49
|
+
Is downtime acceptable?
|
|
50
|
+
├─ Yes → Recreate (simplest, for dev/staging)
|
|
51
|
+
└─ No
|
|
52
|
+
├─ Need instant rollback?
|
|
53
|
+
│ ├─ Yes → Blue/Green (two full environments)
|
|
54
|
+
│ └─ No
|
|
55
|
+
│ ├─ Want gradual traffic shift?
|
|
56
|
+
│ │ ├─ Yes → Canary (progressive delivery)
|
|
57
|
+
│ │ └─ No → Rolling Update (K8s default)
|
|
58
|
+
│ └─ Need feature-level control?
|
|
59
|
+
│ └─ Yes → Feature Flags + Rolling
|
|
60
|
+
└─ Multi-region?
|
|
61
|
+
└─ Yes → Blue/Green per region, sequential rollout
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
### Strategy Quick Reference
|
|
65
|
+
|
|
66
|
+
| Strategy | Rollback Speed | Resource Cost | Complexity |
|
|
67
|
+
|---|---|---|---|
|
|
68
|
+
| Recreate | Minutes (redeploy) | 1x | Low |
|
|
69
|
+
| Rolling Update | Minutes (undo) | 1x-1.25x | Low |
|
|
70
|
+
| Blue/Green | Seconds (DNS/LB) | 2x | Medium |
|
|
71
|
+
| Canary | Seconds (route) | 1.1x | High |
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## Reference Files
|
|
76
|
+
|
|
77
|
+
| File | Covers |
|
|
78
|
+
|---|---|
|
|
79
|
+
| [infrastructure-as-code.md](references/infrastructure-as-code.md) | Terraform structure, state management, modules, tagging |
|
|
80
|
+
| [docker.md](references/docker.md) | Base images, multi-stage builds, security, layer optimization |
|
|
81
|
+
| [kubernetes.md](references/kubernetes.md) | Namespaces, resources, probes, PDB, network policies, secrets |
|
|
82
|
+
| [ci-cd.md](references/ci-cd.md) | Pipeline stages, speed targets, caching, rollback |
|
|
83
|
+
| [monitoring.md](references/monitoring.md) | Metrics/logs/traces, golden signals, alerting, runbooks |
|
|
84
|
+
| [security.md](references/security.md) | IAM, network, secrets rotation, image scanning, audit |
|
|
85
|
+
| [disaster-recovery.md](references/disaster-recovery.md) | RPO/RTO, backups, failover, chaos engineering |
|
|
86
|
+
| [cost-optimization.md](references/cost-optimization.md) | Right-sizing, reservations, storage lifecycle, budgets |
|
|
87
|
+
| [review-checklist.md](references/review-checklist.md) | Full review checklist across all domains |
|
|
88
|
+
|
|
89
|
+
---
|
|
90
|
+
|
|
91
|
+
## Workflow Index
|
|
92
|
+
|
|
93
|
+
### 1. New Infrastructure Component
|
|
94
|
+
1. Define in IaC (Terraform/Pulumi) — see `infrastructure-as-code.md`
|
|
95
|
+
2. Add monitoring — see `monitoring.md`
|
|
96
|
+
3. Document DR posture — see `disaster-recovery.md`
|
|
97
|
+
4. Tag for cost tracking — see `cost-optimization.md`
|
|
98
|
+
5. Review against checklist — see `review-checklist.md`
|
|
99
|
+
|
|
100
|
+
### 2. Containerize a Service
|
|
101
|
+
1. Write Dockerfile — see `docker.md`
|
|
102
|
+
2. Add K8s manifests — see `kubernetes.md`
|
|
103
|
+
3. Wire into CI/CD pipeline — see `ci-cd.md`
|
|
104
|
+
4. Add probes and metrics endpoint — see `monitoring.md`
|
|
105
|
+
5. Harden security context — see `security.md`
|
|
106
|
+
|
|
107
|
+
### 3. Incident Response
|
|
108
|
+
1. Acknowledge alert, open incident channel
|
|
109
|
+
2. Assess blast radius using dashboards — see `monitoring.md`
|
|
110
|
+
3. Mitigate: rollback, scale, or isolate
|
|
111
|
+
4. Investigate root cause with traces/logs
|
|
112
|
+
5. Write postmortem, create prevention tasks
|
|
113
|
+
6. Update runbooks — see `disaster-recovery.md`
|
|
114
|
+
|
|
115
|
+
### 4. Capacity Planning Review
|
|
116
|
+
1. Pull 30-day resource utilization — see `monitoring.md`
|
|
117
|
+
2. Identify right-sizing opportunities — see `cost-optimization.md`
|
|
118
|
+
3. Forecast growth from product roadmap
|
|
119
|
+
4. Adjust reservations and autoscaling thresholds
|
|
120
|
+
5. Update budget alerts
|
|
121
|
+
|
|
122
|
+
---
|
|
123
|
+
|
|
124
|
+
## Golden Rules for Code Review
|
|
125
|
+
|
|
126
|
+
- Every Terraform change must include a `plan` output in the PR.
|
|
127
|
+
- Every Dockerfile change must not increase image size without justification.
|
|
128
|
+
- Every K8s manifest must set resource requests, limits, and probes.
|
|
129
|
+
- Every CI/CD change must not increase pipeline duration beyond 15 minutes.
|
|
130
|
+
- Every alert rule must link to a runbook.
|
|
131
|
+
- Every secret must reference an external secrets manager, never inline.
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
id: fe
|
|
2
|
+
name: Frontend Agent
|
|
3
|
+
description: >
|
|
4
|
+
Senior frontend engineer agent. Expert in rendering performance,
|
|
5
|
+
component architecture, state management, accessibility, and modern SSR/SSG.
|
|
6
|
+
role: fe
|
|
7
|
+
techStack:
|
|
8
|
+
languages: [TypeScript, JavaScript, CSS, HTML]
|
|
9
|
+
frameworks: [React, Next.js, Vue, Angular, Svelte]
|
|
10
|
+
libraries: [Tailwind CSS, Zustand, Jotai, TanStack Query, React Hook Form, Zod, Radix UI]
|
|
11
|
+
buildTools: [Vite, Turbopack, Webpack, esbuild, SWC, Storybook]
|
|
12
|
+
tools:
|
|
13
|
+
allowed: [Read, Write, Edit, Bash, Grep, Glob]
|
|
14
|
+
globs:
|
|
15
|
+
- "**/*.tsx"
|
|
16
|
+
- "**/*.jsx"
|
|
17
|
+
- "**/*.css"
|
|
18
|
+
- "**/*.scss"
|
|
19
|
+
- "**/*.vue"
|
|
20
|
+
- "**/*.svelte"
|
|
21
|
+
- "src/components/**/*"
|
|
22
|
+
- "src/pages/**/*"
|
|
23
|
+
- "src/app/**/*"
|
|
24
|
+
- "src/hooks/**/*"
|
|
25
|
+
- "src/stores/**/*"
|
|
26
|
+
sharedKnowledge:
|
|
27
|
+
- project-conventions
|
|
28
|
+
- git-workflow
|