sdtk-ops-kit 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (51) hide show
  1. package/README.md +146 -0
  2. package/assets/manifest/toolkit-bundle.manifest.json +187 -0
  3. package/assets/manifest/toolkit-bundle.sha256.txt +36 -0
  4. package/assets/toolkit/toolkit/AGENTS.md +65 -0
  5. package/assets/toolkit/toolkit/SDTKOPS_TOOLKIT.md +166 -0
  6. package/assets/toolkit/toolkit/install.ps1 +138 -0
  7. package/assets/toolkit/toolkit/scripts/install-claude-skills.ps1 +81 -0
  8. package/assets/toolkit/toolkit/scripts/install-codex-skills.ps1 +127 -0
  9. package/assets/toolkit/toolkit/scripts/uninstall-claude-skills.ps1 +65 -0
  10. package/assets/toolkit/toolkit/scripts/uninstall-codex-skills.ps1 +53 -0
  11. package/assets/toolkit/toolkit/sdtk-spec.config.json +6 -0
  12. package/assets/toolkit/toolkit/sdtk-spec.config.profiles.example.json +12 -0
  13. package/assets/toolkit/toolkit/skills/ops-backup/SKILL.md +93 -0
  14. package/assets/toolkit/toolkit/skills/ops-backup/references/backup-script-patterns.md +108 -0
  15. package/assets/toolkit/toolkit/skills/ops-ci-cd/SKILL.md +88 -0
  16. package/assets/toolkit/toolkit/skills/ops-ci-cd/references/pipeline-examples.md +113 -0
  17. package/assets/toolkit/toolkit/skills/ops-compliance/SKILL.md +105 -0
  18. package/assets/toolkit/toolkit/skills/ops-container/SKILL.md +95 -0
  19. package/assets/toolkit/toolkit/skills/ops-container/references/k8s-manifest-patterns.md +116 -0
  20. package/assets/toolkit/toolkit/skills/ops-cost/SKILL.md +88 -0
  21. package/assets/toolkit/toolkit/skills/ops-debug/SKILL.md +311 -0
  22. package/assets/toolkit/toolkit/skills/ops-debug/references/root-cause-tracing.md +138 -0
  23. package/assets/toolkit/toolkit/skills/ops-deploy/SKILL.md +102 -0
  24. package/assets/toolkit/toolkit/skills/ops-discover/SKILL.md +102 -0
  25. package/assets/toolkit/toolkit/skills/ops-incident/SKILL.md +113 -0
  26. package/assets/toolkit/toolkit/skills/ops-incident/references/communication-templates.md +34 -0
  27. package/assets/toolkit/toolkit/skills/ops-incident/references/postmortem-template.md +69 -0
  28. package/assets/toolkit/toolkit/skills/ops-incident/references/runbook-template.md +69 -0
  29. package/assets/toolkit/toolkit/skills/ops-infra-plan/SKILL.md +123 -0
  30. package/assets/toolkit/toolkit/skills/ops-infra-plan/references/iac-patterns.md +141 -0
  31. package/assets/toolkit/toolkit/skills/ops-monitor/SKILL.md +110 -0
  32. package/assets/toolkit/toolkit/skills/ops-monitor/references/alert-rules.md +80 -0
  33. package/assets/toolkit/toolkit/skills/ops-monitor/references/slo-templates.md +83 -0
  34. package/assets/toolkit/toolkit/skills/ops-parallel/SKILL.md +177 -0
  35. package/assets/toolkit/toolkit/skills/ops-plan/SKILL.md +169 -0
  36. package/assets/toolkit/toolkit/skills/ops-security-infra/SKILL.md +126 -0
  37. package/assets/toolkit/toolkit/skills/ops-security-infra/references/cicd-security-pipeline.md +55 -0
  38. package/assets/toolkit/toolkit/skills/ops-security-infra/references/security-headers.md +24 -0
  39. package/assets/toolkit/toolkit/skills/ops-verify/SKILL.md +180 -0
  40. package/bin/sdtk-ops.js +14 -0
  41. package/package.json +46 -0
  42. package/src/commands/generate.js +12 -0
  43. package/src/commands/help.js +53 -0
  44. package/src/commands/init.js +86 -0
  45. package/src/commands/runtime.js +201 -0
  46. package/src/index.js +65 -0
  47. package/src/lib/args.js +107 -0
  48. package/src/lib/errors.js +41 -0
  49. package/src/lib/powershell.js +65 -0
  50. package/src/lib/scope.js +58 -0
  51. package/src/lib/toolkit-payload.js +123 -0
@@ -0,0 +1,88 @@
1
+ <!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
2
+ ---
3
+ name: ops-ci-cd
4
+ description: CI/CD pipeline design and management. Use when setting up or modifying build, test, and deployment pipelines -- covers pipeline stages, branch strategies, artifact management, and secret handling.
5
+ ---
6
+
7
+ # Ops CI CD
8
+
9
+ ## Overview
10
+
11
+ A pipeline should make the safe path the default path. It must build the artifact, test it, scan it, promote it, and leave an auditable trail of what ran and what was deployed.
12
+
13
+ ## Standard Pipeline Flow
14
+
15
+ Use this default flow:
16
+
17
+ ```text
18
+ commit
19
+ -> build
20
+ -> unit test
21
+ -> lint and SAST
22
+ -> integration test
23
+ -> security scan
24
+ -> build artifact
25
+ -> deploy staging
26
+ -> smoke test
27
+ -> deploy production
28
+ -> health check
29
+ ```
30
+
31
+ Do not collapse build, scan, and deploy into one opaque job.
32
+
33
+ ## Branch And Promotion Strategy
34
+
35
+ Prefer:
36
+ - trunk-based development with short-lived branches
37
+ - automatic promotion to staging after required checks pass
38
+ - manual or policy-gated promotion to production
39
+ - immutable artifact promotion instead of rebuilding per environment
40
+
41
+ ## Pipeline Config Patterns
42
+
43
+ Keep pipeline definitions:
44
+ - small enough to review
45
+ - explicit about dependencies between jobs
46
+ - strict about required checks
47
+ - separate from infrastructure design details that belong in `ops-infra-plan`
48
+
49
+ Use `./references/pipeline-examples.md` for GitHub Actions and GitLab CI examples.
50
+
51
+ ## <HARD-GATE>
52
+
53
+ NEVER:
54
+ - hardcode secrets in pipeline files
55
+ - pass long-lived cloud credentials through plain environment variables when OIDC or equivalent short-lived auth is available
56
+ - skip secret rotation or access auditing
57
+
58
+ ALWAYS:
59
+ - use platform secret stores
60
+ - scope credentials to the minimum required permissions
61
+ - rotate and review access on a defined schedule
62
+
63
+ ## Artifact Management
64
+
65
+ Pipeline outputs should be:
66
+ - versioned with semver and git SHA
67
+ - immutable once published
68
+ - stored with retention rules
69
+ - traceable from deployment record back to commit and build
70
+
71
+ ## Pipeline Optimization
72
+
73
+ Optimize only after correctness:
74
+ - cache dependencies and build layers
75
+ - parallelize independent jobs
76
+ - use matrix builds only when they produce clear value
77
+ - skip unaffected jobs based on changed files when dependency rules are reliable
78
+
79
+ ## Common Mistakes
80
+
81
+ | Mistake | Why it fails |
82
+ |---------|--------------|
83
+ | Secrets committed in YAML | Rotation and audit control are lost immediately |
84
+ | No staging environment | Production becomes the first integration test |
85
+ | Manual deploy steps outside pipeline | Audit trail and repeatability disappear |
86
+ | Skip security scanning to save time | Vulnerabilities ship by default |
87
+ | No build cache or job parallelism | Pipeline slows down until teams bypass it |
88
+
@@ -0,0 +1,113 @@
1
+ <!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
2
+
3
+ # Pipeline Examples
4
+
5
+ ## GitHub Actions
6
+
7
+ ```yaml
8
+ name: Production Deployment
9
+
10
+ on:
11
+ push:
12
+ branches: [main]
13
+
14
+ jobs:
15
+ security-scan:
16
+ runs-on: ubuntu-latest
17
+ steps:
18
+ - uses: actions/checkout@v4
19
+ - name: Run dependency and image scan
20
+ run: ./scripts/security-scan.sh
21
+
22
+ test:
23
+ runs-on: ubuntu-latest
24
+ steps:
25
+ - uses: actions/checkout@v4
26
+ - name: Install dependencies
27
+ run: ./scripts/install.sh
28
+ - name: Run tests
29
+ run: ./scripts/test.sh
30
+
31
+ build:
32
+ runs-on: ubuntu-latest
33
+ needs: [security-scan, test]
34
+ steps:
35
+ - uses: actions/checkout@v4
36
+ - name: Build artifact
37
+ run: ./scripts/build.sh
38
+ - name: Publish image
39
+ run: ./scripts/publish-image.sh
40
+
41
+ deploy:
42
+ runs-on: ubuntu-latest
43
+ needs: [build]
44
+ environment: production
45
+ permissions:
46
+ id-token: write
47
+ contents: read
48
+ steps:
49
+ - uses: actions/checkout@v4
50
+ - name: Blue-Green deploy
51
+ run: |
52
+ kubectl set image deployment/app app=registry.example.com/app:${{ github.sha }}
53
+ kubectl rollout status deployment/app
54
+ kubectl patch service app-service -p '{"spec":{"selector":{"version":"green"}}}'
55
+ - name: Health check
56
+ run: curl -fsS https://app.example.com/health
57
+ ```
58
+
59
+ ## GitLab CI
60
+
61
+ ```yaml
62
+ stages:
63
+ - build
64
+ - test
65
+ - security
66
+ - deploy_staging
67
+ - smoke
68
+ - deploy_prod
69
+
70
+ variables:
71
+ IMAGE_TAG: "$CI_COMMIT_SHORT_SHA"
72
+
73
+ build:
74
+ stage: build
75
+ script:
76
+ - ./scripts/build.sh
77
+ - ./scripts/publish-image.sh "$IMAGE_TAG"
78
+
79
+ test:
80
+ stage: test
81
+ script:
82
+ - ./scripts/test.sh
83
+
84
+ security_scan:
85
+ stage: security
86
+ script:
87
+ - ./scripts/security-scan.sh
88
+
89
+ deploy_staging:
90
+ stage: deploy_staging
91
+ script:
92
+ - ./scripts/deploy.sh staging "$IMAGE_TAG"
93
+
94
+ smoke_test:
95
+ stage: smoke
96
+ script:
97
+ - ./scripts/smoke-test.sh staging
98
+
99
+ deploy_prod:
100
+ stage: deploy_prod
101
+ when: manual
102
+ script:
103
+ - ./scripts/deploy.sh production "$IMAGE_TAG"
104
+ - ./scripts/health-check.sh production
105
+ ```
106
+
107
+ ## Pattern Notes
108
+
109
+ - keep security scanning before production deploy
110
+ - promote the same artifact between environments
111
+ - prefer platform identity federation over static cloud keys
112
+ - keep deployment commands explicit enough to review and replay
113
+
@@ -0,0 +1,105 @@
1
+ ---
2
+ name: ops-compliance
3
+ description: Compliance and audit readiness. Use when setting up compliance scanning, audit logging, or preparing evidence for regulatory audits -- covers policy-as-code patterns and evidence collection.
4
+ ---
5
+
6
+ # Ops Compliance
7
+
8
+ ## Overview
9
+
10
+ Compliance automation should make controls testable and evidence easy to retrieve. SDTK-OPS covers the tooling and operational patterns that support compliance work. It does not replace legal interpretation, certification strategy, or third-party auditor management.
11
+
12
+ ## The Iron Law
13
+
14
+ ```
15
+ NO AUDIT READINESS WITHOUT IMMUTABLE LOGS AND AUTOMATED EVIDENCE COLLECTION
16
+ ```
17
+
18
+ ## When to Use
19
+
20
+ Use for:
21
+ - audit logging design
22
+ - evidence collection workflows
23
+ - policy-as-code adoption
24
+ - compliance scanning in CI/CD
25
+ - audit readiness reviews before external assessment
26
+
27
+ ## Regulatory Framework Awareness
28
+
29
+ | Framework | Focus Area | Key Operational Requirements |
30
+ |-----------|------------|------------------------------|
31
+ | SOC 2 | trust services criteria | access control, change management, logging, availability evidence |
32
+ | GDPR | personal data handling | data access controls, deletion workflows, audit trail, breach evidence |
33
+ | HIPAA | protected health information | access auditing, retention, encryption, incident evidence |
34
+ | PCI-DSS | payment card data security | segmentation, access control, logging, vulnerability management |
35
+
36
+ These are awareness anchors only. They are not legal or certification guidance.
37
+
38
+ ## Audit Logging Requirements
39
+
40
+ Every audit trail should answer:
41
+ - **WHO**
42
+ - which identity performed the action
43
+ - **WHAT**
44
+ - what action or change happened
45
+ - **WHEN**
46
+ - timestamp in a consistent standard
47
+ - **WHERE**
48
+ - account, region, environment, or system boundary
49
+ - **WHETHER**
50
+ - outcome, such as allowed, denied, success, or failure
51
+
52
+ ## Implementation Patterns
53
+
54
+ Use a layered pattern:
55
+ - cloud-provider audit logs for infrastructure and identity changes
56
+ - application audit logs in structured JSON for user and system events
57
+ - immutable or write-once log storage for retention-sensitive records
58
+ - restricted access to log management and export paths
59
+
60
+ ## <HARD-GATE>
61
+
62
+ Do not claim audit readiness until all are true:
63
+ - audit logs are immutable or strongly tamper-resistant
64
+ - retention is at least 1 year or stricter if the policy requires it
65
+ - tamper monitoring exists for critical logs
66
+ - access to log management is restricted and reviewable
67
+
68
+ ## Policy As Code
69
+
70
+ Policy should be executable:
71
+ - define guardrails as code
72
+ - run them in CI/CD
73
+ - fail deployment when policies are violated
74
+
75
+ Awareness tools include:
76
+ - OPA and Rego
77
+ - cloud-native policy engines
78
+ - IaC scanners such as Checkov or tfsec
79
+
80
+ ## Evidence Collection Automation
81
+
82
+ | Control Area | Evidence Type | Collection Method | Frequency |
83
+ |--------------|---------------|-------------------|-----------|
84
+ | Access review | privileged access report | scheduled export from IAM source | monthly |
85
+ | Change management | deployment and approval log | CI/CD pipeline export | per release |
86
+ | Backup control | restore drill report | runbook output and stored artifact | quarterly |
87
+ | Logging control | retention and tamper status | scripted control check | monthly |
88
+ | Vulnerability management | scan summary and remediation status | scanner export | weekly |
89
+
90
+ ## Common Mistakes
91
+
92
+ | Mistake | Why it fails |
93
+ |---------|--------------|
94
+ | Manual evidence collection | Audit prep becomes a scramble and gaps are hidden |
95
+ | Treat compliance as a yearly project | Controls drift the rest of the year |
96
+ | Keep policy only in documents | Violations continue because nothing enforces them |
97
+ | Store audit logs with normal mutable application data | Tampering and accidental deletion become easier |
98
+
99
+ ## Execution Handoff
100
+
101
+ After compliance controls are defined:
102
+ - implement logging and retention changes through `ops-infra-plan`
103
+ - route CI/CD policy enforcement through `ops-ci-cd`
104
+ - validate collected evidence with `ops-verify`
105
+
@@ -0,0 +1,95 @@
1
+ <!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
2
+ ---
3
+ name: ops-container
4
+ description: Container operations. Use when building Docker images, writing Dockerfiles, creating Kubernetes manifests, or managing container lifecycle -- covers best practices, security scanning, and orchestration patterns.
5
+ ---
6
+
7
+ # Ops Container
8
+
9
+ ## Overview
10
+
11
+ Container workflows fail when image build, runtime security, and orchestration design are treated as separate concerns. Build lean images, run them with explicit security posture, and define manifests that expose health and resource boundaries.
12
+
13
+ ## When to Use
14
+
15
+ Use for:
16
+ - Dockerfile authoring or review
17
+ - image build and tagging strategy
18
+ - Kubernetes deployment manifest design
19
+ - local multi-service orchestration
20
+ - container registry policy updates
21
+
22
+ ## Dockerfile Best Practices
23
+
24
+ Use these defaults:
25
+ - multi-stage builds to separate build tools from runtime image
26
+ - specific base image tags, never floating `latest`
27
+ - minimal runtime packages
28
+ - `.dockerignore` to exclude git metadata, secrets, and local caches
29
+ - deterministic dependency install steps for repeatable builds
30
+
31
+ ## Security Defaults
32
+
33
+ Container security starts in the image:
34
+ - run as non-root user
35
+ - pin base images to explicit versions
36
+ - scan images before push
37
+ - keep secrets out of the image and build args
38
+ - remove unused packages and shells where possible
39
+
40
+ ## <HARD-GATE>
41
+
42
+ Do not publish or deploy the image unless all are true:
43
+ - multi-stage build is used where build tooling is separate from runtime
44
+ - container runs as non-root unless a reviewed exception exists
45
+ - vulnerability scan passes with no CRITICAL or HIGH findings
46
+ - base image tag is specific and reviewable
47
+
48
+ ## Kubernetes Patterns
49
+
50
+ For production-oriented workloads, define:
51
+ - Deployment with explicit resource requests and limits
52
+ - readiness, liveness, and startup probes
53
+ - rolling update settings that limit simultaneous disruption
54
+ - PodDisruptionBudget for critical workloads
55
+ - Service type matched to actual exposure need
56
+ - ConfigMap and Secret references instead of baked config
57
+
58
+ Use `./references/k8s-manifest-patterns.md` for concrete manifest examples.
59
+
60
+ ## Health Probes
61
+
62
+ | Probe | Purpose | Failure Action |
63
+ |-------|---------|----------------|
64
+ | Liveness | Detect deadlocked or unrecoverable process | restart container |
65
+ | Readiness | Control whether traffic reaches the pod | remove pod from service endpoints |
66
+ | Startup | Protect slow boot from early liveness failure | delay other probe enforcement until startup completes |
67
+
68
+ ## Docker Compose Pattern
69
+
70
+ For local or small deployments:
71
+ - define each service explicitly
72
+ - isolate shared environment variables in files or secret stores
73
+ - declare volumes and networks on purpose
74
+ - use health checks so service startup order is evidence-based, not timing-based
75
+
76
+ Do not treat Docker Compose defaults as production architecture.
77
+
78
+ ## Container Registry Discipline
79
+
80
+ Use a tagging and retention policy:
81
+ - tag images with semver and git SHA
82
+ - enable scan-on-push where the registry supports it
83
+ - expire unneeded intermediate tags
84
+ - keep a clear mapping from deployed revision to immutable image digest
85
+
86
+ ## Common Mistakes
87
+
88
+ | Mistake | Why it fails |
89
+ |---------|--------------|
90
+ | Use `latest` everywhere | Rollback and provenance become guesswork |
91
+ | Run as root by default | One container escape becomes much worse |
92
+ | Bake secrets into Dockerfile or image | Rotation and access control break immediately |
93
+ | Skip resource limits | Noisy neighbors and eviction behavior become unpredictable |
94
+ | Omit probes | Broken containers look healthy until users discover them |
95
+
@@ -0,0 +1,116 @@
1
+ <!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
2
+
3
+ # Kubernetes Manifest Patterns
4
+
5
+ ## Deployment
6
+
7
+ ```yaml
8
+ apiVersion: apps/v1
9
+ kind: Deployment
10
+ metadata:
11
+ name: app-api
12
+ spec:
13
+ replicas: 3
14
+ strategy:
15
+ type: RollingUpdate
16
+ rollingUpdate:
17
+ maxUnavailable: 1
18
+ maxSurge: 1
19
+ selector:
20
+ matchLabels:
21
+ app: app-api
22
+ template:
23
+ metadata:
24
+ labels:
25
+ app: app-api
26
+ spec:
27
+ containers:
28
+ - name: api
29
+ image: registry.example.com/app-api:1.2.3-abcd123
30
+ ports:
31
+ - containerPort: 8080
32
+ resources:
33
+ requests:
34
+ cpu: "250m"
35
+ memory: "256Mi"
36
+ limits:
37
+ cpu: "500m"
38
+ memory: "512Mi"
39
+ readinessProbe:
40
+ httpGet:
41
+ path: /health/ready
42
+ port: 8080
43
+ initialDelaySeconds: 5
44
+ periodSeconds: 10
45
+ livenessProbe:
46
+ httpGet:
47
+ path: /health/live
48
+ port: 8080
49
+ initialDelaySeconds: 15
50
+ periodSeconds: 20
51
+ startupProbe:
52
+ httpGet:
53
+ path: /health/startup
54
+ port: 8080
55
+ failureThreshold: 30
56
+ periodSeconds: 5
57
+ envFrom:
58
+ - configMapRef:
59
+ name: app-api-config
60
+ - secretRef:
61
+ name: app-api-secrets
62
+ securityContext:
63
+ runAsNonRoot: true
64
+ allowPrivilegeEscalation: false
65
+ ```
66
+
67
+ ## Service
68
+
69
+ ```yaml
70
+ apiVersion: v1
71
+ kind: Service
72
+ metadata:
73
+ name: app-api
74
+ spec:
75
+ type: ClusterIP
76
+ selector:
77
+ app: app-api
78
+ ports:
79
+ - name: http
80
+ port: 80
81
+ targetPort: 8080
82
+ ```
83
+
84
+ ## ConfigMap
85
+
86
+ ```yaml
87
+ apiVersion: v1
88
+ kind: ConfigMap
89
+ metadata:
90
+ name: app-api-config
91
+ data:
92
+ APP_ENV: production
93
+ LOG_LEVEL: info
94
+ FEATURE_FLAG_X: "false"
95
+ ```
96
+
97
+ ## PodDisruptionBudget
98
+
99
+ ```yaml
100
+ apiVersion: policy/v1
101
+ kind: PodDisruptionBudget
102
+ metadata:
103
+ name: app-api
104
+ spec:
105
+ minAvailable: 2
106
+ selector:
107
+ matchLabels:
108
+ app: app-api
109
+ ```
110
+
111
+ ## Pattern Notes
112
+
113
+ - keep resource limits and probes in the first deployable version
114
+ - route config and secrets through references, not inline literals
115
+ - match `maxUnavailable` and `minAvailable` so rollout and maintenance rules do not fight each other
116
+
@@ -0,0 +1,88 @@
1
+ <!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
2
+ ---
3
+ name: ops-cost
4
+ description: Cloud cost analysis and optimization. Use when reviewing infrastructure costs, right-sizing resources, or implementing cost reduction strategies -- covers cost allocation, tagging, reserved capacity, and budget alerts.
5
+ ---
6
+
7
+ # Ops Cost
8
+
9
+ ## Overview
10
+
11
+ Cost optimization is not a one-time cleanup. It is an operating loop: make spend visible, find waste, right-size safely, buy commitments with evidence, and review the result every month.
12
+
13
+ ## Five-Step Framework
14
+
15
+ 1. **Visibility**
16
+ - tag all resources by owner, environment, service, and cost center
17
+ 2. **Identify Waste**
18
+ - find unused, idle, or obviously oversized resources
19
+ 3. **Right-Size**
20
+ - adjust resources based on measured utilization
21
+ 4. **Reserved Capacity**
22
+ - buy longer commitments only after usage patterns are stable
23
+ 5. **Continuous Optimization**
24
+ - review, clean up, and compare month over month
25
+
26
+ ## Waste Types
27
+
28
+ | Waste Type | Detection Rule | Typical Response |
29
+ |------------|----------------|------------------|
30
+ | Unused resources | CPU below 5% for 14 or more days | delete or stop |
31
+ | Over-provisioned resources | CPU or memory below 30% average | downsize after validation |
32
+ | Unattached storage | no active attachment or mount | delete after review |
33
+ | Old snapshots | beyond retention with no restore purpose | expire automatically |
34
+ | Idle load balancers | no meaningful traffic or backend use | consolidate or remove |
35
+
36
+ ## Right-Sizing
37
+
38
+ Use a 30-day utilization window before changing production sizing:
39
+ - review CPU, memory, storage, and throughput
40
+ - recommend smaller instance families or lower replica counts where safe
41
+ - validate with load testing or staged rollout before changing production
42
+
43
+ Route structural changes through `ops-infra-plan` and rollout changes through `ops-deploy`.
44
+
45
+ ## Purchase Types
46
+
47
+ | Purchase Type | Savings | Commitment | Risk |
48
+ |---------------|---------|------------|------|
49
+ | On-demand | baseline | none | low |
50
+ | Spot or preemptible | 60% to 90% | none | high interruption risk |
51
+ | 1-year reserved | medium | 1 year | medium |
52
+ | 3-year reserved | high | 3 years | higher lock-in risk |
53
+ | Savings plan or equivalent | medium to high | usage commitment | medium |
54
+
55
+ ## <HARD-GATE>
56
+
57
+ Do not call cost management operationally sound until all are true:
58
+ - all production resources are tagged
59
+ - monthly cost review is documented
60
+ - unused resource cleanup runs at least monthly
61
+ - budget alerts are configured per account or environment
62
+
63
+ ## Cost Report Template
64
+
65
+ Use a recurring report with:
66
+ - spending by service
67
+ - month-over-month comparison
68
+ - optimization opportunities
69
+ - reserved-capacity recommendations
70
+ - action items with owners and due dates
71
+
72
+ ## Common Mistakes
73
+
74
+ | Mistake | Why it fails |
75
+ |---------|--------------|
76
+ | No tagging strategy | Costs cannot be assigned or optimized correctly |
77
+ | Buy reserved capacity without usage data | Long-term commitment locks in the wrong shape |
78
+ | Ignore data transfer costs | Network spend grows outside compute reviews |
79
+ | Run one audit and stop | Waste returns as systems change |
80
+ | Optimize cost without performance validation | Reliability drops and rollback follows |
81
+
82
+ ## Execution Handoff
83
+
84
+ After a cost review:
85
+ - send architecture changes to `ops-infra-plan`
86
+ - deploy right-sizing changes with `ops-deploy`
87
+ - confirm savings and service health with `ops-verify`
88
+