sdtk-ops-kit 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +146 -0
- package/assets/manifest/toolkit-bundle.manifest.json +187 -0
- package/assets/manifest/toolkit-bundle.sha256.txt +36 -0
- package/assets/toolkit/toolkit/AGENTS.md +65 -0
- package/assets/toolkit/toolkit/SDTKOPS_TOOLKIT.md +166 -0
- package/assets/toolkit/toolkit/install.ps1 +138 -0
- package/assets/toolkit/toolkit/scripts/install-claude-skills.ps1 +81 -0
- package/assets/toolkit/toolkit/scripts/install-codex-skills.ps1 +127 -0
- package/assets/toolkit/toolkit/scripts/uninstall-claude-skills.ps1 +65 -0
- package/assets/toolkit/toolkit/scripts/uninstall-codex-skills.ps1 +53 -0
- package/assets/toolkit/toolkit/sdtk-spec.config.json +6 -0
- package/assets/toolkit/toolkit/sdtk-spec.config.profiles.example.json +12 -0
- package/assets/toolkit/toolkit/skills/ops-backup/SKILL.md +93 -0
- package/assets/toolkit/toolkit/skills/ops-backup/references/backup-script-patterns.md +108 -0
- package/assets/toolkit/toolkit/skills/ops-ci-cd/SKILL.md +88 -0
- package/assets/toolkit/toolkit/skills/ops-ci-cd/references/pipeline-examples.md +113 -0
- package/assets/toolkit/toolkit/skills/ops-compliance/SKILL.md +105 -0
- package/assets/toolkit/toolkit/skills/ops-container/SKILL.md +95 -0
- package/assets/toolkit/toolkit/skills/ops-container/references/k8s-manifest-patterns.md +116 -0
- package/assets/toolkit/toolkit/skills/ops-cost/SKILL.md +88 -0
- package/assets/toolkit/toolkit/skills/ops-debug/SKILL.md +311 -0
- package/assets/toolkit/toolkit/skills/ops-debug/references/root-cause-tracing.md +138 -0
- package/assets/toolkit/toolkit/skills/ops-deploy/SKILL.md +102 -0
- package/assets/toolkit/toolkit/skills/ops-discover/SKILL.md +102 -0
- package/assets/toolkit/toolkit/skills/ops-incident/SKILL.md +113 -0
- package/assets/toolkit/toolkit/skills/ops-incident/references/communication-templates.md +34 -0
- package/assets/toolkit/toolkit/skills/ops-incident/references/postmortem-template.md +69 -0
- package/assets/toolkit/toolkit/skills/ops-incident/references/runbook-template.md +69 -0
- package/assets/toolkit/toolkit/skills/ops-infra-plan/SKILL.md +123 -0
- package/assets/toolkit/toolkit/skills/ops-infra-plan/references/iac-patterns.md +141 -0
- package/assets/toolkit/toolkit/skills/ops-monitor/SKILL.md +110 -0
- package/assets/toolkit/toolkit/skills/ops-monitor/references/alert-rules.md +80 -0
- package/assets/toolkit/toolkit/skills/ops-monitor/references/slo-templates.md +83 -0
- package/assets/toolkit/toolkit/skills/ops-parallel/SKILL.md +177 -0
- package/assets/toolkit/toolkit/skills/ops-plan/SKILL.md +169 -0
- package/assets/toolkit/toolkit/skills/ops-security-infra/SKILL.md +126 -0
- package/assets/toolkit/toolkit/skills/ops-security-infra/references/cicd-security-pipeline.md +55 -0
- package/assets/toolkit/toolkit/skills/ops-security-infra/references/security-headers.md +24 -0
- package/assets/toolkit/toolkit/skills/ops-verify/SKILL.md +180 -0
- package/bin/sdtk-ops.js +14 -0
- package/package.json +46 -0
- package/src/commands/generate.js +12 -0
- package/src/commands/help.js +53 -0
- package/src/commands/init.js +86 -0
- package/src/commands/runtime.js +201 -0
- package/src/index.js +65 -0
- package/src/lib/args.js +107 -0
- package/src/lib/errors.js +41 -0
- package/src/lib/powershell.js +65 -0
- package/src/lib/scope.js +58 -0
- package/src/lib/toolkit-payload.js +123 -0
|
@@ -0,0 +1,88 @@
|
|
|
1
|
+
<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
|
|
2
|
+
---
|
|
3
|
+
name: ops-ci-cd
|
|
4
|
+
description: CI/CD pipeline design and management. Use when setting up or modifying build, test, and deployment pipelines -- covers pipeline stages, branch strategies, artifact management, and secret handling.
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Ops CI CD
|
|
8
|
+
|
|
9
|
+
## Overview
|
|
10
|
+
|
|
11
|
+
A pipeline should make the safe path the default path. It must build the artifact, test it, scan it, promote it, and leave an auditable trail of what ran and what was deployed.
|
|
12
|
+
|
|
13
|
+
## Standard Pipeline Flow
|
|
14
|
+
|
|
15
|
+
Use this default flow:
|
|
16
|
+
|
|
17
|
+
```text
|
|
18
|
+
commit
|
|
19
|
+
-> build
|
|
20
|
+
-> unit test
|
|
21
|
+
-> lint and SAST
|
|
22
|
+
-> integration test
|
|
23
|
+
-> security scan
|
|
24
|
+
-> build artifact
|
|
25
|
+
-> deploy staging
|
|
26
|
+
-> smoke test
|
|
27
|
+
-> deploy production
|
|
28
|
+
-> health check
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
Do not collapse build, scan, and deploy into one opaque job.
|
|
32
|
+
|
|
33
|
+
## Branch And Promotion Strategy
|
|
34
|
+
|
|
35
|
+
Prefer:
|
|
36
|
+
- trunk-based development with short-lived branches
|
|
37
|
+
- automatic promotion to staging after required checks pass
|
|
38
|
+
- manual or policy-gated promotion to production
|
|
39
|
+
- immutable artifact promotion instead of rebuilding per environment
|
|
40
|
+
|
|
41
|
+
## Pipeline Config Patterns
|
|
42
|
+
|
|
43
|
+
Keep pipeline definitions:
|
|
44
|
+
- small enough to review
|
|
45
|
+
- explicit about dependencies between jobs
|
|
46
|
+
- strict about required checks
|
|
47
|
+
- separate from infrastructure design details that belong in `ops-infra-plan`
|
|
48
|
+
|
|
49
|
+
Use `./references/pipeline-examples.md` for GitHub Actions and GitLab CI examples.
|
|
50
|
+
|
|
51
|
+
## <HARD-GATE>
|
|
52
|
+
|
|
53
|
+
NEVER:
|
|
54
|
+
- hardcode secrets in pipeline files
|
|
55
|
+
- pass long-lived cloud credentials through plain environment variables when OIDC or equivalent short-lived auth is available
|
|
56
|
+
- skip secret rotation or access auditing
|
|
57
|
+
|
|
58
|
+
ALWAYS:
|
|
59
|
+
- use platform secret stores
|
|
60
|
+
- scope credentials to the minimum required permissions
|
|
61
|
+
- rotate and review access on a defined schedule
|
|
62
|
+
|
|
63
|
+
## Artifact Management
|
|
64
|
+
|
|
65
|
+
Pipeline outputs should be:
|
|
66
|
+
- versioned with semver and git SHA
|
|
67
|
+
- immutable once published
|
|
68
|
+
- stored with retention rules
|
|
69
|
+
- traceable from deployment record back to commit and build
|
|
70
|
+
|
|
71
|
+
## Pipeline Optimization
|
|
72
|
+
|
|
73
|
+
Optimize only after correctness:
|
|
74
|
+
- cache dependencies and build layers
|
|
75
|
+
- parallelize independent jobs
|
|
76
|
+
- use matrix builds only when they produce clear value
|
|
77
|
+
- skip unaffected jobs based on changed files when dependency rules are reliable
|
|
78
|
+
|
|
79
|
+
## Common Mistakes
|
|
80
|
+
|
|
81
|
+
| Mistake | Why it fails |
|
|
82
|
+
|---------|--------------|
|
|
83
|
+
| Secrets committed in YAML | Rotation and audit control are lost immediately |
|
|
84
|
+
| No staging environment | Production becomes the first integration test |
|
|
85
|
+
| Manual deploy steps outside pipeline | Audit trail and repeatability disappear |
|
|
86
|
+
| Skip security scanning to save time | Vulnerabilities ship by default |
|
|
87
|
+
| No build cache or job parallelism | Pipeline slows down until teams bypass it |
|
|
88
|
+
|
|
@@ -0,0 +1,113 @@
|
|
|
1
|
+
<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
|
|
2
|
+
|
|
3
|
+
# Pipeline Examples
|
|
4
|
+
|
|
5
|
+
## GitHub Actions
|
|
6
|
+
|
|
7
|
+
```yaml
|
|
8
|
+
name: Production Deployment
|
|
9
|
+
|
|
10
|
+
on:
|
|
11
|
+
push:
|
|
12
|
+
branches: [main]
|
|
13
|
+
|
|
14
|
+
jobs:
|
|
15
|
+
security-scan:
|
|
16
|
+
runs-on: ubuntu-latest
|
|
17
|
+
steps:
|
|
18
|
+
- uses: actions/checkout@v4
|
|
19
|
+
- name: Run dependency and image scan
|
|
20
|
+
run: ./scripts/security-scan.sh
|
|
21
|
+
|
|
22
|
+
test:
|
|
23
|
+
runs-on: ubuntu-latest
|
|
24
|
+
steps:
|
|
25
|
+
- uses: actions/checkout@v4
|
|
26
|
+
- name: Install dependencies
|
|
27
|
+
run: ./scripts/install.sh
|
|
28
|
+
- name: Run tests
|
|
29
|
+
run: ./scripts/test.sh
|
|
30
|
+
|
|
31
|
+
build:
|
|
32
|
+
runs-on: ubuntu-latest
|
|
33
|
+
needs: [security-scan, test]
|
|
34
|
+
steps:
|
|
35
|
+
- uses: actions/checkout@v4
|
|
36
|
+
- name: Build artifact
|
|
37
|
+
run: ./scripts/build.sh
|
|
38
|
+
- name: Publish image
|
|
39
|
+
run: ./scripts/publish-image.sh
|
|
40
|
+
|
|
41
|
+
deploy:
|
|
42
|
+
runs-on: ubuntu-latest
|
|
43
|
+
needs: [build]
|
|
44
|
+
environment: production
|
|
45
|
+
permissions:
|
|
46
|
+
id-token: write
|
|
47
|
+
contents: read
|
|
48
|
+
steps:
|
|
49
|
+
- uses: actions/checkout@v4
|
|
50
|
+
- name: Blue-Green deploy
|
|
51
|
+
run: |
|
|
52
|
+
kubectl set image deployment/app app=registry.example.com/app:${{ github.sha }}
|
|
53
|
+
kubectl rollout status deployment/app
|
|
54
|
+
kubectl patch service app-service -p '{"spec":{"selector":{"version":"green"}}}'
|
|
55
|
+
- name: Health check
|
|
56
|
+
run: curl -fsS https://app.example.com/health
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
## GitLab CI
|
|
60
|
+
|
|
61
|
+
```yaml
|
|
62
|
+
stages:
|
|
63
|
+
- build
|
|
64
|
+
- test
|
|
65
|
+
- security
|
|
66
|
+
- deploy_staging
|
|
67
|
+
- smoke
|
|
68
|
+
- deploy_prod
|
|
69
|
+
|
|
70
|
+
variables:
|
|
71
|
+
IMAGE_TAG: "$CI_COMMIT_SHORT_SHA"
|
|
72
|
+
|
|
73
|
+
build:
|
|
74
|
+
stage: build
|
|
75
|
+
script:
|
|
76
|
+
- ./scripts/build.sh
|
|
77
|
+
- ./scripts/publish-image.sh "$IMAGE_TAG"
|
|
78
|
+
|
|
79
|
+
test:
|
|
80
|
+
stage: test
|
|
81
|
+
script:
|
|
82
|
+
- ./scripts/test.sh
|
|
83
|
+
|
|
84
|
+
security_scan:
|
|
85
|
+
stage: security
|
|
86
|
+
script:
|
|
87
|
+
- ./scripts/security-scan.sh
|
|
88
|
+
|
|
89
|
+
deploy_staging:
|
|
90
|
+
stage: deploy_staging
|
|
91
|
+
script:
|
|
92
|
+
- ./scripts/deploy.sh staging "$IMAGE_TAG"
|
|
93
|
+
|
|
94
|
+
smoke_test:
|
|
95
|
+
stage: smoke
|
|
96
|
+
script:
|
|
97
|
+
- ./scripts/smoke-test.sh staging
|
|
98
|
+
|
|
99
|
+
deploy_prod:
|
|
100
|
+
stage: deploy_prod
|
|
101
|
+
when: manual
|
|
102
|
+
script:
|
|
103
|
+
- ./scripts/deploy.sh production "$IMAGE_TAG"
|
|
104
|
+
- ./scripts/health-check.sh production
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
## Pattern Notes
|
|
108
|
+
|
|
109
|
+
- keep security scanning before production deploy
|
|
110
|
+
- promote the same artifact between environments
|
|
111
|
+
- prefer platform identity federation over static cloud keys
|
|
112
|
+
- keep deployment commands explicit enough to review and replay
|
|
113
|
+
|
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ops-compliance
|
|
3
|
+
description: Compliance and audit readiness. Use when setting up compliance scanning, audit logging, or preparing evidence for regulatory audits -- covers policy-as-code patterns and evidence collection.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Ops Compliance
|
|
7
|
+
|
|
8
|
+
## Overview
|
|
9
|
+
|
|
10
|
+
Compliance automation should make controls testable and evidence easy to retrieve. SDTK-OPS covers the tooling and operational patterns that support compliance work. It does not replace legal interpretation, certification strategy, or third-party auditor management.
|
|
11
|
+
|
|
12
|
+
## The Iron Law
|
|
13
|
+
|
|
14
|
+
```
|
|
15
|
+
NO AUDIT READINESS WITHOUT IMMUTABLE LOGS AND AUTOMATED EVIDENCE COLLECTION
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
## When to Use
|
|
19
|
+
|
|
20
|
+
Use for:
|
|
21
|
+
- audit logging design
|
|
22
|
+
- evidence collection workflows
|
|
23
|
+
- policy-as-code adoption
|
|
24
|
+
- compliance scanning in CI/CD
|
|
25
|
+
- audit readiness reviews before external assessment
|
|
26
|
+
|
|
27
|
+
## Regulatory Framework Awareness
|
|
28
|
+
|
|
29
|
+
| Framework | Focus Area | Key Operational Requirements |
|
|
30
|
+
|-----------|------------|------------------------------|
|
|
31
|
+
| SOC 2 | trust services criteria | access control, change management, logging, availability evidence |
|
|
32
|
+
| GDPR | personal data handling | data access controls, deletion workflows, audit trail, breach evidence |
|
|
33
|
+
| HIPAA | protected health information | access auditing, retention, encryption, incident evidence |
|
|
34
|
+
| PCI-DSS | payment card data security | segmentation, access control, logging, vulnerability management |
|
|
35
|
+
|
|
36
|
+
These are awareness anchors only. They are not legal or certification guidance.
|
|
37
|
+
|
|
38
|
+
## Audit Logging Requirements
|
|
39
|
+
|
|
40
|
+
Every audit trail should answer:
|
|
41
|
+
- **WHO**
|
|
42
|
+
- which identity performed the action
|
|
43
|
+
- **WHAT**
|
|
44
|
+
- what action or change happened
|
|
45
|
+
- **WHEN**
|
|
46
|
+
- timestamp in a consistent standard
|
|
47
|
+
- **WHERE**
|
|
48
|
+
- account, region, environment, or system boundary
|
|
49
|
+
- **WHETHER**
|
|
50
|
+
- outcome, such as allowed, denied, success, or failure
|
|
51
|
+
|
|
52
|
+
## Implementation Patterns
|
|
53
|
+
|
|
54
|
+
Use a layered pattern:
|
|
55
|
+
- cloud-provider audit logs for infrastructure and identity changes
|
|
56
|
+
- application audit logs in structured JSON for user and system events
|
|
57
|
+
- immutable or write-once log storage for retention-sensitive records
|
|
58
|
+
- restricted access to log management and export paths
|
|
59
|
+
|
|
60
|
+
## <HARD-GATE>
|
|
61
|
+
|
|
62
|
+
Do not claim audit readiness until all are true:
|
|
63
|
+
- audit logs are immutable or strongly tamper-resistant
|
|
64
|
+
- retention is at least 1 year or stricter if the policy requires it
|
|
65
|
+
- tamper monitoring exists for critical logs
|
|
66
|
+
- access to log management is restricted and reviewable
|
|
67
|
+
|
|
68
|
+
## Policy As Code
|
|
69
|
+
|
|
70
|
+
Policy should be executable:
|
|
71
|
+
- define guardrails as code
|
|
72
|
+
- run them in CI/CD
|
|
73
|
+
- fail deployment when policies are violated
|
|
74
|
+
|
|
75
|
+
Awareness tools include:
|
|
76
|
+
- OPA and Rego
|
|
77
|
+
- cloud-native policy engines
|
|
78
|
+
- IaC scanners such as Checkov or tfsec
|
|
79
|
+
|
|
80
|
+
## Evidence Collection Automation
|
|
81
|
+
|
|
82
|
+
| Control Area | Evidence Type | Collection Method | Frequency |
|
|
83
|
+
|--------------|---------------|-------------------|-----------|
|
|
84
|
+
| Access review | privileged access report | scheduled export from IAM source | monthly |
|
|
85
|
+
| Change management | deployment and approval log | CI/CD pipeline export | per release |
|
|
86
|
+
| Backup control | restore drill report | runbook output and stored artifact | quarterly |
|
|
87
|
+
| Logging control | retention and tamper status | scripted control check | monthly |
|
|
88
|
+
| Vulnerability management | scan summary and remediation status | scanner export | weekly |
|
|
89
|
+
|
|
90
|
+
## Common Mistakes
|
|
91
|
+
|
|
92
|
+
| Mistake | Why it fails |
|
|
93
|
+
|---------|--------------|
|
|
94
|
+
| Manual evidence collection | Audit prep becomes a scramble and gaps are hidden |
|
|
95
|
+
| Treat compliance as a yearly project | Controls drift the rest of the year |
|
|
96
|
+
| Keep policy only in documents | Violations continue because nothing enforces them |
|
|
97
|
+
| Store audit logs with normal mutable application data | Tampering and accidental deletion become easier |
|
|
98
|
+
|
|
99
|
+
## Execution Handoff
|
|
100
|
+
|
|
101
|
+
After compliance controls are defined:
|
|
102
|
+
- implement logging and retention changes through `ops-infra-plan`
|
|
103
|
+
- route CI/CD policy enforcement through `ops-ci-cd`
|
|
104
|
+
- validate collected evidence with `ops-verify`
|
|
105
|
+
|
|
@@ -0,0 +1,95 @@
|
|
|
1
|
+
<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
|
|
2
|
+
---
|
|
3
|
+
name: ops-container
|
|
4
|
+
description: Container operations. Use when building Docker images, writing Dockerfiles, creating Kubernetes manifests, or managing container lifecycle -- covers best practices, security scanning, and orchestration patterns.
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Ops Container
|
|
8
|
+
|
|
9
|
+
## Overview
|
|
10
|
+
|
|
11
|
+
Container workflows fail when image build, runtime security, and orchestration design are treated as separate concerns. Build lean images, run them with explicit security posture, and define manifests that expose health and resource boundaries.
|
|
12
|
+
|
|
13
|
+
## When to Use
|
|
14
|
+
|
|
15
|
+
Use for:
|
|
16
|
+
- Dockerfile authoring or review
|
|
17
|
+
- image build and tagging strategy
|
|
18
|
+
- Kubernetes deployment manifest design
|
|
19
|
+
- local multi-service orchestration
|
|
20
|
+
- container registry policy updates
|
|
21
|
+
|
|
22
|
+
## Dockerfile Best Practices
|
|
23
|
+
|
|
24
|
+
Use these defaults:
|
|
25
|
+
- multi-stage builds to separate build tools from runtime image
|
|
26
|
+
- specific base image tags, never floating `latest`
|
|
27
|
+
- minimal runtime packages
|
|
28
|
+
- `.dockerignore` to exclude git metadata, secrets, and local caches
|
|
29
|
+
- deterministic dependency install steps for repeatable builds
|
|
30
|
+
|
|
31
|
+
## Security Defaults
|
|
32
|
+
|
|
33
|
+
Container security starts in the image:
|
|
34
|
+
- run as non-root user
|
|
35
|
+
- pin base images to explicit versions
|
|
36
|
+
- scan images before push
|
|
37
|
+
- keep secrets out of the image and build args
|
|
38
|
+
- remove unused packages and shells where possible
|
|
39
|
+
|
|
40
|
+
## <HARD-GATE>
|
|
41
|
+
|
|
42
|
+
Do not publish or deploy the image unless all are true:
|
|
43
|
+
- multi-stage build is used where build tooling is separate from runtime
|
|
44
|
+
- container runs as non-root unless a reviewed exception exists
|
|
45
|
+
- vulnerability scan passes with no CRITICAL or HIGH findings
|
|
46
|
+
- base image tag is specific and reviewable
|
|
47
|
+
|
|
48
|
+
## Kubernetes Patterns
|
|
49
|
+
|
|
50
|
+
For production-oriented workloads, define:
|
|
51
|
+
- Deployment with explicit resource requests and limits
|
|
52
|
+
- readiness, liveness, and startup probes
|
|
53
|
+
- rolling update settings that limit simultaneous disruption
|
|
54
|
+
- PodDisruptionBudget for critical workloads
|
|
55
|
+
- Service type matched to actual exposure need
|
|
56
|
+
- ConfigMap and Secret references instead of baked config
|
|
57
|
+
|
|
58
|
+
Use `./references/k8s-manifest-patterns.md` for concrete manifest examples.
|
|
59
|
+
|
|
60
|
+
## Health Probes
|
|
61
|
+
|
|
62
|
+
| Probe | Purpose | Failure Action |
|
|
63
|
+
|-------|---------|----------------|
|
|
64
|
+
| Liveness | Detect deadlocked or unrecoverable process | restart container |
|
|
65
|
+
| Readiness | Control whether traffic reaches the pod | remove pod from service endpoints |
|
|
66
|
+
| Startup | Protect slow boot from early liveness failure | delay other probe enforcement until startup completes |
|
|
67
|
+
|
|
68
|
+
## Docker Compose Pattern
|
|
69
|
+
|
|
70
|
+
For local or small deployments:
|
|
71
|
+
- define each service explicitly
|
|
72
|
+
- isolate shared environment variables in files or secret stores
|
|
73
|
+
- declare volumes and networks on purpose
|
|
74
|
+
- use health checks so service startup order is evidence-based, not timing-based
|
|
75
|
+
|
|
76
|
+
Do not treat Docker Compose defaults as production architecture.
|
|
77
|
+
|
|
78
|
+
## Container Registry Discipline
|
|
79
|
+
|
|
80
|
+
Use a tagging and retention policy:
|
|
81
|
+
- tag images with semver and git SHA
|
|
82
|
+
- enable scan-on-push where the registry supports it
|
|
83
|
+
- expire unneeded intermediate tags
|
|
84
|
+
- keep a clear mapping from deployed revision to immutable image digest
|
|
85
|
+
|
|
86
|
+
## Common Mistakes
|
|
87
|
+
|
|
88
|
+
| Mistake | Why it fails |
|
|
89
|
+
|---------|--------------|
|
|
90
|
+
| Use `latest` everywhere | Rollback and provenance become guesswork |
|
|
91
|
+
| Run as root by default | One container escape becomes much worse |
|
|
92
|
+
| Bake secrets into Dockerfile or image | Rotation and access control break immediately |
|
|
93
|
+
| Skip resource limits | Noisy neighbors and eviction behavior become unpredictable |
|
|
94
|
+
| Omit probes | Broken containers look healthy until users discover them |
|
|
95
|
+
|
|
@@ -0,0 +1,116 @@
|
|
|
1
|
+
<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
|
|
2
|
+
|
|
3
|
+
# Kubernetes Manifest Patterns
|
|
4
|
+
|
|
5
|
+
## Deployment
|
|
6
|
+
|
|
7
|
+
```yaml
|
|
8
|
+
apiVersion: apps/v1
|
|
9
|
+
kind: Deployment
|
|
10
|
+
metadata:
|
|
11
|
+
name: app-api
|
|
12
|
+
spec:
|
|
13
|
+
replicas: 3
|
|
14
|
+
strategy:
|
|
15
|
+
type: RollingUpdate
|
|
16
|
+
rollingUpdate:
|
|
17
|
+
maxUnavailable: 1
|
|
18
|
+
maxSurge: 1
|
|
19
|
+
selector:
|
|
20
|
+
matchLabels:
|
|
21
|
+
app: app-api
|
|
22
|
+
template:
|
|
23
|
+
metadata:
|
|
24
|
+
labels:
|
|
25
|
+
app: app-api
|
|
26
|
+
spec:
|
|
27
|
+
containers:
|
|
28
|
+
- name: api
|
|
29
|
+
image: registry.example.com/app-api:1.2.3-abcd123
|
|
30
|
+
ports:
|
|
31
|
+
- containerPort: 8080
|
|
32
|
+
resources:
|
|
33
|
+
requests:
|
|
34
|
+
cpu: "250m"
|
|
35
|
+
memory: "256Mi"
|
|
36
|
+
limits:
|
|
37
|
+
cpu: "500m"
|
|
38
|
+
memory: "512Mi"
|
|
39
|
+
readinessProbe:
|
|
40
|
+
httpGet:
|
|
41
|
+
path: /health/ready
|
|
42
|
+
port: 8080
|
|
43
|
+
initialDelaySeconds: 5
|
|
44
|
+
periodSeconds: 10
|
|
45
|
+
livenessProbe:
|
|
46
|
+
httpGet:
|
|
47
|
+
path: /health/live
|
|
48
|
+
port: 8080
|
|
49
|
+
initialDelaySeconds: 15
|
|
50
|
+
periodSeconds: 20
|
|
51
|
+
startupProbe:
|
|
52
|
+
httpGet:
|
|
53
|
+
path: /health/startup
|
|
54
|
+
port: 8080
|
|
55
|
+
failureThreshold: 30
|
|
56
|
+
periodSeconds: 5
|
|
57
|
+
envFrom:
|
|
58
|
+
- configMapRef:
|
|
59
|
+
name: app-api-config
|
|
60
|
+
- secretRef:
|
|
61
|
+
name: app-api-secrets
|
|
62
|
+
securityContext:
|
|
63
|
+
runAsNonRoot: true
|
|
64
|
+
allowPrivilegeEscalation: false
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
## Service
|
|
68
|
+
|
|
69
|
+
```yaml
|
|
70
|
+
apiVersion: v1
|
|
71
|
+
kind: Service
|
|
72
|
+
metadata:
|
|
73
|
+
name: app-api
|
|
74
|
+
spec:
|
|
75
|
+
type: ClusterIP
|
|
76
|
+
selector:
|
|
77
|
+
app: app-api
|
|
78
|
+
ports:
|
|
79
|
+
- name: http
|
|
80
|
+
port: 80
|
|
81
|
+
targetPort: 8080
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
## ConfigMap
|
|
85
|
+
|
|
86
|
+
```yaml
|
|
87
|
+
apiVersion: v1
|
|
88
|
+
kind: ConfigMap
|
|
89
|
+
metadata:
|
|
90
|
+
name: app-api-config
|
|
91
|
+
data:
|
|
92
|
+
APP_ENV: production
|
|
93
|
+
LOG_LEVEL: info
|
|
94
|
+
FEATURE_FLAG_X: "false"
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
## PodDisruptionBudget
|
|
98
|
+
|
|
99
|
+
```yaml
|
|
100
|
+
apiVersion: policy/v1
|
|
101
|
+
kind: PodDisruptionBudget
|
|
102
|
+
metadata:
|
|
103
|
+
name: app-api
|
|
104
|
+
spec:
|
|
105
|
+
minAvailable: 2
|
|
106
|
+
selector:
|
|
107
|
+
matchLabels:
|
|
108
|
+
app: app-api
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
## Pattern Notes
|
|
112
|
+
|
|
113
|
+
- keep resource limits and probes in the first deployable version
|
|
114
|
+
- route config and secrets through references, not inline literals
|
|
115
|
+
- match `maxUnavailable` and `minAvailable` so rollout and maintenance rules do not fight each other
|
|
116
|
+
|
|
@@ -0,0 +1,88 @@
|
|
|
1
|
+
<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
|
|
2
|
+
---
|
|
3
|
+
name: ops-cost
|
|
4
|
+
description: Cloud cost analysis and optimization. Use when reviewing infrastructure costs, right-sizing resources, or implementing cost reduction strategies -- covers cost allocation, tagging, reserved capacity, and budget alerts.
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Ops Cost
|
|
8
|
+
|
|
9
|
+
## Overview
|
|
10
|
+
|
|
11
|
+
Cost optimization is not a one-time cleanup. It is an operating loop: make spend visible, find waste, right-size safely, buy commitments with evidence, and review the result every month.
|
|
12
|
+
|
|
13
|
+
## Five-Step Framework
|
|
14
|
+
|
|
15
|
+
1. **Visibility**
|
|
16
|
+
- tag all resources by owner, environment, service, and cost center
|
|
17
|
+
2. **Identify Waste**
|
|
18
|
+
- find unused, idle, or obviously oversized resources
|
|
19
|
+
3. **Right-Size**
|
|
20
|
+
- adjust resources based on measured utilization
|
|
21
|
+
4. **Reserved Capacity**
|
|
22
|
+
- buy longer commitments only after usage patterns are stable
|
|
23
|
+
5. **Continuous Optimization**
|
|
24
|
+
- review, clean up, and compare month over month
|
|
25
|
+
|
|
26
|
+
## Waste Types
|
|
27
|
+
|
|
28
|
+
| Waste Type | Detection Rule | Typical Response |
|
|
29
|
+
|------------|----------------|------------------|
|
|
30
|
+
| Unused resources | CPU below 5% for 14 or more days | delete or stop |
|
|
31
|
+
| Over-provisioned resources | CPU or memory below 30% average | downsize after validation |
|
|
32
|
+
| Unattached storage | no active attachment or mount | delete after review |
|
|
33
|
+
| Old snapshots | beyond retention with no restore purpose | expire automatically |
|
|
34
|
+
| Idle load balancers | no meaningful traffic or backend use | consolidate or remove |
|
|
35
|
+
|
|
36
|
+
## Right-Sizing
|
|
37
|
+
|
|
38
|
+
Use a 30-day utilization window before changing production sizing:
|
|
39
|
+
- review CPU, memory, storage, and throughput
|
|
40
|
+
- recommend smaller instance families or lower replica counts where safe
|
|
41
|
+
- validate with load testing or staged rollout before changing production
|
|
42
|
+
|
|
43
|
+
Route structural changes through `ops-infra-plan` and rollout changes through `ops-deploy`.
|
|
44
|
+
|
|
45
|
+
## Purchase Types
|
|
46
|
+
|
|
47
|
+
| Purchase Type | Savings | Commitment | Risk |
|
|
48
|
+
|---------------|---------|------------|------|
|
|
49
|
+
| On-demand | baseline | none | low |
|
|
50
|
+
| Spot or preemptible | 60% to 90% | none | high interruption risk |
|
|
51
|
+
| 1-year reserved | medium | 1 year | medium |
|
|
52
|
+
| 3-year reserved | high | 3 years | higher lock-in risk |
|
|
53
|
+
| Savings plan or equivalent | medium to high | usage commitment | medium |
|
|
54
|
+
|
|
55
|
+
## <HARD-GATE>
|
|
56
|
+
|
|
57
|
+
Do not call cost management operationally sound until all are true:
|
|
58
|
+
- all production resources are tagged
|
|
59
|
+
- monthly cost review is documented
|
|
60
|
+
- unused resource cleanup runs at least monthly
|
|
61
|
+
- budget alerts are configured per account or environment
|
|
62
|
+
|
|
63
|
+
## Cost Report Template
|
|
64
|
+
|
|
65
|
+
Use a recurring report with:
|
|
66
|
+
- spending by service
|
|
67
|
+
- month-over-month comparison
|
|
68
|
+
- optimization opportunities
|
|
69
|
+
- reserved-capacity recommendations
|
|
70
|
+
- action items with owners and due dates
|
|
71
|
+
|
|
72
|
+
## Common Mistakes
|
|
73
|
+
|
|
74
|
+
| Mistake | Why it fails |
|
|
75
|
+
|---------|--------------|
|
|
76
|
+
| No tagging strategy | Costs cannot be assigned or optimized correctly |
|
|
77
|
+
| Buy reserved capacity without usage data | Long-term commitment locks in the wrong shape |
|
|
78
|
+
| Ignore data transfer costs | Network spend grows outside compute reviews |
|
|
79
|
+
| Run one audit and stop | Waste returns as systems change |
|
|
80
|
+
| Optimize cost without performance validation | Reliability drops and rollback follows |
|
|
81
|
+
|
|
82
|
+
## Execution Handoff
|
|
83
|
+
|
|
84
|
+
After a cost review:
|
|
85
|
+
- send architecture changes to `ops-infra-plan`
|
|
86
|
+
- deploy right-sizing changes with `ops-deploy`
|
|
87
|
+
- confirm savings and service health with `ops-verify`
|
|
88
|
+
|