sdtk-ops-kit 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (51) hide show
  1. package/README.md +146 -0
  2. package/assets/manifest/toolkit-bundle.manifest.json +187 -0
  3. package/assets/manifest/toolkit-bundle.sha256.txt +36 -0
  4. package/assets/toolkit/toolkit/AGENTS.md +65 -0
  5. package/assets/toolkit/toolkit/SDTKOPS_TOOLKIT.md +166 -0
  6. package/assets/toolkit/toolkit/install.ps1 +138 -0
  7. package/assets/toolkit/toolkit/scripts/install-claude-skills.ps1 +81 -0
  8. package/assets/toolkit/toolkit/scripts/install-codex-skills.ps1 +127 -0
  9. package/assets/toolkit/toolkit/scripts/uninstall-claude-skills.ps1 +65 -0
  10. package/assets/toolkit/toolkit/scripts/uninstall-codex-skills.ps1 +53 -0
  11. package/assets/toolkit/toolkit/sdtk-spec.config.json +6 -0
  12. package/assets/toolkit/toolkit/sdtk-spec.config.profiles.example.json +12 -0
  13. package/assets/toolkit/toolkit/skills/ops-backup/SKILL.md +93 -0
  14. package/assets/toolkit/toolkit/skills/ops-backup/references/backup-script-patterns.md +108 -0
  15. package/assets/toolkit/toolkit/skills/ops-ci-cd/SKILL.md +88 -0
  16. package/assets/toolkit/toolkit/skills/ops-ci-cd/references/pipeline-examples.md +113 -0
  17. package/assets/toolkit/toolkit/skills/ops-compliance/SKILL.md +105 -0
  18. package/assets/toolkit/toolkit/skills/ops-container/SKILL.md +95 -0
  19. package/assets/toolkit/toolkit/skills/ops-container/references/k8s-manifest-patterns.md +116 -0
  20. package/assets/toolkit/toolkit/skills/ops-cost/SKILL.md +88 -0
  21. package/assets/toolkit/toolkit/skills/ops-debug/SKILL.md +311 -0
  22. package/assets/toolkit/toolkit/skills/ops-debug/references/root-cause-tracing.md +138 -0
  23. package/assets/toolkit/toolkit/skills/ops-deploy/SKILL.md +102 -0
  24. package/assets/toolkit/toolkit/skills/ops-discover/SKILL.md +102 -0
  25. package/assets/toolkit/toolkit/skills/ops-incident/SKILL.md +113 -0
  26. package/assets/toolkit/toolkit/skills/ops-incident/references/communication-templates.md +34 -0
  27. package/assets/toolkit/toolkit/skills/ops-incident/references/postmortem-template.md +69 -0
  28. package/assets/toolkit/toolkit/skills/ops-incident/references/runbook-template.md +69 -0
  29. package/assets/toolkit/toolkit/skills/ops-infra-plan/SKILL.md +123 -0
  30. package/assets/toolkit/toolkit/skills/ops-infra-plan/references/iac-patterns.md +141 -0
  31. package/assets/toolkit/toolkit/skills/ops-monitor/SKILL.md +110 -0
  32. package/assets/toolkit/toolkit/skills/ops-monitor/references/alert-rules.md +80 -0
  33. package/assets/toolkit/toolkit/skills/ops-monitor/references/slo-templates.md +83 -0
  34. package/assets/toolkit/toolkit/skills/ops-parallel/SKILL.md +177 -0
  35. package/assets/toolkit/toolkit/skills/ops-plan/SKILL.md +169 -0
  36. package/assets/toolkit/toolkit/skills/ops-security-infra/SKILL.md +126 -0
  37. package/assets/toolkit/toolkit/skills/ops-security-infra/references/cicd-security-pipeline.md +55 -0
  38. package/assets/toolkit/toolkit/skills/ops-security-infra/references/security-headers.md +24 -0
  39. package/assets/toolkit/toolkit/skills/ops-verify/SKILL.md +180 -0
  40. package/bin/sdtk-ops.js +14 -0
  41. package/package.json +46 -0
  42. package/src/commands/generate.js +12 -0
  43. package/src/commands/help.js +53 -0
  44. package/src/commands/init.js +86 -0
  45. package/src/commands/runtime.js +201 -0
  46. package/src/index.js +65 -0
  47. package/src/lib/args.js +107 -0
  48. package/src/lib/errors.js +41 -0
  49. package/src/lib/powershell.js +65 -0
  50. package/src/lib/scope.js +58 -0
  51. package/src/lib/toolkit-payload.js +123 -0
@@ -0,0 +1,311 @@
1
+ <!-- Based on superpowers by Jesse Vincent (MIT License, 2025). Adapted for SDTK-OPS. -->
2
+ ---
3
+ name: ops-debug
4
+ description: Systematic infrastructure debugging. Use when encountering deployment failure, service outage, network issue, or unexpected infrastructure behavior -- binary search to root cause, evidence-first, no guessing fixes.
5
+ ---
6
+
7
+ # Ops Debug
8
+
9
+ ## Overview
10
+
11
+ Random fixes waste time and create new bugs. Quick patches mask underlying issues.
12
+
13
+ **Core principle:** ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
14
+
15
+ **Violating the letter of this process is violating the spirit of debugging.**
16
+
17
+ ## The Iron Law
18
+
19
+ ```
20
+ NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
21
+ ```
22
+
23
+ If you have not completed Phase 1, you cannot propose fixes.
24
+
25
+ ## When to Use
26
+
27
+ Use for ANY infrastructure or operations issue:
28
+ - Deployment failures
29
+ - Service outages
30
+ - Unexpected behavior
31
+ - Performance problems
32
+ - Build or release failures
33
+ - Integration issues
34
+ - Network or DNS problems
35
+
36
+ **Use this ESPECIALLY when:**
37
+ - Under time pressure (emergencies make guessing tempting)
38
+ - "Just one quick fix" seems obvious
39
+ - You have already tried multiple fixes
40
+ - Previous fix did not work
41
+ - You do not fully understand the issue
42
+
43
+ **Do not skip when:**
44
+ - Issue seems simple (simple bugs have root causes too)
45
+ - You are in a hurry (rushing guarantees rework)
46
+ - Someone wants it fixed NOW (systematic is faster than thrashing)
47
+
48
+ ## The Four Phases
49
+
50
+ You MUST complete each phase before proceeding to the next.
51
+
52
+ ### Phase 1: Root Cause Investigation
53
+
54
+ **BEFORE attempting ANY fix:**
55
+
56
+ 1. **Read Error Messages Carefully**
57
+ - Do not skip past errors or warnings
58
+ - They often contain the exact clue you need
59
+ - Read stack traces completely
60
+ - Note line numbers, file paths, error codes, probe failures, and status codes
61
+
62
+ 2. **Reproduce Consistently**
63
+ - Can you trigger it reliably?
64
+ - What are the exact steps?
65
+ - Does it happen every time?
66
+ - If not reproducible -> gather more data, do not guess
67
+
68
+ 3. **Check Recent Changes**
69
+ - What changed that could cause this?
70
+ - Git diff, recent commits, manifest changes
71
+ - New dependencies, config changes, secret rotations
72
+ - Environmental differences between healthy and broken systems
73
+
74
+ 4. **Gather Evidence in Multi-Component Systems**
75
+
76
+ **WHEN system has multiple components (load balancer -> service -> database, CI -> artifact -> deploy target):**
77
+
78
+ **BEFORE proposing fixes, add diagnostic instrumentation:**
79
+ ```
80
+ For EACH component boundary:
81
+ - Log what data enters component
82
+ - Log what data exits component
83
+ - Verify environment and config propagation
84
+ - Check state at each layer
85
+
86
+ Run once to gather evidence showing WHERE it breaks
87
+ THEN analyze evidence to identify failing component
88
+ THEN investigate that specific component
89
+ ```
90
+
91
+ **Infrastructure example (LB -> App -> DB -> DNS -> Network):**
92
+ ```bash
93
+ # Layer 1: Load balancer or ingress
94
+ curl -Ik https://api.example.com/health
95
+
96
+ # Layer 2: Application workload
97
+ kubectl get pods -n production
98
+ kubectl logs deploy/api -n production --tail=100
99
+
100
+ # Layer 3: Database connectivity
101
+ kubectl exec deploy/api -n production -- sh -c 'nc -zv db.internal 5432'
102
+
103
+ # Layer 4: DNS
104
+ dig +short api.example.com
105
+
106
+ # Layer 5: Network controls
107
+ kubectl get networkpolicy -n production
108
+ ```
109
+
110
+ **This reveals:** Whether the break happens at traffic entry, application runtime, database reachability, DNS resolution, or network controls.
111
+
112
+ 5. **Trace Data Flow**
113
+
114
+ **WHEN error is deep in the stack or spread across layers:**
115
+
116
+ See `./references/root-cause-tracing.md` in this directory for the complete backward tracing technique.
117
+
118
+ **Quick version:**
119
+ - Where does the bad value or bad state originate?
120
+ - What called this with bad input?
121
+ - Keep tracing up until you find the source
122
+ - Fix at source, not at symptom
123
+
124
+ ### Phase 2: Pattern Analysis
125
+
126
+ **Find the pattern before fixing:**
127
+
128
+ 1. **Find Working Examples**
129
+ - Locate similar working infrastructure or runbooks in the same codebase
130
+ - What works that is similar to what is broken?
131
+
132
+ 2. **Compare Against References**
133
+ - If implementing a pattern, read the reference implementation COMPLETELY
134
+ - Do not skim
135
+ - Understand the pattern fully before applying it
136
+
137
+ 3. **Identify Differences**
138
+ - What is different between working and broken?
139
+ - List every difference, however small
140
+ - Do not assume "that cannot matter"
141
+
142
+ 4. **Understand Dependencies**
143
+ - What other components does this need?
144
+ - What settings, config, secrets, or environment?
145
+ - What assumptions does it make?
146
+
147
+ ### Phase 3: Hypothesis and Testing
148
+
149
+ **Scientific method:**
150
+
151
+ 1. **Form Single Hypothesis**
152
+ - State clearly: "I think X is the root cause because Y"
153
+ - Write it down
154
+ - Be specific, not vague
155
+
156
+ 2. **Test Minimally**
157
+ - Make the SMALLEST possible change to test the hypothesis
158
+ - One variable at a time
159
+ - Do not fix multiple things at once
160
+
161
+ 3. **Verify Before Continuing**
162
+ - Did it work? Yes -> Phase 4
163
+ - Did it fail? Form a NEW hypothesis
164
+ - DO NOT add more fixes on top
165
+
166
+ 4. **When You Do Not Know**
167
+ - Say "I do not understand X"
168
+ - Do not pretend to know
169
+ - Ask for help
170
+ - Research more
171
+
172
+ ### Phase 4: Implementation
173
+
174
+ **Fix the root cause, not the symptom:**
175
+
176
+ 1. **Create a Repeatable Failing Check**
177
+ - Simplest possible reproduction
178
+ - Automated test if possible
179
+ - Diagnostic command or small script if no framework exists
180
+ - MUST have before fixing
181
+
182
+ 2. **Implement Single Fix**
183
+ - Address the root cause identified
184
+ - ONE change at a time
185
+ - No "while I'm here" improvements
186
+ - No bundled refactoring
187
+
188
+ 3. **Verify Fix**
189
+ - Check passes now?
190
+ - No other health checks broken?
191
+ - Issue actually resolved?
192
+
193
+ 4. **If Fix Does Not Work**
194
+ - STOP
195
+ - Count: How many fixes have you tried?
196
+ - If < 3: Return to Phase 1 and re-analyze with new information
197
+ - **If >= 3: STOP and question the architecture (step 5 below)**
198
+ - DO NOT attempt Fix #4 without architectural discussion
199
+
200
+ 5. **If 3+ Fixes Failed: Question Architecture**
201
+
202
+ **Pattern indicating architectural problem:**
203
+ - Each fix reveals new shared state, coupling, or drift in a different place
204
+ - Fixes require broad refactoring to implement
205
+ - Each fix creates new symptoms elsewhere
206
+
207
+ **STOP and question fundamentals:**
208
+ - Is this pattern fundamentally sound?
209
+ - Are we "sticking with it through sheer inertia"?
210
+ - Should we refactor architecture vs. continue fixing symptoms?
211
+
212
+ **Discuss with the user before attempting more fixes**
213
+
214
+ This is NOT a failed hypothesis. This is a wrong architecture.
215
+
216
+ ## Infrastructure Debugging Patterns
217
+
218
+ | Symptom | Investigation Path |
219
+ |---------|--------------------|
220
+ | 5xx errors spike | deployment timing, pod health, resource limits, DB connections |
221
+ | High latency | DNS resolution, network hops, DB query time, container CPU throttling |
222
+ | Connection refused | security groups, network policies, service mesh, port bindings |
223
+ | Intermittent failures | DNS TTL, load balancer health checks, resource saturation, race conditions |
224
+ | Container restart loop | OOM kills, liveness probe config, startup dependencies |
225
+ | Certificate errors | expiry dates, chain completeness, SANs, renewal automation |
226
+
227
+ ## Red Flags - STOP and Follow Process
228
+
229
+ If you catch yourself thinking:
230
+ - "Quick fix for now, investigate later"
231
+ - "Just try changing X and see if it works"
232
+ - "Add multiple changes, run checks"
233
+ - "Skip the failing check, I will verify manually"
234
+ - "It is probably X, let me fix that"
235
+ - "I do not fully understand but this might work"
236
+ - "Pattern says X but I will adapt it differently"
237
+ - "Here are the main problems:" followed by fixes without investigation
238
+ - Proposing solutions before tracing data flow
239
+ - **"One more fix attempt" (when already tried 2+)**
240
+ - **Each fix reveals new problem in different place**
241
+
242
+ **ALL of these mean: STOP. Return to Phase 1.**
243
+
244
+ **If 3+ fixes failed:** Question the architecture (see Phase 4.5)
245
+
246
+ ## Signals You Are Doing It Wrong
247
+
248
+ **Watch for these redirections:**
249
+ - "Is that not happening?" - You assumed without verifying
250
+ - "Will it show us...?" - You should have added evidence gathering
251
+ - "Stop guessing" - You are proposing fixes without understanding
252
+ - "Think harder about the system" - Question fundamentals, not just symptoms
253
+ - "We are stuck?" - Your approach is not working
254
+
255
+ **When you see these:** STOP. Return to Phase 1.
256
+
257
+ ## Common Rationalizations
258
+
259
+ | Excuse | Reality |
260
+ |--------|---------|
261
+ | "Issue is simple, do not need process" | Simple issues have root causes too. Process is fast for simple bugs. |
262
+ | "Emergency, no time for process" | Systematic debugging is FASTER than guess-and-check thrashing. |
263
+ | "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
264
+ | "I will write the check after confirming the fix works" | Unverified fixes do not stick. Prove the failure first. |
265
+ | "Multiple fixes at once saves time" | You cannot isolate what worked. It creates new bugs. |
266
+ | "Reference is too long, I will adapt the pattern" | Partial understanding guarantees bugs. Read it completely. |
267
+ | "I see the problem, let me fix it" | Seeing symptoms is not the same as understanding root cause. |
268
+ | "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question pattern, do not fix again. |
269
+
270
+ ## Common Mistakes
271
+
272
+ | Mistake | Why it fails |
273
+ |---------|--------------|
274
+ | Skip reproduction and jump to fixes | You never prove what actually changed |
275
+ | Bundle multiple changes into one test | You lose causality and create new uncertainty |
276
+ | Stop tracing at the first visible failure | The real trigger stays in place |
277
+ | Keep patching after repeated failed fixes | The architecture problem remains unchallenged |
278
+
279
+ ## Quick Reference
280
+
281
+ | Phase | Key Activities | Success Criteria |
282
+ |-------|---------------|------------------|
283
+ | **1. Root Cause** | Read errors, reproduce, check changes, gather evidence | Understand WHAT and WHY |
284
+ | **2. Pattern** | Find working examples, compare | Identify differences |
285
+ | **3. Hypothesis** | Form theory, test minimally | Confirmed or new hypothesis |
286
+ | **4. Implementation** | Create failing check, fix, verify | Issue resolved, checks pass |
287
+
288
+ ## When Process Reveals "No Root Cause"
289
+
290
+ If systematic investigation reveals the issue is truly environmental, timing-dependent, or external:
291
+
292
+ 1. You completed the process
293
+ 2. Document what you investigated
294
+ 3. Implement appropriate handling (retry, timeout, circuit breaker, error message)
295
+ 4. Add monitoring and logging for future investigation
296
+
297
+ **But:** most "no root cause" cases are incomplete investigation.
298
+
299
+ ## Supporting Techniques
300
+
301
+ - **`./references/root-cause-tracing.md`** - Trace failures backward through layers to find the original trigger
302
+
303
+ **Related skills:**
304
+ - **`ops-verify`** - Verify the fix worked before claiming success
305
+
306
+ ## Real-World Impact
307
+
308
+ From debugging sessions:
309
+ - Systematic investigation minimizes thrash
310
+ - First-time fix rate improves when evidence replaces guessing
311
+ - New bugs introduced by speculative patches drop sharply
@@ -0,0 +1,138 @@
1
+ <!-- Based on superpowers by Jesse Vincent (MIT License, 2025). Adapted for SDTK-OPS. -->
2
+
3
+ # Root Cause Tracing
4
+
5
+ ## Overview
6
+
7
+ Failures often appear deep in an operational stack: a load balancer returns 502, a pod restarts, or a deployment step times out. The instinct is to fix where the error appears, but that only treats the symptom.
8
+
9
+ **Core principle:** Trace backward through the call chain until you find the original trigger, then fix at the source.
10
+
11
+ ## When to Use
12
+
13
+ Use this technique when:
14
+ - the error appears deep in execution rather than at the entry point
15
+ - multiple layers are involved
16
+ - it is unclear where invalid data or state originated
17
+ - you need to find which system, pipeline step, or configuration change triggered the failure
18
+
19
+ ## The Tracing Process
20
+
21
+ ### 1. Observe the Symptom
22
+
23
+ ```
24
+ Error: https://api.example.com returns 502 through the load balancer
25
+ ```
26
+
27
+ ### 2. Find the Immediate Cause
28
+
29
+ Ask: what directly causes this symptom?
30
+
31
+ ```
32
+ Load balancer target health checks are failing
33
+ ```
34
+
35
+ ### 3. Ask: What Called This?
36
+
37
+ Trace one level back at a time:
38
+
39
+ ```
40
+ Load balancer target marked unhealthy
41
+ -> readiness probe failing on application pods
42
+ -> application cannot open a database connection
43
+ -> connection string secret missing in deployment environment
44
+ ```
45
+
46
+ ### 4. Keep Tracing Up
47
+
48
+ Ask what introduced the bad value or broken state:
49
+
50
+ - Was the secret renamed?
51
+ - Was the manifest updated only in one environment?
52
+ - Did a pipeline step skip secret injection?
53
+ - Did an operator apply the wrong values file?
54
+
55
+ ### 5. Find the Original Trigger
56
+
57
+ Example original trigger:
58
+
59
+ ```yaml
60
+ env:
61
+ - name: DATABASE_URL
62
+ valueFrom:
63
+ secretKeyRef:
64
+ name: api-secrets-prod
65
+ key: database_url
66
+ ```
67
+
68
+ But the deployment applied to staging uses `api-secrets-staging`, and the values file still points to the prod secret name. The readiness failure is a symptom. The wrong values reference is the source.
69
+
70
+ ## Adding Stack Traces And Instrumentation
71
+
72
+ When you cannot trace manually, add instrumentation at the boundary you suspect:
73
+
74
+ ```bash
75
+ echo "=== ingress ==="
76
+ curl -Ik https://api.example.com/health
77
+
78
+ echo "=== workload ==="
79
+ kubectl get pods -n production
80
+ kubectl logs deploy/api -n production --tail=100
81
+
82
+ echo "=== runtime env ==="
83
+ kubectl exec deploy/api -n production -- env | grep DATABASE_URL || true
84
+
85
+ echo "=== dependency reachability ==="
86
+ kubectl exec deploy/api -n production -- sh -c 'nc -zv db.internal 5432'
87
+ ```
88
+
89
+ Analyze the output in order:
90
+ - traffic entry
91
+ - workload health
92
+ - environment propagation
93
+ - downstream dependency reachability
94
+
95
+ ## Real Example: Missing Secret Reference
96
+
97
+ **Symptom:** public endpoint returns 502
98
+
99
+ **Trace chain:**
100
+ 1. ingress reports targets unhealthy
101
+ 2. readiness probe on the app fails
102
+ 3. app logs show database connection failure
103
+ 4. `DATABASE_URL` is unset inside the running pod
104
+ 5. Helm values file points to the wrong secret name
105
+
106
+ **Root cause:** wrong configuration at deployment source
107
+
108
+ **Fix:** correct the values file and redeploy
109
+
110
+ **Also add defense in depth:**
111
+ - validate required secrets before rollout
112
+ - fail the pipeline if mandatory env vars are absent
113
+ - alert on repeated readiness failures
114
+ - log explicit startup errors for missing config
115
+
116
+ ## Key Principle
117
+
118
+ ```
119
+ Found immediate cause
120
+ -> Can trace one level up?
121
+ -> Keep tracing
122
+ -> Is this the source?
123
+ -> Fix at source
124
+ -> Add validation at each layer
125
+ ```
126
+
127
+ **NEVER fix only where the error appears.** Trace back to the original trigger.
128
+
129
+ ## Tracing Tips
130
+
131
+ - Log before the dangerous operation, not after it fails
132
+ - Capture enough context to compare healthy vs broken paths
133
+ - Include environment, target, region, namespace, revision, or release identifiers
134
+ - Save the exact command output you used so `ops-verify` can confirm the fix later
135
+
136
+ ## Real-World Impact
137
+
138
+ Root-cause tracing turns "the load balancer is broken" into a precise failure chain. That reduces guesswork, shortens incidents, and prevents the same class of outage from recurring.
@@ -0,0 +1,102 @@
1
+ <!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025) and gstack by Garry Tan (MIT License, 2026). Adapted for SDTK-OPS. -->
2
+ ---
3
+ name: ops-deploy
4
+ description: Deployment execution. Use when deploying infrastructure changes or application releases to any environment -- covers deployment strategies, health check gates, traffic management, and rollback procedures.
5
+ ---
6
+
7
+ # Ops Deploy
8
+
9
+ ## Overview
10
+
11
+ Deployment is an operational change, not a hopeful push. Every rollout needs a pre-flight baseline, a traffic strategy, explicit health gates, and a rollback procedure that is ready before the first change lands.
12
+
13
+ ## The Iron Law
14
+
15
+ ```
16
+ NO DEPLOYMENT WITHOUT HEALTH CHECK GATES AND ROLLBACK PROCEDURE
17
+ ```
18
+
19
+ ## When to Use
20
+
21
+ Use for:
22
+ - application releases
23
+ - infrastructure rollouts that can affect service health
24
+ - database-linked releases
25
+ - traffic routing changes
26
+ - staged promotions between environments
27
+
28
+ ## Pre-Flight Checks
29
+
30
+ Before starting deployment, confirm:
31
+ - target revision or artifact is fixed and identifiable
32
+ - target environment is healthy before the change
33
+ - rollback access, previous revision, and credentials are available
34
+ - deployment window avoids known peak traffic or freeze periods
35
+ - required backup or snapshot is complete if stateful systems are involved
36
+
37
+ ## Deployment Strategies
38
+
39
+ | Strategy | When To Use | Rollback Method | Time To Roll Back |
40
+ |----------|-------------|-----------------|-------------------|
41
+ | Rolling | low-risk stateless changes with healthy autoscaling | roll back deployment revision or image tag | minutes |
42
+ | Blue-Green | high-confidence traffic cutover with fast reversal | switch traffic back to blue | seconds to minutes |
43
+ | Canary | risky changes that need live traffic sampling | stop promotion and route all traffic to stable version | seconds to minutes |
44
+
45
+ ## Deployment Execution Flow
46
+
47
+ Execute in this order:
48
+ 1. pre-flight baseline
49
+ 2. deploy to target slice or environment
50
+ 3. health check gate
51
+ 4. smoke test for critical user path
52
+ 5. stability window of 15-30 minutes
53
+ 6. declare complete or roll back
54
+
55
+ Do not compress these stages into one command sequence with no pause for evidence.
56
+
57
+ ## <HARD-GATE>
58
+
59
+ Do not advance to the next stage unless all are true:
60
+ - health endpoint returns 200 or equivalent healthy state
61
+ - error rate is not materially above baseline
62
+ - p95 latency is not materially above baseline
63
+ - no crash loops, failed probes, or repeated restarts are present
64
+
65
+ If any gate fails, stop promotion and execute rollback.
66
+
67
+ ## Database Migration Safety
68
+
69
+ For releases that touch data shape:
70
+ - ship backward-compatible schema first
71
+ - deploy code that can run against both old and new shape
72
+ - migrate data in a controlled step
73
+ - remove old schema only after rollback is no longer needed
74
+
75
+ Never combine irreversible schema change with first traffic cutover unless a specific rollback path exists and has been reviewed.
76
+
77
+ ## Rollback Procedures
78
+
79
+ | Strategy | Rollback Trigger | Rollback Action | Verification |
80
+ |----------|------------------|-----------------|--------------|
81
+ | Rolling | error spike or unhealthy pods | roll back workload revision | old revision healthy and stable |
82
+ | Blue-Green | failed smoke test after cutover | repoint traffic to blue | blue health and traffic recovery confirmed |
83
+ | Canary | canary metrics outside threshold | stop promotion and restore stable routing | stable slice absorbs full traffic cleanly |
84
+
85
+ ## Common Mistakes
86
+
87
+ | Mistake | Why it fails |
88
+ |---------|--------------|
89
+ | No health checks between stages | Bad release reaches more users before anyone notices |
90
+ | Skip staging because production is "simple" | Hidden config drift appears at the worst time |
91
+ | Deploy during peak traffic | Blast radius increases and rollback gets noisier |
92
+ | Combine application deploy and risky schema change | Root cause and rollback path become unclear |
93
+ | Declare success immediately after deploy command returns | Runtime failure often appears during warmup and live traffic |
94
+
95
+ ## Completion Discipline
96
+
97
+ A deployment is complete only after:
98
+ - the stability window passes
99
+ - rollback remains possible or has been intentionally closed
100
+ - the exact evidence is recorded
101
+ - `ops-verify` confirms the post-deploy state
102
+
@@ -0,0 +1,102 @@
1
+ ---
2
+ name: ops-discover
3
+ description: Discover SDTK-OPS skills. Use when starting a conversation or unsure which skill to use -- lists all operations skills with trigger conditions and priority order.
4
+ ---
5
+
6
+ # Ops Discover
7
+
8
+ ## Overview
9
+
10
+ Use this index skill when the request is operational but the correct SDTK-OPS entry point is unclear. Start with the highest-priority foundational skill that matches the situation, then add domain-specific skills only as needed.
11
+
12
+ ## When to Use
13
+
14
+ Use for:
15
+ - the first turn of an operations task
16
+ - unclear routing between planning, debugging, deployment, monitoring, or security
17
+ - deciding which skill chain should run first
18
+
19
+ ## Priority Order
20
+
21
+ Choose in this order:
22
+ 1. `ops-verify`
23
+ 2. `ops-debug`
24
+ 3. `ops-plan` or `ops-infra-plan`
25
+ 4. the domain-specific skill
26
+ 5. `ops-parallel`
27
+
28
+ Use `ops-parallel` only when the work can safely split after the primary skill choice is clear.
29
+
30
+ ## Core Process
31
+
32
+ | Skill | Trigger Condition |
33
+ |-------|-------------------|
34
+ | `ops-verify` | you need evidence before claiming an operational task is complete |
35
+ | `ops-debug` | a deployment, service, network, DNS, or runtime issue needs root-cause analysis |
36
+ | `ops-plan` | the team needs a step-by-step operational change plan |
37
+ | `ops-parallel` | two or more independent operations tasks can proceed without shared state |
38
+
39
+ ## Deployment
40
+
41
+ | Skill | Trigger Condition |
42
+ |-------|-------------------|
43
+ | `ops-infra-plan` | you are designing infrastructure, networking, IAM, or resource topology before provisioning |
44
+ | `ops-deploy` | you are executing a rollout, promotion, cutover, or rollback |
45
+ | `ops-container` | you are building images, Dockerfiles, manifests, or container runtime patterns |
46
+ | `ops-ci-cd` | you are setting up or modifying pipelines, artifact flow, or deployment gates |
47
+
48
+ ## Operations
49
+
50
+ | Skill | Trigger Condition |
51
+ |-------|-------------------|
52
+ | `ops-monitor` | you need SLOs, alerts, dashboards, logs, traces, or observability coverage |
53
+ | `ops-incident` | a production incident is active or incident procedures need to be established |
54
+ | `ops-backup` | backup, restore, disaster recovery, RTO, or RPO design is in scope |
55
+ | `ops-cost` | you are reviewing spend, right-sizing, or budget controls |
56
+
57
+ ## Security And Compliance
58
+
59
+ | Skill | Trigger Condition |
60
+ |-------|-------------------|
61
+ | `ops-security-infra` | infrastructure hardening, secrets, network policy, or security scanning is the main concern |
62
+ | `ops-compliance` | audit readiness, evidence collection, retention, or policy enforcement is the main concern |
63
+
64
+ ## Discovery
65
+
66
+ | Skill | Trigger Condition |
67
+ |-------|-------------------|
68
+ | `ops-discover` | you need help choosing among SDTK-OPS skills or sequencing them |
69
+
70
+ ## Common Workflow Chains
71
+
72
+ | Situation | Suggested Chain |
73
+ |-----------|-----------------|
74
+ | New service rollout | `ops-infra-plan` -> `ops-container` -> `ops-ci-cd` -> `ops-deploy` -> `ops-monitor` -> `ops-verify` |
75
+ | Production incident | `ops-incident` -> `ops-debug` -> `ops-deploy` -> `ops-verify` |
76
+ | Backup hardening | `ops-backup` -> `ops-infra-plan` -> `ops-monitor` -> `ops-verify` |
77
+ | Cost review | `ops-cost` -> `ops-infra-plan` -> `ops-deploy` -> `ops-verify` |
78
+ | Security hardening | `ops-security-infra` -> `ops-infra-plan` -> `ops-ci-cd` -> `ops-verify` |
79
+
80
+ ## Red Flags
81
+
82
+ | Shortcut Or Smell | Correct Skill |
83
+ |-------------------|---------------|
84
+ | "Looks healthy enough" | `ops-verify` |
85
+ | "Try one more quick fix" | `ops-debug` |
86
+ | "Provision first and document later" | `ops-infra-plan` |
87
+ | "Just push it to production" | `ops-deploy` |
88
+ | "We can add the pipeline checks later" | `ops-ci-cd` |
89
+ | "No time for alerts right now" | `ops-monitor` |
90
+ | "We will write the post-mortem only if leadership asks" | `ops-incident` |
91
+ | "Backups exist, so restore testing can wait" | `ops-backup` |
92
+ | "We do not know who owns this spend" | `ops-cost` |
93
+ | "Secrets in env files are good enough" | `ops-security-infra` |
94
+ | "We will gather audit evidence manually at the end of the year" | `ops-compliance` |
95
+
96
+ ## Common Mistakes
97
+
98
+ | Mistake | Why it fails |
99
+ |---------|--------------|
100
+ | Start with a domain skill before choosing the primary control skill | Planning, debugging, or verification gaps get missed |
101
+ | Use `ops-parallel` before clarifying the main workstream | Parallelism amplifies confusion instead of speeding delivery |
102
+ | Treat `ops-discover` as the workflow itself | It is the routing layer, not the execution skill |