sdtk-ops-kit 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +146 -0
- package/assets/manifest/toolkit-bundle.manifest.json +187 -0
- package/assets/manifest/toolkit-bundle.sha256.txt +36 -0
- package/assets/toolkit/toolkit/AGENTS.md +65 -0
- package/assets/toolkit/toolkit/SDTKOPS_TOOLKIT.md +166 -0
- package/assets/toolkit/toolkit/install.ps1 +138 -0
- package/assets/toolkit/toolkit/scripts/install-claude-skills.ps1 +81 -0
- package/assets/toolkit/toolkit/scripts/install-codex-skills.ps1 +127 -0
- package/assets/toolkit/toolkit/scripts/uninstall-claude-skills.ps1 +65 -0
- package/assets/toolkit/toolkit/scripts/uninstall-codex-skills.ps1 +53 -0
- package/assets/toolkit/toolkit/sdtk-spec.config.json +6 -0
- package/assets/toolkit/toolkit/sdtk-spec.config.profiles.example.json +12 -0
- package/assets/toolkit/toolkit/skills/ops-backup/SKILL.md +93 -0
- package/assets/toolkit/toolkit/skills/ops-backup/references/backup-script-patterns.md +108 -0
- package/assets/toolkit/toolkit/skills/ops-ci-cd/SKILL.md +88 -0
- package/assets/toolkit/toolkit/skills/ops-ci-cd/references/pipeline-examples.md +113 -0
- package/assets/toolkit/toolkit/skills/ops-compliance/SKILL.md +105 -0
- package/assets/toolkit/toolkit/skills/ops-container/SKILL.md +95 -0
- package/assets/toolkit/toolkit/skills/ops-container/references/k8s-manifest-patterns.md +116 -0
- package/assets/toolkit/toolkit/skills/ops-cost/SKILL.md +88 -0
- package/assets/toolkit/toolkit/skills/ops-debug/SKILL.md +311 -0
- package/assets/toolkit/toolkit/skills/ops-debug/references/root-cause-tracing.md +138 -0
- package/assets/toolkit/toolkit/skills/ops-deploy/SKILL.md +102 -0
- package/assets/toolkit/toolkit/skills/ops-discover/SKILL.md +102 -0
- package/assets/toolkit/toolkit/skills/ops-incident/SKILL.md +113 -0
- package/assets/toolkit/toolkit/skills/ops-incident/references/communication-templates.md +34 -0
- package/assets/toolkit/toolkit/skills/ops-incident/references/postmortem-template.md +69 -0
- package/assets/toolkit/toolkit/skills/ops-incident/references/runbook-template.md +69 -0
- package/assets/toolkit/toolkit/skills/ops-infra-plan/SKILL.md +123 -0
- package/assets/toolkit/toolkit/skills/ops-infra-plan/references/iac-patterns.md +141 -0
- package/assets/toolkit/toolkit/skills/ops-monitor/SKILL.md +110 -0
- package/assets/toolkit/toolkit/skills/ops-monitor/references/alert-rules.md +80 -0
- package/assets/toolkit/toolkit/skills/ops-monitor/references/slo-templates.md +83 -0
- package/assets/toolkit/toolkit/skills/ops-parallel/SKILL.md +177 -0
- package/assets/toolkit/toolkit/skills/ops-plan/SKILL.md +169 -0
- package/assets/toolkit/toolkit/skills/ops-security-infra/SKILL.md +126 -0
- package/assets/toolkit/toolkit/skills/ops-security-infra/references/cicd-security-pipeline.md +55 -0
- package/assets/toolkit/toolkit/skills/ops-security-infra/references/security-headers.md +24 -0
- package/assets/toolkit/toolkit/skills/ops-verify/SKILL.md +180 -0
- package/bin/sdtk-ops.js +14 -0
- package/package.json +46 -0
- package/src/commands/generate.js +12 -0
- package/src/commands/help.js +53 -0
- package/src/commands/init.js +86 -0
- package/src/commands/runtime.js +201 -0
- package/src/index.js +65 -0
- package/src/lib/args.js +107 -0
- package/src/lib/errors.js +41 -0
- package/src/lib/powershell.js +65 -0
- package/src/lib/scope.js +58 -0
- package/src/lib/toolkit-payload.js +123 -0
|
@@ -0,0 +1,311 @@
|
|
|
1
|
+
<!-- Based on superpowers by Jesse Vincent (MIT License, 2025). Adapted for SDTK-OPS. -->
|
|
2
|
+
---
|
|
3
|
+
name: ops-debug
|
|
4
|
+
description: Systematic infrastructure debugging. Use when encountering deployment failure, service outage, network issue, or unexpected infrastructure behavior -- binary search to root cause, evidence-first, no guessing fixes.
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Ops Debug
|
|
8
|
+
|
|
9
|
+
## Overview
|
|
10
|
+
|
|
11
|
+
Random fixes waste time and create new bugs. Quick patches mask underlying issues.
|
|
12
|
+
|
|
13
|
+
**Core principle:** ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
|
|
14
|
+
|
|
15
|
+
**Violating the letter of this process is violating the spirit of debugging.**
|
|
16
|
+
|
|
17
|
+
## The Iron Law
|
|
18
|
+
|
|
19
|
+
```
|
|
20
|
+
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
If you have not completed Phase 1, you cannot propose fixes.
|
|
24
|
+
|
|
25
|
+
## When to Use
|
|
26
|
+
|
|
27
|
+
Use for ANY infrastructure or operations issue:
|
|
28
|
+
- Deployment failures
|
|
29
|
+
- Service outages
|
|
30
|
+
- Unexpected behavior
|
|
31
|
+
- Performance problems
|
|
32
|
+
- Build or release failures
|
|
33
|
+
- Integration issues
|
|
34
|
+
- Network or DNS problems
|
|
35
|
+
|
|
36
|
+
**Use this ESPECIALLY when:**
|
|
37
|
+
- Under time pressure (emergencies make guessing tempting)
|
|
38
|
+
- "Just one quick fix" seems obvious
|
|
39
|
+
- You have already tried multiple fixes
|
|
40
|
+
- Previous fix did not work
|
|
41
|
+
- You do not fully understand the issue
|
|
42
|
+
|
|
43
|
+
**Do not skip when:**
|
|
44
|
+
- Issue seems simple (simple bugs have root causes too)
|
|
45
|
+
- You are in a hurry (rushing guarantees rework)
|
|
46
|
+
- Someone wants it fixed NOW (systematic is faster than thrashing)
|
|
47
|
+
|
|
48
|
+
## The Four Phases
|
|
49
|
+
|
|
50
|
+
You MUST complete each phase before proceeding to the next.
|
|
51
|
+
|
|
52
|
+
### Phase 1: Root Cause Investigation
|
|
53
|
+
|
|
54
|
+
**BEFORE attempting ANY fix:**
|
|
55
|
+
|
|
56
|
+
1. **Read Error Messages Carefully**
|
|
57
|
+
- Do not skip past errors or warnings
|
|
58
|
+
- They often contain the exact clue you need
|
|
59
|
+
- Read stack traces completely
|
|
60
|
+
- Note line numbers, file paths, error codes, probe failures, and status codes
|
|
61
|
+
|
|
62
|
+
2. **Reproduce Consistently**
|
|
63
|
+
- Can you trigger it reliably?
|
|
64
|
+
- What are the exact steps?
|
|
65
|
+
- Does it happen every time?
|
|
66
|
+
- If not reproducible -> gather more data, do not guess
|
|
67
|
+
|
|
68
|
+
3. **Check Recent Changes**
|
|
69
|
+
- What changed that could cause this?
|
|
70
|
+
- Git diff, recent commits, manifest changes
|
|
71
|
+
- New dependencies, config changes, secret rotations
|
|
72
|
+
- Environmental differences between healthy and broken systems
|
|
73
|
+
|
|
74
|
+
4. **Gather Evidence in Multi-Component Systems**
|
|
75
|
+
|
|
76
|
+
**WHEN system has multiple components (load balancer -> service -> database, CI -> artifact -> deploy target):**
|
|
77
|
+
|
|
78
|
+
**BEFORE proposing fixes, add diagnostic instrumentation:**
|
|
79
|
+
```
|
|
80
|
+
For EACH component boundary:
|
|
81
|
+
- Log what data enters component
|
|
82
|
+
- Log what data exits component
|
|
83
|
+
- Verify environment and config propagation
|
|
84
|
+
- Check state at each layer
|
|
85
|
+
|
|
86
|
+
Run once to gather evidence showing WHERE it breaks
|
|
87
|
+
THEN analyze evidence to identify failing component
|
|
88
|
+
THEN investigate that specific component
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
**Infrastructure example (LB -> App -> DB -> DNS -> Network):**
|
|
92
|
+
```bash
|
|
93
|
+
# Layer 1: Load balancer or ingress
|
|
94
|
+
curl -Ik https://api.example.com/health
|
|
95
|
+
|
|
96
|
+
# Layer 2: Application workload
|
|
97
|
+
kubectl get pods -n production
|
|
98
|
+
kubectl logs deploy/api -n production --tail=100
|
|
99
|
+
|
|
100
|
+
# Layer 3: Database connectivity
|
|
101
|
+
kubectl exec deploy/api -n production -- sh -c 'nc -zv db.internal 5432'
|
|
102
|
+
|
|
103
|
+
# Layer 4: DNS
|
|
104
|
+
dig +short api.example.com
|
|
105
|
+
|
|
106
|
+
# Layer 5: Network controls
|
|
107
|
+
kubectl get networkpolicy -n production
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
**This reveals:** Whether the break happens at traffic entry, application runtime, database reachability, DNS resolution, or network controls.
|
|
111
|
+
|
|
112
|
+
5. **Trace Data Flow**
|
|
113
|
+
|
|
114
|
+
**WHEN error is deep in the stack or spread across layers:**
|
|
115
|
+
|
|
116
|
+
See `./references/root-cause-tracing.md` in this directory for the complete backward tracing technique.
|
|
117
|
+
|
|
118
|
+
**Quick version:**
|
|
119
|
+
- Where does the bad value or bad state originate?
|
|
120
|
+
- What called this with bad input?
|
|
121
|
+
- Keep tracing up until you find the source
|
|
122
|
+
- Fix at source, not at symptom
|
|
123
|
+
|
|
124
|
+
### Phase 2: Pattern Analysis
|
|
125
|
+
|
|
126
|
+
**Find the pattern before fixing:**
|
|
127
|
+
|
|
128
|
+
1. **Find Working Examples**
|
|
129
|
+
- Locate similar working infrastructure or runbooks in the same codebase
|
|
130
|
+
- What works that is similar to what is broken?
|
|
131
|
+
|
|
132
|
+
2. **Compare Against References**
|
|
133
|
+
- If implementing a pattern, read the reference implementation COMPLETELY
|
|
134
|
+
- Do not skim
|
|
135
|
+
- Understand the pattern fully before applying it
|
|
136
|
+
|
|
137
|
+
3. **Identify Differences**
|
|
138
|
+
- What is different between working and broken?
|
|
139
|
+
- List every difference, however small
|
|
140
|
+
- Do not assume "that cannot matter"
|
|
141
|
+
|
|
142
|
+
4. **Understand Dependencies**
|
|
143
|
+
- What other components does this need?
|
|
144
|
+
- What settings, config, secrets, or environment?
|
|
145
|
+
- What assumptions does it make?
|
|
146
|
+
|
|
147
|
+
### Phase 3: Hypothesis and Testing
|
|
148
|
+
|
|
149
|
+
**Scientific method:**
|
|
150
|
+
|
|
151
|
+
1. **Form Single Hypothesis**
|
|
152
|
+
- State clearly: "I think X is the root cause because Y"
|
|
153
|
+
- Write it down
|
|
154
|
+
- Be specific, not vague
|
|
155
|
+
|
|
156
|
+
2. **Test Minimally**
|
|
157
|
+
- Make the SMALLEST possible change to test the hypothesis
|
|
158
|
+
- One variable at a time
|
|
159
|
+
- Do not fix multiple things at once
|
|
160
|
+
|
|
161
|
+
3. **Verify Before Continuing**
|
|
162
|
+
- Did it work? Yes -> Phase 4
|
|
163
|
+
- Did it fail? Form a NEW hypothesis
|
|
164
|
+
- DO NOT add more fixes on top
|
|
165
|
+
|
|
166
|
+
4. **When You Do Not Know**
|
|
167
|
+
- Say "I do not understand X"
|
|
168
|
+
- Do not pretend to know
|
|
169
|
+
- Ask for help
|
|
170
|
+
- Research more
|
|
171
|
+
|
|
172
|
+
### Phase 4: Implementation
|
|
173
|
+
|
|
174
|
+
**Fix the root cause, not the symptom:**
|
|
175
|
+
|
|
176
|
+
1. **Create a Repeatable Failing Check**
|
|
177
|
+
- Simplest possible reproduction
|
|
178
|
+
- Automated test if possible
|
|
179
|
+
- Diagnostic command or small script if no framework exists
|
|
180
|
+
- MUST have before fixing
|
|
181
|
+
|
|
182
|
+
2. **Implement Single Fix**
|
|
183
|
+
- Address the root cause identified
|
|
184
|
+
- ONE change at a time
|
|
185
|
+
- No "while I'm here" improvements
|
|
186
|
+
- No bundled refactoring
|
|
187
|
+
|
|
188
|
+
3. **Verify Fix**
|
|
189
|
+
- Check passes now?
|
|
190
|
+
- No other health checks broken?
|
|
191
|
+
- Issue actually resolved?
|
|
192
|
+
|
|
193
|
+
4. **If Fix Does Not Work**
|
|
194
|
+
- STOP
|
|
195
|
+
- Count: How many fixes have you tried?
|
|
196
|
+
- If < 3: Return to Phase 1 and re-analyze with new information
|
|
197
|
+
- **If >= 3: STOP and question the architecture (step 5 below)**
|
|
198
|
+
- DO NOT attempt Fix #4 without architectural discussion
|
|
199
|
+
|
|
200
|
+
5. **If 3+ Fixes Failed: Question Architecture**
|
|
201
|
+
|
|
202
|
+
**Pattern indicating architectural problem:**
|
|
203
|
+
- Each fix reveals new shared state, coupling, or drift in a different place
|
|
204
|
+
- Fixes require broad refactoring to implement
|
|
205
|
+
- Each fix creates new symptoms elsewhere
|
|
206
|
+
|
|
207
|
+
**STOP and question fundamentals:**
|
|
208
|
+
- Is this pattern fundamentally sound?
|
|
209
|
+
- Are we "sticking with it through sheer inertia"?
|
|
210
|
+
- Should we refactor architecture vs. continue fixing symptoms?
|
|
211
|
+
|
|
212
|
+
**Discuss with the user before attempting more fixes**
|
|
213
|
+
|
|
214
|
+
This is NOT a failed hypothesis. This is a wrong architecture.
|
|
215
|
+
|
|
216
|
+
## Infrastructure Debugging Patterns
|
|
217
|
+
|
|
218
|
+
| Symptom | Investigation Path |
|
|
219
|
+
|---------|--------------------|
|
|
220
|
+
| 5xx errors spike | deployment timing, pod health, resource limits, DB connections |
|
|
221
|
+
| High latency | DNS resolution, network hops, DB query time, container CPU throttling |
|
|
222
|
+
| Connection refused | security groups, network policies, service mesh, port bindings |
|
|
223
|
+
| Intermittent failures | DNS TTL, load balancer health checks, resource saturation, race conditions |
|
|
224
|
+
| Container restart loop | OOM kills, liveness probe config, startup dependencies |
|
|
225
|
+
| Certificate errors | expiry dates, chain completeness, SANs, renewal automation |
|
|
226
|
+
|
|
227
|
+
## Red Flags - STOP and Follow Process
|
|
228
|
+
|
|
229
|
+
If you catch yourself thinking:
|
|
230
|
+
- "Quick fix for now, investigate later"
|
|
231
|
+
- "Just try changing X and see if it works"
|
|
232
|
+
- "Add multiple changes, run checks"
|
|
233
|
+
- "Skip the failing check, I will verify manually"
|
|
234
|
+
- "It is probably X, let me fix that"
|
|
235
|
+
- "I do not fully understand but this might work"
|
|
236
|
+
- "Pattern says X but I will adapt it differently"
|
|
237
|
+
- "Here are the main problems:" followed by fixes without investigation
|
|
238
|
+
- Proposing solutions before tracing data flow
|
|
239
|
+
- **"One more fix attempt" (when already tried 2+)**
|
|
240
|
+
- **Each fix reveals new problem in different place**
|
|
241
|
+
|
|
242
|
+
**ALL of these mean: STOP. Return to Phase 1.**
|
|
243
|
+
|
|
244
|
+
**If 3+ fixes failed:** Question the architecture (see Phase 4.5)
|
|
245
|
+
|
|
246
|
+
## Signals You Are Doing It Wrong
|
|
247
|
+
|
|
248
|
+
**Watch for these redirections:**
|
|
249
|
+
- "Is that not happening?" - You assumed without verifying
|
|
250
|
+
- "Will it show us...?" - You should have added evidence gathering
|
|
251
|
+
- "Stop guessing" - You are proposing fixes without understanding
|
|
252
|
+
- "Think harder about the system" - Question fundamentals, not just symptoms
|
|
253
|
+
- "We are stuck?" - Your approach is not working
|
|
254
|
+
|
|
255
|
+
**When you see these:** STOP. Return to Phase 1.
|
|
256
|
+
|
|
257
|
+
## Common Rationalizations
|
|
258
|
+
|
|
259
|
+
| Excuse | Reality |
|
|
260
|
+
|--------|---------|
|
|
261
|
+
| "Issue is simple, do not need process" | Simple issues have root causes too. Process is fast for simple bugs. |
|
|
262
|
+
| "Emergency, no time for process" | Systematic debugging is FASTER than guess-and-check thrashing. |
|
|
263
|
+
| "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
|
|
264
|
+
| "I will write the check after confirming the fix works" | Unverified fixes do not stick. Prove the failure first. |
|
|
265
|
+
| "Multiple fixes at once saves time" | You cannot isolate what worked. It creates new bugs. |
|
|
266
|
+
| "Reference is too long, I will adapt the pattern" | Partial understanding guarantees bugs. Read it completely. |
|
|
267
|
+
| "I see the problem, let me fix it" | Seeing symptoms is not the same as understanding root cause. |
|
|
268
|
+
| "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question pattern, do not fix again. |
|
|
269
|
+
|
|
270
|
+
## Common Mistakes
|
|
271
|
+
|
|
272
|
+
| Mistake | Why it fails |
|
|
273
|
+
|---------|--------------|
|
|
274
|
+
| Skip reproduction and jump to fixes | You never prove what actually changed |
|
|
275
|
+
| Bundle multiple changes into one test | You lose causality and create new uncertainty |
|
|
276
|
+
| Stop tracing at the first visible failure | The real trigger stays in place |
|
|
277
|
+
| Keep patching after repeated failed fixes | The architecture problem remains unchallenged |
|
|
278
|
+
|
|
279
|
+
## Quick Reference
|
|
280
|
+
|
|
281
|
+
| Phase | Key Activities | Success Criteria |
|
|
282
|
+
|-------|---------------|------------------|
|
|
283
|
+
| **1. Root Cause** | Read errors, reproduce, check changes, gather evidence | Understand WHAT and WHY |
|
|
284
|
+
| **2. Pattern** | Find working examples, compare | Identify differences |
|
|
285
|
+
| **3. Hypothesis** | Form theory, test minimally | Confirmed or new hypothesis |
|
|
286
|
+
| **4. Implementation** | Create failing check, fix, verify | Issue resolved, checks pass |
|
|
287
|
+
|
|
288
|
+
## When Process Reveals "No Root Cause"
|
|
289
|
+
|
|
290
|
+
If systematic investigation reveals the issue is truly environmental, timing-dependent, or external:
|
|
291
|
+
|
|
292
|
+
1. You completed the process
|
|
293
|
+
2. Document what you investigated
|
|
294
|
+
3. Implement appropriate handling (retry, timeout, circuit breaker, error message)
|
|
295
|
+
4. Add monitoring and logging for future investigation
|
|
296
|
+
|
|
297
|
+
**But:** most "no root cause" cases are incomplete investigation.
|
|
298
|
+
|
|
299
|
+
## Supporting Techniques
|
|
300
|
+
|
|
301
|
+
- **`./references/root-cause-tracing.md`** - Trace failures backward through layers to find the original trigger
|
|
302
|
+
|
|
303
|
+
**Related skills:**
|
|
304
|
+
- **`ops-verify`** - Verify the fix worked before claiming success
|
|
305
|
+
|
|
306
|
+
## Real-World Impact
|
|
307
|
+
|
|
308
|
+
From debugging sessions:
|
|
309
|
+
- Systematic investigation minimizes thrash
|
|
310
|
+
- First-time fix rate improves when evidence replaces guessing
|
|
311
|
+
- New bugs introduced by speculative patches drop sharply
|
|
@@ -0,0 +1,138 @@
|
|
|
1
|
+
<!-- Based on superpowers by Jesse Vincent (MIT License, 2025). Adapted for SDTK-OPS. -->
|
|
2
|
+
|
|
3
|
+
# Root Cause Tracing
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
Failures often appear deep in an operational stack: a load balancer returns 502, a pod restarts, or a deployment step times out. The instinct is to fix where the error appears, but that only treats the symptom.
|
|
8
|
+
|
|
9
|
+
**Core principle:** Trace backward through the call chain until you find the original trigger, then fix at the source.
|
|
10
|
+
|
|
11
|
+
## When to Use
|
|
12
|
+
|
|
13
|
+
Use this technique when:
|
|
14
|
+
- the error appears deep in execution rather than at the entry point
|
|
15
|
+
- multiple layers are involved
|
|
16
|
+
- it is unclear where invalid data or state originated
|
|
17
|
+
- you need to find which system, pipeline step, or configuration change triggered the failure
|
|
18
|
+
|
|
19
|
+
## The Tracing Process
|
|
20
|
+
|
|
21
|
+
### 1. Observe the Symptom
|
|
22
|
+
|
|
23
|
+
```
|
|
24
|
+
Error: https://api.example.com returns 502 through the load balancer
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
### 2. Find the Immediate Cause
|
|
28
|
+
|
|
29
|
+
Ask: what directly causes this symptom?
|
|
30
|
+
|
|
31
|
+
```
|
|
32
|
+
Load balancer target health checks are failing
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
### 3. Ask: What Called This?
|
|
36
|
+
|
|
37
|
+
Trace one level back at a time:
|
|
38
|
+
|
|
39
|
+
```
|
|
40
|
+
Load balancer target marked unhealthy
|
|
41
|
+
-> readiness probe failing on application pods
|
|
42
|
+
-> application cannot open a database connection
|
|
43
|
+
-> connection string secret missing in deployment environment
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
### 4. Keep Tracing Up
|
|
47
|
+
|
|
48
|
+
Ask what introduced the bad value or broken state:
|
|
49
|
+
|
|
50
|
+
- Was the secret renamed?
|
|
51
|
+
- Was the manifest updated only in one environment?
|
|
52
|
+
- Did a pipeline step skip secret injection?
|
|
53
|
+
- Did an operator apply the wrong values file?
|
|
54
|
+
|
|
55
|
+
### 5. Find the Original Trigger
|
|
56
|
+
|
|
57
|
+
Example original trigger:
|
|
58
|
+
|
|
59
|
+
```yaml
|
|
60
|
+
env:
|
|
61
|
+
- name: DATABASE_URL
|
|
62
|
+
valueFrom:
|
|
63
|
+
secretKeyRef:
|
|
64
|
+
name: api-secrets-prod
|
|
65
|
+
key: database_url
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
But the deployment applied to staging uses `api-secrets-staging`, and the values file still points to the prod secret name. The readiness failure is a symptom. The wrong values reference is the source.
|
|
69
|
+
|
|
70
|
+
## Adding Stack Traces And Instrumentation
|
|
71
|
+
|
|
72
|
+
When you cannot trace manually, add instrumentation at the boundary you suspect:
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
echo "=== ingress ==="
|
|
76
|
+
curl -Ik https://api.example.com/health
|
|
77
|
+
|
|
78
|
+
echo "=== workload ==="
|
|
79
|
+
kubectl get pods -n production
|
|
80
|
+
kubectl logs deploy/api -n production --tail=100
|
|
81
|
+
|
|
82
|
+
echo "=== runtime env ==="
|
|
83
|
+
kubectl exec deploy/api -n production -- env | grep DATABASE_URL || true
|
|
84
|
+
|
|
85
|
+
echo "=== dependency reachability ==="
|
|
86
|
+
kubectl exec deploy/api -n production -- sh -c 'nc -zv db.internal 5432'
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
Analyze the output in order:
|
|
90
|
+
- traffic entry
|
|
91
|
+
- workload health
|
|
92
|
+
- environment propagation
|
|
93
|
+
- downstream dependency reachability
|
|
94
|
+
|
|
95
|
+
## Real Example: Missing Secret Reference
|
|
96
|
+
|
|
97
|
+
**Symptom:** public endpoint returns 502
|
|
98
|
+
|
|
99
|
+
**Trace chain:**
|
|
100
|
+
1. ingress reports targets unhealthy
|
|
101
|
+
2. readiness probe on the app fails
|
|
102
|
+
3. app logs show database connection failure
|
|
103
|
+
4. `DATABASE_URL` is unset inside the running pod
|
|
104
|
+
5. Helm values file points to the wrong secret name
|
|
105
|
+
|
|
106
|
+
**Root cause:** wrong configuration at deployment source
|
|
107
|
+
|
|
108
|
+
**Fix:** correct the values file and redeploy
|
|
109
|
+
|
|
110
|
+
**Also add defense in depth:**
|
|
111
|
+
- validate required secrets before rollout
|
|
112
|
+
- fail the pipeline if mandatory env vars are absent
|
|
113
|
+
- alert on repeated readiness failures
|
|
114
|
+
- log explicit startup errors for missing config
|
|
115
|
+
|
|
116
|
+
## Key Principle
|
|
117
|
+
|
|
118
|
+
```
|
|
119
|
+
Found immediate cause
|
|
120
|
+
-> Can trace one level up?
|
|
121
|
+
-> Keep tracing
|
|
122
|
+
-> Is this the source?
|
|
123
|
+
-> Fix at source
|
|
124
|
+
-> Add validation at each layer
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
**NEVER fix only where the error appears.** Trace back to the original trigger.
|
|
128
|
+
|
|
129
|
+
## Tracing Tips
|
|
130
|
+
|
|
131
|
+
- Log before the dangerous operation, not after it fails
|
|
132
|
+
- Capture enough context to compare healthy vs broken paths
|
|
133
|
+
- Include environment, target, region, namespace, revision, or release identifiers
|
|
134
|
+
- Save the exact command output you used so `ops-verify` can confirm the fix later
|
|
135
|
+
|
|
136
|
+
## Real-World Impact
|
|
137
|
+
|
|
138
|
+
Root-cause tracing turns "the load balancer is broken" into a precise failure chain. That reduces guesswork, shortens incidents, and prevents the same class of outage from recurring.
|
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025) and gstack by Garry Tan (MIT License, 2026). Adapted for SDTK-OPS. -->
|
|
2
|
+
---
|
|
3
|
+
name: ops-deploy
|
|
4
|
+
description: Deployment execution. Use when deploying infrastructure changes or application releases to any environment -- covers deployment strategies, health check gates, traffic management, and rollback procedures.
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Ops Deploy
|
|
8
|
+
|
|
9
|
+
## Overview
|
|
10
|
+
|
|
11
|
+
Deployment is an operational change, not a hopeful push. Every rollout needs a pre-flight baseline, a traffic strategy, explicit health gates, and a rollback procedure that is ready before the first change lands.
|
|
12
|
+
|
|
13
|
+
## The Iron Law
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
NO DEPLOYMENT WITHOUT HEALTH CHECK GATES AND ROLLBACK PROCEDURE
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## When to Use
|
|
20
|
+
|
|
21
|
+
Use for:
|
|
22
|
+
- application releases
|
|
23
|
+
- infrastructure rollouts that can affect service health
|
|
24
|
+
- database-linked releases
|
|
25
|
+
- traffic routing changes
|
|
26
|
+
- staged promotions between environments
|
|
27
|
+
|
|
28
|
+
## Pre-Flight Checks
|
|
29
|
+
|
|
30
|
+
Before starting deployment, confirm:
|
|
31
|
+
- target revision or artifact is fixed and identifiable
|
|
32
|
+
- target environment is healthy before the change
|
|
33
|
+
- rollback access, previous revision, and credentials are available
|
|
34
|
+
- deployment window avoids known peak traffic or freeze periods
|
|
35
|
+
- required backup or snapshot is complete if stateful systems are involved
|
|
36
|
+
|
|
37
|
+
## Deployment Strategies
|
|
38
|
+
|
|
39
|
+
| Strategy | When To Use | Rollback Method | Time To Roll Back |
|
|
40
|
+
|----------|-------------|-----------------|-------------------|
|
|
41
|
+
| Rolling | low-risk stateless changes with healthy autoscaling | roll back deployment revision or image tag | minutes |
|
|
42
|
+
| Blue-Green | high-confidence traffic cutover with fast reversal | switch traffic back to blue | seconds to minutes |
|
|
43
|
+
| Canary | risky changes that need live traffic sampling | stop promotion and route all traffic to stable version | seconds to minutes |
|
|
44
|
+
|
|
45
|
+
## Deployment Execution Flow
|
|
46
|
+
|
|
47
|
+
Execute in this order:
|
|
48
|
+
1. pre-flight baseline
|
|
49
|
+
2. deploy to target slice or environment
|
|
50
|
+
3. health check gate
|
|
51
|
+
4. smoke test for critical user path
|
|
52
|
+
5. stability window of 15-30 minutes
|
|
53
|
+
6. declare complete or roll back
|
|
54
|
+
|
|
55
|
+
Do not compress these stages into one command sequence with no pause for evidence.
|
|
56
|
+
|
|
57
|
+
## <HARD-GATE>
|
|
58
|
+
|
|
59
|
+
Do not advance to the next stage unless all are true:
|
|
60
|
+
- health endpoint returns 200 or equivalent healthy state
|
|
61
|
+
- error rate is not materially above baseline
|
|
62
|
+
- p95 latency is not materially above baseline
|
|
63
|
+
- no crash loops, failed probes, or repeated restarts are present
|
|
64
|
+
|
|
65
|
+
If any gate fails, stop promotion and execute rollback.
|
|
66
|
+
|
|
67
|
+
## Database Migration Safety
|
|
68
|
+
|
|
69
|
+
For releases that touch data shape:
|
|
70
|
+
- ship backward-compatible schema first
|
|
71
|
+
- deploy code that can run against both old and new shape
|
|
72
|
+
- migrate data in a controlled step
|
|
73
|
+
- remove old schema only after rollback is no longer needed
|
|
74
|
+
|
|
75
|
+
Never combine irreversible schema change with first traffic cutover unless a specific rollback path exists and has been reviewed.
|
|
76
|
+
|
|
77
|
+
## Rollback Procedures
|
|
78
|
+
|
|
79
|
+
| Strategy | Rollback Trigger | Rollback Action | Verification |
|
|
80
|
+
|----------|------------------|-----------------|--------------|
|
|
81
|
+
| Rolling | error spike or unhealthy pods | roll back workload revision | old revision healthy and stable |
|
|
82
|
+
| Blue-Green | failed smoke test after cutover | repoint traffic to blue | blue health and traffic recovery confirmed |
|
|
83
|
+
| Canary | canary metrics outside threshold | stop promotion and restore stable routing | stable slice absorbs full traffic cleanly |
|
|
84
|
+
|
|
85
|
+
## Common Mistakes
|
|
86
|
+
|
|
87
|
+
| Mistake | Why it fails |
|
|
88
|
+
|---------|--------------|
|
|
89
|
+
| No health checks between stages | Bad release reaches more users before anyone notices |
|
|
90
|
+
| Skip staging because production is "simple" | Hidden config drift appears at the worst time |
|
|
91
|
+
| Deploy during peak traffic | Blast radius increases and rollback gets noisier |
|
|
92
|
+
| Combine application deploy and risky schema change | Root cause and rollback path become unclear |
|
|
93
|
+
| Declare success immediately after deploy command returns | Runtime failure often appears during warmup and live traffic |
|
|
94
|
+
|
|
95
|
+
## Completion Discipline
|
|
96
|
+
|
|
97
|
+
A deployment is complete only after:
|
|
98
|
+
- the stability window passes
|
|
99
|
+
- rollback remains possible or has been intentionally closed
|
|
100
|
+
- the exact evidence is recorded
|
|
101
|
+
- `ops-verify` confirms the post-deploy state
|
|
102
|
+
|
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ops-discover
|
|
3
|
+
description: Discover SDTK-OPS skills. Use when starting a conversation or unsure which skill to use -- lists all operations skills with trigger conditions and priority order.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Ops Discover
|
|
7
|
+
|
|
8
|
+
## Overview
|
|
9
|
+
|
|
10
|
+
Use this index skill when the request is operational but the correct SDTK-OPS entry point is unclear. Start with the highest-priority foundational skill that matches the situation, then add domain-specific skills only as needed.
|
|
11
|
+
|
|
12
|
+
## When to Use
|
|
13
|
+
|
|
14
|
+
Use for:
|
|
15
|
+
- the first turn of an operations task
|
|
16
|
+
- unclear routing between planning, debugging, deployment, monitoring, or security
|
|
17
|
+
- deciding which skill chain should run first
|
|
18
|
+
|
|
19
|
+
## Priority Order
|
|
20
|
+
|
|
21
|
+
Choose in this order:
|
|
22
|
+
1. `ops-verify`
|
|
23
|
+
2. `ops-debug`
|
|
24
|
+
3. `ops-plan` or `ops-infra-plan`
|
|
25
|
+
4. the domain-specific skill
|
|
26
|
+
5. `ops-parallel`
|
|
27
|
+
|
|
28
|
+
Use `ops-parallel` only when the work can safely split after the primary skill choice is clear.
|
|
29
|
+
|
|
30
|
+
## Core Process
|
|
31
|
+
|
|
32
|
+
| Skill | Trigger Condition |
|
|
33
|
+
|-------|-------------------|
|
|
34
|
+
| `ops-verify` | you need evidence before claiming an operational task is complete |
|
|
35
|
+
| `ops-debug` | a deployment, service, network, DNS, or runtime issue needs root-cause analysis |
|
|
36
|
+
| `ops-plan` | the team needs a step-by-step operational change plan |
|
|
37
|
+
| `ops-parallel` | two or more independent operations tasks can proceed without shared state |
|
|
38
|
+
|
|
39
|
+
## Deployment
|
|
40
|
+
|
|
41
|
+
| Skill | Trigger Condition |
|
|
42
|
+
|-------|-------------------|
|
|
43
|
+
| `ops-infra-plan` | you are designing infrastructure, networking, IAM, or resource topology before provisioning |
|
|
44
|
+
| `ops-deploy` | you are executing a rollout, promotion, cutover, or rollback |
|
|
45
|
+
| `ops-container` | you are building images, Dockerfiles, manifests, or container runtime patterns |
|
|
46
|
+
| `ops-ci-cd` | you are setting up or modifying pipelines, artifact flow, or deployment gates |
|
|
47
|
+
|
|
48
|
+
## Operations
|
|
49
|
+
|
|
50
|
+
| Skill | Trigger Condition |
|
|
51
|
+
|-------|-------------------|
|
|
52
|
+
| `ops-monitor` | you need SLOs, alerts, dashboards, logs, traces, or observability coverage |
|
|
53
|
+
| `ops-incident` | a production incident is active or incident procedures need to be established |
|
|
54
|
+
| `ops-backup` | backup, restore, disaster recovery, RTO, or RPO design is in scope |
|
|
55
|
+
| `ops-cost` | you are reviewing spend, right-sizing, or budget controls |
|
|
56
|
+
|
|
57
|
+
## Security And Compliance
|
|
58
|
+
|
|
59
|
+
| Skill | Trigger Condition |
|
|
60
|
+
|-------|-------------------|
|
|
61
|
+
| `ops-security-infra` | infrastructure hardening, secrets, network policy, or security scanning is the main concern |
|
|
62
|
+
| `ops-compliance` | audit readiness, evidence collection, retention, or policy enforcement is the main concern |
|
|
63
|
+
|
|
64
|
+
## Discovery
|
|
65
|
+
|
|
66
|
+
| Skill | Trigger Condition |
|
|
67
|
+
|-------|-------------------|
|
|
68
|
+
| `ops-discover` | you need help choosing among SDTK-OPS skills or sequencing them |
|
|
69
|
+
|
|
70
|
+
## Common Workflow Chains
|
|
71
|
+
|
|
72
|
+
| Situation | Suggested Chain |
|
|
73
|
+
|-----------|-----------------|
|
|
74
|
+
| New service rollout | `ops-infra-plan` -> `ops-container` -> `ops-ci-cd` -> `ops-deploy` -> `ops-monitor` -> `ops-verify` |
|
|
75
|
+
| Production incident | `ops-incident` -> `ops-debug` -> `ops-deploy` -> `ops-verify` |
|
|
76
|
+
| Backup hardening | `ops-backup` -> `ops-infra-plan` -> `ops-monitor` -> `ops-verify` |
|
|
77
|
+
| Cost review | `ops-cost` -> `ops-infra-plan` -> `ops-deploy` -> `ops-verify` |
|
|
78
|
+
| Security hardening | `ops-security-infra` -> `ops-infra-plan` -> `ops-ci-cd` -> `ops-verify` |
|
|
79
|
+
|
|
80
|
+
## Red Flags
|
|
81
|
+
|
|
82
|
+
| Shortcut Or Smell | Correct Skill |
|
|
83
|
+
|-------------------|---------------|
|
|
84
|
+
| "Looks healthy enough" | `ops-verify` |
|
|
85
|
+
| "Try one more quick fix" | `ops-debug` |
|
|
86
|
+
| "Provision first and document later" | `ops-infra-plan` |
|
|
87
|
+
| "Just push it to production" | `ops-deploy` |
|
|
88
|
+
| "We can add the pipeline checks later" | `ops-ci-cd` |
|
|
89
|
+
| "No time for alerts right now" | `ops-monitor` |
|
|
90
|
+
| "We will write the post-mortem only if leadership asks" | `ops-incident` |
|
|
91
|
+
| "Backups exist, so restore testing can wait" | `ops-backup` |
|
|
92
|
+
| "We do not know who owns this spend" | `ops-cost` |
|
|
93
|
+
| "Secrets in env files are good enough" | `ops-security-infra` |
|
|
94
|
+
| "We will gather audit evidence manually at the end of the year" | `ops-compliance` |
|
|
95
|
+
|
|
96
|
+
## Common Mistakes
|
|
97
|
+
|
|
98
|
+
| Mistake | Why it fails |
|
|
99
|
+
|---------|--------------|
|
|
100
|
+
| Start with a domain skill before choosing the primary control skill | Planning, debugging, or verification gaps get missed |
|
|
101
|
+
| Use `ops-parallel` before clarifying the main workstream | Parallelism amplifies confusion instead of speeding delivery |
|
|
102
|
+
| Treat `ops-discover` as the workflow itself | It is the routing layer, not the execution skill |
|