sdtk-ops-kit 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +146 -0
- package/assets/manifest/toolkit-bundle.manifest.json +187 -0
- package/assets/manifest/toolkit-bundle.sha256.txt +36 -0
- package/assets/toolkit/toolkit/AGENTS.md +65 -0
- package/assets/toolkit/toolkit/SDTKOPS_TOOLKIT.md +166 -0
- package/assets/toolkit/toolkit/install.ps1 +138 -0
- package/assets/toolkit/toolkit/scripts/install-claude-skills.ps1 +81 -0
- package/assets/toolkit/toolkit/scripts/install-codex-skills.ps1 +127 -0
- package/assets/toolkit/toolkit/scripts/uninstall-claude-skills.ps1 +65 -0
- package/assets/toolkit/toolkit/scripts/uninstall-codex-skills.ps1 +53 -0
- package/assets/toolkit/toolkit/sdtk-spec.config.json +6 -0
- package/assets/toolkit/toolkit/sdtk-spec.config.profiles.example.json +12 -0
- package/assets/toolkit/toolkit/skills/ops-backup/SKILL.md +93 -0
- package/assets/toolkit/toolkit/skills/ops-backup/references/backup-script-patterns.md +108 -0
- package/assets/toolkit/toolkit/skills/ops-ci-cd/SKILL.md +88 -0
- package/assets/toolkit/toolkit/skills/ops-ci-cd/references/pipeline-examples.md +113 -0
- package/assets/toolkit/toolkit/skills/ops-compliance/SKILL.md +105 -0
- package/assets/toolkit/toolkit/skills/ops-container/SKILL.md +95 -0
- package/assets/toolkit/toolkit/skills/ops-container/references/k8s-manifest-patterns.md +116 -0
- package/assets/toolkit/toolkit/skills/ops-cost/SKILL.md +88 -0
- package/assets/toolkit/toolkit/skills/ops-debug/SKILL.md +311 -0
- package/assets/toolkit/toolkit/skills/ops-debug/references/root-cause-tracing.md +138 -0
- package/assets/toolkit/toolkit/skills/ops-deploy/SKILL.md +102 -0
- package/assets/toolkit/toolkit/skills/ops-discover/SKILL.md +102 -0
- package/assets/toolkit/toolkit/skills/ops-incident/SKILL.md +113 -0
- package/assets/toolkit/toolkit/skills/ops-incident/references/communication-templates.md +34 -0
- package/assets/toolkit/toolkit/skills/ops-incident/references/postmortem-template.md +69 -0
- package/assets/toolkit/toolkit/skills/ops-incident/references/runbook-template.md +69 -0
- package/assets/toolkit/toolkit/skills/ops-infra-plan/SKILL.md +123 -0
- package/assets/toolkit/toolkit/skills/ops-infra-plan/references/iac-patterns.md +141 -0
- package/assets/toolkit/toolkit/skills/ops-monitor/SKILL.md +110 -0
- package/assets/toolkit/toolkit/skills/ops-monitor/references/alert-rules.md +80 -0
- package/assets/toolkit/toolkit/skills/ops-monitor/references/slo-templates.md +83 -0
- package/assets/toolkit/toolkit/skills/ops-parallel/SKILL.md +177 -0
- package/assets/toolkit/toolkit/skills/ops-plan/SKILL.md +169 -0
- package/assets/toolkit/toolkit/skills/ops-security-infra/SKILL.md +126 -0
- package/assets/toolkit/toolkit/skills/ops-security-infra/references/cicd-security-pipeline.md +55 -0
- package/assets/toolkit/toolkit/skills/ops-security-infra/references/security-headers.md +24 -0
- package/assets/toolkit/toolkit/skills/ops-verify/SKILL.md +180 -0
- package/bin/sdtk-ops.js +14 -0
- package/package.json +46 -0
- package/src/commands/generate.js +12 -0
- package/src/commands/help.js +53 -0
- package/src/commands/init.js +86 -0
- package/src/commands/runtime.js +201 -0
- package/src/index.js +65 -0
- package/src/lib/args.js +107 -0
- package/src/lib/errors.js +41 -0
- package/src/lib/powershell.js +65 -0
- package/src/lib/scope.js +58 -0
- package/src/lib/toolkit-payload.js +123 -0
|
@@ -0,0 +1,113 @@
|
|
|
1
|
+
<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
|
|
2
|
+
---
|
|
3
|
+
name: ops-incident
|
|
4
|
+
description: Incident response and management. Use when a production incident occurs or when establishing incident response procedures -- covers severity classification, response coordination, post-mortem facilitation, and on-call design.
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Ops Incident
|
|
8
|
+
|
|
9
|
+
## Overview
|
|
10
|
+
|
|
11
|
+
Incident response must turn chaos into a structured workflow: classify severity, assign roles, build a timeline, test one hypothesis path at a time, stabilize the system, and capture systemic follow-up before memory fades.
|
|
12
|
+
|
|
13
|
+
## The Iron Law
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
NO INCIDENT RESOLVED WITHOUT A TIMELINE, IMPACT ASSESSMENT, AND ACTION ITEMS WITHIN 48 HOURS
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## Severity Classification Matrix
|
|
20
|
+
|
|
21
|
+
| Level | Name | Criteria | Response Time | Update Cadence | Escalation |
|
|
22
|
+
|-------|------|----------|---------------|----------------|------------|
|
|
23
|
+
| SEV1 | Critical | full service outage, data loss risk, security breach | under 5 min | every 15 min | leadership immediately |
|
|
24
|
+
| SEV2 | Major | degraded service for more than 25% of users, key feature down | under 15 min | every 30 min | engineering manager within 15 min |
|
|
25
|
+
| SEV3 | Moderate | minor feature broken, workaround available | under 1 hour | every 2 hours | team lead next standup |
|
|
26
|
+
| SEV4 | Low | cosmetic issue, no user impact, tech debt trigger | next business day | daily | backlog triage |
|
|
27
|
+
|
|
28
|
+
Escalate severity when:
|
|
29
|
+
- impact scope doubles
|
|
30
|
+
- no root cause is identified after 30 minutes for SEV1 or 2 hours for SEV2
|
|
31
|
+
- a paying customer is blocked, minimum SEV2
|
|
32
|
+
- any data integrity concern appears, immediate SEV1
|
|
33
|
+
|
|
34
|
+
## The Four-Phase Process
|
|
35
|
+
|
|
36
|
+
### 1. Detection And Declaration
|
|
37
|
+
|
|
38
|
+
- acknowledge the page or signal
|
|
39
|
+
- classify severity using the matrix
|
|
40
|
+
- declare the incident and open a timeline immediately
|
|
41
|
+
- state current impact, suspected scope, and next update time
|
|
42
|
+
|
|
43
|
+
### 2. Structured Response
|
|
44
|
+
|
|
45
|
+
Assign roles:
|
|
46
|
+
- **Incident Commander**
|
|
47
|
+
- owns timeline, severity, decisions, and cadence
|
|
48
|
+
- **Technical Lead**
|
|
49
|
+
- drives diagnosis and remediation with `ops-debug`
|
|
50
|
+
- **Scribe**
|
|
51
|
+
- records timestamps, evidence, commands, and decisions
|
|
52
|
+
- **Communications Lead**
|
|
53
|
+
- sends stakeholder updates on schedule
|
|
54
|
+
|
|
55
|
+
Response rules:
|
|
56
|
+
- timebox each hypothesis path to 15 minutes before re-evaluating
|
|
57
|
+
- fix the bleeding first, then optimize
|
|
58
|
+
- keep one source of truth for status and impact
|
|
59
|
+
|
|
60
|
+
### 3. Resolution And Stabilization
|
|
61
|
+
|
|
62
|
+
- apply the lowest-risk effective mitigation
|
|
63
|
+
- verify recovery through metrics, not by visual guesswork
|
|
64
|
+
- monitor for 15 to 30 minutes after recovery
|
|
65
|
+
- do not close the incident until the service is stable
|
|
66
|
+
|
|
67
|
+
### 4. Post-Mortem
|
|
68
|
+
|
|
69
|
+
- schedule the review within 48 hours
|
|
70
|
+
- document impact, timeline, root cause, contributing factors, and action items
|
|
71
|
+
- convert learning into runbook, alert, test, or design changes
|
|
72
|
+
|
|
73
|
+
## <HARD-GATE>
|
|
74
|
+
|
|
75
|
+
Incident review and post-mortem must stay blameless:
|
|
76
|
+
- say "the system allowed this failure mode"
|
|
77
|
+
- do not frame the review as "which person caused it"
|
|
78
|
+
- protect psychological safety so the real causes are visible
|
|
79
|
+
|
|
80
|
+
## Operational Metrics
|
|
81
|
+
|
|
82
|
+
| Metric | Target |
|
|
83
|
+
|--------|--------|
|
|
84
|
+
| MTTD | under 5 minutes |
|
|
85
|
+
| MTTR for SEV1 | under 30 minutes |
|
|
86
|
+
| Post-mortems within 48 hours | 100% |
|
|
87
|
+
| Action item completion | 90% |
|
|
88
|
+
| Repeat incidents | 0 for the same unresolved cause |
|
|
89
|
+
|
|
90
|
+
## Reference Files
|
|
91
|
+
|
|
92
|
+
Use:
|
|
93
|
+
- `./references/runbook-template.md`
|
|
94
|
+
- `./references/postmortem-template.md`
|
|
95
|
+
- `./references/communication-templates.md`
|
|
96
|
+
|
|
97
|
+
## Common Mistakes
|
|
98
|
+
|
|
99
|
+
| Mistake | Why it fails |
|
|
100
|
+
|---------|--------------|
|
|
101
|
+
| Start fixing before declaring severity and roles | Communication and triage fragment immediately |
|
|
102
|
+
| Chase multiple hypotheses at once | The team loses evidence and ownership |
|
|
103
|
+
| Close the incident when errors drop briefly | Latent instability returns and the timeline becomes unclear |
|
|
104
|
+
| Write a blame document instead of a post-mortem | Systemic fixes are replaced by defensiveness |
|
|
105
|
+
| Skip action item ownership and due dates | The same incident class repeats |
|
|
106
|
+
|
|
107
|
+
## Execution Handoff
|
|
108
|
+
|
|
109
|
+
During incident response:
|
|
110
|
+
- use `ops-debug` for diagnosis
|
|
111
|
+
- use `ops-monitor` to confirm recovery against live signals
|
|
112
|
+
- use `ops-verify` before declaring the system stable
|
|
113
|
+
|
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
|
|
2
|
+
|
|
3
|
+
# Communication Templates
|
|
4
|
+
|
|
5
|
+
```markdown
|
|
6
|
+
# SEV1 Initial Notification
|
|
7
|
+
**Subject**: [SEV1] [Service Name] - [Brief Impact Description]
|
|
8
|
+
|
|
9
|
+
**Current Status**: We are investigating an issue affecting [service or feature].
|
|
10
|
+
**Impact**: [X]% of users are experiencing [symptom].
|
|
11
|
+
**Next Update**: In 15 minutes or sooner if the situation changes materially.
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
# SEV1 Status Update
|
|
16
|
+
**Subject**: [SEV1 UPDATE] [Service Name] - [Current State]
|
|
17
|
+
|
|
18
|
+
**Status**: [Investigating / Identified / Mitigating / Resolved]
|
|
19
|
+
**Current Understanding**: [what we know about the cause]
|
|
20
|
+
**Actions Taken**: [what has been done so far]
|
|
21
|
+
**Next Steps**: [what happens next]
|
|
22
|
+
**Next Update**: In 15 minutes.
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
# Incident Resolved
|
|
27
|
+
**Subject**: [RESOLVED] [Service Name] - [Brief Description]
|
|
28
|
+
|
|
29
|
+
**Resolution**: [what fixed the issue]
|
|
30
|
+
**Duration**: [start time] to [end time] ([total duration])
|
|
31
|
+
**Impact Summary**: [who was affected and how]
|
|
32
|
+
**Follow-up**: Post-mortem scheduled for [date]. Action items will be tracked in [link].
|
|
33
|
+
```
|
|
34
|
+
|
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
|
|
2
|
+
|
|
3
|
+
# Postmortem Template
|
|
4
|
+
|
|
5
|
+
```markdown
|
|
6
|
+
# Post-Mortem: [Incident Title]
|
|
7
|
+
|
|
8
|
+
**Date**: YYYY-MM-DD
|
|
9
|
+
**Severity**: SEV[1-4]
|
|
10
|
+
**Duration**: [start time] to [end time] ([total duration])
|
|
11
|
+
**Author**: [name]
|
|
12
|
+
**Status**: [Draft / Review / Final]
|
|
13
|
+
|
|
14
|
+
## Executive Summary
|
|
15
|
+
[2-3 sentences on what happened, who was affected, and how it was resolved]
|
|
16
|
+
|
|
17
|
+
## Impact
|
|
18
|
+
- **Users affected**: [number or percentage]
|
|
19
|
+
- **Revenue impact**: [estimated or N/A]
|
|
20
|
+
- **SLO budget consumed**: [X% of monthly error budget]
|
|
21
|
+
- **Support tickets created**: [count]
|
|
22
|
+
|
|
23
|
+
## Timeline (UTC)
|
|
24
|
+
| Time | Event |
|
|
25
|
+
|------|-------|
|
|
26
|
+
| 14:02 | Monitoring alert fires |
|
|
27
|
+
| 14:05 | On-call engineer acknowledges page |
|
|
28
|
+
| 14:08 | Incident declared and roles assigned |
|
|
29
|
+
| 14:12 | Root-cause hypothesis recorded |
|
|
30
|
+
| 14:18 | Mitigation started |
|
|
31
|
+
| 14:23 | Metrics begin returning to baseline |
|
|
32
|
+
| 14:30 | Incident resolved |
|
|
33
|
+
| 14:45 | All-clear communicated |
|
|
34
|
+
|
|
35
|
+
## Root Cause Analysis
|
|
36
|
+
### What Happened
|
|
37
|
+
[Detailed technical explanation of the failure chain]
|
|
38
|
+
|
|
39
|
+
### Contributing Factors
|
|
40
|
+
1. **Immediate cause**: [the direct trigger]
|
|
41
|
+
2. **Underlying cause**: [why the trigger was possible]
|
|
42
|
+
3. **Systemic cause**: [what process or design gap allowed it]
|
|
43
|
+
|
|
44
|
+
### 5 Whys
|
|
45
|
+
1. Why did the service fail? [answer]
|
|
46
|
+
2. Why did that happen? [answer]
|
|
47
|
+
3. Why was that possible? [answer]
|
|
48
|
+
4. Why was the guardrail missing? [answer]
|
|
49
|
+
5. Why did the system permit the failure mode? [root issue]
|
|
50
|
+
|
|
51
|
+
## What Went Well
|
|
52
|
+
- [things that helped detection or response]
|
|
53
|
+
- [tools or practices that reduced impact]
|
|
54
|
+
|
|
55
|
+
## What Went Poorly
|
|
56
|
+
- [gaps that slowed detection or resolution]
|
|
57
|
+
- [runbooks, alerts, or ownership issues]
|
|
58
|
+
|
|
59
|
+
## Action Items
|
|
60
|
+
| ID | Action | Owner | Priority | Due Date | Status |
|
|
61
|
+
|----|--------|-------|----------|----------|--------|
|
|
62
|
+
| 1 | [action] | [owner] | P1 | YYYY-MM-DD | Not Started |
|
|
63
|
+
| 2 | [action] | [owner] | P1 | YYYY-MM-DD | Not Started |
|
|
64
|
+
| 3 | [action] | [owner] | P2 | YYYY-MM-DD | Not Started |
|
|
65
|
+
|
|
66
|
+
## Lessons Learned
|
|
67
|
+
[Key takeaways that should change design, monitoring, or procedure]
|
|
68
|
+
```
|
|
69
|
+
|
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
|
|
2
|
+
|
|
3
|
+
# Incident Runbook Template
|
|
4
|
+
|
|
5
|
+
```markdown
|
|
6
|
+
# Runbook: [Service Or Failure Scenario]
|
|
7
|
+
|
|
8
|
+
## Quick Reference
|
|
9
|
+
- **Service**: [service name and repo link]
|
|
10
|
+
- **Owner Team**: [team name and contact channel]
|
|
11
|
+
- **On-Call Rotation**: [schedule link or contact path]
|
|
12
|
+
- **Dashboards**: [monitoring links]
|
|
13
|
+
- **Last Tested**: [date of drill or last validation]
|
|
14
|
+
|
|
15
|
+
## Detection
|
|
16
|
+
- **Alert**: [alert name and system]
|
|
17
|
+
- **Symptoms**: [what users and metrics look like]
|
|
18
|
+
- **False Positive Check**: [how to confirm it is real]
|
|
19
|
+
|
|
20
|
+
## Diagnosis
|
|
21
|
+
1. Check service health: `[command]`
|
|
22
|
+
2. Review error rate and latency dashboards: `[links]`
|
|
23
|
+
3. Check recent deployments or config changes: `[command or link]`
|
|
24
|
+
4. Review dependency health: `[links or commands]`
|
|
25
|
+
|
|
26
|
+
## Remediation
|
|
27
|
+
|
|
28
|
+
### Option A: Rollback
|
|
29
|
+
```bash
|
|
30
|
+
# Identify the last known good revision
|
|
31
|
+
[rollback-history-command]
|
|
32
|
+
|
|
33
|
+
# Roll back to the prior revision
|
|
34
|
+
[rollback-command]
|
|
35
|
+
|
|
36
|
+
# Verify the rollback finished
|
|
37
|
+
[status-command]
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
### Option B: Restart
|
|
41
|
+
```bash
|
|
42
|
+
# Use a rolling restart where possible
|
|
43
|
+
[restart-command]
|
|
44
|
+
|
|
45
|
+
# Monitor progress
|
|
46
|
+
[status-command]
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
### Option C: Scale Up
|
|
50
|
+
```bash
|
|
51
|
+
# Increase capacity if the issue is load-related
|
|
52
|
+
[scale-command]
|
|
53
|
+
|
|
54
|
+
# Verify capacity and error rate
|
|
55
|
+
[verify-command]
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
## Verification
|
|
59
|
+
- [ ] Error rate returned to baseline
|
|
60
|
+
- [ ] Latency is within SLO
|
|
61
|
+
- [ ] No new alerts firing for 10 minutes
|
|
62
|
+
- [ ] User-facing functionality manually verified
|
|
63
|
+
|
|
64
|
+
## Communication
|
|
65
|
+
- Internal: [incident channel or update path]
|
|
66
|
+
- External: [status page or customer update path]
|
|
67
|
+
- Follow-up: Create post-mortem within 48 hours
|
|
68
|
+
```
|
|
69
|
+
|
|
@@ -0,0 +1,123 @@
|
|
|
1
|
+
<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025) and gstack by Garry Tan (MIT License, 2026). Adapted for SDTK-OPS. -->
|
|
2
|
+
---
|
|
3
|
+
name: ops-infra-plan
|
|
4
|
+
description: Infrastructure architecture planning. Use when designing cloud resources, networking, security groups, IAM policies, or IaC modules -- produces a reviewable infrastructure plan before provisioning.
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Ops Infra Plan
|
|
8
|
+
|
|
9
|
+
## Overview
|
|
10
|
+
|
|
11
|
+
Provisioning without a reviewed design creates hidden coupling, rollback gaps, and security drift. Write the infrastructure plan first, make the dependencies explicit, then provision only what the approved plan covers.
|
|
12
|
+
|
|
13
|
+
## The Iron Law
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
NO INFRASTRUCTURE PROVISIONING WITHOUT A REVIEWED PLAN
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## When to Use
|
|
20
|
+
|
|
21
|
+
Use for:
|
|
22
|
+
- new service or environment setup
|
|
23
|
+
- network topology changes
|
|
24
|
+
- security group, firewall, or IAM changes
|
|
25
|
+
- database, cache, queue, or storage provisioning
|
|
26
|
+
- scaling architecture updates
|
|
27
|
+
- disaster recovery design
|
|
28
|
+
|
|
29
|
+
## Required Plan Sections
|
|
30
|
+
|
|
31
|
+
Every infrastructure plan must include:
|
|
32
|
+
- **Architecture Diagram (ASCII)**
|
|
33
|
+
- show traffic entry, workload tier, stateful dependencies, and operator control points
|
|
34
|
+
- **Resource Inventory**
|
|
35
|
+
- resource name, purpose, environment, owner, critical dependency
|
|
36
|
+
- **Networking Design**
|
|
37
|
+
- ingress and egress paths, DNS, load balancing, private connectivity, access boundaries
|
|
38
|
+
- **Security Design**
|
|
39
|
+
- identities, secrets flow, encryption posture, least-privilege access, audit boundaries
|
|
40
|
+
- **Capacity Estimates**
|
|
41
|
+
- CPU, memory, storage, throughput, and auto-scaling limits
|
|
42
|
+
- **Multi-Environment Strategy**
|
|
43
|
+
- dev, staging, and production separation with configuration and data isolation notes
|
|
44
|
+
- **Rollback Plan**
|
|
45
|
+
- rollback trigger, exact rollback action, verification evidence, and decision owner
|
|
46
|
+
- **Dependency Order**
|
|
47
|
+
- what must exist before provisioning the next layer
|
|
48
|
+
|
|
49
|
+
## ASCII Diagram Pattern
|
|
50
|
+
|
|
51
|
+
Use an explicit diagram even for small changes:
|
|
52
|
+
|
|
53
|
+
```text
|
|
54
|
+
Internet
|
|
55
|
+
|
|
|
56
|
+
DNS -> Load Balancer
|
|
57
|
+
|
|
|
58
|
+
App Service
|
|
59
|
+
|
|
|
60
|
+
Database / Cache / Queue
|
|
61
|
+
|
|
|
62
|
+
Backups / Logs / Metrics
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
If a system boundary or dependency matters during rollout, it belongs in the diagram.
|
|
66
|
+
|
|
67
|
+
## Resource Inventory Template
|
|
68
|
+
|
|
69
|
+
Use a table like this:
|
|
70
|
+
|
|
71
|
+
| Resource | Purpose | Environment | Owner | Depends On |
|
|
72
|
+
|----------|---------|-------------|-------|------------|
|
|
73
|
+
| app-lb | Public traffic entry | production | ops | DNS, TLS cert |
|
|
74
|
+
| app-service | Stateless workload | production | ops | image, secrets, DB |
|
|
75
|
+
| app-db | Primary relational store | production | ops | subnet, backup policy |
|
|
76
|
+
|
|
77
|
+
## IaC Patterns
|
|
78
|
+
|
|
79
|
+
Prefer reusable modules and explicit inputs over hand-built resources. Common patterns:
|
|
80
|
+
- network foundation first: VPC or equivalent, subnets, routes, security boundaries
|
|
81
|
+
- stateless compute behind a load balancer with health checks and auto-scaling
|
|
82
|
+
- stateful services with backup, encryption, and maintenance windows defined up front
|
|
83
|
+
- alarms and dashboards defined in the same plan as the resource they guard
|
|
84
|
+
- environment-specific values separated from reusable module logic
|
|
85
|
+
|
|
86
|
+
Use `./references/iac-patterns.md` for concrete Terraform-style examples adapted from the source material.
|
|
87
|
+
|
|
88
|
+
## Review Checklist
|
|
89
|
+
|
|
90
|
+
Review the plan for:
|
|
91
|
+
- naming consistency across resources, modules, and environments
|
|
92
|
+
- security groups or firewall rules scoped to least privilege
|
|
93
|
+
- encryption at rest and in transit
|
|
94
|
+
- backup schedule and restore ownership
|
|
95
|
+
- observability coverage for each critical resource
|
|
96
|
+
- cost estimate, scaling bounds, and obvious waste
|
|
97
|
+
- rollback path that works after partial apply
|
|
98
|
+
|
|
99
|
+
## <HARD-GATE>
|
|
100
|
+
|
|
101
|
+
Do not provision until all four are true:
|
|
102
|
+
- written plan exists and matches the intended scope
|
|
103
|
+
- resource inventory is complete
|
|
104
|
+
- rollback procedure exists for the first failed step and for partial apply
|
|
105
|
+
- security review covers network access, identities, secrets, and encryption
|
|
106
|
+
|
|
107
|
+
## Common Mistakes
|
|
108
|
+
|
|
109
|
+
| Mistake | Why it fails |
|
|
110
|
+
|---------|--------------|
|
|
111
|
+
| Provision first, document later | The live system becomes the only source of truth |
|
|
112
|
+
| Copy production config into dev | Cost, permissions, and blast radius drift immediately |
|
|
113
|
+
| Hardcode IPs, hostnames, or secrets | Reuse fails and rotation becomes dangerous |
|
|
114
|
+
| Plan resources without capacity notes | Scaling and cost failures surface during rollout |
|
|
115
|
+
| Skip rollback because IaC is "reversible" | Partial apply and data changes still need explicit recovery |
|
|
116
|
+
|
|
117
|
+
## Execution Handoff
|
|
118
|
+
|
|
119
|
+
After the plan is approved:
|
|
120
|
+
- provision in dependency order
|
|
121
|
+
- verify each applied layer before moving forward
|
|
122
|
+
- use `ops-verify` before marking any provisioning step complete
|
|
123
|
+
|
|
@@ -0,0 +1,141 @@
|
|
|
1
|
+
<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025) and gstack by Garry Tan (MIT License, 2026). Adapted for SDTK-OPS. -->
|
|
2
|
+
|
|
3
|
+
# IaC Patterns
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
These examples capture common infrastructure-as-code patterns for a networked service with autoscaling, load balancing, and operational safeguards. Treat provider-specific syntax as illustrative. The planning pattern is portable even when the resource names differ.
|
|
8
|
+
|
|
9
|
+
## Network And Compute Pattern
|
|
10
|
+
|
|
11
|
+
```hcl
|
|
12
|
+
module "network" {
|
|
13
|
+
source = "./modules/network"
|
|
14
|
+
|
|
15
|
+
name = "app-prod"
|
|
16
|
+
cidr_block = "10.20.0.0/16"
|
|
17
|
+
public_subnets = ["10.20.1.0/24", "10.20.2.0/24"]
|
|
18
|
+
private_subnets = ["10.20.11.0/24", "10.20.12.0/24"]
|
|
19
|
+
enable_nat_gateway = true
|
|
20
|
+
}
|
|
21
|
+
|
|
22
|
+
resource "aws_launch_template" "app" {
|
|
23
|
+
name_prefix = "app-prod-"
|
|
24
|
+
image_id = var.image_id
|
|
25
|
+
instance_type = var.instance_type
|
|
26
|
+
|
|
27
|
+
user_data = base64encode(templatefile("${path.module}/bootstrap.sh", {
|
|
28
|
+
env = "production"
|
|
29
|
+
}))
|
|
30
|
+
}
|
|
31
|
+
|
|
32
|
+
resource "aws_autoscaling_group" "app" {
|
|
33
|
+
name = "app-prod"
|
|
34
|
+
desired_capacity = 3
|
|
35
|
+
min_size = 3
|
|
36
|
+
max_size = 6
|
|
37
|
+
vpc_zone_identifier = module.network.private_subnet_ids
|
|
38
|
+
target_group_arns = [aws_lb_target_group.app.arn]
|
|
39
|
+
health_check_type = "ELB"
|
|
40
|
+
health_check_grace_period = 300
|
|
41
|
+
|
|
42
|
+
launch_template {
|
|
43
|
+
id = aws_launch_template.app.id
|
|
44
|
+
version = "$Latest"
|
|
45
|
+
}
|
|
46
|
+
}
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
Pattern notes:
|
|
50
|
+
- keep shared network concerns in a module
|
|
51
|
+
- treat image version and instance size as explicit inputs
|
|
52
|
+
- set min and max capacity in the same change that introduces autoscaling
|
|
53
|
+
- attach compute to health-checked traffic entry, not direct public access
|
|
54
|
+
|
|
55
|
+
## Load Balancer Pattern
|
|
56
|
+
|
|
57
|
+
```hcl
|
|
58
|
+
resource "aws_lb" "app" {
|
|
59
|
+
name = "app-prod"
|
|
60
|
+
internal = false
|
|
61
|
+
load_balancer_type = "application"
|
|
62
|
+
subnets = module.network.public_subnet_ids
|
|
63
|
+
security_groups = [aws_security_group.lb.id]
|
|
64
|
+
}
|
|
65
|
+
|
|
66
|
+
resource "aws_lb_target_group" "app" {
|
|
67
|
+
name = "app-prod"
|
|
68
|
+
port = 8080
|
|
69
|
+
protocol = "HTTP"
|
|
70
|
+
vpc_id = module.network.vpc_id
|
|
71
|
+
|
|
72
|
+
health_check {
|
|
73
|
+
path = "/health"
|
|
74
|
+
healthy_threshold = 2
|
|
75
|
+
unhealthy_threshold = 3
|
|
76
|
+
interval = 30
|
|
77
|
+
timeout = 5
|
|
78
|
+
}
|
|
79
|
+
}
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
Pattern notes:
|
|
83
|
+
- define health checks next to the traffic entry resource
|
|
84
|
+
- use explicit thresholds instead of provider defaults
|
|
85
|
+
- keep listener, target, and security policy changes in the same reviewed plan
|
|
86
|
+
|
|
87
|
+
## Database Pattern
|
|
88
|
+
|
|
89
|
+
```hcl
|
|
90
|
+
resource "aws_db_subnet_group" "app" {
|
|
91
|
+
name = "app-prod"
|
|
92
|
+
subnet_ids = module.network.private_subnet_ids
|
|
93
|
+
}
|
|
94
|
+
|
|
95
|
+
resource "aws_db_instance" "app" {
|
|
96
|
+
identifier = "app-prod"
|
|
97
|
+
engine = "postgres"
|
|
98
|
+
instance_class = "db.t3.medium"
|
|
99
|
+
allocated_storage = 100
|
|
100
|
+
storage_encrypted = true
|
|
101
|
+
backup_retention_period = 7
|
|
102
|
+
multi_az = true
|
|
103
|
+
db_subnet_group_name = aws_db_subnet_group.app.name
|
|
104
|
+
vpc_security_group_ids = [aws_security_group.db.id]
|
|
105
|
+
}
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
Pattern notes:
|
|
109
|
+
- keep stateful resources private by default
|
|
110
|
+
- define backup retention and encryption in the first version
|
|
111
|
+
- make high availability an explicit decision, not an accident
|
|
112
|
+
|
|
113
|
+
## Alarm Pattern
|
|
114
|
+
|
|
115
|
+
```hcl
|
|
116
|
+
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
|
|
117
|
+
alarm_name = "app-prod-cpu-high"
|
|
118
|
+
comparison_operator = "GreaterThanThreshold"
|
|
119
|
+
evaluation_periods = 2
|
|
120
|
+
metric_name = "CPUUtilization"
|
|
121
|
+
namespace = "AWS/EC2"
|
|
122
|
+
period = 300
|
|
123
|
+
statistic = "Average"
|
|
124
|
+
threshold = 75
|
|
125
|
+
alarm_description = "CPU high for two consecutive periods"
|
|
126
|
+
}
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
Pattern notes:
|
|
130
|
+
- define an alarm when you define the critical resource
|
|
131
|
+
- use thresholds the team can explain and review
|
|
132
|
+
- connect alerting design to rollback and incident response decisions
|
|
133
|
+
|
|
134
|
+
## Planning Checklist
|
|
135
|
+
|
|
136
|
+
Before applying an IaC change, confirm:
|
|
137
|
+
- all provider-specific examples have a cloud-agnostic rationale
|
|
138
|
+
- network, identity, and secret dependencies are listed
|
|
139
|
+
- rollback covers both failed create and partial update
|
|
140
|
+
- health checks, alarms, and backups are defined with the resource
|
|
141
|
+
|
|
@@ -0,0 +1,110 @@
|
|
|
1
|
+
<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
|
|
2
|
+
---
|
|
3
|
+
name: ops-monitor
|
|
4
|
+
description: Observability and monitoring setup. Use when designing monitoring, defining SLOs/SLIs, configuring alerts, or building dashboards -- covers the three pillars of observability and Golden Signals framework.
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Ops Monitor
|
|
8
|
+
|
|
9
|
+
## Overview
|
|
10
|
+
|
|
11
|
+
Monitoring exists to detect user-impacting problems early, explain what is happening, and guide a safe response. A service is not production-ready until metrics, logs, traces, alerts, and response expectations are all defined together.
|
|
12
|
+
|
|
13
|
+
## The Iron Law
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
NO SERVICE IN PRODUCTION WITHOUT MONITORING AND ALERTING
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## When to Use
|
|
20
|
+
|
|
21
|
+
Use for:
|
|
22
|
+
- new production services
|
|
23
|
+
- dashboard and alert design
|
|
24
|
+
- SLO and SLI definition
|
|
25
|
+
- noisy alert cleanup
|
|
26
|
+
- monitoring gaps discovered during incidents
|
|
27
|
+
|
|
28
|
+
## The Three Pillars
|
|
29
|
+
|
|
30
|
+
| Pillar | Purpose | Key Question |
|
|
31
|
+
|--------|---------|--------------|
|
|
32
|
+
| Metrics | trends, alerting, SLO tracking | Is the system healthy over time? |
|
|
33
|
+
| Logs | event details, audit trail, debugging context | What happened at a specific time? |
|
|
34
|
+
| Traces | request flow across services | Where is the latency or failure path? |
|
|
35
|
+
|
|
36
|
+
## Golden Signals
|
|
37
|
+
|
|
38
|
+
| Signal | What To Measure | Example Metrics |
|
|
39
|
+
|--------|-----------------|-----------------|
|
|
40
|
+
| Latency | request duration for successful and failed work | `http_request_duration_seconds_bucket`, `grpc_server_handling_seconds_bucket` |
|
|
41
|
+
| Traffic | request volume or work rate | `http_requests_total`, `rpc_requests_total`, queue throughput |
|
|
42
|
+
| Errors | failure rate by type and severity | `http_requests_total{status=~"5.."}`, timeout counters, business error counters |
|
|
43
|
+
| Saturation | how close the system is to a limit | CPU, memory, queue depth, disk, connection pool usage |
|
|
44
|
+
|
|
45
|
+
## SLO And SLI Framework
|
|
46
|
+
|
|
47
|
+
Define:
|
|
48
|
+
- **SLI**
|
|
49
|
+
- what you measure for user-visible behavior
|
|
50
|
+
- examples: availability, latency, correctness
|
|
51
|
+
- **SLO**
|
|
52
|
+
- the target and window for that SLI
|
|
53
|
+
- examples: `99.95% availability over 30d`, `99% of requests under 300ms over 30d`
|
|
54
|
+
|
|
55
|
+
Use `./references/slo-templates.md` for YAML templates that combine availability, latency, correctness, burn-rate alerts, and error-budget policy.
|
|
56
|
+
|
|
57
|
+
## Error Budget Policy
|
|
58
|
+
|
|
59
|
+
Use a simple decision policy:
|
|
60
|
+
- budget remaining above 50%: normal development
|
|
61
|
+
- budget remaining 25% to 50%: reliability review before risky changes
|
|
62
|
+
- budget remaining below 25%: prioritize reliability work
|
|
63
|
+
- budget exhausted: freeze non-critical deploys until reviewed
|
|
64
|
+
|
|
65
|
+
## Alert Design
|
|
66
|
+
|
|
67
|
+
Alerts should:
|
|
68
|
+
- page on symptoms, not on every possible internal cause
|
|
69
|
+
- link to a runbook for first response
|
|
70
|
+
- define at least two severities where appropriate
|
|
71
|
+
- be tuned so false positives stay below 15%
|
|
72
|
+
- point to the dashboard or query needed for triage
|
|
73
|
+
|
|
74
|
+
An alert that does not change operator behavior should be removed or rewritten.
|
|
75
|
+
|
|
76
|
+
## Dashboard Design
|
|
77
|
+
|
|
78
|
+
| Dashboard | Audience | Required Content |
|
|
79
|
+
|-----------|----------|------------------|
|
|
80
|
+
| Executive | incident leads, managers | SLO status, error budget, active incidents, high-level trends |
|
|
81
|
+
| Service | on-call engineers | Golden Signals, dependency health, deploy markers, recent alerts |
|
|
82
|
+
| Debug | responders and service owners | detailed metrics, logs, traces, and breakdowns by instance, region, or endpoint |
|
|
83
|
+
|
|
84
|
+
## <HARD-GATE>
|
|
85
|
+
|
|
86
|
+
Before a service is treated as production-ready, all must be true:
|
|
87
|
+
- SLIs are defined for user-visible behavior
|
|
88
|
+
- SLOs are set with explicit targets and windows
|
|
89
|
+
- burn-rate alerts are configured
|
|
90
|
+
- a Golden Signals dashboard exists
|
|
91
|
+
- the on-call team knows which alerts fire and what they mean
|
|
92
|
+
|
|
93
|
+
## Common Mistakes
|
|
94
|
+
|
|
95
|
+
| Mistake | Why it fails |
|
|
96
|
+
|---------|--------------|
|
|
97
|
+
| Alert on every low-level cause | Operators get noise instead of signal |
|
|
98
|
+
| Track uptime only | User-visible latency and correctness failures are missed |
|
|
99
|
+
| Dashboard with no SLO context | Teams see metrics but not reliability risk |
|
|
100
|
+
| No runbook links in alerts | Response time expands during real incidents |
|
|
101
|
+
| Collect logs and traces without correlation keys | Cross-system debugging stays slow and manual |
|
|
102
|
+
|
|
103
|
+
## Execution Handoff
|
|
104
|
+
|
|
105
|
+
After monitoring is defined:
|
|
106
|
+
- validate alerts with a controlled test or drill
|
|
107
|
+
- route incident handling through `ops-incident`
|
|
108
|
+
- use `ops-debug` for root-cause investigation
|
|
109
|
+
- use `ops-verify` before claiming monitoring coverage is complete
|
|
110
|
+
|