sdtk-ops-kit 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (51) hide show
  1. package/README.md +146 -0
  2. package/assets/manifest/toolkit-bundle.manifest.json +187 -0
  3. package/assets/manifest/toolkit-bundle.sha256.txt +36 -0
  4. package/assets/toolkit/toolkit/AGENTS.md +65 -0
  5. package/assets/toolkit/toolkit/SDTKOPS_TOOLKIT.md +166 -0
  6. package/assets/toolkit/toolkit/install.ps1 +138 -0
  7. package/assets/toolkit/toolkit/scripts/install-claude-skills.ps1 +81 -0
  8. package/assets/toolkit/toolkit/scripts/install-codex-skills.ps1 +127 -0
  9. package/assets/toolkit/toolkit/scripts/uninstall-claude-skills.ps1 +65 -0
  10. package/assets/toolkit/toolkit/scripts/uninstall-codex-skills.ps1 +53 -0
  11. package/assets/toolkit/toolkit/sdtk-spec.config.json +6 -0
  12. package/assets/toolkit/toolkit/sdtk-spec.config.profiles.example.json +12 -0
  13. package/assets/toolkit/toolkit/skills/ops-backup/SKILL.md +93 -0
  14. package/assets/toolkit/toolkit/skills/ops-backup/references/backup-script-patterns.md +108 -0
  15. package/assets/toolkit/toolkit/skills/ops-ci-cd/SKILL.md +88 -0
  16. package/assets/toolkit/toolkit/skills/ops-ci-cd/references/pipeline-examples.md +113 -0
  17. package/assets/toolkit/toolkit/skills/ops-compliance/SKILL.md +105 -0
  18. package/assets/toolkit/toolkit/skills/ops-container/SKILL.md +95 -0
  19. package/assets/toolkit/toolkit/skills/ops-container/references/k8s-manifest-patterns.md +116 -0
  20. package/assets/toolkit/toolkit/skills/ops-cost/SKILL.md +88 -0
  21. package/assets/toolkit/toolkit/skills/ops-debug/SKILL.md +311 -0
  22. package/assets/toolkit/toolkit/skills/ops-debug/references/root-cause-tracing.md +138 -0
  23. package/assets/toolkit/toolkit/skills/ops-deploy/SKILL.md +102 -0
  24. package/assets/toolkit/toolkit/skills/ops-discover/SKILL.md +102 -0
  25. package/assets/toolkit/toolkit/skills/ops-incident/SKILL.md +113 -0
  26. package/assets/toolkit/toolkit/skills/ops-incident/references/communication-templates.md +34 -0
  27. package/assets/toolkit/toolkit/skills/ops-incident/references/postmortem-template.md +69 -0
  28. package/assets/toolkit/toolkit/skills/ops-incident/references/runbook-template.md +69 -0
  29. package/assets/toolkit/toolkit/skills/ops-infra-plan/SKILL.md +123 -0
  30. package/assets/toolkit/toolkit/skills/ops-infra-plan/references/iac-patterns.md +141 -0
  31. package/assets/toolkit/toolkit/skills/ops-monitor/SKILL.md +110 -0
  32. package/assets/toolkit/toolkit/skills/ops-monitor/references/alert-rules.md +80 -0
  33. package/assets/toolkit/toolkit/skills/ops-monitor/references/slo-templates.md +83 -0
  34. package/assets/toolkit/toolkit/skills/ops-parallel/SKILL.md +177 -0
  35. package/assets/toolkit/toolkit/skills/ops-plan/SKILL.md +169 -0
  36. package/assets/toolkit/toolkit/skills/ops-security-infra/SKILL.md +126 -0
  37. package/assets/toolkit/toolkit/skills/ops-security-infra/references/cicd-security-pipeline.md +55 -0
  38. package/assets/toolkit/toolkit/skills/ops-security-infra/references/security-headers.md +24 -0
  39. package/assets/toolkit/toolkit/skills/ops-verify/SKILL.md +180 -0
  40. package/bin/sdtk-ops.js +14 -0
  41. package/package.json +46 -0
  42. package/src/commands/generate.js +12 -0
  43. package/src/commands/help.js +53 -0
  44. package/src/commands/init.js +86 -0
  45. package/src/commands/runtime.js +201 -0
  46. package/src/index.js +65 -0
  47. package/src/lib/args.js +107 -0
  48. package/src/lib/errors.js +41 -0
  49. package/src/lib/powershell.js +65 -0
  50. package/src/lib/scope.js +58 -0
  51. package/src/lib/toolkit-payload.js +123 -0
@@ -0,0 +1,113 @@
1
+ <!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
2
+ ---
3
+ name: ops-incident
4
+ description: Incident response and management. Use when a production incident occurs or when establishing incident response procedures -- covers severity classification, response coordination, post-mortem facilitation, and on-call design.
5
+ ---
6
+
7
+ # Ops Incident
8
+
9
+ ## Overview
10
+
11
+ Incident response must turn chaos into a structured workflow: classify severity, assign roles, build a timeline, test one hypothesis path at a time, stabilize the system, and capture systemic follow-up before memory fades.
12
+
13
+ ## The Iron Law
14
+
15
+ ```
16
+ NO INCIDENT RESOLVED WITHOUT A TIMELINE, IMPACT ASSESSMENT, AND ACTION ITEMS WITHIN 48 HOURS
17
+ ```
18
+
19
+ ## Severity Classification Matrix
20
+
21
+ | Level | Name | Criteria | Response Time | Update Cadence | Escalation |
22
+ |-------|------|----------|---------------|----------------|------------|
23
+ | SEV1 | Critical | full service outage, data loss risk, security breach | under 5 min | every 15 min | leadership immediately |
24
+ | SEV2 | Major | degraded service for more than 25% of users, key feature down | under 15 min | every 30 min | engineering manager within 15 min |
25
+ | SEV3 | Moderate | minor feature broken, workaround available | under 1 hour | every 2 hours | team lead next standup |
26
+ | SEV4 | Low | cosmetic issue, no user impact, tech debt trigger | next business day | daily | backlog triage |
27
+
28
+ Escalate severity when:
29
+ - impact scope doubles
30
+ - no root cause is identified after 30 minutes for SEV1 or 2 hours for SEV2
31
+ - a paying customer is blocked, minimum SEV2
32
+ - any data integrity concern appears, immediate SEV1
33
+
34
+ ## The Four-Phase Process
35
+
36
+ ### 1. Detection And Declaration
37
+
38
+ - acknowledge the page or signal
39
+ - classify severity using the matrix
40
+ - declare the incident and open a timeline immediately
41
+ - state current impact, suspected scope, and next update time
42
+
43
+ ### 2. Structured Response
44
+
45
+ Assign roles:
46
+ - **Incident Commander**
47
+ - owns timeline, severity, decisions, and cadence
48
+ - **Technical Lead**
49
+ - drives diagnosis and remediation with `ops-debug`
50
+ - **Scribe**
51
+ - records timestamps, evidence, commands, and decisions
52
+ - **Communications Lead**
53
+ - sends stakeholder updates on schedule
54
+
55
+ Response rules:
56
+ - timebox each hypothesis path to 15 minutes before re-evaluating
57
+ - fix the bleeding first, then optimize
58
+ - keep one source of truth for status and impact
59
+
60
+ ### 3. Resolution And Stabilization
61
+
62
+ - apply the lowest-risk effective mitigation
63
+ - verify recovery through metrics, not by visual guesswork
64
+ - monitor for 15 to 30 minutes after recovery
65
+ - do not close the incident until the service is stable
66
+
67
+ ### 4. Post-Mortem
68
+
69
+ - schedule the review within 48 hours
70
+ - document impact, timeline, root cause, contributing factors, and action items
71
+ - convert learning into runbook, alert, test, or design changes
72
+
73
+ ## <HARD-GATE>
74
+
75
+ Incident review and post-mortem must stay blameless:
76
+ - say "the system allowed this failure mode"
77
+ - do not frame the review as "which person caused it"
78
+ - protect psychological safety so the real causes are visible
79
+
80
+ ## Operational Metrics
81
+
82
+ | Metric | Target |
83
+ |--------|--------|
84
+ | MTTD | under 5 minutes |
85
+ | MTTR for SEV1 | under 30 minutes |
86
+ | Post-mortems within 48 hours | 100% |
87
+ | Action item completion | 90% |
88
+ | Repeat incidents | 0 for the same unresolved cause |
89
+
90
+ ## Reference Files
91
+
92
+ Use:
93
+ - `./references/runbook-template.md`
94
+ - `./references/postmortem-template.md`
95
+ - `./references/communication-templates.md`
96
+
97
+ ## Common Mistakes
98
+
99
+ | Mistake | Why it fails |
100
+ |---------|--------------|
101
+ | Start fixing before declaring severity and roles | Communication and triage fragment immediately |
102
+ | Chase multiple hypotheses at once | The team loses evidence and ownership |
103
+ | Close the incident when errors drop briefly | Latent instability returns and the timeline becomes unclear |
104
+ | Write a blame document instead of a post-mortem | Systemic fixes are replaced by defensiveness |
105
+ | Skip action item ownership and due dates | The same incident class repeats |
106
+
107
+ ## Execution Handoff
108
+
109
+ During incident response:
110
+ - use `ops-debug` for diagnosis
111
+ - use `ops-monitor` to confirm recovery against live signals
112
+ - use `ops-verify` before declaring the system stable
113
+
@@ -0,0 +1,34 @@
1
+ <!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
2
+
3
+ # Communication Templates
4
+
5
+ ```markdown
6
+ # SEV1 Initial Notification
7
+ **Subject**: [SEV1] [Service Name] - [Brief Impact Description]
8
+
9
+ **Current Status**: We are investigating an issue affecting [service or feature].
10
+ **Impact**: [X]% of users are experiencing [symptom].
11
+ **Next Update**: In 15 minutes or sooner if the situation changes materially.
12
+
13
+ ---
14
+
15
+ # SEV1 Status Update
16
+ **Subject**: [SEV1 UPDATE] [Service Name] - [Current State]
17
+
18
+ **Status**: [Investigating / Identified / Mitigating / Resolved]
19
+ **Current Understanding**: [what we know about the cause]
20
+ **Actions Taken**: [what has been done so far]
21
+ **Next Steps**: [what happens next]
22
+ **Next Update**: In 15 minutes.
23
+
24
+ ---
25
+
26
+ # Incident Resolved
27
+ **Subject**: [RESOLVED] [Service Name] - [Brief Description]
28
+
29
+ **Resolution**: [what fixed the issue]
30
+ **Duration**: [start time] to [end time] ([total duration])
31
+ **Impact Summary**: [who was affected and how]
32
+ **Follow-up**: Post-mortem scheduled for [date]. Action items will be tracked in [link].
33
+ ```
34
+
@@ -0,0 +1,69 @@
1
+ <!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
2
+
3
+ # Postmortem Template
4
+
5
+ ```markdown
6
+ # Post-Mortem: [Incident Title]
7
+
8
+ **Date**: YYYY-MM-DD
9
+ **Severity**: SEV[1-4]
10
+ **Duration**: [start time] to [end time] ([total duration])
11
+ **Author**: [name]
12
+ **Status**: [Draft / Review / Final]
13
+
14
+ ## Executive Summary
15
+ [2-3 sentences on what happened, who was affected, and how it was resolved]
16
+
17
+ ## Impact
18
+ - **Users affected**: [number or percentage]
19
+ - **Revenue impact**: [estimated or N/A]
20
+ - **SLO budget consumed**: [X% of monthly error budget]
21
+ - **Support tickets created**: [count]
22
+
23
+ ## Timeline (UTC)
24
+ | Time | Event |
25
+ |------|-------|
26
+ | 14:02 | Monitoring alert fires |
27
+ | 14:05 | On-call engineer acknowledges page |
28
+ | 14:08 | Incident declared and roles assigned |
29
+ | 14:12 | Root-cause hypothesis recorded |
30
+ | 14:18 | Mitigation started |
31
+ | 14:23 | Metrics begin returning to baseline |
32
+ | 14:30 | Incident resolved |
33
+ | 14:45 | All-clear communicated |
34
+
35
+ ## Root Cause Analysis
36
+ ### What Happened
37
+ [Detailed technical explanation of the failure chain]
38
+
39
+ ### Contributing Factors
40
+ 1. **Immediate cause**: [the direct trigger]
41
+ 2. **Underlying cause**: [why the trigger was possible]
42
+ 3. **Systemic cause**: [what process or design gap allowed it]
43
+
44
+ ### 5 Whys
45
+ 1. Why did the service fail? [answer]
46
+ 2. Why did that happen? [answer]
47
+ 3. Why was that possible? [answer]
48
+ 4. Why was the guardrail missing? [answer]
49
+ 5. Why did the system permit the failure mode? [root issue]
50
+
51
+ ## What Went Well
52
+ - [things that helped detection or response]
53
+ - [tools or practices that reduced impact]
54
+
55
+ ## What Went Poorly
56
+ - [gaps that slowed detection or resolution]
57
+ - [runbooks, alerts, or ownership issues]
58
+
59
+ ## Action Items
60
+ | ID | Action | Owner | Priority | Due Date | Status |
61
+ |----|--------|-------|----------|----------|--------|
62
+ | 1 | [action] | [owner] | P1 | YYYY-MM-DD | Not Started |
63
+ | 2 | [action] | [owner] | P1 | YYYY-MM-DD | Not Started |
64
+ | 3 | [action] | [owner] | P2 | YYYY-MM-DD | Not Started |
65
+
66
+ ## Lessons Learned
67
+ [Key takeaways that should change design, monitoring, or procedure]
68
+ ```
69
+
@@ -0,0 +1,69 @@
1
+ <!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
2
+
3
+ # Incident Runbook Template
4
+
5
+ ```markdown
6
+ # Runbook: [Service Or Failure Scenario]
7
+
8
+ ## Quick Reference
9
+ - **Service**: [service name and repo link]
10
+ - **Owner Team**: [team name and contact channel]
11
+ - **On-Call Rotation**: [schedule link or contact path]
12
+ - **Dashboards**: [monitoring links]
13
+ - **Last Tested**: [date of drill or last validation]
14
+
15
+ ## Detection
16
+ - **Alert**: [alert name and system]
17
+ - **Symptoms**: [what users and metrics look like]
18
+ - **False Positive Check**: [how to confirm it is real]
19
+
20
+ ## Diagnosis
21
+ 1. Check service health: `[command]`
22
+ 2. Review error rate and latency dashboards: `[links]`
23
+ 3. Check recent deployments or config changes: `[command or link]`
24
+ 4. Review dependency health: `[links or commands]`
25
+
26
+ ## Remediation
27
+
28
+ ### Option A: Rollback
29
+ ```bash
30
+ # Identify the last known good revision
31
+ [rollback-history-command]
32
+
33
+ # Roll back to the prior revision
34
+ [rollback-command]
35
+
36
+ # Verify the rollback finished
37
+ [status-command]
38
+ ```
39
+
40
+ ### Option B: Restart
41
+ ```bash
42
+ # Use a rolling restart where possible
43
+ [restart-command]
44
+
45
+ # Monitor progress
46
+ [status-command]
47
+ ```
48
+
49
+ ### Option C: Scale Up
50
+ ```bash
51
+ # Increase capacity if the issue is load-related
52
+ [scale-command]
53
+
54
+ # Verify capacity and error rate
55
+ [verify-command]
56
+ ```
57
+
58
+ ## Verification
59
+ - [ ] Error rate returned to baseline
60
+ - [ ] Latency is within SLO
61
+ - [ ] No new alerts firing for 10 minutes
62
+ - [ ] User-facing functionality manually verified
63
+
64
+ ## Communication
65
+ - Internal: [incident channel or update path]
66
+ - External: [status page or customer update path]
67
+ - Follow-up: Create post-mortem within 48 hours
68
+ ```
69
+
@@ -0,0 +1,123 @@
1
+ <!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025) and gstack by Garry Tan (MIT License, 2026). Adapted for SDTK-OPS. -->
2
+ ---
3
+ name: ops-infra-plan
4
+ description: Infrastructure architecture planning. Use when designing cloud resources, networking, security groups, IAM policies, or IaC modules -- produces a reviewable infrastructure plan before provisioning.
5
+ ---
6
+
7
+ # Ops Infra Plan
8
+
9
+ ## Overview
10
+
11
+ Provisioning without a reviewed design creates hidden coupling, rollback gaps, and security drift. Write the infrastructure plan first, make the dependencies explicit, then provision only what the approved plan covers.
12
+
13
+ ## The Iron Law
14
+
15
+ ```
16
+ NO INFRASTRUCTURE PROVISIONING WITHOUT A REVIEWED PLAN
17
+ ```
18
+
19
+ ## When to Use
20
+
21
+ Use for:
22
+ - new service or environment setup
23
+ - network topology changes
24
+ - security group, firewall, or IAM changes
25
+ - database, cache, queue, or storage provisioning
26
+ - scaling architecture updates
27
+ - disaster recovery design
28
+
29
+ ## Required Plan Sections
30
+
31
+ Every infrastructure plan must include:
32
+ - **Architecture Diagram (ASCII)**
33
+ - show traffic entry, workload tier, stateful dependencies, and operator control points
34
+ - **Resource Inventory**
35
+ - resource name, purpose, environment, owner, critical dependency
36
+ - **Networking Design**
37
+ - ingress and egress paths, DNS, load balancing, private connectivity, access boundaries
38
+ - **Security Design**
39
+ - identities, secrets flow, encryption posture, least-privilege access, audit boundaries
40
+ - **Capacity Estimates**
41
+ - CPU, memory, storage, throughput, and auto-scaling limits
42
+ - **Multi-Environment Strategy**
43
+ - dev, staging, and production separation with configuration and data isolation notes
44
+ - **Rollback Plan**
45
+ - rollback trigger, exact rollback action, verification evidence, and decision owner
46
+ - **Dependency Order**
47
+ - what must exist before provisioning the next layer
48
+
49
+ ## ASCII Diagram Pattern
50
+
51
+ Use an explicit diagram even for small changes:
52
+
53
+ ```text
54
+ Internet
55
+ |
56
+ DNS -> Load Balancer
57
+ |
58
+ App Service
59
+ |
60
+ Database / Cache / Queue
61
+ |
62
+ Backups / Logs / Metrics
63
+ ```
64
+
65
+ If a system boundary or dependency matters during rollout, it belongs in the diagram.
66
+
67
+ ## Resource Inventory Template
68
+
69
+ Use a table like this:
70
+
71
+ | Resource | Purpose | Environment | Owner | Depends On |
72
+ |----------|---------|-------------|-------|------------|
73
+ | app-lb | Public traffic entry | production | ops | DNS, TLS cert |
74
+ | app-service | Stateless workload | production | ops | image, secrets, DB |
75
+ | app-db | Primary relational store | production | ops | subnet, backup policy |
76
+
77
+ ## IaC Patterns
78
+
79
+ Prefer reusable modules and explicit inputs over hand-built resources. Common patterns:
80
+ - network foundation first: VPC or equivalent, subnets, routes, security boundaries
81
+ - stateless compute behind a load balancer with health checks and auto-scaling
82
+ - stateful services with backup, encryption, and maintenance windows defined up front
83
+ - alarms and dashboards defined in the same plan as the resource they guard
84
+ - environment-specific values separated from reusable module logic
85
+
86
+ Use `./references/iac-patterns.md` for concrete Terraform-style examples adapted from the source material.
87
+
88
+ ## Review Checklist
89
+
90
+ Review the plan for:
91
+ - naming consistency across resources, modules, and environments
92
+ - security groups or firewall rules scoped to least privilege
93
+ - encryption at rest and in transit
94
+ - backup schedule and restore ownership
95
+ - observability coverage for each critical resource
96
+ - cost estimate, scaling bounds, and obvious waste
97
+ - rollback path that works after partial apply
98
+
99
+ ## <HARD-GATE>
100
+
101
+ Do not provision until all four are true:
102
+ - written plan exists and matches the intended scope
103
+ - resource inventory is complete
104
+ - rollback procedure exists for the first failed step and for partial apply
105
+ - security review covers network access, identities, secrets, and encryption
106
+
107
+ ## Common Mistakes
108
+
109
+ | Mistake | Why it fails |
110
+ |---------|--------------|
111
+ | Provision first, document later | The live system becomes the only source of truth |
112
+ | Copy production config into dev | Cost, permissions, and blast radius drift immediately |
113
+ | Hardcode IPs, hostnames, or secrets | Reuse fails and rotation becomes dangerous |
114
+ | Plan resources without capacity notes | Scaling and cost failures surface during rollout |
115
+ | Skip rollback because IaC is "reversible" | Partial apply and data changes still need explicit recovery |
116
+
117
+ ## Execution Handoff
118
+
119
+ After the plan is approved:
120
+ - provision in dependency order
121
+ - verify each applied layer before moving forward
122
+ - use `ops-verify` before marking any provisioning step complete
123
+
@@ -0,0 +1,141 @@
1
+ <!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025) and gstack by Garry Tan (MIT License, 2026). Adapted for SDTK-OPS. -->
2
+
3
+ # IaC Patterns
4
+
5
+ ## Overview
6
+
7
+ These examples capture common infrastructure-as-code patterns for a networked service with autoscaling, load balancing, and operational safeguards. Treat provider-specific syntax as illustrative. The planning pattern is portable even when the resource names differ.
8
+
9
+ ## Network And Compute Pattern
10
+
11
+ ```hcl
12
+ module "network" {
13
+ source = "./modules/network"
14
+
15
+ name = "app-prod"
16
+ cidr_block = "10.20.0.0/16"
17
+ public_subnets = ["10.20.1.0/24", "10.20.2.0/24"]
18
+ private_subnets = ["10.20.11.0/24", "10.20.12.0/24"]
19
+ enable_nat_gateway = true
20
+ }
21
+
22
+ resource "aws_launch_template" "app" {
23
+ name_prefix = "app-prod-"
24
+ image_id = var.image_id
25
+ instance_type = var.instance_type
26
+
27
+ user_data = base64encode(templatefile("${path.module}/bootstrap.sh", {
28
+ env = "production"
29
+ }))
30
+ }
31
+
32
+ resource "aws_autoscaling_group" "app" {
33
+ name = "app-prod"
34
+ desired_capacity = 3
35
+ min_size = 3
36
+ max_size = 6
37
+ vpc_zone_identifier = module.network.private_subnet_ids
38
+ target_group_arns = [aws_lb_target_group.app.arn]
39
+ health_check_type = "ELB"
40
+ health_check_grace_period = 300
41
+
42
+ launch_template {
43
+ id = aws_launch_template.app.id
44
+ version = "$Latest"
45
+ }
46
+ }
47
+ ```
48
+
49
+ Pattern notes:
50
+ - keep shared network concerns in a module
51
+ - treat image version and instance size as explicit inputs
52
+ - set min and max capacity in the same change that introduces autoscaling
53
+ - attach compute to health-checked traffic entry, not direct public access
54
+
55
+ ## Load Balancer Pattern
56
+
57
+ ```hcl
58
+ resource "aws_lb" "app" {
59
+ name = "app-prod"
60
+ internal = false
61
+ load_balancer_type = "application"
62
+ subnets = module.network.public_subnet_ids
63
+ security_groups = [aws_security_group.lb.id]
64
+ }
65
+
66
+ resource "aws_lb_target_group" "app" {
67
+ name = "app-prod"
68
+ port = 8080
69
+ protocol = "HTTP"
70
+ vpc_id = module.network.vpc_id
71
+
72
+ health_check {
73
+ path = "/health"
74
+ healthy_threshold = 2
75
+ unhealthy_threshold = 3
76
+ interval = 30
77
+ timeout = 5
78
+ }
79
+ }
80
+ ```
81
+
82
+ Pattern notes:
83
+ - define health checks next to the traffic entry resource
84
+ - use explicit thresholds instead of provider defaults
85
+ - keep listener, target, and security policy changes in the same reviewed plan
86
+
87
+ ## Database Pattern
88
+
89
+ ```hcl
90
+ resource "aws_db_subnet_group" "app" {
91
+ name = "app-prod"
92
+ subnet_ids = module.network.private_subnet_ids
93
+ }
94
+
95
+ resource "aws_db_instance" "app" {
96
+ identifier = "app-prod"
97
+ engine = "postgres"
98
+ instance_class = "db.t3.medium"
99
+ allocated_storage = 100
100
+ storage_encrypted = true
101
+ backup_retention_period = 7
102
+ multi_az = true
103
+ db_subnet_group_name = aws_db_subnet_group.app.name
104
+ vpc_security_group_ids = [aws_security_group.db.id]
105
+ }
106
+ ```
107
+
108
+ Pattern notes:
109
+ - keep stateful resources private by default
110
+ - define backup retention and encryption in the first version
111
+ - make high availability an explicit decision, not an accident
112
+
113
+ ## Alarm Pattern
114
+
115
+ ```hcl
116
+ resource "aws_cloudwatch_metric_alarm" "cpu_high" {
117
+ alarm_name = "app-prod-cpu-high"
118
+ comparison_operator = "GreaterThanThreshold"
119
+ evaluation_periods = 2
120
+ metric_name = "CPUUtilization"
121
+ namespace = "AWS/EC2"
122
+ period = 300
123
+ statistic = "Average"
124
+ threshold = 75
125
+ alarm_description = "CPU high for two consecutive periods"
126
+ }
127
+ ```
128
+
129
+ Pattern notes:
130
+ - define an alarm when you define the critical resource
131
+ - use thresholds the team can explain and review
132
+ - connect alerting design to rollback and incident response decisions
133
+
134
+ ## Planning Checklist
135
+
136
+ Before applying an IaC change, confirm:
137
+ - all provider-specific examples have a cloud-agnostic rationale
138
+ - network, identity, and secret dependencies are listed
139
+ - rollback covers both failed create and partial update
140
+ - health checks, alarms, and backups are defined with the resource
141
+
@@ -0,0 +1,110 @@
1
+ <!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
2
+ ---
3
+ name: ops-monitor
4
+ description: Observability and monitoring setup. Use when designing monitoring, defining SLOs/SLIs, configuring alerts, or building dashboards -- covers the three pillars of observability and Golden Signals framework.
5
+ ---
6
+
7
+ # Ops Monitor
8
+
9
+ ## Overview
10
+
11
+ Monitoring exists to detect user-impacting problems early, explain what is happening, and guide a safe response. A service is not production-ready until metrics, logs, traces, alerts, and response expectations are all defined together.
12
+
13
+ ## The Iron Law
14
+
15
+ ```
16
+ NO SERVICE IN PRODUCTION WITHOUT MONITORING AND ALERTING
17
+ ```
18
+
19
+ ## When to Use
20
+
21
+ Use for:
22
+ - new production services
23
+ - dashboard and alert design
24
+ - SLO and SLI definition
25
+ - noisy alert cleanup
26
+ - monitoring gaps discovered during incidents
27
+
28
+ ## The Three Pillars
29
+
30
+ | Pillar | Purpose | Key Question |
31
+ |--------|---------|--------------|
32
+ | Metrics | trends, alerting, SLO tracking | Is the system healthy over time? |
33
+ | Logs | event details, audit trail, debugging context | What happened at a specific time? |
34
+ | Traces | request flow across services | Where is the latency or failure path? |
35
+
36
+ ## Golden Signals
37
+
38
+ | Signal | What To Measure | Example Metrics |
39
+ |--------|-----------------|-----------------|
40
+ | Latency | request duration for successful and failed work | `http_request_duration_seconds_bucket`, `grpc_server_handling_seconds_bucket` |
41
+ | Traffic | request volume or work rate | `http_requests_total`, `rpc_requests_total`, queue throughput |
42
+ | Errors | failure rate by type and severity | `http_requests_total{status=~"5.."}`, timeout counters, business error counters |
43
+ | Saturation | how close the system is to a limit | CPU, memory, queue depth, disk, connection pool usage |
44
+
45
+ ## SLO And SLI Framework
46
+
47
+ Define:
48
+ - **SLI**
49
+ - what you measure for user-visible behavior
50
+ - examples: availability, latency, correctness
51
+ - **SLO**
52
+ - the target and window for that SLI
53
+ - examples: `99.95% availability over 30d`, `99% of requests under 300ms over 30d`
54
+
55
+ Use `./references/slo-templates.md` for YAML templates that combine availability, latency, correctness, burn-rate alerts, and error-budget policy.
56
+
57
+ ## Error Budget Policy
58
+
59
+ Use a simple decision policy:
60
+ - budget remaining above 50%: normal development
61
+ - budget remaining 25% to 50%: reliability review before risky changes
62
+ - budget remaining below 25%: prioritize reliability work
63
+ - budget exhausted: freeze non-critical deploys until reviewed
64
+
65
+ ## Alert Design
66
+
67
+ Alerts should:
68
+ - page on symptoms, not on every possible internal cause
69
+ - link to a runbook for first response
70
+ - define at least two severities where appropriate
71
+ - be tuned so false positives stay below 15%
72
+ - point to the dashboard or query needed for triage
73
+
74
+ An alert that does not change operator behavior should be removed or rewritten.
75
+
76
+ ## Dashboard Design
77
+
78
+ | Dashboard | Audience | Required Content |
79
+ |-----------|----------|------------------|
80
+ | Executive | incident leads, managers | SLO status, error budget, active incidents, high-level trends |
81
+ | Service | on-call engineers | Golden Signals, dependency health, deploy markers, recent alerts |
82
+ | Debug | responders and service owners | detailed metrics, logs, traces, and breakdowns by instance, region, or endpoint |
83
+
84
+ ## <HARD-GATE>
85
+
86
+ Before a service is treated as production-ready, all must be true:
87
+ - SLIs are defined for user-visible behavior
88
+ - SLOs are set with explicit targets and windows
89
+ - burn-rate alerts are configured
90
+ - a Golden Signals dashboard exists
91
+ - the on-call team knows which alerts fire and what they mean
92
+
93
+ ## Common Mistakes
94
+
95
+ | Mistake | Why it fails |
96
+ |---------|--------------|
97
+ | Alert on every low-level cause | Operators get noise instead of signal |
98
+ | Track uptime only | User-visible latency and correctness failures are missed |
99
+ | Dashboard with no SLO context | Teams see metrics but not reliability risk |
100
+ | No runbook links in alerts | Response time expands during real incidents |
101
+ | Collect logs and traces without correlation keys | Cross-system debugging stays slow and manual |
102
+
103
+ ## Execution Handoff
104
+
105
+ After monitoring is defined:
106
+ - validate alerts with a controlled test or drill
107
+ - route incident handling through `ops-incident`
108
+ - use `ops-debug` for root-cause investigation
109
+ - use `ops-verify` before claiming monitoring coverage is complete
110
+