npm - sdtk-ops-kit - Versions diffs - 0.2.0 - Mend

sdtk-ops-kit 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (51) hide show

package/assets/toolkit/toolkit/skills/ops-monitor/references/alert-rules.md ADDED Viewed

@@ -0,0 +1,80 @@
+<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
+# Alert Rules
+## Prometheus Rules Example
+```yaml
+groups:
+  - name: application.rules
+    rules:
+      - alert: HighErrorRate
+        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "High error rate detected"
+          description: "Error rate is above threshold for the last 5 minutes"
+          runbook: "docs/runbooks/high-error-rate.md"
+      - alert: HighResponseTime
+        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
+        for: 2m
+        labels:
+          severity: warning
+        annotations:
+          summary: "High response time detected"
+          description: "95th percentile latency is above 500ms"
+          runbook: "docs/runbooks/high-latency.md"
+  - name: infrastructure.rules
+    rules:
+      - alert: HighCPUUsage
+        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "High CPU usage detected"
+          description: "CPU usage is above 80% for 5 minutes"
+          runbook: "docs/runbooks/high-cpu.md"
+      - alert: HighMemoryUsage
+        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "High memory usage detected"
+          description: "Memory usage is above 90%"
+          runbook: "docs/runbooks/high-memory.md"
+      - alert: DiskSpaceLow
+        expr: 100 - ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes) > 85
+        for: 2m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Low disk space"
+          description: "Disk usage is above 85%"
+          runbook: "docs/runbooks/disk-space-low.md"
+      - alert: ServiceDown
+        expr: up == 0
+        for: 1m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Service is down"
+          description: "A monitored target has been unavailable for more than 1 minute"
+          runbook: "docs/runbooks/service-down.md"
+```
+## Design Notes
+- alert on user impact and resource exhaustion, not every internal event
+- attach a runbook to every page-worthy alert
+- use warning and critical levels where the system benefits from early warning
+- deduplicate rules that describe the same symptom

package/assets/toolkit/toolkit/skills/ops-monitor/references/slo-templates.md ADDED Viewed

@@ -0,0 +1,83 @@
+<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
+# SLO Templates
+## SLI Definition Template
+```yaml
+service: service-name
+owner: owning-team
+review_cadence: monthly
+slis:
+  availability:
+    description: "Successful responses to valid requests"
+    metric: |
+      sum(rate(http_requests_total{service="service-name", status!~"5.."}[5m]))
+      /
+      sum(rate(http_requests_total{service="service-name"}[5m]))
+    good_event: "HTTP status below 500"
+    valid_event: "Any user-visible request excluding health checks"
+  latency:
+    description: "Requests served within the required threshold"
+    metric: |
+      histogram_quantile(0.99,
+        sum(rate(http_request_duration_seconds_bucket{service="service-name"}[5m]))
+        by (le)
+      )
+    threshold: "300ms at p99"
+  correctness:
+    description: "Requests returning correct business results"
+    metric: "business_logic_errors_total / requests_total"
+    good_event: "No business logic error"
+```
+## SLO Definition Template
+```yaml
+slos:
+  - sli: availability
+    target: 99.95%
+    window: 30d
+    error_budget: "21.6 minutes per month"
+    burn_rate_alerts:
+      - severity: critical
+        short_window: 5m
+        long_window: 1h
+        burn_rate: 14.4x
+      - severity: warning
+        short_window: 30m
+        long_window: 6h
+        burn_rate: 6x
+  - sli: latency
+    target: 99.0%
+    window: 30d
+    error_budget: "7.2 hours per month"
+  - sli: correctness
+    target: 99.99%
+    window: 30d
+```
+## Error Budget Policy Template
+```yaml
+error_budget_policy:
+  budget_remaining_above_50pct: "Normal feature development"
+  budget_remaining_25_to_50pct: "Reliability review before risky changes"
+  budget_remaining_below_25pct: "Prioritize reliability work until budget recovers"
+  budget_exhausted: "Freeze all non-critical deploys and require leadership review"
+```
+## Review Checklist
+Before approving an SLO definition, confirm:
+- the SLI measures user-visible behavior
+- the target matches business criticality
+- the window is explicit
+- burn-rate alerts exist for fast and slow budget exhaustion
+- the team can explain what action follows each budget state

package/assets/toolkit/toolkit/skills/ops-parallel/SKILL.md ADDED Viewed

@@ -0,0 +1,177 @@
+<!-- Based on superpowers by Jesse Vincent (MIT License, 2025). Adapted for SDTK-OPS. -->
+---
+name: ops-parallel
+description: Parallel operations dispatch. Use when facing 2+ independent infrastructure tasks that can be worked on without shared state or sequential dependencies.
+---
+# Ops Parallel
+## Overview
+You delegate tasks to specialized agents with isolated context. By precisely crafting their instructions and context, you ensure they stay focused and succeed at their task. They should never inherit your session's context or history. You construct exactly what they need. This also preserves your own context for coordination work.
+When you have multiple unrelated operational problems or task slices, investigating them sequentially wastes time. Each investigation is independent and can happen in parallel.
+**Core principle:** Dispatch one agent per independent problem domain. Let them work concurrently.
+## When to Use
+```dot
+digraph when_to_use {
+    "Multiple failures?" [shape=diamond];
+    "Are they independent?" [shape=diamond];
+    "Single agent investigates all" [shape=box];
+    "One agent per problem domain" [shape=box];
+    "Can they work in parallel?" [shape=diamond];
+    "Sequential agents" [shape=box];
+    "Parallel dispatch" [shape=box];
+    "Multiple failures?" -> "Are they independent?" [label="yes"];
+    "Are they independent?" -> "Single agent investigates all" [label="no - related"];
+    "Are they independent?" -> "Can they work in parallel?" [label="yes"];
+    "Can they work in parallel?" -> "Parallel dispatch" [label="yes"];
+    "Can they work in parallel?" -> "Sequential agents" [label="no - shared state"];
+}
+```
+**Use when:**
+- multiple infrastructure failures have different root causes
+- multiple subsystems need work independently
+- each problem can be understood without context from the others
+- no shared state exists between investigations
+**Do not use when:**
+- failures are related (fixing one might fix others)
+- you need to understand the full system state first
+- agents would interfere with each other
+## The Pattern
+### 1. Identify Independent Domains
+Group failures by what is broken:
+- monitoring alert thresholds
+- CI/CD deployment gates
+- backup retention procedures
+Each domain is independent. Fixing monitoring thresholds should not affect backup retention.
+### 2. Create Focused Agent Tasks
+Each agent gets:
+- **Specific scope:** one subsystem, environment, or operational workflow
+- **Clear goal:** solve exactly one task slice
+- **Constraints:** do not change unrelated code or infrastructure
+- **Expected output:** summary of what was found and what changed
+### 3. Dispatch in Parallel
+```text
+Task("Tune monitoring alerts for checkout service")
+Task("Fix CI deployment gate for staging rollouts")
+Task("Update backup retention procedure and verification notes")
+```
+All three run concurrently.
+### 4. Review and Integrate
+When agents return:
+- read each summary
+- verify fixes do not conflict
+- run the top-level verification for the combined result
+- integrate all accepted changes
+## Agent Prompt Structure
+Good agent prompts are:
+1. **Focused** - one clear problem domain
+2. **Self-contained** - all context needed to understand the problem
+3. **Specific about output** - what the agent should return
+```markdown
+Fix the monitoring configuration for checkout alerts:
+1. alert fires on every rollout even when service recovers within 30 seconds
+2. paging threshold should reflect sustained error rate, not one failed probe
+3. do not change CI/CD or backup files
+Your task:
+1. read the current alert config and related runbook
+2. identify the root cause of noisy alerts
+3. fix only the monitoring slice
+4. return a short summary of root cause and changes
+Do NOT refactor unrelated infrastructure.
+```
+## Common Mistakes
+**Bad:** "Fix all infra issues" - agent gets lost
+**Good:** "Fix checkout monitoring alerts" - focused scope
+**Bad:** "Fix the pipeline" - no context
+**Good:** include failing step names, exact files, and constraints
+**Bad:** no constraints - agent may refactor everything
+**Good:** "Do NOT touch unrelated systems"
+**Bad:** vague output - you do not know what changed
+**Good:** "Return summary of root cause and changes"
+## When NOT to Use
+**Related failures:** fixing one might fix others, so investigate together first
+**Need full context:** understanding requires seeing the whole system
+**Exploratory debugging:** you do not know what is broken yet
+**Shared state:** agents would interfere by editing the same files or environments
+## Real Example from Session
+**Scenario:** three independent operations tasks after a reliability review
+**Tasks:**
+- monitoring setup for checkout alerts needs tuning
+- CI/CD pipeline rollout gate is missing a health checkpoint
+- backup procedure lacks a restore verification note
+**Decision:** independent domains. Monitoring, CI/CD, and backup procedures can be investigated separately.
+**Dispatch:**
+```
+Agent 1 -> Fix monitoring alert configuration
+Agent 2 -> Fix CI/CD rollout gate
+Agent 3 -> Fix backup verification procedure
+```
+**Results:**
+- Agent 1: tuned thresholds and reduced noisy paging
+- Agent 2: added a health gate before promotion
+- Agent 3: documented restore verification evidence
+**Integration:** all fixes were independent, no conflicts, combined verification succeeded
+**Time saved:** three problem domains advanced in parallel instead of one by one
+## Key Benefits
+1. **Parallelization** - multiple investigations happen simultaneously
+2. **Focus** - each agent has narrow scope and less context to track
+3. **Independence** - agents do not interfere with each other
+4. **Speed** - three problems solved in the time of one
+## Verification
+After agents return:
+1. **Review each summary** - understand what changed
+2. **Check for conflicts** - did agents edit the same files or assumptions?
+3. **Run top-level verification** - use `ops-verify` on the combined result
+4. **Spot check** - agents can make systematic errors
+## Real-World Impact
+From debugging sessions:
+- multiple independent failures were investigated concurrently
+- all investigations completed faster than a sequential pass
+- focused prompts reduced confusion and rework

package/assets/toolkit/toolkit/skills/ops-plan/SKILL.md ADDED Viewed

@@ -0,0 +1,169 @@
+<!-- Based on superpowers by Jesse Vincent (MIT License, 2025) and gstack by Garry Tan (MIT License, 2026). Adapted for SDTK-OPS. -->
+---
+name: ops-plan
+description: Infrastructure and operations planning. Use when planning infrastructure changes, deployment strategies, or operational procedures before execution -- numbered steps, dependency ordering, rollback strategy per step.
+---
+# Ops Plan
+## Overview
+Write infrastructure and operations plans assuming the implementer has zero context for the system, environment, or operational history. Document exactly what changes, in what order, how to verify each step, and how to roll it back safely.
+Keep the plan reviewable, explicit, and small enough to execute without improvisation.
+## Scope Check
+If the request covers multiple independent systems, environments, or operational initiatives, suggest splitting it into separate plans. One plan should produce one coherent, reviewable operational outcome.
+Challenge scope before planning:
+- What already exists that partially solves this?
+- What is the minimum change that achieves the goal safely?
+- Which work is required now, and which work can be deferred?
+## File Structure And Affected Systems
+Before defining tasks, map out what will be created or modified and what each item owns.
+Include:
+- manifests, IaC modules, runbooks, scripts, or policy files
+- affected services, environments, regions, or accounts
+- external dependencies such as DNS, secrets, CI/CD, or databases
+Lock in decomposition here. Each task should have one clear operational responsibility.
+## Bite-Sized Task Granularity
+Each step should be one operational action that can be verified independently.
+Examples:
+- update one manifest or values file
+- validate one configuration change
+- apply one change to staging
+- verify one health gate
+- record one rollback checkpoint
+Do not write giant plan steps like "deploy all infrastructure" or "complete migration".
+## Plan Document Header
+Every plan MUST start with this header:
+```markdown
+# [Change Name] Operations Plan
+> For implementers: execute steps in order. Do not mark any step complete until `ops-verify` confirms the expected evidence.
+**Goal:** [One sentence describing what this change achieves]
+**Architecture:** [2-3 sentences about the approach]
+**Affected Systems:** [Services, environments, accounts, regions, pipelines]
+**Rollback Strategy:** [One sentence summary of rollback posture]
+**Risk Level:** [LOW | MEDIUM | HIGH]
+---
+```
+## Required Sections For Infrastructure Plans
+Every infrastructure plan must include:
+- **Resource Sizing**
+  - CPU, memory, storage estimates
+  - auto-scaling boundaries
+- **Networking**
+  - DNS changes
+  - ingress or load balancer updates
+  - security groups, network policies, firewall rules
+- **Security**
+  - IAM roles
+  - secrets handling
+  - network policy or access boundary changes
+- **Rollback Checklist per step**
+  - use this exact shape:
+    | Step | Rollback Action | Verification |
+    |------|-----------------|--------------|
+    | 1 | Revert manifest to previous revision | Health endpoint returns 200 |
+- **Migration Safety**
+  - backward compatibility
+  - feature flags
+  - dual-write, dual-read, or phased rollout needs
+## Assumption Tracking
+Record assumptions in this exact table format:
+| # | Assumption | Verified | Risk if wrong |
+|---|------------|----------|---------------|
+| A1 | Example assumption | No | Medium |
+Do not bury assumptions in prose. If an assumption can break rollout or rollback, it must be tracked explicitly.
+## Infrastructure Review Lens
+Before approving the plan, check:
+- dependency order across systems and environments
+- blast radius if one step fails
+- rollback feasibility after each step
+- health checks, smoke checks, and stability windows
+- happy path, no-op path, failure path, and rollback path
+- observability coverage during and after the change
+- operator-visible evidence for each important milestone
+## Task Structure
+Use this shape for each task:
+````markdown
+### Task N: [Change Slice]
+**Files / Systems:**
+- Modify: `exact/path/to/file`
+- Environment: `staging|production|shared`
+- Verify: `exact command or evidence`
+**Rollback Checklist:**
+| Step | Rollback Action | Verification |
+|------|-----------------|--------------|
+| N | [Exact rollback action] | [Exact evidence] |
+- [ ] **Step 1: Prepare the change**
+  - update the target manifest, script, or runbook
+- [ ] **Step 2: Validate locally or in dry-run mode**
+  - run the exact validation command
+- [ ] **Step 3: Apply to the target environment**
+  - perform one bounded operational change
+- [ ] **Step 4: Verify expected state**
+  - record the exact health or status evidence
+- [ ] **Step 5: Record rollback checkpoint**
+  - confirm the rollback command and previous good state
+````
+## Common Mistakes
+| Mistake | Why it fails |
+|---------|--------------|
+| "Implement the infrastructure" as one step | No safe checkpoint, no isolated rollback |
+| Missing rollback action per task | Recovery becomes guesswork during incidents |
+| No resource sizing or network notes | Hidden capacity and connectivity risks surface late |
+| Security assumptions left implicit | Secrets, IAM, or access drift breaks rollout |
+| Verification only at the end | You lose the exact step where the system broke |
+| Migration plan ignores backward compatibility | Deploy succeeds but runtime traffic fails |
+## Execution Handoff
+After saving the plan:
+- review assumptions, dependencies, and rollback notes
+- use `ops-parallel` only for truly independent slices
+- invoke `ops-verify` before marking any step complete
+The plan is not complete until the implementer can execute it without guessing.

package/assets/toolkit/toolkit/skills/ops-security-infra/SKILL.md ADDED Viewed

@@ -0,0 +1,126 @@
+<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
+---
+name: ops-security-infra
+description: Infrastructure security. Use when hardening infrastructure, managing secrets, defining network policies, or setting up security scanning in CI/CD -- covers STRIDE for infrastructure, secrets management, and detection-as-code.
+---
+# Ops Security Infra
+## Overview
+Infrastructure security must be designed into identity, network, secrets, logging, and delivery paths at the same time. The goal is not perfect theoretical safety. The goal is to remove obvious exposure, narrow trust boundaries, and make security controls reviewable and repeatable.
+## When to Use
+Use for:
+- infrastructure hardening reviews
+- IAM and access-boundary changes
+- secrets-management design
+- network policy and ingress review
+- CI/CD security scanning design
+- detection rule delivery and coverage tracking
+## STRIDE For Infrastructure
+| Threat | Infrastructure Example | Primary Control |
+|--------|------------------------|-----------------|
+| Spoofing | compromised IAM role or forged workload identity | MFA, short-lived credentials, strong workload identity |
+| Tampering | unauthorized infrastructure-as-code or config change | peer review, protected branches, signed change path |
+| Repudiation | operator denies a risky change | immutable audit logging |
+| Information Disclosure | secrets exposed in config, logs, or state files | secret manager, redaction, encryption |
+| Denial of Service | public endpoints or shared resources overwhelmed | rate limiting, WAF, scaling limits, network controls |
+| Elevation of Privilege | broad admin roles or wildcard policies | least-privilege IAM and isolated admin paths |
+## Network Security
+Use these defaults:
+- security groups and firewall rules default deny
+- no `0.0.0.0/0` access except for intentionally public load balancers or edge endpoints
+- private services stay private by default
+- Kubernetes network policies start from deny-all and then allow only required traffic
+- management interfaces live behind stronger access controls than the workload path
+## Secrets Management
+| Option | When To Use | Security Level |
+|--------|-------------|----------------|
+| Cloud KMS | key management and envelope encryption | high |
+| Secret Manager | application and infrastructure secrets at runtime | high |
+| HashiCorp Vault | complex multi-platform secret workflows and dynamic credentials | high |
+| Sealed Secrets | Kubernetes-native encrypted secret delivery | medium to high |
+Rules:
+- never hardcode secrets in repo files
+- never log secrets or raw tokens
+- prefer short-lived credentials over long-lived static keys
+- rotate secrets on a fixed schedule and on compromise
+## <HARD-GATE>
+NEVER:
+- store secrets in git
+- log secrets
+- leave credentials unrotated beyond 90 days without reviewed exception
+ALWAYS:
+- prefer short-lived credentials such as OIDC or STS where available
+- audit secret access
+- review who can decrypt or retrieve production secrets
+## CI CD Security Pipeline
+The pipeline should scan:
+- static code issues with Semgrep or equivalent
+- dependency and image vulnerabilities with Trivy or equivalent
+- secret exposure with Gitleaks or equivalent
+Use `./references/cicd-security-pipeline.md` for a GitHub Actions example adapted from the source material.
+## Detection As Code
+Treat security detections as code:
+- keep rules in version control
+- validate them in CI before deployment
+- map each rule to MITRE ATT&CK coverage where that model is relevant
+- record known false positives and data-source dependencies
+- deploy through a controlled pipeline, not ad hoc console edits
+The goal is awareness and repeatability, not a specific SIEM vendor.
+## Security Hardening Checklist
+Review these items:
+1. default-deny network posture
+2. encryption at rest
+3. encryption in transit
+4. secrets in a manager, not config files
+5. least-privilege IAM
+6. security scanning in CI/CD
+7. immutable or strongly protected audit logging
+8. container image scanning
+9. SSH key rotation or stronger admin access controls
+10. MFA for privileged infrastructure access
+## Common Mistakes
+| Mistake | Why it fails |
+|---------|--------------|
+| Overly permissive IAM | One compromise turns into full-environment access |
+| Secrets in environment dumps or logs | Detection becomes recovery plus disclosure response |
+| Security treated as a final review step | Core architecture assumptions stay unsafe |
+| No audit logging for privileged actions | Investigation and compliance both fail |
+| Security scans exist but do not block anything | Vulnerable changes keep shipping |
+## Reference Files
+Use:
+- `./references/cicd-security-pipeline.md`
+- `./references/security-headers.md`
+## Execution Handoff
+After defining the security change:
+- send infrastructure boundary changes to `ops-infra-plan`
+- route CI/CD controls through `ops-ci-cd`
+- verify the final control state with `ops-verify`

package/assets/toolkit/toolkit/skills/ops-security-infra/references/cicd-security-pipeline.md ADDED Viewed

@@ -0,0 +1,55 @@
+<!-- Based on agency-agents by AgentLand Contributors (MIT License, 2025). Adapted for SDTK-OPS. -->
+# CI CD Security Pipeline
+```yaml
+name: Security Scan
+on:
+  pull_request:
+    branches: [main]
+jobs:
+  sast:
+    name: Static Analysis
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Run Semgrep SAST
+        uses: semgrep/semgrep-action@v1
+        with:
+          config: >-
+            p/owasp-top-ten
+            p/cwe-top-25
+  dependency_scan:
+    name: Dependency Audit
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Run Trivy vulnerability scanner
+        uses: aquasecurity/trivy-action@master
+        with:
+          scan-type: fs
+          severity: CRITICAL,HIGH
+          exit-code: "1"
+  secrets_scan:
+    name: Secrets Detection
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+      - name: Run Gitleaks
+        uses: gitleaks/gitleaks-action@v2
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+```
+## Pattern Notes
+- fail the pipeline on critical findings, not just log them
+- keep secrets in the platform secret store
+- expand beyond GitHub Actions only if the team actually uses another runner