npm - @intentsolutionsio/fairdb-ops-manager - Versions diffs - 1.0.0 - Mend

@intentsolutionsio/fairdb-ops-manager 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

package/.claude-plugin/plugin.json +22 -0
package/LICENSE +21 -0
package/README.md +609 -0
package/agents/fairdb-incident-responder.md +365 -0
package/agents/fairdb-ops-auditor.md +525 -0
package/agents/fairdb-setup-wizard.md +393 -0
package/commands/daily-health-check.md +225 -0
package/commands/incident-p0-database-down.md +318 -0
package/commands/incident-p0-disk-full.md +344 -0
package/commands/sop-001-vps-setup.md +84 -0
package/commands/sop-002-postgres-install.md +104 -0
package/commands/sop-003-backup-setup.md +160 -0
package/package.json +45 -0
package/scripts/backup-status.sh +122 -0
package/scripts/pg-health-check.sh +74 -0
package/scripts/sop-checklist.sh +354 -0
package/skills/skill-adapter/assets/README.md +5 -0
package/skills/skill-adapter/assets/config-template.json +32 -0
package/skills/skill-adapter/assets/skill-schema.json +28 -0
package/skills/skill-adapter/assets/test-data.json +27 -0
package/skills/skill-adapter/references/README.md +4 -0
package/skills/skill-adapter/references/best-practices.md +69 -0
package/skills/skill-adapter/references/examples.md +73 -0
package/skills/skill-adapter/scripts/README.md +11 -0
package/skills/skill-adapter/scripts/helper-template.sh +42 -0
package/skills/skill-adapter/scripts/validation.sh +32 -0

package/commands/incident-p0-database-down.md ADDED Viewed

@@ -0,0 +1,318 @@
+---
+name: incident-p0-database-down
+description: Emergency response procedure for SOP-201 P0 - Database Down (Critical)
+model: sonnet
+---
+# SOP-201: P0 - Database Down (CRITICAL)
+🚨 **EMERGENCY INCIDENT RESPONSE**
+You are responding to a **P0 CRITICAL incident**: PostgreSQL database is down.
+## Severity: P0 - CRITICAL
+- **Impact:** ALL customers affected
+- **Response Time:** IMMEDIATE
+- **Resolution Target:** <15 minutes
+## Your Mission
+Guide rapid diagnosis and recovery with:
+- Systematic troubleshooting steps
+- Clear commands for each check
+- Fast recovery procedures
+- Customer communication templates
+- Post-incident documentation
+## IMMEDIATE ACTIONS (First 60 seconds)
+### 1. Verify the Issue
+```bash
+# Is PostgreSQL running?
+sudo systemctl status postgresql
+# Can we connect?
+sudo -u postgres psql -c "SELECT 1;"
+# Check recent logs
+sudo tail -100 /var/log/postgresql/postgresql-16-main.log
+```
+### 2. Alert Stakeholders
+**Post to incident channel IMMEDIATELY:**
+```
+🚨 P0 INCIDENT - Database Down
+Time: [TIMESTAMP]
+Server: VPS-XXX
+Impact: All customers unable to connect
+Status: Investigating
+ETA: TBD
+```
+## DIAGNOSTIC PROTOCOL
+### Check 1: Service Status
+```bash
+sudo systemctl status postgresql
+sudo systemctl status pgbouncer  # If installed
+```
+**Possible states:**
+- `inactive (dead)` → Service stopped
+- `failed` → Service crashed
+- `active (running)` → Service running but not responding
+### Check 2: Process Status
+```bash
+# Check for PostgreSQL processes
+ps aux | grep postgres
+# Check listening ports
+sudo ss -tlnp | grep 5432
+sudo ss -tlnp | grep 6432  # pgBouncer
+```
+### Check 3: Disk Space
+```bash
+df -h /var/lib/postgresql
+```
+⚠️ **If disk is full (100%):**
+- This is likely the cause!
+- Jump to "Recovery: Disk Full" section
+### Check 4: Log Analysis
+```bash
+# Check for errors in PostgreSQL log
+sudo grep -i "error\|fatal\|panic" /var/log/postgresql/postgresql-16-main.log | tail -50
+# Check system logs
+sudo journalctl -u postgresql -n 100 --no-pager
+# Check for OOM (Out of Memory) kills
+sudo grep -i "killed process" /var/log/syslog | grep postgres
+```
+### Check 5: Configuration Issues
+```bash
+# Test PostgreSQL config
+sudo -u postgres /usr/lib/postgresql/16/bin/postgres --check -D /var/lib/postgresql/16/main
+# Check for lock files
+ls -la /var/run/postgresql/
+ls -la /var/lib/postgresql/16/main/postmaster.pid
+```
+## RECOVERY PROCEDURES
+### Recovery 1: Simple Service Restart
+**If service is stopped but no obvious errors:**
+```bash
+# Start PostgreSQL
+sudo systemctl start postgresql
+# Check status
+sudo systemctl status postgresql
+# Test connection
+sudo -u postgres psql -c "SELECT version();"
+# Monitor logs
+sudo tail -f /var/log/postgresql/postgresql-16-main.log
+```
+**✅ If successful:** Jump to "Post-Recovery" section
+### Recovery 2: Remove Stale PID File
+**If error mentions "postmaster.pid already exists":**
+```bash
+# Stop PostgreSQL (if running)
+sudo systemctl stop postgresql
+# Remove stale PID file
+sudo rm /var/lib/postgresql/16/main/postmaster.pid
+# Start PostgreSQL
+sudo systemctl start postgresql
+# Verify
+sudo systemctl status postgresql
+sudo -u postgres psql -c "SELECT 1;"
+```
+### Recovery 3: Disk Full Emergency
+**If disk is 100% full:**
+```bash
+# Find largest files
+sudo du -sh /var/lib/postgresql/16/main/* | sort -rh | head -10
+# Option A: Clear old logs
+sudo find /var/log/postgresql/ -name "*.log" -mtime +7 -delete
+# Option B: Vacuum to reclaim space
+sudo -u postgres vacuumdb --all --full
+# Option C: Archive/delete old WAL files (DANGER!)
+# Only if you have confirmed backups!
+sudo -u postgres pg_archivecleanup /var/lib/postgresql/16/main/pg_wal 000000010000000000000010
+# Check space
+df -h /var/lib/postgresql
+# Start PostgreSQL
+sudo systemctl start postgresql
+```
+### Recovery 4: Configuration Fix
+**If config test fails:**
+```bash
+# Restore backup config
+sudo cp /etc/postgresql/16/main/postgresql.conf.backup /etc/postgresql/16/main/postgresql.conf
+sudo cp /etc/postgresql/16/main/pg_hba.conf.backup /etc/postgresql/16/main/pg_hba.conf
+# Start PostgreSQL
+sudo systemctl start postgresql
+```
+### Recovery 5: Database Corruption (WORST CASE)
+**If logs show corruption errors:**
+```bash
+# Stop PostgreSQL
+sudo systemctl stop postgresql
+# Run filesystem check (if safe to do so)
+# sudo fsck /dev/sdX  # Only if unmounted!
+# Try single-user mode recovery
+sudo -u postgres /usr/lib/postgresql/16/bin/postgres --single -D /var/lib/postgresql/16/main
+# If that fails, restore from backup (SOP-204)
+```
+⚠️ **At this point, escalate to backup restoration procedure!**
+## POST-RECOVERY ACTIONS
+### 1. Verify Full Functionality
+```bash
+# Test connections
+sudo -u postgres psql -c "SELECT version();"
+# Check all databases
+sudo -u postgres psql -c "\l"
+# Test customer database access (example)
+sudo -u postgres psql -d customer_db_001 -c "SELECT 1;"
+# Check active connections
+sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
+# Run health check
+/opt/fairdb/scripts/pg-health-check.sh
+```
+### 2. Update Incident Status
+```
+✅ RESOLVED - Database Restored
+Resolution Time: [X minutes]
+Root Cause: [Brief description]
+Recovery Method: [Which recovery procedure used]
+Customer Impact: [Duration of outage]
+Follow-up: [Post-mortem scheduled]
+```
+### 3. Customer Communication
+**Template:**
+```
+Subject: [RESOLVED] Database Service Interruption
+Dear FairDB Customer,
+We experienced a brief service interruption affecting database
+connectivity from [START_TIME] to [END_TIME] ([DURATION]).
+The issue has been fully resolved and all services are operational.
+Root Cause: [Brief explanation]
+Resolution: [What we did]
+Prevention: [Steps to prevent recurrence]
+We apologize for any inconvenience. If you continue to experience
+issues, please contact support@fairdb.io.
+- FairDB Operations Team
+```
+### 4. Document Incident
+Create incident report at `/opt/fairdb/incidents/YYYY-MM-DD-database-down.md`:
+```markdown
+# Incident Report: Database Down
+**Incident ID:** INC-YYYYMMDD-001
+**Severity:** P0 - Critical
+**Date:** YYYY-MM-DD
+**Duration:** X minutes
+## Timeline
+- HH:MM - Issue detected
+- HH:MM - Investigation started
+- HH:MM - Root cause identified
+- HH:MM - Resolution implemented
+- HH:MM - Service restored
+- HH:MM - Verified functionality
+## Root Cause
+[Detailed explanation]
+## Impact
+- Customers affected: X
+- Downtime: X minutes
+- Data loss: None / [describe if any]
+## Resolution
+[Detailed steps taken]
+## Prevention
+[Action items to prevent recurrence]
+## Follow-up Tasks
+- [ ] Review monitoring alerts
+- [ ] Update runbooks
+- [ ] Implement preventive measures
+- [ ] Schedule post-mortem meeting
+```
+## ESCALATION CRITERIA
+Escalate if:
+- ❌ Cannot restore service within 15 minutes
+- ❌ Data corruption suspected
+- ❌ Backup restoration required
+- ❌ Multiple VPS affected
+- ❌ Security incident suspected
+**Escalation contacts:** [Document your escalation chain]
+## START RESPONSE
+Begin by asking:
+1. "What symptoms are you seeing? (Can't connect, service down, etc.)"
+2. "When did the issue start?"
+3. "Are you on the affected server now?"
+Then immediately execute Diagnostic Protocol starting with Check 1.
+**Remember:** Speed is critical. Every minute counts. Stay calm, work systematically.

package/commands/incident-p0-disk-full.md ADDED Viewed

@@ -0,0 +1,344 @@
+---
+name: incident-p0-disk-full
+description: Emergency response for SOP-203 P0 - Disk Space Emergency
+model: sonnet
+---
+# SOP-203: P0 - Disk Space Emergency
+🚨 **CRITICAL: Disk Space at 100% or >95%**
+You are responding to a **disk space emergency** that threatens database operations.
+## Severity: P0 - CRITICAL
+- **Impact:** Database writes failing, potential data loss
+- **Response Time:** IMMEDIATE
+- **Resolution Target:** <30 minutes
+## IMMEDIATE DANGER SIGNS
+If disk is at 100%:
+- ❌ PostgreSQL cannot write data
+- ❌ WAL files cannot be created
+- ❌ Transactions will fail
+- ❌ Database may crash
+- ❌ Backups will fail
+**Act NOW to free space!**
+## RAPID ASSESSMENT
+### 1. Check Current Usage
+```bash
+# Overall disk usage
+df -h
+# PostgreSQL data directory
+du -sh /var/lib/postgresql/16/main
+# Find largest directories
+du -sh /var/lib/postgresql/16/main/* | sort -rh | head -10
+# Find largest files
+find /var/lib/postgresql/16/main -type f -size +100M -exec ls -lh {} \; | sort -k5 -rh | head -20
+```
+### 2. Identify Culprits
+```bash
+# Check log sizes
+du -sh /var/log/postgresql/
+# Check WAL directory
+du -sh /var/lib/postgresql/16/main/pg_wal/
+ls -lh /var/lib/postgresql/16/main/pg_wal/ | wc -l
+# Check for temp files
+du -sh /tmp/
+find /tmp -type f -size +10M -ls
+# Database sizes
+sudo -u postgres psql -c "
+SELECT
+    datname,
+    pg_size_pretty(pg_database_size(datname)) AS size,
+    pg_database_size(datname) AS size_bytes
+FROM pg_database
+ORDER BY size_bytes DESC;"
+```
+## EMERGENCY SPACE RECOVERY
+### Priority 1: Clear Old Logs (SAFEST)
+```bash
+# PostgreSQL logs older than 7 days
+sudo find /var/log/postgresql/ -name "*.log" -mtime +7 -delete
+# Compress recent logs
+sudo gzip /var/log/postgresql/*.log
+# Clear syslog/journal
+sudo journalctl --vacuum-time=7d
+# Check space recovered
+df -h
+```
+**Expected recovery:** 1-5 GB
+### Priority 2: Archive Old WAL Files
+⚠️ **ONLY if you have confirmed backups!**
+```bash
+# Check WAL retention settings
+sudo -u postgres psql -c "SHOW wal_keep_size;"
+# List old WAL files
+ls -lh /var/lib/postgresql/16/main/pg_wal/ | tail -50
+# Archive WAL files (pgBackRest will help)
+sudo -u postgres pgbackrest --stanza=main --type=full backup
+# Clean archived WALs (CAREFUL!)
+sudo -u postgres pg_archivecleanup /var/lib/postgresql/16/main/pg_wal \
+    $(ls /var/lib/postgresql/16/main/pg_wal/ | grep -v '\.history' | head -1)
+# Check space
+df -h
+```
+**Expected recovery:** 5-20 GB
+### Priority 3: Vacuum Databases
+```bash
+# Quick vacuum (recovers space within tables)
+sudo -u postgres vacuumdb --all --analyze
+# Check largest tables
+sudo -u postgres psql -c "
+SELECT
+    schemaname,
+    tablename,
+    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
+FROM pg_tables
+WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
+ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
+LIMIT 10;"
+# Full vacuum on bloated tables (SLOW, locks table)
+sudo -u postgres psql -d [database] -c "VACUUM FULL [table_name];"
+# Check space
+df -h
+```
+**Expected recovery:** Variable, depends on bloat
+### Priority 4: Remove Temp Files
+```bash
+# Clear PostgreSQL temp files
+sudo rm -rf /var/lib/postgresql/16/main/pgsql_tmp/*
+# Clear system temp
+sudo rm -rf /tmp/*
+# Clear old backups (if local copies exist)
+ls -lh /opt/fairdb/backups/
+# Delete old local backups if remote backups are confirmed
+df -h
+```
+### Priority 5: Drop Old/Unused Databases (DANGER!)
+⚠️ **ONLY with customer approval!**
+```bash
+# List databases and last access
+sudo -u postgres psql -c "
+SELECT
+    datname,
+    pg_size_pretty(pg_database_size(datname)) AS size,
+    (SELECT max(query_start) FROM pg_stat_activity WHERE datname = d.datname) AS last_activity
+FROM pg_database d
+WHERE datname NOT IN ('template0', 'template1', 'postgres')
+ORDER BY pg_database_size(datname) DESC;"
+# Identify inactive databases (last_activity is NULL or very old)
+# BEFORE DROPPING: Backup!
+sudo -u postgres pg_dump [database_name] | gzip > /opt/fairdb/backups/emergency-backup-[database_name].sql.gz
+# Drop database (IRREVERSIBLE!)
+sudo -u postgres psql -c "DROP DATABASE [database_name];"
+```
+## LONG-TERM SOLUTIONS
+### Option 1: Increase Disk Size
+**Contabo/VPS Provider:**
+1. Log into provider control panel
+2. Upgrade storage plan
+3. Resize disk partition
+4. Expand filesystem
+```bash
+# After resize, expand filesystem
+sudo resize2fs /dev/sda1  # Adjust device as needed
+# Verify
+df -h
+```
+### Option 2: Move Data to External Volume
+```bash
+# Create new volume/mount point
+# Move PostgreSQL data directory
+sudo systemctl stop postgresql
+sudo rsync -av /var/lib/postgresql/ /mnt/new-volume/postgresql/
+sudo mv /var/lib/postgresql /var/lib/postgresql.old
+sudo ln -s /mnt/new-volume/postgresql /var/lib/postgresql
+sudo systemctl start postgresql
+```
+### Option 3: Offload Old Data
+- Archive old customer databases
+- Export historical data to cold storage
+- Implement data retention policies
+### Option 4: Optimize Storage
+```bash
+# Enable compression for tables (PostgreSQL 14+)
+ALTER TABLE [table_name] SET COMPRESSION lz4;
+# Rewrite table to apply compression
+VACUUM FULL [table_name];
+# Set autovacuum more aggressively
+ALTER TABLE [table_name] SET (autovacuum_vacuum_scale_factor = 0.05);
+```
+## MONITORING & PREVENTION
+### Set Up Disk Monitoring
+Add to cron (`crontab -e`):
+```bash
+# Check disk space every hour
+0 * * * * /opt/fairdb/scripts/check-disk-space.sh
+```
+**Create script** `/opt/fairdb/scripts/check-disk-space.sh`:
+```bash
+#!/bin/bash
+THRESHOLD=80
+USAGE=$(df -h /var/lib/postgresql | awk 'NR==2 {print $5}' | sed 's/%//')
+if [ "$USAGE" -gt "$THRESHOLD" ]; then
+    echo "WARNING: Disk usage at ${USAGE}%" | mail -s "FairDB Disk Warning" your-email@example.com
+fi
+```
+### Configure Log Rotation
+Edit `/etc/logrotate.d/postgresql`:
+```
+/var/log/postgresql/*.log {
+    daily
+    rotate 7
+    compress
+    delaycompress
+    notifempty
+    missingok
+}
+```
+### Implement Database Quotas
+```sql
+-- Set database size limits
+ALTER DATABASE customer_db_001 SET max_database_size = '10GB';
+```
+## POST-RECOVERY ACTIONS
+### 1. Verify Database Health
+```bash
+# Check PostgreSQL status
+sudo systemctl status postgresql
+# Test connections
+sudo -u postgres psql -c "SELECT 1;"
+# Run health check
+/opt/fairdb/scripts/pg-health-check.sh
+```
+### 2. Document Incident
+```markdown
+# Disk Space Emergency - YYYY-MM-DD
+## Initial State
+- Disk usage: X%
+- Free space: XGB
+- Affected services: [list]
+## Actions Taken
+- [List each action with space recovered]
+## Final State
+- Disk usage: X%
+- Free space: XGB
+- Time to resolution: X minutes
+## Root Cause
+[Why did disk fill up?]
+## Prevention
+- [ ] Implement monitoring
+- [ ] Set up log rotation
+- [ ] Schedule regular cleanups
+- [ ] Consider storage upgrade
+```
+### 3. Implement Monitoring
+```bash
+# Install monitoring script
+sudo cp /opt/fairdb/scripts/check-disk-space.sh /etc/cron.hourly/
+# Set up alerts
+# (Configure email/Slack notifications)
+```
+## DECISION TREE
+```
+Disk at 100%?
+├─ Yes → Priority 1 & 2 (Logs + WAL) IMMEDIATELY
+│   ├─ Space freed? → Continue to monitoring
+│   └─ Still full? → Priority 3 (Vacuum) + Consider Priority 5
+│
+└─ Disk at 85-99%?
+    ├─ Priority 1 (Logs) + Schedule Priority 3 (Vacuum)
+    └─ Plan long-term solution (resize disk)
+```
+## START RESPONSE
+Ask user:
+1. "What is the current disk usage? (run `df -h`)"
+2. "Is PostgreSQL still running?"
+3. "When did this start happening?"
+Then immediately execute Rapid Assessment and Emergency Space Recovery procedures.
+**Remember:** Time is critical. Database writes are failing. Act fast but safely!