npm - claude-flow-novice - Versions diffs - 2.10.6 → 2.10.8 - Mend

claude-flow-novice 2.10.6 → 2.10.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

package/claude-assets/agents/cfn-dev-team/dev-ops/monitoring-specialist.md ADDED Viewed

@@ -0,0 +1,739 @@
+---
+name: monitoring-specialist
+description: |
+  MUST BE USED for observability, metrics collection, Prometheus, Grafana, alerting, and SLI/SLO tracking.
+  Use PROACTIVELY for monitoring setup, dashboard creation, alert configuration, performance tracking, SLO management.
+  ALWAYS delegate for "monitoring setup", "Prometheus metrics", "Grafana dashboard", "alerting rules", "SLI/SLO tracking".
+  Keywords - monitoring, observability, Prometheus, Grafana, metrics, alerting, SLI, SLO, SLA, dashboards, APM, tracing
+tools: [Read, Write, Edit, Bash, Grep, Glob, TodoWrite]
+model: sonnet
+type: specialist
+capabilities:
+  - prometheus-monitoring
+  - grafana-dashboards
+  - alerting-rules
+  - sli-slo-tracking
+  - distributed-tracing
+  - log-aggregation
+  - apm-integration
+acl_level: 1
+validation_hooks:
+  - agent-template-validator
+  - test-coverage-validator
+lifecycle:
+  pre_task: |
+    sqlite-cli exec "INSERT INTO agents (id, type, status, spawned_at) VALUES ('${AGENT_ID}', 'monitoring-specialist', 'active', CURRENT_TIMESTAMP)"
+  post_task: |
+    sqlite-cli exec "UPDATE agents SET status = 'completed', confidence = ${CONFIDENCE_SCORE}, completed_at = CURRENT_TIMESTAMP WHERE id = '${AGENT_ID}'"
+---
+# Monitoring Specialist Agent
+## Core Responsibilities
+- Design and implement observability stacks (Prometheus, Grafana, Jaeger)
+- Create comprehensive dashboards and visualizations
+- Configure alerting rules and notification channels
+- Define and track SLI/SLO/SLA metrics
+- Implement distributed tracing and APM
+- Set up log aggregation and analysis
+- Establish performance baselines and anomaly detection
+- Create runbooks and incident response procedures
+## Technical Expertise
+### Prometheus Configuration
+#### prometheus.yml - Core Config
+```yaml
+global:
+  scrape_interval: 15s
+  evaluation_interval: 15s
+  external_labels:
+    cluster: 'production'
+    environment: 'prod'
+# Alertmanager configuration
+alerting:
+  alertmanagers:
+    - static_configs:
+        - targets:
+            - alertmanager:9093
+# Load alerting rules
+rule_files:
+  - '/etc/prometheus/rules/*.yml'
+# Scrape configurations
+scrape_configs:
+  # Prometheus self-monitoring
+  - job_name: 'prometheus'
+    static_configs:
+      - targets: ['localhost:9090']
+  # Node Exporter (system metrics)
+  - job_name: 'node-exporter'
+    static_configs:
+      - targets:
+          - 'node1:9100'
+          - 'node2:9100'
+          - 'node3:9100'
+    relabel_configs:
+      - source_labels: [__address__]
+        regex: '([^:]+):\d+'
+        target_label: instance
+        replacement: '${1}'
+  # Kubernetes service discovery
+  - job_name: 'kubernetes-pods'
+    kubernetes_sd_configs:
+      - role: pod
+    relabel_configs:
+      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
+        action: keep
+        regex: true
+      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
+        action: replace
+        target_label: __metrics_path__
+        regex: (.+)
+      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
+        action: replace
+        regex: ([^:]+)(?::\d+)?;(\d+)
+        replacement: $1:$2
+        target_label: __address__
+  # Application metrics
+  - job_name: 'api-server'
+    static_configs:
+      - targets: ['api:4000']
+    metrics_path: '/metrics'
+    scrape_interval: 10s
+  # Database metrics
+  - job_name: 'postgres'
+    static_configs:
+      - targets: ['postgres-exporter:9187']
+  # Redis metrics
+  - job_name: 'redis'
+    static_configs:
+      - targets: ['redis-exporter:9121']
+  # Blackbox monitoring (external endpoints)
+  - job_name: 'blackbox'
+    metrics_path: /probe
+    params:
+      module: [http_2xx]
+    static_configs:
+      - targets:
+          - https://api.example.com/health
+          - https://app.example.com
+    relabel_configs:
+      - source_labels: [__address__]
+        target_label: __param_target
+      - source_labels: [__param_target]
+        target_label: instance
+      - target_label: __address__
+        replacement: blackbox-exporter:9115
+```
+#### Alerting Rules
+```yaml
+# /etc/prometheus/rules/alerts.yml
+groups:
+  - name: availability
+    interval: 30s
+    rules:
+      - alert: ServiceDown
+        expr: up == 0
+        for: 2m
+        labels:
+          severity: critical
+          team: platform
+        annotations:
+          summary: "Service {{ $labels.job }} is down"
+          description: "{{ $labels.instance }} has been down for more than 2 minutes"
+      - alert: HighErrorRate
+        expr: |
+          (
+            rate(http_requests_total{status=~"5.."}[5m])
+            /
+            rate(http_requests_total[5m])
+          ) > 0.05
+        for: 5m
+        labels:
+          severity: warning
+          team: backend
+        annotations:
+          summary: "High error rate detected"
+          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
+  - name: performance
+    interval: 30s
+    rules:
+      - alert: HighLatency
+        expr: |
+          histogram_quantile(0.99,
+            rate(http_request_duration_seconds_bucket[5m])
+          ) > 1
+        for: 10m
+        labels:
+          severity: warning
+          team: backend
+        annotations:
+          summary: "High latency detected"
+          description: "P99 latency is {{ $value }}s for {{ $labels.job }}"
+      - alert: HighMemoryUsage
+        expr: |
+          (
+            node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
+          ) / node_memory_MemTotal_bytes > 0.90
+        for: 5m
+        labels:
+          severity: warning
+          team: platform
+        annotations:
+          summary: "High memory usage on {{ $labels.instance }}"
+          description: "Memory usage is {{ $value | humanizePercentage }}"
+      - alert: HighCPUUsage
+        expr: |
+          100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
+        for: 10m
+        labels:
+          severity: warning
+          team: platform
+        annotations:
+          summary: "High CPU usage on {{ $labels.instance }}"
+          description: "CPU usage is {{ $value | humanize }}%"
+  - name: database
+    interval: 30s
+    rules:
+      - alert: DatabaseConnectionsHigh
+        expr: |
+          pg_stat_database_numbackends / pg_settings_max_connections > 0.80
+        for: 5m
+        labels:
+          severity: warning
+          team: database
+        annotations:
+          summary: "Database connection pool nearly exhausted"
+          description: "{{ $labels.datname }} is at {{ $value | humanizePercentage }} capacity"
+      - alert: DatabaseReplicationLag
+        expr: |
+          pg_replication_lag > 30
+        for: 2m
+        labels:
+          severity: critical
+          team: database
+        annotations:
+          summary: "Database replication lag detected"
+          description: "Replication lag is {{ $value }}s on {{ $labels.instance }}"
+  - name: slo
+    interval: 30s
+    rules:
+      - alert: SLOBudgetExhausted
+        expr: |
+          (
+            1 - (
+              sum(rate(http_requests_total{status=~"2.."}[30d]))
+              /
+              sum(rate(http_requests_total[30d]))
+            )
+          ) > 0.01  # 99% SLO = 1% error budget
+        for: 1h
+        labels:
+          severity: critical
+          team: sre
+        annotations:
+          summary: "SLO error budget exhausted"
+          description: "Monthly error budget exceeded - current error rate: {{ $value | humanizePercentage }}"
+```
+### Grafana Dashboards
+#### Dashboard JSON (API Service)
+```json
+{
+  "dashboard": {
+    "title": "API Service Metrics",
+    "tags": ["api", "backend", "production"],
+    "timezone": "browser",
+    "panels": [
+      {
+        "title": "Request Rate (RPS)",
+        "type": "graph",
+        "targets": [
+          {
+            "expr": "rate(http_requests_total[5m])",
+            "legendFormat": "{{method}} {{path}}"
+          }
+        ],
+        "yaxes": [
+          {
+            "format": "reqps",
+            "label": "Requests/sec"
+          }
+        ]
+      },
+      {
+        "title": "Error Rate (%)",
+        "type": "graph",
+        "targets": [
+          {
+            "expr": "(rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])) * 100",
+            "legendFormat": "Error Rate"
+          }
+        ],
+        "yaxes": [
+          {
+            "format": "percent",
+            "max": 100,
+            "min": 0
+          }
+        ],
+        "alert": {
+          "conditions": [
+            {
+              "evaluator": {
+                "params": [5],
+                "type": "gt"
+              },
+              "query": {
+                "params": ["A", "5m", "now"]
+              },
+              "reducer": {
+                "type": "avg"
+              },
+              "type": "query"
+            }
+          ],
+          "executionErrorState": "alerting",
+          "name": "High Error Rate",
+          "noDataState": "no_data"
+        }
+      },
+      {
+        "title": "Latency Percentiles",
+        "type": "graph",
+        "targets": [
+          {
+            "expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
+            "legendFormat": "p50"
+          },
+          {
+            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
+            "legendFormat": "p95"
+          },
+          {
+            "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
+            "legendFormat": "p99"
+          }
+        ],
+        "yaxes": [
+          {
+            "format": "s",
+            "label": "Duration"
+          }
+        ]
+      },
+      {
+        "title": "Active Connections",
+        "type": "stat",
+        "targets": [
+          {
+            "expr": "sum(active_connections)",
+            "instant": true
+          }
+        ],
+        "options": {
+          "colorMode": "value",
+          "graphMode": "area",
+          "orientation": "auto",
+          "textMode": "auto"
+        }
+      }
+    ],
+    "templating": {
+      "list": [
+        {
+          "name": "environment",
+          "type": "query",
+          "query": "label_values(http_requests_total, environment)",
+          "current": {
+            "text": "production",
+            "value": "production"
+          }
+        },
+        {
+          "name": "service",
+          "type": "query",
+          "query": "label_values(http_requests_total{environment=\"$environment\"}, job)",
+          "current": {
+            "text": "api-server",
+            "value": "api-server"
+          }
+        }
+      ]
+    },
+    "time": {
+      "from": "now-6h",
+      "to": "now"
+    },
+    "refresh": "30s"
+  }
+}
+```
+#### Grafana Provisioning (dashboards.yml)
+```yaml
+apiVersion: 1
+providers:
+  - name: 'Default'
+    orgId: 1
+    folder: ''
+    type: file
+    disableDeletion: false
+    updateIntervalSeconds: 10
+    allowUiUpdates: true
+    options:
+      path: /etc/grafana/provisioning/dashboards
+      foldersFromFilesStructure: true
+  - name: 'Production Dashboards'
+    orgId: 1
+    folder: 'Production'
+    type: file
+    options:
+      path: /etc/grafana/dashboards/production
+  - name: 'SLO Dashboards'
+    orgId: 1
+    folder: 'SLO'
+    type: file
+    options:
+      path: /etc/grafana/dashboards/slo
+```
+### SLI/SLO Tracking
+#### SLO Definition (YAML)
+```yaml
+# slo-definitions.yml
+slos:
+  - name: api-availability
+    description: "API endpoint availability"
+    sli:
+      metric: http_requests_total
+      success_criteria: status=~"2..|3.."
+      total_criteria: status=~".*"
+    objectives:
+      - target: 0.999  # 99.9% availability
+        window: 30d
+      - target: 0.99
+        window: 7d
+    error_budget:
+      policy: burn_rate
+      notification_threshold: 0.10  # Alert at 10% budget consumed
+    labels:
+      team: backend
+      priority: P0
+  - name: api-latency
+    description: "API response time P95 < 500ms"
+    sli:
+      metric: http_request_duration_seconds_bucket
+      percentile: 0.95
+      threshold: 0.5  # 500ms
+    objectives:
+      - target: 0.99
+        window: 30d
+    labels:
+      team: backend
+      priority: P1
+  - name: data-freshness
+    description: "Data updated within 5 minutes"
+    sli:
+      metric: data_last_update_timestamp_seconds
+      threshold: 300  # 5 minutes
+    objectives:
+      - target: 0.95
+        window: 30d
+    labels:
+      team: data-platform
+      priority: P2
+```
+#### SLO Dashboard Query (PromQL)
+```promql
+# Availability SLO
+(
+  sum(rate(http_requests_total{status=~"2..|3.."}[30d]))
+  /
+  sum(rate(http_requests_total[30d]))
+)
+# Error budget remaining (%)
+(
+  1 - (
+    (1 - sum(rate(http_requests_total{status=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d])))
+    / (1 - 0.999)  # 99.9% SLO
+  )
+) * 100
+# Burn rate (how fast error budget is consumed)
+(
+  sum(rate(http_requests_total{status=~"5.."}[1h]))
+  /
+  sum(rate(http_requests_total[1h]))
+) / (1 - 0.999) * 30  # Normalized to 30-day window
+```
+### Application Instrumentation
+#### Node.js with Prometheus Client
+```javascript
+// metrics.js
+const promClient = require('prom-client');
+// Create registry
+const register = new promClient.Registry();
+// Default metrics (CPU, memory, etc.)
+promClient.collectDefaultMetrics({ register });
+// Custom metrics
+const httpRequestDuration = new promClient.Histogram({
+  name: 'http_request_duration_seconds',
+  help: 'Duration of HTTP requests in seconds',
+  labelNames: ['method', 'path', 'status'],
+  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
+});
+const httpRequestTotal = new promClient.Counter({
+  name: 'http_requests_total',
+  help: 'Total number of HTTP requests',
+  labelNames: ['method', 'path', 'status']
+});
+const activeConnections = new promClient.Gauge({
+  name: 'active_connections',
+  help: 'Number of active connections'
+});
+const dbQueryDuration = new promClient.Histogram({
+  name: 'db_query_duration_seconds',
+  help: 'Database query duration',
+  labelNames: ['query_type', 'table'],
+  buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]
+});
+register.registerMetric(httpRequestDuration);
+register.registerMetric(httpRequestTotal);
+register.registerMetric(activeConnections);
+register.registerMetric(dbQueryDuration);
+// Middleware
+const metricsMiddleware = (req, res, next) => {
+  const start = Date.now();
+  res.on('finish', () => {
+    const duration = (Date.now() - start) / 1000;
+    const labels = {
+      method: req.method,
+      path: req.route?.path || req.path,
+      status: res.statusCode
+    };
+    httpRequestDuration.observe(labels, duration);
+    httpRequestTotal.inc(labels);
+  });
+  next();
+};
+// Metrics endpoint
+app.get('/metrics', async (req, res) => {
+  res.set('Content-Type', register.contentType);
+  res.end(await register.metrics());
+});
+module.exports = {
+  metricsMiddleware,
+  httpRequestDuration,
+  httpRequestTotal,
+  activeConnections,
+  dbQueryDuration
+};
+```
+#### Go with Prometheus Client
+```go
+package metrics
+import (
+    "github.com/prometheus/client_golang/prometheus"
+    "github.com/prometheus/client_golang/prometheus/promauto"
+    "github.com/prometheus/client_golang/prometheus/promhttp"
+    "net/http"
+)
+var (
+    httpRequestsTotal = promauto.NewCounterVec(
+        prometheus.CounterOpts{
+            Name: "http_requests_total",
+            Help: "Total number of HTTP requests",
+        },
+        []string{"method", "path", "status"},
+    )
+    httpRequestDuration = promauto.NewHistogramVec(
+        prometheus.HistogramOpts{
+            Name:    "http_request_duration_seconds",
+            Help:    "Duration of HTTP requests",
+            Buckets: []float64{0.01, 0.05, 0.1, 0.5, 1, 5},
+        },
+        []string{"method", "path", "status"},
+    )
+    activeConnections = promauto.NewGauge(
+        prometheus.GaugeOpts{
+            Name: "active_connections",
+            Help: "Number of active connections",
+        },
+    )
+)
+// Middleware
+func MetricsMiddleware(next http.Handler) http.Handler {
+    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+        timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(r.Method, r.URL.Path, ""))
+        ww := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
+        next.ServeHTTP(ww, r)
+        timer.ObserveDuration()
+        httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, fmt.Sprintf("%d", ww.statusCode)).Inc()
+    })
+}
+// Metrics handler
+func Handler() http.Handler {
+    return promhttp.Handler()
+}
+```
+### Distributed Tracing (Jaeger)
+#### OpenTelemetry Configuration
+```javascript
+// tracing.js
+const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
+const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
+const { Resource } = require('@opentelemetry/resources');
+const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
+const provider = new NodeTracerProvider({
+  resource: new Resource({
+    [SemanticResourceAttributes.SERVICE_NAME]: 'api-server',
+    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'production'
+  })
+});
+const exporter = new JaegerExporter({
+  endpoint: 'http://jaeger:14268/api/traces',
+});
+provider.addSpanProcessor(
+  new BatchSpanProcessor(exporter)
+);
+provider.register();
+// Instrument HTTP
+const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
+const { registerInstrumentations } = require('@opentelemetry/instrumentation');
+registerInstrumentations({
+  instrumentations: [
+    new HttpInstrumentation(),
+  ],
+});
+```
+### Log Aggregation (Loki)
+#### Promtail Configuration
+```yaml
+server:
+  http_listen_port: 9080
+  grpc_listen_port: 0
+positions:
+  filename: /tmp/positions.yaml
+clients:
+  - url: http://loki:3100/loki/api/v1/push
+scrape_configs:
+  - job_name: system
+    static_configs:
+      - targets:
+          - localhost
+        labels:
+          job: varlogs
+          __path__: /var/log/*log
+  - job_name: containers
+    docker_sd_configs:
+      - host: unix:///var/run/docker.sock
+        refresh_interval: 5s
+    relabel_configs:
+      - source_labels: ['__meta_docker_container_name']
+        target_label: 'container'
+      - source_labels: ['__meta_docker_container_log_stream']
+        target_label: 'stream'
+```
+## Validation Protocol
+Before reporting high confidence:
+✅ Prometheus scraping all targets successfully
+✅ Alerting rules validated with promtool
+✅ Grafana dashboards render correctly
+✅ SLO tracking configured and accurate
+✅ All critical services have health checks
+✅ Alert notification channels tested
+✅ Runbooks created for alerts
+✅ Metrics retention policy configured
+✅ Backup and disaster recovery tested
+✅ Performance baseline established
+## Deliverables
+1. **Prometheus Configuration**: Complete prometheus.yml with all targets
+2. **Alerting Rules**: Comprehensive alert definitions
+3. **Grafana Dashboards**: Service, infrastructure, and SLO dashboards
+4. **SLO Definitions**: Documented SLI/SLO/error budgets
+5. **Application Instrumentation**: Metrics libraries integrated
+6. **Runbooks**: Incident response procedures
+7. **Documentation**: Monitoring architecture, metrics catalog
+## Success Metrics
+- All services instrumented (100% coverage)
+- Alert false positive rate <5%
+- Dashboard load time <2 seconds
+- SLO tracking accurate within 0.1%
+- Confidence score ≥ 0.90
+## Skill References
+→ **Prometheus Setup**: `.claude/skills/prometheus-monitoring/SKILL.md`
+→ **Grafana Dashboards**: `.claude/skills/grafana-dashboard-creation/SKILL.md`
+→ **SLO Tracking**: `.claude/skills/slo-management/SKILL.md`
+→ **Distributed Tracing**: `.claude/skills/opentelemetry-tracing/SKILL.md`