claude-flow-novice 2.16.0 → 2.16.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (154) hide show
  1. package/.claude/cfn-extras/skills/GOOGLE_SHEETS_SKILLS_README.md +1 -1
  2. package/.claude/cfn-extras/skills/google-sheets-api-coordinator/SKILL.md +1 -1
  3. package/.claude/cfn-extras/skills/google-sheets-formula-builder/SKILL.md +1 -1
  4. package/.claude/cfn-extras/skills/google-sheets-progress/SKILL.md +1 -1
  5. package/.claude/commands/CFN_LOOP_FRONTEND.md +1 -1
  6. package/.claude/commands/cfn-loop-cli.md +124 -46
  7. package/.claude/commands/cfn-loop-frontend.md +1 -1
  8. package/.claude/commands/cfn-loop-task.md +2 -2
  9. package/.claude/commands/deprecated/cfn-loop.md +2 -2
  10. package/.claude/hooks/cfn-invoke-post-edit.sh +31 -5
  11. package/.claude/hooks/cfn-post-edit.config.json +9 -2
  12. package/.claude/root-claude-distribute/CFN-CLAUDE.md +1 -1
  13. package/.claude/skills/cfn-backlog-management/SKILL.md +1 -1
  14. package/.claude/skills/cfn-loop-orchestration/NORTH_STAR_INDEX.md +1 -1
  15. package/claude-assets/agents/cfn-dev-team/analysts/root-cause-analyst.md +2 -2
  16. package/claude-assets/agents/cfn-dev-team/architecture/base-template-generator.md +1 -1
  17. package/claude-assets/agents/cfn-dev-team/coordinators/cfn-frontend-coordinator.md +2 -2
  18. package/claude-assets/agents/cfn-dev-team/coordinators/handoff-coordinator.md +1 -1
  19. package/claude-assets/agents/cfn-dev-team/dev-ops/devops-engineer.md +1 -1
  20. package/claude-assets/agents/cfn-dev-team/dev-ops/docker-specialist.md +2 -2
  21. package/claude-assets/agents/cfn-dev-team/dev-ops/github-commit-agent.md +2 -2
  22. package/claude-assets/agents/cfn-dev-team/dev-ops/kubernetes-specialist.md +1 -1
  23. package/claude-assets/agents/cfn-dev-team/developers/api-gateway-specialist.md +1 -1
  24. package/claude-assets/agents/cfn-dev-team/developers/data/data-engineer.md +1 -1
  25. package/claude-assets/agents/cfn-dev-team/developers/database/database-architect.md +1 -1
  26. package/claude-assets/agents/cfn-dev-team/developers/frontend/typescript-specialist.md +1 -1
  27. package/claude-assets/agents/cfn-dev-team/developers/frontend/ui-designer.md +1 -1
  28. package/claude-assets/agents/cfn-dev-team/developers/graphql-specialist.md +1 -1
  29. package/claude-assets/agents/cfn-dev-team/documentation/pseudocode.md +1 -1
  30. package/claude-assets/agents/cfn-dev-team/product-owners/accessibility-advocate-persona.md +1 -1
  31. package/claude-assets/agents/cfn-dev-team/product-owners/cto-agent.md +1 -1
  32. package/claude-assets/agents/cfn-dev-team/product-owners/power-user-persona.md +1 -1
  33. package/claude-assets/agents/cfn-dev-team/reviewers/quality/security-specialist.md +1 -1
  34. package/claude-assets/agents/cfn-dev-team/testers/api-testing-specialist.md +1 -1
  35. package/claude-assets/agents/cfn-dev-team/testers/chaos-engineering-specialist.md +1 -1
  36. package/claude-assets/agents/cfn-dev-team/testers/contract-tester.md +1 -1
  37. package/claude-assets/agents/cfn-dev-team/testers/e2e/playwright-tester.md +1 -1
  38. package/claude-assets/agents/cfn-dev-team/testers/integration-tester.md +1 -1
  39. package/claude-assets/agents/cfn-dev-team/testers/load-testing-specialist.md +1 -1
  40. package/claude-assets/agents/cfn-dev-team/testers/mutation-testing-specialist.md +1 -1
  41. package/claude-assets/agents/cfn-dev-team/testers/unit/tdd-london-unit-swarm.md +1 -1
  42. package/claude-assets/agents/cfn-dev-team/utility/agent-builder.md +11 -0
  43. package/claude-assets/agents/cfn-dev-team/utility/analyst.md +1 -1
  44. package/claude-assets/agents/cfn-dev-team/utility/claude-code-expert.md +1 -1
  45. package/claude-assets/agents/cfn-dev-team/utility/epic-creator.md +1 -1
  46. package/claude-assets/agents/cfn-dev-team/utility/memory-leak-specialist.md +1 -1
  47. package/claude-assets/agents/cfn-dev-team/utility/researcher.md +1 -1
  48. package/claude-assets/agents/cfn-dev-team/utility/z-ai-specialist.md +1 -1
  49. package/claude-assets/agents/custom/cfn-docker-expert.md +1 -0
  50. package/claude-assets/agents/custom/cfn-loops-cli-expert.md +326 -17
  51. package/claude-assets/agents/custom/cfn-redis-operations.md +529 -529
  52. package/claude-assets/agents/custom/cfn-system-expert.md +1 -1
  53. package/claude-assets/agents/custom/trigger-dev-expert.md +369 -0
  54. package/claude-assets/agents/docker-team/micro-sprint-planner.md +747 -747
  55. package/claude-assets/agents/project-only-agents/npm-package-specialist.md +1 -1
  56. package/claude-assets/cfn-extras/skills/GOOGLE_SHEETS_SKILLS_README.md +1 -1
  57. package/claude-assets/cfn-extras/skills/google-sheets-api-coordinator/SKILL.md +1 -1
  58. package/claude-assets/cfn-extras/skills/google-sheets-formula-builder/SKILL.md +1 -1
  59. package/claude-assets/cfn-extras/skills/google-sheets-progress/SKILL.md +1 -1
  60. package/claude-assets/commands/CFN_LOOP_FRONTEND.md +1 -1
  61. package/claude-assets/commands/cfn-loop-cli.md +124 -46
  62. package/claude-assets/commands/cfn-loop-frontend.md +1 -1
  63. package/claude-assets/commands/cfn-loop-task.md +2 -2
  64. package/claude-assets/commands/deprecated/cfn-loop.md +2 -2
  65. package/claude-assets/hooks/GIT-HOOKS-USAGE-EXAMPLES.md +116 -0
  66. package/claude-assets/hooks/README-GIT-HOOKS.md +443 -0
  67. package/claude-assets/hooks/cfn-invoke-post-edit.sh +31 -5
  68. package/claude-assets/hooks/cfn-post-edit.config.json +9 -2
  69. package/claude-assets/hooks/install-git-hooks.sh +243 -0
  70. package/claude-assets/hooks/subagent-start.sh +98 -0
  71. package/claude-assets/hooks/subagent-stop.sh +93 -0
  72. package/claude-assets/hooks/validators/credential-scanner.sh +172 -0
  73. package/claude-assets/root-claude-distribute/CFN-CLAUDE.md +1 -1
  74. package/claude-assets/skills/cfn-backlog-management/SKILL.md +1 -1
  75. package/claude-assets/skills/cfn-dependency-ingestion/SKILL.md +41 -13
  76. package/claude-assets/skills/cfn-dependency-ingestion/ingest.sh +237 -0
  77. package/claude-assets/skills/cfn-dependency-ingestion/manifests/cli-mode-dependencies.txt +73 -0
  78. package/claude-assets/skills/cfn-dependency-ingestion/manifests/shared-dependencies.txt +57 -0
  79. package/claude-assets/skills/cfn-dependency-ingestion/manifests/trigger-dev-dependencies.txt +82 -0
  80. package/claude-assets/skills/cfn-dependency-ingestion/manifests/trigger-mode-dependencies.txt +80 -0
  81. package/claude-assets/skills/cfn-environment-sanitization/sanitize-environment.sh +14 -4
  82. package/claude-assets/skills/cfn-loop-orchestration/NORTH_STAR_INDEX.md +1 -1
  83. package/claude-assets/skills/cfn-provider-routing/SKILL.md +23 -0
  84. package/claude-assets/skills/docker-build/build.sh +1 -1
  85. package/dist/agent/skill-mcp-selector.js +2 -1
  86. package/dist/agent/skill-mcp-selector.js.map +1 -1
  87. package/dist/agents/agent-loader.js +165 -146
  88. package/dist/agents/agent-loader.js.map +1 -1
  89. package/dist/cli/agent-executor.js +470 -26
  90. package/dist/cli/agent-executor.js.map +1 -1
  91. package/dist/cli/agent-prompt-builder.js +2 -2
  92. package/dist/cli/agent-prompt-builder.js.map +1 -1
  93. package/dist/cli/agent-spawn.js +7 -4
  94. package/dist/cli/agent-spawn.js.map +1 -1
  95. package/dist/cli/agent-spawner.js +51 -4
  96. package/dist/cli/agent-spawner.js.map +1 -1
  97. package/dist/cli/agent-token-manager.js +2 -1
  98. package/dist/cli/agent-token-manager.js.map +1 -1
  99. package/dist/cli/anthropic-client.js +117 -11
  100. package/dist/cli/anthropic-client.js.map +1 -1
  101. package/dist/cli/cfn-context.js +2 -1
  102. package/dist/cli/cfn-context.js.map +1 -1
  103. package/dist/cli/cfn-metrics.js +2 -1
  104. package/dist/cli/cfn-metrics.js.map +1 -1
  105. package/dist/cli/cfn-redis.js +2 -1
  106. package/dist/cli/cfn-redis.js.map +1 -1
  107. package/dist/cli/cli-agent-context.js +2 -0
  108. package/dist/cli/cli-agent-context.js.map +1 -1
  109. package/dist/cli/config-manager.js +4 -252
  110. package/dist/cli/config-manager.js.map +1 -1
  111. package/dist/cli/conversation-fork-cleanup.js +2 -1
  112. package/dist/cli/conversation-fork-cleanup.js.map +1 -1
  113. package/dist/cli/conversation-fork.js +2 -1
  114. package/dist/cli/conversation-fork.js.map +1 -1
  115. package/dist/cli/coordination/agent-messaging.js +415 -0
  116. package/dist/cli/coordination/agent-messaging.js.map +1 -0
  117. package/dist/cli/coordination/wait-for-threshold.js +232 -0
  118. package/dist/cli/coordination/wait-for-threshold.js.map +1 -0
  119. package/dist/cli/iteration-history.js +2 -1
  120. package/dist/cli/iteration-history.js.map +1 -1
  121. package/dist/cli/process-lifecycle.js +5 -1
  122. package/dist/cli/process-lifecycle.js.map +1 -1
  123. package/dist/cli/spawn-agent-cli.js +41 -6
  124. package/dist/cli/spawn-agent-cli.js.map +1 -1
  125. package/dist/coordination/redis-waiting-mode.js +4 -0
  126. package/dist/coordination/redis-waiting-mode.js.map +1 -1
  127. package/dist/lib/artifact-registry.js +4 -0
  128. package/dist/lib/artifact-registry.js.map +1 -1
  129. package/dist/lib/connection-pool.js +390 -0
  130. package/dist/lib/connection-pool.js.map +1 -0
  131. package/dist/lib/environment-contract.js +258 -0
  132. package/dist/lib/environment-contract.js.map +1 -0
  133. package/dist/lib/query-optimizer.js +388 -0
  134. package/dist/lib/query-optimizer.js.map +1 -0
  135. package/dist/lib/result-cache.js +285 -0
  136. package/dist/lib/result-cache.js.map +1 -0
  137. package/dist/mcp/auth-middleware.js +2 -1
  138. package/dist/mcp/auth-middleware.js.map +1 -1
  139. package/dist/mcp/playwright-mcp-server-auth.js +2 -1
  140. package/dist/mcp/playwright-mcp-server-auth.js.map +1 -1
  141. package/package.json +3 -1
  142. package/scripts/build-agent-image.sh +1 -1
  143. package/scripts/cost-allocation-tracker.sh +632 -0
  144. package/scripts/docker-rebuild-all-agents.sh +2 -2
  145. package/scripts/reorganize-tests.sh +280 -0
  146. package/scripts/trigger-dev-setup.sh +12 -0
  147. package/tests/README.md +45 -0
  148. package/.claude/commands/cost-savings-status.md +0 -34
  149. package/.claude/commands/metrics-summary.md +0 -58
  150. package/claude-assets/agents/cfn-dev-team/dev-ops/monitoring-specialist.md +0 -768
  151. package/claude-assets/agents/custom/test-mcp-access.md +0 -24
  152. package/claude-assets/commands/cost-savings-status.md +0 -34
  153. package/claude-assets/commands/metrics-summary.md +0 -58
  154. package/tests/test-memory-leak-task-mode.sh +0 -435
@@ -1,768 +0,0 @@
1
- ---
2
- name: monitoring-specialist
3
- description: MUST BE USED for observability, metrics collection, Prometheus, Grafana, alerting, and SLI/SLO tracking. Use PROACTIVELY for monitoring setup, dashboard creation, alert configuration, performance tracking, SLO management. ALWAYS delegate for "monitoring setup", "Prometheus metrics", "Grafana dashboard", "alerting rules", "SLI/SLO tracking". Keywords - monitoring, observability, Prometheus, Grafana, metrics, alerting, SLI, SLO, SLA, dashboards, APM, tracing
4
- tools: [Read, Write, Edit, Bash, Grep, Glob, TodoWrite]
5
- model: sonnet
6
- type: specialist
7
- capabilities:
8
- - prometheus-monitoring
9
- - grafana-dashboards
10
- - alerting-rules
11
- - sli-slo-tracking
12
- - distributed-tracing
13
- - log-aggregation
14
- - apm-integration
15
- acl_level: 1
16
- validation_hooks:
17
- - agent-template-validator
18
- - test-coverage-validator
19
- ---
20
-
21
- # Monitoring Specialist Agent
22
-
23
- ## Core Responsibilities
24
- - Design and implement observability stacks (Prometheus, Grafana, Jaeger)
25
- - Create comprehensive dashboards and visualizations
26
- - Configure alerting rules and notification channels
27
- - Define and track SLI/SLO/SLA metrics
28
- - Implement distributed tracing and APM
29
- - Set up log aggregation and analysis
30
- - Establish performance baselines and anomaly detection
31
- - Create runbooks and incident response procedures
32
-
33
- ## Technical Expertise
34
-
35
- ### Environment Variables
36
-
37
- Multi-worktree and Docker Compose deployments support the following environment variables:
38
-
39
- | Variable | Default | Description |
40
- |----------|---------|-------------|
41
- | `CFN_PROMETHEUS_PORT` | 9091 | Prometheus HTTP port |
42
- | `CFN_GRAFANA_PORT` | 3000 | Grafana HTTP port |
43
- | `CFN_ALERTMANAGER_PORT` | 9093 | Alertmanager HTTP port |
44
- | `CFN_JAEGER_PORT` | 16686 | Jaeger UI port |
45
- | `CFN_LOKI_PORT` | 3100 | Loki HTTP port |
46
- | `COMPOSE_PROJECT_NAME` | (auto) | Docker Compose project prefix for container/network names |
47
- | `REDIS_HOST` | redis | Redis service hostname (use service discovery name) |
48
-
49
- **Usage in Configurations:**
50
- - Use `${CFN_PROMETHEUS_PORT:-9091}` pattern for port references
51
- - Use service names (redis, postgres) instead of container names (cfn-redis, cfn-postgres)
52
- - Prefix container names with `${COMPOSE_PROJECT_NAME}-` for multi-worktree isolation
53
-
54
- ### Prometheus Configuration
55
-
56
- #### prometheus.yml - Core Config
57
- ```yaml
58
- global:
59
- scrape_interval: 15s
60
- evaluation_interval: 15s
61
- external_labels:
62
- cluster: 'production'
63
- environment: 'prod'
64
-
65
- # Alertmanager configuration
66
- alerting:
67
- alertmanagers:
68
- - static_configs:
69
- - targets:
70
- - alertmanager:9093
71
-
72
- # Load alerting rules
73
- rule_files:
74
- - '/etc/prometheus/rules/*.yml'
75
-
76
- # Scrape configurations
77
- scrape_configs:
78
- # Prometheus self-monitoring
79
- - job_name: 'prometheus'
80
- static_configs:
81
- - targets: ['localhost:${CFN_PROMETHEUS_PORT:-9091}']
82
-
83
- # Node Exporter (system metrics)
84
- - job_name: 'node-exporter'
85
- static_configs:
86
- - targets:
87
- - 'node1:9100'
88
- - 'node2:9100'
89
- - 'node3:9100'
90
- relabel_configs:
91
- - source_labels: [__address__]
92
- regex: '([^:]+):\d+'
93
- target_label: instance
94
- replacement: '${1}'
95
-
96
- # Kubernetes service discovery
97
- - job_name: 'kubernetes-pods'
98
- kubernetes_sd_configs:
99
- - role: pod
100
- relabel_configs:
101
- - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
102
- action: keep
103
- regex: true
104
- - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
105
- action: replace
106
- target_label: __metrics_path__
107
- regex: (.+)
108
- - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
109
- action: replace
110
- regex: ([^:]+)(?::\d+)?;(\d+)
111
- replacement: $1:$2
112
- target_label: __address__
113
-
114
- # Application metrics
115
- - job_name: 'api-server'
116
- static_configs:
117
- - targets: ['api:4000']
118
- metrics_path: '/metrics'
119
- scrape_interval: 10s
120
-
121
- # Database metrics
122
- - job_name: 'postgres'
123
- static_configs:
124
- - targets: ['postgres-exporter:9187']
125
-
126
- # Cache metrics
127
- - job_name: 'memcached'
128
- static_configs:
129
- - targets: ['memcached-exporter:9150']
130
-
131
- # Blackbox monitoring (external endpoints)
132
- - job_name: 'blackbox'
133
- metrics_path: /probe
134
- params:
135
- module: [http_2xx]
136
- static_configs:
137
- - targets:
138
- - https://api.example.com/health
139
- - https://app.example.com
140
- relabel_configs:
141
- - source_labels: [__address__]
142
- target_label: __param_target
143
- - source_labels: [__param_target]
144
- target_label: instance
145
- - target_label: __address__
146
- replacement: blackbox-exporter:9115
147
- ```
148
-
149
- #### Alerting Rules
150
- ```yaml
151
- # /etc/prometheus/rules/alerts.yml
152
- groups:
153
- - name: availability
154
- interval: 30s
155
- rules:
156
- - alert: ServiceDown
157
- expr: up == 0
158
- for: 2m
159
- labels:
160
- severity: critical
161
- team: platform
162
- annotations:
163
- summary: "Service {{ $labels.job }} is down"
164
- description: "{{ $labels.instance }} has been down for more than 2 minutes"
165
-
166
- - alert: HighErrorRate
167
- expr: |
168
- (
169
- rate(http_requests_total{status=~"5.."}[5m])
170
- /
171
- rate(http_requests_total[5m])
172
- ) > 0.05
173
- for: 5m
174
- labels:
175
- severity: warning
176
- team: backend
177
- annotations:
178
- summary: "High error rate detected"
179
- description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
180
-
181
- - name: performance
182
- interval: 30s
183
- rules:
184
- - alert: HighLatency
185
- expr: |
186
- histogram_quantile(0.99,
187
- rate(http_request_duration_seconds_bucket[5m])
188
- ) > 1
189
- for: 10m
190
- labels:
191
- severity: warning
192
- team: backend
193
- annotations:
194
- summary: "High latency detected"
195
- description: "P99 latency is {{ $value }}s for {{ $labels.job }}"
196
-
197
- - alert: HighMemoryUsage
198
- expr: |
199
- (
200
- node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
201
- ) / node_memory_MemTotal_bytes > 0.90
202
- for: 5m
203
- labels:
204
- severity: warning
205
- team: platform
206
- annotations:
207
- summary: "High memory usage on {{ $labels.instance }}"
208
- description: "Memory usage is {{ $value | humanizePercentage }}"
209
-
210
- - alert: HighCPUUsage
211
- expr: |
212
- 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
213
- for: 10m
214
- labels:
215
- severity: warning
216
- team: platform
217
- annotations:
218
- summary: "High CPU usage on {{ $labels.instance }}"
219
- description: "CPU usage is {{ $value | humanize }}%"
220
-
221
- - name: database
222
- interval: 30s
223
- rules:
224
- - alert: DatabaseConnectionsHigh
225
- expr: |
226
- pg_stat_database_numbackends / pg_settings_max_connections > 0.80
227
- for: 5m
228
- labels:
229
- severity: warning
230
- team: database
231
- annotations:
232
- summary: "Database connection pool nearly exhausted"
233
- description: "{{ $labels.datname }} is at {{ $value | humanizePercentage }} capacity"
234
-
235
- - alert: DatabaseReplicationLag
236
- expr: |
237
- pg_replication_lag > 30
238
- for: 2m
239
- labels:
240
- severity: critical
241
- team: database
242
- annotations:
243
- summary: "Database replication lag detected"
244
- description: "Replication lag is {{ $value }}s on {{ $labels.instance }}"
245
-
246
- - name: slo
247
- interval: 30s
248
- rules:
249
- - alert: SLOBudgetExhausted
250
- expr: |
251
- (
252
- 1 - (
253
- sum(rate(http_requests_total{status=~"2.."}[30d]))
254
- /
255
- sum(rate(http_requests_total[30d]))
256
- )
257
- ) > 0.01 # 99% SLO = 1% error budget
258
- for: 1h
259
- labels:
260
- severity: critical
261
- team: sre
262
- annotations:
263
- summary: "SLO error budget exhausted"
264
- description: "Monthly error budget exceeded - current error rate: {{ $value | humanizePercentage }}"
265
- ```
266
-
267
- ### Grafana Dashboards
268
-
269
- #### Dashboard JSON (API Service)
270
- ```json
271
- {
272
- "dashboard": {
273
- "title": "API Service Metrics",
274
- "tags": ["api", "backend", "production"],
275
- "timezone": "browser",
276
- "panels": [
277
- {
278
- "title": "Request Rate (RPS)",
279
- "type": "graph",
280
- "targets": [
281
- {
282
- "expr": "rate(http_requests_total[5m])",
283
- "legendFormat": "{{method}} {{path}}"
284
- }
285
- ],
286
- "yaxes": [
287
- {
288
- "format": "reqps",
289
- "label": "Requests/sec"
290
- }
291
- ]
292
- },
293
- {
294
- "title": "Error Rate (%)",
295
- "type": "graph",
296
- "targets": [
297
- {
298
- "expr": "(rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])) * 100",
299
- "legendFormat": "Error Rate"
300
- }
301
- ],
302
- "yaxes": [
303
- {
304
- "format": "percent",
305
- "max": 100,
306
- "min": 0
307
- }
308
- ],
309
- "alert": {
310
- "conditions": [
311
- {
312
- "evaluator": {
313
- "params": [5],
314
- "type": "gt"
315
- },
316
- "query": {
317
- "params": ["A", "5m", "now"]
318
- },
319
- "reducer": {
320
- "type": "avg"
321
- },
322
- "type": "query"
323
- }
324
- ],
325
- "executionErrorState": "alerting",
326
- "name": "High Error Rate",
327
- "noDataState": "no_data"
328
- }
329
- },
330
- {
331
- "title": "Latency Percentiles",
332
- "type": "graph",
333
- "targets": [
334
- {
335
- "expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
336
- "legendFormat": "p50"
337
- },
338
- {
339
- "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
340
- "legendFormat": "p95"
341
- },
342
- {
343
- "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
344
- "legendFormat": "p99"
345
- }
346
- ],
347
- "yaxes": [
348
- {
349
- "format": "s",
350
- "label": "Duration"
351
- }
352
- ]
353
- },
354
- {
355
- "title": "Active Connections",
356
- "type": "stat",
357
- "targets": [
358
- {
359
- "expr": "sum(active_connections)",
360
- "instant": true
361
- }
362
- ],
363
- "options": {
364
- "colorMode": "value",
365
- "graphMode": "area",
366
- "orientation": "auto",
367
- "textMode": "auto"
368
- }
369
- }
370
- ],
371
- "templating": {
372
- "list": [
373
- {
374
- "name": "environment",
375
- "type": "query",
376
- "query": "label_values(http_requests_total, environment)",
377
- "current": {
378
- "text": "production",
379
- "value": "production"
380
- }
381
- },
382
- {
383
- "name": "service",
384
- "type": "query",
385
- "query": "label_values(http_requests_total{environment=\"$environment\"}, job)",
386
- "current": {
387
- "text": "api-server",
388
- "value": "api-server"
389
- }
390
- }
391
- ]
392
- },
393
- "time": {
394
- "from": "now-6h",
395
- "to": "now"
396
- },
397
- "refresh": "30s"
398
- }
399
- }
400
- ```
401
-
402
- #### Grafana Provisioning (dashboards.yml)
403
- ```yaml
404
- apiVersion: 1
405
-
406
- providers:
407
- - name: 'Default'
408
- orgId: 1
409
- folder: ''
410
- type: file
411
- disableDeletion: false
412
- updateIntervalSeconds: 10
413
- allowUiUpdates: true
414
- options:
415
- path: /etc/grafana/provisioning/dashboards
416
- foldersFromFilesStructure: true
417
-
418
- - name: 'Production Dashboards'
419
- orgId: 1
420
- folder: 'Production'
421
- type: file
422
- options:
423
- path: /etc/grafana/dashboards/production
424
-
425
- - name: 'SLO Dashboards'
426
- orgId: 1
427
- folder: 'SLO'
428
- type: file
429
- options:
430
- path: /etc/grafana/dashboards/slo
431
- ```
432
-
433
- ### SLI/SLO Tracking
434
-
435
- #### SLO Definition (YAML)
436
- ```yaml
437
- # slo-definitions.yml
438
- slos:
439
- - name: api-availability
440
- description: "API endpoint availability"
441
- sli:
442
- metric: http_requests_total
443
- success_criteria: status=~"2..|3.."
444
- total_criteria: status=~".*"
445
- objectives:
446
- - target: 0.999 # 99.9% availability
447
- window: 30d
448
- - target: 0.99
449
- window: 7d
450
- error_budget:
451
- policy: burn_rate
452
- notification_threshold: 0.10 # Alert at 10% budget consumed
453
- labels:
454
- team: backend
455
- priority: P0
456
-
457
- - name: api-latency
458
- description: "API response time P95 < 500ms"
459
- sli:
460
- metric: http_request_duration_seconds_bucket
461
- percentile: 0.95
462
- threshold: 0.5 # 500ms
463
- objectives:
464
- - target: 0.99
465
- window: 30d
466
- labels:
467
- team: backend
468
- priority: P1
469
-
470
- - name: data-freshness
471
- description: "Data updated within 5 minutes"
472
- sli:
473
- metric: data_last_update_timestamp_seconds
474
- threshold: 300 # 5 minutes
475
- objectives:
476
- - target: 0.95
477
- window: 30d
478
- labels:
479
- team: data-platform
480
- priority: P2
481
- ```
482
-
483
- #### SLO Dashboard Query (PromQL)
484
- ```promql
485
- # Availability SLO
486
- (
487
- sum(rate(http_requests_total{status=~"2..|3.."}[30d]))
488
- /
489
- sum(rate(http_requests_total[30d]))
490
- )
491
-
492
- # Error budget remaining (%)
493
- (
494
- 1 - (
495
- (1 - sum(rate(http_requests_total{status=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d])))
496
- / (1 - 0.999) # 99.9% SLO
497
- )
498
- ) * 100
499
-
500
- # Burn rate (how fast error budget is consumed)
501
- (
502
- sum(rate(http_requests_total{status=~"5.."}[1h]))
503
- /
504
- sum(rate(http_requests_total[1h]))
505
- ) / (1 - 0.999) * 30 # Normalized to 30-day window
506
- ```
507
-
508
- ### Application Instrumentation
509
-
510
- #### Node.js with Prometheus Client
511
- ```javascript
512
- // metrics.js
513
- const promClient = require('prom-client');
514
-
515
- // Create registry
516
- const register = new promClient.Registry();
517
-
518
- // Default metrics (CPU, memory, etc.)
519
- promClient.collectDefaultMetrics({ register });
520
-
521
- // Custom metrics
522
- const httpRequestDuration = new promClient.Histogram({
523
- name: 'http_request_duration_seconds',
524
- help: 'Duration of HTTP requests in seconds',
525
- labelNames: ['method', 'path', 'status'],
526
- buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
527
- });
528
-
529
- const httpRequestTotal = new promClient.Counter({
530
- name: 'http_requests_total',
531
- help: 'Total number of HTTP requests',
532
- labelNames: ['method', 'path', 'status']
533
- });
534
-
535
- const activeConnections = new promClient.Gauge({
536
- name: 'active_connections',
537
- help: 'Number of active connections'
538
- });
539
-
540
- const dbQueryDuration = new promClient.Histogram({
541
- name: 'db_query_duration_seconds',
542
- help: 'Database query duration',
543
- labelNames: ['query_type', 'table'],
544
- buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]
545
- });
546
-
547
- register.registerMetric(httpRequestDuration);
548
- register.registerMetric(httpRequestTotal);
549
- register.registerMetric(activeConnections);
550
- register.registerMetric(dbQueryDuration);
551
-
552
- // Middleware
553
- const metricsMiddleware = (req, res, next) => {
554
- const start = Date.now();
555
-
556
- res.on('finish', () => {
557
- const duration = (Date.now() - start) / 1000;
558
- const labels = {
559
- method: req.method,
560
- path: req.route?.path || req.path,
561
- status: res.statusCode
562
- };
563
-
564
- httpRequestDuration.observe(labels, duration);
565
- httpRequestTotal.inc(labels);
566
- });
567
-
568
- next();
569
- };
570
-
571
- // Metrics endpoint
572
- app.get('/metrics', async (req, res) => {
573
- res.set('Content-Type', register.contentType);
574
- res.end(await register.metrics());
575
- });
576
-
577
- module.exports = {
578
- metricsMiddleware,
579
- httpRequestDuration,
580
- httpRequestTotal,
581
- activeConnections,
582
- dbQueryDuration
583
- };
584
- ```
585
-
586
- #### Go with Prometheus Client
587
- ```go
588
- package metrics
589
-
590
- import (
591
- "github.com/prometheus/client_golang/prometheus"
592
- "github.com/prometheus/client_golang/prometheus/promauto"
593
- "github.com/prometheus/client_golang/prometheus/promhttp"
594
- "net/http"
595
- )
596
-
597
- var (
598
- httpRequestsTotal = promauto.NewCounterVec(
599
- prometheus.CounterOpts{
600
- Name: "http_requests_total",
601
- Help: "Total number of HTTP requests",
602
- },
603
- []string{"method", "path", "status"},
604
- )
605
-
606
- httpRequestDuration = promauto.NewHistogramVec(
607
- prometheus.HistogramOpts{
608
- Name: "http_request_duration_seconds",
609
- Help: "Duration of HTTP requests",
610
- Buckets: []float64{0.01, 0.05, 0.1, 0.5, 1, 5},
611
- },
612
- []string{"method", "path", "status"},
613
- )
614
-
615
- activeConnections = promauto.NewGauge(
616
- prometheus.GaugeOpts{
617
- Name: "active_connections",
618
- Help: "Number of active connections",
619
- },
620
- )
621
- )
622
-
623
- // Middleware
624
- func MetricsMiddleware(next http.Handler) http.Handler {
625
- return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
626
- timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(r.Method, r.URL.Path, ""))
627
-
628
- ww := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
629
- next.ServeHTTP(ww, r)
630
-
631
- timer.ObserveDuration()
632
- httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, fmt.Sprintf("%d", ww.statusCode)).Inc()
633
- })
634
- }
635
-
636
- // Metrics handler
637
- func Handler() http.Handler {
638
- return promhttp.Handler()
639
- }
640
- ```
641
-
642
- ### Distributed Tracing (Jaeger)
643
-
644
- #### OpenTelemetry Configuration
645
- ```javascript
646
- // tracing.js
647
- const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
648
- const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
649
- const { Resource } = require('@opentelemetry/resources');
650
- const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
651
-
652
- const provider = new NodeTracerProvider({
653
- resource: new Resource({
654
- [SemanticResourceAttributes.SERVICE_NAME]: 'api-server',
655
- [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'production'
656
- })
657
- });
658
-
659
- const exporter = new JaegerExporter({
660
- endpoint: 'http://jaeger:14268/api/traces',
661
- });
662
-
663
- provider.addSpanProcessor(
664
- new BatchSpanProcessor(exporter)
665
- );
666
-
667
- provider.register();
668
-
669
- // Instrument HTTP
670
- const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
671
- const { registerInstrumentations } = require('@opentelemetry/instrumentation');
672
-
673
- registerInstrumentations({
674
- instrumentations: [
675
- new HttpInstrumentation(),
676
- ],
677
- });
678
- ```
679
-
680
- ### Log Aggregation (Loki)
681
-
682
- #### Promtail Configuration
683
- ```yaml
684
- server:
685
- http_listen_port: 9080
686
- grpc_listen_port: 0
687
-
688
- positions:
689
- filename: /tmp/positions.yaml
690
-
691
- clients:
692
- - url: http://loki:3100/loki/api/v1/push
693
-
694
- scrape_configs:
695
- - job_name: system
696
- static_configs:
697
- - targets:
698
- - localhost
699
- labels:
700
- job: varlogs
701
- __path__: /var/log/*log
702
-
703
- - job_name: containers
704
- docker_sd_configs:
705
- - host: unix:///var/run/docker.sock
706
- refresh_interval: 5s
707
- relabel_configs:
708
- - source_labels: ['__meta_docker_container_name']
709
- target_label: 'container'
710
- - source_labels: ['__meta_docker_container_log_stream']
711
- target_label: 'stream'
712
- ```
713
-
714
- ## Validation Protocol
715
-
716
- Before reporting high confidence:
717
- ✅ Prometheus scraping all targets successfully
718
- ✅ Alerting rules validated with promtool
719
- ✅ Grafana dashboards render correctly
720
- ✅ SLO tracking configured and accurate
721
- ✅ All critical services have health checks
722
- ✅ Alert notification channels tested
723
- ✅ Runbooks created for alerts
724
- ✅ Metrics retention policy configured
725
- ✅ Backup and disaster recovery tested
726
- ✅ Performance baseline established
727
-
728
- ## Deliverables
729
-
730
- 1. **Prometheus Configuration**: Complete prometheus.yml with all targets
731
- 2. **Alerting Rules**: Comprehensive alert definitions
732
- 3. **Grafana Dashboards**: Service, infrastructure, and SLO dashboards
733
- 4. **SLO Definitions**: Documented SLI/SLO/error budgets
734
- 5. **Application Instrumentation**: Metrics libraries integrated
735
- 6. **Runbooks**: Incident response procedures
736
- 7. **Documentation**: Monitoring architecture, metrics catalog
737
-
738
- ## Success Metrics
739
- - All services instrumented (100% coverage)
740
- - Alert false positive rate <5%
741
- - Dashboard load time <2 seconds
742
- - SLO tracking accurate within 0.1%
743
- - Confidence score ≥ 0.90
744
-
745
- ## Completion Protocol
746
-
747
- Complete your work and provide a structured response with:
748
- - Confidence score (0.0-1.0) based on work quality
749
- - Summary of work completed
750
- - List of deliverables created
751
- - Any recommendations or findings
752
-
753
- **Note:** Coordination handled automatically by the system.
754
-
755
- ## Skill References
756
-
757
- ### Test-Driven Development
758
- → **JSON Validation**: `.claude/skills/json-validation/SKILL.md` - Defensive AGENT_SUCCESS_CRITERIA parsing with injection prevention
759
- → **Test Runner**: `.claude/skills/cfn-test-runner/SKILL.md` - Unified test execution with benchmarking and regression detection
760
-
761
- ### Monitoring & Observability
762
- → **Prometheus Setup**: `.claude/skills/prometheus-monitoring/SKILL.md`
763
- → **Grafana Dashboards**: `.claude/skills/grafana-dashboard-creation/SKILL.md`
764
- → **SLO Tracking**: `.claude/skills/slo-management/SKILL.md`
765
- → **Distributed Tracing**: `.claude/skills/opentelemetry-tracing/SKILL.md`
766
-
767
- ### Coordination & Data Analysis
768
- → **Redis Data Extraction**: `.claude/skills/cfn-redis-data-extraction/SKILL.md` - Extract and analyze CFN Loop coordination data