@neyugn/agent-kits 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (158) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +514 -0
  3. package/README.vi.md +410 -0
  4. package/README.zh.md +410 -0
  5. package/dist/cli.d.ts +1 -0
  6. package/dist/cli.js +422 -0
  7. package/kits/coder/ARCHITECTURE.md +289 -0
  8. package/kits/coder/agents/ai-engineer.md +344 -0
  9. package/kits/coder/agents/backend-specialist.md +270 -0
  10. package/kits/coder/agents/cloud-architect.md +363 -0
  11. package/kits/coder/agents/code-reviewer.md +284 -0
  12. package/kits/coder/agents/data-engineer.md +401 -0
  13. package/kits/coder/agents/database-specialist.md +251 -0
  14. package/kits/coder/agents/debugger.md +209 -0
  15. package/kits/coder/agents/devops-engineer.md +281 -0
  16. package/kits/coder/agents/documentation-writer.md +296 -0
  17. package/kits/coder/agents/frontend-specialist.md +298 -0
  18. package/kits/coder/agents/i18n-specialist.md +348 -0
  19. package/kits/coder/agents/integration-specialist.md +314 -0
  20. package/kits/coder/agents/mobile-developer.md +271 -0
  21. package/kits/coder/agents/multi-tenant-architect.md +281 -0
  22. package/kits/coder/agents/orchestrator.md +263 -0
  23. package/kits/coder/agents/performance-analyst.md +327 -0
  24. package/kits/coder/agents/project-planner.md +277 -0
  25. package/kits/coder/agents/queue-specialist.md +282 -0
  26. package/kits/coder/agents/realtime-specialist.md +267 -0
  27. package/kits/coder/agents/security-auditor.md +253 -0
  28. package/kits/coder/agents/test-engineer.md +315 -0
  29. package/kits/coder/agents/ux-researcher.md +388 -0
  30. package/kits/coder/rules/.cursorrules +287 -0
  31. package/kits/coder/rules/CLAUDE.md +287 -0
  32. package/kits/coder/rules/CODEX.md +287 -0
  33. package/kits/coder/rules/GEMINI.md +287 -0
  34. package/kits/coder/scripts/checklist.py +318 -0
  35. package/kits/coder/scripts/kit_status.py +292 -0
  36. package/kits/coder/scripts/skills_manager.py +243 -0
  37. package/kits/coder/scripts/verify_all.py +391 -0
  38. package/kits/coder/skills/accessibility-patterns/SKILL.md +372 -0
  39. package/kits/coder/skills/accessibility-patterns/scripts/a11y_checker.py +211 -0
  40. package/kits/coder/skills/ai-rag-patterns/SKILL.md +444 -0
  41. package/kits/coder/skills/api-patterns/SKILL.md +316 -0
  42. package/kits/coder/skills/api-patterns/assets/.gitkeep +1 -0
  43. package/kits/coder/skills/api-patterns/references/deep-dive.md +21 -0
  44. package/kits/coder/skills/api-patterns/scripts/api_validator.py +253 -0
  45. package/kits/coder/skills/api-patterns/scripts/validate.py +56 -0
  46. package/kits/coder/skills/auth-patterns/SKILL.md +267 -0
  47. package/kits/coder/skills/aws-patterns/SKILL.md +576 -0
  48. package/kits/coder/skills/brainstorming/SKILL.md +370 -0
  49. package/kits/coder/skills/brainstorming/assets/.gitkeep +1 -0
  50. package/kits/coder/skills/brainstorming/references/deep-dive.md +21 -0
  51. package/kits/coder/skills/brainstorming/scripts/validate.py +56 -0
  52. package/kits/coder/skills/clean-code/SKILL.md +240 -0
  53. package/kits/coder/skills/clean-code/assets/.gitkeep +1 -0
  54. package/kits/coder/skills/clean-code/references/deep-dive.md +21 -0
  55. package/kits/coder/skills/clean-code/scripts/lint_runner.py +186 -0
  56. package/kits/coder/skills/clean-code/scripts/validate.py +56 -0
  57. package/kits/coder/skills/database-design/SKILL.md +255 -0
  58. package/kits/coder/skills/database-design/assets/.gitkeep +1 -0
  59. package/kits/coder/skills/database-design/references/deep-dive.md +21 -0
  60. package/kits/coder/skills/database-design/scripts/schema_validator.py +272 -0
  61. package/kits/coder/skills/database-design/scripts/validate.py +56 -0
  62. package/kits/coder/skills/docker-patterns/SKILL.md +240 -0
  63. package/kits/coder/skills/documentation-templates/SKILL.md +441 -0
  64. package/kits/coder/skills/e2e-testing/SKILL.md +457 -0
  65. package/kits/coder/skills/flutter-patterns/SKILL.md +330 -0
  66. package/kits/coder/skills/frontend-design/SKILL.md +127 -0
  67. package/kits/coder/skills/github-actions/SKILL.md +349 -0
  68. package/kits/coder/skills/gitlab-ci-patterns/SKILL.md +466 -0
  69. package/kits/coder/skills/graphql-patterns/SKILL.md +558 -0
  70. package/kits/coder/skills/i18n-localization/SKILL.md +345 -0
  71. package/kits/coder/skills/i18n-localization/scripts/i18n_checker.py +267 -0
  72. package/kits/coder/skills/kubernetes-patterns/SKILL.md +357 -0
  73. package/kits/coder/skills/mermaid-diagrams/SKILL.md +351 -0
  74. package/kits/coder/skills/mobile-design/SKILL.md +305 -0
  75. package/kits/coder/skills/monitoring-observability/SKILL.md +458 -0
  76. package/kits/coder/skills/multi-tenancy/SKILL.md +317 -0
  77. package/kits/coder/skills/multi-tenancy/assets/.gitkeep +1 -0
  78. package/kits/coder/skills/multi-tenancy/references/deep-dive.md +21 -0
  79. package/kits/coder/skills/multi-tenancy/scripts/validate.py +56 -0
  80. package/kits/coder/skills/nodejs-best-practices/SKILL.md +220 -0
  81. package/kits/coder/skills/performance-profiling/SKILL.md +333 -0
  82. package/kits/coder/skills/performance-profiling/assets/.gitkeep +1 -0
  83. package/kits/coder/skills/performance-profiling/references/deep-dive.md +21 -0
  84. package/kits/coder/skills/performance-profiling/scripts/validate.py +56 -0
  85. package/kits/coder/skills/plan-writing/SKILL.md +360 -0
  86. package/kits/coder/skills/plan-writing/assets/.gitkeep +1 -0
  87. package/kits/coder/skills/plan-writing/references/deep-dive.md +21 -0
  88. package/kits/coder/skills/plan-writing/scripts/validate.py +56 -0
  89. package/kits/coder/skills/postgres-patterns/SKILL.md +361 -0
  90. package/kits/coder/skills/prompt-engineering/SKILL.md +277 -0
  91. package/kits/coder/skills/queue-patterns/SKILL.md +359 -0
  92. package/kits/coder/skills/queue-patterns/assets/.gitkeep +1 -0
  93. package/kits/coder/skills/queue-patterns/references/deep-dive.md +21 -0
  94. package/kits/coder/skills/queue-patterns/scripts/validate.py +56 -0
  95. package/kits/coder/skills/react-native-patterns/SKILL.md +393 -0
  96. package/kits/coder/skills/react-patterns/SKILL.md +319 -0
  97. package/kits/coder/skills/realtime-patterns/SKILL.md +506 -0
  98. package/kits/coder/skills/realtime-patterns/assets/.gitkeep +1 -0
  99. package/kits/coder/skills/realtime-patterns/references/deep-dive.md +21 -0
  100. package/kits/coder/skills/realtime-patterns/scripts/validate.py +56 -0
  101. package/kits/coder/skills/redis-patterns/SKILL.md +484 -0
  102. package/kits/coder/skills/security-fundamentals/SKILL.md +363 -0
  103. package/kits/coder/skills/security-fundamentals/assets/.gitkeep +1 -0
  104. package/kits/coder/skills/security-fundamentals/references/deep-dive.md +21 -0
  105. package/kits/coder/skills/security-fundamentals/scripts/security_scan.py +326 -0
  106. package/kits/coder/skills/security-fundamentals/scripts/validate.py +56 -0
  107. package/kits/coder/skills/seo-patterns/SKILL.md +262 -0
  108. package/kits/coder/skills/seo-patterns/scripts/seo_checker.py +211 -0
  109. package/kits/coder/skills/systematic-debugging/SKILL.md +478 -0
  110. package/kits/coder/skills/systematic-debugging/assets/.gitkeep +1 -0
  111. package/kits/coder/skills/systematic-debugging/references/deep-dive.md +21 -0
  112. package/kits/coder/skills/systematic-debugging/scripts/validate.py +56 -0
  113. package/kits/coder/skills/tailwind-patterns/SKILL.md +395 -0
  114. package/kits/coder/skills/terraform-patterns/SKILL.md +470 -0
  115. package/kits/coder/skills/testing-patterns/SKILL.md +285 -0
  116. package/kits/coder/skills/testing-patterns/assets/.gitkeep +1 -0
  117. package/kits/coder/skills/testing-patterns/references/deep-dive.md +21 -0
  118. package/kits/coder/skills/testing-patterns/scripts/test_runner.py +219 -0
  119. package/kits/coder/skills/testing-patterns/scripts/validate.py +56 -0
  120. package/kits/coder/skills/typescript-patterns/SKILL.md +417 -0
  121. package/kits/coder/skills/ui-ux-pro-max/SKILL.md +364 -0
  122. package/kits/coder/skills/ui-ux-pro-max/data/charts.csv +26 -0
  123. package/kits/coder/skills/ui-ux-pro-max/data/colors.csv +97 -0
  124. package/kits/coder/skills/ui-ux-pro-max/data/icons.csv +101 -0
  125. package/kits/coder/skills/ui-ux-pro-max/data/landing.csv +31 -0
  126. package/kits/coder/skills/ui-ux-pro-max/data/products.csv +97 -0
  127. package/kits/coder/skills/ui-ux-pro-max/data/prompts.csv +24 -0
  128. package/kits/coder/skills/ui-ux-pro-max/data/react-performance.csv +45 -0
  129. package/kits/coder/skills/ui-ux-pro-max/data/stacks/flutter.csv +53 -0
  130. package/kits/coder/skills/ui-ux-pro-max/data/stacks/html-tailwind.csv +56 -0
  131. package/kits/coder/skills/ui-ux-pro-max/data/stacks/nextjs.csv +53 -0
  132. package/kits/coder/skills/ui-ux-pro-max/data/stacks/nuxt-ui.csv +51 -0
  133. package/kits/coder/skills/ui-ux-pro-max/data/stacks/nuxtjs.csv +59 -0
  134. package/kits/coder/skills/ui-ux-pro-max/data/stacks/react-native.csv +52 -0
  135. package/kits/coder/skills/ui-ux-pro-max/data/stacks/react.csv +54 -0
  136. package/kits/coder/skills/ui-ux-pro-max/data/stacks/shadcn.csv +61 -0
  137. package/kits/coder/skills/ui-ux-pro-max/data/stacks/svelte.csv +54 -0
  138. package/kits/coder/skills/ui-ux-pro-max/data/stacks/swiftui.csv +51 -0
  139. package/kits/coder/skills/ui-ux-pro-max/data/stacks/vue.csv +50 -0
  140. package/kits/coder/skills/ui-ux-pro-max/data/styles.csv +59 -0
  141. package/kits/coder/skills/ui-ux-pro-max/data/typography.csv +58 -0
  142. package/kits/coder/skills/ui-ux-pro-max/data/ui-reasoning.csv +101 -0
  143. package/kits/coder/skills/ui-ux-pro-max/data/ux-guidelines.csv +100 -0
  144. package/kits/coder/skills/ui-ux-pro-max/data/web-interface.csv +31 -0
  145. package/kits/coder/skills/ui-ux-pro-max/scripts/__pycache__/core.cpython-314.pyc +0 -0
  146. package/kits/coder/skills/ui-ux-pro-max/scripts/__pycache__/design_system.cpython-314.pyc +0 -0
  147. package/kits/coder/skills/ui-ux-pro-max/scripts/core.py +257 -0
  148. package/kits/coder/skills/ui-ux-pro-max/scripts/design_system.py +488 -0
  149. package/kits/coder/skills/ui-ux-pro-max/scripts/search.py +76 -0
  150. package/kits/coder/workflows/.gitkeep +20 -0
  151. package/kits/coder/workflows/create.md +152 -0
  152. package/kits/coder/workflows/debug.md +223 -0
  153. package/kits/coder/workflows/deploy.md +283 -0
  154. package/kits/coder/workflows/orchestrate.md +243 -0
  155. package/kits/coder/workflows/plan.md +134 -0
  156. package/kits/coder/workflows/test.md +237 -0
  157. package/kits/coder/workflows/ui-ux-pro-max.md +109 -0
  158. package/package.json +49 -0
@@ -0,0 +1,458 @@
1
+ ---
2
+ name: monitoring-observability
3
+ description: Production monitoring, observability, and SRE patterns. Use when designing monitoring systems, implementing SLI/SLO, configuring alerting, or building observability infrastructure with Prometheus, Grafana, and modern tools.
4
+ allowed-tools: Read, Write, Edit, Glob, Grep
5
+ version: 2.0
6
+ ---
7
+
8
+ # Monitoring & Observability - SRE Patterns
9
+
10
+ > **Philosophy:** Observability is not about collecting metrics—it's about understanding system behavior.
11
+
12
+ ---
13
+
14
+ ## When to Use This Skill
15
+
16
+ | ✅ Use | ❌ Don't Use |
17
+ | -------------------------------- | ------------------------------- |
18
+ | Designing monitoring systems | Single ad-hoc dashboard |
19
+ | Defining SLI/SLO/SLA | Application feature development |
20
+ | Configuring alerting strategy | Local development debugging |
21
+ | Building observability pipelines | No access to telemetry data |
22
+ | Incident response workflow | Static reporting only |
23
+
24
+ ---
25
+
26
+ ## Core Rules (Non-Negotiable)
27
+
28
+ 1. **Four Golden Signals** - Latency, Traffic, Errors, Saturation
29
+ 2. **SLO-based alerting** - Alert on symptoms, not causes
30
+ 3. **No secrets in logs** - Redact sensitive data
31
+ 4. **Structured logging** - JSON, not unstructured text
32
+ 5. **Correlation required** - Link metrics, logs, traces
33
+
34
+ ---
35
+
36
+ ## Three Pillars of Observability
37
+
38
+ ```
39
+ ┌─────────────────────────────────────────────────────────────┐
40
+ │ OBSERVABILITY │
41
+ ├─────────────────┬─────────────────┬─────────────────────────┤
42
+ │ METRICS │ LOGS │ TRACES │
43
+ │ │ │ │
44
+ │ • Aggregated │ • Discrete │ • Request-scoped │
45
+ │ • Time-series │ • Event-based │ • Distributed │
46
+ │ • Low overhead │ • High detail │ • Causality chain │
47
+ │ │ │ │
48
+ │ Prometheus │ Loki/ELK │ Jaeger/Zipkin │
49
+ │ Victoria │ Splunk │ X-Ray │
50
+ │ DataDog │ CloudWatch │ OpenTelemetry │
51
+ └─────────────────┴─────────────────┴─────────────────────────┘
52
+ ```
53
+
54
+ ---
55
+
56
+ ## Four Golden Signals
57
+
58
+ | Signal | What to Measure | Example Metrics |
59
+ | -------------- | -------------------------- | ----------------------------------- |
60
+ | **Latency** | Time to serve a request | `http_request_duration_seconds` |
61
+ | **Traffic** | Demand on your system | `http_requests_total` |
62
+ | **Errors** | Rate of failed requests | `http_requests_total{status="5xx"}` |
63
+ | **Saturation** | Fullness of your resources | `container_memory_usage_bytes` |
64
+
65
+ ### RED Method (Request-focused)
66
+
67
+ - **Rate** - Requests per second
68
+ - **Errors** - Failed requests per second
69
+ - **Duration** - Time per request
70
+
71
+ ### USE Method (Resource-focused)
72
+
73
+ - **Utilization** - % time resource is busy
74
+ - **Saturation** - Queue length / pending work
75
+ - **Errors** - Error events count
76
+
77
+ ---
78
+
79
+ ## SLI/SLO/SLA Framework
80
+
81
+ ### Definitions
82
+
83
+ | Term | Definition | Example |
84
+ | ------- | ------------------------------------- | ------------------------------------ |
85
+ | **SLI** | Measurable indicator of service level | 99th percentile latency < 200ms |
86
+ | **SLO** | Target value for an SLI | 99% of requests < 200ms over 30 days |
87
+ | **SLA** | Contractual commitment with penalties | 99.9% availability or refund |
88
+
89
+ ### Error Budget
90
+
91
+ ```python
92
+ # Error budget calculation
93
+ slo = 0.999 # 99.9%
94
+ window_days = 30
95
+ total_minutes = window_days * 24 * 60
96
+
97
+ error_budget_minutes = total_minutes * (1 - slo)
98
+ # 43.2 minutes of allowed downtime per month
99
+ ```
100
+
101
+ ### Burn Rate Alerting
102
+
103
+ ```yaml
104
+ # Fast burn: 2% budget in 1 hour
105
+ - alert: HighErrorRate
106
+ expr: |
107
+ (
108
+ sum(rate(http_requests_total{status=~"5.."}[1h]))
109
+ /
110
+ sum(rate(http_requests_total[1h]))
111
+ ) > 0.001 * 14.4 # 14.4x burn rate
112
+ for: 2m
113
+ labels:
114
+ severity: critical
115
+ ```
116
+
117
+ ---
118
+
119
+ ## Prometheus Patterns
120
+
121
+ ### Essential Metrics
122
+
123
+ ```yaml
124
+ # Counter: Only goes up
125
+ http_requests_total{method="GET", status="200"}
126
+
127
+ # Gauge: Can go up or down
128
+ current_connections
129
+ memory_usage_bytes
130
+
131
+ # Histogram: Buckets for distribution
132
+ http_request_duration_seconds_bucket{le="0.1"}
133
+ http_request_duration_seconds_bucket{le="0.5"}
134
+ http_request_duration_seconds_bucket{le="1"}
135
+
136
+ # Summary: Pre-calculated quantiles
137
+ http_request_duration_seconds{quantile="0.99"}
138
+ ```
139
+
140
+ ### PromQL Patterns
141
+
142
+ ```promql
143
+ # Rate of change (per-second)
144
+ rate(http_requests_total[5m])
145
+
146
+ # Increase over time window
147
+ increase(http_requests_total[1h])
148
+
149
+ # 99th percentile from histogram
150
+ histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
151
+
152
+ # Error rate percentage
153
+ 100 * (
154
+ rate(http_requests_total{status=~"5.."}[5m])
155
+ /
156
+ rate(http_requests_total[5m])
157
+ )
158
+
159
+ # Top 5 by label
160
+ topk(5, sum by (endpoint) (rate(http_requests_total[5m])))
161
+
162
+ # Aggregation across instances
163
+ sum without(instance) (rate(http_requests_total[5m]))
164
+
165
+ # Prediction (linear regression)
166
+ predict_linear(disk_free_bytes[1h], 3600 * 4)
167
+ ```
168
+
169
+ ### Recording Rules
170
+
171
+ ```yaml
172
+ groups:
173
+ - name: sli_recording_rules
174
+ rules:
175
+ - record: job:http_request_latency_seconds:p99
176
+ expr: histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
177
+
178
+ - record: job:http_error_rate:ratio
179
+ expr: |
180
+ sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
181
+ /
182
+ sum by (job) (rate(http_requests_total[5m]))
183
+ ```
184
+
185
+ ---
186
+
187
+ ## Alerting Strategy
188
+
189
+ ### Alert Priority Levels
190
+
191
+ | Level | Response Time | Channel | Example |
192
+ | ------------ | ------------- | ---------------- | --------------------------- |
193
+ | **Critical** | Immediate | PagerDuty + Call | Service down, data loss |
194
+ | **Warning** | 15-30 min | Slack | High latency, disk 80% |
195
+ | **Info** | Next business | Email/Ticket | Certificate expiring in 30d |
196
+
197
+ ### Alerting Best Practices
198
+
199
+ ```yaml
200
+ groups:
201
+ - name: slo_alerts
202
+ rules:
203
+ # ✅ Good: Alert on symptoms (SLO breach)
204
+ - alert: HighLatencySLOBreach
205
+ expr: |
206
+ job:http_request_latency_seconds:p99 > 0.5
207
+ for: 5m
208
+ labels:
209
+ severity: warning
210
+ annotations:
211
+ summary: "P99 latency exceeds 500ms SLO"
212
+ runbook_url: "https://wiki/runbooks/high-latency"
213
+
214
+ # ❌ Bad: Alert on cause (CPU high)
215
+ # - alert: HighCPU
216
+ # expr: node_cpu_usage > 80
217
+ # # CPU can be high without user impact
218
+ ```
219
+
220
+ ### Reducing Alert Noise
221
+
222
+ | Problem | Solution |
223
+ | ---------------- | ------------------------------------ |
224
+ | Flapping alerts | Increase `for` duration |
225
+ | Too many alerts | Alert on SLOs, not individual causes |
226
+ | Duplicate alerts | Use `group_by` and aggregation |
227
+ | Weekend pages | Time-based routing, error budgets |
228
+ | Alert storms | Implement alerting hierarchy |
229
+
230
+ ---
231
+
232
+ ## Structured Logging
233
+
234
+ ### Log Levels
235
+
236
+ | Level | Use Case | Example |
237
+ | --------- | ------------------------------------- | ----------------------------- |
238
+ | **ERROR** | Unhandled failures requiring action | Database connection failed |
239
+ | **WARN** | Concerning but handled situations | Retry succeeded on attempt 3 |
240
+ | **INFO** | Business-significant events | User registered, order placed |
241
+ | **DEBUG** | Technical details for troubleshooting | Query executed in 50ms |
242
+
243
+ ### Structured Log Format
244
+
245
+ ```typescript
246
+ const log = {
247
+ timestamp: "2024-01-15T10:30:00Z",
248
+ level: "INFO",
249
+ service: "order-service",
250
+ traceId: "abc123",
251
+ spanId: "def456",
252
+ userId: "user_789",
253
+ event: "order.created",
254
+ orderId: "order_123",
255
+ total: 99.99,
256
+ items: 3,
257
+ latencyMs: 45,
258
+ };
259
+ ```
260
+
261
+ ### Log Correlation Pattern
262
+
263
+ ```typescript
264
+ // Propagate trace context through all logs
265
+ app.use((req, res, next) => {
266
+ req.logger = logger.child({
267
+ traceId: req.headers["x-trace-id"] || uuid(),
268
+ spanId: uuid(),
269
+ requestId: req.id,
270
+ userId: req.user?.id,
271
+ });
272
+ next();
273
+ });
274
+ ```
275
+
276
+ ---
277
+
278
+ ## Distributed Tracing
279
+
280
+ ### OpenTelemetry Setup
281
+
282
+ ```typescript
283
+ import { NodeSDK } from "@opentelemetry/sdk-node";
284
+ import { JaegerExporter } from "@opentelemetry/exporter-jaeger";
285
+ import { Resource } from "@opentelemetry/resources";
286
+
287
+ const sdk = new NodeSDK({
288
+ resource: new Resource({
289
+ "service.name": "order-service",
290
+ "service.version": "1.0.0",
291
+ }),
292
+ traceExporter: new JaegerExporter({
293
+ endpoint: "http://jaeger:14268/api/traces",
294
+ }),
295
+ });
296
+
297
+ sdk.start();
298
+ ```
299
+
300
+ ### Span Attributes
301
+
302
+ ```typescript
303
+ import { trace } from "@opentelemetry/api";
304
+
305
+ const tracer = trace.getTracer("order-service");
306
+
307
+ async function processOrder(orderId: string) {
308
+ return tracer.startActiveSpan("processOrder", async (span) => {
309
+ span.setAttribute("order.id", orderId);
310
+ span.setAttribute("order.total", 99.99);
311
+
312
+ try {
313
+ // ... business logic
314
+ span.setStatus({ code: SpanStatusCode.OK });
315
+ } catch (error) {
316
+ span.setStatus({
317
+ code: SpanStatusCode.ERROR,
318
+ message: error.message,
319
+ });
320
+ span.recordException(error);
321
+ throw error;
322
+ } finally {
323
+ span.end();
324
+ }
325
+ });
326
+ }
327
+ ```
328
+
329
+ ---
330
+
331
+ ## Grafana Dashboard Patterns
332
+
333
+ ### Dashboard Structure
334
+
335
+ ```
336
+ ├── Overview (Business KPIs)
337
+ │ ├── Revenue / Orders
338
+ │ ├── Active Users
339
+ │ └── Error Rate Summary
340
+
341
+ ├── Service Health (Per Service)
342
+ │ ├── Four Golden Signals
343
+ │ ├── SLI/SLO Status
344
+ │ └── Resource Utilization
345
+
346
+ ├── Infrastructure
347
+ │ ├── Node Metrics
348
+ │ ├── Container Stats
349
+ │ └── Database Performance
350
+
351
+ └── Debugging
352
+ ├── Trace Explorer
353
+ ├── Log Viewer
354
+ └── Error Breakdown
355
+ ```
356
+
357
+ ### Variable Templates
358
+
359
+ ```yaml
360
+ # Environment selector
361
+ - name: environment
362
+ type: query
363
+ query: label_values(up, environment)
364
+
365
+ # Service filter
366
+ - name: service
367
+ type: query
368
+ query: label_values(http_requests_total{environment="$environment"}, service)
369
+ ```
370
+
371
+ ---
372
+
373
+ ## Incident Response Workflow
374
+
375
+ ```
376
+ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
377
+ │ Detect │───▷│ Triage │───▷│ Mitigate │
378
+ │ Alert │ │ Severity │ │ Rollback │
379
+ └─────────────┘ └─────────────┘ └─────────────┘
380
+
381
+
382
+ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
383
+ │ Review │◁───│ Resolve │◁───│ Communicate│
384
+ │ Postmortem │ │ Fix │ │ Status │
385
+ └─────────────┘ └─────────────┘ └─────────────┘
386
+ ```
387
+
388
+ ### Runbook Template
389
+
390
+ ```markdown
391
+ # Alert: [Alert Name]
392
+
393
+ ## Impact
394
+
395
+ - What services are affected?
396
+ - What is the user impact?
397
+
398
+ ## Quick Diagnosis
399
+
400
+ 1. Check dashboard: [link]
401
+ 2. Check recent deployments: [link]
402
+ 3. Check upstream dependencies
403
+
404
+ ## Mitigation Steps
405
+
406
+ 1. If caused by deployment → Rollback
407
+ 2. If caused by traffic → Scale up / rate limit
408
+ 3. If caused by dependency → Failover
409
+
410
+ ## Escalation
411
+
412
+ - On-call: @team-oncall
413
+ - Escalation: @team-lead
414
+ ```
415
+
416
+ ---
417
+
418
+ ## Anti-Patterns
419
+
420
+ | ❌ Don't | ✅ Do |
421
+ | ------------------------------- | ----------------------------------- |
422
+ | Alert on causes (CPU, memory) | Alert on symptoms (latency, errors) |
423
+ | Log everything at INFO | Use appropriate log levels |
424
+ | Unstructured log messages | JSON structured logging |
425
+ | Alert without runbook | Every alert has a runbook |
426
+ | Collect metrics without purpose | Define SLIs first, then instrument |
427
+ | Secret values in logs | Redact sensitive data |
428
+ | High-cardinality labels | Bounded label values |
429
+
430
+ ---
431
+
432
+ ## Production Checklist
433
+
434
+ Before production:
435
+
436
+ - [ ] Four Golden Signals instrumented?
437
+ - [ ] SLIs/SLOs defined per service?
438
+ - [ ] Error budget tracking enabled?
439
+ - [ ] Structured logging implemented?
440
+ - [ ] Trace context propagating?
441
+ - [ ] Alerting hierarchy defined?
442
+ - [ ] Runbooks for all critical alerts?
443
+ - [ ] On-call rotation configured?
444
+
445
+ ---
446
+
447
+ ## Related Skills
448
+
449
+ | Need | Skill |
450
+ | ------------------- | ----------------------- |
451
+ | Kubernetes ops | `kubernetes-patterns` |
452
+ | CI/CD pipelines | `github-actions` |
453
+ | Performance tuning | `performance-profiling` |
454
+ | Security monitoring | `security-fundamentals` |
455
+
456
+ ---
457
+
458
+ > **Remember:** Good observability lets you answer questions you haven't thought of yet. Build for unknown-unknowns, not just known issues.