@neyugn/agent-kits 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +514 -0
- package/README.vi.md +410 -0
- package/README.zh.md +410 -0
- package/dist/cli.d.ts +1 -0
- package/dist/cli.js +422 -0
- package/kits/coder/ARCHITECTURE.md +289 -0
- package/kits/coder/agents/ai-engineer.md +344 -0
- package/kits/coder/agents/backend-specialist.md +270 -0
- package/kits/coder/agents/cloud-architect.md +363 -0
- package/kits/coder/agents/code-reviewer.md +284 -0
- package/kits/coder/agents/data-engineer.md +401 -0
- package/kits/coder/agents/database-specialist.md +251 -0
- package/kits/coder/agents/debugger.md +209 -0
- package/kits/coder/agents/devops-engineer.md +281 -0
- package/kits/coder/agents/documentation-writer.md +296 -0
- package/kits/coder/agents/frontend-specialist.md +298 -0
- package/kits/coder/agents/i18n-specialist.md +348 -0
- package/kits/coder/agents/integration-specialist.md +314 -0
- package/kits/coder/agents/mobile-developer.md +271 -0
- package/kits/coder/agents/multi-tenant-architect.md +281 -0
- package/kits/coder/agents/orchestrator.md +263 -0
- package/kits/coder/agents/performance-analyst.md +327 -0
- package/kits/coder/agents/project-planner.md +277 -0
- package/kits/coder/agents/queue-specialist.md +282 -0
- package/kits/coder/agents/realtime-specialist.md +267 -0
- package/kits/coder/agents/security-auditor.md +253 -0
- package/kits/coder/agents/test-engineer.md +315 -0
- package/kits/coder/agents/ux-researcher.md +388 -0
- package/kits/coder/rules/.cursorrules +287 -0
- package/kits/coder/rules/CLAUDE.md +287 -0
- package/kits/coder/rules/CODEX.md +287 -0
- package/kits/coder/rules/GEMINI.md +287 -0
- package/kits/coder/scripts/checklist.py +318 -0
- package/kits/coder/scripts/kit_status.py +292 -0
- package/kits/coder/scripts/skills_manager.py +243 -0
- package/kits/coder/scripts/verify_all.py +391 -0
- package/kits/coder/skills/accessibility-patterns/SKILL.md +372 -0
- package/kits/coder/skills/accessibility-patterns/scripts/a11y_checker.py +211 -0
- package/kits/coder/skills/ai-rag-patterns/SKILL.md +444 -0
- package/kits/coder/skills/api-patterns/SKILL.md +316 -0
- package/kits/coder/skills/api-patterns/assets/.gitkeep +1 -0
- package/kits/coder/skills/api-patterns/references/deep-dive.md +21 -0
- package/kits/coder/skills/api-patterns/scripts/api_validator.py +253 -0
- package/kits/coder/skills/api-patterns/scripts/validate.py +56 -0
- package/kits/coder/skills/auth-patterns/SKILL.md +267 -0
- package/kits/coder/skills/aws-patterns/SKILL.md +576 -0
- package/kits/coder/skills/brainstorming/SKILL.md +370 -0
- package/kits/coder/skills/brainstorming/assets/.gitkeep +1 -0
- package/kits/coder/skills/brainstorming/references/deep-dive.md +21 -0
- package/kits/coder/skills/brainstorming/scripts/validate.py +56 -0
- package/kits/coder/skills/clean-code/SKILL.md +240 -0
- package/kits/coder/skills/clean-code/assets/.gitkeep +1 -0
- package/kits/coder/skills/clean-code/references/deep-dive.md +21 -0
- package/kits/coder/skills/clean-code/scripts/lint_runner.py +186 -0
- package/kits/coder/skills/clean-code/scripts/validate.py +56 -0
- package/kits/coder/skills/database-design/SKILL.md +255 -0
- package/kits/coder/skills/database-design/assets/.gitkeep +1 -0
- package/kits/coder/skills/database-design/references/deep-dive.md +21 -0
- package/kits/coder/skills/database-design/scripts/schema_validator.py +272 -0
- package/kits/coder/skills/database-design/scripts/validate.py +56 -0
- package/kits/coder/skills/docker-patterns/SKILL.md +240 -0
- package/kits/coder/skills/documentation-templates/SKILL.md +441 -0
- package/kits/coder/skills/e2e-testing/SKILL.md +457 -0
- package/kits/coder/skills/flutter-patterns/SKILL.md +330 -0
- package/kits/coder/skills/frontend-design/SKILL.md +127 -0
- package/kits/coder/skills/github-actions/SKILL.md +349 -0
- package/kits/coder/skills/gitlab-ci-patterns/SKILL.md +466 -0
- package/kits/coder/skills/graphql-patterns/SKILL.md +558 -0
- package/kits/coder/skills/i18n-localization/SKILL.md +345 -0
- package/kits/coder/skills/i18n-localization/scripts/i18n_checker.py +267 -0
- package/kits/coder/skills/kubernetes-patterns/SKILL.md +357 -0
- package/kits/coder/skills/mermaid-diagrams/SKILL.md +351 -0
- package/kits/coder/skills/mobile-design/SKILL.md +305 -0
- package/kits/coder/skills/monitoring-observability/SKILL.md +458 -0
- package/kits/coder/skills/multi-tenancy/SKILL.md +317 -0
- package/kits/coder/skills/multi-tenancy/assets/.gitkeep +1 -0
- package/kits/coder/skills/multi-tenancy/references/deep-dive.md +21 -0
- package/kits/coder/skills/multi-tenancy/scripts/validate.py +56 -0
- package/kits/coder/skills/nodejs-best-practices/SKILL.md +220 -0
- package/kits/coder/skills/performance-profiling/SKILL.md +333 -0
- package/kits/coder/skills/performance-profiling/assets/.gitkeep +1 -0
- package/kits/coder/skills/performance-profiling/references/deep-dive.md +21 -0
- package/kits/coder/skills/performance-profiling/scripts/validate.py +56 -0
- package/kits/coder/skills/plan-writing/SKILL.md +360 -0
- package/kits/coder/skills/plan-writing/assets/.gitkeep +1 -0
- package/kits/coder/skills/plan-writing/references/deep-dive.md +21 -0
- package/kits/coder/skills/plan-writing/scripts/validate.py +56 -0
- package/kits/coder/skills/postgres-patterns/SKILL.md +361 -0
- package/kits/coder/skills/prompt-engineering/SKILL.md +277 -0
- package/kits/coder/skills/queue-patterns/SKILL.md +359 -0
- package/kits/coder/skills/queue-patterns/assets/.gitkeep +1 -0
- package/kits/coder/skills/queue-patterns/references/deep-dive.md +21 -0
- package/kits/coder/skills/queue-patterns/scripts/validate.py +56 -0
- package/kits/coder/skills/react-native-patterns/SKILL.md +393 -0
- package/kits/coder/skills/react-patterns/SKILL.md +319 -0
- package/kits/coder/skills/realtime-patterns/SKILL.md +506 -0
- package/kits/coder/skills/realtime-patterns/assets/.gitkeep +1 -0
- package/kits/coder/skills/realtime-patterns/references/deep-dive.md +21 -0
- package/kits/coder/skills/realtime-patterns/scripts/validate.py +56 -0
- package/kits/coder/skills/redis-patterns/SKILL.md +484 -0
- package/kits/coder/skills/security-fundamentals/SKILL.md +363 -0
- package/kits/coder/skills/security-fundamentals/assets/.gitkeep +1 -0
- package/kits/coder/skills/security-fundamentals/references/deep-dive.md +21 -0
- package/kits/coder/skills/security-fundamentals/scripts/security_scan.py +326 -0
- package/kits/coder/skills/security-fundamentals/scripts/validate.py +56 -0
- package/kits/coder/skills/seo-patterns/SKILL.md +262 -0
- package/kits/coder/skills/seo-patterns/scripts/seo_checker.py +211 -0
- package/kits/coder/skills/systematic-debugging/SKILL.md +478 -0
- package/kits/coder/skills/systematic-debugging/assets/.gitkeep +1 -0
- package/kits/coder/skills/systematic-debugging/references/deep-dive.md +21 -0
- package/kits/coder/skills/systematic-debugging/scripts/validate.py +56 -0
- package/kits/coder/skills/tailwind-patterns/SKILL.md +395 -0
- package/kits/coder/skills/terraform-patterns/SKILL.md +470 -0
- package/kits/coder/skills/testing-patterns/SKILL.md +285 -0
- package/kits/coder/skills/testing-patterns/assets/.gitkeep +1 -0
- package/kits/coder/skills/testing-patterns/references/deep-dive.md +21 -0
- package/kits/coder/skills/testing-patterns/scripts/test_runner.py +219 -0
- package/kits/coder/skills/testing-patterns/scripts/validate.py +56 -0
- package/kits/coder/skills/typescript-patterns/SKILL.md +417 -0
- package/kits/coder/skills/ui-ux-pro-max/SKILL.md +364 -0
- package/kits/coder/skills/ui-ux-pro-max/data/charts.csv +26 -0
- package/kits/coder/skills/ui-ux-pro-max/data/colors.csv +97 -0
- package/kits/coder/skills/ui-ux-pro-max/data/icons.csv +101 -0
- package/kits/coder/skills/ui-ux-pro-max/data/landing.csv +31 -0
- package/kits/coder/skills/ui-ux-pro-max/data/products.csv +97 -0
- package/kits/coder/skills/ui-ux-pro-max/data/prompts.csv +24 -0
- package/kits/coder/skills/ui-ux-pro-max/data/react-performance.csv +45 -0
- package/kits/coder/skills/ui-ux-pro-max/data/stacks/flutter.csv +53 -0
- package/kits/coder/skills/ui-ux-pro-max/data/stacks/html-tailwind.csv +56 -0
- package/kits/coder/skills/ui-ux-pro-max/data/stacks/nextjs.csv +53 -0
- package/kits/coder/skills/ui-ux-pro-max/data/stacks/nuxt-ui.csv +51 -0
- package/kits/coder/skills/ui-ux-pro-max/data/stacks/nuxtjs.csv +59 -0
- package/kits/coder/skills/ui-ux-pro-max/data/stacks/react-native.csv +52 -0
- package/kits/coder/skills/ui-ux-pro-max/data/stacks/react.csv +54 -0
- package/kits/coder/skills/ui-ux-pro-max/data/stacks/shadcn.csv +61 -0
- package/kits/coder/skills/ui-ux-pro-max/data/stacks/svelte.csv +54 -0
- package/kits/coder/skills/ui-ux-pro-max/data/stacks/swiftui.csv +51 -0
- package/kits/coder/skills/ui-ux-pro-max/data/stacks/vue.csv +50 -0
- package/kits/coder/skills/ui-ux-pro-max/data/styles.csv +59 -0
- package/kits/coder/skills/ui-ux-pro-max/data/typography.csv +58 -0
- package/kits/coder/skills/ui-ux-pro-max/data/ui-reasoning.csv +101 -0
- package/kits/coder/skills/ui-ux-pro-max/data/ux-guidelines.csv +100 -0
- package/kits/coder/skills/ui-ux-pro-max/data/web-interface.csv +31 -0
- package/kits/coder/skills/ui-ux-pro-max/scripts/__pycache__/core.cpython-314.pyc +0 -0
- package/kits/coder/skills/ui-ux-pro-max/scripts/__pycache__/design_system.cpython-314.pyc +0 -0
- package/kits/coder/skills/ui-ux-pro-max/scripts/core.py +257 -0
- package/kits/coder/skills/ui-ux-pro-max/scripts/design_system.py +488 -0
- package/kits/coder/skills/ui-ux-pro-max/scripts/search.py +76 -0
- package/kits/coder/workflows/.gitkeep +20 -0
- package/kits/coder/workflows/create.md +152 -0
- package/kits/coder/workflows/debug.md +223 -0
- package/kits/coder/workflows/deploy.md +283 -0
- package/kits/coder/workflows/orchestrate.md +243 -0
- package/kits/coder/workflows/plan.md +134 -0
- package/kits/coder/workflows/test.md +237 -0
- package/kits/coder/workflows/ui-ux-pro-max.md +109 -0
- package/package.json +49 -0
|
@@ -0,0 +1,458 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: monitoring-observability
|
|
3
|
+
description: Production monitoring, observability, and SRE patterns. Use when designing monitoring systems, implementing SLI/SLO, configuring alerting, or building observability infrastructure with Prometheus, Grafana, and modern tools.
|
|
4
|
+
allowed-tools: Read, Write, Edit, Glob, Grep
|
|
5
|
+
version: 2.0
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Monitoring & Observability - SRE Patterns
|
|
9
|
+
|
|
10
|
+
> **Philosophy:** Observability is not about collecting metrics—it's about understanding system behavior.
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## When to Use This Skill
|
|
15
|
+
|
|
16
|
+
| ✅ Use | ❌ Don't Use |
|
|
17
|
+
| -------------------------------- | ------------------------------- |
|
|
18
|
+
| Designing monitoring systems | Single ad-hoc dashboard |
|
|
19
|
+
| Defining SLI/SLO/SLA | Application feature development |
|
|
20
|
+
| Configuring alerting strategy | Local development debugging |
|
|
21
|
+
| Building observability pipelines | No access to telemetry data |
|
|
22
|
+
| Incident response workflow | Static reporting only |
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## Core Rules (Non-Negotiable)
|
|
27
|
+
|
|
28
|
+
1. **Four Golden Signals** - Latency, Traffic, Errors, Saturation
|
|
29
|
+
2. **SLO-based alerting** - Alert on symptoms, not causes
|
|
30
|
+
3. **No secrets in logs** - Redact sensitive data
|
|
31
|
+
4. **Structured logging** - JSON, not unstructured text
|
|
32
|
+
5. **Correlation required** - Link metrics, logs, traces
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## Three Pillars of Observability
|
|
37
|
+
|
|
38
|
+
```
|
|
39
|
+
┌─────────────────────────────────────────────────────────────┐
|
|
40
|
+
│ OBSERVABILITY │
|
|
41
|
+
├─────────────────┬─────────────────┬─────────────────────────┤
|
|
42
|
+
│ METRICS │ LOGS │ TRACES │
|
|
43
|
+
│ │ │ │
|
|
44
|
+
│ • Aggregated │ • Discrete │ • Request-scoped │
|
|
45
|
+
│ • Time-series │ • Event-based │ • Distributed │
|
|
46
|
+
│ • Low overhead │ • High detail │ • Causality chain │
|
|
47
|
+
│ │ │ │
|
|
48
|
+
│ Prometheus │ Loki/ELK │ Jaeger/Zipkin │
|
|
49
|
+
│ Victoria │ Splunk │ X-Ray │
|
|
50
|
+
│ DataDog │ CloudWatch │ OpenTelemetry │
|
|
51
|
+
└─────────────────┴─────────────────┴─────────────────────────┘
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
56
|
+
## Four Golden Signals
|
|
57
|
+
|
|
58
|
+
| Signal | What to Measure | Example Metrics |
|
|
59
|
+
| -------------- | -------------------------- | ----------------------------------- |
|
|
60
|
+
| **Latency** | Time to serve a request | `http_request_duration_seconds` |
|
|
61
|
+
| **Traffic** | Demand on your system | `http_requests_total` |
|
|
62
|
+
| **Errors** | Rate of failed requests | `http_requests_total{status="5xx"}` |
|
|
63
|
+
| **Saturation** | Fullness of your resources | `container_memory_usage_bytes` |
|
|
64
|
+
|
|
65
|
+
### RED Method (Request-focused)
|
|
66
|
+
|
|
67
|
+
- **Rate** - Requests per second
|
|
68
|
+
- **Errors** - Failed requests per second
|
|
69
|
+
- **Duration** - Time per request
|
|
70
|
+
|
|
71
|
+
### USE Method (Resource-focused)
|
|
72
|
+
|
|
73
|
+
- **Utilization** - % time resource is busy
|
|
74
|
+
- **Saturation** - Queue length / pending work
|
|
75
|
+
- **Errors** - Error events count
|
|
76
|
+
|
|
77
|
+
---
|
|
78
|
+
|
|
79
|
+
## SLI/SLO/SLA Framework
|
|
80
|
+
|
|
81
|
+
### Definitions
|
|
82
|
+
|
|
83
|
+
| Term | Definition | Example |
|
|
84
|
+
| ------- | ------------------------------------- | ------------------------------------ |
|
|
85
|
+
| **SLI** | Measurable indicator of service level | 99th percentile latency < 200ms |
|
|
86
|
+
| **SLO** | Target value for an SLI | 99% of requests < 200ms over 30 days |
|
|
87
|
+
| **SLA** | Contractual commitment with penalties | 99.9% availability or refund |
|
|
88
|
+
|
|
89
|
+
### Error Budget
|
|
90
|
+
|
|
91
|
+
```python
|
|
92
|
+
# Error budget calculation
|
|
93
|
+
slo = 0.999 # 99.9%
|
|
94
|
+
window_days = 30
|
|
95
|
+
total_minutes = window_days * 24 * 60
|
|
96
|
+
|
|
97
|
+
error_budget_minutes = total_minutes * (1 - slo)
|
|
98
|
+
# 43.2 minutes of allowed downtime per month
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
### Burn Rate Alerting
|
|
102
|
+
|
|
103
|
+
```yaml
|
|
104
|
+
# Fast burn: 2% budget in 1 hour
|
|
105
|
+
- alert: HighErrorRate
|
|
106
|
+
expr: |
|
|
107
|
+
(
|
|
108
|
+
sum(rate(http_requests_total{status=~"5.."}[1h]))
|
|
109
|
+
/
|
|
110
|
+
sum(rate(http_requests_total[1h]))
|
|
111
|
+
) > 0.001 * 14.4 # 14.4x burn rate
|
|
112
|
+
for: 2m
|
|
113
|
+
labels:
|
|
114
|
+
severity: critical
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
---
|
|
118
|
+
|
|
119
|
+
## Prometheus Patterns
|
|
120
|
+
|
|
121
|
+
### Essential Metrics
|
|
122
|
+
|
|
123
|
+
```yaml
|
|
124
|
+
# Counter: Only goes up
|
|
125
|
+
http_requests_total{method="GET", status="200"}
|
|
126
|
+
|
|
127
|
+
# Gauge: Can go up or down
|
|
128
|
+
current_connections
|
|
129
|
+
memory_usage_bytes
|
|
130
|
+
|
|
131
|
+
# Histogram: Buckets for distribution
|
|
132
|
+
http_request_duration_seconds_bucket{le="0.1"}
|
|
133
|
+
http_request_duration_seconds_bucket{le="0.5"}
|
|
134
|
+
http_request_duration_seconds_bucket{le="1"}
|
|
135
|
+
|
|
136
|
+
# Summary: Pre-calculated quantiles
|
|
137
|
+
http_request_duration_seconds{quantile="0.99"}
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
### PromQL Patterns
|
|
141
|
+
|
|
142
|
+
```promql
|
|
143
|
+
# Rate of change (per-second)
|
|
144
|
+
rate(http_requests_total[5m])
|
|
145
|
+
|
|
146
|
+
# Increase over time window
|
|
147
|
+
increase(http_requests_total[1h])
|
|
148
|
+
|
|
149
|
+
# 99th percentile from histogram
|
|
150
|
+
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
|
|
151
|
+
|
|
152
|
+
# Error rate percentage
|
|
153
|
+
100 * (
|
|
154
|
+
rate(http_requests_total{status=~"5.."}[5m])
|
|
155
|
+
/
|
|
156
|
+
rate(http_requests_total[5m])
|
|
157
|
+
)
|
|
158
|
+
|
|
159
|
+
# Top 5 by label
|
|
160
|
+
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))
|
|
161
|
+
|
|
162
|
+
# Aggregation across instances
|
|
163
|
+
sum without(instance) (rate(http_requests_total[5m]))
|
|
164
|
+
|
|
165
|
+
# Prediction (linear regression)
|
|
166
|
+
predict_linear(disk_free_bytes[1h], 3600 * 4)
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
### Recording Rules
|
|
170
|
+
|
|
171
|
+
```yaml
|
|
172
|
+
groups:
|
|
173
|
+
- name: sli_recording_rules
|
|
174
|
+
rules:
|
|
175
|
+
- record: job:http_request_latency_seconds:p99
|
|
176
|
+
expr: histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
|
|
177
|
+
|
|
178
|
+
- record: job:http_error_rate:ratio
|
|
179
|
+
expr: |
|
|
180
|
+
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
|
|
181
|
+
/
|
|
182
|
+
sum by (job) (rate(http_requests_total[5m]))
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
---
|
|
186
|
+
|
|
187
|
+
## Alerting Strategy
|
|
188
|
+
|
|
189
|
+
### Alert Priority Levels
|
|
190
|
+
|
|
191
|
+
| Level | Response Time | Channel | Example |
|
|
192
|
+
| ------------ | ------------- | ---------------- | --------------------------- |
|
|
193
|
+
| **Critical** | Immediate | PagerDuty + Call | Service down, data loss |
|
|
194
|
+
| **Warning** | 15-30 min | Slack | High latency, disk 80% |
|
|
195
|
+
| **Info** | Next business | Email/Ticket | Certificate expiring in 30d |
|
|
196
|
+
|
|
197
|
+
### Alerting Best Practices
|
|
198
|
+
|
|
199
|
+
```yaml
|
|
200
|
+
groups:
|
|
201
|
+
- name: slo_alerts
|
|
202
|
+
rules:
|
|
203
|
+
# ✅ Good: Alert on symptoms (SLO breach)
|
|
204
|
+
- alert: HighLatencySLOBreach
|
|
205
|
+
expr: |
|
|
206
|
+
job:http_request_latency_seconds:p99 > 0.5
|
|
207
|
+
for: 5m
|
|
208
|
+
labels:
|
|
209
|
+
severity: warning
|
|
210
|
+
annotations:
|
|
211
|
+
summary: "P99 latency exceeds 500ms SLO"
|
|
212
|
+
runbook_url: "https://wiki/runbooks/high-latency"
|
|
213
|
+
|
|
214
|
+
# ❌ Bad: Alert on cause (CPU high)
|
|
215
|
+
# - alert: HighCPU
|
|
216
|
+
# expr: node_cpu_usage > 80
|
|
217
|
+
# # CPU can be high without user impact
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
### Reducing Alert Noise
|
|
221
|
+
|
|
222
|
+
| Problem | Solution |
|
|
223
|
+
| ---------------- | ------------------------------------ |
|
|
224
|
+
| Flapping alerts | Increase `for` duration |
|
|
225
|
+
| Too many alerts | Alert on SLOs, not individual causes |
|
|
226
|
+
| Duplicate alerts | Use `group_by` and aggregation |
|
|
227
|
+
| Weekend pages | Time-based routing, error budgets |
|
|
228
|
+
| Alert storms | Implement alerting hierarchy |
|
|
229
|
+
|
|
230
|
+
---
|
|
231
|
+
|
|
232
|
+
## Structured Logging
|
|
233
|
+
|
|
234
|
+
### Log Levels
|
|
235
|
+
|
|
236
|
+
| Level | Use Case | Example |
|
|
237
|
+
| --------- | ------------------------------------- | ----------------------------- |
|
|
238
|
+
| **ERROR** | Unhandled failures requiring action | Database connection failed |
|
|
239
|
+
| **WARN** | Concerning but handled situations | Retry succeeded on attempt 3 |
|
|
240
|
+
| **INFO** | Business-significant events | User registered, order placed |
|
|
241
|
+
| **DEBUG** | Technical details for troubleshooting | Query executed in 50ms |
|
|
242
|
+
|
|
243
|
+
### Structured Log Format
|
|
244
|
+
|
|
245
|
+
```typescript
|
|
246
|
+
const log = {
|
|
247
|
+
timestamp: "2024-01-15T10:30:00Z",
|
|
248
|
+
level: "INFO",
|
|
249
|
+
service: "order-service",
|
|
250
|
+
traceId: "abc123",
|
|
251
|
+
spanId: "def456",
|
|
252
|
+
userId: "user_789",
|
|
253
|
+
event: "order.created",
|
|
254
|
+
orderId: "order_123",
|
|
255
|
+
total: 99.99,
|
|
256
|
+
items: 3,
|
|
257
|
+
latencyMs: 45,
|
|
258
|
+
};
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
### Log Correlation Pattern
|
|
262
|
+
|
|
263
|
+
```typescript
|
|
264
|
+
// Propagate trace context through all logs
|
|
265
|
+
app.use((req, res, next) => {
|
|
266
|
+
req.logger = logger.child({
|
|
267
|
+
traceId: req.headers["x-trace-id"] || uuid(),
|
|
268
|
+
spanId: uuid(),
|
|
269
|
+
requestId: req.id,
|
|
270
|
+
userId: req.user?.id,
|
|
271
|
+
});
|
|
272
|
+
next();
|
|
273
|
+
});
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
---
|
|
277
|
+
|
|
278
|
+
## Distributed Tracing
|
|
279
|
+
|
|
280
|
+
### OpenTelemetry Setup
|
|
281
|
+
|
|
282
|
+
```typescript
|
|
283
|
+
import { NodeSDK } from "@opentelemetry/sdk-node";
|
|
284
|
+
import { JaegerExporter } from "@opentelemetry/exporter-jaeger";
|
|
285
|
+
import { Resource } from "@opentelemetry/resources";
|
|
286
|
+
|
|
287
|
+
const sdk = new NodeSDK({
|
|
288
|
+
resource: new Resource({
|
|
289
|
+
"service.name": "order-service",
|
|
290
|
+
"service.version": "1.0.0",
|
|
291
|
+
}),
|
|
292
|
+
traceExporter: new JaegerExporter({
|
|
293
|
+
endpoint: "http://jaeger:14268/api/traces",
|
|
294
|
+
}),
|
|
295
|
+
});
|
|
296
|
+
|
|
297
|
+
sdk.start();
|
|
298
|
+
```
|
|
299
|
+
|
|
300
|
+
### Span Attributes
|
|
301
|
+
|
|
302
|
+
```typescript
|
|
303
|
+
import { trace } from "@opentelemetry/api";
|
|
304
|
+
|
|
305
|
+
const tracer = trace.getTracer("order-service");
|
|
306
|
+
|
|
307
|
+
async function processOrder(orderId: string) {
|
|
308
|
+
return tracer.startActiveSpan("processOrder", async (span) => {
|
|
309
|
+
span.setAttribute("order.id", orderId);
|
|
310
|
+
span.setAttribute("order.total", 99.99);
|
|
311
|
+
|
|
312
|
+
try {
|
|
313
|
+
// ... business logic
|
|
314
|
+
span.setStatus({ code: SpanStatusCode.OK });
|
|
315
|
+
} catch (error) {
|
|
316
|
+
span.setStatus({
|
|
317
|
+
code: SpanStatusCode.ERROR,
|
|
318
|
+
message: error.message,
|
|
319
|
+
});
|
|
320
|
+
span.recordException(error);
|
|
321
|
+
throw error;
|
|
322
|
+
} finally {
|
|
323
|
+
span.end();
|
|
324
|
+
}
|
|
325
|
+
});
|
|
326
|
+
}
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
---
|
|
330
|
+
|
|
331
|
+
## Grafana Dashboard Patterns
|
|
332
|
+
|
|
333
|
+
### Dashboard Structure
|
|
334
|
+
|
|
335
|
+
```
|
|
336
|
+
├── Overview (Business KPIs)
|
|
337
|
+
│ ├── Revenue / Orders
|
|
338
|
+
│ ├── Active Users
|
|
339
|
+
│ └── Error Rate Summary
|
|
340
|
+
│
|
|
341
|
+
├── Service Health (Per Service)
|
|
342
|
+
│ ├── Four Golden Signals
|
|
343
|
+
│ ├── SLI/SLO Status
|
|
344
|
+
│ └── Resource Utilization
|
|
345
|
+
│
|
|
346
|
+
├── Infrastructure
|
|
347
|
+
│ ├── Node Metrics
|
|
348
|
+
│ ├── Container Stats
|
|
349
|
+
│ └── Database Performance
|
|
350
|
+
│
|
|
351
|
+
└── Debugging
|
|
352
|
+
├── Trace Explorer
|
|
353
|
+
├── Log Viewer
|
|
354
|
+
└── Error Breakdown
|
|
355
|
+
```
|
|
356
|
+
|
|
357
|
+
### Variable Templates
|
|
358
|
+
|
|
359
|
+
```yaml
|
|
360
|
+
# Environment selector
|
|
361
|
+
- name: environment
|
|
362
|
+
type: query
|
|
363
|
+
query: label_values(up, environment)
|
|
364
|
+
|
|
365
|
+
# Service filter
|
|
366
|
+
- name: service
|
|
367
|
+
type: query
|
|
368
|
+
query: label_values(http_requests_total{environment="$environment"}, service)
|
|
369
|
+
```
|
|
370
|
+
|
|
371
|
+
---
|
|
372
|
+
|
|
373
|
+
## Incident Response Workflow
|
|
374
|
+
|
|
375
|
+
```
|
|
376
|
+
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
377
|
+
│ Detect │───▷│ Triage │───▷│ Mitigate │
|
|
378
|
+
│ Alert │ │ Severity │ │ Rollback │
|
|
379
|
+
└─────────────┘ └─────────────┘ └─────────────┘
|
|
380
|
+
│
|
|
381
|
+
▼
|
|
382
|
+
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
383
|
+
│ Review │◁───│ Resolve │◁───│ Communicate│
|
|
384
|
+
│ Postmortem │ │ Fix │ │ Status │
|
|
385
|
+
└─────────────┘ └─────────────┘ └─────────────┘
|
|
386
|
+
```
|
|
387
|
+
|
|
388
|
+
### Runbook Template
|
|
389
|
+
|
|
390
|
+
```markdown
|
|
391
|
+
# Alert: [Alert Name]
|
|
392
|
+
|
|
393
|
+
## Impact
|
|
394
|
+
|
|
395
|
+
- What services are affected?
|
|
396
|
+
- What is the user impact?
|
|
397
|
+
|
|
398
|
+
## Quick Diagnosis
|
|
399
|
+
|
|
400
|
+
1. Check dashboard: [link]
|
|
401
|
+
2. Check recent deployments: [link]
|
|
402
|
+
3. Check upstream dependencies
|
|
403
|
+
|
|
404
|
+
## Mitigation Steps
|
|
405
|
+
|
|
406
|
+
1. If caused by deployment → Rollback
|
|
407
|
+
2. If caused by traffic → Scale up / rate limit
|
|
408
|
+
3. If caused by dependency → Failover
|
|
409
|
+
|
|
410
|
+
## Escalation
|
|
411
|
+
|
|
412
|
+
- On-call: @team-oncall
|
|
413
|
+
- Escalation: @team-lead
|
|
414
|
+
```
|
|
415
|
+
|
|
416
|
+
---
|
|
417
|
+
|
|
418
|
+
## Anti-Patterns
|
|
419
|
+
|
|
420
|
+
| ❌ Don't | ✅ Do |
|
|
421
|
+
| ------------------------------- | ----------------------------------- |
|
|
422
|
+
| Alert on causes (CPU, memory) | Alert on symptoms (latency, errors) |
|
|
423
|
+
| Log everything at INFO | Use appropriate log levels |
|
|
424
|
+
| Unstructured log messages | JSON structured logging |
|
|
425
|
+
| Alert without runbook | Every alert has a runbook |
|
|
426
|
+
| Collect metrics without purpose | Define SLIs first, then instrument |
|
|
427
|
+
| Secret values in logs | Redact sensitive data |
|
|
428
|
+
| High-cardinality labels | Bounded label values |
|
|
429
|
+
|
|
430
|
+
---
|
|
431
|
+
|
|
432
|
+
## Production Checklist
|
|
433
|
+
|
|
434
|
+
Before production:
|
|
435
|
+
|
|
436
|
+
- [ ] Four Golden Signals instrumented?
|
|
437
|
+
- [ ] SLIs/SLOs defined per service?
|
|
438
|
+
- [ ] Error budget tracking enabled?
|
|
439
|
+
- [ ] Structured logging implemented?
|
|
440
|
+
- [ ] Trace context propagating?
|
|
441
|
+
- [ ] Alerting hierarchy defined?
|
|
442
|
+
- [ ] Runbooks for all critical alerts?
|
|
443
|
+
- [ ] On-call rotation configured?
|
|
444
|
+
|
|
445
|
+
---
|
|
446
|
+
|
|
447
|
+
## Related Skills
|
|
448
|
+
|
|
449
|
+
| Need | Skill |
|
|
450
|
+
| ------------------- | ----------------------- |
|
|
451
|
+
| Kubernetes ops | `kubernetes-patterns` |
|
|
452
|
+
| CI/CD pipelines | `github-actions` |
|
|
453
|
+
| Performance tuning | `performance-profiling` |
|
|
454
|
+
| Security monitoring | `security-fundamentals` |
|
|
455
|
+
|
|
456
|
+
---
|
|
457
|
+
|
|
458
|
+
> **Remember:** Good observability lets you answer questions you haven't thought of yet. Build for unknown-unknowns, not just known issues.
|