@techwavedev/agi-agent-kit 1.1.7 → 1.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of @techwavedev/agi-agent-kit might be problematic. Click here for more details.
- package/CHANGELOG.md +82 -1
- package/README.md +190 -12
- package/bin/init.js +30 -2
- package/package.json +6 -3
- package/templates/base/AGENTS.md +54 -23
- package/templates/base/README.md +325 -0
- package/templates/base/directives/memory_integration.md +95 -0
- package/templates/base/execution/memory_manager.py +309 -0
- package/templates/base/execution/session_boot.py +218 -0
- package/templates/base/execution/session_init.py +320 -0
- package/templates/base/skill-creator/SKILL_skillcreator.md +23 -36
- package/templates/base/skill-creator/scripts/init_skill.py +18 -135
- package/templates/skills/ec/README.md +31 -0
- package/templates/skills/ec/aws/SKILL.md +1020 -0
- package/templates/skills/ec/aws/defaults.yaml +13 -0
- package/templates/skills/ec/aws/references/common_patterns.md +80 -0
- package/templates/skills/ec/aws/references/mcp_servers.md +98 -0
- package/templates/skills/ec/aws-terraform/SKILL.md +349 -0
- package/templates/skills/ec/aws-terraform/references/best_practices.md +394 -0
- package/templates/skills/ec/aws-terraform/references/checkov_reference.md +337 -0
- package/templates/skills/ec/aws-terraform/scripts/configure_mcp.py +150 -0
- package/templates/skills/ec/confluent-kafka/SKILL.md +655 -0
- package/templates/skills/ec/confluent-kafka/references/ansible_playbooks.md +792 -0
- package/templates/skills/ec/confluent-kafka/references/ec_deployment.md +579 -0
- package/templates/skills/ec/confluent-kafka/references/kraft_migration.md +490 -0
- package/templates/skills/ec/confluent-kafka/references/troubleshooting.md +778 -0
- package/templates/skills/ec/confluent-kafka/references/upgrade_7x_to_8x.md +488 -0
- package/templates/skills/ec/confluent-kafka/scripts/kafka_health_check.py +435 -0
- package/templates/skills/ec/confluent-kafka/scripts/upgrade_preflight.py +568 -0
- package/templates/skills/ec/confluent-kafka/scripts/validate_config.py +455 -0
- package/templates/skills/ec/consul/SKILL.md +427 -0
- package/templates/skills/ec/consul/references/acl_setup.md +168 -0
- package/templates/skills/ec/consul/references/ha_config.md +196 -0
- package/templates/skills/ec/consul/references/troubleshooting.md +267 -0
- package/templates/skills/ec/consul/references/upgrades.md +213 -0
- package/templates/skills/ec/consul/scripts/consul_health_report.py +530 -0
- package/templates/skills/ec/consul/scripts/consul_status.py +264 -0
- package/templates/skills/ec/consul/scripts/generate_values.py +170 -0
- package/templates/skills/ec/documentation/SKILL.md +351 -0
- package/templates/skills/ec/documentation/references/best_practices.md +201 -0
- package/templates/skills/ec/documentation/scripts/analyze_code.py +307 -0
- package/templates/skills/ec/documentation/scripts/detect_changes.py +460 -0
- package/templates/skills/ec/documentation/scripts/generate_changelog.py +312 -0
- package/templates/skills/ec/documentation/scripts/sync_docs.py +272 -0
- package/templates/skills/ec/documentation/scripts/update_skill_docs.py +366 -0
- package/templates/skills/ec/gitlab/SKILL.md +529 -0
- package/templates/skills/ec/gitlab/references/agent_installation.md +416 -0
- package/templates/skills/ec/gitlab/references/api_reference.md +508 -0
- package/templates/skills/ec/gitlab/references/gitops_flux.md +465 -0
- package/templates/skills/ec/gitlab/references/troubleshooting.md +518 -0
- package/templates/skills/ec/gitlab/scripts/generate_agent_values.py +329 -0
- package/templates/skills/ec/gitlab/scripts/gitlab_agent_status.py +414 -0
- package/templates/skills/ec/jira/SKILL.md +484 -0
- package/templates/skills/ec/jira/references/jql_reference.md +148 -0
- package/templates/skills/ec/jira/scripts/add_comment.py +91 -0
- package/templates/skills/ec/jira/scripts/bulk_log_work.py +124 -0
- package/templates/skills/ec/jira/scripts/create_ticket.py +162 -0
- package/templates/skills/ec/jira/scripts/get_ticket.py +191 -0
- package/templates/skills/ec/jira/scripts/jira_client.py +383 -0
- package/templates/skills/ec/jira/scripts/log_work.py +154 -0
- package/templates/skills/ec/jira/scripts/search_tickets.py +104 -0
- package/templates/skills/ec/jira/scripts/update_comment.py +67 -0
- package/templates/skills/ec/jira/scripts/update_ticket.py +161 -0
- package/templates/skills/ec/karpenter/SKILL.md +301 -0
- package/templates/skills/ec/karpenter/references/ec2nodeclasses.md +421 -0
- package/templates/skills/ec/karpenter/references/migration.md +396 -0
- package/templates/skills/ec/karpenter/references/nodepools.md +400 -0
- package/templates/skills/ec/karpenter/references/troubleshooting.md +359 -0
- package/templates/skills/ec/karpenter/scripts/generate_ec2nodeclass.py +187 -0
- package/templates/skills/ec/karpenter/scripts/generate_nodepool.py +245 -0
- package/templates/skills/ec/karpenter/scripts/karpenter_status.py +359 -0
- package/templates/skills/ec/opensearch/SKILL.md +720 -0
- package/templates/skills/ec/opensearch/references/ml_neural_search.md +576 -0
- package/templates/skills/ec/opensearch/references/operator.md +532 -0
- package/templates/skills/ec/opensearch/references/query_dsl.md +532 -0
- package/templates/skills/ec/opensearch/scripts/configure_mcp.py +148 -0
- package/templates/skills/ec/victoriametrics/SKILL.md +598 -0
- package/templates/skills/ec/victoriametrics/references/kubernetes.md +531 -0
- package/templates/skills/ec/victoriametrics/references/prometheus_migration.md +333 -0
- package/templates/skills/ec/victoriametrics/references/troubleshooting.md +442 -0
- package/templates/skills/knowledge/SKILLS_CATALOG.md +274 -4
- package/templates/skills/knowledge/intelligent-routing/SKILL.md +237 -164
- package/templates/skills/knowledge/parallel-agents/SKILL.md +345 -73
- package/templates/skills/knowledge/plugin-discovery/SKILL.md +582 -0
- package/templates/skills/knowledge/plugin-discovery/scripts/platform_setup.py +1083 -0
- package/templates/skills/knowledge/design-md/README.md +0 -34
- package/templates/skills/knowledge/design-md/SKILL.md +0 -193
- package/templates/skills/knowledge/design-md/examples/DESIGN.md +0 -154
- package/templates/skills/knowledge/notebooklm-mcp/SKILL.md +0 -71
- package/templates/skills/knowledge/notebooklm-mcp/assets/example_asset.txt +0 -24
- package/templates/skills/knowledge/notebooklm-mcp/references/api_reference.md +0 -34
- package/templates/skills/knowledge/notebooklm-mcp/scripts/example.py +0 -19
- package/templates/skills/knowledge/react-components/README.md +0 -36
- package/templates/skills/knowledge/react-components/SKILL.md +0 -53
- package/templates/skills/knowledge/react-components/examples/gold-standard-card.tsx +0 -80
- package/templates/skills/knowledge/react-components/package-lock.json +0 -231
- package/templates/skills/knowledge/react-components/package.json +0 -16
- package/templates/skills/knowledge/react-components/resources/architecture-checklist.md +0 -15
- package/templates/skills/knowledge/react-components/resources/component-template.tsx +0 -37
- package/templates/skills/knowledge/react-components/resources/stitch-api-reference.md +0 -14
- package/templates/skills/knowledge/react-components/resources/style-guide.json +0 -27
- package/templates/skills/knowledge/react-components/scripts/fetch-stitch.sh +0 -30
- package/templates/skills/knowledge/react-components/scripts/validate.js +0 -68
- package/templates/skills/knowledge/self-update/SKILL.md +0 -60
- package/templates/skills/knowledge/self-update/scripts/update_kit.py +0 -103
- package/templates/skills/knowledge/stitch-loop/README.md +0 -54
- package/templates/skills/knowledge/stitch-loop/SKILL.md +0 -235
- package/templates/skills/knowledge/stitch-loop/examples/SITE.md +0 -73
- package/templates/skills/knowledge/stitch-loop/examples/next-prompt.md +0 -25
- package/templates/skills/knowledge/stitch-loop/resources/baton-schema.md +0 -61
- package/templates/skills/knowledge/stitch-loop/resources/site-template.md +0 -104
|
@@ -0,0 +1,442 @@
|
|
|
1
|
+
# VictoriaMetrics Troubleshooting Reference
|
|
2
|
+
|
|
3
|
+
Complete diagnostic and troubleshooting guide.
|
|
4
|
+
|
|
5
|
+
## Diagnostic Endpoints
|
|
6
|
+
|
|
7
|
+
### Health & Status
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
# Health check
|
|
11
|
+
curl http://localhost:8428/health
|
|
12
|
+
|
|
13
|
+
# Metrics (self-monitoring)
|
|
14
|
+
curl http://localhost:8428/metrics
|
|
15
|
+
|
|
16
|
+
# TSDB status
|
|
17
|
+
curl http://localhost:8428/api/v1/status/tsdb
|
|
18
|
+
|
|
19
|
+
# Active queries
|
|
20
|
+
curl http://localhost:8428/api/v1/status/active_queries
|
|
21
|
+
|
|
22
|
+
# Top queries by duration
|
|
23
|
+
curl http://localhost:8428/api/v1/status/top_queries
|
|
24
|
+
|
|
25
|
+
# Flags
|
|
26
|
+
curl http://localhost:8428/flags
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
### Cluster-Specific
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
# vmstorage status
|
|
33
|
+
curl http://vmstorage:8482/-/healthy
|
|
34
|
+
|
|
35
|
+
# vmselect status
|
|
36
|
+
curl http://vmselect:8481/health
|
|
37
|
+
|
|
38
|
+
# vminsert status
|
|
39
|
+
curl http://vminsert:8480/health
|
|
40
|
+
|
|
41
|
+
# Storage nodes from vmselect
|
|
42
|
+
curl http://vmselect:8481/api/v1/status/vmstorage
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## Common Issues & Solutions
|
|
48
|
+
|
|
49
|
+
### Out of Memory (OOM)
|
|
50
|
+
|
|
51
|
+
**Symptoms:**
|
|
52
|
+
|
|
53
|
+
- Process killed by OOM killer
|
|
54
|
+
- Slow queries
|
|
55
|
+
- `vm_slow_row_inserts_total` increasing
|
|
56
|
+
|
|
57
|
+
**Diagnosis:**
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
# Check memory usage
|
|
61
|
+
curl -s http://localhost:8428/metrics | grep process_resident_memory
|
|
62
|
+
curl -s http://localhost:8428/metrics | grep vm_slow
|
|
63
|
+
|
|
64
|
+
# Check active series
|
|
65
|
+
curl http://localhost:8428/api/v1/status/tsdb
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
**Solutions:**
|
|
69
|
+
|
|
70
|
+
```bash
|
|
71
|
+
# 1. Reduce memory usage
|
|
72
|
+
-memory.allowedPercent=60 # Default is 80
|
|
73
|
+
|
|
74
|
+
# 2. Limit concurrent queries
|
|
75
|
+
-search.maxConcurrentRequests=8
|
|
76
|
+
|
|
77
|
+
# 3. Add RAM or scale out
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
### High Cardinality
|
|
83
|
+
|
|
84
|
+
**Symptoms:**
|
|
85
|
+
|
|
86
|
+
- Slow ingestion
|
|
87
|
+
- High memory usage
|
|
88
|
+
- `vm_hourly_series_limit_*` alerts
|
|
89
|
+
|
|
90
|
+
**Diagnosis:**
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
# Check cardinality
|
|
94
|
+
curl 'http://localhost:8428/api/v1/status/tsdb?topN=20'
|
|
95
|
+
|
|
96
|
+
# Metric with most series
|
|
97
|
+
curl 'http://localhost:8428/api/v1/status/tsdb?topN=10&focusLabel=__name__'
|
|
98
|
+
|
|
99
|
+
# High churn labels
|
|
100
|
+
curl 'http://localhost:8428/api/v1/status/tsdb?topN=10&focusLabel=pod'
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
**Solutions:**
|
|
104
|
+
|
|
105
|
+
```yaml
|
|
106
|
+
# 1. Drop high-cardinality labels in vmagent
|
|
107
|
+
relabel_configs:
|
|
108
|
+
- action: labeldrop
|
|
109
|
+
regex: (pod_template_hash|controller_revision_hash)
|
|
110
|
+
|
|
111
|
+
# 2. Set cardinality limits
|
|
112
|
+
-storage.maxHourlySeries=5000000
|
|
113
|
+
-storage.maxDailySeries=10000000
|
|
114
|
+
|
|
115
|
+
# 3. Use streaming aggregation to reduce cardinality
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
---
|
|
119
|
+
|
|
120
|
+
### Slow Queries
|
|
121
|
+
|
|
122
|
+
**Symptoms:**
|
|
123
|
+
|
|
124
|
+
- Query timeouts
|
|
125
|
+
- High latency in Grafana
|
|
126
|
+
- `vm_slow_*` metrics increasing
|
|
127
|
+
|
|
128
|
+
**Diagnosis:**
|
|
129
|
+
|
|
130
|
+
```bash
|
|
131
|
+
# Check current queries
|
|
132
|
+
curl http://localhost:8428/api/v1/status/active_queries
|
|
133
|
+
|
|
134
|
+
# Check top slow queries
|
|
135
|
+
curl http://localhost:8428/api/v1/status/top_queries
|
|
136
|
+
|
|
137
|
+
# Profile a query
|
|
138
|
+
curl 'http://localhost:8428/api/v1/query?query=your_query{}&trace=1'
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
**Solutions:**
|
|
142
|
+
|
|
143
|
+
```bash
|
|
144
|
+
# 1. Optimize query
|
|
145
|
+
# Use specific labels, limit time range
|
|
146
|
+
|
|
147
|
+
# 2. Increase workers for heavy queries
|
|
148
|
+
-search.maxWorkersPerQuery=16
|
|
149
|
+
|
|
150
|
+
# 3. Add caching
|
|
151
|
+
-search.cacheTimestampOffset=10m
|
|
152
|
+
|
|
153
|
+
# 4. Scale vmselect horizontally
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
---
|
|
157
|
+
|
|
158
|
+
### Disk Space Issues
|
|
159
|
+
|
|
160
|
+
**Symptoms:**
|
|
161
|
+
|
|
162
|
+
- Alerts on low disk space
|
|
163
|
+
- Insert failures
|
|
164
|
+
- Forced merge failures
|
|
165
|
+
|
|
166
|
+
**Diagnosis:**
|
|
167
|
+
|
|
168
|
+
```bash
|
|
169
|
+
# Check disk space
|
|
170
|
+
curl -s http://localhost:8428/metrics | grep vm_free_disk_space
|
|
171
|
+
curl -s http://localhost:8428/metrics | grep vm_data_size
|
|
172
|
+
|
|
173
|
+
# Check partition usage
|
|
174
|
+
df -h /var/lib/victoriametrics
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
**Solutions:**
|
|
178
|
+
|
|
179
|
+
```bash
|
|
180
|
+
# 1. Reduce retention
|
|
181
|
+
-retentionPeriod=30d
|
|
182
|
+
|
|
183
|
+
# 2. Enable downsampling
|
|
184
|
+
-downsampling.period=7d:1m,30d:5m,90d:1h
|
|
185
|
+
|
|
186
|
+
# 3. Delete old data
|
|
187
|
+
curl -X POST 'http://localhost:8428/api/v1/admin/tsdb/delete_series?match[]={__name__="unwanted_metric"}'
|
|
188
|
+
|
|
189
|
+
# 4. Force merge to reclaim space
|
|
190
|
+
curl http://localhost:8428/internal/forceMerge
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
---
|
|
194
|
+
|
|
195
|
+
### Replication Issues (Cluster)
|
|
196
|
+
|
|
197
|
+
**Symptoms:**
|
|
198
|
+
|
|
199
|
+
- Query returns incomplete data
|
|
200
|
+
- vmstorage nodes out of sync
|
|
201
|
+
- Alerts on replica lag
|
|
202
|
+
|
|
203
|
+
**Diagnosis:**
|
|
204
|
+
|
|
205
|
+
```bash
|
|
206
|
+
# Check storage node health
|
|
207
|
+
for i in 0 1 2; do
|
|
208
|
+
curl http://vmstorage-$i:8482/-/healthy
|
|
209
|
+
done
|
|
210
|
+
|
|
211
|
+
# Check replication status from vmselect
|
|
212
|
+
curl http://vmselect:8481/api/v1/status/vmstorage
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
**Solutions:**
|
|
216
|
+
|
|
217
|
+
```bash
|
|
218
|
+
# 1. Ensure replicationFactor matches node count
|
|
219
|
+
-replicationFactor=2
|
|
220
|
+
|
|
221
|
+
# 2. Enable deduplication
|
|
222
|
+
-dedup.minScrapeInterval=15s
|
|
223
|
+
|
|
224
|
+
# 3. Check network connectivity between nodes
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
---
|
|
228
|
+
|
|
229
|
+
### Ingestion Delays
|
|
230
|
+
|
|
231
|
+
**Symptoms:**
|
|
232
|
+
|
|
233
|
+
- Data not appearing immediately
|
|
234
|
+
- Lag in dashboards
|
|
235
|
+
|
|
236
|
+
**Diagnosis:**
|
|
237
|
+
|
|
238
|
+
```bash
|
|
239
|
+
# Check pending rows
|
|
240
|
+
curl -s http://localhost:8428/metrics | grep vm_rows_pending
|
|
241
|
+
|
|
242
|
+
# Check insert rate
|
|
243
|
+
curl -s http://localhost:8428/metrics | grep vm_rows_inserted_total
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
**Solutions:**
|
|
247
|
+
|
|
248
|
+
```bash
|
|
249
|
+
# 1. Force flush (for testing)
|
|
250
|
+
curl http://localhost:8428/internal/force_flush
|
|
251
|
+
|
|
252
|
+
# 2. Reduce flush interval
|
|
253
|
+
-inmemoryDataFlushInterval=5s
|
|
254
|
+
|
|
255
|
+
# 3. Scale vminsert for cluster
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
---
|
|
259
|
+
|
|
260
|
+
### Connection Issues
|
|
261
|
+
|
|
262
|
+
**Symptoms:**
|
|
263
|
+
|
|
264
|
+
- "connection refused" errors
|
|
265
|
+
- Timeouts from vmagent
|
|
266
|
+
|
|
267
|
+
**Diagnosis:**
|
|
268
|
+
|
|
269
|
+
```bash
|
|
270
|
+
# Test connectivity
|
|
271
|
+
curl -v http://victoriametrics:8428/health
|
|
272
|
+
|
|
273
|
+
# Check listening ports
|
|
274
|
+
netstat -tlnp | grep victoria
|
|
275
|
+
|
|
276
|
+
# Check firewall/security groups
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
**Solutions:**
|
|
280
|
+
|
|
281
|
+
```bash
|
|
282
|
+
# 1. Check bind address
|
|
283
|
+
-httpListenAddr=0.0.0.0:8428
|
|
284
|
+
|
|
285
|
+
# 2. Check Kubernetes service
|
|
286
|
+
kubectl get svc -n monitoring
|
|
287
|
+
|
|
288
|
+
# 3. Check network policies
|
|
289
|
+
kubectl get networkpolicy -n monitoring
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
---
|
|
293
|
+
|
|
294
|
+
### Data Loss Concerns
|
|
295
|
+
|
|
296
|
+
**Symptoms:**
|
|
297
|
+
|
|
298
|
+
- Missing data points
|
|
299
|
+
- Gaps in graphs
|
|
300
|
+
|
|
301
|
+
**Diagnosis:**
|
|
302
|
+
|
|
303
|
+
```bash
|
|
304
|
+
# Check for unclean shutdowns
|
|
305
|
+
grep -i "unclean\|crash\|oom" /var/log/victoriametrics.log
|
|
306
|
+
|
|
307
|
+
# Check insert success
|
|
308
|
+
curl -s http://localhost:8428/metrics | grep vm_http_request_errors_total
|
|
309
|
+
```
|
|
310
|
+
|
|
311
|
+
**Solutions:**
|
|
312
|
+
|
|
313
|
+
```bash
|
|
314
|
+
# 1. Ensure graceful shutdown
|
|
315
|
+
kill -INT $(pidof victoria-metrics)
|
|
316
|
+
|
|
317
|
+
# 2. Configure proper liveness probes in K8s
|
|
318
|
+
# 3. Regular backups with vmbackup
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
---
|
|
322
|
+
|
|
323
|
+
## Log Analysis
|
|
324
|
+
|
|
325
|
+
### Key Log Patterns
|
|
326
|
+
|
|
327
|
+
```bash
|
|
328
|
+
# Errors
|
|
329
|
+
grep -i "error\|warn\|fatal" /var/log/victoriametrics.log
|
|
330
|
+
|
|
331
|
+
# Slow operations
|
|
332
|
+
grep -i "slow" /var/log/victoriametrics.log
|
|
333
|
+
|
|
334
|
+
# OOM warnings
|
|
335
|
+
grep -i "memory\|oom\|cannot allocate" /var/log/victoriametrics.log
|
|
336
|
+
|
|
337
|
+
# Disk issues
|
|
338
|
+
grep -i "disk\|storage\|no space" /var/log/victoriametrics.log
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
### Log Level
|
|
342
|
+
|
|
343
|
+
```bash
|
|
344
|
+
# Increase log verbosity temporarily
|
|
345
|
+
-loggerLevel=INFO # Options: INFO, WARN, ERROR
|
|
346
|
+
```
|
|
347
|
+
|
|
348
|
+
---
|
|
349
|
+
|
|
350
|
+
## Profiling
|
|
351
|
+
|
|
352
|
+
### CPU Profile
|
|
353
|
+
|
|
354
|
+
```bash
|
|
355
|
+
curl http://localhost:8428/debug/pprof/profile?seconds=30 > cpu.prof
|
|
356
|
+
go tool pprof cpu.prof
|
|
357
|
+
```
|
|
358
|
+
|
|
359
|
+
### Memory Profile
|
|
360
|
+
|
|
361
|
+
```bash
|
|
362
|
+
curl http://localhost:8428/debug/pprof/heap > heap.prof
|
|
363
|
+
go tool pprof heap.prof
|
|
364
|
+
```
|
|
365
|
+
|
|
366
|
+
### Goroutine Dump
|
|
367
|
+
|
|
368
|
+
```bash
|
|
369
|
+
curl http://localhost:8428/debug/pprof/goroutine?debug=1
|
|
370
|
+
```
|
|
371
|
+
|
|
372
|
+
---
|
|
373
|
+
|
|
374
|
+
## Recovery Procedures
|
|
375
|
+
|
|
376
|
+
### Restore from Backup
|
|
377
|
+
|
|
378
|
+
```bash
|
|
379
|
+
# Stop VictoriaMetrics
|
|
380
|
+
systemctl stop victoriametrics
|
|
381
|
+
|
|
382
|
+
# Clear data directory
|
|
383
|
+
rm -rf /var/lib/victoriametrics/*
|
|
384
|
+
|
|
385
|
+
# Restore
|
|
386
|
+
vmrestore \
|
|
387
|
+
-src=s3://bucket/backups/20240120 \
|
|
388
|
+
-storageDataPath=/var/lib/victoriametrics
|
|
389
|
+
|
|
390
|
+
# Start VictoriaMetrics
|
|
391
|
+
systemctl start victoriametrics
|
|
392
|
+
```
|
|
393
|
+
|
|
394
|
+
### Repair Corrupted Data
|
|
395
|
+
|
|
396
|
+
```bash
|
|
397
|
+
# Stop VictoriaMetrics
|
|
398
|
+
systemctl stop victoriametrics
|
|
399
|
+
|
|
400
|
+
# Remove cache (safe)
|
|
401
|
+
rm -rf /var/lib/victoriametrics/cache
|
|
402
|
+
|
|
403
|
+
# If index is corrupted, rebuild from data
|
|
404
|
+
# (may take time for large datasets)
|
|
405
|
+
rm -rf /var/lib/victoriametrics/indexdb
|
|
406
|
+
|
|
407
|
+
# Start - index will rebuild
|
|
408
|
+
systemctl start victoriametrics
|
|
409
|
+
```
|
|
410
|
+
|
|
411
|
+
---
|
|
412
|
+
|
|
413
|
+
## Metrics to Monitor
|
|
414
|
+
|
|
415
|
+
### Critical Metrics
|
|
416
|
+
|
|
417
|
+
| Metric | Alert Threshold | Action |
|
|
418
|
+
| ---------------------------------------- | --------------- | --------------------------------------- |
|
|
419
|
+
| `vm_free_disk_space_bytes` | < 20GB | Expand disk or reduce retention |
|
|
420
|
+
| `process_resident_memory_bytes` | > 85% limit | Add RAM or reduce memory.allowedPercent |
|
|
421
|
+
| `rate(vm_slow_row_inserts_total[5m])` | > 0 | RAM too low for cardinality |
|
|
422
|
+
| `rate(vm_http_request_errors_total[5m])` | > 0 | Check logs for errors |
|
|
423
|
+
| `vm_active_merges` | High sustained | Disk I/O bottleneck |
|
|
424
|
+
|
|
425
|
+
### Key Dashboard Queries
|
|
426
|
+
|
|
427
|
+
```promql
|
|
428
|
+
# Ingestion rate
|
|
429
|
+
rate(vm_rows_inserted_total[5m])
|
|
430
|
+
|
|
431
|
+
# Query rate
|
|
432
|
+
rate(vm_http_requests_total{path=~"/api/v1/query.*"}[5m])
|
|
433
|
+
|
|
434
|
+
# Memory usage
|
|
435
|
+
process_resident_memory_bytes / 1024 / 1024 / 1024
|
|
436
|
+
|
|
437
|
+
# Disk usage
|
|
438
|
+
vm_data_size_bytes / 1024 / 1024 / 1024
|
|
439
|
+
|
|
440
|
+
# Active time series
|
|
441
|
+
vm_cache_entries{type="storage/hour_metric_ids"}
|
|
442
|
+
```
|