loki-mode 5.42.2 → 5.46.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,527 @@
1
+ # Metrics Guide
2
+
3
+ Prometheus and OpenMetrics monitoring for Loki Mode (v5.38.0).
4
+
5
+ ## Overview
6
+
7
+ Loki Mode exposes a `/metrics` endpoint that returns production-ready metrics in Prometheus/OpenMetrics text format. This enables integration with:
8
+
9
+ - Prometheus
10
+ - Grafana
11
+ - Datadog
12
+ - New Relic
13
+ - Elastic APM
14
+ - Any OpenMetrics-compatible monitoring system
15
+
16
+ ## Quick Start
17
+
18
+ ```bash
19
+ # Enable metrics endpoint
20
+ export LOKI_METRICS_ENABLED=true
21
+
22
+ # Start Loki Mode
23
+ loki start ./prd.md
24
+
25
+ # View metrics
26
+ curl http://localhost:57374/metrics
27
+
28
+ # Or use CLI
29
+ loki metrics
30
+ ```
31
+
32
+ ## Metrics Endpoint
33
+
34
+ ```
35
+ GET http://localhost:57374/metrics
36
+ Content-Type: text/plain; version=0.0.4
37
+ ```
38
+
39
+ Returns metrics in OpenMetrics text format. No authentication required by default (configure reverse proxy auth for production).
40
+
41
+ ## Available Metrics
42
+
43
+ ### Session Metrics
44
+
45
+ | Metric | Type | Description |
46
+ |--------|------|-------------|
47
+ | `loki_session_status` | gauge | Current session status: 0=stopped, 1=running, 2=paused |
48
+ | `loki_iteration_current` | gauge | Current iteration number |
49
+ | `loki_iteration_max` | gauge | Maximum configured iterations (from LOKI_MAX_ITERATIONS) |
50
+ | `loki_uptime_seconds` | gauge | Seconds since session started |
51
+
52
+ ### Task Metrics
53
+
54
+ | Metric | Type | Labels | Description |
55
+ |--------|------|--------|-------------|
56
+ | `loki_tasks_total` | gauge | `status` | Number of tasks by status: pending, in_progress, completed, failed |
57
+
58
+ ### Agent Metrics
59
+
60
+ | Metric | Type | Description |
61
+ |--------|------|-------------|
62
+ | `loki_agents_active` | gauge | Number of currently active agents |
63
+ | `loki_agents_total` | gauge | Total number of registered agents |
64
+
65
+ ### Cost Metrics
66
+
67
+ | Metric | Type | Description |
68
+ |--------|------|-------------|
69
+ | `loki_cost_usd` | gauge | Estimated total session cost in USD |
70
+
71
+ ### Event Metrics
72
+
73
+ | Metric | Type | Description |
74
+ |--------|------|-------------|
75
+ | `loki_events_total` | counter | Total number of events recorded in events.jsonl |
76
+
77
+ ## Data Sources
78
+
79
+ Metrics are derived from `.loki/` flat files:
80
+
81
+ | File | Metrics |
82
+ |------|---------|
83
+ | `dashboard-state.json` | session_status, iteration_current, iteration_max, tasks_total, agents_active |
84
+ | `loki.pid` | session_status (PID alive check fallback), uptime_seconds |
85
+ | `state/agents.json` | agents_total |
86
+ | `metrics/efficiency/*.json` | cost_usd |
87
+ | `events.jsonl` | events_total (line count) |
88
+
89
+ ## CLI Usage
90
+
91
+ ```bash
92
+ # Fetch all metrics
93
+ loki metrics
94
+
95
+ # Filter specific metric
96
+ loki metrics | grep loki_cost_usd
97
+
98
+ # Watch metrics in real-time
99
+ watch -n 5 loki metrics
100
+
101
+ # Custom dashboard host/port
102
+ loki metrics --host 192.168.1.100 --port 8080
103
+ ```
104
+
105
+ ## Prometheus Configuration
106
+
107
+ ### Basic Scrape Config
108
+
109
+ Add to `prometheus.yml`:
110
+
111
+ ```yaml
112
+ scrape_configs:
113
+ - job_name: 'loki-mode'
114
+ scrape_interval: 15s
115
+ static_configs:
116
+ - targets: ['localhost:57374']
117
+ labels:
118
+ environment: 'production'
119
+ project: 'my-app'
120
+ ```
121
+
122
+ ### With TLS/HTTPS
123
+
124
+ ```yaml
125
+ scrape_configs:
126
+ - job_name: 'loki-mode'
127
+ scheme: https
128
+ tls_config:
129
+ insecure_skip_verify: true # For self-signed certs
130
+ static_configs:
131
+ - targets: ['localhost:57374']
132
+ ```
133
+
134
+ ### With Authentication (via reverse proxy)
135
+
136
+ ```yaml
137
+ scrape_configs:
138
+ - job_name: 'loki-mode'
139
+ scheme: https
140
+ bearer_token: 'loki_xxx...'
141
+ static_configs:
142
+ - targets: ['dashboard.example.com:443']
143
+ ```
144
+
145
+ ### Service Discovery (Kubernetes)
146
+
147
+ ```yaml
148
+ scrape_configs:
149
+ - job_name: 'loki-mode'
150
+ kubernetes_sd_configs:
151
+ - role: pod
152
+ namespaces:
153
+ names:
154
+ - loki
155
+ relabel_configs:
156
+ - source_labels: [__meta_kubernetes_pod_label_app]
157
+ action: keep
158
+ regex: loki-mode
159
+ - source_labels: [__meta_kubernetes_pod_ip]
160
+ target_label: __address__
161
+ replacement: $1:57374
162
+ ```
163
+
164
+ ## Grafana Integration
165
+
166
+ ### Add Prometheus Data Source
167
+
168
+ 1. Navigate to Configuration > Data Sources
169
+ 2. Click "Add data source"
170
+ 3. Select "Prometheus"
171
+ 4. URL: `http://prometheus-server:9090`
172
+ 5. Save & Test
173
+
174
+ ### Create Dashboard
175
+
176
+ Import the Loki Mode dashboard template or create custom panels:
177
+
178
+ #### Panel 1: Session Status
179
+
180
+ - **Type:** Stat
181
+ - **Query:** `loki_session_status`
182
+ - **Value Mappings:**
183
+ - 0 = Stopped (Red)
184
+ - 1 = Running (Green)
185
+ - 2 = Paused (Yellow)
186
+
187
+ #### Panel 2: Iteration Progress
188
+
189
+ - **Type:** Gauge
190
+ - **Query:** `loki_iteration_current / loki_iteration_max * 100`
191
+ - **Unit:** Percent (0-100)
192
+ - **Thresholds:** 0-50 (yellow), 50-100 (green)
193
+
194
+ #### Panel 3: Task Distribution
195
+
196
+ - **Type:** Pie chart
197
+ - **Query:** `loki_tasks_total`
198
+ - **Legend:** `{{status}}`
199
+
200
+ #### Panel 4: Agent Activity
201
+
202
+ - **Type:** Time series
203
+ - **Query:** `loki_agents_active`
204
+ - **Legend:** Active Agents
205
+
206
+ #### Panel 5: Cost Tracking
207
+
208
+ - **Type:** Stat
209
+ - **Query:** `loki_cost_usd`
210
+ - **Unit:** Currency (USD)
211
+ - **Decimals:** 2
212
+
213
+ #### Panel 6: Event Rate
214
+
215
+ - **Type:** Graph
216
+ - **Query:** `rate(loki_events_total[5m])`
217
+ - **Legend:** Events per second
218
+
219
+ #### Panel 7: Uptime
220
+
221
+ - **Type:** Stat
222
+ - **Query:** `loki_uptime_seconds`
223
+ - **Unit:** Duration (seconds)
224
+
225
+ ### Example PromQL Queries
226
+
227
+ ```promql
228
+ # Session is running
229
+ loki_session_status == 1
230
+
231
+ # Iteration progress percentage
232
+ loki_iteration_current / loki_iteration_max * 100
233
+
234
+ # Total pending + in-progress tasks
235
+ loki_tasks_total{status="pending"} + loki_tasks_total{status="in_progress"}
236
+
237
+ # Cost per hour
238
+ rate(loki_cost_usd[1h]) * 3600
239
+
240
+ # Event rate (events per minute)
241
+ rate(loki_events_total[5m]) * 60
242
+
243
+ # Task completion rate
244
+ rate(loki_tasks_total{status="completed"}[10m])
245
+
246
+ # Failed task ratio
247
+ loki_tasks_total{status="failed"} / sum(loki_tasks_total)
248
+ ```
249
+
250
+ ## Datadog Integration
251
+
252
+ ### Configure OpenMetrics Check
253
+
254
+ Create `/etc/datadog-agent/conf.d/openmetrics.d/loki_mode.yaml`:
255
+
256
+ ```yaml
257
+ instances:
258
+ - prometheus_url: http://localhost:57374/metrics
259
+ namespace: loki
260
+ metrics:
261
+ - loki_session_status
262
+ - loki_iteration_current
263
+ - loki_iteration_max
264
+ - loki_tasks_total
265
+ - loki_agents_active
266
+ - loki_agents_total
267
+ - loki_cost_usd
268
+ - loki_events_total
269
+ - loki_uptime_seconds
270
+ tags:
271
+ - environment:production
272
+ - service:loki-mode
273
+ ```
274
+
275
+ Restart Datadog Agent:
276
+
277
+ ```bash
278
+ sudo systemctl restart datadog-agent
279
+ ```
280
+
281
+ ### Datadog Dashboards
282
+
283
+ View metrics in Datadog:
284
+ - Navigate to Dashboards > New Dashboard
285
+ - Add widgets with queries like `loki.session_status`, `loki.cost_usd`
286
+ - Set up monitors for cost thresholds and session failures
287
+
288
+ ## Alerting
289
+
290
+ ### Prometheus Alert Rules
291
+
292
+ Create `loki_alerts.yml`:
293
+
294
+ ```yaml
295
+ groups:
296
+ - name: loki-mode
297
+ interval: 30s
298
+ rules:
299
+ - alert: LokiSessionDown
300
+ expr: loki_session_status == 0
301
+ for: 5m
302
+ labels:
303
+ severity: warning
304
+ annotations:
305
+ summary: "Loki Mode session is not running"
306
+ description: "Session has been stopped for more than 5 minutes"
307
+
308
+ - alert: LokiBudgetWarning
309
+ expr: loki_cost_usd > 4.00
310
+ labels:
311
+ severity: warning
312
+ annotations:
313
+ summary: "Loki Mode cost approaching budget limit"
314
+ description: "Current cost: ${{ $value }}"
315
+
316
+ - alert: LokiBudgetCritical
317
+ expr: loki_cost_usd > 4.50
318
+ labels:
319
+ severity: critical
320
+ annotations:
321
+ summary: "Loki Mode cost exceeds budget"
322
+ description: "Current cost: ${{ $value }}, budget: $5.00"
323
+
324
+ - alert: LokiStagnation
325
+ expr: changes(loki_iteration_current[30m]) == 0 and loki_session_status == 1
326
+ for: 10m
327
+ labels:
328
+ severity: critical
329
+ annotations:
330
+ summary: "Loki Mode iteration not progressing"
331
+ description: "No iteration progress in 30 minutes"
332
+
333
+ - alert: LokiHighFailureRate
334
+ expr: loki_tasks_total{status="failed"} / sum(loki_tasks_total) > 0.1
335
+ for: 5m
336
+ labels:
337
+ severity: warning
338
+ annotations:
339
+ summary: "High task failure rate"
340
+ description: "{{ $value | humanizePercentage }} of tasks are failing"
341
+
342
+ - alert: LokiTooManyAgents
343
+ expr: loki_agents_active > 50
344
+ for: 10m
345
+ labels:
346
+ severity: warning
347
+ annotations:
348
+ summary: "Too many active agents"
349
+ description: "{{ $value }} agents active, may indicate runaway spawning"
350
+ ```
351
+
352
+ ### Grafana Alerts
353
+
354
+ Configure alerts in Grafana panels:
355
+
356
+ 1. Edit panel
357
+ 2. Navigate to Alert tab
358
+ 3. Create alert rule:
359
+ - **Condition:** `WHEN last() OF query(A, 5m, now) IS ABOVE 4.5`
360
+ - **Evaluate:** Every 1m for 5m
361
+ - **Send to:** Slack, PagerDuty, Email
362
+
363
+ ## Environment Variables
364
+
365
+ | Variable | Default | Description |
366
+ |----------|---------|-------------|
367
+ | `LOKI_METRICS_ENABLED` | `false` | Enable `/metrics` endpoint |
368
+ | `LOKI_METRICS_PORT` | `57374` | Port for metrics endpoint (same as dashboard) |
369
+ | `LOKI_METRICS_PATH` | `/metrics` | Endpoint path |
370
+
371
+ ## Best Practices
372
+
373
+ ### Production Deployment
374
+
375
+ 1. Enable metrics in production:
376
+ ```bash
377
+ export LOKI_METRICS_ENABLED=true
378
+ ```
379
+
380
+ 2. Secure endpoint with reverse proxy authentication
381
+ 3. Set up Prometheus scraping with appropriate interval (15-30s)
382
+ 4. Create Grafana dashboards for visualization
383
+ 5. Configure alerts for budget, stagnation, and failures
384
+ 6. Monitor metrics retention and storage
385
+
386
+ ### Performance
387
+
388
+ - Metrics endpoint is lightweight (reads flat files, no DB queries)
389
+ - Scrape interval of 15-30 seconds recommended
390
+ - Metrics are cached for 2 seconds to avoid excessive file reads
391
+ - No impact on Loki Mode execution performance
392
+
393
+ ### Monitoring
394
+
395
+ - Track `loki_cost_usd` to prevent budget overruns
396
+ - Alert on `loki_session_status == 0` for unexpected stops
397
+ - Monitor `loki_tasks_total{status="failed"}` for quality issues
398
+ - Watch `loki_agents_active` for agent spawning issues
399
+ - Track `loki_iteration_current` for progress
400
+
401
+ ## Troubleshooting
402
+
403
+ ### Metrics Endpoint Returns Empty
404
+
405
+ ```bash
406
+ # Check LOKI_METRICS_ENABLED is set
407
+ echo $LOKI_METRICS_ENABLED
408
+
409
+ # Verify LOKI_DIR is set (required for dashboard)
410
+ echo $LOKI_DIR
411
+
412
+ # Check dashboard-state.json exists and is updating
413
+ ls -la .loki/dashboard-state.json
414
+ watch -n 2 cat .loki/dashboard-state.json
415
+
416
+ # Check dashboard is running
417
+ loki dashboard status
418
+ curl http://localhost:57374/health
419
+ ```
420
+
421
+ ### Metrics Show Zero Values
422
+
423
+ ```bash
424
+ # Ensure a Loki session is running
425
+ loki status
426
+
427
+ # Check dashboard-state.json is being updated (every 2 seconds)
428
+ stat .loki/dashboard-state.json
429
+
430
+ # Verify metrics files exist
431
+ ls -la .loki/metrics/efficiency/
432
+
433
+ # Check events.jsonl exists
434
+ ls -la .loki/events.jsonl
435
+ ```
436
+
437
+ ### Connection Refused
438
+
439
+ ```bash
440
+ # Verify dashboard is running on expected port
441
+ curl http://localhost:57374/health
442
+
443
+ # Check if another process is using port 57374
444
+ lsof -ti:57374
445
+
446
+ # Restart dashboard
447
+ loki dashboard stop
448
+ loki dashboard start
449
+ ```
450
+
451
+ ### Prometheus Cannot Scrape
452
+
453
+ ```bash
454
+ # Test endpoint manually
455
+ curl http://localhost:57374/metrics
456
+
457
+ # Check Prometheus targets page
458
+ open http://prometheus-server:9090/targets
459
+
460
+ # Verify network connectivity from Prometheus to Loki dashboard
461
+ # (firewall, security groups, etc.)
462
+
463
+ # Check Prometheus logs
464
+ kubectl logs -f prometheus-server-xyz
465
+ ```
466
+
467
+ ## Examples
468
+
469
+ ### Cost Budget Monitoring
470
+
471
+ ```bash
472
+ # Set up budget alert
473
+ cat > /tmp/budget_check.sh <<'EOF'
474
+ #!/bin/bash
475
+ COST=$(curl -s http://localhost:57374/metrics | grep loki_cost_usd | awk '{print $2}')
476
+ if (( $(echo "$COST > 4.5" | bc -l) )); then
477
+ echo "CRITICAL: Cost $COST exceeds budget!"
478
+ loki stop
479
+ fi
480
+ EOF
481
+
482
+ # Run every 5 minutes
483
+ crontab -e
484
+ # Add: */5 * * * * /tmp/budget_check.sh
485
+ ```
486
+
487
+ ### Custom Metrics Export
488
+
489
+ ```python
490
+ import requests
491
+ import json
492
+
493
+ def get_loki_metrics():
494
+ response = requests.get("http://localhost:57374/metrics")
495
+ metrics = {}
496
+ for line in response.text.splitlines():
497
+ if line.startswith("loki_"):
498
+ parts = line.split()
499
+ metric_name = parts[0]
500
+ metric_value = float(parts[1]) if len(parts) > 1 else 0
501
+ metrics[metric_name] = metric_value
502
+ return metrics
503
+
504
+ metrics = get_loki_metrics()
505
+ print(json.dumps(metrics, indent=2))
506
+ ```
507
+
508
+ ### Slack Notification on High Cost
509
+
510
+ ```bash
511
+ # Add to Prometheus Alertmanager config
512
+ cat >> /etc/alertmanager/alertmanager.yml <<EOF
513
+ receivers:
514
+ - name: slack
515
+ slack_configs:
516
+ - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
517
+ channel: '#loki-alerts'
518
+ text: 'Loki Mode cost: ${{ .Annotations.description }}'
519
+ EOF
520
+ ```
521
+
522
+ ## See Also
523
+
524
+ - [Audit Logging](audit-logging.md) - Track agent actions
525
+ - [Dashboard Guide](dashboard-guide.md) - Web dashboard
526
+ - [Enterprise Features](../wiki/Enterprise-Features.md) - Complete enterprise guide
527
+ - [Prometheus Metrics](../wiki/Prometheus-Metrics.md) - Detailed wiki documentation