detectkit 0.2.8__tar.gz → 0.3.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {detectkit-0.2.8/detectkit.egg-info → detectkit-0.3.1}/PKG-INFO +20 -12
- {detectkit-0.2.8 → detectkit-0.3.1}/README.md +19 -11
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/alerting/orchestrator.py +165 -1
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/cli/commands/run.py +29 -13
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/config/metric_config.py +13 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/database/internal_tables.py +122 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/database/tables.py +10 -1
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/orchestration/task_manager.py +2 -0
- {detectkit-0.2.8 → detectkit-0.3.1/detectkit.egg-info}/PKG-INFO +20 -12
- {detectkit-0.2.8 → detectkit-0.3.1}/pyproject.toml +1 -1
- {detectkit-0.2.8 → detectkit-0.3.1}/LICENSE +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/MANIFEST.in +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/__init__.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/alerting/__init__.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/alerting/channels/__init__.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/alerting/channels/base.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/alerting/channels/email.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/alerting/channels/factory.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/alerting/channels/mattermost.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/alerting/channels/slack.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/alerting/channels/telegram.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/alerting/channels/webhook.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/cli/__init__.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/cli/commands/__init__.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/cli/commands/init.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/cli/commands/test_alert.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/cli/main.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/config/__init__.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/config/profile.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/config/project_config.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/config/validator.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/core/__init__.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/core/interval.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/core/models.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/database/__init__.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/database/clickhouse_manager.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/database/manager.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/detectors/__init__.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/detectors/base.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/detectors/factory.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/detectors/statistical/__init__.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/detectors/statistical/iqr.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/detectors/statistical/mad.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/detectors/statistical/manual_bounds.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/detectors/statistical/zscore.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/loaders/__init__.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/loaders/metric_loader.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/loaders/query_template.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/orchestration/__init__.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/utils/__init__.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit/utils/stats.py +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit.egg-info/SOURCES.txt +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit.egg-info/dependency_links.txt +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit.egg-info/entry_points.txt +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit.egg-info/requires.txt +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/detectkit.egg-info/top_level.txt +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/requirements.txt +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/setup.cfg +0 -0
- {detectkit-0.2.8 → detectkit-0.3.1}/setup.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: detectkit
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.3.1
|
|
4
4
|
Summary: Metric monitoring with automatic anomaly detection
|
|
5
5
|
Author: detectkit team
|
|
6
6
|
License: MIT
|
|
@@ -68,12 +68,19 @@ Dynamic: license-file
|
|
|
68
68
|
|
|
69
69
|
## Status
|
|
70
70
|
|
|
71
|
-
✅ **Production Ready** - Version 0.
|
|
71
|
+
✅ **Production Ready** - Version 0.3.0
|
|
72
72
|
|
|
73
73
|
Published to PyPI: https://pypi.org/project/detectkit/
|
|
74
74
|
|
|
75
75
|
Complete rewrite with modern architecture and full documentation (2025).
|
|
76
76
|
|
|
77
|
+
### What's New in v0.3.0
|
|
78
|
+
|
|
79
|
+
🎯 **Alert Cooldown** - Prevent alert spam from persistent anomalies
|
|
80
|
+
- Configure minimum time between alerts (`alert_cooldown: "30min"`)
|
|
81
|
+
- Automatic recovery detection (`cooldown_reset_on_recovery: true`)
|
|
82
|
+
- Stops duplicate alerts during long-running issues
|
|
83
|
+
|
|
77
84
|
## Features
|
|
78
85
|
|
|
79
86
|
- ✅ **Pure numpy arrays** - No pandas dependency in core logic
|
|
@@ -225,13 +232,14 @@ This project is currently in active development. Contributions are welcome once
|
|
|
225
232
|
|
|
226
233
|
## Changelog
|
|
227
234
|
|
|
228
|
-
|
|
229
|
-
|
|
230
|
-
|
|
231
|
-
|
|
232
|
-
-
|
|
233
|
-
-
|
|
234
|
-
-
|
|
235
|
-
-
|
|
236
|
-
|
|
237
|
-
|
|
235
|
+
See [CHANGELOG.md](CHANGELOG.md) for complete version history.
|
|
236
|
+
|
|
237
|
+
### Recent Releases
|
|
238
|
+
|
|
239
|
+
**[0.3.0]** (2025-11-10) - Alert cooldown system, spam prevention
|
|
240
|
+
**[0.2.8]** (2025-11-10) - Fix incomplete interval detection
|
|
241
|
+
**[0.2.7]** (2025-11-10) - Add _dtk_metrics table
|
|
242
|
+
**[0.2.0]** (2025-11-06) - Detector preprocessing and value weighting
|
|
243
|
+
**[0.1.0]** (2025-11-03) - Initial release
|
|
244
|
+
|
|
245
|
+
[Full changelog →](CHANGELOG.md)
|
|
@@ -6,12 +6,19 @@
|
|
|
6
6
|
|
|
7
7
|
## Status
|
|
8
8
|
|
|
9
|
-
✅ **Production Ready** - Version 0.
|
|
9
|
+
✅ **Production Ready** - Version 0.3.0
|
|
10
10
|
|
|
11
11
|
Published to PyPI: https://pypi.org/project/detectkit/
|
|
12
12
|
|
|
13
13
|
Complete rewrite with modern architecture and full documentation (2025).
|
|
14
14
|
|
|
15
|
+
### What's New in v0.3.0
|
|
16
|
+
|
|
17
|
+
🎯 **Alert Cooldown** - Prevent alert spam from persistent anomalies
|
|
18
|
+
- Configure minimum time between alerts (`alert_cooldown: "30min"`)
|
|
19
|
+
- Automatic recovery detection (`cooldown_reset_on_recovery: true`)
|
|
20
|
+
- Stops duplicate alerts during long-running issues
|
|
21
|
+
|
|
15
22
|
## Features
|
|
16
23
|
|
|
17
24
|
- ✅ **Pure numpy arrays** - No pandas dependency in core logic
|
|
@@ -163,13 +170,14 @@ This project is currently in active development. Contributions are welcome once
|
|
|
163
170
|
|
|
164
171
|
## Changelog
|
|
165
172
|
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
-
|
|
171
|
-
-
|
|
172
|
-
-
|
|
173
|
-
-
|
|
174
|
-
|
|
175
|
-
|
|
173
|
+
See [CHANGELOG.md](CHANGELOG.md) for complete version history.
|
|
174
|
+
|
|
175
|
+
### Recent Releases
|
|
176
|
+
|
|
177
|
+
**[0.3.0]** (2025-11-10) - Alert cooldown system, spam prevention
|
|
178
|
+
**[0.2.8]** (2025-11-10) - Fix incomplete interval detection
|
|
179
|
+
**[0.2.7]** (2025-11-10) - Add _dtk_metrics table
|
|
180
|
+
**[0.2.0]** (2025-11-06) - Detector preprocessing and value weighting
|
|
181
|
+
**[0.1.0]** (2025-11-03) - Initial release
|
|
182
|
+
|
|
183
|
+
[Full changelog →](CHANGELOG.md)
|
|
@@ -73,6 +73,8 @@ class AlertOrchestrator:
|
|
|
73
73
|
interval: Interval,
|
|
74
74
|
conditions: Optional[AlertConditions] = None,
|
|
75
75
|
timezone_display: str = "UTC",
|
|
76
|
+
internal=None, # InternalTablesManager (optional, for cooldown tracking)
|
|
77
|
+
alert_config=None, # AlertConfig (optional, for cooldown settings)
|
|
76
78
|
):
|
|
77
79
|
"""
|
|
78
80
|
Initialize alert orchestrator.
|
|
@@ -82,11 +84,15 @@ class AlertOrchestrator:
|
|
|
82
84
|
interval: Metric interval
|
|
83
85
|
conditions: Alert conditions (defaults to AlertConditions())
|
|
84
86
|
timezone_display: Timezone for alert display (default: UTC)
|
|
87
|
+
internal: InternalTablesManager instance (optional, for cooldown tracking)
|
|
88
|
+
alert_config: AlertConfig instance (optional, for cooldown settings)
|
|
85
89
|
"""
|
|
86
90
|
self.metric_name = metric_name
|
|
87
91
|
self.interval = interval
|
|
88
92
|
self.conditions = conditions or AlertConditions()
|
|
89
93
|
self.timezone_display = timezone_display
|
|
94
|
+
self.internal = internal
|
|
95
|
+
self.alert_config = alert_config
|
|
90
96
|
|
|
91
97
|
def should_alert(
|
|
92
98
|
self,
|
|
@@ -106,11 +112,16 @@ class AlertOrchestrator:
|
|
|
106
112
|
Logic:
|
|
107
113
|
1. Check if enough detectors triggered (min_detectors)
|
|
108
114
|
2. Check consecutive anomalies with direction matching
|
|
109
|
-
3.
|
|
115
|
+
3. Check alert cooldown (if configured)
|
|
116
|
+
4. Return decision and formatted AlertData
|
|
110
117
|
"""
|
|
111
118
|
if not recent_detections:
|
|
112
119
|
return False, None
|
|
113
120
|
|
|
121
|
+
# NEW: Check cooldown FIRST (before expensive checks)
|
|
122
|
+
if self._is_in_cooldown():
|
|
123
|
+
return False, None
|
|
124
|
+
|
|
114
125
|
# Group detections by timestamp
|
|
115
126
|
detections_by_time = self._group_by_timestamp(recent_detections)
|
|
116
127
|
|
|
@@ -316,6 +327,15 @@ class AlertOrchestrator:
|
|
|
316
327
|
print(f"Error sending alert via {channel_name}: {e}")
|
|
317
328
|
results[channel_name] = False
|
|
318
329
|
|
|
330
|
+
# NEW: Update alert timestamp after sending (for cooldown tracking)
|
|
331
|
+
if any(results.values()) and self.internal:
|
|
332
|
+
# At least one channel succeeded - update timestamp
|
|
333
|
+
self.internal.update_alert_timestamp(
|
|
334
|
+
metric_name=self.metric_name,
|
|
335
|
+
timestamp=datetime.utcnow(),
|
|
336
|
+
increment_count=True
|
|
337
|
+
)
|
|
338
|
+
|
|
319
339
|
return results
|
|
320
340
|
|
|
321
341
|
def get_last_complete_point(self, now: Optional[datetime] = None) -> datetime:
|
|
@@ -357,6 +377,150 @@ class AlertOrchestrator:
|
|
|
357
377
|
|
|
358
378
|
return datetime.fromtimestamp(last_complete_seconds, tz=timezone.utc)
|
|
359
379
|
|
|
380
|
+
def _is_in_cooldown(self) -> bool:
|
|
381
|
+
"""
|
|
382
|
+
Check if alert is currently in cooldown period.
|
|
383
|
+
|
|
384
|
+
Returns:
|
|
385
|
+
True if in cooldown (should NOT send alert), False otherwise
|
|
386
|
+
|
|
387
|
+
Logic:
|
|
388
|
+
1. If alert_cooldown not configured → return False (no cooldown)
|
|
389
|
+
2. Get last_alert_sent timestamp from database
|
|
390
|
+
3. If never sent → return False (no cooldown)
|
|
391
|
+
4. Calculate elapsed time since last alert
|
|
392
|
+
5. If cooldown_reset_on_recovery=True:
|
|
393
|
+
- Check if recovery happened since last alert
|
|
394
|
+
- If yes → return False (cooldown reset)
|
|
395
|
+
6. If elapsed < cooldown_interval → return True (in cooldown)
|
|
396
|
+
7. Otherwise → return False (cooldown expired)
|
|
397
|
+
"""
|
|
398
|
+
# No cooldown configured
|
|
399
|
+
if not self.alert_config or not self.alert_config.alert_cooldown:
|
|
400
|
+
return False
|
|
401
|
+
|
|
402
|
+
# No internal manager (can't check cooldown)
|
|
403
|
+
if not self.internal:
|
|
404
|
+
return False
|
|
405
|
+
|
|
406
|
+
# Get last alert timestamp
|
|
407
|
+
last_sent = self.internal.get_last_alert_timestamp(self.metric_name)
|
|
408
|
+
|
|
409
|
+
if not last_sent:
|
|
410
|
+
return False # Never sent alert before
|
|
411
|
+
|
|
412
|
+
# Parse cooldown interval
|
|
413
|
+
from detectkit.core.interval import Interval
|
|
414
|
+
cooldown_interval = Interval(self.alert_config.alert_cooldown)
|
|
415
|
+
cooldown_seconds = cooldown_interval.seconds
|
|
416
|
+
|
|
417
|
+
# Calculate elapsed time
|
|
418
|
+
now = datetime.utcnow()
|
|
419
|
+
elapsed = (now - last_sent).total_seconds()
|
|
420
|
+
|
|
421
|
+
# Check recovery reset (if enabled)
|
|
422
|
+
if self.alert_config.cooldown_reset_on_recovery:
|
|
423
|
+
# Check if recovery happened since last alert
|
|
424
|
+
has_recovery = self._check_recovery_since_last_alert(last_sent)
|
|
425
|
+
|
|
426
|
+
if has_recovery:
|
|
427
|
+
return False # Cooldown reset by recovery
|
|
428
|
+
|
|
429
|
+
# Check if still in cooldown
|
|
430
|
+
return elapsed < cooldown_seconds
|
|
431
|
+
|
|
432
|
+
def _check_recovery_since_last_alert(
|
|
433
|
+
self,
|
|
434
|
+
last_alert_timestamp: datetime
|
|
435
|
+
) -> bool:
|
|
436
|
+
"""
|
|
437
|
+
Check if recovery happened since last alert was sent.
|
|
438
|
+
|
|
439
|
+
Recovery means: consecutive anomalies count dropped below threshold,
|
|
440
|
+
indicating the metric returned to normal state.
|
|
441
|
+
|
|
442
|
+
Args:
|
|
443
|
+
last_alert_timestamp: Timestamp when last alert was sent
|
|
444
|
+
|
|
445
|
+
Returns:
|
|
446
|
+
True if recovery detected, False otherwise
|
|
447
|
+
|
|
448
|
+
Logic:
|
|
449
|
+
1. Load detections created after last_alert_timestamp
|
|
450
|
+
2. Count consecutive anomalies using same logic as should_alert()
|
|
451
|
+
3. If consecutive < required → recovery happened
|
|
452
|
+
4. If consecutive >= required → still in anomaly state
|
|
453
|
+
"""
|
|
454
|
+
if not self.internal:
|
|
455
|
+
return False
|
|
456
|
+
|
|
457
|
+
# Get last complete point
|
|
458
|
+
last_point = self.get_last_complete_point()
|
|
459
|
+
|
|
460
|
+
# Load detections created AFTER last alert
|
|
461
|
+
# We need enough points to check consecutive anomalies
|
|
462
|
+
num_points = self.conditions.consecutive_anomalies + 5 # +5 for margin
|
|
463
|
+
|
|
464
|
+
recent_detections = self.internal.get_recent_detections(
|
|
465
|
+
metric_name=self.metric_name,
|
|
466
|
+
last_point=last_point,
|
|
467
|
+
num_points=num_points,
|
|
468
|
+
created_after=last_alert_timestamp # Only detections AFTER last alert
|
|
469
|
+
)
|
|
470
|
+
|
|
471
|
+
if not recent_detections:
|
|
472
|
+
# No new detections → assume recovery
|
|
473
|
+
return True
|
|
474
|
+
|
|
475
|
+
# Convert to DetectionRecord format
|
|
476
|
+
detection_records = []
|
|
477
|
+
for det in recent_detections:
|
|
478
|
+
# Group has multiple detectors per timestamp
|
|
479
|
+
for i in range(len(det["detector_ids"])):
|
|
480
|
+
# Parse detection metadata
|
|
481
|
+
try:
|
|
482
|
+
import json
|
|
483
|
+
metadata = json.loads(det["detector_params_list"][i])
|
|
484
|
+
except:
|
|
485
|
+
metadata = {}
|
|
486
|
+
|
|
487
|
+
# Determine direction
|
|
488
|
+
value = det["value"]
|
|
489
|
+
conf_lower = det["confidence_lowers"][i]
|
|
490
|
+
conf_upper = det["confidence_uppers"][i]
|
|
491
|
+
|
|
492
|
+
if value < conf_lower:
|
|
493
|
+
direction = "down"
|
|
494
|
+
elif value > conf_upper:
|
|
495
|
+
direction = "up"
|
|
496
|
+
else:
|
|
497
|
+
direction = "none"
|
|
498
|
+
|
|
499
|
+
record = DetectionRecord(
|
|
500
|
+
timestamp=np.datetime64(det["timestamp"]),
|
|
501
|
+
detector_name=det["detector_names"][i],
|
|
502
|
+
detector_id=det["detector_ids"][i],
|
|
503
|
+
detector_params=det["detector_params_list"][i],
|
|
504
|
+
value=value,
|
|
505
|
+
is_anomaly=det["is_anomaly_flags"][i],
|
|
506
|
+
confidence_lower=conf_lower,
|
|
507
|
+
confidence_upper=conf_upper,
|
|
508
|
+
direction=direction,
|
|
509
|
+
severity=0.0, # Not used for recovery check
|
|
510
|
+
detection_metadata=metadata
|
|
511
|
+
)
|
|
512
|
+
detection_records.append(record)
|
|
513
|
+
|
|
514
|
+
# Count consecutive anomalies (same logic as should_alert)
|
|
515
|
+
consecutive = self._count_consecutive_anomalies(
|
|
516
|
+
detections=detection_records,
|
|
517
|
+
min_detectors=self.conditions.min_detectors,
|
|
518
|
+
direction=self.conditions.direction
|
|
519
|
+
)
|
|
520
|
+
|
|
521
|
+
# Recovery = consecutive dropped below threshold
|
|
522
|
+
return consecutive < self.conditions.consecutive_anomalies
|
|
523
|
+
|
|
360
524
|
def __repr__(self) -> str:
|
|
361
525
|
"""String representation."""
|
|
362
526
|
return (
|
|
@@ -369,17 +369,26 @@ def find_metrics_by_tag(metrics_dir: Path, tag: str) -> List[Path]:
|
|
|
369
369
|
|
|
370
370
|
matching_metrics = []
|
|
371
371
|
|
|
372
|
-
|
|
373
|
-
|
|
374
|
-
|
|
375
|
-
|
|
376
|
-
|
|
377
|
-
|
|
378
|
-
|
|
379
|
-
|
|
380
|
-
|
|
381
|
-
|
|
382
|
-
|
|
372
|
+
# Search both .yml and .yaml extensions (consistent with find_metric_by_name)
|
|
373
|
+
for pattern in ["**/*.yml", "**/*.yaml"]:
|
|
374
|
+
for metric_file in metrics_dir.glob(pattern):
|
|
375
|
+
try:
|
|
376
|
+
with open(metric_file) as f:
|
|
377
|
+
config = yaml.safe_load(f)
|
|
378
|
+
|
|
379
|
+
if config and "tags" in config:
|
|
380
|
+
if tag in config["tags"]:
|
|
381
|
+
matching_metrics.append(metric_file)
|
|
382
|
+
except Exception as e:
|
|
383
|
+
# Warn about unparseable files but continue searching
|
|
384
|
+
click.echo(
|
|
385
|
+
click.style(
|
|
386
|
+
f"Warning: Skipping {metric_file.relative_to(metrics_dir.parent)}: {e}",
|
|
387
|
+
fg="yellow"
|
|
388
|
+
),
|
|
389
|
+
err=True
|
|
390
|
+
)
|
|
391
|
+
continue
|
|
383
392
|
|
|
384
393
|
return matching_metrics
|
|
385
394
|
|
|
@@ -406,8 +415,15 @@ def find_metric_by_name(metrics_dir: Path, name: str) -> Optional[Path]:
|
|
|
406
415
|
|
|
407
416
|
if config and config.get("name") == name:
|
|
408
417
|
return metric_file
|
|
409
|
-
except Exception:
|
|
410
|
-
#
|
|
418
|
+
except Exception as e:
|
|
419
|
+
# Warn about unparseable files but continue searching
|
|
420
|
+
click.echo(
|
|
421
|
+
click.style(
|
|
422
|
+
f"Warning: Skipping {metric_file.relative_to(metrics_dir.parent)}: {e}",
|
|
423
|
+
fg="yellow"
|
|
424
|
+
),
|
|
425
|
+
err=True
|
|
426
|
+
)
|
|
411
427
|
continue
|
|
412
428
|
|
|
413
429
|
return None
|
|
@@ -141,6 +141,8 @@ class AlertConfig(BaseModel):
|
|
|
141
141
|
no_data_alert: Whether to alert when data is missing
|
|
142
142
|
template_single: Custom template for single anomaly alert
|
|
143
143
|
template_consecutive: Custom template for consecutive anomalies alert
|
|
144
|
+
alert_cooldown: Minimum interval between alerts (e.g., "30min", 1800 seconds)
|
|
145
|
+
cooldown_reset_on_recovery: Whether to reset cooldown when anomaly recovers
|
|
144
146
|
"""
|
|
145
147
|
|
|
146
148
|
enabled: bool = Field(default=True, description="Enable alerting")
|
|
@@ -168,6 +170,17 @@ class AlertConfig(BaseModel):
|
|
|
168
170
|
template_consecutive: Optional[str] = Field(
|
|
169
171
|
default=None, description="Custom template for consecutive anomalies"
|
|
170
172
|
)
|
|
173
|
+
alert_cooldown: Optional[Union[str, int]] = Field(
|
|
174
|
+
default=None,
|
|
175
|
+
description="Minimum interval between alerts (e.g., '30min', 1800). "
|
|
176
|
+
"If None, no cooldown is applied (alerts sent every time conditions are met)."
|
|
177
|
+
)
|
|
178
|
+
cooldown_reset_on_recovery: bool = Field(
|
|
179
|
+
default=True,
|
|
180
|
+
description="Reset cooldown timer when anomaly recovers to normal. "
|
|
181
|
+
"Only applies if alert_cooldown is set. "
|
|
182
|
+
"True = cooldown resets on recovery, False = strict cooldown independent of recovery."
|
|
183
|
+
)
|
|
171
184
|
|
|
172
185
|
@field_validator("consecutive_anomalies")
|
|
173
186
|
@classmethod
|
|
@@ -841,3 +841,125 @@ class InternalTablesManager:
|
|
|
841
841
|
key_columns={"metric_name": metric_config.name},
|
|
842
842
|
data=data
|
|
843
843
|
)
|
|
844
|
+
|
|
845
|
+
def get_last_alert_timestamp(
|
|
846
|
+
self,
|
|
847
|
+
metric_name: str
|
|
848
|
+
) -> Optional[datetime]:
|
|
849
|
+
"""
|
|
850
|
+
Get timestamp of last sent alert for a metric.
|
|
851
|
+
|
|
852
|
+
Used for alert cooldown tracking - prevents sending alerts
|
|
853
|
+
too frequently for the same metric.
|
|
854
|
+
|
|
855
|
+
Args:
|
|
856
|
+
metric_name: Metric identifier
|
|
857
|
+
|
|
858
|
+
Returns:
|
|
859
|
+
Timestamp of last sent alert, or None if never sent
|
|
860
|
+
|
|
861
|
+
Example:
|
|
862
|
+
>>> last_sent = internal.get_last_alert_timestamp("cpu_usage")
|
|
863
|
+
>>> if last_sent:
|
|
864
|
+
... elapsed = (datetime.utcnow() - last_sent).total_seconds()
|
|
865
|
+
... print(f"Last alert sent {elapsed}s ago")
|
|
866
|
+
"""
|
|
867
|
+
full_table_name = self._manager.get_full_table_name(
|
|
868
|
+
TABLE_TASKS, use_internal=True
|
|
869
|
+
)
|
|
870
|
+
|
|
871
|
+
# Query for pipeline task (detector_id="pipeline", process_type="pipeline")
|
|
872
|
+
query = f"""
|
|
873
|
+
SELECT last_alert_sent
|
|
874
|
+
FROM {full_table_name}
|
|
875
|
+
WHERE metric_name = %(metric_name)s
|
|
876
|
+
AND detector_id = 'pipeline'
|
|
877
|
+
AND process_type = 'pipeline'
|
|
878
|
+
LIMIT 1
|
|
879
|
+
"""
|
|
880
|
+
|
|
881
|
+
results = self._manager.execute_query(
|
|
882
|
+
query,
|
|
883
|
+
params={"metric_name": metric_name}
|
|
884
|
+
)
|
|
885
|
+
|
|
886
|
+
if not results or not results[0]["last_alert_sent"]:
|
|
887
|
+
return None
|
|
888
|
+
|
|
889
|
+
last_sent = results[0]["last_alert_sent"]
|
|
890
|
+
|
|
891
|
+
# Normalize to naive datetime if needed
|
|
892
|
+
if hasattr(last_sent, 'tzinfo') and last_sent.tzinfo is not None:
|
|
893
|
+
last_sent = last_sent.replace(tzinfo=None)
|
|
894
|
+
|
|
895
|
+
return last_sent
|
|
896
|
+
|
|
897
|
+
def update_alert_timestamp(
|
|
898
|
+
self,
|
|
899
|
+
metric_name: str,
|
|
900
|
+
timestamp: datetime,
|
|
901
|
+
increment_count: bool = True
|
|
902
|
+
) -> int:
|
|
903
|
+
"""
|
|
904
|
+
Update last_alert_sent timestamp and optionally increment alert_count.
|
|
905
|
+
|
|
906
|
+
Called after successfully sending an alert to track cooldown state.
|
|
907
|
+
|
|
908
|
+
Args:
|
|
909
|
+
metric_name: Metric identifier
|
|
910
|
+
timestamp: Timestamp when alert was sent (typically datetime.utcnow())
|
|
911
|
+
increment_count: Whether to increment alert_count (default: True)
|
|
912
|
+
|
|
913
|
+
Returns:
|
|
914
|
+
Number of rows updated (typically 1)
|
|
915
|
+
|
|
916
|
+
Example:
|
|
917
|
+
>>> # After sending alert
|
|
918
|
+
>>> internal.update_alert_timestamp(
|
|
919
|
+
... "cpu_usage",
|
|
920
|
+
... datetime.utcnow(),
|
|
921
|
+
... increment_count=True
|
|
922
|
+
... )
|
|
923
|
+
"""
|
|
924
|
+
full_table_name = self._manager.get_full_table_name(
|
|
925
|
+
TABLE_TASKS, use_internal=True
|
|
926
|
+
)
|
|
927
|
+
|
|
928
|
+
# Normalize timestamp to naive if needed
|
|
929
|
+
if hasattr(timestamp, 'tzinfo') and timestamp.tzinfo is not None:
|
|
930
|
+
timestamp = timestamp.replace(tzinfo=None)
|
|
931
|
+
|
|
932
|
+
if increment_count:
|
|
933
|
+
# Update with alert_count increment
|
|
934
|
+
update_query = f"""
|
|
935
|
+
ALTER TABLE {full_table_name}
|
|
936
|
+
UPDATE
|
|
937
|
+
last_alert_sent = %(timestamp)s,
|
|
938
|
+
alert_count = alert_count + 1,
|
|
939
|
+
updated_at = %(timestamp)s
|
|
940
|
+
WHERE metric_name = %(metric_name)s
|
|
941
|
+
AND detector_id = 'pipeline'
|
|
942
|
+
AND process_type = 'pipeline'
|
|
943
|
+
"""
|
|
944
|
+
else:
|
|
945
|
+
# Update without alert_count increment
|
|
946
|
+
update_query = f"""
|
|
947
|
+
ALTER TABLE {full_table_name}
|
|
948
|
+
UPDATE
|
|
949
|
+
last_alert_sent = %(timestamp)s,
|
|
950
|
+
updated_at = %(timestamp)s
|
|
951
|
+
WHERE metric_name = %(metric_name)s
|
|
952
|
+
AND detector_id = 'pipeline'
|
|
953
|
+
AND process_type = 'pipeline'
|
|
954
|
+
"""
|
|
955
|
+
|
|
956
|
+
self._manager.execute_query(
|
|
957
|
+
update_query,
|
|
958
|
+
params={
|
|
959
|
+
"metric_name": metric_name,
|
|
960
|
+
"timestamp": timestamp
|
|
961
|
+
}
|
|
962
|
+
)
|
|
963
|
+
|
|
964
|
+
# ClickHouse ALTER TABLE UPDATE is async, return 1 (optimistic)
|
|
965
|
+
return 1
|
|
@@ -97,12 +97,15 @@ def get_tasks_table_model() -> TableModel:
|
|
|
97
97
|
- last_processed_timestamp: Last successfully processed timestamp
|
|
98
98
|
- error_message: Error message if failed (nullable)
|
|
99
99
|
- timeout_seconds: Task timeout in seconds
|
|
100
|
+
- last_alert_sent: Timestamp of last sent alert (nullable, for cooldown tracking)
|
|
101
|
+
- alert_count: Number of alerts sent for this metric (for statistics)
|
|
100
102
|
|
|
101
103
|
Primary Key: (metric_name, detector_id, process_type)
|
|
102
104
|
|
|
103
|
-
This table serves
|
|
105
|
+
This table serves multiple purposes:
|
|
104
106
|
1. Locking: Only one process can run for a given (metric, detector, type)
|
|
105
107
|
2. Resume: Stores last_processed_timestamp to resume from interruptions
|
|
108
|
+
3. Alert cooldown: Tracks last_alert_sent timestamp to prevent alert spam
|
|
106
109
|
"""
|
|
107
110
|
return TableModel(
|
|
108
111
|
columns=[
|
|
@@ -119,6 +122,12 @@ def get_tasks_table_model() -> TableModel:
|
|
|
119
122
|
),
|
|
120
123
|
ColumnDefinition("error_message", "Nullable(String)", nullable=True),
|
|
121
124
|
ColumnDefinition("timeout_seconds", "Int32"),
|
|
125
|
+
ColumnDefinition(
|
|
126
|
+
"last_alert_sent",
|
|
127
|
+
"Nullable(DateTime64(3, 'UTC'))",
|
|
128
|
+
nullable=True
|
|
129
|
+
),
|
|
130
|
+
ColumnDefinition("alert_count", "UInt32", default="0"),
|
|
122
131
|
],
|
|
123
132
|
primary_key=["metric_name", "detector_id", "process_type"],
|
|
124
133
|
engine="MergeTree",
|
|
@@ -577,6 +577,8 @@ class TaskManager:
|
|
|
577
577
|
consecutive_anomalies=alerting_config.consecutive_anomalies,
|
|
578
578
|
),
|
|
579
579
|
timezone_display="UTC",
|
|
580
|
+
internal=self.internal, # For cooldown tracking
|
|
581
|
+
alert_config=alerting_config, # For cooldown settings
|
|
580
582
|
)
|
|
581
583
|
|
|
582
584
|
# Get last complete point
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: detectkit
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.3.1
|
|
4
4
|
Summary: Metric monitoring with automatic anomaly detection
|
|
5
5
|
Author: detectkit team
|
|
6
6
|
License: MIT
|
|
@@ -68,12 +68,19 @@ Dynamic: license-file
|
|
|
68
68
|
|
|
69
69
|
## Status
|
|
70
70
|
|
|
71
|
-
✅ **Production Ready** - Version 0.
|
|
71
|
+
✅ **Production Ready** - Version 0.3.0
|
|
72
72
|
|
|
73
73
|
Published to PyPI: https://pypi.org/project/detectkit/
|
|
74
74
|
|
|
75
75
|
Complete rewrite with modern architecture and full documentation (2025).
|
|
76
76
|
|
|
77
|
+
### What's New in v0.3.0
|
|
78
|
+
|
|
79
|
+
🎯 **Alert Cooldown** - Prevent alert spam from persistent anomalies
|
|
80
|
+
- Configure minimum time between alerts (`alert_cooldown: "30min"`)
|
|
81
|
+
- Automatic recovery detection (`cooldown_reset_on_recovery: true`)
|
|
82
|
+
- Stops duplicate alerts during long-running issues
|
|
83
|
+
|
|
77
84
|
## Features
|
|
78
85
|
|
|
79
86
|
- ✅ **Pure numpy arrays** - No pandas dependency in core logic
|
|
@@ -225,13 +232,14 @@ This project is currently in active development. Contributions are welcome once
|
|
|
225
232
|
|
|
226
233
|
## Changelog
|
|
227
234
|
|
|
228
|
-
|
|
229
|
-
|
|
230
|
-
|
|
231
|
-
|
|
232
|
-
-
|
|
233
|
-
-
|
|
234
|
-
-
|
|
235
|
-
-
|
|
236
|
-
|
|
237
|
-
|
|
235
|
+
See [CHANGELOG.md](CHANGELOG.md) for complete version history.
|
|
236
|
+
|
|
237
|
+
### Recent Releases
|
|
238
|
+
|
|
239
|
+
**[0.3.0]** (2025-11-10) - Alert cooldown system, spam prevention
|
|
240
|
+
**[0.2.8]** (2025-11-10) - Fix incomplete interval detection
|
|
241
|
+
**[0.2.7]** (2025-11-10) - Add _dtk_metrics table
|
|
242
|
+
**[0.2.0]** (2025-11-06) - Detector preprocessing and value weighting
|
|
243
|
+
**[0.1.0]** (2025-11-03) - Initial release
|
|
244
|
+
|
|
245
|
+
[Full changelog →](CHANGELOG.md)
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|