detectkit 0.8.2__tar.gz → 0.10.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {detectkit-0.8.2 → detectkit-0.10.0}/MANIFEST.in +1 -0
- {detectkit-0.8.2/detectkit.egg-info → detectkit-0.10.0}/PKG-INFO +6 -1
- {detectkit-0.8.2 → detectkit-0.10.0}/README.md +5 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/__init__.py +1 -1
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/channels/base.py +86 -14
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/channels/email.py +1 -1
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/channels/webhook.py +20 -8
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/orchestrator/_decision.py +24 -1
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/orchestrator/_recovery.py +5 -0
- detectkit-0.10.0/detectkit/cli/assets/claude/CLAUDE.section.md +54 -0
- detectkit-0.10.0/detectkit/cli/assets/claude/rules/alerting.md +192 -0
- detectkit-0.10.0/detectkit/cli/assets/claude/rules/cli.md +138 -0
- detectkit-0.10.0/detectkit/cli/assets/claude/rules/detectors.md +193 -0
- detectkit-0.10.0/detectkit/cli/assets/claude/rules/metrics.md +147 -0
- detectkit-0.10.0/detectkit/cli/assets/claude/rules/overview.md +104 -0
- detectkit-0.10.0/detectkit/cli/assets/claude/rules/project.md +203 -0
- detectkit-0.10.0/detectkit/cli/assets/claude/skills/dtk-new-metric/SKILL.md +159 -0
- detectkit-0.10.0/detectkit/cli/assets/claude/skills/dtk-setup-project/SKILL.md +174 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/cli/commands/clean.py +1 -2
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/cli/commands/init.py +53 -22
- detectkit-0.10.0/detectkit/cli/commands/init_claude.py +180 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/cli/commands/test_alert.py +23 -2
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/cli/main.py +32 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/database/internal_tables/_maintenance.py +2 -3
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/database/internal_tables/_tasks.py +1 -1
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/orchestration/task_manager/manager.py +1 -1
- {detectkit-0.8.2 → detectkit-0.10.0/detectkit.egg-info}/PKG-INFO +6 -1
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit.egg-info/SOURCES.txt +10 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/pyproject.toml +4 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/LICENSE +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/channels/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/channels/factory.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/channels/mattermost.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/channels/slack.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/channels/telegram.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/orchestrator/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/orchestrator/_base.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/orchestrator/_cooldown.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/orchestrator/_dispatch.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/orchestrator/_types.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/alerting/orchestrator/orchestrator.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/cli/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/cli/_output.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/cli/commands/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/cli/commands/run.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/cli/commands/unlock.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/config/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/config/metric_config.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/config/profile.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/config/project_config.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/config/validator.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/core/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/core/interval.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/core/models.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/database/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/database/clickhouse_manager.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/database/internal_tables/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/database/internal_tables/_alert_states.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/database/internal_tables/_base.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/database/internal_tables/_datapoints.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/database/internal_tables/_detections.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/database/internal_tables/_metrics.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/database/internal_tables/_schema.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/database/internal_tables/manager.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/database/manager.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/database/tables.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/detectors/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/detectors/base.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/detectors/factory.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/detectors/seasonality.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/detectors/statistical/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/detectors/statistical/_windowed.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/detectors/statistical/iqr.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/detectors/statistical/mad.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/detectors/statistical/manual_bounds.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/detectors/statistical/zscore.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/loaders/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/loaders/metric_loader.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/loaders/query_template.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/orchestration/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/orchestration/error_dispatch.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/orchestration/task_manager/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/orchestration/task_manager/_alert_step.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/orchestration/task_manager/_base.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/orchestration/task_manager/_detect_step.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/orchestration/task_manager/_load_step.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/orchestration/task_manager/_types.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/utils/__init__.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/utils/datetime_utils.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/utils/env_interpolation.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/utils/json_utils.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit/utils/stats.py +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit.egg-info/dependency_links.txt +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit.egg-info/entry_points.txt +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit.egg-info/requires.txt +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/detectkit.egg-info/top_level.txt +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/requirements.txt +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/setup.cfg +0 -0
- {detectkit-0.8.2 → detectkit-0.10.0}/setup.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: detectkit
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.10.0
|
|
4
4
|
Summary: Metric monitoring with automatic anomaly detection
|
|
5
5
|
Author: detectkit team
|
|
6
6
|
License: MIT
|
|
@@ -85,6 +85,7 @@ Dynamic: license-file
|
|
|
85
85
|
- **Database agnostic** — ClickHouse, PostgreSQL, MySQL
|
|
86
86
|
- **Idempotent** — resume from interruptions, no duplicate processing
|
|
87
87
|
- **CLI** — `dtk init`, `dtk run --select`, `dtk unlock`, `dtk clean`, tag-based selectors
|
|
88
|
+
- **AI-native onboarding** — `dtk init-claude` sets up Claude Code context (CLAUDE.md + rules + a metric-scaffolding skill) so an assistant can help you build metrics out of the box
|
|
88
89
|
|
|
89
90
|
## Installation
|
|
90
91
|
|
|
@@ -108,6 +109,10 @@ pip install detectkit[all-db] # All databases
|
|
|
108
109
|
dtk init my_monitoring
|
|
109
110
|
cd my_monitoring
|
|
110
111
|
|
|
112
|
+
# Optional: set up Claude Code context so an AI assistant can help you
|
|
113
|
+
# write metrics, tune detectors and configure alerts (re-run after upgrades)
|
|
114
|
+
dtk init-claude
|
|
115
|
+
|
|
111
116
|
# Configure database in profiles.yml, then:
|
|
112
117
|
dtk run --select cpu_usage
|
|
113
118
|
dtk run --select tag:critical
|
|
@@ -19,6 +19,7 @@
|
|
|
19
19
|
- **Database agnostic** — ClickHouse, PostgreSQL, MySQL
|
|
20
20
|
- **Idempotent** — resume from interruptions, no duplicate processing
|
|
21
21
|
- **CLI** — `dtk init`, `dtk run --select`, `dtk unlock`, `dtk clean`, tag-based selectors
|
|
22
|
+
- **AI-native onboarding** — `dtk init-claude` sets up Claude Code context (CLAUDE.md + rules + a metric-scaffolding skill) so an assistant can help you build metrics out of the box
|
|
22
23
|
|
|
23
24
|
## Installation
|
|
24
25
|
|
|
@@ -42,6 +43,10 @@ pip install detectkit[all-db] # All databases
|
|
|
42
43
|
dtk init my_monitoring
|
|
43
44
|
cd my_monitoring
|
|
44
45
|
|
|
46
|
+
# Optional: set up Claude Code context so an AI assistant can help you
|
|
47
|
+
# write metrics, tune detectors and configure alerts (re-run after upgrades)
|
|
48
|
+
dtk init-claude
|
|
49
|
+
|
|
45
50
|
# Configure database in profiles.yml, then:
|
|
46
51
|
dtk run --select cpu_usage
|
|
47
52
|
dtk run --select tag:critical
|
|
@@ -4,7 +4,7 @@ detectk - Anomaly Detection for Time-Series Metrics
|
|
|
4
4
|
A Python library for data analysts and engineers to monitor metrics with automatic anomaly detection.
|
|
5
5
|
"""
|
|
6
6
|
|
|
7
|
-
__version__ = "0.
|
|
7
|
+
__version__ = "0.10.0"
|
|
8
8
|
|
|
9
9
|
from detectkit.core.interval import Interval
|
|
10
10
|
from detectkit.core.models import ColumnDefinition, TableModel
|
|
@@ -36,6 +36,15 @@ class AlertData:
|
|
|
36
36
|
as ``{project_name}`` in templates and as a ``[name] `` prefix
|
|
37
37
|
in the default error title. Lets multiple projects share the
|
|
38
38
|
same alert channel without ambiguity.
|
|
39
|
+
|
|
40
|
+
Alert-rule fields (``min_detectors``, ``direction_policy``,
|
|
41
|
+
``consecutive_required``, ``detector_count``) describe *why the alert
|
|
42
|
+
fired* — the configured quorum/direction/consecutive thresholds plus
|
|
43
|
+
the observed number of agreeing detectors. They are filled by the
|
|
44
|
+
orchestrator from :class:`AlertConditions` and are deliberately kept
|
|
45
|
+
distinct from the observed ``direction``/``consecutive_count`` above so
|
|
46
|
+
templates can contrast "required vs actual". They default to ``None``
|
|
47
|
+
so direct-API callers (and non-anomaly alerts) still render cleanly.
|
|
39
48
|
"""
|
|
40
49
|
|
|
41
50
|
metric_name: str
|
|
@@ -58,6 +67,11 @@ class AlertData:
|
|
|
58
67
|
description: str | None = None
|
|
59
68
|
mentions: list[str] = field(default_factory=list)
|
|
60
69
|
project_name: str | None = None
|
|
70
|
+
# Alert rule (the parameters the alert fired with) — see class docstring.
|
|
71
|
+
min_detectors: int | None = None
|
|
72
|
+
direction_policy: str | None = None
|
|
73
|
+
consecutive_required: int | None = None
|
|
74
|
+
detector_count: int = 1
|
|
61
75
|
|
|
62
76
|
|
|
63
77
|
class BaseAlertChannel(ABC):
|
|
@@ -123,10 +137,17 @@ class BaseAlertChannel(ABC):
|
|
|
123
137
|
- {value} / {value_display}
|
|
124
138
|
- {confidence_lower}
|
|
125
139
|
- {confidence_upper}
|
|
140
|
+
- {confidence_interval} — "[lower, upper]" or "N/A"
|
|
141
|
+
- {expected_range} — one-sided aware: ">= lo", "<= hi",
|
|
142
|
+
"[lo, hi]" or "N/A" (renders one-sided detector bounds cleanly)
|
|
126
143
|
- {detector_name}
|
|
127
|
-
- {
|
|
144
|
+
- {detector_count} — observed detectors that agreed (the quorum)
|
|
145
|
+
- {direction} — observed/locked direction of the anomaly
|
|
146
|
+
- {direction_policy} — configured direction rule ("same"/"any"/...)
|
|
147
|
+
- {min_detectors} — configured quorum threshold (the rule)
|
|
148
|
+
- {consecutive_count} — observed consecutive points
|
|
149
|
+
- {consecutive_required} — configured consecutive threshold (rule)
|
|
128
150
|
- {severity}
|
|
129
|
-
- {consecutive_count}
|
|
130
151
|
- {status}
|
|
131
152
|
|
|
132
153
|
Args:
|
|
@@ -177,6 +198,40 @@ class BaseAlertChannel(ABC):
|
|
|
177
198
|
else:
|
|
178
199
|
confidence_str = "N/A"
|
|
179
200
|
|
|
201
|
+
# One-sided-aware expected range. A NaN/inf bound means "no bound on
|
|
202
|
+
# that side" (e.g. ManualBounds with only ``lower_bound`` set), so we
|
|
203
|
+
# render ">= lo" / "<= hi" instead of the confusing "[7.00, nan]".
|
|
204
|
+
def _bounded(b: Any) -> bool:
|
|
205
|
+
return b is not None and not (isinstance(b, float) and (math.isnan(b) or math.isinf(b)))
|
|
206
|
+
|
|
207
|
+
lo_ok = _bounded(alert_data.confidence_lower)
|
|
208
|
+
hi_ok = _bounded(alert_data.confidence_upper)
|
|
209
|
+
if lo_ok and hi_ok:
|
|
210
|
+
expected_range = (
|
|
211
|
+
f"[{alert_data.confidence_lower:.2f}, {alert_data.confidence_upper:.2f}]"
|
|
212
|
+
)
|
|
213
|
+
elif lo_ok:
|
|
214
|
+
expected_range = f">= {alert_data.confidence_lower:.2f}"
|
|
215
|
+
elif hi_ok:
|
|
216
|
+
expected_range = f"<= {alert_data.confidence_upper:.2f}"
|
|
217
|
+
else:
|
|
218
|
+
expected_range = "N/A"
|
|
219
|
+
|
|
220
|
+
# Alert-rule display values. The orchestrator fills these from the
|
|
221
|
+
# configured AlertConditions; for direct-API/non-anomaly callers that
|
|
222
|
+
# leave them unset we fall back to the observed counts so the default
|
|
223
|
+
# templates never render a bare "None".
|
|
224
|
+
detector_count = alert_data.detector_count
|
|
225
|
+
min_detectors = (
|
|
226
|
+
alert_data.min_detectors if alert_data.min_detectors is not None else detector_count
|
|
227
|
+
)
|
|
228
|
+
consecutive_required = (
|
|
229
|
+
alert_data.consecutive_required
|
|
230
|
+
if alert_data.consecutive_required is not None
|
|
231
|
+
else alert_data.consecutive_count
|
|
232
|
+
)
|
|
233
|
+
direction_policy = alert_data.direction_policy or alert_data.direction
|
|
234
|
+
|
|
180
235
|
# Display-safe value: stays usable even when value is None/NaN (no-data).
|
|
181
236
|
raw_value = alert_data.value
|
|
182
237
|
if raw_value is None or (isinstance(raw_value, float) and math.isnan(raw_value)):
|
|
@@ -221,11 +276,16 @@ class BaseAlertChannel(ABC):
|
|
|
221
276
|
confidence_lower=alert_data.confidence_lower,
|
|
222
277
|
confidence_upper=alert_data.confidence_upper,
|
|
223
278
|
confidence_interval=confidence_str,
|
|
279
|
+
expected_range=expected_range,
|
|
224
280
|
detector_name=alert_data.detector_name,
|
|
281
|
+
detector_count=detector_count,
|
|
225
282
|
detector_params=alert_data.detector_params,
|
|
226
283
|
direction=alert_data.direction,
|
|
284
|
+
direction_policy=direction_policy,
|
|
285
|
+
min_detectors=min_detectors,
|
|
227
286
|
severity=alert_data.severity,
|
|
228
287
|
consecutive_count=alert_data.consecutive_count,
|
|
288
|
+
consecutive_required=consecutive_required,
|
|
229
289
|
status=status,
|
|
230
290
|
error_type=alert_data.error_type or "",
|
|
231
291
|
error_message=alert_data.error_message or "",
|
|
@@ -312,12 +372,19 @@ class BaseAlertChannel(ABC):
|
|
|
312
372
|
Default template string
|
|
313
373
|
"""
|
|
314
374
|
return (
|
|
315
|
-
"
|
|
375
|
+
"⚠ Alert: {metric_name}\n"
|
|
316
376
|
"{description_line}"
|
|
317
|
-
"
|
|
318
|
-
"
|
|
319
|
-
"
|
|
320
|
-
"
|
|
377
|
+
"Quorum {detector_count}/{min_detectors} · "
|
|
378
|
+
"direction {direction} (policy {direction_policy}) · "
|
|
379
|
+
"consecutive {consecutive_count}/{consecutive_required}\n"
|
|
380
|
+
"Rule: min_detectors={min_detectors} · "
|
|
381
|
+
"direction={direction_policy} · consecutive={consecutive_required}\n"
|
|
382
|
+
"\n"
|
|
383
|
+
"Latest point (evidence):\n"
|
|
384
|
+
"· Time: {timestamp}\n"
|
|
385
|
+
"· Value: {value_display} | Expected: {expected_range}\n"
|
|
386
|
+
"· Severity: {severity:.2f}\n"
|
|
387
|
+
"Detectors: {detector_name}\n"
|
|
321
388
|
"Parameters: {detector_params}"
|
|
322
389
|
"{mentions_line}"
|
|
323
390
|
)
|
|
@@ -330,12 +397,17 @@ class BaseAlertChannel(ABC):
|
|
|
330
397
|
Default recovery template string
|
|
331
398
|
"""
|
|
332
399
|
return (
|
|
333
|
-
"
|
|
400
|
+
"✅ Alert cleared: {metric_name}\n"
|
|
334
401
|
"{description_line}"
|
|
335
|
-
"
|
|
336
|
-
"
|
|
337
|
-
"
|
|
338
|
-
"
|
|
402
|
+
"The alert condition no longer holds — "
|
|
403
|
+
"the metric is back within expected bounds.\n"
|
|
404
|
+
"Rule: min_detectors={min_detectors} · "
|
|
405
|
+
"direction={direction_policy} · consecutive={consecutive_required}\n"
|
|
406
|
+
"\n"
|
|
407
|
+
"Latest point:\n"
|
|
408
|
+
"· Time: {timestamp}\n"
|
|
409
|
+
"· Value: {value_display} | Expected: {expected_range}\n"
|
|
410
|
+
"Detectors: {detector_name}"
|
|
339
411
|
"{mentions_line}"
|
|
340
412
|
)
|
|
341
413
|
|
|
@@ -348,7 +420,7 @@ class BaseAlertChannel(ABC):
|
|
|
348
420
|
Returns:
|
|
349
421
|
Default title template string
|
|
350
422
|
"""
|
|
351
|
-
return "
|
|
423
|
+
return "⚠ Alert: {metric_name}"
|
|
352
424
|
|
|
353
425
|
def get_default_recovery_title_template(self) -> str:
|
|
354
426
|
"""
|
|
@@ -357,7 +429,7 @@ class BaseAlertChannel(ABC):
|
|
|
357
429
|
Returns:
|
|
358
430
|
Default recovery title template string
|
|
359
431
|
"""
|
|
360
|
-
return "
|
|
432
|
+
return "✅ Alert cleared: {metric_name}"
|
|
361
433
|
|
|
362
434
|
def get_default_no_data_template(self) -> str:
|
|
363
435
|
"""
|
|
@@ -54,7 +54,7 @@ class EmailChannel(BaseAlertChannel):
|
|
|
54
54
|
smtp_username: str | None = None,
|
|
55
55
|
smtp_password: str | None = None,
|
|
56
56
|
use_tls: bool = True,
|
|
57
|
-
subject_template: str = "
|
|
57
|
+
subject_template: str = "⚠ Alert: {metric_name}",
|
|
58
58
|
template: str | None = None,
|
|
59
59
|
**kwargs,
|
|
60
60
|
):
|
|
@@ -155,10 +155,17 @@ class WebhookChannel(BaseAlertChannel):
|
|
|
155
155
|
"""
|
|
156
156
|
return (
|
|
157
157
|
"{description_line}"
|
|
158
|
-
"
|
|
159
|
-
"
|
|
160
|
-
"
|
|
161
|
-
"
|
|
158
|
+
"Quorum {detector_count}/{min_detectors} · "
|
|
159
|
+
"direction {direction} (policy {direction_policy}) · "
|
|
160
|
+
"consecutive {consecutive_count}/{consecutive_required}\n"
|
|
161
|
+
"Rule: min_detectors={min_detectors} · "
|
|
162
|
+
"direction={direction_policy} · consecutive={consecutive_required}\n"
|
|
163
|
+
"\n"
|
|
164
|
+
"Latest point (evidence):\n"
|
|
165
|
+
"· Time: {timestamp}\n"
|
|
166
|
+
"· Value: {value_display} | Expected: {expected_range}\n"
|
|
167
|
+
"· Severity: {severity:.2f}\n"
|
|
168
|
+
"Detectors: {detector_name}\n"
|
|
162
169
|
"Parameters: {detector_params}"
|
|
163
170
|
"{mentions_line}"
|
|
164
171
|
)
|
|
@@ -171,10 +178,15 @@ class WebhookChannel(BaseAlertChannel):
|
|
|
171
178
|
"""
|
|
172
179
|
return (
|
|
173
180
|
"{description_line}"
|
|
174
|
-
"
|
|
175
|
-
"
|
|
176
|
-
"
|
|
177
|
-
"
|
|
181
|
+
"The alert condition no longer holds — "
|
|
182
|
+
"the metric is back within expected bounds.\n"
|
|
183
|
+
"Rule: min_detectors={min_detectors} · "
|
|
184
|
+
"direction={direction_policy} · consecutive={consecutive_required}\n"
|
|
185
|
+
"\n"
|
|
186
|
+
"Latest point:\n"
|
|
187
|
+
"· Time: {timestamp}\n"
|
|
188
|
+
"· Value: {value_display} | Expected: {expected_range}\n"
|
|
189
|
+
"Detectors: {detector_name}"
|
|
178
190
|
"{mentions_line}"
|
|
179
191
|
)
|
|
180
192
|
|
|
@@ -207,6 +207,23 @@ class _DecisionMixin(_OrchestratorBase):
|
|
|
207
207
|
detector_params = primary.detector_params
|
|
208
208
|
combined_metadata = primary.detection_metadata
|
|
209
209
|
|
|
210
|
+
# Observed direction shown in the message. For "same"/"up"/"down" the
|
|
211
|
+
# caller passes the locked/policy direction. For "any" it passes None
|
|
212
|
+
# because the quorum may combine directions — collapse to the shared
|
|
213
|
+
# side only when every quorum member agrees, otherwise label it
|
|
214
|
+
# "mixed" so the message never claims an agreement that did not happen
|
|
215
|
+
# (e.g. one up + one down satisfying min_detectors=2).
|
|
216
|
+
if direction:
|
|
217
|
+
observed_direction = direction
|
|
218
|
+
else:
|
|
219
|
+
quorum_dirs = {d.direction for d in anomalies if d.direction in ("up", "down")}
|
|
220
|
+
if len(quorum_dirs) == 1:
|
|
221
|
+
observed_direction = next(iter(quorum_dirs))
|
|
222
|
+
elif len(quorum_dirs) >= 2:
|
|
223
|
+
observed_direction = "mixed"
|
|
224
|
+
else:
|
|
225
|
+
observed_direction = primary.direction
|
|
226
|
+
|
|
210
227
|
return AlertData(
|
|
211
228
|
metric_name=self.metric_name,
|
|
212
229
|
timestamp=primary.timestamp,
|
|
@@ -216,12 +233,18 @@ class _DecisionMixin(_OrchestratorBase):
|
|
|
216
233
|
confidence_upper=primary.confidence_upper,
|
|
217
234
|
detector_name=detector_name,
|
|
218
235
|
detector_params=detector_params,
|
|
219
|
-
direction=
|
|
236
|
+
direction=observed_direction,
|
|
220
237
|
severity=max_severity,
|
|
221
238
|
detection_metadata=combined_metadata,
|
|
222
239
|
consecutive_count=consecutive_count,
|
|
223
240
|
description=self.description,
|
|
224
241
|
mentions=self.mentions,
|
|
242
|
+
# Alert rule the message foregrounds: configured thresholds plus
|
|
243
|
+
# the observed quorum size that satisfied them.
|
|
244
|
+
min_detectors=self.conditions.min_detectors,
|
|
245
|
+
direction_policy=self.conditions.direction,
|
|
246
|
+
consecutive_required=self.conditions.consecutive_anomalies,
|
|
247
|
+
detector_count=len(anomalies),
|
|
225
248
|
)
|
|
226
249
|
|
|
227
250
|
def should_alert_no_data(
|
|
@@ -176,4 +176,9 @@ class _RecoveryMixin(_OrchestratorBase):
|
|
|
176
176
|
is_recovery=True,
|
|
177
177
|
description=self.description,
|
|
178
178
|
mentions=self.mentions,
|
|
179
|
+
# Echo the rule that had fired so the recovery message names the
|
|
180
|
+
# same alert condition that just cleared.
|
|
181
|
+
min_detectors=self.conditions.min_detectors,
|
|
182
|
+
direction_policy=self.conditions.direction,
|
|
183
|
+
consecutive_required=self.conditions.consecutive_anomalies,
|
|
179
184
|
)
|
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
## detectkit — metric anomaly monitoring
|
|
2
|
+
|
|
3
|
+
This workspace contains one or more **detectkit** projects. detectkit is a
|
|
4
|
+
dbt-like Python tool for monitoring time-series metrics: each metric is a SQL
|
|
5
|
+
query plus one or more anomaly **detectors** defined in YAML, run through a
|
|
6
|
+
`load → detect → alert` pipeline with the `dtk` CLI. A directory is a detectkit
|
|
7
|
+
project when it contains a `detectkit_project.yml` file.
|
|
8
|
+
|
|
9
|
+
**Help the user operate detectkit**: create and edit metrics, tune detectors,
|
|
10
|
+
configure alerting and channels, run the pipeline, and debug why an alert did
|
|
11
|
+
(or didn't) fire. Stay numpy/SQL/YAML-first and follow the project's existing
|
|
12
|
+
conventions.
|
|
13
|
+
|
|
14
|
+
### Where to look (read the matching file before answering)
|
|
15
|
+
|
|
16
|
+
The full, authoritative reference lives in `.claude/rules/detectkit/`. These
|
|
17
|
+
files are generated by `dtk init-claude` and track the installed detectkit
|
|
18
|
+
version — **read the relevant one on demand** instead of guessing:
|
|
19
|
+
|
|
20
|
+
| If the task is about… | Read |
|
|
21
|
+
|---|---|
|
|
22
|
+
| What detectkit is, the pipeline, internal tables, glossary | `.claude/rules/detectkit/overview.md` |
|
|
23
|
+
| `dtk` commands, selectors, backfills, locks, cleanup | `.claude/rules/detectkit/cli.md` |
|
|
24
|
+
| `detectkit_project.yml`, `profiles.yml`, DB connections, channels | `.claude/rules/detectkit/project.md` |
|
|
25
|
+
| A metric YAML: query, interval, seasonality, loading | `.claude/rules/detectkit/metrics.md` |
|
|
26
|
+
| Choosing/tuning detectors, preprocessing, trends, seasonality | `.claude/rules/detectkit/detectors.md` |
|
|
27
|
+
| Alert rules (quorum/direction/consecutive), cooldown, recovery, templates | `.claude/rules/detectkit/alerting.md` |
|
|
28
|
+
|
|
29
|
+
### Set up & scaffold (skills)
|
|
30
|
+
|
|
31
|
+
- **First-time setup** — use the **`dtk-setup-project`** skill to configure the
|
|
32
|
+
database connection in `profiles.yml` (the `dtk init` placeholder ships example
|
|
33
|
+
values that need your real connection details) and, optionally, a first alert
|
|
34
|
+
channel.
|
|
35
|
+
- **A new metric** — use the **`dtk-new-metric`** skill; it walks the config out
|
|
36
|
+
to a YAML file that validates and is ready to run.
|
|
37
|
+
|
|
38
|
+
### Gotchas that bite (keep these in mind)
|
|
39
|
+
|
|
40
|
+
- **Every loading query MUST filter its time range** on `{{ dtk_start_time }}`
|
|
41
|
+
and `{{ dtk_end_time }}` (rendered as `'YYYY-MM-DD HH:MM:SS'`, so quote them).
|
|
42
|
+
Without it, incremental/batched loading cannot work.
|
|
43
|
+
- **Metric `name` must be unique** across the whole project — it is the
|
|
44
|
+
database key, not the filename. Keep filename and `name` in sync.
|
|
45
|
+
- **Changing a detector parameter changes the detector's identity** and
|
|
46
|
+
recomputes its detections from scratch; the old rows are orphaned. After
|
|
47
|
+
retuning a live metric, run `dtk clean --select <metric>` to prune them.
|
|
48
|
+
- **`alert_cooldown` defaults to `null`** = a persisting anomaly re-alerts on
|
|
49
|
+
*every* `dtk run`. Always set a cooldown for production metrics.
|
|
50
|
+
- The pipeline is **idempotent**: it resumes from the last saved timestamp.
|
|
51
|
+
Don't reprocess history unless you mean to (`--full-refresh` / `--from`).
|
|
52
|
+
|
|
53
|
+
> Generated by `dtk init-claude`. Re-run it after upgrading detectkit to refresh
|
|
54
|
+
> these instructions and the files under `.claude/rules/detectkit/`.
|
|
@@ -0,0 +1,192 @@
|
|
|
1
|
+
# detectkit — Alerting
|
|
2
|
+
|
|
3
|
+
detectkit is **alert-centric**: the *alert* is the primary entity and a detector
|
|
4
|
+
anomaly is secondary evidence a rule interprets (the same anomaly means
|
|
5
|
+
different things under different rules). Configure alerting per metric under
|
|
6
|
+
`alerting:`. Channels themselves are defined in `profiles.yml` (see
|
|
7
|
+
`project.md`).
|
|
8
|
+
|
|
9
|
+
```yaml
|
|
10
|
+
alerting:
|
|
11
|
+
enabled: true
|
|
12
|
+
channels: [mattermost_ops]
|
|
13
|
+
min_detectors: 1
|
|
14
|
+
direction: "same"
|
|
15
|
+
consecutive_anomalies: 3
|
|
16
|
+
alert_cooldown: "30min"
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## The alert rule (quorum × direction × consecutive)
|
|
20
|
+
|
|
21
|
+
At the alert step, detectkit looks at the most recent detections and applies one
|
|
22
|
+
combined contract:
|
|
23
|
+
|
|
24
|
+
1. **Quorum** — at each timestamp, group all detectors' anomalies. The point
|
|
25
|
+
satisfies the quorum when at least `min_detectors` of them match the
|
|
26
|
+
`direction` policy.
|
|
27
|
+
2. **Consecutive** — an alert fires only when the latest `consecutive_anomalies`
|
|
28
|
+
timestamps each satisfy the quorum **and** are grid-adjacent (exactly one
|
|
29
|
+
`interval` apart). A missing detection row between two anomalies breaks the
|
|
30
|
+
chain.
|
|
31
|
+
|
|
32
|
+
### `min_detectors` (default 1)
|
|
33
|
+
|
|
34
|
+
How many detectors must qualify at **every** point in the chain. `1` = any one
|
|
35
|
+
detector (high recall); `N` = all must agree (high precision).
|
|
36
|
+
|
|
37
|
+
### `direction` (default `"same"`)
|
|
38
|
+
|
|
39
|
+
Which anomalies count toward the quorum:
|
|
40
|
+
|
|
41
|
+
- `"same"` — at the latest point, ≥`min_detectors` detectors must agree on **one**
|
|
42
|
+
direction (up and down counted separately — disagreement is not consensus).
|
|
43
|
+
The winning direction is **locked for the whole chain**. Ties: more detectors
|
|
44
|
+
win, then the more severe side.
|
|
45
|
+
- `"any"` — every anomaly counts regardless of direction (1 up + 1 down
|
|
46
|
+
satisfies `min_detectors: 2`).
|
|
47
|
+
- `"up"` — only anomalies above the interval count (others ignored, never block).
|
|
48
|
+
- `"down"` — only anomalies below the interval count.
|
|
49
|
+
|
|
50
|
+
Pick by meaning: `"up"` for CPU/error rate (high is bad), `"down"` for cache hit
|
|
51
|
+
rate/uptime (low is bad), `"any"` for single-detector "any deviation matters",
|
|
52
|
+
`"same"` for multi-detector consensus.
|
|
53
|
+
|
|
54
|
+
### `consecutive_anomalies` (default 3)
|
|
55
|
+
|
|
56
|
+
Grid-adjacent quorum points required before alerting. `1` = alert immediately
|
|
57
|
+
(critical metrics); `3` = balanced; `5+` = noisy metrics. Gaps in the detection
|
|
58
|
+
grid break the chain.
|
|
59
|
+
|
|
60
|
+
### Worked example (two detectors A, B; `min_detectors: 2`)
|
|
61
|
+
|
|
62
|
+
| `direction` | A | B | Result |
|
|
63
|
+
|---|---|---|---|
|
|
64
|
+
| `same` | up | down | no alert (disagreement) |
|
|
65
|
+
| `same` | up | up | quorum; "up" locked for the chain |
|
|
66
|
+
| `up` | up | down | no quorum (only one "up", needs 2) |
|
|
67
|
+
| `down` | up | up | no quorum ("up" ignored) |
|
|
68
|
+
| `any` | up | down | quorum (every anomaly counts) |
|
|
69
|
+
|
|
70
|
+
## Cooldown (spam control) — **set it in production**
|
|
71
|
+
|
|
72
|
+
`alert_cooldown` defaults to **`null` = no cooldown**, meaning a persisting
|
|
73
|
+
anomaly re-alerts on **every** `dtk run` (e.g. every cron tick). Always set a
|
|
74
|
+
cooldown for production metrics.
|
|
75
|
+
|
|
76
|
+
```yaml
|
|
77
|
+
alert_cooldown: "30min" # or seconds: 1800
|
|
78
|
+
cooldown_reset_on_recovery: true # default — reset the timer when the metric recovers
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
- With `cooldown_reset_on_recovery: true` (recommended): alert on first
|
|
82
|
+
occurrence, suppress duplicates while it persists, alert again on a fresh
|
|
83
|
+
incident after recovery.
|
|
84
|
+
- With `false` (strict): an absolute minimum time between any alerts, regardless
|
|
85
|
+
of recovery — for very noisy metrics.
|
|
86
|
+
- No-data and anomaly alerts **share** the same cooldown state within an alert
|
|
87
|
+
block. State lives in `_dtk_alert_states`.
|
|
88
|
+
|
|
89
|
+
## Recovery notifications
|
|
90
|
+
|
|
91
|
+
```yaml
|
|
92
|
+
notify_on_recovery: true # default false
|
|
93
|
+
template_recovery: null # optional custom body
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
Sends one notification per incident when the metric returns to normal after an
|
|
97
|
+
alert fired. **Direction-aware**: after a "down" alert, a fresh "up" anomaly
|
|
98
|
+
does not block recovery (the original condition no longer holds). Independent of
|
|
99
|
+
`alert_cooldown` (recovery always sends once per incident). Default body is
|
|
100
|
+
alert-centric (`✅ Alert cleared: <metric>`).
|
|
101
|
+
|
|
102
|
+
## No-data alerts
|
|
103
|
+
|
|
104
|
+
```yaml
|
|
105
|
+
no_data_alert: true # default false
|
|
106
|
+
template_no_data: null # optional custom body
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Fires when the **last complete interval** (now floored to a boundary, minus one
|
|
110
|
+
interval) has no datapoint, or the row's value is `NULL`/`NaN`. `min_detectors`
|
|
111
|
+
and `consecutive_anomalies` do **not** apply (it's a single binary signal).
|
|
112
|
+
Honors `alert_cooldown` and `suppress_until`. Webhook channels render it amber.
|
|
113
|
+
Use for cron loaders where source absence is a real failure; **don't** enable on
|
|
114
|
+
naturally sparse metrics.
|
|
115
|
+
|
|
116
|
+
## Temporary suppression
|
|
117
|
+
|
|
118
|
+
```yaml
|
|
119
|
+
suppress_until: "2026-04-11 18:00:00" # UTC; default null
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Load and detect keep running; only alerting is paused until that time, then it
|
|
123
|
+
auto-resumes (no second edit needed). For permanent off, use `enabled: false`.
|
|
124
|
+
|
|
125
|
+
## Mentions
|
|
126
|
+
|
|
127
|
+
```yaml
|
|
128
|
+
mentions: [oncall_engineer, here] # plain names, no @
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
Channel-agnostic: you write plain usernames and each channel renders them
|
|
132
|
+
natively. Special broadcast keywords: `here`, `channel`, `all`. Available as
|
|
133
|
+
`{mentions}` / `{mentions_line}` template variables (appended automatically if
|
|
134
|
+
not placed in a template). Slack `@username` is display-only — use Slack user
|
|
135
|
+
IDs (`U…`) for real pings.
|
|
136
|
+
|
|
137
|
+
## Multiple alert configs per metric
|
|
138
|
+
|
|
139
|
+
`alerting:` may be a **list** of independent blocks, each with its own channels,
|
|
140
|
+
timezone, template, and rule — evaluated and sent independently:
|
|
141
|
+
|
|
142
|
+
```yaml
|
|
143
|
+
alerting:
|
|
144
|
+
- {enabled: true, channels: [mattermost_ops], consecutive_anomalies: 3}
|
|
145
|
+
- {enabled: true, channels: [slack_critical], consecutive_anomalies: 1, direction: "up"}
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
Each block's state is keyed by a hash of its functional fields; editing those
|
|
149
|
+
fields or removing a block orphans its `_dtk_alert_states` row (prune with
|
|
150
|
+
`dtk clean`). Disabling with `enabled: false` keeps the hash, so a paused alert
|
|
151
|
+
is never treated as orphaned.
|
|
152
|
+
|
|
153
|
+
## Templates
|
|
154
|
+
|
|
155
|
+
Defaults are alert-centric. Override with:
|
|
156
|
+
- `template_single` — alerts with `consecutive_count` ≤ 1.
|
|
157
|
+
- `template_consecutive` — streaks (`> 1`); falls back to `template_single`.
|
|
158
|
+
- `template_recovery`, `template_no_data` — recovery / no-data bodies.
|
|
159
|
+
|
|
160
|
+
Templates are plain `{var}` strings (or Jinja2 `.j2` files under `templates_dir`
|
|
161
|
+
referenced by path). Key variables:
|
|
162
|
+
|
|
163
|
+
| Variable | Meaning |
|
|
164
|
+
|---|---|
|
|
165
|
+
| `{metric_name}`, `{description}` / `{description_line}` | identity |
|
|
166
|
+
| `{timestamp}`, `{timezone}` | when (display tz via `alerting.timezone`, default UTC) |
|
|
167
|
+
| `{value}` / `{value_display}` | metric value (`value_display` is NaN-safe) |
|
|
168
|
+
| `{confidence_lower}` / `{confidence_upper}` / `{confidence_interval}` | bounds |
|
|
169
|
+
| `{expected_range}` | one-sided-aware band (`>= 7.00`, `<= 1.10`, `[lo, hi]`, `N/A`) |
|
|
170
|
+
| `{detector_name}`, `{detector_count}` | who fired (`"N detectors"` for multi) |
|
|
171
|
+
| `{min_detectors}` / `{direction_policy}` / `{consecutive_required}` | the configured rule |
|
|
172
|
+
| `{direction}`, `{consecutive_count}`, `{severity}` | observed values |
|
|
173
|
+
| `{status}` | `ANOMALY` / `RECOVERED` / `NO_DATA` / `ERROR` |
|
|
174
|
+
| `{mentions}` / `{mentions_line}` | formatted mentions |
|
|
175
|
+
|
|
176
|
+
> For no-data/error alerts there is no numeric value — avoid `{value:.2f}` in
|
|
177
|
+
> those templates (detectkit falls back to the default template rather than
|
|
178
|
+
> crashing, but write kind-appropriate templates).
|
|
179
|
+
|
|
180
|
+
## Test, tune, debug
|
|
181
|
+
|
|
182
|
+
```bash
|
|
183
|
+
dtk test-alert <metric> # mock alert through the real channels, using this rule
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
- **Too many alerts** → raise `consecutive_anomalies`, raise detector
|
|
187
|
+
`threshold`, use `min_detectors: 2`, add seasonality, or set a `direction`.
|
|
188
|
+
- **No alerts** → check `enabled: true`, channels exist in `profiles.yml`,
|
|
189
|
+
detections exist (`dtk run --steps detect`), the quorum/consecutive thresholds
|
|
190
|
+
aren't too high, and `direction` isn't filtering the move out.
|
|
191
|
+
- **Wrong direction** (alerting when CPU drops) → set `direction: "up"`.
|
|
192
|
+
- Aim for **< 5 alerts/day/team** to avoid fatigue.
|