detectkit 0.9.0__tar.gz → 0.11.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {detectkit-0.9.0 → detectkit-0.11.0}/MANIFEST.in +1 -0
- {detectkit-0.9.0/detectkit.egg-info → detectkit-0.11.0}/PKG-INFO +9 -2
- {detectkit-0.9.0 → detectkit-0.11.0}/README.md +5 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/__init__.py +1 -1
- detectkit-0.11.0/detectkit/cli/assets/claude/CLAUDE.section.md +54 -0
- detectkit-0.11.0/detectkit/cli/assets/claude/rules/alerting.md +192 -0
- detectkit-0.11.0/detectkit/cli/assets/claude/rules/cli.md +138 -0
- detectkit-0.11.0/detectkit/cli/assets/claude/rules/detectors.md +193 -0
- detectkit-0.11.0/detectkit/cli/assets/claude/rules/metrics.md +147 -0
- detectkit-0.11.0/detectkit/cli/assets/claude/rules/overview.md +104 -0
- detectkit-0.11.0/detectkit/cli/assets/claude/rules/project.md +201 -0
- detectkit-0.11.0/detectkit/cli/assets/claude/skills/dtk-new-metric/SKILL.md +159 -0
- detectkit-0.11.0/detectkit/cli/assets/claude/skills/dtk-setup-project/SKILL.md +179 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/commands/clean.py +1 -2
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/commands/init.py +195 -74
- detectkit-0.11.0/detectkit/cli/commands/init_claude.py +180 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/main.py +43 -3
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/config/profile.py +39 -3
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/core/models.py +11 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/__init__.py +6 -0
- detectkit-0.11.0/detectkit/database/_sql_manager.py +398 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/clickhouse_manager.py +38 -16
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_alert_states.py +14 -29
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_datapoints.py +6 -5
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_detections.py +9 -11
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_maintenance.py +7 -10
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_schema.py +5 -1
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_tasks.py +1 -1
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/manager.py +73 -0
- detectkit-0.11.0/detectkit/database/mysql_manager.py +132 -0
- detectkit-0.11.0/detectkit/database/postgres_manager.py +118 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/tables.py +3 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/task_manager/manager.py +1 -1
- {detectkit-0.9.0 → detectkit-0.11.0/detectkit.egg-info}/PKG-INFO +9 -2
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit.egg-info/SOURCES.txt +13 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit.egg-info/requires.txt +3 -1
- {detectkit-0.9.0 → detectkit-0.11.0}/pyproject.toml +9 -1
- {detectkit-0.9.0 → detectkit-0.11.0}/LICENSE +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/__init__.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/__init__.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/base.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/email.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/factory.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/mattermost.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/slack.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/telegram.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/webhook.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/__init__.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/_base.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/_cooldown.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/_decision.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/_dispatch.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/_recovery.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/_types.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/orchestrator.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/__init__.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/_output.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/commands/__init__.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/commands/run.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/commands/test_alert.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/commands/unlock.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/config/__init__.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/config/metric_config.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/config/project_config.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/config/validator.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/core/__init__.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/core/interval.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/__init__.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_base.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_metrics.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/manager.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/__init__.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/base.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/factory.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/seasonality.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/statistical/__init__.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/statistical/_windowed.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/statistical/iqr.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/statistical/mad.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/statistical/manual_bounds.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/statistical/zscore.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/loaders/__init__.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/loaders/metric_loader.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/loaders/query_template.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/__init__.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/error_dispatch.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/task_manager/__init__.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/task_manager/_alert_step.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/task_manager/_base.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/task_manager/_detect_step.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/task_manager/_load_step.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/task_manager/_types.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/utils/__init__.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/utils/datetime_utils.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/utils/env_interpolation.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/utils/json_utils.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/utils/stats.py +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit.egg-info/dependency_links.txt +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit.egg-info/entry_points.txt +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/detectkit.egg-info/top_level.txt +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/requirements.txt +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/setup.cfg +0 -0
- {detectkit-0.9.0 → detectkit-0.11.0}/setup.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: detectkit
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.11.0
|
|
4
4
|
Summary: Metric monitoring with automatic anomaly detection
|
|
5
5
|
Author: detectkit team
|
|
6
6
|
License: MIT
|
|
@@ -61,7 +61,9 @@ Requires-Dist: black>=23.0; extra == "dev"
|
|
|
61
61
|
Requires-Dist: mypy>=1.0; extra == "dev"
|
|
62
62
|
Requires-Dist: ruff>=0.1.0; extra == "dev"
|
|
63
63
|
Provides-Extra: integration
|
|
64
|
-
Requires-Dist: testcontainers[clickhouse]>=4.0; extra == "integration"
|
|
64
|
+
Requires-Dist: testcontainers[clickhouse,mysql,postgres]>=4.0; extra == "integration"
|
|
65
|
+
Requires-Dist: psycopg2-binary>=2.9.0; extra == "integration"
|
|
66
|
+
Requires-Dist: pymysql>=1.0.0; extra == "integration"
|
|
65
67
|
Dynamic: license-file
|
|
66
68
|
|
|
67
69
|
# detectkit
|
|
@@ -85,6 +87,7 @@ Dynamic: license-file
|
|
|
85
87
|
- **Database agnostic** — ClickHouse, PostgreSQL, MySQL
|
|
86
88
|
- **Idempotent** — resume from interruptions, no duplicate processing
|
|
87
89
|
- **CLI** — `dtk init`, `dtk run --select`, `dtk unlock`, `dtk clean`, tag-based selectors
|
|
90
|
+
- **AI-native onboarding** — `dtk init-claude` sets up Claude Code context (CLAUDE.md + rules + a metric-scaffolding skill) so an assistant can help you build metrics out of the box
|
|
88
91
|
|
|
89
92
|
## Installation
|
|
90
93
|
|
|
@@ -108,6 +111,10 @@ pip install detectkit[all-db] # All databases
|
|
|
108
111
|
dtk init my_monitoring
|
|
109
112
|
cd my_monitoring
|
|
110
113
|
|
|
114
|
+
# Optional: set up Claude Code context so an AI assistant can help you
|
|
115
|
+
# write metrics, tune detectors and configure alerts (re-run after upgrades)
|
|
116
|
+
dtk init-claude
|
|
117
|
+
|
|
111
118
|
# Configure database in profiles.yml, then:
|
|
112
119
|
dtk run --select cpu_usage
|
|
113
120
|
dtk run --select tag:critical
|
|
@@ -19,6 +19,7 @@
|
|
|
19
19
|
- **Database agnostic** — ClickHouse, PostgreSQL, MySQL
|
|
20
20
|
- **Idempotent** — resume from interruptions, no duplicate processing
|
|
21
21
|
- **CLI** — `dtk init`, `dtk run --select`, `dtk unlock`, `dtk clean`, tag-based selectors
|
|
22
|
+
- **AI-native onboarding** — `dtk init-claude` sets up Claude Code context (CLAUDE.md + rules + a metric-scaffolding skill) so an assistant can help you build metrics out of the box
|
|
22
23
|
|
|
23
24
|
## Installation
|
|
24
25
|
|
|
@@ -42,6 +43,10 @@ pip install detectkit[all-db] # All databases
|
|
|
42
43
|
dtk init my_monitoring
|
|
43
44
|
cd my_monitoring
|
|
44
45
|
|
|
46
|
+
# Optional: set up Claude Code context so an AI assistant can help you
|
|
47
|
+
# write metrics, tune detectors and configure alerts (re-run after upgrades)
|
|
48
|
+
dtk init-claude
|
|
49
|
+
|
|
45
50
|
# Configure database in profiles.yml, then:
|
|
46
51
|
dtk run --select cpu_usage
|
|
47
52
|
dtk run --select tag:critical
|
|
@@ -4,7 +4,7 @@ detectk - Anomaly Detection for Time-Series Metrics
|
|
|
4
4
|
A Python library for data analysts and engineers to monitor metrics with automatic anomaly detection.
|
|
5
5
|
"""
|
|
6
6
|
|
|
7
|
-
__version__ = "0.
|
|
7
|
+
__version__ = "0.11.0"
|
|
8
8
|
|
|
9
9
|
from detectkit.core.interval import Interval
|
|
10
10
|
from detectkit.core.models import ColumnDefinition, TableModel
|
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
## detectkit — metric anomaly monitoring
|
|
2
|
+
|
|
3
|
+
This workspace contains one or more **detectkit** projects. detectkit is a
|
|
4
|
+
dbt-like Python tool for monitoring time-series metrics: each metric is a SQL
|
|
5
|
+
query plus one or more anomaly **detectors** defined in YAML, run through a
|
|
6
|
+
`load → detect → alert` pipeline with the `dtk` CLI. A directory is a detectkit
|
|
7
|
+
project when it contains a `detectkit_project.yml` file.
|
|
8
|
+
|
|
9
|
+
**Help the user operate detectkit**: create and edit metrics, tune detectors,
|
|
10
|
+
configure alerting and channels, run the pipeline, and debug why an alert did
|
|
11
|
+
(or didn't) fire. Stay numpy/SQL/YAML-first and follow the project's existing
|
|
12
|
+
conventions.
|
|
13
|
+
|
|
14
|
+
### Where to look (read the matching file before answering)
|
|
15
|
+
|
|
16
|
+
The full, authoritative reference lives in `.claude/rules/detectkit/`. These
|
|
17
|
+
files are generated by `dtk init-claude` and track the installed detectkit
|
|
18
|
+
version — **read the relevant one on demand** instead of guessing:
|
|
19
|
+
|
|
20
|
+
| If the task is about… | Read |
|
|
21
|
+
|---|---|
|
|
22
|
+
| What detectkit is, the pipeline, internal tables, glossary | `.claude/rules/detectkit/overview.md` |
|
|
23
|
+
| `dtk` commands, selectors, backfills, locks, cleanup | `.claude/rules/detectkit/cli.md` |
|
|
24
|
+
| `detectkit_project.yml`, `profiles.yml`, DB connections, channels | `.claude/rules/detectkit/project.md` |
|
|
25
|
+
| A metric YAML: query, interval, seasonality, loading | `.claude/rules/detectkit/metrics.md` |
|
|
26
|
+
| Choosing/tuning detectors, preprocessing, trends, seasonality | `.claude/rules/detectkit/detectors.md` |
|
|
27
|
+
| Alert rules (quorum/direction/consecutive), cooldown, recovery, templates | `.claude/rules/detectkit/alerting.md` |
|
|
28
|
+
|
|
29
|
+
### Set up & scaffold (skills)
|
|
30
|
+
|
|
31
|
+
- **First-time setup** — use the **`dtk-setup-project`** skill to configure the
|
|
32
|
+
database connection in `profiles.yml` (the `dtk init` placeholder ships example
|
|
33
|
+
values that need your real connection details) and, optionally, a first alert
|
|
34
|
+
channel.
|
|
35
|
+
- **A new metric** — use the **`dtk-new-metric`** skill; it walks the config out
|
|
36
|
+
to a YAML file that validates and is ready to run.
|
|
37
|
+
|
|
38
|
+
### Gotchas that bite (keep these in mind)
|
|
39
|
+
|
|
40
|
+
- **Every loading query MUST filter its time range** on `{{ dtk_start_time }}`
|
|
41
|
+
and `{{ dtk_end_time }}` (rendered as `'YYYY-MM-DD HH:MM:SS'`, so quote them).
|
|
42
|
+
Without it, incremental/batched loading cannot work.
|
|
43
|
+
- **Metric `name` must be unique** across the whole project — it is the
|
|
44
|
+
database key, not the filename. Keep filename and `name` in sync.
|
|
45
|
+
- **Changing a detector parameter changes the detector's identity** and
|
|
46
|
+
recomputes its detections from scratch; the old rows are orphaned. After
|
|
47
|
+
retuning a live metric, run `dtk clean --select <metric>` to prune them.
|
|
48
|
+
- **`alert_cooldown` defaults to `null`** = a persisting anomaly re-alerts on
|
|
49
|
+
*every* `dtk run`. Always set a cooldown for production metrics.
|
|
50
|
+
- The pipeline is **idempotent**: it resumes from the last saved timestamp.
|
|
51
|
+
Don't reprocess history unless you mean to (`--full-refresh` / `--from`).
|
|
52
|
+
|
|
53
|
+
> Generated by `dtk init-claude`. Re-run it after upgrading detectkit to refresh
|
|
54
|
+
> these instructions and the files under `.claude/rules/detectkit/`.
|
|
@@ -0,0 +1,192 @@
|
|
|
1
|
+
# detectkit — Alerting
|
|
2
|
+
|
|
3
|
+
detectkit is **alert-centric**: the *alert* is the primary entity and a detector
|
|
4
|
+
anomaly is secondary evidence a rule interprets (the same anomaly means
|
|
5
|
+
different things under different rules). Configure alerting per metric under
|
|
6
|
+
`alerting:`. Channels themselves are defined in `profiles.yml` (see
|
|
7
|
+
`project.md`).
|
|
8
|
+
|
|
9
|
+
```yaml
|
|
10
|
+
alerting:
|
|
11
|
+
enabled: true
|
|
12
|
+
channels: [mattermost_ops]
|
|
13
|
+
min_detectors: 1
|
|
14
|
+
direction: "same"
|
|
15
|
+
consecutive_anomalies: 3
|
|
16
|
+
alert_cooldown: "30min"
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## The alert rule (quorum × direction × consecutive)
|
|
20
|
+
|
|
21
|
+
At the alert step, detectkit looks at the most recent detections and applies one
|
|
22
|
+
combined contract:
|
|
23
|
+
|
|
24
|
+
1. **Quorum** — at each timestamp, group all detectors' anomalies. The point
|
|
25
|
+
satisfies the quorum when at least `min_detectors` of them match the
|
|
26
|
+
`direction` policy.
|
|
27
|
+
2. **Consecutive** — an alert fires only when the latest `consecutive_anomalies`
|
|
28
|
+
timestamps each satisfy the quorum **and** are grid-adjacent (exactly one
|
|
29
|
+
`interval` apart). A missing detection row between two anomalies breaks the
|
|
30
|
+
chain.
|
|
31
|
+
|
|
32
|
+
### `min_detectors` (default 1)
|
|
33
|
+
|
|
34
|
+
How many detectors must qualify at **every** point in the chain. `1` = any one
|
|
35
|
+
detector (high recall); `N` = all must agree (high precision).
|
|
36
|
+
|
|
37
|
+
### `direction` (default `"same"`)
|
|
38
|
+
|
|
39
|
+
Which anomalies count toward the quorum:
|
|
40
|
+
|
|
41
|
+
- `"same"` — at the latest point, ≥`min_detectors` detectors must agree on **one**
|
|
42
|
+
direction (up and down counted separately — disagreement is not consensus).
|
|
43
|
+
The winning direction is **locked for the whole chain**. Ties: more detectors
|
|
44
|
+
win, then the more severe side.
|
|
45
|
+
- `"any"` — every anomaly counts regardless of direction (1 up + 1 down
|
|
46
|
+
satisfies `min_detectors: 2`).
|
|
47
|
+
- `"up"` — only anomalies above the interval count (others ignored, never block).
|
|
48
|
+
- `"down"` — only anomalies below the interval count.
|
|
49
|
+
|
|
50
|
+
Pick by meaning: `"up"` for CPU/error rate (high is bad), `"down"` for cache hit
|
|
51
|
+
rate/uptime (low is bad), `"any"` for single-detector "any deviation matters",
|
|
52
|
+
`"same"` for multi-detector consensus.
|
|
53
|
+
|
|
54
|
+
### `consecutive_anomalies` (default 3)
|
|
55
|
+
|
|
56
|
+
Grid-adjacent quorum points required before alerting. `1` = alert immediately
|
|
57
|
+
(critical metrics); `3` = balanced; `5+` = noisy metrics. Gaps in the detection
|
|
58
|
+
grid break the chain.
|
|
59
|
+
|
|
60
|
+
### Worked example (two detectors A, B; `min_detectors: 2`)
|
|
61
|
+
|
|
62
|
+
| `direction` | A | B | Result |
|
|
63
|
+
|---|---|---|---|
|
|
64
|
+
| `same` | up | down | no alert (disagreement) |
|
|
65
|
+
| `same` | up | up | quorum; "up" locked for the chain |
|
|
66
|
+
| `up` | up | down | no quorum (only one "up", needs 2) |
|
|
67
|
+
| `down` | up | up | no quorum ("up" ignored) |
|
|
68
|
+
| `any` | up | down | quorum (every anomaly counts) |
|
|
69
|
+
|
|
70
|
+
## Cooldown (spam control) — **set it in production**
|
|
71
|
+
|
|
72
|
+
`alert_cooldown` defaults to **`null` = no cooldown**, meaning a persisting
|
|
73
|
+
anomaly re-alerts on **every** `dtk run` (e.g. every cron tick). Always set a
|
|
74
|
+
cooldown for production metrics.
|
|
75
|
+
|
|
76
|
+
```yaml
|
|
77
|
+
alert_cooldown: "30min" # or seconds: 1800
|
|
78
|
+
cooldown_reset_on_recovery: true # default — reset the timer when the metric recovers
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
- With `cooldown_reset_on_recovery: true` (recommended): alert on first
|
|
82
|
+
occurrence, suppress duplicates while it persists, alert again on a fresh
|
|
83
|
+
incident after recovery.
|
|
84
|
+
- With `false` (strict): an absolute minimum time between any alerts, regardless
|
|
85
|
+
of recovery — for very noisy metrics.
|
|
86
|
+
- No-data and anomaly alerts **share** the same cooldown state within an alert
|
|
87
|
+
block. State lives in `_dtk_alert_states`.
|
|
88
|
+
|
|
89
|
+
## Recovery notifications
|
|
90
|
+
|
|
91
|
+
```yaml
|
|
92
|
+
notify_on_recovery: true # default false
|
|
93
|
+
template_recovery: null # optional custom body
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
Sends one notification per incident when the metric returns to normal after an
|
|
97
|
+
alert fired. **Direction-aware**: after a "down" alert, a fresh "up" anomaly
|
|
98
|
+
does not block recovery (the original condition no longer holds). Independent of
|
|
99
|
+
`alert_cooldown` (recovery always sends once per incident). Default body is
|
|
100
|
+
alert-centric (`✅ Alert cleared: <metric>`).
|
|
101
|
+
|
|
102
|
+
## No-data alerts
|
|
103
|
+
|
|
104
|
+
```yaml
|
|
105
|
+
no_data_alert: true # default false
|
|
106
|
+
template_no_data: null # optional custom body
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Fires when the **last complete interval** (now floored to a boundary, minus one
|
|
110
|
+
interval) has no datapoint, or the row's value is `NULL`/`NaN`. `min_detectors`
|
|
111
|
+
and `consecutive_anomalies` do **not** apply (it's a single binary signal).
|
|
112
|
+
Honors `alert_cooldown` and `suppress_until`. Webhook channels render it amber.
|
|
113
|
+
Use for cron loaders where source absence is a real failure; **don't** enable on
|
|
114
|
+
naturally sparse metrics.
|
|
115
|
+
|
|
116
|
+
## Temporary suppression
|
|
117
|
+
|
|
118
|
+
```yaml
|
|
119
|
+
suppress_until: "2026-04-11 18:00:00" # UTC; default null
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Load and detect keep running; only alerting is paused until that time, then it
|
|
123
|
+
auto-resumes (no second edit needed). For permanent off, use `enabled: false`.
|
|
124
|
+
|
|
125
|
+
## Mentions
|
|
126
|
+
|
|
127
|
+
```yaml
|
|
128
|
+
mentions: [oncall_engineer, here] # plain names, no @
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
Channel-agnostic: you write plain usernames and each channel renders them
|
|
132
|
+
natively. Special broadcast keywords: `here`, `channel`, `all`. Available as
|
|
133
|
+
`{mentions}` / `{mentions_line}` template variables (appended automatically if
|
|
134
|
+
not placed in a template). Slack `@username` is display-only — use Slack user
|
|
135
|
+
IDs (`U…`) for real pings.
|
|
136
|
+
|
|
137
|
+
## Multiple alert configs per metric
|
|
138
|
+
|
|
139
|
+
`alerting:` may be a **list** of independent blocks, each with its own channels,
|
|
140
|
+
timezone, template, and rule — evaluated and sent independently:
|
|
141
|
+
|
|
142
|
+
```yaml
|
|
143
|
+
alerting:
|
|
144
|
+
- {enabled: true, channels: [mattermost_ops], consecutive_anomalies: 3}
|
|
145
|
+
- {enabled: true, channels: [slack_critical], consecutive_anomalies: 1, direction: "up"}
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
Each block's state is keyed by a hash of its functional fields; editing those
|
|
149
|
+
fields or removing a block orphans its `_dtk_alert_states` row (prune with
|
|
150
|
+
`dtk clean`). Disabling with `enabled: false` keeps the hash, so a paused alert
|
|
151
|
+
is never treated as orphaned.
|
|
152
|
+
|
|
153
|
+
## Templates
|
|
154
|
+
|
|
155
|
+
Defaults are alert-centric. Override with:
|
|
156
|
+
- `template_single` — alerts with `consecutive_count` ≤ 1.
|
|
157
|
+
- `template_consecutive` — streaks (`> 1`); falls back to `template_single`.
|
|
158
|
+
- `template_recovery`, `template_no_data` — recovery / no-data bodies.
|
|
159
|
+
|
|
160
|
+
Templates are plain `{var}` strings (or Jinja2 `.j2` files under `templates_dir`
|
|
161
|
+
referenced by path). Key variables:
|
|
162
|
+
|
|
163
|
+
| Variable | Meaning |
|
|
164
|
+
|---|---|
|
|
165
|
+
| `{metric_name}`, `{description}` / `{description_line}` | identity |
|
|
166
|
+
| `{timestamp}`, `{timezone}` | when (display tz via `alerting.timezone`, default UTC) |
|
|
167
|
+
| `{value}` / `{value_display}` | metric value (`value_display` is NaN-safe) |
|
|
168
|
+
| `{confidence_lower}` / `{confidence_upper}` / `{confidence_interval}` | bounds |
|
|
169
|
+
| `{expected_range}` | one-sided-aware band (`>= 7.00`, `<= 1.10`, `[lo, hi]`, `N/A`) |
|
|
170
|
+
| `{detector_name}`, `{detector_count}` | who fired (`"N detectors"` for multi) |
|
|
171
|
+
| `{min_detectors}` / `{direction_policy}` / `{consecutive_required}` | the configured rule |
|
|
172
|
+
| `{direction}`, `{consecutive_count}`, `{severity}` | observed values |
|
|
173
|
+
| `{status}` | `ANOMALY` / `RECOVERED` / `NO_DATA` / `ERROR` |
|
|
174
|
+
| `{mentions}` / `{mentions_line}` | formatted mentions |
|
|
175
|
+
|
|
176
|
+
> For no-data/error alerts there is no numeric value — avoid `{value:.2f}` in
|
|
177
|
+
> those templates (detectkit falls back to the default template rather than
|
|
178
|
+
> crashing, but write kind-appropriate templates).
|
|
179
|
+
|
|
180
|
+
## Test, tune, debug
|
|
181
|
+
|
|
182
|
+
```bash
|
|
183
|
+
dtk test-alert <metric> # mock alert through the real channels, using this rule
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
- **Too many alerts** → raise `consecutive_anomalies`, raise detector
|
|
187
|
+
`threshold`, use `min_detectors: 2`, add seasonality, or set a `direction`.
|
|
188
|
+
- **No alerts** → check `enabled: true`, channels exist in `profiles.yml`,
|
|
189
|
+
detections exist (`dtk run --steps detect`), the quorum/consecutive thresholds
|
|
190
|
+
aren't too high, and `direction` isn't filtering the move out.
|
|
191
|
+
- **Wrong direction** (alerting when CPU drops) → set `direction: "up"`.
|
|
192
|
+
- Aim for **< 5 alerts/day/team** to avoid fatigue.
|
|
@@ -0,0 +1,138 @@
|
|
|
1
|
+
# detectkit — CLI (`dtk`)
|
|
2
|
+
|
|
3
|
+
Run all commands from a project directory (the one containing
|
|
4
|
+
`detectkit_project.yml`). `dtk --help` and `dtk <command> --help` always work.
|
|
5
|
+
|
|
6
|
+
## Commands
|
|
7
|
+
|
|
8
|
+
| Command | Purpose |
|
|
9
|
+
|---|---|
|
|
10
|
+
| `dtk init <name>` | Scaffold a new project directory |
|
|
11
|
+
| `dtk init-claude` | (Re)generate this Claude context (CLAUDE.md + `.claude/rules/detectkit/` + skills) |
|
|
12
|
+
| `dtk run --select <sel>` | Run the load → detect → alert pipeline |
|
|
13
|
+
| `dtk test-alert <metric>` | Send a mock alert to the metric's channels |
|
|
14
|
+
| `dtk unlock --select <sel>` | Clear a stuck pipeline lock |
|
|
15
|
+
| `dtk clean --select <sel>` | Prune internal data that no longer matches the config |
|
|
16
|
+
| `dtk --version` | Show installed detectkit version |
|
|
17
|
+
|
|
18
|
+
## Selectors (`--select` / `-s`)
|
|
19
|
+
|
|
20
|
+
Used by `run`, `unlock`, and `clean` (drift mode). Three forms:
|
|
21
|
+
|
|
22
|
+
- **Metric name** — `--select cpu_usage`. Searches the root `metrics/` dir only.
|
|
23
|
+
Do **not** add `.yml` (it is appended). This matches the metric **file**, but
|
|
24
|
+
every operation is keyed by the metric `name` inside the YAML.
|
|
25
|
+
- **Path / glob** — `--select "metrics/critical/*.yml"`, `--select "api_*"`,
|
|
26
|
+
`--select "metrics/**/*.yml"`. Searches recursively via glob; keep `.yml`.
|
|
27
|
+
- **Tag** — `--select tag:critical`. Searches recursively for metrics whose
|
|
28
|
+
`tags:` list contains that tag.
|
|
29
|
+
|
|
30
|
+
`--select "*"` selects everything. `--exclude / -e` removes matches
|
|
31
|
+
(`--select "*" --exclude "metrics/staging/*"`). Metric names must be unique
|
|
32
|
+
across the project; duplicates raise an error listing the conflicting files.
|
|
33
|
+
|
|
34
|
+
## `dtk run`
|
|
35
|
+
|
|
36
|
+
```bash
|
|
37
|
+
dtk run --select <sel> [--steps load,detect,alert] [--from DATE] [--to DATE] \
|
|
38
|
+
[--full-refresh] [--force] [--profile NAME]
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
- `--steps` — subset/order of `load`, `detect`, `alert` (default all). Examples:
|
|
42
|
+
`--steps load` (verify the query), `--steps detect` (rerun detection only),
|
|
43
|
+
`--steps detect,alert` (skip load).
|
|
44
|
+
- `--from DATE` / `--to DATE` — `YYYY-MM-DD` or `YYYY-MM-DD HH:MM:SS`, UTC.
|
|
45
|
+
Affects only the `load` step. `--from` overrides the metric's
|
|
46
|
+
`loading_start_time`; `--to` defaults to now.
|
|
47
|
+
- `--full-refresh` — **destructive**: deletes the metric's datapoints and
|
|
48
|
+
detections, then reloads from `loading_start_time`/`--from`. Use after
|
|
49
|
+
changing the query or to recompute detections over history.
|
|
50
|
+
- `--force` — ignore a held lock and run anyway (also releases it on exit).
|
|
51
|
+
Risky with concurrent runs; usually `dtk unlock` is the better recovery.
|
|
52
|
+
- `--profile` — override the project's default profile (e.g. run against staging).
|
|
53
|
+
|
|
54
|
+
## `dtk test-alert <metric>`
|
|
55
|
+
|
|
56
|
+
Sends a mock alert (fake value/CI/severity) through the metric's configured
|
|
57
|
+
channels, using that alert config's own rule (`min_detectors` / `direction` /
|
|
58
|
+
`consecutive_anomalies`), so the preview matches a real firing. Use it to
|
|
59
|
+
verify webhook URLs, channel permissions, and custom templates.
|
|
60
|
+
|
|
61
|
+
## `dtk unlock --select <sel>`
|
|
62
|
+
|
|
63
|
+
Every run records a `running` lock in `_dtk_tasks` and clears it on exit. If a
|
|
64
|
+
run is killed mid-flight (commonly the **DB restarting mid-run**), the lock is
|
|
65
|
+
left behind and later non-`--force` runs fail with
|
|
66
|
+
`Failed to acquire lock … Use --force`. `dtk unlock` clears it immediately
|
|
67
|
+
without running the pipeline. (Stuck locks also auto-expire after ~1 hour, so
|
|
68
|
+
the next normal run recovers on its own.)
|
|
69
|
+
|
|
70
|
+
## `dtk clean`
|
|
71
|
+
|
|
72
|
+
Editing metrics over time leaves stale rows in the internal tables. `dtk clean`
|
|
73
|
+
removes that drift. **Both modes dry-run by default** — pass `--execute` to
|
|
74
|
+
actually delete.
|
|
75
|
+
|
|
76
|
+
- **Drift mode** — `dtk clean --select <sel>`: for each still-existing metric,
|
|
77
|
+
deletes `_dtk_detections` rows for `detector_id`s the config no longer
|
|
78
|
+
produces (you changed a detector param / `seasonality_components`, or removed
|
|
79
|
+
a detector), and `_dtk_alert_states` rows for alert blocks the config no
|
|
80
|
+
longer produces. Datapoints are never touched (keyed only by timestamp).
|
|
81
|
+
- **GC mode** — `dtk clean --orphaned-metrics`: deletes all internal rows for
|
|
82
|
+
metric names present in the DB but no longer defined by any YAML (renamed or
|
|
83
|
+
deleted metrics). Ignores `--select`; asks for confirmation on `--execute`
|
|
84
|
+
unless `-y/--yes`; refuses to run if the project defines no metrics or configs
|
|
85
|
+
fail to parse (so a wrong directory can't wipe valid data).
|
|
86
|
+
|
|
87
|
+
## Common workflows
|
|
88
|
+
|
|
89
|
+
```bash
|
|
90
|
+
# First run of a metric
|
|
91
|
+
dtk run --select my_metric
|
|
92
|
+
|
|
93
|
+
# Cron loop (every interval)
|
|
94
|
+
dtk run --select "*"
|
|
95
|
+
dtk run --select "tag:critical"
|
|
96
|
+
|
|
97
|
+
# Backfill history
|
|
98
|
+
dtk run --select my_metric --from "2024-01-01"
|
|
99
|
+
dtk run --select my_metric --from "2024-01-01" --to "2024-02-01"
|
|
100
|
+
|
|
101
|
+
# Reprocess after config changes
|
|
102
|
+
dtk run --select my_metric --full-refresh # query changed → reload
|
|
103
|
+
dtk run --select my_metric --steps detect --full-refresh # detector changed → recompute detections
|
|
104
|
+
dtk clean --select my_metric # then prune orphaned old detector/alert rows
|
|
105
|
+
dtk clean --select my_metric --execute
|
|
106
|
+
|
|
107
|
+
# Debug
|
|
108
|
+
dtk run --select my_metric --steps load # does the query return data?
|
|
109
|
+
dtk run --select my_metric --steps detect # does the detector fire?
|
|
110
|
+
dtk test-alert my_metric # do the channels work?
|
|
111
|
+
|
|
112
|
+
# Recover a stuck lock
|
|
113
|
+
dtk unlock --select my_metric
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
## Scheduling
|
|
117
|
+
|
|
118
|
+
detectkit has no built-in scheduler — drive `dtk run` from cron / systemd
|
|
119
|
+
timers / Windows Task Scheduler. Always `cd` into the project first:
|
|
120
|
+
|
|
121
|
+
```cron
|
|
122
|
+
*/10 * * * * cd /path/to/project && dtk run --select "*" >> /var/log/detectkit.log 2>&1
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
Pair scheduling with `error_alerting` (in `detectkit_project.yml`) so in-process
|
|
126
|
+
failures page someone; cron monitoring covers `dtk run` not running at all.
|
|
127
|
+
|
|
128
|
+
## Troubleshooting
|
|
129
|
+
|
|
130
|
+
- **"Metric not found"** — selector doesn't match. Use the bare name
|
|
131
|
+
(`cpu_usage`, not `cpu_usage.yml`) for root metrics; check `ls metrics/`.
|
|
132
|
+
- **"Failed to acquire lock"** — a crashed run left a lock. `dtk unlock --select <m>`
|
|
133
|
+
(or wait ~1h for auto-expiry).
|
|
134
|
+
- **"Connection refused"** — check `profiles.yml` and DB connectivity.
|
|
135
|
+
- **"No data loaded"** — run the query manually with sample dates; verify the
|
|
136
|
+
`{{ dtk_start_time }}` / `{{ dtk_end_time }}` filter.
|
|
137
|
+
- **All points "insufficient_data"** — not enough history before `min_samples`;
|
|
138
|
+
lower `min_samples`, or backfill more history with `--from`.
|
|
@@ -0,0 +1,193 @@
|
|
|
1
|
+
# detectkit — Detectors
|
|
2
|
+
|
|
3
|
+
A metric's `detectors:` is a list; each entry runs independently and writes its
|
|
4
|
+
own detection rows. A detector flags points whose value falls outside an
|
|
5
|
+
expected confidence interval it computes from history.
|
|
6
|
+
|
|
7
|
+
```yaml
|
|
8
|
+
detectors:
|
|
9
|
+
- type: mad
|
|
10
|
+
params:
|
|
11
|
+
threshold: 3.0
|
|
12
|
+
window_size: 288
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
## Choosing a detector
|
|
16
|
+
|
|
17
|
+
| Detector | Use when | Robust to outliers | Seasonality |
|
|
18
|
+
|---|---|---|---|
|
|
19
|
+
| `manual_bounds` | You know the acceptable bounds (SLA, hard limit) | n/a | no |
|
|
20
|
+
| `mad` | General-purpose default; outliers / non-normal data | yes | yes |
|
|
21
|
+
| `zscore` | Clean, normally distributed data | no | yes |
|
|
22
|
+
| `iqr` | Skewed distributions, percentile metrics (p95/p99) | yes | yes |
|
|
23
|
+
|
|
24
|
+
Quick decision: known bounds → `manual_bounds`; seasonal → `mad` with
|
|
25
|
+
`seasonality_components`; normal & clean → `zscore`; skewed/heavy-tailed →
|
|
26
|
+
`mad` or `iqr`; unsure → `mad`.
|
|
27
|
+
|
|
28
|
+
You can combine detectors — e.g. a `manual_bounds` hard cap plus a `mad`
|
|
29
|
+
pattern detector. The alerting `min_detectors` quorum then decides how many
|
|
30
|
+
must agree (see `alerting.md`).
|
|
31
|
+
|
|
32
|
+
## `manual_bounds`
|
|
33
|
+
|
|
34
|
+
Fixed thresholds, no window, instant (no warm-up). Supports `input_type` only.
|
|
35
|
+
|
|
36
|
+
```yaml
|
|
37
|
+
- type: manual_bounds
|
|
38
|
+
params:
|
|
39
|
+
upper_bound: 90.0 # alert when value > 90
|
|
40
|
+
lower_bound: 0.8 # alert when value < 0.8 (use either or both)
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## Windowed detectors (`mad`, `zscore`, `iqr`)
|
|
44
|
+
|
|
45
|
+
These three **share one implementation** and accept an identical parameter set.
|
|
46
|
+
Each computes statistics over a trailing window (current point excluded) and an
|
|
47
|
+
expected interval for the current point.
|
|
48
|
+
|
|
49
|
+
```yaml
|
|
50
|
+
- type: mad # same params for zscore / iqr
|
|
51
|
+
params:
|
|
52
|
+
# --- core (all participate in the detector_id hash) ---
|
|
53
|
+
threshold: 3.0 # defaults: mad 3.0, zscore 3.0, iqr 1.5
|
|
54
|
+
window_size: 100 # trailing window in points
|
|
55
|
+
min_samples: 30 # min valid points in window before detection runs
|
|
56
|
+
seasonality_components: null # e.g. ["hour"] or [["hour","day_of_week"]]
|
|
57
|
+
min_samples_per_group: 10 # defaults: mad 10, zscore 3, iqr 4
|
|
58
|
+
input_type: values # values | changes | absolute_changes | log_changes
|
|
59
|
+
smoothing: null # null | ema | sma
|
|
60
|
+
smoothing_alpha: 0.3 # EMA factor, 0 < a <= 1
|
|
61
|
+
smoothing_window: 10 # SMA window in points
|
|
62
|
+
window_weights: null # null (uniform) | exponential | linear
|
|
63
|
+
half_life: null # exponential half-life: int points or "3d"/"12h"
|
|
64
|
+
detrend: null # null | linear
|
|
65
|
+
# --- execution (NOT hashed) ---
|
|
66
|
+
start_time: "2024-01-01 00:00:00" # when detection begins
|
|
67
|
+
batch_size: 500
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
**Threshold semantics:**
|
|
71
|
+
- `mad` is scaled by the normal-consistency constant 1.4826, so `threshold` is
|
|
72
|
+
in **σ-equivalents** — `3.0` ≈ 3-sigma, just like `zscore`. Lower = more
|
|
73
|
+
sensitive.
|
|
74
|
+
- `zscore`: `threshold` = number of standard deviations (3.0 ≈ 99.7%).
|
|
75
|
+
- `iqr`: `threshold` = Tukey fence multiplier (1.5 = standard outliers, 3.0 =
|
|
76
|
+
extreme only).
|
|
77
|
+
|
|
78
|
+
**Window sizing:** for non-seasonal metrics 100–500 points; for seasonal
|
|
79
|
+
metrics size the window to contain several full cycles (10-min data: 4320–8640
|
|
80
|
+
≈ 30–60 days; hourly: 672–2016 ≈ 1–3 weeks; daily: 60–90). `min_samples` ≈
|
|
81
|
+
10–30% of `window_size`.
|
|
82
|
+
|
|
83
|
+
## Seasonality grouping
|
|
84
|
+
|
|
85
|
+
Statistics are computed within seasonality buckets so peak vs off-peak get
|
|
86
|
+
different expected ranges. Component names must match the metric's seasonality
|
|
87
|
+
feature names (built-in `seasonality_columns` or query-provided — see
|
|
88
|
+
`metrics.md`).
|
|
89
|
+
|
|
90
|
+
```yaml
|
|
91
|
+
seasonality_components:
|
|
92
|
+
- "hour" # 24 separate per-hour adjustments
|
|
93
|
+
# - ["hour", "day_of_week"] # one combined group per hour+day (168 groups)
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
- `["hour"]` — single component (24 groups).
|
|
97
|
+
- `["hour", "day_of_week"]` — two *separate* adjustments.
|
|
98
|
+
- `[["hour", "day_of_week"]]` — one *combined* group per pair.
|
|
99
|
+
|
|
100
|
+
`min_samples_per_group` is the floor of points required per bucket.
|
|
101
|
+
|
|
102
|
+
## Preprocessing — `input_type`
|
|
103
|
+
|
|
104
|
+
Detect on transformed values (applied after smoothing):
|
|
105
|
+
|
|
106
|
+
- `values` (default) — raw values; absolute thresholds meaningful (CPU%, latency).
|
|
107
|
+
- `absolute_changes` — `v[t] - v[t-1]`; sudden jumps/drops matter.
|
|
108
|
+
- `changes` — `(v[t] - v[t-1]) / v[t-1]`; relative moves (traffic, revenue).
|
|
109
|
+
- `log_changes` — `log1p(v[t]) - log1p(v[t-1])`; near-symmetric for big % moves,
|
|
110
|
+
tolerates zeros (values must be > −1).
|
|
111
|
+
|
|
112
|
+
The first point has no predecessor → NaN; the detect context pulls one extra
|
|
113
|
+
point to compensate.
|
|
114
|
+
|
|
115
|
+
## Smoothing
|
|
116
|
+
|
|
117
|
+
Reduce noise before detection: `smoothing: sma` (`smoothing_window`) or
|
|
118
|
+
`smoothing: ema` (`smoothing_alpha`, higher = less smoothing). Trade-off: less
|
|
119
|
+
noise but reduced sensitivity to short spikes.
|
|
120
|
+
|
|
121
|
+
## Trending metrics (avoid alert spam)
|
|
122
|
+
|
|
123
|
+
A gradual trend drifts a uniform window's interval behind the current level, so
|
|
124
|
+
every point starts to look "below" → false alerts. Two shared params fix this:
|
|
125
|
+
|
|
126
|
+
- `window_weights: exponential` + `half_life: "3d"` — recent points weigh more,
|
|
127
|
+
so the interval follows the new normal. `half_life` is the age at which a
|
|
128
|
+
point's weight halves (int = points; `"3d"`/`"12h"` converted via the grid
|
|
129
|
+
step; default `window_size/20`). `linear` weighting is also available.
|
|
130
|
+
- `detrend: linear` — removes a robust in-window linear trend before computing
|
|
131
|
+
statistics; gradual drift no longer pulls the metric out of its interval,
|
|
132
|
+
while sharp deviations are still caught.
|
|
133
|
+
|
|
134
|
+
**Recommended recipe for trending, seasonal metrics:**
|
|
135
|
+
```yaml
|
|
136
|
+
seasonality_columns: [hour]
|
|
137
|
+
detectors:
|
|
138
|
+
- type: mad
|
|
139
|
+
params:
|
|
140
|
+
window_size: 8640
|
|
141
|
+
min_samples: 1000
|
|
142
|
+
seasonality_components: ["hour"]
|
|
143
|
+
window_weights: exponential
|
|
144
|
+
half_life: "3d"
|
|
145
|
+
detrend: linear # optional, on top of weighting
|
|
146
|
+
```
|
|
147
|
+
Trade-off: a shorter `half_life` adapts faster but also "accepts" a real
|
|
148
|
+
sustained degradation as the new normal sooner. (`weight_decay` is a deprecated
|
|
149
|
+
alias for `half_life`; prefer `half_life`.)
|
|
150
|
+
|
|
151
|
+
## Feature compatibility
|
|
152
|
+
|
|
153
|
+
| Feature | mad | zscore | iqr | manual_bounds |
|
|
154
|
+
|---|---|---|---|---|
|
|
155
|
+
| `input_type` | ✅ | ✅ | ✅ | ✅ |
|
|
156
|
+
| `smoothing` | ✅ | ✅ | ✅ | ❌ |
|
|
157
|
+
| `window_weights` / `half_life` | ✅ | ✅ | ✅ | ❌ |
|
|
158
|
+
| `detrend` | ✅ | ✅ | ✅ | ❌ |
|
|
159
|
+
| `seasonality_components` | ✅ | ✅ | ✅ | ❌ |
|
|
160
|
+
|
|
161
|
+
`manual_bounds` has no window, so window-based features don't apply.
|
|
162
|
+
|
|
163
|
+
## Detector identity & recomputation
|
|
164
|
+
|
|
165
|
+
Every parameter that affects results (threshold, window_size, min_samples,
|
|
166
|
+
seasonality_components, min_samples_per_group, input_type, smoothing*,
|
|
167
|
+
window_weights, half_life, detrend) is hashed into the `detector_id` — only
|
|
168
|
+
non-default values participate. Execution params (`start_time`, `batch_size`)
|
|
169
|
+
are **not** hashed.
|
|
170
|
+
|
|
171
|
+
Changing any hashed parameter creates a new `detector_id` and recomputes that
|
|
172
|
+
detector's detections from scratch on the next run; old rows stay under the
|
|
173
|
+
previous id. To recompute over history immediately:
|
|
174
|
+
`dtk run --select <m> --steps detect --full-refresh`. To prune the orphaned old
|
|
175
|
+
rows: `dtk clean --select <m> --execute`.
|
|
176
|
+
|
|
177
|
+
Parameters are validated when the detector is constructed at the start of the
|
|
178
|
+
`detect` step (per run, not at YAML load) — a typo like `input_type: "diff"`
|
|
179
|
+
fails fast with a clear error.
|
|
180
|
+
|
|
181
|
+
## Tuning
|
|
182
|
+
|
|
183
|
+
- Too many false positives → raise `threshold`, add seasonality, add
|
|
184
|
+
`window_weights`/`detrend` for trends, or raise `consecutive_anomalies` in
|
|
185
|
+
alerting.
|
|
186
|
+
- Missing real anomalies → lower `threshold`, lower `window_size`, lower
|
|
187
|
+
`consecutive_anomalies`/`min_detectors`.
|
|
188
|
+
- All "insufficient_data" → lower `min_samples` or backfill more history.
|
|
189
|
+
|
|
190
|
+
Detection metadata records what the detector saw (`global_median`,
|
|
191
|
+
`adjusted_median`, `ess` = Kish effective sample size when weighting,
|
|
192
|
+
`trend_slope_per_point` when detrending, and a `preprocessing` block) — useful
|
|
193
|
+
for debugging.
|