detectkit 0.9.0__tar.gz → 0.11.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (103) hide show
  1. {detectkit-0.9.0 → detectkit-0.11.0}/MANIFEST.in +1 -0
  2. {detectkit-0.9.0/detectkit.egg-info → detectkit-0.11.0}/PKG-INFO +9 -2
  3. {detectkit-0.9.0 → detectkit-0.11.0}/README.md +5 -0
  4. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/__init__.py +1 -1
  5. detectkit-0.11.0/detectkit/cli/assets/claude/CLAUDE.section.md +54 -0
  6. detectkit-0.11.0/detectkit/cli/assets/claude/rules/alerting.md +192 -0
  7. detectkit-0.11.0/detectkit/cli/assets/claude/rules/cli.md +138 -0
  8. detectkit-0.11.0/detectkit/cli/assets/claude/rules/detectors.md +193 -0
  9. detectkit-0.11.0/detectkit/cli/assets/claude/rules/metrics.md +147 -0
  10. detectkit-0.11.0/detectkit/cli/assets/claude/rules/overview.md +104 -0
  11. detectkit-0.11.0/detectkit/cli/assets/claude/rules/project.md +201 -0
  12. detectkit-0.11.0/detectkit/cli/assets/claude/skills/dtk-new-metric/SKILL.md +159 -0
  13. detectkit-0.11.0/detectkit/cli/assets/claude/skills/dtk-setup-project/SKILL.md +179 -0
  14. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/commands/clean.py +1 -2
  15. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/commands/init.py +195 -74
  16. detectkit-0.11.0/detectkit/cli/commands/init_claude.py +180 -0
  17. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/main.py +43 -3
  18. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/config/profile.py +39 -3
  19. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/core/models.py +11 -0
  20. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/__init__.py +6 -0
  21. detectkit-0.11.0/detectkit/database/_sql_manager.py +398 -0
  22. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/clickhouse_manager.py +38 -16
  23. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_alert_states.py +14 -29
  24. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_datapoints.py +6 -5
  25. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_detections.py +9 -11
  26. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_maintenance.py +7 -10
  27. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_schema.py +5 -1
  28. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_tasks.py +1 -1
  29. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/manager.py +73 -0
  30. detectkit-0.11.0/detectkit/database/mysql_manager.py +132 -0
  31. detectkit-0.11.0/detectkit/database/postgres_manager.py +118 -0
  32. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/tables.py +3 -0
  33. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/task_manager/manager.py +1 -1
  34. {detectkit-0.9.0 → detectkit-0.11.0/detectkit.egg-info}/PKG-INFO +9 -2
  35. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit.egg-info/SOURCES.txt +13 -0
  36. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit.egg-info/requires.txt +3 -1
  37. {detectkit-0.9.0 → detectkit-0.11.0}/pyproject.toml +9 -1
  38. {detectkit-0.9.0 → detectkit-0.11.0}/LICENSE +0 -0
  39. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/__init__.py +0 -0
  40. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/__init__.py +0 -0
  41. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/base.py +0 -0
  42. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/email.py +0 -0
  43. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/factory.py +0 -0
  44. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/mattermost.py +0 -0
  45. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/slack.py +0 -0
  46. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/telegram.py +0 -0
  47. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/channels/webhook.py +0 -0
  48. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/__init__.py +0 -0
  49. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/_base.py +0 -0
  50. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/_cooldown.py +0 -0
  51. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/_decision.py +0 -0
  52. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/_dispatch.py +0 -0
  53. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/_recovery.py +0 -0
  54. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/_types.py +0 -0
  55. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/alerting/orchestrator/orchestrator.py +0 -0
  56. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/__init__.py +0 -0
  57. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/_output.py +0 -0
  58. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/commands/__init__.py +0 -0
  59. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/commands/run.py +0 -0
  60. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/commands/test_alert.py +0 -0
  61. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/cli/commands/unlock.py +0 -0
  62. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/config/__init__.py +0 -0
  63. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/config/metric_config.py +0 -0
  64. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/config/project_config.py +0 -0
  65. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/config/validator.py +0 -0
  66. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/core/__init__.py +0 -0
  67. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/core/interval.py +0 -0
  68. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/__init__.py +0 -0
  69. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_base.py +0 -0
  70. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/_metrics.py +0 -0
  71. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/database/internal_tables/manager.py +0 -0
  72. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/__init__.py +0 -0
  73. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/base.py +0 -0
  74. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/factory.py +0 -0
  75. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/seasonality.py +0 -0
  76. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/statistical/__init__.py +0 -0
  77. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/statistical/_windowed.py +0 -0
  78. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/statistical/iqr.py +0 -0
  79. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/statistical/mad.py +0 -0
  80. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/statistical/manual_bounds.py +0 -0
  81. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/detectors/statistical/zscore.py +0 -0
  82. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/loaders/__init__.py +0 -0
  83. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/loaders/metric_loader.py +0 -0
  84. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/loaders/query_template.py +0 -0
  85. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/__init__.py +0 -0
  86. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/error_dispatch.py +0 -0
  87. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/task_manager/__init__.py +0 -0
  88. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/task_manager/_alert_step.py +0 -0
  89. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/task_manager/_base.py +0 -0
  90. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/task_manager/_detect_step.py +0 -0
  91. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/task_manager/_load_step.py +0 -0
  92. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/orchestration/task_manager/_types.py +0 -0
  93. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/utils/__init__.py +0 -0
  94. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/utils/datetime_utils.py +0 -0
  95. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/utils/env_interpolation.py +0 -0
  96. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/utils/json_utils.py +0 -0
  97. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit/utils/stats.py +0 -0
  98. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit.egg-info/dependency_links.txt +0 -0
  99. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit.egg-info/entry_points.txt +0 -0
  100. {detectkit-0.9.0 → detectkit-0.11.0}/detectkit.egg-info/top_level.txt +0 -0
  101. {detectkit-0.9.0 → detectkit-0.11.0}/requirements.txt +0 -0
  102. {detectkit-0.9.0 → detectkit-0.11.0}/setup.cfg +0 -0
  103. {detectkit-0.9.0 → detectkit-0.11.0}/setup.py +0 -0
@@ -2,6 +2,7 @@ include README.md
2
2
  include LICENSE
3
3
  include requirements.txt
4
4
  recursive-include detectkit *.py
5
+ recursive-include detectkit/cli/assets *.md
5
6
  recursive-exclude tests *
6
7
  recursive-exclude * __pycache__
7
8
  recursive-exclude * *.pyc
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: detectkit
3
- Version: 0.9.0
3
+ Version: 0.11.0
4
4
  Summary: Metric monitoring with automatic anomaly detection
5
5
  Author: detectkit team
6
6
  License: MIT
@@ -61,7 +61,9 @@ Requires-Dist: black>=23.0; extra == "dev"
61
61
  Requires-Dist: mypy>=1.0; extra == "dev"
62
62
  Requires-Dist: ruff>=0.1.0; extra == "dev"
63
63
  Provides-Extra: integration
64
- Requires-Dist: testcontainers[clickhouse]>=4.0; extra == "integration"
64
+ Requires-Dist: testcontainers[clickhouse,mysql,postgres]>=4.0; extra == "integration"
65
+ Requires-Dist: psycopg2-binary>=2.9.0; extra == "integration"
66
+ Requires-Dist: pymysql>=1.0.0; extra == "integration"
65
67
  Dynamic: license-file
66
68
 
67
69
  # detectkit
@@ -85,6 +87,7 @@ Dynamic: license-file
85
87
  - **Database agnostic** — ClickHouse, PostgreSQL, MySQL
86
88
  - **Idempotent** — resume from interruptions, no duplicate processing
87
89
  - **CLI** — `dtk init`, `dtk run --select`, `dtk unlock`, `dtk clean`, tag-based selectors
90
+ - **AI-native onboarding** — `dtk init-claude` sets up Claude Code context (CLAUDE.md + rules + a metric-scaffolding skill) so an assistant can help you build metrics out of the box
88
91
 
89
92
  ## Installation
90
93
 
@@ -108,6 +111,10 @@ pip install detectkit[all-db] # All databases
108
111
  dtk init my_monitoring
109
112
  cd my_monitoring
110
113
 
114
+ # Optional: set up Claude Code context so an AI assistant can help you
115
+ # write metrics, tune detectors and configure alerts (re-run after upgrades)
116
+ dtk init-claude
117
+
111
118
  # Configure database in profiles.yml, then:
112
119
  dtk run --select cpu_usage
113
120
  dtk run --select tag:critical
@@ -19,6 +19,7 @@
19
19
  - **Database agnostic** — ClickHouse, PostgreSQL, MySQL
20
20
  - **Idempotent** — resume from interruptions, no duplicate processing
21
21
  - **CLI** — `dtk init`, `dtk run --select`, `dtk unlock`, `dtk clean`, tag-based selectors
22
+ - **AI-native onboarding** — `dtk init-claude` sets up Claude Code context (CLAUDE.md + rules + a metric-scaffolding skill) so an assistant can help you build metrics out of the box
22
23
 
23
24
  ## Installation
24
25
 
@@ -42,6 +43,10 @@ pip install detectkit[all-db] # All databases
42
43
  dtk init my_monitoring
43
44
  cd my_monitoring
44
45
 
46
+ # Optional: set up Claude Code context so an AI assistant can help you
47
+ # write metrics, tune detectors and configure alerts (re-run after upgrades)
48
+ dtk init-claude
49
+
45
50
  # Configure database in profiles.yml, then:
46
51
  dtk run --select cpu_usage
47
52
  dtk run --select tag:critical
@@ -4,7 +4,7 @@ detectk - Anomaly Detection for Time-Series Metrics
4
4
  A Python library for data analysts and engineers to monitor metrics with automatic anomaly detection.
5
5
  """
6
6
 
7
- __version__ = "0.9.0"
7
+ __version__ = "0.11.0"
8
8
 
9
9
  from detectkit.core.interval import Interval
10
10
  from detectkit.core.models import ColumnDefinition, TableModel
@@ -0,0 +1,54 @@
1
+ ## detectkit — metric anomaly monitoring
2
+
3
+ This workspace contains one or more **detectkit** projects. detectkit is a
4
+ dbt-like Python tool for monitoring time-series metrics: each metric is a SQL
5
+ query plus one or more anomaly **detectors** defined in YAML, run through a
6
+ `load → detect → alert` pipeline with the `dtk` CLI. A directory is a detectkit
7
+ project when it contains a `detectkit_project.yml` file.
8
+
9
+ **Help the user operate detectkit**: create and edit metrics, tune detectors,
10
+ configure alerting and channels, run the pipeline, and debug why an alert did
11
+ (or didn't) fire. Stay numpy/SQL/YAML-first and follow the project's existing
12
+ conventions.
13
+
14
+ ### Where to look (read the matching file before answering)
15
+
16
+ The full, authoritative reference lives in `.claude/rules/detectkit/`. These
17
+ files are generated by `dtk init-claude` and track the installed detectkit
18
+ version — **read the relevant one on demand** instead of guessing:
19
+
20
+ | If the task is about… | Read |
21
+ |---|---|
22
+ | What detectkit is, the pipeline, internal tables, glossary | `.claude/rules/detectkit/overview.md` |
23
+ | `dtk` commands, selectors, backfills, locks, cleanup | `.claude/rules/detectkit/cli.md` |
24
+ | `detectkit_project.yml`, `profiles.yml`, DB connections, channels | `.claude/rules/detectkit/project.md` |
25
+ | A metric YAML: query, interval, seasonality, loading | `.claude/rules/detectkit/metrics.md` |
26
+ | Choosing/tuning detectors, preprocessing, trends, seasonality | `.claude/rules/detectkit/detectors.md` |
27
+ | Alert rules (quorum/direction/consecutive), cooldown, recovery, templates | `.claude/rules/detectkit/alerting.md` |
28
+
29
+ ### Set up & scaffold (skills)
30
+
31
+ - **First-time setup** — use the **`dtk-setup-project`** skill to configure the
32
+ database connection in `profiles.yml` (the `dtk init` placeholder ships example
33
+ values that need your real connection details) and, optionally, a first alert
34
+ channel.
35
+ - **A new metric** — use the **`dtk-new-metric`** skill; it walks the config out
36
+ to a YAML file that validates and is ready to run.
37
+
38
+ ### Gotchas that bite (keep these in mind)
39
+
40
+ - **Every loading query MUST filter its time range** on `{{ dtk_start_time }}`
41
+ and `{{ dtk_end_time }}` (rendered as `'YYYY-MM-DD HH:MM:SS'`, so quote them).
42
+ Without it, incremental/batched loading cannot work.
43
+ - **Metric `name` must be unique** across the whole project — it is the
44
+ database key, not the filename. Keep filename and `name` in sync.
45
+ - **Changing a detector parameter changes the detector's identity** and
46
+ recomputes its detections from scratch; the old rows are orphaned. After
47
+ retuning a live metric, run `dtk clean --select <metric>` to prune them.
48
+ - **`alert_cooldown` defaults to `null`** = a persisting anomaly re-alerts on
49
+ *every* `dtk run`. Always set a cooldown for production metrics.
50
+ - The pipeline is **idempotent**: it resumes from the last saved timestamp.
51
+ Don't reprocess history unless you mean to (`--full-refresh` / `--from`).
52
+
53
+ > Generated by `dtk init-claude`. Re-run it after upgrading detectkit to refresh
54
+ > these instructions and the files under `.claude/rules/detectkit/`.
@@ -0,0 +1,192 @@
1
+ # detectkit — Alerting
2
+
3
+ detectkit is **alert-centric**: the *alert* is the primary entity and a detector
4
+ anomaly is secondary evidence a rule interprets (the same anomaly means
5
+ different things under different rules). Configure alerting per metric under
6
+ `alerting:`. Channels themselves are defined in `profiles.yml` (see
7
+ `project.md`).
8
+
9
+ ```yaml
10
+ alerting:
11
+ enabled: true
12
+ channels: [mattermost_ops]
13
+ min_detectors: 1
14
+ direction: "same"
15
+ consecutive_anomalies: 3
16
+ alert_cooldown: "30min"
17
+ ```
18
+
19
+ ## The alert rule (quorum × direction × consecutive)
20
+
21
+ At the alert step, detectkit looks at the most recent detections and applies one
22
+ combined contract:
23
+
24
+ 1. **Quorum** — at each timestamp, group all detectors' anomalies. The point
25
+ satisfies the quorum when at least `min_detectors` of them match the
26
+ `direction` policy.
27
+ 2. **Consecutive** — an alert fires only when the latest `consecutive_anomalies`
28
+ timestamps each satisfy the quorum **and** are grid-adjacent (exactly one
29
+ `interval` apart). A missing detection row between two anomalies breaks the
30
+ chain.
31
+
32
+ ### `min_detectors` (default 1)
33
+
34
+ How many detectors must qualify at **every** point in the chain. `1` = any one
35
+ detector (high recall); `N` = all must agree (high precision).
36
+
37
+ ### `direction` (default `"same"`)
38
+
39
+ Which anomalies count toward the quorum:
40
+
41
+ - `"same"` — at the latest point, ≥`min_detectors` detectors must agree on **one**
42
+ direction (up and down counted separately — disagreement is not consensus).
43
+ The winning direction is **locked for the whole chain**. Ties: more detectors
44
+ win, then the more severe side.
45
+ - `"any"` — every anomaly counts regardless of direction (1 up + 1 down
46
+ satisfies `min_detectors: 2`).
47
+ - `"up"` — only anomalies above the interval count (others ignored, never block).
48
+ - `"down"` — only anomalies below the interval count.
49
+
50
+ Pick by meaning: `"up"` for CPU/error rate (high is bad), `"down"` for cache hit
51
+ rate/uptime (low is bad), `"any"` for single-detector "any deviation matters",
52
+ `"same"` for multi-detector consensus.
53
+
54
+ ### `consecutive_anomalies` (default 3)
55
+
56
+ Grid-adjacent quorum points required before alerting. `1` = alert immediately
57
+ (critical metrics); `3` = balanced; `5+` = noisy metrics. Gaps in the detection
58
+ grid break the chain.
59
+
60
+ ### Worked example (two detectors A, B; `min_detectors: 2`)
61
+
62
+ | `direction` | A | B | Result |
63
+ |---|---|---|---|
64
+ | `same` | up | down | no alert (disagreement) |
65
+ | `same` | up | up | quorum; "up" locked for the chain |
66
+ | `up` | up | down | no quorum (only one "up", needs 2) |
67
+ | `down` | up | up | no quorum ("up" ignored) |
68
+ | `any` | up | down | quorum (every anomaly counts) |
69
+
70
+ ## Cooldown (spam control) — **set it in production**
71
+
72
+ `alert_cooldown` defaults to **`null` = no cooldown**, meaning a persisting
73
+ anomaly re-alerts on **every** `dtk run` (e.g. every cron tick). Always set a
74
+ cooldown for production metrics.
75
+
76
+ ```yaml
77
+ alert_cooldown: "30min" # or seconds: 1800
78
+ cooldown_reset_on_recovery: true # default — reset the timer when the metric recovers
79
+ ```
80
+
81
+ - With `cooldown_reset_on_recovery: true` (recommended): alert on first
82
+ occurrence, suppress duplicates while it persists, alert again on a fresh
83
+ incident after recovery.
84
+ - With `false` (strict): an absolute minimum time between any alerts, regardless
85
+ of recovery — for very noisy metrics.
86
+ - No-data and anomaly alerts **share** the same cooldown state within an alert
87
+ block. State lives in `_dtk_alert_states`.
88
+
89
+ ## Recovery notifications
90
+
91
+ ```yaml
92
+ notify_on_recovery: true # default false
93
+ template_recovery: null # optional custom body
94
+ ```
95
+
96
+ Sends one notification per incident when the metric returns to normal after an
97
+ alert fired. **Direction-aware**: after a "down" alert, a fresh "up" anomaly
98
+ does not block recovery (the original condition no longer holds). Independent of
99
+ `alert_cooldown` (recovery always sends once per incident). Default body is
100
+ alert-centric (`✅ Alert cleared: <metric>`).
101
+
102
+ ## No-data alerts
103
+
104
+ ```yaml
105
+ no_data_alert: true # default false
106
+ template_no_data: null # optional custom body
107
+ ```
108
+
109
+ Fires when the **last complete interval** (now floored to a boundary, minus one
110
+ interval) has no datapoint, or the row's value is `NULL`/`NaN`. `min_detectors`
111
+ and `consecutive_anomalies` do **not** apply (it's a single binary signal).
112
+ Honors `alert_cooldown` and `suppress_until`. Webhook channels render it amber.
113
+ Use for cron loaders where source absence is a real failure; **don't** enable on
114
+ naturally sparse metrics.
115
+
116
+ ## Temporary suppression
117
+
118
+ ```yaml
119
+ suppress_until: "2026-04-11 18:00:00" # UTC; default null
120
+ ```
121
+
122
+ Load and detect keep running; only alerting is paused until that time, then it
123
+ auto-resumes (no second edit needed). For permanent off, use `enabled: false`.
124
+
125
+ ## Mentions
126
+
127
+ ```yaml
128
+ mentions: [oncall_engineer, here] # plain names, no @
129
+ ```
130
+
131
+ Channel-agnostic: you write plain usernames and each channel renders them
132
+ natively. Special broadcast keywords: `here`, `channel`, `all`. Available as
133
+ `{mentions}` / `{mentions_line}` template variables (appended automatically if
134
+ not placed in a template). Slack `@username` is display-only — use Slack user
135
+ IDs (`U…`) for real pings.
136
+
137
+ ## Multiple alert configs per metric
138
+
139
+ `alerting:` may be a **list** of independent blocks, each with its own channels,
140
+ timezone, template, and rule — evaluated and sent independently:
141
+
142
+ ```yaml
143
+ alerting:
144
+ - {enabled: true, channels: [mattermost_ops], consecutive_anomalies: 3}
145
+ - {enabled: true, channels: [slack_critical], consecutive_anomalies: 1, direction: "up"}
146
+ ```
147
+
148
+ Each block's state is keyed by a hash of its functional fields; editing those
149
+ fields or removing a block orphans its `_dtk_alert_states` row (prune with
150
+ `dtk clean`). Disabling with `enabled: false` keeps the hash, so a paused alert
151
+ is never treated as orphaned.
152
+
153
+ ## Templates
154
+
155
+ Defaults are alert-centric. Override with:
156
+ - `template_single` — alerts with `consecutive_count` ≤ 1.
157
+ - `template_consecutive` — streaks (`> 1`); falls back to `template_single`.
158
+ - `template_recovery`, `template_no_data` — recovery / no-data bodies.
159
+
160
+ Templates are plain `{var}` strings (or Jinja2 `.j2` files under `templates_dir`
161
+ referenced by path). Key variables:
162
+
163
+ | Variable | Meaning |
164
+ |---|---|
165
+ | `{metric_name}`, `{description}` / `{description_line}` | identity |
166
+ | `{timestamp}`, `{timezone}` | when (display tz via `alerting.timezone`, default UTC) |
167
+ | `{value}` / `{value_display}` | metric value (`value_display` is NaN-safe) |
168
+ | `{confidence_lower}` / `{confidence_upper}` / `{confidence_interval}` | bounds |
169
+ | `{expected_range}` | one-sided-aware band (`>= 7.00`, `<= 1.10`, `[lo, hi]`, `N/A`) |
170
+ | `{detector_name}`, `{detector_count}` | who fired (`"N detectors"` for multi) |
171
+ | `{min_detectors}` / `{direction_policy}` / `{consecutive_required}` | the configured rule |
172
+ | `{direction}`, `{consecutive_count}`, `{severity}` | observed values |
173
+ | `{status}` | `ANOMALY` / `RECOVERED` / `NO_DATA` / `ERROR` |
174
+ | `{mentions}` / `{mentions_line}` | formatted mentions |
175
+
176
+ > For no-data/error alerts there is no numeric value — avoid `{value:.2f}` in
177
+ > those templates (detectkit falls back to the default template rather than
178
+ > crashing, but write kind-appropriate templates).
179
+
180
+ ## Test, tune, debug
181
+
182
+ ```bash
183
+ dtk test-alert <metric> # mock alert through the real channels, using this rule
184
+ ```
185
+
186
+ - **Too many alerts** → raise `consecutive_anomalies`, raise detector
187
+ `threshold`, use `min_detectors: 2`, add seasonality, or set a `direction`.
188
+ - **No alerts** → check `enabled: true`, channels exist in `profiles.yml`,
189
+ detections exist (`dtk run --steps detect`), the quorum/consecutive thresholds
190
+ aren't too high, and `direction` isn't filtering the move out.
191
+ - **Wrong direction** (alerting when CPU drops) → set `direction: "up"`.
192
+ - Aim for **< 5 alerts/day/team** to avoid fatigue.
@@ -0,0 +1,138 @@
1
+ # detectkit — CLI (`dtk`)
2
+
3
+ Run all commands from a project directory (the one containing
4
+ `detectkit_project.yml`). `dtk --help` and `dtk <command> --help` always work.
5
+
6
+ ## Commands
7
+
8
+ | Command | Purpose |
9
+ |---|---|
10
+ | `dtk init <name>` | Scaffold a new project directory |
11
+ | `dtk init-claude` | (Re)generate this Claude context (CLAUDE.md + `.claude/rules/detectkit/` + skills) |
12
+ | `dtk run --select <sel>` | Run the load → detect → alert pipeline |
13
+ | `dtk test-alert <metric>` | Send a mock alert to the metric's channels |
14
+ | `dtk unlock --select <sel>` | Clear a stuck pipeline lock |
15
+ | `dtk clean --select <sel>` | Prune internal data that no longer matches the config |
16
+ | `dtk --version` | Show installed detectkit version |
17
+
18
+ ## Selectors (`--select` / `-s`)
19
+
20
+ Used by `run`, `unlock`, and `clean` (drift mode). Three forms:
21
+
22
+ - **Metric name** — `--select cpu_usage`. Searches the root `metrics/` dir only.
23
+ Do **not** add `.yml` (it is appended). This matches the metric **file**, but
24
+ every operation is keyed by the metric `name` inside the YAML.
25
+ - **Path / glob** — `--select "metrics/critical/*.yml"`, `--select "api_*"`,
26
+ `--select "metrics/**/*.yml"`. Searches recursively via glob; keep `.yml`.
27
+ - **Tag** — `--select tag:critical`. Searches recursively for metrics whose
28
+ `tags:` list contains that tag.
29
+
30
+ `--select "*"` selects everything. `--exclude / -e` removes matches
31
+ (`--select "*" --exclude "metrics/staging/*"`). Metric names must be unique
32
+ across the project; duplicates raise an error listing the conflicting files.
33
+
34
+ ## `dtk run`
35
+
36
+ ```bash
37
+ dtk run --select <sel> [--steps load,detect,alert] [--from DATE] [--to DATE] \
38
+ [--full-refresh] [--force] [--profile NAME]
39
+ ```
40
+
41
+ - `--steps` — subset/order of `load`, `detect`, `alert` (default all). Examples:
42
+ `--steps load` (verify the query), `--steps detect` (rerun detection only),
43
+ `--steps detect,alert` (skip load).
44
+ - `--from DATE` / `--to DATE` — `YYYY-MM-DD` or `YYYY-MM-DD HH:MM:SS`, UTC.
45
+ Affects only the `load` step. `--from` overrides the metric's
46
+ `loading_start_time`; `--to` defaults to now.
47
+ - `--full-refresh` — **destructive**: deletes the metric's datapoints and
48
+ detections, then reloads from `loading_start_time`/`--from`. Use after
49
+ changing the query or to recompute detections over history.
50
+ - `--force` — ignore a held lock and run anyway (also releases it on exit).
51
+ Risky with concurrent runs; usually `dtk unlock` is the better recovery.
52
+ - `--profile` — override the project's default profile (e.g. run against staging).
53
+
54
+ ## `dtk test-alert <metric>`
55
+
56
+ Sends a mock alert (fake value/CI/severity) through the metric's configured
57
+ channels, using that alert config's own rule (`min_detectors` / `direction` /
58
+ `consecutive_anomalies`), so the preview matches a real firing. Use it to
59
+ verify webhook URLs, channel permissions, and custom templates.
60
+
61
+ ## `dtk unlock --select <sel>`
62
+
63
+ Every run records a `running` lock in `_dtk_tasks` and clears it on exit. If a
64
+ run is killed mid-flight (commonly the **DB restarting mid-run**), the lock is
65
+ left behind and later non-`--force` runs fail with
66
+ `Failed to acquire lock … Use --force`. `dtk unlock` clears it immediately
67
+ without running the pipeline. (Stuck locks also auto-expire after ~1 hour, so
68
+ the next normal run recovers on its own.)
69
+
70
+ ## `dtk clean`
71
+
72
+ Editing metrics over time leaves stale rows in the internal tables. `dtk clean`
73
+ removes that drift. **Both modes dry-run by default** — pass `--execute` to
74
+ actually delete.
75
+
76
+ - **Drift mode** — `dtk clean --select <sel>`: for each still-existing metric,
77
+ deletes `_dtk_detections` rows for `detector_id`s the config no longer
78
+ produces (you changed a detector param / `seasonality_components`, or removed
79
+ a detector), and `_dtk_alert_states` rows for alert blocks the config no
80
+ longer produces. Datapoints are never touched (keyed only by timestamp).
81
+ - **GC mode** — `dtk clean --orphaned-metrics`: deletes all internal rows for
82
+ metric names present in the DB but no longer defined by any YAML (renamed or
83
+ deleted metrics). Ignores `--select`; asks for confirmation on `--execute`
84
+ unless `-y/--yes`; refuses to run if the project defines no metrics or configs
85
+ fail to parse (so a wrong directory can't wipe valid data).
86
+
87
+ ## Common workflows
88
+
89
+ ```bash
90
+ # First run of a metric
91
+ dtk run --select my_metric
92
+
93
+ # Cron loop (every interval)
94
+ dtk run --select "*"
95
+ dtk run --select "tag:critical"
96
+
97
+ # Backfill history
98
+ dtk run --select my_metric --from "2024-01-01"
99
+ dtk run --select my_metric --from "2024-01-01" --to "2024-02-01"
100
+
101
+ # Reprocess after config changes
102
+ dtk run --select my_metric --full-refresh # query changed → reload
103
+ dtk run --select my_metric --steps detect --full-refresh # detector changed → recompute detections
104
+ dtk clean --select my_metric # then prune orphaned old detector/alert rows
105
+ dtk clean --select my_metric --execute
106
+
107
+ # Debug
108
+ dtk run --select my_metric --steps load # does the query return data?
109
+ dtk run --select my_metric --steps detect # does the detector fire?
110
+ dtk test-alert my_metric # do the channels work?
111
+
112
+ # Recover a stuck lock
113
+ dtk unlock --select my_metric
114
+ ```
115
+
116
+ ## Scheduling
117
+
118
+ detectkit has no built-in scheduler — drive `dtk run` from cron / systemd
119
+ timers / Windows Task Scheduler. Always `cd` into the project first:
120
+
121
+ ```cron
122
+ */10 * * * * cd /path/to/project && dtk run --select "*" >> /var/log/detectkit.log 2>&1
123
+ ```
124
+
125
+ Pair scheduling with `error_alerting` (in `detectkit_project.yml`) so in-process
126
+ failures page someone; cron monitoring covers `dtk run` not running at all.
127
+
128
+ ## Troubleshooting
129
+
130
+ - **"Metric not found"** — selector doesn't match. Use the bare name
131
+ (`cpu_usage`, not `cpu_usage.yml`) for root metrics; check `ls metrics/`.
132
+ - **"Failed to acquire lock"** — a crashed run left a lock. `dtk unlock --select <m>`
133
+ (or wait ~1h for auto-expiry).
134
+ - **"Connection refused"** — check `profiles.yml` and DB connectivity.
135
+ - **"No data loaded"** — run the query manually with sample dates; verify the
136
+ `{{ dtk_start_time }}` / `{{ dtk_end_time }}` filter.
137
+ - **All points "insufficient_data"** — not enough history before `min_samples`;
138
+ lower `min_samples`, or backfill more history with `--from`.
@@ -0,0 +1,193 @@
1
+ # detectkit — Detectors
2
+
3
+ A metric's `detectors:` is a list; each entry runs independently and writes its
4
+ own detection rows. A detector flags points whose value falls outside an
5
+ expected confidence interval it computes from history.
6
+
7
+ ```yaml
8
+ detectors:
9
+ - type: mad
10
+ params:
11
+ threshold: 3.0
12
+ window_size: 288
13
+ ```
14
+
15
+ ## Choosing a detector
16
+
17
+ | Detector | Use when | Robust to outliers | Seasonality |
18
+ |---|---|---|---|
19
+ | `manual_bounds` | You know the acceptable bounds (SLA, hard limit) | n/a | no |
20
+ | `mad` | General-purpose default; outliers / non-normal data | yes | yes |
21
+ | `zscore` | Clean, normally distributed data | no | yes |
22
+ | `iqr` | Skewed distributions, percentile metrics (p95/p99) | yes | yes |
23
+
24
+ Quick decision: known bounds → `manual_bounds`; seasonal → `mad` with
25
+ `seasonality_components`; normal & clean → `zscore`; skewed/heavy-tailed →
26
+ `mad` or `iqr`; unsure → `mad`.
27
+
28
+ You can combine detectors — e.g. a `manual_bounds` hard cap plus a `mad`
29
+ pattern detector. The alerting `min_detectors` quorum then decides how many
30
+ must agree (see `alerting.md`).
31
+
32
+ ## `manual_bounds`
33
+
34
+ Fixed thresholds, no window, instant (no warm-up). Supports `input_type` only.
35
+
36
+ ```yaml
37
+ - type: manual_bounds
38
+ params:
39
+ upper_bound: 90.0 # alert when value > 90
40
+ lower_bound: 0.8 # alert when value < 0.8 (use either or both)
41
+ ```
42
+
43
+ ## Windowed detectors (`mad`, `zscore`, `iqr`)
44
+
45
+ These three **share one implementation** and accept an identical parameter set.
46
+ Each computes statistics over a trailing window (current point excluded) and an
47
+ expected interval for the current point.
48
+
49
+ ```yaml
50
+ - type: mad # same params for zscore / iqr
51
+ params:
52
+ # --- core (all participate in the detector_id hash) ---
53
+ threshold: 3.0 # defaults: mad 3.0, zscore 3.0, iqr 1.5
54
+ window_size: 100 # trailing window in points
55
+ min_samples: 30 # min valid points in window before detection runs
56
+ seasonality_components: null # e.g. ["hour"] or [["hour","day_of_week"]]
57
+ min_samples_per_group: 10 # defaults: mad 10, zscore 3, iqr 4
58
+ input_type: values # values | changes | absolute_changes | log_changes
59
+ smoothing: null # null | ema | sma
60
+ smoothing_alpha: 0.3 # EMA factor, 0 < a <= 1
61
+ smoothing_window: 10 # SMA window in points
62
+ window_weights: null # null (uniform) | exponential | linear
63
+ half_life: null # exponential half-life: int points or "3d"/"12h"
64
+ detrend: null # null | linear
65
+ # --- execution (NOT hashed) ---
66
+ start_time: "2024-01-01 00:00:00" # when detection begins
67
+ batch_size: 500
68
+ ```
69
+
70
+ **Threshold semantics:**
71
+ - `mad` is scaled by the normal-consistency constant 1.4826, so `threshold` is
72
+ in **σ-equivalents** — `3.0` ≈ 3-sigma, just like `zscore`. Lower = more
73
+ sensitive.
74
+ - `zscore`: `threshold` = number of standard deviations (3.0 ≈ 99.7%).
75
+ - `iqr`: `threshold` = Tukey fence multiplier (1.5 = standard outliers, 3.0 =
76
+ extreme only).
77
+
78
+ **Window sizing:** for non-seasonal metrics 100–500 points; for seasonal
79
+ metrics size the window to contain several full cycles (10-min data: 4320–8640
80
+ ≈ 30–60 days; hourly: 672–2016 ≈ 1–3 weeks; daily: 60–90). `min_samples` ≈
81
+ 10–30% of `window_size`.
82
+
83
+ ## Seasonality grouping
84
+
85
+ Statistics are computed within seasonality buckets so peak vs off-peak get
86
+ different expected ranges. Component names must match the metric's seasonality
87
+ feature names (built-in `seasonality_columns` or query-provided — see
88
+ `metrics.md`).
89
+
90
+ ```yaml
91
+ seasonality_components:
92
+ - "hour" # 24 separate per-hour adjustments
93
+ # - ["hour", "day_of_week"] # one combined group per hour+day (168 groups)
94
+ ```
95
+
96
+ - `["hour"]` — single component (24 groups).
97
+ - `["hour", "day_of_week"]` — two *separate* adjustments.
98
+ - `[["hour", "day_of_week"]]` — one *combined* group per pair.
99
+
100
+ `min_samples_per_group` is the floor of points required per bucket.
101
+
102
+ ## Preprocessing — `input_type`
103
+
104
+ Detect on transformed values (applied after smoothing):
105
+
106
+ - `values` (default) — raw values; absolute thresholds meaningful (CPU%, latency).
107
+ - `absolute_changes` — `v[t] - v[t-1]`; sudden jumps/drops matter.
108
+ - `changes` — `(v[t] - v[t-1]) / v[t-1]`; relative moves (traffic, revenue).
109
+ - `log_changes` — `log1p(v[t]) - log1p(v[t-1])`; near-symmetric for big % moves,
110
+ tolerates zeros (values must be > −1).
111
+
112
+ The first point has no predecessor → NaN; the detect context pulls one extra
113
+ point to compensate.
114
+
115
+ ## Smoothing
116
+
117
+ Reduce noise before detection: `smoothing: sma` (`smoothing_window`) or
118
+ `smoothing: ema` (`smoothing_alpha`, higher = less smoothing). Trade-off: less
119
+ noise but reduced sensitivity to short spikes.
120
+
121
+ ## Trending metrics (avoid alert spam)
122
+
123
+ A gradual trend drifts a uniform window's interval behind the current level, so
124
+ every point starts to look "below" → false alerts. Two shared params fix this:
125
+
126
+ - `window_weights: exponential` + `half_life: "3d"` — recent points weigh more,
127
+ so the interval follows the new normal. `half_life` is the age at which a
128
+ point's weight halves (int = points; `"3d"`/`"12h"` converted via the grid
129
+ step; default `window_size/20`). `linear` weighting is also available.
130
+ - `detrend: linear` — removes a robust in-window linear trend before computing
131
+ statistics; gradual drift no longer pulls the metric out of its interval,
132
+ while sharp deviations are still caught.
133
+
134
+ **Recommended recipe for trending, seasonal metrics:**
135
+ ```yaml
136
+ seasonality_columns: [hour]
137
+ detectors:
138
+ - type: mad
139
+ params:
140
+ window_size: 8640
141
+ min_samples: 1000
142
+ seasonality_components: ["hour"]
143
+ window_weights: exponential
144
+ half_life: "3d"
145
+ detrend: linear # optional, on top of weighting
146
+ ```
147
+ Trade-off: a shorter `half_life` adapts faster but also "accepts" a real
148
+ sustained degradation as the new normal sooner. (`weight_decay` is a deprecated
149
+ alias for `half_life`; prefer `half_life`.)
150
+
151
+ ## Feature compatibility
152
+
153
+ | Feature | mad | zscore | iqr | manual_bounds |
154
+ |---|---|---|---|---|
155
+ | `input_type` | ✅ | ✅ | ✅ | ✅ |
156
+ | `smoothing` | ✅ | ✅ | ✅ | ❌ |
157
+ | `window_weights` / `half_life` | ✅ | ✅ | ✅ | ❌ |
158
+ | `detrend` | ✅ | ✅ | ✅ | ❌ |
159
+ | `seasonality_components` | ✅ | ✅ | ✅ | ❌ |
160
+
161
+ `manual_bounds` has no window, so window-based features don't apply.
162
+
163
+ ## Detector identity & recomputation
164
+
165
+ Every parameter that affects results (threshold, window_size, min_samples,
166
+ seasonality_components, min_samples_per_group, input_type, smoothing*,
167
+ window_weights, half_life, detrend) is hashed into the `detector_id` — only
168
+ non-default values participate. Execution params (`start_time`, `batch_size`)
169
+ are **not** hashed.
170
+
171
+ Changing any hashed parameter creates a new `detector_id` and recomputes that
172
+ detector's detections from scratch on the next run; old rows stay under the
173
+ previous id. To recompute over history immediately:
174
+ `dtk run --select <m> --steps detect --full-refresh`. To prune the orphaned old
175
+ rows: `dtk clean --select <m> --execute`.
176
+
177
+ Parameters are validated when the detector is constructed at the start of the
178
+ `detect` step (per run, not at YAML load) — a typo like `input_type: "diff"`
179
+ fails fast with a clear error.
180
+
181
+ ## Tuning
182
+
183
+ - Too many false positives → raise `threshold`, add seasonality, add
184
+ `window_weights`/`detrend` for trends, or raise `consecutive_anomalies` in
185
+ alerting.
186
+ - Missing real anomalies → lower `threshold`, lower `window_size`, lower
187
+ `consecutive_anomalies`/`min_detectors`.
188
+ - All "insufficient_data" → lower `min_samples` or backfill more history.
189
+
190
+ Detection metadata records what the detector saw (`global_median`,
191
+ `adjusted_median`, `ess` = Kish effective sample size when weighting,
192
+ `trend_slope_per_point` when detrending, and a `preprocessing` block) — useful
193
+ for debugging.