anomaly-pipeline 0.1.27__py3-none-any.whl → 0.1.61__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- anomaly_pipeline/__init__.py +73 -1
- anomaly_pipeline/helpers/DB_scan.py +144 -10
- anomaly_pipeline/helpers/MAD.py +45 -0
- anomaly_pipeline/helpers/Preprocessing.py +274 -73
- anomaly_pipeline/helpers/STD.py +64 -0
- anomaly_pipeline/helpers/__init__.py +13 -1
- anomaly_pipeline/helpers/evaluation_info.py +25 -17
- anomaly_pipeline/helpers/evaluation_plots.py +636 -30
- anomaly_pipeline/helpers/ewma.py +105 -7
- anomaly_pipeline/helpers/fb_prophet.py +150 -2
- anomaly_pipeline/helpers/{help_info.py → help_anomaly.py} +194 -89
- anomaly_pipeline/helpers/iso_forest_general.py +5 -3
- anomaly_pipeline/helpers/iso_forest_timeseries.py +195 -23
- anomaly_pipeline/helpers/percentile.py +46 -3
- anomaly_pipeline/main.py +158 -39
- anomaly_pipeline/pipeline.py +106 -34
- anomaly_pipeline-0.1.61.dist-info/METADATA +275 -0
- anomaly_pipeline-0.1.61.dist-info/RECORD +24 -0
- anomaly_pipeline-0.1.27.dist-info/METADATA +0 -15
- anomaly_pipeline-0.1.27.dist-info/RECORD +0 -24
- {anomaly_pipeline-0.1.27.dist-info → anomaly_pipeline-0.1.61.dist-info}/WHEEL +0 -0
- {anomaly_pipeline-0.1.27.dist-info → anomaly_pipeline-0.1.61.dist-info}/entry_points.txt +0 -0
- {anomaly_pipeline-0.1.27.dist-info → anomaly_pipeline-0.1.61.dist-info}/top_level.txt +0 -0
anomaly_pipeline/__init__.py
CHANGED
|
@@ -1,2 +1,74 @@
|
|
|
1
|
+
"""----------------------------------------------------------------------------
|
|
2
|
+
help_anomaly
|
|
3
|
+
----------------------------------------------------------------------------
|
|
4
|
+
|
|
5
|
+
For a nice overview of the anomaly pipeline run the following lines of code:
|
|
6
|
+
|
|
7
|
+
>> from anomaly_pipeline.helpers.help_anomaly import help_anomaly
|
|
8
|
+
>> help_anomaly()
|
|
9
|
+
|
|
10
|
+
You can see information about specific models used in the anomaly pipeline with any of the following commands:
|
|
11
|
+
|
|
12
|
+
>> help_anomaly('percentile')
|
|
13
|
+
>> help_anomaly('iqr')
|
|
14
|
+
>> help_anomaly('mad')
|
|
15
|
+
>> help_anomaly('std')
|
|
16
|
+
>> help_anomaly('ewma')
|
|
17
|
+
>> help_anomaly('prophet')
|
|
18
|
+
>> help_anomaly('dbscan')
|
|
19
|
+
>> help_anomaly('iso') # For information on isolation forest
|
|
20
|
+
|
|
21
|
+
|
|
22
|
+
----------------------------------------------------------------------------
|
|
23
|
+
Functional Overview
|
|
24
|
+
----------------------------------------------------------------------------
|
|
25
|
+
The pipeline takes raw master data, partitions it into groups by unique ID, applies a suite of 8 different anomaly detection methods, and then flags observations as anomalies where at least half of the models consider the observation an anomaly.
|
|
26
|
+
|
|
27
|
+
The master data DataFrame that you pass into the anomaly detection pipeline needs to have at least 3 columns - unique ID, date, and a target variable. The unique ID can be defined by multiple columns.
|
|
28
|
+
|
|
29
|
+
Core Execution Stages
|
|
30
|
+
1. Preprocessing & Interpolation
|
|
31
|
+
Before modeling, the function interpolates target variable values for missing dates
|
|
32
|
+
Fill gaps in the variable column to prevent model crashes.
|
|
33
|
+
|
|
34
|
+
2. Statistical Baseline Models (Local Execution)
|
|
35
|
+
The pipeline first runs four computationally light models sequentially on each group:
|
|
36
|
+
- Percentile & IQR: Non-parametric bounds detection.
|
|
37
|
+
- SD (Standard Deviation) & MAD (Median Absolute Deviation): Variance-based detection.
|
|
38
|
+
|
|
39
|
+
3. Parallel Machine Learning Suite (process_group)
|
|
40
|
+
To maximize performance, the pipeline uses joblib.Parallel to run intensive models across all available CPU cores. The process_group utility acts as a router, sending data to the correct engine based on the model key:
|
|
41
|
+
- FB (Prophet): Walk-forward time-series forecasting.
|
|
42
|
+
- EWMA: Exponentially weighted moving averages.
|
|
43
|
+
- ISF (Isolation Forest): Unsupervised isolation of anomalies.
|
|
44
|
+
- DBSCAN: Density-based spatial clustering.
|
|
45
|
+
|
|
46
|
+
4. Majority Voting (Ensemble Logic)
|
|
47
|
+
The power of this pipeline lies in its Consensus Model. After all models finish, the pipeline calculates:
|
|
48
|
+
- Anomaly_Votes: The sum of flags across all 8 methods.
|
|
49
|
+
- is_Anomaly: A final boolean set to True only if at least 4 models agree that the point is an outlier.
|
|
50
|
+
|
|
51
|
+
Key Output Columns
|
|
52
|
+
refresh_date: The timestamp of when the pipeline was executed.
|
|
53
|
+
Anomaly_Votes: Total count of models that flagged the row.
|
|
54
|
+
is_Anomaly: The final "Gold Standard" anomaly flag.
|
|
55
|
+
Individual Model Flags: Columns like is_FB_anomaly, is_IQR_anomaly, etc., for granular auditing.
|
|
56
|
+
|
|
57
|
+
Usage Context
|
|
58
|
+
Use run_pipeline when you need a highly reliable, automated output. By combining statistical, forecasting, and clustering models, the pipeline reduces "false positives" often generated by single-model approaches.
|
|
59
|
+
"""
|
|
60
|
+
|
|
1
61
|
from .main import timeseries_anomaly_detection
|
|
2
|
-
|
|
62
|
+
|
|
63
|
+
from .helpers import (help_anomaly,
|
|
64
|
+
get_example_df,
|
|
65
|
+
evaluation_info,
|
|
66
|
+
anomaly_overview_plot,
|
|
67
|
+
anomaly_percentile_plot,
|
|
68
|
+
anomaly_sd_plot,
|
|
69
|
+
anomaly_mad_plot,
|
|
70
|
+
anomaly_iqr_plot,
|
|
71
|
+
anomaly_ewma_plot,
|
|
72
|
+
anomaly_fb_plot,
|
|
73
|
+
anomaly_dbscan_plot,
|
|
74
|
+
anomaly_isolation_forest_plot)
|
|
@@ -33,6 +33,7 @@ def get_dynamic_lags(series: pd.Series) -> list:
|
|
|
33
33
|
|
|
34
34
|
return dynamic_lags
|
|
35
35
|
|
|
36
|
+
|
|
36
37
|
def find_optimal_epsilon(X_scaled: np.ndarray, k: int) -> float:
|
|
37
38
|
"""
|
|
38
39
|
Finds the optimal epsilon by calculating the distance to the k-th nearest neighbor
|
|
@@ -71,16 +72,133 @@ def detect_time_series_anomalies_dbscan(
|
|
|
71
72
|
eval_period,
|
|
72
73
|
):
|
|
73
74
|
|
|
75
|
+
"""# 🌀 DBSCAN Walk-Forward Anomaly Detection
|
|
76
|
+
---
|
|
77
|
+
|
|
78
|
+
The `detect_time_series_anomalies_dbscan` function implements a **density-based clustering** approach for time-series anomaly detection. It utilizes an **iterative walk-forward validation** strategy to identify data points that exist in "low-density" regions of the feature space.
|
|
79
|
+
|
|
80
|
+
## 📋 Functional Overview
|
|
81
|
+
This function transforms a univariate time series into a high-dimensional feature space using **dynamic lags** and **rolling statistics**. It then applies the **DBSCAN** (Density-Based Spatial Clustering of Applications with Noise) algorithm to distinguish between dense clusters of "normal" behavior and sparse "noise" points (anomalies).
|
|
82
|
+
|
|
83
|
+
## 🧠 Core Logic & Helper Utilities
|
|
84
|
+
|
|
85
|
+
### 1. Dynamic Feature Engineering (`get_dynamic_lags`)
|
|
86
|
+
Instead of using fixed lags, the function uses the **Autocorrelation Function (ACF)** to find the 10 most significant seasonal patterns in the data.
|
|
87
|
+
* **Baseline:** Always includes lags 1, 2, and 3 to capture immediate momentum.
|
|
88
|
+
* **Significance:** Uses a 75% confidence interval ($\\\\alpha=0.25$) to identify meaningful historical dependencies.
|
|
89
|
+
|
|
90
|
+
### 2. Automated Parameter Tuning (`find_optimal_epsilon`)
|
|
91
|
+
DBSCAN is highly sensitive to the **Epsilon ($\\\\epsilon$)** parameter (the neighborhood radius).
|
|
92
|
+
* **Proxy Elbow Method:** The function automatically calculates $\\\\epsilon$ by analyzing the distance to the $k$-th nearest neighbor for all training points.
|
|
93
|
+
* **Density Threshold:** It sets $\\\\epsilon$ at the **95th percentile** of these distances, ensuring that 95% of training data is considered "dense" while the most isolated 5% are candidates for noise.
|
|
94
|
+
|
|
95
|
+
### 3. Walk-Forward Iteration
|
|
96
|
+
For the initial training period, all points are evaluated using DBSCAN fitted on the same training data.
|
|
97
|
+
For each period in the `eval_period`:
|
|
98
|
+
* **Feature Construction:** Builds a matrix containing the variable, its dynamic lags, rolling means, rolling standard deviations, and a linear trend component.
|
|
99
|
+
* **Scaling:** Fits a `StandardScaler` **only on training data** to prevent data leakage.
|
|
100
|
+
* **Novelty Detection:** Since DBSCAN cannot "predict" on new points, the function uses a **Nearest Neighbors proxy**. If the distance from a new test point to its $k$-th neighbor in the training set is greater than the trained $\\\\epsilon$, it is flagged as an anomaly.
|
|
101
|
+
|
|
102
|
+
## 📤 Key Output Columns
|
|
103
|
+
* **`dbscan_score`**: The distance from the point to the $\\\\epsilon$ boundary (positive values indicate anomalies).
|
|
104
|
+
* **`is_DBSCAN_anomaly`**: A boolean flag identifying outliers.
|
|
105
|
+
* **Generated Features**: Includes all dynamic lags (`lagX`) and rolling statistics (`roll_mean_W`) used during the fit.
|
|
106
|
+
|
|
107
|
+
## 💡 Usage Context
|
|
108
|
+
DBSCAN is exceptionally powerful for detecting **contextual anomalies**—points that might look "normal" in value but are "weird" given their recent history or seasonal context. Because it is density-based, it can find anomalies in non-linear or multi-modal distributions where simple percentile or Z-score methods would fail.
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
### ⚠️ Performance Note
|
|
112
|
+
This model is computationally more intensive than statistical methods due to the iterative re-fitting of the `NearestNeighbors` and `DBSCAN` models. It is best suited for high-priority metrics where accuracy is more critical than processing speed."""
|
|
113
|
+
|
|
74
114
|
group[date_column] = pd.to_datetime(group[date_column])
|
|
75
115
|
group = group.copy().sort_values(date_column).reset_index(drop=True)
|
|
116
|
+
group['set'] = np.where(np.arange(len(group)) >= len(group) - eval_period, 'TEST', 'TRAIN')
|
|
76
117
|
|
|
77
118
|
# --- Default DBSCAN Parameters ---
|
|
78
119
|
# These parameters often need tuning, but these are reasonable starting points:
|
|
79
120
|
DEFAULT_EPS = 0.5 # Neighborhood radius (critical parameter)
|
|
80
121
|
|
|
81
122
|
try:
|
|
82
|
-
|
|
123
|
+
all_results = []
|
|
124
|
+
|
|
125
|
+
# ===================================================================
|
|
126
|
+
# STEP 1: Evaluate all points in the initial TRAIN period
|
|
127
|
+
# ===================================================================
|
|
128
|
+
|
|
129
|
+
# Get the cutoff date for initial train period
|
|
130
|
+
initial_cutoff_date = group[group['set'] == 'TRAIN'][date_column].max()
|
|
131
|
+
|
|
132
|
+
# Prepare the full group with features
|
|
133
|
+
model_group_initial = group.copy()
|
|
134
|
+
|
|
135
|
+
# Get train set to determine lags
|
|
136
|
+
train_initial = model_group_initial[model_group_initial['set'] == 'TRAIN'].copy()
|
|
137
|
+
lags = get_dynamic_lags(train_initial[variable])
|
|
138
|
+
|
|
139
|
+
# Create lag features and rolling stats for the entire DF
|
|
140
|
+
rolling_stats_features = []
|
|
141
|
+
for lag in lags:
|
|
142
|
+
model_group_initial[f'lag{lag}'] = model_group_initial[variable].shift(lag)
|
|
143
|
+
|
|
144
|
+
for w in [int(np.ceil(max(lags)/4)), int(np.ceil(max(lags)/2)), int(max(lags))]:
|
|
145
|
+
if w >= 3:
|
|
146
|
+
rolling_stats_features.extend([f'roll_mean_{w}', f'roll_std_{w}'])
|
|
147
|
+
model_group_initial[f'roll_mean_{w}'] = model_group_initial[variable].shift(1).rolling(w).mean()
|
|
148
|
+
model_group_initial[f'roll_std_{w}'] = model_group_initial[variable].shift(1).rolling(w).std()
|
|
149
|
+
|
|
150
|
+
model_group_initial['trend'] = model_group_initial.index
|
|
151
|
+
model_group_initial = model_group_initial.copy().dropna()
|
|
83
152
|
|
|
153
|
+
# Get just the initial train set
|
|
154
|
+
train_initial = model_group_initial[model_group_initial['set'] == 'TRAIN'].copy()
|
|
155
|
+
|
|
156
|
+
# Identify all model features (lags, rolling stats, trend, and the variable itself)
|
|
157
|
+
features = [f'lag{i}' for i in lags] + rolling_stats_features + ['trend'] + [variable]
|
|
158
|
+
|
|
159
|
+
# Fit the scaler on the training data
|
|
160
|
+
scaler = StandardScaler()
|
|
161
|
+
scaler.fit(train_initial[features])
|
|
162
|
+
train_scaled = scaler.transform(train_initial[features])
|
|
163
|
+
|
|
164
|
+
# Determine min_samples based on feature space dimension
|
|
165
|
+
min_samples = max(2 * len(features), 3)
|
|
166
|
+
|
|
167
|
+
# Find optimal epsilon
|
|
168
|
+
calculated_eps = find_optimal_epsilon(train_scaled, k=min_samples)
|
|
169
|
+
|
|
170
|
+
# --- DBSCAN MODEL on initial training data ---
|
|
171
|
+
dbscan_model = DBSCAN(
|
|
172
|
+
eps=calculated_eps,
|
|
173
|
+
min_samples=min_samples,
|
|
174
|
+
n_jobs=-1
|
|
175
|
+
)
|
|
176
|
+
|
|
177
|
+
# Fit DBSCAN on the scaled training data
|
|
178
|
+
cluster_labels = dbscan_model.fit_predict(train_scaled)
|
|
179
|
+
|
|
180
|
+
# For training points, use DBSCAN labels (-1 = noise/anomaly)
|
|
181
|
+
train_initial['is_DBSCAN_anomaly'] = (cluster_labels == -1)
|
|
182
|
+
|
|
183
|
+
# Calculate scores for training points
|
|
184
|
+
# Use distance to k-th nearest neighbor as score
|
|
185
|
+
neigh = NearestNeighbors(n_neighbors=min_samples)
|
|
186
|
+
neigh.fit(train_scaled)
|
|
187
|
+
distances, indices = neigh.kneighbors(train_scaled)
|
|
188
|
+
k_distance = distances[:, min_samples - 1]
|
|
189
|
+
|
|
190
|
+
train_initial['dbscan_score'] = k_distance - calculated_eps
|
|
191
|
+
train_initial['dbscan_score_high'] = 0
|
|
192
|
+
|
|
193
|
+
# Select relevant columns
|
|
194
|
+
train_initial_result = train_initial[[variable, date_column, 'dbscan_score',
|
|
195
|
+
'dbscan_score_high', 'is_DBSCAN_anomaly']]
|
|
196
|
+
all_results.append(train_initial_result)
|
|
197
|
+
|
|
198
|
+
# ===================================================================
|
|
199
|
+
# STEP 2: Walk-forward evaluation for TEST period (one-step-ahead)
|
|
200
|
+
# ===================================================================
|
|
201
|
+
|
|
84
202
|
for t in list(range(eval_period - 1, -1, -1)):
|
|
85
203
|
|
|
86
204
|
try:
|
|
@@ -155,27 +273,42 @@ def detect_time_series_anomalies_dbscan(
|
|
|
155
273
|
|
|
156
274
|
# Flag as anomaly if the k-distance is greater than the trained eps threshold
|
|
157
275
|
test['dbscan_score'] = k_distance - calculated_eps
|
|
158
|
-
test['
|
|
276
|
+
test['dbscan_score_high'] = 0
|
|
277
|
+
test['is_DBSCAN_anomaly'] = np.where(test['dbscan_score'] > 0, True, False)
|
|
159
278
|
|
|
160
|
-
test = test[[variable, date_column, 'dbscan_score', 'is_DBSCAN_anomaly']]
|
|
161
|
-
|
|
279
|
+
test = test[[variable, date_column, 'dbscan_score', 'dbscan_score_high', 'is_DBSCAN_anomaly']]
|
|
280
|
+
all_results.append(test)
|
|
162
281
|
|
|
163
282
|
except Exception as e:
|
|
164
283
|
print(f"Error in iteration {t}: {e}")
|
|
165
284
|
pass
|
|
166
285
|
|
|
286
|
+
# ===================================================================
|
|
287
|
+
# STEP 3: Combine all results and merge back to original group
|
|
288
|
+
# ===================================================================
|
|
289
|
+
|
|
167
290
|
try:
|
|
168
|
-
|
|
169
|
-
|
|
291
|
+
all_results_df = pd.concat(all_results, ignore_index=True)
|
|
292
|
+
|
|
293
|
+
# Merge back to original group
|
|
294
|
+
group = group.merge(
|
|
295
|
+
all_results_df[[variable, date_column, 'dbscan_score',
|
|
296
|
+
'dbscan_score_high', 'is_DBSCAN_anomaly']],
|
|
297
|
+
on=[variable, date_column],
|
|
298
|
+
how='left'
|
|
299
|
+
)
|
|
300
|
+
|
|
301
|
+
# Fill any remaining NaNs with False for boolean column
|
|
170
302
|
# group["is_DBSCAN_anomaly"] = group["is_DBSCAN_anomaly"].fillna(False)
|
|
171
|
-
|
|
172
|
-
|
|
303
|
+
|
|
304
|
+
except Exception as e:
|
|
305
|
+
print(f"Error in concatenating results: {e}")
|
|
173
306
|
group['dbscan_score'] = np.nan
|
|
307
|
+
group['dbscan_score_high'] = np.nan
|
|
174
308
|
group["is_DBSCAN_anomaly"] = np.nan
|
|
175
309
|
|
|
176
310
|
except Exception as e:
|
|
177
311
|
# Fallback error handling
|
|
178
|
-
# Replace key_series with group for robustness if key_series is not defined
|
|
179
312
|
try:
|
|
180
313
|
group_id_cols = group.select_dtypes(include=['object', 'string']).columns.tolist()
|
|
181
314
|
group_id = " ".join(group[group_id_cols].reset_index(drop=True).iloc[0].astype(str).to_list())
|
|
@@ -183,6 +316,7 @@ def detect_time_series_anomalies_dbscan(
|
|
|
183
316
|
group_id = "Unknown Group ID"
|
|
184
317
|
print(f'DBSCAN Anomaly Detection failed for {group_id}. Error: {e}')
|
|
185
318
|
group['dbscan_score'] = np.nan
|
|
319
|
+
group['dbscan_score_high'] = np.nan
|
|
186
320
|
group["is_DBSCAN_anomaly"] = np.nan
|
|
187
321
|
|
|
188
|
-
return group
|
|
322
|
+
return group
|
anomaly_pipeline/helpers/MAD.py
CHANGED
|
@@ -5,6 +5,51 @@ from .Preprocessing import classify
|
|
|
5
5
|
|
|
6
6
|
|
|
7
7
|
def detect_outliers_mad(group, variable, date_column, mad_threshold, mad_scale_factor, eval_period):
|
|
8
|
+
|
|
9
|
+
"""
|
|
10
|
+
# 🛡️ MAD Anomaly Detection Model
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
Median Absolute Deviation with Expanding Window
|
|
14
|
+
|
|
15
|
+
The detect_outliers_mad function is a non-parametric outlier detection tool.
|
|
16
|
+
Unlike methods based on the standard deviation, this model uses the Median and MAD,
|
|
17
|
+
making it significantly more robust against data that contains extreme outliers or non-normal distributions.
|
|
18
|
+
|
|
19
|
+
## 📋 Functional Overview
|
|
20
|
+
|
|
21
|
+
The function identifies anomalies by calculating how far a data point deviates from the median.
|
|
22
|
+
It utilizes an expanding window approach to ensure that as the dataset grows,
|
|
23
|
+
the definition of "normal" behavior adapts dynamically to the historical context.
|
|
24
|
+
|
|
25
|
+
## 🧠 Core Logic Stages
|
|
26
|
+
|
|
27
|
+
1. Preprocessing & Validation
|
|
28
|
+
Sample Size Check: Requires a minimum of 10 data points. If the group is too small, it returns an empty DataFrame to avoid biased statistical results.
|
|
29
|
+
Deep Copy: Operates on a group.copy() to ensure the original input data remains untouched.
|
|
30
|
+
|
|
31
|
+
2. Initial Training Block
|
|
32
|
+
Baseline Calculation: For the first part of the series (pre-evaluation period), it establishes a static baseline.
|
|
33
|
+
The MAD Formula: > It calculates the Median Absolute Deviation: MAD = median(|x_i - median(x)|).
|
|
34
|
+
Thresholding: It uses a mad_scale_factor (default 0.6745) to make the MAD comparable to a standard deviation for a normal distribution.
|
|
35
|
+
Bounds:
|
|
36
|
+
MAD_high: Median + (Threshold x Scale)$
|
|
37
|
+
MAD_low: max(Median - (Threshold x Scale), 0)$
|
|
38
|
+
|
|
39
|
+
3. Expanding Window Evaluation
|
|
40
|
+
Incremental Testing: For each point in the evaluation period, the function recalculates the Median and MAD using all data available up to that point.
|
|
41
|
+
Real-time Simulation: This simulates a "production" environment where each new weekly point is tested against the entirety of its known history.
|
|
42
|
+
Zero-Variance Handling: If MAD is 0 (all historical values are identical), the bounds collapse to the median value to avoid division errors.
|
|
43
|
+
|
|
44
|
+
## 📤 Key Output Columns
|
|
45
|
+
|
|
46
|
+
## 💡 Usage Context
|
|
47
|
+
|
|
48
|
+
The MAD model is the "gold standard" for univariate outlier detection in robust statistics. It is highly recommended for:
|
|
49
|
+
- Data with large, extreme spikes that would skew a Mean-based (SD) model.
|
|
50
|
+
- Datasets that are not normally distributed.
|
|
51
|
+
- Scenarios where you need a conservative, reliable boundary that isn't easily shifted by a single bad data point."""
|
|
52
|
+
|
|
8
53
|
n = len(group)
|
|
9
54
|
if n < 10:
|
|
10
55
|
return pd.DataFrame(columns=group.columns)
|