anomaly-pipeline 0.1.27__py3-none-any.whl → 0.1.61__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,2 +1,74 @@
1
+ """----------------------------------------------------------------------------
2
+ help_anomaly
3
+ ----------------------------------------------------------------------------
4
+
5
+ For a nice overview of the anomaly pipeline run the following lines of code:
6
+
7
+ >> from anomaly_pipeline.helpers.help_anomaly import help_anomaly
8
+ >> help_anomaly()
9
+
10
+ You can see information about specific models used in the anomaly pipeline with any of the following commands:
11
+
12
+ >> help_anomaly('percentile')
13
+ >> help_anomaly('iqr')
14
+ >> help_anomaly('mad')
15
+ >> help_anomaly('std')
16
+ >> help_anomaly('ewma')
17
+ >> help_anomaly('prophet')
18
+ >> help_anomaly('dbscan')
19
+ >> help_anomaly('iso') # For information on isolation forest
20
+
21
+
22
+ ----------------------------------------------------------------------------
23
+ Functional Overview
24
+ ----------------------------------------------------------------------------
25
+ The pipeline takes raw master data, partitions it into groups by unique ID, applies a suite of 8 different anomaly detection methods, and then flags observations as anomalies where at least half of the models consider the observation an anomaly.
26
+
27
+ The master data DataFrame that you pass into the anomaly detection pipeline needs to have at least 3 columns - unique ID, date, and a target variable. The unique ID can be defined by multiple columns.
28
+
29
+ Core Execution Stages
30
+ 1. Preprocessing & Interpolation
31
+ Before modeling, the function interpolates target variable values for missing dates
32
+ Fill gaps in the variable column to prevent model crashes.
33
+
34
+ 2. Statistical Baseline Models (Local Execution)
35
+ The pipeline first runs four computationally light models sequentially on each group:
36
+ - Percentile & IQR: Non-parametric bounds detection.
37
+ - SD (Standard Deviation) & MAD (Median Absolute Deviation): Variance-based detection.
38
+
39
+ 3. Parallel Machine Learning Suite (process_group)
40
+ To maximize performance, the pipeline uses joblib.Parallel to run intensive models across all available CPU cores. The process_group utility acts as a router, sending data to the correct engine based on the model key:
41
+ - FB (Prophet): Walk-forward time-series forecasting.
42
+ - EWMA: Exponentially weighted moving averages.
43
+ - ISF (Isolation Forest): Unsupervised isolation of anomalies.
44
+ - DBSCAN: Density-based spatial clustering.
45
+
46
+ 4. Majority Voting (Ensemble Logic)
47
+ The power of this pipeline lies in its Consensus Model. After all models finish, the pipeline calculates:
48
+ - Anomaly_Votes: The sum of flags across all 8 methods.
49
+ - is_Anomaly: A final boolean set to True only if at least 4 models agree that the point is an outlier.
50
+
51
+ Key Output Columns
52
+ refresh_date: The timestamp of when the pipeline was executed.
53
+ Anomaly_Votes: Total count of models that flagged the row.
54
+ is_Anomaly: The final "Gold Standard" anomaly flag.
55
+ Individual Model Flags: Columns like is_FB_anomaly, is_IQR_anomaly, etc., for granular auditing.
56
+
57
+ Usage Context
58
+ Use run_pipeline when you need a highly reliable, automated output. By combining statistical, forecasting, and clustering models, the pipeline reduces "false positives" often generated by single-model approaches.
59
+ """
60
+
1
61
  from .main import timeseries_anomaly_detection
2
- from .helpers import help_info
62
+
63
+ from .helpers import (help_anomaly,
64
+ get_example_df,
65
+ evaluation_info,
66
+ anomaly_overview_plot,
67
+ anomaly_percentile_plot,
68
+ anomaly_sd_plot,
69
+ anomaly_mad_plot,
70
+ anomaly_iqr_plot,
71
+ anomaly_ewma_plot,
72
+ anomaly_fb_plot,
73
+ anomaly_dbscan_plot,
74
+ anomaly_isolation_forest_plot)
@@ -33,6 +33,7 @@ def get_dynamic_lags(series: pd.Series) -> list:
33
33
 
34
34
  return dynamic_lags
35
35
 
36
+
36
37
  def find_optimal_epsilon(X_scaled: np.ndarray, k: int) -> float:
37
38
  """
38
39
  Finds the optimal epsilon by calculating the distance to the k-th nearest neighbor
@@ -71,16 +72,133 @@ def detect_time_series_anomalies_dbscan(
71
72
  eval_period,
72
73
  ):
73
74
 
75
+ """# 🌀 DBSCAN Walk-Forward Anomaly Detection
76
+ ---
77
+
78
+ The `detect_time_series_anomalies_dbscan` function implements a **density-based clustering** approach for time-series anomaly detection. It utilizes an **iterative walk-forward validation** strategy to identify data points that exist in "low-density" regions of the feature space.
79
+
80
+ ## 📋 Functional Overview
81
+ This function transforms a univariate time series into a high-dimensional feature space using **dynamic lags** and **rolling statistics**. It then applies the **DBSCAN** (Density-Based Spatial Clustering of Applications with Noise) algorithm to distinguish between dense clusters of "normal" behavior and sparse "noise" points (anomalies).
82
+
83
+ ## 🧠 Core Logic & Helper Utilities
84
+
85
+ ### 1. Dynamic Feature Engineering (`get_dynamic_lags`)
86
+ Instead of using fixed lags, the function uses the **Autocorrelation Function (ACF)** to find the 10 most significant seasonal patterns in the data.
87
+ * **Baseline:** Always includes lags 1, 2, and 3 to capture immediate momentum.
88
+ * **Significance:** Uses a 75% confidence interval ($\\\\alpha=0.25$) to identify meaningful historical dependencies.
89
+
90
+ ### 2. Automated Parameter Tuning (`find_optimal_epsilon`)
91
+ DBSCAN is highly sensitive to the **Epsilon ($\\\\epsilon$)** parameter (the neighborhood radius).
92
+ * **Proxy Elbow Method:** The function automatically calculates $\\\\epsilon$ by analyzing the distance to the $k$-th nearest neighbor for all training points.
93
+ * **Density Threshold:** It sets $\\\\epsilon$ at the **95th percentile** of these distances, ensuring that 95% of training data is considered "dense" while the most isolated 5% are candidates for noise.
94
+
95
+ ### 3. Walk-Forward Iteration
96
+ For the initial training period, all points are evaluated using DBSCAN fitted on the same training data.
97
+ For each period in the `eval_period`:
98
+ * **Feature Construction:** Builds a matrix containing the variable, its dynamic lags, rolling means, rolling standard deviations, and a linear trend component.
99
+ * **Scaling:** Fits a `StandardScaler` **only on training data** to prevent data leakage.
100
+ * **Novelty Detection:** Since DBSCAN cannot "predict" on new points, the function uses a **Nearest Neighbors proxy**. If the distance from a new test point to its $k$-th neighbor in the training set is greater than the trained $\\\\epsilon$, it is flagged as an anomaly.
101
+
102
+ ## 📤 Key Output Columns
103
+ * **`dbscan_score`**: The distance from the point to the $\\\\epsilon$ boundary (positive values indicate anomalies).
104
+ * **`is_DBSCAN_anomaly`**: A boolean flag identifying outliers.
105
+ * **Generated Features**: Includes all dynamic lags (`lagX`) and rolling statistics (`roll_mean_W`) used during the fit.
106
+
107
+ ## 💡 Usage Context
108
+ DBSCAN is exceptionally powerful for detecting **contextual anomalies**—points that might look "normal" in value but are "weird" given their recent history or seasonal context. Because it is density-based, it can find anomalies in non-linear or multi-modal distributions where simple percentile or Z-score methods would fail.
109
+
110
+ ---
111
+ ### ⚠️ Performance Note
112
+ This model is computationally more intensive than statistical methods due to the iterative re-fitting of the `NearestNeighbors` and `DBSCAN` models. It is best suited for high-priority metrics where accuracy is more critical than processing speed."""
113
+
74
114
  group[date_column] = pd.to_datetime(group[date_column])
75
115
  group = group.copy().sort_values(date_column).reset_index(drop=True)
116
+ group['set'] = np.where(np.arange(len(group)) >= len(group) - eval_period, 'TEST', 'TRAIN')
76
117
 
77
118
  # --- Default DBSCAN Parameters ---
78
119
  # These parameters often need tuning, but these are reasonable starting points:
79
120
  DEFAULT_EPS = 0.5 # Neighborhood radius (critical parameter)
80
121
 
81
122
  try:
82
- test_anom = []
123
+ all_results = []
124
+
125
+ # ===================================================================
126
+ # STEP 1: Evaluate all points in the initial TRAIN period
127
+ # ===================================================================
128
+
129
+ # Get the cutoff date for initial train period
130
+ initial_cutoff_date = group[group['set'] == 'TRAIN'][date_column].max()
131
+
132
+ # Prepare the full group with features
133
+ model_group_initial = group.copy()
134
+
135
+ # Get train set to determine lags
136
+ train_initial = model_group_initial[model_group_initial['set'] == 'TRAIN'].copy()
137
+ lags = get_dynamic_lags(train_initial[variable])
138
+
139
+ # Create lag features and rolling stats for the entire DF
140
+ rolling_stats_features = []
141
+ for lag in lags:
142
+ model_group_initial[f'lag{lag}'] = model_group_initial[variable].shift(lag)
143
+
144
+ for w in [int(np.ceil(max(lags)/4)), int(np.ceil(max(lags)/2)), int(max(lags))]:
145
+ if w >= 3:
146
+ rolling_stats_features.extend([f'roll_mean_{w}', f'roll_std_{w}'])
147
+ model_group_initial[f'roll_mean_{w}'] = model_group_initial[variable].shift(1).rolling(w).mean()
148
+ model_group_initial[f'roll_std_{w}'] = model_group_initial[variable].shift(1).rolling(w).std()
149
+
150
+ model_group_initial['trend'] = model_group_initial.index
151
+ model_group_initial = model_group_initial.copy().dropna()
83
152
 
153
+ # Get just the initial train set
154
+ train_initial = model_group_initial[model_group_initial['set'] == 'TRAIN'].copy()
155
+
156
+ # Identify all model features (lags, rolling stats, trend, and the variable itself)
157
+ features = [f'lag{i}' for i in lags] + rolling_stats_features + ['trend'] + [variable]
158
+
159
+ # Fit the scaler on the training data
160
+ scaler = StandardScaler()
161
+ scaler.fit(train_initial[features])
162
+ train_scaled = scaler.transform(train_initial[features])
163
+
164
+ # Determine min_samples based on feature space dimension
165
+ min_samples = max(2 * len(features), 3)
166
+
167
+ # Find optimal epsilon
168
+ calculated_eps = find_optimal_epsilon(train_scaled, k=min_samples)
169
+
170
+ # --- DBSCAN MODEL on initial training data ---
171
+ dbscan_model = DBSCAN(
172
+ eps=calculated_eps,
173
+ min_samples=min_samples,
174
+ n_jobs=-1
175
+ )
176
+
177
+ # Fit DBSCAN on the scaled training data
178
+ cluster_labels = dbscan_model.fit_predict(train_scaled)
179
+
180
+ # For training points, use DBSCAN labels (-1 = noise/anomaly)
181
+ train_initial['is_DBSCAN_anomaly'] = (cluster_labels == -1)
182
+
183
+ # Calculate scores for training points
184
+ # Use distance to k-th nearest neighbor as score
185
+ neigh = NearestNeighbors(n_neighbors=min_samples)
186
+ neigh.fit(train_scaled)
187
+ distances, indices = neigh.kneighbors(train_scaled)
188
+ k_distance = distances[:, min_samples - 1]
189
+
190
+ train_initial['dbscan_score'] = k_distance - calculated_eps
191
+ train_initial['dbscan_score_high'] = 0
192
+
193
+ # Select relevant columns
194
+ train_initial_result = train_initial[[variable, date_column, 'dbscan_score',
195
+ 'dbscan_score_high', 'is_DBSCAN_anomaly']]
196
+ all_results.append(train_initial_result)
197
+
198
+ # ===================================================================
199
+ # STEP 2: Walk-forward evaluation for TEST period (one-step-ahead)
200
+ # ===================================================================
201
+
84
202
  for t in list(range(eval_period - 1, -1, -1)):
85
203
 
86
204
  try:
@@ -155,27 +273,42 @@ def detect_time_series_anomalies_dbscan(
155
273
 
156
274
  # Flag as anomaly if the k-distance is greater than the trained eps threshold
157
275
  test['dbscan_score'] = k_distance - calculated_eps
158
- test['is_DBSCAN_anomaly'] = np.where(k_distance > calculated_eps, True, False)
276
+ test['dbscan_score_high'] = 0
277
+ test['is_DBSCAN_anomaly'] = np.where(test['dbscan_score'] > 0, True, False)
159
278
 
160
- test = test[[variable, date_column, 'dbscan_score', 'is_DBSCAN_anomaly']]
161
- test_anom.append(test)
279
+ test = test[[variable, date_column, 'dbscan_score', 'dbscan_score_high', 'is_DBSCAN_anomaly']]
280
+ all_results.append(test)
162
281
 
163
282
  except Exception as e:
164
283
  print(f"Error in iteration {t}: {e}")
165
284
  pass
166
285
 
286
+ # ===================================================================
287
+ # STEP 3: Combine all results and merge back to original group
288
+ # ===================================================================
289
+
167
290
  try:
168
- test_anom = pd.concat(test_anom)
169
- group = group.merge(test_anom[[variable, date_column, 'dbscan_score', 'is_DBSCAN_anomaly']], on=[variable, date_column], how='left')
291
+ all_results_df = pd.concat(all_results, ignore_index=True)
292
+
293
+ # Merge back to original group
294
+ group = group.merge(
295
+ all_results_df[[variable, date_column, 'dbscan_score',
296
+ 'dbscan_score_high', 'is_DBSCAN_anomaly']],
297
+ on=[variable, date_column],
298
+ how='left'
299
+ )
300
+
301
+ # Fill any remaining NaNs with False for boolean column
170
302
  # group["is_DBSCAN_anomaly"] = group["is_DBSCAN_anomaly"].fillna(False)
171
- except:
172
- print("Error in DBSCAN process")
303
+
304
+ except Exception as e:
305
+ print(f"Error in concatenating results: {e}")
173
306
  group['dbscan_score'] = np.nan
307
+ group['dbscan_score_high'] = np.nan
174
308
  group["is_DBSCAN_anomaly"] = np.nan
175
309
 
176
310
  except Exception as e:
177
311
  # Fallback error handling
178
- # Replace key_series with group for robustness if key_series is not defined
179
312
  try:
180
313
  group_id_cols = group.select_dtypes(include=['object', 'string']).columns.tolist()
181
314
  group_id = " ".join(group[group_id_cols].reset_index(drop=True).iloc[0].astype(str).to_list())
@@ -183,6 +316,7 @@ def detect_time_series_anomalies_dbscan(
183
316
  group_id = "Unknown Group ID"
184
317
  print(f'DBSCAN Anomaly Detection failed for {group_id}. Error: {e}')
185
318
  group['dbscan_score'] = np.nan
319
+ group['dbscan_score_high'] = np.nan
186
320
  group["is_DBSCAN_anomaly"] = np.nan
187
321
 
188
- return group
322
+ return group
@@ -5,6 +5,51 @@ from .Preprocessing import classify
5
5
 
6
6
 
7
7
  def detect_outliers_mad(group, variable, date_column, mad_threshold, mad_scale_factor, eval_period):
8
+
9
+ """
10
+ # 🛡️ MAD Anomaly Detection Model
11
+ ---
12
+
13
+ Median Absolute Deviation with Expanding Window
14
+
15
+ The detect_outliers_mad function is a non-parametric outlier detection tool.
16
+ Unlike methods based on the standard deviation, this model uses the Median and MAD,
17
+ making it significantly more robust against data that contains extreme outliers or non-normal distributions.
18
+
19
+ ## 📋 Functional Overview
20
+
21
+ The function identifies anomalies by calculating how far a data point deviates from the median.
22
+ It utilizes an expanding window approach to ensure that as the dataset grows,
23
+ the definition of "normal" behavior adapts dynamically to the historical context.
24
+
25
+ ## 🧠 Core Logic Stages
26
+
27
+ 1. Preprocessing & Validation
28
+ Sample Size Check: Requires a minimum of 10 data points. If the group is too small, it returns an empty DataFrame to avoid biased statistical results.
29
+ Deep Copy: Operates on a group.copy() to ensure the original input data remains untouched.
30
+
31
+ 2. Initial Training Block
32
+ Baseline Calculation: For the first part of the series (pre-evaluation period), it establishes a static baseline.
33
+ The MAD Formula: > It calculates the Median Absolute Deviation: MAD = median(|x_i - median(x)|).
34
+ Thresholding: It uses a mad_scale_factor (default 0.6745) to make the MAD comparable to a standard deviation for a normal distribution.
35
+ Bounds:
36
+ MAD_high: Median + (Threshold x Scale)$
37
+ MAD_low: max(Median - (Threshold x Scale), 0)$
38
+
39
+ 3. Expanding Window Evaluation
40
+ Incremental Testing: For each point in the evaluation period, the function recalculates the Median and MAD using all data available up to that point.
41
+ Real-time Simulation: This simulates a "production" environment where each new weekly point is tested against the entirety of its known history.
42
+ Zero-Variance Handling: If MAD is 0 (all historical values are identical), the bounds collapse to the median value to avoid division errors.
43
+
44
+ ## 📤 Key Output Columns
45
+
46
+ ## 💡 Usage Context
47
+
48
+ The MAD model is the "gold standard" for univariate outlier detection in robust statistics. It is highly recommended for:
49
+ - Data with large, extreme spikes that would skew a Mean-based (SD) model.
50
+ - Datasets that are not normally distributed.
51
+ - Scenarios where you need a conservative, reliable boundary that isn't easily shifted by a single bad data point."""
52
+
8
53
  n = len(group)
9
54
  if n < 10:
10
55
  return pd.DataFrame(columns=group.columns)