anomaly-pipeline 0.1.27__py3-none-any.whl → 0.1.61__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,275 @@
1
+ Metadata-Version: 2.4
2
+ Name: anomaly_pipeline
3
+ Version: 0.1.61
4
+ Summary: Ensemble framework for detecting outliers in grouped time-series data
5
+ Classifier: Programming Language :: Python :: 3
6
+ Classifier: License :: OSI Approved :: MIT License
7
+ Classifier: Operating System :: OS Independent
8
+ Classifier: Intended Audience :: Science/Research
9
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
10
+ Requires-Python: >=3.8
11
+ Description-Content-Type: text/markdown
12
+ Requires-Dist: pandas
13
+ Requires-Dist: numpy<2
14
+ Requires-Dist: joblib
15
+ Requires-Dist: prophet
16
+ Requires-Dist: scikit-learn
17
+ Requires-Dist: google-cloud-bigquery
18
+ Requires-Dist: google-cloud-storage
19
+ Requires-Dist: statsmodels
20
+ Requires-Dist: plotly
21
+ Requires-Dist: pandas-gbq
22
+ Requires-Dist: gcsfs
23
+ Dynamic: classifier
24
+ Dynamic: description
25
+ Dynamic: description-content-type
26
+ Dynamic: requires-dist
27
+ Dynamic: requires-python
28
+ Dynamic: summary
29
+
30
+ # anomaly-pipeline
31
+
32
+ anomaly-pipeline is an ensemble framework for detecting outliers in grouped time-series data. It automates the entire workflow from data cleaning and calendar interpolation to running 8 different detection algorithms and generating visual diagnostic reports.
33
+
34
+ ## Key Capabilities
35
+
36
+ - Ensemble Scoring: Combines 8 models (Statistical + ML) to provide a robust Anomaly_Score and a final is_Anomaly consensus.
37
+ - Hierarchical Processing: Natively handles grouped data (e.g., detecting anomalies per Region, Product, or Channel).
38
+ - Automated Preprocessing: Handles missing dates via linear interpolation and filters out "low-quality" groups automatically.
39
+ - Parallel Execution: Leverages joblib for multi-core processing of large datasets.
40
+ - Visual Analytics: Generates pie charts, stacked bar plots, and detailed group-level time-series breakdowns.
41
+
42
+ ## Included Models
43
+
44
+ The pipeline utilizes an ensemble of the following methodologies:
45
+
46
+ - Statistical: Percentile (5th/95th), Standard Deviation (SD), Median Absolute Deviation (MAD), and Interquartile Range (IQR).
47
+
48
+ - Time-Series Specific: EWMA (Exponentially Weighted Moving Average) and FB Prophet (Walk-forward validation).
49
+
50
+ - Machine Learning: Isolation Forest (General & Time-series optimized) and DBSCAN.
51
+
52
+ ## Detailed Functionality
53
+
54
+ - Robust Input Validation: Clear error messaging for missing parameters or incorrect data types.
55
+
56
+ - Quality Control: Automatically generates a Success Report and an Exclusion Report (identifying groups dropped due to low history or high interpolation).
57
+
58
+ - Visual Suite: Automated rendering of Pie Charts (Summary), Stacked Bars (Distribution), and Top-5 Anomaly Heatmaps.
59
+
60
+ ## 🚀 Quick Start
61
+
62
+ ```python
63
+ !pip install anomaly-pipeline
64
+ import pandas as pd
65
+ from anomaly_pipeline import timeseries_anomaly_detection
66
+
67
+ # Load your data
68
+ df = pd.read_csv("your_data.csv")
69
+
70
+ # Run the pipeline
71
+ anomaly_df, success_report, exclusion_report = timeseries_anomaly_detection( master_data=df,
72
+ group_columns=['category', 'region'],
73
+ variable='sales',
74
+ date_column='timestamp',
75
+ freq='W-MON',
76
+ eval_period=1 # Evaluate the most recent recor
77
+ )
78
+
79
+ ```
80
+ ## 📊 Visualizing Results & Deep Dives
81
+ Inspecting a Specific Group, if a specific group shows a high anomaly rate, use the evaluation_info tool to render detailed diagnostic plots.
82
+
83
+ ```python
84
+ from anomaly_pipeline import evaluation_info
85
+
86
+ # Filter the specific group you want to inspect. Define the group values (must match the order in group_columns)
87
+ group_values = ['appliances', 'TX']
88
+
89
+ # Filter the results for this group
90
+ mask = anomaly_df[group_columns].eq(group_values).all(axis=1)
91
+ group_df = anomaly_df[mask]
92
+
93
+ # Generate detailed diagnostic plots
94
+ evaluation_info(group_df,
95
+ group_columns,
96
+ variable,
97
+ date_column,
98
+ eval_period=1
99
+ )
100
+ ```
101
+
102
+ The Evaluation Dashboard provides:
103
+
104
+ - Model Breakdown: Individual charts for FB Prophet, EWMA, and Isolation Forest with confidence intervals.
105
+
106
+ - Ensemble View: A summary highlighting where multiple models overlap.
107
+
108
+ - Statistical Thresholds: Visual markers for IQR, MAD, percentile and SD limits.
109
+
110
+ ## Input_data:
111
+
112
+ ### Mandatory
113
+ master_data: Input DataFrame containing variables, dates, and group identifiers.
114
+
115
+ group_columns: Mandatory,"A list of column names defining the granularity of the time series". Ex: For sales, if the timeseries data is at store level and would like find the anamalous sales values at store level, then the group columns will be ["store"].
116
+
117
+ variable: Mandatory,the column name containing the time series value being analyzed. Ex: for sales it is 'sales', for Ad_requests it is "ad_requests".
118
+
119
+ date_column: Mandatory,the column name containing the timestamp.
120
+
121
+ ### Default
122
+
123
+ freq: Optional,Pandas frequency string for calendar interpolation. Default : "W-MON" (Weekly, starting Monday)". If it is monthly data, change it to M, Daily change it to D.
124
+
125
+ min_records: Minimum history required per group. Default is None; If None, extracts based on freq (1 Year + eval_period). Ex: if freq is weekly and eval_period is 1: min_records = 52+1.
126
+
127
+ max_records: Maximum history to retain per group. Default is None; if the value is provided (N), filters for the most recent N records.
128
+
129
+ contamination (float): Expected proportion of outliers in the data (0 to 0.5). Defaults to 0.03.
130
+
131
+ random_state (int): Seed for reproducibility in stochastic models. Defaults to 42.
132
+
133
+ alpha (float): Smoothing factor for trend calculations. Defaults to 0.3.
134
+
135
+ sigma (float): Standard deviation multiplier for thresholding. Defaults to 1.5.
136
+
137
+ eval_period: The number of trailing records in each group to evaluate for anomalies. Default to 1
138
+
139
+ prophet_CI (float): The confidence level for the prediction interval (0 to 1). Defaults to 0.9.
140
+
141
+ mad_scale_factor (float): This is a constant used to make the MAD comparable to the Standard Deviation. Default is 0.6745.
142
+
143
+ mad_threshold (float): This is the "sensitivity" dial. It determines how many "Adjusted MADs" a data point must be away from the median to be flagged as an anomaly. Default is 2.
144
+
145
+ ## Output columns: All the output values are at "group_columns" level.
146
+
147
+ MIN_value
148
+ The minimum historical "variable" values. For train data the value is fixed. For test data varies. It is the min_value up to t-1.
149
+ ________________________________________
150
+ MAX_value
151
+ The maximum historical "variable" values. For train data the value is fixed. For test data varies. It is the max_value up to t-1.
152
+ ________________________________________
153
+ Percentile_low / Percentile_high
154
+ The 5th and 95th percentile "variable" values
155
+ Used to detect unusually low or unusually high "variable" values. Fixed for train data. Varies for test data. Takes the stats by considering historical data upto t-1.
156
+ ________________________________________
157
+ Percentile_anomaly
158
+ Flags based on percentile limits:
159
+ • Low → value < Percentile_low
160
+ • High → value > Percentile_high
161
+ • None → within the range
162
+ ________________________________________
163
+
164
+ Mean / SD (Standard Deviation)
165
+ The average "variable"and its standard deviation based on historical data.Fixed for train data. Varies for test data. Takes the stats by considering historical data upto t-1.
166
+ ________________________________________
167
+ SD2_low / SD2_high
168
+ Two-standard-deviation control limits:
169
+ • SD2_low = mean − 2×SD (floored at 0)
170
+ • SD2_high = mean + 2×SD
171
+ __________________________________
172
+ SD_anomaly
173
+ Flags based on SD2 limits:
174
+ • Low → value < SD2_low
175
+ • High → value > SD2_high
176
+ • None → within the range
177
+ ________________________________________
178
+ Median / MAD (Median Absolute Deviation)
179
+ Median of "variable" and the median of absolute deviations from the median.Fixed for train data. Varies for test data. Takes the stats by considering historical data upto t-1.
180
+ Used for robust anomaly detection when data contains outliers.
181
+ ________________________________________
182
+ MAD_low / MAD_high
183
+ MAD-based limits:
184
+ • MAD_low = median − 2 × MAD / 0.6745 (floored at 0)
185
+ • MAD_high = median + 2 × MAD / 0.6745
186
+
187
+ ________________________________________
188
+ MAD_anomaly
189
+ Flags based on MAD limits:
190
+ • Low → value < MAD_low
191
+ • High → value > MAD_high
192
+ • None → within the range
193
+ ________________________________________
194
+ Q1 / Q3 / IQR (Interquartile Range)
195
+ • Q1: 25th percentile
196
+ • Q3: 75th percentile
197
+ • IQR = Q3 − Q1
198
+ Used to detect unusually low or high "variable" values.
199
+ ________________________________________
200
+ IQR_low / IQR_high
201
+ IQR-based limits:
202
+ • IQR_low = Q1 − 1.5 × IQR (floored at 0)
203
+ • IQR_high = Q3 + 1.5 × IQR
204
+ ______________________________________
205
+ IQR_anomaly
206
+ Flags based on IQR limits:
207
+ • Low → value < IQR_low
208
+ • High → value > IQR_high
209
+ • None → within the range
210
+ ________________________________________
211
+ is_Percentile_anomaly / is_SD_anomaly / is_MAD_anomaly / is_IQR_anomaly
212
+ Boolean indicators stating whether each method classified the value as an anomaly (low or high).
213
+ ________________________________________
214
+ Alpha
215
+ Smoothing factor used in EWMA. Higher values give more weight to recent observations.
216
+ ________________________________________
217
+ EWMA_forecast
218
+ Expected value estimated using the EWMA model.
219
+ ________________________________________
220
+ EWMA_STD
221
+ Rolling standard deviation of residuals around the EWMA forecast.
222
+ ________________________________________
223
+ EWMA_high
224
+ Upper anomaly threshold (EWMA_forecast + sigma × EWMA_STD).
225
+ _____________________________________
226
+ EWMA_low
227
+ lower anomaly threshold (EWMA_forecast - sigma × EWMA_STD).
228
+ _____________________________________
229
+ Is_EWMA_anomaly
230
+ Boolean flag indicating whether the observed value falls outside the EWMA bounds.
231
+ ________________________________________
232
+ FB_forecast
233
+ Expected value estimated using the EWMA model.
234
+ ________________________________________
235
+ FB_low
236
+ Lower confidence interval of the Prophet forecast
237
+ ________________________________________
238
+ FB_high
239
+ Upper confidence interval of the Prophet forecast.
240
+ _____________________________________
241
+ FB_residual
242
+ Difference between observed value and Prophet forecast.
243
+ _____________________________________
244
+ FB_anomaly
245
+ Raw anomaly indicator based on Prophet confidence bounds.
246
+ _____________________________________
247
+ Is_FB_anomaly
248
+ Boolean flag indicating a Prophet-detected anomaly.
249
+ ______________________________________
250
+ isolation_forest_score
251
+ Score from the Isolation Forest model indicating anomaly severity. Typical range: –0.5 to +0.5
252
+ • Higher scores = more normal
253
+ • Lower scores = more anomalous
254
+ ________________________________________
255
+ is_IsoForest_anomaly
256
+ Boolean flag based on Isolation Forest model output:
257
+ • True → model predicts anomaly (prediction = –1)
258
+ • False → model predicts normal (prediction = 1)
259
+ ______________________________________
260
+ dbscan_score
261
+ Cluster label or distance score produced by DBSCAN (-1 indicates noise/anomaly).
262
+ ________________________________________
263
+ is_DBSCAN_anomaly
264
+ Boolean flag indicating DBSCAN-detected anomaly.
265
+ ________________________________________
266
+ Anomaly_Votes
267
+ Count of anomaly-detection methods that agree a point is anomalous.
268
+ Ranges from 0 to 8.
269
+ ________________________________________
270
+ is_Anomaly
271
+ Final ensemble decision:
272
+ • True → value flagged anomalous by 4 or more methods
273
+ • False → fewer than 4 methods indicate anomaly
274
+
275
+
@@ -0,0 +1,24 @@
1
+ anomaly_pipeline/__init__.py,sha256=tuToyXjdvPfTX_xDghxGZ78ZdcHmG16JWUMgS-Sv5jU,3568
2
+ anomaly_pipeline/main.py,sha256=0dVdNB5yu11rAxQtJtF1XLF7KzcC6ez5NOo3ZGfklqg,7973
3
+ anomaly_pipeline/pipeline.py,sha256=94zsCAVKvjeScV2dbYucezAZ22jUZIFyq0ztVaWEDUo,15131
4
+ anomaly_pipeline/helpers/DB_scan.py,sha256=aPel8hjoG5Am_T0OuFTlOqfxTgaa2aY-7tztGhX54eI,15884
5
+ anomaly_pipeline/helpers/IQR.py,sha256=VlYU6Yf-4KQmVroLvzwd220jn5BUNJEchsVE4_KxKm4,2824
6
+ anomaly_pipeline/helpers/MAD.py,sha256=qSzdQMGkd-ynFSqoOydg76YWHdSqWM2e7mSi9QSawcY,5821
7
+ anomaly_pipeline/helpers/Preprocessing.py,sha256=WiW7WjeKXxipyuA1vW4kpZbGQn7h68Rfm3aWHCHW6Hs,14165
8
+ anomaly_pipeline/helpers/STD.py,sha256=uVG3lU1k65TbKpNQWHS_rjsjfP9QFVeS23_GDeksagY,4448
9
+ anomaly_pipeline/helpers/__init__.py,sha256=LGVuwJR2Bx-xh5pdatp7Riiv9NpetsqBkABX9m9xyUc,364
10
+ anomaly_pipeline/helpers/baseline.py,sha256=h9t_LWcAw17P9qmoRQMceukGzOOr-gFLuHfVbipQB7M,3824
11
+ anomaly_pipeline/helpers/cluster_functions.py,sha256=Nhk2YdKVynrKywEILg_5B2xD4zrCZ_ICWw3oOdTDHuA,13040
12
+ anomaly_pipeline/helpers/evaluation_info.py,sha256=yzhvpQiMCP0f1Njrn0he6KKlqRQMDnNEVo5U_2H31jU,6531
13
+ anomaly_pipeline/helpers/evaluation_plots.py,sha256=DcQqA2DNEjhDpUTj_-lpmw1rMYIcnulNU3ASmewE1cA,46110
14
+ anomaly_pipeline/helpers/ewma.py,sha256=TRrshZWS06EA7H5vutA-GbO2BG9c0MFxWvcPC86uzE8,8586
15
+ anomaly_pipeline/helpers/fb_prophet.py,sha256=Z12LsLl1lP6-urP422awEUCuHvHz6tZmRsOeEqLSbGY,10387
16
+ anomaly_pipeline/helpers/help_anomaly.py,sha256=fIPVgrvfgUZ49AAc6e_b7InPOzVhWsF4lsmfl3lxtds,41173
17
+ anomaly_pipeline/helpers/iso_forest_general.py,sha256=UAEt41lGGt5MhqYyOB7_8e7kRGT5HijdJX5WA9SMAhU,2427
18
+ anomaly_pipeline/helpers/iso_forest_timeseries.py,sha256=Q-chpPmO4FkRBKRaZhIQdl-xISHfyFelmDC9V5_8PIQ,14562
19
+ anomaly_pipeline/helpers/percentile.py,sha256=IGk9DrlIrf-rKOQnIS72-cP9meRfAP6NAZv1UIktm9k,5436
20
+ anomaly_pipeline-0.1.61.dist-info/METADATA,sha256=75RO3V9iOX78_hxU4l_RfV7uo9g04FNMH_fh37CndSk,11669
21
+ anomaly_pipeline-0.1.61.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
22
+ anomaly_pipeline-0.1.61.dist-info/entry_points.txt,sha256=c7aMFN_VdyQk_gKp9S2-bz4AF3eBActUectAElnEdMo,92
23
+ anomaly_pipeline-0.1.61.dist-info/top_level.txt,sha256=3QhrLt05iNbxIQhnAA0vmIkRQje4Hc_STGY_Tukx3Vg,17
24
+ anomaly_pipeline-0.1.61.dist-info/RECORD,,
@@ -1,15 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: anomaly_pipeline
3
- Version: 0.1.27
4
- Requires-Dist: pandas
5
- Requires-Dist: numpy<2
6
- Requires-Dist: joblib
7
- Requires-Dist: prophet
8
- Requires-Dist: scikit-learn
9
- Requires-Dist: google-cloud-bigquery
10
- Requires-Dist: google-cloud-storage
11
- Requires-Dist: statsmodels
12
- Requires-Dist: plotly
13
- Requires-Dist: pandas-gbq
14
- Requires-Dist: gcsfs
15
- Dynamic: requires-dist
@@ -1,24 +0,0 @@
1
- anomaly_pipeline/__init__.py,sha256=ED-UPADjbdS8xjK41KmWVYcFIn6q_cN-SwBx-dRI-nM,77
2
- anomaly_pipeline/main.py,sha256=khiatXxr01XYHB8SrIfyTnlaCu008MA6ORGiI_2Tjr4,2925
3
- anomaly_pipeline/pipeline.py,sha256=3Lf9b0Vok-mqWDLhhZeN9emgx5i30stPrU8XOmKpmEw,11204
4
- anomaly_pipeline/helpers/DB_scan.py,sha256=80PLlubpcwY6dOUx5rm569hvFlGNa1rtvjs74US9oIk,8134
5
- anomaly_pipeline/helpers/IQR.py,sha256=VlYU6Yf-4KQmVroLvzwd220jn5BUNJEchsVE4_KxKm4,2824
6
- anomaly_pipeline/helpers/MAD.py,sha256=XDG8r9o1JNi7YZ2NKwNzqmu_Oyz2OPP2rThCuw8WZhs,3377
7
- anomaly_pipeline/helpers/Preprocessing.py,sha256=VsAohcAW1wTKDdNAF1xNF4j4I2gyZ8rOC1HjyK0NpGk,3933
8
- anomaly_pipeline/helpers/STD.py,sha256=SZ1UaS_Aa5ay6qWNzKpBXpQIloUuPlliOrfr7yHba4k,2769
9
- anomaly_pipeline/helpers/__init__.py,sha256=aDAAxiNAusL4rwcn9XbkUIApp3i02UXolB_CWvbbY_0,32
10
- anomaly_pipeline/helpers/baseline.py,sha256=h9t_LWcAw17P9qmoRQMceukGzOOr-gFLuHfVbipQB7M,3824
11
- anomaly_pipeline/helpers/cluster_functions.py,sha256=Nhk2YdKVynrKywEILg_5B2xD4zrCZ_ICWw3oOdTDHuA,13040
12
- anomaly_pipeline/helpers/evaluation_info.py,sha256=SXa1LkznNQXTOcFCbryRmRJMSNC_Fa2CU-HhFnyTIKY,6219
13
- anomaly_pipeline/helpers/evaluation_plots.py,sha256=xfyVlE7B4E376EL4AF8A4T5kUfqzPShGOSy548psT6M,21230
14
- anomaly_pipeline/helpers/ewma.py,sha256=YprdcvR17EQ4X9pJo5OusaD3jNaaoHvQLHRHHt25CGk,3562
15
- anomaly_pipeline/helpers/fb_prophet.py,sha256=-ivBIgMBPT4DG-hbGXPMB1-aiEBfLw2LQvy6eXKzELQ,3182
16
- anomaly_pipeline/helpers/help_info.py,sha256=QuRd206KQ8etRnlODH9Ek_zmXUvHSBwVQtukqf0iKSc,37012
17
- anomaly_pipeline/helpers/iso_forest_general.py,sha256=nonZl2wcLyHe0E50mqQUw_IB3tuMochmZKQNd0xMFVk,2350
18
- anomaly_pipeline/helpers/iso_forest_timeseries.py,sha256=SWf6g0mwLohIRdMvGfMCAcfWi5FPPokiV7dM8Un5qpE,5900
19
- anomaly_pipeline/helpers/percentile.py,sha256=eLk0PgY7m7z7VKTLfXg8ykKii0ciAJvlGOYXpv84mOE,2523
20
- anomaly_pipeline-0.1.27.dist-info/METADATA,sha256=YIIJMpsDchA8L2Jp0T4wBXpxwcL5r-UiJ35gLP6BRCs,371
21
- anomaly_pipeline-0.1.27.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
22
- anomaly_pipeline-0.1.27.dist-info/entry_points.txt,sha256=c7aMFN_VdyQk_gKp9S2-bz4AF3eBActUectAElnEdMo,92
23
- anomaly_pipeline-0.1.27.dist-info/top_level.txt,sha256=3QhrLt05iNbxIQhnAA0vmIkRQje4Hc_STGY_Tukx3Vg,17
24
- anomaly_pipeline-0.1.27.dist-info/RECORD,,