PyNomaly 0.3.4__tar.gz → 0.3.5__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
pynomaly-0.3.5/LICENSE ADDED
@@ -0,0 +1,13 @@
1
+ Copyright 2017 Valentino Constantinou.
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License");
4
+ you may not use this file except in compliance with the License.
5
+ You may obtain a copy of the License at
6
+
7
+ http://www.apache.org/licenses/LICENSE-2.0
8
+
9
+ Unless required by applicable law or agreed to in writing, software
10
+ distributed under the License is distributed on an "AS IS" BASIS,
11
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ See the License for the specific language governing permissions and
13
+ limitations under the License.
@@ -0,0 +1,503 @@
1
+ Metadata-Version: 2.4
2
+ Name: PyNomaly
3
+ Version: 0.3.5
4
+ Summary: A Python 3 implementation of LoOP: Local Outlier Probabilities, a local density based outlier detection method providing an outlier score in the range of [0,1].
5
+ Home-page: https://github.com/vc1492a/PyNomaly
6
+ Download-URL: https://github.com/vc1492a/PyNomaly/archive/0.3.5.tar.gz
7
+ Author: Valentino Constantinou
8
+ Author-email: vc@valentino.io
9
+ License: Apache License, Version 2.0
10
+ Keywords: outlier,anomaly,detection,machine,learning,probability
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE
13
+ Requires-Dist: numpy
14
+ Requires-Dist: python-utils
15
+ Dynamic: author
16
+ Dynamic: author-email
17
+ Dynamic: description
18
+ Dynamic: description-content-type
19
+ Dynamic: download-url
20
+ Dynamic: home-page
21
+ Dynamic: keywords
22
+ Dynamic: license
23
+ Dynamic: license-file
24
+ Dynamic: requires-dist
25
+ Dynamic: summary
26
+
27
+ # PyNomaly
28
+
29
+ PyNomaly is a Python 3 implementation of LoOP (Local Outlier Probabilities).
30
+ LoOP is a local density based outlier detection method by Kriegel, Kröger, Schubert, and Zimek which provides outlier
31
+ scores in the range of [0,1] that are directly interpretable as the probability of a sample being an outlier.
32
+
33
+ PyNomaly is a core library of [deepchecks](https://github.com/deepchecks/deepchecks), [OmniDocBench](https://github.com/opendatalab/OmniDocBench) and [pysad](https://github.com/selimfirat/pysad).
34
+
35
+ [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
36
+ [![PyPi](https://img.shields.io/badge/pypi-0.3.5-blue.svg)](https://pypi.python.org/pypi/PyNomaly/0.3.5)
37
+ [![Total Downloads](https://static.pepy.tech/badge/pynomaly)](https://pepy.tech/projects/pynomaly)
38
+ [![Monthly Downloads](https://static.pepy.tech/badge/pynomaly/month)](https://pepy.tech/projects/pynomaly)
39
+ ![Tests](https://github.com/vc1492a/PyNomaly/actions/workflows/tests.yml/badge.svg)
40
+ [![Coverage Status](https://coveralls.io/repos/github/vc1492a/PyNomaly/badge.svg?branch=main)](https://coveralls.io/github/vc1492a/PyNomaly?branch=main)
41
+ [![JOSS](http://joss.theoj.org/papers/f4d2cfe680768526da7c1f6a2c103266/status.svg)](http://joss.theoj.org/papers/f4d2cfe680768526da7c1f6a2c103266)
42
+
43
+ The outlier score of each sample is called the Local Outlier Probability.
44
+ It measures the local deviation of density of a given sample with
45
+ respect to its neighbors as Local Outlier Factor (LOF), but provides normalized
46
+ outlier scores in the range [0,1]. These outlier scores are directly interpretable
47
+ as a probability of an object being an outlier. Since Local Outlier Probabilities provides scores in the
48
+ range [0,1], practitioners are free to interpret the results according to the application.
49
+
50
+ Like LOF, it is local in that the anomaly score depends on how isolated the sample is
51
+ with respect to the surrounding neighborhood. Locality is given by k-nearest neighbors,
52
+ whose distance is used to estimate the local density. By comparing the local density of a sample to the
53
+ local densities of its neighbors, one can identify samples that lie in regions of lower
54
+ density compared to their neighbors and thus identify samples that may be outliers according to their Local
55
+ Outlier Probability.
56
+
57
+ The authors' 2009 paper detailing LoOP's theory, formulation, and application is provided by
58
+ Ludwig-Maximilians University Munich - Institute for Informatics;
59
+ [LoOP: Local Outlier Probabilities](http://www.dbs.ifi.lmu.de/Publikationen/Papers/LoOP1649.pdf).
60
+
61
+ ## Implementation
62
+
63
+ This Python 3 implementation uses Numpy and the formulas outlined in
64
+ [LoOP: Local Outlier Probabilities](http://www.dbs.ifi.lmu.de/Publikationen/Papers/LoOP1649.pdf)
65
+ to calculate the Local Outlier Probability of each sample.
66
+
67
+ ## Dependencies
68
+ - Python 3.8 - 3.13
69
+ - numpy >= 1.16.3
70
+ - python-utils >= 2.3.0
71
+ - (optional) numba >= 0.45.1
72
+
73
+ Numba just-in-time (JIT) compiles the function with calculates the Euclidean
74
+ distance between observations, providing a reduction in computation time
75
+ (significantly when a large number of observations are scored). Numba is not a
76
+ requirement and PyNomaly may still be used solely with numpy if desired
77
+ (details below).
78
+
79
+ ## Quick Start
80
+
81
+ First install the package from the Python Package Index:
82
+
83
+ ```shell
84
+ pip install PyNomaly # or pip3 install ... if you're using both Python 3 and 2.
85
+ ```
86
+
87
+ Alternatively, you can use conda to install the package from conda-forge:
88
+
89
+ ```shell
90
+ conda install conda-forge::pynomaly
91
+ ```
92
+ Then you can do something like this:
93
+
94
+ ```python
95
+ from PyNomaly import loop
96
+ m = loop.LocalOutlierProbability(data).fit()
97
+ scores = m.local_outlier_probabilities
98
+ print(scores)
99
+ ```
100
+ where *data* is a NxM (N rows, M columns; 2-dimensional) set of data as either a Pandas DataFrame or Numpy array.
101
+
102
+ LocalOutlierProbability sets the *extent* (in integer in value of 1, 2, or 3) and *n_neighbors* (must be greater than 0) parameters with the default
103
+ values of 3 and 10, respectively. You're free to set these parameters on your own as below:
104
+
105
+ ```python
106
+ from PyNomaly import loop
107
+ m = loop.LocalOutlierProbability(data, extent=2, n_neighbors=20).fit()
108
+ scores = m.local_outlier_probabilities
109
+ print(scores)
110
+ ```
111
+
112
+ This implementation of LoOP also includes an optional *cluster_labels* parameter. This is useful in cases where regions
113
+ of varying density occur within the same set of data. When using *cluster_labels*, the Local Outlier Probability of a
114
+ sample is calculated with respect to its cluster assignment.
115
+
116
+ ```python
117
+ from PyNomaly import loop
118
+ from sklearn.cluster import DBSCAN
119
+ db = DBSCAN(eps=0.6, min_samples=50).fit(data)
120
+ m = loop.LocalOutlierProbability(data, extent=2, n_neighbors=20, cluster_labels=list(db.labels_)).fit()
121
+ scores = m.local_outlier_probabilities
122
+ print(scores)
123
+ ```
124
+
125
+ **NOTE**: Unless your data is all the same scale, it may be a good idea to normalize your data with z-scores or another
126
+ normalization scheme prior to using LoOP, especially when working with multiple dimensions of varying scale.
127
+ Users must also appropriately handle missing values prior to using LoOP, as LoOP does not support Pandas
128
+ DataFrames or Numpy arrays with missing values.
129
+
130
+ ### Utilizing Numba and Progress Bars
131
+
132
+ It may be helpful to use just-in-time (JIT) compilation in the cases where a lot of
133
+ observations are scored. Numba, a JIT compiler for Python, may be used
134
+ with PyNomaly by setting `use_numba=True`:
135
+
136
+ ```python
137
+ from PyNomaly import loop
138
+ m = loop.LocalOutlierProbability(data, extent=2, n_neighbors=20, use_numba=True, progress_bar=True).fit()
139
+ scores = m.local_outlier_probabilities
140
+ print(scores)
141
+ ```
142
+
143
+ Numba must be installed if the above to use JIT compilation and improve the
144
+ speed of multiple calls to `LocalOutlierProbability()`, and PyNomaly has been
145
+ tested with Numba version 0.45.1. An example of the speed difference that can
146
+ be realized with using Numba is avaialble in `examples/numba_speed_diff.py`.
147
+
148
+ You may also choose to print progress bars _with our without_ the use of numba
149
+ by passing `progress_bar=True` to the `LocalOutlierProbability()` method as above.
150
+
151
+ ### Choosing Parameters
152
+
153
+ The *extent* parameter controls the sensitivity of the scoring in practice. The parameter corresponds to
154
+ the statistical notion of an outlier defined as an object deviating more than a given lambda (*extent*)
155
+ times the standard deviation from the mean. A value of 2 implies outliers deviating more than 2 standard deviations
156
+ from the mean, and corresponds to 95.0% in the empirical "three-sigma" rule. The appropriate parameter should be selected
157
+ according to the level of sensitivity needed for the input data and application. The question to ask is whether it is
158
+ more reasonable to assume outliers in your data are 1, 2, or 3 standard deviations from the mean, and select the value
159
+ likely most appropriate to your data and application.
160
+
161
+ The *n_neighbors* parameter defines the number of neighbors to consider about
162
+ each sample (neighborhood size) when determining its Local Outlier Probability with respect to the density
163
+ of the sample's defined neighborhood. The idea number of neighbors to consider is dependent on the
164
+ input data. However, the notion of an outlier implies it would be considered as such regardless of the number
165
+ of neighbors considered. One potential approach is to use a number of different neighborhood sizes and average
166
+ the results for reach observation. Those observations which rank highly with varying neighborhood sizes are
167
+ more than likely outliers. This is one potential approach of selecting the neighborhood size. Another is to
168
+ select a value proportional to the number of observations, such an odd-valued integer close to the square root
169
+ of the number of observations in your data (*sqrt(n_observations*).
170
+
171
+ ## Iris Data Example
172
+
173
+ We'll be using the well-known Iris dataset to show LoOP's capabilities. There's a few things you'll need for this
174
+ example beyond the standard prerequisites listed above:
175
+ - matplotlib 2.0.0 or greater
176
+ - PyDataset 0.2.0 or greater
177
+ - scikit-learn 0.18.1 or greater
178
+
179
+ First, let's import the packages and libraries we will need for this example.
180
+
181
+ ```python
182
+ from PyNomaly import loop
183
+ import pandas as pd
184
+ from pydataset import data
185
+ import numpy as np
186
+ from sklearn.cluster import DBSCAN
187
+ import matplotlib.pyplot as plt
188
+ from mpl_toolkits.mplot3d import Axes3D
189
+ ```
190
+
191
+ Now let's create two sets of Iris data for scoring; one with clustering and the other without.
192
+
193
+ ```python
194
+ # import the data and remove any non-numeric columns
195
+ iris = pd.DataFrame(data('iris').drop(columns=['Species']))
196
+ ```
197
+
198
+ Next, let's cluster the data using DBSCAN and generate two sets of scores. On both cases, we will use the default
199
+ values for both *extent* (0.997) and *n_neighbors* (10).
200
+
201
+ ```python
202
+ db = DBSCAN(eps=0.9, min_samples=10).fit(iris)
203
+ m = loop.LocalOutlierProbability(iris).fit()
204
+ scores_noclust = m.local_outlier_probabilities
205
+ m_clust = loop.LocalOutlierProbability(iris, cluster_labels=list(db.labels_)).fit()
206
+ scores_clust = m_clust.local_outlier_probabilities
207
+ ```
208
+
209
+ Organize the data into two separate Pandas DataFrames.
210
+
211
+ ```python
212
+ iris_clust = pd.DataFrame(iris.copy())
213
+ iris_clust['scores'] = scores_clust
214
+ iris_clust['labels'] = db.labels_
215
+ iris['scores'] = scores_noclust
216
+ ```
217
+
218
+ And finally, let's visualize the scores provided by LoOP in both cases (with and without clustering).
219
+
220
+ ```python
221
+ fig = plt.figure(figsize=(7, 7))
222
+ ax = fig.add_subplot(111, projection='3d')
223
+ ax.scatter(iris['Sepal.Width'], iris['Petal.Width'], iris['Sepal.Length'],
224
+ c=iris['scores'], cmap='seismic', s=50)
225
+ ax.set_xlabel('Sepal.Width')
226
+ ax.set_ylabel('Petal.Width')
227
+ ax.set_zlabel('Sepal.Length')
228
+ plt.show()
229
+ plt.clf()
230
+ plt.cla()
231
+ plt.close()
232
+
233
+ fig = plt.figure(figsize=(7, 7))
234
+ ax = fig.add_subplot(111, projection='3d')
235
+ ax.scatter(iris_clust['Sepal.Width'], iris_clust['Petal.Width'], iris_clust['Sepal.Length'],
236
+ c=iris_clust['scores'], cmap='seismic', s=50)
237
+ ax.set_xlabel('Sepal.Width')
238
+ ax.set_ylabel('Petal.Width')
239
+ ax.set_zlabel('Sepal.Length')
240
+ plt.show()
241
+ plt.clf()
242
+ plt.cla()
243
+ plt.close()
244
+
245
+ fig = plt.figure(figsize=(7, 7))
246
+ ax = fig.add_subplot(111, projection='3d')
247
+ ax.scatter(iris_clust['Sepal.Width'], iris_clust['Petal.Width'], iris_clust['Sepal.Length'],
248
+ c=iris_clust['labels'], cmap='Set1', s=50)
249
+ ax.set_xlabel('Sepal.Width')
250
+ ax.set_ylabel('Petal.Width')
251
+ ax.set_zlabel('Sepal.Length')
252
+ plt.show()
253
+ plt.clf()
254
+ plt.cla()
255
+ plt.close()
256
+ ```
257
+
258
+ Your results should look like the following:
259
+
260
+ **LoOP Scores without Clustering**
261
+ ![LoOP Scores without Clustering](https://github.com/vc1492a/PyNomaly/blob/main/images/scores.png)
262
+
263
+ **LoOP Scores with Clustering**
264
+ ![LoOP Scores with Clustering](https://github.com/vc1492a/PyNomaly/blob/main/images/scores_clust.png)
265
+
266
+ **DBSCAN Cluster Assignments**
267
+ ![DBSCAN Cluster Assignments](https://github.com/vc1492a/PyNomaly/blob/main/images/cluster_assignments.png)
268
+
269
+
270
+ Note the differences between using LocalOutlierProbability with and without clustering. In the example without clustering, samples are
271
+ scored according to the distribution of the entire data set. In the example with clustering, each sample is scored
272
+ according to the distribution of each cluster. Which approach is suitable depends on the use case.
273
+
274
+ **NOTE**: Data was not normalized in this example, but it's probably a good idea to do so in practice.
275
+
276
+ ## Using Numpy
277
+
278
+ When using numpy, make sure to use 2-dimensional arrays in tabular format:
279
+
280
+ ```python
281
+ data = np.array([
282
+ [43.3, 30.2, 90.2],
283
+ [62.9, 58.3, 49.3],
284
+ [55.2, 56.2, 134.2],
285
+ [48.6, 80.3, 50.3],
286
+ [67.1, 60.0, 55.9],
287
+ [421.5, 90.3, 50.0]
288
+ ])
289
+
290
+ scores = loop.LocalOutlierProbability(data, n_neighbors=3).fit().local_outlier_probabilities
291
+ print(scores)
292
+
293
+ ```
294
+
295
+ The shape of the input array shape corresponds to the rows (observations) and columns (features) in the data:
296
+
297
+ ```python
298
+ print(data.shape)
299
+ # (6,3), which matches number of observations and features in the above example
300
+ ```
301
+
302
+ Similar to the above:
303
+
304
+ ```python
305
+ data = np.random.rand(100, 5)
306
+ scores = loop.LocalOutlierProbability(data).fit().local_outlier_probabilities
307
+ print(scores)
308
+ ```
309
+
310
+ ## Specifying a Distance Matrix
311
+
312
+ PyNomaly provides the ability to specify a distance matrix so that any
313
+ distance metric can be used (a neighbor index matrix must also be provided).
314
+ This can be useful when wanting to use a distance other than the euclidean.
315
+
316
+ Note that in order to maintain alignment with the LoOP definition of closest neighbors,
317
+ an additional neighbor is added when using [scikit-learn's NearestNeighbors](https://scikit-learn.org/1.5/modules/neighbors.html) since `NearestNeighbors`
318
+ includes the point itself when calculating the cloest neighbors (whereas the LoOP method does not include distances to point itself).
319
+
320
+ ```python
321
+ import numpy as np
322
+ from sklearn.neighbors import NearestNeighbors
323
+
324
+ data = np.array([
325
+ [43.3, 30.2, 90.2],
326
+ [62.9, 58.3, 49.3],
327
+ [55.2, 56.2, 134.2],
328
+ [48.6, 80.3, 50.3],
329
+ [67.1, 60.0, 55.9],
330
+ [421.5, 90.3, 50.0]
331
+ ])
332
+
333
+ # Generate distance and neighbor matrices
334
+ n_neighbors = 3 # the number of neighbors according to the LoOP definition
335
+ neigh = NearestNeighbors(n_neighbors=n_neighbors+1, metric='hamming')
336
+ neigh.fit(data)
337
+ d, idx = neigh.kneighbors(data, return_distance=True)
338
+
339
+ # Remove self-distances - you MUST do this to preserve the same results as intended by the definition of LoOP
340
+ indices = np.delete(indices, 0, 1)
341
+ distances = np.delete(distances, 0, 1)
342
+
343
+ # Fit and return scores
344
+ m = loop.LocalOutlierProbability(distance_matrix=d, neighbor_matrix=idx, n_neighbors=n_neighbors+1).fit()
345
+ scores = m.local_outlier_probabilities
346
+ ```
347
+
348
+ The below visualization shows the results by a few known distance metrics:
349
+
350
+ **LoOP Scores by Distance Metric**
351
+ ![DBSCAN Cluster Assignments](https://github.com/vc1492a/PyNomaly/blob/main/images/scores_by_distance_metric.png)
352
+
353
+ ## Streaming Data
354
+
355
+ PyNomaly also contains an implementation of Hamlet et. al.'s modifications
356
+ to the original LoOP approach [[4](http://www.tandfonline.com/doi/abs/10.1080/23742917.2016.1226651?journalCode=tsec20)],
357
+ which may be used for applications involving streaming data or where rapid calculations may be necessary.
358
+ First, the standard LoOP algorithm is used on "training" data, with certain attributes of the fitted data
359
+ stored from the original LoOP approach. Then, as new points are considered, these fitted attributes are
360
+ called when calculating the score of the incoming streaming data due to the use of averages from the initial
361
+ fit, such as the use of a global value for the expected value of the probabilistic distance. Despite the potential
362
+ for increased error when compared to the standard approach, it may be effective in streaming applications where
363
+ refitting the standard approach over all points could be computationally expensive.
364
+
365
+ While the iris dataset is not streaming data, we'll use it in this example by taking the first 120 observations
366
+ as training data and take the remaining 30 observations as a stream, scoring each observation
367
+ individually.
368
+
369
+ Split the data.
370
+ ```python
371
+ iris = iris.sample(frac=1) # shuffle data
372
+ iris_train = iris.iloc[:, 0:4].head(120)
373
+ iris_test = iris.iloc[:, 0:4].tail(30)
374
+ ```
375
+
376
+ Fit to each set.
377
+ ```python
378
+ m = loop.LocalOutlierProbability(iris).fit()
379
+ scores_noclust = m.local_outlier_probabilities
380
+ iris['scores'] = scores_noclust
381
+
382
+ m_train = loop.LocalOutlierProbability(iris_train, n_neighbors=10)
383
+ m_train.fit()
384
+ iris_train_scores = m_train.local_outlier_probabilities
385
+ ```
386
+
387
+ ```python
388
+ iris_test_scores = []
389
+ for index, row in iris_test.iterrows():
390
+ array = np.array([row['Sepal.Length'], row['Sepal.Width'], row['Petal.Length'], row['Petal.Width']])
391
+ iris_test_scores.append(m_train.stream(array))
392
+ iris_test_scores = np.array(iris_test_scores)
393
+ ```
394
+
395
+ Concatenate the scores and assess.
396
+
397
+ ```python
398
+ iris['stream_scores'] = np.hstack((iris_train_scores, iris_test_scores))
399
+ # iris['scores'] from earlier example
400
+ rmse = np.sqrt(((iris['scores'] - iris['stream_scores']) ** 2).mean(axis=None))
401
+ print(rmse)
402
+ ```
403
+
404
+ The root mean squared error (RMSE) between the two approaches is approximately 0.199 (your scores will vary depending on the data and specification).
405
+ The plot below shows the scores from the stream approach.
406
+
407
+ ```python
408
+ fig = plt.figure(figsize=(7, 7))
409
+ ax = fig.add_subplot(111, projection='3d')
410
+ ax.scatter(iris['Sepal.Width'], iris['Petal.Width'], iris['Sepal.Length'],
411
+ c=iris['stream_scores'], cmap='seismic', s=50)
412
+ ax.set_xlabel('Sepal.Width')
413
+ ax.set_ylabel('Petal.Width')
414
+ ax.set_zlabel('Sepal.Length')
415
+ plt.show()
416
+ plt.clf()
417
+ plt.cla()
418
+ plt.close()
419
+ ```
420
+
421
+ **LoOP Scores using Stream Approach with n=10**
422
+ ![LoOP Scores using Stream Approach with n=10](https://github.com/vc1492a/PyNomaly/blob/main/images/scores_stream.png)
423
+
424
+ ### Notes
425
+ When calculating the LoOP score of incoming data, the original fitted scores are not updated.
426
+ In some applications, it may be beneficial to refit the data periodically. The stream functionality
427
+ also assumes that either data or a distance matrix (or value) will be used across in both fitting
428
+ and streaming, with no changes in specification between steps.
429
+
430
+ ## Contributing
431
+
432
+ Please use the issue tracker to report any erroneous behavior or desired
433
+ feature requests.
434
+
435
+ If you would like to contribute to development, please fork the repository and make
436
+ any changes to a branch which corresponds to an open issue. Hot fixes
437
+ and bug fixes can be represented by branches with the prefix `fix/` versus
438
+ `feature/` for new capabilities or code improvements. Pull requests will
439
+ then be made from these branches into the repository's `dev` branch
440
+ prior to being pulled into `main`.
441
+
442
+ ### Commit Messages and Releases
443
+
444
+ **Your commit messages are important** - here's why.
445
+
446
+ PyNomaly leverages [release-please](https://github.com/googleapis/release-please-action) to help automate the release process using the [Conventional Commits](https://www.conventionalcommits.org/) specification. When pull requests are opened to the `main` branch, release-please will collate the git commit messages and prepare an organized changelog and release notes. This process can be completed because of the Conventional Commits specification.
447
+
448
+ Conventional Commits provides an easy set of rules for creating an explicit commit history; which makes it easier to write automated tools on top of. This convention dovetails with SemVer, by describing the features, fixes, and breaking changes made in commit messages. You can check out examples [here](https://www.conventionalcommits.org/en/v1.0.0/#examples). Make a best effort to use the specification when contributing to Infactory code as it dramatically eases the documentation around releases and their features, breaking changes, bug fixes and documentation updates.
449
+
450
+ ### Tests
451
+ When contributing, please ensure to run unit tests and add additional tests as
452
+ necessary if adding new functionality. To run the unit tests, use `pytest`:
453
+
454
+ ```
455
+ python3 -m pytest --cov=PyNomaly -s -v
456
+ ```
457
+
458
+ To run the tests with Numba enabled, simply set the flag `NUMBA` in `test_loop.py`
459
+ to `True`. Note that a drop in coverage is expected due to portions of the code
460
+ being compiled upon code execution.
461
+
462
+ ## Versioning
463
+ [Semantic versioning](http://semver.org/) is used for this project. If contributing, please conform to semantic
464
+ versioning guidelines when submitting a pull request.
465
+
466
+ ## License
467
+ This project is licensed under the Apache 2.0 license.
468
+
469
+ ## Research
470
+ If citing PyNomaly, use the following:
471
+
472
+ ```
473
+ @article{Constantinou2018,
474
+ doi = {10.21105/joss.00845},
475
+ url = {https://doi.org/10.21105/joss.00845},
476
+ year = {2018},
477
+ month = {oct},
478
+ publisher = {The Open Journal},
479
+ volume = {3},
480
+ number = {30},
481
+ pages = {845},
482
+ author = {Valentino Constantinou},
483
+ title = {{PyNomaly}: Anomaly detection using Local Outlier Probabilities ({LoOP}).},
484
+ journal = {Journal of Open Source Software}
485
+ }
486
+ ```
487
+
488
+
489
+ ## References
490
+ 1. Breunig M., Kriegel H.-P., Ng R., Sander, J. LOF: Identifying Density-based Local Outliers. ACM SIGMOD International Conference on Management of Data (2000). [PDF](http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf).
491
+ 2. Kriegel H., Kröger P., Schubert E., Zimek A. LoOP: Local Outlier Probabilities. 18th ACM conference on Information and knowledge management, CIKM (2009). [PDF](http://www.dbs.ifi.lmu.de/Publikationen/Papers/LoOP1649.pdf).
492
+ 3. Goldstein M., Uchida S. A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data. PLoS ONE 11(4): e0152173 (2016).
493
+ 4. Hamlet C., Straub J., Russell M., Kerlin S. An incremental and approximate local outlier probability algorithm for intrusion detection and its evaluation. Journal of Cyber Security Technology (2016). [DOI](http://www.tandfonline.com/doi/abs/10.1080/23742917.2016.1226651?journalCode=tsec20).
494
+
495
+ ## Acknowledgements
496
+ - The authors of LoOP (Local Outlier Probabilities)
497
+ - Hans-Peter Kriegel
498
+ - Peer Kröger
499
+ - Erich Schubert
500
+ - Arthur Zimek
501
+ - [NASA Jet Propulsion Laboratory](https://jpl.nasa.gov/)
502
+ - [Kyle Hundman](https://github.com/khundman)
503
+ - [Ian Colwell](https://github.com/iancolwell)
@@ -0,0 +1,18 @@
1
+ # Authors: Valentino Constantinou <vc@valentino.io>
2
+ # License: Apache 2.0
3
+
4
+ from PyNomaly.loop import (
5
+ LocalOutlierProbability,
6
+ PyNomalyError,
7
+ ValidationError,
8
+ ClusterSizeError,
9
+ MissingValuesError,
10
+ )
11
+
12
+ __all__ = [
13
+ "LocalOutlierProbability",
14
+ "PyNomalyError",
15
+ "ValidationError",
16
+ "ClusterSizeError",
17
+ "MissingValuesError",
18
+ ]