outliertree 0.1.2 → 0.3.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +14 -0
- data/NOTICE.txt +1 -1
- data/README.md +11 -10
- data/ext/outliertree/ext.cpp +104 -105
- data/ext/outliertree/extconf.rb +1 -1
- data/lib/outliertree/result.rb +3 -3
- data/lib/outliertree/version.rb +1 -1
- data/vendor/outliertree/README.md +77 -40
- data/vendor/outliertree/src/Makevars.in +4 -0
- data/vendor/outliertree/src/Makevars.win +4 -0
- data/vendor/outliertree/src/RcppExports.cpp +20 -9
- data/vendor/outliertree/src/Rwrapper.cpp +256 -57
- data/vendor/outliertree/src/cat_outlier.cpp +6 -6
- data/vendor/outliertree/src/clusters.cpp +114 -9
- data/vendor/outliertree/src/fit_model.cpp +505 -308
- data/vendor/outliertree/src/misc.cpp +165 -4
- data/vendor/outliertree/src/outlier_tree.hpp +159 -51
- data/vendor/outliertree/src/outliertree-win.def +3 -0
- data/vendor/outliertree/src/predict.cpp +33 -0
- data/vendor/outliertree/src/split.cpp +124 -20
- metadata +10 -8
- data/vendor/outliertree/src/Makevars +0 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 2851b4b56b23141bc9f1ef5b3c448fb75d785ce0e7b38580113001898ce18e2e
|
4
|
+
data.tar.gz: 817325392325bc61f1dea1363096678fe9fd578ec6026301e173447edc522752
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e1bc84c131959bb7260100b4aa1e345ae09330600252299084efe5d40fbff8d8d9d9aadf78c6384a0b6a158a1b3bbcdeff9de3ce71fca90ce938003125b37898
|
7
|
+
data.tar.gz: cad988456c492f101bc71334997217a48d6588c4e2e5feee333bc074f3b14b62557d0e64429fca787705244b701eca7d685aa17f090dad94578fce172812e5a0
|
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,17 @@
|
|
1
|
+
## 0.3.0 (2022-06-13)
|
2
|
+
|
3
|
+
- Updated OutlierTree to 1.8.1
|
4
|
+
- Dropped support for Ruby < 2.7
|
5
|
+
|
6
|
+
## 0.2.1 (2021-05-23)
|
7
|
+
|
8
|
+
- Improved performance
|
9
|
+
|
10
|
+
## 0.2.0 (2021-05-17)
|
11
|
+
|
12
|
+
- Updated to Rice 4
|
13
|
+
- Dropped support for Ruby < 2.6
|
14
|
+
|
1
15
|
## 0.1.2 (2021-02-08)
|
2
16
|
|
3
17
|
- Fixed error with missing numeric values
|
data/NOTICE.txt
CHANGED
data/README.md
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
# OutlierTree
|
1
|
+
# OutlierTree Ruby
|
2
2
|
|
3
3
|
:deciduous_tree: [OutlierTree](https://github.com/david-cortes/outliertree) - explainable outlier/anomaly detection - for Ruby
|
4
4
|
|
@@ -8,16 +8,16 @@ Produces human-readable explanations for why values are detected as outliers
|
|
8
8
|
Price (2.50) looks low given Department is Books and Sale is false
|
9
9
|
```
|
10
10
|
|
11
|
-
:evergreen_tree: Check out [IsoTree](https://github.com/ankane/isotree) for an alternative approach that uses Isolation Forest
|
11
|
+
:evergreen_tree: Check out [IsoTree](https://github.com/ankane/isotree-ruby) for an alternative approach that uses Isolation Forest
|
12
12
|
|
13
|
-
[![Build Status](https://github.com/ankane/outliertree/workflows/build/badge.svg?branch=master)](https://github.com/ankane/outliertree/actions)
|
13
|
+
[![Build Status](https://github.com/ankane/outliertree-ruby/workflows/build/badge.svg?branch=master)](https://github.com/ankane/outliertree-ruby/actions)
|
14
14
|
|
15
15
|
## Installation
|
16
16
|
|
17
17
|
Add this line to your application’s Gemfile:
|
18
18
|
|
19
19
|
```ruby
|
20
|
-
gem
|
20
|
+
gem "outliertree"
|
21
21
|
```
|
22
22
|
|
23
23
|
## Getting Started
|
@@ -28,7 +28,8 @@ Prep your data
|
|
28
28
|
data = [
|
29
29
|
{department: "Books", sale: false, price: 2.50},
|
30
30
|
{department: "Books", sale: true, price: 3.00},
|
31
|
-
{department: "Movies", sale: false, price: 5.00}
|
31
|
+
{department: "Movies", sale: false, price: 5.00},
|
32
|
+
# ...
|
32
33
|
]
|
33
34
|
```
|
34
35
|
|
@@ -108,22 +109,22 @@ bundle install
|
|
108
109
|
|
109
110
|
## History
|
110
111
|
|
111
|
-
View the [changelog](https://github.com/ankane/outliertree/blob/master/CHANGELOG.md)
|
112
|
+
View the [changelog](https://github.com/ankane/outliertree-ruby/blob/master/CHANGELOG.md)
|
112
113
|
|
113
114
|
## Contributing
|
114
115
|
|
115
116
|
Everyone is encouraged to help improve this project. Here are a few ways you can help:
|
116
117
|
|
117
|
-
- [Report bugs](https://github.com/ankane/outliertree/issues)
|
118
|
-
- Fix bugs and [submit pull requests](https://github.com/ankane/outliertree/pulls)
|
118
|
+
- [Report bugs](https://github.com/ankane/outliertree-ruby/issues)
|
119
|
+
- Fix bugs and [submit pull requests](https://github.com/ankane/outliertree-ruby/pulls)
|
119
120
|
- Write, clarify, or fix documentation
|
120
121
|
- Suggest or add new features
|
121
122
|
|
122
123
|
To get started with development:
|
123
124
|
|
124
125
|
```sh
|
125
|
-
git clone --recursive https://github.com/ankane/outliertree.git
|
126
|
-
cd outliertree
|
126
|
+
git clone --recursive https://github.com/ankane/outliertree-ruby.git
|
127
|
+
cd outliertree-ruby
|
127
128
|
bundle install
|
128
129
|
bundle exec rake compile
|
129
130
|
bundle exec rake test
|
data/ext/outliertree/ext.cpp
CHANGED
@@ -2,12 +2,8 @@
|
|
2
2
|
#include <outlier_tree.hpp>
|
3
3
|
|
4
4
|
// rice
|
5
|
-
#include <rice/
|
6
|
-
#include <rice/
|
7
|
-
#include <rice/Module.hpp>
|
8
|
-
#include <rice/Object.hpp>
|
9
|
-
#include <rice/String.hpp>
|
10
|
-
#include <rice/Symbol.hpp>
|
5
|
+
#include <rice/rice.hpp>
|
6
|
+
#include <rice/stl.hpp>
|
11
7
|
|
12
8
|
using Rice::Array;
|
13
9
|
using Rice::Hash;
|
@@ -18,74 +14,77 @@ using Rice::Symbol;
|
|
18
14
|
using Rice::define_class_under;
|
19
15
|
using Rice::define_module;
|
20
16
|
|
21
|
-
|
22
|
-
Object to_ruby<std::vector<char>>(std::vector<char> const & x)
|
17
|
+
namespace Rice::detail
|
23
18
|
{
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
19
|
+
template<typename T>
|
20
|
+
class To_Ruby<std::vector<T>>
|
21
|
+
{
|
22
|
+
public:
|
23
|
+
VALUE convert(std::vector<T> const & x)
|
24
|
+
{
|
25
|
+
auto a = rb_ary_new2(x.size());
|
26
|
+
for (const auto& v : x) {
|
27
|
+
rb_ary_push(a, To_Ruby<T>().convert(v));
|
28
|
+
}
|
29
|
+
return a;
|
30
|
+
}
|
31
|
+
};
|
30
32
|
|
31
|
-
template<>
|
32
|
-
|
33
|
-
{
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
}
|
33
|
+
template<>
|
34
|
+
struct Type<ColType>
|
35
|
+
{
|
36
|
+
static bool verify()
|
37
|
+
{
|
38
|
+
return true;
|
39
|
+
}
|
40
|
+
};
|
40
41
|
|
41
|
-
template<>
|
42
|
-
|
43
|
-
{
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
42
|
+
template<>
|
43
|
+
class To_Ruby<ColType>
|
44
|
+
{
|
45
|
+
public:
|
46
|
+
VALUE convert(ColType const & x)
|
47
|
+
{
|
48
|
+
switch (x) {
|
49
|
+
case Numeric: return Symbol("numeric");
|
50
|
+
case Categorical: return Symbol("categorical");
|
51
|
+
case Ordinal: return Symbol("ordinal");
|
52
|
+
case NoType: return Symbol("no_type");
|
53
|
+
}
|
54
|
+
throw std::runtime_error("Unknown column type");
|
55
|
+
}
|
56
|
+
};
|
50
57
|
|
51
|
-
template<>
|
52
|
-
|
53
|
-
{
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
}
|
58
|
+
template<>
|
59
|
+
struct Type<SplitType>
|
60
|
+
{
|
61
|
+
static bool verify()
|
62
|
+
{
|
63
|
+
return true;
|
64
|
+
}
|
65
|
+
};
|
60
66
|
|
61
|
-
template<>
|
62
|
-
|
63
|
-
{
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
case NotInSubset: return Symbol("not_in_subset");
|
83
|
-
case SingleCateg: return Symbol("single_categ");
|
84
|
-
case SubTrees: return Symbol("sub_trees");
|
85
|
-
case IsNa: return Symbol("is_na");
|
86
|
-
case Root: return Symbol("root");
|
87
|
-
}
|
88
|
-
throw std::runtime_error("Unknown split type");
|
67
|
+
template<>
|
68
|
+
class To_Ruby<SplitType>
|
69
|
+
{
|
70
|
+
public:
|
71
|
+
VALUE convert(SplitType const & x)
|
72
|
+
{
|
73
|
+
switch (x) {
|
74
|
+
case LessOrEqual: return Symbol("less_or_equal");
|
75
|
+
case Greater: return Symbol("greater");
|
76
|
+
case Equal: return Symbol("equal");
|
77
|
+
case NotEqual: return Symbol("not_equal");
|
78
|
+
case InSubset: return Symbol("in_subset");
|
79
|
+
case NotInSubset: return Symbol("not_in_subset");
|
80
|
+
case SingleCateg: return Symbol("single_categ");
|
81
|
+
case SubTrees: return Symbol("sub_trees");
|
82
|
+
case IsNa: return Symbol("is_na");
|
83
|
+
case Root: return Symbol("root");
|
84
|
+
}
|
85
|
+
throw std::runtime_error("Unknown split type");
|
86
|
+
}
|
87
|
+
};
|
89
88
|
}
|
90
89
|
|
91
90
|
extern "C"
|
@@ -95,55 +94,55 @@ void Init_ext()
|
|
95
94
|
Module rb_mExt = define_module_under(rb_mOutlierTree, "Ext");
|
96
95
|
|
97
96
|
define_class_under<Cluster>(rb_mExt, "Cluster")
|
98
|
-
.define_method("upper_lim",
|
99
|
-
.define_method("display_lim_high",
|
100
|
-
.define_method("perc_below",
|
101
|
-
.define_method("display_lim_low",
|
102
|
-
.define_method("perc_above",
|
103
|
-
.define_method("display_mean",
|
104
|
-
.define_method("display_sd",
|
105
|
-
.define_method("cluster_size",
|
106
|
-
.define_method("split_point",
|
107
|
-
.define_method("split_subset",
|
108
|
-
.define_method("split_lev",
|
109
|
-
.define_method("split_type",
|
110
|
-
.define_method("column_type",
|
111
|
-
.define_method("has_na_branch",
|
112
|
-
.define_method("col_num",
|
97
|
+
.define_method("upper_lim", [](Cluster& self) { return self.upper_lim; })
|
98
|
+
.define_method("display_lim_high", [](Cluster& self) { return self.display_lim_high; })
|
99
|
+
.define_method("perc_below", [](Cluster& self) { return self.perc_below; })
|
100
|
+
.define_method("display_lim_low", [](Cluster& self) { return self.display_lim_low; })
|
101
|
+
.define_method("perc_above", [](Cluster& self) { return self.perc_above; })
|
102
|
+
.define_method("display_mean", [](Cluster& self) { return self.display_mean; })
|
103
|
+
.define_method("display_sd", [](Cluster& self) { return self.display_sd; })
|
104
|
+
.define_method("cluster_size", [](Cluster& self) { return self.cluster_size; })
|
105
|
+
.define_method("split_point", [](Cluster& self) { return self.split_point; })
|
106
|
+
.define_method("split_subset", [](Cluster& self) { return self.split_subset; })
|
107
|
+
.define_method("split_lev", [](Cluster& self) { return self.split_lev; })
|
108
|
+
.define_method("split_type", [](Cluster& self) { return self.split_type; })
|
109
|
+
.define_method("column_type", [](Cluster& self) { return self.column_type; })
|
110
|
+
.define_method("has_na_branch", [](Cluster& self) { return self.has_NA_branch; })
|
111
|
+
.define_method("col_num", [](Cluster& self) { return self.col_num; });
|
113
112
|
|
114
113
|
define_class_under<ClusterTree>(rb_mExt, "ClusterTree")
|
115
|
-
.define_method("parent_branch",
|
116
|
-
.define_method("parent",
|
117
|
-
.define_method("all_branches",
|
118
|
-
.define_method("column_type",
|
119
|
-
.define_method("col_num",
|
120
|
-
.define_method("split_point",
|
121
|
-
.define_method("split_subset",
|
122
|
-
.define_method("split_lev",
|
114
|
+
.define_method("parent_branch", [](ClusterTree& self) { return self.parent_branch; })
|
115
|
+
.define_method("parent", [](ClusterTree& self) { return self.parent; })
|
116
|
+
.define_method("all_branches", [](ClusterTree& self) { return self.all_branches; })
|
117
|
+
.define_method("column_type", [](ClusterTree& self) { return self.column_type; })
|
118
|
+
.define_method("col_num", [](ClusterTree& self) { return self.col_num; })
|
119
|
+
.define_method("split_point", [](ClusterTree& self) { return self.split_point; })
|
120
|
+
.define_method("split_subset", [](ClusterTree& self) { return self.split_subset; })
|
121
|
+
.define_method("split_lev", [](ClusterTree& self) { return self.split_lev; });
|
123
122
|
|
124
123
|
define_class_under<ModelOutputs>(rb_mExt, "ModelOutputs")
|
125
|
-
.define_method("outlier_scores_final",
|
126
|
-
.define_method("outlier_columns_final",
|
127
|
-
.define_method("outlier_clusters_final",
|
128
|
-
.define_method("outlier_trees_final",
|
129
|
-
.define_method("outlier_depth_final",
|
130
|
-
.define_method("outlier_decimals_distr",
|
131
|
-
.define_method("min_decimals_col",
|
124
|
+
.define_method("outlier_scores_final", [](ModelOutputs& self) { return self.outlier_scores_final; })
|
125
|
+
.define_method("outlier_columns_final", [](ModelOutputs& self) { return self.outlier_columns_final; })
|
126
|
+
.define_method("outlier_clusters_final", [](ModelOutputs& self) { return self.outlier_clusters_final; })
|
127
|
+
.define_method("outlier_trees_final", [](ModelOutputs& self) { return self.outlier_trees_final; })
|
128
|
+
.define_method("outlier_depth_final", [](ModelOutputs& self) { return self.outlier_depth_final; })
|
129
|
+
.define_method("outlier_decimals_distr", [](ModelOutputs& self) { return self.outlier_decimals_distr; })
|
130
|
+
.define_method("min_decimals_col", [](ModelOutputs& self) { return self.min_decimals_col; })
|
132
131
|
.define_method(
|
133
132
|
"all_clusters",
|
134
|
-
|
133
|
+
[](ModelOutputs& self, size_t i, size_t j) {
|
135
134
|
return self.all_clusters[i][j];
|
136
135
|
})
|
137
136
|
.define_method(
|
138
137
|
"all_trees",
|
139
|
-
|
138
|
+
[](ModelOutputs& self, size_t i, size_t j) {
|
140
139
|
return self.all_trees[i][j];
|
141
140
|
});
|
142
141
|
|
143
142
|
rb_mExt
|
144
|
-
.
|
143
|
+
.define_singleton_function(
|
145
144
|
"fit_outliers_models",
|
146
|
-
|
145
|
+
[](Hash options) {
|
147
146
|
ModelOutputs model_outputs;
|
148
147
|
|
149
148
|
// data
|
@@ -219,9 +218,9 @@ void Init_ext()
|
|
219
218
|
);
|
220
219
|
return model_outputs;
|
221
220
|
})
|
222
|
-
.
|
221
|
+
.define_singleton_function(
|
223
222
|
"find_new_outliers",
|
224
|
-
|
223
|
+
[](ModelOutputs& model_outputs, Hash options) {
|
225
224
|
// data
|
226
225
|
size_t nrows = options.get<size_t, Symbol>("nrows");
|
227
226
|
size_t ncols_numeric = options.get<size_t, Symbol>("ncols_numeric");
|
data/ext/outliertree/extconf.rb
CHANGED
data/lib/outliertree/result.rb
CHANGED
@@ -22,7 +22,7 @@ module OutlierTree
|
|
22
22
|
if outl_col < @numeric_columns.size
|
23
23
|
column = @numeric_columns[outl_col]
|
24
24
|
value = df[column][row]
|
25
|
-
|
25
|
+
_decimals = model_outputs.outlier_decimals_distr[row]
|
26
26
|
else
|
27
27
|
column = @categorical_columns[outl_col - @numeric_columns.size]
|
28
28
|
value = df[column][row]
|
@@ -94,11 +94,11 @@ module OutlierTree
|
|
94
94
|
private
|
95
95
|
|
96
96
|
def add_condition(row, split_type, cluster)
|
97
|
-
|
97
|
+
_coldecim = 0
|
98
98
|
case cluster.column_type
|
99
99
|
when :numeric
|
100
100
|
cond_col = @numeric_columns[cluster.col_num]
|
101
|
-
|
101
|
+
_coldecim = model_outputs.min_decimals_col[cluster.col_num]
|
102
102
|
else
|
103
103
|
cond_col = @categorical_columns[cluster.col_num]
|
104
104
|
end
|
data/lib/outliertree/version.rb
CHANGED
@@ -1,47 +1,60 @@
|
|
1
1
|
# OutlierTree
|
2
2
|
|
3
|
-
Explainable outlier/anomaly detection based on smart decision tree grouping, similar in spirit to the GritBot software developed by RuleQuest research. Written in C++ with interfaces for R and Python. Supports columns of types numeric, categorical, binary/boolean, and ordinal, and can handle missing values in all of them. Ideal as a sanity checker in exploratory data analysis.
|
4
|
-
|
5
|
-
# How it works
|
6
|
-
|
7
|
-
Will try to fit decision trees that try to "predict" values for each column based on the values of each other column. Along the way, each time a split is evaluated, it will take the observations that fall into each branch as a homogeneous cluster in which it will search for outliers in the 1-d distribution of the column being predicted. Outliers are determined according to confidence intervals on this 1-d distribution, and need to have a large gap with respect to the next observation in sorted order to be flagged as outliers. Since outliers are searched for in a decision tree branch, it will know the conditions that make it a rare observation compared to others that meet the same conditions, and the conditions will always be correlated with the target variable (as it's being predicted from them).
|
8
|
-
|
9
|
-
As such, it will only be able to detect outliers that can be described through a decision tree logic, and unlike other methods such as [Isolation Forests](https://github.com/david-cortes/isotree), will not be able to assign an outlier score to each observation, nor to detect outliers that are just overall rare, but will always provide a human-readable justification when it flags an outlier.
|
10
|
-
|
11
|
-
Procedure is described in more detail in [Explainable outlier detection through decision tree conditioning](http://arxiv.org/abs/2001.00636).
|
3
|
+
Explainable outlier/anomaly detection based on smart decision tree grouping, similar in spirit to the GritBot software developed by RuleQuest research. Written in C++ with interfaces for R and Python (additional Ruby wrapper can be found [here](https://github.com/ankane/outliertree/)). Supports columns of types numeric, categorical, binary/boolean, and ordinal, and can handle missing values in all of them. Ideal as a sanity checker in exploratory data analysis.
|
12
4
|
|
13
5
|
# Example outputs
|
14
6
|
|
15
|
-
Example outliers from [hypothyroid dataset](http://archive.ics.uci.edu/ml/datasets/thyroid+disease):
|
7
|
+
Example outliers from the [hypothyroid dataset](http://archive.ics.uci.edu/ml/datasets/thyroid+disease):
|
16
8
|
```
|
17
|
-
row [
|
18
|
-
distribution: 95.122% <= 42.
|
9
|
+
row [1138] - suspicious column: [age] - suspicious value: [75.00]
|
10
|
+
distribution: 95.122% <= 42.00 - [mean: 31.46] - [sd: 5.28] - [norm. obs: 39]
|
19
11
|
given:
|
20
|
-
[pregnant] = [
|
12
|
+
[pregnant] = [TRUE]
|
21
13
|
|
22
14
|
|
23
|
-
row [
|
24
|
-
distribution: 99.951% <= 7.
|
15
|
+
row [2230] - suspicious column: [T3] - suspicious value: [10.60]
|
16
|
+
distribution: 99.951% <= 7.10 - [mean: 1.98] - [sd: 0.75] - [norm. obs: 2050]
|
25
17
|
given:
|
26
|
-
[query
|
18
|
+
[query.hyperthyroid] = [FALSE]
|
19
|
+
|
20
|
+
row [745] - suspicious column: [TT4] - suspicious value: [239.00]
|
21
|
+
distribution: 98.571% <= 177.00 - [mean: 135.23] - [sd: 12.57] - [norm. obs: 69]
|
22
|
+
given:
|
23
|
+
[FTI] between (97.96, 128.12] (value: 112.74)
|
24
|
+
[T4U] > [1.12] (value: 2.12)
|
25
|
+
[age] > [55.00] (value: 87.00)
|
27
26
|
```
|
28
27
|
(i.e. it's saying that it's abnormal to be pregnant at the age of 75, or to not be classified as hyperthyroidal when having very high thyroid hormone levels)
|
29
28
|
(this dataset is also bundled into the R package - e.g. `data(hypothyroid)`)
|
30
29
|
|
31
30
|
|
32
|
-
Example
|
31
|
+
Example outliers from the [Titanic dataset](https://www.kaggle.com/c/titanic):
|
33
32
|
```
|
34
|
-
row [
|
35
|
-
distribution: 97.849% <= 15.
|
33
|
+
row [1147] - suspicious column: [Fare] - suspicious value: [29.12]
|
34
|
+
distribution: 97.849% <= 15.50 - [mean: 7.89] - [sd: 1.17] - [norm. obs: 91]
|
36
35
|
given:
|
37
36
|
[Pclass] = [3]
|
38
37
|
[SibSp] = [0]
|
39
38
|
[Embarked] = [Q]
|
39
|
+
|
40
|
+
row [897] - suspicious column: [Fare] - suspicious value: [0.00]
|
41
|
+
distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]
|
42
|
+
given:
|
43
|
+
[Pclass] = [3]
|
44
|
+
[SibSp] = [0]
|
40
45
|
```
|
41
|
-
(i.e. it's saying that the
|
46
|
+
(i.e. it's saying that the the first person paid too much for the kind of accomodation he had, and the second person should not have gotten it for free)
|
42
47
|
|
43
48
|
_Note that it can also produce other types of conditions such as 'between' (for numeric intervals) or 'in' (for categorical subsets)_
|
44
49
|
|
50
|
+
# How it works
|
51
|
+
|
52
|
+
Will try to fit decision trees that try to "predict" values for each column based on the values of each other column. Along the way, each time a split is evaluated, it will take the observations that fall into each branch as a homogeneous cluster in which it will search for outliers in the 1-d distribution of the column being predicted. Outliers are determined according to confidence intervals on this 1-d distribution, and need to have a large gap with respect to the next observation in sorted order to be flagged as outliers. Since outliers are searched for in a decision tree branch, it will know the conditions that make it a rare observation compared to others that meet the same conditions, and the conditions will always be correlated with the target variable (as it's being predicted from them).
|
53
|
+
|
54
|
+
As such, it will only be able to detect outliers that can be described through a decision tree logic, and unlike other methods such as [Isolation Forests](https://github.com/david-cortes/isotree), will not be able to assign an outlier score to each observation, nor to detect outliers that are just overall rare, but will always provide a human-readable justification when it flags an outlier.
|
55
|
+
|
56
|
+
Procedure is described in more detail in [Explainable outlier detection through decision tree conditioning](http://arxiv.org/abs/2001.00636).
|
57
|
+
|
45
58
|
# Installation
|
46
59
|
|
47
60
|
* For R:
|
@@ -54,18 +67,38 @@ install.packages("outliertree")
|
|
54
67
|
```
|
55
68
|
pip install outliertree
|
56
69
|
```
|
57
|
-
|
70
|
+
or if that fails:
|
71
|
+
```
|
72
|
+
pip install --no-use-pep517 outliertree
|
73
|
+
```
|
74
|
+
** *
|
58
75
|
|
59
|
-
**Note for macOS users:** on macOS, the Python version of this package
|
76
|
+
**Note for macOS users:** on macOS, the Python version of this package might compile **without** multi-threading capabilities. In order to enable multi-threading support, first install OpenMP:
|
60
77
|
```
|
61
|
-
|
78
|
+
brew install libomp
|
79
|
+
```
|
80
|
+
And then reinstall this package: `pip install --force-reinstall outliertree`.
|
81
|
+
|
82
|
+
** *
|
83
|
+
**IMPORTANT:** the setup script will try to add compilation flag `-march=native`. This instructs the compiler to tune the package for the CPU in which it is being installed, but the result might not be usable in other computers. If building a binary wheel of this package or putting it into a docker image which will be used in different machines, this can be overriden by manually supplying compilation `CFLAGS` and `CXXFLAGS` as environment variables with something related to architecture. For maximum compatibility (but slowest speed), assuming `x86-64` computers, it's possible to do something like this:
|
84
|
+
|
85
|
+
```
|
86
|
+
export CFLAGS="-march=x86-64"
|
87
|
+
export CXXFLAGS="-march=x86-64"
|
62
88
|
pip install outliertree
|
63
89
|
```
|
64
|
-
(Alternatively, can also pass argument `enable-omp` to the `setup.py` file: `python setup.py install enable-omp`)
|
65
90
|
|
91
|
+
or for creating wheels:
|
92
|
+
```
|
93
|
+
export CFLAGS="-march=x86-64"
|
94
|
+
export CXXFLAGS="-march=x86-64"
|
95
|
+
python setup.py bwheel
|
96
|
+
```
|
97
|
+
** *
|
66
98
|
|
67
99
|
* For C++: package doesn't have a build system, nor a `main` function that can produce an executable, but can be built as a shared object and wrapped into other languages with any C++11-compliant compiler (`std=c++11` in most compilers, `/std:c++14` in MSVC). For parallelization, needs OpenMP linkage (`-fopenmp` in most compilers, `/openmp` in MSVC). Package should *not* be built with optimization higher than `O3` (i.e. don't use `-Ofast`). Needs linkage to the `math` library, which should be enabled by default in most C++ compilers, but otherwise would require `-lm` argument. No external dependencies are required.
|
68
100
|
|
101
|
+
* For Ruby: see [external repository with wrapper](https://github.com/ankane/outliertree/).
|
69
102
|
|
70
103
|
# Sample usage
|
71
104
|
|
@@ -77,18 +110,18 @@ library(outliertree)
|
|
77
110
|
nrows = 100
|
78
111
|
set.seed(1)
|
79
112
|
df = data.frame(
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
113
|
+
numeric_col1 = c(rnorm(nrows - 1), 1e6),
|
114
|
+
numeric_col2 = rgamma(nrows, 1),
|
115
|
+
categ_col = sample(c('categA', 'categB', 'categC'), size = nrows, replace = TRUE)
|
116
|
+
)
|
84
117
|
|
85
118
|
### test data frame with another obvious outlier
|
86
119
|
nrows_test = 50
|
87
120
|
df_test = data.frame(
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
121
|
+
numeric_col1 = rnorm(nrows_test),
|
122
|
+
numeric_col2 = c(-1e6, rgamma(nrows_test - 1, 1)),
|
123
|
+
categ_col = sample(c('categA', 'categB', 'categC'), size = nrows_test, replace = TRUE)
|
124
|
+
)
|
92
125
|
|
93
126
|
### fit model
|
94
127
|
outliers_model = outliertree::outlier.tree(df, outliers_print = 10, save_outliers = TRUE)
|
@@ -113,17 +146,17 @@ from outliertree import OutlierTree
|
|
113
146
|
nrows = 100
|
114
147
|
np.random.seed(1)
|
115
148
|
df = pd.DataFrame({
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
149
|
+
"numeric_col1" : np.r_[np.random.normal(size = nrows - 1), np.array([float(1e6)])],
|
150
|
+
"numeric_col2" : np.random.gamma(1, 1, size = nrows),
|
151
|
+
"categ_col" : np.random.choice(['categA', 'categB', 'categC'], size = nrows)
|
152
|
+
})
|
120
153
|
|
121
154
|
### test data frame with another obvious outlier
|
122
155
|
df_test = pd.DataFrame({
|
123
|
-
|
124
|
-
|
125
|
-
|
126
|
-
|
156
|
+
"numeric_col1" : np.random.normal(size = nrows),
|
157
|
+
"numeric_col2" : np.r_[np.array([float(-1e6)]), np.random.gamma(1, 1, size = nrows - 1)],
|
158
|
+
"categ_col" : np.random.choice(['categA', 'categB', 'categC'], size = nrows)
|
159
|
+
})
|
127
160
|
|
128
161
|
### fit model
|
129
162
|
outliers_model = OutlierTree()
|
@@ -138,6 +171,8 @@ outliers_model.print_outliers(new_outliers)
|
|
138
171
|
|
139
172
|
Example [IPython notebook](http://nbviewer.ipython.org/github/david-cortes/outliertree/blob/master/example/titanic_outliertree_python.ipynb) using the Titanic dataset.
|
140
173
|
|
174
|
+
* For Ruby: see the [external repository](https://github.com/ankane/outliertree/).
|
175
|
+
|
141
176
|
* For C++: see functions `fit_outliers_models` and `find_new_outliers` in header `outlier_tree.hpp`.
|
142
177
|
|
143
178
|
# Documentation
|
@@ -146,6 +181,8 @@ Example [IPython notebook](http://nbviewer.ipython.org/github/david-cortes/outli
|
|
146
181
|
|
147
182
|
* For Python: documentation is available at [ReadTheDocs](http://outliertree.readthedocs.io/en/latest/) (and it's also built-in in the package as docstrings, e.g. `help(outliertree.OutlierTree.fit)`).
|
148
183
|
|
184
|
+
* For Ruby: see the [external repository](https://github.com/ankane/outliertree/) and the [Python documentation](http://outliertree.readthedocs.io/en/latest/).
|
185
|
+
|
149
186
|
* For C++: documentation is available in the source files (not in the header).
|
150
187
|
|
151
188
|
# References
|