outliertree 0.1.2 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 9a3e3de44aa1af8f0db3ca34f8ef4fc91cda84ee0ee2d75e142fe975e7821d6b
4
- data.tar.gz: 1d0e8532fc9f319c0879948dc3c117f1a9b1c1cd64cb740d19bdbe657f015a51
3
+ metadata.gz: 2851b4b56b23141bc9f1ef5b3c448fb75d785ce0e7b38580113001898ce18e2e
4
+ data.tar.gz: 817325392325bc61f1dea1363096678fe9fd578ec6026301e173447edc522752
5
5
  SHA512:
6
- metadata.gz: 4df1d581ebb43267c0608c90011298bb452929f2fefe579b1786d99b390c0df8e7f56129ec18d6d043427482a17e719e7de0515791272a15a70a7b820db9b1ea
7
- data.tar.gz: 4ae398b9da0d6bff07e2d22dd561ef900a4c9a3a84d4581c18d783bbbd47fe3b2d01d91d7e1ad91b8641c2c533ee0fa67a3e0165168433a814c43e93d6a9c080
6
+ metadata.gz: e1bc84c131959bb7260100b4aa1e345ae09330600252299084efe5d40fbff8d8d9d9aadf78c6384a0b6a158a1b3bbcdeff9de3ce71fca90ce938003125b37898
7
+ data.tar.gz: cad988456c492f101bc71334997217a48d6588c4e2e5feee333bc074f3b14b62557d0e64429fca787705244b701eca7d685aa17f090dad94578fce172812e5a0
data/CHANGELOG.md CHANGED
@@ -1,3 +1,17 @@
1
+ ## 0.3.0 (2022-06-13)
2
+
3
+ - Updated OutlierTree to 1.8.1
4
+ - Dropped support for Ruby < 2.7
5
+
6
+ ## 0.2.1 (2021-05-23)
7
+
8
+ - Improved performance
9
+
10
+ ## 0.2.0 (2021-05-17)
11
+
12
+ - Updated to Rice 4
13
+ - Dropped support for Ruby < 2.6
14
+
1
15
  ## 0.1.2 (2021-02-08)
2
16
 
3
17
  - Fixed error with missing numeric values
data/NOTICE.txt CHANGED
@@ -1,5 +1,5 @@
1
1
  Copyright (C) 2019-2020 David Cortes
2
- Copyright (C) 2020-2021 Andrew Kane
2
+ Copyright (C) 2020-2022 Andrew Kane
3
3
 
4
4
  This program is free software: you can redistribute it and/or modify
5
5
  it under the terms of the GNU General Public License as published by
data/README.md CHANGED
@@ -1,4 +1,4 @@
1
- # OutlierTree
1
+ # OutlierTree Ruby
2
2
 
3
3
  :deciduous_tree: [OutlierTree](https://github.com/david-cortes/outliertree) - explainable outlier/anomaly detection - for Ruby
4
4
 
@@ -8,16 +8,16 @@ Produces human-readable explanations for why values are detected as outliers
8
8
  Price (2.50) looks low given Department is Books and Sale is false
9
9
  ```
10
10
 
11
- :evergreen_tree: Check out [IsoTree](https://github.com/ankane/isotree) for an alternative approach that uses Isolation Forest
11
+ :evergreen_tree: Check out [IsoTree](https://github.com/ankane/isotree-ruby) for an alternative approach that uses Isolation Forest
12
12
 
13
- [![Build Status](https://github.com/ankane/outliertree/workflows/build/badge.svg?branch=master)](https://github.com/ankane/outliertree/actions)
13
+ [![Build Status](https://github.com/ankane/outliertree-ruby/workflows/build/badge.svg?branch=master)](https://github.com/ankane/outliertree-ruby/actions)
14
14
 
15
15
  ## Installation
16
16
 
17
17
  Add this line to your application’s Gemfile:
18
18
 
19
19
  ```ruby
20
- gem 'outliertree'
20
+ gem "outliertree"
21
21
  ```
22
22
 
23
23
  ## Getting Started
@@ -28,7 +28,8 @@ Prep your data
28
28
  data = [
29
29
  {department: "Books", sale: false, price: 2.50},
30
30
  {department: "Books", sale: true, price: 3.00},
31
- {department: "Movies", sale: false, price: 5.00}
31
+ {department: "Movies", sale: false, price: 5.00},
32
+ # ...
32
33
  ]
33
34
  ```
34
35
 
@@ -108,22 +109,22 @@ bundle install
108
109
 
109
110
  ## History
110
111
 
111
- View the [changelog](https://github.com/ankane/outliertree/blob/master/CHANGELOG.md)
112
+ View the [changelog](https://github.com/ankane/outliertree-ruby/blob/master/CHANGELOG.md)
112
113
 
113
114
  ## Contributing
114
115
 
115
116
  Everyone is encouraged to help improve this project. Here are a few ways you can help:
116
117
 
117
- - [Report bugs](https://github.com/ankane/outliertree/issues)
118
- - Fix bugs and [submit pull requests](https://github.com/ankane/outliertree/pulls)
118
+ - [Report bugs](https://github.com/ankane/outliertree-ruby/issues)
119
+ - Fix bugs and [submit pull requests](https://github.com/ankane/outliertree-ruby/pulls)
119
120
  - Write, clarify, or fix documentation
120
121
  - Suggest or add new features
121
122
 
122
123
  To get started with development:
123
124
 
124
125
  ```sh
125
- git clone --recursive https://github.com/ankane/outliertree.git
126
- cd outliertree
126
+ git clone --recursive https://github.com/ankane/outliertree-ruby.git
127
+ cd outliertree-ruby
127
128
  bundle install
128
129
  bundle exec rake compile
129
130
  bundle exec rake test
@@ -2,12 +2,8 @@
2
2
  #include <outlier_tree.hpp>
3
3
 
4
4
  // rice
5
- #include <rice/Array.hpp>
6
- #include <rice/Hash.hpp>
7
- #include <rice/Module.hpp>
8
- #include <rice/Object.hpp>
9
- #include <rice/String.hpp>
10
- #include <rice/Symbol.hpp>
5
+ #include <rice/rice.hpp>
6
+ #include <rice/stl.hpp>
11
7
 
12
8
  using Rice::Array;
13
9
  using Rice::Hash;
@@ -18,74 +14,77 @@ using Rice::Symbol;
18
14
  using Rice::define_class_under;
19
15
  using Rice::define_module;
20
16
 
21
- template<>
22
- Object to_ruby<std::vector<char>>(std::vector<char> const & x)
17
+ namespace Rice::detail
23
18
  {
24
- Array a;
25
- for (size_t i = 0; i < x.size(); i++) {
26
- a.push(x[i]);
27
- }
28
- return a;
29
- }
19
+ template<typename T>
20
+ class To_Ruby<std::vector<T>>
21
+ {
22
+ public:
23
+ VALUE convert(std::vector<T> const & x)
24
+ {
25
+ auto a = rb_ary_new2(x.size());
26
+ for (const auto& v : x) {
27
+ rb_ary_push(a, To_Ruby<T>().convert(v));
28
+ }
29
+ return a;
30
+ }
31
+ };
30
32
 
31
- template<>
32
- Object to_ruby<std::vector<int>>(std::vector<int> const & x)
33
- {
34
- Array a;
35
- for (size_t i = 0; i < x.size(); i++) {
36
- a.push(x[i]);
37
- }
38
- return a;
39
- }
33
+ template<>
34
+ struct Type<ColType>
35
+ {
36
+ static bool verify()
37
+ {
38
+ return true;
39
+ }
40
+ };
40
41
 
41
- template<>
42
- Object to_ruby<std::vector<unsigned long>>(std::vector<unsigned long> const & x)
43
- {
44
- Array a;
45
- for (size_t i = 0; i < x.size(); i++) {
46
- a.push(x[i]);
47
- }
48
- return a;
49
- }
42
+ template<>
43
+ class To_Ruby<ColType>
44
+ {
45
+ public:
46
+ VALUE convert(ColType const & x)
47
+ {
48
+ switch (x) {
49
+ case Numeric: return Symbol("numeric");
50
+ case Categorical: return Symbol("categorical");
51
+ case Ordinal: return Symbol("ordinal");
52
+ case NoType: return Symbol("no_type");
53
+ }
54
+ throw std::runtime_error("Unknown column type");
55
+ }
56
+ };
50
57
 
51
- template<>
52
- Object to_ruby<std::vector<double>>(std::vector<double> const & x)
53
- {
54
- Array a;
55
- for (size_t i = 0; i < x.size(); i++) {
56
- a.push(x[i]);
57
- }
58
- return a;
59
- }
58
+ template<>
59
+ struct Type<SplitType>
60
+ {
61
+ static bool verify()
62
+ {
63
+ return true;
64
+ }
65
+ };
60
66
 
61
- template<>
62
- Object to_ruby<ColType>(ColType const & x)
63
- {
64
- switch (x) {
65
- case Numeric: return Symbol("numeric");
66
- case Categorical: return Symbol("categorical");
67
- case Ordinal: return Symbol("ordinal");
68
- case NoType: return Symbol("no_type");
69
- }
70
- throw std::runtime_error("Unknown column type");
71
- }
72
-
73
- template<>
74
- Object to_ruby<SplitType>(SplitType const & x)
75
- {
76
- switch (x) {
77
- case LessOrEqual: return Symbol("less_or_equal");
78
- case Greater: return Symbol("greater");
79
- case Equal: return Symbol("equal");
80
- case NotEqual: return Symbol("not_equal");
81
- case InSubset: return Symbol("in_subset");
82
- case NotInSubset: return Symbol("not_in_subset");
83
- case SingleCateg: return Symbol("single_categ");
84
- case SubTrees: return Symbol("sub_trees");
85
- case IsNa: return Symbol("is_na");
86
- case Root: return Symbol("root");
87
- }
88
- throw std::runtime_error("Unknown split type");
67
+ template<>
68
+ class To_Ruby<SplitType>
69
+ {
70
+ public:
71
+ VALUE convert(SplitType const & x)
72
+ {
73
+ switch (x) {
74
+ case LessOrEqual: return Symbol("less_or_equal");
75
+ case Greater: return Symbol("greater");
76
+ case Equal: return Symbol("equal");
77
+ case NotEqual: return Symbol("not_equal");
78
+ case InSubset: return Symbol("in_subset");
79
+ case NotInSubset: return Symbol("not_in_subset");
80
+ case SingleCateg: return Symbol("single_categ");
81
+ case SubTrees: return Symbol("sub_trees");
82
+ case IsNa: return Symbol("is_na");
83
+ case Root: return Symbol("root");
84
+ }
85
+ throw std::runtime_error("Unknown split type");
86
+ }
87
+ };
89
88
  }
90
89
 
91
90
  extern "C"
@@ -95,55 +94,55 @@ void Init_ext()
95
94
  Module rb_mExt = define_module_under(rb_mOutlierTree, "Ext");
96
95
 
97
96
  define_class_under<Cluster>(rb_mExt, "Cluster")
98
- .define_method("upper_lim", *[](Cluster& self) { return self.upper_lim; })
99
- .define_method("display_lim_high", *[](Cluster& self) { return self.display_lim_high; })
100
- .define_method("perc_below", *[](Cluster& self) { return self.perc_below; })
101
- .define_method("display_lim_low", *[](Cluster& self) { return self.display_lim_low; })
102
- .define_method("perc_above", *[](Cluster& self) { return self.perc_above; })
103
- .define_method("display_mean", *[](Cluster& self) { return self.display_mean; })
104
- .define_method("display_sd", *[](Cluster& self) { return self.display_sd; })
105
- .define_method("cluster_size", *[](Cluster& self) { return self.cluster_size; })
106
- .define_method("split_point", *[](Cluster& self) { return self.split_point; })
107
- .define_method("split_subset", *[](Cluster& self) { return self.split_subset; })
108
- .define_method("split_lev", *[](Cluster& self) { return self.split_lev; })
109
- .define_method("split_type", *[](Cluster& self) { return self.split_type; })
110
- .define_method("column_type", *[](Cluster& self) { return self.column_type; })
111
- .define_method("has_na_branch", *[](Cluster& self) { return self.has_NA_branch; })
112
- .define_method("col_num", *[](Cluster& self) { return self.col_num; });
97
+ .define_method("upper_lim", [](Cluster& self) { return self.upper_lim; })
98
+ .define_method("display_lim_high", [](Cluster& self) { return self.display_lim_high; })
99
+ .define_method("perc_below", [](Cluster& self) { return self.perc_below; })
100
+ .define_method("display_lim_low", [](Cluster& self) { return self.display_lim_low; })
101
+ .define_method("perc_above", [](Cluster& self) { return self.perc_above; })
102
+ .define_method("display_mean", [](Cluster& self) { return self.display_mean; })
103
+ .define_method("display_sd", [](Cluster& self) { return self.display_sd; })
104
+ .define_method("cluster_size", [](Cluster& self) { return self.cluster_size; })
105
+ .define_method("split_point", [](Cluster& self) { return self.split_point; })
106
+ .define_method("split_subset", [](Cluster& self) { return self.split_subset; })
107
+ .define_method("split_lev", [](Cluster& self) { return self.split_lev; })
108
+ .define_method("split_type", [](Cluster& self) { return self.split_type; })
109
+ .define_method("column_type", [](Cluster& self) { return self.column_type; })
110
+ .define_method("has_na_branch", [](Cluster& self) { return self.has_NA_branch; })
111
+ .define_method("col_num", [](Cluster& self) { return self.col_num; });
113
112
 
114
113
  define_class_under<ClusterTree>(rb_mExt, "ClusterTree")
115
- .define_method("parent_branch", *[](ClusterTree& self) { return self.parent_branch; })
116
- .define_method("parent", *[](ClusterTree& self) { return self.parent; })
117
- .define_method("all_branches", *[](ClusterTree& self) { return self.all_branches; })
118
- .define_method("column_type", *[](ClusterTree& self) { return self.column_type; })
119
- .define_method("col_num", *[](ClusterTree& self) { return self.col_num; })
120
- .define_method("split_point", *[](ClusterTree& self) { return self.split_point; })
121
- .define_method("split_subset", *[](ClusterTree& self) { return self.split_subset; })
122
- .define_method("split_lev", *[](ClusterTree& self) { return self.split_lev; });
114
+ .define_method("parent_branch", [](ClusterTree& self) { return self.parent_branch; })
115
+ .define_method("parent", [](ClusterTree& self) { return self.parent; })
116
+ .define_method("all_branches", [](ClusterTree& self) { return self.all_branches; })
117
+ .define_method("column_type", [](ClusterTree& self) { return self.column_type; })
118
+ .define_method("col_num", [](ClusterTree& self) { return self.col_num; })
119
+ .define_method("split_point", [](ClusterTree& self) { return self.split_point; })
120
+ .define_method("split_subset", [](ClusterTree& self) { return self.split_subset; })
121
+ .define_method("split_lev", [](ClusterTree& self) { return self.split_lev; });
123
122
 
124
123
  define_class_under<ModelOutputs>(rb_mExt, "ModelOutputs")
125
- .define_method("outlier_scores_final", *[](ModelOutputs& self) { return self.outlier_scores_final; })
126
- .define_method("outlier_columns_final", *[](ModelOutputs& self) { return self.outlier_columns_final; })
127
- .define_method("outlier_clusters_final", *[](ModelOutputs& self) { return self.outlier_clusters_final; })
128
- .define_method("outlier_trees_final", *[](ModelOutputs& self) { return self.outlier_trees_final; })
129
- .define_method("outlier_depth_final", *[](ModelOutputs& self) { return self.outlier_depth_final; })
130
- .define_method("outlier_decimals_distr", *[](ModelOutputs& self) { return self.outlier_decimals_distr; })
131
- .define_method("min_decimals_col", *[](ModelOutputs& self) { return self.min_decimals_col; })
124
+ .define_method("outlier_scores_final", [](ModelOutputs& self) { return self.outlier_scores_final; })
125
+ .define_method("outlier_columns_final", [](ModelOutputs& self) { return self.outlier_columns_final; })
126
+ .define_method("outlier_clusters_final", [](ModelOutputs& self) { return self.outlier_clusters_final; })
127
+ .define_method("outlier_trees_final", [](ModelOutputs& self) { return self.outlier_trees_final; })
128
+ .define_method("outlier_depth_final", [](ModelOutputs& self) { return self.outlier_depth_final; })
129
+ .define_method("outlier_decimals_distr", [](ModelOutputs& self) { return self.outlier_decimals_distr; })
130
+ .define_method("min_decimals_col", [](ModelOutputs& self) { return self.min_decimals_col; })
132
131
  .define_method(
133
132
  "all_clusters",
134
- *[](ModelOutputs& self, size_t i, size_t j) {
133
+ [](ModelOutputs& self, size_t i, size_t j) {
135
134
  return self.all_clusters[i][j];
136
135
  })
137
136
  .define_method(
138
137
  "all_trees",
139
- *[](ModelOutputs& self, size_t i, size_t j) {
138
+ [](ModelOutputs& self, size_t i, size_t j) {
140
139
  return self.all_trees[i][j];
141
140
  });
142
141
 
143
142
  rb_mExt
144
- .define_singleton_method(
143
+ .define_singleton_function(
145
144
  "fit_outliers_models",
146
- *[](Hash options) {
145
+ [](Hash options) {
147
146
  ModelOutputs model_outputs;
148
147
 
149
148
  // data
@@ -219,9 +218,9 @@ void Init_ext()
219
218
  );
220
219
  return model_outputs;
221
220
  })
222
- .define_singleton_method(
221
+ .define_singleton_function(
223
222
  "find_new_outliers",
224
- *[](ModelOutputs& model_outputs, Hash options) {
223
+ [](ModelOutputs& model_outputs, Hash options) {
225
224
  // data
226
225
  size_t nrows = options.get<size_t, Symbol>("nrows");
227
226
  size_t ncols_numeric = options.get<size_t, Symbol>("ncols_numeric");
@@ -1,6 +1,6 @@
1
1
  require "mkmf-rice"
2
2
 
3
- $CXXFLAGS += " -std=c++11"
3
+ $CXXFLAGS += " -std=c++17 $(optflags) -DDONT_THROW_ON_INTERRUPT"
4
4
 
5
5
  apple_clang = RbConfig::CONFIG["CC_VERSION_MESSAGE"] =~ /apple clang/i
6
6
 
@@ -22,7 +22,7 @@ module OutlierTree
22
22
  if outl_col < @numeric_columns.size
23
23
  column = @numeric_columns[outl_col]
24
24
  value = df[column][row]
25
- decimals = model_outputs.outlier_decimals_distr[row]
25
+ _decimals = model_outputs.outlier_decimals_distr[row]
26
26
  else
27
27
  column = @categorical_columns[outl_col - @numeric_columns.size]
28
28
  value = df[column][row]
@@ -94,11 +94,11 @@ module OutlierTree
94
94
  private
95
95
 
96
96
  def add_condition(row, split_type, cluster)
97
- coldecim = 0
97
+ _coldecim = 0
98
98
  case cluster.column_type
99
99
  when :numeric
100
100
  cond_col = @numeric_columns[cluster.col_num]
101
- coldecim = model_outputs.min_decimals_col[cluster.col_num]
101
+ _coldecim = model_outputs.min_decimals_col[cluster.col_num]
102
102
  else
103
103
  cond_col = @categorical_columns[cluster.col_num]
104
104
  end
@@ -1,3 +1,3 @@
1
1
  module OutlierTree
2
- VERSION = "0.1.2"
2
+ VERSION = "0.3.0"
3
3
  end
@@ -1,47 +1,60 @@
1
1
  # OutlierTree
2
2
 
3
- Explainable outlier/anomaly detection based on smart decision tree grouping, similar in spirit to the GritBot software developed by RuleQuest research. Written in C++ with interfaces for R and Python. Supports columns of types numeric, categorical, binary/boolean, and ordinal, and can handle missing values in all of them. Ideal as a sanity checker in exploratory data analysis.
4
-
5
- # How it works
6
-
7
- Will try to fit decision trees that try to "predict" values for each column based on the values of each other column. Along the way, each time a split is evaluated, it will take the observations that fall into each branch as a homogeneous cluster in which it will search for outliers in the 1-d distribution of the column being predicted. Outliers are determined according to confidence intervals on this 1-d distribution, and need to have a large gap with respect to the next observation in sorted order to be flagged as outliers. Since outliers are searched for in a decision tree branch, it will know the conditions that make it a rare observation compared to others that meet the same conditions, and the conditions will always be correlated with the target variable (as it's being predicted from them).
8
-
9
- As such, it will only be able to detect outliers that can be described through a decision tree logic, and unlike other methods such as [Isolation Forests](https://github.com/david-cortes/isotree), will not be able to assign an outlier score to each observation, nor to detect outliers that are just overall rare, but will always provide a human-readable justification when it flags an outlier.
10
-
11
- Procedure is described in more detail in [Explainable outlier detection through decision tree conditioning](http://arxiv.org/abs/2001.00636).
3
+ Explainable outlier/anomaly detection based on smart decision tree grouping, similar in spirit to the GritBot software developed by RuleQuest research. Written in C++ with interfaces for R and Python (additional Ruby wrapper can be found [here](https://github.com/ankane/outliertree/)). Supports columns of types numeric, categorical, binary/boolean, and ordinal, and can handle missing values in all of them. Ideal as a sanity checker in exploratory data analysis.
12
4
 
13
5
  # Example outputs
14
6
 
15
- Example outliers from [hypothyroid dataset](http://archive.ics.uci.edu/ml/datasets/thyroid+disease):
7
+ Example outliers from the [hypothyroid dataset](http://archive.ics.uci.edu/ml/datasets/thyroid+disease):
16
8
  ```
17
- row [1137] - suspicious column: [age] - suspicious value: [75.000]
18
- distribution: 95.122% <= 42.000 - [mean: 31.462] - [sd: 5.281] - [norm. obs: 39]
9
+ row [1138] - suspicious column: [age] - suspicious value: [75.00]
10
+ distribution: 95.122% <= 42.00 - [mean: 31.46] - [sd: 5.28] - [norm. obs: 39]
19
11
  given:
20
- [pregnant] = [t]
12
+ [pregnant] = [TRUE]
21
13
 
22
14
 
23
- row [2229] - suspicious column: [T3] - suspicious vale: [10.600]
24
- distribution: 99.951% <= 7.100 - [mean: 1.984] - [sd: 0.750] - [norm. obs: 2050]
15
+ row [2230] - suspicious column: [T3] - suspicious value: [10.60]
16
+ distribution: 99.951% <= 7.10 - [mean: 1.98] - [sd: 0.75] - [norm. obs: 2050]
25
17
  given:
26
- [query hyperthyroid] = [f]
18
+ [query.hyperthyroid] = [FALSE]
19
+
20
+ row [745] - suspicious column: [TT4] - suspicious value: [239.00]
21
+ distribution: 98.571% <= 177.00 - [mean: 135.23] - [sd: 12.57] - [norm. obs: 69]
22
+ given:
23
+ [FTI] between (97.96, 128.12] (value: 112.74)
24
+ [T4U] > [1.12] (value: 2.12)
25
+ [age] > [55.00] (value: 87.00)
27
26
  ```
28
27
  (i.e. it's saying that it's abnormal to be pregnant at the age of 75, or to not be classified as hyperthyroidal when having very high thyroid hormone levels)
29
28
  (this dataset is also bundled into the R package - e.g. `data(hypothyroid)`)
30
29
 
31
30
 
32
- Example outlier from [Titanic dataset](https://www.kaggle.com/c/titanic):
31
+ Example outliers from the [Titanic dataset](https://www.kaggle.com/c/titanic):
33
32
  ```
34
- row [885] - suspicious column: [Fare] - suspicious value: [29.125]
35
- distribution: 97.849% <= 15.500 - [mean: 7.887] - [sd: 1.173] - [norm. obs: 91]
33
+ row [1147] - suspicious column: [Fare] - suspicious value: [29.12]
34
+ distribution: 97.849% <= 15.50 - [mean: 7.89] - [sd: 1.17] - [norm. obs: 91]
36
35
  given:
37
36
  [Pclass] = [3]
38
37
  [SibSp] = [0]
39
38
  [Embarked] = [Q]
39
+
40
+ row [897] - suspicious column: [Fare] - suspicious value: [0.00]
41
+ distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]
42
+ given:
43
+ [Pclass] = [3]
44
+ [SibSp] = [0]
40
45
  ```
41
- (i.e. it's saying that the this person paid too much for the kind of accomodation he had)
46
+ (i.e. it's saying that the the first person paid too much for the kind of accomodation he had, and the second person should not have gotten it for free)
42
47
 
43
48
  _Note that it can also produce other types of conditions such as 'between' (for numeric intervals) or 'in' (for categorical subsets)_
44
49
 
50
+ # How it works
51
+
52
+ Will try to fit decision trees that try to "predict" values for each column based on the values of each other column. Along the way, each time a split is evaluated, it will take the observations that fall into each branch as a homogeneous cluster in which it will search for outliers in the 1-d distribution of the column being predicted. Outliers are determined according to confidence intervals on this 1-d distribution, and need to have a large gap with respect to the next observation in sorted order to be flagged as outliers. Since outliers are searched for in a decision tree branch, it will know the conditions that make it a rare observation compared to others that meet the same conditions, and the conditions will always be correlated with the target variable (as it's being predicted from them).
53
+
54
+ As such, it will only be able to detect outliers that can be described through a decision tree logic, and unlike other methods such as [Isolation Forests](https://github.com/david-cortes/isotree), will not be able to assign an outlier score to each observation, nor to detect outliers that are just overall rare, but will always provide a human-readable justification when it flags an outlier.
55
+
56
+ Procedure is described in more detail in [Explainable outlier detection through decision tree conditioning](http://arxiv.org/abs/2001.00636).
57
+
45
58
  # Installation
46
59
 
47
60
  * For R:
@@ -54,18 +67,38 @@ install.packages("outliertree")
54
67
  ```
55
68
  pip install outliertree
56
69
  ```
57
- (Package has only been tested in Python 3)
70
+ or if that fails:
71
+ ```
72
+ pip install --no-use-pep517 outliertree
73
+ ```
74
+ ** *
58
75
 
59
- **Note for macOS users:** on macOS, the Python version of this package will compile **without** multi-threading capabilities. This is due to default apple's redistribution of `clang` not providing OpenMP modules, and aliasing it to `gcc` which causes confusions in build scripts. If you have a non-apple version of `clang` with the OpenMP modules, or if you have `gcc` installed, you can compile this package with multi-threading enabled by setting up an environment variable `ENABLE_OMP=1`:
76
+ **Note for macOS users:** on macOS, the Python version of this package might compile **without** multi-threading capabilities. In order to enable multi-threading support, first install OpenMP:
60
77
  ```
61
- export ENABLE_OMP=1
78
+ brew install libomp
79
+ ```
80
+ And then reinstall this package: `pip install --force-reinstall outliertree`.
81
+
82
+ ** *
83
+ **IMPORTANT:** the setup script will try to add compilation flag `-march=native`. This instructs the compiler to tune the package for the CPU in which it is being installed, but the result might not be usable in other computers. If building a binary wheel of this package or putting it into a docker image which will be used in different machines, this can be overriden by manually supplying compilation `CFLAGS` and `CXXFLAGS` as environment variables with something related to architecture. For maximum compatibility (but slowest speed), assuming `x86-64` computers, it's possible to do something like this:
84
+
85
+ ```
86
+ export CFLAGS="-march=x86-64"
87
+ export CXXFLAGS="-march=x86-64"
62
88
  pip install outliertree
63
89
  ```
64
- (Alternatively, can also pass argument `enable-omp` to the `setup.py` file: `python setup.py install enable-omp`)
65
90
 
91
+ or for creating wheels:
92
+ ```
93
+ export CFLAGS="-march=x86-64"
94
+ export CXXFLAGS="-march=x86-64"
95
+ python setup.py bwheel
96
+ ```
97
+ ** *
66
98
 
67
99
  * For C++: package doesn't have a build system, nor a `main` function that can produce an executable, but can be built as a shared object and wrapped into other languages with any C++11-compliant compiler (`std=c++11` in most compilers, `/std:c++14` in MSVC). For parallelization, needs OpenMP linkage (`-fopenmp` in most compilers, `/openmp` in MSVC). Package should *not* be built with optimization higher than `O3` (i.e. don't use `-Ofast`). Needs linkage to the `math` library, which should be enabled by default in most C++ compilers, but otherwise would require `-lm` argument. No external dependencies are required.
68
100
 
101
+ * For Ruby: see [external repository with wrapper](https://github.com/ankane/outliertree/).
69
102
 
70
103
  # Sample usage
71
104
 
@@ -77,18 +110,18 @@ library(outliertree)
77
110
  nrows = 100
78
111
  set.seed(1)
79
112
  df = data.frame(
80
- numeric_col1 = c(rnorm(nrows - 1), 1e6),
81
- numeric_col2 = rgamma(nrows, 1),
82
- categ_col = sample(c('categA', 'categB', 'categC'), size = nrows, replace = TRUE)
83
- )
113
+ numeric_col1 = c(rnorm(nrows - 1), 1e6),
114
+ numeric_col2 = rgamma(nrows, 1),
115
+ categ_col = sample(c('categA', 'categB', 'categC'), size = nrows, replace = TRUE)
116
+ )
84
117
 
85
118
  ### test data frame with another obvious outlier
86
119
  nrows_test = 50
87
120
  df_test = data.frame(
88
- numeric_col1 = rnorm(nrows_test),
89
- numeric_col2 = c(-1e6, rgamma(nrows_test - 1, 1)),
90
- categ_col = sample(c('categA', 'categB', 'categC'), size = nrows_test, replace = TRUE)
91
- )
121
+ numeric_col1 = rnorm(nrows_test),
122
+ numeric_col2 = c(-1e6, rgamma(nrows_test - 1, 1)),
123
+ categ_col = sample(c('categA', 'categB', 'categC'), size = nrows_test, replace = TRUE)
124
+ )
92
125
 
93
126
  ### fit model
94
127
  outliers_model = outliertree::outlier.tree(df, outliers_print = 10, save_outliers = TRUE)
@@ -113,17 +146,17 @@ from outliertree import OutlierTree
113
146
  nrows = 100
114
147
  np.random.seed(1)
115
148
  df = pd.DataFrame({
116
- "numeric_col1" : np.r_[np.random.normal(size = nrows - 1), np.array([float(1e6)])],
117
- "numeric_col2" : np.random.gamma(1, 1, size = nrows),
118
- "categ_col" : np.random.choice(['categA', 'categB', 'categC'], size = nrows)
119
- })
149
+ "numeric_col1" : np.r_[np.random.normal(size = nrows - 1), np.array([float(1e6)])],
150
+ "numeric_col2" : np.random.gamma(1, 1, size = nrows),
151
+ "categ_col" : np.random.choice(['categA', 'categB', 'categC'], size = nrows)
152
+ })
120
153
 
121
154
  ### test data frame with another obvious outlier
122
155
  df_test = pd.DataFrame({
123
- "numeric_col1" : np.random.normal(size = nrows),
124
- "numeric_col2" : np.r_[np.array([float(-1e6)]), np.random.gamma(1, 1, size = nrows - 1)],
125
- "categ_col" : np.random.choice(['categA', 'categB', 'categC'], size = nrows)
126
- })
156
+ "numeric_col1" : np.random.normal(size = nrows),
157
+ "numeric_col2" : np.r_[np.array([float(-1e6)]), np.random.gamma(1, 1, size = nrows - 1)],
158
+ "categ_col" : np.random.choice(['categA', 'categB', 'categC'], size = nrows)
159
+ })
127
160
 
128
161
  ### fit model
129
162
  outliers_model = OutlierTree()
@@ -138,6 +171,8 @@ outliers_model.print_outliers(new_outliers)
138
171
 
139
172
  Example [IPython notebook](http://nbviewer.ipython.org/github/david-cortes/outliertree/blob/master/example/titanic_outliertree_python.ipynb) using the Titanic dataset.
140
173
 
174
+ * For Ruby: see the [external repository](https://github.com/ankane/outliertree/).
175
+
141
176
  * For C++: see functions `fit_outliers_models` and `find_new_outliers` in header `outlier_tree.hpp`.
142
177
 
143
178
  # Documentation
@@ -146,6 +181,8 @@ Example [IPython notebook](http://nbviewer.ipython.org/github/david-cortes/outli
146
181
 
147
182
  * For Python: documentation is available at [ReadTheDocs](http://outliertree.readthedocs.io/en/latest/) (and it's also built-in in the package as docstrings, e.g. `help(outliertree.OutlierTree.fit)`).
148
183
 
184
+ * For Ruby: see the [external repository](https://github.com/ankane/outliertree/) and the [Python documentation](http://outliertree.readthedocs.io/en/latest/).
185
+
149
186
  * For C++: documentation is available in the source files (not in the header).
150
187
 
151
188
  # References
@@ -0,0 +1,4 @@
1
+ PKG_CPPFLAGS = -D_FOR_R @SUPPORTS_RESTRICT@
2
+ PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) @FNE_FLAG@ @FNTP_FLAG@ $(CXX_VISIBILITY)
3
+ PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS)
4
+ CXX_STD = CXX11
@@ -0,0 +1,4 @@
1
+ PKG_CPPFLAGS = -D_FOR_R
2
+ PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) -fno-trapping-math -fno-math-errno
3
+ PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS)
4
+ CXX_STD = CXX11