isotree 0.2.2 → 0.3.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +8 -1
- data/LICENSE.txt +2 -2
- data/README.md +32 -14
- data/ext/isotree/ext.cpp +144 -31
- data/ext/isotree/extconf.rb +7 -7
- data/lib/isotree/isolation_forest.rb +110 -30
- data/lib/isotree/version.rb +1 -1
- data/vendor/isotree/LICENSE +1 -1
- data/vendor/isotree/README.md +165 -27
- data/vendor/isotree/include/isotree.hpp +2111 -0
- data/vendor/isotree/include/isotree_oop.hpp +394 -0
- data/vendor/isotree/inst/COPYRIGHTS +62 -0
- data/vendor/isotree/src/RcppExports.cpp +525 -52
- data/vendor/isotree/src/Rwrapper.cpp +1931 -268
- data/vendor/isotree/src/c_interface.cpp +953 -0
- data/vendor/isotree/src/crit.hpp +4232 -0
- data/vendor/isotree/src/dist.hpp +1886 -0
- data/vendor/isotree/src/exp_depth_table.hpp +134 -0
- data/vendor/isotree/src/extended.hpp +1444 -0
- data/vendor/isotree/src/external_facing_generic.hpp +399 -0
- data/vendor/isotree/src/fit_model.hpp +2401 -0
- data/vendor/isotree/src/{dealloc.cpp → headers_joined.hpp} +38 -22
- data/vendor/isotree/src/helpers_iforest.hpp +813 -0
- data/vendor/isotree/src/{impute.cpp → impute.hpp} +353 -122
- data/vendor/isotree/src/indexer.cpp +515 -0
- data/vendor/isotree/src/instantiate_template_headers.cpp +118 -0
- data/vendor/isotree/src/instantiate_template_headers.hpp +240 -0
- data/vendor/isotree/src/isoforest.hpp +1659 -0
- data/vendor/isotree/src/isotree.hpp +1804 -392
- data/vendor/isotree/src/isotree_exportable.hpp +99 -0
- data/vendor/isotree/src/merge_models.cpp +159 -16
- data/vendor/isotree/src/mult.hpp +1321 -0
- data/vendor/isotree/src/oop_interface.cpp +842 -0
- data/vendor/isotree/src/oop_interface.hpp +278 -0
- data/vendor/isotree/src/other_helpers.hpp +219 -0
- data/vendor/isotree/src/predict.hpp +1932 -0
- data/vendor/isotree/src/python_helpers.hpp +134 -0
- data/vendor/isotree/src/ref_indexer.hpp +154 -0
- data/vendor/isotree/src/robinmap/LICENSE +21 -0
- data/vendor/isotree/src/robinmap/README.md +483 -0
- data/vendor/isotree/src/robinmap/include/tsl/robin_growth_policy.h +406 -0
- data/vendor/isotree/src/robinmap/include/tsl/robin_hash.h +1620 -0
- data/vendor/isotree/src/robinmap/include/tsl/robin_map.h +807 -0
- data/vendor/isotree/src/robinmap/include/tsl/robin_set.h +660 -0
- data/vendor/isotree/src/serialize.cpp +4300 -139
- data/vendor/isotree/src/sql.cpp +141 -59
- data/vendor/isotree/src/subset_models.cpp +174 -0
- data/vendor/isotree/src/utils.hpp +3808 -0
- data/vendor/isotree/src/xoshiro.hpp +467 -0
- data/vendor/isotree/src/ziggurat.hpp +405 -0
- metadata +38 -104
- data/vendor/cereal/LICENSE +0 -24
- data/vendor/cereal/README.md +0 -85
- data/vendor/cereal/include/cereal/access.hpp +0 -351
- data/vendor/cereal/include/cereal/archives/adapters.hpp +0 -163
- data/vendor/cereal/include/cereal/archives/binary.hpp +0 -169
- data/vendor/cereal/include/cereal/archives/json.hpp +0 -1019
- data/vendor/cereal/include/cereal/archives/portable_binary.hpp +0 -334
- data/vendor/cereal/include/cereal/archives/xml.hpp +0 -956
- data/vendor/cereal/include/cereal/cereal.hpp +0 -1089
- data/vendor/cereal/include/cereal/details/helpers.hpp +0 -422
- data/vendor/cereal/include/cereal/details/polymorphic_impl.hpp +0 -796
- data/vendor/cereal/include/cereal/details/polymorphic_impl_fwd.hpp +0 -65
- data/vendor/cereal/include/cereal/details/static_object.hpp +0 -127
- data/vendor/cereal/include/cereal/details/traits.hpp +0 -1411
- data/vendor/cereal/include/cereal/details/util.hpp +0 -84
- data/vendor/cereal/include/cereal/external/base64.hpp +0 -134
- data/vendor/cereal/include/cereal/external/rapidjson/allocators.h +0 -284
- data/vendor/cereal/include/cereal/external/rapidjson/cursorstreamwrapper.h +0 -78
- data/vendor/cereal/include/cereal/external/rapidjson/document.h +0 -2652
- data/vendor/cereal/include/cereal/external/rapidjson/encodedstream.h +0 -299
- data/vendor/cereal/include/cereal/external/rapidjson/encodings.h +0 -716
- data/vendor/cereal/include/cereal/external/rapidjson/error/en.h +0 -74
- data/vendor/cereal/include/cereal/external/rapidjson/error/error.h +0 -161
- data/vendor/cereal/include/cereal/external/rapidjson/filereadstream.h +0 -99
- data/vendor/cereal/include/cereal/external/rapidjson/filewritestream.h +0 -104
- data/vendor/cereal/include/cereal/external/rapidjson/fwd.h +0 -151
- data/vendor/cereal/include/cereal/external/rapidjson/internal/biginteger.h +0 -290
- data/vendor/cereal/include/cereal/external/rapidjson/internal/diyfp.h +0 -271
- data/vendor/cereal/include/cereal/external/rapidjson/internal/dtoa.h +0 -245
- data/vendor/cereal/include/cereal/external/rapidjson/internal/ieee754.h +0 -78
- data/vendor/cereal/include/cereal/external/rapidjson/internal/itoa.h +0 -308
- data/vendor/cereal/include/cereal/external/rapidjson/internal/meta.h +0 -186
- data/vendor/cereal/include/cereal/external/rapidjson/internal/pow10.h +0 -55
- data/vendor/cereal/include/cereal/external/rapidjson/internal/regex.h +0 -740
- data/vendor/cereal/include/cereal/external/rapidjson/internal/stack.h +0 -232
- data/vendor/cereal/include/cereal/external/rapidjson/internal/strfunc.h +0 -69
- data/vendor/cereal/include/cereal/external/rapidjson/internal/strtod.h +0 -290
- data/vendor/cereal/include/cereal/external/rapidjson/internal/swap.h +0 -46
- data/vendor/cereal/include/cereal/external/rapidjson/istreamwrapper.h +0 -128
- data/vendor/cereal/include/cereal/external/rapidjson/memorybuffer.h +0 -70
- data/vendor/cereal/include/cereal/external/rapidjson/memorystream.h +0 -71
- data/vendor/cereal/include/cereal/external/rapidjson/msinttypes/inttypes.h +0 -316
- data/vendor/cereal/include/cereal/external/rapidjson/msinttypes/stdint.h +0 -300
- data/vendor/cereal/include/cereal/external/rapidjson/ostreamwrapper.h +0 -81
- data/vendor/cereal/include/cereal/external/rapidjson/pointer.h +0 -1414
- data/vendor/cereal/include/cereal/external/rapidjson/prettywriter.h +0 -277
- data/vendor/cereal/include/cereal/external/rapidjson/rapidjson.h +0 -656
- data/vendor/cereal/include/cereal/external/rapidjson/reader.h +0 -2230
- data/vendor/cereal/include/cereal/external/rapidjson/schema.h +0 -2497
- data/vendor/cereal/include/cereal/external/rapidjson/stream.h +0 -223
- data/vendor/cereal/include/cereal/external/rapidjson/stringbuffer.h +0 -121
- data/vendor/cereal/include/cereal/external/rapidjson/writer.h +0 -709
- data/vendor/cereal/include/cereal/external/rapidxml/license.txt +0 -52
- data/vendor/cereal/include/cereal/external/rapidxml/manual.html +0 -406
- data/vendor/cereal/include/cereal/external/rapidxml/rapidxml.hpp +0 -2624
- data/vendor/cereal/include/cereal/external/rapidxml/rapidxml_iterators.hpp +0 -175
- data/vendor/cereal/include/cereal/external/rapidxml/rapidxml_print.hpp +0 -428
- data/vendor/cereal/include/cereal/external/rapidxml/rapidxml_utils.hpp +0 -123
- data/vendor/cereal/include/cereal/macros.hpp +0 -154
- data/vendor/cereal/include/cereal/specialize.hpp +0 -139
- data/vendor/cereal/include/cereal/types/array.hpp +0 -79
- data/vendor/cereal/include/cereal/types/atomic.hpp +0 -55
- data/vendor/cereal/include/cereal/types/base_class.hpp +0 -203
- data/vendor/cereal/include/cereal/types/bitset.hpp +0 -176
- data/vendor/cereal/include/cereal/types/boost_variant.hpp +0 -164
- data/vendor/cereal/include/cereal/types/chrono.hpp +0 -72
- data/vendor/cereal/include/cereal/types/common.hpp +0 -129
- data/vendor/cereal/include/cereal/types/complex.hpp +0 -56
- data/vendor/cereal/include/cereal/types/concepts/pair_associative_container.hpp +0 -73
- data/vendor/cereal/include/cereal/types/deque.hpp +0 -62
- data/vendor/cereal/include/cereal/types/forward_list.hpp +0 -68
- data/vendor/cereal/include/cereal/types/functional.hpp +0 -43
- data/vendor/cereal/include/cereal/types/list.hpp +0 -62
- data/vendor/cereal/include/cereal/types/map.hpp +0 -36
- data/vendor/cereal/include/cereal/types/memory.hpp +0 -425
- data/vendor/cereal/include/cereal/types/optional.hpp +0 -66
- data/vendor/cereal/include/cereal/types/polymorphic.hpp +0 -483
- data/vendor/cereal/include/cereal/types/queue.hpp +0 -132
- data/vendor/cereal/include/cereal/types/set.hpp +0 -103
- data/vendor/cereal/include/cereal/types/stack.hpp +0 -76
- data/vendor/cereal/include/cereal/types/string.hpp +0 -61
- data/vendor/cereal/include/cereal/types/tuple.hpp +0 -123
- data/vendor/cereal/include/cereal/types/unordered_map.hpp +0 -36
- data/vendor/cereal/include/cereal/types/unordered_set.hpp +0 -99
- data/vendor/cereal/include/cereal/types/utility.hpp +0 -47
- data/vendor/cereal/include/cereal/types/valarray.hpp +0 -89
- data/vendor/cereal/include/cereal/types/variant.hpp +0 -109
- data/vendor/cereal/include/cereal/types/vector.hpp +0 -112
- data/vendor/cereal/include/cereal/version.hpp +0 -52
- data/vendor/isotree/src/Makevars +0 -4
- data/vendor/isotree/src/crit.cpp +0 -912
- data/vendor/isotree/src/dist.cpp +0 -749
- data/vendor/isotree/src/extended.cpp +0 -790
- data/vendor/isotree/src/fit_model.cpp +0 -1090
- data/vendor/isotree/src/helpers_iforest.cpp +0 -324
- data/vendor/isotree/src/isoforest.cpp +0 -771
- data/vendor/isotree/src/mult.cpp +0 -607
- data/vendor/isotree/src/predict.cpp +0 -853
- data/vendor/isotree/src/utils.cpp +0 -1566
@@ -1,38 +1,59 @@
|
|
1
1
|
module IsoTree
|
2
2
|
class IsolationForest
|
3
3
|
def initialize(
|
4
|
-
sample_size:
|
5
|
-
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
4
|
+
sample_size: "auto", ntrees: 500, ndim: 3, ntry: 1,
|
5
|
+
# categ_cols: nil,
|
6
|
+
max_depth: "auto", ncols_per_tree: nil,
|
7
|
+
prob_pick_pooled_gain: 0.0, prob_pick_avg_gain: 0.0,
|
8
|
+
prob_pick_full_gain: 0.0, prob_pick_dens: 0.0,
|
9
|
+
prob_pick_col_by_range: 0.0, prob_pick_col_by_var: 0.0, prob_pick_col_by_kurt: 0.0,
|
10
|
+
min_gain: 0.0, missing_action: "auto", new_categ_action: "auto",
|
11
|
+
categ_split_type: "auto", all_perm: false, coef_by_prop: false,
|
12
|
+
# recode_categ: false,
|
13
|
+
weights_as_sample_prob: true,
|
14
|
+
sample_with_replacement: false, penalize_range: false, standardize_data: true,
|
15
|
+
scoring_metric: "depth", fast_bratio: true,
|
16
|
+
weigh_by_kurtosis: false, coefs: "uniform", assume_full_distr: true,
|
17
|
+
# build_imputer: false,
|
18
|
+
min_imp_obs: 3, depth_imp: "higher",
|
19
|
+
weigh_imp_rows: "inverse", random_seed: 1, use_long_double: false, nthreads: -1
|
12
20
|
)
|
13
21
|
|
14
22
|
@sample_size = sample_size
|
15
23
|
@ntrees = ntrees
|
16
24
|
@ndim = ndim
|
17
25
|
@ntry = ntry
|
18
|
-
@
|
26
|
+
# @categ_cols = categ_cols
|
27
|
+
@max_depth = max_depth
|
28
|
+
@ncols_per_tree = ncols_per_tree
|
19
29
|
@prob_pick_pooled_gain = prob_pick_pooled_gain
|
20
|
-
@
|
21
|
-
@
|
30
|
+
@prob_pick_avg_gain = prob_pick_avg_gain
|
31
|
+
@prob_pick_full_gain = prob_pick_full_gain
|
32
|
+
@prob_pick_dens = prob_pick_dens
|
33
|
+
@prob_pick_col_by_range = prob_pick_col_by_range
|
34
|
+
@prob_pick_col_by_var = prob_pick_col_by_var
|
35
|
+
@prob_pick_col_by_kurt = prob_pick_col_by_kurt
|
22
36
|
@min_gain = min_gain
|
23
37
|
@missing_action = missing_action
|
24
38
|
@new_categ_action = new_categ_action
|
25
39
|
@categ_split_type = categ_split_type
|
26
40
|
@all_perm = all_perm
|
27
41
|
@coef_by_prop = coef_by_prop
|
42
|
+
# @recode_categ = recode_categ
|
43
|
+
@weights_as_sample_prob = weights_as_sample_prob
|
28
44
|
@sample_with_replacement = sample_with_replacement
|
29
45
|
@penalize_range = penalize_range
|
46
|
+
@standardize_data = standardize_data
|
47
|
+
@scoring_metric = scoring_metric
|
48
|
+
@fast_bratio = fast_bratio
|
30
49
|
@weigh_by_kurtosis = weigh_by_kurtosis
|
31
50
|
@coefs = coefs
|
51
|
+
@assume_full_distr = assume_full_distr
|
32
52
|
@min_imp_obs = min_imp_obs
|
33
53
|
@depth_imp = depth_imp
|
34
54
|
@weigh_imp_rows = weigh_imp_rows
|
35
55
|
@random_seed = random_seed
|
56
|
+
@use_long_double = use_long_double
|
36
57
|
|
37
58
|
# etc module returns virtual cores
|
38
59
|
nthreads = Etc.nprocessors if nthreads < 0
|
@@ -40,10 +61,16 @@ module IsoTree
|
|
40
61
|
end
|
41
62
|
|
42
63
|
def fit(x)
|
64
|
+
# make export consistent with Python library
|
65
|
+
update_params
|
66
|
+
|
43
67
|
x = Dataset.new(x)
|
44
68
|
prep_fit(x)
|
45
69
|
options = data_options(x).merge(fit_options)
|
46
|
-
|
70
|
+
|
71
|
+
if options[:sample_size] == "auto"
|
72
|
+
options[:sample_size] = [options[:nrows], 10000].min
|
73
|
+
end
|
47
74
|
|
48
75
|
# prevent segfault
|
49
76
|
options[:sample_size] = options[:nrows] if options[:sample_size] > options[:nrows]
|
@@ -71,18 +98,22 @@ module IsoTree
|
|
71
98
|
end
|
72
99
|
|
73
100
|
# same format as Python so models are compatible
|
74
|
-
def export_model(path)
|
101
|
+
def export_model(path, add_metada_file: false)
|
75
102
|
check_fit
|
76
103
|
|
77
|
-
|
78
|
-
|
104
|
+
metadata = export_metadata
|
105
|
+
if add_metada_file
|
106
|
+
# indent 4 spaces like Python
|
107
|
+
File.write("#{path}.metadata", JSON.pretty_generate(metadata, indent: " "))
|
108
|
+
end
|
109
|
+
Ext.serialize_combined(@ext_iso_forest, path, JSON.generate(metadata))
|
79
110
|
end
|
80
111
|
|
81
112
|
def self.import_model(path)
|
82
113
|
model = new
|
83
|
-
metadata =
|
84
|
-
model.
|
85
|
-
model.
|
114
|
+
ext_iso_forest, metadata = Ext.deserialize_combined(path)
|
115
|
+
model.instance_variable_set(:@ext_iso_forest, ext_iso_forest)
|
116
|
+
model.send(:import_metadata, JSON.parse(metadata))
|
86
117
|
model
|
87
118
|
end
|
88
119
|
|
@@ -94,7 +125,9 @@ module IsoTree
|
|
94
125
|
ncols_categ: @categorical_columns.size,
|
95
126
|
cols_numeric: @numeric_columns,
|
96
127
|
cols_categ: @categorical_columns,
|
97
|
-
cat_levels: @categorical_columns.map { |v| @categories[v].keys }
|
128
|
+
cat_levels: @categorical_columns.map { |v| @categories[v].keys },
|
129
|
+
categ_cols: [],
|
130
|
+
categ_max: []
|
98
131
|
}
|
99
132
|
|
100
133
|
# Ruby-specific
|
@@ -104,6 +137,7 @@ module IsoTree
|
|
104
137
|
model_info = {
|
105
138
|
ndim: @ndim,
|
106
139
|
nthreads: @nthreads,
|
140
|
+
use_long_double: @use_long_double,
|
107
141
|
build_imputer: false
|
108
142
|
}
|
109
143
|
|
@@ -112,6 +146,10 @@ module IsoTree
|
|
112
146
|
params[k] = instance_variable_get("@#{k}")
|
113
147
|
end
|
114
148
|
|
149
|
+
if params[:max_depth] == "auto"
|
150
|
+
params[:max_depth] = 0
|
151
|
+
end
|
152
|
+
|
115
153
|
{
|
116
154
|
data_info: data_info,
|
117
155
|
model_info: model_info,
|
@@ -137,6 +175,8 @@ module IsoTree
|
|
137
175
|
|
138
176
|
@ndim = model_info["ndim"]
|
139
177
|
@nthreads = model_info["nthreads"]
|
178
|
+
@use_long_double = model_info["use_long_double"]
|
179
|
+
@build_imputer = model_info["build_imputer"]
|
140
180
|
|
141
181
|
PARAM_KEYS.each do |k|
|
142
182
|
instance_variable_set("@#{k}", params[k.to_s])
|
@@ -221,31 +261,71 @@ module IsoTree
|
|
221
261
|
end
|
222
262
|
|
223
263
|
PARAM_KEYS = %i(
|
224
|
-
sample_size ntrees ntry max_depth
|
225
|
-
prob_pick_avg_gain prob_pick_pooled_gain
|
226
|
-
|
227
|
-
missing_action new_categ_action categ_split_type coefs
|
228
|
-
weigh_imp_rows min_imp_obs random_seed all_perm
|
229
|
-
weights_as_sample_prob sample_with_replacement penalize_range
|
230
|
-
weigh_by_kurtosis assume_full_distr
|
264
|
+
sample_size ntrees ntry max_depth ncols_per_tree
|
265
|
+
prob_pick_avg_gain prob_pick_pooled_gain prob_pick_full_gain prob_pick_dens
|
266
|
+
prob_pick_col_by_range prob_pick_col_by_var prob_pick_col_by_kurt
|
267
|
+
min_gain missing_action new_categ_action categ_split_type coefs
|
268
|
+
depth_imp weigh_imp_rows min_imp_obs random_seed all_perm
|
269
|
+
coef_by_prop weights_as_sample_prob sample_with_replacement penalize_range standardize_data
|
270
|
+
scoring_metric fast_bratio weigh_by_kurtosis assume_full_distr
|
231
271
|
)
|
232
272
|
|
233
273
|
def fit_options
|
234
274
|
keys = %i(
|
235
275
|
sample_size ntrees ndim ntry
|
236
|
-
|
237
|
-
|
276
|
+
categ_cols max_depth ncols_per_tree
|
277
|
+
prob_pick_pooled_gain prob_pick_avg_gain
|
278
|
+
prob_pick_full_gain prob_pick_dens
|
279
|
+
prob_pick_col_by_range prob_pick_col_by_var prob_pick_col_by_kurt
|
238
280
|
min_gain missing_action new_categ_action
|
239
281
|
categ_split_type all_perm coef_by_prop
|
240
|
-
|
282
|
+
weights_as_sample_prob
|
283
|
+
sample_with_replacement penalize_range standardize_data
|
284
|
+
scoring_metric fast_bratio
|
241
285
|
weigh_by_kurtosis coefs min_imp_obs depth_imp
|
242
|
-
weigh_imp_rows random_seed nthreads
|
286
|
+
weigh_imp_rows random_seed use_long_double nthreads
|
243
287
|
)
|
244
288
|
options = {}
|
245
289
|
keys.each do |k|
|
246
290
|
options[k] = instance_variable_get("@#{k}")
|
247
291
|
end
|
292
|
+
|
293
|
+
if options[:max_depth] == "auto"
|
294
|
+
options[:max_depth] = 0
|
295
|
+
options[:limit_depth] = true
|
296
|
+
end
|
297
|
+
|
298
|
+
if options[:ncols_per_tree].nil?
|
299
|
+
options[:ncols_per_tree] = 0
|
300
|
+
end
|
301
|
+
|
248
302
|
options
|
249
303
|
end
|
304
|
+
|
305
|
+
def update_params
|
306
|
+
if @missing_action == "auto"
|
307
|
+
if @ndim == 1
|
308
|
+
@missing_action = "divide"
|
309
|
+
else
|
310
|
+
@missing_action = "impute"
|
311
|
+
end
|
312
|
+
end
|
313
|
+
|
314
|
+
if @new_categ_action == "auto"
|
315
|
+
if @ndim == 1
|
316
|
+
@new_categ_action = "weighted"
|
317
|
+
else
|
318
|
+
@new_categ_action = "impute"
|
319
|
+
end
|
320
|
+
end
|
321
|
+
|
322
|
+
if @categ_split_type == "auto"
|
323
|
+
if @ndim == 1
|
324
|
+
@categ_split_type = "single_categ"
|
325
|
+
else
|
326
|
+
@categ_split_type = "subset"
|
327
|
+
end
|
328
|
+
end
|
329
|
+
end
|
250
330
|
end
|
251
331
|
end
|
data/lib/isotree/version.rb
CHANGED
data/vendor/isotree/LICENSE
CHANGED
data/vendor/isotree/README.md
CHANGED
@@ -1,11 +1,27 @@
|
|
1
1
|
# IsoTree
|
2
2
|
|
3
|
-
Fast and multi-threaded implementation of
|
3
|
+
Fast and multi-threaded implementation of Isolation Forest (a.k.a. iForest) and variations of it such as Extended Isolation Forest (EIF), Split-Criterion iForest (SCiForest), Fair-Cut Forest (FCF), Robust Random-Cut Forest (RRCF), and other customizable variants, aimed at outlier/anomaly detection plus additions for imputation of missing values, distance/similarity calculation between observations, and handling of categorical data. Written in C++ with interfaces for Python, R, and C. An additional wrapper for Ruby can be found [here](https://github.com/ankane/isotree).
|
4
4
|
|
5
5
|
The new concepts in this software are described in:
|
6
|
+
* [Revisiting randomized choices in isolation forests](https://arxiv.org/abs/2110.13402)
|
7
|
+
* [Isolation forests: looking beyond tree depth](https://arxiv.org/abs/2111.11639)
|
6
8
|
* [Distance approximation using Isolation Forests](https://arxiv.org/abs/1910.12362)
|
7
9
|
* [Imputing missing values with unsupervised random trees](https://arxiv.org/abs/1911.06646)
|
8
10
|
|
11
|
+
*********************
|
12
|
+
|
13
|
+
For a quick introduction to the Isolation Forest concept as used in this library, see:
|
14
|
+
* [Python introductory notebook](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/an_introduction_to_isolation_forests.ipynb).
|
15
|
+
* [R Vignette](http://htmlpreview.github.io/?https://github.com/david-cortes/isotree/blob/master/inst/doc/An_Introduction_to_Isolation_Forests.html).
|
16
|
+
|
17
|
+
Short Python example notebooks:
|
18
|
+
* [General library usage](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_example.ipynb).
|
19
|
+
* [Using it as imputer in a scikit-learn pipeline](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_impute.ipynb).
|
20
|
+
* [Using it as a kernel for SVMs](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_svm_kernel_example.ipynb).
|
21
|
+
* [Converting it to TreeLite format for faster predictions](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/treelite_example.ipynb).
|
22
|
+
|
23
|
+
(R examples are available in the internal documentation)
|
24
|
+
|
9
25
|
# Description
|
10
26
|
|
11
27
|
Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some attribute/feature/column at random. The idea is that, the rarer the observation, the more likely it is that a random uniform split on some feature would put outliers alone in one branch, and the fewer splits it will take to isolate an outlier observation like this. The concept is extended to splitting hyperplanes in the extended model (i.e. splitting by more than one column at a time), and to guided (not entirely random) splits in the SCiForest model that aim at isolating outliers faster and finding clustered outliers.
|
@@ -16,6 +32,53 @@ Note that this is a black-box model that will not produce explanations or import
|
|
16
32
|
|
17
33
|
_(Code to produce these plots can be found in the R examples in the documentation)_
|
18
34
|
|
35
|
+
# Comparison against other libraries
|
36
|
+
|
37
|
+
The folder [timings](https://github.com/david-cortes/isotree/blob/master/timings) contains a speed comparison against other Isolation Forest implementations in Python (SciKit-Learn, EIF) and R (IsolationForest, isofor, solitude). From the benchmarks, IsoTree tends to be at least 1 order of magnitude faster than the libraries compared against in both single-threaded and multi-threaded mode.
|
38
|
+
|
39
|
+
Example timings for 100 trees and different sample sizes, CovType dataset - see the link above for full benchmark and details:
|
40
|
+
|
41
|
+
| Library | Model | Time (s) 256 | Time (s) 1024 | Time (s) 10k |
|
42
|
+
| :---: | :---: | :---: | :---: | :---: |
|
43
|
+
| isotree | orig | 0.00161 | 0.00631 | 0.0848 |
|
44
|
+
| isotree | ext | 0.00326 | 0.0123 | 0.168 |
|
45
|
+
| eif | orig | 0.149 | 0.398 | 4.99 |
|
46
|
+
| eif | ext | 0.16 | 0.428 | 5.06 |
|
47
|
+
| h2o | orig | 9.33 | 11.21 | 14.23 |
|
48
|
+
| h2o | ext | 1.06 | 2.07 | 17.31 |
|
49
|
+
| scikit-learn | orig | 8.3 | 8.01 | 6.89 |
|
50
|
+
| solitude | orig | 32.612 | 34.01 | 41.01 |
|
51
|
+
|
52
|
+
|
53
|
+
Example AUC as outlier detector in typical datasets (notebook to produce results [here](https://github.com/david-cortes/isotree/blob/master/example/comparison_model_quality.ipynb)):
|
54
|
+
|
55
|
+
* Satellite dataset:
|
56
|
+
|
57
|
+
| Library | AUROC defaults | AUROC grid search |
|
58
|
+
| :---: | :---: | :---: |
|
59
|
+
| isotree | 0.70 | 0.84 |
|
60
|
+
| eif | - | 0.714 |
|
61
|
+
| scikit-learn | 0.687 | 0.74 |
|
62
|
+
| h2o | 0.662 | 0.748 |
|
63
|
+
|
64
|
+
* Annthyroid dataset:
|
65
|
+
|
66
|
+
| Library | AUROC defaults | AUROC grid search |
|
67
|
+
| :---: | :---: | :---: |
|
68
|
+
| isotree | 0.80 | 0.982 |
|
69
|
+
| eif | - | 0.808 |
|
70
|
+
| scikit-learn | 0.836 | 0.836 |
|
71
|
+
| h2o | 0.80 | 0.80 |
|
72
|
+
|
73
|
+
*(Disclaimer: these are rather small datasets and thus these AUC estimates have high variance)*
|
74
|
+
|
75
|
+
# Non-random splits
|
76
|
+
|
77
|
+
While the original idea behind isolation forests consisted in deciding splits uniformly at random, it's possible to get better performance at detecting outliers in some datasets (particularly those with multimodal distributions) by determining splits according to an information gain criterion instead. The idea is described in ["Revisiting randomized choices in isolation forests"](https://arxiv.org/abs/2110.13402) along with some comparisons of different split guiding criteria.
|
78
|
+
|
79
|
+
# Different outlier scoring criteria
|
80
|
+
|
81
|
+
Although the intuition behind the algorithm was to look at the tree depth required for isolation, this package can also produce outlier scores based on density criteria, which provide improved results in some datasets, particularly when splitting on categorical features. The idea is described in ["Isolation forests: looking beyond tree depth"](https://arxiv.org/abs/2111.11639).
|
19
82
|
|
20
83
|
# Distance / similarity calculations
|
21
84
|
|
@@ -30,51 +93,77 @@ The model can also be used to impute missing values in a similar fashion as kNN,
|
|
30
93
|
There's already many available implementations of isolation forests for both Python and R (such as [the one from the original paper's authors'](https://sourceforge.net/projects/iforest/) or [the one in SciKit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)), but at the time of writing, all of them are lacking some important functionality and/or offer sub-optimal speed. This particular implementation offers the following:
|
31
94
|
|
32
95
|
* Implements the extended model (with splitting hyperplanes) and split-criterion model (with non-random splits).
|
96
|
+
* Can handle missing values (but performance with them is not so good).
|
97
|
+
* Can handle categorical variables (one-hot/dummy encoding does not produce the same result).
|
33
98
|
* Can use a mixture of random and non-random splits, and can split by weighted/pooled gain (in addition to simple average).
|
34
99
|
* Can produce approximated pairwise distances between observations according to how many steps it takes on average to separate them down the tree.
|
35
|
-
* Can
|
100
|
+
* Can calculate isolation kernels or proximity matrix, which counts the proportion of trees in which two given observations end up in the same terminal node.
|
36
101
|
* Can produce missing value imputations according to observations that fall on each terminal node.
|
37
|
-
* Can handle categorical variables (one-hot/dummy encoding does not produce the same result).
|
38
102
|
* Can work with sparse matrices.
|
103
|
+
* Can use either depth-based metrics or density-based metrics for calculation of outlier scores.
|
39
104
|
* Supports sample/observation weights, either as sampling importance or as distribution density measurement.
|
40
105
|
* Supports user-provided column sample weights.
|
41
106
|
* Can sample columns randomly with weights given by kurtosis.
|
42
|
-
* Uses exact formula (not approximation as others do) for harmonic numbers at lower sample and remainder sizes.
|
107
|
+
* Uses exact formula (not approximation as others do) for harmonic numbers at lower sample and remainder sizes, and a higher-order approximation for larger sizes.
|
43
108
|
* Can fit trees incrementally to user-provided data samples.
|
44
109
|
* Produces serializable model objects with reasonable file sizes.
|
110
|
+
* Can convert the models to `treelite` format (Python-only and depending on the parameters that are used) ([example here](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/treelite_example.ipynb)).
|
45
111
|
* Can translate the generated trees into SQL statements.
|
46
|
-
* Fast and multi-threaded C++ code. Can be wrapped in languages other than Python/R/Ruby.
|
112
|
+
* Fast and multi-threaded C++ code with an ISO C interface, which is architecture-agnostic, multi-platform, and with the only external dependency (Robin-Map) being optional. Can be wrapped in languages other than Python/R/Ruby.
|
47
113
|
|
48
114
|
(Note that categoricals, NAs, and density-like sample weights, are treated heuristically with different options as there is no single logical extension of the original idea to them, and having them present might degrade performance/accuracy for regular numerical non-missing observations)
|
49
115
|
|
50
116
|
# Installation
|
51
117
|
|
118
|
+
* R:
|
119
|
+
|
120
|
+
```r
|
121
|
+
install.packages("isotree")
|
122
|
+
```
|
123
|
+
** *
|
124
|
+
|
125
|
+
|
52
126
|
* Python:
|
53
|
-
|
127
|
+
|
128
|
+
```
|
54
129
|
pip install isotree
|
55
130
|
```
|
131
|
+
or if that fails:
|
132
|
+
```
|
133
|
+
pip install --no-use-pep517 isotree
|
134
|
+
```
|
135
|
+
** *
|
56
136
|
|
57
|
-
**Note for macOS users:** on macOS, the Python version of this package
|
137
|
+
**Note for macOS users:** on macOS, the Python version of this package might compile **without** multi-threading capabilities. In order to enable multi-threading support, first install OpenMP:
|
58
138
|
```
|
59
|
-
|
60
|
-
pip install isotree
|
139
|
+
brew install libomp
|
61
140
|
```
|
62
|
-
|
141
|
+
And then reinstall this package: `pip install --force-reinstall isotree`.
|
63
142
|
|
64
|
-
*
|
143
|
+
** *
|
144
|
+
**IMPORTANT:** the setup script will try to add compilation flag `-march=native`. This instructs the compiler to tune the package for the CPU in which it is being installed (by e.g. using AVX instructions if available), but the result might not be usable in other computers. If building a binary wheel of this package or putting it into a docker image which will be used in different machines, this can be overriden either by (a) defining an environment variable `DONT_SET_MARCH=1`, or by (b) manually supplying compilation `CFLAGS` as an environment variable with something related to architecture. For maximum compatibility (but slowest speed), it's possible to do something like this:
|
65
145
|
|
66
|
-
```
|
67
|
-
|
146
|
+
```
|
147
|
+
export DONT_SET_MARCH=1
|
148
|
+
pip install isotree
|
68
149
|
```
|
69
150
|
|
70
|
-
|
151
|
+
or, by specifying some compilation flag for architecture:
|
152
|
+
```
|
153
|
+
export CFLAGS="-march=x86-64"
|
154
|
+
export CXXFLAGS="-march=x86-64"
|
155
|
+
pip install isotree
|
71
156
|
```
|
72
|
-
|
157
|
+
** *
|
158
|
+
|
159
|
+
* C and C++:
|
160
|
+
```
|
161
|
+
git clone --recursive https://www.github.com/david-cortes/isotree.git
|
73
162
|
cd isotree
|
74
163
|
mkdir build
|
75
164
|
cd build
|
76
|
-
cmake ..
|
77
|
-
|
165
|
+
cmake -DUSE_MARCH_NATIVE=1 ..
|
166
|
+
cmake --build .
|
78
167
|
|
79
168
|
### for a system-wide install in linux
|
80
169
|
sudo make install
|
@@ -83,16 +172,22 @@ sudo ldconfig
|
|
83
172
|
|
84
173
|
(Will build as a shared object - linkage is then done with `-lisotree`)
|
85
174
|
|
86
|
-
|
175
|
+
Be aware that the snippet above includes option `-DUSE_MARCH_NATIVE=1`, which will make it use the highest-available CPU instruction set (e.g. AVX2) and will produces objects that might not run on older CPUs - to build more "portable" objects, remove this option from the cmake command.
|
176
|
+
|
177
|
+
The package has an optional dependency on the [Robin-Map](https://github.com/Tessil/robin-map) library, which is added to this repository as a linked submodule. If this library is not found under `/src`, will use the compiler's own hashmaps, which are less optimal.
|
178
|
+
|
179
|
+
* Ruby:
|
87
180
|
|
88
181
|
See [external repository with wrapper](https://github.com/ankane/isotree).
|
89
182
|
|
90
183
|
# Sample usage
|
91
184
|
|
92
|
-
**Warning: default parameters in this implementation are very different from default parameters in others such as
|
185
|
+
**Warning: default parameters in this implementation are very different from default parameters in others such as Scikit-Learn's, and these defaults won't scale to large datasets (see documentation for details).**
|
93
186
|
|
94
187
|
* Python:
|
95
188
|
|
189
|
+
(Library is Scikit-Learn compatible)
|
190
|
+
|
96
191
|
```python
|
97
192
|
import numpy as np
|
98
193
|
from isotree import IsolationForest
|
@@ -107,7 +202,7 @@ X = np.random.normal(size = (n, m))
|
|
107
202
|
X = np.r_[X, np.array([3, 3]).reshape((1, m))]
|
108
203
|
|
109
204
|
### Fit a small isolation forest model
|
110
|
-
iso = IsolationForest(ntrees = 10,
|
205
|
+
iso = IsolationForest(ntrees = 10, nthreads = 1)
|
111
206
|
iso.fit(X)
|
112
207
|
|
113
208
|
### Check which row has the highest outlier score
|
@@ -117,6 +212,7 @@ print("Point with highest outlier score: ",
|
|
117
212
|
```
|
118
213
|
|
119
214
|
* R:
|
215
|
+
|
120
216
|
(see documentation for more examples - `help(isotree::isolation.forest)`)
|
121
217
|
```r
|
122
218
|
### Random data from a standard normal distribution
|
@@ -135,29 +231,67 @@ iso <- isolation.forest(X, ntrees = 10, nthreads = 1)
|
|
135
231
|
### Check which row has the highest outlier score
|
136
232
|
pred <- predict(iso, X)
|
137
233
|
cat("Point with highest outlier score: ",
|
138
|
-
|
234
|
+
X[which.max(pred), ], "\n")
|
139
235
|
```
|
140
236
|
|
141
237
|
* C++:
|
142
238
|
|
143
|
-
|
239
|
+
The package comes with two different C++ interfaces: (a) a struct-based interface which exposes the full library's functionalities but makes little checks on the inputs it receives and is perhaps a bit difficult to use due to the large number of arguments that functions require; and (b) a scikit-learn-like interface in which the model exposes a single class with methods like 'fit' and 'predict', which is less flexible than the struct-based interface but easier to use and the function signatures disallow some potential errors due to invalid parameter combinations.
|
240
|
+
|
241
|
+
|
242
|
+
See files: [isotree_cpp_ex.cpp](https://github.com/david-cortes/isotree/blob/master/example/isotree_cpp_ex.cpp) for an example with the struct-based interface; and [isotree_cpp_oop_ex.cpp](https://github.com/david-cortes/isotree/blob/master/example/isotree_cpp_oop_ex.cpp) for an example with the scikit-learn-like interface.
|
144
243
|
|
244
|
+
Note that the second interface does not expose all the functionalities - for example, it only supports inputs of classes 'double' and 'int', while the struct-based interface also supports 'float'/'size_t'.
|
245
|
+
|
246
|
+
* C:
|
247
|
+
|
248
|
+
See file [isotree_c_ex.c](https://github.com/david-cortes/isotree/blob/master/example/isotree_c_ex.c).
|
249
|
+
|
250
|
+
Note that the C interface is a simple wrapper over the scikit-learn-like C++ interface, but using only ISO C bindings for better compatibility and easier wrapping in other languages.
|
251
|
+
|
252
|
+
* Ruby
|
253
|
+
|
254
|
+
See [external repository with wrapper](https://github.com/ankane/isotree).
|
145
255
|
|
146
256
|
# Examples
|
147
257
|
|
148
|
-
* Python:
|
258
|
+
* Python:
|
259
|
+
* [Example about general library usage](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_example.ipynb).
|
260
|
+
* [Example using it as imputer in a scikit-learn pipeline](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_impute.ipynb).
|
261
|
+
* [Example using it as a kernel for SVMs](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_svm_kernel_example.ipynb).
|
262
|
+
* [Example converting it to TreeLite format for faster predictions](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/treelite_example.ipynb).
|
149
263
|
* R: examples available in the documentation (`help(isotree::isolation.forest)`, [link to CRAN](https://cran.r-project.org/web/packages/isotree/index.html)).
|
150
|
-
* C++: see short
|
264
|
+
* C and C++: see short examples in the section above.
|
265
|
+
* Ruby: see [external repository with wrapper](https://github.com/ankane/isotree).
|
151
266
|
|
152
267
|
# Documentation
|
153
268
|
|
154
269
|
* Python: documentation is available at [ReadTheDocs](http://isotree.readthedocs.io/en/latest/).
|
155
270
|
* R: documentation is available internally in the package (e.g. `help(isolation.forest)`) and in [CRAN](https://cran.r-project.org/web/packages/isotree/index.html).
|
156
|
-
* C++: documentation is available in the public header (`include/isotree.hpp`) and in the source files.
|
271
|
+
* C++: documentation is available in the public header (`include/isotree.hpp`) and in the source files. See also the header for the scikit-learn-like interface (`include/isotree_oop.hpp`).
|
272
|
+
* C: interface is not documented per-se, but the same documentation from the C++ header applies to it. See also its header for some non-comprehensive comments about the parameters that functions take (`include/isotree_c.h`).
|
273
|
+
* Ruby: see [external repository with wrapper](https://github.com/ankane/isotree) for the syntax and the [Python docs](http://isotree.readthedocs.io) for details about the parameters.
|
274
|
+
|
275
|
+
# Reducing library size and compilation times
|
276
|
+
|
277
|
+
By default, this library will compile with some functionalities that are unlikely to be used and which can significantly increase the size of the library and compilation times - if using this library in e.g. embedded devices, it is highly recommended to disable some options, and if creating a docker images for serving models, one might want to make it as minimal as possible. Being a C++ templated library, it generates multiple versions of its functions that are specialized for different types (such as C `double` and `float`), and in practice not all the supported types are likely to be used.
|
278
|
+
|
279
|
+
In particular, the library supports usage of `long double` type for more precise aggregated calculations (e.g. standard deviations), which is unlikely to end up used (its usage is determined by a user-passed function argument and not available in the C or C++-OOP interfaces). For a smaller library and faster compilation, support for `long double` can be disabled by:
|
280
|
+
|
281
|
+
* Defining an environment variable `NO_LONG_DOUBLE`, which will be accepted by the Python and R build systems - e.g. first run `export NO_LONG_DOUBLE=1`, then a `pip` install; or for R, run `Sys.setenv("NO_LONG_DOUBLE" = "1")` before `install.packages`.
|
282
|
+
* Passing option `NO_LONG_DOUBLE` to the CMake script - e.g. `cmake -DNO_LONG_DOUBLE=1 ..` (only when using the CMake system, which is not used by the Python and R versions).
|
283
|
+
|
284
|
+
|
285
|
+
Additionally, the library will produce functions for different floating point and integer types of the input data. In practice, one usually ends up using only `double` and `int` types (these are the only types supported in the R interface and in the C and C++-OOP interfaces). When building it as a shared library through the CMake system, these can be disabled (leaving only `double` and `int` support) through option `NO_TEMPLATED_VERSIONS` - e.g.:
|
286
|
+
```
|
287
|
+
cmake -DNO_TEMPLATED_VERSIONS=1 ..
|
288
|
+
```
|
289
|
+
(this option is not available for the Python build system)
|
290
|
+
|
157
291
|
|
158
|
-
#
|
292
|
+
# Help wanted
|
159
293
|
|
160
|
-
|
294
|
+
The package does not currenly have any functionality for visualizing trees. Pull requests adding such functionality would be welcome.
|
161
295
|
|
162
296
|
# References
|
163
297
|
|
@@ -170,3 +304,7 @@ When setting a random seed and using more than one thread, the results of some f
|
|
170
304
|
* Quinlan, J. Ross. C4. 5: programs for machine learning. Elsevier, 2014.
|
171
305
|
* Cortes, David. "Distance approximation using Isolation Forests." arXiv preprint arXiv:1910.12362 (2019).
|
172
306
|
* Cortes, David. "Imputing missing values with unsupervised random trees." arXiv preprint arXiv:1911.06646 (2019).
|
307
|
+
* Cortes, David. "Revisiting randomized choices in isolation forests." arXiv preprint arXiv:2110.13402 (2021).
|
308
|
+
* Guha, Sudipto, et al. "Robust random cut forest based anomaly detection on streams." International conference on machine learning. PMLR, 2016.
|
309
|
+
* Cortes, David. "Isolation forests: looking beyond tree depth." arXiv preprint arXiv:2111.11639 (2021).
|
310
|
+
* Ting, Kai Ming, Yue Zhu, and Zhi-Hua Zhou. "Isolation kernel and its effect on SVM." Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.
|