isotree 0.2.2 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (151) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +8 -1
  3. data/LICENSE.txt +2 -2
  4. data/README.md +32 -14
  5. data/ext/isotree/ext.cpp +144 -31
  6. data/ext/isotree/extconf.rb +7 -7
  7. data/lib/isotree/isolation_forest.rb +110 -30
  8. data/lib/isotree/version.rb +1 -1
  9. data/vendor/isotree/LICENSE +1 -1
  10. data/vendor/isotree/README.md +165 -27
  11. data/vendor/isotree/include/isotree.hpp +2111 -0
  12. data/vendor/isotree/include/isotree_oop.hpp +394 -0
  13. data/vendor/isotree/inst/COPYRIGHTS +62 -0
  14. data/vendor/isotree/src/RcppExports.cpp +525 -52
  15. data/vendor/isotree/src/Rwrapper.cpp +1931 -268
  16. data/vendor/isotree/src/c_interface.cpp +953 -0
  17. data/vendor/isotree/src/crit.hpp +4232 -0
  18. data/vendor/isotree/src/dist.hpp +1886 -0
  19. data/vendor/isotree/src/exp_depth_table.hpp +134 -0
  20. data/vendor/isotree/src/extended.hpp +1444 -0
  21. data/vendor/isotree/src/external_facing_generic.hpp +399 -0
  22. data/vendor/isotree/src/fit_model.hpp +2401 -0
  23. data/vendor/isotree/src/{dealloc.cpp → headers_joined.hpp} +38 -22
  24. data/vendor/isotree/src/helpers_iforest.hpp +813 -0
  25. data/vendor/isotree/src/{impute.cpp → impute.hpp} +353 -122
  26. data/vendor/isotree/src/indexer.cpp +515 -0
  27. data/vendor/isotree/src/instantiate_template_headers.cpp +118 -0
  28. data/vendor/isotree/src/instantiate_template_headers.hpp +240 -0
  29. data/vendor/isotree/src/isoforest.hpp +1659 -0
  30. data/vendor/isotree/src/isotree.hpp +1804 -392
  31. data/vendor/isotree/src/isotree_exportable.hpp +99 -0
  32. data/vendor/isotree/src/merge_models.cpp +159 -16
  33. data/vendor/isotree/src/mult.hpp +1321 -0
  34. data/vendor/isotree/src/oop_interface.cpp +842 -0
  35. data/vendor/isotree/src/oop_interface.hpp +278 -0
  36. data/vendor/isotree/src/other_helpers.hpp +219 -0
  37. data/vendor/isotree/src/predict.hpp +1932 -0
  38. data/vendor/isotree/src/python_helpers.hpp +134 -0
  39. data/vendor/isotree/src/ref_indexer.hpp +154 -0
  40. data/vendor/isotree/src/robinmap/LICENSE +21 -0
  41. data/vendor/isotree/src/robinmap/README.md +483 -0
  42. data/vendor/isotree/src/robinmap/include/tsl/robin_growth_policy.h +406 -0
  43. data/vendor/isotree/src/robinmap/include/tsl/robin_hash.h +1620 -0
  44. data/vendor/isotree/src/robinmap/include/tsl/robin_map.h +807 -0
  45. data/vendor/isotree/src/robinmap/include/tsl/robin_set.h +660 -0
  46. data/vendor/isotree/src/serialize.cpp +4300 -139
  47. data/vendor/isotree/src/sql.cpp +141 -59
  48. data/vendor/isotree/src/subset_models.cpp +174 -0
  49. data/vendor/isotree/src/utils.hpp +3808 -0
  50. data/vendor/isotree/src/xoshiro.hpp +467 -0
  51. data/vendor/isotree/src/ziggurat.hpp +405 -0
  52. metadata +38 -104
  53. data/vendor/cereal/LICENSE +0 -24
  54. data/vendor/cereal/README.md +0 -85
  55. data/vendor/cereal/include/cereal/access.hpp +0 -351
  56. data/vendor/cereal/include/cereal/archives/adapters.hpp +0 -163
  57. data/vendor/cereal/include/cereal/archives/binary.hpp +0 -169
  58. data/vendor/cereal/include/cereal/archives/json.hpp +0 -1019
  59. data/vendor/cereal/include/cereal/archives/portable_binary.hpp +0 -334
  60. data/vendor/cereal/include/cereal/archives/xml.hpp +0 -956
  61. data/vendor/cereal/include/cereal/cereal.hpp +0 -1089
  62. data/vendor/cereal/include/cereal/details/helpers.hpp +0 -422
  63. data/vendor/cereal/include/cereal/details/polymorphic_impl.hpp +0 -796
  64. data/vendor/cereal/include/cereal/details/polymorphic_impl_fwd.hpp +0 -65
  65. data/vendor/cereal/include/cereal/details/static_object.hpp +0 -127
  66. data/vendor/cereal/include/cereal/details/traits.hpp +0 -1411
  67. data/vendor/cereal/include/cereal/details/util.hpp +0 -84
  68. data/vendor/cereal/include/cereal/external/base64.hpp +0 -134
  69. data/vendor/cereal/include/cereal/external/rapidjson/allocators.h +0 -284
  70. data/vendor/cereal/include/cereal/external/rapidjson/cursorstreamwrapper.h +0 -78
  71. data/vendor/cereal/include/cereal/external/rapidjson/document.h +0 -2652
  72. data/vendor/cereal/include/cereal/external/rapidjson/encodedstream.h +0 -299
  73. data/vendor/cereal/include/cereal/external/rapidjson/encodings.h +0 -716
  74. data/vendor/cereal/include/cereal/external/rapidjson/error/en.h +0 -74
  75. data/vendor/cereal/include/cereal/external/rapidjson/error/error.h +0 -161
  76. data/vendor/cereal/include/cereal/external/rapidjson/filereadstream.h +0 -99
  77. data/vendor/cereal/include/cereal/external/rapidjson/filewritestream.h +0 -104
  78. data/vendor/cereal/include/cereal/external/rapidjson/fwd.h +0 -151
  79. data/vendor/cereal/include/cereal/external/rapidjson/internal/biginteger.h +0 -290
  80. data/vendor/cereal/include/cereal/external/rapidjson/internal/diyfp.h +0 -271
  81. data/vendor/cereal/include/cereal/external/rapidjson/internal/dtoa.h +0 -245
  82. data/vendor/cereal/include/cereal/external/rapidjson/internal/ieee754.h +0 -78
  83. data/vendor/cereal/include/cereal/external/rapidjson/internal/itoa.h +0 -308
  84. data/vendor/cereal/include/cereal/external/rapidjson/internal/meta.h +0 -186
  85. data/vendor/cereal/include/cereal/external/rapidjson/internal/pow10.h +0 -55
  86. data/vendor/cereal/include/cereal/external/rapidjson/internal/regex.h +0 -740
  87. data/vendor/cereal/include/cereal/external/rapidjson/internal/stack.h +0 -232
  88. data/vendor/cereal/include/cereal/external/rapidjson/internal/strfunc.h +0 -69
  89. data/vendor/cereal/include/cereal/external/rapidjson/internal/strtod.h +0 -290
  90. data/vendor/cereal/include/cereal/external/rapidjson/internal/swap.h +0 -46
  91. data/vendor/cereal/include/cereal/external/rapidjson/istreamwrapper.h +0 -128
  92. data/vendor/cereal/include/cereal/external/rapidjson/memorybuffer.h +0 -70
  93. data/vendor/cereal/include/cereal/external/rapidjson/memorystream.h +0 -71
  94. data/vendor/cereal/include/cereal/external/rapidjson/msinttypes/inttypes.h +0 -316
  95. data/vendor/cereal/include/cereal/external/rapidjson/msinttypes/stdint.h +0 -300
  96. data/vendor/cereal/include/cereal/external/rapidjson/ostreamwrapper.h +0 -81
  97. data/vendor/cereal/include/cereal/external/rapidjson/pointer.h +0 -1414
  98. data/vendor/cereal/include/cereal/external/rapidjson/prettywriter.h +0 -277
  99. data/vendor/cereal/include/cereal/external/rapidjson/rapidjson.h +0 -656
  100. data/vendor/cereal/include/cereal/external/rapidjson/reader.h +0 -2230
  101. data/vendor/cereal/include/cereal/external/rapidjson/schema.h +0 -2497
  102. data/vendor/cereal/include/cereal/external/rapidjson/stream.h +0 -223
  103. data/vendor/cereal/include/cereal/external/rapidjson/stringbuffer.h +0 -121
  104. data/vendor/cereal/include/cereal/external/rapidjson/writer.h +0 -709
  105. data/vendor/cereal/include/cereal/external/rapidxml/license.txt +0 -52
  106. data/vendor/cereal/include/cereal/external/rapidxml/manual.html +0 -406
  107. data/vendor/cereal/include/cereal/external/rapidxml/rapidxml.hpp +0 -2624
  108. data/vendor/cereal/include/cereal/external/rapidxml/rapidxml_iterators.hpp +0 -175
  109. data/vendor/cereal/include/cereal/external/rapidxml/rapidxml_print.hpp +0 -428
  110. data/vendor/cereal/include/cereal/external/rapidxml/rapidxml_utils.hpp +0 -123
  111. data/vendor/cereal/include/cereal/macros.hpp +0 -154
  112. data/vendor/cereal/include/cereal/specialize.hpp +0 -139
  113. data/vendor/cereal/include/cereal/types/array.hpp +0 -79
  114. data/vendor/cereal/include/cereal/types/atomic.hpp +0 -55
  115. data/vendor/cereal/include/cereal/types/base_class.hpp +0 -203
  116. data/vendor/cereal/include/cereal/types/bitset.hpp +0 -176
  117. data/vendor/cereal/include/cereal/types/boost_variant.hpp +0 -164
  118. data/vendor/cereal/include/cereal/types/chrono.hpp +0 -72
  119. data/vendor/cereal/include/cereal/types/common.hpp +0 -129
  120. data/vendor/cereal/include/cereal/types/complex.hpp +0 -56
  121. data/vendor/cereal/include/cereal/types/concepts/pair_associative_container.hpp +0 -73
  122. data/vendor/cereal/include/cereal/types/deque.hpp +0 -62
  123. data/vendor/cereal/include/cereal/types/forward_list.hpp +0 -68
  124. data/vendor/cereal/include/cereal/types/functional.hpp +0 -43
  125. data/vendor/cereal/include/cereal/types/list.hpp +0 -62
  126. data/vendor/cereal/include/cereal/types/map.hpp +0 -36
  127. data/vendor/cereal/include/cereal/types/memory.hpp +0 -425
  128. data/vendor/cereal/include/cereal/types/optional.hpp +0 -66
  129. data/vendor/cereal/include/cereal/types/polymorphic.hpp +0 -483
  130. data/vendor/cereal/include/cereal/types/queue.hpp +0 -132
  131. data/vendor/cereal/include/cereal/types/set.hpp +0 -103
  132. data/vendor/cereal/include/cereal/types/stack.hpp +0 -76
  133. data/vendor/cereal/include/cereal/types/string.hpp +0 -61
  134. data/vendor/cereal/include/cereal/types/tuple.hpp +0 -123
  135. data/vendor/cereal/include/cereal/types/unordered_map.hpp +0 -36
  136. data/vendor/cereal/include/cereal/types/unordered_set.hpp +0 -99
  137. data/vendor/cereal/include/cereal/types/utility.hpp +0 -47
  138. data/vendor/cereal/include/cereal/types/valarray.hpp +0 -89
  139. data/vendor/cereal/include/cereal/types/variant.hpp +0 -109
  140. data/vendor/cereal/include/cereal/types/vector.hpp +0 -112
  141. data/vendor/cereal/include/cereal/version.hpp +0 -52
  142. data/vendor/isotree/src/Makevars +0 -4
  143. data/vendor/isotree/src/crit.cpp +0 -912
  144. data/vendor/isotree/src/dist.cpp +0 -749
  145. data/vendor/isotree/src/extended.cpp +0 -790
  146. data/vendor/isotree/src/fit_model.cpp +0 -1090
  147. data/vendor/isotree/src/helpers_iforest.cpp +0 -324
  148. data/vendor/isotree/src/isoforest.cpp +0 -771
  149. data/vendor/isotree/src/mult.cpp +0 -607
  150. data/vendor/isotree/src/predict.cpp +0 -853
  151. data/vendor/isotree/src/utils.cpp +0 -1566
@@ -1,38 +1,59 @@
1
1
  module IsoTree
2
2
  class IsolationForest
3
3
  def initialize(
4
- sample_size: nil, ntrees: 500, ndim: 3, ntry: 3,
5
- prob_pick_avg_gain: 0, prob_pick_pooled_gain: 0,
6
- prob_split_avg_gain: 0, prob_split_pooled_gain: 0,
7
- min_gain: 0, missing_action: "impute", new_categ_action: "smallest",
8
- categ_split_type: "subset", all_perm: false, coef_by_prop: false,
9
- sample_with_replacement: false, penalize_range: true,
10
- weigh_by_kurtosis: false, coefs: "normal", min_imp_obs: 3, depth_imp: "higher",
11
- weigh_imp_rows: "inverse", random_seed: 1, nthreads: -1
4
+ sample_size: "auto", ntrees: 500, ndim: 3, ntry: 1,
5
+ # categ_cols: nil,
6
+ max_depth: "auto", ncols_per_tree: nil,
7
+ prob_pick_pooled_gain: 0.0, prob_pick_avg_gain: 0.0,
8
+ prob_pick_full_gain: 0.0, prob_pick_dens: 0.0,
9
+ prob_pick_col_by_range: 0.0, prob_pick_col_by_var: 0.0, prob_pick_col_by_kurt: 0.0,
10
+ min_gain: 0.0, missing_action: "auto", new_categ_action: "auto",
11
+ categ_split_type: "auto", all_perm: false, coef_by_prop: false,
12
+ # recode_categ: false,
13
+ weights_as_sample_prob: true,
14
+ sample_with_replacement: false, penalize_range: false, standardize_data: true,
15
+ scoring_metric: "depth", fast_bratio: true,
16
+ weigh_by_kurtosis: false, coefs: "uniform", assume_full_distr: true,
17
+ # build_imputer: false,
18
+ min_imp_obs: 3, depth_imp: "higher",
19
+ weigh_imp_rows: "inverse", random_seed: 1, use_long_double: false, nthreads: -1
12
20
  )
13
21
 
14
22
  @sample_size = sample_size
15
23
  @ntrees = ntrees
16
24
  @ndim = ndim
17
25
  @ntry = ntry
18
- @prob_pick_avg_gain = prob_pick_avg_gain
26
+ # @categ_cols = categ_cols
27
+ @max_depth = max_depth
28
+ @ncols_per_tree = ncols_per_tree
19
29
  @prob_pick_pooled_gain = prob_pick_pooled_gain
20
- @prob_split_avg_gain = prob_split_avg_gain
21
- @prob_split_pooled_gain = prob_split_pooled_gain
30
+ @prob_pick_avg_gain = prob_pick_avg_gain
31
+ @prob_pick_full_gain = prob_pick_full_gain
32
+ @prob_pick_dens = prob_pick_dens
33
+ @prob_pick_col_by_range = prob_pick_col_by_range
34
+ @prob_pick_col_by_var = prob_pick_col_by_var
35
+ @prob_pick_col_by_kurt = prob_pick_col_by_kurt
22
36
  @min_gain = min_gain
23
37
  @missing_action = missing_action
24
38
  @new_categ_action = new_categ_action
25
39
  @categ_split_type = categ_split_type
26
40
  @all_perm = all_perm
27
41
  @coef_by_prop = coef_by_prop
42
+ # @recode_categ = recode_categ
43
+ @weights_as_sample_prob = weights_as_sample_prob
28
44
  @sample_with_replacement = sample_with_replacement
29
45
  @penalize_range = penalize_range
46
+ @standardize_data = standardize_data
47
+ @scoring_metric = scoring_metric
48
+ @fast_bratio = fast_bratio
30
49
  @weigh_by_kurtosis = weigh_by_kurtosis
31
50
  @coefs = coefs
51
+ @assume_full_distr = assume_full_distr
32
52
  @min_imp_obs = min_imp_obs
33
53
  @depth_imp = depth_imp
34
54
  @weigh_imp_rows = weigh_imp_rows
35
55
  @random_seed = random_seed
56
+ @use_long_double = use_long_double
36
57
 
37
58
  # etc module returns virtual cores
38
59
  nthreads = Etc.nprocessors if nthreads < 0
@@ -40,10 +61,16 @@ module IsoTree
40
61
  end
41
62
 
42
63
  def fit(x)
64
+ # make export consistent with Python library
65
+ update_params
66
+
43
67
  x = Dataset.new(x)
44
68
  prep_fit(x)
45
69
  options = data_options(x).merge(fit_options)
46
- options[:sample_size] ||= options[:nrows]
70
+
71
+ if options[:sample_size] == "auto"
72
+ options[:sample_size] = [options[:nrows], 10000].min
73
+ end
47
74
 
48
75
  # prevent segfault
49
76
  options[:sample_size] = options[:nrows] if options[:sample_size] > options[:nrows]
@@ -71,18 +98,22 @@ module IsoTree
71
98
  end
72
99
 
73
100
  # same format as Python so models are compatible
74
- def export_model(path)
101
+ def export_model(path, add_metada_file: false)
75
102
  check_fit
76
103
 
77
- File.write("#{path}.metadata", JSON.pretty_generate(export_metadata))
78
- Ext.serialize_ext_isoforest(@ext_iso_forest, path)
104
+ metadata = export_metadata
105
+ if add_metada_file
106
+ # indent 4 spaces like Python
107
+ File.write("#{path}.metadata", JSON.pretty_generate(metadata, indent: " "))
108
+ end
109
+ Ext.serialize_combined(@ext_iso_forest, path, JSON.generate(metadata))
79
110
  end
80
111
 
81
112
  def self.import_model(path)
82
113
  model = new
83
- metadata = JSON.parse(File.read("#{path}.metadata"))
84
- model.send(:import_metadata, metadata)
85
- model.instance_variable_set(:@ext_iso_forest, Ext.deserialize_ext_isoforest(path))
114
+ ext_iso_forest, metadata = Ext.deserialize_combined(path)
115
+ model.instance_variable_set(:@ext_iso_forest, ext_iso_forest)
116
+ model.send(:import_metadata, JSON.parse(metadata))
86
117
  model
87
118
  end
88
119
 
@@ -94,7 +125,9 @@ module IsoTree
94
125
  ncols_categ: @categorical_columns.size,
95
126
  cols_numeric: @numeric_columns,
96
127
  cols_categ: @categorical_columns,
97
- cat_levels: @categorical_columns.map { |v| @categories[v].keys }
128
+ cat_levels: @categorical_columns.map { |v| @categories[v].keys },
129
+ categ_cols: [],
130
+ categ_max: []
98
131
  }
99
132
 
100
133
  # Ruby-specific
@@ -104,6 +137,7 @@ module IsoTree
104
137
  model_info = {
105
138
  ndim: @ndim,
106
139
  nthreads: @nthreads,
140
+ use_long_double: @use_long_double,
107
141
  build_imputer: false
108
142
  }
109
143
 
@@ -112,6 +146,10 @@ module IsoTree
112
146
  params[k] = instance_variable_get("@#{k}")
113
147
  end
114
148
 
149
+ if params[:max_depth] == "auto"
150
+ params[:max_depth] = 0
151
+ end
152
+
115
153
  {
116
154
  data_info: data_info,
117
155
  model_info: model_info,
@@ -137,6 +175,8 @@ module IsoTree
137
175
 
138
176
  @ndim = model_info["ndim"]
139
177
  @nthreads = model_info["nthreads"]
178
+ @use_long_double = model_info["use_long_double"]
179
+ @build_imputer = model_info["build_imputer"]
140
180
 
141
181
  PARAM_KEYS.each do |k|
142
182
  instance_variable_set("@#{k}", params[k.to_s])
@@ -221,31 +261,71 @@ module IsoTree
221
261
  end
222
262
 
223
263
  PARAM_KEYS = %i(
224
- sample_size ntrees ntry max_depth
225
- prob_pick_avg_gain prob_pick_pooled_gain
226
- prob_split_avg_gain prob_split_pooled_gain min_gain
227
- missing_action new_categ_action categ_split_type coefs depth_imp
228
- weigh_imp_rows min_imp_obs random_seed all_perm coef_by_prop
229
- weights_as_sample_prob sample_with_replacement penalize_range
230
- weigh_by_kurtosis assume_full_distr
264
+ sample_size ntrees ntry max_depth ncols_per_tree
265
+ prob_pick_avg_gain prob_pick_pooled_gain prob_pick_full_gain prob_pick_dens
266
+ prob_pick_col_by_range prob_pick_col_by_var prob_pick_col_by_kurt
267
+ min_gain missing_action new_categ_action categ_split_type coefs
268
+ depth_imp weigh_imp_rows min_imp_obs random_seed all_perm
269
+ coef_by_prop weights_as_sample_prob sample_with_replacement penalize_range standardize_data
270
+ scoring_metric fast_bratio weigh_by_kurtosis assume_full_distr
231
271
  )
232
272
 
233
273
  def fit_options
234
274
  keys = %i(
235
275
  sample_size ntrees ndim ntry
236
- prob_pick_avg_gain prob_pick_pooled_gain
237
- prob_split_avg_gain prob_split_pooled_gain
276
+ categ_cols max_depth ncols_per_tree
277
+ prob_pick_pooled_gain prob_pick_avg_gain
278
+ prob_pick_full_gain prob_pick_dens
279
+ prob_pick_col_by_range prob_pick_col_by_var prob_pick_col_by_kurt
238
280
  min_gain missing_action new_categ_action
239
281
  categ_split_type all_perm coef_by_prop
240
- sample_with_replacement penalize_range
282
+ weights_as_sample_prob
283
+ sample_with_replacement penalize_range standardize_data
284
+ scoring_metric fast_bratio
241
285
  weigh_by_kurtosis coefs min_imp_obs depth_imp
242
- weigh_imp_rows random_seed nthreads
286
+ weigh_imp_rows random_seed use_long_double nthreads
243
287
  )
244
288
  options = {}
245
289
  keys.each do |k|
246
290
  options[k] = instance_variable_get("@#{k}")
247
291
  end
292
+
293
+ if options[:max_depth] == "auto"
294
+ options[:max_depth] = 0
295
+ options[:limit_depth] = true
296
+ end
297
+
298
+ if options[:ncols_per_tree].nil?
299
+ options[:ncols_per_tree] = 0
300
+ end
301
+
248
302
  options
249
303
  end
304
+
305
+ def update_params
306
+ if @missing_action == "auto"
307
+ if @ndim == 1
308
+ @missing_action = "divide"
309
+ else
310
+ @missing_action = "impute"
311
+ end
312
+ end
313
+
314
+ if @new_categ_action == "auto"
315
+ if @ndim == 1
316
+ @new_categ_action = "weighted"
317
+ else
318
+ @new_categ_action = "impute"
319
+ end
320
+ end
321
+
322
+ if @categ_split_type == "auto"
323
+ if @ndim == 1
324
+ @categ_split_type = "single_categ"
325
+ else
326
+ @categ_split_type = "subset"
327
+ end
328
+ end
329
+ end
250
330
  end
251
331
  end
@@ -1,3 +1,3 @@
1
1
  module IsoTree
2
- VERSION = "0.2.2"
2
+ VERSION = "0.3.0"
3
3
  end
@@ -1,6 +1,6 @@
1
1
  BSD 2-Clause License
2
2
 
3
- Copyright (c) 2020, David Cortes
3
+ Copyright (c) 2019-2022, David Cortes
4
4
  All rights reserved.
5
5
 
6
6
  Redistribution and use in source and binary forms, with or without
@@ -1,11 +1,27 @@
1
1
  # IsoTree
2
2
 
3
- Fast and multi-threaded implementation of Extended Isolation Forest, Fair-Cut Forest, SCiForest (a.k.a. Split-Criterion iForest), and regular Isolation Forest, for outlier/anomaly detection, plus additions for imputation of missing values, distance/similarity calculation between observations, and handling of categorical data. Written in C++ with interfaces for Python and R. An additional wrapper for Ruby can be found [here](https://github.com/ankane/isotree).
3
+ Fast and multi-threaded implementation of Isolation Forest (a.k.a. iForest) and variations of it such as Extended Isolation Forest (EIF), Split-Criterion iForest (SCiForest), Fair-Cut Forest (FCF), Robust Random-Cut Forest (RRCF), and other customizable variants, aimed at outlier/anomaly detection plus additions for imputation of missing values, distance/similarity calculation between observations, and handling of categorical data. Written in C++ with interfaces for Python, R, and C. An additional wrapper for Ruby can be found [here](https://github.com/ankane/isotree).
4
4
 
5
5
  The new concepts in this software are described in:
6
+ * [Revisiting randomized choices in isolation forests](https://arxiv.org/abs/2110.13402)
7
+ * [Isolation forests: looking beyond tree depth](https://arxiv.org/abs/2111.11639)
6
8
  * [Distance approximation using Isolation Forests](https://arxiv.org/abs/1910.12362)
7
9
  * [Imputing missing values with unsupervised random trees](https://arxiv.org/abs/1911.06646)
8
10
 
11
+ *********************
12
+
13
+ For a quick introduction to the Isolation Forest concept as used in this library, see:
14
+ * [Python introductory notebook](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/an_introduction_to_isolation_forests.ipynb).
15
+ * [R Vignette](http://htmlpreview.github.io/?https://github.com/david-cortes/isotree/blob/master/inst/doc/An_Introduction_to_Isolation_Forests.html).
16
+
17
+ Short Python example notebooks:
18
+ * [General library usage](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_example.ipynb).
19
+ * [Using it as imputer in a scikit-learn pipeline](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_impute.ipynb).
20
+ * [Using it as a kernel for SVMs](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_svm_kernel_example.ipynb).
21
+ * [Converting it to TreeLite format for faster predictions](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/treelite_example.ipynb).
22
+
23
+ (R examples are available in the internal documentation)
24
+
9
25
  # Description
10
26
 
11
27
  Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some attribute/feature/column at random. The idea is that, the rarer the observation, the more likely it is that a random uniform split on some feature would put outliers alone in one branch, and the fewer splits it will take to isolate an outlier observation like this. The concept is extended to splitting hyperplanes in the extended model (i.e. splitting by more than one column at a time), and to guided (not entirely random) splits in the SCiForest model that aim at isolating outliers faster and finding clustered outliers.
@@ -16,6 +32,53 @@ Note that this is a black-box model that will not produce explanations or import
16
32
 
17
33
  _(Code to produce these plots can be found in the R examples in the documentation)_
18
34
 
35
+ # Comparison against other libraries
36
+
37
+ The folder [timings](https://github.com/david-cortes/isotree/blob/master/timings) contains a speed comparison against other Isolation Forest implementations in Python (SciKit-Learn, EIF) and R (IsolationForest, isofor, solitude). From the benchmarks, IsoTree tends to be at least 1 order of magnitude faster than the libraries compared against in both single-threaded and multi-threaded mode.
38
+
39
+ Example timings for 100 trees and different sample sizes, CovType dataset - see the link above for full benchmark and details:
40
+
41
+ | Library | Model | Time (s) 256 | Time (s) 1024 | Time (s) 10k |
42
+ | :---: | :---: | :---: | :---: | :---: |
43
+ | isotree | orig | 0.00161 | 0.00631 | 0.0848 |
44
+ | isotree | ext | 0.00326 | 0.0123 | 0.168 |
45
+ | eif | orig | 0.149 | 0.398 | 4.99 |
46
+ | eif | ext | 0.16 | 0.428 | 5.06 |
47
+ | h2o | orig | 9.33 | 11.21 | 14.23 |
48
+ | h2o | ext | 1.06 | 2.07 | 17.31 |
49
+ | scikit-learn | orig | 8.3 | 8.01 | 6.89 |
50
+ | solitude | orig | 32.612 | 34.01 | 41.01 |
51
+
52
+
53
+ Example AUC as outlier detector in typical datasets (notebook to produce results [here](https://github.com/david-cortes/isotree/blob/master/example/comparison_model_quality.ipynb)):
54
+
55
+ * Satellite dataset:
56
+
57
+ | Library | AUROC defaults | AUROC grid search |
58
+ | :---: | :---: | :---: |
59
+ | isotree | 0.70 | 0.84 |
60
+ | eif | - | 0.714 |
61
+ | scikit-learn | 0.687 | 0.74 |
62
+ | h2o | 0.662 | 0.748 |
63
+
64
+ * Annthyroid dataset:
65
+
66
+ | Library | AUROC defaults | AUROC grid search |
67
+ | :---: | :---: | :---: |
68
+ | isotree | 0.80 | 0.982 |
69
+ | eif | - | 0.808 |
70
+ | scikit-learn | 0.836 | 0.836 |
71
+ | h2o | 0.80 | 0.80 |
72
+
73
+ *(Disclaimer: these are rather small datasets and thus these AUC estimates have high variance)*
74
+
75
+ # Non-random splits
76
+
77
+ While the original idea behind isolation forests consisted in deciding splits uniformly at random, it's possible to get better performance at detecting outliers in some datasets (particularly those with multimodal distributions) by determining splits according to an information gain criterion instead. The idea is described in ["Revisiting randomized choices in isolation forests"](https://arxiv.org/abs/2110.13402) along with some comparisons of different split guiding criteria.
78
+
79
+ # Different outlier scoring criteria
80
+
81
+ Although the intuition behind the algorithm was to look at the tree depth required for isolation, this package can also produce outlier scores based on density criteria, which provide improved results in some datasets, particularly when splitting on categorical features. The idea is described in ["Isolation forests: looking beyond tree depth"](https://arxiv.org/abs/2111.11639).
19
82
 
20
83
  # Distance / similarity calculations
21
84
 
@@ -30,51 +93,77 @@ The model can also be used to impute missing values in a similar fashion as kNN,
30
93
  There's already many available implementations of isolation forests for both Python and R (such as [the one from the original paper's authors'](https://sourceforge.net/projects/iforest/) or [the one in SciKit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)), but at the time of writing, all of them are lacking some important functionality and/or offer sub-optimal speed. This particular implementation offers the following:
31
94
 
32
95
  * Implements the extended model (with splitting hyperplanes) and split-criterion model (with non-random splits).
96
+ * Can handle missing values (but performance with them is not so good).
97
+ * Can handle categorical variables (one-hot/dummy encoding does not produce the same result).
33
98
  * Can use a mixture of random and non-random splits, and can split by weighted/pooled gain (in addition to simple average).
34
99
  * Can produce approximated pairwise distances between observations according to how many steps it takes on average to separate them down the tree.
35
- * Can handle missing values (but performance with them is not so good).
100
+ * Can calculate isolation kernels or proximity matrix, which counts the proportion of trees in which two given observations end up in the same terminal node.
36
101
  * Can produce missing value imputations according to observations that fall on each terminal node.
37
- * Can handle categorical variables (one-hot/dummy encoding does not produce the same result).
38
102
  * Can work with sparse matrices.
103
+ * Can use either depth-based metrics or density-based metrics for calculation of outlier scores.
39
104
  * Supports sample/observation weights, either as sampling importance or as distribution density measurement.
40
105
  * Supports user-provided column sample weights.
41
106
  * Can sample columns randomly with weights given by kurtosis.
42
- * Uses exact formula (not approximation as others do) for harmonic numbers at lower sample and remainder sizes.
107
+ * Uses exact formula (not approximation as others do) for harmonic numbers at lower sample and remainder sizes, and a higher-order approximation for larger sizes.
43
108
  * Can fit trees incrementally to user-provided data samples.
44
109
  * Produces serializable model objects with reasonable file sizes.
110
+ * Can convert the models to `treelite` format (Python-only and depending on the parameters that are used) ([example here](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/treelite_example.ipynb)).
45
111
  * Can translate the generated trees into SQL statements.
46
- * Fast and multi-threaded C++ code. Can be wrapped in languages other than Python/R/Ruby.
112
+ * Fast and multi-threaded C++ code with an ISO C interface, which is architecture-agnostic, multi-platform, and with the only external dependency (Robin-Map) being optional. Can be wrapped in languages other than Python/R/Ruby.
47
113
 
48
114
  (Note that categoricals, NAs, and density-like sample weights, are treated heuristically with different options as there is no single logical extension of the original idea to them, and having them present might degrade performance/accuracy for regular numerical non-missing observations)
49
115
 
50
116
  # Installation
51
117
 
118
+ * R:
119
+
120
+ ```r
121
+ install.packages("isotree")
122
+ ```
123
+ ** *
124
+
125
+
52
126
  * Python:
53
- ```python
127
+
128
+ ```
54
129
  pip install isotree
55
130
  ```
131
+ or if that fails:
132
+ ```
133
+ pip install --no-use-pep517 isotree
134
+ ```
135
+ ** *
56
136
 
57
- **Note for macOS users:** on macOS, the Python version of this package will compile **without** multi-threading capabilities. This is due to default apple's redistribution of `clang` not providing OpenMP modules, and aliasing it to `gcc` which causes confusions in build scripts. If you have a non-apple version of `clang` with the OpenMP modules, or if you have `gcc` installed, you can compile this package with multi-threading enabled by setting up an environment variable `ENABLE_OMP=1`:
137
+ **Note for macOS users:** on macOS, the Python version of this package might compile **without** multi-threading capabilities. In order to enable multi-threading support, first install OpenMP:
58
138
  ```
59
- export ENABLE_OMP=1
60
- pip install isotree
139
+ brew install libomp
61
140
  ```
62
- (Alternatively, can also pass argument `enable-omp` to the `setup.py` file: `python setup.py install enable-omp`)
141
+ And then reinstall this package: `pip install --force-reinstall isotree`.
63
142
 
64
- * R:
143
+ ** *
144
+ **IMPORTANT:** the setup script will try to add compilation flag `-march=native`. This instructs the compiler to tune the package for the CPU in which it is being installed (by e.g. using AVX instructions if available), but the result might not be usable in other computers. If building a binary wheel of this package or putting it into a docker image which will be used in different machines, this can be overriden either by (a) defining an environment variable `DONT_SET_MARCH=1`, or by (b) manually supplying compilation `CFLAGS` as an environment variable with something related to architecture. For maximum compatibility (but slowest speed), it's possible to do something like this:
65
145
 
66
- ```r
67
- install.packages("isotree")
146
+ ```
147
+ export DONT_SET_MARCH=1
148
+ pip install isotree
68
149
  ```
69
150
 
70
- * C++:
151
+ or, by specifying some compilation flag for architecture:
152
+ ```
153
+ export CFLAGS="-march=x86-64"
154
+ export CXXFLAGS="-march=x86-64"
155
+ pip install isotree
71
156
  ```
72
- git clone https://www.github.com/david-cortes/isotree.git
157
+ ** *
158
+
159
+ * C and C++:
160
+ ```
161
+ git clone --recursive https://www.github.com/david-cortes/isotree.git
73
162
  cd isotree
74
163
  mkdir build
75
164
  cd build
76
- cmake ..
77
- make
165
+ cmake -DUSE_MARCH_NATIVE=1 ..
166
+ cmake --build .
78
167
 
79
168
  ### for a system-wide install in linux
80
169
  sudo make install
@@ -83,16 +172,22 @@ sudo ldconfig
83
172
 
84
173
  (Will build as a shared object - linkage is then done with `-lisotree`)
85
174
 
86
- * Ruby
175
+ Be aware that the snippet above includes option `-DUSE_MARCH_NATIVE=1`, which will make it use the highest-available CPU instruction set (e.g. AVX2) and will produces objects that might not run on older CPUs - to build more "portable" objects, remove this option from the cmake command.
176
+
177
+ The package has an optional dependency on the [Robin-Map](https://github.com/Tessil/robin-map) library, which is added to this repository as a linked submodule. If this library is not found under `/src`, will use the compiler's own hashmaps, which are less optimal.
178
+
179
+ * Ruby:
87
180
 
88
181
  See [external repository with wrapper](https://github.com/ankane/isotree).
89
182
 
90
183
  # Sample usage
91
184
 
92
- **Warning: default parameters in this implementation are very different from default parameters in others such as SciKit-Learn's, and these defaults won't scale to large datasets (see documentation for details).**
185
+ **Warning: default parameters in this implementation are very different from default parameters in others such as Scikit-Learn's, and these defaults won't scale to large datasets (see documentation for details).**
93
186
 
94
187
  * Python:
95
188
 
189
+ (Library is Scikit-Learn compatible)
190
+
96
191
  ```python
97
192
  import numpy as np
98
193
  from isotree import IsolationForest
@@ -107,7 +202,7 @@ X = np.random.normal(size = (n, m))
107
202
  X = np.r_[X, np.array([3, 3]).reshape((1, m))]
108
203
 
109
204
  ### Fit a small isolation forest model
110
- iso = IsolationForest(ntrees = 10, ndim = 2, nthreads = 1)
205
+ iso = IsolationForest(ntrees = 10, nthreads = 1)
111
206
  iso.fit(X)
112
207
 
113
208
  ### Check which row has the highest outlier score
@@ -117,6 +212,7 @@ print("Point with highest outlier score: ",
117
212
  ```
118
213
 
119
214
  * R:
215
+
120
216
  (see documentation for more examples - `help(isotree::isolation.forest)`)
121
217
  ```r
122
218
  ### Random data from a standard normal distribution
@@ -135,29 +231,67 @@ iso <- isolation.forest(X, ntrees = 10, nthreads = 1)
135
231
  ### Check which row has the highest outlier score
136
232
  pred <- predict(iso, X)
137
233
  cat("Point with highest outlier score: ",
138
- X[which.max(pred), ], "\n")
234
+ X[which.max(pred), ], "\n")
139
235
  ```
140
236
 
141
237
  * C++:
142
238
 
143
- See file [isotree_cpp_ex.cpp](https://github.com/david-cortes/isotree/blob/master/example/isotree_cpp_ex.cpp).
239
+ The package comes with two different C++ interfaces: (a) a struct-based interface which exposes the full library's functionalities but makes little checks on the inputs it receives and is perhaps a bit difficult to use due to the large number of arguments that functions require; and (b) a scikit-learn-like interface in which the model exposes a single class with methods like 'fit' and 'predict', which is less flexible than the struct-based interface but easier to use and the function signatures disallow some potential errors due to invalid parameter combinations.
240
+
241
+
242
+ See files: [isotree_cpp_ex.cpp](https://github.com/david-cortes/isotree/blob/master/example/isotree_cpp_ex.cpp) for an example with the struct-based interface; and [isotree_cpp_oop_ex.cpp](https://github.com/david-cortes/isotree/blob/master/example/isotree_cpp_oop_ex.cpp) for an example with the scikit-learn-like interface.
144
243
 
244
+ Note that the second interface does not expose all the functionalities - for example, it only supports inputs of classes 'double' and 'int', while the struct-based interface also supports 'float'/'size_t'.
245
+
246
+ * C:
247
+
248
+ See file [isotree_c_ex.c](https://github.com/david-cortes/isotree/blob/master/example/isotree_c_ex.c).
249
+
250
+ Note that the C interface is a simple wrapper over the scikit-learn-like C++ interface, but using only ISO C bindings for better compatibility and easier wrapping in other languages.
251
+
252
+ * Ruby
253
+
254
+ See [external repository with wrapper](https://github.com/ankane/isotree).
145
255
 
146
256
  # Examples
147
257
 
148
- * Python: example notebook [here](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_example.ipynb), (also example as imputer in sklearn pipeline [here](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_impute.ipynb)).
258
+ * Python:
259
+ * [Example about general library usage](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_example.ipynb).
260
+ * [Example using it as imputer in a scikit-learn pipeline](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_impute.ipynb).
261
+ * [Example using it as a kernel for SVMs](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_svm_kernel_example.ipynb).
262
+ * [Example converting it to TreeLite format for faster predictions](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/treelite_example.ipynb).
149
263
  * R: examples available in the documentation (`help(isotree::isolation.forest)`, [link to CRAN](https://cran.r-project.org/web/packages/isotree/index.html)).
150
- * C++: see short example in the section above.
264
+ * C and C++: see short examples in the section above.
265
+ * Ruby: see [external repository with wrapper](https://github.com/ankane/isotree).
151
266
 
152
267
  # Documentation
153
268
 
154
269
  * Python: documentation is available at [ReadTheDocs](http://isotree.readthedocs.io/en/latest/).
155
270
  * R: documentation is available internally in the package (e.g. `help(isolation.forest)`) and in [CRAN](https://cran.r-project.org/web/packages/isotree/index.html).
156
- * C++: documentation is available in the public header (`include/isotree.hpp`) and in the source files.
271
+ * C++: documentation is available in the public header (`include/isotree.hpp`) and in the source files. See also the header for the scikit-learn-like interface (`include/isotree_oop.hpp`).
272
+ * C: interface is not documented per-se, but the same documentation from the C++ header applies to it. See also its header for some non-comprehensive comments about the parameters that functions take (`include/isotree_c.h`).
273
+ * Ruby: see [external repository with wrapper](https://github.com/ankane/isotree) for the syntax and the [Python docs](http://isotree.readthedocs.io) for details about the parameters.
274
+
275
+ # Reducing library size and compilation times
276
+
277
+ By default, this library will compile with some functionalities that are unlikely to be used and which can significantly increase the size of the library and compilation times - if using this library in e.g. embedded devices, it is highly recommended to disable some options, and if creating a docker images for serving models, one might want to make it as minimal as possible. Being a C++ templated library, it generates multiple versions of its functions that are specialized for different types (such as C `double` and `float`), and in practice not all the supported types are likely to be used.
278
+
279
+ In particular, the library supports usage of `long double` type for more precise aggregated calculations (e.g. standard deviations), which is unlikely to end up used (its usage is determined by a user-passed function argument and not available in the C or C++-OOP interfaces). For a smaller library and faster compilation, support for `long double` can be disabled by:
280
+
281
+ * Defining an environment variable `NO_LONG_DOUBLE`, which will be accepted by the Python and R build systems - e.g. first run `export NO_LONG_DOUBLE=1`, then a `pip` install; or for R, run `Sys.setenv("NO_LONG_DOUBLE" = "1")` before `install.packages`.
282
+ * Passing option `NO_LONG_DOUBLE` to the CMake script - e.g. `cmake -DNO_LONG_DOUBLE=1 ..` (only when using the CMake system, which is not used by the Python and R versions).
283
+
284
+
285
+ Additionally, the library will produce functions for different floating point and integer types of the input data. In practice, one usually ends up using only `double` and `int` types (these are the only types supported in the R interface and in the C and C++-OOP interfaces). When building it as a shared library through the CMake system, these can be disabled (leaving only `double` and `int` support) through option `NO_TEMPLATED_VERSIONS` - e.g.:
286
+ ```
287
+ cmake -DNO_TEMPLATED_VERSIONS=1 ..
288
+ ```
289
+ (this option is not available for the Python build system)
290
+
157
291
 
158
- # Known issues
292
+ # Help wanted
159
293
 
160
- When setting a random seed and using more than one thread, the results of some functions are not 100% reproducible to the last decimal - especially not for imputations. This is due to parallelized aggregations, and thus the only "fix" is to limit oneself to only one thread. The trees themselves are however not affected by this, and neither is the isolation depth (main functionality of the package).
294
+ The package does not currenly have any functionality for visualizing trees. Pull requests adding such functionality would be welcome.
161
295
 
162
296
  # References
163
297
 
@@ -170,3 +304,7 @@ When setting a random seed and using more than one thread, the results of some f
170
304
  * Quinlan, J. Ross. C4. 5: programs for machine learning. Elsevier, 2014.
171
305
  * Cortes, David. "Distance approximation using Isolation Forests." arXiv preprint arXiv:1910.12362 (2019).
172
306
  * Cortes, David. "Imputing missing values with unsupervised random trees." arXiv preprint arXiv:1911.06646 (2019).
307
+ * Cortes, David. "Revisiting randomized choices in isolation forests." arXiv preprint arXiv:2110.13402 (2021).
308
+ * Guha, Sudipto, et al. "Robust random cut forest based anomaly detection on streams." International conference on machine learning. PMLR, 2016.
309
+ * Cortes, David. "Isolation forests: looking beyond tree depth." arXiv preprint arXiv:2111.11639 (2021).
310
+ * Ting, Kai Ming, Yue Zhu, and Zhi-Hua Zhou. "Isolation kernel and its effect on SVM." Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.