RubyGems - isotree - Versions diffs - 0.2.2 → 0.3.0 - Mend

isotree 0.2.2 → 0.3.0

Files changed (151) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +8 -1
data/LICENSE.txt +2 -2
data/README.md +32 -14
data/ext/isotree/ext.cpp +144 -31
data/ext/isotree/extconf.rb +7 -7
data/lib/isotree/isolation_forest.rb +110 -30
data/lib/isotree/version.rb +1 -1
data/vendor/isotree/LICENSE +1 -1
data/vendor/isotree/README.md +165 -27
data/vendor/isotree/include/isotree.hpp +2111 -0
data/vendor/isotree/include/isotree_oop.hpp +394 -0
data/vendor/isotree/inst/COPYRIGHTS +62 -0
data/vendor/isotree/src/RcppExports.cpp +525 -52
data/vendor/isotree/src/Rwrapper.cpp +1931 -268
data/vendor/isotree/src/c_interface.cpp +953 -0
data/vendor/isotree/src/crit.hpp +4232 -0
data/vendor/isotree/src/dist.hpp +1886 -0
data/vendor/isotree/src/exp_depth_table.hpp +134 -0
data/vendor/isotree/src/extended.hpp +1444 -0
data/vendor/isotree/src/external_facing_generic.hpp +399 -0
data/vendor/isotree/src/fit_model.hpp +2401 -0
data/vendor/isotree/src/{dealloc.cpp → headers_joined.hpp} +38 -22
data/vendor/isotree/src/helpers_iforest.hpp +813 -0
data/vendor/isotree/src/{impute.cpp → impute.hpp} +353 -122
data/vendor/isotree/src/indexer.cpp +515 -0
data/vendor/isotree/src/instantiate_template_headers.cpp +118 -0
data/vendor/isotree/src/instantiate_template_headers.hpp +240 -0
data/vendor/isotree/src/isoforest.hpp +1659 -0
data/vendor/isotree/src/isotree.hpp +1804 -392
data/vendor/isotree/src/isotree_exportable.hpp +99 -0
data/vendor/isotree/src/merge_models.cpp +159 -16
data/vendor/isotree/src/mult.hpp +1321 -0
data/vendor/isotree/src/oop_interface.cpp +842 -0
data/vendor/isotree/src/oop_interface.hpp +278 -0
data/vendor/isotree/src/other_helpers.hpp +219 -0
data/vendor/isotree/src/predict.hpp +1932 -0
data/vendor/isotree/src/python_helpers.hpp +134 -0
data/vendor/isotree/src/ref_indexer.hpp +154 -0
data/vendor/isotree/src/robinmap/LICENSE +21 -0
data/vendor/isotree/src/robinmap/README.md +483 -0
data/vendor/isotree/src/robinmap/include/tsl/robin_growth_policy.h +406 -0
data/vendor/isotree/src/robinmap/include/tsl/robin_hash.h +1620 -0
data/vendor/isotree/src/robinmap/include/tsl/robin_map.h +807 -0
data/vendor/isotree/src/robinmap/include/tsl/robin_set.h +660 -0
data/vendor/isotree/src/serialize.cpp +4300 -139
data/vendor/isotree/src/sql.cpp +141 -59
data/vendor/isotree/src/subset_models.cpp +174 -0
data/vendor/isotree/src/utils.hpp +3808 -0
data/vendor/isotree/src/xoshiro.hpp +467 -0
data/vendor/isotree/src/ziggurat.hpp +405 -0
metadata +38 -104
data/vendor/cereal/LICENSE +0 -24
data/vendor/cereal/README.md +0 -85
data/vendor/cereal/include/cereal/access.hpp +0 -351
data/vendor/cereal/include/cereal/archives/adapters.hpp +0 -163
data/vendor/cereal/include/cereal/archives/binary.hpp +0 -169
data/vendor/cereal/include/cereal/archives/json.hpp +0 -1019
data/vendor/cereal/include/cereal/archives/portable_binary.hpp +0 -334
data/vendor/cereal/include/cereal/archives/xml.hpp +0 -956
data/vendor/cereal/include/cereal/cereal.hpp +0 -1089
data/vendor/cereal/include/cereal/details/helpers.hpp +0 -422
data/vendor/cereal/include/cereal/details/polymorphic_impl.hpp +0 -796
data/vendor/cereal/include/cereal/details/polymorphic_impl_fwd.hpp +0 -65
data/vendor/cereal/include/cereal/details/static_object.hpp +0 -127
data/vendor/cereal/include/cereal/details/traits.hpp +0 -1411
data/vendor/cereal/include/cereal/details/util.hpp +0 -84
data/vendor/cereal/include/cereal/external/base64.hpp +0 -134
data/vendor/cereal/include/cereal/external/rapidjson/allocators.h +0 -284
data/vendor/cereal/include/cereal/external/rapidjson/cursorstreamwrapper.h +0 -78
data/vendor/cereal/include/cereal/external/rapidjson/document.h +0 -2652
data/vendor/cereal/include/cereal/external/rapidjson/encodedstream.h +0 -299
data/vendor/cereal/include/cereal/external/rapidjson/encodings.h +0 -716
data/vendor/cereal/include/cereal/external/rapidjson/error/en.h +0 -74
data/vendor/cereal/include/cereal/external/rapidjson/error/error.h +0 -161
data/vendor/cereal/include/cereal/external/rapidjson/filereadstream.h +0 -99
data/vendor/cereal/include/cereal/external/rapidjson/filewritestream.h +0 -104
data/vendor/cereal/include/cereal/external/rapidjson/fwd.h +0 -151
data/vendor/cereal/include/cereal/external/rapidjson/internal/biginteger.h +0 -290
data/vendor/cereal/include/cereal/external/rapidjson/internal/diyfp.h +0 -271
data/vendor/cereal/include/cereal/external/rapidjson/internal/dtoa.h +0 -245
data/vendor/cereal/include/cereal/external/rapidjson/internal/ieee754.h +0 -78
data/vendor/cereal/include/cereal/external/rapidjson/internal/itoa.h +0 -308
data/vendor/cereal/include/cereal/external/rapidjson/internal/meta.h +0 -186
data/vendor/cereal/include/cereal/external/rapidjson/internal/pow10.h +0 -55
data/vendor/cereal/include/cereal/external/rapidjson/internal/regex.h +0 -740
data/vendor/cereal/include/cereal/external/rapidjson/internal/stack.h +0 -232
data/vendor/cereal/include/cereal/external/rapidjson/internal/strfunc.h +0 -69
data/vendor/cereal/include/cereal/external/rapidjson/internal/strtod.h +0 -290
data/vendor/cereal/include/cereal/external/rapidjson/internal/swap.h +0 -46
data/vendor/cereal/include/cereal/external/rapidjson/istreamwrapper.h +0 -128
data/vendor/cereal/include/cereal/external/rapidjson/memorybuffer.h +0 -70
data/vendor/cereal/include/cereal/external/rapidjson/memorystream.h +0 -71
data/vendor/cereal/include/cereal/external/rapidjson/msinttypes/inttypes.h +0 -316
data/vendor/cereal/include/cereal/external/rapidjson/msinttypes/stdint.h +0 -300
data/vendor/cereal/include/cereal/external/rapidjson/ostreamwrapper.h +0 -81
data/vendor/cereal/include/cereal/external/rapidjson/pointer.h +0 -1414
data/vendor/cereal/include/cereal/external/rapidjson/prettywriter.h +0 -277
data/vendor/cereal/include/cereal/external/rapidjson/rapidjson.h +0 -656
data/vendor/cereal/include/cereal/external/rapidjson/reader.h +0 -2230
data/vendor/cereal/include/cereal/external/rapidjson/schema.h +0 -2497
data/vendor/cereal/include/cereal/external/rapidjson/stream.h +0 -223
data/vendor/cereal/include/cereal/external/rapidjson/stringbuffer.h +0 -121
data/vendor/cereal/include/cereal/external/rapidjson/writer.h +0 -709
data/vendor/cereal/include/cereal/external/rapidxml/license.txt +0 -52
data/vendor/cereal/include/cereal/external/rapidxml/manual.html +0 -406
data/vendor/cereal/include/cereal/external/rapidxml/rapidxml.hpp +0 -2624
data/vendor/cereal/include/cereal/external/rapidxml/rapidxml_iterators.hpp +0 -175
data/vendor/cereal/include/cereal/external/rapidxml/rapidxml_print.hpp +0 -428
data/vendor/cereal/include/cereal/external/rapidxml/rapidxml_utils.hpp +0 -123
data/vendor/cereal/include/cereal/macros.hpp +0 -154
data/vendor/cereal/include/cereal/specialize.hpp +0 -139
data/vendor/cereal/include/cereal/types/array.hpp +0 -79
data/vendor/cereal/include/cereal/types/atomic.hpp +0 -55
data/vendor/cereal/include/cereal/types/base_class.hpp +0 -203
data/vendor/cereal/include/cereal/types/bitset.hpp +0 -176
data/vendor/cereal/include/cereal/types/boost_variant.hpp +0 -164
data/vendor/cereal/include/cereal/types/chrono.hpp +0 -72
data/vendor/cereal/include/cereal/types/common.hpp +0 -129
data/vendor/cereal/include/cereal/types/complex.hpp +0 -56
data/vendor/cereal/include/cereal/types/concepts/pair_associative_container.hpp +0 -73
data/vendor/cereal/include/cereal/types/deque.hpp +0 -62
data/vendor/cereal/include/cereal/types/forward_list.hpp +0 -68
data/vendor/cereal/include/cereal/types/functional.hpp +0 -43
data/vendor/cereal/include/cereal/types/list.hpp +0 -62
data/vendor/cereal/include/cereal/types/map.hpp +0 -36
data/vendor/cereal/include/cereal/types/memory.hpp +0 -425
data/vendor/cereal/include/cereal/types/optional.hpp +0 -66
data/vendor/cereal/include/cereal/types/polymorphic.hpp +0 -483
data/vendor/cereal/include/cereal/types/queue.hpp +0 -132
data/vendor/cereal/include/cereal/types/set.hpp +0 -103
data/vendor/cereal/include/cereal/types/stack.hpp +0 -76
data/vendor/cereal/include/cereal/types/string.hpp +0 -61
data/vendor/cereal/include/cereal/types/tuple.hpp +0 -123
data/vendor/cereal/include/cereal/types/unordered_map.hpp +0 -36
data/vendor/cereal/include/cereal/types/unordered_set.hpp +0 -99
data/vendor/cereal/include/cereal/types/utility.hpp +0 -47
data/vendor/cereal/include/cereal/types/valarray.hpp +0 -89
data/vendor/cereal/include/cereal/types/variant.hpp +0 -109
data/vendor/cereal/include/cereal/types/vector.hpp +0 -112
data/vendor/cereal/include/cereal/version.hpp +0 -52
data/vendor/isotree/src/Makevars +0 -4
data/vendor/isotree/src/crit.cpp +0 -912
data/vendor/isotree/src/dist.cpp +0 -749
data/vendor/isotree/src/extended.cpp +0 -790
data/vendor/isotree/src/fit_model.cpp +0 -1090
data/vendor/isotree/src/helpers_iforest.cpp +0 -324
data/vendor/isotree/src/isoforest.cpp +0 -771
data/vendor/isotree/src/mult.cpp +0 -607
data/vendor/isotree/src/predict.cpp +0 -853
data/vendor/isotree/src/utils.cpp +0 -1566

data/lib/isotree/isolation_forest.rb CHANGED Viewed

@@ -1,38 +1,59 @@
 module IsoTree
   class IsolationForest
     def initialize(
-      sample_size: nil, ntrees: 500, ndim: 3, ntry: 3,
-      prob_pick_avg_gain: 0, prob_pick_pooled_gain: 0,
-      prob_split_avg_gain: 0, prob_split_pooled_gain: 0,
-      min_gain: 0, missing_action: "impute", new_categ_action: "smallest",
-      categ_split_type: "subset", all_perm: false, coef_by_prop: false,
-      sample_with_replacement: false, penalize_range: true,
-      weigh_by_kurtosis: false, coefs: "normal", min_imp_obs: 3, depth_imp: "higher",
-      weigh_imp_rows: "inverse", random_seed: 1, nthreads: -1
+      sample_size: "auto", ntrees: 500, ndim: 3, ntry: 1,
+      # categ_cols: nil,
+      max_depth: "auto", ncols_per_tree: nil,
+      prob_pick_pooled_gain: 0.0, prob_pick_avg_gain: 0.0,
+      prob_pick_full_gain: 0.0, prob_pick_dens: 0.0,
+      prob_pick_col_by_range: 0.0, prob_pick_col_by_var: 0.0, prob_pick_col_by_kurt: 0.0,
+      min_gain: 0.0, missing_action: "auto", new_categ_action: "auto",
+      categ_split_type: "auto", all_perm: false, coef_by_prop: false,
+      # recode_categ: false,
+      weights_as_sample_prob: true,
+      sample_with_replacement: false, penalize_range: false, standardize_data: true,
+      scoring_metric: "depth", fast_bratio: true,
+      weigh_by_kurtosis: false, coefs: "uniform", assume_full_distr: true,
+      # build_imputer: false,
+      min_imp_obs: 3, depth_imp: "higher",
+      weigh_imp_rows: "inverse", random_seed: 1, use_long_double: false, nthreads: -1
     )
       @sample_size = sample_size
       @ntrees = ntrees
       @ndim = ndim
       @ntry = ntry
-      @prob_pick_avg_gain = prob_pick_avg_gain
+      # @categ_cols = categ_cols
+      @max_depth = max_depth
+      @ncols_per_tree = ncols_per_tree
       @prob_pick_pooled_gain = prob_pick_pooled_gain
-      @prob_split_avg_gain = prob_split_avg_gain
-      @prob_split_pooled_gain = prob_split_pooled_gain
+      @prob_pick_avg_gain = prob_pick_avg_gain
+      @prob_pick_full_gain = prob_pick_full_gain
+      @prob_pick_dens = prob_pick_dens
+      @prob_pick_col_by_range = prob_pick_col_by_range
+      @prob_pick_col_by_var = prob_pick_col_by_var
+      @prob_pick_col_by_kurt = prob_pick_col_by_kurt
       @min_gain = min_gain
       @missing_action = missing_action
       @new_categ_action = new_categ_action
       @categ_split_type = categ_split_type
       @all_perm = all_perm
       @coef_by_prop = coef_by_prop
+      # @recode_categ = recode_categ
+      @weights_as_sample_prob = weights_as_sample_prob
       @sample_with_replacement = sample_with_replacement
       @penalize_range = penalize_range
+      @standardize_data = standardize_data
+      @scoring_metric = scoring_metric
+      @fast_bratio = fast_bratio
       @weigh_by_kurtosis = weigh_by_kurtosis
       @coefs = coefs
+      @assume_full_distr = assume_full_distr
       @min_imp_obs = min_imp_obs
       @depth_imp = depth_imp
       @weigh_imp_rows = weigh_imp_rows
       @random_seed = random_seed
+      @use_long_double = use_long_double
       # etc module returns virtual cores
       nthreads = Etc.nprocessors if nthreads < 0
@@ -40,10 +61,16 @@ module IsoTree
     end
     def fit(x)
+      # make export consistent with Python library
+      update_params
       x = Dataset.new(x)
       prep_fit(x)
       options = data_options(x).merge(fit_options)
-      options[:sample_size] ||= options[:nrows]
+      if options[:sample_size] == "auto"
+        options[:sample_size] = [options[:nrows], 10000].min
+      end
       # prevent segfault
       options[:sample_size] = options[:nrows] if options[:sample_size] > options[:nrows]
@@ -71,18 +98,22 @@ module IsoTree
     end
     # same format as Python so models are compatible
-    def export_model(path)
+    def export_model(path, add_metada_file: false)
       check_fit
-      File.write("#{path}.metadata", JSON.pretty_generate(export_metadata))
-      Ext.serialize_ext_isoforest(@ext_iso_forest, path)
+      metadata = export_metadata
+      if add_metada_file
+        # indent 4 spaces like Python
+        File.write("#{path}.metadata", JSON.pretty_generate(metadata, indent: "    "))
+      end
+      Ext.serialize_combined(@ext_iso_forest, path, JSON.generate(metadata))
     end
     def self.import_model(path)
       model = new
-      metadata = JSON.parse(File.read("#{path}.metadata"))
-      model.send(:import_metadata, metadata)
-      model.instance_variable_set(:@ext_iso_forest, Ext.deserialize_ext_isoforest(path))
+      ext_iso_forest, metadata = Ext.deserialize_combined(path)
+      model.instance_variable_set(:@ext_iso_forest, ext_iso_forest)
+      model.send(:import_metadata, JSON.parse(metadata))
       model
     end
@@ -94,7 +125,9 @@ module IsoTree
         ncols_categ: @categorical_columns.size,
         cols_numeric: @numeric_columns,
         cols_categ: @categorical_columns,
-        cat_levels: @categorical_columns.map { |v| @categories[v].keys }
+        cat_levels: @categorical_columns.map { |v| @categories[v].keys },
+        categ_cols: [],
+        categ_max: []
       }
       # Ruby-specific
@@ -104,6 +137,7 @@ module IsoTree
       model_info = {
         ndim: @ndim,
         nthreads: @nthreads,
+        use_long_double: @use_long_double,
         build_imputer: false
       }
@@ -112,6 +146,10 @@ module IsoTree
         params[k] = instance_variable_get("@#{k}")
       end
+      if params[:max_depth] == "auto"
+        params[:max_depth] = 0
+      end
       {
         data_info: data_info,
         model_info: model_info,
@@ -137,6 +175,8 @@ module IsoTree
       @ndim = model_info["ndim"]
       @nthreads = model_info["nthreads"]
+      @use_long_double = model_info["use_long_double"]
+      @build_imputer = model_info["build_imputer"]
       PARAM_KEYS.each do |k|
         instance_variable_set("@#{k}", params[k.to_s])
@@ -221,31 +261,71 @@ module IsoTree
     end
     PARAM_KEYS = %i(
-      sample_size ntrees ntry max_depth
-      prob_pick_avg_gain prob_pick_pooled_gain
-      prob_split_avg_gain prob_split_pooled_gain min_gain
-      missing_action new_categ_action categ_split_type coefs depth_imp
-      weigh_imp_rows min_imp_obs random_seed all_perm coef_by_prop
-      weights_as_sample_prob sample_with_replacement penalize_range
-      weigh_by_kurtosis assume_full_distr
+      sample_size ntrees ntry max_depth ncols_per_tree
+      prob_pick_avg_gain prob_pick_pooled_gain prob_pick_full_gain prob_pick_dens
+      prob_pick_col_by_range prob_pick_col_by_var prob_pick_col_by_kurt
+      min_gain missing_action new_categ_action categ_split_type coefs
+      depth_imp weigh_imp_rows min_imp_obs random_seed all_perm
+      coef_by_prop weights_as_sample_prob sample_with_replacement penalize_range standardize_data
+      scoring_metric fast_bratio weigh_by_kurtosis assume_full_distr
     )
     def fit_options
       keys = %i(
         sample_size ntrees ndim ntry
-        prob_pick_avg_gain prob_pick_pooled_gain
-        prob_split_avg_gain prob_split_pooled_gain
+        categ_cols max_depth ncols_per_tree
+        prob_pick_pooled_gain prob_pick_avg_gain
+        prob_pick_full_gain prob_pick_dens
+        prob_pick_col_by_range prob_pick_col_by_var prob_pick_col_by_kurt
         min_gain missing_action new_categ_action
         categ_split_type all_perm coef_by_prop
-        sample_with_replacement penalize_range
+        weights_as_sample_prob
+        sample_with_replacement penalize_range standardize_data
+        scoring_metric fast_bratio
         weigh_by_kurtosis coefs min_imp_obs depth_imp
-        weigh_imp_rows random_seed nthreads
+        weigh_imp_rows random_seed use_long_double nthreads
       )
       options = {}
       keys.each do |k|
         options[k] = instance_variable_get("@#{k}")
       end
+      if options[:max_depth] == "auto"
+        options[:max_depth] = 0
+        options[:limit_depth] = true
+      end
+      if options[:ncols_per_tree].nil?
+        options[:ncols_per_tree] = 0
+      end
       options
     end
+    def update_params
+      if @missing_action == "auto"
+        if @ndim == 1
+          @missing_action = "divide"
+        else
+          @missing_action = "impute"
+        end
+      end
+      if @new_categ_action == "auto"
+        if @ndim == 1
+          @new_categ_action = "weighted"
+        else
+          @new_categ_action = "impute"
+        end
+      end
+      if @categ_split_type == "auto"
+        if @ndim == 1
+          @categ_split_type = "single_categ"
+        else
+          @categ_split_type = "subset"
+        end
+      end
+    end
   end
 end

data/lib/isotree/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module IsoTree
-  VERSION = "0.2.2"
+  VERSION = "0.3.0"
 end

data/vendor/isotree/LICENSE CHANGED Viewed

@@ -1,6 +1,6 @@
 BSD 2-Clause License
-Copyright (c) 2020, David Cortes
+Copyright (c) 2019-2022, David Cortes
 All rights reserved.
 Redistribution and use in source and binary forms, with or without

data/vendor/isotree/README.md CHANGED Viewed

@@ -1,11 +1,27 @@
 # IsoTree
-Fast and multi-threaded implementation of Extended Isolation Forest, Fair-Cut Forest, SCiForest (a.k.a. Split-Criterion iForest), and regular Isolation Forest, for outlier/anomaly detection, plus additions for imputation of missing values, distance/similarity calculation between observations, and handling of categorical data. Written in C++ with interfaces for Python and R. An additional wrapper for Ruby can be found [here](https://github.com/ankane/isotree).
+Fast and multi-threaded implementation of Isolation Forest (a.k.a. iForest) and variations of it such as Extended Isolation Forest (EIF), Split-Criterion iForest (SCiForest), Fair-Cut Forest (FCF), Robust Random-Cut Forest (RRCF), and other customizable variants, aimed at outlier/anomaly detection plus additions for imputation of missing values, distance/similarity calculation between observations, and handling of categorical data. Written in C++ with interfaces for Python, R, and C. An additional wrapper for Ruby can be found [here](https://github.com/ankane/isotree).
 The new concepts in this software are described in:
+* [Revisiting randomized choices in isolation forests](https://arxiv.org/abs/2110.13402)
+* [Isolation forests: looking beyond tree depth](https://arxiv.org/abs/2111.11639)
 * [Distance approximation using Isolation Forests](https://arxiv.org/abs/1910.12362)
 * [Imputing missing values with unsupervised random trees](https://arxiv.org/abs/1911.06646)
+*********************
+For a quick introduction to the Isolation Forest concept as used in this library, see:
+* [Python introductory notebook](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/an_introduction_to_isolation_forests.ipynb).
+* [R Vignette](http://htmlpreview.github.io/?https://github.com/david-cortes/isotree/blob/master/inst/doc/An_Introduction_to_Isolation_Forests.html).
+Short Python example notebooks:
+* [General library usage](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_example.ipynb).
+* [Using it as imputer in a scikit-learn pipeline](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_impute.ipynb).
+* [Using it as a kernel for SVMs](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_svm_kernel_example.ipynb).
+* [Converting it to TreeLite format for faster predictions](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/treelite_example.ipynb).
+(R examples are available in the internal documentation)
 # Description
 Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some attribute/feature/column at random. The idea is that, the rarer the observation, the more likely it is that a random uniform split on some feature would put outliers alone in one branch, and the fewer splits it will take to isolate an outlier observation like this. The concept is extended to splitting hyperplanes in the extended model (i.e. splitting by more than one column at a time), and to guided (not entirely random) splits in the SCiForest model that aim at isolating outliers faster and finding clustered outliers.
@@ -16,6 +32,53 @@ Note that this is a black-box model that will not produce explanations or import
 _(Code to produce these plots can be found in the R examples in the documentation)_
+# Comparison against other libraries
+The folder [timings](https://github.com/david-cortes/isotree/blob/master/timings) contains a speed comparison against other Isolation Forest implementations in Python (SciKit-Learn, EIF) and R (IsolationForest, isofor, solitude). From the benchmarks, IsoTree tends to be at least 1 order of magnitude faster than the libraries compared against in both single-threaded and multi-threaded mode.
+Example timings for 100 trees and different sample sizes, CovType dataset - see the link above for full benchmark and details:
+| Library         |  Model | Time (s) 256 | Time (s) 1024 | Time (s) 10k  |
+| :---:           |  :---: | :---:        | :---:         | :---:         |
+| isotree         | orig   |  0.00161     | 0.00631       | 0.0848        |
+| isotree         | ext    |  0.00326     | 0.0123        | 0.168         |
+| eif             | orig   |  0.149       | 0.398         | 4.99          |
+| eif             | ext    |  0.16        | 0.428         | 5.06          |
+| h2o             | orig   |  9.33        | 11.21         | 14.23         |
+| h2o             | ext    |  1.06        | 2.07          | 17.31         |
+| scikit-learn    | orig   |  8.3         | 8.01          | 6.89          |
+| solitude        | orig   |  32.612      | 34.01         | 41.01         |
+Example AUC as outlier detector in typical datasets (notebook to produce results [here](https://github.com/david-cortes/isotree/blob/master/example/comparison_model_quality.ipynb)):
+* Satellite dataset:
+| Library      | AUROC defaults | AUROC grid search |
+| :---:        | :---:          | :---:             |
+| isotree      | 0.70           | 0.84              |
+| eif          | -              | 0.714             |
+| scikit-learn | 0.687          | 0.74              |
+| h2o          | 0.662          | 0.748             |
+* Annthyroid dataset:
+| Library      | AUROC defaults | AUROC grid search |
+| :---:        | :---:          | :---:             |
+| isotree      | 0.80           | 0.982             |
+| eif          | -              | 0.808             |
+| scikit-learn | 0.836          | 0.836             |
+| h2o          | 0.80           | 0.80              |
+*(Disclaimer: these are rather small datasets and thus these AUC estimates have high variance)*
+# Non-random splits
+While the original idea behind isolation forests consisted in deciding splits uniformly at random, it's possible to get better performance at detecting outliers in some datasets (particularly those with multimodal distributions) by determining splits according to an information gain criterion instead. The idea is described in ["Revisiting randomized choices in isolation forests"](https://arxiv.org/abs/2110.13402) along with some comparisons of different split guiding criteria.
+# Different outlier scoring criteria
+Although the intuition behind the algorithm was to look at the tree depth required for isolation, this package can also produce outlier scores based on density criteria, which provide improved results in some datasets, particularly when splitting on categorical features. The idea is described in ["Isolation forests: looking beyond tree depth"](https://arxiv.org/abs/2111.11639).
 # Distance / similarity calculations
@@ -30,51 +93,77 @@ The model can also be used to impute missing values in a similar fashion as kNN,
 There's already many available implementations of isolation forests for both Python and R (such as [the one from the original paper's authors'](https://sourceforge.net/projects/iforest/) or [the one in SciKit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)), but at the time of writing, all of them are lacking some important functionality and/or offer sub-optimal speed. This particular implementation offers the following:
 * Implements the extended model (with splitting hyperplanes) and split-criterion model (with non-random splits).
+* Can handle missing values (but performance with them is not so good).
+* Can handle categorical variables (one-hot/dummy encoding does not produce the same result).
 * Can use a mixture of random and non-random splits, and can split by weighted/pooled gain (in addition to simple average).
 * Can produce approximated pairwise distances between observations according to how many steps it takes on average to separate them down the tree.
-* Can handle missing values (but performance with them is not so good).
+* Can calculate isolation kernels or proximity matrix, which counts the proportion of trees in which two given observations end up in the same terminal node.
 * Can produce missing value imputations according to observations that fall on each terminal node.
-* Can handle categorical variables (one-hot/dummy encoding does not produce the same result).
 * Can work with sparse matrices.
+* Can use either depth-based metrics or density-based metrics for calculation of outlier scores.
 * Supports sample/observation weights, either as sampling importance or as distribution density measurement.
 * Supports user-provided column sample weights.
 * Can sample columns randomly with weights given by kurtosis.
-* Uses exact formula (not approximation as others do) for harmonic numbers at lower sample and remainder sizes.
+* Uses exact formula (not approximation as others do) for harmonic numbers at lower sample and remainder sizes, and a higher-order approximation for larger sizes.
 * Can fit trees incrementally to user-provided data samples.
 * Produces serializable model objects with reasonable file sizes.
+* Can convert the models to `treelite` format (Python-only and depending on the parameters that are used) ([example here](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/treelite_example.ipynb)).
 * Can translate the generated trees into SQL statements.
-* Fast and multi-threaded C++ code. Can be wrapped in languages other than Python/R/Ruby.
+* Fast and multi-threaded C++ code with an ISO C interface, which is architecture-agnostic, multi-platform, and with the only external dependency (Robin-Map) being optional. Can be wrapped in languages other than Python/R/Ruby.
 (Note that categoricals, NAs, and density-like sample weights, are treated heuristically with different options as there is no single logical extension of the original idea to them, and having them present might degrade performance/accuracy for regular numerical non-missing observations)
 # Installation
+* R:
+```r
+install.packages("isotree")
+```
+** *
 * Python:
-```python
+```
 pip install isotree
 ```
+or if that fails:
+```
+pip install --no-use-pep517 isotree
+```
+** *
-**Note for macOS users:** on macOS, the Python version of this package will compile **without** multi-threading capabilities. This is due to default apple's redistribution of `clang` not providing OpenMP modules, and aliasing it to `gcc` which causes confusions in build scripts. If you have a non-apple version of `clang` with the OpenMP modules, or if you have `gcc` installed, you can compile this package with multi-threading enabled by setting up an environment variable `ENABLE_OMP=1`:
+**Note for macOS users:** on macOS, the Python version of this package might compile **without** multi-threading capabilities. In order to enable multi-threading support, first install OpenMP:
 ```
-export ENABLE_OMP=1
-pip install isotree
+brew install libomp
 ```
-(Alternatively, can also pass argument `enable-omp` to the `setup.py` file: `python setup.py install enable-omp`)
+And then reinstall this package: `pip install --force-reinstall isotree`.
-* R:
+** *
+**IMPORTANT:** the setup script will try to add compilation flag `-march=native`. This instructs the compiler to tune the package for the CPU in which it is being installed (by e.g. using AVX instructions if available), but the result might not be usable in other computers. If building a binary wheel of this package or putting it into a docker image which will be used in different machines, this can be overriden either by (a) defining an environment variable `DONT_SET_MARCH=1`, or by (b) manually supplying compilation `CFLAGS` as an environment variable with something related to architecture. For maximum compatibility (but slowest speed), it's possible to do something like this:
-```r
-install.packages("isotree")
+```
+export DONT_SET_MARCH=1
+pip install isotree
 ```
-* C++:
+or, by specifying some compilation flag for architecture:
+```
+export CFLAGS="-march=x86-64"
+export CXXFLAGS="-march=x86-64"
+pip install isotree
 ```
-git clone https://www.github.com/david-cortes/isotree.git
+** *
+* C and C++:
+```
+git clone --recursive https://www.github.com/david-cortes/isotree.git
 cd isotree
 mkdir build
 cd build
-cmake ..
-make
+cmake -DUSE_MARCH_NATIVE=1 ..
+cmake --build .
 ### for a system-wide install in linux
 sudo make install
@@ -83,16 +172,22 @@ sudo ldconfig
 (Will build as a shared object - linkage is then done with `-lisotree`)
-* Ruby
+Be aware that the snippet above includes option `-DUSE_MARCH_NATIVE=1`, which will make it use the highest-available CPU instruction set (e.g. AVX2) and will produces objects that might not run on older CPUs - to build more "portable" objects, remove this option from the cmake command.
+The package has an optional dependency on the [Robin-Map](https://github.com/Tessil/robin-map) library, which is added to this repository as a linked submodule. If this library is not found under `/src`, will use the compiler's own hashmaps, which are less optimal.
+* Ruby:
 See [external repository with wrapper](https://github.com/ankane/isotree).
 # Sample usage
-**Warning: default parameters in this implementation are very different from default parameters in others such as SciKit-Learn's, and these defaults won't scale to large datasets (see documentation for details).**
+**Warning: default parameters in this implementation are very different from default parameters in others such as Scikit-Learn's, and these defaults won't scale to large datasets (see documentation for details).**
 * Python:
+(Library is Scikit-Learn compatible)
 ```python
 import numpy as np
 from isotree import IsolationForest
@@ -107,7 +202,7 @@ X = np.random.normal(size = (n, m))
 X = np.r_[X, np.array([3, 3]).reshape((1, m))]
 ### Fit a small isolation forest model
-iso = IsolationForest(ntrees = 10, ndim = 2, nthreads = 1)
+iso = IsolationForest(ntrees = 10, nthreads = 1)
 iso.fit(X)
 ### Check which row has the highest outlier score
@@ -117,6 +212,7 @@ print("Point with highest outlier score: ",
 ```
 * R:
 (see documentation for more examples - `help(isotree::isolation.forest)`)
 ```r
 ### Random data from a standard normal distribution
@@ -135,29 +231,67 @@ iso <- isolation.forest(X, ntrees = 10, nthreads = 1)
 ### Check which row has the highest outlier score
 pred <- predict(iso, X)
 cat("Point with highest outlier score: ",
-	X[which.max(pred), ], "\n")
+    X[which.max(pred), ], "\n")
 ```
 * C++:
-See file [isotree_cpp_ex.cpp](https://github.com/david-cortes/isotree/blob/master/example/isotree_cpp_ex.cpp).
+The package comes with two different C++ interfaces: (a) a struct-based interface which exposes the full library's functionalities but makes little checks on the inputs it receives and is perhaps a bit difficult to use due to the large number of arguments that functions require; and (b) a scikit-learn-like interface in which the model exposes a single class with methods like 'fit' and 'predict', which is less flexible than the struct-based interface but easier to use and the function signatures disallow some potential errors due to invalid parameter combinations.
+See files: [isotree_cpp_ex.cpp](https://github.com/david-cortes/isotree/blob/master/example/isotree_cpp_ex.cpp) for an example with the struct-based interface; and [isotree_cpp_oop_ex.cpp](https://github.com/david-cortes/isotree/blob/master/example/isotree_cpp_oop_ex.cpp) for an example with the scikit-learn-like interface.
+Note that the second interface does not expose all the functionalities - for example, it only supports inputs of classes 'double' and 'int', while the struct-based interface also supports 'float'/'size_t'.
+* C:
+See file [isotree_c_ex.c](https://github.com/david-cortes/isotree/blob/master/example/isotree_c_ex.c).
+Note that the C interface is a simple wrapper over the scikit-learn-like C++ interface, but using only ISO C bindings for better compatibility and easier wrapping in other languages.
+* Ruby
+See [external repository with wrapper](https://github.com/ankane/isotree).
 # Examples
-* Python: example notebook [here](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_example.ipynb), (also example as imputer in sklearn pipeline [here](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_impute.ipynb)).
+* Python:
+    * [Example about general library usage](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_example.ipynb).
+    * [Example using it as imputer in a scikit-learn pipeline](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_impute.ipynb).
+    * [Example using it as a kernel for SVMs](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/isotree_svm_kernel_example.ipynb).
+    * [Example converting it to TreeLite format for faster predictions](https://nbviewer.jupyter.org/github/david-cortes/isotree/blob/master/example/treelite_example.ipynb).
 * R: examples available in the documentation (`help(isotree::isolation.forest)`, [link to CRAN](https://cran.r-project.org/web/packages/isotree/index.html)).
-* C++: see short example in the section above.
+* C and C++: see short examples in the section above.
+* Ruby: see [external repository with wrapper](https://github.com/ankane/isotree).
 # Documentation
 * Python: documentation is available at [ReadTheDocs](http://isotree.readthedocs.io/en/latest/).
 * R: documentation is available internally in the package (e.g. `help(isolation.forest)`) and in [CRAN](https://cran.r-project.org/web/packages/isotree/index.html).
-* C++: documentation is available in the public header (`include/isotree.hpp`) and in the source files.
+* C++: documentation is available in the public header (`include/isotree.hpp`) and in the source files. See also the header for the scikit-learn-like interface (`include/isotree_oop.hpp`).
+* C: interface is not documented per-se, but the same documentation from the C++ header applies to it. See also its header for some non-comprehensive comments about the parameters that functions take (`include/isotree_c.h`).
+* Ruby: see [external repository with wrapper](https://github.com/ankane/isotree) for the syntax and the [Python docs](http://isotree.readthedocs.io) for details about the parameters.
+# Reducing library size and compilation times
+By default, this library will compile with some functionalities that are unlikely to be used and which can significantly increase the size of the library and compilation times - if using this library in e.g. embedded devices, it is highly recommended to disable some options, and if creating a docker images for serving models, one might want to make it as minimal as possible. Being a C++ templated library, it generates multiple versions of its functions that are specialized for different types (such as C `double` and `float`), and in practice not all the supported types are likely to be used.
+In particular, the library supports usage of `long double` type for more precise aggregated calculations (e.g. standard deviations), which is unlikely to end up used (its usage is determined by a user-passed function argument and not available in the C or C++-OOP interfaces). For a smaller library and faster compilation, support for `long double` can be disabled by:
+* Defining an environment variable `NO_LONG_DOUBLE`, which will be accepted by the Python and R build systems - e.g. first run `export NO_LONG_DOUBLE=1`, then a `pip` install; or for R, run `Sys.setenv("NO_LONG_DOUBLE" = "1")` before `install.packages`.
+* Passing option `NO_LONG_DOUBLE` to the CMake script - e.g. `cmake -DNO_LONG_DOUBLE=1 ..` (only when using the CMake system, which is not used by the Python and R versions).
+Additionally, the library will produce functions for different floating point and integer types of the input data. In practice, one usually ends up using only `double` and `int` types (these are the only types supported in the R interface and in the C and C++-OOP interfaces). When building it as a shared library through the CMake system, these can be disabled (leaving only `double` and `int` support) through option `NO_TEMPLATED_VERSIONS` - e.g.:
+```
+cmake -DNO_TEMPLATED_VERSIONS=1 ..
+```
+(this option is not available for the Python build system)
-# Known issues
+# Help wanted
-When setting a random seed and using more than one thread, the results of some functions are not 100% reproducible to the last decimal - especially not for imputations. This is due to parallelized aggregations, and thus the only "fix" is to limit oneself to only one thread. The trees themselves are however not affected by this, and neither is the isolation depth (main functionality of the package).
+The package does not currenly have any functionality for visualizing trees. Pull requests adding such functionality would be welcome.
 # References
@@ -170,3 +304,7 @@ When setting a random seed and using more than one thread, the results of some f
 * Quinlan, J. Ross. C4. 5: programs for machine learning. Elsevier, 2014.
 * Cortes, David. "Distance approximation using Isolation Forests." arXiv preprint arXiv:1910.12362 (2019).
 * Cortes, David. "Imputing missing values with unsupervised random trees." arXiv preprint arXiv:1911.06646 (2019).
+* Cortes, David. "Revisiting randomized choices in isolation forests." arXiv preprint arXiv:2110.13402 (2021).
+* Guha, Sudipto, et al. "Robust random cut forest based anomaly detection on streams." International conference on machine learning. PMLR, 2016.
+* Cortes, David. "Isolation forests: looking beyond tree depth." arXiv preprint arXiv:2111.11639 (2021).
+* Ting, Kai Ming, Yue Zhu, and Zhi-Hua Zhou. "Isolation kernel and its effect on SVM." Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.