outliertree 0.1.2 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +14 -0
- data/NOTICE.txt +1 -1
- data/README.md +11 -10
- data/ext/outliertree/ext.cpp +104 -105
- data/ext/outliertree/extconf.rb +1 -1
- data/lib/outliertree/result.rb +3 -3
- data/lib/outliertree/version.rb +1 -1
- data/vendor/outliertree/README.md +77 -40
- data/vendor/outliertree/src/Makevars.in +4 -0
- data/vendor/outliertree/src/Makevars.win +4 -0
- data/vendor/outliertree/src/RcppExports.cpp +20 -9
- data/vendor/outliertree/src/Rwrapper.cpp +256 -57
- data/vendor/outliertree/src/cat_outlier.cpp +6 -6
- data/vendor/outliertree/src/clusters.cpp +114 -9
- data/vendor/outliertree/src/fit_model.cpp +505 -308
- data/vendor/outliertree/src/misc.cpp +165 -4
- data/vendor/outliertree/src/outlier_tree.hpp +159 -51
- data/vendor/outliertree/src/outliertree-win.def +3 -0
- data/vendor/outliertree/src/predict.cpp +33 -0
- data/vendor/outliertree/src/split.cpp +124 -20
- metadata +10 -8
- data/vendor/outliertree/src/Makevars +0 -3
    
        checksums.yaml
    CHANGED
    
    | @@ -1,7 +1,7 @@ | |
| 1 1 | 
             
            ---
         | 
| 2 2 | 
             
            SHA256:
         | 
| 3 | 
            -
              metadata.gz:  | 
| 4 | 
            -
              data.tar.gz:  | 
| 3 | 
            +
              metadata.gz: 2851b4b56b23141bc9f1ef5b3c448fb75d785ce0e7b38580113001898ce18e2e
         | 
| 4 | 
            +
              data.tar.gz: 817325392325bc61f1dea1363096678fe9fd578ec6026301e173447edc522752
         | 
| 5 5 | 
             
            SHA512:
         | 
| 6 | 
            -
              metadata.gz:  | 
| 7 | 
            -
              data.tar.gz:  | 
| 6 | 
            +
              metadata.gz: e1bc84c131959bb7260100b4aa1e345ae09330600252299084efe5d40fbff8d8d9d9aadf78c6384a0b6a158a1b3bbcdeff9de3ce71fca90ce938003125b37898
         | 
| 7 | 
            +
              data.tar.gz: cad988456c492f101bc71334997217a48d6588c4e2e5feee333bc074f3b14b62557d0e64429fca787705244b701eca7d685aa17f090dad94578fce172812e5a0
         | 
    
        data/CHANGELOG.md
    CHANGED
    
    | @@ -1,3 +1,17 @@ | |
| 1 | 
            +
            ## 0.3.0 (2022-06-13)
         | 
| 2 | 
            +
             | 
| 3 | 
            +
            - Updated OutlierTree to 1.8.1
         | 
| 4 | 
            +
            - Dropped support for Ruby < 2.7
         | 
| 5 | 
            +
             | 
| 6 | 
            +
            ## 0.2.1 (2021-05-23)
         | 
| 7 | 
            +
             | 
| 8 | 
            +
            - Improved performance
         | 
| 9 | 
            +
             | 
| 10 | 
            +
            ## 0.2.0 (2021-05-17)
         | 
| 11 | 
            +
             | 
| 12 | 
            +
            - Updated to Rice 4
         | 
| 13 | 
            +
            - Dropped support for Ruby < 2.6
         | 
| 14 | 
            +
             | 
| 1 15 | 
             
            ## 0.1.2 (2021-02-08)
         | 
| 2 16 |  | 
| 3 17 | 
             
            - Fixed error with missing numeric values
         | 
    
        data/NOTICE.txt
    CHANGED
    
    
    
        data/README.md
    CHANGED
    
    | @@ -1,4 +1,4 @@ | |
| 1 | 
            -
            # OutlierTree
         | 
| 1 | 
            +
            # OutlierTree Ruby
         | 
| 2 2 |  | 
| 3 3 | 
             
            :deciduous_tree: [OutlierTree](https://github.com/david-cortes/outliertree) - explainable outlier/anomaly detection - for Ruby
         | 
| 4 4 |  | 
| @@ -8,16 +8,16 @@ Produces human-readable explanations for why values are detected as outliers | |
| 8 8 | 
             
            Price (2.50) looks low given Department is Books and Sale is false
         | 
| 9 9 | 
             
            ```
         | 
| 10 10 |  | 
| 11 | 
            -
            :evergreen_tree: Check out [IsoTree](https://github.com/ankane/isotree) for an alternative approach that uses Isolation Forest
         | 
| 11 | 
            +
            :evergreen_tree: Check out [IsoTree](https://github.com/ankane/isotree-ruby) for an alternative approach that uses Isolation Forest
         | 
| 12 12 |  | 
| 13 | 
            -
            [](https://github.com/ankane/outliertree/actions)
         | 
| 13 | 
            +
            [](https://github.com/ankane/outliertree-ruby/actions)
         | 
| 14 14 |  | 
| 15 15 | 
             
            ## Installation
         | 
| 16 16 |  | 
| 17 17 | 
             
            Add this line to your application’s Gemfile:
         | 
| 18 18 |  | 
| 19 19 | 
             
            ```ruby
         | 
| 20 | 
            -
            gem  | 
| 20 | 
            +
            gem "outliertree"
         | 
| 21 21 | 
             
            ```
         | 
| 22 22 |  | 
| 23 23 | 
             
            ## Getting Started
         | 
| @@ -28,7 +28,8 @@ Prep your data | |
| 28 28 | 
             
            data = [
         | 
| 29 29 | 
             
              {department: "Books",  sale: false, price: 2.50},
         | 
| 30 30 | 
             
              {department: "Books",  sale: true,  price: 3.00},
         | 
| 31 | 
            -
              {department: "Movies", sale: false, price: 5.00}
         | 
| 31 | 
            +
              {department: "Movies", sale: false, price: 5.00},
         | 
| 32 | 
            +
              # ...
         | 
| 32 33 | 
             
            ]
         | 
| 33 34 | 
             
            ```
         | 
| 34 35 |  | 
| @@ -108,22 +109,22 @@ bundle install | |
| 108 109 |  | 
| 109 110 | 
             
            ## History
         | 
| 110 111 |  | 
| 111 | 
            -
            View the [changelog](https://github.com/ankane/outliertree/blob/master/CHANGELOG.md)
         | 
| 112 | 
            +
            View the [changelog](https://github.com/ankane/outliertree-ruby/blob/master/CHANGELOG.md)
         | 
| 112 113 |  | 
| 113 114 | 
             
            ## Contributing
         | 
| 114 115 |  | 
| 115 116 | 
             
            Everyone is encouraged to help improve this project. Here are a few ways you can help:
         | 
| 116 117 |  | 
| 117 | 
            -
            - [Report bugs](https://github.com/ankane/outliertree/issues)
         | 
| 118 | 
            -
            - Fix bugs and [submit pull requests](https://github.com/ankane/outliertree/pulls)
         | 
| 118 | 
            +
            - [Report bugs](https://github.com/ankane/outliertree-ruby/issues)
         | 
| 119 | 
            +
            - Fix bugs and [submit pull requests](https://github.com/ankane/outliertree-ruby/pulls)
         | 
| 119 120 | 
             
            - Write, clarify, or fix documentation
         | 
| 120 121 | 
             
            - Suggest or add new features
         | 
| 121 122 |  | 
| 122 123 | 
             
            To get started with development:
         | 
| 123 124 |  | 
| 124 125 | 
             
            ```sh
         | 
| 125 | 
            -
            git clone --recursive https://github.com/ankane/outliertree.git
         | 
| 126 | 
            -
            cd outliertree
         | 
| 126 | 
            +
            git clone --recursive https://github.com/ankane/outliertree-ruby.git
         | 
| 127 | 
            +
            cd outliertree-ruby
         | 
| 127 128 | 
             
            bundle install
         | 
| 128 129 | 
             
            bundle exec rake compile
         | 
| 129 130 | 
             
            bundle exec rake test
         | 
    
        data/ext/outliertree/ext.cpp
    CHANGED
    
    | @@ -2,12 +2,8 @@ | |
| 2 2 | 
             
            #include <outlier_tree.hpp>
         | 
| 3 3 |  | 
| 4 4 | 
             
            // rice
         | 
| 5 | 
            -
            #include <rice/ | 
| 6 | 
            -
            #include <rice/ | 
| 7 | 
            -
            #include <rice/Module.hpp>
         | 
| 8 | 
            -
            #include <rice/Object.hpp>
         | 
| 9 | 
            -
            #include <rice/String.hpp>
         | 
| 10 | 
            -
            #include <rice/Symbol.hpp>
         | 
| 5 | 
            +
            #include <rice/rice.hpp>
         | 
| 6 | 
            +
            #include <rice/stl.hpp>
         | 
| 11 7 |  | 
| 12 8 | 
             
            using Rice::Array;
         | 
| 13 9 | 
             
            using Rice::Hash;
         | 
| @@ -18,74 +14,77 @@ using Rice::Symbol; | |
| 18 14 | 
             
            using Rice::define_class_under;
         | 
| 19 15 | 
             
            using Rice::define_module;
         | 
| 20 16 |  | 
| 21 | 
            -
             | 
| 22 | 
            -
            Object to_ruby<std::vector<char>>(std::vector<char> const & x)
         | 
| 17 | 
            +
            namespace Rice::detail
         | 
| 23 18 | 
             
            {
         | 
| 24 | 
            -
               | 
| 25 | 
            -
               | 
| 26 | 
            -
             | 
| 27 | 
            -
               | 
| 28 | 
            -
             | 
| 29 | 
            -
             | 
| 19 | 
            +
              template<typename T>
         | 
| 20 | 
            +
              class To_Ruby<std::vector<T>>
         | 
| 21 | 
            +
              {
         | 
| 22 | 
            +
              public:
         | 
| 23 | 
            +
                VALUE convert(std::vector<T> const & x)
         | 
| 24 | 
            +
                {
         | 
| 25 | 
            +
                  auto a = rb_ary_new2(x.size());
         | 
| 26 | 
            +
                  for (const auto& v : x) {
         | 
| 27 | 
            +
                    rb_ary_push(a, To_Ruby<T>().convert(v));
         | 
| 28 | 
            +
                  }
         | 
| 29 | 
            +
                  return a;
         | 
| 30 | 
            +
                }
         | 
| 31 | 
            +
              };
         | 
| 30 32 |  | 
| 31 | 
            -
            template<>
         | 
| 32 | 
            -
             | 
| 33 | 
            -
            {
         | 
| 34 | 
            -
             | 
| 35 | 
            -
             | 
| 36 | 
            -
             | 
| 37 | 
            -
             | 
| 38 | 
            -
               | 
| 39 | 
            -
            }
         | 
| 33 | 
            +
              template<>
         | 
| 34 | 
            +
              struct Type<ColType>
         | 
| 35 | 
            +
              {
         | 
| 36 | 
            +
                static bool verify()
         | 
| 37 | 
            +
                {
         | 
| 38 | 
            +
                  return true;
         | 
| 39 | 
            +
                }
         | 
| 40 | 
            +
              };
         | 
| 40 41 |  | 
| 41 | 
            -
            template<>
         | 
| 42 | 
            -
             | 
| 43 | 
            -
            {
         | 
| 44 | 
            -
               | 
| 45 | 
            -
             | 
| 46 | 
            -
                 | 
| 47 | 
            -
             | 
| 48 | 
            -
             | 
| 49 | 
            -
             | 
| 42 | 
            +
              template<>
         | 
| 43 | 
            +
              class To_Ruby<ColType>
         | 
| 44 | 
            +
              {
         | 
| 45 | 
            +
              public:
         | 
| 46 | 
            +
                VALUE convert(ColType const & x)
         | 
| 47 | 
            +
                {
         | 
| 48 | 
            +
                  switch (x) {
         | 
| 49 | 
            +
                    case Numeric: return Symbol("numeric");
         | 
| 50 | 
            +
                    case Categorical: return Symbol("categorical");
         | 
| 51 | 
            +
                    case Ordinal: return Symbol("ordinal");
         | 
| 52 | 
            +
                    case NoType: return Symbol("no_type");
         | 
| 53 | 
            +
                  }
         | 
| 54 | 
            +
                  throw std::runtime_error("Unknown column type");
         | 
| 55 | 
            +
                }
         | 
| 56 | 
            +
              };
         | 
| 50 57 |  | 
| 51 | 
            -
            template<>
         | 
| 52 | 
            -
             | 
| 53 | 
            -
            {
         | 
| 54 | 
            -
             | 
| 55 | 
            -
             | 
| 56 | 
            -
             | 
| 57 | 
            -
             | 
| 58 | 
            -
               | 
| 59 | 
            -
            }
         | 
| 58 | 
            +
              template<>
         | 
| 59 | 
            +
              struct Type<SplitType>
         | 
| 60 | 
            +
              {
         | 
| 61 | 
            +
                static bool verify()
         | 
| 62 | 
            +
                {
         | 
| 63 | 
            +
                  return true;
         | 
| 64 | 
            +
                }
         | 
| 65 | 
            +
              };
         | 
| 60 66 |  | 
| 61 | 
            -
            template<>
         | 
| 62 | 
            -
             | 
| 63 | 
            -
            {
         | 
| 64 | 
            -
               | 
| 65 | 
            -
                 | 
| 66 | 
            -
                 | 
| 67 | 
            -
             | 
| 68 | 
            -
             | 
| 69 | 
            -
             | 
| 70 | 
            -
             | 
| 71 | 
            -
             | 
| 72 | 
            -
             | 
| 73 | 
            -
             | 
| 74 | 
            -
             | 
| 75 | 
            -
             | 
| 76 | 
            -
             | 
| 77 | 
            -
             | 
| 78 | 
            -
             | 
| 79 | 
            -
             | 
| 80 | 
            -
                 | 
| 81 | 
            -
             | 
| 82 | 
            -
                case NotInSubset: return Symbol("not_in_subset");
         | 
| 83 | 
            -
                case SingleCateg: return Symbol("single_categ");
         | 
| 84 | 
            -
                case SubTrees: return Symbol("sub_trees");
         | 
| 85 | 
            -
                case IsNa: return Symbol("is_na");
         | 
| 86 | 
            -
                case Root: return Symbol("root");
         | 
| 87 | 
            -
              }
         | 
| 88 | 
            -
              throw std::runtime_error("Unknown split type");
         | 
| 67 | 
            +
              template<>
         | 
| 68 | 
            +
              class To_Ruby<SplitType>
         | 
| 69 | 
            +
              {
         | 
| 70 | 
            +
              public:
         | 
| 71 | 
            +
                VALUE convert(SplitType const & x)
         | 
| 72 | 
            +
                {
         | 
| 73 | 
            +
                  switch (x) {
         | 
| 74 | 
            +
                    case LessOrEqual: return Symbol("less_or_equal");
         | 
| 75 | 
            +
                    case Greater: return Symbol("greater");
         | 
| 76 | 
            +
                    case Equal: return Symbol("equal");
         | 
| 77 | 
            +
                    case NotEqual: return Symbol("not_equal");
         | 
| 78 | 
            +
                    case InSubset: return Symbol("in_subset");
         | 
| 79 | 
            +
                    case NotInSubset: return Symbol("not_in_subset");
         | 
| 80 | 
            +
                    case SingleCateg: return Symbol("single_categ");
         | 
| 81 | 
            +
                    case SubTrees: return Symbol("sub_trees");
         | 
| 82 | 
            +
                    case IsNa: return Symbol("is_na");
         | 
| 83 | 
            +
                    case Root: return Symbol("root");
         | 
| 84 | 
            +
                  }
         | 
| 85 | 
            +
                  throw std::runtime_error("Unknown split type");
         | 
| 86 | 
            +
                }
         | 
| 87 | 
            +
              };
         | 
| 89 88 | 
             
            }
         | 
| 90 89 |  | 
| 91 90 | 
             
            extern "C"
         | 
| @@ -95,55 +94,55 @@ void Init_ext() | |
| 95 94 | 
             
              Module rb_mExt = define_module_under(rb_mOutlierTree, "Ext");
         | 
| 96 95 |  | 
| 97 96 | 
             
              define_class_under<Cluster>(rb_mExt, "Cluster")
         | 
| 98 | 
            -
                .define_method("upper_lim",  | 
| 99 | 
            -
                .define_method("display_lim_high",  | 
| 100 | 
            -
                .define_method("perc_below",  | 
| 101 | 
            -
                .define_method("display_lim_low",  | 
| 102 | 
            -
                .define_method("perc_above",  | 
| 103 | 
            -
                .define_method("display_mean",  | 
| 104 | 
            -
                .define_method("display_sd",  | 
| 105 | 
            -
                .define_method("cluster_size",  | 
| 106 | 
            -
                .define_method("split_point",  | 
| 107 | 
            -
                .define_method("split_subset",  | 
| 108 | 
            -
                .define_method("split_lev",  | 
| 109 | 
            -
                .define_method("split_type",  | 
| 110 | 
            -
                .define_method("column_type",  | 
| 111 | 
            -
                .define_method("has_na_branch",  | 
| 112 | 
            -
                .define_method("col_num",  | 
| 97 | 
            +
                .define_method("upper_lim", [](Cluster& self) { return self.upper_lim; })
         | 
| 98 | 
            +
                .define_method("display_lim_high", [](Cluster& self) { return self.display_lim_high; })
         | 
| 99 | 
            +
                .define_method("perc_below", [](Cluster& self) { return self.perc_below; })
         | 
| 100 | 
            +
                .define_method("display_lim_low", [](Cluster& self) { return self.display_lim_low; })
         | 
| 101 | 
            +
                .define_method("perc_above", [](Cluster& self) { return self.perc_above; })
         | 
| 102 | 
            +
                .define_method("display_mean", [](Cluster& self) { return self.display_mean; })
         | 
| 103 | 
            +
                .define_method("display_sd", [](Cluster& self) { return self.display_sd; })
         | 
| 104 | 
            +
                .define_method("cluster_size", [](Cluster& self) { return self.cluster_size; })
         | 
| 105 | 
            +
                .define_method("split_point", [](Cluster& self) { return self.split_point; })
         | 
| 106 | 
            +
                .define_method("split_subset", [](Cluster& self) { return self.split_subset; })
         | 
| 107 | 
            +
                .define_method("split_lev", [](Cluster& self) { return self.split_lev; })
         | 
| 108 | 
            +
                .define_method("split_type", [](Cluster& self) { return self.split_type; })
         | 
| 109 | 
            +
                .define_method("column_type", [](Cluster& self) { return self.column_type; })
         | 
| 110 | 
            +
                .define_method("has_na_branch", [](Cluster& self) { return self.has_NA_branch; })
         | 
| 111 | 
            +
                .define_method("col_num", [](Cluster& self) { return self.col_num; });
         | 
| 113 112 |  | 
| 114 113 | 
             
              define_class_under<ClusterTree>(rb_mExt, "ClusterTree")
         | 
| 115 | 
            -
                .define_method("parent_branch",  | 
| 116 | 
            -
                .define_method("parent",  | 
| 117 | 
            -
                .define_method("all_branches",  | 
| 118 | 
            -
                .define_method("column_type",  | 
| 119 | 
            -
                .define_method("col_num",  | 
| 120 | 
            -
                .define_method("split_point",  | 
| 121 | 
            -
                .define_method("split_subset",  | 
| 122 | 
            -
                .define_method("split_lev",  | 
| 114 | 
            +
                .define_method("parent_branch", [](ClusterTree& self) { return self.parent_branch; })
         | 
| 115 | 
            +
                .define_method("parent", [](ClusterTree& self) { return self.parent; })
         | 
| 116 | 
            +
                .define_method("all_branches", [](ClusterTree& self) { return self.all_branches; })
         | 
| 117 | 
            +
                .define_method("column_type", [](ClusterTree& self) { return self.column_type; })
         | 
| 118 | 
            +
                .define_method("col_num", [](ClusterTree& self) { return self.col_num; })
         | 
| 119 | 
            +
                .define_method("split_point", [](ClusterTree& self) { return self.split_point; })
         | 
| 120 | 
            +
                .define_method("split_subset", [](ClusterTree& self) { return self.split_subset; })
         | 
| 121 | 
            +
                .define_method("split_lev", [](ClusterTree& self) { return self.split_lev; });
         | 
| 123 122 |  | 
| 124 123 | 
             
              define_class_under<ModelOutputs>(rb_mExt, "ModelOutputs")
         | 
| 125 | 
            -
                .define_method("outlier_scores_final",  | 
| 126 | 
            -
                .define_method("outlier_columns_final",  | 
| 127 | 
            -
                .define_method("outlier_clusters_final",  | 
| 128 | 
            -
                .define_method("outlier_trees_final",  | 
| 129 | 
            -
                .define_method("outlier_depth_final",  | 
| 130 | 
            -
                .define_method("outlier_decimals_distr",  | 
| 131 | 
            -
                .define_method("min_decimals_col",  | 
| 124 | 
            +
                .define_method("outlier_scores_final", [](ModelOutputs& self) { return self.outlier_scores_final; })
         | 
| 125 | 
            +
                .define_method("outlier_columns_final", [](ModelOutputs& self) { return self.outlier_columns_final; })
         | 
| 126 | 
            +
                .define_method("outlier_clusters_final", [](ModelOutputs& self) { return self.outlier_clusters_final; })
         | 
| 127 | 
            +
                .define_method("outlier_trees_final", [](ModelOutputs& self) { return self.outlier_trees_final; })
         | 
| 128 | 
            +
                .define_method("outlier_depth_final", [](ModelOutputs& self) { return self.outlier_depth_final; })
         | 
| 129 | 
            +
                .define_method("outlier_decimals_distr", [](ModelOutputs& self) { return self.outlier_decimals_distr; })
         | 
| 130 | 
            +
                .define_method("min_decimals_col", [](ModelOutputs& self) { return self.min_decimals_col; })
         | 
| 132 131 | 
             
                .define_method(
         | 
| 133 132 | 
             
                  "all_clusters",
         | 
| 134 | 
            -
                   | 
| 133 | 
            +
                  [](ModelOutputs& self, size_t i, size_t j) {
         | 
| 135 134 | 
             
                    return self.all_clusters[i][j];
         | 
| 136 135 | 
             
                  })
         | 
| 137 136 | 
             
                .define_method(
         | 
| 138 137 | 
             
                  "all_trees",
         | 
| 139 | 
            -
                   | 
| 138 | 
            +
                  [](ModelOutputs& self, size_t i, size_t j) {
         | 
| 140 139 | 
             
                    return self.all_trees[i][j];
         | 
| 141 140 | 
             
                  });
         | 
| 142 141 |  | 
| 143 142 | 
             
              rb_mExt
         | 
| 144 | 
            -
                . | 
| 143 | 
            +
                .define_singleton_function(
         | 
| 145 144 | 
             
                  "fit_outliers_models",
         | 
| 146 | 
            -
                   | 
| 145 | 
            +
                  [](Hash options) {
         | 
| 147 146 | 
             
                    ModelOutputs model_outputs;
         | 
| 148 147 |  | 
| 149 148 | 
             
                    // data
         | 
| @@ -219,9 +218,9 @@ void Init_ext() | |
| 219 218 | 
             
                    );
         | 
| 220 219 | 
             
                    return model_outputs;
         | 
| 221 220 | 
             
                  })
         | 
| 222 | 
            -
                . | 
| 221 | 
            +
                .define_singleton_function(
         | 
| 223 222 | 
             
                  "find_new_outliers",
         | 
| 224 | 
            -
                   | 
| 223 | 
            +
                  [](ModelOutputs& model_outputs, Hash options) {
         | 
| 225 224 | 
             
                    // data
         | 
| 226 225 | 
             
                    size_t nrows = options.get<size_t, Symbol>("nrows");
         | 
| 227 226 | 
             
                    size_t ncols_numeric = options.get<size_t, Symbol>("ncols_numeric");
         | 
    
        data/ext/outliertree/extconf.rb
    CHANGED
    
    
    
        data/lib/outliertree/result.rb
    CHANGED
    
    | @@ -22,7 +22,7 @@ module OutlierTree | |
| 22 22 | 
             
                      if outl_col < @numeric_columns.size
         | 
| 23 23 | 
             
                        column = @numeric_columns[outl_col]
         | 
| 24 24 | 
             
                        value = df[column][row]
         | 
| 25 | 
            -
                         | 
| 25 | 
            +
                        _decimals = model_outputs.outlier_decimals_distr[row]
         | 
| 26 26 | 
             
                      else
         | 
| 27 27 | 
             
                        column = @categorical_columns[outl_col - @numeric_columns.size]
         | 
| 28 28 | 
             
                        value = df[column][row]
         | 
| @@ -94,11 +94,11 @@ module OutlierTree | |
| 94 94 | 
             
                private
         | 
| 95 95 |  | 
| 96 96 | 
             
                def add_condition(row, split_type, cluster)
         | 
| 97 | 
            -
                   | 
| 97 | 
            +
                  _coldecim = 0
         | 
| 98 98 | 
             
                  case cluster.column_type
         | 
| 99 99 | 
             
                  when :numeric
         | 
| 100 100 | 
             
                    cond_col = @numeric_columns[cluster.col_num]
         | 
| 101 | 
            -
                     | 
| 101 | 
            +
                    _coldecim = model_outputs.min_decimals_col[cluster.col_num]
         | 
| 102 102 | 
             
                  else
         | 
| 103 103 | 
             
                    cond_col = @categorical_columns[cluster.col_num]
         | 
| 104 104 | 
             
                  end
         | 
    
        data/lib/outliertree/version.rb
    CHANGED
    
    
| @@ -1,47 +1,60 @@ | |
| 1 1 | 
             
            # OutlierTree
         | 
| 2 2 |  | 
| 3 | 
            -
            Explainable outlier/anomaly detection based on smart decision tree grouping, similar in spirit to the GritBot software developed by RuleQuest research. Written in C++ with interfaces for R and Python. Supports columns of types numeric, categorical, binary/boolean, and ordinal, and can handle missing values in all of them. Ideal as a sanity checker in exploratory data analysis.
         | 
| 4 | 
            -
             | 
| 5 | 
            -
            # How it works
         | 
| 6 | 
            -
             | 
| 7 | 
            -
            Will try to fit decision trees that try to "predict" values for each column based on the values of each other column. Along the way, each time a split is evaluated, it will take the observations that fall into each branch as a homogeneous cluster in which it will search for outliers in the 1-d distribution of the column being predicted. Outliers are determined according to confidence intervals on this 1-d distribution, and need to have a large gap with respect to the next observation in sorted order to be flagged as outliers. Since outliers are searched for in a decision tree branch, it will know the conditions that make it a rare observation compared to others that meet the same conditions, and the conditions will always be correlated with the target variable (as it's being predicted from them).
         | 
| 8 | 
            -
             | 
| 9 | 
            -
            As such, it will only be able to detect outliers that can be described through a decision tree logic, and unlike other methods such as [Isolation Forests](https://github.com/david-cortes/isotree), will not be able to assign an outlier score to each observation, nor to detect outliers that are just overall rare, but will always provide a human-readable justification when it flags an outlier.
         | 
| 10 | 
            -
             | 
| 11 | 
            -
            Procedure is described in more detail in [Explainable outlier detection through decision tree conditioning](http://arxiv.org/abs/2001.00636).
         | 
| 3 | 
            +
            Explainable outlier/anomaly detection based on smart decision tree grouping, similar in spirit to the GritBot software developed by RuleQuest research. Written in C++ with interfaces for R and Python (additional Ruby wrapper can be found [here](https://github.com/ankane/outliertree/)). Supports columns of types numeric, categorical, binary/boolean, and ordinal, and can handle missing values in all of them. Ideal as a sanity checker in exploratory data analysis.
         | 
| 12 4 |  | 
| 13 5 | 
             
            # Example outputs
         | 
| 14 6 |  | 
| 15 | 
            -
            Example outliers from [hypothyroid dataset](http://archive.ics.uci.edu/ml/datasets/thyroid+disease):
         | 
| 7 | 
            +
            Example outliers from the [hypothyroid dataset](http://archive.ics.uci.edu/ml/datasets/thyroid+disease):
         | 
| 16 8 | 
             
            ```
         | 
| 17 | 
            -
            row [ | 
| 18 | 
            -
            	distribution: 95.122% <= 42. | 
| 9 | 
            +
            row [1138] - suspicious column: [age] - suspicious value: [75.00]
         | 
| 10 | 
            +
            	distribution: 95.122% <= 42.00 - [mean: 31.46] - [sd: 5.28] - [norm. obs: 39]
         | 
| 19 11 | 
             
            	given:
         | 
| 20 | 
            -
            		[pregnant] = [ | 
| 12 | 
            +
            		[pregnant] = [TRUE]
         | 
| 21 13 |  | 
| 22 14 |  | 
| 23 | 
            -
            row [ | 
| 24 | 
            -
            	distribution: 99.951% <= 7. | 
| 15 | 
            +
            row [2230] - suspicious column: [T3] - suspicious value: [10.60]
         | 
| 16 | 
            +
            	distribution: 99.951% <= 7.10 - [mean: 1.98] - [sd: 0.75] - [norm. obs: 2050]
         | 
| 25 17 | 
             
            	given:
         | 
| 26 | 
            -
            		[query | 
| 18 | 
            +
            		[query.hyperthyroid] = [FALSE]
         | 
| 19 | 
            +
             | 
| 20 | 
            +
            row [745] - suspicious column: [TT4] - suspicious value: [239.00]
         | 
| 21 | 
            +
            	distribution: 98.571% <= 177.00 - [mean: 135.23] - [sd: 12.57] - [norm. obs: 69]
         | 
| 22 | 
            +
            	given:
         | 
| 23 | 
            +
            		[FTI] between (97.96, 128.12] (value: 112.74)
         | 
| 24 | 
            +
            		[T4U] > [1.12] (value: 2.12)
         | 
| 25 | 
            +
            		[age] > [55.00] (value: 87.00)
         | 
| 27 26 | 
             
            ```
         | 
| 28 27 | 
             
            (i.e. it's saying that it's abnormal to be pregnant at the age of 75, or to not be classified as hyperthyroidal when having very high thyroid hormone levels)
         | 
| 29 28 | 
             
            (this dataset is also bundled into the R package - e.g. `data(hypothyroid)`)
         | 
| 30 29 |  | 
| 31 30 |  | 
| 32 | 
            -
            Example  | 
| 31 | 
            +
            Example outliers from the [Titanic dataset](https://www.kaggle.com/c/titanic):
         | 
| 33 32 | 
             
            ```
         | 
| 34 | 
            -
            row [ | 
| 35 | 
            -
            	distribution: 97.849% <= 15. | 
| 33 | 
            +
            row [1147] - suspicious column: [Fare] - suspicious value: [29.12]
         | 
| 34 | 
            +
            	distribution: 97.849% <= 15.50 - [mean: 7.89] - [sd: 1.17] - [norm. obs: 91]
         | 
| 36 35 | 
             
            	given:
         | 
| 37 36 | 
             
            		[Pclass] = [3]
         | 
| 38 37 | 
             
            		[SibSp] = [0]
         | 
| 39 38 | 
             
            		[Embarked] = [Q]
         | 
| 39 | 
            +
             | 
| 40 | 
            +
            row [897] - suspicious column: [Fare] - suspicious value: [0.00]
         | 
| 41 | 
            +
            	distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]
         | 
| 42 | 
            +
            	given:
         | 
| 43 | 
            +
            		[Pclass] = [3]
         | 
| 44 | 
            +
            		[SibSp] = [0]
         | 
| 40 45 | 
             
            ```
         | 
| 41 | 
            -
            (i.e. it's saying that the  | 
| 46 | 
            +
            (i.e. it's saying that the the first person paid too much for the kind of accomodation he had, and the second person should not have gotten it for free)
         | 
| 42 47 |  | 
| 43 48 | 
             
            _Note that it can also produce other types of conditions such as 'between' (for numeric intervals) or 'in' (for categorical subsets)_
         | 
| 44 49 |  | 
| 50 | 
            +
            # How it works
         | 
| 51 | 
            +
             | 
| 52 | 
            +
            Will try to fit decision trees that try to "predict" values for each column based on the values of each other column. Along the way, each time a split is evaluated, it will take the observations that fall into each branch as a homogeneous cluster in which it will search for outliers in the 1-d distribution of the column being predicted. Outliers are determined according to confidence intervals on this 1-d distribution, and need to have a large gap with respect to the next observation in sorted order to be flagged as outliers. Since outliers are searched for in a decision tree branch, it will know the conditions that make it a rare observation compared to others that meet the same conditions, and the conditions will always be correlated with the target variable (as it's being predicted from them).
         | 
| 53 | 
            +
             | 
| 54 | 
            +
            As such, it will only be able to detect outliers that can be described through a decision tree logic, and unlike other methods such as [Isolation Forests](https://github.com/david-cortes/isotree), will not be able to assign an outlier score to each observation, nor to detect outliers that are just overall rare, but will always provide a human-readable justification when it flags an outlier.
         | 
| 55 | 
            +
             | 
| 56 | 
            +
            Procedure is described in more detail in [Explainable outlier detection through decision tree conditioning](http://arxiv.org/abs/2001.00636).
         | 
| 57 | 
            +
             | 
| 45 58 | 
             
            # Installation
         | 
| 46 59 |  | 
| 47 60 | 
             
            * For R:
         | 
| @@ -54,18 +67,38 @@ install.packages("outliertree") | |
| 54 67 | 
             
            ```
         | 
| 55 68 | 
             
            pip install outliertree
         | 
| 56 69 | 
             
            ```
         | 
| 57 | 
            -
             | 
| 70 | 
            +
            or if that fails:
         | 
| 71 | 
            +
            ```
         | 
| 72 | 
            +
            pip install --no-use-pep517 outliertree
         | 
| 73 | 
            +
            ```
         | 
| 74 | 
            +
            ** *
         | 
| 58 75 |  | 
| 59 | 
            -
            **Note for macOS users:** on macOS, the Python version of this package  | 
| 76 | 
            +
            **Note for macOS users:** on macOS, the Python version of this package might compile **without** multi-threading capabilities. In order to enable multi-threading support, first install OpenMP:
         | 
| 60 77 | 
             
            ```
         | 
| 61 | 
            -
             | 
| 78 | 
            +
            brew install libomp
         | 
| 79 | 
            +
            ```
         | 
| 80 | 
            +
            And then reinstall this package: `pip install --force-reinstall outliertree`.
         | 
| 81 | 
            +
             | 
| 82 | 
            +
            ** *
         | 
| 83 | 
            +
            **IMPORTANT:** the setup script will try to add compilation flag `-march=native`. This instructs the compiler to tune the package for the CPU in which it is being installed, but the result might not be usable in other computers. If building a binary wheel of this package or putting it into a docker image which will be used in different machines, this can be overriden by manually supplying compilation `CFLAGS` and `CXXFLAGS` as environment variables with something related to architecture. For maximum compatibility (but slowest speed), assuming `x86-64` computers, it's possible to do something like this:
         | 
| 84 | 
            +
             | 
| 85 | 
            +
            ```
         | 
| 86 | 
            +
            export CFLAGS="-march=x86-64"
         | 
| 87 | 
            +
            export CXXFLAGS="-march=x86-64"
         | 
| 62 88 | 
             
            pip install outliertree
         | 
| 63 89 | 
             
            ```
         | 
| 64 | 
            -
            (Alternatively, can also pass argument `enable-omp` to the `setup.py` file: `python setup.py install enable-omp`)
         | 
| 65 90 |  | 
| 91 | 
            +
            or for creating wheels:
         | 
| 92 | 
            +
            ```
         | 
| 93 | 
            +
            export CFLAGS="-march=x86-64"
         | 
| 94 | 
            +
            export CXXFLAGS="-march=x86-64"
         | 
| 95 | 
            +
            python setup.py bwheel
         | 
| 96 | 
            +
            ```
         | 
| 97 | 
            +
            ** *
         | 
| 66 98 |  | 
| 67 99 | 
             
            * For C++: package doesn't have a build system, nor a `main` function that can produce an executable, but can be built as a shared object and wrapped into other languages with any C++11-compliant compiler (`std=c++11` in most compilers, `/std:c++14` in MSVC). For parallelization, needs OpenMP linkage (`-fopenmp` in most compilers, `/openmp` in MSVC). Package should *not* be built with optimization higher than `O3` (i.e. don't use `-Ofast`). Needs linkage to the `math` library, which should be enabled by default in most C++ compilers, but otherwise would require `-lm` argument. No external dependencies are required.
         | 
| 68 100 |  | 
| 101 | 
            +
            * For Ruby: see [external repository with wrapper](https://github.com/ankane/outliertree/).
         | 
| 69 102 |  | 
| 70 103 | 
             
            # Sample usage
         | 
| 71 104 |  | 
| @@ -77,18 +110,18 @@ library(outliertree) | |
| 77 110 | 
             
            nrows = 100
         | 
| 78 111 | 
             
            set.seed(1)
         | 
| 79 112 | 
             
            df = data.frame(
         | 
| 80 | 
            -
             | 
| 81 | 
            -
             | 
| 82 | 
            -
             | 
| 83 | 
            -
             | 
| 113 | 
            +
                numeric_col1 = c(rnorm(nrows - 1), 1e6),
         | 
| 114 | 
            +
                numeric_col2 = rgamma(nrows, 1),
         | 
| 115 | 
            +
                categ_col    = sample(c('categA', 'categB', 'categC'), size = nrows, replace = TRUE)
         | 
| 116 | 
            +
            )
         | 
| 84 117 |  | 
| 85 118 | 
             
            ### test data frame with another obvious outlier
         | 
| 86 119 | 
             
            nrows_test = 50
         | 
| 87 120 | 
             
            df_test = data.frame(
         | 
| 88 | 
            -
             | 
| 89 | 
            -
             | 
| 90 | 
            -
             | 
| 91 | 
            -
             | 
| 121 | 
            +
                numeric_col1 = rnorm(nrows_test),
         | 
| 122 | 
            +
                numeric_col2 = c(-1e6, rgamma(nrows_test - 1, 1)),
         | 
| 123 | 
            +
                categ_col    = sample(c('categA', 'categB', 'categC'), size = nrows_test, replace = TRUE)
         | 
| 124 | 
            +
            )
         | 
| 92 125 |  | 
| 93 126 | 
             
            ### fit model
         | 
| 94 127 | 
             
            outliers_model = outliertree::outlier.tree(df, outliers_print = 10, save_outliers = TRUE)
         | 
| @@ -113,17 +146,17 @@ from outliertree import OutlierTree | |
| 113 146 | 
             
            nrows = 100
         | 
| 114 147 | 
             
            np.random.seed(1)
         | 
| 115 148 | 
             
            df = pd.DataFrame({
         | 
| 116 | 
            -
             | 
| 117 | 
            -
             | 
| 118 | 
            -
             | 
| 119 | 
            -
             | 
| 149 | 
            +
                "numeric_col1" : np.r_[np.random.normal(size = nrows - 1), np.array([float(1e6)])],
         | 
| 150 | 
            +
                "numeric_col2" : np.random.gamma(1, 1, size = nrows),
         | 
| 151 | 
            +
                "categ_col"    : np.random.choice(['categA', 'categB', 'categC'], size = nrows)
         | 
| 152 | 
            +
            })
         | 
| 120 153 |  | 
| 121 154 | 
             
            ### test data frame with another obvious outlier
         | 
| 122 155 | 
             
            df_test = pd.DataFrame({
         | 
| 123 | 
            -
             | 
| 124 | 
            -
             | 
| 125 | 
            -
             | 
| 126 | 
            -
             | 
| 156 | 
            +
                "numeric_col1" : np.random.normal(size = nrows),
         | 
| 157 | 
            +
                "numeric_col2" : np.r_[np.array([float(-1e6)]), np.random.gamma(1, 1, size = nrows - 1)],
         | 
| 158 | 
            +
                "categ_col"    : np.random.choice(['categA', 'categB', 'categC'], size = nrows)
         | 
| 159 | 
            +
            })
         | 
| 127 160 |  | 
| 128 161 | 
             
            ### fit model
         | 
| 129 162 | 
             
            outliers_model = OutlierTree()
         | 
| @@ -138,6 +171,8 @@ outliers_model.print_outliers(new_outliers) | |
| 138 171 |  | 
| 139 172 | 
             
            Example [IPython notebook](http://nbviewer.ipython.org/github/david-cortes/outliertree/blob/master/example/titanic_outliertree_python.ipynb) using the Titanic dataset.
         | 
| 140 173 |  | 
| 174 | 
            +
            * For Ruby: see the [external repository](https://github.com/ankane/outliertree/).
         | 
| 175 | 
            +
             | 
| 141 176 | 
             
            * For C++: see functions `fit_outliers_models` and `find_new_outliers` in header `outlier_tree.hpp`.
         | 
| 142 177 |  | 
| 143 178 | 
             
            # Documentation
         | 
| @@ -146,6 +181,8 @@ Example [IPython notebook](http://nbviewer.ipython.org/github/david-cortes/outli | |
| 146 181 |  | 
| 147 182 | 
             
            * For Python: documentation is available at [ReadTheDocs](http://outliertree.readthedocs.io/en/latest/) (and it's also built-in in the package as docstrings, e.g. `help(outliertree.OutlierTree.fit)`).
         | 
| 148 183 |  | 
| 184 | 
            +
            * For Ruby: see the [external repository](https://github.com/ankane/outliertree/) and the [Python documentation](http://outliertree.readthedocs.io/en/latest/).
         | 
| 185 | 
            +
             | 
| 149 186 | 
             
            * For C++: documentation is available in the source files (not in the header).
         | 
| 150 187 |  | 
| 151 188 | 
             
            # References
         |