outliertree 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,155 @@
1
+ # OutlierTree
2
+
3
+ Explainable outlier/anomaly detection based on smart decision tree grouping, similar in spirit to the GritBot software developed by RuleQuest research. Written in C++ with interfaces for R and Python. Supports columns of types numeric, categorical, binary/boolean, and ordinal, and can handle missing values in all of them. Ideal as a sanity checker in exploratory data analysis.
4
+
5
+ # How it works
6
+
7
+ Will try to fit decision trees that try to "predict" values for each column based on the values of each other column. Along the way, each time a split is evaluated, it will take the observations that fall into each branch as a homogeneous cluster in which it will search for outliers in the 1-d distribution of the column being predicted. Outliers are determined according to confidence intervals on this 1-d distribution, and need to have a large gap with respect to the next observation in sorted order to be flagged as outliers. Since outliers are searched for in a decision tree branch, it will know the conditions that make it a rare observation compared to others that meet the same conditions, and the conditions will always be correlated with the target variable (as it's being predicted from them).
8
+
9
+ As such, it will only be able to detect outliers that can be described through a decision tree logic, and unlike other methods such as [Isolation Forests](https://github.com/david-cortes/isotree), will not be able to assign an outlier score to each observation, nor to detect outliers that are just overall rare, but will always provide a human-readable justification when it flags an outlier.
10
+
11
+ Procedure is described in more detail in [Explainable outlier detection through decision tree conditioning](http://arxiv.org/abs/2001.00636).
12
+
13
+ # Example outputs
14
+
15
+ Example outliers from [hypothyroid dataset](http://archive.ics.uci.edu/ml/datasets/thyroid+disease):
16
+ ```
17
+ row [1137] - suspicious column: [age] - suspicious value: [75.000]
18
+ distribution: 95.122% <= 42.000 - [mean: 31.462] - [sd: 5.281] - [norm. obs: 39]
19
+ given:
20
+ [pregnant] = [t]
21
+
22
+
23
+ row [2229] - suspicious column: [T3] - suspicious vale: [10.600]
24
+ distribution: 99.951% <= 7.100 - [mean: 1.984] - [sd: 0.750] - [norm. obs: 2050]
25
+ given:
26
+ [query hyperthyroid] = [f]
27
+ ```
28
+ (i.e. it's saying that it's abnormal to be pregnant at the age of 75, or to not be classified as hyperthyroidal when having very high thyroid hormone levels)
29
+ (this dataset is also bundled into the R package - e.g. `data(hypothyroid)`)
30
+
31
+
32
+ Example outlier from [Titanic dataset](https://www.kaggle.com/c/titanic):
33
+ ```
34
+ row [885] - suspicious column: [Fare] - suspicious value: [29.125]
35
+ distribution: 97.849% <= 15.500 - [mean: 7.887] - [sd: 1.173] - [norm. obs: 91]
36
+ given:
37
+ [Pclass] = [3]
38
+ [SibSp] = [0]
39
+ [Embarked] = [Q]
40
+ ```
41
+ (i.e. it's saying that the this person paid too much for the kind of accomodation he had)
42
+
43
+ _Note that it can also produce other types of conditions such as 'between' (for numeric intervals) or 'in' (for categorical subsets)_
44
+
45
+ # Installation
46
+
47
+ * For R:
48
+ ```r
49
+ install.packages("outliertree")
50
+ ```
51
+
52
+
53
+ * For Python:
54
+ ```
55
+ pip install outliertree
56
+ ```
57
+ (Package has only been tested in Python 3)
58
+
59
+ **Note for macOS users:** on macOS, the Python version of this package will compile **without** multi-threading capabilities. This is due to default apple's redistribution of `clang` not providing OpenMP modules, and aliasing it to `gcc` which causes confusions in build scripts. If you have a non-apple version of `clang` with the OpenMP modules, or if you have `gcc` installed, you can compile this package with multi-threading enabled by setting up an environment variable `ENABLE_OMP=1`:
60
+ ```
61
+ export ENABLE_OMP=1
62
+ pip install outliertree
63
+ ```
64
+ (Alternatively, can also pass argument `enable-omp` to the `setup.py` file: `python setup.py install enable-omp`)
65
+
66
+
67
+ * For C++: package doesn't have a build system, nor a `main` function that can produce an executable, but can be built as a shared object and wrapped into other languages with any C++11-compliant compiler (`std=c++11` in most compilers, `/std:c++14` in MSVC). For parallelization, needs OpenMP linkage (`-fopenmp` in most compilers, `/openmp` in MSVC). Package should *not* be built with optimization higher than `O3` (i.e. don't use `-Ofast`). Needs linkage to the `math` library, which should be enabled by default in most C++ compilers, but otherwise would require `-lm` argument. No external dependencies are required.
68
+
69
+
70
+ # Sample usage
71
+
72
+ * For R:
73
+ ```r
74
+ library(outliertree)
75
+
76
+ ### random data frame with an obvious outlier
77
+ nrows = 100
78
+ set.seed(1)
79
+ df = data.frame(
80
+ numeric_col1 = c(rnorm(nrows - 1), 1e6),
81
+ numeric_col2 = rgamma(nrows, 1),
82
+ categ_col = sample(c('categA', 'categB', 'categC'), size = nrows, replace = TRUE)
83
+ )
84
+
85
+ ### test data frame with another obvious outlier
86
+ nrows_test = 50
87
+ df_test = data.frame(
88
+ numeric_col1 = rnorm(nrows_test),
89
+ numeric_col2 = c(-1e6, rgamma(nrows_test - 1, 1)),
90
+ categ_col = sample(c('categA', 'categB', 'categC'), size = nrows_test, replace = TRUE)
91
+ )
92
+
93
+ ### fit model
94
+ outliers_model = outliertree::outlier.tree(df, outliers_print = 10, save_outliers = TRUE)
95
+
96
+ ### find outliers in new data
97
+ new_outliers = predict(outliers_model, df_test, outliers_print = 10, return_outliers = TRUE)
98
+
99
+ ### print outliers in readable format
100
+ summary(new_outliers)
101
+ ```
102
+ (see documentation for more examples)
103
+
104
+ Example [RMarkdown](http://htmlpreview.github.io/?https://github.com/david-cortes/outliertree/blob/master/example/titanic_outliertree_r.html) using the Titanic dataset.
105
+
106
+
107
+ * For Python:
108
+ ```python
109
+ import numpy as np, pandas as pd
110
+ from outliertree import OutlierTree
111
+
112
+ ### random data frame with an obvious outlier
113
+ nrows = 100
114
+ np.random.seed(1)
115
+ df = pd.DataFrame({
116
+ "numeric_col1" : np.r_[np.random.normal(size = nrows - 1), np.array([float(1e6)])],
117
+ "numeric_col2" : np.random.gamma(1, 1, size = nrows),
118
+ "categ_col" : np.random.choice(['categA', 'categB', 'categC'], size = nrows)
119
+ })
120
+
121
+ ### test data frame with another obvious outlier
122
+ df_test = pd.DataFrame({
123
+ "numeric_col1" : np.random.normal(size = nrows),
124
+ "numeric_col2" : np.r_[np.array([float(-1e6)]), np.random.gamma(1, 1, size = nrows - 1)],
125
+ "categ_col" : np.random.choice(['categA', 'categB', 'categC'], size = nrows)
126
+ })
127
+
128
+ ### fit model
129
+ outliers_model = OutlierTree()
130
+ outliers_df = outliers_model.fit(df, outliers_print = 10, return_outliers = True)
131
+
132
+ ### find outliers in new data
133
+ new_outliers = outliers_model.predict(df_test)
134
+
135
+ ### print outliers in readable format
136
+ outliers_model.print_outliers(new_outliers)
137
+ ```
138
+
139
+ Example [IPython notebook](http://nbviewer.ipython.org/github/david-cortes/outliertree/blob/master/example/titanic_outliertree_python.ipynb) using the Titanic dataset.
140
+
141
+ * For C++: see functions `fit_outliers_models` and `find_new_outliers` in header `outlier_tree.hpp`.
142
+
143
+ # Documentation
144
+
145
+ * For R : documentation is built-in in the package (e.g. `help(outliertree::outlier.tree)`) - PDF can be downloaded in [CRAN](https://cran.r-project.org/web/packages/outliertree/index.html).
146
+
147
+ * For Python: documentation is available at [ReadTheDocs](http://outliertree.readthedocs.io/en/latest/) (and it's also built-in in the package as docstrings, e.g. `help(outliertree.OutlierTree.fit)`).
148
+
149
+ * For C++: documentation is available in the source files (not in the header).
150
+
151
+ # References
152
+
153
+ * Cortes, David. "Explainable outlier detection through decision tree conditioning." arXiv preprint arXiv:2001.00636 (2020).
154
+ * [GritBot software](https://www.rulequest.com/gritbot-info.html) .
155
+
@@ -0,0 +1,3 @@
1
+ PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS)
2
+ PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS)
3
+ CXX_STD = CXX11
@@ -0,0 +1,123 @@
1
+ // Generated by using Rcpp::compileAttributes() -> do not edit by hand
2
+ // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
3
+
4
+ #include <Rcpp.h>
5
+
6
+ using namespace Rcpp;
7
+
8
+ // deserialize_OutlierTree
9
+ SEXP deserialize_OutlierTree(Rcpp::RawVector src);
10
+ RcppExport SEXP _outliertree_deserialize_OutlierTree(SEXP srcSEXP) {
11
+ BEGIN_RCPP
12
+ Rcpp::RObject rcpp_result_gen;
13
+ Rcpp::RNGScope rcpp_rngScope_gen;
14
+ Rcpp::traits::input_parameter< Rcpp::RawVector >::type src(srcSEXP);
15
+ rcpp_result_gen = Rcpp::wrap(deserialize_OutlierTree(src));
16
+ return rcpp_result_gen;
17
+ END_RCPP
18
+ }
19
+ // check_null_ptr_model
20
+ Rcpp::LogicalVector check_null_ptr_model(SEXP ptr_model);
21
+ RcppExport SEXP _outliertree_check_null_ptr_model(SEXP ptr_modelSEXP) {
22
+ BEGIN_RCPP
23
+ Rcpp::RObject rcpp_result_gen;
24
+ Rcpp::RNGScope rcpp_rngScope_gen;
25
+ Rcpp::traits::input_parameter< SEXP >::type ptr_model(ptr_modelSEXP);
26
+ rcpp_result_gen = Rcpp::wrap(check_null_ptr_model(ptr_model));
27
+ return rcpp_result_gen;
28
+ END_RCPP
29
+ }
30
+ // fit_OutlierTree
31
+ Rcpp::List fit_OutlierTree(Rcpp::NumericVector arr_num, size_t ncols_numeric, Rcpp::IntegerVector arr_cat, size_t ncols_categ, Rcpp::IntegerVector ncat, Rcpp::IntegerVector arr_ord, size_t ncols_ord, Rcpp::IntegerVector ncat_ord, size_t nrows, Rcpp::LogicalVector cols_ignore_r, int nthreads, bool categ_as_bin, bool ord_as_bin, bool cat_bruteforce_subset, bool categ_from_maj, bool take_mid, size_t max_depth, double max_perc_outliers, size_t min_size_numeric, size_t min_size_categ, double min_gain, bool follow_all, bool gain_as_pct, double z_norm, double z_outlier, bool return_outliers, Rcpp::ListOf<Rcpp::StringVector> cat_levels, Rcpp::ListOf<Rcpp::StringVector> ord_levels, Rcpp::StringVector colnames_num, Rcpp::StringVector colnames_cat, Rcpp::StringVector colnames_ord, Rcpp::NumericVector min_date, Rcpp::NumericVector min_ts);
32
+ RcppExport SEXP _outliertree_fit_OutlierTree(SEXP arr_numSEXP, SEXP ncols_numericSEXP, SEXP arr_catSEXP, SEXP ncols_categSEXP, SEXP ncatSEXP, SEXP arr_ordSEXP, SEXP ncols_ordSEXP, SEXP ncat_ordSEXP, SEXP nrowsSEXP, SEXP cols_ignore_rSEXP, SEXP nthreadsSEXP, SEXP categ_as_binSEXP, SEXP ord_as_binSEXP, SEXP cat_bruteforce_subsetSEXP, SEXP categ_from_majSEXP, SEXP take_midSEXP, SEXP max_depthSEXP, SEXP max_perc_outliersSEXP, SEXP min_size_numericSEXP, SEXP min_size_categSEXP, SEXP min_gainSEXP, SEXP follow_allSEXP, SEXP gain_as_pctSEXP, SEXP z_normSEXP, SEXP z_outlierSEXP, SEXP return_outliersSEXP, SEXP cat_levelsSEXP, SEXP ord_levelsSEXP, SEXP colnames_numSEXP, SEXP colnames_catSEXP, SEXP colnames_ordSEXP, SEXP min_dateSEXP, SEXP min_tsSEXP) {
33
+ BEGIN_RCPP
34
+ Rcpp::RObject rcpp_result_gen;
35
+ Rcpp::RNGScope rcpp_rngScope_gen;
36
+ Rcpp::traits::input_parameter< Rcpp::NumericVector >::type arr_num(arr_numSEXP);
37
+ Rcpp::traits::input_parameter< size_t >::type ncols_numeric(ncols_numericSEXP);
38
+ Rcpp::traits::input_parameter< Rcpp::IntegerVector >::type arr_cat(arr_catSEXP);
39
+ Rcpp::traits::input_parameter< size_t >::type ncols_categ(ncols_categSEXP);
40
+ Rcpp::traits::input_parameter< Rcpp::IntegerVector >::type ncat(ncatSEXP);
41
+ Rcpp::traits::input_parameter< Rcpp::IntegerVector >::type arr_ord(arr_ordSEXP);
42
+ Rcpp::traits::input_parameter< size_t >::type ncols_ord(ncols_ordSEXP);
43
+ Rcpp::traits::input_parameter< Rcpp::IntegerVector >::type ncat_ord(ncat_ordSEXP);
44
+ Rcpp::traits::input_parameter< size_t >::type nrows(nrowsSEXP);
45
+ Rcpp::traits::input_parameter< Rcpp::LogicalVector >::type cols_ignore_r(cols_ignore_rSEXP);
46
+ Rcpp::traits::input_parameter< int >::type nthreads(nthreadsSEXP);
47
+ Rcpp::traits::input_parameter< bool >::type categ_as_bin(categ_as_binSEXP);
48
+ Rcpp::traits::input_parameter< bool >::type ord_as_bin(ord_as_binSEXP);
49
+ Rcpp::traits::input_parameter< bool >::type cat_bruteforce_subset(cat_bruteforce_subsetSEXP);
50
+ Rcpp::traits::input_parameter< bool >::type categ_from_maj(categ_from_majSEXP);
51
+ Rcpp::traits::input_parameter< bool >::type take_mid(take_midSEXP);
52
+ Rcpp::traits::input_parameter< size_t >::type max_depth(max_depthSEXP);
53
+ Rcpp::traits::input_parameter< double >::type max_perc_outliers(max_perc_outliersSEXP);
54
+ Rcpp::traits::input_parameter< size_t >::type min_size_numeric(min_size_numericSEXP);
55
+ Rcpp::traits::input_parameter< size_t >::type min_size_categ(min_size_categSEXP);
56
+ Rcpp::traits::input_parameter< double >::type min_gain(min_gainSEXP);
57
+ Rcpp::traits::input_parameter< bool >::type follow_all(follow_allSEXP);
58
+ Rcpp::traits::input_parameter< bool >::type gain_as_pct(gain_as_pctSEXP);
59
+ Rcpp::traits::input_parameter< double >::type z_norm(z_normSEXP);
60
+ Rcpp::traits::input_parameter< double >::type z_outlier(z_outlierSEXP);
61
+ Rcpp::traits::input_parameter< bool >::type return_outliers(return_outliersSEXP);
62
+ Rcpp::traits::input_parameter< Rcpp::ListOf<Rcpp::StringVector> >::type cat_levels(cat_levelsSEXP);
63
+ Rcpp::traits::input_parameter< Rcpp::ListOf<Rcpp::StringVector> >::type ord_levels(ord_levelsSEXP);
64
+ Rcpp::traits::input_parameter< Rcpp::StringVector >::type colnames_num(colnames_numSEXP);
65
+ Rcpp::traits::input_parameter< Rcpp::StringVector >::type colnames_cat(colnames_catSEXP);
66
+ Rcpp::traits::input_parameter< Rcpp::StringVector >::type colnames_ord(colnames_ordSEXP);
67
+ Rcpp::traits::input_parameter< Rcpp::NumericVector >::type min_date(min_dateSEXP);
68
+ Rcpp::traits::input_parameter< Rcpp::NumericVector >::type min_ts(min_tsSEXP);
69
+ rcpp_result_gen = Rcpp::wrap(fit_OutlierTree(arr_num, ncols_numeric, arr_cat, ncols_categ, ncat, arr_ord, ncols_ord, ncat_ord, nrows, cols_ignore_r, nthreads, categ_as_bin, ord_as_bin, cat_bruteforce_subset, categ_from_maj, take_mid, max_depth, max_perc_outliers, min_size_numeric, min_size_categ, min_gain, follow_all, gain_as_pct, z_norm, z_outlier, return_outliers, cat_levels, ord_levels, colnames_num, colnames_cat, colnames_ord, min_date, min_ts));
70
+ return rcpp_result_gen;
71
+ END_RCPP
72
+ }
73
+ // predict_OutlierTree
74
+ Rcpp::List predict_OutlierTree(SEXP ptr_model, size_t nrows, int nthreads, Rcpp::NumericVector arr_num, Rcpp::IntegerVector arr_cat, Rcpp::IntegerVector arr_ord, Rcpp::ListOf<Rcpp::StringVector> cat_levels, Rcpp::ListOf<Rcpp::StringVector> ord_levels, Rcpp::StringVector colnames_num, Rcpp::StringVector colnames_cat, Rcpp::StringVector colnames_ord, Rcpp::NumericVector min_date, Rcpp::NumericVector min_ts);
75
+ RcppExport SEXP _outliertree_predict_OutlierTree(SEXP ptr_modelSEXP, SEXP nrowsSEXP, SEXP nthreadsSEXP, SEXP arr_numSEXP, SEXP arr_catSEXP, SEXP arr_ordSEXP, SEXP cat_levelsSEXP, SEXP ord_levelsSEXP, SEXP colnames_numSEXP, SEXP colnames_catSEXP, SEXP colnames_ordSEXP, SEXP min_dateSEXP, SEXP min_tsSEXP) {
76
+ BEGIN_RCPP
77
+ Rcpp::RObject rcpp_result_gen;
78
+ Rcpp::RNGScope rcpp_rngScope_gen;
79
+ Rcpp::traits::input_parameter< SEXP >::type ptr_model(ptr_modelSEXP);
80
+ Rcpp::traits::input_parameter< size_t >::type nrows(nrowsSEXP);
81
+ Rcpp::traits::input_parameter< int >::type nthreads(nthreadsSEXP);
82
+ Rcpp::traits::input_parameter< Rcpp::NumericVector >::type arr_num(arr_numSEXP);
83
+ Rcpp::traits::input_parameter< Rcpp::IntegerVector >::type arr_cat(arr_catSEXP);
84
+ Rcpp::traits::input_parameter< Rcpp::IntegerVector >::type arr_ord(arr_ordSEXP);
85
+ Rcpp::traits::input_parameter< Rcpp::ListOf<Rcpp::StringVector> >::type cat_levels(cat_levelsSEXP);
86
+ Rcpp::traits::input_parameter< Rcpp::ListOf<Rcpp::StringVector> >::type ord_levels(ord_levelsSEXP);
87
+ Rcpp::traits::input_parameter< Rcpp::StringVector >::type colnames_num(colnames_numSEXP);
88
+ Rcpp::traits::input_parameter< Rcpp::StringVector >::type colnames_cat(colnames_catSEXP);
89
+ Rcpp::traits::input_parameter< Rcpp::StringVector >::type colnames_ord(colnames_ordSEXP);
90
+ Rcpp::traits::input_parameter< Rcpp::NumericVector >::type min_date(min_dateSEXP);
91
+ Rcpp::traits::input_parameter< Rcpp::NumericVector >::type min_ts(min_tsSEXP);
92
+ rcpp_result_gen = Rcpp::wrap(predict_OutlierTree(ptr_model, nrows, nthreads, arr_num, arr_cat, arr_ord, cat_levels, ord_levels, colnames_num, colnames_cat, colnames_ord, min_date, min_ts));
93
+ return rcpp_result_gen;
94
+ END_RCPP
95
+ }
96
+ // check_few_values
97
+ Rcpp::LogicalVector check_few_values(Rcpp::NumericVector arr_num, size_t nrows, size_t ncols, int nthreads);
98
+ RcppExport SEXP _outliertree_check_few_values(SEXP arr_numSEXP, SEXP nrowsSEXP, SEXP ncolsSEXP, SEXP nthreadsSEXP) {
99
+ BEGIN_RCPP
100
+ Rcpp::RObject rcpp_result_gen;
101
+ Rcpp::RNGScope rcpp_rngScope_gen;
102
+ Rcpp::traits::input_parameter< Rcpp::NumericVector >::type arr_num(arr_numSEXP);
103
+ Rcpp::traits::input_parameter< size_t >::type nrows(nrowsSEXP);
104
+ Rcpp::traits::input_parameter< size_t >::type ncols(ncolsSEXP);
105
+ Rcpp::traits::input_parameter< int >::type nthreads(nthreadsSEXP);
106
+ rcpp_result_gen = Rcpp::wrap(check_few_values(arr_num, nrows, ncols, nthreads));
107
+ return rcpp_result_gen;
108
+ END_RCPP
109
+ }
110
+
111
+ static const R_CallMethodDef CallEntries[] = {
112
+ {"_outliertree_deserialize_OutlierTree", (DL_FUNC) &_outliertree_deserialize_OutlierTree, 1},
113
+ {"_outliertree_check_null_ptr_model", (DL_FUNC) &_outliertree_check_null_ptr_model, 1},
114
+ {"_outliertree_fit_OutlierTree", (DL_FUNC) &_outliertree_fit_OutlierTree, 33},
115
+ {"_outliertree_predict_OutlierTree", (DL_FUNC) &_outliertree_predict_OutlierTree, 13},
116
+ {"_outliertree_check_few_values", (DL_FUNC) &_outliertree_check_few_values, 4},
117
+ {NULL, NULL, 0}
118
+ };
119
+
120
+ RcppExport void R_init_outliertree(DllInfo *dll) {
121
+ R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
122
+ R_useDynamicSymbols(dll, FALSE);
123
+ }
@@ -0,0 +1,1225 @@
1
+ #include <Rcpp.h>
2
+ // [[Rcpp::plugins(cpp11)]]
3
+
4
+ /* This is to serialize the model objects */
5
+ // [[Rcpp::depends(Rcereal)]]
6
+ #include <cereal/archives/binary.hpp>
7
+ #include <cereal/types/vector.hpp>
8
+ #include <sstream>
9
+ #include <string>
10
+
11
+ /* This is the package's header */
12
+ #include "outlier_tree.hpp"
13
+
14
+ /* for model serialization and re-usage in R */
15
+ /* https://stackoverflow.com/questions/18474292/how-to-handle-c-internal-data-structure-in-r-in-order-to-allow-save-load */
16
+ /* this extra comment below the link is a workaround for Rcpp issue 675 in GitHub, do not remove it */
17
+ #include <Rinternals.h>
18
+ Rcpp::RawVector serialize_OutlierTree(ModelOutputs *model_outputs)
19
+ {
20
+ std::stringstream ss;
21
+ {
22
+ cereal::BinaryOutputArchive oarchive(ss); // Create an output archive
23
+ oarchive(*model_outputs);
24
+ }
25
+ ss.seekg(0, ss.end);
26
+ Rcpp::RawVector retval(ss.tellg());
27
+ ss.seekg(0, ss.beg);
28
+ ss.read(reinterpret_cast<char*>(&retval[0]), retval.size());
29
+ return retval;
30
+ }
31
+
32
+ // [[Rcpp::export]]
33
+ SEXP deserialize_OutlierTree(Rcpp::RawVector src)
34
+ {
35
+ std::stringstream ss;
36
+ ss.write(reinterpret_cast<char*>(&src[0]), src.size());
37
+ ss.seekg(0, ss.beg);
38
+ std::unique_ptr<ModelOutputs> model_outputs = std::unique_ptr<ModelOutputs>(new ModelOutputs());
39
+ {
40
+ cereal::BinaryInputArchive iarchive(ss);
41
+ iarchive(*model_outputs);
42
+ }
43
+ return Rcpp::XPtr<ModelOutputs>(model_outputs.release(), true);
44
+ }
45
+
46
+ // [[Rcpp::export]]
47
+ Rcpp::LogicalVector check_null_ptr_model(SEXP ptr_model)
48
+ {
49
+ return Rcpp::LogicalVector(R_ExternalPtrAddr(ptr_model) == NULL);
50
+ }
51
+
52
+ double* set_R_nan_as_C_nan(double *restrict x_R, std::vector<double> &x_C, size_t n, int nthreads)
53
+ {
54
+ x_C.assign(x_R, x_R + n);
55
+ #pragma omp parallel for schedule(static) num_threads(nthreads) shared(x_R, x_C, n)
56
+ for (size_t_for i = 0; i < n; i++)
57
+ if (isnan(x_R[i]) || Rcpp::NumericVector::is_na(x_R[i]) || Rcpp::traits::is_nan<REALSXP>(x_R[i]))
58
+ x_C[i] = NAN;
59
+ return x_C.data();
60
+ }
61
+
62
+
63
+ /* for predicting outliers */
64
+ Rcpp::List describe_outliers(ModelOutputs &model_outputs,
65
+ double *arr_num,
66
+ int *arr_cat,
67
+ int *arr_ord,
68
+ Rcpp::ListOf<Rcpp::StringVector> cat_levels,
69
+ Rcpp::ListOf<Rcpp::StringVector> ord_levels,
70
+ Rcpp::StringVector colnames_num,
71
+ Rcpp::StringVector colnames_cat,
72
+ Rcpp::StringVector colnames_ord,
73
+ Rcpp::NumericVector min_date,
74
+ Rcpp::NumericVector min_ts)
75
+ {
76
+ size_t nrows = model_outputs.outlier_scores_final.size();
77
+ size_t ncols_num = model_outputs.ncols_numeric;
78
+ size_t ncols_cat = model_outputs.ncols_categ;
79
+ size_t ncols_num_num = model_outputs.ncols_numeric - min_date.size() - min_ts.size();
80
+ size_t ncols_date = min_date.size();
81
+ size_t ncols_cat_cat = cat_levels.size();
82
+ Rcpp::List outp;
83
+
84
+ Rcpp::LogicalVector has_na_col = Rcpp::LogicalVector(nrows, NA_LOGICAL);
85
+ Rcpp::IntegerVector tree_depth = Rcpp::IntegerVector(nrows, NA_INTEGER);
86
+ Rcpp::NumericVector outlier_score = Rcpp::NumericVector(nrows, NA_REAL);
87
+ Rcpp::ListOf<Rcpp::List> outlier_val = Rcpp::ListOf<Rcpp::List>(nrows);
88
+ Rcpp::ListOf<Rcpp::List> lst_stats = Rcpp::ListOf<Rcpp::List>(nrows);
89
+ Rcpp::ListOf<Rcpp::List> lst_cond = Rcpp::ListOf<Rcpp::List>(nrows);
90
+
91
+
92
+ size_t outl_col;
93
+ size_t outl_clust;
94
+ size_t curr_tree;
95
+ size_t parent_tree;
96
+ Rcpp::LogicalVector tmp_bool;
97
+
98
+ for (size_t row = 0; row < nrows; row++) {
99
+ if (model_outputs.outlier_scores_final[row] < 1) {
100
+
101
+ outl_col = model_outputs.outlier_columns_final[row];
102
+ outl_clust = model_outputs.outlier_clusters_final[row];
103
+
104
+ /* metrics of outlierness - used to rank when choosing which to print */
105
+ outlier_score[row] = model_outputs.outlier_scores_final[row];
106
+ tree_depth[row] = (int)model_outputs.outlier_depth_final[row];
107
+ has_na_col[row] = model_outputs.all_clusters[outl_col][outl_clust].has_NA_branch;
108
+
109
+ /* first determine outlier column and suspected value */
110
+ if (outl_col < ncols_num) {
111
+ if (outl_col < ncols_num_num) {
112
+ outlier_val[row] = Rcpp::List::create(
113
+ Rcpp::_["column"] = Rcpp::CharacterVector(1, colnames_num[outl_col]),
114
+ Rcpp::_["value"] = Rcpp::wrap(arr_num[row + outl_col * nrows]),
115
+ Rcpp::_["decimals"] = Rcpp::wrap(model_outputs.outlier_decimals_distr[row])
116
+ );
117
+ } else if (outl_col < (ncols_num_num + ncols_date)) {
118
+ outlier_val[row] = Rcpp::List::create(
119
+ Rcpp::_["column"] = Rcpp::CharacterVector(1, colnames_num[outl_col]),
120
+ Rcpp::_["value"] = Rcpp::Date(arr_num[row + outl_col * nrows] - 1 + min_date[outl_col - ncols_num_num])
121
+ );
122
+ } else {
123
+ outlier_val[row] = Rcpp::List::create(
124
+ Rcpp::_["column"] = Rcpp::CharacterVector(1, colnames_num[outl_col]),
125
+ Rcpp::_["value"] = Rcpp::Datetime(arr_num[row + outl_col * nrows] - 1 + min_ts[outl_col - ncols_num_num - ncols_date])
126
+ );
127
+ }
128
+ } else if (outl_col < (ncols_num + ncols_cat)) {
129
+ if (outl_col < (ncols_num + ncols_cat_cat)) {
130
+ outlier_val[row] = Rcpp::List::create(
131
+ Rcpp::_["column"] = Rcpp::CharacterVector(1, colnames_cat[outl_col - ncols_num]),
132
+ Rcpp::_["value"] = Rcpp::CharacterVector(1, cat_levels[outl_col - ncols_num]
133
+ [arr_cat[row + (outl_col - ncols_num) * nrows]])
134
+ );
135
+ } else {
136
+ outlier_val[row] = Rcpp::List::create(
137
+ Rcpp::_["column"] = Rcpp::CharacterVector(1, colnames_cat[outl_col - ncols_num]),
138
+ Rcpp::_["value"] = Rcpp::wrap((bool)arr_cat[row + (outl_col - ncols_num) * nrows])
139
+ );
140
+ }
141
+ } else {
142
+ outlier_val[row] = Rcpp::List::create(
143
+ Rcpp::_["column"] = Rcpp::CharacterVector(1, colnames_ord[outl_col - ncols_num - ncols_cat]),
144
+ Rcpp::_["value"] = Rcpp::CharacterVector(1, ord_levels[outl_col - ncols_num - ncols_cat]
145
+ [arr_ord[row + (outl_col - ncols_num - ncols_cat) * nrows]])
146
+ );
147
+ }
148
+
149
+
150
+ /* info about the normal observations in the cluster */
151
+ if (outl_col < ncols_num) {
152
+ if (outl_col < ncols_num_num) {
153
+ if (arr_num[row + outl_col * nrows] >= model_outputs.all_clusters[outl_col][outl_clust].upper_lim) {
154
+ lst_stats[row] = Rcpp::List::create(
155
+ Rcpp::_["upper_thr"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].display_lim_high),
156
+ Rcpp::_["pct_below"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_below),
157
+ Rcpp::_["mean"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].display_mean),
158
+ Rcpp::_["sd"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].display_sd),
159
+ Rcpp::_["n_obs"] = Rcpp::wrap((int)model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
160
+ );
161
+ } else {
162
+ lst_stats[row] = Rcpp::List::create(
163
+ Rcpp::_["lower_thr"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].display_lim_low),
164
+ Rcpp::_["pct_above"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_above),
165
+ Rcpp::_["mean"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].display_mean),
166
+ Rcpp::_["sd"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].display_sd),
167
+ Rcpp::_["n_obs"] = Rcpp::wrap((int)model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
168
+ );
169
+ }
170
+ } else if (outl_col < (ncols_num_num + ncols_date)) {
171
+ if (arr_num[row + outl_col * nrows] >= model_outputs.all_clusters[outl_col][outl_clust].upper_lim) {
172
+ lst_stats[row] = Rcpp::List::create(
173
+ Rcpp::_["upper_thr"] = Rcpp::Date(model_outputs.all_clusters[outl_col][outl_clust].display_lim_high
174
+ - 1 + min_date[outl_col - ncols_num_num]),
175
+ Rcpp::_["pct_below"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_below),
176
+ Rcpp::_["mean"] = Rcpp::Date(model_outputs.all_clusters[outl_col][outl_clust].display_mean - 1 + min_date[outl_col - ncols_num_num]),
177
+ Rcpp::_["n_obs"] = Rcpp::wrap((int)model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
178
+ );
179
+ } else {
180
+ lst_stats[row] = Rcpp::List::create(
181
+ Rcpp::_["lower_thr"] = Rcpp::Date(model_outputs.all_clusters[outl_col][outl_clust].display_lim_low
182
+ - 1 + min_date[outl_col - ncols_num_num]),
183
+ Rcpp::_["pct_above"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_above),
184
+ Rcpp::_["mean"] = Rcpp::Date(model_outputs.all_clusters[outl_col][outl_clust].display_mean - 1 + min_date[outl_col - ncols_num_num]),
185
+ Rcpp::_["n_obs"] = Rcpp::wrap((int)model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
186
+ );
187
+ }
188
+ } else {
189
+ if (arr_num[row + outl_col * nrows] >= model_outputs.all_clusters[outl_col][outl_clust].upper_lim) {
190
+ lst_stats[row] = Rcpp::List::create(
191
+ Rcpp::_["upper_thr"] = Rcpp::Datetime(model_outputs.all_clusters[outl_col][outl_clust].display_lim_high
192
+ - 1 + min_ts[outl_col - ncols_num_num - ncols_date]),
193
+ Rcpp::_["pct_below"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_below),
194
+ Rcpp::_["mean"] = Rcpp::Datetime(model_outputs.all_clusters[outl_col][outl_clust].display_mean
195
+ - 1 + min_ts[outl_col - ncols_num_num - ncols_date]),
196
+ Rcpp::_["n_obs"] = Rcpp::wrap((int)model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
197
+ );
198
+ } else {
199
+ lst_stats[row] = Rcpp::List::create(
200
+ Rcpp::_["lower_thr"] = Rcpp::Datetime(model_outputs.all_clusters[outl_col][outl_clust].display_lim_low
201
+ - 1 + min_ts[outl_col - ncols_num_num - ncols_date]),
202
+ Rcpp::_["pct_above"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_above),
203
+ Rcpp::_["mean"] = Rcpp::Datetime(model_outputs.all_clusters[outl_col][outl_clust].display_mean
204
+ - 1 + min_ts[outl_col - ncols_num_num - ncols_date]),
205
+ Rcpp::_["n_obs"] = Rcpp::wrap((int)model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
206
+ );
207
+ }
208
+ }
209
+ } else if (outl_col < (ncols_num + ncols_cat)) {
210
+ if (outl_col < (ncols_num + ncols_cat_cat)) {
211
+ tmp_bool = Rcpp::LogicalVector(model_outputs.all_clusters[outl_col][outl_clust].subset_common.size(), false);
212
+ for (size_t cat = 0; cat < tmp_bool.size(); cat++) {
213
+ if (model_outputs.all_clusters[outl_col][outl_clust].subset_common[cat] == 0) {
214
+ tmp_bool[cat] = true;
215
+ }
216
+ }
217
+ if (model_outputs.all_clusters[outl_col][outl_clust].split_type != Root) {
218
+ if (model_outputs.all_clusters[outl_col][outl_clust].categ_maj < 0) {
219
+ lst_stats[row] = Rcpp::List::create(
220
+ Rcpp::_["categs_common"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[outl_col - ncols_num][tmp_bool]),
221
+ Rcpp::_["pct_common"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_in_subset),
222
+ Rcpp::_["pct_next_most_comm"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_next_most_comm),
223
+ Rcpp::_["prior_prob"] = Rcpp::wrap(model_outputs.prop_categ[model_outputs.start_ix_cat_counts[outl_col - ncols_num] +
224
+ arr_cat[row + (outl_col - ncols_num) * nrows]]),
225
+ Rcpp::_["n_obs"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
226
+ );
227
+ } else {
228
+ lst_stats[row] = Rcpp::List::create(
229
+ Rcpp::_["categ_maj"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[outl_col - ncols_num][
230
+ model_outputs.all_clusters[outl_col][outl_clust].categ_maj
231
+ ]),
232
+ Rcpp::_["pct_common"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_in_subset),
233
+ Rcpp::_["prior_prob"] = Rcpp::wrap(model_outputs.prop_categ[model_outputs.start_ix_cat_counts[outl_col - ncols_num] +
234
+ arr_cat[row + (outl_col - ncols_num) * nrows]]),
235
+ Rcpp::_["n_obs"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
236
+ );
237
+ }
238
+ } else {
239
+ lst_stats[row] = Rcpp::List::create(
240
+ Rcpp::_["categs_common"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[outl_col - ncols_num][tmp_bool]),
241
+ Rcpp::_["pct_common"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_in_subset),
242
+ Rcpp::_["pct_next_most_comm"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_next_most_comm),
243
+ Rcpp::_["n_obs"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
244
+ );
245
+ }
246
+ } else {
247
+ lst_stats[row] = Rcpp::List::create(
248
+ Rcpp::_["pct_other"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_in_subset),
249
+ Rcpp::_["prior_prob"] = Rcpp::wrap(model_outputs.prop_categ[model_outputs.start_ix_cat_counts[outl_col - ncols_num] +
250
+ arr_cat[row + (outl_col - ncols_num) * nrows]]),
251
+ Rcpp::_["n_obs"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
252
+ );
253
+ }
254
+ } else {
255
+ tmp_bool = Rcpp::LogicalVector(model_outputs.all_clusters[outl_col][outl_clust].subset_common.size(), false);
256
+ for (size_t cat = 0; cat < tmp_bool.size(); cat++) {
257
+ if (model_outputs.all_clusters[outl_col][outl_clust].subset_common[cat] == 0) {
258
+ tmp_bool[cat] = true;
259
+ }
260
+ }
261
+ if (model_outputs.all_clusters[outl_col][outl_clust].split_type != Root) {
262
+ if (model_outputs.all_clusters[outl_col][outl_clust].categ_maj < 0) {
263
+ lst_stats[row] = Rcpp::List::create(
264
+ Rcpp::_["categs_common"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[outl_col - ncols_num - ncols_cat][tmp_bool]),
265
+ Rcpp::_["pct_common"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_in_subset),
266
+ Rcpp::_["pct_next_most_comm"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_next_most_comm),
267
+ Rcpp::_["prior_prob"] = Rcpp::wrap(model_outputs.prop_categ[model_outputs.start_ix_cat_counts[outl_col - ncols_num] +
268
+ arr_ord[row + (outl_col - ncols_num - ncols_cat) * nrows]]),
269
+ Rcpp::_["n_obs"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
270
+ );
271
+ } else {
272
+ lst_stats[row] = Rcpp::List::create(
273
+ Rcpp::_["categ_maj"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[outl_col - ncols_num - ncols_cat][
274
+ model_outputs.all_clusters[outl_col][outl_clust].categ_maj
275
+ ]),
276
+ Rcpp::_["pct_common"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_in_subset),
277
+ Rcpp::_["prior_prob"] = Rcpp::wrap(model_outputs.prop_categ[model_outputs.start_ix_cat_counts[outl_col - ncols_num] +
278
+ arr_ord[row + (outl_col - ncols_num - ncols_cat) * nrows]]),
279
+ Rcpp::_["n_obs"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
280
+ );
281
+ }
282
+ } else {
283
+ lst_stats[row] = Rcpp::List::create(
284
+ Rcpp::_["categs_common"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[outl_col - ncols_num - ncols_cat][tmp_bool]),
285
+ Rcpp::_["pct_common"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_in_subset),
286
+ Rcpp::_["pct_next_most_comm"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_next_most_comm),
287
+ Rcpp::_["n_obs"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
288
+ );
289
+ }
290
+ }
291
+
292
+
293
+ /* then determine conditions from the cluster */
294
+ Rcpp::List cond_clust;
295
+ if (model_outputs.all_clusters[outl_col][outl_clust].column_type != NoType) {
296
+
297
+ /* add the column name and actual value for the row */
298
+ switch(model_outputs.all_clusters[outl_col][outl_clust].column_type) {
299
+ case Numeric:
300
+ {
301
+ cond_clust["column"] = Rcpp::CharacterVector(1, colnames_num[model_outputs.all_clusters[outl_col][outl_clust].col_num]);
302
+ if (model_outputs.all_clusters[outl_col][outl_clust].col_num < ncols_num_num) {
303
+ cond_clust["value_this"] = Rcpp::wrap(arr_num[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows]);
304
+ if (model_outputs.all_clusters[outl_col][outl_clust].split_type != IsNa)
305
+ cond_clust["decimals"] = Rcpp::wrap(model_outputs.min_decimals_col[model_outputs.all_clusters[outl_col][outl_clust].col_num]);
306
+ } else if (model_outputs.all_clusters[outl_col][outl_clust].col_num < (ncols_num_num + ncols_date)) {
307
+ cond_clust["value_this"] = Rcpp::Date(arr_num[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows]
308
+ - 1 + min_date[model_outputs.all_clusters[outl_col][outl_clust].col_num - ncols_num_num]);
309
+ } else {
310
+ cond_clust["value_this"] = Rcpp::Datetime(arr_num[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows]
311
+ - 1 + min_ts[model_outputs.all_clusters[outl_col][outl_clust].col_num - ncols_num_num - ncols_date]);
312
+ }
313
+ break;
314
+ }
315
+
316
+ case Categorical:
317
+ {
318
+ cond_clust["column"] = Rcpp::CharacterVector(1, colnames_cat[model_outputs.all_clusters[outl_col][outl_clust].col_num]);
319
+ if (model_outputs.all_clusters[outl_col][outl_clust].col_num < ncols_cat_cat) {
320
+ if (arr_cat[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows] >= 0) {
321
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num]
322
+ [arr_cat[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows]]);
323
+ } else {
324
+ cond_clust["value_this"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
325
+ }
326
+ } else {
327
+
328
+ if (arr_cat[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows] >= 0) {
329
+ cond_clust["value_this"] = Rcpp::wrap((bool)arr_cat[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows]);
330
+ } else {
331
+ cond_clust["value_this"] = Rcpp::LogicalVector(1, NA_LOGICAL);
332
+ }
333
+ }
334
+ break;
335
+ }
336
+
337
+ case Ordinal:
338
+ {
339
+ cond_clust["column"] = Rcpp::CharacterVector(1, colnames_ord[model_outputs.all_clusters[outl_col][outl_clust].col_num]);
340
+ if (arr_ord[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows] >= 0) {
341
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num]
342
+ [arr_ord[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows]]);
343
+ } else {
344
+ cond_clust["value_this"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
345
+ }
346
+ break;
347
+ }
348
+ }
349
+
350
+ /* add the comparison point */
351
+ switch(model_outputs.all_clusters[outl_col][outl_clust].split_type) {
352
+
353
+ case IsNa:
354
+ {
355
+ cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
356
+ switch(model_outputs.all_clusters[outl_col][outl_clust].column_type) {
357
+ case Numeric:
358
+ {
359
+ /* http://lists.r-forge.r-project.org/pipermail/rcpp-devel/2012-October/004379.html */
360
+ /* this comment below will prevent bug with Rcpp comments having forward slashes */
361
+ cond_clust["value_comp"] = Rcpp::wrap(NA_REAL);
362
+ break;
363
+ }
364
+
365
+ case Categorical:
366
+ {
367
+ if (model_outputs.all_clusters[outl_col][outl_clust].col_num < ncols_cat_cat) {
368
+ cond_clust["value_comp"] = Rcpp::wrap(NA_STRING);
369
+ } else {
370
+ cond_clust["value_comp"] = Rcpp::LogicalVector(1, NA_LOGICAL);
371
+ }
372
+ break;
373
+ }
374
+
375
+ case Ordinal:
376
+ {
377
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
378
+ break;
379
+ }
380
+ }
381
+ break;
382
+ }
383
+
384
+ case LessOrEqual:
385
+ {
386
+ if (model_outputs.all_clusters[outl_col][outl_clust].column_type == Numeric) {
387
+ if (model_outputs.all_clusters[outl_col][outl_clust].col_num < ncols_num_num) {
388
+ cond_clust["comparison"] = Rcpp::CharacterVector("<=");
389
+ cond_clust["value_comp"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].split_point);
390
+ } else if (model_outputs.all_clusters[outl_col][outl_clust].col_num < (ncols_num_num + ncols_date)) {
391
+ cond_clust["comparison"] = Rcpp::CharacterVector("<=");
392
+ cond_clust["value_comp"] = Rcpp::Date(model_outputs.all_clusters[outl_col][outl_clust].split_point
393
+ - 1 + min_date[model_outputs.all_clusters[outl_col][outl_clust].col_num - ncols_num_num]);
394
+ } else {
395
+ cond_clust["comparison"] = Rcpp::CharacterVector("<=");
396
+ cond_clust["value_comp"] = Rcpp::Datetime(model_outputs.all_clusters[outl_col][outl_clust].split_point
397
+ - 1 + min_ts[model_outputs.all_clusters[outl_col][outl_clust].col_num - ncols_num_num - ncols_date]);
398
+ }
399
+ } else {
400
+ tmp_bool = Rcpp::LogicalVector(ord_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num].size(), false);
401
+ for (int cat = 0; cat <= model_outputs.all_clusters[outl_col][outl_clust].split_lev; cat++) tmp_bool[cat] = true;
402
+ cond_clust["comparison"] = Rcpp::CharacterVector("in");
403
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num][tmp_bool]);
404
+ }
405
+ break;
406
+ }
407
+
408
+ case Greater:
409
+ {
410
+ if (model_outputs.all_clusters[outl_col][outl_clust].column_type == Numeric) {
411
+ if (model_outputs.all_clusters[outl_col][outl_clust].col_num < ncols_num_num) {
412
+ cond_clust["comparison"] = Rcpp::CharacterVector(">");
413
+ cond_clust["value_comp"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].split_point);
414
+ } else if (model_outputs.all_clusters[outl_col][outl_clust].col_num < (ncols_num_num + ncols_date)) {
415
+ cond_clust["comparison"] = Rcpp::CharacterVector(">");
416
+ cond_clust["value_comp"] = Rcpp::Date(model_outputs.all_clusters[outl_col][outl_clust].split_point
417
+ - 1 + min_date[model_outputs.all_clusters[outl_col][outl_clust].col_num - ncols_num_num]);
418
+ } else {
419
+ cond_clust["comparison"] = Rcpp::CharacterVector(">");
420
+ cond_clust["value_comp"] = Rcpp::Datetime(model_outputs.all_clusters[outl_col][outl_clust].split_point
421
+ - 1 + min_ts[model_outputs.all_clusters[outl_col][outl_clust].col_num - ncols_num_num - ncols_date]);
422
+ }
423
+ } else {
424
+ tmp_bool = Rcpp::LogicalVector(ord_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num].size(), true);
425
+ for (int cat = 0; cat <= model_outputs.all_clusters[outl_col][outl_clust].split_lev; cat++) tmp_bool[cat] = false;
426
+ cond_clust["comparison"] = Rcpp::CharacterVector("in");
427
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num][tmp_bool]);
428
+ }
429
+ break;
430
+ }
431
+
432
+ case InSubset:
433
+ {
434
+ tmp_bool = Rcpp::LogicalVector(model_outputs.all_clusters[outl_col][outl_clust].split_subset.size(), false);
435
+ for (size_t cat = 0; cat < model_outputs.all_clusters[outl_col][outl_clust].split_subset.size(); cat++) {
436
+ if (model_outputs.all_clusters[outl_col][outl_clust].split_subset[cat] > 0) {
437
+ tmp_bool[cat] = true;
438
+ }
439
+ }
440
+ cond_clust["comparison"] = Rcpp::CharacterVector("in");
441
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num][tmp_bool]);
442
+ break;
443
+ }
444
+
445
+ case NotInSubset:
446
+ {
447
+ tmp_bool = Rcpp::LogicalVector(model_outputs.all_clusters[outl_col][outl_clust].split_subset.size(), false);
448
+ for (size_t cat = 0; cat < model_outputs.all_clusters[outl_col][outl_clust].split_subset.size(); cat++) {
449
+ if (model_outputs.all_clusters[outl_col][outl_clust].split_subset[cat] == 0) {
450
+ tmp_bool[cat] = true;
451
+ }
452
+ }
453
+ cond_clust["comparison"] = Rcpp::CharacterVector("in");
454
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num][tmp_bool]);
455
+ break;
456
+ }
457
+
458
+ case Equal:
459
+ {
460
+ if (model_outputs.all_clusters[outl_col][outl_clust].column_type == Categorical) {
461
+ if (model_outputs.all_clusters[outl_col][outl_clust].col_num < ncols_cat_cat) {
462
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
463
+ cond_clust["value_comp"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num]
464
+ [model_outputs.all_clusters[outl_col][outl_clust].split_lev]);
465
+ } else {
466
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
467
+ cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_clusters[outl_col][outl_clust].split_lev);
468
+ }
469
+ } else {
470
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
471
+ cond_clust["value_comp"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num]
472
+ [model_outputs.all_clusters[outl_col][outl_clust].split_lev]);
473
+ }
474
+ break;
475
+ }
476
+
477
+ case NotEqual:
478
+ {
479
+ if (model_outputs.all_clusters[outl_col][outl_clust].column_type == Categorical) {
480
+ if (model_outputs.all_clusters[outl_col][outl_clust].col_num < ncols_cat_cat) {
481
+ cond_clust["comparison"] = Rcpp::CharacterVector("!=");
482
+ cond_clust["value_comp"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num]
483
+ [model_outputs.all_clusters[outl_col][outl_clust].split_lev]);
484
+ } else {
485
+ cond_clust["comparison"] = Rcpp::CharacterVector("!=");
486
+ cond_clust["value_comp"] = Rcpp::wrap(!((bool)model_outputs.all_clusters[outl_col][outl_clust].split_lev));
487
+ }
488
+ } else {
489
+ cond_clust["comparison"] = Rcpp::CharacterVector("!=");
490
+ cond_clust["value_comp"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num]
491
+ [model_outputs.all_clusters[outl_col][outl_clust].split_lev]);
492
+ }
493
+ break;
494
+ }
495
+
496
+ }
497
+ lst_cond[row] = Rcpp::List::create(Rcpp::clone(cond_clust));
498
+
499
+ /* finally, add conditions from branches that lead to the cluster */
500
+ curr_tree = model_outputs.outlier_trees_final[row];
501
+ Rcpp::List temp_list;
502
+ while (true) {
503
+ if (curr_tree == 0 || model_outputs.all_trees[outl_col][curr_tree].parent_branch == SubTrees) {
504
+ break;
505
+ }
506
+ parent_tree = model_outputs.all_trees[outl_col][curr_tree].parent;
507
+ cond_clust = Rcpp::List();
508
+
509
+ /* when using 'follow_all' */
510
+ if (model_outputs.all_trees[outl_col][parent_tree].all_branches.size() > 0) {
511
+
512
+ /* add column name and value */
513
+ switch(model_outputs.all_trees[outl_col][curr_tree].column_type) {
514
+ case Numeric:
515
+ {
516
+ cond_clust["column"] = Rcpp::as<Rcpp::CharacterVector>(colnames_num[model_outputs.all_trees[outl_col][curr_tree].col_num]);
517
+ break;
518
+ }
519
+
520
+ case Categorical:
521
+ {
522
+ cond_clust["column"] = Rcpp::as<Rcpp::CharacterVector>(colnames_cat[model_outputs.all_trees[outl_col][curr_tree].col_num]);
523
+ break;
524
+ }
525
+
526
+ case Ordinal:
527
+ {
528
+ cond_clust["column"] = Rcpp::as<Rcpp::CharacterVector>(colnames_ord[model_outputs.all_trees[outl_col][curr_tree].col_num]);
529
+ break;
530
+ }
531
+ }
532
+
533
+ /* add conditions from tree */
534
+ switch(model_outputs.all_trees[outl_col][curr_tree].column_type) {
535
+
536
+ case Numeric:
537
+ {
538
+ /* add decimals if appropriate */
539
+ if (
540
+ model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_num_num &&
541
+ model_outputs.all_trees[outl_col][curr_tree].split_this_branch != IsNa
542
+ )
543
+ {
544
+ cond_clust["decimals"] = Rcpp::wrap(model_outputs.min_decimals_col[model_outputs.all_trees[outl_col][curr_tree].col_num]);
545
+ }
546
+
547
+ /* then conditions */
548
+ switch(model_outputs.all_trees[outl_col][curr_tree].split_this_branch) {
549
+
550
+ case IsNa:
551
+ {
552
+ cond_clust["value_this"] = Rcpp::wrap(NA_REAL);
553
+ cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
554
+ cond_clust["value_comp"] = Rcpp::wrap(NA_REAL);
555
+ break;
556
+ }
557
+
558
+ case LessOrEqual:
559
+ {
560
+ if (model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_num_num) {
561
+ cond_clust["value_this"] = Rcpp::wrap(arr_num[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]);
562
+ cond_clust["comparison"] = Rcpp::CharacterVector("<=");
563
+ cond_clust["value_comp"] = Rcpp::wrap(model_outputs.all_trees[outl_col][curr_tree].split_point);
564
+ } else if (model_outputs.all_trees[outl_col][curr_tree].col_num < (ncols_num_num + ncols_date)) {
565
+ cond_clust["value_this"] = Rcpp::Date(arr_num[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]
566
+ - 1 + min_date[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num]);
567
+ cond_clust["comparison"] = Rcpp::CharacterVector("<=");
568
+ cond_clust["value_comp"] = Rcpp::Date(model_outputs.all_trees[outl_col][curr_tree].split_point
569
+ - 1 + min_date[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num]);
570
+ } else {
571
+ cond_clust["value_this"] = Rcpp::Datetime(arr_num[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]
572
+ - 1 + min_ts[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num - ncols_date]);
573
+ cond_clust["comparison"] = Rcpp::CharacterVector("<=");
574
+ cond_clust["value_comp"] = Rcpp::Datetime(model_outputs.all_trees[outl_col][curr_tree].split_point
575
+ - 1 + min_ts[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num - ncols_date]);
576
+ }
577
+ break;
578
+ }
579
+
580
+ case Greater:
581
+ {
582
+ if (model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_num_num) {
583
+ cond_clust["value_this"] = Rcpp::wrap(arr_num[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]);
584
+ cond_clust["comparison"] = Rcpp::CharacterVector(">");
585
+ cond_clust["value_comp"] = Rcpp::wrap(model_outputs.all_trees[outl_col][curr_tree].split_point);
586
+ } else if (model_outputs.all_trees[outl_col][curr_tree].col_num < (ncols_num_num + ncols_date)) {
587
+ cond_clust["value_this"] = Rcpp::Date(arr_num[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]
588
+ - 1 + min_date[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num]);
589
+ cond_clust["comparison"] = Rcpp::CharacterVector(">");
590
+ cond_clust["value_comp"] = Rcpp::Date(model_outputs.all_trees[outl_col][curr_tree].split_point
591
+ - 1 + min_date[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num]);
592
+ } else {
593
+ cond_clust["value_this"] = Rcpp::Datetime(arr_num[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]
594
+ - 1 + min_ts[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num - ncols_date]);
595
+ cond_clust["comparison"] = Rcpp::CharacterVector(">");
596
+ cond_clust["value_comp"] = Rcpp::Datetime(model_outputs.all_trees[outl_col][curr_tree].split_point
597
+ - 1 + min_ts[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num - ncols_date]);
598
+ }
599
+ break;
600
+ }
601
+
602
+ }
603
+ break;
604
+ }
605
+
606
+ case Categorical:
607
+ {
608
+ switch(model_outputs.all_trees[outl_col][curr_tree].split_this_branch) {
609
+
610
+ case IsNa:
611
+ {
612
+ if (model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_cat_cat) {
613
+ cond_clust["value_this"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
614
+ cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
615
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
616
+ } else {
617
+ cond_clust["value_this"] = Rcpp::LogicalVector(1, NA_LOGICAL);
618
+ cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
619
+ cond_clust["value_comp"] = Rcpp::LogicalVector(1, NA_LOGICAL);
620
+ }
621
+ break;
622
+ }
623
+
624
+ case InSubset:
625
+ {
626
+ if (model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_cat_cat) {
627
+ tmp_bool = Rcpp::LogicalVector(model_outputs.all_trees[outl_col][curr_tree].split_subset.size(), false);
628
+ for (size_t cat = 0; cat < model_outputs.all_trees[outl_col][curr_tree].split_subset.size(); cat++) {
629
+ if (model_outputs.all_trees[outl_col][curr_tree].split_subset[cat] > 0) {
630
+ tmp_bool[cat] = true;
631
+ }
632
+ }
633
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
634
+ [arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
635
+ cond_clust["comparison"] = Rcpp::CharacterVector("in");
636
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num][tmp_bool]);
637
+ } else {
638
+ cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]);
639
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
640
+ cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_trees[outl_col][curr_tree].split_subset[1]);
641
+ }
642
+ break;
643
+ }
644
+
645
+ case NotInSubset:
646
+ {
647
+ if (model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_cat_cat) {
648
+ tmp_bool = Rcpp::LogicalVector(model_outputs.all_trees[outl_col][curr_tree].split_subset.size(), true);
649
+ for (size_t cat = 0; cat < model_outputs.all_trees[outl_col][curr_tree].split_subset.size(); cat++) {
650
+ if (model_outputs.all_trees[outl_col][curr_tree].split_subset[cat] > 0) {
651
+ tmp_bool[cat] = false;
652
+ }
653
+ }
654
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
655
+ [arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
656
+ cond_clust["comparison"] = Rcpp::CharacterVector("in");
657
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num][tmp_bool]);
658
+ } else {
659
+ cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]);
660
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
661
+ cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_trees[outl_col][curr_tree].split_subset[0]);
662
+ }
663
+ break;
664
+ }
665
+
666
+ case Equal:
667
+ {
668
+ if (model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_cat_cat) {
669
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
670
+ [arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
671
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
672
+ cond_clust["value_comp"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
673
+ [model_outputs.all_trees[outl_col][curr_tree].split_lev]);
674
+ } else {
675
+ cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]);
676
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
677
+ cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_trees[outl_col][curr_tree].split_lev);
678
+ }
679
+ break;
680
+ }
681
+
682
+ case NotEqual:
683
+ {
684
+ if (model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_cat_cat) {
685
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
686
+ [arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
687
+ cond_clust["comparison"] = Rcpp::CharacterVector("!=");
688
+ cond_clust["value_comp"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
689
+ [model_outputs.all_trees[outl_col][curr_tree].split_lev]);
690
+ } else {
691
+ cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]);
692
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
693
+ cond_clust["value_comp"] = Rcpp::wrap((bool) !model_outputs.all_trees[outl_col][curr_tree].split_lev);
694
+ /* note: booleans should always get converted to Equals, this code is redundant */
695
+ }
696
+ break;
697
+ }
698
+
699
+ }
700
+ break;
701
+ }
702
+
703
+ case Ordinal:
704
+ {
705
+ switch(model_outputs.all_trees[outl_col][curr_tree].split_this_branch) {
706
+
707
+ case IsNa:
708
+ {
709
+ cond_clust["value_this"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
710
+ cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
711
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
712
+ break;
713
+ }
714
+
715
+ case LessOrEqual:
716
+ {
717
+ tmp_bool = Rcpp::LogicalVector(ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num].size(), false);
718
+ for (int cat = 0; cat <= model_outputs.all_trees[outl_col][curr_tree].split_lev; cat++) {
719
+ tmp_bool[cat] = true;
720
+ }
721
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
722
+ [arr_ord[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
723
+ cond_clust["comparison"] = Rcpp::CharacterVector("in");
724
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num][tmp_bool]);
725
+ break;
726
+ }
727
+
728
+ case Greater:
729
+ {
730
+ tmp_bool = Rcpp::LogicalVector(ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num].size(), true);
731
+ for (int cat = 0; cat <= model_outputs.all_trees[outl_col][curr_tree].split_lev; cat++) {
732
+ tmp_bool[cat] = false;
733
+ }
734
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
735
+ [arr_ord[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
736
+ cond_clust["comparison"] = Rcpp::CharacterVector("in");
737
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num][tmp_bool]);
738
+ break;
739
+ }
740
+
741
+ case Equal:
742
+ {
743
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
744
+ [arr_ord[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
745
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
746
+ cond_clust["value_comp"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
747
+ [model_outputs.all_trees[outl_col][curr_tree].split_lev]);
748
+ break;
749
+ }
750
+
751
+ case NotEqual:
752
+ {
753
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
754
+ [arr_ord[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
755
+ cond_clust["comparison"] = Rcpp::CharacterVector("!=");
756
+ cond_clust["value_comp"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
757
+ [model_outputs.all_trees[outl_col][curr_tree].split_lev]);
758
+ break;
759
+ }
760
+
761
+ }
762
+ break;
763
+ }
764
+
765
+ }
766
+ }
767
+
768
+ /* regular case (no 'follow_all') */
769
+ else
770
+ {
771
+
772
+ /* add column name and value */
773
+ switch(model_outputs.all_trees[outl_col][parent_tree].column_type) {
774
+ case Numeric:
775
+ {
776
+ cond_clust["column"] = Rcpp::as<Rcpp::CharacterVector>(colnames_num[model_outputs.all_trees[outl_col][parent_tree].col_num]);
777
+ /* add decimals if appropriate */
778
+ if (
779
+ model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_num_num &&
780
+ model_outputs.all_trees[outl_col][curr_tree].parent_branch != IsNa
781
+ )
782
+ {
783
+ cond_clust["decimals"] = Rcpp::wrap(model_outputs.min_decimals_col[model_outputs.all_trees[outl_col][parent_tree].col_num]);
784
+ }
785
+ break;
786
+ }
787
+
788
+ case Categorical:
789
+ {
790
+ cond_clust["column"] = Rcpp::as<Rcpp::CharacterVector>(colnames_cat[model_outputs.all_trees[outl_col][parent_tree].col_num]);
791
+ break;
792
+ }
793
+
794
+ case Ordinal:
795
+ {
796
+ cond_clust["column"] = Rcpp::as<Rcpp::CharacterVector>(colnames_ord[model_outputs.all_trees[outl_col][parent_tree].col_num]);
797
+ break;
798
+ }
799
+ }
800
+
801
+
802
+ /* add conditions from tree */
803
+ switch(model_outputs.all_trees[outl_col][curr_tree].parent_branch) {
804
+
805
+
806
+ case IsNa:
807
+ {
808
+ switch(model_outputs.all_trees[outl_col][parent_tree].column_type) {
809
+ case Numeric:
810
+ {
811
+ cond_clust["value_this"] = Rcpp::wrap(NA_REAL);
812
+ cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
813
+ cond_clust["value_comp"] = Rcpp::wrap(NA_REAL);
814
+ break;
815
+ }
816
+
817
+ case Categorical:
818
+ {
819
+ if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_cat_cat) {
820
+ cond_clust["value_this"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
821
+ cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
822
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
823
+ } else {
824
+ cond_clust["value_this"] = Rcpp::LogicalVector(1, NA_LOGICAL);
825
+ cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
826
+ cond_clust["value_comp"] = Rcpp::LogicalVector(1, NA_LOGICAL);
827
+ }
828
+ break;
829
+ }
830
+
831
+ case Ordinal:
832
+ {
833
+ cond_clust["value_this"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
834
+ cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
835
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
836
+ break;
837
+ }
838
+ }
839
+ break;
840
+ }
841
+
842
+ case LessOrEqual:
843
+ {
844
+ if (model_outputs.all_trees[outl_col][parent_tree].column_type == Numeric) {
845
+ if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_num_num) {
846
+ cond_clust["value_this"] = Rcpp::wrap(arr_num[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
847
+ cond_clust["comparison"] = Rcpp::CharacterVector("<=");
848
+ cond_clust["value_comp"] = Rcpp::wrap(model_outputs.all_trees[outl_col][parent_tree].split_point);
849
+ } else if (model_outputs.all_trees[outl_col][parent_tree].col_num < (ncols_num_num + ncols_date)) {
850
+ cond_clust["value_this"] = Rcpp::Date(arr_num[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]
851
+ - 1 + min_date[model_outputs.all_trees[outl_col][parent_tree].col_num - ncols_num_num]);
852
+ cond_clust["comparison"] = Rcpp::CharacterVector("<=");
853
+ cond_clust["value_comp"] = Rcpp::Date(model_outputs.all_trees[outl_col][parent_tree].split_point
854
+ - 1 + min_date[model_outputs.all_trees[outl_col][parent_tree].col_num - ncols_num_num]);
855
+ } else {
856
+ cond_clust["value_this"] = Rcpp::Datetime(arr_num[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]
857
+ - 1 + min_ts[model_outputs.all_trees[outl_col][parent_tree].col_num
858
+ - ncols_num_num - ncols_date]);
859
+ cond_clust["comparison"] = Rcpp::CharacterVector("<=");
860
+ cond_clust["value_comp"] = Rcpp::Datetime(model_outputs.all_trees[outl_col][parent_tree].split_point
861
+ - 1 + min_ts[model_outputs.all_trees[outl_col][parent_tree].col_num
862
+ - ncols_num_num - ncols_date]);
863
+ }
864
+ } else {
865
+ tmp_bool = Rcpp::LogicalVector(ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num].size(), false);
866
+ for (int cat = 0; cat <= model_outputs.all_trees[outl_col][parent_tree].split_lev; cat++) tmp_bool[cat] = true;
867
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
868
+ [arr_ord[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
869
+ cond_clust["comparison"] = Rcpp::CharacterVector("in");
870
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num][tmp_bool]);
871
+ }
872
+ break;
873
+ }
874
+
875
+ case Greater:
876
+ {
877
+ if (model_outputs.all_trees[outl_col][parent_tree].column_type == Numeric) {
878
+ if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_num_num) {
879
+ cond_clust["value_this"] = Rcpp::wrap(arr_num[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
880
+ cond_clust["comparison"] = Rcpp::CharacterVector(">");
881
+ cond_clust["value_comp"] = Rcpp::wrap(model_outputs.all_trees[outl_col][parent_tree].split_point);
882
+ } else if (model_outputs.all_trees[outl_col][parent_tree].col_num < (ncols_num_num + ncols_date)) {
883
+ cond_clust["value_this"] = Rcpp::Date(arr_num[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]
884
+ - 1 + min_date[model_outputs.all_trees[outl_col][parent_tree].col_num - ncols_num_num]);
885
+ cond_clust["comparison"] = Rcpp::CharacterVector(">");
886
+ cond_clust["value_comp"] = Rcpp::Date(model_outputs.all_trees[outl_col][parent_tree].split_point
887
+ - 1 + min_date[model_outputs.all_trees[outl_col][parent_tree].col_num - ncols_num_num]);
888
+ } else {
889
+ cond_clust["value_this"] = Rcpp::Datetime(arr_num[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]
890
+ - 1 + min_ts[model_outputs.all_trees[outl_col][parent_tree].col_num
891
+ - ncols_num_num - ncols_date]);
892
+ cond_clust["comparison"] = Rcpp::CharacterVector(">");
893
+ cond_clust["value_comp"] = Rcpp::Datetime(model_outputs.all_trees[outl_col][parent_tree].split_point
894
+ - 1 + min_ts[model_outputs.all_trees[outl_col][parent_tree].col_num
895
+ - ncols_num_num - ncols_date]);
896
+ }
897
+ } else {
898
+ tmp_bool = Rcpp::LogicalVector(ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num].size(), true);
899
+ for (int cat = 0; cat <= model_outputs.all_trees[outl_col][parent_tree].split_lev; cat++) tmp_bool[cat] = false;
900
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
901
+ [arr_ord[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
902
+ cond_clust["comparison"] = Rcpp::CharacterVector("in");
903
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num][tmp_bool]);
904
+ }
905
+ break;
906
+ }
907
+
908
+ case InSubset:
909
+ {
910
+ if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_cat_cat) {
911
+ tmp_bool = Rcpp::LogicalVector(cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num].size(), false);
912
+ for (size_t cat = 0; cat < model_outputs.all_trees[outl_col][parent_tree].split_subset.size(); cat++) {
913
+ if (model_outputs.all_trees[outl_col][parent_tree].split_subset[cat] > 0) {
914
+ tmp_bool[cat] = true;
915
+ }
916
+ }
917
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
918
+ [arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
919
+ cond_clust["comparison"] = Rcpp::CharacterVector("in");
920
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num][tmp_bool]);
921
+ } else {
922
+ cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
923
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
924
+ cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_trees[outl_col][parent_tree].split_subset[1]);
925
+ }
926
+ break;
927
+ }
928
+
929
+ case NotInSubset:
930
+ {
931
+ if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_cat_cat) {
932
+ tmp_bool = Rcpp::LogicalVector(cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num].size(), false);
933
+ for (size_t cat = 0; cat < model_outputs.all_trees[outl_col][parent_tree].split_subset.size(); cat++) {
934
+ if (model_outputs.all_trees[outl_col][parent_tree].split_subset[cat] == 0) {
935
+ tmp_bool[cat] = true;
936
+ }
937
+ }
938
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
939
+ [arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
940
+ cond_clust["comparison"] = Rcpp::CharacterVector("in");
941
+ cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num][tmp_bool]);
942
+ } else {
943
+ cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
944
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
945
+ cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_trees[outl_col][parent_tree].split_subset[0]);
946
+ }
947
+ break;
948
+ }
949
+
950
+ case Equal:
951
+ {
952
+ if (model_outputs.all_trees[outl_col][parent_tree].column_type == Categorical) {
953
+ if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_cat_cat) {
954
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
955
+ [arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
956
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
957
+ cond_clust["value_comp"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
958
+ [model_outputs.all_trees[outl_col][parent_tree].split_lev]);
959
+ } else {
960
+ cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
961
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
962
+ cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_trees[outl_col][parent_tree].split_subset[1]);
963
+ }
964
+ } else {
965
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
966
+ [arr_ord[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
967
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
968
+ cond_clust["value_comp"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
969
+ [model_outputs.all_trees[outl_col][parent_tree].split_lev]);
970
+ }
971
+ break;
972
+ }
973
+
974
+ case NotEqual:
975
+ {
976
+ if (model_outputs.all_trees[outl_col][parent_tree].column_type == Categorical) {
977
+ if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_cat_cat) {
978
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
979
+ [arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
980
+ cond_clust["comparison"] = Rcpp::CharacterVector("!=");
981
+ cond_clust["value_comp"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
982
+ [model_outputs.all_trees[outl_col][parent_tree].split_lev]);
983
+ } else {
984
+ cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
985
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
986
+ cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_trees[outl_col][parent_tree].split_subset[0]);
987
+ }
988
+ } else {
989
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
990
+ [arr_ord[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
991
+ cond_clust["comparison"] = Rcpp::CharacterVector("!=");
992
+ cond_clust["value_comp"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
993
+ [model_outputs.all_trees[outl_col][parent_tree].split_lev]);
994
+ }
995
+ break;
996
+ }
997
+
998
+ case SingleCateg:
999
+ {
1000
+ if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_cat_cat) {
1001
+ cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
1002
+ [arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
1003
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
1004
+ cond_clust["value_comp"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
1005
+ [arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
1006
+ } else {
1007
+ cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
1008
+ cond_clust["comparison"] = Rcpp::CharacterVector("=");
1009
+ cond_clust["value_comp"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
1010
+ }
1011
+ break;
1012
+ }
1013
+
1014
+ }
1015
+
1016
+
1017
+ }
1018
+
1019
+ /* https://github.com/RcppCore/Rcpp/issues/979 */
1020
+ /* this comment below will fix Rcpp issue with having slashes in the comment above */
1021
+ temp_list = lst_cond[row];
1022
+ temp_list.push_back(Rcpp::clone(cond_clust));
1023
+ lst_cond[row] = temp_list;
1024
+ curr_tree = parent_tree;
1025
+ }
1026
+
1027
+ }
1028
+
1029
+ }
1030
+ }
1031
+
1032
+ outp["suspicous_value"] = outlier_val;
1033
+ outp["group_statistics"] = lst_stats;
1034
+ outp["conditions"] = lst_cond;
1035
+ outp["tree_depth"] = tree_depth;
1036
+ outp["uses_NA_branch"] = has_na_col;
1037
+ outp["outlier_score"] = outlier_score;
1038
+ return outp;
1039
+ }
1040
+
1041
+ /* for extracting info about flaggable outliers */
1042
+ Rcpp::List extract_outl_bounds(ModelOutputs &model_outputs,
1043
+ Rcpp::ListOf<Rcpp::StringVector> cat_levels,
1044
+ Rcpp::ListOf<Rcpp::StringVector> ord_levels,
1045
+ Rcpp::NumericVector min_date,
1046
+ Rcpp::NumericVector min_ts)
1047
+ {
1048
+ size_t ncols_num = model_outputs.ncols_numeric;
1049
+ size_t ncols_cat = model_outputs.ncols_categ;
1050
+ size_t ncols_ord = model_outputs.ncols_ord;
1051
+ size_t col_lim_num = model_outputs.ncols_numeric - min_date.size() - min_ts.size();
1052
+ size_t col_lim_date = model_outputs.ncols_numeric - min_ts.size();
1053
+ size_t ncols_cat_cat = cat_levels.size();
1054
+ size_t tot_cols = ncols_num + ncols_cat + ncols_ord;
1055
+ Rcpp::LogicalVector temp_bool;
1056
+ Rcpp::LogicalVector bool_choice(2, false); bool_choice[1] = true;
1057
+ Rcpp::List outp(tot_cols);
1058
+
1059
+ for (size_t cl = 0; cl < tot_cols; cl++) {
1060
+ if (cl < col_lim_num) {
1061
+ /* numeric */
1062
+ outp[cl] = Rcpp::List::create(Rcpp::_["lb"] = Rcpp::wrap(model_outputs.min_outlier_any_cl[cl]),
1063
+ Rcpp::_["ub"] = Rcpp::wrap(model_outputs.max_outlier_any_cl[cl]));
1064
+ } else if (cl < col_lim_date) {
1065
+ /* date */
1066
+ outp[cl] = Rcpp::List::create(
1067
+ Rcpp::_["lb"] = Rcpp::Date(model_outputs.min_outlier_any_cl[cl] - 1 + min_date[cl - col_lim_num]),
1068
+ Rcpp::_["ub"] = Rcpp::Date(model_outputs.max_outlier_any_cl[cl] - 1 + min_date[cl - col_lim_num])
1069
+ );
1070
+ } else if (cl < ncols_num) {
1071
+ /* timestamp */
1072
+ outp[cl] = Rcpp::List::create(
1073
+ Rcpp::_["lb"] = Rcpp::Datetime(model_outputs.min_outlier_any_cl[cl] - 1 + min_ts[cl - col_lim_date]),
1074
+ Rcpp::_["ub"] = Rcpp::Datetime(model_outputs.max_outlier_any_cl[cl] - 1 + min_ts[cl - col_lim_date])
1075
+ );
1076
+ } else if (cl < (ncols_num + ncols_cat_cat)) {
1077
+ /* categorical */
1078
+ if (model_outputs.cat_outlier_any_cl[cl - ncols_num].size()) {
1079
+ temp_bool = Rcpp::wrap(model_outputs.cat_outlier_any_cl[cl - ncols_num]);
1080
+ outp[cl] = cat_levels[cl - ncols_num][temp_bool];
1081
+ } else {
1082
+ outp[cl] = Rcpp::StringVector();
1083
+ }
1084
+ } else if (cl < (ncols_num + ncols_cat)) {
1085
+ /* boolean */
1086
+ if (model_outputs.cat_outlier_any_cl[cl - ncols_num].size()) {
1087
+ temp_bool = Rcpp::wrap(model_outputs.cat_outlier_any_cl[cl - ncols_num]);
1088
+ outp[cl] = bool_choice[temp_bool];
1089
+ } else {
1090
+ outp[cl] = Rcpp::LogicalVector();
1091
+ }
1092
+ } else {
1093
+ /* ordinal */
1094
+ if (model_outputs.cat_outlier_any_cl[cl - ncols_num].size()) {
1095
+ temp_bool = Rcpp::wrap(model_outputs.cat_outlier_any_cl[cl - ncols_num]);
1096
+ outp[cl] = ord_levels[cl - ncols_num - ncols_cat][temp_bool];
1097
+ } else {
1098
+ outp[cl] = Rcpp::StringVector();
1099
+ }
1100
+ }
1101
+ }
1102
+ return outp;
1103
+ }
1104
+
1105
+
1106
+ /* external functions for fitting the model and predicting outliers */
1107
+ // [[Rcpp::export]]
1108
+ Rcpp::List fit_OutlierTree(Rcpp::NumericVector arr_num, size_t ncols_numeric,
1109
+ Rcpp::IntegerVector arr_cat, size_t ncols_categ, Rcpp::IntegerVector ncat,
1110
+ Rcpp::IntegerVector arr_ord, size_t ncols_ord, Rcpp::IntegerVector ncat_ord,
1111
+ size_t nrows, Rcpp::LogicalVector cols_ignore_r, int nthreads,
1112
+ bool categ_as_bin, bool ord_as_bin, bool cat_bruteforce_subset, bool categ_from_maj, bool take_mid,
1113
+ size_t max_depth, double max_perc_outliers, size_t min_size_numeric, size_t min_size_categ,
1114
+ double min_gain, bool follow_all, bool gain_as_pct, double z_norm, double z_outlier,
1115
+ bool return_outliers,
1116
+ Rcpp::ListOf<Rcpp::StringVector> cat_levels,
1117
+ Rcpp::ListOf<Rcpp::StringVector> ord_levels,
1118
+ Rcpp::StringVector colnames_num,
1119
+ Rcpp::StringVector colnames_cat,
1120
+ Rcpp::StringVector colnames_ord,
1121
+ Rcpp::NumericVector min_date,
1122
+ Rcpp::NumericVector min_ts)
1123
+ {
1124
+ bool found_outliers;
1125
+ Rcpp::List outp;
1126
+ size_t tot_cols = ncols_numeric + ncols_categ + ncols_ord;
1127
+ std::vector<char> cols_ignore;
1128
+ char *cols_ignore_ptr = NULL;
1129
+ if (cols_ignore_r.size() > 0) {
1130
+ cols_ignore.resize(tot_cols, false);
1131
+ for (size_t cl = 0; cl < tot_cols; cl++) cols_ignore[cl] = (bool) cols_ignore_r[cl];
1132
+ cols_ignore_ptr = &cols_ignore[0];
1133
+ }
1134
+ std::vector<double> Xcpp;
1135
+ double *arr_num_C = set_R_nan_as_C_nan(&arr_num[0], Xcpp, arr_num.size(), nthreads);
1136
+
1137
+ std::unique_ptr<ModelOutputs> model_outputs = std::unique_ptr<ModelOutputs>(new ModelOutputs());
1138
+ found_outliers = fit_outliers_models(*model_outputs,
1139
+ arr_num_C, ncols_numeric,
1140
+ &arr_cat[0], ncols_categ, &ncat[0],
1141
+ &arr_ord[0], ncols_ord, &ncat_ord[0],
1142
+ nrows, cols_ignore_ptr, nthreads,
1143
+ categ_as_bin, ord_as_bin, cat_bruteforce_subset, categ_from_maj, take_mid,
1144
+ max_depth, max_perc_outliers, min_size_numeric, min_size_categ,
1145
+ min_gain, gain_as_pct, follow_all, z_norm, z_outlier);
1146
+
1147
+ outp["bounds"] = extract_outl_bounds(*model_outputs,
1148
+ cat_levels,
1149
+ ord_levels,
1150
+ min_date,
1151
+ min_ts);
1152
+
1153
+ outp["serialized_obj"] = serialize_OutlierTree(model_outputs.get());
1154
+ if (return_outliers) {
1155
+ outp["outliers_info"] = describe_outliers(*model_outputs,
1156
+ arr_num_C,
1157
+ &arr_cat[0],
1158
+ &arr_ord[0],
1159
+ cat_levels,
1160
+ ord_levels,
1161
+ colnames_num,
1162
+ colnames_cat,
1163
+ colnames_ord,
1164
+ min_date,
1165
+ min_ts);
1166
+ }
1167
+ /* add number of trees and clusters */
1168
+ size_t ntrees = 0, nclust = 0;
1169
+ for (size_t col = 0; col < model_outputs->all_trees.size(); col++) {
1170
+ ntrees += model_outputs->all_trees[col].size();
1171
+ nclust += model_outputs->all_clusters[col].size();
1172
+ }
1173
+ outp["ntrees"] = Rcpp::wrap((int) ntrees);
1174
+ outp["nclust"] = Rcpp::wrap((int) nclust);
1175
+ outp["found_outliers"] = Rcpp::wrap(found_outliers);
1176
+
1177
+ forget_row_outputs(*model_outputs);
1178
+ outp["ptr_model"] = Rcpp::XPtr<ModelOutputs>(model_outputs.release(), true);
1179
+ return outp;
1180
+ }
1181
+
1182
+ // [[Rcpp::export]]
1183
+ Rcpp::List predict_OutlierTree(SEXP ptr_model, size_t nrows, int nthreads,
1184
+ Rcpp::NumericVector arr_num, Rcpp::IntegerVector arr_cat, Rcpp::IntegerVector arr_ord,
1185
+ Rcpp::ListOf<Rcpp::StringVector> cat_levels,
1186
+ Rcpp::ListOf<Rcpp::StringVector> ord_levels,
1187
+ Rcpp::StringVector colnames_num,
1188
+ Rcpp::StringVector colnames_cat,
1189
+ Rcpp::StringVector colnames_ord,
1190
+ Rcpp::NumericVector min_date,
1191
+ Rcpp::NumericVector min_ts)
1192
+ {
1193
+ std::vector<double> Xcpp;
1194
+ double *arr_num_C = set_R_nan_as_C_nan(&arr_num[0], Xcpp, arr_num.size(), nthreads);
1195
+
1196
+ ModelOutputs *model_outputs = static_cast<ModelOutputs*>(R_ExternalPtrAddr(ptr_model));
1197
+ bool found_outliers = find_new_outliers(&arr_num[0], &arr_cat[0], &arr_ord[0],
1198
+ nrows, nthreads, *model_outputs);
1199
+ Rcpp::List outp = describe_outliers(*model_outputs,
1200
+ arr_num_C,
1201
+ &arr_cat[0],
1202
+ &arr_ord[0],
1203
+ cat_levels,
1204
+ ord_levels,
1205
+ colnames_num,
1206
+ colnames_cat,
1207
+ colnames_ord,
1208
+ min_date,
1209
+ min_ts);
1210
+ outp["found_outliers"] = Rcpp::LogicalVector(found_outliers);
1211
+ forget_row_outputs(*model_outputs);
1212
+ return outp;
1213
+ }
1214
+
1215
+ // [[Rcpp::export]]
1216
+ Rcpp::LogicalVector check_few_values(Rcpp::NumericVector arr_num, size_t nrows, size_t ncols, int nthreads)
1217
+ {
1218
+ std::vector<char> too_few_vals(ncols, 0);
1219
+ check_more_two_values(&arr_num[0], nrows, ncols, nthreads, too_few_vals.data());
1220
+ Rcpp::LogicalVector outp(ncols);
1221
+ for (size_t col = 0; col < ncols; col++) {
1222
+ outp[col] = (bool) too_few_vals[col];
1223
+ }
1224
+ return outp;
1225
+ }