outliertree 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/CHANGELOG.md +3 -0
- data/LICENSE.txt +674 -0
- data/NOTICE.txt +14 -0
- data/README.md +107 -0
- data/ext/outliertree/ext.cpp +260 -0
- data/ext/outliertree/extconf.rb +21 -0
- data/lib/outliertree.rb +17 -0
- data/lib/outliertree/dataset.rb +35 -0
- data/lib/outliertree/model.rb +128 -0
- data/lib/outliertree/result.rb +190 -0
- data/lib/outliertree/version.rb +3 -0
- data/vendor/outliertree/LICENSE +674 -0
- data/vendor/outliertree/README.md +155 -0
- data/vendor/outliertree/src/Makevars +3 -0
- data/vendor/outliertree/src/RcppExports.cpp +123 -0
- data/vendor/outliertree/src/Rwrapper.cpp +1225 -0
- data/vendor/outliertree/src/cat_outlier.cpp +328 -0
- data/vendor/outliertree/src/clusters.cpp +972 -0
- data/vendor/outliertree/src/fit_model.cpp +1932 -0
- data/vendor/outliertree/src/misc.cpp +685 -0
- data/vendor/outliertree/src/outlier_tree.hpp +758 -0
- data/vendor/outliertree/src/predict.cpp +706 -0
- data/vendor/outliertree/src/split.cpp +1098 -0
- metadata +150 -0
@@ -0,0 +1,155 @@
|
|
1
|
+
# OutlierTree
|
2
|
+
|
3
|
+
Explainable outlier/anomaly detection based on smart decision tree grouping, similar in spirit to the GritBot software developed by RuleQuest research. Written in C++ with interfaces for R and Python. Supports columns of types numeric, categorical, binary/boolean, and ordinal, and can handle missing values in all of them. Ideal as a sanity checker in exploratory data analysis.
|
4
|
+
|
5
|
+
# How it works
|
6
|
+
|
7
|
+
Will try to fit decision trees that try to "predict" values for each column based on the values of each other column. Along the way, each time a split is evaluated, it will take the observations that fall into each branch as a homogeneous cluster in which it will search for outliers in the 1-d distribution of the column being predicted. Outliers are determined according to confidence intervals on this 1-d distribution, and need to have a large gap with respect to the next observation in sorted order to be flagged as outliers. Since outliers are searched for in a decision tree branch, it will know the conditions that make it a rare observation compared to others that meet the same conditions, and the conditions will always be correlated with the target variable (as it's being predicted from them).
|
8
|
+
|
9
|
+
As such, it will only be able to detect outliers that can be described through a decision tree logic, and unlike other methods such as [Isolation Forests](https://github.com/david-cortes/isotree), will not be able to assign an outlier score to each observation, nor to detect outliers that are just overall rare, but will always provide a human-readable justification when it flags an outlier.
|
10
|
+
|
11
|
+
Procedure is described in more detail in [Explainable outlier detection through decision tree conditioning](http://arxiv.org/abs/2001.00636).
|
12
|
+
|
13
|
+
# Example outputs
|
14
|
+
|
15
|
+
Example outliers from [hypothyroid dataset](http://archive.ics.uci.edu/ml/datasets/thyroid+disease):
|
16
|
+
```
|
17
|
+
row [1137] - suspicious column: [age] - suspicious value: [75.000]
|
18
|
+
distribution: 95.122% <= 42.000 - [mean: 31.462] - [sd: 5.281] - [norm. obs: 39]
|
19
|
+
given:
|
20
|
+
[pregnant] = [t]
|
21
|
+
|
22
|
+
|
23
|
+
row [2229] - suspicious column: [T3] - suspicious vale: [10.600]
|
24
|
+
distribution: 99.951% <= 7.100 - [mean: 1.984] - [sd: 0.750] - [norm. obs: 2050]
|
25
|
+
given:
|
26
|
+
[query hyperthyroid] = [f]
|
27
|
+
```
|
28
|
+
(i.e. it's saying that it's abnormal to be pregnant at the age of 75, or to not be classified as hyperthyroidal when having very high thyroid hormone levels)
|
29
|
+
(this dataset is also bundled into the R package - e.g. `data(hypothyroid)`)
|
30
|
+
|
31
|
+
|
32
|
+
Example outlier from [Titanic dataset](https://www.kaggle.com/c/titanic):
|
33
|
+
```
|
34
|
+
row [885] - suspicious column: [Fare] - suspicious value: [29.125]
|
35
|
+
distribution: 97.849% <= 15.500 - [mean: 7.887] - [sd: 1.173] - [norm. obs: 91]
|
36
|
+
given:
|
37
|
+
[Pclass] = [3]
|
38
|
+
[SibSp] = [0]
|
39
|
+
[Embarked] = [Q]
|
40
|
+
```
|
41
|
+
(i.e. it's saying that the this person paid too much for the kind of accomodation he had)
|
42
|
+
|
43
|
+
_Note that it can also produce other types of conditions such as 'between' (for numeric intervals) or 'in' (for categorical subsets)_
|
44
|
+
|
45
|
+
# Installation
|
46
|
+
|
47
|
+
* For R:
|
48
|
+
```r
|
49
|
+
install.packages("outliertree")
|
50
|
+
```
|
51
|
+
|
52
|
+
|
53
|
+
* For Python:
|
54
|
+
```
|
55
|
+
pip install outliertree
|
56
|
+
```
|
57
|
+
(Package has only been tested in Python 3)
|
58
|
+
|
59
|
+
**Note for macOS users:** on macOS, the Python version of this package will compile **without** multi-threading capabilities. This is due to default apple's redistribution of `clang` not providing OpenMP modules, and aliasing it to `gcc` which causes confusions in build scripts. If you have a non-apple version of `clang` with the OpenMP modules, or if you have `gcc` installed, you can compile this package with multi-threading enabled by setting up an environment variable `ENABLE_OMP=1`:
|
60
|
+
```
|
61
|
+
export ENABLE_OMP=1
|
62
|
+
pip install outliertree
|
63
|
+
```
|
64
|
+
(Alternatively, can also pass argument `enable-omp` to the `setup.py` file: `python setup.py install enable-omp`)
|
65
|
+
|
66
|
+
|
67
|
+
* For C++: package doesn't have a build system, nor a `main` function that can produce an executable, but can be built as a shared object and wrapped into other languages with any C++11-compliant compiler (`std=c++11` in most compilers, `/std:c++14` in MSVC). For parallelization, needs OpenMP linkage (`-fopenmp` in most compilers, `/openmp` in MSVC). Package should *not* be built with optimization higher than `O3` (i.e. don't use `-Ofast`). Needs linkage to the `math` library, which should be enabled by default in most C++ compilers, but otherwise would require `-lm` argument. No external dependencies are required.
|
68
|
+
|
69
|
+
|
70
|
+
# Sample usage
|
71
|
+
|
72
|
+
* For R:
|
73
|
+
```r
|
74
|
+
library(outliertree)
|
75
|
+
|
76
|
+
### random data frame with an obvious outlier
|
77
|
+
nrows = 100
|
78
|
+
set.seed(1)
|
79
|
+
df = data.frame(
|
80
|
+
numeric_col1 = c(rnorm(nrows - 1), 1e6),
|
81
|
+
numeric_col2 = rgamma(nrows, 1),
|
82
|
+
categ_col = sample(c('categA', 'categB', 'categC'), size = nrows, replace = TRUE)
|
83
|
+
)
|
84
|
+
|
85
|
+
### test data frame with another obvious outlier
|
86
|
+
nrows_test = 50
|
87
|
+
df_test = data.frame(
|
88
|
+
numeric_col1 = rnorm(nrows_test),
|
89
|
+
numeric_col2 = c(-1e6, rgamma(nrows_test - 1, 1)),
|
90
|
+
categ_col = sample(c('categA', 'categB', 'categC'), size = nrows_test, replace = TRUE)
|
91
|
+
)
|
92
|
+
|
93
|
+
### fit model
|
94
|
+
outliers_model = outliertree::outlier.tree(df, outliers_print = 10, save_outliers = TRUE)
|
95
|
+
|
96
|
+
### find outliers in new data
|
97
|
+
new_outliers = predict(outliers_model, df_test, outliers_print = 10, return_outliers = TRUE)
|
98
|
+
|
99
|
+
### print outliers in readable format
|
100
|
+
summary(new_outliers)
|
101
|
+
```
|
102
|
+
(see documentation for more examples)
|
103
|
+
|
104
|
+
Example [RMarkdown](http://htmlpreview.github.io/?https://github.com/david-cortes/outliertree/blob/master/example/titanic_outliertree_r.html) using the Titanic dataset.
|
105
|
+
|
106
|
+
|
107
|
+
* For Python:
|
108
|
+
```python
|
109
|
+
import numpy as np, pandas as pd
|
110
|
+
from outliertree import OutlierTree
|
111
|
+
|
112
|
+
### random data frame with an obvious outlier
|
113
|
+
nrows = 100
|
114
|
+
np.random.seed(1)
|
115
|
+
df = pd.DataFrame({
|
116
|
+
"numeric_col1" : np.r_[np.random.normal(size = nrows - 1), np.array([float(1e6)])],
|
117
|
+
"numeric_col2" : np.random.gamma(1, 1, size = nrows),
|
118
|
+
"categ_col" : np.random.choice(['categA', 'categB', 'categC'], size = nrows)
|
119
|
+
})
|
120
|
+
|
121
|
+
### test data frame with another obvious outlier
|
122
|
+
df_test = pd.DataFrame({
|
123
|
+
"numeric_col1" : np.random.normal(size = nrows),
|
124
|
+
"numeric_col2" : np.r_[np.array([float(-1e6)]), np.random.gamma(1, 1, size = nrows - 1)],
|
125
|
+
"categ_col" : np.random.choice(['categA', 'categB', 'categC'], size = nrows)
|
126
|
+
})
|
127
|
+
|
128
|
+
### fit model
|
129
|
+
outliers_model = OutlierTree()
|
130
|
+
outliers_df = outliers_model.fit(df, outliers_print = 10, return_outliers = True)
|
131
|
+
|
132
|
+
### find outliers in new data
|
133
|
+
new_outliers = outliers_model.predict(df_test)
|
134
|
+
|
135
|
+
### print outliers in readable format
|
136
|
+
outliers_model.print_outliers(new_outliers)
|
137
|
+
```
|
138
|
+
|
139
|
+
Example [IPython notebook](http://nbviewer.ipython.org/github/david-cortes/outliertree/blob/master/example/titanic_outliertree_python.ipynb) using the Titanic dataset.
|
140
|
+
|
141
|
+
* For C++: see functions `fit_outliers_models` and `find_new_outliers` in header `outlier_tree.hpp`.
|
142
|
+
|
143
|
+
# Documentation
|
144
|
+
|
145
|
+
* For R : documentation is built-in in the package (e.g. `help(outliertree::outlier.tree)`) - PDF can be downloaded in [CRAN](https://cran.r-project.org/web/packages/outliertree/index.html).
|
146
|
+
|
147
|
+
* For Python: documentation is available at [ReadTheDocs](http://outliertree.readthedocs.io/en/latest/) (and it's also built-in in the package as docstrings, e.g. `help(outliertree.OutlierTree.fit)`).
|
148
|
+
|
149
|
+
* For C++: documentation is available in the source files (not in the header).
|
150
|
+
|
151
|
+
# References
|
152
|
+
|
153
|
+
* Cortes, David. "Explainable outlier detection through decision tree conditioning." arXiv preprint arXiv:2001.00636 (2020).
|
154
|
+
* [GritBot software](https://www.rulequest.com/gritbot-info.html) .
|
155
|
+
|
@@ -0,0 +1,123 @@
|
|
1
|
+
// Generated by using Rcpp::compileAttributes() -> do not edit by hand
|
2
|
+
// Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
|
3
|
+
|
4
|
+
#include <Rcpp.h>
|
5
|
+
|
6
|
+
using namespace Rcpp;
|
7
|
+
|
8
|
+
// deserialize_OutlierTree
|
9
|
+
SEXP deserialize_OutlierTree(Rcpp::RawVector src);
|
10
|
+
RcppExport SEXP _outliertree_deserialize_OutlierTree(SEXP srcSEXP) {
|
11
|
+
BEGIN_RCPP
|
12
|
+
Rcpp::RObject rcpp_result_gen;
|
13
|
+
Rcpp::RNGScope rcpp_rngScope_gen;
|
14
|
+
Rcpp::traits::input_parameter< Rcpp::RawVector >::type src(srcSEXP);
|
15
|
+
rcpp_result_gen = Rcpp::wrap(deserialize_OutlierTree(src));
|
16
|
+
return rcpp_result_gen;
|
17
|
+
END_RCPP
|
18
|
+
}
|
19
|
+
// check_null_ptr_model
|
20
|
+
Rcpp::LogicalVector check_null_ptr_model(SEXP ptr_model);
|
21
|
+
RcppExport SEXP _outliertree_check_null_ptr_model(SEXP ptr_modelSEXP) {
|
22
|
+
BEGIN_RCPP
|
23
|
+
Rcpp::RObject rcpp_result_gen;
|
24
|
+
Rcpp::RNGScope rcpp_rngScope_gen;
|
25
|
+
Rcpp::traits::input_parameter< SEXP >::type ptr_model(ptr_modelSEXP);
|
26
|
+
rcpp_result_gen = Rcpp::wrap(check_null_ptr_model(ptr_model));
|
27
|
+
return rcpp_result_gen;
|
28
|
+
END_RCPP
|
29
|
+
}
|
30
|
+
// fit_OutlierTree
|
31
|
+
Rcpp::List fit_OutlierTree(Rcpp::NumericVector arr_num, size_t ncols_numeric, Rcpp::IntegerVector arr_cat, size_t ncols_categ, Rcpp::IntegerVector ncat, Rcpp::IntegerVector arr_ord, size_t ncols_ord, Rcpp::IntegerVector ncat_ord, size_t nrows, Rcpp::LogicalVector cols_ignore_r, int nthreads, bool categ_as_bin, bool ord_as_bin, bool cat_bruteforce_subset, bool categ_from_maj, bool take_mid, size_t max_depth, double max_perc_outliers, size_t min_size_numeric, size_t min_size_categ, double min_gain, bool follow_all, bool gain_as_pct, double z_norm, double z_outlier, bool return_outliers, Rcpp::ListOf<Rcpp::StringVector> cat_levels, Rcpp::ListOf<Rcpp::StringVector> ord_levels, Rcpp::StringVector colnames_num, Rcpp::StringVector colnames_cat, Rcpp::StringVector colnames_ord, Rcpp::NumericVector min_date, Rcpp::NumericVector min_ts);
|
32
|
+
RcppExport SEXP _outliertree_fit_OutlierTree(SEXP arr_numSEXP, SEXP ncols_numericSEXP, SEXP arr_catSEXP, SEXP ncols_categSEXP, SEXP ncatSEXP, SEXP arr_ordSEXP, SEXP ncols_ordSEXP, SEXP ncat_ordSEXP, SEXP nrowsSEXP, SEXP cols_ignore_rSEXP, SEXP nthreadsSEXP, SEXP categ_as_binSEXP, SEXP ord_as_binSEXP, SEXP cat_bruteforce_subsetSEXP, SEXP categ_from_majSEXP, SEXP take_midSEXP, SEXP max_depthSEXP, SEXP max_perc_outliersSEXP, SEXP min_size_numericSEXP, SEXP min_size_categSEXP, SEXP min_gainSEXP, SEXP follow_allSEXP, SEXP gain_as_pctSEXP, SEXP z_normSEXP, SEXP z_outlierSEXP, SEXP return_outliersSEXP, SEXP cat_levelsSEXP, SEXP ord_levelsSEXP, SEXP colnames_numSEXP, SEXP colnames_catSEXP, SEXP colnames_ordSEXP, SEXP min_dateSEXP, SEXP min_tsSEXP) {
|
33
|
+
BEGIN_RCPP
|
34
|
+
Rcpp::RObject rcpp_result_gen;
|
35
|
+
Rcpp::RNGScope rcpp_rngScope_gen;
|
36
|
+
Rcpp::traits::input_parameter< Rcpp::NumericVector >::type arr_num(arr_numSEXP);
|
37
|
+
Rcpp::traits::input_parameter< size_t >::type ncols_numeric(ncols_numericSEXP);
|
38
|
+
Rcpp::traits::input_parameter< Rcpp::IntegerVector >::type arr_cat(arr_catSEXP);
|
39
|
+
Rcpp::traits::input_parameter< size_t >::type ncols_categ(ncols_categSEXP);
|
40
|
+
Rcpp::traits::input_parameter< Rcpp::IntegerVector >::type ncat(ncatSEXP);
|
41
|
+
Rcpp::traits::input_parameter< Rcpp::IntegerVector >::type arr_ord(arr_ordSEXP);
|
42
|
+
Rcpp::traits::input_parameter< size_t >::type ncols_ord(ncols_ordSEXP);
|
43
|
+
Rcpp::traits::input_parameter< Rcpp::IntegerVector >::type ncat_ord(ncat_ordSEXP);
|
44
|
+
Rcpp::traits::input_parameter< size_t >::type nrows(nrowsSEXP);
|
45
|
+
Rcpp::traits::input_parameter< Rcpp::LogicalVector >::type cols_ignore_r(cols_ignore_rSEXP);
|
46
|
+
Rcpp::traits::input_parameter< int >::type nthreads(nthreadsSEXP);
|
47
|
+
Rcpp::traits::input_parameter< bool >::type categ_as_bin(categ_as_binSEXP);
|
48
|
+
Rcpp::traits::input_parameter< bool >::type ord_as_bin(ord_as_binSEXP);
|
49
|
+
Rcpp::traits::input_parameter< bool >::type cat_bruteforce_subset(cat_bruteforce_subsetSEXP);
|
50
|
+
Rcpp::traits::input_parameter< bool >::type categ_from_maj(categ_from_majSEXP);
|
51
|
+
Rcpp::traits::input_parameter< bool >::type take_mid(take_midSEXP);
|
52
|
+
Rcpp::traits::input_parameter< size_t >::type max_depth(max_depthSEXP);
|
53
|
+
Rcpp::traits::input_parameter< double >::type max_perc_outliers(max_perc_outliersSEXP);
|
54
|
+
Rcpp::traits::input_parameter< size_t >::type min_size_numeric(min_size_numericSEXP);
|
55
|
+
Rcpp::traits::input_parameter< size_t >::type min_size_categ(min_size_categSEXP);
|
56
|
+
Rcpp::traits::input_parameter< double >::type min_gain(min_gainSEXP);
|
57
|
+
Rcpp::traits::input_parameter< bool >::type follow_all(follow_allSEXP);
|
58
|
+
Rcpp::traits::input_parameter< bool >::type gain_as_pct(gain_as_pctSEXP);
|
59
|
+
Rcpp::traits::input_parameter< double >::type z_norm(z_normSEXP);
|
60
|
+
Rcpp::traits::input_parameter< double >::type z_outlier(z_outlierSEXP);
|
61
|
+
Rcpp::traits::input_parameter< bool >::type return_outliers(return_outliersSEXP);
|
62
|
+
Rcpp::traits::input_parameter< Rcpp::ListOf<Rcpp::StringVector> >::type cat_levels(cat_levelsSEXP);
|
63
|
+
Rcpp::traits::input_parameter< Rcpp::ListOf<Rcpp::StringVector> >::type ord_levels(ord_levelsSEXP);
|
64
|
+
Rcpp::traits::input_parameter< Rcpp::StringVector >::type colnames_num(colnames_numSEXP);
|
65
|
+
Rcpp::traits::input_parameter< Rcpp::StringVector >::type colnames_cat(colnames_catSEXP);
|
66
|
+
Rcpp::traits::input_parameter< Rcpp::StringVector >::type colnames_ord(colnames_ordSEXP);
|
67
|
+
Rcpp::traits::input_parameter< Rcpp::NumericVector >::type min_date(min_dateSEXP);
|
68
|
+
Rcpp::traits::input_parameter< Rcpp::NumericVector >::type min_ts(min_tsSEXP);
|
69
|
+
rcpp_result_gen = Rcpp::wrap(fit_OutlierTree(arr_num, ncols_numeric, arr_cat, ncols_categ, ncat, arr_ord, ncols_ord, ncat_ord, nrows, cols_ignore_r, nthreads, categ_as_bin, ord_as_bin, cat_bruteforce_subset, categ_from_maj, take_mid, max_depth, max_perc_outliers, min_size_numeric, min_size_categ, min_gain, follow_all, gain_as_pct, z_norm, z_outlier, return_outliers, cat_levels, ord_levels, colnames_num, colnames_cat, colnames_ord, min_date, min_ts));
|
70
|
+
return rcpp_result_gen;
|
71
|
+
END_RCPP
|
72
|
+
}
|
73
|
+
// predict_OutlierTree
|
74
|
+
Rcpp::List predict_OutlierTree(SEXP ptr_model, size_t nrows, int nthreads, Rcpp::NumericVector arr_num, Rcpp::IntegerVector arr_cat, Rcpp::IntegerVector arr_ord, Rcpp::ListOf<Rcpp::StringVector> cat_levels, Rcpp::ListOf<Rcpp::StringVector> ord_levels, Rcpp::StringVector colnames_num, Rcpp::StringVector colnames_cat, Rcpp::StringVector colnames_ord, Rcpp::NumericVector min_date, Rcpp::NumericVector min_ts);
|
75
|
+
RcppExport SEXP _outliertree_predict_OutlierTree(SEXP ptr_modelSEXP, SEXP nrowsSEXP, SEXP nthreadsSEXP, SEXP arr_numSEXP, SEXP arr_catSEXP, SEXP arr_ordSEXP, SEXP cat_levelsSEXP, SEXP ord_levelsSEXP, SEXP colnames_numSEXP, SEXP colnames_catSEXP, SEXP colnames_ordSEXP, SEXP min_dateSEXP, SEXP min_tsSEXP) {
|
76
|
+
BEGIN_RCPP
|
77
|
+
Rcpp::RObject rcpp_result_gen;
|
78
|
+
Rcpp::RNGScope rcpp_rngScope_gen;
|
79
|
+
Rcpp::traits::input_parameter< SEXP >::type ptr_model(ptr_modelSEXP);
|
80
|
+
Rcpp::traits::input_parameter< size_t >::type nrows(nrowsSEXP);
|
81
|
+
Rcpp::traits::input_parameter< int >::type nthreads(nthreadsSEXP);
|
82
|
+
Rcpp::traits::input_parameter< Rcpp::NumericVector >::type arr_num(arr_numSEXP);
|
83
|
+
Rcpp::traits::input_parameter< Rcpp::IntegerVector >::type arr_cat(arr_catSEXP);
|
84
|
+
Rcpp::traits::input_parameter< Rcpp::IntegerVector >::type arr_ord(arr_ordSEXP);
|
85
|
+
Rcpp::traits::input_parameter< Rcpp::ListOf<Rcpp::StringVector> >::type cat_levels(cat_levelsSEXP);
|
86
|
+
Rcpp::traits::input_parameter< Rcpp::ListOf<Rcpp::StringVector> >::type ord_levels(ord_levelsSEXP);
|
87
|
+
Rcpp::traits::input_parameter< Rcpp::StringVector >::type colnames_num(colnames_numSEXP);
|
88
|
+
Rcpp::traits::input_parameter< Rcpp::StringVector >::type colnames_cat(colnames_catSEXP);
|
89
|
+
Rcpp::traits::input_parameter< Rcpp::StringVector >::type colnames_ord(colnames_ordSEXP);
|
90
|
+
Rcpp::traits::input_parameter< Rcpp::NumericVector >::type min_date(min_dateSEXP);
|
91
|
+
Rcpp::traits::input_parameter< Rcpp::NumericVector >::type min_ts(min_tsSEXP);
|
92
|
+
rcpp_result_gen = Rcpp::wrap(predict_OutlierTree(ptr_model, nrows, nthreads, arr_num, arr_cat, arr_ord, cat_levels, ord_levels, colnames_num, colnames_cat, colnames_ord, min_date, min_ts));
|
93
|
+
return rcpp_result_gen;
|
94
|
+
END_RCPP
|
95
|
+
}
|
96
|
+
// check_few_values
|
97
|
+
Rcpp::LogicalVector check_few_values(Rcpp::NumericVector arr_num, size_t nrows, size_t ncols, int nthreads);
|
98
|
+
RcppExport SEXP _outliertree_check_few_values(SEXP arr_numSEXP, SEXP nrowsSEXP, SEXP ncolsSEXP, SEXP nthreadsSEXP) {
|
99
|
+
BEGIN_RCPP
|
100
|
+
Rcpp::RObject rcpp_result_gen;
|
101
|
+
Rcpp::RNGScope rcpp_rngScope_gen;
|
102
|
+
Rcpp::traits::input_parameter< Rcpp::NumericVector >::type arr_num(arr_numSEXP);
|
103
|
+
Rcpp::traits::input_parameter< size_t >::type nrows(nrowsSEXP);
|
104
|
+
Rcpp::traits::input_parameter< size_t >::type ncols(ncolsSEXP);
|
105
|
+
Rcpp::traits::input_parameter< int >::type nthreads(nthreadsSEXP);
|
106
|
+
rcpp_result_gen = Rcpp::wrap(check_few_values(arr_num, nrows, ncols, nthreads));
|
107
|
+
return rcpp_result_gen;
|
108
|
+
END_RCPP
|
109
|
+
}
|
110
|
+
|
111
|
+
static const R_CallMethodDef CallEntries[] = {
|
112
|
+
{"_outliertree_deserialize_OutlierTree", (DL_FUNC) &_outliertree_deserialize_OutlierTree, 1},
|
113
|
+
{"_outliertree_check_null_ptr_model", (DL_FUNC) &_outliertree_check_null_ptr_model, 1},
|
114
|
+
{"_outliertree_fit_OutlierTree", (DL_FUNC) &_outliertree_fit_OutlierTree, 33},
|
115
|
+
{"_outliertree_predict_OutlierTree", (DL_FUNC) &_outliertree_predict_OutlierTree, 13},
|
116
|
+
{"_outliertree_check_few_values", (DL_FUNC) &_outliertree_check_few_values, 4},
|
117
|
+
{NULL, NULL, 0}
|
118
|
+
};
|
119
|
+
|
120
|
+
RcppExport void R_init_outliertree(DllInfo *dll) {
|
121
|
+
R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
|
122
|
+
R_useDynamicSymbols(dll, FALSE);
|
123
|
+
}
|
@@ -0,0 +1,1225 @@
|
|
1
|
+
#include <Rcpp.h>
|
2
|
+
// [[Rcpp::plugins(cpp11)]]
|
3
|
+
|
4
|
+
/* This is to serialize the model objects */
|
5
|
+
// [[Rcpp::depends(Rcereal)]]
|
6
|
+
#include <cereal/archives/binary.hpp>
|
7
|
+
#include <cereal/types/vector.hpp>
|
8
|
+
#include <sstream>
|
9
|
+
#include <string>
|
10
|
+
|
11
|
+
/* This is the package's header */
|
12
|
+
#include "outlier_tree.hpp"
|
13
|
+
|
14
|
+
/* for model serialization and re-usage in R */
|
15
|
+
/* https://stackoverflow.com/questions/18474292/how-to-handle-c-internal-data-structure-in-r-in-order-to-allow-save-load */
|
16
|
+
/* this extra comment below the link is a workaround for Rcpp issue 675 in GitHub, do not remove it */
|
17
|
+
#include <Rinternals.h>
|
18
|
+
Rcpp::RawVector serialize_OutlierTree(ModelOutputs *model_outputs)
|
19
|
+
{
|
20
|
+
std::stringstream ss;
|
21
|
+
{
|
22
|
+
cereal::BinaryOutputArchive oarchive(ss); // Create an output archive
|
23
|
+
oarchive(*model_outputs);
|
24
|
+
}
|
25
|
+
ss.seekg(0, ss.end);
|
26
|
+
Rcpp::RawVector retval(ss.tellg());
|
27
|
+
ss.seekg(0, ss.beg);
|
28
|
+
ss.read(reinterpret_cast<char*>(&retval[0]), retval.size());
|
29
|
+
return retval;
|
30
|
+
}
|
31
|
+
|
32
|
+
// [[Rcpp::export]]
|
33
|
+
SEXP deserialize_OutlierTree(Rcpp::RawVector src)
|
34
|
+
{
|
35
|
+
std::stringstream ss;
|
36
|
+
ss.write(reinterpret_cast<char*>(&src[0]), src.size());
|
37
|
+
ss.seekg(0, ss.beg);
|
38
|
+
std::unique_ptr<ModelOutputs> model_outputs = std::unique_ptr<ModelOutputs>(new ModelOutputs());
|
39
|
+
{
|
40
|
+
cereal::BinaryInputArchive iarchive(ss);
|
41
|
+
iarchive(*model_outputs);
|
42
|
+
}
|
43
|
+
return Rcpp::XPtr<ModelOutputs>(model_outputs.release(), true);
|
44
|
+
}
|
45
|
+
|
46
|
+
// [[Rcpp::export]]
|
47
|
+
Rcpp::LogicalVector check_null_ptr_model(SEXP ptr_model)
|
48
|
+
{
|
49
|
+
return Rcpp::LogicalVector(R_ExternalPtrAddr(ptr_model) == NULL);
|
50
|
+
}
|
51
|
+
|
52
|
+
double* set_R_nan_as_C_nan(double *restrict x_R, std::vector<double> &x_C, size_t n, int nthreads)
|
53
|
+
{
|
54
|
+
x_C.assign(x_R, x_R + n);
|
55
|
+
#pragma omp parallel for schedule(static) num_threads(nthreads) shared(x_R, x_C, n)
|
56
|
+
for (size_t_for i = 0; i < n; i++)
|
57
|
+
if (isnan(x_R[i]) || Rcpp::NumericVector::is_na(x_R[i]) || Rcpp::traits::is_nan<REALSXP>(x_R[i]))
|
58
|
+
x_C[i] = NAN;
|
59
|
+
return x_C.data();
|
60
|
+
}
|
61
|
+
|
62
|
+
|
63
|
+
/* for predicting outliers */
|
64
|
+
Rcpp::List describe_outliers(ModelOutputs &model_outputs,
|
65
|
+
double *arr_num,
|
66
|
+
int *arr_cat,
|
67
|
+
int *arr_ord,
|
68
|
+
Rcpp::ListOf<Rcpp::StringVector> cat_levels,
|
69
|
+
Rcpp::ListOf<Rcpp::StringVector> ord_levels,
|
70
|
+
Rcpp::StringVector colnames_num,
|
71
|
+
Rcpp::StringVector colnames_cat,
|
72
|
+
Rcpp::StringVector colnames_ord,
|
73
|
+
Rcpp::NumericVector min_date,
|
74
|
+
Rcpp::NumericVector min_ts)
|
75
|
+
{
|
76
|
+
size_t nrows = model_outputs.outlier_scores_final.size();
|
77
|
+
size_t ncols_num = model_outputs.ncols_numeric;
|
78
|
+
size_t ncols_cat = model_outputs.ncols_categ;
|
79
|
+
size_t ncols_num_num = model_outputs.ncols_numeric - min_date.size() - min_ts.size();
|
80
|
+
size_t ncols_date = min_date.size();
|
81
|
+
size_t ncols_cat_cat = cat_levels.size();
|
82
|
+
Rcpp::List outp;
|
83
|
+
|
84
|
+
Rcpp::LogicalVector has_na_col = Rcpp::LogicalVector(nrows, NA_LOGICAL);
|
85
|
+
Rcpp::IntegerVector tree_depth = Rcpp::IntegerVector(nrows, NA_INTEGER);
|
86
|
+
Rcpp::NumericVector outlier_score = Rcpp::NumericVector(nrows, NA_REAL);
|
87
|
+
Rcpp::ListOf<Rcpp::List> outlier_val = Rcpp::ListOf<Rcpp::List>(nrows);
|
88
|
+
Rcpp::ListOf<Rcpp::List> lst_stats = Rcpp::ListOf<Rcpp::List>(nrows);
|
89
|
+
Rcpp::ListOf<Rcpp::List> lst_cond = Rcpp::ListOf<Rcpp::List>(nrows);
|
90
|
+
|
91
|
+
|
92
|
+
size_t outl_col;
|
93
|
+
size_t outl_clust;
|
94
|
+
size_t curr_tree;
|
95
|
+
size_t parent_tree;
|
96
|
+
Rcpp::LogicalVector tmp_bool;
|
97
|
+
|
98
|
+
for (size_t row = 0; row < nrows; row++) {
|
99
|
+
if (model_outputs.outlier_scores_final[row] < 1) {
|
100
|
+
|
101
|
+
outl_col = model_outputs.outlier_columns_final[row];
|
102
|
+
outl_clust = model_outputs.outlier_clusters_final[row];
|
103
|
+
|
104
|
+
/* metrics of outlierness - used to rank when choosing which to print */
|
105
|
+
outlier_score[row] = model_outputs.outlier_scores_final[row];
|
106
|
+
tree_depth[row] = (int)model_outputs.outlier_depth_final[row];
|
107
|
+
has_na_col[row] = model_outputs.all_clusters[outl_col][outl_clust].has_NA_branch;
|
108
|
+
|
109
|
+
/* first determine outlier column and suspected value */
|
110
|
+
if (outl_col < ncols_num) {
|
111
|
+
if (outl_col < ncols_num_num) {
|
112
|
+
outlier_val[row] = Rcpp::List::create(
|
113
|
+
Rcpp::_["column"] = Rcpp::CharacterVector(1, colnames_num[outl_col]),
|
114
|
+
Rcpp::_["value"] = Rcpp::wrap(arr_num[row + outl_col * nrows]),
|
115
|
+
Rcpp::_["decimals"] = Rcpp::wrap(model_outputs.outlier_decimals_distr[row])
|
116
|
+
);
|
117
|
+
} else if (outl_col < (ncols_num_num + ncols_date)) {
|
118
|
+
outlier_val[row] = Rcpp::List::create(
|
119
|
+
Rcpp::_["column"] = Rcpp::CharacterVector(1, colnames_num[outl_col]),
|
120
|
+
Rcpp::_["value"] = Rcpp::Date(arr_num[row + outl_col * nrows] - 1 + min_date[outl_col - ncols_num_num])
|
121
|
+
);
|
122
|
+
} else {
|
123
|
+
outlier_val[row] = Rcpp::List::create(
|
124
|
+
Rcpp::_["column"] = Rcpp::CharacterVector(1, colnames_num[outl_col]),
|
125
|
+
Rcpp::_["value"] = Rcpp::Datetime(arr_num[row + outl_col * nrows] - 1 + min_ts[outl_col - ncols_num_num - ncols_date])
|
126
|
+
);
|
127
|
+
}
|
128
|
+
} else if (outl_col < (ncols_num + ncols_cat)) {
|
129
|
+
if (outl_col < (ncols_num + ncols_cat_cat)) {
|
130
|
+
outlier_val[row] = Rcpp::List::create(
|
131
|
+
Rcpp::_["column"] = Rcpp::CharacterVector(1, colnames_cat[outl_col - ncols_num]),
|
132
|
+
Rcpp::_["value"] = Rcpp::CharacterVector(1, cat_levels[outl_col - ncols_num]
|
133
|
+
[arr_cat[row + (outl_col - ncols_num) * nrows]])
|
134
|
+
);
|
135
|
+
} else {
|
136
|
+
outlier_val[row] = Rcpp::List::create(
|
137
|
+
Rcpp::_["column"] = Rcpp::CharacterVector(1, colnames_cat[outl_col - ncols_num]),
|
138
|
+
Rcpp::_["value"] = Rcpp::wrap((bool)arr_cat[row + (outl_col - ncols_num) * nrows])
|
139
|
+
);
|
140
|
+
}
|
141
|
+
} else {
|
142
|
+
outlier_val[row] = Rcpp::List::create(
|
143
|
+
Rcpp::_["column"] = Rcpp::CharacterVector(1, colnames_ord[outl_col - ncols_num - ncols_cat]),
|
144
|
+
Rcpp::_["value"] = Rcpp::CharacterVector(1, ord_levels[outl_col - ncols_num - ncols_cat]
|
145
|
+
[arr_ord[row + (outl_col - ncols_num - ncols_cat) * nrows]])
|
146
|
+
);
|
147
|
+
}
|
148
|
+
|
149
|
+
|
150
|
+
/* info about the normal observations in the cluster */
|
151
|
+
if (outl_col < ncols_num) {
|
152
|
+
if (outl_col < ncols_num_num) {
|
153
|
+
if (arr_num[row + outl_col * nrows] >= model_outputs.all_clusters[outl_col][outl_clust].upper_lim) {
|
154
|
+
lst_stats[row] = Rcpp::List::create(
|
155
|
+
Rcpp::_["upper_thr"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].display_lim_high),
|
156
|
+
Rcpp::_["pct_below"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_below),
|
157
|
+
Rcpp::_["mean"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].display_mean),
|
158
|
+
Rcpp::_["sd"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].display_sd),
|
159
|
+
Rcpp::_["n_obs"] = Rcpp::wrap((int)model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
|
160
|
+
);
|
161
|
+
} else {
|
162
|
+
lst_stats[row] = Rcpp::List::create(
|
163
|
+
Rcpp::_["lower_thr"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].display_lim_low),
|
164
|
+
Rcpp::_["pct_above"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_above),
|
165
|
+
Rcpp::_["mean"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].display_mean),
|
166
|
+
Rcpp::_["sd"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].display_sd),
|
167
|
+
Rcpp::_["n_obs"] = Rcpp::wrap((int)model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
|
168
|
+
);
|
169
|
+
}
|
170
|
+
} else if (outl_col < (ncols_num_num + ncols_date)) {
|
171
|
+
if (arr_num[row + outl_col * nrows] >= model_outputs.all_clusters[outl_col][outl_clust].upper_lim) {
|
172
|
+
lst_stats[row] = Rcpp::List::create(
|
173
|
+
Rcpp::_["upper_thr"] = Rcpp::Date(model_outputs.all_clusters[outl_col][outl_clust].display_lim_high
|
174
|
+
- 1 + min_date[outl_col - ncols_num_num]),
|
175
|
+
Rcpp::_["pct_below"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_below),
|
176
|
+
Rcpp::_["mean"] = Rcpp::Date(model_outputs.all_clusters[outl_col][outl_clust].display_mean - 1 + min_date[outl_col - ncols_num_num]),
|
177
|
+
Rcpp::_["n_obs"] = Rcpp::wrap((int)model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
|
178
|
+
);
|
179
|
+
} else {
|
180
|
+
lst_stats[row] = Rcpp::List::create(
|
181
|
+
Rcpp::_["lower_thr"] = Rcpp::Date(model_outputs.all_clusters[outl_col][outl_clust].display_lim_low
|
182
|
+
- 1 + min_date[outl_col - ncols_num_num]),
|
183
|
+
Rcpp::_["pct_above"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_above),
|
184
|
+
Rcpp::_["mean"] = Rcpp::Date(model_outputs.all_clusters[outl_col][outl_clust].display_mean - 1 + min_date[outl_col - ncols_num_num]),
|
185
|
+
Rcpp::_["n_obs"] = Rcpp::wrap((int)model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
|
186
|
+
);
|
187
|
+
}
|
188
|
+
} else {
|
189
|
+
if (arr_num[row + outl_col * nrows] >= model_outputs.all_clusters[outl_col][outl_clust].upper_lim) {
|
190
|
+
lst_stats[row] = Rcpp::List::create(
|
191
|
+
Rcpp::_["upper_thr"] = Rcpp::Datetime(model_outputs.all_clusters[outl_col][outl_clust].display_lim_high
|
192
|
+
- 1 + min_ts[outl_col - ncols_num_num - ncols_date]),
|
193
|
+
Rcpp::_["pct_below"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_below),
|
194
|
+
Rcpp::_["mean"] = Rcpp::Datetime(model_outputs.all_clusters[outl_col][outl_clust].display_mean
|
195
|
+
- 1 + min_ts[outl_col - ncols_num_num - ncols_date]),
|
196
|
+
Rcpp::_["n_obs"] = Rcpp::wrap((int)model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
|
197
|
+
);
|
198
|
+
} else {
|
199
|
+
lst_stats[row] = Rcpp::List::create(
|
200
|
+
Rcpp::_["lower_thr"] = Rcpp::Datetime(model_outputs.all_clusters[outl_col][outl_clust].display_lim_low
|
201
|
+
- 1 + min_ts[outl_col - ncols_num_num - ncols_date]),
|
202
|
+
Rcpp::_["pct_above"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_above),
|
203
|
+
Rcpp::_["mean"] = Rcpp::Datetime(model_outputs.all_clusters[outl_col][outl_clust].display_mean
|
204
|
+
- 1 + min_ts[outl_col - ncols_num_num - ncols_date]),
|
205
|
+
Rcpp::_["n_obs"] = Rcpp::wrap((int)model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
|
206
|
+
);
|
207
|
+
}
|
208
|
+
}
|
209
|
+
} else if (outl_col < (ncols_num + ncols_cat)) {
|
210
|
+
if (outl_col < (ncols_num + ncols_cat_cat)) {
|
211
|
+
tmp_bool = Rcpp::LogicalVector(model_outputs.all_clusters[outl_col][outl_clust].subset_common.size(), false);
|
212
|
+
for (size_t cat = 0; cat < tmp_bool.size(); cat++) {
|
213
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].subset_common[cat] == 0) {
|
214
|
+
tmp_bool[cat] = true;
|
215
|
+
}
|
216
|
+
}
|
217
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].split_type != Root) {
|
218
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].categ_maj < 0) {
|
219
|
+
lst_stats[row] = Rcpp::List::create(
|
220
|
+
Rcpp::_["categs_common"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[outl_col - ncols_num][tmp_bool]),
|
221
|
+
Rcpp::_["pct_common"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_in_subset),
|
222
|
+
Rcpp::_["pct_next_most_comm"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_next_most_comm),
|
223
|
+
Rcpp::_["prior_prob"] = Rcpp::wrap(model_outputs.prop_categ[model_outputs.start_ix_cat_counts[outl_col - ncols_num] +
|
224
|
+
arr_cat[row + (outl_col - ncols_num) * nrows]]),
|
225
|
+
Rcpp::_["n_obs"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
|
226
|
+
);
|
227
|
+
} else {
|
228
|
+
lst_stats[row] = Rcpp::List::create(
|
229
|
+
Rcpp::_["categ_maj"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[outl_col - ncols_num][
|
230
|
+
model_outputs.all_clusters[outl_col][outl_clust].categ_maj
|
231
|
+
]),
|
232
|
+
Rcpp::_["pct_common"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_in_subset),
|
233
|
+
Rcpp::_["prior_prob"] = Rcpp::wrap(model_outputs.prop_categ[model_outputs.start_ix_cat_counts[outl_col - ncols_num] +
|
234
|
+
arr_cat[row + (outl_col - ncols_num) * nrows]]),
|
235
|
+
Rcpp::_["n_obs"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
|
236
|
+
);
|
237
|
+
}
|
238
|
+
} else {
|
239
|
+
lst_stats[row] = Rcpp::List::create(
|
240
|
+
Rcpp::_["categs_common"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[outl_col - ncols_num][tmp_bool]),
|
241
|
+
Rcpp::_["pct_common"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_in_subset),
|
242
|
+
Rcpp::_["pct_next_most_comm"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_next_most_comm),
|
243
|
+
Rcpp::_["n_obs"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
|
244
|
+
);
|
245
|
+
}
|
246
|
+
} else {
|
247
|
+
lst_stats[row] = Rcpp::List::create(
|
248
|
+
Rcpp::_["pct_other"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_in_subset),
|
249
|
+
Rcpp::_["prior_prob"] = Rcpp::wrap(model_outputs.prop_categ[model_outputs.start_ix_cat_counts[outl_col - ncols_num] +
|
250
|
+
arr_cat[row + (outl_col - ncols_num) * nrows]]),
|
251
|
+
Rcpp::_["n_obs"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
|
252
|
+
);
|
253
|
+
}
|
254
|
+
} else {
|
255
|
+
tmp_bool = Rcpp::LogicalVector(model_outputs.all_clusters[outl_col][outl_clust].subset_common.size(), false);
|
256
|
+
for (size_t cat = 0; cat < tmp_bool.size(); cat++) {
|
257
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].subset_common[cat] == 0) {
|
258
|
+
tmp_bool[cat] = true;
|
259
|
+
}
|
260
|
+
}
|
261
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].split_type != Root) {
|
262
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].categ_maj < 0) {
|
263
|
+
lst_stats[row] = Rcpp::List::create(
|
264
|
+
Rcpp::_["categs_common"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[outl_col - ncols_num - ncols_cat][tmp_bool]),
|
265
|
+
Rcpp::_["pct_common"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_in_subset),
|
266
|
+
Rcpp::_["pct_next_most_comm"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_next_most_comm),
|
267
|
+
Rcpp::_["prior_prob"] = Rcpp::wrap(model_outputs.prop_categ[model_outputs.start_ix_cat_counts[outl_col - ncols_num] +
|
268
|
+
arr_ord[row + (outl_col - ncols_num - ncols_cat) * nrows]]),
|
269
|
+
Rcpp::_["n_obs"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
|
270
|
+
);
|
271
|
+
} else {
|
272
|
+
lst_stats[row] = Rcpp::List::create(
|
273
|
+
Rcpp::_["categ_maj"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[outl_col - ncols_num - ncols_cat][
|
274
|
+
model_outputs.all_clusters[outl_col][outl_clust].categ_maj
|
275
|
+
]),
|
276
|
+
Rcpp::_["pct_common"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_in_subset),
|
277
|
+
Rcpp::_["prior_prob"] = Rcpp::wrap(model_outputs.prop_categ[model_outputs.start_ix_cat_counts[outl_col - ncols_num] +
|
278
|
+
arr_ord[row + (outl_col - ncols_num - ncols_cat) * nrows]]),
|
279
|
+
Rcpp::_["n_obs"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
|
280
|
+
);
|
281
|
+
}
|
282
|
+
} else {
|
283
|
+
lst_stats[row] = Rcpp::List::create(
|
284
|
+
Rcpp::_["categs_common"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[outl_col - ncols_num - ncols_cat][tmp_bool]),
|
285
|
+
Rcpp::_["pct_common"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_in_subset),
|
286
|
+
Rcpp::_["pct_next_most_comm"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].perc_next_most_comm),
|
287
|
+
Rcpp::_["n_obs"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].cluster_size)
|
288
|
+
);
|
289
|
+
}
|
290
|
+
}
|
291
|
+
|
292
|
+
|
293
|
+
/* then determine conditions from the cluster */
|
294
|
+
Rcpp::List cond_clust;
|
295
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].column_type != NoType) {
|
296
|
+
|
297
|
+
/* add the column name and actual value for the row */
|
298
|
+
switch(model_outputs.all_clusters[outl_col][outl_clust].column_type) {
|
299
|
+
case Numeric:
|
300
|
+
{
|
301
|
+
cond_clust["column"] = Rcpp::CharacterVector(1, colnames_num[model_outputs.all_clusters[outl_col][outl_clust].col_num]);
|
302
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].col_num < ncols_num_num) {
|
303
|
+
cond_clust["value_this"] = Rcpp::wrap(arr_num[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows]);
|
304
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].split_type != IsNa)
|
305
|
+
cond_clust["decimals"] = Rcpp::wrap(model_outputs.min_decimals_col[model_outputs.all_clusters[outl_col][outl_clust].col_num]);
|
306
|
+
} else if (model_outputs.all_clusters[outl_col][outl_clust].col_num < (ncols_num_num + ncols_date)) {
|
307
|
+
cond_clust["value_this"] = Rcpp::Date(arr_num[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows]
|
308
|
+
- 1 + min_date[model_outputs.all_clusters[outl_col][outl_clust].col_num - ncols_num_num]);
|
309
|
+
} else {
|
310
|
+
cond_clust["value_this"] = Rcpp::Datetime(arr_num[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows]
|
311
|
+
- 1 + min_ts[model_outputs.all_clusters[outl_col][outl_clust].col_num - ncols_num_num - ncols_date]);
|
312
|
+
}
|
313
|
+
break;
|
314
|
+
}
|
315
|
+
|
316
|
+
case Categorical:
|
317
|
+
{
|
318
|
+
cond_clust["column"] = Rcpp::CharacterVector(1, colnames_cat[model_outputs.all_clusters[outl_col][outl_clust].col_num]);
|
319
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].col_num < ncols_cat_cat) {
|
320
|
+
if (arr_cat[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows] >= 0) {
|
321
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num]
|
322
|
+
[arr_cat[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows]]);
|
323
|
+
} else {
|
324
|
+
cond_clust["value_this"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
|
325
|
+
}
|
326
|
+
} else {
|
327
|
+
|
328
|
+
if (arr_cat[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows] >= 0) {
|
329
|
+
cond_clust["value_this"] = Rcpp::wrap((bool)arr_cat[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows]);
|
330
|
+
} else {
|
331
|
+
cond_clust["value_this"] = Rcpp::LogicalVector(1, NA_LOGICAL);
|
332
|
+
}
|
333
|
+
}
|
334
|
+
break;
|
335
|
+
}
|
336
|
+
|
337
|
+
case Ordinal:
|
338
|
+
{
|
339
|
+
cond_clust["column"] = Rcpp::CharacterVector(1, colnames_ord[model_outputs.all_clusters[outl_col][outl_clust].col_num]);
|
340
|
+
if (arr_ord[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows] >= 0) {
|
341
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num]
|
342
|
+
[arr_ord[row + model_outputs.all_clusters[outl_col][outl_clust].col_num * nrows]]);
|
343
|
+
} else {
|
344
|
+
cond_clust["value_this"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
|
345
|
+
}
|
346
|
+
break;
|
347
|
+
}
|
348
|
+
}
|
349
|
+
|
350
|
+
/* add the comparison point */
|
351
|
+
switch(model_outputs.all_clusters[outl_col][outl_clust].split_type) {
|
352
|
+
|
353
|
+
case IsNa:
|
354
|
+
{
|
355
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
|
356
|
+
switch(model_outputs.all_clusters[outl_col][outl_clust].column_type) {
|
357
|
+
case Numeric:
|
358
|
+
{
|
359
|
+
/* http://lists.r-forge.r-project.org/pipermail/rcpp-devel/2012-October/004379.html */
|
360
|
+
/* this comment below will prevent bug with Rcpp comments having forward slashes */
|
361
|
+
cond_clust["value_comp"] = Rcpp::wrap(NA_REAL);
|
362
|
+
break;
|
363
|
+
}
|
364
|
+
|
365
|
+
case Categorical:
|
366
|
+
{
|
367
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].col_num < ncols_cat_cat) {
|
368
|
+
cond_clust["value_comp"] = Rcpp::wrap(NA_STRING);
|
369
|
+
} else {
|
370
|
+
cond_clust["value_comp"] = Rcpp::LogicalVector(1, NA_LOGICAL);
|
371
|
+
}
|
372
|
+
break;
|
373
|
+
}
|
374
|
+
|
375
|
+
case Ordinal:
|
376
|
+
{
|
377
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
|
378
|
+
break;
|
379
|
+
}
|
380
|
+
}
|
381
|
+
break;
|
382
|
+
}
|
383
|
+
|
384
|
+
case LessOrEqual:
|
385
|
+
{
|
386
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].column_type == Numeric) {
|
387
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].col_num < ncols_num_num) {
|
388
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("<=");
|
389
|
+
cond_clust["value_comp"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].split_point);
|
390
|
+
} else if (model_outputs.all_clusters[outl_col][outl_clust].col_num < (ncols_num_num + ncols_date)) {
|
391
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("<=");
|
392
|
+
cond_clust["value_comp"] = Rcpp::Date(model_outputs.all_clusters[outl_col][outl_clust].split_point
|
393
|
+
- 1 + min_date[model_outputs.all_clusters[outl_col][outl_clust].col_num - ncols_num_num]);
|
394
|
+
} else {
|
395
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("<=");
|
396
|
+
cond_clust["value_comp"] = Rcpp::Datetime(model_outputs.all_clusters[outl_col][outl_clust].split_point
|
397
|
+
- 1 + min_ts[model_outputs.all_clusters[outl_col][outl_clust].col_num - ncols_num_num - ncols_date]);
|
398
|
+
}
|
399
|
+
} else {
|
400
|
+
tmp_bool = Rcpp::LogicalVector(ord_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num].size(), false);
|
401
|
+
for (int cat = 0; cat <= model_outputs.all_clusters[outl_col][outl_clust].split_lev; cat++) tmp_bool[cat] = true;
|
402
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("in");
|
403
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num][tmp_bool]);
|
404
|
+
}
|
405
|
+
break;
|
406
|
+
}
|
407
|
+
|
408
|
+
case Greater:
|
409
|
+
{
|
410
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].column_type == Numeric) {
|
411
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].col_num < ncols_num_num) {
|
412
|
+
cond_clust["comparison"] = Rcpp::CharacterVector(">");
|
413
|
+
cond_clust["value_comp"] = Rcpp::wrap(model_outputs.all_clusters[outl_col][outl_clust].split_point);
|
414
|
+
} else if (model_outputs.all_clusters[outl_col][outl_clust].col_num < (ncols_num_num + ncols_date)) {
|
415
|
+
cond_clust["comparison"] = Rcpp::CharacterVector(">");
|
416
|
+
cond_clust["value_comp"] = Rcpp::Date(model_outputs.all_clusters[outl_col][outl_clust].split_point
|
417
|
+
- 1 + min_date[model_outputs.all_clusters[outl_col][outl_clust].col_num - ncols_num_num]);
|
418
|
+
} else {
|
419
|
+
cond_clust["comparison"] = Rcpp::CharacterVector(">");
|
420
|
+
cond_clust["value_comp"] = Rcpp::Datetime(model_outputs.all_clusters[outl_col][outl_clust].split_point
|
421
|
+
- 1 + min_ts[model_outputs.all_clusters[outl_col][outl_clust].col_num - ncols_num_num - ncols_date]);
|
422
|
+
}
|
423
|
+
} else {
|
424
|
+
tmp_bool = Rcpp::LogicalVector(ord_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num].size(), true);
|
425
|
+
for (int cat = 0; cat <= model_outputs.all_clusters[outl_col][outl_clust].split_lev; cat++) tmp_bool[cat] = false;
|
426
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("in");
|
427
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num][tmp_bool]);
|
428
|
+
}
|
429
|
+
break;
|
430
|
+
}
|
431
|
+
|
432
|
+
case InSubset:
|
433
|
+
{
|
434
|
+
tmp_bool = Rcpp::LogicalVector(model_outputs.all_clusters[outl_col][outl_clust].split_subset.size(), false);
|
435
|
+
for (size_t cat = 0; cat < model_outputs.all_clusters[outl_col][outl_clust].split_subset.size(); cat++) {
|
436
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].split_subset[cat] > 0) {
|
437
|
+
tmp_bool[cat] = true;
|
438
|
+
}
|
439
|
+
}
|
440
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("in");
|
441
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num][tmp_bool]);
|
442
|
+
break;
|
443
|
+
}
|
444
|
+
|
445
|
+
case NotInSubset:
|
446
|
+
{
|
447
|
+
tmp_bool = Rcpp::LogicalVector(model_outputs.all_clusters[outl_col][outl_clust].split_subset.size(), false);
|
448
|
+
for (size_t cat = 0; cat < model_outputs.all_clusters[outl_col][outl_clust].split_subset.size(); cat++) {
|
449
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].split_subset[cat] == 0) {
|
450
|
+
tmp_bool[cat] = true;
|
451
|
+
}
|
452
|
+
}
|
453
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("in");
|
454
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num][tmp_bool]);
|
455
|
+
break;
|
456
|
+
}
|
457
|
+
|
458
|
+
case Equal:
|
459
|
+
{
|
460
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].column_type == Categorical) {
|
461
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].col_num < ncols_cat_cat) {
|
462
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
463
|
+
cond_clust["value_comp"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num]
|
464
|
+
[model_outputs.all_clusters[outl_col][outl_clust].split_lev]);
|
465
|
+
} else {
|
466
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
467
|
+
cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_clusters[outl_col][outl_clust].split_lev);
|
468
|
+
}
|
469
|
+
} else {
|
470
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
471
|
+
cond_clust["value_comp"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num]
|
472
|
+
[model_outputs.all_clusters[outl_col][outl_clust].split_lev]);
|
473
|
+
}
|
474
|
+
break;
|
475
|
+
}
|
476
|
+
|
477
|
+
case NotEqual:
|
478
|
+
{
|
479
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].column_type == Categorical) {
|
480
|
+
if (model_outputs.all_clusters[outl_col][outl_clust].col_num < ncols_cat_cat) {
|
481
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("!=");
|
482
|
+
cond_clust["value_comp"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num]
|
483
|
+
[model_outputs.all_clusters[outl_col][outl_clust].split_lev]);
|
484
|
+
} else {
|
485
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("!=");
|
486
|
+
cond_clust["value_comp"] = Rcpp::wrap(!((bool)model_outputs.all_clusters[outl_col][outl_clust].split_lev));
|
487
|
+
}
|
488
|
+
} else {
|
489
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("!=");
|
490
|
+
cond_clust["value_comp"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_clusters[outl_col][outl_clust].col_num]
|
491
|
+
[model_outputs.all_clusters[outl_col][outl_clust].split_lev]);
|
492
|
+
}
|
493
|
+
break;
|
494
|
+
}
|
495
|
+
|
496
|
+
}
|
497
|
+
lst_cond[row] = Rcpp::List::create(Rcpp::clone(cond_clust));
|
498
|
+
|
499
|
+
/* finally, add conditions from branches that lead to the cluster */
|
500
|
+
curr_tree = model_outputs.outlier_trees_final[row];
|
501
|
+
Rcpp::List temp_list;
|
502
|
+
while (true) {
|
503
|
+
if (curr_tree == 0 || model_outputs.all_trees[outl_col][curr_tree].parent_branch == SubTrees) {
|
504
|
+
break;
|
505
|
+
}
|
506
|
+
parent_tree = model_outputs.all_trees[outl_col][curr_tree].parent;
|
507
|
+
cond_clust = Rcpp::List();
|
508
|
+
|
509
|
+
/* when using 'follow_all' */
|
510
|
+
if (model_outputs.all_trees[outl_col][parent_tree].all_branches.size() > 0) {
|
511
|
+
|
512
|
+
/* add column name and value */
|
513
|
+
switch(model_outputs.all_trees[outl_col][curr_tree].column_type) {
|
514
|
+
case Numeric:
|
515
|
+
{
|
516
|
+
cond_clust["column"] = Rcpp::as<Rcpp::CharacterVector>(colnames_num[model_outputs.all_trees[outl_col][curr_tree].col_num]);
|
517
|
+
break;
|
518
|
+
}
|
519
|
+
|
520
|
+
case Categorical:
|
521
|
+
{
|
522
|
+
cond_clust["column"] = Rcpp::as<Rcpp::CharacterVector>(colnames_cat[model_outputs.all_trees[outl_col][curr_tree].col_num]);
|
523
|
+
break;
|
524
|
+
}
|
525
|
+
|
526
|
+
case Ordinal:
|
527
|
+
{
|
528
|
+
cond_clust["column"] = Rcpp::as<Rcpp::CharacterVector>(colnames_ord[model_outputs.all_trees[outl_col][curr_tree].col_num]);
|
529
|
+
break;
|
530
|
+
}
|
531
|
+
}
|
532
|
+
|
533
|
+
/* add conditions from tree */
|
534
|
+
switch(model_outputs.all_trees[outl_col][curr_tree].column_type) {
|
535
|
+
|
536
|
+
case Numeric:
|
537
|
+
{
|
538
|
+
/* add decimals if appropriate */
|
539
|
+
if (
|
540
|
+
model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_num_num &&
|
541
|
+
model_outputs.all_trees[outl_col][curr_tree].split_this_branch != IsNa
|
542
|
+
)
|
543
|
+
{
|
544
|
+
cond_clust["decimals"] = Rcpp::wrap(model_outputs.min_decimals_col[model_outputs.all_trees[outl_col][curr_tree].col_num]);
|
545
|
+
}
|
546
|
+
|
547
|
+
/* then conditions */
|
548
|
+
switch(model_outputs.all_trees[outl_col][curr_tree].split_this_branch) {
|
549
|
+
|
550
|
+
case IsNa:
|
551
|
+
{
|
552
|
+
cond_clust["value_this"] = Rcpp::wrap(NA_REAL);
|
553
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
|
554
|
+
cond_clust["value_comp"] = Rcpp::wrap(NA_REAL);
|
555
|
+
break;
|
556
|
+
}
|
557
|
+
|
558
|
+
case LessOrEqual:
|
559
|
+
{
|
560
|
+
if (model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_num_num) {
|
561
|
+
cond_clust["value_this"] = Rcpp::wrap(arr_num[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]);
|
562
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("<=");
|
563
|
+
cond_clust["value_comp"] = Rcpp::wrap(model_outputs.all_trees[outl_col][curr_tree].split_point);
|
564
|
+
} else if (model_outputs.all_trees[outl_col][curr_tree].col_num < (ncols_num_num + ncols_date)) {
|
565
|
+
cond_clust["value_this"] = Rcpp::Date(arr_num[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]
|
566
|
+
- 1 + min_date[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num]);
|
567
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("<=");
|
568
|
+
cond_clust["value_comp"] = Rcpp::Date(model_outputs.all_trees[outl_col][curr_tree].split_point
|
569
|
+
- 1 + min_date[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num]);
|
570
|
+
} else {
|
571
|
+
cond_clust["value_this"] = Rcpp::Datetime(arr_num[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]
|
572
|
+
- 1 + min_ts[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num - ncols_date]);
|
573
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("<=");
|
574
|
+
cond_clust["value_comp"] = Rcpp::Datetime(model_outputs.all_trees[outl_col][curr_tree].split_point
|
575
|
+
- 1 + min_ts[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num - ncols_date]);
|
576
|
+
}
|
577
|
+
break;
|
578
|
+
}
|
579
|
+
|
580
|
+
case Greater:
|
581
|
+
{
|
582
|
+
if (model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_num_num) {
|
583
|
+
cond_clust["value_this"] = Rcpp::wrap(arr_num[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]);
|
584
|
+
cond_clust["comparison"] = Rcpp::CharacterVector(">");
|
585
|
+
cond_clust["value_comp"] = Rcpp::wrap(model_outputs.all_trees[outl_col][curr_tree].split_point);
|
586
|
+
} else if (model_outputs.all_trees[outl_col][curr_tree].col_num < (ncols_num_num + ncols_date)) {
|
587
|
+
cond_clust["value_this"] = Rcpp::Date(arr_num[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]
|
588
|
+
- 1 + min_date[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num]);
|
589
|
+
cond_clust["comparison"] = Rcpp::CharacterVector(">");
|
590
|
+
cond_clust["value_comp"] = Rcpp::Date(model_outputs.all_trees[outl_col][curr_tree].split_point
|
591
|
+
- 1 + min_date[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num]);
|
592
|
+
} else {
|
593
|
+
cond_clust["value_this"] = Rcpp::Datetime(arr_num[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]
|
594
|
+
- 1 + min_ts[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num - ncols_date]);
|
595
|
+
cond_clust["comparison"] = Rcpp::CharacterVector(">");
|
596
|
+
cond_clust["value_comp"] = Rcpp::Datetime(model_outputs.all_trees[outl_col][curr_tree].split_point
|
597
|
+
- 1 + min_ts[model_outputs.all_trees[outl_col][curr_tree].col_num - ncols_num_num - ncols_date]);
|
598
|
+
}
|
599
|
+
break;
|
600
|
+
}
|
601
|
+
|
602
|
+
}
|
603
|
+
break;
|
604
|
+
}
|
605
|
+
|
606
|
+
case Categorical:
|
607
|
+
{
|
608
|
+
switch(model_outputs.all_trees[outl_col][curr_tree].split_this_branch) {
|
609
|
+
|
610
|
+
case IsNa:
|
611
|
+
{
|
612
|
+
if (model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_cat_cat) {
|
613
|
+
cond_clust["value_this"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
|
614
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
|
615
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
|
616
|
+
} else {
|
617
|
+
cond_clust["value_this"] = Rcpp::LogicalVector(1, NA_LOGICAL);
|
618
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
|
619
|
+
cond_clust["value_comp"] = Rcpp::LogicalVector(1, NA_LOGICAL);
|
620
|
+
}
|
621
|
+
break;
|
622
|
+
}
|
623
|
+
|
624
|
+
case InSubset:
|
625
|
+
{
|
626
|
+
if (model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_cat_cat) {
|
627
|
+
tmp_bool = Rcpp::LogicalVector(model_outputs.all_trees[outl_col][curr_tree].split_subset.size(), false);
|
628
|
+
for (size_t cat = 0; cat < model_outputs.all_trees[outl_col][curr_tree].split_subset.size(); cat++) {
|
629
|
+
if (model_outputs.all_trees[outl_col][curr_tree].split_subset[cat] > 0) {
|
630
|
+
tmp_bool[cat] = true;
|
631
|
+
}
|
632
|
+
}
|
633
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
|
634
|
+
[arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
|
635
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("in");
|
636
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num][tmp_bool]);
|
637
|
+
} else {
|
638
|
+
cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]);
|
639
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
640
|
+
cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_trees[outl_col][curr_tree].split_subset[1]);
|
641
|
+
}
|
642
|
+
break;
|
643
|
+
}
|
644
|
+
|
645
|
+
case NotInSubset:
|
646
|
+
{
|
647
|
+
if (model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_cat_cat) {
|
648
|
+
tmp_bool = Rcpp::LogicalVector(model_outputs.all_trees[outl_col][curr_tree].split_subset.size(), true);
|
649
|
+
for (size_t cat = 0; cat < model_outputs.all_trees[outl_col][curr_tree].split_subset.size(); cat++) {
|
650
|
+
if (model_outputs.all_trees[outl_col][curr_tree].split_subset[cat] > 0) {
|
651
|
+
tmp_bool[cat] = false;
|
652
|
+
}
|
653
|
+
}
|
654
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
|
655
|
+
[arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
|
656
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("in");
|
657
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num][tmp_bool]);
|
658
|
+
} else {
|
659
|
+
cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]);
|
660
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
661
|
+
cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_trees[outl_col][curr_tree].split_subset[0]);
|
662
|
+
}
|
663
|
+
break;
|
664
|
+
}
|
665
|
+
|
666
|
+
case Equal:
|
667
|
+
{
|
668
|
+
if (model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_cat_cat) {
|
669
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
|
670
|
+
[arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
|
671
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
672
|
+
cond_clust["value_comp"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
|
673
|
+
[model_outputs.all_trees[outl_col][curr_tree].split_lev]);
|
674
|
+
} else {
|
675
|
+
cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]);
|
676
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
677
|
+
cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_trees[outl_col][curr_tree].split_lev);
|
678
|
+
}
|
679
|
+
break;
|
680
|
+
}
|
681
|
+
|
682
|
+
case NotEqual:
|
683
|
+
{
|
684
|
+
if (model_outputs.all_trees[outl_col][curr_tree].col_num < ncols_cat_cat) {
|
685
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
|
686
|
+
[arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
|
687
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("!=");
|
688
|
+
cond_clust["value_comp"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
|
689
|
+
[model_outputs.all_trees[outl_col][curr_tree].split_lev]);
|
690
|
+
} else {
|
691
|
+
cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]);
|
692
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
693
|
+
cond_clust["value_comp"] = Rcpp::wrap((bool) !model_outputs.all_trees[outl_col][curr_tree].split_lev);
|
694
|
+
/* note: booleans should always get converted to Equals, this code is redundant */
|
695
|
+
}
|
696
|
+
break;
|
697
|
+
}
|
698
|
+
|
699
|
+
}
|
700
|
+
break;
|
701
|
+
}
|
702
|
+
|
703
|
+
case Ordinal:
|
704
|
+
{
|
705
|
+
switch(model_outputs.all_trees[outl_col][curr_tree].split_this_branch) {
|
706
|
+
|
707
|
+
case IsNa:
|
708
|
+
{
|
709
|
+
cond_clust["value_this"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
|
710
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
|
711
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
|
712
|
+
break;
|
713
|
+
}
|
714
|
+
|
715
|
+
case LessOrEqual:
|
716
|
+
{
|
717
|
+
tmp_bool = Rcpp::LogicalVector(ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num].size(), false);
|
718
|
+
for (int cat = 0; cat <= model_outputs.all_trees[outl_col][curr_tree].split_lev; cat++) {
|
719
|
+
tmp_bool[cat] = true;
|
720
|
+
}
|
721
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
|
722
|
+
[arr_ord[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
|
723
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("in");
|
724
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num][tmp_bool]);
|
725
|
+
break;
|
726
|
+
}
|
727
|
+
|
728
|
+
case Greater:
|
729
|
+
{
|
730
|
+
tmp_bool = Rcpp::LogicalVector(ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num].size(), true);
|
731
|
+
for (int cat = 0; cat <= model_outputs.all_trees[outl_col][curr_tree].split_lev; cat++) {
|
732
|
+
tmp_bool[cat] = false;
|
733
|
+
}
|
734
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
|
735
|
+
[arr_ord[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
|
736
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("in");
|
737
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num][tmp_bool]);
|
738
|
+
break;
|
739
|
+
}
|
740
|
+
|
741
|
+
case Equal:
|
742
|
+
{
|
743
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
|
744
|
+
[arr_ord[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
|
745
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
746
|
+
cond_clust["value_comp"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
|
747
|
+
[model_outputs.all_trees[outl_col][curr_tree].split_lev]);
|
748
|
+
break;
|
749
|
+
}
|
750
|
+
|
751
|
+
case NotEqual:
|
752
|
+
{
|
753
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
|
754
|
+
[arr_ord[row + model_outputs.all_trees[outl_col][curr_tree].col_num * nrows]]);
|
755
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("!=");
|
756
|
+
cond_clust["value_comp"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][curr_tree].col_num]
|
757
|
+
[model_outputs.all_trees[outl_col][curr_tree].split_lev]);
|
758
|
+
break;
|
759
|
+
}
|
760
|
+
|
761
|
+
}
|
762
|
+
break;
|
763
|
+
}
|
764
|
+
|
765
|
+
}
|
766
|
+
}
|
767
|
+
|
768
|
+
/* regular case (no 'follow_all') */
|
769
|
+
else
|
770
|
+
{
|
771
|
+
|
772
|
+
/* add column name and value */
|
773
|
+
switch(model_outputs.all_trees[outl_col][parent_tree].column_type) {
|
774
|
+
case Numeric:
|
775
|
+
{
|
776
|
+
cond_clust["column"] = Rcpp::as<Rcpp::CharacterVector>(colnames_num[model_outputs.all_trees[outl_col][parent_tree].col_num]);
|
777
|
+
/* add decimals if appropriate */
|
778
|
+
if (
|
779
|
+
model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_num_num &&
|
780
|
+
model_outputs.all_trees[outl_col][curr_tree].parent_branch != IsNa
|
781
|
+
)
|
782
|
+
{
|
783
|
+
cond_clust["decimals"] = Rcpp::wrap(model_outputs.min_decimals_col[model_outputs.all_trees[outl_col][parent_tree].col_num]);
|
784
|
+
}
|
785
|
+
break;
|
786
|
+
}
|
787
|
+
|
788
|
+
case Categorical:
|
789
|
+
{
|
790
|
+
cond_clust["column"] = Rcpp::as<Rcpp::CharacterVector>(colnames_cat[model_outputs.all_trees[outl_col][parent_tree].col_num]);
|
791
|
+
break;
|
792
|
+
}
|
793
|
+
|
794
|
+
case Ordinal:
|
795
|
+
{
|
796
|
+
cond_clust["column"] = Rcpp::as<Rcpp::CharacterVector>(colnames_ord[model_outputs.all_trees[outl_col][parent_tree].col_num]);
|
797
|
+
break;
|
798
|
+
}
|
799
|
+
}
|
800
|
+
|
801
|
+
|
802
|
+
/* add conditions from tree */
|
803
|
+
switch(model_outputs.all_trees[outl_col][curr_tree].parent_branch) {
|
804
|
+
|
805
|
+
|
806
|
+
case IsNa:
|
807
|
+
{
|
808
|
+
switch(model_outputs.all_trees[outl_col][parent_tree].column_type) {
|
809
|
+
case Numeric:
|
810
|
+
{
|
811
|
+
cond_clust["value_this"] = Rcpp::wrap(NA_REAL);
|
812
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
|
813
|
+
cond_clust["value_comp"] = Rcpp::wrap(NA_REAL);
|
814
|
+
break;
|
815
|
+
}
|
816
|
+
|
817
|
+
case Categorical:
|
818
|
+
{
|
819
|
+
if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_cat_cat) {
|
820
|
+
cond_clust["value_this"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
|
821
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
|
822
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
|
823
|
+
} else {
|
824
|
+
cond_clust["value_this"] = Rcpp::LogicalVector(1, NA_LOGICAL);
|
825
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
|
826
|
+
cond_clust["value_comp"] = Rcpp::LogicalVector(1, NA_LOGICAL);
|
827
|
+
}
|
828
|
+
break;
|
829
|
+
}
|
830
|
+
|
831
|
+
case Ordinal:
|
832
|
+
{
|
833
|
+
cond_clust["value_this"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
|
834
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("is NA");
|
835
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(NA_STRING);
|
836
|
+
break;
|
837
|
+
}
|
838
|
+
}
|
839
|
+
break;
|
840
|
+
}
|
841
|
+
|
842
|
+
case LessOrEqual:
|
843
|
+
{
|
844
|
+
if (model_outputs.all_trees[outl_col][parent_tree].column_type == Numeric) {
|
845
|
+
if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_num_num) {
|
846
|
+
cond_clust["value_this"] = Rcpp::wrap(arr_num[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
|
847
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("<=");
|
848
|
+
cond_clust["value_comp"] = Rcpp::wrap(model_outputs.all_trees[outl_col][parent_tree].split_point);
|
849
|
+
} else if (model_outputs.all_trees[outl_col][parent_tree].col_num < (ncols_num_num + ncols_date)) {
|
850
|
+
cond_clust["value_this"] = Rcpp::Date(arr_num[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]
|
851
|
+
- 1 + min_date[model_outputs.all_trees[outl_col][parent_tree].col_num - ncols_num_num]);
|
852
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("<=");
|
853
|
+
cond_clust["value_comp"] = Rcpp::Date(model_outputs.all_trees[outl_col][parent_tree].split_point
|
854
|
+
- 1 + min_date[model_outputs.all_trees[outl_col][parent_tree].col_num - ncols_num_num]);
|
855
|
+
} else {
|
856
|
+
cond_clust["value_this"] = Rcpp::Datetime(arr_num[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]
|
857
|
+
- 1 + min_ts[model_outputs.all_trees[outl_col][parent_tree].col_num
|
858
|
+
- ncols_num_num - ncols_date]);
|
859
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("<=");
|
860
|
+
cond_clust["value_comp"] = Rcpp::Datetime(model_outputs.all_trees[outl_col][parent_tree].split_point
|
861
|
+
- 1 + min_ts[model_outputs.all_trees[outl_col][parent_tree].col_num
|
862
|
+
- ncols_num_num - ncols_date]);
|
863
|
+
}
|
864
|
+
} else {
|
865
|
+
tmp_bool = Rcpp::LogicalVector(ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num].size(), false);
|
866
|
+
for (int cat = 0; cat <= model_outputs.all_trees[outl_col][parent_tree].split_lev; cat++) tmp_bool[cat] = true;
|
867
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
|
868
|
+
[arr_ord[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
|
869
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("in");
|
870
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num][tmp_bool]);
|
871
|
+
}
|
872
|
+
break;
|
873
|
+
}
|
874
|
+
|
875
|
+
case Greater:
|
876
|
+
{
|
877
|
+
if (model_outputs.all_trees[outl_col][parent_tree].column_type == Numeric) {
|
878
|
+
if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_num_num) {
|
879
|
+
cond_clust["value_this"] = Rcpp::wrap(arr_num[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
|
880
|
+
cond_clust["comparison"] = Rcpp::CharacterVector(">");
|
881
|
+
cond_clust["value_comp"] = Rcpp::wrap(model_outputs.all_trees[outl_col][parent_tree].split_point);
|
882
|
+
} else if (model_outputs.all_trees[outl_col][parent_tree].col_num < (ncols_num_num + ncols_date)) {
|
883
|
+
cond_clust["value_this"] = Rcpp::Date(arr_num[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]
|
884
|
+
- 1 + min_date[model_outputs.all_trees[outl_col][parent_tree].col_num - ncols_num_num]);
|
885
|
+
cond_clust["comparison"] = Rcpp::CharacterVector(">");
|
886
|
+
cond_clust["value_comp"] = Rcpp::Date(model_outputs.all_trees[outl_col][parent_tree].split_point
|
887
|
+
- 1 + min_date[model_outputs.all_trees[outl_col][parent_tree].col_num - ncols_num_num]);
|
888
|
+
} else {
|
889
|
+
cond_clust["value_this"] = Rcpp::Datetime(arr_num[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]
|
890
|
+
- 1 + min_ts[model_outputs.all_trees[outl_col][parent_tree].col_num
|
891
|
+
- ncols_num_num - ncols_date]);
|
892
|
+
cond_clust["comparison"] = Rcpp::CharacterVector(">");
|
893
|
+
cond_clust["value_comp"] = Rcpp::Datetime(model_outputs.all_trees[outl_col][parent_tree].split_point
|
894
|
+
- 1 + min_ts[model_outputs.all_trees[outl_col][parent_tree].col_num
|
895
|
+
- ncols_num_num - ncols_date]);
|
896
|
+
}
|
897
|
+
} else {
|
898
|
+
tmp_bool = Rcpp::LogicalVector(ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num].size(), true);
|
899
|
+
for (int cat = 0; cat <= model_outputs.all_trees[outl_col][parent_tree].split_lev; cat++) tmp_bool[cat] = false;
|
900
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
|
901
|
+
[arr_ord[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
|
902
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("in");
|
903
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num][tmp_bool]);
|
904
|
+
}
|
905
|
+
break;
|
906
|
+
}
|
907
|
+
|
908
|
+
case InSubset:
|
909
|
+
{
|
910
|
+
if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_cat_cat) {
|
911
|
+
tmp_bool = Rcpp::LogicalVector(cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num].size(), false);
|
912
|
+
for (size_t cat = 0; cat < model_outputs.all_trees[outl_col][parent_tree].split_subset.size(); cat++) {
|
913
|
+
if (model_outputs.all_trees[outl_col][parent_tree].split_subset[cat] > 0) {
|
914
|
+
tmp_bool[cat] = true;
|
915
|
+
}
|
916
|
+
}
|
917
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
|
918
|
+
[arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
|
919
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("in");
|
920
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num][tmp_bool]);
|
921
|
+
} else {
|
922
|
+
cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
|
923
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
924
|
+
cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_trees[outl_col][parent_tree].split_subset[1]);
|
925
|
+
}
|
926
|
+
break;
|
927
|
+
}
|
928
|
+
|
929
|
+
case NotInSubset:
|
930
|
+
{
|
931
|
+
if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_cat_cat) {
|
932
|
+
tmp_bool = Rcpp::LogicalVector(cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num].size(), false);
|
933
|
+
for (size_t cat = 0; cat < model_outputs.all_trees[outl_col][parent_tree].split_subset.size(); cat++) {
|
934
|
+
if (model_outputs.all_trees[outl_col][parent_tree].split_subset[cat] == 0) {
|
935
|
+
tmp_bool[cat] = true;
|
936
|
+
}
|
937
|
+
}
|
938
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
|
939
|
+
[arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
|
940
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("in");
|
941
|
+
cond_clust["value_comp"] = Rcpp::as<Rcpp::CharacterVector>(cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num][tmp_bool]);
|
942
|
+
} else {
|
943
|
+
cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
|
944
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
945
|
+
cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_trees[outl_col][parent_tree].split_subset[0]);
|
946
|
+
}
|
947
|
+
break;
|
948
|
+
}
|
949
|
+
|
950
|
+
case Equal:
|
951
|
+
{
|
952
|
+
if (model_outputs.all_trees[outl_col][parent_tree].column_type == Categorical) {
|
953
|
+
if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_cat_cat) {
|
954
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
|
955
|
+
[arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
|
956
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
957
|
+
cond_clust["value_comp"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
|
958
|
+
[model_outputs.all_trees[outl_col][parent_tree].split_lev]);
|
959
|
+
} else {
|
960
|
+
cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
|
961
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
962
|
+
cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_trees[outl_col][parent_tree].split_subset[1]);
|
963
|
+
}
|
964
|
+
} else {
|
965
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
|
966
|
+
[arr_ord[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
|
967
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
968
|
+
cond_clust["value_comp"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
|
969
|
+
[model_outputs.all_trees[outl_col][parent_tree].split_lev]);
|
970
|
+
}
|
971
|
+
break;
|
972
|
+
}
|
973
|
+
|
974
|
+
case NotEqual:
|
975
|
+
{
|
976
|
+
if (model_outputs.all_trees[outl_col][parent_tree].column_type == Categorical) {
|
977
|
+
if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_cat_cat) {
|
978
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
|
979
|
+
[arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
|
980
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("!=");
|
981
|
+
cond_clust["value_comp"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
|
982
|
+
[model_outputs.all_trees[outl_col][parent_tree].split_lev]);
|
983
|
+
} else {
|
984
|
+
cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
|
985
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
986
|
+
cond_clust["value_comp"] = Rcpp::wrap((bool) model_outputs.all_trees[outl_col][parent_tree].split_subset[0]);
|
987
|
+
}
|
988
|
+
} else {
|
989
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
|
990
|
+
[arr_ord[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
|
991
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("!=");
|
992
|
+
cond_clust["value_comp"] = Rcpp::CharacterVector(1, ord_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
|
993
|
+
[model_outputs.all_trees[outl_col][parent_tree].split_lev]);
|
994
|
+
}
|
995
|
+
break;
|
996
|
+
}
|
997
|
+
|
998
|
+
case SingleCateg:
|
999
|
+
{
|
1000
|
+
if (model_outputs.all_trees[outl_col][parent_tree].col_num < ncols_cat_cat) {
|
1001
|
+
cond_clust["value_this"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
|
1002
|
+
[arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
|
1003
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
1004
|
+
cond_clust["value_comp"] = Rcpp::CharacterVector(1, cat_levels[model_outputs.all_trees[outl_col][parent_tree].col_num]
|
1005
|
+
[arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]]);
|
1006
|
+
} else {
|
1007
|
+
cond_clust["value_this"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
|
1008
|
+
cond_clust["comparison"] = Rcpp::CharacterVector("=");
|
1009
|
+
cond_clust["value_comp"] = Rcpp::wrap((bool) arr_cat[row + model_outputs.all_trees[outl_col][parent_tree].col_num * nrows]);
|
1010
|
+
}
|
1011
|
+
break;
|
1012
|
+
}
|
1013
|
+
|
1014
|
+
}
|
1015
|
+
|
1016
|
+
|
1017
|
+
}
|
1018
|
+
|
1019
|
+
/* https://github.com/RcppCore/Rcpp/issues/979 */
|
1020
|
+
/* this comment below will fix Rcpp issue with having slashes in the comment above */
|
1021
|
+
temp_list = lst_cond[row];
|
1022
|
+
temp_list.push_back(Rcpp::clone(cond_clust));
|
1023
|
+
lst_cond[row] = temp_list;
|
1024
|
+
curr_tree = parent_tree;
|
1025
|
+
}
|
1026
|
+
|
1027
|
+
}
|
1028
|
+
|
1029
|
+
}
|
1030
|
+
}
|
1031
|
+
|
1032
|
+
outp["suspicous_value"] = outlier_val;
|
1033
|
+
outp["group_statistics"] = lst_stats;
|
1034
|
+
outp["conditions"] = lst_cond;
|
1035
|
+
outp["tree_depth"] = tree_depth;
|
1036
|
+
outp["uses_NA_branch"] = has_na_col;
|
1037
|
+
outp["outlier_score"] = outlier_score;
|
1038
|
+
return outp;
|
1039
|
+
}
|
1040
|
+
|
1041
|
+
/* for extracting info about flaggable outliers */
|
1042
|
+
Rcpp::List extract_outl_bounds(ModelOutputs &model_outputs,
|
1043
|
+
Rcpp::ListOf<Rcpp::StringVector> cat_levels,
|
1044
|
+
Rcpp::ListOf<Rcpp::StringVector> ord_levels,
|
1045
|
+
Rcpp::NumericVector min_date,
|
1046
|
+
Rcpp::NumericVector min_ts)
|
1047
|
+
{
|
1048
|
+
size_t ncols_num = model_outputs.ncols_numeric;
|
1049
|
+
size_t ncols_cat = model_outputs.ncols_categ;
|
1050
|
+
size_t ncols_ord = model_outputs.ncols_ord;
|
1051
|
+
size_t col_lim_num = model_outputs.ncols_numeric - min_date.size() - min_ts.size();
|
1052
|
+
size_t col_lim_date = model_outputs.ncols_numeric - min_ts.size();
|
1053
|
+
size_t ncols_cat_cat = cat_levels.size();
|
1054
|
+
size_t tot_cols = ncols_num + ncols_cat + ncols_ord;
|
1055
|
+
Rcpp::LogicalVector temp_bool;
|
1056
|
+
Rcpp::LogicalVector bool_choice(2, false); bool_choice[1] = true;
|
1057
|
+
Rcpp::List outp(tot_cols);
|
1058
|
+
|
1059
|
+
for (size_t cl = 0; cl < tot_cols; cl++) {
|
1060
|
+
if (cl < col_lim_num) {
|
1061
|
+
/* numeric */
|
1062
|
+
outp[cl] = Rcpp::List::create(Rcpp::_["lb"] = Rcpp::wrap(model_outputs.min_outlier_any_cl[cl]),
|
1063
|
+
Rcpp::_["ub"] = Rcpp::wrap(model_outputs.max_outlier_any_cl[cl]));
|
1064
|
+
} else if (cl < col_lim_date) {
|
1065
|
+
/* date */
|
1066
|
+
outp[cl] = Rcpp::List::create(
|
1067
|
+
Rcpp::_["lb"] = Rcpp::Date(model_outputs.min_outlier_any_cl[cl] - 1 + min_date[cl - col_lim_num]),
|
1068
|
+
Rcpp::_["ub"] = Rcpp::Date(model_outputs.max_outlier_any_cl[cl] - 1 + min_date[cl - col_lim_num])
|
1069
|
+
);
|
1070
|
+
} else if (cl < ncols_num) {
|
1071
|
+
/* timestamp */
|
1072
|
+
outp[cl] = Rcpp::List::create(
|
1073
|
+
Rcpp::_["lb"] = Rcpp::Datetime(model_outputs.min_outlier_any_cl[cl] - 1 + min_ts[cl - col_lim_date]),
|
1074
|
+
Rcpp::_["ub"] = Rcpp::Datetime(model_outputs.max_outlier_any_cl[cl] - 1 + min_ts[cl - col_lim_date])
|
1075
|
+
);
|
1076
|
+
} else if (cl < (ncols_num + ncols_cat_cat)) {
|
1077
|
+
/* categorical */
|
1078
|
+
if (model_outputs.cat_outlier_any_cl[cl - ncols_num].size()) {
|
1079
|
+
temp_bool = Rcpp::wrap(model_outputs.cat_outlier_any_cl[cl - ncols_num]);
|
1080
|
+
outp[cl] = cat_levels[cl - ncols_num][temp_bool];
|
1081
|
+
} else {
|
1082
|
+
outp[cl] = Rcpp::StringVector();
|
1083
|
+
}
|
1084
|
+
} else if (cl < (ncols_num + ncols_cat)) {
|
1085
|
+
/* boolean */
|
1086
|
+
if (model_outputs.cat_outlier_any_cl[cl - ncols_num].size()) {
|
1087
|
+
temp_bool = Rcpp::wrap(model_outputs.cat_outlier_any_cl[cl - ncols_num]);
|
1088
|
+
outp[cl] = bool_choice[temp_bool];
|
1089
|
+
} else {
|
1090
|
+
outp[cl] = Rcpp::LogicalVector();
|
1091
|
+
}
|
1092
|
+
} else {
|
1093
|
+
/* ordinal */
|
1094
|
+
if (model_outputs.cat_outlier_any_cl[cl - ncols_num].size()) {
|
1095
|
+
temp_bool = Rcpp::wrap(model_outputs.cat_outlier_any_cl[cl - ncols_num]);
|
1096
|
+
outp[cl] = ord_levels[cl - ncols_num - ncols_cat][temp_bool];
|
1097
|
+
} else {
|
1098
|
+
outp[cl] = Rcpp::StringVector();
|
1099
|
+
}
|
1100
|
+
}
|
1101
|
+
}
|
1102
|
+
return outp;
|
1103
|
+
}
|
1104
|
+
|
1105
|
+
|
1106
|
+
/* external functions for fitting the model and predicting outliers */
|
1107
|
+
// [[Rcpp::export]]
|
1108
|
+
Rcpp::List fit_OutlierTree(Rcpp::NumericVector arr_num, size_t ncols_numeric,
|
1109
|
+
Rcpp::IntegerVector arr_cat, size_t ncols_categ, Rcpp::IntegerVector ncat,
|
1110
|
+
Rcpp::IntegerVector arr_ord, size_t ncols_ord, Rcpp::IntegerVector ncat_ord,
|
1111
|
+
size_t nrows, Rcpp::LogicalVector cols_ignore_r, int nthreads,
|
1112
|
+
bool categ_as_bin, bool ord_as_bin, bool cat_bruteforce_subset, bool categ_from_maj, bool take_mid,
|
1113
|
+
size_t max_depth, double max_perc_outliers, size_t min_size_numeric, size_t min_size_categ,
|
1114
|
+
double min_gain, bool follow_all, bool gain_as_pct, double z_norm, double z_outlier,
|
1115
|
+
bool return_outliers,
|
1116
|
+
Rcpp::ListOf<Rcpp::StringVector> cat_levels,
|
1117
|
+
Rcpp::ListOf<Rcpp::StringVector> ord_levels,
|
1118
|
+
Rcpp::StringVector colnames_num,
|
1119
|
+
Rcpp::StringVector colnames_cat,
|
1120
|
+
Rcpp::StringVector colnames_ord,
|
1121
|
+
Rcpp::NumericVector min_date,
|
1122
|
+
Rcpp::NumericVector min_ts)
|
1123
|
+
{
|
1124
|
+
bool found_outliers;
|
1125
|
+
Rcpp::List outp;
|
1126
|
+
size_t tot_cols = ncols_numeric + ncols_categ + ncols_ord;
|
1127
|
+
std::vector<char> cols_ignore;
|
1128
|
+
char *cols_ignore_ptr = NULL;
|
1129
|
+
if (cols_ignore_r.size() > 0) {
|
1130
|
+
cols_ignore.resize(tot_cols, false);
|
1131
|
+
for (size_t cl = 0; cl < tot_cols; cl++) cols_ignore[cl] = (bool) cols_ignore_r[cl];
|
1132
|
+
cols_ignore_ptr = &cols_ignore[0];
|
1133
|
+
}
|
1134
|
+
std::vector<double> Xcpp;
|
1135
|
+
double *arr_num_C = set_R_nan_as_C_nan(&arr_num[0], Xcpp, arr_num.size(), nthreads);
|
1136
|
+
|
1137
|
+
std::unique_ptr<ModelOutputs> model_outputs = std::unique_ptr<ModelOutputs>(new ModelOutputs());
|
1138
|
+
found_outliers = fit_outliers_models(*model_outputs,
|
1139
|
+
arr_num_C, ncols_numeric,
|
1140
|
+
&arr_cat[0], ncols_categ, &ncat[0],
|
1141
|
+
&arr_ord[0], ncols_ord, &ncat_ord[0],
|
1142
|
+
nrows, cols_ignore_ptr, nthreads,
|
1143
|
+
categ_as_bin, ord_as_bin, cat_bruteforce_subset, categ_from_maj, take_mid,
|
1144
|
+
max_depth, max_perc_outliers, min_size_numeric, min_size_categ,
|
1145
|
+
min_gain, gain_as_pct, follow_all, z_norm, z_outlier);
|
1146
|
+
|
1147
|
+
outp["bounds"] = extract_outl_bounds(*model_outputs,
|
1148
|
+
cat_levels,
|
1149
|
+
ord_levels,
|
1150
|
+
min_date,
|
1151
|
+
min_ts);
|
1152
|
+
|
1153
|
+
outp["serialized_obj"] = serialize_OutlierTree(model_outputs.get());
|
1154
|
+
if (return_outliers) {
|
1155
|
+
outp["outliers_info"] = describe_outliers(*model_outputs,
|
1156
|
+
arr_num_C,
|
1157
|
+
&arr_cat[0],
|
1158
|
+
&arr_ord[0],
|
1159
|
+
cat_levels,
|
1160
|
+
ord_levels,
|
1161
|
+
colnames_num,
|
1162
|
+
colnames_cat,
|
1163
|
+
colnames_ord,
|
1164
|
+
min_date,
|
1165
|
+
min_ts);
|
1166
|
+
}
|
1167
|
+
/* add number of trees and clusters */
|
1168
|
+
size_t ntrees = 0, nclust = 0;
|
1169
|
+
for (size_t col = 0; col < model_outputs->all_trees.size(); col++) {
|
1170
|
+
ntrees += model_outputs->all_trees[col].size();
|
1171
|
+
nclust += model_outputs->all_clusters[col].size();
|
1172
|
+
}
|
1173
|
+
outp["ntrees"] = Rcpp::wrap((int) ntrees);
|
1174
|
+
outp["nclust"] = Rcpp::wrap((int) nclust);
|
1175
|
+
outp["found_outliers"] = Rcpp::wrap(found_outliers);
|
1176
|
+
|
1177
|
+
forget_row_outputs(*model_outputs);
|
1178
|
+
outp["ptr_model"] = Rcpp::XPtr<ModelOutputs>(model_outputs.release(), true);
|
1179
|
+
return outp;
|
1180
|
+
}
|
1181
|
+
|
1182
|
+
// [[Rcpp::export]]
|
1183
|
+
Rcpp::List predict_OutlierTree(SEXP ptr_model, size_t nrows, int nthreads,
|
1184
|
+
Rcpp::NumericVector arr_num, Rcpp::IntegerVector arr_cat, Rcpp::IntegerVector arr_ord,
|
1185
|
+
Rcpp::ListOf<Rcpp::StringVector> cat_levels,
|
1186
|
+
Rcpp::ListOf<Rcpp::StringVector> ord_levels,
|
1187
|
+
Rcpp::StringVector colnames_num,
|
1188
|
+
Rcpp::StringVector colnames_cat,
|
1189
|
+
Rcpp::StringVector colnames_ord,
|
1190
|
+
Rcpp::NumericVector min_date,
|
1191
|
+
Rcpp::NumericVector min_ts)
|
1192
|
+
{
|
1193
|
+
std::vector<double> Xcpp;
|
1194
|
+
double *arr_num_C = set_R_nan_as_C_nan(&arr_num[0], Xcpp, arr_num.size(), nthreads);
|
1195
|
+
|
1196
|
+
ModelOutputs *model_outputs = static_cast<ModelOutputs*>(R_ExternalPtrAddr(ptr_model));
|
1197
|
+
bool found_outliers = find_new_outliers(&arr_num[0], &arr_cat[0], &arr_ord[0],
|
1198
|
+
nrows, nthreads, *model_outputs);
|
1199
|
+
Rcpp::List outp = describe_outliers(*model_outputs,
|
1200
|
+
arr_num_C,
|
1201
|
+
&arr_cat[0],
|
1202
|
+
&arr_ord[0],
|
1203
|
+
cat_levels,
|
1204
|
+
ord_levels,
|
1205
|
+
colnames_num,
|
1206
|
+
colnames_cat,
|
1207
|
+
colnames_ord,
|
1208
|
+
min_date,
|
1209
|
+
min_ts);
|
1210
|
+
outp["found_outliers"] = Rcpp::LogicalVector(found_outliers);
|
1211
|
+
forget_row_outputs(*model_outputs);
|
1212
|
+
return outp;
|
1213
|
+
}
|
1214
|
+
|
1215
|
+
// [[Rcpp::export]]
|
1216
|
+
Rcpp::LogicalVector check_few_values(Rcpp::NumericVector arr_num, size_t nrows, size_t ncols, int nthreads)
|
1217
|
+
{
|
1218
|
+
std::vector<char> too_few_vals(ncols, 0);
|
1219
|
+
check_more_two_values(&arr_num[0], nrows, ncols, nthreads, too_few_vals.data());
|
1220
|
+
Rcpp::LogicalVector outp(ncols);
|
1221
|
+
for (size_t col = 0; col < ncols; col++) {
|
1222
|
+
outp[col] = (bool) too_few_vals[col];
|
1223
|
+
}
|
1224
|
+
return outp;
|
1225
|
+
}
|