db-robust-clust 0.1.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,19 @@
1
+ Copyright (c) 2018 The Python Packaging Authority
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy
4
+ of this software and associated documentation files (the "Software"), to deal
5
+ in the Software without restriction, including without limitation the rights
6
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7
+ copies of the Software, and to permit persons to whom the Software is
8
+ furnished to do so, subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in all
11
+ copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19
+ SOFTWARE.
@@ -0,0 +1,54 @@
1
+ Metadata-Version: 2.4
2
+ Name: db-robust-clust
3
+ Version: 0.1.3
4
+ Summary: Apply distance based robust clustering for mixed data.
5
+ Home-page: https://github.com/FabioScielzoOrtiz/db_robust_clust
6
+ Author: Fabio Scielzo Ortiz
7
+ Author-email: fabio.scielzoortiz@gmail.com
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Requires-Python: >=3.7
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: polars
15
+ Requires-Dist: numpy<=1.26.4
16
+ Requires-Dist: PyDistances
17
+ Requires-Dist: pandas
18
+ Requires-Dist: scikit-learn-extra
19
+ Requires-Dist: tqdm
20
+ Requires-Dist: setuptools
21
+ Requires-Dist: pyarrow
22
+ Requires-Dist: matplotlib
23
+ Requires-Dist: seaborn
24
+ Dynamic: author
25
+ Dynamic: author-email
26
+ Dynamic: classifier
27
+ Dynamic: description
28
+ Dynamic: description-content-type
29
+ Dynamic: home-page
30
+ Dynamic: license-file
31
+ Dynamic: requires-dist
32
+ Dynamic: requires-python
33
+ Dynamic: summary
34
+
35
+ # db-robust-clust
36
+
37
+ In the era of big data, data scientists are trying to solve real-world problems using multivariate
38
+ and heterogeneous datasets, i.e., datasets where for each unit multiple variables of different
39
+ nature are observed. Clustering may be a challenging problem when data are of mixed-type and
40
+ present an underlying correlation structure and outlying units.
41
+
42
+ In the paper ***Grané, A., Scielzo-Ortiz, F.: New distance-based clustering algorithms for large mixed-type data, Submitted to Journal of Classification (2025)***, new efficient robust clustering algorithms able to deal with large mixed-type data are developed and implemented in a **new Python package**, called `db_robust_clust`, hosted in the official PyPI page https://pypi.org/project/db_robust_clust/.
43
+
44
+ Their performance is analyzed in rather complex mixed-type datasets,
45
+ both synthetic and real, where a wide variety of scenarios is considered regarding
46
+ size, the proportion of outlying units, the underlying correlation structure, and the
47
+ cluster pattern. The simulation study comprises four computational experiments
48
+ conducted on datasets of sizes ranging from 35k to 1M, in which the accuracy and
49
+ efficiency of the new proposals are tested and compared to those of existing clus-
50
+ tering alternatives. In addition, the goodness and computing time of the methods
51
+ under evaluation are tested on real datasets of varying sizes and patterns. MDS is
52
+ used to visualize clustering results.
53
+
54
+ The package is located in Python Package Index (PyPI), the standard repository of packages for the Python programming language: https://pypi.org/project/db_robust_clust/
@@ -0,0 +1,20 @@
1
+ # db-robust-clust
2
+
3
+ In the era of big data, data scientists are trying to solve real-world problems using multivariate
4
+ and heterogeneous datasets, i.e., datasets where for each unit multiple variables of different
5
+ nature are observed. Clustering may be a challenging problem when data are of mixed-type and
6
+ present an underlying correlation structure and outlying units.
7
+
8
+ In the paper ***Grané, A., Scielzo-Ortiz, F.: New distance-based clustering algorithms for large mixed-type data, Submitted to Journal of Classification (2025)***, new efficient robust clustering algorithms able to deal with large mixed-type data are developed and implemented in a **new Python package**, called `db_robust_clust`, hosted in the official PyPI page https://pypi.org/project/db_robust_clust/.
9
+
10
+ Their performance is analyzed in rather complex mixed-type datasets,
11
+ both synthetic and real, where a wide variety of scenarios is considered regarding
12
+ size, the proportion of outlying units, the underlying correlation structure, and the
13
+ cluster pattern. The simulation study comprises four computational experiments
14
+ conducted on datasets of sizes ranging from 35k to 1M, in which the accuracy and
15
+ efficiency of the new proposals are tested and compared to those of existing clus-
16
+ tering alternatives. In addition, the goodness and computing time of the methods
17
+ under evaluation are tested on real datasets of varying sizes and patterns. MDS is
18
+ used to visualize clustering results.
19
+
20
+ The package is located in Python Package Index (PyPI), the standard repository of packages for the Python programming language: https://pypi.org/project/db_robust_clust/
File without changes
@@ -0,0 +1,129 @@
1
+ import numpy as np
2
+ import polars as pl
3
+ #from PyMachineLearning.preprocessing import Encoder, Imputer
4
+ #from sklearn.pipeline import Pipeline
5
+ #from sklearn.compose import ColumnTransformer
6
+
7
+ #####################################################################################################################
8
+
9
+ def outlier_contamination(X, col_name, prop_below=0.05, prop_above=None, sigma=2, random_state=123) :
10
+ """
11
+ Contaminates with outliers a data matrix.
12
+
13
+ Parameters (inputs)
14
+ ----------
15
+ X: a pandas/polars series. It represents a statistical variable.
16
+ col: the name of a column of `X`.
17
+ prop_below: proportion of outliers generated in the below part of `X`. Only used if below = True.
18
+ prop_above: proportion of outliers generated in the above part of `X`. Only used if above = True.
19
+ sigma: parameter that controls the upper bound of the generated above outliers and the lower bound of the lower outliers.
20
+ random_state: controls the random seed of the random elements.
21
+
22
+ Returns (outputs)
23
+ -------
24
+ X_new: the resulting variable after the outlier contamination of `X`.
25
+ outlier_idx_below: the index of the below outliers.
26
+ outlier_idx_above: the index of the above outliers.
27
+ """
28
+
29
+ X_new = X.copy()
30
+ Q25 = X_new[col_name].quantile(0.25)
31
+ Q75 = X_new[col_name].quantile(0.75)
32
+ IQR = Q75 - Q25
33
+ lower_bound = Q25 - 1.5*IQR
34
+ upper_bound = Q75 + 1.5*IQR
35
+ np.random.seed(random_state)
36
+
37
+ if prop_below is not None:
38
+
39
+ n_outliers_below = int(len(X_new)*prop_below)
40
+ outlier_idx_below = np.random.choice(len(X_new), size=n_outliers_below, replace=False)
41
+ outliers_below = np.random.uniform(lower_bound - sigma*np.abs(lower_bound), lower_bound, size=n_outliers_below)
42
+ X_new.loc[outlier_idx_below, col_name] = outliers_below
43
+ return X_new, outlier_idx_below
44
+
45
+
46
+ elif prop_above is not None:
47
+
48
+ n_outliers_above = int(len(X_new)*prop_above)
49
+ outlier_idx_above = np.random.choice(len(X_new), size=n_outliers_above, replace=False)
50
+ outliers_above = np.random.uniform(upper_bound, upper_bound + sigma*np.abs(upper_bound), size=n_outliers_above)
51
+ X_new.loc[outlier_idx_above, col_name] = outliers_above
52
+ return X_new, outlier_idx_above
53
+
54
+ elif prop_below is not None and prop_above is not None:
55
+
56
+ n_outliers_below = int(len(X_new)*prop_below)
57
+ outlier_idx_below = np.random.choice(len(X_new), size=n_outliers_below, replace=False)
58
+ outliers_below = np.random.uniform(lower_bound - sigma*np.abs(lower_bound), lower_bound, size=n_outliers_below)
59
+ X_new.loc[outlier_idx_below, col_name] = outliers_below
60
+
61
+ n_outliers_above = int(len(X_new)*prop_above)
62
+ outlier_idx_above = np.random.choice(len(X_new), size=n_outliers_above, replace=False)
63
+ outliers_above = np.random.uniform(upper_bound, upper_bound + sigma*np.abs(upper_bound), size=n_outliers_above)
64
+ X_new.loc[outlier_idx_above, col_name] = outliers_above
65
+
66
+ return X_new, outlier_idx_below, outlier_idx_above
67
+
68
+ else:
69
+ raise ValueError('prop_below and prop_above cannot be both None.')
70
+
71
+
72
+ #####################################################################################################################
73
+
74
+ '''
75
+ def sort_predictors_for_GGower(df, quant_predictors, cat_predictors):
76
+ """
77
+ Given a data-frame th function return the names of its categorical variables sorted according to (binary, multi-class)
78
+ and the number of quantitative, binary and multi-class variables.
79
+
80
+ Parameters (inputs)
81
+ ----------
82
+ df: a pandas/polars data-frame. It represents a data matrix.
83
+ quant_predictors: a list with the names of the quantitative variables of `df`.
84
+ cat_predictors: a list with the names of the categorical variables of `df`.
85
+
86
+ Returns (outputs)
87
+ -------
88
+ cat_predictors_sorted: a list with the names of the categorical variables of `df` sorted according to (binary, multi-class).
89
+ p1, p2, p3: the number of quantitative, binary and multi-class variables in `df`, respectively.
90
+ """
91
+
92
+ # Defining the transformers pipeline to impute and codify the predictors that need it.
93
+ quant_pipeline = Pipeline([
94
+ ('imputer', Imputer(method='simple_mean'))
95
+ ])
96
+
97
+ cat_pipeline = Pipeline([
98
+ ('encoder', Encoder(method='ordinal')), # encoding the categorical variables is needed by some imputers
99
+ ('imputer', Imputer(method='simple_most_frequent'))
100
+ ])
101
+
102
+ quant_cat_transformer = ColumnTransformer(transformers=[('quant', quant_pipeline, quant_predictors),
103
+ ('cat', cat_pipeline, cat_predictors)])
104
+
105
+ predictors = quant_predictors + cat_predictors
106
+ if isinstance(df, pl.DataFrame):
107
+ X = df[predictors].to_pandas()
108
+ # The Null values of the Polars columns that are define as Object type by Pandas are treated as None and not as NaN (what we would like)
109
+ # The avoid this behavior the next step is necessary
110
+ X = X.fillna(value=np.nan)
111
+ # First we have to impute missing values so that are not detected as another unique value
112
+ X = pl.DataFrame(quant_cat_transformer.fit_transform(X))
113
+ X.columns = quant_predictors + cat_predictors
114
+ # Compute number of unique values for each categorical predictor
115
+ n_unique_val = {}
116
+ for col in cat_predictors:
117
+ n_unique_val[col] = len(X[col].unique())
118
+ # Define the list of binary and multi-class predictors based on the number of unique values.
119
+ binary_predictors = [col for col in n_unique_val.keys() if n_unique_val[col] == 2]
120
+ multiclass_predictors = [col for col in n_unique_val.keys() if n_unique_val[col] >= 3]
121
+ # Reorder the list of categorical predictors in a suitable order for Gower Generalized
122
+ cat_predictors_sorted = binary_predictors + multiclass_predictors
123
+ # Getting the number of quant, binary and multi-class predictors
124
+ p1 = len(quant_predictors)
125
+ p2 = len(binary_predictors)
126
+ p3 = len(multiclass_predictors)
127
+
128
+ return cat_predictors_sorted, p1, p2, p3
129
+ '''
@@ -0,0 +1,44 @@
1
+ import numpy as np
2
+ from itertools import permutations
3
+ from sklearn.metrics import accuracy_score
4
+
5
+ #####################################################################################################################
6
+
7
+ def adjusted_score(y_pred, y_true, metric=accuracy_score):
8
+ """
9
+ Computes the adjusted score (accuracy, balanced accuracy, etc.) as the maximum
10
+ score obtained across all possible permutations of the cluster labels (`y_pred`).
11
+
12
+ Parameters
13
+ ----------
14
+ y_pred : numpy.ndarray
15
+ Predicted cluster labels.
16
+ y_true : numpy.ndarray
17
+ True class labels.
18
+ metric : callable, default=accuracy_score
19
+ Function to compute the metric. Must accept (y_true, y_pred) and return a float.
20
+
21
+ Returns
22
+ -------
23
+ adj_score : float
24
+ The best score obtained across all permutations.
25
+ adj_cluster_labels : numpy.ndarray
26
+ The cluster labels permuted according to the best permutation.
27
+ """
28
+
29
+ permutations_list = list(permutations(np.unique(y_pred)))
30
+ scores, permuted_cluster_labels = [], {}
31
+
32
+ for per in permutations_list:
33
+ permutation_dict = dict(zip(np.unique(y_pred), per))
34
+ permuted_cluster_labels[per] = np.array([permutation_dict[x] for x in y_pred])
35
+ scores.append(metric(y_true=y_true, y_pred=permuted_cluster_labels[per]))
36
+
37
+ scores = np.array(scores)
38
+ best_permutation = permutations_list[np.argmax(scores)]
39
+ adj_cluster_labels = permuted_cluster_labels[best_permutation]
40
+ adj_score = np.max(scores)
41
+
42
+ return adj_score, adj_cluster_labels
43
+
44
+ #####################################################################################################################
@@ -0,0 +1,323 @@
1
+ #####################################################################################################################
2
+ import polars as pl
3
+ import pandas as pd
4
+ import numpy as np
5
+ from sklearn_extra.cluster import KMedoids
6
+ from sklearn.model_selection import KFold
7
+ from PyDistances.mixed import FastGGowerDistMatrix, GGowerDist
8
+ from tqdm import tqdm
9
+
10
+ #####################################################################################################################
11
+
12
+ def concat_X_y(X, y, y_type, p1, p2, p3):
13
+ """
14
+ Concatenating `X`and `y` in a suitable way to be used by the class `FastKmedoidsGG` to be applied in 'supervised' clustering.
15
+
16
+ Parameters (inputs)
17
+ ----------
18
+ X: a numpy array. It represents a predictors matrix.
19
+ y: a numpy array. It represents a response/target variable.
20
+ y_type: the type of response variable. Must be in ['quantitative', 'binary', 'multiclass'].
21
+ p1, p2, p3: number of quantitative, binary and multi-class predictors in `X`.
22
+
23
+ Returns (outputs)
24
+ -------
25
+ X_y: the result of concatening `X` and `y` in the proper way to be used in `FastKmedoidsGG`
26
+ p1, p2, p3: the updated number of quantitative, binary and multi-class predictors in `X_y`.
27
+ y_idx: the column index in which `y` is located in `X_y`.
28
+ """
29
+
30
+ if y_type == 'binary':
31
+ X_y = np.column_stack((X[:,0:p1], y, X[:,(p1+1):]))
32
+ p2 = p2 + 1 # updating p2 since now X contains y and it is binary.
33
+ y_idx = p1
34
+ elif y_type == 'multiclass':
35
+ X_y = np.column_stack((X[:,0:p1], X[:,(p1+1):p2], y, X[:,(p2+1):]))
36
+ p3 = p3 + 1 # updating p3 since now X contains y and it is multiclass.
37
+ y_idx = p2
38
+ elif y_type == 'quantitative':
39
+ X_y = np.column_stack((y, X))
40
+ p1 = p1 + 1 # updating p1 since now X contains y and it is quant.
41
+ y_idx = 0
42
+ else:
43
+ raise ValueError("Invalid `y` type")
44
+
45
+ return X_y, p1, p2, p3, y_idx
46
+
47
+ #####################################################################################################################
48
+
49
+ def get_idx_obs(fold_key, medoid_key, idx_fold, labels_fold):
50
+ # Idx of the observations of fold_key associated to the medoid_key of that fold
51
+ return idx_fold[fold_key][np.where(labels_fold[fold_key] == medoid_key)[0]]
52
+
53
+ #####################################################################################################################
54
+
55
+ class FastKmedoidsGGower :
56
+ """
57
+ Implements the Fast-K-medoids algorithm based on the Generalized Gower distance.
58
+ """
59
+
60
+ def __init__(self, n_clusters, method='pam', init='heuristic', max_iter=100, random_state=123,
61
+ frac_sample_size=0.1, p1=None, p2=None, p3=None, d1='robust_mahalanobis', d2='jaccard', d3='matching',
62
+ robust_method='trimmed', alpha=0.05, epsilon=0.05, n_iters=20, q=1,
63
+ fast_VG=False, VG_sample_size=1000, VG_n_samples=5, y_type=None) :
64
+ """
65
+ Constructor method.
66
+
67
+ Parameters:
68
+ n_clusters: the number of clusters.
69
+ method: the k-medoids clustering method. Must be in ['pam', 'alternate']. PAM is the classic one, more accurate but slower.
70
+ init: the k-medoids initialization method. Must be in ['heuristic', 'random']. Heuristic is the classic one, smarter burt slower.
71
+ max_iter: the maximum number of iterations run by k-medodis.
72
+ frac_sample_size: the sample size in proportional terms.
73
+ p1, p2, p3: number of quantitative, binary and multi-class variables in the considered data matrix, respectively. Must be a non negative integer.
74
+ d1: name of the distance to be computed for quantitative variables. Must be an string in ['euclidean', 'minkowski', 'canberra', 'mahalanobis', 'robust_mahalanobis'].
75
+ d2: name of the distance to be computed for binary variables. Must be an string in ['sokal', 'jaccard'].
76
+ d3: name of the distance to be computed for multi-class variables. Must be an string in ['matching'].
77
+ q: the parameter that defines the Minkowski distance. Must be a positive integer.
78
+ robust_method: the method to be used for computing the robust covariance matrix. Only needed when d1 = 'robust_mahalanobis'.
79
+ alpha : a real number in [0,1] that is used if `method` is 'trimmed' or 'winsorized'. Only needed when d1 = 'robust_mahalanobis'.
80
+ epsilon: parameter used by the Delvin algorithm that is used when computing the robust covariance matrix. Only needed when d1 = 'robust_mahalanobis'.
81
+ n_iters: maximum number of iterations used by the Delvin algorithm. Only needed when d1 = 'robust_mahalanobis'.
82
+ fast_VG: whether the geometric variability estimation will be full (False) or fast (True).
83
+ VG_sample_size: sample size to be used to make the estimation of the geometric variability.
84
+ VG_n_samples: number of samples to be used to make the estimation of the geometric variability.
85
+ random_state: the random seed used for the (random) sample elements.
86
+ y_type: the type of response variable. Must be in ['quantitative', 'binary', 'multiclass'].
87
+ """
88
+ self.n_clusters = n_clusters; self.method = method; self.init = init; self.max_iter = max_iter; self.random_state = random_state
89
+ self.frac_sample_size = frac_sample_size; self.p1 = p1; self.p2 = p2; self.p3 = p3; self.d1 = d1; self.d2 = d2; self.d3 = d3;
90
+ self.robust_method = robust_method; self.alpha = alpha; self.epsilon = epsilon; self.n_iters = n_iters; self.fast_VG = fast_VG;
91
+ self.VG_sample_size = VG_sample_size; self.VG_n_samples = VG_n_samples; self.q = q ; self.y_type = y_type
92
+ self.kmedoids = KMedoids(n_clusters=n_clusters, metric='precomputed', method=method, init=init, max_iter=max_iter, random_state=random_state)
93
+
94
+ def fit(self, X, y=None, weights=None):
95
+ """
96
+ Fit method: fitting the fast k-medoids algorithm to `X` (and `y` if needed).
97
+
98
+ Parameters:
99
+ X: a pandas/polars data-frame or a numpy array. Represents a predictors matrix. Is required.
100
+ y: a pandas/polars series or a numpy array. Represents a response variable. Is not required.
101
+ weights: the sample weights. Only used if provided and d1 = 'robust_mahalanobis'.
102
+ """
103
+ if isinstance(X, (pd.DataFrame, pl.DataFrame)):
104
+ X = X.to_numpy()
105
+ if isinstance(y, (pd.Series, pl.Series)):
106
+ y = y.to_numpy()
107
+
108
+ self.p1_init = self.p1 ; self.p2_init = self.p2 ; self.p3_init = self.p3 # p1, p2 and p3 when X doesn't contain y. These original p's are needed for the predict method, since what is predicted is X without y.
109
+
110
+ if y is not None:
111
+ X, self.p1, self.p2, self.p3, self.y_idx = concat_X_y(X=X, y=y, y_type=self.y_type, p1=self.p1, p2=self.p2, p3=self.p3)
112
+
113
+ fastGG = FastGGowerDistMatrix(frac_sample_size=self.frac_sample_size, random_state=self.random_state, p1=self.p1, p2=self.p2, p3=self.p3,
114
+ d1=self.d1, d2=self.d2, d3=self.d3, robust_method=self.robust_method, alpha=self.alpha, epsilon=self.epsilon,
115
+ n_iters=self.n_iters, fast_VG=self.fast_VG, VG_sample_size=self.VG_sample_size, VG_n_samples=self.VG_n_samples,
116
+ q=self.q, weights=weights)
117
+
118
+ fastGG.compute(X)
119
+
120
+ self.D_GG = fastGG.D_GGower
121
+ self.X_sample = fastGG.X_sample
122
+ self.X_out_sample = fastGG.X_out_sample
123
+ self.sample_index = fastGG.sample_index
124
+ self.out_sample_index = fastGG.out_sample_index
125
+
126
+ self.kmedoids.fit(self.D_GG)
127
+ sample_labels_dict = {idx : self.kmedoids.labels_[i] for i, idx in enumerate(self.sample_index)} # keys: observation indices. values: cluster labels. Contains only the sample observation indices.
128
+ self.sample_labels = np.array(list(sample_labels_dict.values()))
129
+
130
+ self.medoids_ = {}
131
+ medoids_idx = [int(x) for x in self.kmedoids.medoid_indices_]
132
+ for j, idx in enumerate(medoids_idx):
133
+ self.medoids_[j] = self.X_sample[idx,:]
134
+
135
+ sample_weights = weights[self.sample_index] if weights is not None else None
136
+
137
+ self.distGG = GGowerDist(p1=self.p1, p2=self.p2, p3=self.p3, d1=self.d1, d2=self.d2, d3=self.d3, q=self.q,
138
+ robust_method=self.robust_method, alpha=self.alpha, epsilon=self.epsilon,
139
+ n_iters=self.n_iters, VG_sample_size=self.VG_sample_size, VG_n_samples=self.VG_n_samples,
140
+ random_state=self.random_state, weights=sample_weights)
141
+
142
+ if sample_weights is None:
143
+ self.distGG.fit(X)
144
+ else: # if there are weights we cannot use X when it is too large in n (number of rows), since Xw is n x n, therefore it cannot be computed in that case due to computational problems. To avoid this potential problem instead of using X to fit GG_dist we use the very reduce sample X_sample.
145
+ self.distGG.fit(self.X_sample)
146
+ # We could use the VG's computed with GG_matrix in GG_dist, rather than making this second estimation. But the current estimation is very fast (less than 1 second) and is equally accurate. So use one or another lead to the same results.
147
+
148
+ dist_out_sample_medoids = {idx : [] for idx in self.out_sample_index} # keys: out sample idx, values: distance with respect each medoid.
149
+ for i, idx in enumerate(self.out_sample_index) :
150
+ for j in range(0, self.n_clusters) :
151
+ dist_out_sample_medoids[idx].append(self.distGG.compute(xi=self.X_out_sample[i,:], xr=self.medoids_[j]))
152
+
153
+ out_sample_labels_dict = {idx : np.argmin(dist_out_sample_medoids[idx]) for idx in self.out_sample_index} # keys: observation indices. Values: cluster labels. Contains only the out of sample observation indices
154
+ self.out_sample_labels = np.array(list(out_sample_labels_dict.values()))
155
+ sample_labels_dict.update(out_sample_labels_dict) # Now sample_label_dict contains the labels for each observation index, but without order.
156
+ labels_dict = {idx : sample_labels_dict[idx] for idx in range(0,len(X))} # keys: observation indices. Values: cluster labels. Contains all the observation indices
157
+ self.labels_ = np.array(list(labels_dict.values()))
158
+
159
+ self.X = X
160
+ self.y = y
161
+
162
+ def predict(self, X):
163
+ """
164
+ Predict method: predicting clusters for `X` observation by assigning them to their nearest cluster (medoid) according to Generalized Gower distance.
165
+
166
+ Parameters:
167
+ X: a pandas/polars data-frame or a numpy array. Represents a predictors matrix. Is required.
168
+ """
169
+
170
+ if self.y: # remove y from the medoids, since in predict method X doesn't contain y.
171
+ for j in range(self.n_clusters):
172
+ self.medoids_[j] = np.delete(self.medoids_[j], self.y_idx)
173
+
174
+ distGG = GGowerDist(p1=self.p1_init, p2=self.p2_init, p3=self.p3_init, d1=self.d1, d2=self.d2, d3=self.d3, q=self.q,
175
+ robust_method=self.robust_method, alpha=self.alpha, epsilon=self.epsilon, n_iters=self.n_iters,
176
+ VG_sample_size=self.VG_sample_size, VG_n_samples=self.VG_n_samples, random_state=self.random_state)
177
+
178
+ distGG.fit(self.X) # self.X is X used during fit method, not necessarily the X parameter passed to the predict method.
179
+
180
+ predicted_clusters = []
181
+ for i in range(0, len(X)):
182
+ dist_xi_medoids = [distGG.compute(xi=X[i,:], xr=self.medoids_[j]) for j in range(self.n_clusters)]
183
+ predicted_clusters.append(np.argmin(dist_xi_medoids))
184
+
185
+ return predicted_clusters
186
+
187
+ #####################################################################################################################
188
+
189
+ class FoldFastKmedoidsGGower:
190
+ """
191
+ Implements the K-Fold Fast-K-medoids algorithm based on the Generalized Gower distance.
192
+ """
193
+
194
+ def __init__(self, n_clusters, method='pam', init='heuristic', max_iter=100, random_state=123,
195
+ frac_sample_size=0.1, p1=None, p2=None, p3=None, d1='robust_mahalanobis', d2='jaccard', d3='matching',
196
+ robust_method='trimmed', alpha=0.05, epsilon=0.05, n_iters=20, q=1, fast_VG=False,
197
+ VG_sample_size=1000, VG_n_samples=5, n_splits=5, shuffle=True, kfold_random_state=123, y_type=None) :
198
+ """
199
+ Constructor method.
200
+
201
+ Parameters:
202
+ n_clusters: the number of clusters.
203
+ method: the k-medoids clustering method. Must be in ['pam', 'alternate']. PAM is the classic one, more accurate but slower.
204
+ init: the k-medoids initialization method. Must be in ['heuristic', 'random']. Heuristic is the classic one, smarter burt slower.
205
+ max_iter: the maximum number of iterations run by k-medodis.
206
+ frac_sample_size: the sample size in proportional terms.
207
+ p1, p2, p3: number of quantitative, binary and multi-class variables in the considered data matrix, respectively. Must be a non negative integer.
208
+ d1: name of the distance to be computed for quantitative variables. Must be an string in ['euclidean', 'minkowski', 'canberra', 'mahalanobis', 'robust_mahalanobis'].
209
+ d2: name of the distance to be computed for binary variables. Must be an string in ['sokal', 'jaccard'].
210
+ d3: name of the distance to be computed for multi-class variables. Must be an string in ['matching'].
211
+ q: the parameter that defines the Minkowski distance. Must be a positive integer.
212
+ robust_method: the method to be used for computing the robust covariance matrix. Only needed when d1 = 'robust_mahalanobis'.
213
+ alpha : a real number in [0,1] that is used if `method` is 'trimmed' or 'winsorized'. Only needed when d1 = 'robust_mahalanobis'.
214
+ epsilon: parameter used by the Delvin algorithm that is used when computing the robust covariance matrix. Only needed when d1 = 'robust_mahalanobis'.
215
+ n_iters: maximum number of iterations used by the Delvin algorithm. Only needed when d1 = 'robust_mahalanobis'.
216
+ fast_VG: whether the geometric variability estimation will be full (False) or fast (True).
217
+ VG_sample_size: sample size to be used to make the estimation of the geometric variability.
218
+ VG_n_samples: number of samples to be used to make the estimation of the geometric variability.
219
+ random_state: the random seed used for the (random) sample elements.
220
+ y_type: the type of response variable. Must be in ['quantitative', 'binary', 'multiclass'].
221
+ n_splits: number of folds to be used.
222
+ shuffle: whether data is shuffled before applying KFold or not, must be in [True, False].
223
+ kfold_random_state: the random seed for KFold if shuffle = True.
224
+ """
225
+ self.n_clusters = n_clusters; self.method = method; self.init = init; self.max_iter = max_iter; self.random_state = random_state
226
+ self.frac_sample_size = frac_sample_size; self.p1 = p1; self.p2 = p2; self.p3 = p3; self.d1 = d1; self.d2 = d2; self.d3 = d3;
227
+ self.robust_method = robust_method ; self.alpha = alpha; self.epsilon = epsilon; self.n_iters = n_iters; self.fast_VG = fast_VG;
228
+ self.VG_sample_size = VG_sample_size; self.VG_n_samples = VG_n_samples; self.q = q; self.n_splits = n_splits; self.shuffle = shuffle;
229
+ self.kfold_random_state = kfold_random_state; self.y_type = y_type
230
+
231
+ def fit(self, X, y=None, weights=None):
232
+ """
233
+ Fit method: fitting the fast k-medoids algorithm to `X` (and `y` if needed).
234
+
235
+ Parameters:
236
+ X: a pandas/polars data-frame or a numpy array. Represents a predictors matrix. Is required.
237
+ y: a pandas/polars series or a numpy array. Represents a response variable. Is not required.
238
+ weights: the sample weights. Only used if provided and d1 = 'robust_mahalanobis'.
239
+ """
240
+
241
+ if isinstance(X, (pd.DataFrame, pl.DataFrame)):
242
+ X = X.to_numpy()
243
+ if isinstance(y, (pd.Series, pl.Series)):
244
+ y = y.to_numpy()
245
+
246
+ kfold = KFold(n_splits=self.n_splits, shuffle=self.shuffle, random_state=self.kfold_random_state)
247
+
248
+ idx_fold = {}
249
+ for j, (train_index, test_index) in enumerate(kfold.split(X)):
250
+ idx_fold[j] = test_index
251
+
252
+ medoids_fold, labels_fold = {}, {}
253
+ for j in tqdm(range(0, self.n_splits), desc="Clustering Folds"):
254
+
255
+ fold_weights = weights[idx_fold[j]] if weights is not None else None
256
+ y_fold = y[idx_fold[j]] if y is not None else None
257
+
258
+ fast_kmedoids = FastKmedoidsGGower(n_clusters=self.n_clusters, method=self.method, init=self.init, max_iter=self.max_iter,
259
+ random_state=self.random_state, frac_sample_size=self.frac_sample_size,
260
+ p1=self.p1, p2=self.p2, p3=self.p3, d1=self.d1, d2=self.d2, d3=self.d3,
261
+ robust_method=self.robust_method, alpha=self.alpha, epsilon=self.epsilon,
262
+ n_iters=self.n_iters, fast_VG=self.fast_VG, VG_sample_size=self.VG_sample_size,
263
+ VG_n_samples=self.VG_n_samples, y_type=self.y_type)
264
+
265
+ fast_kmedoids.fit(X=X[idx_fold[j],:], y=y_fold, weights=fold_weights)
266
+
267
+ medoids_fold[j] = fast_kmedoids.medoids_
268
+ labels_fold[j] = fast_kmedoids.labels_
269
+
270
+ if y is not None:
271
+ self.y_idx = fast_kmedoids.y_idx
272
+ self.p1_init = fast_kmedoids.p1_init; self.p2_init = fast_kmedoids.p2_init; self.p3_init = fast_kmedoids.p3_init
273
+
274
+ X_medoids = np.row_stack([np.array(list(medoids_fold[fold_key].values())) for fold_key in range(0, self.n_splits)])
275
+
276
+ fast_kmedoids = FastKmedoidsGGower(n_clusters=self.n_clusters, method=self.method, init=self.init, max_iter=self.max_iter,
277
+ random_state=self.random_state, frac_sample_size=0.80, p1=self.p1, p2=self.p2, p3=self.p3,
278
+ d1=self.d1, d2=self.d2, d3=self.d3, robust_method=self.robust_method, alpha=self.alpha,
279
+ epsilon=self.epsilon, n_iters=self.n_iters, fast_VG=self.fast_VG,
280
+ VG_sample_size=self.VG_sample_size, VG_n_samples=self.VG_n_samples)
281
+
282
+ fast_kmedoids.fit(X=X_medoids)
283
+
284
+ fold_medoid_keys = [(fold_key, medoid_key) for fold_key in range(0, self.n_splits) for medoid_key in range(0, self.n_clusters)]
285
+ labels_dict = dict(zip(fold_medoid_keys, fast_kmedoids.labels_))
286
+ labels_dict = {fold_key: {medoid_key: labels_dict[fold_key, medoid_key] for medoid_key in range(0,self.n_clusters)} for fold_key in range(0,self.n_splits)}
287
+
288
+ final_labels = np.repeat(-1, len(X))
289
+ for fold_key in range(0, self.n_splits):
290
+ for medoid_key in range(0, self.n_clusters):
291
+ final_labels[get_idx_obs(fold_key, medoid_key, idx_fold, labels_fold)] = labels_dict[fold_key][medoid_key]
292
+
293
+ self.labels_ = final_labels
294
+ self.medoids_ = fast_kmedoids.medoids_
295
+ self.X = X
296
+ self.y = y
297
+
298
+ def predict(self, X):
299
+ """
300
+ Predict method: predicting clusters for `X` observation by assigning them to their nearest cluster (medoid) according to Generalized Gower distance.
301
+
302
+ Parameters:
303
+ X: a pandas/polars data-frame or a numpy array. Represents a predictors matrix. Is required.
304
+ """
305
+
306
+ if self.y is not None: # remove y from the medoids, since in predict method X doesn't contain y.
307
+ for j in range(self.n_clusters):
308
+ self.medoids_[j] = np.delete(self.medoids_[j], self.y_idx)
309
+
310
+ distGG = GGowerDist(p1=self.p1_init, p2=self.p2_init, p3=self.p3_init, d1=self.d1, d2=self.d2, d3=self.d3, q=self.q,
311
+ robust_method=self.robust_method, alpha=self.alpha, epsilon=self.epsilon, n_iters=self.n_iters,
312
+ VG_sample_size=self.VG_sample_size, VG_n_samples=self.VG_n_samples, random_state=self.random_state)
313
+
314
+ distGG.fit(self.X) # self.X is X used during fit method, not necessarily the X parameter passed to the predict method
315
+
316
+ predicted_clusters = []
317
+ for i in range(0, len(X)):
318
+ dist_xi_medoids = [distGG.compute(xi=X[i,:], xr=self.medoids_[j]) for j in range(self.n_clusters)]
319
+ predicted_clusters.append(np.argmin(dist_xi_medoids))
320
+
321
+ return predicted_clusters
322
+
323
+ #####################################################################################################################
@@ -0,0 +1,306 @@
1
+ #####################################################################################################################
2
+
3
+ import numpy as np
4
+ import polars as pl
5
+ import matplotlib.pyplot as plt
6
+ import seaborn as sns
7
+
8
+ #####################################################################################################################
9
+
10
+ '''
11
+ def clustering_MDS_plot_one_method(X_mds, y_pred, y_true, title='', clustering_method=None, accuracy=None, time=None, figsize=(8, 5), bbox_to_anchor=(1.2, 1),
12
+ title_size=13, title_weight='bold', points_size=40, title_height=0.98, subtitles_size=12, subtitle_weight='bold',
13
+ hspace=0.8, wspace=0.4, save=False, file_name=None, format='jpg', dpi=250, legend_size=9):
14
+ """
15
+ Computes and display the MDS plot for a considered clustering configuration,
16
+ differentiating the cluster labels and the real groups, if they are known.
17
+
18
+ Parameters (inputs)
19
+ ----------
20
+ X_mds: a numpy array with the MDS matrix for the distance matrix used in the considered clustering configuration.
21
+ y_pred: a numpy array with the predictions of the response.
22
+ y_true: a numpy array with the true values of the response.
23
+ title: the title of the plot.
24
+ accuracy: the accuracy of the clustering algorithm, if computed.
25
+ time: the execution time of the clustering algorithm, if computed.
26
+ figsize: the size of the plot.
27
+ bbox_to_anchor: the size of the legend box.
28
+ title_fontsize: the size of the font of the title.
29
+ title_weight: the weight of the title.
30
+ points_size: the size of the points of the plot.
31
+ title_height: the height of the tile of the plot.
32
+
33
+ Returns (outputs)
34
+ -------
35
+ The described plot.
36
+ """
37
+
38
+ X_mds_df = pl.DataFrame(X_mds)
39
+ X_mds_df.columns = ['Z1', 'Z2']
40
+ labels_df = pl.DataFrame(y_pred)
41
+ labels_df.columns = ['cluster_labels']
42
+ MDS_cluster_df = pl.concat((X_mds_df, labels_df), how='horizontal')
43
+
44
+ if y_true is not None:
45
+
46
+ Y_df = pl.DataFrame(y_true)
47
+ Y_df.columns = ['Y']
48
+ MDS_true_df = pl.concat((X_mds_df, Y_df), how='horizontal')
49
+
50
+ fig, axes = plt.subplots(1,2, figsize=figsize)
51
+ axes = axes.flatten()
52
+ sns.scatterplot(x='Z1', y='Z2', hue='Y', data=MDS_true_df, ax=axes[0], s=points_size, palette='bright')
53
+ sns.scatterplot(x='Z1', y='Z2', hue='cluster_labels', data=MDS_cluster_df, ax=axes[1], s=points_size, palette='bright')
54
+ axes[0].set_title('Real groups', fontsize=subtitles_size, weight=subtitle_weight)
55
+ if accuracy != None and time != None:
56
+ axes[1].set_title(f'Predicted groups by\n{clustering_method}\nAcc:{np.round(accuracy,3)}, Time:{np.round(time,1)} secs', fontsize=subtitles_size, weight=subtitle_weight)
57
+ elif accuracy != None:
58
+ axes[1].set_title(f'Predicted groups by\n{clustering_method}\nAcc:{np.round(accuracy,3)}', fontsize=subtitles_size, weight=subtitle_weight)
59
+ elif time != None:
60
+ axes[1].set_title(f'Predicted groups by\n{clustering_method}\nTime:{np.round(time,1)} secs', fontsize=subtitles_size, weight=subtitle_weight)
61
+ else:
62
+ axes[1].set_title(f'Predicted groups', fontsize=subtitles_size, weight=subtitle_weight)
63
+ axes[0].legend(title='Y', bbox_to_anchor=bbox_to_anchor, loc='upper right', fontsize=legend_size, title_fontsize=legend_size)
64
+ axes[1].legend(title='Cluster labels',bbox_to_anchor=bbox_to_anchor, loc='upper right', fontsize=legend_size, title_fontsize=legend_size)
65
+ plt.subplots_adjust(hspace=hspace, wspace=wspace)
66
+ plt.suptitle(title, fontsize=title_size, y=title_height, weight=title_weight, color='black')
67
+
68
+ else:
69
+
70
+ fig, axes = plt.subplots(figsize=figsize)
71
+ ax = sns.scatterplot(x='Z1', y='Z2', hue='cluster_labels', data=MDS_cluster_df, s=points_size, palette='bright')
72
+ ax.set_title(title, fontsize=title_size, y=title_height, weight=title_weight, color='black')
73
+ ax.legend(title='Cluster labels', bbox_to_anchor=bbox_to_anchor, loc='upper right', fontsize=legend_size)
74
+
75
+ if save == True:
76
+ fig.savefig(file_name, format=format, dpi=dpi, bbox_inches="tight", pad_inches=0.2)
77
+ plt.show()
78
+
79
+ '''
80
+
81
+ def clustering_MDS_plot_one_method(X_mds, y_pred, y_true, title='', clustering_method=None, accuracy=None, time=None,
82
+ outliers_boolean=None, figsize=(8, 5), bbox_to_anchor=(1.2, 1),
83
+ title_size=13, title_weight='bold', points_size=40, title_height=0.98,
84
+ subtitles_size=12, subtitle_weight='bold', hspace=0.8, wspace=0.4,
85
+ save=False, file_name=None, format='jpg', dpi=250, legend_size=9):
86
+ """
87
+ Computes and display the MDS plot for a considered clustering configuration,
88
+ differentiating the cluster labels and the real groups, if they are known.
89
+
90
+ Parameters (inputs)
91
+ ----------
92
+ X_mds: a numpy array with the MDS matrix.
93
+ y_pred: predicted cluster labels.
94
+ y_true: true labels (if available).
95
+ outliers_boolean: array-like boolean (0 or 1) indicating outliers (if available).
96
+ ...
97
+
98
+ Returns
99
+ -------
100
+ The described plot.
101
+ """
102
+ X_mds_df = pl.DataFrame(X_mds, schema=["Z1", "Z2"])
103
+ labels_df = pl.DataFrame(y_pred, schema=["cluster_labels"])
104
+
105
+ if outliers_boolean is not None:
106
+ outliers_df = pl.DataFrame(outliers_boolean, schema=["outliers"])
107
+ MDS_cluster_df = pl.concat((X_mds_df, labels_df, outliers_df), how='horizontal')
108
+ else:
109
+ MDS_cluster_df = pl.concat((X_mds_df, labels_df), how='horizontal')
110
+
111
+ if y_true is not None:
112
+ Y_df = pl.DataFrame(y_true, schema=["Y"])
113
+
114
+ if outliers_boolean is not None:
115
+ MDS_true_df = pl.concat((X_mds_df, Y_df, outliers_df), how='horizontal')
116
+ else:
117
+ MDS_true_df = pl.concat((X_mds_df, Y_df), how='horizontal')
118
+
119
+ fig, axes = plt.subplots(1, 2, figsize=figsize)
120
+ axes = axes.flatten()
121
+
122
+ if outliers_boolean is not None:
123
+ sns.scatterplot(x='Z1', y='Z2', hue='Y', style='outliers', data=MDS_true_df, ax=axes[0],
124
+ s=points_size, palette='bright', markers={0: 'o', 1: '^'})
125
+ else:
126
+ sns.scatterplot(x='Z1', y='Z2', hue='Y', data=MDS_true_df, ax=axes[0], s=points_size, palette='bright')
127
+
128
+ if outliers_boolean is not None:
129
+ sns.scatterplot(x='Z1', y='Z2', hue='cluster_labels', style='outliers', data=MDS_cluster_df, ax=axes[1],
130
+ s=points_size, palette='bright', markers={0: 'o', 1: '^'})
131
+ else:
132
+ sns.scatterplot(x='Z1', y='Z2', hue='cluster_labels', data=MDS_cluster_df, ax=axes[1], s=points_size, palette='bright')
133
+
134
+ axes[0].set_title('Real groups', fontsize=subtitles_size, weight=subtitle_weight)
135
+
136
+ if accuracy is not None and time is not None:
137
+ axes[1].set_title(f'Predicted groups by\n{clustering_method}\nAcc:{np.round(accuracy,3)}, Time:{np.round(time,1)} secs',
138
+ fontsize=subtitles_size, weight=subtitle_weight)
139
+ elif accuracy is not None:
140
+ axes[1].set_title(f'Predicted groups by\n{clustering_method}\nAcc:{np.round(accuracy,3)}',
141
+ fontsize=subtitles_size, weight=subtitle_weight)
142
+ elif time is not None:
143
+ axes[1].set_title(f'Predicted groups by\n{clustering_method}\nTime:{np.round(time,1)} secs',
144
+ fontsize=subtitles_size, weight=subtitle_weight)
145
+ else:
146
+ axes[1].set_title('Predicted groups', fontsize=subtitles_size, weight=subtitle_weight)
147
+
148
+ axes[0].legend(title='Y', bbox_to_anchor=bbox_to_anchor, loc='upper right', fontsize=legend_size, title_fontsize=legend_size)
149
+ axes[1].legend(title='Cluster labels', bbox_to_anchor=bbox_to_anchor, loc='upper right', fontsize=legend_size, title_fontsize=legend_size)
150
+
151
+ plt.subplots_adjust(hspace=hspace, wspace=wspace)
152
+ plt.suptitle(title, fontsize=title_size, y=title_height, weight=title_weight, color='black')
153
+
154
+ else:
155
+ fig, ax = plt.subplots(figsize=figsize)
156
+
157
+ if outliers_boolean is not None:
158
+ sns.scatterplot(x='Z1', y='Z2', hue='cluster_labels', style='outliers', data=MDS_cluster_df,
159
+ s=points_size, palette='bright', markers={0: 'o', 1: '^'})
160
+ else:
161
+ sns.scatterplot(x='Z1', y='Z2', hue='cluster_labels', data=MDS_cluster_df, s=points_size, palette='bright')
162
+
163
+ ax.set_title(title, fontsize=title_size, y=title_height, weight=title_weight, color='black')
164
+ ax.legend(title='Cluster labels', bbox_to_anchor=bbox_to_anchor, loc='upper right', fontsize=legend_size)
165
+
166
+ if save:
167
+ fig.savefig(file_name, format=format, dpi=dpi, bbox_inches="tight", pad_inches=0.2)
168
+ plt.show()
169
+
170
+
171
+ #####################################################################################################################
172
+
173
+ def clustering_MDS_plot_multiple_methods(X_mds, y_pred, y_true=None, outliers_boolean=None, title='', accuracy=None, time=None, n_rows=2,
174
+ figsize=(8, 5), bbox_to_anchor=(1.2, 1), title_size=13, title_weight='bold', points_size=40,
175
+ title_height=0.98, subtitles_size=12, subtitle_weight='bold', hspace=0.8, wspace=0.4,
176
+ save=False, file_name=None, format='jpg', dpi=250, legend_size=9, legend_title='', n_cols_legend=2):
177
+ """
178
+ Computes and display the MDS plot for a considered clustering configuration,
179
+ differentiating the cluster labels and the real groups, if they are known.
180
+
181
+ Parameters (inputs)
182
+ ----------
183
+ X_mds: a numpy array with the MDS matrix for the distance matrix used in the considered clustering configuration.
184
+ y_pred: a numpy array with the predictions of the response.
185
+ y_true: a numpy array with the true values of the response.
186
+ title: the title of the plot.
187
+ accuracy: the accuracy of the clustering algorithm, if computed.
188
+ time: the execution time of the clustering algorithm, if computed.
189
+ figsize: the size of the plot.
190
+ bbox_to_anchor: the size of the legend box.
191
+ title_fontsize: the size of the font of the title.
192
+ title_weight: the weight of the title.
193
+ points_size: the size of the points of the plot.
194
+ title_height: the height of the tile of the plot.
195
+
196
+ Returns (outputs)
197
+ -------
198
+ The described plot.
199
+ """
200
+
201
+ MDS_cluster_df = {}
202
+ X_mds_df = pl.DataFrame(X_mds)
203
+ X_mds_df.columns = ['Z1', 'Z2']
204
+
205
+ if outliers_boolean is not None:
206
+ outliers_bool_df = pl.DataFrame(outliers_boolean)
207
+ outliers_bool_df.columns = ['outliers']
208
+
209
+ methods = y_pred.keys()
210
+ for method in methods:
211
+ labels_df = pl.DataFrame(y_pred[method])
212
+ labels_df.columns = ['groups']
213
+ if outliers_boolean is not None:
214
+ MDS_cluster_df[method] = pl.concat((X_mds_df, labels_df, outliers_bool_df), how='horizontal')
215
+ else:
216
+ MDS_cluster_df[method] = pl.concat((X_mds_df, labels_df), how='horizontal')
217
+
218
+
219
+ if y_true is not None:
220
+
221
+ Y_df = pl.DataFrame(y_true)
222
+ Y_df.columns = ['Y']
223
+ if outliers_boolean is not None:
224
+ MDS_true_df = pl.concat((X_mds_df, Y_df, outliers_bool_df), how='horizontal')
225
+ else:
226
+ MDS_true_df = pl.concat((X_mds_df, Y_df), how='horizontal')
227
+
228
+ n_methods = len(methods)
229
+ n_cases = n_methods + 1
230
+ n_cols = int(np.ceil(n_cases / n_rows))
231
+
232
+ fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize)
233
+ axes = axes.flatten()
234
+
235
+ if outliers_boolean is not None:
236
+ sns.scatterplot(x='Z1', y='Z2', hue='Y', style='outliers', data=MDS_true_df, ax=axes[0],
237
+ s=points_size, palette='bright', markers={0: 'o', 1: '^'})
238
+ else:
239
+ sns.scatterplot(x='Z1', y='Z2', hue='Y', data=MDS_true_df, ax=axes[0], s=points_size, palette='bright')
240
+
241
+ axes[0].set_title('Real groups', fontsize=subtitles_size, weight=subtitle_weight)
242
+ #axes[0].legend(title='', bbox_to_anchor=bbox_to_anchor, loc='lower right', fontsize=legend_size, ncol=2)
243
+ axes[0].legend().remove()
244
+
245
+ for i, method in enumerate(methods):
246
+
247
+ if outliers_boolean is not None:
248
+ sns.scatterplot(x='Z1', y='Z2', hue='groups', style='outliers', data=MDS_cluster_df[method], ax=axes[i+1],
249
+ s=points_size, palette='bright', markers={0: 'o', 1: '^'})
250
+ else:
251
+ sns.scatterplot(x='Z1', y='Z2', hue='groups', data=MDS_cluster_df[method], ax=axes[i+1], s=points_size, palette='bright')
252
+
253
+ if accuracy != None and time != None:
254
+ axes[i+1].set_title(f'Predicted groups by\n{method}\n Acc:{np.round(accuracy[method],3)} - Time:{np.round(time[method],1)} secs', fontsize=subtitles_size, weight=subtitle_weight)
255
+ elif accuracy != None:
256
+ axes[i+1].set_title(f'Predicted groups by\n{method}\n Acc:{np.round(accuracy[method],3)}', fontsize=subtitles_size, weight=subtitle_weight)
257
+ elif time != None:
258
+ axes[i+1].set_title(f'Predicted groups by\n{method}\n Time:{np.round(time[method],1)} secs', fontsize=subtitles_size, weight=subtitle_weight)
259
+ else:
260
+ axes[i+1].set_title(f'Predicted groups by\n{method}\n ', fontsize=11)
261
+
262
+ axes[i+1].legend().remove()
263
+
264
+ axes[1].legend(title=legend_title , bbox_to_anchor=bbox_to_anchor,
265
+ loc='lower right', fontsize=legend_size, ncol=n_cols_legend)
266
+
267
+ else:
268
+
269
+ n_methods = len(methods)
270
+ n_cases = n_methods
271
+ n_cols = int(np.ceil(n_cases / n_rows))
272
+
273
+ fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize)
274
+ axes = axes.flatten()
275
+
276
+ for i, method in enumerate(methods):
277
+
278
+ if outliers_boolean is not None:
279
+ sns.scatterplot(x='Z1', y='Z2', hue='groups', style='outliers', data=MDS_cluster_df[method], ax=axes[i],
280
+ s=points_size, palette='bright', markers={0: 'o', 1: '^'})
281
+ else:
282
+ sns.scatterplot(x='Z1', y='Z2', hue='groups', data=MDS_cluster_df[method], ax=axes[i], s=points_size, palette='bright')
283
+
284
+ if accuracy != None and time != None:
285
+ axes[i].set_title(f'Predicted groups by\n{method}\n Acc:{np.round(accuracy[method],3)} - Time:{np.round(time[method],1)} secs', fontsize=subtitles_size, weight=subtitle_weight)
286
+ elif accuracy != None:
287
+ axes[i].set_title(f'Predicted groups by\n{method}\n Acc:{np.round(accuracy[method],3)}', fontsize=subtitles_size, weight=subtitle_weight)
288
+ elif time != None:
289
+ axes[i].set_title(f'Predicted groups by\n{method}\n Time:{np.round(time[method],1)} secs', fontsize=subtitles_size, weight=subtitle_weight)
290
+ else:
291
+ axes[i].set_title(f'Predicted groups by\n{method}\n ', fontsize=11)
292
+
293
+ axes[i].legend().remove()
294
+
295
+ axes[1].legend(title='', bbox_to_anchor=bbox_to_anchor, loc='lower right', fontsize=legend_size)
296
+
297
+
298
+ plt.subplots_adjust(hspace=hspace, wspace=wspace)
299
+ plt.suptitle(title, fontsize=title_size, y=title_height, weight=title_weight, color='black')
300
+ for j in range(n_cases, n_rows * n_cols):
301
+ fig.delaxes(axes[j])
302
+ if save == True:
303
+ fig.savefig(file_name, format=format, dpi=dpi, bbox_inches="tight", pad_inches=0.2)
304
+ plt.show()
305
+
306
+ #####################################################################################################################
@@ -0,0 +1,54 @@
1
+ Metadata-Version: 2.4
2
+ Name: db-robust-clust
3
+ Version: 0.1.3
4
+ Summary: Apply distance based robust clustering for mixed data.
5
+ Home-page: https://github.com/FabioScielzoOrtiz/db_robust_clust
6
+ Author: Fabio Scielzo Ortiz
7
+ Author-email: fabio.scielzoortiz@gmail.com
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Requires-Python: >=3.7
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: polars
15
+ Requires-Dist: numpy<=1.26.4
16
+ Requires-Dist: PyDistances
17
+ Requires-Dist: pandas
18
+ Requires-Dist: scikit-learn-extra
19
+ Requires-Dist: tqdm
20
+ Requires-Dist: setuptools
21
+ Requires-Dist: pyarrow
22
+ Requires-Dist: matplotlib
23
+ Requires-Dist: seaborn
24
+ Dynamic: author
25
+ Dynamic: author-email
26
+ Dynamic: classifier
27
+ Dynamic: description
28
+ Dynamic: description-content-type
29
+ Dynamic: home-page
30
+ Dynamic: license-file
31
+ Dynamic: requires-dist
32
+ Dynamic: requires-python
33
+ Dynamic: summary
34
+
35
+ # db-robust-clust
36
+
37
+ In the era of big data, data scientists are trying to solve real-world problems using multivariate
38
+ and heterogeneous datasets, i.e., datasets where for each unit multiple variables of different
39
+ nature are observed. Clustering may be a challenging problem when data are of mixed-type and
40
+ present an underlying correlation structure and outlying units.
41
+
42
+ In the paper ***Grané, A., Scielzo-Ortiz, F.: New distance-based clustering algorithms for large mixed-type data, Submitted to Journal of Classification (2025)***, new efficient robust clustering algorithms able to deal with large mixed-type data are developed and implemented in a **new Python package**, called `db_robust_clust`, hosted in the official PyPI page https://pypi.org/project/db_robust_clust/.
43
+
44
+ Their performance is analyzed in rather complex mixed-type datasets,
45
+ both synthetic and real, where a wide variety of scenarios is considered regarding
46
+ size, the proportion of outlying units, the underlying correlation structure, and the
47
+ cluster pattern. The simulation study comprises four computational experiments
48
+ conducted on datasets of sizes ranging from 35k to 1M, in which the accuracy and
49
+ efficiency of the new proposals are tested and compared to those of existing clus-
50
+ tering alternatives. In addition, the goodness and computing time of the methods
51
+ under evaluation are tested on real datasets of varying sizes and patterns. MDS is
52
+ used to visualize clustering results.
53
+
54
+ The package is located in Python Package Index (PyPI), the standard repository of packages for the Python programming language: https://pypi.org/project/db_robust_clust/
@@ -0,0 +1,13 @@
1
+ LICENSE
2
+ README.md
3
+ setup.py
4
+ db_robust_clust/__init__.py
5
+ db_robust_clust/data.py
6
+ db_robust_clust/metrics.py
7
+ db_robust_clust/models.py
8
+ db_robust_clust/plots.py
9
+ db_robust_clust.egg-info/PKG-INFO
10
+ db_robust_clust.egg-info/SOURCES.txt
11
+ db_robust_clust.egg-info/dependency_links.txt
12
+ db_robust_clust.egg-info/requires.txt
13
+ db_robust_clust.egg-info/top_level.txt
@@ -0,0 +1,10 @@
1
+ polars
2
+ numpy<=1.26.4
3
+ PyDistances
4
+ pandas
5
+ scikit-learn-extra
6
+ tqdm
7
+ setuptools
8
+ pyarrow
9
+ matplotlib
10
+ seaborn
@@ -0,0 +1 @@
1
+ db_robust_clust
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,23 @@
1
+ from setuptools import setup, find_packages
2
+
3
+ with open("README.md", "r", encoding="utf-8") as fh:
4
+ long_description = fh.read()
5
+
6
+ setup(
7
+ name="db-robust-clust",
8
+ version="0.1.3",
9
+ author="Fabio Scielzo Ortiz",
10
+ author_email="fabio.scielzoortiz@gmail.com",
11
+ description="Apply distance based robust clustering for mixed data.",
12
+ long_description=long_description,
13
+ long_description_content_type="text/markdown",
14
+ url="https://github.com/FabioScielzoOrtiz/db_robust_clust",
15
+ packages=find_packages(),
16
+ classifiers=[
17
+ "Programming Language :: Python :: 3",
18
+ "License :: OSI Approved :: MIT License",
19
+ "Operating System :: OS Independent",
20
+ ],
21
+ install_requires=['polars','numpy<=1.26.4', 'PyDistances', 'pandas', 'scikit-learn-extra', 'tqdm', 'setuptools', 'pyarrow', 'matplotlib', 'seaborn'],
22
+ python_requires=">=3.7"
23
+ )