PyPI - db-robust-clust - Versions diffs - 0.1.3__tar.gz - Mend

db-robust-clust 0.1.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

db_robust_clust-0.1.3/LICENSE +19 -0
db_robust_clust-0.1.3/PKG-INFO +54 -0
db_robust_clust-0.1.3/README.md +20 -0
db_robust_clust-0.1.3/db_robust_clust/__init__.py +0 -0
db_robust_clust-0.1.3/db_robust_clust/data.py +129 -0
db_robust_clust-0.1.3/db_robust_clust/metrics.py +44 -0
db_robust_clust-0.1.3/db_robust_clust/models.py +323 -0
db_robust_clust-0.1.3/db_robust_clust/plots.py +306 -0
db_robust_clust-0.1.3/db_robust_clust.egg-info/PKG-INFO +54 -0
db_robust_clust-0.1.3/db_robust_clust.egg-info/SOURCES.txt +13 -0
db_robust_clust-0.1.3/db_robust_clust.egg-info/dependency_links.txt +1 -0
db_robust_clust-0.1.3/db_robust_clust.egg-info/requires.txt +10 -0
db_robust_clust-0.1.3/db_robust_clust.egg-info/top_level.txt +1 -0
db_robust_clust-0.1.3/setup.cfg +4 -0
db_robust_clust-0.1.3/setup.py +23 -0

db_robust_clust-0.1.3/LICENSE ADDED Viewed

@@ -0,0 +1,19 @@
+Copyright (c) 2018 The Python Packaging Authority
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

db_robust_clust-0.1.3/PKG-INFO ADDED Viewed

@@ -0,0 +1,54 @@
+Metadata-Version: 2.4
+Name: db-robust-clust
+Version: 0.1.3
+Summary: Apply distance based robust clustering for mixed data.
+Home-page: https://github.com/FabioScielzoOrtiz/db_robust_clust
+Author: Fabio Scielzo Ortiz
+Author-email: fabio.scielzoortiz@gmail.com
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.7
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: polars
+Requires-Dist: numpy<=1.26.4
+Requires-Dist: PyDistances
+Requires-Dist: pandas
+Requires-Dist: scikit-learn-extra
+Requires-Dist: tqdm
+Requires-Dist: setuptools
+Requires-Dist: pyarrow
+Requires-Dist: matplotlib
+Requires-Dist: seaborn
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
+# db-robust-clust
+In the era of big data, data scientists are trying to solve real-world problems using multivariate
+and heterogeneous datasets, i.e., datasets where for each unit multiple variables of different
+nature are observed. Clustering may be a challenging problem when data are of mixed-type and
+present an underlying correlation structure and outlying units.
+In the paper ***Grané, A., Scielzo-Ortiz, F.: New distance-based clustering algorithms for large mixed-type data, Submitted to Journal of Classification (2025)***, new efficient robust clustering algorithms able to deal with large mixed-type data are developed and implemented in a **new Python package**, called `db_robust_clust`, hosted in the official PyPI page https://pypi.org/project/db_robust_clust/.
+Their performance is analyzed in rather complex mixed-type datasets,
+both synthetic and real, where a wide variety of scenarios is considered regarding
+size, the proportion of outlying units, the underlying correlation structure, and the
+cluster pattern. The simulation study comprises four computational experiments
+conducted on datasets of sizes ranging from 35k to 1M, in which the accuracy and
+efficiency of the new proposals are tested and compared to those of existing clus-
+tering alternatives. In addition, the goodness and computing time of the methods
+under evaluation are tested on real datasets of varying sizes and patterns. MDS is
+used to visualize clustering results.
+The package is located in Python Package Index (PyPI), the standard repository of packages for the Python programming language: https://pypi.org/project/db_robust_clust/

db_robust_clust-0.1.3/README.md ADDED Viewed

@@ -0,0 +1,20 @@
+# db-robust-clust
+In the era of big data, data scientists are trying to solve real-world problems using multivariate
+and heterogeneous datasets, i.e., datasets where for each unit multiple variables of different
+nature are observed. Clustering may be a challenging problem when data are of mixed-type and
+present an underlying correlation structure and outlying units.
+In the paper ***Grané, A., Scielzo-Ortiz, F.: New distance-based clustering algorithms for large mixed-type data, Submitted to Journal of Classification (2025)***, new efficient robust clustering algorithms able to deal with large mixed-type data are developed and implemented in a **new Python package**, called `db_robust_clust`, hosted in the official PyPI page https://pypi.org/project/db_robust_clust/.
+Their performance is analyzed in rather complex mixed-type datasets,
+both synthetic and real, where a wide variety of scenarios is considered regarding
+size, the proportion of outlying units, the underlying correlation structure, and the
+cluster pattern. The simulation study comprises four computational experiments
+conducted on datasets of sizes ranging from 35k to 1M, in which the accuracy and
+efficiency of the new proposals are tested and compared to those of existing clus-
+tering alternatives. In addition, the goodness and computing time of the methods
+under evaluation are tested on real datasets of varying sizes and patterns. MDS is
+used to visualize clustering results.
+The package is located in Python Package Index (PyPI), the standard repository of packages for the Python programming language: https://pypi.org/project/db_robust_clust/

db_robust_clust-0.1.3/db_robust_clust/__init__.py ADDED Viewed

File without changes

db_robust_clust-0.1.3/db_robust_clust/data.py ADDED Viewed

@@ -0,0 +1,129 @@
+import numpy as np
+import polars as pl
+#from PyMachineLearning.preprocessing import Encoder, Imputer
+#from sklearn.pipeline import Pipeline
+#from sklearn.compose import ColumnTransformer
+#####################################################################################################################
+def outlier_contamination(X, col_name, prop_below=0.05, prop_above=None, sigma=2, random_state=123) :
+    """
+    Contaminates with outliers a data matrix.
+    Parameters (inputs)
+    ----------
+    X: a pandas/polars series. It represents a statistical variable.
+    col: the name of a column of `X`.
+    prop_below: proportion of outliers generated in the below part of `X`. Only used if below = True.
+    prop_above: proportion of outliers generated in the above part of `X`. Only used if above = True.
+    sigma: parameter that controls the upper bound of the generated above outliers and the lower bound of the lower outliers.
+    random_state: controls the random seed of the random elements.
+    Returns (outputs)
+    -------
+    X_new: the resulting variable after the outlier contamination of `X`.
+    outlier_idx_below: the index of the below outliers.
+    outlier_idx_above: the index of the above outliers.
+    """
+    X_new = X.copy()
+    Q25 = X_new[col_name].quantile(0.25)
+    Q75 = X_new[col_name].quantile(0.75)
+    IQR = Q75 - Q25
+    lower_bound = Q25 - 1.5*IQR
+    upper_bound = Q75 + 1.5*IQR
+    np.random.seed(random_state)
+    if prop_below is not None:
+        n_outliers_below = int(len(X_new)*prop_below)
+        outlier_idx_below = np.random.choice(len(X_new), size=n_outliers_below, replace=False)
+        outliers_below = np.random.uniform(lower_bound - sigma*np.abs(lower_bound), lower_bound, size=n_outliers_below)
+        X_new.loc[outlier_idx_below, col_name] = outliers_below
+        return X_new, outlier_idx_below
+    elif prop_above is not None:
+        n_outliers_above = int(len(X_new)*prop_above)
+        outlier_idx_above = np.random.choice(len(X_new), size=n_outliers_above, replace=False)
+        outliers_above = np.random.uniform(upper_bound, upper_bound + sigma*np.abs(upper_bound), size=n_outliers_above)
+        X_new.loc[outlier_idx_above, col_name] = outliers_above
+        return X_new, outlier_idx_above
+    elif prop_below is not None and prop_above is not None:
+        n_outliers_below = int(len(X_new)*prop_below)
+        outlier_idx_below = np.random.choice(len(X_new), size=n_outliers_below, replace=False)
+        outliers_below = np.random.uniform(lower_bound - sigma*np.abs(lower_bound), lower_bound, size=n_outliers_below)
+        X_new.loc[outlier_idx_below, col_name] = outliers_below
+        n_outliers_above = int(len(X_new)*prop_above)
+        outlier_idx_above = np.random.choice(len(X_new), size=n_outliers_above, replace=False)
+        outliers_above = np.random.uniform(upper_bound, upper_bound + sigma*np.abs(upper_bound), size=n_outliers_above)
+        X_new.loc[outlier_idx_above, col_name] = outliers_above
+        return X_new, outlier_idx_below, outlier_idx_above
+    else:
+        raise ValueError('prop_below and prop_above cannot be both None.')
+#####################################################################################################################
+'''
+def sort_predictors_for_GGower(df, quant_predictors, cat_predictors):
+    """
+    Given a data-frame th function return the names of its categorical variables sorted according to (binary, multi-class)
+    and the number of quantitative, binary and multi-class variables.
+    Parameters (inputs)
+    ----------
+    df: a pandas/polars data-frame. It represents a data matrix.
+    quant_predictors: a list with the names of the quantitative variables of `df`.
+    cat_predictors: a list with the names of the categorical variables of `df`.
+    Returns (outputs)
+    -------
+    cat_predictors_sorted: a list with the names of the categorical variables of `df` sorted according to (binary, multi-class).
+    p1, p2, p3: the number of quantitative, binary and multi-class variables in `df`, respectively.
+    """
+    # Defining the transformers pipeline to impute and codify the predictors that need it.
+    quant_pipeline = Pipeline([
+    ('imputer', Imputer(method='simple_mean'))
+    ])
+    cat_pipeline = Pipeline([
+        ('encoder', Encoder(method='ordinal')), # encoding the categorical variables is needed by some imputers
+        ('imputer', Imputer(method='simple_most_frequent'))
+        ])
+    quant_cat_transformer = ColumnTransformer(transformers=[('quant', quant_pipeline, quant_predictors),
+                                                            ('cat', cat_pipeline, cat_predictors)])
+    predictors = quant_predictors + cat_predictors
+    if isinstance(df, pl.DataFrame):
+        X = df[predictors].to_pandas()
+        # The Null values of the Polars columns that are define as Object type by Pandas are treated as None and not as NaN (what we would like)
+        # The avoid this behavior the next step is necessary
+        X = X.fillna(value=np.nan)
+    # First we have to impute missing values so that are not detected as another unique value
+    X = pl.DataFrame(quant_cat_transformer.fit_transform(X))
+    X.columns = quant_predictors + cat_predictors
+    # Compute number of unique values for each categorical predictor
+    n_unique_val = {}
+    for col in cat_predictors:
+        n_unique_val[col] = len(X[col].unique())
+    # Define the list of binary and multi-class predictors based on the number of unique values.
+    binary_predictors = [col for col in n_unique_val.keys() if n_unique_val[col] == 2]
+    multiclass_predictors = [col for col in n_unique_val.keys() if n_unique_val[col] >= 3]
+    # Reorder the list of categorical predictors in a suitable order for Gower Generalized
+    cat_predictors_sorted = binary_predictors + multiclass_predictors
+    # Getting the number of quant, binary and multi-class predictors
+    p1 = len(quant_predictors)
+    p2 = len(binary_predictors)
+    p3 = len(multiclass_predictors)
+    return cat_predictors_sorted, p1, p2, p3
+'''

db_robust_clust-0.1.3/db_robust_clust/metrics.py ADDED Viewed

@@ -0,0 +1,44 @@
+import numpy as np
+from itertools import permutations
+from sklearn.metrics import accuracy_score
+#####################################################################################################################
+def adjusted_score(y_pred, y_true, metric=accuracy_score):
+    """
+    Computes the adjusted score (accuracy, balanced accuracy, etc.) as the maximum
+    score obtained across all possible permutations of the cluster labels (`y_pred`).
+    Parameters
+    ----------
+    y_pred : numpy.ndarray
+        Predicted cluster labels.
+    y_true : numpy.ndarray
+        True class labels.
+    metric : callable, default=accuracy_score
+        Function to compute the metric. Must accept (y_true, y_pred) and return a float.
+    Returns
+    -------
+    adj_score : float
+        The best score obtained across all permutations.
+    adj_cluster_labels : numpy.ndarray
+        The cluster labels permuted according to the best permutation.
+    """
+    permutations_list = list(permutations(np.unique(y_pred)))
+    scores, permuted_cluster_labels = [], {}
+    for per in permutations_list:
+        permutation_dict = dict(zip(np.unique(y_pred), per))
+        permuted_cluster_labels[per] = np.array([permutation_dict[x] for x in y_pred])
+        scores.append(metric(y_true=y_true, y_pred=permuted_cluster_labels[per]))
+    scores = np.array(scores)
+    best_permutation = permutations_list[np.argmax(scores)]
+    adj_cluster_labels = permuted_cluster_labels[best_permutation]
+    adj_score = np.max(scores)
+    return adj_score, adj_cluster_labels
+#####################################################################################################################

db_robust_clust-0.1.3/db_robust_clust/models.py ADDED Viewed

@@ -0,0 +1,323 @@
+#####################################################################################################################
+import polars as pl
+import pandas as pd
+import numpy as np
+from sklearn_extra.cluster import KMedoids
+from sklearn.model_selection import KFold
+from PyDistances.mixed import FastGGowerDistMatrix, GGowerDist
+from tqdm import tqdm
+#####################################################################################################################
+def concat_X_y(X, y, y_type, p1, p2, p3):
+    """
+    Concatenating `X`and `y` in a suitable way to be used by the class `FastKmedoidsGG` to be applied in 'supervised' clustering.
+    Parameters (inputs)
+    ----------
+    X: a numpy array. It represents a predictors matrix.
+    y: a numpy array. It represents a response/target variable.
+    y_type: the type of response variable. Must be in ['quantitative', 'binary', 'multiclass'].
+    p1, p2, p3: number of quantitative, binary and multi-class predictors in `X`.
+    Returns (outputs)
+    -------
+    X_y: the result of concatening `X` and `y` in the proper way to be used in `FastKmedoidsGG`
+    p1, p2, p3: the updated number of quantitative, binary and multi-class predictors in `X_y`.
+    y_idx: the column index in which `y` is located in `X_y`.
+    """
+    if y_type == 'binary':
+        X_y = np.column_stack((X[:,0:p1], y, X[:,(p1+1):]))
+        p2 = p2 + 1 # updating p2 since now X contains y and it is binary.
+        y_idx = p1
+    elif y_type == 'multiclass':
+        X_y = np.column_stack((X[:,0:p1], X[:,(p1+1):p2], y, X[:,(p2+1):]))
+        p3 = p3 + 1 # updating p3 since now X contains y and it is multiclass.
+        y_idx = p2
+    elif y_type == 'quantitative':
+        X_y = np.column_stack((y, X))
+        p1 = p1 + 1 # updating p1 since now X contains y and it is quant.
+        y_idx = 0
+    else:
+        raise ValueError("Invalid `y` type")
+    return X_y, p1, p2, p3, y_idx
+#####################################################################################################################
+def get_idx_obs(fold_key, medoid_key, idx_fold, labels_fold):
+    # Idx of the observations of fold_key associated to the medoid_key of that fold
+    return idx_fold[fold_key][np.where(labels_fold[fold_key] == medoid_key)[0]]
+#####################################################################################################################
+class FastKmedoidsGGower :
+    """
+    Implements the Fast-K-medoids algorithm based on the Generalized Gower distance.
+    """
+    def __init__(self, n_clusters, method='pam', init='heuristic', max_iter=100, random_state=123,
+                 frac_sample_size=0.1, p1=None, p2=None, p3=None, d1='robust_mahalanobis', d2='jaccard', d3='matching',
+                 robust_method='trimmed', alpha=0.05, epsilon=0.05, n_iters=20, q=1,
+                 fast_VG=False, VG_sample_size=1000, VG_n_samples=5, y_type=None) :
+        """
+        Constructor method.
+        Parameters:
+            n_clusters: the number of clusters.
+            method: the k-medoids clustering method. Must be in ['pam', 'alternate']. PAM is the classic one, more accurate but slower.
+            init: the k-medoids initialization method. Must be in ['heuristic', 'random']. Heuristic is the classic one, smarter burt slower.
+            max_iter: the maximum number of iterations run by k-medodis.
+            frac_sample_size: the sample size in proportional terms.
+            p1, p2, p3: number of quantitative, binary and multi-class variables in the considered data matrix, respectively. Must be a non negative integer.
+            d1: name of the distance to be computed for quantitative variables. Must be an string in ['euclidean', 'minkowski', 'canberra', 'mahalanobis', 'robust_mahalanobis'].
+            d2: name of the distance to be computed for binary variables. Must be an string in ['sokal', 'jaccard'].
+            d3: name of the distance to be computed for multi-class variables. Must be an string in ['matching'].
+            q: the parameter that defines the Minkowski distance. Must be a positive integer.
+            robust_method: the method to be used for computing the robust covariance matrix. Only needed when d1 = 'robust_mahalanobis'.
+            alpha : a real number in [0,1] that is used if `method` is 'trimmed' or 'winsorized'. Only needed when d1 = 'robust_mahalanobis'.
+            epsilon: parameter used by the Delvin algorithm that is used when computing the robust covariance matrix. Only needed when d1 = 'robust_mahalanobis'.
+            n_iters: maximum number of iterations used by the Delvin algorithm. Only needed when d1 = 'robust_mahalanobis'.
+            fast_VG: whether the geometric variability estimation will be full (False) or fast (True).
+            VG_sample_size: sample size to be used to make the estimation of the geometric variability.
+            VG_n_samples: number of samples to be used to make the estimation of the geometric variability.
+            random_state: the random seed used for the (random) sample elements.
+            y_type: the type of response variable. Must be in ['quantitative', 'binary', 'multiclass'].
+        """
+        self.n_clusters = n_clusters; self.method = method; self.init = init; self.max_iter = max_iter; self.random_state = random_state
+        self.frac_sample_size = frac_sample_size; self.p1 = p1; self.p2 = p2; self.p3 = p3; self.d1 = d1; self.d2 = d2; self.d3 = d3;
+        self.robust_method = robust_method; self.alpha = alpha; self.epsilon = epsilon; self.n_iters = n_iters; self.fast_VG = fast_VG;
+        self.VG_sample_size = VG_sample_size; self.VG_n_samples = VG_n_samples; self.q = q ; self.y_type = y_type
+        self.kmedoids = KMedoids(n_clusters=n_clusters, metric='precomputed', method=method, init=init, max_iter=max_iter, random_state=random_state)
+    def fit(self, X, y=None, weights=None):
+        """
+        Fit method: fitting the fast k-medoids algorithm to `X` (and `y` if needed).
+        Parameters:
+            X: a pandas/polars data-frame or a numpy array. Represents a predictors matrix. Is required.
+            y: a pandas/polars series or a numpy array. Represents a response variable. Is not required.
+            weights: the sample weights. Only used if provided and d1 = 'robust_mahalanobis'.
+        """
+        if isinstance(X, (pd.DataFrame, pl.DataFrame)):
+            X = X.to_numpy()
+        if isinstance(y, (pd.Series, pl.Series)):
+            y = y.to_numpy()
+        self.p1_init = self.p1 ; self.p2_init = self.p2 ; self.p3_init = self.p3  # p1, p2 and p3 when X doesn't contain y. These original p's are needed for the predict method, since what is predicted is X without y.
+        if y is not None:
+            X, self.p1, self.p2, self.p3, self.y_idx = concat_X_y(X=X, y=y, y_type=self.y_type, p1=self.p1, p2=self.p2, p3=self.p3)
+        fastGG = FastGGowerDistMatrix(frac_sample_size=self.frac_sample_size, random_state=self.random_state, p1=self.p1, p2=self.p2, p3=self.p3,
+                                      d1=self.d1, d2=self.d2, d3=self.d3, robust_method=self.robust_method, alpha=self.alpha, epsilon=self.epsilon,
+                                      n_iters=self.n_iters, fast_VG=self.fast_VG, VG_sample_size=self.VG_sample_size, VG_n_samples=self.VG_n_samples,
+                                      q=self.q, weights=weights)
+        fastGG.compute(X)
+        self.D_GG = fastGG.D_GGower
+        self.X_sample = fastGG.X_sample
+        self.X_out_sample = fastGG.X_out_sample
+        self.sample_index = fastGG.sample_index
+        self.out_sample_index = fastGG.out_sample_index
+        self.kmedoids.fit(self.D_GG)
+        sample_labels_dict = {idx : self.kmedoids.labels_[i] for i, idx in enumerate(self.sample_index)} # keys: observation indices. values: cluster labels. Contains only the sample observation indices.
+        self.sample_labels = np.array(list(sample_labels_dict.values()))
+        self.medoids_ = {}
+        medoids_idx = [int(x) for x in self.kmedoids.medoid_indices_]
+        for j, idx in enumerate(medoids_idx):
+            self.medoids_[j] = self.X_sample[idx,:]
+        sample_weights = weights[self.sample_index] if weights is not None else None
+        self.distGG = GGowerDist(p1=self.p1, p2=self.p2, p3=self.p3, d1=self.d1, d2=self.d2, d3=self.d3, q=self.q,
+                                 robust_method=self.robust_method, alpha=self.alpha,  epsilon=self.epsilon,
+                                 n_iters=self.n_iters, VG_sample_size=self.VG_sample_size, VG_n_samples=self.VG_n_samples,
+                                 random_state=self.random_state, weights=sample_weights)
+        if sample_weights is None:
+            self.distGG.fit(X)
+        else: # if there are weights we cannot use X when it is too large in n (number of rows), since Xw is n x n, therefore it cannot be computed in that case due to computational problems. To avoid this potential problem instead of using X to fit GG_dist we use the very reduce sample X_sample.
+            self.distGG.fit(self.X_sample)
+        # We could use the VG's computed with GG_matrix in GG_dist, rather than making this second estimation. But the current estimation is very fast (less than 1 second) and is equally accurate. So use one or another lead to the same results.
+        dist_out_sample_medoids = {idx : [] for idx in self.out_sample_index} # keys: out sample idx, values: distance with respect each medoid.
+        for i, idx in enumerate(self.out_sample_index) :
+            for j in range(0, self.n_clusters) :
+                dist_out_sample_medoids[idx].append(self.distGG.compute(xi=self.X_out_sample[i,:], xr=self.medoids_[j]))
+        out_sample_labels_dict = {idx : np.argmin(dist_out_sample_medoids[idx]) for idx in self.out_sample_index} # keys: observation indices. Values: cluster labels. Contains only the out of sample observation indices
+        self.out_sample_labels = np.array(list(out_sample_labels_dict.values()))
+        sample_labels_dict.update(out_sample_labels_dict)  # Now sample_label_dict contains the labels for each observation index, but without order.
+        labels_dict = {idx : sample_labels_dict[idx] for idx in range(0,len(X))}  # keys: observation indices. Values: cluster labels. Contains all the observation indices
+        self.labels_ = np.array(list(labels_dict.values()))
+        self.X = X
+        self.y = y
+    def predict(self, X):
+        """
+        Predict method: predicting clusters for `X` observation by assigning them to their nearest cluster (medoid) according to Generalized Gower distance.
+        Parameters:
+            X: a pandas/polars data-frame or a numpy array. Represents a predictors matrix. Is required.
+        """
+        if self.y: # remove y from the medoids, since in predict method X doesn't contain y.
+            for j in range(self.n_clusters):
+                self.medoids_[j] = np.delete(self.medoids_[j], self.y_idx)
+        distGG = GGowerDist(p1=self.p1_init, p2=self.p2_init, p3=self.p3_init, d1=self.d1, d2=self.d2, d3=self.d3, q=self.q,
+                                robust_method=self.robust_method, alpha=self.alpha, epsilon=self.epsilon, n_iters=self.n_iters,
+                                VG_sample_size=self.VG_sample_size, VG_n_samples=self.VG_n_samples, random_state=self.random_state)
+        distGG.fit(self.X) # self.X is X used during fit method, not necessarily the X parameter passed to the predict method.
+        predicted_clusters = []
+        for i in range(0, len(X)):
+                dist_xi_medoids = [distGG.compute(xi=X[i,:], xr=self.medoids_[j]) for j in range(self.n_clusters)]
+                predicted_clusters.append(np.argmin(dist_xi_medoids))
+        return predicted_clusters
+#####################################################################################################################
+class FoldFastKmedoidsGGower:
+    """
+    Implements the K-Fold Fast-K-medoids algorithm based on the Generalized Gower distance.
+    """
+    def __init__(self, n_clusters, method='pam', init='heuristic', max_iter=100, random_state=123,
+                 frac_sample_size=0.1, p1=None, p2=None, p3=None, d1='robust_mahalanobis', d2='jaccard', d3='matching',
+                 robust_method='trimmed', alpha=0.05, epsilon=0.05, n_iters=20, q=1, fast_VG=False,
+                 VG_sample_size=1000, VG_n_samples=5, n_splits=5, shuffle=True, kfold_random_state=123, y_type=None) :
+        """
+        Constructor method.
+        Parameters:
+            n_clusters: the number of clusters.
+            method: the k-medoids clustering method. Must be in ['pam', 'alternate']. PAM is the classic one, more accurate but slower.
+            init: the k-medoids initialization method. Must be in ['heuristic', 'random']. Heuristic is the classic one, smarter burt slower.
+            max_iter: the maximum number of iterations run by k-medodis.
+            frac_sample_size: the sample size in proportional terms.
+            p1, p2, p3: number of quantitative, binary and multi-class variables in the considered data matrix, respectively. Must be a non negative integer.
+            d1: name of the distance to be computed for quantitative variables. Must be an string in ['euclidean', 'minkowski', 'canberra', 'mahalanobis', 'robust_mahalanobis'].
+            d2: name of the distance to be computed for binary variables. Must be an string in ['sokal', 'jaccard'].
+            d3: name of the distance to be computed for multi-class variables. Must be an string in ['matching'].
+            q: the parameter that defines the Minkowski distance. Must be a positive integer.
+            robust_method: the method to be used for computing the robust covariance matrix. Only needed when d1 = 'robust_mahalanobis'.
+            alpha : a real number in [0,1] that is used if `method` is 'trimmed' or 'winsorized'. Only needed when d1 = 'robust_mahalanobis'.
+            epsilon: parameter used by the Delvin algorithm that is used when computing the robust covariance matrix. Only needed when d1 = 'robust_mahalanobis'.
+            n_iters: maximum number of iterations used by the Delvin algorithm. Only needed when d1 = 'robust_mahalanobis'.
+            fast_VG: whether the geometric variability estimation will be full (False) or fast (True).
+            VG_sample_size: sample size to be used to make the estimation of the geometric variability.
+            VG_n_samples: number of samples to be used to make the estimation of the geometric variability.
+            random_state: the random seed used for the (random) sample elements.
+            y_type: the type of response variable. Must be in ['quantitative', 'binary', 'multiclass'].
+            n_splits: number of folds to be used.
+            shuffle: whether data is shuffled before applying KFold or not, must be in [True, False].
+            kfold_random_state: the random seed for KFold if shuffle = True.
+        """
+        self.n_clusters = n_clusters; self.method = method; self.init = init; self.max_iter = max_iter; self.random_state = random_state
+        self.frac_sample_size = frac_sample_size; self.p1 = p1; self.p2 = p2; self.p3 = p3; self.d1 = d1; self.d2 = d2; self.d3 = d3;
+        self.robust_method = robust_method ; self.alpha = alpha; self.epsilon = epsilon; self.n_iters = n_iters; self.fast_VG = fast_VG;
+        self.VG_sample_size = VG_sample_size;  self.VG_n_samples = VG_n_samples; self.q = q; self.n_splits = n_splits; self.shuffle = shuffle;
+        self.kfold_random_state = kfold_random_state; self.y_type = y_type
+    def fit(self, X, y=None, weights=None):
+        """
+        Fit method: fitting the fast k-medoids algorithm to `X` (and `y` if needed).
+        Parameters:
+            X: a pandas/polars data-frame or a numpy array. Represents a predictors matrix. Is required.
+            y: a pandas/polars series or a numpy array. Represents a response variable. Is not required.
+            weights: the sample weights. Only used if provided and d1 = 'robust_mahalanobis'.
+        """
+        if isinstance(X, (pd.DataFrame, pl.DataFrame)):
+            X = X.to_numpy()
+        if isinstance(y, (pd.Series, pl.Series)):
+            y = y.to_numpy()
+        kfold = KFold(n_splits=self.n_splits, shuffle=self.shuffle, random_state=self.kfold_random_state)
+        idx_fold = {}
+        for j, (train_index, test_index) in enumerate(kfold.split(X)):
+            idx_fold[j] = test_index
+        medoids_fold, labels_fold = {}, {}
+        for j in tqdm(range(0, self.n_splits), desc="Clustering Folds"):
+            fold_weights = weights[idx_fold[j]] if weights is not None else None
+            y_fold = y[idx_fold[j]] if y is not None else None
+            fast_kmedoids = FastKmedoidsGGower(n_clusters=self.n_clusters, method=self.method, init=self.init, max_iter=self.max_iter,
+                                               random_state=self.random_state, frac_sample_size=self.frac_sample_size,
+                                               p1=self.p1, p2=self.p2, p3=self.p3, d1=self.d1, d2=self.d2, d3=self.d3,
+                                               robust_method=self.robust_method, alpha=self.alpha, epsilon=self.epsilon,
+                                               n_iters=self.n_iters, fast_VG=self.fast_VG, VG_sample_size=self.VG_sample_size,
+                                               VG_n_samples=self.VG_n_samples, y_type=self.y_type)
+            fast_kmedoids.fit(X=X[idx_fold[j],:], y=y_fold, weights=fold_weights)
+            medoids_fold[j] = fast_kmedoids.medoids_
+            labels_fold[j] = fast_kmedoids.labels_
+        if y is not None:
+            self.y_idx = fast_kmedoids.y_idx
+            self.p1_init = fast_kmedoids.p1_init; self.p2_init = fast_kmedoids.p2_init; self.p3_init = fast_kmedoids.p3_init
+        X_medoids = np.row_stack([np.array(list(medoids_fold[fold_key].values())) for fold_key in range(0, self.n_splits)])
+        fast_kmedoids = FastKmedoidsGGower(n_clusters=self.n_clusters, method=self.method, init=self.init, max_iter=self.max_iter,
+                                           random_state=self.random_state, frac_sample_size=0.80, p1=self.p1, p2=self.p2, p3=self.p3,
+                                           d1=self.d1, d2=self.d2, d3=self.d3, robust_method=self.robust_method, alpha=self.alpha,
+                                           epsilon=self.epsilon, n_iters=self.n_iters, fast_VG=self.fast_VG,
+                                           VG_sample_size=self.VG_sample_size, VG_n_samples=self.VG_n_samples)
+        fast_kmedoids.fit(X=X_medoids)
+        fold_medoid_keys = [(fold_key, medoid_key) for fold_key in range(0, self.n_splits) for medoid_key in range(0, self.n_clusters)]
+        labels_dict = dict(zip(fold_medoid_keys, fast_kmedoids.labels_))
+        labels_dict = {fold_key: {medoid_key: labels_dict[fold_key, medoid_key] for medoid_key in range(0,self.n_clusters)} for fold_key in range(0,self.n_splits)}
+        final_labels = np.repeat(-1, len(X))
+        for fold_key in range(0, self.n_splits):
+            for medoid_key in range(0, self.n_clusters):
+                final_labels[get_idx_obs(fold_key, medoid_key, idx_fold, labels_fold)] = labels_dict[fold_key][medoid_key]
+        self.labels_ = final_labels
+        self.medoids_ = fast_kmedoids.medoids_
+        self.X = X
+        self.y = y
+    def predict(self, X):
+        """
+        Predict method: predicting clusters for `X` observation by assigning them to their nearest cluster (medoid) according to Generalized Gower distance.
+        Parameters:
+            X: a pandas/polars data-frame or a numpy array. Represents a predictors matrix. Is required.
+        """
+        if self.y is not None: # remove y from the medoids, since in predict method X doesn't contain y.
+            for j in range(self.n_clusters):
+                self.medoids_[j] = np.delete(self.medoids_[j], self.y_idx)
+        distGG = GGowerDist(p1=self.p1_init, p2=self.p2_init, p3=self.p3_init, d1=self.d1, d2=self.d2, d3=self.d3, q=self.q,
+                                robust_method=self.robust_method, alpha=self.alpha, epsilon=self.epsilon, n_iters=self.n_iters,
+                                VG_sample_size=self.VG_sample_size, VG_n_samples=self.VG_n_samples, random_state=self.random_state)
+        distGG.fit(self.X) # self.X is X used during fit method, not necessarily the X parameter passed to the predict method
+        predicted_clusters = []
+        for i in range(0, len(X)):
+                dist_xi_medoids = [distGG.compute(xi=X[i,:], xr=self.medoids_[j]) for j in range(self.n_clusters)]
+                predicted_clusters.append(np.argmin(dist_xi_medoids))
+        return predicted_clusters
+#####################################################################################################################

db_robust_clust-0.1.3/db_robust_clust/plots.py ADDED Viewed

@@ -0,0 +1,306 @@
+#####################################################################################################################
+import numpy as np
+import polars as pl
+import matplotlib.pyplot as plt
+import seaborn as sns
+#####################################################################################################################
+'''
+def clustering_MDS_plot_one_method(X_mds, y_pred, y_true, title='', clustering_method=None, accuracy=None, time=None, figsize=(8, 5), bbox_to_anchor=(1.2, 1),
+                                   title_size=13, title_weight='bold', points_size=40, title_height=0.98, subtitles_size=12, subtitle_weight='bold',
+                                   hspace=0.8, wspace=0.4, save=False, file_name=None, format='jpg', dpi=250, legend_size=9):
+    """
+    Computes and display the MDS plot for a considered clustering configuration,
+    differentiating the cluster labels and the real groups, if they are known.
+    Parameters (inputs)
+    ----------
+    X_mds: a numpy array with the MDS matrix for the distance matrix used in the considered clustering configuration.
+    y_pred: a numpy array with the predictions of the response.
+    y_true: a numpy array with the true values of the response.
+    title: the title of the plot.
+    accuracy: the accuracy of the clustering algorithm, if computed.
+    time: the execution time of the clustering algorithm, if computed.
+    figsize: the size of the plot.
+    bbox_to_anchor: the size of the legend box.
+    title_fontsize: the size of the font of the title.
+    title_weight: the weight of the title.
+    points_size: the size of the points of the plot.
+    title_height: the height of the tile of the plot.
+    Returns (outputs)
+    -------
+    The described plot.
+    """
+    X_mds_df = pl.DataFrame(X_mds)
+    X_mds_df.columns = ['Z1', 'Z2']
+    labels_df = pl.DataFrame(y_pred)
+    labels_df.columns = ['cluster_labels']
+    MDS_cluster_df = pl.concat((X_mds_df, labels_df), how='horizontal')
+    if y_true is not None:
+        Y_df = pl.DataFrame(y_true)
+        Y_df.columns = ['Y']
+        MDS_true_df = pl.concat((X_mds_df, Y_df), how='horizontal')
+        fig, axes = plt.subplots(1,2, figsize=figsize)
+        axes = axes.flatten()
+        sns.scatterplot(x='Z1', y='Z2', hue='Y', data=MDS_true_df, ax=axes[0], s=points_size, palette='bright')
+        sns.scatterplot(x='Z1', y='Z2', hue='cluster_labels', data=MDS_cluster_df, ax=axes[1], s=points_size, palette='bright')
+        axes[0].set_title('Real groups', fontsize=subtitles_size, weight=subtitle_weight)
+        if accuracy != None and time != None:
+            axes[1].set_title(f'Predicted groups by\n{clustering_method}\nAcc:{np.round(accuracy,3)}, Time:{np.round(time,1)} secs', fontsize=subtitles_size, weight=subtitle_weight)
+        elif accuracy != None:
+            axes[1].set_title(f'Predicted groups by\n{clustering_method}\nAcc:{np.round(accuracy,3)}', fontsize=subtitles_size, weight=subtitle_weight)
+        elif time != None:
+            axes[1].set_title(f'Predicted groups by\n{clustering_method}\nTime:{np.round(time,1)} secs', fontsize=subtitles_size, weight=subtitle_weight)
+        else:
+            axes[1].set_title(f'Predicted groups',  fontsize=subtitles_size, weight=subtitle_weight)
+        axes[0].legend(title='Y', bbox_to_anchor=bbox_to_anchor, loc='upper right', fontsize=legend_size, title_fontsize=legend_size)
+        axes[1].legend(title='Cluster labels',bbox_to_anchor=bbox_to_anchor, loc='upper right', fontsize=legend_size, title_fontsize=legend_size)
+        plt.subplots_adjust(hspace=hspace, wspace=wspace)
+        plt.suptitle(title, fontsize=title_size, y=title_height, weight=title_weight, color='black')
+    else:
+        fig, axes = plt.subplots(figsize=figsize)
+        ax = sns.scatterplot(x='Z1', y='Z2', hue='cluster_labels', data=MDS_cluster_df, s=points_size, palette='bright')
+        ax.set_title(title, fontsize=title_size, y=title_height, weight=title_weight, color='black')
+        ax.legend(title='Cluster labels', bbox_to_anchor=bbox_to_anchor, loc='upper right', fontsize=legend_size)
+    if save == True:
+        fig.savefig(file_name, format=format, dpi=dpi, bbox_inches="tight", pad_inches=0.2)
+    plt.show()
+'''
+def clustering_MDS_plot_one_method(X_mds, y_pred, y_true, title='', clustering_method=None, accuracy=None, time=None,
+                                   outliers_boolean=None, figsize=(8, 5), bbox_to_anchor=(1.2, 1),
+                                   title_size=13, title_weight='bold', points_size=40, title_height=0.98,
+                                   subtitles_size=12, subtitle_weight='bold', hspace=0.8, wspace=0.4,
+                                   save=False, file_name=None, format='jpg', dpi=250, legend_size=9):
+    """
+    Computes and display the MDS plot for a considered clustering configuration,
+    differentiating the cluster labels and the real groups, if they are known.
+    Parameters (inputs)
+    ----------
+    X_mds: a numpy array with the MDS matrix.
+    y_pred: predicted cluster labels.
+    y_true: true labels (if available).
+    outliers_boolean: array-like boolean (0 or 1) indicating outliers (if available).
+    ...
+    Returns
+    -------
+    The described plot.
+    """
+    X_mds_df = pl.DataFrame(X_mds, schema=["Z1", "Z2"])
+    labels_df = pl.DataFrame(y_pred, schema=["cluster_labels"])
+    if outliers_boolean is not None:
+        outliers_df = pl.DataFrame(outliers_boolean, schema=["outliers"])
+        MDS_cluster_df = pl.concat((X_mds_df, labels_df, outliers_df), how='horizontal')
+    else:
+        MDS_cluster_df = pl.concat((X_mds_df, labels_df), how='horizontal')
+    if y_true is not None:
+        Y_df = pl.DataFrame(y_true, schema=["Y"])
+        if outliers_boolean is not None:
+            MDS_true_df = pl.concat((X_mds_df, Y_df, outliers_df), how='horizontal')
+        else:
+            MDS_true_df = pl.concat((X_mds_df, Y_df), how='horizontal')
+        fig, axes = plt.subplots(1, 2, figsize=figsize)
+        axes = axes.flatten()
+        if outliers_boolean is not None:
+            sns.scatterplot(x='Z1', y='Z2', hue='Y', style='outliers', data=MDS_true_df, ax=axes[0],
+                            s=points_size, palette='bright', markers={0: 'o', 1: '^'})
+        else:
+            sns.scatterplot(x='Z1', y='Z2', hue='Y', data=MDS_true_df, ax=axes[0], s=points_size, palette='bright')
+        if outliers_boolean is not None:
+            sns.scatterplot(x='Z1', y='Z2', hue='cluster_labels', style='outliers', data=MDS_cluster_df, ax=axes[1],
+                            s=points_size, palette='bright', markers={0: 'o', 1: '^'})
+        else:
+            sns.scatterplot(x='Z1', y='Z2', hue='cluster_labels', data=MDS_cluster_df, ax=axes[1], s=points_size, palette='bright')
+        axes[0].set_title('Real groups', fontsize=subtitles_size, weight=subtitle_weight)
+        if accuracy is not None and time is not None:
+            axes[1].set_title(f'Predicted groups by\n{clustering_method}\nAcc:{np.round(accuracy,3)}, Time:{np.round(time,1)} secs',
+                              fontsize=subtitles_size, weight=subtitle_weight)
+        elif accuracy is not None:
+            axes[1].set_title(f'Predicted groups by\n{clustering_method}\nAcc:{np.round(accuracy,3)}',
+                              fontsize=subtitles_size, weight=subtitle_weight)
+        elif time is not None:
+            axes[1].set_title(f'Predicted groups by\n{clustering_method}\nTime:{np.round(time,1)} secs',
+                              fontsize=subtitles_size, weight=subtitle_weight)
+        else:
+            axes[1].set_title('Predicted groups', fontsize=subtitles_size, weight=subtitle_weight)
+        axes[0].legend(title='Y', bbox_to_anchor=bbox_to_anchor, loc='upper right', fontsize=legend_size, title_fontsize=legend_size)
+        axes[1].legend(title='Cluster labels', bbox_to_anchor=bbox_to_anchor, loc='upper right', fontsize=legend_size, title_fontsize=legend_size)
+        plt.subplots_adjust(hspace=hspace, wspace=wspace)
+        plt.suptitle(title, fontsize=title_size, y=title_height, weight=title_weight, color='black')
+    else:
+        fig, ax = plt.subplots(figsize=figsize)
+        if outliers_boolean is not None:
+            sns.scatterplot(x='Z1', y='Z2', hue='cluster_labels', style='outliers', data=MDS_cluster_df,
+                            s=points_size, palette='bright', markers={0: 'o', 1: '^'})
+        else:
+            sns.scatterplot(x='Z1', y='Z2', hue='cluster_labels', data=MDS_cluster_df, s=points_size, palette='bright')
+        ax.set_title(title, fontsize=title_size, y=title_height, weight=title_weight, color='black')
+        ax.legend(title='Cluster labels', bbox_to_anchor=bbox_to_anchor, loc='upper right', fontsize=legend_size)
+    if save:
+        fig.savefig(file_name, format=format, dpi=dpi, bbox_inches="tight", pad_inches=0.2)
+    plt.show()
+#####################################################################################################################
+def clustering_MDS_plot_multiple_methods(X_mds, y_pred, y_true=None, outliers_boolean=None, title='', accuracy=None, time=None, n_rows=2,
+                                         figsize=(8, 5), bbox_to_anchor=(1.2, 1), title_size=13, title_weight='bold', points_size=40,
+                                         title_height=0.98, subtitles_size=12, subtitle_weight='bold', hspace=0.8, wspace=0.4,
+                                         save=False, file_name=None, format='jpg', dpi=250, legend_size=9, legend_title='', n_cols_legend=2):
+    """
+    Computes and display the MDS plot for a considered clustering configuration,
+    differentiating the cluster labels and the real groups, if they are known.
+    Parameters (inputs)
+    ----------
+    X_mds: a numpy array with the MDS matrix for the distance matrix used in the considered clustering configuration.
+    y_pred: a numpy array with the predictions of the response.
+    y_true: a numpy array with the true values of the response.
+    title: the title of the plot.
+    accuracy: the accuracy of the clustering algorithm, if computed.
+    time: the execution time of the clustering algorithm, if computed.
+    figsize: the size of the plot.
+    bbox_to_anchor: the size of the legend box.
+    title_fontsize: the size of the font of the title.
+    title_weight: the weight of the title.
+    points_size: the size of the points of the plot.
+    title_height: the height of the tile of the plot.
+    Returns (outputs)
+    -------
+    The described plot.
+    """
+    MDS_cluster_df = {}
+    X_mds_df = pl.DataFrame(X_mds)
+    X_mds_df.columns = ['Z1', 'Z2']
+    if outliers_boolean is not None:
+        outliers_bool_df = pl.DataFrame(outliers_boolean)
+        outliers_bool_df.columns = ['outliers']
+    methods = y_pred.keys()
+    for method in methods:
+        labels_df = pl.DataFrame(y_pred[method])
+        labels_df.columns = ['groups']
+        if outliers_boolean is not None:
+            MDS_cluster_df[method] = pl.concat((X_mds_df, labels_df, outliers_bool_df), how='horizontal')
+        else:
+            MDS_cluster_df[method] = pl.concat((X_mds_df, labels_df), how='horizontal')
+    if y_true is not None:
+        Y_df = pl.DataFrame(y_true)
+        Y_df.columns = ['Y']
+        if outliers_boolean is not None:
+            MDS_true_df = pl.concat((X_mds_df, Y_df, outliers_bool_df), how='horizontal')
+        else:
+            MDS_true_df = pl.concat((X_mds_df, Y_df), how='horizontal')
+        n_methods = len(methods)
+        n_cases = n_methods + 1
+        n_cols = int(np.ceil(n_cases / n_rows))
+        fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize)
+        axes = axes.flatten()
+        if outliers_boolean is not None:
+            sns.scatterplot(x='Z1', y='Z2', hue='Y', style='outliers', data=MDS_true_df, ax=axes[0],
+                            s=points_size, palette='bright', markers={0: 'o', 1: '^'})
+        else:
+            sns.scatterplot(x='Z1', y='Z2', hue='Y', data=MDS_true_df, ax=axes[0], s=points_size, palette='bright')
+        axes[0].set_title('Real groups', fontsize=subtitles_size, weight=subtitle_weight)
+        #axes[0].legend(title='', bbox_to_anchor=bbox_to_anchor, loc='lower right', fontsize=legend_size, ncol=2)
+        axes[0].legend().remove()
+        for i, method in enumerate(methods):
+            if outliers_boolean is not None:
+                sns.scatterplot(x='Z1', y='Z2', hue='groups', style='outliers', data=MDS_cluster_df[method], ax=axes[i+1],
+                                s=points_size, palette='bright', markers={0: 'o', 1: '^'})
+            else:
+                sns.scatterplot(x='Z1', y='Z2', hue='groups', data=MDS_cluster_df[method], ax=axes[i+1], s=points_size, palette='bright')
+            if accuracy != None and time != None:
+                axes[i+1].set_title(f'Predicted groups by\n{method}\n Acc:{np.round(accuracy[method],3)} - Time:{np.round(time[method],1)} secs', fontsize=subtitles_size, weight=subtitle_weight)
+            elif accuracy != None:
+                axes[i+1].set_title(f'Predicted groups by\n{method}\n Acc:{np.round(accuracy[method],3)}', fontsize=subtitles_size, weight=subtitle_weight)
+            elif time != None:
+                axes[i+1].set_title(f'Predicted groups by\n{method}\n Time:{np.round(time[method],1)} secs', fontsize=subtitles_size, weight=subtitle_weight)
+            else:
+                axes[i+1].set_title(f'Predicted groups by\n{method}\n ', fontsize=11)
+            axes[i+1].legend().remove()
+        axes[1].legend(title=legend_title , bbox_to_anchor=bbox_to_anchor,
+                       loc='lower right', fontsize=legend_size, ncol=n_cols_legend)
+    else:
+        n_methods = len(methods)
+        n_cases = n_methods
+        n_cols = int(np.ceil(n_cases / n_rows))
+        fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize)
+        axes = axes.flatten()
+        for i, method in enumerate(methods):
+            if outliers_boolean is not None:
+                sns.scatterplot(x='Z1', y='Z2', hue='groups', style='outliers', data=MDS_cluster_df[method], ax=axes[i],
+                                s=points_size, palette='bright', markers={0: 'o', 1: '^'})
+            else:
+                sns.scatterplot(x='Z1', y='Z2', hue='groups', data=MDS_cluster_df[method], ax=axes[i], s=points_size, palette='bright')
+            if accuracy != None and time != None:
+                axes[i].set_title(f'Predicted groups by\n{method}\n Acc:{np.round(accuracy[method],3)} - Time:{np.round(time[method],1)} secs', fontsize=subtitles_size, weight=subtitle_weight)
+            elif accuracy != None:
+                axes[i].set_title(f'Predicted groups by\n{method}\n Acc:{np.round(accuracy[method],3)}', fontsize=subtitles_size, weight=subtitle_weight)
+            elif time != None:
+                axes[i].set_title(f'Predicted groups by\n{method}\n Time:{np.round(time[method],1)} secs', fontsize=subtitles_size, weight=subtitle_weight)
+            else:
+                axes[i].set_title(f'Predicted groups by\n{method}\n ', fontsize=11)
+            axes[i].legend().remove()
+        axes[1].legend(title='', bbox_to_anchor=bbox_to_anchor, loc='lower right', fontsize=legend_size)
+    plt.subplots_adjust(hspace=hspace, wspace=wspace)
+    plt.suptitle(title, fontsize=title_size, y=title_height, weight=title_weight, color='black')
+    for j in range(n_cases, n_rows * n_cols):
+        fig.delaxes(axes[j])
+    if save == True:
+        fig.savefig(file_name, format=format, dpi=dpi, bbox_inches="tight", pad_inches=0.2)
+    plt.show()
+#####################################################################################################################

db_robust_clust-0.1.3/db_robust_clust.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,54 @@
+Metadata-Version: 2.4
+Name: db-robust-clust
+Version: 0.1.3
+Summary: Apply distance based robust clustering for mixed data.
+Home-page: https://github.com/FabioScielzoOrtiz/db_robust_clust
+Author: Fabio Scielzo Ortiz
+Author-email: fabio.scielzoortiz@gmail.com
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.7
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: polars
+Requires-Dist: numpy<=1.26.4
+Requires-Dist: PyDistances
+Requires-Dist: pandas
+Requires-Dist: scikit-learn-extra
+Requires-Dist: tqdm
+Requires-Dist: setuptools
+Requires-Dist: pyarrow
+Requires-Dist: matplotlib
+Requires-Dist: seaborn
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
+# db-robust-clust
+In the era of big data, data scientists are trying to solve real-world problems using multivariate
+and heterogeneous datasets, i.e., datasets where for each unit multiple variables of different
+nature are observed. Clustering may be a challenging problem when data are of mixed-type and
+present an underlying correlation structure and outlying units.
+In the paper ***Grané, A., Scielzo-Ortiz, F.: New distance-based clustering algorithms for large mixed-type data, Submitted to Journal of Classification (2025)***, new efficient robust clustering algorithms able to deal with large mixed-type data are developed and implemented in a **new Python package**, called `db_robust_clust`, hosted in the official PyPI page https://pypi.org/project/db_robust_clust/.
+Their performance is analyzed in rather complex mixed-type datasets,
+both synthetic and real, where a wide variety of scenarios is considered regarding
+size, the proportion of outlying units, the underlying correlation structure, and the
+cluster pattern. The simulation study comprises four computational experiments
+conducted on datasets of sizes ranging from 35k to 1M, in which the accuracy and
+efficiency of the new proposals are tested and compared to those of existing clus-
+tering alternatives. In addition, the goodness and computing time of the methods
+under evaluation are tested on real datasets of varying sizes and patterns. MDS is
+used to visualize clustering results.
+The package is located in Python Package Index (PyPI), the standard repository of packages for the Python programming language: https://pypi.org/project/db_robust_clust/

db_robust_clust-0.1.3/db_robust_clust.egg-info/SOURCES.txt ADDED Viewed

@@ -0,0 +1,13 @@
+LICENSE
+README.md
+setup.py
+db_robust_clust/__init__.py
+db_robust_clust/data.py
+db_robust_clust/metrics.py
+db_robust_clust/models.py
+db_robust_clust/plots.py
+db_robust_clust.egg-info/PKG-INFO
+db_robust_clust.egg-info/SOURCES.txt
+db_robust_clust.egg-info/dependency_links.txt
+db_robust_clust.egg-info/requires.txt
+db_robust_clust.egg-info/top_level.txt

db_robust_clust-0.1.3/db_robust_clust.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+

db_robust_clust-0.1.3/db_robust_clust.egg-info/requires.txt ADDED Viewed

@@ -0,0 +1,10 @@
+polars
+numpy<=1.26.4
+PyDistances
+pandas
+scikit-learn-extra
+tqdm
+setuptools
+pyarrow
+matplotlib
+seaborn

db_robust_clust-0.1.3/db_robust_clust.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+ db_robust_clust

db_robust_clust-0.1.3/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

db_robust_clust-0.1.3/setup.py ADDED Viewed

@@ -0,0 +1,23 @@
+from setuptools import setup, find_packages
+with open("README.md", "r", encoding="utf-8") as fh:
+    long_description = fh.read()
+setup(
+    name="db-robust-clust",
+    version="0.1.3",
+    author="Fabio Scielzo Ortiz",
+    author_email="fabio.scielzoortiz@gmail.com",
+    description="Apply distance based robust clustering for mixed data.",
+    long_description=long_description,
+    long_description_content_type="text/markdown",
+    url="https://github.com/FabioScielzoOrtiz/db_robust_clust",
+    packages=find_packages(),
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "License :: OSI Approved :: MIT License",
+        "Operating System :: OS Independent",
+    ],
+    install_requires=['polars','numpy<=1.26.4', 'PyDistances', 'pandas', 'scikit-learn-extra', 'tqdm', 'setuptools', 'pyarrow', 'matplotlib', 'seaborn'],
+    python_requires=">=3.7"
+)