PyPI - bayesian-sparse-gmm - Versions diffs - 0.2.2__tar.gz → 0.3.0__tar.gz - Mend

bayesian-sparse-gmm 0.2.2tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (46) hide show

{bayesian_sparse_gmm-0.2.2 → bayesian_sparse_gmm-0.3.0}/.gitignore RENAMED Viewed

@@ -142,4 +142,5 @@ cython_debug/
 /visualize/
 # /tests/
 /docs/
-/.benchmarks/
+/.benchmarks/
+/benchmarks/visualize/

{bayesian_sparse_gmm-0.2.2 → bayesian_sparse_gmm-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: bayesian-sparse-gmm
-Version: 0.2.2
+Version: 0.3.0
 Summary: Bayesian Sparse Gaussian Mixture Model implementation in Python
 Author-email: Nam Nam <nampvh4436@gmail.com>
 License: MIT
@@ -26,6 +26,11 @@ Description-Content-Type: text/markdown
 # Bayesian Sparse GMM
+![Python](https://img.shields.io/badge/python-3.9%20--%203.12-blue?logo=python&style=flat-square)
+![CUDA](https://img.shields.io/badge/CUDA-Accelerated-76B900?logo=nvidia&style=flat-square)
+![CuPy](https://img.shields.io/badge/CuPy-Supported-7F22FE?style=flat-square)
+![Numba](https://img.shields.io/badge/Numba-Accelerated-FE7A15?style=flat-square)
 Bayesian Sparse Gaussian Mixture Model (GMM) implementation in Python.
 ## Installation
@@ -79,6 +84,13 @@ print(f"Feature inclusion probabilities: {model.feature_probabilities_.round(3)}
 labels = model.predict(X)
 ```
+## Optimization Methods
+The model supports two optimization architectures via the `optimizer` parameter:
+1. **Gibbs Sampling (MCMC)** (`optimizer="default"`): The original, mathematically exact implementation. It explores the full posterior distribution but requires processing the entire dataset per iteration. Best for smaller datasets or when exact posterior distributions are required.
+2. **Stochastic Variational Inference (SVI)** (`optimizer="svi"`): A highly-scalable approach using Coordinate Ascent Variational Inference (CAVI) with Natural Gradients. It processes data in mini-batches, making it orders of magnitude faster and capable of scaling to extremely large datasets ($N \gg 10,000$).
 ## GPU / CUDA Acceleration
 The model supports three backends, selected via the `backend` parameter:
@@ -164,14 +176,29 @@ Understanding the key hyperparameters is crucial for fine-tuning the model's spa
 | Parameter | Type | Default | Description |
 |-----------|------|---------|-------------|
+| `optimizer` | `str` | `"default"` | Choose `"default"` for exact MCMC (Gibbs Sampling) or `"svi"` for scalable Stochastic Variational Inference (mini-batch). |
 | `K_max` | `int` | `15` | The maximum possible number of clusters. The algorithm will automatically find the active number of clusters $K \le K_{max}$. Should be set safely higher than the expected number of true clusters. |
 | `lambda_0` | `float` | `1000.0` | **Spike rate** of the Spike-and-Slab LASSO prior. A larger value aggressively forces non-informative (noise) features closer to zero. Must satisfy `lambda_0 >> lambda_1`. |
 | `lambda_1` | `float` | `0.1` | **Slab rate**. A smaller value allows informative features to deviate freely from zero to capture the cluster structure. |
 | `alpha` | `float` | `1.0` | Dirichlet concentration parameter for the cluster weight prior. Controls the prior belief over the distribution of cluster sizes. |
 | `theta` | `float` | `0.1` | Prior probability of a feature being included in the active set (the slab component). Smaller values induce stronger sparsity (fewer features selected). |
+### MCMC Parameters (`optimizer="default"`)
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
 | `burn_in` | `int` | `500` | Number of initial MCMC iterations discarded to allow the Markov chain to converge to the stationary distribution. |
 | `n_iter` | `int` | `1000` | Total number of MCMC iterations. The number of samples used for posterior inference is `n_iter - burn_in`. |
+### SVI Parameters (`optimizer="svi"`)
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `epochs` | `int` | `100` | Total number of passes over the dataset during variational inference. |
+| `batch_size` | `int` | `256` | Mini-batch size for SVI updates. |
+| `delay_rho` | `float` | `1.0` | Learning rate delay parameter ($\tau_0$) to stabilize early iterations. |
+| `forgetting_rate`| `float`| `0.75` | Forgetting rate ($\kappa \in (0.5, 1.0]$) controlling the learning rate decay $\rho_t = (t + \tau_0)^{-\kappa}$. |
 *Tip: For extremely high-dimensional datasets with heavy noise, tuning `lambda_0` to be larger and `theta` to be smaller will encourage more aggressive feature selection.*
 ## Reference
@@ -187,4 +214,8 @@ Understanding the key hyperparameters is crucial for fine-tuning the model's spa
   pages   = {1--50},
   url     = {http://jmlr.org/papers/v26/23-0142.html}
 }
-```
+```
+## Contributors
+* **Nam Nam** ([@Neeze](https://github.com/Neeze)) - Developer of the SVI (Stochastic Variational Inference) optimizer, GPU/CUDA acceleration, and benchmarking suite.

{bayesian_sparse_gmm-0.2.2 → bayesian_sparse_gmm-0.3.0}/README.md RENAMED Viewed

@@ -1,5 +1,10 @@
 # Bayesian Sparse GMM
+![Python](https://img.shields.io/badge/python-3.9%20--%203.12-blue?logo=python&style=flat-square)
+![CUDA](https://img.shields.io/badge/CUDA-Accelerated-76B900?logo=nvidia&style=flat-square)
+![CuPy](https://img.shields.io/badge/CuPy-Supported-7F22FE?style=flat-square)
+![Numba](https://img.shields.io/badge/Numba-Accelerated-FE7A15?style=flat-square)
 Bayesian Sparse Gaussian Mixture Model (GMM) implementation in Python.
 ## Installation
@@ -53,6 +58,13 @@ print(f"Feature inclusion probabilities: {model.feature_probabilities_.round(3)}
 labels = model.predict(X)
 ```
+## Optimization Methods
+The model supports two optimization architectures via the `optimizer` parameter:
+1. **Gibbs Sampling (MCMC)** (`optimizer="default"`): The original, mathematically exact implementation. It explores the full posterior distribution but requires processing the entire dataset per iteration. Best for smaller datasets or when exact posterior distributions are required.
+2. **Stochastic Variational Inference (SVI)** (`optimizer="svi"`): A highly-scalable approach using Coordinate Ascent Variational Inference (CAVI) with Natural Gradients. It processes data in mini-batches, making it orders of magnitude faster and capable of scaling to extremely large datasets ($N \gg 10,000$).
 ## GPU / CUDA Acceleration
 The model supports three backends, selected via the `backend` parameter:
@@ -138,14 +150,29 @@ Understanding the key hyperparameters is crucial for fine-tuning the model's spa
 | Parameter | Type | Default | Description |
 |-----------|------|---------|-------------|
+| `optimizer` | `str` | `"default"` | Choose `"default"` for exact MCMC (Gibbs Sampling) or `"svi"` for scalable Stochastic Variational Inference (mini-batch). |
 | `K_max` | `int` | `15` | The maximum possible number of clusters. The algorithm will automatically find the active number of clusters $K \le K_{max}$. Should be set safely higher than the expected number of true clusters. |
 | `lambda_0` | `float` | `1000.0` | **Spike rate** of the Spike-and-Slab LASSO prior. A larger value aggressively forces non-informative (noise) features closer to zero. Must satisfy `lambda_0 >> lambda_1`. |
 | `lambda_1` | `float` | `0.1` | **Slab rate**. A smaller value allows informative features to deviate freely from zero to capture the cluster structure. |
 | `alpha` | `float` | `1.0` | Dirichlet concentration parameter for the cluster weight prior. Controls the prior belief over the distribution of cluster sizes. |
 | `theta` | `float` | `0.1` | Prior probability of a feature being included in the active set (the slab component). Smaller values induce stronger sparsity (fewer features selected). |
+### MCMC Parameters (`optimizer="default"`)
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
 | `burn_in` | `int` | `500` | Number of initial MCMC iterations discarded to allow the Markov chain to converge to the stationary distribution. |
 | `n_iter` | `int` | `1000` | Total number of MCMC iterations. The number of samples used for posterior inference is `n_iter - burn_in`. |
+### SVI Parameters (`optimizer="svi"`)
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `epochs` | `int` | `100` | Total number of passes over the dataset during variational inference. |
+| `batch_size` | `int` | `256` | Mini-batch size for SVI updates. |
+| `delay_rho` | `float` | `1.0` | Learning rate delay parameter ($\tau_0$) to stabilize early iterations. |
+| `forgetting_rate`| `float`| `0.75` | Forgetting rate ($\kappa \in (0.5, 1.0]$) controlling the learning rate decay $\rho_t = (t + \tau_0)^{-\kappa}$. |
 *Tip: For extremely high-dimensional datasets with heavy noise, tuning `lambda_0` to be larger and `theta` to be smaller will encourage more aggressive feature selection.*
 ## Reference
@@ -161,4 +188,8 @@ Understanding the key hyperparameters is crucial for fine-tuning the model's spa
   pages   = {1--50},
   url     = {http://jmlr.org/papers/v26/23-0142.html}
 }
-```
+```
+## Contributors
+* **Nam Nam** ([@Neeze](https://github.com/Neeze)) - Developer of the SVI (Stochastic Variational Inference) optimizer, GPU/CUDA acceleration, and benchmarking suite.

bayesian_sparse_gmm-0.3.0/benchmarks/benchmark.py ADDED Viewed

@@ -0,0 +1,78 @@
+import time
+import numpy as np
+from sklearn.preprocessing import StandardScaler
+from bayesian_sparse_gmm.model import BayesianSparseGMM
+def run_benchmark(n=5000, p=50, K_true=5, signal_features=10, backend="cuda"):
+    print(f"\n=======================================================")
+    print(f"BENCHMARK: SVI vs MCMC (n={n}, p={p}, K={K_true})")
+    print(f"Backend: {backend.upper()}")
+    print(f"=======================================================\n")
+    # Generate Synthetic Sparse Data
+    rng = np.random.default_rng(42)
+    means = np.zeros((K_true, p))
+    means[:, :signal_features] = rng.normal(0, 3.0, size=(K_true, signal_features))
+    n_per_k = n // K_true
+    X_parts = [rng.normal(means[k], 1.0, size=(n_per_k, p)) for k in range(K_true)]
+    X_raw = np.vstack(X_parts)
+    y = np.repeat(np.arange(K_true), n_per_k)
+    # Shuffle
+    shuf = rng.permutation(len(y))
+    X_raw, y = X_raw[shuf], y[shuf]
+    # Standardize
+    X = StandardScaler().fit_transform(X_raw)
+    # --- MCMC (Default) ---
+    print("Running MCMC (Gibbs Sampling) ...")
+    t0 = time.time()
+    gmm_mcmc = BayesianSparseGMM(
+        K_max=10,
+        optimizer="default",
+        n_iter=100,      # Small number for benchmarking
+        burn_in=20,
+        backend=backend,
+        random_state=42,
+        use_identity_covariance=True
+    )
+    gmm_mcmc.fit(X)
+    t_mcmc = time.time() - t0
+    # --- SVI ---
+    print("Running SVI (Natural Gradients) ...")
+    t0 = time.time()
+    gmm_svi = BayesianSparseGMM(
+        K_max=10,
+        optimizer="svi",
+        epochs=10,       # 10 passes over the dataset
+        batch_size=256,
+        backend=backend,
+        random_state=42,
+        use_identity_covariance=True
+    )
+    gmm_svi.fit(X)
+    t_svi = time.time() - t0
+    print("\n--- RESULTS ---")
+    print(f"MCMC Time: {t_mcmc:.2f} seconds")
+    print(f"SVI Time:  {t_svi:.2f} seconds")
+    print(f"Speedup:   {t_mcmc/t_svi:.2f}x")
+    print(f"\nMCMC Found Clusters: {gmm_mcmc.K_hat_}")
+    print(f"SVI Found Clusters:  {gmm_svi.K_hat_}")
+    n_sel_mcmc = len(gmm_mcmc.selected_features_)
+    n_sel_svi = len(gmm_svi.selected_features_)
+    print(f"MCMC Selected Features: {n_sel_mcmc}/{p}")
+    print(f"SVI Selected Features:  {n_sel_svi}/{p}")
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--backend", default="cuda", type=str)
+    args = parser.parse_args()
+    run_benchmark(n=10000, p=100, K_true=10, signal_features=15, backend=args.backend)

bayesian_sparse_gmm-0.3.0/benchmarks/benchmark_stress_test.py ADDED Viewed

@@ -0,0 +1,182 @@
+import time
+import os
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.cluster import DBSCAN
+from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score, v_measure_score
+from sklearn.datasets import make_moons
+from sklearn.preprocessing import StandardScaler
+from bayesian_sparse_gmm.model import BayesianSparseGMM
+def generate_complex_arcs(n_samples=10000, n_features=1024, noise_level=0.15):
+    """
+    Generate a complex dataset with arc-shaped clusters mixed together, embedded in high dimensions.
+    """
+    rng = np.random.default_rng(42)
+    # Create 4 arcs (moons) that intersect
+    n_per_pair = n_samples // 2
+    # Pair 1: Standard moons
+    X1, y1 = make_moons(n_samples=n_per_pair, noise=noise_level, random_state=42)
+    # Pair 2: Rotated and shifted moons
+    X2, y2 = make_moons(n_samples=n_per_pair, noise=noise_level, random_state=43)
+    y2 += 2 # Cluster labels 2 and 3
+    # Rotate by 60 degrees
+    theta = np.radians(60)
+    c, s = np.cos(theta), np.sin(theta)
+    R = np.array([[c, -s], [s, c]])
+    X2 = X2 @ R
+    # Shift to overlap with Pair 1
+    X2[:, 0] += 0.3
+    X2[:, 1] += 0.3
+    X_2d = np.vstack((X1, X2))
+    y = np.concatenate((y1, y2))
+    # Embed in high-dimensional space
+    X = np.zeros((n_samples, n_features))
+    X[:, 0] = X_2d[:, 0]
+    X[:, 1] = X_2d[:, 1]
+    # Add a few derived non-linear signal features
+    X[:, 2] = X_2d[:, 0] * X_2d[:, 1]
+    X[:, 3] = X_2d[:, 0] ** 2
+    X[:, 4] = X_2d[:, 1] ** 2
+    # Add complex pure noise features
+    noise_features = rng.normal(0, 1.5, size=(n_samples, n_features - 5))
+    X[:, 5:] = noise_features
+    # Randomly permute the features so the signal is not just in the first 5 dims
+    feature_shuf = rng.permutation(n_features)
+    X = X[:, feature_shuf]
+    # Shuffle samples
+    sample_shuf = rng.permutation(n_samples)
+    X, y, X_2d = X[sample_shuf], y[sample_shuf], X_2d[sample_shuf]
+    # Scale
+    X = StandardScaler().fit_transform(X)
+    return X, y, X_2d, feature_shuf
+def run_stress_test_benchmark(backend="cuda"):
+    n_samples = 10000
+    n_features = 1024
+    print(f"\n=======================================================")
+    print(f"STRESS TEST BENCHMARK: ARC-SHAPED CLUSTERS")
+    print(f"n={n_samples}, p={n_features}, backend={backend.upper()}")
+    print(f"=======================================================\n")
+    print("Generating complex dataset...")
+    X, y, X_2d, feature_shuf = generate_complex_arcs(n_samples, n_features)
+    # Find original signal feature indices
+    signal_indices = [np.where(feature_shuf == i)[0][0] for i in range(5)]
+    print(f"Signal feature indices: {signal_indices}")
+    # --- DBSCAN ---
+    # In 1024 dimensions, distance values are large.
+    # For N(0,1) variables, expected Euclidean distance is roughly sqrt(2*1024) ~ 45
+    # We set a somewhat large eps and min_samples to see how it handles it.
+    print("\nRunning DBSCAN ...")
+    t0 = time.time()
+    dbscan = DBSCAN(eps=40.0, min_samples=10)
+    dbscan.fit(X)
+    t_dbscan = time.time() - t0
+    # --- BSGMM (SVI) ---
+    print("\nRunning Bayesian Sparse GMM (SVI) ...")
+    t0 = time.time()
+    gmm = BayesianSparseGMM(
+        K_max=10,
+        optimizer="svi",
+        epochs=100,
+        batch_size=512,
+        lambda_0=1000.0,
+        lambda_1=0.1,
+        theta=0.5,
+        backend=backend,
+        random_state=42,
+        use_identity_covariance=True,
+        verbose=1
+    )
+    gmm.fit(X)
+    t_bsgmm = time.time() - t0
+    # --- Results ---
+    ari_db = adjusted_rand_score(y, dbscan.labels_)
+    ami_db = adjusted_mutual_info_score(y, dbscan.labels_)
+    v_db = v_measure_score(y, dbscan.labels_)
+    ari_gmm = adjusted_rand_score(y, gmm.labels_)
+    ami_gmm = adjusted_mutual_info_score(y, gmm.labels_)
+    v_gmm = v_measure_score(y, gmm.labels_)
+    print("\n--- RESULTS ---")
+    print(f"DBSCAN Time: {t_dbscan:.2f} seconds")
+    print(f"BSGMM Time:  {t_bsgmm:.2f} seconds")
+    print(f"Speedup:     {t_dbscan/t_bsgmm:.2f}x (Note: DBSCAN doesn't scale well with dimensions)")
+    print(f"\nDBSCAN Clusters found: {len(np.unique(dbscan.labels_))} (Noise points: {np.sum(dbscan.labels_ == -1)})")
+    print(f"BSGMM Clusters found:  {gmm.K_hat_}")
+    print(f"\nDBSCAN - ARI: {ari_db:.4f} | AMI: {ami_db:.4f} | V: {v_db:.4f}")
+    print(f"BSGMM  - ARI: {ari_gmm:.4f} | AMI: {ami_gmm:.4f} | V: {v_gmm:.4f}")
+    n_sel = len(gmm.selected_features_)
+    print(f"\nBSGMM Selected Features: {n_sel}/{n_features}")
+    # Check if BSGMM found the true signal features
+    true_sig = set(signal_indices)
+    sel = set(gmm.selected_features_)
+    found = true_sig.intersection(sel)
+    print(f"Signal features found: {found} (out of {true_sig})")
+    # --- Visualization ---
+    os.makedirs("./visualize", exist_ok=True)
+    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
+    fig.suptitle(f"Stress Test: Mixed Arcs (n={n_samples}, p={n_features})", fontsize=14, fontweight='bold')
+    # Ground Truth
+    pal = plt.cm.tab10(np.linspace(0, 0.9, 10))
+    for k in np.unique(y):
+        m = y == k
+        axes[0].scatter(X_2d[m, 0], X_2d[m, 1], c=[pal[k % 10]], s=5, alpha=0.5)
+    axes[0].set_title("Ground Truth (2D Projection)")
+    # DBSCAN
+    unique_db = np.unique(dbscan.labels_)
+    pal_db = plt.cm.tab20(np.linspace(0, 1, max(len(unique_db), 2)))
+    for idx, k in enumerate(unique_db):
+        m = dbscan.labels_ == k
+        color = 'k' if k == -1 else pal_db[idx % 20]
+        axes[1].scatter(X_2d[m, 0], X_2d[m, 1], c=[color], s=5, alpha=0.5)
+    axes[1].set_title(f"DBSCAN (ARI={ari_db:.3f})")
+    # BSGMM
+    unique_gmm = np.unique(gmm.labels_)
+    pal_gmm = plt.cm.tab20(np.linspace(0, 1, max(len(unique_gmm), 2)))
+    for idx, k in enumerate(unique_gmm):
+        m = gmm.labels_ == k
+        axes[2].scatter(X_2d[m, 0], X_2d[m, 1], c=[pal_gmm[idx % 20]], s=5, alpha=0.5)
+    axes[2].set_title(f"BSGMM SVI (ARI={ari_gmm:.3f}, {n_sel} feats)")
+    plt.tight_layout()
+    plt.savefig("./visualize/stress_test_arcs.png", dpi=150, bbox_inches='tight')
+    plt.close()
+    print("Saved './visualize/stress_test_arcs.png'")
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--backend", default="cuda", type=str)
+    args = parser.parse_args()
+    run_stress_test_benchmark(backend=args.backend)

bayesian-sparse-gmm 0.2.2__tar.gz → 0.3.0__tar.gz

bayesian-sparse-gmm 0.2.2tar.gz → 0.3.0tar.gz