PyPI - mcDETECT - Versions diffs - 2.0.15__tar.gz - Mend

mcDETECT 2.0.15__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

mcdetect-2.0.15/LICENSE +21 -0
mcdetect-2.0.15/PKG-INFO +40 -0
mcdetect-2.0.15/README.md +7 -0
mcdetect-2.0.15/mcDETECT/__init__.py +6 -0
mcdetect-2.0.15/mcDETECT/model.py +644 -0
mcdetect-2.0.15/mcDETECT/utils.py +145 -0
mcdetect-2.0.15/mcDETECT.egg-info/PKG-INFO +40 -0
mcdetect-2.0.15/mcDETECT.egg-info/SOURCES.txt +11 -0
mcdetect-2.0.15/mcDETECT.egg-info/dependency_links.txt +1 -0
mcdetect-2.0.15/mcDETECT.egg-info/requires.txt +9 -0
mcdetect-2.0.15/mcDETECT.egg-info/top_level.txt +1 -0
mcdetect-2.0.15/setup.cfg +4 -0
mcdetect-2.0.15/setup.py +20 -0

mcdetect-2.0.15/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 Chenyang Yuan
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

mcdetect-2.0.15/PKG-INFO ADDED Viewed

@@ -0,0 +1,40 @@
+Metadata-Version: 2.4
+Name: mcDETECT
+Version: 2.0.15
+Summary: Uncovering the dark transcriptome in polarized neuronal compartments with mcDETECT
+Home-page: https://github.com/chen-yang-yuan/mcDETECT
+Author: Chenyang Yuan
+Author-email: chenyang.yuan@emory.edu
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.6
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: anndata
+Requires-Dist: miniball
+Requires-Dist: numpy
+Requires-Dist: pandas
+Requires-Dist: rtree
+Requires-Dist: scanpy
+Requires-Dist: scikit-learn
+Requires-Dist: scipy
+Requires-Dist: shapely
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
+# mcDETECT
+## Uncovering the dark transcriptome in polarized neuronal compartments with mcDETECT
+#### Chenyang Yuan, Krupa Patel, Hongshun Shi, Hsiao-Lin V. Wang, Feng Wang, Ronghua Li, Yangping Li, Victor G. Corces, Hailing Shi, Sulagna Das, Jindan Yu, Peng Jin, Bing Yao* and Jian Hu*
+mcDETECT is a computational framework designed to study the dark transcriptome related to polarized compartments in brain using *in situ* spatial transcriptomics (iST) data. It begins by examining the subcellular distribution of mRNAs in an iST sample. Each mRNA molecule is treated as a distinct point with its own 3D spatial coordinates considering the thickness of the sample. Unlike many cell-type marker genes, which are typically found within the nucleus or soma, compartmentalized mRNAs often form small aggregates outside the soma. mcDETECT uses a density-based clustering approach to identify these extrasomatic aggregates. This involves calculating the Euclidean distance between mRNA points and defining the neighborhood of each point within a specified search radius. Points are then categorized as core points, border points, or noise points based on their reachability from neighboring points. mcDETECT recognizes each connected bundle of core and border points as a mRNA aggregate. To minimize false positives, it excludes aggregates that substantially overlap with somata, which are estimated by dilating the nuclear masks derived from DAPI staining. mcDETECT then repeats this process for multiple granule markers, merging aggregates from different markers that exhibit high spatial overlap. After aggregating across all markers, an additional filtering step removes aggregates containing mRNAs from negative control genes, which are known to be enriched exclusively in nuclei and somata. The remaining aggregates are considered individual RNA granules. mcDETECT then computes the minimum enclosing sphere for each aggregate to connect neighboring mRNA molecules from all measured genes and summarizes their counts, thereby defining the spatial transcriptome profile of individual RNA granules.

mcdetect-2.0.15/README.md ADDED Viewed

@@ -0,0 +1,7 @@
+# mcDETECT
+## Uncovering the dark transcriptome in polarized neuronal compartments with mcDETECT
+#### Chenyang Yuan, Krupa Patel, Hongshun Shi, Hsiao-Lin V. Wang, Feng Wang, Ronghua Li, Yangping Li, Victor G. Corces, Hailing Shi, Sulagna Das, Jindan Yu, Peng Jin, Bing Yao* and Jian Hu*
+mcDETECT is a computational framework designed to study the dark transcriptome related to polarized compartments in brain using *in situ* spatial transcriptomics (iST) data. It begins by examining the subcellular distribution of mRNAs in an iST sample. Each mRNA molecule is treated as a distinct point with its own 3D spatial coordinates considering the thickness of the sample. Unlike many cell-type marker genes, which are typically found within the nucleus or soma, compartmentalized mRNAs often form small aggregates outside the soma. mcDETECT uses a density-based clustering approach to identify these extrasomatic aggregates. This involves calculating the Euclidean distance between mRNA points and defining the neighborhood of each point within a specified search radius. Points are then categorized as core points, border points, or noise points based on their reachability from neighboring points. mcDETECT recognizes each connected bundle of core and border points as a mRNA aggregate. To minimize false positives, it excludes aggregates that substantially overlap with somata, which are estimated by dilating the nuclear masks derived from DAPI staining. mcDETECT then repeats this process for multiple granule markers, merging aggregates from different markers that exhibit high spatial overlap. After aggregating across all markers, an additional filtering step removes aggregates containing mRNAs from negative control genes, which are known to be enriched exclusively in nuclei and somata. The remaining aggregates are considered individual RNA granules. mcDETECT then computes the minimum enclosing sphere for each aggregate to connect neighboring mRNA molecules from all measured genes and summarizes their counts, thereby defining the spatial transcriptome profile of individual RNA granules.

mcdetect-2.0.15/mcDETECT/__init__.py ADDED Viewed

@@ -0,0 +1,6 @@
+__version__ = "2.0.15"
+from . import model
+from . import utils
+__all__ = ["model", "utils"]

mcdetect-2.0.15/mcDETECT/model.py ADDED Viewed

@@ -0,0 +1,644 @@
+import anndata
+import math
+import miniball
+import numpy as np
+import pandas as pd
+import scanpy as sc
+from collections import Counter
+from rtree import index
+from scipy.sparse import csr_matrix
+from scipy.spatial import cKDTree
+from scipy.stats import poisson
+from shapely.geometry import Point
+from sklearn.cluster import DBSCAN
+from sklearn.preprocessing import OneHotEncoder
+from .utils import *
+class mcDETECT:
+    def __init__(self, type, transcripts, gnl_genes, nc_genes = None, eps = 1.5, minspl = None, grid_len = 1.0, cutoff_prob = 0.95, alpha = 5.0, low_bound = 3,
+                 size_thr = 4.0, in_soma_thr = (0.5, 0.5), l = 1.0, rho = 0.2, s = 1.0, nc_top = 20, nc_thr = 0.1):
+        self.type = type                        # string, iST platform, now support MERSCOPE, Xenium, and CosMx
+        self.transcripts = transcripts          # dataframe, transcripts file
+        self.gnl_genes = gnl_genes              # list, string, all granule markers
+        self.nc_genes = nc_genes                # list, string, all negative controls
+        self.eps = eps                          # numeric, searching radius epsilon
+        self.minspl = minspl                    # integer, manually select min_samples, i.e., no automatic parameter selection
+        self.grid_len = grid_len                # numeric, length of grids for computing the tissue area
+        self.cutoff_prob = cutoff_prob          # numeric, cutoff probability in parameter selection for min_samples
+        self.alpha = alpha                      # numeric, scaling factor in parameter selection for min_samples
+        self.low_bound = low_bound              # integer, lower bound in parameter selection for min_samples
+        self.size_thr = size_thr                # numeric, threshold for maximum radius of an aggregation
+        self.in_soma_thr = in_soma_thr          # 2-d tuple, threshold for low- and high-in-soma ratio
+        self.l = l                              # numeric, scaling factor for seaching overlapped spheres
+        self.rho = rho                          # numeric, threshold for determining overlaps
+        self.s = s                              # numeric, scaling factor for merging overlapped spheres
+        self.nc_top = nc_top                    # integer, number of negative controls retained for filtering
+        self.nc_thr = nc_thr                    # numeric, threshold for negative control filtering
+    # [INNER] construct grids, input for tissue_area()
+    def construct_grid(self, grid_len = None):
+        if grid_len is None:
+            grid_len = self.grid_len
+        x_min, x_max = np.min(self.transcripts["global_x"]), np.max(self.transcripts["global_x"])
+        y_min, y_max = np.min(self.transcripts["global_y"]), np.max(self.transcripts["global_y"])
+        x_min = np.floor(x_min / grid_len) * grid_len
+        x_max = np.ceil(x_max / grid_len) * grid_len
+        y_min = np.floor(y_min / grid_len) * grid_len
+        y_max = np.ceil(y_max / grid_len) * grid_len
+        x_bins = np.arange(x_min, x_max + grid_len, grid_len)
+        y_bins = np.arange(y_min, y_max + grid_len, grid_len)
+        return x_bins, y_bins
+    # [INNER] calculate tissue area, input for poisson_select()
+    def tissue_area(self):
+        x_bins, y_bins = self.construct_grid(grid_len = None)
+        hist, _, _ = np.histogram2d(self.transcripts["global_x"], self.transcripts["global_y"], bins = [x_bins, y_bins])
+        area = np.count_nonzero(hist) * (self.grid_len ** 2)
+        return area
+    # [INNER] calculate optimal min_samples, input for dbscan()
+    def poisson_select(self, gene_name):
+        num_trans = np.sum(self.transcripts["target"] == gene_name)
+        bg_density = num_trans / self.tissue_area()
+        cutoff_density = poisson.ppf(self.cutoff_prob, mu = self.alpha * bg_density * (np.pi * self.eps ** 2))
+        optimal_m = int(max(cutoff_density, self.low_bound))
+        return optimal_m
+    # [INTERMEDIATE] dictionary, low- and high-in-soma spheres for each granule marker
+    def dbscan(self, target_names = None, record_cell_id = False, write_csv = False, write_path = "./"):
+        if self.type != "Xenium":
+            z_grid = list(self.transcripts["global_z"].unique())
+            z_grid.sort()
+        if target_names is None:
+            target_names = self.gnl_genes
+        transcripts = self.transcripts[self.transcripts["target"].isin(target_names)]
+        num_individual, data_low, data_high = [], {}, {}
+        for j in target_names:
+            # split transcripts
+            target = transcripts[transcripts["target"] == j]
+            others = transcripts[transcripts["target"] != j]
+            tree = make_tree(d1 = np.array(others["global_x"]), d2 = np.array(others["global_y"]), d3 = np.array(others["global_z"]))
+            # 3D DBSCAN
+            if self.minspl is None:
+                min_spl = self.poisson_select(j)
+            else:
+                min_spl = self.minspl
+            X = np.array(target[["global_x", "global_y", "global_z"]])
+            db = DBSCAN(eps = self.eps, min_samples = min_spl, algorithm = "kd_tree").fit(X)
+            labels = db.labels_
+            n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
+            # iterate over all aggregations
+            cell_id, sphere_x, sphere_y, sphere_z, layer_z, sphere_r, sphere_size, sphere_comp, sphere_score = [], [], [], [], [], [], [], [], []
+            for k in range(n_clusters):
+                # ---------- find minimum enclosing spheres ---------- #
+                mask = (labels == k)
+                coords = X[mask]
+                if coords.shape[0] == 0:
+                    continue
+                temp = pd.DataFrame(coords, columns=["global_x", "global_y", "global_z"])
+                temp = temp.drop_duplicates()
+                coords_unique = temp.to_numpy()
+                # skip clusters with too few unique points
+                if coords_unique.shape[0] < self.low_bound:
+                    print(f"Skipping small cluster for gene {j}, cluster {k} (n = {coords_unique.shape[0]})")
+                    continue
+                # compute minimum enclosing sphere without singularity issues
+                try:
+                    center, r2 = miniball.get_bounding_ball(coords_unique, epsilon=1e-8)
+                except np.linalg.LinAlgError:
+                    print(f"Warning: singular matrix for gene {j}, cluster {k} —- using fallback sphere.")
+                    center = coords_unique.mean(axis=0)
+                    dists = np.linalg.norm(coords_unique - center, axis=1)
+                    r2 = (dists.max() ** 2)
+                # record closest z-layer
+                if self.type != "Xenium":
+                    closest_z = closest(z_grid, center[2])
+                else:
+                    closest_z = center[2]
+                # record cell id after filtering
+                if record_cell_id:
+                    temp_target = target[labels == k]
+                    temp_cell_id_mode = temp_target["cell_id"].mode()[0]
+                    cell_id.append(temp_cell_id_mode)
+                # ---------- compute sphere features (size, composition, and in-soma ratio) ---------- #
+                temp_in_soma = np.sum(target["overlaps_nucleus"].values[mask])
+                temp_size = coords.shape[0]
+                other_idx = tree.query_ball_point([center[0], center[1], center[2]], np.sqrt(r2))
+                other_trans = others.iloc[other_idx]
+                other_in_soma = np.sum(other_trans["overlaps_nucleus"])
+                other_size = other_trans.shape[0]
+                other_comp = len(other_trans["target"].unique())
+                total_size = temp_size + other_size
+                total_comp = 1 + other_comp
+                in_soma_score = (temp_in_soma + other_in_soma) / total_size
+                # record sphere features
+                sphere_x.append(center[0])
+                sphere_y.append(center[1])
+                sphere_z.append(center[2])
+                layer_z.append(closest_z)
+                sphere_r.append(np.sqrt(r2))
+                sphere_size.append(total_size)
+                sphere_comp.append(total_comp)
+                sphere_score.append(in_soma_score)
+            # basic features for all spheres from each granule marker
+            sphere = pd.DataFrame(list(zip(sphere_x, sphere_y, sphere_z, layer_z, sphere_r, sphere_size, sphere_comp, sphere_score, [j] * len(sphere_x))),
+                                      columns = ["sphere_x", "sphere_y", "sphere_z", "layer_z", "sphere_r", "size", "comp", "in_soma_ratio", "gene"])
+            sphere = sphere.astype({"sphere_x": float, "sphere_y": float, "sphere_z": float, "layer_z": float, "sphere_r": float, "size": float, "comp": float, "in_soma_ratio": float, "gene": str})
+            if record_cell_id:
+                sphere["cell_id"] = cell_id
+                sphere = sphere.astype({"cell_id": str})
+            # split low- and high-in-soma spheres
+            sphere_low = sphere[(sphere["sphere_r"] < self.size_thr) & (sphere["in_soma_ratio"] < self.in_soma_thr[0])]
+            sphere_high = sphere[(sphere["sphere_r"] < self.size_thr) & (sphere["in_soma_ratio"] > self.in_soma_thr[1])]
+            if write_csv:
+                sphere_low.to_csv(write_path + j + " sphere.csv", index=0)
+                sphere_high.to_csv(write_path + j + " sphere_high.csv", index=0)
+            num_individual.append(sphere_low.shape[0])
+            data_low[target_names.index(j)] = sphere_low
+            data_high[target_names.index(j)] = sphere_high
+            print("{} out of {} genes processed!".format(target_names.index(j) + 1, len(target_names)))
+        return np.sum(num_individual), data_low, data_high
+    # [INNER] merge points from two overlapped spheres, input for remove_overlaps()
+    def find_points(self, sphere_a, sphere_b):
+        transcripts = self.transcripts[self.transcripts["target"].isin(self.gnl_genes)]
+        tree_temp = make_tree(d1 = np.array(transcripts["global_x"]), d2 = np.array(transcripts["global_y"]), d3 = np.array(transcripts["global_z"]))
+        idx_a = tree_temp.query_ball_point([sphere_a["sphere_x"], sphere_a["sphere_y"], sphere_a["sphere_z"]], sphere_a["sphere_r"])
+        points_a = transcripts.iloc[idx_a]
+        points_a = points_a[points_a["target"] == sphere_a["gene"]]
+        idx_b = tree_temp.query_ball_point([sphere_b["sphere_x"], sphere_b["sphere_y"], sphere_b["sphere_z"]], sphere_b["sphere_r"])
+        points_b = transcripts.iloc[idx_b]
+        points_b = points_b[points_b["target"] == sphere_b["gene"]]
+        points = pd.concat([points_a, points_b])
+        points = points[["global_x", "global_y", "global_z"]]
+        return points
+    def remove_overlaps(self, set_a, set_b):
+        set_a = set_a.copy()
+        set_b = set_b.copy()
+        # find possible overlaps on 2D by r-tree
+        idx_b = make_rtree(set_b)
+        for i, sphere_a in set_a.iterrows():
+            center_a_3D = np.array([sphere_a.sphere_x, sphere_a.sphere_y, sphere_a.sphere_z])
+            bounds_a = (sphere_a.sphere_x - sphere_a.sphere_r,
+                        sphere_a.sphere_y - sphere_a.sphere_r,
+                        sphere_a.sphere_x + sphere_a.sphere_r,
+                        sphere_a.sphere_y + sphere_a.sphere_r)
+            possible_overlaps = idx_b.intersection(bounds_a)
+            # search 3D overlaps within possible overlaps
+            for j in possible_overlaps:
+                if j in set_b.index:
+                    sphere_b = set_b.loc[j]
+                    center_b_3D = np.array([sphere_b.sphere_x, sphere_b.sphere_y, sphere_b.sphere_z])
+                    dist = np.linalg.norm(center_a_3D - center_b_3D)
+                    radius_sum = sphere_a.sphere_r + sphere_b.sphere_r
+                    radius_diff = sphere_a.sphere_r - sphere_b.sphere_r
+                    # relative positions (0: internal & intersect, 1: internal, 2: intersect)
+                    c0 = (dist < self.l * radius_sum)
+                    c1 = (dist <= self.l * np.abs(radius_diff))
+                    c1_1 = (radius_diff > 0)
+                    c2_1 = (dist < self.rho * self.l * radius_sum)
+                    # operations on dataframes
+                    if c0:
+                        if c1 and c1_1:                             # keep A and remove B
+                            set_b.drop(index = j, inplace = True)
+                        elif c1 and not c1_1:                       # replace A with B and remove B
+                            set_a.loc[i] = set_b.loc[j]
+                            set_b.drop(index = j, inplace = True)
+                        elif not c1 and c2_1:                       # replace A with new sphere and remove B
+                            points_union = np.array(self.find_points(sphere_a, sphere_b))
+                            new_center, new_radius = miniball.get_bounding_ball(points_union, epsilon=1e-8)
+                            set_a.loc[i, "sphere_x"] = new_center[0]
+                            set_a.loc[i, "sphere_y"] = new_center[1]
+                            set_a.loc[i, "sphere_z"] = new_center[2]
+                            set_a.loc[i, "sphere_r"] = self.s * new_radius
+                            set_b.drop(index = j, inplace = True)
+        set_a = set_a.reset_index(drop = True)
+        set_b = set_b.reset_index(drop = True)
+        return set_a, set_b
+    # [INNER] merge spheres from different granule markers, input for detect()
+    def merge_sphere(self, sphere_dict):
+        sphere = sphere_dict[0].copy()
+        for j in range(1, len(self.gnl_genes)):
+            target_sphere = sphere_dict[j]
+            sphere, target_sphere_new = self.remove_overlaps(sphere, target_sphere)
+            sphere = pd.concat([sphere, target_sphere_new])
+            sphere = sphere.reset_index(drop = True)
+        return sphere
+    # [INNER] negative control filtering, input for detect()
+    def nc_filter(self, sphere_low, sphere_high):
+        # negative control gene profiling
+        adata_low = self.profile(sphere_low, self.nc_genes)
+        adata_high = self.profile(sphere_high, self.nc_genes)
+        adata = anndata.concat([adata_low, adata_high], axis = 0, merge = "same")
+        adata.var["genes"] = adata.var.index
+        adata.obs_keys = list(np.arange(adata.shape[0]))
+        adata.obs["type"] = ["low"] * adata_low.shape[0] + ["high"] * adata_high.shape[0]
+        adata.obs["type"] = pd.Categorical(adata.obs["type"], categories = ["low", "high"], ordered = True)
+        # DE analysis of negative control genes
+        sc.tl.rank_genes_groups(adata, "type", method = "t-test")
+        names = adata.uns["rank_genes_groups"]["names"]
+        names = pd.DataFrame(names)
+        logfc = adata.uns["rank_genes_groups"]["logfoldchanges"]
+        logfc = pd.DataFrame(logfc)
+        pvals = adata.uns["rank_genes_groups"]["pvals"]
+        pvals = pd.DataFrame(pvals)
+        # select top upregulated negative control genes
+        df = pd.DataFrame({"names": names["high"], "logfc": logfc["high"], "pvals": pvals["high"]})
+        df = df[df["logfc"] >= 0]
+        df = df.sort_values(by = ["pvals"], ascending = True)
+        nc_genes_final = list(df["names"].head(self.nc_top))
+        # negative control filtering
+        nc_transcripts_final = self.transcripts[self.transcripts["target"].isin(nc_genes_final)]
+        tree = make_tree(d1 = np.array(nc_transcripts_final["global_x"]), d2 = np.array(nc_transcripts_final["global_y"]), d3 = np.array(nc_transcripts_final["global_z"]))
+        centers = sphere_low[["sphere_x", "sphere_y", "sphere_z"]].to_numpy()
+        radii = sphere_low["sphere_r"].to_numpy()
+        sizes = sphere_low["size"].to_numpy()
+        counts = np.array([len(tree.query_ball_point(c, r)) for c, r in zip(centers, radii)])
+        nc_ratio = counts / sizes
+        sphere = sphere_low.copy().reset_index(drop=True)
+        sphere["nc_ratio"] = nc_ratio
+        if self.nc_thr is None:
+            return sphere
+        pass_idx = (counts == 0) | (nc_ratio < self.nc_thr)
+        return sphere.loc[pass_idx].reset_index(drop=True)
+    # [MAIN] dataframe, granule metadata
+    def detect(self, record_cell_id = False):
+        _, data_low, data_high = self.dbscan(record_cell_id = record_cell_id)
+        print("Merging spheres...")
+        sphere_low, sphere_high = self.merge_sphere(data_low), self.merge_sphere(data_high)
+        if self.nc_genes is None:
+            return sphere_low
+        else:
+            print("Negative control filtering...")
+            return self.nc_filter(sphere_low, sphere_high)
+    # [MAIN] anndata, granule spatial transcriptome profile
+    def profile(self, granule, genes = None, buffer = 0.0, print_itr = False):
+        if genes is None:
+            genes = list(self.transcripts["target"].unique())
+            transcripts = self.transcripts
+        else:
+            transcripts = self.transcripts[self.transcripts["target"].isin(genes)]
+        gene_to_idx = {g: i for i, g in enumerate(genes)}
+        gene_array = transcripts["target"].to_numpy()
+        tree = make_tree(d1 = np.array(transcripts["global_x"]), d2 = np.array(transcripts["global_y"]), d3 = np.array(transcripts["global_z"]))
+        n_gnl = granule.shape[0]
+        n_gene = len(genes)
+        data, row_idx, col_idx = [], [], []
+        # iterate over all granules to count nearby transcripts
+        for i in range(n_gnl):
+            temp = granule.iloc[i]
+            target_idx = tree.query_ball_point([temp["sphere_x"], temp["sphere_y"], temp["layer_z"]], temp["sphere_r"] + buffer)
+            if not target_idx:
+                continue
+            local_genes = gene_array[target_idx]    # extract genes for those nearby transcripts
+            counts = Counter(local_genes)           # count how many times each gene occurs
+            for g, cnt in counts.items():           # append nonzero entries to sparse matrix lists
+                j = gene_to_idx[g]                  # get gene column index
+                data.append(cnt)                    # nonzero count
+                row_idx.append(i)                   # row index = granule index
+                col_idx.append(j)                   # column index = gene index
+            if print_itr and (i % 5000 == 0):
+                print(f"{i} out of {n_gnl} granules profiled!")
+        # construct sparse spatial transcriptome profile, (n_granules × n_genes)
+        X = csr_matrix((data, (row_idx, col_idx)), shape = (n_gnl, n_gene), dtype = np.float32)
+        adata = anndata.AnnData(X = X, obs = granule.copy())
+        adata.obs["granule_id"] = [f"gnl_{i}" for i in range(n_gnl)]
+        adata.obs = adata.obs.astype({"granule_id": str})
+        adata.obs.rename(columns = {"sphere_x": "global_x", "sphere_y": "global_y", "sphere_z": "global_z"}, inplace = True)
+        adata.var["genes"] = genes
+        adata.var_names = genes
+        adata.var_keys = genes
+        return adata
+    # [MAIN] anndata, spot-level gene expression
+    def spot_expression(self, grid_len, genes = None):
+        if genes is None:
+            genes = list(self.transcripts["target"].unique())
+            transcripts = self.transcripts
+        else:
+            transcripts = self.transcripts[self.transcripts["target"].isin(genes)]
+        # construct bins
+        x_bins, y_bins = self.construct_grid(grid_len = grid_len)
+        # initialize data
+        X = np.zeros((len(genes), (len(x_bins) - 1) * (len(y_bins) - 1)))
+        global_x, global_y = [], []
+        # coordinates
+        for i in list(x_bins)[:-1]:
+            center_x = i + 0.5 * grid_len
+            for j in list(y_bins)[:-1]:
+                center_y = j + 0.5 * grid_len
+                global_x.append(center_x)
+                global_y.append(center_y)
+        # count matrix
+        for k_idx, k in enumerate(genes):
+            target_gene = transcripts[transcripts["target"] == k]
+            count_gene, _, _ = np.histogram2d(target_gene["global_x"], target_gene["global_y"], bins = [x_bins, y_bins])
+            X[k_idx, :] = count_gene.flatten()
+            if k_idx % 100 == 0:
+                print("{} out of {} genes profiled!".format(k_idx, len(genes)))
+        # spot id
+        spot_id = []
+        for i in range(len(global_x)):
+            id = "spot_" + str(i)
+            spot_id.append(id)
+        # assemble data
+        adata = anndata.AnnData(X = np.transpose(X))
+        adata.obs["spot_id"] = spot_id
+        adata.obs["global_x"] = global_x
+        adata.obs["global_y"] = global_y
+        adata.var["genes"] = genes
+        adata.var_names = genes
+        adata.var_keys = genes
+        return adata
+# [MAIN] anndata, spot-level neuron metadata
+def spot_neuron(adata_neuron, spot, grid_len = 50, neuron_loc_key = ["global_x", "global_y"], spot_loc_key = ["global_x", "global_y"]):
+    adata_neuron = adata_neuron.copy()
+    neurons = adata_neuron.obs
+    spot = spot.copy()
+    half_len = grid_len / 2
+    indicator, neuron_count = [], []
+    for _, row in spot.obs.iterrows():
+        x = row[spot_loc_key[0]]
+        y = row[spot_loc_key[1]]
+        neuron_temp = neurons[(neurons[neuron_loc_key[0]] > x - half_len) & (neurons[neuron_loc_key[0]] < x + half_len) & (neurons[neuron_loc_key[1]] > y - half_len) & (neurons[neuron_loc_key[1]] < y + half_len)]
+        indicator.append(int(len(neuron_temp) > 0))
+        neuron_count.append(len(neuron_temp))
+    spot.obs["indicator"] = indicator
+    spot.obs["neuron_count"] = neuron_count
+    return spot
+# [MAIN] anndata, spot-level granule metadata
+def spot_granule(granule, spot, grid_len = 50, gnl_loc_key = ["sphere_x", "sphere_y"], spot_loc_key = ["global_x", "global_y"]):
+    granule = granule.copy()
+    spot = spot.copy()
+    half_len = grid_len / 2
+    indicator, granule_count, granule_radius, granule_size, granule_score = [], [], [], [], []
+    for _, row in spot.obs.iterrows():
+        x = row[spot_loc_key[0]]
+        y = row[spot_loc_key[1]]
+        gnl_temp = granule[(granule[gnl_loc_key[0]] >= x - half_len) & (granule[gnl_loc_key[0]] < x + half_len) & (granule[gnl_loc_key[1]] >= y - half_len) & (granule[gnl_loc_key[1]] < y + half_len)]
+        indicator.append(int(len(gnl_temp) > 0))
+        granule_count.append(len(gnl_temp))
+        if len(gnl_temp) == 0:
+            granule_radius.append(0)
+            granule_size.append(0)
+            granule_score.append(0)
+        else:
+            granule_radius.append(np.nanmean(gnl_temp["sphere_r"]))
+            granule_size.append(np.nanmean(gnl_temp["size"]))
+            granule_score.append(np.nanmean(gnl_temp["in_soma_ratio"]))
+    spot.obs["indicator"] = indicator
+    spot.obs["gnl_count"] = granule_count
+    spot.obs["gnl_radius"] = granule_radius
+    spot.obs["gnl_size"] = granule_size
+    spot.obs["gnl_score"] = granule_score
+    return spot
+# [Main] anndata, neuron-granule colocalization
+def neighbor_granule(adata_neuron, granule_adata, radius = 10, sigma = None, loc_key = ["global_x", "global_y"]):
+    adata_neuron = adata_neuron.copy()
+    granule_adata = granule_adata.copy()
+    if sigma is None:
+        sigma = radius / 2
+    # neuron and granule coordinates
+    neuron_coords = adata_neuron.obs[loc_key].values
+    gnl_coords = granule_adata.obs[loc_key].values
+    # make tree
+    tree = make_tree(d1 = gnl_coords[:, 0], d2 = gnl_coords[:, 1])
+    # query neighboring granules for each neuron
+    neighbor_indices = tree.query_ball_point(neuron_coords, r = radius)
+    # record count and indices
+    granule_counts = np.array([len(indices) for indices in neighbor_indices])
+    adata_neuron.obs["neighbor_gnl_count"] = granule_counts
+    adata_neuron.uns["neighbor_gnl_indices"] = neighbor_indices
+    # ---------- neighboring granule expression matrix ---------- #
+    n_neurons, n_genes = adata_neuron.n_obs, adata_neuron.n_vars
+    weighted_expr = np.zeros((n_neurons, n_genes))
+    for i, indices in enumerate(neighbor_indices):
+        if len(indices) == 0:
+            continue
+        distances = np.linalg.norm(gnl_coords[indices] - neuron_coords[i], axis = 1)
+        weights = np.exp(- (distances ** 2) / (2 * sigma ** 2))
+        weights = weights / weights.sum()
+        weighted_expr[i] = np.average(granule_adata.X[indices], axis = 0, weights = weights)
+    adata_neuron.obsm["weighted_gnl_expression"] = weighted_expr
+    # ---------- neighboring granule spatial feature ---------- #
+    features = []
+    for i, gnl_idx in enumerate(neighbor_indices):
+        feats = {}
+        feats["n_granules"] = len(gnl_idx)
+        if len(gnl_idx) == 0:
+            feats.update({"mean_distance": np.nan, "std_distance": np.nan, "radius_max": np.nan, "radius_min": np.nan, "density": 0, "center_offset_norm": np.nan, "anisotropy_ratio": np.nan})
+        else:
+            gnl_pos = gnl_coords[gnl_idx]
+            neuron_pos = neuron_coords[i]
+            dists = np.linalg.norm(gnl_pos - neuron_pos, axis = 1)
+            feats["mean_distance"] = dists.mean()
+            feats["std_distance"] = dists.std()
+            feats["radius_max"] = dists.max()
+            feats["radius_min"] = dists.min()
+            feats["density"] = len(gnl_idx) / (np.pi * radius ** 2)
+            centroid = gnl_pos.mean(axis = 0)
+            offset = centroid - neuron_pos
+            feats["center_offset_norm"] = np.linalg.norm(offset)
+            cov = np.cov((gnl_pos - neuron_pos).T)
+            eigvals = np.linalg.eigvalsh(cov)
+            if np.min(eigvals) > 0:
+                feats["anisotropy_ratio"] = np.max(eigvals) / np.min(eigvals)
+            else:
+                feats["anisotropy_ratio"] = np.nan
+        features.append(feats)
+    spatial_df = pd.DataFrame(features, index = adata_neuron.obs_names)
+    return adata_neuron, spatial_df
+# [MAIN] numpy array, neuron embeddings based on neighboring granules
+def neuron_embedding_one_hot(adata_neuron, granule_adata, k = 10, radius = 10, loc_key = ["global_x", "global_y"], gnl_subtype_key = "granule_subtype_kmeans", padding_value = "Others"):
+    adata_neuron = adata_neuron.copy()
+    granule_adata = granule_adata.copy()
+    # neuron and granule coordinates, granule subtypes
+    neuron_coords = adata_neuron.obs[loc_key].to_numpy()
+    granule_coords = granule_adata.obs[loc_key].to_numpy()
+    granule_subtypes = granule_adata.obs[gnl_subtype_key].astype(str).to_numpy()
+    # include padding category
+    unique_subtypes = np.unique(granule_subtypes).tolist()
+    if padding_value not in unique_subtypes:
+        unique_subtypes.append(padding_value)
+    encoder = OneHotEncoder(categories = [unique_subtypes], sparse = False, handle_unknown = "ignore")
+    encoder.fit(np.array(unique_subtypes).reshape(-1, 1))
+    S = len(unique_subtypes)
+    # k-d tree
+    tree = make_tree(d1 = granule_coords[:, 0], d2 = granule_coords[:, 1])
+    distances, indices = tree.query(neuron_coords, k = k, distance_upper_bound = radius)
+    # initialize output
+    n_neurons = neuron_coords.shape[0]
+    embeddings = np.zeros((n_neurons, k, S), dtype = float)
+    for i in range(n_neurons):
+        for k in range(k):
+            idx = indices[i, k]
+            dist = distances[i, k]
+            if idx == granule_coords.shape[0] or np.isinf(dist):
+                subtype = padding_value
+            else:
+                subtype = granule_subtypes[idx]
+            onehot = encoder.transform([[subtype]])[0]
+            embeddings[i, k, :] = onehot
+    return embeddings, encoder.categories_[0]
+# [MAIN] numpy array, neuron embeddings based on neighboring granules
+def neuron_embedding_spatial_weight(adata_neuron, granule_adata, radius = 10, sigma = 10, loc_key = ["global_x", "global_y"], gnl_subtype_key = "granule_subtype_kmeans", padding_value = "Others"):
+    adata_neuron = adata_neuron.copy()
+    granule_adata = granule_adata.copy()
+    # neuron and granule coordinates, granule subtypes
+    neuron_coords = adata_neuron.obs[loc_key].to_numpy()
+    granule_coords = granule_adata.obs[loc_key].to_numpy()
+    granule_subtypes = granule_adata.obs[gnl_subtype_key].astype(str).to_numpy()
+    # include padding category
+    unique_subtypes = np.unique(granule_subtypes).tolist()
+    if padding_value not in unique_subtypes:
+        unique_subtypes.append(padding_value)
+    encoder = OneHotEncoder(categories = [unique_subtypes], sparse = False, handle_unknown = "ignore")
+    encoder.fit(np.array(unique_subtypes).reshape(-1, 1))
+    S = len(unique_subtypes)
+    # k-d tree
+    tree = make_tree(d1 = granule_coords[:, 0], d2 = granule_coords[:, 1])
+    all_neighbors = tree.query_ball_point(neuron_coords, r = radius)
+    # initialize output
+    n_neurons = neuron_coords.shape[0]
+    embeddings = np.zeros((n_neurons, S), dtype = float)
+    for i, neighbor_indices in enumerate(all_neighbors):
+        if not neighbor_indices:
+            # no neighbors, assign to padding subtype
+            embeddings[i] = encoder.transform([[padding_value]])[0]
+            continue
+        # get neighbor subtypes and distances
+        neighbor_coords = granule_coords[neighbor_indices]
+        dists = np.linalg.norm(neuron_coords[i] - neighbor_coords, axis = 1)
+        weights = np.exp(- dists / sigma)
+        # encode subtypes to one-hot and weight them
+        subtypes = granule_subtypes[neighbor_indices]
+        onehots = encoder.transform(subtypes.reshape(-1, 1))
+        weighted_sum = (weights[:, np.newaxis] * onehots).sum(axis = 0)
+        # normalize to make it a composition vector
+        embeddings[i] = weighted_sum / weights.sum()
+    return embeddings, encoder.categories_[0]

mcdetect-2.0.15/mcDETECT/utils.py ADDED Viewed

@@ -0,0 +1,145 @@
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+import seaborn as sns
+from matplotlib import colors as mcolors
+from rtree import index
+from scipy.spatial import cKDTree
+from scipy.stats import rankdata
+from shapely.geometry import Point
+def find_threshold_index(cumsum_list, threshold = 0.99):
+    total = cumsum_list[-1]
+    for i, value in enumerate(cumsum_list):
+        if value >= threshold * total:
+            return i
+    return None
+def closest(lst, K):
+    return lst[min(range(len(lst)), key = lambda i: abs(lst[i] - K))]
+def make_tree(d1 = None, d2 = None, d3 = None):
+    active_dimensions = [dimension for dimension in [d1, d2, d3] if dimension is not None]
+    if len(active_dimensions) == 1:
+        points = np.c_[active_dimensions[0].ravel()]
+    elif len(active_dimensions) == 2:
+        points = np.c_[active_dimensions[0].ravel(), active_dimensions[1].ravel()]
+    elif len(active_dimensions) == 3:
+        points = np.c_[active_dimensions[0].ravel(), active_dimensions[1].ravel(), active_dimensions[2].ravel()]
+    return cKDTree(points)
+def make_rtree(spheres):
+    p = index.Property()
+    idx = index.Index(properties = p)
+    for i, sphere in enumerate(spheres.itertuples()):
+        center = Point(sphere.sphere_x, sphere.sphere_y)
+        bounds = (center.x - sphere.sphere_r,
+                  center.y - sphere.sphere_r,
+                  center.x + sphere.sphere_r,
+                  center.y + sphere.sphere_r)
+        idx.insert(i, bounds)
+    return idx
+def scale(array, max = 1):
+    new_array = (array - np.min(array)) / (np.max(array) - np.min(array)) * max
+    return new_array
+def weighted_corr(estimated, actual, weights):
+    estimated = np.array(estimated)
+    actual = np.array(actual)
+    weights = np.array(weights)
+    # weighted mean
+    mean_estimated = np.average(estimated, weights = weights)
+    mean_actual = np.average(actual, weights = weights)
+    # weighted covariance
+    cov_w = np.sum(weights * (estimated - mean_estimated) * (actual - mean_actual)) / np.sum(weights)
+    # weighted variances
+    var_estimated = np.sum(weights * (estimated - mean_estimated) ** 2) / np.sum(weights)
+    var_actual = np.sum(weights * (actual - mean_actual) ** 2) / np.sum(weights)
+    # weighted correlation coefficient
+    weighted_corr = cov_w / np.sqrt(var_estimated * var_actual)
+    return weighted_corr
+def weighted_spearmanr(A, B, weights):
+    A = np.array(A)
+    B = np.array(B)
+    weights = np.array(weights)
+    # rank the data
+    R_A = rankdata(A)
+    R_B = rankdata(B)
+    # weighted mean
+    mean_R_A_w = np.average(R_A, weights=weights)
+    mean_R_B_w = np.average(R_B, weights=weights)
+    # weighted covariance
+    cov_w = np.sum(weights * (R_A - mean_R_A_w) * (R_B - mean_R_B_w)) / np.sum(weights)
+    # weighted variances
+    var_R_A_w = np.sum(weights * (R_A - mean_R_A_w)**2) / np.sum(weights)
+    var_R_B_w = np.sum(weights * (R_B - mean_R_B_w)**2) / np.sum(weights)
+    # weighted Spearman correlation coefficient
+    weighted_spearman_corr = cov_w / np.sqrt(var_R_A_w * var_R_B_w)
+    return weighted_spearman_corr
+def assign_palette_to_adata(adata, obs_key = "granule_expr_cluster_hierarchical", self_defined = False, cmap_name = "tab10"):
+    adata = adata.copy()
+    if not pd.api.types.is_categorical_dtype(adata.obs[obs_key]):
+        adata.obs[obs_key] = adata.obs[obs_key].astype("category")
+    categories = adata.obs[obs_key].cat.categories
+    n_categories = len(categories)
+    if self_defined:
+        cmap = plt.colormaps[cmap_name]
+        color_palette = [cmap(i) for i in range(n_categories)]
+    else:
+        base_colors = plt.get_cmap(cmap_name).colors
+        if n_categories > len(base_colors):
+            color_palette = sns.color_palette(cmap_name, n_categories)
+        else:
+            color_palette = base_colors[:n_categories]
+    adata.uns[f"{obs_key}_colors"] = [mcolors.to_hex(c) for c in color_palette]
+    return adata
+def p_val_to_star(p):
+    if p > 0.05:
+        return "ns"
+    elif p > 0.01:
+        return "*"
+    elif p > 0.001:
+        return "**"
+    else:
+        return "***"
+def top_columns_above_threshold(row, threshold=0.5):
+    sorted_row = row.sort_values(ascending=False)
+    cumsum = sorted_row.cumsum()
+    # Find how many top columns are needed to exceed the threshold
+    n = (cumsum > threshold).idxmax()
+    # Slice up to and including the index that crosses the threshold
+    return sorted_row.loc[:n].index.tolist()

mcdetect-2.0.15/mcDETECT.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,40 @@
+Metadata-Version: 2.4
+Name: mcDETECT
+Version: 2.0.15
+Summary: Uncovering the dark transcriptome in polarized neuronal compartments with mcDETECT
+Home-page: https://github.com/chen-yang-yuan/mcDETECT
+Author: Chenyang Yuan
+Author-email: chenyang.yuan@emory.edu
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.6
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: anndata
+Requires-Dist: miniball
+Requires-Dist: numpy
+Requires-Dist: pandas
+Requires-Dist: rtree
+Requires-Dist: scanpy
+Requires-Dist: scikit-learn
+Requires-Dist: scipy
+Requires-Dist: shapely
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
+# mcDETECT
+## Uncovering the dark transcriptome in polarized neuronal compartments with mcDETECT
+#### Chenyang Yuan, Krupa Patel, Hongshun Shi, Hsiao-Lin V. Wang, Feng Wang, Ronghua Li, Yangping Li, Victor G. Corces, Hailing Shi, Sulagna Das, Jindan Yu, Peng Jin, Bing Yao* and Jian Hu*
+mcDETECT is a computational framework designed to study the dark transcriptome related to polarized compartments in brain using *in situ* spatial transcriptomics (iST) data. It begins by examining the subcellular distribution of mRNAs in an iST sample. Each mRNA molecule is treated as a distinct point with its own 3D spatial coordinates considering the thickness of the sample. Unlike many cell-type marker genes, which are typically found within the nucleus or soma, compartmentalized mRNAs often form small aggregates outside the soma. mcDETECT uses a density-based clustering approach to identify these extrasomatic aggregates. This involves calculating the Euclidean distance between mRNA points and defining the neighborhood of each point within a specified search radius. Points are then categorized as core points, border points, or noise points based on their reachability from neighboring points. mcDETECT recognizes each connected bundle of core and border points as a mRNA aggregate. To minimize false positives, it excludes aggregates that substantially overlap with somata, which are estimated by dilating the nuclear masks derived from DAPI staining. mcDETECT then repeats this process for multiple granule markers, merging aggregates from different markers that exhibit high spatial overlap. After aggregating across all markers, an additional filtering step removes aggregates containing mRNAs from negative control genes, which are known to be enriched exclusively in nuclei and somata. The remaining aggregates are considered individual RNA granules. mcDETECT then computes the minimum enclosing sphere for each aggregate to connect neighboring mRNA molecules from all measured genes and summarizes their counts, thereby defining the spatial transcriptome profile of individual RNA granules.

mcdetect-2.0.15/mcDETECT.egg-info/SOURCES.txt ADDED Viewed

@@ -0,0 +1,11 @@
+LICENSE
+README.md
+setup.py
+mcDETECT/__init__.py
+mcDETECT/model.py
+mcDETECT/utils.py
+mcDETECT.egg-info/PKG-INFO
+mcDETECT.egg-info/SOURCES.txt
+mcDETECT.egg-info/dependency_links.txt
+mcDETECT.egg-info/requires.txt
+mcDETECT.egg-info/top_level.txt

mcdetect-2.0.15/mcDETECT.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+

mcdetect-2.0.15/mcDETECT.egg-info/requires.txt ADDED Viewed

@@ -0,0 +1,9 @@
+anndata
+miniball
+numpy
+pandas
+rtree
+scanpy
+scikit-learn
+scipy
+shapely

mcdetect-2.0.15/mcDETECT.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+ mcDETECT

mcdetect-2.0.15/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

mcdetect-2.0.15/setup.py ADDED Viewed

@@ -0,0 +1,20 @@
+from setuptools import setup, find_packages
+setup(
+    name = "mcDETECT",
+    version = "2.0.15",
+    packages = find_packages(),
+    install_requires = ["anndata", "miniball", "numpy", "pandas", "rtree", "scanpy", "scikit-learn", "scipy", "shapely"],
+    author = "Chenyang Yuan",
+    author_email = "chenyang.yuan@emory.edu",
+    description = "Uncovering the dark transcriptome in polarized neuronal compartments with mcDETECT",
+    long_description = open("README.md").read(),
+    long_description_content_type = "text/markdown",
+    url = "https://github.com/chen-yang-yuan/mcDETECT",
+    classifiers = [
+        "Programming Language :: Python :: 3",
+        "License :: OSI Approved :: MIT License",
+        "Operating System :: OS Independent",
+    ],
+    python_requires = ">=3.6",
+)