cellSP 0.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (32) hide show
  1. cellsp-0.0.1/LICENSE +21 -0
  2. cellsp-0.0.1/PKG-INFO +137 -0
  3. cellsp-0.0.1/README.md +128 -0
  4. cellsp-0.0.1/cellSP/__init__.py +7 -0
  5. cellsp-0.0.1/cellSP/characterize/__init__.py +2 -0
  6. cellsp-0.0.1/cellSP/characterize/_bicluster.py +384 -0
  7. cellsp-0.0.1/cellSP/characterize/_instant.py +174 -0
  8. cellsp-0.0.1/cellSP/characterize/_sprawl.py +448 -0
  9. cellsp-0.0.1/cellSP/characterize/_utils.py +70 -0
  10. cellsp-0.0.1/cellSP/datasets/__init__.py +1 -0
  11. cellsp-0.0.1/cellSP/datasets/_datasets.py +123 -0
  12. cellsp-0.0.1/cellSP/geo/__init__.py +1 -0
  13. cellsp-0.0.1/cellSP/geo/_geo.py +134 -0
  14. cellsp-0.0.1/cellSP/io/__init__.py +1 -0
  15. cellsp-0.0.1/cellSP/io/_io.py +104 -0
  16. cellsp-0.0.1/cellSP/model/__init__.py +1 -0
  17. cellsp-0.0.1/cellSP/model/_model.py +215 -0
  18. cellsp-0.0.1/cellSP/preprocessing/__init__.py +2 -0
  19. cellsp-0.0.1/cellSP/preprocessing/_extrapolate.py +262 -0
  20. cellsp-0.0.1/cellSP/preprocessing/_impute.py +24 -0
  21. cellsp-0.0.1/cellSP/visualisation/__init__.py +5 -0
  22. cellsp-0.0.1/cellSP/visualisation/_circularize.py +458 -0
  23. cellsp-0.0.1/cellSP/visualisation/_enrichment.py +170 -0
  24. cellsp-0.0.1/cellSP/visualisation/_raw.py +56 -0
  25. cellsp-0.0.1/cellSP/visualisation/_report.py +258 -0
  26. cellsp-0.0.1/cellSP/visualisation/_validation.py +16 -0
  27. cellsp-0.0.1/cellSP.egg-info/PKG-INFO +137 -0
  28. cellsp-0.0.1/cellSP.egg-info/SOURCES.txt +30 -0
  29. cellsp-0.0.1/cellSP.egg-info/dependency_links.txt +1 -0
  30. cellsp-0.0.1/cellSP.egg-info/top_level.txt +3 -0
  31. cellsp-0.0.1/pyproject.toml +20 -0
  32. cellsp-0.0.1/setup.cfg +4 -0
cellsp-0.0.1/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2024 Bhavay Aggarwal
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
cellsp-0.0.1/PKG-INFO ADDED
@@ -0,0 +1,137 @@
1
+ Metadata-Version: 2.2
2
+ Name: cellSP
3
+ Version: 0.0.1
4
+ Summary: cellSP.
5
+ Author-email: Bhavay Aggarwal <bhavayaggarwal07@gmail.com>
6
+ Requires-Python: >=3.12
7
+ Description-Content-Type: text/markdown
8
+ License-File: LICENSE
9
+
10
+ # CellSP
11
+ __Note: Repository is work in progress__
12
+ <br>
13
+ <br>
14
+ CellSP is a python package for the analysis of subcellular spatial transcriptomic data. CellSP works with datasets generated at single-modulecule resolution from technologies like Xenium, CosMx, MERSCOPE or other ISH-like data. Using existing tools [InSTAnT](https://github.com/bhavaygg/InSTAnT) and [SPRAWL](https://github.com/salzman-lab/SPRAWL/), CellSP identifies statistically signficant subcellular patterns of gene transcripts and uses a biclustering algorithm to aggregate these patterns over hundereds of cells to produce "gene-cell modules". These modules represent the consistent detection of the same subcellular pattern by a set of genes in the same cells and offer a summarized and biologically interpretable desciption of subcellular patterns. CellSP provides specialized techniques for visualizing such modules and their defining spatial patterns. Additionally, CellSP utilize Gene Ontology (GO) enrichments tests to offer functionsal insights into the genes comprising the module as CellSPll as the cells comprising the module.
15
+
16
+ ![CellSP_overview](https://github.com/bhavaygg/CellSP/blob/main/figures/Overview.png)
17
+
18
+ ***
19
+
20
+ ## How to install CellSP
21
+
22
+ CellSP recommend using our environment.yml file to create a new conda environment to avoid issues with package incompatibility.
23
+
24
+ ```
25
+ conda env create -f environment.yml
26
+ ```
27
+ This will create a new conda environment with the name `CellSP` and has all dependencies installed.
28
+
29
+ Alternatively, the package can be installed using pip.
30
+
31
+ ```
32
+ pip install CellSP
33
+ ```
34
+
35
+ __Note: Not operational as of now__
36
+
37
+ ***
38
+ ## How to use CellSP
39
+
40
+ CellSP expects data (both single cell and spatial transcriptomic) to be in AnnData format and can be loaded using
41
+
42
+ ```
43
+ adata_sc, adata_st = cellSP.ds.load_data(sc_adata= 'files/adata_sc.h5ad', st_adata = "files/adata_st.h5ad")
44
+ ```
45
+
46
+ **Note - Single cell data on the same tissue is required for characterization of the module cells.**
47
+
48
+ To load raw csv data, refer to [file]() for instructions.
49
+
50
+ CellSP preprocess the input single cell data by performing denoising using [MAGIC](https://github.com/KrishnaswamyLab/MAGIC) and impute the expression of genes not in the ST panel using [Tangram](https://github.com/broadinstitute/Tangram/).
51
+
52
+ ```
53
+ adata_sc = cellSP.pp.impute(adata_sc, t="auto")
54
+ adata_st = cellSP.pp.run_tangram(adata_sc, adata_st, device='cpu')
55
+ ```
56
+
57
+ After Tangram imputation, the single cell and spatial Anndata objects are combined into one. This completes the preprocessing required for using CellSP. This can be skipped if cellular characterization is not required.
58
+
59
+ There are three main steps involved in running CellSP -
60
+ 1. Subcellular Pattern Discovery
61
+ 2. Module Discovery
62
+ 3. Module Characterization
63
+
64
+ ### Subcellular Pattern Discovery
65
+
66
+ CellSP uses InSTAnT and SPRAWL for identifying statistically significant subcellular patterns. InSTAnT tests if transcripts of a gene pair tend to be proximal to each other more often than expected by chance, while SPRAWL identifies four types of subcellular patterns – peripheral, radial, punctate and central – describing the distribution of a gene’s transcripts within the cell.
67
+
68
+ To run InSTAnT, CellSP has two primary parameters -
69
+ - `distance_threshold`: The distance (in microns) at which to consider 2 genes proximal.
70
+ - `alpha_cpb`: p-value signifiance threshold below which a gene-pair is considered colocalized for the CPB test. Default = 1e-3
71
+
72
+ ```
73
+ adata_st = cellSP.ch.run_instant(adata_st = adata_st, distance_threshold=2, alpha_cpb=1e-5)
74
+ ```
75
+
76
+ To run SPRAWL, CellSP uses the default parameters from the original implementation.
77
+
78
+ ```
79
+ adata_st = cellSP.ch.run_sprawl(adata_st)
80
+ ```
81
+
82
+ ### Module Discovery
83
+
84
+ CellSP use a biclustering tool, LAS, to analyze each of the patterns and identiy "gene-cell modules". Each module represents a set of genes or gene pairs that exhibit the same type of sub-cellular pattern in the same set of cells, with statistical significance estimated by a Bonferroni-based score.
85
+
86
+ CellSP has 2 functions for module discovery, one for SPRAWL and one for InSTAnT. Both the functions share the same parameters but the InSTAnt function has two additional parameter
87
+ - `alpha`: p-value signifiance threshold below which a gene-pair is considered for biclustering. Default = 1e-3
88
+ - `topk`: Select only the K most significant gene pairs that have p-value < `alpha`. Default = None
89
+
90
+ These parameters is used the restrict the number of gene-pairs over which biclustering is performed in order to reduce the computational complexity.
91
+
92
+ The other parameters used are -
93
+ - `num_biclusters`: Number of modules to find. Default = 10.
94
+ - `randomized_searches`: Number of randomized searches to perform in LAS. Default = 50000.
95
+
96
+ ```
97
+ adata_st = cellSP.ch.bicluster_instant(adata_st, distance_threshold=2, threads=128, alpha=1e-5, num_biclusters = 50, randomized_searches = 50000)
98
+ adata_st = cellSP.ch.bicluster_sprawl(adata_st, threads=128, num_biclusters = 50, randomized_searches = 50000)
99
+ ```
100
+
101
+ ### Module Characterization
102
+
103
+ To aid biological interpretation, CellSP reports shared properties of the genes and cells of each discovered module. Genes are characterized using Gene Ontology (GO) enrichment tests, while cells are characterized by their cell type composition if such information is available. To provide a more precise characterization of a module’s cells, CellSP trains a machine learning classifier to discriminate those cells from all other cells, using the expression levels of all genes other than the module genes. Genes that are highly predictive in this task are then subjected to GO enrichment tests, furnishing hypotheses about biological processes and pathways that are active specifically in the module cells.
104
+
105
+ To characterize the module genes -
106
+
107
+ ```
108
+ adata_st = cellSP.geo.geo_analysis(adata_st, setting="module")
109
+ ```
110
+
111
+
112
+ To characterize the module cells, we first train a random forest classifier to find genes that are predictive of module presence and then perform enrichment tests -
113
+ ```
114
+ adata_st = cellSP.md.model_modules(adata_st, do_shap=True, subsample = True)
115
+ adata_st = cellSP.geo.geo_analysis(adata_st, setting="cell")
116
+ ```
117
+
118
+ ### Visualization
119
+
120
+ To help visualize modules defined by the five types of subcellular spatial patterns (four types identified by SPRAWL and colocalization patterns identified by InSTAnT), we developed three complementary plotting techniques.
121
+
122
+ ![CellSP_visualizations](https://raw.githubusercontent.com/bhavaygg/main/figures/Overview.png)
123
+
124
+ ***
125
+
126
+ ### How to cite CellSP
127
+
128
+ ```
129
+ @article{aggarwal2025cellsp,
130
+ title={CellSP: Module discovery and visualization for subcellular spatial transcriptomics data},
131
+ author={Aggarwal, Bhavay and Sinha, Saurabh},
132
+ journal={bioRxiv},
133
+ pages={2025--01},
134
+ year={2025},
135
+ publisher={Cold Spring Harbor Laboratory}
136
+ }
137
+ ```
cellsp-0.0.1/README.md ADDED
@@ -0,0 +1,128 @@
1
+ # CellSP
2
+ __Note: Repository is work in progress__
3
+ <br>
4
+ <br>
5
+ CellSP is a python package for the analysis of subcellular spatial transcriptomic data. CellSP works with datasets generated at single-modulecule resolution from technologies like Xenium, CosMx, MERSCOPE or other ISH-like data. Using existing tools [InSTAnT](https://github.com/bhavaygg/InSTAnT) and [SPRAWL](https://github.com/salzman-lab/SPRAWL/), CellSP identifies statistically signficant subcellular patterns of gene transcripts and uses a biclustering algorithm to aggregate these patterns over hundereds of cells to produce "gene-cell modules". These modules represent the consistent detection of the same subcellular pattern by a set of genes in the same cells and offer a summarized and biologically interpretable desciption of subcellular patterns. CellSP provides specialized techniques for visualizing such modules and their defining spatial patterns. Additionally, CellSP utilize Gene Ontology (GO) enrichments tests to offer functionsal insights into the genes comprising the module as CellSPll as the cells comprising the module.
6
+
7
+ ![CellSP_overview](https://github.com/bhavaygg/CellSP/blob/main/figures/Overview.png)
8
+
9
+ ***
10
+
11
+ ## How to install CellSP
12
+
13
+ CellSP recommend using our environment.yml file to create a new conda environment to avoid issues with package incompatibility.
14
+
15
+ ```
16
+ conda env create -f environment.yml
17
+ ```
18
+ This will create a new conda environment with the name `CellSP` and has all dependencies installed.
19
+
20
+ Alternatively, the package can be installed using pip.
21
+
22
+ ```
23
+ pip install CellSP
24
+ ```
25
+
26
+ __Note: Not operational as of now__
27
+
28
+ ***
29
+ ## How to use CellSP
30
+
31
+ CellSP expects data (both single cell and spatial transcriptomic) to be in AnnData format and can be loaded using
32
+
33
+ ```
34
+ adata_sc, adata_st = cellSP.ds.load_data(sc_adata= 'files/adata_sc.h5ad', st_adata = "files/adata_st.h5ad")
35
+ ```
36
+
37
+ **Note - Single cell data on the same tissue is required for characterization of the module cells.**
38
+
39
+ To load raw csv data, refer to [file]() for instructions.
40
+
41
+ CellSP preprocess the input single cell data by performing denoising using [MAGIC](https://github.com/KrishnaswamyLab/MAGIC) and impute the expression of genes not in the ST panel using [Tangram](https://github.com/broadinstitute/Tangram/).
42
+
43
+ ```
44
+ adata_sc = cellSP.pp.impute(adata_sc, t="auto")
45
+ adata_st = cellSP.pp.run_tangram(adata_sc, adata_st, device='cpu')
46
+ ```
47
+
48
+ After Tangram imputation, the single cell and spatial Anndata objects are combined into one. This completes the preprocessing required for using CellSP. This can be skipped if cellular characterization is not required.
49
+
50
+ There are three main steps involved in running CellSP -
51
+ 1. Subcellular Pattern Discovery
52
+ 2. Module Discovery
53
+ 3. Module Characterization
54
+
55
+ ### Subcellular Pattern Discovery
56
+
57
+ CellSP uses InSTAnT and SPRAWL for identifying statistically significant subcellular patterns. InSTAnT tests if transcripts of a gene pair tend to be proximal to each other more often than expected by chance, while SPRAWL identifies four types of subcellular patterns – peripheral, radial, punctate and central – describing the distribution of a gene’s transcripts within the cell.
58
+
59
+ To run InSTAnT, CellSP has two primary parameters -
60
+ - `distance_threshold`: The distance (in microns) at which to consider 2 genes proximal.
61
+ - `alpha_cpb`: p-value signifiance threshold below which a gene-pair is considered colocalized for the CPB test. Default = 1e-3
62
+
63
+ ```
64
+ adata_st = cellSP.ch.run_instant(adata_st = adata_st, distance_threshold=2, alpha_cpb=1e-5)
65
+ ```
66
+
67
+ To run SPRAWL, CellSP uses the default parameters from the original implementation.
68
+
69
+ ```
70
+ adata_st = cellSP.ch.run_sprawl(adata_st)
71
+ ```
72
+
73
+ ### Module Discovery
74
+
75
+ CellSP use a biclustering tool, LAS, to analyze each of the patterns and identiy "gene-cell modules". Each module represents a set of genes or gene pairs that exhibit the same type of sub-cellular pattern in the same set of cells, with statistical significance estimated by a Bonferroni-based score.
76
+
77
+ CellSP has 2 functions for module discovery, one for SPRAWL and one for InSTAnT. Both the functions share the same parameters but the InSTAnt function has two additional parameter
78
+ - `alpha`: p-value signifiance threshold below which a gene-pair is considered for biclustering. Default = 1e-3
79
+ - `topk`: Select only the K most significant gene pairs that have p-value < `alpha`. Default = None
80
+
81
+ These parameters is used the restrict the number of gene-pairs over which biclustering is performed in order to reduce the computational complexity.
82
+
83
+ The other parameters used are -
84
+ - `num_biclusters`: Number of modules to find. Default = 10.
85
+ - `randomized_searches`: Number of randomized searches to perform in LAS. Default = 50000.
86
+
87
+ ```
88
+ adata_st = cellSP.ch.bicluster_instant(adata_st, distance_threshold=2, threads=128, alpha=1e-5, num_biclusters = 50, randomized_searches = 50000)
89
+ adata_st = cellSP.ch.bicluster_sprawl(adata_st, threads=128, num_biclusters = 50, randomized_searches = 50000)
90
+ ```
91
+
92
+ ### Module Characterization
93
+
94
+ To aid biological interpretation, CellSP reports shared properties of the genes and cells of each discovered module. Genes are characterized using Gene Ontology (GO) enrichment tests, while cells are characterized by their cell type composition if such information is available. To provide a more precise characterization of a module’s cells, CellSP trains a machine learning classifier to discriminate those cells from all other cells, using the expression levels of all genes other than the module genes. Genes that are highly predictive in this task are then subjected to GO enrichment tests, furnishing hypotheses about biological processes and pathways that are active specifically in the module cells.
95
+
96
+ To characterize the module genes -
97
+
98
+ ```
99
+ adata_st = cellSP.geo.geo_analysis(adata_st, setting="module")
100
+ ```
101
+
102
+
103
+ To characterize the module cells, we first train a random forest classifier to find genes that are predictive of module presence and then perform enrichment tests -
104
+ ```
105
+ adata_st = cellSP.md.model_modules(adata_st, do_shap=True, subsample = True)
106
+ adata_st = cellSP.geo.geo_analysis(adata_st, setting="cell")
107
+ ```
108
+
109
+ ### Visualization
110
+
111
+ To help visualize modules defined by the five types of subcellular spatial patterns (four types identified by SPRAWL and colocalization patterns identified by InSTAnT), we developed three complementary plotting techniques.
112
+
113
+ ![CellSP_visualizations](https://raw.githubusercontent.com/bhavaygg/main/figures/Overview.png)
114
+
115
+ ***
116
+
117
+ ### How to cite CellSP
118
+
119
+ ```
120
+ @article{aggarwal2025cellsp,
121
+ title={CellSP: Module discovery and visualization for subcellular spatial transcriptomics data},
122
+ author={Aggarwal, Bhavay and Sinha, Saurabh},
123
+ journal={bioRxiv},
124
+ pages={2025--01},
125
+ year={2025},
126
+ publisher={Cold Spring Harbor Laboratory}
127
+ }
128
+ ```
@@ -0,0 +1,7 @@
1
+ from . import datasets as ds
2
+ from . import io
3
+ from . import visualisation as vs
4
+ from . import preprocessing as pp
5
+ from . import characterize as ch
6
+ from . import model as md
7
+ from . import geo as geo
@@ -0,0 +1,2 @@
1
+ from ._instant import run_instant, analyse_fsm, bicluster_instant
2
+ from ._sprawl import run_sprawl, bicluster_sprawl, bicluster_sprawl