scboa 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
scboa-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2024 Qiang Su
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
scboa-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,250 @@
1
+ Metadata-Version: 2.4
2
+ Name: scboa
3
+ Version: 0.1.0
4
+ Summary: Integrated Two-Stage Bayesian Optimization and Final Analysis Pipeline for scRNA-seq.
5
+ Author-email: Your Name <your.email@example.com>
6
+ License: MIT License
7
+
8
+ Copyright (c) 2024 Qiang Su
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26
+ SOFTWARE.
27
+
28
+ Project-URL: Homepage, https://github.com/your-username/scBOA-project
29
+ Project-URL: Bug Tracker, https://github.com/your-username/scBOA-project/issues
30
+ Keywords: scrna-seq,bioinformatics,bayesian optimization,scanpy,celltypist,single-cell,genomics
31
+ Classifier: Development Status :: 4 - Beta
32
+ Classifier: Programming Language :: Python :: 3
33
+ Classifier: Programming Language :: Python :: 3.9
34
+ Classifier: Programming Language :: Python :: 3.10
35
+ Classifier: Programming Language :: Python :: 3.11
36
+ Classifier: License :: OSI Approved :: MIT License
37
+ Classifier: Operating System :: OS Independent
38
+ Classifier: Intended Audience :: Science/Research
39
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
40
+ Requires-Python: >=3.9
41
+ Description-Content-Type: text/markdown
42
+ License-File: LICENSE
43
+ Requires-Dist: scanpy>=1.9.0
44
+ Requires-Dist: anndata
45
+ Requires-Dist: pandas
46
+ Requires-Dist: numpy
47
+ Requires-Dist: scipy
48
+ Requires-Dist: scikit-learn
49
+ Requires-Dist: matplotlib
50
+ Requires-Dist: seaborn
51
+ Requires-Dist: scikit-optimize
52
+ Requires-Dist: celltypist>=1.5.0
53
+ Requires-Dist: harmonypy
54
+ Requires-Dist: leidenalg
55
+ Requires-Dist: python-igraph
56
+ Requires-Dist: openpyxl
57
+ Dynamic: license-file
58
+
59
+ # In scBOA/README.md
60
+
61
+ # scBOA: scRNA-seq Bayesian Optimization and Analysis
62
+
63
+ **scBOA** is an integrated, two-stage computational pipeline for single-cell RNA sequencing (scRNA-seq) analysis. It automates the discovery of optimal processing parameters using Bayesian Optimization (Stage 1) and then applies these parameters to a comprehensive downstream analysis workflow (Stage 2). The pipeline also features an optional multi-level refinement process (Stage 3/4) to iteratively re-analyze and improve annotations for low-confidence cell clusters.
64
+
65
+ ## Key Features
66
+
67
+ - **Automated Parameter Tuning**: Uses Bayesian Optimization to find the best parameters (`n_highly_variable_genes`, `n_pcs`, `n_neighbors`, `resolution`) for clustering and cell type annotation.
68
+ - **Multi-Metric Objective Function**: Optimizes for a balanced score that considers annotation accuracy (CAS), marker gene specificity (MCS), and cluster separation (Silhouette score).
69
+ - **Single & Multi-Sample Modes**: Natively supports analysis of a single dataset or the integration of two datasets (e.g., control vs. treated) using Harmony.
70
+ - **Iterative Refinement**: Automatically identifies low-confidence cell clusters and re-runs the entire optimization and analysis pipeline on them to improve annotation granularity and accuracy.
71
+ - **Comprehensive Outputs**: Generates publication-quality plots, detailed metric reports, annotated data objects (`.h5ad`), and summary tables for easy interpretation.
72
+
73
+ ---
74
+
75
+ ## Step-by-Step Workflow
76
+
77
+ ### 1. Prerequisites
78
+
79
+ - Python 3.9 or newer.
80
+ - Access to a Linux or macOS command line.
81
+
82
+ ### 2. Installation
83
+
84
+ It is highly recommended to install `scboa` in a dedicated virtual environment to avoid conflicts with other Python projects.
85
+
86
+ ```bash
87
+ # Create a virtual environment named 'scboa-env'
88
+ python3 -m venv scboa-env
89
+
90
+ # Activate the environment
91
+ source scboa-env/bin/activate
92
+
93
+ # Now, install scBOA from PyPI
94
+ pip install scboa
95
+
96
+ # To deactivate the environment later, simply run: deactivate
97
+ ```
98
+
99
+ ### 3. Prepare Your Data
100
+
101
+ - **scRNA-seq Data**: Ensure your Cell Ranger output (the folder containing `barcodes.tsv.gz`, `features.tsv.gz`, and `matrix.mtx.gz`) is accessible.
102
+ - **CellTypist Model**: Download a pre-trained CellTypist model (`.pkl` file). You can find available models on the official [CellTypist models website](https://www.celltypist.org/models).
103
+
104
+ ### 4. Run the Pipeline
105
+
106
+ Here is an example command for a single-sample analysis:
107
+
108
+ ```bash
109
+ scboa \
110
+ --data_dir /path/to/your/cellranger_output/ \
111
+ --output_dir ./my_analysis_output/ \
112
+ --model_path /path/to/your/celltypist_model.pkl \
113
+ # --- Bayesian Optimization Parameters ---
114
+ --n_calls 50 \
115
+ --target all \
116
+ --model_type biological \
117
+ --seed 42 \
118
+ # --- Analysis & Clustering Parameters ---
119
+ --hvg_min_mean 0.0125 \
120
+ --hvg_max_mean 3.0 \
121
+ --hvg_min_disp 0.3 \
122
+ --cas_aggregation_method leiden
123
+ ```
124
+
125
+ ---
126
+
127
+ ### For Developers (Contributing)
128
+
129
+ # This clones only the latest commit, making it much faster and smaller
130
+ git clone --depth 1 https://github.com/QiangSu/scBOA.git
131
+ cd scBOA
132
+
133
+ # Install in editable mode, which also installs development dependencies
134
+ pip install -e .
135
+
136
+ ## Command-Line Arguments Explained
137
+
138
+ #### `Stage 1 & 2: Main I/O and Mode`
139
+
140
+ | Argument | Description | Explanation/Usage |
141
+ | :--- | :--- | :--- |
142
+ | `--data_dir <path>` | Path to 10x Genomics data. | **(Single-Sample Mode)** Provide the path to the directory containing `matrix.mtx.gz`, etc. |
143
+ | `--multi_sample <path1> <path2>` | Two paths for WT and Treated 10x data. | **(Multi-Sample Mode)** Provide two paths, first for control/WT, second for treated/perturbed. This mode enables Harmony integration. |
144
+ | `--output_dir <path>` | Path for all output files. | The main directory where all results, plots, and logs will be saved. Subdirectories for each stage will be created here. |
145
+ | `--model_path <path>` | Path to CellTypist model (`.pkl`). | **Required.** The pre-trained model used for cell type annotation. |
146
+ | `--output_prefix <str>` | Base prefix for Stage 1 output files. | Default: `bayesian_opt`. Used for naming optimization reports and plots. |
147
+
148
+ #### `Stage 1: Optimization Parameters`
149
+
150
+ | Argument | Description | Explanation/Usage |
151
+ | :--- | :--- | :--- |
152
+ | `--seed <int>` | Global random seed for reproducibility. | Default: `42`. Ensures that results are identical if run with the same data and parameters. |
153
+ | `--n_calls <int>` | Number of trials for EACH optimization strategy. | Default: `50`. The script runs three strategies (Explore, Exploit, BO-EI), so `50` means a total of 150 optimization steps. |
154
+ | `--model_type <choice>` | Optimization objective function type. | `biological`: Balances annotation agreement (CAS) and marker specificity (MCS). <br> `structural` (default): Adds cluster separation (Silhouette Score) to the biological metrics for more robust clusters. <br> `silhouette`: Optimizes solely for the best Silhouette Score. |
155
+ | `--marker_gene_model <choice>` | Genes to use for MCS calculation. | `all`: All genes are considered. <br> `non-mitochondrial` (default): Excludes mitochondrial genes, which often act as non-specific markers of cell stress. |
156
+ | `--target <choice>` | Optimization target metric. | `all` (default): Runs a single, balanced optimization (equivalent to `--model_type`). <br> `weighted_cas`, `simple_cas`, `mcs`: Runs optimization targeting only that specific metric. |
157
+ | `--cas_aggregation_method <choice>` | Method for calculating Simple CAS. | `leiden` (default): Averages the purity score of each raw Leiden cluster. Best for assessing technical cluster quality. <br> `consensus`: Merges clusters with the same final cell type label before averaging purity. Best for assessing biological group quality. |
158
+
159
+ #### `Stage 1 & 2: HVG Selection Method`
160
+
161
+ | Argument | Description | Explanation/Usage |
162
+ | :--- | :--- | :--- |
163
+ | `--hvg_min_mean <float>` | Min mean for two-step HVG selection. | If set, activates a pre-filtering step on genes based on expression and dispersion before selecting the top `n_hvg`. |
164
+ | `--hvg_max_mean <float>` | Max mean for two-step HVG selection. | See above. |
165
+ | `--hvg_min_disp <float>` | Min dispersion for two-step HVG selection. | See above. |
166
+
167
+ #### `Stage 1 & 2: QC & Filtering Parameters`
168
+
169
+ | Argument | Description | Explanation/Usage |
170
+ | :--- | :--- | :--- |
171
+ | `--min_genes <int>` | Min genes per cell. | Default: `200`. Filters out low-quality cells/empty droplets. |
172
+ | `--max_genes <int>` | Max genes per cell. | Default: `7000`. Filters out potential doublets. |
173
+ | `--max_pct_mt <float>` | Max mitochondrial percentage. | Default: `10.0`. Filters out stressed or dying cells. |
174
+ | `--min_cells <int>` | Min cells per gene. | Default: `3`. Filters out genes with negligible expression. |
175
+
176
+ #### `Stage 2 & Optional Refinement: Final Run Parameters`
177
+
178
+ | Argument | Description | Explanation/Usage |
179
+ | :--- | :--- | :--- |
180
+ | `--final_run_prefix <str>` | Prefix for Stage 2 output files. | Default: `sc_analysis_repro`. |
181
+ | `--fig_dpi <int>` | Resolution (DPI) for saved figures. | Default: `500`. |
182
+ | `--n_pcs_compute <int>` | Number of principal components to compute. | Default: `105`. A higher number allows for a wider search space for the optimal `n_pcs`. |
183
+ | `--n_top_genes <int>` | Number of top marker genes to show. | Default: `5`. Affects dot plots, heatmaps, and marker gene tables. |
184
+ | `--cellmarker_db <path>` | Path to a cell marker database (.csv). | **(Optional)** If provided, performs a "manual-style" annotation based on cluster marker genes and calculates a Marker Capture Score. |
185
+ | `--n_degs_for_capture <int>` | DEGs per cluster for Marker Capture Score. | Default: `50`. Number of top differentially expressed genes used to match against the marker DB. |
186
+ | `--cas_refine_threshold <float>`| CAS threshold to trigger refinement. | **(Optional)** If a cluster's CAS score is below this value (e.g., `90`), its cells are pooled for a new round of optimization and analysis. |
187
+ | `--refinement_depth <int>` | Maximum number of refinement iterations. | Default: `1`. If refinement is triggered, this controls how many times the process can repeat on the subsequently failing cells. |
188
+
189
+ ---
190
+
191
+ ## Output Directory Structure
192
+
193
+ The script generates a structured output directory. Below is an example structure and an explanation of key files.
194
+
195
+ ```
196
+ <output_dir>/
197
+ ├── stage_1_bayesian_optimization/
198
+ │ ├── bayesian_opt_structural_balanced_FINAL_annotated.h5ad
199
+ │ ├── bayesian_opt_structural_balanced_FINAL_best_params.txt
200
+ │ ├── bayesian_opt_structural_balanced_yield_scores_report.csv
201
+ │ ├── bayesian_opt_structural_balanced_optimizer_convergence.png
202
+ │ ├── bayesian_opt_structural_balanced_BO-EI_opt_result.skopt
203
+ │ ├── ... (other plots and strategy files) ...
204
+ │ └── refinement_depth_1/
205
+ │ ├── ... (mirrors the structure above, but for the refined subset of cells) ...
206
+
207
+ └── stage_2_final_analysis/
208
+ ├── sc_analysis_repro_final_processed.h5ad
209
+ ├── sc_analysis_repro_final_processed_with_refinement.h5ad
210
+ ├── sc_analysis_repro_all_annotations.csv
211
+ ├── sc_analysis_repro_all_annotations_with_refinement.csv
212
+ ├── sc_analysis_repro_leiden_cluster_annotation_scores.csv
213
+ ├── sc_analysis_repro_consensus_group_annotation_scores.csv
214
+ ├── sc_analysis_repro_combined_cluster_annotation_scores.csv
215
+ ├── sc_analysis_repro_cell_type_journey_summary.csv
216
+ ├── sc_analysis_repro_umap_leiden.png
217
+ ├── sc_analysis_repro_cluster_celltypist_umap.png
218
+ ├── sc_analysis_repro_umap_low_confidence_greyed.png
219
+ ├── ... (many other plots and result files) ...
220
+ └── refinement_depth_1/
221
+ ├── sc_analysis_repro_refinement_depth_1_final_processed.h5ad
222
+ ├── sc_analysis_repro_refinement_depth_1_umap_cumulative_result.png
223
+ └── ... (mirrors Stage 2 structure for the refined subset) ...
224
+ ```
225
+
226
+ ### Key File Explanations
227
+
228
+ #### Stage 1: `stage_1_bayesian_optimization/`
229
+ - `*_FINAL_best_params.txt`: A summary of the optimal parameters found and the final performance metrics. **This is the most important summary file.**
230
+ - `*_FINAL_annotated.h5ad`: The AnnData object processed with the best parameters, containing all final annotations from the single best run.
231
+ - `*_yield_scores_report.csv`: A detailed log of every trial from every optimization strategy, including parameters tested and all resulting scores (CAS, MCS, Silhouette).
232
+ - `*_optimizer_convergence.png`: A plot showing how the best score improved over time for each strategy.
233
+ - `*_opt_result.skopt`: A saved state of the optimization process, which can be reloaded.
234
+
235
+ #### Stage 2: `stage_2_final_analysis/`
236
+ - `*_final_processed.h5ad`: The final, fully annotated AnnData object from the initial Stage 2 run. Contains UMAP coordinates, clustering, and all annotations.
237
+ - `*_final_processed_with_refinement.h5ad`: **(If refinement runs)** The master AnnData object with the final, combined annotations after all refinement levels are complete.
238
+ - `*_all_annotations_with_refinement.csv`: A cell-by-cell table of all annotations, including the final `combined_annotation` column after refinement.
239
+ - `*_cluster_annotation_scores.csv`: Tables detailing the Cell Annotation Score (CAS) for each Leiden cluster and each consensus cell type group.
240
+ - `*_combined_cluster_annotation_scores.csv`: A concatenation of all CAS reports from the initial run and all refinement levels.
241
+ - `*_cell_type_journey_summary.csv`: A wide-format table showing how the cell count and CAS score for each cell type change across refinement stages.
242
+ - `*_umap_low_confidence_greyed.png`: A UMAP plot from the initial run where cells belonging to clusters that failed the CAS threshold are colored grey.
243
+ - `refinement_depth_1/*_umap_cumulative_result.png`: A UMAP plot showing the state of the data *after* a refinement level, with newly-annotated cells colored and any still-failing cells shown in grey.
244
+
245
+ ---
246
+
247
+ ## License
248
+
249
+ This project is licensed under the MIT License. See the `LICENSE` file for details.
250
+
scboa-0.1.0/README.md ADDED
@@ -0,0 +1,192 @@
1
+ # In scBOA/README.md
2
+
3
+ # scBOA: scRNA-seq Bayesian Optimization and Analysis
4
+
5
+ **scBOA** is an integrated, two-stage computational pipeline for single-cell RNA sequencing (scRNA-seq) analysis. It automates the discovery of optimal processing parameters using Bayesian Optimization (Stage 1) and then applies these parameters to a comprehensive downstream analysis workflow (Stage 2). The pipeline also features an optional multi-level refinement process (Stage 3/4) to iteratively re-analyze and improve annotations for low-confidence cell clusters.
6
+
7
+ ## Key Features
8
+
9
+ - **Automated Parameter Tuning**: Uses Bayesian Optimization to find the best parameters (`n_highly_variable_genes`, `n_pcs`, `n_neighbors`, `resolution`) for clustering and cell type annotation.
10
+ - **Multi-Metric Objective Function**: Optimizes for a balanced score that considers annotation accuracy (CAS), marker gene specificity (MCS), and cluster separation (Silhouette score).
11
+ - **Single & Multi-Sample Modes**: Natively supports analysis of a single dataset or the integration of two datasets (e.g., control vs. treated) using Harmony.
12
+ - **Iterative Refinement**: Automatically identifies low-confidence cell clusters and re-runs the entire optimization and analysis pipeline on them to improve annotation granularity and accuracy.
13
+ - **Comprehensive Outputs**: Generates publication-quality plots, detailed metric reports, annotated data objects (`.h5ad`), and summary tables for easy interpretation.
14
+
15
+ ---
16
+
17
+ ## Step-by-Step Workflow
18
+
19
+ ### 1. Prerequisites
20
+
21
+ - Python 3.9 or newer.
22
+ - Access to a Linux or macOS command line.
23
+
24
+ ### 2. Installation
25
+
26
+ It is highly recommended to install `scboa` in a dedicated virtual environment to avoid conflicts with other Python projects.
27
+
28
+ ```bash
29
+ # Create a virtual environment named 'scboa-env'
30
+ python3 -m venv scboa-env
31
+
32
+ # Activate the environment
33
+ source scboa-env/bin/activate
34
+
35
+ # Now, install scBOA from PyPI
36
+ pip install scboa
37
+
38
+ # To deactivate the environment later, simply run: deactivate
39
+ ```
40
+
41
+ ### 3. Prepare Your Data
42
+
43
+ - **scRNA-seq Data**: Ensure your Cell Ranger output (the folder containing `barcodes.tsv.gz`, `features.tsv.gz`, and `matrix.mtx.gz`) is accessible.
44
+ - **CellTypist Model**: Download a pre-trained CellTypist model (`.pkl` file). You can find available models on the official [CellTypist models website](https://www.celltypist.org/models).
45
+
46
+ ### 4. Run the Pipeline
47
+
48
+ Here is an example command for a single-sample analysis:
49
+
50
+ ```bash
51
+ scboa \
52
+ --data_dir /path/to/your/cellranger_output/ \
53
+ --output_dir ./my_analysis_output/ \
54
+ --model_path /path/to/your/celltypist_model.pkl \
55
+ # --- Bayesian Optimization Parameters ---
56
+ --n_calls 50 \
57
+ --target all \
58
+ --model_type biological \
59
+ --seed 42 \
60
+ # --- Analysis & Clustering Parameters ---
61
+ --hvg_min_mean 0.0125 \
62
+ --hvg_max_mean 3.0 \
63
+ --hvg_min_disp 0.3 \
64
+ --cas_aggregation_method leiden
65
+ ```
66
+
67
+ ---
68
+
69
+ ### For Developers (Contributing)
70
+
71
+ # This clones only the latest commit, making it much faster and smaller
72
+ git clone --depth 1 https://github.com/QiangSu/scBOA.git
73
+ cd scBOA
74
+
75
+ # Install in editable mode, which also installs development dependencies
76
+ pip install -e .
77
+
78
+ ## Command-Line Arguments Explained
79
+
80
+ #### `Stage 1 & 2: Main I/O and Mode`
81
+
82
+ | Argument | Description | Explanation/Usage |
83
+ | :--- | :--- | :--- |
84
+ | `--data_dir <path>` | Path to 10x Genomics data. | **(Single-Sample Mode)** Provide the path to the directory containing `matrix.mtx.gz`, etc. |
85
+ | `--multi_sample <path1> <path2>` | Two paths for WT and Treated 10x data. | **(Multi-Sample Mode)** Provide two paths, first for control/WT, second for treated/perturbed. This mode enables Harmony integration. |
86
+ | `--output_dir <path>` | Path for all output files. | The main directory where all results, plots, and logs will be saved. Subdirectories for each stage will be created here. |
87
+ | `--model_path <path>` | Path to CellTypist model (`.pkl`). | **Required.** The pre-trained model used for cell type annotation. |
88
+ | `--output_prefix <str>` | Base prefix for Stage 1 output files. | Default: `bayesian_opt`. Used for naming optimization reports and plots. |
89
+
90
+ #### `Stage 1: Optimization Parameters`
91
+
92
+ | Argument | Description | Explanation/Usage |
93
+ | :--- | :--- | :--- |
94
+ | `--seed <int>` | Global random seed for reproducibility. | Default: `42`. Ensures that results are identical if run with the same data and parameters. |
95
+ | `--n_calls <int>` | Number of trials for EACH optimization strategy. | Default: `50`. The script runs three strategies (Explore, Exploit, BO-EI), so `50` means a total of 150 optimization steps. |
96
+ | `--model_type <choice>` | Optimization objective function type. | `biological`: Balances annotation agreement (CAS) and marker specificity (MCS). <br> `structural` (default): Adds cluster separation (Silhouette Score) to the biological metrics for more robust clusters. <br> `silhouette`: Optimizes solely for the best Silhouette Score. |
97
+ | `--marker_gene_model <choice>` | Genes to use for MCS calculation. | `all`: All genes are considered. <br> `non-mitochondrial` (default): Excludes mitochondrial genes, which often act as non-specific markers of cell stress. |
98
+ | `--target <choice>` | Optimization target metric. | `all` (default): Runs a single, balanced optimization (equivalent to `--model_type`). <br> `weighted_cas`, `simple_cas`, `mcs`: Runs optimization targeting only that specific metric. |
99
+ | `--cas_aggregation_method <choice>` | Method for calculating Simple CAS. | `leiden` (default): Averages the purity score of each raw Leiden cluster. Best for assessing technical cluster quality. <br> `consensus`: Merges clusters with the same final cell type label before averaging purity. Best for assessing biological group quality. |
100
+
101
+ #### `Stage 1 & 2: HVG Selection Method`
102
+
103
+ | Argument | Description | Explanation/Usage |
104
+ | :--- | :--- | :--- |
105
+ | `--hvg_min_mean <float>` | Min mean for two-step HVG selection. | If set, activates a pre-filtering step on genes based on expression and dispersion before selecting the top `n_hvg`. |
106
+ | `--hvg_max_mean <float>` | Max mean for two-step HVG selection. | See above. |
107
+ | `--hvg_min_disp <float>` | Min dispersion for two-step HVG selection. | See above. |
108
+
109
+ #### `Stage 1 & 2: QC & Filtering Parameters`
110
+
111
+ | Argument | Description | Explanation/Usage |
112
+ | :--- | :--- | :--- |
113
+ | `--min_genes <int>` | Min genes per cell. | Default: `200`. Filters out low-quality cells/empty droplets. |
114
+ | `--max_genes <int>` | Max genes per cell. | Default: `7000`. Filters out potential doublets. |
115
+ | `--max_pct_mt <float>` | Max mitochondrial percentage. | Default: `10.0`. Filters out stressed or dying cells. |
116
+ | `--min_cells <int>` | Min cells per gene. | Default: `3`. Filters out genes with negligible expression. |
117
+
118
+ #### `Stage 2 & Optional Refinement: Final Run Parameters`
119
+
120
+ | Argument | Description | Explanation/Usage |
121
+ | :--- | :--- | :--- |
122
+ | `--final_run_prefix <str>` | Prefix for Stage 2 output files. | Default: `sc_analysis_repro`. |
123
+ | `--fig_dpi <int>` | Resolution (DPI) for saved figures. | Default: `500`. |
124
+ | `--n_pcs_compute <int>` | Number of principal components to compute. | Default: `105`. A higher number allows for a wider search space for the optimal `n_pcs`. |
125
+ | `--n_top_genes <int>` | Number of top marker genes to show. | Default: `5`. Affects dot plots, heatmaps, and marker gene tables. |
126
+ | `--cellmarker_db <path>` | Path to a cell marker database (.csv). | **(Optional)** If provided, performs a "manual-style" annotation based on cluster marker genes and calculates a Marker Capture Score. |
127
+ | `--n_degs_for_capture <int>` | DEGs per cluster for Marker Capture Score. | Default: `50`. Number of top differentially expressed genes used to match against the marker DB. |
128
+ | `--cas_refine_threshold <float>`| CAS threshold to trigger refinement. | **(Optional)** If a cluster's CAS score is below this value (e.g., `90`), its cells are pooled for a new round of optimization and analysis. |
129
+ | `--refinement_depth <int>` | Maximum number of refinement iterations. | Default: `1`. If refinement is triggered, this controls how many times the process can repeat on the subsequently failing cells. |
130
+
131
+ ---
132
+
133
+ ## Output Directory Structure
134
+
135
+ The script generates a structured output directory. Below is an example structure and an explanation of key files.
136
+
137
+ ```
138
+ <output_dir>/
139
+ ├── stage_1_bayesian_optimization/
140
+ │ ├── bayesian_opt_structural_balanced_FINAL_annotated.h5ad
141
+ │ ├── bayesian_opt_structural_balanced_FINAL_best_params.txt
142
+ │ ├── bayesian_opt_structural_balanced_yield_scores_report.csv
143
+ │ ├── bayesian_opt_structural_balanced_optimizer_convergence.png
144
+ │ ├── bayesian_opt_structural_balanced_BO-EI_opt_result.skopt
145
+ │ ├── ... (other plots and strategy files) ...
146
+ │ └── refinement_depth_1/
147
+ │ ├── ... (mirrors the structure above, but for the refined subset of cells) ...
148
+
149
+ └── stage_2_final_analysis/
150
+ ├── sc_analysis_repro_final_processed.h5ad
151
+ ├── sc_analysis_repro_final_processed_with_refinement.h5ad
152
+ ├── sc_analysis_repro_all_annotations.csv
153
+ ├── sc_analysis_repro_all_annotations_with_refinement.csv
154
+ ├── sc_analysis_repro_leiden_cluster_annotation_scores.csv
155
+ ├── sc_analysis_repro_consensus_group_annotation_scores.csv
156
+ ├── sc_analysis_repro_combined_cluster_annotation_scores.csv
157
+ ├── sc_analysis_repro_cell_type_journey_summary.csv
158
+ ├── sc_analysis_repro_umap_leiden.png
159
+ ├── sc_analysis_repro_cluster_celltypist_umap.png
160
+ ├── sc_analysis_repro_umap_low_confidence_greyed.png
161
+ ├── ... (many other plots and result files) ...
162
+ └── refinement_depth_1/
163
+ ├── sc_analysis_repro_refinement_depth_1_final_processed.h5ad
164
+ ├── sc_analysis_repro_refinement_depth_1_umap_cumulative_result.png
165
+ └── ... (mirrors Stage 2 structure for the refined subset) ...
166
+ ```
167
+
168
+ ### Key File Explanations
169
+
170
+ #### Stage 1: `stage_1_bayesian_optimization/`
171
+ - `*_FINAL_best_params.txt`: A summary of the optimal parameters found and the final performance metrics. **This is the most important summary file.**
172
+ - `*_FINAL_annotated.h5ad`: The AnnData object processed with the best parameters, containing all final annotations from the single best run.
173
+ - `*_yield_scores_report.csv`: A detailed log of every trial from every optimization strategy, including parameters tested and all resulting scores (CAS, MCS, Silhouette).
174
+ - `*_optimizer_convergence.png`: A plot showing how the best score improved over time for each strategy.
175
+ - `*_opt_result.skopt`: A saved state of the optimization process, which can be reloaded.
176
+
177
+ #### Stage 2: `stage_2_final_analysis/`
178
+ - `*_final_processed.h5ad`: The final, fully annotated AnnData object from the initial Stage 2 run. Contains UMAP coordinates, clustering, and all annotations.
179
+ - `*_final_processed_with_refinement.h5ad`: **(If refinement runs)** The master AnnData object with the final, combined annotations after all refinement levels are complete.
180
+ - `*_all_annotations_with_refinement.csv`: A cell-by-cell table of all annotations, including the final `combined_annotation` column after refinement.
181
+ - `*_cluster_annotation_scores.csv`: Tables detailing the Cell Annotation Score (CAS) for each Leiden cluster and each consensus cell type group.
182
+ - `*_combined_cluster_annotation_scores.csv`: A concatenation of all CAS reports from the initial run and all refinement levels.
183
+ - `*_cell_type_journey_summary.csv`: A wide-format table showing how the cell count and CAS score for each cell type change across refinement stages.
184
+ - `*_umap_low_confidence_greyed.png`: A UMAP plot from the initial run where cells belonging to clusters that failed the CAS threshold are colored grey.
185
+ - `refinement_depth_1/*_umap_cumulative_result.png`: A UMAP plot showing the state of the data *after* a refinement level, with newly-annotated cells colored and any still-failing cells shown in grey.
186
+
187
+ ---
188
+
189
+ ## License
190
+
191
+ This project is licensed under the MIT License. See the `LICENSE` file for details.
192
+
@@ -0,0 +1,65 @@
1
+ # pyproject.toml
2
+
3
+ [build-system]
4
+ requires = ["setuptools>=61.0"]
5
+ build-backend = "setuptools.build_meta"
6
+
7
+ [project]
8
+ name = "scboa"
9
+ version = "0.1.0"
10
+ authors = [
11
+ { name="Your Name", email="your.email@example.com" },
12
+ ]
13
+ description = "Integrated Two-Stage Bayesian Optimization and Final Analysis Pipeline for scRNA-seq."
14
+
15
+ # Long description read from the README file
16
+ readme = "README.md"
17
+
18
+ # License information
19
+ license = { file = "LICENSE" }
20
+
21
+ # Python version requirement
22
+ requires-python = ">=3.9"
23
+
24
+ # Keywords for discoverability on PyPI
25
+ keywords = ["scrna-seq", "bioinformatics", "bayesian optimization", "scanpy", "celltypist", "single-cell", "genomics"]
26
+
27
+ # Trove classifiers for PyPI
28
+ classifiers = [
29
+ "Development Status :: 4 - Beta",
30
+ "Programming Language :: Python :: 3",
31
+ "Programming Language :: Python :: 3.9",
32
+ "Programming Language :: Python :: 3.10",
33
+ "Programming Language :: Python :: 3.11",
34
+ "License :: OSI Approved :: MIT License",
35
+ "Operating System :: OS Independent",
36
+ "Intended Audience :: Science/Research",
37
+ "Topic :: Scientific/Engineering :: Bio-Informatics",
38
+ ]
39
+
40
+ # Core dependencies required for the tool to run
41
+ dependencies = [
42
+ "scanpy>=1.9.0", # Core analysis framework
43
+ "anndata", # Data structure for scanpy
44
+ "pandas", # Data manipulation
45
+ "numpy", # Numerical operations
46
+ "scipy", # Scientific computing
47
+ "scikit-learn", # For metrics like silhouette score
48
+ "matplotlib", # Plotting
49
+ "seaborn", # Enhanced plotting
50
+ "scikit-optimize", # For Bayesian Optimization
51
+ "celltypist>=1.5.0", # For cell annotation
52
+ "harmonypy", # For multi-sample integration
53
+ "leidenalg", # Required by scanpy for Leiden clustering
54
+ "python-igraph", # Required by scanpy for Leiden clustering
55
+ "openpyxl", # For writing Excel files (e.g., marker genes)
56
+ ]
57
+
58
+ [project.urls]
59
+ "Homepage" = "https://github.com/your-username/scBOA-project"
60
+ "Bug Tracker" = "https://github.com/your-username/scBOA-project/issues"
61
+
62
+ # This section creates the command-line tool.
63
+ # It links the command `scboa` to the `run` function in your `cli.py` module.
64
+ [project.scripts]
65
+ scboa = "scboa.cli:run"
scboa-0.1.0/setup.cfg ADDED
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
File without changes
@@ -0,0 +1,91 @@
1
+ #!/usr/bin/env python
2
+ # -*- coding: utf-8 -*-
3
+ """
4
+ Command-Line Interface for the scBOA Pipeline.
5
+
6
+ This script handles argument parsing and serves as the entry point for the
7
+ scBOA pipeline when it's installed as a package. It imports and calls the
8
+ main orchestrator function from the `pipeline` module.
9
+ """
10
+
11
+ import argparse
12
+
13
+ # This is the crucial relative import. It looks for a file named 'pipeline.py'
14
+ # within the same package directory ('src/scboa/') and imports the 'main' function.
15
+ from .pipeline import main
16
+
17
+ def run():
18
+ """
19
+ Parses all command-line arguments and executes the main scBOA pipeline.
20
+
21
+ This function is the target for the [project.scripts] entry point specified
22
+ in the pyproject.toml file.
23
+ """
24
+ parser = argparse.ArgumentParser(
25
+ description="Integrated Two-Stage Bayesian Optimization and Final Analysis Pipeline for scRNA-seq.",
26
+ formatter_class=argparse.RawTextHelpFormatter
27
+ )
28
+
29
+ stage1_group = parser.add_argument_group('Stage 1 & 2: Main I/O and Mode')
30
+ mode_group = stage1_group.add_mutually_exclusive_group(required=True)
31
+ mode_group.add_argument('--data_dir', type=str, help='Path to 10x Genomics data for single-sample analysis.')
32
+ mode_group.add_argument('--multi_sample', nargs=2, metavar=('WT_DIR', 'TREATED_DIR'), help='Two paths for WT/Control and Treated/Perturbed 10x data for multi-sample integration.')
33
+ stage1_group.add_argument('--output_dir', type=str, required=True, help='Path for all output files.')
34
+ stage1_group.add_argument('--model_path', type=str, required=True, help='Path to CellTypist model (.pkl).')
35
+ stage1_group.add_argument('--output_prefix', type=str, default='bayesian_opt', help='Base prefix for Stage 1 output files.')
36
+
37
+ opt_group = parser.add_argument_group('Stage 1: Optimization Parameters')
38
+ opt_group.add_argument('--seed', type=int, default=42, help='Global random seed for reproducibility.')
39
+ opt_group.add_argument('--n_calls', type=int, default=50, help='Number of trials for EACH of the three optimization strategies.')
40
+ opt_group.add_argument(
41
+ '--model_type',
42
+ type=str,
43
+ default='structural',
44
+ choices=['biological', 'structural', 'silhouette'],
45
+ help= ("'biological': balances CAS & MCS.\n"
46
+ "'structural' (default): adds silhouette score to balance biological concordance with cluster quality.\n"
47
+ "'silhouette': optimizes solely to maximize the silhouette score.")
48
+ )
49
+ opt_group.add_argument('--marker_gene_model', type=str, default='non-mitochondrial', choices=['all', 'non-mitochondrial'], help="'all': use all genes. 'non-mitochondrial' (default): exclude mitochondrial genes from MCS markers.")
50
+ opt_group.add_argument('--target', type=str, default='all', choices=['all', 'weighted_cas', 'simple_cas', 'mcs'], help="'all' (default): runs a single, balanced optimization. Other options optimize for that specific metric.")
51
+
52
+ opt_group.add_argument(
53
+ '--cas_aggregation_method',
54
+ type=str,
55
+ default='leiden',
56
+ choices=['leiden', 'consensus'],
57
+ help=("Method for calculating Simple Mean CAS and for determining refinement candidates.\n"
58
+ "'leiden' (default): Averages the purity of each individual Leiden cluster.\n"
59
+ "'consensus': Merges Leiden clusters with the same consensus label, then averages their purity.")
60
+ )
61
+
62
+ hvg_group = parser.add_argument_group('Stage 1 & 2: HVG Selection Method')
63
+ hvg_group.add_argument('--hvg_min_mean', type=float, default=None, help='(Optional) Activates two-step HVG selection. Min mean for initial filtering.')
64
+ hvg_group.add_argument('--hvg_max_mean', type=float, default=None, help='(Optional) Activates two-step HVG selection. Max mean for initial filtering.')
65
+ hvg_group.add_argument('--hvg_min_disp', type=float, default=None, help='(Optional) Activates two-step HVG selection. Min dispersion for initial filtering.')
66
+
67
+ qc_group = parser.add_argument_group('Stage 1 & 2: QC & Filtering Parameters')
68
+ qc_group.add_argument('--min_genes', type=int, default=200, help='Min genes per cell.')
69
+ qc_group.add_argument('--max_genes', type=int, default=7000, help='Max genes per cell.')
70
+ qc_group.add_argument('--max_pct_mt', type=float, default=10.0, help='Max mitochondrial percentage.')
71
+ qc_group.add_argument('--min_cells', type=int, default=3, help='Min cells per gene.')
72
+
73
+ stage2_group = parser.add_argument_group('Stage 2 & Optional Refinement: Final Run Parameters')
74
+ stage2_group.add_argument('--final_run_prefix', type=str, default='sc_analysis_repro', help='Prefix for all output files in the Stage 2 subdirectory.')
75
+ stage2_group.add_argument('--fig_dpi', default=500, type=int, help='Resolution (DPI) for saved figures in Stage 2.')
76
+ stage2_group.add_argument('--n_pcs_compute', type=int, default=105, help="Number of principal components to COMPUTE in Stage 1 and 2.")
77
+ stage2_group.add_argument('--n_top_genes', type=int, default=5, help="Number of top marker genes to show in plots/tables in Stage 1 and 2.")
78
+ stage2_group.add_argument('--cellmarker_db', type=str, default=None, help="(Optional) Path to a cell marker database (.csv) for manual annotation in Stage 2.")
79
+ stage2_group.add_argument('--n_degs_for_capture', type=int, default=50, help="Number of top DEGs per cluster to use for the Marker Capture Score calculation in Stage 2.")
80
+ stage2_group.add_argument('--cas_refine_threshold', type=float, default=None, help="(Optional) CAS percentage threshold (0-100). If a cluster's CAS is below this, its cells are pooled for a second, refined optimization run.")
81
+ stage2_group.add_argument('--refinement_depth', type=int, default=1, help="(Optional) Maximum number of times to repeat the refinement process on failing cells. Default is 1.")
82
+
83
+ # Parse the arguments provided by the user from the command line
84
+ parsed_args = parser.parse_args()
85
+
86
+ # Apply any pre-processing logic to the arguments before calling the main function
87
+ if parsed_args.multi_sample and "harmony" not in parsed_args.output_prefix:
88
+ parsed_args.output_prefix += "_harmony"
89
+
90
+ # Call the main orchestrator function from pipeline.py and pass the parsed arguments
91
+ main(parsed_args)