rocit 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- rocit-0.1.0/LICENSE +28 -0
- rocit-0.1.0/PKG-INFO +570 -0
- rocit-0.1.0/README.md +509 -0
- rocit-0.1.0/pyproject.toml +54 -0
- rocit-0.1.0/setup.cfg +4 -0
- rocit-0.1.0/src/rocit/__init__.py +12 -0
- rocit-0.1.0/src/rocit/cli.py +464 -0
- rocit-0.1.0/src/rocit/config.py +213 -0
- rocit-0.1.0/src/rocit/constants.py +8 -0
- rocit-0.1.0/src/rocit/data/__init__.py +8 -0
- rocit-0.1.0/src/rocit/data/datamodule.py +78 -0
- rocit-0.1.0/src/rocit/data/dataset.py +274 -0
- rocit-0.1.0/src/rocit/models/__init__.py +6 -0
- rocit-0.1.0/src/rocit/models/lightning_module.py +162 -0
- rocit-0.1.0/src/rocit/models/model_architecture.py +131 -0
- rocit-0.1.0/src/rocit/pipeline.py +202 -0
- rocit-0.1.0/src/rocit/preprocessing/__init__.py +5 -0
- rocit-0.1.0/src/rocit/preprocessing/bam_tools.py +93 -0
- rocit-0.1.0/src/rocit/preprocessing/extract_pacbio_cpg_info.py +261 -0
- rocit-0.1.0/src/rocit/preprocessing/loh_data_labeller.py +148 -0
- rocit-0.1.0/src/rocit/preprocessing/prepare_somatic_data.py +155 -0
- rocit-0.1.0/src/rocit/preprocessing/process_cpg_distribution.py +37 -0
- rocit-0.1.0/src/rocit/preprocessing/snv_data_labeller.py +99 -0
- rocit-0.1.0/src/rocit/preprocessing/tumor_data_labeller.py +74 -0
- rocit-0.1.0/src/rocit/preprocessing/variant_processing.py +103 -0
- rocit-0.1.0/src/rocit/utils/__init__.py +0 -0
- rocit-0.1.0/src/rocit/validation.py +146 -0
- rocit-0.1.0/src/rocit.egg-info/PKG-INFO +570 -0
- rocit-0.1.0/src/rocit.egg-info/SOURCES.txt +31 -0
- rocit-0.1.0/src/rocit.egg-info/dependency_links.txt +1 -0
- rocit-0.1.0/src/rocit.egg-info/entry_points.txt +2 -0
- rocit-0.1.0/src/rocit.egg-info/requires.txt +11 -0
- rocit-0.1.0/src/rocit.egg-info/top_level.txt +1 -0
rocit-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
BSD 3-Clause License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026, Toby Baker
|
|
4
|
+
|
|
5
|
+
Redistribution and use in source and binary forms, with or without
|
|
6
|
+
modification, are permitted provided that the following conditions are met:
|
|
7
|
+
|
|
8
|
+
1. Redistributions of source code must retain the above copyright notice, this
|
|
9
|
+
list of conditions and the following disclaimer.
|
|
10
|
+
|
|
11
|
+
2. Redistributions in binary form must reproduce the above copyright notice,
|
|
12
|
+
this list of conditions and the following disclaimer in the documentation
|
|
13
|
+
and/or other materials provided with the distribution.
|
|
14
|
+
|
|
15
|
+
3. Neither the name of the copyright holder nor the names of its
|
|
16
|
+
contributors may be used to endorse or promote products derived from
|
|
17
|
+
this software without specific prior written permission.
|
|
18
|
+
|
|
19
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
|
20
|
+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
21
|
+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
|
22
|
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
|
23
|
+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
24
|
+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
|
25
|
+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
|
26
|
+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
|
27
|
+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
28
|
+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
rocit-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,570 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: rocit
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Deep learning classifier for tumor-derived long reads using CpG methylation
|
|
5
|
+
Author-email: Toby Baker <tobybaker97@gmail.com>
|
|
6
|
+
License: BSD 3-Clause License
|
|
7
|
+
|
|
8
|
+
Copyright (c) 2026, Toby Baker
|
|
9
|
+
|
|
10
|
+
Redistribution and use in source and binary forms, with or without
|
|
11
|
+
modification, are permitted provided that the following conditions are met:
|
|
12
|
+
|
|
13
|
+
1. Redistributions of source code must retain the above copyright notice, this
|
|
14
|
+
list of conditions and the following disclaimer.
|
|
15
|
+
|
|
16
|
+
2. Redistributions in binary form must reproduce the above copyright notice,
|
|
17
|
+
this list of conditions and the following disclaimer in the documentation
|
|
18
|
+
and/or other materials provided with the distribution.
|
|
19
|
+
|
|
20
|
+
3. Neither the name of the copyright holder nor the names of its
|
|
21
|
+
contributors may be used to endorse or promote products derived from
|
|
22
|
+
this software without specific prior written permission.
|
|
23
|
+
|
|
24
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
|
25
|
+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
26
|
+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
|
27
|
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
|
28
|
+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
29
|
+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
|
30
|
+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
|
31
|
+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
|
32
|
+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
33
|
+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
34
|
+
|
|
35
|
+
Project-URL: Homepage, https://github.com/tobybaker/rocit
|
|
36
|
+
Project-URL: Repository, https://github.com/tobybaker/rocit
|
|
37
|
+
Project-URL: Bug Tracker, https://github.com/tobybaker/rocit/issues
|
|
38
|
+
Keywords: bioinformatics,cancer,methylation,long-read,deep-learning,pacbio
|
|
39
|
+
Classifier: Development Status :: 3 - Alpha
|
|
40
|
+
Classifier: Intended Audience :: Science/Research
|
|
41
|
+
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
|
|
42
|
+
Classifier: License :: OSI Approved :: BSD License
|
|
43
|
+
Classifier: Programming Language :: Python :: 3
|
|
44
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
45
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
46
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
47
|
+
Requires-Python: >=3.10
|
|
48
|
+
Description-Content-Type: text/markdown
|
|
49
|
+
License-File: LICENSE
|
|
50
|
+
Requires-Dist: torch>=2.0.0
|
|
51
|
+
Requires-Dist: pytorch-lightning>=2.0.0
|
|
52
|
+
Requires-Dist: pyyaml>=6.0
|
|
53
|
+
Requires-Dist: numpy>=1.24.0
|
|
54
|
+
Requires-Dist: polars>=1.0.0
|
|
55
|
+
Requires-Dist: pysam>=0.22.0
|
|
56
|
+
Requires-Dist: click>=8.1.0
|
|
57
|
+
Provides-Extra: dev
|
|
58
|
+
Requires-Dist: build>=1.4.0; extra == "dev"
|
|
59
|
+
Requires-Dist: twine>=6.2.0; extra == "dev"
|
|
60
|
+
Dynamic: license-file
|
|
61
|
+
|
|
62
|
+
# ROCIT
|
|
63
|
+
|
|
64
|
+
<div align="center">
|
|
65
|
+
<img src="https://res.cloudinary.com/dhco9jghq/image/upload/v1771031245/rocitlogo_e2huhi.png" alt="ROCIT Logo" width="200"/>
|
|
66
|
+
</div>
|
|
67
|
+
|
|
68
|
+
**ROCIT** (Read Origin Classifier In Tumors) is a deep learning tool that classifies individual sequencing reads from long-read bulk tumor sequencing as tumor-derived or normal-derived. By leveraging CpG methylation patterns, ROCIT enables read-level resolution of tumor heterogeneity from PacBio sequencing data.
|
|
69
|
+
|
|
70
|
+
ROCIT currently supports training and prediction on PacBio HiFi Tumor BAMs with CpG methylation probabilities produced by [Jasmine](https://github.com/PacificBiosciences/jasmine). Oxford Nanopore support is planned for future releases.
|
|
71
|
+
|
|
72
|
+
|
|
73
|
+
### How It Works
|
|
74
|
+
|
|
75
|
+
ROCIT uses a multi-step approach:
|
|
76
|
+
|
|
77
|
+
1. **Data Preprocessing**: Extracts CpG methylation from PacBio BAM files and labels the origin of a subset of reads based on somatic variants (SNVs) and loss of heterozygosity (LOH) events
|
|
78
|
+
2. **Input Features**: Combines read-level methylation patterns with cell-type reference atlases and bulk sample methylation distributions
|
|
79
|
+
3. **Model Training**: Trains a transformer-based neural network to classify the labelled read subset using chromosomal cross-validation
|
|
80
|
+
4. **Prediction**: Applies the trained model to classify all reads in the sample
|
|
81
|
+
|
|
82
|
+
## Installation
|
|
83
|
+
|
|
84
|
+
### Via pip (Recommended)
|
|
85
|
+
|
|
86
|
+
```bash
|
|
87
|
+
pip install rocit
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
### From Source
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
git clone https://github.com/tobybaker/rocit.git
|
|
94
|
+
cd rocit
|
|
95
|
+
pip install -e .
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
### Requirements
|
|
99
|
+
|
|
100
|
+
- Python ≥ 3.10
|
|
101
|
+
- PyTorch ≥ 2.9.1
|
|
102
|
+
- PyTorch Lightning ≥ 2.6.0
|
|
103
|
+
- Polars ≥ 1.36.1
|
|
104
|
+
- pysam == 0.22.1
|
|
105
|
+
- Additional dependencies listed in [pyproject.toml](pyproject.toml)
|
|
106
|
+
|
|
107
|
+
## Quick Start
|
|
108
|
+
|
|
109
|
+
### Cell Atlas Requirement
|
|
110
|
+
|
|
111
|
+
ROCIT requires a reference cell-type methylation atlas derived from whole-genome bisulfite sequencing data (GSE186458). Download the pre-computed atlas:
|
|
112
|
+
|
|
113
|
+
```bash
|
|
114
|
+
wget [PLACEHOLDER_URL] -O reference/cell_atlas.parquet
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
Alternatively, you can [generate the atlas from source](#generating-the-cell-atlas-from-source).
|
|
118
|
+
|
|
119
|
+
> **Citation:** Loyfer, N., et al. (2023). [A DNA methylation atlas of normal human cell types](https://doi.org/10.1038/s41586-022-05580-6). *Nature*.
|
|
120
|
+
|
|
121
|
+
### Running ROCIT
|
|
122
|
+
|
|
123
|
+
ROCIT provides a complete end-to-end pipeline via the `rocit run` command:
|
|
124
|
+
|
|
125
|
+
```bash
|
|
126
|
+
rocit run --config config.yaml
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
For more control, you can run individual steps:
|
|
130
|
+
|
|
131
|
+
```bash
|
|
132
|
+
# 1. Extract methylation from BAM
|
|
133
|
+
rocit extract-bam-methylation --sample-id SAMPLE01 \
|
|
134
|
+
--sample-bam aligned.bam \
|
|
135
|
+
--output-dir methylation/
|
|
136
|
+
|
|
137
|
+
# 2. Compute methylation distribution
|
|
138
|
+
rocit extract-cpg-distribution --sample-id SAMPLE01 \
|
|
139
|
+
--methylation-dir methylation/ \
|
|
140
|
+
--output-dir distribution/
|
|
141
|
+
|
|
142
|
+
# 3. Label reads for training
|
|
143
|
+
rocit preprocess --config preprocess_config.yaml
|
|
144
|
+
|
|
145
|
+
# 4. Train the model
|
|
146
|
+
rocit train --config train_config.yaml
|
|
147
|
+
|
|
148
|
+
# 5. Run predictions
|
|
149
|
+
rocit predict --config predict_config.yaml
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
|
|
153
|
+
## Configuration Files
|
|
154
|
+
|
|
155
|
+
ROCIT uses YAML configuration files for reproducibility and ease of use. Below are templates for each command.
|
|
156
|
+
|
|
157
|
+
### Training Configuration (`train_config.yaml`)
|
|
158
|
+
|
|
159
|
+
```yaml
|
|
160
|
+
sample_id: "SAMPLE01"
|
|
161
|
+
labelled_data: "preprocessing/labelled_methylation_data.parquet"
|
|
162
|
+
sample_distribution: "distribution/SAMPLE01_methylation_distribution.parquet"
|
|
163
|
+
cell_atlas: "reference/cell_atlas.parquet"
|
|
164
|
+
val_chromosomes: ["chr20", "chr21"]
|
|
165
|
+
test_chromosomes: ["chr22"]
|
|
166
|
+
output_dir: "training/"
|
|
167
|
+
cache_dir: "/scratch/"
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
**Required fields:**
|
|
171
|
+
- `sample_id`: Unique identifier for the sample
|
|
172
|
+
- `labelled_data`: Path to labelled methylation data from preprocessing
|
|
173
|
+
- `sample_distribution`: Path to sample methylation distribution
|
|
174
|
+
- `cell_atlas`: Path to cell-type methylation reference
|
|
175
|
+
- `val_chromosomes`: Chromosomes reserved for validation (must not overlap with test)
|
|
176
|
+
- `test_chromosomes`: Chromosomes reserved for testing (must not overlap with validation)
|
|
177
|
+
- `output_dir`: Directory for training outputs
|
|
178
|
+
- `cache_dir`: Cache directory for dataset processing (default: `/scratch`)
|
|
179
|
+
|
|
180
|
+
### Prediction Configuration (`predict_config.yaml`)
|
|
181
|
+
|
|
182
|
+
```yaml
|
|
183
|
+
sample_id: "SAMPLE01"
|
|
184
|
+
best_checkpoint_path: "training/SAMPLE01/version_0/checkpoints/best-checkpoint.ckpt"
|
|
185
|
+
sample_distribution: "distribution/SAMPLE01_methylation_distribution.parquet"
|
|
186
|
+
cell_atlas: "reference/cell_atlas.parquet"
|
|
187
|
+
read_store_dir: "methylation/" # OR use read_store for single file
|
|
188
|
+
output_dir: "predictions/"
|
|
189
|
+
cache_dir: "/scratch/"
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
**Required fields:**
|
|
193
|
+
- `sample_id`: Unique identifier for the sample
|
|
194
|
+
- `best_checkpoint_path`: Path to trained model checkpoint
|
|
195
|
+
- `sample_distribution`: Path to sample methylation distribution
|
|
196
|
+
- `cell_atlas`: Path to cell-type methylation reference
|
|
197
|
+
- `read_store` OR `read_store_dir`: Single file or directory of methylation parquet files
|
|
198
|
+
- `output_dir`: Directory for prediction outputs
|
|
199
|
+
- `cache_dir`: Cache directory (default: `/scratch`)
|
|
200
|
+
|
|
201
|
+
### Preprocessing Configuration (`preprocess_config.yaml`)
|
|
202
|
+
|
|
203
|
+
```yaml
|
|
204
|
+
sample_id: "SAMPLE01"
|
|
205
|
+
bam: "data/aligned.bam"
|
|
206
|
+
methylation_dir: "methylation/"
|
|
207
|
+
copy_number: "data/copy_number_segments.parquet"
|
|
208
|
+
variants: "data/somatic_variants.parquet"
|
|
209
|
+
haplotags: "data/haplotags.parquet"
|
|
210
|
+
haploblocks: "data/haploblocks.parquet"
|
|
211
|
+
snv_clusters: "data/snv_clusters.parquet"
|
|
212
|
+
snv_cluster_assignments: "data/snv_cluster_assignments.parquet"
|
|
213
|
+
output_dir: "preprocessing/"
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
**Required fields:**
|
|
217
|
+
- `sample_id`: Unique identifier for the sample
|
|
218
|
+
- `bam`: Path to aligned BAM file with methylation tags
|
|
219
|
+
- `methylation_dir`: Directory containing extracted methylation data
|
|
220
|
+
- `copy_number`: Path to copy number segments file
|
|
221
|
+
- `variants`: Path to somatic variants file
|
|
222
|
+
- `haplotags`: Path to read haplotype assignments
|
|
223
|
+
- `haploblocks`: Path to phased haplotype blocks
|
|
224
|
+
- `snv_clusters`: Path to SNV cluster assignments
|
|
225
|
+
- `output_dir`: Directory for preprocessing outputs
|
|
226
|
+
|
|
227
|
+
**Optional fields:**
|
|
228
|
+
- `snv_cluster_assignments`: Path to SNV cluster assignments (if not provided, will be inferred)
|
|
229
|
+
|
|
230
|
+
### Full Pipeline Configuration (`run_config.yaml`)
|
|
231
|
+
|
|
232
|
+
```yaml
|
|
233
|
+
sample_id: "SAMPLE01"
|
|
234
|
+
bam: "data/aligned.bam"
|
|
235
|
+
bam_index: "data/aligned.bam.bai"
|
|
236
|
+
copy_number: "data/copy_number_segments.parquet"
|
|
237
|
+
variants: "data/somatic_variants.parquet"
|
|
238
|
+
haplotags: "data/haplotags.parquet"
|
|
239
|
+
haploblocks: "data/haploblocks.parquet"
|
|
240
|
+
snv_clusters: "data/snv_clusters.parquet"
|
|
241
|
+
snv_cluster_assignments: "data/snv_cluster_assignments.parquet"
|
|
242
|
+
cell_atlas: "reference/cell_atlas.parquet"
|
|
243
|
+
val_chromosomes: ["chr20", "chr21"]
|
|
244
|
+
test_chromosomes: ["chr22"]
|
|
245
|
+
min_mapq: 0
|
|
246
|
+
workers: 8
|
|
247
|
+
output_dir: "output/"
|
|
248
|
+
cache_dir: "/scratch/"
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
**Required fields:**
|
|
252
|
+
- `sample_id`: Unique identifier for the sample
|
|
253
|
+
- `bam`: Path to aligned BAM file with methylation tags
|
|
254
|
+
- `copy_number`: Path to copy number segments file
|
|
255
|
+
- `variants`: Path to somatic variants file
|
|
256
|
+
- `haplotags`: Path to read haplotype assignments
|
|
257
|
+
- `haploblocks`: Path to phased haplotype blocks
|
|
258
|
+
- `snv_clusters`: Path to SNV cluster assignments
|
|
259
|
+
- `cell_atlas`: Path to cell-type methylation reference
|
|
260
|
+
- `val_chromosomes`: Chromosomes reserved for validation (must not overlap with test)
|
|
261
|
+
- `test_chromosomes`: Chromosomes reserved for testing (must not overlap with validation)
|
|
262
|
+
- `output_dir`: Directory for all pipeline outputs
|
|
263
|
+
- `cache_dir`: Cache directory for dataset processing
|
|
264
|
+
|
|
265
|
+
**Optional fields:**
|
|
266
|
+
- `bam_index`: BAM index file (auto-detected if not provided)
|
|
267
|
+
- `snv_cluster_assignments`: Path to SNV cluster assignments (if not provided, will be inferred)
|
|
268
|
+
- `chromosomes`: Specific chromosomes to process (defaults to chr1-chrY)
|
|
269
|
+
- `min_mapq`: Minimum mapping quality for reads (default: 0)
|
|
270
|
+
- `workers`: Number of parallel workers for BAM processing (default: 1)
|
|
271
|
+
|
|
272
|
+
## Command Reference
|
|
273
|
+
|
|
274
|
+
### `rocit run`
|
|
275
|
+
|
|
276
|
+
Run the complete ROCIT pipeline from BAM to predictions.
|
|
277
|
+
|
|
278
|
+
```bash
|
|
279
|
+
rocit run --config run_config.yaml
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
**Pipeline steps:**
|
|
283
|
+
1. Extract BAM methylation
|
|
284
|
+
2. Compute CpG distribution
|
|
285
|
+
3. Label reads using somatic variants
|
|
286
|
+
4. Train classification model
|
|
287
|
+
5. Generate predictions
|
|
288
|
+
|
|
289
|
+
**Outputs:**
|
|
290
|
+
- `output/methylation/`: Per-chromosome methylation data
|
|
291
|
+
- `output/distribution/`: Sample methylation distribution
|
|
292
|
+
- `output/preprocessing/`: Labelled reads and methylation data
|
|
293
|
+
- `output/training/`: Model checkpoints and training logs
|
|
294
|
+
- `output/predictions/`: Final tumor/normal predictions
|
|
295
|
+
|
|
296
|
+
### `rocit train`
|
|
297
|
+
|
|
298
|
+
Train a ROCIT classification model.
|
|
299
|
+
|
|
300
|
+
```bash
|
|
301
|
+
rocit train --config train_config.yaml
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
**Outputs:**
|
|
305
|
+
- `{output_dir}/{sample_id}/version_X/checkpoints/best-checkpoint.ckpt`: Best model
|
|
306
|
+
- `{output_dir}/{sample_id}/version_X/metrics.csv`: Training metrics (loss, AUROC, etc.)
|
|
307
|
+
|
|
308
|
+
**Training parameters** (modifiable in code via `TrainingParams` in the python API):
|
|
309
|
+
- Model architecture: 384-dim, 6 heads, 3 layers
|
|
310
|
+
- Max epochs: 100 with early stopping (patience=10)
|
|
311
|
+
- Learning rate: 1e-4 with warmup
|
|
312
|
+
- Batch size: 256
|
|
313
|
+
- Early Stopping Metric: Validation AUROC
|
|
314
|
+
|
|
315
|
+
### `rocit predict`
|
|
316
|
+
|
|
317
|
+
Generate predictions using a trained model.
|
|
318
|
+
|
|
319
|
+
```bash
|
|
320
|
+
rocit predict --config predict_config.yaml
|
|
321
|
+
```
|
|
322
|
+
|
|
323
|
+
**Outputs:**
|
|
324
|
+
- `{output_dir}/{sample_id}_tumor_origin_predictions.parquet`: Read-level predictions with columns:
|
|
325
|
+
- `read_index`: Unique read identifier
|
|
326
|
+
- `chromosome`: Chromosome name
|
|
327
|
+
- `tumor_probability`: Predicted probability of tumor origin (0-1)
|
|
328
|
+
|
|
329
|
+
### `rocit preprocess`
|
|
330
|
+
|
|
331
|
+
Label reads for training using somatic variant information.
|
|
332
|
+
|
|
333
|
+
```bash
|
|
334
|
+
rocit preprocess --config preprocess_config.yaml
|
|
335
|
+
```
|
|
336
|
+
|
|
337
|
+
**Outputs:**
|
|
338
|
+
- `{output_dir}/labelled_reads.parquet`: Read labels (tumor/normal)
|
|
339
|
+
- `{output_dir}/labelled_methylation_data.parquet`: Methylation data with labels
|
|
340
|
+
|
|
341
|
+
|
|
342
|
+
### `rocit extract-bam-methylation`
|
|
343
|
+
|
|
344
|
+
Extract CpG methylation from PacBio BAM files.
|
|
345
|
+
|
|
346
|
+
```bash
|
|
347
|
+
rocit extract-bam-methylation \
|
|
348
|
+
--sample-id SAMPLE01 \
|
|
349
|
+
--sample-bam aligned.bam \
|
|
350
|
+
--output-dir methylation/ \
|
|
351
|
+
--workers 8 \
|
|
352
|
+
--min-mapq 0 \
|
|
353
|
+
--chromosomes "chr1 chr2 chr3"
|
|
354
|
+
```
|
|
355
|
+
|
|
356
|
+
**Options:**
|
|
357
|
+
- `--sample-id`: Sample identifier for output naming
|
|
358
|
+
- `--sample-bam`: Input BAM file with MM/ML tags
|
|
359
|
+
- `--output-dir`: Output directory
|
|
360
|
+
- `--index`: BAM index file (optional, auto-detected)
|
|
361
|
+
- `--min-mapq`: Minimum mapping quality (default: 0)
|
|
362
|
+
- `--workers`: Number of parallel workers (default: 1)
|
|
363
|
+
- `--chromosomes`: Space-separated chromosomes to process (default: chr1-chrY)
|
|
364
|
+
|
|
365
|
+
**Outputs:**
|
|
366
|
+
- `{output_dir}/{chromosome}_cpg_methylation_data.parquet` for each chromosome
|
|
367
|
+
|
|
368
|
+
### `rocit extract-cpg-distribution`
|
|
369
|
+
|
|
370
|
+
Aggregate methylation distribution from extracted data.
|
|
371
|
+
|
|
372
|
+
```bash
|
|
373
|
+
rocit extract-cpg-distribution \
|
|
374
|
+
--sample-id SAMPLE01 \
|
|
375
|
+
--methylation-dir methylation/ \
|
|
376
|
+
--output-dir distribution/
|
|
377
|
+
```
|
|
378
|
+
|
|
379
|
+
**Outputs:**
|
|
380
|
+
- `{output_dir}/{sample_id}_methylation_distribution.parquet`:
|
|
381
|
+
An aggregated distribution of methylation values across the sample, used for model context.
|
|
382
|
+
|
|
383
|
+
## Output Files
|
|
384
|
+
|
|
385
|
+
### Prediction Output
|
|
386
|
+
|
|
387
|
+
The primary output from ROCIT is a parquet file with read-level predictions:
|
|
388
|
+
|
|
389
|
+
```python
|
|
390
|
+
import polars as pl
|
|
391
|
+
|
|
392
|
+
predictions = pl.read_parquet("predictions/SAMPLE01_tumor_origin_predictions.parquet")
|
|
393
|
+
print(predictions.head())
|
|
394
|
+
|
|
395
|
+
# Example output:
|
|
396
|
+
# ┌────────────┬────────────┬───────────────────┐
|
|
397
|
+
# │ read_index │ chromosome │ tumor_probability │
|
|
398
|
+
# ├────────────┼────────────┼───────────────────┤
|
|
399
|
+
# │ 1001 │ chr1 │ 0.87 │
|
|
400
|
+
# │ 1002 │ chr1 │ 0.12 │
|
|
401
|
+
# │ 1003 │ chr1 │ 0.94 │
|
|
402
|
+
# └────────────┴────────────┴───────────────────┘
|
|
403
|
+
```
|
|
404
|
+
|
|
405
|
+
### Training Metrics
|
|
406
|
+
|
|
407
|
+
Training progress is logged to CSV:
|
|
408
|
+
|
|
409
|
+
```python
|
|
410
|
+
metrics = pl.read_csv("training/SAMPLE01/version_0/metrics.csv")
|
|
411
|
+
# Contains: epoch, train_loss, train_auroc, val_loss, val_auroc, etc.
|
|
412
|
+
```
|
|
413
|
+
|
|
414
|
+
## Model Architecture
|
|
415
|
+
|
|
416
|
+
ROCIT uses a transformer-based architecture designed for long-read methylation data:
|
|
417
|
+
|
|
418
|
+
- Input: CpG methylation patterns, cell atlas features, sample distribution features
|
|
419
|
+
## Python API
|
|
420
|
+
|
|
421
|
+
ROCIT can also be used programmatically:
|
|
422
|
+
|
|
423
|
+
```python
|
|
424
|
+
import rocit
|
|
425
|
+
from pathlib import Path
|
|
426
|
+
|
|
427
|
+
# Training
|
|
428
|
+
train_result = rocit.train(
|
|
429
|
+
sample_id="SAMPLE01",
|
|
430
|
+
labelled_data=labelled_df,
|
|
431
|
+
sample_distribution=distribution_df,
|
|
432
|
+
cell_atlas=atlas_df,
|
|
433
|
+
val_chromosomes=["chr20", "chr21"],
|
|
434
|
+
test_chromosomes=["chr22"],
|
|
435
|
+
output_dir=Path("training/"),
|
|
436
|
+
cache_dir=Path("/scratch/")
|
|
437
|
+
)
|
|
438
|
+
|
|
439
|
+
# Prediction
|
|
440
|
+
predictions = rocit.predict(
|
|
441
|
+
sample_id="SAMPLE01",
|
|
442
|
+
best_checkpoint_path=Path("training/best-checkpoint.ckpt"),
|
|
443
|
+
read_store=[methylation_lazy_df], # List of polars DataFrames or LazyFrames
|
|
444
|
+
sample_distribution=distribution_df,
|
|
445
|
+
cell_atlas=atlas_df,
|
|
446
|
+
output_dir=Path("predictions/"),
|
|
447
|
+
cache_dir=Path("/scratch/")
|
|
448
|
+
)
|
|
449
|
+
```
|
|
450
|
+
|
|
451
|
+
## Generating the Cell Atlas from Source
|
|
452
|
+
|
|
453
|
+
If you prefer to build the cell-type methylation atlas yourself rather than using the pre-computed version, you can use the provided generation script. This process downloads and processes whole-genome bisulfite sequencing data from GEO accession **GSE186458**, which contains methylation profiles across diverse normal human cell types.
|
|
454
|
+
|
|
455
|
+
### Requirements
|
|
456
|
+
|
|
457
|
+
```bash
|
|
458
|
+
pip install pyBigWig polars tqdm
|
|
459
|
+
```
|
|
460
|
+
|
|
461
|
+
### Usage
|
|
462
|
+
|
|
463
|
+
The script provides two modes: automatic download and processing, or processing from pre-downloaded files.
|
|
464
|
+
|
|
465
|
+
**Automatic Download and Processing**
|
|
466
|
+
|
|
467
|
+
This will download ~328 GB of raw data from GEO:
|
|
468
|
+
|
|
469
|
+
```bash
|
|
470
|
+
python setup_scripts/generate_cell_map_df.py \
|
|
471
|
+
--download /path/to/download_directory/ \
|
|
472
|
+
--output reference/cell_atlas.parquet
|
|
473
|
+
```
|
|
474
|
+
|
|
475
|
+
You will be prompted to confirm before the download begins.
|
|
476
|
+
|
|
477
|
+
**Process Pre-Downloaded Files**
|
|
478
|
+
|
|
479
|
+
If you already have the bigwig files:
|
|
480
|
+
|
|
481
|
+
```bash
|
|
482
|
+
python setup_scripts/generate_cell_map_df.py \
|
|
483
|
+
--data-dir /path/to/extracted_bigwig_files/ \
|
|
484
|
+
--output reference/cell_atlas.parquet
|
|
485
|
+
```
|
|
486
|
+
|
|
487
|
+
### What the Script Does
|
|
488
|
+
|
|
489
|
+
1. **Downloads** (if using `--download`): Fetches the GSE186458 tar archive from NCBI GEO
|
|
490
|
+
2. **Extracts**: Unpacks `*.hg38.bigwig` files containing methylation data per cell type
|
|
491
|
+
3. **Processes**: For each cell type, aggregates methylation values across biological replicates
|
|
492
|
+
4. **Combines**: Joins all cell types into a single reference atlas
|
|
493
|
+
5. **Outputs**: Saves a Parquet file with columns:
|
|
494
|
+
- `chromosome`: chr1-chr22, chrX
|
|
495
|
+
- `position`: CpG genomic position
|
|
496
|
+
- `average_methylation_{cell_type}`: Mean methylation value (0-1) for each cell type
|
|
497
|
+
|
|
498
|
+
The resulting atlas enables ROCIT to contextualize read-level methylation patterns using cell-type-specific reference signatures.
|
|
499
|
+
|
|
500
|
+
### Dataset Information
|
|
501
|
+
|
|
502
|
+
**GSE186458** contains whole-genome bisulfite sequencing (WGBS) data from normal human tissues and cell types. Each cell type typically has multiple biological replicates, which the script averages to produce robust methylation estimates.
|
|
503
|
+
|
|
504
|
+
> **Citation:** Loyfer, N., et al. (2023). [A DNA methylation atlas of normal human cell types](https://doi.org/10.1038/s41586-022-05580-6). *Nature*.
|
|
505
|
+
|
|
506
|
+
---
|
|
507
|
+
|
|
508
|
+
## Data Format Specifications
|
|
509
|
+
|
|
510
|
+
### Copy Number Segments
|
|
511
|
+
|
|
512
|
+
Required columns:
|
|
513
|
+
- `chromosome`: Chromosome name (e.g., "chr1")
|
|
514
|
+
- `start`: Segment start position
|
|
515
|
+
- `end`: Segment end position
|
|
516
|
+
- `minor_cn`: Minor allele copy number
|
|
517
|
+
- `major_cn`: Major allele copy number
|
|
518
|
+
- `total_cn`: Total copy number
|
|
519
|
+
- `purity`: Tumor purity estimate
|
|
520
|
+
- `normal_total_cn`: Normal total copy number (typically 2 except for chrX/chrY in XY subjects)
|
|
521
|
+
|
|
522
|
+
### Somatic Variants
|
|
523
|
+
|
|
524
|
+
Required columns:
|
|
525
|
+
- `chromosome`: Chromosome name
|
|
526
|
+
- `position`: Variant position
|
|
527
|
+
- `ref`: Reference allele
|
|
528
|
+
- `alt`: Alternate allele
|
|
529
|
+
- Additional variant metadata as needed
|
|
530
|
+
|
|
531
|
+
### Haplotags
|
|
532
|
+
|
|
533
|
+
Required columns:
|
|
534
|
+
- `read_index`: Unique read identifier
|
|
535
|
+
- `chromosome`: Chromosome name
|
|
536
|
+
- `haplotag`: Haplotype assignment (1 or 2)
|
|
537
|
+
- `start`: Read start position
|
|
538
|
+
- `end`: Read end position
|
|
539
|
+
|
|
540
|
+
### Haploblocks
|
|
541
|
+
|
|
542
|
+
Required columns:
|
|
543
|
+
- `chromosome`: Chromosome name
|
|
544
|
+
- `block_start`: Block start position
|
|
545
|
+
- `block_end`: Block end position
|
|
546
|
+
- `block_size`: Size of phased block
|
|
547
|
+
- `haploblock_id`: Unique block identifier
|
|
548
|
+
|
|
549
|
+
### SNV Clusters
|
|
550
|
+
|
|
551
|
+
Required columns:
|
|
552
|
+
- `cluster_id`: Unique cluster identifier
|
|
553
|
+
- `cluster_ccf`: Cancer cell fraction for the cluster (0-1)
|
|
554
|
+
- `cluster_fraction`: Fraction of variants assigned to this cluster (0-1)
|
|
555
|
+
|
|
556
|
+
### SNV Cluster Assignments
|
|
557
|
+
|
|
558
|
+
This file is optional. If not provided, cluster assignments will be inferred using a binomial model.
|
|
559
|
+
|
|
560
|
+
Required columns:
|
|
561
|
+
- `chromosome`: Chromosome name
|
|
562
|
+
- `position`: Variant position
|
|
563
|
+
- `cluster_id`: Cluster identifier (must match IDs in SNV Clusters)
|
|
564
|
+
- `n_copies`: Number of allelic copies of the variant.
|
|
565
|
+
|
|
566
|
+
|
|
567
|
+
## License
|
|
568
|
+
|
|
569
|
+
ROCIT is licensed under the BSD 3-Clause License. See the [LICENSE](LICENSE) file for details.
|
|
570
|
+
|