rocit 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. rocit-0.1.0/LICENSE +28 -0
  2. rocit-0.1.0/PKG-INFO +570 -0
  3. rocit-0.1.0/README.md +509 -0
  4. rocit-0.1.0/pyproject.toml +54 -0
  5. rocit-0.1.0/setup.cfg +4 -0
  6. rocit-0.1.0/src/rocit/__init__.py +12 -0
  7. rocit-0.1.0/src/rocit/cli.py +464 -0
  8. rocit-0.1.0/src/rocit/config.py +213 -0
  9. rocit-0.1.0/src/rocit/constants.py +8 -0
  10. rocit-0.1.0/src/rocit/data/__init__.py +8 -0
  11. rocit-0.1.0/src/rocit/data/datamodule.py +78 -0
  12. rocit-0.1.0/src/rocit/data/dataset.py +274 -0
  13. rocit-0.1.0/src/rocit/models/__init__.py +6 -0
  14. rocit-0.1.0/src/rocit/models/lightning_module.py +162 -0
  15. rocit-0.1.0/src/rocit/models/model_architecture.py +131 -0
  16. rocit-0.1.0/src/rocit/pipeline.py +202 -0
  17. rocit-0.1.0/src/rocit/preprocessing/__init__.py +5 -0
  18. rocit-0.1.0/src/rocit/preprocessing/bam_tools.py +93 -0
  19. rocit-0.1.0/src/rocit/preprocessing/extract_pacbio_cpg_info.py +261 -0
  20. rocit-0.1.0/src/rocit/preprocessing/loh_data_labeller.py +148 -0
  21. rocit-0.1.0/src/rocit/preprocessing/prepare_somatic_data.py +155 -0
  22. rocit-0.1.0/src/rocit/preprocessing/process_cpg_distribution.py +37 -0
  23. rocit-0.1.0/src/rocit/preprocessing/snv_data_labeller.py +99 -0
  24. rocit-0.1.0/src/rocit/preprocessing/tumor_data_labeller.py +74 -0
  25. rocit-0.1.0/src/rocit/preprocessing/variant_processing.py +103 -0
  26. rocit-0.1.0/src/rocit/utils/__init__.py +0 -0
  27. rocit-0.1.0/src/rocit/validation.py +146 -0
  28. rocit-0.1.0/src/rocit.egg-info/PKG-INFO +570 -0
  29. rocit-0.1.0/src/rocit.egg-info/SOURCES.txt +31 -0
  30. rocit-0.1.0/src/rocit.egg-info/dependency_links.txt +1 -0
  31. rocit-0.1.0/src/rocit.egg-info/entry_points.txt +2 -0
  32. rocit-0.1.0/src/rocit.egg-info/requires.txt +11 -0
  33. rocit-0.1.0/src/rocit.egg-info/top_level.txt +1 -0
rocit-0.1.0/LICENSE ADDED
@@ -0,0 +1,28 @@
1
+ BSD 3-Clause License
2
+
3
+ Copyright (c) 2026, Toby Baker
4
+
5
+ Redistribution and use in source and binary forms, with or without
6
+ modification, are permitted provided that the following conditions are met:
7
+
8
+ 1. Redistributions of source code must retain the above copyright notice, this
9
+ list of conditions and the following disclaimer.
10
+
11
+ 2. Redistributions in binary form must reproduce the above copyright notice,
12
+ this list of conditions and the following disclaimer in the documentation
13
+ and/or other materials provided with the distribution.
14
+
15
+ 3. Neither the name of the copyright holder nor the names of its
16
+ contributors may be used to endorse or promote products derived from
17
+ this software without specific prior written permission.
18
+
19
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
20
+ AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
21
+ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
22
+ DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
23
+ FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
24
+ DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
25
+ SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
26
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
27
+ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
28
+ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
rocit-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,570 @@
1
+ Metadata-Version: 2.4
2
+ Name: rocit
3
+ Version: 0.1.0
4
+ Summary: Deep learning classifier for tumor-derived long reads using CpG methylation
5
+ Author-email: Toby Baker <tobybaker97@gmail.com>
6
+ License: BSD 3-Clause License
7
+
8
+ Copyright (c) 2026, Toby Baker
9
+
10
+ Redistribution and use in source and binary forms, with or without
11
+ modification, are permitted provided that the following conditions are met:
12
+
13
+ 1. Redistributions of source code must retain the above copyright notice, this
14
+ list of conditions and the following disclaimer.
15
+
16
+ 2. Redistributions in binary form must reproduce the above copyright notice,
17
+ this list of conditions and the following disclaimer in the documentation
18
+ and/or other materials provided with the distribution.
19
+
20
+ 3. Neither the name of the copyright holder nor the names of its
21
+ contributors may be used to endorse or promote products derived from
22
+ this software without specific prior written permission.
23
+
24
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
25
+ AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26
+ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
27
+ DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
28
+ FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29
+ DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
30
+ SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
31
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
32
+ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
33
+ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
34
+
35
+ Project-URL: Homepage, https://github.com/tobybaker/rocit
36
+ Project-URL: Repository, https://github.com/tobybaker/rocit
37
+ Project-URL: Bug Tracker, https://github.com/tobybaker/rocit/issues
38
+ Keywords: bioinformatics,cancer,methylation,long-read,deep-learning,pacbio
39
+ Classifier: Development Status :: 3 - Alpha
40
+ Classifier: Intended Audience :: Science/Research
41
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
42
+ Classifier: License :: OSI Approved :: BSD License
43
+ Classifier: Programming Language :: Python :: 3
44
+ Classifier: Programming Language :: Python :: 3.10
45
+ Classifier: Programming Language :: Python :: 3.11
46
+ Classifier: Programming Language :: Python :: 3.12
47
+ Requires-Python: >=3.10
48
+ Description-Content-Type: text/markdown
49
+ License-File: LICENSE
50
+ Requires-Dist: torch>=2.0.0
51
+ Requires-Dist: pytorch-lightning>=2.0.0
52
+ Requires-Dist: pyyaml>=6.0
53
+ Requires-Dist: numpy>=1.24.0
54
+ Requires-Dist: polars>=1.0.0
55
+ Requires-Dist: pysam>=0.22.0
56
+ Requires-Dist: click>=8.1.0
57
+ Provides-Extra: dev
58
+ Requires-Dist: build>=1.4.0; extra == "dev"
59
+ Requires-Dist: twine>=6.2.0; extra == "dev"
60
+ Dynamic: license-file
61
+
62
+ # ROCIT
63
+
64
+ <div align="center">
65
+ <img src="https://res.cloudinary.com/dhco9jghq/image/upload/v1771031245/rocitlogo_e2huhi.png" alt="ROCIT Logo" width="200"/>
66
+ </div>
67
+
68
+ **ROCIT** (Read Origin Classifier In Tumors) is a deep learning tool that classifies individual sequencing reads from long-read bulk tumor sequencing as tumor-derived or normal-derived. By leveraging CpG methylation patterns, ROCIT enables read-level resolution of tumor heterogeneity from PacBio sequencing data.
69
+
70
+ ROCIT currently supports training and prediction on PacBio HiFi Tumor BAMs with CpG methylation probabilities produced by [Jasmine](https://github.com/PacificBiosciences/jasmine). Oxford Nanopore support is planned for future releases.
71
+
72
+
73
+ ### How It Works
74
+
75
+ ROCIT uses a multi-step approach:
76
+
77
+ 1. **Data Preprocessing**: Extracts CpG methylation from PacBio BAM files and labels the origin of a subset of reads based on somatic variants (SNVs) and loss of heterozygosity (LOH) events
78
+ 2. **Input Features**: Combines read-level methylation patterns with cell-type reference atlases and bulk sample methylation distributions
79
+ 3. **Model Training**: Trains a transformer-based neural network to classify the labelled read subset using chromosomal cross-validation
80
+ 4. **Prediction**: Applies the trained model to classify all reads in the sample
81
+
82
+ ## Installation
83
+
84
+ ### Via pip (Recommended)
85
+
86
+ ```bash
87
+ pip install rocit
88
+ ```
89
+
90
+ ### From Source
91
+
92
+ ```bash
93
+ git clone https://github.com/tobybaker/rocit.git
94
+ cd rocit
95
+ pip install -e .
96
+ ```
97
+
98
+ ### Requirements
99
+
100
+ - Python ≥ 3.10
101
+ - PyTorch ≥ 2.9.1
102
+ - PyTorch Lightning ≥ 2.6.0
103
+ - Polars ≥ 1.36.1
104
+ - pysam == 0.22.1
105
+ - Additional dependencies listed in [pyproject.toml](pyproject.toml)
106
+
107
+ ## Quick Start
108
+
109
+ ### Cell Atlas Requirement
110
+
111
+ ROCIT requires a reference cell-type methylation atlas derived from whole-genome bisulfite sequencing data (GSE186458). Download the pre-computed atlas:
112
+
113
+ ```bash
114
+ wget [PLACEHOLDER_URL] -O reference/cell_atlas.parquet
115
+ ```
116
+
117
+ Alternatively, you can [generate the atlas from source](#generating-the-cell-atlas-from-source).
118
+
119
+ > **Citation:** Loyfer, N., et al. (2023). [A DNA methylation atlas of normal human cell types](https://doi.org/10.1038/s41586-022-05580-6). *Nature*.
120
+
121
+ ### Running ROCIT
122
+
123
+ ROCIT provides a complete end-to-end pipeline via the `rocit run` command:
124
+
125
+ ```bash
126
+ rocit run --config config.yaml
127
+ ```
128
+
129
+ For more control, you can run individual steps:
130
+
131
+ ```bash
132
+ # 1. Extract methylation from BAM
133
+ rocit extract-bam-methylation --sample-id SAMPLE01 \
134
+ --sample-bam aligned.bam \
135
+ --output-dir methylation/
136
+
137
+ # 2. Compute methylation distribution
138
+ rocit extract-cpg-distribution --sample-id SAMPLE01 \
139
+ --methylation-dir methylation/ \
140
+ --output-dir distribution/
141
+
142
+ # 3. Label reads for training
143
+ rocit preprocess --config preprocess_config.yaml
144
+
145
+ # 4. Train the model
146
+ rocit train --config train_config.yaml
147
+
148
+ # 5. Run predictions
149
+ rocit predict --config predict_config.yaml
150
+ ```
151
+
152
+
153
+ ## Configuration Files
154
+
155
+ ROCIT uses YAML configuration files for reproducibility and ease of use. Below are templates for each command.
156
+
157
+ ### Training Configuration (`train_config.yaml`)
158
+
159
+ ```yaml
160
+ sample_id: "SAMPLE01"
161
+ labelled_data: "preprocessing/labelled_methylation_data.parquet"
162
+ sample_distribution: "distribution/SAMPLE01_methylation_distribution.parquet"
163
+ cell_atlas: "reference/cell_atlas.parquet"
164
+ val_chromosomes: ["chr20", "chr21"]
165
+ test_chromosomes: ["chr22"]
166
+ output_dir: "training/"
167
+ cache_dir: "/scratch/"
168
+ ```
169
+
170
+ **Required fields:**
171
+ - `sample_id`: Unique identifier for the sample
172
+ - `labelled_data`: Path to labelled methylation data from preprocessing
173
+ - `sample_distribution`: Path to sample methylation distribution
174
+ - `cell_atlas`: Path to cell-type methylation reference
175
+ - `val_chromosomes`: Chromosomes reserved for validation (must not overlap with test)
176
+ - `test_chromosomes`: Chromosomes reserved for testing (must not overlap with validation)
177
+ - `output_dir`: Directory for training outputs
178
+ - `cache_dir`: Cache directory for dataset processing (default: `/scratch`)
179
+
180
+ ### Prediction Configuration (`predict_config.yaml`)
181
+
182
+ ```yaml
183
+ sample_id: "SAMPLE01"
184
+ best_checkpoint_path: "training/SAMPLE01/version_0/checkpoints/best-checkpoint.ckpt"
185
+ sample_distribution: "distribution/SAMPLE01_methylation_distribution.parquet"
186
+ cell_atlas: "reference/cell_atlas.parquet"
187
+ read_store_dir: "methylation/" # OR use read_store for single file
188
+ output_dir: "predictions/"
189
+ cache_dir: "/scratch/"
190
+ ```
191
+
192
+ **Required fields:**
193
+ - `sample_id`: Unique identifier for the sample
194
+ - `best_checkpoint_path`: Path to trained model checkpoint
195
+ - `sample_distribution`: Path to sample methylation distribution
196
+ - `cell_atlas`: Path to cell-type methylation reference
197
+ - `read_store` OR `read_store_dir`: Single file or directory of methylation parquet files
198
+ - `output_dir`: Directory for prediction outputs
199
+ - `cache_dir`: Cache directory (default: `/scratch`)
200
+
201
+ ### Preprocessing Configuration (`preprocess_config.yaml`)
202
+
203
+ ```yaml
204
+ sample_id: "SAMPLE01"
205
+ bam: "data/aligned.bam"
206
+ methylation_dir: "methylation/"
207
+ copy_number: "data/copy_number_segments.parquet"
208
+ variants: "data/somatic_variants.parquet"
209
+ haplotags: "data/haplotags.parquet"
210
+ haploblocks: "data/haploblocks.parquet"
211
+ snv_clusters: "data/snv_clusters.parquet"
212
+ snv_cluster_assignments: "data/snv_cluster_assignments.parquet"
213
+ output_dir: "preprocessing/"
214
+ ```
215
+
216
+ **Required fields:**
217
+ - `sample_id`: Unique identifier for the sample
218
+ - `bam`: Path to aligned BAM file with methylation tags
219
+ - `methylation_dir`: Directory containing extracted methylation data
220
+ - `copy_number`: Path to copy number segments file
221
+ - `variants`: Path to somatic variants file
222
+ - `haplotags`: Path to read haplotype assignments
223
+ - `haploblocks`: Path to phased haplotype blocks
224
+ - `snv_clusters`: Path to SNV cluster assignments
225
+ - `output_dir`: Directory for preprocessing outputs
226
+
227
+ **Optional fields:**
228
+ - `snv_cluster_assignments`: Path to SNV cluster assignments (if not provided, will be inferred)
229
+
230
+ ### Full Pipeline Configuration (`run_config.yaml`)
231
+
232
+ ```yaml
233
+ sample_id: "SAMPLE01"
234
+ bam: "data/aligned.bam"
235
+ bam_index: "data/aligned.bam.bai"
236
+ copy_number: "data/copy_number_segments.parquet"
237
+ variants: "data/somatic_variants.parquet"
238
+ haplotags: "data/haplotags.parquet"
239
+ haploblocks: "data/haploblocks.parquet"
240
+ snv_clusters: "data/snv_clusters.parquet"
241
+ snv_cluster_assignments: "data/snv_cluster_assignments.parquet"
242
+ cell_atlas: "reference/cell_atlas.parquet"
243
+ val_chromosomes: ["chr20", "chr21"]
244
+ test_chromosomes: ["chr22"]
245
+ min_mapq: 0
246
+ workers: 8
247
+ output_dir: "output/"
248
+ cache_dir: "/scratch/"
249
+ ```
250
+
251
+ **Required fields:**
252
+ - `sample_id`: Unique identifier for the sample
253
+ - `bam`: Path to aligned BAM file with methylation tags
254
+ - `copy_number`: Path to copy number segments file
255
+ - `variants`: Path to somatic variants file
256
+ - `haplotags`: Path to read haplotype assignments
257
+ - `haploblocks`: Path to phased haplotype blocks
258
+ - `snv_clusters`: Path to SNV cluster assignments
259
+ - `cell_atlas`: Path to cell-type methylation reference
260
+ - `val_chromosomes`: Chromosomes reserved for validation (must not overlap with test)
261
+ - `test_chromosomes`: Chromosomes reserved for testing (must not overlap with validation)
262
+ - `output_dir`: Directory for all pipeline outputs
263
+ - `cache_dir`: Cache directory for dataset processing
264
+
265
+ **Optional fields:**
266
+ - `bam_index`: BAM index file (auto-detected if not provided)
267
+ - `snv_cluster_assignments`: Path to SNV cluster assignments (if not provided, will be inferred)
268
+ - `chromosomes`: Specific chromosomes to process (defaults to chr1-chrY)
269
+ - `min_mapq`: Minimum mapping quality for reads (default: 0)
270
+ - `workers`: Number of parallel workers for BAM processing (default: 1)
271
+
272
+ ## Command Reference
273
+
274
+ ### `rocit run`
275
+
276
+ Run the complete ROCIT pipeline from BAM to predictions.
277
+
278
+ ```bash
279
+ rocit run --config run_config.yaml
280
+ ```
281
+
282
+ **Pipeline steps:**
283
+ 1. Extract BAM methylation
284
+ 2. Compute CpG distribution
285
+ 3. Label reads using somatic variants
286
+ 4. Train classification model
287
+ 5. Generate predictions
288
+
289
+ **Outputs:**
290
+ - `output/methylation/`: Per-chromosome methylation data
291
+ - `output/distribution/`: Sample methylation distribution
292
+ - `output/preprocessing/`: Labelled reads and methylation data
293
+ - `output/training/`: Model checkpoints and training logs
294
+ - `output/predictions/`: Final tumor/normal predictions
295
+
296
+ ### `rocit train`
297
+
298
+ Train a ROCIT classification model.
299
+
300
+ ```bash
301
+ rocit train --config train_config.yaml
302
+ ```
303
+
304
+ **Outputs:**
305
+ - `{output_dir}/{sample_id}/version_X/checkpoints/best-checkpoint.ckpt`: Best model
306
+ - `{output_dir}/{sample_id}/version_X/metrics.csv`: Training metrics (loss, AUROC, etc.)
307
+
308
+ **Training parameters** (modifiable in code via `TrainingParams` in the python API):
309
+ - Model architecture: 384-dim, 6 heads, 3 layers
310
+ - Max epochs: 100 with early stopping (patience=10)
311
+ - Learning rate: 1e-4 with warmup
312
+ - Batch size: 256
313
+ - Early Stopping Metric: Validation AUROC
314
+
315
+ ### `rocit predict`
316
+
317
+ Generate predictions using a trained model.
318
+
319
+ ```bash
320
+ rocit predict --config predict_config.yaml
321
+ ```
322
+
323
+ **Outputs:**
324
+ - `{output_dir}/{sample_id}_tumor_origin_predictions.parquet`: Read-level predictions with columns:
325
+ - `read_index`: Unique read identifier
326
+ - `chromosome`: Chromosome name
327
+ - `tumor_probability`: Predicted probability of tumor origin (0-1)
328
+
329
+ ### `rocit preprocess`
330
+
331
+ Label reads for training using somatic variant information.
332
+
333
+ ```bash
334
+ rocit preprocess --config preprocess_config.yaml
335
+ ```
336
+
337
+ **Outputs:**
338
+ - `{output_dir}/labelled_reads.parquet`: Read labels (tumor/normal)
339
+ - `{output_dir}/labelled_methylation_data.parquet`: Methylation data with labels
340
+
341
+
342
+ ### `rocit extract-bam-methylation`
343
+
344
+ Extract CpG methylation from PacBio BAM files.
345
+
346
+ ```bash
347
+ rocit extract-bam-methylation \
348
+ --sample-id SAMPLE01 \
349
+ --sample-bam aligned.bam \
350
+ --output-dir methylation/ \
351
+ --workers 8 \
352
+ --min-mapq 0 \
353
+ --chromosomes "chr1 chr2 chr3"
354
+ ```
355
+
356
+ **Options:**
357
+ - `--sample-id`: Sample identifier for output naming
358
+ - `--sample-bam`: Input BAM file with MM/ML tags
359
+ - `--output-dir`: Output directory
360
+ - `--index`: BAM index file (optional, auto-detected)
361
+ - `--min-mapq`: Minimum mapping quality (default: 0)
362
+ - `--workers`: Number of parallel workers (default: 1)
363
+ - `--chromosomes`: Space-separated chromosomes to process (default: chr1-chrY)
364
+
365
+ **Outputs:**
366
+ - `{output_dir}/{chromosome}_cpg_methylation_data.parquet` for each chromosome
367
+
368
+ ### `rocit extract-cpg-distribution`
369
+
370
+ Aggregate methylation distribution from extracted data.
371
+
372
+ ```bash
373
+ rocit extract-cpg-distribution \
374
+ --sample-id SAMPLE01 \
375
+ --methylation-dir methylation/ \
376
+ --output-dir distribution/
377
+ ```
378
+
379
+ **Outputs:**
380
+ - `{output_dir}/{sample_id}_methylation_distribution.parquet`:
381
+ An aggregated distribution of methylation values across the sample, used for model context.
382
+
383
+ ## Output Files
384
+
385
+ ### Prediction Output
386
+
387
+ The primary output from ROCIT is a parquet file with read-level predictions:
388
+
389
+ ```python
390
+ import polars as pl
391
+
392
+ predictions = pl.read_parquet("predictions/SAMPLE01_tumor_origin_predictions.parquet")
393
+ print(predictions.head())
394
+
395
+ # Example output:
396
+ # ┌────────────┬────────────┬───────────────────┐
397
+ # │ read_index │ chromosome │ tumor_probability │
398
+ # ├────────────┼────────────┼───────────────────┤
399
+ # │ 1001 │ chr1 │ 0.87 │
400
+ # │ 1002 │ chr1 │ 0.12 │
401
+ # │ 1003 │ chr1 │ 0.94 │
402
+ # └────────────┴────────────┴───────────────────┘
403
+ ```
404
+
405
+ ### Training Metrics
406
+
407
+ Training progress is logged to CSV:
408
+
409
+ ```python
410
+ metrics = pl.read_csv("training/SAMPLE01/version_0/metrics.csv")
411
+ # Contains: epoch, train_loss, train_auroc, val_loss, val_auroc, etc.
412
+ ```
413
+
414
+ ## Model Architecture
415
+
416
+ ROCIT uses a transformer-based architecture designed for long-read methylation data:
417
+
418
+ - Input: CpG methylation patterns, cell atlas features, sample distribution features
419
+ ## Python API
420
+
421
+ ROCIT can also be used programmatically:
422
+
423
+ ```python
424
+ import rocit
425
+ from pathlib import Path
426
+
427
+ # Training
428
+ train_result = rocit.train(
429
+ sample_id="SAMPLE01",
430
+ labelled_data=labelled_df,
431
+ sample_distribution=distribution_df,
432
+ cell_atlas=atlas_df,
433
+ val_chromosomes=["chr20", "chr21"],
434
+ test_chromosomes=["chr22"],
435
+ output_dir=Path("training/"),
436
+ cache_dir=Path("/scratch/")
437
+ )
438
+
439
+ # Prediction
440
+ predictions = rocit.predict(
441
+ sample_id="SAMPLE01",
442
+ best_checkpoint_path=Path("training/best-checkpoint.ckpt"),
443
+ read_store=[methylation_lazy_df], # List of polars DataFrames or LazyFrames
444
+ sample_distribution=distribution_df,
445
+ cell_atlas=atlas_df,
446
+ output_dir=Path("predictions/"),
447
+ cache_dir=Path("/scratch/")
448
+ )
449
+ ```
450
+
451
+ ## Generating the Cell Atlas from Source
452
+
453
+ If you prefer to build the cell-type methylation atlas yourself rather than using the pre-computed version, you can use the provided generation script. This process downloads and processes whole-genome bisulfite sequencing data from GEO accession **GSE186458**, which contains methylation profiles across diverse normal human cell types.
454
+
455
+ ### Requirements
456
+
457
+ ```bash
458
+ pip install pyBigWig polars tqdm
459
+ ```
460
+
461
+ ### Usage
462
+
463
+ The script provides two modes: automatic download and processing, or processing from pre-downloaded files.
464
+
465
+ **Automatic Download and Processing**
466
+
467
+ This will download ~328 GB of raw data from GEO:
468
+
469
+ ```bash
470
+ python setup_scripts/generate_cell_map_df.py \
471
+ --download /path/to/download_directory/ \
472
+ --output reference/cell_atlas.parquet
473
+ ```
474
+
475
+ You will be prompted to confirm before the download begins.
476
+
477
+ **Process Pre-Downloaded Files**
478
+
479
+ If you already have the bigwig files:
480
+
481
+ ```bash
482
+ python setup_scripts/generate_cell_map_df.py \
483
+ --data-dir /path/to/extracted_bigwig_files/ \
484
+ --output reference/cell_atlas.parquet
485
+ ```
486
+
487
+ ### What the Script Does
488
+
489
+ 1. **Downloads** (if using `--download`): Fetches the GSE186458 tar archive from NCBI GEO
490
+ 2. **Extracts**: Unpacks `*.hg38.bigwig` files containing methylation data per cell type
491
+ 3. **Processes**: For each cell type, aggregates methylation values across biological replicates
492
+ 4. **Combines**: Joins all cell types into a single reference atlas
493
+ 5. **Outputs**: Saves a Parquet file with columns:
494
+ - `chromosome`: chr1-chr22, chrX
495
+ - `position`: CpG genomic position
496
+ - `average_methylation_{cell_type}`: Mean methylation value (0-1) for each cell type
497
+
498
+ The resulting atlas enables ROCIT to contextualize read-level methylation patterns using cell-type-specific reference signatures.
499
+
500
+ ### Dataset Information
501
+
502
+ **GSE186458** contains whole-genome bisulfite sequencing (WGBS) data from normal human tissues and cell types. Each cell type typically has multiple biological replicates, which the script averages to produce robust methylation estimates.
503
+
504
+ > **Citation:** Loyfer, N., et al. (2023). [A DNA methylation atlas of normal human cell types](https://doi.org/10.1038/s41586-022-05580-6). *Nature*.
505
+
506
+ ---
507
+
508
+ ## Data Format Specifications
509
+
510
+ ### Copy Number Segments
511
+
512
+ Required columns:
513
+ - `chromosome`: Chromosome name (e.g., "chr1")
514
+ - `start`: Segment start position
515
+ - `end`: Segment end position
516
+ - `minor_cn`: Minor allele copy number
517
+ - `major_cn`: Major allele copy number
518
+ - `total_cn`: Total copy number
519
+ - `purity`: Tumor purity estimate
520
+ - `normal_total_cn`: Normal total copy number (typically 2 except for chrX/chrY in XY subjects)
521
+
522
+ ### Somatic Variants
523
+
524
+ Required columns:
525
+ - `chromosome`: Chromosome name
526
+ - `position`: Variant position
527
+ - `ref`: Reference allele
528
+ - `alt`: Alternate allele
529
+ - Additional variant metadata as needed
530
+
531
+ ### Haplotags
532
+
533
+ Required columns:
534
+ - `read_index`: Unique read identifier
535
+ - `chromosome`: Chromosome name
536
+ - `haplotag`: Haplotype assignment (1 or 2)
537
+ - `start`: Read start position
538
+ - `end`: Read end position
539
+
540
+ ### Haploblocks
541
+
542
+ Required columns:
543
+ - `chromosome`: Chromosome name
544
+ - `block_start`: Block start position
545
+ - `block_end`: Block end position
546
+ - `block_size`: Size of phased block
547
+ - `haploblock_id`: Unique block identifier
548
+
549
+ ### SNV Clusters
550
+
551
+ Required columns:
552
+ - `cluster_id`: Unique cluster identifier
553
+ - `cluster_ccf`: Cancer cell fraction for the cluster (0-1)
554
+ - `cluster_fraction`: Fraction of variants assigned to this cluster (0-1)
555
+
556
+ ### SNV Cluster Assignments
557
+
558
+ This file is optional. If not provided, cluster assignments will be inferred using a binomial model.
559
+
560
+ Required columns:
561
+ - `chromosome`: Chromosome name
562
+ - `position`: Variant position
563
+ - `cluster_id`: Cluster identifier (must match IDs in SNV Clusters)
564
+ - `n_copies`: Number of allelic copies of the variant.
565
+
566
+
567
+ ## License
568
+
569
+ ROCIT is licensed under the BSD 3-Clause License. See the [LICENSE](LICENSE) file for details.
570
+