py-gbcms 2.0.0__py3-none-any.whl → 2.1.1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- gbcms/__init__.py +1 -13
- gbcms/cli.py +134 -716
- gbcms/core/kernel.py +126 -0
- gbcms/io/input.py +222 -0
- gbcms/io/output.py +361 -0
- gbcms/models/core.py +133 -0
- gbcms/pipeline.py +212 -0
- gbcms/py.typed +0 -0
- py_gbcms-2.1.1.dist-info/METADATA +216 -0
- py_gbcms-2.1.1.dist-info/RECORD +13 -0
- gbcms/config.py +0 -98
- gbcms/counter.py +0 -1074
- gbcms/models.py +0 -295
- gbcms/numba_counter.py +0 -394
- gbcms/output.py +0 -573
- gbcms/parallel.py +0 -129
- gbcms/processor.py +0 -293
- gbcms/reference.py +0 -86
- gbcms/variant.py +0 -390
- py_gbcms-2.0.0.dist-info/METADATA +0 -506
- py_gbcms-2.0.0.dist-info/RECORD +0 -16
- {py_gbcms-2.0.0.dist-info → py_gbcms-2.1.1.dist-info}/WHEEL +0 -0
- {py_gbcms-2.0.0.dist-info → py_gbcms-2.1.1.dist-info}/entry_points.txt +0 -0
- {py_gbcms-2.0.0.dist-info → py_gbcms-2.1.1.dist-info}/licenses/LICENSE +0 -0
|
@@ -1,506 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: py-gbcms
|
|
3
|
-
Version: 2.0.0
|
|
4
|
-
Summary: Python implementation of GetBaseCountsMultiSample (gbcms) for calculating base counts in BAM files
|
|
5
|
-
Project-URL: Homepage, https://github.com/msk-access/getbasecounts
|
|
6
|
-
Project-URL: Repository, https://github.com/msk-access/getbasecounts
|
|
7
|
-
Project-URL: Documentation, https://github.com/msk-access/getbasecounts#readme
|
|
8
|
-
Project-URL: Bug Tracker, https://github.com/msk-access/getbasecounts/issues
|
|
9
|
-
Author-email: MSK-ACCESS <shahr2@mskcc.org>
|
|
10
|
-
License: AGPL-3.0
|
|
11
|
-
License-File: LICENSE
|
|
12
|
-
Keywords: bam,base-counts,bioinformatics,gbcms,genomics,maf,vcf
|
|
13
|
-
Classifier: Development Status :: 4 - Beta
|
|
14
|
-
Classifier: Intended Audience :: Science/Research
|
|
15
|
-
Classifier: License :: OSI Approved :: GNU Affero General Public License v3
|
|
16
|
-
Classifier: Programming Language :: Python :: 3.11
|
|
17
|
-
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
-
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
|
|
19
|
-
Requires-Python: >=3.11
|
|
20
|
-
Requires-Dist: joblib>=1.3.0
|
|
21
|
-
Requires-Dist: numba>=0.58.0
|
|
22
|
-
Requires-Dist: numpy>=1.24.0
|
|
23
|
-
Requires-Dist: pandas>=2.0.0
|
|
24
|
-
Requires-Dist: pydantic-settings>=2.0.0
|
|
25
|
-
Requires-Dist: pydantic>=2.0.0
|
|
26
|
-
Requires-Dist: pysam>=0.22.0
|
|
27
|
-
Requires-Dist: rich>=13.0.0
|
|
28
|
-
Requires-Dist: scipy>=1.11.0
|
|
29
|
-
Requires-Dist: typer>=0.9.0
|
|
30
|
-
Provides-Extra: all
|
|
31
|
-
Requires-Dist: cyvcf2>=0.30.0; extra == 'all'
|
|
32
|
-
Provides-Extra: dev
|
|
33
|
-
Requires-Dist: black>=23.0.0; extra == 'dev'
|
|
34
|
-
Requires-Dist: mypy>=1.5.0; extra == 'dev'
|
|
35
|
-
Requires-Dist: pre-commit>=3.3.0; extra == 'dev'
|
|
36
|
-
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
|
|
37
|
-
Requires-Dist: pytest-mock>=3.11.0; extra == 'dev'
|
|
38
|
-
Requires-Dist: pytest>=7.4.0; extra == 'dev'
|
|
39
|
-
Requires-Dist: ruff>=0.1.0; extra == 'dev'
|
|
40
|
-
Requires-Dist: types-pyyaml>=6.0.0; extra == 'dev'
|
|
41
|
-
Provides-Extra: fast
|
|
42
|
-
Requires-Dist: cyvcf2>=0.30.0; extra == 'fast'
|
|
43
|
-
Description-Content-Type: text/markdown
|
|
44
|
-
|
|
45
|
-
# py-gbcms - Python Implementation of gbcms
|
|
46
|
-
|
|
47
|
-
[](https://www.python.org/downloads/)
|
|
48
|
-
[](https://opensource.org/licenses/AGPL-3.0)
|
|
49
|
-
|
|
50
|
-
A high-performance Python reimplementation of [GetBaseCountsMultiSample](https://github.com/msk-access/GetBaseCountsMultiSample) for calculating base counts in multiple BAM files at variant positions specified in VCF or MAF files.
|
|
51
|
-
|
|
52
|
-
**Package**: `py-gbcms` | **CLI**: `gbcms`
|
|
53
|
-
|
|
54
|
-
## ✨ Features
|
|
55
|
-
|
|
56
|
-
- 🚀 **High Performance**: Multi-threaded processing with efficient BAM file handling
|
|
57
|
-
- **Smart hybrid counting** strategy: numba for simple SNPs (50-100x faster), counter.py for complex variants
|
|
58
|
-
- **Numba JIT compilation** for optimized counting operations
|
|
59
|
-
- **joblib** for efficient local parallelization
|
|
60
|
-
- **Ray** support for distributed computing across clusters
|
|
61
|
-
- 🔬 **Strand Bias Analysis**: Statistical strand bias detection using Fisher's exact test
|
|
62
|
-
- 🎨 **Beautiful CLI**: Rich terminal output with progress bars and colored logging
|
|
63
|
-
- 🔒 **Type Safety**: Pydantic models for runtime validation and type checking
|
|
64
|
-
- 🐳 **Docker Support**: Containerized deployment for reproducibility
|
|
65
|
-
- 📊 **Multiple Formats**: Support for both VCF and MAF input/output formats with strand bias
|
|
66
|
-
- 🧪 **Well Tested**: Comprehensive unit tests with high coverage
|
|
67
|
-
- 🔧 **Modern Python**: Built with type hints, Pydantic models, and modern Python practices
|
|
68
|
-
|
|
69
|
-
## 🚀 Quick Start
|
|
70
|
-
|
|
71
|
-
### Installation
|
|
72
|
-
|
|
73
|
-
# Install with all features
|
|
74
|
-
uv pip install "py-gbcms[all]"
|
|
75
|
-
|
|
76
|
-
# Or with pip
|
|
77
|
-
pip install "py-gbcms[all]"
|
|
78
|
-
|
|
79
|
-
# For development (includes scipy for strand bias and type checking)
|
|
80
|
-
uv pip install -e ".[dev]"
|
|
81
|
-
|
|
82
|
-
**Core Dependencies:**
|
|
83
|
-
- `pysam>=0.22.0` - BAM file processing
|
|
84
|
-
- `numpy>=1.24.0` - Numerical computations
|
|
85
|
-
- `scipy>=1.11.0` - Statistical analysis (Fisher's exact test for strand bias)
|
|
86
|
-
- `pandas>=2.0.0` - Data manipulation
|
|
87
|
-
- `pydantic>=2.0.0` - Runtime validation
|
|
88
|
-
- `numba>=0.58.0` - JIT compilation for performance
|
|
89
|
-
- `joblib>=1.3.0` - Parallel processing
|
|
90
|
-
|
|
91
|
-
**Optional Dependencies:**
|
|
92
|
-
- `cyvcf2>=0.30.0` - Fast VCF parsing (`py-gbcms[fast]`)
|
|
93
|
-
- `ray>=2.7.0` - Distributed computing (`py-gbcms[ray]`)
|
|
94
|
-
|
|
95
|
-
**Development Dependencies:**
|
|
96
|
-
- `scipy-stubs>=1.11.0` - Type stubs for scipy (for mypy type checking)
|
|
97
|
-
|
|
98
|
-
**Requirements:** Python 3.11 or later
|
|
99
|
-
|
|
100
|
-
### Basic Usage
|
|
101
|
-
|
|
102
|
-
```bash
|
|
103
|
-
# Run
|
|
104
|
-
gbcms count run \
|
|
105
|
-
--fasta reference.fa \
|
|
106
|
-
--bam sample1:sample1.bam \
|
|
107
|
-
--vcf variants.vcf \
|
|
108
|
-
--output counts.txt \
|
|
109
|
-
--thread 8
|
|
110
|
-
|
|
111
|
-
# Run with MAF file
|
|
112
|
-
gbcms count run \
|
|
113
|
-
--fasta reference.fa \
|
|
114
|
-
--bam-fof bam_files.txt \
|
|
115
|
-
--maf variants.maf \
|
|
116
|
-
--output counts.txt
|
|
117
|
-
|
|
118
|
-
### Docker Usage
|
|
119
|
-
|
|
120
|
-
```bash
|
|
121
|
-
docker pull ghcr.io/msk-access/getbasecounts:latest
|
|
122
|
-
|
|
123
|
-
# Run the container
|
|
124
|
-
docker run --rm \
|
|
125
|
-
-v $(pwd)/data:/data \
|
|
126
|
-
ghcr.io/msk-access/getbasecounts:latest \
|
|
127
|
-
gbcms count run \
|
|
128
|
-
--omaf \
|
|
129
|
-
--fasta /data/reference.fa \
|
|
130
|
-
--bam sample1:/data/sample1.bam \
|
|
131
|
-
--vcf /data/variants.vcf \
|
|
132
|
-
--output /data/counts.maf
|
|
133
|
-
```
|
|
134
|
-
|
|
135
|
-
### BAM File of Files Format
|
|
136
|
-
|
|
137
|
-
Create a tab-separated file (`bam_files.txt`):
|
|
138
|
-
```
|
|
139
|
-
sample1 /path/to/sample1.bam
|
|
140
|
-
sample2 /path/to/sample2.bam
|
|
141
|
-
sample3 /path/to/sample3.bam
|
|
142
|
-
```
|
|
143
|
-
|
|
144
|
-
Then use:
|
|
145
|
-
|
|
146
|
-
```bash
|
|
147
|
-
gbcms count run \
|
|
148
|
-
--fasta reference.fa \
|
|
149
|
-
--bam-fof bam_files.txt \
|
|
150
|
-
--vcf variants.vcf \
|
|
151
|
-
|
|
152
|
-
## Command Line Options
|
|
153
|
-
|
|
154
|
-
### Commands
|
|
155
|
-
|
|
156
|
-
gbcms uses subcommands for different operations:
|
|
157
|
-
|
|
158
|
-
- `gbcms count run`: Run base counting on variants (main command)
|
|
159
|
-
- `gbcms validate files`: Validate input files before processing
|
|
160
|
-
- `gbcms version`: Show version information
|
|
161
|
-
- `gbcms info`: Show tool capabilities and information
|
|
162
|
-
|
|
163
|
-
### Count Run Options
|
|
164
|
-
|
|
165
|
-
#### Required Arguments
|
|
166
|
-
|
|
167
|
-
- `--fasta`, `-f`: Reference genome FASTA file (must be indexed with .fai)
|
|
168
|
-
- `--output`, `-o`: Output file path
|
|
169
|
-
|
|
170
|
-
#### BAM Input (at least one required)
|
|
171
|
-
|
|
172
|
-
- `--bam`, `-b`: BAM file in format `SAMPLE_NAME:BAM_FILE` (can be specified multiple times)
|
|
173
|
-
- `--bam-fof`: File containing sample names and BAM paths (tab-separated)
|
|
174
|
-
|
|
175
|
-
#### Variant Input (one required)
|
|
176
|
-
|
|
177
|
-
- `--maf`: Input variant file in MAF format (can be specified multiple times)
|
|
178
|
-
- `--vcf`: Input variant file in VCF format (can be specified multiple times)
|
|
179
|
-
|
|
180
|
-
#### Output Options
|
|
181
|
-
|
|
182
|
-
- `--omaf`: Output in MAF format (only with MAF input)
|
|
183
|
-
- `--positive-count/--no-positive-count`: Output positive strand counts (default: True)
|
|
184
|
-
- `--negative-count/--no-negative-count`: Output negative strand counts (default: False)
|
|
185
|
-
- `--fragment-count/--no-fragment-count`: Output fragment counts (default: False)
|
|
186
|
-
- `--fragment-fractional-weight`: Use fractional weights for fragments (default: False)
|
|
187
|
-
|
|
188
|
-
#### Quality Filters
|
|
189
|
-
- `--maq`: Mapping quality threshold (default: 20)
|
|
190
|
-
- `--baq`: Base quality threshold (default: 0)
|
|
191
|
-
- `--filter-duplicate`: Filter duplicate reads (default: True)
|
|
192
|
-
- `--filter-improper-pair`: Filter improperly paired reads (default: False)
|
|
193
|
-
- `--filter-qc-failed`: Filter QC failed reads (default: False)
|
|
194
|
-
- `--filter-indel/--no-filter-indel`: Filter reads with indels (default: False)
|
|
195
|
-
- `--filter-non-primary/--no-filter-non-primary`: Filter non-primary alignments (default: False)
|
|
196
|
-
|
|
197
|
-
#### Performance Options
|
|
198
|
-
- `--thread`, `-t`: Number of threads (default: 1)
|
|
199
|
-
- `--max-block-size`: Maximum variants per block (default: 10000)
|
|
200
|
-
- `--max-block-dist`: Maximum block distance in bp (default: 100000)
|
|
201
|
-
|
|
202
|
-
#### Advanced Options
|
|
203
|
-
- `--generic-counting`: Use generic counting algorithm for complex variants
|
|
204
|
-
- `--suppress-warning`: Maximum warnings per type (default: 3)
|
|
205
|
-
- `--verbose`, `-v`: Enable verbose logging
|
|
206
|
-
|
|
207
|
-
### Validate Files Options
|
|
208
|
-
|
|
209
|
-
- `--fasta`, `-f`: Reference FASTA file to validate
|
|
210
|
-
- `--bam`, `-b`: BAM files to validate (SAMPLE:PATH format, multiple allowed)
|
|
211
|
-
- `--vcf`: VCF files to validate (multiple allowed)
|
|
212
|
-
- `--maf`: MAF files to validate (multiple allowed)
|
|
213
|
-
|
|
214
|
-
## Output Format
|
|
215
|
-
|
|
216
|
-
### VCF Format (Proper VCF with INFO fields)
|
|
217
|
-
|
|
218
|
-
**Extension**: `.vcf`
|
|
219
|
-
|
|
220
|
-
**Structure**: Standard VCF format with count and strand bias information in FORMAT and INFO fields
|
|
221
|
-
|
|
222
|
-
**Example**:
|
|
223
|
-
```vcf
|
|
224
|
-
##fileformat=VCFv4.2
|
|
225
|
-
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth across all samples">
|
|
226
|
-
##INFO=<ID=SB,Number=3,Type=Float,Description="Strand bias p-value, odds ratio, direction">
|
|
227
|
-
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total depth for this sample">
|
|
228
|
-
##FORMAT=<ID=SB,Number=3,Type=Float,Description="Strand bias for this sample">
|
|
229
|
-
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2
|
|
230
|
-
chr1 100 . A T . . DP=95;SB=0.001234,2.5,reverse DP:RD:AD:SB 50:30:20:0.001234:2.5:reverse 45:25:20:0.95:1.2:none
|
|
231
|
-
```
|
|
232
|
-
|
|
233
|
-
**INFO Fields**:
|
|
234
|
-
- `DP`: Total depth across all samples
|
|
235
|
-
- `SB`: Strand bias (p-value,odds_ratio,direction) - most significant across samples
|
|
236
|
-
- `FSB`: Fragment strand bias (when fragment counting enabled)
|
|
237
|
-
|
|
238
|
-
**FORMAT Fields**:
|
|
239
|
-
- `DP`: Total depth for this sample
|
|
240
|
-
- `RD`: Reference allele depth for this sample
|
|
241
|
-
- `AD`: Alternate allele depth for this sample
|
|
242
|
-
- `DPP`: Positive strand depth (if enabled)
|
|
243
|
-
- `RDP`: Positive strand reference depth (if enabled)
|
|
244
|
-
- `ADP`: Positive strand alternate depth (if enabled)
|
|
245
|
-
- `DPF`: Fragment depth (if enabled)
|
|
246
|
-
- `RDF`: Fragment reference depth (if enabled)
|
|
247
|
-
- `ADF`: Fragment alternate depth (if enabled)
|
|
248
|
-
- `SB`: Strand bias (p-value,odds_ratio,direction) for this sample
|
|
249
|
-
- `FSB`: Fragment strand bias for this sample (if enabled)
|
|
250
|
-
|
|
251
|
-
### MAF Format
|
|
252
|
-
|
|
253
|
-
When using `--omaf`, the output maintains the MAF format with updated count columns for tumor and normal samples, including strand bias information.
|
|
254
|
-
|
|
255
|
-
## Development
|
|
256
|
-
|
|
257
|
-
### Setup Development Environment
|
|
258
|
-
|
|
259
|
-
```bash
|
|
260
|
-
# Clone the repository
|
|
261
|
-
git clone https://github.com/msk-access/py-gbcms.git
|
|
262
|
-
cd py-gbcms
|
|
263
|
-
|
|
264
|
-
# Install with development dependencies (includes scipy-stubs for type checking)
|
|
265
|
-
uv pip install -e ".[dev]"
|
|
266
|
-
|
|
267
|
-
# Install pre-commit hooks
|
|
268
|
-
pre-commit install
|
|
269
|
-
```
|
|
270
|
-
|
|
271
|
-
### Running Tests
|
|
272
|
-
|
|
273
|
-
```bash
|
|
274
|
-
# Run all tests
|
|
275
|
-
pytest
|
|
276
|
-
|
|
277
|
-
# Run with coverage
|
|
278
|
-
pytest --cov=gbcms --cov-report=html
|
|
279
|
-
|
|
280
|
-
# Run specific test file
|
|
281
|
-
pytest tests/test_counter.py -v
|
|
282
|
-
```
|
|
283
|
-
|
|
284
|
-
### Code Quality
|
|
285
|
-
|
|
286
|
-
```bash
|
|
287
|
-
# Format code
|
|
288
|
-
black src/ tests/
|
|
289
|
-
|
|
290
|
-
# Lint code
|
|
291
|
-
ruff check src/ tests/
|
|
292
|
-
|
|
293
|
-
# Type checking (requires scipy-stubs)
|
|
294
|
-
mypy src/
|
|
295
|
-
```
|
|
296
|
-
|
|
297
|
-
### Building Docker Image
|
|
298
|
-
|
|
299
|
-
```bash
|
|
300
|
-
# Build the image
|
|
301
|
-
docker build -t gbcms:latest .
|
|
302
|
-
|
|
303
|
-
# Run tests in container
|
|
304
|
-
docker run --rm gbcms:latest pytest
|
|
305
|
-
```
|
|
306
|
-
|
|
307
|
-
## Performance Comparison
|
|
308
|
-
|
|
309
|
-
Compared to the original C++ implementation:
|
|
310
|
-
|
|
311
|
-
| Feature | C++ Version | Python (Smart Hybrid) | Python (Pure counter.py) |
|
|
312
|
-
|---------|-------------|----------------------|-------------------------|
|
|
313
|
-
| **Simple SNPs** | ~1x | **~50-100x faster** | ~0.8-1.2x |
|
|
314
|
-
| **Complex Variants** | ~1x | **~1x** | ~0.8-1.2x |
|
|
315
|
-
| **Overall** | Baseline | **~10-50x faster** | ~0.8-1.2x |
|
|
316
|
-
| Memory | Baseline | ~1.2x | ~1.2x |
|
|
317
|
-
| Multi-threading | OpenMP | joblib/Ray | joblib/Ray |
|
|
318
|
-
| Dependencies | bamtools, zlib | pysam, numpy, numba, joblib, ray |
|
|
319
|
-
| Scalability | Single machine | Single machine | Multi-node clusters |
|
|
320
|
-
|
|
321
|
-
**Smart Hybrid Strategy:**
|
|
322
|
-
- **Simple SNPs**: Uses `numba_counter` (50-100x faster than C++)
|
|
323
|
-
- **Complex variants**: Uses `counter.py` (C++ equivalent accuracy)
|
|
324
|
-
- **Automatic selection**: Optimal algorithm chosen per variant type
|
|
325
|
-
*Performance varies based on workload and Python version. Python 3.11+ shows significant improvements.
|
|
326
|
-
|
|
327
|
-
**With Numba JIT compilation and smart algorithm selection. See [Fast VCF Parsing](docs/CYVCF2_SUPPORT.md) for benchmarks.
|
|
328
|
-
|
|
329
|
-
## Architecture
|
|
330
|
-
|
|
331
|
-
The package is organized into distinct modules with clear responsibilities:
|
|
332
|
-
|
|
333
|
-
```
|
|
334
|
-
py-gbcms/
|
|
335
|
-
├── src/gbcms/
|
|
336
|
-
│ ├── __init__.py
|
|
337
|
-
│ ├── cli.py # 🎨 Typer CLI interface with Rich
|
|
338
|
-
│ ├── config.py # ⚙️ Configuration dataclasses (legacy)
|
|
339
|
-
│ ├── models.py # 🔒 Pydantic models for type safety ⭐
|
|
340
|
-
│ ├── variant.py # 📄 Variant loading (VCF/MAF)
|
|
341
|
-
│ ├── counter.py # 🐢 Pure Python counting (baseline)
|
|
342
|
-
│ ├── numba_counter.py # ⚡ Numba-optimized counting (50-100x faster) ⭐
|
|
343
|
-
│ ├── parallel.py # 🔄 joblib/Ray parallelization ⭐
|
|
344
|
-
│ ├── reference.py # 🧬 Reference sequence handling
|
|
345
|
-
│ ├── output.py # 📤 Output formatting with strand bias ⭐
|
|
346
|
-
│ └── processor.py # 🎯 Main processing pipeline
|
|
347
|
-
├── tests/ # 🧪 Comprehensive test suite
|
|
348
|
-
├── scripts/ # 🛠️ Setup and verification scripts
|
|
349
|
-
├── Dockerfile # 🐳 Production container
|
|
350
|
-
├── pyproject.toml # 📦 Package configuration
|
|
351
|
-
└── docs/
|
|
352
|
-
├── README.md # Main documentation
|
|
353
|
-
├── ARCHITECTURE.md # Module relationships ⭐
|
|
354
|
-
├── INSTALLATION.md # Setup guide
|
|
355
|
-
├── QUICKSTART.md # 5-minute start
|
|
356
|
-
├── ADVANCED_FEATURES.md # Pydantic, Numba, Ray ⭐
|
|
357
|
-
└── CLI_FEATURES.md # CLI documentation
|
|
358
|
-
```
|
|
359
|
-
|
|
360
|
-
### Module Relationships
|
|
361
|
-
|
|
362
|
-
```
|
|
363
|
-
CLI (cli.py)
|
|
364
|
-
↓
|
|
365
|
-
Configuration (models.py)
|
|
366
|
-
↓
|
|
367
|
-
Processor (processor.py)
|
|
368
|
-
├─→ Variant Loader (variant.py)
|
|
369
|
-
├─→ Reference (reference.py)
|
|
370
|
-
├─→ Smart Counting Engine ⭐
|
|
371
|
-
│ ├─→ counter.py (Pure Python - accurate, slower)
|
|
372
|
-
│ └─→ numba_counter.py (JIT compiled - 50-100x faster for SNPs) ⭐
|
|
373
|
-
├─→ Strand Bias Analysis (counter.py + output.py) ⭐
|
|
374
|
-
├─→ Parallelization (parallel.py)
|
|
375
|
-
└─→ Output (output.py)
|
|
376
|
-
```
|
|
377
|
-
|
|
378
|
-
**Key Algorithm Selection:**
|
|
379
|
-
- **`Smart Hybrid Strategy`**: Automatically chooses optimal algorithm per variant
|
|
380
|
-
- **`counter.py`**: Pure Python, easy to debug, baseline performance
|
|
381
|
-
- **`numba_counter.py`**: JIT-compiled, 50-100x faster for simple SNPs, for production
|
|
382
|
-
|
|
383
|
-
**New Features:**
|
|
384
|
-
- **Strand Bias Analysis**: Statistical detection using Fisher's exact test ⭐
|
|
385
|
-
- **Enhanced Output**: VCF/MAF formats with strand bias columns ⭐
|
|
386
|
-
|
|
387
|
-
See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed module relationships.
|
|
388
|
-
|
|
389
|
-
## Documentation
|
|
390
|
-
|
|
391
|
-
📚 **[Complete Documentation](docs/README.md)** | [Quick Start](docs/QUICKSTART.md) | [Contributing Guide](CONTRIBUTING.md) | [Package Structure](docs/PACKAGE_STRUCTURE.md) | [Testing Guide](docs/TESTING_GUIDE.md)
|
|
392
|
-
|
|
393
|
-
- **User Guide**
|
|
394
|
-
- [Input & Output Formats](docs/INPUT_OUTPUT.md)
|
|
395
|
-
|
|
396
|
-
- **Advanced**
|
|
397
|
-
- [Advanced Features](docs/ADVANCED_FEATURES.md)
|
|
398
|
-
- [Parallelization Guide](docs/PARALLELIZATION_GUIDE.md)
|
|
399
|
-
- [Fast VCF Parsing (cyvcf2)](docs/CYVCF2_SUPPORT.md)
|
|
400
|
-
|
|
401
|
-
- **Reference**
|
|
402
|
-
- [Architecture](docs/ARCHITECTURE.md)
|
|
403
|
-
- [C++ Comparison](docs/CPP_FEATURE_COMPARISON.md)
|
|
404
|
-
- [FAQ](docs/FAQ.md)
|
|
405
|
-
|
|
406
|
-
- **Docker & Deployment**
|
|
407
|
-
- [Docker Guide](docs/DOCKER_GUIDE.md)
|
|
408
|
-
|
|
409
|
-
## Advanced Features
|
|
410
|
-
|
|
411
|
-
GetBaseCounts includes several advanced features for performance and scalability:
|
|
412
|
-
|
|
413
|
-
### 🔒 Type Safety with Pydantic
|
|
414
|
-
|
|
415
|
-
```python
|
|
416
|
-
from gbcms.models import GetBaseCountsConfig
|
|
417
|
-
|
|
418
|
-
# Runtime validation of all inputs
|
|
419
|
-
config = GetBaseCountsConfig(
|
|
420
|
-
fasta_file=Path("reference.fa"),
|
|
421
|
-
bam_files=[...],
|
|
422
|
-
variant_files=[...],
|
|
423
|
-
)
|
|
424
|
-
```
|
|
425
|
-
|
|
426
|
-
### 🔬 Strand Bias Analysis
|
|
427
|
-
|
|
428
|
-
```python
|
|
429
|
-
# Strand bias is automatically calculated for all variants
|
|
430
|
-
# Uses Fisher's exact test for statistical rigor
|
|
431
|
-
from gbcms.counter import BaseCounter
|
|
432
|
-
|
|
433
|
-
counter = BaseCounter(config)
|
|
434
|
-
# Strand bias calculated during counting and included in output
|
|
435
|
-
```
|
|
436
|
-
|
|
437
|
-
**Features:**
|
|
438
|
-
- **Fisher's exact test** for statistically sound strand bias detection
|
|
439
|
-
- **Automatic direction detection** (forward, reverse, or none)
|
|
440
|
-
- **Minimum depth filtering** (10 reads) for reliable calculations
|
|
441
|
-
- **VCF and MAF output** with strand bias columns
|
|
442
|
-
|
|
443
|
-
### ⚡ Performance with Smart Hybrid Strategy
|
|
444
|
-
|
|
445
|
-
```python
|
|
446
|
-
# Automatic algorithm selection based on variant complexity
|
|
447
|
-
from gbcms.counter import BaseCounter
|
|
448
|
-
|
|
449
|
-
counter = BaseCounter(config)
|
|
450
|
-
# Automatically uses:
|
|
451
|
-
# - numba_counter for simple SNPs (50-100x faster)
|
|
452
|
-
# - counter.py for complex variants (maximum accuracy)
|
|
453
|
-
```
|
|
454
|
-
|
|
455
|
-
### 🔄 Parallelization with joblib
|
|
456
|
-
|
|
457
|
-
```bash
|
|
458
|
-
# Use joblib for efficient local parallelization
|
|
459
|
-
gbcms count run --thread 16 --backend joblib ...
|
|
460
|
-
```
|
|
461
|
-
|
|
462
|
-
### 🌐 Distributed Computing with Ray
|
|
463
|
-
|
|
464
|
-
```bash
|
|
465
|
-
# Install with Ray support
|
|
466
|
-
uv pip install "gbcms[ray]"
|
|
467
|
-
|
|
468
|
-
# Use Ray for distributed processing
|
|
469
|
-
gbcms count run --thread 32 --backend ray --use-ray ...
|
|
470
|
-
```
|
|
471
|
-
|
|
472
|
-
See [ADVANCED_FEATURES.md](ADVANCED_FEATURES.md) for detailed documentation and benchmarks.
|
|
473
|
-
|
|
474
|
-
## Contributing
|
|
475
|
-
|
|
476
|
-
Contributions are welcome! Please:
|
|
477
|
-
|
|
478
|
-
1. Fork the repository
|
|
479
|
-
2. Create a feature branch
|
|
480
|
-
3. Make your changes with tests
|
|
481
|
-
4. Run the test suite and linters
|
|
482
|
-
5. Submit a pull request
|
|
483
|
-
|
|
484
|
-
## Citation
|
|
485
|
-
|
|
486
|
-
If you use this tool in your research, please cite:
|
|
487
|
-
|
|
488
|
-
```
|
|
489
|
-
GetBaseCountsMultiSample: A tool for calculating base counts in multiple BAM files
|
|
490
|
-
MSK-ACCESS Team
|
|
491
|
-
https://github.com/msk-access/GetBaseCountsMultiSample
|
|
492
|
-
```
|
|
493
|
-
|
|
494
|
-
## License
|
|
495
|
-
|
|
496
|
-
AGPL-3.0 License - See [LICENSE](LICENSE) for details.
|
|
497
|
-
|
|
498
|
-
## Support
|
|
499
|
-
|
|
500
|
-
- 🐛 Report bugs: [GitHub Issues](https://github.com/msk-access/py-gbcms/issues)
|
|
501
|
-
- 💬 Ask questions: [GitHub Discussions](https://github.com/msk-access/py-gbcms/discussions)
|
|
502
|
-
- 📧 Email: shahr2@mskcc.org
|
|
503
|
-
|
|
504
|
-
## Acknowledgments
|
|
505
|
-
|
|
506
|
-
This is a Python reimplementation of the original C++ tool developed by the MSK-ACCESS team. Special thanks to the original authors and contributors.
|
py_gbcms-2.0.0.dist-info/RECORD
DELETED
|
@@ -1,16 +0,0 @@
|
|
|
1
|
-
gbcms/__init__.py,sha256=NDfNc1Ajhcxqos51IIDtI95YpSXF5auYr5qXXcftt7c,347
|
|
2
|
-
gbcms/cli.py,sha256=qk8Lb-c5GsJdY79Wm1PFOlF_hS4ikDss8-X_RDC1uXI,23469
|
|
3
|
-
gbcms/config.py,sha256=xmPM5X684OZ3ysF49GcQ8eij99mqm3E0kRmWPOSigqU,3434
|
|
4
|
-
gbcms/counter.py,sha256=dYOHpHS20QPLVX-dVGr28mtuBpzjmMNDaEz6mRB7-04,43863
|
|
5
|
-
gbcms/models.py,sha256=05E3lJMmHdZfvgEJ21beEdwYdAk91IR__pfEUlBP39E,11281
|
|
6
|
-
gbcms/numba_counter.py,sha256=c96WjiWihj6nAr_FJOp1pvsV-4XhHsvIEQo5IbNtzu4,11039
|
|
7
|
-
gbcms/output.py,sha256=8h9NS3aCd45K_UrBpB6zKJdocwVx3ZfNaJbeyJpy1k4,25026
|
|
8
|
-
gbcms/parallel.py,sha256=L_N1B4hu6bAilUpnGmjtm7CxAAe_qEBmGw5zTtOs3aM,4039
|
|
9
|
-
gbcms/processor.py,sha256=16XIlMab9Ja-lnL86niL0NM31FoSrjjNe8zgGuWSxyU,9708
|
|
10
|
-
gbcms/reference.py,sha256=eLc6qRxKMr1wPCR-qkpy9IlkEEO-cRZdlC21tcUECv4,2468
|
|
11
|
-
gbcms/variant.py,sha256=0a1yGduYBG1RiwCxoEJSzNn3GNumxXXoQc76nxsb-M0,13295
|
|
12
|
-
py_gbcms-2.0.0.dist-info/METADATA,sha256=Sjikcd_nQ3b96nnVFaAAKL2Ub7NK6VU5ywKpYE2wRaM,17133
|
|
13
|
-
py_gbcms-2.0.0.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
|
|
14
|
-
py_gbcms-2.0.0.dist-info/entry_points.txt,sha256=AAg3yd8-c7jlb-FDGiFJXSNFVAhqO44zMLJQVFv8oWQ,40
|
|
15
|
-
py_gbcms-2.0.0.dist-info/licenses/LICENSE,sha256=5vLuih3k9yufKSXoR5qVWOhALHC8WXbSXjrOo9ZK3cs,34797
|
|
16
|
-
py_gbcms-2.0.0.dist-info/RECORD,,
|
|
File without changes
|
|
File without changes
|
|
File without changes
|