PyPI - parapeak - Versions diffs - 0.3.3__tar.gz - Mend

parapeak 0.3.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

parapeak-0.3.3/LICENSE +21 -0
parapeak-0.3.3/PKG-INFO +490 -0
parapeak-0.3.3/README.md +458 -0
parapeak-0.3.3/parapeak/__init__.py +1 -0
parapeak-0.3.3/parapeak/__main__.py +28 -0
parapeak-0.3.3/parapeak/bam_reader.py +396 -0
parapeak-0.3.3/parapeak/blacklist.py +60 -0
parapeak-0.3.3/parapeak/cli.py +147 -0
parapeak-0.3.3/parapeak/gc_model.py +120 -0
parapeak-0.3.3/parapeak/output.py +269 -0
parapeak-0.3.3/parapeak/peak_caller.py +496 -0
parapeak-0.3.3/parapeak/stats.py +267 -0
parapeak-0.3.3/parapeak.egg-info/PKG-INFO +490 -0
parapeak-0.3.3/parapeak.egg-info/SOURCES.txt +18 -0
parapeak-0.3.3/parapeak.egg-info/dependency_links.txt +1 -0
parapeak-0.3.3/parapeak.egg-info/entry_points.txt +2 -0
parapeak-0.3.3/parapeak.egg-info/requires.txt +11 -0
parapeak-0.3.3/parapeak.egg-info/top_level.txt +1 -0
parapeak-0.3.3/pyproject.toml +47 -0
parapeak-0.3.3/setup.cfg +4 -0

parapeak-0.3.3/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Takaho A. Endo
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

parapeak-0.3.3/PKG-INFO ADDED Viewed

@@ -0,0 +1,490 @@
+Metadata-Version: 2.4
+Name: parapeak
+Version: 0.3.3
+Summary: NGS peak caller with parallel computation and GC-corrected statistics
+Author-email: "Takaho A. Endo" <ssssrabbit@gmail.com>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/takaho/parapeak
+Project-URL: Repository, https://github.com/takaho/parapeak
+Project-URL: Bug Tracker, https://github.com/takaho/parapeak/issues
+Keywords: bioinformatics,NGS,peak-calling,genomics
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Science/Research
+Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: pysam>=0.21
+Requires-Dist: numpy>=1.24
+Requires-Dist: scipy>=1.11
+Requires-Dist: numba>=0.58
+Provides-Extra: dev
+Requires-Dist: pytest; extra == "dev"
+Requires-Dist: pytest-cov; extra == "dev"
+Provides-Extra: bigwig
+Requires-Dist: pyBigWig>=0.3; extra == "bigwig"
+Dynamic: license-file
+# parapeak
+An NGS peak caller supporting shallow (MiSeq) to deep coverage across
+ATAC-seq, genome-editing assays (CIRCLE-seq, GUIDE-seq), ChIP-seq, CUT&TAG,
+and similar experiments.
+**parapeak** is inspired by [MACS3](https://github.com/macs3-project/MACS) but differs in several key aspects:
+| Feature | MACS3 | parapeak |
+|---|---|---|
+| Parallelism | Single-threaded | Multi-process, per chromosome |
+| Genome size | User-supplied `--gsize` | Inferred from BAM headers |
+| Statistical model | Poisson | Negative binomial **and** GC-corrected Z-score (both required) |
+| GC correction | Separate tool | Built-in (no reference FASTA needed) |
+| Blacklist | Not built-in | Applied at pileup construction time |
+| Low-coverage support | Limited | Signal filters for MiSeq-depth data |
+| Paired-end fragments | Fixed extension | Actual insert size from TLEN field |
+| Unsorted BAM support | Requires sorted + indexed | Single-pass; no sorting or index required |
+| Read QC filters | Basic | MAPQ threshold, QC-fail flag, duplicate/secondary/supplementary |
+| Input | BAM/SAM/BED | BAM/SAM only (broken headers auto-repaired) |
+---
+## Installation
+```bash
+pip install -r requirements.txt
+pip install -e .
+```
+Dependencies: `pysam`, `numpy`, `scipy`, `numba`
+> **numba** accelerates the negative-binomial survival function and the
+> Benjamini–Hochberg isotonic step with JIT compilation.
+> If numba is unavailable, parapeak falls back transparently to scipy.
+---
+## Quick start
+```bash
+# Paired-end ATAC-seq, unsorted BAM
+parapeak -t atac.bam --blacklist hg38-blacklist.v2.bed.gz \
+         -o results/ -n ATAC -p 8
+# CIRCLE-seq / genome-editing with NTC control (MiSeq depth)
+parapeak -t treated.bam -c ntc.bam \
+         --min-count 3 --min-fold 3.0 --pseudocount 2.0 \
+         -o results/ -n edit -p 8
+# Multiple replicates, stricter MAPQ filter
+parapeak -t rep1.bam rep2.bam --min-mapq 30 -o results/ -p 4
+```
+BAM files do **not** need to be sorted or indexed. parapeak reads each file
+in a single sequential pass.
+---
+## Options
+```
+Input:
+  -t, --treated BAM [BAM ...]   Treated BAM/SAM files (sorting and indexing not required)
+  -c, --control BAM [BAM ...]   Control/input BAM/SAM files
+  --blacklist BED               Blacklist BED file (plain or gzip-compressed)
+Output:
+  -o, --output DIR              Output directory (default: parapeak_output)
+  -n, --name NAME               Output file prefix (default: parapeak)
+  --bedgraph-value TYPE         Value written to the bedGraph output.
+                                fold_enrichment (default): (treated + pc) / (scaled control + pc)
+                                pvalue: -log10 of the minimum NB / Z-score p-value per bin
+                                pileup: raw treated read count per bin
+Algorithm parameters:
+  --local-window BP             Local background window size in bp (default: 1000)
+  --fragment-size BP            Fallback extension for single-end or discordant
+                                paired-end reads; estimated from paired-end insert
+                                sizes if omitted, falls back to 200 bp
+  --bin-size BP                 Pileup bin size in bp (default: 10)
+  -q, --qvalue FLOAT            Q-value threshold for both methods (default: 0.05)
+  --min-length BP               Minimum peak length in bp (default: 200)
+  --max-gap BP                  Maximum gap to merge between significant bins (default: 30)
+Read QC filters:
+  --min-mapq INT                Minimum mapping quality (MAPQ). Reads below this
+                                threshold are discarded (default: 20)
+  --min-fragment BP             Minimum paired-end insert size accepted from TLEN.
+                                Smaller inserts are treated as single-end (default: 10)
+  --max-fragment BP             Maximum paired-end insert size accepted from TLEN.
+                                Larger inserts are treated as single-end (default: 2000)
+Signal filters:
+  --min-count FLOAT             Minimum pileup at the peak summit bin (default: 5)
+  --min-fold FLOAT              Minimum fold enrichment over background (default: 2.0)
+  --pseudocount FLOAT           Background floor per local window (default: 1.0)
+  -p, --threads INT             Number of parallel worker processes (default: 1)
+```
+---
+## Read QC filtering
+The following filters are applied to each read in order before it contributes
+to the pileup. Counts for each filter are reported in the JSON run report.
+| Filter | SAM flag / field | Default behaviour |
+|---|---|---|
+| Unmapped | `0x4` | Discarded |
+| Duplicate | `0x400` | Discarded |
+| Secondary alignment | `0x100` | Discarded |
+| Supplementary alignment | `0x800` | Discarded |
+| QC fail | `0x200` | Discarded |
+| Low mapping quality | `MAPQ < --min-mapq` | Discarded (default MAPQ < 20) |
+| Read 2 of a pair | `0x80` | Skipped (not a quality filter; avoids double-counting) |
+Paired-end reads that pass all filters but whose TLEN falls outside
+`[--min-fragment, --max-fragment]` are not discarded; they are treated as
+single-end reads and extended by `--fragment-size`. The count of such reads
+is reported as `insert_size_fallback` in the JSON run report.
+---
+## Single-end and paired-end handling
+### Single-end
+Each mapped read is extended from its 5′ end by `--fragment-size` bp toward
+the sequenced fragment's expected end.  `--fragment-size` is estimated
+automatically as the median insert size if the BAM contains paired reads;
+otherwise it defaults to 200 bp.
+```
+5'─────read──────3'──────────────extension──────────────►
+|←────────────── fragment_size ────────────────────────→|
+```
+### Paired-end
+Only **R1** is processed; R2 is skipped to avoid double-counting fragments.
+For **proper pairs** whose TLEN (template length from the SAM field) falls
+within `[--min-fragment, --max-fragment]` (default 10–2000 bp), the actual
+fragment interval is used directly:
+```
+R1  5'──────────►
+                    ◄──────────5'  R2
+|←───────── TLEN (actual insert size) ─────────────────→|
+```
+This is important for experiments where fragment size varies:
+- **ATAC-seq**: each pair of Tn5 cut sites defines an accessible region;
+  using actual insert sizes correctly captures nucleosome-free regions
+  (NFR, < 200 bp) and mono-nucleosomal fragments (∼200 bp).
+- **CIRCLE-seq / GUIDE-seq**: read pairs span the circularised DNA around
+  cut sites; TLEN reflects the circle diameter.
+- **ChIP-seq paired-end**: sonication produces a range of fragment sizes;
+  actual TLEN avoids over-smoothing short fragments.
+For **discordant pairs** (TLEN outside `[--min-fragment, --max-fragment]`)
+and **unpaired reads**, the read is treated as single-end and extended by
+`--fragment-size`.
+---
+## Signal filters for low-coverage data
+At shallow sequencing depth (MiSeq, ~1–5 M reads), stochastic sampling
+means that ~60% of 10 bp bins in the control have zero reads.
+Without a signal floor, NB and Z-score tests can declare bins with only
+1–2 reads significant wherever the control happens to have zero coverage,
+producing thousands of artefactual peaks.
+Three filters address this, all applied *after* the BH q-value test:
+### `--pseudocount` (default: 1.0)
+Adds a pseudocount (in units of reads per `--local-window`) to the scaled
+control pileup before NB background fitting and fold-enrichment calculation.
+Distributing the pseudocount across the local window gives each 10 bp bin
+a minimum background floor of `pseudocount / (local_window / bin_size)`.
+Fold enrichment is computed as:
+```
+fold = (mean_treatment + pseudocount) / (mean_scaled_control + pseudocount)
+```
+This eliminates infinite fold values when the control has zero reads and
+makes the metric interpretable at all coverage depths.
+### `--min-count` (default: 5.0)
+Discards peaks whose summit bin has a pileup value below this threshold,
+regardless of the q-value.
+At typical fragment extension (200 bp), a summit pileup of 5 corresponds
+to approximately 2–3 overlapping read fragments; for sparser data where
+even 2–3 reads can be statistically significant against zero control, this
+filter prevents low-signal artefacts.
+### `--min-fold` (default: 2.0)
+Discards peaks with fold enrichment (pseudocount-corrected) below this
+value.  Combined with `--pseudocount`, a minimum fold of 2.0 ensures that
+the signal is at least twice the effective background, not just twice zero.
+Set to values < 1 to disable.
+---
+## Recommended parameters by experiment
+| Experiment | SE/PE | Typical depth | `--min-count` | `--min-fold` | `--pseudocount` | `--local-window` | Notes |
+|---|---|---|---|---|---|---|---|
+| ATAC-seq (deep) | PE | > 20 M | 10 | 2.0 | 0.5 | 1000 | Nucleosome-free regions; actual insert sizes used |
+| ATAC-seq (MiSeq) | PE | 2–5 M | 3 | 2.0 | 2.0 | 2000 | Wider local window compensates for sparse control |
+| CIRCLE-seq | PE | 1–5 M | 3 | 3.0 | 2.0 | 1000 | Sparse library; higher fold reduces false positives |
+| GUIDE-seq | PE | 5–20 M | 5 | 2.0 | 1.0 | 1000 | On-target site may need blacklisting |
+| ChIP-seq narrow (deep) | SE/PE | > 10 M | 10 | 2.0 | 0.5 | 1000 | TF binding sites |
+| ChIP-seq narrow (MiSeq) | SE | 1–5 M | 3 | 2.0 | 2.0 | 5000 | Wider local window for stable background |
+| ChIP-seq broad | SE/PE | > 10 M | 5 | 1.5 | 1.0 | 5000 | Histone marks; lower fold threshold |
+| CUT&TAG | PE | 5–20 M | 5 | 2.0 | 1.0 | 500 | High signal-to-noise; narrow local window |
+| FAIRE-seq | SE/PE | > 5 M | 5 | 2.0 | 1.0 | 1000 | Similar to ATAC-seq |
+### Coverage guide
+| Sequencer | Typical read pairs | Recommended `--min-count` | Recommended `--pseudocount` |
+|---|---|---|---|
+| NovaSeq / HiSeq (deep) | > 30 M | 10–20 | 0.5 |
+| NextSeq / HiSeq (standard) | 10–30 M | 5–10 | 1.0 |
+| MiSeq (V3 2×300) | 3–10 M | 3–5 | 1.0–2.0 |
+| MiSeq (V2 2×150) | 1–3 M | 3 | 2.0–5.0 |
+---
+## Output files
+| File | Format | Description |
+|---|---|---|
+| `<name>_peaks.narrowPeak` | ENCODE narrowPeak (BED6+4) | Final peak calls |
+| `<name>_summits.bed` | BED | Single-base summit positions |
+| `<name>_peaks.tsv` | TSV | All score columns for downstream analysis |
+| `<name>.bedGraph` | UCSC bedGraph | Genome-wide signal track (see `--bedgraph-value`) |
+| `<name>_run.json` | JSON | Settings and read QC statistics |
+The narrowPeak columns follow the
+[ENCODE specification](https://genome.ucsc.edu/FAQ/FAQformat.html#format12):
+```
+chrom  start  end  name  score  strand  signalValue  -log10(p)  -log10(q)  summit_offset
+```
+The TSV file includes separate NB and Z-score p-values and q-values for
+inspection of which statistical test is the driver for each peak.
+### bedGraph signal track
+The bedGraph file covers the full genome at the same bin resolution as the
+pileup (`--bin-size`).  Bins with zero signal and blacklisted bins are
+omitted; consecutive bins with identical values are merged into a single
+record to reduce file size.  The default value is **fold enrichment**:
+```
+fold = (treated_bin + pseudocount) / (scaled_control_bin + pseudocount)
+```
+Use `--bedgraph-value pvalue` to write −log₁₀(*p*) (minimum of the NB
+and Z-score p-values per bin), or `--bedgraph-value pileup` to write the
+raw treated read count per bin.
+The track is compatible with IGV, the UCSC Genome Browser, and any tool
+that accepts bedGraph format.
+### JSON run report
+The JSON file records all settings and per-file read statistics for
+reproducibility and QC review. Example:
+```json
+{
+  "run": {
+    "timestamp": "2026-06-15T10:00:00Z",
+    "tool": "parapeak"
+  },
+  "settings": {
+    "treated": ["sample.bam"],
+    "control": null,
+    "blacklist": null,
+    "bin_size": 10,
+    "fragment_size": null,
+    "min_mapq": 20,
+    "min_fragment": 10,
+    "max_fragment": 2000,
+    "qvalue": 0.05,
+    "min_count": 5.0,
+    "min_fold": 2.0,
+    "pseudocount": 1.0,
+    "threads": 8,
+    "bedgraph_value": "fold_enrichment"
+  },
+  "statistics": {
+    "fragment_size_used": 185,
+    "scale_factor": 1.0,
+    "treated": {
+      "total_reads": 42000000,
+      "filtered_unmapped": 1200000,
+      "filtered_duplicate": 3500000,
+      "filtered_secondary": 80000,
+      "filtered_supplementary": 12000,
+      "filtered_qcfail": 45000,
+      "filtered_low_mapq": 2100000,
+      "skipped_read2": 17500000,
+      "insert_size_fallback": 320000,
+      "reads_used": 17243000
+    },
+    "control": null,
+    "peaks_called": 48312
+  }
+}
+```
+---
+## Algorithm
+### 1. Read pileup (single-pass)
+Each BAM/SAM file is read in a single sequential pass using
+[pysam](https://pysam.readthedocs.io/), with no region argument.
+BAM files do not need to be sorted or indexed.
+For each file, numpy arrays (one per chromosome, length = chromosome size /
+bin size) are pre-allocated and held in memory. As each read is consumed from
+the stream, QC filters are applied (see [Read QC filtering](#read-qc-filtering)),
+the fragment interval is determined (see
+[Single-end and paired-end handling](#single-end-and-paired-end-handling)),
+and the corresponding bins are incremented by 1.
+This approach has O(N_reads) I/O cost regardless of sort order. The previous
+per-chromosome `fetch()` strategy required N_chromosomes scans of an unsorted
+file; on a typical human genome (25 chromosomes) this is a ~25× reduction in
+I/O for unsorted input.
+Blacklisted bins are zeroed after the pass completes.
+If pysam rejects a file due to a malformed header (e.g. duplicate `@HD`
+lines produced by some aligners or merge tools), parapeak automatically
+repairs the header: for SAM files the raw text is rewritten with duplicate
+`@HD` lines removed; for BAM files the same repair is applied after dumping
+the file to SAM text via `samtools view`.  The fixed content is written to
+a temporary file, read, and then deleted.  If repair fails, the file is
+skipped with a warning and the run continues with the remaining inputs.
+### 2. GC correction model
+Without requiring a reference genome, parapeak estimates per-bin GC content
+from the sequences of mapped reads (a reliable proxy for reference GC%).
+The genome is divided into non-overlapping windows (`--local-window`).
+For each window, the mean GC fraction of overlapping reads and the observed
+read depth are recorded. All windows across all chromosomes are pooled,
+binned into 20 equal GC% intervals (0–5%, 5–10%, …, 95–100%), and the
+mean and variance of depth are computed per interval.
+A piecewise-linear interpolation curve `f(GC%) → (expected_depth, variance)`
+is fitted to these bin statistics. This curve is used in Step 4 to derive
+GC-corrected expected values and their variances for every genomic bin.
+When too few data points are available (sparse libraries), the model falls
+back to the global mean depth.
+### 3. Negative-binomial p-values
+The scaled control pileup is floored by adding
+`pseudocount / (local_window / bin_size)` per bin before fitting.
+For each bin, the background is estimated at two scales:
+- **Global λ**: genome-wide mean of the pseudocount-adjusted scaled control.
+- **Local λ**: sliding-window mean over `--local-window` bp.
+The **larger** lambda (more conservative, higher p-value) is used, so
+enrichment is only called when signal exceeds both local and global baselines.
+A Negative Binomial NB(*r*, *p*) is fitted by the method of moments:
+```
+r = μ² / (σ² − μ)     p = μ / σ²
+```
+When σ² ≤ μ the distribution degenerates to Poisson (*r* → ∞).
+The survival function is evaluated with a numba-JIT log-PMF recurrence.
+### 4. GC-corrected Z-score p-values
+Each bin receives a GC-matched expected depth *E* and variance *V* from
+Step 2.  Expected values are scaled to match the total treatment read count.
+```
+Z = (observed − E) / √V
+p_Z = P(Z_standard > Z)   [one-sided upper tail]
+```
+### 5. Genome-wide Benjamini–Hochberg correction
+P-values from all chromosomes are pooled and corrected independently for
+each method. Steps 2–5 are parallelised across chromosomes (`-p` workers).
+```
+q_NB = BH(p_NB)     q_Z = BH(p_Z)
+```
+A bin is **significant** only when **both** q_NB < threshold **and**
+q_Z < threshold.
+### 6. Peak region assembly and signal filtering
+Contiguous significant bins are merged across gaps of up to `--max-gap` bp.
+Regions shorter than `--min-length` bp are discarded.
+Signal filters are then applied in order:
+1. **`--min-count`**: summit pileup ≥ threshold
+2. **`--min-fold`**: fold enrichment (pseudocount-corrected) ≥ threshold
+The summit is the bin with the highest treatment pileup.
+The score in the output is −log₁₀ of the minimum q-value across both methods.
+---
+## Comparison with MACS3
+MACS3 uses a linked-list data structure internally, which limits its
+ability to parallelise across chromosomes. parapeak replaces this with
+per-chromosome NumPy arrays and Python `multiprocessing`, with
+inter-process communication only at the genome-wide BH correction step.
+MACS3 requires sorted and indexed BAM files for random-access region
+queries. parapeak reads each file in a single forward pass, which
+avoids this requirement and is more I/O-efficient for unsorted input.
+The true runtime bottleneck is BAM decompression (via the C library
+underlying pysam) and gzip decompression of the blacklist. These I/O costs
+dominate over Python-level arithmetic, so there is no need for a compiled
+extension language; numba JIT is sufficient for the compute-intensive
+statistical routines.
+---
+## Author
+Takaho A. Endo
+## License
+MIT