PyPI - sequana-downsampling - Versions diffs - 0.10.0__py3-none-any.whl - Mend

sequana-downsampling 0.10.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

sequana_downsampling-0.10.0.dist-info/METADATA +187 -0
sequana_downsampling-0.10.0.dist-info/RECORD +11 -0
sequana_downsampling-0.10.0.dist-info/WHEEL +4 -0
sequana_downsampling-0.10.0.dist-info/entry_points.txt +3 -0
sequana_downsampling-0.10.0.dist-info/licenses/LICENSE +29 -0
sequana_pipelines/downsampling/__init__.py +3 -0
sequana_pipelines/downsampling/config.yaml +37 -0
sequana_pipelines/downsampling/downsampling.rules +131 -0
sequana_pipelines/downsampling/main.py +126 -0
sequana_pipelines/downsampling/schema.yaml +43 -0
sequana_pipelines/downsampling/tools.txt +2 -0

sequana_downsampling-0.10.0.dist-info/METADATA ADDED Viewed

@@ -0,0 +1,187 @@
+Metadata-Version: 2.4
+Name: sequana-downsampling
+Version: 0.10.0
+Summary: Downsample NGS data sets (FastQ/FastA) using the Sequana framework
+License: BSD-3
+License-File: LICENSE
+Keywords: sequana,snakemake,NGS,fastq,downsampling
+Author: Sequana Team
+Requires-Python: >=3.10,<4.0
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Education
+Classifier: Intended Audience :: End Users/Desktop
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: BSD License
+Classifier: License :: Other/Proprietary License
+Classifier: Operating System :: POSIX :: Linux
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
+Classifier: Topic :: Scientific/Engineering :: Information Analysis
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Requires-Dist: click (>=8.1.7)
+Requires-Dist: pulp (>=2.8)
+Requires-Dist: rich-click (>=1.7.2)
+Requires-Dist: sequana (>=0.17)
+Requires-Dist: sequana-pipetools (>=1.5.0)
+Requires-Dist: snakemake (>=7.32)
+Project-URL: Repository, https://github.com/sequana/downsampling
+Description-Content-Type: text/x-rst
+.. image:: https://badge.fury.io/py/sequana-downsampling.svg
+     :target: https://pypi.python.org/pypi/sequana-downsampling
+.. image:: https://github.com/sequana/downsampling/actions/workflows/main.yml/badge.svg
+   :target: https://github.com/sequana/downsampling/actions/workflows
+.. image:: https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue.svg
+    :target: https://pypi.python.org/pypi/sequana-downsampling
+    :alt: Python 3.10 | 3.11 | 3.12
+.. image:: https://joss.theoj.org/papers/10.21105/joss.00352/status.svg
+    :target: https://joss.theoj.org/papers/10.21105/joss.00352
+    :alt: JOSS (journal of open source software) DOI
+This is the **downsampling** pipeline from the `Sequana <https://sequana.readthedocs.org>`_ project.
+:Overview: Downsample NGS data sets (FastQ or FastA).
+:Input: A set of FastQ or FastA files (single or paired-end).
+:Output: Downsampled FastQ or FastA files.
+:Status: Production
+:Citation: Cokelaer et al, (2017), 'Sequana': a Set of Snakemake NGS pipelines, Journal of Open Source Software, 2(16), 352, JOSS DOI https://doi.org/10.21105/joss.00352
+Installation
+~~~~~~~~~~~~
+::
+    pip install sequana_downsampling --upgrade
+You will also need ``pigz`` available on your PATH.
+Quick Start
+~~~~~~~~~~~
+**1. Set up the pipeline**::
+    sequana_downsampling --input-directory DATAPATH
+**2. Run the pipeline**::
+    cd downsampling
+    bash downsampling.sh
+Usage
+~~~~~
+::
+    sequana_downsampling --help
+Key pipeline-specific options:
+``--downsampling-input-format``
+    Input format: ``fastq`` (default), ``fasta``, or ``sam``.
+``--downsampling-method``
+    ``random`` (default, keeps a fixed number of reads) or ``random_pct``
+    (keeps a percentage of reads).
+``--downsampling-max-entries``
+    Number of reads to keep when using ``random`` (default: 1000).
+``--downsampling-percent``
+    Percentage of reads to keep when using ``random_pct`` (default: 10).
+``--downsampling-threads``
+    Number of threads used by ``pigz`` to compress output (default: 4).
+Examples::
+    sequana_downsampling --input-directory DATAPATH \
+        --downsampling-method random --downsampling-max-entries 100
+    sequana_downsampling --input-directory DATAPATH \
+        --downsampling-method random_pct --downsampling-percent 10 \
+        --downsampling-input-format fasta --input-pattern "*.fasta"
+Run on a SLURM cluster::
+    cd downsampling
+    sbatch downsampling.sh
+Or drive Snakemake directly::
+    snakemake -s downsampling.rules --cores 4 --stats stats.txt
+Requirements
+~~~~~~~~~~~~
+The following tools must be available (install via conda/bioconda)::
+    mamba env create -f environment.yml
+- **sequana** — FastQ/FastA selection (Python API)
+- **pigz** — parallel gzip compression of outputs
+Pipeline overview
+~~~~~~~~~~~~~~~~~
+The pipeline randomly selects reads from the input files (single or paired).
+If the inputs are paired, the one-to-one mapping between R1 and R2 is
+preserved. FastQ inputs can be gzipped; outputs are gzipped with ``pigz``.
+FastA inputs and outputs are uncompressed.
+Configuration
+~~~~~~~~~~~~~
+Here is the `latest documented configuration file <https://raw.githubusercontent.com/sequana/downsampling/main/sequana_pipelines/downsampling/config.yaml>`_.
+Key sections:
+- ``downsampling`` — method (``random`` / ``random_pct``), ``max_entries``,
+  ``percent``, ``threads``, and ``input_format`` (``fastq`` / ``fasta``)
+Changelog
+~~~~~~~~~
+========= ====================================================================
+Version   Description
+========= ====================================================================
+0.10.0    * Migrate to Poetry / pyproject.toml packaging
+          * Simplify __init__.py using importlib.metadata
+          * Rewrite CLI with rich_click (replaces argparse)
+          * Update CI to use setup-micromamba with generate-run-shell
+          * Add ``localrules: pipeline``
+          * Add ``tools.txt`` and ``environment.yml``
+          * Refresh README badges and usage examples
+0.9.0     * Maintenance release
+0.8.5     * Cope with R1/R2 paired data properly. Improved make file
+0.8.4     * Add missing MANIFEST to include missing requirements.txt
+0.8.3     * Comply with new API from sequana_pipetools 0.2.4
+0.8.2     * Add a --run option to execute the pipeline directly
+0.8.1     * Fix input and N in the random selection
+0.8.0     **First release.**
+========= ====================================================================
+Contribute & Code of Conduct
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+To contribute to this project, please take a look at the
+`Contributing Guidelines <https://github.com/sequana/sequana/blob/main/CONTRIBUTING.rst>`_ first. Please note that this project is released with a
+`Code of Conduct <https://github.com/sequana/sequana/blob/main/CONDUCT.md>`_. By contributing to this project, you agree to abide by its terms.

sequana_downsampling-0.10.0.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,11 @@
+sequana_pipelines/downsampling/__init__.py,sha256=a2JTYHRndqt6IeaZvkrwj60qrJfuaH-ruh0ajj-ftus,88
+sequana_pipelines/downsampling/config.yaml,sha256=PATX4d6C_rM2yuVLit5UvFUIdh84ZLYyAaKkLWAxrR0,1252
+sequana_pipelines/downsampling/downsampling.rules,sha256=RFqZRMRmxQIeJxCJljbN97I76iQmbdHAVQKKQ83KS4A,4041
+sequana_pipelines/downsampling/main.py,sha256=gY2HTctxkzY6yCemKLVe2txO2gnbjvVvpC9E7wru75M,3782
+sequana_pipelines/downsampling/schema.yaml,sha256=giBDsqnGUwIcmLZGPodLrGRVsOkI7lWQrCrQsseL__Q,935
+sequana_pipelines/downsampling/tools.txt,sha256=HwamIWOd6wMdHlrYZtHhk-nleP91pi4w46yOtzN6KjY,13
+sequana_downsampling-0.10.0.dist-info/METADATA,sha256=TFS3iAq2cR6Rxcpc7BLgmOLZIqac_QpSQvkeLq-BsqY,6390
+sequana_downsampling-0.10.0.dist-info/WHEEL,sha256=Vz2fHgx6HFtSwhs8KvkHLqH5Ea4w1_rner5uNVGCeIE,88
+sequana_downsampling-0.10.0.dist-info/entry_points.txt,sha256=cWoyjeUM8nwsBG5V30q087sQEGwGYh3XhOJhaVzMKI0,81
+sequana_downsampling-0.10.0.dist-info/licenses/LICENSE,sha256=tifUGMwXA9uaq2g1wFz0ZgRRPOFCfqsOxRj8JneHZ3M,1530
+sequana_downsampling-0.10.0.dist-info/RECORD,,

sequana_downsampling-0.10.0.dist-info/WHEEL ADDED Viewed

@@ -0,0 +1,4 @@
+Wheel-Version: 1.0
+Generator: poetry-core 2.3.2
+Root-Is-Purelib: true
+Tag: py3-none-any

sequana_downsampling-0.10.0.dist-info/entry_points.txt ADDED Viewed

@@ -0,0 +1,3 @@
+[console_scripts]
+sequana_downsampling=sequana_pipelines.downsampling.main:main

sequana_downsampling-0.10.0.dist-info/licenses/LICENSE ADDED Viewed

@@ -0,0 +1,29 @@
+BSD 3-Clause License
+Copyright (c) 2016-2019, Sequana Development Team
+All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+* Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+* Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+* Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

sequana_pipelines/downsampling/__init__.py ADDED Viewed

@@ -0,0 +1,3 @@
+import importlib.metadata
+version = importlib.metadata.version("sequana-downsampling")

sequana_pipelines/downsampling/config.yaml ADDED Viewed

@@ -0,0 +1,37 @@
+# ============================================================================
+# Config file for Quality Control
+# ==========================================[ Sections for the users ]========
+#
+# One of input_directory, input_pattern and input_samples must be provided
+# If input_directory provided, use it otherwise if input_pattern provided,
+# use it, otherwise use input_samples.
+# ============================================================================
+input_directory: '.'
+input_pattern: '*.fastq.gz'
+#input_readtag: "_R[12]_"
+apptainers:
+    pigz: https://zenodo.org/record/7346805/files/pigz_2.4.0.img
+##############################################################################
+# Your section
+#
+# :Parameters:
+#
+# - options: string with any valid FastQC options
+# - threads: for unpigz/pigz if data is zipped
+# - method can be a {tail, tail_pct, head, , head_pct, random, random_pct}
+#   if _pcft is appended, the percent field is used, otherwise, the max_entries
+#   if max_entries is used, depending on input_format, we select 1, 2, 4 lines.
+# - input_format in [fastq, fasta]. If provided, we will check the extension
+downsampling:
+    threads: 4
+    method: random
+    percent: 10
+    max_entries: 100
+    input_format: fastq

sequana_pipelines/downsampling/downsampling.rules ADDED Viewed

@@ -0,0 +1,131 @@
+"""downsampling pipeline
+Author: Thomas Cokelaer
+Affiliation: Institut Pasteur @ 2020
+This pipeline is part of Sequana software (sequana.readthedocs.io)
+snakemake -s downsampling.rules --forceall --stats stats.txt --cores 4
+"""
+from sequana_pipetools import PipelineManager
+# This must be defined before the include
+configfile: "config.yaml"
+manager = PipelineManager("downsampling", config, fastq=True)
+localrules: pipeline, touch_done
+rule pipeline:
+    input: expand("output/{sample}.done", sample=manager.samples)
+fmt = config["downsampling"]["input_format"]
+paired = manager.paired
+if fmt == "fastq":
+    _ext = "fastq"
+elif fmt == "fasta":
+    _ext = "fasta"
+else:
+    raise NotImplementedError(f"Unsupported input_format: {fmt}")
+if fmt == "fastq":
+    if paired:
+        rule select_reads:
+            input: manager.getrawdata()
+            output:
+                r1 = temp("output/{sample}_R1.fastq"),
+                r2 = temp("output/{sample}_R2.fastq"),
+            run:
+                from sequana import FastQ
+                f1 = FastQ(input[0])
+                if config['downsampling']['method'] == "random":
+                    N = config['downsampling']['max_entries']
+                else:
+                    N = int(len(f1) * config["downsampling"]["percent"] / 100)
+                selection = f1.select_random_reads(N, output.r1)
+                f2 = FastQ(input[1])
+                f2.select_random_reads(selection, output.r2)
+    else:
+        rule select_reads:
+            input: manager.getrawdata()
+            output: temp("output/{sample}.fastq")
+            run:
+                from sequana import FastQ
+                f1 = FastQ(input[0])
+                if config['downsampling']['method'] == "random":
+                    N = config['downsampling']['max_entries']
+                else:
+                    N = int(len(f1) * config["downsampling"]["percent"] / 100)
+                f1.select_random_reads(N, output[0])
+elif fmt == "fasta":
+    if paired:
+        rule select_reads:
+            input: manager.getrawdata()
+            output:
+                r1 = temp("output/{sample}_R1.fasta"),
+                r2 = temp("output/{sample}_R2.fasta"),
+            run:
+                from sequana import FastA
+                f1 = FastA(input[0])
+                if config['downsampling']['method'] == "random":
+                    N = config['downsampling']['max_entries']
+                else:
+                    N = int(len(f1) * config["downsampling"]["percent"] / 100)
+                selection = f1.select_random_reads(N, output.r1)
+                f2 = FastA(input[1])
+                f2.select_random_reads(selection, output.r2)
+    else:
+        rule select_reads:
+            input: manager.getrawdata()
+            output: temp("output/{sample}.fasta")
+            run:
+                from sequana import FastA
+                f1 = FastA(input[0])
+                if config['downsampling']['method'] == "random":
+                    N = config['downsampling']['max_entries']
+                else:
+                    N = int(len(f1) * config["downsampling"]["percent"] / 100)
+                f1.select_random_reads(N, output[0])
+rule pigz_compress:
+    input:  "output/{prefix}." + _ext
+    output: "output/{prefix}." + _ext + ".gz"
+    threads: config["downsampling"]["threads"]
+    log:    "logs/{prefix}/pigz.log"
+    container: config["apptainers"]["pigz"]
+    shell:
+        """
+        pigz -f -p {threads} {input} > {log} 2>&1
+        """
+if paired:
+    rule touch_done:
+        input:
+            r1 = "output/{sample}_R1." + _ext + ".gz",
+            r2 = "output/{sample}_R2." + _ext + ".gz",
+        output: "output/{sample}.done"
+        shell: "touch {output}"
+else:
+    rule touch_done:
+        input: "output/{sample}." + _ext + ".gz"
+        output: "output/{sample}.done"
+        shell: "touch {output}"
+onsuccess:
+    manager.teardown(extra_files_to_remove=["*.done"])
+    shell("mv output/* . && rm -rf output")
+onerror:
+    manager.onerror()

sequana_pipelines/downsampling/main.py ADDED Viewed

@@ -0,0 +1,126 @@
+#
+#  This file is part of Sequana software
+#
+#  Copyright (c) 2016-2021 - Sequana Development Team
+#
+#  File author(s):
+#      Thomas Cokelaer <thomas.cokelaer@pasteur.fr>
+#
+#  Distributed under the terms of the 3-clause BSD license.
+#  The full license is in the LICENSE file, distributed with this software.
+#
+#  website: https://github.com/sequana/sequana
+#  documentation: http://sequana.readthedocs.io
+#
+##############################################################################
+import os
+import sys
+import rich_click as click
+from sequana_pipetools import SequanaManager
+from sequana_pipetools.options import *
+NAME = "downsampling"
+help = init_click(
+    NAME,
+    groups={
+        "Pipeline Specific": [
+            "--downsampling-input-format",
+            "--downsampling-method",
+            "--downsampling-percent",
+            "--downsampling-max-entries",
+            "--downsampling-threads",
+        ],
+    },
+)
+@click.command(context_settings=help)
+@include_options_from(ClickInputOptions)
+@include_options_from(ClickSnakemakeOptions, working_directory=NAME)
+@include_options_from(ClickSlurmOptions)
+@include_options_from(ClickGeneralOptions)
+@click.option(
+    "--downsampling-input-format",
+    "downsampling_input_format",
+    default="fastq",
+    show_default=True,
+    type=click.Choice(["fasta", "fastq", "sam"]),
+    help="set input format (only 'fastq', 'fasta', 'sam' supported for now)",
+)
+@click.option(
+    "--downsampling-method",
+    "downsampling_method",
+    default="random",
+    show_default=True,
+    type=click.Choice(["random", "random_pct"]),
+    help="downsampling method: random (based on read counts) or random_pct (based on a percentage of reads)",
+)
+@click.option(
+    "--downsampling-percent",
+    "downsampling_percent",
+    default=10.0,
+    show_default=True,
+    type=float,
+    help="percentage of reads to select. Use with method 'random_pct' only",
+)
+@click.option(
+    "--downsampling-max-entries",
+    "downsampling_max_entries",
+    default=1000,
+    show_default=True,
+    type=int,
+    help="max entries (reads, alignments) to select. Use with method 'random' only",
+)
+@click.option(
+    "--downsampling-threads",
+    "downsampling_threads",
+    default=4,
+    show_default=True,
+    type=int,
+    help="max threads to use with pigz",
+)
+def main(**options):
+    if options["from_project"]:
+        click.echo("--from-project Not yet implemented")
+        sys.exit(1)
+    # the real stuff is here
+    manager = SequanaManager(options, NAME)
+    manager.setup()
+    options = manager.options
+    cfg = manager.config.config
+    from sequana_pipetools import logger
+    logger.setLevel(options.level)
+    logger.name = "sequana_downsampling"
+    manager.fill_data_options()
+    # --------------------------------------------------- downsampling
+    cfg.downsampling.input_format = options.downsampling_input_format
+    cfg.downsampling.method = options.downsampling_method
+    cfg.downsampling.percent = options.downsampling_percent
+    cfg.downsampling.max_entries = options.downsampling_max_entries
+    cfg.downsampling.threads = options.downsampling_threads
+    # If input format is fasta, adjust input pattern default
+    if options.downsampling_input_format == "fasta" and options.input_pattern == "*fastq.gz":
+        cfg.input_pattern = "*fasta.gz"
+    logger.info(f"Input data should be {cfg.downsampling.input_format}")
+    if cfg.downsampling.method == "random":
+        logger.info(f"Your data will be downsampled randomly keeping {cfg.downsampling.max_entries} reads")
+    elif cfg.downsampling.method == "random_pct":
+        logger.info(f"Your data will be downsampled randomly keeping {cfg.downsampling.percent}% of the reads")
+    manager.teardown()
+if __name__ == "__main__":
+    main()

sequana_pipelines/downsampling/schema.yaml ADDED Viewed

@@ -0,0 +1,43 @@
+# Schema validator for the quality_control
+# author: Thomas Cokelaer
+type: map
+mapping:
+    "input_directory":
+        type: str
+        required: False
+    #"input_readtag":
+    #    type: str
+    #    required: False
+    "input_pattern":
+        type: str
+        required: False
+    "exclude_pattern":
+        type: str
+        required: False
+        nullable: True
+    "apptainers":
+        type: any
+        required: False
+    "downsampling":
+        type: map
+        mapping:
+            "input_format":
+                type: str
+                enum: [fastq, fasta]
+            "method":
+                type: str
+                enum: [random, random_pct]
+            "threads":
+                type: int
+                range: { min: 1}
+            "percent":
+                type: float
+                range : { min: 0, max: 100}
+            "max_entries":
+                type: int
+                range: { min: 1}

sequana_pipelines/downsampling/tools.txt ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ sequana
2	+ pigz