PyPI - DAJIN2 - Versions diffs - 0.4.1__zip → 0.4.3__zip - Mend

DAJIN2 0.4.1zip → 0.4.3zip

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (71) hide show

{DAJIN2-0.4.1/src/DAJIN2.egg-info → DAJIN2-0.4.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: DAJIN2
-Version: 0.4.1
+Version: 0.4.3
 Summary: One-step genotyping tools for targeted long-read sequencing
 Home-page: https://github.com/akikuno/DAJIN2
 Author: Akihiro Kuno
@@ -14,24 +14,22 @@ Classifier: Intended Audience :: Science/Research
 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
 Description-Content-Type: text/markdown
 License-File: LICENSE
-Requires-Dist: numpy>=1.20.0
-Requires-Dist: scipy>=1.6.0
+Requires-Dist: numpy>=1.24.0
+Requires-Dist: scipy>=1.10.0
 Requires-Dist: pandas>=1.0.0
-Requires-Dist: openpyxl>=3.0.0
-Requires-Dist: rapidfuzz>=3.0.0
-Requires-Dist: statsmodels>=0.13.5
-Requires-Dist: scikit-learn>=1.0.0
-Requires-Dist: openpyxl>=3.0.0
+Requires-Dist: openpyxl>=3.1.0
+Requires-Dist: rapidfuzz>=3.6.0
+Requires-Dist: scikit-learn>=1.3.0
 Requires-Dist: mappy>=2.24
-Requires-Dist: pysam>=0.19.0
+Requires-Dist: pysam>=0.21.0
 Requires-Dist: Flask>=2.2.0
 Requires-Dist: waitress>=2.1.0
 Requires-Dist: Jinja2>=3.1.0
-Requires-Dist: plotly>=5.0.0
+Requires-Dist: plotly>=5.19.0
 Requires-Dist: kaleido>=0.2.0
-Requires-Dist: cstag>=0.4.1
-Requires-Dist: midsv>=0.10.1
-Requires-Dist: wslPath>=0.3.0
+Requires-Dist: cstag>=1.0.0
+Requires-Dist: midsv>=0.11.0
+Requires-Dist: wslPath>=0.4.1
 [![License](https://img.shields.io/badge/License-MIT-9cf.svg)](https://choosealicense.com/licenses/mit/)
 [![Test](https://img.shields.io/github/actions/workflow/status/akikuno/dajin2/pytest.yml?branch=main&label=Test&color=brightgreen)](https://github.com/akikuno/dajin2/actions)
@@ -56,14 +54,14 @@ The name DAJIN is derived from the phrase 一網**打尽** (Ichimou **DAJIN** or
 + **Comprehensive Mutation Detection**: Equipped with the capability to detect genome editing events over a wide range, it can identify a broad spectrum of mutations, from small changes to large structural variations.
   + DAJIN2 is also possible to detect complex mutations characteristic of genome editing, such as "insertions occurring in regions where deletions have occurred."
 + **Intuitive Visualization**: The outcomes of genome editing are visualized intuitively, allowing for the rapid and easy identification and analysis of mutations.
-+ **Multi-Sample Compatibility**: Accommodates a variety of samples, enabling simultaneous processing of multiple samples. This facilitates efficient progression of large-scale experiments and comparative studies.
++ **Multi-Sample Compatibility**: Enabling parallel processing of multiple samples. This facilitates efficient progression of large-scale experiments and comparative studies.
 ## 🛠 Installation
 ### Prerequisites
-- Python 3.7 or later
+- Python 3.8 or later
 - Unix-like environment (Linux, macOS, WSL2, etc.)
 ### From [Bioconda](https://anaconda.org/bioconda/DAJIN2) (Recommended)
@@ -80,6 +78,7 @@ conda activate env-dajin2
 > CONDA_SUBDIR=osx-64 conda create -n env-dajin2 -c conda-forge -c bioconda python=3.10 DAJIN2 -y
 > conda activate env-dajin2
 > conda config --env --set subdir osx-64
+> python -c "import platform; print(platform.machine())" # Confirm that the output is 'x86_64', not 'arm64'
 > ```
 ### From [PyPI](https://pypi.org/project/DAJIN2/)
@@ -92,7 +91,7 @@ pip install DAJIN2
 > If you encounter any issues during the installation, please refer to the [Troubleshooting Guide](https://github.com/akikuno/DAJIN2/blob/main/docs/TROUBLESHOOTING.md)
-## 💡 Usage
+## 💻 Usage
 ### Required Files
@@ -126,11 +125,11 @@ Assuming barcode01 as the control and barcode02 as the sample, specify each dire
 The FASTA file should contain descriptions of the alleles anticipated as a result of genome editing.
 > [!IMPORTANT]
-> Specifying the control allele: A header name >control and its sequence are mandatory.
+> **A header name >control and its sequence are mandatory.**
 If there are anticipated alleles (e.g., knock-ins or knock-outs), include their sequences in the FASTA file too. These anticipated alleles can be named arbitrarily.
-Below is a typical example of a FASTA file:
+Below is an example of a FASTA file:
 ```text
 >control
@@ -166,12 +165,17 @@ Options:
 #### Example
 ```bash
+# Download example dataset
+wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_single.tar.gz
+tar -xf example_single.tar.gz
+# Run DAJIN2
 DAJIN2 \
-    --control example/barcode01 \
-    --sample example/barcode02 \
-    --allele example/design.fa \
-    --name IL6-knockin \
-    --genome hg38 \
+    --control example_single/control \
+    --sample example_single/sample \
+    --allele example_single/stx2_deletion.fa \
+    --name stx2_deletion \
+    --genome mm39 \
     --threads 4
 ```
@@ -208,7 +212,6 @@ DAJIN2 \
 By using the `batch` subcommand, you can process multiple FASTQ files simultaneously.
 For this purpose, a CSV or Excel file consolidating the sample information is required.
-<!-- For a specific example, please refer to [this link](https://github.com/akikuno/DAJIN2/blob/main/examples/example-batch/batch.csv). -->
 > [!NOTE]
 > For guidance on how to compile sample information, please refer to [this document](https://docs.google.com/presentation/d/e/2PACX-1vSMEmXJPG2TNjfT66XZJRzqJd82aAqO5gJrdEzyhn15YBBr_Li-j5puOgVChYf3jA/embed?start=false&loop=false&delayms=3000).
@@ -226,13 +229,18 @@ options:
 #### Example
 ```bash
-DAJIN2 --file batch.csv --threads 4
+# Donwload the example dataset
+wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
+tar -xf example_batch.tar.gz
+# Run DAJIN2
+DAJIN2 batch --file example_batch/batch.csv --threads 4
 ```
 <!-- ```bash
 # Donwload the example dataset
-wget https://github.com/akikuno/DAJIN2/raw/main/examples/example-batch.tar.gz
-tar -xf example-batch.tar.gz
+wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
+tar -xf example_batch.tar.gz
 # Run DAJIN2
 DAJIN2 batch --file example-batch/batch.csv --threads 3
@@ -313,16 +321,17 @@ For example, Tyr point mutation is highlighted in **green**.
 ### 3. MUTATION_INFO
 The MUTATION_INFO directory saves tables depicting mutation sites for each allele.
-An example of a Tyr point mutation is described by its position on the chromosome and the type of mutation.
+An example of a *Tyr* point mutation is described by its position on the chromosome and the type of mutation.
 <img src="https://user-images.githubusercontent.com/15861316/274519342-a613490d-5dbb-4a27-a2cf-bca0686b30f0.png" width="75%">
-### 4. read_plot.html and read_plot.pdf
+### 4. resd_summary.xlsx, read_plot.html and read_plot.pdf
+read_summary.xlsx describes the number of reads and presence proportion for each allele.
 Both read_plot.html and read_plot.pdf illustrate the proportions of each allele.
-The chart's **Allele type** indicates the type of allele, and **Percent of reads** shows the proportion of reads for that allele.
+The chart's **Allele type** indicates the type of allele, and **Percent of reads** shows the proportion of reads for each allele.
-Additionally, the types of **Allele type** include:
+The **Allele type** includes:
 - **Intact**: Alleles that perfectly match the input FASTA allele.
 - **Indels**: Substitutions, deletions, insertions, or inversions within 50 bases.
 - **SV**: Substitutions, deletions, insertions, or inversions beyond 50 bases.
@@ -333,14 +342,10 @@ Additionally, the types of **Allele type** include:
 > In PCR amplicon sequencing, the % of reads might not match the actual allele proportions due to amplification bias.
 > Especially when large deletions are present, the deletion alleles might be significantly amplified, potentially not reflecting the actual allele proportions.
-### 5. read_summary.xlsx
-- read_summary.xlsx: Describes the number of reads and presence proportion for each allele.
 ## 📣Feedback and Support
 For questions, bug reports, or other forms of feedback, we'd love to hear from you!
-Please use [GitHub Issues](https://github.com/akikuno/DAJIN2/issues) for all reporting purposes.
+Please use [GitHub Issues](https://github.com/akikuno/DAJIN2/issues/new/choose) for all reporting purposes.
 Please refer to [CONTRIBUTING](https://github.com/akikuno/DAJIN2/blob/main/docs/CONTRIBUTING.md) for how to contribute and how to verify your contributions.

{DAJIN2-0.4.1 → DAJIN2-0.4.3}/README.md RENAMED Viewed

@@ -21,14 +21,14 @@ The name DAJIN is derived from the phrase 一網**打尽** (Ichimou **DAJIN** or
 + **Comprehensive Mutation Detection**: Equipped with the capability to detect genome editing events over a wide range, it can identify a broad spectrum of mutations, from small changes to large structural variations.
   + DAJIN2 is also possible to detect complex mutations characteristic of genome editing, such as "insertions occurring in regions where deletions have occurred."
 + **Intuitive Visualization**: The outcomes of genome editing are visualized intuitively, allowing for the rapid and easy identification and analysis of mutations.
-+ **Multi-Sample Compatibility**: Accommodates a variety of samples, enabling simultaneous processing of multiple samples. This facilitates efficient progression of large-scale experiments and comparative studies.
++ **Multi-Sample Compatibility**: Enabling parallel processing of multiple samples. This facilitates efficient progression of large-scale experiments and comparative studies.
 ## 🛠 Installation
 ### Prerequisites
-- Python 3.7 or later
+- Python 3.8 or later
 - Unix-like environment (Linux, macOS, WSL2, etc.)
 ### From [Bioconda](https://anaconda.org/bioconda/DAJIN2) (Recommended)
@@ -45,6 +45,7 @@ conda activate env-dajin2
 > CONDA_SUBDIR=osx-64 conda create -n env-dajin2 -c conda-forge -c bioconda python=3.10 DAJIN2 -y
 > conda activate env-dajin2
 > conda config --env --set subdir osx-64
+> python -c "import platform; print(platform.machine())" # Confirm that the output is 'x86_64', not 'arm64'
 > ```
 ### From [PyPI](https://pypi.org/project/DAJIN2/)
@@ -57,7 +58,7 @@ pip install DAJIN2
 > If you encounter any issues during the installation, please refer to the [Troubleshooting Guide](https://github.com/akikuno/DAJIN2/blob/main/docs/TROUBLESHOOTING.md)
-## 💡 Usage
+## 💻 Usage
 ### Required Files
@@ -91,11 +92,11 @@ Assuming barcode01 as the control and barcode02 as the sample, specify each dire
 The FASTA file should contain descriptions of the alleles anticipated as a result of genome editing.
 > [!IMPORTANT]
-> Specifying the control allele: A header name >control and its sequence are mandatory.
+> **A header name >control and its sequence are mandatory.**
 If there are anticipated alleles (e.g., knock-ins or knock-outs), include their sequences in the FASTA file too. These anticipated alleles can be named arbitrarily.
-Below is a typical example of a FASTA file:
+Below is an example of a FASTA file:
 ```text
 >control
@@ -131,12 +132,17 @@ Options:
 #### Example
 ```bash
+# Download example dataset
+wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_single.tar.gz
+tar -xf example_single.tar.gz
+# Run DAJIN2
 DAJIN2 \
-    --control example/barcode01 \
-    --sample example/barcode02 \
-    --allele example/design.fa \
-    --name IL6-knockin \
-    --genome hg38 \
+    --control example_single/control \
+    --sample example_single/sample \
+    --allele example_single/stx2_deletion.fa \
+    --name stx2_deletion \
+    --genome mm39 \
     --threads 4
 ```
@@ -173,7 +179,6 @@ DAJIN2 \
 By using the `batch` subcommand, you can process multiple FASTQ files simultaneously.
 For this purpose, a CSV or Excel file consolidating the sample information is required.
-<!-- For a specific example, please refer to [this link](https://github.com/akikuno/DAJIN2/blob/main/examples/example-batch/batch.csv). -->
 > [!NOTE]
 > For guidance on how to compile sample information, please refer to [this document](https://docs.google.com/presentation/d/e/2PACX-1vSMEmXJPG2TNjfT66XZJRzqJd82aAqO5gJrdEzyhn15YBBr_Li-j5puOgVChYf3jA/embed?start=false&loop=false&delayms=3000).
@@ -191,13 +196,18 @@ options:
 #### Example
 ```bash
-DAJIN2 --file batch.csv --threads 4
+# Donwload the example dataset
+wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
+tar -xf example_batch.tar.gz
+# Run DAJIN2
+DAJIN2 batch --file example_batch/batch.csv --threads 4
 ```
 <!-- ```bash
 # Donwload the example dataset
-wget https://github.com/akikuno/DAJIN2/raw/main/examples/example-batch.tar.gz
-tar -xf example-batch.tar.gz
+wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
+tar -xf example_batch.tar.gz
 # Run DAJIN2
 DAJIN2 batch --file example-batch/batch.csv --threads 3
@@ -278,16 +288,17 @@ For example, Tyr point mutation is highlighted in **green**.
 ### 3. MUTATION_INFO
 The MUTATION_INFO directory saves tables depicting mutation sites for each allele.
-An example of a Tyr point mutation is described by its position on the chromosome and the type of mutation.
+An example of a *Tyr* point mutation is described by its position on the chromosome and the type of mutation.
 <img src="https://user-images.githubusercontent.com/15861316/274519342-a613490d-5dbb-4a27-a2cf-bca0686b30f0.png" width="75%">
-### 4. read_plot.html and read_plot.pdf
+### 4. resd_summary.xlsx, read_plot.html and read_plot.pdf
+read_summary.xlsx describes the number of reads and presence proportion for each allele.
 Both read_plot.html and read_plot.pdf illustrate the proportions of each allele.
-The chart's **Allele type** indicates the type of allele, and **Percent of reads** shows the proportion of reads for that allele.
+The chart's **Allele type** indicates the type of allele, and **Percent of reads** shows the proportion of reads for each allele.
-Additionally, the types of **Allele type** include:
+The **Allele type** includes:
 - **Intact**: Alleles that perfectly match the input FASTA allele.
 - **Indels**: Substitutions, deletions, insertions, or inversions within 50 bases.
 - **SV**: Substitutions, deletions, insertions, or inversions beyond 50 bases.
@@ -298,14 +309,10 @@ Additionally, the types of **Allele type** include:
 > In PCR amplicon sequencing, the % of reads might not match the actual allele proportions due to amplification bias.
 > Especially when large deletions are present, the deletion alleles might be significantly amplified, potentially not reflecting the actual allele proportions.
-### 5. read_summary.xlsx
-- read_summary.xlsx: Describes the number of reads and presence proportion for each allele.
 ## 📣Feedback and Support
 For questions, bug reports, or other forms of feedback, we'd love to hear from you!
-Please use [GitHub Issues](https://github.com/akikuno/DAJIN2/issues) for all reporting purposes.
+Please use [GitHub Issues](https://github.com/akikuno/DAJIN2/issues/new/choose) for all reporting purposes.
 Please refer to [CONTRIBUTING](https://github.com/akikuno/DAJIN2/blob/main/docs/CONTRIBUTING.md) for how to contribute and how to verify your contributions.

DAJIN2-0.4.3/requirements.txt ADDED Viewed

@@ -0,0 +1,20 @@
+numpy >= 1.24.0
+scipy >= 1.10.0
+pandas >= 1.0.0
+openpyxl >= 3.1.0
+rapidfuzz >=3.6.0
+scikit-learn >= 1.3.0
+mappy >= 2.24
+pysam >= 0.21.0
+Flask >= 2.2.0
+waitress >= 2.1.0
+Jinja2 >= 3.1.0
+plotly >= 5.19.0
+kaleido >= 0.2.0
+cstag >= 1.0.0
+midsv >= 0.11.0
+wslPath >=0.4.1

{DAJIN2-0.4.1 → DAJIN2-0.4.3}/setup.py RENAMED Viewed

@@ -9,7 +9,7 @@ with open("requirements.txt") as requirements_file:
 setuptools.setup(
     name="DAJIN2",
-    version="0.4.1",
+    version="0.4.3",
     author="Akihiro Kuno",
     author_email="akuno@md.tsukuba.ac.jp",
     description="One-step genotyping tools for targeted long-read sequencing",

{DAJIN2-0.4.1 → DAJIN2-0.4.3}/src/DAJIN2/core/clustering/label_merger.py RENAMED Viewed

@@ -11,20 +11,6 @@ def calculate_label_percentages(labels: list[int]) -> dict[int, float]:
     return {label: (count / total_labels * 100) for label, count in label_counts.items()}
-def merge_mixed_cluster(labels_control: list[int], labels_sample: list[int], threshold: float = 0.5) -> list[int]:
-    """Merge labels in sample if they appear more than 'threshold' percentage in control."""
-    labels_merged = labels_sample.copy()
-    label_percentages_control = calculate_label_percentages(labels_control)
-    mixed_labels = {label for label, percent in label_percentages_control.items() if percent > threshold}
-    new_label = max(labels_merged) + 1
-    for i, label in enumerate(labels_sample):
-        if label in mixed_labels:
-            labels_merged[i] = new_label
-    return labels_merged
 def map_clusters_to_previous(labels_sample: list[int], labels_previous: list[int]) -> dict[int, int]:
     """
     Determine which cluster in labels_previous corresponds to each cluster in labels_sample.
@@ -63,6 +49,8 @@ def merge_minor_cluster(
     minor_labels_percentage = {label for label, percent in label_percentages.items() if percent < threshold_percentage}
     minor_labels_readnumber = {label for label, num in Counter(labels_sample).items() if num <= threshold_readnumber}
     minor_labels = minor_labels_percentage | minor_labels_readnumber
+    if minor_labels == set():
+        return labels_sample
     correspondence = map_clusters_to_previous(labels_sample, labels_previous)
     update_required_labels = get_update_required_labels(correspondence)
@@ -70,7 +58,23 @@ def merge_minor_cluster(
     labels_merged = labels_sample.copy()
     for m in minor_labels:
         new_label = max(labels_merged) + 1
-        labels_merged = [new_label if label in update_required_labels[correspondence[m]] else label for label in labels_merged]
+        labels_merged = [
+            new_label if label in update_required_labels[correspondence[m]] else label for label in labels_merged
+        ]
+    return labels_merged
+def merge_mixed_cluster(labels_control: list[int], labels_sample: list[int], threshold: float = 0.5) -> list[int]:
+    """Merge labels in sample if they appear more than 'threshold' percentage in control."""
+    labels_merged = labels_sample.copy()
+    label_percentages_control = calculate_label_percentages(labels_control)
+    mixed_labels = {label for label, percent in label_percentages_control.items() if percent > threshold}
+    new_label = max(labels_merged) + 1
+    for i, label in enumerate(labels_sample):
+        if label in mixed_labels:
+            labels_merged[i] = new_label
     return labels_merged
@@ -82,7 +86,7 @@ def merge_minor_cluster(
 def merge_labels(labels_control: list[int], labels_sample: list[int], labels_previous: list[int]) -> list[int]:
     labels_merged = merge_minor_cluster(
-        labels_sample, labels_previous, threshold_percentage=0.5, threshold_readnumber=10
+        labels_sample, labels_previous, threshold_percentage=0.5, threshold_readnumber=5
     )
     labels_merged = merge_mixed_cluster(labels_control, labels_merged)
     return labels_merged

{DAJIN2-0.4.1 → DAJIN2-0.4.3}/src/DAJIN2/core/consensus/consensus.py RENAMED Viewed

@@ -1,7 +1,7 @@
 from __future__ import annotations
 from pathlib import Path
-from typing import NamedTuple
+from dataclasses import dataclass
 from itertools import groupby
 from collections import defaultdict
@@ -90,7 +90,8 @@ def call_percentage(cssplits: list[list[str]], mutation_loci: list[set[str]]) ->
 ###########################################################
-class ConsensusKey(NamedTuple):
+@dataclass(frozen=True)
+class ConsensusKey:
     allele: str
     label: int
     percent: float

{DAJIN2-0.4.1 → DAJIN2-0.4.3}/src/DAJIN2/core/consensus/name_handler.py RENAMED Viewed

@@ -1,13 +1,7 @@
 from __future__ import annotations
 import re
-from typing import NamedTuple
-class ConsensusKey(NamedTuple):
-    allele: str
-    label: int
-    percent: float
+from DAJIN2.core.consensus.consensus import ConsensusKey
 def _detect_sv(cons_percentages: dict[ConsensusKey, list], threshold: int = 50) -> list[bool]:

{DAJIN2-0.4.1 → DAJIN2-0.4.3}/src/DAJIN2/core/core.py RENAMED Viewed

@@ -2,119 +2,16 @@ from __future__ import annotations
 import shutil
 import logging
-import uuid
 from pathlib import Path
-from typing import NamedTuple
-from collections import defaultdict
-from DAJIN2.utils import io, config, fastx_handler
+from DAJIN2.utils import io, fastx_handler
 from DAJIN2.core import classification, clustering, consensus, preprocess, report
+from DAJIN2.core.preprocess.input_formatter import FormattedInputs
 logger = logging.getLogger(__name__)
-def parse_arguments(arguments: dict) -> tuple:
-    genome_urls = defaultdict(str)
-    if arguments.get("genome"):
-        genome_urls.update(
-            {"genome": arguments["genome"], "blat": arguments["blat"], "goldenpath": arguments["goldenpath"]}
-        )
-    return (
-        arguments["sample"],
-        arguments["control"],
-        arguments["allele"],
-        arguments["name"],
-        arguments["threads"],
-        genome_urls,
-        uuid.uuid4().hex,
-    )
-def convert_input_paths_to_posix(sample: str, control: str, allele: str) -> tuple:
-    sample = io.convert_to_posix(sample)
-    control = io.convert_to_posix(control)
-    allele = io.convert_to_posix(allele)
-    return sample, control, allele
-def create_temporal_directory(name: str, control_name: str) -> Path:
-    tempdir = Path(config.TEMP_ROOT_DIR, name)
-    Path(tempdir, "cache", ".igvjs", control_name).mkdir(parents=True, exist_ok=True)
-    return tempdir
-def check_caches(tempdir: Path, path_allele: str, genome_url: str) -> bool:
-    is_cache_hash = preprocess.cache_checker.exists_cached_hash(tempdir=tempdir, path=path_allele)
-    is_cache_genome = preprocess.cache_checker.exists_cached_genome(tempdir=tempdir, genome=genome_url)
-    return is_cache_hash and is_cache_genome
-def get_genome_coordinates(genome_urls: dict, fasta_alleles: dict, is_cache_genome: bool, tempdir: Path) -> dict:
-    genome_coordinates = {
-        "genome": genome_urls["genome"],
-        "chrom_size": 0,
-        "chrom": "control",
-        "start": 0,
-        "end": len(fasta_alleles["control"]) - 1,
-        "strand": "+",
-    }
-    if genome_urls["genome"]:
-        if is_cache_genome:
-            genome_coordinates = next(io.read_jsonl(Path(tempdir, "cache", "genome_coordinates.jsonl")))
-        else:
-            genome_coordinates = preprocess.genome_fetcher.fetch_coordinates(
-                genome_coordinates, genome_urls, fasta_alleles["control"]
-            )
-            genome_coordinates["chrom_size"] = preprocess.genome_fetcher.fetch_chromosome_size(
-                genome_coordinates, genome_urls
-            )
-            io.write_jsonl([genome_coordinates], Path(tempdir, "cache", "genome_coordinates.jsonl"))
-    return genome_coordinates
-class FormattedInputs(NamedTuple):
-    path_sample: str
-    path_control: str
-    path_allele: str
-    sample_name: str
-    control_name: str
-    fasta_alleles: dict[str, str]
-    tempdir: Path
-    genome_coordinates: dict[str, str]
-    threads: int
-    uuid: str
-def format_inputs(arguments: dict) -> FormattedInputs:
-    path_sample, path_control, path_allele, name, threads, genome_urls, uuid = parse_arguments(arguments)
-    path_sample, path_control, path_allele = convert_input_paths_to_posix(path_sample, path_control, path_allele)
-    sample_name = preprocess.fastx_parser.extract_basename(path_sample)
-    control_name = preprocess.fastx_parser.extract_basename(path_control)
-    fasta_alleles = preprocess.fastx_parser.dictionize_allele(path_allele)
-    tempdir = create_temporal_directory(name, control_name)
-    is_cache_genome = check_caches(tempdir, path_allele, genome_urls["genome"])
-    genome_coordinates = get_genome_coordinates(genome_urls, fasta_alleles, is_cache_genome, tempdir)
-    return FormattedInputs(
-        path_sample,
-        path_control,
-        path_allele,
-        sample_name,
-        control_name,
-        fasta_alleles,
-        tempdir,
-        genome_coordinates,
-        threads,
-        uuid,
-    )
 ###########################################################
 # main
 ###########################################################
@@ -126,9 +23,9 @@ def execute_control(arguments: dict):
     ###########################################################
     # Preprocess
     ###########################################################
-    ARGS = format_inputs(arguments)
-    preprocess.directories.create_temporal_directories(ARGS.tempdir, ARGS.control_name, is_control=True)
-    preprocess.directories.create_report_directories(ARGS.tempdir, ARGS.control_name, is_control=True)
+    ARGS: FormattedInputs = preprocess.format_inputs(arguments)
+    preprocess.create_temporal_directories(ARGS.tempdir, ARGS.control_name, is_control=True)
+    preprocess.create_report_directories(ARGS.tempdir, ARGS.control_name, is_control=True)
     io.cache_control_hash(ARGS.tempdir, ARGS.path_allele)
     ###########################################################
@@ -151,7 +48,7 @@ def execute_control(arguments: dict):
     # ============================================================
     # Export fasta files as single-FASTA format
     # ============================================================
-    preprocess.fastx_parser.export_fasta_files(ARGS.tempdir, ARGS.fasta_alleles, ARGS.control_name)
+    fastx_handler.export_fasta_files(ARGS.tempdir, ARGS.fasta_alleles, ARGS.control_name)
     # ============================================================
     # Mapping using mappy
@@ -173,8 +70,8 @@ def execute_control(arguments: dict):
     # Output BAM files
     ###########################################################
     logger.info(f"Output BAM files of {arguments['control']}...")
-    report.report_bam.export_to_bam(
-        ARGS.tempdir, ARGS.control_name, ARGS.genome_coordinates, ARGS.threads, is_control=True
+    report.bam_exporter.export_to_bam(
+        ARGS.tempdir, ARGS.control_name, ARGS.genome_coordinates, ARGS.threads, ARGS.uuid, is_control=True
     )
     ###########################################################
     # Finish call
@@ -189,9 +86,9 @@ def execute_sample(arguments: dict):
     # Preprocess
     ###########################################################
-    ARGS = format_inputs(arguments)
-    preprocess.directories.create_temporal_directories(ARGS.tempdir, ARGS.sample_name, is_control=False)
-    preprocess.directories.create_report_directories(ARGS.tempdir, ARGS.sample_name, is_control=False)
+    ARGS: FormattedInputs = preprocess.format_inputs(arguments)
+    preprocess.create_temporal_directories(ARGS.tempdir, ARGS.sample_name, is_control=False)
+    preprocess.create_report_directories(ARGS.tempdir, ARGS.sample_name, is_control=False)
     logger.info(f"Preprocess {arguments['sample']}...")
@@ -209,7 +106,7 @@ def execute_sample(arguments: dict):
         shutil.copy(path_fasta, Path(ARGS.tempdir, ARGS.sample_name, "fasta"))
     paths_fasta = Path(ARGS.tempdir, ARGS.sample_name, "fasta").glob("*.fasta")
-    preprocess.mapping.generate_sam(ARGS, paths_fasta, is_control=False, is_insertion=False)
+    preprocess.generate_sam(ARGS, paths_fasta, is_control=False, is_insertion=False)
     # ============================================================
     # MIDSV conversion
@@ -234,8 +131,8 @@ def execute_sample(arguments: dict):
     if paths_insertion_fasta:
         # mapping to insertion alleles
-        preprocess.mapping.generate_sam(ARGS, paths_insertion_fasta, is_control=True, is_insertion=True)
-        preprocess.mapping.generate_sam(ARGS, paths_insertion_fasta, is_control=False, is_insertion=True)
+        preprocess.generate_sam(ARGS, paths_insertion_fasta, is_control=True, is_insertion=True)
+        preprocess.generate_sam(ARGS, paths_insertion_fasta, is_control=False, is_insertion=True)
         # add insertions to ARGS.fasta_alleles
         for path_fasta in paths_insertion_fasta:
             allele, seq = Path(path_fasta).read_text().strip().split("\n")
@@ -307,15 +204,15 @@ def execute_sample(arguments: dict):
     # RESULT
     io.write_jsonl(RESULT_SAMPLE, Path(ARGS.tempdir, "result", f"{ARGS.sample_name}.jsonl"))
     # FASTA
-    report.report_files.export_to_fasta(ARGS.tempdir, ARGS.sample_name, cons_sequence)
-    report.report_files.export_reference_to_fasta(ARGS.tempdir, ARGS.sample_name)
+    report.sequence_exporter.export_to_fasta(ARGS.tempdir, ARGS.sample_name, cons_sequence)
+    report.sequence_exporter.export_reference_to_fasta(ARGS.tempdir, ARGS.sample_name)
     # HTML
-    report.report_files.export_to_html(ARGS.tempdir, ARGS.sample_name, cons_percentage)
+    report.sequence_exporter.export_to_html(ARGS.tempdir, ARGS.sample_name, cons_percentage)
     # CSV (Allele Info)
-    report.report_mutation.export_to_csv(ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, cons_percentage)
+    report.mutation_exporter.export_to_csv(ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, cons_percentage)
     # BAM
-    report.report_bam.export_to_bam(
-        ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, ARGS.threads, RESULT_SAMPLE
+    report.bam_exporter.export_to_bam(
+        ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, ARGS.threads, ARGS.uuid, RESULT_SAMPLE
     )
     for path_bam_igvjs in Path(ARGS.tempdir, "cache", ".igvjs").glob(f"{ARGS.control_name}_control.bam*"):
         shutil.copy(path_bam_igvjs, Path(ARGS.tempdir, "report", ".igvjs", ARGS.sample_name))

DAJIN2-0.4.3/src/DAJIN2/core/preprocess/__init__.py ADDED Viewed

@@ -0,0 +1,9 @@
+from DAJIN2.core.preprocess.cache_checker import exists_cached_hash, exists_cached_genome
+from DAJIN2.core.preprocess.genome_fetcher import fetch_coordinates, fetch_chromosome_size
+from DAJIN2.core.preprocess.mapping import generate_sam
+from DAJIN2.core.preprocess.directory_manager import create_temporal_directories, create_report_directories
+from DAJIN2.core.preprocess.input_formatter import format_inputs
+from DAJIN2.core.preprocess.midsv_caller import generate_midsv
+from DAJIN2.core.preprocess.knockin_handler import extract_knockin_loci
+from DAJIN2.core.preprocess.mutation_extractor import cache_mutation_loci
+from DAJIN2.core.preprocess.insertions_to_fasta import generate_insertion_fasta

DAJIN2 0.4.1__zip → 0.4.3__zip

DAJIN2 0.4.1zip → 0.4.3zip