PyPI - DAJIN2 - Versions diffs - 0.4.1__zip → 0.4.2__zip - Mend

DAJIN2 0.4.1zip → 0.4.2zip

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (68) hide show

{DAJIN2-0.4.1/src/DAJIN2.egg-info → DAJIN2-0.4.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: DAJIN2
-Version: 0.4.1
+Version: 0.4.2
 Summary: One-step genotyping tools for targeted long-read sequencing
 Home-page: https://github.com/akikuno/DAJIN2
 Author: Akihiro Kuno
@@ -19,9 +19,7 @@ Requires-Dist: scipy>=1.6.0
 Requires-Dist: pandas>=1.0.0
 Requires-Dist: openpyxl>=3.0.0
 Requires-Dist: rapidfuzz>=3.0.0
-Requires-Dist: statsmodels>=0.13.5
 Requires-Dist: scikit-learn>=1.0.0
-Requires-Dist: openpyxl>=3.0.0
 Requires-Dist: mappy>=2.24
 Requires-Dist: pysam>=0.19.0
 Requires-Dist: Flask>=2.2.0
@@ -29,7 +27,7 @@ Requires-Dist: waitress>=2.1.0
 Requires-Dist: Jinja2>=3.1.0
 Requires-Dist: plotly>=5.0.0
 Requires-Dist: kaleido>=0.2.0
-Requires-Dist: cstag>=0.4.1
+Requires-Dist: cstag>=1.0.0
 Requires-Dist: midsv>=0.10.1
 Requires-Dist: wslPath>=0.3.0
@@ -56,14 +54,14 @@ The name DAJIN is derived from the phrase 一網**打尽** (Ichimou **DAJIN** or
 + **Comprehensive Mutation Detection**: Equipped with the capability to detect genome editing events over a wide range, it can identify a broad spectrum of mutations, from small changes to large structural variations.
   + DAJIN2 is also possible to detect complex mutations characteristic of genome editing, such as "insertions occurring in regions where deletions have occurred."
 + **Intuitive Visualization**: The outcomes of genome editing are visualized intuitively, allowing for the rapid and easy identification and analysis of mutations.
-+ **Multi-Sample Compatibility**: Accommodates a variety of samples, enabling simultaneous processing of multiple samples. This facilitates efficient progression of large-scale experiments and comparative studies.
++ **Multi-Sample Compatibility**: Enabling parallel processing of multiple samples. This facilitates efficient progression of large-scale experiments and comparative studies.
 ## 🛠 Installation
 ### Prerequisites
-- Python 3.7 or later
+- Python 3.8 or later
 - Unix-like environment (Linux, macOS, WSL2, etc.)
 ### From [Bioconda](https://anaconda.org/bioconda/DAJIN2) (Recommended)
@@ -92,7 +90,7 @@ pip install DAJIN2
 > If you encounter any issues during the installation, please refer to the [Troubleshooting Guide](https://github.com/akikuno/DAJIN2/blob/main/docs/TROUBLESHOOTING.md)
-## 💡 Usage
+## 💻 Usage
 ### Required Files
@@ -126,11 +124,11 @@ Assuming barcode01 as the control and barcode02 as the sample, specify each dire
 The FASTA file should contain descriptions of the alleles anticipated as a result of genome editing.
 > [!IMPORTANT]
-> Specifying the control allele: A header name >control and its sequence are mandatory.
+> **A header name >control and its sequence are mandatory.**
 If there are anticipated alleles (e.g., knock-ins or knock-outs), include their sequences in the FASTA file too. These anticipated alleles can be named arbitrarily.
-Below is a typical example of a FASTA file:
+Below is an example of a FASTA file:
 ```text
 >control
@@ -313,16 +311,17 @@ For example, Tyr point mutation is highlighted in **green**.
 ### 3. MUTATION_INFO
 The MUTATION_INFO directory saves tables depicting mutation sites for each allele.
-An example of a Tyr point mutation is described by its position on the chromosome and the type of mutation.
+An example of a *Tyr* point mutation is described by its position on the chromosome and the type of mutation.
 <img src="https://user-images.githubusercontent.com/15861316/274519342-a613490d-5dbb-4a27-a2cf-bca0686b30f0.png" width="75%">
-### 4. read_plot.html and read_plot.pdf
+### 4. resd_summary.xlsx, read_plot.html and read_plot.pdf
+read_summary.xlsx describes the number of reads and presence proportion for each allele.
 Both read_plot.html and read_plot.pdf illustrate the proportions of each allele.
-The chart's **Allele type** indicates the type of allele, and **Percent of reads** shows the proportion of reads for that allele.
+The chart's **Allele type** indicates the type of allele, and **Percent of reads** shows the proportion of reads for each allele.
-Additionally, the types of **Allele type** include:
+The **Allele type** includes:
 - **Intact**: Alleles that perfectly match the input FASTA allele.
 - **Indels**: Substitutions, deletions, insertions, or inversions within 50 bases.
 - **SV**: Substitutions, deletions, insertions, or inversions beyond 50 bases.
@@ -333,14 +332,10 @@ Additionally, the types of **Allele type** include:
 > In PCR amplicon sequencing, the % of reads might not match the actual allele proportions due to amplification bias.
 > Especially when large deletions are present, the deletion alleles might be significantly amplified, potentially not reflecting the actual allele proportions.
-### 5. read_summary.xlsx
-- read_summary.xlsx: Describes the number of reads and presence proportion for each allele.
 ## 📣Feedback and Support
 For questions, bug reports, or other forms of feedback, we'd love to hear from you!
-Please use [GitHub Issues](https://github.com/akikuno/DAJIN2/issues) for all reporting purposes.
+Please use [GitHub Issues](https://github.com/akikuno/DAJIN2/issues/new/choose) for all reporting purposes.
 Please refer to [CONTRIBUTING](https://github.com/akikuno/DAJIN2/blob/main/docs/CONTRIBUTING.md) for how to contribute and how to verify your contributions.

{DAJIN2-0.4.1 → DAJIN2-0.4.2}/README.md RENAMED Viewed

@@ -21,14 +21,14 @@ The name DAJIN is derived from the phrase 一網**打尽** (Ichimou **DAJIN** or
 + **Comprehensive Mutation Detection**: Equipped with the capability to detect genome editing events over a wide range, it can identify a broad spectrum of mutations, from small changes to large structural variations.
   + DAJIN2 is also possible to detect complex mutations characteristic of genome editing, such as "insertions occurring in regions where deletions have occurred."
 + **Intuitive Visualization**: The outcomes of genome editing are visualized intuitively, allowing for the rapid and easy identification and analysis of mutations.
-+ **Multi-Sample Compatibility**: Accommodates a variety of samples, enabling simultaneous processing of multiple samples. This facilitates efficient progression of large-scale experiments and comparative studies.
++ **Multi-Sample Compatibility**: Enabling parallel processing of multiple samples. This facilitates efficient progression of large-scale experiments and comparative studies.
 ## 🛠 Installation
 ### Prerequisites
-- Python 3.7 or later
+- Python 3.8 or later
 - Unix-like environment (Linux, macOS, WSL2, etc.)
 ### From [Bioconda](https://anaconda.org/bioconda/DAJIN2) (Recommended)
@@ -57,7 +57,7 @@ pip install DAJIN2
 > If you encounter any issues during the installation, please refer to the [Troubleshooting Guide](https://github.com/akikuno/DAJIN2/blob/main/docs/TROUBLESHOOTING.md)
-## 💡 Usage
+## 💻 Usage
 ### Required Files
@@ -91,11 +91,11 @@ Assuming barcode01 as the control and barcode02 as the sample, specify each dire
 The FASTA file should contain descriptions of the alleles anticipated as a result of genome editing.
 > [!IMPORTANT]
-> Specifying the control allele: A header name >control and its sequence are mandatory.
+> **A header name >control and its sequence are mandatory.**
 If there are anticipated alleles (e.g., knock-ins or knock-outs), include their sequences in the FASTA file too. These anticipated alleles can be named arbitrarily.
-Below is a typical example of a FASTA file:
+Below is an example of a FASTA file:
 ```text
 >control
@@ -278,16 +278,17 @@ For example, Tyr point mutation is highlighted in **green**.
 ### 3. MUTATION_INFO
 The MUTATION_INFO directory saves tables depicting mutation sites for each allele.
-An example of a Tyr point mutation is described by its position on the chromosome and the type of mutation.
+An example of a *Tyr* point mutation is described by its position on the chromosome and the type of mutation.
 <img src="https://user-images.githubusercontent.com/15861316/274519342-a613490d-5dbb-4a27-a2cf-bca0686b30f0.png" width="75%">
-### 4. read_plot.html and read_plot.pdf
+### 4. resd_summary.xlsx, read_plot.html and read_plot.pdf
+read_summary.xlsx describes the number of reads and presence proportion for each allele.
 Both read_plot.html and read_plot.pdf illustrate the proportions of each allele.
-The chart's **Allele type** indicates the type of allele, and **Percent of reads** shows the proportion of reads for that allele.
+The chart's **Allele type** indicates the type of allele, and **Percent of reads** shows the proportion of reads for each allele.
-Additionally, the types of **Allele type** include:
+The **Allele type** includes:
 - **Intact**: Alleles that perfectly match the input FASTA allele.
 - **Indels**: Substitutions, deletions, insertions, or inversions within 50 bases.
 - **SV**: Substitutions, deletions, insertions, or inversions beyond 50 bases.
@@ -298,14 +299,10 @@ Additionally, the types of **Allele type** include:
 > In PCR amplicon sequencing, the % of reads might not match the actual allele proportions due to amplification bias.
 > Especially when large deletions are present, the deletion alleles might be significantly amplified, potentially not reflecting the actual allele proportions.
-### 5. read_summary.xlsx
-- read_summary.xlsx: Describes the number of reads and presence proportion for each allele.
 ## 📣Feedback and Support
 For questions, bug reports, or other forms of feedback, we'd love to hear from you!
-Please use [GitHub Issues](https://github.com/akikuno/DAJIN2/issues) for all reporting purposes.
+Please use [GitHub Issues](https://github.com/akikuno/DAJIN2/issues/new/choose) for all reporting purposes.
 Please refer to [CONTRIBUTING](https://github.com/akikuno/DAJIN2/blob/main/docs/CONTRIBUTING.md) for how to contribute and how to verify your contributions.

{DAJIN2-0.4.1 → DAJIN2-0.4.2}/requirements.txt RENAMED Viewed

@@ -3,11 +3,8 @@ scipy >= 1.6.0
 pandas >= 1.0.0
 openpyxl >= 3.0.0
 rapidfuzz >=3.0.0
-statsmodels >= 0.13.5
 scikit-learn >= 1.0.0
-openpyxl >= 3.0.0
 mappy >= 2.24
 pysam >= 0.19.0
@@ -18,6 +15,6 @@ Jinja2 >= 3.1.0
 plotly >= 5.0.0
 kaleido >= 0.2.0
-cstag >= 0.4.1
+cstag >= 1.0.0
 midsv >= 0.10.1
 wslPath >=0.3.0

{DAJIN2-0.4.1 → DAJIN2-0.4.2}/setup.py RENAMED Viewed

@@ -9,7 +9,7 @@ with open("requirements.txt") as requirements_file:
 setuptools.setup(
     name="DAJIN2",
-    version="0.4.1",
+    version="0.4.2",
     author="Akihiro Kuno",
     author_email="akuno@md.tsukuba.ac.jp",
     description="One-step genotyping tools for targeted long-read sequencing",

{DAJIN2-0.4.1 → DAJIN2-0.4.2}/src/DAJIN2/core/consensus/consensus.py RENAMED Viewed

@@ -1,7 +1,7 @@
 from __future__ import annotations
 from pathlib import Path
-from typing import NamedTuple
+from dataclasses import dataclass
 from itertools import groupby
 from collections import defaultdict
@@ -90,7 +90,8 @@ def call_percentage(cssplits: list[list[str]], mutation_loci: list[set[str]]) ->
 ###########################################################
-class ConsensusKey(NamedTuple):
+@dataclass(frozen=True)
+class ConsensusKey:
     allele: str
     label: int
     percent: float

{DAJIN2-0.4.1 → DAJIN2-0.4.2}/src/DAJIN2/core/consensus/name_handler.py RENAMED Viewed

@@ -1,13 +1,7 @@
 from __future__ import annotations
 import re
-from typing import NamedTuple
-class ConsensusKey(NamedTuple):
-    allele: str
-    label: int
-    percent: float
+from DAJIN2.core.consensus.consensus import ConsensusKey
 def _detect_sv(cons_percentages: dict[ConsensusKey, list], threshold: int = 50) -> list[bool]:

{DAJIN2-0.4.1 → DAJIN2-0.4.2}/src/DAJIN2/core/core.py RENAMED Viewed

@@ -2,119 +2,16 @@ from __future__ import annotations
 import shutil
 import logging
-import uuid
 from pathlib import Path
-from typing import NamedTuple
-from collections import defaultdict
-from DAJIN2.utils import io, config, fastx_handler
+from DAJIN2.utils import io, fastx_handler
 from DAJIN2.core import classification, clustering, consensus, preprocess, report
+from DAJIN2.core.preprocess.input_formatter import FormattedInputs
 logger = logging.getLogger(__name__)
-def parse_arguments(arguments: dict) -> tuple:
-    genome_urls = defaultdict(str)
-    if arguments.get("genome"):
-        genome_urls.update(
-            {"genome": arguments["genome"], "blat": arguments["blat"], "goldenpath": arguments["goldenpath"]}
-        )
-    return (
-        arguments["sample"],
-        arguments["control"],
-        arguments["allele"],
-        arguments["name"],
-        arguments["threads"],
-        genome_urls,
-        uuid.uuid4().hex,
-    )
-def convert_input_paths_to_posix(sample: str, control: str, allele: str) -> tuple:
-    sample = io.convert_to_posix(sample)
-    control = io.convert_to_posix(control)
-    allele = io.convert_to_posix(allele)
-    return sample, control, allele
-def create_temporal_directory(name: str, control_name: str) -> Path:
-    tempdir = Path(config.TEMP_ROOT_DIR, name)
-    Path(tempdir, "cache", ".igvjs", control_name).mkdir(parents=True, exist_ok=True)
-    return tempdir
-def check_caches(tempdir: Path, path_allele: str, genome_url: str) -> bool:
-    is_cache_hash = preprocess.cache_checker.exists_cached_hash(tempdir=tempdir, path=path_allele)
-    is_cache_genome = preprocess.cache_checker.exists_cached_genome(tempdir=tempdir, genome=genome_url)
-    return is_cache_hash and is_cache_genome
-def get_genome_coordinates(genome_urls: dict, fasta_alleles: dict, is_cache_genome: bool, tempdir: Path) -> dict:
-    genome_coordinates = {
-        "genome": genome_urls["genome"],
-        "chrom_size": 0,
-        "chrom": "control",
-        "start": 0,
-        "end": len(fasta_alleles["control"]) - 1,
-        "strand": "+",
-    }
-    if genome_urls["genome"]:
-        if is_cache_genome:
-            genome_coordinates = next(io.read_jsonl(Path(tempdir, "cache", "genome_coordinates.jsonl")))
-        else:
-            genome_coordinates = preprocess.genome_fetcher.fetch_coordinates(
-                genome_coordinates, genome_urls, fasta_alleles["control"]
-            )
-            genome_coordinates["chrom_size"] = preprocess.genome_fetcher.fetch_chromosome_size(
-                genome_coordinates, genome_urls
-            )
-            io.write_jsonl([genome_coordinates], Path(tempdir, "cache", "genome_coordinates.jsonl"))
-    return genome_coordinates
-class FormattedInputs(NamedTuple):
-    path_sample: str
-    path_control: str
-    path_allele: str
-    sample_name: str
-    control_name: str
-    fasta_alleles: dict[str, str]
-    tempdir: Path
-    genome_coordinates: dict[str, str]
-    threads: int
-    uuid: str
-def format_inputs(arguments: dict) -> FormattedInputs:
-    path_sample, path_control, path_allele, name, threads, genome_urls, uuid = parse_arguments(arguments)
-    path_sample, path_control, path_allele = convert_input_paths_to_posix(path_sample, path_control, path_allele)
-    sample_name = preprocess.fastx_parser.extract_basename(path_sample)
-    control_name = preprocess.fastx_parser.extract_basename(path_control)
-    fasta_alleles = preprocess.fastx_parser.dictionize_allele(path_allele)
-    tempdir = create_temporal_directory(name, control_name)
-    is_cache_genome = check_caches(tempdir, path_allele, genome_urls["genome"])
-    genome_coordinates = get_genome_coordinates(genome_urls, fasta_alleles, is_cache_genome, tempdir)
-    return FormattedInputs(
-        path_sample,
-        path_control,
-        path_allele,
-        sample_name,
-        control_name,
-        fasta_alleles,
-        tempdir,
-        genome_coordinates,
-        threads,
-        uuid,
-    )
 ###########################################################
 # main
 ###########################################################
@@ -126,9 +23,9 @@ def execute_control(arguments: dict):
     ###########################################################
     # Preprocess
     ###########################################################
-    ARGS = format_inputs(arguments)
-    preprocess.directories.create_temporal_directories(ARGS.tempdir, ARGS.control_name, is_control=True)
-    preprocess.directories.create_report_directories(ARGS.tempdir, ARGS.control_name, is_control=True)
+    ARGS: FormattedInputs = preprocess.format_inputs(arguments)
+    preprocess.create_temporal_directories(ARGS.tempdir, ARGS.control_name, is_control=True)
+    preprocess.create_report_directories(ARGS.tempdir, ARGS.control_name, is_control=True)
     io.cache_control_hash(ARGS.tempdir, ARGS.path_allele)
     ###########################################################
@@ -151,7 +48,7 @@ def execute_control(arguments: dict):
     # ============================================================
     # Export fasta files as single-FASTA format
     # ============================================================
-    preprocess.fastx_parser.export_fasta_files(ARGS.tempdir, ARGS.fasta_alleles, ARGS.control_name)
+    fastx_handler.export_fasta_files(ARGS.tempdir, ARGS.fasta_alleles, ARGS.control_name)
     # ============================================================
     # Mapping using mappy
@@ -189,9 +86,9 @@ def execute_sample(arguments: dict):
     # Preprocess
     ###########################################################
-    ARGS = format_inputs(arguments)
-    preprocess.directories.create_temporal_directories(ARGS.tempdir, ARGS.sample_name, is_control=False)
-    preprocess.directories.create_report_directories(ARGS.tempdir, ARGS.sample_name, is_control=False)
+    ARGS: FormattedInputs = preprocess.format_inputs(arguments)
+    preprocess.create_temporal_directories(ARGS.tempdir, ARGS.sample_name, is_control=False)
+    preprocess.create_report_directories(ARGS.tempdir, ARGS.sample_name, is_control=False)
     logger.info(f"Preprocess {arguments['sample']}...")
@@ -209,7 +106,7 @@ def execute_sample(arguments: dict):
         shutil.copy(path_fasta, Path(ARGS.tempdir, ARGS.sample_name, "fasta"))
     paths_fasta = Path(ARGS.tempdir, ARGS.sample_name, "fasta").glob("*.fasta")
-    preprocess.mapping.generate_sam(ARGS, paths_fasta, is_control=False, is_insertion=False)
+    preprocess.generate_sam(ARGS, paths_fasta, is_control=False, is_insertion=False)
     # ============================================================
     # MIDSV conversion
@@ -234,8 +131,8 @@ def execute_sample(arguments: dict):
     if paths_insertion_fasta:
         # mapping to insertion alleles
-        preprocess.mapping.generate_sam(ARGS, paths_insertion_fasta, is_control=True, is_insertion=True)
-        preprocess.mapping.generate_sam(ARGS, paths_insertion_fasta, is_control=False, is_insertion=True)
+        preprocess.generate_sam(ARGS, paths_insertion_fasta, is_control=True, is_insertion=True)
+        preprocess.generate_sam(ARGS, paths_insertion_fasta, is_control=False, is_insertion=True)
         # add insertions to ARGS.fasta_alleles
         for path_fasta in paths_insertion_fasta:
             allele, seq = Path(path_fasta).read_text().strip().split("\n")

DAJIN2-0.4.2/src/DAJIN2/core/preprocess/__init__.py ADDED Viewed

@@ -0,0 +1,9 @@
+from DAJIN2.core.preprocess.cache_checker import exists_cached_hash, exists_cached_genome
+from DAJIN2.core.preprocess.genome_fetcher import fetch_coordinates, fetch_chromosome_size
+from DAJIN2.core.preprocess.mapping import generate_sam
+from DAJIN2.core.preprocess.directory_manager import create_temporal_directories, create_report_directories
+from DAJIN2.core.preprocess.input_formatter import format_inputs
+from DAJIN2.core.preprocess.midsv_caller import generate_midsv
+from DAJIN2.core.preprocess.knockin_handler import extract_knockin_loci
+from DAJIN2.core.preprocess.mutation_extractor import cache_mutation_loci
+from DAJIN2.core.preprocess.insertions_to_fasta import generate_insertion_fasta

DAJIN2-0.4.2/src/DAJIN2/core/preprocess/input_formatter.py ADDED Viewed

@@ -0,0 +1,109 @@
+from __future__ import annotations
+import uuid
+from pathlib import Path
+from dataclasses import dataclass
+from collections import defaultdict
+from DAJIN2.utils import io, config, fastx_handler
+from DAJIN2.core import preprocess
+def parse_arguments(arguments: dict) -> tuple:
+    genome_urls = defaultdict(str)
+    if arguments.get("genome"):
+        genome_urls.update(
+            {"genome": arguments["genome"], "blat": arguments["blat"], "goldenpath": arguments["goldenpath"]}
+        )
+    return (
+        arguments["sample"],
+        arguments["control"],
+        arguments["allele"],
+        arguments["name"],
+        arguments["threads"],
+        genome_urls,
+        uuid.uuid4().hex,
+    )
+def convert_input_paths_to_posix(sample: str, control: str, allele: str) -> tuple:
+    sample = io.convert_to_posix(sample)
+    control = io.convert_to_posix(control)
+    allele = io.convert_to_posix(allele)
+    return sample, control, allele
+def create_temporal_directory(name: str, control_name: str) -> Path:
+    tempdir = Path(config.TEMP_ROOT_DIR, name)
+    Path(tempdir, "cache", ".igvjs", control_name).mkdir(parents=True, exist_ok=True)
+    return tempdir
+def check_caches(tempdir: Path, path_allele: str, genome_url: str) -> bool:
+    is_cache_hash = preprocess.exists_cached_hash(tempdir=tempdir, path=path_allele)
+    is_cache_genome = preprocess.exists_cached_genome(tempdir=tempdir, genome=genome_url)
+    return is_cache_hash and is_cache_genome
+def get_genome_coordinates(genome_urls: dict, fasta_alleles: dict, is_cache_genome: bool, tempdir: Path) -> dict:
+    genome_coordinates = {
+        "genome": genome_urls["genome"],
+        "chrom_size": 0,
+        "chrom": "control",
+        "start": 0,
+        "end": len(fasta_alleles["control"]) - 1,
+        "strand": "+",
+    }
+    if genome_urls["genome"]:
+        if is_cache_genome:
+            genome_coordinates = next(io.read_jsonl(Path(tempdir, "cache", "genome_coordinates.jsonl")))
+        else:
+            genome_coordinates = preprocess.fetch_coordinates(genome_coordinates, genome_urls, fasta_alleles["control"])
+            genome_coordinates["chrom_size"] = preprocess.fetch_chromosome_size(genome_coordinates, genome_urls)
+            io.write_jsonl([genome_coordinates], Path(tempdir, "cache", "genome_coordinates.jsonl"))
+    return genome_coordinates
+@dataclass(frozen=True)
+class FormattedInputs:
+    path_sample: str
+    path_control: str
+    path_allele: str
+    sample_name: str
+    control_name: str
+    fasta_alleles: dict[str, str]
+    tempdir: Path
+    genome_coordinates: dict[str, str]
+    threads: int
+    uuid: str
+def format_inputs(arguments: dict) -> FormattedInputs:
+    path_sample, path_control, path_allele, name, threads, genome_urls, uuid = parse_arguments(arguments)
+    path_sample, path_control, path_allele = convert_input_paths_to_posix(path_sample, path_control, path_allele)
+    sample_name = fastx_handler.extract_filename(path_sample)
+    control_name = fastx_handler.extract_filename(path_control)
+    fasta_alleles = fastx_handler.dictionize_allele(path_allele)
+    tempdir = create_temporal_directory(name, control_name)
+    is_cache_genome = check_caches(tempdir, path_allele, genome_urls["genome"])
+    genome_coordinates = get_genome_coordinates(genome_urls, fasta_alleles, is_cache_genome, tempdir)
+    return FormattedInputs(
+        path_sample,
+        path_control,
+        path_allele,
+        sample_name,
+        control_name,
+        fasta_alleles,
+        tempdir,
+        genome_coordinates,
+        threads,
+        uuid,
+    )

{DAJIN2-0.4.1 → DAJIN2-0.4.2}/src/DAJIN2/core/preprocess/mapping.py RENAMED Viewed

@@ -43,6 +43,10 @@ def to_sam(
             query_seq = QUERY_SEQ.upper()
             query_qual = QUERY_QUAL
+            # Skip multi-mapping reads
+            if hit.mapq == 0:
+                continue
             # Report flag
             if hit.is_primary:
                 flag = 0 if hit.strand == 1 else 16

{DAJIN2-0.4.1 → DAJIN2-0.4.2}/src/DAJIN2/core/preprocess/midsv_caller.py RENAMED Viewed

@@ -215,8 +215,8 @@ def generate_midsv(ARGS, is_control: bool = False, is_insertion: bool = False) -
             path_splice = Path(ARGS.tempdir, name, "sam", f"splice_{allele}.sam")
             path_output_midsv = Path(ARGS.tempdir, name, "midsv", f"{allele}.json")
-        sam_ont = sam_handler.remove_overlapped_reads(list(midsv.read_sam(path_ont)))
-        sam_splice = sam_handler.remove_overlapped_reads(list(midsv.read_sam(path_splice)))
+        sam_ont = sam_handler.remove_overlapped_reads(list(sam_handler.read_sam(path_ont)))
+        sam_splice = sam_handler.remove_overlapped_reads(list(sam_handler.read_sam(path_splice)))
         qname_of_map_ont = extract_qname_of_map_ont(sam_ont, sam_splice)
         sam_of_map_ont = filter_sam_by_preset(sam_ont, qname_of_map_ont, preset="map-ont")
         sam_of_splice = filter_sam_by_preset(sam_splice, qname_of_map_ont, preset="splice")

{DAJIN2-0.4.1 → DAJIN2-0.4.2}/src/DAJIN2/main.py RENAMED Viewed

@@ -20,7 +20,7 @@ from DAJIN2.core import core
 from DAJIN2.utils import io, config, report_generator, input_validator, multiprocess
-DAJIN_VERSION = "0.4.1"
+DAJIN_VERSION = "0.4.2"
 def generate_report(name: str) -> None:

DAJIN2-0.4.2/src/DAJIN2/utils/fastx_handler.py ADDED Viewed

@@ -0,0 +1,94 @@
+from __future__ import annotations
+import re
+import gzip
+from pathlib import Path
+import mappy
+#################################################
+# Helper function
+#################################################
+def sanitize_filename(path_file: Path | str) -> str:
+    """
+    Sanitize the path_file by replacing invalid characters on Windows OS with '-'
+    """
+    path_file = str(path_file).lstrip()
+    if not path_file:
+        raise ValueError("Provided FASTA/FASTQ is empty or consists only of whitespace")
+    return re.sub(r'[\\/:?.,\'"<>| ]', "-", path_file)
+#################################################
+# Extract filename
+#################################################
+def extract_filename(path_fasta: Path | str) -> str:
+    filename = Path(path_fasta).name
+    filename = re.sub(r"\..*$", "", filename)  # Remove file extension
+    return sanitize_filename(filename)
+#################################################
+# Convert allele file to dictionary type fasta format
+#################################################
+def dictionize_allele(path_fasta: str | Path) -> dict[str, str]:
+    return {sanitize_filename(name): seq.upper() for name, seq, _ in mappy.fastx_read(str(path_fasta))}
+#################################################
+# Export fasta files as single-FASTA format
+#################################################
+def export_fasta_files(TEMPDIR: Path, FASTA_ALLELES: dict, NAME: str) -> None:
+    """+ Save multiple FASTAs in separate single-FASTA format files."""
+    for identifier, sequence in FASTA_ALLELES.items():
+        contents = "\n".join([">" + identifier, sequence]) + "\n"
+        output_fasta = Path(TEMPDIR, NAME, "fasta", f"{identifier}.fasta")
+        output_fasta.write_text(contents)
+#################################################
+# save_concatenated_fastx
+#################################################
+def extract_extention(path_file: Path) -> str:
+    suffixes = path_file.suffixes
+    return "".join(suffixes)
+def is_gzip_file(path_file: Path) -> bool:
+    """Check if a file is a GZip compressed file."""
+    try:
+        with path_file.open("rb") as f:
+            return f.read(2) == b"\x1f\x8b"
+    except IOError:
+        return False
+def save_fastq_as_gzip(TEMPDIR: Path, path_fastx: list[Path], barcode: str) -> None:
+    """Merge gzip and non-gzip files into a single gzip file."""
+    with gzip.open(Path(TEMPDIR, barcode, "fastq", f"{barcode}.fastq.gz"), "wb") as merged_file:
+        for path_file in path_fastx:
+            if is_gzip_file(path_file):
+                with gzip.open(path_file, "rb") as f:
+                    merged_file.write(f.read())
+            else:
+                with open(path_file, "r") as f:
+                    merged_file.write(f.read().encode())
+def save_concatenated_fastx(TEMPDIR: Path, directory: str) -> None:
+    fastx_suffix = {".fa", ".fq", ".fasta", ".fastq", ".fa.gz", ".fq.gz", ".fasta.gz", ".fastq.gz"}
+    path_directory = Path(directory)
+    barcode = path_directory.stem
+    path_fastx = [path for path in path_directory.iterdir() if extract_extention(path) in fastx_suffix]
+    save_fastq_as_gzip(TEMPDIR, path_fastx, barcode)

{DAJIN2-0.4.1 → DAJIN2-0.4.2}/src/DAJIN2/utils/input_validator.py RENAMED Viewed

@@ -23,40 +23,51 @@ def update_threads(threads: int) -> int:
 ########################################################################
-def validate_file_existence(input_file: str):
-    if not Path(input_file).exists():
-        raise FileNotFoundError(f"{input_file} is not found")
+def validate_file_existence(path_file: str):
+    if not Path(path_file).exists():
+        raise FileNotFoundError(f"{path_file} is not found")
-def validate_fastq_extension(fastq_path: str):
-    if not re.search(r".fastq$|.fastq.gz$|.fq$|.fq.gz$", fastq_path):
-        raise ValueError(f"{fastq_path} requires extensions either 'fastq', 'fastq.gz', 'fq' or 'fq.gz'")
+def validate_fastq_extension(path_fastq: str):
+    if not re.search(r".fastq$|.fastq.gz$|.fq$|.fq.gz$", path_fastq):
+        raise ValueError(f"{path_fastq} requires extensions either 'fastq', 'fastq.gz', 'fq' or 'fq.gz'")
-# Varidate if the file is in the proper format.
-# See top 100 lines
-def validate_fastq_content(fastq_path: str):
+# Varidate if the file is in the proper format viewing top 100 lines
+def validate_fastq_content(path_fastq: str):
     try:
-        names, seqs, quals = zip(*[(n, s, q) for i, (n, s, q) in enumerate(mappy.fastx_read(fastq_path)) if i < 100])
-        if not (len(names) == len(seqs) == len(quals) > 0):
+        headers, seqs, quals = zip(*[(n, s, q) for i, (n, s, q) in enumerate(mappy.fastx_read(path_fastq)) if i < 100])
+        # Remove empty elements
+        headers = [header for header in headers if header]
+        seqs = [seq for seq in seqs if seq]
+        quals = [qual for qual in quals if qual]
+        if not (len(headers) == len(seqs) == len(quals) > 0):
             raise ValueError
     except ValueError:
-        raise ValueError(f"{fastq_path} is not a FASTQ format")
+        raise ValueError(f"{path_fastq} is not a proper FASTQ format")
-def validate_fasta_content(fasta_path: str):
+def validate_fasta_content(path_fasta: str):
     try:
-        names, seqs = zip(*[(n, s) for n, s, _ in mappy.fastx_read(fasta_path)])
-        if len(names) != len(seqs) or not names:
+        headers, seqs = zip(*[(n, s) for n, s, _ in mappy.fastx_read(path_fasta)])
+        # Remove empty elements
+        headers = [header for header in headers if header]
+        seqs = [seq for seq in seqs if seq]
+        if len(headers) != len(seqs) or not headers:
             raise ValueError
     except ValueError:
-        raise ValueError(f"{fasta_path} is not a proper FASTA format")
-    if len(names) != len(set(names)):
-        raise ValueError(f"{fasta_path} must include unique identifiers")
+        raise ValueError(f"{path_fasta} is not a proper FASTA format")
+    if len(headers) != len(set(headers)):
+        raise ValueError(f"{path_fasta} must include unique identifiers")
     if len(seqs) != len(set(seqs)):
-        raise ValueError(f"{fasta_path} must include unique DNA sequences")
-    if "control" not in names:
-        raise ValueError(f"One of the headers in the {fasta_path} must be '>control'")
+        raise ValueError(f"{path_fasta} must include unique DNA sequences")
+    if "control" not in headers:
+        raise ValueError(f"One of the headers in the {path_fasta} must be '>control'")
 def validate_files(SAMPLE: str, CONTROL: str, ALLELE: str) -> None:

{DAJIN2-0.4.1 → DAJIN2-0.4.2}/src/DAJIN2/utils/sam_handler.py RENAMED Viewed

@@ -1,6 +1,9 @@
 from __future__ import annotations
 import re
+from pathlib import Path
+from typing import Generator
 from itertools import groupby
 from DAJIN2.utils.dna_handler import revcomp
@@ -22,6 +25,17 @@ def is_mapped(s: list[str]) -> bool:
     return not s[0].startswith("@") and s[9] != "*"
+###########################################################
+# Read sam
+###########################################################
+def read_sam(path_of_sam: str | Path) -> Generator[list]:
+    with open(path_of_sam) as f:
+        for line in f:
+            yield line.strip().split("\t")
 ###########################################################
 # remove_overlapped_reads
 ###########################################################

{DAJIN2-0.4.1 → DAJIN2-0.4.2/src/DAJIN2.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: DAJIN2
-Version: 0.4.1
+Version: 0.4.2
 Summary: One-step genotyping tools for targeted long-read sequencing
 Home-page: https://github.com/akikuno/DAJIN2
 Author: Akihiro Kuno
@@ -19,9 +19,7 @@ Requires-Dist: scipy>=1.6.0
 Requires-Dist: pandas>=1.0.0
 Requires-Dist: openpyxl>=3.0.0
 Requires-Dist: rapidfuzz>=3.0.0
-Requires-Dist: statsmodels>=0.13.5
 Requires-Dist: scikit-learn>=1.0.0
-Requires-Dist: openpyxl>=3.0.0
 Requires-Dist: mappy>=2.24
 Requires-Dist: pysam>=0.19.0
 Requires-Dist: Flask>=2.2.0
@@ -29,7 +27,7 @@ Requires-Dist: waitress>=2.1.0
 Requires-Dist: Jinja2>=3.1.0
 Requires-Dist: plotly>=5.0.0
 Requires-Dist: kaleido>=0.2.0
-Requires-Dist: cstag>=0.4.1
+Requires-Dist: cstag>=1.0.0
 Requires-Dist: midsv>=0.10.1
 Requires-Dist: wslPath>=0.3.0
@@ -56,14 +54,14 @@ The name DAJIN is derived from the phrase 一網**打尽** (Ichimou **DAJIN** or
 + **Comprehensive Mutation Detection**: Equipped with the capability to detect genome editing events over a wide range, it can identify a broad spectrum of mutations, from small changes to large structural variations.
   + DAJIN2 is also possible to detect complex mutations characteristic of genome editing, such as "insertions occurring in regions where deletions have occurred."
 + **Intuitive Visualization**: The outcomes of genome editing are visualized intuitively, allowing for the rapid and easy identification and analysis of mutations.
-+ **Multi-Sample Compatibility**: Accommodates a variety of samples, enabling simultaneous processing of multiple samples. This facilitates efficient progression of large-scale experiments and comparative studies.
++ **Multi-Sample Compatibility**: Enabling parallel processing of multiple samples. This facilitates efficient progression of large-scale experiments and comparative studies.
 ## 🛠 Installation
 ### Prerequisites
-- Python 3.7 or later
+- Python 3.8 or later
 - Unix-like environment (Linux, macOS, WSL2, etc.)
 ### From [Bioconda](https://anaconda.org/bioconda/DAJIN2) (Recommended)
@@ -92,7 +90,7 @@ pip install DAJIN2
 > If you encounter any issues during the installation, please refer to the [Troubleshooting Guide](https://github.com/akikuno/DAJIN2/blob/main/docs/TROUBLESHOOTING.md)
-## 💡 Usage
+## 💻 Usage
 ### Required Files
@@ -126,11 +124,11 @@ Assuming barcode01 as the control and barcode02 as the sample, specify each dire
 The FASTA file should contain descriptions of the alleles anticipated as a result of genome editing.
 > [!IMPORTANT]
-> Specifying the control allele: A header name >control and its sequence are mandatory.
+> **A header name >control and its sequence are mandatory.**
 If there are anticipated alleles (e.g., knock-ins or knock-outs), include their sequences in the FASTA file too. These anticipated alleles can be named arbitrarily.
-Below is a typical example of a FASTA file:
+Below is an example of a FASTA file:
 ```text
 >control
@@ -313,16 +311,17 @@ For example, Tyr point mutation is highlighted in **green**.
 ### 3. MUTATION_INFO
 The MUTATION_INFO directory saves tables depicting mutation sites for each allele.
-An example of a Tyr point mutation is described by its position on the chromosome and the type of mutation.
+An example of a *Tyr* point mutation is described by its position on the chromosome and the type of mutation.
 <img src="https://user-images.githubusercontent.com/15861316/274519342-a613490d-5dbb-4a27-a2cf-bca0686b30f0.png" width="75%">
-### 4. read_plot.html and read_plot.pdf
+### 4. resd_summary.xlsx, read_plot.html and read_plot.pdf
+read_summary.xlsx describes the number of reads and presence proportion for each allele.
 Both read_plot.html and read_plot.pdf illustrate the proportions of each allele.
-The chart's **Allele type** indicates the type of allele, and **Percent of reads** shows the proportion of reads for that allele.
+The chart's **Allele type** indicates the type of allele, and **Percent of reads** shows the proportion of reads for each allele.
-Additionally, the types of **Allele type** include:
+The **Allele type** includes:
 - **Intact**: Alleles that perfectly match the input FASTA allele.
 - **Indels**: Substitutions, deletions, insertions, or inversions within 50 bases.
 - **SV**: Substitutions, deletions, insertions, or inversions beyond 50 bases.
@@ -333,14 +332,10 @@ Additionally, the types of **Allele type** include:
 > In PCR amplicon sequencing, the % of reads might not match the actual allele proportions due to amplification bias.
 > Especially when large deletions are present, the deletion alleles might be significantly amplified, potentially not reflecting the actual allele proportions.
-### 5. read_summary.xlsx
-- read_summary.xlsx: Describes the number of reads and presence proportion for each allele.
 ## 📣Feedback and Support
 For questions, bug reports, or other forms of feedback, we'd love to hear from you!
-Please use [GitHub Issues](https://github.com/akikuno/DAJIN2/issues) for all reporting purposes.
+Please use [GitHub Issues](https://github.com/akikuno/DAJIN2/issues/new/choose) for all reporting purposes.
 Please refer to [CONTRIBUTING](https://github.com/akikuno/DAJIN2/blob/main/docs/CONTRIBUTING.md) for how to contribute and how to verify your contributions.

{DAJIN2-0.4.1 → DAJIN2-0.4.2}/src/DAJIN2.egg-info/SOURCES.txt RENAMED Viewed

@@ -36,10 +36,10 @@ src/DAJIN2/core/consensus/name_handler.py
 src/DAJIN2/core/consensus/similarity_searcher.py
 src/DAJIN2/core/preprocess/__init__.py
 src/DAJIN2/core/preprocess/cache_checker.py
-src/DAJIN2/core/preprocess/directories.py
-src/DAJIN2/core/preprocess/fastx_parser.py
+src/DAJIN2/core/preprocess/directory_manager.py
 src/DAJIN2/core/preprocess/genome_fetcher.py
 src/DAJIN2/core/preprocess/homopolymer_handler.py
+src/DAJIN2/core/preprocess/input_formatter.py
 src/DAJIN2/core/preprocess/insertions_to_fasta.py
 src/DAJIN2/core/preprocess/knockin_handler.py
 src/DAJIN2/core/preprocess/mapping.py

{DAJIN2-0.4.1 → DAJIN2-0.4.2}/src/DAJIN2.egg-info/requires.txt RENAMED Viewed

@@ -3,9 +3,7 @@ scipy>=1.6.0
 pandas>=1.0.0
 openpyxl>=3.0.0
 rapidfuzz>=3.0.0
-statsmodels>=0.13.5
 scikit-learn>=1.0.0
-openpyxl>=3.0.0
 mappy>=2.24
 pysam>=0.19.0
 Flask>=2.2.0
@@ -13,6 +11,6 @@ waitress>=2.1.0
 Jinja2>=3.1.0
 plotly>=5.0.0
 kaleido>=0.2.0
-cstag>=0.4.1
+cstag>=1.0.0
 midsv>=0.10.1
 wslPath>=0.3.0

DAJIN2-0.4.1/src/DAJIN2/core/preprocess/__init__.py DELETED Viewed

@@ -1,12 +0,0 @@
-from DAJIN2.core.preprocess import (
-    fastx_parser,
-    genome_fetcher,
-    cache_checker,
-    directories,
-)
-from DAJIN2.core.preprocess.mapping import generate_sam
-from DAJIN2.core.preprocess.midsv_caller import generate_midsv
-from DAJIN2.core.preprocess.knockin_handler import extract_knockin_loci
-from DAJIN2.core.preprocess.mutation_extractor import cache_mutation_loci
-from DAJIN2.core.preprocess.insertions_to_fasta import generate_insertion_fasta

DAJIN2-0.4.1/src/DAJIN2/core/preprocess/fastx_parser.py DELETED Viewed

@@ -1,59 +0,0 @@
-from __future__ import annotations
-import re
-from pathlib import Path
-import mappy
-########################################################################
-# Helper function
-########################################################################
-def _sanitize_name(name: str) -> str:
-    """
-    Sanitize the name by replacing invalid characters with '-'
-    """
-    name = name.lstrip()
-    if not name:
-        raise ValueError("Provided FASTA/FASTQ is empty or consists only of whitespace")
-    return re.sub(r'[\\/:?.,\'"<>| ]', "-", name)
-########################################################################
-# Extract basename
-########################################################################
-def extract_basename(fastq_path: str) -> str:
-    name = Path(fastq_path).name
-    name = re.sub(r"\..*$", "", name)  # Remove file extension
-    return _sanitize_name(name)
-########################################################################
-# Convert allele file to dictionary type fasta format
-########################################################################
-def dictionize_allele(path_fasta: str | Path) -> dict[str, str]:
-    return {_sanitize_name(name): seq.upper() for name, seq, _ in mappy.fastx_read(str(path_fasta))}
-########################################################################
-# Export fasta files as single-FASTA format
-########################################################################
-def export_fasta_files(TEMPDIR: Path, FASTA_ALLELES: dict, NAME: str) -> None:
-    """
-    This function exports FASTA files in single-FASTA format.
-    :param TEMPDIR: Temporary directory Path object where the output files will be saved.
-    :param FASTA_ALLELES: Dictionary containing identifier and sequence pairs.
-    :param NAME: Name to be included in the output path.
-    """
-    for identifier, sequence in FASTA_ALLELES.items():
-        contents = "\n".join([">" + identifier, sequence]) + "\n"
-        output_fasta = Path(TEMPDIR, NAME, "fasta", f"{identifier}.fasta")
-        output_fasta.write_text(contents)

DAJIN2-0.4.1/src/DAJIN2/utils/fastx_handler.py DELETED Viewed

@@ -1,42 +0,0 @@
-from __future__ import annotations
-import gzip
-from pathlib import Path
-#################################################
-# save_concatenated_fastx
-#################################################
-def extract_extention(file_path: Path) -> str:
-    suffixes = file_path.suffixes
-    return "".join(suffixes[-2:]) if len(suffixes) >= 2 else suffixes[0]
-def is_gzip_file(file_name: Path) -> bool:
-    """Check if a file is a GZip compressed file."""
-    try:
-        with file_name.open("rb") as f:
-            return f.read(2) == b"\x1f\x8b"
-    except IOError:
-        return False
-def save_fastq_as_gzip(TEMPDIR: Path, path_fastx: list[Path], barcode: str) -> None:
-    """Merge gzip and non-gzip files into a single gzip file."""
-    with gzip.open(Path(TEMPDIR, barcode, "fastq", f"{barcode}.fastq.gz"), "wb") as merged_file:
-        for file_name in path_fastx:
-            if is_gzip_file(file_name):
-                with gzip.open(file_name, "rb") as f:
-                    merged_file.write(f.read())
-            else:
-                with open(file_name, "r") as f:
-                    merged_file.write(f.read().encode())
-def save_concatenated_fastx(TEMPDIR: Path, directory: str) -> None:
-    fastx_suffix = {".fa", ".fq", ".fasta", ".fastq", ".fa.gz", ".fq.gz", ".fasta.gz", ".fastq.gz"}
-    path_directory = Path(directory)
-    barcode = path_directory.stem
-    path_fastx = [path for path in path_directory.iterdir() if extract_extention(path) in fastx_suffix]
-    save_fastq_as_gzip(TEMPDIR, path_fastx, barcode)