PyPI - DAJIN2 - Versions diffs - 0.4.2__zip → 0.4.4__zip - Mend

DAJIN2 0.4.2zip → 0.4.4zip

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (69) hide show

{DAJIN2-0.4.2/src/DAJIN2.egg-info → dajin2-0.4.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: DAJIN2
-Version: 0.4.2
+Version: 0.4.4
 Summary: One-step genotyping tools for targeted long-read sequencing
 Home-page: https://github.com/akikuno/DAJIN2
 Author: Akihiro Kuno
@@ -14,22 +14,22 @@ Classifier: Intended Audience :: Science/Research
 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
 Description-Content-Type: text/markdown
 License-File: LICENSE
-Requires-Dist: numpy>=1.20.0
-Requires-Dist: scipy>=1.6.0
+Requires-Dist: numpy>=1.24.0
+Requires-Dist: scipy>=1.10.0
 Requires-Dist: pandas>=1.0.0
-Requires-Dist: openpyxl>=3.0.0
-Requires-Dist: rapidfuzz>=3.0.0
-Requires-Dist: scikit-learn>=1.0.0
+Requires-Dist: openpyxl>=3.1.0
+Requires-Dist: rapidfuzz>=3.6.0
+Requires-Dist: scikit-learn>=1.3.0
 Requires-Dist: mappy>=2.24
-Requires-Dist: pysam>=0.19.0
+Requires-Dist: pysam>=0.21.0
 Requires-Dist: Flask>=2.2.0
 Requires-Dist: waitress>=2.1.0
 Requires-Dist: Jinja2>=3.1.0
-Requires-Dist: plotly>=5.0.0
+Requires-Dist: plotly>=5.19.0
 Requires-Dist: kaleido>=0.2.0
 Requires-Dist: cstag>=1.0.0
-Requires-Dist: midsv>=0.10.1
-Requires-Dist: wslPath>=0.3.0
+Requires-Dist: midsv>=0.11.0
+Requires-Dist: wslPath>=0.4.1
 [![License](https://img.shields.io/badge/License-MIT-9cf.svg)](https://choosealicense.com/licenses/mit/)
 [![Test](https://img.shields.io/github/actions/workflow/status/akikuno/dajin2/pytest.yml?branch=main&label=Test&color=brightgreen)](https://github.com/akikuno/dajin2/actions)
@@ -78,6 +78,7 @@ conda activate env-dajin2
 > CONDA_SUBDIR=osx-64 conda create -n env-dajin2 -c conda-forge -c bioconda python=3.10 DAJIN2 -y
 > conda activate env-dajin2
 > conda config --env --set subdir osx-64
+> python -c "import platform; print(platform.machine())" # Confirm that the output is 'x86_64', not 'arm64'
 > ```
 ### From [PyPI](https://pypi.org/project/DAJIN2/)
@@ -164,12 +165,17 @@ Options:
 #### Example
 ```bash
+# Download example dataset
+curl -LJO https://github.com/akikuno/DAJIN2/raw/main/examples/example_single.tar.gz
+tar -xf example_single.tar.gz
+# Run DAJIN2
 DAJIN2 \
-    --control example/barcode01 \
-    --sample example/barcode02 \
-    --allele example/design.fa \
-    --name IL6-knockin \
-    --genome hg38 \
+    --control example_single/control \
+    --sample example_single/sample \
+    --allele example_single/stx2_deletion.fa \
+    --name stx2_deletion \
+    --genome mm39 \
     --threads 4
 ```
@@ -206,7 +212,6 @@ DAJIN2 \
 By using the `batch` subcommand, you can process multiple FASTQ files simultaneously.
 For this purpose, a CSV or Excel file consolidating the sample information is required.
-<!-- For a specific example, please refer to [this link](https://github.com/akikuno/DAJIN2/blob/main/examples/example-batch/batch.csv). -->
 > [!NOTE]
 > For guidance on how to compile sample information, please refer to [this document](https://docs.google.com/presentation/d/e/2PACX-1vSMEmXJPG2TNjfT66XZJRzqJd82aAqO5gJrdEzyhn15YBBr_Li-j5puOgVChYf3jA/embed?start=false&loop=false&delayms=3000).
@@ -224,44 +229,14 @@ options:
 #### Example
 ```bash
-DAJIN2 --file batch.csv --threads 4
-```
-<!-- ```bash
 # Donwload the example dataset
-wget https://github.com/akikuno/DAJIN2/raw/main/examples/example-batch.tar.gz
-tar -xf example-batch.tar.gz
+curl -LJO https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
+tar -xf example_batch.tar.gz
 # Run DAJIN2
-DAJIN2 batch --file example-batch/batch.csv --threads 3
-# 2023-07-31 17:01:10: example-batch/tyr_control.fq.gz is now processing...
-# 2023-07-31 17:01:16: Preprocess example-batch/tyr_control.fq.gz...
-# 2023-07-31 17:01:48: Output BAM files of example-batch/tyr_control.fq.gz...
-# 2023-07-31 17:01:52: 🍵 example-batch/tyr_control.fq.gz is finished!
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_50%.fq.gz is now processing...
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_10%.fq.gz is now processing...
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_01%.fq.gz is now processing...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:17: Classify example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:19: Clustering example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:34: Classify example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:02:35: Classify example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:39: Clustering example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:02:39: Clustering example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:53: Consensus calling of example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:59: Output reports of example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:03:04: 🍵 example-batch/tyr_c230gt_50%.fq.gz is finished!
-# 2023-07-31 17:03:39: Consensus calling of example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:03:51: Output reports of example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:04:03: 🍵 example-batch/tyr_c230gt_01%.fq.gz is finished!
-# 2023-07-31 17:04:08: Consensus calling of example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:04:16: Output reports of example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:04:24: 🍵 example-batch/tyr_c230gt_10%.fq.gz is finished!
-# 🎉 Finished! Open DAJIN_Results/tyr-substitution to see the report.
-``` -->
+DAJIN2 batch --file example_batch/batch.csv --threads 4
+```
 ## 📈 Report Contents
@@ -271,22 +246,22 @@ Inside the **DAJIN_Results** directory, the following files can be found:
 ```
 DAJIN_Results/tyr-substitution
 ├── BAM
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   ├── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   ├── tyr_c230gt_50
 │   └── tyr_control
 ├── FASTA
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   └── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   └── tyr_c230gt_50
 ├── HTML
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   └── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   └── tyr_c230gt_50
 ├── MUTATION_INFO
-│   ├── tyr_c230gt_01%.csv
-│   ├── tyr_c230gt_10%.csv
-│   └── tyr_c230gt_50%.csv
+│   ├── tyr_c230gt_01.csv
+│   ├── tyr_c230gt_10.csv
+│   └── tyr_c230gt_50.csv
 ├── read_plot.html
 ├── read_plot.pdf
 └── read_summary.xlsx

{DAJIN2-0.4.2 → dajin2-0.4.4}/README.md RENAMED Viewed

@@ -45,6 +45,7 @@ conda activate env-dajin2
 > CONDA_SUBDIR=osx-64 conda create -n env-dajin2 -c conda-forge -c bioconda python=3.10 DAJIN2 -y
 > conda activate env-dajin2
 > conda config --env --set subdir osx-64
+> python -c "import platform; print(platform.machine())" # Confirm that the output is 'x86_64', not 'arm64'
 > ```
 ### From [PyPI](https://pypi.org/project/DAJIN2/)
@@ -131,12 +132,17 @@ Options:
 #### Example
 ```bash
+# Download example dataset
+curl -LJO https://github.com/akikuno/DAJIN2/raw/main/examples/example_single.tar.gz
+tar -xf example_single.tar.gz
+# Run DAJIN2
 DAJIN2 \
-    --control example/barcode01 \
-    --sample example/barcode02 \
-    --allele example/design.fa \
-    --name IL6-knockin \
-    --genome hg38 \
+    --control example_single/control \
+    --sample example_single/sample \
+    --allele example_single/stx2_deletion.fa \
+    --name stx2_deletion \
+    --genome mm39 \
     --threads 4
 ```
@@ -173,7 +179,6 @@ DAJIN2 \
 By using the `batch` subcommand, you can process multiple FASTQ files simultaneously.
 For this purpose, a CSV or Excel file consolidating the sample information is required.
-<!-- For a specific example, please refer to [this link](https://github.com/akikuno/DAJIN2/blob/main/examples/example-batch/batch.csv). -->
 > [!NOTE]
 > For guidance on how to compile sample information, please refer to [this document](https://docs.google.com/presentation/d/e/2PACX-1vSMEmXJPG2TNjfT66XZJRzqJd82aAqO5gJrdEzyhn15YBBr_Li-j5puOgVChYf3jA/embed?start=false&loop=false&delayms=3000).
@@ -191,44 +196,14 @@ options:
 #### Example
 ```bash
-DAJIN2 --file batch.csv --threads 4
-```
-<!-- ```bash
 # Donwload the example dataset
-wget https://github.com/akikuno/DAJIN2/raw/main/examples/example-batch.tar.gz
-tar -xf example-batch.tar.gz
+curl -LJO https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
+tar -xf example_batch.tar.gz
 # Run DAJIN2
-DAJIN2 batch --file example-batch/batch.csv --threads 3
-# 2023-07-31 17:01:10: example-batch/tyr_control.fq.gz is now processing...
-# 2023-07-31 17:01:16: Preprocess example-batch/tyr_control.fq.gz...
-# 2023-07-31 17:01:48: Output BAM files of example-batch/tyr_control.fq.gz...
-# 2023-07-31 17:01:52: 🍵 example-batch/tyr_control.fq.gz is finished!
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_50%.fq.gz is now processing...
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_10%.fq.gz is now processing...
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_01%.fq.gz is now processing...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:17: Classify example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:19: Clustering example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:34: Classify example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:02:35: Classify example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:39: Clustering example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:02:39: Clustering example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:53: Consensus calling of example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:59: Output reports of example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:03:04: 🍵 example-batch/tyr_c230gt_50%.fq.gz is finished!
-# 2023-07-31 17:03:39: Consensus calling of example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:03:51: Output reports of example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:04:03: 🍵 example-batch/tyr_c230gt_01%.fq.gz is finished!
-# 2023-07-31 17:04:08: Consensus calling of example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:04:16: Output reports of example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:04:24: 🍵 example-batch/tyr_c230gt_10%.fq.gz is finished!
-# 🎉 Finished! Open DAJIN_Results/tyr-substitution to see the report.
-``` -->
+DAJIN2 batch --file example_batch/batch.csv --threads 4
+```
 ## 📈 Report Contents
@@ -238,22 +213,22 @@ Inside the **DAJIN_Results** directory, the following files can be found:
 ```
 DAJIN_Results/tyr-substitution
 ├── BAM
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   ├── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   ├── tyr_c230gt_50
 │   └── tyr_control
 ├── FASTA
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   └── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   └── tyr_c230gt_50
 ├── HTML
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   └── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   └── tyr_c230gt_50
 ├── MUTATION_INFO
-│   ├── tyr_c230gt_01%.csv
-│   ├── tyr_c230gt_10%.csv
-│   └── tyr_c230gt_50%.csv
+│   ├── tyr_c230gt_01.csv
+│   ├── tyr_c230gt_10.csv
+│   └── tyr_c230gt_50.csv
 ├── read_plot.html
 ├── read_plot.pdf
 └── read_summary.xlsx

dajin2-0.4.4/requirements.txt ADDED Viewed

@@ -0,0 +1,20 @@
+numpy >= 1.24.0
+scipy >= 1.10.0
+pandas >= 1.0.0
+openpyxl >= 3.1.0
+rapidfuzz >=3.6.0
+scikit-learn >= 1.3.0
+mappy >= 2.24
+pysam >= 0.21.0
+Flask >= 2.2.0
+waitress >= 2.1.0
+Jinja2 >= 3.1.0
+plotly >= 5.19.0
+kaleido >= 0.2.0
+cstag >= 1.0.0
+midsv >= 0.11.0
+wslPath >=0.4.1

{DAJIN2-0.4.2 → dajin2-0.4.4}/setup.py RENAMED Viewed

@@ -9,7 +9,7 @@ with open("requirements.txt") as requirements_file:
 setuptools.setup(
     name="DAJIN2",
-    version="0.4.2",
+    version="0.4.4",
     author="Akihiro Kuno",
     author_email="akuno@md.tsukuba.ac.jp",
     description="One-step genotyping tools for targeted long-read sequencing",

{DAJIN2-0.4.2 → dajin2-0.4.4}/src/DAJIN2/core/clustering/clustering.py RENAMED Viewed

@@ -39,17 +39,16 @@ def optimize_labels(X: spmatrix, coverage_sample: int, coverage_control: int) ->
         # print(i, Counter(labels_sample), Counter(labels_control), Counter(labels_current))  # ! DEBUG
         num_labels_control = count_number_of_clusters(labels_control, coverage_control)
-        mutual_info = metrics.adjusted_rand_score(labels_previous, labels_current)
+        rand_index = metrics.adjusted_rand_score(labels_previous, labels_current)
         """
         Return the number of clusters when:
-        - the number of clusters in control is split into more than one.
-        - the mutual information between the current and previous labels is high enough (= similar).
+            - the number of clusters in control is split into more than one.
+            - the mutual information between the current and previous labels is high enough (= similar).
+        To reduce the allele number, previous labels are returned.
         """
-        if num_labels_control >= 2:
+        if num_labels_control >= 2 or rand_index >= 0.95:
             return labels_previous
-        if 0.95 <= mutual_info <= 1.0:
-            return labels_current
         labels_previous = labels_current
     return labels_previous
@@ -58,11 +57,13 @@ def get_label_most_common(labels: list[int]) -> int:
     return Counter(labels).most_common()[0][0]
-def return_labels(path_score_sample: Path, path_score_control: Path, path_sample: Path, strand_bias: bool) -> list[int]:
+def return_labels(
+    path_score_sample: Path, path_score_control: Path, path_sample: Path, strand_bias_in_control: bool
+) -> list[int]:
     np.random.seed(seed=1)
     score_control = list(io.read_jsonl(path_score_control))
     X_control = csr_matrix(score_control)
-    # subset to 1000 reads of controls in the most common cluster to remove outliers and reduce computation time
+    """Subset to 1000 reads of controls in the most common cluster to remove outliers and reduce computation time"""
     labels_control = BisectingKMeans(n_clusters=2, random_state=1).fit_predict(X_control)
     label_most_common = get_label_most_common(labels_control)
     scores_control_subset = subset_scores(labels_control, io.read_jsonl(path_score_control), label_most_common, 1000)
@@ -71,7 +72,7 @@ def return_labels(path_score_sample: Path, path_score_control: Path, path_sample
     coverage_sample = io.count_newlines(path_score_sample)
     coverage_control = len(scores_control_subset)
     labels = optimize_labels(X, coverage_sample, coverage_control)
-    # correct clusters with strand bias
-    if strand_bias is False:
+    """Re-allocate clusters with strand bias to clusters without strand bias"""
+    if strand_bias_in_control is False:
         labels = remove_biased_clusters(path_sample, path_score_sample, labels)
     return labels

{DAJIN2-0.4.2 → dajin2-0.4.4}/src/DAJIN2/core/clustering/label_merger.py RENAMED Viewed

@@ -11,20 +11,6 @@ def calculate_label_percentages(labels: list[int]) -> dict[int, float]:
     return {label: (count / total_labels * 100) for label, count in label_counts.items()}
-def merge_mixed_cluster(labels_control: list[int], labels_sample: list[int], threshold: float = 0.5) -> list[int]:
-    """Merge labels in sample if they appear more than 'threshold' percentage in control."""
-    labels_merged = labels_sample.copy()
-    label_percentages_control = calculate_label_percentages(labels_control)
-    mixed_labels = {label for label, percent in label_percentages_control.items() if percent > threshold}
-    new_label = max(labels_merged) + 1
-    for i, label in enumerate(labels_sample):
-        if label in mixed_labels:
-            labels_merged[i] = new_label
-    return labels_merged
 def map_clusters_to_previous(labels_sample: list[int], labels_previous: list[int]) -> dict[int, int]:
     """
     Determine which cluster in labels_previous corresponds to each cluster in labels_sample.
@@ -63,6 +49,8 @@ def merge_minor_cluster(
     minor_labels_percentage = {label for label, percent in label_percentages.items() if percent < threshold_percentage}
     minor_labels_readnumber = {label for label, num in Counter(labels_sample).items() if num <= threshold_readnumber}
     minor_labels = minor_labels_percentage | minor_labels_readnumber
+    if minor_labels == set():
+        return labels_sample
     correspondence = map_clusters_to_previous(labels_sample, labels_previous)
     update_required_labels = get_update_required_labels(correspondence)
@@ -70,7 +58,23 @@ def merge_minor_cluster(
     labels_merged = labels_sample.copy()
     for m in minor_labels:
         new_label = max(labels_merged) + 1
-        labels_merged = [new_label if label in update_required_labels[correspondence[m]] else label for label in labels_merged]
+        labels_merged = [
+            new_label if label in update_required_labels[correspondence[m]] else label for label in labels_merged
+        ]
+    return labels_merged
+def merge_mixed_cluster(labels_control: list[int], labels_sample: list[int], threshold: float = 0.5) -> list[int]:
+    """Merge labels in sample if they appear more than 'threshold' percentage in control."""
+    labels_merged = labels_sample.copy()
+    label_percentages_control = calculate_label_percentages(labels_control)
+    mixed_labels = {label for label, percent in label_percentages_control.items() if percent > threshold}
+    new_label = max(labels_merged) + 1
+    for i, label in enumerate(labels_sample):
+        if label in mixed_labels:
+            labels_merged[i] = new_label
     return labels_merged
@@ -82,7 +86,7 @@ def merge_minor_cluster(
 def merge_labels(labels_control: list[int], labels_sample: list[int], labels_previous: list[int]) -> list[int]:
     labels_merged = merge_minor_cluster(
-        labels_sample, labels_previous, threshold_percentage=0.5, threshold_readnumber=10
+        labels_sample, labels_previous, threshold_percentage=0.5, threshold_readnumber=5
     )
     labels_merged = merge_mixed_cluster(labels_control, labels_merged)
     return labels_merged

dajin2-0.4.4/src/DAJIN2/core/clustering/strand_bias_handler.py ADDED Viewed

@@ -0,0 +1,115 @@
+from __future__ import annotations
+"""
+Nanopore sequencing results often results in strand specific mutations even though the mutation is not strand specific, thus they are considered as sequencing errors and should be removed.
+This module provides functions to determine whether each allele obtained after clustering is formed due to sequencing errors caused by strand bias.
+Re-allocates reads belonging to clusters with strand bias to clusters without strand bias.
+"""
+from pathlib import Path
+from collections import defaultdict
+from sklearn.tree import DecisionTreeClassifier
+from DAJIN2.utils import io
+# Constants
+STRAND_BIAS_LOWER_LIMIT = 0.1
+STRAND_BIAS_UPPER_LIMIT = 0.9
+def is_strand_bias(path_control: Path) -> bool:
+    """
+    Determines whether there is a strand bias in sequencing data
+    based on the distribution of '+' and '-' strands.
+    """
+    count_strand = defaultdict(int)
+    for sample in io.read_jsonl(path_control):
+        count_strand[sample["STRAND"]] += 1
+    total = count_strand["+"] + count_strand["-"]
+    percentage_plus = count_strand["+"] / total if total > 0 else 0
+    return not (STRAND_BIAS_LOWER_LIMIT < percentage_plus < STRAND_BIAS_UPPER_LIMIT)
+###############################################################################
+# Handle Strand bias
+# # Clusters of reads with mutations with strand bias are merged into similar clusters without strand bias
+###############################################################################
+def count_strand(labels: list[int], samples: list[dict[str, str]]) -> tuple[dict[str, int], dict[str, int]]:
+    """Count the occurrences of each strand type by label."""
+    positive_strand_counts_by_labels = defaultdict(int)
+    total_counts_by_labels = defaultdict(int)
+    for label, sample in zip(labels, samples):
+        total_counts_by_labels[label] += 1
+        if sample["STRAND"] == "+":
+            positive_strand_counts_by_labels[label] += 1
+    return dict(positive_strand_counts_by_labels), dict(total_counts_by_labels)
+def determine_strand_biases(
+    positive_strand_counts_by_labels: defaultdict, total_counts_by_labels: defaultdict
+) -> dict[int, bool]:
+    """Determine strand biases based on positive strand counts."""
+    strand_biases = {}
+    for label, total in total_counts_by_labels.items():
+        positive_strand_count = positive_strand_counts_by_labels[label]
+        strand_ratio = positive_strand_count / total
+        strand_biases[label] = not (STRAND_BIAS_LOWER_LIMIT < strand_ratio < STRAND_BIAS_UPPER_LIMIT)
+    return strand_biases
+def prepare_training_testing_sets(labels, scores, strand_biases) -> tuple[list, list, list]:
+    """Prepare training and testing datasets based on strand biases."""
+    train_data, train_labels, test_data = [], [], []
+    for label, score in zip(labels, scores):
+        if strand_biases[label]:
+            test_data.append(score)
+        else:
+            train_data.append(score)
+            train_labels.append(label)
+    return train_data, train_labels, test_data
+def train_decision_tree(train_data, train_labels) -> DecisionTreeClassifier:
+    """Train a decision tree classifier using the provided features and labels."""
+    dtree = DecisionTreeClassifier(random_state=1)
+    dtree.fit(train_data, train_labels)
+    return dtree
+def allocate_labels(labels: list[int], strand_biases: dict[str, bool], dtree, test_data) -> list[int]:
+    """Re-allocates reads belonging to clusters with strand bias to clusters without strand bias."""
+    label_predictions = iter(dtree.predict(test_data))
+    for i, label in enumerate(labels):
+        if strand_biases[label]:
+            labels[i] = next(label_predictions)
+    return labels
+def remove_biased_clusters(path_sample: Path, path_score_sample: Path, labels: list[int]) -> list[int]:
+    """Remove clusters with strand bias by re-labeling based on decision tree predictions.
+    Continue until at least one of the samples exhibits strand bias (i.e., do not calculate if all samples exhibit strand bias, or conversely, if none of the samples exhibit strand bias) or
+    1000 iterations are reached, which serves as a safeguard to prevent infinite loops.
+    """
+    samples = io.read_jsonl(path_sample)
+    positive_strand_counts_by_labels, total_counts_by_labels = count_strand(labels, samples)
+    strand_biases = determine_strand_biases(positive_strand_counts_by_labels, total_counts_by_labels)
+    iteration_count = 0
+    labels_corrected = labels
+    while len(set(strand_biases.values())) > 1 or iteration_count < 1000:
+        scores = io.read_jsonl(path_score_sample)
+        train_data, train_labels, test_data = prepare_training_testing_sets(labels, scores, strand_biases)
+        dtree = train_decision_tree(train_data, train_labels)
+        labels_corrected = allocate_labels(labels, strand_biases, dtree, test_data)
+        strand_biases = determine_strand_biases(labels_corrected, path_sample)
+        iteration_count += 1
+    return labels_corrected

{DAJIN2-0.4.2 → dajin2-0.4.4}/src/DAJIN2/core/core.py RENAMED Viewed

@@ -70,8 +70,8 @@ def execute_control(arguments: dict):
     # Output BAM files
     ###########################################################
     logger.info(f"Output BAM files of {arguments['control']}...")
-    report.report_bam.export_to_bam(
-        ARGS.tempdir, ARGS.control_name, ARGS.genome_coordinates, ARGS.threads, is_control=True
+    report.bam_exporter.export_to_bam(
+        ARGS.tempdir, ARGS.control_name, ARGS.genome_coordinates, ARGS.threads, ARGS.uuid, is_control=True
     )
     ###########################################################
     # Finish call
@@ -204,15 +204,15 @@ def execute_sample(arguments: dict):
     # RESULT
     io.write_jsonl(RESULT_SAMPLE, Path(ARGS.tempdir, "result", f"{ARGS.sample_name}.jsonl"))
     # FASTA
-    report.report_files.export_to_fasta(ARGS.tempdir, ARGS.sample_name, cons_sequence)
-    report.report_files.export_reference_to_fasta(ARGS.tempdir, ARGS.sample_name)
+    report.sequence_exporter.export_to_fasta(ARGS.tempdir, ARGS.sample_name, cons_sequence)
+    report.sequence_exporter.export_reference_to_fasta(ARGS.tempdir, ARGS.sample_name)
     # HTML
-    report.report_files.export_to_html(ARGS.tempdir, ARGS.sample_name, cons_percentage)
+    report.sequence_exporter.export_to_html(ARGS.tempdir, ARGS.sample_name, cons_percentage)
     # CSV (Allele Info)
-    report.report_mutation.export_to_csv(ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, cons_percentage)
+    report.mutation_exporter.export_to_csv(ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, cons_percentage)
     # BAM
-    report.report_bam.export_to_bam(
-        ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, ARGS.threads, RESULT_SAMPLE
+    report.bam_exporter.export_to_bam(
+        ARGS.tempdir, ARGS.sample_name, ARGS.genome_coordinates, ARGS.threads, ARGS.uuid, RESULT_SAMPLE
     )
     for path_bam_igvjs in Path(ARGS.tempdir, "cache", ".igvjs").glob(f"{ARGS.control_name}_control.bam*"):
         shutil.copy(path_bam_igvjs, Path(ARGS.tempdir, "report", ".igvjs", ARGS.sample_name))

{DAJIN2-0.4.2 → dajin2-0.4.4}/src/DAJIN2/core/preprocess/genome_fetcher.py RENAMED Viewed

@@ -5,11 +5,19 @@ from urllib.request import urlopen
 def fetch_seq_coordinates(genome: str, blat_url: str, seq: str) -> dict:
     url = f"{blat_url}?db={genome}&type=BLAT&userSeq={seq}"
-    response = urlopen(url).read().decode("utf8").split("\n")
-    matches = [x for x in response if "100.0%" in x]
+    records = urlopen(url).read().decode("utf8").split("\n")
+    matches = []
+    for record in records:
+        if "100.0%" not in record:
+            continue
+        record_trim = [r for r in record.split(" ") if r]
+        if record_trim[-1] == str(len(seq)):
+            matches = record_trim
     if not matches:
         raise ValueError(f"{seq[:60]}... is not found in {genome}")
-    chrom, strand, start, end, _ = matches[0].split()[-5:]
+    chrom, strand, start, end, _ = matches[-5:]
     return {"chrom": chrom, "strand": strand, "start": int(start), "end": int(end)}

{DAJIN2-0.4.2 → dajin2-0.4.4}/src/DAJIN2/core/preprocess/midsv_caller.py RENAMED Viewed

@@ -8,8 +8,7 @@ from itertools import chain, groupby
 from collections import Counter
-from DAJIN2.utils import sam_handler
-from DAJIN2.utils import cssplits_handler
+from DAJIN2.utils import io, sam_handler, cssplits_handler
 def has_inversion_in_splice(CIGAR: str) -> bool:
@@ -215,8 +214,8 @@ def generate_midsv(ARGS, is_control: bool = False, is_insertion: bool = False) -
             path_splice = Path(ARGS.tempdir, name, "sam", f"splice_{allele}.sam")
             path_output_midsv = Path(ARGS.tempdir, name, "midsv", f"{allele}.json")
-        sam_ont = sam_handler.remove_overlapped_reads(list(sam_handler.read_sam(path_ont)))
-        sam_splice = sam_handler.remove_overlapped_reads(list(sam_handler.read_sam(path_splice)))
+        sam_ont = sam_handler.remove_overlapped_reads(list(io.read_sam(path_ont)))
+        sam_splice = sam_handler.remove_overlapped_reads(list(io.read_sam(path_splice)))
         qname_of_map_ont = extract_qname_of_map_ont(sam_ont, sam_splice)
         sam_of_map_ont = filter_sam_by_preset(sam_ont, qname_of_map_ont, preset="map-ont")
         sam_of_splice = filter_sam_by_preset(sam_splice, qname_of_map_ont, preset="splice")

{DAJIN2-0.4.2 → dajin2-0.4.4}/src/DAJIN2/core/preprocess/mutation_extractor.py RENAMED Viewed

@@ -89,13 +89,13 @@ def cosine_similarity(x, y):
 def identify_dissimilar_loci(values_sample, values_control, index: int, is_consensus: bool = False) -> int:
-    # If 'sample' has more than X% variation compared to 'control', unconditionally set it to "dissimilar loci"
-    threshold = 20 if is_consensus else 5
-    if values_sample[index] - values_control[index] > threshold:
+    # If 'sample' has more than 20% variation compared to 'control' in consensus mode, unconditionally set it to 'dissimilar loci'. This is set to counteract cases where, when evaluating cosine similarity during significant deletions, values exceedingly close to 1 can occur even if not observed in the control (e.g., control = [1,1,1,1,1], sample = [100,100,100,100,100] -> cosine similarity = 1).
+    if is_consensus and values_sample[index] - values_control[index] > 20:
         return True
-    x = values_sample[index - 5 : index + 6]
-    y = values_control[index - 5 : index + 6]
+    # Subset 10 bases around index and add 1e-6 to avoid division by zero when calculating cosine similarity.
+    x = np.array(values_sample[index - 5 : index + 6]) + 1e-6
+    y = np.array(values_control[index - 5 : index + 6]) + 1e-6
     return cosine_similarity(x, y) < 0.95
@@ -109,8 +109,8 @@ def detect_anomalies(values_sample, values_control, threshold: float, is_consens
     values_subtract_reshaped = values_subtract.reshape(-1, 1)
     kmeans = MiniBatchKMeans(n_clusters=2, random_state=0, n_init="auto").fit(values_subtract_reshaped)
-    threshold = kmeans.cluster_centers_.mean()
-    candidate_loci = {i for i, v in enumerate(values_subtract_reshaped) if v > threshold}
+    threshold_kmeans = kmeans.cluster_centers_.mean()
+    candidate_loci = {i for i, v in enumerate(values_subtract_reshaped) if v > threshold_kmeans}
     return {i for i in candidate_loci if identify_dissimilar_loci(values_sample, values_control, i, is_consensus)}

dajin2-0.4.4/src/DAJIN2/core/report/__init__.py ADDED Viewed

@@ -0,0 +1,3 @@
+from DAJIN2.core.report import bam_exporter
+from DAJIN2.core.report import sequence_exporter
+from DAJIN2.core.report import mutation_exporter

DAJIN2 0.4.2__zip → 0.4.4__zip

DAJIN2 0.4.2zip → 0.4.4zip