PyPI - DAJIN2 - Versions diffs - 0.4.3__zip → 0.4.4__zip - Mend

DAJIN2 0.4.3zip → 0.4.4zip

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (66) hide show

{DAJIN2-0.4.3/src/DAJIN2.egg-info → dajin2-0.4.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: DAJIN2
-Version: 0.4.3
+Version: 0.4.4
 Summary: One-step genotyping tools for targeted long-read sequencing
 Home-page: https://github.com/akikuno/DAJIN2
 Author: Akihiro Kuno
@@ -166,7 +166,7 @@ Options:
 ```bash
 # Download example dataset
-wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_single.tar.gz
+curl -LJO https://github.com/akikuno/DAJIN2/raw/main/examples/example_single.tar.gz
 tar -xf example_single.tar.gz
 # Run DAJIN2
@@ -230,48 +230,13 @@ options:
 ```bash
 # Donwload the example dataset
-wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
+curl -LJO https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
 tar -xf example_batch.tar.gz
 # Run DAJIN2
 DAJIN2 batch --file example_batch/batch.csv --threads 4
 ```
-<!-- ```bash
-# Donwload the example dataset
-wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
-tar -xf example_batch.tar.gz
-# Run DAJIN2
-DAJIN2 batch --file example-batch/batch.csv --threads 3
-# 2023-07-31 17:01:10: example-batch/tyr_control.fq.gz is now processing...
-# 2023-07-31 17:01:16: Preprocess example-batch/tyr_control.fq.gz...
-# 2023-07-31 17:01:48: Output BAM files of example-batch/tyr_control.fq.gz...
-# 2023-07-31 17:01:52: 🍵 example-batch/tyr_control.fq.gz is finished!
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_50%.fq.gz is now processing...
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_10%.fq.gz is now processing...
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_01%.fq.gz is now processing...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:17: Classify example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:19: Clustering example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:34: Classify example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:02:35: Classify example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:39: Clustering example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:02:39: Clustering example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:53: Consensus calling of example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:59: Output reports of example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:03:04: 🍵 example-batch/tyr_c230gt_50%.fq.gz is finished!
-# 2023-07-31 17:03:39: Consensus calling of example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:03:51: Output reports of example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:04:03: 🍵 example-batch/tyr_c230gt_01%.fq.gz is finished!
-# 2023-07-31 17:04:08: Consensus calling of example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:04:16: Output reports of example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:04:24: 🍵 example-batch/tyr_c230gt_10%.fq.gz is finished!
-# 🎉 Finished! Open DAJIN_Results/tyr-substitution to see the report.
-``` -->
 ## 📈 Report Contents
@@ -281,22 +246,22 @@ Inside the **DAJIN_Results** directory, the following files can be found:
 ```
 DAJIN_Results/tyr-substitution
 ├── BAM
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   ├── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   ├── tyr_c230gt_50
 │   └── tyr_control
 ├── FASTA
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   └── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   └── tyr_c230gt_50
 ├── HTML
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   └── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   └── tyr_c230gt_50
 ├── MUTATION_INFO
-│   ├── tyr_c230gt_01%.csv
-│   ├── tyr_c230gt_10%.csv
-│   └── tyr_c230gt_50%.csv
+│   ├── tyr_c230gt_01.csv
+│   ├── tyr_c230gt_10.csv
+│   └── tyr_c230gt_50.csv
 ├── read_plot.html
 ├── read_plot.pdf
 └── read_summary.xlsx

{DAJIN2-0.4.3 → dajin2-0.4.4}/README.md RENAMED Viewed

@@ -133,7 +133,7 @@ Options:
 ```bash
 # Download example dataset
-wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_single.tar.gz
+curl -LJO https://github.com/akikuno/DAJIN2/raw/main/examples/example_single.tar.gz
 tar -xf example_single.tar.gz
 # Run DAJIN2
@@ -197,48 +197,13 @@ options:
 ```bash
 # Donwload the example dataset
-wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
+curl -LJO https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
 tar -xf example_batch.tar.gz
 # Run DAJIN2
 DAJIN2 batch --file example_batch/batch.csv --threads 4
 ```
-<!-- ```bash
-# Donwload the example dataset
-wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
-tar -xf example_batch.tar.gz
-# Run DAJIN2
-DAJIN2 batch --file example-batch/batch.csv --threads 3
-# 2023-07-31 17:01:10: example-batch/tyr_control.fq.gz is now processing...
-# 2023-07-31 17:01:16: Preprocess example-batch/tyr_control.fq.gz...
-# 2023-07-31 17:01:48: Output BAM files of example-batch/tyr_control.fq.gz...
-# 2023-07-31 17:01:52: 🍵 example-batch/tyr_control.fq.gz is finished!
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_50%.fq.gz is now processing...
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_10%.fq.gz is now processing...
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_01%.fq.gz is now processing...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:17: Classify example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:19: Clustering example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:34: Classify example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:02:35: Classify example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:39: Clustering example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:02:39: Clustering example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:53: Consensus calling of example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:59: Output reports of example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:03:04: 🍵 example-batch/tyr_c230gt_50%.fq.gz is finished!
-# 2023-07-31 17:03:39: Consensus calling of example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:03:51: Output reports of example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:04:03: 🍵 example-batch/tyr_c230gt_01%.fq.gz is finished!
-# 2023-07-31 17:04:08: Consensus calling of example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:04:16: Output reports of example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:04:24: 🍵 example-batch/tyr_c230gt_10%.fq.gz is finished!
-# 🎉 Finished! Open DAJIN_Results/tyr-substitution to see the report.
-``` -->
 ## 📈 Report Contents
@@ -248,22 +213,22 @@ Inside the **DAJIN_Results** directory, the following files can be found:
 ```
 DAJIN_Results/tyr-substitution
 ├── BAM
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   ├── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   ├── tyr_c230gt_50
 │   └── tyr_control
 ├── FASTA
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   └── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   └── tyr_c230gt_50
 ├── HTML
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   └── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   └── tyr_c230gt_50
 ├── MUTATION_INFO
-│   ├── tyr_c230gt_01%.csv
-│   ├── tyr_c230gt_10%.csv
-│   └── tyr_c230gt_50%.csv
+│   ├── tyr_c230gt_01.csv
+│   ├── tyr_c230gt_10.csv
+│   └── tyr_c230gt_50.csv
 ├── read_plot.html
 ├── read_plot.pdf
 └── read_summary.xlsx

{DAJIN2-0.4.3 → dajin2-0.4.4}/setup.py RENAMED Viewed

@@ -9,7 +9,7 @@ with open("requirements.txt") as requirements_file:
 setuptools.setup(
     name="DAJIN2",
-    version="0.4.3",
+    version="0.4.4",
     author="Akihiro Kuno",
     author_email="akuno@md.tsukuba.ac.jp",
     description="One-step genotyping tools for targeted long-read sequencing",

{DAJIN2-0.4.3 → dajin2-0.4.4}/src/DAJIN2/core/clustering/clustering.py RENAMED Viewed

@@ -39,17 +39,16 @@ def optimize_labels(X: spmatrix, coverage_sample: int, coverage_control: int) ->
         # print(i, Counter(labels_sample), Counter(labels_control), Counter(labels_current))  # ! DEBUG
         num_labels_control = count_number_of_clusters(labels_control, coverage_control)
-        mutual_info = metrics.adjusted_rand_score(labels_previous, labels_current)
+        rand_index = metrics.adjusted_rand_score(labels_previous, labels_current)
         """
         Return the number of clusters when:
-        - the number of clusters in control is split into more than one.
-        - the mutual information between the current and previous labels is high enough (= similar).
+            - the number of clusters in control is split into more than one.
+            - the mutual information between the current and previous labels is high enough (= similar).
+        To reduce the allele number, previous labels are returned.
         """
-        if num_labels_control >= 2:
+        if num_labels_control >= 2 or rand_index >= 0.95:
             return labels_previous
-        if 0.95 <= mutual_info <= 1.0:
-            return labels_current
         labels_previous = labels_current
     return labels_previous
@@ -58,11 +57,13 @@ def get_label_most_common(labels: list[int]) -> int:
     return Counter(labels).most_common()[0][0]
-def return_labels(path_score_sample: Path, path_score_control: Path, path_sample: Path, strand_bias: bool) -> list[int]:
+def return_labels(
+    path_score_sample: Path, path_score_control: Path, path_sample: Path, strand_bias_in_control: bool
+) -> list[int]:
     np.random.seed(seed=1)
     score_control = list(io.read_jsonl(path_score_control))
     X_control = csr_matrix(score_control)
-    # subset to 1000 reads of controls in the most common cluster to remove outliers and reduce computation time
+    """Subset to 1000 reads of controls in the most common cluster to remove outliers and reduce computation time"""
     labels_control = BisectingKMeans(n_clusters=2, random_state=1).fit_predict(X_control)
     label_most_common = get_label_most_common(labels_control)
     scores_control_subset = subset_scores(labels_control, io.read_jsonl(path_score_control), label_most_common, 1000)
@@ -71,7 +72,7 @@ def return_labels(path_score_sample: Path, path_score_control: Path, path_sample
     coverage_sample = io.count_newlines(path_score_sample)
     coverage_control = len(scores_control_subset)
     labels = optimize_labels(X, coverage_sample, coverage_control)
-    # correct clusters with strand bias
-    if strand_bias is False:
+    """Re-allocate clusters with strand bias to clusters without strand bias"""
+    if strand_bias_in_control is False:
         labels = remove_biased_clusters(path_sample, path_score_sample, labels)
     return labels

dajin2-0.4.4/src/DAJIN2/core/clustering/strand_bias_handler.py ADDED Viewed

@@ -0,0 +1,115 @@
+from __future__ import annotations
+"""
+Nanopore sequencing results often results in strand specific mutations even though the mutation is not strand specific, thus they are considered as sequencing errors and should be removed.
+This module provides functions to determine whether each allele obtained after clustering is formed due to sequencing errors caused by strand bias.
+Re-allocates reads belonging to clusters with strand bias to clusters without strand bias.
+"""
+from pathlib import Path
+from collections import defaultdict
+from sklearn.tree import DecisionTreeClassifier
+from DAJIN2.utils import io
+# Constants
+STRAND_BIAS_LOWER_LIMIT = 0.1
+STRAND_BIAS_UPPER_LIMIT = 0.9
+def is_strand_bias(path_control: Path) -> bool:
+    """
+    Determines whether there is a strand bias in sequencing data
+    based on the distribution of '+' and '-' strands.
+    """
+    count_strand = defaultdict(int)
+    for sample in io.read_jsonl(path_control):
+        count_strand[sample["STRAND"]] += 1
+    total = count_strand["+"] + count_strand["-"]
+    percentage_plus = count_strand["+"] / total if total > 0 else 0
+    return not (STRAND_BIAS_LOWER_LIMIT < percentage_plus < STRAND_BIAS_UPPER_LIMIT)
+###############################################################################
+# Handle Strand bias
+# # Clusters of reads with mutations with strand bias are merged into similar clusters without strand bias
+###############################################################################
+def count_strand(labels: list[int], samples: list[dict[str, str]]) -> tuple[dict[str, int], dict[str, int]]:
+    """Count the occurrences of each strand type by label."""
+    positive_strand_counts_by_labels = defaultdict(int)
+    total_counts_by_labels = defaultdict(int)
+    for label, sample in zip(labels, samples):
+        total_counts_by_labels[label] += 1
+        if sample["STRAND"] == "+":
+            positive_strand_counts_by_labels[label] += 1
+    return dict(positive_strand_counts_by_labels), dict(total_counts_by_labels)
+def determine_strand_biases(
+    positive_strand_counts_by_labels: defaultdict, total_counts_by_labels: defaultdict
+) -> dict[int, bool]:
+    """Determine strand biases based on positive strand counts."""
+    strand_biases = {}
+    for label, total in total_counts_by_labels.items():
+        positive_strand_count = positive_strand_counts_by_labels[label]
+        strand_ratio = positive_strand_count / total
+        strand_biases[label] = not (STRAND_BIAS_LOWER_LIMIT < strand_ratio < STRAND_BIAS_UPPER_LIMIT)
+    return strand_biases
+def prepare_training_testing_sets(labels, scores, strand_biases) -> tuple[list, list, list]:
+    """Prepare training and testing datasets based on strand biases."""
+    train_data, train_labels, test_data = [], [], []
+    for label, score in zip(labels, scores):
+        if strand_biases[label]:
+            test_data.append(score)
+        else:
+            train_data.append(score)
+            train_labels.append(label)
+    return train_data, train_labels, test_data
+def train_decision_tree(train_data, train_labels) -> DecisionTreeClassifier:
+    """Train a decision tree classifier using the provided features and labels."""
+    dtree = DecisionTreeClassifier(random_state=1)
+    dtree.fit(train_data, train_labels)
+    return dtree
+def allocate_labels(labels: list[int], strand_biases: dict[str, bool], dtree, test_data) -> list[int]:
+    """Re-allocates reads belonging to clusters with strand bias to clusters without strand bias."""
+    label_predictions = iter(dtree.predict(test_data))
+    for i, label in enumerate(labels):
+        if strand_biases[label]:
+            labels[i] = next(label_predictions)
+    return labels
+def remove_biased_clusters(path_sample: Path, path_score_sample: Path, labels: list[int]) -> list[int]:
+    """Remove clusters with strand bias by re-labeling based on decision tree predictions.
+    Continue until at least one of the samples exhibits strand bias (i.e., do not calculate if all samples exhibit strand bias, or conversely, if none of the samples exhibit strand bias) or
+    1000 iterations are reached, which serves as a safeguard to prevent infinite loops.
+    """
+    samples = io.read_jsonl(path_sample)
+    positive_strand_counts_by_labels, total_counts_by_labels = count_strand(labels, samples)
+    strand_biases = determine_strand_biases(positive_strand_counts_by_labels, total_counts_by_labels)
+    iteration_count = 0
+    labels_corrected = labels
+    while len(set(strand_biases.values())) > 1 or iteration_count < 1000:
+        scores = io.read_jsonl(path_score_sample)
+        train_data, train_labels, test_data = prepare_training_testing_sets(labels, scores, strand_biases)
+        dtree = train_decision_tree(train_data, train_labels)
+        labels_corrected = allocate_labels(labels, strand_biases, dtree, test_data)
+        strand_biases = determine_strand_biases(labels_corrected, path_sample)
+        iteration_count += 1
+    return labels_corrected

{DAJIN2-0.4.3 → dajin2-0.4.4}/src/DAJIN2/core/preprocess/mutation_extractor.py RENAMED Viewed

@@ -89,13 +89,13 @@ def cosine_similarity(x, y):
 def identify_dissimilar_loci(values_sample, values_control, index: int, is_consensus: bool = False) -> int:
-    # If 'sample' has more than X% variation compared to 'control', unconditionally set it to "dissimilar loci"
-    threshold = 20 if is_consensus else 5
-    if values_sample[index] - values_control[index] > threshold:
+    # If 'sample' has more than 20% variation compared to 'control' in consensus mode, unconditionally set it to 'dissimilar loci'. This is set to counteract cases where, when evaluating cosine similarity during significant deletions, values exceedingly close to 1 can occur even if not observed in the control (e.g., control = [1,1,1,1,1], sample = [100,100,100,100,100] -> cosine similarity = 1).
+    if is_consensus and values_sample[index] - values_control[index] > 20:
         return True
-    x = values_sample[index - 5 : index + 6]
-    y = values_control[index - 5 : index + 6]
+    # Subset 10 bases around index and add 1e-6 to avoid division by zero when calculating cosine similarity.
+    x = np.array(values_sample[index - 5 : index + 6]) + 1e-6
+    y = np.array(values_control[index - 5 : index + 6]) + 1e-6
     return cosine_similarity(x, y) < 0.95
@@ -109,8 +109,8 @@ def detect_anomalies(values_sample, values_control, threshold: float, is_consens
     values_subtract_reshaped = values_subtract.reshape(-1, 1)
     kmeans = MiniBatchKMeans(n_clusters=2, random_state=0, n_init="auto").fit(values_subtract_reshaped)
-    threshold = kmeans.cluster_centers_.mean()
-    candidate_loci = {i for i, v in enumerate(values_subtract_reshaped) if v > threshold}
+    threshold_kmeans = kmeans.cluster_centers_.mean()
+    candidate_loci = {i for i, v in enumerate(values_subtract_reshaped) if v > threshold_kmeans}
     return {i for i in candidate_loci if identify_dissimilar_loci(values_sample, values_control, i, is_consensus)}

{DAJIN2-0.4.3 → dajin2-0.4.4}/src/DAJIN2/main.py RENAMED Viewed

@@ -20,7 +20,7 @@ from DAJIN2.core import core
 from DAJIN2.utils import io, config, report_generator, input_validator, multiprocess
-DAJIN_VERSION = "0.4.3"
+DAJIN_VERSION = "0.4.4"
 def generate_report(name: str) -> None:
@@ -58,21 +58,21 @@ def execute_single_mode(arguments: dict[str]):
 ################################################################################
-def validate_columns_of_batch_file(columns: list, filepath: str) -> None:
-    """Validate the columns of a batch file."""
-    required_columns = ["sample", "control", "allele", "name"]
-    accepted_columns = ["sample", "control", "allele", "name", "genome"]
+def validate_headers_of_batch_file(headers: list, filepath: str) -> None:
+    """Validate the headers of a batch file."""
+    required_headers = ["sample", "control", "allele", "name"]
+    accepted_headers = ["sample", "control", "allele", "name", "genome"]
-    if not set(required_columns).issubset(set(columns)):
-        raise ValueError(f"{filepath} must contain {', '.join(required_columns)} in the header")
+    if not set(required_headers).issubset(set(headers)):
+        raise ValueError(f"{filepath} must contain {', '.join(required_headers)} in the header")
-    if not set(columns).issubset(accepted_columns):
-        raise ValueError(f"Accepted header names of {filepath} are {', '.join(accepted_columns)}.")
+    if not set(headers).issubset(accepted_headers):
+        raise ValueError(f"Accepted header names of {filepath} are {', '.join(accepted_headers)}.")
-def create_argument_dict(columns: list, group: list, cache_urls_genome: dict, is_control: bool) -> dict:
-    """Create a dictionary of arguments from the given columns and group."""
-    args = dict(zip(columns, group))
+def create_argument_dict(headers: list, group: list, cache_urls_genome: dict, is_control: bool) -> dict:
+    """Create a dictionary of arguments from the given headers and group."""
+    args = dict(zip(headers, group))
     args["threads"] = 1  # Set the number of threads to 1 for batch mode
     # Assign the "sample" field depending on whether it's a control or not
@@ -89,11 +89,11 @@ def create_argument_dict(columns: list, group: list, cache_urls_genome: dict, is
 def run_DAJIN2(
-    groups: list, columns: list, cache_urls_genome: dict, is_control: bool = True, num_workers: int = 1
+    groups: list, headers: list, cache_urls_genome: dict, is_control: bool = True, num_workers: int = 1
 ) -> None:
     contents = []
     for group in groups:
-        args = create_argument_dict(columns, group, cache_urls_genome, is_control)
+        args = create_argument_dict(headers, group, cache_urls_genome, is_control)
         if args:  # Add args to contents only if it's not an empty dict
             contents.append(args)
@@ -117,17 +117,17 @@ def execute_batch_mode(arguments: dict[str]):
     inputs = io.load_batchfile(path_batchfile)
     # Validate Column of the batch file
-    columns = inputs[0]
-    validate_columns_of_batch_file(columns, path_batchfile)
+    headers = inputs[0]
+    validate_headers_of_batch_file(headers, path_batchfile)
     # Validate contents and fetch genome urls
     contents = inputs[1:]
     cache_urls_genome = dict()
-    index_of_name = columns.index("name")
+    index_of_name = headers.index("name")
     contents.sort(key=lambda x: x[index_of_name])
     for _, groups in groupby(contents, key=lambda x: x[index_of_name]):
         for group in groups:
-            args = dict(zip(columns, group))
+            args = dict(zip(headers, group))
             # validate contents in the batch file
             input_validator.validate_files(args["sample"], args["control"], args["allele"])
             # validate genome and fetch urls
@@ -141,8 +141,8 @@ def execute_batch_mode(arguments: dict[str]):
         config.set_logging(path_logfile)
         groups = list(groups)
         # Run DAJIN2
-        run_DAJIN2(groups, columns, cache_urls_genome, is_control=True, num_workers=arguments["threads"])
-        run_DAJIN2(groups, columns, cache_urls_genome, is_control=False, num_workers=arguments["threads"])
+        run_DAJIN2(groups, headers, cache_urls_genome, is_control=True, num_workers=arguments["threads"])
+        run_DAJIN2(groups, headers, cache_urls_genome, is_control=False, num_workers=arguments["threads"])
         # Finish
         generate_report(name)
         shutil.move(path_logfile, Path("DAJIN_Results", name))

{DAJIN2-0.4.3 → dajin2-0.4.4}/src/DAJIN2/utils/io.py RENAMED Viewed

@@ -83,8 +83,8 @@ def write_xlsx(data: list[dict[str, str]], file_path: str | Path) -> None:
 ###########################################################
-def check_excel_or_csv(file_path: str) -> str | None:
-    """Check if the file is an Excel or CSV file. Raise error for other types."""
+def determine_file_type(file_path: str) -> str | None:
+    """Determine if the file is an Excel or CSV file. Raise error for other types."""
     file_extension = Path(file_path).suffix
     if file_extension in [".xlsx", ".xls"]:
         return "excel"
@@ -112,16 +112,18 @@ def read_xlsx(file_path: str | Path) -> list[dict[str, str]]:
 def read_csv(file_path: str) -> list[dict[str, str]]:
     """Load data from a CSV file and return as a list."""
     with open(file_path, "r") as csvfile:
-        contents = []
+        inputs = []
         for row in csv.reader(csvfile):
+            if not row:  # Skip empty rows
+                continue
             trimmed_row = [field.strip() for field in row]
-            contents.append(trimmed_row)
-        return contents
+            inputs.append(trimmed_row)
+        return inputs
 def load_batchfile(batchfile_path: str) -> list[dict[str, str]]:
     """Load data from either an Excel or CSV file."""
-    file_type = check_excel_or_csv(batchfile_path)
+    file_type = determine_file_type(batchfile_path)
     if file_type == "excel":
         return read_xlsx(batchfile_path)
     elif file_type == "csv":

{DAJIN2-0.4.3 → dajin2-0.4.4/src/DAJIN2.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: DAJIN2
-Version: 0.4.3
+Version: 0.4.4
 Summary: One-step genotyping tools for targeted long-read sequencing
 Home-page: https://github.com/akikuno/DAJIN2
 Author: Akihiro Kuno
@@ -166,7 +166,7 @@ Options:
 ```bash
 # Download example dataset
-wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_single.tar.gz
+curl -LJO https://github.com/akikuno/DAJIN2/raw/main/examples/example_single.tar.gz
 tar -xf example_single.tar.gz
 # Run DAJIN2
@@ -230,48 +230,13 @@ options:
 ```bash
 # Donwload the example dataset
-wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
+curl -LJO https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
 tar -xf example_batch.tar.gz
 # Run DAJIN2
 DAJIN2 batch --file example_batch/batch.csv --threads 4
 ```
-<!-- ```bash
-# Donwload the example dataset
-wget https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
-tar -xf example_batch.tar.gz
-# Run DAJIN2
-DAJIN2 batch --file example-batch/batch.csv --threads 3
-# 2023-07-31 17:01:10: example-batch/tyr_control.fq.gz is now processing...
-# 2023-07-31 17:01:16: Preprocess example-batch/tyr_control.fq.gz...
-# 2023-07-31 17:01:48: Output BAM files of example-batch/tyr_control.fq.gz...
-# 2023-07-31 17:01:52: 🍵 example-batch/tyr_control.fq.gz is finished!
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_50%.fq.gz is now processing...
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_10%.fq.gz is now processing...
-# 2023-07-31 17:01:52: example-batch/tyr_c230gt_01%.fq.gz is now processing...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:01:55: Preprocess example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:17: Classify example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:19: Clustering example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:34: Classify example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:02:35: Classify example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:39: Clustering example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:02:39: Clustering example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:02:53: Consensus calling of example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:02:59: Output reports of example-batch/tyr_c230gt_50%.fq.gz...
-# 2023-07-31 17:03:04: 🍵 example-batch/tyr_c230gt_50%.fq.gz is finished!
-# 2023-07-31 17:03:39: Consensus calling of example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:03:51: Output reports of example-batch/tyr_c230gt_01%.fq.gz...
-# 2023-07-31 17:04:03: 🍵 example-batch/tyr_c230gt_01%.fq.gz is finished!
-# 2023-07-31 17:04:08: Consensus calling of example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:04:16: Output reports of example-batch/tyr_c230gt_10%.fq.gz...
-# 2023-07-31 17:04:24: 🍵 example-batch/tyr_c230gt_10%.fq.gz is finished!
-# 🎉 Finished! Open DAJIN_Results/tyr-substitution to see the report.
-``` -->
 ## 📈 Report Contents
@@ -281,22 +246,22 @@ Inside the **DAJIN_Results** directory, the following files can be found:
 ```
 DAJIN_Results/tyr-substitution
 ├── BAM
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   ├── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   ├── tyr_c230gt_50
 │   └── tyr_control
 ├── FASTA
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   └── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   └── tyr_c230gt_50
 ├── HTML
-│   ├── tyr_c230gt_01%
-│   ├── tyr_c230gt_10%
-│   └── tyr_c230gt_50%
+│   ├── tyr_c230gt_01
+│   ├── tyr_c230gt_10
+│   └── tyr_c230gt_50
 ├── MUTATION_INFO
-│   ├── tyr_c230gt_01%.csv
-│   ├── tyr_c230gt_10%.csv
-│   └── tyr_c230gt_50%.csv
+│   ├── tyr_c230gt_01.csv
+│   ├── tyr_c230gt_10.csv
+│   └── tyr_c230gt_50.csv
 ├── read_plot.html
 ├── read_plot.pdf
 └── read_summary.xlsx

DAJIN2-0.4.3/src/DAJIN2/core/clustering/strand_bias_handler.py DELETED Viewed

@@ -1,113 +0,0 @@
-from __future__ import annotations
-from pathlib import Path
-from collections import defaultdict, Counter
-from sklearn.tree import DecisionTreeClassifier
-from DAJIN2.utils import io
-# Constants
-STRAND_BIAS_LOWER_LIMIT = 0.1
-STRAND_BIAS_UPPER_LIMIT = 0.9
-def is_strand_bias(path_control: Path) -> bool:
-    count_strand = defaultdict(int)
-    for m in io.read_jsonl(path_control):
-        count_strand[m["STRAND"]] += 1
-    total = count_strand["+"] + count_strand["-"]
-    percentage_plus = count_strand["+"] / total if total else 0
-    return not (STRAND_BIAS_LOWER_LIMIT < percentage_plus < STRAND_BIAS_UPPER_LIMIT)
-###############################################################################
-# Handle Strand bias
-# # Clusters of reads with mutations with strand bias are merged into similar clusters without strand bias
-###############################################################################
-def _count_strand(labels: list[int], samples: list[dict[str, str]]) -> tuple[defaultdict, defaultdict]:
-    """Count the occurrences of each strand type by label."""
-    count_strand_by_labels = defaultdict(int)
-    total_count_by_labels = defaultdict(int)
-    for label, sample in zip(labels, samples):
-        total_count_by_labels[label] += 1
-        if sample["STRAND"] == "+":
-            count_strand_by_labels[label] += 1
-    return count_strand_by_labels, total_count_by_labels
-def _calculate_strand_biases(
-    count_strand_by_labels: defaultdict, total_count_by_labels: defaultdict
-) -> dict[int, bool]:
-    """Calculate strand biases based on strand counts."""
-    strand_biases = {}
-    for label, total in total_count_by_labels.items():
-        strand_count = count_strand_by_labels[label]
-        strand_ratio = strand_count / total
-        strand_biases[label] = not (STRAND_BIAS_LOWER_LIMIT < strand_ratio < STRAND_BIAS_UPPER_LIMIT)
-    return strand_biases
-def _get_strand_biases_on_each_label(labels: list[int], path_sample: Path | str) -> dict[int, bool]:
-    """Get strand biases for given labels and samples.
-    Args:
-        labels: A list of integer labels.
-        path_sample: The path to the sample file.
-    Returns:
-        A dictionary containing strand biases by label.
-    """
-    samples = io.read_jsonl(path_sample)
-    count_strand_by_labels, total_count_by_labels = _count_strand(labels, samples)
-    return _calculate_strand_biases(count_strand_by_labels, total_count_by_labels)
-def _prepare_training_testing_sets(labels, scores, strand_biases) -> tuple[list, list, list]:
-    x_train, y_train, x_test = [], [], []
-    for label, score in zip(labels, scores):
-        if strand_biases[label]:
-            x_test.append(score)
-        else:
-            x_train.append(score)
-            y_train.append(label)
-    return x_train, y_train, x_test
-def _train_decision_tree(x_train, y_train) -> DecisionTreeClassifier:
-    dtree = DecisionTreeClassifier(random_state=1)
-    dtree.fit(x_train, y_train)
-    return dtree
-def _allocate_labels(labels, strand_biases, dtree, x_test) -> list[int]:
-    label_predictions = dtree.predict(x_test)
-    label_predict_iter = iter(label_predictions)
-    for i, label in enumerate(labels):
-        if strand_biases[label]:
-            labels[i] = next(label_predict_iter)
-    return labels
-def _correct_clusters_with_strand_bias(path_score_sample, labels, strand_biases) -> list[int]:
-    scores = io.read_jsonl(path_score_sample)
-    x_train, y_train, x_test = _prepare_training_testing_sets(labels, scores, strand_biases)
-    dtree = _train_decision_tree(x_train, y_train)
-    return _allocate_labels(labels, strand_biases, dtree, x_test)
-def remove_biased_clusters(path_sample, path_score_sample, labels) -> list[int]:
-    strand_biases = _get_strand_biases_on_each_label(labels, path_sample)
-    # Until there is at least one True and one False or
-    # 1000 iterations (1000 is a suitable number to exit an infinite loop just in case)
-    i = 0
-    labels_corrected = labels
-    while len(Counter(strand_biases.values())) > 1 and i < 1000:
-        labels_corrected = _correct_clusters_with_strand_bias(path_score_sample, labels_corrected, strand_biases)
-        strand_biases = _get_strand_biases_on_each_label(labels_corrected, path_sample)
-        i += 1
-    return labels_corrected