PyPI - mgnify-pipelines-toolkit - Versions diffs - 1.0.3__tar.gz → 1.0.5__tar.gz - Mend

mgnify-pipelines-toolkit 1.0.3tar.gz → 1.0.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of mgnify-pipelines-toolkit might be problematic. Click here for more details.

Files changed (62) hide show

{mgnify_pipelines_toolkit-1.0.3 → mgnify_pipelines_toolkit-1.0.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: mgnify_pipelines_toolkit
-Version: 1.0.3
+Version: 1.0.5
 Summary: Collection of scripts and tools for MGnify pipelines
 Author-email: MGnify team <metagenomics-help@ebi.ac.uk>
 License: Apache Software License 2.0
@@ -11,33 +11,24 @@ Classifier: Operating System :: OS Independent
 Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: LICENSE
-Requires-Dist: biopython==1.82
-Requires-Dist: numpy==1.26.0
-Requires-Dist: pandas==2.0.2
-Requires-Dist: regex==2023.12.25
-Requires-Dist: requests==2.32.3
-Requires-Dist: click==8.1.7
-Requires-Dist: pandera==0.22.1
-Requires-Dist: pyfastx>=2.2.0
-Requires-Dist: intervaltree==3.1.0
+Requires-Dist: biopython>=1.85
+Requires-Dist: numpy<3,>=2.2.4
+Requires-Dist: pandas<3,>=2.2.3
+Requires-Dist: regex>=2024.11.6
+Requires-Dist: requests<3,>=2.32.3
+Requires-Dist: click<9,>=8.1.8
+Requires-Dist: pandera<0.24,>=0.23.1
+Requires-Dist: pyfastx<3,>=2.2.0
+Requires-Dist: intervaltree<4,>=3.1.0
 Provides-Extra: tests
-Requires-Dist: pytest==7.4.0; extra == "tests"
-Requires-Dist: pytest-md==0.2.0; extra == "tests"
-Requires-Dist: pytest-workflow==2.0.1; extra == "tests"
-Requires-Dist: biopython==1.82; extra == "tests"
-Requires-Dist: pandas==2.0.2; extra == "tests"
-Requires-Dist: numpy==1.26.0; extra == "tests"
-Requires-Dist: regex==2023.12.25; extra == "tests"
-Requires-Dist: requests==2.32.3; extra == "tests"
-Requires-Dist: click==8.1.7; extra == "tests"
-Requires-Dist: pandera==0.22.1; extra == "tests"
-Requires-Dist: pyfastx>=2.2.0; extra == "tests"
+Requires-Dist: pytest<9,>=8.3.5; extra == "tests"
+Requires-Dist: pytest-md>=0.2.0; extra == "tests"
+Requires-Dist: pytest-workflow==2.1.0; extra == "tests"
 Provides-Extra: dev
-Requires-Dist: mgnify_pipelines_toolkit[tests]; extra == "dev"
-Requires-Dist: pre-commit==3.8.0; extra == "dev"
-Requires-Dist: black==24.8.0; extra == "dev"
-Requires-Dist: flake8==7.1.1; extra == "dev"
-Requires-Dist: pep8-naming==0.14.1; extra == "dev"
+Requires-Dist: pre-commit>=4.2.0; extra == "dev"
+Requires-Dist: black>=25.1.0; extra == "dev"
+Requires-Dist: flake8>=7.1.2; extra == "dev"
+Requires-Dist: pep8-naming>=0.14.1; extra == "dev"
 Dynamic: license-file
 # mgnify-pipelines-toolkit
@@ -74,8 +65,9 @@ Before starting any development, you should do these few steps:
 - Clone the repo if you haven't already and create a feature branch from the `dev` branch (NOT `main`).
 - Create a virtual environment with the tool of your choice (i.e. `conda create --name my_new_env`)
 - Activate you new environment (i.e. `conda activate my_new_env`)
-- Install dev dependencies `pip install -e '.[dev]'`
+- Install dev dependencies `pip install -e '.[tests,dev]'`
 - Install pre-commit hooks `pre-commit install`
+- Run unit tests `pytest`
 When doing these steps above, you ensure that the code you add will be linted and formatted properly.

{mgnify_pipelines_toolkit-1.0.3 → mgnify_pipelines_toolkit-1.0.5}/README.md RENAMED Viewed

@@ -32,8 +32,9 @@ Before starting any development, you should do these few steps:
 - Clone the repo if you haven't already and create a feature branch from the `dev` branch (NOT `main`).
 - Create a virtual environment with the tool of your choice (i.e. `conda create --name my_new_env`)
 - Activate you new environment (i.e. `conda activate my_new_env`)
-- Install dev dependencies `pip install -e '.[dev]'`
+- Install dev dependencies `pip install -e '.[tests,dev]'`
 - Install pre-commit hooks `pre-commit install`
+- Run unit tests `pytest`
 When doing these steps above, you ensure that the code you add will be linted and formatted properly.

{mgnify_pipelines_toolkit-1.0.3 → mgnify_pipelines_toolkit-1.0.5}/mgnify_pipelines_toolkit/analysis/assembly/add_rhea_chebi_annotation.py RENAMED Viewed

@@ -78,7 +78,11 @@ def main():
         "--output",
         required=True,
         type=Path,
-        help="Output TSV file with columns: contig_id, protein_id, UniRef90 cluster, rhea_ids, CHEBI reaction participants",
+        help=(
+            "Output TSV file with columns: contig_id, protein_id, protein hash, "
+            "Rhea IDs, CHEBI reaction, reaction definition, 'top hit' if it is "
+            "the first hit for the protein"
+        ),
     )
     parser.add_argument(
         "-p",

{mgnify_pipelines_toolkit-1.0.3 → mgnify_pipelines_toolkit-1.0.5}/mgnify_pipelines_toolkit/analysis/assembly/gff_annotation_utils.py RENAMED Viewed

@@ -17,8 +17,19 @@
 import re
 import sys
+import fileinput
-from mgnify_pipelines_toolkit.constants.thresholds import EVALUE_CUTOFF_IPS, EVALUE_CUTOFF_EGGNOG
+from mgnify_pipelines_toolkit.constants.thresholds import (
+    EVALUE_CUTOFF_IPS,
+    EVALUE_CUTOFF_EGGNOG,
+)
+DBCAN_CLASSES_DICT = {
+    "TC": "dbcan_transporter_classification",
+    "TF": "dbcan_transcription_factor",
+    "STP": "dbcan_signal_transduction_prot",
+    "CAZyme": "dbcan_prot_family",
+}
 def get_iprs(ipr_annot):
@@ -26,7 +37,8 @@ def get_iprs(ipr_annot):
     antifams = list()
     if not ipr_annot:
         return iprs, antifams
-    with open(ipr_annot) as f:
+    with fileinput.hook_compressed(ipr_annot, "r", encoding="utf-8") as f:
         for line in f:
             cols = line.strip().split("\t")
             protein = cols[0]
@@ -55,7 +67,8 @@ def get_eggnog(eggnog_annot):
     eggnogs = {}
     if not eggnog_annot:
         return eggnogs
-    with open(eggnog_annot, "r") as f:
+    with fileinput.hook_compressed(eggnog_annot, "r", encoding="utf-8") as f:
         for line in f:
             line = line.rstrip()
             cols = line.split("\t")
@@ -104,7 +117,8 @@ def get_bgcs(bgc_file, prokka_gff, tool):
         return bgc_annotations
     # save positions of each BGC cluster to dictionary cluster_positions
     # and save the annotations to dictionary bgc_result
-    with open(bgc_file, "r") as bgc_in:
+    with fileinput.hook_compressed(bgc_file, "r", encoding="utf-8") as bgc_in:
         for line in bgc_in:
             if not line.startswith("#"):
                 (
@@ -138,7 +152,7 @@ def get_bgcs(bgc_file, prokka_gff, tool):
                     type_value = ""
                     as_product = ""
                     for a in annotations.split(
-                            ";"
+                        ";"
                     ):  # go through all parts of the annotation field
                         if a.startswith("as_type="):
                             type_value = a.split("=")[1]
@@ -170,9 +184,12 @@ def get_bgcs(bgc_file, prokka_gff, tool):
                         {"bgc_function": type_value},
                     )
                     if as_product:
-                        tool_result[contig]["_".join([start_pos, end_pos])]["bgc_product"] = as_product
+                        tool_result[contig]["_".join([start_pos, end_pos])][
+                            "bgc_product"
+                        ] = as_product
     # identify CDSs that fall into each of the clusters annotated by the BGC tool
-    with open(prokka_gff, "r") as gff_in:
+    with fileinput.hook_compressed(prokka_gff, "r", encoding="utf-8") as gff_in:
         for line in gff_in:
             if not line.startswith("#"):
                 matching_interval = ""
@@ -228,8 +245,9 @@ def get_bgcs(bgc_file, prokka_gff, tool):
                             },
                         )
                         if "bgc_product" in tool_result[contig][matching_interval]:
-                            bgc_annotations[cds_id]["antismash_product"] = tool_result[contig][matching_interval][
-                                "bgc_product"]
+                            bgc_annotations[cds_id]["antismash_product"] = tool_result[
+                                contig
+                            ][matching_interval]["bgc_product"]
             elif line.startswith("##FASTA"):
                 break
     return bgc_annotations
@@ -239,7 +257,7 @@ def get_amr(amr_file):
     amr_annotations = {}
     if not amr_file:
         return amr_annotations
-    with open(amr_file, "r") as f:
+    with fileinput.hook_compressed(amr_file, "r", encoding="utf-8") as f:
         for line in f:
             if line.startswith("Protein identifier"):
                 continue
@@ -286,7 +304,7 @@ def get_dbcan(dbcan_file):
     substrates = dict()
     if not dbcan_file:
         return dbcan_annotations
-    with open(dbcan_file, "r") as f:
+    with fileinput.hook_compressed(dbcan_file, "r", encoding="utf-8") as f:
         for line in f:
             if "predicted PUL" in line:
                 annot_fields = line.strip().split("\t")[8].split(";")
@@ -314,13 +332,45 @@ def get_dbcan(dbcan_file):
                         elif a.startswith("Parent"):
                             parent = a.split("=")[1]
                     dbcan_annotations[acc] = (
-                        "dbcan_prot_type={};dbcan_prot_family={};substrate_dbcan-pul={};substrate_dbcan-sub={}".format(
+                        "dbcan_prot_type={};{}={};substrate_dbcan-pul={};substrate_dbcan-sub={}".format(
                             prot_type,
+                            DBCAN_CLASSES_DICT[prot_type],
                             prot_fam,
                             substrates[parent]["substrate_pul"],
                             substrates[parent]["substrate_ecami"],
                         )
                     )
+    return dbcan_annotations
+def get_dbcan_individual_cazys(dbcan_cazys_file):
+    dbcan_annotations = dict()
+    if not dbcan_cazys_file:
+        return dbcan_annotations
+    with fileinput.hook_compressed(dbcan_cazys_file, "r", encoding="utf-8") as f:
+        for line in f:
+            if line.startswith("#"):
+                continue
+            attributes = line.strip().split("\t")[8]
+            attributes_dict = dict(
+                re.split(r"(?<!\\)=", item)
+                for item in re.split(r"(?<!\\);", attributes.rstrip(";"))
+            )
+            if "num_tools" in attributes_dict and int(attributes_dict["num_tools"]) < 2:
+                continue  # don't keep annotations supported by only one tool within dbcan
+            cds_pattern = r"\.CDS\d+$"
+            protein = re.sub(
+                cds_pattern, "", attributes_dict["ID"]
+            )  # remove the CDS number
+            annotation_text = "dbcan_prot_type=CAZyme;"
+            for field in ["protein_family", "substrate_dbcan-sub", "eC_number"]:
+                if field in attributes_dict:
+                    annotation_text += (
+                        f"{'dbcan_prot_family' if field == 'protein_family' else field}"
+                        f"={attributes_dict[field]};"
+                    )
+            dbcan_annotations[protein] = annotation_text.strip(";")
     return dbcan_annotations
@@ -329,7 +379,8 @@ def get_defense_finder(df_file):
     type_info = dict()
     if not df_file:
         return defense_finder_annotations
-    with open(df_file, "r") as f:
+    with fileinput.hook_compressed(df_file, "r", encoding="utf-8") as f:
         for line in f:
             if "Anti-phage system" in line:
                 annot_fields = line.strip().split("\t")[8].split(";")
@@ -366,6 +417,7 @@ def load_annotations(
     antismash_file,
     gecco_file,
     dbcan_file,
+    dbcan_cazys_file,
     defense_finder_file,
     pseudofinder_file,
 ):
@@ -376,6 +428,7 @@ def load_annotations(
     antismash_bgcs = get_bgcs(antismash_file, in_gff, tool="antismash")
     amr_annotations = get_amr(amr_file)
     dbcan_annotations = get_dbcan(dbcan_file)
+    dbcan_cazys_annotations = get_dbcan_individual_cazys(dbcan_cazys_file)
     defense_finder_annotations = get_defense_finder(defense_finder_file)
     pseudogenes = get_pseudogenes(pseudofinder_file)
     pseudogene_report_dict = dict()
@@ -384,7 +437,7 @@ def load_annotations(
     header = []
     fasta = []
     fasta_flag = False
-    with open(in_gff) as f:
+    with fileinput.hook_compressed(in_gff, "r", encoding="utf-8") as f:
         for line in f:
             line = line.strip()
             if line[0] != "#" and not fasta_flag:
@@ -496,6 +549,11 @@ def load_annotations(
                         added_annot[protein]["dbCAN"] = dbcan_annotations[protein]
                     except KeyError:
                         pass
+                    try:
+                        dbcan_cazys_annotations[protein]
+                        added_annot[protein]["dbCAN"] = dbcan_cazys_annotations[protein]
+                    except KeyError:
+                        pass
                     try:
                         defense_finder_annotations[protein]
                         added_annot[protein]["defense_finder"] = (
@@ -530,7 +588,7 @@ def load_annotations(
 def get_ncrnas(ncrnas_file):
     ncrnas = {}
     counts = 0
-    with open(ncrnas_file, "r") as f:
+    with fileinput.hook_compressed(ncrnas_file, "r", encoding="utf-8") as f:
         for line in f:
             if not line.startswith("#"):
                 cols = line.strip().split()
@@ -543,7 +601,9 @@ def get_ncrnas(ncrnas_file):
                     # Skip tRNAs, we add them from tRNAscan-SE
                     continue
                 strand = cols[11]
-                start, end = (int(cols[9]), int(cols[10])) if strand == "+" else (int(cols[10]), int(cols[9]))
+                start, end = int(cols[10]), int(cols[9])
+                if strand == "+":
+                    start, end = end, start
                 rna_feature_name, ncrna_class = prepare_rna_gff_fields(cols)
                 annot = [
                     "ID=" + locus,
@@ -718,7 +778,10 @@ def prepare_rna_gff_fields(cols):
     }
     if rna_feature_name == "ncRNA":
-        ncrna_class = next((rna_type for rna_type, rfams in rna_types.items() if cols[2] in rfams), None)
+        ncrna_class = next(
+            (rna_type for rna_type, rfams in rna_types.items() if cols[2] in rfams),
+            None,
+        )
         if not ncrna_class:
             if "microRNA" in cols[-1]:
                 ncrna_class = "pre_miRNA"
@@ -729,7 +792,7 @@ def prepare_rna_gff_fields(cols):
 def get_trnas(trnas_file):
     trnas = {}
-    with open(trnas_file, "r") as f:
+    with fileinput.hook_compressed(trnas_file, "r", encoding="utf-8") as f:
         for line in f:
             if not line.startswith("#"):
                 cols = line.split("\t")
@@ -738,13 +801,13 @@ def get_trnas(trnas_file):
                     line = line.replace("tRNAscan-SE", "tRNAscan-SE:2.0.9")
                     trnas.setdefault(contig, dict()).setdefault(
                         int(start), list()
-                    ).append(line.strip())
+                    ).append(line.strip().strip(";"))
     return trnas
 def load_crispr(crispr_file):
     crispr_annotations = dict()
-    with open(crispr_file, "r") as f:
+    with fileinput.hook_compressed(crispr_file, "r", encoding="utf-8") as f:
         record = list()
         left_coord = ""
         loc_contig = ""
@@ -791,7 +854,7 @@ def get_pseudogenes(pseudofinder_file):
     pseudogenes = dict()
     if not pseudofinder_file:
         return pseudogenes
-    with open(pseudofinder_file) as file_in:
+    with fileinput.hook_compressed(pseudofinder_file, "r", encoding="utf-8") as file_in:
         for line in file_in:
             if not line.startswith("#"):
                 col9 = line.strip().split("\t")[8]

{mgnify_pipelines_toolkit-1.0.3 → mgnify_pipelines_toolkit-1.0.5}/mgnify_pipelines_toolkit/analysis/assembly/gff_file_utils.py RENAMED Viewed

@@ -28,6 +28,17 @@ def write_results_to_file(
         contig_list = check_for_additional_keys(
             ncrnas, trnas, crispr_annotations, contig_list
         )
+        # sort contigs by digit at the end of contig/genome accession
+        if contig_list[0].startswith(
+            "MGYG"
+        ):  # e.g. 'MGYG000500002_1', 'MGYG000500002_2', 'MGYG000500002_3'
+            contig_list = sorted(list(contig_list), key=lambda x: int(x.split("_")[-1]))
+        elif contig_list[0].startswith(
+            "ERZ"
+        ):  # e.g. 'ERZ1049444', 'ERZ1049445', 'ERZ1049446'
+            contig_list = sorted(
+                list(contig_list), key=lambda x: int(x.split("ERZ")[-1])
+            )
         for contig in contig_list:
             sorted_pos_list = sort_positions(
                 contig, main_gff_extended, ncrnas, trnas, crispr_annotations

{mgnify_pipelines_toolkit-1.0.3 → mgnify_pipelines_toolkit-1.0.5}/mgnify_pipelines_toolkit/analysis/assembly/gff_toolkit.py RENAMED Viewed

@@ -17,8 +17,16 @@
 import argparse
-from gff_annotation_utils import get_ncrnas, get_trnas, load_annotations, load_crispr
-from gff_file_utils import write_results_to_file, print_pseudogene_report
+from mgnify_pipelines_toolkit.analysis.assembly.gff_annotation_utils import (
+    get_ncrnas,
+    get_trnas,
+    load_annotations,
+    load_crispr,
+)
+from mgnify_pipelines_toolkit.analysis.assembly.gff_file_utils import (
+    write_results_to_file,
+    print_pseudogene_report,
+)
 def main(
@@ -31,6 +39,7 @@ def main(
     antismash_file,
     gecco_file,
     dbcan_file,
+    dbcan_cazys_file,
     defense_finder_file,
     pseudofinder_file,
     rfam_file,
@@ -53,6 +62,7 @@ def main(
         antismash_file,
         gecco_file,
         dbcan_file,
+        dbcan_cazys_file,
         defense_finder_file,
         pseudofinder_file,
     )
@@ -66,7 +76,9 @@ def main(
     if crispr_file:
         crispr_annotations = load_crispr(crispr_file)
-    write_results_to_file(outfile, header, main_gff_extended, fasta, ncrnas, trnas, crispr_annotations)
+    write_results_to_file(
+        outfile, header, main_gff_extended, fasta, ncrnas, trnas, crispr_annotations
+    )
     if pseudogene_report_file:
         print_pseudogene_report(pseudogene_report_dict, pseudogene_report_file)
@@ -74,7 +86,7 @@ def main(
 def parse_args():
     parser = argparse.ArgumentParser(
         description="The script extends a user-provided base GFF annotation file by incorporating "
-                    "information extracted from the user-provided outputs of supplementary annotation tools.",
+        "information extracted from the user-provided outputs of supplementary annotation tools.",
     )
     parser.add_argument(
         "-g",
@@ -124,7 +136,12 @@ def parse_args():
     )
     parser.add_argument(
         "--dbcan",
-        help="The GFF file produced by dbCAN post-processing script",
+        help="The GFF file produced by dbCAN post-processing script that uses cluster annotations",
+        required=False,
+    )
+    parser.add_argument(
+        "--dbcan-cazys",
+        help="The GFF file produced by dbCAN-CAZYs post-processing script",
         required=False,
     )
     parser.add_argument(
@@ -149,7 +166,7 @@ def parse_args():
     return parser.parse_args()
-if __name__ == '__main__':
+if __name__ == "__main__":
     args = parse_args()
     main(
         args.gff_input,
@@ -161,10 +178,11 @@ if __name__ == '__main__':
         args.antismash,
         args.gecco,
         args.dbcan,
+        args.dbcan_cazys,
         args.defense_finder,
         args.pseudofinder,
         args.rfam,
         args.trnascan,
         args.outfile,
         args.pseudogene_report,
-         )
+    )

{mgnify_pipelines_toolkit-1.0.3 → mgnify_pipelines_toolkit-1.0.5}/mgnify_pipelines_toolkit/analysis/assembly/krona_txt_from_cat_classification.py RENAMED Viewed

@@ -40,10 +40,12 @@ def import_nodes(nodes_dmp):
     taxid2rank = {}
     with open(nodes_dmp) as f1:
-        reader = csv.reader(f1, delimiter="\t")
-        for line in reader:
-            taxid = line[0]
-            rank = line[4]
+        for line in f1:
+            fields = [part.strip() for part in line.split("|")]
+            if len(fields) != 14:
+                raise ValueError(f"Unexpected number of columns in line: {line}")
+            taxid = fields[0]
+            rank = fields[2]
             taxid2rank[taxid] = rank
     return taxid2rank
@@ -54,11 +56,13 @@ def import_names(names_dmp):
     taxid2name = {}
     with open(names_dmp, newline="") as f1:
-        reader = csv.reader(f1, delimiter="\t")
-        for line in reader:
-            if line[6] == "scientific name":
-                taxid = line[0]
-                name = line[2]
+        for line in f1:
+            fields = [part.strip() for part in line.split("|")]
+            if len(fields) != 5:
+                raise ValueError(f"Unexpected number of columns in line: {line}")
+            if fields[3] == "scientific name":
+                taxid = fields[0]
+                name = fields[1]
                 taxid2name[taxid] = name
     return taxid2name

mgnify-pipelines-toolkit 1.0.3__tar.gz → 1.0.5__tar.gz

Potentially problematic release.

mgnify-pipelines-toolkit 1.0.3tar.gz → 1.0.5tar.gz