PyPI - mgnify-pipelines-toolkit - Versions diffs - 0.1.1__tar.gz → 0.1.2__tar.gz - Mend

mgnify-pipelines-toolkit 0.1.1tar.gz → 0.1.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of mgnify-pipelines-toolkit might be problematic. Click here for more details.

Files changed (32) hide show

{mgnify_pipelines_toolkit-0.1.1 → mgnify_pipelines_toolkit-0.1.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: mgnify_pipelines_toolkit
-Version: 0.1.1
+Version: 0.1.2
 Summary: Collection of scripts and tools for MGnify pipelines
 Author-email: MGnify team <metagenomics-help@ebi.ac.uk>
 License: Apache Software License 2.0
@@ -55,13 +55,14 @@ You should then be able to run the packages from the command-line. For example t
 ### New script requirements
 There are a few requirements for your script:
 - It needs to have a named main function of some kind. See `mgnify_pipelines_toolkit/analysis/shared/get_subunits.py` and the `main()` function for an example
 - Because this package is meant to be run from the command-line, make sure your script can easily pass arguments using tools like `argparse` or `click`
 - A small amount of dependencies. This requirement is subjective, but for example if your script only requires a handful of basic packages like `Biopython`, `numpy`, `pandas`, etc., then it's fine. However if the script has a more extensive list of dependencies, a container is probably a better fit.
 ### How to add a new script
-To add a new Python script, first copy it over to the `mgnify_pipelines_toolkit` directory in this repository, specifically to the subdirectory that makes the most sense. If none of the subdirectories make sense for your script, create a new one. If your script doesn't have a `main()` type function yet, write one.
+To add a new Python script, first copy it over to the `mgnify_pipelines_toolkit` directory in this repository, specifically to the subdirectory that makes the most sense. If none of the subdirectories make sense for your script, create a new one. If your script doesn't have a `main()` type function yet, write one.
 Then, open `pyproject.toml` as you will need to add some bits. First, add any missing dependencies (include the version) to the `dependencies` field.
@@ -73,7 +74,7 @@ Then, scroll down to the `[project.scripts]` line. Here, you will create an alia
 - `get_subunits` is the alias
 - `mgnify_pipelines_toolkit.analysis.shared.get_subunits` will link the alias to the script with the path `mgnify_pipelines_toolkit/analysis/shared/get_subunits.py`
-- `:main` will specifically call the function named `main()` when the alias is run.
+- `:main` will specifically call the function named `main()` when the alias is run.
 When you have setup this command, executing `get_subunits` on the command-line will be the equivalent of doing:
@@ -86,4 +87,5 @@ Finally, you will need to bump up the version in the `version` line.
 At the moment, these should be the only steps required to setup your script in this package (which is subject to change).
 ### Building and uploading to PyPi
 The building and pushing of the package is automated by GitHub Actions, which will activate only on a new release. Bioconda should then automatically pick up the new PyPi release and push it to their recipes, though it's worth keeping an eye on their automated pull requests just in case [here](https://github.com/bioconda/bioconda-recipes/pulls).

{mgnify_pipelines_toolkit-0.1.1 → mgnify_pipelines_toolkit-0.1.2}/README.md RENAMED Viewed

@@ -30,13 +30,14 @@ You should then be able to run the packages from the command-line. For example t
 ### New script requirements
 There are a few requirements for your script:
 - It needs to have a named main function of some kind. See `mgnify_pipelines_toolkit/analysis/shared/get_subunits.py` and the `main()` function for an example
 - Because this package is meant to be run from the command-line, make sure your script can easily pass arguments using tools like `argparse` or `click`
 - A small amount of dependencies. This requirement is subjective, but for example if your script only requires a handful of basic packages like `Biopython`, `numpy`, `pandas`, etc., then it's fine. However if the script has a more extensive list of dependencies, a container is probably a better fit.
 ### How to add a new script
-To add a new Python script, first copy it over to the `mgnify_pipelines_toolkit` directory in this repository, specifically to the subdirectory that makes the most sense. If none of the subdirectories make sense for your script, create a new one. If your script doesn't have a `main()` type function yet, write one.
+To add a new Python script, first copy it over to the `mgnify_pipelines_toolkit` directory in this repository, specifically to the subdirectory that makes the most sense. If none of the subdirectories make sense for your script, create a new one. If your script doesn't have a `main()` type function yet, write one.
 Then, open `pyproject.toml` as you will need to add some bits. First, add any missing dependencies (include the version) to the `dependencies` field.
@@ -48,7 +49,7 @@ Then, scroll down to the `[project.scripts]` line. Here, you will create an alia
 - `get_subunits` is the alias
 - `mgnify_pipelines_toolkit.analysis.shared.get_subunits` will link the alias to the script with the path `mgnify_pipelines_toolkit/analysis/shared/get_subunits.py`
-- `:main` will specifically call the function named `main()` when the alias is run.
+- `:main` will specifically call the function named `main()` when the alias is run.
 When you have setup this command, executing `get_subunits` on the command-line will be the equivalent of doing:
@@ -61,4 +62,5 @@ Finally, you will need to bump up the version in the `version` line.
 At the moment, these should be the only steps required to setup your script in this package (which is subject to change).
 ### Building and uploading to PyPi
-The building and pushing of the package is automated by GitHub Actions, which will activate only on a new release. Bioconda should then automatically pick up the new PyPi release and push it to their recipes, though it's worth keeping an eye on their automated pull requests just in case [here](https://github.com/bioconda/bioconda-recipes/pulls).
+The building and pushing of the package is automated by GitHub Actions, which will activate only on a new release. Bioconda should then automatically pick up the new PyPi release and push it to their recipes, though it's worth keeping an eye on their automated pull requests just in case [here](https://github.com/bioconda/bioconda-recipes/pulls).

{mgnify_pipelines_toolkit-0.1.1 → mgnify_pipelines_toolkit-0.1.2}/mgnify_pipelines_toolkit/analysis/amplicon/make_asv_count_table.py RENAMED Viewed

@@ -28,7 +28,7 @@ def parse_args():
     parser = argparse.ArgumentParser()
-    parser.add_argument("-t", "--taxa", required=True, type=str, help="Path to DADA2 taxa file")
+    parser.add_argument("-t", "--taxa", required=True, type=str, help="Path to taxa file")
     parser.add_argument("-f", "--fwd", required=True, type=str, help="Path to DADA2 forward map file")
     parser.add_argument("-r", "--rev", required=False, type=str, help="Path to DADA2 reverse map file")
     parser.add_argument("-a", "--amp", required=True, type=str, help="Path to extracted amp_region reads from inference subworkflow")
@@ -49,7 +49,7 @@ def parse_args():
 def order_df(taxa_df):
-    if len(taxa_df.columns) == 8:
+    if len(taxa_df.columns) == 9:
         taxa_df = taxa_df.sort_values(_SILVA_TAX_RANKS, ascending=True)
     elif len(taxa_df.columns) == 10:
         taxa_df = taxa_df.sort_values(_PR2_TAX_RANKS, ascending=True)
@@ -66,11 +66,13 @@ def make_tax_assignment_dict_silva(taxa_df, asv_dict):
     for i in range(len(taxa_df)):
         sorted_index = taxa_df.index[i]
-        asv_count = asv_dict[sorted_index]
+        asv_num = taxa_df.iloc[i, 0]
+        asv_count = asv_dict[asv_num]
         if asv_count == 0:
             continue
+        sk = taxa_df.loc[sorted_index, "Superkingdom"]
         k = taxa_df.loc[sorted_index, "Kingdom"]
         p = taxa_df.loc[sorted_index, "Phylum"]
         c = taxa_df.loc[sorted_index, "Class"]
@@ -83,47 +85,53 @@ def make_tax_assignment_dict_silva(taxa_df, asv_dict):
         while True:
+            if sk != "0":
+                sk = "_".join(sk.split(" "))
+                tax_assignment += sk
+            else:
+                break
             if k != "0":
                 k = "_".join(k.split(" "))
-                if k == "Archaea" or k == "Bacteria":
-                    tax_assignment += f"sk__{k}"
-                elif k == "Eukaryota":
-                    tax_assignment += f"sk__Eukaryota"
-                else:
-                    tax_assignment += f"sk__Eukaryota\tk__{k}"
+                tax_assignment += f"\t{k}"
+            elif sk != "0":
+                tax_assignment += f"\tk__"
             else:
                 break
             if p != "0":
-                if k == "Archaea" or k == "Bacteria":
-                    tax_assignment += f"\tk__"
                 p = "_".join(p.split(" "))
-                tax_assignment += f"\tp__{p}"
+                tax_assignment += f"\t{p}"
             else:
                 break
             if c != "0":
                 c = "_".join(c.split(" "))
-                tax_assignment += f"\tc__{c}"
+                tax_assignment += f"\t{c}"
             else:
                 break
             if o != "0":
                 o = "_".join(o.split(" "))
-                tax_assignment += f"\to__{o}"
+                tax_assignment += f"\t{o}"
             else:
                 break
             if f != "0":
                 f = "_".join(f.split(" "))
-                tax_assignment += f"\tf__{f}"
+                tax_assignment += f"\t{f}"
             else:
                 break
             if g != "0":
                 g = "_".join(g.split(" "))
-                tax_assignment += f"\tg__{g}"
+                tax_assignment += f"\t{g}"
             else:
                 break
             if s != "0":
                 s = "_".join(s.split(" "))
-                tax_assignment += f"\ts__{s}"
+                tax_assignment += f"\t{s}"
             break
         if tax_assignment == "":
@@ -140,7 +148,8 @@ def make_tax_assignment_dict_pr2(taxa_df, asv_dict):
     for i in range(len(taxa_df)):
         sorted_index = taxa_df.index[i]
-        asv_count = asv_dict[sorted_index]
+        asv_num = taxa_df.iloc[i, 0]
+        asv_count = asv_dict[asv_num]
         if asv_count == 0:
             continue
@@ -161,45 +170,55 @@ def make_tax_assignment_dict_pr2(taxa_df, asv_dict):
             if d != "0":
                 d = "_".join(d.split(" "))
-                tax_assignment += f"d__{d}"
+                tax_assignment += d
             else:
                 break
             if sg != "0":
                 sg = "_".join(sg.split(" "))
-                tax_assignment += f"\tsg__{sg}"
+                tax_assignment += f"\t{sg}"
             else:
                 break
             if dv != "0":
                 dv = "_".join(dv.split(" "))
-                tax_assignment += f"\tdv__{dv}"
+                tax_assignment += f"\t{dv}"
+            else:
+                break
             if sdv != "0":
                 sdv = "_".join(sdv.split(" "))
-                tax_assignment += f"\tsdv__{sdv}"
+                tax_assignment += f"\t{sdv}"
+            else:
+                break
             if c != "0":
                 c = "_".join(c.split(" "))
-                tax_assignment += f"\tc__{c}"
+                tax_assignment += f"\t{c}"
             else:
                 break
             if o != "0":
                 o = "_".join(o.split(" "))
-                tax_assignment += f"\to__{o}"
+                tax_assignment += f"\t{o}"
             else:
                 break
             if f != "0":
                 f = "_".join(f.split(" "))
-                tax_assignment += f"\tf__{f}"
+                tax_assignment += f"\t{f}"
             else:
                 break
             if g != "0":
                 g = "_".join(g.split(" "))
-                tax_assignment += f"\tg__{g}"
+                tax_assignment += f"\t{g}"
             else:
                 break
             if s != "0":
                 s = "_".join(s.split(" "))
-                tax_assignment += f"\ts__{s}"
+                tax_assignment += f"\t{s}"
             break
         if tax_assignment == "":
@@ -253,7 +272,7 @@ def main():
             asv_intersection = fwd_asvs
         if headers[counter] in amp_reads:
-            asv_dict[int(asv_intersection[0]) - 1] += 1
+            asv_dict[f"seq_{int(asv_intersection[0]) - 1}"] += 1
     fwd_fr.close()
     if paired_end:
@@ -261,7 +280,7 @@ def main():
     ref_db = ""
-    if len(taxa_df.columns) == 8:
+    if len(taxa_df.columns) == 9:
         tax_assignment_dict = make_tax_assignment_dict_silva(taxa_df, asv_dict)
         ref_db = "silva"
     elif len(taxa_df.columns) == 10:

mgnify_pipelines_toolkit-0.1.2/mgnify_pipelines_toolkit/analysis/amplicon/mapseq_to_asv_table.py ADDED Viewed

@@ -0,0 +1,126 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# Copyright 2024 EMBL - European Bioinformatics Institute
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+from collections import defaultdict
+import logging
+import pandas as pd
+logging.basicConfig(level=logging.DEBUG)
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-i", "--input", required=True, type=str, help="Input from MAPseq output")
+    parser.add_argument("-l", "--label", choices=['DADA2-SILVA', 'DADA2-PR2'], required=True, type=str, help="Database label - either DADA2-SILVA or DADA2-PR2")
+    parser.add_argument("-s", "--sample", required=True, type=str, help="Sample ID")
+    args = parser.parse_args()
+    _INPUT = args.input
+    _LABEL = args.label
+    _SAMPLE = args.sample
+    return _INPUT, _LABEL, _SAMPLE
+def parse_label(label):
+    silva_short_ranks = ["sk__", "k__", "p__", "c__", "o__", "f__", "g__", "s__"]
+    pr2_short_ranks = ["d__", "sg__", "dv__", "sdv__", "c__", "o__", "f__", "g__", "s__"]
+    silva_long_ranks = ["Superkingdom", "Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"]
+    pr2_long_ranks = ["Domain", "Supergroup", "Division", "Subdivision", "Class", "Order", "Family", "Genus", "Species"]
+    chosen_short_ranks = ''
+    chosen_long_ranks = ''
+    if label == 'DADA2-SILVA':
+        chosen_short_ranks = silva_short_ranks
+        chosen_long_ranks = silva_long_ranks
+    elif label == 'DADA2-PR2':
+        chosen_short_ranks = pr2_short_ranks
+        chosen_long_ranks = pr2_long_ranks
+    else:
+        logging.error("Incorrect database label - exiting.")
+        exit(1)
+    return chosen_short_ranks, chosen_long_ranks
+def parse_mapseq(mseq_df, short_ranks, long_ranks):
+    res_dict = defaultdict(list)
+    for i in range(len(mseq_df)):
+        asv_id = mseq_df.iloc[i, 0]
+        tax_ass = mseq_df.iloc[i, 1].split(';')
+        res_dict['ASV'].append(asv_id)
+        for j in range(len(short_ranks)):
+            curr_rank = long_ranks[j]
+            if j >= len(tax_ass):
+                # This would only be true if the assigned taxonomy is shorter than the total reference database taxononmy
+                # so fill each remaining rank with its respective short rank blank
+                curr_tax = short_ranks[j]
+            else:
+                curr_tax = tax_ass[j]
+            res_dict[curr_rank].append(curr_tax)
+    res_df = pd.DataFrame.from_dict(res_dict)
+    return(res_df)
+def process_blank_tax_ends(res_df, ranks):
+    # Necessary function as we want to replace consecutive blank assignments that start at the last rank as NAs
+    # while avoiding making blanks in the middle as NAs
+    for i in range(len(res_df)):
+        last_empty_rank = ''
+        currently_empty = False
+        for j in reversed(range(len(ranks))): # Parse an assignment backwards, from Species all the way to Superkingdom/Domain
+            curr_rank = res_df.iloc[i, j+1]
+            if curr_rank in ranks:
+                if last_empty_rank == '': # Last rank is empty, start window of consecutive blanks
+                    last_empty_rank = j+1
+                    currently_empty = True
+                elif currently_empty: # If we're in a window of consecutive blank assignments that started at the beginning
+                    last_empty_rank = j+1
+                else:
+                    break
+            else:
+                break
+        if last_empty_rank != '':
+            res_df.iloc[i, last_empty_rank:] = 'NA'
+    return res_df
+def main():
+    _INPUT, _LABEL, _SAMPLE = parse_args()
+    mseq_df = pd.read_csv(_INPUT, header=1, delim_whitespace=True, usecols=[0, 12])
+    short_ranks, long_ranks = parse_label(_LABEL)
+    res_df = parse_mapseq(mseq_df, short_ranks, long_ranks)
+    final_res_df = process_blank_tax_ends(res_df, short_ranks)
+    final_res_df.to_csv(f"./{_SAMPLE}_{_LABEL}_asv_taxa.tsv", sep="\t", index=False)
+if __name__ == "__main__":
+    main()

{mgnify_pipelines_toolkit-0.1.1 → mgnify_pipelines_toolkit-0.1.2}/mgnify_pipelines_toolkit/constants/tax_ranks.py RENAMED Viewed

@@ -14,5 +14,5 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
-_SILVA_TAX_RANKS = ["Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"]
+_SILVA_TAX_RANKS = ["Superkingdom", "Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"]
 _PR2_TAX_RANKS = ["Domain", "Supergroup", "Division", "Subdivision", "Class", "Order", "Family", "Genus", "Species"]

{mgnify_pipelines_toolkit-0.1.1 → mgnify_pipelines_toolkit-0.1.2}/mgnify_pipelines_toolkit.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: mgnify_pipelines_toolkit
-Version: 0.1.1
+Version: 0.1.2
 Summary: Collection of scripts and tools for MGnify pipelines
 Author-email: MGnify team <metagenomics-help@ebi.ac.uk>
 License: Apache Software License 2.0
@@ -55,13 +55,14 @@ You should then be able to run the packages from the command-line. For example t
 ### New script requirements
 There are a few requirements for your script:
 - It needs to have a named main function of some kind. See `mgnify_pipelines_toolkit/analysis/shared/get_subunits.py` and the `main()` function for an example
 - Because this package is meant to be run from the command-line, make sure your script can easily pass arguments using tools like `argparse` or `click`
 - A small amount of dependencies. This requirement is subjective, but for example if your script only requires a handful of basic packages like `Biopython`, `numpy`, `pandas`, etc., then it's fine. However if the script has a more extensive list of dependencies, a container is probably a better fit.
 ### How to add a new script
-To add a new Python script, first copy it over to the `mgnify_pipelines_toolkit` directory in this repository, specifically to the subdirectory that makes the most sense. If none of the subdirectories make sense for your script, create a new one. If your script doesn't have a `main()` type function yet, write one.
+To add a new Python script, first copy it over to the `mgnify_pipelines_toolkit` directory in this repository, specifically to the subdirectory that makes the most sense. If none of the subdirectories make sense for your script, create a new one. If your script doesn't have a `main()` type function yet, write one.
 Then, open `pyproject.toml` as you will need to add some bits. First, add any missing dependencies (include the version) to the `dependencies` field.
@@ -73,7 +74,7 @@ Then, scroll down to the `[project.scripts]` line. Here, you will create an alia
 - `get_subunits` is the alias
 - `mgnify_pipelines_toolkit.analysis.shared.get_subunits` will link the alias to the script with the path `mgnify_pipelines_toolkit/analysis/shared/get_subunits.py`
-- `:main` will specifically call the function named `main()` when the alias is run.
+- `:main` will specifically call the function named `main()` when the alias is run.
 When you have setup this command, executing `get_subunits` on the command-line will be the equivalent of doing:
@@ -86,4 +87,5 @@ Finally, you will need to bump up the version in the `version` line.
 At the moment, these should be the only steps required to setup your script in this package (which is subject to change).
 ### Building and uploading to PyPi
 The building and pushing of the package is automated by GitHub Actions, which will activate only on a new release. Bioconda should then automatically pick up the new PyPi release and push it to their recipes, though it's worth keeping an eye on their automated pull requests just in case [here](https://github.com/bioconda/bioconda-recipes/pulls).

{mgnify_pipelines_toolkit-0.1.1 → mgnify_pipelines_toolkit-0.1.2}/mgnify_pipelines_toolkit.egg-info/SOURCES.txt RENAMED Viewed

@@ -16,6 +16,7 @@ mgnify_pipelines_toolkit/analysis/amplicon/assess_mcp_proportions.py
 mgnify_pipelines_toolkit/analysis/amplicon/classify_var_regions.py
 mgnify_pipelines_toolkit/analysis/amplicon/find_mcp_inflection_points.py
 mgnify_pipelines_toolkit/analysis/amplicon/make_asv_count_table.py
+mgnify_pipelines_toolkit/analysis/amplicon/mapseq_to_asv_table.py
 mgnify_pipelines_toolkit/analysis/amplicon/remove_ambiguous_reads.py
 mgnify_pipelines_toolkit/analysis/amplicon/rev_comp_se_primers.py
 mgnify_pipelines_toolkit/analysis/amplicon/standard_primer_matching.py

{mgnify_pipelines_toolkit-0.1.1 → mgnify_pipelines_toolkit-0.1.2}/mgnify_pipelines_toolkit.egg-info/entry_points.txt RENAMED Viewed

@@ -8,6 +8,7 @@ get_subunits = mgnify_pipelines_toolkit.analysis.shared.get_subunits:main
 get_subunits_coords = mgnify_pipelines_toolkit.analysis.shared.get_subunits_coords:main
 make_asv_count_table = mgnify_pipelines_toolkit.analysis.amplicon.make_asv_count_table:main
 mapseq2biom = mgnify_pipelines_toolkit.analysis.shared.mapseq2biom:main
+mapseq_to_asv_table = mgnify_pipelines_toolkit.analysis.amplicon.mapseq_to_asv_table:main
 remove_ambiguous_reads = mgnify_pipelines_toolkit.analysis.amplicon.remove_ambiguous_reads:main
 rev_comp_se_primers = mgnify_pipelines_toolkit.analysis.amplicon.rev_comp_se_primers:main
 standard_primer_matching = mgnify_pipelines_toolkit.analysis.amplicon.standard_primer_matching:main

{mgnify_pipelines_toolkit-0.1.1 → mgnify_pipelines_toolkit-0.1.2}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "mgnify_pipelines_toolkit"
-version = "0.1.1"
+version = "0.1.2"
 readme = "README.md"
 license = {text = "Apache Software License 2.0"}
 authors = [
@@ -49,6 +49,7 @@ make_asv_count_table = "mgnify_pipelines_toolkit.analysis.amplicon.make_asv_coun
 remove_ambiguous_reads = "mgnify_pipelines_toolkit.analysis.amplicon.remove_ambiguous_reads:main"
 rev_comp_se_primers = "mgnify_pipelines_toolkit.analysis.amplicon.rev_comp_se_primers:main"
 standard_primer_matching = "mgnify_pipelines_toolkit.analysis.amplicon.standard_primer_matching:main"
+mapseq_to_asv_table = "mgnify_pipelines_toolkit.analysis.amplicon.mapseq_to_asv_table:main"
 [project.optional-dependencies]
 tests = [