PyPI - eplacer - Versions diffs - 0.1.0__tar.gz - Mend

eplacer 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

eplacer-0.1.0/LICENSE.txt +5 -0
eplacer-0.1.0/PKG-INFO +143 -0
eplacer-0.1.0/README.md +103 -0
eplacer-0.1.0/eplacer/__init__.py +0 -0
eplacer-0.1.0/eplacer/__main__.py +176 -0
eplacer-0.1.0/eplacer/data_prep.py +97 -0
eplacer-0.1.0/eplacer/external.py +60 -0
eplacer-0.1.0/eplacer/geographicRep.py +133 -0
eplacer-0.1.0/eplacer/models.py +234 -0
eplacer-0.1.0/eplacer/run_command.py +62 -0
eplacer-0.1.0/eplacer/run_model.py +610 -0
eplacer-0.1.0/eplacer/train_command.py +53 -0
eplacer-0.1.0/eplacer/train_evaluate.py +478 -0
eplacer-0.1.0/eplacer.egg-info/PKG-INFO +143 -0
eplacer-0.1.0/eplacer.egg-info/SOURCES.txt +19 -0
eplacer-0.1.0/eplacer.egg-info/dependency_links.txt +1 -0
eplacer-0.1.0/eplacer.egg-info/entry_points.txt +2 -0
eplacer-0.1.0/eplacer.egg-info/requires.txt +18 -0
eplacer-0.1.0/eplacer.egg-info/top_level.txt +1 -0
eplacer-0.1.0/setup.cfg +4 -0
eplacer-0.1.0/setup.py +42 -0

eplacer-0.1.0/LICENSE.txt ADDED Viewed

@@ -0,0 +1,5 @@
+Software code created by U.S. Government employees is not subject to copyright in the United States (17 U.S.C. §105).
+The United States/Department of Commerce reserve all rights to seek and obtain copyright protection in countries other
+than the United States for Software authored in its entirety by the Department of Commerce. To this end, the Department
+of Commerce hereby grants to Recipient a royalty-free, nonexclusive license to use, copy, and create derivative works of
+the Software outside of the United States.

eplacer-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,143 @@
+Metadata-Version: 2.4
+Name: eplacer
+Version: 0.1.0
+Summary: Machine learning platform for taxonomic classification
+Author: Christopher C Powers
+Author-email: christopher.powers@noaa.gov
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: Public Domain
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.12
+Description-Content-Type: text/markdown
+License-File: LICENSE.txt
+Requires-Dist: argh
+Requires-Dist: docopt>=0.6.2
+Requires-Dist: pytorch>=2.5
+Requires-Dist: torchvision
+Requires-Dist: torchinfo
+Requires-Dist: numpy
+Requires-Dist: pandas
+Requires-Dist: scipy
+Requires-Dist: scikit-learn
+Requires-Dist: networkx
+Requires-Dist: shapely
+Requires-Dist: matplotlib
+Requires-Dist: sympy
+Requires-Dist: tqdm
+Requires-Dist: pyyaml
+Requires-Dist: requests
+Requires-Dist: click
+Requires-Dist: pygeohash
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
+## ePlacer
+ePlacer is a taxonomic classification tool that uses deep-learning approaches to incorporate both sequence information and biogeographic information into taxonomic assignment of DNA sequences.
+### Why use ePlacer
+The machine learning architecture of ePlacer enables powerful prediction beyond sequence-only classification tools (e.g. sequence alignment with blast or naive-bayes classifiers) by directly incorporating additional data into the probabalistic estimate of taxonomy, specifically developed for metabarcoding data. This novel applciation of deep-learning is immensely useful, as there can be many cases in metabarcoding data where two reference species have 100% sequence overlap, but distinct geographic ranges. This tool discriminates these cases and provides additional data for downstream taxonomic curation. Due to this, ePlacer provides enhanced interoperability between metabarcoding datasets.
+Currently, ePlacer offers pre-trained models for two popular metabarcoding regions: the [MiFish](https://doi.org/10.1007/s12562-020-01461-x) and the [ecoPrimer, or Riaz,](https://doi.org/10.1093/nar/gkr732) marker gene regions. For these two regions, ePlacer offers the following benefits:
+* **Interoperability.** ePlacer is trained on global datasets, allowing for direct comparison between metabarcoding datasets, regardless of geographic region.
+* **Portability.** ePlacer has pre-trained models available for both MiFish and Riaz marker gene regions containerized and available for out-of-the-box use
+* **Interactive Visualization.** ePlacer provides an interactive GUI and curation tool that allows
+* **Increased Accuracy.** The ePlacer model architecture provides increased accuracy, precision, and recall as compared to blast, Naive-Bayes, or least common ancestor approachers
+* **Trainability** In addition to the two provided barcodes, this code repository provides tools for training new models.
+For other barcode regions, there will be significant advantages with the training of new models. If you are interested in training a new model for ePlacer, please do not hesitate to reach out!
+### Installation
+Users can install the current version of ePlacer with conda.
+```bash
+conda install bioconda::eplacer
+```
+### Using ePlacer for classification
+The ePlacer taxonomic assignment tool can be run two ways: natively (through the ePlacer CLI or API) or with a [QIIME2](https://github.com/NEFSC/PEMAD-PBB-q2-ePlacer) plugin. Here, the documentation will be detailing the native usage. Details on usage of the QIIME2 plugin can be found in the linked git repository.
+ePlacer taxonomically classified ASV sequences using two distinct types of information:
+- Sequence information (inferred from ASVs)
+- Biogeography (inferred from sample metadata and count tables)
+Although not strictly required for assignment, blast results are also used to automatically check "solvable" taxonomic assignments and resolve them more accurately as an automated curation step.
+Using this information, ePlacer generates a raw confidence of presence across all possible taxonomic labels.
+In order to run classification with ePlacer, four data files are required. Properly formatted examples can be seen here:
+- A fasta file of ASVs
+```bash
+>ASV1
+CCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACT
+>ASV2
+CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
+>ASV3
+CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
+```
+- A geography metadata file
+```bash
+#SampleID	Latitude	Longitude
+Sample1	39.645946	-71.746641
+Sample2	39.645946	-71.746641
+```
+- A count table
+```bash
+#OTU ID	Sample1	Sample2
+ASV1	15	0
+ASV2	5	22
+ASV3	0	10
+```
+- blast data output (generated with -outfmt "6 qseqid sseqid pident evalue length qlen slen qstart qend sstart send sseq")
+```bash
+ASV1	SubjectRef_A	100.00	1.45e-45	98	98	98	1	98	1	98	GCCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCAGCTTATAACCCAAAGGACTTGGCGCTGCTTCAGACCCCCCT
+ASV2	SubjectRef_B	99.00	2.12e-42	98	98	98	1	98	1	98	GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT
+ASV3	SubjectRef_C	100.00	1.45e-45	98	98	98	1	98	1	98	GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT
+```
+#### Acquiring pre-trained models.
+Pre-trained models can be acquired from Zenodo (doi:10.5281/zenodo.20820029). Currently, only 12S-V5 ecoprimer and mifish primers are available, but others will be created and stored in the future. If you develop your own model, please don't hesitate to reach out.
+Natively trained models contain directories of information and can be obtained in the following manner:
+```bash
+wget https://zenodo.org/records/20820029/files/mifish.tar.gz
+tar -xzf mifish.tar.gz
+wget https://zenodo.org/records/20820029/files/riaz.tar.gz
+tar -xzf riaz.tar.gz
+```
+Note we also provide pre-compiled *.qza models for use with QIIME2. These can be found in the same zenodo repository.
+#### Running Classification with Pre-trained models
+For users that have generated their own models, use the following code:
+```bash
+eplacer run-model --fasta <fasta path> --counts <count matrix> --geoData <geoData path> --confidence <threshold> --model <model path> --maskrate 0
+```
+#### Training new ePlacer models
+Training new ePlacer models is very simple! All that is required is an aligned fasta file for the barcode of interest (containing all available references of interest), a flat taxonomy file, and a reference file for biogeography (currently, eplacer supports the [OBIS csv download](https://obis.org/data/access/)).
+ePlacer also supports custom references for biogeography, formatted as follows:
+```bash
+#Species	Latitude	Longitude
+SpeciesLabelA	39.645946	-71.746641
+SpeciesLabelB	39.645946	-71.746641
+```
+To run the training, use the following:
+```bash
+eplacer train-model --fasta <alignment file> --taxa <taxonomy file> \
+            --out <output directory> --taxlevel SPECIES \
+            --geoData <obis data> --augments <Several parameters should be test here> \
+            --maskrate <Several parameters should be test here> --threads 1
+```
+==============================================================
+This repository is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA GitHub project code is provided on an ‘as is’ basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.

eplacer-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,103 @@
+## ePlacer
+ePlacer is a taxonomic classification tool that uses deep-learning approaches to incorporate both sequence information and biogeographic information into taxonomic assignment of DNA sequences.
+### Why use ePlacer
+The machine learning architecture of ePlacer enables powerful prediction beyond sequence-only classification tools (e.g. sequence alignment with blast or naive-bayes classifiers) by directly incorporating additional data into the probabalistic estimate of taxonomy, specifically developed for metabarcoding data. This novel applciation of deep-learning is immensely useful, as there can be many cases in metabarcoding data where two reference species have 100% sequence overlap, but distinct geographic ranges. This tool discriminates these cases and provides additional data for downstream taxonomic curation. Due to this, ePlacer provides enhanced interoperability between metabarcoding datasets.
+Currently, ePlacer offers pre-trained models for two popular metabarcoding regions: the [MiFish](https://doi.org/10.1007/s12562-020-01461-x) and the [ecoPrimer, or Riaz,](https://doi.org/10.1093/nar/gkr732) marker gene regions. For these two regions, ePlacer offers the following benefits:
+* **Interoperability.** ePlacer is trained on global datasets, allowing for direct comparison between metabarcoding datasets, regardless of geographic region.
+* **Portability.** ePlacer has pre-trained models available for both MiFish and Riaz marker gene regions containerized and available for out-of-the-box use
+* **Interactive Visualization.** ePlacer provides an interactive GUI and curation tool that allows
+* **Increased Accuracy.** The ePlacer model architecture provides increased accuracy, precision, and recall as compared to blast, Naive-Bayes, or least common ancestor approachers
+* **Trainability** In addition to the two provided barcodes, this code repository provides tools for training new models.
+For other barcode regions, there will be significant advantages with the training of new models. If you are interested in training a new model for ePlacer, please do not hesitate to reach out!
+### Installation
+Users can install the current version of ePlacer with conda.
+```bash
+conda install bioconda::eplacer
+```
+### Using ePlacer for classification
+The ePlacer taxonomic assignment tool can be run two ways: natively (through the ePlacer CLI or API) or with a [QIIME2](https://github.com/NEFSC/PEMAD-PBB-q2-ePlacer) plugin. Here, the documentation will be detailing the native usage. Details on usage of the QIIME2 plugin can be found in the linked git repository.
+ePlacer taxonomically classified ASV sequences using two distinct types of information:
+- Sequence information (inferred from ASVs)
+- Biogeography (inferred from sample metadata and count tables)
+Although not strictly required for assignment, blast results are also used to automatically check "solvable" taxonomic assignments and resolve them more accurately as an automated curation step.
+Using this information, ePlacer generates a raw confidence of presence across all possible taxonomic labels.
+In order to run classification with ePlacer, four data files are required. Properly formatted examples can be seen here:
+- A fasta file of ASVs
+```bash
+>ASV1
+CCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACT
+>ASV2
+CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
+>ASV3
+CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
+```
+- A geography metadata file
+```bash
+#SampleID	Latitude	Longitude
+Sample1	39.645946	-71.746641
+Sample2	39.645946	-71.746641
+```
+- A count table
+```bash
+#OTU ID	Sample1	Sample2
+ASV1	15	0
+ASV2	5	22
+ASV3	0	10
+```
+- blast data output (generated with -outfmt "6 qseqid sseqid pident evalue length qlen slen qstart qend sstart send sseq")
+```bash
+ASV1	SubjectRef_A	100.00	1.45e-45	98	98	98	1	98	1	98	GCCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCAGCTTATAACCCAAAGGACTTGGCGCTGCTTCAGACCCCCCT
+ASV2	SubjectRef_B	99.00	2.12e-42	98	98	98	1	98	1	98	GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT
+ASV3	SubjectRef_C	100.00	1.45e-45	98	98	98	1	98	1	98	GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT
+```
+#### Acquiring pre-trained models.
+Pre-trained models can be acquired from Zenodo (doi:10.5281/zenodo.20820029). Currently, only 12S-V5 ecoprimer and mifish primers are available, but others will be created and stored in the future. If you develop your own model, please don't hesitate to reach out.
+Natively trained models contain directories of information and can be obtained in the following manner:
+```bash
+wget https://zenodo.org/records/20820029/files/mifish.tar.gz
+tar -xzf mifish.tar.gz
+wget https://zenodo.org/records/20820029/files/riaz.tar.gz
+tar -xzf riaz.tar.gz
+```
+Note we also provide pre-compiled *.qza models for use with QIIME2. These can be found in the same zenodo repository.
+#### Running Classification with Pre-trained models
+For users that have generated their own models, use the following code:
+```bash
+eplacer run-model --fasta <fasta path> --counts <count matrix> --geoData <geoData path> --confidence <threshold> --model <model path> --maskrate 0
+```
+#### Training new ePlacer models
+Training new ePlacer models is very simple! All that is required is an aligned fasta file for the barcode of interest (containing all available references of interest), a flat taxonomy file, and a reference file for biogeography (currently, eplacer supports the [OBIS csv download](https://obis.org/data/access/)).
+ePlacer also supports custom references for biogeography, formatted as follows:
+```bash
+#Species	Latitude	Longitude
+SpeciesLabelA	39.645946	-71.746641
+SpeciesLabelB	39.645946	-71.746641
+```
+To run the training, use the following:
+```bash
+eplacer train-model --fasta <alignment file> --taxa <taxonomy file> \
+            --out <output directory> --taxlevel SPECIES \
+            --geoData <obis data> --augments <Several parameters should be test here> \
+            --maskrate <Several parameters should be test here> --threads 1
+```
+==============================================================
+This repository is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA GitHub project code is provided on an ‘as is’ basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.

eplacer-0.1.0/eplacer/__init__.py ADDED Viewed

File without changes

eplacer-0.1.0/eplacer/__main__.py ADDED Viewed

@@ -0,0 +1,176 @@
+#! /usr/bin/env python
+'''
+Usage:
+    eplacer [--version] [--help] <command> [<args>...]
+Options:
+    -h, --help          Generate Help Screen
+    -v, --version       Get Version Number
+General Commands:
+    train-model         Trains a convolutional neural network to perform
+                        a classification task to a specific taxonomic
+                        group.
+                        Options for classification are the following
+                        - sequence-only
+                        - sequence-geo
+    run-model           Runs the convolutional neural network on new
+                        data to assign taxonomy
+See 'eplacer <command> --help' for more information on a command
+'''
+import sys
+import os
+from docopt import docopt
+import multiprocessing
+def main():
+    args = docopt(__doc__,
+                  version='',
+                  options_first=True)
+    argv = [args['<command>']] + args['<args>']
+    if args['<command>'] == 'train-model':
+        import eplacer.train_command
+        args = docopt(eplacer.train_command.__doc__, argv=argv)
+        # Check if provided directory exists, if provided
+        if args['--out']:
+            if os.path.exists(args['--out']):
+                if args['--force'] == False:
+                    raise Exception('The path already exists! Exiting...\n')
+        # Set default out directory if it doesn't exist
+        else:
+            args['--out']="data/models/"
+            if os.path.exists(args['--out']):
+                if args['--force'] == False:
+                    raise Exception('The path already exists! Exiting...\n')
+        # Check for existence of required input
+        if args['--fasta']:
+            if not os.path.exists(args['--fasta']):
+                raise Exception('No fasta file exists at this location! Exiting...\n')
+        else:
+            raise Exception('No fasta file specified! Exiting...\n')
+        if args['--taxa']:
+            if not os.path.exists(args['--taxa']):
+                raise Exception('No taxa file exists at this location! Exiting...\n')
+        else:
+            raise Exception('No taxa file specified! Exiting...\n')
+        # Default to the species level. Which may or may not work
+        if not args['--taxlevel']:
+            args['--taxlevel']="SPECIES"
+        # Check for the geo data. Set mode of running based on this.
+        if args['--geoData']:
+            if not os.path.exists(args['--geoData']):
+                raise Exception("GeoData path does not exist! Exiting\n")
+            else:
+                sys.stdout.write("Setting mode to train on "
+                                 "sequence and geographic data\n")
+                mode='sequence_geo'
+            if not args['--kernel']:
+                kernel = 3
+            else:
+                kernel = int(args['--kernel'])
+            if not args['--sigma']:
+                sigma = 1
+            else:
+                sigma = int(args['--sigma'])
+            if not args['--precision']:
+                precision = 3
+            else:
+                precision = float(args['--precision'])
+        else:
+            sys.stdout.write("Setting mode to train on "
+                             "sequence data only\n")
+            mode='sequence'
+        if not args['--taxlevel']:
+            args['--taxlevel'] = "SPECIES"
+        if not args['--maskrate']:
+            maskrate = 0
+        else:
+            maskrate = float(args['--maskrate'])
+        if not args['--augments']:
+            augments = 5
+        else:
+            augments = int(args['--augments'])
+            print("augments: ", augments)
+        if mode == 'sequence':
+            eplacer.train_evaluate.train_sequence(args['--fasta'], args['--taxa'], args['--taxlevel'],args['--out'],args['--augments'],args['--maskrate'])
+        elif mode == 'sequence_geo':
+            eplacer.train_evaluate.train_sequenceOBIS(args['--fasta'], args['--taxa'], args['--taxlevel'],args['--out'],args['--geoData'],
+                                                      augments,maskrate,sigma,kernel,precision)
+        exit()
+    elif args['<command>'] == 'run-model':
+        import eplacer.run_command
+        import eplacer.run_model
+        args = docopt(eplacer.run_command.__doc__, argv=argv)
+        # check that the provided directory exists, if provided
+        if args['--out']:
+            if os.path.exists(args['--out']):
+                if args['--force'] == False:
+                    raise Exception('The path already exists! Exiting...\n')
+        # Set default out directory if it doesn't exist
+        else:
+            args['--out']="result/models/"
+            if os.path.exists(args['--out']):
+                if args['--force'] == False:
+                    raise Exception('The path already exists! Exiting...\n')
+        if args['--fasta']:
+            if not os.path.exists(args['--fasta']):
+                raise Exception('No fasta file exists at this location! Exiting...\n')
+        else:
+            raise Exception('No fasta file specified! Exiting...\n')
+        if args['--blast']:
+            if not os.path.exists(args['--blast']):
+                raise Exception('No blast result file exists at this location! Exiting...\n')
+        else:
+            raise Exception('No blast result file specified! Exiting...\n')
+        if args['--model']:
+            if not os.path.exists(args['--model']):
+                raise Exception('No model file exists at this location! Exiting...\n')
+            args['--taxfile'] = str(args['--model']) + "/taxa_key_SPECIES.tsv"
+            if not os.path.exists(args['--taxfile']):
+                raise Exception('No taxfile file exists at this location! Your model directory may be corrupted. Exiting...\n')
+        else:
+            raise Exception('No model file specified! Exiting...\n')
+        if not args['--taxlevel']:
+            args['--taxlevel'] = "SPECIES"
+        if args['--threads']:
+            cpu_count = multiprocessing.cpu_count()
+            if int(args['--threads']) > cpu_count:
+                args['--threads'] = cpu_count
+                print(f"Too many threads requested. setting to {cpu_count}")
+        else:
+            cpu_count = multiprocessing.cpu_count()
+            args['--threads'] = cpu_count
+            print(f"No threads requested. setting to {cpu_count}")
+        if not args['--maskrate']:
+            maskrate = 0
+        else:
+            maskrate = float(args['--maskrate'])
+        # Check for the geo data. Set mode of running based on this.
+        if args['--geoData']:
+            if not os.path.exists(args['--geoData']):
+                raise Exception("GeoData path does not exist! Exiting\n")
+            else:
+                sys.stdout.write("Setting mode to train on "
+                                 "sequence and geographic data\n")
+                mode='sequence_geo'
+            if not args['--counts']:
+                raise Exception('Abundance matrix not specified! Exiting\n')
+            elif not os.path.exists(args['--counts']):
+                raise Exception('The path to the count matrix does not exist! Exiting\n')
+            if not args['--kernel']:
+                kernel = 3
+            else:
+                kernel = int(args['--kernel'])
+            if not args['--sigma']:
+                sigma = 1
+            else:
+                sigma = int(args['--sigma'])
+            if not args['--confidence']:
+                args['--confidence'] = 0.9
+        else:
+            raise Exception('No geoData available. Exiting...\n')
+        eplacer.run_model.gen_model_output_OBIS(args['--confidence'], args['--blast'], args['--out'],args['--fasta'], args['--model'], args['--taxlevel'],args['--taxfile'],args['--geoData'],args['--counts'],maskrate,sigma,kernel, args['--threads'])

eplacer-0.1.0/eplacer/data_prep.py ADDED Viewed

@@ -0,0 +1,97 @@
+"""
+This script defines some useful code for prepping datasets
+for ePlacer
+Author: Christopher Powers
+Institution: NOAA NEFSC PEMAD PBB
+"""
+import numpy as np
+from torch.utils.data import Dataset
+import torch
+import numpy as np
+def get_degenerate_bases():
+    """
+    Returns a dictionary mapping IUPAC degenerate bases to their possible canonical bases
+    """
+    return {
+        'A': ['A'],
+        'C': ['C'],
+        'G': ['G'],
+        'T': ['T'],
+        'R': ['A', 'G'],           # Purine
+        'Y': ['C', 'T'],           # Pyrimidine
+        'M': ['A', 'C'],           # Amino
+        'K': ['G', 'T'],           # Keto
+        'S': ['C', 'G'],           # Strong
+        'W': ['A', 'T'],           # Weak
+        'H': ['A', 'C', 'T'],      # not G
+        'B': ['C', 'G', 'T'],      # not A
+        'V': ['A', 'C', 'G'],      # not T
+        'D': ['A', 'G', 'T'],      # not C
+        'N': ['-'],                # any base is not informative. Encode as gap
+        '-': ['-']                 # gap
+    }
+def encode_onehot(seq, mask_token = "N"):
+    """
+    Function to encode an individual sequence with one hot encoding,
+    handling degenerate bases by averaging their possible canonical forms
+    """
+    mapping = {
+        "A": [1., 0., 0., 0.],
+        "C": [0., 1., 0., 0.],
+        "G": [0., 0., 1., 0.],
+        "T": [0., 0., 0., 1.],
+        "-": [0., 0., 0., 0.],
+        mask_token: [0.25, 0.25, 0.25, 0.25]
+    }
+    degenerate_bases = get_degenerate_bases()
+    # Pre-calculate average encodings for degenerate bases
+    for base, possibilities in degenerate_bases.items():
+        if base not in mapping and len(possibilities) > 0:
+            avg_encoding = np.zeros(4)
+            for canonical_base in possibilities:
+                if canonical_base in mapping:
+                    avg_encoding += np.array(mapping[canonical_base])
+            mapping[base] = (avg_encoding / len(possibilities)).tolist()
+    # Vectorized operation for the whole sequence
+    return np.array([mapping.get(base, [0., 0., 0., 0.]) for base in seq])
+class SeqGeoDataset(Dataset):
+    """
+    Dataset that stores the one-hot encoded data alongside
+    the geographic data
+    """
+    def __init__(self, sequences, labels, geo_data):
+        self.seqs = []
+        self.geo = []
+        self.taxa_labels = []
+        for i in range(0,len(sequences)):
+            self.seqs.append(sequences[i])
+        for i in range(0,len(labels)):
+            self.taxa_labels.append(labels[i])
+        self.ohe_seqs = torch.stack([torch.from_numpy(encode_onehot(x)).float() for x in self.seqs])
+        self.labels = torch.Tensor(self.taxa_labels).long()
+        for i in range(0,len(geo_data)):
+            self.geo.append(geo_data[i])
+        self.ohe_seqs = torch.stack([torch.from_numpy(encode_onehot(x)).float() for x in self.seqs])
+        self.ohe_geo = torch.stack([torch.from_numpy(x).float() for x in self.geo])
+        self.labels = torch.Tensor(self.taxa_labels).long()
+    def __len__(self): return len(self.seqs)
+    def __getitem__(self,idx):
+        seq = self.ohe_seqs[idx]
+        label = self.labels[idx]
+        geo = self.ohe_geo[idx]
+        return seq, geo, label

eplacer-0.1.0/eplacer/external.py ADDED Viewed

@@ -0,0 +1,60 @@
+"""
+This script runs mafft as a subprocess and generates
+an alignment.
+Author: Christopher Powers
+Institution: NOAA NEFSC PEMAD PBB
+"""
+import subprocess
+def run_mafft(input, reference, moutput, subset_output, threads):
+    """
+    Run the MAFFT alignment to add your sequences to a new fasta
+    """
+    # Get initial IDs
+    names = []
+    with open(input, "r") as infile:
+        for line in infile:
+            if line.startswith(">"):
+                names.append(line[1:].rstrip())
+    try:
+        print("Beginning subprocess...")
+        print("Aligning with mafft...")
+        command = ["mafft --add", input, "--adjustdirection --thread", str(threads), "--keeplength --reorder", reference, ">", moutput]
+        subprocess.run(" ".join(command), shell=True, check=True)
+    except subprocess.CalledProcessError as e:
+        print(f"MAFFT exection failed with error: {e}")
+    except FileNotFoundError:
+        print("MAFFT not found. Is it installed/in the path?")
+    # subset the new file
+    key = ''
+    seq = ''
+    seqdict = {}
+    with open(moutput, "r") as infile:
+        for line in infile:
+            line = line.rstrip()
+            if line.startswith(">_R_"):
+                if key != '':
+                    seqdict[key] = seq.upper()
+                    seq = ''
+                key = line[4:]
+            elif line.startswith(">"):
+                if key != '':
+                    seqdict[key] = seq.upper()
+                    seq = ''
+                key = line[1:]
+            else:
+                seq += line
+    with open(subset_output, "w") as outfile:
+        for s in seqdict:
+            if s in names:
+                outfile.write(f">{s}\n{seqdict[s]}\n")
+    return seqdict