eplacer 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,5 @@
1
+ Software code created by U.S. Government employees is not subject to copyright in the United States (17 U.S.C. §105).
2
+ The United States/Department of Commerce reserve all rights to seek and obtain copyright protection in countries other
3
+ than the United States for Software authored in its entirety by the Department of Commerce. To this end, the Department
4
+ of Commerce hereby grants to Recipient a royalty-free, nonexclusive license to use, copy, and create derivative works of
5
+ the Software outside of the United States.
eplacer-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,143 @@
1
+ Metadata-Version: 2.4
2
+ Name: eplacer
3
+ Version: 0.1.0
4
+ Summary: Machine learning platform for taxonomic classification
5
+ Author: Christopher C Powers
6
+ Author-email: christopher.powers@noaa.gov
7
+ Classifier: Programming Language :: Python :: 3
8
+ Classifier: License :: Public Domain
9
+ Classifier: Operating System :: OS Independent
10
+ Requires-Python: >=3.12
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE.txt
13
+ Requires-Dist: argh
14
+ Requires-Dist: docopt>=0.6.2
15
+ Requires-Dist: pytorch>=2.5
16
+ Requires-Dist: torchvision
17
+ Requires-Dist: torchinfo
18
+ Requires-Dist: numpy
19
+ Requires-Dist: pandas
20
+ Requires-Dist: scipy
21
+ Requires-Dist: scikit-learn
22
+ Requires-Dist: networkx
23
+ Requires-Dist: shapely
24
+ Requires-Dist: matplotlib
25
+ Requires-Dist: sympy
26
+ Requires-Dist: tqdm
27
+ Requires-Dist: pyyaml
28
+ Requires-Dist: requests
29
+ Requires-Dist: click
30
+ Requires-Dist: pygeohash
31
+ Dynamic: author
32
+ Dynamic: author-email
33
+ Dynamic: classifier
34
+ Dynamic: description
35
+ Dynamic: description-content-type
36
+ Dynamic: license-file
37
+ Dynamic: requires-dist
38
+ Dynamic: requires-python
39
+ Dynamic: summary
40
+
41
+ ## ePlacer
42
+
43
+ ePlacer is a taxonomic classification tool that uses deep-learning approaches to incorporate both sequence information and biogeographic information into taxonomic assignment of DNA sequences.
44
+
45
+ ### Why use ePlacer
46
+
47
+ The machine learning architecture of ePlacer enables powerful prediction beyond sequence-only classification tools (e.g. sequence alignment with blast or naive-bayes classifiers) by directly incorporating additional data into the probabalistic estimate of taxonomy, specifically developed for metabarcoding data. This novel applciation of deep-learning is immensely useful, as there can be many cases in metabarcoding data where two reference species have 100% sequence overlap, but distinct geographic ranges. This tool discriminates these cases and provides additional data for downstream taxonomic curation. Due to this, ePlacer provides enhanced interoperability between metabarcoding datasets.
48
+
49
+ Currently, ePlacer offers pre-trained models for two popular metabarcoding regions: the [MiFish](https://doi.org/10.1007/s12562-020-01461-x) and the [ecoPrimer, or Riaz,](https://doi.org/10.1093/nar/gkr732) marker gene regions. For these two regions, ePlacer offers the following benefits:
50
+
51
+ * **Interoperability.** ePlacer is trained on global datasets, allowing for direct comparison between metabarcoding datasets, regardless of geographic region.
52
+ * **Portability.** ePlacer has pre-trained models available for both MiFish and Riaz marker gene regions containerized and available for out-of-the-box use
53
+ * **Interactive Visualization.** ePlacer provides an interactive GUI and curation tool that allows
54
+ * **Increased Accuracy.** The ePlacer model architecture provides increased accuracy, precision, and recall as compared to blast, Naive-Bayes, or least common ancestor approachers
55
+ * **Trainability** In addition to the two provided barcodes, this code repository provides tools for training new models.
56
+
57
+ For other barcode regions, there will be significant advantages with the training of new models. If you are interested in training a new model for ePlacer, please do not hesitate to reach out!
58
+
59
+ ### Installation
60
+ Users can install the current version of ePlacer with conda.
61
+ ```bash
62
+ conda install bioconda::eplacer
63
+ ```
64
+
65
+ ### Using ePlacer for classification
66
+ The ePlacer taxonomic assignment tool can be run two ways: natively (through the ePlacer CLI or API) or with a [QIIME2](https://github.com/NEFSC/PEMAD-PBB-q2-ePlacer) plugin. Here, the documentation will be detailing the native usage. Details on usage of the QIIME2 plugin can be found in the linked git repository.
67
+
68
+ ePlacer taxonomically classified ASV sequences using two distinct types of information:
69
+ - Sequence information (inferred from ASVs)
70
+ - Biogeography (inferred from sample metadata and count tables)
71
+
72
+ Although not strictly required for assignment, blast results are also used to automatically check "solvable" taxonomic assignments and resolve them more accurately as an automated curation step.
73
+
74
+ Using this information, ePlacer generates a raw confidence of presence across all possible taxonomic labels.
75
+
76
+ In order to run classification with ePlacer, four data files are required. Properly formatted examples can be seen here:
77
+ - A fasta file of ASVs
78
+ ```bash
79
+ >ASV1
80
+ CCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACT
81
+ >ASV2
82
+ CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
83
+ >ASV3
84
+ CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
85
+ ```
86
+ - A geography metadata file
87
+ ```bash
88
+ #SampleID Latitude Longitude
89
+ Sample1 39.645946 -71.746641
90
+ Sample2 39.645946 -71.746641
91
+ ```
92
+ - A count table
93
+ ```bash
94
+ #OTU ID Sample1 Sample2
95
+ ASV1 15 0
96
+ ASV2 5 22
97
+ ASV3 0 10
98
+ ```
99
+ - blast data output (generated with -outfmt "6 qseqid sseqid pident evalue length qlen slen qstart qend sstart send sseq")
100
+ ```bash
101
+ ASV1 SubjectRef_A 100.00 1.45e-45 98 98 98 1 98 1 98 GCCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCAGCTTATAACCCAAAGGACTTGGCGCTGCTTCAGACCCCCCT
102
+ ASV2 SubjectRef_B 99.00 2.12e-42 98 98 98 1 98 1 98 GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT
103
+ ASV3 SubjectRef_C 100.00 1.45e-45 98 98 98 1 98 1 98 GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT
104
+ ```
105
+ #### Acquiring pre-trained models.
106
+ Pre-trained models can be acquired from Zenodo (doi:10.5281/zenodo.20820029). Currently, only 12S-V5 ecoprimer and mifish primers are available, but others will be created and stored in the future. If you develop your own model, please don't hesitate to reach out.
107
+
108
+ Natively trained models contain directories of information and can be obtained in the following manner:
109
+ ```bash
110
+ wget https://zenodo.org/records/20820029/files/mifish.tar.gz
111
+ tar -xzf mifish.tar.gz
112
+ wget https://zenodo.org/records/20820029/files/riaz.tar.gz
113
+ tar -xzf riaz.tar.gz
114
+ ```
115
+ Note we also provide pre-compiled *.qza models for use with QIIME2. These can be found in the same zenodo repository.
116
+
117
+ #### Running Classification with Pre-trained models
118
+ For users that have generated their own models, use the following code:
119
+ ```bash
120
+ eplacer run-model --fasta <fasta path> --counts <count matrix> --geoData <geoData path> --confidence <threshold> --model <model path> --maskrate 0
121
+ ```
122
+
123
+ #### Training new ePlacer models
124
+ Training new ePlacer models is very simple! All that is required is an aligned fasta file for the barcode of interest (containing all available references of interest), a flat taxonomy file, and a reference file for biogeography (currently, eplacer supports the [OBIS csv download](https://obis.org/data/access/)).
125
+
126
+ ePlacer also supports custom references for biogeography, formatted as follows:
127
+ ```bash
128
+ #Species Latitude Longitude
129
+ SpeciesLabelA 39.645946 -71.746641
130
+ SpeciesLabelB 39.645946 -71.746641
131
+ ```
132
+
133
+ To run the training, use the following:
134
+ ```bash
135
+ eplacer train-model --fasta <alignment file> --taxa <taxonomy file> \
136
+ --out <output directory> --taxlevel SPECIES \
137
+ --geoData <obis data> --augments <Several parameters should be test here> \
138
+ --maskrate <Several parameters should be test here> --threads 1
139
+ ```
140
+
141
+ ==============================================================
142
+
143
+ This repository is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA GitHub project code is provided on an ‘as is’ basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.
@@ -0,0 +1,103 @@
1
+ ## ePlacer
2
+
3
+ ePlacer is a taxonomic classification tool that uses deep-learning approaches to incorporate both sequence information and biogeographic information into taxonomic assignment of DNA sequences.
4
+
5
+ ### Why use ePlacer
6
+
7
+ The machine learning architecture of ePlacer enables powerful prediction beyond sequence-only classification tools (e.g. sequence alignment with blast or naive-bayes classifiers) by directly incorporating additional data into the probabalistic estimate of taxonomy, specifically developed for metabarcoding data. This novel applciation of deep-learning is immensely useful, as there can be many cases in metabarcoding data where two reference species have 100% sequence overlap, but distinct geographic ranges. This tool discriminates these cases and provides additional data for downstream taxonomic curation. Due to this, ePlacer provides enhanced interoperability between metabarcoding datasets.
8
+
9
+ Currently, ePlacer offers pre-trained models for two popular metabarcoding regions: the [MiFish](https://doi.org/10.1007/s12562-020-01461-x) and the [ecoPrimer, or Riaz,](https://doi.org/10.1093/nar/gkr732) marker gene regions. For these two regions, ePlacer offers the following benefits:
10
+
11
+ * **Interoperability.** ePlacer is trained on global datasets, allowing for direct comparison between metabarcoding datasets, regardless of geographic region.
12
+ * **Portability.** ePlacer has pre-trained models available for both MiFish and Riaz marker gene regions containerized and available for out-of-the-box use
13
+ * **Interactive Visualization.** ePlacer provides an interactive GUI and curation tool that allows
14
+ * **Increased Accuracy.** The ePlacer model architecture provides increased accuracy, precision, and recall as compared to blast, Naive-Bayes, or least common ancestor approachers
15
+ * **Trainability** In addition to the two provided barcodes, this code repository provides tools for training new models.
16
+
17
+ For other barcode regions, there will be significant advantages with the training of new models. If you are interested in training a new model for ePlacer, please do not hesitate to reach out!
18
+
19
+ ### Installation
20
+ Users can install the current version of ePlacer with conda.
21
+ ```bash
22
+ conda install bioconda::eplacer
23
+ ```
24
+
25
+ ### Using ePlacer for classification
26
+ The ePlacer taxonomic assignment tool can be run two ways: natively (through the ePlacer CLI or API) or with a [QIIME2](https://github.com/NEFSC/PEMAD-PBB-q2-ePlacer) plugin. Here, the documentation will be detailing the native usage. Details on usage of the QIIME2 plugin can be found in the linked git repository.
27
+
28
+ ePlacer taxonomically classified ASV sequences using two distinct types of information:
29
+ - Sequence information (inferred from ASVs)
30
+ - Biogeography (inferred from sample metadata and count tables)
31
+
32
+ Although not strictly required for assignment, blast results are also used to automatically check "solvable" taxonomic assignments and resolve them more accurately as an automated curation step.
33
+
34
+ Using this information, ePlacer generates a raw confidence of presence across all possible taxonomic labels.
35
+
36
+ In order to run classification with ePlacer, four data files are required. Properly formatted examples can be seen here:
37
+ - A fasta file of ASVs
38
+ ```bash
39
+ >ASV1
40
+ CCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACT
41
+ >ASV2
42
+ CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
43
+ >ASV3
44
+ CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
45
+ ```
46
+ - A geography metadata file
47
+ ```bash
48
+ #SampleID Latitude Longitude
49
+ Sample1 39.645946 -71.746641
50
+ Sample2 39.645946 -71.746641
51
+ ```
52
+ - A count table
53
+ ```bash
54
+ #OTU ID Sample1 Sample2
55
+ ASV1 15 0
56
+ ASV2 5 22
57
+ ASV3 0 10
58
+ ```
59
+ - blast data output (generated with -outfmt "6 qseqid sseqid pident evalue length qlen slen qstart qend sstart send sseq")
60
+ ```bash
61
+ ASV1 SubjectRef_A 100.00 1.45e-45 98 98 98 1 98 1 98 GCCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCAGCTTATAACCCAAAGGACTTGGCGCTGCTTCAGACCCCCCT
62
+ ASV2 SubjectRef_B 99.00 2.12e-42 98 98 98 1 98 1 98 GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT
63
+ ASV3 SubjectRef_C 100.00 1.45e-45 98 98 98 1 98 1 98 GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT
64
+ ```
65
+ #### Acquiring pre-trained models.
66
+ Pre-trained models can be acquired from Zenodo (doi:10.5281/zenodo.20820029). Currently, only 12S-V5 ecoprimer and mifish primers are available, but others will be created and stored in the future. If you develop your own model, please don't hesitate to reach out.
67
+
68
+ Natively trained models contain directories of information and can be obtained in the following manner:
69
+ ```bash
70
+ wget https://zenodo.org/records/20820029/files/mifish.tar.gz
71
+ tar -xzf mifish.tar.gz
72
+ wget https://zenodo.org/records/20820029/files/riaz.tar.gz
73
+ tar -xzf riaz.tar.gz
74
+ ```
75
+ Note we also provide pre-compiled *.qza models for use with QIIME2. These can be found in the same zenodo repository.
76
+
77
+ #### Running Classification with Pre-trained models
78
+ For users that have generated their own models, use the following code:
79
+ ```bash
80
+ eplacer run-model --fasta <fasta path> --counts <count matrix> --geoData <geoData path> --confidence <threshold> --model <model path> --maskrate 0
81
+ ```
82
+
83
+ #### Training new ePlacer models
84
+ Training new ePlacer models is very simple! All that is required is an aligned fasta file for the barcode of interest (containing all available references of interest), a flat taxonomy file, and a reference file for biogeography (currently, eplacer supports the [OBIS csv download](https://obis.org/data/access/)).
85
+
86
+ ePlacer also supports custom references for biogeography, formatted as follows:
87
+ ```bash
88
+ #Species Latitude Longitude
89
+ SpeciesLabelA 39.645946 -71.746641
90
+ SpeciesLabelB 39.645946 -71.746641
91
+ ```
92
+
93
+ To run the training, use the following:
94
+ ```bash
95
+ eplacer train-model --fasta <alignment file> --taxa <taxonomy file> \
96
+ --out <output directory> --taxlevel SPECIES \
97
+ --geoData <obis data> --augments <Several parameters should be test here> \
98
+ --maskrate <Several parameters should be test here> --threads 1
99
+ ```
100
+
101
+ ==============================================================
102
+
103
+ This repository is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA GitHub project code is provided on an ‘as is’ basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.
File without changes
@@ -0,0 +1,176 @@
1
+ #! /usr/bin/env python
2
+ '''
3
+ Usage:
4
+ eplacer [--version] [--help] <command> [<args>...]
5
+
6
+ Options:
7
+ -h, --help Generate Help Screen
8
+ -v, --version Get Version Number
9
+
10
+ General Commands:
11
+ train-model Trains a convolutional neural network to perform
12
+ a classification task to a specific taxonomic
13
+ group.
14
+ Options for classification are the following
15
+ - sequence-only
16
+ - sequence-geo
17
+ run-model Runs the convolutional neural network on new
18
+ data to assign taxonomy
19
+
20
+ See 'eplacer <command> --help' for more information on a command
21
+ '''
22
+
23
+
24
+ import sys
25
+ import os
26
+ from docopt import docopt
27
+ import multiprocessing
28
+
29
+ def main():
30
+ args = docopt(__doc__,
31
+ version='',
32
+ options_first=True)
33
+ argv = [args['<command>']] + args['<args>']
34
+ if args['<command>'] == 'train-model':
35
+ import eplacer.train_command
36
+ args = docopt(eplacer.train_command.__doc__, argv=argv)
37
+ # Check if provided directory exists, if provided
38
+ if args['--out']:
39
+ if os.path.exists(args['--out']):
40
+ if args['--force'] == False:
41
+ raise Exception('The path already exists! Exiting...\n')
42
+ # Set default out directory if it doesn't exist
43
+ else:
44
+ args['--out']="data/models/"
45
+ if os.path.exists(args['--out']):
46
+ if args['--force'] == False:
47
+ raise Exception('The path already exists! Exiting...\n')
48
+ # Check for existence of required input
49
+ if args['--fasta']:
50
+ if not os.path.exists(args['--fasta']):
51
+ raise Exception('No fasta file exists at this location! Exiting...\n')
52
+ else:
53
+ raise Exception('No fasta file specified! Exiting...\n')
54
+ if args['--taxa']:
55
+ if not os.path.exists(args['--taxa']):
56
+ raise Exception('No taxa file exists at this location! Exiting...\n')
57
+ else:
58
+ raise Exception('No taxa file specified! Exiting...\n')
59
+ # Default to the species level. Which may or may not work
60
+ if not args['--taxlevel']:
61
+ args['--taxlevel']="SPECIES"
62
+ # Check for the geo data. Set mode of running based on this.
63
+ if args['--geoData']:
64
+ if not os.path.exists(args['--geoData']):
65
+ raise Exception("GeoData path does not exist! Exiting\n")
66
+ else:
67
+ sys.stdout.write("Setting mode to train on "
68
+ "sequence and geographic data\n")
69
+ mode='sequence_geo'
70
+ if not args['--kernel']:
71
+ kernel = 3
72
+ else:
73
+ kernel = int(args['--kernel'])
74
+ if not args['--sigma']:
75
+ sigma = 1
76
+ else:
77
+ sigma = int(args['--sigma'])
78
+ if not args['--precision']:
79
+ precision = 3
80
+ else:
81
+ precision = float(args['--precision'])
82
+ else:
83
+ sys.stdout.write("Setting mode to train on "
84
+ "sequence data only\n")
85
+ mode='sequence'
86
+ if not args['--taxlevel']:
87
+ args['--taxlevel'] = "SPECIES"
88
+ if not args['--maskrate']:
89
+ maskrate = 0
90
+ else:
91
+ maskrate = float(args['--maskrate'])
92
+ if not args['--augments']:
93
+ augments = 5
94
+ else:
95
+ augments = int(args['--augments'])
96
+ print("augments: ", augments)
97
+ if mode == 'sequence':
98
+ eplacer.train_evaluate.train_sequence(args['--fasta'], args['--taxa'], args['--taxlevel'],args['--out'],args['--augments'],args['--maskrate'])
99
+ elif mode == 'sequence_geo':
100
+ eplacer.train_evaluate.train_sequenceOBIS(args['--fasta'], args['--taxa'], args['--taxlevel'],args['--out'],args['--geoData'],
101
+ augments,maskrate,sigma,kernel,precision)
102
+ exit()
103
+ elif args['<command>'] == 'run-model':
104
+ import eplacer.run_command
105
+ import eplacer.run_model
106
+ args = docopt(eplacer.run_command.__doc__, argv=argv)
107
+ # check that the provided directory exists, if provided
108
+ if args['--out']:
109
+ if os.path.exists(args['--out']):
110
+ if args['--force'] == False:
111
+ raise Exception('The path already exists! Exiting...\n')
112
+ # Set default out directory if it doesn't exist
113
+ else:
114
+ args['--out']="result/models/"
115
+ if os.path.exists(args['--out']):
116
+ if args['--force'] == False:
117
+ raise Exception('The path already exists! Exiting...\n')
118
+ if args['--fasta']:
119
+ if not os.path.exists(args['--fasta']):
120
+ raise Exception('No fasta file exists at this location! Exiting...\n')
121
+ else:
122
+ raise Exception('No fasta file specified! Exiting...\n')
123
+ if args['--blast']:
124
+ if not os.path.exists(args['--blast']):
125
+ raise Exception('No blast result file exists at this location! Exiting...\n')
126
+ else:
127
+ raise Exception('No blast result file specified! Exiting...\n')
128
+ if args['--model']:
129
+ if not os.path.exists(args['--model']):
130
+ raise Exception('No model file exists at this location! Exiting...\n')
131
+ args['--taxfile'] = str(args['--model']) + "/taxa_key_SPECIES.tsv"
132
+ if not os.path.exists(args['--taxfile']):
133
+ raise Exception('No taxfile file exists at this location! Your model directory may be corrupted. Exiting...\n')
134
+ else:
135
+ raise Exception('No model file specified! Exiting...\n')
136
+ if not args['--taxlevel']:
137
+ args['--taxlevel'] = "SPECIES"
138
+ if args['--threads']:
139
+ cpu_count = multiprocessing.cpu_count()
140
+ if int(args['--threads']) > cpu_count:
141
+ args['--threads'] = cpu_count
142
+ print(f"Too many threads requested. setting to {cpu_count}")
143
+ else:
144
+ cpu_count = multiprocessing.cpu_count()
145
+ args['--threads'] = cpu_count
146
+ print(f"No threads requested. setting to {cpu_count}")
147
+ if not args['--maskrate']:
148
+ maskrate = 0
149
+ else:
150
+ maskrate = float(args['--maskrate'])
151
+ # Check for the geo data. Set mode of running based on this.
152
+ if args['--geoData']:
153
+ if not os.path.exists(args['--geoData']):
154
+ raise Exception("GeoData path does not exist! Exiting\n")
155
+ else:
156
+ sys.stdout.write("Setting mode to train on "
157
+ "sequence and geographic data\n")
158
+ mode='sequence_geo'
159
+ if not args['--counts']:
160
+ raise Exception('Abundance matrix not specified! Exiting\n')
161
+ elif not os.path.exists(args['--counts']):
162
+ raise Exception('The path to the count matrix does not exist! Exiting\n')
163
+ if not args['--kernel']:
164
+ kernel = 3
165
+ else:
166
+ kernel = int(args['--kernel'])
167
+ if not args['--sigma']:
168
+ sigma = 1
169
+ else:
170
+ sigma = int(args['--sigma'])
171
+ if not args['--confidence']:
172
+ args['--confidence'] = 0.9
173
+ else:
174
+ raise Exception('No geoData available. Exiting...\n')
175
+ eplacer.run_model.gen_model_output_OBIS(args['--confidence'], args['--blast'], args['--out'],args['--fasta'], args['--model'], args['--taxlevel'],args['--taxfile'],args['--geoData'],args['--counts'],maskrate,sigma,kernel, args['--threads'])
176
+
@@ -0,0 +1,97 @@
1
+ """
2
+ This script defines some useful code for prepping datasets
3
+ for ePlacer
4
+
5
+ Author: Christopher Powers
6
+ Institution: NOAA NEFSC PEMAD PBB
7
+ """
8
+
9
+
10
+ import numpy as np
11
+ from torch.utils.data import Dataset
12
+ import torch
13
+ import numpy as np
14
+
15
+ def get_degenerate_bases():
16
+ """
17
+ Returns a dictionary mapping IUPAC degenerate bases to their possible canonical bases
18
+ """
19
+ return {
20
+ 'A': ['A'],
21
+ 'C': ['C'],
22
+ 'G': ['G'],
23
+ 'T': ['T'],
24
+ 'R': ['A', 'G'], # Purine
25
+ 'Y': ['C', 'T'], # Pyrimidine
26
+ 'M': ['A', 'C'], # Amino
27
+ 'K': ['G', 'T'], # Keto
28
+ 'S': ['C', 'G'], # Strong
29
+ 'W': ['A', 'T'], # Weak
30
+ 'H': ['A', 'C', 'T'], # not G
31
+ 'B': ['C', 'G', 'T'], # not A
32
+ 'V': ['A', 'C', 'G'], # not T
33
+ 'D': ['A', 'G', 'T'], # not C
34
+ 'N': ['-'], # any base is not informative. Encode as gap
35
+ '-': ['-'] # gap
36
+ }
37
+
38
+ def encode_onehot(seq, mask_token = "N"):
39
+ """
40
+ Function to encode an individual sequence with one hot encoding,
41
+ handling degenerate bases by averaging their possible canonical forms
42
+ """
43
+ mapping = {
44
+ "A": [1., 0., 0., 0.],
45
+ "C": [0., 1., 0., 0.],
46
+ "G": [0., 0., 1., 0.],
47
+ "T": [0., 0., 0., 1.],
48
+ "-": [0., 0., 0., 0.],
49
+ mask_token: [0.25, 0.25, 0.25, 0.25]
50
+ }
51
+
52
+ degenerate_bases = get_degenerate_bases()
53
+ # Pre-calculate average encodings for degenerate bases
54
+ for base, possibilities in degenerate_bases.items():
55
+ if base not in mapping and len(possibilities) > 0:
56
+ avg_encoding = np.zeros(4)
57
+ for canonical_base in possibilities:
58
+ if canonical_base in mapping:
59
+ avg_encoding += np.array(mapping[canonical_base])
60
+ mapping[base] = (avg_encoding / len(possibilities)).tolist()
61
+
62
+ # Vectorized operation for the whole sequence
63
+ return np.array([mapping.get(base, [0., 0., 0., 0.]) for base in seq])
64
+
65
+ class SeqGeoDataset(Dataset):
66
+ """
67
+ Dataset that stores the one-hot encoded data alongside
68
+ the geographic data
69
+ """
70
+ def __init__(self, sequences, labels, geo_data):
71
+ self.seqs = []
72
+ self.geo = []
73
+ self.taxa_labels = []
74
+ for i in range(0,len(sequences)):
75
+ self.seqs.append(sequences[i])
76
+ for i in range(0,len(labels)):
77
+ self.taxa_labels.append(labels[i])
78
+ self.ohe_seqs = torch.stack([torch.from_numpy(encode_onehot(x)).float() for x in self.seqs])
79
+ self.labels = torch.Tensor(self.taxa_labels).long()
80
+
81
+ for i in range(0,len(geo_data)):
82
+ self.geo.append(geo_data[i])
83
+ self.ohe_seqs = torch.stack([torch.from_numpy(encode_onehot(x)).float() for x in self.seqs])
84
+ self.ohe_geo = torch.stack([torch.from_numpy(x).float() for x in self.geo])
85
+ self.labels = torch.Tensor(self.taxa_labels).long()
86
+
87
+ def __len__(self): return len(self.seqs)
88
+
89
+ def __getitem__(self,idx):
90
+ seq = self.ohe_seqs[idx]
91
+ label = self.labels[idx]
92
+ geo = self.ohe_geo[idx]
93
+
94
+ return seq, geo, label
95
+
96
+
97
+
@@ -0,0 +1,60 @@
1
+ """
2
+ This script runs mafft as a subprocess and generates
3
+ an alignment.
4
+
5
+ Author: Christopher Powers
6
+ Institution: NOAA NEFSC PEMAD PBB
7
+ """
8
+
9
+
10
+ import subprocess
11
+
12
+ def run_mafft(input, reference, moutput, subset_output, threads):
13
+ """
14
+ Run the MAFFT alignment to add your sequences to a new fasta
15
+ """
16
+
17
+ # Get initial IDs
18
+ names = []
19
+ with open(input, "r") as infile:
20
+ for line in infile:
21
+ if line.startswith(">"):
22
+ names.append(line[1:].rstrip())
23
+
24
+ try:
25
+ print("Beginning subprocess...")
26
+ print("Aligning with mafft...")
27
+ command = ["mafft --add", input, "--adjustdirection --thread", str(threads), "--keeplength --reorder", reference, ">", moutput]
28
+ subprocess.run(" ".join(command), shell=True, check=True)
29
+ except subprocess.CalledProcessError as e:
30
+ print(f"MAFFT exection failed with error: {e}")
31
+ except FileNotFoundError:
32
+ print("MAFFT not found. Is it installed/in the path?")
33
+
34
+ # subset the new file
35
+ key = ''
36
+ seq = ''
37
+ seqdict = {}
38
+
39
+ with open(moutput, "r") as infile:
40
+ for line in infile:
41
+ line = line.rstrip()
42
+ if line.startswith(">_R_"):
43
+ if key != '':
44
+ seqdict[key] = seq.upper()
45
+ seq = ''
46
+ key = line[4:]
47
+ elif line.startswith(">"):
48
+ if key != '':
49
+ seqdict[key] = seq.upper()
50
+ seq = ''
51
+ key = line[1:]
52
+ else:
53
+ seq += line
54
+ with open(subset_output, "w") as outfile:
55
+ for s in seqdict:
56
+ if s in names:
57
+ outfile.write(f">{s}\n{seqdict[s]}\n")
58
+
59
+ return seqdict
60
+