boldigger3 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2024 DominikBuchner
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,190 @@
1
+ Metadata-Version: 2.1
2
+ Name: boldigger3
3
+ Version: 1.0.0
4
+ Summary: A python package to query different databases of boldsystems.org v5!
5
+ Home-page: https://github.com/DominikBuchner/BOLDigger3
6
+ Author: Dominik Buchner
7
+ Author-email: dominik.buchner@uni-due.de
8
+ License: MIT
9
+ Classifier: Programming Language :: Python :: 3
10
+ Classifier: License :: OSI Approved :: MIT License
11
+ Classifier: Operating System :: OS Independent
12
+ Requires-Python: >=3.10
13
+ Description-Content-Type: text/markdown
14
+ License-File: LICENSE
15
+ Requires-Dist: beautifulsoup4>=4.12.3
16
+ Requires-Dist: Bio>=1.7.1
17
+ Requires-Dist: biopython>=1.84
18
+ Requires-Dist: get_pypi_latest_version>=0.0.12
19
+ Requires-Dist: more_itertools>=10.5.0
20
+ Requires-Dist: numpy>=2.1.2
21
+ Requires-Dist: pandas>=2.2.3
22
+ Requires-Dist: playwright>=1.48.0
23
+ Requires-Dist: Requests>=2.32.3
24
+ Requires-Dist: requests_html_playwright>=0.12.3
25
+ Requires-Dist: setuptools>=65.5.0
26
+ Requires-Dist: tqdm>=4.66.4
27
+ Requires-Dist: urllib3>=1.26.14
28
+ Requires-Dist: tables>=3.9.2
29
+ Requires-Dist: html5lib>=1.1
30
+ Requires-Dist: soupsieve>=2.5
31
+ Requires-Dist: openpyxl>=3.1.1
32
+ Requires-Dist: pyarrow>=11.0.0
33
+ Requires-Dist: lxml_html_clean>=0.1.1
34
+
35
+ # BOLDigger3
36
+
37
+ ![boldigger_logo]()
38
+
39
+ [![Downloads](https://pepy.tech/badge/boldigger3)](https://pepy.tech/project/boldigger3)
40
+
41
+ A Python program to query .fasta files against the databases of www.boldsystems.org v5!
42
+
43
+ ## Introduction
44
+ DNA metabarcoding datasets often comprise hundreds of Operational Taxonomic Units (OTUs), requiring querying against databases for taxonomic assignment. The Barcode of Life Data system (BOLD) is a widely used database for this purpose among biologists. However, BOLD's online platform limits users to identifying batches of only 1000, 200 or 100 (depending on operating mode) sequences at a time.
45
+
46
+ BOLDigger3, the successor to BOLDigger2 and BOLDigger, aims to overcome these limitations. As a pure Python program, BOLDigger3 offers:
47
+
48
+ - Automated access to BOLD's identification engine
49
+ - Downloading of additional metadata for each hit
50
+ - Selection of the best-fitting hit from the returned results
51
+
52
+ ## Overview
53
+
54
+ **BOLDigger3** is an automated tool designed for DNA sequence identification through **BOLDSystems v5**, supporting integration into bioinformatics pipelines with enhanced functionality and performance. With BOLDigger3, users can identify up to **10,000 sequences per hour** without the need for credentials, using an optimized data storage and queuing system that improves speed and process safety.
55
+
56
+ ## Key Differences Between BOLDigger3 and BOLDigger2
57
+
58
+ - **Unified Function**: BOLDigger3 consolidates all actions into a single function, `identify`, which automatically performs identification, additional data downloading, and top-hit selection, making it easier to integrate into pipelines.
59
+ - **Enhanced Database Accessibility**: Users have access to all databases offered by **BOLDSystems v5** and can select from three different operating modes.
60
+ - **Improved Speeds**: Depending on the operating mode, BOLDigger3 can identify up to **10,000 sequences per hour**, significantly faster than BOLDigger2.
61
+ - **No Password Required**: Users no longer need credentials to perform identifications—just select the FASTA file, database, and operating mode to start.
62
+ - **Streamlined Data Storage**: Data is stored in an **HDF Store** for faster processing, with final outputs available in `.xlsx` and `.parquet` formats.
63
+ - **Process Safety**: BOLDigger3 can resume interrupted executions, continuing exactly where it left off.
64
+ - **Dynamic Queuing**: The tool automatically manages request queuing based on the selected operating mode.
65
+
66
+ ## Features
67
+
68
+ - **Identify Sequences Automatically**: Run DNA sequence identifications with a single command.
69
+ - **Flexible Database Options**: Access to all BOLDSystems v5 databases with user-selected operating modes.
70
+ - **High-Performance Processing**: Up to 10,000 identifications per hour, depending on settings.
71
+ - **Robust Storage**: Data stored in HDF format for efficient processing; results in `.xlsx` and `.parquet`.
72
+ - **User-Friendly**: No credentials needed for use.
73
+
74
+ ## Installation and Usage
75
+
76
+ BOLDigger3 requires Python version 3.10 or higher and can be easily installed using pip in any command line:
77
+
78
+ `pip install boldigger3`
79
+
80
+ This command will install BOLDigger3 along with all its dependencies.
81
+ BOLDigger3 uses the Python package ```playwright```, which needs a separate installation prior to first execution:
82
+
83
+ `playwright install`
84
+
85
+ **FOR BOLDigger2 USERS:**
86
+ BOLDigger2 used requests-html which relied on an old version of pyppeteer. This may lead to conflicts with playwright. To solve:
87
+
88
+ `
89
+ pip uninstall pyppeteer
90
+ pip install --upgrade pyee
91
+ `
92
+
93
+ To run the ```identify``` function, use the following command:
94
+
95
+ `boldigger3 identify PATH_TO_FASTA --db DATABASE_NR --mode OPERATING MODE`
96
+
97
+ # Databases
98
+
99
+ The ```--db``` is a number between 1 and 7 corresponding to the seven databases BOLD v5 currently offers:
100
+
101
+ 1: **ANIMAL LIBRARY (PUBLIC)**
102
+ 2: **ANIMAL SPECIES-LEVEL LIBRARY (PUBLIC + PRIVATE)**
103
+ 3: **ANIMAL LIBRARY (PUBLIC+PRIVATE)**
104
+ 4: **VALIDATED CANADIAN ARTHROPOD LIBRARY**
105
+ 5: **PLANT LIBRARY (PUBLIC)**
106
+ 6: **FUNGI LIBRARY (PUBLIC)**
107
+ 7: **ANIMAL SECONDARY MARKERS (PUBLIC)**
108
+
109
+ # Operating modes
110
+
111
+ The ```--mode``` is a number between 1 and the corresponding to the 3 operating modes BOLD v5 currently offers:
112
+
113
+ 1: **Rapid Species Search**
114
+ 2: **Genus and Species Search**
115
+ 3: **Exhaustive Search**
116
+
117
+ To customize the implemented thresholds for user-specific needs, the thresholds can be passed as an additional (ordered) argument. Up to five different thresholds can be passed for the different taxonomic levels (Species, Genus, Family, Order, Class). Thresholds not passed will be replaced by default, but BOLDigger3 will also inform you about this:
118
+
119
+ `boldigger3 identify PATH_TO_FASTA --db DATABASE_NR --mode OPERATING MODE --thresholds 99 97`
120
+
121
+ Output:
122
+
123
+ ```
124
+ 19:16:16: Default thresholds changed!
125
+ 19:16:16: Species: 99, Genus: 97, Family: 90, Order: 85
126
+ ```
127
+
128
+ When a new version is released, you can update BOLDigger3 by typing:
129
+
130
+ `pip install --upgrade boldigger3`
131
+
132
+ ## How to cite
133
+
134
+ Buchner D, Leese F (2020) BOLDigger – a Python package to identify and organise sequences with the Barcode of Life Data systems. Metabarcoding and Metagenomics 4: e53535. https://doi.org/10.3897/mbmg.4.53535
135
+
136
+
137
+ ## The BOLDigger3 Algorithm
138
+
139
+ The BOLDigger3 algorithm operates as follows:
140
+
141
+ 1. **Split the FASTA**: The input FASTA file is divided into chunks that fit the limits of the selected operating mode of the identification engine.
142
+
143
+ 2. **Queue the Chunks**: These chunks are then queued in the identification engine for processing.
144
+
145
+ 3. **Check for Results**: The algorithm periodically checks if any results can be downloaded.
146
+
147
+ 4. **Data Download**: Once results are available, the data is downloaded.
148
+
149
+ 5. **Data Validation**: The algorithm ensures that all data has been correctly downloaded.
150
+
151
+ 6. **Retrieve Additional Data**: Additional data is obtained via the API.
152
+
153
+ 7. **Select Top Hit**: Finally, the algorithm selects the top hit backed by the most database entries for the final output.
154
+
155
+ ### Top hit selection
156
+
157
+ Different thresholds (97%: species level, 95%: genus level, 90%: family level, 85%: order level) for the taxonomic levels are used to find the best fitting hit. After determining the threshold for all hits the most common hit above the threshold will be selected. Note that for all hits below the threshold, the taxonomic resolution will be adjusted accordingly (e.g. for a 96% hit the species-level information will be discarded, and genus-level information will be used as the lowest taxonomic level).
158
+
159
+ The BOLDigger3 algorithm functions as follows:
160
+
161
+ 1. **Identify Maximum Similarity**: Find the maximum similarity value among the top 100 hits currently under consideration.
162
+
163
+ 2. **Set Threshold**: Set the threshold to this maximum similarity level. Remove all hits with a similarity below this threshold. For example, if the highest hit has a similarity of 100%, the threshold will be set to 97%, and all hits below this threshold will be removed temporarily.
164
+
165
+ 3. **Classification and Sorting**: Count all individual classifications and sort them by abundance.
166
+
167
+ 4. **Filter Missing Data**: Drop all classifications that contain missing data. For instance, if the most common hit is "Arthropoda --> Insecta" with a similarity of 100% but missing values for Order, Family, Genus, and Species.
168
+
169
+ 5. **Identify Common Hit**: Look for the most common hit that has no missing values.
170
+
171
+ 6. **Return Hit**: If a hit with no missing values is found, return that hit.
172
+
173
+ 7. **Threshold Adjustment**: If no hit with no missing values is found, increase the threshold to the next higher level and repeat the process until a hit is found.
174
+
175
+
176
+ ### BOLDigger3 Flagging System
177
+
178
+ BOLDigger3 employs a flagging system to highlight certain conditions, indicating a degree of uncertainty in the selected hit. Currently, there are five flags implemented, which may be updated as needed:
179
+
180
+ 1. **Reverse BIN Taxonomy**: This flag is raised if all of the top 100 hits representing the selected match utilize reverse BIN taxonomy. Reverse BIN taxonomy assigns species names to deposited sequences on BOLD that lack species information, potentially introducing uncertainty.
181
+
182
+ 2. **Differing Taxonomic Information**: If there are two or more entries with differing taxonomic information above the selected threshold (e.g., two species above 97%), this flag is triggered, suggesting potential discrepancies.
183
+
184
+ 3. **Private Data**: If all of the top 100 hits representing the top hit are private hits, this flag is raised, indicating limited accessibility to data.
185
+
186
+ 4. **Unique Hit**: This flag indicates that the top hit result represents a unique hit among the top 100 hits, potentially requiring further scrutiny.
187
+
188
+ 5. **Multiple BINs**: If the selected species-level hit is composed of more than one BIN, this flag is raised, suggesting potential complexities in taxonomic assignment.
189
+
190
+ Given the presence of these flags, it is advisable to conduct a closer examination of all flagged hits to better understand and address any uncertainties in the selected hit.
@@ -0,0 +1,156 @@
1
+ # BOLDigger3
2
+
3
+ ![boldigger_logo]()
4
+
5
+ [![Downloads](https://pepy.tech/badge/boldigger3)](https://pepy.tech/project/boldigger3)
6
+
7
+ A Python program to query .fasta files against the databases of www.boldsystems.org v5!
8
+
9
+ ## Introduction
10
+ DNA metabarcoding datasets often comprise hundreds of Operational Taxonomic Units (OTUs), requiring querying against databases for taxonomic assignment. The Barcode of Life Data system (BOLD) is a widely used database for this purpose among biologists. However, BOLD's online platform limits users to identifying batches of only 1000, 200 or 100 (depending on operating mode) sequences at a time.
11
+
12
+ BOLDigger3, the successor to BOLDigger2 and BOLDigger, aims to overcome these limitations. As a pure Python program, BOLDigger3 offers:
13
+
14
+ - Automated access to BOLD's identification engine
15
+ - Downloading of additional metadata for each hit
16
+ - Selection of the best-fitting hit from the returned results
17
+
18
+ ## Overview
19
+
20
+ **BOLDigger3** is an automated tool designed for DNA sequence identification through **BOLDSystems v5**, supporting integration into bioinformatics pipelines with enhanced functionality and performance. With BOLDigger3, users can identify up to **10,000 sequences per hour** without the need for credentials, using an optimized data storage and queuing system that improves speed and process safety.
21
+
22
+ ## Key Differences Between BOLDigger3 and BOLDigger2
23
+
24
+ - **Unified Function**: BOLDigger3 consolidates all actions into a single function, `identify`, which automatically performs identification, additional data downloading, and top-hit selection, making it easier to integrate into pipelines.
25
+ - **Enhanced Database Accessibility**: Users have access to all databases offered by **BOLDSystems v5** and can select from three different operating modes.
26
+ - **Improved Speeds**: Depending on the operating mode, BOLDigger3 can identify up to **10,000 sequences per hour**, significantly faster than BOLDigger2.
27
+ - **No Password Required**: Users no longer need credentials to perform identifications—just select the FASTA file, database, and operating mode to start.
28
+ - **Streamlined Data Storage**: Data is stored in an **HDF Store** for faster processing, with final outputs available in `.xlsx` and `.parquet` formats.
29
+ - **Process Safety**: BOLDigger3 can resume interrupted executions, continuing exactly where it left off.
30
+ - **Dynamic Queuing**: The tool automatically manages request queuing based on the selected operating mode.
31
+
32
+ ## Features
33
+
34
+ - **Identify Sequences Automatically**: Run DNA sequence identifications with a single command.
35
+ - **Flexible Database Options**: Access to all BOLDSystems v5 databases with user-selected operating modes.
36
+ - **High-Performance Processing**: Up to 10,000 identifications per hour, depending on settings.
37
+ - **Robust Storage**: Data stored in HDF format for efficient processing; results in `.xlsx` and `.parquet`.
38
+ - **User-Friendly**: No credentials needed for use.
39
+
40
+ ## Installation and Usage
41
+
42
+ BOLDigger3 requires Python version 3.10 or higher and can be easily installed using pip in any command line:
43
+
44
+ `pip install boldigger3`
45
+
46
+ This command will install BOLDigger3 along with all its dependencies.
47
+ BOLDigger3 uses the Python package ```playwright```, which needs a separate installation prior to first execution:
48
+
49
+ `playwright install`
50
+
51
+ **FOR BOLDigger2 USERS:**
52
+ BOLDigger2 used requests-html which relied on an old version of pyppeteer. This may lead to conflicts with playwright. To solve:
53
+
54
+ `
55
+ pip uninstall pyppeteer
56
+ pip install --upgrade pyee
57
+ `
58
+
59
+ To run the ```identify``` function, use the following command:
60
+
61
+ `boldigger3 identify PATH_TO_FASTA --db DATABASE_NR --mode OPERATING MODE`
62
+
63
+ # Databases
64
+
65
+ The ```--db``` is a number between 1 and 7 corresponding to the seven databases BOLD v5 currently offers:
66
+
67
+ 1: **ANIMAL LIBRARY (PUBLIC)**
68
+ 2: **ANIMAL SPECIES-LEVEL LIBRARY (PUBLIC + PRIVATE)**
69
+ 3: **ANIMAL LIBRARY (PUBLIC+PRIVATE)**
70
+ 4: **VALIDATED CANADIAN ARTHROPOD LIBRARY**
71
+ 5: **PLANT LIBRARY (PUBLIC)**
72
+ 6: **FUNGI LIBRARY (PUBLIC)**
73
+ 7: **ANIMAL SECONDARY MARKERS (PUBLIC)**
74
+
75
+ # Operating modes
76
+
77
+ The ```--mode``` is a number between 1 and the corresponding to the 3 operating modes BOLD v5 currently offers:
78
+
79
+ 1: **Rapid Species Search**
80
+ 2: **Genus and Species Search**
81
+ 3: **Exhaustive Search**
82
+
83
+ To customize the implemented thresholds for user-specific needs, the thresholds can be passed as an additional (ordered) argument. Up to five different thresholds can be passed for the different taxonomic levels (Species, Genus, Family, Order, Class). Thresholds not passed will be replaced by default, but BOLDigger3 will also inform you about this:
84
+
85
+ `boldigger3 identify PATH_TO_FASTA --db DATABASE_NR --mode OPERATING MODE --thresholds 99 97`
86
+
87
+ Output:
88
+
89
+ ```
90
+ 19:16:16: Default thresholds changed!
91
+ 19:16:16: Species: 99, Genus: 97, Family: 90, Order: 85
92
+ ```
93
+
94
+ When a new version is released, you can update BOLDigger3 by typing:
95
+
96
+ `pip install --upgrade boldigger3`
97
+
98
+ ## How to cite
99
+
100
+ Buchner D, Leese F (2020) BOLDigger – a Python package to identify and organise sequences with the Barcode of Life Data systems. Metabarcoding and Metagenomics 4: e53535. https://doi.org/10.3897/mbmg.4.53535
101
+
102
+
103
+ ## The BOLDigger3 Algorithm
104
+
105
+ The BOLDigger3 algorithm operates as follows:
106
+
107
+ 1. **Split the FASTA**: The input FASTA file is divided into chunks that fit the limits of the selected operating mode of the identification engine.
108
+
109
+ 2. **Queue the Chunks**: These chunks are then queued in the identification engine for processing.
110
+
111
+ 3. **Check for Results**: The algorithm periodically checks if any results can be downloaded.
112
+
113
+ 4. **Data Download**: Once results are available, the data is downloaded.
114
+
115
+ 5. **Data Validation**: The algorithm ensures that all data has been correctly downloaded.
116
+
117
+ 6. **Retrieve Additional Data**: Additional data is obtained via the API.
118
+
119
+ 7. **Select Top Hit**: Finally, the algorithm selects the top hit backed by the most database entries for the final output.
120
+
121
+ ### Top hit selection
122
+
123
+ Different thresholds (97%: species level, 95%: genus level, 90%: family level, 85%: order level) for the taxonomic levels are used to find the best fitting hit. After determining the threshold for all hits the most common hit above the threshold will be selected. Note that for all hits below the threshold, the taxonomic resolution will be adjusted accordingly (e.g. for a 96% hit the species-level information will be discarded, and genus-level information will be used as the lowest taxonomic level).
124
+
125
+ The BOLDigger3 algorithm functions as follows:
126
+
127
+ 1. **Identify Maximum Similarity**: Find the maximum similarity value among the top 100 hits currently under consideration.
128
+
129
+ 2. **Set Threshold**: Set the threshold to this maximum similarity level. Remove all hits with a similarity below this threshold. For example, if the highest hit has a similarity of 100%, the threshold will be set to 97%, and all hits below this threshold will be removed temporarily.
130
+
131
+ 3. **Classification and Sorting**: Count all individual classifications and sort them by abundance.
132
+
133
+ 4. **Filter Missing Data**: Drop all classifications that contain missing data. For instance, if the most common hit is "Arthropoda --> Insecta" with a similarity of 100% but missing values for Order, Family, Genus, and Species.
134
+
135
+ 5. **Identify Common Hit**: Look for the most common hit that has no missing values.
136
+
137
+ 6. **Return Hit**: If a hit with no missing values is found, return that hit.
138
+
139
+ 7. **Threshold Adjustment**: If no hit with no missing values is found, increase the threshold to the next higher level and repeat the process until a hit is found.
140
+
141
+
142
+ ### BOLDigger3 Flagging System
143
+
144
+ BOLDigger3 employs a flagging system to highlight certain conditions, indicating a degree of uncertainty in the selected hit. Currently, there are five flags implemented, which may be updated as needed:
145
+
146
+ 1. **Reverse BIN Taxonomy**: This flag is raised if all of the top 100 hits representing the selected match utilize reverse BIN taxonomy. Reverse BIN taxonomy assigns species names to deposited sequences on BOLD that lack species information, potentially introducing uncertainty.
147
+
148
+ 2. **Differing Taxonomic Information**: If there are two or more entries with differing taxonomic information above the selected threshold (e.g., two species above 97%), this flag is triggered, suggesting potential discrepancies.
149
+
150
+ 3. **Private Data**: If all of the top 100 hits representing the top hit are private hits, this flag is raised, indicating limited accessibility to data.
151
+
152
+ 4. **Unique Hit**: This flag indicates that the top hit result represents a unique hit among the top 100 hits, potentially requiring further scrutiny.
153
+
154
+ 5. **Multiple BINs**: If the selected species-level hit is composed of more than one BIN, this flag is raised, suggesting potential complexities in taxonomic assignment.
155
+
156
+ Given the presence of these flags, it is advisable to conduct a closer examination of all flagged hits to better understand and address any uncertainties in the selected hit.
File without changes
@@ -0,0 +1,125 @@
1
+ import argparse, sys, datetime, time
2
+ from boldigger3 import id_engine, additional_data_download, select_top_hit
3
+ from importlib.metadata import version
4
+ from get_pypi_latest_version import GetPyPiLatestVersion
5
+
6
+
7
+ # main function to program the commandline interface
8
+ def main() -> None:
9
+ """Function to define the commandline interface."""
10
+ # initialize the default behaviour if boldigger3 is called without any argument
11
+ formatter = lambda prog: argparse.HelpFormatter(prog, max_help_position=35)
12
+
13
+ # define the parser
14
+ parser = argparse.ArgumentParser(
15
+ prog="boldigger3",
16
+ description="A Python package to identify and organise sequences with the Barcode of Life Data systems.",
17
+ formatter_class=formatter,
18
+ )
19
+
20
+ # display help when no argument is called
21
+ parser.set_defaults(func=lambda x: parser.print_help())
22
+
23
+ # add the subparsers
24
+ subparsers = parser.add_subparsers(dest="function")
25
+
26
+ # add the identify parser
27
+ parser_identify = subparsers.add_parser(
28
+ "identify", help="Run the BOLD v5 identification engine"
29
+ )
30
+
31
+ # add the fasta path argument
32
+ parser_identify.add_argument(
33
+ "fasta_file",
34
+ help="Path to the fasta file or fasta file in the current working directory to be identified.",
35
+ type=str,
36
+ )
37
+
38
+ # add the database argument
39
+ parser_identify.add_argument(
40
+ "--db",
41
+ required=True,
42
+ help="Integer that defines which database to use (1 to 7). See readme for details",
43
+ type=int,
44
+ choices=range(1, 8),
45
+ )
46
+
47
+ # add the operating mode argument
48
+ parser_identify.add_argument(
49
+ "--mode",
50
+ required=True,
51
+ help="Integer that defines which operating mode to use (1 to 3). See readme for details.",
52
+ type=int,
53
+ choices=range(1, 4),
54
+ )
55
+
56
+ # add the optional argument thresholds
57
+ parser_identify.add_argument(
58
+ "--thresholds",
59
+ nargs="+",
60
+ type=int,
61
+ help="Thresholds to use for the selection of the top hit.",
62
+ )
63
+
64
+ # add version control
65
+ # get the installed version
66
+ current_version = version("boldigger2")
67
+ obtainer = GetPyPiLatestVersion()
68
+ latest_version = obtainer("boldigger2")
69
+
70
+ # give a user warning if the latest version is not installed
71
+ if current_version != latest_version:
72
+ print(
73
+ "{}: Your boldigger3 version is outdated. Consider updating to the latest version.".format(
74
+ datetime.datetime.now().strftime("%H:%M:%S")
75
+ )
76
+ )
77
+
78
+ # add the version argument
79
+ parser.add_argument("--version", action="version", version=version("boldigger2"))
80
+
81
+ # parse the arguments
82
+ arguments = parser.parse_args()
83
+
84
+ # print help if no argument is provided
85
+ if len(sys.argv) == 1:
86
+ arguments.func(arguments)
87
+ sys.exit()
88
+
89
+ # only use the threshold provided by the user replace the rest with defaults
90
+ default_thresholds = [97, 95, 90, 85]
91
+ thresholds = []
92
+
93
+ for i in range(4):
94
+ try:
95
+ thresholds.append(arguments.thresholds[i])
96
+ except (IndexError, TypeError):
97
+ thresholds.append(default_thresholds[i])
98
+
99
+ if arguments.thresholds:
100
+ # give user output
101
+ print(
102
+ "{}: Default thresholds changed!\n{}: Species: {}, Genus: {}, Family: {}, Order: {}".format(
103
+ datetime.datetime.now().strftime("%H:%M:%S"),
104
+ datetime.datetime.now().strftime("%H:%M:%S"),
105
+ *thresholds
106
+ )
107
+ )
108
+
109
+ # run the identification engine
110
+ if arguments.function == "identify":
111
+ # run the id engine
112
+ id_engine.main(
113
+ arguments.fasta_file,
114
+ database=arguments.db,
115
+ operating_mode=arguments.mode,
116
+ )
117
+ # download the additional data
118
+ additional_data_download.main(arguments.fasta_file)
119
+ # select the top hit
120
+ select_top_hit.main(arguments.fasta_file, thresholds=thresholds)
121
+
122
+
123
+ # run only if called as a top level script
124
+ if __name__ == "__main__":
125
+ main()