rdrpcatch 0.0.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,223 @@
1
+ Metadata-Version: 2.4
2
+ Name: rdrpcatch
3
+ Version: 0.0.1
4
+ Dynamic: Summary
5
+ Project-URL: Home, https://github.com/dimitris-karapliafis/RdRpCATCH
6
+ Project-URL: Source, https://github.com/dimitris-karapliafis/RdRpCATCH
7
+ Author-email: Dimitris Karapliafis <dimitris.karapliafis@wur.nl>, Uri Neri <uneri@lbl.gov>, RdRpCATCH contributors <dimitris.karapliafis@wur.nl>
8
+ License: MIT
9
+ License-File: LICENCE
10
+ Requires-Python: >=3.12
11
+ Requires-Dist: altair==5.5.0
12
+ Requires-Dist: matplotlib==3.10.1
13
+ Requires-Dist: needletail==0.6.3
14
+ Requires-Dist: pandas==2.2.3
15
+ Requires-Dist: polars==1.26.0
16
+ Requires-Dist: pyhmmer==0.11.0
17
+ Requires-Dist: requests==2.32.3
18
+ Requires-Dist: rich-click==1.8.8
19
+ Requires-Dist: rich==13.9.4
20
+ Requires-Dist: upsetplot==0.9.0
21
+ Description-Content-Type: text/markdown
22
+
23
+ # RdRpCATCH
24
+ ## RNA-dependent RNA polymerase Collaborative Analysis Tool with Collections of pHMMs
25
+
26
+
27
+
28
+ RdRpCATCH is collaborative effort to combine various publicly available RNA virus RNA-dependent RNA polymerase pHMM databases in one tool
29
+ to facilitate their detection in (meta-)transcriptomics data.
30
+
31
+
32
+ RdRpCATCH is written in Python and uses the pyHMMER3
33
+ library to perform pHMM searches. In addition, the tool scans each sequence (aa or nt) in the input file with the selected databases and provides the best hit (hit with the highest bitscore across all databases) as output.
34
+ In addition, RdRpCATCH provides information about the number of profiles
35
+ that were positive for each sequence across all pHMM databases, and taxonomic information based on the MMseqs2 easy-taxonomy and search modules against a custom RefSeq Riboviria database.
36
+
37
+ ** The tool has been modified to use [rolypoly](https://code.jgi.doe.gov/UNeri/rolypoly) code/approaches **
38
+
39
+ ![rdrpcatch_flowchart_v0.png](images%2Frdrpcatch_flowchart_v0.png)
40
+
41
+ Supported databases
42
+ - NeoRdRp <sup>1</sup> : 1182 pHMMs
43
+ - NeoRdRp2 <sup>2</sup>: 19394 pHMMs
44
+ - RVMT <sup>3</sup>: 710 pHMMs
45
+ - RdRp-Scan <sup>4</sup> : 68 pHMMs
46
+ - TSA_Oleandrite_fam <sup>5</sup>: 77 pHMMs
47
+ - TSA_Oleandrite_gen <sup>6</sup> : 341 pHMMs
48
+ - LucaProt_pHMM<sup>7 </sup> : 754 pHMMs
49
+
50
+ 1. Sakaguchi, S. et al. (2022) 'NeoRdRp: A comprehensive dataset for identifying RNA-dependent RNA polymerases of various RNA viruses from metatranscriptomic data', *Microbes and Environments*, 37(3). [doi:10.1264/jsme2.me22001](https://doi.org/10.1264/jsme2.me22001)
51
+ 2. Sakaguchi, S., Nakano, T. and Nakagawa, S. (2024) 'Neordrp2 with improved seed data, annotations, and scoring', *Frontiers in Virology*, 4. [doi:10.3389/fviro.2024.1378695](https://doi.org/10.3389/fviro.2024.1378695)
52
+ 3. Neri, U. et al. (2022) 'Expansion of the global RNA virome reveals diverse clades of bacteriophages', *Cell*, 185(21). [doi:10.1016/j.cell.2022.08.023](https://doi.org/10.1016/j.cell.2022.08.023)
53
+ 4. Charon, J. et al. (2022) 'RDRP-Scan: A bioinformatic resource to identify and annotate divergent RNA viruses in metagenomic sequence data', *Virus Evolution*, 8(2). [doi:10.1093/ve/veac082](https://doi.org/10.1093/ve/veac082)
54
+ 5. Olendraite, I., Brown, K. and Firth, A.E. (2023) 'Identification of RNA virus–derived rdrp sequences in publicly available transcriptomic data sets', *Molecular Biology and Evolution*, 40(4). [doi:10.1093/molbev/msad060](https://doi.org/10.1093/molbev/msad060)
55
+ 6. Olendraite, I. (2021) 'Mining diverse and novel RNA viruses in transcriptomic datasets', Apollo. Available at: [https://www.repository.cam.ac.uk/items/1fabebd2-429b-45c9-b6eb-41d27d0a90c2](https://www.repository.cam.ac.uk/items/1fabebd2-429b-45c9-b6eb-41d27d0a90c2)
56
+ 7. Hou, X. et al. (2024) 'Using artificial intelligence to document the hidden RNA virosphere', *Cell*, 187(24). [doi:10.1016/j.cell.2024.09.027](https://doi.org/10.1016/j.cell.2024.09.027)
57
+
58
+
59
+ ## Installation
60
+
61
+ ### Installation instructions for testing phase
62
+
63
+ RdRpCATCH will be available as a bioconda package soon. For the testing phase, we provide a tarball and a .yaml file to
64
+ install the tool and its dependencies. The .tar.bz2 is created for Linux systems but should work on MacOS as well.
65
+ (Windows is not supported)
66
+
67
+ #### Prerequisites
68
+ For the installation process, conda is required. If you don't have conda installed, you can find instructions on how to
69
+ https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
70
+ Mamba is a faster alternative to conda. If you have it installed, you can use it instead of conda.
71
+
72
+ #### Installation steps
73
+
74
+ The package is available as a bioconda package. You can install it using the following command:
75
+ ```bash
76
+ conda install -c bioconda rdrpcatch
77
+ ```
78
+ or
79
+ ```bash
80
+ conda env create rdrpcatch -c bioconda rdrpcatch
81
+ ```
82
+
83
+ Alternatively, you can install RdRpCATCH from python package index (PyPI) using pip. This requires the installation of the dependencies
84
+ manually. The dependencies are:
85
+ - mmseqs2
86
+ - seqkit
87
+
88
+ The dependencies can be installed using conda or mamba. Follow these steps:
89
+
90
+ Create a new conda environment and install the dependencies:
91
+ ```bash
92
+ conda create -n rdrpcatch python=3.12
93
+ conda activate rdrpcatch
94
+ conda install -c bioconda mmseqs2==17.17.b804f seqkit==2.10.0
95
+ ```
96
+ Install the tool from pip:
97
+ ```bash
98
+ pip install rdrpcatch
99
+ ```
100
+
101
+ Activate the environment and download the RdRpCATCH databases:
102
+
103
+ ```bash
104
+ conda activate rdrpcatch
105
+ rdrpcatch download --destination_dir path/to/store/databases
106
+ ```
107
+
108
+ * Note 1: The databases are large files and may take some time to download (~ 3 GB).
109
+ * Note 2: The databases are stored in the specified directory, and the path is required to run RdRpCATCH.
110
+
111
+ ## Usage
112
+ RdRpCATCH can be used as a CLI tool as follows:
113
+
114
+ ```bash
115
+ # make sure the conda environment is activated
116
+ # conda activate rdrpcatch
117
+
118
+ # scan the input fasta file with the selected databases
119
+ rdrpcatch scan -i path/to/input.fasta -o path/to/output_dir -db_dir path/to/database
120
+ ```
121
+ ### input:
122
+ The input file can be one or more nucleotide or protein sequences in multi-fasta format.
123
+ The output directory is where the results will be stored. We recommend specifying the type of the sequence in the command line,
124
+ An optional argument `--seq_type` (nuc or prot) can be used to specify if the input fasta file sequences are nucleotide or amino acid.
125
+
126
+ ## Commands
127
+ The following two commands are available in RdRpCATCH:
128
+ * [`rdrpcatch scan`](#rdrpcatch-scan)
129
+ * [`rdrpcatch download`](#rdrpcatch-download)
130
+
131
+ ### rdrpcatch download:
132
+ Command to download pre-compiled databases from Zenodo. If the databases are already downloaded in the specified directory
133
+ , the command will check for updates and download the latest version if available.
134
+
135
+ | Argument | Short Flag | Type | Description |
136
+ |----------|------------|------|-------------------------------------------------------------|
137
+ | `--destination_dir` | `-dest` | PATH | Path to the directory to download HMM databases. [required] |
138
+ | `--concept-doi` | `` | TEXT | Zenodo Concept DOI for database repository |
139
+ | `--help` | `` | | Show help message and exit |
140
+ ### rdrpcatch scan:
141
+ Search a given input using selected RdRp databases.
142
+
143
+ | Argument | Short Flag | Type | Description |
144
+ |----------|------------|------|-------------|
145
+ | `--input` | `-i` | FILE | Path to the input FASTA file. [required] |
146
+ | `--output` | `-o` | DIRECTORY | Path to the output directory. [required] |
147
+ | `--db_dir` | `-db_dir` | PATH | Path to the directory containing RdRpCATCH databases. [required] |
148
+ | `--db_options` | `-dbs` | TEXT | Comma-separated list of databases to search against. Valid options: RVMT, NeoRdRp, NeoRdRp.2.1, TSA_Olendraite_fam, TSA_Olendraite_gen, RDRP-scan, Lucaprot, all |
149
+ | `--custom-dbs` | | PATH | Path to directory containing custom MSAs/pHMM files to use as additional databases |
150
+ | `--seq_type` | `-seq_type` | TEXT | Type of sequence to search against: (prot,nuc) Default: unknown |
151
+ | `--verbose` | `-v` | FLAG | Print verbose output. |
152
+ | `--evalue` | `-e` | FLOAT | E-value threshold for HMMsearch. (default: 1e-5) |
153
+ | `--incevalue` | `-incE` | FLOAT | Inclusion E-value threshold for HMMsearch. (default: 1e-5) |
154
+ | `--domevalue` | `-domE` | FLOAT | Domain E-value threshold for HMMsearch. (default: 1e-5) |
155
+ | `--incdomevalue` | `-incdomE` | FLOAT | Inclusion domain E-value threshold for HMMsearch. (default: 1e-5) |
156
+ | `--zvalue` | `-z` | INTEGER | Number of sequences to search against. (default: 1000000) |
157
+ | `--cpus` | `-cpus` | INTEGER | Number of CPUs to use for HMMsearch. (default: 1) |
158
+ | `--length_thr` | `-length_thr` | INTEGER | Minimum length threshold for seqkit seq. (default: 400) |
159
+ | `--gen_code` | `-gen_code` | INTEGER | Genetic code to use for translation. (default: 1) |
160
+ | `--bundle` | `-bundle` | | Bundle the output files into a single archive. (default: False) |
161
+ | `--keep_tmp` | `-keep_tmp` | | Keep the temporary files generated during the analysis. (default: False) |
162
+
163
+
164
+
165
+ #### Output files
166
+ rdrpcatch scan will create a folder with the following structure:
167
+
168
+ | Output | Description |
169
+ |--------|------------------------------------------------------------------------------|
170
+ | `{prefix}_rdrpcatch_output_annotated.tsv` | A tab-separated file containing the results of the RdRpCATCH analysis. |
171
+ | `{prefix}_rdrpcatch_fasta` | A directory containing the sequences that were identified as RdRp sequences. |
172
+ | `{prefix}_rdrpcatch_plots` | A directory containing the plots generated during the analysis. |
173
+ | `{prefix}_gff_files` | A directory containing the GFF files generated during the analysis. (For now only based on protein sequences) |
174
+ | `tmp` | A directory containing temporary files generated during the analysis. (Only available if the -keep_tmp flag is used |
175
+
176
+ #### Output table fields
177
+ A summary of the results is stored in the `{prefix}_rdrpcatch_output_annotated.tsv` file, which contains the following fields:
178
+ | Field | Description |
179
+ |-------|---------------------------------------------------------------------------------------------------------------------|
180
+ | `Contig_name` | The name of the contig. |
181
+ | `Translated_contig_name (frame)` | The name of the translated contig and the frame of the RdRp sequence. |
182
+ | `Sequence_length(AA)` | The length of the RdRp sequence in amino acids. |
183
+ | `Total_databases_that_the_contig_was_detected(No_of_Profiles)` | The name of databases and the number of profiles that the RdRp sequence was detected by. |
184
+ | `Best_hit_Database` | The database with the best hit. |
185
+ | `Best_hit_profile_name` | The name of the profile with the best hit. |
186
+ | `Best_hit_profile_length` | The length of the profile with the best hit. |
187
+ | `Best_hit_e-value` | The e-value of the best hit. |
188
+ | `Best_hit_bitscore` | The bitscore of the best hit. |
189
+ | `RdRp_from(AA)` | The start position of the RdRp sequence, in relation to the amino acid sequence. |
190
+ | `RdRp_to(AA)` | The end position of the RdRp sequence, in relation to the amino acid sequence. |
191
+ | `Best_hit_profile_coverage` | The fraction of the profile that was covered by the RdRp sequence. |
192
+ | `Best_hit_contig_coverage` | The fraction of the contig that was covered by the RdRp sequence. (Based on aminoacid sequence) |
193
+ | `MMseqs_Taxonomy_2bLCA` | The taxonomy of the RdRp sequence based on MMseqs2 easy-taxonomy module against a custom RefSeq Riboviria database. |
194
+ | `MMseqs_TopHit_accession` | The accession of the top hit in the RefSeq Riboviria database. |
195
+ | `MMseqs_TopHit_fident` | The fraction of identical matches of the top hit in the RefSeq Riboviria database. |
196
+ | `MMseqs_TopHit_alnlen` | The alignment length of the top hit in the RefSeq Riboviria database. |
197
+ | `MMseqs_TopHit_eval` | The e-value of the top hit in the RefSeq Riboviria database. |
198
+ | `MMseqs_TopHit_bitscore` | The bitscore of the top hit in the RefSeq Riboviria database. |
199
+ | `MMseqs_TopHit_qcov` | The query coverage of the top hit in the RefSeq Riboviria database. |
200
+ | `MMseqs_TopHit_lineage` | The lineage of the top hit in the RefSeq Riboviria database. |
201
+
202
+ ## Citations
203
+ Manuscript still in preparation. If you use RdRpCATCH, please cite this GitHub repository
204
+ A precompiled version of the used databases is available at Zenodo DOI: [10.5281/zenodo.14358348](https://doi.org/10.5281/zenodo.14358348).
205
+ If you use RdRpCATCH, please cite the following third party databases:
206
+
207
+ ## Acknowledgements
208
+ RdRpCATCH is a collaborative effort and we would like to thank all the authors and developers of the underling databases.
209
+
210
+ ## Contact
211
+ Dimitris Karapliafis (dimitris.karapliafis@wur.nl), potentially via slack/teams or an issue in the main repo.
212
+
213
+ ##TODO:
214
+ - [ ] loud logging is linking to the utils.py file, not the actual line of code causing the error.
215
+ - [ ] Add `overwrite` flag
216
+ - [ ] drop `db_dir` argument and use global/environment/config variable that is set after running the `download` command
217
+
218
+
219
+ ## Contributing
220
+ TBD up to Dimitris and Anne
221
+
222
+ ## Licence
223
+ [MIT](LICENCE)
@@ -0,0 +1,19 @@
1
+ rdrpcatch/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
2
+ rdrpcatch/rdrpcatch_wrapper.py,sha256=PLj8KSJ2wbXVKlGhCaQEhGgoFBOMXBKQS9DnukHOgAs,30501
3
+ rdrpcatch/cli/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
4
+ rdrpcatch/cli/args.py,sha256=2E2gXY42hNasUP94HmPxpgVCA1glk_oN7D5ftbu6W2c,15805
5
+ rdrpcatch/rdrpcatch_scripts/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
6
+ rdrpcatch/rdrpcatch_scripts/fetch_dbs.py,sha256=e9ShColfLgBvWSZpGOvY3zKhEgIg3rw1IIV__KX7N-g,11054
7
+ rdrpcatch/rdrpcatch_scripts/format_pyhmmer_out.py,sha256=w4I_7W-fvuT4JmKvZmbJ07Dewm3CBQuQmpMvQutdOqo,25112
8
+ rdrpcatch/rdrpcatch_scripts/gui.py,sha256=he8kx_4VJWB7SVv9XSQPk0DmkOjEFIg-uGMAtDp3t-w,10576
9
+ rdrpcatch/rdrpcatch_scripts/mmseqs_tax.py,sha256=bwzuCxu8nHQ5OC0Yr5Lyvhcyk9OWjuamInqe0T0lc38,3809
10
+ rdrpcatch/rdrpcatch_scripts/paths.py,sha256=Nq08P8GGPKPrzX6u4wQ2Xwn-kQP-pue_yOGMuRjrLdY,4706
11
+ rdrpcatch/rdrpcatch_scripts/plot.py,sha256=Y1mZL7rkKHFKEs2D7T2Qj2kpfiORmFwRLq1LYWqwcJI,5938
12
+ rdrpcatch/rdrpcatch_scripts/run_pyhmmer.py,sha256=9zcMzaIwQ4_-NgYzG9kejxOBaDi-gbzaqpvZti8ZXA4,9008
13
+ rdrpcatch/rdrpcatch_scripts/run_seqkit.py,sha256=5y7DtJ6NLa4sRoBQOcjBfczKlqG_LibNrEqNmKLrHu0,4361
14
+ rdrpcatch/rdrpcatch_scripts/utils.py,sha256=Wx1GXhAPBfJw7x67sOu7WclZzMo0N3O-hxNYTVxc3v4,16780
15
+ rdrpcatch-0.0.1.dist-info/METADATA,sha256=LMx68xrBacLt8cml_tHk6F-7_Uvr3KOHmhyZOD38joA,14131
16
+ rdrpcatch-0.0.1.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
17
+ rdrpcatch-0.0.1.dist-info/entry_points.txt,sha256=uiyoPO41jNz_KVOt2JdPak9NbVei-D8WQ6saMeMBFpE,53
18
+ rdrpcatch-0.0.1.dist-info/licenses/LICENCE,sha256=3jm5vKRMIaiETEFfNN34-oyWUShxZtmDmL38PNAwlUI,1120
19
+ rdrpcatch-0.0.1.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: hatchling 1.27.0
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ rdrpcatch = rdrpcatch.cli.args:cli
@@ -0,0 +1,9 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Dimitris Karapliafis and RdRpCATCH contributors.
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
6
+
7
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
8
+
9
+ THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.