simple-blast 0.0.5__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2023 Andrew Tapia
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,88 @@
1
+ Metadata-Version: 2.1
2
+ Name: simple_blast
3
+ Version: 0.0.5
4
+ Summary: A simple library for executing BLAST searches with ncbi-blast+
5
+ Author-email: Andrew Tapia <andrew.tapia@uky.edu>
6
+ Project-URL: Homepage, https://github.com/actapia/simple_blast
7
+ Project-URL: Bug Tracker, https://github.com/actapia/simple_blast
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Requires-Python: >=3.11
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: pandas>=1.5.3
15
+
16
+ # simple_blast
17
+
18
+ This is a library that provides a very basic wrapper around ncbi-blast+.
19
+ Currently, the library supports searches with `blastn` only, but I may expand
20
+ the library to include wrappers for other BLAST executables if I need them.
21
+
22
+ ## Requirements
23
+
24
+ This library depends on Pandas for parsing BLAST output. The library has been
25
+ tested with Pandas 1.5.3, but it likely works with other versions.
26
+
27
+ Of course, this library assumes that ncbi-blast+ is installed. The library has
28
+ been tested with ncbi-blast 2.12.0+, and it likely works with newer versions of
29
+ the software as well.
30
+
31
+ ## Basic usage
32
+
33
+ You can define a `blastn` search to be carried out using the `BlastnSearch`
34
+ class. `BlastnSearch`objects are constructed with two required
35
+ arguments&mdash;the subject sequence and the query sequence files, in that
36
+ order. For example, to set up a `balstn` search for sequences in `seqs1.fasta`
37
+ against those in `seqs2.fasta`, you could construct a `BlastnSearch` object like
38
+ this:
39
+
40
+ ```python
41
+ from simple_blast import BlastnSearch
42
+
43
+ search = BlastnSearch("seqs2.fasta", "seqs1.fasta")
44
+ ```
45
+
46
+ The BLAST search is not carried out until you ask for the results by accessing
47
+ the `hits` property of the search. This property returns a Pandas dataframe
48
+ containing the HSPs identified in the BLAST search.
49
+
50
+ ```python
51
+ results = search.hits
52
+ ```
53
+
54
+ The columns in the output may be configured by passing either the `out_columns`
55
+ or `additional_columns` arguments when constructing the `BlastnSearch`. The
56
+ former argument overrides the set of output columns; the latter argument is
57
+ added to the list of default output columns.
58
+
59
+ You can also specify an e-value cutoff through the `evalue` argument.
60
+
61
+ ## DB caches
62
+
63
+ When the same sequence file is used as a subject in multiple searches, it can be
64
+ efficient to build a BLAST database up front. The `BlastDBCache` class can be
65
+ used to handle this mostly automatically. To make a `BlastDBCache`, you need
66
+ to specify the location of the on the file system.
67
+
68
+ ```python
69
+ from simple_blast import BlastDBCache
70
+
71
+ cache = BlastDBCache("cache_dir")
72
+ ```
73
+
74
+ To add a file to the cache, use the `makedb` method.
75
+
76
+ ```python
77
+ cache.makedb("seqs2.fasta")
78
+ ```
79
+
80
+ When constructing a `BlastnSearch` object, give it the `BlastDBCache` as the
81
+ `db_cache` parameter to make the `BlastnSearch` object use the cache for
82
+ searches.
83
+
84
+ ```python
85
+ search = BlastnSearch("seqs2.fasta", "seqs1.fasta", db_cache=cache)
86
+ ```
87
+
88
+ Now `search` will use the database we created for `seqs2.fasta`.
@@ -0,0 +1,73 @@
1
+ # simple_blast
2
+
3
+ This is a library that provides a very basic wrapper around ncbi-blast+.
4
+ Currently, the library supports searches with `blastn` only, but I may expand
5
+ the library to include wrappers for other BLAST executables if I need them.
6
+
7
+ ## Requirements
8
+
9
+ This library depends on Pandas for parsing BLAST output. The library has been
10
+ tested with Pandas 1.5.3, but it likely works with other versions.
11
+
12
+ Of course, this library assumes that ncbi-blast+ is installed. The library has
13
+ been tested with ncbi-blast 2.12.0+, and it likely works with newer versions of
14
+ the software as well.
15
+
16
+ ## Basic usage
17
+
18
+ You can define a `blastn` search to be carried out using the `BlastnSearch`
19
+ class. `BlastnSearch`objects are constructed with two required
20
+ arguments&mdash;the subject sequence and the query sequence files, in that
21
+ order. For example, to set up a `balstn` search for sequences in `seqs1.fasta`
22
+ against those in `seqs2.fasta`, you could construct a `BlastnSearch` object like
23
+ this:
24
+
25
+ ```python
26
+ from simple_blast import BlastnSearch
27
+
28
+ search = BlastnSearch("seqs2.fasta", "seqs1.fasta")
29
+ ```
30
+
31
+ The BLAST search is not carried out until you ask for the results by accessing
32
+ the `hits` property of the search. This property returns a Pandas dataframe
33
+ containing the HSPs identified in the BLAST search.
34
+
35
+ ```python
36
+ results = search.hits
37
+ ```
38
+
39
+ The columns in the output may be configured by passing either the `out_columns`
40
+ or `additional_columns` arguments when constructing the `BlastnSearch`. The
41
+ former argument overrides the set of output columns; the latter argument is
42
+ added to the list of default output columns.
43
+
44
+ You can also specify an e-value cutoff through the `evalue` argument.
45
+
46
+ ## DB caches
47
+
48
+ When the same sequence file is used as a subject in multiple searches, it can be
49
+ efficient to build a BLAST database up front. The `BlastDBCache` class can be
50
+ used to handle this mostly automatically. To make a `BlastDBCache`, you need
51
+ to specify the location of the on the file system.
52
+
53
+ ```python
54
+ from simple_blast import BlastDBCache
55
+
56
+ cache = BlastDBCache("cache_dir")
57
+ ```
58
+
59
+ To add a file to the cache, use the `makedb` method.
60
+
61
+ ```python
62
+ cache.makedb("seqs2.fasta")
63
+ ```
64
+
65
+ When constructing a `BlastnSearch` object, give it the `BlastDBCache` as the
66
+ `db_cache` parameter to make the `BlastnSearch` object use the cache for
67
+ searches.
68
+
69
+ ```python
70
+ search = BlastnSearch("seqs2.fasta", "seqs1.fasta", db_cache=cache)
71
+ ```
72
+
73
+ Now `search` will use the database we created for `seqs2.fasta`.
@@ -0,0 +1,25 @@
1
+ [build-system]
2
+ requires = ["setuptools>=61.0"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "simple_blast"
7
+ version = "0.0.5"
8
+ authors = [
9
+ { name = "Andrew Tapia", email = "andrew.tapia@uky.edu" }
10
+ ]
11
+ description = "A simple library for executing BLAST searches with ncbi-blast+"
12
+ readme = "README.md"
13
+ requires-python = ">=3.11"
14
+ dependencies = [
15
+ "pandas >= 1.5.3"
16
+ ]
17
+ classifiers = [
18
+ "Programming Language :: Python :: 3",
19
+ "License :: OSI Approved :: MIT License",
20
+ "Operating System :: OS Independent",
21
+ ]
22
+
23
+ [project.urls]
24
+ "Homepage" = "https://github.com/actapia/simple_blast"
25
+ "Bug Tracker" = "https://github.com/actapia/simple_blast"
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,4 @@
1
+ from .blasting import BlastnSearch
2
+ from .blastdb_cache import BlastDBCache
3
+
4
+ __all__ = ["BlastnSearch", "BlastDBCache"]
@@ -0,0 +1,73 @@
1
+ import subprocess
2
+ import tempfile
3
+ from pathlib import Path
4
+
5
+ def read_nal_title(nal):
6
+ with open(nal, "r") as nal_file:
7
+ for l in nal_file:
8
+ if not l.startswith("#"):
9
+ split = l.rstrip().split(" ")
10
+ if split[0] == "TITLE":
11
+ return " ".join(split[1:])
12
+
13
+
14
+ class BlastDBCache:
15
+ def __init__(self, location, find_existing=True, parse_seqids=False):
16
+ self.location = location
17
+ self._cache = {}
18
+ if find_existing:
19
+ for path in Path(location).glob("*/*.nal"):
20
+ self._cache[Path(read_nal_title(path))] = path.parent / "db"
21
+ self._parse_seqids = parse_seqids
22
+
23
+ @property
24
+ def parse_seqids(self):
25
+ return self._parse_seqids
26
+
27
+ def _build_makeblastdb_command(self, seq_file_path, db_name):
28
+ command = [
29
+ "makeblastdb",
30
+ "-in",
31
+ str(seq_file_path),
32
+ "-out",
33
+ db_name,
34
+ "-dbtype",
35
+ "nucl",
36
+ "-hash_index" # Do I need this?
37
+ ]
38
+ if self._parse_seqids:
39
+ command.append("-parse_seqids")
40
+ return command
41
+
42
+ def makedb(self, seq_file_path):
43
+ seq_file_path = Path(seq_file_path)
44
+ if seq_file_path in self._cache:
45
+ return
46
+ seq_name = seq_file_path.stem
47
+ tempdir = Path(
48
+ tempfile.mkdtemp(
49
+ prefix=seq_name,
50
+ dir=self.location
51
+ )
52
+ )
53
+ db_name = str(tempdir / "db")
54
+ proc = subprocess.Popen(
55
+ self._build_makeblastdb_command(seq_file_path, db_name),
56
+ stdout=subprocess.DEVNULL,
57
+ stderr=subprocess.DEVNULL
58
+ )
59
+ proc.communicate()
60
+ if proc.returncode:
61
+ raise subprocess.CalledProcessError(proc.returncode, proc.args)
62
+ self._cache[seq_file_path] = db_name
63
+
64
+ def __getitem__(self, k):
65
+ return self._cache[Path(k)]
66
+
67
+ def __delitem__(self, k):
68
+ del self._cache[Path(k)]
69
+
70
+ def __contains__(self, k):
71
+ return Path(k) in self._cache
72
+
73
+
@@ -0,0 +1,220 @@
1
+ import subprocess
2
+ import pandas as pd
3
+ from typing import List, Optional
4
+
5
+ from .blastdb_cache import BlastDBCache
6
+
7
+ default_out_columns = ['qseqid',
8
+ 'sseqid',
9
+ 'pident',
10
+ 'length',
11
+ 'mismatch',
12
+ 'gapopen',
13
+ 'qstart',
14
+ 'qend',
15
+ 'sstart',
16
+ 'send',
17
+ 'evalue',
18
+ 'bitscore']
19
+
20
+ yes_no = ["no", "yes"]
21
+
22
+ class BlastnSearch:
23
+ """A search (alignment) to be made with blastn.
24
+
25
+ This class provides a programmer-friendly way to define the parameters of a
26
+ simple blastn search, carry out the search, and parse the results.
27
+
28
+ The most useful property of a BlastnSearch instance is hits. hits runs the
29
+ defined blastn search (if it hasn't been run already), parses the results,
30
+ stores them in a pandas dataframe, and returns the result.
31
+
32
+ Values passed to the constructor may be retrieved through the class's
33
+ properties.
34
+
35
+ Attributes:
36
+ debug (bool): Whether to enable debug features for this instance.
37
+ """
38
+ def __init__(
39
+ self,
40
+ seq1_path: str,
41
+ seq2_path: str,
42
+ evalue: float = 1e-20,
43
+ out_columns: List[str] = default_out_columns,
44
+ additional_columns: List[str] = [],
45
+ db_cache: Optional[BlastDBCache] = None,
46
+ threads: int = 1,
47
+ dust: bool = True,
48
+ task: Optional[str] = None,
49
+ max_targets: int = 500,
50
+ n_seqidlist: Optional[str] = None,
51
+ debug: bool = False
52
+ ):
53
+ """Construct a BlastnSearch with the specified settings.
54
+
55
+ This constructor requires paths to FASTA files containing the query and
56
+ subject sequences to use in the search.
57
+
58
+ Optionally, the caller may provide an expect value cutoff to use for the
59
+ search. If no value is provided, a default evalue of 1e-20 will be used.
60
+
61
+ The caller may specify what columns should be included in the output.
62
+ By default, the included columns are
63
+
64
+ sseqid
65
+ pident
66
+ length
67
+ mismatcch
68
+ gapopen
69
+ qstart
70
+ qend
71
+ sstart
72
+ send
73
+ evalue
74
+ bitscore
75
+
76
+ Explanations of these columns may be found at
77
+ https://www.metagenomics.wiki/tools/blast/blastn-output-format-6
78
+
79
+ If the caller desires to include additional columns, it may provide
80
+ them to the additional_columns parameter.
81
+
82
+ Parameters:
83
+ seq1_path (str): Path to query sequence FASTA file.
84
+ seq2_path (str): Path to subject sequence FASTA file.
85
+ evalue (float): Expect value cutoff to use in BLAST search.
86
+ out_columns: Output columns to include in results.
87
+ additional_columns: Additional output columns to include in results.
88
+ db_cache: BlastDBCache that tells where to find BLAST DBs.
89
+ threads (int): Number of threads to use for BLAST search.
90
+ dust (bool): Filter low-complexity regions from search.
91
+ task (str): Parameter preset to use.
92
+ max_targets (int): Maximum number of target seqs to include.
93
+ n_seqidlist (str): Specifies seqids to ignore.
94
+ debug (bool): Whether to enable debug features.
95
+ """
96
+ self._seq1_path = seq1_path
97
+ self._seq2_path = seq2_path
98
+ self._evalue = evalue
99
+ self._hits = None
100
+ self._out_columns = list(out_columns + additional_columns)
101
+ self._db_cache = db_cache
102
+ self._threads = threads
103
+ self._dust = dust
104
+ self._task = task
105
+ self._max_targets = max_targets
106
+ self._negative_seqidlist = n_seqidlist
107
+ # If you really need to add extra arguments, you can do it by setting
108
+ # the _extra_args attribute.
109
+ self._extra_args = []
110
+ self.debug = debug
111
+
112
+ @property
113
+ def seq1_path(self) -> str:
114
+ """Return the query sequence path."""
115
+ return self._seq1_path
116
+
117
+ @property
118
+ def seq2_path(self) -> str:
119
+ """Return the subject sequence path."""
120
+ return self._seq2_path
121
+
122
+ @property
123
+ def evalue(self) -> float:
124
+ """Return the expect value used as a cutoff in the blastn search."""
125
+ return self._evalue
126
+
127
+ @property
128
+ def db_cache(self) -> Optional[BlastDBCache]:
129
+ """Return a cache of BLAST DBs to be used in the search."""
130
+ return self._db_cache
131
+
132
+ @property
133
+ def hits(self) -> pd.DataFrame:
134
+ """Return a dataframe containing this search's BLAST results."""
135
+ if self._hits is None:
136
+ self._get_hits()
137
+ return self._hits
138
+
139
+ @property
140
+ def threads(self) -> int:
141
+ """Return the number of threads to use for the search."""
142
+ return self._threads
143
+
144
+ @property
145
+ def dust(self) -> bool:
146
+ """Return whether to filter low-complexity regions."""
147
+ return self._dust
148
+
149
+ @property
150
+ def task(self) -> Optional[str]:
151
+ """Return the name of the parameter preset to use."""
152
+ return self._task
153
+
154
+ @property
155
+ def max_targets(self) -> int:
156
+ """Return the maximum number of target sequences."""
157
+ return self._max_targets
158
+
159
+ @property
160
+ def negative_seqidlist(self) -> Optional[str]:
161
+ """Return a path to a list of sequence IDs to ignore."""
162
+ return self._negative_seqidlist
163
+
164
+
165
+ # def __len__(self) -> int:
166
+ # return len(self.hits)
167
+
168
+ # def __iter__(self):
169
+ # yield from self.hits
170
+
171
+ def _build_blast_command(self):
172
+ command = ["blastn"]
173
+ if self._db_cache and self.seq1_path in self._db_cache:
174
+ command = command + ["-db", str(self._db_cache[self.seq1_path])]
175
+ else:
176
+ command = command + ["-subject", self.seq1_path]
177
+ if self._task is not None:
178
+ command = command + ["-task", self._task]
179
+ if self._negative_seqidlist is not None:
180
+ command = command + [
181
+ "-negative_seqidlist",
182
+ self._negative_seqidlist
183
+ ]
184
+ command = command + [
185
+ "-query",
186
+ str(self.seq2_path),
187
+ "-evalue",
188
+ str(self.evalue),
189
+ "-outfmt",
190
+ " ".join(["6"] + self._out_columns),
191
+ "-num_threads",
192
+ str(self._threads),
193
+ "-dust",
194
+ yes_no[self._dust],
195
+ "-max_target_seqs",
196
+ str(self._max_targets)
197
+ ] + self._extra_args
198
+ #print(" ".join(command), file=sys.stderr)
199
+ return command
200
+
201
+
202
+ def _get_hits(self):
203
+ proc = subprocess.Popen(
204
+ self._build_blast_command(),
205
+ stdout=subprocess.PIPE,
206
+ stderr=subprocess.PIPE
207
+ )
208
+ self._hits = pd.read_csv(
209
+ proc.stdout,
210
+ names=self._out_columns,
211
+ delim_whitespace=True
212
+ )
213
+ # from IPython import embed
214
+ # embed()
215
+ proc.communicate()
216
+ if proc.returncode:
217
+ if self.debug:
218
+ from IPython import embed
219
+ embed()
220
+ raise subprocess.CalledProcessError(proc.returncode, proc.args)
@@ -0,0 +1,88 @@
1
+ Metadata-Version: 2.1
2
+ Name: simple_blast
3
+ Version: 0.0.5
4
+ Summary: A simple library for executing BLAST searches with ncbi-blast+
5
+ Author-email: Andrew Tapia <andrew.tapia@uky.edu>
6
+ Project-URL: Homepage, https://github.com/actapia/simple_blast
7
+ Project-URL: Bug Tracker, https://github.com/actapia/simple_blast
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Requires-Python: >=3.11
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: pandas>=1.5.3
15
+
16
+ # simple_blast
17
+
18
+ This is a library that provides a very basic wrapper around ncbi-blast+.
19
+ Currently, the library supports searches with `blastn` only, but I may expand
20
+ the library to include wrappers for other BLAST executables if I need them.
21
+
22
+ ## Requirements
23
+
24
+ This library depends on Pandas for parsing BLAST output. The library has been
25
+ tested with Pandas 1.5.3, but it likely works with other versions.
26
+
27
+ Of course, this library assumes that ncbi-blast+ is installed. The library has
28
+ been tested with ncbi-blast 2.12.0+, and it likely works with newer versions of
29
+ the software as well.
30
+
31
+ ## Basic usage
32
+
33
+ You can define a `blastn` search to be carried out using the `BlastnSearch`
34
+ class. `BlastnSearch`objects are constructed with two required
35
+ arguments&mdash;the subject sequence and the query sequence files, in that
36
+ order. For example, to set up a `balstn` search for sequences in `seqs1.fasta`
37
+ against those in `seqs2.fasta`, you could construct a `BlastnSearch` object like
38
+ this:
39
+
40
+ ```python
41
+ from simple_blast import BlastnSearch
42
+
43
+ search = BlastnSearch("seqs2.fasta", "seqs1.fasta")
44
+ ```
45
+
46
+ The BLAST search is not carried out until you ask for the results by accessing
47
+ the `hits` property of the search. This property returns a Pandas dataframe
48
+ containing the HSPs identified in the BLAST search.
49
+
50
+ ```python
51
+ results = search.hits
52
+ ```
53
+
54
+ The columns in the output may be configured by passing either the `out_columns`
55
+ or `additional_columns` arguments when constructing the `BlastnSearch`. The
56
+ former argument overrides the set of output columns; the latter argument is
57
+ added to the list of default output columns.
58
+
59
+ You can also specify an e-value cutoff through the `evalue` argument.
60
+
61
+ ## DB caches
62
+
63
+ When the same sequence file is used as a subject in multiple searches, it can be
64
+ efficient to build a BLAST database up front. The `BlastDBCache` class can be
65
+ used to handle this mostly automatically. To make a `BlastDBCache`, you need
66
+ to specify the location of the on the file system.
67
+
68
+ ```python
69
+ from simple_blast import BlastDBCache
70
+
71
+ cache = BlastDBCache("cache_dir")
72
+ ```
73
+
74
+ To add a file to the cache, use the `makedb` method.
75
+
76
+ ```python
77
+ cache.makedb("seqs2.fasta")
78
+ ```
79
+
80
+ When constructing a `BlastnSearch` object, give it the `BlastDBCache` as the
81
+ `db_cache` parameter to make the `BlastnSearch` object use the cache for
82
+ searches.
83
+
84
+ ```python
85
+ search = BlastnSearch("seqs2.fasta", "seqs1.fasta", db_cache=cache)
86
+ ```
87
+
88
+ Now `search` will use the database we created for `seqs2.fasta`.
@@ -0,0 +1,11 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ src/simple_blast/__init__.py
5
+ src/simple_blast/blastdb_cache.py
6
+ src/simple_blast/blasting.py
7
+ src/simple_blast.egg-info/PKG-INFO
8
+ src/simple_blast.egg-info/SOURCES.txt
9
+ src/simple_blast.egg-info/dependency_links.txt
10
+ src/simple_blast.egg-info/requires.txt
11
+ src/simple_blast.egg-info/top_level.txt
@@ -0,0 +1 @@
1
+ pandas>=1.5.3
@@ -0,0 +1 @@
1
+ simple_blast