simple-blast 0.0.5__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- simple_blast-0.0.5/LICENSE +21 -0
- simple_blast-0.0.5/PKG-INFO +88 -0
- simple_blast-0.0.5/README.md +73 -0
- simple_blast-0.0.5/pyproject.toml +25 -0
- simple_blast-0.0.5/setup.cfg +4 -0
- simple_blast-0.0.5/src/simple_blast/__init__.py +4 -0
- simple_blast-0.0.5/src/simple_blast/blastdb_cache.py +73 -0
- simple_blast-0.0.5/src/simple_blast/blasting.py +220 -0
- simple_blast-0.0.5/src/simple_blast.egg-info/PKG-INFO +88 -0
- simple_blast-0.0.5/src/simple_blast.egg-info/SOURCES.txt +11 -0
- simple_blast-0.0.5/src/simple_blast.egg-info/dependency_links.txt +1 -0
- simple_blast-0.0.5/src/simple_blast.egg-info/requires.txt +1 -0
- simple_blast-0.0.5/src/simple_blast.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2023 Andrew Tapia
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,88 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: simple_blast
|
|
3
|
+
Version: 0.0.5
|
|
4
|
+
Summary: A simple library for executing BLAST searches with ncbi-blast+
|
|
5
|
+
Author-email: Andrew Tapia <andrew.tapia@uky.edu>
|
|
6
|
+
Project-URL: Homepage, https://github.com/actapia/simple_blast
|
|
7
|
+
Project-URL: Bug Tracker, https://github.com/actapia/simple_blast
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Operating System :: OS Independent
|
|
11
|
+
Requires-Python: >=3.11
|
|
12
|
+
Description-Content-Type: text/markdown
|
|
13
|
+
License-File: LICENSE
|
|
14
|
+
Requires-Dist: pandas>=1.5.3
|
|
15
|
+
|
|
16
|
+
# simple_blast
|
|
17
|
+
|
|
18
|
+
This is a library that provides a very basic wrapper around ncbi-blast+.
|
|
19
|
+
Currently, the library supports searches with `blastn` only, but I may expand
|
|
20
|
+
the library to include wrappers for other BLAST executables if I need them.
|
|
21
|
+
|
|
22
|
+
## Requirements
|
|
23
|
+
|
|
24
|
+
This library depends on Pandas for parsing BLAST output. The library has been
|
|
25
|
+
tested with Pandas 1.5.3, but it likely works with other versions.
|
|
26
|
+
|
|
27
|
+
Of course, this library assumes that ncbi-blast+ is installed. The library has
|
|
28
|
+
been tested with ncbi-blast 2.12.0+, and it likely works with newer versions of
|
|
29
|
+
the software as well.
|
|
30
|
+
|
|
31
|
+
## Basic usage
|
|
32
|
+
|
|
33
|
+
You can define a `blastn` search to be carried out using the `BlastnSearch`
|
|
34
|
+
class. `BlastnSearch`objects are constructed with two required
|
|
35
|
+
arguments—the subject sequence and the query sequence files, in that
|
|
36
|
+
order. For example, to set up a `balstn` search for sequences in `seqs1.fasta`
|
|
37
|
+
against those in `seqs2.fasta`, you could construct a `BlastnSearch` object like
|
|
38
|
+
this:
|
|
39
|
+
|
|
40
|
+
```python
|
|
41
|
+
from simple_blast import BlastnSearch
|
|
42
|
+
|
|
43
|
+
search = BlastnSearch("seqs2.fasta", "seqs1.fasta")
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
The BLAST search is not carried out until you ask for the results by accessing
|
|
47
|
+
the `hits` property of the search. This property returns a Pandas dataframe
|
|
48
|
+
containing the HSPs identified in the BLAST search.
|
|
49
|
+
|
|
50
|
+
```python
|
|
51
|
+
results = search.hits
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
The columns in the output may be configured by passing either the `out_columns`
|
|
55
|
+
or `additional_columns` arguments when constructing the `BlastnSearch`. The
|
|
56
|
+
former argument overrides the set of output columns; the latter argument is
|
|
57
|
+
added to the list of default output columns.
|
|
58
|
+
|
|
59
|
+
You can also specify an e-value cutoff through the `evalue` argument.
|
|
60
|
+
|
|
61
|
+
## DB caches
|
|
62
|
+
|
|
63
|
+
When the same sequence file is used as a subject in multiple searches, it can be
|
|
64
|
+
efficient to build a BLAST database up front. The `BlastDBCache` class can be
|
|
65
|
+
used to handle this mostly automatically. To make a `BlastDBCache`, you need
|
|
66
|
+
to specify the location of the on the file system.
|
|
67
|
+
|
|
68
|
+
```python
|
|
69
|
+
from simple_blast import BlastDBCache
|
|
70
|
+
|
|
71
|
+
cache = BlastDBCache("cache_dir")
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
To add a file to the cache, use the `makedb` method.
|
|
75
|
+
|
|
76
|
+
```python
|
|
77
|
+
cache.makedb("seqs2.fasta")
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
When constructing a `BlastnSearch` object, give it the `BlastDBCache` as the
|
|
81
|
+
`db_cache` parameter to make the `BlastnSearch` object use the cache for
|
|
82
|
+
searches.
|
|
83
|
+
|
|
84
|
+
```python
|
|
85
|
+
search = BlastnSearch("seqs2.fasta", "seqs1.fasta", db_cache=cache)
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
Now `search` will use the database we created for `seqs2.fasta`.
|
|
@@ -0,0 +1,73 @@
|
|
|
1
|
+
# simple_blast
|
|
2
|
+
|
|
3
|
+
This is a library that provides a very basic wrapper around ncbi-blast+.
|
|
4
|
+
Currently, the library supports searches with `blastn` only, but I may expand
|
|
5
|
+
the library to include wrappers for other BLAST executables if I need them.
|
|
6
|
+
|
|
7
|
+
## Requirements
|
|
8
|
+
|
|
9
|
+
This library depends on Pandas for parsing BLAST output. The library has been
|
|
10
|
+
tested with Pandas 1.5.3, but it likely works with other versions.
|
|
11
|
+
|
|
12
|
+
Of course, this library assumes that ncbi-blast+ is installed. The library has
|
|
13
|
+
been tested with ncbi-blast 2.12.0+, and it likely works with newer versions of
|
|
14
|
+
the software as well.
|
|
15
|
+
|
|
16
|
+
## Basic usage
|
|
17
|
+
|
|
18
|
+
You can define a `blastn` search to be carried out using the `BlastnSearch`
|
|
19
|
+
class. `BlastnSearch`objects are constructed with two required
|
|
20
|
+
arguments—the subject sequence and the query sequence files, in that
|
|
21
|
+
order. For example, to set up a `balstn` search for sequences in `seqs1.fasta`
|
|
22
|
+
against those in `seqs2.fasta`, you could construct a `BlastnSearch` object like
|
|
23
|
+
this:
|
|
24
|
+
|
|
25
|
+
```python
|
|
26
|
+
from simple_blast import BlastnSearch
|
|
27
|
+
|
|
28
|
+
search = BlastnSearch("seqs2.fasta", "seqs1.fasta")
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
The BLAST search is not carried out until you ask for the results by accessing
|
|
32
|
+
the `hits` property of the search. This property returns a Pandas dataframe
|
|
33
|
+
containing the HSPs identified in the BLAST search.
|
|
34
|
+
|
|
35
|
+
```python
|
|
36
|
+
results = search.hits
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
The columns in the output may be configured by passing either the `out_columns`
|
|
40
|
+
or `additional_columns` arguments when constructing the `BlastnSearch`. The
|
|
41
|
+
former argument overrides the set of output columns; the latter argument is
|
|
42
|
+
added to the list of default output columns.
|
|
43
|
+
|
|
44
|
+
You can also specify an e-value cutoff through the `evalue` argument.
|
|
45
|
+
|
|
46
|
+
## DB caches
|
|
47
|
+
|
|
48
|
+
When the same sequence file is used as a subject in multiple searches, it can be
|
|
49
|
+
efficient to build a BLAST database up front. The `BlastDBCache` class can be
|
|
50
|
+
used to handle this mostly automatically. To make a `BlastDBCache`, you need
|
|
51
|
+
to specify the location of the on the file system.
|
|
52
|
+
|
|
53
|
+
```python
|
|
54
|
+
from simple_blast import BlastDBCache
|
|
55
|
+
|
|
56
|
+
cache = BlastDBCache("cache_dir")
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
To add a file to the cache, use the `makedb` method.
|
|
60
|
+
|
|
61
|
+
```python
|
|
62
|
+
cache.makedb("seqs2.fasta")
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
When constructing a `BlastnSearch` object, give it the `BlastDBCache` as the
|
|
66
|
+
`db_cache` parameter to make the `BlastnSearch` object use the cache for
|
|
67
|
+
searches.
|
|
68
|
+
|
|
69
|
+
```python
|
|
70
|
+
search = BlastnSearch("seqs2.fasta", "seqs1.fasta", db_cache=cache)
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
Now `search` will use the database we created for `seqs2.fasta`.
|
|
@@ -0,0 +1,25 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=61.0"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "simple_blast"
|
|
7
|
+
version = "0.0.5"
|
|
8
|
+
authors = [
|
|
9
|
+
{ name = "Andrew Tapia", email = "andrew.tapia@uky.edu" }
|
|
10
|
+
]
|
|
11
|
+
description = "A simple library for executing BLAST searches with ncbi-blast+"
|
|
12
|
+
readme = "README.md"
|
|
13
|
+
requires-python = ">=3.11"
|
|
14
|
+
dependencies = [
|
|
15
|
+
"pandas >= 1.5.3"
|
|
16
|
+
]
|
|
17
|
+
classifiers = [
|
|
18
|
+
"Programming Language :: Python :: 3",
|
|
19
|
+
"License :: OSI Approved :: MIT License",
|
|
20
|
+
"Operating System :: OS Independent",
|
|
21
|
+
]
|
|
22
|
+
|
|
23
|
+
[project.urls]
|
|
24
|
+
"Homepage" = "https://github.com/actapia/simple_blast"
|
|
25
|
+
"Bug Tracker" = "https://github.com/actapia/simple_blast"
|
|
@@ -0,0 +1,73 @@
|
|
|
1
|
+
import subprocess
|
|
2
|
+
import tempfile
|
|
3
|
+
from pathlib import Path
|
|
4
|
+
|
|
5
|
+
def read_nal_title(nal):
|
|
6
|
+
with open(nal, "r") as nal_file:
|
|
7
|
+
for l in nal_file:
|
|
8
|
+
if not l.startswith("#"):
|
|
9
|
+
split = l.rstrip().split(" ")
|
|
10
|
+
if split[0] == "TITLE":
|
|
11
|
+
return " ".join(split[1:])
|
|
12
|
+
|
|
13
|
+
|
|
14
|
+
class BlastDBCache:
|
|
15
|
+
def __init__(self, location, find_existing=True, parse_seqids=False):
|
|
16
|
+
self.location = location
|
|
17
|
+
self._cache = {}
|
|
18
|
+
if find_existing:
|
|
19
|
+
for path in Path(location).glob("*/*.nal"):
|
|
20
|
+
self._cache[Path(read_nal_title(path))] = path.parent / "db"
|
|
21
|
+
self._parse_seqids = parse_seqids
|
|
22
|
+
|
|
23
|
+
@property
|
|
24
|
+
def parse_seqids(self):
|
|
25
|
+
return self._parse_seqids
|
|
26
|
+
|
|
27
|
+
def _build_makeblastdb_command(self, seq_file_path, db_name):
|
|
28
|
+
command = [
|
|
29
|
+
"makeblastdb",
|
|
30
|
+
"-in",
|
|
31
|
+
str(seq_file_path),
|
|
32
|
+
"-out",
|
|
33
|
+
db_name,
|
|
34
|
+
"-dbtype",
|
|
35
|
+
"nucl",
|
|
36
|
+
"-hash_index" # Do I need this?
|
|
37
|
+
]
|
|
38
|
+
if self._parse_seqids:
|
|
39
|
+
command.append("-parse_seqids")
|
|
40
|
+
return command
|
|
41
|
+
|
|
42
|
+
def makedb(self, seq_file_path):
|
|
43
|
+
seq_file_path = Path(seq_file_path)
|
|
44
|
+
if seq_file_path in self._cache:
|
|
45
|
+
return
|
|
46
|
+
seq_name = seq_file_path.stem
|
|
47
|
+
tempdir = Path(
|
|
48
|
+
tempfile.mkdtemp(
|
|
49
|
+
prefix=seq_name,
|
|
50
|
+
dir=self.location
|
|
51
|
+
)
|
|
52
|
+
)
|
|
53
|
+
db_name = str(tempdir / "db")
|
|
54
|
+
proc = subprocess.Popen(
|
|
55
|
+
self._build_makeblastdb_command(seq_file_path, db_name),
|
|
56
|
+
stdout=subprocess.DEVNULL,
|
|
57
|
+
stderr=subprocess.DEVNULL
|
|
58
|
+
)
|
|
59
|
+
proc.communicate()
|
|
60
|
+
if proc.returncode:
|
|
61
|
+
raise subprocess.CalledProcessError(proc.returncode, proc.args)
|
|
62
|
+
self._cache[seq_file_path] = db_name
|
|
63
|
+
|
|
64
|
+
def __getitem__(self, k):
|
|
65
|
+
return self._cache[Path(k)]
|
|
66
|
+
|
|
67
|
+
def __delitem__(self, k):
|
|
68
|
+
del self._cache[Path(k)]
|
|
69
|
+
|
|
70
|
+
def __contains__(self, k):
|
|
71
|
+
return Path(k) in self._cache
|
|
72
|
+
|
|
73
|
+
|
|
@@ -0,0 +1,220 @@
|
|
|
1
|
+
import subprocess
|
|
2
|
+
import pandas as pd
|
|
3
|
+
from typing import List, Optional
|
|
4
|
+
|
|
5
|
+
from .blastdb_cache import BlastDBCache
|
|
6
|
+
|
|
7
|
+
default_out_columns = ['qseqid',
|
|
8
|
+
'sseqid',
|
|
9
|
+
'pident',
|
|
10
|
+
'length',
|
|
11
|
+
'mismatch',
|
|
12
|
+
'gapopen',
|
|
13
|
+
'qstart',
|
|
14
|
+
'qend',
|
|
15
|
+
'sstart',
|
|
16
|
+
'send',
|
|
17
|
+
'evalue',
|
|
18
|
+
'bitscore']
|
|
19
|
+
|
|
20
|
+
yes_no = ["no", "yes"]
|
|
21
|
+
|
|
22
|
+
class BlastnSearch:
|
|
23
|
+
"""A search (alignment) to be made with blastn.
|
|
24
|
+
|
|
25
|
+
This class provides a programmer-friendly way to define the parameters of a
|
|
26
|
+
simple blastn search, carry out the search, and parse the results.
|
|
27
|
+
|
|
28
|
+
The most useful property of a BlastnSearch instance is hits. hits runs the
|
|
29
|
+
defined blastn search (if it hasn't been run already), parses the results,
|
|
30
|
+
stores them in a pandas dataframe, and returns the result.
|
|
31
|
+
|
|
32
|
+
Values passed to the constructor may be retrieved through the class's
|
|
33
|
+
properties.
|
|
34
|
+
|
|
35
|
+
Attributes:
|
|
36
|
+
debug (bool): Whether to enable debug features for this instance.
|
|
37
|
+
"""
|
|
38
|
+
def __init__(
|
|
39
|
+
self,
|
|
40
|
+
seq1_path: str,
|
|
41
|
+
seq2_path: str,
|
|
42
|
+
evalue: float = 1e-20,
|
|
43
|
+
out_columns: List[str] = default_out_columns,
|
|
44
|
+
additional_columns: List[str] = [],
|
|
45
|
+
db_cache: Optional[BlastDBCache] = None,
|
|
46
|
+
threads: int = 1,
|
|
47
|
+
dust: bool = True,
|
|
48
|
+
task: Optional[str] = None,
|
|
49
|
+
max_targets: int = 500,
|
|
50
|
+
n_seqidlist: Optional[str] = None,
|
|
51
|
+
debug: bool = False
|
|
52
|
+
):
|
|
53
|
+
"""Construct a BlastnSearch with the specified settings.
|
|
54
|
+
|
|
55
|
+
This constructor requires paths to FASTA files containing the query and
|
|
56
|
+
subject sequences to use in the search.
|
|
57
|
+
|
|
58
|
+
Optionally, the caller may provide an expect value cutoff to use for the
|
|
59
|
+
search. If no value is provided, a default evalue of 1e-20 will be used.
|
|
60
|
+
|
|
61
|
+
The caller may specify what columns should be included in the output.
|
|
62
|
+
By default, the included columns are
|
|
63
|
+
|
|
64
|
+
sseqid
|
|
65
|
+
pident
|
|
66
|
+
length
|
|
67
|
+
mismatcch
|
|
68
|
+
gapopen
|
|
69
|
+
qstart
|
|
70
|
+
qend
|
|
71
|
+
sstart
|
|
72
|
+
send
|
|
73
|
+
evalue
|
|
74
|
+
bitscore
|
|
75
|
+
|
|
76
|
+
Explanations of these columns may be found at
|
|
77
|
+
https://www.metagenomics.wiki/tools/blast/blastn-output-format-6
|
|
78
|
+
|
|
79
|
+
If the caller desires to include additional columns, it may provide
|
|
80
|
+
them to the additional_columns parameter.
|
|
81
|
+
|
|
82
|
+
Parameters:
|
|
83
|
+
seq1_path (str): Path to query sequence FASTA file.
|
|
84
|
+
seq2_path (str): Path to subject sequence FASTA file.
|
|
85
|
+
evalue (float): Expect value cutoff to use in BLAST search.
|
|
86
|
+
out_columns: Output columns to include in results.
|
|
87
|
+
additional_columns: Additional output columns to include in results.
|
|
88
|
+
db_cache: BlastDBCache that tells where to find BLAST DBs.
|
|
89
|
+
threads (int): Number of threads to use for BLAST search.
|
|
90
|
+
dust (bool): Filter low-complexity regions from search.
|
|
91
|
+
task (str): Parameter preset to use.
|
|
92
|
+
max_targets (int): Maximum number of target seqs to include.
|
|
93
|
+
n_seqidlist (str): Specifies seqids to ignore.
|
|
94
|
+
debug (bool): Whether to enable debug features.
|
|
95
|
+
"""
|
|
96
|
+
self._seq1_path = seq1_path
|
|
97
|
+
self._seq2_path = seq2_path
|
|
98
|
+
self._evalue = evalue
|
|
99
|
+
self._hits = None
|
|
100
|
+
self._out_columns = list(out_columns + additional_columns)
|
|
101
|
+
self._db_cache = db_cache
|
|
102
|
+
self._threads = threads
|
|
103
|
+
self._dust = dust
|
|
104
|
+
self._task = task
|
|
105
|
+
self._max_targets = max_targets
|
|
106
|
+
self._negative_seqidlist = n_seqidlist
|
|
107
|
+
# If you really need to add extra arguments, you can do it by setting
|
|
108
|
+
# the _extra_args attribute.
|
|
109
|
+
self._extra_args = []
|
|
110
|
+
self.debug = debug
|
|
111
|
+
|
|
112
|
+
@property
|
|
113
|
+
def seq1_path(self) -> str:
|
|
114
|
+
"""Return the query sequence path."""
|
|
115
|
+
return self._seq1_path
|
|
116
|
+
|
|
117
|
+
@property
|
|
118
|
+
def seq2_path(self) -> str:
|
|
119
|
+
"""Return the subject sequence path."""
|
|
120
|
+
return self._seq2_path
|
|
121
|
+
|
|
122
|
+
@property
|
|
123
|
+
def evalue(self) -> float:
|
|
124
|
+
"""Return the expect value used as a cutoff in the blastn search."""
|
|
125
|
+
return self._evalue
|
|
126
|
+
|
|
127
|
+
@property
|
|
128
|
+
def db_cache(self) -> Optional[BlastDBCache]:
|
|
129
|
+
"""Return a cache of BLAST DBs to be used in the search."""
|
|
130
|
+
return self._db_cache
|
|
131
|
+
|
|
132
|
+
@property
|
|
133
|
+
def hits(self) -> pd.DataFrame:
|
|
134
|
+
"""Return a dataframe containing this search's BLAST results."""
|
|
135
|
+
if self._hits is None:
|
|
136
|
+
self._get_hits()
|
|
137
|
+
return self._hits
|
|
138
|
+
|
|
139
|
+
@property
|
|
140
|
+
def threads(self) -> int:
|
|
141
|
+
"""Return the number of threads to use for the search."""
|
|
142
|
+
return self._threads
|
|
143
|
+
|
|
144
|
+
@property
|
|
145
|
+
def dust(self) -> bool:
|
|
146
|
+
"""Return whether to filter low-complexity regions."""
|
|
147
|
+
return self._dust
|
|
148
|
+
|
|
149
|
+
@property
|
|
150
|
+
def task(self) -> Optional[str]:
|
|
151
|
+
"""Return the name of the parameter preset to use."""
|
|
152
|
+
return self._task
|
|
153
|
+
|
|
154
|
+
@property
|
|
155
|
+
def max_targets(self) -> int:
|
|
156
|
+
"""Return the maximum number of target sequences."""
|
|
157
|
+
return self._max_targets
|
|
158
|
+
|
|
159
|
+
@property
|
|
160
|
+
def negative_seqidlist(self) -> Optional[str]:
|
|
161
|
+
"""Return a path to a list of sequence IDs to ignore."""
|
|
162
|
+
return self._negative_seqidlist
|
|
163
|
+
|
|
164
|
+
|
|
165
|
+
# def __len__(self) -> int:
|
|
166
|
+
# return len(self.hits)
|
|
167
|
+
|
|
168
|
+
# def __iter__(self):
|
|
169
|
+
# yield from self.hits
|
|
170
|
+
|
|
171
|
+
def _build_blast_command(self):
|
|
172
|
+
command = ["blastn"]
|
|
173
|
+
if self._db_cache and self.seq1_path in self._db_cache:
|
|
174
|
+
command = command + ["-db", str(self._db_cache[self.seq1_path])]
|
|
175
|
+
else:
|
|
176
|
+
command = command + ["-subject", self.seq1_path]
|
|
177
|
+
if self._task is not None:
|
|
178
|
+
command = command + ["-task", self._task]
|
|
179
|
+
if self._negative_seqidlist is not None:
|
|
180
|
+
command = command + [
|
|
181
|
+
"-negative_seqidlist",
|
|
182
|
+
self._negative_seqidlist
|
|
183
|
+
]
|
|
184
|
+
command = command + [
|
|
185
|
+
"-query",
|
|
186
|
+
str(self.seq2_path),
|
|
187
|
+
"-evalue",
|
|
188
|
+
str(self.evalue),
|
|
189
|
+
"-outfmt",
|
|
190
|
+
" ".join(["6"] + self._out_columns),
|
|
191
|
+
"-num_threads",
|
|
192
|
+
str(self._threads),
|
|
193
|
+
"-dust",
|
|
194
|
+
yes_no[self._dust],
|
|
195
|
+
"-max_target_seqs",
|
|
196
|
+
str(self._max_targets)
|
|
197
|
+
] + self._extra_args
|
|
198
|
+
#print(" ".join(command), file=sys.stderr)
|
|
199
|
+
return command
|
|
200
|
+
|
|
201
|
+
|
|
202
|
+
def _get_hits(self):
|
|
203
|
+
proc = subprocess.Popen(
|
|
204
|
+
self._build_blast_command(),
|
|
205
|
+
stdout=subprocess.PIPE,
|
|
206
|
+
stderr=subprocess.PIPE
|
|
207
|
+
)
|
|
208
|
+
self._hits = pd.read_csv(
|
|
209
|
+
proc.stdout,
|
|
210
|
+
names=self._out_columns,
|
|
211
|
+
delim_whitespace=True
|
|
212
|
+
)
|
|
213
|
+
# from IPython import embed
|
|
214
|
+
# embed()
|
|
215
|
+
proc.communicate()
|
|
216
|
+
if proc.returncode:
|
|
217
|
+
if self.debug:
|
|
218
|
+
from IPython import embed
|
|
219
|
+
embed()
|
|
220
|
+
raise subprocess.CalledProcessError(proc.returncode, proc.args)
|
|
@@ -0,0 +1,88 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: simple_blast
|
|
3
|
+
Version: 0.0.5
|
|
4
|
+
Summary: A simple library for executing BLAST searches with ncbi-blast+
|
|
5
|
+
Author-email: Andrew Tapia <andrew.tapia@uky.edu>
|
|
6
|
+
Project-URL: Homepage, https://github.com/actapia/simple_blast
|
|
7
|
+
Project-URL: Bug Tracker, https://github.com/actapia/simple_blast
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Operating System :: OS Independent
|
|
11
|
+
Requires-Python: >=3.11
|
|
12
|
+
Description-Content-Type: text/markdown
|
|
13
|
+
License-File: LICENSE
|
|
14
|
+
Requires-Dist: pandas>=1.5.3
|
|
15
|
+
|
|
16
|
+
# simple_blast
|
|
17
|
+
|
|
18
|
+
This is a library that provides a very basic wrapper around ncbi-blast+.
|
|
19
|
+
Currently, the library supports searches with `blastn` only, but I may expand
|
|
20
|
+
the library to include wrappers for other BLAST executables if I need them.
|
|
21
|
+
|
|
22
|
+
## Requirements
|
|
23
|
+
|
|
24
|
+
This library depends on Pandas for parsing BLAST output. The library has been
|
|
25
|
+
tested with Pandas 1.5.3, but it likely works with other versions.
|
|
26
|
+
|
|
27
|
+
Of course, this library assumes that ncbi-blast+ is installed. The library has
|
|
28
|
+
been tested with ncbi-blast 2.12.0+, and it likely works with newer versions of
|
|
29
|
+
the software as well.
|
|
30
|
+
|
|
31
|
+
## Basic usage
|
|
32
|
+
|
|
33
|
+
You can define a `blastn` search to be carried out using the `BlastnSearch`
|
|
34
|
+
class. `BlastnSearch`objects are constructed with two required
|
|
35
|
+
arguments—the subject sequence and the query sequence files, in that
|
|
36
|
+
order. For example, to set up a `balstn` search for sequences in `seqs1.fasta`
|
|
37
|
+
against those in `seqs2.fasta`, you could construct a `BlastnSearch` object like
|
|
38
|
+
this:
|
|
39
|
+
|
|
40
|
+
```python
|
|
41
|
+
from simple_blast import BlastnSearch
|
|
42
|
+
|
|
43
|
+
search = BlastnSearch("seqs2.fasta", "seqs1.fasta")
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
The BLAST search is not carried out until you ask for the results by accessing
|
|
47
|
+
the `hits` property of the search. This property returns a Pandas dataframe
|
|
48
|
+
containing the HSPs identified in the BLAST search.
|
|
49
|
+
|
|
50
|
+
```python
|
|
51
|
+
results = search.hits
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
The columns in the output may be configured by passing either the `out_columns`
|
|
55
|
+
or `additional_columns` arguments when constructing the `BlastnSearch`. The
|
|
56
|
+
former argument overrides the set of output columns; the latter argument is
|
|
57
|
+
added to the list of default output columns.
|
|
58
|
+
|
|
59
|
+
You can also specify an e-value cutoff through the `evalue` argument.
|
|
60
|
+
|
|
61
|
+
## DB caches
|
|
62
|
+
|
|
63
|
+
When the same sequence file is used as a subject in multiple searches, it can be
|
|
64
|
+
efficient to build a BLAST database up front. The `BlastDBCache` class can be
|
|
65
|
+
used to handle this mostly automatically. To make a `BlastDBCache`, you need
|
|
66
|
+
to specify the location of the on the file system.
|
|
67
|
+
|
|
68
|
+
```python
|
|
69
|
+
from simple_blast import BlastDBCache
|
|
70
|
+
|
|
71
|
+
cache = BlastDBCache("cache_dir")
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
To add a file to the cache, use the `makedb` method.
|
|
75
|
+
|
|
76
|
+
```python
|
|
77
|
+
cache.makedb("seqs2.fasta")
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
When constructing a `BlastnSearch` object, give it the `BlastDBCache` as the
|
|
81
|
+
`db_cache` parameter to make the `BlastnSearch` object use the cache for
|
|
82
|
+
searches.
|
|
83
|
+
|
|
84
|
+
```python
|
|
85
|
+
search = BlastnSearch("seqs2.fasta", "seqs1.fasta", db_cache=cache)
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
Now `search` will use the database we created for `seqs2.fasta`.
|
|
@@ -0,0 +1,11 @@
|
|
|
1
|
+
LICENSE
|
|
2
|
+
README.md
|
|
3
|
+
pyproject.toml
|
|
4
|
+
src/simple_blast/__init__.py
|
|
5
|
+
src/simple_blast/blastdb_cache.py
|
|
6
|
+
src/simple_blast/blasting.py
|
|
7
|
+
src/simple_blast.egg-info/PKG-INFO
|
|
8
|
+
src/simple_blast.egg-info/SOURCES.txt
|
|
9
|
+
src/simple_blast.egg-info/dependency_links.txt
|
|
10
|
+
src/simple_blast.egg-info/requires.txt
|
|
11
|
+
src/simple_blast.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
pandas>=1.5.3
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
simple_blast
|