npsearch 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.coveralls.yml +1 -0
- data/.gitignore +18 -0
- data/.travis.yml +8 -0
- data/Gemfile +4 -0
- data/LICENSE.txt +661 -0
- data/README.md +68 -0
- data/Rakefile +14 -0
- data/bin/npsearch +165 -0
- data/lib/npsearch.rb +96 -0
- data/lib/npsearch/arg_validator.rb +264 -0
- data/lib/npsearch/logger.rb +12 -0
- data/lib/npsearch/pool.rb +66 -0
- data/lib/npsearch/sequence.rb +25 -0
- data/lib/npsearch/signalp.rb +44 -0
- data/lib/npsearch/version.rb +4 -0
- data/npsearch.gemspec +33 -0
- data/test/files/1_protein.fa +204 -0
- data/test/files/2_orf.fa +1330 -0
- data/test/files/3_signalp_out.txt +667 -0
- data/test/files/4_secretome.fa +6 -0
- data/test/files/5_output.fa +6 -0
- data/test/files/5_output.html +37 -0
- data/test/files/empty_file.fa +0 -0
- data/test/files/genetic.fa +465 -0
- data/test/files/not_fasta.fa +446 -0
- data/test/files/protein.fa +180 -0
- data/test/files/signalp/signalp +0 -0
- data/test/test_np_search.rb +122 -0
- metadata +162 -0
data/README.md
ADDED
@@ -0,0 +1,68 @@
|
|
1
|
+
# NeuroPeptideSearch (NpSearch)
|
2
|
+
[](http://badge.fury.io/rb/NpSearch)
|
3
|
+
[](https://travis-ci.org/IsmailM/NeuroPeptideSearch)
|
4
|
+
[](https://gemnasium.com/IsmailM/NeuroPeptideSearch)
|
5
|
+
[](http://inch-ci.org/github/IsmailM/NeuroPeptideSearch)
|
6
|
+
|
7
|
+
> A tool to identify noval Neuropeptides.
|
8
|
+
|
9
|
+
NpSearch (NeuroPeptideSearch) is a program that searches for potential neuropeptides precursors based on the motifs commonly found on a neuropeptide. Ideally, the input would be transcriptome or protein data since there are no introns to worry about and the signal peptide would be attached to the front of the precursor.
|
10
|
+
|
11
|
+
Currently, the program produces a long list of sequences that fulfil all the requirements to be a potential neuropeptide. This list needs to be further analysed to find potential neuropeptides. Future versions of the program will automatically analyse the output file and extract a list of highly likely neuropeptides.
|
12
|
+
|
13
|
+
NpSearch produces a number of files - the final output files is produced as a fasta file and as a colour coded html file that can be opened by any web browser or even in a word processor.
|
14
|
+
|
15
|
+
Note: For this program to work, you will need to obtain a copy of Signal P 4.1 from cbs at "http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?signalp" and link this to the program. Alternatively you will require an output text file from the Signal P which you can input into the program.
|
16
|
+
|
17
|
+
** Currently only supported on Mac OS & Linux
|
18
|
+
|
19
|
+
If you use this program, please cite us:
|
20
|
+
|
21
|
+
Moghul I, Rowe M, Priyam A, ELphick M & Wurm Y <em>(in prep)</em> NpSearch: A Tool to Identify Novel Neuropeptides
|
22
|
+
|
23
|
+
## Installation
|
24
|
+
|
25
|
+
1. Simply open the terminal and type this
|
26
|
+
```
|
27
|
+
$ gem install npsearch
|
28
|
+
```
|
29
|
+
## Usage
|
30
|
+
|
31
|
+
* Usage: npsearch [Options] -i [Input File] -o [Output Folder Name]
|
32
|
+
|
33
|
+
* Mandatory Options:
|
34
|
+
|
35
|
+
-i, --input [file] The input file (in fasta format). Can be a relative or a full
|
36
|
+
path.
|
37
|
+
-o, --output [folder name] The path to the output folder. This will be created if the
|
38
|
+
folder does not exist.
|
39
|
+
|
40
|
+
* Optional Options:
|
41
|
+
-m, --motif [Query Motif] By default NpSearch only searches for dibasic cleavage site
|
42
|
+
("KR", "RR" or "KK"). This option allows one to change the
|
43
|
+
set of cleavage sites to be searched.
|
44
|
+
The period "." can be used to denote any character. Multiple
|
45
|
+
motifs query can be used by using a pipeline character ("|")
|
46
|
+
between each query and putting the motif query in speech marks
|
47
|
+
e.g. "KR|RR|R..R"
|
48
|
+
Advanced Users: Regular expressions are supported.
|
49
|
+
-c, --cut_off N Changes the minimum Open Reading
|
50
|
+
Frame from the default 10 amino acid residues to N amino acid
|
51
|
+
residues.
|
52
|
+
-s, --signalp_file [file] Is used to supply the signal peptide results to the program.
|
53
|
+
These signal peptide results must be created using the SignalP
|
54
|
+
program (Version 4.x), downloadable from CBS. If this argument
|
55
|
+
isn't suplied, then NpSearch will try to run a local version
|
56
|
+
of the Signal P script.
|
57
|
+
-e, --extract_orf Only extracts the Open Reading Frames.
|
58
|
+
-v, --verbose Provides more information on each step taken in this program.
|
59
|
+
-h, --help Display this screen
|
60
|
+
--version Shows version
|
61
|
+
|
62
|
+
## Contributing
|
63
|
+
|
64
|
+
1. Fork it
|
65
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
66
|
+
3. Commit your changes (`git commit -am 'Add some feature'`)
|
67
|
+
4. Push to the branch (`git push origin my-new-feature`)
|
68
|
+
5. Create new Pull Request
|
data/Rakefile
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
require 'bundler/gem_tasks'
|
2
|
+
require 'rake/testtask'
|
3
|
+
|
4
|
+
task default: [:build]
|
5
|
+
desc 'Installs the ruby gem'
|
6
|
+
task :build do
|
7
|
+
exec("gem build np_search.gemspec && gem install ./NpSearch-#{NpSearch::VERSION}.gem")
|
8
|
+
end
|
9
|
+
|
10
|
+
task :test do
|
11
|
+
Rake::TestTask.new do |t|
|
12
|
+
t.pattern = 'test/test_np_search.rb'
|
13
|
+
end
|
14
|
+
end
|
data/bin/npsearch
ADDED
@@ -0,0 +1,165 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
require 'optparse'
|
3
|
+
|
4
|
+
require 'npsearch'
|
5
|
+
require 'npsearch/arg_validator'
|
6
|
+
require 'npsearch/version'
|
7
|
+
|
8
|
+
opt = {}
|
9
|
+
optparse = OptionParser.new do |opts|
|
10
|
+
opts.banner = <<Banner
|
11
|
+
|
12
|
+
* Usage: npsearch [Options] -i [Input File] -o [Output Folder Name]
|
13
|
+
|
14
|
+
* Mandatory Options:
|
15
|
+
|
16
|
+
Banner
|
17
|
+
|
18
|
+
opt[:input_file] = nil
|
19
|
+
opts.on('-i', '--input [file]', 'Path to the input fasta file') do |f|
|
20
|
+
opt[:input_file] = f
|
21
|
+
end
|
22
|
+
|
23
|
+
opts.separator ''
|
24
|
+
opts.separator '* Optional Options:'
|
25
|
+
|
26
|
+
opt[:motif] = 'KR|RR|KK'
|
27
|
+
opts.on('-m', '--motif [Query Motif]', 'By default NpSearch only searches',
|
28
|
+
' for dibasic cleavage site ("KR", "RR" or "KK"). This option allows',
|
29
|
+
' one to change the set of cleavage sites to be searched.',
|
30
|
+
' The period "." can be used to denote any character. Multiple',
|
31
|
+
' motifs query can be used by using a pipeline character ("|")',
|
32
|
+
' between each query and putting the motif query in speech marks',
|
33
|
+
' e.g. "KR|RR|R..R"',
|
34
|
+
' Advanced Users: Regular expressions are supported.') do |motif|
|
35
|
+
opt[:motif] = motif
|
36
|
+
end
|
37
|
+
|
38
|
+
opt[:cut_off] = 10
|
39
|
+
opts.on('-c', '--cut_off N', Integer, 'Changes the minimum Open Reading',
|
40
|
+
' Frame from the default 10 amino acid residues to N amino acid',
|
41
|
+
' residues.') do |n|
|
42
|
+
opt[:cut_off] = n
|
43
|
+
end
|
44
|
+
|
45
|
+
opt[:signalp_file] = nil
|
46
|
+
opts.on('-s', '--signalp_file [file]',
|
47
|
+
'Is used to supply the signal peptide results to the program. These',
|
48
|
+
' signal peptide results must be created using the SignalP program',
|
49
|
+
" (Version 4.x), downloadable from CBS. If this argument isn't ",
|
50
|
+
' suplied, then NpSearch will try to run a local version of the',
|
51
|
+
' Signal P script.') do |signalp_file|
|
52
|
+
opt[:signalp_file] = signalp_file
|
53
|
+
end
|
54
|
+
|
55
|
+
opt[:extract_orf] = false
|
56
|
+
opts.on('-e', '--extract_orf', 'Only extracts the Open Reading Frames.') do
|
57
|
+
opt[:extract_orf] = true
|
58
|
+
end
|
59
|
+
|
60
|
+
opt[:verbose] = false
|
61
|
+
opts.on('-v', '--verbose', 'Provides more information on each step taken',
|
62
|
+
' in this program.') do
|
63
|
+
opt[:verbose] = true
|
64
|
+
end
|
65
|
+
|
66
|
+
opts.on('-h', '--help', 'Display this screen') do
|
67
|
+
puts opts
|
68
|
+
exit
|
69
|
+
end
|
70
|
+
|
71
|
+
opts.on('--version', 'Shows version') do
|
72
|
+
puts NpSearch::VERSION
|
73
|
+
exit
|
74
|
+
end
|
75
|
+
end
|
76
|
+
optparse.parse!
|
77
|
+
|
78
|
+
NpSearch.init(opt)
|
79
|
+
NpSearch.run
|
80
|
+
|
81
|
+
|
82
|
+
|
83
|
+
|
84
|
+
# ############# Argument Validation...##############
|
85
|
+
# arg_vldr = NpSearch::ArgValidators.new(opt[:verbose])
|
86
|
+
# input_type = arg_vldr.arg(opt[:motif], opt[:input], opt[:output_dir],
|
87
|
+
# opt[:cut_off], opt[:extract_orf], opt[:signalp_file],
|
88
|
+
# optparse.help)
|
89
|
+
|
90
|
+
# ############# General Validation...##############
|
91
|
+
# vldr = NpSearch::Validators.new
|
92
|
+
# vldr.output_dir(opt[:output_dir])
|
93
|
+
# if opt[:signalp_file].nil? && opt[:extract_orf] == false
|
94
|
+
# sp_dir = vldr.signalp_dir
|
95
|
+
# end
|
96
|
+
|
97
|
+
# ############# Converting input file to Bio::FastaFormat. #############
|
98
|
+
# input_read = NpSearch::Input.read(opt[:input], input_type)
|
99
|
+
|
100
|
+
# ############# Extract_ORF #############
|
101
|
+
# if input_type == 'genetic'
|
102
|
+
# # Translate Sequences in all 6 frames
|
103
|
+
# translated = NpSearch::Translation.translate(input_read)
|
104
|
+
# translated.to_fasta('translated seq.', "#{opt[:output_dir]}/1_protein.fa")
|
105
|
+
# # Extract all possible ORF that are longer than the ORF_min_length
|
106
|
+
# orf = NpSearch::Translation.extract_orf(translated, opt[:cut_off])
|
107
|
+
# orf.to_fasta('Open Reading Frames', "#{opt[:output_dir]}/2_orf.fa")
|
108
|
+
|
109
|
+
# if opt[:extract_orf]
|
110
|
+
# puts "\nSuccess: All output files created in the directory:" \
|
111
|
+
# "#{opt[:output_dir]}'.\n "
|
112
|
+
# exit
|
113
|
+
# end
|
114
|
+
# end
|
115
|
+
|
116
|
+
# ############# Setting up more variables...##############
|
117
|
+
# if opt[:motif] == 'neuro_clv'
|
118
|
+
# motif = 'KK|KR|RR|' \
|
119
|
+
# 'R..R|R....R|R......R|H..R|H....R|H......R|K..R|K....R|K......R'
|
120
|
+
# else
|
121
|
+
# motif = opt[:motif]
|
122
|
+
# end
|
123
|
+
# vldr.motif_type(motif)
|
124
|
+
|
125
|
+
# if input_type == 'genetic'
|
126
|
+
# sp_input_file = "#{opt[:output_dir]}/2_orf.fa"
|
127
|
+
# sp_hash = orf
|
128
|
+
# file_number = 3
|
129
|
+
# else # i.e. if the input is protein
|
130
|
+
# sp_input_file = opt[:input]
|
131
|
+
# sp_hash = input_read
|
132
|
+
# file_number = 1
|
133
|
+
# end
|
134
|
+
|
135
|
+
# if opt[:signalp_file].nil?
|
136
|
+
# sp_out_file = "#{opt[:output_dir]}/#{file_number}_signalp_out.txt"
|
137
|
+
# file_number += 1
|
138
|
+
# NpSearch::Signalp.signalp(sp_dir, sp_input_file, sp_out_file)
|
139
|
+
# else
|
140
|
+
# sp_out_file = opt[:signalp_file]
|
141
|
+
# file_number = 1
|
142
|
+
# end
|
143
|
+
|
144
|
+
# ############# Signal P Results file Validation #############
|
145
|
+
# vldr.sp_results(sp_out_file)
|
146
|
+
|
147
|
+
# ############# Extract sequences with a signal peptide #############
|
148
|
+
# secretome = NpSearch::Analysis.parse(sp_out_file, sp_hash, motif)
|
149
|
+
# secretome.to_fasta('secretome file',
|
150
|
+
# "#{opt[:output_dir]}/#{file_number}_secretome.fa")
|
151
|
+
# file_number += 1
|
152
|
+
|
153
|
+
# ############# Remove any duplicate data #############
|
154
|
+
# flattened_seq = NpSearch::Analysis.flattener(secretome)
|
155
|
+
|
156
|
+
# ############# Creating Output Files #############
|
157
|
+
# flattened_seq.to_fasta('fasta output file',
|
158
|
+
# "#{opt[:output_dir]}/#{file_number}_output.fa")
|
159
|
+
# flattened_seq.to_html(motif,
|
160
|
+
# "#{opt[:output_dir]}/#{file_number}_output.html")
|
161
|
+
|
162
|
+
# ############# Success #############
|
163
|
+
# puts # a blank line.
|
164
|
+
# puts "Success: All output files created in the directory:'#{opt[:output_dir]}'."
|
165
|
+
# puts # a blank line
|
data/lib/npsearch.rb
ADDED
@@ -0,0 +1,96 @@
|
|
1
|
+
require 'bio'
|
2
|
+
require 'fileutils'
|
3
|
+
|
4
|
+
# require 'npsearch/arg_validator'
|
5
|
+
require 'npsearch/logger'
|
6
|
+
require 'npsearch/sequence'
|
7
|
+
require 'npsearch/signalp'
|
8
|
+
require 'npsearch/pool'
|
9
|
+
|
10
|
+
# Top level module / namespace.
|
11
|
+
module NpSearch
|
12
|
+
class <<self
|
13
|
+
MIN_ORF_SIZE = 40 # amino acids (including potential signal peptide)
|
14
|
+
|
15
|
+
attr_accessor :opt
|
16
|
+
attr_accessor :sequences
|
17
|
+
|
18
|
+
def logger
|
19
|
+
@logger ||= Logger.new(STDERR, @opt[:verbose])
|
20
|
+
end
|
21
|
+
|
22
|
+
def init(opt)
|
23
|
+
# @opt = args_validation(opt)
|
24
|
+
@opt = opt
|
25
|
+
@sequences = []
|
26
|
+
@opt[:num_threads] = 8
|
27
|
+
@opt[:type] = guess_sequence_type
|
28
|
+
@opt[:signalp_path] = '/Volumes/Data/programs/signalp-4.1/signalp'
|
29
|
+
@pool = Pool.new(@opt[:num_threads]) if @opt[:num_threads] > 1
|
30
|
+
end
|
31
|
+
|
32
|
+
def run
|
33
|
+
iterate_input_file
|
34
|
+
# score_sequence
|
35
|
+
# scan(?<=(KR|RR|KK))(\w+?)(?=(KR|RR|KK|$))
|
36
|
+
@sequences.each { |s| puts ">#{s.id}\n#{s.seq}" }
|
37
|
+
end
|
38
|
+
|
39
|
+
private
|
40
|
+
|
41
|
+
def iterate_input_file
|
42
|
+
biofastafile = Bio::FlatFile.open(Bio::FastaFormat, @opt[:input_file])
|
43
|
+
biofastafile.each_entry do |entry|
|
44
|
+
if @opt[:num_threads] > 1
|
45
|
+
@pool.schedule(entry) { |e| initialise_seqs(e) }
|
46
|
+
else
|
47
|
+
initialise_seqs(entry)
|
48
|
+
end
|
49
|
+
end
|
50
|
+
@pool.shutdown if @opt[:num_threads] > 1
|
51
|
+
end
|
52
|
+
|
53
|
+
def initialise_seqs(entry)
|
54
|
+
if @opt[:type] == :protein
|
55
|
+
initialise_protein_seq(entry.entry_id, entry.aaseq)
|
56
|
+
else
|
57
|
+
initialise_transcriptomic_seq(entry.entry_id, entry.naseq)
|
58
|
+
end
|
59
|
+
end
|
60
|
+
|
61
|
+
def initialise_protein_seq(id, seq)
|
62
|
+
sp = Signalp.analyse_sequence(seq)
|
63
|
+
@sequences << Sequence.new(id, seq, sp) if sp[:sp] == 'Y'
|
64
|
+
end
|
65
|
+
|
66
|
+
def initialise_transcriptomic_seq(id, naseq)
|
67
|
+
(1..6).each do |f|
|
68
|
+
translated_seq = naseq.translate(f)
|
69
|
+
orfs = translated_seq.to_s.scan(/(?=(M\w{#{MIN_ORF_SIZE},}))./).flatten
|
70
|
+
initialise_orfs(id, orfs, f)
|
71
|
+
end
|
72
|
+
end
|
73
|
+
|
74
|
+
def initialise_orfs(id, orfs, frame)
|
75
|
+
idx = 0
|
76
|
+
orfs.each do |orf|
|
77
|
+
sp = Signalp.analyse_sequence(orf)
|
78
|
+
next if sp[:sp] == 'N'
|
79
|
+
seq = Sequence.new(id, orf, sp)
|
80
|
+
seq.translated_frame = frame
|
81
|
+
seq.orf_index = idx
|
82
|
+
@sequences << seq
|
83
|
+
idx += 1
|
84
|
+
end
|
85
|
+
end
|
86
|
+
|
87
|
+
def guess_sequence_type
|
88
|
+
fasta_content = IO.binread(@opt[:input_file])
|
89
|
+
# removing non-letter and ambiguous characters
|
90
|
+
cleaned_sequence = fasta_content.gsub(/[^A-Z]|[NX]/i, '')
|
91
|
+
return nil if cleaned_sequence.length < 10 # conservative
|
92
|
+
type = Bio::Sequence.new(cleaned_sequence).guess(0.9)
|
93
|
+
(type == Bio::Sequence::NA) ? :nucleotide : :protein
|
94
|
+
end
|
95
|
+
end
|
96
|
+
end
|
@@ -0,0 +1,264 @@
|
|
1
|
+
module NpSearch
|
2
|
+
class ArgValidators
|
3
|
+
|
4
|
+
|
5
|
+
# Changes the logger level to output extra info when the verbose option is
|
6
|
+
# true.
|
7
|
+
def initialize(verbose_opt)
|
8
|
+
LOG.level = Logger::INFO if verbose_opt == true
|
9
|
+
end
|
10
|
+
|
11
|
+
# Runs all the arguments method...
|
12
|
+
def arg(motif, input, output_dir, orf_min_length, extract_orf,
|
13
|
+
signalp_file, help_banner)
|
14
|
+
comp_arg(input, motif, output_dir, extract_orf, help_banner)
|
15
|
+
input_type = guess_input_type(input)
|
16
|
+
extract_orf_conflict(input_type, extract_orf)
|
17
|
+
input_sp_file_conflict(input_type, signalp_file)
|
18
|
+
orf_min_length(orf_min_length)
|
19
|
+
input_type
|
20
|
+
end
|
21
|
+
|
22
|
+
# Ensures that the compulsory input arguments are supplied...
|
23
|
+
def comp_arg(input, motif, output_dir, extract_orf, help_banner)
|
24
|
+
comp_arg_error(motif, 'Query Motif ("-m" option)') if extract_orf == false
|
25
|
+
comp_arg_error(input, 'Input file ("-i option")')
|
26
|
+
comp_arg_error(output_dir, 'Output Folder ("-o" option)')
|
27
|
+
return unless input.nil? || (motif.nil? && extract_orf == false)
|
28
|
+
puts help_banner
|
29
|
+
exit
|
30
|
+
end
|
31
|
+
|
32
|
+
# Ensures that a message is provided for all missing compulsory args.
|
33
|
+
# Run from comp_arg method
|
34
|
+
def comp_arg_error(arg, message)
|
35
|
+
puts 'Usage Error: No ' + message + ' is supplied' if arg.nil?
|
36
|
+
end
|
37
|
+
|
38
|
+
# Guesses the type of data within the input file on the first 100 lines of
|
39
|
+
# the file (ignores all identifiers (lines that start with a '>').
|
40
|
+
# It has a 80% threshold.
|
41
|
+
def guess_input_type(input_file)
|
42
|
+
input_file_format(input_file)
|
43
|
+
sequences = []
|
44
|
+
File.open(input_file, 'r') do |file_stream|
|
45
|
+
file_stream.readlines[0..100].each do |line|
|
46
|
+
sequences << line.to_s unless line.match(/^>/)
|
47
|
+
end
|
48
|
+
end
|
49
|
+
type = Bio::Sequence.new(sequences).guess(0.8)
|
50
|
+
if type == Bio::Sequence::NA
|
51
|
+
input_type = 'genetic'
|
52
|
+
elsif type == Bio::Sequence::AA
|
53
|
+
input_type = 'protein'
|
54
|
+
end
|
55
|
+
input_type
|
56
|
+
end
|
57
|
+
|
58
|
+
# Ensures that the input file a) exists b) is not empty and c) is a fasta
|
59
|
+
# file. Run from the guess_input_type method.
|
60
|
+
def input_file_format(input_file)
|
61
|
+
unless File.exist?(input_file)
|
62
|
+
fail ArgumentError("Critical Error: The input file '#{input_file}'" \
|
63
|
+
' does not exist.')
|
64
|
+
end
|
65
|
+
if File.zero?(input_file)
|
66
|
+
fail ArgumentError("Critical Error: The input file '#{input_file}'" \
|
67
|
+
' is empty.')
|
68
|
+
end
|
69
|
+
unless File.probably_fasta?(input_file)
|
70
|
+
fail ArgumentError("Critical Error: The input file '#{input_file}'" \
|
71
|
+
' does not seem to be in fasta format. Only' \
|
72
|
+
' input files in fasta format are supported.')
|
73
|
+
end
|
74
|
+
end
|
75
|
+
|
76
|
+
# Ensures that the extract_orf option is only used with genetic data.
|
77
|
+
def extract_orf_conflict(input_type, extract_orf)
|
78
|
+
return unless input_type == 'protein' && extract_orf == true
|
79
|
+
fail ArgumentError('Usage Error: Conflicting arguments detected:' \
|
80
|
+
' Protein data detected within the input file,' \
|
81
|
+
' when using the Extract_ORF option (option' \
|
82
|
+
' "-e"). This option is only available when' \
|
83
|
+
' input file contains genetic data.')
|
84
|
+
end
|
85
|
+
|
86
|
+
# Ensures that the protein data (or open reading frames) are supplied as
|
87
|
+
# the input file when the signal p output file is passed.
|
88
|
+
def input_sp_file_conflict(input_type, signalp_file)
|
89
|
+
return unless input_type == 'genetic' && !signalp_file.nil?
|
90
|
+
fail ArgumentError('Usage Error: Conflicting arguments detected' \
|
91
|
+
': Genetic data detected within the input file' \
|
92
|
+
' when using the Signal P Input Option (Option' \
|
93
|
+
' "-s"). The Signal P input Option requires the' \
|
94
|
+
' input of two files: the Signal P Script Result' \
|
95
|
+
' files (at the "-s" option) and the protein' \
|
96
|
+
' data file used to run the Signal P Script.')
|
97
|
+
end
|
98
|
+
|
99
|
+
# Ensures that the ORF minimum length is a number. Any digits after the
|
100
|
+
# decimal place are ignored.
|
101
|
+
def orf_min_length(orf_min_length)
|
102
|
+
return unless orf_min_length.to_i < 1
|
103
|
+
fail ArgumentError('Usage Error: The Open Reading Frames minimum' \
|
104
|
+
' length can only be a full integer.')
|
105
|
+
end
|
106
|
+
end
|
107
|
+
|
108
|
+
class Validators
|
109
|
+
# Checks for the presence of the output directory; if not found, it asks
|
110
|
+
# the user whether they want to create the output directory.
|
111
|
+
def output_dir(output_dir)
|
112
|
+
unless File.directory? output_dir # If output_dir doesn't exist
|
113
|
+
fail IOError, "\n\nThe output directory deoes not exist\n\n"
|
114
|
+
end
|
115
|
+
rescue IOError
|
116
|
+
puts # a blank line
|
117
|
+
puts 'The output directory does not exist.'
|
118
|
+
puts # a blank line
|
119
|
+
puts "The directory '#{output_dir}' will be created in this location."
|
120
|
+
puts 'Do you to continue? [y/n]'
|
121
|
+
print '> '
|
122
|
+
inp = $stdin.gets.chomp
|
123
|
+
until inp.downcase == 'n' || inp.downcase == 'y' || inp == ''
|
124
|
+
puts # a blank line
|
125
|
+
puts "The input: '#{inp}' is not recognised - 'y' or 'n' are the" \
|
126
|
+
' only recognisable inputs.'
|
127
|
+
puts 'Please try again.'
|
128
|
+
puts "The directory '#{output_dir}' will be created in this" \
|
129
|
+
' location.'
|
130
|
+
puts 'Do you to continue? [y/n]'
|
131
|
+
print '> '
|
132
|
+
inp = $stdin.gets.chomp
|
133
|
+
end
|
134
|
+
if inp.downcase == 'y' || inp == ''
|
135
|
+
FileUtils.mkdir_p "#{output_dir}"
|
136
|
+
puts 'Created output directory...'
|
137
|
+
elsif inp.downcase == 'n'
|
138
|
+
raise ArgumentError('Critical Error: An output directory is' \
|
139
|
+
' required; please create an output directory' \
|
140
|
+
' and then try again.')
|
141
|
+
end
|
142
|
+
end
|
143
|
+
|
144
|
+
# Ensures that the Signal P Script is present. If not found in the home
|
145
|
+
# directory, it asks the user for its location.
|
146
|
+
def signalp_dir
|
147
|
+
signalp_dir = "#{Dir.home}/SignalPeptide"
|
148
|
+
if File.exist? "#{signalp_dir}/signalp"
|
149
|
+
signalp_directory = signalp_dir
|
150
|
+
else
|
151
|
+
begin
|
152
|
+
fail IOError('The Signal P Script directory cannot be found at' \
|
153
|
+
" the following location: '#{signalp_dir}/'.")
|
154
|
+
rescue IOError
|
155
|
+
puts # a blank line
|
156
|
+
puts 'Error: The Signal P Script directory cannot be found at the' \
|
157
|
+
" following location: '#{signalp_dir}/'."
|
158
|
+
puts # a blank line
|
159
|
+
puts 'Please enter the full path or a relative path to the Signal' \
|
160
|
+
' P Script directory (i.e. to the folder containing the' \
|
161
|
+
' Signal P script). Refer to the online tutorial for more help'
|
162
|
+
print '> '
|
163
|
+
inp = $stdin.gets.chomp
|
164
|
+
until (File.exist? "#{signalp_dir}/signalp") ||
|
165
|
+
(File.exist? "#{inp}/signalp")
|
166
|
+
puts # a blank line
|
167
|
+
puts 'The Signal P directory cannot be found at the following' \
|
168
|
+
" location: '#{inp}'"
|
169
|
+
puts 'Please enter the full path or a relative path to the Signal' \
|
170
|
+
' Peptide directory again.'
|
171
|
+
print '> '
|
172
|
+
inp = $stdin.gets.chomp
|
173
|
+
end
|
174
|
+
signalp_directory = inp
|
175
|
+
puts # a blank line
|
176
|
+
puts "The Signal P directory has been found at '#{signalp_directory}'"
|
177
|
+
FileUtils.ln_s "#{signalp_directory}", "#{Dir.home}/SignalPeptide",
|
178
|
+
force: true
|
179
|
+
puts # a blank line
|
180
|
+
end
|
181
|
+
end
|
182
|
+
signalp_directory
|
183
|
+
end
|
184
|
+
|
185
|
+
# Ensures that the supported version of the Signal P Script has been linked
|
186
|
+
# to NpSearch. Run from the 'sp_results' method.
|
187
|
+
def sp_version(input_file)
|
188
|
+
File.open(input_file, 'r') do |file_stream|
|
189
|
+
first_line = file_stream.readline
|
190
|
+
if first_line.match(/# SignalP-4.1/)
|
191
|
+
return true
|
192
|
+
else
|
193
|
+
return false
|
194
|
+
end
|
195
|
+
end
|
196
|
+
end
|
197
|
+
|
198
|
+
# Ensures that the critical columns in the tabular results produced by the
|
199
|
+
# Signal P script are conserved. Run from the 'sp_results' method.
|
200
|
+
def sp_column(_input_file)
|
201
|
+
File.open('signalp_out.txt', 'r') do |file_stream|
|
202
|
+
secondline = file_stream.readlines[1]
|
203
|
+
row = secondline.gsub(/\s+/m, ' ').chomp.split(' ')
|
204
|
+
if row[1] != 'name' && row[4] != 'Ymax' && row[5] != 'pos' &&
|
205
|
+
row[9] != 'D'
|
206
|
+
return true
|
207
|
+
else
|
208
|
+
return false
|
209
|
+
end
|
210
|
+
end
|
211
|
+
end
|
212
|
+
|
213
|
+
# Ensure that the right version of the Signal P script is used (via
|
214
|
+
# 'sp_version' Method). If the wrong signal p script has been linked to
|
215
|
+
# NpSearch, check whether the critical columns in the tabular results
|
216
|
+
# produced by the Signal P Script are conserved (via 'sp_column'
|
217
|
+
# Method).
|
218
|
+
def sp_results(signalp_output_file)
|
219
|
+
return if sp_version(signalp_output_file)
|
220
|
+
# i.e. if Signal P is the wrong version
|
221
|
+
if sp_column(signalp_output_file) # If wrong version but correct columns
|
222
|
+
puts # a blank line
|
223
|
+
puts 'Warning: The wrong version of signalp has been linked.' \
|
224
|
+
' However, the signal peptide output file still seems to' \
|
225
|
+
' be in the right format.'
|
226
|
+
else
|
227
|
+
puts # a blank line
|
228
|
+
puts 'Warning: The wrong version of the signal p has been linked' \
|
229
|
+
' and the signal peptide output is in an unrecognised format.'
|
230
|
+
puts 'Continuing may give you meaningless results.'
|
231
|
+
end
|
232
|
+
puts # a blank line
|
233
|
+
puts 'Do you still want to continue? [y/n]'
|
234
|
+
print '> '
|
235
|
+
inp = $stdin.gets.chomp
|
236
|
+
until inp.downcase == 'n' || inp.downcase == 'y'
|
237
|
+
puts # a blank line
|
238
|
+
puts "The input: '#{inp}' is not recognised - 'y' or 'n' are the" \
|
239
|
+
' only recognisable inputs.'
|
240
|
+
puts 'Please try again.'
|
241
|
+
end
|
242
|
+
if inp.downcase == 'y'
|
243
|
+
puts 'Continuing.'
|
244
|
+
elsif inp.downcase == 'n'
|
245
|
+
fail IOError('Critical Error: NpSearch only supports SignalP 4.1' \
|
246
|
+
' (downloadable form CBS) Please ensure the version' \
|
247
|
+
' of the signal p script is downloaded.')
|
248
|
+
end
|
249
|
+
end
|
250
|
+
|
251
|
+
# Guesses the type of the data in the supplied motif. It ignores all
|
252
|
+
# non-word characters (e.g. '|' that is used for regex). It has a 90%
|
253
|
+
# threshold.
|
254
|
+
def motif_type(motif)
|
255
|
+
motif_seq = Bio::Sequence.new(motif.gsub(/\W/, ''))
|
256
|
+
type = motif_seq.guess(0.9)
|
257
|
+
return unless type.to_s != 'Bio::Sequence::AA'
|
258
|
+
fail IOError('Critical Error: There seems to be an error in' \
|
259
|
+
' processing the motif. Please ensure that the motif' \
|
260
|
+
' contains amino acid residues that you wish to search' \
|
261
|
+
' for.')
|
262
|
+
end
|
263
|
+
end
|
264
|
+
end
|