npsearch 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.coveralls.yml +1 -0
- data/.gitignore +18 -0
- data/.travis.yml +8 -0
- data/Gemfile +4 -0
- data/LICENSE.txt +661 -0
- data/README.md +68 -0
- data/Rakefile +14 -0
- data/bin/npsearch +165 -0
- data/lib/npsearch.rb +96 -0
- data/lib/npsearch/arg_validator.rb +264 -0
- data/lib/npsearch/logger.rb +12 -0
- data/lib/npsearch/pool.rb +66 -0
- data/lib/npsearch/sequence.rb +25 -0
- data/lib/npsearch/signalp.rb +44 -0
- data/lib/npsearch/version.rb +4 -0
- data/npsearch.gemspec +33 -0
- data/test/files/1_protein.fa +204 -0
- data/test/files/2_orf.fa +1330 -0
- data/test/files/3_signalp_out.txt +667 -0
- data/test/files/4_secretome.fa +6 -0
- data/test/files/5_output.fa +6 -0
- data/test/files/5_output.html +37 -0
- data/test/files/empty_file.fa +0 -0
- data/test/files/genetic.fa +465 -0
- data/test/files/not_fasta.fa +446 -0
- data/test/files/protein.fa +180 -0
- data/test/files/signalp/signalp +0 -0
- data/test/test_np_search.rb +122 -0
- metadata +162 -0
data/README.md
ADDED
@@ -0,0 +1,68 @@
|
|
1
|
+
# NeuroPeptideSearch (NpSearch)
|
2
|
+
[![Gem Version](https://badge.fury.io/rb/NpSearch.svg)](http://badge.fury.io/rb/NpSearch)
|
3
|
+
[![Build Status](https://travis-ci.org/IsmailM/NeuroPeptideSearch.svg?branch=master)](https://travis-ci.org/IsmailM/NeuroPeptideSearch)
|
4
|
+
[![Dependency Status](https://gemnasium.com/IsmailM/NeuroPeptideSearch.svg)](https://gemnasium.com/IsmailM/NeuroPeptideSearch)
|
5
|
+
[![Inline docs](http://inch-ci.org/github/IsmailM/NeuroPeptideSearch.png?branch=master)](http://inch-ci.org/github/IsmailM/NeuroPeptideSearch)
|
6
|
+
|
7
|
+
> A tool to identify noval Neuropeptides.
|
8
|
+
|
9
|
+
NpSearch (NeuroPeptideSearch) is a program that searches for potential neuropeptides precursors based on the motifs commonly found on a neuropeptide. Ideally, the input would be transcriptome or protein data since there are no introns to worry about and the signal peptide would be attached to the front of the precursor.
|
10
|
+
|
11
|
+
Currently, the program produces a long list of sequences that fulfil all the requirements to be a potential neuropeptide. This list needs to be further analysed to find potential neuropeptides. Future versions of the program will automatically analyse the output file and extract a list of highly likely neuropeptides.
|
12
|
+
|
13
|
+
NpSearch produces a number of files - the final output files is produced as a fasta file and as a colour coded html file that can be opened by any web browser or even in a word processor.
|
14
|
+
|
15
|
+
Note: For this program to work, you will need to obtain a copy of Signal P 4.1 from cbs at "http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?signalp" and link this to the program. Alternatively you will require an output text file from the Signal P which you can input into the program.
|
16
|
+
|
17
|
+
** Currently only supported on Mac OS & Linux
|
18
|
+
|
19
|
+
If you use this program, please cite us:
|
20
|
+
|
21
|
+
Moghul I, Rowe M, Priyam A, ELphick M & Wurm Y <em>(in prep)</em> NpSearch: A Tool to Identify Novel Neuropeptides
|
22
|
+
|
23
|
+
## Installation
|
24
|
+
|
25
|
+
1. Simply open the terminal and type this
|
26
|
+
```
|
27
|
+
$ gem install npsearch
|
28
|
+
```
|
29
|
+
## Usage
|
30
|
+
|
31
|
+
* Usage: npsearch [Options] -i [Input File] -o [Output Folder Name]
|
32
|
+
|
33
|
+
* Mandatory Options:
|
34
|
+
|
35
|
+
-i, --input [file] The input file (in fasta format). Can be a relative or a full
|
36
|
+
path.
|
37
|
+
-o, --output [folder name] The path to the output folder. This will be created if the
|
38
|
+
folder does not exist.
|
39
|
+
|
40
|
+
* Optional Options:
|
41
|
+
-m, --motif [Query Motif] By default NpSearch only searches for dibasic cleavage site
|
42
|
+
("KR", "RR" or "KK"). This option allows one to change the
|
43
|
+
set of cleavage sites to be searched.
|
44
|
+
The period "." can be used to denote any character. Multiple
|
45
|
+
motifs query can be used by using a pipeline character ("|")
|
46
|
+
between each query and putting the motif query in speech marks
|
47
|
+
e.g. "KR|RR|R..R"
|
48
|
+
Advanced Users: Regular expressions are supported.
|
49
|
+
-c, --cut_off N Changes the minimum Open Reading
|
50
|
+
Frame from the default 10 amino acid residues to N amino acid
|
51
|
+
residues.
|
52
|
+
-s, --signalp_file [file] Is used to supply the signal peptide results to the program.
|
53
|
+
These signal peptide results must be created using the SignalP
|
54
|
+
program (Version 4.x), downloadable from CBS. If this argument
|
55
|
+
isn't suplied, then NpSearch will try to run a local version
|
56
|
+
of the Signal P script.
|
57
|
+
-e, --extract_orf Only extracts the Open Reading Frames.
|
58
|
+
-v, --verbose Provides more information on each step taken in this program.
|
59
|
+
-h, --help Display this screen
|
60
|
+
--version Shows version
|
61
|
+
|
62
|
+
## Contributing
|
63
|
+
|
64
|
+
1. Fork it
|
65
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
66
|
+
3. Commit your changes (`git commit -am 'Add some feature'`)
|
67
|
+
4. Push to the branch (`git push origin my-new-feature`)
|
68
|
+
5. Create new Pull Request
|
data/Rakefile
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
require 'bundler/gem_tasks'
|
2
|
+
require 'rake/testtask'
|
3
|
+
|
4
|
+
task default: [:build]
|
5
|
+
desc 'Installs the ruby gem'
|
6
|
+
task :build do
|
7
|
+
exec("gem build np_search.gemspec && gem install ./NpSearch-#{NpSearch::VERSION}.gem")
|
8
|
+
end
|
9
|
+
|
10
|
+
task :test do
|
11
|
+
Rake::TestTask.new do |t|
|
12
|
+
t.pattern = 'test/test_np_search.rb'
|
13
|
+
end
|
14
|
+
end
|
data/bin/npsearch
ADDED
@@ -0,0 +1,165 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
require 'optparse'
|
3
|
+
|
4
|
+
require 'npsearch'
|
5
|
+
require 'npsearch/arg_validator'
|
6
|
+
require 'npsearch/version'
|
7
|
+
|
8
|
+
opt = {}
|
9
|
+
optparse = OptionParser.new do |opts|
|
10
|
+
opts.banner = <<Banner
|
11
|
+
|
12
|
+
* Usage: npsearch [Options] -i [Input File] -o [Output Folder Name]
|
13
|
+
|
14
|
+
* Mandatory Options:
|
15
|
+
|
16
|
+
Banner
|
17
|
+
|
18
|
+
opt[:input_file] = nil
|
19
|
+
opts.on('-i', '--input [file]', 'Path to the input fasta file') do |f|
|
20
|
+
opt[:input_file] = f
|
21
|
+
end
|
22
|
+
|
23
|
+
opts.separator ''
|
24
|
+
opts.separator '* Optional Options:'
|
25
|
+
|
26
|
+
opt[:motif] = 'KR|RR|KK'
|
27
|
+
opts.on('-m', '--motif [Query Motif]', 'By default NpSearch only searches',
|
28
|
+
' for dibasic cleavage site ("KR", "RR" or "KK"). This option allows',
|
29
|
+
' one to change the set of cleavage sites to be searched.',
|
30
|
+
' The period "." can be used to denote any character. Multiple',
|
31
|
+
' motifs query can be used by using a pipeline character ("|")',
|
32
|
+
' between each query and putting the motif query in speech marks',
|
33
|
+
' e.g. "KR|RR|R..R"',
|
34
|
+
' Advanced Users: Regular expressions are supported.') do |motif|
|
35
|
+
opt[:motif] = motif
|
36
|
+
end
|
37
|
+
|
38
|
+
opt[:cut_off] = 10
|
39
|
+
opts.on('-c', '--cut_off N', Integer, 'Changes the minimum Open Reading',
|
40
|
+
' Frame from the default 10 amino acid residues to N amino acid',
|
41
|
+
' residues.') do |n|
|
42
|
+
opt[:cut_off] = n
|
43
|
+
end
|
44
|
+
|
45
|
+
opt[:signalp_file] = nil
|
46
|
+
opts.on('-s', '--signalp_file [file]',
|
47
|
+
'Is used to supply the signal peptide results to the program. These',
|
48
|
+
' signal peptide results must be created using the SignalP program',
|
49
|
+
" (Version 4.x), downloadable from CBS. If this argument isn't ",
|
50
|
+
' suplied, then NpSearch will try to run a local version of the',
|
51
|
+
' Signal P script.') do |signalp_file|
|
52
|
+
opt[:signalp_file] = signalp_file
|
53
|
+
end
|
54
|
+
|
55
|
+
opt[:extract_orf] = false
|
56
|
+
opts.on('-e', '--extract_orf', 'Only extracts the Open Reading Frames.') do
|
57
|
+
opt[:extract_orf] = true
|
58
|
+
end
|
59
|
+
|
60
|
+
opt[:verbose] = false
|
61
|
+
opts.on('-v', '--verbose', 'Provides more information on each step taken',
|
62
|
+
' in this program.') do
|
63
|
+
opt[:verbose] = true
|
64
|
+
end
|
65
|
+
|
66
|
+
opts.on('-h', '--help', 'Display this screen') do
|
67
|
+
puts opts
|
68
|
+
exit
|
69
|
+
end
|
70
|
+
|
71
|
+
opts.on('--version', 'Shows version') do
|
72
|
+
puts NpSearch::VERSION
|
73
|
+
exit
|
74
|
+
end
|
75
|
+
end
|
76
|
+
optparse.parse!
|
77
|
+
|
78
|
+
NpSearch.init(opt)
|
79
|
+
NpSearch.run
|
80
|
+
|
81
|
+
|
82
|
+
|
83
|
+
|
84
|
+
# ############# Argument Validation...##############
|
85
|
+
# arg_vldr = NpSearch::ArgValidators.new(opt[:verbose])
|
86
|
+
# input_type = arg_vldr.arg(opt[:motif], opt[:input], opt[:output_dir],
|
87
|
+
# opt[:cut_off], opt[:extract_orf], opt[:signalp_file],
|
88
|
+
# optparse.help)
|
89
|
+
|
90
|
+
# ############# General Validation...##############
|
91
|
+
# vldr = NpSearch::Validators.new
|
92
|
+
# vldr.output_dir(opt[:output_dir])
|
93
|
+
# if opt[:signalp_file].nil? && opt[:extract_orf] == false
|
94
|
+
# sp_dir = vldr.signalp_dir
|
95
|
+
# end
|
96
|
+
|
97
|
+
# ############# Converting input file to Bio::FastaFormat. #############
|
98
|
+
# input_read = NpSearch::Input.read(opt[:input], input_type)
|
99
|
+
|
100
|
+
# ############# Extract_ORF #############
|
101
|
+
# if input_type == 'genetic'
|
102
|
+
# # Translate Sequences in all 6 frames
|
103
|
+
# translated = NpSearch::Translation.translate(input_read)
|
104
|
+
# translated.to_fasta('translated seq.', "#{opt[:output_dir]}/1_protein.fa")
|
105
|
+
# # Extract all possible ORF that are longer than the ORF_min_length
|
106
|
+
# orf = NpSearch::Translation.extract_orf(translated, opt[:cut_off])
|
107
|
+
# orf.to_fasta('Open Reading Frames', "#{opt[:output_dir]}/2_orf.fa")
|
108
|
+
|
109
|
+
# if opt[:extract_orf]
|
110
|
+
# puts "\nSuccess: All output files created in the directory:" \
|
111
|
+
# "#{opt[:output_dir]}'.\n "
|
112
|
+
# exit
|
113
|
+
# end
|
114
|
+
# end
|
115
|
+
|
116
|
+
# ############# Setting up more variables...##############
|
117
|
+
# if opt[:motif] == 'neuro_clv'
|
118
|
+
# motif = 'KK|KR|RR|' \
|
119
|
+
# 'R..R|R....R|R......R|H..R|H....R|H......R|K..R|K....R|K......R'
|
120
|
+
# else
|
121
|
+
# motif = opt[:motif]
|
122
|
+
# end
|
123
|
+
# vldr.motif_type(motif)
|
124
|
+
|
125
|
+
# if input_type == 'genetic'
|
126
|
+
# sp_input_file = "#{opt[:output_dir]}/2_orf.fa"
|
127
|
+
# sp_hash = orf
|
128
|
+
# file_number = 3
|
129
|
+
# else # i.e. if the input is protein
|
130
|
+
# sp_input_file = opt[:input]
|
131
|
+
# sp_hash = input_read
|
132
|
+
# file_number = 1
|
133
|
+
# end
|
134
|
+
|
135
|
+
# if opt[:signalp_file].nil?
|
136
|
+
# sp_out_file = "#{opt[:output_dir]}/#{file_number}_signalp_out.txt"
|
137
|
+
# file_number += 1
|
138
|
+
# NpSearch::Signalp.signalp(sp_dir, sp_input_file, sp_out_file)
|
139
|
+
# else
|
140
|
+
# sp_out_file = opt[:signalp_file]
|
141
|
+
# file_number = 1
|
142
|
+
# end
|
143
|
+
|
144
|
+
# ############# Signal P Results file Validation #############
|
145
|
+
# vldr.sp_results(sp_out_file)
|
146
|
+
|
147
|
+
# ############# Extract sequences with a signal peptide #############
|
148
|
+
# secretome = NpSearch::Analysis.parse(sp_out_file, sp_hash, motif)
|
149
|
+
# secretome.to_fasta('secretome file',
|
150
|
+
# "#{opt[:output_dir]}/#{file_number}_secretome.fa")
|
151
|
+
# file_number += 1
|
152
|
+
|
153
|
+
# ############# Remove any duplicate data #############
|
154
|
+
# flattened_seq = NpSearch::Analysis.flattener(secretome)
|
155
|
+
|
156
|
+
# ############# Creating Output Files #############
|
157
|
+
# flattened_seq.to_fasta('fasta output file',
|
158
|
+
# "#{opt[:output_dir]}/#{file_number}_output.fa")
|
159
|
+
# flattened_seq.to_html(motif,
|
160
|
+
# "#{opt[:output_dir]}/#{file_number}_output.html")
|
161
|
+
|
162
|
+
# ############# Success #############
|
163
|
+
# puts # a blank line.
|
164
|
+
# puts "Success: All output files created in the directory:'#{opt[:output_dir]}'."
|
165
|
+
# puts # a blank line
|
data/lib/npsearch.rb
ADDED
@@ -0,0 +1,96 @@
|
|
1
|
+
require 'bio'
|
2
|
+
require 'fileutils'
|
3
|
+
|
4
|
+
# require 'npsearch/arg_validator'
|
5
|
+
require 'npsearch/logger'
|
6
|
+
require 'npsearch/sequence'
|
7
|
+
require 'npsearch/signalp'
|
8
|
+
require 'npsearch/pool'
|
9
|
+
|
10
|
+
# Top level module / namespace.
|
11
|
+
module NpSearch
|
12
|
+
class <<self
|
13
|
+
MIN_ORF_SIZE = 40 # amino acids (including potential signal peptide)
|
14
|
+
|
15
|
+
attr_accessor :opt
|
16
|
+
attr_accessor :sequences
|
17
|
+
|
18
|
+
def logger
|
19
|
+
@logger ||= Logger.new(STDERR, @opt[:verbose])
|
20
|
+
end
|
21
|
+
|
22
|
+
def init(opt)
|
23
|
+
# @opt = args_validation(opt)
|
24
|
+
@opt = opt
|
25
|
+
@sequences = []
|
26
|
+
@opt[:num_threads] = 8
|
27
|
+
@opt[:type] = guess_sequence_type
|
28
|
+
@opt[:signalp_path] = '/Volumes/Data/programs/signalp-4.1/signalp'
|
29
|
+
@pool = Pool.new(@opt[:num_threads]) if @opt[:num_threads] > 1
|
30
|
+
end
|
31
|
+
|
32
|
+
def run
|
33
|
+
iterate_input_file
|
34
|
+
# score_sequence
|
35
|
+
# scan(?<=(KR|RR|KK))(\w+?)(?=(KR|RR|KK|$))
|
36
|
+
@sequences.each { |s| puts ">#{s.id}\n#{s.seq}" }
|
37
|
+
end
|
38
|
+
|
39
|
+
private
|
40
|
+
|
41
|
+
def iterate_input_file
|
42
|
+
biofastafile = Bio::FlatFile.open(Bio::FastaFormat, @opt[:input_file])
|
43
|
+
biofastafile.each_entry do |entry|
|
44
|
+
if @opt[:num_threads] > 1
|
45
|
+
@pool.schedule(entry) { |e| initialise_seqs(e) }
|
46
|
+
else
|
47
|
+
initialise_seqs(entry)
|
48
|
+
end
|
49
|
+
end
|
50
|
+
@pool.shutdown if @opt[:num_threads] > 1
|
51
|
+
end
|
52
|
+
|
53
|
+
def initialise_seqs(entry)
|
54
|
+
if @opt[:type] == :protein
|
55
|
+
initialise_protein_seq(entry.entry_id, entry.aaseq)
|
56
|
+
else
|
57
|
+
initialise_transcriptomic_seq(entry.entry_id, entry.naseq)
|
58
|
+
end
|
59
|
+
end
|
60
|
+
|
61
|
+
def initialise_protein_seq(id, seq)
|
62
|
+
sp = Signalp.analyse_sequence(seq)
|
63
|
+
@sequences << Sequence.new(id, seq, sp) if sp[:sp] == 'Y'
|
64
|
+
end
|
65
|
+
|
66
|
+
def initialise_transcriptomic_seq(id, naseq)
|
67
|
+
(1..6).each do |f|
|
68
|
+
translated_seq = naseq.translate(f)
|
69
|
+
orfs = translated_seq.to_s.scan(/(?=(M\w{#{MIN_ORF_SIZE},}))./).flatten
|
70
|
+
initialise_orfs(id, orfs, f)
|
71
|
+
end
|
72
|
+
end
|
73
|
+
|
74
|
+
def initialise_orfs(id, orfs, frame)
|
75
|
+
idx = 0
|
76
|
+
orfs.each do |orf|
|
77
|
+
sp = Signalp.analyse_sequence(orf)
|
78
|
+
next if sp[:sp] == 'N'
|
79
|
+
seq = Sequence.new(id, orf, sp)
|
80
|
+
seq.translated_frame = frame
|
81
|
+
seq.orf_index = idx
|
82
|
+
@sequences << seq
|
83
|
+
idx += 1
|
84
|
+
end
|
85
|
+
end
|
86
|
+
|
87
|
+
def guess_sequence_type
|
88
|
+
fasta_content = IO.binread(@opt[:input_file])
|
89
|
+
# removing non-letter and ambiguous characters
|
90
|
+
cleaned_sequence = fasta_content.gsub(/[^A-Z]|[NX]/i, '')
|
91
|
+
return nil if cleaned_sequence.length < 10 # conservative
|
92
|
+
type = Bio::Sequence.new(cleaned_sequence).guess(0.9)
|
93
|
+
(type == Bio::Sequence::NA) ? :nucleotide : :protein
|
94
|
+
end
|
95
|
+
end
|
96
|
+
end
|
@@ -0,0 +1,264 @@
|
|
1
|
+
module NpSearch
|
2
|
+
class ArgValidators
|
3
|
+
|
4
|
+
|
5
|
+
# Changes the logger level to output extra info when the verbose option is
|
6
|
+
# true.
|
7
|
+
def initialize(verbose_opt)
|
8
|
+
LOG.level = Logger::INFO if verbose_opt == true
|
9
|
+
end
|
10
|
+
|
11
|
+
# Runs all the arguments method...
|
12
|
+
def arg(motif, input, output_dir, orf_min_length, extract_orf,
|
13
|
+
signalp_file, help_banner)
|
14
|
+
comp_arg(input, motif, output_dir, extract_orf, help_banner)
|
15
|
+
input_type = guess_input_type(input)
|
16
|
+
extract_orf_conflict(input_type, extract_orf)
|
17
|
+
input_sp_file_conflict(input_type, signalp_file)
|
18
|
+
orf_min_length(orf_min_length)
|
19
|
+
input_type
|
20
|
+
end
|
21
|
+
|
22
|
+
# Ensures that the compulsory input arguments are supplied...
|
23
|
+
def comp_arg(input, motif, output_dir, extract_orf, help_banner)
|
24
|
+
comp_arg_error(motif, 'Query Motif ("-m" option)') if extract_orf == false
|
25
|
+
comp_arg_error(input, 'Input file ("-i option")')
|
26
|
+
comp_arg_error(output_dir, 'Output Folder ("-o" option)')
|
27
|
+
return unless input.nil? || (motif.nil? && extract_orf == false)
|
28
|
+
puts help_banner
|
29
|
+
exit
|
30
|
+
end
|
31
|
+
|
32
|
+
# Ensures that a message is provided for all missing compulsory args.
|
33
|
+
# Run from comp_arg method
|
34
|
+
def comp_arg_error(arg, message)
|
35
|
+
puts 'Usage Error: No ' + message + ' is supplied' if arg.nil?
|
36
|
+
end
|
37
|
+
|
38
|
+
# Guesses the type of data within the input file on the first 100 lines of
|
39
|
+
# the file (ignores all identifiers (lines that start with a '>').
|
40
|
+
# It has a 80% threshold.
|
41
|
+
def guess_input_type(input_file)
|
42
|
+
input_file_format(input_file)
|
43
|
+
sequences = []
|
44
|
+
File.open(input_file, 'r') do |file_stream|
|
45
|
+
file_stream.readlines[0..100].each do |line|
|
46
|
+
sequences << line.to_s unless line.match(/^>/)
|
47
|
+
end
|
48
|
+
end
|
49
|
+
type = Bio::Sequence.new(sequences).guess(0.8)
|
50
|
+
if type == Bio::Sequence::NA
|
51
|
+
input_type = 'genetic'
|
52
|
+
elsif type == Bio::Sequence::AA
|
53
|
+
input_type = 'protein'
|
54
|
+
end
|
55
|
+
input_type
|
56
|
+
end
|
57
|
+
|
58
|
+
# Ensures that the input file a) exists b) is not empty and c) is a fasta
|
59
|
+
# file. Run from the guess_input_type method.
|
60
|
+
def input_file_format(input_file)
|
61
|
+
unless File.exist?(input_file)
|
62
|
+
fail ArgumentError("Critical Error: The input file '#{input_file}'" \
|
63
|
+
' does not exist.')
|
64
|
+
end
|
65
|
+
if File.zero?(input_file)
|
66
|
+
fail ArgumentError("Critical Error: The input file '#{input_file}'" \
|
67
|
+
' is empty.')
|
68
|
+
end
|
69
|
+
unless File.probably_fasta?(input_file)
|
70
|
+
fail ArgumentError("Critical Error: The input file '#{input_file}'" \
|
71
|
+
' does not seem to be in fasta format. Only' \
|
72
|
+
' input files in fasta format are supported.')
|
73
|
+
end
|
74
|
+
end
|
75
|
+
|
76
|
+
# Ensures that the extract_orf option is only used with genetic data.
|
77
|
+
def extract_orf_conflict(input_type, extract_orf)
|
78
|
+
return unless input_type == 'protein' && extract_orf == true
|
79
|
+
fail ArgumentError('Usage Error: Conflicting arguments detected:' \
|
80
|
+
' Protein data detected within the input file,' \
|
81
|
+
' when using the Extract_ORF option (option' \
|
82
|
+
' "-e"). This option is only available when' \
|
83
|
+
' input file contains genetic data.')
|
84
|
+
end
|
85
|
+
|
86
|
+
# Ensures that the protein data (or open reading frames) are supplied as
|
87
|
+
# the input file when the signal p output file is passed.
|
88
|
+
def input_sp_file_conflict(input_type, signalp_file)
|
89
|
+
return unless input_type == 'genetic' && !signalp_file.nil?
|
90
|
+
fail ArgumentError('Usage Error: Conflicting arguments detected' \
|
91
|
+
': Genetic data detected within the input file' \
|
92
|
+
' when using the Signal P Input Option (Option' \
|
93
|
+
' "-s"). The Signal P input Option requires the' \
|
94
|
+
' input of two files: the Signal P Script Result' \
|
95
|
+
' files (at the "-s" option) and the protein' \
|
96
|
+
' data file used to run the Signal P Script.')
|
97
|
+
end
|
98
|
+
|
99
|
+
# Ensures that the ORF minimum length is a number. Any digits after the
|
100
|
+
# decimal place are ignored.
|
101
|
+
def orf_min_length(orf_min_length)
|
102
|
+
return unless orf_min_length.to_i < 1
|
103
|
+
fail ArgumentError('Usage Error: The Open Reading Frames minimum' \
|
104
|
+
' length can only be a full integer.')
|
105
|
+
end
|
106
|
+
end
|
107
|
+
|
108
|
+
class Validators
|
109
|
+
# Checks for the presence of the output directory; if not found, it asks
|
110
|
+
# the user whether they want to create the output directory.
|
111
|
+
def output_dir(output_dir)
|
112
|
+
unless File.directory? output_dir # If output_dir doesn't exist
|
113
|
+
fail IOError, "\n\nThe output directory deoes not exist\n\n"
|
114
|
+
end
|
115
|
+
rescue IOError
|
116
|
+
puts # a blank line
|
117
|
+
puts 'The output directory does not exist.'
|
118
|
+
puts # a blank line
|
119
|
+
puts "The directory '#{output_dir}' will be created in this location."
|
120
|
+
puts 'Do you to continue? [y/n]'
|
121
|
+
print '> '
|
122
|
+
inp = $stdin.gets.chomp
|
123
|
+
until inp.downcase == 'n' || inp.downcase == 'y' || inp == ''
|
124
|
+
puts # a blank line
|
125
|
+
puts "The input: '#{inp}' is not recognised - 'y' or 'n' are the" \
|
126
|
+
' only recognisable inputs.'
|
127
|
+
puts 'Please try again.'
|
128
|
+
puts "The directory '#{output_dir}' will be created in this" \
|
129
|
+
' location.'
|
130
|
+
puts 'Do you to continue? [y/n]'
|
131
|
+
print '> '
|
132
|
+
inp = $stdin.gets.chomp
|
133
|
+
end
|
134
|
+
if inp.downcase == 'y' || inp == ''
|
135
|
+
FileUtils.mkdir_p "#{output_dir}"
|
136
|
+
puts 'Created output directory...'
|
137
|
+
elsif inp.downcase == 'n'
|
138
|
+
raise ArgumentError('Critical Error: An output directory is' \
|
139
|
+
' required; please create an output directory' \
|
140
|
+
' and then try again.')
|
141
|
+
end
|
142
|
+
end
|
143
|
+
|
144
|
+
# Ensures that the Signal P Script is present. If not found in the home
|
145
|
+
# directory, it asks the user for its location.
|
146
|
+
def signalp_dir
|
147
|
+
signalp_dir = "#{Dir.home}/SignalPeptide"
|
148
|
+
if File.exist? "#{signalp_dir}/signalp"
|
149
|
+
signalp_directory = signalp_dir
|
150
|
+
else
|
151
|
+
begin
|
152
|
+
fail IOError('The Signal P Script directory cannot be found at' \
|
153
|
+
" the following location: '#{signalp_dir}/'.")
|
154
|
+
rescue IOError
|
155
|
+
puts # a blank line
|
156
|
+
puts 'Error: The Signal P Script directory cannot be found at the' \
|
157
|
+
" following location: '#{signalp_dir}/'."
|
158
|
+
puts # a blank line
|
159
|
+
puts 'Please enter the full path or a relative path to the Signal' \
|
160
|
+
' P Script directory (i.e. to the folder containing the' \
|
161
|
+
' Signal P script). Refer to the online tutorial for more help'
|
162
|
+
print '> '
|
163
|
+
inp = $stdin.gets.chomp
|
164
|
+
until (File.exist? "#{signalp_dir}/signalp") ||
|
165
|
+
(File.exist? "#{inp}/signalp")
|
166
|
+
puts # a blank line
|
167
|
+
puts 'The Signal P directory cannot be found at the following' \
|
168
|
+
" location: '#{inp}'"
|
169
|
+
puts 'Please enter the full path or a relative path to the Signal' \
|
170
|
+
' Peptide directory again.'
|
171
|
+
print '> '
|
172
|
+
inp = $stdin.gets.chomp
|
173
|
+
end
|
174
|
+
signalp_directory = inp
|
175
|
+
puts # a blank line
|
176
|
+
puts "The Signal P directory has been found at '#{signalp_directory}'"
|
177
|
+
FileUtils.ln_s "#{signalp_directory}", "#{Dir.home}/SignalPeptide",
|
178
|
+
force: true
|
179
|
+
puts # a blank line
|
180
|
+
end
|
181
|
+
end
|
182
|
+
signalp_directory
|
183
|
+
end
|
184
|
+
|
185
|
+
# Ensures that the supported version of the Signal P Script has been linked
|
186
|
+
# to NpSearch. Run from the 'sp_results' method.
|
187
|
+
def sp_version(input_file)
|
188
|
+
File.open(input_file, 'r') do |file_stream|
|
189
|
+
first_line = file_stream.readline
|
190
|
+
if first_line.match(/# SignalP-4.1/)
|
191
|
+
return true
|
192
|
+
else
|
193
|
+
return false
|
194
|
+
end
|
195
|
+
end
|
196
|
+
end
|
197
|
+
|
198
|
+
# Ensures that the critical columns in the tabular results produced by the
|
199
|
+
# Signal P script are conserved. Run from the 'sp_results' method.
|
200
|
+
def sp_column(_input_file)
|
201
|
+
File.open('signalp_out.txt', 'r') do |file_stream|
|
202
|
+
secondline = file_stream.readlines[1]
|
203
|
+
row = secondline.gsub(/\s+/m, ' ').chomp.split(' ')
|
204
|
+
if row[1] != 'name' && row[4] != 'Ymax' && row[5] != 'pos' &&
|
205
|
+
row[9] != 'D'
|
206
|
+
return true
|
207
|
+
else
|
208
|
+
return false
|
209
|
+
end
|
210
|
+
end
|
211
|
+
end
|
212
|
+
|
213
|
+
# Ensure that the right version of the Signal P script is used (via
|
214
|
+
# 'sp_version' Method). If the wrong signal p script has been linked to
|
215
|
+
# NpSearch, check whether the critical columns in the tabular results
|
216
|
+
# produced by the Signal P Script are conserved (via 'sp_column'
|
217
|
+
# Method).
|
218
|
+
def sp_results(signalp_output_file)
|
219
|
+
return if sp_version(signalp_output_file)
|
220
|
+
# i.e. if Signal P is the wrong version
|
221
|
+
if sp_column(signalp_output_file) # If wrong version but correct columns
|
222
|
+
puts # a blank line
|
223
|
+
puts 'Warning: The wrong version of signalp has been linked.' \
|
224
|
+
' However, the signal peptide output file still seems to' \
|
225
|
+
' be in the right format.'
|
226
|
+
else
|
227
|
+
puts # a blank line
|
228
|
+
puts 'Warning: The wrong version of the signal p has been linked' \
|
229
|
+
' and the signal peptide output is in an unrecognised format.'
|
230
|
+
puts 'Continuing may give you meaningless results.'
|
231
|
+
end
|
232
|
+
puts # a blank line
|
233
|
+
puts 'Do you still want to continue? [y/n]'
|
234
|
+
print '> '
|
235
|
+
inp = $stdin.gets.chomp
|
236
|
+
until inp.downcase == 'n' || inp.downcase == 'y'
|
237
|
+
puts # a blank line
|
238
|
+
puts "The input: '#{inp}' is not recognised - 'y' or 'n' are the" \
|
239
|
+
' only recognisable inputs.'
|
240
|
+
puts 'Please try again.'
|
241
|
+
end
|
242
|
+
if inp.downcase == 'y'
|
243
|
+
puts 'Continuing.'
|
244
|
+
elsif inp.downcase == 'n'
|
245
|
+
fail IOError('Critical Error: NpSearch only supports SignalP 4.1' \
|
246
|
+
' (downloadable form CBS) Please ensure the version' \
|
247
|
+
' of the signal p script is downloaded.')
|
248
|
+
end
|
249
|
+
end
|
250
|
+
|
251
|
+
# Guesses the type of the data in the supplied motif. It ignores all
|
252
|
+
# non-word characters (e.g. '|' that is used for regex). It has a 90%
|
253
|
+
# threshold.
|
254
|
+
def motif_type(motif)
|
255
|
+
motif_seq = Bio::Sequence.new(motif.gsub(/\W/, ''))
|
256
|
+
type = motif_seq.guess(0.9)
|
257
|
+
return unless type.to_s != 'Bio::Sequence::AA'
|
258
|
+
fail IOError('Critical Error: There seems to be an error in' \
|
259
|
+
' processing the motif. Please ensure that the motif' \
|
260
|
+
' contains amino acid residues that you wish to search' \
|
261
|
+
' for.')
|
262
|
+
end
|
263
|
+
end
|
264
|
+
end
|