full_lengther_next 0.0.2 → 0.0.5
Sign up to get free protection for your applications and to get access to all the features.
- data/History.txt +12 -0
- data/Manifest.txt +2 -1
- data/README.rdoc +44 -6
- data/bin/download_fln_dbs.rb +50 -22
- data/bin/full_lengther_next +13 -5
- data/lib/full_lengther_next.rb +1 -1
- data/lib/full_lengther_next/classes/fl_analysis.rb +2 -2
- data/lib/full_lengther_next/classes/fln_stats.rb +387 -0
- data/lib/full_lengther_next/classes/my_worker.rb +26 -13
- data/lib/full_lengther_next/classes/my_worker_manager.rb +27 -18
- data/lib/full_lengther_next/classes/nc_rna.rb +21 -0
- data/lib/full_lengther_next/classes/test_code.rb +8 -4
- metadata +4 -3
- data/lib/full_lengther_next/classes/fl2_stats.rb +0 -222
data/History.txt
CHANGED
data/Manifest.txt
CHANGED
@@ -3,12 +3,13 @@ bin/make_user_db.rb
|
|
3
3
|
bin/full_lengther_next
|
4
4
|
History.txt
|
5
5
|
lib/full_lengther_next/classes/common_functions.rb
|
6
|
-
lib/full_lengther_next/classes/fl2_stats.rb
|
7
6
|
lib/full_lengther_next/classes/fl_analysis.rb
|
8
7
|
lib/full_lengther_next/classes/fl_string_utils.rb
|
8
|
+
lib/full_lengther_next/classes/fln_stats.rb
|
9
9
|
lib/full_lengther_next/classes/lcs.rb
|
10
10
|
lib/full_lengther_next/classes/my_worker.rb
|
11
11
|
lib/full_lengther_next/classes/my_worker_manager.rb
|
12
|
+
lib/full_lengther_next/classes/nc_rna.rb
|
12
13
|
lib/full_lengther_next/classes/orf.rb
|
13
14
|
lib/full_lengther_next/classes/sequence.rb
|
14
15
|
lib/full_lengther_next/classes/test_code.rb
|
data/README.rdoc
CHANGED
@@ -16,9 +16,9 @@ FULL-LENGTHERNEXT is a tool adapted to NGS technologies, able to work in paralle
|
|
16
16
|
|
17
17
|
* It returns the translated protein sequence for the complete genes and the nucleotide sequence with frame shift fixed and highlighting the start and end codon for an easier finding of the gene and the UTR regions.
|
18
18
|
|
19
|
-
* FULL-LENGTHERNEXT suggests putative new genes analysing what of the genes classified as unknown are probably coding.
|
19
|
+
* FULL-LENGTHERNEXT suggests putative new genes analysing what of the genes classified as unknown are probably coding and what are putative non coding RNA sequences.
|
20
20
|
|
21
|
-
* It produces a
|
21
|
+
* It produces a HTML file with statistics useful for assemblies comparison.
|
22
22
|
|
23
23
|
== SYNOPSIS:
|
24
24
|
|
@@ -26,6 +26,40 @@ FULL-LENGTHERNEXT must be fed with a multifasta file containing all unigenes to
|
|
26
26
|
|
27
27
|
full_lengther_next -f input.fasta -g [fungi|human|invertebrates|mammals|plants|rodents|vertebrates] -d user_db [options]
|
28
28
|
|
29
|
+
=== Output
|
30
|
+
Full-LengthNext results files appear at the end of program execution, grouped in a folder called fl2_results, where the following files can be found:
|
31
|
+
* alignments.txt: Displays the BLASTx alignment between our query sequence translated into amino acids and the protein sequence from the Full-LengthNext database.
|
32
|
+
* annotations.txt: in this file, the main information for each query sequence can be found; status, subject accession number, subject description, warning messages, protein obtained and indices provided by BLASTx alignment.
|
33
|
+
* nc_rna.txt: Putative non coding RNA sequences detected using BLAST.
|
34
|
+
* nt_seq.txt: It contains the nucleotide sequence, marking when possible the start codon with hyphen and underscore and hyphen (-_-) and the stop codon with three underscores. Useful to find UTRs and gene sequence.
|
35
|
+
* proteins.fasta: fasta format file with the complete proteins.
|
36
|
+
* summary_stats.html: summary statistics of the results obtained by Full-LengthNext for the set of query unigenes. It is useful for assemblies comparison.
|
37
|
+
* tcode_result.txt: It is equivalent to annotations.txt file, but it is used for sequences with no similarity in databases. Possible status are: coding, non-coding or unknown
|
38
|
+
|
39
|
+
|
40
|
+
=== CLUSTERED INSTALLATION
|
41
|
+
To install FULL-LENGTHERNEXT into a cluster, you need to have the software available on all machines. By installing it on a shared location, or installing it on each cluster node. Once installed, you need to create a init_file where your environment is correctly setup (paths, BLASTDB, etc):
|
42
|
+
|
43
|
+
export PATH=/apps/blast+/bin:/apps/cd-hit/bin
|
44
|
+
export BLASTDB=/var/DB/formatted
|
45
|
+
export FULL_LENGTHER_NEXT_INIT=path_to_init_file
|
46
|
+
And initialize the FULL_LENGTHER_NEXT_INIT environment variable on your main node (from where FULL-LENGTHERNEXT will be initially launched):
|
47
|
+
|
48
|
+
export FULL_LENGTHER_NEXT_INIT=path_to_init_file
|
49
|
+
If you use any queue system like PBS Pro or Moab/Slurm, be sure to initialize the variables on each submission script.
|
50
|
+
|
51
|
+
NOTE: all nodes on the cluster should use ssh keys to allow FULL-LENGTHERNEXT to launch workers without asking for a password.
|
52
|
+
|
53
|
+
SAMPLE INIT FILES FOR CLUSTERED INSTALLATION:
|
54
|
+
Init file
|
55
|
+
$> cat fln_init_env
|
56
|
+
|
57
|
+
source ~ruby19/init_env
|
58
|
+
source ~blast_plus/init_env
|
59
|
+
|
60
|
+
export BLASTDB=~full_lenghter_next/DB/formatted/
|
61
|
+
export FULL_LENGTHER_NEXT_INIT=~full_lenghter_next/fln_init_env
|
62
|
+
|
29
63
|
|
30
64
|
=== PBS Submission script
|
31
65
|
|
@@ -42,10 +76,10 @@ cd $PBS_O_WORKDIR
|
|
42
76
|
|
43
77
|
cat ${PBS_NODEFILE} > workers
|
44
78
|
|
45
|
-
# init
|
46
|
-
source ~
|
79
|
+
# init full-lengthernext
|
80
|
+
source ~full_lenghter_next/init_env
|
47
81
|
|
48
|
-
time
|
82
|
+
time full_lenghter_next -f input.fasta -g group -d user_db -w workers -s 10.0.0
|
49
83
|
Once this submission script is created, you only need to launch it with:
|
50
84
|
|
51
85
|
qsub sample_work.sh
|
@@ -101,7 +135,11 @@ gem install full_lengther_next
|
|
101
135
|
|
102
136
|
=== Install and rebuild Full-LengthNEXT databases
|
103
137
|
|
104
|
-
Full-LengthNEXT needs some databases to work. You can use the BLASTDB environment variable to to change the default database location. To
|
138
|
+
Full-LengthNEXT needs some databases to work. You can use the BLASTDB environment variable to to change the default database location. To set the path for storing databases, execute next line in your terminal or add it to your .bash_profile:
|
139
|
+
|
140
|
+
export BLASTDB=/my_path/
|
141
|
+
|
142
|
+
To install databases execute:
|
105
143
|
|
106
144
|
$ download_fln_dbs.rb
|
107
145
|
|
data/bin/download_fln_dbs.rb
CHANGED
@@ -5,15 +5,41 @@
|
|
5
5
|
# Once in UniProtKB/Swiss-Prot, a protein entry is removed from UniProtKB/TrEMBL.
|
6
6
|
|
7
7
|
require 'net/ftp'
|
8
|
+
require 'open-uri'
|
8
9
|
|
9
10
|
################################################### Functions
|
10
11
|
|
12
|
+
def download_ncrna(formatted_db_path)
|
13
|
+
|
14
|
+
if !File.exists?(File.join(formatted_db_path, "nc_rna_db"))
|
15
|
+
Dir.mkdir(File.join(formatted_db_path, "nc_rna_db"))
|
16
|
+
end
|
17
|
+
|
18
|
+
puts "Downloading ncRNA database"
|
19
|
+
open(File.join(formatted_db_path, "nc_rna_db/ncrna_fln_100.fasta.zip"), "wb") do |my_file|
|
20
|
+
my_file.print open('http://www.scbi.uma.es/downloads/FLNDB/ncrna_fln_100.fasta.zip').read
|
21
|
+
end
|
22
|
+
puts "\nncRNA database downloaded"
|
23
|
+
|
24
|
+
ncrna_zip=File.join(formatted_db_path,'nc_rna_db','ncrna_fln_100.fasta.zip')
|
25
|
+
ncrna_out_dir=File.join(formatted_db_path,'nc_rna_db')
|
26
|
+
system("unzip", ncrna_zip, "-d", ncrna_out_dir)
|
27
|
+
system("rm", ncrna_zip)
|
28
|
+
|
29
|
+
puts "\nncRNA database decompressed"
|
30
|
+
|
31
|
+
ncrna_fasta=File.join(formatted_db_path,'nc_rna_db','ncrna_fln_100.fasta')
|
32
|
+
system("makeblastdb", "-in", ncrna_fasta, "-dbtype", "nucl", "-parse_seqids")
|
33
|
+
|
34
|
+
puts "\nncRNA database completed"
|
35
|
+
end
|
36
|
+
|
11
37
|
def conecta_uniprot(my_array, formatted_db_path)
|
12
38
|
|
13
39
|
$ftp = Net::FTP.new()
|
14
40
|
|
15
|
-
if !File.exists?(
|
16
|
-
Dir.mkdir(
|
41
|
+
if !File.exists?(formatted_db_path)
|
42
|
+
Dir.mkdir(formatted_db_path)
|
17
43
|
end
|
18
44
|
|
19
45
|
$ftp.connect('ftp.uniprot.org')
|
@@ -27,8 +53,9 @@ def conecta_uniprot(my_array, formatted_db_path)
|
|
27
53
|
download_uniprot(db_group, formatted_db_path)
|
28
54
|
end
|
29
55
|
|
56
|
+
varsplic_out=File.join(formatted_db_path,'uniprot_sprot_varsplic.fasta.gz')
|
30
57
|
$ftp.chdir("/pub/databases/uniprot/current_release/knowledgebase/complete")
|
31
|
-
$ftp.getbinaryfile("uniprot_sprot_varsplic.fasta.gz",
|
58
|
+
$ftp.getbinaryfile("uniprot_sprot_varsplic.fasta.gz", varsplic_out)
|
32
59
|
|
33
60
|
puts "isoform files downloaded"
|
34
61
|
|
@@ -38,9 +65,11 @@ end
|
|
38
65
|
|
39
66
|
def download_uniprot(uniprot_group, formatted_db_path)
|
40
67
|
|
68
|
+
sp_out=File.join(formatted_db_path,"uniprot_sprot_#{uniprot_group}.dat.gz")
|
69
|
+
tr_out=File.join(formatted_db_path,"uniprot_trembl_#{uniprot_group}.dat.gz")
|
41
70
|
$ftp.chdir("/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions")
|
42
|
-
$ftp.getbinaryfile("uniprot_sprot_#{uniprot_group}.dat.gz",
|
43
|
-
$ftp.getbinaryfile("uniprot_trembl_#{uniprot_group}.dat.gz",
|
71
|
+
$ftp.getbinaryfile("uniprot_sprot_#{uniprot_group}.dat.gz", sp_out)
|
72
|
+
$ftp.getbinaryfile("uniprot_trembl_#{uniprot_group}.dat.gz", tr_out)
|
44
73
|
|
45
74
|
puts "#{uniprot_group} files downloaded"
|
46
75
|
|
@@ -74,11 +103,11 @@ def filter_incomplete_seqs(file_name, isoform_hash, formatted_db_path)
|
|
74
103
|
db_name.sub!('sprot','sp')
|
75
104
|
db_name.sub!('trembl','tr')
|
76
105
|
|
77
|
-
if !File.exists?("#{
|
78
|
-
Dir.mkdir("#{
|
106
|
+
if !File.exists?(File.join(formatted_db_path, "#{db_name}_#{output_name}"))
|
107
|
+
Dir.mkdir(File.join(formatted_db_path, "#{db_name}_#{output_name}"))
|
79
108
|
end
|
80
109
|
|
81
|
-
output_file = File.new("#{
|
110
|
+
output_file = File.new(File.join(formatted_db_path, "#{db_name}_#{output_name}/#{db_name}_#{output_name}.fasta"), "w")
|
82
111
|
|
83
112
|
File.open(file_name).each_line do |line|
|
84
113
|
if (newseq == false)
|
@@ -152,15 +181,10 @@ def load_isoform_hash(file)
|
|
152
181
|
my_fasta += line
|
153
182
|
end
|
154
183
|
end
|
155
|
-
|
156
|
-
# if (isoform_hash[acc].nil?)
|
157
|
-
# isoform_hash[acc]= "#{my_fasta}\n"
|
158
|
-
# else
|
159
|
-
# isoform_hash[acc]+= "#{my_fasta}\n"
|
160
|
-
# end
|
161
184
|
|
162
185
|
return isoform_hash
|
163
186
|
end
|
187
|
+
|
164
188
|
################################################### MAIN
|
165
189
|
|
166
190
|
ROOT_PATH=File.dirname(__FILE__)
|
@@ -173,24 +197,28 @@ end
|
|
173
197
|
|
174
198
|
ENV['BLASTDB']=formatted_db_path
|
175
199
|
puts "Databases will be downloaded at: #{ENV['BLASTDB']}"
|
176
|
-
|
200
|
+
puts "\nTo set the path for storing databases, execute next line in your terminal or add it to your .bash_profile:\n\n\texport BLASTDB=/my_path/\n\n"
|
177
201
|
|
178
202
|
my_array = ["human","fungi","invertebrates","mammals","plants","rodents","vertebrates"]
|
179
|
-
# my_array = ["plants","
|
203
|
+
# my_array = ["plants","human"] # used for a shoter test
|
180
204
|
|
181
205
|
conecta_uniprot(my_array, formatted_db_path)
|
182
|
-
|
206
|
+
system('gunzip '+formatted_db_path+'*.gz')
|
183
207
|
|
184
208
|
isoform_hash = {}
|
185
|
-
isoform_hash = load_isoform_hash("
|
209
|
+
isoform_hash = load_isoform_hash(File.join(formatted_db_path, "uniprot_sprot_varsplic.fasta"))
|
210
|
+
|
211
|
+
download_ncrna(formatted_db_path)
|
186
212
|
|
187
213
|
my_array.each do |db_group|
|
188
214
|
|
189
|
-
filter_incomplete_seqs("
|
190
|
-
filter_incomplete_seqs("
|
215
|
+
filter_incomplete_seqs(File.join(formatted_db_path, "uniprot_sprot_#{db_group}.dat"), isoform_hash, formatted_db_path)
|
216
|
+
filter_incomplete_seqs(File.join(formatted_db_path, "uniprot_trembl_#{db_group}.dat"), isoform_hash, formatted_db_path)
|
191
217
|
|
192
|
-
|
193
|
-
|
218
|
+
sp_fasta=File.join(formatted_db_path,"sp_#{db_group}","sp_#{db_group}.fasta")
|
219
|
+
tr_fasta=File.join(formatted_db_path,"tr_#{db_group}","tr_#{db_group}.fasta")
|
220
|
+
system("makeblastdb -in #{sp_fasta} -dbtype 'prot' -parse_seqids")
|
221
|
+
system("makeblastdb -in #{tr_fasta} -dbtype 'prot' -parse_seqids")
|
194
222
|
|
195
223
|
end
|
196
224
|
|
data/bin/full_lengther_next
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
#!/usr/bin/env ruby
|
2
2
|
|
3
3
|
# 12-2-2011 Noe Fernandez Pozo.
|
4
|
-
# Full-
|
4
|
+
# Full-LengtherNEXT predicts if your sequences are complete, showing you the nucleotide sequences and the translated protein
|
5
5
|
|
6
6
|
#------------------------------------------------------------------ parameters entry
|
7
7
|
require 'optparse'
|
@@ -91,7 +91,7 @@ optparse = OptionParser.new do |opts|
|
|
91
91
|
|
92
92
|
|
93
93
|
# Set a banner, displayed at the top of the help screen.
|
94
|
-
opts.banner = "Usage:
|
94
|
+
opts.banner = "Usage: full_lengther_next -f input.fasta -g [fungi|human|invertebrates|mammals|plants|rodents|vertebrates] [options]\n\n"
|
95
95
|
|
96
96
|
# This displays the help screen
|
97
97
|
opts.on( '-h', '--help', 'Display this screen' ) do
|
@@ -129,7 +129,7 @@ require 'full_lengther_next'
|
|
129
129
|
if ENV['FULL_LENGTHER_NEXT_INIT'] && File.exists?(ENV['FULL_LENGTHER_NEXT_INIT'])
|
130
130
|
FULL_LENGTHER_NEXT_INIT=File.expand_path(ENV['FULL_LENGTHER_NEXT_INIT'])
|
131
131
|
else
|
132
|
-
FULL_LENGTHER_NEXT_INIT=File.join(
|
132
|
+
FULL_LENGTHER_NEXT_INIT=File.join(ROOT_PATH,'init_env')
|
133
133
|
end
|
134
134
|
|
135
135
|
|
@@ -142,8 +142,16 @@ end
|
|
142
142
|
ENV['BLASTDB']=formatted_db_path
|
143
143
|
puts "Using databases at: #{ENV['BLASTDB']}"
|
144
144
|
|
145
|
-
|
146
|
-
|
145
|
+
ncrna_path = File.join(ENV['BLASTDB'],'nc_rna_db','ncrna_fln_100.fasta.nhr')
|
146
|
+
if !File.exists?(ncrna_path)
|
147
|
+
puts "DB File #{ncrna_path} doesn't exists"
|
148
|
+
puts optparse.help
|
149
|
+
exit
|
150
|
+
end
|
151
|
+
|
152
|
+
sp_path=File.join(ENV['BLASTDB'],"sp_#{options[:tax_group]}","sp_#{options[:tax_group]}.fasta.psq")
|
153
|
+
if !File.exists?(sp_path)
|
154
|
+
puts "DB File #{sp_path} doesn't exists, or"
|
147
155
|
puts "incorrect taxon group name: #{options[:tax_group]} choose:"
|
148
156
|
puts optparse.help
|
149
157
|
exit
|
data/lib/full_lengther_next.rb
CHANGED
@@ -32,7 +32,7 @@ module FlAnalysis
|
|
32
32
|
if (db_name =~ /^tr_/)
|
33
33
|
if (seq.get_annotations(:tmp_annotation).empty?)
|
34
34
|
if (seq.sec_desc.empty?)
|
35
|
-
seq.annotate(:
|
35
|
+
seq.annotate(:apply_tcode,'')
|
36
36
|
else
|
37
37
|
seq.annotate(:tmp_annotation,[seq.sec_desc, '','',''],true)
|
38
38
|
end
|
@@ -250,7 +250,7 @@ module FlAnalysis
|
|
250
250
|
seq.sec_desc = "#{q.query_def}\t#{seq.seq_fasta.length}\t#{q.hits[0].acc}\t#{db_name}\tCoding Seq\t\t#{q.hits[0].e_val}\t#{q.hits[0].ident}\t\t#{q.hits[0].full_subject_length}\t#{warnings}\t\t\t\t\t\t#{q.hits[0].definition}\t"
|
251
251
|
seq.annotate(:tmp_annotation,[seq.sec_desc, '','',''],true)
|
252
252
|
else
|
253
|
-
seq.annotate(:
|
253
|
+
seq.annotate(:apply_tcode,'')
|
254
254
|
end
|
255
255
|
else
|
256
256
|
warnings = "Coding sequence with some errors, #{warnings}"
|
@@ -0,0 +1,387 @@
|
|
1
|
+
|
2
|
+
module FlnStats
|
3
|
+
|
4
|
+
def summary_stats
|
5
|
+
stats_file = File.open('fln_results/summary_stats.html', 'w')
|
6
|
+
|
7
|
+
(html_head, html_1, html_2, html_3, html_4) = html_code
|
8
|
+
|
9
|
+
total_seqs = 0
|
10
|
+
|
11
|
+
(status_array, seqs_number1, error_1_num, seq_uniq, complete_uniq, seq_length_stats, complete_seq_length_stats) = annotation_stats
|
12
|
+
(tcode_array, seqs_number2, tcode_length_stats, coding_length_stats, unknown_length_stats) = testcode_stats
|
13
|
+
ncrna_array=ncrna_stats
|
14
|
+
|
15
|
+
total_seqs = seqs_number1 + seqs_number2 + ncrna_array[4].to_i
|
16
|
+
|
17
|
+
stats_file.puts html_head
|
18
|
+
stats_file.puts "\t\t\t\t"+'<font color="#FF0000">'+total_seqs.to_s+"</font> sequences in your input fasta\n\t\t\t</h2>\n\t\t</center>"
|
19
|
+
|
20
|
+
if (total_seqs.to_i > 0)
|
21
|
+
stats_file.puts html_1
|
22
|
+
stats_file.puts ' <tr>
|
23
|
+
<td align="center">YES</td>
|
24
|
+
<td align="right">'+seqs_number1.to_s+'</td>
|
25
|
+
<td align="right">'+'%.2f' % (100*seqs_number1.to_f/total_seqs.to_f).to_s+' %</td>
|
26
|
+
<td align="right">'+seq_uniq.to_s+'</td>
|
27
|
+
<td align="right">'+seq_length_stats[0].to_s+'</td>
|
28
|
+
<td align="right">'+seq_length_stats[1].to_s+'</td>
|
29
|
+
<td align="right">'+seq_length_stats[2].to_s+'</td>
|
30
|
+
<td align="right">'+seq_length_stats[3].to_s+'</td>
|
31
|
+
</tr>'
|
32
|
+
stats_file.puts ' <tr>
|
33
|
+
<td align="center">NO</td>
|
34
|
+
<td align="right">'+seqs_number2.to_s+'</td>
|
35
|
+
<td align="right">'+'%.2f' % (100*seqs_number2.to_f/total_seqs.to_f).to_s+' %</td>
|
36
|
+
<td align="right">-</td>
|
37
|
+
<td align="right">'+tcode_length_stats[0].to_s+'</td>
|
38
|
+
<td align="right">'+tcode_length_stats[1].to_s+'</td>
|
39
|
+
<td align="right">'+tcode_length_stats[2].to_s+'</td>
|
40
|
+
<td align="right">'+tcode_length_stats[3].to_s+'</td>
|
41
|
+
</tr>'
|
42
|
+
stats_file.puts ' <tr>
|
43
|
+
<td align="center">ncRNA</td>
|
44
|
+
<td align="right">'+ncrna_array[4].to_s+'</td>
|
45
|
+
<td align="right">'+'%.2f' % (100*ncrna_array[4].to_f/total_seqs.to_f).to_s+' %</td>
|
46
|
+
<td align="right">-</td>
|
47
|
+
<td align="right">'+ncrna_array[0].to_s+'</td>
|
48
|
+
<td align="right">'+ncrna_array[1].to_s+'</td>
|
49
|
+
<td align="right">'+ncrna_array[2].to_s+'</td>
|
50
|
+
<td align="right">'+ncrna_array[3].to_s+'</td>
|
51
|
+
</tr>
|
52
|
+
</table>'
|
53
|
+
|
54
|
+
stats_file.puts ' <p><font color="#FF0000">'+error_1_num.to_s+'</font> Sequences with sense and antisense hits error</p>'
|
55
|
+
stats_file.puts ' <p><font color="#FF0000">'+complete_uniq.to_s+'</font> Complete sequences with different ortologue ID</p>'
|
56
|
+
stats_file.puts html_2
|
57
|
+
status_array.each do |status|
|
58
|
+
stats_file.puts ' <tr>
|
59
|
+
<td align="right">'+status[4].to_s+'</td>
|
60
|
+
<td align="right">'+status[0].to_s+'</td>
|
61
|
+
<td align="right">'+'%.2f' % (100*status[0].to_f/total_seqs.to_f).to_s+' %</td>
|
62
|
+
<td align="right">'+status[1].to_s+'</td>
|
63
|
+
<td align="right">'+status[2].to_s+'</td>
|
64
|
+
<td align="right">'+status[3].to_s+'</td>
|
65
|
+
</tr>'
|
66
|
+
end
|
67
|
+
stats_file.puts html_3
|
68
|
+
|
69
|
+
tcode_array.each do |status|
|
70
|
+
stats_file.puts ' <tr>
|
71
|
+
<td align="right">'+status[5].to_s+'</td>
|
72
|
+
<td align="right">'+status[4].to_s+'</td>
|
73
|
+
<td align="right">'+'%.2f' % (100*status[4].to_f/total_seqs.to_f).to_s+' %</td>
|
74
|
+
<td align="right">'+status[0].to_s+'</td>
|
75
|
+
<td align="right">'+status[1].to_s+'</td>
|
76
|
+
<td align="right">'+status[2].to_s+'</td>
|
77
|
+
<td align="right">'+status[3].to_s+'</td>
|
78
|
+
</tr>'
|
79
|
+
end
|
80
|
+
|
81
|
+
# print Non coding RNA
|
82
|
+
stats_file.puts ' <tr>
|
83
|
+
<td align="right">Putative ncRNA</td>
|
84
|
+
<td align="right">'+ncrna_array[4].to_s+'</td>
|
85
|
+
<td align="right">'+'%.2f' % (100*ncrna_array[4].to_f/total_seqs.to_f).to_s+' %</td>
|
86
|
+
<td align="right">'+ncrna_array[0].to_s+'</td>
|
87
|
+
<td align="right">'+ncrna_array[1].to_s+'</td>
|
88
|
+
<td align="right">'+ncrna_array[2].to_s+'</td>
|
89
|
+
<td align="right">'+ncrna_array[3].to_s+'</td>
|
90
|
+
</tr>
|
91
|
+
</table>
|
92
|
+
</center>'
|
93
|
+
|
94
|
+
end
|
95
|
+
stats_file.puts html_4
|
96
|
+
|
97
|
+
stats_file.close
|
98
|
+
end
|
99
|
+
|
100
|
+
|
101
|
+
def html_code
|
102
|
+
html_head = '<html>
|
103
|
+
<head>
|
104
|
+
<title>FLN Annotation Summary</title>
|
105
|
+
</head>
|
106
|
+
|
107
|
+
<body bgcolor="#FFFFFF">
|
108
|
+
<center>
|
109
|
+
<h1 ALIGN="center">
|
110
|
+
Full-LengtherNEXT
|
111
|
+
<br/>
|
112
|
+
Annotation summary
|
113
|
+
</h1>
|
114
|
+
<h2 align="center">'
|
115
|
+
|
116
|
+
html_1 = '
|
117
|
+
<center>
|
118
|
+
<table border=1>
|
119
|
+
<tr>
|
120
|
+
<th>Ortologue found</th>
|
121
|
+
<th>Sequences found</th>
|
122
|
+
<th>%</th>
|
123
|
+
<th>Different IDs</th>
|
124
|
+
<th>>200 bp</th>
|
125
|
+
<th><200 bp</th>
|
126
|
+
<th>>500 bp</th>
|
127
|
+
<th><500 bp</th>
|
128
|
+
</tr>'
|
129
|
+
|
130
|
+
html_2= ' <br/>
|
131
|
+
<table border=1>
|
132
|
+
<tr>
|
133
|
+
<th>Status</th>
|
134
|
+
<th>Total</th>
|
135
|
+
<th>%</th>
|
136
|
+
<th>UserDB</th>
|
137
|
+
<th>SwissProt</th>
|
138
|
+
<th>TrEMBL</th>
|
139
|
+
</tr>'
|
140
|
+
|
141
|
+
html_3= ' </table>
|
142
|
+
<br/>
|
143
|
+
<table border=1>
|
144
|
+
<tr>
|
145
|
+
<th>Status</th>
|
146
|
+
<th>Total</th>
|
147
|
+
<th>%</th>
|
148
|
+
<th>>200 bp</th>
|
149
|
+
<th><200 bp</th>
|
150
|
+
<th>>500 bp</th>
|
151
|
+
<th><500 bp</th>
|
152
|
+
</tr>'
|
153
|
+
|
154
|
+
html_4 = ' </body>
|
155
|
+
</html>'
|
156
|
+
|
157
|
+
return [html_head, html_1, html_2, html_3, html_4]
|
158
|
+
|
159
|
+
end
|
160
|
+
|
161
|
+
|
162
|
+
def stats_my_db(db_name, array)
|
163
|
+
|
164
|
+
if (db_name !~ /^sp_/) && (db_name !~ /^tr_/)
|
165
|
+
array[1] += 1
|
166
|
+
elsif (db_name =~ /^sp_/)
|
167
|
+
array[2] += 1
|
168
|
+
elsif (db_name =~ /^tr_/)
|
169
|
+
array[3] += 1
|
170
|
+
end
|
171
|
+
|
172
|
+
return array
|
173
|
+
end
|
174
|
+
|
175
|
+
|
176
|
+
def annotation_stats
|
177
|
+
|
178
|
+
seqs_number = 0
|
179
|
+
array_of_all_accs = []
|
180
|
+
array_of_complete_accs = []
|
181
|
+
error_1_num = 0
|
182
|
+
|
183
|
+
# >200, <200, >500, <500
|
184
|
+
seq_length_stats = [0,0,0,0]
|
185
|
+
|
186
|
+
# >200, <200, >500, <500
|
187
|
+
complete_seq_length_stats = [0,0,0,0]
|
188
|
+
|
189
|
+
status_array = []
|
190
|
+
# total, userdb, swissprotdb, trembl, status
|
191
|
+
complete = [0,0,0,0,'Complete']
|
192
|
+
putative_complete = [0,0,0,0,'Putative Complete']
|
193
|
+
c_terminus = [0,0,0,0,'C-terminus']
|
194
|
+
putative_c_terminus = [0,0,0,0,'Putative C-terminus']
|
195
|
+
n_terminus = [0,0,0,0,'N-terminus']
|
196
|
+
putative_n_terminus = [0,0,0,0,'Putative N-terminus']
|
197
|
+
internal = [0,0,0,0,'Internal']
|
198
|
+
cod_seq = [0,0,0,0,'Misassembled']
|
199
|
+
|
200
|
+
|
201
|
+
File.open('fln_results/annotations.txt').each do |line|
|
202
|
+
line.chomp!
|
203
|
+
(name,fasta_length,acc,db_name,status,kk1,kk2,kk3,kk4,kk5,msgs) = line.split("\t")
|
204
|
+
|
205
|
+
if (line !~ /^Query_id\t/) && (!line.empty?)
|
206
|
+
seqs_number += 1
|
207
|
+
array_of_all_accs.push acc
|
208
|
+
# -------------------------------------------------------------------------
|
209
|
+
if (fasta_length.to_i >= 200)
|
210
|
+
seq_length_stats[0] += 1
|
211
|
+
# seqs_longer_200 += 1
|
212
|
+
else
|
213
|
+
seq_length_stats[1] += 1
|
214
|
+
# seqs_shorter_200 += 1
|
215
|
+
end
|
216
|
+
if (fasta_length.to_i >= 500)
|
217
|
+
seq_length_stats[2] += 1
|
218
|
+
# seqs_longer_500 += 1
|
219
|
+
else
|
220
|
+
seq_length_stats[3] += 1
|
221
|
+
# seqs_shorter_500 += 1
|
222
|
+
end
|
223
|
+
# -------------------------------------------------------------------------
|
224
|
+
if (msgs =~ /ERROR#1/)
|
225
|
+
error_1_num += 1
|
226
|
+
end
|
227
|
+
# -------------------------------------------------------------------------
|
228
|
+
if (status == 'Complete')
|
229
|
+
complete[0] += 1
|
230
|
+
array_of_complete_accs.push acc
|
231
|
+
complete = stats_my_db(db_name, complete)
|
232
|
+
|
233
|
+
if (fasta_length.to_i >= 200)
|
234
|
+
complete_seq_length_stats[0] += 1
|
235
|
+
# complete_longer_200 += 1
|
236
|
+
else
|
237
|
+
complete_seq_length_stats[1] += 1
|
238
|
+
# complete_shorter_200 += 1
|
239
|
+
end
|
240
|
+
|
241
|
+
if (fasta_length.to_i >= 500)
|
242
|
+
complete_seq_length_stats[2] += 1
|
243
|
+
# complete_longer_500 += 1
|
244
|
+
else
|
245
|
+
complete_seq_length_stats[3] += 1
|
246
|
+
# complete_shorter_500 += 1
|
247
|
+
end
|
248
|
+
|
249
|
+
elsif (status == 'Putative Complete')
|
250
|
+
putative_complete[0] += 1
|
251
|
+
putative_complete = stats_my_db(db_name, putative_complete)
|
252
|
+
elsif (status == 'C-terminus')
|
253
|
+
c_terminus[0] += 1
|
254
|
+
c_terminus = stats_my_db(db_name, c_terminus)
|
255
|
+
elsif (status == 'N-terminus')
|
256
|
+
n_terminus[0] += 1
|
257
|
+
n_terminus = stats_my_db(db_name, n_terminus)
|
258
|
+
elsif (status == 'Putative C-terminus')
|
259
|
+
putative_c_terminus[0] += 1
|
260
|
+
putative_c_terminus = stats_my_db(db_name, putative_c_terminus)
|
261
|
+
elsif (status == 'Putative N-terminus')
|
262
|
+
putative_n_terminus[0] += 1
|
263
|
+
putative_n_terminus = stats_my_db(db_name, putative_n_terminus)
|
264
|
+
elsif (status == 'Internal')
|
265
|
+
internal[0] += 1
|
266
|
+
internal = stats_my_db(db_name, internal)
|
267
|
+
elsif (status == 'Coding Seq')
|
268
|
+
cod_seq[0] += 1
|
269
|
+
cod_seq = stats_my_db(db_name, cod_seq)
|
270
|
+
end
|
271
|
+
# -------------------------------------------------------------------------
|
272
|
+
end
|
273
|
+
|
274
|
+
end
|
275
|
+
|
276
|
+
status_array = [complete, putative_complete, c_terminus, putative_c_terminus, n_terminus, putative_n_terminus, internal, cod_seq]
|
277
|
+
|
278
|
+
return [status_array, seqs_number, error_1_num, array_of_all_accs.uniq.count, array_of_complete_accs.uniq.count, seq_length_stats, complete_seq_length_stats]
|
279
|
+
end
|
280
|
+
|
281
|
+
|
282
|
+
def testcode_stats
|
283
|
+
|
284
|
+
seqs_number = 0
|
285
|
+
|
286
|
+
# >200, <200, >500, <500
|
287
|
+
all_tcode_stats = [0,0,0,0]
|
288
|
+
|
289
|
+
# >200, <200, >500, <500, total, status
|
290
|
+
coding_length_stats = [0,0,0,0,0,'Coding']
|
291
|
+
p_coding_length_stats = [0,0,0,0,0,'Putative Coding']
|
292
|
+
unknown_length_stats = [0,0,0,0,0,'Unknown']
|
293
|
+
|
294
|
+
File.open('fln_results/tcode_result.txt').each do |line|
|
295
|
+
line.chomp!
|
296
|
+
(name,fasta_length,acc,db_name,status) = line.split("\t")
|
297
|
+
|
298
|
+
if (line !~ /^Query_id\t/) && (!line.empty?)
|
299
|
+
seqs_number += 1
|
300
|
+
|
301
|
+
if (fasta_length.to_i >= 200)
|
302
|
+
all_tcode_stats[0] += 1
|
303
|
+
|
304
|
+
if (status == 'coding')
|
305
|
+
coding_length_stats[4] += 1
|
306
|
+
coding_length_stats[0] += 1
|
307
|
+
elsif (status == 'putative_coding')
|
308
|
+
p_coding_length_stats[4] += 1
|
309
|
+
p_coding_length_stats[0] += 1
|
310
|
+
elsif (status == 'unknown')
|
311
|
+
unknown_length_stats[4] += 1
|
312
|
+
unknown_length_stats[0] += 1
|
313
|
+
end
|
314
|
+
else
|
315
|
+
all_tcode_stats[1] += 1
|
316
|
+
|
317
|
+
if (status == 'coding')
|
318
|
+
coding_length_stats[4] += 1
|
319
|
+
coding_length_stats[1] += 1
|
320
|
+
elsif (status == 'putative_coding')
|
321
|
+
p_coding_length_stats[4] += 1
|
322
|
+
p_coding_length_stats[1] += 1
|
323
|
+
elsif (status == 'unknown')
|
324
|
+
unknown_length_stats[4] += 1
|
325
|
+
unknown_length_stats[1] += 1
|
326
|
+
end
|
327
|
+
end
|
328
|
+
if (fasta_length.to_i >= 500)
|
329
|
+
all_tcode_stats[2] += 1
|
330
|
+
|
331
|
+
if (status == 'coding')
|
332
|
+
coding_length_stats[2] += 1
|
333
|
+
elsif (status == 'putative_coding')
|
334
|
+
p_coding_length_stats[2] += 1
|
335
|
+
elsif (status == 'unknown')
|
336
|
+
unknown_length_stats[2] += 1
|
337
|
+
end
|
338
|
+
else
|
339
|
+
all_tcode_stats[3] += 1
|
340
|
+
|
341
|
+
if (status == 'coding')
|
342
|
+
coding_length_stats[3] += 1
|
343
|
+
elsif (status == 'putative_coding')
|
344
|
+
p_coding_length_stats[3] += 1
|
345
|
+
elsif (status == 'unknown')
|
346
|
+
unknown_length_stats[3] += 1
|
347
|
+
end
|
348
|
+
end
|
349
|
+
|
350
|
+
end
|
351
|
+
|
352
|
+
end
|
353
|
+
|
354
|
+
status_array = [coding_length_stats, p_coding_length_stats, unknown_length_stats]
|
355
|
+
|
356
|
+
return [status_array, seqs_number, all_tcode_stats, coding_length_stats, unknown_length_stats]
|
357
|
+
end
|
358
|
+
|
359
|
+
def ncrna_stats
|
360
|
+
|
361
|
+
# >200, <200, >500, <500, total
|
362
|
+
ncrna_array = [0,0,0,0,0]
|
363
|
+
|
364
|
+
File.open('fln_results/nc_rna.txt').each do |line|
|
365
|
+
line.chomp!
|
366
|
+
(name,fasta_length,acc,db_name,status) = line.split("\t")
|
367
|
+
|
368
|
+
if (status == 'Putative ncRNA')
|
369
|
+
ncrna_array[4] += 1
|
370
|
+
|
371
|
+
if (fasta_length.to_i >= 200)
|
372
|
+
ncrna_array[0] += 1
|
373
|
+
else
|
374
|
+
ncrna_array[1] += 1
|
375
|
+
end
|
376
|
+
if (fasta_length.to_i >= 500)
|
377
|
+
ncrna_array[2] += 1
|
378
|
+
else
|
379
|
+
ncrna_array[3] += 1
|
380
|
+
end
|
381
|
+
end
|
382
|
+
end
|
383
|
+
|
384
|
+
return ncrna_array
|
385
|
+
end
|
386
|
+
|
387
|
+
end
|
@@ -11,6 +11,8 @@ require "test_code"
|
|
11
11
|
require 'fl_analysis'
|
12
12
|
include FlAnalysis
|
13
13
|
|
14
|
+
require 'nc_rna'
|
15
|
+
include NcRna
|
14
16
|
|
15
17
|
class MyWorker < ScbiMapreduce::Worker
|
16
18
|
|
@@ -41,15 +43,12 @@ class MyWorker < ScbiMapreduce::Worker
|
|
41
43
|
|
42
44
|
end
|
43
45
|
|
44
|
-
# ejecuta
|
45
|
-
def
|
46
|
-
# puts "\n#{user_db_name} ..... executing BLASTx"
|
46
|
+
# ejecuta blast utilizando los parametros fichero de entrada, base de datos, fichero de salida y tipo de blast
|
47
|
+
def run_blast(input, database, blast_type, evalue)
|
47
48
|
|
48
|
-
blast=BatchBlast.new("-db #{database}",
|
49
|
+
blast=BatchBlast.new("-db #{database}",blast_type,"-evalue #{evalue} -max_target_seqs 1")
|
49
50
|
blast_result = blast.do_blast_seqs(input, :xml)
|
50
51
|
|
51
|
-
# puts "#{user_db_name} ..... BLASTx finished"
|
52
|
-
|
53
52
|
return blast_result
|
54
53
|
end
|
55
54
|
|
@@ -72,7 +71,7 @@ class MyWorker < ScbiMapreduce::Worker
|
|
72
71
|
end
|
73
72
|
|
74
73
|
# do blast
|
75
|
-
my_blast =
|
74
|
+
my_blast = run_blast(seqs, "#{@options[:user_db]}", 'blastx', '1e-6')
|
76
75
|
|
77
76
|
# split and parse blast
|
78
77
|
seqs.each_with_index do |seq,i|
|
@@ -87,7 +86,8 @@ class MyWorker < ScbiMapreduce::Worker
|
|
87
86
|
|
88
87
|
# -------------------------------------------- UniProt (sp)
|
89
88
|
# blast
|
90
|
-
|
89
|
+
sp_path=File.join("sp_#{@options[:tax_group]}","sp_#{@options[:tax_group]}.fasta")
|
90
|
+
my_blast = run_blast(new_seqs, sp_path, 'blastx', '1e-6')
|
91
91
|
|
92
92
|
# split and parse blast
|
93
93
|
new_seqs.each_with_index do |seq,i|
|
@@ -98,7 +98,8 @@ class MyWorker < ScbiMapreduce::Worker
|
|
98
98
|
|
99
99
|
# -------------------------------------------- UniProt (tr)
|
100
100
|
# blast
|
101
|
-
|
101
|
+
tr_path=File.join("tr_#{@options[:tax_group]}","tr_#{@options[:tax_group]}.fasta")
|
102
|
+
my_blast = run_blast(new_seqs, tr_path, 'blastx', '1e-6')
|
102
103
|
|
103
104
|
# split and parse blast
|
104
105
|
new_seqs.each_with_index do |seq,i|
|
@@ -107,15 +108,27 @@ class MyWorker < ScbiMapreduce::Worker
|
|
107
108
|
|
108
109
|
# -------------------------------------------- Test Code
|
109
110
|
# the sequences without a reliable similarity with an orthologue are processed with Test Code
|
110
|
-
testcode_input=seqs.select{|s| !s.get_annotations(:
|
111
|
+
testcode_input=seqs.select{|s| !s.get_annotations(:apply_tcode).empty?}
|
111
112
|
|
112
|
-
# active this line to test tcode
|
113
|
+
# active this line to test tcode, and comment all lines above in this function
|
113
114
|
# testcode_input=seqs
|
114
|
-
|
115
|
+
|
115
116
|
testcode_input.each do |seq|
|
116
117
|
TestCode.new(seq)
|
117
118
|
end
|
118
|
-
|
119
|
+
|
120
|
+
# -------------------------------------------- nc RNA
|
121
|
+
unknown_seqs=seqs.select{|s| !s.get_annotations(:tcode_unknown).empty?}
|
122
|
+
# run blastn
|
123
|
+
ncrna_path=File.join('nc_rna_db','ncrna_fln_100.fasta')
|
124
|
+
my_blast = run_blast(unknown_seqs, ncrna_path, 'blastn', '1e-3')
|
125
|
+
|
126
|
+
# split and parse blast
|
127
|
+
unknown_seqs.each_with_index do |seq,i|
|
128
|
+
find_nc_rna(seq, my_blast.querys[i])
|
129
|
+
end
|
130
|
+
# ---------------------------------------------------
|
131
|
+
|
119
132
|
end
|
120
133
|
|
121
134
|
end
|
@@ -2,8 +2,8 @@ require 'json'
|
|
2
2
|
require 'scbi_fasta'
|
3
3
|
require 'sequence'
|
4
4
|
|
5
|
-
require '
|
6
|
-
include
|
5
|
+
require 'fln_stats'
|
6
|
+
include FlnStats
|
7
7
|
|
8
8
|
class MyWorkerManager < ScbiMapreduce::WorkManager
|
9
9
|
|
@@ -12,37 +12,43 @@ class MyWorkerManager < ScbiMapreduce::WorkManager
|
|
12
12
|
|
13
13
|
input_file=options[:fasta]
|
14
14
|
|
15
|
-
if !File.exists?('
|
16
|
-
Dir.mkdir('
|
15
|
+
if !File.exists?('fln_results')
|
16
|
+
Dir.mkdir('fln_results')
|
17
17
|
end
|
18
|
-
|
18
|
+
|
19
|
+
file_head = "Query_id\tfasta_length\tSubject_id\tdb_name\tStatus\tt_code\te_value\tp_ident\tprotein_length\ts_length\tWarning_msgs\tframe\tORF_start\tORF_end\ts_start\ts_end\tDescription\tProtein_sequence"
|
20
|
+
|
19
21
|
@@fasta_file = FastaQualFile.new(input_file,'')
|
20
22
|
@@chunk_size=chunk_size
|
21
23
|
@@options = options
|
22
24
|
|
23
|
-
@@annotation_file = File.open("
|
24
|
-
@@annotation_file.puts
|
25
|
+
@@annotation_file = File.open("fln_results/annotations.txt", 'w')
|
26
|
+
@@annotation_file.puts file_head
|
27
|
+
|
28
|
+
@@alignment_file = File.open("fln_results/alignments.txt", 'w')
|
29
|
+
@@prot_file = File.open("fln_results/proteins.fasta", 'w')
|
30
|
+
@@nts_file = File.open("fln_results/nt_seq.txt", 'w')
|
31
|
+
@@tcode_file=File.open("fln_results/tcode_result.txt", 'w')
|
32
|
+
@@tcode_file.puts file_head
|
25
33
|
|
26
|
-
@@
|
27
|
-
@@
|
28
|
-
@@nts_file = File.open("fl2_results/nt_seq.txt", 'w')
|
29
|
-
@@tcode_file=File.open("fl2_results/tcode_result.txt", 'w')
|
30
|
-
@@tcode_file.puts "Query_id\tfasta_length\tSubject_id\tdb_name\tStatus\tt_code\te_value\tp_ident\tprotein_length\ts_length\tWarning_msgs\tframe\tORF_start\tORF_end\ts_start\ts_end\tDescription\tProtein_sequence"
|
34
|
+
@@nc_rna_file = File.open("fln_results/nc_rna.txt", 'w')
|
35
|
+
@@nc_rna_file.puts file_head
|
31
36
|
|
32
|
-
# @@error_fasta_file = File.open("
|
33
|
-
# @@error_file = File.open("
|
37
|
+
# @@error_fasta_file = File.open("fln_results/error_seqs.fasta", 'w')
|
38
|
+
# @@error_file = File.open("fln_results/errors_info.txt", 'w')
|
34
39
|
|
35
40
|
end
|
36
41
|
|
37
42
|
# close files
|
38
43
|
def self.end_work_manager
|
39
|
-
@@fasta_file.close
|
44
|
+
# @@fasta_file.close
|
40
45
|
|
41
46
|
@@annotation_file.close
|
42
47
|
@@alignment_file.close
|
43
48
|
@@prot_file.close
|
44
49
|
@@nts_file.close
|
45
50
|
@@tcode_file.close
|
51
|
+
@@nc_rna_file.close
|
46
52
|
|
47
53
|
# @@error_fasta_file.close
|
48
54
|
# @@error_file.close
|
@@ -143,11 +149,14 @@ class MyWorkerManager < ScbiMapreduce::WorkManager
|
|
143
149
|
if (n=seq.get_annotations(:nucleotide).first)
|
144
150
|
@@nts_file.puts n[:message]
|
145
151
|
end
|
152
|
+
# -------------------------------------------------------- nc RNA
|
153
|
+
elsif (nc=seq.get_annotations(:ncrna).first)
|
154
|
+
@@nc_rna_file.puts nc[:message]
|
146
155
|
# -------------------------------------------------------- Test Code
|
147
|
-
elsif (t=seq.get_annotations(:tcode).first)
|
148
|
-
|
156
|
+
elsif (t=seq.get_annotations(:tcode).first)
|
157
|
+
@@tcode_file.puts t[:message]
|
149
158
|
end
|
150
|
-
# --------------------------------------------------------
|
159
|
+
# -------------------------------------------------------- errors
|
151
160
|
# if e=seq.get_annotations(:error).first
|
152
161
|
# if !e[:message].empty?
|
153
162
|
# @@error_fasta_file.puts ">#{seq.seq_name}\n#{seq.seq_fasta}"
|
@@ -0,0 +1,21 @@
|
|
1
|
+
|
2
|
+
module NcRna
|
3
|
+
|
4
|
+
def find_nc_rna(seq, blast_query)
|
5
|
+
|
6
|
+
# used to detect if the sequence and the blast are from different query
|
7
|
+
if seq.seq_name != blast_query.query_def
|
8
|
+
raise "BLAST query name and sequence are different"
|
9
|
+
end
|
10
|
+
|
11
|
+
q=blast_query
|
12
|
+
|
13
|
+
if (!q.hits[0].nil?) # There is match in blast.
|
14
|
+
nc_annotations = "#{q.query_def}\t#{seq.seq_fasta.length}\t#{q.hits[0].acc}\tncRNA\tPutative ncRNA\t\t#{q.hits[0].e_val}\t#{q.hits[0].ident}\t\t\t\t#{q.hits[0].q_frame}\t#{q.hits[0].q_beg}\t#{q.hits[0].q_end}\t#{q.hits[0].s_beg.to_i}\t#{q.hits[0].s_end.to_i}\t#{q.hits[0].definition}\t"
|
15
|
+
seq.annotate(:ncrna,nc_annotations,true)
|
16
|
+
else
|
17
|
+
unknown_annot = seq.get_annotations(:tcode_unknown).first
|
18
|
+
seq.annotate(:tcode, unknown_annot[:message],true)
|
19
|
+
end
|
20
|
+
end
|
21
|
+
end
|
@@ -26,8 +26,8 @@ class TestCode
|
|
26
26
|
ref_orf = ''
|
27
27
|
ref_msgs = 'Sequence length < 200 nt'
|
28
28
|
|
29
|
-
seq.annotate(:
|
30
|
-
|
29
|
+
seq.annotate(:tcode_unknown,"#{ref_name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{ref_status}\t#{ref_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
|
30
|
+
# seq.annotate(:tcode,"#{ref_name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{ref_status}\t#{ref_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
|
31
31
|
else
|
32
32
|
|
33
33
|
# para probar tescode con toda la secuencia, en lugar de con los ORFs ----------------------------------------------------------------------
|
@@ -44,8 +44,12 @@ class TestCode
|
|
44
44
|
|
45
45
|
# see add_region filter
|
46
46
|
(name,t_code,status,ref_start,ref_end,ref_frame,orf,ref_msgs,stop_before_start,more_than_one_frame) = t_code(seq)
|
47
|
-
|
48
|
-
|
47
|
+
if (status == :unknown)
|
48
|
+
seq.annotate(:tcode_unknown,"#{name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{status}\t#{t_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
|
49
|
+
else
|
50
|
+
seq.annotate(:tcode,"#{name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{status}\t#{t_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
|
51
|
+
end
|
52
|
+
|
49
53
|
# if (ref_msgs.nil?)
|
50
54
|
# ref_msgs = ''
|
51
55
|
# end
|
metadata
CHANGED
@@ -2,7 +2,7 @@
|
|
2
2
|
name: full_lengther_next
|
3
3
|
version: !ruby/object:Gem::Version
|
4
4
|
prerelease:
|
5
|
-
version: 0.0.
|
5
|
+
version: 0.0.5
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
8
8
|
- Noe Fernandez & Dario Guerrero
|
@@ -10,7 +10,7 @@ autorequire:
|
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
12
|
|
13
|
-
date: 2012-
|
13
|
+
date: 2012-03-09 00:00:00 Z
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
16
16
|
name: xml-simple
|
@@ -97,12 +97,13 @@ files:
|
|
97
97
|
- bin/full_lengther_next
|
98
98
|
- History.txt
|
99
99
|
- lib/full_lengther_next/classes/common_functions.rb
|
100
|
-
- lib/full_lengther_next/classes/fl2_stats.rb
|
101
100
|
- lib/full_lengther_next/classes/fl_analysis.rb
|
102
101
|
- lib/full_lengther_next/classes/fl_string_utils.rb
|
102
|
+
- lib/full_lengther_next/classes/fln_stats.rb
|
103
103
|
- lib/full_lengther_next/classes/lcs.rb
|
104
104
|
- lib/full_lengther_next/classes/my_worker.rb
|
105
105
|
- lib/full_lengther_next/classes/my_worker_manager.rb
|
106
|
+
- lib/full_lengther_next/classes/nc_rna.rb
|
106
107
|
- lib/full_lengther_next/classes/orf.rb
|
107
108
|
- lib/full_lengther_next/classes/sequence.rb
|
108
109
|
- lib/full_lengther_next/classes/test_code.rb
|
@@ -1,222 +0,0 @@
|
|
1
|
-
|
2
|
-
module Fl2Stats
|
3
|
-
|
4
|
-
# -------------------------------------------------------------------------------- Main
|
5
|
-
def summary_stats
|
6
|
-
stats_file = File.open('fl2_results/summary_stats.txt', 'w')
|
7
|
-
|
8
|
-
total_seqs = 0
|
9
|
-
|
10
|
-
num1 = annotation_stats(stats_file)
|
11
|
-
num2 = testcode_stats(stats_file)
|
12
|
-
|
13
|
-
total_seqs = num1 + num2
|
14
|
-
|
15
|
-
stats_file.puts "\nInput sequences in your fasta: #{total_seqs}\n\n"
|
16
|
-
end
|
17
|
-
|
18
|
-
# ---------------------------------------------------------------------------------- Functions
|
19
|
-
def stats_my_db(db_name, array)
|
20
|
-
|
21
|
-
if (db_name !~ /^sp_/) && (db_name !~ /^tr_/)
|
22
|
-
array[1] += 1
|
23
|
-
elsif (db_name =~ /^sp_/)
|
24
|
-
array[2] += 1
|
25
|
-
elsif (db_name =~ /^tr_/)
|
26
|
-
array[3] += 1
|
27
|
-
end
|
28
|
-
|
29
|
-
return array
|
30
|
-
end
|
31
|
-
|
32
|
-
|
33
|
-
def annotation_stats(stats_file)
|
34
|
-
|
35
|
-
seqs_number = 0
|
36
|
-
array_of_all_accs = []
|
37
|
-
array_of_complete_accs = []
|
38
|
-
error_1_num = 0
|
39
|
-
|
40
|
-
seqs_longer_200 = 0
|
41
|
-
seqs_shorter_200 = 0
|
42
|
-
complete_longer_200 = 0
|
43
|
-
complete_shorter_200 = 0
|
44
|
-
|
45
|
-
seqs_longer_500 = 0
|
46
|
-
seqs_shorter_500 = 0
|
47
|
-
complete_longer_500 = 0
|
48
|
-
complete_shorter_500 = 0
|
49
|
-
|
50
|
-
complete = [0,0,0,0]
|
51
|
-
putative_complete = [0,0,0,0]
|
52
|
-
c_terminus = [0,0,0,0]
|
53
|
-
putative_c_terminus = [0,0,0,0]
|
54
|
-
n_terminus = [0,0,0,0]
|
55
|
-
putative_n_terminus = [0,0,0,0]
|
56
|
-
internal = [0,0,0,0]
|
57
|
-
cod_seq = [0,0,0,0]
|
58
|
-
|
59
|
-
|
60
|
-
File.open('fl2_results/annotations.txt').each do |line|
|
61
|
-
line.chomp!
|
62
|
-
(name,fasta_length,acc,db_name,status,kk1,kk2,kk3,kk4,kk5,msgs) = line.split("\t")
|
63
|
-
|
64
|
-
if (line !~ /^Query_id\t/)
|
65
|
-
seqs_number += 1
|
66
|
-
array_of_all_accs.push acc
|
67
|
-
# -------------------------------------------------------------------------
|
68
|
-
if (fasta_length.to_i >= 200)
|
69
|
-
seqs_longer_200 += 1
|
70
|
-
else
|
71
|
-
seqs_shorter_200 += 1
|
72
|
-
end
|
73
|
-
if (fasta_length.to_i >= 500)
|
74
|
-
seqs_longer_500 += 1
|
75
|
-
else
|
76
|
-
seqs_shorter_500 += 1
|
77
|
-
end
|
78
|
-
# -------------------------------------------------------------------------
|
79
|
-
if (msgs =~ /ERROR#1/)
|
80
|
-
error_1_num += 1
|
81
|
-
end
|
82
|
-
# -------------------------------------------------------------------------
|
83
|
-
if (status == 'Complete')
|
84
|
-
complete[0] += 1
|
85
|
-
array_of_complete_accs.push acc
|
86
|
-
complete = stats_my_db(db_name, complete)
|
87
|
-
|
88
|
-
if (fasta_length.to_i >= 200)
|
89
|
-
complete_longer_200 += 1
|
90
|
-
else
|
91
|
-
complete_shorter_200 += 1
|
92
|
-
end
|
93
|
-
|
94
|
-
if (fasta_length.to_i >= 500)
|
95
|
-
complete_longer_500 += 1
|
96
|
-
else
|
97
|
-
complete_shorter_500 += 1
|
98
|
-
end
|
99
|
-
|
100
|
-
elsif (status == 'Putative Complete')
|
101
|
-
putative_complete[0] += 1
|
102
|
-
putative_complete = stats_my_db(db_name, putative_complete)
|
103
|
-
elsif (status == 'C-terminus')
|
104
|
-
c_terminus[0] += 1
|
105
|
-
c_terminus = stats_my_db(db_name, c_terminus)
|
106
|
-
elsif (status == 'N-terminus')
|
107
|
-
n_terminus[0] += 1
|
108
|
-
n_terminus = stats_my_db(db_name, n_terminus)
|
109
|
-
elsif (status == 'Putative C-terminus')
|
110
|
-
putative_c_terminus[0] += 1
|
111
|
-
putative_c_terminus = stats_my_db(db_name, putative_c_terminus)
|
112
|
-
elsif (status == 'Putative N-terminus')
|
113
|
-
putative_n_terminus[0] += 1
|
114
|
-
putative_n_terminus = stats_my_db(db_name, putative_n_terminus)
|
115
|
-
elsif (status == 'Internal')
|
116
|
-
internal[0] += 1
|
117
|
-
internal = stats_my_db(db_name, internal)
|
118
|
-
elsif (status == 'Coding Seq')
|
119
|
-
cod_seq[0] += 1
|
120
|
-
cod_seq = stats_my_db(db_name, cod_seq)
|
121
|
-
end
|
122
|
-
# -------------------------------------------------------------------------
|
123
|
-
end
|
124
|
-
|
125
|
-
end
|
126
|
-
|
127
|
-
stats_file.puts "--- Annotation Summary ---"
|
128
|
-
stats_file.puts "\n------------------------------ Summary of sequences found by similarity -----"
|
129
|
-
|
130
|
-
stats_file.puts "\n\tSequences found: #{seqs_number}\t\t(>200: #{seqs_longer_200}, <200: #{seqs_shorter_200})\t(>500: #{seqs_longer_500}, <500: #{seqs_shorter_500})"
|
131
|
-
stats_file.puts "\tDifferent IDs: #{array_of_all_accs.uniq.count}"
|
132
|
-
|
133
|
-
stats_file.puts "\n\tsequences with sense and antisense hits error: #{error_1_num}"
|
134
|
-
stats_file.puts "\n------------------------------------------------- Full-Length Sequences -----"
|
135
|
-
stats_file.puts "\tComplete Seqs: #{complete[0]} ("+ '%.3f' % (complete[0].to_f/seqs_number.to_f*100) +" %)\t\t(>200: #{complete_longer_200}, <200: #{complete_shorter_200})\t(>500: #{complete_longer_500}, <500: #{complete_shorter_500})"
|
136
|
-
stats_file.puts "\tDifferent IDs: #{array_of_complete_accs.uniq.count} ("+ '%.3f' % (array_of_complete_accs.uniq.count.to_f/seqs_number.to_f*100) +" %)"
|
137
|
-
stats_file.puts "\n\t\tuser_db: #{complete[1]}\n\t\tsp: #{complete[2]}\n\t\ttr: #{complete[3]}"
|
138
|
-
stats_file.puts "-----------------------------------------------------------------------------"
|
139
|
-
|
140
|
-
stats_file.puts "\n\tputative completes: #{putative_complete[0]}\n\t\tuser_db: #{putative_complete[1]}\n\t\tsp: #{putative_complete[2]}\n\t\ttr: #{putative_complete[3]}"
|
141
|
-
stats_file.puts "\n\tn-terminus: #{n_terminus[0]}\n\t\tuser_db: #{n_terminus[1]}\n\t\tsp: #{n_terminus[2]}\n\t\ttr: #{n_terminus[3]}"
|
142
|
-
stats_file.puts "\n\tputative_n_terminus: #{putative_n_terminus[0]}\n\t\tuser_db: #{putative_n_terminus[1]}\n\t\tsp: #{putative_n_terminus[2]}\n\t\ttr: #{putative_n_terminus[3]}"
|
143
|
-
stats_file.puts "\n\tc-terminus: #{c_terminus[0]}\n\t\tuser_db: #{c_terminus[1]}\n\t\tsp: #{c_terminus[2]}\n\t\ttr: #{c_terminus[3]}"
|
144
|
-
stats_file.puts "\n\tputative_c_terminus: #{putative_c_terminus[0]}\n\t\tuser_db: #{putative_c_terminus[1]}\n\t\tsp: #{putative_c_terminus[2]}\n\t\ttr: #{putative_c_terminus[3]}"
|
145
|
-
stats_file.puts "\n\tinternal: #{internal[0]}\n\t\tuser_db: #{internal[1]}\n\t\tsp: #{internal[2]}\n\t\ttr: #{internal[3]}"
|
146
|
-
stats_file.puts "\n\tcoding sequences with unknown status: #{cod_seq[0]}\n\t\tuser_db: #{cod_seq[1]}\n\t\tsp: #{cod_seq[2]}\n\t\ttr: #{cod_seq[3]}"
|
147
|
-
|
148
|
-
return seqs_number
|
149
|
-
end
|
150
|
-
|
151
|
-
|
152
|
-
def testcode_stats(stats_file)
|
153
|
-
|
154
|
-
seqs_number = 0
|
155
|
-
coding = 0
|
156
|
-
putative_coding = 0
|
157
|
-
unknown = 0
|
158
|
-
|
159
|
-
coding_longer_200 = 0
|
160
|
-
coding_shorter_200 = 0
|
161
|
-
unknown_longer_200 = 0
|
162
|
-
unknown_shorter_200 = 0
|
163
|
-
|
164
|
-
coding_longer_500 = 0
|
165
|
-
coding_shorter_500 = 0
|
166
|
-
unknown_longer_500 = 0
|
167
|
-
unknown_shorter_500 = 0
|
168
|
-
|
169
|
-
File.open('fl2_results/tcode_result.txt').each do |line|
|
170
|
-
line.chomp!
|
171
|
-
(name,fasta_length,acc,db_name,status) = line.split("\t")
|
172
|
-
|
173
|
-
if (line !~ /^Query_id\t/)
|
174
|
-
seqs_number += 1
|
175
|
-
|
176
|
-
if (status == 'coding')
|
177
|
-
coding += 1
|
178
|
-
if (fasta_length.to_i >= 200)
|
179
|
-
coding_longer_200 += 1
|
180
|
-
coding_longer_500 += 1
|
181
|
-
else
|
182
|
-
coding_shorter_200 += 1
|
183
|
-
coding_shorter_500 += 1
|
184
|
-
end
|
185
|
-
elsif (status == 'putative_coding')
|
186
|
-
putative_coding += 1
|
187
|
-
elsif (status == 'unknown')
|
188
|
-
unknown += 1
|
189
|
-
if (fasta_length.to_i >= 200)
|
190
|
-
unknown_longer_200 += 1
|
191
|
-
unknown_longer_500 += 1
|
192
|
-
else
|
193
|
-
unknown_shorter_200 += 1
|
194
|
-
unknown_shorter_500 += 1
|
195
|
-
end
|
196
|
-
|
197
|
-
end
|
198
|
-
|
199
|
-
end
|
200
|
-
|
201
|
-
end
|
202
|
-
|
203
|
-
|
204
|
-
stats_file.puts "\n--------------------------- Test Code Summary\n\n\ttotal seqs: #{seqs_number}"
|
205
|
-
stats_file.puts "\n\tcoding sequences: #{coding}"
|
206
|
-
stats_file.puts "\t\tlonger than 200 bp: #{coding_longer_200}"
|
207
|
-
stats_file.puts "\t\tshorter than 200 bp: #{coding_shorter_200}"
|
208
|
-
stats_file.puts "\t\tlonger than 500 bp: #{coding_longer_500}"
|
209
|
-
stats_file.puts "\t\tshorter than 500 bp: #{coding_shorter_500}"
|
210
|
-
stats_file.puts "\n\tputative coding sequences: #{putative_coding}\n"
|
211
|
-
stats_file.puts "\n\tunknown: #{unknown} ("+ '%.3f' % (unknown.to_f/seqs_number.to_f*100) +" %)"
|
212
|
-
stats_file.puts "\t\tlonger than 200 bp: #{unknown_longer_200}"
|
213
|
-
stats_file.puts "\t\tshorter than 200 bp: #{unknown_shorter_200}"
|
214
|
-
stats_file.puts "\t\tlonger than 500 bp: #{unknown_longer_500}"
|
215
|
-
stats_file.puts "\t\tshorter than 500 bp: #{unknown_shorter_500}"
|
216
|
-
stats_file.puts "\n\tUnknown sequences have a bad test code score or haven't got an ORF longer than 200 nt"
|
217
|
-
stats_file.puts "---------------------------------------------"
|
218
|
-
|
219
|
-
return seqs_number
|
220
|
-
end
|
221
|
-
|
222
|
-
end
|