full_lengther_next 0.0.2 → 0.0.5

Sign up to get free protection for your applications and to get access to all the features.
data/History.txt CHANGED
@@ -1,3 +1,15 @@
1
+ === 0.0.5 2012-03-09
2
+
3
+ Fix NCRNA annotation
4
+
5
+ === 0.0.4 2012-03-07
6
+
7
+ Fixed stats for 0 seqs
8
+
9
+ === 0.0.3 2012-03-01
10
+
11
+ Added ncrna
12
+
1
13
  === 0.0.2 2012-02-07
2
14
 
3
15
  Added FULL_LENGTH_NEXT_INIT environment variable for clustered installations
data/Manifest.txt CHANGED
@@ -3,12 +3,13 @@ bin/make_user_db.rb
3
3
  bin/full_lengther_next
4
4
  History.txt
5
5
  lib/full_lengther_next/classes/common_functions.rb
6
- lib/full_lengther_next/classes/fl2_stats.rb
7
6
  lib/full_lengther_next/classes/fl_analysis.rb
8
7
  lib/full_lengther_next/classes/fl_string_utils.rb
8
+ lib/full_lengther_next/classes/fln_stats.rb
9
9
  lib/full_lengther_next/classes/lcs.rb
10
10
  lib/full_lengther_next/classes/my_worker.rb
11
11
  lib/full_lengther_next/classes/my_worker_manager.rb
12
+ lib/full_lengther_next/classes/nc_rna.rb
12
13
  lib/full_lengther_next/classes/orf.rb
13
14
  lib/full_lengther_next/classes/sequence.rb
14
15
  lib/full_lengther_next/classes/test_code.rb
data/README.rdoc CHANGED
@@ -16,9 +16,9 @@ FULL-LENGTHERNEXT is a tool adapted to NGS technologies, able to work in paralle
16
16
 
17
17
  * It returns the translated protein sequence for the complete genes and the nucleotide sequence with frame shift fixed and highlighting the start and end codon for an easier finding of the gene and the UTR regions.
18
18
 
19
- * FULL-LENGTHERNEXT suggests putative new genes analysing what of the genes classified as unknown are probably coding.
19
+ * FULL-LENGTHERNEXT suggests putative new genes analysing what of the genes classified as unknown are probably coding and what are putative non coding RNA sequences.
20
20
 
21
- * It produces a stats file useful for assemblies comparison.
21
+ * It produces a HTML file with statistics useful for assemblies comparison.
22
22
 
23
23
  == SYNOPSIS:
24
24
 
@@ -26,6 +26,40 @@ FULL-LENGTHERNEXT must be fed with a multifasta file containing all unigenes to
26
26
 
27
27
  full_lengther_next -f input.fasta -g [fungi|human|invertebrates|mammals|plants|rodents|vertebrates] -d user_db [options]
28
28
 
29
+ === Output
30
+ Full-LengthNext results files appear at the end of program execution, grouped in a folder called fl2_results, where the following files can be found:
31
+ * alignments.txt: Displays the BLASTx alignment between our query sequence translated into amino acids and the protein sequence from the Full-LengthNext database.
32
+ * annotations.txt: in this file, the main information for each query sequence can be found; status, subject accession number, subject description, warning messages, protein obtained and indices provided by BLASTx alignment.
33
+ * nc_rna.txt: Putative non coding RNA sequences detected using BLAST.
34
+ * nt_seq.txt: It contains the nucleotide sequence, marking when possible the start codon with hyphen and underscore and hyphen (-_-) and the stop codon with three underscores. Useful to find UTRs and gene sequence.
35
+ * proteins.fasta: fasta format file with the complete proteins.
36
+ * summary_stats.html: summary statistics of the results obtained by Full-LengthNext for the set of query unigenes. It is useful for assemblies comparison.
37
+ * tcode_result.txt: It is equivalent to annotations.txt file, but it is used for sequences with no similarity in databases. Possible status are: coding, non-coding or unknown
38
+
39
+
40
+ === CLUSTERED INSTALLATION
41
+ To install FULL-LENGTHERNEXT into a cluster, you need to have the software available on all machines. By installing it on a shared location, or installing it on each cluster node. Once installed, you need to create a init_file where your environment is correctly setup (paths, BLASTDB, etc):
42
+
43
+ export PATH=/apps/blast+/bin:/apps/cd-hit/bin
44
+ export BLASTDB=/var/DB/formatted
45
+ export FULL_LENGTHER_NEXT_INIT=path_to_init_file
46
+ And initialize the FULL_LENGTHER_NEXT_INIT environment variable on your main node (from where FULL-LENGTHERNEXT will be initially launched):
47
+
48
+ export FULL_LENGTHER_NEXT_INIT=path_to_init_file
49
+ If you use any queue system like PBS Pro or Moab/Slurm, be sure to initialize the variables on each submission script.
50
+
51
+ NOTE: all nodes on the cluster should use ssh keys to allow FULL-LENGTHERNEXT to launch workers without asking for a password.
52
+
53
+ SAMPLE INIT FILES FOR CLUSTERED INSTALLATION:
54
+ Init file
55
+ $> cat fln_init_env
56
+
57
+ source ~ruby19/init_env
58
+ source ~blast_plus/init_env
59
+
60
+ export BLASTDB=~full_lenghter_next/DB/formatted/
61
+ export FULL_LENGTHER_NEXT_INIT=~full_lenghter_next/fln_init_env
62
+
29
63
 
30
64
  === PBS Submission script
31
65
 
@@ -42,10 +76,10 @@ cd $PBS_O_WORKDIR
42
76
 
43
77
  cat ${PBS_NODEFILE} > workers
44
78
 
45
- # init seqtrimnext
46
- source ~seqtrimnext/init_env
79
+ # init full-lengthernext
80
+ source ~full_lenghter_next/init_env
47
81
 
48
- time seqtrimnext -t paired_ends.txt -Q fastq -w workers -s 10.0.0
82
+ time full_lenghter_next -f input.fasta -g group -d user_db -w workers -s 10.0.0
49
83
  Once this submission script is created, you only need to launch it with:
50
84
 
51
85
  qsub sample_work.sh
@@ -101,7 +135,11 @@ gem install full_lengther_next
101
135
 
102
136
  === Install and rebuild Full-LengthNEXT databases
103
137
 
104
- Full-LengthNEXT needs some databases to work. You can use the BLASTDB environment variable to to change the default database location. To install them, execute:
138
+ Full-LengthNEXT needs some databases to work. You can use the BLASTDB environment variable to to change the default database location. To set the path for storing databases, execute next line in your terminal or add it to your .bash_profile:
139
+
140
+ export BLASTDB=/my_path/
141
+
142
+ To install databases execute:
105
143
 
106
144
  $ download_fln_dbs.rb
107
145
 
@@ -5,15 +5,41 @@
5
5
  # Once in UniProtKB/Swiss-Prot, a protein entry is removed from UniProtKB/TrEMBL.
6
6
 
7
7
  require 'net/ftp'
8
+ require 'open-uri'
8
9
 
9
10
  ################################################### Functions
10
11
 
12
+ def download_ncrna(formatted_db_path)
13
+
14
+ if !File.exists?(File.join(formatted_db_path, "nc_rna_db"))
15
+ Dir.mkdir(File.join(formatted_db_path, "nc_rna_db"))
16
+ end
17
+
18
+ puts "Downloading ncRNA database"
19
+ open(File.join(formatted_db_path, "nc_rna_db/ncrna_fln_100.fasta.zip"), "wb") do |my_file|
20
+ my_file.print open('http://www.scbi.uma.es/downloads/FLNDB/ncrna_fln_100.fasta.zip').read
21
+ end
22
+ puts "\nncRNA database downloaded"
23
+
24
+ ncrna_zip=File.join(formatted_db_path,'nc_rna_db','ncrna_fln_100.fasta.zip')
25
+ ncrna_out_dir=File.join(formatted_db_path,'nc_rna_db')
26
+ system("unzip", ncrna_zip, "-d", ncrna_out_dir)
27
+ system("rm", ncrna_zip)
28
+
29
+ puts "\nncRNA database decompressed"
30
+
31
+ ncrna_fasta=File.join(formatted_db_path,'nc_rna_db','ncrna_fln_100.fasta')
32
+ system("makeblastdb", "-in", ncrna_fasta, "-dbtype", "nucl", "-parse_seqids")
33
+
34
+ puts "\nncRNA database completed"
35
+ end
36
+
11
37
  def conecta_uniprot(my_array, formatted_db_path)
12
38
 
13
39
  $ftp = Net::FTP.new()
14
40
 
15
- if !File.exists?('blast_dbs')
16
- Dir.mkdir('blast_dbs')
41
+ if !File.exists?(formatted_db_path)
42
+ Dir.mkdir(formatted_db_path)
17
43
  end
18
44
 
19
45
  $ftp.connect('ftp.uniprot.org')
@@ -27,8 +53,9 @@ def conecta_uniprot(my_array, formatted_db_path)
27
53
  download_uniprot(db_group, formatted_db_path)
28
54
  end
29
55
 
56
+ varsplic_out=File.join(formatted_db_path,'uniprot_sprot_varsplic.fasta.gz')
30
57
  $ftp.chdir("/pub/databases/uniprot/current_release/knowledgebase/complete")
31
- $ftp.getbinaryfile("uniprot_sprot_varsplic.fasta.gz", "#{formatted_db_path}/uniprot_sprot_varsplic.fasta.gz")
58
+ $ftp.getbinaryfile("uniprot_sprot_varsplic.fasta.gz", varsplic_out)
32
59
 
33
60
  puts "isoform files downloaded"
34
61
 
@@ -38,9 +65,11 @@ end
38
65
 
39
66
  def download_uniprot(uniprot_group, formatted_db_path)
40
67
 
68
+ sp_out=File.join(formatted_db_path,"uniprot_sprot_#{uniprot_group}.dat.gz")
69
+ tr_out=File.join(formatted_db_path,"uniprot_trembl_#{uniprot_group}.dat.gz")
41
70
  $ftp.chdir("/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions")
42
- $ftp.getbinaryfile("uniprot_sprot_#{uniprot_group}.dat.gz", "#{formatted_db_path}/uniprot_sprot_#{uniprot_group}.dat.gz")
43
- $ftp.getbinaryfile("uniprot_trembl_#{uniprot_group}.dat.gz", "#{formatted_db_path}/uniprot_trembl_#{uniprot_group}.dat.gz")
71
+ $ftp.getbinaryfile("uniprot_sprot_#{uniprot_group}.dat.gz", sp_out)
72
+ $ftp.getbinaryfile("uniprot_trembl_#{uniprot_group}.dat.gz", tr_out)
44
73
 
45
74
  puts "#{uniprot_group} files downloaded"
46
75
 
@@ -74,11 +103,11 @@ def filter_incomplete_seqs(file_name, isoform_hash, formatted_db_path)
74
103
  db_name.sub!('sprot','sp')
75
104
  db_name.sub!('trembl','tr')
76
105
 
77
- if !File.exists?("#{formatted_db_path}/#{db_name}_#{output_name}")
78
- Dir.mkdir("#{formatted_db_path}/#{db_name}_#{output_name}")
106
+ if !File.exists?(File.join(formatted_db_path, "#{db_name}_#{output_name}"))
107
+ Dir.mkdir(File.join(formatted_db_path, "#{db_name}_#{output_name}"))
79
108
  end
80
109
 
81
- output_file = File.new("#{formatted_db_path}/#{db_name}_#{output_name}/#{db_name}_#{output_name}.fasta", "w")
110
+ output_file = File.new(File.join(formatted_db_path, "#{db_name}_#{output_name}/#{db_name}_#{output_name}.fasta"), "w")
82
111
 
83
112
  File.open(file_name).each_line do |line|
84
113
  if (newseq == false)
@@ -152,15 +181,10 @@ def load_isoform_hash(file)
152
181
  my_fasta += line
153
182
  end
154
183
  end
155
-
156
- # if (isoform_hash[acc].nil?)
157
- # isoform_hash[acc]= "#{my_fasta}\n"
158
- # else
159
- # isoform_hash[acc]+= "#{my_fasta}\n"
160
- # end
161
184
 
162
185
  return isoform_hash
163
186
  end
187
+
164
188
  ################################################### MAIN
165
189
 
166
190
  ROOT_PATH=File.dirname(__FILE__)
@@ -173,24 +197,28 @@ end
173
197
 
174
198
  ENV['BLASTDB']=formatted_db_path
175
199
  puts "Databases will be downloaded at: #{ENV['BLASTDB']}"
176
-
200
+ puts "\nTo set the path for storing databases, execute next line in your terminal or add it to your .bash_profile:\n\n\texport BLASTDB=/my_path/\n\n"
177
201
 
178
202
  my_array = ["human","fungi","invertebrates","mammals","plants","rodents","vertebrates"]
179
- # my_array = ["plants","invertebrates"] # used for a shoter test
203
+ # my_array = ["plants","human"] # used for a shoter test
180
204
 
181
205
  conecta_uniprot(my_array, formatted_db_path)
182
- `gunzip #{formatted_db_path}/*gz`
206
+ system('gunzip '+formatted_db_path+'*.gz')
183
207
 
184
208
  isoform_hash = {}
185
- isoform_hash = load_isoform_hash("#{formatted_db_path}/uniprot_sprot_varsplic.fasta")
209
+ isoform_hash = load_isoform_hash(File.join(formatted_db_path, "uniprot_sprot_varsplic.fasta"))
210
+
211
+ download_ncrna(formatted_db_path)
186
212
 
187
213
  my_array.each do |db_group|
188
214
 
189
- filter_incomplete_seqs("#{formatted_db_path}/uniprot_sprot_#{db_group}.dat", isoform_hash, formatted_db_path)
190
- filter_incomplete_seqs("#{formatted_db_path}/uniprot_trembl_#{db_group}.dat", isoform_hash, formatted_db_path)
215
+ filter_incomplete_seqs(File.join(formatted_db_path, "uniprot_sprot_#{db_group}.dat"), isoform_hash, formatted_db_path)
216
+ filter_incomplete_seqs(File.join(formatted_db_path, "uniprot_trembl_#{db_group}.dat"), isoform_hash, formatted_db_path)
191
217
 
192
- `makeblastdb -in #{formatted_db_path}/sp_#{db_group}/sp_#{db_group}.fasta -dbtype 'prot' -parse_seqids`
193
- `makeblastdb -in #{formatted_db_path}/tr_#{db_group}/tr_#{db_group}.fasta -dbtype 'prot' -parse_seqids`
218
+ sp_fasta=File.join(formatted_db_path,"sp_#{db_group}","sp_#{db_group}.fasta")
219
+ tr_fasta=File.join(formatted_db_path,"tr_#{db_group}","tr_#{db_group}.fasta")
220
+ system("makeblastdb -in #{sp_fasta} -dbtype 'prot' -parse_seqids")
221
+ system("makeblastdb -in #{tr_fasta} -dbtype 'prot' -parse_seqids")
194
222
 
195
223
  end
196
224
 
@@ -1,7 +1,7 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
3
  # 12-2-2011 Noe Fernandez Pozo.
4
- # Full-Lengther2 predicts if your sequences are complete, showing you the nucleotide sequences and the translated protein
4
+ # Full-LengtherNEXT predicts if your sequences are complete, showing you the nucleotide sequences and the translated protein
5
5
 
6
6
  #------------------------------------------------------------------ parameters entry
7
7
  require 'optparse'
@@ -91,7 +91,7 @@ optparse = OptionParser.new do |opts|
91
91
 
92
92
 
93
93
  # Set a banner, displayed at the top of the help screen.
94
- opts.banner = "Usage: full_lengther_2 -f input.fasta -g [fungi|human|invertebrates|mammals|plants|rodents|vertebrates] [options]\n\n"
94
+ opts.banner = "Usage: full_lengther_next -f input.fasta -g [fungi|human|invertebrates|mammals|plants|rodents|vertebrates] [options]\n\n"
95
95
 
96
96
  # This displays the help screen
97
97
  opts.on( '-h', '--help', 'Display this screen' ) do
@@ -129,7 +129,7 @@ require 'full_lengther_next'
129
129
  if ENV['FULL_LENGTHER_NEXT_INIT'] && File.exists?(ENV['FULL_LENGTHER_NEXT_INIT'])
130
130
  FULL_LENGTHER_NEXT_INIT=File.expand_path(ENV['FULL_LENGTHER_NEXT_INIT'])
131
131
  else
132
- FULL_LENGTHER_NEXT_INIT=File.join($ROOT_PATH,'init_env')
132
+ FULL_LENGTHER_NEXT_INIT=File.join(ROOT_PATH,'init_env')
133
133
  end
134
134
 
135
135
 
@@ -142,8 +142,16 @@ end
142
142
  ENV['BLASTDB']=formatted_db_path
143
143
  puts "Using databases at: #{ENV['BLASTDB']}"
144
144
 
145
- if !File.exists?("#{ENV['BLASTDB']}/sp_#{options[:tax_group]}/sp_#{options[:tax_group]}.fasta.psq")
146
- puts "DB File #{ENV['BLASTDB']}/sp_#{options[:tax_group]}/sp_#{options[:tax_group]}.fasta.psq doesn't exists, or"
145
+ ncrna_path = File.join(ENV['BLASTDB'],'nc_rna_db','ncrna_fln_100.fasta.nhr')
146
+ if !File.exists?(ncrna_path)
147
+ puts "DB File #{ncrna_path} doesn't exists"
148
+ puts optparse.help
149
+ exit
150
+ end
151
+
152
+ sp_path=File.join(ENV['BLASTDB'],"sp_#{options[:tax_group]}","sp_#{options[:tax_group]}.fasta.psq")
153
+ if !File.exists?(sp_path)
154
+ puts "DB File #{sp_path} doesn't exists, or"
147
155
  puts "incorrect taxon group name: #{options[:tax_group]} choose:"
148
156
  puts optparse.help
149
157
  exit
@@ -7,7 +7,7 @@ $: << File.expand_path(File.join(ROOT_PATH, 'classes'))
7
7
 
8
8
 
9
9
  module FullLengtherNext
10
- VERSION = '0.0.2'
10
+ VERSION = '0.0.5'
11
11
 
12
12
  FULLLENGHTER_VERSION = VERSION
13
13
  end
@@ -32,7 +32,7 @@ module FlAnalysis
32
32
  if (db_name =~ /^tr_/)
33
33
  if (seq.get_annotations(:tmp_annotation).empty?)
34
34
  if (seq.sec_desc.empty?)
35
- seq.annotate(:tcode,'')
35
+ seq.annotate(:apply_tcode,'')
36
36
  else
37
37
  seq.annotate(:tmp_annotation,[seq.sec_desc, '','',''],true)
38
38
  end
@@ -250,7 +250,7 @@ module FlAnalysis
250
250
  seq.sec_desc = "#{q.query_def}\t#{seq.seq_fasta.length}\t#{q.hits[0].acc}\t#{db_name}\tCoding Seq\t\t#{q.hits[0].e_val}\t#{q.hits[0].ident}\t\t#{q.hits[0].full_subject_length}\t#{warnings}\t\t\t\t\t\t#{q.hits[0].definition}\t"
251
251
  seq.annotate(:tmp_annotation,[seq.sec_desc, '','',''],true)
252
252
  else
253
- seq.annotate(:tcode,'')
253
+ seq.annotate(:apply_tcode,'')
254
254
  end
255
255
  else
256
256
  warnings = "Coding sequence with some errors, #{warnings}"
@@ -0,0 +1,387 @@
1
+
2
+ module FlnStats
3
+
4
+ def summary_stats
5
+ stats_file = File.open('fln_results/summary_stats.html', 'w')
6
+
7
+ (html_head, html_1, html_2, html_3, html_4) = html_code
8
+
9
+ total_seqs = 0
10
+
11
+ (status_array, seqs_number1, error_1_num, seq_uniq, complete_uniq, seq_length_stats, complete_seq_length_stats) = annotation_stats
12
+ (tcode_array, seqs_number2, tcode_length_stats, coding_length_stats, unknown_length_stats) = testcode_stats
13
+ ncrna_array=ncrna_stats
14
+
15
+ total_seqs = seqs_number1 + seqs_number2 + ncrna_array[4].to_i
16
+
17
+ stats_file.puts html_head
18
+ stats_file.puts "\t\t\t\t"+'<font color="#FF0000">'+total_seqs.to_s+"</font> sequences in your input fasta\n\t\t\t</h2>\n\t\t</center>"
19
+
20
+ if (total_seqs.to_i > 0)
21
+ stats_file.puts html_1
22
+ stats_file.puts ' <tr>
23
+ <td align="center">YES</td>
24
+ <td align="right">'+seqs_number1.to_s+'</td>
25
+ <td align="right">'+'%.2f' % (100*seqs_number1.to_f/total_seqs.to_f).to_s+' %</td>
26
+ <td align="right">'+seq_uniq.to_s+'</td>
27
+ <td align="right">'+seq_length_stats[0].to_s+'</td>
28
+ <td align="right">'+seq_length_stats[1].to_s+'</td>
29
+ <td align="right">'+seq_length_stats[2].to_s+'</td>
30
+ <td align="right">'+seq_length_stats[3].to_s+'</td>
31
+ </tr>'
32
+ stats_file.puts ' <tr>
33
+ <td align="center">NO</td>
34
+ <td align="right">'+seqs_number2.to_s+'</td>
35
+ <td align="right">'+'%.2f' % (100*seqs_number2.to_f/total_seqs.to_f).to_s+' %</td>
36
+ <td align="right">-</td>
37
+ <td align="right">'+tcode_length_stats[0].to_s+'</td>
38
+ <td align="right">'+tcode_length_stats[1].to_s+'</td>
39
+ <td align="right">'+tcode_length_stats[2].to_s+'</td>
40
+ <td align="right">'+tcode_length_stats[3].to_s+'</td>
41
+ </tr>'
42
+ stats_file.puts ' <tr>
43
+ <td align="center">ncRNA</td>
44
+ <td align="right">'+ncrna_array[4].to_s+'</td>
45
+ <td align="right">'+'%.2f' % (100*ncrna_array[4].to_f/total_seqs.to_f).to_s+' %</td>
46
+ <td align="right">-</td>
47
+ <td align="right">'+ncrna_array[0].to_s+'</td>
48
+ <td align="right">'+ncrna_array[1].to_s+'</td>
49
+ <td align="right">'+ncrna_array[2].to_s+'</td>
50
+ <td align="right">'+ncrna_array[3].to_s+'</td>
51
+ </tr>
52
+ </table>'
53
+
54
+ stats_file.puts ' <p><font color="#FF0000">'+error_1_num.to_s+'</font> Sequences with sense and antisense hits error</p>'
55
+ stats_file.puts ' <p><font color="#FF0000">'+complete_uniq.to_s+'</font> Complete sequences with different ortologue ID</p>'
56
+ stats_file.puts html_2
57
+ status_array.each do |status|
58
+ stats_file.puts ' <tr>
59
+ <td align="right">'+status[4].to_s+'</td>
60
+ <td align="right">'+status[0].to_s+'</td>
61
+ <td align="right">'+'%.2f' % (100*status[0].to_f/total_seqs.to_f).to_s+' %</td>
62
+ <td align="right">'+status[1].to_s+'</td>
63
+ <td align="right">'+status[2].to_s+'</td>
64
+ <td align="right">'+status[3].to_s+'</td>
65
+ </tr>'
66
+ end
67
+ stats_file.puts html_3
68
+
69
+ tcode_array.each do |status|
70
+ stats_file.puts ' <tr>
71
+ <td align="right">'+status[5].to_s+'</td>
72
+ <td align="right">'+status[4].to_s+'</td>
73
+ <td align="right">'+'%.2f' % (100*status[4].to_f/total_seqs.to_f).to_s+' %</td>
74
+ <td align="right">'+status[0].to_s+'</td>
75
+ <td align="right">'+status[1].to_s+'</td>
76
+ <td align="right">'+status[2].to_s+'</td>
77
+ <td align="right">'+status[3].to_s+'</td>
78
+ </tr>'
79
+ end
80
+
81
+ # print Non coding RNA
82
+ stats_file.puts ' <tr>
83
+ <td align="right">Putative ncRNA</td>
84
+ <td align="right">'+ncrna_array[4].to_s+'</td>
85
+ <td align="right">'+'%.2f' % (100*ncrna_array[4].to_f/total_seqs.to_f).to_s+' %</td>
86
+ <td align="right">'+ncrna_array[0].to_s+'</td>
87
+ <td align="right">'+ncrna_array[1].to_s+'</td>
88
+ <td align="right">'+ncrna_array[2].to_s+'</td>
89
+ <td align="right">'+ncrna_array[3].to_s+'</td>
90
+ </tr>
91
+ </table>
92
+ </center>'
93
+
94
+ end
95
+ stats_file.puts html_4
96
+
97
+ stats_file.close
98
+ end
99
+
100
+
101
+ def html_code
102
+ html_head = '<html>
103
+ <head>
104
+ <title>FLN Annotation Summary</title>
105
+ </head>
106
+
107
+ <body bgcolor="#FFFFFF">
108
+ <center>
109
+ <h1 ALIGN="center">
110
+ Full-LengtherNEXT
111
+ <br/>
112
+ Annotation summary
113
+ </h1>
114
+ <h2 align="center">'
115
+
116
+ html_1 = '
117
+ <center>
118
+ <table border=1>
119
+ <tr>
120
+ <th>Ortologue found</th>
121
+ <th>Sequences found</th>
122
+ <th>%</th>
123
+ <th>Different IDs</th>
124
+ <th>&gt;200 bp</th>
125
+ <th>&lt;200 bp</th>
126
+ <th>&gt;500 bp</th>
127
+ <th>&lt;500 bp</th>
128
+ </tr>'
129
+
130
+ html_2= ' <br/>
131
+ <table border=1>
132
+ <tr>
133
+ <th>Status</th>
134
+ <th>Total</th>
135
+ <th>%</th>
136
+ <th>UserDB</th>
137
+ <th>SwissProt</th>
138
+ <th>TrEMBL</th>
139
+ </tr>'
140
+
141
+ html_3= ' </table>
142
+ <br/>
143
+ <table border=1>
144
+ <tr>
145
+ <th>Status</th>
146
+ <th>Total</th>
147
+ <th>%</th>
148
+ <th>&gt;200 bp</th>
149
+ <th>&lt;200 bp</th>
150
+ <th>&gt;500 bp</th>
151
+ <th>&lt;500 bp</th>
152
+ </tr>'
153
+
154
+ html_4 = ' </body>
155
+ </html>'
156
+
157
+ return [html_head, html_1, html_2, html_3, html_4]
158
+
159
+ end
160
+
161
+
162
+ def stats_my_db(db_name, array)
163
+
164
+ if (db_name !~ /^sp_/) && (db_name !~ /^tr_/)
165
+ array[1] += 1
166
+ elsif (db_name =~ /^sp_/)
167
+ array[2] += 1
168
+ elsif (db_name =~ /^tr_/)
169
+ array[3] += 1
170
+ end
171
+
172
+ return array
173
+ end
174
+
175
+
176
+ def annotation_stats
177
+
178
+ seqs_number = 0
179
+ array_of_all_accs = []
180
+ array_of_complete_accs = []
181
+ error_1_num = 0
182
+
183
+ # >200, <200, >500, <500
184
+ seq_length_stats = [0,0,0,0]
185
+
186
+ # >200, <200, >500, <500
187
+ complete_seq_length_stats = [0,0,0,0]
188
+
189
+ status_array = []
190
+ # total, userdb, swissprotdb, trembl, status
191
+ complete = [0,0,0,0,'Complete']
192
+ putative_complete = [0,0,0,0,'Putative Complete']
193
+ c_terminus = [0,0,0,0,'C-terminus']
194
+ putative_c_terminus = [0,0,0,0,'Putative C-terminus']
195
+ n_terminus = [0,0,0,0,'N-terminus']
196
+ putative_n_terminus = [0,0,0,0,'Putative N-terminus']
197
+ internal = [0,0,0,0,'Internal']
198
+ cod_seq = [0,0,0,0,'Misassembled']
199
+
200
+
201
+ File.open('fln_results/annotations.txt').each do |line|
202
+ line.chomp!
203
+ (name,fasta_length,acc,db_name,status,kk1,kk2,kk3,kk4,kk5,msgs) = line.split("\t")
204
+
205
+ if (line !~ /^Query_id\t/) && (!line.empty?)
206
+ seqs_number += 1
207
+ array_of_all_accs.push acc
208
+ # -------------------------------------------------------------------------
209
+ if (fasta_length.to_i >= 200)
210
+ seq_length_stats[0] += 1
211
+ # seqs_longer_200 += 1
212
+ else
213
+ seq_length_stats[1] += 1
214
+ # seqs_shorter_200 += 1
215
+ end
216
+ if (fasta_length.to_i >= 500)
217
+ seq_length_stats[2] += 1
218
+ # seqs_longer_500 += 1
219
+ else
220
+ seq_length_stats[3] += 1
221
+ # seqs_shorter_500 += 1
222
+ end
223
+ # -------------------------------------------------------------------------
224
+ if (msgs =~ /ERROR#1/)
225
+ error_1_num += 1
226
+ end
227
+ # -------------------------------------------------------------------------
228
+ if (status == 'Complete')
229
+ complete[0] += 1
230
+ array_of_complete_accs.push acc
231
+ complete = stats_my_db(db_name, complete)
232
+
233
+ if (fasta_length.to_i >= 200)
234
+ complete_seq_length_stats[0] += 1
235
+ # complete_longer_200 += 1
236
+ else
237
+ complete_seq_length_stats[1] += 1
238
+ # complete_shorter_200 += 1
239
+ end
240
+
241
+ if (fasta_length.to_i >= 500)
242
+ complete_seq_length_stats[2] += 1
243
+ # complete_longer_500 += 1
244
+ else
245
+ complete_seq_length_stats[3] += 1
246
+ # complete_shorter_500 += 1
247
+ end
248
+
249
+ elsif (status == 'Putative Complete')
250
+ putative_complete[0] += 1
251
+ putative_complete = stats_my_db(db_name, putative_complete)
252
+ elsif (status == 'C-terminus')
253
+ c_terminus[0] += 1
254
+ c_terminus = stats_my_db(db_name, c_terminus)
255
+ elsif (status == 'N-terminus')
256
+ n_terminus[0] += 1
257
+ n_terminus = stats_my_db(db_name, n_terminus)
258
+ elsif (status == 'Putative C-terminus')
259
+ putative_c_terminus[0] += 1
260
+ putative_c_terminus = stats_my_db(db_name, putative_c_terminus)
261
+ elsif (status == 'Putative N-terminus')
262
+ putative_n_terminus[0] += 1
263
+ putative_n_terminus = stats_my_db(db_name, putative_n_terminus)
264
+ elsif (status == 'Internal')
265
+ internal[0] += 1
266
+ internal = stats_my_db(db_name, internal)
267
+ elsif (status == 'Coding Seq')
268
+ cod_seq[0] += 1
269
+ cod_seq = stats_my_db(db_name, cod_seq)
270
+ end
271
+ # -------------------------------------------------------------------------
272
+ end
273
+
274
+ end
275
+
276
+ status_array = [complete, putative_complete, c_terminus, putative_c_terminus, n_terminus, putative_n_terminus, internal, cod_seq]
277
+
278
+ return [status_array, seqs_number, error_1_num, array_of_all_accs.uniq.count, array_of_complete_accs.uniq.count, seq_length_stats, complete_seq_length_stats]
279
+ end
280
+
281
+
282
+ def testcode_stats
283
+
284
+ seqs_number = 0
285
+
286
+ # >200, <200, >500, <500
287
+ all_tcode_stats = [0,0,0,0]
288
+
289
+ # >200, <200, >500, <500, total, status
290
+ coding_length_stats = [0,0,0,0,0,'Coding']
291
+ p_coding_length_stats = [0,0,0,0,0,'Putative Coding']
292
+ unknown_length_stats = [0,0,0,0,0,'Unknown']
293
+
294
+ File.open('fln_results/tcode_result.txt').each do |line|
295
+ line.chomp!
296
+ (name,fasta_length,acc,db_name,status) = line.split("\t")
297
+
298
+ if (line !~ /^Query_id\t/) && (!line.empty?)
299
+ seqs_number += 1
300
+
301
+ if (fasta_length.to_i >= 200)
302
+ all_tcode_stats[0] += 1
303
+
304
+ if (status == 'coding')
305
+ coding_length_stats[4] += 1
306
+ coding_length_stats[0] += 1
307
+ elsif (status == 'putative_coding')
308
+ p_coding_length_stats[4] += 1
309
+ p_coding_length_stats[0] += 1
310
+ elsif (status == 'unknown')
311
+ unknown_length_stats[4] += 1
312
+ unknown_length_stats[0] += 1
313
+ end
314
+ else
315
+ all_tcode_stats[1] += 1
316
+
317
+ if (status == 'coding')
318
+ coding_length_stats[4] += 1
319
+ coding_length_stats[1] += 1
320
+ elsif (status == 'putative_coding')
321
+ p_coding_length_stats[4] += 1
322
+ p_coding_length_stats[1] += 1
323
+ elsif (status == 'unknown')
324
+ unknown_length_stats[4] += 1
325
+ unknown_length_stats[1] += 1
326
+ end
327
+ end
328
+ if (fasta_length.to_i >= 500)
329
+ all_tcode_stats[2] += 1
330
+
331
+ if (status == 'coding')
332
+ coding_length_stats[2] += 1
333
+ elsif (status == 'putative_coding')
334
+ p_coding_length_stats[2] += 1
335
+ elsif (status == 'unknown')
336
+ unknown_length_stats[2] += 1
337
+ end
338
+ else
339
+ all_tcode_stats[3] += 1
340
+
341
+ if (status == 'coding')
342
+ coding_length_stats[3] += 1
343
+ elsif (status == 'putative_coding')
344
+ p_coding_length_stats[3] += 1
345
+ elsif (status == 'unknown')
346
+ unknown_length_stats[3] += 1
347
+ end
348
+ end
349
+
350
+ end
351
+
352
+ end
353
+
354
+ status_array = [coding_length_stats, p_coding_length_stats, unknown_length_stats]
355
+
356
+ return [status_array, seqs_number, all_tcode_stats, coding_length_stats, unknown_length_stats]
357
+ end
358
+
359
+ def ncrna_stats
360
+
361
+ # >200, <200, >500, <500, total
362
+ ncrna_array = [0,0,0,0,0]
363
+
364
+ File.open('fln_results/nc_rna.txt').each do |line|
365
+ line.chomp!
366
+ (name,fasta_length,acc,db_name,status) = line.split("\t")
367
+
368
+ if (status == 'Putative ncRNA')
369
+ ncrna_array[4] += 1
370
+
371
+ if (fasta_length.to_i >= 200)
372
+ ncrna_array[0] += 1
373
+ else
374
+ ncrna_array[1] += 1
375
+ end
376
+ if (fasta_length.to_i >= 500)
377
+ ncrna_array[2] += 1
378
+ else
379
+ ncrna_array[3] += 1
380
+ end
381
+ end
382
+ end
383
+
384
+ return ncrna_array
385
+ end
386
+
387
+ end
@@ -11,6 +11,8 @@ require "test_code"
11
11
  require 'fl_analysis'
12
12
  include FlAnalysis
13
13
 
14
+ require 'nc_rna'
15
+ include NcRna
14
16
 
15
17
  class MyWorker < ScbiMapreduce::Worker
16
18
 
@@ -41,15 +43,12 @@ class MyWorker < ScbiMapreduce::Worker
41
43
 
42
44
  end
43
45
 
44
- # ejecuta blastx utilizando los parametros fichero de entrada, base de datos y fichero de salida
45
- def run_blastx(input, database, user_db_name)
46
- # puts "\n#{user_db_name} ..... executing BLASTx"
46
+ # ejecuta blast utilizando los parametros fichero de entrada, base de datos, fichero de salida y tipo de blast
47
+ def run_blast(input, database, blast_type, evalue)
47
48
 
48
- blast=BatchBlast.new("-db #{database}",'blastx',"-evalue 1e-6 -num_alignments 1 -num_descriptions 1")
49
+ blast=BatchBlast.new("-db #{database}",blast_type,"-evalue #{evalue} -max_target_seqs 1")
49
50
  blast_result = blast.do_blast_seqs(input, :xml)
50
51
 
51
- # puts "#{user_db_name} ..... BLASTx finished"
52
-
53
52
  return blast_result
54
53
  end
55
54
 
@@ -72,7 +71,7 @@ class MyWorker < ScbiMapreduce::Worker
72
71
  end
73
72
 
74
73
  # do blast
75
- my_blast = run_blastx(seqs, "#{@options[:user_db]}", user_db_name)
74
+ my_blast = run_blast(seqs, "#{@options[:user_db]}", 'blastx', '1e-6')
76
75
 
77
76
  # split and parse blast
78
77
  seqs.each_with_index do |seq,i|
@@ -87,7 +86,8 @@ class MyWorker < ScbiMapreduce::Worker
87
86
 
88
87
  # -------------------------------------------- UniProt (sp)
89
88
  # blast
90
- my_blast = run_blastx(new_seqs, "sp_#{@options[:tax_group]}/sp_#{@options[:tax_group]}.fasta", "sp_#{@options[:tax_group]}")
89
+ sp_path=File.join("sp_#{@options[:tax_group]}","sp_#{@options[:tax_group]}.fasta")
90
+ my_blast = run_blast(new_seqs, sp_path, 'blastx', '1e-6')
91
91
 
92
92
  # split and parse blast
93
93
  new_seqs.each_with_index do |seq,i|
@@ -98,7 +98,8 @@ class MyWorker < ScbiMapreduce::Worker
98
98
 
99
99
  # -------------------------------------------- UniProt (tr)
100
100
  # blast
101
- my_blast = run_blastx(new_seqs, "tr_#{@options[:tax_group]}/tr_#{@options[:tax_group]}.fasta", "tr_#{@options[:tax_group]}")
101
+ tr_path=File.join("tr_#{@options[:tax_group]}","tr_#{@options[:tax_group]}.fasta")
102
+ my_blast = run_blast(new_seqs, tr_path, 'blastx', '1e-6')
102
103
 
103
104
  # split and parse blast
104
105
  new_seqs.each_with_index do |seq,i|
@@ -107,15 +108,27 @@ class MyWorker < ScbiMapreduce::Worker
107
108
 
108
109
  # -------------------------------------------- Test Code
109
110
  # the sequences without a reliable similarity with an orthologue are processed with Test Code
110
- testcode_input=seqs.select{|s| !s.get_annotations(:tcode).empty?}
111
+ testcode_input=seqs.select{|s| !s.get_annotations(:apply_tcode).empty?}
111
112
 
112
- # active this line to test tcode. hay que comentar todas las lineas de arriba de este metodo
113
+ # active this line to test tcode, and comment all lines above in this function
113
114
  # testcode_input=seqs
114
-
115
+
115
116
  testcode_input.each do |seq|
116
117
  TestCode.new(seq)
117
118
  end
118
-
119
+
120
+ # -------------------------------------------- nc RNA
121
+ unknown_seqs=seqs.select{|s| !s.get_annotations(:tcode_unknown).empty?}
122
+ # run blastn
123
+ ncrna_path=File.join('nc_rna_db','ncrna_fln_100.fasta')
124
+ my_blast = run_blast(unknown_seqs, ncrna_path, 'blastn', '1e-3')
125
+
126
+ # split and parse blast
127
+ unknown_seqs.each_with_index do |seq,i|
128
+ find_nc_rna(seq, my_blast.querys[i])
129
+ end
130
+ # ---------------------------------------------------
131
+
119
132
  end
120
133
 
121
134
  end
@@ -2,8 +2,8 @@ require 'json'
2
2
  require 'scbi_fasta'
3
3
  require 'sequence'
4
4
 
5
- require 'fl2_stats'
6
- include Fl2Stats
5
+ require 'fln_stats'
6
+ include FlnStats
7
7
 
8
8
  class MyWorkerManager < ScbiMapreduce::WorkManager
9
9
 
@@ -12,37 +12,43 @@ class MyWorkerManager < ScbiMapreduce::WorkManager
12
12
 
13
13
  input_file=options[:fasta]
14
14
 
15
- if !File.exists?('fl2_results')
16
- Dir.mkdir('fl2_results')
15
+ if !File.exists?('fln_results')
16
+ Dir.mkdir('fln_results')
17
17
  end
18
-
18
+
19
+ file_head = "Query_id\tfasta_length\tSubject_id\tdb_name\tStatus\tt_code\te_value\tp_ident\tprotein_length\ts_length\tWarning_msgs\tframe\tORF_start\tORF_end\ts_start\ts_end\tDescription\tProtein_sequence"
20
+
19
21
  @@fasta_file = FastaQualFile.new(input_file,'')
20
22
  @@chunk_size=chunk_size
21
23
  @@options = options
22
24
 
23
- @@annotation_file = File.open("fl2_results/annotations.txt", 'w')
24
- @@annotation_file.puts "Query_id\tfasta_length\tSubject_id\tdb_name\tStatus\tt_code\te_value\tp_ident\tprotein_length\ts_length\tWarning_msgs\tframe\tORF_start\tORF_end\ts_start\ts_end\tDescription\tProtein_sequence"
25
+ @@annotation_file = File.open("fln_results/annotations.txt", 'w')
26
+ @@annotation_file.puts file_head
27
+
28
+ @@alignment_file = File.open("fln_results/alignments.txt", 'w')
29
+ @@prot_file = File.open("fln_results/proteins.fasta", 'w')
30
+ @@nts_file = File.open("fln_results/nt_seq.txt", 'w')
31
+ @@tcode_file=File.open("fln_results/tcode_result.txt", 'w')
32
+ @@tcode_file.puts file_head
25
33
 
26
- @@alignment_file = File.open("fl2_results/alignments.txt", 'w')
27
- @@prot_file = File.open("fl2_results/proteins.fasta", 'w')
28
- @@nts_file = File.open("fl2_results/nt_seq.txt", 'w')
29
- @@tcode_file=File.open("fl2_results/tcode_result.txt", 'w')
30
- @@tcode_file.puts "Query_id\tfasta_length\tSubject_id\tdb_name\tStatus\tt_code\te_value\tp_ident\tprotein_length\ts_length\tWarning_msgs\tframe\tORF_start\tORF_end\ts_start\ts_end\tDescription\tProtein_sequence"
34
+ @@nc_rna_file = File.open("fln_results/nc_rna.txt", 'w')
35
+ @@nc_rna_file.puts file_head
31
36
 
32
- # @@error_fasta_file = File.open("fl2_results/error_seqs.fasta", 'w')
33
- # @@error_file = File.open("fl2_results/errors_info.txt", 'w')
37
+ # @@error_fasta_file = File.open("fln_results/error_seqs.fasta", 'w')
38
+ # @@error_file = File.open("fln_results/errors_info.txt", 'w')
34
39
 
35
40
  end
36
41
 
37
42
  # close files
38
43
  def self.end_work_manager
39
- @@fasta_file.close
44
+ # @@fasta_file.close
40
45
 
41
46
  @@annotation_file.close
42
47
  @@alignment_file.close
43
48
  @@prot_file.close
44
49
  @@nts_file.close
45
50
  @@tcode_file.close
51
+ @@nc_rna_file.close
46
52
 
47
53
  # @@error_fasta_file.close
48
54
  # @@error_file.close
@@ -143,11 +149,14 @@ class MyWorkerManager < ScbiMapreduce::WorkManager
143
149
  if (n=seq.get_annotations(:nucleotide).first)
144
150
  @@nts_file.puts n[:message]
145
151
  end
152
+ # -------------------------------------------------------- nc RNA
153
+ elsif (nc=seq.get_annotations(:ncrna).first)
154
+ @@nc_rna_file.puts nc[:message]
146
155
  # -------------------------------------------------------- Test Code
147
- elsif (t=seq.get_annotations(:tcode).first)
148
- @@tcode_file.puts t[:message]
156
+ elsif (t=seq.get_annotations(:tcode).first)
157
+ @@tcode_file.puts t[:message]
149
158
  end
150
- # -------------------------------------------------------- Errors
159
+ # -------------------------------------------------------- errors
151
160
  # if e=seq.get_annotations(:error).first
152
161
  # if !e[:message].empty?
153
162
  # @@error_fasta_file.puts ">#{seq.seq_name}\n#{seq.seq_fasta}"
@@ -0,0 +1,21 @@
1
+
2
+ module NcRna
3
+
4
+ def find_nc_rna(seq, blast_query)
5
+
6
+ # used to detect if the sequence and the blast are from different query
7
+ if seq.seq_name != blast_query.query_def
8
+ raise "BLAST query name and sequence are different"
9
+ end
10
+
11
+ q=blast_query
12
+
13
+ if (!q.hits[0].nil?) # There is match in blast.
14
+ nc_annotations = "#{q.query_def}\t#{seq.seq_fasta.length}\t#{q.hits[0].acc}\tncRNA\tPutative ncRNA\t\t#{q.hits[0].e_val}\t#{q.hits[0].ident}\t\t\t\t#{q.hits[0].q_frame}\t#{q.hits[0].q_beg}\t#{q.hits[0].q_end}\t#{q.hits[0].s_beg.to_i}\t#{q.hits[0].s_end.to_i}\t#{q.hits[0].definition}\t"
15
+ seq.annotate(:ncrna,nc_annotations,true)
16
+ else
17
+ unknown_annot = seq.get_annotations(:tcode_unknown).first
18
+ seq.annotate(:tcode, unknown_annot[:message],true)
19
+ end
20
+ end
21
+ end
@@ -26,8 +26,8 @@ class TestCode
26
26
  ref_orf = ''
27
27
  ref_msgs = 'Sequence length < 200 nt'
28
28
 
29
- seq.annotate(:tcode,"#{ref_name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{ref_status}\t#{ref_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
30
-
29
+ seq.annotate(:tcode_unknown,"#{ref_name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{ref_status}\t#{ref_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
30
+ # seq.annotate(:tcode,"#{ref_name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{ref_status}\t#{ref_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
31
31
  else
32
32
 
33
33
  # para probar tescode con toda la secuencia, en lugar de con los ORFs ----------------------------------------------------------------------
@@ -44,8 +44,12 @@ class TestCode
44
44
 
45
45
  # see add_region filter
46
46
  (name,t_code,status,ref_start,ref_end,ref_frame,orf,ref_msgs,stop_before_start,more_than_one_frame) = t_code(seq)
47
- seq.annotate(:tcode,"#{name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{status}\t#{t_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
48
-
47
+ if (status == :unknown)
48
+ seq.annotate(:tcode_unknown,"#{name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{status}\t#{t_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
49
+ else
50
+ seq.annotate(:tcode,"#{name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{status}\t#{t_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
51
+ end
52
+
49
53
  # if (ref_msgs.nil?)
50
54
  # ref_msgs = ''
51
55
  # end
metadata CHANGED
@@ -2,7 +2,7 @@
2
2
  name: full_lengther_next
3
3
  version: !ruby/object:Gem::Version
4
4
  prerelease:
5
- version: 0.0.2
5
+ version: 0.0.5
6
6
  platform: ruby
7
7
  authors:
8
8
  - Noe Fernandez & Dario Guerrero
@@ -10,7 +10,7 @@ autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
12
 
13
- date: 2012-02-07 00:00:00 Z
13
+ date: 2012-03-09 00:00:00 Z
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: xml-simple
@@ -97,12 +97,13 @@ files:
97
97
  - bin/full_lengther_next
98
98
  - History.txt
99
99
  - lib/full_lengther_next/classes/common_functions.rb
100
- - lib/full_lengther_next/classes/fl2_stats.rb
101
100
  - lib/full_lengther_next/classes/fl_analysis.rb
102
101
  - lib/full_lengther_next/classes/fl_string_utils.rb
102
+ - lib/full_lengther_next/classes/fln_stats.rb
103
103
  - lib/full_lengther_next/classes/lcs.rb
104
104
  - lib/full_lengther_next/classes/my_worker.rb
105
105
  - lib/full_lengther_next/classes/my_worker_manager.rb
106
+ - lib/full_lengther_next/classes/nc_rna.rb
106
107
  - lib/full_lengther_next/classes/orf.rb
107
108
  - lib/full_lengther_next/classes/sequence.rb
108
109
  - lib/full_lengther_next/classes/test_code.rb
@@ -1,222 +0,0 @@
1
-
2
- module Fl2Stats
3
-
4
- # -------------------------------------------------------------------------------- Main
5
- def summary_stats
6
- stats_file = File.open('fl2_results/summary_stats.txt', 'w')
7
-
8
- total_seqs = 0
9
-
10
- num1 = annotation_stats(stats_file)
11
- num2 = testcode_stats(stats_file)
12
-
13
- total_seqs = num1 + num2
14
-
15
- stats_file.puts "\nInput sequences in your fasta: #{total_seqs}\n\n"
16
- end
17
-
18
- # ---------------------------------------------------------------------------------- Functions
19
- def stats_my_db(db_name, array)
20
-
21
- if (db_name !~ /^sp_/) && (db_name !~ /^tr_/)
22
- array[1] += 1
23
- elsif (db_name =~ /^sp_/)
24
- array[2] += 1
25
- elsif (db_name =~ /^tr_/)
26
- array[3] += 1
27
- end
28
-
29
- return array
30
- end
31
-
32
-
33
- def annotation_stats(stats_file)
34
-
35
- seqs_number = 0
36
- array_of_all_accs = []
37
- array_of_complete_accs = []
38
- error_1_num = 0
39
-
40
- seqs_longer_200 = 0
41
- seqs_shorter_200 = 0
42
- complete_longer_200 = 0
43
- complete_shorter_200 = 0
44
-
45
- seqs_longer_500 = 0
46
- seqs_shorter_500 = 0
47
- complete_longer_500 = 0
48
- complete_shorter_500 = 0
49
-
50
- complete = [0,0,0,0]
51
- putative_complete = [0,0,0,0]
52
- c_terminus = [0,0,0,0]
53
- putative_c_terminus = [0,0,0,0]
54
- n_terminus = [0,0,0,0]
55
- putative_n_terminus = [0,0,0,0]
56
- internal = [0,0,0,0]
57
- cod_seq = [0,0,0,0]
58
-
59
-
60
- File.open('fl2_results/annotations.txt').each do |line|
61
- line.chomp!
62
- (name,fasta_length,acc,db_name,status,kk1,kk2,kk3,kk4,kk5,msgs) = line.split("\t")
63
-
64
- if (line !~ /^Query_id\t/)
65
- seqs_number += 1
66
- array_of_all_accs.push acc
67
- # -------------------------------------------------------------------------
68
- if (fasta_length.to_i >= 200)
69
- seqs_longer_200 += 1
70
- else
71
- seqs_shorter_200 += 1
72
- end
73
- if (fasta_length.to_i >= 500)
74
- seqs_longer_500 += 1
75
- else
76
- seqs_shorter_500 += 1
77
- end
78
- # -------------------------------------------------------------------------
79
- if (msgs =~ /ERROR#1/)
80
- error_1_num += 1
81
- end
82
- # -------------------------------------------------------------------------
83
- if (status == 'Complete')
84
- complete[0] += 1
85
- array_of_complete_accs.push acc
86
- complete = stats_my_db(db_name, complete)
87
-
88
- if (fasta_length.to_i >= 200)
89
- complete_longer_200 += 1
90
- else
91
- complete_shorter_200 += 1
92
- end
93
-
94
- if (fasta_length.to_i >= 500)
95
- complete_longer_500 += 1
96
- else
97
- complete_shorter_500 += 1
98
- end
99
-
100
- elsif (status == 'Putative Complete')
101
- putative_complete[0] += 1
102
- putative_complete = stats_my_db(db_name, putative_complete)
103
- elsif (status == 'C-terminus')
104
- c_terminus[0] += 1
105
- c_terminus = stats_my_db(db_name, c_terminus)
106
- elsif (status == 'N-terminus')
107
- n_terminus[0] += 1
108
- n_terminus = stats_my_db(db_name, n_terminus)
109
- elsif (status == 'Putative C-terminus')
110
- putative_c_terminus[0] += 1
111
- putative_c_terminus = stats_my_db(db_name, putative_c_terminus)
112
- elsif (status == 'Putative N-terminus')
113
- putative_n_terminus[0] += 1
114
- putative_n_terminus = stats_my_db(db_name, putative_n_terminus)
115
- elsif (status == 'Internal')
116
- internal[0] += 1
117
- internal = stats_my_db(db_name, internal)
118
- elsif (status == 'Coding Seq')
119
- cod_seq[0] += 1
120
- cod_seq = stats_my_db(db_name, cod_seq)
121
- end
122
- # -------------------------------------------------------------------------
123
- end
124
-
125
- end
126
-
127
- stats_file.puts "--- Annotation Summary ---"
128
- stats_file.puts "\n------------------------------ Summary of sequences found by similarity -----"
129
-
130
- stats_file.puts "\n\tSequences found: #{seqs_number}\t\t(>200: #{seqs_longer_200}, <200: #{seqs_shorter_200})\t(>500: #{seqs_longer_500}, <500: #{seqs_shorter_500})"
131
- stats_file.puts "\tDifferent IDs: #{array_of_all_accs.uniq.count}"
132
-
133
- stats_file.puts "\n\tsequences with sense and antisense hits error: #{error_1_num}"
134
- stats_file.puts "\n------------------------------------------------- Full-Length Sequences -----"
135
- stats_file.puts "\tComplete Seqs: #{complete[0]} ("+ '%.3f' % (complete[0].to_f/seqs_number.to_f*100) +" %)\t\t(>200: #{complete_longer_200}, <200: #{complete_shorter_200})\t(>500: #{complete_longer_500}, <500: #{complete_shorter_500})"
136
- stats_file.puts "\tDifferent IDs: #{array_of_complete_accs.uniq.count} ("+ '%.3f' % (array_of_complete_accs.uniq.count.to_f/seqs_number.to_f*100) +" %)"
137
- stats_file.puts "\n\t\tuser_db: #{complete[1]}\n\t\tsp: #{complete[2]}\n\t\ttr: #{complete[3]}"
138
- stats_file.puts "-----------------------------------------------------------------------------"
139
-
140
- stats_file.puts "\n\tputative completes: #{putative_complete[0]}\n\t\tuser_db: #{putative_complete[1]}\n\t\tsp: #{putative_complete[2]}\n\t\ttr: #{putative_complete[3]}"
141
- stats_file.puts "\n\tn-terminus: #{n_terminus[0]}\n\t\tuser_db: #{n_terminus[1]}\n\t\tsp: #{n_terminus[2]}\n\t\ttr: #{n_terminus[3]}"
142
- stats_file.puts "\n\tputative_n_terminus: #{putative_n_terminus[0]}\n\t\tuser_db: #{putative_n_terminus[1]}\n\t\tsp: #{putative_n_terminus[2]}\n\t\ttr: #{putative_n_terminus[3]}"
143
- stats_file.puts "\n\tc-terminus: #{c_terminus[0]}\n\t\tuser_db: #{c_terminus[1]}\n\t\tsp: #{c_terminus[2]}\n\t\ttr: #{c_terminus[3]}"
144
- stats_file.puts "\n\tputative_c_terminus: #{putative_c_terminus[0]}\n\t\tuser_db: #{putative_c_terminus[1]}\n\t\tsp: #{putative_c_terminus[2]}\n\t\ttr: #{putative_c_terminus[3]}"
145
- stats_file.puts "\n\tinternal: #{internal[0]}\n\t\tuser_db: #{internal[1]}\n\t\tsp: #{internal[2]}\n\t\ttr: #{internal[3]}"
146
- stats_file.puts "\n\tcoding sequences with unknown status: #{cod_seq[0]}\n\t\tuser_db: #{cod_seq[1]}\n\t\tsp: #{cod_seq[2]}\n\t\ttr: #{cod_seq[3]}"
147
-
148
- return seqs_number
149
- end
150
-
151
-
152
- def testcode_stats(stats_file)
153
-
154
- seqs_number = 0
155
- coding = 0
156
- putative_coding = 0
157
- unknown = 0
158
-
159
- coding_longer_200 = 0
160
- coding_shorter_200 = 0
161
- unknown_longer_200 = 0
162
- unknown_shorter_200 = 0
163
-
164
- coding_longer_500 = 0
165
- coding_shorter_500 = 0
166
- unknown_longer_500 = 0
167
- unknown_shorter_500 = 0
168
-
169
- File.open('fl2_results/tcode_result.txt').each do |line|
170
- line.chomp!
171
- (name,fasta_length,acc,db_name,status) = line.split("\t")
172
-
173
- if (line !~ /^Query_id\t/)
174
- seqs_number += 1
175
-
176
- if (status == 'coding')
177
- coding += 1
178
- if (fasta_length.to_i >= 200)
179
- coding_longer_200 += 1
180
- coding_longer_500 += 1
181
- else
182
- coding_shorter_200 += 1
183
- coding_shorter_500 += 1
184
- end
185
- elsif (status == 'putative_coding')
186
- putative_coding += 1
187
- elsif (status == 'unknown')
188
- unknown += 1
189
- if (fasta_length.to_i >= 200)
190
- unknown_longer_200 += 1
191
- unknown_longer_500 += 1
192
- else
193
- unknown_shorter_200 += 1
194
- unknown_shorter_500 += 1
195
- end
196
-
197
- end
198
-
199
- end
200
-
201
- end
202
-
203
-
204
- stats_file.puts "\n--------------------------- Test Code Summary\n\n\ttotal seqs: #{seqs_number}"
205
- stats_file.puts "\n\tcoding sequences: #{coding}"
206
- stats_file.puts "\t\tlonger than 200 bp: #{coding_longer_200}"
207
- stats_file.puts "\t\tshorter than 200 bp: #{coding_shorter_200}"
208
- stats_file.puts "\t\tlonger than 500 bp: #{coding_longer_500}"
209
- stats_file.puts "\t\tshorter than 500 bp: #{coding_shorter_500}"
210
- stats_file.puts "\n\tputative coding sequences: #{putative_coding}\n"
211
- stats_file.puts "\n\tunknown: #{unknown} ("+ '%.3f' % (unknown.to_f/seqs_number.to_f*100) +" %)"
212
- stats_file.puts "\t\tlonger than 200 bp: #{unknown_longer_200}"
213
- stats_file.puts "\t\tshorter than 200 bp: #{unknown_shorter_200}"
214
- stats_file.puts "\t\tlonger than 500 bp: #{unknown_longer_500}"
215
- stats_file.puts "\t\tshorter than 500 bp: #{unknown_shorter_500}"
216
- stats_file.puts "\n\tUnknown sequences have a bad test code score or haven't got an ORF longer than 200 nt"
217
- stats_file.puts "---------------------------------------------"
218
-
219
- return seqs_number
220
- end
221
-
222
- end