RubyGems - full_lengther_next - Versions diffs - 0.0.2 → 0.0.5 - Mend

full_lengther_next 0.0.2 → 0.0.5

Files changed (14) hide show

data/History.txt +12 -0
data/Manifest.txt +2 -1
data/README.rdoc +44 -6
data/bin/download_fln_dbs.rb +50 -22
data/bin/full_lengther_next +13 -5
data/lib/full_lengther_next.rb +1 -1
data/lib/full_lengther_next/classes/fl_analysis.rb +2 -2
data/lib/full_lengther_next/classes/fln_stats.rb +387 -0
data/lib/full_lengther_next/classes/my_worker.rb +26 -13
data/lib/full_lengther_next/classes/my_worker_manager.rb +27 -18
data/lib/full_lengther_next/classes/nc_rna.rb +21 -0
data/lib/full_lengther_next/classes/test_code.rb +8 -4
metadata +4 -3
data/lib/full_lengther_next/classes/fl2_stats.rb +0 -222

data/History.txt CHANGED Viewed

@@ -1,3 +1,15 @@
+=== 0.0.5 2012-03-09
+Fix NCRNA annotation
+=== 0.0.4 2012-03-07
+Fixed stats for 0 seqs
+=== 0.0.3 2012-03-01
+Added ncrna
 === 0.0.2 2012-02-07
 Added FULL_LENGTH_NEXT_INIT environment variable for clustered installations

data/Manifest.txt CHANGED Viewed

@@ -3,12 +3,13 @@ bin/make_user_db.rb
 bin/full_lengther_next
 History.txt
 lib/full_lengther_next/classes/common_functions.rb
-lib/full_lengther_next/classes/fl2_stats.rb
 lib/full_lengther_next/classes/fl_analysis.rb
 lib/full_lengther_next/classes/fl_string_utils.rb
+lib/full_lengther_next/classes/fln_stats.rb
 lib/full_lengther_next/classes/lcs.rb
 lib/full_lengther_next/classes/my_worker.rb
 lib/full_lengther_next/classes/my_worker_manager.rb
+lib/full_lengther_next/classes/nc_rna.rb
 lib/full_lengther_next/classes/orf.rb
 lib/full_lengther_next/classes/sequence.rb
 lib/full_lengther_next/classes/test_code.rb

data/README.rdoc CHANGED Viewed

@@ -16,9 +16,9 @@ FULL-LENGTHERNEXT is a tool adapted to NGS technologies, able to work in paralle
 * It returns the translated protein sequence for the complete genes and the nucleotide sequence with frame shift fixed and highlighting the start and end codon for an easier finding of the gene and the UTR regions.
-* FULL-LENGTHERNEXT suggests putative new genes analysing what of the genes classified as unknown are probably coding.
+* FULL-LENGTHERNEXT suggests putative new genes analysing what of the genes classified as unknown are probably coding and what are putative non coding RNA sequences.
-* It produces a stats file useful for assemblies comparison.
+* It produces a HTML file with statistics useful for assemblies comparison.
 == SYNOPSIS:
@@ -26,6 +26,40 @@ FULL-LENGTHERNEXT must be fed with a multifasta file containing all unigenes to
 full_lengther_next -f input.fasta -g [fungi|human|invertebrates|mammals|plants|rodents|vertebrates] -d user_db [options]
+=== Output
+Full-LengthNext results files appear at the end of program execution, grouped in a folder called fl2_results, where the following files can be found:
+* alignments.txt: Displays the BLASTx alignment between our query sequence translated into amino acids and the protein sequence from the Full-LengthNext database.
+* annotations.txt: in this file, the main information for each query sequence can be found; status, subject accession number, subject description, warning messages, protein obtained and indices provided by BLASTx alignment.
+* nc_rna.txt: Putative non coding RNA sequences detected using BLAST.
+* nt_seq.txt: It contains the nucleotide sequence, marking when possible the start codon with hyphen and underscore and hyphen (-_-) and the stop codon with three underscores. Useful to find UTRs and gene sequence.
+* proteins.fasta: fasta format file with the complete proteins.
+* summary_stats.html: summary statistics of the results obtained by Full-LengthNext for the set of query unigenes. It is useful for assemblies comparison.
+* tcode_result.txt: It is equivalent to annotations.txt file, but it is used for sequences with no similarity in databases. Possible status are: coding, non-coding or unknown
+=== CLUSTERED INSTALLATION
+To install FULL-LENGTHERNEXT into a cluster, you need to have the software available on all machines. By installing it on a shared location, or installing it on each cluster node. Once installed, you need to create a init_file where your environment is correctly setup (paths, BLASTDB, etc):
+export PATH=/apps/blast+/bin:/apps/cd-hit/bin
+export BLASTDB=/var/DB/formatted
+export FULL_LENGTHER_NEXT_INIT=path_to_init_file
+And initialize the FULL_LENGTHER_NEXT_INIT environment variable on your main node (from where FULL-LENGTHERNEXT will be initially launched):
+export FULL_LENGTHER_NEXT_INIT=path_to_init_file
+If you use any queue system like PBS Pro or Moab/Slurm, be sure to initialize the variables on each submission script.
+NOTE: all nodes on the cluster should use ssh keys to allow FULL-LENGTHERNEXT to launch workers without asking for a password.
+SAMPLE INIT FILES FOR CLUSTERED INSTALLATION:
+Init file
+$> cat fln_init_env
+source ~ruby19/init_env
+source ~blast_plus/init_env
+export BLASTDB=~full_lenghter_next/DB/formatted/
+export FULL_LENGTHER_NEXT_INIT=~full_lenghter_next/fln_init_env
 === PBS Submission script
@@ -42,10 +76,10 @@ cd $PBS_O_WORKDIR
 cat ${PBS_NODEFILE} > workers
-# init seqtrimnext
-source ~seqtrimnext/init_env
+# init full-lengthernext
+source ~full_lenghter_next/init_env
-time seqtrimnext -t paired_ends.txt -Q fastq -w workers -s 10.0.0
+time full_lenghter_next -f input.fasta -g group -d user_db -w workers -s 10.0.0
 Once this submission script is created, you only need to launch it with:
 qsub sample_work.sh
@@ -101,7 +135,11 @@ gem install full_lengther_next
 === Install and rebuild Full-LengthNEXT databases
-Full-LengthNEXT needs some databases to work. You can use the BLASTDB environment variable to to change the default database location. To install them, execute:
+Full-LengthNEXT needs some databases to work. You can use the BLASTDB environment variable to to change the default database location. To set the path for storing databases, execute next line in your terminal or add it to your .bash_profile:
+export BLASTDB=/my_path/
+To install databases execute:
 $ download_fln_dbs.rb

data/bin/download_fln_dbs.rb CHANGED Viewed

@@ -5,15 +5,41 @@
 # Once in UniProtKB/Swiss-Prot, a protein entry is removed from UniProtKB/TrEMBL.
 require 'net/ftp'
+require 'open-uri'
 ################################################### Functions
+def download_ncrna(formatted_db_path)
+	if !File.exists?(File.join(formatted_db_path, "nc_rna_db"))
+		Dir.mkdir(File.join(formatted_db_path, "nc_rna_db"))
+	end
+	puts "Downloading ncRNA database"
+	open(File.join(formatted_db_path, "nc_rna_db/ncrna_fln_100.fasta.zip"), "wb") do |my_file|
+	  my_file.print open('http://www.scbi.uma.es/downloads/FLNDB/ncrna_fln_100.fasta.zip').read
+	end
+	puts "\nncRNA database downloaded"
+	ncrna_zip=File.join(formatted_db_path,'nc_rna_db','ncrna_fln_100.fasta.zip')
+	ncrna_out_dir=File.join(formatted_db_path,'nc_rna_db')
+	system("unzip", ncrna_zip, "-d", ncrna_out_dir)
+	system("rm", ncrna_zip)
+	puts "\nncRNA database decompressed"
+	ncrna_fasta=File.join(formatted_db_path,'nc_rna_db','ncrna_fln_100.fasta')
+	system("makeblastdb", "-in", ncrna_fasta, "-dbtype", "nucl", "-parse_seqids")
+	puts "\nncRNA database completed"
+end
 def conecta_uniprot(my_array, formatted_db_path)
 	$ftp = Net::FTP.new()
-	if !File.exists?('blast_dbs')
-		Dir.mkdir('blast_dbs')
+	if !File.exists?(formatted_db_path)
+		Dir.mkdir(formatted_db_path)
 	end
 	$ftp.connect('ftp.uniprot.org')
@@ -27,8 +53,9 @@ def conecta_uniprot(my_array, formatted_db_path)
 		download_uniprot(db_group, formatted_db_path)
 	end
+	varsplic_out=File.join(formatted_db_path,'uniprot_sprot_varsplic.fasta.gz')
 	$ftp.chdir("/pub/databases/uniprot/current_release/knowledgebase/complete")
-	$ftp.getbinaryfile("uniprot_sprot_varsplic.fasta.gz", "#{formatted_db_path}/uniprot_sprot_varsplic.fasta.gz")
+	$ftp.getbinaryfile("uniprot_sprot_varsplic.fasta.gz", varsplic_out)
 	puts "isoform files downloaded"
@@ -38,9 +65,11 @@ end
 def download_uniprot(uniprot_group, formatted_db_path)
+	sp_out=File.join(formatted_db_path,"uniprot_sprot_#{uniprot_group}.dat.gz")
+	tr_out=File.join(formatted_db_path,"uniprot_trembl_#{uniprot_group}.dat.gz")
 	$ftp.chdir("/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions")
-	$ftp.getbinaryfile("uniprot_sprot_#{uniprot_group}.dat.gz", "#{formatted_db_path}/uniprot_sprot_#{uniprot_group}.dat.gz")
-	$ftp.getbinaryfile("uniprot_trembl_#{uniprot_group}.dat.gz", "#{formatted_db_path}/uniprot_trembl_#{uniprot_group}.dat.gz")
+	$ftp.getbinaryfile("uniprot_sprot_#{uniprot_group}.dat.gz", sp_out)
+	$ftp.getbinaryfile("uniprot_trembl_#{uniprot_group}.dat.gz", tr_out)
 	puts "#{uniprot_group} files downloaded"
@@ -74,11 +103,11 @@ def filter_incomplete_seqs(file_name, isoform_hash, formatted_db_path)
 	db_name.sub!('sprot','sp')
 	db_name.sub!('trembl','tr')
-	if !File.exists?("#{formatted_db_path}/#{db_name}_#{output_name}")
-		Dir.mkdir("#{formatted_db_path}/#{db_name}_#{output_name}")
+	if !File.exists?(File.join(formatted_db_path, "#{db_name}_#{output_name}"))
+		Dir.mkdir(File.join(formatted_db_path, "#{db_name}_#{output_name}"))
 	end
-	output_file = File.new("#{formatted_db_path}/#{db_name}_#{output_name}/#{db_name}_#{output_name}.fasta", "w")
+	output_file = File.new(File.join(formatted_db_path, "#{db_name}_#{output_name}/#{db_name}_#{output_name}.fasta"), "w")
 	File.open(file_name).each_line do |line|
 		if (newseq == false)
@@ -152,15 +181,10 @@ def load_isoform_hash(file)
 			my_fasta += line
 		end
 	end
-	# if (isoform_hash[acc].nil?)
-	# 	isoform_hash[acc]= "#{my_fasta}\n"
-	# else
-	# 	isoform_hash[acc]+= "#{my_fasta}\n"
-	# end
 	return isoform_hash
 end
 ################################################### MAIN
 ROOT_PATH=File.dirname(__FILE__)
@@ -173,24 +197,28 @@ end
 ENV['BLASTDB']=formatted_db_path
 puts "Databases will be downloaded at: #{ENV['BLASTDB']}"
+puts "\nTo set the path for storing databases, execute next line in your terminal or add it to your .bash_profile:\n\n\texport BLASTDB=/my_path/\n\n"
 my_array = ["human","fungi","invertebrates","mammals","plants","rodents","vertebrates"]
-# my_array = ["plants","invertebrates"] # used for a shoter test
+# my_array = ["plants","human"] # used for a shoter test
 conecta_uniprot(my_array, formatted_db_path)
-`gunzip #{formatted_db_path}/*gz`
+system('gunzip '+formatted_db_path+'*.gz')
 isoform_hash = {}
-isoform_hash = load_isoform_hash("#{formatted_db_path}/uniprot_sprot_varsplic.fasta")
+isoform_hash = load_isoform_hash(File.join(formatted_db_path, "uniprot_sprot_varsplic.fasta"))
+download_ncrna(formatted_db_path)
 my_array.each do |db_group|
-	filter_incomplete_seqs("#{formatted_db_path}/uniprot_sprot_#{db_group}.dat", isoform_hash, formatted_db_path)
-	filter_incomplete_seqs("#{formatted_db_path}/uniprot_trembl_#{db_group}.dat", isoform_hash, formatted_db_path)
+	filter_incomplete_seqs(File.join(formatted_db_path, "uniprot_sprot_#{db_group}.dat"), isoform_hash, formatted_db_path)
+	filter_incomplete_seqs(File.join(formatted_db_path, "uniprot_trembl_#{db_group}.dat"), isoform_hash, formatted_db_path)
-	`makeblastdb -in #{formatted_db_path}/sp_#{db_group}/sp_#{db_group}.fasta -dbtype 'prot' -parse_seqids`
-	`makeblastdb -in #{formatted_db_path}/tr_#{db_group}/tr_#{db_group}.fasta -dbtype 'prot' -parse_seqids`
+	sp_fasta=File.join(formatted_db_path,"sp_#{db_group}","sp_#{db_group}.fasta")
+	tr_fasta=File.join(formatted_db_path,"tr_#{db_group}","tr_#{db_group}.fasta")
+	system("makeblastdb -in #{sp_fasta} -dbtype 'prot' -parse_seqids")
+	system("makeblastdb -in #{tr_fasta} -dbtype 'prot' -parse_seqids")
 end

data/bin/full_lengther_next CHANGED Viewed

@@ -1,7 +1,7 @@
 #!/usr/bin/env ruby
 # 12-2-2011 Noe Fernandez Pozo.
-# Full-Lengther2 predicts if your sequences are complete, showing you the nucleotide sequences and the translated protein
+# Full-LengtherNEXT predicts if your sequences are complete, showing you the nucleotide sequences and the translated protein
 #------------------------------------------------------------------ parameters entry
 require 'optparse'
@@ -91,7 +91,7 @@ optparse = OptionParser.new do |opts|
 	# Set a banner, displayed at the top of the help screen.
-	opts.banner = "Usage: full_lengther_2 -f input.fasta -g [fungi|human|invertebrates|mammals|plants|rodents|vertebrates] [options]\n\n"
+	opts.banner = "Usage: full_lengther_next -f input.fasta -g [fungi|human|invertebrates|mammals|plants|rodents|vertebrates] [options]\n\n"
 	# This displays the help screen
 	opts.on( '-h', '--help', 'Display this screen' ) do
@@ -129,7 +129,7 @@ require 'full_lengther_next'
 if ENV['FULL_LENGTHER_NEXT_INIT'] && File.exists?(ENV['FULL_LENGTHER_NEXT_INIT'])
   FULL_LENGTHER_NEXT_INIT=File.expand_path(ENV['FULL_LENGTHER_NEXT_INIT'])
 else
-  FULL_LENGTHER_NEXT_INIT=File.join($ROOT_PATH,'init_env')
+  FULL_LENGTHER_NEXT_INIT=File.join(ROOT_PATH,'init_env')
 end
@@ -142,8 +142,16 @@ end
 ENV['BLASTDB']=formatted_db_path
 puts "Using databases at: #{ENV['BLASTDB']}"
-if !File.exists?("#{ENV['BLASTDB']}/sp_#{options[:tax_group]}/sp_#{options[:tax_group]}.fasta.psq")
-  puts "DB File #{ENV['BLASTDB']}/sp_#{options[:tax_group]}/sp_#{options[:tax_group]}.fasta.psq doesn't exists, or"
+ncrna_path = File.join(ENV['BLASTDB'],'nc_rna_db','ncrna_fln_100.fasta.nhr')
+if !File.exists?(ncrna_path)
+  puts "DB File #{ncrna_path} doesn't exists"
+	puts optparse.help
+	exit
+end
+sp_path=File.join(ENV['BLASTDB'],"sp_#{options[:tax_group]}","sp_#{options[:tax_group]}.fasta.psq")
+if !File.exists?(sp_path)
+  puts "DB File #{sp_path} doesn't exists, or"
 	puts "incorrect taxon group name: #{options[:tax_group]} choose:"
 	puts optparse.help
 	exit

data/lib/full_lengther_next.rb CHANGED Viewed

@@ -7,7 +7,7 @@ $: << File.expand_path(File.join(ROOT_PATH, 'classes'))
 module FullLengtherNext
-   VERSION = '0.0.2'
+   VERSION = '0.0.5'
   FULLLENGHTER_VERSION = VERSION
 end

data/lib/full_lengther_next/classes/fl_analysis.rb CHANGED Viewed

@@ -32,7 +32,7 @@ module FlAnalysis
 			if (db_name =~ /^tr_/)
 				if (seq.get_annotations(:tmp_annotation).empty?)
 					if (seq.sec_desc.empty?)
-						seq.annotate(:tcode,'')
+						seq.annotate(:apply_tcode,'')
 					else
 						seq.annotate(:tmp_annotation,[seq.sec_desc, '','',''],true)
 					end
@@ -250,7 +250,7 @@ module FlAnalysis
 						seq.sec_desc = "#{q.query_def}\t#{seq.seq_fasta.length}\t#{q.hits[0].acc}\t#{db_name}\tCoding Seq\t\t#{q.hits[0].e_val}\t#{q.hits[0].ident}\t\t#{q.hits[0].full_subject_length}\t#{warnings}\t\t\t\t\t\t#{q.hits[0].definition}\t"
 						seq.annotate(:tmp_annotation,[seq.sec_desc, '','',''],true)
 					else
-						seq.annotate(:tcode,'')
+						seq.annotate(:apply_tcode,'')
 					end
 				else
 					warnings = "Coding sequence with some errors, #{warnings}"

data/lib/full_lengther_next/classes/fln_stats.rb ADDED Viewed

@@ -0,0 +1,387 @@
+module FlnStats
+	def summary_stats
+		stats_file = File.open('fln_results/summary_stats.html', 'w')
+		(html_head, html_1, html_2, html_3, html_4) = html_code
+		total_seqs = 0
+		(status_array, seqs_number1, error_1_num, seq_uniq, complete_uniq, seq_length_stats, complete_seq_length_stats) = annotation_stats
+		(tcode_array, seqs_number2, tcode_length_stats, coding_length_stats, unknown_length_stats) = testcode_stats
+		ncrna_array=ncrna_stats
+		total_seqs = seqs_number1 + seqs_number2 + ncrna_array[4].to_i
+		stats_file.puts html_head
+		stats_file.puts "\t\t\t\t"+'<font color="#FF0000">'+total_seqs.to_s+"</font> sequences in your input fasta\n\t\t\t</h2>\n\t\t</center>"
+		if (total_seqs.to_i > 0)
+			stats_file.puts html_1
+			stats_file.puts '				<tr>
+						<td align="center">YES</td>
+						<td align="right">'+seqs_number1.to_s+'</td>
+						<td align="right">'+'%.2f' % (100*seqs_number1.to_f/total_seqs.to_f).to_s+' %</td>
+						<td align="right">'+seq_uniq.to_s+'</td>
+						<td align="right">'+seq_length_stats[0].to_s+'</td>
+						<td align="right">'+seq_length_stats[1].to_s+'</td>
+						<td align="right">'+seq_length_stats[2].to_s+'</td>
+						<td align="right">'+seq_length_stats[3].to_s+'</td>
+					</tr>'
+			stats_file.puts '				<tr>
+						<td align="center">NO</td>
+						<td align="right">'+seqs_number2.to_s+'</td>
+						<td align="right">'+'%.2f' % (100*seqs_number2.to_f/total_seqs.to_f).to_s+' %</td>
+						<td align="right">-</td>
+						<td align="right">'+tcode_length_stats[0].to_s+'</td>
+						<td align="right">'+tcode_length_stats[1].to_s+'</td>
+						<td align="right">'+tcode_length_stats[2].to_s+'</td>
+						<td align="right">'+tcode_length_stats[3].to_s+'</td>
+					</tr>'
+			stats_file.puts '				<tr>
+						<td align="center">ncRNA</td>
+						<td align="right">'+ncrna_array[4].to_s+'</td>
+						<td align="right">'+'%.2f' % (100*ncrna_array[4].to_f/total_seqs.to_f).to_s+' %</td>
+						<td align="right">-</td>
+						<td align="right">'+ncrna_array[0].to_s+'</td>
+						<td align="right">'+ncrna_array[1].to_s+'</td>
+						<td align="right">'+ncrna_array[2].to_s+'</td>
+						<td align="right">'+ncrna_array[3].to_s+'</td>
+					</tr>
+				</table>'
+			stats_file.puts '			<p><font color="#FF0000">'+error_1_num.to_s+'</font> Sequences with sense and antisense hits error</p>'
+			stats_file.puts '			<p><font color="#FF0000">'+complete_uniq.to_s+'</font> Complete sequences with different ortologue ID</p>'
+			stats_file.puts html_2
+			status_array.each do |status|
+				stats_file.puts '				<tr>
+						<td align="right">'+status[4].to_s+'</td>
+						<td align="right">'+status[0].to_s+'</td>
+						<td align="right">'+'%.2f' % (100*status[0].to_f/total_seqs.to_f).to_s+' %</td>
+						<td align="right">'+status[1].to_s+'</td>
+						<td align="right">'+status[2].to_s+'</td>
+						<td align="right">'+status[3].to_s+'</td>
+					</tr>'
+			end
+			stats_file.puts html_3
+			tcode_array.each do |status|
+				stats_file.puts '				<tr>
+						<td align="right">'+status[5].to_s+'</td>
+						<td align="right">'+status[4].to_s+'</td>
+						<td align="right">'+'%.2f' % (100*status[4].to_f/total_seqs.to_f).to_s+' %</td>
+						<td align="right">'+status[0].to_s+'</td>
+						<td align="right">'+status[1].to_s+'</td>
+						<td align="right">'+status[2].to_s+'</td>
+						<td align="right">'+status[3].to_s+'</td>
+					</tr>'
+			end
+			# print Non coding RNA
+			stats_file.puts '				<tr>
+					<td align="right">Putative ncRNA</td>
+					<td align="right">'+ncrna_array[4].to_s+'</td>
+					<td align="right">'+'%.2f' % (100*ncrna_array[4].to_f/total_seqs.to_f).to_s+' %</td>
+					<td align="right">'+ncrna_array[0].to_s+'</td>
+					<td align="right">'+ncrna_array[1].to_s+'</td>
+					<td align="right">'+ncrna_array[2].to_s+'</td>
+					<td align="right">'+ncrna_array[3].to_s+'</td>
+				</tr>
+			</table>
+		</center>'
+		end
+		stats_file.puts html_4
+		stats_file.close
+	end
+	def html_code
+		html_head = '<html>
+	<head>
+		<title>FLN Annotation Summary</title>
+	</head>
+	<body bgcolor="#FFFFFF">
+		<center>
+			<h1 ALIGN="center">
+				Full-LengtherNEXT
+				<br/>
+				Annotation summary
+			</h1>
+			<h2 align="center">'
+		html_1 = '
+		<center>
+			<table border=1>
+				<tr>
+					<th>Ortologue found</th>
+					<th>Sequences found</th>
+					<th>%</th>
+					<th>Different IDs</th>
+					<th>&gt;200 bp</th>
+					<th>&lt;200 bp</th>
+					<th>&gt;500 bp</th>
+					<th>&lt;500 bp</th>
+				</tr>'
+		html_2= '			<br/>
+			<table border=1>
+				<tr>
+					<th>Status</th>
+					<th>Total</th>
+					<th>%</th>
+					<th>UserDB</th>
+					<th>SwissProt</th>
+					<th>TrEMBL</th>
+				</tr>'
+		html_3= '			</table>
+			<br/>
+			<table border=1>
+				<tr>
+					<th>Status</th>
+					<th>Total</th>
+					<th>%</th>
+					<th>&gt;200 bp</th>
+					<th>&lt;200 bp</th>
+					<th>&gt;500 bp</th>
+					<th>&lt;500 bp</th>
+				</tr>'
+		html_4 = '	</body>
+</html>'
+		return [html_head, html_1, html_2, html_3, html_4]
+	end
+	def stats_my_db(db_name, array)
+		if (db_name !~ /^sp_/) && (db_name !~ /^tr_/)
+			array[1] += 1
+		elsif (db_name =~ /^sp_/)
+			array[2] += 1
+		elsif (db_name =~ /^tr_/)
+			array[3] += 1
+		end
+		return array
+	end
+	def annotation_stats
+		seqs_number = 0
+		array_of_all_accs = []
+		array_of_complete_accs = []
+		error_1_num = 0
+		# >200, <200, >500, <500
+		seq_length_stats = [0,0,0,0]
+		# >200, <200, >500, <500
+		complete_seq_length_stats = [0,0,0,0]
+		status_array = []
+		# total, userdb, swissprotdb, trembl, status
+		complete = [0,0,0,0,'Complete']
+		putative_complete = [0,0,0,0,'Putative Complete']
+		c_terminus = [0,0,0,0,'C-terminus']
+		putative_c_terminus = [0,0,0,0,'Putative C-terminus']
+		n_terminus = [0,0,0,0,'N-terminus']
+		putative_n_terminus = [0,0,0,0,'Putative N-terminus']
+		internal = [0,0,0,0,'Internal']
+		cod_seq = [0,0,0,0,'Misassembled']
+		File.open('fln_results/annotations.txt').each do |line|
+			line.chomp!
+			(name,fasta_length,acc,db_name,status,kk1,kk2,kk3,kk4,kk5,msgs) = line.split("\t")
+			if (line !~ /^Query_id\t/) && (!line.empty?)
+				seqs_number += 1
+				array_of_all_accs.push acc
+				# -------------------------------------------------------------------------
+				if (fasta_length.to_i >= 200)
+					seq_length_stats[0] += 1
+					# seqs_longer_200 += 1
+				else
+					seq_length_stats[1] += 1
+					# seqs_shorter_200 += 1
+				end
+				if (fasta_length.to_i >= 500)
+					seq_length_stats[2] += 1
+					# seqs_longer_500 += 1
+				else
+					seq_length_stats[3] += 1
+					# seqs_shorter_500 += 1
+				end
+				# -------------------------------------------------------------------------
+				if (msgs =~ /ERROR#1/)
+					error_1_num += 1
+				end
+				# -------------------------------------------------------------------------
+				if (status == 'Complete')
+					complete[0] += 1
+					array_of_complete_accs.push acc
+					complete = stats_my_db(db_name, complete)
+					if (fasta_length.to_i >= 200)
+						complete_seq_length_stats[0] += 1
+						# complete_longer_200 += 1
+					else
+						complete_seq_length_stats[1] += 1
+						# complete_shorter_200 += 1
+					end
+					if (fasta_length.to_i >= 500)
+						complete_seq_length_stats[2] += 1
+						# complete_longer_500 += 1
+					else
+						complete_seq_length_stats[3] += 1
+						# complete_shorter_500 += 1
+					end
+				elsif (status == 'Putative Complete')
+					putative_complete[0] += 1
+					putative_complete = stats_my_db(db_name, putative_complete)
+				elsif (status == 'C-terminus')
+					c_terminus[0] += 1
+					c_terminus = stats_my_db(db_name, c_terminus)
+				elsif (status == 'N-terminus')
+					n_terminus[0] += 1
+					n_terminus = stats_my_db(db_name, n_terminus)
+				elsif (status == 'Putative C-terminus')
+					putative_c_terminus[0] += 1
+					putative_c_terminus = stats_my_db(db_name, putative_c_terminus)
+				elsif (status == 'Putative N-terminus')
+					putative_n_terminus[0] += 1
+					putative_n_terminus = stats_my_db(db_name, putative_n_terminus)
+				elsif (status == 'Internal')
+					internal[0] += 1
+					internal = stats_my_db(db_name, internal)
+				elsif (status == 'Coding Seq')
+					cod_seq[0] += 1
+					cod_seq = stats_my_db(db_name, cod_seq)
+				end
+				# -------------------------------------------------------------------------
+			end
+		end
+		status_array = [complete, putative_complete, c_terminus, putative_c_terminus, n_terminus, putative_n_terminus, internal, cod_seq]
+		return [status_array, seqs_number, error_1_num, array_of_all_accs.uniq.count, array_of_complete_accs.uniq.count, seq_length_stats, complete_seq_length_stats]
+	end
+	def testcode_stats
+		seqs_number = 0
+		# >200, <200, >500, <500
+		all_tcode_stats = [0,0,0,0]
+		# >200, <200, >500, <500, total, status
+		coding_length_stats = [0,0,0,0,0,'Coding']
+		p_coding_length_stats = [0,0,0,0,0,'Putative Coding']
+		unknown_length_stats = [0,0,0,0,0,'Unknown']
+		File.open('fln_results/tcode_result.txt').each do |line|
+			line.chomp!
+			(name,fasta_length,acc,db_name,status) = line.split("\t")
+			if (line !~ /^Query_id\t/) && (!line.empty?)
+				seqs_number += 1
+				if (fasta_length.to_i >= 200)
+					all_tcode_stats[0] += 1
+					if (status == 'coding')
+						coding_length_stats[4] += 1
+						coding_length_stats[0] += 1
+					elsif (status == 'putative_coding')
+						p_coding_length_stats[4] += 1
+						p_coding_length_stats[0] += 1
+					elsif (status == 'unknown')
+						unknown_length_stats[4] += 1
+						unknown_length_stats[0] += 1
+					end
+				else
+					all_tcode_stats[1] += 1
+					if (status == 'coding')
+						coding_length_stats[4] += 1
+						coding_length_stats[1] += 1
+					elsif (status == 'putative_coding')
+						p_coding_length_stats[4] += 1
+						p_coding_length_stats[1] += 1
+					elsif (status == 'unknown')
+						unknown_length_stats[4] += 1
+						unknown_length_stats[1] += 1
+					end
+				end
+				if (fasta_length.to_i >= 500)
+					all_tcode_stats[2] += 1
+					if (status == 'coding')
+						coding_length_stats[2] += 1
+					elsif (status == 'putative_coding')
+						p_coding_length_stats[2] += 1
+					elsif (status == 'unknown')
+						unknown_length_stats[2] += 1
+					end
+				else
+					all_tcode_stats[3] += 1
+					if (status == 'coding')
+						coding_length_stats[3] += 1
+					elsif (status == 'putative_coding')
+						p_coding_length_stats[3] += 1
+					elsif (status == 'unknown')
+						unknown_length_stats[3] += 1
+					end
+				end
+			end
+		end
+		status_array = [coding_length_stats, p_coding_length_stats, unknown_length_stats]
+		return [status_array, seqs_number, all_tcode_stats, coding_length_stats, unknown_length_stats]
+	end
+	def ncrna_stats
+		# >200, <200, >500, <500, total
+		ncrna_array = [0,0,0,0,0]
+		File.open('fln_results/nc_rna.txt').each do |line|
+			line.chomp!
+			(name,fasta_length,acc,db_name,status) = line.split("\t")
+			if (status == 'Putative ncRNA')
+				ncrna_array[4] += 1
+				if (fasta_length.to_i >= 200)
+					ncrna_array[0] += 1
+				else
+					ncrna_array[1] += 1
+				end
+				if (fasta_length.to_i >= 500)
+					ncrna_array[2] += 1
+				else
+					ncrna_array[3] += 1
+				end
+			end
+		end
+		return ncrna_array
+	end
+end

data/lib/full_lengther_next/classes/my_worker.rb CHANGED Viewed

@@ -11,6 +11,8 @@ require "test_code"
 require 'fl_analysis'
 include FlAnalysis
+require 'nc_rna'
+include NcRna
 class MyWorker < ScbiMapreduce::Worker
@@ -41,15 +43,12 @@ class MyWorker < ScbiMapreduce::Worker
 	end
-	# ejecuta blastx utilizando los parametros fichero de entrada, base de datos y fichero de salida
-	def run_blastx(input, database, user_db_name)
-		# puts "\n#{user_db_name} ..... executing BLASTx"
+	# ejecuta blast utilizando los parametros fichero de entrada, base de datos, fichero de salida y tipo de blast
+	def run_blast(input, database, blast_type, evalue)
-		blast=BatchBlast.new("-db #{database}",'blastx',"-evalue 1e-6 -num_alignments 1 -num_descriptions 1")
+		blast=BatchBlast.new("-db #{database}",blast_type,"-evalue #{evalue} -max_target_seqs 1")
 		blast_result = blast.do_blast_seqs(input, :xml)
-		# puts "#{user_db_name} ..... BLASTx finished"
 		return blast_result
 	end
@@ -72,7 +71,7 @@ class MyWorker < ScbiMapreduce::Worker
 			end
 			# do blast
-			my_blast = run_blastx(seqs, "#{@options[:user_db]}", user_db_name)
+			my_blast = run_blast(seqs, "#{@options[:user_db]}", 'blastx', '1e-6')
 			# split and parse blast
 			seqs.each_with_index do |seq,i|
@@ -87,7 +86,8 @@ class MyWorker < ScbiMapreduce::Worker
 		# -------------------------------------------- UniProt (sp)
 		# blast
-		my_blast = run_blastx(new_seqs, "sp_#{@options[:tax_group]}/sp_#{@options[:tax_group]}.fasta", "sp_#{@options[:tax_group]}")
+		sp_path=File.join("sp_#{@options[:tax_group]}","sp_#{@options[:tax_group]}.fasta")
+		my_blast = run_blast(new_seqs, sp_path, 'blastx', '1e-6')
 		# split and parse blast
 		new_seqs.each_with_index do |seq,i|
@@ -98,7 +98,8 @@ class MyWorker < ScbiMapreduce::Worker
 		# -------------------------------------------- UniProt (tr)
 		# blast
-		my_blast = run_blastx(new_seqs, "tr_#{@options[:tax_group]}/tr_#{@options[:tax_group]}.fasta", "tr_#{@options[:tax_group]}")
+		tr_path=File.join("tr_#{@options[:tax_group]}","tr_#{@options[:tax_group]}.fasta")
+		my_blast = run_blast(new_seqs, tr_path, 'blastx', '1e-6')
 		# split and parse blast
 		new_seqs.each_with_index do |seq,i|
@@ -107,15 +108,27 @@ class MyWorker < ScbiMapreduce::Worker
 		# -------------------------------------------- Test Code
 		# the sequences without a reliable similarity with an orthologue are processed with Test Code
-		testcode_input=seqs.select{|s| !s.get_annotations(:tcode).empty?}
+		testcode_input=seqs.select{|s| !s.get_annotations(:apply_tcode).empty?}
-# active this line to test tcode. hay que comentar todas las lineas de arriba de este metodo
+# active this line to test tcode, and comment all lines above in this function
 # testcode_input=seqs
 		testcode_input.each do |seq|
 			TestCode.new(seq)
 		end
+		# -------------------------------------------- nc RNA
+		unknown_seqs=seqs.select{|s| !s.get_annotations(:tcode_unknown).empty?}
+		# run blastn
+		ncrna_path=File.join('nc_rna_db','ncrna_fln_100.fasta')
+		my_blast = run_blast(unknown_seqs, ncrna_path, 'blastn', '1e-3')
+		# split and parse blast
+		unknown_seqs.each_with_index do |seq,i|
+			find_nc_rna(seq, my_blast.querys[i])
+		end
+		# ---------------------------------------------------
 	end
 end

data/lib/full_lengther_next/classes/my_worker_manager.rb CHANGED Viewed

@@ -2,8 +2,8 @@ require 'json'
 require 'scbi_fasta'
 require 'sequence'
-require 'fl2_stats'
-include Fl2Stats
+require 'fln_stats'
+include FlnStats
 class MyWorkerManager < ScbiMapreduce::WorkManager
@@ -12,37 +12,43 @@ class MyWorkerManager < ScbiMapreduce::WorkManager
 		input_file=options[:fasta]
-		if !File.exists?('fl2_results')
-			Dir.mkdir('fl2_results')
+		if !File.exists?('fln_results')
+			Dir.mkdir('fln_results')
 		end
+		file_head = "Query_id\tfasta_length\tSubject_id\tdb_name\tStatus\tt_code\te_value\tp_ident\tprotein_length\ts_length\tWarning_msgs\tframe\tORF_start\tORF_end\ts_start\ts_end\tDescription\tProtein_sequence"
 		@@fasta_file = FastaQualFile.new(input_file,'')
 		@@chunk_size=chunk_size
 		@@options = options
-		@@annotation_file = File.open("fl2_results/annotations.txt", 'w')
-		@@annotation_file.puts "Query_id\tfasta_length\tSubject_id\tdb_name\tStatus\tt_code\te_value\tp_ident\tprotein_length\ts_length\tWarning_msgs\tframe\tORF_start\tORF_end\ts_start\ts_end\tDescription\tProtein_sequence"
+		@@annotation_file = File.open("fln_results/annotations.txt", 'w')
+		@@annotation_file.puts file_head
+		@@alignment_file = File.open("fln_results/alignments.txt", 'w')
+		@@prot_file = File.open("fln_results/proteins.fasta", 'w')
+		@@nts_file = File.open("fln_results/nt_seq.txt", 'w')
+		@@tcode_file=File.open("fln_results/tcode_result.txt", 'w')
+		@@tcode_file.puts file_head
-		@@alignment_file = File.open("fl2_results/alignments.txt", 'w')
-		@@prot_file = File.open("fl2_results/proteins.fasta", 'w')
-		@@nts_file = File.open("fl2_results/nt_seq.txt", 'w')
-		@@tcode_file=File.open("fl2_results/tcode_result.txt", 'w')
-		@@tcode_file.puts "Query_id\tfasta_length\tSubject_id\tdb_name\tStatus\tt_code\te_value\tp_ident\tprotein_length\ts_length\tWarning_msgs\tframe\tORF_start\tORF_end\ts_start\ts_end\tDescription\tProtein_sequence"
+		@@nc_rna_file = File.open("fln_results/nc_rna.txt", 'w')
+		@@nc_rna_file.puts file_head
-		# @@error_fasta_file = File.open("fl2_results/error_seqs.fasta", 'w')
-		# @@error_file = File.open("fl2_results/errors_info.txt", 'w')
+		# @@error_fasta_file = File.open("fln_results/error_seqs.fasta", 'w')
+		# @@error_file = File.open("fln_results/errors_info.txt", 'w')
 	end
 	# close files
 	def self.end_work_manager
-		@@fasta_file.close
+		# @@fasta_file.close
 		@@annotation_file.close
 		@@alignment_file.close
 		@@prot_file.close
 		@@nts_file.close
 		@@tcode_file.close
+		@@nc_rna_file.close
 		# @@error_fasta_file.close
 		# @@error_file.close
@@ -143,11 +149,14 @@ class MyWorkerManager < ScbiMapreduce::WorkManager
 				if (n=seq.get_annotations(:nucleotide).first)
 					@@nts_file.puts n[:message]
 				end
+# --------------------------------------------------------     nc RNA
+			elsif (nc=seq.get_annotations(:ncrna).first)
+				@@nc_rna_file.puts nc[:message]
 # --------------------------------------------------------     Test Code
-			elsif (t=seq.get_annotations(:tcode).first)
-				@@tcode_file.puts t[:message]
+			elsif (t=seq.get_annotations(:tcode).first)
+  				@@tcode_file.puts t[:message]
 			end
-# --------------------------------------------------------     Errors
+# --------------------------------------------------------     errors
 			# if e=seq.get_annotations(:error).first
 			# 	if !e[:message].empty?
 			# 		@@error_fasta_file.puts ">#{seq.seq_name}\n#{seq.seq_fasta}"

data/lib/full_lengther_next/classes/nc_rna.rb ADDED Viewed

@@ -0,0 +1,21 @@
+module NcRna
+	def find_nc_rna(seq, blast_query)
+		# used to detect if the sequence and the blast are from different query
+		if seq.seq_name != blast_query.query_def
+			raise "BLAST query name and sequence are different"
+		end
+		q=blast_query
+		if (!q.hits[0].nil?) # There is match in blast.
+			nc_annotations = "#{q.query_def}\t#{seq.seq_fasta.length}\t#{q.hits[0].acc}\tncRNA\tPutative ncRNA\t\t#{q.hits[0].e_val}\t#{q.hits[0].ident}\t\t\t\t#{q.hits[0].q_frame}\t#{q.hits[0].q_beg}\t#{q.hits[0].q_end}\t#{q.hits[0].s_beg.to_i}\t#{q.hits[0].s_end.to_i}\t#{q.hits[0].definition}\t"
+			seq.annotate(:ncrna,nc_annotations,true)
+		else
+			unknown_annot = seq.get_annotations(:tcode_unknown).first
+			seq.annotate(:tcode, unknown_annot[:message],true)
+		end
+	end
+end

data/lib/full_lengther_next/classes/test_code.rb CHANGED Viewed

@@ -26,8 +26,8 @@ class TestCode
 			ref_orf = ''
 			ref_msgs = 'Sequence length < 200 nt'
-			seq.annotate(:tcode,"#{ref_name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{ref_status}\t#{ref_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
+			seq.annotate(:tcode_unknown,"#{ref_name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{ref_status}\t#{ref_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
+			# seq.annotate(:tcode,"#{ref_name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{ref_status}\t#{ref_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
 		else
 # para probar tescode con toda la secuencia, en lugar de con los ORFs ----------------------------------------------------------------------
@@ -44,8 +44,12 @@ class TestCode
 			# see add_region filter
 			(name,t_code,status,ref_start,ref_end,ref_frame,orf,ref_msgs,stop_before_start,more_than_one_frame) = t_code(seq)
-			seq.annotate(:tcode,"#{name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{status}\t#{t_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
+			if (status == :unknown)
+				seq.annotate(:tcode_unknown,"#{name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{status}\t#{t_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
+			else
+				seq.annotate(:tcode,"#{name}\t#{seq.seq_fasta.length}\t\ttestcode\t#{status}\t#{t_code}\t\t\t\t\t#{ref_msgs}\t#{ref_frame}\t#{ref_start}\t#{ref_end}\t\t\t\t",true)
+			end
 			# if (ref_msgs.nil?)
 			# 	ref_msgs = ''
 			# end

metadata CHANGED Viewed

@@ -2,7 +2,7 @@
 name: full_lengther_next
 version: !ruby/object:Gem::Version
   prerelease:
-  version: 0.0.2
+  version: 0.0.5
 platform: ruby
 authors:
 - Noe Fernandez & Dario Guerrero
@@ -10,7 +10,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-02-07 00:00:00 Z
+date: 2012-03-09 00:00:00 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: xml-simple
@@ -97,12 +97,13 @@ files:
 - bin/full_lengther_next
 - History.txt
 - lib/full_lengther_next/classes/common_functions.rb
-- lib/full_lengther_next/classes/fl2_stats.rb
 - lib/full_lengther_next/classes/fl_analysis.rb
 - lib/full_lengther_next/classes/fl_string_utils.rb
+- lib/full_lengther_next/classes/fln_stats.rb
 - lib/full_lengther_next/classes/lcs.rb
 - lib/full_lengther_next/classes/my_worker.rb
 - lib/full_lengther_next/classes/my_worker_manager.rb
+- lib/full_lengther_next/classes/nc_rna.rb
 - lib/full_lengther_next/classes/orf.rb
 - lib/full_lengther_next/classes/sequence.rb
 - lib/full_lengther_next/classes/test_code.rb

data/lib/full_lengther_next/classes/fl2_stats.rb DELETED Viewed

@@ -1,222 +0,0 @@
-module Fl2Stats
-	# --------------------------------------------------------------------------------       Main
-	def summary_stats
-		stats_file = File.open('fl2_results/summary_stats.txt', 'w')
-		total_seqs = 0
-		num1 = annotation_stats(stats_file)
-		num2 = testcode_stats(stats_file)
-		total_seqs = num1 + num2
-		stats_file.puts "\nInput sequences in your fasta: #{total_seqs}\n\n"
-	end
-	# ----------------------------------------------------------------------------------      Functions
-	def stats_my_db(db_name, array)
-		if (db_name !~ /^sp_/) && (db_name !~ /^tr_/)
-			array[1] += 1
-		elsif (db_name =~ /^sp_/)
-			array[2] += 1
-		elsif (db_name =~ /^tr_/)
-			array[3] += 1
-		end
-		return array
-	end
-	def annotation_stats(stats_file)
-		seqs_number = 0
-		array_of_all_accs = []
-		array_of_complete_accs = []
-		error_1_num = 0
-		seqs_longer_200 = 0
-		seqs_shorter_200 = 0
-		complete_longer_200 = 0
-		complete_shorter_200 = 0
-		seqs_longer_500 = 0
-		seqs_shorter_500 = 0
-		complete_longer_500 = 0
-		complete_shorter_500 = 0
-		complete = [0,0,0,0]
-		putative_complete = [0,0,0,0]
-		c_terminus = [0,0,0,0]
-		putative_c_terminus = [0,0,0,0]
-		n_terminus = [0,0,0,0]
-		putative_n_terminus = [0,0,0,0]
-		internal = [0,0,0,0]
-		cod_seq = [0,0,0,0]
-		File.open('fl2_results/annotations.txt').each do |line|
-			line.chomp!
-			(name,fasta_length,acc,db_name,status,kk1,kk2,kk3,kk4,kk5,msgs) = line.split("\t")
-			if (line !~ /^Query_id\t/)
-				seqs_number += 1
-				array_of_all_accs.push acc
-				# -------------------------------------------------------------------------
-				if (fasta_length.to_i >= 200)
-					seqs_longer_200 += 1
-				else
-					seqs_shorter_200 += 1
-				end
-				if (fasta_length.to_i >= 500)
-					seqs_longer_500 += 1
-				else
-					seqs_shorter_500 += 1
-				end
-				# -------------------------------------------------------------------------
-				if (msgs =~ /ERROR#1/)
-					error_1_num += 1
-				end
-				# -------------------------------------------------------------------------
-				if (status == 'Complete')
-					complete[0] += 1
-					array_of_complete_accs.push acc
-					complete = stats_my_db(db_name, complete)
-					if (fasta_length.to_i >= 200)
-						complete_longer_200 += 1
-					else
-						complete_shorter_200 += 1
-					end
-					if (fasta_length.to_i >= 500)
-						complete_longer_500 += 1
-					else
-						complete_shorter_500 += 1
-					end
-				elsif (status == 'Putative Complete')
-					putative_complete[0] += 1
-					putative_complete = stats_my_db(db_name, putative_complete)
-				elsif (status == 'C-terminus')
-					c_terminus[0] += 1
-					c_terminus = stats_my_db(db_name, c_terminus)
-				elsif (status == 'N-terminus')
-					n_terminus[0] += 1
-					n_terminus = stats_my_db(db_name, n_terminus)
-				elsif (status == 'Putative C-terminus')
-					putative_c_terminus[0] += 1
-					putative_c_terminus = stats_my_db(db_name, putative_c_terminus)
-				elsif (status == 'Putative N-terminus')
-					putative_n_terminus[0] += 1
-					putative_n_terminus = stats_my_db(db_name, putative_n_terminus)
-				elsif (status == 'Internal')
-					internal[0] += 1
-					internal = stats_my_db(db_name, internal)
-				elsif (status == 'Coding Seq')
-					cod_seq[0] += 1
-					cod_seq = stats_my_db(db_name, cod_seq)
-				end
-				# -------------------------------------------------------------------------
-			end
-		end
-		stats_file.puts "--- Annotation Summary ---"
-		stats_file.puts "\n------------------------------ Summary of sequences found by similarity -----"
-		stats_file.puts "\n\tSequences found: #{seqs_number}\t\t(>200: #{seqs_longer_200}, <200: #{seqs_shorter_200})\t(>500: #{seqs_longer_500}, <500: #{seqs_shorter_500})"
-		stats_file.puts "\tDifferent IDs:   #{array_of_all_accs.uniq.count}"
-		stats_file.puts "\n\tsequences with sense and antisense hits error: #{error_1_num}"
-		stats_file.puts "\n------------------------------------------------- Full-Length Sequences -----"
-		stats_file.puts "\tComplete Seqs: #{complete[0]} ("+ '%.3f' % (complete[0].to_f/seqs_number.to_f*100) +" %)\t\t(>200: #{complete_longer_200}, <200: #{complete_shorter_200})\t(>500: #{complete_longer_500}, <500: #{complete_shorter_500})"
-		stats_file.puts "\tDifferent IDs: #{array_of_complete_accs.uniq.count} ("+ '%.3f' % (array_of_complete_accs.uniq.count.to_f/seqs_number.to_f*100) +" %)"
-		stats_file.puts "\n\t\tuser_db: #{complete[1]}\n\t\tsp: #{complete[2]}\n\t\ttr: #{complete[3]}"
-		stats_file.puts "-----------------------------------------------------------------------------"
-		stats_file.puts "\n\tputative completes: #{putative_complete[0]}\n\t\tuser_db: #{putative_complete[1]}\n\t\tsp: #{putative_complete[2]}\n\t\ttr: #{putative_complete[3]}"
-		stats_file.puts "\n\tn-terminus: #{n_terminus[0]}\n\t\tuser_db: #{n_terminus[1]}\n\t\tsp: #{n_terminus[2]}\n\t\ttr: #{n_terminus[3]}"
-		stats_file.puts "\n\tputative_n_terminus: #{putative_n_terminus[0]}\n\t\tuser_db: #{putative_n_terminus[1]}\n\t\tsp: #{putative_n_terminus[2]}\n\t\ttr: #{putative_n_terminus[3]}"
-		stats_file.puts "\n\tc-terminus: #{c_terminus[0]}\n\t\tuser_db: #{c_terminus[1]}\n\t\tsp: #{c_terminus[2]}\n\t\ttr: #{c_terminus[3]}"
-		stats_file.puts "\n\tputative_c_terminus: #{putative_c_terminus[0]}\n\t\tuser_db: #{putative_c_terminus[1]}\n\t\tsp: #{putative_c_terminus[2]}\n\t\ttr: #{putative_c_terminus[3]}"
-		stats_file.puts "\n\tinternal: #{internal[0]}\n\t\tuser_db: #{internal[1]}\n\t\tsp: #{internal[2]}\n\t\ttr: #{internal[3]}"
-		stats_file.puts "\n\tcoding sequences with unknown status: #{cod_seq[0]}\n\t\tuser_db: #{cod_seq[1]}\n\t\tsp: #{cod_seq[2]}\n\t\ttr: #{cod_seq[3]}"
-		return seqs_number
-	end
-	def testcode_stats(stats_file)
-		seqs_number = 0
-		coding = 0
-		putative_coding = 0
-		unknown = 0
-		coding_longer_200 = 0
-		coding_shorter_200 = 0
-		unknown_longer_200 = 0
-		unknown_shorter_200 = 0
-		coding_longer_500 = 0
-		coding_shorter_500 = 0
-		unknown_longer_500 = 0
-		unknown_shorter_500 = 0
-		File.open('fl2_results/tcode_result.txt').each do |line|
-			line.chomp!
-			(name,fasta_length,acc,db_name,status) = line.split("\t")
-			if (line !~ /^Query_id\t/)
-				seqs_number += 1
-				if (status == 'coding')
-					coding += 1
-					if (fasta_length.to_i >= 200)
-						coding_longer_200 += 1
-						coding_longer_500 += 1
-					else
-						coding_shorter_200 += 1
-						coding_shorter_500 += 1
-					end
-				elsif (status == 'putative_coding')
-					putative_coding += 1
-				elsif (status == 'unknown')
-					unknown += 1
-					if (fasta_length.to_i >= 200)
-						unknown_longer_200 += 1
-						unknown_longer_500 += 1
-					else
-						unknown_shorter_200 += 1
-						unknown_shorter_500 += 1
-					end
-				end
-			end
-		end
-		stats_file.puts "\n--------------------------- Test Code Summary\n\n\ttotal seqs: #{seqs_number}"
-		stats_file.puts "\n\tcoding sequences: #{coding}"
-		stats_file.puts "\t\tlonger than 200 bp: #{coding_longer_200}"
-		stats_file.puts "\t\tshorter than 200 bp: #{coding_shorter_200}"
-		stats_file.puts "\t\tlonger than 500 bp: #{coding_longer_500}"
-		stats_file.puts "\t\tshorter than 500 bp: #{coding_shorter_500}"
-		stats_file.puts "\n\tputative coding sequences: #{putative_coding}\n"
-		stats_file.puts "\n\tunknown: #{unknown} ("+ '%.3f' % (unknown.to_f/seqs_number.to_f*100) +" %)"
-		stats_file.puts "\t\tlonger than 200 bp: #{unknown_longer_200}"
-		stats_file.puts "\t\tshorter than 200 bp: #{unknown_shorter_200}"
-		stats_file.puts "\t\tlonger than 500 bp: #{unknown_longer_500}"
-		stats_file.puts "\t\tshorter than 500 bp: #{unknown_shorter_500}"
-		stats_file.puts "\n\tUnknown sequences have a bad test code score or haven't got an ORF longer than 200 nt"
-		stats_file.puts "---------------------------------------------"
-		return seqs_number
-	end
-end