RubyGems - bio-phyta - Versions diffs - 0.9.0 → 0.9.1 - Mend

bio-phyta 0.9.0 → 0.9.1

Files changed (12) hide show

data/LICENSE.txt +165 -20
data/Rakefile +2 -2
data/VERSION +1 -1
data/bin/phyta-assign +22 -30
data/bin/phyta-extract +33 -35
data/bin/phyta-split +56 -28
data/lib/kingdom_db.rb +7 -1
data/test/test_blackbox_assign.rb +68 -0
data/test/test_blackbox_extract.rb +58 -0
data/test/test_blackbox_split.rb +116 -0
metadata +109 -166
data/test/test_blackbox.rb +0 -41

data/LICENSE.txt CHANGED Viewed

@@ -1,20 +1,165 @@
-Copyright (c) 2011 Philipp Comans
-Permission is hereby granted, free of charge, to any person obtaining
-a copy of this software and associated documentation files (the
-"Software"), to deal in the Software without restriction, including
-without limitation the rights to use, copy, modify, merge, publish,
-distribute, sublicense, and/or sell copies of the Software, and to
-permit persons to whom the Software is furnished to do so, subject to
-the following conditions:
-The above copyright notice and this permission notice shall be
-included in all copies or substantial portions of the Software.
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
-EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
-MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
-NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
-LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
-OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
-WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+                   GNU LESSER GENERAL PUBLIC LICENSE
+                       Version 3, 29 June 2007
+ Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+  This version of the GNU Lesser General Public License incorporates
+the terms and conditions of version 3 of the GNU General Public
+License, supplemented by the additional permissions listed below.
+  0. Additional Definitions.
+  As used herein, "this License" refers to version 3 of the GNU Lesser
+General Public License, and the "GNU GPL" refers to version 3 of the GNU
+General Public License.
+  "The Library" refers to a covered work governed by this License,
+other than an Application or a Combined Work as defined below.
+  An "Application" is any work that makes use of an interface provided
+by the Library, but which is not otherwise based on the Library.
+Defining a subclass of a class defined by the Library is deemed a mode
+of using an interface provided by the Library.
+  A "Combined Work" is a work produced by combining or linking an
+Application with the Library.  The particular version of the Library
+with which the Combined Work was made is also called the "Linked
+Version".
+  The "Minimal Corresponding Source" for a Combined Work means the
+Corresponding Source for the Combined Work, excluding any source code
+for portions of the Combined Work that, considered in isolation, are
+based on the Application, and not on the Linked Version.
+  The "Corresponding Application Code" for a Combined Work means the
+object code and/or source code for the Application, including any data
+and utility programs needed for reproducing the Combined Work from the
+Application, but excluding the System Libraries of the Combined Work.
+  1. Exception to Section 3 of the GNU GPL.
+  You may convey a covered work under sections 3 and 4 of this License
+without being bound by section 3 of the GNU GPL.
+  2. Conveying Modified Versions.
+  If you modify a copy of the Library, and, in your modifications, a
+facility refers to a function or data to be supplied by an Application
+that uses the facility (other than as an argument passed when the
+facility is invoked), then you may convey a copy of the modified
+version:
+   a) under this License, provided that you make a good faith effort to
+   ensure that, in the event an Application does not supply the
+   function or data, the facility still operates, and performs
+   whatever part of its purpose remains meaningful, or
+   b) under the GNU GPL, with none of the additional permissions of
+   this License applicable to that copy.
+  3. Object Code Incorporating Material from Library Header Files.
+  The object code form of an Application may incorporate material from
+a header file that is part of the Library.  You may convey such object
+code under terms of your choice, provided that, if the incorporated
+material is not limited to numerical parameters, data structure
+layouts and accessors, or small macros, inline functions and templates
+(ten or fewer lines in length), you do both of the following:
+   a) Give prominent notice with each copy of the object code that the
+   Library is used in it and that the Library and its use are
+   covered by this License.
+   b) Accompany the object code with a copy of the GNU GPL and this license
+   document.
+  4. Combined Works.
+  You may convey a Combined Work under terms of your choice that,
+taken together, effectively do not restrict modification of the
+portions of the Library contained in the Combined Work and reverse
+engineering for debugging such modifications, if you also do each of
+the following:
+   a) Give prominent notice with each copy of the Combined Work that
+   the Library is used in it and that the Library and its use are
+   covered by this License.
+   b) Accompany the Combined Work with a copy of the GNU GPL and this license
+   document.
+   c) For a Combined Work that displays copyright notices during
+   execution, include the copyright notice for the Library among
+   these notices, as well as a reference directing the user to the
+   copies of the GNU GPL and this license document.
+   d) Do one of the following:
+       0) Convey the Minimal Corresponding Source under the terms of this
+       License, and the Corresponding Application Code in a form
+       suitable for, and under terms that permit, the user to
+       recombine or relink the Application with a modified version of
+       the Linked Version to produce a modified Combined Work, in the
+       manner specified by section 6 of the GNU GPL for conveying
+       Corresponding Source.
+       1) Use a suitable shared library mechanism for linking with the
+       Library.  A suitable mechanism is one that (a) uses at run time
+       a copy of the Library already present on the user's computer
+       system, and (b) will operate properly with a modified version
+       of the Library that is interface-compatible with the Linked
+       Version.
+   e) Provide Installation Information, but only if you would otherwise
+   be required to provide such information under section 6 of the
+   GNU GPL, and only to the extent that such information is
+   necessary to install and execute a modified version of the
+   Combined Work produced by recombining or relinking the
+   Application with a modified version of the Linked Version. (If
+   you use option 4d0, the Installation Information must accompany
+   the Minimal Corresponding Source and Corresponding Application
+   Code. If you use option 4d1, you must provide the Installation
+   Information in the manner specified by section 6 of the GNU GPL
+   for conveying Corresponding Source.)
+  5. Combined Libraries.
+  You may place library facilities that are a work based on the
+Library side by side in a single library together with other library
+facilities that are not Applications and are not covered by this
+License, and convey such a combined library under terms of your
+choice, if you do both of the following:
+   a) Accompany the combined library with a copy of the same work based
+   on the Library, uncombined with any other library facilities,
+   conveyed under the terms of this License.
+   b) Give prominent notice with the combined library that part of it
+   is a work based on the Library, and explaining where to find the
+   accompanying uncombined form of the same work.
+  6. Revised Versions of the GNU Lesser General Public License.
+  The Free Software Foundation may publish revised and/or new versions
+of the GNU Lesser General Public License from time to time. Such new
+versions will be similar in spirit to the present version, but may
+differ in detail to address new problems or concerns.
+  Each version is given a distinguishing version number. If the
+Library as you received it specifies that a certain numbered version
+of the GNU Lesser General Public License "or any later version"
+applies to it, you have the option of following the terms and
+conditions either of that published version or of any later version
+published by the Free Software Foundation. If the Library as you
+received it does not specify a version number of the GNU Lesser
+General Public License, you may choose any version of the GNU Lesser
+General Public License ever published by the Free Software Foundation.
+  If the Library as you received it specifies that a proxy can decide
+whether future versions of the GNU Lesser General Public License shall
+apply, that proxy's public statement of acceptance of any version is
+permanent authorization for you to choose that version for the
+Library.

data/Rakefile CHANGED Viewed

@@ -15,10 +15,10 @@ require 'jeweler'
 Jeweler::Tasks.new do |gem|
   # gem is a Gem::Specification... see http://docs.rubygems.org/read/chapter/20 for more options
   gem.name = "bio-phyta"
-  gem.homepage = "http://github.com/pcomans/bioruby-phyta"
+  gem.homepage = "https://github.com/PalMuc/bio-phyta"
   gem.license = "LGPL"
   gem.summary = "Pipeline to remove contaminations from EST libraries"
-  gem.description = "Coming soon"
+  gem.description = "Pipeline to remove contaminations from EST libraries"
   gem.email = "philipp.comans@googlemail.com"
   gem.authors = ["Philipp Comans"]
   # Remove test data from the gem

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 0.9.0
1	+ 0.9.1

data/bin/phyta-assign CHANGED Viewed

@@ -13,6 +13,7 @@ opts = Trollop::options do
   opt :database_user, "Optional: The name of the database user", :type => String, :default => "root", :short => "-u"
   opt :database_password, "Optional: The password of the database user", :type => String, :default => "no password", :short => "-p"
   opt :database_name, "Optional: The name of the NCBI taxonomy database", :type => String, :default => "kingdom_assignment_taxonomy", :short => "-n"
+  opt :filter, "A file in YAML format containing a list of taxa to be considered contaminants", :type => String, :default => "Use builtin filter capturing Bacteria, Archaea, Viruses and NONE. To learn how to write your own filters, visit https://github.com/PalMuc/bio-phyta/wiki/Custom-filters ", :short => "-f"
 end
 unless opts[:input_file_given] && opts[:output_file_given]
@@ -35,6 +36,7 @@ end
 require 'sequel'
 require 'nokogiri'
 require 'bio'
+require 'yaml'
 require 'csv'
@@ -57,6 +59,26 @@ puts "Running #{SCRIPT_NAME} #{PHYTA_VERSION}"
 puts "Settings: " + opts.inspect
+filter_array = nil
+if opts[:filter_given]
+  begin
+    filter_array = YAML::load(File.open(opts[:filter], 'r'))
+  rescue Exception => e
+    puts "Error: #{e.message}"
+    puts e.backtrace.join("\n")
+    puts "Please see https://github.com/PalMuc/bio-phyta/wiki/Custom-filters for instructions on how to write filters"
+    abort
+  end
+  unless filter_array.is_a? Array
+    puts "Error: Invalid filter format.\nPlease see https://github.com/PalMuc/bio-phyta/wiki/Custom-filters for instructions on how to write filters"
+    abort
+  end
+else
+  filter_array = KingdomDB::DEFAULT_FILTER
+end
 #Initialize auxiliary classes
 blast_parser = BlastStringParser.new()
@@ -85,36 +107,6 @@ output = INSTALLED_CSV.open(opts[:output_file], "w", {
                               :headers => ["query sequence id", "hit accession number", "sgi", "evalue", "species", "subject annotation", "subject score", "kingdom"],
                               :write_headers => true})
-filter_array = [
-                "Bacteria",
-                "Archaea",
-                "Viridiplantae",
-                "Rhodophyta",
-                "Glaucocystophyceae",
-                "Alveolata",
-                "Cryptophyta",
-                "stramenopiles", #<- Change
-                "Amoebozoa",
-                "Apusozoa",
-                "Euglenozoa",
-                "Fornicata",
-                "Haptophyceae",
-                "Heterolobosea",
-                "Jakobida",
-                "Katablepharidophyta",
-                "Malawimonadidae",
-                "Nucleariidae",
-                "Oxymonadida",
-                "Parabasalia",
-                "Rhizaria",
-                "unclassified eukaryotes",
-                "Fungi",
-                "Metazoa",
-                "Choanoflagellida",
-                "Opisthokonta incertae sedis", #"Fungi/Metazoa incertae sedis"
-                "Viruses"
-               ]
 filter_hash = db.get_filter(filter_array)
 current_query = ""

data/bin/phyta-extract CHANGED Viewed

@@ -19,54 +19,52 @@ def table_to_set(table, header)
   return result
 end
+require 'rubygems'
+require 'csv'
+require 'set'
+require 'bio'
+require 'trollop'
 #parse command line arguments
-settings = {}
-unless ARGV.size == 5
-  puts "Usage: kingdom-extraction sequences.fasta clean.csv contaminated.csv clean_output.fasta contaminated_output.fasta"
-  exit
+opts = Trollop::options do
+  opt :fasta, "The file containing the sequences in FASTA format", :type => String
+  opt :input_clean, "The name of the clean sequence table in CSV format", :type => String, :short => "-c"
+  opt :input_contaminated, "The name of the contaminated sequence table in CSV format", :type => String, :short => "-d"
+  opt :output_clean, "The name of the FASTA file where clean sequences will be written to", :type => String, :short => "-o"
+  opt :output_contaminated, "The name of the FASTA file where contaminated sequences will be written to", :type => String, :short => "-p"
+end
+unless opts[:fasta_given] && opts[:input_clean_given] && opts[:input_contaminated_given] && opts[:output_clean_given] && opts[:output_contaminated_given]
+  puts "Invalid arguments, see --help for more information."
+  abort
 end
 $LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
 $LOAD_PATH.unshift(File.dirname(__FILE__))
-require 'rubygems'
-require 'csv'
-require 'set'
-require 'bio'
 rootpath = File.dirname(File.dirname(__FILE__))
 PHYTA_VERSION = File.new(File.join(rootpath,'VERSION')).read.chomp
 puts "Running #{SCRIPT_NAME} #{PHYTA_VERSION}"
-settings[:input_fasta] = ARGV.shift
-settings[:input_clean] = ARGV.shift
-settings[:input_contaminated] = ARGV.shift
-settings[:output_clean] = ARGV.shift
-settings[:output_contaminated] = ARGV.shift
-unless File.exists?(settings[:input_fasta])
-  puts "The input file at " + File.expand_path(settings[:input_fasta]) + " could not be opened!"
-  exit
+unless File.exists?(opts[:fasta])
+  abort "The input file at " + File.expand_path(opts[:fasta]) + " could not be opened!"
 end
-unless File.exists?(settings[:input_clean])
-  puts "The input file at " + File.expand_path(settings[:input_clean]) + " could not be opened!"
-  exit
+unless File.exists?(opts[:input_clean])
+  abort "The input file at " + File.expand_path(opts[:input_clean]) + " could not be opened!"
 end
-unless File.exists?(settings[:input_contaminated])
-  puts "The input file at " + File.expand_path(settings[:input_contaminated]) + " could not be opened!"
-  exit
+unless File.exists?(opts[:input_contaminated])
+  abort "The input file at " + File.expand_path(opts[:input_contaminated]) + " could not be opened!"
 end
-if File.exists?(settings[:output_clean])
-  puts "The input file at " + File.expand_path(settings[:output_clean]) + " already exists!"
-  exit
+if File.exists?(opts[:output_clean])
+  abort "The input file at " + File.expand_path(opts[:output_clean]) + " already exists!"
 end
-if File.exists?(settings[:output_contaminated])
-  puts "The input file at " + File.expand_path(settings[:output_contaminated]) + " already exists!"
-  exit
+if File.exists?(opts[:output_contaminated])
+  abort "The input file at " + File.expand_path(opts[:output_contaminated]) + " already exists!"
 end
 #CSV backwards compatibility
@@ -79,23 +77,23 @@ end
 #Open output of Kingdom-Splitter, save clean and contaminated sequence ids in two sets
 puts "Reading clean..."
-clean_table = INSTALLED_CSV.open(settings[:input_clean], "r", { :col_sep => ";", :headers => :first_row, :header_converters => :symbol})
+clean_table = INSTALLED_CSV.open(opts[:input_clean], "r", { :col_sep => ";", :headers => :first_row, :header_converters => :symbol})
 clean = table_to_set(clean_table, :query_sequence_id)
 clean_table.close
 puts "Reading contaminated..."
-contaminated_table = INSTALLED_CSV.open(settings[:input_contaminated], "r", { :col_sep => ";", :headers => :first_row, :header_converters => :symbol})
+contaminated_table = INSTALLED_CSV.open(opts[:input_contaminated], "r", { :col_sep => ";", :headers => :first_row, :header_converters => :symbol})
 contaminated = table_to_set(contaminated_table, :query_sequence_id)
 contaminated_table.close
 #Initialize output files
-clean_out = File.open(settings[:output_clean], "w")
-contaminated_out = File.open(settings[:output_contaminated], "w")
+clean_out = File.open(opts[:output_clean], "w")
+contaminated_out = File.open(opts[:output_contaminated], "w")
 puts "Extracting FASTA sequences..."
 QUERY_SEQ_REGEXP = /\A(\S+)\s.*\z/ #Make sure this is exactly the same as in BlastStringParser in Kingdom-Assignment
-sequences = Bio::FastaFormat.open(settings[:input_fasta])
+sequences = Bio::FastaFormat.open(opts[:fasta])
 sequences.each do |entry|
   current = QUERY_SEQ_REGEXP.match(entry.definition)[1] #TODO do something when this comparison fails
   if clean.include?(current)

data/bin/phyta-split CHANGED Viewed

@@ -1,13 +1,30 @@
 #!/usr/bin/env ruby
 require 'rubygems'
+require 'trollop'
 require 'csv' #Will use FasterCSV on Ruby 1.8
+require 'yaml'
 SCRIPT_NAME = "phyta-split"
-# $LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
+#parse command line arguments
+opts = Trollop::options do
+  opt :input_file, "The output of phyta-assign in CSV format", :type => String
+  opt :output_clean, "The name of the clean output table in CSV format", :type => String, :default => "[name_of_input_file]_clean.csv", :short => "-c"
+  opt :output_contaminated, "The name of the contaminated output table in CSV format", :type => String, :default => "[name_of_input_file]_contaminated.csv", :short => "-d"
+  opt :filter, "Optional: A file in YAML format containing a list of taxa to be considered contaminants", :type => String, :default => "Use builtin filter capturing Bacteria, Archaea, Viruses and NONE. To learn how to write your own filters, visit https://github.com/PalMuc/bio-phyta/wiki/Custom-filters ", :short => "-f"
+end
+unless opts[:input_file_given]
+  puts "Invalid arguments, see --help for more information."
+  abort
+end
+$LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
 $LOAD_PATH.unshift(File.dirname(__FILE__))
+require 'kingdom_db'
 #CSV backwards compatibility
 if CSV.const_defined? :Reader
   require 'fastercsv'
@@ -20,46 +37,57 @@ rootpath = File.dirname(File.dirname(__FILE__))
 PHYTA_VERSION = File.new(File.join(rootpath,'VERSION')).read.chomp
 puts "Running #{SCRIPT_NAME} #{PHYTA_VERSION}"
-unless ARGV.size == 1
-  puts "Usage: #{SCRIPT_NAME} input.csv"
-  puts "This will automatically create input_clean.csv and input_contaminated.csv in the same directory."
-  exit
-end
-#Command line arguments
-settings = {}
-settings[:input_file] =  ARGV.shift
 #Set up output file
-fullpath = File.expand_path(settings[:input_file])
+fullpath = File.expand_path(opts[:input_file])
 suffix = File.extname(fullpath)
 dirname = File.dirname(fullpath)
 name = File.basename(fullpath, suffix)
-settings[:contaminated_file] = dirname + "/" + name + "_contaminated.csv"
-settings[:clean_file] = dirname + "/" + name + "_clean.csv"
+unless opts[:output_clean_given]
+  opts[:output_clean] = dirname + "/" + name + "_clean.csv"
+end
+unless opts[:output_contaminated_given]
+  opts[:output_contaminated] = dirname + "/" + name + "_contaminated.csv"
+end
+filter_array = nil
+if opts[:filter_given]
+  begin
+    filter_array = YAML::load(File.open(opts[:filter], 'r'))
+  rescue Exception => e
+    puts "Error: #{e.message}"
+    puts e.backtrace.join("\n")
+    puts "Please see https://github.com/PalMuc/bio-phyta/wiki/Custom-filters for instructions on how to write filters"
+    abort
+  end
+  unless filter_array.is_a? Array
+    puts "Error: Invalid filter format.\nPlease see https://github.com/PalMuc/bio-phyta/wiki/Custom-filters for instructions on how to write filters"
+    abort
+  end
+else
+  filter_array = KingdomDB::DEFAULT_FILTER
+end
 csv_header = ["query sequence id", "hit accession number", "sgi", "evalue", "species", "subject annotation", "subject score", "kingdom"]
 #Open input file
-if !File.file?(settings[:input_file])
-  puts "No input file at " + File.expand_path(settings[:input_file]) + "!"
+if !File.file?(opts[:input_file])
+  puts "No input file at " + File.expand_path(opts[:input_file]) + "!"
   exit
 end
-input = INSTALLED_CSV.open(settings[:input_file], "r", {
+input = INSTALLED_CSV.open(opts[:input_file], "r", {
                    :col_sep => ";",
                    :headers => :first_row,
                    :header_converters => :symbol})
 clean_seqs = {}
 contaminated_seqs = {}
-contaminated_filter = [
-                       "Bacteria",
-                       "Archaea",
-                       "Viruses",
-                       "NONE"
-                       #TODO is this all?
-                      ]
+#TODO make sure filters are consistent
 warning = false;
@@ -80,7 +108,7 @@ input.each do |current_row|
   seq_is_in_clean            = clean_seqs.has_key?(seqid)
   seq_is_in_contaminated     = contaminated_seqs.has_key?(seqid)
-  kingdom_is_in_contaminated = contaminated_filter.include?(kingdom)
+  kingdom_is_in_contaminated = filter_array.include?(kingdom)
   if seq_is_in_clean && seq_is_in_contaminated
@@ -114,7 +142,7 @@ input.each do |current_row|
       end
     else
       #One hit is not contaminated, move to clean seqs
-      if contaminated_seqs[seqid][:evalue].to_f >= current_row[:evalue].to_f
+      if contaminated_seqs[seqid][:evalue].to_f > current_row[:evalue].to_f
         clean_seqs[seqid] = current_row
       else
         clean_seqs[seqid] = contaminated_seqs[seqid]
@@ -140,12 +168,12 @@ unless (clean_seqs.keys & contaminated_seqs.keys).empty?
 end
 #Output
-contaminated = INSTALLED_CSV.open(settings[:contaminated_file], "w", {
+contaminated = INSTALLED_CSV.open(opts[:output_contaminated], "w", {
                           :col_sep => ";",
                           :headers => csv_header,
                           :write_headers => true})
-clean = INSTALLED_CSV.open(settings[:clean_file], "w", {
+clean = INSTALLED_CSV.open(opts[:output_clean], "w", {
                    :col_sep => ";",
                    :headers => csv_header,
                    :write_headers => true})

data/lib/kingdom_db.rb CHANGED Viewed

@@ -4,6 +4,12 @@ class KingdomDB
   ROOT_ID = "1"
   SCIENTIFIC_NAME = "scientific name"
+  DEFAULT_FILTER = [
+                                          "Bacteria",
+                                          "Archaea",
+                                          "Viruses",
+                                          "NONE"
+                                         ]
   def initialize(server, user, password, database)
@@ -26,7 +32,7 @@ class KingdomDB
     @filter_hit_cache = {}
   end
   def id_from_name(taxon_name)
     db_results = @database[:names].select(:taxonid, :class).filter(:name => taxon_name).all

data/test/test_blackbox_assign.rb ADDED Viewed

@@ -0,0 +1,68 @@
+require 'helper'
+require 'tmpdir'
+class BlackBoxTest < Test::Unit::TestCase
+  ASSIGN_DATADIR = "test/data/assign"
+  def test_without_parameters
+    #This test does not make a whole lot of sense...
+    result = %x[bin/phyta-assign]
+    expected = "Invalid arguments, see --help for more information."
+    assert_equal expected.strip, result.strip
+  end
+  def test_default_filter
+    Dir.mktmpdir do |dir|
+      %x[bin/phyta-assign -i #{ASSIGN_DATADIR}/in_medium.xml -o #{dir}/out_default_filter.csv]
+      result = File.open("#{dir}/out_default_filter.csv").read
+      target = File.open("#{ASSIGN_DATADIR}/target_default_filter.csv").read
+      assert_not_nil result
+      assert_not_nil target
+      assert_block "Output of out_medium.xml invalid." do
+        result == target
+      end
+    end
+  end
+  def test_invalid_filter
+    Dir.mktmpdir do |dir|
+      response = %x[bin/phyta-assign -i #{ASSIGN_DATADIR}/in_medium.xml -o #{dir}/out_default_filter.csv -f #{ASSIGN_DATADIR}/in_medium.xml]
+      assert response.include? "Error"
+      assert !File.exist?("#{dir}/out_default_filter.csv")
+    end
+  end
+  def test_small
+    Dir.mktmpdir do |dir|
+      res = %x[bin/phyta-assign -i #{ASSIGN_DATADIR}/in_3.xml -o #{dir}/out_3.csv -f #{SPLIT_DATADIR}/../common/default_filter.yaml]
+      result = File.open("#{dir}/out_3.csv").read
+      target = File.open("#{ASSIGN_DATADIR}/target_3.csv").read
+      assert_not_nil result
+      assert_not_nil target
+      assert_equal target, result, "Output of out_3.xml invalid"
+    end
+  end
+  def test_medium
+    Dir.mktmpdir do |dir|
+      %x[bin/phyta-assign -i #{ASSIGN_DATADIR}/in_medium.xml -o #{dir}/out_medium.csv -f #{SPLIT_DATADIR}/../common/default_filter.yaml]
+      result = File.open("#{dir}/out_medium.csv").read
+      target = File.open("#{ASSIGN_DATADIR}/target_medium.csv").read
+      assert_not_nil result
+      assert_not_nil target
+      assert_block "Output of out_medium.xml invalid." do
+        result == target
+      end
+    end
+  end
+end

data/test/test_blackbox_extract.rb ADDED Viewed

@@ -0,0 +1,58 @@
+require 'helper'
+require 'tmpdir'
+class BlackBoxTest < Test::Unit::TestCase
+  EXTRACT_DATADIR = "test/data/extract"
+  EXTRACT_BINARY  = "bin/phyta-extract"
+  context "Extract command line output" do
+    should "print default message if run without parameters" do
+      result = %x[#{EXTRACT_BINARY}]
+      expected = "Invalid arguments, see --help for more information."
+      assert_equal expected.strip, result.strip
+    end
+  end
+  context "Extracting" do
+    should "work if the clean file is empty" do
+      Dir.mktmpdir do |dir|
+        result = %x[#{EXTRACT_BINARY} -c #{EXTRACT_DATADIR}/clean_empty_clean.csv -d #{EXTRACT_DATADIR}/clean_empty_contaminated.csv -f #{EXTRACT_DATADIR}/truncated.fasta -o #{dir}/clean_empty_clean_out.fasta -p #{dir}/clean_empty_contaminated_out.fasta]
+        clean_result = File.open("#{dir}/clean_empty_clean_out.fasta").read
+        contaminated_result = File.open("#{dir}/clean_empty_contaminated_out.fasta").read
+        clean_target = File.open("#{EXTRACT_DATADIR}/clean_empty_clean_target.fasta").read
+        contaminated_target = File.open("#{EXTRACT_DATADIR}/clean_empty_contaminated_target.fasta").read
+        assert_not_nil clean_result
+        assert_not_nil contaminated_result
+        assert_not_nil clean_target
+        assert_not_nil contaminated_target
+        assert_equal clean_target, clean_result, "Clean files differ"
+        assert_equal contaminated_target, contaminated_result, "Contaminated files differ"
+      end
+    end
+    should "work if the contaminated file is empty" do
+      Dir.mktmpdir do |dir|
+        result = %x[#{EXTRACT_BINARY} -c #{EXTRACT_DATADIR}/contaminated_empty_clean.csv -d #{EXTRACT_DATADIR}/contaminated_empty_contaminated.csv -f #{EXTRACT_DATADIR}/truncated.fasta -o #{dir}/contaminated_empty_clean_out.fasta -p #{dir}/contaminated_empty_contaminated_out.fasta]
+        clean_result = File.open("#{dir}/contaminated_empty_clean_out.fasta").read
+        contaminated_result = File.open("#{dir}/contaminated_empty_contaminated_out.fasta").read
+        clean_target = File.open("#{EXTRACT_DATADIR}/contaminated_empty_clean_target.fasta").read
+        contaminated_target = File.open("#{EXTRACT_DATADIR}/contaminated_empty_contaminated_target.fasta").read
+        assert_not_nil clean_result
+        assert_not_nil contaminated_result
+        assert_not_nil clean_target
+        assert_not_nil contaminated_target
+        assert_equal clean_target, clean_result, "Clean files differ"
+        assert_equal contaminated_target, contaminated_result, "Contaminated files differ"
+      end
+    end
+  end
+end

data/test/test_blackbox_split.rb ADDED Viewed

@@ -0,0 +1,116 @@
+require 'helper'
+require 'tmpdir'
+class BlackBoxTest < Test::Unit::TestCase
+  SPLIT_DATADIR = "test/data/split"
+  context "Command line output" do
+    should "print default message if run without parameters" do
+      result = %x[bin/phyta-split]
+      expected = "Invalid arguments, see --help for more information."
+      assert_equal expected.strip, result.strip
+    end
+  end
+  context "Filter parsing" do
+    should "print an error if the filter file is invalid" do
+      Dir.mktmpdir do |dir|
+        response = %x[bin/phyta-split -i #{SPLIT_DATADIR}/in_okay.csv -c #{dir}/clean_okay.csv -d #{dir}/contaminated_okay.csv -f #{SPLIT_DATADIR}/in_okay.csv]
+        assert response.include? "Error"
+        assert !File.exist?("#{dir}/clean_okay.csv")
+        assert !File.exist?("#{dir}/contaminated_okay.csv")
+      end
+    end
+  end
+  context "PhyTA Split" do
+    should "put a sequence into clean if one hit is not in the filter" do
+      Dir.mktmpdir do |dir|
+        %x[bin/phyta-split -i #{SPLIT_DATADIR}/in_okay.csv -c #{dir}/clean_okay.csv -d #{dir}/contaminated_okay.csv -f #{SPLIT_DATADIR}/../common/default_filter.yaml]
+        clean_result = File.open("#{dir}/clean_okay.csv").read
+        contaminated_result = File.open("#{dir}/contaminated_okay.csv").read
+        clean_target = File.open("#{SPLIT_DATADIR}/clean_okay_target.csv").read
+        contaminated_target = File.open("#{SPLIT_DATADIR}/contaminated_okay_target.csv").read
+        assert_not_nil clean_result
+        assert_not_nil contaminated_result
+        assert_not_nil clean_target
+        assert_not_nil contaminated_target
+        assert_equal clean_target, clean_result, "Clean files differ"
+        assert_equal contaminated_target, contaminated_result, "Contaminated files differ"
+      end
+    end
+    should "put a sequence into contaminated if all hits are captured by the filter" do
+      Dir.mktmpdir do |dir|
+        %x[bin/phyta-split -i #{SPLIT_DATADIR}/in_other.csv -c #{dir}/clean_other.csv -d #{dir}/contaminated_other.csv -f #{SPLIT_DATADIR}/../common/default_filter.yaml]
+        clean_result = File.open("#{dir}/clean_other.csv").read
+        contaminated_result = File.open("#{dir}/contaminated_other.csv").read
+        clean_target = File.open("#{SPLIT_DATADIR}/clean_other_target.csv").read
+        contaminated_target = File.open("#{SPLIT_DATADIR}/contaminated_other_target.csv").read
+        assert_not_nil clean_result
+        assert_not_nil contaminated_result
+        assert_not_nil clean_target
+        assert_not_nil contaminated_target
+        assert_equal clean_target, clean_result, "Clean files differ"
+        assert_equal contaminated_target, contaminated_result, "Contaminated files differ"
+      end
+    end
+    should "always choose the best hit, even if it is in the filtered set" do
+      Dir.mktmpdir do |dir|
+        %x[bin/phyta-split -i #{SPLIT_DATADIR}/in_3.csv -c #{dir}/clean_3.csv -d #{dir}/contaminated_3.csv -f #{SPLIT_DATADIR}/../common/default_filter.yaml]
+        clean_result = File.open("#{dir}/clean_3.csv").read
+        contaminated_result = File.open("#{dir}/contaminated_3.csv").read
+        clean_target = File.open("#{SPLIT_DATADIR}/out_3_target_clean.csv").read
+        contaminated_target = File.open("#{SPLIT_DATADIR}/out_3_target_contaminated.csv").read
+        assert_not_nil clean_result
+        assert_not_nil contaminated_result
+        assert_not_nil clean_target
+        assert_not_nil contaminated_target
+        assert_equal clean_target, clean_result, "Clean files differ"
+        assert_equal contaminated_target, contaminated_result, "Contaminated files differ"
+      end
+    end
+    should "split with the default filter if none specified" do
+      Dir.mktmpdir do |dir|
+        %x[bin/phyta-split -i #{SPLIT_DATADIR}/in_3.csv -c #{dir}/clean_3.csv -d #{dir}/contaminated_3.csv]
+        clean_result = File.open("#{dir}/clean_3.csv").read
+        contaminated_result = File.open("#{dir}/contaminated_3.csv").read
+        clean_target = File.open("#{SPLIT_DATADIR}/out_3_default_filter_target_clean.csv").read
+        contaminated_target = File.open("#{SPLIT_DATADIR}/out_3_default_filter_target_contaminated.csv").read
+        assert_not_nil clean_result
+        assert_not_nil contaminated_result
+        assert_not_nil clean_target
+        assert_not_nil contaminated_target
+        assert_equal clean_target, clean_result, "Clean files differ"
+        assert_equal contaminated_target, contaminated_result, "Contaminated files differ"
+      end
+    end
+  end
+end

metadata CHANGED Viewed

@@ -1,191 +1,138 @@
---- !ruby/object:Gem::Specification
+--- !ruby/object:Gem::Specification
 name: bio-phyta
-version: !ruby/object:Gem::Version
-  hash: 59
+version: !ruby/object:Gem::Version
+  version: 0.9.1
   prerelease:
-  segments:
-  - 0
-  - 9
-  - 0
-  version: 0.9.0
 platform: ruby
-authors:
+authors:
 - Philipp Comans
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2011-10-20 00:00:00 Z
-dependencies:
-- !ruby/object:Gem::Dependency
-  requirement: &id001 !ruby/object:Gem::Requirement
+date: 2011-10-21 00:00:00.000000000Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bio
+  requirement: &2153022740 !ruby/object:Gem::Requirement
     none: false
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        hash: 3
-        segments:
-        - 1
-        - 4
-        - 2
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
         version: 1.4.2
-  version_requirements: *id001
-  name: bio
-  prerelease: false
   type: :runtime
-- !ruby/object:Gem::Dependency
-  requirement: &id002 !ruby/object:Gem::Requirement
+  prerelease: false
+  version_requirements: *2153022740
+- !ruby/object:Gem::Dependency
+  name: mysql
+  requirement: &2153022260 !ruby/object:Gem::Requirement
     none: false
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        hash: 45
-        segments:
-        - 2
-        - 8
-        - 1
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
         version: 2.8.1
-  version_requirements: *id002
-  name: mysql
-  prerelease: false
   type: :runtime
-- !ruby/object:Gem::Dependency
-  requirement: &id003 !ruby/object:Gem::Requirement
+  prerelease: false
+  version_requirements: *2153022260
+- !ruby/object:Gem::Dependency
+  name: sequel
+  requirement: &2153021780 !ruby/object:Gem::Requirement
     none: false
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        hash: 119
-        segments:
-        - 3
-        - 28
-        - 0
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
         version: 3.28.0
-  version_requirements: *id003
-  name: sequel
-  prerelease: false
   type: :runtime
-- !ruby/object:Gem::Dependency
-  requirement: &id004 !ruby/object:Gem::Requirement
+  prerelease: false
+  version_requirements: *2153021780
+- !ruby/object:Gem::Dependency
+  name: fastercsv
+  requirement: &2153021300 !ruby/object:Gem::Requirement
     none: false
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        hash: 11
-        segments:
-        - 1
-        - 5
-        - 4
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
         version: 1.5.4
-  version_requirements: *id004
-  name: fastercsv
-  prerelease: false
   type: :runtime
-- !ruby/object:Gem::Dependency
-  requirement: &id005 !ruby/object:Gem::Requirement
+  prerelease: false
+  version_requirements: *2153021300
+- !ruby/object:Gem::Dependency
+  name: nokogiri
+  requirement: &2153020820 !ruby/object:Gem::Requirement
     none: false
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        hash: 3
-        segments:
-        - 1
-        - 5
-        - 0
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
         version: 1.5.0
-  version_requirements: *id005
-  name: nokogiri
-  prerelease: false
   type: :runtime
-- !ruby/object:Gem::Dependency
-  requirement: &id006 !ruby/object:Gem::Requirement
+  prerelease: false
+  version_requirements: *2153020820
+- !ruby/object:Gem::Dependency
+  name: trollop
+  requirement: &2153020340 !ruby/object:Gem::Requirement
     none: false
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        hash: 83
-        segments:
-        - 1
-        - 16
-        - 2
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
         version: 1.16.2
-  version_requirements: *id006
-  name: trollop
-  prerelease: false
   type: :runtime
-- !ruby/object:Gem::Dependency
-  requirement: &id007 !ruby/object:Gem::Requirement
-    none: false
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        hash: 3
-        segments:
-        - 0
-        version: "0"
-  version_requirements: *id007
-  name: shoulda
   prerelease: false
+  version_requirements: *2153020340
+- !ruby/object:Gem::Dependency
+  name: shoulda
+  requirement: &2153019860 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
   type: :development
-- !ruby/object:Gem::Dependency
-  requirement: &id008 !ruby/object:Gem::Requirement
+  prerelease: false
+  version_requirements: *2153019860
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: &2153019380 !ruby/object:Gem::Requirement
     none: false
-    requirements:
+    requirements:
     - - ~>
-      - !ruby/object:Gem::Version
-        hash: 23
-        segments:
-        - 1
-        - 0
-        - 0
+      - !ruby/object:Gem::Version
         version: 1.0.0
-  version_requirements: *id008
-  name: bundler
-  prerelease: false
   type: :development
-- !ruby/object:Gem::Dependency
-  requirement: &id009 !ruby/object:Gem::Requirement
+  prerelease: false
+  version_requirements: *2153019380
+- !ruby/object:Gem::Dependency
+  name: jeweler
+  requirement: &2153018900 !ruby/object:Gem::Requirement
     none: false
-    requirements:
+    requirements:
     - - ~>
-      - !ruby/object:Gem::Version
-        hash: 7
-        segments:
-        - 1
-        - 6
-        - 4
+      - !ruby/object:Gem::Version
         version: 1.6.4
-  version_requirements: *id009
-  name: jeweler
-  prerelease: false
   type: :development
-- !ruby/object:Gem::Dependency
-  requirement: &id010 !ruby/object:Gem::Requirement
-    none: false
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        hash: 3
-        segments:
-        - 0
-        version: "0"
-  version_requirements: *id010
-  name: rcov
   prerelease: false
+  version_requirements: *2153018900
+- !ruby/object:Gem::Dependency
+  name: rcov
+  requirement: &2153018420 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
   type: :development
-description: Coming soon
+  prerelease: false
+  version_requirements: *2153018420
+description: Pipeline to remove contaminations from EST libraries
 email: philipp.comans@googlemail.com
-executables:
-- phyta-split
+executables:
 - phyta-assign
 - phyta-extract
 - phyta-setup-taxonomy-db
+- phyta-split
 extensions: []
-extra_rdoc_files:
+extra_rdoc_files:
 - LICENSE.txt
 - README.rdoc
-files:
+files:
 - .document
 - Gemfile
 - LICENSE.txt
@@ -199,41 +146,37 @@ files:
 - lib/blast_string_parser.rb
 - lib/kingdom_db.rb
 - test/helper.rb
-- test/test_blackbox.rb
+- test/test_blackbox_assign.rb
+- test/test_blackbox_extract.rb
+- test/test_blackbox_split.rb
 - test/test_blast_string_parser.rb
 - test/test_kingdom_db.rb
-homepage: http://github.com/pcomans/bioruby-phyta
-licenses:
+homepage: https://github.com/PalMuc/bio-phyta
+licenses:
 - LGPL
 post_install_message:
 rdoc_options: []
-require_paths:
+require_paths:
 - lib
-required_ruby_version: !ruby/object:Gem::Requirement
+required_ruby_version: !ruby/object:Gem::Requirement
   none: false
-  requirements:
-  - - ">="
-    - !ruby/object:Gem::Version
-      hash: 3
-      segments:
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+      segments:
       - 0
-      version: "0"
-required_rubygems_version: !ruby/object:Gem::Requirement
+      hash: -3130547697683155421
+required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
-  requirements:
-  - - ">="
-    - !ruby/object:Gem::Version
-      hash: 3
-      segments:
-      - 0
-      version: "0"
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
 requirements: []
 rubyforge_project:
 rubygems_version: 1.8.10
 signing_key:
 specification_version: 3
 summary: Pipeline to remove contaminations from EST libraries
 test_files: []

data/test/test_blackbox.rb DELETED Viewed

@@ -1,41 +0,0 @@
-require 'helper'
-require 'tmpdir'
-class BlackBoxTest < Test::Unit::TestCase
-  def test_without_parameters
-    #This test does not make a whole lot of sense...
-    result = %x[bin/phyta-assign]
-    expected = "Invalid arguments, see --help for more information."
-    assert_equal expected.strip, result.strip
-  end
-  def test_small
-    Dir.mktmpdir do |dir|
-      %x[bin/phyta-assign -i test/data/in_3.xml -o #{dir}/out_3.csv]
-      result = File.open("#{dir}/out_3.csv").read
-      target = File.open("test/data/target_3.csv").read
-      assert_not_nil result
-      assert_not_nil target
-      assert_equal target, result, "Output of out_3.xml invalid"
-    end
-  end
-  def test_medium
-    Dir.mktmpdir do |dir|
-      %x[bin/phyta-assign -i test/data/in_medium.xml -o #{dir}/out_medium.csv]
-      result = File.open("#{dir}/out_medium.csv").read
-      target = File.open("test/data/target_medium.csv").read
-      assert_not_nil result
-      assert_not_nil target
-      assert_block "Output of out_medium.xml invalid." do
-        result == target
-      end
-    end
-  end
-end