RubyGems - bio-kmer_counter - Versions diffs - 0.1.0 → 0.1.1 - Mend

bio-kmer_counter 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

data/Gemfile +3 -3
data/README.md +28 -9
data/VERSION +1 -1
data/bin/kmer_counter.rb +7 -7
data/test/data/100random.fa +2 -0
data/test/helper.rb +2 -0
data/test/test_bio-kmer_counter.rb +36 -9
metadata +25 -24

data/Gemfile CHANGED Viewed

@@ -11,8 +11,8 @@ gem 'bio-logger', '>=1.0.1'
 # Include everything needed to run rake, tests, features, etc.
 group :development do
   gem "shoulda", ">= 0"
-  gem "rdoc", "~> 3.12"
-  gem "jeweler", "~> 1.8.3"
+  gem "rdoc", ">= 3.12"
+  gem "jeweler",">= 1.8.3"
   gem "bundler", ">= 1.0.21"
-  gem "rdoc", "~> 3.12"
+  gem "rdoc", ">= 3.12"
 end

data/README.md CHANGED Viewed

@@ -4,28 +4,47 @@
 bio-kmer_counter is a simple [biogem](http://biogem.info) for fingerprinting
 nucleotide sequences by counting the occurences of particular kmers in the
-sequence. The methodology is not new, for references see [Teeling et. al. 2004](http://www.biomedcentral.com/1471-2105/5/163). The default parameters are derived from the methods section of [Dick et. al. 2009](http://genomebiology.com/content/10/8/R85).
+sequence. The methodology is not new, for a reference see
+[Teeling et. al. 2004](http://www.biomedcentral.com/1471-2105/5/163).
+The default parameters are derived from the well explained methods section of
+[Dick et. al. 2009](http://genomebiology.com/content/10/8/R85).
 This methodology is quite different to that of other software that counts
 kmer content with longer kmers, e.g. [khmer](https://github.com/ged-lab/khmer).
-Here only small kmers are intended (e.g. 1mer or 4mer).
-Note: this software is under active development!
+Here only small kmers are intended (e.g. 1-mer or 4-mer).
 ## Installation
+After installing [Ruby](http://www.ruby-lang.org) itself, install the bio-kmer_counter rubygem:
 ```sh
 gem install bio-kmer_counter
 ```
-## Usage
+bio-kmer_counter is only tested on Linux, but probably works on OSX too. It might even work on Windows if
+the progress bar is turned off. Maybe.
-To analyse a fasta file (that contains one or more sequences in it) for 4-mer (tetranucleotide)
-content, reporting the fingerprint of 5kb windows in each sequence separately,
-plus the leftover part if it is longer than 2kb:
+## Usage
+The default parameters analyse a fasta file that contains one or more sequences in it for 4-mer (tetranucleotide)
+content. By default, any sequence
+in the fasta file 2kb or longer is included at least once. Sequences are split up
+into 5kb windows if they are that long, and each window is reported separately.
+If the leftover bit at the end after any 5kb windows is 2kb or longer then this is also included.
+By default, each 4 base window in the input sequence is included exactly once in the output file.
+To account for the fact
+that the directions of sequences with respect to each other are presumed to be unknown (as is the
+case for de-novo genome assembly), either the forward or reverse complement is included. Which one
+(forward or reverse) depends on which one comes first alphabetically. So for instance if the window is ```CTTT```, then ```AAAG```
+is used. Accounting for palindromic sequences like ```ATAT```, there are 136 of these lowest lexigraphical 4-mers.
+So there are 136 columns in the output, plus one for the name of the window. Using only 1 is
+actually slightly different than the method outlined in Dick et. al. 2009, but we
+don't expect the results to differ.
+Example usage, if you wish to fingerprint a fasta file ```my_nucleotide_sequences.fasta```:
 ```sh
-kmer_counter.rb <fasta_file> >tetranucleotide_content.csv
+kmer_counter.rb my_nucleotide_sequences.fasta >tetranucleotide_content.csv
 ```
 The fingerprints are reported in percentages. Well, between 0 and 1, that is.

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 0.1.0
1	+ 0.1.1

data/bin/kmer_counter.rb CHANGED Viewed

@@ -64,8 +64,8 @@ o = OptionParser.new do |opts|
   opts.on("-l", "--window-length", "print the length of the window in the output [default #{options[:sequence_length]}]") do |v|
     options[:sequence_length] = true
   end
   # logger options
   opts.on("-q", "--quiet", "Run quietly, set logging to ERROR level [default INFO]") do |q|
     Bio::Log::CLI.trace('error')
@@ -90,7 +90,7 @@ Bio::Log::CLI.configure(LOG_NAME)
 # Print headers
 print "ID\t"
-print Bio::Sequence::Kmer.merge_down_to_lowest_lexigraphical_form(Bio::Sequence::Kmer.empty_full_kmer_hash(options[:kmer])).keys.join("\t")
+print Bio::Sequence::Kmer.merge_down_to_lowest_lexigraphical_form(Bio::Sequence::Kmer.empty_full_kmer_hash(options[:kmer])).keys.sort.join("\t")
 print "\tWindowLength" if options[:sequence_length]
 print "\tcontig" if options[:contig_name]
 puts
@@ -99,7 +99,7 @@ orig = Bio::Sequence::Kmer.empty_full_kmer_hash(options[:kmer])
 process_window = lambda do |window,kmer,sequence_name,contig_name|
   counts = orig.dup
   num_kmers_counted = 0
   window.window_search(options[:kmer],1) do |tetranucleotide|
     str = tetranucleotide.to_s
     next unless str.gsub(/[ATGC]+/,'') == ''
@@ -107,10 +107,10 @@ process_window = lambda do |window,kmer,sequence_name,contig_name|
     counts[str]+=1
     #counts[Bio::Sequence::NA.new(tetranucleotide).lowest_lexigraphical_form.to_s.upcase] += 1
   end
   # Merge everything into lowest lexigraphical form
   new_counts = Bio::Sequence::Kmer.merge_down_to_lowest_lexigraphical_form counts
   if num_kmers_counted == 0
     log.warn "Skipping window #{sequence_name} because few/none ATGC's were detected (was it all N's?)"
   else
@@ -127,7 +127,7 @@ end
 fasta_filename = ARGV[0]
 progress = nil
 progress = ProgressBar.new('kmer_counter', `grep -c '>' '#{fasta_filename}'`.to_i) if options[:progressbar]
-ff = Bio::FlatFile.open(fasta_filename)
+ff = Bio::FlatFile.open(fasta_filename)
 ff.each do |sequence|
   window_counter = 0

data/test/data/100random.fa ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ >random
2	+ GCAGAGCACCTCCGCGTGACATTCTATTATGGAATTGAAGTCCAGTCAGACCAGTACCCTTGCACAGGCAATACATTGGAACTGGATCAGAACTTCCTAC

data/test/helper.rb CHANGED Viewed

@@ -14,5 +14,7 @@ $LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
 $LOAD_PATH.unshift(File.dirname(__FILE__))
 require 'bio-kmer_counter'
+TEST_DATA_DIR = File.join(File.dirname(__FILE__), 'data')
 class Test::Unit::TestCase
 end

data/test/test_bio-kmer_counter.rb CHANGED Viewed

@@ -8,12 +8,12 @@ class TestBioKmerCounter < Test::Unit::TestCase
     assert_equal Bio::Sequence::NA.new('AA'), Bio::Sequence::NA.new('TT').lowest_lexigraphical_form
     assert_equal Bio::Sequence::NA.new('AG'), Bio::Sequence::NA.new('CT').lowest_lexigraphical_form
   end
   should 'test_empty_full_kmer_hash' do
     answer = {}; %w(A C G T).each{|k| answer[k] = 0}
     assert_equal answer, Bio::Sequence::Kmer.empty_full_kmer_hash(1)
   end
   should 'test merge down' do
     answer = {}; %w(A C).each{|k| answer[k] = 0}
     full = Bio::Sequence::Kmer.empty_full_kmer_hash(1)
@@ -21,11 +21,11 @@ class TestBioKmerCounter < Test::Unit::TestCase
     full = Bio::Sequence::Kmer.empty_full_kmer_hash #defaults to kmer hash length 4
     assert_equal 136, Bio::Sequence::Kmer.merge_down_to_lowest_lexigraphical_form(full).length
   end
   def script_path
     File.join(File.dirname(__FILE__),'..','bin','kmer_counter.rb')
   end
   should 'test_running1' do
     Tempfile.open('one') do |tempfile|
       tempfile.puts '>one'
@@ -35,7 +35,7 @@ class TestBioKmerCounter < Test::Unit::TestCase
       assert_equal "ID\tA\tC\none_0\t0.6\t0.4\n", `#{script_path} -w 5 -k 1 #{tempfile.path}`
     end
   end
   should 'not whack out when there isnt any sequence to count' do
     Tempfile.open('one') do |tempfile|
       tempfile.puts '>one'
@@ -45,13 +45,13 @@ class TestBioKmerCounter < Test::Unit::TestCase
       assert_equal "ID\tA\tC\n", `#{script_path} -w 5 -k 1 #{tempfile.path}`
     end
   end
   should 'give correct increments in window numbering' do
     Tempfile.open('one') do |tempfile|
       tempfile.puts '>one'
       tempfile.puts 'ATGCATGCAT' #10 letters long
       tempfile.close
       expected = "ID\tA\tC\n"+
       "one_0\t0.5\t0.5\n"+
       "one_1\t0.5\t0.5\n"+
@@ -60,14 +60,14 @@ class TestBioKmerCounter < Test::Unit::TestCase
       assert_equal expected, `#{script_path} -w 4 -k 1 -m 2 #{tempfile.path}`
     end
   end
   should "print help when no arguments are given" do
     command = "#{script_path}"
     Open3.popen3(command) do |stdin, stdout, stderr|
       assert stderr.readlines[0].match(/^Usage: kmer_counter/)
     end
   end
   should 'work with lowercase' do
     Tempfile.open('one') do |tempfile|
       tempfile.puts '>one'
@@ -77,4 +77,31 @@ class TestBioKmerCounter < Test::Unit::TestCase
       assert_equal "ID\tA\tC\none_0\t0.6\t0.4\n", `#{script_path} -w 5 -k 1 #{tempfile.path}`
     end
   end
+  should 'by default count contigs greater than 2kb but less than 5kb' do
+    Tempfile.open('one') do |tempfile|
+      tempfile.puts '>one'
+      tempfile.puts 'A'*2500
+      tempfile.close
+      assert_equal "ID\tA\tC\none_leftover_0\t1.0\t0.0\n", `#{script_path} -k 1 #{tempfile.path}`
+    end
+  end
+  should 'by default count contigs greater than 2kb but less than 5kb' do
+    Tempfile.open('one') do |tempfile|
+      tempfile.puts '>one'
+      tempfile.puts 'A'*7500
+      tempfile.close
+      assert_equal "ID\tA\tC\none_0\t1.0\t0.0\none_leftover_1\t1.0\t0.0\n", `#{script_path} -k 1 #{tempfile.path}`
+    end
+  end
+  should 'work simulated example with kmer length = 2' do
+    expected = %w(ID	AA	AC	AG	AT	CA	CC	CG	GA	GC	TA).join("\t")+"\n"+
+    %w(random_leftover_0	0.1111111111111111	0.13131313131313133	0.1414141414141414	0.0707070707070707	0.1717171717171717	0.1111111111111111	0.020202020202020204	0.1414141414141414	0.050505050505050504	0.050505050505050504).join("\t")+"\n"
+    assert_equal expected, `#{script_path} -k 2 -m 1 #{File.join(TEST_DATA_DIR,'100random.fa')}`
+  end
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: bio-kmer_counter
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.1.1
   prerelease:
 platform: ruby
 authors:
@@ -9,11 +9,11 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-08-29 00:00:00.000000000 Z
+date: 2013-03-27 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bio
-  requirement: &85076110 !ruby/object:Gem::Requirement
+  requirement: &73018740 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
         version: 1.4.2
   type: :runtime
   prerelease: false
-  version_requirements: *85076110
+  version_requirements: *73018740
 - !ruby/object:Gem::Dependency
   name: progressbar
-  requirement: &85075540 !ruby/object:Gem::Requirement
+  requirement: &73018270 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -32,10 +32,10 @@ dependencies:
         version: 0.11.0
   type: :runtime
   prerelease: false
-  version_requirements: *85075540
+  version_requirements: *73018270
 - !ruby/object:Gem::Dependency
   name: parallel
-  requirement: &85075130 !ruby/object:Gem::Requirement
+  requirement: &73017680 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -43,10 +43,10 @@ dependencies:
         version: 0.5.17
   type: :runtime
   prerelease: false
-  version_requirements: *85075130
+  version_requirements: *73017680
 - !ruby/object:Gem::Dependency
   name: bio-logger
-  requirement: &85074690 !ruby/object:Gem::Requirement
+  requirement: &73016840 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -54,10 +54,10 @@ dependencies:
         version: 1.0.1
   type: :runtime
   prerelease: false
-  version_requirements: *85074690
+  version_requirements: *73016840
 - !ruby/object:Gem::Dependency
   name: shoulda
-  requirement: &85074250 !ruby/object:Gem::Requirement
+  requirement: &73016590 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -65,32 +65,32 @@ dependencies:
         version: '0'
   type: :development
   prerelease: false
-  version_requirements: *85074250
+  version_requirements: *73016590
 - !ruby/object:Gem::Dependency
   name: rdoc
-  requirement: &85073740 !ruby/object:Gem::Requirement
+  requirement: &73016160 !ruby/object:Gem::Requirement
     none: false
     requirements:
-    - - ~>
+    - - ! '>='
       - !ruby/object:Gem::Version
         version: '3.12'
   type: :development
   prerelease: false
-  version_requirements: *85073740
+  version_requirements: *73016160
 - !ruby/object:Gem::Dependency
   name: jeweler
-  requirement: &85093940 !ruby/object:Gem::Requirement
+  requirement: &73015700 !ruby/object:Gem::Requirement
     none: false
     requirements:
-    - - ~>
+    - - ! '>='
       - !ruby/object:Gem::Version
         version: 1.8.3
   type: :development
   prerelease: false
-  version_requirements: *85093940
+  version_requirements: *73015700
 - !ruby/object:Gem::Dependency
   name: bundler
-  requirement: &85093260 !ruby/object:Gem::Requirement
+  requirement: &73015320 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -98,18 +98,18 @@ dependencies:
         version: 1.0.21
   type: :development
   prerelease: false
-  version_requirements: *85093260
+  version_requirements: *73015320
 - !ruby/object:Gem::Dependency
   name: rdoc
-  requirement: &85092790 !ruby/object:Gem::Requirement
+  requirement: &73015000 !ruby/object:Gem::Requirement
     none: false
     requirements:
-    - - ~>
+    - - ! '>='
       - !ruby/object:Gem::Version
         version: '3.12'
   type: :development
   prerelease: false
-  version_requirements: *85092790
+  version_requirements: *73015000
 description: A biogem for counting small kmers for fingerprinting nucleotide sequences.
   See README for details.
 email: gmail.com after donttrustben
@@ -130,6 +130,7 @@ files:
 - bin/kmer_counter.rb
 - lib/bio-kmer_counter.rb
 - lib/bio-kmer_counter/kmer_counter.rb
+- test/data/100random.fa
 - test/helper.rb
 - test/test_bio-kmer_counter.rb
 homepage: http://github.com/wwood/bioruby-kmer_counter
@@ -147,7 +148,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
       version: '0'
       segments:
       - 0
-      hash: 285575987
+      hash: -117512543
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements: