RubyGems - parse_fasta - Versions diffs - 1.0.1 → 1.1.0 - Mend

parse_fasta 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

checksums.yaml +4 -4
data/README.md +41 -20
data/lib/parse_fasta/fastq_file.rb +65 -0
data/lib/parse_fasta/quality.rb +33 -0
data/lib/parse_fasta/sequence.rb +2 -5
data/lib/parse_fasta/version.rb +1 -1
data/lib/parse_fasta.rb +2 -0
data/parse_fasta.gemspec +4 -4
data/spec/lib/fastq_file_spec.rb +50 -0
data/spec/lib/quality.rb +35 -0
data/test_files/benchmark.rb +32 -10
data/test_files/test.fq +8 -0
metadata +25 -18

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 81e3eddaba7422d44ae3235f5c9a997f9fc9ea96
-  data.tar.gz: 9fce00b1a06d89911ded694f3727eaca9dbda59e
+  metadata.gz: 964e83c4a500490c9c5b9cb6f709fe7405b9f856
+  data.tar.gz: eca56db94b3e639849699da0baf25e06ceedbbd6
 SHA512:
-  metadata.gz: ce14854b609f406cbafba53f6e523e17ab9e7406d0073e303f1b6d634e6e8e213baf0f2f648d57159af48dcc7b1859bcac426a9b081eb7e069cc5ace43fe5030
-  data.tar.gz: d1bf3eb669efe8d5747315315b1cbb7370d70ed0614e64ea44beb9fd08990554e4609364ce623a54e896e74becd222c124da92ff5a716acfccc55b9887f5cf24
+  metadata.gz: b0ae81b8ccf5bf0867b4457a7ceb65fecfe7b23bfb489132e4547bf62207707ee94ae0085444036c4683d9b5cc05cbeb1e177fc53c67dc249aedc4432cde41bb
+  data.tar.gz: 39a9dea187dae22817b8fdcfc5f1c67ed2e8357abc3d89631544f8d94256df1f6c2aec8d07d957c792adbd039e18968f91c8c133c1e53608a38623896d73d40d

data/README.md CHANGED Viewed

@@ -18,17 +18,20 @@ Or install it yourself as:
 ## Overview ##
-I wanted a simple, fast way to parse fasta files so I wouldn't have to
-keep writing annoying boilerplate fasta parsing code everytime I go to
-do something with one. I will probably add more, but likely only tasks
-that I find myself doing over and over.
+I wanted a simple, fast way to parse fasta and fastq files so I
+wouldn't have to keep writing annoying boilerplate parsing code
+everytime I go to do something with a fasta or fastq file. I will
+probably add more, but likely only tasks that I find myself doing over
+and over.
-## Usage ##
+## Documentation ##
+Checkout [parse_fasta docs](http://rubydoc.info/gems/parse_fasta/1.1.0/frames) to see
+the full documentation.
-### Version 1.0.0 (current) ###
+## Usage ##
-The monkey patch of the `File` class is no more! Here is the new print
-length example:
+A little script to print header and length of each record.
 	require 'parse_fasta'
@@ -38,28 +41,37 @@ length example:
 And here, a script to calculate GC content:
-	require 'parse_fasta'
 	FastaFile.open(ARGV.first, 'r').each_record do |header, sequence|
 	  puts [header, sequence.gc].join("\t")
 	end
-### Version 0.0.5 (old) ###
+Now we can parse fastq files as well!
-An example that lists the length for each sequence. (Won't work in
-version 1.0.0)
+	FastqFile.open(ARGV.first, 'r').each_record do |head, seq, desc, qual|
+	  puts [header, seq, desc, qual.qual_scores.join(',')].join("\t")
+	end
-    require 'parse_fasta'
+## Versions ##
-	File.open(ARGV.first, 'r').each_record do |header, sequence|
-	  puts [header, sequence.length].join("\t")
-	end
+### 1.1.0 ###
+Added: Fastq and Quality classes
+### 1.0.0 ###
+Added: Fasta and Sequence classes
+Removed: File monkey patch
+### 0.0.5 ###
+Last version with File monkey patch.
 ## Benchmark ##
-Take these with a grain of salt since `BioRuby` is a heavy weight
+Take these with a grain of salt since `BioRuby` is a big module
 module with lots of features and error checking, whereas `parse_fasta`
-is meant to be lightweight and easy to use for my own coding.
+is meant to be lightweight and easy to use for my own research.
 ### FastaFile#each_record ###
@@ -78,12 +90,21 @@ was 1.1 gigabytes. Here are the results from Ruby's `Benchmark` class:
 I just wanted a nice, clean way to parse fasta files, but being nearly
 twice as fasta as BioRuby doesn't hurt either!
+### FastqFile#each_record ###
+The same sequence length test as above, but this time with a fastq
+file containing 4,000,000 illumina reads.
+                        user     system      total        real
+    this_fastq     62.610000   1.660000  64.270000 ( 64.389408)
+    bioruby_fastq 165.500000   2.100000 167.600000 (167.969636)
 ### Sequence#gc ###
 I played around with a few different implementations for the `#gc`
 method and found this one to be the fastest.
-The test is done one random strings mating `/[AaCcTtGgUu]/`. `this_gc`
+The test is done on random strings mating `/[AaCcTtGgUu]/`. `this_gc`
 is `Sequence.new(str).gc`, and `bioruby_gc` is
 `Bio::Sequence::NA.new(str).gc_content`.

data/lib/parse_fasta/fastq_file.rb ADDED Viewed

@@ -0,0 +1,65 @@
+# Copyright 2014 Ryan Moore
+# Contact: moorer@udel.edu
+#
+# This file is part of parse_fasta.
+#
+# parse_fasta is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# parse_fasta is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with parse_fasta.  If not, see <http://www.gnu.org/licenses/>.
+# Provides simple interface for parsing four-line-per-record fastq
+# format files.
+class FastqFile < File
+  # Analagous to File#each_line, #each_record is used to go through a
+  # fastq file record by record.
+  #
+  # @example Parsing a fastq file
+  #   FastqFile.open('reads.fq', 'r').each_record do |head, seq, desc, qual|
+  #     # do some fun stuff here!
+  #   end
+  #
+  # @yield The header, sequence, description and quality string for
+  #   each record in the fastq file to the block
+  # @yieldparam header [String] The header of the fastq record without
+  #   the leading '@'
+  # @yieldparam sequence [Sequence] The sequence of the fastq record
+  # @yieldparam sequence [String] The description line of the fastq
+  #   record without the leading '+'
+  # @yieldparam sequence [Quality] The quality string of the fastq
+  #   record
+  def each_record
+    count = 0
+    header = ''
+    sequence = ''
+    description = ''
+    quality = ''
+    self.each_line do |line|
+      line.chomp!
+      case count % 4
+      when 0
+        header = line.sub(/^@/, '')
+      when 1
+        sequence = Sequence.new(line)
+      when 2
+        description = line.sub(/^\+/, '')
+      when 3
+        quality = Quality.new(line)
+        yield(header, sequence, description, quality)
+      end
+      count += 1
+    end
+  end
+end

data/lib/parse_fasta/quality.rb ADDED Viewed

@@ -0,0 +1,33 @@
+# Copyright 2014 Ryan Moore
+# Contact: moorer@udel.edu
+#
+# This file is part of parse_fasta.
+#
+# parse_fasta is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# parse_fasta is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with parse_fasta.  If not, see <http://www.gnu.org/licenses/>.
+# Provide some methods for dealing with common tasks regarding
+# quality strings.
+class Quality < String
+  # Returns an array of illumina style quality scores. The quality
+  # scores generated will be Phred+33.
+  #
+  # @example Get quality score array of a Quality
+  #   Quality.new("!+5?I").qual_scores #=> [0, 10, 20, 30, 40]
+  #
+  # @return [Array<Fixnum>] the quality scores
+  def qual_scores
+    self.each_byte.map { |b| b - 33 }
+  end
+end

data/lib/parse_fasta/sequence.rb CHANGED Viewed

@@ -19,14 +19,11 @@
 # Provide some methods for dealing with common tasks regarding
 # nucleotide sequences.
 class Sequence < String
-  def initialize(str)
-    super(str)
-  end
-  # Returns GC content for self
+  # Calculates GC content
   #
   # Calculates GC content by dividing count of G + C divided by count
-  # of G + C + T + A +U. If there are both T's and U's in the
+  # of G + C + T + A + U. If there are both T's and U's in the
   # Sequence, things will get weird, but then again, that wouldn't
   # happen, now would it!
   #

data/lib/parse_fasta/version.rb CHANGED Viewed

@@ -17,5 +17,5 @@
 # along with parse_fasta.  If not, see <http://www.gnu.org/licenses/>.
 module ParseFasta
-  VERSION = "1.0.1"
+  VERSION = "1.1.0"
 end

data/lib/parse_fasta.rb CHANGED Viewed

@@ -18,4 +18,6 @@
 require 'parse_fasta/version'
 require 'parse_fasta/fasta_file'
+require 'parse_fasta/fastq_file'
 require 'parse_fasta/sequence'
+require 'parse_fasta/quality'

data/parse_fasta.gemspec CHANGED Viewed

@@ -19,8 +19,8 @@ Gem::Specification.new do |spec|
   spec.require_paths = ["lib"]
   spec.add_development_dependency "bundler", "~> 1.6"
-  spec.add_development_dependency "rake"
-  spec.add_development_dependency "rspec"
-  spec.add_development_dependency "bio"
-  spec.add_development_dependency "yard"
+  spec.add_development_dependency "rake", "~> 10.3"
+  spec.add_development_dependency "rspec", "~> 2.14"
+  spec.add_development_dependency "bio", "~> 1.4"
+  spec.add_development_dependency "yard", "~> 0.8"
 end

data/spec/lib/fastq_file_spec.rb ADDED Viewed

@@ -0,0 +1,50 @@
+# Copyright 2014 Ryan Moore
+# Contact: moorer@udel.edu
+#
+# This file is part of parse_fasta.
+#
+# parse_fasta is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# parse_fasta is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with parse_fasta.  If not, see <http://www.gnu.org/licenses/>.
+require 'spec_helper'
+describe FastqFile do
+  describe "#each_record" do
+    let(:fname) { "#{File.dirname(__FILE__)}/../../test_files/test.fq" }
+    context "with a 4 line per record fastq file" do
+      before do
+        @records = []
+        FastqFile.open(fname, 'r').each_record do |head, seq, desc, qual|
+          @records << [head, seq, desc, qual]
+        end
+      end
+      it "yields the header, sequence, desc, and qual" do
+        expect(@records).to eq([["seq1", "AACCTTGG", "", ")#3gTqN8"],
+                               ["seq2 apples", "ACTG", "seq2 apples",
+                                "*ujM"]])
+      end
+      it "yields the sequence as a Sequence class" do
+        the_sequence = @records[0][1]
+        expect(the_sequence).to be_a(Sequence)
+      end
+      it "yields the quality string as a Quality class" do
+        the_quality = @records[0][3]
+        expect(the_quality).to be_a(Quality)
+      end
+    end
+  end
+end

data/spec/lib/quality.rb ADDED Viewed

@@ -0,0 +1,35 @@
+# Copyright 2014 Ryan Moore
+# Contact: moorer@udel.edu
+#
+# This file is part of parse_fasta.
+#
+# parse_fasta is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# parse_fasta is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with parse_fasta.  If not, see <http://www.gnu.org/licenses/>.
+require 'spec_helper'
+require 'bio'
+describe Quality do
+  let(:qual_string) { qual_string = Quality.new('ab%63:K') }
+  let(:bioruby_qual_scores) do
+    Bio::Fastq.new("@seq1\nACTGACT\n+\n#{qual_string}").quality_scores
+  end
+  describe "#qual_scores" do
+    context "with illumina style quality scores" do
+      it "returns an array of quality scores" do
+        expect(qual_string.qual_scores).to eq bioruby_qual_scores
+      end
+    end
+  end
+end

data/test_files/benchmark.rb CHANGED Viewed

@@ -54,17 +54,39 @@ def make_seq(num)
   num.times.reduce('') { |str, n| str << %w[A a C c T t G g N n].sample }
 end
-s1 = make_seq(2000000)
-s2 = make_seq(4000000)
-s3 = make_seq(8000000)
+# s1 = make_seq(2000000)
+# s2 = make_seq(4000000)
+# s3 = make_seq(8000000)
-Benchmark.bmbm do |x|
-  x.report('this_gc 1') { this_gc(s1) }
-  x.report('bioruby_gc 1') { bioruby_gc(s1) }
+# Benchmark.bmbm do |x|
+#   x.report('this_gc 1') { this_gc(s1) }
+#   x.report('bioruby_gc 1') { bioruby_gc(s1) }
+#   x.report('this_gc 2') { this_gc(s2) }
+#   x.report('bioruby_gc 2') { bioruby_gc(s2) }
+#   x.report('this_gc 3') { this_gc(s3) }
+#   x.report('bioruby_gc 3') { bioruby_gc(s3) }
+# end
-  x.report('this_gc 2') { this_gc(s2) }
-  x.report('bioruby_gc 2') { bioruby_gc(s2) }
+# fastq = ARGV.first
-  x.report('this_gc 3') { this_gc(s3) }
-  x.report('bioruby_gc 3') { bioruby_gc(s3) }
+def bioruby_fastq(fastq)
+  Bio::FlatFile.open(Bio::Fastq, fastq) do |fq|
+    fq.each do |entry|
+      [entry.definition, entry.seq.length].join("\t")
+    end
+  end
 end
+def this_fastq(fastq)
+  FastqFile.open(fastq).each_record do |head, seq, desc, qual|
+    [head, seq.length].join("\t")
+  end
+end
+# file is 4 million illumina reads (16,000,000 lines) 1.4gb
+# Benchmark.bmbm do |x|
+#   x.report('this_fastq') { this_fastq(ARGV.first) }
+#   x.report('bioruby_fastq') { bioruby_fastq(ARGV.first) }
+# end

data/test_files/test.fq ADDED Viewed

@@ -0,0 +1,8 @@
+@seq1
+AACCTTGG
++
+)#3gTqN8
+@seq2 apples
+ACTG
++seq2 apples
+*ujM

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: parse_fasta
 version: !ruby/object:Gem::Version
-  version: 1.0.1
+  version: 1.1.0
 platform: ruby
 authors:
 - Ryan Moore
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2014-06-04 00:00:00.000000000 Z
+date: 2014-06-13 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -28,58 +28,58 @@ dependencies:
   name: rake
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '10.3'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '10.3'
 - !ruby/object:Gem::Dependency
   name: rspec
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '2.14'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '2.14'
 - !ruby/object:Gem::Dependency
   name: bio
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '1.4'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '1.4'
 - !ruby/object:Gem::Dependency
   name: yard
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '0.8'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '0.8'
 description: So you want to parse a fasta file...
 email:
 - moorer@udel.edu
@@ -94,14 +94,19 @@ files:
 - Rakefile
 - lib/parse_fasta.rb
 - lib/parse_fasta/fasta_file.rb
+- lib/parse_fasta/fastq_file.rb
+- lib/parse_fasta/quality.rb
 - lib/parse_fasta/sequence.rb
 - lib/parse_fasta/version.rb
 - parse_fasta.gemspec
 - spec/lib/fasta_file_spec.rb
+- spec/lib/fastq_file_spec.rb
+- spec/lib/quality.rb
 - spec/lib/sequence_spec.rb
 - spec/spec_helper.rb
 - test_files/benchmark.rb
 - test_files/test.fa
+- test_files/test.fq
 homepage: https://github.com/mooreryan/parse_fasta
 licenses:
 - 'GPLv3: http://www.gnu.org/licenses/gpl.txt'
@@ -128,6 +133,8 @@ specification_version: 4
 summary: Easy-peasy parsing of fasta files
 test_files:
 - spec/lib/fasta_file_spec.rb
+- spec/lib/fastq_file_spec.rb
+- spec/lib/quality.rb
 - spec/lib/sequence_spec.rb
 - spec/spec_helper.rb
 has_rdoc: