RubyGems - parse_fasta - Versions diffs - 1.9.2 → 2.0.0 - Mend

parse_fasta 1.9.2 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (48) hide show

checksums.yaml +8 -8
data/.gitignore +1 -0
data/.rspec +2 -0
data/CHANGELOG.md +178 -0
data/README.md +42 -215
data/Rakefile +2 -4
data/bin/console +14 -0
data/bin/setup +8 -0
data/lib/parse_fasta/error.rb +39 -0
data/lib/parse_fasta/record.rb +88 -0
data/lib/parse_fasta/seq_file.rb +221 -114
data/lib/parse_fasta/version.rb +2 -2
data/lib/parse_fasta.rb +5 -20
data/spec/parse_fasta/record_spec.rb +115 -0
data/spec/parse_fasta/seq_file_spec.rb +238 -0
data/spec/parse_fasta_spec.rb +25 -0
data/spec/spec_helper.rb +2 -44
data/spec/test_files/cr.fa +1 -0
data/spec/test_files/cr.fa.gz +0 -0
data/spec/test_files/cr.fq +3 -0
data/spec/test_files/cr.fq.gz +0 -0
data/spec/test_files/cr_nl.fa +4 -0
data/spec/test_files/cr_nl.fa.gz +0 -0
data/spec/test_files/cr_nl.fq +8 -0
data/spec/test_files/cr_nl.fq.gz +0 -0
data/spec/test_files/multi_blob.fa.gz +0 -0
data/spec/test_files/multi_blob.fq.gz +0 -0
data/spec/test_files/not_a_seq_file.txt +1 -0
data/{test_files/bad.fa → spec/test_files/poorly_catted.fa} +0 -0
data/{test_files/test.fa → spec/test_files/seqs.fa} +0 -0
data/spec/test_files/seqs.fa.gz +0 -0
data/spec/test_files/seqs.fq +8 -0
data/spec/test_files/seqs.fq.gz +0 -0
metadata +49 -24
data/lib/parse_fasta/fasta_file.rb +0 -232
data/lib/parse_fasta/fastq_file.rb +0 -160
data/lib/parse_fasta/quality.rb +0 -54
data/lib/parse_fasta/sequence.rb +0 -174
data/spec/lib/fasta_file_spec.rb +0 -212
data/spec/lib/fastq_file_spec.rb +0 -143
data/spec/lib/quality_spec.rb +0 -51
data/spec/lib/seq_file_spec.rb +0 -357
data/spec/lib/sequence_spec.rb +0 -188
data/test_files/benchmark.rb +0 -99
data/test_files/bogus.txt +0 -2
data/test_files/test.fa.gz +0 -0
data/test_files/test.fq +0 -8
data/test_files/test.fq.gz +0 -0

checksums.yaml CHANGED Viewed

@@ -1,15 +1,15 @@
 ---
 !binary "U0hBMQ==":
   metadata.gz: !binary |-
-    NmM5ZWYwOGM5YWIxMzU2YjBmZTk4Y2I5YzI0NjY0MzUwM2YwMjgyOA==
+    YzliYjhmZmMzNGRlYmFmNDQwOGE2NGFmNzgyZTliZDdhMDdkMTc0Zg==
   data.tar.gz: !binary |-
-    NzI2NDY1MWZmYmUwNDUxMTk2MmI4YjgwYWVlYjcyZDI4MDUzMzk4NA==
+    OTgxOWFjYTEyMWI0MjNlNjBhZjJkNGZkMjFkZGFkZDNjNGJkNTk2NA==
 SHA512:
   metadata.gz: !binary |-
-    ODY1ZTQ1MzU4MTc2MDhhMjA0OThiYzM4Yzk4YjJiZjU4ZGY4MGM5NTRjYTE5
-    OWZkODk0M2ZmODE5ODY1MjE3NTQ5MzgyNTFjMTk2NzU2NGVjN2NkNGUzYzA3
-    ODliNjRlOGJjOGJhNjhlMWZmMmU1NjkyMjgwNzAyODQ1MDExOTI=
+    OGQxNTg4YzYyYzQyZGM2YjM0NzYyMjFiYzUwMTllYjM3NzZiZjViNTQwMWFi
+    NTI0NDk5NDY0NTc4YThhZTg4ODczYjAxZTA3MGNmZDdmMWYzNmMwMGFlMzhl
+    ODFhM2Q1NzIxZDVlYjE0MjEwYTg0OTlkMzlmZDQyYjIzYjhjNGQ=
   data.tar.gz: !binary |-
-    YWY4NWU3NDFiYTVmMmE1Y2MxMDI3ZjE3NTIyY2Q1N2Q2ZDQxM2ZlZjI4NjUy
-    MWM5OTZhNzEzZWNmMGVlYTQ1MDc1MzViMDBkOTQ0YzQyY2IxYjlmOGQwNzRh
-    YmIyOTg2Yjk0OTFlNWVhOGU3MTMzM2I1ZGY0ZjlkMzExZGNkZDk=
+    MTBlN2NmNmJkOGUwM2Q1MDZhZTkzM2NmMzNmOTY1YWUzMzVjNjdkN2NiMDM2
+    NTJlYmU5Yzk1ODExNzczMGNkNTFkNzEwOWZkZGIwMjRiMjNiNGY5ZGM0MDJk
+    ZmY3YWI0OGQwOTNiMzY2ODAzMzkxZjFkZmNiNTExMGE3NWFlZjk=

data/.gitignore CHANGED Viewed

@@ -21,3 +21,4 @@ tmp
 *.a
 mkmf.log
 .ruby-*
+.idea

data/.rspec ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ --format documentation
2	+ --color

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,178 @@
+## Versions ##
+### 2.0.0 ###
+A weird feature of `Zlib::GzipReader` made it so that if a gzipped file was created like this.
+```bash
+gzip -c a.fa > z.fa.gz
+gzip -c b.fa >> z.fa.gz
+```
+Then the gzip reader would only read the lines from `a.fa` without some fiddling around. Since this was a pretty low level thing, I just decided to make a bunch of under the hood changes that I've been meaning to get to.
+#### Other things
+- Everything is namespaced under `ParseFasta` module
+- Removed `FastaFile` and `FastqFile` classes, `SeqFile` only remains
+- Removed `Sequence` and `Quality` classes. These might get put back in at some point, but I almost never used them anyway
+- `SeqFile#each_record` yields a `Record` object so you can use the same code to parse fastA and fastQ files
+- Other stuff that I'm forgetting!
+### 1.9.2 ###
+Speed up fastA `each_record` and `each_record_fast`.
+### 1.9.1 ###
+Speed up fastQ `each_record` and `each_record_fast`. Courtesy of
+[Matthew Ralston](https://github.com/MatthewRalston).
+### 1.9.0 ###
+Added "fast" versions of `each_record` methods
+(`each_record_fast`). Basically, they return sequences and quality
+strings as Ruby `Sring` objects instead of aa `Sequence` or `Quality`
+objects. Also, if the sequence or quality string has spaces, they will
+be retained. If this is a problem, use the original `each_record`
+methods.
+### 1.8.2 ###
+Speed up `FastqFile#each_record`.
+### 1.8.1 ###
+An error will be raised if a fasta file has a `>` in the
+sequence. Sometimes files are not terminated with a newline
+character. If this is the case, then catting two fasta files will
+smush the first header of the second file right in with the last
+sequence of the first file. This is bad, raise an error! ;)
+Example
+    >seq1
+    ACTG>seq2
+    ACTG
+    >seq3
+    ACTG
+This will raise `ParseFasta::SequenceFormatError`.
+Also, headers with lots of `>` within are fine now.
+### 1.8 ###
+Add `Sequence#rev_comp`. It can handle IUPAC characters. Since
+`parse_fasta` doesn't check whether the seq is AA or NA, if called on
+an amino acid string, things will get weird as it will complement the
+IUPAC characters in the AA string and leave others.
+### 1.7.2 ###
+Strip spaces (not all whitespace) from `Sequence` and `Quality` strings.
+Some alignment fastas have spaces for easier reading. Strip these
+out. For consistency, also strips spaces from `Quality` strings. If
+there are spaces that don't match in the quality and sequence in a
+fastQ file, then things will get messed up in the FastQ file. FastQ
+shouldn't have spaces though.
+### 1.7 ###
+Add `SeqFile#to_hash`, `FastaFile#to_hash` and `FastqFile#to_hash`.
+### 1.6.2 ###
+`FastaFile::open` now raises a `ParseFasta::DataFormatError` when passed files
+that don't begin with a `>`.
+### 1.6.1 ###
+Better internal handling of empty sequences -- instead of raising
+errors, pass empty sequences.
+### 1.6 ###
+Added `SeqFile` class, which accepts either fastA or fastQ files. It
+uses FastaFile and FastqFile internally. You can use this class if you
+want your scripts to accept either fastA or fastQ files.
+If you need the description and quality string, you should use
+FastqFile instead.
+### 1.5 ###
+Now accepts gzipped files. Huzzah!
+### 1.4 ###
+Added methods:
+    Sequence.base_counts
+	Sequence.base_frequencies
+### 1.3 ###
+Add additional functionality to `each_record` method.
+#### Info ####
+I often like to use the fasta format for other things like so
+	>fruits
+	pineapple
+	pear
+	peach
+	>veggies
+	peppers
+	parsnip
+	peas
+rather than having this in a two column file like this
+	fruit,pineapple
+	fruit,pear
+	fruit,peach
+	veggie,peppers
+	veggie,parsnip
+	veggie,peas
+So I added functionality to `each_record` to keep each line a record
+separate in an array. Here's an example using the above file.
+    info = []
+	FastaFile.open(f, 'r').each_record(1) do |header, lines|
+	  info << [header, lines]
+	end
+Then info will contain the following arrays
+	['fruits', ['pineapple', 'pear', 'peach']],
+	['veggies', ['peppers', 'parsnip', 'peas']]
+### 1.2 ###
+Added `mean_qual` method to the `Quality` class.
+### 1.1.2 ###
+Dropped Ruby requirement to 1.9.3
+(Note, if you want to build the docs with yard and you're using
+Ruby 1.9.3, you may have to install the redcarpet gem.)
+### 1.1 ###
+Added: Fastq and Quality classes
+### 1.0 ###
+Added: Fasta and Sequence classes
+Removed: File monkey patch
+### 0.0.5 ###
+Last version with File monkey patch.

data/README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# parse_fasta #
+# ParseFasta #
 [![Gem Version](https://badge.fury.io/rb/parse_fasta.svg)](http://badge.fury.io/rb/parse_fasta) [![Build Status](https://travis-ci.org/mooreryan/parse_fasta.svg?branch=master)](https://travis-ci.org/mooreryan/parse_fasta) [![Coverage Status](https://coveralls.io/repos/mooreryan/parse_fasta/badge.svg)](https://coveralls.io/r/mooreryan/parse_fasta)
@@ -8,7 +8,9 @@ So you want to parse a fasta file...
 Add this line to your application's Gemfile:
-    gem 'parse_fasta'
+```ruby
+gem 'parse_fasta'
+```
 And then execute:
@@ -20,9 +22,7 @@ Or install it yourself as:
 ## Overview ##
-Provides nice, programmatic access to fasta and fastq files, as well
-as providing Sequence and Quality helper classes. It's more
-lightweight than BioRuby. And more fun! ;)
+Provides nice, programmatic access to fasta and fastq files. It's faster and more lightweight than BioRuby. And more fun!
 ## Documentation ##
@@ -32,213 +32,40 @@ for the full api documentation.
 ## Usage ##
-Some examples...
-A little script to print header and length of each record.
-	require 'parse_fasta'
-	FastaFile.open(ARGV[0]).each_record do |header, sequence|
-	  puts [header, sequence.length].join("\t")
-	end
-And here, a script to calculate GC content:
-	FastaFile.open(ARGV[0]).each_record do |header, sequence|
-	  puts [header, sequence.gc].join("\t")
-	end
-Now we can parse fastq files as well!
-	FastqFile.open(ARGV[0]).each_record do |head, seq, desc, qual|
-	  puts [header, qual.qual_scores.join(',')].join("\t")
-	end
-What if you don't care if the input is a fastA or a fastQ? No problem!
-	SeqFile.open(ARGV[0]).each_record do |head, seq|
-	  puts [header, seq].join "\t"
-	end
-Read fasta file into a hash.
-    seqs = FastaFile.open(ARGV[0]).to_hash
-## Versions ##
-### 1.9.2 ###
-Speed up fastA `each_record` and `each_record_fast`.
-### 1.9.1 ###
-Speed up fastQ `each_record` and `each_record_fast`. Courtesy of
-[Matthew Ralston](https://github.com/MatthewRalston).
-### 1.9.0 ###
-Added "fast" versions of `each_record` methods
-(`each_record_fast`). Basically, they return sequences and quality
-strings as Ruby `Sring` objects instead of aa `Sequence` or `Quality`
-objects. Also, if the sequence or quality string has spaces, they will
-be retained. If this is a problem, use the original `each_record`
-methods.
-### 1.8.2 ###
-Speed up `FastqFile#each_record`.
-### 1.8.1 ###
-An error will be raised if a fasta file has a `>` in the
-sequence. Sometimes files are not terminated with a newline
-character. If this is the case, then catting two fasta files will
-smush the first header of the second file right in with the last
-sequence of the first file. This is bad, raise an error! ;)
-Example
-    >seq1
-    ACTG>seq2
-    ACTG
-    >seq3
-    ACTG
-This will raise `ParseFasta::SequenceFormatError`.
-Also, headers with lots of `>` within are fine now.
-### 1.8 ###
-Add `Sequence#rev_comp`. It can handle IUPAC characters. Since
-`parse_fasta` doesn't check whether the seq is AA or NA, if called on
-an amino acid string, things will get weird as it will complement the
-IUPAC characters in the AA string and leave others.
-### 1.7.2 ###
-Strip spaces (not all whitespace) from `Sequence` and `Quality` strings.
-Some alignment fastas have spaces for easier reading. Strip these
-out. For consistency, also strips spaces from `Quality` strings. If
-there are spaces that don't match in the quality and sequence in a
-fastQ file, then things will get messed up in the FastQ file. FastQ
-shouldn't have spaces though.
-### 1.7 ###
-Add `SeqFile#to_hash`, `FastaFile#to_hash` and `FastqFile#to_hash`.
-### 1.6.2 ###
-`FastaFile::open` now raises a `ParseFasta::DataFormatError` when passed files
-that don't begin with a `>`.
-### 1.6.1 ###
-Better internal handling of empty sequences -- instead of raising
-errors, pass empty sequences.
-### 1.6 ###
-Added `SeqFile` class, which accepts either fastA or fastQ files. It
-uses FastaFile and FastqFile internally. You can use this class if you
-want your scripts to accept either fastA or fastQ files.
-If you need the description and quality string, you should use
-FastqFile instead.
-### 1.5 ###
-Now accepts gzipped files. Huzzah!
-### 1.4 ###
-Added methods:
-    Sequence.base_counts
-	Sequence.base_frequencies
-### 1.3 ###
-Add additional functionality to `each_record` method.
-#### Info ####
-I often like to use the fasta format for other things like so
-	>fruits
-	pineapple
-	pear
-	peach
-	>veggies
-	peppers
-	parsnip
-	peas
-rather than having this in a two column file like this
-	fruit,pineapple
-	fruit,pear
-	fruit,peach
-	veggie,peppers
-	veggie,parsnip
-	veggie,peas
-So I added functionality to `each_record` to keep each line a record
-separate in an array. Here's an example using the above file.
-    info = []
-	FastaFile.open(f, 'r').each_record(1) do |header, lines|
-	  info << [header, lines]
-	end
-Then info will contain the following arrays
-	['fruits', ['pineapple', 'pear', 'peach']],
-	['veggies', ['peppers', 'parsnip', 'peas']]
-### 1.2 ###
-Added `mean_qual` method to the `Quality` class.
-### 1.1.2 ###
-Dropped Ruby requirement to 1.9.3
-(Note, if you want to build the docs with yard and you're using
-Ruby 1.9.3, you may have to install the redcarpet gem.)
-### 1.1 ###
-Added: Fastq and Quality classes
-### 1.0 ###
-Added: Fasta and Sequence classes
-Removed: File monkey patch
-### 0.0.5 ###
-Last version with File monkey patch.
-## Benchmark ##
-Some quick and dirty benchmarks against `BioRuby`.
-### FastaFile#each_record ###
-You can see the test script in `benchmark.rb`.
-                           user     system      total        real
-    parse_fasta        1.920000   0.160000   2.080000 (  2.145932)
-    parse_fasta fast   1.210000   0.160000   1.370000 (  1.377770)
-    bioruby            4.330000   0.290000   4.620000 (  4.655567)
-Hot dog! It's faster :)
-## Notes ##
-Only the `SeqFile` class actually checks to make sure that you passed
-in a "proper" fastA or fastQ file, so watch out.
+Here are some examples of using ParseFasta. Don't forget to `require "parse_fasta"` at the top of your program!
+Print header and length of each record.
+```ruby
+ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
+  puts [rec.header, rec.seq.length].join "\t"
+end
+```
+You can parse fastQ files in exatcly the same way.
+```ruby
+ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
+  printf "Header: %s, Sequence: %s, Description: %s, Quality: %s\n",
+	     rec.header,
+	     rec.seq,
+	     rec.desc,
+	     rec.qual
+end
+```
+The `Record#desc` and `Record#qual` will be `nil` if the file you are parsing is a fastA file.
+```ruby
+ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
+  if rec.qual
+    puts "@#{rec.header}"
+    puts rec.seq
+    puts "+#{rec.desc}"
+    puts rec.qual
+  else
+    puts ">#{rec.header}"
+    puts rec.sequence
+  end
+end
+```

data/Rakefile CHANGED Viewed

@@ -1,8 +1,6 @@
 require "bundler/gem_tasks"
 require "rspec/core/rake_task"
-RSpec::Core::RakeTask.new
-task default: :spec
-task test: :spec
+RSpec::Core::RakeTask.new(:spec)
+task :default => :spec

data/bin/console ADDED Viewed

@@ -0,0 +1,14 @@
+#!/usr/bin/env ruby
+require "bundler/setup"
+require "parse_fasta"
+# You can add fixtures and/or initialization code here to make experimenting
+# with your gem easier. You can also use a different console, if you like.
+# (If you use this, don't forget to add pry to your Gemfile!)
+# require "pry"
+# Pry.start
+require "irb"
+IRB.start

data/bin/setup ADDED Viewed

@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+set -euo pipefail
+IFS=$'\n\t'
+set -vx
+bundle install
+# Do any other automated setup that you need to do here

data/lib/parse_fasta/error.rb ADDED Viewed

@@ -0,0 +1,39 @@
+# Copyright 2014 - 2016 Ryan Moore
+# Contact: moorer@udel.edu
+#
+# This file is part of parse_fasta.
+#
+# parse_fasta is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# parse_fasta is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with parse_fasta.  If not, see <http://www.gnu.org/licenses/>.
+module ParseFasta
+  # Contains the Error classes that ParseFasta API will raise
+  module Error
+    # All ParseFasta errors inherit from ParseFastaError
+    class ParseFastaError < StandardError
+    end
+    # Raised when the input file doesn't look like fastA or fastQ
+    class DataFormatError < ParseFastaError
+    end
+    # Raised when the file is not found
+    class FileNotFoundError < ParseFastaError
+    end
+    # Raised when fastA sequences have a '>' in them
+    class SequenceFormatError < ParseFastaError
+    end
+  end
+end

data/lib/parse_fasta/record.rb ADDED Viewed

@@ -0,0 +1,88 @@
+# Copyright 2014 - 2016 Ryan Moore
+# Contact: moorer@udel.edu
+#
+# This file is part of parse_fasta.
+#
+# parse_fasta is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# parse_fasta is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with parse_fasta.  If not, see <http://www.gnu.org/licenses/>.
+module ParseFasta
+  class Record
+    # @!attribute header
+    #   @return [String] the full header of the record without the '>'
+    #     or '@'
+    # @!attribute seq
+    #   @return [String] the sequence of the record
+    # @!attribute desc
+    #   @return [String or Nil] if the record is from a fastA file, it
+    #     is nil; else, the description line of the fastQ record
+    # @!attribute qual
+    #   @return [String or Nil] if the record is from a fastA file, it
+    #     is nil; else, the quality string of the fastQ record
+    attr_accessor :header, :seq, :desc, :qual
+    # The constructor takes keyword args.
+    #
+    # @example Init a new Record object for a fastA record
+    #   Record.new header: "apple", seq: "actg"
+    # @example Init a new Record object for a fastQ record
+    #   Record.new header: "apple", seq: "actd", desc: "", qual: "IIII"
+    #
+    # @param header [String] the header of the record
+    # @param seq [String] the sequence of the record
+    # @param desc [String] the description line of a fastQ record
+    # @param qual [String] the quality string of a fastQ record
+    #
+    # @raise [SequenceFormatError] if a fastA sequence has a '>'
+    #   character in it
+    def initialize args = {}
+      @header = args.fetch :header
+      @desc = args.fetch :desc, nil
+      @qual = args.fetch :qual, nil
+      @qual.gsub!(/\s+/, "") if @qual
+      seq = args.fetch(:seq).gsub(/\s+/, "")
+      if @qual # is fastQ
+        @seq = seq
+      else # is fastA
+        @seq = check_fasta_seq(seq)
+      end
+    end
+    # Compare attrs of this rec with another
+    #
+    # @param rec [Record] a Record object to compare with
+    #
+    # @return [Bool] true or false
+    def == rec
+      self.header == rec.header && self.seq == rec.seq &&
+          self.desc == rec.desc && self.qual == rec.qual
+    end
+    private
+    def check_fasta_seq seq
+      if seq.match ">"
+        raise ParseFasta::Error::SequenceFormatError,
+              "A sequence contained a '>' character " +
+                  "(the fastA file record separator)"
+      else
+        seq
+      end
+    end
+  end
+end