parse_fasta 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 81e3eddaba7422d44ae3235f5c9a997f9fc9ea96
4
- data.tar.gz: 9fce00b1a06d89911ded694f3727eaca9dbda59e
3
+ metadata.gz: 964e83c4a500490c9c5b9cb6f709fe7405b9f856
4
+ data.tar.gz: eca56db94b3e639849699da0baf25e06ceedbbd6
5
5
  SHA512:
6
- metadata.gz: ce14854b609f406cbafba53f6e523e17ab9e7406d0073e303f1b6d634e6e8e213baf0f2f648d57159af48dcc7b1859bcac426a9b081eb7e069cc5ace43fe5030
7
- data.tar.gz: d1bf3eb669efe8d5747315315b1cbb7370d70ed0614e64ea44beb9fd08990554e4609364ce623a54e896e74becd222c124da92ff5a716acfccc55b9887f5cf24
6
+ metadata.gz: b0ae81b8ccf5bf0867b4457a7ceb65fecfe7b23bfb489132e4547bf62207707ee94ae0085444036c4683d9b5cc05cbeb1e177fc53c67dc249aedc4432cde41bb
7
+ data.tar.gz: 39a9dea187dae22817b8fdcfc5f1c67ed2e8357abc3d89631544f8d94256df1f6c2aec8d07d957c792adbd039e18968f91c8c133c1e53608a38623896d73d40d
data/README.md CHANGED
@@ -18,17 +18,20 @@ Or install it yourself as:
18
18
 
19
19
  ## Overview ##
20
20
 
21
- I wanted a simple, fast way to parse fasta files so I wouldn't have to
22
- keep writing annoying boilerplate fasta parsing code everytime I go to
23
- do something with one. I will probably add more, but likely only tasks
24
- that I find myself doing over and over.
21
+ I wanted a simple, fast way to parse fasta and fastq files so I
22
+ wouldn't have to keep writing annoying boilerplate parsing code
23
+ everytime I go to do something with a fasta or fastq file. I will
24
+ probably add more, but likely only tasks that I find myself doing over
25
+ and over.
25
26
 
26
- ## Usage ##
27
+ ## Documentation ##
28
+
29
+ Checkout [parse_fasta docs](http://rubydoc.info/gems/parse_fasta/1.1.0/frames) to see
30
+ the full documentation.
27
31
 
28
- ### Version 1.0.0 (current) ###
32
+ ## Usage ##
29
33
 
30
- The monkey patch of the `File` class is no more! Here is the new print
31
- length example:
34
+ A little script to print header and length of each record.
32
35
 
33
36
  require 'parse_fasta'
34
37
 
@@ -38,28 +41,37 @@ length example:
38
41
 
39
42
  And here, a script to calculate GC content:
40
43
 
41
- require 'parse_fasta'
42
-
43
44
  FastaFile.open(ARGV.first, 'r').each_record do |header, sequence|
44
45
  puts [header, sequence.gc].join("\t")
45
46
  end
46
47
 
47
- ### Version 0.0.5 (old) ###
48
+ Now we can parse fastq files as well!
48
49
 
49
- An example that lists the length for each sequence. (Won't work in
50
- version 1.0.0)
50
+ FastqFile.open(ARGV.first, 'r').each_record do |head, seq, desc, qual|
51
+ puts [header, seq, desc, qual.qual_scores.join(',')].join("\t")
52
+ end
51
53
 
52
- require 'parse_fasta'
54
+ ## Versions ##
53
55
 
54
- File.open(ARGV.first, 'r').each_record do |header, sequence|
55
- puts [header, sequence.length].join("\t")
56
- end
56
+ ### 1.1.0 ###
57
+
58
+ Added: Fastq and Quality classes
59
+
60
+ ### 1.0.0 ###
61
+
62
+ Added: Fasta and Sequence classes
63
+
64
+ Removed: File monkey patch
65
+
66
+ ### 0.0.5 ###
67
+
68
+ Last version with File monkey patch.
57
69
 
58
70
  ## Benchmark ##
59
71
 
60
- Take these with a grain of salt since `BioRuby` is a heavy weight
72
+ Take these with a grain of salt since `BioRuby` is a big module
61
73
  module with lots of features and error checking, whereas `parse_fasta`
62
- is meant to be lightweight and easy to use for my own coding.
74
+ is meant to be lightweight and easy to use for my own research.
63
75
 
64
76
  ### FastaFile#each_record ###
65
77
 
@@ -78,12 +90,21 @@ was 1.1 gigabytes. Here are the results from Ruby's `Benchmark` class:
78
90
  I just wanted a nice, clean way to parse fasta files, but being nearly
79
91
  twice as fasta as BioRuby doesn't hurt either!
80
92
 
93
+ ### FastqFile#each_record ###
94
+
95
+ The same sequence length test as above, but this time with a fastq
96
+ file containing 4,000,000 illumina reads.
97
+
98
+ user system total real
99
+ this_fastq 62.610000 1.660000 64.270000 ( 64.389408)
100
+ bioruby_fastq 165.500000 2.100000 167.600000 (167.969636)
101
+
81
102
  ### Sequence#gc ###
82
103
 
83
104
  I played around with a few different implementations for the `#gc`
84
105
  method and found this one to be the fastest.
85
106
 
86
- The test is done one random strings mating `/[AaCcTtGgUu]/`. `this_gc`
107
+ The test is done on random strings mating `/[AaCcTtGgUu]/`. `this_gc`
87
108
  is `Sequence.new(str).gc`, and `bioruby_gc` is
88
109
  `Bio::Sequence::NA.new(str).gc_content`.
89
110
 
@@ -0,0 +1,65 @@
1
+ # Copyright 2014 Ryan Moore
2
+ # Contact: moorer@udel.edu
3
+ #
4
+ # This file is part of parse_fasta.
5
+ #
6
+ # parse_fasta is free software: you can redistribute it and/or modify
7
+ # it under the terms of the GNU General Public License as published by
8
+ # the Free Software Foundation, either version 3 of the License, or
9
+ # (at your option) any later version.
10
+ #
11
+ # parse_fasta is distributed in the hope that it will be useful,
12
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
+ # GNU General Public License for more details.
15
+ #
16
+ # You should have received a copy of the GNU General Public License
17
+ # along with parse_fasta. If not, see <http://www.gnu.org/licenses/>.
18
+
19
+ # Provides simple interface for parsing four-line-per-record fastq
20
+ # format files.
21
+ class FastqFile < File
22
+
23
+ # Analagous to File#each_line, #each_record is used to go through a
24
+ # fastq file record by record.
25
+ #
26
+ # @example Parsing a fastq file
27
+ # FastqFile.open('reads.fq', 'r').each_record do |head, seq, desc, qual|
28
+ # # do some fun stuff here!
29
+ # end
30
+ #
31
+ # @yield The header, sequence, description and quality string for
32
+ # each record in the fastq file to the block
33
+ # @yieldparam header [String] The header of the fastq record without
34
+ # the leading '@'
35
+ # @yieldparam sequence [Sequence] The sequence of the fastq record
36
+ # @yieldparam sequence [String] The description line of the fastq
37
+ # record without the leading '+'
38
+ # @yieldparam sequence [Quality] The quality string of the fastq
39
+ # record
40
+ def each_record
41
+ count = 0
42
+ header = ''
43
+ sequence = ''
44
+ description = ''
45
+ quality = ''
46
+
47
+ self.each_line do |line|
48
+ line.chomp!
49
+
50
+ case count % 4
51
+ when 0
52
+ header = line.sub(/^@/, '')
53
+ when 1
54
+ sequence = Sequence.new(line)
55
+ when 2
56
+ description = line.sub(/^\+/, '')
57
+ when 3
58
+ quality = Quality.new(line)
59
+ yield(header, sequence, description, quality)
60
+ end
61
+
62
+ count += 1
63
+ end
64
+ end
65
+ end
@@ -0,0 +1,33 @@
1
+ # Copyright 2014 Ryan Moore
2
+ # Contact: moorer@udel.edu
3
+ #
4
+ # This file is part of parse_fasta.
5
+ #
6
+ # parse_fasta is free software: you can redistribute it and/or modify
7
+ # it under the terms of the GNU General Public License as published by
8
+ # the Free Software Foundation, either version 3 of the License, or
9
+ # (at your option) any later version.
10
+ #
11
+ # parse_fasta is distributed in the hope that it will be useful,
12
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
+ # GNU General Public License for more details.
15
+ #
16
+ # You should have received a copy of the GNU General Public License
17
+ # along with parse_fasta. If not, see <http://www.gnu.org/licenses/>.
18
+
19
+ # Provide some methods for dealing with common tasks regarding
20
+ # quality strings.
21
+ class Quality < String
22
+
23
+ # Returns an array of illumina style quality scores. The quality
24
+ # scores generated will be Phred+33.
25
+ #
26
+ # @example Get quality score array of a Quality
27
+ # Quality.new("!+5?I").qual_scores #=> [0, 10, 20, 30, 40]
28
+ #
29
+ # @return [Array<Fixnum>] the quality scores
30
+ def qual_scores
31
+ self.each_byte.map { |b| b - 33 }
32
+ end
33
+ end
@@ -19,14 +19,11 @@
19
19
  # Provide some methods for dealing with common tasks regarding
20
20
  # nucleotide sequences.
21
21
  class Sequence < String
22
- def initialize(str)
23
- super(str)
24
- end
25
22
 
26
- # Returns GC content for self
23
+ # Calculates GC content
27
24
  #
28
25
  # Calculates GC content by dividing count of G + C divided by count
29
- # of G + C + T + A +U. If there are both T's and U's in the
26
+ # of G + C + T + A + U. If there are both T's and U's in the
30
27
  # Sequence, things will get weird, but then again, that wouldn't
31
28
  # happen, now would it!
32
29
  #
@@ -17,5 +17,5 @@
17
17
  # along with parse_fasta. If not, see <http://www.gnu.org/licenses/>.
18
18
 
19
19
  module ParseFasta
20
- VERSION = "1.0.1"
20
+ VERSION = "1.1.0"
21
21
  end
data/lib/parse_fasta.rb CHANGED
@@ -18,4 +18,6 @@
18
18
 
19
19
  require 'parse_fasta/version'
20
20
  require 'parse_fasta/fasta_file'
21
+ require 'parse_fasta/fastq_file'
21
22
  require 'parse_fasta/sequence'
23
+ require 'parse_fasta/quality'
data/parse_fasta.gemspec CHANGED
@@ -19,8 +19,8 @@ Gem::Specification.new do |spec|
19
19
  spec.require_paths = ["lib"]
20
20
 
21
21
  spec.add_development_dependency "bundler", "~> 1.6"
22
- spec.add_development_dependency "rake"
23
- spec.add_development_dependency "rspec"
24
- spec.add_development_dependency "bio"
25
- spec.add_development_dependency "yard"
22
+ spec.add_development_dependency "rake", "~> 10.3"
23
+ spec.add_development_dependency "rspec", "~> 2.14"
24
+ spec.add_development_dependency "bio", "~> 1.4"
25
+ spec.add_development_dependency "yard", "~> 0.8"
26
26
  end
@@ -0,0 +1,50 @@
1
+ # Copyright 2014 Ryan Moore
2
+ # Contact: moorer@udel.edu
3
+ #
4
+ # This file is part of parse_fasta.
5
+ #
6
+ # parse_fasta is free software: you can redistribute it and/or modify
7
+ # it under the terms of the GNU General Public License as published by
8
+ # the Free Software Foundation, either version 3 of the License, or
9
+ # (at your option) any later version.
10
+ #
11
+ # parse_fasta is distributed in the hope that it will be useful,
12
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
+ # GNU General Public License for more details.
15
+ #
16
+ # You should have received a copy of the GNU General Public License
17
+ # along with parse_fasta. If not, see <http://www.gnu.org/licenses/>.
18
+
19
+ require 'spec_helper'
20
+
21
+ describe FastqFile do
22
+ describe "#each_record" do
23
+ let(:fname) { "#{File.dirname(__FILE__)}/../../test_files/test.fq" }
24
+
25
+ context "with a 4 line per record fastq file" do
26
+ before do
27
+ @records = []
28
+ FastqFile.open(fname, 'r').each_record do |head, seq, desc, qual|
29
+ @records << [head, seq, desc, qual]
30
+ end
31
+ end
32
+
33
+ it "yields the header, sequence, desc, and qual" do
34
+ expect(@records).to eq([["seq1", "AACCTTGG", "", ")#3gTqN8"],
35
+ ["seq2 apples", "ACTG", "seq2 apples",
36
+ "*ujM"]])
37
+ end
38
+
39
+ it "yields the sequence as a Sequence class" do
40
+ the_sequence = @records[0][1]
41
+ expect(the_sequence).to be_a(Sequence)
42
+ end
43
+
44
+ it "yields the quality string as a Quality class" do
45
+ the_quality = @records[0][3]
46
+ expect(the_quality).to be_a(Quality)
47
+ end
48
+ end
49
+ end
50
+ end
@@ -0,0 +1,35 @@
1
+ # Copyright 2014 Ryan Moore
2
+ # Contact: moorer@udel.edu
3
+ #
4
+ # This file is part of parse_fasta.
5
+ #
6
+ # parse_fasta is free software: you can redistribute it and/or modify
7
+ # it under the terms of the GNU General Public License as published by
8
+ # the Free Software Foundation, either version 3 of the License, or
9
+ # (at your option) any later version.
10
+ #
11
+ # parse_fasta is distributed in the hope that it will be useful,
12
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
+ # GNU General Public License for more details.
15
+ #
16
+ # You should have received a copy of the GNU General Public License
17
+ # along with parse_fasta. If not, see <http://www.gnu.org/licenses/>.
18
+
19
+ require 'spec_helper'
20
+ require 'bio'
21
+
22
+ describe Quality do
23
+ let(:qual_string) { qual_string = Quality.new('ab%63:K') }
24
+ let(:bioruby_qual_scores) do
25
+ Bio::Fastq.new("@seq1\nACTGACT\n+\n#{qual_string}").quality_scores
26
+ end
27
+
28
+ describe "#qual_scores" do
29
+ context "with illumina style quality scores" do
30
+ it "returns an array of quality scores" do
31
+ expect(qual_string.qual_scores).to eq bioruby_qual_scores
32
+ end
33
+ end
34
+ end
35
+ end
@@ -54,17 +54,39 @@ def make_seq(num)
54
54
  num.times.reduce('') { |str, n| str << %w[A a C c T t G g N n].sample }
55
55
  end
56
56
 
57
- s1 = make_seq(2000000)
58
- s2 = make_seq(4000000)
59
- s3 = make_seq(8000000)
57
+ # s1 = make_seq(2000000)
58
+ # s2 = make_seq(4000000)
59
+ # s3 = make_seq(8000000)
60
60
 
61
- Benchmark.bmbm do |x|
62
- x.report('this_gc 1') { this_gc(s1) }
63
- x.report('bioruby_gc 1') { bioruby_gc(s1) }
61
+ # Benchmark.bmbm do |x|
62
+ # x.report('this_gc 1') { this_gc(s1) }
63
+ # x.report('bioruby_gc 1') { bioruby_gc(s1) }
64
+
65
+ # x.report('this_gc 2') { this_gc(s2) }
66
+ # x.report('bioruby_gc 2') { bioruby_gc(s2) }
67
+
68
+ # x.report('this_gc 3') { this_gc(s3) }
69
+ # x.report('bioruby_gc 3') { bioruby_gc(s3) }
70
+ # end
64
71
 
65
- x.report('this_gc 2') { this_gc(s2) }
66
- x.report('bioruby_gc 2') { bioruby_gc(s2) }
72
+ # fastq = ARGV.first
67
73
 
68
- x.report('this_gc 3') { this_gc(s3) }
69
- x.report('bioruby_gc 3') { bioruby_gc(s3) }
74
+ def bioruby_fastq(fastq)
75
+ Bio::FlatFile.open(Bio::Fastq, fastq) do |fq|
76
+ fq.each do |entry|
77
+ [entry.definition, entry.seq.length].join("\t")
78
+ end
79
+ end
70
80
  end
81
+
82
+ def this_fastq(fastq)
83
+ FastqFile.open(fastq).each_record do |head, seq, desc, qual|
84
+ [head, seq.length].join("\t")
85
+ end
86
+ end
87
+
88
+ # file is 4 million illumina reads (16,000,000 lines) 1.4gb
89
+ # Benchmark.bmbm do |x|
90
+ # x.report('this_fastq') { this_fastq(ARGV.first) }
91
+ # x.report('bioruby_fastq') { bioruby_fastq(ARGV.first) }
92
+ # end
@@ -0,0 +1,8 @@
1
+ @seq1
2
+ AACCTTGG
3
+ +
4
+ )#3gTqN8
5
+ @seq2 apples
6
+ ACTG
7
+ +seq2 apples
8
+ *ujM
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: parse_fasta
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.1
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ryan Moore
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-06-04 00:00:00.000000000 Z
11
+ date: 2014-06-13 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -28,58 +28,58 @@ dependencies:
28
28
  name: rake
29
29
  requirement: !ruby/object:Gem::Requirement
30
30
  requirements:
31
- - - ">="
31
+ - - "~>"
32
32
  - !ruby/object:Gem::Version
33
- version: '0'
33
+ version: '10.3'
34
34
  type: :development
35
35
  prerelease: false
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
- - - ">="
38
+ - - "~>"
39
39
  - !ruby/object:Gem::Version
40
- version: '0'
40
+ version: '10.3'
41
41
  - !ruby/object:Gem::Dependency
42
42
  name: rspec
43
43
  requirement: !ruby/object:Gem::Requirement
44
44
  requirements:
45
- - - ">="
45
+ - - "~>"
46
46
  - !ruby/object:Gem::Version
47
- version: '0'
47
+ version: '2.14'
48
48
  type: :development
49
49
  prerelease: false
50
50
  version_requirements: !ruby/object:Gem::Requirement
51
51
  requirements:
52
- - - ">="
52
+ - - "~>"
53
53
  - !ruby/object:Gem::Version
54
- version: '0'
54
+ version: '2.14'
55
55
  - !ruby/object:Gem::Dependency
56
56
  name: bio
57
57
  requirement: !ruby/object:Gem::Requirement
58
58
  requirements:
59
- - - ">="
59
+ - - "~>"
60
60
  - !ruby/object:Gem::Version
61
- version: '0'
61
+ version: '1.4'
62
62
  type: :development
63
63
  prerelease: false
64
64
  version_requirements: !ruby/object:Gem::Requirement
65
65
  requirements:
66
- - - ">="
66
+ - - "~>"
67
67
  - !ruby/object:Gem::Version
68
- version: '0'
68
+ version: '1.4'
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: yard
71
71
  requirement: !ruby/object:Gem::Requirement
72
72
  requirements:
73
- - - ">="
73
+ - - "~>"
74
74
  - !ruby/object:Gem::Version
75
- version: '0'
75
+ version: '0.8'
76
76
  type: :development
77
77
  prerelease: false
78
78
  version_requirements: !ruby/object:Gem::Requirement
79
79
  requirements:
80
- - - ">="
80
+ - - "~>"
81
81
  - !ruby/object:Gem::Version
82
- version: '0'
82
+ version: '0.8'
83
83
  description: So you want to parse a fasta file...
84
84
  email:
85
85
  - moorer@udel.edu
@@ -94,14 +94,19 @@ files:
94
94
  - Rakefile
95
95
  - lib/parse_fasta.rb
96
96
  - lib/parse_fasta/fasta_file.rb
97
+ - lib/parse_fasta/fastq_file.rb
98
+ - lib/parse_fasta/quality.rb
97
99
  - lib/parse_fasta/sequence.rb
98
100
  - lib/parse_fasta/version.rb
99
101
  - parse_fasta.gemspec
100
102
  - spec/lib/fasta_file_spec.rb
103
+ - spec/lib/fastq_file_spec.rb
104
+ - spec/lib/quality.rb
101
105
  - spec/lib/sequence_spec.rb
102
106
  - spec/spec_helper.rb
103
107
  - test_files/benchmark.rb
104
108
  - test_files/test.fa
109
+ - test_files/test.fq
105
110
  homepage: https://github.com/mooreryan/parse_fasta
106
111
  licenses:
107
112
  - 'GPLv3: http://www.gnu.org/licenses/gpl.txt'
@@ -128,6 +133,8 @@ specification_version: 4
128
133
  summary: Easy-peasy parsing of fasta files
129
134
  test_files:
130
135
  - spec/lib/fasta_file_spec.rb
136
+ - spec/lib/fastq_file_spec.rb
137
+ - spec/lib/quality.rb
131
138
  - spec/lib/sequence_spec.rb
132
139
  - spec/spec_helper.rb
133
140
  has_rdoc: