parse_fasta 1.0.1 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 81e3eddaba7422d44ae3235f5c9a997f9fc9ea96
4
- data.tar.gz: 9fce00b1a06d89911ded694f3727eaca9dbda59e
3
+ metadata.gz: 964e83c4a500490c9c5b9cb6f709fe7405b9f856
4
+ data.tar.gz: eca56db94b3e639849699da0baf25e06ceedbbd6
5
5
  SHA512:
6
- metadata.gz: ce14854b609f406cbafba53f6e523e17ab9e7406d0073e303f1b6d634e6e8e213baf0f2f648d57159af48dcc7b1859bcac426a9b081eb7e069cc5ace43fe5030
7
- data.tar.gz: d1bf3eb669efe8d5747315315b1cbb7370d70ed0614e64ea44beb9fd08990554e4609364ce623a54e896e74becd222c124da92ff5a716acfccc55b9887f5cf24
6
+ metadata.gz: b0ae81b8ccf5bf0867b4457a7ceb65fecfe7b23bfb489132e4547bf62207707ee94ae0085444036c4683d9b5cc05cbeb1e177fc53c67dc249aedc4432cde41bb
7
+ data.tar.gz: 39a9dea187dae22817b8fdcfc5f1c67ed2e8357abc3d89631544f8d94256df1f6c2aec8d07d957c792adbd039e18968f91c8c133c1e53608a38623896d73d40d
data/README.md CHANGED
@@ -18,17 +18,20 @@ Or install it yourself as:
18
18
 
19
19
  ## Overview ##
20
20
 
21
- I wanted a simple, fast way to parse fasta files so I wouldn't have to
22
- keep writing annoying boilerplate fasta parsing code everytime I go to
23
- do something with one. I will probably add more, but likely only tasks
24
- that I find myself doing over and over.
21
+ I wanted a simple, fast way to parse fasta and fastq files so I
22
+ wouldn't have to keep writing annoying boilerplate parsing code
23
+ everytime I go to do something with a fasta or fastq file. I will
24
+ probably add more, but likely only tasks that I find myself doing over
25
+ and over.
25
26
 
26
- ## Usage ##
27
+ ## Documentation ##
28
+
29
+ Checkout [parse_fasta docs](http://rubydoc.info/gems/parse_fasta/1.1.0/frames) to see
30
+ the full documentation.
27
31
 
28
- ### Version 1.0.0 (current) ###
32
+ ## Usage ##
29
33
 
30
- The monkey patch of the `File` class is no more! Here is the new print
31
- length example:
34
+ A little script to print header and length of each record.
32
35
 
33
36
  require 'parse_fasta'
34
37
 
@@ -38,28 +41,37 @@ length example:
38
41
 
39
42
  And here, a script to calculate GC content:
40
43
 
41
- require 'parse_fasta'
42
-
43
44
  FastaFile.open(ARGV.first, 'r').each_record do |header, sequence|
44
45
  puts [header, sequence.gc].join("\t")
45
46
  end
46
47
 
47
- ### Version 0.0.5 (old) ###
48
+ Now we can parse fastq files as well!
48
49
 
49
- An example that lists the length for each sequence. (Won't work in
50
- version 1.0.0)
50
+ FastqFile.open(ARGV.first, 'r').each_record do |head, seq, desc, qual|
51
+ puts [header, seq, desc, qual.qual_scores.join(',')].join("\t")
52
+ end
51
53
 
52
- require 'parse_fasta'
54
+ ## Versions ##
53
55
 
54
- File.open(ARGV.first, 'r').each_record do |header, sequence|
55
- puts [header, sequence.length].join("\t")
56
- end
56
+ ### 1.1.0 ###
57
+
58
+ Added: Fastq and Quality classes
59
+
60
+ ### 1.0.0 ###
61
+
62
+ Added: Fasta and Sequence classes
63
+
64
+ Removed: File monkey patch
65
+
66
+ ### 0.0.5 ###
67
+
68
+ Last version with File monkey patch.
57
69
 
58
70
  ## Benchmark ##
59
71
 
60
- Take these with a grain of salt since `BioRuby` is a heavy weight
72
+ Take these with a grain of salt since `BioRuby` is a big module
61
73
  module with lots of features and error checking, whereas `parse_fasta`
62
- is meant to be lightweight and easy to use for my own coding.
74
+ is meant to be lightweight and easy to use for my own research.
63
75
 
64
76
  ### FastaFile#each_record ###
65
77
 
@@ -78,12 +90,21 @@ was 1.1 gigabytes. Here are the results from Ruby's `Benchmark` class:
78
90
  I just wanted a nice, clean way to parse fasta files, but being nearly
79
91
  twice as fasta as BioRuby doesn't hurt either!
80
92
 
93
+ ### FastqFile#each_record ###
94
+
95
+ The same sequence length test as above, but this time with a fastq
96
+ file containing 4,000,000 illumina reads.
97
+
98
+ user system total real
99
+ this_fastq 62.610000 1.660000 64.270000 ( 64.389408)
100
+ bioruby_fastq 165.500000 2.100000 167.600000 (167.969636)
101
+
81
102
  ### Sequence#gc ###
82
103
 
83
104
  I played around with a few different implementations for the `#gc`
84
105
  method and found this one to be the fastest.
85
106
 
86
- The test is done one random strings mating `/[AaCcTtGgUu]/`. `this_gc`
107
+ The test is done on random strings mating `/[AaCcTtGgUu]/`. `this_gc`
87
108
  is `Sequence.new(str).gc`, and `bioruby_gc` is
88
109
  `Bio::Sequence::NA.new(str).gc_content`.
89
110
 
@@ -0,0 +1,65 @@
1
+ # Copyright 2014 Ryan Moore
2
+ # Contact: moorer@udel.edu
3
+ #
4
+ # This file is part of parse_fasta.
5
+ #
6
+ # parse_fasta is free software: you can redistribute it and/or modify
7
+ # it under the terms of the GNU General Public License as published by
8
+ # the Free Software Foundation, either version 3 of the License, or
9
+ # (at your option) any later version.
10
+ #
11
+ # parse_fasta is distributed in the hope that it will be useful,
12
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
+ # GNU General Public License for more details.
15
+ #
16
+ # You should have received a copy of the GNU General Public License
17
+ # along with parse_fasta. If not, see <http://www.gnu.org/licenses/>.
18
+
19
+ # Provides simple interface for parsing four-line-per-record fastq
20
+ # format files.
21
+ class FastqFile < File
22
+
23
+ # Analagous to File#each_line, #each_record is used to go through a
24
+ # fastq file record by record.
25
+ #
26
+ # @example Parsing a fastq file
27
+ # FastqFile.open('reads.fq', 'r').each_record do |head, seq, desc, qual|
28
+ # # do some fun stuff here!
29
+ # end
30
+ #
31
+ # @yield The header, sequence, description and quality string for
32
+ # each record in the fastq file to the block
33
+ # @yieldparam header [String] The header of the fastq record without
34
+ # the leading '@'
35
+ # @yieldparam sequence [Sequence] The sequence of the fastq record
36
+ # @yieldparam sequence [String] The description line of the fastq
37
+ # record without the leading '+'
38
+ # @yieldparam sequence [Quality] The quality string of the fastq
39
+ # record
40
+ def each_record
41
+ count = 0
42
+ header = ''
43
+ sequence = ''
44
+ description = ''
45
+ quality = ''
46
+
47
+ self.each_line do |line|
48
+ line.chomp!
49
+
50
+ case count % 4
51
+ when 0
52
+ header = line.sub(/^@/, '')
53
+ when 1
54
+ sequence = Sequence.new(line)
55
+ when 2
56
+ description = line.sub(/^\+/, '')
57
+ when 3
58
+ quality = Quality.new(line)
59
+ yield(header, sequence, description, quality)
60
+ end
61
+
62
+ count += 1
63
+ end
64
+ end
65
+ end
@@ -0,0 +1,33 @@
1
+ # Copyright 2014 Ryan Moore
2
+ # Contact: moorer@udel.edu
3
+ #
4
+ # This file is part of parse_fasta.
5
+ #
6
+ # parse_fasta is free software: you can redistribute it and/or modify
7
+ # it under the terms of the GNU General Public License as published by
8
+ # the Free Software Foundation, either version 3 of the License, or
9
+ # (at your option) any later version.
10
+ #
11
+ # parse_fasta is distributed in the hope that it will be useful,
12
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
+ # GNU General Public License for more details.
15
+ #
16
+ # You should have received a copy of the GNU General Public License
17
+ # along with parse_fasta. If not, see <http://www.gnu.org/licenses/>.
18
+
19
+ # Provide some methods for dealing with common tasks regarding
20
+ # quality strings.
21
+ class Quality < String
22
+
23
+ # Returns an array of illumina style quality scores. The quality
24
+ # scores generated will be Phred+33.
25
+ #
26
+ # @example Get quality score array of a Quality
27
+ # Quality.new("!+5?I").qual_scores #=> [0, 10, 20, 30, 40]
28
+ #
29
+ # @return [Array<Fixnum>] the quality scores
30
+ def qual_scores
31
+ self.each_byte.map { |b| b - 33 }
32
+ end
33
+ end
@@ -19,14 +19,11 @@
19
19
  # Provide some methods for dealing with common tasks regarding
20
20
  # nucleotide sequences.
21
21
  class Sequence < String
22
- def initialize(str)
23
- super(str)
24
- end
25
22
 
26
- # Returns GC content for self
23
+ # Calculates GC content
27
24
  #
28
25
  # Calculates GC content by dividing count of G + C divided by count
29
- # of G + C + T + A +U. If there are both T's and U's in the
26
+ # of G + C + T + A + U. If there are both T's and U's in the
30
27
  # Sequence, things will get weird, but then again, that wouldn't
31
28
  # happen, now would it!
32
29
  #
@@ -17,5 +17,5 @@
17
17
  # along with parse_fasta. If not, see <http://www.gnu.org/licenses/>.
18
18
 
19
19
  module ParseFasta
20
- VERSION = "1.0.1"
20
+ VERSION = "1.1.0"
21
21
  end
data/lib/parse_fasta.rb CHANGED
@@ -18,4 +18,6 @@
18
18
 
19
19
  require 'parse_fasta/version'
20
20
  require 'parse_fasta/fasta_file'
21
+ require 'parse_fasta/fastq_file'
21
22
  require 'parse_fasta/sequence'
23
+ require 'parse_fasta/quality'
data/parse_fasta.gemspec CHANGED
@@ -19,8 +19,8 @@ Gem::Specification.new do |spec|
19
19
  spec.require_paths = ["lib"]
20
20
 
21
21
  spec.add_development_dependency "bundler", "~> 1.6"
22
- spec.add_development_dependency "rake"
23
- spec.add_development_dependency "rspec"
24
- spec.add_development_dependency "bio"
25
- spec.add_development_dependency "yard"
22
+ spec.add_development_dependency "rake", "~> 10.3"
23
+ spec.add_development_dependency "rspec", "~> 2.14"
24
+ spec.add_development_dependency "bio", "~> 1.4"
25
+ spec.add_development_dependency "yard", "~> 0.8"
26
26
  end
@@ -0,0 +1,50 @@
1
+ # Copyright 2014 Ryan Moore
2
+ # Contact: moorer@udel.edu
3
+ #
4
+ # This file is part of parse_fasta.
5
+ #
6
+ # parse_fasta is free software: you can redistribute it and/or modify
7
+ # it under the terms of the GNU General Public License as published by
8
+ # the Free Software Foundation, either version 3 of the License, or
9
+ # (at your option) any later version.
10
+ #
11
+ # parse_fasta is distributed in the hope that it will be useful,
12
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
+ # GNU General Public License for more details.
15
+ #
16
+ # You should have received a copy of the GNU General Public License
17
+ # along with parse_fasta. If not, see <http://www.gnu.org/licenses/>.
18
+
19
+ require 'spec_helper'
20
+
21
+ describe FastqFile do
22
+ describe "#each_record" do
23
+ let(:fname) { "#{File.dirname(__FILE__)}/../../test_files/test.fq" }
24
+
25
+ context "with a 4 line per record fastq file" do
26
+ before do
27
+ @records = []
28
+ FastqFile.open(fname, 'r').each_record do |head, seq, desc, qual|
29
+ @records << [head, seq, desc, qual]
30
+ end
31
+ end
32
+
33
+ it "yields the header, sequence, desc, and qual" do
34
+ expect(@records).to eq([["seq1", "AACCTTGG", "", ")#3gTqN8"],
35
+ ["seq2 apples", "ACTG", "seq2 apples",
36
+ "*ujM"]])
37
+ end
38
+
39
+ it "yields the sequence as a Sequence class" do
40
+ the_sequence = @records[0][1]
41
+ expect(the_sequence).to be_a(Sequence)
42
+ end
43
+
44
+ it "yields the quality string as a Quality class" do
45
+ the_quality = @records[0][3]
46
+ expect(the_quality).to be_a(Quality)
47
+ end
48
+ end
49
+ end
50
+ end
@@ -0,0 +1,35 @@
1
+ # Copyright 2014 Ryan Moore
2
+ # Contact: moorer@udel.edu
3
+ #
4
+ # This file is part of parse_fasta.
5
+ #
6
+ # parse_fasta is free software: you can redistribute it and/or modify
7
+ # it under the terms of the GNU General Public License as published by
8
+ # the Free Software Foundation, either version 3 of the License, or
9
+ # (at your option) any later version.
10
+ #
11
+ # parse_fasta is distributed in the hope that it will be useful,
12
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
+ # GNU General Public License for more details.
15
+ #
16
+ # You should have received a copy of the GNU General Public License
17
+ # along with parse_fasta. If not, see <http://www.gnu.org/licenses/>.
18
+
19
+ require 'spec_helper'
20
+ require 'bio'
21
+
22
+ describe Quality do
23
+ let(:qual_string) { qual_string = Quality.new('ab%63:K') }
24
+ let(:bioruby_qual_scores) do
25
+ Bio::Fastq.new("@seq1\nACTGACT\n+\n#{qual_string}").quality_scores
26
+ end
27
+
28
+ describe "#qual_scores" do
29
+ context "with illumina style quality scores" do
30
+ it "returns an array of quality scores" do
31
+ expect(qual_string.qual_scores).to eq bioruby_qual_scores
32
+ end
33
+ end
34
+ end
35
+ end
@@ -54,17 +54,39 @@ def make_seq(num)
54
54
  num.times.reduce('') { |str, n| str << %w[A a C c T t G g N n].sample }
55
55
  end
56
56
 
57
- s1 = make_seq(2000000)
58
- s2 = make_seq(4000000)
59
- s3 = make_seq(8000000)
57
+ # s1 = make_seq(2000000)
58
+ # s2 = make_seq(4000000)
59
+ # s3 = make_seq(8000000)
60
60
 
61
- Benchmark.bmbm do |x|
62
- x.report('this_gc 1') { this_gc(s1) }
63
- x.report('bioruby_gc 1') { bioruby_gc(s1) }
61
+ # Benchmark.bmbm do |x|
62
+ # x.report('this_gc 1') { this_gc(s1) }
63
+ # x.report('bioruby_gc 1') { bioruby_gc(s1) }
64
+
65
+ # x.report('this_gc 2') { this_gc(s2) }
66
+ # x.report('bioruby_gc 2') { bioruby_gc(s2) }
67
+
68
+ # x.report('this_gc 3') { this_gc(s3) }
69
+ # x.report('bioruby_gc 3') { bioruby_gc(s3) }
70
+ # end
64
71
 
65
- x.report('this_gc 2') { this_gc(s2) }
66
- x.report('bioruby_gc 2') { bioruby_gc(s2) }
72
+ # fastq = ARGV.first
67
73
 
68
- x.report('this_gc 3') { this_gc(s3) }
69
- x.report('bioruby_gc 3') { bioruby_gc(s3) }
74
+ def bioruby_fastq(fastq)
75
+ Bio::FlatFile.open(Bio::Fastq, fastq) do |fq|
76
+ fq.each do |entry|
77
+ [entry.definition, entry.seq.length].join("\t")
78
+ end
79
+ end
70
80
  end
81
+
82
+ def this_fastq(fastq)
83
+ FastqFile.open(fastq).each_record do |head, seq, desc, qual|
84
+ [head, seq.length].join("\t")
85
+ end
86
+ end
87
+
88
+ # file is 4 million illumina reads (16,000,000 lines) 1.4gb
89
+ # Benchmark.bmbm do |x|
90
+ # x.report('this_fastq') { this_fastq(ARGV.first) }
91
+ # x.report('bioruby_fastq') { bioruby_fastq(ARGV.first) }
92
+ # end
@@ -0,0 +1,8 @@
1
+ @seq1
2
+ AACCTTGG
3
+ +
4
+ )#3gTqN8
5
+ @seq2 apples
6
+ ACTG
7
+ +seq2 apples
8
+ *ujM
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: parse_fasta
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.1
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ryan Moore
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-06-04 00:00:00.000000000 Z
11
+ date: 2014-06-13 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -28,58 +28,58 @@ dependencies:
28
28
  name: rake
29
29
  requirement: !ruby/object:Gem::Requirement
30
30
  requirements:
31
- - - ">="
31
+ - - "~>"
32
32
  - !ruby/object:Gem::Version
33
- version: '0'
33
+ version: '10.3'
34
34
  type: :development
35
35
  prerelease: false
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
- - - ">="
38
+ - - "~>"
39
39
  - !ruby/object:Gem::Version
40
- version: '0'
40
+ version: '10.3'
41
41
  - !ruby/object:Gem::Dependency
42
42
  name: rspec
43
43
  requirement: !ruby/object:Gem::Requirement
44
44
  requirements:
45
- - - ">="
45
+ - - "~>"
46
46
  - !ruby/object:Gem::Version
47
- version: '0'
47
+ version: '2.14'
48
48
  type: :development
49
49
  prerelease: false
50
50
  version_requirements: !ruby/object:Gem::Requirement
51
51
  requirements:
52
- - - ">="
52
+ - - "~>"
53
53
  - !ruby/object:Gem::Version
54
- version: '0'
54
+ version: '2.14'
55
55
  - !ruby/object:Gem::Dependency
56
56
  name: bio
57
57
  requirement: !ruby/object:Gem::Requirement
58
58
  requirements:
59
- - - ">="
59
+ - - "~>"
60
60
  - !ruby/object:Gem::Version
61
- version: '0'
61
+ version: '1.4'
62
62
  type: :development
63
63
  prerelease: false
64
64
  version_requirements: !ruby/object:Gem::Requirement
65
65
  requirements:
66
- - - ">="
66
+ - - "~>"
67
67
  - !ruby/object:Gem::Version
68
- version: '0'
68
+ version: '1.4'
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: yard
71
71
  requirement: !ruby/object:Gem::Requirement
72
72
  requirements:
73
- - - ">="
73
+ - - "~>"
74
74
  - !ruby/object:Gem::Version
75
- version: '0'
75
+ version: '0.8'
76
76
  type: :development
77
77
  prerelease: false
78
78
  version_requirements: !ruby/object:Gem::Requirement
79
79
  requirements:
80
- - - ">="
80
+ - - "~>"
81
81
  - !ruby/object:Gem::Version
82
- version: '0'
82
+ version: '0.8'
83
83
  description: So you want to parse a fasta file...
84
84
  email:
85
85
  - moorer@udel.edu
@@ -94,14 +94,19 @@ files:
94
94
  - Rakefile
95
95
  - lib/parse_fasta.rb
96
96
  - lib/parse_fasta/fasta_file.rb
97
+ - lib/parse_fasta/fastq_file.rb
98
+ - lib/parse_fasta/quality.rb
97
99
  - lib/parse_fasta/sequence.rb
98
100
  - lib/parse_fasta/version.rb
99
101
  - parse_fasta.gemspec
100
102
  - spec/lib/fasta_file_spec.rb
103
+ - spec/lib/fastq_file_spec.rb
104
+ - spec/lib/quality.rb
101
105
  - spec/lib/sequence_spec.rb
102
106
  - spec/spec_helper.rb
103
107
  - test_files/benchmark.rb
104
108
  - test_files/test.fa
109
+ - test_files/test.fq
105
110
  homepage: https://github.com/mooreryan/parse_fasta
106
111
  licenses:
107
112
  - 'GPLv3: http://www.gnu.org/licenses/gpl.txt'
@@ -128,6 +133,8 @@ specification_version: 4
128
133
  summary: Easy-peasy parsing of fasta files
129
134
  test_files:
130
135
  - spec/lib/fasta_file_spec.rb
136
+ - spec/lib/fastq_file_spec.rb
137
+ - spec/lib/quality.rb
131
138
  - spec/lib/sequence_spec.rb
132
139
  - spec/spec_helper.rb
133
140
  has_rdoc: