parse_fasta 1.9.2 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. checksums.yaml +8 -8
  2. data/.gitignore +1 -0
  3. data/.rspec +2 -0
  4. data/CHANGELOG.md +178 -0
  5. data/README.md +42 -215
  6. data/Rakefile +2 -4
  7. data/bin/console +14 -0
  8. data/bin/setup +8 -0
  9. data/lib/parse_fasta/error.rb +39 -0
  10. data/lib/parse_fasta/record.rb +88 -0
  11. data/lib/parse_fasta/seq_file.rb +221 -114
  12. data/lib/parse_fasta/version.rb +2 -2
  13. data/lib/parse_fasta.rb +5 -20
  14. data/spec/parse_fasta/record_spec.rb +115 -0
  15. data/spec/parse_fasta/seq_file_spec.rb +238 -0
  16. data/spec/parse_fasta_spec.rb +25 -0
  17. data/spec/spec_helper.rb +2 -44
  18. data/spec/test_files/cr.fa +1 -0
  19. data/spec/test_files/cr.fa.gz +0 -0
  20. data/spec/test_files/cr.fq +3 -0
  21. data/spec/test_files/cr.fq.gz +0 -0
  22. data/spec/test_files/cr_nl.fa +4 -0
  23. data/spec/test_files/cr_nl.fa.gz +0 -0
  24. data/spec/test_files/cr_nl.fq +8 -0
  25. data/spec/test_files/cr_nl.fq.gz +0 -0
  26. data/spec/test_files/multi_blob.fa.gz +0 -0
  27. data/spec/test_files/multi_blob.fq.gz +0 -0
  28. data/spec/test_files/not_a_seq_file.txt +1 -0
  29. data/{test_files/bad.fa → spec/test_files/poorly_catted.fa} +0 -0
  30. data/{test_files/test.fa → spec/test_files/seqs.fa} +0 -0
  31. data/spec/test_files/seqs.fa.gz +0 -0
  32. data/spec/test_files/seqs.fq +8 -0
  33. data/spec/test_files/seqs.fq.gz +0 -0
  34. metadata +49 -24
  35. data/lib/parse_fasta/fasta_file.rb +0 -232
  36. data/lib/parse_fasta/fastq_file.rb +0 -160
  37. data/lib/parse_fasta/quality.rb +0 -54
  38. data/lib/parse_fasta/sequence.rb +0 -174
  39. data/spec/lib/fasta_file_spec.rb +0 -212
  40. data/spec/lib/fastq_file_spec.rb +0 -143
  41. data/spec/lib/quality_spec.rb +0 -51
  42. data/spec/lib/seq_file_spec.rb +0 -357
  43. data/spec/lib/sequence_spec.rb +0 -188
  44. data/test_files/benchmark.rb +0 -99
  45. data/test_files/bogus.txt +0 -2
  46. data/test_files/test.fa.gz +0 -0
  47. data/test_files/test.fq +0 -8
  48. data/test_files/test.fq.gz +0 -0
checksums.yaml CHANGED
@@ -1,15 +1,15 @@
1
1
  ---
2
2
  !binary "U0hBMQ==":
3
3
  metadata.gz: !binary |-
4
- NmM5ZWYwOGM5YWIxMzU2YjBmZTk4Y2I5YzI0NjY0MzUwM2YwMjgyOA==
4
+ YzliYjhmZmMzNGRlYmFmNDQwOGE2NGFmNzgyZTliZDdhMDdkMTc0Zg==
5
5
  data.tar.gz: !binary |-
6
- NzI2NDY1MWZmYmUwNDUxMTk2MmI4YjgwYWVlYjcyZDI4MDUzMzk4NA==
6
+ OTgxOWFjYTEyMWI0MjNlNjBhZjJkNGZkMjFkZGFkZDNjNGJkNTk2NA==
7
7
  SHA512:
8
8
  metadata.gz: !binary |-
9
- ODY1ZTQ1MzU4MTc2MDhhMjA0OThiYzM4Yzk4YjJiZjU4ZGY4MGM5NTRjYTE5
10
- OWZkODk0M2ZmODE5ODY1MjE3NTQ5MzgyNTFjMTk2NzU2NGVjN2NkNGUzYzA3
11
- ODliNjRlOGJjOGJhNjhlMWZmMmU1NjkyMjgwNzAyODQ1MDExOTI=
9
+ OGQxNTg4YzYyYzQyZGM2YjM0NzYyMjFiYzUwMTllYjM3NzZiZjViNTQwMWFi
10
+ NTI0NDk5NDY0NTc4YThhZTg4ODczYjAxZTA3MGNmZDdmMWYzNmMwMGFlMzhl
11
+ ODFhM2Q1NzIxZDVlYjE0MjEwYTg0OTlkMzlmZDQyYjIzYjhjNGQ=
12
12
  data.tar.gz: !binary |-
13
- YWY4NWU3NDFiYTVmMmE1Y2MxMDI3ZjE3NTIyY2Q1N2Q2ZDQxM2ZlZjI4NjUy
14
- MWM5OTZhNzEzZWNmMGVlYTQ1MDc1MzViMDBkOTQ0YzQyY2IxYjlmOGQwNzRh
15
- YmIyOTg2Yjk0OTFlNWVhOGU3MTMzM2I1ZGY0ZjlkMzExZGNkZDk=
13
+ MTBlN2NmNmJkOGUwM2Q1MDZhZTkzM2NmMzNmOTY1YWUzMzVjNjdkN2NiMDM2
14
+ NTJlYmU5Yzk1ODExNzczMGNkNTFkNzEwOWZkZGIwMjRiMjNiNGY5ZGM0MDJk
15
+ ZmY3YWI0OGQwOTNiMzY2ODAzMzkxZjFkZmNiNTExMGE3NWFlZjk=
data/.gitignore CHANGED
@@ -21,3 +21,4 @@ tmp
21
21
  *.a
22
22
  mkmf.log
23
23
  .ruby-*
24
+ .idea
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --format documentation
2
+ --color
data/CHANGELOG.md ADDED
@@ -0,0 +1,178 @@
1
+ ## Versions ##
2
+
3
+ ### 2.0.0 ###
4
+
5
+ A weird feature of `Zlib::GzipReader` made it so that if a gzipped file was created like this.
6
+
7
+ ```bash
8
+ gzip -c a.fa > z.fa.gz
9
+ gzip -c b.fa >> z.fa.gz
10
+ ```
11
+
12
+ Then the gzip reader would only read the lines from `a.fa` without some fiddling around. Since this was a pretty low level thing, I just decided to make a bunch of under the hood changes that I've been meaning to get to.
13
+
14
+ #### Other things
15
+
16
+ - Everything is namespaced under `ParseFasta` module
17
+ - Removed `FastaFile` and `FastqFile` classes, `SeqFile` only remains
18
+ - Removed `Sequence` and `Quality` classes. These might get put back in at some point, but I almost never used them anyway
19
+ - `SeqFile#each_record` yields a `Record` object so you can use the same code to parse fastA and fastQ files
20
+ - Other stuff that I'm forgetting!
21
+
22
+
23
+ ### 1.9.2 ###
24
+
25
+ Speed up fastA `each_record` and `each_record_fast`.
26
+
27
+ ### 1.9.1 ###
28
+
29
+ Speed up fastQ `each_record` and `each_record_fast`. Courtesy of
30
+ [Matthew Ralston](https://github.com/MatthewRalston).
31
+
32
+ ### 1.9.0 ###
33
+
34
+ Added "fast" versions of `each_record` methods
35
+ (`each_record_fast`). Basically, they return sequences and quality
36
+ strings as Ruby `Sring` objects instead of aa `Sequence` or `Quality`
37
+ objects. Also, if the sequence or quality string has spaces, they will
38
+ be retained. If this is a problem, use the original `each_record`
39
+ methods.
40
+
41
+ ### 1.8.2 ###
42
+
43
+ Speed up `FastqFile#each_record`.
44
+
45
+ ### 1.8.1 ###
46
+
47
+ An error will be raised if a fasta file has a `>` in the
48
+ sequence. Sometimes files are not terminated with a newline
49
+ character. If this is the case, then catting two fasta files will
50
+ smush the first header of the second file right in with the last
51
+ sequence of the first file. This is bad, raise an error! ;)
52
+
53
+ Example
54
+
55
+ >seq1
56
+ ACTG>seq2
57
+ ACTG
58
+ >seq3
59
+ ACTG
60
+
61
+ This will raise `ParseFasta::SequenceFormatError`.
62
+
63
+ Also, headers with lots of `>` within are fine now.
64
+
65
+ ### 1.8 ###
66
+
67
+ Add `Sequence#rev_comp`. It can handle IUPAC characters. Since
68
+ `parse_fasta` doesn't check whether the seq is AA or NA, if called on
69
+ an amino acid string, things will get weird as it will complement the
70
+ IUPAC characters in the AA string and leave others.
71
+
72
+ ### 1.7.2 ###
73
+
74
+ Strip spaces (not all whitespace) from `Sequence` and `Quality` strings.
75
+
76
+ Some alignment fastas have spaces for easier reading. Strip these
77
+ out. For consistency, also strips spaces from `Quality` strings. If
78
+ there are spaces that don't match in the quality and sequence in a
79
+ fastQ file, then things will get messed up in the FastQ file. FastQ
80
+ shouldn't have spaces though.
81
+
82
+ ### 1.7 ###
83
+
84
+ Add `SeqFile#to_hash`, `FastaFile#to_hash` and `FastqFile#to_hash`.
85
+
86
+ ### 1.6.2 ###
87
+
88
+ `FastaFile::open` now raises a `ParseFasta::DataFormatError` when passed files
89
+ that don't begin with a `>`.
90
+
91
+ ### 1.6.1 ###
92
+
93
+ Better internal handling of empty sequences -- instead of raising
94
+ errors, pass empty sequences.
95
+
96
+ ### 1.6 ###
97
+
98
+ Added `SeqFile` class, which accepts either fastA or fastQ files. It
99
+ uses FastaFile and FastqFile internally. You can use this class if you
100
+ want your scripts to accept either fastA or fastQ files.
101
+
102
+ If you need the description and quality string, you should use
103
+ FastqFile instead.
104
+
105
+ ### 1.5 ###
106
+
107
+ Now accepts gzipped files. Huzzah!
108
+
109
+ ### 1.4 ###
110
+
111
+ Added methods:
112
+
113
+ Sequence.base_counts
114
+ Sequence.base_frequencies
115
+
116
+ ### 1.3 ###
117
+
118
+ Add additional functionality to `each_record` method.
119
+
120
+ #### Info ####
121
+
122
+ I often like to use the fasta format for other things like so
123
+
124
+ >fruits
125
+ pineapple
126
+ pear
127
+ peach
128
+ >veggies
129
+ peppers
130
+ parsnip
131
+ peas
132
+
133
+ rather than having this in a two column file like this
134
+
135
+ fruit,pineapple
136
+ fruit,pear
137
+ fruit,peach
138
+ veggie,peppers
139
+ veggie,parsnip
140
+ veggie,peas
141
+
142
+ So I added functionality to `each_record` to keep each line a record
143
+ separate in an array. Here's an example using the above file.
144
+
145
+ info = []
146
+ FastaFile.open(f, 'r').each_record(1) do |header, lines|
147
+ info << [header, lines]
148
+ end
149
+
150
+ Then info will contain the following arrays
151
+
152
+ ['fruits', ['pineapple', 'pear', 'peach']],
153
+ ['veggies', ['peppers', 'parsnip', 'peas']]
154
+
155
+ ### 1.2 ###
156
+
157
+ Added `mean_qual` method to the `Quality` class.
158
+
159
+ ### 1.1.2 ###
160
+
161
+ Dropped Ruby requirement to 1.9.3
162
+
163
+ (Note, if you want to build the docs with yard and you're using
164
+ Ruby 1.9.3, you may have to install the redcarpet gem.)
165
+
166
+ ### 1.1 ###
167
+
168
+ Added: Fastq and Quality classes
169
+
170
+ ### 1.0 ###
171
+
172
+ Added: Fasta and Sequence classes
173
+
174
+ Removed: File monkey patch
175
+
176
+ ### 0.0.5 ###
177
+
178
+ Last version with File monkey patch.
data/README.md CHANGED
@@ -1,4 +1,4 @@
1
- # parse_fasta #
1
+ # ParseFasta #
2
2
 
3
3
  [![Gem Version](https://badge.fury.io/rb/parse_fasta.svg)](http://badge.fury.io/rb/parse_fasta) [![Build Status](https://travis-ci.org/mooreryan/parse_fasta.svg?branch=master)](https://travis-ci.org/mooreryan/parse_fasta) [![Coverage Status](https://coveralls.io/repos/mooreryan/parse_fasta/badge.svg)](https://coveralls.io/r/mooreryan/parse_fasta)
4
4
 
@@ -8,7 +8,9 @@ So you want to parse a fasta file...
8
8
 
9
9
  Add this line to your application's Gemfile:
10
10
 
11
- gem 'parse_fasta'
11
+ ```ruby
12
+ gem 'parse_fasta'
13
+ ```
12
14
 
13
15
  And then execute:
14
16
 
@@ -20,9 +22,7 @@ Or install it yourself as:
20
22
 
21
23
  ## Overview ##
22
24
 
23
- Provides nice, programmatic access to fasta and fastq files, as well
24
- as providing Sequence and Quality helper classes. It's more
25
- lightweight than BioRuby. And more fun! ;)
25
+ Provides nice, programmatic access to fasta and fastq files. It's faster and more lightweight than BioRuby. And more fun!
26
26
 
27
27
  ## Documentation ##
28
28
 
@@ -32,213 +32,40 @@ for the full api documentation.
32
32
 
33
33
  ## Usage ##
34
34
 
35
- Some examples...
36
-
37
- A little script to print header and length of each record.
38
-
39
- require 'parse_fasta'
40
-
41
- FastaFile.open(ARGV[0]).each_record do |header, sequence|
42
- puts [header, sequence.length].join("\t")
43
- end
44
-
45
- And here, a script to calculate GC content:
46
-
47
- FastaFile.open(ARGV[0]).each_record do |header, sequence|
48
- puts [header, sequence.gc].join("\t")
49
- end
50
-
51
- Now we can parse fastq files as well!
52
-
53
- FastqFile.open(ARGV[0]).each_record do |head, seq, desc, qual|
54
- puts [header, qual.qual_scores.join(',')].join("\t")
55
- end
56
-
57
- What if you don't care if the input is a fastA or a fastQ? No problem!
58
-
59
- SeqFile.open(ARGV[0]).each_record do |head, seq|
60
- puts [header, seq].join "\t"
61
- end
62
-
63
- Read fasta file into a hash.
64
-
65
- seqs = FastaFile.open(ARGV[0]).to_hash
66
-
67
- ## Versions ##
68
-
69
- ### 1.9.2 ###
70
-
71
- Speed up fastA `each_record` and `each_record_fast`.
72
-
73
- ### 1.9.1 ###
74
-
75
- Speed up fastQ `each_record` and `each_record_fast`. Courtesy of
76
- [Matthew Ralston](https://github.com/MatthewRalston).
77
-
78
- ### 1.9.0 ###
79
-
80
- Added "fast" versions of `each_record` methods
81
- (`each_record_fast`). Basically, they return sequences and quality
82
- strings as Ruby `Sring` objects instead of aa `Sequence` or `Quality`
83
- objects. Also, if the sequence or quality string has spaces, they will
84
- be retained. If this is a problem, use the original `each_record`
85
- methods.
86
-
87
- ### 1.8.2 ###
88
-
89
- Speed up `FastqFile#each_record`.
90
-
91
- ### 1.8.1 ###
92
-
93
- An error will be raised if a fasta file has a `>` in the
94
- sequence. Sometimes files are not terminated with a newline
95
- character. If this is the case, then catting two fasta files will
96
- smush the first header of the second file right in with the last
97
- sequence of the first file. This is bad, raise an error! ;)
98
-
99
- Example
100
-
101
- >seq1
102
- ACTG>seq2
103
- ACTG
104
- >seq3
105
- ACTG
106
-
107
- This will raise `ParseFasta::SequenceFormatError`.
108
-
109
- Also, headers with lots of `>` within are fine now.
110
-
111
- ### 1.8 ###
112
-
113
- Add `Sequence#rev_comp`. It can handle IUPAC characters. Since
114
- `parse_fasta` doesn't check whether the seq is AA or NA, if called on
115
- an amino acid string, things will get weird as it will complement the
116
- IUPAC characters in the AA string and leave others.
117
-
118
- ### 1.7.2 ###
119
-
120
- Strip spaces (not all whitespace) from `Sequence` and `Quality` strings.
121
-
122
- Some alignment fastas have spaces for easier reading. Strip these
123
- out. For consistency, also strips spaces from `Quality` strings. If
124
- there are spaces that don't match in the quality and sequence in a
125
- fastQ file, then things will get messed up in the FastQ file. FastQ
126
- shouldn't have spaces though.
127
-
128
- ### 1.7 ###
129
-
130
- Add `SeqFile#to_hash`, `FastaFile#to_hash` and `FastqFile#to_hash`.
131
-
132
- ### 1.6.2 ###
133
-
134
- `FastaFile::open` now raises a `ParseFasta::DataFormatError` when passed files
135
- that don't begin with a `>`.
136
-
137
- ### 1.6.1 ###
138
-
139
- Better internal handling of empty sequences -- instead of raising
140
- errors, pass empty sequences.
141
-
142
- ### 1.6 ###
143
-
144
- Added `SeqFile` class, which accepts either fastA or fastQ files. It
145
- uses FastaFile and FastqFile internally. You can use this class if you
146
- want your scripts to accept either fastA or fastQ files.
147
-
148
- If you need the description and quality string, you should use
149
- FastqFile instead.
150
-
151
- ### 1.5 ###
152
-
153
- Now accepts gzipped files. Huzzah!
154
-
155
- ### 1.4 ###
156
-
157
- Added methods:
158
-
159
- Sequence.base_counts
160
- Sequence.base_frequencies
161
-
162
- ### 1.3 ###
163
-
164
- Add additional functionality to `each_record` method.
165
-
166
- #### Info ####
167
-
168
- I often like to use the fasta format for other things like so
169
-
170
- >fruits
171
- pineapple
172
- pear
173
- peach
174
- >veggies
175
- peppers
176
- parsnip
177
- peas
178
-
179
- rather than having this in a two column file like this
180
-
181
- fruit,pineapple
182
- fruit,pear
183
- fruit,peach
184
- veggie,peppers
185
- veggie,parsnip
186
- veggie,peas
187
-
188
- So I added functionality to `each_record` to keep each line a record
189
- separate in an array. Here's an example using the above file.
190
-
191
- info = []
192
- FastaFile.open(f, 'r').each_record(1) do |header, lines|
193
- info << [header, lines]
194
- end
195
-
196
- Then info will contain the following arrays
197
-
198
- ['fruits', ['pineapple', 'pear', 'peach']],
199
- ['veggies', ['peppers', 'parsnip', 'peas']]
200
-
201
- ### 1.2 ###
202
-
203
- Added `mean_qual` method to the `Quality` class.
204
-
205
- ### 1.1.2 ###
206
-
207
- Dropped Ruby requirement to 1.9.3
208
-
209
- (Note, if you want to build the docs with yard and you're using
210
- Ruby 1.9.3, you may have to install the redcarpet gem.)
211
-
212
- ### 1.1 ###
213
-
214
- Added: Fastq and Quality classes
215
-
216
- ### 1.0 ###
217
-
218
- Added: Fasta and Sequence classes
219
-
220
- Removed: File monkey patch
221
-
222
- ### 0.0.5 ###
223
-
224
- Last version with File monkey patch.
225
-
226
- ## Benchmark ##
227
-
228
- Some quick and dirty benchmarks against `BioRuby`.
229
-
230
- ### FastaFile#each_record ###
231
-
232
- You can see the test script in `benchmark.rb`.
233
-
234
- user system total real
235
- parse_fasta 1.920000 0.160000 2.080000 ( 2.145932)
236
- parse_fasta fast 1.210000 0.160000 1.370000 ( 1.377770)
237
- bioruby 4.330000 0.290000 4.620000 ( 4.655567)
238
-
239
- Hot dog! It's faster :)
240
-
241
- ## Notes ##
242
-
243
- Only the `SeqFile` class actually checks to make sure that you passed
244
- in a "proper" fastA or fastQ file, so watch out.
35
+ Here are some examples of using ParseFasta. Don't forget to `require "parse_fasta"` at the top of your program!
36
+
37
+ Print header and length of each record.
38
+
39
+ ```ruby
40
+ ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
41
+ puts [rec.header, rec.seq.length].join "\t"
42
+ end
43
+ ```
44
+
45
+ You can parse fastQ files in exatcly the same way.
46
+
47
+ ```ruby
48
+ ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
49
+ printf "Header: %s, Sequence: %s, Description: %s, Quality: %s\n",
50
+ rec.header,
51
+ rec.seq,
52
+ rec.desc,
53
+ rec.qual
54
+ end
55
+ ```
56
+
57
+ The `Record#desc` and `Record#qual` will be `nil` if the file you are parsing is a fastA file.
58
+
59
+ ```ruby
60
+ ParseFasta::SeqFile.open(ARGV[0]).each_record do |rec|
61
+ if rec.qual
62
+ puts "@#{rec.header}"
63
+ puts rec.seq
64
+ puts "+#{rec.desc}"
65
+ puts rec.qual
66
+ else
67
+ puts ">#{rec.header}"
68
+ puts rec.sequence
69
+ end
70
+ end
71
+ ```
data/Rakefile CHANGED
@@ -1,8 +1,6 @@
1
1
  require "bundler/gem_tasks"
2
2
  require "rspec/core/rake_task"
3
3
 
4
- RSpec::Core::RakeTask.new
5
-
6
- task default: :spec
7
- task test: :spec
4
+ RSpec::Core::RakeTask.new(:spec)
8
5
 
6
+ task :default => :spec
data/bin/console ADDED
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "parse_fasta"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start
data/bin/setup ADDED
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,39 @@
1
+ # Copyright 2014 - 2016 Ryan Moore
2
+ # Contact: moorer@udel.edu
3
+ #
4
+ # This file is part of parse_fasta.
5
+ #
6
+ # parse_fasta is free software: you can redistribute it and/or modify
7
+ # it under the terms of the GNU General Public License as published by
8
+ # the Free Software Foundation, either version 3 of the License, or
9
+ # (at your option) any later version.
10
+ #
11
+ # parse_fasta is distributed in the hope that it will be useful,
12
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
+ # GNU General Public License for more details.
15
+ #
16
+ # You should have received a copy of the GNU General Public License
17
+ # along with parse_fasta. If not, see <http://www.gnu.org/licenses/>.
18
+
19
+ module ParseFasta
20
+ # Contains the Error classes that ParseFasta API will raise
21
+ module Error
22
+
23
+ # All ParseFasta errors inherit from ParseFastaError
24
+ class ParseFastaError < StandardError
25
+ end
26
+
27
+ # Raised when the input file doesn't look like fastA or fastQ
28
+ class DataFormatError < ParseFastaError
29
+ end
30
+
31
+ # Raised when the file is not found
32
+ class FileNotFoundError < ParseFastaError
33
+ end
34
+
35
+ # Raised when fastA sequences have a '>' in them
36
+ class SequenceFormatError < ParseFastaError
37
+ end
38
+ end
39
+ end
@@ -0,0 +1,88 @@
1
+ # Copyright 2014 - 2016 Ryan Moore
2
+ # Contact: moorer@udel.edu
3
+ #
4
+ # This file is part of parse_fasta.
5
+ #
6
+ # parse_fasta is free software: you can redistribute it and/or modify
7
+ # it under the terms of the GNU General Public License as published by
8
+ # the Free Software Foundation, either version 3 of the License, or
9
+ # (at your option) any later version.
10
+ #
11
+ # parse_fasta is distributed in the hope that it will be useful,
12
+ # but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
+ # GNU General Public License for more details.
15
+ #
16
+ # You should have received a copy of the GNU General Public License
17
+ # along with parse_fasta. If not, see <http://www.gnu.org/licenses/>.
18
+
19
+ module ParseFasta
20
+ class Record
21
+
22
+ # @!attribute header
23
+ # @return [String] the full header of the record without the '>'
24
+ # or '@'
25
+ # @!attribute seq
26
+ # @return [String] the sequence of the record
27
+ # @!attribute desc
28
+ # @return [String or Nil] if the record is from a fastA file, it
29
+ # is nil; else, the description line of the fastQ record
30
+ # @!attribute qual
31
+ # @return [String or Nil] if the record is from a fastA file, it
32
+ # is nil; else, the quality string of the fastQ record
33
+ attr_accessor :header, :seq, :desc, :qual
34
+
35
+ # The constructor takes keyword args.
36
+ #
37
+ # @example Init a new Record object for a fastA record
38
+ # Record.new header: "apple", seq: "actg"
39
+ # @example Init a new Record object for a fastQ record
40
+ # Record.new header: "apple", seq: "actd", desc: "", qual: "IIII"
41
+ #
42
+ # @param header [String] the header of the record
43
+ # @param seq [String] the sequence of the record
44
+ # @param desc [String] the description line of a fastQ record
45
+ # @param qual [String] the quality string of a fastQ record
46
+ #
47
+ # @raise [SequenceFormatError] if a fastA sequence has a '>'
48
+ # character in it
49
+ def initialize args = {}
50
+ @header = args.fetch :header
51
+
52
+ @desc = args.fetch :desc, nil
53
+ @qual = args.fetch :qual, nil
54
+
55
+ @qual.gsub!(/\s+/, "") if @qual
56
+
57
+ seq = args.fetch(:seq).gsub(/\s+/, "")
58
+
59
+ if @qual # is fastQ
60
+ @seq = seq
61
+ else # is fastA
62
+ @seq = check_fasta_seq(seq)
63
+ end
64
+ end
65
+
66
+ # Compare attrs of this rec with another
67
+ #
68
+ # @param rec [Record] a Record object to compare with
69
+ #
70
+ # @return [Bool] true or false
71
+ def == rec
72
+ self.header == rec.header && self.seq == rec.seq &&
73
+ self.desc == rec.desc && self.qual == rec.qual
74
+ end
75
+
76
+ private
77
+
78
+ def check_fasta_seq seq
79
+ if seq.match ">"
80
+ raise ParseFasta::Error::SequenceFormatError,
81
+ "A sequence contained a '>' character " +
82
+ "(the fastA file record separator)"
83
+ else
84
+ seq
85
+ end
86
+ end
87
+ end
88
+ end