bio-bigbio 0.1.4 → 0.1.5
Sign up to get free protection for your applications and to get access to all the features.
- data/.travis.yml +12 -0
- data/LICENSE.txt +20 -0
- data/README.md +147 -15
- data/Rakefile +1 -0
- data/VERSION +1 -1
- data/bin/fasta_filter.rb +100 -0
- data/bin/fasta_sort.rb +24 -0
- data/bin/getorf +4 -8
- data/bin/nt2aa.rb +3 -6
- data/bio-bigbio.gemspec +9 -5
- data/lib/bigbio/db/fasta/fastareader.rb +35 -0
- data/lib/bigbio/db/fasta/fastarecord.rb +7 -1
- data/lib/bigbio/db/phylip.rb +49 -0
- data/spec/emitter_spec.rb +17 -0
- metadata +23 -17
- data/LICENSE +0 -34
data/.travis.yml
ADDED
@@ -0,0 +1,12 @@
|
|
1
|
+
language: ruby
|
2
|
+
rvm:
|
3
|
+
- 1.9.2
|
4
|
+
# - 1.9.3
|
5
|
+
# - 1.8.7
|
6
|
+
# - jruby-19mode # JRuby in 1.9 mode
|
7
|
+
# - rbx-19mode
|
8
|
+
# - jruby-18mode # JRuby in 1.8 mode
|
9
|
+
# - rbx-18mode
|
10
|
+
|
11
|
+
# uncomment this line if your project needs to run something other than `rake`:
|
12
|
+
# script: bundle exec rspec spec
|
data/LICENSE.txt
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2011-2013 Pjotr Prins
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
CHANGED
@@ -8,31 +8,119 @@ computing in biology.
|
|
8
8
|
BigBio may use BioLib C/C++/D functions for increasing performance and
|
9
9
|
reducing memory consumption.
|
10
10
|
|
11
|
-
|
12
|
-
|
11
|
+
In a way, this is an experimental project. I use it for
|
12
|
+
experimentation, but what is in here should work fine. If you wish to
|
13
|
+
contribute subscribe to the BioRuby and/or BioLib mailing lists
|
14
|
+
instead.
|
13
15
|
|
14
16
|
# Overview
|
15
17
|
|
16
18
|
* BigBio can translate nucleotide sequences to amino acid
|
17
19
|
sequences using an EMBOSS C function, or BioRuby's translator.
|
20
|
+
* BigBio has a terrific FASTA file emitter which iterates FASTA files and
|
21
|
+
iterates sequences without loading everything in memory. There is
|
22
|
+
also an indexed edition
|
23
|
+
* BioBio has a flexible FASTA filter
|
18
24
|
* BigBio has an ORF emitter which parses DNA/RNA sequences and emits
|
19
25
|
ORFs between START_STOP or STOP_STOP codons.
|
20
|
-
* BigBio has a
|
21
|
-
iterates sequences without loading everything in memory.
|
26
|
+
* BigBio has a Phylip (PAML style) emitter and writer
|
22
27
|
|
23
|
-
#
|
28
|
+
# Installation
|
29
|
+
|
30
|
+
The easy way
|
31
|
+
|
32
|
+
```sh
|
33
|
+
gem install bio-bigbio
|
34
|
+
```
|
35
|
+
|
36
|
+
in your code
|
37
|
+
|
38
|
+
```ruby
|
39
|
+
require 'bigbio'
|
40
|
+
```
|
41
|
+
|
42
|
+
# Command line tools
|
43
|
+
|
44
|
+
Some functionality comes also as executable command line tools (see the
|
45
|
+
./bin directory). Use the -h switch to get information. Current tools
|
46
|
+
are
|
47
|
+
|
48
|
+
1. getorf: fetch all areas between start-stop and stop-stop codons in six frames (using EMBOSS when biolib is available)
|
49
|
+
2. nt2aa.rb: translate in six frames (using EMBOSS when biolib is available)
|
50
|
+
3. fasta_filter.rb
|
51
|
+
|
52
|
+
## Command line Fasta Filter
|
53
|
+
|
54
|
+
The CLI filter accepts standard Ruby commands.
|
55
|
+
|
56
|
+
Filter sequences that contain more than 25% C's
|
57
|
+
|
58
|
+
```sh
|
59
|
+
fasta_filter.rb --filter "rec.seq.count('C') > rec.seq.size*0.25" test/data/fasta/nt.fa
|
60
|
+
```
|
61
|
+
|
62
|
+
Look for IDs containing -126 and sequences ending on CCC
|
63
|
+
|
64
|
+
```sh
|
65
|
+
fasta_filter.rb --filter "rec.id =~ /-126/ or rec.seq =~ /CCC$/" test/data/fasta/nt.fa
|
66
|
+
```
|
67
|
+
|
68
|
+
Filter out all masked sequences that contain more than 10% masked
|
69
|
+
nucleotides
|
70
|
+
|
71
|
+
```sh
|
72
|
+
fasta_filter.rb --filter "rec.seq.count('N')<rec.seq.size*0.10"
|
73
|
+
```
|
74
|
+
|
75
|
+
Next to rec.id and rec.seq, you have rec.descr and 'num' as variables,
|
76
|
+
so to skip every other record
|
77
|
+
|
78
|
+
```sh
|
79
|
+
fasta_filter.rb --filter "num % 2 == 0"
|
80
|
+
```
|
81
|
+
|
82
|
+
Rewrite all sequences to lower case, you can use the useful rewrite
|
83
|
+
option
|
84
|
+
|
85
|
+
```sh
|
86
|
+
fasta_filter.rb --rewrite 'rec.seq = rec.seq.downcase'
|
87
|
+
```
|
88
|
+
|
89
|
+
Filters and rewrites can be combined. The rest is up to your imagination!
|
90
|
+
|
91
|
+
# API Examples
|
24
92
|
|
25
93
|
## Iterate through a FASTA file
|
26
94
|
|
27
95
|
Read a file without loading the whole thing in memory
|
28
96
|
|
29
97
|
```ruby
|
98
|
+
require 'bigbio'
|
99
|
+
|
30
100
|
fasta = FastaReader.new(fn)
|
31
101
|
fasta.each do | rec |
|
32
102
|
print rec.descr,rec.seq
|
33
103
|
end
|
34
104
|
```
|
35
105
|
|
106
|
+
Since FastaReader parses the ID, write a tab file with id and sequence
|
107
|
+
|
108
|
+
```ruby
|
109
|
+
i = 1
|
110
|
+
print "num\tid\tseq\n"
|
111
|
+
FastaReader.new(fn).each do | rec |
|
112
|
+
if rec.id =~ /(AT\w+)/
|
113
|
+
print i,"\t",$1,"\t",rec.seq,"\n"
|
114
|
+
i += 1
|
115
|
+
end
|
116
|
+
end
|
117
|
+
```
|
118
|
+
|
119
|
+
wich, for example, can be turned into RDF with the
|
120
|
+
[bio-table](https://github.com/pjotrp/bioruby-table) biogem.
|
121
|
+
|
122
|
+
## Write a FASTA file
|
123
|
+
|
36
124
|
Write a FASTA file. The simple way
|
37
125
|
|
38
126
|
```ruby
|
@@ -60,6 +148,44 @@ fasta = FastaWriter.new(fn)
|
|
60
148
|
fasta.write(mysequence)
|
61
149
|
```
|
62
150
|
|
151
|
+
## Transform a FASTA file
|
152
|
+
|
153
|
+
You can combine above FastaReader and FastaWriter to transform
|
154
|
+
sequences, e.g.
|
155
|
+
|
156
|
+
```ruby
|
157
|
+
fasta = FastaWriter.new(in_fn)
|
158
|
+
FastaReader.new(out_fn).each do | rec |
|
159
|
+
# Strip the description down to the second ID
|
160
|
+
(id1,id2) = /(\S+)\s+(\S+)/.match(rec.descr)
|
161
|
+
fasta.write(id2,rec.seq)
|
162
|
+
end
|
163
|
+
```
|
164
|
+
|
165
|
+
The downside to this approach is the explicit file naming. What if you
|
166
|
+
want to use STDIN or some other source instead? I have come round to
|
167
|
+
the idea of using a combination of lambda and block. For example:
|
168
|
+
|
169
|
+
```ruby
|
170
|
+
FastaReader::emit_fastarecord(-> {gets}) { |rec|
|
171
|
+
print FastaWriter.to_fasta(rec)
|
172
|
+
}
|
173
|
+
```
|
174
|
+
|
175
|
+
which takes STDIN line by line, and outputs FASTA on STDOUT. This is
|
176
|
+
a better design as the FastaReader and FastaWriter know nothing of
|
177
|
+
the mechanism fetching and displaying data. These can both be 'pure'
|
178
|
+
functions. Note also that the data is never fully loaded into RAM.
|
179
|
+
|
180
|
+
Here the transformer functional style
|
181
|
+
|
182
|
+
```ruby
|
183
|
+
FastaReader::emit_fastarecord(-> {gets}) { |rec|
|
184
|
+
(id1,id2) = /(\S+)\s+(\S+)/.match(rec.descr)
|
185
|
+
print FastaWriter.to_fasta(id2,req.seq)
|
186
|
+
}
|
187
|
+
```
|
188
|
+
|
63
189
|
## Fetch ORFs from a sequence
|
64
190
|
|
65
191
|
BigBio can parse a sequence for ORFs. Together with the FastaReader
|
@@ -83,21 +209,27 @@ translate = Nucleotide::Translate.new(trn_table)
|
|
83
209
|
aa_frames = translate.aa_6_frames("ATCATTAGCAACACCAGCTTCCTCTCTCTCGCTTCAAAGTTCACTACTCGTGGATCTCGT")
|
84
210
|
```
|
85
211
|
|
86
|
-
#
|
212
|
+
# Project home page
|
87
213
|
|
88
|
-
|
214
|
+
Information on the source tree, documentation, examples, issues and
|
215
|
+
how to contribute, see
|
89
216
|
|
90
|
-
|
91
|
-
gem install bio-bigbio
|
92
|
-
```
|
217
|
+
http://github.com/pjotrp/bigbio
|
93
218
|
|
94
|
-
|
219
|
+
The BioRuby community is on IRC server: irc.freenode.org, channel: #bioruby.
|
95
220
|
|
96
|
-
|
97
|
-
|
98
|
-
|
221
|
+
# Cite
|
222
|
+
|
223
|
+
If you use this software, please cite one of
|
224
|
+
|
225
|
+
* [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
|
226
|
+
* [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
|
227
|
+
|
228
|
+
# Biogems.info
|
229
|
+
|
230
|
+
This Biogem is published at [#bio-table](http://biogems.info/index.html)
|
99
231
|
|
100
232
|
# Copyright
|
101
233
|
|
102
|
-
Copyright (c) 2011-
|
234
|
+
Copyright (c) 2011-2013 Pjotr Prins. See LICENSE for further details.
|
103
235
|
|
data/Rakefile
CHANGED
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.1.
|
1
|
+
0.1.5
|
data/bin/fasta_filter.rb
ADDED
@@ -0,0 +1,100 @@
|
|
1
|
+
#! /usr/bin/env ruby
|
2
|
+
#
|
3
|
+
# Filter for FASTA files
|
4
|
+
#
|
5
|
+
|
6
|
+
$: << File.dirname(__FILE__)+'/../lib'
|
7
|
+
|
8
|
+
require 'bigbio'
|
9
|
+
require 'optparse'
|
10
|
+
require 'ostruct'
|
11
|
+
|
12
|
+
class OptParser
|
13
|
+
#
|
14
|
+
# Return a structure describing the options.
|
15
|
+
#
|
16
|
+
def self.parse(args)
|
17
|
+
# The options specified on the command line will be collected in *options*.
|
18
|
+
# We set default values here.
|
19
|
+
options = OpenStruct.new
|
20
|
+
options.codonize = false
|
21
|
+
options.verbose = false
|
22
|
+
|
23
|
+
opt_parser = OptionParser.new do |opts|
|
24
|
+
opts.banner = "Usage: fasta_filter.rb [options]"
|
25
|
+
|
26
|
+
opts.separator ""
|
27
|
+
opts.separator "Specific options:"
|
28
|
+
|
29
|
+
opts.on("--filter expression","Filter on Ruby expression") do |expr|
|
30
|
+
options.filter = expr
|
31
|
+
end
|
32
|
+
|
33
|
+
opts.on("--rewrite expression","Rewrite expression") do |expr|
|
34
|
+
options.rewrite = expr
|
35
|
+
end
|
36
|
+
|
37
|
+
opts.on("--codonize",
|
38
|
+
"Trim sequence to be at multiple of 3 nucleotides") do |b|
|
39
|
+
options.codonize = b
|
40
|
+
end
|
41
|
+
|
42
|
+
opts.on("--min size",
|
43
|
+
"Set minimum sequence size") do |min|
|
44
|
+
options.min = min.to_i
|
45
|
+
end
|
46
|
+
|
47
|
+
opts.on("--id","Write out ID only") do |b|
|
48
|
+
options.id = b
|
49
|
+
end
|
50
|
+
|
51
|
+
opts.on("-v", "--[no-]verbose", "Run verbosely") do |v|
|
52
|
+
options.verbose = v
|
53
|
+
end
|
54
|
+
|
55
|
+
opts.separator ""
|
56
|
+
opts.separator "Examples:"
|
57
|
+
opts.separator ""
|
58
|
+
opts.separator " fasta_filter.rb --filter \"rec.id =~ /-126/ or rec.seq =~ /CCC$/\" test/data/fasta/nt.fa"
|
59
|
+
opts.separator " fasta_filter.rb --filter \"rec.seq.count('C') > rec.seq.size*0.25\" test/data/fasta/nt.fa"
|
60
|
+
opts.separator " fasta_filter.rb --filter \"rec.descr =~ /C. elegans/\" test/data/fasta/nt.fa"
|
61
|
+
opts.separator " fasta_filter.rb --filter \"num % 2 == 0\" test/data/fasta/nt.fa"
|
62
|
+
opts.separator " fasta_filter.rb test/data/fasta/nt.fa --rewrite 'rec.seq.downcase!'"
|
63
|
+
opts.separator ""
|
64
|
+
opts.separator "Other options:"
|
65
|
+
opts.separator ""
|
66
|
+
|
67
|
+
opts.on_tail("-h", "--help", "Show this message") do
|
68
|
+
puts opts
|
69
|
+
exit
|
70
|
+
end
|
71
|
+
|
72
|
+
end
|
73
|
+
|
74
|
+
opt_parser.parse!(args)
|
75
|
+
options
|
76
|
+
end # parse()
|
77
|
+
end # class OptParser
|
78
|
+
|
79
|
+
options = OptParser.parse(ARGV)
|
80
|
+
|
81
|
+
num = -1
|
82
|
+
FastaReader::emit_fastarecord(-> { ARGF.gets }) { | rec |
|
83
|
+
num += 1
|
84
|
+
# --- Filtering
|
85
|
+
next if options.filter and not eval(options.filter)
|
86
|
+
if options.codonize
|
87
|
+
# --- Round sequence to nearest 3 nucleotides
|
88
|
+
size = rec.seq.size
|
89
|
+
rec.seq = rec.seq[0..size - (size % 3) - 1]
|
90
|
+
end
|
91
|
+
# --- Only use sequences from MIN size
|
92
|
+
next if options.min and rec.seq.size < options.min
|
93
|
+
# --- Truncate description to ID
|
94
|
+
rec.descr = rec.id if options.id
|
95
|
+
|
96
|
+
# --- rewrite
|
97
|
+
eval(options.rewrite) if options.rewrite
|
98
|
+
print rec.to_fasta
|
99
|
+
}
|
100
|
+
|
data/bin/fasta_sort.rb
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
#
|
3
|
+
# fasta_sort: Sorts a FASTA file and outputs sorted unique records as FASTA again
|
4
|
+
#
|
5
|
+
# Usage:
|
6
|
+
#
|
7
|
+
# fasta_sort inputfile(s)
|
8
|
+
|
9
|
+
require 'bio'
|
10
|
+
|
11
|
+
include Bio
|
12
|
+
|
13
|
+
table = Hash.new
|
14
|
+
ARGV.each do | fn |
|
15
|
+
Bio::FlatFile.auto(fn).each do | seq |
|
16
|
+
table[seq.definition] ||= seq.data
|
17
|
+
end
|
18
|
+
end
|
19
|
+
|
20
|
+
table.sort.each do | definition, data |
|
21
|
+
rec = Bio::FastaFormat.new('> '+definition.strip+"\n"+data)
|
22
|
+
print rec
|
23
|
+
end
|
24
|
+
|
data/bin/getorf
CHANGED
@@ -6,12 +6,8 @@
|
|
6
6
|
# (aa_heuristic.fa and nt_heuristic.fa respectively)
|
7
7
|
#
|
8
8
|
# You can choose the heuristic on the command line (default stopstop).
|
9
|
-
|
10
|
-
|
11
|
-
# Copyright:: 2009-2011
|
12
|
-
# License:: Ruby License
|
13
|
-
#
|
14
|
-
# Copyright (C) 2009-2011 Pjotr Prins <pjotr.prins@thebird.nl>
|
9
|
+
|
10
|
+
$stderr.print "WARNING: This tool has one or more known bugs! Better use the EMBOSS getorf instead for now\n"
|
15
11
|
|
16
12
|
rootpath = File.dirname(File.dirname(__FILE__))
|
17
13
|
$: << File.join(rootpath,'lib')
|
@@ -48,10 +44,10 @@ EXAMPLE
|
|
48
44
|
exit()
|
49
45
|
}
|
50
46
|
|
51
|
-
opts.on("-h heuristic", String, "Heuristic (
|
47
|
+
opts.on("-h heuristic", String, "Heuristic (default #{heuristic})") do | s |
|
52
48
|
heuristic = s
|
53
49
|
end
|
54
|
-
opts.on("-s size", "--min-size", Integer, "Minimal sequence size") do | n |
|
50
|
+
opts.on("-s size", "--min-size", Integer, "Minimal sequence size (default #{minsize})") do | n |
|
55
51
|
minsize = n
|
56
52
|
end
|
57
53
|
opts.on("--longest", "Only get longest ORF match") do
|
data/bin/nt2aa.rb
CHANGED
@@ -3,11 +3,6 @@
|
|
3
3
|
# Translate nucleotide sequences into aminoacids sequences in all
|
4
4
|
# reading frames.
|
5
5
|
#
|
6
|
-
#
|
7
|
-
# (: pjotrp 2009, 2012 rblicense :)
|
8
|
-
#
|
9
|
-
# Copyright (C) 2012 Pjotr Prins <pjotr.prins@thebird.nl>
|
10
|
-
|
11
6
|
USAGE =<<EOM
|
12
7
|
ruby #{__FILE__} [--six-frame] inputfile(s)
|
13
8
|
EOM
|
@@ -44,7 +39,9 @@ ARGV.each do | fn |
|
|
44
39
|
|
45
40
|
# ajpseqt = Biolib::Emboss.ajTrnSeqOrig(trnTable,ajpseq,frame)
|
46
41
|
# aa = Biolib::Emboss.ajSeqGetSeqCopyC(ajpseqt)
|
47
|
-
print ">
|
42
|
+
print ">",rec.descr
|
43
|
+
print " [",frame.to_s,"]" if do_sixframes
|
44
|
+
print "\n"
|
48
45
|
print aa,"\n"
|
49
46
|
end
|
50
47
|
}
|
data/bio-bigbio.gemspec
CHANGED
@@ -5,25 +5,28 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = "bio-bigbio"
|
8
|
-
s.version = "0.1.
|
8
|
+
s.version = "0.1.5"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Pjotr Prins"]
|
12
|
-
s.date = "
|
12
|
+
s.date = "2013-05-03"
|
13
13
|
s.description = "Fasta reader, ORF emitter, sequence translation"
|
14
14
|
s.email = "pjotr.public01@thebird.nl"
|
15
|
-
s.executables = ["getorf", "nt2aa.rb"]
|
15
|
+
s.executables = ["fasta_filter.rb", "fasta_sort.rb", "getorf", "nt2aa.rb"]
|
16
16
|
s.extra_rdoc_files = [
|
17
|
-
"LICENSE",
|
17
|
+
"LICENSE.txt",
|
18
18
|
"README.md"
|
19
19
|
]
|
20
20
|
s.files = [
|
21
|
+
".travis.yml",
|
21
22
|
"Gemfile",
|
22
23
|
"Gemfile.lock",
|
23
|
-
"LICENSE",
|
24
|
+
"LICENSE.txt",
|
24
25
|
"README.md",
|
25
26
|
"Rakefile",
|
26
27
|
"VERSION",
|
28
|
+
"bin/fasta_filter.rb",
|
29
|
+
"bin/fasta_sort.rb",
|
27
30
|
"bin/getorf",
|
28
31
|
"bin/nt2aa.rb",
|
29
32
|
"bio-bigbio.gemspec",
|
@@ -42,6 +45,7 @@ Gem::Specification.new do |s|
|
|
42
45
|
"lib/bigbio/db/fasta/fastarecord.rb",
|
43
46
|
"lib/bigbio/db/fasta/fastawriter.rb",
|
44
47
|
"lib/bigbio/db/fasta/indexer.rb",
|
48
|
+
"lib/bigbio/db/phylip.rb",
|
45
49
|
"lib/bigbio/environment.rb",
|
46
50
|
"lib/bigbio/sequence/predictorf.rb",
|
47
51
|
"lib/bigbio/sequence/translate.rb",
|
@@ -130,3 +130,38 @@ class FastaReader
|
|
130
130
|
end
|
131
131
|
|
132
132
|
end
|
133
|
+
|
134
|
+
# The following is actually a module/trait implementation without state
|
135
|
+
|
136
|
+
class FastaReader
|
137
|
+
|
138
|
+
# func passes in a FASTA buffer. Every time a record is parsed it is
|
139
|
+
# yielded.
|
140
|
+
#
|
141
|
+
def FastaReader::emit getbuf_func
|
142
|
+
seq = ""
|
143
|
+
id = nil
|
144
|
+
descr = nil
|
145
|
+
while buf = getbuf_func.call
|
146
|
+
buf.split(/\n/).each do | line |
|
147
|
+
if line =~ /^>/
|
148
|
+
yield id, descr, seq if descr
|
149
|
+
descr = line[1..-1].strip
|
150
|
+
matched = /^(\S+)/.match(descr)
|
151
|
+
id = matched[0]
|
152
|
+
seq = ""
|
153
|
+
else
|
154
|
+
seq += line.strip
|
155
|
+
end
|
156
|
+
end
|
157
|
+
end
|
158
|
+
yield id, descr, seq if descr and seq.size > 0
|
159
|
+
end
|
160
|
+
|
161
|
+
def FastaReader::emit_fastarecord getbuf_func
|
162
|
+
emit(getbuf_func) do | id, descr, seq |
|
163
|
+
yield FastaRecord.new(id, descr, seq)
|
164
|
+
end
|
165
|
+
end
|
166
|
+
|
167
|
+
end
|
@@ -7,6 +7,10 @@ class FastaRecord
|
|
7
7
|
@descr = descr
|
8
8
|
@seq = seq
|
9
9
|
end
|
10
|
+
|
11
|
+
def to_fasta
|
12
|
+
">"+@descr+"\n"+@seq+"\n"
|
13
|
+
end
|
10
14
|
end
|
11
15
|
|
12
16
|
class FastaPairedRecord
|
@@ -30,7 +34,9 @@ class FastaPairedRecord
|
|
30
34
|
if nt.seq.size == aa.seq.size*3-3
|
31
35
|
aa.seq.chop!
|
32
36
|
end
|
33
|
-
|
37
|
+
nt_size = nt.seq.size
|
38
|
+
expected_size = aa.seq.size*3
|
39
|
+
# raise "Sequence size mismatch for #{nt.id} <nt:#{nt.seq.size} != #{aa.seq.size*3} (aa:#{aa.seq.size}*3)>" if expected_size - 3 > nt_size and nt_size > expected_size + 3
|
34
40
|
end
|
35
41
|
|
36
42
|
def id
|
@@ -0,0 +1,49 @@
|
|
1
|
+
# Simple phylip reader. Supports PAML style files formatted as
|
2
|
+
#
|
3
|
+
# sequence 1
|
4
|
+
# AAGCTTCACCGGCGCAGTCATTCTCATAAT
|
5
|
+
# CGCCCACGGACTTACATCCTCATTACTATT
|
6
|
+
# sequence 2
|
7
|
+
# AAGCTTCACCGGCGCAATTATCCTCATAAT
|
8
|
+
# CGCCCACGGACTTACATCCTCATTATTATT
|
9
|
+
# sequence 3
|
10
|
+
# AAGCTTCACCGGCGCAGTTGTTCTTATAAT
|
11
|
+
# TGCCCACGGACTTACATCATCATTATTATT
|
12
|
+
# sequence 4
|
13
|
+
# AAGCTTCACCGGCGCAACCACCCTCATGAT
|
14
|
+
# TGCCCATGGACTCACATCCTCCCTACTGTT
|
15
|
+
|
16
|
+
module Bio
|
17
|
+
module Big
|
18
|
+
module PhylipReader
|
19
|
+
# Define get_line as a lambda function, e.g.
|
20
|
+
# Bio::Big::PhylipReader.emit_seq(-> { lines.next }) { | name, seq | p [name,seq] }
|
21
|
+
|
22
|
+
def PhylipReader::emit_seq get_line
|
23
|
+
line = get_line.call.strip
|
24
|
+
a = line.split
|
25
|
+
seq_num = a[0].to_i
|
26
|
+
seq_size = a[1].to_i
|
27
|
+
name = nil
|
28
|
+
seq = ""
|
29
|
+
while true
|
30
|
+
line = get_line.call
|
31
|
+
break if line == nil or line == ""
|
32
|
+
line = line.strip
|
33
|
+
if name == nil
|
34
|
+
name = line
|
35
|
+
next
|
36
|
+
end
|
37
|
+
seq += line
|
38
|
+
if seq.size >= seq_size
|
39
|
+
raise "Name wrong size for #{name}" if name.size > 20
|
40
|
+
raise "Sequence wrong size for #{name}" if seq.size > seq_size
|
41
|
+
yield name, seq
|
42
|
+
name = nil
|
43
|
+
seq = ""
|
44
|
+
end
|
45
|
+
end
|
46
|
+
end
|
47
|
+
end
|
48
|
+
end
|
49
|
+
end
|
data/spec/emitter_spec.rb
CHANGED
@@ -20,6 +20,23 @@ describe Bio::Big::FastaEmitter, "when using the emitter" do
|
|
20
20
|
end
|
21
21
|
end
|
22
22
|
|
23
|
+
it "should emit functional style" do
|
24
|
+
count = 0
|
25
|
+
FastaReader::emit_fastarecord(-> { File.open("test/data/fasta/nt.fa").read }) { |rec|
|
26
|
+
case count
|
27
|
+
when 0
|
28
|
+
rec.id.should == "PUT-157a-Arabidopsis_thaliana-1"
|
29
|
+
rec.seq[0..10].should == "AGGTTCGNACG"
|
30
|
+
when 1
|
31
|
+
rec.id.should == "PUT-157a-Arabidopsis_thaliana-2"
|
32
|
+
rec.seq[0..10].should == "AGACAAACGAC"
|
33
|
+
else
|
34
|
+
break
|
35
|
+
end
|
36
|
+
count += 1
|
37
|
+
}
|
38
|
+
end
|
39
|
+
|
23
40
|
it "should emit large parts" do
|
24
41
|
FastaEmitter.new("test/data/fasta/nt.fa").emit_seq do | part, index, tag, seq |
|
25
42
|
# p [index, part, tag, seq]
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: bio-bigbio
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.5
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2013-05-03 00:00:00.000000000Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: bio
|
16
|
-
requirement: &
|
16
|
+
requirement: &27203900 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: 1.4.1
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *27203900
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: bio-logger
|
27
|
-
requirement: &
|
27
|
+
requirement: &27203120 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ! '>='
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: 0.9.0
|
33
33
|
type: :runtime
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *27203120
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: rspec
|
38
|
-
requirement: &
|
38
|
+
requirement: &27202300 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ~>
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: 2.3.0
|
44
44
|
type: :development
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *27202300
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: bundler
|
49
|
-
requirement: &
|
49
|
+
requirement: &27201380 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ~>
|
@@ -54,10 +54,10 @@ dependencies:
|
|
54
54
|
version: 1.0.0
|
55
55
|
type: :development
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *27201380
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: jeweler
|
60
|
-
requirement: &
|
60
|
+
requirement: &27200760 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ~>
|
@@ -65,10 +65,10 @@ dependencies:
|
|
65
65
|
version: 1.5.2
|
66
66
|
type: :development
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *27200760
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: rcov
|
71
|
-
requirement: &
|
71
|
+
requirement: &27199840 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ! '>='
|
@@ -76,23 +76,28 @@ dependencies:
|
|
76
76
|
version: '0'
|
77
77
|
type: :development
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *27199840
|
80
80
|
description: Fasta reader, ORF emitter, sequence translation
|
81
81
|
email: pjotr.public01@thebird.nl
|
82
82
|
executables:
|
83
|
+
- fasta_filter.rb
|
84
|
+
- fasta_sort.rb
|
83
85
|
- getorf
|
84
86
|
- nt2aa.rb
|
85
87
|
extensions: []
|
86
88
|
extra_rdoc_files:
|
87
|
-
- LICENSE
|
89
|
+
- LICENSE.txt
|
88
90
|
- README.md
|
89
91
|
files:
|
92
|
+
- .travis.yml
|
90
93
|
- Gemfile
|
91
94
|
- Gemfile.lock
|
92
|
-
- LICENSE
|
95
|
+
- LICENSE.txt
|
93
96
|
- README.md
|
94
97
|
- Rakefile
|
95
98
|
- VERSION
|
99
|
+
- bin/fasta_filter.rb
|
100
|
+
- bin/fasta_sort.rb
|
96
101
|
- bin/getorf
|
97
102
|
- bin/nt2aa.rb
|
98
103
|
- bio-bigbio.gemspec
|
@@ -111,6 +116,7 @@ files:
|
|
111
116
|
- lib/bigbio/db/fasta/fastarecord.rb
|
112
117
|
- lib/bigbio/db/fasta/fastawriter.rb
|
113
118
|
- lib/bigbio/db/fasta/indexer.rb
|
119
|
+
- lib/bigbio/db/phylip.rb
|
114
120
|
- lib/bigbio/environment.rb
|
115
121
|
- lib/bigbio/sequence/predictorf.rb
|
116
122
|
- lib/bigbio/sequence/translate.rb
|
@@ -139,7 +145,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
139
145
|
version: '0'
|
140
146
|
segments:
|
141
147
|
- 0
|
142
|
-
hash:
|
148
|
+
hash: 2941883289909211187
|
143
149
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
144
150
|
none: false
|
145
151
|
requirements:
|
data/LICENSE
DELETED
@@ -1,34 +0,0 @@
|
|
1
|
-
If a license is not specified the code contributed to BioBig defaults to the
|
2
|
-
BSD license:
|
3
|
-
|
4
|
-
Copyright (c) 2008, 2009 The BioLib Project
|
5
|
-
All rights reserved.
|
6
|
-
|
7
|
-
Redistribution and use in source and binary forms, with or without
|
8
|
-
modification, are permitted provided that the following conditions are met:
|
9
|
-
|
10
|
-
* Redistributions of source code must retain the above copyright notice,
|
11
|
-
this list of conditions and the following disclaimer.
|
12
|
-
|
13
|
-
* Redistributions in binary form must reproduce the above copyright notice,
|
14
|
-
this list of conditions and the following disclaimer in the documentation
|
15
|
-
and/or other materials provided with the distribution.
|
16
|
-
|
17
|
-
* Neither the name of the The BioLib Project nor the names of
|
18
|
-
its contributors may be used to endorse or promote products derived from
|
19
|
-
this software without specific prior written permission.
|
20
|
-
|
21
|
-
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
22
|
-
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
23
|
-
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
24
|
-
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
25
|
-
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
26
|
-
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
27
|
-
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
|
28
|
-
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
29
|
-
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
30
|
-
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
31
|
-
|
32
|
-
For more information on opensource software licenses see
|
33
|
-
http://www.opensource.org/licenses/bsd-license.php,
|
34
|
-
http://www.gnu.org/licenses/gpl.html and http://www.fsf.org/.
|