bio-bigbio 0.1.4 → 0.1.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.travis.yml +12 -0
- data/LICENSE.txt +20 -0
- data/README.md +147 -15
- data/Rakefile +1 -0
- data/VERSION +1 -1
- data/bin/fasta_filter.rb +100 -0
- data/bin/fasta_sort.rb +24 -0
- data/bin/getorf +4 -8
- data/bin/nt2aa.rb +3 -6
- data/bio-bigbio.gemspec +9 -5
- data/lib/bigbio/db/fasta/fastareader.rb +35 -0
- data/lib/bigbio/db/fasta/fastarecord.rb +7 -1
- data/lib/bigbio/db/phylip.rb +49 -0
- data/spec/emitter_spec.rb +17 -0
- metadata +23 -17
- data/LICENSE +0 -34
data/.travis.yml
ADDED
@@ -0,0 +1,12 @@
|
|
1
|
+
language: ruby
|
2
|
+
rvm:
|
3
|
+
- 1.9.2
|
4
|
+
# - 1.9.3
|
5
|
+
# - 1.8.7
|
6
|
+
# - jruby-19mode # JRuby in 1.9 mode
|
7
|
+
# - rbx-19mode
|
8
|
+
# - jruby-18mode # JRuby in 1.8 mode
|
9
|
+
# - rbx-18mode
|
10
|
+
|
11
|
+
# uncomment this line if your project needs to run something other than `rake`:
|
12
|
+
# script: bundle exec rspec spec
|
data/LICENSE.txt
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2011-2013 Pjotr Prins
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
CHANGED
@@ -8,31 +8,119 @@ computing in biology.
|
|
8
8
|
BigBio may use BioLib C/C++/D functions for increasing performance and
|
9
9
|
reducing memory consumption.
|
10
10
|
|
11
|
-
|
12
|
-
|
11
|
+
In a way, this is an experimental project. I use it for
|
12
|
+
experimentation, but what is in here should work fine. If you wish to
|
13
|
+
contribute subscribe to the BioRuby and/or BioLib mailing lists
|
14
|
+
instead.
|
13
15
|
|
14
16
|
# Overview
|
15
17
|
|
16
18
|
* BigBio can translate nucleotide sequences to amino acid
|
17
19
|
sequences using an EMBOSS C function, or BioRuby's translator.
|
20
|
+
* BigBio has a terrific FASTA file emitter which iterates FASTA files and
|
21
|
+
iterates sequences without loading everything in memory. There is
|
22
|
+
also an indexed edition
|
23
|
+
* BioBio has a flexible FASTA filter
|
18
24
|
* BigBio has an ORF emitter which parses DNA/RNA sequences and emits
|
19
25
|
ORFs between START_STOP or STOP_STOP codons.
|
20
|
-
* BigBio has a
|
21
|
-
iterates sequences without loading everything in memory.
|
26
|
+
* BigBio has a Phylip (PAML style) emitter and writer
|
22
27
|
|
23
|
-
#
|
28
|
+
# Installation
|
29
|
+
|
30
|
+
The easy way
|
31
|
+
|
32
|
+
```sh
|
33
|
+
gem install bio-bigbio
|
34
|
+
```
|
35
|
+
|
36
|
+
in your code
|
37
|
+
|
38
|
+
```ruby
|
39
|
+
require 'bigbio'
|
40
|
+
```
|
41
|
+
|
42
|
+
# Command line tools
|
43
|
+
|
44
|
+
Some functionality comes also as executable command line tools (see the
|
45
|
+
./bin directory). Use the -h switch to get information. Current tools
|
46
|
+
are
|
47
|
+
|
48
|
+
1. getorf: fetch all areas between start-stop and stop-stop codons in six frames (using EMBOSS when biolib is available)
|
49
|
+
2. nt2aa.rb: translate in six frames (using EMBOSS when biolib is available)
|
50
|
+
3. fasta_filter.rb
|
51
|
+
|
52
|
+
## Command line Fasta Filter
|
53
|
+
|
54
|
+
The CLI filter accepts standard Ruby commands.
|
55
|
+
|
56
|
+
Filter sequences that contain more than 25% C's
|
57
|
+
|
58
|
+
```sh
|
59
|
+
fasta_filter.rb --filter "rec.seq.count('C') > rec.seq.size*0.25" test/data/fasta/nt.fa
|
60
|
+
```
|
61
|
+
|
62
|
+
Look for IDs containing -126 and sequences ending on CCC
|
63
|
+
|
64
|
+
```sh
|
65
|
+
fasta_filter.rb --filter "rec.id =~ /-126/ or rec.seq =~ /CCC$/" test/data/fasta/nt.fa
|
66
|
+
```
|
67
|
+
|
68
|
+
Filter out all masked sequences that contain more than 10% masked
|
69
|
+
nucleotides
|
70
|
+
|
71
|
+
```sh
|
72
|
+
fasta_filter.rb --filter "rec.seq.count('N')<rec.seq.size*0.10"
|
73
|
+
```
|
74
|
+
|
75
|
+
Next to rec.id and rec.seq, you have rec.descr and 'num' as variables,
|
76
|
+
so to skip every other record
|
77
|
+
|
78
|
+
```sh
|
79
|
+
fasta_filter.rb --filter "num % 2 == 0"
|
80
|
+
```
|
81
|
+
|
82
|
+
Rewrite all sequences to lower case, you can use the useful rewrite
|
83
|
+
option
|
84
|
+
|
85
|
+
```sh
|
86
|
+
fasta_filter.rb --rewrite 'rec.seq = rec.seq.downcase'
|
87
|
+
```
|
88
|
+
|
89
|
+
Filters and rewrites can be combined. The rest is up to your imagination!
|
90
|
+
|
91
|
+
# API Examples
|
24
92
|
|
25
93
|
## Iterate through a FASTA file
|
26
94
|
|
27
95
|
Read a file without loading the whole thing in memory
|
28
96
|
|
29
97
|
```ruby
|
98
|
+
require 'bigbio'
|
99
|
+
|
30
100
|
fasta = FastaReader.new(fn)
|
31
101
|
fasta.each do | rec |
|
32
102
|
print rec.descr,rec.seq
|
33
103
|
end
|
34
104
|
```
|
35
105
|
|
106
|
+
Since FastaReader parses the ID, write a tab file with id and sequence
|
107
|
+
|
108
|
+
```ruby
|
109
|
+
i = 1
|
110
|
+
print "num\tid\tseq\n"
|
111
|
+
FastaReader.new(fn).each do | rec |
|
112
|
+
if rec.id =~ /(AT\w+)/
|
113
|
+
print i,"\t",$1,"\t",rec.seq,"\n"
|
114
|
+
i += 1
|
115
|
+
end
|
116
|
+
end
|
117
|
+
```
|
118
|
+
|
119
|
+
wich, for example, can be turned into RDF with the
|
120
|
+
[bio-table](https://github.com/pjotrp/bioruby-table) biogem.
|
121
|
+
|
122
|
+
## Write a FASTA file
|
123
|
+
|
36
124
|
Write a FASTA file. The simple way
|
37
125
|
|
38
126
|
```ruby
|
@@ -60,6 +148,44 @@ fasta = FastaWriter.new(fn)
|
|
60
148
|
fasta.write(mysequence)
|
61
149
|
```
|
62
150
|
|
151
|
+
## Transform a FASTA file
|
152
|
+
|
153
|
+
You can combine above FastaReader and FastaWriter to transform
|
154
|
+
sequences, e.g.
|
155
|
+
|
156
|
+
```ruby
|
157
|
+
fasta = FastaWriter.new(in_fn)
|
158
|
+
FastaReader.new(out_fn).each do | rec |
|
159
|
+
# Strip the description down to the second ID
|
160
|
+
(id1,id2) = /(\S+)\s+(\S+)/.match(rec.descr)
|
161
|
+
fasta.write(id2,rec.seq)
|
162
|
+
end
|
163
|
+
```
|
164
|
+
|
165
|
+
The downside to this approach is the explicit file naming. What if you
|
166
|
+
want to use STDIN or some other source instead? I have come round to
|
167
|
+
the idea of using a combination of lambda and block. For example:
|
168
|
+
|
169
|
+
```ruby
|
170
|
+
FastaReader::emit_fastarecord(-> {gets}) { |rec|
|
171
|
+
print FastaWriter.to_fasta(rec)
|
172
|
+
}
|
173
|
+
```
|
174
|
+
|
175
|
+
which takes STDIN line by line, and outputs FASTA on STDOUT. This is
|
176
|
+
a better design as the FastaReader and FastaWriter know nothing of
|
177
|
+
the mechanism fetching and displaying data. These can both be 'pure'
|
178
|
+
functions. Note also that the data is never fully loaded into RAM.
|
179
|
+
|
180
|
+
Here the transformer functional style
|
181
|
+
|
182
|
+
```ruby
|
183
|
+
FastaReader::emit_fastarecord(-> {gets}) { |rec|
|
184
|
+
(id1,id2) = /(\S+)\s+(\S+)/.match(rec.descr)
|
185
|
+
print FastaWriter.to_fasta(id2,req.seq)
|
186
|
+
}
|
187
|
+
```
|
188
|
+
|
63
189
|
## Fetch ORFs from a sequence
|
64
190
|
|
65
191
|
BigBio can parse a sequence for ORFs. Together with the FastaReader
|
@@ -83,21 +209,27 @@ translate = Nucleotide::Translate.new(trn_table)
|
|
83
209
|
aa_frames = translate.aa_6_frames("ATCATTAGCAACACCAGCTTCCTCTCTCTCGCTTCAAAGTTCACTACTCGTGGATCTCGT")
|
84
210
|
```
|
85
211
|
|
86
|
-
#
|
212
|
+
# Project home page
|
87
213
|
|
88
|
-
|
214
|
+
Information on the source tree, documentation, examples, issues and
|
215
|
+
how to contribute, see
|
89
216
|
|
90
|
-
|
91
|
-
gem install bio-bigbio
|
92
|
-
```
|
217
|
+
http://github.com/pjotrp/bigbio
|
93
218
|
|
94
|
-
|
219
|
+
The BioRuby community is on IRC server: irc.freenode.org, channel: #bioruby.
|
95
220
|
|
96
|
-
|
97
|
-
|
98
|
-
|
221
|
+
# Cite
|
222
|
+
|
223
|
+
If you use this software, please cite one of
|
224
|
+
|
225
|
+
* [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
|
226
|
+
* [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
|
227
|
+
|
228
|
+
# Biogems.info
|
229
|
+
|
230
|
+
This Biogem is published at [#bio-table](http://biogems.info/index.html)
|
99
231
|
|
100
232
|
# Copyright
|
101
233
|
|
102
|
-
Copyright (c) 2011-
|
234
|
+
Copyright (c) 2011-2013 Pjotr Prins. See LICENSE for further details.
|
103
235
|
|
data/Rakefile
CHANGED
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.1.
|
1
|
+
0.1.5
|
data/bin/fasta_filter.rb
ADDED
@@ -0,0 +1,100 @@
|
|
1
|
+
#! /usr/bin/env ruby
|
2
|
+
#
|
3
|
+
# Filter for FASTA files
|
4
|
+
#
|
5
|
+
|
6
|
+
$: << File.dirname(__FILE__)+'/../lib'
|
7
|
+
|
8
|
+
require 'bigbio'
|
9
|
+
require 'optparse'
|
10
|
+
require 'ostruct'
|
11
|
+
|
12
|
+
class OptParser
|
13
|
+
#
|
14
|
+
# Return a structure describing the options.
|
15
|
+
#
|
16
|
+
def self.parse(args)
|
17
|
+
# The options specified on the command line will be collected in *options*.
|
18
|
+
# We set default values here.
|
19
|
+
options = OpenStruct.new
|
20
|
+
options.codonize = false
|
21
|
+
options.verbose = false
|
22
|
+
|
23
|
+
opt_parser = OptionParser.new do |opts|
|
24
|
+
opts.banner = "Usage: fasta_filter.rb [options]"
|
25
|
+
|
26
|
+
opts.separator ""
|
27
|
+
opts.separator "Specific options:"
|
28
|
+
|
29
|
+
opts.on("--filter expression","Filter on Ruby expression") do |expr|
|
30
|
+
options.filter = expr
|
31
|
+
end
|
32
|
+
|
33
|
+
opts.on("--rewrite expression","Rewrite expression") do |expr|
|
34
|
+
options.rewrite = expr
|
35
|
+
end
|
36
|
+
|
37
|
+
opts.on("--codonize",
|
38
|
+
"Trim sequence to be at multiple of 3 nucleotides") do |b|
|
39
|
+
options.codonize = b
|
40
|
+
end
|
41
|
+
|
42
|
+
opts.on("--min size",
|
43
|
+
"Set minimum sequence size") do |min|
|
44
|
+
options.min = min.to_i
|
45
|
+
end
|
46
|
+
|
47
|
+
opts.on("--id","Write out ID only") do |b|
|
48
|
+
options.id = b
|
49
|
+
end
|
50
|
+
|
51
|
+
opts.on("-v", "--[no-]verbose", "Run verbosely") do |v|
|
52
|
+
options.verbose = v
|
53
|
+
end
|
54
|
+
|
55
|
+
opts.separator ""
|
56
|
+
opts.separator "Examples:"
|
57
|
+
opts.separator ""
|
58
|
+
opts.separator " fasta_filter.rb --filter \"rec.id =~ /-126/ or rec.seq =~ /CCC$/\" test/data/fasta/nt.fa"
|
59
|
+
opts.separator " fasta_filter.rb --filter \"rec.seq.count('C') > rec.seq.size*0.25\" test/data/fasta/nt.fa"
|
60
|
+
opts.separator " fasta_filter.rb --filter \"rec.descr =~ /C. elegans/\" test/data/fasta/nt.fa"
|
61
|
+
opts.separator " fasta_filter.rb --filter \"num % 2 == 0\" test/data/fasta/nt.fa"
|
62
|
+
opts.separator " fasta_filter.rb test/data/fasta/nt.fa --rewrite 'rec.seq.downcase!'"
|
63
|
+
opts.separator ""
|
64
|
+
opts.separator "Other options:"
|
65
|
+
opts.separator ""
|
66
|
+
|
67
|
+
opts.on_tail("-h", "--help", "Show this message") do
|
68
|
+
puts opts
|
69
|
+
exit
|
70
|
+
end
|
71
|
+
|
72
|
+
end
|
73
|
+
|
74
|
+
opt_parser.parse!(args)
|
75
|
+
options
|
76
|
+
end # parse()
|
77
|
+
end # class OptParser
|
78
|
+
|
79
|
+
options = OptParser.parse(ARGV)
|
80
|
+
|
81
|
+
num = -1
|
82
|
+
FastaReader::emit_fastarecord(-> { ARGF.gets }) { | rec |
|
83
|
+
num += 1
|
84
|
+
# --- Filtering
|
85
|
+
next if options.filter and not eval(options.filter)
|
86
|
+
if options.codonize
|
87
|
+
# --- Round sequence to nearest 3 nucleotides
|
88
|
+
size = rec.seq.size
|
89
|
+
rec.seq = rec.seq[0..size - (size % 3) - 1]
|
90
|
+
end
|
91
|
+
# --- Only use sequences from MIN size
|
92
|
+
next if options.min and rec.seq.size < options.min
|
93
|
+
# --- Truncate description to ID
|
94
|
+
rec.descr = rec.id if options.id
|
95
|
+
|
96
|
+
# --- rewrite
|
97
|
+
eval(options.rewrite) if options.rewrite
|
98
|
+
print rec.to_fasta
|
99
|
+
}
|
100
|
+
|
data/bin/fasta_sort.rb
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
#
|
3
|
+
# fasta_sort: Sorts a FASTA file and outputs sorted unique records as FASTA again
|
4
|
+
#
|
5
|
+
# Usage:
|
6
|
+
#
|
7
|
+
# fasta_sort inputfile(s)
|
8
|
+
|
9
|
+
require 'bio'
|
10
|
+
|
11
|
+
include Bio
|
12
|
+
|
13
|
+
table = Hash.new
|
14
|
+
ARGV.each do | fn |
|
15
|
+
Bio::FlatFile.auto(fn).each do | seq |
|
16
|
+
table[seq.definition] ||= seq.data
|
17
|
+
end
|
18
|
+
end
|
19
|
+
|
20
|
+
table.sort.each do | definition, data |
|
21
|
+
rec = Bio::FastaFormat.new('> '+definition.strip+"\n"+data)
|
22
|
+
print rec
|
23
|
+
end
|
24
|
+
|
data/bin/getorf
CHANGED
@@ -6,12 +6,8 @@
|
|
6
6
|
# (aa_heuristic.fa and nt_heuristic.fa respectively)
|
7
7
|
#
|
8
8
|
# You can choose the heuristic on the command line (default stopstop).
|
9
|
-
|
10
|
-
|
11
|
-
# Copyright:: 2009-2011
|
12
|
-
# License:: Ruby License
|
13
|
-
#
|
14
|
-
# Copyright (C) 2009-2011 Pjotr Prins <pjotr.prins@thebird.nl>
|
9
|
+
|
10
|
+
$stderr.print "WARNING: This tool has one or more known bugs! Better use the EMBOSS getorf instead for now\n"
|
15
11
|
|
16
12
|
rootpath = File.dirname(File.dirname(__FILE__))
|
17
13
|
$: << File.join(rootpath,'lib')
|
@@ -48,10 +44,10 @@ EXAMPLE
|
|
48
44
|
exit()
|
49
45
|
}
|
50
46
|
|
51
|
-
opts.on("-h heuristic", String, "Heuristic (
|
47
|
+
opts.on("-h heuristic", String, "Heuristic (default #{heuristic})") do | s |
|
52
48
|
heuristic = s
|
53
49
|
end
|
54
|
-
opts.on("-s size", "--min-size", Integer, "Minimal sequence size") do | n |
|
50
|
+
opts.on("-s size", "--min-size", Integer, "Minimal sequence size (default #{minsize})") do | n |
|
55
51
|
minsize = n
|
56
52
|
end
|
57
53
|
opts.on("--longest", "Only get longest ORF match") do
|
data/bin/nt2aa.rb
CHANGED
@@ -3,11 +3,6 @@
|
|
3
3
|
# Translate nucleotide sequences into aminoacids sequences in all
|
4
4
|
# reading frames.
|
5
5
|
#
|
6
|
-
#
|
7
|
-
# (: pjotrp 2009, 2012 rblicense :)
|
8
|
-
#
|
9
|
-
# Copyright (C) 2012 Pjotr Prins <pjotr.prins@thebird.nl>
|
10
|
-
|
11
6
|
USAGE =<<EOM
|
12
7
|
ruby #{__FILE__} [--six-frame] inputfile(s)
|
13
8
|
EOM
|
@@ -44,7 +39,9 @@ ARGV.each do | fn |
|
|
44
39
|
|
45
40
|
# ajpseqt = Biolib::Emboss.ajTrnSeqOrig(trnTable,ajpseq,frame)
|
46
41
|
# aa = Biolib::Emboss.ajSeqGetSeqCopyC(ajpseqt)
|
47
|
-
print ">
|
42
|
+
print ">",rec.descr
|
43
|
+
print " [",frame.to_s,"]" if do_sixframes
|
44
|
+
print "\n"
|
48
45
|
print aa,"\n"
|
49
46
|
end
|
50
47
|
}
|
data/bio-bigbio.gemspec
CHANGED
@@ -5,25 +5,28 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = "bio-bigbio"
|
8
|
-
s.version = "0.1.
|
8
|
+
s.version = "0.1.5"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Pjotr Prins"]
|
12
|
-
s.date = "
|
12
|
+
s.date = "2013-05-03"
|
13
13
|
s.description = "Fasta reader, ORF emitter, sequence translation"
|
14
14
|
s.email = "pjotr.public01@thebird.nl"
|
15
|
-
s.executables = ["getorf", "nt2aa.rb"]
|
15
|
+
s.executables = ["fasta_filter.rb", "fasta_sort.rb", "getorf", "nt2aa.rb"]
|
16
16
|
s.extra_rdoc_files = [
|
17
|
-
"LICENSE",
|
17
|
+
"LICENSE.txt",
|
18
18
|
"README.md"
|
19
19
|
]
|
20
20
|
s.files = [
|
21
|
+
".travis.yml",
|
21
22
|
"Gemfile",
|
22
23
|
"Gemfile.lock",
|
23
|
-
"LICENSE",
|
24
|
+
"LICENSE.txt",
|
24
25
|
"README.md",
|
25
26
|
"Rakefile",
|
26
27
|
"VERSION",
|
28
|
+
"bin/fasta_filter.rb",
|
29
|
+
"bin/fasta_sort.rb",
|
27
30
|
"bin/getorf",
|
28
31
|
"bin/nt2aa.rb",
|
29
32
|
"bio-bigbio.gemspec",
|
@@ -42,6 +45,7 @@ Gem::Specification.new do |s|
|
|
42
45
|
"lib/bigbio/db/fasta/fastarecord.rb",
|
43
46
|
"lib/bigbio/db/fasta/fastawriter.rb",
|
44
47
|
"lib/bigbio/db/fasta/indexer.rb",
|
48
|
+
"lib/bigbio/db/phylip.rb",
|
45
49
|
"lib/bigbio/environment.rb",
|
46
50
|
"lib/bigbio/sequence/predictorf.rb",
|
47
51
|
"lib/bigbio/sequence/translate.rb",
|
@@ -130,3 +130,38 @@ class FastaReader
|
|
130
130
|
end
|
131
131
|
|
132
132
|
end
|
133
|
+
|
134
|
+
# The following is actually a module/trait implementation without state
|
135
|
+
|
136
|
+
class FastaReader
|
137
|
+
|
138
|
+
# func passes in a FASTA buffer. Every time a record is parsed it is
|
139
|
+
# yielded.
|
140
|
+
#
|
141
|
+
def FastaReader::emit getbuf_func
|
142
|
+
seq = ""
|
143
|
+
id = nil
|
144
|
+
descr = nil
|
145
|
+
while buf = getbuf_func.call
|
146
|
+
buf.split(/\n/).each do | line |
|
147
|
+
if line =~ /^>/
|
148
|
+
yield id, descr, seq if descr
|
149
|
+
descr = line[1..-1].strip
|
150
|
+
matched = /^(\S+)/.match(descr)
|
151
|
+
id = matched[0]
|
152
|
+
seq = ""
|
153
|
+
else
|
154
|
+
seq += line.strip
|
155
|
+
end
|
156
|
+
end
|
157
|
+
end
|
158
|
+
yield id, descr, seq if descr and seq.size > 0
|
159
|
+
end
|
160
|
+
|
161
|
+
def FastaReader::emit_fastarecord getbuf_func
|
162
|
+
emit(getbuf_func) do | id, descr, seq |
|
163
|
+
yield FastaRecord.new(id, descr, seq)
|
164
|
+
end
|
165
|
+
end
|
166
|
+
|
167
|
+
end
|
@@ -7,6 +7,10 @@ class FastaRecord
|
|
7
7
|
@descr = descr
|
8
8
|
@seq = seq
|
9
9
|
end
|
10
|
+
|
11
|
+
def to_fasta
|
12
|
+
">"+@descr+"\n"+@seq+"\n"
|
13
|
+
end
|
10
14
|
end
|
11
15
|
|
12
16
|
class FastaPairedRecord
|
@@ -30,7 +34,9 @@ class FastaPairedRecord
|
|
30
34
|
if nt.seq.size == aa.seq.size*3-3
|
31
35
|
aa.seq.chop!
|
32
36
|
end
|
33
|
-
|
37
|
+
nt_size = nt.seq.size
|
38
|
+
expected_size = aa.seq.size*3
|
39
|
+
# raise "Sequence size mismatch for #{nt.id} <nt:#{nt.seq.size} != #{aa.seq.size*3} (aa:#{aa.seq.size}*3)>" if expected_size - 3 > nt_size and nt_size > expected_size + 3
|
34
40
|
end
|
35
41
|
|
36
42
|
def id
|
@@ -0,0 +1,49 @@
|
|
1
|
+
# Simple phylip reader. Supports PAML style files formatted as
|
2
|
+
#
|
3
|
+
# sequence 1
|
4
|
+
# AAGCTTCACCGGCGCAGTCATTCTCATAAT
|
5
|
+
# CGCCCACGGACTTACATCCTCATTACTATT
|
6
|
+
# sequence 2
|
7
|
+
# AAGCTTCACCGGCGCAATTATCCTCATAAT
|
8
|
+
# CGCCCACGGACTTACATCCTCATTATTATT
|
9
|
+
# sequence 3
|
10
|
+
# AAGCTTCACCGGCGCAGTTGTTCTTATAAT
|
11
|
+
# TGCCCACGGACTTACATCATCATTATTATT
|
12
|
+
# sequence 4
|
13
|
+
# AAGCTTCACCGGCGCAACCACCCTCATGAT
|
14
|
+
# TGCCCATGGACTCACATCCTCCCTACTGTT
|
15
|
+
|
16
|
+
module Bio
|
17
|
+
module Big
|
18
|
+
module PhylipReader
|
19
|
+
# Define get_line as a lambda function, e.g.
|
20
|
+
# Bio::Big::PhylipReader.emit_seq(-> { lines.next }) { | name, seq | p [name,seq] }
|
21
|
+
|
22
|
+
def PhylipReader::emit_seq get_line
|
23
|
+
line = get_line.call.strip
|
24
|
+
a = line.split
|
25
|
+
seq_num = a[0].to_i
|
26
|
+
seq_size = a[1].to_i
|
27
|
+
name = nil
|
28
|
+
seq = ""
|
29
|
+
while true
|
30
|
+
line = get_line.call
|
31
|
+
break if line == nil or line == ""
|
32
|
+
line = line.strip
|
33
|
+
if name == nil
|
34
|
+
name = line
|
35
|
+
next
|
36
|
+
end
|
37
|
+
seq += line
|
38
|
+
if seq.size >= seq_size
|
39
|
+
raise "Name wrong size for #{name}" if name.size > 20
|
40
|
+
raise "Sequence wrong size for #{name}" if seq.size > seq_size
|
41
|
+
yield name, seq
|
42
|
+
name = nil
|
43
|
+
seq = ""
|
44
|
+
end
|
45
|
+
end
|
46
|
+
end
|
47
|
+
end
|
48
|
+
end
|
49
|
+
end
|
data/spec/emitter_spec.rb
CHANGED
@@ -20,6 +20,23 @@ describe Bio::Big::FastaEmitter, "when using the emitter" do
|
|
20
20
|
end
|
21
21
|
end
|
22
22
|
|
23
|
+
it "should emit functional style" do
|
24
|
+
count = 0
|
25
|
+
FastaReader::emit_fastarecord(-> { File.open("test/data/fasta/nt.fa").read }) { |rec|
|
26
|
+
case count
|
27
|
+
when 0
|
28
|
+
rec.id.should == "PUT-157a-Arabidopsis_thaliana-1"
|
29
|
+
rec.seq[0..10].should == "AGGTTCGNACG"
|
30
|
+
when 1
|
31
|
+
rec.id.should == "PUT-157a-Arabidopsis_thaliana-2"
|
32
|
+
rec.seq[0..10].should == "AGACAAACGAC"
|
33
|
+
else
|
34
|
+
break
|
35
|
+
end
|
36
|
+
count += 1
|
37
|
+
}
|
38
|
+
end
|
39
|
+
|
23
40
|
it "should emit large parts" do
|
24
41
|
FastaEmitter.new("test/data/fasta/nt.fa").emit_seq do | part, index, tag, seq |
|
25
42
|
# p [index, part, tag, seq]
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: bio-bigbio
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.5
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2013-05-03 00:00:00.000000000Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: bio
|
16
|
-
requirement: &
|
16
|
+
requirement: &27203900 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: 1.4.1
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *27203900
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: bio-logger
|
27
|
-
requirement: &
|
27
|
+
requirement: &27203120 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ! '>='
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: 0.9.0
|
33
33
|
type: :runtime
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *27203120
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: rspec
|
38
|
-
requirement: &
|
38
|
+
requirement: &27202300 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ~>
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: 2.3.0
|
44
44
|
type: :development
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *27202300
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: bundler
|
49
|
-
requirement: &
|
49
|
+
requirement: &27201380 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ~>
|
@@ -54,10 +54,10 @@ dependencies:
|
|
54
54
|
version: 1.0.0
|
55
55
|
type: :development
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *27201380
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: jeweler
|
60
|
-
requirement: &
|
60
|
+
requirement: &27200760 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ~>
|
@@ -65,10 +65,10 @@ dependencies:
|
|
65
65
|
version: 1.5.2
|
66
66
|
type: :development
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *27200760
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: rcov
|
71
|
-
requirement: &
|
71
|
+
requirement: &27199840 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ! '>='
|
@@ -76,23 +76,28 @@ dependencies:
|
|
76
76
|
version: '0'
|
77
77
|
type: :development
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *27199840
|
80
80
|
description: Fasta reader, ORF emitter, sequence translation
|
81
81
|
email: pjotr.public01@thebird.nl
|
82
82
|
executables:
|
83
|
+
- fasta_filter.rb
|
84
|
+
- fasta_sort.rb
|
83
85
|
- getorf
|
84
86
|
- nt2aa.rb
|
85
87
|
extensions: []
|
86
88
|
extra_rdoc_files:
|
87
|
-
- LICENSE
|
89
|
+
- LICENSE.txt
|
88
90
|
- README.md
|
89
91
|
files:
|
92
|
+
- .travis.yml
|
90
93
|
- Gemfile
|
91
94
|
- Gemfile.lock
|
92
|
-
- LICENSE
|
95
|
+
- LICENSE.txt
|
93
96
|
- README.md
|
94
97
|
- Rakefile
|
95
98
|
- VERSION
|
99
|
+
- bin/fasta_filter.rb
|
100
|
+
- bin/fasta_sort.rb
|
96
101
|
- bin/getorf
|
97
102
|
- bin/nt2aa.rb
|
98
103
|
- bio-bigbio.gemspec
|
@@ -111,6 +116,7 @@ files:
|
|
111
116
|
- lib/bigbio/db/fasta/fastarecord.rb
|
112
117
|
- lib/bigbio/db/fasta/fastawriter.rb
|
113
118
|
- lib/bigbio/db/fasta/indexer.rb
|
119
|
+
- lib/bigbio/db/phylip.rb
|
114
120
|
- lib/bigbio/environment.rb
|
115
121
|
- lib/bigbio/sequence/predictorf.rb
|
116
122
|
- lib/bigbio/sequence/translate.rb
|
@@ -139,7 +145,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
139
145
|
version: '0'
|
140
146
|
segments:
|
141
147
|
- 0
|
142
|
-
hash:
|
148
|
+
hash: 2941883289909211187
|
143
149
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
144
150
|
none: false
|
145
151
|
requirements:
|
data/LICENSE
DELETED
@@ -1,34 +0,0 @@
|
|
1
|
-
If a license is not specified the code contributed to BioBig defaults to the
|
2
|
-
BSD license:
|
3
|
-
|
4
|
-
Copyright (c) 2008, 2009 The BioLib Project
|
5
|
-
All rights reserved.
|
6
|
-
|
7
|
-
Redistribution and use in source and binary forms, with or without
|
8
|
-
modification, are permitted provided that the following conditions are met:
|
9
|
-
|
10
|
-
* Redistributions of source code must retain the above copyright notice,
|
11
|
-
this list of conditions and the following disclaimer.
|
12
|
-
|
13
|
-
* Redistributions in binary form must reproduce the above copyright notice,
|
14
|
-
this list of conditions and the following disclaimer in the documentation
|
15
|
-
and/or other materials provided with the distribution.
|
16
|
-
|
17
|
-
* Neither the name of the The BioLib Project nor the names of
|
18
|
-
its contributors may be used to endorse or promote products derived from
|
19
|
-
this software without specific prior written permission.
|
20
|
-
|
21
|
-
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
22
|
-
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
23
|
-
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
24
|
-
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
25
|
-
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
26
|
-
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
27
|
-
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
|
28
|
-
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
29
|
-
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
30
|
-
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
31
|
-
|
32
|
-
For more information on opensource software licenses see
|
33
|
-
http://www.opensource.org/licenses/bsd-license.php,
|
34
|
-
http://www.gnu.org/licenses/gpl.html and http://www.fsf.org/.
|