bio-gff3 0.6.0 → 0.8.0
Sign up to get free protection for your applications and to get access to all the features.
- data/Gemfile +1 -0
- data/Gemfile.lock +10 -0
- data/README.rdoc +63 -13
- data/VERSION +1 -1
- data/bin/gff3-fetch +12 -9
- data/bio-gff3.gemspec +5 -4
- data/lib/bio-gff3.rb +1 -0
- data/lib/bio/db/gff/gffassemble.rb +52 -52
- data/lib/bio/db/gff/gfffasta.rb +1 -1
- data/spec/gff3_assemble2_spec.rb +1 -1
- data/spec/gff3_assemble3_spec.rb +3 -3
- data/spec/gff3_assemble_spec.rb +23 -20
- data/spec/gff3_fileiterator_spec.rb +2 -2
- data/spec/gffdb_spec.rb +2 -2
- data/test/data/gff/test.gff3 +2 -2
- metadata +17 -6
- data/README +0 -65
data/Gemfile
CHANGED
data/Gemfile.lock
CHANGED
@@ -2,6 +2,7 @@ GEM
|
|
2
2
|
remote: http://rubygems.org/
|
3
3
|
specs:
|
4
4
|
bio (1.4.1)
|
5
|
+
diff-lcs (1.1.2)
|
5
6
|
git (1.2.5)
|
6
7
|
jeweler (1.5.2)
|
7
8
|
bundler (~> 1.0.0)
|
@@ -9,6 +10,14 @@ GEM
|
|
9
10
|
rake
|
10
11
|
rake (0.8.7)
|
11
12
|
rcov (0.9.9)
|
13
|
+
rspec (2.3.0)
|
14
|
+
rspec-core (~> 2.3.0)
|
15
|
+
rspec-expectations (~> 2.3.0)
|
16
|
+
rspec-mocks (~> 2.3.0)
|
17
|
+
rspec-core (2.3.1)
|
18
|
+
rspec-expectations (2.3.0)
|
19
|
+
diff-lcs (~> 1.1.2)
|
20
|
+
rspec-mocks (2.3.0)
|
12
21
|
shoulda (2.11.3)
|
13
22
|
|
14
23
|
PLATFORMS
|
@@ -19,4 +28,5 @@ DEPENDENCIES
|
|
19
28
|
bundler (~> 1.0.0)
|
20
29
|
jeweler (~> 1.5.2)
|
21
30
|
rcov
|
31
|
+
rspec
|
22
32
|
shoulda
|
data/README.rdoc
CHANGED
@@ -1,19 +1,69 @@
|
|
1
1
|
= bio-gff3
|
2
2
|
|
3
|
-
|
4
|
-
|
5
|
-
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
3
|
+
GFF3 plugin for BioRuby, aimed at parsing big data
|
4
|
+
|
5
|
+
Features:
|
6
|
+
|
7
|
+
# Take GFF (genome browser) information and digest mRNA and CDS sequences
|
8
|
+
# Options for low memory use and caching of records
|
9
|
+
# Support for external FASTA files
|
10
|
+
|
11
|
+
You can use this plugin in two ways. First as a standalone program, next as a
|
12
|
+
plugin library to BioRuby.
|
13
|
+
|
14
|
+
For example, fetch mRNA and CDS information from GFF3 files and output to FASTA:
|
15
|
+
|
16
|
+
./bin/gff3-fetch mrna test/data/gff/test.gff3
|
17
|
+
./bin/gff3-fetch cds test/data/gff/test.gff3
|
18
|
+
|
19
|
+
Or clone this repository and add the 'lib' dir to the Ruby search path and
|
20
|
+
|
21
|
+
require 'bio/db/gff/gffdb'
|
22
|
+
|
23
|
+
You can also run RSpec with something like
|
24
|
+
|
25
|
+
rspec -I ../bioruby/lib/ spec/*.rb
|
26
|
+
|
27
|
+
This implementation depends on BioRuby's basic GFF3 parser, with the possible
|
28
|
+
advantage that the plugin is faster and does not consume all memory. The Gff3
|
29
|
+
specs are based on the output of the Wormbase genome browser.
|
30
|
+
|
31
|
+
For a write-up see http://thebird.nl/bioruby/BioRuby_GFF3.html
|
32
|
+
|
33
|
+
-------------------------------------------------------------------------------
|
34
|
+
|
35
|
+
|
36
|
+
Fetch and assemble mRNAs, or CDS and print in FASTA format.
|
37
|
+
|
38
|
+
gff3-fetch [--no-cache] mRNA|CDS [filename.fa] filename.gff
|
39
|
+
|
40
|
+
Where:
|
41
|
+
|
42
|
+
--no-cache : do not load everything in memory (slower)
|
43
|
+
mRNA : assemble mRNA
|
44
|
+
CDS : assemble CDS
|
45
|
+
|
46
|
+
Multiple GFF3 files can be used. For external FASTA files, always the last
|
47
|
+
one before the GFF file is used.
|
48
|
+
|
49
|
+
Examples:
|
50
|
+
|
51
|
+
Find mRNA and CDS information from test.gff3 (which includes sequence information)
|
52
|
+
|
53
|
+
gff3-fetch mRNA test/data/gff/test.gff3
|
54
|
+
gff3-fetch CDS test/data/gff/test.gff3
|
55
|
+
|
56
|
+
Find CDS from external FASTA file
|
57
|
+
|
58
|
+
gff3-fetch CDS test/data/gff/MhA1_Contig1133.fa test/data/gff/MhA1_Contig1133.gff3
|
59
|
+
|
60
|
+
Find mRNA from external FASTA file, without loading everything in RAM
|
61
|
+
|
62
|
+
gff3-fetch --no-cache mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3
|
63
|
+
|
64
|
+
If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
|
14
65
|
|
15
66
|
== Copyright
|
16
67
|
|
17
|
-
Copyright (
|
18
|
-
further details.
|
68
|
+
Copyright (C) 2010,2011 Pjotr Prins <pjotr.prins@thebird.nl>
|
19
69
|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.
|
1
|
+
0.8.0
|
data/bin/gff3-fetch
CHANGED
@@ -4,7 +4,7 @@
|
|
4
4
|
# Copyright:: August 2010
|
5
5
|
# License:: Ruby License
|
6
6
|
#
|
7
|
-
# Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
|
7
|
+
# Copyright (C) 2010,2011 Pjotr Prins <pjotr.prins@thebird.nl>
|
8
8
|
|
9
9
|
|
10
10
|
USAGE = <<EOM
|
@@ -14,7 +14,7 @@ USAGE = <<EOM
|
|
14
14
|
|
15
15
|
Where:
|
16
16
|
|
17
|
-
--no-cache : do not load everything in memory
|
17
|
+
--no-cache : do not load everything in memory (slower)
|
18
18
|
mRNA : assemble mRNA
|
19
19
|
CDS : assemble CDS
|
20
20
|
|
@@ -25,19 +25,22 @@ USAGE = <<EOM
|
|
25
25
|
|
26
26
|
Find mRNA and CDS information from test.gff3 (which includes sequence information)
|
27
27
|
|
28
|
-
|
29
|
-
|
28
|
+
gff3-fetch mRNA test/data/gff/test.gff3
|
29
|
+
gff3-fetch CDS test/data/gff/test.gff3
|
30
30
|
|
31
|
-
Find CDS from
|
31
|
+
Find CDS from external FASTA file
|
32
32
|
|
33
|
-
|
33
|
+
gff3-fetch CDS test/data/gff/MhA1_Contig1133.fa test/data/gff/MhA1_Contig1133.gff3
|
34
34
|
|
35
35
|
Find mRNA from external FASTA file, without loading everything in RAM
|
36
36
|
|
37
|
-
|
37
|
+
gff3-fetch --no-cache mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3
|
38
38
|
|
39
39
|
If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
|
40
40
|
|
41
|
+
== Copyright
|
42
|
+
|
43
|
+
Copyright (C) 2010,2011 Pjotr Prins <pjotr.prins@thebird.nl>
|
41
44
|
|
42
45
|
EOM
|
43
46
|
|
@@ -45,9 +48,9 @@ rootpath = File.dirname(File.dirname(__FILE__))
|
|
45
48
|
$: << rootpath+'/lib'
|
46
49
|
$: << rootpath+'/../bioruby/lib'
|
47
50
|
|
48
|
-
require 'bio
|
51
|
+
require 'bio-gff3'
|
49
52
|
|
50
|
-
$stderr.print "BioRuby GFF3 Plugin Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>\n\n"
|
53
|
+
$stderr.print "BioRuby GFF3 Plugin Copyright (C) 2010,2011 Pjotr Prins <pjotr.prins@thebird.nl>\n\n"
|
51
54
|
|
52
55
|
if ARGV.size == 0
|
53
56
|
print USAGE
|
data/bio-gff3.gemspec
CHANGED
@@ -5,11 +5,11 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = %q{bio-gff3}
|
8
|
-
s.version = "0.
|
8
|
+
s.version = "0.8.0"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Pjotr Prins"]
|
12
|
-
s.date = %q{2010-12-
|
12
|
+
s.date = %q{2010-12-31}
|
13
13
|
s.default_executable = %q{gff3-fetch}
|
14
14
|
s.description = %q{GFF3 (genome browser) information and digest mRNA and CDS sequences.
|
15
15
|
Options for low memory use and caching of records.
|
@@ -19,14 +19,12 @@ Support for external FASTA files.
|
|
19
19
|
s.executables = ["gff3-fetch"]
|
20
20
|
s.extra_rdoc_files = [
|
21
21
|
"LICENSE.txt",
|
22
|
-
"README",
|
23
22
|
"README.rdoc"
|
24
23
|
]
|
25
24
|
s.files = [
|
26
25
|
"Gemfile",
|
27
26
|
"Gemfile.lock",
|
28
27
|
"LICENSE.txt",
|
29
|
-
"README",
|
30
28
|
"README.rdoc",
|
31
29
|
"Rakefile",
|
32
30
|
"VERSION",
|
@@ -83,12 +81,14 @@ Support for external FASTA files.
|
|
83
81
|
s.add_development_dependency(%q<jeweler>, ["~> 1.5.2"])
|
84
82
|
s.add_development_dependency(%q<rcov>, [">= 0"])
|
85
83
|
s.add_development_dependency(%q<bio>, [">= 1.4.1"])
|
84
|
+
s.add_development_dependency(%q<rspec>, [">= 0"])
|
86
85
|
else
|
87
86
|
s.add_dependency(%q<shoulda>, [">= 0"])
|
88
87
|
s.add_dependency(%q<bundler>, ["~> 1.0.0"])
|
89
88
|
s.add_dependency(%q<jeweler>, ["~> 1.5.2"])
|
90
89
|
s.add_dependency(%q<rcov>, [">= 0"])
|
91
90
|
s.add_dependency(%q<bio>, [">= 1.4.1"])
|
91
|
+
s.add_dependency(%q<rspec>, [">= 0"])
|
92
92
|
end
|
93
93
|
else
|
94
94
|
s.add_dependency(%q<shoulda>, [">= 0"])
|
@@ -96,6 +96,7 @@ Support for external FASTA files.
|
|
96
96
|
s.add_dependency(%q<jeweler>, ["~> 1.5.2"])
|
97
97
|
s.add_dependency(%q<rcov>, [">= 0"])
|
98
98
|
s.add_dependency(%q<bio>, [">= 1.4.1"])
|
99
|
+
s.add_dependency(%q<rspec>, [">= 0"])
|
99
100
|
end
|
100
101
|
end
|
101
102
|
|
data/lib/bio-gff3.rb
CHANGED
@@ -0,0 +1 @@
|
|
1
|
+
require 'bio/db/gff/gffdb'
|
@@ -198,77 +198,76 @@ module Bio
|
|
198
198
|
# to the landmark given in column 1 - in this case the sequence as it
|
199
199
|
# is passed in. The following options are available:
|
200
200
|
#
|
201
|
-
# :
|
202
|
-
# :
|
203
|
-
# :
|
204
|
-
# :trim : make sure sequence is multiple of 3 nucleotide bps (false)
|
201
|
+
# :reverse : do reverse if reverse is indicated (default true)
|
202
|
+
# :complement : do complement if reverse is indicated (default true)
|
203
|
+
# :phase : do set CDS phase (default false, normally ignore)
|
204
|
+
# :trim : make sure sequence is multiple of 3 nucleotide bps (default false)
|
205
205
|
#
|
206
206
|
# there are two special options:
|
207
207
|
#
|
208
208
|
# :raw : raw sequence (all above false)
|
209
|
-
# :codonize : codon sequence (
|
209
|
+
# :codonize : codon sequence (reverse, complement and trim are true)
|
210
210
|
#
|
211
|
-
def assemble sequence, startpos, reclist, options = { :phase=>
|
211
|
+
def assemble sequence, startpos, reclist, options = { :phase=>false, :reverse=>true, :trim=>false, :complement=>true, :debug=>false }
|
212
|
+
do_debug = options[:debug]
|
212
213
|
do_phase = options[:phase]
|
213
|
-
do_reverse = options[:reverse]
|
214
|
-
do_trim
|
215
|
-
do_complement = options[:complement]
|
214
|
+
do_reverse = (options[:reverse] == false ? false : true)
|
215
|
+
do_trim = (options[:trim] == false ? false : true)
|
216
|
+
do_complement = (options[:complement] == false ? false : true)
|
216
217
|
if options[:raw]
|
217
218
|
do_phase = false
|
218
219
|
do_reverse = false
|
219
220
|
do_trim = false
|
220
221
|
do_complement = false
|
221
222
|
elsif options[:codonize]
|
222
|
-
do_phase =
|
223
|
+
do_phase = false
|
223
224
|
do_reverse = true
|
224
225
|
do_trim = true
|
225
226
|
do_complement = true
|
226
227
|
end
|
227
|
-
retval = ""
|
228
228
|
sectionlist = Sections::sort(reclist)
|
229
|
-
reverse = false
|
230
|
-
# we assume strand is always the same
|
231
229
|
rec0 = sectionlist.first.rec
|
232
|
-
|
233
|
-
|
234
|
-
|
235
|
-
|
236
|
-
|
237
|
-
|
238
|
-
|
239
|
-
|
240
|
-
|
241
|
-
|
242
|
-
|
243
|
-
|
244
|
-
seq = sequence[(rec.start-1)..(rec.end-1)]
|
245
|
-
retval += seq
|
230
|
+
# we assume ORF is always read in the same direction
|
231
|
+
orf_reverse = (rec0.strand == '-')
|
232
|
+
orf_frame = startpos - 1
|
233
|
+
orf_frameshift = orf_frame % 3
|
234
|
+
sectionlist = sectionlist.reverse if orf_reverse
|
235
|
+
if do_debug
|
236
|
+
p "------------------"
|
237
|
+
p options
|
238
|
+
p [:reverse,do_reverse]
|
239
|
+
p [:complement,do_complement]
|
240
|
+
p [:trim,do_trim]
|
241
|
+
p [:orf_reverse, orf_reverse, rec0.strand]
|
246
242
|
end
|
247
|
-
|
248
|
-
if
|
249
|
-
#
|
250
|
-
|
243
|
+
|
244
|
+
if sequence.kind_of?(Bio::FastaFormat)
|
245
|
+
# BioRuby conversion
|
246
|
+
sequence = sequence.seq
|
251
247
|
end
|
252
|
-
|
253
|
-
|
254
|
-
|
255
|
-
|
256
|
-
|
257
|
-
|
258
|
-
# the phase appears to be disregarded - or rather handled
|
259
|
-
# by start-stop. This is a hack.
|
260
|
-
if do_reverse and reverse and (seq.size % 3 == 0)
|
261
|
-
# do nothing
|
262
|
-
else
|
263
|
-
seq = seq[frame..-1] if frame != 0 # set phase
|
248
|
+
# Generate array of sequences
|
249
|
+
seq = sectionlist.map { | section |
|
250
|
+
rec = section.rec
|
251
|
+
s = sequence[(section.begin-1)..(section.end-1)]
|
252
|
+
if do_reverse and orf_reverse
|
253
|
+
s = s.reverse
|
264
254
|
end
|
265
|
-
|
266
|
-
|
267
|
-
#
|
268
|
-
if
|
269
|
-
|
270
|
-
|
255
|
+
# Correct for phase. Unfortunately the use of phase is ambiguous.
|
256
|
+
# Here we check whether rec.start is in line with orf_frame. If it
|
257
|
+
# is, we correct for phase. Otherwise it is ignored.
|
258
|
+
if do_phase and rec.phase
|
259
|
+
phase = rec.phase.to_i
|
260
|
+
# if ((rec.start-startpos) % 3 == 0)
|
261
|
+
s = s[phase..-1]
|
262
|
+
# end
|
271
263
|
end
|
264
|
+
s
|
265
|
+
}
|
266
|
+
# p seq
|
267
|
+
seq = seq.join
|
268
|
+
if do_complement and do_reverse and orf_reverse
|
269
|
+
ntseq = Bio::Sequence::NA.new(seq)
|
270
|
+
seq = ntseq.forward_complement.upcase
|
272
271
|
end
|
273
272
|
if do_trim
|
274
273
|
reduce = seq.size % 3
|
@@ -279,9 +278,10 @@ module Bio
|
|
279
278
|
end
|
280
279
|
|
281
280
|
# Patch a sequence together from a Sequence string and an array
|
282
|
-
# of records and translate in the correct direction and frame
|
283
|
-
|
284
|
-
|
281
|
+
# of records and translate in the correct direction and frame. The options
|
282
|
+
# are the same as for +assemble+.
|
283
|
+
def assembleAA sequence, startpos, reclist, options = { :phase=>false, :reverse=>true, :trim=>false, :complement=>true }
|
284
|
+
seq = assemble(sequence, startpos, reclist, options)
|
285
285
|
ntseq = Bio::Sequence::NA.new(seq)
|
286
286
|
ntseq.translate
|
287
287
|
end
|
data/lib/bio/db/gff/gfffasta.rb
CHANGED
data/spec/gff3_assemble2_spec.rb
CHANGED
data/spec/gff3_assemble3_spec.rb
CHANGED
@@ -1,12 +1,12 @@
|
|
1
1
|
# RSpec for BioRuby-GFF3-Plugin. Run with something like:
|
2
2
|
#
|
3
|
-
#
|
3
|
+
# rspec -I ../bioruby/lib/ spec/gff3_assemble3_spec.rb
|
4
4
|
#
|
5
|
-
# Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
|
5
|
+
# Copyright (C) 2010,2011 Pjotr Prins <pjotr.prins@thebird.nl>
|
6
6
|
#
|
7
7
|
$: << "../lib"
|
8
8
|
|
9
|
-
require 'bio
|
9
|
+
require 'bio-gff3'
|
10
10
|
|
11
11
|
include Bio::GFFbrowser
|
12
12
|
|
data/spec/gff3_assemble_spec.rb
CHANGED
@@ -1,12 +1,12 @@
|
|
1
1
|
# RSpec for BioRuby-GFF3-Plugin. Run with something like:
|
2
2
|
#
|
3
|
-
#
|
3
|
+
# rspec -I ../bioruby/lib/ spec/gff3_assemble_spec.rb
|
4
4
|
#
|
5
|
-
# Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
|
5
|
+
# Copyright (C) 2010,2011 Pjotr Prins <pjotr.prins@thebird.nl>
|
6
6
|
#
|
7
7
|
$: << "../lib"
|
8
8
|
|
9
|
-
require 'bio
|
9
|
+
require 'bio-gff3'
|
10
10
|
|
11
11
|
include Bio::GFFbrowser
|
12
12
|
|
@@ -83,17 +83,20 @@ describe GFFdb, "Assemble CDS" do
|
|
83
83
|
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds0])
|
84
84
|
aaseq.should == "MRPLTDEETEKFFKKLSNYIGDNIKLLLEREDGEYVFRLHKDRVYYC"
|
85
85
|
end
|
86
|
+
# MhA1_Contig1133 WormBase CDS 8065 8308 . + 1 ID=cds:MhA1_Contig1133.frz3.gene4;Parent=transcript:MhA1_Contig1133.frz3.gene4
|
86
87
|
it "should translate CDS 8065:8308 (in frame 1, + strand)" do
|
87
88
|
recs = @cdslist['cds:MhA1_Contig1133.frz3.gene4']
|
88
89
|
component = @componentlist['cds:MhA1_Contig1133.frz3.gene4']
|
89
90
|
cds1 = recs[1]
|
90
|
-
seq = @gff.assemble(@contigsequence,component.start,[cds1]
|
91
|
+
seq = @gff.assemble(@contigsequence,component.start,[cds1])
|
91
92
|
seq.size.should == 244
|
92
93
|
seq.should == "TGAAAAATTAATGCGACAAGCAGCATGTATTGGACGTAAACAATTGGGATCTTTTGGAACTTGTTTGGGTAAATTCACAAAAGGAGGGTCTTTCTTTCTTCATATAACATCATTGGATTATTTGGCACCTTATGCTTTAGCAAAAATTTGGTTAAAACCACAAGCTGAACAACAATTTTTATATGGAAATAATATTGTTAAATCTGGTGTTGGAAGAATGAGTGAAGGGATTGAAGAAAAACAA"
|
93
|
-
seq = @gff.assemble(@contigsequence,component.start,[cds1])
|
94
|
+
seq = @gff.assemble(@contigsequence,component.start,[cds1],:phase => true)
|
95
|
+
seq.size.should == 243
|
94
96
|
seq.should == "GAAAAATTAATGCGACAAGCAGCATGTATTGGACGTAAACAATTGGGATCTTTTGGAACTTGTTTGGGTAAATTCACAAAAGGAGGGTCTTTCTTTCTTCATATAACATCATTGGATTATTTGGCACCTTATGCTTTAGCAAAAATTTGGTTAAAACCACAAGCTGAACAACAATTTTTATATGGAAATAATATTGTTAAATCTGGTGTTGGAAGAATGAGTGAAGGGATTGAAGAAAAACAA"
|
95
|
-
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds1])
|
97
|
+
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds1],:phase => true)
|
96
98
|
# note it should handle the frame shift and direction!
|
99
|
+
# wormbase validated
|
97
100
|
aaseq.should == "EKLMRQAACIGRKQLGSFGTCLGKFTKGGSFFLHITSLDYLAPYALAKIWLKPQAEQQFLYGNNIVKSGVGRMSEGIEEKQ"
|
98
101
|
end
|
99
102
|
it "should translate CDS3 (in frame 0, + strand)" do
|
@@ -114,7 +117,7 @@ describe GFFdb, "Assemble CDS" do
|
|
114
117
|
seq.size.should == 543
|
115
118
|
seq.should == "ATGCGTCCTTTAACAGATGAAGAAACTGAAAAGTTTTTCAAAAAACTTTCAAATTATATTGGTGACAATATTAAACTTTTATTGGAAAGAGAAGATGGAGAATATGTTTTTCGTTTACATAAAGACAGAGTTTATTATTGCAGTGAAAAATTAATGCGACAAGCAGCATGTATTGGACGTAAACAATTGGGATCTTTTGGAACTTGTTTGGGTAAATTCACAAAAGGAGGGTCTTTCTTTCTTCATATAACATCATTGGATTATTTGGCACCTTATGCTTTAGCAAAAATTTGGTTAAAACCACAAGCTGAACAACAATTTTTATATGGAAATAATATTGTTAAATCTGGTGTTGGAAGAATGAGTGAAGGGATTGAAGAAAAACAAGGTATTATTATTTATAATATGTCAGATTTACCATTGGGTTTTGGAGTGGCTGCAAAGGGAACATTATCTTGTAGAAAAGTAGATCCTACAGCTTTAGTTGTTTTACATCAATCAGATTTGGGTGAATATATTCGAAATGAAGAGGGATTAATTTAA"
|
116
119
|
seq = @gff.assemble(@contigsequence,component.start,recs)
|
117
|
-
seq.size.should == 543
|
120
|
+
seq.size.should == 543 # auto correct for phase problem
|
118
121
|
seq.should == "ATGCGTCCTTTAACAGATGAAGAAACTGAAAAGTTTTTCAAAAAACTTTCAAATTATATTGGTGACAATATTAAACTTTTATTGGAAAGAGAAGATGGAGAATATGTTTTTCGTTTACATAAAGACAGAGTTTATTATTGCAGTGAAAAATTAATGCGACAAGCAGCATGTATTGGACGTAAACAATTGGGATCTTTTGGAACTTGTTTGGGTAAATTCACAAAAGGAGGGTCTTTCTTTCTTCATATAACATCATTGGATTATTTGGCACCTTATGCTTTAGCAAAAATTTGGTTAAAACCACAAGCTGAACAACAATTTTTATATGGAAATAATATTGTTAAATCTGGTGTTGGAAGAATGAGTGAAGGGATTGAAGAAAAACAAGGTATTATTATTTATAATATGTCAGATTTACCATTGGGTTTTGGAGTGGCTGCAAAGGGAACATTATCTTGTAGAAAAGTAGATCCTACAGCTTTAGTTGTTTTACATCAATCAGATTTGGGTGAATATATTCGAAATGAAGAGGGATTAATTTAA"
|
119
122
|
aaseq = @gff.assembleAA(@contigsequence,component.start,recs)
|
120
123
|
aaseq.should == "MRPLTDEETEKFFKKLSNYIGDNIKLLLEREDGEYVFRLHKDRVYYCSEKLMRQAACIGRKQLGSFGTCLGKFTKGGSFFLHITSLDYLAPYALAKIWLKPQAEQQFLYGNNIVKSGVGRMSEGIEEKQGIIIYNMSDLPLGFGVAAKGTLSCRKVDPTALVVLHQSDLGEYIRNEEGLI*"
|
@@ -161,17 +164,17 @@ describe GFFdb, "Assemble CDS" do
|
|
161
164
|
# tctttgtgcttccaaacgagctaatgacattccactacgatctcgcaatgattgtcgtct
|
162
165
|
# aattgcacctctagctgagaaaggattttctaatgttgaaggtggttgttgaggagattc
|
163
166
|
# aaacttttttctt
|
164
|
-
|
165
|
-
|
166
|
-
|
167
|
-
|
168
|
-
seq = @gff.assemble(@contigsequence,component.start,[
|
167
|
+
cds5 = recs[5]
|
168
|
+
cds5.start.should == 27981
|
169
|
+
cds5.frame.should == 1
|
170
|
+
cds5.strand.should == '-'
|
171
|
+
seq = @gff.assemble(@contigsequence,component.start,[cds5],:phase=>true,:complement=>false)
|
169
172
|
seq.should == "TCTTTTTTCAAACTTAGAGGAGTTGTTGGTGGAAGTTGTAATCTTTTAGGAAAGAGTCGATCTCCACGTTAATCTGCTGTTAGTAACGCTCTAGCATCACCTTACAGTAATCGAGCAAACCTTCGTGTTTCTCTCCCAAGACTGGAATAATCTTCAATATTATCATTTCTTCTGGAAAGAAGATTATGTCGC"
|
170
173
|
seq.size.should == 192
|
171
|
-
seq = @gff.assemble(@contigsequence,component.start,[
|
174
|
+
seq = @gff.assemble(@contigsequence,component.start,[cds5],:phase=>true,:reverse=>true,:complement=>true)
|
172
175
|
seq.should == "AGAAAAAAGTTTGAATCTCCTCAACAACCACCTTCAACATTAGAAAATCCTTTCTCAGCTAGAGGTGCAATTAGACGACAATCATTGCGAGATCGTAGTGGAATGTCATTAGCTCGTTTGGAAGCACAAAGAGAGGGTTCTGACCTTATTAGAAGTTATAATAGTAAAGAAGACCTTTCTTCTAATACAGCG"
|
173
176
|
seq.size.should == 192
|
174
|
-
aaseq = @gff.assembleAA(@contigsequence,component.start,[
|
177
|
+
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds5],:phase=>true)
|
175
178
|
# note it should handle the frame shift and direction!
|
176
179
|
# >EMBOSS_001_4
|
177
180
|
# RKKFESPQQPPSTLENPFSARGAIRRQSLRDRSGMSLARLEAQREGSDLIRSYNSKEDLSSNTA
|
@@ -190,9 +193,9 @@ describe GFFdb, "Assemble CDS" do
|
|
190
193
|
cds2.start.should == 27981
|
191
194
|
cds2.frame.should == 1
|
192
195
|
cds2.strand.should == '-'
|
193
|
-
seq = @gff.assemble(@contigsequence,component.start,[cds2],:complement=>true)
|
194
|
-
seq.should == "
|
195
|
-
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds2])
|
196
|
+
seq = @gff.assemble(@contigsequence,component.start,[cds2],:reverse=>false,:complement=>true)
|
197
|
+
seq.should == "CGCTGTATTAGAAGAAAGGTCTTCTTTACTATTATAACTTCTAATAAGGTCAGAACCCTCTCTTTGTGCTTCCAAACGAGCTAATGACATTCCACTACGATCTCGCAATGATTGTCGTCTAATTGCACCTCTAGCTGAGAAAGGATTTTCTAATGTTGAAGGTGGTTGTTGAGGAGATTCAAACTTTTTTCT"
|
198
|
+
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds2],:phase=>true)
|
196
199
|
# note it should handle the frame shift and direction!
|
197
200
|
# >27981..28173_4 RKKFESPQQPPSTLENPFSARGAIRRQSLRDRSGMSLARLEAQREGSDLIRSYNSKEDLSSNTA
|
198
201
|
aaseq.should == "RKKFESPQQPPSTLENPFSARGAIRRQSLRDRSGMSLARLEAQREGSDLIRSYNSKEDLSSNTA"
|
@@ -222,17 +225,17 @@ describe GFFdb, "Assemble CDS" do
|
|
222
225
|
cds2.strand.should == '-'
|
223
226
|
seq = @gff.assemble(@contigsequence,component.start,[cds2], :raw=>true)
|
224
227
|
seq.should == "ATAAATTTCCCTTTCTCCAGAAAAACTTACAAAAGTAGATTTATCAACAGAATTTCTTTGATCTAAAGGTAATCCTCTTTGATGTAAAATTTTCATATCATTTAACATTTCCCTTTCTGGTTGTTGTCTTCTTTCATCAATCATTTCTTGTGTAATTCCTCTAGCAGCCATTTCAGATTCAATAAGGTCAAGGGTTTGTTCATCATCACAAATATCATAAGGCATATTACCATCTGCATTTACTGCTAGTAAATCTGCGTTG"
|
225
|
-
seq = @gff.assemble(@contigsequence,component.start,[cds2], :
|
228
|
+
seq = @gff.assemble(@contigsequence,component.start,[cds2], :phase=>true)
|
226
229
|
seq.should == "AACGCAGATTTACTAGCAGTAAATGCAGATGGTAATATGCCTTATGATATTTGTGATGATGAACAAACCCTTGACCTTATTGAATCTGAAATGGCTGCTAGAGGAATTACACAAGAAATGATTGATGAAAGAAGACAACAACCAGAAAGGGAAATGTTAAATGATATGAAAATTTTACATCAAAGAGGATTACCTTTAGATCAAAGAAATTCTGTTGATAAATCTACTTTTGTAAGTTTTTCTGGAGAAAGGGAAATTTAT"
|
227
230
|
# cds1.frame = 1
|
228
|
-
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds2])
|
231
|
+
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds2],:phase=>true)
|
229
232
|
# note it should handle the frame shift and direction!
|
230
233
|
aaseq.should == "NADLLAVNADGNMPYDICDDEQTLDLIESEMAARGITQEMIDERRQQPEREMLNDMKILHQRGLPLDQRNSVDKSTFVSFSGEREIY"
|
231
234
|
end
|
232
235
|
it "should assemble the protein sequence for MhA1_Contig1133.frz3.gene11" do
|
233
236
|
recs = @cdslist['cds:MhA1_Contig1133.frz3.gene11']
|
234
237
|
component = @componentlist['cds:MhA1_Contig1133.frz3.gene11']
|
235
|
-
seq = @gff.assemble(@contigsequence,component.start,recs, :
|
238
|
+
seq = @gff.assemble(@contigsequence,component.start,recs, :reverse=>true, :complement=>true)
|
236
239
|
seq.should == "ATGGACCATCATGCATTGGTGGAGGAATTACCAGAAATTGAAAAATTAACTCCTCAAGAACGTATTGCATTAGCTAGAGAACGCCGTGCTGAACAACTTCGACAGAATGCTGCACGGGAGGCTCAATTGCCAATGCCTGCACAGCGCCGGCCTCGTCTTCGATTTACACCAGATGTTGCTTTACTTGAGGCAACATGTGCCATTGACAATAATGAAAGAATTGTTCGTCTTCTGCTTAGGTACGGAGCTTGTGTTAATGCCAAAGACACTGAACTTTGGACACCATTGCACGCAGCTGCATGTTGTGCTTATATTGATATTGTTCGATTGCTTATTGCACACAACGCAGATTTACTAGCAGTAAATGCAGATGGTAATATGCCTTATGATATTTGTGATGATGAACAAACCCTTGACCTTATTGAATCTGAAATGGCTGCTAGAGGAATTACACAAGAAATGATTGATGAAAGAAGACAACAACCAGAAAGGGAAATGTTAAATGATATGAAAATTTTACATCAAAGAGGATTACCTTTAGATCAAAGAAATTCTGTTGATAAATCTACTTTTGTAAGTTTTTCTGGAGAAAGGGAAATTTATTTACATATAGCAGCAGCTAATGGTTATTATGATGTTGCTGCTTTCCTTCTTCGTTGTAATGTTTCTCCAGCATTGAGAGATATAGATTTGTGGCAACCAATTCATGCAGCTGCTTCTTGGAATCAACCAGACTTAATCGAGCTTTTATGCGAATATGGGGCTGATATAAATGCAAAAACTGGAGCTGGGGAAAGCCCTTTAGAATTAACTGAAGATGAACCAACCCAACAAGTAATTAGAACAATCGCTCAGACAGAAGCAAGGAGACGGCGTGGTCCAGGTGGTGGTTACTTTGGTGTTCGTGATTCTCGACGACAAAGCCGAAAAAGAAAAAAGTTTGAATCTCCTCAACAACCACCTTCAACATTAGAAAATCCTTTCTCAGCTAGAGGTGCAATTAGACGACAATCATTGCGAGATCGTAGTGGAATGTCATTAGCTCGTTTGGAAGCACAAAGAGAGGGTTCTGACCTTATTAGAAGTTATAATAGTAAAGAAGACCTTTCTTCTAATACAGCGGATGATTCTTTAAATGTTGGAAGTTCTTCATATCTCAACAATCCAACAGCCTCGGCTAGTGCTTCCTCTTCAGCATTACACGGAACTCCACATCAACAACAACGTCGTGAATCTCCACCTAAACGTGCATTAATGGCTAGAAGTGCTTCTCATCAAAAACAAAAACAACAAATGTCTCCAGATGAATGGCTGAAAAAATTAGAAGCAGATTCTGCAGGTTTTCGAGATAATGATGGAGAAGATGGTGAATTACAATCTGAACTTAAAGGAGGACAAAGAATGAAGAGTGGTGGTGGTGGAGGAGCGAGAGGTCAGCAAGAAATGAATGGTGGTCCAACAGCAACATTTGGTGGAGCTTCAAAACAACAATTAGCAATGGGCTCTGGACCCAATAGACGGCGCAAACAAGGATGTTGCTCTGTTTTGTGA"
|
237
240
|
aaseq = @gff.assembleAA(@contigsequence,component.start,recs)
|
238
241
|
aaseq.should == "MDHHALVEELPEIEKLTPQERIALARERRAEQLRQNAAREAQLPMPAQRRPRLRFTPDVALLEATCAIDNNERIVRLLLRYGACVNAKDTELWTPLHAAACCAYIDIVRLLIAHNADLLAVNADGNMPYDICDDEQTLDLIESEMAARGITQEMIDERRQQPEREMLNDMKILHQRGLPLDQRNSVDKSTFVSFSGEREIYLHIAAANGYYDVAAFLLRCNVSPALRDIDLWQPIHAAASWNQPDLIELLCEYGADINAKTGAGESPLELTEDEPTQQVIRTIAQTEARRRRGPGGGYFGVRDSRRQSRKRKKFESPQQPPSTLENPFSARGAIRRQSLRDRSGMSLARLEAQREGSDLIRSYNSKEDLSSNTADDSLNVGSSSYLNNPTASASASSSALHGTPHQQQRRESPPKRALMARSASHQKQKQQMSPDEWLKKLEADSAGFRDNDGEDGELQSELKGGQRMKSGGGGGARGQQEMNGGPTATFGGASKQQLAMGSGPNRRRKQGCCSVL*"
|
@@ -1,12 +1,12 @@
|
|
1
1
|
# RSpec for BioRuby-GFF3-Plugin. Run with something like:
|
2
2
|
#
|
3
|
-
#
|
3
|
+
# rspec -I ../bioruby/lib/ spec/gff3_fileiterator_spec.rb
|
4
4
|
#
|
5
5
|
# Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
|
6
6
|
#
|
7
7
|
$: << "../lib"
|
8
8
|
|
9
|
-
require 'bio
|
9
|
+
require 'bio-gff3'
|
10
10
|
|
11
11
|
TEST1='test/data/gff/test.gff3'
|
12
12
|
TEST2='test/data/gff/standard.gff3'
|
data/spec/gffdb_spec.rb
CHANGED
@@ -1,13 +1,13 @@
|
|
1
1
|
# RSpec for BioRuby-GFF3-Plugin. Run with something like:
|
2
2
|
#
|
3
|
-
#
|
3
|
+
# rspec -I ../bioruby/lib/ spec/gffdb_spec.rb
|
4
4
|
#
|
5
5
|
# Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
|
6
6
|
#
|
7
7
|
$: << "../lib"
|
8
8
|
|
9
9
|
|
10
|
-
require 'bio
|
10
|
+
require 'bio-gff3'
|
11
11
|
|
12
12
|
include Bio::GFFbrowser
|
13
13
|
|
data/test/data/gff/test.gff3
CHANGED
@@ -64,11 +64,11 @@ AATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGT
|
|
64
64
|
GCCCAAGAATGCGATCCCAGAAGTCTTGGTTCTAAAGTCGTCGGAAAGATTTGAGGAACTGCCATACAGC
|
65
65
|
CCGTGGGTGAAACTGTCGACATCCATTGTGCGAATAGGCCTGCTAGTGAC
|
66
66
|
>test02
|
67
|
-
|
67
|
+
ACGACGAAGATTTGTATGACTGATTTATCCTGGACAGGCATTGGTCAGATGTCTCCTTCCGTATCGTCGTTTA
|
68
68
|
GTTGCAAATCCGAGTGTTCGGGGGTATTGCTATTTGCCACCTAGAAGCGCAACATGCCCAGCTTCACACA
|
69
69
|
CCATAGCGAACACGCCGCCCCGGTGGCGACTATCGGTCGAAGTTAAGACAATTCATGGGCGAAACGAGAT
|
70
70
|
AATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGT
|
71
71
|
GCCCAAGAATGCGATCCCAGAAGTCTTGGTTCTAAAGTCGTCGGAAAGATTTGAGGAACTGCCATACAGC
|
72
|
-
|
72
|
+
CCGTGGGTGAAACTGTCGACATCCATTGTGCGAATAGGCCTGCTAGTGACAAAAAA
|
73
73
|
|
74
74
|
|
metadata
CHANGED
@@ -4,9 +4,9 @@ version: !ruby/object:Gem::Version
|
|
4
4
|
prerelease: false
|
5
5
|
segments:
|
6
6
|
- 0
|
7
|
-
-
|
7
|
+
- 8
|
8
8
|
- 0
|
9
|
-
version: 0.
|
9
|
+
version: 0.8.0
|
10
10
|
platform: ruby
|
11
11
|
authors:
|
12
12
|
- Pjotr Prins
|
@@ -14,7 +14,7 @@ autorequire:
|
|
14
14
|
bindir: bin
|
15
15
|
cert_chain: []
|
16
16
|
|
17
|
-
date: 2010-12-
|
17
|
+
date: 2010-12-31 00:00:00 +01:00
|
18
18
|
default_executable: gff3-fetch
|
19
19
|
dependencies:
|
20
20
|
- !ruby/object:Gem::Dependency
|
@@ -88,6 +88,19 @@ dependencies:
|
|
88
88
|
type: :development
|
89
89
|
prerelease: false
|
90
90
|
version_requirements: *id005
|
91
|
+
- !ruby/object:Gem::Dependency
|
92
|
+
name: rspec
|
93
|
+
requirement: &id006 !ruby/object:Gem::Requirement
|
94
|
+
none: false
|
95
|
+
requirements:
|
96
|
+
- - ">="
|
97
|
+
- !ruby/object:Gem::Version
|
98
|
+
segments:
|
99
|
+
- 0
|
100
|
+
version: "0"
|
101
|
+
type: :development
|
102
|
+
prerelease: false
|
103
|
+
version_requirements: *id006
|
91
104
|
description: |
|
92
105
|
GFF3 (genome browser) information and digest mRNA and CDS sequences.
|
93
106
|
Options for low memory use and caching of records.
|
@@ -100,13 +113,11 @@ extensions: []
|
|
100
113
|
|
101
114
|
extra_rdoc_files:
|
102
115
|
- LICENSE.txt
|
103
|
-
- README
|
104
116
|
- README.rdoc
|
105
117
|
files:
|
106
118
|
- Gemfile
|
107
119
|
- Gemfile.lock
|
108
120
|
- LICENSE.txt
|
109
|
-
- README
|
110
121
|
- README.rdoc
|
111
122
|
- Rakefile
|
112
123
|
- VERSION
|
@@ -151,7 +162,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
151
162
|
requirements:
|
152
163
|
- - ">="
|
153
164
|
- !ruby/object:Gem::Version
|
154
|
-
hash: -
|
165
|
+
hash: -1033924243
|
155
166
|
segments:
|
156
167
|
- 0
|
157
168
|
version: "0"
|
data/README
DELETED
@@ -1,65 +0,0 @@
|
|
1
|
-
= GFF3 plugin for BioRuby, aimed at parsing big data =
|
2
|
-
|
3
|
-
Features:
|
4
|
-
|
5
|
-
# Take GFF (genome browser) information and digest mRNA and CDS sequences
|
6
|
-
# Options for low memory use and caching of records
|
7
|
-
# Support for external FASTA files
|
8
|
-
|
9
|
-
You can use this plugin in two ways. First as a standalone program, next as a
|
10
|
-
plugin library to BioRuby.
|
11
|
-
|
12
|
-
For example, fetch mRNA and CDS information from GFF3 files and output to FASTA:
|
13
|
-
|
14
|
-
./bin/gff3-fetch mrna test/data/gff/test.gff3
|
15
|
-
./bin/gff3-fetch cds test/data/gff/test.gff3
|
16
|
-
|
17
|
-
Or clone this repository and add the 'lib' dir to the Ruby search path and
|
18
|
-
|
19
|
-
require 'bio/db/gff/gffdb'
|
20
|
-
|
21
|
-
You can also run RSpec with something like
|
22
|
-
|
23
|
-
ruby -I ../bioruby/lib/ ~/.gems/bin/spec spec/gffdb_spec.rb
|
24
|
-
|
25
|
-
This implementation depends on BioRuby's basic GFF3 parser, with the possible
|
26
|
-
advantage that the plugin is faster and does not consume all memory. The Gff3
|
27
|
-
specs are based on the output of the Wormbase genome browser.
|
28
|
-
|
29
|
-
For a write-up see http://thebird.nl/bioruby/BioRuby_GFF3.html
|
30
|
-
|
31
|
-
Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
|
32
|
-
|
33
|
-
-------------------------------------------------------------------------------
|
34
|
-
|
35
|
-
Usage:
|
36
|
-
|
37
|
-
BioRuby GFF3 Plugin Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
|
38
|
-
|
39
|
-
Fetch and assemble mRNAs, or CDS and print in FASTA format.
|
40
|
-
|
41
|
-
gff3-fetch [--no-cache] mRNA|CDS [filename.fa] filename.gff
|
42
|
-
|
43
|
-
Where:
|
44
|
-
|
45
|
-
--no-cache : do not load everything in memory
|
46
|
-
mRNA : assemble mRNA
|
47
|
-
CDS : assemble CDS
|
48
|
-
|
49
|
-
Multiple GFF3 files can be used. For external FASTA files, always the last
|
50
|
-
one before the GFF file is used.
|
51
|
-
|
52
|
-
Examples:
|
53
|
-
|
54
|
-
Find mRNA and CDS information from test.gff3 (which includes sequence information)
|
55
|
-
|
56
|
-
./bin/gff3-fetch mRNA test/data/gff/test.gff3
|
57
|
-
./bin/gff3-fetch CDS test/data/gff/test.gff3
|
58
|
-
|
59
|
-
Find mRNA from external FASTA file, without loading everythin in RAM
|
60
|
-
|
61
|
-
./bin/gff3-fetch --no-cache mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3
|
62
|
-
|
63
|
-
If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
|
64
|
-
|
65
|
-
|