bio-gff3 0.6.0 → 0.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/Gemfile +1 -0
- data/Gemfile.lock +10 -0
- data/README.rdoc +63 -13
- data/VERSION +1 -1
- data/bin/gff3-fetch +12 -9
- data/bio-gff3.gemspec +5 -4
- data/lib/bio-gff3.rb +1 -0
- data/lib/bio/db/gff/gffassemble.rb +52 -52
- data/lib/bio/db/gff/gfffasta.rb +1 -1
- data/spec/gff3_assemble2_spec.rb +1 -1
- data/spec/gff3_assemble3_spec.rb +3 -3
- data/spec/gff3_assemble_spec.rb +23 -20
- data/spec/gff3_fileiterator_spec.rb +2 -2
- data/spec/gffdb_spec.rb +2 -2
- data/test/data/gff/test.gff3 +2 -2
- metadata +17 -6
- data/README +0 -65
data/Gemfile
CHANGED
data/Gemfile.lock
CHANGED
@@ -2,6 +2,7 @@ GEM
|
|
2
2
|
remote: http://rubygems.org/
|
3
3
|
specs:
|
4
4
|
bio (1.4.1)
|
5
|
+
diff-lcs (1.1.2)
|
5
6
|
git (1.2.5)
|
6
7
|
jeweler (1.5.2)
|
7
8
|
bundler (~> 1.0.0)
|
@@ -9,6 +10,14 @@ GEM
|
|
9
10
|
rake
|
10
11
|
rake (0.8.7)
|
11
12
|
rcov (0.9.9)
|
13
|
+
rspec (2.3.0)
|
14
|
+
rspec-core (~> 2.3.0)
|
15
|
+
rspec-expectations (~> 2.3.0)
|
16
|
+
rspec-mocks (~> 2.3.0)
|
17
|
+
rspec-core (2.3.1)
|
18
|
+
rspec-expectations (2.3.0)
|
19
|
+
diff-lcs (~> 1.1.2)
|
20
|
+
rspec-mocks (2.3.0)
|
12
21
|
shoulda (2.11.3)
|
13
22
|
|
14
23
|
PLATFORMS
|
@@ -19,4 +28,5 @@ DEPENDENCIES
|
|
19
28
|
bundler (~> 1.0.0)
|
20
29
|
jeweler (~> 1.5.2)
|
21
30
|
rcov
|
31
|
+
rspec
|
22
32
|
shoulda
|
data/README.rdoc
CHANGED
@@ -1,19 +1,69 @@
|
|
1
1
|
= bio-gff3
|
2
2
|
|
3
|
-
|
4
|
-
|
5
|
-
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
3
|
+
GFF3 plugin for BioRuby, aimed at parsing big data
|
4
|
+
|
5
|
+
Features:
|
6
|
+
|
7
|
+
# Take GFF (genome browser) information and digest mRNA and CDS sequences
|
8
|
+
# Options for low memory use and caching of records
|
9
|
+
# Support for external FASTA files
|
10
|
+
|
11
|
+
You can use this plugin in two ways. First as a standalone program, next as a
|
12
|
+
plugin library to BioRuby.
|
13
|
+
|
14
|
+
For example, fetch mRNA and CDS information from GFF3 files and output to FASTA:
|
15
|
+
|
16
|
+
./bin/gff3-fetch mrna test/data/gff/test.gff3
|
17
|
+
./bin/gff3-fetch cds test/data/gff/test.gff3
|
18
|
+
|
19
|
+
Or clone this repository and add the 'lib' dir to the Ruby search path and
|
20
|
+
|
21
|
+
require 'bio/db/gff/gffdb'
|
22
|
+
|
23
|
+
You can also run RSpec with something like
|
24
|
+
|
25
|
+
rspec -I ../bioruby/lib/ spec/*.rb
|
26
|
+
|
27
|
+
This implementation depends on BioRuby's basic GFF3 parser, with the possible
|
28
|
+
advantage that the plugin is faster and does not consume all memory. The Gff3
|
29
|
+
specs are based on the output of the Wormbase genome browser.
|
30
|
+
|
31
|
+
For a write-up see http://thebird.nl/bioruby/BioRuby_GFF3.html
|
32
|
+
|
33
|
+
-------------------------------------------------------------------------------
|
34
|
+
|
35
|
+
|
36
|
+
Fetch and assemble mRNAs, or CDS and print in FASTA format.
|
37
|
+
|
38
|
+
gff3-fetch [--no-cache] mRNA|CDS [filename.fa] filename.gff
|
39
|
+
|
40
|
+
Where:
|
41
|
+
|
42
|
+
--no-cache : do not load everything in memory (slower)
|
43
|
+
mRNA : assemble mRNA
|
44
|
+
CDS : assemble CDS
|
45
|
+
|
46
|
+
Multiple GFF3 files can be used. For external FASTA files, always the last
|
47
|
+
one before the GFF file is used.
|
48
|
+
|
49
|
+
Examples:
|
50
|
+
|
51
|
+
Find mRNA and CDS information from test.gff3 (which includes sequence information)
|
52
|
+
|
53
|
+
gff3-fetch mRNA test/data/gff/test.gff3
|
54
|
+
gff3-fetch CDS test/data/gff/test.gff3
|
55
|
+
|
56
|
+
Find CDS from external FASTA file
|
57
|
+
|
58
|
+
gff3-fetch CDS test/data/gff/MhA1_Contig1133.fa test/data/gff/MhA1_Contig1133.gff3
|
59
|
+
|
60
|
+
Find mRNA from external FASTA file, without loading everything in RAM
|
61
|
+
|
62
|
+
gff3-fetch --no-cache mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3
|
63
|
+
|
64
|
+
If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
|
14
65
|
|
15
66
|
== Copyright
|
16
67
|
|
17
|
-
Copyright (
|
18
|
-
further details.
|
68
|
+
Copyright (C) 2010,2011 Pjotr Prins <pjotr.prins@thebird.nl>
|
19
69
|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.
|
1
|
+
0.8.0
|
data/bin/gff3-fetch
CHANGED
@@ -4,7 +4,7 @@
|
|
4
4
|
# Copyright:: August 2010
|
5
5
|
# License:: Ruby License
|
6
6
|
#
|
7
|
-
# Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
|
7
|
+
# Copyright (C) 2010,2011 Pjotr Prins <pjotr.prins@thebird.nl>
|
8
8
|
|
9
9
|
|
10
10
|
USAGE = <<EOM
|
@@ -14,7 +14,7 @@ USAGE = <<EOM
|
|
14
14
|
|
15
15
|
Where:
|
16
16
|
|
17
|
-
--no-cache : do not load everything in memory
|
17
|
+
--no-cache : do not load everything in memory (slower)
|
18
18
|
mRNA : assemble mRNA
|
19
19
|
CDS : assemble CDS
|
20
20
|
|
@@ -25,19 +25,22 @@ USAGE = <<EOM
|
|
25
25
|
|
26
26
|
Find mRNA and CDS information from test.gff3 (which includes sequence information)
|
27
27
|
|
28
|
-
|
29
|
-
|
28
|
+
gff3-fetch mRNA test/data/gff/test.gff3
|
29
|
+
gff3-fetch CDS test/data/gff/test.gff3
|
30
30
|
|
31
|
-
Find CDS from
|
31
|
+
Find CDS from external FASTA file
|
32
32
|
|
33
|
-
|
33
|
+
gff3-fetch CDS test/data/gff/MhA1_Contig1133.fa test/data/gff/MhA1_Contig1133.gff3
|
34
34
|
|
35
35
|
Find mRNA from external FASTA file, without loading everything in RAM
|
36
36
|
|
37
|
-
|
37
|
+
gff3-fetch --no-cache mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3
|
38
38
|
|
39
39
|
If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
|
40
40
|
|
41
|
+
== Copyright
|
42
|
+
|
43
|
+
Copyright (C) 2010,2011 Pjotr Prins <pjotr.prins@thebird.nl>
|
41
44
|
|
42
45
|
EOM
|
43
46
|
|
@@ -45,9 +48,9 @@ rootpath = File.dirname(File.dirname(__FILE__))
|
|
45
48
|
$: << rootpath+'/lib'
|
46
49
|
$: << rootpath+'/../bioruby/lib'
|
47
50
|
|
48
|
-
require 'bio
|
51
|
+
require 'bio-gff3'
|
49
52
|
|
50
|
-
$stderr.print "BioRuby GFF3 Plugin Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>\n\n"
|
53
|
+
$stderr.print "BioRuby GFF3 Plugin Copyright (C) 2010,2011 Pjotr Prins <pjotr.prins@thebird.nl>\n\n"
|
51
54
|
|
52
55
|
if ARGV.size == 0
|
53
56
|
print USAGE
|
data/bio-gff3.gemspec
CHANGED
@@ -5,11 +5,11 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = %q{bio-gff3}
|
8
|
-
s.version = "0.
|
8
|
+
s.version = "0.8.0"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Pjotr Prins"]
|
12
|
-
s.date = %q{2010-12-
|
12
|
+
s.date = %q{2010-12-31}
|
13
13
|
s.default_executable = %q{gff3-fetch}
|
14
14
|
s.description = %q{GFF3 (genome browser) information and digest mRNA and CDS sequences.
|
15
15
|
Options for low memory use and caching of records.
|
@@ -19,14 +19,12 @@ Support for external FASTA files.
|
|
19
19
|
s.executables = ["gff3-fetch"]
|
20
20
|
s.extra_rdoc_files = [
|
21
21
|
"LICENSE.txt",
|
22
|
-
"README",
|
23
22
|
"README.rdoc"
|
24
23
|
]
|
25
24
|
s.files = [
|
26
25
|
"Gemfile",
|
27
26
|
"Gemfile.lock",
|
28
27
|
"LICENSE.txt",
|
29
|
-
"README",
|
30
28
|
"README.rdoc",
|
31
29
|
"Rakefile",
|
32
30
|
"VERSION",
|
@@ -83,12 +81,14 @@ Support for external FASTA files.
|
|
83
81
|
s.add_development_dependency(%q<jeweler>, ["~> 1.5.2"])
|
84
82
|
s.add_development_dependency(%q<rcov>, [">= 0"])
|
85
83
|
s.add_development_dependency(%q<bio>, [">= 1.4.1"])
|
84
|
+
s.add_development_dependency(%q<rspec>, [">= 0"])
|
86
85
|
else
|
87
86
|
s.add_dependency(%q<shoulda>, [">= 0"])
|
88
87
|
s.add_dependency(%q<bundler>, ["~> 1.0.0"])
|
89
88
|
s.add_dependency(%q<jeweler>, ["~> 1.5.2"])
|
90
89
|
s.add_dependency(%q<rcov>, [">= 0"])
|
91
90
|
s.add_dependency(%q<bio>, [">= 1.4.1"])
|
91
|
+
s.add_dependency(%q<rspec>, [">= 0"])
|
92
92
|
end
|
93
93
|
else
|
94
94
|
s.add_dependency(%q<shoulda>, [">= 0"])
|
@@ -96,6 +96,7 @@ Support for external FASTA files.
|
|
96
96
|
s.add_dependency(%q<jeweler>, ["~> 1.5.2"])
|
97
97
|
s.add_dependency(%q<rcov>, [">= 0"])
|
98
98
|
s.add_dependency(%q<bio>, [">= 1.4.1"])
|
99
|
+
s.add_dependency(%q<rspec>, [">= 0"])
|
99
100
|
end
|
100
101
|
end
|
101
102
|
|
data/lib/bio-gff3.rb
CHANGED
@@ -0,0 +1 @@
|
|
1
|
+
require 'bio/db/gff/gffdb'
|
@@ -198,77 +198,76 @@ module Bio
|
|
198
198
|
# to the landmark given in column 1 - in this case the sequence as it
|
199
199
|
# is passed in. The following options are available:
|
200
200
|
#
|
201
|
-
# :
|
202
|
-
# :
|
203
|
-
# :
|
204
|
-
# :trim : make sure sequence is multiple of 3 nucleotide bps (false)
|
201
|
+
# :reverse : do reverse if reverse is indicated (default true)
|
202
|
+
# :complement : do complement if reverse is indicated (default true)
|
203
|
+
# :phase : do set CDS phase (default false, normally ignore)
|
204
|
+
# :trim : make sure sequence is multiple of 3 nucleotide bps (default false)
|
205
205
|
#
|
206
206
|
# there are two special options:
|
207
207
|
#
|
208
208
|
# :raw : raw sequence (all above false)
|
209
|
-
# :codonize : codon sequence (
|
209
|
+
# :codonize : codon sequence (reverse, complement and trim are true)
|
210
210
|
#
|
211
|
-
def assemble sequence, startpos, reclist, options = { :phase=>
|
211
|
+
def assemble sequence, startpos, reclist, options = { :phase=>false, :reverse=>true, :trim=>false, :complement=>true, :debug=>false }
|
212
|
+
do_debug = options[:debug]
|
212
213
|
do_phase = options[:phase]
|
213
|
-
do_reverse = options[:reverse]
|
214
|
-
do_trim
|
215
|
-
do_complement = options[:complement]
|
214
|
+
do_reverse = (options[:reverse] == false ? false : true)
|
215
|
+
do_trim = (options[:trim] == false ? false : true)
|
216
|
+
do_complement = (options[:complement] == false ? false : true)
|
216
217
|
if options[:raw]
|
217
218
|
do_phase = false
|
218
219
|
do_reverse = false
|
219
220
|
do_trim = false
|
220
221
|
do_complement = false
|
221
222
|
elsif options[:codonize]
|
222
|
-
do_phase =
|
223
|
+
do_phase = false
|
223
224
|
do_reverse = true
|
224
225
|
do_trim = true
|
225
226
|
do_complement = true
|
226
227
|
end
|
227
|
-
retval = ""
|
228
228
|
sectionlist = Sections::sort(reclist)
|
229
|
-
reverse = false
|
230
|
-
# we assume strand is always the same
|
231
229
|
rec0 = sectionlist.first.rec
|
232
|
-
|
233
|
-
|
234
|
-
|
235
|
-
|
236
|
-
|
237
|
-
|
238
|
-
|
239
|
-
|
240
|
-
|
241
|
-
|
242
|
-
|
243
|
-
|
244
|
-
seq = sequence[(rec.start-1)..(rec.end-1)]
|
245
|
-
retval += seq
|
230
|
+
# we assume ORF is always read in the same direction
|
231
|
+
orf_reverse = (rec0.strand == '-')
|
232
|
+
orf_frame = startpos - 1
|
233
|
+
orf_frameshift = orf_frame % 3
|
234
|
+
sectionlist = sectionlist.reverse if orf_reverse
|
235
|
+
if do_debug
|
236
|
+
p "------------------"
|
237
|
+
p options
|
238
|
+
p [:reverse,do_reverse]
|
239
|
+
p [:complement,do_complement]
|
240
|
+
p [:trim,do_trim]
|
241
|
+
p [:orf_reverse, orf_reverse, rec0.strand]
|
246
242
|
end
|
247
|
-
|
248
|
-
if
|
249
|
-
#
|
250
|
-
|
243
|
+
|
244
|
+
if sequence.kind_of?(Bio::FastaFormat)
|
245
|
+
# BioRuby conversion
|
246
|
+
sequence = sequence.seq
|
251
247
|
end
|
252
|
-
|
253
|
-
|
254
|
-
|
255
|
-
|
256
|
-
|
257
|
-
|
258
|
-
# the phase appears to be disregarded - or rather handled
|
259
|
-
# by start-stop. This is a hack.
|
260
|
-
if do_reverse and reverse and (seq.size % 3 == 0)
|
261
|
-
# do nothing
|
262
|
-
else
|
263
|
-
seq = seq[frame..-1] if frame != 0 # set phase
|
248
|
+
# Generate array of sequences
|
249
|
+
seq = sectionlist.map { | section |
|
250
|
+
rec = section.rec
|
251
|
+
s = sequence[(section.begin-1)..(section.end-1)]
|
252
|
+
if do_reverse and orf_reverse
|
253
|
+
s = s.reverse
|
264
254
|
end
|
265
|
-
|
266
|
-
|
267
|
-
#
|
268
|
-
if
|
269
|
-
|
270
|
-
|
255
|
+
# Correct for phase. Unfortunately the use of phase is ambiguous.
|
256
|
+
# Here we check whether rec.start is in line with orf_frame. If it
|
257
|
+
# is, we correct for phase. Otherwise it is ignored.
|
258
|
+
if do_phase and rec.phase
|
259
|
+
phase = rec.phase.to_i
|
260
|
+
# if ((rec.start-startpos) % 3 == 0)
|
261
|
+
s = s[phase..-1]
|
262
|
+
# end
|
271
263
|
end
|
264
|
+
s
|
265
|
+
}
|
266
|
+
# p seq
|
267
|
+
seq = seq.join
|
268
|
+
if do_complement and do_reverse and orf_reverse
|
269
|
+
ntseq = Bio::Sequence::NA.new(seq)
|
270
|
+
seq = ntseq.forward_complement.upcase
|
272
271
|
end
|
273
272
|
if do_trim
|
274
273
|
reduce = seq.size % 3
|
@@ -279,9 +278,10 @@ module Bio
|
|
279
278
|
end
|
280
279
|
|
281
280
|
# Patch a sequence together from a Sequence string and an array
|
282
|
-
# of records and translate in the correct direction and frame
|
283
|
-
|
284
|
-
|
281
|
+
# of records and translate in the correct direction and frame. The options
|
282
|
+
# are the same as for +assemble+.
|
283
|
+
def assembleAA sequence, startpos, reclist, options = { :phase=>false, :reverse=>true, :trim=>false, :complement=>true }
|
284
|
+
seq = assemble(sequence, startpos, reclist, options)
|
285
285
|
ntseq = Bio::Sequence::NA.new(seq)
|
286
286
|
ntseq.translate
|
287
287
|
end
|
data/lib/bio/db/gff/gfffasta.rb
CHANGED
data/spec/gff3_assemble2_spec.rb
CHANGED
data/spec/gff3_assemble3_spec.rb
CHANGED
@@ -1,12 +1,12 @@
|
|
1
1
|
# RSpec for BioRuby-GFF3-Plugin. Run with something like:
|
2
2
|
#
|
3
|
-
#
|
3
|
+
# rspec -I ../bioruby/lib/ spec/gff3_assemble3_spec.rb
|
4
4
|
#
|
5
|
-
# Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
|
5
|
+
# Copyright (C) 2010,2011 Pjotr Prins <pjotr.prins@thebird.nl>
|
6
6
|
#
|
7
7
|
$: << "../lib"
|
8
8
|
|
9
|
-
require 'bio
|
9
|
+
require 'bio-gff3'
|
10
10
|
|
11
11
|
include Bio::GFFbrowser
|
12
12
|
|
data/spec/gff3_assemble_spec.rb
CHANGED
@@ -1,12 +1,12 @@
|
|
1
1
|
# RSpec for BioRuby-GFF3-Plugin. Run with something like:
|
2
2
|
#
|
3
|
-
#
|
3
|
+
# rspec -I ../bioruby/lib/ spec/gff3_assemble_spec.rb
|
4
4
|
#
|
5
|
-
# Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
|
5
|
+
# Copyright (C) 2010,2011 Pjotr Prins <pjotr.prins@thebird.nl>
|
6
6
|
#
|
7
7
|
$: << "../lib"
|
8
8
|
|
9
|
-
require 'bio
|
9
|
+
require 'bio-gff3'
|
10
10
|
|
11
11
|
include Bio::GFFbrowser
|
12
12
|
|
@@ -83,17 +83,20 @@ describe GFFdb, "Assemble CDS" do
|
|
83
83
|
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds0])
|
84
84
|
aaseq.should == "MRPLTDEETEKFFKKLSNYIGDNIKLLLEREDGEYVFRLHKDRVYYC"
|
85
85
|
end
|
86
|
+
# MhA1_Contig1133 WormBase CDS 8065 8308 . + 1 ID=cds:MhA1_Contig1133.frz3.gene4;Parent=transcript:MhA1_Contig1133.frz3.gene4
|
86
87
|
it "should translate CDS 8065:8308 (in frame 1, + strand)" do
|
87
88
|
recs = @cdslist['cds:MhA1_Contig1133.frz3.gene4']
|
88
89
|
component = @componentlist['cds:MhA1_Contig1133.frz3.gene4']
|
89
90
|
cds1 = recs[1]
|
90
|
-
seq = @gff.assemble(@contigsequence,component.start,[cds1]
|
91
|
+
seq = @gff.assemble(@contigsequence,component.start,[cds1])
|
91
92
|
seq.size.should == 244
|
92
93
|
seq.should == "TGAAAAATTAATGCGACAAGCAGCATGTATTGGACGTAAACAATTGGGATCTTTTGGAACTTGTTTGGGTAAATTCACAAAAGGAGGGTCTTTCTTTCTTCATATAACATCATTGGATTATTTGGCACCTTATGCTTTAGCAAAAATTTGGTTAAAACCACAAGCTGAACAACAATTTTTATATGGAAATAATATTGTTAAATCTGGTGTTGGAAGAATGAGTGAAGGGATTGAAGAAAAACAA"
|
93
|
-
seq = @gff.assemble(@contigsequence,component.start,[cds1])
|
94
|
+
seq = @gff.assemble(@contigsequence,component.start,[cds1],:phase => true)
|
95
|
+
seq.size.should == 243
|
94
96
|
seq.should == "GAAAAATTAATGCGACAAGCAGCATGTATTGGACGTAAACAATTGGGATCTTTTGGAACTTGTTTGGGTAAATTCACAAAAGGAGGGTCTTTCTTTCTTCATATAACATCATTGGATTATTTGGCACCTTATGCTTTAGCAAAAATTTGGTTAAAACCACAAGCTGAACAACAATTTTTATATGGAAATAATATTGTTAAATCTGGTGTTGGAAGAATGAGTGAAGGGATTGAAGAAAAACAA"
|
95
|
-
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds1])
|
97
|
+
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds1],:phase => true)
|
96
98
|
# note it should handle the frame shift and direction!
|
99
|
+
# wormbase validated
|
97
100
|
aaseq.should == "EKLMRQAACIGRKQLGSFGTCLGKFTKGGSFFLHITSLDYLAPYALAKIWLKPQAEQQFLYGNNIVKSGVGRMSEGIEEKQ"
|
98
101
|
end
|
99
102
|
it "should translate CDS3 (in frame 0, + strand)" do
|
@@ -114,7 +117,7 @@ describe GFFdb, "Assemble CDS" do
|
|
114
117
|
seq.size.should == 543
|
115
118
|
seq.should == "ATGCGTCCTTTAACAGATGAAGAAACTGAAAAGTTTTTCAAAAAACTTTCAAATTATATTGGTGACAATATTAAACTTTTATTGGAAAGAGAAGATGGAGAATATGTTTTTCGTTTACATAAAGACAGAGTTTATTATTGCAGTGAAAAATTAATGCGACAAGCAGCATGTATTGGACGTAAACAATTGGGATCTTTTGGAACTTGTTTGGGTAAATTCACAAAAGGAGGGTCTTTCTTTCTTCATATAACATCATTGGATTATTTGGCACCTTATGCTTTAGCAAAAATTTGGTTAAAACCACAAGCTGAACAACAATTTTTATATGGAAATAATATTGTTAAATCTGGTGTTGGAAGAATGAGTGAAGGGATTGAAGAAAAACAAGGTATTATTATTTATAATATGTCAGATTTACCATTGGGTTTTGGAGTGGCTGCAAAGGGAACATTATCTTGTAGAAAAGTAGATCCTACAGCTTTAGTTGTTTTACATCAATCAGATTTGGGTGAATATATTCGAAATGAAGAGGGATTAATTTAA"
|
116
119
|
seq = @gff.assemble(@contigsequence,component.start,recs)
|
117
|
-
seq.size.should == 543
|
120
|
+
seq.size.should == 543 # auto correct for phase problem
|
118
121
|
seq.should == "ATGCGTCCTTTAACAGATGAAGAAACTGAAAAGTTTTTCAAAAAACTTTCAAATTATATTGGTGACAATATTAAACTTTTATTGGAAAGAGAAGATGGAGAATATGTTTTTCGTTTACATAAAGACAGAGTTTATTATTGCAGTGAAAAATTAATGCGACAAGCAGCATGTATTGGACGTAAACAATTGGGATCTTTTGGAACTTGTTTGGGTAAATTCACAAAAGGAGGGTCTTTCTTTCTTCATATAACATCATTGGATTATTTGGCACCTTATGCTTTAGCAAAAATTTGGTTAAAACCACAAGCTGAACAACAATTTTTATATGGAAATAATATTGTTAAATCTGGTGTTGGAAGAATGAGTGAAGGGATTGAAGAAAAACAAGGTATTATTATTTATAATATGTCAGATTTACCATTGGGTTTTGGAGTGGCTGCAAAGGGAACATTATCTTGTAGAAAAGTAGATCCTACAGCTTTAGTTGTTTTACATCAATCAGATTTGGGTGAATATATTCGAAATGAAGAGGGATTAATTTAA"
|
119
122
|
aaseq = @gff.assembleAA(@contigsequence,component.start,recs)
|
120
123
|
aaseq.should == "MRPLTDEETEKFFKKLSNYIGDNIKLLLEREDGEYVFRLHKDRVYYCSEKLMRQAACIGRKQLGSFGTCLGKFTKGGSFFLHITSLDYLAPYALAKIWLKPQAEQQFLYGNNIVKSGVGRMSEGIEEKQGIIIYNMSDLPLGFGVAAKGTLSCRKVDPTALVVLHQSDLGEYIRNEEGLI*"
|
@@ -161,17 +164,17 @@ describe GFFdb, "Assemble CDS" do
|
|
161
164
|
# tctttgtgcttccaaacgagctaatgacattccactacgatctcgcaatgattgtcgtct
|
162
165
|
# aattgcacctctagctgagaaaggattttctaatgttgaaggtggttgttgaggagattc
|
163
166
|
# aaacttttttctt
|
164
|
-
|
165
|
-
|
166
|
-
|
167
|
-
|
168
|
-
seq = @gff.assemble(@contigsequence,component.start,[
|
167
|
+
cds5 = recs[5]
|
168
|
+
cds5.start.should == 27981
|
169
|
+
cds5.frame.should == 1
|
170
|
+
cds5.strand.should == '-'
|
171
|
+
seq = @gff.assemble(@contigsequence,component.start,[cds5],:phase=>true,:complement=>false)
|
169
172
|
seq.should == "TCTTTTTTCAAACTTAGAGGAGTTGTTGGTGGAAGTTGTAATCTTTTAGGAAAGAGTCGATCTCCACGTTAATCTGCTGTTAGTAACGCTCTAGCATCACCTTACAGTAATCGAGCAAACCTTCGTGTTTCTCTCCCAAGACTGGAATAATCTTCAATATTATCATTTCTTCTGGAAAGAAGATTATGTCGC"
|
170
173
|
seq.size.should == 192
|
171
|
-
seq = @gff.assemble(@contigsequence,component.start,[
|
174
|
+
seq = @gff.assemble(@contigsequence,component.start,[cds5],:phase=>true,:reverse=>true,:complement=>true)
|
172
175
|
seq.should == "AGAAAAAAGTTTGAATCTCCTCAACAACCACCTTCAACATTAGAAAATCCTTTCTCAGCTAGAGGTGCAATTAGACGACAATCATTGCGAGATCGTAGTGGAATGTCATTAGCTCGTTTGGAAGCACAAAGAGAGGGTTCTGACCTTATTAGAAGTTATAATAGTAAAGAAGACCTTTCTTCTAATACAGCG"
|
173
176
|
seq.size.should == 192
|
174
|
-
aaseq = @gff.assembleAA(@contigsequence,component.start,[
|
177
|
+
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds5],:phase=>true)
|
175
178
|
# note it should handle the frame shift and direction!
|
176
179
|
# >EMBOSS_001_4
|
177
180
|
# RKKFESPQQPPSTLENPFSARGAIRRQSLRDRSGMSLARLEAQREGSDLIRSYNSKEDLSSNTA
|
@@ -190,9 +193,9 @@ describe GFFdb, "Assemble CDS" do
|
|
190
193
|
cds2.start.should == 27981
|
191
194
|
cds2.frame.should == 1
|
192
195
|
cds2.strand.should == '-'
|
193
|
-
seq = @gff.assemble(@contigsequence,component.start,[cds2],:complement=>true)
|
194
|
-
seq.should == "
|
195
|
-
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds2])
|
196
|
+
seq = @gff.assemble(@contigsequence,component.start,[cds2],:reverse=>false,:complement=>true)
|
197
|
+
seq.should == "CGCTGTATTAGAAGAAAGGTCTTCTTTACTATTATAACTTCTAATAAGGTCAGAACCCTCTCTTTGTGCTTCCAAACGAGCTAATGACATTCCACTACGATCTCGCAATGATTGTCGTCTAATTGCACCTCTAGCTGAGAAAGGATTTTCTAATGTTGAAGGTGGTTGTTGAGGAGATTCAAACTTTTTTCT"
|
198
|
+
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds2],:phase=>true)
|
196
199
|
# note it should handle the frame shift and direction!
|
197
200
|
# >27981..28173_4 RKKFESPQQPPSTLENPFSARGAIRRQSLRDRSGMSLARLEAQREGSDLIRSYNSKEDLSSNTA
|
198
201
|
aaseq.should == "RKKFESPQQPPSTLENPFSARGAIRRQSLRDRSGMSLARLEAQREGSDLIRSYNSKEDLSSNTA"
|
@@ -222,17 +225,17 @@ describe GFFdb, "Assemble CDS" do
|
|
222
225
|
cds2.strand.should == '-'
|
223
226
|
seq = @gff.assemble(@contigsequence,component.start,[cds2], :raw=>true)
|
224
227
|
seq.should == "ATAAATTTCCCTTTCTCCAGAAAAACTTACAAAAGTAGATTTATCAACAGAATTTCTTTGATCTAAAGGTAATCCTCTTTGATGTAAAATTTTCATATCATTTAACATTTCCCTTTCTGGTTGTTGTCTTCTTTCATCAATCATTTCTTGTGTAATTCCTCTAGCAGCCATTTCAGATTCAATAAGGTCAAGGGTTTGTTCATCATCACAAATATCATAAGGCATATTACCATCTGCATTTACTGCTAGTAAATCTGCGTTG"
|
225
|
-
seq = @gff.assemble(@contigsequence,component.start,[cds2], :
|
228
|
+
seq = @gff.assemble(@contigsequence,component.start,[cds2], :phase=>true)
|
226
229
|
seq.should == "AACGCAGATTTACTAGCAGTAAATGCAGATGGTAATATGCCTTATGATATTTGTGATGATGAACAAACCCTTGACCTTATTGAATCTGAAATGGCTGCTAGAGGAATTACACAAGAAATGATTGATGAAAGAAGACAACAACCAGAAAGGGAAATGTTAAATGATATGAAAATTTTACATCAAAGAGGATTACCTTTAGATCAAAGAAATTCTGTTGATAAATCTACTTTTGTAAGTTTTTCTGGAGAAAGGGAAATTTAT"
|
227
230
|
# cds1.frame = 1
|
228
|
-
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds2])
|
231
|
+
aaseq = @gff.assembleAA(@contigsequence,component.start,[cds2],:phase=>true)
|
229
232
|
# note it should handle the frame shift and direction!
|
230
233
|
aaseq.should == "NADLLAVNADGNMPYDICDDEQTLDLIESEMAARGITQEMIDERRQQPEREMLNDMKILHQRGLPLDQRNSVDKSTFVSFSGEREIY"
|
231
234
|
end
|
232
235
|
it "should assemble the protein sequence for MhA1_Contig1133.frz3.gene11" do
|
233
236
|
recs = @cdslist['cds:MhA1_Contig1133.frz3.gene11']
|
234
237
|
component = @componentlist['cds:MhA1_Contig1133.frz3.gene11']
|
235
|
-
seq = @gff.assemble(@contigsequence,component.start,recs, :
|
238
|
+
seq = @gff.assemble(@contigsequence,component.start,recs, :reverse=>true, :complement=>true)
|
236
239
|
seq.should == "ATGGACCATCATGCATTGGTGGAGGAATTACCAGAAATTGAAAAATTAACTCCTCAAGAACGTATTGCATTAGCTAGAGAACGCCGTGCTGAACAACTTCGACAGAATGCTGCACGGGAGGCTCAATTGCCAATGCCTGCACAGCGCCGGCCTCGTCTTCGATTTACACCAGATGTTGCTTTACTTGAGGCAACATGTGCCATTGACAATAATGAAAGAATTGTTCGTCTTCTGCTTAGGTACGGAGCTTGTGTTAATGCCAAAGACACTGAACTTTGGACACCATTGCACGCAGCTGCATGTTGTGCTTATATTGATATTGTTCGATTGCTTATTGCACACAACGCAGATTTACTAGCAGTAAATGCAGATGGTAATATGCCTTATGATATTTGTGATGATGAACAAACCCTTGACCTTATTGAATCTGAAATGGCTGCTAGAGGAATTACACAAGAAATGATTGATGAAAGAAGACAACAACCAGAAAGGGAAATGTTAAATGATATGAAAATTTTACATCAAAGAGGATTACCTTTAGATCAAAGAAATTCTGTTGATAAATCTACTTTTGTAAGTTTTTCTGGAGAAAGGGAAATTTATTTACATATAGCAGCAGCTAATGGTTATTATGATGTTGCTGCTTTCCTTCTTCGTTGTAATGTTTCTCCAGCATTGAGAGATATAGATTTGTGGCAACCAATTCATGCAGCTGCTTCTTGGAATCAACCAGACTTAATCGAGCTTTTATGCGAATATGGGGCTGATATAAATGCAAAAACTGGAGCTGGGGAAAGCCCTTTAGAATTAACTGAAGATGAACCAACCCAACAAGTAATTAGAACAATCGCTCAGACAGAAGCAAGGAGACGGCGTGGTCCAGGTGGTGGTTACTTTGGTGTTCGTGATTCTCGACGACAAAGCCGAAAAAGAAAAAAGTTTGAATCTCCTCAACAACCACCTTCAACATTAGAAAATCCTTTCTCAGCTAGAGGTGCAATTAGACGACAATCATTGCGAGATCGTAGTGGAATGTCATTAGCTCGTTTGGAAGCACAAAGAGAGGGTTCTGACCTTATTAGAAGTTATAATAGTAAAGAAGACCTTTCTTCTAATACAGCGGATGATTCTTTAAATGTTGGAAGTTCTTCATATCTCAACAATCCAACAGCCTCGGCTAGTGCTTCCTCTTCAGCATTACACGGAACTCCACATCAACAACAACGTCGTGAATCTCCACCTAAACGTGCATTAATGGCTAGAAGTGCTTCTCATCAAAAACAAAAACAACAAATGTCTCCAGATGAATGGCTGAAAAAATTAGAAGCAGATTCTGCAGGTTTTCGAGATAATGATGGAGAAGATGGTGAATTACAATCTGAACTTAAAGGAGGACAAAGAATGAAGAGTGGTGGTGGTGGAGGAGCGAGAGGTCAGCAAGAAATGAATGGTGGTCCAACAGCAACATTTGGTGGAGCTTCAAAACAACAATTAGCAATGGGCTCTGGACCCAATAGACGGCGCAAACAAGGATGTTGCTCTGTTTTGTGA"
|
237
240
|
aaseq = @gff.assembleAA(@contigsequence,component.start,recs)
|
238
241
|
aaseq.should == "MDHHALVEELPEIEKLTPQERIALARERRAEQLRQNAAREAQLPMPAQRRPRLRFTPDVALLEATCAIDNNERIVRLLLRYGACVNAKDTELWTPLHAAACCAYIDIVRLLIAHNADLLAVNADGNMPYDICDDEQTLDLIESEMAARGITQEMIDERRQQPEREMLNDMKILHQRGLPLDQRNSVDKSTFVSFSGEREIYLHIAAANGYYDVAAFLLRCNVSPALRDIDLWQPIHAAASWNQPDLIELLCEYGADINAKTGAGESPLELTEDEPTQQVIRTIAQTEARRRRGPGGGYFGVRDSRRQSRKRKKFESPQQPPSTLENPFSARGAIRRQSLRDRSGMSLARLEAQREGSDLIRSYNSKEDLSSNTADDSLNVGSSSYLNNPTASASASSSALHGTPHQQQRRESPPKRALMARSASHQKQKQQMSPDEWLKKLEADSAGFRDNDGEDGELQSELKGGQRMKSGGGGGARGQQEMNGGPTATFGGASKQQLAMGSGPNRRRKQGCCSVL*"
|
@@ -1,12 +1,12 @@
|
|
1
1
|
# RSpec for BioRuby-GFF3-Plugin. Run with something like:
|
2
2
|
#
|
3
|
-
#
|
3
|
+
# rspec -I ../bioruby/lib/ spec/gff3_fileiterator_spec.rb
|
4
4
|
#
|
5
5
|
# Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
|
6
6
|
#
|
7
7
|
$: << "../lib"
|
8
8
|
|
9
|
-
require 'bio
|
9
|
+
require 'bio-gff3'
|
10
10
|
|
11
11
|
TEST1='test/data/gff/test.gff3'
|
12
12
|
TEST2='test/data/gff/standard.gff3'
|
data/spec/gffdb_spec.rb
CHANGED
@@ -1,13 +1,13 @@
|
|
1
1
|
# RSpec for BioRuby-GFF3-Plugin. Run with something like:
|
2
2
|
#
|
3
|
-
#
|
3
|
+
# rspec -I ../bioruby/lib/ spec/gffdb_spec.rb
|
4
4
|
#
|
5
5
|
# Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
|
6
6
|
#
|
7
7
|
$: << "../lib"
|
8
8
|
|
9
9
|
|
10
|
-
require 'bio
|
10
|
+
require 'bio-gff3'
|
11
11
|
|
12
12
|
include Bio::GFFbrowser
|
13
13
|
|
data/test/data/gff/test.gff3
CHANGED
@@ -64,11 +64,11 @@ AATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGT
|
|
64
64
|
GCCCAAGAATGCGATCCCAGAAGTCTTGGTTCTAAAGTCGTCGGAAAGATTTGAGGAACTGCCATACAGC
|
65
65
|
CCGTGGGTGAAACTGTCGACATCCATTGTGCGAATAGGCCTGCTAGTGAC
|
66
66
|
>test02
|
67
|
-
|
67
|
+
ACGACGAAGATTTGTATGACTGATTTATCCTGGACAGGCATTGGTCAGATGTCTCCTTCCGTATCGTCGTTTA
|
68
68
|
GTTGCAAATCCGAGTGTTCGGGGGTATTGCTATTTGCCACCTAGAAGCGCAACATGCCCAGCTTCACACA
|
69
69
|
CCATAGCGAACACGCCGCCCCGGTGGCGACTATCGGTCGAAGTTAAGACAATTCATGGGCGAAACGAGAT
|
70
70
|
AATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGT
|
71
71
|
GCCCAAGAATGCGATCCCAGAAGTCTTGGTTCTAAAGTCGTCGGAAAGATTTGAGGAACTGCCATACAGC
|
72
|
-
|
72
|
+
CCGTGGGTGAAACTGTCGACATCCATTGTGCGAATAGGCCTGCTAGTGACAAAAAA
|
73
73
|
|
74
74
|
|
metadata
CHANGED
@@ -4,9 +4,9 @@ version: !ruby/object:Gem::Version
|
|
4
4
|
prerelease: false
|
5
5
|
segments:
|
6
6
|
- 0
|
7
|
-
-
|
7
|
+
- 8
|
8
8
|
- 0
|
9
|
-
version: 0.
|
9
|
+
version: 0.8.0
|
10
10
|
platform: ruby
|
11
11
|
authors:
|
12
12
|
- Pjotr Prins
|
@@ -14,7 +14,7 @@ autorequire:
|
|
14
14
|
bindir: bin
|
15
15
|
cert_chain: []
|
16
16
|
|
17
|
-
date: 2010-12-
|
17
|
+
date: 2010-12-31 00:00:00 +01:00
|
18
18
|
default_executable: gff3-fetch
|
19
19
|
dependencies:
|
20
20
|
- !ruby/object:Gem::Dependency
|
@@ -88,6 +88,19 @@ dependencies:
|
|
88
88
|
type: :development
|
89
89
|
prerelease: false
|
90
90
|
version_requirements: *id005
|
91
|
+
- !ruby/object:Gem::Dependency
|
92
|
+
name: rspec
|
93
|
+
requirement: &id006 !ruby/object:Gem::Requirement
|
94
|
+
none: false
|
95
|
+
requirements:
|
96
|
+
- - ">="
|
97
|
+
- !ruby/object:Gem::Version
|
98
|
+
segments:
|
99
|
+
- 0
|
100
|
+
version: "0"
|
101
|
+
type: :development
|
102
|
+
prerelease: false
|
103
|
+
version_requirements: *id006
|
91
104
|
description: |
|
92
105
|
GFF3 (genome browser) information and digest mRNA and CDS sequences.
|
93
106
|
Options for low memory use and caching of records.
|
@@ -100,13 +113,11 @@ extensions: []
|
|
100
113
|
|
101
114
|
extra_rdoc_files:
|
102
115
|
- LICENSE.txt
|
103
|
-
- README
|
104
116
|
- README.rdoc
|
105
117
|
files:
|
106
118
|
- Gemfile
|
107
119
|
- Gemfile.lock
|
108
120
|
- LICENSE.txt
|
109
|
-
- README
|
110
121
|
- README.rdoc
|
111
122
|
- Rakefile
|
112
123
|
- VERSION
|
@@ -151,7 +162,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
151
162
|
requirements:
|
152
163
|
- - ">="
|
153
164
|
- !ruby/object:Gem::Version
|
154
|
-
hash: -
|
165
|
+
hash: -1033924243
|
155
166
|
segments:
|
156
167
|
- 0
|
157
168
|
version: "0"
|
data/README
DELETED
@@ -1,65 +0,0 @@
|
|
1
|
-
= GFF3 plugin for BioRuby, aimed at parsing big data =
|
2
|
-
|
3
|
-
Features:
|
4
|
-
|
5
|
-
# Take GFF (genome browser) information and digest mRNA and CDS sequences
|
6
|
-
# Options for low memory use and caching of records
|
7
|
-
# Support for external FASTA files
|
8
|
-
|
9
|
-
You can use this plugin in two ways. First as a standalone program, next as a
|
10
|
-
plugin library to BioRuby.
|
11
|
-
|
12
|
-
For example, fetch mRNA and CDS information from GFF3 files and output to FASTA:
|
13
|
-
|
14
|
-
./bin/gff3-fetch mrna test/data/gff/test.gff3
|
15
|
-
./bin/gff3-fetch cds test/data/gff/test.gff3
|
16
|
-
|
17
|
-
Or clone this repository and add the 'lib' dir to the Ruby search path and
|
18
|
-
|
19
|
-
require 'bio/db/gff/gffdb'
|
20
|
-
|
21
|
-
You can also run RSpec with something like
|
22
|
-
|
23
|
-
ruby -I ../bioruby/lib/ ~/.gems/bin/spec spec/gffdb_spec.rb
|
24
|
-
|
25
|
-
This implementation depends on BioRuby's basic GFF3 parser, with the possible
|
26
|
-
advantage that the plugin is faster and does not consume all memory. The Gff3
|
27
|
-
specs are based on the output of the Wormbase genome browser.
|
28
|
-
|
29
|
-
For a write-up see http://thebird.nl/bioruby/BioRuby_GFF3.html
|
30
|
-
|
31
|
-
Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
|
32
|
-
|
33
|
-
-------------------------------------------------------------------------------
|
34
|
-
|
35
|
-
Usage:
|
36
|
-
|
37
|
-
BioRuby GFF3 Plugin Copyright (C) 2010 Pjotr Prins <pjotr.prins@thebird.nl>
|
38
|
-
|
39
|
-
Fetch and assemble mRNAs, or CDS and print in FASTA format.
|
40
|
-
|
41
|
-
gff3-fetch [--no-cache] mRNA|CDS [filename.fa] filename.gff
|
42
|
-
|
43
|
-
Where:
|
44
|
-
|
45
|
-
--no-cache : do not load everything in memory
|
46
|
-
mRNA : assemble mRNA
|
47
|
-
CDS : assemble CDS
|
48
|
-
|
49
|
-
Multiple GFF3 files can be used. For external FASTA files, always the last
|
50
|
-
one before the GFF file is used.
|
51
|
-
|
52
|
-
Examples:
|
53
|
-
|
54
|
-
Find mRNA and CDS information from test.gff3 (which includes sequence information)
|
55
|
-
|
56
|
-
./bin/gff3-fetch mRNA test/data/gff/test.gff3
|
57
|
-
./bin/gff3-fetch CDS test/data/gff/test.gff3
|
58
|
-
|
59
|
-
Find mRNA from external FASTA file, without loading everythin in RAM
|
60
|
-
|
61
|
-
./bin/gff3-fetch --no-cache mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3
|
62
|
-
|
63
|
-
If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
|
64
|
-
|
65
|
-
|