bio-nexml 1.0.0 → 1.1.0
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +3 -0
- data/.travis.yml +10 -0
- data/Gemfile +11 -0
- data/README.mkd +459 -0
- data/Rakefile +36 -0
- data/TODO.txt +6 -0
- data/VERSION +1 -0
- data/bio-nexml.gemspec +27 -0
- data/lib/bio/db/nexml.rb +6 -19
- data/lib/bio/db/nexml/mapper/framework.rb +6 -9
- data/test/data/nexml/test.xml +69 -0
- data/test/unit/bio/db/nexml/tc_factory.rb +119 -0
- data/test/unit/bio/db/nexml/tc_mapper.rb +78 -0
- data/test/unit/bio/db/nexml/tc_matrix.rb +551 -0
- data/test/unit/bio/db/nexml/tc_parser.rb +21 -0
- data/test/unit/bio/db/nexml/tc_taxa.rb +118 -0
- data/test/unit/bio/db/nexml/tc_trees.rb +370 -0
- data/test/unit/bio/db/nexml/tc_writer.rb +633 -0
- metadata +61 -73
- data/README.rdoc +0 -53
data/.gitignore
ADDED
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/README.mkd
ADDED
@@ -0,0 +1,459 @@
|
|
1
|
+
[![Build status](https://secure.travis-ci.org/nexml/bio-nexml.png)](http://travis-ci.org/#!/nexml/bio-nexml)
|
2
|
+
|
3
|
+
bio-nexml is listed at http://biogems.info/
|
4
|
+
|
5
|
+
# bio-nexml
|
6
|
+
|
7
|
+
NeXML is a file format for phylogenetic data. It is inspired by the modular
|
8
|
+
architecture of the commonly-used NEXUS file format (hence the name) in that
|
9
|
+
a NeXML instance document can contain:
|
10
|
+
* sets of Operational Taxonomic Units (OTUs), i.e. the tips in phylogenetic
|
11
|
+
trees, and that which comparative observations are made on. Often these are
|
12
|
+
species ("taxa").
|
13
|
+
* sets of phylogenetic trees (or reticulate trees, i.e. networks)
|
14
|
+
* sets of comparative data, i.e. molecular sequences, morphological categorical
|
15
|
+
data, continuous data, and other types.
|
16
|
+
|
17
|
+
The elements in a NeXML document can be annotated using RDFa
|
18
|
+
(http://en.wikipedia.org/wiki/RDFa), which means that every object that can
|
19
|
+
be parsed out of a NeXML document must be an object that, in turn, can be
|
20
|
+
annotated with predicates (and their namespaces) and other objects (with,
|
21
|
+
perhaps, their own namespaces). The advantage over previous file formats is
|
22
|
+
that we can retain all metadata for all objects within one file, regardless
|
23
|
+
where the metadata come from.
|
24
|
+
|
25
|
+
NeXML can be transformed to RDF using an XSL stylesheet. As such, NeXML forms
|
26
|
+
an intermediate format between traditional flat file formats (with predictable
|
27
|
+
structure but no semantics) and RDF (with loose structure, but lots of
|
28
|
+
semantics) that is both easy to work with, yet ready for the Semantic Web.
|
29
|
+
|
30
|
+
To learn more, visit http://www.nexml.org
|
31
|
+
|
32
|
+
## Parsing
|
33
|
+
Currently all the parsing is done at the start( i.e. no streaming ). This is likely to change later. Parse an NeXML file:
|
34
|
+
|
35
|
+
```ruby
|
36
|
+
doc = Bio::NeXML::Parser.new( "trees.xml" )
|
37
|
+
nexml = doc.parse
|
38
|
+
nexml.class #Bio::NeXML::Nexml
|
39
|
+
```
|
40
|
+
|
41
|
+
## Serializing
|
42
|
+
`Bio::NeXML::Writer` class provides a wrapper over libxml-ruby to create any NeXML document. This class defines a set of `serialize_*` instance methods which can be called on the appropriate object to get its NeXML representation. The method returns a `XML::Node` object. To get the raw NeXML representation `to_s` method should be called on the return value.
|
43
|
+
|
44
|
+
NeXML defines three top level containers: `otus`, `trees`, `characters` which bear parent-child relation with other NeXML elements. In effect, a valid NeXML document has only three type of immediate children. Naturally, a typical working paradigm would be to create `Bio::NeXML::Otus`, `Bio::NeXML::Trees`, and `Bio::NeXML::Characters` objects and write them to the NeXML file.
|
45
|
+
|
46
|
+
```ruby
|
47
|
+
# Parse a test file. This will give us Bio::NeXML::Otus,
|
48
|
+
# Bio::NeXML::Trees, and Bio::NeXML::Characters object.
|
49
|
+
doc1 = Bio::NeXML::Parser.new 'test.xml'
|
50
|
+
nexml = doc1.parse
|
51
|
+
doc1.close
|
52
|
+
|
53
|
+
# Create a Writer object,
|
54
|
+
writer = Bio::NeXML::Writer.new
|
55
|
+
|
56
|
+
# add otus, trees and characters to it,
|
57
|
+
writer << nexml.otus
|
58
|
+
writer << nexml.trees
|
59
|
+
writer << nexml.characters
|
60
|
+
|
61
|
+
# save it.
|
62
|
+
writer.save 'sample.xml'
|
63
|
+
```
|
64
|
+
|
65
|
+
`Bio::NeXML::Writer` internally calls some `serialize_*` method at the lowest level. If need be, these `serialize_*` methods can be called to obtain raw NeXML representation of any NeXML element.
|
66
|
+
|
67
|
+
``` ruby
|
68
|
+
# Create an otus object with a child otu element
|
69
|
+
taxa1 = Bio::NeXML::Otus.new 'taxa1', 'A taxa block'
|
70
|
+
o1 = Bio::NeXML::Otu.new 'o1', 'A taxon'
|
71
|
+
taxa1 << o1
|
72
|
+
|
73
|
+
# Obtain the raw NeXML representation of the otus object created
|
74
|
+
writer = Bio::NeXML::Writer.new
|
75
|
+
writer.serialize_otus( taxa1 ).to_s
|
76
|
+
# => "<otus label=\"A taxa block\" id=\"taxa1\">\n <otu label=\"A taxon\" id=\"o1\"/>\n</otus>"
|
77
|
+
```
|
78
|
+
|
79
|
+
Unit tests for serializer are filled with such use cases.
|
80
|
+
|
81
|
+
## Nexml
|
82
|
+
|
83
|
+
``` ruby
|
84
|
+
#get a hash of otus objects indexed with 'id'
|
85
|
+
nexml.otus_set
|
86
|
+
|
87
|
+
#get an array of otus objects
|
88
|
+
nexml.otus
|
89
|
+
|
90
|
+
#get an otus by id
|
91
|
+
taxa1 = nexml.get_otus_by_id "taxa1"
|
92
|
+
|
93
|
+
#iterate over each otus object
|
94
|
+
nexml.each_otus do |taxa|
|
95
|
+
puts taxa.id
|
96
|
+
puts taxa.label
|
97
|
+
end
|
98
|
+
|
99
|
+
#characters
|
100
|
+
nexml.trees_set #return a hash of trees object indexed with 'id'
|
101
|
+
nexml.trees #return an array of trees objects.
|
102
|
+
|
103
|
+
#iterate over each trees object
|
104
|
+
nexml.each_trees do |trees|
|
105
|
+
puts trees.id
|
106
|
+
puts trees.label
|
107
|
+
end
|
108
|
+
|
109
|
+
#find a trees by id
|
110
|
+
trees1 = nexml.get_trees_by_id 'trees1'
|
111
|
+
|
112
|
+
# characters
|
113
|
+
nexml.characters_set #return a hash of characters object indexed with 'id'
|
114
|
+
nexml.characters #return an array of characters object
|
115
|
+
|
116
|
+
#iterate over each characters object
|
117
|
+
nexml.each_characters do |ch|
|
118
|
+
puts ch.id
|
119
|
+
puts ch.label
|
120
|
+
end
|
121
|
+
|
122
|
+
#find a characters object by id
|
123
|
+
characters = nexml.get_characters_by_id 'chars1'
|
124
|
+
```
|
125
|
+
|
126
|
+
## Otus
|
127
|
+
|
128
|
+
Taxa blocks and taxons are stored internally as a Ruby hash for faster 'id' based lookup.
|
129
|
+
Consider [https://www.nescent.org/wg_phyloinformatics/NeXML_Elements#Example this] NeXML
|
130
|
+
snippet
|
131
|
+
|
132
|
+
``` ruby
|
133
|
+
#get the id of otus
|
134
|
+
taxa1.id # "taxa1"
|
135
|
+
|
136
|
+
#get the label of otus
|
137
|
+
taxa1.label # "Primary taxa block"
|
138
|
+
|
139
|
+
#get a hash of child otu objects indexed with id
|
140
|
+
taxa1.otu_set
|
141
|
+
|
142
|
+
#get an array of child otu objects
|
143
|
+
taxa1.otus
|
144
|
+
|
145
|
+
#get an otu object by id
|
146
|
+
#get_otu_by_id is an alias of []
|
147
|
+
t1 = taxa1[ 't1' ]
|
148
|
+
|
149
|
+
#add an otu object to otus
|
150
|
+
t1.add_otu( otu_object )
|
151
|
+
#to add more than one otu object at a time use << or otus= method
|
152
|
+
t1 << [otu_object1, otu_object2]
|
153
|
+
t1.otus = otu_object1, otu_object2
|
154
|
+
|
155
|
+
#or iterate over each otu object
|
156
|
+
#each_otu is an alias for each
|
157
|
+
taxa1.each do |taxon|
|
158
|
+
puts taxon.id
|
159
|
+
puts taxon.label
|
160
|
+
end
|
161
|
+
|
162
|
+
#check if an otu with given id belongs to an otus or not
|
163
|
+
#include? and has? are alias for has_otu?
|
164
|
+
taxa1.has_otu? 't2' # => true
|
165
|
+
taxa1.has? 't8' # => false
|
166
|
+
|
167
|
+
#an otus object in enumerable
|
168
|
+
taxa1.map &:id # => array of otu ids
|
169
|
+
taxa1.select {|t| t.class == "Lemurs" } #maybe in future
|
170
|
+
```
|
171
|
+
|
172
|
+
### Otu
|
173
|
+
|
174
|
+
``` ruby
|
175
|
+
#get an otu's id
|
176
|
+
t1.id # => "t1"
|
177
|
+
|
178
|
+
#get an otu's label
|
179
|
+
t1.label # => "Homo sapiens"
|
180
|
+
```
|
181
|
+
|
182
|
+
## Trees
|
183
|
+
Trees and tree and network are stored internally as a Ruby hash for faster 'id' based lookup.
|
184
|
+
|
185
|
+
``` ruby
|
186
|
+
trees1.class #Bio::NeXML::Trees
|
187
|
+
|
188
|
+
#get the taxa block to which the trees is linked to
|
189
|
+
trees1.otus #returns an otus object
|
190
|
+
```
|
191
|
+
|
192
|
+
### Tree
|
193
|
+
|
194
|
+
``` ruby
|
195
|
+
trees1.tree_set #return a hash or tree objects indexed with 'id'
|
196
|
+
tress1.trees #return an arrayof trees object
|
197
|
+
|
198
|
+
#iterate over each tree object
|
199
|
+
trees1.each_tree do |t|
|
200
|
+
puts t.id
|
201
|
+
puts t.label
|
202
|
+
end
|
203
|
+
|
204
|
+
#get a tree object with its 'tree1'
|
205
|
+
tree1 = trees1[ 'tree1' ]
|
206
|
+
#or, with a conventional method call
|
207
|
+
tree1 = trees1.get_tree_by_id 'tree1'
|
208
|
+
#or, from a nexml object
|
209
|
+
tree1 = nexml.get_tree_by_id 'tree1'
|
210
|
+
|
211
|
+
tree1.class #Bio::NeXML::IntTree or Bio::NeXML::FloatTree
|
212
|
+
|
213
|
+
#check if a tree belongs to a trees or not
|
214
|
+
#pass it a tree id
|
215
|
+
tree1.has_tree? 'tree1' #return true or false
|
216
|
+
|
217
|
+
#get the number of treess
|
218
|
+
trees1.number_of_trees
|
219
|
+
```
|
220
|
+
|
221
|
+
### Network
|
222
|
+
|
223
|
+
``` ruby
|
224
|
+
trees1.network_set #return a hash or network objects indexed with 'id'
|
225
|
+
tress1.networks #return an arrayof network objects
|
226
|
+
|
227
|
+
#iterate over each network object
|
228
|
+
trees1.each_network do |n|
|
229
|
+
puts n.id
|
230
|
+
puts n.label
|
231
|
+
end
|
232
|
+
|
233
|
+
#get a network object with its id
|
234
|
+
network1 = trees1[ 'network1' ]
|
235
|
+
#or, with a conventional method call
|
236
|
+
network1 = trees1.get_network_by_id 'network1'
|
237
|
+
#or, from a nexml object
|
238
|
+
network1 = nexml.get_tree_by_id 'network1'
|
239
|
+
|
240
|
+
network1.class #Bio::NeXML::IntTree or Bio::NeXML::FloatTree
|
241
|
+
|
242
|
+
#check if a network belongs to a trees or not
|
243
|
+
#pass it a network id
|
244
|
+
trees1.has_network? 'network1' #return true or false
|
245
|
+
|
246
|
+
#get the number of networks
|
247
|
+
trees1.number_of_networks
|
248
|
+
```
|
249
|
+
|
250
|
+
### Tree and Network
|
251
|
+
|
252
|
+
``` ruby
|
253
|
+
#iterate over both trees and networks
|
254
|
+
trees1.each do |g|
|
255
|
+
puts g.class
|
256
|
+
end
|
257
|
+
|
258
|
+
#find if a tree or a network belongs to a trees or not
|
259
|
+
#include? is an alias for has?
|
260
|
+
trees1.has? 'tree1' #return true or false
|
261
|
+
|
262
|
+
#total number of trees and networks
|
263
|
+
trees1.number_of_graphs
|
264
|
+
```
|
265
|
+
|
266
|
+
All the available methods from [http://bioruby.org/rdoc/classes/Bio/Tree.html#M001688 Bio::Tree]
|
267
|
+
class can be called on a tree object.
|
268
|
+
|
269
|
+
``` ruby
|
270
|
+
node1 = tree.get_node_by_name "n3" #note name is same as id
|
271
|
+
tree1.parents node1
|
272
|
+
```
|
273
|
+
|
274
|
+
A trees object is an enumerable:
|
275
|
+
|
276
|
+
``` ruby
|
277
|
+
trees1.map &:id
|
278
|
+
```
|
279
|
+
|
280
|
+
## Characters
|
281
|
+
|
282
|
+
``` ruby
|
283
|
+
puts characters.class
|
284
|
+
|
285
|
+
#get the taxa block to which the characters is linked to
|
286
|
+
characters.otus #returns an otus object
|
287
|
+
|
288
|
+
#get the child format element
|
289
|
+
format = characters.format
|
290
|
+
|
291
|
+
puts format.class
|
292
|
+
|
293
|
+
#get the child matrix element
|
294
|
+
matrix = characters.matrix
|
295
|
+
|
296
|
+
puts matrix.class
|
297
|
+
```
|
298
|
+
|
299
|
+
### Format
|
300
|
+
|
301
|
+
``` ruby
|
302
|
+
format.states_set #return a hash of states objects indexed with 'id'
|
303
|
+
format.states #return an array of states object
|
304
|
+
|
305
|
+
#iterate over each states object
|
306
|
+
format.each_states do |states|
|
307
|
+
puts states.id
|
308
|
+
puts states.label
|
309
|
+
end
|
310
|
+
|
311
|
+
#get a states object by id
|
312
|
+
states = format.get_states_by_id 'states1'
|
313
|
+
|
314
|
+
#check if the states object with 'id' belongs to format or not
|
315
|
+
format.has_states? 'states1'
|
316
|
+
|
317
|
+
format.char_set #return a hash of char objects indexed with 'id'
|
318
|
+
format.chars #return an array of char objects
|
319
|
+
|
320
|
+
#iterate over each char object
|
321
|
+
format.each_char do |char|
|
322
|
+
puts char.id
|
323
|
+
puts char.label
|
324
|
+
end
|
325
|
+
|
326
|
+
#get a char object by id
|
327
|
+
char = format.get_char_by_id 'char1'
|
328
|
+
|
329
|
+
#check if the char object with 'id' belongs to format or not
|
330
|
+
format.has_char? 'char1'
|
331
|
+
|
332
|
+
#get a states or a char object by id
|
333
|
+
state = format[ 'states1' ]
|
334
|
+
char = format[ 'char1' ]
|
335
|
+
|
336
|
+
#check if a states or a char object with 'id' belongs to format or not
|
337
|
+
format.has? 'states1'
|
338
|
+
format.has? 'char1'
|
339
|
+
|
340
|
+
#all objects, including char and states can be iterated over with each
|
341
|
+
format.each do |obj|
|
342
|
+
puts obj.class
|
343
|
+
end
|
344
|
+
|
345
|
+
#format is enumerable
|
346
|
+
format.map &:id
|
347
|
+
```
|
348
|
+
|
349
|
+
#### States
|
350
|
+
|
351
|
+
``` ruby
|
352
|
+
states.state_set #return a hash of state objects indexed with 'id'
|
353
|
+
states.states #return an array of state objects
|
354
|
+
|
355
|
+
#iterate over each state object
|
356
|
+
states.each_state do |state|
|
357
|
+
puts state.id
|
358
|
+
end
|
359
|
+
#or, use its alias each
|
360
|
+
|
361
|
+
#get a state object by id
|
362
|
+
state = states.get_state_by_id 'state1'
|
363
|
+
#or, use hash notation
|
364
|
+
state = states[ 'state1' ]
|
365
|
+
|
366
|
+
#check if a state belongs to states or not
|
367
|
+
states.has_state? 'state1'
|
368
|
+
#or, use its alias has? and include?
|
369
|
+
```
|
370
|
+
|
371
|
+
##### State
|
372
|
+
|
373
|
+
``` ruby
|
374
|
+
#get the symbol associated with the state
|
375
|
+
state.symbol
|
376
|
+
|
377
|
+
#find if the state is ambiguous
|
378
|
+
state.ambiguous?
|
379
|
+
|
380
|
+
#find the kind of ambiguity
|
381
|
+
state.ambiguity
|
382
|
+
|
383
|
+
#find if it is an uncertain state set
|
384
|
+
state.uncertain?
|
385
|
+
|
386
|
+
#find if it is a polymorphic state set
|
387
|
+
state.polymorphic?
|
388
|
+
|
389
|
+
#get the members of a state set as an array
|
390
|
+
state.members
|
391
|
+
|
392
|
+
#or iterate over each member
|
393
|
+
state.each do |member|
|
394
|
+
puts member.class #same as self
|
395
|
+
puts member.id
|
396
|
+
end
|
397
|
+
|
398
|
+
#a state is Enumerable over its members
|
399
|
+
state.select{ |member| member.id == "rna5" }
|
400
|
+
```
|
401
|
+
|
402
|
+
#### Char
|
403
|
+
|
404
|
+
``` ruby
|
405
|
+
#get the id
|
406
|
+
char.id
|
407
|
+
|
408
|
+
#get the label
|
409
|
+
char.label
|
410
|
+
|
411
|
+
#get the states object the char is linked to
|
412
|
+
char.states
|
413
|
+
|
414
|
+
#get the codon position for DnaChar and RnaChar objects
|
415
|
+
char.codon
|
416
|
+
```
|
417
|
+
|
418
|
+
### Matrix
|
419
|
+
|
420
|
+
...
|
421
|
+
|
422
|
+
## Contributing to bio-nexml
|
423
|
+
|
424
|
+
* Check out the latest master to make sure the feature hasn't been implemented
|
425
|
+
or the bug hasn't been fixed yet
|
426
|
+
* Check out the issue tracker to make sure someone already hasn't requested it
|
427
|
+
and/or contributed it
|
428
|
+
* Fork the project
|
429
|
+
* Start a feature/bugfix branch
|
430
|
+
* Commit and push until you are happy with your contribution
|
431
|
+
* Make sure to add tests for it. This is important so I don't break it in a
|
432
|
+
future version unintentionally.
|
433
|
+
* Please try not to mess with the Rakefile, version, or history. If you want to
|
434
|
+
have your own version, or is otherwise necessary, that is fine, but please
|
435
|
+
isolate to its own commit so I can cherry-pick around it.
|
436
|
+
|
437
|
+
## Acknowledgements
|
438
|
+
|
439
|
+
The research leading to these results has received funding from the [European
|
440
|
+
Community's] Seventh Framework Programme ([FP7/2007-2013] under grant agreement
|
441
|
+
n� [237046].
|
442
|
+
|
443
|
+
## Citing bio-nexml
|
444
|
+
|
445
|
+
If you use this software, please cite:
|
446
|
+
|
447
|
+
> [NeXML: rich, extensible, and verifiable representation of comparative data and metadata][1]
|
448
|
+
|
449
|
+
and
|
450
|
+
|
451
|
+
> [Biogem: an effective tool based approach for scaling up open source software development in bioinformatics][2]
|
452
|
+
|
453
|
+
## Copyright
|
454
|
+
|
455
|
+
Copyright (c) 2011 Rutger Vos and Anurag Priyam. See LICENSE.txt for further
|
456
|
+
details.
|
457
|
+
|
458
|
+
[1]: http://sysbio.oxfordjournals.org/content/early/2012/02/12/sysbio.sys025.short
|
459
|
+
[2]: http://dx.doi.org/10.1093/bioinformatics/bts080
|