bio-nexml 1.0.0 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.gitignore +3 -0
- data/.travis.yml +10 -0
- data/Gemfile +11 -0
- data/README.mkd +459 -0
- data/Rakefile +36 -0
- data/TODO.txt +6 -0
- data/VERSION +1 -0
- data/bio-nexml.gemspec +27 -0
- data/lib/bio/db/nexml.rb +6 -19
- data/lib/bio/db/nexml/mapper/framework.rb +6 -9
- data/test/data/nexml/test.xml +69 -0
- data/test/unit/bio/db/nexml/tc_factory.rb +119 -0
- data/test/unit/bio/db/nexml/tc_mapper.rb +78 -0
- data/test/unit/bio/db/nexml/tc_matrix.rb +551 -0
- data/test/unit/bio/db/nexml/tc_parser.rb +21 -0
- data/test/unit/bio/db/nexml/tc_taxa.rb +118 -0
- data/test/unit/bio/db/nexml/tc_trees.rb +370 -0
- data/test/unit/bio/db/nexml/tc_writer.rb +633 -0
- metadata +61 -73
- data/README.rdoc +0 -53
data/.gitignore
ADDED
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/README.mkd
ADDED
@@ -0,0 +1,459 @@
|
|
1
|
+
[](http://travis-ci.org/#!/nexml/bio-nexml)
|
2
|
+
|
3
|
+
bio-nexml is listed at http://biogems.info/
|
4
|
+
|
5
|
+
# bio-nexml
|
6
|
+
|
7
|
+
NeXML is a file format for phylogenetic data. It is inspired by the modular
|
8
|
+
architecture of the commonly-used NEXUS file format (hence the name) in that
|
9
|
+
a NeXML instance document can contain:
|
10
|
+
* sets of Operational Taxonomic Units (OTUs), i.e. the tips in phylogenetic
|
11
|
+
trees, and that which comparative observations are made on. Often these are
|
12
|
+
species ("taxa").
|
13
|
+
* sets of phylogenetic trees (or reticulate trees, i.e. networks)
|
14
|
+
* sets of comparative data, i.e. molecular sequences, morphological categorical
|
15
|
+
data, continuous data, and other types.
|
16
|
+
|
17
|
+
The elements in a NeXML document can be annotated using RDFa
|
18
|
+
(http://en.wikipedia.org/wiki/RDFa), which means that every object that can
|
19
|
+
be parsed out of a NeXML document must be an object that, in turn, can be
|
20
|
+
annotated with predicates (and their namespaces) and other objects (with,
|
21
|
+
perhaps, their own namespaces). The advantage over previous file formats is
|
22
|
+
that we can retain all metadata for all objects within one file, regardless
|
23
|
+
where the metadata come from.
|
24
|
+
|
25
|
+
NeXML can be transformed to RDF using an XSL stylesheet. As such, NeXML forms
|
26
|
+
an intermediate format between traditional flat file formats (with predictable
|
27
|
+
structure but no semantics) and RDF (with loose structure, but lots of
|
28
|
+
semantics) that is both easy to work with, yet ready for the Semantic Web.
|
29
|
+
|
30
|
+
To learn more, visit http://www.nexml.org
|
31
|
+
|
32
|
+
## Parsing
|
33
|
+
Currently all the parsing is done at the start( i.e. no streaming ). This is likely to change later. Parse an NeXML file:
|
34
|
+
|
35
|
+
```ruby
|
36
|
+
doc = Bio::NeXML::Parser.new( "trees.xml" )
|
37
|
+
nexml = doc.parse
|
38
|
+
nexml.class #Bio::NeXML::Nexml
|
39
|
+
```
|
40
|
+
|
41
|
+
## Serializing
|
42
|
+
`Bio::NeXML::Writer` class provides a wrapper over libxml-ruby to create any NeXML document. This class defines a set of `serialize_*` instance methods which can be called on the appropriate object to get its NeXML representation. The method returns a `XML::Node` object. To get the raw NeXML representation `to_s` method should be called on the return value.
|
43
|
+
|
44
|
+
NeXML defines three top level containers: `otus`, `trees`, `characters` which bear parent-child relation with other NeXML elements. In effect, a valid NeXML document has only three type of immediate children. Naturally, a typical working paradigm would be to create `Bio::NeXML::Otus`, `Bio::NeXML::Trees`, and `Bio::NeXML::Characters` objects and write them to the NeXML file.
|
45
|
+
|
46
|
+
```ruby
|
47
|
+
# Parse a test file. This will give us Bio::NeXML::Otus,
|
48
|
+
# Bio::NeXML::Trees, and Bio::NeXML::Characters object.
|
49
|
+
doc1 = Bio::NeXML::Parser.new 'test.xml'
|
50
|
+
nexml = doc1.parse
|
51
|
+
doc1.close
|
52
|
+
|
53
|
+
# Create a Writer object,
|
54
|
+
writer = Bio::NeXML::Writer.new
|
55
|
+
|
56
|
+
# add otus, trees and characters to it,
|
57
|
+
writer << nexml.otus
|
58
|
+
writer << nexml.trees
|
59
|
+
writer << nexml.characters
|
60
|
+
|
61
|
+
# save it.
|
62
|
+
writer.save 'sample.xml'
|
63
|
+
```
|
64
|
+
|
65
|
+
`Bio::NeXML::Writer` internally calls some `serialize_*` method at the lowest level. If need be, these `serialize_*` methods can be called to obtain raw NeXML representation of any NeXML element.
|
66
|
+
|
67
|
+
``` ruby
|
68
|
+
# Create an otus object with a child otu element
|
69
|
+
taxa1 = Bio::NeXML::Otus.new 'taxa1', 'A taxa block'
|
70
|
+
o1 = Bio::NeXML::Otu.new 'o1', 'A taxon'
|
71
|
+
taxa1 << o1
|
72
|
+
|
73
|
+
# Obtain the raw NeXML representation of the otus object created
|
74
|
+
writer = Bio::NeXML::Writer.new
|
75
|
+
writer.serialize_otus( taxa1 ).to_s
|
76
|
+
# => "<otus label=\"A taxa block\" id=\"taxa1\">\n <otu label=\"A taxon\" id=\"o1\"/>\n</otus>"
|
77
|
+
```
|
78
|
+
|
79
|
+
Unit tests for serializer are filled with such use cases.
|
80
|
+
|
81
|
+
## Nexml
|
82
|
+
|
83
|
+
``` ruby
|
84
|
+
#get a hash of otus objects indexed with 'id'
|
85
|
+
nexml.otus_set
|
86
|
+
|
87
|
+
#get an array of otus objects
|
88
|
+
nexml.otus
|
89
|
+
|
90
|
+
#get an otus by id
|
91
|
+
taxa1 = nexml.get_otus_by_id "taxa1"
|
92
|
+
|
93
|
+
#iterate over each otus object
|
94
|
+
nexml.each_otus do |taxa|
|
95
|
+
puts taxa.id
|
96
|
+
puts taxa.label
|
97
|
+
end
|
98
|
+
|
99
|
+
#characters
|
100
|
+
nexml.trees_set #return a hash of trees object indexed with 'id'
|
101
|
+
nexml.trees #return an array of trees objects.
|
102
|
+
|
103
|
+
#iterate over each trees object
|
104
|
+
nexml.each_trees do |trees|
|
105
|
+
puts trees.id
|
106
|
+
puts trees.label
|
107
|
+
end
|
108
|
+
|
109
|
+
#find a trees by id
|
110
|
+
trees1 = nexml.get_trees_by_id 'trees1'
|
111
|
+
|
112
|
+
# characters
|
113
|
+
nexml.characters_set #return a hash of characters object indexed with 'id'
|
114
|
+
nexml.characters #return an array of characters object
|
115
|
+
|
116
|
+
#iterate over each characters object
|
117
|
+
nexml.each_characters do |ch|
|
118
|
+
puts ch.id
|
119
|
+
puts ch.label
|
120
|
+
end
|
121
|
+
|
122
|
+
#find a characters object by id
|
123
|
+
characters = nexml.get_characters_by_id 'chars1'
|
124
|
+
```
|
125
|
+
|
126
|
+
## Otus
|
127
|
+
|
128
|
+
Taxa blocks and taxons are stored internally as a Ruby hash for faster 'id' based lookup.
|
129
|
+
Consider [https://www.nescent.org/wg_phyloinformatics/NeXML_Elements#Example this] NeXML
|
130
|
+
snippet
|
131
|
+
|
132
|
+
``` ruby
|
133
|
+
#get the id of otus
|
134
|
+
taxa1.id # "taxa1"
|
135
|
+
|
136
|
+
#get the label of otus
|
137
|
+
taxa1.label # "Primary taxa block"
|
138
|
+
|
139
|
+
#get a hash of child otu objects indexed with id
|
140
|
+
taxa1.otu_set
|
141
|
+
|
142
|
+
#get an array of child otu objects
|
143
|
+
taxa1.otus
|
144
|
+
|
145
|
+
#get an otu object by id
|
146
|
+
#get_otu_by_id is an alias of []
|
147
|
+
t1 = taxa1[ 't1' ]
|
148
|
+
|
149
|
+
#add an otu object to otus
|
150
|
+
t1.add_otu( otu_object )
|
151
|
+
#to add more than one otu object at a time use << or otus= method
|
152
|
+
t1 << [otu_object1, otu_object2]
|
153
|
+
t1.otus = otu_object1, otu_object2
|
154
|
+
|
155
|
+
#or iterate over each otu object
|
156
|
+
#each_otu is an alias for each
|
157
|
+
taxa1.each do |taxon|
|
158
|
+
puts taxon.id
|
159
|
+
puts taxon.label
|
160
|
+
end
|
161
|
+
|
162
|
+
#check if an otu with given id belongs to an otus or not
|
163
|
+
#include? and has? are alias for has_otu?
|
164
|
+
taxa1.has_otu? 't2' # => true
|
165
|
+
taxa1.has? 't8' # => false
|
166
|
+
|
167
|
+
#an otus object in enumerable
|
168
|
+
taxa1.map &:id # => array of otu ids
|
169
|
+
taxa1.select {|t| t.class == "Lemurs" } #maybe in future
|
170
|
+
```
|
171
|
+
|
172
|
+
### Otu
|
173
|
+
|
174
|
+
``` ruby
|
175
|
+
#get an otu's id
|
176
|
+
t1.id # => "t1"
|
177
|
+
|
178
|
+
#get an otu's label
|
179
|
+
t1.label # => "Homo sapiens"
|
180
|
+
```
|
181
|
+
|
182
|
+
## Trees
|
183
|
+
Trees and tree and network are stored internally as a Ruby hash for faster 'id' based lookup.
|
184
|
+
|
185
|
+
``` ruby
|
186
|
+
trees1.class #Bio::NeXML::Trees
|
187
|
+
|
188
|
+
#get the taxa block to which the trees is linked to
|
189
|
+
trees1.otus #returns an otus object
|
190
|
+
```
|
191
|
+
|
192
|
+
### Tree
|
193
|
+
|
194
|
+
``` ruby
|
195
|
+
trees1.tree_set #return a hash or tree objects indexed with 'id'
|
196
|
+
tress1.trees #return an arrayof trees object
|
197
|
+
|
198
|
+
#iterate over each tree object
|
199
|
+
trees1.each_tree do |t|
|
200
|
+
puts t.id
|
201
|
+
puts t.label
|
202
|
+
end
|
203
|
+
|
204
|
+
#get a tree object with its 'tree1'
|
205
|
+
tree1 = trees1[ 'tree1' ]
|
206
|
+
#or, with a conventional method call
|
207
|
+
tree1 = trees1.get_tree_by_id 'tree1'
|
208
|
+
#or, from a nexml object
|
209
|
+
tree1 = nexml.get_tree_by_id 'tree1'
|
210
|
+
|
211
|
+
tree1.class #Bio::NeXML::IntTree or Bio::NeXML::FloatTree
|
212
|
+
|
213
|
+
#check if a tree belongs to a trees or not
|
214
|
+
#pass it a tree id
|
215
|
+
tree1.has_tree? 'tree1' #return true or false
|
216
|
+
|
217
|
+
#get the number of treess
|
218
|
+
trees1.number_of_trees
|
219
|
+
```
|
220
|
+
|
221
|
+
### Network
|
222
|
+
|
223
|
+
``` ruby
|
224
|
+
trees1.network_set #return a hash or network objects indexed with 'id'
|
225
|
+
tress1.networks #return an arrayof network objects
|
226
|
+
|
227
|
+
#iterate over each network object
|
228
|
+
trees1.each_network do |n|
|
229
|
+
puts n.id
|
230
|
+
puts n.label
|
231
|
+
end
|
232
|
+
|
233
|
+
#get a network object with its id
|
234
|
+
network1 = trees1[ 'network1' ]
|
235
|
+
#or, with a conventional method call
|
236
|
+
network1 = trees1.get_network_by_id 'network1'
|
237
|
+
#or, from a nexml object
|
238
|
+
network1 = nexml.get_tree_by_id 'network1'
|
239
|
+
|
240
|
+
network1.class #Bio::NeXML::IntTree or Bio::NeXML::FloatTree
|
241
|
+
|
242
|
+
#check if a network belongs to a trees or not
|
243
|
+
#pass it a network id
|
244
|
+
trees1.has_network? 'network1' #return true or false
|
245
|
+
|
246
|
+
#get the number of networks
|
247
|
+
trees1.number_of_networks
|
248
|
+
```
|
249
|
+
|
250
|
+
### Tree and Network
|
251
|
+
|
252
|
+
``` ruby
|
253
|
+
#iterate over both trees and networks
|
254
|
+
trees1.each do |g|
|
255
|
+
puts g.class
|
256
|
+
end
|
257
|
+
|
258
|
+
#find if a tree or a network belongs to a trees or not
|
259
|
+
#include? is an alias for has?
|
260
|
+
trees1.has? 'tree1' #return true or false
|
261
|
+
|
262
|
+
#total number of trees and networks
|
263
|
+
trees1.number_of_graphs
|
264
|
+
```
|
265
|
+
|
266
|
+
All the available methods from [http://bioruby.org/rdoc/classes/Bio/Tree.html#M001688 Bio::Tree]
|
267
|
+
class can be called on a tree object.
|
268
|
+
|
269
|
+
``` ruby
|
270
|
+
node1 = tree.get_node_by_name "n3" #note name is same as id
|
271
|
+
tree1.parents node1
|
272
|
+
```
|
273
|
+
|
274
|
+
A trees object is an enumerable:
|
275
|
+
|
276
|
+
``` ruby
|
277
|
+
trees1.map &:id
|
278
|
+
```
|
279
|
+
|
280
|
+
## Characters
|
281
|
+
|
282
|
+
``` ruby
|
283
|
+
puts characters.class
|
284
|
+
|
285
|
+
#get the taxa block to which the characters is linked to
|
286
|
+
characters.otus #returns an otus object
|
287
|
+
|
288
|
+
#get the child format element
|
289
|
+
format = characters.format
|
290
|
+
|
291
|
+
puts format.class
|
292
|
+
|
293
|
+
#get the child matrix element
|
294
|
+
matrix = characters.matrix
|
295
|
+
|
296
|
+
puts matrix.class
|
297
|
+
```
|
298
|
+
|
299
|
+
### Format
|
300
|
+
|
301
|
+
``` ruby
|
302
|
+
format.states_set #return a hash of states objects indexed with 'id'
|
303
|
+
format.states #return an array of states object
|
304
|
+
|
305
|
+
#iterate over each states object
|
306
|
+
format.each_states do |states|
|
307
|
+
puts states.id
|
308
|
+
puts states.label
|
309
|
+
end
|
310
|
+
|
311
|
+
#get a states object by id
|
312
|
+
states = format.get_states_by_id 'states1'
|
313
|
+
|
314
|
+
#check if the states object with 'id' belongs to format or not
|
315
|
+
format.has_states? 'states1'
|
316
|
+
|
317
|
+
format.char_set #return a hash of char objects indexed with 'id'
|
318
|
+
format.chars #return an array of char objects
|
319
|
+
|
320
|
+
#iterate over each char object
|
321
|
+
format.each_char do |char|
|
322
|
+
puts char.id
|
323
|
+
puts char.label
|
324
|
+
end
|
325
|
+
|
326
|
+
#get a char object by id
|
327
|
+
char = format.get_char_by_id 'char1'
|
328
|
+
|
329
|
+
#check if the char object with 'id' belongs to format or not
|
330
|
+
format.has_char? 'char1'
|
331
|
+
|
332
|
+
#get a states or a char object by id
|
333
|
+
state = format[ 'states1' ]
|
334
|
+
char = format[ 'char1' ]
|
335
|
+
|
336
|
+
#check if a states or a char object with 'id' belongs to format or not
|
337
|
+
format.has? 'states1'
|
338
|
+
format.has? 'char1'
|
339
|
+
|
340
|
+
#all objects, including char and states can be iterated over with each
|
341
|
+
format.each do |obj|
|
342
|
+
puts obj.class
|
343
|
+
end
|
344
|
+
|
345
|
+
#format is enumerable
|
346
|
+
format.map &:id
|
347
|
+
```
|
348
|
+
|
349
|
+
#### States
|
350
|
+
|
351
|
+
``` ruby
|
352
|
+
states.state_set #return a hash of state objects indexed with 'id'
|
353
|
+
states.states #return an array of state objects
|
354
|
+
|
355
|
+
#iterate over each state object
|
356
|
+
states.each_state do |state|
|
357
|
+
puts state.id
|
358
|
+
end
|
359
|
+
#or, use its alias each
|
360
|
+
|
361
|
+
#get a state object by id
|
362
|
+
state = states.get_state_by_id 'state1'
|
363
|
+
#or, use hash notation
|
364
|
+
state = states[ 'state1' ]
|
365
|
+
|
366
|
+
#check if a state belongs to states or not
|
367
|
+
states.has_state? 'state1'
|
368
|
+
#or, use its alias has? and include?
|
369
|
+
```
|
370
|
+
|
371
|
+
##### State
|
372
|
+
|
373
|
+
``` ruby
|
374
|
+
#get the symbol associated with the state
|
375
|
+
state.symbol
|
376
|
+
|
377
|
+
#find if the state is ambiguous
|
378
|
+
state.ambiguous?
|
379
|
+
|
380
|
+
#find the kind of ambiguity
|
381
|
+
state.ambiguity
|
382
|
+
|
383
|
+
#find if it is an uncertain state set
|
384
|
+
state.uncertain?
|
385
|
+
|
386
|
+
#find if it is a polymorphic state set
|
387
|
+
state.polymorphic?
|
388
|
+
|
389
|
+
#get the members of a state set as an array
|
390
|
+
state.members
|
391
|
+
|
392
|
+
#or iterate over each member
|
393
|
+
state.each do |member|
|
394
|
+
puts member.class #same as self
|
395
|
+
puts member.id
|
396
|
+
end
|
397
|
+
|
398
|
+
#a state is Enumerable over its members
|
399
|
+
state.select{ |member| member.id == "rna5" }
|
400
|
+
```
|
401
|
+
|
402
|
+
#### Char
|
403
|
+
|
404
|
+
``` ruby
|
405
|
+
#get the id
|
406
|
+
char.id
|
407
|
+
|
408
|
+
#get the label
|
409
|
+
char.label
|
410
|
+
|
411
|
+
#get the states object the char is linked to
|
412
|
+
char.states
|
413
|
+
|
414
|
+
#get the codon position for DnaChar and RnaChar objects
|
415
|
+
char.codon
|
416
|
+
```
|
417
|
+
|
418
|
+
### Matrix
|
419
|
+
|
420
|
+
...
|
421
|
+
|
422
|
+
## Contributing to bio-nexml
|
423
|
+
|
424
|
+
* Check out the latest master to make sure the feature hasn't been implemented
|
425
|
+
or the bug hasn't been fixed yet
|
426
|
+
* Check out the issue tracker to make sure someone already hasn't requested it
|
427
|
+
and/or contributed it
|
428
|
+
* Fork the project
|
429
|
+
* Start a feature/bugfix branch
|
430
|
+
* Commit and push until you are happy with your contribution
|
431
|
+
* Make sure to add tests for it. This is important so I don't break it in a
|
432
|
+
future version unintentionally.
|
433
|
+
* Please try not to mess with the Rakefile, version, or history. If you want to
|
434
|
+
have your own version, or is otherwise necessary, that is fine, but please
|
435
|
+
isolate to its own commit so I can cherry-pick around it.
|
436
|
+
|
437
|
+
## Acknowledgements
|
438
|
+
|
439
|
+
The research leading to these results has received funding from the [European
|
440
|
+
Community's] Seventh Framework Programme ([FP7/2007-2013] under grant agreement
|
441
|
+
n� [237046].
|
442
|
+
|
443
|
+
## Citing bio-nexml
|
444
|
+
|
445
|
+
If you use this software, please cite:
|
446
|
+
|
447
|
+
> [NeXML: rich, extensible, and verifiable representation of comparative data and metadata][1]
|
448
|
+
|
449
|
+
and
|
450
|
+
|
451
|
+
> [Biogem: an effective tool based approach for scaling up open source software development in bioinformatics][2]
|
452
|
+
|
453
|
+
## Copyright
|
454
|
+
|
455
|
+
Copyright (c) 2011 Rutger Vos and Anurag Priyam. See LICENSE.txt for further
|
456
|
+
details.
|
457
|
+
|
458
|
+
[1]: http://sysbio.oxfordjournals.org/content/early/2012/02/12/sysbio.sys025.short
|
459
|
+
[2]: http://dx.doi.org/10.1093/bioinformatics/bts080
|