bio-alignment 0.0.6 → 0.0.7

Sign up to get free protection for your applications and to get access to all the features.
data/Gemfile CHANGED
@@ -8,7 +8,7 @@ group :development do
8
8
  gem "rake"
9
9
  gem "bio-bigbio", "> 0.1.3" # for reading FASTA files in tests
10
10
  gem "cucumber", ">= 0"
11
- gem "rspec", "~> 2.3.0"
11
+ gem "rspec", "~> 2.10.0"
12
12
  gem "bundler", ">= 1.0.21"
13
13
  gem "jeweler"
14
14
  end
data/README.md CHANGED
@@ -1,22 +1,39 @@
1
1
  # bio-alignment
2
2
 
3
- Alignment handler for multiple sequence alignments (MSA).
3
+ Matrix style alignment handler for multiple sequence alignments (MSA).
4
4
 
5
- This alignment handler makes no assumptions about the underlying
6
- sequence object. Support for any nucleotide, amino acid and codon
7
- sequences that are lists. Any list with payload can be used (e.g.
8
- nucleotide quality score, codon annotation). The only requirement is
9
- that the list is iterable and can be indexed.
5
+ [![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-alignment.png)](http://travis-ci.org/pjotrp/bioruby-alignment)
10
6
 
11
- This work is based on Pjotr's experience designing the BioScala
7
+ This alignment handler makes no assumptions about the underlying
8
+ sequence object. It supports any nucleotide, amino acid and codon
9
+ sequences that are lists. Any list with payload or state, can be used
10
+ (e.g. nucleotide quality score, codon annotation). The only
11
+ requirement is that the list is Enumerable and can be indexed, i.e.
12
+ inherit Ruby Enumerable and have the [] method.
13
+
14
+ Features are:
15
+
16
+ * Matrix notation for alignment object
17
+ * Functional style alignment access and editing
18
+ * Support for BioRuby Sequences
19
+ * Support for BioRuby trees and node distance calculation
20
+ * bio-alignment interacts well with BioRuby structures,
21
+ including sequence objects and alignment/tree parsers
22
+
23
+ When possible, BioRuby functionality is merged in. For example, by
24
+ supporting Bio::Sequence objects, standard BioRuby alignment
25
+ functions, sequence readers and writers can be used. By supporting the
26
+ BioRuby Tree object, standard BioRuby tree parsers and writers can be
27
+ used. bio-alignment takes alignment handling with phylogenetic tree
28
+ support to a new level.
29
+
30
+ bio-alignment is based on Pjotr's experience designing the BioScala
12
31
  Alignment handler and BioRuby's PAML support. Read the
13
32
  Bio::BioAlignment
14
33
  [design
15
34
  document](https://github.com/pjotrp/bioruby-alignment/blob/master/doc/bio-alignment-design.md)
16
35
  for Ruby.
17
36
 
18
- Note: this software is under active development.
19
-
20
37
  ## Developers
21
38
 
22
39
  ### Codon alignment example
@@ -40,6 +57,8 @@ aligmment (note codon gaps are represented by '---')
40
57
  aln.rows.each do | row |
41
58
  fasta.write(row.id, row.to_aa.to_s)
42
59
  end
60
+ # get first codon element of the fourth sequence
61
+ p aln[3][0]
43
62
  ```
44
63
 
45
64
  Now add some state - you can define your own row state
@@ -151,8 +170,28 @@ resulting in the codon alignment.
151
170
 
152
171
  ### Phylogeny
153
172
 
154
- BioAlignment has support for attaching a phylogentic tree to an
155
- alignment, and traversing the tree.
173
+ BioAlignment has support for attaching a phylogenetic tree to an
174
+ alignment, and traversing the tree using an intuitive interface
175
+
176
+ ```ruby
177
+ sole_tree = Bio::Newick.new(string).tree # use BioRuby's tree parser
178
+ tree = aln.attach_tree(sole_tree) # attach the tree
179
+ # now do stuff with the tree, which has improved bio-align support
180
+ root = tree.root
181
+ children = root.children
182
+ children.map { |n| n.name }.sort.should == ["","seq7"]
183
+ seq7 = children.last
184
+ seq4 = tree.find("seq4")
185
+ seq4.distance(seq7).should == 19.387756600000003
186
+ print tree.output_newick # BioRuby Newick output
187
+ ```
188
+
189
+ There are methods for finding sibling nodes, splitting the alignment
190
+ based on the tree, and locating sequences on the same branch. More
191
+ examples can be found in the tests and features. The underlying
192
+ implementation of Bio::Tree is that of BioRuby. We have added an OOP
193
+ layer for traversing the tree by injecting methods into the BioRuby
194
+ object itself.
156
195
 
157
196
  ### Alignment marking/masking/editing
158
197
 
@@ -249,18 +288,28 @@ where aln2 is a copy of aln with bridging columns deleted.
249
288
 
250
289
  ### See also
251
290
 
252
- The API documentation is online. For more code examples see
291
+ For more on the design of bio-alignment, read the
292
+ Bio::BioAlignment
293
+ [design
294
+ document](https://github.com/pjotrp/bioruby-alignment/blob/master/doc/bio-alignment-design.md).
295
+
296
+ The API documentation can be found
297
+ [online](http://rubygems.org/gems/bio-alignment). For examples see the files in
253
298
  [./spec/*.rb](https://github.com/pjotrp/bioruby-alignment/tree/master/spec) and
254
299
  [./features/*](https://github.com/pjotrp/bioruby-alignment/tree/master/features).
255
300
 
256
301
  ## Cite
257
302
 
258
- If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
303
+ If you use this software, please cite one of
304
+
305
+ * [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
306
+ * [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
307
+
308
+ ## Biogems.info
309
+
310
+ This Biogem is published at [#bio-alignment](http://biogems.info/index.html)
259
311
 
260
312
  ## Copyright
261
313
 
262
314
  Copyright (c) 2012 Pjotr Prins. See LICENSE.txt for further details.
263
315
 
264
- ## Biogems.info
265
-
266
- This exciting Ruby Biogem is published on http://biogems.info/
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.0.6
1
+ 0.0.7
@@ -1,58 +1,59 @@
1
1
  # Bio-alignment design
2
2
 
3
- ''A well designed library should be simple and elegant to use...''
3
+ ''A well designed library should be *simple* and elegant to use...''
4
4
 
5
5
  ## Introduction
6
6
 
7
- Biological multi-sequence alignments (MSA) are normally matrices of
8
- nucleotide or amino acid sequences, with gaps. Despite this rather
9
- simple premise, most software fails make it simple to access these
10
- structures. Also most implementations fail to support a 'pay load' of
11
- items in the matrix (mostly because underlying sequences are String
12
- based). This means a developer has to track information in multiple
13
- places, for example a base pair quality score. This makes code complex
14
- and therefore error prone. With bio-alignment elements of the matrix
15
- can carry information. So, when the alignment gets edited,
16
- the element gets moved or deleted, and the information moves or
17
- deletes along. For example,
18
- say we have a nucleotide sequence with pay load
7
+ Biological multi-sequence alignments (MSA) are matrices of nucleotide or amino
8
+ acid sequences with gaps. Despite this rather simple premise, most software
9
+ fails make it simple to access these structures. Also most implementations fail
10
+ to support a 'pay load' of items in the matrix (this is because underlying
11
+ sequences are String based). The result is that a developer has to track
12
+ information in multiple places. For example to track a base pair quality score
13
+ will be a second matrix of information. This makes code complex and therefore
14
+ error prone. With the bio-alignment library, elements of the matrix can carry
15
+ information, so called 'state'. When the alignment gets edited, i.e. the
16
+ element gets moved or deleted, the information gets moved or deleted along. For
17
+ example, say we have a nucleotide sequence with quality pay load
19
18
 
20
19
  A G T A
21
20
  | | | |
22
21
  5 9 * 1
23
22
 
24
- most library implementations will have two strings "AGTA" and "59*1".
25
- Removing the third nucleotide would mean removing it twice, into first
26
- "AGA", and second "591". With bio-alignment this is one action because we
27
- have one object for each element that contains both values, e.g. the
28
- payload of 'T' is '*'. Moving 'T' automatically moves '*'.
23
+ most library implementations will have two strings "AGTA" and "59*1". Removing
24
+ the third nucleotide would mean removing it twice, into first "AGA", and second
25
+ "591". With bio-alignment this is one action because we have one object for
26
+ each element that contains both values, e.g. the payload of 'T' is '*'. Moving
27
+ 'T' automatically moves '*'. Simple really.
29
28
 
30
- In addition the bio-alignment library deals with codons and codon translation.
31
- Rather than track multiple matrices, the codon is viewed as an element,
32
- and the translated codon as the pay load. Again, when an alignment gets
33
- reordered the code only has to do it in one place.
29
+ In addition to carrying state, the bio-alignment library deals with codons and
30
+ codon translation. Rather than track multiple matrices, the codon is viewed as
31
+ an element, and the translated codon as the pay load. Again, when an alignment
32
+ gets reordered the code only has to do it in one place.
34
33
 
35
- Likewise, an alignment column can have a pay load (e.g. quality score
36
- in a pile up), and an alignment row can have a pay load (e.g. the
37
- sequence name). The concept of pay load is handled through generic
38
- matrix element, column, or row 'attributes'.
34
+ Likewise, an alignment column can have a pay load (e.g. quality score in a pile
35
+ up), and an alignment row can have a pay load (e.g. the sequence name). The
36
+ concept of pay load, normally referred to as 'state', is handled through
37
+ generic matrix element, column, or row 'attributes'.
39
38
 
40
- Many of these ideas came from my work on the [BioScala
39
+ Many of these ideas came from my earlier work on the [BioScala
41
40
  project](https://github.com/pjotrp/bioscala/blob/master/doc/design.txt),
42
41
  The BioScala library has the additional advantage of having type
43
- safety throughout.
42
+ safety throughout, but lacks many of the features I have added to the
43
+ Ruby version.
44
44
 
45
45
  ## Row or Sequence
46
46
 
47
- Any sequence for an alignment is simply a list of objects. The
48
- requirement is that the list should be enumerable and can be indexed. This means
49
- it has to include Enumerable and provide 'each' and '[]' methods. CodonSequence
50
- is a good example.
47
+ Any sequence for an alignment is simply a list of objects. The requirement for
48
+ any such list is that it should be enumerable and can be indexed. In Ruby
49
+ terms, the list has to include Enumerable and provide 'each' and '[]' methods.
50
+ The CodonSequence list, included in this library, is a good example.
51
51
 
52
52
  In addition, elements in the list should respond to certain properties (see
53
53
  below).
54
54
 
55
55
  ```ruby
56
+ # create a list of codons
56
57
  codons = CodonSequence.new(rec.id,rec.seq)
57
58
  print codons.id
58
59
  # get first codon
@@ -70,7 +71,8 @@ acid with
70
71
  print codons.seq[0].to_aa
71
72
  ```
72
73
 
73
- in fact, because Sequence is index-able we can write directly
74
+ in fact, because bio-alignment demands Sequence is index-able we can write
75
+ directly
74
76
 
75
77
  ```ruby
76
78
  print codons[0].to_aa # 'M'
@@ -85,14 +87,17 @@ do a fancy
85
87
  aaseq = codons.map { | codon | codon.to_aa }.join("")
86
88
  ```
87
89
 
90
+ this is getting interesting... Codons, which are three letter nucleotide base
91
+ pairs, actually act as basic lists, and can be converted to amino acids.
92
+
88
93
  ## Element
89
94
 
90
95
  Elements in the list should respond to a gap? method, for an alignment
91
96
  gap, and the undefined? method for a position that is either an
92
97
  element or a gap. Also it should respond to the to_s method.
93
98
 
94
- An element can contain any pay load. If a list of attributes exists
95
- in the sequence object, it can be used.
99
+ It is important to note that an element can contain *any* pay load, or state. Ruby
100
+ objects are 'open'. You can even add state at runtime.
96
101
 
97
102
  ## Elements and CodonSequence
98
103
 
@@ -102,24 +107,25 @@ carry state.
102
107
 
103
108
  The third list type we normally use in an Alignment, next to Sequence and
104
109
  Elements, is the CodonSequence (remember, you can easily roll your own Sequence
105
- type).
110
+ type, just make them Enumerable and indexed).
106
111
 
107
112
  ## Column
108
113
 
109
- The column list tracks the columns of the alignment. The requirement
110
- is that it should be iterable and can be indexed. The Column contains
111
- no elements, but may point to a list when the alignment is transposed.
114
+ The column list tracks the columns of the alignment. Again, the requirement is
115
+ that the list should be Enumerable and indexed. By default, the Column contains
116
+ no elements, only when the alignment is transposed. Matrix elements are found
117
+ by indexing on the sequences (rows).
112
118
 
113
- One of the 'features' of this library is that the Column access logic is
119
+ One of the features of this library is that the Column access logic is
114
120
  split out into a separate module, which accesses the data in a lazy fashion.
115
121
  Also column state is stored as an 'any object'. I.e. a column can contain
116
- any state.
122
+ any type of state.
117
123
 
118
- ## Matrix or MSA
124
+ ## Matrix (MSA)
119
125
 
120
- The Matrix consists of a Column list, multiple Sequences, in turn
121
- consisting of Elements. Accessing the matrix is by Sequence, followed
122
- by Element.
126
+ The matrix (multi sequence alignment or MSA) consists of a Column list, and
127
+ multiple Sequences, in turn consisting of Elements. Accessing the matrix is by
128
+ Sequence, followed by Element, leading to a matrix style notation
123
129
 
124
130
  ```ruby
125
131
  require 'bio-alignment'
@@ -130,31 +136,34 @@ by Element.
130
136
  fasta.each do | rec |
131
137
  aln.sequences << rec
132
138
  end
139
+ # get first codon element of the fourth sequence
140
+ codon = aln[3][0]
133
141
  ```
134
142
 
135
143
  note that MSA understands rec, as long as rec.id and rec.seq exist, and strings
136
- (req.seq is a String). Alternatively we can convert to a Codon sequence by
144
+ (req.seq is a String). Alternatively we can first convert to a Codon sequence by
137
145
 
138
146
  ```ruby
139
147
  fasta.each do | rec |
140
148
  aln.sequences << CodonSequence.new(rec.id,rec.seq)
141
149
  end
150
+ # get first codon element of the fourth sequence
151
+ codon = aln[3][0]
142
152
  ```
143
153
 
144
154
  The Matrix can be accessed in transposed fashion, but accessing the normal
145
- matrix and transposed matrix at the same time is not supported. Matrix is not
146
- designed to be transaction safe - though you can copy the Matrix any time.
147
-
155
+ matrix and transposed matrix at the same time is not supported. Note that
156
+ Matrix editing is not designed to be transaction safe - better to copy the
157
+ Matrix when editing.
148
158
 
149
159
  ## Adding functionality
150
160
 
151
- To ascertain that the basic BioAlignment implementation does not get
152
- polluted, extra functionality is added by using modules. These
153
- modules can be added at run time(!) One advantage is that there is
154
- less name space pollution, the other is that different implementations
155
- can be plugged in - using the same interface. For example, here we are
156
- going to use an alignment editor named DelBridges, which has a method
157
- named del_bridges:
161
+ To ascertain that the basic BioAlignment implementation does not get polluted
162
+ with heaps of methods, extra functionality is added by using modules. These
163
+ modules can be added at run time(!) One advantage is that there is less name
164
+ space pollution, the other is that different implementations can be plugged in
165
+ - using the same interface. For example, here we are going to use an alignment
166
+ editor named DelBridges, which has a method named del_bridges:
158
167
 
159
168
  ```ruby
160
169
  require 'bio-alignment/edit/del_bridges'
@@ -164,18 +173,19 @@ named del_bridges:
164
173
  aln2 = aln.del_bridges
165
174
  ```
166
175
 
167
- in other words, the functionality in DelBridges gets attached to the
168
- aln instance at run time, without affecting any other instantiated
169
- object(!) Also, when not requiring 'bio-alignment/edit/del_bridges',
170
- the functionality is never visible, and never added to the
171
- environment. This type of runtime plugin is something you can only do
172
- in a dynamic language.
176
+ in other words, the functionality in DelBridges gets attached to the aln
177
+ instance at run time, without affecting any other instantiated object(!) Also,
178
+ when not requiring 'bio-alignment/edit/del_bridges', the functionality is never
179
+ visible, and never added to the runtime environment. This type of runtime
180
+ plugin is something you can only do in a dynamic language, such as Ruby. Ruby,
181
+ makes it rather convenient.
173
182
 
174
- Likewise you may have your own sequence objects in an alignment. To register
175
- deletion state, simply extend the sequence with the RowState module:
183
+ You may have created own style sequence objects in an alignment. To register a
184
+ prefab deletion state, extend the sequence with the RowState module:
176
185
 
177
186
  ```ruby
178
187
  require 'bio-alignment/state'
188
+ # Use the standard BioRuby sequence object
179
189
  bioseq = Bio::Sequence::NA.new("AGCT")
180
190
  bioseq.extend(State) # add state
181
191
  bioseq.state = RowState.new # set state
@@ -183,10 +193,10 @@ deletion state, simply extend the sequence with the RowState module:
183
193
  > false
184
194
  ```
185
195
 
186
- That is impressive - the BioRuby Sequence has no deletion state facility. We
187
- just added that, and it can even be used in BioAlignment editors which require
188
- such a state object. See also the scenario "Give deletion state to a
189
- Bio::Sequence object" in the bioruby.feature.
196
+ That is impressive - the BioRuby Sequence has no deletion state facility by
197
+ itself. We just added that, and it can even be used in BioAlignment editors
198
+ which require such a state object. See also the scenario "Give deletion state
199
+ to a Bio::Sequence object" in the bioruby.feature.
190
200
 
191
201
  Note: if we wanted only to allow one plugin per instance at a time, we can
192
202
  create a generic interface with a method of the same name for every
@@ -195,9 +205,10 @@ multiple plugins (by default).
195
205
 
196
206
  ## Adding Phylogenetic support
197
207
 
198
- MSAs often come with phylogenetic trees. Not to add this functionality by default,
199
- we extend BioAlignment with BioAlignment::AlignmentTree when a tree is plugged in
200
- with the add_tree method.
208
+ An MSA often comes with a phylogenetic tree. Similar to runtime adding of the
209
+ delete state module, now we extend BioAlignment with
210
+ BioAlignment::AlignmentTree. A tree is plugged in with the add_tree method. See
211
+ the README and features directory for more examples.
201
212
 
202
213
  ## Methods returning alignments and concurrency
203
214
 
@@ -216,7 +227,7 @@ in functional style, such as
216
227
  ```
217
228
 
218
229
  where aln2 is a copy (of aln) with columns removed that were marked for
219
- deletion. In other words, we apply ''Functional programming in Ruby.'' If
230
+ deletion. In other words, applied ''Functional programming in Ruby.'' If
220
231
  functions can be easily 'piped', and code can be easily copy and pasted into
221
232
  different algorithms, it is likely that the module is written in a functional
222
233
  style.
@@ -0,0 +1,31 @@
1
+ require 'bio-alignment/edit/tree_splitter.rb'
2
+
3
+ When /^I split the tree$/ do |string|
4
+ tree = @aln.attach_tree(@tree)
5
+ @aln.extend TreeSplitter
6
+ (aln1,aln2) = @aln.split_on_distance
7
+ aln2.size.should == 5
8
+ @split1 = aln1
9
+ @split2 = aln2
10
+ end
11
+
12
+ Then /^I should have found sub\-trees "([^"]*)" and "([^"]*)"$/ do |arg1, arg2|
13
+ @split2.ids.sort.join(",").should == arg2
14
+ @split1.ids.sort.join(",").should == arg1
15
+ end
16
+
17
+ When /^I split the tree with a target of (\d+)$/ do |arg1|
18
+ tree = @aln.attach_tree(@tree)
19
+ @aln.extend TreeSplitter
20
+ @split1,@split2 = @aln.split_on_distance(arg1.to_i)
21
+ end
22
+
23
+ Then /^I should have found low\-homology sub\-tree "([^"]*)"$/ do |arg1|
24
+ @split1.ids.sort.join(",").should == arg1
25
+ end
26
+
27
+ Then /^I should have found high\-homology sub\-tree "([^"]*)"$/ do |arg1|
28
+ @split2.ids.sort.join(",").should == arg1
29
+ end
30
+
31
+
@@ -0,0 +1,66 @@
1
+ @split
2
+ Feature: Splitting alignments into equal sized branches using phylogenetic tree info
3
+
4
+ Sometimes we want to split a large alignment into sub-sets. When an
5
+ alignment is accompanied by a phylogenetic tree, we can greedily split the
6
+ tree. With a rooted tree, we start from the root, and walk the tree, taking
7
+ the shortest edge at every node (a tie may favour splitting). If the tree can
8
+ be split, so that both sides are similar sized, the job is done (if you want
9
+ more splits, just repeat the exercise). Essentially one subset shows
10
+ relatively high homology, the other relatively low homology. This is a crude
11
+ method, but has the advantage of being quick to calculate and reproducible.
12
+ If there is no root, we start from the point next to the longest edge.
13
+
14
+ We add one 'target_size' parameter to allow for leaving more sequences in the
15
+ high homology subset. 'target_size' sets the allowed size of the
16
+ high-homology alignment. For example, setting it to 10 in a 15 sequence
17
+ alignment, will stop the splitting at 5 sequences, leaving (approx.) 10
18
+ sequences in the high homology group. Likewise, setting it to 5 will continue
19
+ splitting until that number is reached.
20
+
21
+ In below example the tree will be split in a branch with similar sequences,
22
+ and a branch with sequences that are somewhat removed.
23
+
24
+ Scenario: Split a tree
25
+ Given I have a multiple sequence alignment (MSA)
26
+ """
27
+ seq1 ----SNSFSRPTIIFSGCSTACSGK--SELVCGFRSFMLSDV
28
+ seq2 SSIISNSFSRPTIIFSGCSTACSGK--SEQVCGFR---LSDV
29
+ seq3 SSIISNSFSRPTIIFSGCSTACSGKLTSEQVCGFR---LSDV
30
+ seq4 ----PKLFSRPTIIFSGCSTACSGK--SEPVCGFRSFMLSDV
31
+ seq5 ----------PTIIFSGCSKACSGKGLSELVCGFRSFMLSDV
32
+ seq6 ----------PTIIFSGCSKACSGK-----FRSFRSFMLSAV
33
+ seq7 ----------PTIIFSGCSKACSGK-----VCGIFHAVRSFM
34
+ seq8 ----------PTIIFSGCSKACSGK--SELVCGFRSFMLSAV
35
+ """
36
+ And I have a phylogenetic tree in Newick format
37
+ """
38
+ ((seq6:5.3571434,(seq4:4.04762,((seq8:1.1904755,seq5:1.1904755):1.7857151,((seq3:0.0,seq2:0.0):1.1904755,seq1:1.1904755):1.7857151):1.0714293):1.3095236):4.336735,seq7:9.693878);
39
+ """
40
+ When I split the tree
41
+ """
42
+ ,--9.69----------------------------------------- seq7 ----------PTIIFSGCSKACSGK-----VCGIFHAVRSFM
43
+ | ,--1.19----- seq1 ----SNSFSRPTIIFSGCSTACSGK--SELVCGFRSFMLSDV
44
+ | ,--1.79--| ,-- seq2 SSIISNSFSRPTIIFSGCSTACSGK--SEQVCGFR---LSDV
45
+ | ,--1.07--| `--1.19--+-- seq3 SSIISNSFSRPTIIFSGCSTACSGKLTSEQVCGFR---LSDV
46
+ | | `--1.79--+--1.19----- seq5 ----------PTIIFSGCSKACSGKGLSELVCGFRSFMLSDV
47
+ | ,--1.31--| `--1.19----- seq8 ----------PTIIFSGCSKACSGK--SELVCGFRSFMLSAV
48
+ `--4.34--| `--4.05----------------------- seq4 ----PKLFSRPTIIFSGCSTACSGK--SEPVCGFRSFMLSDV
49
+ `--5.36-------------------------------- seq6 ----------PTIIFSGCSKACSGK-----FRSFRSFMLSAV
50
+ """
51
+ Then I should have found sub-trees "seq4,seq6,seq7" and "seq1,seq2,seq3,seq5,seq8"
52
+ When I split the tree with a target of 2
53
+ Then I should have found high-homology sub-tree "seq5,seq8"
54
+ When I split the tree with a target of 3
55
+ Then I should have found high-homology sub-tree "seq1,seq2,seq3"
56
+ When I split the tree with a target of 4
57
+ Then I should have found high-homology sub-tree "seq1,seq2,seq3"
58
+ When I split the tree with a target of 5
59
+ Then I should have found high-homology sub-tree "seq1,seq2,seq3,seq5,seq8"
60
+ When I split the tree with a target of 6
61
+ Then I should have found high-homology sub-tree "seq1,seq2,seq3,seq4,seq5,seq8"
62
+ When I split the tree with a target of 7
63
+ Then I should have found low-homology sub-tree "seq7"
64
+ When I split the tree with a target of 6
65
+ Then I should have found low-homology sub-tree "seq6,seq7"
66
+
@@ -27,24 +27,29 @@ Then /^I should be able to traverse the tree$/ do
27
27
  root = @aln.root # get the root of the tree
28
28
  root.leaf?.should == false
29
29
  children = root.children
30
+ # root has one direct leaf
30
31
  children.map { |n| n.name }.sort.should == ["","seq7"]
31
32
  seq7 = children.last
32
33
  seq7.name.should == 'seq7'
33
34
  seq7.leaf?.should == true
34
35
  seq7.parent.should == root
36
+ # find leaf seq4
35
37
  seq4 = tree.find("seq4")
36
38
  seq4.leaf?.should == true
37
- seq4.distance(seq7).should == 19.387756600000003 # that is nice!
39
+ # total distance to seq7 9.69+4.34+1.31+4.05 ~ 19.38
40
+ seq4.distance(seq7).should == 19.387756600000003 # BioRuby does this!
38
41
  end
39
42
 
40
43
  Then /^fetch elements from the MSA from each end node in the tree$/ do
41
44
  # walk the tree
42
45
  tree = @aln.attach_tree(@tree)
43
46
  ids = []
47
+ # Walk the ordered tree and fetch the sequence from the alignment
44
48
  column20 = tree.map { | leaf |
45
49
  ids << leaf.name
50
+ # we have the ID, now find the alignment
46
51
  seq = @aln.find(leaf.name)
47
- # p seq
52
+ # Return the 18th nucleotide, just for show
48
53
  seq[19]
49
54
  }
50
55
  ids.should == ["seq6", "seq4", "seq8", "seq5", "seq3", "seq2", "seq1", "seq7"]
@@ -52,11 +57,35 @@ Then /^fetch elements from the MSA from each end node in the tree$/ do
52
57
  end
53
58
 
54
59
  Then /^calculate the phylogenetic distance between each element$/ do
55
- pending # express the regexp above with the code you wish you had
60
+ # we did this earlier with
61
+ tree = @aln.attach_tree(@tree)
62
+ seq7 = tree.find("seq7")
63
+ seq4 = tree.find("seq4")
64
+ # total distance to seq7 9.69+4.34+1.31+4.05 ~ 19.38
65
+ seq4.distance(seq7).should == 19.387756600000003 # BioRuby does this!
66
+ end
67
+
68
+ Then /^find that the nearest sequence to "([^"]*)" is "([^"]*)"$/ do |arg1, arg2|
69
+ tree = @aln.attach_tree(@tree)
70
+ seq = tree.find(arg1)
71
+ seq.nearest.map{|n|n.to_s}.sort.join(',').should == arg2
56
72
  end
57
73
 
74
+ Then /^find that "([^"]*)" is on the same branch as "([^"]*)"$/ do |arg1, arg2|
75
+ # really the same as the above
76
+ tree = @aln.attach_tree(@tree)
77
+ seq = tree.find(arg1)
78
+ seq.nearest.map{|n|n.to_s}.sort.join(',').should == arg2
79
+ end
80
+
81
+
58
82
  Then /^draw the MSA with the tree$/ do | string |
59
83
  # textual drawing, like tabtree, or http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/149701
84
+ # or BioPythons http://biopython.org/DIST/docs/api/Bio.Phylo._utils-pysrc.html#draw_ascii
85
+ # hg clone https://bitbucket.org/keesey/namesonnodes-sa
86
+ #
87
+ # http://cegg.unige.ch/newick_utils
88
+ # http://code.google.com/p/a3lbmonkeybrain-as3/source/browse/trunk/src/a3lbmonkeybrain/calculia/collections/graphs/exporters/TextCladogramExporter.as?spec=svn26&r=26
60
89
  print string
61
90
  pending # express the regexp above with the code you wish you had
62
91
  end
@@ -1,6 +1,8 @@
1
1
  @tree
2
2
  Feature: Tree support for alignments
3
- Alignments are often accompanied by phylogenetic trees.
3
+ Alignments are often accompanied by phylogenetic trees. When we
4
+ have an alignment with its tree, we want to traverse the tree
5
+ and calculate distances.
4
6
 
5
7
  Scenario: Get ordered elements from a tree
6
8
  Given I have a multiple sequence alignment (MSA)
@@ -21,6 +23,11 @@ Feature: Tree support for alignments
21
23
  Then I should be able to traverse the tree
22
24
  And fetch elements from the MSA from each end node in the tree
23
25
  And calculate the phylogenetic distance between each element
26
+ And find that the nearest sequence to "seq2" is "seq3"
27
+ And find that the nearest sequence to "seq5" is "seq8"
28
+ And find that the nearest sequence to "seq1" is "seq2,seq3"
29
+ And find that "seq1" is on the same branch as "seq2,seq3"
30
+ And find that "seq4" is on the same branch as "seq1,seq2,seq3,seq5,seq8"
24
31
  And draw the MSA with the tree
25
32
  """
26
33
  ,--9.69----------------------------------------- seq7 ----------PTIIFSGCSKACSGK-----VCGIFHAVRSFM
@@ -15,6 +15,7 @@ module Bio
15
15
  include Columns
16
16
 
17
17
  attr_accessor :sequences
18
+ attr_reader :tree
18
19
 
19
20
  # Create alignment. seqs can be a list of sequences. If these
20
21
  # are String types, they get converted to the library Sequence
@@ -41,6 +42,16 @@ module Bio
41
42
 
42
43
  alias rows sequences
43
44
 
45
+ # return an array of sequence ids
46
+ def ids
47
+ rows.map { |r| r.id }
48
+ end
49
+
50
+ def size
51
+ rows.size
52
+ end
53
+
54
+ # Return a sequence by index
44
55
  def [] index
45
56
  rows[index]
46
57
  end
@@ -59,7 +70,7 @@ module Bio
59
70
  each do | seq |
60
71
  return seq if seq.id == name
61
72
  end
62
- raise "ERROR: Sequence not found by its name #{name}"
73
+ raise "ERROR: Sequence not found by its name, looking for <#{name}>"
63
74
  end
64
75
 
65
76
  # clopy alignment and allow updating elements
@@ -85,6 +96,8 @@ module Bio
85
96
  aln.sequences << seq.clone
86
97
  end
87
98
  aln.clone_columns! if @columns
99
+ # clone the tree
100
+ @tree = @tree.clone if @tree
88
101
  aln
89
102
  end
90
103
 
@@ -96,6 +109,19 @@ module Bio
96
109
  @tree = Tree::init(tree)
97
110
  @tree
98
111
  end
112
+
113
+ # Reduce an alignment, based on the new tree
114
+ def tree_reduce new_tree
115
+ names = new_tree.map { | node | node.name }.compact
116
+ # p names
117
+ nrows = []
118
+ names.each do | name |
119
+ nrows << find(name).clone
120
+ end
121
+ new_aln = Alignment.new(nrows)
122
+ new_aln.attach_tree(new_tree.clone)
123
+ new_aln
124
+ end
99
125
  end
100
126
  end
101
127
  end
@@ -0,0 +1,58 @@
1
+ module Bio
2
+ module BioAlignment
3
+
4
+ # Split an alignment based on its phylogeny
5
+ module TreeSplitter
6
+
7
+ # Split an alignment using a phylogeny tree. One half contains sequences
8
+ # that are relatively homologues, the other half contains the rest. This
9
+ # is described in the tree-split.feature in the features directory.
10
+ #
11
+ # The target_size parameter gives the size of the homologues sequence
12
+ # set. If target_size is nil, the set will be split in half.
13
+ #
14
+ # Returns two alignments with their matching trees attached
15
+ def split_on_distance target_size = nil
16
+ target_size = size/2+1 if not target_size
17
+
18
+ aln1 = clone
19
+ # Start from the root of the tree (FIXME: what if there is no root?)
20
+ prev_root = nil
21
+ new_root = aln1.tree.root
22
+ while new_root
23
+ # find the nearest child (shortest edge)
24
+ near_children = new_root.nearest_children
25
+ # We possibly have multiple matches, so we are going to split on the
26
+ # number of leafs, or we leave it like it is, if the split will be
27
+ # too far from the target
28
+ prev_root = new_root
29
+ new_root = near_children.first
30
+ near_children.each do |c|
31
+ next if c == new_root
32
+ # find the nearest match
33
+ if (c.leaves.size-target_size).abs < (new_root.leaves.size-target_size).abs
34
+ new_root = c
35
+ end
36
+ end
37
+ # Break out of the loop when we hit the target
38
+ break if new_root.leaves.size <= target_size
39
+ end
40
+ # Now see if whether the last step actually was an improvement, otherwise
41
+ # we take one node up
42
+ # p [(prev_root.leaves.size-target_size).abs,(new_root.leaves.size-target_size).abs]
43
+ new_root = prev_root if (prev_root.leaves.size-target_size).abs < (new_root.leaves.size-target_size).abs
44
+ branch = aln1.tree.clone_subtree(new_root)
45
+ reduced_tree = aln1.tree.clone_tree_without_branch(new_root)
46
+ # p branch.map { |n| n.name }.compact
47
+ # p reduced_tree.map { |n| n.name }.compact
48
+
49
+ # Now reduce the alignments themselves to match the trees
50
+ aln1 = tree_reduce(reduced_tree)
51
+ aln2 = tree_reduce(branch)
52
+ return aln1,aln2
53
+ end
54
+
55
+ end
56
+ end
57
+ end
58
+
@@ -37,27 +37,72 @@ module Bio
37
37
  # Here we add to BioRuby's Bio::Tree classes
38
38
  class Tree
39
39
  class Node
40
+ # Add tree information to this node, so it can be queried
40
41
  def inject_tree tree
41
42
  @tree = tree
43
+ self
42
44
  end
43
45
 
46
+ # Is this Node a leaf?
44
47
  def leaf?
45
48
  children.size == 0
46
49
  end
47
50
 
51
+ # Get the children of this Node
48
52
  def children
49
53
  @tree.children(self)
50
54
  end
51
55
 
56
+ def descendents
57
+ @tree.descendents(self)
58
+ end
59
+
60
+ # Get the parents of this Node
52
61
  def parent
53
62
  @tree.parent(self)
54
63
  end
64
+
65
+ # Get the direct sibling nodes (i.e. parent.children)
66
+ def siblings
67
+ parent.children - [self]
68
+ end
69
+
70
+ # Return the leaves of this node
71
+ def leaves
72
+ @tree.leaves(self)
73
+ end
74
+
75
+ # Find the nearest and dearest, i.e. the leafs attached to the parent
76
+ # node
77
+ def nearest
78
+ @tree.leaves(parent) - [self]
79
+ end
55
80
 
56
- # Get the distance to another node (FIXME: write test)
81
+ # Get the distance to another node
57
82
  def distance other
58
83
  @tree.distance(self,other)
59
84
  end
60
- end
85
+
86
+ # Get child node with the shortest edge - note that if there are more
87
+ # than one, the first will be picked
88
+ def nearest_child
89
+ c = nil
90
+ children.each do |n|
91
+ c=n if not c or distance(n)<distance(c)
92
+ end
93
+ c
94
+ end
95
+
96
+ # Get the child nodes with the shortest edge - returns an Array
97
+ def nearest_children
98
+ min_distance = distance(nearest_child)
99
+ cs = []
100
+ children.each do |n|
101
+ cs << n if distance(n) == min_distance
102
+ end
103
+ cs
104
+ end
105
+ end # End of injecting Node functionality
61
106
 
62
107
  def find name
63
108
  get_node_by_name(name)
@@ -65,12 +110,57 @@ module Bio
65
110
 
66
111
  # Walk the ordered tree leaves, calling into the block, and return an array
67
112
  def map
68
- res = []
69
- leaves.each do | leaf |
70
- item = yield leaf
71
- res << item
113
+ leaves.map { | leaf | yield leaf }
114
+ end
115
+
116
+ # Create a deep clone of the tree
117
+ def clone_subtree start_node
118
+ new_tree = self.class.new
119
+ list = [start_node] + start_node.descendents
120
+ list.each do |x|
121
+ new_tree.add_node(x)
122
+ end
123
+ each_edge do |node1, node2, edge|
124
+ if new_tree.include?(node1) and new_tree.include?(node2)
125
+ new_tree.add_edge(node1, node2, edge)
126
+ end
127
+ end
128
+ new_tree
129
+ end
130
+
131
+ # Clone a tree without the branch starting at node
132
+ def clone_tree_without_branch node
133
+ new_tree = self.class.new
134
+ original = [root] + root.descendents
135
+ # p "Original",original
136
+ skip = [node] + node.descendents
137
+ # p "Skip",skip
138
+ # p "Retain",root.descendents - skip
139
+ nodes.each do |x|
140
+ if not skip.include?(x)
141
+ new_tree.add_node(x)
142
+ else
143
+ end
144
+ end
145
+ each_edge do |node1, node2, edge|
146
+ if new_tree.include?(node1) and new_tree.include?(node2)
147
+ new_tree.add_edge(node1, node2, edge)
148
+ end
149
+ end
150
+ new_tree
151
+ end
152
+
153
+ def clone
154
+ new_tree = self.class.new
155
+ nodes.each do |x|
156
+ new_tree.add_node(x)
157
+ end
158
+ self.each_edge do |node1, node2, edge|
159
+ if new_tree.include?(node1) and new_tree.include?(node2) then
160
+ new_tree.add_edge(node1, node2, edge)
161
+ end
72
162
  end
73
- res
163
+ new_tree
74
164
  end
75
165
 
76
166
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bio-alignment
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.6
4
+ version: 0.0.7
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-03-17 00:00:00.000000000Z
12
+ date: 2012-06-25 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: bio-logger
16
- requirement: &26202820 !ruby/object:Gem::Requirement
16
+ requirement: &83191660 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: '0'
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *26202820
24
+ version_requirements: *83191660
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: bio
27
- requirement: &26201340 !ruby/object:Gem::Requirement
27
+ requirement: &83191360 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ! '>='
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: 1.4.2
33
33
  type: :runtime
34
34
  prerelease: false
35
- version_requirements: *26201340
35
+ version_requirements: *83191360
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: rake
38
- requirement: &26199400 !ruby/object:Gem::Requirement
38
+ requirement: &83190960 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ! '>='
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: '0'
44
44
  type: :development
45
45
  prerelease: false
46
- version_requirements: *26199400
46
+ version_requirements: *83190960
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: bio-bigbio
49
- requirement: &26197880 !ruby/object:Gem::Requirement
49
+ requirement: &83190640 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ! '>'
@@ -54,10 +54,10 @@ dependencies:
54
54
  version: 0.1.3
55
55
  type: :development
56
56
  prerelease: false
57
- version_requirements: *26197880
57
+ version_requirements: *83190640
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: cucumber
60
- requirement: &26196760 !ruby/object:Gem::Requirement
60
+ requirement: &83190190 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ! '>='
@@ -65,21 +65,21 @@ dependencies:
65
65
  version: '0'
66
66
  type: :development
67
67
  prerelease: false
68
- version_requirements: *26196760
68
+ version_requirements: *83190190
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: rspec
71
- requirement: &26195120 !ruby/object:Gem::Requirement
71
+ requirement: &83095350 !ruby/object:Gem::Requirement
72
72
  none: false
73
73
  requirements:
74
74
  - - ~>
75
75
  - !ruby/object:Gem::Version
76
- version: 2.3.0
76
+ version: 2.10.0
77
77
  type: :development
78
78
  prerelease: false
79
- version_requirements: *26195120
79
+ version_requirements: *83095350
80
80
  - !ruby/object:Gem::Dependency
81
81
  name: bundler
82
- requirement: &26194620 !ruby/object:Gem::Requirement
82
+ requirement: &83094950 !ruby/object:Gem::Requirement
83
83
  none: false
84
84
  requirements:
85
85
  - - ! '>='
@@ -87,10 +87,10 @@ dependencies:
87
87
  version: 1.0.21
88
88
  type: :development
89
89
  prerelease: false
90
- version_requirements: *26194620
90
+ version_requirements: *83094950
91
91
  - !ruby/object:Gem::Dependency
92
92
  name: jeweler
93
- requirement: &26193920 !ruby/object:Gem::Requirement
93
+ requirement: &83094400 !ruby/object:Gem::Requirement
94
94
  none: false
95
95
  requirements:
96
96
  - - ! '>='
@@ -98,7 +98,7 @@ dependencies:
98
98
  version: '0'
99
99
  type: :development
100
100
  prerelease: false
101
- version_requirements: *26193920
101
+ version_requirements: *83094400
102
102
  description: Alignment handler for multiple sequence alignments (MSA)
103
103
  email: pjotr.public01@thebird.nl
104
104
  executables:
@@ -138,10 +138,12 @@ files:
138
138
  - features/edit/mask_serial_mutations.feature
139
139
  - features/pal2nal-feature.rb
140
140
  - features/pal2nal.feature
141
+ - features/phylogeny/split-tree-feature.rb
142
+ - features/phylogeny/split-tree.feature
143
+ - features/phylogeny/tree-feature.rb
144
+ - features/phylogeny/tree.feature
141
145
  - features/rows-feature.rb
142
146
  - features/rows.feature
143
- - features/tree-feature.rb
144
- - features/tree.feature
145
147
  - lib/bio-alignment.rb
146
148
  - lib/bio-alignment/alignment.rb
147
149
  - lib/bio-alignment/bioruby.rb
@@ -154,6 +156,7 @@ files:
154
156
  - lib/bio-alignment/edit/edit_rows.rb
155
157
  - lib/bio-alignment/edit/mask_islands.rb
156
158
  - lib/bio-alignment/edit/mask_serial_mutations.rb
159
+ - lib/bio-alignment/edit/tree_splitter.rb
157
160
  - lib/bio-alignment/elements.rb
158
161
  - lib/bio-alignment/pal2nal.rb
159
162
  - lib/bio-alignment/rows.rb
@@ -183,7 +186,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
183
186
  version: '0'
184
187
  segments:
185
188
  - 0
186
- hash: 1800672102634743595
189
+ hash: 900281341
187
190
  required_rubygems_version: !ruby/object:Gem::Requirement
188
191
  none: false
189
192
  requirements:
@@ -192,7 +195,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
192
195
  version: '0'
193
196
  requirements: []
194
197
  rubyforge_project:
195
- rubygems_version: 1.8.10
198
+ rubygems_version: 1.8.6
196
199
  signing_key:
197
200
  specification_version: 3
198
201
  summary: Support for multiple sequence alignments (MSA)