RubyGems - bio-alignment - Versions diffs - 0.0.6 → 0.0.7 - Mend

bio-alignment 0.0.6 → 0.0.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

data/Gemfile +1 -1
data/README.md +65 -16
data/VERSION +1 -1
data/doc/bio-alignment-design.md +83 -72
data/features/phylogeny/split-tree-feature.rb +31 -0
data/features/phylogeny/split-tree.feature +66 -0
data/features/{tree-feature.rb → phylogeny/tree-feature.rb} +32 -3
data/features/{tree.feature → phylogeny/tree.feature} +8 -1
data/lib/bio-alignment/alignment.rb +27 -1
data/lib/bio-alignment/edit/tree_splitter.rb +58 -0
data/lib/bio-alignment/tree.rb +97 -7
metadata +26 -23

data/Gemfile CHANGED

@@ -8,7 +8,7 @@ group :development do
   gem "rake"
   gem "bio-bigbio", "> 0.1.3"         # for reading FASTA files in tests
   gem "cucumber", ">= 0"
-  gem "rspec", "~> 2.3.0"
+  gem "rspec", "~> 2.10.0"
   gem "bundler", ">= 1.0.21"
   gem "jeweler"
 end

data/README.md CHANGED

@@ -1,22 +1,39 @@
 # bio-alignment
-Alignment handler for multiple sequence alignments (MSA).
+Matrix style alignment handler for multiple sequence alignments (MSA).
-This alignment handler makes no assumptions about the underlying
-sequence object.  Support for any nucleotide, amino acid and codon
-sequences that are lists. Any list with payload can be used (e.g.
-nucleotide quality score, codon annotation). The only requirement is
-that the list is iterable and can be indexed.
+[![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-alignment.png)](http://travis-ci.org/pjotrp/bioruby-alignment)
-This work is based on Pjotr's experience designing the BioScala
+This alignment handler makes no assumptions about the underlying
+sequence object. It supports any nucleotide, amino acid and codon
+sequences that are lists. Any list with payload or state, can be used
+(e.g.  nucleotide quality score, codon annotation). The only
+requirement is that the list is Enumerable and can be indexed, i.e.
+inherit Ruby Enumerable and have the [] method.
+Features are:
+* Matrix notation for alignment object
+* Functional style alignment access and editing
+* Support for BioRuby Sequences
+* Support for BioRuby trees and node distance calculation
+* bio-alignment interacts well with BioRuby structures,
+  including sequence objects and alignment/tree parsers
+When possible, BioRuby functionality is merged in. For example, by
+supporting Bio::Sequence objects, standard BioRuby alignment
+functions, sequence readers and writers can be used. By supporting the
+BioRuby Tree object, standard BioRuby tree parsers and writers can be
+used. bio-alignment takes alignment handling with phylogenetic tree
+support to a new level.
+bio-alignment is based on Pjotr's experience designing the BioScala
 Alignment handler and BioRuby's PAML support. Read the
 Bio::BioAlignment
 [design
 document](https://github.com/pjotrp/bioruby-alignment/blob/master/doc/bio-alignment-design.md)
 for Ruby.
-Note: this software is under active development.
 ## Developers
 ### Codon alignment example
@@ -40,6 +57,8 @@ aligmment (note codon gaps are represented by '---')
   aln.rows.each do | row |
     fasta.write(row.id, row.to_aa.to_s)
   end
+  # get first codon element of the fourth sequence
+  p aln[3][0]
 ```
 Now add some state - you can define your own row state
@@ -151,8 +170,28 @@ resulting in the codon alignment.
 ### Phylogeny
-BioAlignment has support for attaching a phylogentic tree to an
-alignment, and traversing the tree.
+BioAlignment has support for attaching a phylogenetic tree to an
+alignment, and traversing the tree using an intuitive interface
+```ruby
+  sole_tree = Bio::Newick.new(string).tree  # use BioRuby's tree parser
+  tree = aln.attach_tree(sole_tree)         # attach the tree
+  # now do stuff with the tree, which has improved bio-align support
+  root = tree.root
+  children = root.children
+  children.map { |n| n.name }.sort.should == ["","seq7"]
+  seq7 = children.last
+  seq4 = tree.find("seq4")
+  seq4.distance(seq7).should == 19.387756600000003
+  print tree.output_newick                  # BioRuby Newick output
+```
+There are methods for finding sibling nodes, splitting the alignment
+based on the tree, and locating sequences on the same branch. More
+examples can be found in the tests and features.  The underlying
+implementation of Bio::Tree is that of BioRuby. We have added an OOP
+layer for traversing the tree by injecting methods into the BioRuby
+object itself.
 ### Alignment marking/masking/editing
@@ -249,18 +288,28 @@ where aln2 is a copy of aln with bridging columns deleted.
 ### See also
-The API documentation is online. For more code examples see
+For more on the design of bio-alignment, read the
+Bio::BioAlignment
+[design
+document](https://github.com/pjotrp/bioruby-alignment/blob/master/doc/bio-alignment-design.md).
+The API documentation can be found
+[online](http://rubygems.org/gems/bio-alignment). For examples see the files in
 [./spec/*.rb](https://github.com/pjotrp/bioruby-alignment/tree/master/spec) and
 [./features/*](https://github.com/pjotrp/bioruby-alignment/tree/master/features).
 ## Cite
-If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
+If you use this software, please cite one of
+* [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
+* [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
+## Biogems.info
+This Biogem is published at [#bio-alignment](http://biogems.info/index.html)
 ## Copyright
 Copyright (c) 2012 Pjotr Prins. See LICENSE.txt for further details.
-## Biogems.info
-This exciting Ruby Biogem is published on http://biogems.info/

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 0.0.6
1	+ 0.0.7

data/doc/bio-alignment-design.md CHANGED

@@ -1,58 +1,59 @@
 # Bio-alignment design
-''A well designed library should be simple and elegant to use...''
+''A well designed library should be *simple* and elegant to use...''
 ## Introduction
-Biological multi-sequence alignments (MSA) are normally matrices of
-nucleotide or amino acid sequences, with gaps. Despite this rather
-simple premise, most software fails make it simple to access these
-structures. Also most implementations fail to support a 'pay load' of
-items in the matrix (mostly because underlying sequences are String
-based). This means a developer has to track information in multiple
-places, for example a base pair quality score. This makes code complex
-and therefore error prone. With bio-alignment elements of the matrix
-can carry information. So, when the alignment gets edited,
-the element gets moved or deleted, and the information moves or
-deletes along. For example,
-say we have a nucleotide sequence with pay load
+Biological multi-sequence alignments (MSA) are matrices of nucleotide or amino
+acid sequences with gaps. Despite this rather simple premise, most software
+fails make it simple to access these structures. Also most implementations fail
+to support a 'pay load' of items in the matrix (this is because underlying
+sequences are String based). The result is that a developer has to track
+information in multiple places. For example to track a base pair quality score
+will be a second matrix of information. This makes code complex and therefore
+error prone. With the bio-alignment library, elements of the matrix can carry
+information, so called 'state'.  When the alignment gets edited, i.e. the
+element gets moved or deleted, the information gets moved or deleted along. For
+example, say we have a nucleotide sequence with quality pay load
     A   G   T    A
     |   |   |    |
     5   9   *    1
-most library implementations will have two strings "AGTA" and "59*1".
-Removing the third nucleotide would mean removing it twice, into first
-"AGA", and second "591". With bio-alignment this is one action because we
-have one object for each element that contains both values, e.g. the
-payload of 'T' is '*'. Moving 'T' automatically moves '*'.
+most library implementations will have two strings "AGTA" and "59*1".  Removing
+the third nucleotide would mean removing it twice, into first "AGA", and second
+"591". With bio-alignment this is one action because we have one object for
+each element that contains both values, e.g. the payload of 'T' is '*'. Moving
+'T' automatically moves '*'. Simple really.
-In addition the bio-alignment library deals with codons and codon translation.
-Rather than track multiple matrices, the codon is viewed as an element,
-and the translated codon as the pay load. Again, when an alignment gets
-reordered the code only has to do it in one place.
+In addition to carrying state, the bio-alignment library deals with codons and
+codon translation.  Rather than track multiple matrices, the codon is viewed as
+an element, and the translated codon as the pay load. Again, when an alignment
+gets reordered the code only has to do it in one place.
-Likewise, an alignment column can have a pay load (e.g. quality score
-in a pile up), and an alignment row can have a pay load (e.g. the
-sequence name). The concept of pay load is handled through generic
-matrix element, column, or row 'attributes'.
+Likewise, an alignment column can have a pay load (e.g. quality score in a pile
+up), and an alignment row can have a pay load (e.g. the sequence name). The
+concept of pay load, normally referred to as 'state', is handled through
+generic matrix element, column, or row 'attributes'.
-Many of these ideas came from my work on the [BioScala
+Many of these ideas came from my earlier work on the [BioScala
 project](https://github.com/pjotrp/bioscala/blob/master/doc/design.txt),
 The BioScala library has the additional advantage of having type
-safety throughout.
+safety throughout, but lacks many of the features I have added to the
+Ruby version.
 ## Row or Sequence
-Any sequence for an alignment is simply a list of objects. The
-requirement is that the list should be enumerable and can be indexed. This means
-it has to include Enumerable and provide 'each' and '[]' methods. CodonSequence
-is a good example.
+Any sequence for an alignment is simply a list of objects. The requirement for
+any such list is that it should be enumerable and can be indexed. In Ruby
+terms, the list has to include Enumerable and provide 'each' and '[]' methods.
+The CodonSequence list, included in this library, is a good example.
 In addition, elements in the list should respond to certain properties (see
 below).
 ```ruby
+    # create a list of codons
     codons = CodonSequence.new(rec.id,rec.seq)
     print codons.id
     # get first codon
@@ -70,7 +71,8 @@ acid with
     print codons.seq[0].to_aa
 ```
-in fact, because Sequence is index-able we can write directly
+in fact, because bio-alignment demands Sequence is index-able we can write
+directly
 ```ruby
     print codons[0].to_aa        # 'M'
@@ -85,14 +87,17 @@ do a fancy
   aaseq = codons.map { | codon | codon.to_aa }.join("")
 ```
+this is getting interesting... Codons, which are three letter nucleotide base
+pairs, actually act as basic lists, and can be converted to amino acids.
 ## Element
 Elements in the list should respond to a gap? method, for an alignment
 gap, and the undefined? method for a position that is either an
 element or a gap. Also it should respond to the to_s method.
-An element can contain any pay load.  If a list of attributes exists
-in the sequence object, it can be used.
+It is important to note that an element can contain *any* pay load, or state. Ruby
+objects are 'open'. You can even add state at runtime.
 ## Elements and CodonSequence
@@ -102,24 +107,25 @@ carry state.
 The third list type we normally use in an Alignment, next to Sequence and
 Elements, is the CodonSequence (remember, you can easily roll your own Sequence
-type).
+type, just make them Enumerable and indexed).
 ## Column
-The column list tracks the columns of the alignment. The requirement
-is that it should be iterable and can be indexed. The Column contains
-no elements, but may point to a list when the alignment is transposed.
+The column list tracks the columns of the alignment. Again, the requirement is
+that the list should be Enumerable and indexed. By default, the Column contains
+no elements, only when the alignment is transposed. Matrix elements are found
+by indexing on the sequences (rows).
-One of the 'features' of this library is that the Column access logic is
+One of the features of this library is that the Column access logic is
 split out into a separate module, which accesses the data in a lazy fashion.
 Also column state is stored as an 'any object'. I.e. a column can contain
-any state.
+any type of state.
-## Matrix or MSA
+## Matrix (MSA)
-The Matrix consists of a Column list, multiple Sequences, in turn
-consisting of Elements. Accessing the matrix is by Sequence, followed
-by Element.
+The matrix (multi sequence alignment or MSA) consists of a Column list, and
+multiple Sequences, in turn consisting of Elements. Accessing the matrix is by
+Sequence, followed by Element, leading to a matrix style notation
 ```ruby
   require 'bio-alignment'
@@ -130,31 +136,34 @@ by Element.
   fasta.each do | rec |
     aln.sequences << rec
   end
+  # get first codon element of the fourth sequence
+  codon = aln[3][0]
 ```
 note that MSA understands rec, as long as rec.id and rec.seq exist, and strings
-(req.seq is a String). Alternatively we can convert to a Codon sequence by
+(req.seq is a String). Alternatively we can first convert to a Codon sequence by
 ```ruby
   fasta.each do | rec |
     aln.sequences << CodonSequence.new(rec.id,rec.seq)
   end
+  # get first codon element of the fourth sequence
+  codon = aln[3][0]
 ```
 The Matrix can be accessed in transposed fashion, but accessing the normal
-matrix and transposed matrix at the same time is not supported.  Matrix is not
-designed to be transaction safe - though you can copy the Matrix any time.
+matrix and transposed matrix at the same time is not supported. Note that
+Matrix editing is not designed to be transaction safe - better to copy the
+Matrix when editing.
 ## Adding functionality
-To ascertain that the basic BioAlignment implementation does not get
-polluted, extra functionality is added by using modules. These
-modules can be added at run time(!) One advantage is that there is
-less name space pollution, the other is that different implementations
-can be plugged in - using the same interface. For example, here we are
-going to use an alignment editor named DelBridges, which has a method
-named del_bridges:
+To ascertain that the basic BioAlignment implementation does not get polluted
+with heaps of methods, extra functionality is added by using modules. These
+modules can be added at run time(!) One advantage is that there is less name
+space pollution, the other is that different implementations can be plugged in
+- using the same interface. For example, here we are going to use an alignment
+editor named DelBridges, which has a method named del_bridges:
 ```ruby
   require 'bio-alignment/edit/del_bridges'
@@ -164,18 +173,19 @@ named del_bridges:
   aln2 = aln.del_bridges
 ```
-in other words, the functionality in DelBridges gets attached to the
-aln instance at run time, without affecting any other instantiated
-object(!) Also, when not requiring 'bio-alignment/edit/del_bridges',
-the functionality is never visible, and never added to the
-environment. This type of runtime plugin is something you can only do
-in a dynamic language.
+in other words, the functionality in DelBridges gets attached to the aln
+instance at run time, without affecting any other instantiated object(!) Also,
+when not requiring 'bio-alignment/edit/del_bridges', the functionality is never
+visible, and never added to the runtime environment. This type of runtime
+plugin is something you can only do in a dynamic language, such as Ruby. Ruby,
+makes it rather convenient.
-Likewise you may have your own sequence objects in an alignment. To register
-deletion state, simply extend the sequence with the RowState module:
+You may have created own style sequence objects in an alignment. To register a
+prefab deletion state, extend the sequence with the RowState module:
 ```ruby
   require 'bio-alignment/state'
+  # Use the standard BioRuby sequence object
   bioseq = Bio::Sequence::NA.new("AGCT")
   bioseq.extend(State)          # add state
   bioseq.state = RowState.new   # set state
@@ -183,10 +193,10 @@ deletion state, simply extend the sequence with the RowState module:
   > false
 ```
-That is impressive - the BioRuby Sequence has no deletion state facility. We
-just added that, and it can even be used in BioAlignment editors which require
-such a state object. See also the scenario "Give deletion state to a
-Bio::Sequence object" in the bioruby.feature.
+That is impressive - the BioRuby Sequence has no deletion state facility by
+itself. We just added that, and it can even be used in BioAlignment editors
+which require such a state object. See also the scenario "Give deletion state
+to a Bio::Sequence object" in the bioruby.feature.
 Note: if we wanted only to allow one plugin per instance at a time, we can
 create a generic interface with a method of the same name for every
@@ -195,9 +205,10 @@ multiple plugins (by default).
 ## Adding Phylogenetic support
-MSAs often come with phylogenetic trees. Not to add this functionality by default,
-we extend BioAlignment with BioAlignment::AlignmentTree when a tree is plugged in
-with the add_tree method.
+An MSA often comes with a phylogenetic tree. Similar to runtime adding of the
+delete state module, now we extend BioAlignment with
+BioAlignment::AlignmentTree. A tree is plugged in with the add_tree method. See
+the README and features directory for more examples.
 ## Methods returning alignments and concurrency
@@ -216,7 +227,7 @@ in functional style, such as
 ```
 where aln2 is a copy (of aln) with columns removed that were marked for
-deletion.  In other words, we apply ''Functional programming in Ruby.'' If
+deletion.  In other words, applied ''Functional programming in Ruby.'' If
 functions can be easily 'piped', and code can be easily copy and pasted into
 different algorithms, it is likely that the module is written in a functional
 style.

data/features/phylogeny/split-tree-feature.rb ADDED

@@ -0,0 +1,31 @@
+require 'bio-alignment/edit/tree_splitter.rb'
+When /^I split the tree$/ do |string|
+  tree = @aln.attach_tree(@tree)
+  @aln.extend TreeSplitter
+  (aln1,aln2) = @aln.split_on_distance
+  aln2.size.should == 5
+  @split1 = aln1
+  @split2 = aln2
+end
+Then /^I should have found sub\-trees "([^"]*)" and "([^"]*)"$/ do |arg1, arg2|
+  @split2.ids.sort.join(",").should == arg2
+  @split1.ids.sort.join(",").should == arg1
+end
+When /^I split the tree with a target of (\d+)$/ do |arg1|
+  tree = @aln.attach_tree(@tree)
+  @aln.extend TreeSplitter
+  @split1,@split2 = @aln.split_on_distance(arg1.to_i)
+end
+Then /^I should have found low\-homology sub\-tree "([^"]*)"$/ do |arg1|
+  @split1.ids.sort.join(",").should == arg1
+end
+Then /^I should have found high\-homology sub\-tree "([^"]*)"$/ do |arg1|
+  @split2.ids.sort.join(",").should == arg1
+end

data/features/phylogeny/split-tree.feature ADDED

@@ -0,0 +1,66 @@
+@split
+Feature: Splitting alignments into equal sized branches using phylogenetic tree info
+  Sometimes we want to split a large alignment into sub-sets.  When an
+  alignment is accompanied by a phylogenetic tree, we can greedily split the
+  tree. With a rooted tree, we start from the root, and walk the tree, taking
+  the shortest edge at every node (a tie may favour splitting). If the tree can
+  be split, so that both sides are similar sized, the job is done (if you want
+  more splits, just repeat the exercise). Essentially one subset shows
+  relatively high homology, the other relatively low homology. This is a crude
+  method, but has the advantage of being quick to calculate and reproducible.
+  If there is no root, we start from the point next to the longest edge.
+  We add one 'target_size' parameter to allow for leaving more sequences in the
+  high homology subset. 'target_size' sets the allowed size of the
+  high-homology alignment.  For example, setting it to 10 in a 15 sequence
+  alignment, will stop the splitting at 5 sequences, leaving (approx.) 10
+  sequences in the high homology group. Likewise, setting it to 5 will continue
+  splitting until that number is reached.
+  In below example the tree will be split in a branch with similar sequences,
+  and a branch with sequences that are somewhat removed.
+  Scenario: Split a tree
+    Given I have a multiple sequence alignment (MSA)
+      """
+      seq1  ----SNSFSRPTIIFSGCSTACSGK--SELVCGFRSFMLSDV
+      seq2  SSIISNSFSRPTIIFSGCSTACSGK--SEQVCGFR---LSDV
+      seq3  SSIISNSFSRPTIIFSGCSTACSGKLTSEQVCGFR---LSDV
+      seq4  ----PKLFSRPTIIFSGCSTACSGK--SEPVCGFRSFMLSDV
+      seq5  ----------PTIIFSGCSKACSGKGLSELVCGFRSFMLSDV
+      seq6  ----------PTIIFSGCSKACSGK-----FRSFRSFMLSAV
+      seq7  ----------PTIIFSGCSKACSGK-----VCGIFHAVRSFM
+      seq8  ----------PTIIFSGCSKACSGK--SELVCGFRSFMLSAV
+      """
+    And I have a phylogenetic tree in Newick format
+      """
+      ((seq6:5.3571434,(seq4:4.04762,((seq8:1.1904755,seq5:1.1904755):1.7857151,((seq3:0.0,seq2:0.0):1.1904755,seq1:1.1904755):1.7857151):1.0714293):1.3095236):4.336735,seq7:9.693878);
+      """
+    When I split the tree
+      """
+      ,--9.69----------------------------------------- seq7  ----------PTIIFSGCSKACSGK-----VCGIFHAVRSFM
+      |                                   ,--1.19----- seq1  ----SNSFSRPTIIFSGCSTACSGK--SELVCGFRSFMLSDV
+      |                          ,--1.79--|        ,-- seq2  SSIISNSFSRPTIIFSGCSTACSGK--SEQVCGFR---LSDV
+      |                 ,--1.07--|        `--1.19--+-- seq3  SSIISNSFSRPTIIFSGCSTACSGKLTSEQVCGFR---LSDV
+      |                 |        `--1.79--+--1.19----- seq5  ----------PTIIFSGCSKACSGKGLSELVCGFRSFMLSDV
+      |        ,--1.31--|                 `--1.19----- seq8  ----------PTIIFSGCSKACSGK--SELVCGFRSFMLSAV
+      `--4.34--|        `--4.05----------------------- seq4  ----PKLFSRPTIIFSGCSTACSGK--SEPVCGFRSFMLSDV
+               `--5.36-------------------------------- seq6  ----------PTIIFSGCSKACSGK-----FRSFRSFMLSAV
+      """
+    Then I should have found sub-trees "seq4,seq6,seq7" and "seq1,seq2,seq3,seq5,seq8"
+    When I split the tree with a target of 2
+    Then I should have found high-homology sub-tree "seq5,seq8"
+    When I split the tree with a target of 3
+    Then I should have found high-homology sub-tree "seq1,seq2,seq3"
+    When I split the tree with a target of 4
+    Then I should have found high-homology sub-tree "seq1,seq2,seq3"
+    When I split the tree with a target of 5
+    Then I should have found high-homology sub-tree "seq1,seq2,seq3,seq5,seq8"
+    When I split the tree with a target of 6
+    Then I should have found high-homology sub-tree "seq1,seq2,seq3,seq4,seq5,seq8"
+    When I split the tree with a target of 7
+    Then I should have found low-homology sub-tree "seq7"
+    When I split the tree with a target of 6
+    Then I should have found low-homology sub-tree "seq6,seq7"

data/features/{tree-feature.rb → phylogeny/tree-feature.rb} RENAMED

@@ -27,24 +27,29 @@ Then /^I should be able to traverse the tree$/ do
   root = @aln.root # get the root of the tree
   root.leaf?.should == false
   children = root.children
+  # root has one direct leaf
   children.map { |n| n.name }.sort.should == ["","seq7"]
   seq7 = children.last
   seq7.name.should == 'seq7'
   seq7.leaf?.should == true
   seq7.parent.should == root
+  # find leaf seq4
   seq4 = tree.find("seq4")
   seq4.leaf?.should == true
-  seq4.distance(seq7).should == 19.387756600000003  # that is nice!
+  # total distance to seq7 9.69+4.34+1.31+4.05 ~ 19.38
+  seq4.distance(seq7).should == 19.387756600000003  # BioRuby does this!
 end
 Then /^fetch elements from the MSA from each end node in the tree$/ do
   # walk the tree
   tree = @aln.attach_tree(@tree)
   ids = []
+  # Walk the ordered tree and fetch the sequence from the alignment
   column20 = tree.map { | leaf |
     ids << leaf.name
+    # we have the ID, now find the alignment
     seq = @aln.find(leaf.name)
-    # p seq
+    # Return the 18th nucleotide, just for show
     seq[19]
   }
   ids.should == ["seq6", "seq4", "seq8", "seq5", "seq3", "seq2", "seq1", "seq7"]
@@ -52,11 +57,35 @@ Then /^fetch elements from the MSA from each end node in the tree$/ do
 end
 Then /^calculate the phylogenetic distance between each element$/ do
-  pending # express the regexp above with the code you wish you had
+  # we did this earlier with
+  tree = @aln.attach_tree(@tree)
+  seq7 = tree.find("seq7")
+  seq4 = tree.find("seq4")
+  # total distance to seq7 9.69+4.34+1.31+4.05 ~ 19.38
+  seq4.distance(seq7).should == 19.387756600000003  # BioRuby does this!
+end
+Then /^find that the nearest sequence to "([^"]*)" is "([^"]*)"$/ do |arg1, arg2|
+  tree = @aln.attach_tree(@tree)
+  seq = tree.find(arg1)
+  seq.nearest.map{|n|n.to_s}.sort.join(',').should == arg2
 end
+Then /^find that "([^"]*)" is on the same branch as "([^"]*)"$/ do |arg1, arg2|
+  # really the same as the above
+  tree = @aln.attach_tree(@tree)
+  seq = tree.find(arg1)
+  seq.nearest.map{|n|n.to_s}.sort.join(',').should == arg2
+end
 Then /^draw the MSA with the tree$/ do | string |
   # textual drawing, like tabtree, or http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/149701
+  # or BioPythons http://biopython.org/DIST/docs/api/Bio.Phylo._utils-pysrc.html#draw_ascii
+  # hg clone https://bitbucket.org/keesey/namesonnodes-sa
+  #
+  # http://cegg.unige.ch/newick_utils
+  # http://code.google.com/p/a3lbmonkeybrain-as3/source/browse/trunk/src/a3lbmonkeybrain/calculia/collections/graphs/exporters/TextCladogramExporter.as?spec=svn26&r=26
   print string
   pending # express the regexp above with the code you wish you had
 end

data/features/{tree.feature → phylogeny/tree.feature} RENAMED

@@ -1,6 +1,8 @@
 @tree
 Feature: Tree support for alignments
-  Alignments are often accompanied by phylogenetic trees.
+  Alignments are often accompanied by phylogenetic trees. When we
+  have an alignment with its tree, we want to traverse the tree
+  and calculate distances.
   Scenario: Get ordered elements from a tree
     Given I have a multiple sequence alignment (MSA)
@@ -21,6 +23,11 @@ Feature: Tree support for alignments
     Then I should be able to traverse the tree
     And fetch elements from the MSA from each end node in the tree
     And calculate the phylogenetic distance between each element
+    And find that the nearest sequence to "seq2" is "seq3"
+    And find that the nearest sequence to "seq5" is "seq8"
+    And find that the nearest sequence to "seq1" is "seq2,seq3"
+    And find that "seq1" is on the same branch as "seq2,seq3"
+    And find that "seq4" is on the same branch as "seq1,seq2,seq3,seq5,seq8"
     And draw the MSA with the tree
       """
       ,--9.69----------------------------------------- seq7  ----------PTIIFSGCSKACSGK-----VCGIFHAVRSFM

data/lib/bio-alignment/alignment.rb CHANGED

@@ -15,6 +15,7 @@ module Bio
       include Columns
       attr_accessor :sequences
+      attr_reader :tree
       # Create alignment. seqs can be a list of sequences. If these
       # are String types, they get converted to the library Sequence
@@ -41,6 +42,16 @@ module Bio
       alias rows sequences
+      # return an array of sequence ids
+      def ids
+        rows.map { |r| r.id }
+      end
+      def size
+        rows.size
+      end
+      # Return a sequence by index
       def [] index
         rows[index]
       end
@@ -59,7 +70,7 @@ module Bio
         each do | seq |
           return seq if seq.id == name
         end
-        raise "ERROR: Sequence not found by its name #{name}"
+        raise "ERROR: Sequence not found by its name, looking for <#{name}>"
       end
       # clopy alignment and allow updating elements
@@ -85,6 +96,8 @@ module Bio
           aln.sequences << seq.clone
         end
         aln.clone_columns! if @columns
+        # clone the tree
+        @tree = @tree.clone if @tree
         aln
       end
@@ -96,6 +109,19 @@ module Bio
         @tree = Tree::init(tree)
         @tree
       end
+      # Reduce an alignment, based on the new tree
+      def tree_reduce new_tree
+        names = new_tree.map { | node | node.name }.compact
+        # p names
+        nrows = []
+        names.each do | name |
+          nrows << find(name).clone
+        end
+        new_aln = Alignment.new(nrows)
+        new_aln.attach_tree(new_tree.clone)
+        new_aln
+      end
     end
   end
 end

data/lib/bio-alignment/edit/tree_splitter.rb ADDED

@@ -0,0 +1,58 @@
+module Bio
+  module BioAlignment
+    # Split an alignment based on its phylogeny
+    module TreeSplitter
+      # Split an alignment using a phylogeny tree.  One half contains sequences
+      # that are relatively homologues, the other half contains the rest. This
+      # is described in the tree-split.feature in the features directory.
+      #
+      # The target_size parameter gives the size of the homologues sequence
+      # set. If target_size is nil, the set will be split in half.
+      #
+      # Returns two alignments with their matching trees attached
+      def split_on_distance target_size = nil
+        target_size = size/2+1 if not target_size
+        aln1 = clone
+        # Start from the root of the tree (FIXME: what if there is no root?)
+        prev_root = nil
+        new_root = aln1.tree.root
+        while new_root
+          # find the nearest child (shortest edge)
+          near_children = new_root.nearest_children
+          # We possibly have multiple matches, so we are going to split on the
+          # number of leafs, or we leave it like it is, if the split will be
+          # too far from the target
+          prev_root = new_root
+          new_root = near_children.first
+          near_children.each do |c|
+            next if c == new_root
+            # find the nearest match
+            if (c.leaves.size-target_size).abs < (new_root.leaves.size-target_size).abs
+              new_root = c
+            end
+          end
+          # Break out of the loop when we hit the target
+          break if new_root.leaves.size <= target_size
+        end
+        # Now see if whether the last step actually was an improvement, otherwise
+        # we take one node up
+        # p [(prev_root.leaves.size-target_size).abs,(new_root.leaves.size-target_size).abs]
+        new_root = prev_root if (prev_root.leaves.size-target_size).abs < (new_root.leaves.size-target_size).abs
+        branch = aln1.tree.clone_subtree(new_root)
+        reduced_tree = aln1.tree.clone_tree_without_branch(new_root)
+        # p branch.map { |n| n.name }.compact
+        # p reduced_tree.map { |n| n.name }.compact
+        # Now reduce the alignments themselves to match the trees
+        aln1 = tree_reduce(reduced_tree)
+        aln2 = tree_reduce(branch)
+        return aln1,aln2
+      end
+    end
+  end
+end

data/lib/bio-alignment/tree.rb CHANGED

@@ -37,27 +37,72 @@ module Bio
   # Here we add to BioRuby's Bio::Tree classes
   class Tree
     class Node
+      # Add tree information to this node, so it can be queried
       def inject_tree tree
         @tree = tree
+        self
       end
+      # Is this Node a leaf?
       def leaf?
         children.size == 0
       end
+      # Get the children of this Node
       def children
         @tree.children(self)
       end
+      def descendents
+        @tree.descendents(self)
+      end
+      # Get the parents of this Node
       def parent
         @tree.parent(self)
       end
+      # Get the direct sibling nodes (i.e. parent.children)
+      def siblings
+        parent.children - [self]
+      end
+      # Return the leaves of this node
+      def leaves
+        @tree.leaves(self)
+      end
+      # Find the nearest and dearest, i.e. the leafs attached to the parent
+      # node
+      def nearest
+        @tree.leaves(parent) - [self]
+      end
-      # Get the distance to another node (FIXME: write test)
+      # Get the distance to another node
       def distance other
         @tree.distance(self,other)
       end
-    end
+      # Get child node with the shortest edge - note that if there are more
+      # than one, the first will be picked
+      def nearest_child
+        c = nil
+        children.each do |n|
+          c=n if not c or distance(n)<distance(c)
+        end
+        c
+      end
+      # Get the child nodes with the shortest edge - returns an Array
+      def nearest_children
+        min_distance = distance(nearest_child)
+        cs = []
+        children.each do |n|
+          cs << n if distance(n) == min_distance
+        end
+        cs
+      end
+    end  # End of injecting Node functionality
     def find name
       get_node_by_name(name)
@@ -65,12 +110,57 @@ module Bio
     # Walk the ordered tree leaves, calling into the block, and return an array
     def map
-      res = []
-      leaves.each do | leaf |
-        item = yield leaf
-        res << item
+      leaves.map { | leaf | yield leaf }
+    end
+    # Create a deep clone of the tree
+    def clone_subtree start_node
+      new_tree = self.class.new
+      list = [start_node] + start_node.descendents
+      list.each do |x|
+        new_tree.add_node(x)
+      end
+      each_edge do |node1, node2, edge|
+        if new_tree.include?(node1) and new_tree.include?(node2)
+          new_tree.add_edge(node1, node2, edge)
+        end
+      end
+      new_tree
+    end
+    # Clone a tree without the branch starting at node
+    def clone_tree_without_branch node
+      new_tree = self.class.new
+      original = [root] + root.descendents
+      # p "Original",original
+      skip = [node] + node.descendents
+      # p "Skip",skip
+      # p "Retain",root.descendents - skip
+      nodes.each do |x|
+        if not skip.include?(x)
+          new_tree.add_node(x)
+        else
+        end
+      end
+      each_edge do |node1, node2, edge|
+        if new_tree.include?(node1) and new_tree.include?(node2)
+          new_tree.add_edge(node1, node2, edge)
+        end
+      end
+      new_tree
+    end
+    def clone
+      new_tree = self.class.new
+      nodes.each do |x|
+        new_tree.add_node(x)
+      end
+      self.each_edge do |node1, node2, edge|
+        if new_tree.include?(node1) and new_tree.include?(node2) then
+          new_tree.add_edge(node1, node2, edge)
+        end
       end
-      res
+      new_tree
     end
   end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: bio-alignment
 version: !ruby/object:Gem::Version
-  version: 0.0.6
+  version: 0.0.7
   prerelease:
 platform: ruby
 authors:
@@ -9,11 +9,11 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-03-17 00:00:00.000000000Z
+date: 2012-06-25 00:00:00.000000000Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bio-logger
-  requirement: &26202820 !ruby/object:Gem::Requirement
+  requirement: &83191660 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *26202820
+  version_requirements: *83191660
 - !ruby/object:Gem::Dependency
   name: bio
-  requirement: &26201340 !ruby/object:Gem::Requirement
+  requirement: &83191360 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -32,10 +32,10 @@ dependencies:
         version: 1.4.2
   type: :runtime
   prerelease: false
-  version_requirements: *26201340
+  version_requirements: *83191360
 - !ruby/object:Gem::Dependency
   name: rake
-  requirement: &26199400 !ruby/object:Gem::Requirement
+  requirement: &83190960 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -43,10 +43,10 @@ dependencies:
         version: '0'
   type: :development
   prerelease: false
-  version_requirements: *26199400
+  version_requirements: *83190960
 - !ruby/object:Gem::Dependency
   name: bio-bigbio
-  requirement: &26197880 !ruby/object:Gem::Requirement
+  requirement: &83190640 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>'
@@ -54,10 +54,10 @@ dependencies:
         version: 0.1.3
   type: :development
   prerelease: false
-  version_requirements: *26197880
+  version_requirements: *83190640
 - !ruby/object:Gem::Dependency
   name: cucumber
-  requirement: &26196760 !ruby/object:Gem::Requirement
+  requirement: &83190190 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -65,21 +65,21 @@ dependencies:
         version: '0'
   type: :development
   prerelease: false
-  version_requirements: *26196760
+  version_requirements: *83190190
 - !ruby/object:Gem::Dependency
   name: rspec
-  requirement: &26195120 !ruby/object:Gem::Requirement
+  requirement: &83095350 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
       - !ruby/object:Gem::Version
-        version: 2.3.0
+        version: 2.10.0
   type: :development
   prerelease: false
-  version_requirements: *26195120
+  version_requirements: *83095350
 - !ruby/object:Gem::Dependency
   name: bundler
-  requirement: &26194620 !ruby/object:Gem::Requirement
+  requirement: &83094950 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -87,10 +87,10 @@ dependencies:
         version: 1.0.21
   type: :development
   prerelease: false
-  version_requirements: *26194620
+  version_requirements: *83094950
 - !ruby/object:Gem::Dependency
   name: jeweler
-  requirement: &26193920 !ruby/object:Gem::Requirement
+  requirement: &83094400 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -98,7 +98,7 @@ dependencies:
         version: '0'
   type: :development
   prerelease: false
-  version_requirements: *26193920
+  version_requirements: *83094400
 description: Alignment handler for multiple sequence alignments (MSA)
 email: pjotr.public01@thebird.nl
 executables:
@@ -138,10 +138,12 @@ files:
 - features/edit/mask_serial_mutations.feature
 - features/pal2nal-feature.rb
 - features/pal2nal.feature
+- features/phylogeny/split-tree-feature.rb
+- features/phylogeny/split-tree.feature
+- features/phylogeny/tree-feature.rb
+- features/phylogeny/tree.feature
 - features/rows-feature.rb
 - features/rows.feature
-- features/tree-feature.rb
-- features/tree.feature
 - lib/bio-alignment.rb
 - lib/bio-alignment/alignment.rb
 - lib/bio-alignment/bioruby.rb
@@ -154,6 +156,7 @@ files:
 - lib/bio-alignment/edit/edit_rows.rb
 - lib/bio-alignment/edit/mask_islands.rb
 - lib/bio-alignment/edit/mask_serial_mutations.rb
+- lib/bio-alignment/edit/tree_splitter.rb
 - lib/bio-alignment/elements.rb
 - lib/bio-alignment/pal2nal.rb
 - lib/bio-alignment/rows.rb
@@ -183,7 +186,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
       version: '0'
       segments:
       - 0
-      hash: 1800672102634743595
+      hash: 900281341
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements:
@@ -192,7 +195,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 1.8.10
+rubygems_version: 1.8.6
 signing_key:
 specification_version: 3
 summary: Support for multiple sequence alignments (MSA)