bio-alignment 0.0.7 → 0.0.8

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -19,6 +19,8 @@ Features are:
19
19
  * Support for BioRuby trees and node distance calculation
20
20
  * bio-alignment interacts well with BioRuby structures,
21
21
  including sequence objects and alignment/tree parsers
22
+ * Support for textual and HTML output of MSA (planned)
23
+ * Support for Clayton's MAF parser is (planned)
22
24
 
23
25
  When possible, BioRuby functionality is merged in. For example, by
24
26
  supporting Bio::Sequence objects, standard BioRuby alignment
@@ -34,7 +36,39 @@ Bio::BioAlignment
34
36
  document](https://github.com/pjotrp/bioruby-alignment/blob/master/doc/bio-alignment-design.md)
35
37
  for Ruby.
36
38
 
37
- ## Developers
39
+ ## Command line
40
+
41
+ bio-alignment comes with a command line interface (CLI), which can apply a number
42
+ of editing functions on an alignment, and generate textual and HTML
43
+ output. Note that the CLI does not cover the full library. The CLI can be useful
44
+ for non-Rubyists, pipeline setups, and simply as examples
45
+
46
+ Remove bridges (columns with mostly gaps) from an alignment
47
+
48
+ bio-alignment aa-alignment.fa --type aminoacid --edit bridges
49
+
50
+ Mask islands (short misaligned 'floating' parts in a sequence)
51
+
52
+ coming soon...
53
+
54
+ Mask serial mutations
55
+
56
+ coming soon...
57
+
58
+ Remove all sequences consisting of mostly gaps (30% informative) and output to FASTA
59
+
60
+ bio-alignment codon-alignment.fa --type codon --edit info --out fasta
61
+
62
+ or output codon style
63
+
64
+ bio-alignment codon-alignment.fa --type codon --edit info --style codon
65
+
66
+ Remove all sequences containing gaps from an alignment (why would you
67
+ want to do that?)
68
+
69
+ bio-alignment codon-alignment.fa --type codon --edit info --perc 100 --out fasta
70
+
71
+ ## Section for developers
38
72
 
39
73
  ### Codon alignment example
40
74
 
@@ -50,7 +84,7 @@ aligmment (note codon gaps are represented by '---')
50
84
  aln = Alignment.new
51
85
  fasta = FastaReader.new('codon-alignment.fa')
52
86
  fasta.each do | rec |
53
- aln.sequences << CodonSequence.new(rec.id, rec.seq)
87
+ aln << CodonSequence.new(rec.id, rec.seq)
54
88
  end
55
89
  # write a matching amino acid alignment
56
90
  fasta = FastaWriter.new('aa-aln.fa')
@@ -106,18 +140,35 @@ BioAlignment supports adding BioRuby's Bio::Sequence objects:
106
140
  include Bio::BioAlignment
107
141
 
108
142
  aln = Alignment.new
109
- aln.sequences << Bio::Sequence::NA.new("atgcatgcaaaa")
110
- aln.sequences << Bio::Sequence::NA.new("atg---tcaaaa")
143
+ aln << Bio::Sequence::NA.new("atgcatgcaaaa")
144
+ aln << Bio::Sequence::NA.new("atg---tcaaaa")
145
+ ```
146
+
147
+ or use BioRuby's flat file reader
148
+
149
+ ```ruby
150
+ aln = Alignment.new
151
+ Bio::FlatFile.auto(filename).each_entry do |entry|
152
+ aln << entry
153
+ end
111
154
  ```
112
155
 
113
- and we can transform BioAlignment into BioRuby's Bio::Alignment and
114
- use BioRuby functions
156
+ and, the other way, we can transform BioAlignment into BioRuby's
157
+ Bio::Alignment and use BioRuby functions
115
158
 
116
159
  ```ruby
117
160
  bioruby_aln = aln.to_bioruby_alignment
118
161
  bioruby_aln.consensus_iupac
119
162
  ```
120
163
 
164
+ Note that native BioRuby objects may not always work. In the first
165
+ case, using Bio::Sequence::NA, no ID is passed in, so each sequence is
166
+ labeled 'id?'. In the second case BioRuby's FlatFile returns a
167
+ FastaFormat object, this time with ID, but FastaFormat does not
168
+ support indexing. In general, it is recommended to stay with the
169
+ bio-alignment Sequence classes (or roll your own, as long as they are
170
+ Enumerable).
171
+
121
172
  ### Pal2nal
122
173
 
123
174
  A protein (amino acid) to nucleotide alignment would first load
@@ -132,7 +183,7 @@ the sequences
132
183
  aln2 = Alignment.new
133
184
  fasta2 = FastaReader.new('nt.fa')
134
185
  fasta2.each do | rec |
135
- aln2.sequences << Sequence.new(rec.id, rec.seq)
186
+ aln2 << Sequence.new(rec.id, rec.seq)
136
187
  end
137
188
  ```
138
189
 
@@ -174,15 +225,17 @@ BioAlignment has support for attaching a phylogenetic tree to an
174
225
  alignment, and traversing the tree using an intuitive interface
175
226
 
176
227
  ```ruby
177
- sole_tree = Bio::Newick.new(string).tree # use BioRuby's tree parser
178
- tree = aln.attach_tree(sole_tree) # attach the tree
179
- # now do stuff with the tree, which has improved bio-align support
228
+ newick_tree = Bio::Newick.new(string).tree # use BioRuby's tree parser
229
+ tree = aln.attach_tree(newick_tree) # attach the tree
230
+ # now do stuff with the tree, which has improved bio-alignment support
180
231
  root = tree.root
181
232
  children = root.children
182
233
  children.map { |n| n.name }.sort.should == ["","seq7"]
183
234
  seq7 = children.last
184
235
  seq4 = tree.find("seq4")
185
236
  seq4.distance(seq7).should == 19.387756600000003
237
+ # find the sequence in the alignment belonging to the node
238
+ print seq4.sequence
186
239
  print tree.output_newick # BioRuby Newick output
187
240
  ```
188
241
 
@@ -201,13 +254,14 @@ programming. Primitives are provided which take out much of the
201
254
  plumbing, such as maintaining row/column/element state, and allow
202
255
  copy-on-edit (so no conflicts arise in concurrent code). For example,
203
256
  to walk an alignment by row, and update the row state, you can mark
204
- all rows for deletion which contain many gaps
257
+ all rows (sequences) which contain many gaps for deletion
205
258
 
206
259
  ```ruby
207
260
  include MarkRows
208
261
  mark_rows { |rowstate,row| # for every row/sequence
209
262
  num = row.count { |e| e.gap? }
210
263
  if (num.to_f/row.length) > 0.5
264
+ # this row in the alignment consists mostly of gaps
211
265
  rowstate.delete! # mark row for deletion
212
266
  end
213
267
  rowstate # returns the updated row state
@@ -225,9 +279,9 @@ The general idea is that there are many potential ways of selecting
225
279
  rows, and changing some state. The 'mark_rows' function/iterator takes
226
280
  care of the plumbing. All the programmer needs to do is to set the
227
281
  criterion, in this case a gap percentage, and tell the library what
228
- state has to change. In this example we only access one row, but you
229
- can also access the other rows. You won't be surprised that marking
230
- columns looks much the same
282
+ state has to change. In this example we only access one row at a time,
283
+ but you can also access the other rows. You won't be surprised that
284
+ marking columns looks much the same
231
285
 
232
286
  ```ruby
233
287
  include MarkColumns
@@ -262,7 +316,7 @@ and, here we remove every marked element by turning it into a gap
262
316
  the old with the new.
263
317
 
264
318
  It is important to note that, instead of directly editing alignments
265
- in place, this module always makes it a two step process. First items
319
+ in place, bio-alignment always makes it a two step process. First items
266
320
  are masked/marked through the state of the rows/columns/elements, next
267
321
  the alignment is rewritten using this state. The advantage of using an
268
322
  intermediate state is that the state can be queried for creating (for
@@ -286,6 +340,9 @@ An edit feature is added at runtime(!) Example:
286
340
 
287
341
  where aln2 is a copy of aln with bridging columns deleted.
288
342
 
343
+ More examples can be found in the features/edit directory of the
344
+ source.
345
+
289
346
  ### See also
290
347
 
291
348
  For more on the design of bio-alignment, read the
data/TODO ADDED
@@ -0,0 +1,2 @@
1
+ - Island masking assumes one unique element, maybe consider a
2
+ neighbour. The example in test/data is rather a good one.
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.0.7
1
+ 0.0.8
@@ -1,12 +1,22 @@
1
1
  #!/usr/bin/env ruby
2
2
  #
3
3
  # BioRuby bio-alignment Plugin
4
- # Version 0.0.0
5
4
  # Author:: Pjotr Prins
6
5
  # Copyright:: 2012
7
6
  # License:: The Ruby License
8
7
 
9
- USAGE = "Describe bio-alignment"
8
+ rootpath = File.dirname(File.dirname(__FILE__))
9
+ $: << File.join(rootpath,'lib')
10
+
11
+ _VERSION = File.new(File.join(rootpath,'VERSION')).read.chomp
12
+
13
+ $stderr.print "bio-alignment "+_VERSION+" Copyright (C) 2012 Pjotr Prins <pjotr.prins@thebird.nl>\n\n"
14
+
15
+ USAGE =<<EOU
16
+
17
+ bio-alignment transforms alignments
18
+
19
+ EOU
10
20
 
11
21
  if ARGV.size == 0
12
22
  print USAGE
@@ -14,51 +24,63 @@ end
14
24
 
15
25
  require 'bio-alignment'
16
26
  require 'optparse'
27
+ include Bio::BioAlignment
17
28
 
18
- # Uncomment when using the bio-logger
19
- # require 'bio-logger'
20
- # Bio::Log::CLI.logger('stderr')
21
- # Bio::Log::CLI.trace('info')
29
+ log = Bio::Log::LoggerPlus.new 'bio-alignment'
22
30
 
23
- options = {:example_switch=>false,:show_help=>false}
31
+ Bio::Log::CLI.logger('stderr')
32
+ Bio::Log::CLI.trace('info')
33
+
34
+ options = {show_help: false}
35
+ options[:show_help] = true if ARGV.size == 0
24
36
  opts = OptionParser.new do |o|
25
- o.banner = "Usage: #{File.basename($0)} [options] reponame\ne.g. #{File.basename($0)} the-perfect-gem"
37
+ o.banner = "Usage: #{File.basename($0)} [options] filename\n\n"
26
38
 
27
- o.on('--example_parameter [EXAMPLE_PARAMETER]', 'TODO: put a description for the PARAMETER') do |example_parameter|
28
- # TODO: your logic here, below an example
29
- options[:example_parameter] = 'this is a parameter'
39
+ o.on('--type codon|nucleotide|aminoacid', [:codon,:nucleotide,:aminoacid], 'Type of sequence data (default auto)') do |type|
40
+ options[:type] = type.to_sym
30
41
  end
31
-
32
- o.separator ""
33
- o.on("--switch-example", 'TODO: put a description for the SWITCH') do
34
- # TODO: your logic here, below an example
35
- self[:example_switch] = true
42
+
43
+ o.on('--edit bridges|islands|info', [:bridges,:islands,:info], 'Apply edit function') do |edit|
44
+ options[:edit] = edit.to_sym
36
45
  end
37
46
 
38
- # Uncomment the following when using the bio-logger
39
- # o.separator ""
40
- # o.on("--logger filename",String,"Log to file (default stderr)") do | name |
41
- # Bio::Log::CLI.logger(name)
42
- # end
43
- #
44
- # o.on("--trace options",String,"Set log level (default INFO, see bio-logger)") do | s |
45
- # Bio::Log::CLI.trace(s)
46
- # end
47
- #
48
- # o.on("-q", "--quiet", "Run quietly") do |q|
49
- # Bio::Log::CLI.trace('error')
50
- # end
51
- #
52
- # o.on("-v", "--verbose", "Run verbosely") do |v|
53
- # Bio::Log::CLI.trace('info')
54
- # end
55
- #
56
- # o.on("--debug", "Show debug messages") do |v|
57
- # Bio::Log::CLI.trace('debug')
58
- # end
47
+ o.on('--perc value', Integer, 'Percentage') do |v|
48
+ options[:perc] = v
49
+ end
50
+
51
+ o.on('--out fasta', [:fasta], 'Output format') do |format|
52
+ options[:out] = format.to_sym
53
+ end
54
+
55
+ o.on('--style codon', [:codon], 'Output style') do |style|
56
+ options[:style] = style.to_sym
57
+ end
58
+
59
+ o.separator ""
60
+
61
+ o.on("--logger filename",String,"Log to file (default stderr)") do | name |
62
+ Bio::Log::CLI.logger(name)
63
+ end
64
+
65
+ o.on("--trace options",String,"Set log level (default INFO, see bio-logger)") do | s |
66
+ Bio::Log::CLI.trace(s)
67
+ end
68
+
69
+ o.on("-q", "--quiet", "Run quietly") do |q|
70
+ Bio::Log::CLI.trace('error')
71
+ end
72
+
73
+ o.on("-v", "--verbose", "Run verbosely") do |v|
74
+ Bio::Log::CLI.trace('info')
75
+ end
76
+
77
+ o.on("--debug", "Show debug messages") do |v|
78
+ Bio::Log::CLI.trace('debug')
79
+ end
59
80
 
60
81
  o.separator ""
61
- o.on_tail('-h', '--help', 'display this help and exit') do
82
+
83
+ o.on_tail('-h', '--help', 'Display this help and exit') do
62
84
  options[:show_help] = true
63
85
  end
64
86
  end
@@ -66,11 +88,67 @@ end
66
88
  begin
67
89
  opts.parse!(ARGV)
68
90
 
69
- # Uncomment the following when using the bio-logger
70
- # Bio::Log::CLI.configure('bio-alignment')
91
+ if options[:show_help]
92
+ print opts
93
+ print USAGE
94
+ end
71
95
 
72
- # TODO: your code here
73
- # use options for your logic
74
96
  rescue OptionParser::InvalidOption => e
75
97
  options[:invalid_argument] = e.message
76
98
  end
99
+
100
+ Bio::Log::CLI.configure('bio-alignment')
101
+ logger = Bio::Log::LoggerPlus['bio-alignment']
102
+ logger.info [options, ARGV]
103
+
104
+ ARGV.each do |fn|
105
+ aln = Alignment.new
106
+ Bio::FlatFile.auto(fn).each_entry do |entry|
107
+ case options[:type]
108
+ when :codon
109
+ aln << CodonSequence.new(entry.entry_id,entry.seq)
110
+ when :nucleotide
111
+ aln << Sequence.new(entry.entry_id,entry.seq)
112
+ when :aminoacid
113
+ aln << Sequence.new(entry.entry_id,entry.seq)
114
+ else
115
+ # auto uses BioRuby sequence type
116
+ logger.warn "Using native type, if you encounter a problem, set the --type explicitly"
117
+ aln << entry
118
+ end
119
+ end
120
+ case options[:edit]
121
+ when :bridges
122
+ logger.info "Apply delete bridges"
123
+ require 'bio-alignment/edit/del_bridges'
124
+ aln.extend(DelBridges)
125
+ aln2 = aln.del_bridges
126
+ aln = aln2
127
+ when :islands
128
+ logger.info "Apply mask islands filter"
129
+ require 'bio-alignment/edit/mask_islands'
130
+ aln.extend(MaskIslands)
131
+ marked_aln = aln.mark_islands
132
+ aln2 = marked_aln.update_each_element { |e| (e.state.masked? ? Element.new("X"):e)}
133
+ aln = aln2
134
+ when :info
135
+ logger.info "Apply sequence information filter"
136
+ require 'bio-alignment/edit/del_non_informative_sequences'
137
+ aln.extend(DelNonInformativeSequences)
138
+ aln.each { |seq| seq.extend(State) }
139
+ aln2 = aln.del_non_informative_sequences(options[:perc])
140
+ aln = aln2
141
+ else
142
+ # do nothing
143
+ end
144
+ case options[:out]
145
+ when :fasta
146
+ aln.each do | seq |
147
+ print FastaOutput::to_fasta(seq)
148
+ end
149
+ else
150
+ aln.each do | seq |
151
+ print TextOutput::to_text(seq,options[:style])
152
+ end
153
+ end
154
+ end
@@ -78,6 +78,11 @@ Then /^find that "([^"]*)" is on the same branch as "([^"]*)"$/ do |arg1, arg2|
78
78
  seq.nearest.map{|n|n.to_s}.sort.join(',').should == arg2
79
79
  end
80
80
 
81
+ Then /^find that the alignment sequence matching tree node "(.*?)" is "(.*?)"$/ do |arg1, arg2|
82
+ tree = @aln.attach_tree(@tree)
83
+ node = tree.find(arg1)
84
+ node.sequence.to_s.should == arg2
85
+ end
81
86
 
82
87
  Then /^draw the MSA with the tree$/ do | string |
83
88
  # textual drawing, like tabtree, or http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/149701
@@ -28,6 +28,7 @@ Feature: Tree support for alignments
28
28
  And find that the nearest sequence to "seq1" is "seq2,seq3"
29
29
  And find that "seq1" is on the same branch as "seq2,seq3"
30
30
  And find that "seq4" is on the same branch as "seq1,seq2,seq3,seq5,seq8"
31
+ And find that the alignment sequence matching tree node "seq4" is "----PKLFSRPTIIFSGCSTACSGK--SEPVCGFRSFMLSDV"
31
32
  And draw the MSA with the tree
32
33
  """
33
34
  ,--9.69----------------------------------------- seq7 ----------PTIIFSGCSKACSGK-----VCGIFHAVRSFM
@@ -0,0 +1,19 @@
1
+ require 'bundler'
2
+ begin
3
+ Bundler.setup(:default, :development)
4
+ rescue Bundler::BundlerError => e
5
+ $stderr.puts e.message
6
+ $stderr.puts "Run `bundle install` to install missing gems"
7
+ exit e.status_code
8
+ end
9
+
10
+ $LOAD_PATH.unshift(File.dirname(__FILE__) + '/../../lib')
11
+ require 'bio-alignment'
12
+
13
+ require 'rspec/expectations'
14
+
15
+ log = Bio::Log::LoggerPlus.new 'bio-alignment'
16
+
17
+ Bio::Log::CLI.logger('stderr')
18
+ Bio::Log::CLI.trace('info')
19
+
@@ -2,9 +2,14 @@
2
2
  # bioruby directory tree.
3
3
  #
4
4
 
5
+ require 'bio-logger'
6
+ require 'bio-alignment/coerce'
5
7
  require 'bio-alignment/state'
6
8
  require 'bio-alignment/elements'
7
9
  require 'bio-alignment/sequence'
8
10
  require 'bio-alignment/codonsequence'
9
11
  require 'bio-alignment/tree'
10
12
  require 'bio-alignment/alignment'
13
+ require 'bio-alignment/format/text'
14
+ require 'bio-alignment/format/fasta'
15
+ require 'bio-alignment/format/phylip'
@@ -13,6 +13,7 @@ module Bio
13
13
  include Pal2Nal
14
14
  include Rows
15
15
  include Columns
16
+ include Coerce
16
17
 
17
18
  attr_accessor :sequences
18
19
  attr_reader :tree
@@ -44,7 +45,7 @@ module Bio
44
45
 
45
46
  # return an array of sequence ids
46
47
  def ids
47
- rows.map { |r| r.id }
48
+ rows.map { |r| Coerce::fetch_id(r) }
48
49
  end
49
50
 
50
51
  def size
@@ -56,6 +57,10 @@ module Bio
56
57
  rows[index]
57
58
  end
58
59
 
60
+ def << seq
61
+ @sequences << seq
62
+ end
63
+
59
64
  def each
60
65
  rows.each { | seq | yield seq }
61
66
  self
@@ -68,21 +73,23 @@ module Bio
68
73
 
69
74
  def find name
70
75
  each do | seq |
71
- return seq if seq.id == name
76
+ return seq if Coerce::fetch_id(seq) == name
72
77
  end
73
78
  raise "ERROR: Sequence not found by its name, looking for <#{name}>"
74
79
  end
75
80
 
76
- # clopy alignment and allow updating elements
81
+ # copy alignment and allow updating elements. Returns alignment.
77
82
  def update_each_element
78
83
  aln = self.clone
79
84
  aln.each { |seq| seq.each_with_index { |e,i| seq.seq[i] = yield e }}
85
+ aln
80
86
  end
81
87
 
82
88
  def to_s
83
89
  res = ""
84
90
  res += "\t" + columns_to_s + "\n" if @columns
85
- res += map{ |seq| seq.id.to_s + "\t" + seq.to_s }.join("\n")
91
+ # fetch each sequence in turn
92
+ res += map{ |seq| Coerce::fetch_id(seq).to_s + "\t" + Coerce::fetch_seq_string(seq) }.join("\n")
86
93
  res
87
94
  end
88
95
 
@@ -106,7 +113,7 @@ module Bio
106
113
  # the tree traverser
107
114
  def attach_tree tree
108
115
  extend Tree
109
- @tree = Tree::init(tree)
116
+ @tree = Tree::init(tree,self)
110
117
  @tree
111
118
  end
112
119
 
@@ -122,6 +129,7 @@ module Bio
122
129
  new_aln.attach_tree(new_tree.clone)
123
130
  new_aln
124
131
  end
132
+
125
133
  end
126
134
  end
127
135
  end
@@ -4,7 +4,7 @@ module Bio
4
4
  class NA
5
5
  include Enumerable
6
6
  def each
7
- to_s.each_byte do | c |
7
+ to_s.each_char do | c |
8
8
  yield c
9
9
  end
10
10
  end
@@ -12,7 +12,7 @@ module Bio
12
12
  class AA
13
13
  include Enumerable
14
14
  def each
15
- to_s.each_byte do | c |
15
+ to_s.each_char do | c |
16
16
  yield c
17
17
  end
18
18
  end
@@ -7,13 +7,16 @@ module Bio
7
7
  # Codon element for the matrix, used by CodonSequence.
8
8
  class Codon
9
9
  GAP = '---'
10
- UNDEFINED = 'X'
10
+ UNDEFINED = 'XXX'
11
11
 
12
12
  attr_reader :codon_table
13
+ include State
13
14
 
14
15
  def initialize codon, codon_table = 1
15
16
  @codon = codon
16
17
  @codon_table = codon_table
18
+ @codon.freeze
19
+ @codon_table.freeze
17
20
  end
18
21
 
19
22
  def gap?
@@ -32,6 +35,10 @@ module Bio
32
35
  @codon
33
36
  end
34
37
 
38
+ def == other
39
+ @codon == other.to_s
40
+ end
41
+
35
42
  # lazily convert to Amino acid (once only)
36
43
  def to_aa
37
44
  aa = translate
@@ -52,7 +59,7 @@ module Bio
52
59
  # lazy translation of codon to amino acid
53
60
  def translate
54
61
  @aa ||= Bio::CodonTable[@codon_table][@codon]
55
- @aa
62
+ @aa.freeze
56
63
  end
57
64
  end
58
65
 
@@ -72,6 +79,9 @@ module Bio
72
79
  seq.scan(/\S\S\S/).each do | codon |
73
80
  @seq << Codon.new(codon, @codon_table)
74
81
  end
82
+ @id.freeze
83
+ @codon_table.freeze
84
+ # @seq is not immutable, as we can add new codes to the list
75
85
  end
76
86
 
77
87
  def [] index
@@ -86,6 +96,7 @@ module Bio
86
96
  @seq.each { | codon | yield codon }
87
97
  end
88
98
 
99
+ # Output codon style
89
100
  def to_s
90
101
  @seq.map { |codon| codon.to_s }.join(' ')
91
102
  end
@@ -107,7 +118,11 @@ module Bio
107
118
  def << codon
108
119
  @seq << codon
109
120
  end
110
-
121
+
122
+ # Return Sequence (string) as an Elements object
123
+ def to_elements
124
+ self
125
+ end
111
126
  end
112
127
 
113
128
  end
@@ -0,0 +1,45 @@
1
+ module Bio
2
+ module BioAlignment
3
+ module Coerce
4
+ # Make BioRuby's entry_id compatible with id
5
+ def Coerce::fetch_id seq
6
+ if seq.respond_to?(:id)
7
+ seq.id
8
+ elsif seq.respond_to?(:entry_id)
9
+ seq.entry_id
10
+ else
11
+ "id?"
12
+ end
13
+ end
14
+
15
+ # Coerce BioRuby's sequence objects to return the sequence itself
16
+ def Coerce::fetch_seq seq
17
+ if seq.respond_to?(:seq)
18
+ seq.seq
19
+ else
20
+ seq
21
+ end
22
+ end
23
+
24
+ # Coerce sequence objects into a string
25
+ def Coerce::fetch_seq_string seq
26
+ s = fetch_seq(seq)
27
+ if s.respond_to?(:join)
28
+ s.join
29
+ else
30
+ s.to_s
31
+ end
32
+ end
33
+
34
+ # Coerce sequence objects into elements
35
+ def Coerce::to_elements seq
36
+ if seq.respond_to?(:to_elements)
37
+ seq.to_elements
38
+ else
39
+ Elements.new(fetch_id(seq),fetch_seq(seq))
40
+ end
41
+ end
42
+ end
43
+ end
44
+ end
45
+
@@ -37,7 +37,7 @@ module Bio
37
37
  end
38
38
 
39
39
  def columns_to_s
40
- columns.map { |c| (c.state ? c.state.to_s : '?') }.join
40
+ columns.map { |c| (c.respond_to?(:state) ? c.state.to_s : '?') }.join
41
41
  end
42
42
 
43
43
  def clone_columns!
@@ -50,21 +50,31 @@ module Bio
50
50
  end
51
51
  end
52
52
 
53
- # Support the notion of columns in an alignment. A column
54
- # can have state by attaching state objects
53
+ # Support the notion of columns in an alignment. A column is simply an
54
+ # integer index into the alignment, stored in @col. A column can have state
55
+ # by attaching state objects.
55
56
  class Column
56
57
  include State
57
58
  include Enumerable
58
59
 
59
60
  def initialize aln, col
60
61
  @aln = aln
61
- @col = col
62
+ @col = col # column index number
63
+ @col.freeze
64
+ @aln
62
65
  end
63
66
 
64
67
  def [] index
65
68
  @aln[index][@col]
66
69
  end
67
70
 
71
+ # update all elements in the column
72
+ # def update! new_column
73
+ # each_with_index do |e,i|
74
+ # @aln[i][@col] = new_column[i]
75
+ # end
76
+ # end
77
+
68
78
  # iterator fetches a column on demand, yielding column elements
69
79
  def each
70
80
  @aln.each do | seq |
@@ -5,11 +5,15 @@ module Bio
5
5
 
6
6
  module DelNonInformativeSequences
7
7
  include MarkRows
8
-
8
+
9
+ # Count the informative elements in a sequence. If the count
10
+ # is less than +percentage+ mark the sequence for deletion.
11
+ #
9
12
  # Return a new alignment with rows marked for deletion, i.e. mark rows
10
13
  # that mostly contain undefined elements and gaps (threshold
11
14
  # +percentage+). The alignment returned is a cloned copy
12
15
  def mark_non_informative_sequences percentage = 30
16
+ percentage=30 if not percentage # for CLI
13
17
  mark_rows { |state,row|
14
18
  num = row.count { |e| e.gap? or e.undefined? }
15
19
  if (num.to_f/row.length) > 1.0-percentage/100.0
@@ -20,7 +24,7 @@ module Bio
20
24
  end
21
25
 
22
26
  def del_non_informative_sequences percentage=30
23
- mark_non_informative_sequences.rows_where { |row| !row.state.deleted? }
27
+ mark_non_informative_sequences(percentage).rows_where { |row| !row.state.deleted? }
24
28
  end
25
29
  end
26
30
  end
@@ -5,7 +5,7 @@ module Bio
5
5
  # state, and returning a newly cloned alignment
6
6
  module MarkRows
7
7
 
8
- # Mark each seq
8
+ # Mark each seq and return alignment
9
9
  def mark_rows &block
10
10
  aln = markrows_clone
11
11
  aln.rows.each do | row |
@@ -16,11 +16,15 @@ module Bio
16
16
 
17
17
  # allow the marking of elements in a copied alignment, making sure
18
18
  # each element is a proper Element object that can contain state.
19
+ #
19
20
  # A Sequence alignment will be turned into an Elements alignment.
21
+ #
22
+ # Returns the new alignment
20
23
  def mark_row_elements &block
21
24
  aln = markrows_clone
22
25
  aln.rows.each_with_index do | row,rownum |
23
- new_seq = block.call(row.to_elements,rownum)
26
+ new_seq = block.call(Coerce::to_elements(row),rownum)
27
+ # p [rownum,new_seq,row]
24
28
  aln.rows[rownum] = new_seq
25
29
  end
26
30
  aln
@@ -32,13 +36,15 @@ module Bio
32
36
  aln = self.clone
33
37
  # clone row state, or add a state object
34
38
  aln.rows.each do | row |
35
- new_state =
36
- if row.state
37
- row.state.clone
38
- else
39
- RowState.new
40
- end
41
- row.state = new_state
39
+ if row.respond_to?(:state)
40
+ new_state =
41
+ if row.state
42
+ row.state.clone
43
+ else
44
+ RowState.new
45
+ end
46
+ row.state = new_state
47
+ end
42
48
  end
43
49
  aln
44
50
  end
@@ -13,25 +13,33 @@ module Bio
13
13
  end
14
14
  end
15
15
 
16
- # Drop all 'islands' in a sequence with low consensus, that show a gap
17
- # larger than 'min_gap_size' (default 3) on both sides, and are shorter
18
- # than 'max_island_size' (default 30). An island larger than 30 elements
19
- # is arguably no longer an island, and low consensus stretches may be
20
- # loops - it is up to the alignment procedure to get that right. We also
21
- # allow for micro deletions inside an alignment (1 or 2 elements).
22
- # The island consensus is calculated by column. If more than 50% of the
23
- # island shows consensus, the island is retained. Consensus for each
24
- # element is defined as the number of matches in the column (default 1).
16
+ # Drop all 'islands' in a sequence with low consensus, i.e. islands that
17
+ # show a gap larger than 'min_gap_size' (default 3) on both sides, and
18
+ # are shorter than 'max_island_size' (default 30). An island larger than
19
+ # 30 elements is arguably no longer an island, and low consensus
20
+ # stretches may be loops - it is up to the alignment procedure to get
21
+ # that right. We also allow for micro deletions inside an alignment (1 or
22
+ # 2 elements). The island consensus is calculated by column. If more
23
+ # than 50% of the island shows consensus, the island is retained.
24
+ # Consensus for each element is defined as the number of matches in the
25
+ # column (default 1).
25
26
  def mark_islands
26
- mark_row_elements { |row,rownum|
27
- # first set state and find unique elements (i.e. consensus)
27
+ logger = Bio::Log::LoggerPlus['bio-alignment']
28
+ count_marked_islands = 0
29
+ count_marked_elements = 0
30
+
31
+ # Traverse each row in the alignment
32
+ marked_aln = mark_row_elements { |row,rownum|
33
+ # for each element create a state object, and find unique elements (i.e. consensus) across a column
28
34
  row.each_with_index do |e,colnum|
29
35
  e.state = IslandElementState.new
30
36
  column = columns[colnum]
31
37
  e.state.unique = (column.count{|e2| !e2.gap? and e2 == e } == 1)
32
38
  # p [e,e.state,e.state.unique]
33
39
  end
34
- # group elements into islands (split on gap) and mask
40
+ # at this stage all elements of the row have been set to unique,
41
+ # which are unique. Now group elements into islands (split on gap)
42
+ # and mask
35
43
  gap = []
36
44
  island = []
37
45
  in_island = true
@@ -52,7 +60,9 @@ module Bio
52
60
  gap << e
53
61
  if gap.length > 2
54
62
  in_island = false
55
- mark_island(island)
63
+ ci, ce = mark_island(island)
64
+ count_marked_islands += ci
65
+ count_marked_elements += ce
56
66
  # print_island(island)
57
67
  island = []
58
68
  end
@@ -60,41 +70,29 @@ module Bio
60
70
  end
61
71
  end
62
72
  if in_island
63
- mark_island(island)
73
+ ci, ce = mark_island(island)
74
+ count_marked_islands += ci
75
+ count_marked_elements += ce
64
76
  # print_island(island) if island.length > 0
65
77
  end
66
- # row.each_with_index do |e,colnum|
67
- # e.state = ElementState.new
68
- # column = columns[colnum]
69
- # e.state.mask! if column.count{|e2| !e2.gap? and e2 == e } == 1
70
- # # print e,',',e.state,';'
71
- # end
72
- # now make sure there are at least 5 in a row, otherwise
73
- # start unmasking. First group all elements
74
- # group = []
75
- # row.each_with_index do |e,colnum|
76
- # next if e.gap?
77
- # if e.state.masked?
78
- # group << e
79
- # else
80
- # if group.length <= min_serial
81
- # # the group is too small
82
- # group.each do | e2 |
83
- # e2.state.unmask!
84
- # end
85
- # end
86
- # group = []
87
- # end
88
- # end
89
- row # return changed sequence
78
+ row # always return the row to mark_row_elements
90
79
  }
80
+ logger.info("#{count_marked_islands} islands marked (#{count_marked_elements} elements)")
81
+ return marked_aln
91
82
  end
92
83
 
93
84
  private
94
-
85
+
86
+ # Check a list of elements that form an island. First count the number
87
+ # of elements marked as being unique. If the island is more than 50%
88
+ # unique (i.e. less than 50% consensus with the rest if the alignment)
89
+ # all island elements are marked for masking. Returns the number of
90
+ # islands and elements marked as a tuple
95
91
  def mark_island island
96
- return if island.length < 2
92
+ return 0,0 if island.length < 2
97
93
  unique = 0
94
+ count_islands = 0
95
+ count_elements = 0
98
96
  island.each do |e|
99
97
  unique += 1 if e.state.unique == true
100
98
  end
@@ -104,7 +102,10 @@ module Bio
104
102
  island.each do |e|
105
103
  e.state.mask!
106
104
  end
105
+ count_islands += 1
106
+ count_elements += island.size
107
107
  end
108
+ return count_islands, count_elements
108
109
  end
109
110
 
110
111
  def print_island island
@@ -10,6 +10,7 @@ module Bio
10
10
 
11
11
  def initialize c
12
12
  @c = c
13
+ @c.freeze
13
14
  end
14
15
  def gap?
15
16
  @c == GAP
@@ -23,6 +24,13 @@ module Bio
23
24
  def == other
24
25
  to_s == other.to_s
25
26
  end
27
+ def clone
28
+ e = self.dup
29
+ if e.state != nil
30
+ e.state = e.state.clone
31
+ end
32
+ e
33
+ end
26
34
  end
27
35
 
28
36
  # Elements is a container for Element sequences.
@@ -34,6 +42,7 @@ module Bio
34
42
  attr_reader :id, :seq
35
43
  def initialize id, seq
36
44
  @id = id
45
+ @id.freeze
37
46
  @seq = []
38
47
  if seq.kind_of?(Elements)
39
48
  @seq = seq.clone
@@ -75,7 +84,7 @@ module Bio
75
84
  def clone
76
85
  copy = Elements.new(@id,"")
77
86
  @seq.each do |e|
78
- copy << e.dup
87
+ copy << e.clone
79
88
  end
80
89
  copy
81
90
  end
@@ -0,0 +1,17 @@
1
+ module Bio
2
+ module BioAlignment
3
+ module FastaOutput
4
+ def FastaOutput::to_fasta seq
5
+ buf = ">"
6
+ buf += seq.id+"\n"
7
+ buf += if seq.kind_of?(CodonSequence)
8
+ seq.to_nt
9
+ else
10
+ seq.to_s
11
+ end
12
+ buf+"\n"
13
+ end
14
+ end
15
+ end
16
+ end
17
+
@@ -0,0 +1,25 @@
1
+ module Bio
2
+ module BioAlignment
3
+ module PhylipOutput
4
+ # Calculate header info from alignment and return as string
5
+ def PhylipOutput::header alignment
6
+ "#{alignment.size} #{alignment[0].length}\n"
7
+ end
8
+
9
+ # Output sequence PAML style and return as a multi-line string
10
+ def PhylipOutput::to_paml seq, size=60
11
+ buf = seq.id+"\n"
12
+ coding = if seq.kind_of?(CodonSequence)
13
+ seq.to_nt
14
+ else
15
+ seq.to_s
16
+ end
17
+ coding.scan(/.{1,#{size}}/).each do | section |
18
+ buf += section + "\n"
19
+ end
20
+ buf
21
+ end
22
+ end
23
+ end
24
+ end
25
+
@@ -0,0 +1,18 @@
1
+ module Bio
2
+ module BioAlignment
3
+ module TextOutput
4
+
5
+ def TextOutput::to_text seq, style
6
+ res = ""
7
+ res += Coerce::fetch_id(seq).to_s + "\t"
8
+ res += if seq.kind_of?(CodonSequence) and style == :codon
9
+ seq.to_s
10
+ else
11
+ Coerce::fetch_seq_string(seq)
12
+ end
13
+ res+"\n"
14
+ end
15
+
16
+ end
17
+ end
18
+ end
@@ -28,6 +28,7 @@ module Bio
28
28
  def initialize aln, row
29
29
  @aln = aln
30
30
  @row = row
31
+ freeze
31
32
  end
32
33
 
33
34
  def count &block
@@ -11,6 +11,7 @@ module Bio
11
11
  attr_reader :id, :seq
12
12
  def initialize id, seq
13
13
  @id = id
14
+ @id.freeze
14
15
  @seq = seq
15
16
  end
16
17
 
@@ -18,6 +19,10 @@ module Bio
18
19
  @seq[index]
19
20
  end
20
21
 
22
+ # def []= index, value --- we should not implement this for reasons of purity
23
+ # @seq[index] = value
24
+ # end
25
+
21
26
  def length
22
27
  @seq.length
23
28
  end
@@ -11,13 +11,13 @@ module Bio
11
11
  class Node
12
12
  end
13
13
 
14
- # Make all nodes in the Bio::Tree aware of the tree object so we can use
15
- # its methods
16
- def Tree::init tree
14
+ # Make all nodes in the Bio::Tree aware of the tree object, and the alignment, so
15
+ # get a more intuitive API
16
+ def Tree::init tree, alignment
17
17
  if tree.kind_of?(Bio::Tree)
18
18
  # walk all nodes and infect the tree info
19
19
  tree.each_node do | node |
20
- node.inject_tree(tree)
20
+ node.inject_tree(tree, alignment)
21
21
  end
22
22
  # tree.root.set_tree(tree)
23
23
  else
@@ -38,8 +38,10 @@ module Bio
38
38
  class Tree
39
39
  class Node
40
40
  # Add tree information to this node, so it can be queried
41
- def inject_tree tree
41
+ def inject_tree tree, alignment
42
42
  @tree = tree
43
+ @tree.freeze
44
+ @alignment = alignment
43
45
  self
44
46
  end
45
47
 
@@ -102,6 +104,15 @@ module Bio
102
104
  end
103
105
  cs
104
106
  end
107
+
108
+ # Return the alignment attached to the tree
109
+ def alignment
110
+ @alignment
111
+ end
112
+
113
+ def sequence
114
+ @alignment.find(name)
115
+ end
105
116
  end # End of injecting Node functionality
106
117
 
107
118
  def find name
@@ -108,3 +108,18 @@ describe "BioAlignment::DelBridges for codons" do
108
108
  aln3 = aln2.columns_where { |col| !col.state.deleted? }
109
109
  aln3.columns.size.should == 399
110
110
  end
111
+
112
+ # require 'bio' # BioRuby
113
+ require 'bio-alignment/bioruby' # make Bio::Sequence enumerable
114
+
115
+ describe "BioAlignment::BioRuby interface" do
116
+ include Bio::BioAlignment
117
+
118
+ aln = Alignment.new
119
+ aln << Bio::Sequence::NA.new("atgcatgcaaaa")
120
+ aln << Bio::Sequence::NA.new("atg---tcaaaa")
121
+ aln[0].should == "atgcatgcaaaa"
122
+ aln[1].should == "atg---tcaaaa"
123
+ Coerce::fetch_seq_string(aln[0]).should == "atgcatgcaaaa"
124
+ test = Coerce::fetch_id(aln[0]) # JRuby may have a name collision with object.id
125
+ end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bio-alignment
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.7
4
+ version: 0.0.8
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-06-25 00:00:00.000000000Z
12
+ date: 2014-05-15 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: bio-logger
16
- requirement: &83191660 !ruby/object:Gem::Requirement
16
+ requirement: !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,10 +21,15 @@ dependencies:
21
21
  version: '0'
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *83191660
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ! '>='
28
+ - !ruby/object:Gem::Version
29
+ version: '0'
25
30
  - !ruby/object:Gem::Dependency
26
31
  name: bio
27
- requirement: &83191360 !ruby/object:Gem::Requirement
32
+ requirement: !ruby/object:Gem::Requirement
28
33
  none: false
29
34
  requirements:
30
35
  - - ! '>='
@@ -32,10 +37,15 @@ dependencies:
32
37
  version: 1.4.2
33
38
  type: :runtime
34
39
  prerelease: false
35
- version_requirements: *83191360
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ! '>='
44
+ - !ruby/object:Gem::Version
45
+ version: 1.4.2
36
46
  - !ruby/object:Gem::Dependency
37
47
  name: rake
38
- requirement: &83190960 !ruby/object:Gem::Requirement
48
+ requirement: !ruby/object:Gem::Requirement
39
49
  none: false
40
50
  requirements:
41
51
  - - ! '>='
@@ -43,10 +53,15 @@ dependencies:
43
53
  version: '0'
44
54
  type: :development
45
55
  prerelease: false
46
- version_requirements: *83190960
56
+ version_requirements: !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ! '>='
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
47
62
  - !ruby/object:Gem::Dependency
48
63
  name: bio-bigbio
49
- requirement: &83190640 !ruby/object:Gem::Requirement
64
+ requirement: !ruby/object:Gem::Requirement
50
65
  none: false
51
66
  requirements:
52
67
  - - ! '>'
@@ -54,10 +69,15 @@ dependencies:
54
69
  version: 0.1.3
55
70
  type: :development
56
71
  prerelease: false
57
- version_requirements: *83190640
72
+ version_requirements: !ruby/object:Gem::Requirement
73
+ none: false
74
+ requirements:
75
+ - - ! '>'
76
+ - !ruby/object:Gem::Version
77
+ version: 0.1.3
58
78
  - !ruby/object:Gem::Dependency
59
79
  name: cucumber
60
- requirement: &83190190 !ruby/object:Gem::Requirement
80
+ requirement: !ruby/object:Gem::Requirement
61
81
  none: false
62
82
  requirements:
63
83
  - - ! '>='
@@ -65,10 +85,15 @@ dependencies:
65
85
  version: '0'
66
86
  type: :development
67
87
  prerelease: false
68
- version_requirements: *83190190
88
+ version_requirements: !ruby/object:Gem::Requirement
89
+ none: false
90
+ requirements:
91
+ - - ! '>='
92
+ - !ruby/object:Gem::Version
93
+ version: '0'
69
94
  - !ruby/object:Gem::Dependency
70
95
  name: rspec
71
- requirement: &83095350 !ruby/object:Gem::Requirement
96
+ requirement: !ruby/object:Gem::Requirement
72
97
  none: false
73
98
  requirements:
74
99
  - - ~>
@@ -76,10 +101,15 @@ dependencies:
76
101
  version: 2.10.0
77
102
  type: :development
78
103
  prerelease: false
79
- version_requirements: *83095350
104
+ version_requirements: !ruby/object:Gem::Requirement
105
+ none: false
106
+ requirements:
107
+ - - ~>
108
+ - !ruby/object:Gem::Version
109
+ version: 2.10.0
80
110
  - !ruby/object:Gem::Dependency
81
111
  name: bundler
82
- requirement: &83094950 !ruby/object:Gem::Requirement
112
+ requirement: !ruby/object:Gem::Requirement
83
113
  none: false
84
114
  requirements:
85
115
  - - ! '>='
@@ -87,10 +117,15 @@ dependencies:
87
117
  version: 1.0.21
88
118
  type: :development
89
119
  prerelease: false
90
- version_requirements: *83094950
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ none: false
122
+ requirements:
123
+ - - ! '>='
124
+ - !ruby/object:Gem::Version
125
+ version: 1.0.21
91
126
  - !ruby/object:Gem::Dependency
92
127
  name: jeweler
93
- requirement: &83094400 !ruby/object:Gem::Requirement
128
+ requirement: !ruby/object:Gem::Requirement
94
129
  none: false
95
130
  requirements:
96
131
  - - ! '>='
@@ -98,7 +133,12 @@ dependencies:
98
133
  version: '0'
99
134
  type: :development
100
135
  prerelease: false
101
- version_requirements: *83094400
136
+ version_requirements: !ruby/object:Gem::Requirement
137
+ none: false
138
+ requirements:
139
+ - - ! '>='
140
+ - !ruby/object:Gem::Version
141
+ version: '0'
102
142
  description: Alignment handler for multiple sequence alignments (MSA)
103
143
  email: pjotr.public01@thebird.nl
104
144
  executables:
@@ -107,6 +147,7 @@ extensions: []
107
147
  extra_rdoc_files:
108
148
  - LICENSE.txt
109
149
  - README.md
150
+ - TODO
110
151
  files:
111
152
  - .document
112
153
  - .rspec
@@ -115,6 +156,7 @@ files:
115
156
  - LICENSE.txt
116
157
  - README.md
117
158
  - Rakefile
159
+ - TODO
118
160
  - VERSION
119
161
  - bin/bio-alignment
120
162
  - doc/bio-alignment-design.md
@@ -144,10 +186,12 @@ files:
144
186
  - features/phylogeny/tree.feature
145
187
  - features/rows-feature.rb
146
188
  - features/rows.feature
189
+ - features/support/env.rb
147
190
  - lib/bio-alignment.rb
148
191
  - lib/bio-alignment/alignment.rb
149
192
  - lib/bio-alignment/bioruby.rb
150
193
  - lib/bio-alignment/codonsequence.rb
194
+ - lib/bio-alignment/coerce.rb
151
195
  - lib/bio-alignment/columns.rb
152
196
  - lib/bio-alignment/edit/del_bridges.rb
153
197
  - lib/bio-alignment/edit/del_non_informative_sequences.rb
@@ -158,6 +202,9 @@ files:
158
202
  - lib/bio-alignment/edit/mask_serial_mutations.rb
159
203
  - lib/bio-alignment/edit/tree_splitter.rb
160
204
  - lib/bio-alignment/elements.rb
205
+ - lib/bio-alignment/format/fasta.rb
206
+ - lib/bio-alignment/format/phylip.rb
207
+ - lib/bio-alignment/format/text.rb
161
208
  - lib/bio-alignment/pal2nal.rb
162
209
  - lib/bio-alignment/rows.rb
163
210
  - lib/bio-alignment/sequence.rb
@@ -186,7 +233,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
186
233
  version: '0'
187
234
  segments:
188
235
  - 0
189
- hash: 900281341
236
+ hash: 3021753307968946034
190
237
  required_rubygems_version: !ruby/object:Gem::Requirement
191
238
  none: false
192
239
  requirements:
@@ -195,7 +242,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
195
242
  version: '0'
196
243
  requirements: []
197
244
  rubyforge_project:
198
- rubygems_version: 1.8.6
245
+ rubygems_version: 1.8.23
199
246
  signing_key:
200
247
  specification_version: 3
201
248
  summary: Support for multiple sequence alignments (MSA)