RubyGems - bio-table - Versions diffs - 0.0.5 → 0.0.6 - Mend

bio-table 0.0.5 → 0.0.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

data/.travis.yml +1 -1
data/Gemfile +1 -1
data/README.md +49 -6
data/VERSION +1 -1
data/bin/bio-table +24 -7
data/features/cli.feature +40 -0
data/lib/bio-table/filter.rb +16 -1
data/lib/bio-table/formatter.rb +30 -2
data/lib/bio-table/parser.rb +13 -1
data/lib/bio-table/rdf.rb +1 -1
data/lib/bio-table/table.rb +7 -3
data/lib/bio-table/table_apply.rb +9 -2
data/lib/bio-table/tableload.rb +5 -0
data/lib/bio-table/tablewriter.rb +5 -2
data/test/data/input/table_no_headers.txt +5 -0
data/test/data/input/table_split_on.txt +6 -0
data/test/data/regression/table1-append.ref +753 -0
data/test/data/regression/table1-diff.ref +7 -0
data/test/data/regression/table1-html.ref +380 -0
data/test/data/regression/table1-latex.ref +380 -0
data/test/data/regression/table1-merge.ref +380 -0
data/test/data/regression/table_filter_headers.ref +5 -0
data/test/data/regression/table_split_on_regex.ref +11 -0
data/test/data/regression/table_split_on_string.ref +11 -0
metadata +33 -23

data/.travis.yml CHANGED

@@ -2,8 +2,8 @@ language: ruby
 rvm:
   - 1.9.2
   - 1.9.3
-  - jruby-19mode # JRuby in 1.9 mode
   - rbx-19mode
+#  - jruby-19mode # JRuby in 1.9 mode
 #  - 1.8.7
 #  - jruby-18mode # JRuby in 1.8 mode
 #  - rbx-18mode

data/Gemfile CHANGED

@@ -15,5 +15,5 @@ group :development do
   gem "jeweler", "~> 1.8.3"
   gem "bio", ">= 1.4.2"
   gem "rdoc", "~> 3.12"
-  gem "regressiontest"
+  gem "regressiontest", ">= 0.0.2"
 end

data/README.md CHANGED

@@ -25,16 +25,17 @@ you don't need to know Ruby to use the command line interface (CLI).
 Features:
-* Support for converting TAB and CSV files
+* Support for reading and writing TAB and CSV files, as well as regex splitters
 * Filter on data
 * Transform table and data by column or row
 * Recalculate data
 * Diff between tables, selecting on specific column values
 * Merge tables side by side on column value/rowname
 * Split/reduce tables by column
+* Write formatted tables, e.g. HTML, LaTeX
 * Read from STDIN, write to STDOUT
 * Convert table to RDF
-* Convert table to JSON (nyi)
+* Convert table to JSON/YAML/XML (nyi)
 * etc. etc.
 and bio-table is pretty fast. To convert a 3Mb file of 18670 rows
@@ -68,14 +69,21 @@ csv, it will assume CSV. To convert the table back
     bio-table test1.tab --format csv > table1.csv
 ```
+It is also possible to use a string or regex splitter, e.g.
+```sh
+    bio-table --in-format split --split-on ',' test/data/input/table_split_on.txt
+    bio-table --in-format regex --split-on '\s*,\s*' test/data/input/table_split_on.txt
+```
 To filter out rows that contain certain values
 ```sh
     bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
 ```
-The filter ignores the header row, and the row names. If you need
-either, use the switches --with-header and --with-rownames. With math, list all rows
+The filter ignores the header row, and the row names, by default. If you need
+either, use the switches --with-headers and --with-rownames. With math, list all rows
 ```sh
     bio-table test/data/input/table1.csv --num-filter "values[3]-values[6] >= 0.05" > test1a.tab
@@ -174,7 +182,7 @@ To combine tables side by side use the --merge switch:
     bio-table --merge table1.csv table2.csv
 ```
-all rownames will be matched (i.e. the input table order do not need
+all rownames will be matched (i.e. the input table does not need
 to be sorted). For non-matching rownames the fields will be filled
 with NA's, unless you add a filter, e.g.
@@ -226,7 +234,10 @@ finds the overlapping rows, based on the content of column 2.
 ### Different parsers
-more soon
+bio-table currently reads comma separated files and tab delimited
+files.
+(more soon)
 ### Using STDIN
@@ -240,6 +251,38 @@ piped in is the first input file
 will filter both files test1.tab and test1.csv and output to
 test1a.tab.
+### Formatted output
+bio-table has built-in formatters - for CSV and TAB, and for RDF
+(and soon for JSON/YAML and perhaps even XML). The RDF format is
+discussed in 'Output table to RDF'.
+Another flexible option for formatting a table is to create programmatic output
+through a formatter.  If you set the --format switch to *eval*, you
+can add the -e 'command' that is evaluated to print to STDOUT. For
+example, bio-table does not support HTML output directly, but if we
+were to create an HTML table, we could run
+```sh
+    bio-table --format eval -e '"<tr><td>"+field.join("</td><td>")+"</td></tr>"' table1.csv
+```
+likewise to create a LaTeX table we could
+```sh
+    bio-table --columns gene_symbol,gene_desc --format eval -e 'field.join(" & ")+" \\\\"' table1.csv
+```
+Since fields can be accessed independently, you can add any markup for
+fields, e.g.
+```sh
+    bio-table --columns ID,Description,Date --format eval -e'"\\emph{"+field[0]+"} & "+ field[1..-1].join(" & ")+"\\\\"' table1.csv
+```
+Because of the evaluation formatter bio-table does not need to implement the machinery for
+every output format on the planet!
 ### Output table to RDF
 bio-table can write a table into turtle RDF triples (part of the semantic

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 0.0.5
1	+ 0.0.6

data/bin/bio-table CHANGED

@@ -43,6 +43,10 @@ opts = OptionParser.new do |o|
     options[:num_filter] = par
   end
+  o.on('--filter expression', 'Generic filtering function') do |par|
+    options[:filter] = par
+  end
   o.on('--rewrite expression', 'Rewrite function') do |par|
     options[:rewrite] = par
   end
@@ -81,18 +85,23 @@ opts = OptionParser.new do |o|
   o.separator "\n\tOverrides:\n\n"
-  # o.on('--with-header','Include the header element in filtering etc.') do
-  #   options[:with_header] = true
-  # end
   o.on('--skip lines',Integer,'Skip the first lines before parsing') do |skip|
     options[:skip] = skip
   end
+  o.on('--with-headers','Include the header element in filtering etc.') do
+    options[:with_headers] = true
+    options[:write_header] = false
+  end
   o.on('--with-rownames','Include the rownames in filtering etc.') do
     options[:with_rownames] = true
   end
+  o.on('--strip-quotes','Strip quotes from table fields') do
+    options[:strip_quotes] = true
+  end
   o.separator "\n\tTransform:\n\n"
   o.on('--transform-ids [downcase,upcase]',[:downcase,:upcase],'Transform column and row identifiers') do |par|
@@ -101,14 +110,22 @@ opts = OptionParser.new do |o|
   o.separator "\n\tFormat and options:\n\n"
-  o.on('--in-format [tab,csv]', [:tab, :csv], 'Input format (default tab)') do |par|
+  o.on('--in-format [tab,csv,split,regex]', [:tab, :csv, :split, :regex], 'Input format (default tab)') do |par|
     options[:in_format] = par.to_sym
   end
-  o.on('--format [tab,csv,rdf]', [:tab, :csv, :rdf], 'Output format (default tab)') do |par|
+  o.on('--format [tab,csv,rdf,eval]', [:tab, :csv, :rdf, :eval], 'Output format (default tab)') do |par|
     options[:format] = par.to_sym
   end
+  o.on("--split-on command",String,"Split on string or regex (use with --in-format)") do | s |
+    options[:split_on] = s
+  end
+  o.on("-e command",String,"Evaluate output command (use with --format eval)") do | s |
+    options[:evaluate] = s
+  end
   o.on('--blank-nodes','Output (RDF) blank nodes - allowing for duplicate row names') do
     options[:blank_nodes] = true
   end
@@ -203,7 +220,7 @@ writer =
   if options[:format] == :rdf
     BioTable::RDF::Writer.new(options[:blank_nodes])
   else
-    BioTable::TableWriter::Writer.new(options[:format])
+    BioTable::TableWriter::Writer.new(options[:format],options[:evaluate])
   end
 if INPUT_ON_STDIN

data/features/cli.feature CHANGED

@@ -33,9 +33,49 @@ Feature: Command-line interface (CLI)
     When I execute "./bin/bio-table --format rdf --transform-ids downcase"
     Then I expect the named output to match "table1-rdf1"
+  Scenario: Write HTML format
+    Given I have input file(s) named "test/data/input/table1.csv"
+    When I execute "./bin/bio-table --format eval -e '"<tr><td>"+field.join("</td><td>")+"</td></tr>"'"
+    Then I expect the named output to match "table1-html"
+  Scenario: Write LaTeX format
+    Given I have input file(s) named "test/data/input/table1.csv"
+    When I execute "./bin/bio-table --columns gene_symbol,gene_desc --format eval -e 'field.join(" & ")+" \\\\"'"
+    Then I expect the named output to match "table1-latex"
+  Scenario: Merge tables horizontally
+    Given I have input file(s) named "test/data/input/table1.csv"
+    When I execute "./bin/bio-table --merge test/data/input/table2.csv"
+    Then I expect the named output to match "table1-merge"
+  Scenario: Merge tables vertically
+    Given I have input file(s) named "test/data/input/table1.csv"
+    When I execute "./bin/bio-table test/data/input/table2.csv"
+    Then I expect the named output to match "table1-append"
+  Scenario: Diff tables
+    Given I have input file(s) named "test/data/input/table1.csv"
+    When I execute "./bin/bio-table --diff test/data/input/table2.csv"
+    Then I expect the named output to match "table1-diff"
   Scenario: Read from STDIN
     Given I have input file(s) named "test/data/input/table1.csv"
     When I execute "cat test/data/input/table1.csv|./bin/bio-table test/data/input/table1.csv --rewrite 'rowname = field[2]; field[1]=nil if field[2].to_f<0.25'"
     Then I expect the named output to match "table1-STDIN"
+  Scenario: Use special string splitter
+    Given I have input file(s) named "test/data/input/table_split_on.txt"
+    When I execute "./bin/bio-table test/data/input/table_split_on.txt --in-format split --split-on ','"
+    Then I expect the named output to match "table_split_on_string"
+  Scenario: Use special regex splitter
+    Given I have input file(s) named "test/data/input/table_split_on.txt"
+    When I execute "./bin/bio-table test/data/input/table_split_on.txt --in-format regex --split-on '\s*,'"
+    Then I expect the named output to match "table_split_on_regex"
+  Scenario: Use header in filter
+    Given I have input file(s) named "test/data/input/table_no_headers.txt"
+    When I execute "./bin/bio-table --in-format split --split-on ',' --num-filter 'values[1]!=0' --with-headers"
+    Then I expect the named output to match "table_filter_headers"

data/lib/bio-table/filter.rb CHANGED

@@ -70,7 +70,7 @@ module BioTable
     def Filter::numeric code, fields
       return true if code == nil
       if fields
-        # values = fields.map { |field| (valid_number?(field) ? field.to_f : nil ) } # FIXME: not so lazy
+        # values = fields.map { |field| (valid_number?(field) ? field.to_f : nil ) }
         values = LazyValues.new(fields)
         begin
           eval(code)
@@ -82,6 +82,21 @@ module BioTable
         false
       end
     end
+    def Filter::generic code, tablefields
+      return true if code == nil
+      if tablefields
+        field = tablefields.dup
+        begin
+          eval(code)
+        rescue Exception
+          $stderr.print "Failed to evaluate ",fields," with ",code,"\n"
+          raise
+        end
+      else
+        false
+      end
+    end
   end
 end

data/lib/bio-table/formatter.rb CHANGED

@@ -17,13 +17,28 @@ module BioTable
       end
       l
     end
+    def Formatter::strip_quotes list
+      list.map { |field|
+        if field == nil
+          nil
+        else
+          first = field[0,1]
+          if first == "\"" or first == "'"
+            last = field[-1,1]
+            if first == last
+              field = field[1..-2]
+            end
+          end
+          field
+        end
+      }
+    end
   end
   class TabFormatter
     def write list
       print list.map{|field| (field==nil ? "NA" : field)}.join("\t"),"\n"
     end
   end
   class CsvFormatter
@@ -35,9 +50,22 @@ module BioTable
     end
   end
+  class EvalFormatter
+    def initialize evaluate
+      @evaluate = evaluate
+    end
+    def write list
+      field = list.dup.map { |e| (e==nil ? "" : e) }
+      print eval(@evaluate)
+      print "\n"
+    end
+  end
   module FormatFactory
-    def self.create format
+    def self.create format, evaluate
       # @logger.info("Formatting to #{format}")
+      return EvalFormatter.new(evaluate) if evaluate
       return CsvFormatter.new if format == :csv
       return TabFormatter.new
     end

data/lib/bio-table/parser.rb CHANGED

@@ -5,9 +5,21 @@ module BioTable
   module LineParser
     # Converts a string into an array of string fields
-    def LineParser::parse(line, in_format)
+    def LineParser::parse(line, in_format, split_on)
       if in_format == :csv
         CSV.parse(line)[0]
+      elsif in_format == :split
+        line.strip.split(split_on).map { |field|
+          fld = field.strip
+          fld = nil if fld == "NA"
+          fld
+        }
+      elsif in_format == :regex
+        line.strip.split(/#{split_on}/).map { |field|
+          fld = field.strip
+          fld = nil if fld == "NA"
+          fld
+        }
       else
         line.strip.split("\t").map { |field|
           fld = field.strip

data/lib/bio-table/rdf.rb CHANGED

@@ -38,7 +38,7 @@ module BioTable
     #
     # The method returns a String.
-    def RDF::row(row, header, use_blank_nodes)
+    def RDF::row(row, header, use_blank_nodes = false)
       list = []
       rowname = make_identifier(row[0])
       list << ":#{rowname}"+(use_blank_nodes ? " :row [ " : " ") + "rdf:label \"#{row[0]}\" ; a :rowname"

data/lib/bio-table/table.rb CHANGED

@@ -23,7 +23,7 @@ module BioTable
     # Read lines (list/array of string) and add them to the table, setting row
     # names and row fields. The first row is assumed to be the header and
-    # ignored if the header has been set.
+    # ignored if the header has been set (the case with merge/concat tables).
     def read_lines lines, options = {}
       table_apply = TableApply.new(options)
@@ -62,9 +62,13 @@ module BioTable
     def write options = {}
       format = options[:format]
       format = :tab if not format
-      formatter = FormatFactory::create(format)
+      evaluate = nil
+      if format == :eval
+        evaluate = options[:evaluate]
+      end
+      formatter = FormatFactory::create(format,evaluate)
       formatter.write(@header) if options[:write_header]
-      each do | tablerow |
+      each do | tablerow,num |
         # p tablerow
         formatter.write(tablerow.all_fields) if tablerow.all_valid?
       end

data/lib/bio-table/table_apply.rb CHANGED

@@ -11,12 +11,16 @@ module BioTable
       # @logger.debug "Skipping #{@skip} lines" if @skip
       @num_filter  = options[:num_filter]
       @logger.debug "Filtering on #{@num_filter}" if @num_filter
+      @filter  = options[:filter]
+      @logger.debug "Filtering on #{@filter}" if @filter
       @rewrite  = options[:rewrite]
       @logger.debug "Rewrite #{@rewrite}" if @rewrite
       @use_columns = options[:columns]
       @logger.debug "Filtering on columns #{@use_columns}" if @use_columns
       @column_filter = options[:column_filter]
       @logger.debug "Filtering on column names #{@column_filter}" if @column_filter
+      @strip_quotes = options[:strip_quotes]
+      @logger.debug "Strip quotes #{@strip_quotes}" if @strip_quotes
       @transform_ids = options[:transform_ids]
       @logger.debug "Transform ids #{@transform_ids}" if @transform_ids
       @include_rownames = options[:with_rownames]
@@ -25,7 +29,8 @@ module BioTable
     end
     def parse_header(line, options)
-      header = LineParser::parse(line, options[:in_format])
+      header = LineParser::parse(line, options[:in_format], options[:split_on])
+      header = Formatter::strip_quotes(header) if @strip_quotes
       return Formatter::transform_header_ids(@transform_ids, header) if @transform_ids
       header
     end
@@ -38,8 +43,9 @@ module BioTable
     end
     def parse_row(line_num, line, column_idx, last_fields, options)
-      fields = LineParser::parse(line, options[:in_format])
+      fields = LineParser::parse(line, options[:in_format], options[:split_on])
       return nil,nil if fields.compact == []
+      fields = Formatter::strip_quotes(fields) if @strip_quotes
       fields = Formatter::transform_row_ids(@transform_ids, fields) if @transform_ids
       fields = Filter::apply_column_filter(fields,column_idx)
       return nil,nil if fields.compact == []
@@ -48,6 +54,7 @@ module BioTable
       if data_fields.size > 0
         return nil,nil if not Validator::valid_row?(line_num, data_fields, last_fields)
         return nil,nil if not Filter::numeric(@num_filter,data_fields)
+        return nil,nil if not Filter::generic(@filter,data_fields)
         (rowname, data_fields) = Rewrite::rewrite(@rewrite,rowname,data_fields)
       end
       return rowname, data_fields

data/lib/bio-table/tableload.rb CHANGED

@@ -23,6 +23,11 @@ module BioTable
             column_index,header = table_apply.column_index(header) # we may rewrite the header
             yielder.yield header,:header if options[:write_header] != false
             prev_line = header[1..-1]
+            # When a header filter is defined, rewind the generator, note that skip won't work
+            # properly (FIXME)
+            if options[:with_headers]
+              generator.rewind
+            end
           elsif line_num-skip < 0
             # do nothing
           else