RubyGems - bio-table - Versions diffs - 0.0.4 → 0.0.5 - Mend

bio-table 0.0.4 → 0.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

data/README.md +85 -22
data/VERSION +1 -1
data/bin/bio-table +45 -18
data/features/cli.feature +11 -1
data/lib/bio-table.rb +1 -0
data/lib/bio-table/filter.rb +2 -1
data/lib/bio-table/formatter.rb +19 -1
data/lib/bio-table/rdf.rb +106 -0
data/lib/bio-table/table_apply.rb +10 -2
data/lib/bio-table/tableload.rb +8 -4
data/lib/bio-table/tablerow.rb +2 -1
data/lib/bio-table/tablewriter.rb +1 -1
data/test/data/regression/table1-STDIN.ref +1138 -0
data/test/data/regression/table1-columns-indexed.ref +0 -2
data/test/data/regression/table1-columns-regex.ref +0 -2
data/test/data/regression/table1-columns.ref +0 -2
data/test/data/regression/table1-rdf1.ref +415 -0
metadata +24 -21

data/README.md CHANGED

@@ -33,13 +33,13 @@ Features:
 * Merge tables side by side on column value/rowname
 * Split/reduce tables by column
 * Read from STDIN, write to STDOUT
+* Convert table to RDF
 * Convert table to JSON (nyi)
-* Convert table to RDF (nyi)
 * etc. etc.
 and bio-table is pretty fast. To convert a 3Mb file of 18670 rows
-takes 0.96 second.  Adding a filter makes it parse at 1.01 second on
-my 3.2 GHz desktop.
+takes 0.87 second. Adding a filter makes it parse at 0.95 second on
+my 3.2 GHz desktop (with preloaded disk cache).
 Note: this software is under active development, though what is
 documented here should just work.
@@ -57,40 +57,40 @@ documented here should just work.
 Tables can be transformed through the command line. To transform a
 comma separated file to a tab delimited one
-```
+```sh
     bio-table test/data/input/table1.csv --in-format csv --format tab > test1.tab
 ```
 Tab is actually the general default. Still, if the file name ends in
 csv, it will assume CSV. To convert the table back
-```
+```sh
     bio-table test1.tab --format csv > table1.csv
 ```
 To filter out rows that contain certain values
-```
+```sh
     bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
 ```
 The filter ignores the header row, and the row names. If you need
 either, use the switches --with-header and --with-rownames. With math, list all rows
-```
+```sh
     bio-table test/data/input/table1.csv --num-filter "values[3]-values[6] >= 0.05" > test1a.tab
 ```
 or, list all rows that have a least a field with values >= 1000.0
-```
+```sh
     bio-table test/data/input/table1.csv --num-filter "values.max >= 1000.0" > test1a.tab
 ```
 Produce all rows that have at least 3 values above 3.0 and 1 one value
 above 10.0:
-```
+```sh
     bio-table test/data/input/table1.csv --num-filter "values.max >= 10.0 and values.count{|x| x>=3.0} > 3"
 ```
@@ -100,7 +100,7 @@ The --num-filter will convert fields lazily to numerical values (only
 valid numbers are converted). If there are NA (nil) values in the table, you
 may wish to remove them, like this
-```
+```sh
     bio-table test/data/input/table1.csv --num-filter "values[0..12].compact.max >= 1000.0" > test1a.tab
 ```
@@ -109,27 +109,27 @@ which takes the first 13 fields and compact removes the nil values.
 Also string comparisons and regular expressions can be used. E.g.
 filter on rownames and a row field both containing 'BGT'
-```
+```sh
     # not yet implemented
     bio-table test/data/input/table1.csv --filter "rowname =~ /BGT/ and field[1] =~ /BGT/" > test1a.tab
 ```
 To reorder/reduce table columns by name
-```
+```sh
     bio-table test/data/input/table1.csv --columns AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19 > test1a.tab
 ```
 or use their index numbers (the first column is zero)
-```
+```sh
     bio-table test/data/input/table1.csv --columns 0,1,8,2,4,6 > test1a.tab
 ```
 To filter for columns using a regular expression
-```
+```sh
     bio-table table1.csv --column-filter 'colname !~ /infected/i'
 ```
@@ -139,7 +139,7 @@ case.
 Finally we can rewrite the content of a table using rowname and fields
 again
-```
+```sh
     bio-table table1.csv --rewrite 'rowname.upcase!; field[1]=nil if field[2].to_f<0.25'
 ```
@@ -150,7 +150,7 @@ empty if the third field is below 0.25.
 To sort a table on column 4 and 2
-```
+```sh
     # not yet implemented
     bio-table test/data/input/table1.csv --sort 4,2 > test1a.tab
 ```
@@ -161,20 +161,26 @@ Note: not all is implemented (just yet). Please check bio-table --help first.
 You can combine/concat two or more tables by passing in multiple file names
+```sh
     bio-table test/data/input/table1.csv test/data/input/table2.csv
+```
 this will append table2 to table1, assuming they have the same headers
 (you can use the --columns switch!)
 To combine tables side by side use the --merge switch:
+```sh
     bio-table --merge table1.csv table2.csv
+```
 all rownames will be matched (i.e. the input table order do not need
 to be sorted). For non-matching rownames the fields will be filled
 with NA's, unless you add a filter, e.g.
+```sh
     bio-table --merge table1.csv table2.csv --num-filter "values.compact.size == values.size"
+```
 ### Splitting a table
@@ -188,24 +194,32 @@ overlap, based on shared columns. The bio-table diff command shows the
 difference between two tables using the row names (i.e. those rows
 with rownames that appear in table2, but not in table1)
+```sh
     bio-table --diff 0 table1.csv table2.csv
+```
 bio-table --diff is different from the standard Unix diff tool. The
 latter shows insertions and deletions. bio-table --diff shows what is
 in one file, and not in the other (insertions). To see deletions,
 reverse the file order, i.e. switch the file names
+```sh
     bio-table --diff 0 table1.csv table2.csv
+```
 To diff on something else
+```sh
     bio-table --diff 0,3 table2.csv table1.csv
+```
 creates a key using columns 0 and 3 (0 is the rownames column).
 Similarly
+```sh
     bio-table --overlap 2 table1.csv table2.csv
+```
 finds the overlapping rows, based on the content of column 2.
@@ -219,14 +233,55 @@ more soon
 bio-table can read data from STDIN, by simply assuming that the data
 piped in is the first input file
-```
+```sh
     cat test1.tab | bio-table table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
 ```
 will filter both files test1.tab and test1.csv and output to
 test1a.tab.
-## bio-table API (for Ruby programming)
+### Output table to RDF
+bio-table can write a table into turtle RDF triples (part of the semantic
+web!), so you can put the data directly into a triple-store.
+```sh
+    bio-table --format rdf table1.csv
+```
+The table header is stored with predicate :colname using the header
+values both as subject and label, with the :index:
+```rdf
+  :header3 rdf:label "Header3" ; a :colname; :index 4 .
+```
+Rows are stored with rowname as subject and label, followed by the
+columns referring to the header triples, and the values. E.g.
+```rdf
+   :row13475701 rdf:label "row13475701" ; a :rowname ; ; :Id "row13475701" ; :header1 "1" ; :header2 "0" ; :header3 "3" .
+```
+To unify identifier names you may want to transform ids:
+```sh
+    bio-table --format rdf --transform-ids "downcase" table1.csv
+```
+Another interesting option is --blank-nodes. This causes rows to be
+written as blank nodes, and allows for duplicate row names. E.g.
+```rdf
+   :row13475701 [ rdf:label "row13475701" ; a :rowname ; ; :Id "row13475701" ; :header1 "1" ; :header2 "0" ; :header3 "3" ] .
+```
+The bio-rdf gem actually uses this bio-table biogem to parse data into a
+triple store and query the data through SPARQL. For examples see the
+features, e.g. the
+[genotype to RDF feature](https://github.com/pjotrp/bioruby-rdf/blob/master/features/genotype-table-to-rdf.feature).
+## bio-table API (for Ruby programmers)
 ```ruby
     require 'bio-table'
@@ -315,17 +370,25 @@ file twice, but being able to handle much larger data.
 In above examples we loaded the whole table in memory. It is also
 possible to execute functions without using RAM by using the emit
-function. This is what the bio-table CLI does:
+function. This is what the bio-table CLI does to convert a CSV table
+to tab delimited:
 ```ruby
 ARGV.each do | fn |
-  BioTable::TableLoader.emit(f, options).each do |row|
-    writer.write(TableRow.new(row[0],row[1..-1]))
+  f = File.open(fn)
+  writer = BioTable::TableWriter::Writer.new(format: :tab)
+  BioTable::TableLoader.emit(f, in_format: :csv).each do |row,type|
+    writer.write(TableRow.new(row[0],row[1..-1]),type)
   end
 end
 ```
+Essentially you can pass in any object that has the *each* method to
+iterate through rows as String (f's each method reads in a line at a
+time). The emit function yields the parsed row object as a simple
+array of fields (each field a String). The type is used to distinguish
+the header row.
 ### Loading a numerical matrix
 Coming soon

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 0.0.4
1	+ 0.0.5

data/bin/bio-table CHANGED

@@ -33,19 +33,12 @@ log = Bio::Log::LoggerPlus.new 'bio-table'
 Bio::Log::CLI.logger('stderr')
 Bio::Log::CLI.trace('info')
-options = {show_help: false, write_header: true}
+options = {show_help: false, write_header: true, skip: 0}
 options[:show_help] = true if ARGV.size == 0 and not INPUT_ON_STDIN
 opts = OptionParser.new do |o|
   o.banner = "Usage: #{File.basename($0)} [options] filename\n\n"
-  o.on('--in-format [tab,csv]', [:tab, :csv], 'Input format (default tab)') do |par|
-    options[:in_format] = par.to_sym
-  end
-  o.on('--format [tab,csv]', [:tab, :csv], 'Output format (default tab)') do |par|
-    options[:format] = par.to_sym
-  end
   o.on('--num-filter expression', 'Numeric filtering function') do |par|
     options[:num_filter] = par
   end
@@ -82,21 +75,50 @@ opts = OptionParser.new do |o|
     options[:overlap] = l
   end
+  o.on('--merge','Merge tables by rowname') do
+    options[:merge] = true
+  end
+  o.separator "\n\tOverrides:\n\n"
   # o.on('--with-header','Include the header element in filtering etc.') do
   #   options[:with_header] = true
   # end
+  o.on('--skip lines',Integer,'Skip the first lines before parsing') do |skip|
+    options[:skip] = skip
+  end
   o.on('--with-rownames','Include the rownames in filtering etc.') do
     options[:with_rownames] = true
   end
+  o.separator "\n\tTransform:\n\n"
+  o.on('--transform-ids [downcase,upcase]',[:downcase,:upcase],'Transform column and row identifiers') do |par|
+    options[:transform_ids] = par.to_sym
+  end
+  o.separator "\n\tFormat and options:\n\n"
-  o.separator ""
+  o.on('--in-format [tab,csv]', [:tab, :csv], 'Input format (default tab)') do |par|
+    options[:in_format] = par.to_sym
+  end
+  o.on('--format [tab,csv,rdf]', [:tab, :csv, :rdf], 'Output format (default tab)') do |par|
+    options[:format] = par.to_sym
+  end
+  o.on('--blank-nodes','Output (RDF) blank nodes - allowing for duplicate row names') do
+    options[:blank_nodes] = true
+  end
+  o.separator "\n\tVerbosity:\n\n"
   o.on("--logger filename",String,"Log to file (default stderr)") do | name |
     Bio::Log::CLI.logger(name)
   end
   o.on("--trace options",String,"Set log level (default INFO, see bio-logger)") do | s |
     Bio::Log::CLI.trace(s)
   end
@@ -177,12 +199,17 @@ end
 # http://eric.lubow.org/2010/ruby/multiple-input-locations-from-bash-into-ruby/
 #
-writer = BioTable::TableWriter::Writer.new(options[:format])
+writer =
+  if options[:format] == :rdf
+    BioTable::RDF::Writer.new(options[:blank_nodes])
+  else
+    BioTable::TableWriter::Writer.new(options[:format])
+  end
 if INPUT_ON_STDIN
   opts = options.dup # so we can modify options
-  BioTable::TableLoader.emit(STDIN, opts).each do |row|
-    writer.write(TableRow.new(row[0],row[1..-1]))
+  BioTable::TableLoader.emit(STDIN, opts).each do |row, type|
+    writer.write(TableRow.new(row[0],row[1..-1]),type)
   end
   options[:write_header] = false  # don't write the header for chained files
 end
@@ -194,8 +221,8 @@ ARGV.each do | fn |
     logger.debug "Autodetected CSV file"
     opts[:in_format] = :csv
   end
-  BioTable::TableLoader.emit(f, opts).each do |row|
-    writer.write(TableRow.new(row[0],row[1..-1]))
+  BioTable::TableLoader.emit(f, opts).each do |row,type|
+    writer.write(TableRow.new(row[0],row[1..-1]),type)
   end
   options[:write_header] = false  # don't write the header for chained files
 end

data/features/cli.feature CHANGED

@@ -1,7 +1,7 @@
 @cli
 Feature: Command-line interface (CLI)
-  bio-table has a powerful comman line interface. Here we regression test features.
+  bio-table has a powerful command line interface. Here we regression test features.
   Scenario: Test the numerical filter by column values
     Given I have input file(s) named "test/data/input/table1.csv"
@@ -28,4 +28,14 @@ Feature: Command-line interface (CLI)
     When I execute "./bin/bio-table test/data/input/table1.csv --rewrite 'rowname = field[2]; field[1]=nil if field[2].to_f<0.25'"
     Then I expect the named output to match "table1-rewrite-rownames"
+  Scenario: Write RDF format
+    Given I have input file(s) named "test/data/input/table1.csv"
+    When I execute "./bin/bio-table --format rdf --transform-ids downcase"
+    Then I expect the named output to match "table1-rdf1"
+  Scenario: Read from STDIN
+    Given I have input file(s) named "test/data/input/table1.csv"
+    When I execute "cat test/data/input/table1.csv|./bin/bio-table test/data/input/table1.csv --rewrite 'rowname = field[2]; field[1]=nil if field[2].to_f<0.25'"
+    Then I expect the named output to match "table1-STDIN"

data/lib/bio-table.rb CHANGED

@@ -25,4 +25,5 @@ require 'bio-table/table_apply.rb'
 require 'bio-table/diff.rb'
 require 'bio-table/overlap.rb'
 require 'bio-table/merge.rb'
+require 'bio-table/rdf.rb'

data/lib/bio-table/filter.rb CHANGED

@@ -63,7 +63,8 @@ module BioTable
     end
     def Filter::valid_number?(s)
-      s.to_s.match(/\A[+-]?\d+?(\.\d+)?\Z/) == nil ? false : true
+      # s.to_s.match(/\A[+-]?\d+?(\.\d+)?\Z/) == nil ? false : true
+      begin Float(s) ; true end rescue false
     end
     def Filter::numeric code, fields

data/lib/bio-table/formatter.rb CHANGED

@@ -1,5 +1,24 @@
 module BioTable
+  module Formatter
+    def Formatter::transform_header_ids modify, list
+      l = list.dup
+      case modify
+        when :downcase then l.map { |h| h.downcase }
+        when :upcase   then l.map { |h| h.upcase }
+        else                l
+      end
+    end
+    def Formatter::transform_row_ids modify, list
+      l = list.dup
+      case modify
+        when :downcase then l[0].downcase!
+        when :upcase   then l[0].upcase!
+      end
+      l
+    end
+  end
   class TabFormatter
     def write list
       print list.map{|field| (field==nil ? "NA" : field)}.join("\t"),"\n"
@@ -8,7 +27,6 @@ module BioTable
   end
   class CsvFormatter
     def write list
       csv_string = CSV.generate do |csv|
         csv << list