RubyGems - bio-table - Versions diffs - 0.0.6 → 0.8.0 - Mend

bio-table 0.0.6 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

data/README.md +100 -26
data/VERSION +1 -1
data/bin/bio-table +23 -3
data/features/cli.feature +20 -2
data/features/filters.feature +43 -0
data/features/step_definitions/filters.rb +46 -0
data/features/support/env.rb +3 -0
data/lib/bio-table.rb +4 -0
data/lib/bio-table/filter.rb +84 -18
data/lib/bio-table/rdf.rb +8 -4
data/lib/bio-table/statistics.rb +45 -0
data/lib/bio-table/table.rb +6 -5
data/lib/bio-table/table_apply.rb +13 -5
data/lib/bio-table/tableload.rb +3 -2
data/lib/bio-table/validator.rb +1 -1
data/test/data/regression/table1-filter-0_1.ref +1 -0
data/test/data/regression/table1-filter-named-0_1.ref +13 -0
data/test/data/regression/table1-named-0_05.ref +281 -0
data/test/data/regression/table_counter_filter.ref +197 -0
metadata +29 -22

data/README.md CHANGED

@@ -15,7 +15,13 @@ Quick example, say we want to filter out rows that contain certain
 p-values listed in the 4th column:
 ```
-    bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05"
+    bio-table table1.csv --num-filter "value[3] <= 0.05"
+```
+even better, you can use the actual column name
+```
+    bio-table table1.csv --num-filter "fdr <= 0.05"
 ```
 bio-table should be lazy. And be good for big data, bio-table is
@@ -26,20 +32,24 @@ you don't need to know Ruby to use the command line interface (CLI).
 Features:
 * Support for reading and writing TAB and CSV files, as well as regex splitters
-* Filter on data
+* Filter on (numerical) data and rownames
 * Transform table and data by column or row
 * Recalculate data
+* Calculate new values
+* Calculate column statistics (mean, standard deviation)
 * Diff between tables, selecting on specific column values
 * Merge tables side by side on column value/rowname
 * Split/reduce tables by column
 * Write formatted tables, e.g. HTML, LaTeX
 * Read from STDIN, write to STDOUT
 * Convert table to RDF
+* Convert key-value (attributes) to RDF (nyi)
 * Convert table to JSON/YAML/XML (nyi)
+* Transpose matrix (nyi)
 * etc. etc.
 and bio-table is pretty fast. To convert a 3Mb file of 18670 rows
-takes 0.87 second. Adding a filter makes it parse at 0.95 second on
+takes 0.87 second with Ruby 1.9. Adding a filter makes it parse at 0.95 second on
 my 3.2 GHz desktop (with preloaded disk cache).
 Note: this software is under active development, though what is
@@ -59,47 +69,55 @@ Tables can be transformed through the command line. To transform a
 comma separated file to a tab delimited one
 ```sh
-    bio-table test/data/input/table1.csv --in-format csv --format tab > test1.tab
+    bio-table table1.csv --in-format csv --format tab > test1.tab
 ```
 Tab is actually the general default. Still, if the file name ends in
 csv, it will assume CSV. To convert the table back
 ```sh
-    bio-table test1.tab --format csv > table1.csv
+    bio-table test1.tab --format csv > table1a.csv
 ```
-It is also possible to use a string or regex splitter, e.g.
+When you have a special file format, it is also possible to use a string or regex splitter, e.g.
 ```sh
-    bio-table --in-format split --split-on ',' test/data/input/table_split_on.txt
-    bio-table --in-format regex --split-on '\s*,\s*' test/data/input/table_split_on.txt
+    bio-table --in-format split --split-on ',' file
+    bio-table --in-format regex --split-on '\s*,\s*' file
 ```
 To filter out rows that contain certain values
 ```sh
-    bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
+    bio-table table1.csv --num-filter "values[3] <= 0.05"
+```
+or, rather than using an index value (which can change between
+different tables), you can use the column name
+(lower case), say for FDR
+```sh
+    bio-table table1.csv --num-filter "fdr <= 0.05"
 ```
 The filter ignores the header row, and the row names, by default. If you need
 either, use the switches --with-headers and --with-rownames. With math, list all rows
 ```sh
-    bio-table test/data/input/table1.csv --num-filter "values[3]-values[6] >= 0.05" > test1a.tab
+    bio-table table1.csv --num-filter "values[3]-values[6] >= 0.05"
 ```
 or, list all rows that have a least a field with values >= 1000.0
 ```sh
-    bio-table test/data/input/table1.csv --num-filter "values.max >= 1000.0" > test1a.tab
+    bio-table table1.csv --num-filter "values.max >= 1000.0"
 ```
 Produce all rows that have at least 3 values above 3.0 and 1 one value
 above 10.0:
 ```sh
-    bio-table test/data/input/table1.csv --num-filter "values.max >= 10.0 and values.count{|x| x>=3.0} > 3"
+    bio-table table1.csv --num-filter "values.max >= 10.0 and values.count{|x| x>=3.0} > 3"
 ```
 How is that for expressiveness? Looks like Ruby to me.
@@ -109,31 +127,60 @@ valid numbers are converted). If there are NA (nil) values in the table, you
 may wish to remove them, like this
 ```sh
-    bio-table test/data/input/table1.csv --num-filter "values[0..12].compact.max >= 1000.0" > test1a.tab
+    bio-table table1.csv --num-filter "values[0..12].compact.max >= 1000.0"
 ```
 which takes the first 13 fields and compact removes the nil values.
+To filter out all rows with more than 3 NA values:
+```sh
+  bio-table table.csv --num-filter 'values.to_a.size - values.compact.size > 3'
+```
 Also string comparisons and regular expressions can be used. E.g.
 filter on rownames and a row field both containing 'BGT'
 ```sh
-    # not yet implemented
-    bio-table test/data/input/table1.csv --filter "rowname =~ /BGT/ and field[1] =~ /BGT/" > test1a.tab
+    bio-table table1.csv --filter "rowname =~ /BGT/ and field[1] =~ /BGT/"
+```
+or use the column name, rather than the indexed column field:
+```sh
+    bio-table table1.csv --filter "rowname =~ /BGT/ and genename =~ /BGT/"
 ```
 To reorder/reduce table columns by name
 ```sh
-    bio-table test/data/input/table1.csv --columns AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19 > test1a.tab
+    bio-table table1.csv --columns AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19
 ```
 or use their index numbers (the first column is zero)
 ```sh
-    bio-table test/data/input/table1.csv --columns 0,1,8,2,4,6 > test1a.tab
+    bio-table table1.csv --columns 0,1,8,2,4,6
+```
+If the table header happens to be one element shorter than the number of columns
+in the table, use unshift headers, 0 becomes an 'ID' column
+```sh
+    bio-table table1.csv --unshift-headers --columns 0,1,8,2,4,6
+```
+Duplicate columns with
+```sh
+    bio-table table1.csv --columns AJ,B6,AJ,Axb1,Axb4,AXB13,Axb15,Axb19
 ```
+Combine column values (more on rewrite below)
+```sh
+    bio-table table1.csv --rewrite "rowname = rowname + '-' + field[0]"
+```
 To filter for columns using a regular expression
@@ -154,13 +201,39 @@ again
 where we rewrite the rowname in capitals, and set the second field to
 empty if the third field is below 0.25.
+### Statistics
+bio-table can handle some column statistics using the Ruby statsample
+gem
+```sh
+    gem install statsample
+```
+(statsample is not loaded by default, as it has a host of
+dependencies)
+Thereafter, to calculate the stats for columns 1 and 2 (rowname is column 0)
+```sh
+    bio-table --statistics --columns 1,2 table1.csv
+      stat    AJ                   B6
+      size    379                  379
+      min     0.0                  0.0
+      max     1171.23              1309.25
+      median  6.26                 7.45
+      mean    23.49952506596308    24.851108179419523
+      sd      79.4384873820721     84.43330500777459
+      cv      3.3804294835358824   3.3975669977445166
+```
 ### Sorting a table
 To sort a table on column 4 and 2
 ```sh
     # not yet implemented
-    bio-table test/data/input/table1.csv --sort 4,2 > test1a.tab
+    bio-table table1.csv --sort 4,2
 ```
 Note: not all is implemented (just yet). Please check bio-table --help first.
@@ -170,7 +243,7 @@ Note: not all is implemented (just yet). Please check bio-table --help first.
 You can combine/concat two or more tables by passing in multiple file names
 ```sh
-    bio-table test/data/input/table1.csv test/data/input/table2.csv
+    bio-table table1.csv table2.csv
 ```
 this will append table2 to table1, assuming they have the same headers
@@ -245,7 +318,7 @@ bio-table can read data from STDIN, by simply assuming that the data
 piped in is the first input file
 ```sh
-    cat test1.tab | bio-table table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
+    cat test1.tab | bio-table table1.csv --num-filter "values[3] <= 0.05"
 ```
 will filter both files test1.tab and test1.csv and output to
@@ -338,7 +411,7 @@ Note: the Ruby API below is a work in progress.
 Tables are two dimensional matrixes, which can be read from a file
 ```ruby
-    t = Table.read_file('test/data/input/table1.csv')
+    t = Table.read_file('table1.csv')
     p t.header              # print the header array
     p t.name[0],t[0]        # print the row name and row row
     p t[0][0]               # print the top corner field
@@ -349,7 +422,7 @@ which column to use for names etc. More interestingly you can pass a
 function to limit the amount of row read into memory:
 ```ruby
-    t = Table.read_file('test/data/input/table1.csv',
+    t = Table.read_file('table1.csv',
       :by_row => { | row | row[0..3] } )
 ```
@@ -358,7 +431,7 @@ the same idea to reformat and reorder table columns when reading data
 into the table. E.g.
 ```ruby
-    t = Table.read_file('test/data/input/table1.csv',
+    t = Table.read_file('table1.csv',
       :by_row => { | row | [row.rowname, row[0..3], row[6].to_i].flatten } )
 ```
@@ -368,7 +441,7 @@ can pass in a :by_header, which will have :by_row only call on
 actual table rows.
 ```ruby
-    t = Table.read_file('test/data/input/table1.csv',
+    t = Table.read_file('table1.csv',
       :by_header => { | header | ["Row name", header[0..3], header[6]].flatten } )
       :by_row => { | row | [row.rowname, row[0..3], row[6].to_i].flatten } )
 ```
@@ -378,7 +451,7 @@ transform a file, and not loading it in memory, is
 ```ruby
     f = File.new('test.tab','w')
-    t = Table.read_file('test/data/input/table1.csv',
+    t = Table.read_file('table1.csv',
       :by_row => { | row |
         TableRow::write(f,[row.rowname,row[0..3],row[6].to_i].flatten, :separator => "\t")
         nil   # don't create a table in memory, effectively a filter
@@ -426,7 +499,8 @@ ARGV.each do | fn |
 end
 ```
-Essentially you can pass in any object that has the *each* method to
+Essentially you can pass in any object that has the *each* method
+(here the File object) to
 iterate through rows as String (f's each method reads in a line at a
 time). The emit function yields the parsed row object as a simple
 array of fields (each field a String). The type is used to distinguish

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 0.0.6
1	+ 0.8.0

data/bin/bio-table CHANGED

@@ -98,6 +98,10 @@ opts = OptionParser.new do |o|
     options[:with_rownames] = true
   end
+  o.on('--unshift-headers','Add an extra header element at the front (header contains one fewer field than the number of columns)') do
+    options[:unshift_headers] = true
+  end
   o.on('--strip-quotes','Strip quotes from table fields') do
     options[:strip_quotes] = true
   end
@@ -130,6 +134,10 @@ opts = OptionParser.new do |o|
     options[:blank_nodes] = true
   end
+  o.on('--statistics','Output column statistics') do
+    options[:statistics] = true
+  end
   o.separator "\n\tVerbosity:\n\n"
   o.on("--logger filename",String,"Log to file (default stderr)") do | name |
@@ -224,22 +232,34 @@ writer =
   end
 if INPUT_ON_STDIN
-  opts = options.dup # so we can modify options
+  opts = options.dup # so we can 'safely' modify options
   BioTable::TableLoader.emit(STDIN, opts).each do |row, type|
     writer.write(TableRow.new(row[0],row[1..-1]),type)
   end
   options[:write_header] = false  # don't write the header for chained files
 end
+statistics = if options[:statistics]
+               BioTable::Statistics::Accumulate.new
+             else
+               nil
+             end
 ARGV.each do | fn |
-  opts = options.dup # so we can modify options
+  opts = options.dup # so we can 'safely' modify options
   f = File.open(fn,"r")
   if not opts[:in_format] and fn =~ /\.csv$/
     logger.debug "Autodetected CSV file"
     opts[:in_format] = :csv
   end
   BioTable::TableLoader.emit(f, opts).each do |row,type|
-    writer.write(TableRow.new(row[0],row[1..-1]),type)
+    if statistics
+      statistics.add(row,type)
+    else
+      writer.write(TableRow.new(row[0],row[1..-1]),type)
+    end
   end
   options[:write_header] = false  # don't write the header for chained files
 end
+statistics.write(writer) if statistics

data/features/cli.feature CHANGED

@@ -3,11 +3,26 @@ Feature: Command-line interface (CLI)
   bio-table has a powerful command line interface. Here we regression test features.
-  Scenario: Test the numerical filter by column values
+  Scenario: Test the numerical filter by indexed column values
     Given I have input file(s) named "test/data/input/table1.csv"
     When I execute "./bin/bio-table --num-filter 'values[3] > 0.05'"
     Then I expect the named output to match "table1-0_05"
+  Scenario: Test the numerical filter by column names
+    Given I have input file(s) named "test/data/input/table1.csv"
+    When I execute "./bin/bio-table --num-filter 'axb2 > 0.05'"
+    Then I expect the named output to match "table1-named-0_05"
+  Scenario: Test the filter by indexed column values
+    Given I have input file(s) named "test/data/input/table1.csv"
+    When I execute "./bin/bio-table --filter 'fields[3] =~ 0.1'"
+    Then I expect the named output to match "table1-filter-0_1"
+  Scenario: Test the filter by column names
+    Given I have input file(s) named "test/data/input/table1.csv"
+    When I execute "./bin/bio-table --filter 'axb1 =~ /0.1/'"
+    Then I expect the named output to match "table1-filter-named-0_1"
   Scenario: Reduce columns
     Given I have input file(s) named "test/data/input/table1.csv"
     When I execute "./bin/bio-table test/data/input/table1.csv --columns '#Gene,AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19'"
@@ -78,4 +93,7 @@ Feature: Command-line interface (CLI)
     When I execute "./bin/bio-table --in-format split --split-on ',' --num-filter 'values[1]!=0' --with-headers"
     Then I expect the named output to match "table_filter_headers"
+  Scenario: Use count in filter
+    Given I have input file(s) named "test/data/input/table1.csv"
+    When I execute "./bin/bio-table --num-filter 'values.compact.max >= 10.0 and values.compact.count{|x| x>=3.0} > 3'"
+    Then I expect the named output to match "table_counter_filter"

data/features/filters.feature ADDED

@@ -0,0 +1,43 @@
+@filter
+Feature: Filter input table
+  bio-table should read input line by line as an iterator, and emit
+  filtered/transformed output, filtering for number values etc.
+  Scenario: Filter a table by value
+    Given I load a CSV table containing
+        """
+        bid,cid,length,num
+        1,a,4658,4
+        1,b,12060,6
+        2,c,5858,7
+        2,d,5626,4
+        3,e,18451,8
+        """
+    When I numerically filter the table for
+      | num_filter       | result              | description                 |
+      | values[1] > 6000 | [12060,18451]       | basic filter                |
+      | value[1] > 6000  | [12060,18451]       | value is alias for values   |
+      | num==4           | [4658,5626]         | column names as variables   |
+      | num==4 or num==6 | [4658,12060,5626]   | column names as variables   |
+      | num==6           | [12060]             | column names as variables   |
+      | length<5000      | [4658]              | column names as variables   |
+    Then I should have result
+  Scenario: Filter a table by string
+    Given I load a CSV table containing
+        """
+        bid,cid,length,num
+        1,a,4658,4
+        1,b,12060,6
+        2,c,5858,7
+        2,d,5626,4
+        3,e,18451,8
+        """
+    When I filter the table for
+      | filter              | result              | description      |
+      | field[1] =~ /4/     | [4658,18451]        | regex filter     |
+      | fields[1] =~ /4/    | [4658,18451]        | alias fields     |
+      | length =~ /4/       | [4658,18451]        | use column names |
+    Then I should have filter result

data/features/step_definitions/filters.rb ADDED

@@ -0,0 +1,46 @@
+Given /^I load a CSV table containing$/ do |string|
+  @lines = string.split(/\n/)
+end
+When /^I numerically filter the table for$/ do |table|
+  # table is a Cucumber::Ast::Table
+  @table = table
+end
+Then /^I should have result$/ do
+  @table.hashes.each do |h|
+    p h
+    result = eval(h['result'])
+    options = { :in_format => :split, :split_on => ',' }
+    options[:num_filter] = h['num_filter']
+    p options
+    p result
+    t = BioTable::Table.new
+    rownames,lines = t.read_lines(@lines, options)
+    p lines
+    lines.map {|r| r[1].to_i }.should == result
+  end
+end
+When /^I filter the table for$/ do |table|
+  # table is a Cucumber::Ast::Table
+  @table1 = table
+end
+Then /^I should have filter result$/ do
+  @table1.hashes.each do |h|
+    p h
+    result = eval(h['result'])
+    options = { :in_format => :split, :split_on => ',' }
+    options[:filter] = h['filter']
+    p options
+    p result
+    t = BioTable::Table.new
+    rownames,lines = t.read_lines(@lines, options)
+    p lines
+    lines.map {|r| r[1].to_i }.should == result
+  end
+end