bio-table 0.0.6 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -15,7 +15,13 @@ Quick example, say we want to filter out rows that contain certain
15
15
  p-values listed in the 4th column:
16
16
 
17
17
  ```
18
- bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05"
18
+ bio-table table1.csv --num-filter "value[3] <= 0.05"
19
+ ```
20
+
21
+ even better, you can use the actual column name
22
+
23
+ ```
24
+ bio-table table1.csv --num-filter "fdr <= 0.05"
19
25
  ```
20
26
 
21
27
  bio-table should be lazy. And be good for big data, bio-table is
@@ -26,20 +32,24 @@ you don't need to know Ruby to use the command line interface (CLI).
26
32
  Features:
27
33
 
28
34
  * Support for reading and writing TAB and CSV files, as well as regex splitters
29
- * Filter on data
35
+ * Filter on (numerical) data and rownames
30
36
  * Transform table and data by column or row
31
37
  * Recalculate data
38
+ * Calculate new values
39
+ * Calculate column statistics (mean, standard deviation)
32
40
  * Diff between tables, selecting on specific column values
33
41
  * Merge tables side by side on column value/rowname
34
42
  * Split/reduce tables by column
35
43
  * Write formatted tables, e.g. HTML, LaTeX
36
44
  * Read from STDIN, write to STDOUT
37
45
  * Convert table to RDF
46
+ * Convert key-value (attributes) to RDF (nyi)
38
47
  * Convert table to JSON/YAML/XML (nyi)
48
+ * Transpose matrix (nyi)
39
49
  * etc. etc.
40
50
 
41
51
  and bio-table is pretty fast. To convert a 3Mb file of 18670 rows
42
- takes 0.87 second. Adding a filter makes it parse at 0.95 second on
52
+ takes 0.87 second with Ruby 1.9. Adding a filter makes it parse at 0.95 second on
43
53
  my 3.2 GHz desktop (with preloaded disk cache).
44
54
 
45
55
  Note: this software is under active development, though what is
@@ -59,47 +69,55 @@ Tables can be transformed through the command line. To transform a
59
69
  comma separated file to a tab delimited one
60
70
 
61
71
  ```sh
62
- bio-table test/data/input/table1.csv --in-format csv --format tab > test1.tab
72
+ bio-table table1.csv --in-format csv --format tab > test1.tab
63
73
  ```
64
74
 
65
75
  Tab is actually the general default. Still, if the file name ends in
66
76
  csv, it will assume CSV. To convert the table back
67
77
 
68
78
  ```sh
69
- bio-table test1.tab --format csv > table1.csv
79
+ bio-table test1.tab --format csv > table1a.csv
70
80
  ```
71
81
 
72
- It is also possible to use a string or regex splitter, e.g.
82
+ When you have a special file format, it is also possible to use a string or regex splitter, e.g.
73
83
 
74
84
  ```sh
75
- bio-table --in-format split --split-on ',' test/data/input/table_split_on.txt
76
- bio-table --in-format regex --split-on '\s*,\s*' test/data/input/table_split_on.txt
85
+ bio-table --in-format split --split-on ',' file
86
+ bio-table --in-format regex --split-on '\s*,\s*' file
77
87
  ```
78
88
 
79
89
  To filter out rows that contain certain values
80
90
 
81
91
  ```sh
82
- bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
92
+ bio-table table1.csv --num-filter "values[3] <= 0.05"
93
+ ```
94
+
95
+ or, rather than using an index value (which can change between
96
+ different tables), you can use the column name
97
+ (lower case), say for FDR
98
+
99
+ ```sh
100
+ bio-table table1.csv --num-filter "fdr <= 0.05"
83
101
  ```
84
102
 
85
103
  The filter ignores the header row, and the row names, by default. If you need
86
104
  either, use the switches --with-headers and --with-rownames. With math, list all rows
87
105
 
88
106
  ```sh
89
- bio-table test/data/input/table1.csv --num-filter "values[3]-values[6] >= 0.05" > test1a.tab
107
+ bio-table table1.csv --num-filter "values[3]-values[6] >= 0.05"
90
108
  ```
91
109
 
92
110
  or, list all rows that have a least a field with values >= 1000.0
93
111
 
94
112
  ```sh
95
- bio-table test/data/input/table1.csv --num-filter "values.max >= 1000.0" > test1a.tab
113
+ bio-table table1.csv --num-filter "values.max >= 1000.0"
96
114
  ```
97
115
 
98
116
  Produce all rows that have at least 3 values above 3.0 and 1 one value
99
117
  above 10.0:
100
118
 
101
119
  ```sh
102
- bio-table test/data/input/table1.csv --num-filter "values.max >= 10.0 and values.count{|x| x>=3.0} > 3"
120
+ bio-table table1.csv --num-filter "values.max >= 10.0 and values.count{|x| x>=3.0} > 3"
103
121
  ```
104
122
 
105
123
  How is that for expressiveness? Looks like Ruby to me.
@@ -109,31 +127,60 @@ valid numbers are converted). If there are NA (nil) values in the table, you
109
127
  may wish to remove them, like this
110
128
 
111
129
  ```sh
112
- bio-table test/data/input/table1.csv --num-filter "values[0..12].compact.max >= 1000.0" > test1a.tab
130
+ bio-table table1.csv --num-filter "values[0..12].compact.max >= 1000.0"
113
131
  ```
114
132
 
115
133
  which takes the first 13 fields and compact removes the nil values.
116
134
 
135
+ To filter out all rows with more than 3 NA values:
136
+
137
+ ```sh
138
+ bio-table table.csv --num-filter 'values.to_a.size - values.compact.size > 3'
139
+ ```
140
+
117
141
  Also string comparisons and regular expressions can be used. E.g.
118
142
  filter on rownames and a row field both containing 'BGT'
119
143
 
120
144
  ```sh
121
- # not yet implemented
122
- bio-table test/data/input/table1.csv --filter "rowname =~ /BGT/ and field[1] =~ /BGT/" > test1a.tab
145
+ bio-table table1.csv --filter "rowname =~ /BGT/ and field[1] =~ /BGT/"
146
+ ```
147
+
148
+ or use the column name, rather than the indexed column field:
149
+
150
+ ```sh
151
+ bio-table table1.csv --filter "rowname =~ /BGT/ and genename =~ /BGT/"
123
152
  ```
124
153
 
125
154
  To reorder/reduce table columns by name
126
155
 
127
156
  ```sh
128
- bio-table test/data/input/table1.csv --columns AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19 > test1a.tab
157
+ bio-table table1.csv --columns AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19
129
158
  ```
130
159
 
131
160
  or use their index numbers (the first column is zero)
132
161
 
133
162
  ```sh
134
- bio-table test/data/input/table1.csv --columns 0,1,8,2,4,6 > test1a.tab
163
+ bio-table table1.csv --columns 0,1,8,2,4,6
164
+ ```
165
+
166
+ If the table header happens to be one element shorter than the number of columns
167
+ in the table, use unshift headers, 0 becomes an 'ID' column
168
+
169
+ ```sh
170
+ bio-table table1.csv --unshift-headers --columns 0,1,8,2,4,6
171
+ ```
172
+
173
+ Duplicate columns with
174
+
175
+ ```sh
176
+ bio-table table1.csv --columns AJ,B6,AJ,Axb1,Axb4,AXB13,Axb15,Axb19
135
177
  ```
136
178
 
179
+ Combine column values (more on rewrite below)
180
+
181
+ ```sh
182
+ bio-table table1.csv --rewrite "rowname = rowname + '-' + field[0]"
183
+ ```
137
184
 
138
185
  To filter for columns using a regular expression
139
186
 
@@ -154,13 +201,39 @@ again
154
201
  where we rewrite the rowname in capitals, and set the second field to
155
202
  empty if the third field is below 0.25.
156
203
 
204
+ ### Statistics
205
+
206
+ bio-table can handle some column statistics using the Ruby statsample
207
+ gem
208
+
209
+ ```sh
210
+ gem install statsample
211
+ ```
212
+
213
+ (statsample is not loaded by default, as it has a host of
214
+ dependencies)
215
+
216
+ Thereafter, to calculate the stats for columns 1 and 2 (rowname is column 0)
217
+
218
+ ```sh
219
+ bio-table --statistics --columns 1,2 table1.csv
220
+ stat AJ B6
221
+ size 379 379
222
+ min 0.0 0.0
223
+ max 1171.23 1309.25
224
+ median 6.26 7.45
225
+ mean 23.49952506596308 24.851108179419523
226
+ sd 79.4384873820721 84.43330500777459
227
+ cv 3.3804294835358824 3.3975669977445166
228
+ ```
229
+
157
230
  ### Sorting a table
158
231
 
159
232
  To sort a table on column 4 and 2
160
233
 
161
234
  ```sh
162
235
  # not yet implemented
163
- bio-table test/data/input/table1.csv --sort 4,2 > test1a.tab
236
+ bio-table table1.csv --sort 4,2
164
237
  ```
165
238
 
166
239
  Note: not all is implemented (just yet). Please check bio-table --help first.
@@ -170,7 +243,7 @@ Note: not all is implemented (just yet). Please check bio-table --help first.
170
243
  You can combine/concat two or more tables by passing in multiple file names
171
244
 
172
245
  ```sh
173
- bio-table test/data/input/table1.csv test/data/input/table2.csv
246
+ bio-table table1.csv table2.csv
174
247
  ```
175
248
 
176
249
  this will append table2 to table1, assuming they have the same headers
@@ -245,7 +318,7 @@ bio-table can read data from STDIN, by simply assuming that the data
245
318
  piped in is the first input file
246
319
 
247
320
  ```sh
248
- cat test1.tab | bio-table table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
321
+ cat test1.tab | bio-table table1.csv --num-filter "values[3] <= 0.05"
249
322
  ```
250
323
 
251
324
  will filter both files test1.tab and test1.csv and output to
@@ -338,7 +411,7 @@ Note: the Ruby API below is a work in progress.
338
411
  Tables are two dimensional matrixes, which can be read from a file
339
412
 
340
413
  ```ruby
341
- t = Table.read_file('test/data/input/table1.csv')
414
+ t = Table.read_file('table1.csv')
342
415
  p t.header # print the header array
343
416
  p t.name[0],t[0] # print the row name and row row
344
417
  p t[0][0] # print the top corner field
@@ -349,7 +422,7 @@ which column to use for names etc. More interestingly you can pass a
349
422
  function to limit the amount of row read into memory:
350
423
 
351
424
  ```ruby
352
- t = Table.read_file('test/data/input/table1.csv',
425
+ t = Table.read_file('table1.csv',
353
426
  :by_row => { | row | row[0..3] } )
354
427
  ```
355
428
 
@@ -358,7 +431,7 @@ the same idea to reformat and reorder table columns when reading data
358
431
  into the table. E.g.
359
432
 
360
433
  ```ruby
361
- t = Table.read_file('test/data/input/table1.csv',
434
+ t = Table.read_file('table1.csv',
362
435
  :by_row => { | row | [row.rowname, row[0..3], row[6].to_i].flatten } )
363
436
  ```
364
437
 
@@ -368,7 +441,7 @@ can pass in a :by_header, which will have :by_row only call on
368
441
  actual table rows.
369
442
 
370
443
  ```ruby
371
- t = Table.read_file('test/data/input/table1.csv',
444
+ t = Table.read_file('table1.csv',
372
445
  :by_header => { | header | ["Row name", header[0..3], header[6]].flatten } )
373
446
  :by_row => { | row | [row.rowname, row[0..3], row[6].to_i].flatten } )
374
447
  ```
@@ -378,7 +451,7 @@ transform a file, and not loading it in memory, is
378
451
 
379
452
  ```ruby
380
453
  f = File.new('test.tab','w')
381
- t = Table.read_file('test/data/input/table1.csv',
454
+ t = Table.read_file('table1.csv',
382
455
  :by_row => { | row |
383
456
  TableRow::write(f,[row.rowname,row[0..3],row[6].to_i].flatten, :separator => "\t")
384
457
  nil # don't create a table in memory, effectively a filter
@@ -426,7 +499,8 @@ ARGV.each do | fn |
426
499
  end
427
500
  ```
428
501
 
429
- Essentially you can pass in any object that has the *each* method to
502
+ Essentially you can pass in any object that has the *each* method
503
+ (here the File object) to
430
504
  iterate through rows as String (f's each method reads in a line at a
431
505
  time). The emit function yields the parsed row object as a simple
432
506
  array of fields (each field a String). The type is used to distinguish
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.0.6
1
+ 0.8.0
@@ -98,6 +98,10 @@ opts = OptionParser.new do |o|
98
98
  options[:with_rownames] = true
99
99
  end
100
100
 
101
+ o.on('--unshift-headers','Add an extra header element at the front (header contains one fewer field than the number of columns)') do
102
+ options[:unshift_headers] = true
103
+ end
104
+
101
105
  o.on('--strip-quotes','Strip quotes from table fields') do
102
106
  options[:strip_quotes] = true
103
107
  end
@@ -130,6 +134,10 @@ opts = OptionParser.new do |o|
130
134
  options[:blank_nodes] = true
131
135
  end
132
136
 
137
+ o.on('--statistics','Output column statistics') do
138
+ options[:statistics] = true
139
+ end
140
+
133
141
  o.separator "\n\tVerbosity:\n\n"
134
142
 
135
143
  o.on("--logger filename",String,"Log to file (default stderr)") do | name |
@@ -224,22 +232,34 @@ writer =
224
232
  end
225
233
 
226
234
  if INPUT_ON_STDIN
227
- opts = options.dup # so we can modify options
235
+ opts = options.dup # so we can 'safely' modify options
228
236
  BioTable::TableLoader.emit(STDIN, opts).each do |row, type|
229
237
  writer.write(TableRow.new(row[0],row[1..-1]),type)
230
238
  end
231
239
  options[:write_header] = false # don't write the header for chained files
232
240
  end
233
241
 
242
+ statistics = if options[:statistics]
243
+ BioTable::Statistics::Accumulate.new
244
+ else
245
+ nil
246
+ end
247
+
234
248
  ARGV.each do | fn |
235
- opts = options.dup # so we can modify options
249
+ opts = options.dup # so we can 'safely' modify options
236
250
  f = File.open(fn,"r")
237
251
  if not opts[:in_format] and fn =~ /\.csv$/
238
252
  logger.debug "Autodetected CSV file"
239
253
  opts[:in_format] = :csv
240
254
  end
241
255
  BioTable::TableLoader.emit(f, opts).each do |row,type|
242
- writer.write(TableRow.new(row[0],row[1..-1]),type)
256
+ if statistics
257
+ statistics.add(row,type)
258
+ else
259
+ writer.write(TableRow.new(row[0],row[1..-1]),type)
260
+ end
243
261
  end
244
262
  options[:write_header] = false # don't write the header for chained files
245
263
  end
264
+
265
+ statistics.write(writer) if statistics
@@ -3,11 +3,26 @@ Feature: Command-line interface (CLI)
3
3
 
4
4
  bio-table has a powerful command line interface. Here we regression test features.
5
5
 
6
- Scenario: Test the numerical filter by column values
6
+ Scenario: Test the numerical filter by indexed column values
7
7
  Given I have input file(s) named "test/data/input/table1.csv"
8
8
  When I execute "./bin/bio-table --num-filter 'values[3] > 0.05'"
9
9
  Then I expect the named output to match "table1-0_05"
10
10
 
11
+ Scenario: Test the numerical filter by column names
12
+ Given I have input file(s) named "test/data/input/table1.csv"
13
+ When I execute "./bin/bio-table --num-filter 'axb2 > 0.05'"
14
+ Then I expect the named output to match "table1-named-0_05"
15
+
16
+ Scenario: Test the filter by indexed column values
17
+ Given I have input file(s) named "test/data/input/table1.csv"
18
+ When I execute "./bin/bio-table --filter 'fields[3] =~ 0.1'"
19
+ Then I expect the named output to match "table1-filter-0_1"
20
+
21
+ Scenario: Test the filter by column names
22
+ Given I have input file(s) named "test/data/input/table1.csv"
23
+ When I execute "./bin/bio-table --filter 'axb1 =~ /0.1/'"
24
+ Then I expect the named output to match "table1-filter-named-0_1"
25
+
11
26
  Scenario: Reduce columns
12
27
  Given I have input file(s) named "test/data/input/table1.csv"
13
28
  When I execute "./bin/bio-table test/data/input/table1.csv --columns '#Gene,AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19'"
@@ -78,4 +93,7 @@ Feature: Command-line interface (CLI)
78
93
  When I execute "./bin/bio-table --in-format split --split-on ',' --num-filter 'values[1]!=0' --with-headers"
79
94
  Then I expect the named output to match "table_filter_headers"
80
95
 
81
-
96
+ Scenario: Use count in filter
97
+ Given I have input file(s) named "test/data/input/table1.csv"
98
+ When I execute "./bin/bio-table --num-filter 'values.compact.max >= 10.0 and values.compact.count{|x| x>=3.0} > 3'"
99
+ Then I expect the named output to match "table_counter_filter"
@@ -0,0 +1,43 @@
1
+ @filter
2
+ Feature: Filter input table
3
+
4
+ bio-table should read input line by line as an iterator, and emit
5
+ filtered/transformed output, filtering for number values etc.
6
+
7
+ Scenario: Filter a table by value
8
+ Given I load a CSV table containing
9
+ """
10
+ bid,cid,length,num
11
+ 1,a,4658,4
12
+ 1,b,12060,6
13
+ 2,c,5858,7
14
+ 2,d,5626,4
15
+ 3,e,18451,8
16
+ """
17
+ When I numerically filter the table for
18
+ | num_filter | result | description |
19
+ | values[1] > 6000 | [12060,18451] | basic filter |
20
+ | value[1] > 6000 | [12060,18451] | value is alias for values |
21
+ | num==4 | [4658,5626] | column names as variables |
22
+ | num==4 or num==6 | [4658,12060,5626] | column names as variables |
23
+ | num==6 | [12060] | column names as variables |
24
+ | length<5000 | [4658] | column names as variables |
25
+ Then I should have result
26
+
27
+ Scenario: Filter a table by string
28
+ Given I load a CSV table containing
29
+ """
30
+ bid,cid,length,num
31
+ 1,a,4658,4
32
+ 1,b,12060,6
33
+ 2,c,5858,7
34
+ 2,d,5626,4
35
+ 3,e,18451,8
36
+ """
37
+ When I filter the table for
38
+ | filter | result | description |
39
+ | field[1] =~ /4/ | [4658,18451] | regex filter |
40
+ | fields[1] =~ /4/ | [4658,18451] | alias fields |
41
+ | length =~ /4/ | [4658,18451] | use column names |
42
+ Then I should have filter result
43
+
@@ -0,0 +1,46 @@
1
+ Given /^I load a CSV table containing$/ do |string|
2
+ @lines = string.split(/\n/)
3
+ end
4
+
5
+ When /^I numerically filter the table for$/ do |table|
6
+ # table is a Cucumber::Ast::Table
7
+ @table = table
8
+ end
9
+
10
+ Then /^I should have result$/ do
11
+ @table.hashes.each do |h|
12
+ p h
13
+ result = eval(h['result'])
14
+ options = { :in_format => :split, :split_on => ',' }
15
+ options[:num_filter] = h['num_filter']
16
+
17
+ p options
18
+ p result
19
+ t = BioTable::Table.new
20
+ rownames,lines = t.read_lines(@lines, options)
21
+ p lines
22
+ lines.map {|r| r[1].to_i }.should == result
23
+ end
24
+ end
25
+
26
+ When /^I filter the table for$/ do |table|
27
+ # table is a Cucumber::Ast::Table
28
+ @table1 = table
29
+ end
30
+
31
+ Then /^I should have filter result$/ do
32
+ @table1.hashes.each do |h|
33
+ p h
34
+ result = eval(h['result'])
35
+ options = { :in_format => :split, :split_on => ',' }
36
+ options[:filter] = h['filter']
37
+
38
+ p options
39
+ p result
40
+ t = BioTable::Table.new
41
+ rownames,lines = t.read_lines(@lines, options)
42
+ p lines
43
+ lines.map {|r| r[1].to_i }.should == result
44
+ end
45
+ end
46
+