bio-table 0.0.6 → 0.8.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -15,7 +15,13 @@ Quick example, say we want to filter out rows that contain certain
15
15
  p-values listed in the 4th column:
16
16
 
17
17
  ```
18
- bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05"
18
+ bio-table table1.csv --num-filter "value[3] <= 0.05"
19
+ ```
20
+
21
+ even better, you can use the actual column name
22
+
23
+ ```
24
+ bio-table table1.csv --num-filter "fdr <= 0.05"
19
25
  ```
20
26
 
21
27
  bio-table should be lazy. And be good for big data, bio-table is
@@ -26,20 +32,24 @@ you don't need to know Ruby to use the command line interface (CLI).
26
32
  Features:
27
33
 
28
34
  * Support for reading and writing TAB and CSV files, as well as regex splitters
29
- * Filter on data
35
+ * Filter on (numerical) data and rownames
30
36
  * Transform table and data by column or row
31
37
  * Recalculate data
38
+ * Calculate new values
39
+ * Calculate column statistics (mean, standard deviation)
32
40
  * Diff between tables, selecting on specific column values
33
41
  * Merge tables side by side on column value/rowname
34
42
  * Split/reduce tables by column
35
43
  * Write formatted tables, e.g. HTML, LaTeX
36
44
  * Read from STDIN, write to STDOUT
37
45
  * Convert table to RDF
46
+ * Convert key-value (attributes) to RDF (nyi)
38
47
  * Convert table to JSON/YAML/XML (nyi)
48
+ * Transpose matrix (nyi)
39
49
  * etc. etc.
40
50
 
41
51
  and bio-table is pretty fast. To convert a 3Mb file of 18670 rows
42
- takes 0.87 second. Adding a filter makes it parse at 0.95 second on
52
+ takes 0.87 second with Ruby 1.9. Adding a filter makes it parse at 0.95 second on
43
53
  my 3.2 GHz desktop (with preloaded disk cache).
44
54
 
45
55
  Note: this software is under active development, though what is
@@ -59,47 +69,55 @@ Tables can be transformed through the command line. To transform a
59
69
  comma separated file to a tab delimited one
60
70
 
61
71
  ```sh
62
- bio-table test/data/input/table1.csv --in-format csv --format tab > test1.tab
72
+ bio-table table1.csv --in-format csv --format tab > test1.tab
63
73
  ```
64
74
 
65
75
  Tab is actually the general default. Still, if the file name ends in
66
76
  csv, it will assume CSV. To convert the table back
67
77
 
68
78
  ```sh
69
- bio-table test1.tab --format csv > table1.csv
79
+ bio-table test1.tab --format csv > table1a.csv
70
80
  ```
71
81
 
72
- It is also possible to use a string or regex splitter, e.g.
82
+ When you have a special file format, it is also possible to use a string or regex splitter, e.g.
73
83
 
74
84
  ```sh
75
- bio-table --in-format split --split-on ',' test/data/input/table_split_on.txt
76
- bio-table --in-format regex --split-on '\s*,\s*' test/data/input/table_split_on.txt
85
+ bio-table --in-format split --split-on ',' file
86
+ bio-table --in-format regex --split-on '\s*,\s*' file
77
87
  ```
78
88
 
79
89
  To filter out rows that contain certain values
80
90
 
81
91
  ```sh
82
- bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
92
+ bio-table table1.csv --num-filter "values[3] <= 0.05"
93
+ ```
94
+
95
+ or, rather than using an index value (which can change between
96
+ different tables), you can use the column name
97
+ (lower case), say for FDR
98
+
99
+ ```sh
100
+ bio-table table1.csv --num-filter "fdr <= 0.05"
83
101
  ```
84
102
 
85
103
  The filter ignores the header row, and the row names, by default. If you need
86
104
  either, use the switches --with-headers and --with-rownames. With math, list all rows
87
105
 
88
106
  ```sh
89
- bio-table test/data/input/table1.csv --num-filter "values[3]-values[6] >= 0.05" > test1a.tab
107
+ bio-table table1.csv --num-filter "values[3]-values[6] >= 0.05"
90
108
  ```
91
109
 
92
110
  or, list all rows that have a least a field with values >= 1000.0
93
111
 
94
112
  ```sh
95
- bio-table test/data/input/table1.csv --num-filter "values.max >= 1000.0" > test1a.tab
113
+ bio-table table1.csv --num-filter "values.max >= 1000.0"
96
114
  ```
97
115
 
98
116
  Produce all rows that have at least 3 values above 3.0 and 1 one value
99
117
  above 10.0:
100
118
 
101
119
  ```sh
102
- bio-table test/data/input/table1.csv --num-filter "values.max >= 10.0 and values.count{|x| x>=3.0} > 3"
120
+ bio-table table1.csv --num-filter "values.max >= 10.0 and values.count{|x| x>=3.0} > 3"
103
121
  ```
104
122
 
105
123
  How is that for expressiveness? Looks like Ruby to me.
@@ -109,31 +127,60 @@ valid numbers are converted). If there are NA (nil) values in the table, you
109
127
  may wish to remove them, like this
110
128
 
111
129
  ```sh
112
- bio-table test/data/input/table1.csv --num-filter "values[0..12].compact.max >= 1000.0" > test1a.tab
130
+ bio-table table1.csv --num-filter "values[0..12].compact.max >= 1000.0"
113
131
  ```
114
132
 
115
133
  which takes the first 13 fields and compact removes the nil values.
116
134
 
135
+ To filter out all rows with more than 3 NA values:
136
+
137
+ ```sh
138
+ bio-table table.csv --num-filter 'values.to_a.size - values.compact.size > 3'
139
+ ```
140
+
117
141
  Also string comparisons and regular expressions can be used. E.g.
118
142
  filter on rownames and a row field both containing 'BGT'
119
143
 
120
144
  ```sh
121
- # not yet implemented
122
- bio-table test/data/input/table1.csv --filter "rowname =~ /BGT/ and field[1] =~ /BGT/" > test1a.tab
145
+ bio-table table1.csv --filter "rowname =~ /BGT/ and field[1] =~ /BGT/"
146
+ ```
147
+
148
+ or use the column name, rather than the indexed column field:
149
+
150
+ ```sh
151
+ bio-table table1.csv --filter "rowname =~ /BGT/ and genename =~ /BGT/"
123
152
  ```
124
153
 
125
154
  To reorder/reduce table columns by name
126
155
 
127
156
  ```sh
128
- bio-table test/data/input/table1.csv --columns AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19 > test1a.tab
157
+ bio-table table1.csv --columns AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19
129
158
  ```
130
159
 
131
160
  or use their index numbers (the first column is zero)
132
161
 
133
162
  ```sh
134
- bio-table test/data/input/table1.csv --columns 0,1,8,2,4,6 > test1a.tab
163
+ bio-table table1.csv --columns 0,1,8,2,4,6
164
+ ```
165
+
166
+ If the table header happens to be one element shorter than the number of columns
167
+ in the table, use unshift headers, 0 becomes an 'ID' column
168
+
169
+ ```sh
170
+ bio-table table1.csv --unshift-headers --columns 0,1,8,2,4,6
171
+ ```
172
+
173
+ Duplicate columns with
174
+
175
+ ```sh
176
+ bio-table table1.csv --columns AJ,B6,AJ,Axb1,Axb4,AXB13,Axb15,Axb19
135
177
  ```
136
178
 
179
+ Combine column values (more on rewrite below)
180
+
181
+ ```sh
182
+ bio-table table1.csv --rewrite "rowname = rowname + '-' + field[0]"
183
+ ```
137
184
 
138
185
  To filter for columns using a regular expression
139
186
 
@@ -154,13 +201,39 @@ again
154
201
  where we rewrite the rowname in capitals, and set the second field to
155
202
  empty if the third field is below 0.25.
156
203
 
204
+ ### Statistics
205
+
206
+ bio-table can handle some column statistics using the Ruby statsample
207
+ gem
208
+
209
+ ```sh
210
+ gem install statsample
211
+ ```
212
+
213
+ (statsample is not loaded by default, as it has a host of
214
+ dependencies)
215
+
216
+ Thereafter, to calculate the stats for columns 1 and 2 (rowname is column 0)
217
+
218
+ ```sh
219
+ bio-table --statistics --columns 1,2 table1.csv
220
+ stat AJ B6
221
+ size 379 379
222
+ min 0.0 0.0
223
+ max 1171.23 1309.25
224
+ median 6.26 7.45
225
+ mean 23.49952506596308 24.851108179419523
226
+ sd 79.4384873820721 84.43330500777459
227
+ cv 3.3804294835358824 3.3975669977445166
228
+ ```
229
+
157
230
  ### Sorting a table
158
231
 
159
232
  To sort a table on column 4 and 2
160
233
 
161
234
  ```sh
162
235
  # not yet implemented
163
- bio-table test/data/input/table1.csv --sort 4,2 > test1a.tab
236
+ bio-table table1.csv --sort 4,2
164
237
  ```
165
238
 
166
239
  Note: not all is implemented (just yet). Please check bio-table --help first.
@@ -170,7 +243,7 @@ Note: not all is implemented (just yet). Please check bio-table --help first.
170
243
  You can combine/concat two or more tables by passing in multiple file names
171
244
 
172
245
  ```sh
173
- bio-table test/data/input/table1.csv test/data/input/table2.csv
246
+ bio-table table1.csv table2.csv
174
247
  ```
175
248
 
176
249
  this will append table2 to table1, assuming they have the same headers
@@ -245,7 +318,7 @@ bio-table can read data from STDIN, by simply assuming that the data
245
318
  piped in is the first input file
246
319
 
247
320
  ```sh
248
- cat test1.tab | bio-table table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
321
+ cat test1.tab | bio-table table1.csv --num-filter "values[3] <= 0.05"
249
322
  ```
250
323
 
251
324
  will filter both files test1.tab and test1.csv and output to
@@ -338,7 +411,7 @@ Note: the Ruby API below is a work in progress.
338
411
  Tables are two dimensional matrixes, which can be read from a file
339
412
 
340
413
  ```ruby
341
- t = Table.read_file('test/data/input/table1.csv')
414
+ t = Table.read_file('table1.csv')
342
415
  p t.header # print the header array
343
416
  p t.name[0],t[0] # print the row name and row row
344
417
  p t[0][0] # print the top corner field
@@ -349,7 +422,7 @@ which column to use for names etc. More interestingly you can pass a
349
422
  function to limit the amount of row read into memory:
350
423
 
351
424
  ```ruby
352
- t = Table.read_file('test/data/input/table1.csv',
425
+ t = Table.read_file('table1.csv',
353
426
  :by_row => { | row | row[0..3] } )
354
427
  ```
355
428
 
@@ -358,7 +431,7 @@ the same idea to reformat and reorder table columns when reading data
358
431
  into the table. E.g.
359
432
 
360
433
  ```ruby
361
- t = Table.read_file('test/data/input/table1.csv',
434
+ t = Table.read_file('table1.csv',
362
435
  :by_row => { | row | [row.rowname, row[0..3], row[6].to_i].flatten } )
363
436
  ```
364
437
 
@@ -368,7 +441,7 @@ can pass in a :by_header, which will have :by_row only call on
368
441
  actual table rows.
369
442
 
370
443
  ```ruby
371
- t = Table.read_file('test/data/input/table1.csv',
444
+ t = Table.read_file('table1.csv',
372
445
  :by_header => { | header | ["Row name", header[0..3], header[6]].flatten } )
373
446
  :by_row => { | row | [row.rowname, row[0..3], row[6].to_i].flatten } )
374
447
  ```
@@ -378,7 +451,7 @@ transform a file, and not loading it in memory, is
378
451
 
379
452
  ```ruby
380
453
  f = File.new('test.tab','w')
381
- t = Table.read_file('test/data/input/table1.csv',
454
+ t = Table.read_file('table1.csv',
382
455
  :by_row => { | row |
383
456
  TableRow::write(f,[row.rowname,row[0..3],row[6].to_i].flatten, :separator => "\t")
384
457
  nil # don't create a table in memory, effectively a filter
@@ -426,7 +499,8 @@ ARGV.each do | fn |
426
499
  end
427
500
  ```
428
501
 
429
- Essentially you can pass in any object that has the *each* method to
502
+ Essentially you can pass in any object that has the *each* method
503
+ (here the File object) to
430
504
  iterate through rows as String (f's each method reads in a line at a
431
505
  time). The emit function yields the parsed row object as a simple
432
506
  array of fields (each field a String). The type is used to distinguish
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.0.6
1
+ 0.8.0
@@ -98,6 +98,10 @@ opts = OptionParser.new do |o|
98
98
  options[:with_rownames] = true
99
99
  end
100
100
 
101
+ o.on('--unshift-headers','Add an extra header element at the front (header contains one fewer field than the number of columns)') do
102
+ options[:unshift_headers] = true
103
+ end
104
+
101
105
  o.on('--strip-quotes','Strip quotes from table fields') do
102
106
  options[:strip_quotes] = true
103
107
  end
@@ -130,6 +134,10 @@ opts = OptionParser.new do |o|
130
134
  options[:blank_nodes] = true
131
135
  end
132
136
 
137
+ o.on('--statistics','Output column statistics') do
138
+ options[:statistics] = true
139
+ end
140
+
133
141
  o.separator "\n\tVerbosity:\n\n"
134
142
 
135
143
  o.on("--logger filename",String,"Log to file (default stderr)") do | name |
@@ -224,22 +232,34 @@ writer =
224
232
  end
225
233
 
226
234
  if INPUT_ON_STDIN
227
- opts = options.dup # so we can modify options
235
+ opts = options.dup # so we can 'safely' modify options
228
236
  BioTable::TableLoader.emit(STDIN, opts).each do |row, type|
229
237
  writer.write(TableRow.new(row[0],row[1..-1]),type)
230
238
  end
231
239
  options[:write_header] = false # don't write the header for chained files
232
240
  end
233
241
 
242
+ statistics = if options[:statistics]
243
+ BioTable::Statistics::Accumulate.new
244
+ else
245
+ nil
246
+ end
247
+
234
248
  ARGV.each do | fn |
235
- opts = options.dup # so we can modify options
249
+ opts = options.dup # so we can 'safely' modify options
236
250
  f = File.open(fn,"r")
237
251
  if not opts[:in_format] and fn =~ /\.csv$/
238
252
  logger.debug "Autodetected CSV file"
239
253
  opts[:in_format] = :csv
240
254
  end
241
255
  BioTable::TableLoader.emit(f, opts).each do |row,type|
242
- writer.write(TableRow.new(row[0],row[1..-1]),type)
256
+ if statistics
257
+ statistics.add(row,type)
258
+ else
259
+ writer.write(TableRow.new(row[0],row[1..-1]),type)
260
+ end
243
261
  end
244
262
  options[:write_header] = false # don't write the header for chained files
245
263
  end
264
+
265
+ statistics.write(writer) if statistics
@@ -3,11 +3,26 @@ Feature: Command-line interface (CLI)
3
3
 
4
4
  bio-table has a powerful command line interface. Here we regression test features.
5
5
 
6
- Scenario: Test the numerical filter by column values
6
+ Scenario: Test the numerical filter by indexed column values
7
7
  Given I have input file(s) named "test/data/input/table1.csv"
8
8
  When I execute "./bin/bio-table --num-filter 'values[3] > 0.05'"
9
9
  Then I expect the named output to match "table1-0_05"
10
10
 
11
+ Scenario: Test the numerical filter by column names
12
+ Given I have input file(s) named "test/data/input/table1.csv"
13
+ When I execute "./bin/bio-table --num-filter 'axb2 > 0.05'"
14
+ Then I expect the named output to match "table1-named-0_05"
15
+
16
+ Scenario: Test the filter by indexed column values
17
+ Given I have input file(s) named "test/data/input/table1.csv"
18
+ When I execute "./bin/bio-table --filter 'fields[3] =~ 0.1'"
19
+ Then I expect the named output to match "table1-filter-0_1"
20
+
21
+ Scenario: Test the filter by column names
22
+ Given I have input file(s) named "test/data/input/table1.csv"
23
+ When I execute "./bin/bio-table --filter 'axb1 =~ /0.1/'"
24
+ Then I expect the named output to match "table1-filter-named-0_1"
25
+
11
26
  Scenario: Reduce columns
12
27
  Given I have input file(s) named "test/data/input/table1.csv"
13
28
  When I execute "./bin/bio-table test/data/input/table1.csv --columns '#Gene,AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19'"
@@ -78,4 +93,7 @@ Feature: Command-line interface (CLI)
78
93
  When I execute "./bin/bio-table --in-format split --split-on ',' --num-filter 'values[1]!=0' --with-headers"
79
94
  Then I expect the named output to match "table_filter_headers"
80
95
 
81
-
96
+ Scenario: Use count in filter
97
+ Given I have input file(s) named "test/data/input/table1.csv"
98
+ When I execute "./bin/bio-table --num-filter 'values.compact.max >= 10.0 and values.compact.count{|x| x>=3.0} > 3'"
99
+ Then I expect the named output to match "table_counter_filter"
@@ -0,0 +1,43 @@
1
+ @filter
2
+ Feature: Filter input table
3
+
4
+ bio-table should read input line by line as an iterator, and emit
5
+ filtered/transformed output, filtering for number values etc.
6
+
7
+ Scenario: Filter a table by value
8
+ Given I load a CSV table containing
9
+ """
10
+ bid,cid,length,num
11
+ 1,a,4658,4
12
+ 1,b,12060,6
13
+ 2,c,5858,7
14
+ 2,d,5626,4
15
+ 3,e,18451,8
16
+ """
17
+ When I numerically filter the table for
18
+ | num_filter | result | description |
19
+ | values[1] > 6000 | [12060,18451] | basic filter |
20
+ | value[1] > 6000 | [12060,18451] | value is alias for values |
21
+ | num==4 | [4658,5626] | column names as variables |
22
+ | num==4 or num==6 | [4658,12060,5626] | column names as variables |
23
+ | num==6 | [12060] | column names as variables |
24
+ | length<5000 | [4658] | column names as variables |
25
+ Then I should have result
26
+
27
+ Scenario: Filter a table by string
28
+ Given I load a CSV table containing
29
+ """
30
+ bid,cid,length,num
31
+ 1,a,4658,4
32
+ 1,b,12060,6
33
+ 2,c,5858,7
34
+ 2,d,5626,4
35
+ 3,e,18451,8
36
+ """
37
+ When I filter the table for
38
+ | filter | result | description |
39
+ | field[1] =~ /4/ | [4658,18451] | regex filter |
40
+ | fields[1] =~ /4/ | [4658,18451] | alias fields |
41
+ | length =~ /4/ | [4658,18451] | use column names |
42
+ Then I should have filter result
43
+
@@ -0,0 +1,46 @@
1
+ Given /^I load a CSV table containing$/ do |string|
2
+ @lines = string.split(/\n/)
3
+ end
4
+
5
+ When /^I numerically filter the table for$/ do |table|
6
+ # table is a Cucumber::Ast::Table
7
+ @table = table
8
+ end
9
+
10
+ Then /^I should have result$/ do
11
+ @table.hashes.each do |h|
12
+ p h
13
+ result = eval(h['result'])
14
+ options = { :in_format => :split, :split_on => ',' }
15
+ options[:num_filter] = h['num_filter']
16
+
17
+ p options
18
+ p result
19
+ t = BioTable::Table.new
20
+ rownames,lines = t.read_lines(@lines, options)
21
+ p lines
22
+ lines.map {|r| r[1].to_i }.should == result
23
+ end
24
+ end
25
+
26
+ When /^I filter the table for$/ do |table|
27
+ # table is a Cucumber::Ast::Table
28
+ @table1 = table
29
+ end
30
+
31
+ Then /^I should have filter result$/ do
32
+ @table1.hashes.each do |h|
33
+ p h
34
+ result = eval(h['result'])
35
+ options = { :in_format => :split, :split_on => ',' }
36
+ options[:filter] = h['filter']
37
+
38
+ p options
39
+ p result
40
+ t = BioTable::Table.new
41
+ rownames,lines = t.read_lines(@lines, options)
42
+ p lines
43
+ lines.map {|r| r[1].to_i }.should == result
44
+ end
45
+ end
46
+