bio-table 0.0.6 → 0.8.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +100 -26
- data/VERSION +1 -1
- data/bin/bio-table +23 -3
- data/features/cli.feature +20 -2
- data/features/filters.feature +43 -0
- data/features/step_definitions/filters.rb +46 -0
- data/features/support/env.rb +3 -0
- data/lib/bio-table.rb +4 -0
- data/lib/bio-table/filter.rb +84 -18
- data/lib/bio-table/rdf.rb +8 -4
- data/lib/bio-table/statistics.rb +45 -0
- data/lib/bio-table/table.rb +6 -5
- data/lib/bio-table/table_apply.rb +13 -5
- data/lib/bio-table/tableload.rb +3 -2
- data/lib/bio-table/validator.rb +1 -1
- data/test/data/regression/table1-filter-0_1.ref +1 -0
- data/test/data/regression/table1-filter-named-0_1.ref +13 -0
- data/test/data/regression/table1-named-0_05.ref +281 -0
- data/test/data/regression/table_counter_filter.ref +197 -0
- metadata +29 -22
data/README.md
CHANGED
@@ -15,7 +15,13 @@ Quick example, say we want to filter out rows that contain certain
|
|
15
15
|
p-values listed in the 4th column:
|
16
16
|
|
17
17
|
```
|
18
|
-
bio-table
|
18
|
+
bio-table table1.csv --num-filter "value[3] <= 0.05"
|
19
|
+
```
|
20
|
+
|
21
|
+
even better, you can use the actual column name
|
22
|
+
|
23
|
+
```
|
24
|
+
bio-table table1.csv --num-filter "fdr <= 0.05"
|
19
25
|
```
|
20
26
|
|
21
27
|
bio-table should be lazy. And be good for big data, bio-table is
|
@@ -26,20 +32,24 @@ you don't need to know Ruby to use the command line interface (CLI).
|
|
26
32
|
Features:
|
27
33
|
|
28
34
|
* Support for reading and writing TAB and CSV files, as well as regex splitters
|
29
|
-
* Filter on data
|
35
|
+
* Filter on (numerical) data and rownames
|
30
36
|
* Transform table and data by column or row
|
31
37
|
* Recalculate data
|
38
|
+
* Calculate new values
|
39
|
+
* Calculate column statistics (mean, standard deviation)
|
32
40
|
* Diff between tables, selecting on specific column values
|
33
41
|
* Merge tables side by side on column value/rowname
|
34
42
|
* Split/reduce tables by column
|
35
43
|
* Write formatted tables, e.g. HTML, LaTeX
|
36
44
|
* Read from STDIN, write to STDOUT
|
37
45
|
* Convert table to RDF
|
46
|
+
* Convert key-value (attributes) to RDF (nyi)
|
38
47
|
* Convert table to JSON/YAML/XML (nyi)
|
48
|
+
* Transpose matrix (nyi)
|
39
49
|
* etc. etc.
|
40
50
|
|
41
51
|
and bio-table is pretty fast. To convert a 3Mb file of 18670 rows
|
42
|
-
takes 0.87 second. Adding a filter makes it parse at 0.95 second on
|
52
|
+
takes 0.87 second with Ruby 1.9. Adding a filter makes it parse at 0.95 second on
|
43
53
|
my 3.2 GHz desktop (with preloaded disk cache).
|
44
54
|
|
45
55
|
Note: this software is under active development, though what is
|
@@ -59,47 +69,55 @@ Tables can be transformed through the command line. To transform a
|
|
59
69
|
comma separated file to a tab delimited one
|
60
70
|
|
61
71
|
```sh
|
62
|
-
bio-table
|
72
|
+
bio-table table1.csv --in-format csv --format tab > test1.tab
|
63
73
|
```
|
64
74
|
|
65
75
|
Tab is actually the general default. Still, if the file name ends in
|
66
76
|
csv, it will assume CSV. To convert the table back
|
67
77
|
|
68
78
|
```sh
|
69
|
-
bio-table test1.tab --format csv >
|
79
|
+
bio-table test1.tab --format csv > table1a.csv
|
70
80
|
```
|
71
81
|
|
72
|
-
|
82
|
+
When you have a special file format, it is also possible to use a string or regex splitter, e.g.
|
73
83
|
|
74
84
|
```sh
|
75
|
-
bio-table --in-format split --split-on ','
|
76
|
-
bio-table --in-format regex --split-on '\s*,\s*'
|
85
|
+
bio-table --in-format split --split-on ',' file
|
86
|
+
bio-table --in-format regex --split-on '\s*,\s*' file
|
77
87
|
```
|
78
88
|
|
79
89
|
To filter out rows that contain certain values
|
80
90
|
|
81
91
|
```sh
|
82
|
-
bio-table
|
92
|
+
bio-table table1.csv --num-filter "values[3] <= 0.05"
|
93
|
+
```
|
94
|
+
|
95
|
+
or, rather than using an index value (which can change between
|
96
|
+
different tables), you can use the column name
|
97
|
+
(lower case), say for FDR
|
98
|
+
|
99
|
+
```sh
|
100
|
+
bio-table table1.csv --num-filter "fdr <= 0.05"
|
83
101
|
```
|
84
102
|
|
85
103
|
The filter ignores the header row, and the row names, by default. If you need
|
86
104
|
either, use the switches --with-headers and --with-rownames. With math, list all rows
|
87
105
|
|
88
106
|
```sh
|
89
|
-
bio-table
|
107
|
+
bio-table table1.csv --num-filter "values[3]-values[6] >= 0.05"
|
90
108
|
```
|
91
109
|
|
92
110
|
or, list all rows that have a least a field with values >= 1000.0
|
93
111
|
|
94
112
|
```sh
|
95
|
-
bio-table
|
113
|
+
bio-table table1.csv --num-filter "values.max >= 1000.0"
|
96
114
|
```
|
97
115
|
|
98
116
|
Produce all rows that have at least 3 values above 3.0 and 1 one value
|
99
117
|
above 10.0:
|
100
118
|
|
101
119
|
```sh
|
102
|
-
bio-table
|
120
|
+
bio-table table1.csv --num-filter "values.max >= 10.0 and values.count{|x| x>=3.0} > 3"
|
103
121
|
```
|
104
122
|
|
105
123
|
How is that for expressiveness? Looks like Ruby to me.
|
@@ -109,31 +127,60 @@ valid numbers are converted). If there are NA (nil) values in the table, you
|
|
109
127
|
may wish to remove them, like this
|
110
128
|
|
111
129
|
```sh
|
112
|
-
bio-table
|
130
|
+
bio-table table1.csv --num-filter "values[0..12].compact.max >= 1000.0"
|
113
131
|
```
|
114
132
|
|
115
133
|
which takes the first 13 fields and compact removes the nil values.
|
116
134
|
|
135
|
+
To filter out all rows with more than 3 NA values:
|
136
|
+
|
137
|
+
```sh
|
138
|
+
bio-table table.csv --num-filter 'values.to_a.size - values.compact.size > 3'
|
139
|
+
```
|
140
|
+
|
117
141
|
Also string comparisons and regular expressions can be used. E.g.
|
118
142
|
filter on rownames and a row field both containing 'BGT'
|
119
143
|
|
120
144
|
```sh
|
121
|
-
|
122
|
-
|
145
|
+
bio-table table1.csv --filter "rowname =~ /BGT/ and field[1] =~ /BGT/"
|
146
|
+
```
|
147
|
+
|
148
|
+
or use the column name, rather than the indexed column field:
|
149
|
+
|
150
|
+
```sh
|
151
|
+
bio-table table1.csv --filter "rowname =~ /BGT/ and genename =~ /BGT/"
|
123
152
|
```
|
124
153
|
|
125
154
|
To reorder/reduce table columns by name
|
126
155
|
|
127
156
|
```sh
|
128
|
-
bio-table
|
157
|
+
bio-table table1.csv --columns AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19
|
129
158
|
```
|
130
159
|
|
131
160
|
or use their index numbers (the first column is zero)
|
132
161
|
|
133
162
|
```sh
|
134
|
-
bio-table
|
163
|
+
bio-table table1.csv --columns 0,1,8,2,4,6
|
164
|
+
```
|
165
|
+
|
166
|
+
If the table header happens to be one element shorter than the number of columns
|
167
|
+
in the table, use unshift headers, 0 becomes an 'ID' column
|
168
|
+
|
169
|
+
```sh
|
170
|
+
bio-table table1.csv --unshift-headers --columns 0,1,8,2,4,6
|
171
|
+
```
|
172
|
+
|
173
|
+
Duplicate columns with
|
174
|
+
|
175
|
+
```sh
|
176
|
+
bio-table table1.csv --columns AJ,B6,AJ,Axb1,Axb4,AXB13,Axb15,Axb19
|
135
177
|
```
|
136
178
|
|
179
|
+
Combine column values (more on rewrite below)
|
180
|
+
|
181
|
+
```sh
|
182
|
+
bio-table table1.csv --rewrite "rowname = rowname + '-' + field[0]"
|
183
|
+
```
|
137
184
|
|
138
185
|
To filter for columns using a regular expression
|
139
186
|
|
@@ -154,13 +201,39 @@ again
|
|
154
201
|
where we rewrite the rowname in capitals, and set the second field to
|
155
202
|
empty if the third field is below 0.25.
|
156
203
|
|
204
|
+
### Statistics
|
205
|
+
|
206
|
+
bio-table can handle some column statistics using the Ruby statsample
|
207
|
+
gem
|
208
|
+
|
209
|
+
```sh
|
210
|
+
gem install statsample
|
211
|
+
```
|
212
|
+
|
213
|
+
(statsample is not loaded by default, as it has a host of
|
214
|
+
dependencies)
|
215
|
+
|
216
|
+
Thereafter, to calculate the stats for columns 1 and 2 (rowname is column 0)
|
217
|
+
|
218
|
+
```sh
|
219
|
+
bio-table --statistics --columns 1,2 table1.csv
|
220
|
+
stat AJ B6
|
221
|
+
size 379 379
|
222
|
+
min 0.0 0.0
|
223
|
+
max 1171.23 1309.25
|
224
|
+
median 6.26 7.45
|
225
|
+
mean 23.49952506596308 24.851108179419523
|
226
|
+
sd 79.4384873820721 84.43330500777459
|
227
|
+
cv 3.3804294835358824 3.3975669977445166
|
228
|
+
```
|
229
|
+
|
157
230
|
### Sorting a table
|
158
231
|
|
159
232
|
To sort a table on column 4 and 2
|
160
233
|
|
161
234
|
```sh
|
162
235
|
# not yet implemented
|
163
|
-
bio-table
|
236
|
+
bio-table table1.csv --sort 4,2
|
164
237
|
```
|
165
238
|
|
166
239
|
Note: not all is implemented (just yet). Please check bio-table --help first.
|
@@ -170,7 +243,7 @@ Note: not all is implemented (just yet). Please check bio-table --help first.
|
|
170
243
|
You can combine/concat two or more tables by passing in multiple file names
|
171
244
|
|
172
245
|
```sh
|
173
|
-
bio-table
|
246
|
+
bio-table table1.csv table2.csv
|
174
247
|
```
|
175
248
|
|
176
249
|
this will append table2 to table1, assuming they have the same headers
|
@@ -245,7 +318,7 @@ bio-table can read data from STDIN, by simply assuming that the data
|
|
245
318
|
piped in is the first input file
|
246
319
|
|
247
320
|
```sh
|
248
|
-
cat test1.tab | bio-table table1.csv --num-filter "values[3] <= 0.05"
|
321
|
+
cat test1.tab | bio-table table1.csv --num-filter "values[3] <= 0.05"
|
249
322
|
```
|
250
323
|
|
251
324
|
will filter both files test1.tab and test1.csv and output to
|
@@ -338,7 +411,7 @@ Note: the Ruby API below is a work in progress.
|
|
338
411
|
Tables are two dimensional matrixes, which can be read from a file
|
339
412
|
|
340
413
|
```ruby
|
341
|
-
t = Table.read_file('
|
414
|
+
t = Table.read_file('table1.csv')
|
342
415
|
p t.header # print the header array
|
343
416
|
p t.name[0],t[0] # print the row name and row row
|
344
417
|
p t[0][0] # print the top corner field
|
@@ -349,7 +422,7 @@ which column to use for names etc. More interestingly you can pass a
|
|
349
422
|
function to limit the amount of row read into memory:
|
350
423
|
|
351
424
|
```ruby
|
352
|
-
t = Table.read_file('
|
425
|
+
t = Table.read_file('table1.csv',
|
353
426
|
:by_row => { | row | row[0..3] } )
|
354
427
|
```
|
355
428
|
|
@@ -358,7 +431,7 @@ the same idea to reformat and reorder table columns when reading data
|
|
358
431
|
into the table. E.g.
|
359
432
|
|
360
433
|
```ruby
|
361
|
-
t = Table.read_file('
|
434
|
+
t = Table.read_file('table1.csv',
|
362
435
|
:by_row => { | row | [row.rowname, row[0..3], row[6].to_i].flatten } )
|
363
436
|
```
|
364
437
|
|
@@ -368,7 +441,7 @@ can pass in a :by_header, which will have :by_row only call on
|
|
368
441
|
actual table rows.
|
369
442
|
|
370
443
|
```ruby
|
371
|
-
t = Table.read_file('
|
444
|
+
t = Table.read_file('table1.csv',
|
372
445
|
:by_header => { | header | ["Row name", header[0..3], header[6]].flatten } )
|
373
446
|
:by_row => { | row | [row.rowname, row[0..3], row[6].to_i].flatten } )
|
374
447
|
```
|
@@ -378,7 +451,7 @@ transform a file, and not loading it in memory, is
|
|
378
451
|
|
379
452
|
```ruby
|
380
453
|
f = File.new('test.tab','w')
|
381
|
-
t = Table.read_file('
|
454
|
+
t = Table.read_file('table1.csv',
|
382
455
|
:by_row => { | row |
|
383
456
|
TableRow::write(f,[row.rowname,row[0..3],row[6].to_i].flatten, :separator => "\t")
|
384
457
|
nil # don't create a table in memory, effectively a filter
|
@@ -426,7 +499,8 @@ ARGV.each do | fn |
|
|
426
499
|
end
|
427
500
|
```
|
428
501
|
|
429
|
-
Essentially you can pass in any object that has the *each* method
|
502
|
+
Essentially you can pass in any object that has the *each* method
|
503
|
+
(here the File object) to
|
430
504
|
iterate through rows as String (f's each method reads in a line at a
|
431
505
|
time). The emit function yields the parsed row object as a simple
|
432
506
|
array of fields (each field a String). The type is used to distinguish
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.0
|
1
|
+
0.8.0
|
data/bin/bio-table
CHANGED
@@ -98,6 +98,10 @@ opts = OptionParser.new do |o|
|
|
98
98
|
options[:with_rownames] = true
|
99
99
|
end
|
100
100
|
|
101
|
+
o.on('--unshift-headers','Add an extra header element at the front (header contains one fewer field than the number of columns)') do
|
102
|
+
options[:unshift_headers] = true
|
103
|
+
end
|
104
|
+
|
101
105
|
o.on('--strip-quotes','Strip quotes from table fields') do
|
102
106
|
options[:strip_quotes] = true
|
103
107
|
end
|
@@ -130,6 +134,10 @@ opts = OptionParser.new do |o|
|
|
130
134
|
options[:blank_nodes] = true
|
131
135
|
end
|
132
136
|
|
137
|
+
o.on('--statistics','Output column statistics') do
|
138
|
+
options[:statistics] = true
|
139
|
+
end
|
140
|
+
|
133
141
|
o.separator "\n\tVerbosity:\n\n"
|
134
142
|
|
135
143
|
o.on("--logger filename",String,"Log to file (default stderr)") do | name |
|
@@ -224,22 +232,34 @@ writer =
|
|
224
232
|
end
|
225
233
|
|
226
234
|
if INPUT_ON_STDIN
|
227
|
-
opts = options.dup # so we can modify options
|
235
|
+
opts = options.dup # so we can 'safely' modify options
|
228
236
|
BioTable::TableLoader.emit(STDIN, opts).each do |row, type|
|
229
237
|
writer.write(TableRow.new(row[0],row[1..-1]),type)
|
230
238
|
end
|
231
239
|
options[:write_header] = false # don't write the header for chained files
|
232
240
|
end
|
233
241
|
|
242
|
+
statistics = if options[:statistics]
|
243
|
+
BioTable::Statistics::Accumulate.new
|
244
|
+
else
|
245
|
+
nil
|
246
|
+
end
|
247
|
+
|
234
248
|
ARGV.each do | fn |
|
235
|
-
opts = options.dup # so we can modify options
|
249
|
+
opts = options.dup # so we can 'safely' modify options
|
236
250
|
f = File.open(fn,"r")
|
237
251
|
if not opts[:in_format] and fn =~ /\.csv$/
|
238
252
|
logger.debug "Autodetected CSV file"
|
239
253
|
opts[:in_format] = :csv
|
240
254
|
end
|
241
255
|
BioTable::TableLoader.emit(f, opts).each do |row,type|
|
242
|
-
|
256
|
+
if statistics
|
257
|
+
statistics.add(row,type)
|
258
|
+
else
|
259
|
+
writer.write(TableRow.new(row[0],row[1..-1]),type)
|
260
|
+
end
|
243
261
|
end
|
244
262
|
options[:write_header] = false # don't write the header for chained files
|
245
263
|
end
|
264
|
+
|
265
|
+
statistics.write(writer) if statistics
|
data/features/cli.feature
CHANGED
@@ -3,11 +3,26 @@ Feature: Command-line interface (CLI)
|
|
3
3
|
|
4
4
|
bio-table has a powerful command line interface. Here we regression test features.
|
5
5
|
|
6
|
-
Scenario: Test the numerical filter by column values
|
6
|
+
Scenario: Test the numerical filter by indexed column values
|
7
7
|
Given I have input file(s) named "test/data/input/table1.csv"
|
8
8
|
When I execute "./bin/bio-table --num-filter 'values[3] > 0.05'"
|
9
9
|
Then I expect the named output to match "table1-0_05"
|
10
10
|
|
11
|
+
Scenario: Test the numerical filter by column names
|
12
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
13
|
+
When I execute "./bin/bio-table --num-filter 'axb2 > 0.05'"
|
14
|
+
Then I expect the named output to match "table1-named-0_05"
|
15
|
+
|
16
|
+
Scenario: Test the filter by indexed column values
|
17
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
18
|
+
When I execute "./bin/bio-table --filter 'fields[3] =~ 0.1'"
|
19
|
+
Then I expect the named output to match "table1-filter-0_1"
|
20
|
+
|
21
|
+
Scenario: Test the filter by column names
|
22
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
23
|
+
When I execute "./bin/bio-table --filter 'axb1 =~ /0.1/'"
|
24
|
+
Then I expect the named output to match "table1-filter-named-0_1"
|
25
|
+
|
11
26
|
Scenario: Reduce columns
|
12
27
|
Given I have input file(s) named "test/data/input/table1.csv"
|
13
28
|
When I execute "./bin/bio-table test/data/input/table1.csv --columns '#Gene,AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19'"
|
@@ -78,4 +93,7 @@ Feature: Command-line interface (CLI)
|
|
78
93
|
When I execute "./bin/bio-table --in-format split --split-on ',' --num-filter 'values[1]!=0' --with-headers"
|
79
94
|
Then I expect the named output to match "table_filter_headers"
|
80
95
|
|
81
|
-
|
96
|
+
Scenario: Use count in filter
|
97
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
98
|
+
When I execute "./bin/bio-table --num-filter 'values.compact.max >= 10.0 and values.compact.count{|x| x>=3.0} > 3'"
|
99
|
+
Then I expect the named output to match "table_counter_filter"
|
@@ -0,0 +1,43 @@
|
|
1
|
+
@filter
|
2
|
+
Feature: Filter input table
|
3
|
+
|
4
|
+
bio-table should read input line by line as an iterator, and emit
|
5
|
+
filtered/transformed output, filtering for number values etc.
|
6
|
+
|
7
|
+
Scenario: Filter a table by value
|
8
|
+
Given I load a CSV table containing
|
9
|
+
"""
|
10
|
+
bid,cid,length,num
|
11
|
+
1,a,4658,4
|
12
|
+
1,b,12060,6
|
13
|
+
2,c,5858,7
|
14
|
+
2,d,5626,4
|
15
|
+
3,e,18451,8
|
16
|
+
"""
|
17
|
+
When I numerically filter the table for
|
18
|
+
| num_filter | result | description |
|
19
|
+
| values[1] > 6000 | [12060,18451] | basic filter |
|
20
|
+
| value[1] > 6000 | [12060,18451] | value is alias for values |
|
21
|
+
| num==4 | [4658,5626] | column names as variables |
|
22
|
+
| num==4 or num==6 | [4658,12060,5626] | column names as variables |
|
23
|
+
| num==6 | [12060] | column names as variables |
|
24
|
+
| length<5000 | [4658] | column names as variables |
|
25
|
+
Then I should have result
|
26
|
+
|
27
|
+
Scenario: Filter a table by string
|
28
|
+
Given I load a CSV table containing
|
29
|
+
"""
|
30
|
+
bid,cid,length,num
|
31
|
+
1,a,4658,4
|
32
|
+
1,b,12060,6
|
33
|
+
2,c,5858,7
|
34
|
+
2,d,5626,4
|
35
|
+
3,e,18451,8
|
36
|
+
"""
|
37
|
+
When I filter the table for
|
38
|
+
| filter | result | description |
|
39
|
+
| field[1] =~ /4/ | [4658,18451] | regex filter |
|
40
|
+
| fields[1] =~ /4/ | [4658,18451] | alias fields |
|
41
|
+
| length =~ /4/ | [4658,18451] | use column names |
|
42
|
+
Then I should have filter result
|
43
|
+
|
@@ -0,0 +1,46 @@
|
|
1
|
+
Given /^I load a CSV table containing$/ do |string|
|
2
|
+
@lines = string.split(/\n/)
|
3
|
+
end
|
4
|
+
|
5
|
+
When /^I numerically filter the table for$/ do |table|
|
6
|
+
# table is a Cucumber::Ast::Table
|
7
|
+
@table = table
|
8
|
+
end
|
9
|
+
|
10
|
+
Then /^I should have result$/ do
|
11
|
+
@table.hashes.each do |h|
|
12
|
+
p h
|
13
|
+
result = eval(h['result'])
|
14
|
+
options = { :in_format => :split, :split_on => ',' }
|
15
|
+
options[:num_filter] = h['num_filter']
|
16
|
+
|
17
|
+
p options
|
18
|
+
p result
|
19
|
+
t = BioTable::Table.new
|
20
|
+
rownames,lines = t.read_lines(@lines, options)
|
21
|
+
p lines
|
22
|
+
lines.map {|r| r[1].to_i }.should == result
|
23
|
+
end
|
24
|
+
end
|
25
|
+
|
26
|
+
When /^I filter the table for$/ do |table|
|
27
|
+
# table is a Cucumber::Ast::Table
|
28
|
+
@table1 = table
|
29
|
+
end
|
30
|
+
|
31
|
+
Then /^I should have filter result$/ do
|
32
|
+
@table1.hashes.each do |h|
|
33
|
+
p h
|
34
|
+
result = eval(h['result'])
|
35
|
+
options = { :in_format => :split, :split_on => ',' }
|
36
|
+
options[:filter] = h['filter']
|
37
|
+
|
38
|
+
p options
|
39
|
+
p result
|
40
|
+
t = BioTable::Table.new
|
41
|
+
rownames,lines = t.read_lines(@lines, options)
|
42
|
+
p lines
|
43
|
+
lines.map {|r| r[1].to_i }.should == result
|
44
|
+
end
|
45
|
+
end
|
46
|
+
|