bio-table 0.0.6 → 0.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +100 -26
- data/VERSION +1 -1
- data/bin/bio-table +23 -3
- data/features/cli.feature +20 -2
- data/features/filters.feature +43 -0
- data/features/step_definitions/filters.rb +46 -0
- data/features/support/env.rb +3 -0
- data/lib/bio-table.rb +4 -0
- data/lib/bio-table/filter.rb +84 -18
- data/lib/bio-table/rdf.rb +8 -4
- data/lib/bio-table/statistics.rb +45 -0
- data/lib/bio-table/table.rb +6 -5
- data/lib/bio-table/table_apply.rb +13 -5
- data/lib/bio-table/tableload.rb +3 -2
- data/lib/bio-table/validator.rb +1 -1
- data/test/data/regression/table1-filter-0_1.ref +1 -0
- data/test/data/regression/table1-filter-named-0_1.ref +13 -0
- data/test/data/regression/table1-named-0_05.ref +281 -0
- data/test/data/regression/table_counter_filter.ref +197 -0
- metadata +29 -22
data/README.md
CHANGED
@@ -15,7 +15,13 @@ Quick example, say we want to filter out rows that contain certain
|
|
15
15
|
p-values listed in the 4th column:
|
16
16
|
|
17
17
|
```
|
18
|
-
bio-table
|
18
|
+
bio-table table1.csv --num-filter "value[3] <= 0.05"
|
19
|
+
```
|
20
|
+
|
21
|
+
even better, you can use the actual column name
|
22
|
+
|
23
|
+
```
|
24
|
+
bio-table table1.csv --num-filter "fdr <= 0.05"
|
19
25
|
```
|
20
26
|
|
21
27
|
bio-table should be lazy. And be good for big data, bio-table is
|
@@ -26,20 +32,24 @@ you don't need to know Ruby to use the command line interface (CLI).
|
|
26
32
|
Features:
|
27
33
|
|
28
34
|
* Support for reading and writing TAB and CSV files, as well as regex splitters
|
29
|
-
* Filter on data
|
35
|
+
* Filter on (numerical) data and rownames
|
30
36
|
* Transform table and data by column or row
|
31
37
|
* Recalculate data
|
38
|
+
* Calculate new values
|
39
|
+
* Calculate column statistics (mean, standard deviation)
|
32
40
|
* Diff between tables, selecting on specific column values
|
33
41
|
* Merge tables side by side on column value/rowname
|
34
42
|
* Split/reduce tables by column
|
35
43
|
* Write formatted tables, e.g. HTML, LaTeX
|
36
44
|
* Read from STDIN, write to STDOUT
|
37
45
|
* Convert table to RDF
|
46
|
+
* Convert key-value (attributes) to RDF (nyi)
|
38
47
|
* Convert table to JSON/YAML/XML (nyi)
|
48
|
+
* Transpose matrix (nyi)
|
39
49
|
* etc. etc.
|
40
50
|
|
41
51
|
and bio-table is pretty fast. To convert a 3Mb file of 18670 rows
|
42
|
-
takes 0.87 second. Adding a filter makes it parse at 0.95 second on
|
52
|
+
takes 0.87 second with Ruby 1.9. Adding a filter makes it parse at 0.95 second on
|
43
53
|
my 3.2 GHz desktop (with preloaded disk cache).
|
44
54
|
|
45
55
|
Note: this software is under active development, though what is
|
@@ -59,47 +69,55 @@ Tables can be transformed through the command line. To transform a
|
|
59
69
|
comma separated file to a tab delimited one
|
60
70
|
|
61
71
|
```sh
|
62
|
-
bio-table
|
72
|
+
bio-table table1.csv --in-format csv --format tab > test1.tab
|
63
73
|
```
|
64
74
|
|
65
75
|
Tab is actually the general default. Still, if the file name ends in
|
66
76
|
csv, it will assume CSV. To convert the table back
|
67
77
|
|
68
78
|
```sh
|
69
|
-
bio-table test1.tab --format csv >
|
79
|
+
bio-table test1.tab --format csv > table1a.csv
|
70
80
|
```
|
71
81
|
|
72
|
-
|
82
|
+
When you have a special file format, it is also possible to use a string or regex splitter, e.g.
|
73
83
|
|
74
84
|
```sh
|
75
|
-
bio-table --in-format split --split-on ','
|
76
|
-
bio-table --in-format regex --split-on '\s*,\s*'
|
85
|
+
bio-table --in-format split --split-on ',' file
|
86
|
+
bio-table --in-format regex --split-on '\s*,\s*' file
|
77
87
|
```
|
78
88
|
|
79
89
|
To filter out rows that contain certain values
|
80
90
|
|
81
91
|
```sh
|
82
|
-
bio-table
|
92
|
+
bio-table table1.csv --num-filter "values[3] <= 0.05"
|
93
|
+
```
|
94
|
+
|
95
|
+
or, rather than using an index value (which can change between
|
96
|
+
different tables), you can use the column name
|
97
|
+
(lower case), say for FDR
|
98
|
+
|
99
|
+
```sh
|
100
|
+
bio-table table1.csv --num-filter "fdr <= 0.05"
|
83
101
|
```
|
84
102
|
|
85
103
|
The filter ignores the header row, and the row names, by default. If you need
|
86
104
|
either, use the switches --with-headers and --with-rownames. With math, list all rows
|
87
105
|
|
88
106
|
```sh
|
89
|
-
bio-table
|
107
|
+
bio-table table1.csv --num-filter "values[3]-values[6] >= 0.05"
|
90
108
|
```
|
91
109
|
|
92
110
|
or, list all rows that have a least a field with values >= 1000.0
|
93
111
|
|
94
112
|
```sh
|
95
|
-
bio-table
|
113
|
+
bio-table table1.csv --num-filter "values.max >= 1000.0"
|
96
114
|
```
|
97
115
|
|
98
116
|
Produce all rows that have at least 3 values above 3.0 and 1 one value
|
99
117
|
above 10.0:
|
100
118
|
|
101
119
|
```sh
|
102
|
-
bio-table
|
120
|
+
bio-table table1.csv --num-filter "values.max >= 10.0 and values.count{|x| x>=3.0} > 3"
|
103
121
|
```
|
104
122
|
|
105
123
|
How is that for expressiveness? Looks like Ruby to me.
|
@@ -109,31 +127,60 @@ valid numbers are converted). If there are NA (nil) values in the table, you
|
|
109
127
|
may wish to remove them, like this
|
110
128
|
|
111
129
|
```sh
|
112
|
-
bio-table
|
130
|
+
bio-table table1.csv --num-filter "values[0..12].compact.max >= 1000.0"
|
113
131
|
```
|
114
132
|
|
115
133
|
which takes the first 13 fields and compact removes the nil values.
|
116
134
|
|
135
|
+
To filter out all rows with more than 3 NA values:
|
136
|
+
|
137
|
+
```sh
|
138
|
+
bio-table table.csv --num-filter 'values.to_a.size - values.compact.size > 3'
|
139
|
+
```
|
140
|
+
|
117
141
|
Also string comparisons and regular expressions can be used. E.g.
|
118
142
|
filter on rownames and a row field both containing 'BGT'
|
119
143
|
|
120
144
|
```sh
|
121
|
-
|
122
|
-
|
145
|
+
bio-table table1.csv --filter "rowname =~ /BGT/ and field[1] =~ /BGT/"
|
146
|
+
```
|
147
|
+
|
148
|
+
or use the column name, rather than the indexed column field:
|
149
|
+
|
150
|
+
```sh
|
151
|
+
bio-table table1.csv --filter "rowname =~ /BGT/ and genename =~ /BGT/"
|
123
152
|
```
|
124
153
|
|
125
154
|
To reorder/reduce table columns by name
|
126
155
|
|
127
156
|
```sh
|
128
|
-
bio-table
|
157
|
+
bio-table table1.csv --columns AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19
|
129
158
|
```
|
130
159
|
|
131
160
|
or use their index numbers (the first column is zero)
|
132
161
|
|
133
162
|
```sh
|
134
|
-
bio-table
|
163
|
+
bio-table table1.csv --columns 0,1,8,2,4,6
|
164
|
+
```
|
165
|
+
|
166
|
+
If the table header happens to be one element shorter than the number of columns
|
167
|
+
in the table, use unshift headers, 0 becomes an 'ID' column
|
168
|
+
|
169
|
+
```sh
|
170
|
+
bio-table table1.csv --unshift-headers --columns 0,1,8,2,4,6
|
171
|
+
```
|
172
|
+
|
173
|
+
Duplicate columns with
|
174
|
+
|
175
|
+
```sh
|
176
|
+
bio-table table1.csv --columns AJ,B6,AJ,Axb1,Axb4,AXB13,Axb15,Axb19
|
135
177
|
```
|
136
178
|
|
179
|
+
Combine column values (more on rewrite below)
|
180
|
+
|
181
|
+
```sh
|
182
|
+
bio-table table1.csv --rewrite "rowname = rowname + '-' + field[0]"
|
183
|
+
```
|
137
184
|
|
138
185
|
To filter for columns using a regular expression
|
139
186
|
|
@@ -154,13 +201,39 @@ again
|
|
154
201
|
where we rewrite the rowname in capitals, and set the second field to
|
155
202
|
empty if the third field is below 0.25.
|
156
203
|
|
204
|
+
### Statistics
|
205
|
+
|
206
|
+
bio-table can handle some column statistics using the Ruby statsample
|
207
|
+
gem
|
208
|
+
|
209
|
+
```sh
|
210
|
+
gem install statsample
|
211
|
+
```
|
212
|
+
|
213
|
+
(statsample is not loaded by default, as it has a host of
|
214
|
+
dependencies)
|
215
|
+
|
216
|
+
Thereafter, to calculate the stats for columns 1 and 2 (rowname is column 0)
|
217
|
+
|
218
|
+
```sh
|
219
|
+
bio-table --statistics --columns 1,2 table1.csv
|
220
|
+
stat AJ B6
|
221
|
+
size 379 379
|
222
|
+
min 0.0 0.0
|
223
|
+
max 1171.23 1309.25
|
224
|
+
median 6.26 7.45
|
225
|
+
mean 23.49952506596308 24.851108179419523
|
226
|
+
sd 79.4384873820721 84.43330500777459
|
227
|
+
cv 3.3804294835358824 3.3975669977445166
|
228
|
+
```
|
229
|
+
|
157
230
|
### Sorting a table
|
158
231
|
|
159
232
|
To sort a table on column 4 and 2
|
160
233
|
|
161
234
|
```sh
|
162
235
|
# not yet implemented
|
163
|
-
bio-table
|
236
|
+
bio-table table1.csv --sort 4,2
|
164
237
|
```
|
165
238
|
|
166
239
|
Note: not all is implemented (just yet). Please check bio-table --help first.
|
@@ -170,7 +243,7 @@ Note: not all is implemented (just yet). Please check bio-table --help first.
|
|
170
243
|
You can combine/concat two or more tables by passing in multiple file names
|
171
244
|
|
172
245
|
```sh
|
173
|
-
bio-table
|
246
|
+
bio-table table1.csv table2.csv
|
174
247
|
```
|
175
248
|
|
176
249
|
this will append table2 to table1, assuming they have the same headers
|
@@ -245,7 +318,7 @@ bio-table can read data from STDIN, by simply assuming that the data
|
|
245
318
|
piped in is the first input file
|
246
319
|
|
247
320
|
```sh
|
248
|
-
cat test1.tab | bio-table table1.csv --num-filter "values[3] <= 0.05"
|
321
|
+
cat test1.tab | bio-table table1.csv --num-filter "values[3] <= 0.05"
|
249
322
|
```
|
250
323
|
|
251
324
|
will filter both files test1.tab and test1.csv and output to
|
@@ -338,7 +411,7 @@ Note: the Ruby API below is a work in progress.
|
|
338
411
|
Tables are two dimensional matrixes, which can be read from a file
|
339
412
|
|
340
413
|
```ruby
|
341
|
-
t = Table.read_file('
|
414
|
+
t = Table.read_file('table1.csv')
|
342
415
|
p t.header # print the header array
|
343
416
|
p t.name[0],t[0] # print the row name and row row
|
344
417
|
p t[0][0] # print the top corner field
|
@@ -349,7 +422,7 @@ which column to use for names etc. More interestingly you can pass a
|
|
349
422
|
function to limit the amount of row read into memory:
|
350
423
|
|
351
424
|
```ruby
|
352
|
-
t = Table.read_file('
|
425
|
+
t = Table.read_file('table1.csv',
|
353
426
|
:by_row => { | row | row[0..3] } )
|
354
427
|
```
|
355
428
|
|
@@ -358,7 +431,7 @@ the same idea to reformat and reorder table columns when reading data
|
|
358
431
|
into the table. E.g.
|
359
432
|
|
360
433
|
```ruby
|
361
|
-
t = Table.read_file('
|
434
|
+
t = Table.read_file('table1.csv',
|
362
435
|
:by_row => { | row | [row.rowname, row[0..3], row[6].to_i].flatten } )
|
363
436
|
```
|
364
437
|
|
@@ -368,7 +441,7 @@ can pass in a :by_header, which will have :by_row only call on
|
|
368
441
|
actual table rows.
|
369
442
|
|
370
443
|
```ruby
|
371
|
-
t = Table.read_file('
|
444
|
+
t = Table.read_file('table1.csv',
|
372
445
|
:by_header => { | header | ["Row name", header[0..3], header[6]].flatten } )
|
373
446
|
:by_row => { | row | [row.rowname, row[0..3], row[6].to_i].flatten } )
|
374
447
|
```
|
@@ -378,7 +451,7 @@ transform a file, and not loading it in memory, is
|
|
378
451
|
|
379
452
|
```ruby
|
380
453
|
f = File.new('test.tab','w')
|
381
|
-
t = Table.read_file('
|
454
|
+
t = Table.read_file('table1.csv',
|
382
455
|
:by_row => { | row |
|
383
456
|
TableRow::write(f,[row.rowname,row[0..3],row[6].to_i].flatten, :separator => "\t")
|
384
457
|
nil # don't create a table in memory, effectively a filter
|
@@ -426,7 +499,8 @@ ARGV.each do | fn |
|
|
426
499
|
end
|
427
500
|
```
|
428
501
|
|
429
|
-
Essentially you can pass in any object that has the *each* method
|
502
|
+
Essentially you can pass in any object that has the *each* method
|
503
|
+
(here the File object) to
|
430
504
|
iterate through rows as String (f's each method reads in a line at a
|
431
505
|
time). The emit function yields the parsed row object as a simple
|
432
506
|
array of fields (each field a String). The type is used to distinguish
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.0
|
1
|
+
0.8.0
|
data/bin/bio-table
CHANGED
@@ -98,6 +98,10 @@ opts = OptionParser.new do |o|
|
|
98
98
|
options[:with_rownames] = true
|
99
99
|
end
|
100
100
|
|
101
|
+
o.on('--unshift-headers','Add an extra header element at the front (header contains one fewer field than the number of columns)') do
|
102
|
+
options[:unshift_headers] = true
|
103
|
+
end
|
104
|
+
|
101
105
|
o.on('--strip-quotes','Strip quotes from table fields') do
|
102
106
|
options[:strip_quotes] = true
|
103
107
|
end
|
@@ -130,6 +134,10 @@ opts = OptionParser.new do |o|
|
|
130
134
|
options[:blank_nodes] = true
|
131
135
|
end
|
132
136
|
|
137
|
+
o.on('--statistics','Output column statistics') do
|
138
|
+
options[:statistics] = true
|
139
|
+
end
|
140
|
+
|
133
141
|
o.separator "\n\tVerbosity:\n\n"
|
134
142
|
|
135
143
|
o.on("--logger filename",String,"Log to file (default stderr)") do | name |
|
@@ -224,22 +232,34 @@ writer =
|
|
224
232
|
end
|
225
233
|
|
226
234
|
if INPUT_ON_STDIN
|
227
|
-
opts = options.dup # so we can modify options
|
235
|
+
opts = options.dup # so we can 'safely' modify options
|
228
236
|
BioTable::TableLoader.emit(STDIN, opts).each do |row, type|
|
229
237
|
writer.write(TableRow.new(row[0],row[1..-1]),type)
|
230
238
|
end
|
231
239
|
options[:write_header] = false # don't write the header for chained files
|
232
240
|
end
|
233
241
|
|
242
|
+
statistics = if options[:statistics]
|
243
|
+
BioTable::Statistics::Accumulate.new
|
244
|
+
else
|
245
|
+
nil
|
246
|
+
end
|
247
|
+
|
234
248
|
ARGV.each do | fn |
|
235
|
-
opts = options.dup # so we can modify options
|
249
|
+
opts = options.dup # so we can 'safely' modify options
|
236
250
|
f = File.open(fn,"r")
|
237
251
|
if not opts[:in_format] and fn =~ /\.csv$/
|
238
252
|
logger.debug "Autodetected CSV file"
|
239
253
|
opts[:in_format] = :csv
|
240
254
|
end
|
241
255
|
BioTable::TableLoader.emit(f, opts).each do |row,type|
|
242
|
-
|
256
|
+
if statistics
|
257
|
+
statistics.add(row,type)
|
258
|
+
else
|
259
|
+
writer.write(TableRow.new(row[0],row[1..-1]),type)
|
260
|
+
end
|
243
261
|
end
|
244
262
|
options[:write_header] = false # don't write the header for chained files
|
245
263
|
end
|
264
|
+
|
265
|
+
statistics.write(writer) if statistics
|
data/features/cli.feature
CHANGED
@@ -3,11 +3,26 @@ Feature: Command-line interface (CLI)
|
|
3
3
|
|
4
4
|
bio-table has a powerful command line interface. Here we regression test features.
|
5
5
|
|
6
|
-
Scenario: Test the numerical filter by column values
|
6
|
+
Scenario: Test the numerical filter by indexed column values
|
7
7
|
Given I have input file(s) named "test/data/input/table1.csv"
|
8
8
|
When I execute "./bin/bio-table --num-filter 'values[3] > 0.05'"
|
9
9
|
Then I expect the named output to match "table1-0_05"
|
10
10
|
|
11
|
+
Scenario: Test the numerical filter by column names
|
12
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
13
|
+
When I execute "./bin/bio-table --num-filter 'axb2 > 0.05'"
|
14
|
+
Then I expect the named output to match "table1-named-0_05"
|
15
|
+
|
16
|
+
Scenario: Test the filter by indexed column values
|
17
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
18
|
+
When I execute "./bin/bio-table --filter 'fields[3] =~ 0.1'"
|
19
|
+
Then I expect the named output to match "table1-filter-0_1"
|
20
|
+
|
21
|
+
Scenario: Test the filter by column names
|
22
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
23
|
+
When I execute "./bin/bio-table --filter 'axb1 =~ /0.1/'"
|
24
|
+
Then I expect the named output to match "table1-filter-named-0_1"
|
25
|
+
|
11
26
|
Scenario: Reduce columns
|
12
27
|
Given I have input file(s) named "test/data/input/table1.csv"
|
13
28
|
When I execute "./bin/bio-table test/data/input/table1.csv --columns '#Gene,AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19'"
|
@@ -78,4 +93,7 @@ Feature: Command-line interface (CLI)
|
|
78
93
|
When I execute "./bin/bio-table --in-format split --split-on ',' --num-filter 'values[1]!=0' --with-headers"
|
79
94
|
Then I expect the named output to match "table_filter_headers"
|
80
95
|
|
81
|
-
|
96
|
+
Scenario: Use count in filter
|
97
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
98
|
+
When I execute "./bin/bio-table --num-filter 'values.compact.max >= 10.0 and values.compact.count{|x| x>=3.0} > 3'"
|
99
|
+
Then I expect the named output to match "table_counter_filter"
|
@@ -0,0 +1,43 @@
|
|
1
|
+
@filter
|
2
|
+
Feature: Filter input table
|
3
|
+
|
4
|
+
bio-table should read input line by line as an iterator, and emit
|
5
|
+
filtered/transformed output, filtering for number values etc.
|
6
|
+
|
7
|
+
Scenario: Filter a table by value
|
8
|
+
Given I load a CSV table containing
|
9
|
+
"""
|
10
|
+
bid,cid,length,num
|
11
|
+
1,a,4658,4
|
12
|
+
1,b,12060,6
|
13
|
+
2,c,5858,7
|
14
|
+
2,d,5626,4
|
15
|
+
3,e,18451,8
|
16
|
+
"""
|
17
|
+
When I numerically filter the table for
|
18
|
+
| num_filter | result | description |
|
19
|
+
| values[1] > 6000 | [12060,18451] | basic filter |
|
20
|
+
| value[1] > 6000 | [12060,18451] | value is alias for values |
|
21
|
+
| num==4 | [4658,5626] | column names as variables |
|
22
|
+
| num==4 or num==6 | [4658,12060,5626] | column names as variables |
|
23
|
+
| num==6 | [12060] | column names as variables |
|
24
|
+
| length<5000 | [4658] | column names as variables |
|
25
|
+
Then I should have result
|
26
|
+
|
27
|
+
Scenario: Filter a table by string
|
28
|
+
Given I load a CSV table containing
|
29
|
+
"""
|
30
|
+
bid,cid,length,num
|
31
|
+
1,a,4658,4
|
32
|
+
1,b,12060,6
|
33
|
+
2,c,5858,7
|
34
|
+
2,d,5626,4
|
35
|
+
3,e,18451,8
|
36
|
+
"""
|
37
|
+
When I filter the table for
|
38
|
+
| filter | result | description |
|
39
|
+
| field[1] =~ /4/ | [4658,18451] | regex filter |
|
40
|
+
| fields[1] =~ /4/ | [4658,18451] | alias fields |
|
41
|
+
| length =~ /4/ | [4658,18451] | use column names |
|
42
|
+
Then I should have filter result
|
43
|
+
|
@@ -0,0 +1,46 @@
|
|
1
|
+
Given /^I load a CSV table containing$/ do |string|
|
2
|
+
@lines = string.split(/\n/)
|
3
|
+
end
|
4
|
+
|
5
|
+
When /^I numerically filter the table for$/ do |table|
|
6
|
+
# table is a Cucumber::Ast::Table
|
7
|
+
@table = table
|
8
|
+
end
|
9
|
+
|
10
|
+
Then /^I should have result$/ do
|
11
|
+
@table.hashes.each do |h|
|
12
|
+
p h
|
13
|
+
result = eval(h['result'])
|
14
|
+
options = { :in_format => :split, :split_on => ',' }
|
15
|
+
options[:num_filter] = h['num_filter']
|
16
|
+
|
17
|
+
p options
|
18
|
+
p result
|
19
|
+
t = BioTable::Table.new
|
20
|
+
rownames,lines = t.read_lines(@lines, options)
|
21
|
+
p lines
|
22
|
+
lines.map {|r| r[1].to_i }.should == result
|
23
|
+
end
|
24
|
+
end
|
25
|
+
|
26
|
+
When /^I filter the table for$/ do |table|
|
27
|
+
# table is a Cucumber::Ast::Table
|
28
|
+
@table1 = table
|
29
|
+
end
|
30
|
+
|
31
|
+
Then /^I should have filter result$/ do
|
32
|
+
@table1.hashes.each do |h|
|
33
|
+
p h
|
34
|
+
result = eval(h['result'])
|
35
|
+
options = { :in_format => :split, :split_on => ',' }
|
36
|
+
options[:filter] = h['filter']
|
37
|
+
|
38
|
+
p options
|
39
|
+
p result
|
40
|
+
t = BioTable::Table.new
|
41
|
+
rownames,lines = t.read_lines(@lines, options)
|
42
|
+
p lines
|
43
|
+
lines.map {|r| r[1].to_i }.should == result
|
44
|
+
end
|
45
|
+
end
|
46
|
+
|