bio-table 0.0.4 → 0.0.5
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +85 -22
- data/VERSION +1 -1
- data/bin/bio-table +45 -18
- data/features/cli.feature +11 -1
- data/lib/bio-table.rb +1 -0
- data/lib/bio-table/filter.rb +2 -1
- data/lib/bio-table/formatter.rb +19 -1
- data/lib/bio-table/rdf.rb +106 -0
- data/lib/bio-table/table_apply.rb +10 -2
- data/lib/bio-table/tableload.rb +8 -4
- data/lib/bio-table/tablerow.rb +2 -1
- data/lib/bio-table/tablewriter.rb +1 -1
- data/test/data/regression/table1-STDIN.ref +1138 -0
- data/test/data/regression/table1-columns-indexed.ref +0 -2
- data/test/data/regression/table1-columns-regex.ref +0 -2
- data/test/data/regression/table1-columns.ref +0 -2
- data/test/data/regression/table1-rdf1.ref +415 -0
- metadata +24 -21
data/README.md
CHANGED
@@ -33,13 +33,13 @@ Features:
|
|
33
33
|
* Merge tables side by side on column value/rowname
|
34
34
|
* Split/reduce tables by column
|
35
35
|
* Read from STDIN, write to STDOUT
|
36
|
+
* Convert table to RDF
|
36
37
|
* Convert table to JSON (nyi)
|
37
|
-
* Convert table to RDF (nyi)
|
38
38
|
* etc. etc.
|
39
39
|
|
40
40
|
and bio-table is pretty fast. To convert a 3Mb file of 18670 rows
|
41
|
-
takes 0.
|
42
|
-
my 3.2 GHz desktop.
|
41
|
+
takes 0.87 second. Adding a filter makes it parse at 0.95 second on
|
42
|
+
my 3.2 GHz desktop (with preloaded disk cache).
|
43
43
|
|
44
44
|
Note: this software is under active development, though what is
|
45
45
|
documented here should just work.
|
@@ -57,40 +57,40 @@ documented here should just work.
|
|
57
57
|
Tables can be transformed through the command line. To transform a
|
58
58
|
comma separated file to a tab delimited one
|
59
59
|
|
60
|
-
```
|
60
|
+
```sh
|
61
61
|
bio-table test/data/input/table1.csv --in-format csv --format tab > test1.tab
|
62
62
|
```
|
63
63
|
|
64
64
|
Tab is actually the general default. Still, if the file name ends in
|
65
65
|
csv, it will assume CSV. To convert the table back
|
66
66
|
|
67
|
-
```
|
67
|
+
```sh
|
68
68
|
bio-table test1.tab --format csv > table1.csv
|
69
69
|
```
|
70
70
|
|
71
71
|
To filter out rows that contain certain values
|
72
72
|
|
73
|
-
```
|
73
|
+
```sh
|
74
74
|
bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
|
75
75
|
```
|
76
76
|
|
77
77
|
The filter ignores the header row, and the row names. If you need
|
78
78
|
either, use the switches --with-header and --with-rownames. With math, list all rows
|
79
79
|
|
80
|
-
```
|
80
|
+
```sh
|
81
81
|
bio-table test/data/input/table1.csv --num-filter "values[3]-values[6] >= 0.05" > test1a.tab
|
82
82
|
```
|
83
83
|
|
84
84
|
or, list all rows that have a least a field with values >= 1000.0
|
85
85
|
|
86
|
-
```
|
86
|
+
```sh
|
87
87
|
bio-table test/data/input/table1.csv --num-filter "values.max >= 1000.0" > test1a.tab
|
88
88
|
```
|
89
89
|
|
90
90
|
Produce all rows that have at least 3 values above 3.0 and 1 one value
|
91
91
|
above 10.0:
|
92
92
|
|
93
|
-
```
|
93
|
+
```sh
|
94
94
|
bio-table test/data/input/table1.csv --num-filter "values.max >= 10.0 and values.count{|x| x>=3.0} > 3"
|
95
95
|
```
|
96
96
|
|
@@ -100,7 +100,7 @@ The --num-filter will convert fields lazily to numerical values (only
|
|
100
100
|
valid numbers are converted). If there are NA (nil) values in the table, you
|
101
101
|
may wish to remove them, like this
|
102
102
|
|
103
|
-
```
|
103
|
+
```sh
|
104
104
|
bio-table test/data/input/table1.csv --num-filter "values[0..12].compact.max >= 1000.0" > test1a.tab
|
105
105
|
```
|
106
106
|
|
@@ -109,27 +109,27 @@ which takes the first 13 fields and compact removes the nil values.
|
|
109
109
|
Also string comparisons and regular expressions can be used. E.g.
|
110
110
|
filter on rownames and a row field both containing 'BGT'
|
111
111
|
|
112
|
-
```
|
112
|
+
```sh
|
113
113
|
# not yet implemented
|
114
114
|
bio-table test/data/input/table1.csv --filter "rowname =~ /BGT/ and field[1] =~ /BGT/" > test1a.tab
|
115
115
|
```
|
116
116
|
|
117
117
|
To reorder/reduce table columns by name
|
118
118
|
|
119
|
-
```
|
119
|
+
```sh
|
120
120
|
bio-table test/data/input/table1.csv --columns AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19 > test1a.tab
|
121
121
|
```
|
122
122
|
|
123
123
|
or use their index numbers (the first column is zero)
|
124
124
|
|
125
|
-
```
|
125
|
+
```sh
|
126
126
|
bio-table test/data/input/table1.csv --columns 0,1,8,2,4,6 > test1a.tab
|
127
127
|
```
|
128
128
|
|
129
129
|
|
130
130
|
To filter for columns using a regular expression
|
131
131
|
|
132
|
-
```
|
132
|
+
```sh
|
133
133
|
bio-table table1.csv --column-filter 'colname !~ /infected/i'
|
134
134
|
```
|
135
135
|
|
@@ -139,7 +139,7 @@ case.
|
|
139
139
|
Finally we can rewrite the content of a table using rowname and fields
|
140
140
|
again
|
141
141
|
|
142
|
-
```
|
142
|
+
```sh
|
143
143
|
bio-table table1.csv --rewrite 'rowname.upcase!; field[1]=nil if field[2].to_f<0.25'
|
144
144
|
```
|
145
145
|
|
@@ -150,7 +150,7 @@ empty if the third field is below 0.25.
|
|
150
150
|
|
151
151
|
To sort a table on column 4 and 2
|
152
152
|
|
153
|
-
```
|
153
|
+
```sh
|
154
154
|
# not yet implemented
|
155
155
|
bio-table test/data/input/table1.csv --sort 4,2 > test1a.tab
|
156
156
|
```
|
@@ -161,20 +161,26 @@ Note: not all is implemented (just yet). Please check bio-table --help first.
|
|
161
161
|
|
162
162
|
You can combine/concat two or more tables by passing in multiple file names
|
163
163
|
|
164
|
+
```sh
|
164
165
|
bio-table test/data/input/table1.csv test/data/input/table2.csv
|
166
|
+
```
|
165
167
|
|
166
168
|
this will append table2 to table1, assuming they have the same headers
|
167
169
|
(you can use the --columns switch!)
|
168
170
|
|
169
171
|
To combine tables side by side use the --merge switch:
|
170
172
|
|
173
|
+
```sh
|
171
174
|
bio-table --merge table1.csv table2.csv
|
175
|
+
```
|
172
176
|
|
173
177
|
all rownames will be matched (i.e. the input table order do not need
|
174
178
|
to be sorted). For non-matching rownames the fields will be filled
|
175
179
|
with NA's, unless you add a filter, e.g.
|
176
180
|
|
181
|
+
```sh
|
177
182
|
bio-table --merge table1.csv table2.csv --num-filter "values.compact.size == values.size"
|
183
|
+
```
|
178
184
|
|
179
185
|
### Splitting a table
|
180
186
|
|
@@ -188,24 +194,32 @@ overlap, based on shared columns. The bio-table diff command shows the
|
|
188
194
|
difference between two tables using the row names (i.e. those rows
|
189
195
|
with rownames that appear in table2, but not in table1)
|
190
196
|
|
197
|
+
```sh
|
191
198
|
bio-table --diff 0 table1.csv table2.csv
|
199
|
+
```
|
192
200
|
|
193
201
|
bio-table --diff is different from the standard Unix diff tool. The
|
194
202
|
latter shows insertions and deletions. bio-table --diff shows what is
|
195
203
|
in one file, and not in the other (insertions). To see deletions,
|
196
204
|
reverse the file order, i.e. switch the file names
|
197
205
|
|
206
|
+
```sh
|
198
207
|
bio-table --diff 0 table1.csv table2.csv
|
208
|
+
```
|
199
209
|
|
200
210
|
To diff on something else
|
201
211
|
|
212
|
+
```sh
|
202
213
|
bio-table --diff 0,3 table2.csv table1.csv
|
214
|
+
```
|
203
215
|
|
204
216
|
creates a key using columns 0 and 3 (0 is the rownames column).
|
205
217
|
|
206
218
|
Similarly
|
207
219
|
|
220
|
+
```sh
|
208
221
|
bio-table --overlap 2 table1.csv table2.csv
|
222
|
+
```
|
209
223
|
|
210
224
|
finds the overlapping rows, based on the content of column 2.
|
211
225
|
|
@@ -219,14 +233,55 @@ more soon
|
|
219
233
|
bio-table can read data from STDIN, by simply assuming that the data
|
220
234
|
piped in is the first input file
|
221
235
|
|
222
|
-
```
|
236
|
+
```sh
|
223
237
|
cat test1.tab | bio-table table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
|
224
238
|
```
|
225
239
|
|
226
240
|
will filter both files test1.tab and test1.csv and output to
|
227
241
|
test1a.tab.
|
228
242
|
|
229
|
-
|
243
|
+
### Output table to RDF
|
244
|
+
|
245
|
+
bio-table can write a table into turtle RDF triples (part of the semantic
|
246
|
+
web!), so you can put the data directly into a triple-store.
|
247
|
+
|
248
|
+
```sh
|
249
|
+
bio-table --format rdf table1.csv
|
250
|
+
```
|
251
|
+
|
252
|
+
The table header is stored with predicate :colname using the header
|
253
|
+
values both as subject and label, with the :index:
|
254
|
+
|
255
|
+
```rdf
|
256
|
+
:header3 rdf:label "Header3" ; a :colname; :index 4 .
|
257
|
+
```
|
258
|
+
|
259
|
+
Rows are stored with rowname as subject and label, followed by the
|
260
|
+
columns referring to the header triples, and the values. E.g.
|
261
|
+
|
262
|
+
```rdf
|
263
|
+
:row13475701 rdf:label "row13475701" ; a :rowname ; ; :Id "row13475701" ; :header1 "1" ; :header2 "0" ; :header3 "3" .
|
264
|
+
```
|
265
|
+
|
266
|
+
To unify identifier names you may want to transform ids:
|
267
|
+
|
268
|
+
```sh
|
269
|
+
bio-table --format rdf --transform-ids "downcase" table1.csv
|
270
|
+
```
|
271
|
+
|
272
|
+
Another interesting option is --blank-nodes. This causes rows to be
|
273
|
+
written as blank nodes, and allows for duplicate row names. E.g.
|
274
|
+
|
275
|
+
```rdf
|
276
|
+
:row13475701 [ rdf:label "row13475701" ; a :rowname ; ; :Id "row13475701" ; :header1 "1" ; :header2 "0" ; :header3 "3" ] .
|
277
|
+
```
|
278
|
+
The bio-rdf gem actually uses this bio-table biogem to parse data into a
|
279
|
+
triple store and query the data through SPARQL. For examples see the
|
280
|
+
features, e.g. the
|
281
|
+
[genotype to RDF feature](https://github.com/pjotrp/bioruby-rdf/blob/master/features/genotype-table-to-rdf.feature).
|
282
|
+
|
283
|
+
|
284
|
+
## bio-table API (for Ruby programmers)
|
230
285
|
|
231
286
|
```ruby
|
232
287
|
require 'bio-table'
|
@@ -315,17 +370,25 @@ file twice, but being able to handle much larger data.
|
|
315
370
|
|
316
371
|
In above examples we loaded the whole table in memory. It is also
|
317
372
|
possible to execute functions without using RAM by using the emit
|
318
|
-
function. This is what the bio-table CLI does
|
373
|
+
function. This is what the bio-table CLI does to convert a CSV table
|
374
|
+
to tab delimited:
|
319
375
|
|
320
376
|
```ruby
|
321
377
|
ARGV.each do | fn |
|
322
|
-
|
323
|
-
|
378
|
+
f = File.open(fn)
|
379
|
+
writer = BioTable::TableWriter::Writer.new(format: :tab)
|
380
|
+
BioTable::TableLoader.emit(f, in_format: :csv).each do |row,type|
|
381
|
+
writer.write(TableRow.new(row[0],row[1..-1]),type)
|
324
382
|
end
|
325
383
|
end
|
326
|
-
|
327
384
|
```
|
328
385
|
|
386
|
+
Essentially you can pass in any object that has the *each* method to
|
387
|
+
iterate through rows as String (f's each method reads in a line at a
|
388
|
+
time). The emit function yields the parsed row object as a simple
|
389
|
+
array of fields (each field a String). The type is used to distinguish
|
390
|
+
the header row.
|
391
|
+
|
329
392
|
### Loading a numerical matrix
|
330
393
|
|
331
394
|
Coming soon
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.0.
|
1
|
+
0.0.5
|
data/bin/bio-table
CHANGED
@@ -33,19 +33,12 @@ log = Bio::Log::LoggerPlus.new 'bio-table'
|
|
33
33
|
Bio::Log::CLI.logger('stderr')
|
34
34
|
Bio::Log::CLI.trace('info')
|
35
35
|
|
36
|
-
options = {show_help: false, write_header: true}
|
36
|
+
options = {show_help: false, write_header: true, skip: 0}
|
37
37
|
options[:show_help] = true if ARGV.size == 0 and not INPUT_ON_STDIN
|
38
38
|
opts = OptionParser.new do |o|
|
39
39
|
o.banner = "Usage: #{File.basename($0)} [options] filename\n\n"
|
40
40
|
|
41
|
-
|
42
|
-
options[:in_format] = par.to_sym
|
43
|
-
end
|
44
|
-
|
45
|
-
o.on('--format [tab,csv]', [:tab, :csv], 'Output format (default tab)') do |par|
|
46
|
-
options[:format] = par.to_sym
|
47
|
-
end
|
48
|
-
|
41
|
+
|
49
42
|
o.on('--num-filter expression', 'Numeric filtering function') do |par|
|
50
43
|
options[:num_filter] = par
|
51
44
|
end
|
@@ -82,21 +75,50 @@ opts = OptionParser.new do |o|
|
|
82
75
|
options[:overlap] = l
|
83
76
|
end
|
84
77
|
|
85
|
-
|
78
|
+
o.on('--merge','Merge tables by rowname') do
|
79
|
+
options[:merge] = true
|
80
|
+
end
|
81
|
+
|
82
|
+
o.separator "\n\tOverrides:\n\n"
|
83
|
+
|
86
84
|
# o.on('--with-header','Include the header element in filtering etc.') do
|
87
85
|
# options[:with_header] = true
|
88
86
|
# end
|
89
|
-
|
87
|
+
|
88
|
+
o.on('--skip lines',Integer,'Skip the first lines before parsing') do |skip|
|
89
|
+
options[:skip] = skip
|
90
|
+
end
|
91
|
+
|
90
92
|
o.on('--with-rownames','Include the rownames in filtering etc.') do
|
91
93
|
options[:with_rownames] = true
|
92
94
|
end
|
95
|
+
|
96
|
+
o.separator "\n\tTransform:\n\n"
|
97
|
+
|
98
|
+
o.on('--transform-ids [downcase,upcase]',[:downcase,:upcase],'Transform column and row identifiers') do |par|
|
99
|
+
options[:transform_ids] = par.to_sym
|
100
|
+
end
|
101
|
+
|
102
|
+
o.separator "\n\tFormat and options:\n\n"
|
93
103
|
|
94
|
-
o.
|
104
|
+
o.on('--in-format [tab,csv]', [:tab, :csv], 'Input format (default tab)') do |par|
|
105
|
+
options[:in_format] = par.to_sym
|
106
|
+
end
|
95
107
|
|
108
|
+
o.on('--format [tab,csv,rdf]', [:tab, :csv, :rdf], 'Output format (default tab)') do |par|
|
109
|
+
options[:format] = par.to_sym
|
110
|
+
end
|
111
|
+
|
112
|
+
o.on('--blank-nodes','Output (RDF) blank nodes - allowing for duplicate row names') do
|
113
|
+
options[:blank_nodes] = true
|
114
|
+
end
|
115
|
+
|
116
|
+
o.separator "\n\tVerbosity:\n\n"
|
117
|
+
|
96
118
|
o.on("--logger filename",String,"Log to file (default stderr)") do | name |
|
97
119
|
Bio::Log::CLI.logger(name)
|
98
120
|
end
|
99
|
-
|
121
|
+
|
100
122
|
o.on("--trace options",String,"Set log level (default INFO, see bio-logger)") do | s |
|
101
123
|
Bio::Log::CLI.trace(s)
|
102
124
|
end
|
@@ -177,12 +199,17 @@ end
|
|
177
199
|
# http://eric.lubow.org/2010/ruby/multiple-input-locations-from-bash-into-ruby/
|
178
200
|
#
|
179
201
|
|
180
|
-
writer =
|
202
|
+
writer =
|
203
|
+
if options[:format] == :rdf
|
204
|
+
BioTable::RDF::Writer.new(options[:blank_nodes])
|
205
|
+
else
|
206
|
+
BioTable::TableWriter::Writer.new(options[:format])
|
207
|
+
end
|
181
208
|
|
182
209
|
if INPUT_ON_STDIN
|
183
210
|
opts = options.dup # so we can modify options
|
184
|
-
BioTable::TableLoader.emit(STDIN, opts).each do |row|
|
185
|
-
writer.write(TableRow.new(row[0],row[1..-1]))
|
211
|
+
BioTable::TableLoader.emit(STDIN, opts).each do |row, type|
|
212
|
+
writer.write(TableRow.new(row[0],row[1..-1]),type)
|
186
213
|
end
|
187
214
|
options[:write_header] = false # don't write the header for chained files
|
188
215
|
end
|
@@ -194,8 +221,8 @@ ARGV.each do | fn |
|
|
194
221
|
logger.debug "Autodetected CSV file"
|
195
222
|
opts[:in_format] = :csv
|
196
223
|
end
|
197
|
-
BioTable::TableLoader.emit(f, opts).each do |row|
|
198
|
-
writer.write(TableRow.new(row[0],row[1..-1]))
|
224
|
+
BioTable::TableLoader.emit(f, opts).each do |row,type|
|
225
|
+
writer.write(TableRow.new(row[0],row[1..-1]),type)
|
199
226
|
end
|
200
227
|
options[:write_header] = false # don't write the header for chained files
|
201
228
|
end
|
data/features/cli.feature
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
@cli
|
2
2
|
Feature: Command-line interface (CLI)
|
3
3
|
|
4
|
-
bio-table has a powerful
|
4
|
+
bio-table has a powerful command line interface. Here we regression test features.
|
5
5
|
|
6
6
|
Scenario: Test the numerical filter by column values
|
7
7
|
Given I have input file(s) named "test/data/input/table1.csv"
|
@@ -28,4 +28,14 @@ Feature: Command-line interface (CLI)
|
|
28
28
|
When I execute "./bin/bio-table test/data/input/table1.csv --rewrite 'rowname = field[2]; field[1]=nil if field[2].to_f<0.25'"
|
29
29
|
Then I expect the named output to match "table1-rewrite-rownames"
|
30
30
|
|
31
|
+
Scenario: Write RDF format
|
32
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
33
|
+
When I execute "./bin/bio-table --format rdf --transform-ids downcase"
|
34
|
+
Then I expect the named output to match "table1-rdf1"
|
35
|
+
|
36
|
+
Scenario: Read from STDIN
|
37
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
38
|
+
When I execute "cat test/data/input/table1.csv|./bin/bio-table test/data/input/table1.csv --rewrite 'rowname = field[2]; field[1]=nil if field[2].to_f<0.25'"
|
39
|
+
Then I expect the named output to match "table1-STDIN"
|
40
|
+
|
31
41
|
|
data/lib/bio-table.rb
CHANGED
data/lib/bio-table/filter.rb
CHANGED
@@ -63,7 +63,8 @@ module BioTable
|
|
63
63
|
end
|
64
64
|
|
65
65
|
def Filter::valid_number?(s)
|
66
|
-
s.to_s.match(/\A[+-]?\d+?(\.\d+)?\Z/) == nil ? false : true
|
66
|
+
# s.to_s.match(/\A[+-]?\d+?(\.\d+)?\Z/) == nil ? false : true
|
67
|
+
begin Float(s) ; true end rescue false
|
67
68
|
end
|
68
69
|
|
69
70
|
def Filter::numeric code, fields
|
data/lib/bio-table/formatter.rb
CHANGED
@@ -1,5 +1,24 @@
|
|
1
1
|
module BioTable
|
2
2
|
|
3
|
+
module Formatter
|
4
|
+
def Formatter::transform_header_ids modify, list
|
5
|
+
l = list.dup
|
6
|
+
case modify
|
7
|
+
when :downcase then l.map { |h| h.downcase }
|
8
|
+
when :upcase then l.map { |h| h.upcase }
|
9
|
+
else l
|
10
|
+
end
|
11
|
+
end
|
12
|
+
def Formatter::transform_row_ids modify, list
|
13
|
+
l = list.dup
|
14
|
+
case modify
|
15
|
+
when :downcase then l[0].downcase!
|
16
|
+
when :upcase then l[0].upcase!
|
17
|
+
end
|
18
|
+
l
|
19
|
+
end
|
20
|
+
end
|
21
|
+
|
3
22
|
class TabFormatter
|
4
23
|
def write list
|
5
24
|
print list.map{|field| (field==nil ? "NA" : field)}.join("\t"),"\n"
|
@@ -8,7 +27,6 @@ module BioTable
|
|
8
27
|
end
|
9
28
|
|
10
29
|
class CsvFormatter
|
11
|
-
|
12
30
|
def write list
|
13
31
|
csv_string = CSV.generate do |csv|
|
14
32
|
csv << list
|