bio-table 0.0.4 → 0.0.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +85 -22
- data/VERSION +1 -1
- data/bin/bio-table +45 -18
- data/features/cli.feature +11 -1
- data/lib/bio-table.rb +1 -0
- data/lib/bio-table/filter.rb +2 -1
- data/lib/bio-table/formatter.rb +19 -1
- data/lib/bio-table/rdf.rb +106 -0
- data/lib/bio-table/table_apply.rb +10 -2
- data/lib/bio-table/tableload.rb +8 -4
- data/lib/bio-table/tablerow.rb +2 -1
- data/lib/bio-table/tablewriter.rb +1 -1
- data/test/data/regression/table1-STDIN.ref +1138 -0
- data/test/data/regression/table1-columns-indexed.ref +0 -2
- data/test/data/regression/table1-columns-regex.ref +0 -2
- data/test/data/regression/table1-columns.ref +0 -2
- data/test/data/regression/table1-rdf1.ref +415 -0
- metadata +24 -21
data/README.md
CHANGED
@@ -33,13 +33,13 @@ Features:
|
|
33
33
|
* Merge tables side by side on column value/rowname
|
34
34
|
* Split/reduce tables by column
|
35
35
|
* Read from STDIN, write to STDOUT
|
36
|
+
* Convert table to RDF
|
36
37
|
* Convert table to JSON (nyi)
|
37
|
-
* Convert table to RDF (nyi)
|
38
38
|
* etc. etc.
|
39
39
|
|
40
40
|
and bio-table is pretty fast. To convert a 3Mb file of 18670 rows
|
41
|
-
takes 0.
|
42
|
-
my 3.2 GHz desktop.
|
41
|
+
takes 0.87 second. Adding a filter makes it parse at 0.95 second on
|
42
|
+
my 3.2 GHz desktop (with preloaded disk cache).
|
43
43
|
|
44
44
|
Note: this software is under active development, though what is
|
45
45
|
documented here should just work.
|
@@ -57,40 +57,40 @@ documented here should just work.
|
|
57
57
|
Tables can be transformed through the command line. To transform a
|
58
58
|
comma separated file to a tab delimited one
|
59
59
|
|
60
|
-
```
|
60
|
+
```sh
|
61
61
|
bio-table test/data/input/table1.csv --in-format csv --format tab > test1.tab
|
62
62
|
```
|
63
63
|
|
64
64
|
Tab is actually the general default. Still, if the file name ends in
|
65
65
|
csv, it will assume CSV. To convert the table back
|
66
66
|
|
67
|
-
```
|
67
|
+
```sh
|
68
68
|
bio-table test1.tab --format csv > table1.csv
|
69
69
|
```
|
70
70
|
|
71
71
|
To filter out rows that contain certain values
|
72
72
|
|
73
|
-
```
|
73
|
+
```sh
|
74
74
|
bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
|
75
75
|
```
|
76
76
|
|
77
77
|
The filter ignores the header row, and the row names. If you need
|
78
78
|
either, use the switches --with-header and --with-rownames. With math, list all rows
|
79
79
|
|
80
|
-
```
|
80
|
+
```sh
|
81
81
|
bio-table test/data/input/table1.csv --num-filter "values[3]-values[6] >= 0.05" > test1a.tab
|
82
82
|
```
|
83
83
|
|
84
84
|
or, list all rows that have a least a field with values >= 1000.0
|
85
85
|
|
86
|
-
```
|
86
|
+
```sh
|
87
87
|
bio-table test/data/input/table1.csv --num-filter "values.max >= 1000.0" > test1a.tab
|
88
88
|
```
|
89
89
|
|
90
90
|
Produce all rows that have at least 3 values above 3.0 and 1 one value
|
91
91
|
above 10.0:
|
92
92
|
|
93
|
-
```
|
93
|
+
```sh
|
94
94
|
bio-table test/data/input/table1.csv --num-filter "values.max >= 10.0 and values.count{|x| x>=3.0} > 3"
|
95
95
|
```
|
96
96
|
|
@@ -100,7 +100,7 @@ The --num-filter will convert fields lazily to numerical values (only
|
|
100
100
|
valid numbers are converted). If there are NA (nil) values in the table, you
|
101
101
|
may wish to remove them, like this
|
102
102
|
|
103
|
-
```
|
103
|
+
```sh
|
104
104
|
bio-table test/data/input/table1.csv --num-filter "values[0..12].compact.max >= 1000.0" > test1a.tab
|
105
105
|
```
|
106
106
|
|
@@ -109,27 +109,27 @@ which takes the first 13 fields and compact removes the nil values.
|
|
109
109
|
Also string comparisons and regular expressions can be used. E.g.
|
110
110
|
filter on rownames and a row field both containing 'BGT'
|
111
111
|
|
112
|
-
```
|
112
|
+
```sh
|
113
113
|
# not yet implemented
|
114
114
|
bio-table test/data/input/table1.csv --filter "rowname =~ /BGT/ and field[1] =~ /BGT/" > test1a.tab
|
115
115
|
```
|
116
116
|
|
117
117
|
To reorder/reduce table columns by name
|
118
118
|
|
119
|
-
```
|
119
|
+
```sh
|
120
120
|
bio-table test/data/input/table1.csv --columns AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19 > test1a.tab
|
121
121
|
```
|
122
122
|
|
123
123
|
or use their index numbers (the first column is zero)
|
124
124
|
|
125
|
-
```
|
125
|
+
```sh
|
126
126
|
bio-table test/data/input/table1.csv --columns 0,1,8,2,4,6 > test1a.tab
|
127
127
|
```
|
128
128
|
|
129
129
|
|
130
130
|
To filter for columns using a regular expression
|
131
131
|
|
132
|
-
```
|
132
|
+
```sh
|
133
133
|
bio-table table1.csv --column-filter 'colname !~ /infected/i'
|
134
134
|
```
|
135
135
|
|
@@ -139,7 +139,7 @@ case.
|
|
139
139
|
Finally we can rewrite the content of a table using rowname and fields
|
140
140
|
again
|
141
141
|
|
142
|
-
```
|
142
|
+
```sh
|
143
143
|
bio-table table1.csv --rewrite 'rowname.upcase!; field[1]=nil if field[2].to_f<0.25'
|
144
144
|
```
|
145
145
|
|
@@ -150,7 +150,7 @@ empty if the third field is below 0.25.
|
|
150
150
|
|
151
151
|
To sort a table on column 4 and 2
|
152
152
|
|
153
|
-
```
|
153
|
+
```sh
|
154
154
|
# not yet implemented
|
155
155
|
bio-table test/data/input/table1.csv --sort 4,2 > test1a.tab
|
156
156
|
```
|
@@ -161,20 +161,26 @@ Note: not all is implemented (just yet). Please check bio-table --help first.
|
|
161
161
|
|
162
162
|
You can combine/concat two or more tables by passing in multiple file names
|
163
163
|
|
164
|
+
```sh
|
164
165
|
bio-table test/data/input/table1.csv test/data/input/table2.csv
|
166
|
+
```
|
165
167
|
|
166
168
|
this will append table2 to table1, assuming they have the same headers
|
167
169
|
(you can use the --columns switch!)
|
168
170
|
|
169
171
|
To combine tables side by side use the --merge switch:
|
170
172
|
|
173
|
+
```sh
|
171
174
|
bio-table --merge table1.csv table2.csv
|
175
|
+
```
|
172
176
|
|
173
177
|
all rownames will be matched (i.e. the input table order do not need
|
174
178
|
to be sorted). For non-matching rownames the fields will be filled
|
175
179
|
with NA's, unless you add a filter, e.g.
|
176
180
|
|
181
|
+
```sh
|
177
182
|
bio-table --merge table1.csv table2.csv --num-filter "values.compact.size == values.size"
|
183
|
+
```
|
178
184
|
|
179
185
|
### Splitting a table
|
180
186
|
|
@@ -188,24 +194,32 @@ overlap, based on shared columns. The bio-table diff command shows the
|
|
188
194
|
difference between two tables using the row names (i.e. those rows
|
189
195
|
with rownames that appear in table2, but not in table1)
|
190
196
|
|
197
|
+
```sh
|
191
198
|
bio-table --diff 0 table1.csv table2.csv
|
199
|
+
```
|
192
200
|
|
193
201
|
bio-table --diff is different from the standard Unix diff tool. The
|
194
202
|
latter shows insertions and deletions. bio-table --diff shows what is
|
195
203
|
in one file, and not in the other (insertions). To see deletions,
|
196
204
|
reverse the file order, i.e. switch the file names
|
197
205
|
|
206
|
+
```sh
|
198
207
|
bio-table --diff 0 table1.csv table2.csv
|
208
|
+
```
|
199
209
|
|
200
210
|
To diff on something else
|
201
211
|
|
212
|
+
```sh
|
202
213
|
bio-table --diff 0,3 table2.csv table1.csv
|
214
|
+
```
|
203
215
|
|
204
216
|
creates a key using columns 0 and 3 (0 is the rownames column).
|
205
217
|
|
206
218
|
Similarly
|
207
219
|
|
220
|
+
```sh
|
208
221
|
bio-table --overlap 2 table1.csv table2.csv
|
222
|
+
```
|
209
223
|
|
210
224
|
finds the overlapping rows, based on the content of column 2.
|
211
225
|
|
@@ -219,14 +233,55 @@ more soon
|
|
219
233
|
bio-table can read data from STDIN, by simply assuming that the data
|
220
234
|
piped in is the first input file
|
221
235
|
|
222
|
-
```
|
236
|
+
```sh
|
223
237
|
cat test1.tab | bio-table table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
|
224
238
|
```
|
225
239
|
|
226
240
|
will filter both files test1.tab and test1.csv and output to
|
227
241
|
test1a.tab.
|
228
242
|
|
229
|
-
|
243
|
+
### Output table to RDF
|
244
|
+
|
245
|
+
bio-table can write a table into turtle RDF triples (part of the semantic
|
246
|
+
web!), so you can put the data directly into a triple-store.
|
247
|
+
|
248
|
+
```sh
|
249
|
+
bio-table --format rdf table1.csv
|
250
|
+
```
|
251
|
+
|
252
|
+
The table header is stored with predicate :colname using the header
|
253
|
+
values both as subject and label, with the :index:
|
254
|
+
|
255
|
+
```rdf
|
256
|
+
:header3 rdf:label "Header3" ; a :colname; :index 4 .
|
257
|
+
```
|
258
|
+
|
259
|
+
Rows are stored with rowname as subject and label, followed by the
|
260
|
+
columns referring to the header triples, and the values. E.g.
|
261
|
+
|
262
|
+
```rdf
|
263
|
+
:row13475701 rdf:label "row13475701" ; a :rowname ; ; :Id "row13475701" ; :header1 "1" ; :header2 "0" ; :header3 "3" .
|
264
|
+
```
|
265
|
+
|
266
|
+
To unify identifier names you may want to transform ids:
|
267
|
+
|
268
|
+
```sh
|
269
|
+
bio-table --format rdf --transform-ids "downcase" table1.csv
|
270
|
+
```
|
271
|
+
|
272
|
+
Another interesting option is --blank-nodes. This causes rows to be
|
273
|
+
written as blank nodes, and allows for duplicate row names. E.g.
|
274
|
+
|
275
|
+
```rdf
|
276
|
+
:row13475701 [ rdf:label "row13475701" ; a :rowname ; ; :Id "row13475701" ; :header1 "1" ; :header2 "0" ; :header3 "3" ] .
|
277
|
+
```
|
278
|
+
The bio-rdf gem actually uses this bio-table biogem to parse data into a
|
279
|
+
triple store and query the data through SPARQL. For examples see the
|
280
|
+
features, e.g. the
|
281
|
+
[genotype to RDF feature](https://github.com/pjotrp/bioruby-rdf/blob/master/features/genotype-table-to-rdf.feature).
|
282
|
+
|
283
|
+
|
284
|
+
## bio-table API (for Ruby programmers)
|
230
285
|
|
231
286
|
```ruby
|
232
287
|
require 'bio-table'
|
@@ -315,17 +370,25 @@ file twice, but being able to handle much larger data.
|
|
315
370
|
|
316
371
|
In above examples we loaded the whole table in memory. It is also
|
317
372
|
possible to execute functions without using RAM by using the emit
|
318
|
-
function. This is what the bio-table CLI does
|
373
|
+
function. This is what the bio-table CLI does to convert a CSV table
|
374
|
+
to tab delimited:
|
319
375
|
|
320
376
|
```ruby
|
321
377
|
ARGV.each do | fn |
|
322
|
-
|
323
|
-
|
378
|
+
f = File.open(fn)
|
379
|
+
writer = BioTable::TableWriter::Writer.new(format: :tab)
|
380
|
+
BioTable::TableLoader.emit(f, in_format: :csv).each do |row,type|
|
381
|
+
writer.write(TableRow.new(row[0],row[1..-1]),type)
|
324
382
|
end
|
325
383
|
end
|
326
|
-
|
327
384
|
```
|
328
385
|
|
386
|
+
Essentially you can pass in any object that has the *each* method to
|
387
|
+
iterate through rows as String (f's each method reads in a line at a
|
388
|
+
time). The emit function yields the parsed row object as a simple
|
389
|
+
array of fields (each field a String). The type is used to distinguish
|
390
|
+
the header row.
|
391
|
+
|
329
392
|
### Loading a numerical matrix
|
330
393
|
|
331
394
|
Coming soon
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.0.
|
1
|
+
0.0.5
|
data/bin/bio-table
CHANGED
@@ -33,19 +33,12 @@ log = Bio::Log::LoggerPlus.new 'bio-table'
|
|
33
33
|
Bio::Log::CLI.logger('stderr')
|
34
34
|
Bio::Log::CLI.trace('info')
|
35
35
|
|
36
|
-
options = {show_help: false, write_header: true}
|
36
|
+
options = {show_help: false, write_header: true, skip: 0}
|
37
37
|
options[:show_help] = true if ARGV.size == 0 and not INPUT_ON_STDIN
|
38
38
|
opts = OptionParser.new do |o|
|
39
39
|
o.banner = "Usage: #{File.basename($0)} [options] filename\n\n"
|
40
40
|
|
41
|
-
|
42
|
-
options[:in_format] = par.to_sym
|
43
|
-
end
|
44
|
-
|
45
|
-
o.on('--format [tab,csv]', [:tab, :csv], 'Output format (default tab)') do |par|
|
46
|
-
options[:format] = par.to_sym
|
47
|
-
end
|
48
|
-
|
41
|
+
|
49
42
|
o.on('--num-filter expression', 'Numeric filtering function') do |par|
|
50
43
|
options[:num_filter] = par
|
51
44
|
end
|
@@ -82,21 +75,50 @@ opts = OptionParser.new do |o|
|
|
82
75
|
options[:overlap] = l
|
83
76
|
end
|
84
77
|
|
85
|
-
|
78
|
+
o.on('--merge','Merge tables by rowname') do
|
79
|
+
options[:merge] = true
|
80
|
+
end
|
81
|
+
|
82
|
+
o.separator "\n\tOverrides:\n\n"
|
83
|
+
|
86
84
|
# o.on('--with-header','Include the header element in filtering etc.') do
|
87
85
|
# options[:with_header] = true
|
88
86
|
# end
|
89
|
-
|
87
|
+
|
88
|
+
o.on('--skip lines',Integer,'Skip the first lines before parsing') do |skip|
|
89
|
+
options[:skip] = skip
|
90
|
+
end
|
91
|
+
|
90
92
|
o.on('--with-rownames','Include the rownames in filtering etc.') do
|
91
93
|
options[:with_rownames] = true
|
92
94
|
end
|
95
|
+
|
96
|
+
o.separator "\n\tTransform:\n\n"
|
97
|
+
|
98
|
+
o.on('--transform-ids [downcase,upcase]',[:downcase,:upcase],'Transform column and row identifiers') do |par|
|
99
|
+
options[:transform_ids] = par.to_sym
|
100
|
+
end
|
101
|
+
|
102
|
+
o.separator "\n\tFormat and options:\n\n"
|
93
103
|
|
94
|
-
o.
|
104
|
+
o.on('--in-format [tab,csv]', [:tab, :csv], 'Input format (default tab)') do |par|
|
105
|
+
options[:in_format] = par.to_sym
|
106
|
+
end
|
95
107
|
|
108
|
+
o.on('--format [tab,csv,rdf]', [:tab, :csv, :rdf], 'Output format (default tab)') do |par|
|
109
|
+
options[:format] = par.to_sym
|
110
|
+
end
|
111
|
+
|
112
|
+
o.on('--blank-nodes','Output (RDF) blank nodes - allowing for duplicate row names') do
|
113
|
+
options[:blank_nodes] = true
|
114
|
+
end
|
115
|
+
|
116
|
+
o.separator "\n\tVerbosity:\n\n"
|
117
|
+
|
96
118
|
o.on("--logger filename",String,"Log to file (default stderr)") do | name |
|
97
119
|
Bio::Log::CLI.logger(name)
|
98
120
|
end
|
99
|
-
|
121
|
+
|
100
122
|
o.on("--trace options",String,"Set log level (default INFO, see bio-logger)") do | s |
|
101
123
|
Bio::Log::CLI.trace(s)
|
102
124
|
end
|
@@ -177,12 +199,17 @@ end
|
|
177
199
|
# http://eric.lubow.org/2010/ruby/multiple-input-locations-from-bash-into-ruby/
|
178
200
|
#
|
179
201
|
|
180
|
-
writer =
|
202
|
+
writer =
|
203
|
+
if options[:format] == :rdf
|
204
|
+
BioTable::RDF::Writer.new(options[:blank_nodes])
|
205
|
+
else
|
206
|
+
BioTable::TableWriter::Writer.new(options[:format])
|
207
|
+
end
|
181
208
|
|
182
209
|
if INPUT_ON_STDIN
|
183
210
|
opts = options.dup # so we can modify options
|
184
|
-
BioTable::TableLoader.emit(STDIN, opts).each do |row|
|
185
|
-
writer.write(TableRow.new(row[0],row[1..-1]))
|
211
|
+
BioTable::TableLoader.emit(STDIN, opts).each do |row, type|
|
212
|
+
writer.write(TableRow.new(row[0],row[1..-1]),type)
|
186
213
|
end
|
187
214
|
options[:write_header] = false # don't write the header for chained files
|
188
215
|
end
|
@@ -194,8 +221,8 @@ ARGV.each do | fn |
|
|
194
221
|
logger.debug "Autodetected CSV file"
|
195
222
|
opts[:in_format] = :csv
|
196
223
|
end
|
197
|
-
BioTable::TableLoader.emit(f, opts).each do |row|
|
198
|
-
writer.write(TableRow.new(row[0],row[1..-1]))
|
224
|
+
BioTable::TableLoader.emit(f, opts).each do |row,type|
|
225
|
+
writer.write(TableRow.new(row[0],row[1..-1]),type)
|
199
226
|
end
|
200
227
|
options[:write_header] = false # don't write the header for chained files
|
201
228
|
end
|
data/features/cli.feature
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
@cli
|
2
2
|
Feature: Command-line interface (CLI)
|
3
3
|
|
4
|
-
bio-table has a powerful
|
4
|
+
bio-table has a powerful command line interface. Here we regression test features.
|
5
5
|
|
6
6
|
Scenario: Test the numerical filter by column values
|
7
7
|
Given I have input file(s) named "test/data/input/table1.csv"
|
@@ -28,4 +28,14 @@ Feature: Command-line interface (CLI)
|
|
28
28
|
When I execute "./bin/bio-table test/data/input/table1.csv --rewrite 'rowname = field[2]; field[1]=nil if field[2].to_f<0.25'"
|
29
29
|
Then I expect the named output to match "table1-rewrite-rownames"
|
30
30
|
|
31
|
+
Scenario: Write RDF format
|
32
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
33
|
+
When I execute "./bin/bio-table --format rdf --transform-ids downcase"
|
34
|
+
Then I expect the named output to match "table1-rdf1"
|
35
|
+
|
36
|
+
Scenario: Read from STDIN
|
37
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
38
|
+
When I execute "cat test/data/input/table1.csv|./bin/bio-table test/data/input/table1.csv --rewrite 'rowname = field[2]; field[1]=nil if field[2].to_f<0.25'"
|
39
|
+
Then I expect the named output to match "table1-STDIN"
|
40
|
+
|
31
41
|
|
data/lib/bio-table.rb
CHANGED
data/lib/bio-table/filter.rb
CHANGED
@@ -63,7 +63,8 @@ module BioTable
|
|
63
63
|
end
|
64
64
|
|
65
65
|
def Filter::valid_number?(s)
|
66
|
-
s.to_s.match(/\A[+-]?\d+?(\.\d+)?\Z/) == nil ? false : true
|
66
|
+
# s.to_s.match(/\A[+-]?\d+?(\.\d+)?\Z/) == nil ? false : true
|
67
|
+
begin Float(s) ; true end rescue false
|
67
68
|
end
|
68
69
|
|
69
70
|
def Filter::numeric code, fields
|
data/lib/bio-table/formatter.rb
CHANGED
@@ -1,5 +1,24 @@
|
|
1
1
|
module BioTable
|
2
2
|
|
3
|
+
module Formatter
|
4
|
+
def Formatter::transform_header_ids modify, list
|
5
|
+
l = list.dup
|
6
|
+
case modify
|
7
|
+
when :downcase then l.map { |h| h.downcase }
|
8
|
+
when :upcase then l.map { |h| h.upcase }
|
9
|
+
else l
|
10
|
+
end
|
11
|
+
end
|
12
|
+
def Formatter::transform_row_ids modify, list
|
13
|
+
l = list.dup
|
14
|
+
case modify
|
15
|
+
when :downcase then l[0].downcase!
|
16
|
+
when :upcase then l[0].upcase!
|
17
|
+
end
|
18
|
+
l
|
19
|
+
end
|
20
|
+
end
|
21
|
+
|
3
22
|
class TabFormatter
|
4
23
|
def write list
|
5
24
|
print list.map{|field| (field==nil ? "NA" : field)}.join("\t"),"\n"
|
@@ -8,7 +27,6 @@ module BioTable
|
|
8
27
|
end
|
9
28
|
|
10
29
|
class CsvFormatter
|
11
|
-
|
12
30
|
def write list
|
13
31
|
csv_string = CSV.generate do |csv|
|
14
32
|
csv << list
|