bio-table 0.0.5 → 0.0.6
Sign up to get free protection for your applications and to get access to all the features.
- data/.travis.yml +1 -1
- data/Gemfile +1 -1
- data/README.md +49 -6
- data/VERSION +1 -1
- data/bin/bio-table +24 -7
- data/features/cli.feature +40 -0
- data/lib/bio-table/filter.rb +16 -1
- data/lib/bio-table/formatter.rb +30 -2
- data/lib/bio-table/parser.rb +13 -1
- data/lib/bio-table/rdf.rb +1 -1
- data/lib/bio-table/table.rb +7 -3
- data/lib/bio-table/table_apply.rb +9 -2
- data/lib/bio-table/tableload.rb +5 -0
- data/lib/bio-table/tablewriter.rb +5 -2
- data/test/data/input/table_no_headers.txt +5 -0
- data/test/data/input/table_split_on.txt +6 -0
- data/test/data/regression/table1-append.ref +753 -0
- data/test/data/regression/table1-diff.ref +7 -0
- data/test/data/regression/table1-html.ref +380 -0
- data/test/data/regression/table1-latex.ref +380 -0
- data/test/data/regression/table1-merge.ref +380 -0
- data/test/data/regression/table_filter_headers.ref +5 -0
- data/test/data/regression/table_split_on_regex.ref +11 -0
- data/test/data/regression/table_split_on_string.ref +11 -0
- metadata +33 -23
data/.travis.yml
CHANGED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -25,16 +25,17 @@ you don't need to know Ruby to use the command line interface (CLI).
|
|
25
25
|
|
26
26
|
Features:
|
27
27
|
|
28
|
-
* Support for
|
28
|
+
* Support for reading and writing TAB and CSV files, as well as regex splitters
|
29
29
|
* Filter on data
|
30
30
|
* Transform table and data by column or row
|
31
31
|
* Recalculate data
|
32
32
|
* Diff between tables, selecting on specific column values
|
33
33
|
* Merge tables side by side on column value/rowname
|
34
34
|
* Split/reduce tables by column
|
35
|
+
* Write formatted tables, e.g. HTML, LaTeX
|
35
36
|
* Read from STDIN, write to STDOUT
|
36
37
|
* Convert table to RDF
|
37
|
-
* Convert table to JSON (nyi)
|
38
|
+
* Convert table to JSON/YAML/XML (nyi)
|
38
39
|
* etc. etc.
|
39
40
|
|
40
41
|
and bio-table is pretty fast. To convert a 3Mb file of 18670 rows
|
@@ -68,14 +69,21 @@ csv, it will assume CSV. To convert the table back
|
|
68
69
|
bio-table test1.tab --format csv > table1.csv
|
69
70
|
```
|
70
71
|
|
72
|
+
It is also possible to use a string or regex splitter, e.g.
|
73
|
+
|
74
|
+
```sh
|
75
|
+
bio-table --in-format split --split-on ',' test/data/input/table_split_on.txt
|
76
|
+
bio-table --in-format regex --split-on '\s*,\s*' test/data/input/table_split_on.txt
|
77
|
+
```
|
78
|
+
|
71
79
|
To filter out rows that contain certain values
|
72
80
|
|
73
81
|
```sh
|
74
82
|
bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
|
75
83
|
```
|
76
84
|
|
77
|
-
The filter ignores the header row, and the row names. If you need
|
78
|
-
either, use the switches --with-
|
85
|
+
The filter ignores the header row, and the row names, by default. If you need
|
86
|
+
either, use the switches --with-headers and --with-rownames. With math, list all rows
|
79
87
|
|
80
88
|
```sh
|
81
89
|
bio-table test/data/input/table1.csv --num-filter "values[3]-values[6] >= 0.05" > test1a.tab
|
@@ -174,7 +182,7 @@ To combine tables side by side use the --merge switch:
|
|
174
182
|
bio-table --merge table1.csv table2.csv
|
175
183
|
```
|
176
184
|
|
177
|
-
all rownames will be matched (i.e. the input table
|
185
|
+
all rownames will be matched (i.e. the input table does not need
|
178
186
|
to be sorted). For non-matching rownames the fields will be filled
|
179
187
|
with NA's, unless you add a filter, e.g.
|
180
188
|
|
@@ -226,7 +234,10 @@ finds the overlapping rows, based on the content of column 2.
|
|
226
234
|
|
227
235
|
### Different parsers
|
228
236
|
|
229
|
-
|
237
|
+
bio-table currently reads comma separated files and tab delimited
|
238
|
+
files.
|
239
|
+
|
240
|
+
(more soon)
|
230
241
|
|
231
242
|
### Using STDIN
|
232
243
|
|
@@ -240,6 +251,38 @@ piped in is the first input file
|
|
240
251
|
will filter both files test1.tab and test1.csv and output to
|
241
252
|
test1a.tab.
|
242
253
|
|
254
|
+
### Formatted output
|
255
|
+
|
256
|
+
bio-table has built-in formatters - for CSV and TAB, and for RDF
|
257
|
+
(and soon for JSON/YAML and perhaps even XML). The RDF format is
|
258
|
+
discussed in 'Output table to RDF'.
|
259
|
+
|
260
|
+
Another flexible option for formatting a table is to create programmatic output
|
261
|
+
through a formatter. If you set the --format switch to *eval*, you
|
262
|
+
can add the -e 'command' that is evaluated to print to STDOUT. For
|
263
|
+
example, bio-table does not support HTML output directly, but if we
|
264
|
+
were to create an HTML table, we could run
|
265
|
+
|
266
|
+
```sh
|
267
|
+
bio-table --format eval -e '"<tr><td>"+field.join("</td><td>")+"</td></tr>"' table1.csv
|
268
|
+
```
|
269
|
+
|
270
|
+
likewise to create a LaTeX table we could
|
271
|
+
|
272
|
+
```sh
|
273
|
+
bio-table --columns gene_symbol,gene_desc --format eval -e 'field.join(" & ")+" \\\\"' table1.csv
|
274
|
+
```
|
275
|
+
|
276
|
+
Since fields can be accessed independently, you can add any markup for
|
277
|
+
fields, e.g.
|
278
|
+
|
279
|
+
```sh
|
280
|
+
bio-table --columns ID,Description,Date --format eval -e'"\\emph{"+field[0]+"} & "+ field[1..-1].join(" & ")+"\\\\"' table1.csv
|
281
|
+
```
|
282
|
+
|
283
|
+
Because of the evaluation formatter bio-table does not need to implement the machinery for
|
284
|
+
every output format on the planet!
|
285
|
+
|
243
286
|
### Output table to RDF
|
244
287
|
|
245
288
|
bio-table can write a table into turtle RDF triples (part of the semantic
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.0.
|
1
|
+
0.0.6
|
data/bin/bio-table
CHANGED
@@ -43,6 +43,10 @@ opts = OptionParser.new do |o|
|
|
43
43
|
options[:num_filter] = par
|
44
44
|
end
|
45
45
|
|
46
|
+
o.on('--filter expression', 'Generic filtering function') do |par|
|
47
|
+
options[:filter] = par
|
48
|
+
end
|
49
|
+
|
46
50
|
o.on('--rewrite expression', 'Rewrite function') do |par|
|
47
51
|
options[:rewrite] = par
|
48
52
|
end
|
@@ -81,18 +85,23 @@ opts = OptionParser.new do |o|
|
|
81
85
|
|
82
86
|
o.separator "\n\tOverrides:\n\n"
|
83
87
|
|
84
|
-
# o.on('--with-header','Include the header element in filtering etc.') do
|
85
|
-
# options[:with_header] = true
|
86
|
-
# end
|
87
|
-
|
88
88
|
o.on('--skip lines',Integer,'Skip the first lines before parsing') do |skip|
|
89
89
|
options[:skip] = skip
|
90
90
|
end
|
91
91
|
|
92
|
+
o.on('--with-headers','Include the header element in filtering etc.') do
|
93
|
+
options[:with_headers] = true
|
94
|
+
options[:write_header] = false
|
95
|
+
end
|
96
|
+
|
92
97
|
o.on('--with-rownames','Include the rownames in filtering etc.') do
|
93
98
|
options[:with_rownames] = true
|
94
99
|
end
|
95
100
|
|
101
|
+
o.on('--strip-quotes','Strip quotes from table fields') do
|
102
|
+
options[:strip_quotes] = true
|
103
|
+
end
|
104
|
+
|
96
105
|
o.separator "\n\tTransform:\n\n"
|
97
106
|
|
98
107
|
o.on('--transform-ids [downcase,upcase]',[:downcase,:upcase],'Transform column and row identifiers') do |par|
|
@@ -101,14 +110,22 @@ opts = OptionParser.new do |o|
|
|
101
110
|
|
102
111
|
o.separator "\n\tFormat and options:\n\n"
|
103
112
|
|
104
|
-
o.on('--in-format [tab,csv]', [:tab, :csv], 'Input format (default tab)') do |par|
|
113
|
+
o.on('--in-format [tab,csv,split,regex]', [:tab, :csv, :split, :regex], 'Input format (default tab)') do |par|
|
105
114
|
options[:in_format] = par.to_sym
|
106
115
|
end
|
107
116
|
|
108
|
-
o.on('--format [tab,csv,rdf]', [:tab, :csv, :rdf], 'Output format (default tab)') do |par|
|
117
|
+
o.on('--format [tab,csv,rdf,eval]', [:tab, :csv, :rdf, :eval], 'Output format (default tab)') do |par|
|
109
118
|
options[:format] = par.to_sym
|
110
119
|
end
|
111
120
|
|
121
|
+
o.on("--split-on command",String,"Split on string or regex (use with --in-format)") do | s |
|
122
|
+
options[:split_on] = s
|
123
|
+
end
|
124
|
+
|
125
|
+
o.on("-e command",String,"Evaluate output command (use with --format eval)") do | s |
|
126
|
+
options[:evaluate] = s
|
127
|
+
end
|
128
|
+
|
112
129
|
o.on('--blank-nodes','Output (RDF) blank nodes - allowing for duplicate row names') do
|
113
130
|
options[:blank_nodes] = true
|
114
131
|
end
|
@@ -203,7 +220,7 @@ writer =
|
|
203
220
|
if options[:format] == :rdf
|
204
221
|
BioTable::RDF::Writer.new(options[:blank_nodes])
|
205
222
|
else
|
206
|
-
BioTable::TableWriter::Writer.new(options[:format])
|
223
|
+
BioTable::TableWriter::Writer.new(options[:format],options[:evaluate])
|
207
224
|
end
|
208
225
|
|
209
226
|
if INPUT_ON_STDIN
|
data/features/cli.feature
CHANGED
@@ -33,9 +33,49 @@ Feature: Command-line interface (CLI)
|
|
33
33
|
When I execute "./bin/bio-table --format rdf --transform-ids downcase"
|
34
34
|
Then I expect the named output to match "table1-rdf1"
|
35
35
|
|
36
|
+
Scenario: Write HTML format
|
37
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
38
|
+
When I execute "./bin/bio-table --format eval -e '"<tr><td>"+field.join("</td><td>")+"</td></tr>"'"
|
39
|
+
Then I expect the named output to match "table1-html"
|
40
|
+
|
41
|
+
Scenario: Write LaTeX format
|
42
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
43
|
+
When I execute "./bin/bio-table --columns gene_symbol,gene_desc --format eval -e 'field.join(" & ")+" \\\\"'"
|
44
|
+
Then I expect the named output to match "table1-latex"
|
45
|
+
|
46
|
+
Scenario: Merge tables horizontally
|
47
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
48
|
+
When I execute "./bin/bio-table --merge test/data/input/table2.csv"
|
49
|
+
Then I expect the named output to match "table1-merge"
|
50
|
+
|
51
|
+
Scenario: Merge tables vertically
|
52
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
53
|
+
When I execute "./bin/bio-table test/data/input/table2.csv"
|
54
|
+
Then I expect the named output to match "table1-append"
|
55
|
+
|
56
|
+
Scenario: Diff tables
|
57
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
58
|
+
When I execute "./bin/bio-table --diff test/data/input/table2.csv"
|
59
|
+
Then I expect the named output to match "table1-diff"
|
60
|
+
|
36
61
|
Scenario: Read from STDIN
|
37
62
|
Given I have input file(s) named "test/data/input/table1.csv"
|
38
63
|
When I execute "cat test/data/input/table1.csv|./bin/bio-table test/data/input/table1.csv --rewrite 'rowname = field[2]; field[1]=nil if field[2].to_f<0.25'"
|
39
64
|
Then I expect the named output to match "table1-STDIN"
|
40
65
|
|
66
|
+
Scenario: Use special string splitter
|
67
|
+
Given I have input file(s) named "test/data/input/table_split_on.txt"
|
68
|
+
When I execute "./bin/bio-table test/data/input/table_split_on.txt --in-format split --split-on ','"
|
69
|
+
Then I expect the named output to match "table_split_on_string"
|
70
|
+
|
71
|
+
Scenario: Use special regex splitter
|
72
|
+
Given I have input file(s) named "test/data/input/table_split_on.txt"
|
73
|
+
When I execute "./bin/bio-table test/data/input/table_split_on.txt --in-format regex --split-on '\s*,'"
|
74
|
+
Then I expect the named output to match "table_split_on_regex"
|
75
|
+
|
76
|
+
Scenario: Use header in filter
|
77
|
+
Given I have input file(s) named "test/data/input/table_no_headers.txt"
|
78
|
+
When I execute "./bin/bio-table --in-format split --split-on ',' --num-filter 'values[1]!=0' --with-headers"
|
79
|
+
Then I expect the named output to match "table_filter_headers"
|
80
|
+
|
41
81
|
|
data/lib/bio-table/filter.rb
CHANGED
@@ -70,7 +70,7 @@ module BioTable
|
|
70
70
|
def Filter::numeric code, fields
|
71
71
|
return true if code == nil
|
72
72
|
if fields
|
73
|
-
# values = fields.map { |field| (valid_number?(field) ? field.to_f : nil ) }
|
73
|
+
# values = fields.map { |field| (valid_number?(field) ? field.to_f : nil ) }
|
74
74
|
values = LazyValues.new(fields)
|
75
75
|
begin
|
76
76
|
eval(code)
|
@@ -82,6 +82,21 @@ module BioTable
|
|
82
82
|
false
|
83
83
|
end
|
84
84
|
end
|
85
|
+
|
86
|
+
def Filter::generic code, tablefields
|
87
|
+
return true if code == nil
|
88
|
+
if tablefields
|
89
|
+
field = tablefields.dup
|
90
|
+
begin
|
91
|
+
eval(code)
|
92
|
+
rescue Exception
|
93
|
+
$stderr.print "Failed to evaluate ",fields," with ",code,"\n"
|
94
|
+
raise
|
95
|
+
end
|
96
|
+
else
|
97
|
+
false
|
98
|
+
end
|
99
|
+
end
|
85
100
|
end
|
86
101
|
|
87
102
|
end
|
data/lib/bio-table/formatter.rb
CHANGED
@@ -17,13 +17,28 @@ module BioTable
|
|
17
17
|
end
|
18
18
|
l
|
19
19
|
end
|
20
|
+
def Formatter::strip_quotes list
|
21
|
+
list.map { |field|
|
22
|
+
if field == nil
|
23
|
+
nil
|
24
|
+
else
|
25
|
+
first = field[0,1]
|
26
|
+
if first == "\"" or first == "'"
|
27
|
+
last = field[-1,1]
|
28
|
+
if first == last
|
29
|
+
field = field[1..-2]
|
30
|
+
end
|
31
|
+
end
|
32
|
+
field
|
33
|
+
end
|
34
|
+
}
|
35
|
+
end
|
20
36
|
end
|
21
37
|
|
22
38
|
class TabFormatter
|
23
39
|
def write list
|
24
40
|
print list.map{|field| (field==nil ? "NA" : field)}.join("\t"),"\n"
|
25
41
|
end
|
26
|
-
|
27
42
|
end
|
28
43
|
|
29
44
|
class CsvFormatter
|
@@ -35,9 +50,22 @@ module BioTable
|
|
35
50
|
end
|
36
51
|
end
|
37
52
|
|
53
|
+
class EvalFormatter
|
54
|
+
def initialize evaluate
|
55
|
+
@evaluate = evaluate
|
56
|
+
end
|
57
|
+
def write list
|
58
|
+
field = list.dup.map { |e| (e==nil ? "" : e) }
|
59
|
+
print eval(@evaluate)
|
60
|
+
print "\n"
|
61
|
+
end
|
62
|
+
end
|
63
|
+
|
64
|
+
|
38
65
|
module FormatFactory
|
39
|
-
def self.create format
|
66
|
+
def self.create format, evaluate
|
40
67
|
# @logger.info("Formatting to #{format}")
|
68
|
+
return EvalFormatter.new(evaluate) if evaluate
|
41
69
|
return CsvFormatter.new if format == :csv
|
42
70
|
return TabFormatter.new
|
43
71
|
end
|
data/lib/bio-table/parser.rb
CHANGED
@@ -5,9 +5,21 @@ module BioTable
|
|
5
5
|
module LineParser
|
6
6
|
|
7
7
|
# Converts a string into an array of string fields
|
8
|
-
def LineParser::parse(line, in_format)
|
8
|
+
def LineParser::parse(line, in_format, split_on)
|
9
9
|
if in_format == :csv
|
10
10
|
CSV.parse(line)[0]
|
11
|
+
elsif in_format == :split
|
12
|
+
line.strip.split(split_on).map { |field|
|
13
|
+
fld = field.strip
|
14
|
+
fld = nil if fld == "NA"
|
15
|
+
fld
|
16
|
+
}
|
17
|
+
elsif in_format == :regex
|
18
|
+
line.strip.split(/#{split_on}/).map { |field|
|
19
|
+
fld = field.strip
|
20
|
+
fld = nil if fld == "NA"
|
21
|
+
fld
|
22
|
+
}
|
11
23
|
else
|
12
24
|
line.strip.split("\t").map { |field|
|
13
25
|
fld = field.strip
|
data/lib/bio-table/rdf.rb
CHANGED
@@ -38,7 +38,7 @@ module BioTable
|
|
38
38
|
#
|
39
39
|
# The method returns a String.
|
40
40
|
|
41
|
-
def RDF::row(row, header, use_blank_nodes)
|
41
|
+
def RDF::row(row, header, use_blank_nodes = false)
|
42
42
|
list = []
|
43
43
|
rowname = make_identifier(row[0])
|
44
44
|
list << ":#{rowname}"+(use_blank_nodes ? " :row [ " : " ") + "rdf:label \"#{row[0]}\" ; a :rowname"
|
data/lib/bio-table/table.rb
CHANGED
@@ -23,7 +23,7 @@ module BioTable
|
|
23
23
|
|
24
24
|
# Read lines (list/array of string) and add them to the table, setting row
|
25
25
|
# names and row fields. The first row is assumed to be the header and
|
26
|
-
# ignored if the header has been set.
|
26
|
+
# ignored if the header has been set (the case with merge/concat tables).
|
27
27
|
|
28
28
|
def read_lines lines, options = {}
|
29
29
|
table_apply = TableApply.new(options)
|
@@ -62,9 +62,13 @@ module BioTable
|
|
62
62
|
def write options = {}
|
63
63
|
format = options[:format]
|
64
64
|
format = :tab if not format
|
65
|
-
|
65
|
+
evaluate = nil
|
66
|
+
if format == :eval
|
67
|
+
evaluate = options[:evaluate]
|
68
|
+
end
|
69
|
+
formatter = FormatFactory::create(format,evaluate)
|
66
70
|
formatter.write(@header) if options[:write_header]
|
67
|
-
each do | tablerow |
|
71
|
+
each do | tablerow,num |
|
68
72
|
# p tablerow
|
69
73
|
formatter.write(tablerow.all_fields) if tablerow.all_valid?
|
70
74
|
end
|
@@ -11,12 +11,16 @@ module BioTable
|
|
11
11
|
# @logger.debug "Skipping #{@skip} lines" if @skip
|
12
12
|
@num_filter = options[:num_filter]
|
13
13
|
@logger.debug "Filtering on #{@num_filter}" if @num_filter
|
14
|
+
@filter = options[:filter]
|
15
|
+
@logger.debug "Filtering on #{@filter}" if @filter
|
14
16
|
@rewrite = options[:rewrite]
|
15
17
|
@logger.debug "Rewrite #{@rewrite}" if @rewrite
|
16
18
|
@use_columns = options[:columns]
|
17
19
|
@logger.debug "Filtering on columns #{@use_columns}" if @use_columns
|
18
20
|
@column_filter = options[:column_filter]
|
19
21
|
@logger.debug "Filtering on column names #{@column_filter}" if @column_filter
|
22
|
+
@strip_quotes = options[:strip_quotes]
|
23
|
+
@logger.debug "Strip quotes #{@strip_quotes}" if @strip_quotes
|
20
24
|
@transform_ids = options[:transform_ids]
|
21
25
|
@logger.debug "Transform ids #{@transform_ids}" if @transform_ids
|
22
26
|
@include_rownames = options[:with_rownames]
|
@@ -25,7 +29,8 @@ module BioTable
|
|
25
29
|
end
|
26
30
|
|
27
31
|
def parse_header(line, options)
|
28
|
-
header = LineParser::parse(line, options[:in_format])
|
32
|
+
header = LineParser::parse(line, options[:in_format], options[:split_on])
|
33
|
+
header = Formatter::strip_quotes(header) if @strip_quotes
|
29
34
|
return Formatter::transform_header_ids(@transform_ids, header) if @transform_ids
|
30
35
|
header
|
31
36
|
end
|
@@ -38,8 +43,9 @@ module BioTable
|
|
38
43
|
end
|
39
44
|
|
40
45
|
def parse_row(line_num, line, column_idx, last_fields, options)
|
41
|
-
fields = LineParser::parse(line, options[:in_format])
|
46
|
+
fields = LineParser::parse(line, options[:in_format], options[:split_on])
|
42
47
|
return nil,nil if fields.compact == []
|
48
|
+
fields = Formatter::strip_quotes(fields) if @strip_quotes
|
43
49
|
fields = Formatter::transform_row_ids(@transform_ids, fields) if @transform_ids
|
44
50
|
fields = Filter::apply_column_filter(fields,column_idx)
|
45
51
|
return nil,nil if fields.compact == []
|
@@ -48,6 +54,7 @@ module BioTable
|
|
48
54
|
if data_fields.size > 0
|
49
55
|
return nil,nil if not Validator::valid_row?(line_num, data_fields, last_fields)
|
50
56
|
return nil,nil if not Filter::numeric(@num_filter,data_fields)
|
57
|
+
return nil,nil if not Filter::generic(@filter,data_fields)
|
51
58
|
(rowname, data_fields) = Rewrite::rewrite(@rewrite,rowname,data_fields)
|
52
59
|
end
|
53
60
|
return rowname, data_fields
|
data/lib/bio-table/tableload.rb
CHANGED
@@ -23,6 +23,11 @@ module BioTable
|
|
23
23
|
column_index,header = table_apply.column_index(header) # we may rewrite the header
|
24
24
|
yielder.yield header,:header if options[:write_header] != false
|
25
25
|
prev_line = header[1..-1]
|
26
|
+
# When a header filter is defined, rewind the generator, note that skip won't work
|
27
|
+
# properly (FIXME)
|
28
|
+
if options[:with_headers]
|
29
|
+
generator.rewind
|
30
|
+
end
|
26
31
|
elsif line_num-skip < 0
|
27
32
|
# do nothing
|
28
33
|
else
|