bio-table 0.0.5 → 0.0.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.travis.yml +1 -1
- data/Gemfile +1 -1
- data/README.md +49 -6
- data/VERSION +1 -1
- data/bin/bio-table +24 -7
- data/features/cli.feature +40 -0
- data/lib/bio-table/filter.rb +16 -1
- data/lib/bio-table/formatter.rb +30 -2
- data/lib/bio-table/parser.rb +13 -1
- data/lib/bio-table/rdf.rb +1 -1
- data/lib/bio-table/table.rb +7 -3
- data/lib/bio-table/table_apply.rb +9 -2
- data/lib/bio-table/tableload.rb +5 -0
- data/lib/bio-table/tablewriter.rb +5 -2
- data/test/data/input/table_no_headers.txt +5 -0
- data/test/data/input/table_split_on.txt +6 -0
- data/test/data/regression/table1-append.ref +753 -0
- data/test/data/regression/table1-diff.ref +7 -0
- data/test/data/regression/table1-html.ref +380 -0
- data/test/data/regression/table1-latex.ref +380 -0
- data/test/data/regression/table1-merge.ref +380 -0
- data/test/data/regression/table_filter_headers.ref +5 -0
- data/test/data/regression/table_split_on_regex.ref +11 -0
- data/test/data/regression/table_split_on_string.ref +11 -0
- metadata +33 -23
data/.travis.yml
CHANGED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -25,16 +25,17 @@ you don't need to know Ruby to use the command line interface (CLI).
|
|
25
25
|
|
26
26
|
Features:
|
27
27
|
|
28
|
-
* Support for
|
28
|
+
* Support for reading and writing TAB and CSV files, as well as regex splitters
|
29
29
|
* Filter on data
|
30
30
|
* Transform table and data by column or row
|
31
31
|
* Recalculate data
|
32
32
|
* Diff between tables, selecting on specific column values
|
33
33
|
* Merge tables side by side on column value/rowname
|
34
34
|
* Split/reduce tables by column
|
35
|
+
* Write formatted tables, e.g. HTML, LaTeX
|
35
36
|
* Read from STDIN, write to STDOUT
|
36
37
|
* Convert table to RDF
|
37
|
-
* Convert table to JSON (nyi)
|
38
|
+
* Convert table to JSON/YAML/XML (nyi)
|
38
39
|
* etc. etc.
|
39
40
|
|
40
41
|
and bio-table is pretty fast. To convert a 3Mb file of 18670 rows
|
@@ -68,14 +69,21 @@ csv, it will assume CSV. To convert the table back
|
|
68
69
|
bio-table test1.tab --format csv > table1.csv
|
69
70
|
```
|
70
71
|
|
72
|
+
It is also possible to use a string or regex splitter, e.g.
|
73
|
+
|
74
|
+
```sh
|
75
|
+
bio-table --in-format split --split-on ',' test/data/input/table_split_on.txt
|
76
|
+
bio-table --in-format regex --split-on '\s*,\s*' test/data/input/table_split_on.txt
|
77
|
+
```
|
78
|
+
|
71
79
|
To filter out rows that contain certain values
|
72
80
|
|
73
81
|
```sh
|
74
82
|
bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
|
75
83
|
```
|
76
84
|
|
77
|
-
The filter ignores the header row, and the row names. If you need
|
78
|
-
either, use the switches --with-
|
85
|
+
The filter ignores the header row, and the row names, by default. If you need
|
86
|
+
either, use the switches --with-headers and --with-rownames. With math, list all rows
|
79
87
|
|
80
88
|
```sh
|
81
89
|
bio-table test/data/input/table1.csv --num-filter "values[3]-values[6] >= 0.05" > test1a.tab
|
@@ -174,7 +182,7 @@ To combine tables side by side use the --merge switch:
|
|
174
182
|
bio-table --merge table1.csv table2.csv
|
175
183
|
```
|
176
184
|
|
177
|
-
all rownames will be matched (i.e. the input table
|
185
|
+
all rownames will be matched (i.e. the input table does not need
|
178
186
|
to be sorted). For non-matching rownames the fields will be filled
|
179
187
|
with NA's, unless you add a filter, e.g.
|
180
188
|
|
@@ -226,7 +234,10 @@ finds the overlapping rows, based on the content of column 2.
|
|
226
234
|
|
227
235
|
### Different parsers
|
228
236
|
|
229
|
-
|
237
|
+
bio-table currently reads comma separated files and tab delimited
|
238
|
+
files.
|
239
|
+
|
240
|
+
(more soon)
|
230
241
|
|
231
242
|
### Using STDIN
|
232
243
|
|
@@ -240,6 +251,38 @@ piped in is the first input file
|
|
240
251
|
will filter both files test1.tab and test1.csv and output to
|
241
252
|
test1a.tab.
|
242
253
|
|
254
|
+
### Formatted output
|
255
|
+
|
256
|
+
bio-table has built-in formatters - for CSV and TAB, and for RDF
|
257
|
+
(and soon for JSON/YAML and perhaps even XML). The RDF format is
|
258
|
+
discussed in 'Output table to RDF'.
|
259
|
+
|
260
|
+
Another flexible option for formatting a table is to create programmatic output
|
261
|
+
through a formatter. If you set the --format switch to *eval*, you
|
262
|
+
can add the -e 'command' that is evaluated to print to STDOUT. For
|
263
|
+
example, bio-table does not support HTML output directly, but if we
|
264
|
+
were to create an HTML table, we could run
|
265
|
+
|
266
|
+
```sh
|
267
|
+
bio-table --format eval -e '"<tr><td>"+field.join("</td><td>")+"</td></tr>"' table1.csv
|
268
|
+
```
|
269
|
+
|
270
|
+
likewise to create a LaTeX table we could
|
271
|
+
|
272
|
+
```sh
|
273
|
+
bio-table --columns gene_symbol,gene_desc --format eval -e 'field.join(" & ")+" \\\\"' table1.csv
|
274
|
+
```
|
275
|
+
|
276
|
+
Since fields can be accessed independently, you can add any markup for
|
277
|
+
fields, e.g.
|
278
|
+
|
279
|
+
```sh
|
280
|
+
bio-table --columns ID,Description,Date --format eval -e'"\\emph{"+field[0]+"} & "+ field[1..-1].join(" & ")+"\\\\"' table1.csv
|
281
|
+
```
|
282
|
+
|
283
|
+
Because of the evaluation formatter bio-table does not need to implement the machinery for
|
284
|
+
every output format on the planet!
|
285
|
+
|
243
286
|
### Output table to RDF
|
244
287
|
|
245
288
|
bio-table can write a table into turtle RDF triples (part of the semantic
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.0.
|
1
|
+
0.0.6
|
data/bin/bio-table
CHANGED
@@ -43,6 +43,10 @@ opts = OptionParser.new do |o|
|
|
43
43
|
options[:num_filter] = par
|
44
44
|
end
|
45
45
|
|
46
|
+
o.on('--filter expression', 'Generic filtering function') do |par|
|
47
|
+
options[:filter] = par
|
48
|
+
end
|
49
|
+
|
46
50
|
o.on('--rewrite expression', 'Rewrite function') do |par|
|
47
51
|
options[:rewrite] = par
|
48
52
|
end
|
@@ -81,18 +85,23 @@ opts = OptionParser.new do |o|
|
|
81
85
|
|
82
86
|
o.separator "\n\tOverrides:\n\n"
|
83
87
|
|
84
|
-
# o.on('--with-header','Include the header element in filtering etc.') do
|
85
|
-
# options[:with_header] = true
|
86
|
-
# end
|
87
|
-
|
88
88
|
o.on('--skip lines',Integer,'Skip the first lines before parsing') do |skip|
|
89
89
|
options[:skip] = skip
|
90
90
|
end
|
91
91
|
|
92
|
+
o.on('--with-headers','Include the header element in filtering etc.') do
|
93
|
+
options[:with_headers] = true
|
94
|
+
options[:write_header] = false
|
95
|
+
end
|
96
|
+
|
92
97
|
o.on('--with-rownames','Include the rownames in filtering etc.') do
|
93
98
|
options[:with_rownames] = true
|
94
99
|
end
|
95
100
|
|
101
|
+
o.on('--strip-quotes','Strip quotes from table fields') do
|
102
|
+
options[:strip_quotes] = true
|
103
|
+
end
|
104
|
+
|
96
105
|
o.separator "\n\tTransform:\n\n"
|
97
106
|
|
98
107
|
o.on('--transform-ids [downcase,upcase]',[:downcase,:upcase],'Transform column and row identifiers') do |par|
|
@@ -101,14 +110,22 @@ opts = OptionParser.new do |o|
|
|
101
110
|
|
102
111
|
o.separator "\n\tFormat and options:\n\n"
|
103
112
|
|
104
|
-
o.on('--in-format [tab,csv]', [:tab, :csv], 'Input format (default tab)') do |par|
|
113
|
+
o.on('--in-format [tab,csv,split,regex]', [:tab, :csv, :split, :regex], 'Input format (default tab)') do |par|
|
105
114
|
options[:in_format] = par.to_sym
|
106
115
|
end
|
107
116
|
|
108
|
-
o.on('--format [tab,csv,rdf]', [:tab, :csv, :rdf], 'Output format (default tab)') do |par|
|
117
|
+
o.on('--format [tab,csv,rdf,eval]', [:tab, :csv, :rdf, :eval], 'Output format (default tab)') do |par|
|
109
118
|
options[:format] = par.to_sym
|
110
119
|
end
|
111
120
|
|
121
|
+
o.on("--split-on command",String,"Split on string or regex (use with --in-format)") do | s |
|
122
|
+
options[:split_on] = s
|
123
|
+
end
|
124
|
+
|
125
|
+
o.on("-e command",String,"Evaluate output command (use with --format eval)") do | s |
|
126
|
+
options[:evaluate] = s
|
127
|
+
end
|
128
|
+
|
112
129
|
o.on('--blank-nodes','Output (RDF) blank nodes - allowing for duplicate row names') do
|
113
130
|
options[:blank_nodes] = true
|
114
131
|
end
|
@@ -203,7 +220,7 @@ writer =
|
|
203
220
|
if options[:format] == :rdf
|
204
221
|
BioTable::RDF::Writer.new(options[:blank_nodes])
|
205
222
|
else
|
206
|
-
BioTable::TableWriter::Writer.new(options[:format])
|
223
|
+
BioTable::TableWriter::Writer.new(options[:format],options[:evaluate])
|
207
224
|
end
|
208
225
|
|
209
226
|
if INPUT_ON_STDIN
|
data/features/cli.feature
CHANGED
@@ -33,9 +33,49 @@ Feature: Command-line interface (CLI)
|
|
33
33
|
When I execute "./bin/bio-table --format rdf --transform-ids downcase"
|
34
34
|
Then I expect the named output to match "table1-rdf1"
|
35
35
|
|
36
|
+
Scenario: Write HTML format
|
37
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
38
|
+
When I execute "./bin/bio-table --format eval -e '"<tr><td>"+field.join("</td><td>")+"</td></tr>"'"
|
39
|
+
Then I expect the named output to match "table1-html"
|
40
|
+
|
41
|
+
Scenario: Write LaTeX format
|
42
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
43
|
+
When I execute "./bin/bio-table --columns gene_symbol,gene_desc --format eval -e 'field.join(" & ")+" \\\\"'"
|
44
|
+
Then I expect the named output to match "table1-latex"
|
45
|
+
|
46
|
+
Scenario: Merge tables horizontally
|
47
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
48
|
+
When I execute "./bin/bio-table --merge test/data/input/table2.csv"
|
49
|
+
Then I expect the named output to match "table1-merge"
|
50
|
+
|
51
|
+
Scenario: Merge tables vertically
|
52
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
53
|
+
When I execute "./bin/bio-table test/data/input/table2.csv"
|
54
|
+
Then I expect the named output to match "table1-append"
|
55
|
+
|
56
|
+
Scenario: Diff tables
|
57
|
+
Given I have input file(s) named "test/data/input/table1.csv"
|
58
|
+
When I execute "./bin/bio-table --diff test/data/input/table2.csv"
|
59
|
+
Then I expect the named output to match "table1-diff"
|
60
|
+
|
36
61
|
Scenario: Read from STDIN
|
37
62
|
Given I have input file(s) named "test/data/input/table1.csv"
|
38
63
|
When I execute "cat test/data/input/table1.csv|./bin/bio-table test/data/input/table1.csv --rewrite 'rowname = field[2]; field[1]=nil if field[2].to_f<0.25'"
|
39
64
|
Then I expect the named output to match "table1-STDIN"
|
40
65
|
|
66
|
+
Scenario: Use special string splitter
|
67
|
+
Given I have input file(s) named "test/data/input/table_split_on.txt"
|
68
|
+
When I execute "./bin/bio-table test/data/input/table_split_on.txt --in-format split --split-on ','"
|
69
|
+
Then I expect the named output to match "table_split_on_string"
|
70
|
+
|
71
|
+
Scenario: Use special regex splitter
|
72
|
+
Given I have input file(s) named "test/data/input/table_split_on.txt"
|
73
|
+
When I execute "./bin/bio-table test/data/input/table_split_on.txt --in-format regex --split-on '\s*,'"
|
74
|
+
Then I expect the named output to match "table_split_on_regex"
|
75
|
+
|
76
|
+
Scenario: Use header in filter
|
77
|
+
Given I have input file(s) named "test/data/input/table_no_headers.txt"
|
78
|
+
When I execute "./bin/bio-table --in-format split --split-on ',' --num-filter 'values[1]!=0' --with-headers"
|
79
|
+
Then I expect the named output to match "table_filter_headers"
|
80
|
+
|
41
81
|
|
data/lib/bio-table/filter.rb
CHANGED
@@ -70,7 +70,7 @@ module BioTable
|
|
70
70
|
def Filter::numeric code, fields
|
71
71
|
return true if code == nil
|
72
72
|
if fields
|
73
|
-
# values = fields.map { |field| (valid_number?(field) ? field.to_f : nil ) }
|
73
|
+
# values = fields.map { |field| (valid_number?(field) ? field.to_f : nil ) }
|
74
74
|
values = LazyValues.new(fields)
|
75
75
|
begin
|
76
76
|
eval(code)
|
@@ -82,6 +82,21 @@ module BioTable
|
|
82
82
|
false
|
83
83
|
end
|
84
84
|
end
|
85
|
+
|
86
|
+
def Filter::generic code, tablefields
|
87
|
+
return true if code == nil
|
88
|
+
if tablefields
|
89
|
+
field = tablefields.dup
|
90
|
+
begin
|
91
|
+
eval(code)
|
92
|
+
rescue Exception
|
93
|
+
$stderr.print "Failed to evaluate ",fields," with ",code,"\n"
|
94
|
+
raise
|
95
|
+
end
|
96
|
+
else
|
97
|
+
false
|
98
|
+
end
|
99
|
+
end
|
85
100
|
end
|
86
101
|
|
87
102
|
end
|
data/lib/bio-table/formatter.rb
CHANGED
@@ -17,13 +17,28 @@ module BioTable
|
|
17
17
|
end
|
18
18
|
l
|
19
19
|
end
|
20
|
+
def Formatter::strip_quotes list
|
21
|
+
list.map { |field|
|
22
|
+
if field == nil
|
23
|
+
nil
|
24
|
+
else
|
25
|
+
first = field[0,1]
|
26
|
+
if first == "\"" or first == "'"
|
27
|
+
last = field[-1,1]
|
28
|
+
if first == last
|
29
|
+
field = field[1..-2]
|
30
|
+
end
|
31
|
+
end
|
32
|
+
field
|
33
|
+
end
|
34
|
+
}
|
35
|
+
end
|
20
36
|
end
|
21
37
|
|
22
38
|
class TabFormatter
|
23
39
|
def write list
|
24
40
|
print list.map{|field| (field==nil ? "NA" : field)}.join("\t"),"\n"
|
25
41
|
end
|
26
|
-
|
27
42
|
end
|
28
43
|
|
29
44
|
class CsvFormatter
|
@@ -35,9 +50,22 @@ module BioTable
|
|
35
50
|
end
|
36
51
|
end
|
37
52
|
|
53
|
+
class EvalFormatter
|
54
|
+
def initialize evaluate
|
55
|
+
@evaluate = evaluate
|
56
|
+
end
|
57
|
+
def write list
|
58
|
+
field = list.dup.map { |e| (e==nil ? "" : e) }
|
59
|
+
print eval(@evaluate)
|
60
|
+
print "\n"
|
61
|
+
end
|
62
|
+
end
|
63
|
+
|
64
|
+
|
38
65
|
module FormatFactory
|
39
|
-
def self.create format
|
66
|
+
def self.create format, evaluate
|
40
67
|
# @logger.info("Formatting to #{format}")
|
68
|
+
return EvalFormatter.new(evaluate) if evaluate
|
41
69
|
return CsvFormatter.new if format == :csv
|
42
70
|
return TabFormatter.new
|
43
71
|
end
|
data/lib/bio-table/parser.rb
CHANGED
@@ -5,9 +5,21 @@ module BioTable
|
|
5
5
|
module LineParser
|
6
6
|
|
7
7
|
# Converts a string into an array of string fields
|
8
|
-
def LineParser::parse(line, in_format)
|
8
|
+
def LineParser::parse(line, in_format, split_on)
|
9
9
|
if in_format == :csv
|
10
10
|
CSV.parse(line)[0]
|
11
|
+
elsif in_format == :split
|
12
|
+
line.strip.split(split_on).map { |field|
|
13
|
+
fld = field.strip
|
14
|
+
fld = nil if fld == "NA"
|
15
|
+
fld
|
16
|
+
}
|
17
|
+
elsif in_format == :regex
|
18
|
+
line.strip.split(/#{split_on}/).map { |field|
|
19
|
+
fld = field.strip
|
20
|
+
fld = nil if fld == "NA"
|
21
|
+
fld
|
22
|
+
}
|
11
23
|
else
|
12
24
|
line.strip.split("\t").map { |field|
|
13
25
|
fld = field.strip
|
data/lib/bio-table/rdf.rb
CHANGED
@@ -38,7 +38,7 @@ module BioTable
|
|
38
38
|
#
|
39
39
|
# The method returns a String.
|
40
40
|
|
41
|
-
def RDF::row(row, header, use_blank_nodes)
|
41
|
+
def RDF::row(row, header, use_blank_nodes = false)
|
42
42
|
list = []
|
43
43
|
rowname = make_identifier(row[0])
|
44
44
|
list << ":#{rowname}"+(use_blank_nodes ? " :row [ " : " ") + "rdf:label \"#{row[0]}\" ; a :rowname"
|
data/lib/bio-table/table.rb
CHANGED
@@ -23,7 +23,7 @@ module BioTable
|
|
23
23
|
|
24
24
|
# Read lines (list/array of string) and add them to the table, setting row
|
25
25
|
# names and row fields. The first row is assumed to be the header and
|
26
|
-
# ignored if the header has been set.
|
26
|
+
# ignored if the header has been set (the case with merge/concat tables).
|
27
27
|
|
28
28
|
def read_lines lines, options = {}
|
29
29
|
table_apply = TableApply.new(options)
|
@@ -62,9 +62,13 @@ module BioTable
|
|
62
62
|
def write options = {}
|
63
63
|
format = options[:format]
|
64
64
|
format = :tab if not format
|
65
|
-
|
65
|
+
evaluate = nil
|
66
|
+
if format == :eval
|
67
|
+
evaluate = options[:evaluate]
|
68
|
+
end
|
69
|
+
formatter = FormatFactory::create(format,evaluate)
|
66
70
|
formatter.write(@header) if options[:write_header]
|
67
|
-
each do | tablerow |
|
71
|
+
each do | tablerow,num |
|
68
72
|
# p tablerow
|
69
73
|
formatter.write(tablerow.all_fields) if tablerow.all_valid?
|
70
74
|
end
|
@@ -11,12 +11,16 @@ module BioTable
|
|
11
11
|
# @logger.debug "Skipping #{@skip} lines" if @skip
|
12
12
|
@num_filter = options[:num_filter]
|
13
13
|
@logger.debug "Filtering on #{@num_filter}" if @num_filter
|
14
|
+
@filter = options[:filter]
|
15
|
+
@logger.debug "Filtering on #{@filter}" if @filter
|
14
16
|
@rewrite = options[:rewrite]
|
15
17
|
@logger.debug "Rewrite #{@rewrite}" if @rewrite
|
16
18
|
@use_columns = options[:columns]
|
17
19
|
@logger.debug "Filtering on columns #{@use_columns}" if @use_columns
|
18
20
|
@column_filter = options[:column_filter]
|
19
21
|
@logger.debug "Filtering on column names #{@column_filter}" if @column_filter
|
22
|
+
@strip_quotes = options[:strip_quotes]
|
23
|
+
@logger.debug "Strip quotes #{@strip_quotes}" if @strip_quotes
|
20
24
|
@transform_ids = options[:transform_ids]
|
21
25
|
@logger.debug "Transform ids #{@transform_ids}" if @transform_ids
|
22
26
|
@include_rownames = options[:with_rownames]
|
@@ -25,7 +29,8 @@ module BioTable
|
|
25
29
|
end
|
26
30
|
|
27
31
|
def parse_header(line, options)
|
28
|
-
header = LineParser::parse(line, options[:in_format])
|
32
|
+
header = LineParser::parse(line, options[:in_format], options[:split_on])
|
33
|
+
header = Formatter::strip_quotes(header) if @strip_quotes
|
29
34
|
return Formatter::transform_header_ids(@transform_ids, header) if @transform_ids
|
30
35
|
header
|
31
36
|
end
|
@@ -38,8 +43,9 @@ module BioTable
|
|
38
43
|
end
|
39
44
|
|
40
45
|
def parse_row(line_num, line, column_idx, last_fields, options)
|
41
|
-
fields = LineParser::parse(line, options[:in_format])
|
46
|
+
fields = LineParser::parse(line, options[:in_format], options[:split_on])
|
42
47
|
return nil,nil if fields.compact == []
|
48
|
+
fields = Formatter::strip_quotes(fields) if @strip_quotes
|
43
49
|
fields = Formatter::transform_row_ids(@transform_ids, fields) if @transform_ids
|
44
50
|
fields = Filter::apply_column_filter(fields,column_idx)
|
45
51
|
return nil,nil if fields.compact == []
|
@@ -48,6 +54,7 @@ module BioTable
|
|
48
54
|
if data_fields.size > 0
|
49
55
|
return nil,nil if not Validator::valid_row?(line_num, data_fields, last_fields)
|
50
56
|
return nil,nil if not Filter::numeric(@num_filter,data_fields)
|
57
|
+
return nil,nil if not Filter::generic(@filter,data_fields)
|
51
58
|
(rowname, data_fields) = Rewrite::rewrite(@rewrite,rowname,data_fields)
|
52
59
|
end
|
53
60
|
return rowname, data_fields
|
data/lib/bio-table/tableload.rb
CHANGED
@@ -23,6 +23,11 @@ module BioTable
|
|
23
23
|
column_index,header = table_apply.column_index(header) # we may rewrite the header
|
24
24
|
yielder.yield header,:header if options[:write_header] != false
|
25
25
|
prev_line = header[1..-1]
|
26
|
+
# When a header filter is defined, rewind the generator, note that skip won't work
|
27
|
+
# properly (FIXME)
|
28
|
+
if options[:with_headers]
|
29
|
+
generator.rewind
|
30
|
+
end
|
26
31
|
elsif line_num-skip < 0
|
27
32
|
# do nothing
|
28
33
|
else
|