bio-table 0.0.5 → 0.0.6

Sign up to get free protection for your applications and to get access to all the features.
@@ -2,8 +2,8 @@ language: ruby
2
2
  rvm:
3
3
  - 1.9.2
4
4
  - 1.9.3
5
- - jruby-19mode # JRuby in 1.9 mode
6
5
  - rbx-19mode
6
+ # - jruby-19mode # JRuby in 1.9 mode
7
7
  # - 1.8.7
8
8
  # - jruby-18mode # JRuby in 1.8 mode
9
9
  # - rbx-18mode
data/Gemfile CHANGED
@@ -15,5 +15,5 @@ group :development do
15
15
  gem "jeweler", "~> 1.8.3"
16
16
  gem "bio", ">= 1.4.2"
17
17
  gem "rdoc", "~> 3.12"
18
- gem "regressiontest"
18
+ gem "regressiontest", ">= 0.0.2"
19
19
  end
data/README.md CHANGED
@@ -25,16 +25,17 @@ you don't need to know Ruby to use the command line interface (CLI).
25
25
 
26
26
  Features:
27
27
 
28
- * Support for converting TAB and CSV files
28
+ * Support for reading and writing TAB and CSV files, as well as regex splitters
29
29
  * Filter on data
30
30
  * Transform table and data by column or row
31
31
  * Recalculate data
32
32
  * Diff between tables, selecting on specific column values
33
33
  * Merge tables side by side on column value/rowname
34
34
  * Split/reduce tables by column
35
+ * Write formatted tables, e.g. HTML, LaTeX
35
36
  * Read from STDIN, write to STDOUT
36
37
  * Convert table to RDF
37
- * Convert table to JSON (nyi)
38
+ * Convert table to JSON/YAML/XML (nyi)
38
39
  * etc. etc.
39
40
 
40
41
  and bio-table is pretty fast. To convert a 3Mb file of 18670 rows
@@ -68,14 +69,21 @@ csv, it will assume CSV. To convert the table back
68
69
  bio-table test1.tab --format csv > table1.csv
69
70
  ```
70
71
 
72
+ It is also possible to use a string or regex splitter, e.g.
73
+
74
+ ```sh
75
+ bio-table --in-format split --split-on ',' test/data/input/table_split_on.txt
76
+ bio-table --in-format regex --split-on '\s*,\s*' test/data/input/table_split_on.txt
77
+ ```
78
+
71
79
  To filter out rows that contain certain values
72
80
 
73
81
  ```sh
74
82
  bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
75
83
  ```
76
84
 
77
- The filter ignores the header row, and the row names. If you need
78
- either, use the switches --with-header and --with-rownames. With math, list all rows
85
+ The filter ignores the header row, and the row names, by default. If you need
86
+ either, use the switches --with-headers and --with-rownames. With math, list all rows
79
87
 
80
88
  ```sh
81
89
  bio-table test/data/input/table1.csv --num-filter "values[3]-values[6] >= 0.05" > test1a.tab
@@ -174,7 +182,7 @@ To combine tables side by side use the --merge switch:
174
182
  bio-table --merge table1.csv table2.csv
175
183
  ```
176
184
 
177
- all rownames will be matched (i.e. the input table order do not need
185
+ all rownames will be matched (i.e. the input table does not need
178
186
  to be sorted). For non-matching rownames the fields will be filled
179
187
  with NA's, unless you add a filter, e.g.
180
188
 
@@ -226,7 +234,10 @@ finds the overlapping rows, based on the content of column 2.
226
234
 
227
235
  ### Different parsers
228
236
 
229
- more soon
237
+ bio-table currently reads comma separated files and tab delimited
238
+ files.
239
+
240
+ (more soon)
230
241
 
231
242
  ### Using STDIN
232
243
 
@@ -240,6 +251,38 @@ piped in is the first input file
240
251
  will filter both files test1.tab and test1.csv and output to
241
252
  test1a.tab.
242
253
 
254
+ ### Formatted output
255
+
256
+ bio-table has built-in formatters - for CSV and TAB, and for RDF
257
+ (and soon for JSON/YAML and perhaps even XML). The RDF format is
258
+ discussed in 'Output table to RDF'.
259
+
260
+ Another flexible option for formatting a table is to create programmatic output
261
+ through a formatter. If you set the --format switch to *eval*, you
262
+ can add the -e 'command' that is evaluated to print to STDOUT. For
263
+ example, bio-table does not support HTML output directly, but if we
264
+ were to create an HTML table, we could run
265
+
266
+ ```sh
267
+ bio-table --format eval -e '"<tr><td>"+field.join("</td><td>")+"</td></tr>"' table1.csv
268
+ ```
269
+
270
+ likewise to create a LaTeX table we could
271
+
272
+ ```sh
273
+ bio-table --columns gene_symbol,gene_desc --format eval -e 'field.join(" & ")+" \\\\"' table1.csv
274
+ ```
275
+
276
+ Since fields can be accessed independently, you can add any markup for
277
+ fields, e.g.
278
+
279
+ ```sh
280
+ bio-table --columns ID,Description,Date --format eval -e'"\\emph{"+field[0]+"} & "+ field[1..-1].join(" & ")+"\\\\"' table1.csv
281
+ ```
282
+
283
+ Because of the evaluation formatter bio-table does not need to implement the machinery for
284
+ every output format on the planet!
285
+
243
286
  ### Output table to RDF
244
287
 
245
288
  bio-table can write a table into turtle RDF triples (part of the semantic
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.0.5
1
+ 0.0.6
@@ -43,6 +43,10 @@ opts = OptionParser.new do |o|
43
43
  options[:num_filter] = par
44
44
  end
45
45
 
46
+ o.on('--filter expression', 'Generic filtering function') do |par|
47
+ options[:filter] = par
48
+ end
49
+
46
50
  o.on('--rewrite expression', 'Rewrite function') do |par|
47
51
  options[:rewrite] = par
48
52
  end
@@ -81,18 +85,23 @@ opts = OptionParser.new do |o|
81
85
 
82
86
  o.separator "\n\tOverrides:\n\n"
83
87
 
84
- # o.on('--with-header','Include the header element in filtering etc.') do
85
- # options[:with_header] = true
86
- # end
87
-
88
88
  o.on('--skip lines',Integer,'Skip the first lines before parsing') do |skip|
89
89
  options[:skip] = skip
90
90
  end
91
91
 
92
+ o.on('--with-headers','Include the header element in filtering etc.') do
93
+ options[:with_headers] = true
94
+ options[:write_header] = false
95
+ end
96
+
92
97
  o.on('--with-rownames','Include the rownames in filtering etc.') do
93
98
  options[:with_rownames] = true
94
99
  end
95
100
 
101
+ o.on('--strip-quotes','Strip quotes from table fields') do
102
+ options[:strip_quotes] = true
103
+ end
104
+
96
105
  o.separator "\n\tTransform:\n\n"
97
106
 
98
107
  o.on('--transform-ids [downcase,upcase]',[:downcase,:upcase],'Transform column and row identifiers') do |par|
@@ -101,14 +110,22 @@ opts = OptionParser.new do |o|
101
110
 
102
111
  o.separator "\n\tFormat and options:\n\n"
103
112
 
104
- o.on('--in-format [tab,csv]', [:tab, :csv], 'Input format (default tab)') do |par|
113
+ o.on('--in-format [tab,csv,split,regex]', [:tab, :csv, :split, :regex], 'Input format (default tab)') do |par|
105
114
  options[:in_format] = par.to_sym
106
115
  end
107
116
 
108
- o.on('--format [tab,csv,rdf]', [:tab, :csv, :rdf], 'Output format (default tab)') do |par|
117
+ o.on('--format [tab,csv,rdf,eval]', [:tab, :csv, :rdf, :eval], 'Output format (default tab)') do |par|
109
118
  options[:format] = par.to_sym
110
119
  end
111
120
 
121
+ o.on("--split-on command",String,"Split on string or regex (use with --in-format)") do | s |
122
+ options[:split_on] = s
123
+ end
124
+
125
+ o.on("-e command",String,"Evaluate output command (use with --format eval)") do | s |
126
+ options[:evaluate] = s
127
+ end
128
+
112
129
  o.on('--blank-nodes','Output (RDF) blank nodes - allowing for duplicate row names') do
113
130
  options[:blank_nodes] = true
114
131
  end
@@ -203,7 +220,7 @@ writer =
203
220
  if options[:format] == :rdf
204
221
  BioTable::RDF::Writer.new(options[:blank_nodes])
205
222
  else
206
- BioTable::TableWriter::Writer.new(options[:format])
223
+ BioTable::TableWriter::Writer.new(options[:format],options[:evaluate])
207
224
  end
208
225
 
209
226
  if INPUT_ON_STDIN
@@ -33,9 +33,49 @@ Feature: Command-line interface (CLI)
33
33
  When I execute "./bin/bio-table --format rdf --transform-ids downcase"
34
34
  Then I expect the named output to match "table1-rdf1"
35
35
 
36
+ Scenario: Write HTML format
37
+ Given I have input file(s) named "test/data/input/table1.csv"
38
+ When I execute "./bin/bio-table --format eval -e '"<tr><td>"+field.join("</td><td>")+"</td></tr>"'"
39
+ Then I expect the named output to match "table1-html"
40
+
41
+ Scenario: Write LaTeX format
42
+ Given I have input file(s) named "test/data/input/table1.csv"
43
+ When I execute "./bin/bio-table --columns gene_symbol,gene_desc --format eval -e 'field.join(" & ")+" \\\\"'"
44
+ Then I expect the named output to match "table1-latex"
45
+
46
+ Scenario: Merge tables horizontally
47
+ Given I have input file(s) named "test/data/input/table1.csv"
48
+ When I execute "./bin/bio-table --merge test/data/input/table2.csv"
49
+ Then I expect the named output to match "table1-merge"
50
+
51
+ Scenario: Merge tables vertically
52
+ Given I have input file(s) named "test/data/input/table1.csv"
53
+ When I execute "./bin/bio-table test/data/input/table2.csv"
54
+ Then I expect the named output to match "table1-append"
55
+
56
+ Scenario: Diff tables
57
+ Given I have input file(s) named "test/data/input/table1.csv"
58
+ When I execute "./bin/bio-table --diff test/data/input/table2.csv"
59
+ Then I expect the named output to match "table1-diff"
60
+
36
61
  Scenario: Read from STDIN
37
62
  Given I have input file(s) named "test/data/input/table1.csv"
38
63
  When I execute "cat test/data/input/table1.csv|./bin/bio-table test/data/input/table1.csv --rewrite 'rowname = field[2]; field[1]=nil if field[2].to_f<0.25'"
39
64
  Then I expect the named output to match "table1-STDIN"
40
65
 
66
+ Scenario: Use special string splitter
67
+ Given I have input file(s) named "test/data/input/table_split_on.txt"
68
+ When I execute "./bin/bio-table test/data/input/table_split_on.txt --in-format split --split-on ','"
69
+ Then I expect the named output to match "table_split_on_string"
70
+
71
+ Scenario: Use special regex splitter
72
+ Given I have input file(s) named "test/data/input/table_split_on.txt"
73
+ When I execute "./bin/bio-table test/data/input/table_split_on.txt --in-format regex --split-on '\s*,'"
74
+ Then I expect the named output to match "table_split_on_regex"
75
+
76
+ Scenario: Use header in filter
77
+ Given I have input file(s) named "test/data/input/table_no_headers.txt"
78
+ When I execute "./bin/bio-table --in-format split --split-on ',' --num-filter 'values[1]!=0' --with-headers"
79
+ Then I expect the named output to match "table_filter_headers"
80
+
41
81
 
@@ -70,7 +70,7 @@ module BioTable
70
70
  def Filter::numeric code, fields
71
71
  return true if code == nil
72
72
  if fields
73
- # values = fields.map { |field| (valid_number?(field) ? field.to_f : nil ) } # FIXME: not so lazy
73
+ # values = fields.map { |field| (valid_number?(field) ? field.to_f : nil ) }
74
74
  values = LazyValues.new(fields)
75
75
  begin
76
76
  eval(code)
@@ -82,6 +82,21 @@ module BioTable
82
82
  false
83
83
  end
84
84
  end
85
+
86
+ def Filter::generic code, tablefields
87
+ return true if code == nil
88
+ if tablefields
89
+ field = tablefields.dup
90
+ begin
91
+ eval(code)
92
+ rescue Exception
93
+ $stderr.print "Failed to evaluate ",fields," with ",code,"\n"
94
+ raise
95
+ end
96
+ else
97
+ false
98
+ end
99
+ end
85
100
  end
86
101
 
87
102
  end
@@ -17,13 +17,28 @@ module BioTable
17
17
  end
18
18
  l
19
19
  end
20
+ def Formatter::strip_quotes list
21
+ list.map { |field|
22
+ if field == nil
23
+ nil
24
+ else
25
+ first = field[0,1]
26
+ if first == "\"" or first == "'"
27
+ last = field[-1,1]
28
+ if first == last
29
+ field = field[1..-2]
30
+ end
31
+ end
32
+ field
33
+ end
34
+ }
35
+ end
20
36
  end
21
37
 
22
38
  class TabFormatter
23
39
  def write list
24
40
  print list.map{|field| (field==nil ? "NA" : field)}.join("\t"),"\n"
25
41
  end
26
-
27
42
  end
28
43
 
29
44
  class CsvFormatter
@@ -35,9 +50,22 @@ module BioTable
35
50
  end
36
51
  end
37
52
 
53
+ class EvalFormatter
54
+ def initialize evaluate
55
+ @evaluate = evaluate
56
+ end
57
+ def write list
58
+ field = list.dup.map { |e| (e==nil ? "" : e) }
59
+ print eval(@evaluate)
60
+ print "\n"
61
+ end
62
+ end
63
+
64
+
38
65
  module FormatFactory
39
- def self.create format
66
+ def self.create format, evaluate
40
67
  # @logger.info("Formatting to #{format}")
68
+ return EvalFormatter.new(evaluate) if evaluate
41
69
  return CsvFormatter.new if format == :csv
42
70
  return TabFormatter.new
43
71
  end
@@ -5,9 +5,21 @@ module BioTable
5
5
  module LineParser
6
6
 
7
7
  # Converts a string into an array of string fields
8
- def LineParser::parse(line, in_format)
8
+ def LineParser::parse(line, in_format, split_on)
9
9
  if in_format == :csv
10
10
  CSV.parse(line)[0]
11
+ elsif in_format == :split
12
+ line.strip.split(split_on).map { |field|
13
+ fld = field.strip
14
+ fld = nil if fld == "NA"
15
+ fld
16
+ }
17
+ elsif in_format == :regex
18
+ line.strip.split(/#{split_on}/).map { |field|
19
+ fld = field.strip
20
+ fld = nil if fld == "NA"
21
+ fld
22
+ }
11
23
  else
12
24
  line.strip.split("\t").map { |field|
13
25
  fld = field.strip
@@ -38,7 +38,7 @@ module BioTable
38
38
  #
39
39
  # The method returns a String.
40
40
 
41
- def RDF::row(row, header, use_blank_nodes)
41
+ def RDF::row(row, header, use_blank_nodes = false)
42
42
  list = []
43
43
  rowname = make_identifier(row[0])
44
44
  list << ":#{rowname}"+(use_blank_nodes ? " :row [ " : " ") + "rdf:label \"#{row[0]}\" ; a :rowname"
@@ -23,7 +23,7 @@ module BioTable
23
23
 
24
24
  # Read lines (list/array of string) and add them to the table, setting row
25
25
  # names and row fields. The first row is assumed to be the header and
26
- # ignored if the header has been set.
26
+ # ignored if the header has been set (the case with merge/concat tables).
27
27
 
28
28
  def read_lines lines, options = {}
29
29
  table_apply = TableApply.new(options)
@@ -62,9 +62,13 @@ module BioTable
62
62
  def write options = {}
63
63
  format = options[:format]
64
64
  format = :tab if not format
65
- formatter = FormatFactory::create(format)
65
+ evaluate = nil
66
+ if format == :eval
67
+ evaluate = options[:evaluate]
68
+ end
69
+ formatter = FormatFactory::create(format,evaluate)
66
70
  formatter.write(@header) if options[:write_header]
67
- each do | tablerow |
71
+ each do | tablerow,num |
68
72
  # p tablerow
69
73
  formatter.write(tablerow.all_fields) if tablerow.all_valid?
70
74
  end
@@ -11,12 +11,16 @@ module BioTable
11
11
  # @logger.debug "Skipping #{@skip} lines" if @skip
12
12
  @num_filter = options[:num_filter]
13
13
  @logger.debug "Filtering on #{@num_filter}" if @num_filter
14
+ @filter = options[:filter]
15
+ @logger.debug "Filtering on #{@filter}" if @filter
14
16
  @rewrite = options[:rewrite]
15
17
  @logger.debug "Rewrite #{@rewrite}" if @rewrite
16
18
  @use_columns = options[:columns]
17
19
  @logger.debug "Filtering on columns #{@use_columns}" if @use_columns
18
20
  @column_filter = options[:column_filter]
19
21
  @logger.debug "Filtering on column names #{@column_filter}" if @column_filter
22
+ @strip_quotes = options[:strip_quotes]
23
+ @logger.debug "Strip quotes #{@strip_quotes}" if @strip_quotes
20
24
  @transform_ids = options[:transform_ids]
21
25
  @logger.debug "Transform ids #{@transform_ids}" if @transform_ids
22
26
  @include_rownames = options[:with_rownames]
@@ -25,7 +29,8 @@ module BioTable
25
29
  end
26
30
 
27
31
  def parse_header(line, options)
28
- header = LineParser::parse(line, options[:in_format])
32
+ header = LineParser::parse(line, options[:in_format], options[:split_on])
33
+ header = Formatter::strip_quotes(header) if @strip_quotes
29
34
  return Formatter::transform_header_ids(@transform_ids, header) if @transform_ids
30
35
  header
31
36
  end
@@ -38,8 +43,9 @@ module BioTable
38
43
  end
39
44
 
40
45
  def parse_row(line_num, line, column_idx, last_fields, options)
41
- fields = LineParser::parse(line, options[:in_format])
46
+ fields = LineParser::parse(line, options[:in_format], options[:split_on])
42
47
  return nil,nil if fields.compact == []
48
+ fields = Formatter::strip_quotes(fields) if @strip_quotes
43
49
  fields = Formatter::transform_row_ids(@transform_ids, fields) if @transform_ids
44
50
  fields = Filter::apply_column_filter(fields,column_idx)
45
51
  return nil,nil if fields.compact == []
@@ -48,6 +54,7 @@ module BioTable
48
54
  if data_fields.size > 0
49
55
  return nil,nil if not Validator::valid_row?(line_num, data_fields, last_fields)
50
56
  return nil,nil if not Filter::numeric(@num_filter,data_fields)
57
+ return nil,nil if not Filter::generic(@filter,data_fields)
51
58
  (rowname, data_fields) = Rewrite::rewrite(@rewrite,rowname,data_fields)
52
59
  end
53
60
  return rowname, data_fields
@@ -23,6 +23,11 @@ module BioTable
23
23
  column_index,header = table_apply.column_index(header) # we may rewrite the header
24
24
  yielder.yield header,:header if options[:write_header] != false
25
25
  prev_line = header[1..-1]
26
+ # When a header filter is defined, rewind the generator, note that skip won't work
27
+ # properly (FIXME)
28
+ if options[:with_headers]
29
+ generator.rewind
30
+ end
26
31
  elsif line_num-skip < 0
27
32
  # do nothing
28
33
  else