bio-table 0.0.5 → 0.0.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -2,8 +2,8 @@ language: ruby
2
2
  rvm:
3
3
  - 1.9.2
4
4
  - 1.9.3
5
- - jruby-19mode # JRuby in 1.9 mode
6
5
  - rbx-19mode
6
+ # - jruby-19mode # JRuby in 1.9 mode
7
7
  # - 1.8.7
8
8
  # - jruby-18mode # JRuby in 1.8 mode
9
9
  # - rbx-18mode
data/Gemfile CHANGED
@@ -15,5 +15,5 @@ group :development do
15
15
  gem "jeweler", "~> 1.8.3"
16
16
  gem "bio", ">= 1.4.2"
17
17
  gem "rdoc", "~> 3.12"
18
- gem "regressiontest"
18
+ gem "regressiontest", ">= 0.0.2"
19
19
  end
data/README.md CHANGED
@@ -25,16 +25,17 @@ you don't need to know Ruby to use the command line interface (CLI).
25
25
 
26
26
  Features:
27
27
 
28
- * Support for converting TAB and CSV files
28
+ * Support for reading and writing TAB and CSV files, as well as regex splitters
29
29
  * Filter on data
30
30
  * Transform table and data by column or row
31
31
  * Recalculate data
32
32
  * Diff between tables, selecting on specific column values
33
33
  * Merge tables side by side on column value/rowname
34
34
  * Split/reduce tables by column
35
+ * Write formatted tables, e.g. HTML, LaTeX
35
36
  * Read from STDIN, write to STDOUT
36
37
  * Convert table to RDF
37
- * Convert table to JSON (nyi)
38
+ * Convert table to JSON/YAML/XML (nyi)
38
39
  * etc. etc.
39
40
 
40
41
  and bio-table is pretty fast. To convert a 3Mb file of 18670 rows
@@ -68,14 +69,21 @@ csv, it will assume CSV. To convert the table back
68
69
  bio-table test1.tab --format csv > table1.csv
69
70
  ```
70
71
 
72
+ It is also possible to use a string or regex splitter, e.g.
73
+
74
+ ```sh
75
+ bio-table --in-format split --split-on ',' test/data/input/table_split_on.txt
76
+ bio-table --in-format regex --split-on '\s*,\s*' test/data/input/table_split_on.txt
77
+ ```
78
+
71
79
  To filter out rows that contain certain values
72
80
 
73
81
  ```sh
74
82
  bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
75
83
  ```
76
84
 
77
- The filter ignores the header row, and the row names. If you need
78
- either, use the switches --with-header and --with-rownames. With math, list all rows
85
+ The filter ignores the header row, and the row names, by default. If you need
86
+ either, use the switches --with-headers and --with-rownames. With math, list all rows
79
87
 
80
88
  ```sh
81
89
  bio-table test/data/input/table1.csv --num-filter "values[3]-values[6] >= 0.05" > test1a.tab
@@ -174,7 +182,7 @@ To combine tables side by side use the --merge switch:
174
182
  bio-table --merge table1.csv table2.csv
175
183
  ```
176
184
 
177
- all rownames will be matched (i.e. the input table order do not need
185
+ all rownames will be matched (i.e. the input table does not need
178
186
  to be sorted). For non-matching rownames the fields will be filled
179
187
  with NA's, unless you add a filter, e.g.
180
188
 
@@ -226,7 +234,10 @@ finds the overlapping rows, based on the content of column 2.
226
234
 
227
235
  ### Different parsers
228
236
 
229
- more soon
237
+ bio-table currently reads comma separated files and tab delimited
238
+ files.
239
+
240
+ (more soon)
230
241
 
231
242
  ### Using STDIN
232
243
 
@@ -240,6 +251,38 @@ piped in is the first input file
240
251
  will filter both files test1.tab and test1.csv and output to
241
252
  test1a.tab.
242
253
 
254
+ ### Formatted output
255
+
256
+ bio-table has built-in formatters - for CSV and TAB, and for RDF
257
+ (and soon for JSON/YAML and perhaps even XML). The RDF format is
258
+ discussed in 'Output table to RDF'.
259
+
260
+ Another flexible option for formatting a table is to create programmatic output
261
+ through a formatter. If you set the --format switch to *eval*, you
262
+ can add the -e 'command' that is evaluated to print to STDOUT. For
263
+ example, bio-table does not support HTML output directly, but if we
264
+ were to create an HTML table, we could run
265
+
266
+ ```sh
267
+ bio-table --format eval -e '"<tr><td>"+field.join("</td><td>")+"</td></tr>"' table1.csv
268
+ ```
269
+
270
+ likewise to create a LaTeX table we could
271
+
272
+ ```sh
273
+ bio-table --columns gene_symbol,gene_desc --format eval -e 'field.join(" & ")+" \\\\"' table1.csv
274
+ ```
275
+
276
+ Since fields can be accessed independently, you can add any markup for
277
+ fields, e.g.
278
+
279
+ ```sh
280
+ bio-table --columns ID,Description,Date --format eval -e'"\\emph{"+field[0]+"} & "+ field[1..-1].join(" & ")+"\\\\"' table1.csv
281
+ ```
282
+
283
+ Because of the evaluation formatter bio-table does not need to implement the machinery for
284
+ every output format on the planet!
285
+
243
286
  ### Output table to RDF
244
287
 
245
288
  bio-table can write a table into turtle RDF triples (part of the semantic
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.0.5
1
+ 0.0.6
@@ -43,6 +43,10 @@ opts = OptionParser.new do |o|
43
43
  options[:num_filter] = par
44
44
  end
45
45
 
46
+ o.on('--filter expression', 'Generic filtering function') do |par|
47
+ options[:filter] = par
48
+ end
49
+
46
50
  o.on('--rewrite expression', 'Rewrite function') do |par|
47
51
  options[:rewrite] = par
48
52
  end
@@ -81,18 +85,23 @@ opts = OptionParser.new do |o|
81
85
 
82
86
  o.separator "\n\tOverrides:\n\n"
83
87
 
84
- # o.on('--with-header','Include the header element in filtering etc.') do
85
- # options[:with_header] = true
86
- # end
87
-
88
88
  o.on('--skip lines',Integer,'Skip the first lines before parsing') do |skip|
89
89
  options[:skip] = skip
90
90
  end
91
91
 
92
+ o.on('--with-headers','Include the header element in filtering etc.') do
93
+ options[:with_headers] = true
94
+ options[:write_header] = false
95
+ end
96
+
92
97
  o.on('--with-rownames','Include the rownames in filtering etc.') do
93
98
  options[:with_rownames] = true
94
99
  end
95
100
 
101
+ o.on('--strip-quotes','Strip quotes from table fields') do
102
+ options[:strip_quotes] = true
103
+ end
104
+
96
105
  o.separator "\n\tTransform:\n\n"
97
106
 
98
107
  o.on('--transform-ids [downcase,upcase]',[:downcase,:upcase],'Transform column and row identifiers') do |par|
@@ -101,14 +110,22 @@ opts = OptionParser.new do |o|
101
110
 
102
111
  o.separator "\n\tFormat and options:\n\n"
103
112
 
104
- o.on('--in-format [tab,csv]', [:tab, :csv], 'Input format (default tab)') do |par|
113
+ o.on('--in-format [tab,csv,split,regex]', [:tab, :csv, :split, :regex], 'Input format (default tab)') do |par|
105
114
  options[:in_format] = par.to_sym
106
115
  end
107
116
 
108
- o.on('--format [tab,csv,rdf]', [:tab, :csv, :rdf], 'Output format (default tab)') do |par|
117
+ o.on('--format [tab,csv,rdf,eval]', [:tab, :csv, :rdf, :eval], 'Output format (default tab)') do |par|
109
118
  options[:format] = par.to_sym
110
119
  end
111
120
 
121
+ o.on("--split-on command",String,"Split on string or regex (use with --in-format)") do | s |
122
+ options[:split_on] = s
123
+ end
124
+
125
+ o.on("-e command",String,"Evaluate output command (use with --format eval)") do | s |
126
+ options[:evaluate] = s
127
+ end
128
+
112
129
  o.on('--blank-nodes','Output (RDF) blank nodes - allowing for duplicate row names') do
113
130
  options[:blank_nodes] = true
114
131
  end
@@ -203,7 +220,7 @@ writer =
203
220
  if options[:format] == :rdf
204
221
  BioTable::RDF::Writer.new(options[:blank_nodes])
205
222
  else
206
- BioTable::TableWriter::Writer.new(options[:format])
223
+ BioTable::TableWriter::Writer.new(options[:format],options[:evaluate])
207
224
  end
208
225
 
209
226
  if INPUT_ON_STDIN
@@ -33,9 +33,49 @@ Feature: Command-line interface (CLI)
33
33
  When I execute "./bin/bio-table --format rdf --transform-ids downcase"
34
34
  Then I expect the named output to match "table1-rdf1"
35
35
 
36
+ Scenario: Write HTML format
37
+ Given I have input file(s) named "test/data/input/table1.csv"
38
+ When I execute "./bin/bio-table --format eval -e '"<tr><td>"+field.join("</td><td>")+"</td></tr>"'"
39
+ Then I expect the named output to match "table1-html"
40
+
41
+ Scenario: Write LaTeX format
42
+ Given I have input file(s) named "test/data/input/table1.csv"
43
+ When I execute "./bin/bio-table --columns gene_symbol,gene_desc --format eval -e 'field.join(" & ")+" \\\\"'"
44
+ Then I expect the named output to match "table1-latex"
45
+
46
+ Scenario: Merge tables horizontally
47
+ Given I have input file(s) named "test/data/input/table1.csv"
48
+ When I execute "./bin/bio-table --merge test/data/input/table2.csv"
49
+ Then I expect the named output to match "table1-merge"
50
+
51
+ Scenario: Merge tables vertically
52
+ Given I have input file(s) named "test/data/input/table1.csv"
53
+ When I execute "./bin/bio-table test/data/input/table2.csv"
54
+ Then I expect the named output to match "table1-append"
55
+
56
+ Scenario: Diff tables
57
+ Given I have input file(s) named "test/data/input/table1.csv"
58
+ When I execute "./bin/bio-table --diff test/data/input/table2.csv"
59
+ Then I expect the named output to match "table1-diff"
60
+
36
61
  Scenario: Read from STDIN
37
62
  Given I have input file(s) named "test/data/input/table1.csv"
38
63
  When I execute "cat test/data/input/table1.csv|./bin/bio-table test/data/input/table1.csv --rewrite 'rowname = field[2]; field[1]=nil if field[2].to_f<0.25'"
39
64
  Then I expect the named output to match "table1-STDIN"
40
65
 
66
+ Scenario: Use special string splitter
67
+ Given I have input file(s) named "test/data/input/table_split_on.txt"
68
+ When I execute "./bin/bio-table test/data/input/table_split_on.txt --in-format split --split-on ','"
69
+ Then I expect the named output to match "table_split_on_string"
70
+
71
+ Scenario: Use special regex splitter
72
+ Given I have input file(s) named "test/data/input/table_split_on.txt"
73
+ When I execute "./bin/bio-table test/data/input/table_split_on.txt --in-format regex --split-on '\s*,'"
74
+ Then I expect the named output to match "table_split_on_regex"
75
+
76
+ Scenario: Use header in filter
77
+ Given I have input file(s) named "test/data/input/table_no_headers.txt"
78
+ When I execute "./bin/bio-table --in-format split --split-on ',' --num-filter 'values[1]!=0' --with-headers"
79
+ Then I expect the named output to match "table_filter_headers"
80
+
41
81
 
@@ -70,7 +70,7 @@ module BioTable
70
70
  def Filter::numeric code, fields
71
71
  return true if code == nil
72
72
  if fields
73
- # values = fields.map { |field| (valid_number?(field) ? field.to_f : nil ) } # FIXME: not so lazy
73
+ # values = fields.map { |field| (valid_number?(field) ? field.to_f : nil ) }
74
74
  values = LazyValues.new(fields)
75
75
  begin
76
76
  eval(code)
@@ -82,6 +82,21 @@ module BioTable
82
82
  false
83
83
  end
84
84
  end
85
+
86
+ def Filter::generic code, tablefields
87
+ return true if code == nil
88
+ if tablefields
89
+ field = tablefields.dup
90
+ begin
91
+ eval(code)
92
+ rescue Exception
93
+ $stderr.print "Failed to evaluate ",fields," with ",code,"\n"
94
+ raise
95
+ end
96
+ else
97
+ false
98
+ end
99
+ end
85
100
  end
86
101
 
87
102
  end
@@ -17,13 +17,28 @@ module BioTable
17
17
  end
18
18
  l
19
19
  end
20
+ def Formatter::strip_quotes list
21
+ list.map { |field|
22
+ if field == nil
23
+ nil
24
+ else
25
+ first = field[0,1]
26
+ if first == "\"" or first == "'"
27
+ last = field[-1,1]
28
+ if first == last
29
+ field = field[1..-2]
30
+ end
31
+ end
32
+ field
33
+ end
34
+ }
35
+ end
20
36
  end
21
37
 
22
38
  class TabFormatter
23
39
  def write list
24
40
  print list.map{|field| (field==nil ? "NA" : field)}.join("\t"),"\n"
25
41
  end
26
-
27
42
  end
28
43
 
29
44
  class CsvFormatter
@@ -35,9 +50,22 @@ module BioTable
35
50
  end
36
51
  end
37
52
 
53
+ class EvalFormatter
54
+ def initialize evaluate
55
+ @evaluate = evaluate
56
+ end
57
+ def write list
58
+ field = list.dup.map { |e| (e==nil ? "" : e) }
59
+ print eval(@evaluate)
60
+ print "\n"
61
+ end
62
+ end
63
+
64
+
38
65
  module FormatFactory
39
- def self.create format
66
+ def self.create format, evaluate
40
67
  # @logger.info("Formatting to #{format}")
68
+ return EvalFormatter.new(evaluate) if evaluate
41
69
  return CsvFormatter.new if format == :csv
42
70
  return TabFormatter.new
43
71
  end
@@ -5,9 +5,21 @@ module BioTable
5
5
  module LineParser
6
6
 
7
7
  # Converts a string into an array of string fields
8
- def LineParser::parse(line, in_format)
8
+ def LineParser::parse(line, in_format, split_on)
9
9
  if in_format == :csv
10
10
  CSV.parse(line)[0]
11
+ elsif in_format == :split
12
+ line.strip.split(split_on).map { |field|
13
+ fld = field.strip
14
+ fld = nil if fld == "NA"
15
+ fld
16
+ }
17
+ elsif in_format == :regex
18
+ line.strip.split(/#{split_on}/).map { |field|
19
+ fld = field.strip
20
+ fld = nil if fld == "NA"
21
+ fld
22
+ }
11
23
  else
12
24
  line.strip.split("\t").map { |field|
13
25
  fld = field.strip
@@ -38,7 +38,7 @@ module BioTable
38
38
  #
39
39
  # The method returns a String.
40
40
 
41
- def RDF::row(row, header, use_blank_nodes)
41
+ def RDF::row(row, header, use_blank_nodes = false)
42
42
  list = []
43
43
  rowname = make_identifier(row[0])
44
44
  list << ":#{rowname}"+(use_blank_nodes ? " :row [ " : " ") + "rdf:label \"#{row[0]}\" ; a :rowname"
@@ -23,7 +23,7 @@ module BioTable
23
23
 
24
24
  # Read lines (list/array of string) and add them to the table, setting row
25
25
  # names and row fields. The first row is assumed to be the header and
26
- # ignored if the header has been set.
26
+ # ignored if the header has been set (the case with merge/concat tables).
27
27
 
28
28
  def read_lines lines, options = {}
29
29
  table_apply = TableApply.new(options)
@@ -62,9 +62,13 @@ module BioTable
62
62
  def write options = {}
63
63
  format = options[:format]
64
64
  format = :tab if not format
65
- formatter = FormatFactory::create(format)
65
+ evaluate = nil
66
+ if format == :eval
67
+ evaluate = options[:evaluate]
68
+ end
69
+ formatter = FormatFactory::create(format,evaluate)
66
70
  formatter.write(@header) if options[:write_header]
67
- each do | tablerow |
71
+ each do | tablerow,num |
68
72
  # p tablerow
69
73
  formatter.write(tablerow.all_fields) if tablerow.all_valid?
70
74
  end
@@ -11,12 +11,16 @@ module BioTable
11
11
  # @logger.debug "Skipping #{@skip} lines" if @skip
12
12
  @num_filter = options[:num_filter]
13
13
  @logger.debug "Filtering on #{@num_filter}" if @num_filter
14
+ @filter = options[:filter]
15
+ @logger.debug "Filtering on #{@filter}" if @filter
14
16
  @rewrite = options[:rewrite]
15
17
  @logger.debug "Rewrite #{@rewrite}" if @rewrite
16
18
  @use_columns = options[:columns]
17
19
  @logger.debug "Filtering on columns #{@use_columns}" if @use_columns
18
20
  @column_filter = options[:column_filter]
19
21
  @logger.debug "Filtering on column names #{@column_filter}" if @column_filter
22
+ @strip_quotes = options[:strip_quotes]
23
+ @logger.debug "Strip quotes #{@strip_quotes}" if @strip_quotes
20
24
  @transform_ids = options[:transform_ids]
21
25
  @logger.debug "Transform ids #{@transform_ids}" if @transform_ids
22
26
  @include_rownames = options[:with_rownames]
@@ -25,7 +29,8 @@ module BioTable
25
29
  end
26
30
 
27
31
  def parse_header(line, options)
28
- header = LineParser::parse(line, options[:in_format])
32
+ header = LineParser::parse(line, options[:in_format], options[:split_on])
33
+ header = Formatter::strip_quotes(header) if @strip_quotes
29
34
  return Formatter::transform_header_ids(@transform_ids, header) if @transform_ids
30
35
  header
31
36
  end
@@ -38,8 +43,9 @@ module BioTable
38
43
  end
39
44
 
40
45
  def parse_row(line_num, line, column_idx, last_fields, options)
41
- fields = LineParser::parse(line, options[:in_format])
46
+ fields = LineParser::parse(line, options[:in_format], options[:split_on])
42
47
  return nil,nil if fields.compact == []
48
+ fields = Formatter::strip_quotes(fields) if @strip_quotes
43
49
  fields = Formatter::transform_row_ids(@transform_ids, fields) if @transform_ids
44
50
  fields = Filter::apply_column_filter(fields,column_idx)
45
51
  return nil,nil if fields.compact == []
@@ -48,6 +54,7 @@ module BioTable
48
54
  if data_fields.size > 0
49
55
  return nil,nil if not Validator::valid_row?(line_num, data_fields, last_fields)
50
56
  return nil,nil if not Filter::numeric(@num_filter,data_fields)
57
+ return nil,nil if not Filter::generic(@filter,data_fields)
51
58
  (rowname, data_fields) = Rewrite::rewrite(@rewrite,rowname,data_fields)
52
59
  end
53
60
  return rowname, data_fields
@@ -23,6 +23,11 @@ module BioTable
23
23
  column_index,header = table_apply.column_index(header) # we may rewrite the header
24
24
  yielder.yield header,:header if options[:write_header] != false
25
25
  prev_line = header[1..-1]
26
+ # When a header filter is defined, rewind the generator, note that skip won't work
27
+ # properly (FIXME)
28
+ if options[:with_headers]
29
+ generator.rewind
30
+ end
26
31
  elsif line_num-skip < 0
27
32
  # do nothing
28
33
  else