bio-table 0.0.4 → 0.0.5

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -33,13 +33,13 @@ Features:
33
33
  * Merge tables side by side on column value/rowname
34
34
  * Split/reduce tables by column
35
35
  * Read from STDIN, write to STDOUT
36
+ * Convert table to RDF
36
37
  * Convert table to JSON (nyi)
37
- * Convert table to RDF (nyi)
38
38
  * etc. etc.
39
39
 
40
40
  and bio-table is pretty fast. To convert a 3Mb file of 18670 rows
41
- takes 0.96 second. Adding a filter makes it parse at 1.01 second on
42
- my 3.2 GHz desktop.
41
+ takes 0.87 second. Adding a filter makes it parse at 0.95 second on
42
+ my 3.2 GHz desktop (with preloaded disk cache).
43
43
 
44
44
  Note: this software is under active development, though what is
45
45
  documented here should just work.
@@ -57,40 +57,40 @@ documented here should just work.
57
57
  Tables can be transformed through the command line. To transform a
58
58
  comma separated file to a tab delimited one
59
59
 
60
- ```
60
+ ```sh
61
61
  bio-table test/data/input/table1.csv --in-format csv --format tab > test1.tab
62
62
  ```
63
63
 
64
64
  Tab is actually the general default. Still, if the file name ends in
65
65
  csv, it will assume CSV. To convert the table back
66
66
 
67
- ```
67
+ ```sh
68
68
  bio-table test1.tab --format csv > table1.csv
69
69
  ```
70
70
 
71
71
  To filter out rows that contain certain values
72
72
 
73
- ```
73
+ ```sh
74
74
  bio-table test/data/input/table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
75
75
  ```
76
76
 
77
77
  The filter ignores the header row, and the row names. If you need
78
78
  either, use the switches --with-header and --with-rownames. With math, list all rows
79
79
 
80
- ```
80
+ ```sh
81
81
  bio-table test/data/input/table1.csv --num-filter "values[3]-values[6] >= 0.05" > test1a.tab
82
82
  ```
83
83
 
84
84
  or, list all rows that have a least a field with values >= 1000.0
85
85
 
86
- ```
86
+ ```sh
87
87
  bio-table test/data/input/table1.csv --num-filter "values.max >= 1000.0" > test1a.tab
88
88
  ```
89
89
 
90
90
  Produce all rows that have at least 3 values above 3.0 and 1 one value
91
91
  above 10.0:
92
92
 
93
- ```
93
+ ```sh
94
94
  bio-table test/data/input/table1.csv --num-filter "values.max >= 10.0 and values.count{|x| x>=3.0} > 3"
95
95
  ```
96
96
 
@@ -100,7 +100,7 @@ The --num-filter will convert fields lazily to numerical values (only
100
100
  valid numbers are converted). If there are NA (nil) values in the table, you
101
101
  may wish to remove them, like this
102
102
 
103
- ```
103
+ ```sh
104
104
  bio-table test/data/input/table1.csv --num-filter "values[0..12].compact.max >= 1000.0" > test1a.tab
105
105
  ```
106
106
 
@@ -109,27 +109,27 @@ which takes the first 13 fields and compact removes the nil values.
109
109
  Also string comparisons and regular expressions can be used. E.g.
110
110
  filter on rownames and a row field both containing 'BGT'
111
111
 
112
- ```
112
+ ```sh
113
113
  # not yet implemented
114
114
  bio-table test/data/input/table1.csv --filter "rowname =~ /BGT/ and field[1] =~ /BGT/" > test1a.tab
115
115
  ```
116
116
 
117
117
  To reorder/reduce table columns by name
118
118
 
119
- ```
119
+ ```sh
120
120
  bio-table test/data/input/table1.csv --columns AJ,B6,Axb1,Axb4,AXB13,Axb15,Axb19 > test1a.tab
121
121
  ```
122
122
 
123
123
  or use their index numbers (the first column is zero)
124
124
 
125
- ```
125
+ ```sh
126
126
  bio-table test/data/input/table1.csv --columns 0,1,8,2,4,6 > test1a.tab
127
127
  ```
128
128
 
129
129
 
130
130
  To filter for columns using a regular expression
131
131
 
132
- ```
132
+ ```sh
133
133
  bio-table table1.csv --column-filter 'colname !~ /infected/i'
134
134
  ```
135
135
 
@@ -139,7 +139,7 @@ case.
139
139
  Finally we can rewrite the content of a table using rowname and fields
140
140
  again
141
141
 
142
- ```
142
+ ```sh
143
143
  bio-table table1.csv --rewrite 'rowname.upcase!; field[1]=nil if field[2].to_f<0.25'
144
144
  ```
145
145
 
@@ -150,7 +150,7 @@ empty if the third field is below 0.25.
150
150
 
151
151
  To sort a table on column 4 and 2
152
152
 
153
- ```
153
+ ```sh
154
154
  # not yet implemented
155
155
  bio-table test/data/input/table1.csv --sort 4,2 > test1a.tab
156
156
  ```
@@ -161,20 +161,26 @@ Note: not all is implemented (just yet). Please check bio-table --help first.
161
161
 
162
162
  You can combine/concat two or more tables by passing in multiple file names
163
163
 
164
+ ```sh
164
165
  bio-table test/data/input/table1.csv test/data/input/table2.csv
166
+ ```
165
167
 
166
168
  this will append table2 to table1, assuming they have the same headers
167
169
  (you can use the --columns switch!)
168
170
 
169
171
  To combine tables side by side use the --merge switch:
170
172
 
173
+ ```sh
171
174
  bio-table --merge table1.csv table2.csv
175
+ ```
172
176
 
173
177
  all rownames will be matched (i.e. the input table order do not need
174
178
  to be sorted). For non-matching rownames the fields will be filled
175
179
  with NA's, unless you add a filter, e.g.
176
180
 
181
+ ```sh
177
182
  bio-table --merge table1.csv table2.csv --num-filter "values.compact.size == values.size"
183
+ ```
178
184
 
179
185
  ### Splitting a table
180
186
 
@@ -188,24 +194,32 @@ overlap, based on shared columns. The bio-table diff command shows the
188
194
  difference between two tables using the row names (i.e. those rows
189
195
  with rownames that appear in table2, but not in table1)
190
196
 
197
+ ```sh
191
198
  bio-table --diff 0 table1.csv table2.csv
199
+ ```
192
200
 
193
201
  bio-table --diff is different from the standard Unix diff tool. The
194
202
  latter shows insertions and deletions. bio-table --diff shows what is
195
203
  in one file, and not in the other (insertions). To see deletions,
196
204
  reverse the file order, i.e. switch the file names
197
205
 
206
+ ```sh
198
207
  bio-table --diff 0 table1.csv table2.csv
208
+ ```
199
209
 
200
210
  To diff on something else
201
211
 
212
+ ```sh
202
213
  bio-table --diff 0,3 table2.csv table1.csv
214
+ ```
203
215
 
204
216
  creates a key using columns 0 and 3 (0 is the rownames column).
205
217
 
206
218
  Similarly
207
219
 
220
+ ```sh
208
221
  bio-table --overlap 2 table1.csv table2.csv
222
+ ```
209
223
 
210
224
  finds the overlapping rows, based on the content of column 2.
211
225
 
@@ -219,14 +233,55 @@ more soon
219
233
  bio-table can read data from STDIN, by simply assuming that the data
220
234
  piped in is the first input file
221
235
 
222
- ```
236
+ ```sh
223
237
  cat test1.tab | bio-table table1.csv --num-filter "values[3] <= 0.05" > test1a.tab
224
238
  ```
225
239
 
226
240
  will filter both files test1.tab and test1.csv and output to
227
241
  test1a.tab.
228
242
 
229
- ## bio-table API (for Ruby programming)
243
+ ### Output table to RDF
244
+
245
+ bio-table can write a table into turtle RDF triples (part of the semantic
246
+ web!), so you can put the data directly into a triple-store.
247
+
248
+ ```sh
249
+ bio-table --format rdf table1.csv
250
+ ```
251
+
252
+ The table header is stored with predicate :colname using the header
253
+ values both as subject and label, with the :index:
254
+
255
+ ```rdf
256
+ :header3 rdf:label "Header3" ; a :colname; :index 4 .
257
+ ```
258
+
259
+ Rows are stored with rowname as subject and label, followed by the
260
+ columns referring to the header triples, and the values. E.g.
261
+
262
+ ```rdf
263
+ :row13475701 rdf:label "row13475701" ; a :rowname ; ; :Id "row13475701" ; :header1 "1" ; :header2 "0" ; :header3 "3" .
264
+ ```
265
+
266
+ To unify identifier names you may want to transform ids:
267
+
268
+ ```sh
269
+ bio-table --format rdf --transform-ids "downcase" table1.csv
270
+ ```
271
+
272
+ Another interesting option is --blank-nodes. This causes rows to be
273
+ written as blank nodes, and allows for duplicate row names. E.g.
274
+
275
+ ```rdf
276
+ :row13475701 [ rdf:label "row13475701" ; a :rowname ; ; :Id "row13475701" ; :header1 "1" ; :header2 "0" ; :header3 "3" ] .
277
+ ```
278
+ The bio-rdf gem actually uses this bio-table biogem to parse data into a
279
+ triple store and query the data through SPARQL. For examples see the
280
+ features, e.g. the
281
+ [genotype to RDF feature](https://github.com/pjotrp/bioruby-rdf/blob/master/features/genotype-table-to-rdf.feature).
282
+
283
+
284
+ ## bio-table API (for Ruby programmers)
230
285
 
231
286
  ```ruby
232
287
  require 'bio-table'
@@ -315,17 +370,25 @@ file twice, but being able to handle much larger data.
315
370
 
316
371
  In above examples we loaded the whole table in memory. It is also
317
372
  possible to execute functions without using RAM by using the emit
318
- function. This is what the bio-table CLI does:
373
+ function. This is what the bio-table CLI does to convert a CSV table
374
+ to tab delimited:
319
375
 
320
376
  ```ruby
321
377
  ARGV.each do | fn |
322
- BioTable::TableLoader.emit(f, options).each do |row|
323
- writer.write(TableRow.new(row[0],row[1..-1]))
378
+ f = File.open(fn)
379
+ writer = BioTable::TableWriter::Writer.new(format: :tab)
380
+ BioTable::TableLoader.emit(f, in_format: :csv).each do |row,type|
381
+ writer.write(TableRow.new(row[0],row[1..-1]),type)
324
382
  end
325
383
  end
326
-
327
384
  ```
328
385
 
386
+ Essentially you can pass in any object that has the *each* method to
387
+ iterate through rows as String (f's each method reads in a line at a
388
+ time). The emit function yields the parsed row object as a simple
389
+ array of fields (each field a String). The type is used to distinguish
390
+ the header row.
391
+
329
392
  ### Loading a numerical matrix
330
393
 
331
394
  Coming soon
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.0.4
1
+ 0.0.5
@@ -33,19 +33,12 @@ log = Bio::Log::LoggerPlus.new 'bio-table'
33
33
  Bio::Log::CLI.logger('stderr')
34
34
  Bio::Log::CLI.trace('info')
35
35
 
36
- options = {show_help: false, write_header: true}
36
+ options = {show_help: false, write_header: true, skip: 0}
37
37
  options[:show_help] = true if ARGV.size == 0 and not INPUT_ON_STDIN
38
38
  opts = OptionParser.new do |o|
39
39
  o.banner = "Usage: #{File.basename($0)} [options] filename\n\n"
40
40
 
41
- o.on('--in-format [tab,csv]', [:tab, :csv], 'Input format (default tab)') do |par|
42
- options[:in_format] = par.to_sym
43
- end
44
-
45
- o.on('--format [tab,csv]', [:tab, :csv], 'Output format (default tab)') do |par|
46
- options[:format] = par.to_sym
47
- end
48
-
41
+
49
42
  o.on('--num-filter expression', 'Numeric filtering function') do |par|
50
43
  options[:num_filter] = par
51
44
  end
@@ -82,21 +75,50 @@ opts = OptionParser.new do |o|
82
75
  options[:overlap] = l
83
76
  end
84
77
 
85
-
78
+ o.on('--merge','Merge tables by rowname') do
79
+ options[:merge] = true
80
+ end
81
+
82
+ o.separator "\n\tOverrides:\n\n"
83
+
86
84
  # o.on('--with-header','Include the header element in filtering etc.') do
87
85
  # options[:with_header] = true
88
86
  # end
89
-
87
+
88
+ o.on('--skip lines',Integer,'Skip the first lines before parsing') do |skip|
89
+ options[:skip] = skip
90
+ end
91
+
90
92
  o.on('--with-rownames','Include the rownames in filtering etc.') do
91
93
  options[:with_rownames] = true
92
94
  end
95
+
96
+ o.separator "\n\tTransform:\n\n"
97
+
98
+ o.on('--transform-ids [downcase,upcase]',[:downcase,:upcase],'Transform column and row identifiers') do |par|
99
+ options[:transform_ids] = par.to_sym
100
+ end
101
+
102
+ o.separator "\n\tFormat and options:\n\n"
93
103
 
94
- o.separator ""
104
+ o.on('--in-format [tab,csv]', [:tab, :csv], 'Input format (default tab)') do |par|
105
+ options[:in_format] = par.to_sym
106
+ end
95
107
 
108
+ o.on('--format [tab,csv,rdf]', [:tab, :csv, :rdf], 'Output format (default tab)') do |par|
109
+ options[:format] = par.to_sym
110
+ end
111
+
112
+ o.on('--blank-nodes','Output (RDF) blank nodes - allowing for duplicate row names') do
113
+ options[:blank_nodes] = true
114
+ end
115
+
116
+ o.separator "\n\tVerbosity:\n\n"
117
+
96
118
  o.on("--logger filename",String,"Log to file (default stderr)") do | name |
97
119
  Bio::Log::CLI.logger(name)
98
120
  end
99
-
121
+
100
122
  o.on("--trace options",String,"Set log level (default INFO, see bio-logger)") do | s |
101
123
  Bio::Log::CLI.trace(s)
102
124
  end
@@ -177,12 +199,17 @@ end
177
199
  # http://eric.lubow.org/2010/ruby/multiple-input-locations-from-bash-into-ruby/
178
200
  #
179
201
 
180
- writer = BioTable::TableWriter::Writer.new(options[:format])
202
+ writer =
203
+ if options[:format] == :rdf
204
+ BioTable::RDF::Writer.new(options[:blank_nodes])
205
+ else
206
+ BioTable::TableWriter::Writer.new(options[:format])
207
+ end
181
208
 
182
209
  if INPUT_ON_STDIN
183
210
  opts = options.dup # so we can modify options
184
- BioTable::TableLoader.emit(STDIN, opts).each do |row|
185
- writer.write(TableRow.new(row[0],row[1..-1]))
211
+ BioTable::TableLoader.emit(STDIN, opts).each do |row, type|
212
+ writer.write(TableRow.new(row[0],row[1..-1]),type)
186
213
  end
187
214
  options[:write_header] = false # don't write the header for chained files
188
215
  end
@@ -194,8 +221,8 @@ ARGV.each do | fn |
194
221
  logger.debug "Autodetected CSV file"
195
222
  opts[:in_format] = :csv
196
223
  end
197
- BioTable::TableLoader.emit(f, opts).each do |row|
198
- writer.write(TableRow.new(row[0],row[1..-1]))
224
+ BioTable::TableLoader.emit(f, opts).each do |row,type|
225
+ writer.write(TableRow.new(row[0],row[1..-1]),type)
199
226
  end
200
227
  options[:write_header] = false # don't write the header for chained files
201
228
  end
@@ -1,7 +1,7 @@
1
1
  @cli
2
2
  Feature: Command-line interface (CLI)
3
3
 
4
- bio-table has a powerful comman line interface. Here we regression test features.
4
+ bio-table has a powerful command line interface. Here we regression test features.
5
5
 
6
6
  Scenario: Test the numerical filter by column values
7
7
  Given I have input file(s) named "test/data/input/table1.csv"
@@ -28,4 +28,14 @@ Feature: Command-line interface (CLI)
28
28
  When I execute "./bin/bio-table test/data/input/table1.csv --rewrite 'rowname = field[2]; field[1]=nil if field[2].to_f<0.25'"
29
29
  Then I expect the named output to match "table1-rewrite-rownames"
30
30
 
31
+ Scenario: Write RDF format
32
+ Given I have input file(s) named "test/data/input/table1.csv"
33
+ When I execute "./bin/bio-table --format rdf --transform-ids downcase"
34
+ Then I expect the named output to match "table1-rdf1"
35
+
36
+ Scenario: Read from STDIN
37
+ Given I have input file(s) named "test/data/input/table1.csv"
38
+ When I execute "cat test/data/input/table1.csv|./bin/bio-table test/data/input/table1.csv --rewrite 'rowname = field[2]; field[1]=nil if field[2].to_f<0.25'"
39
+ Then I expect the named output to match "table1-STDIN"
40
+
31
41
 
@@ -25,4 +25,5 @@ require 'bio-table/table_apply.rb'
25
25
  require 'bio-table/diff.rb'
26
26
  require 'bio-table/overlap.rb'
27
27
  require 'bio-table/merge.rb'
28
+ require 'bio-table/rdf.rb'
28
29
 
@@ -63,7 +63,8 @@ module BioTable
63
63
  end
64
64
 
65
65
  def Filter::valid_number?(s)
66
- s.to_s.match(/\A[+-]?\d+?(\.\d+)?\Z/) == nil ? false : true
66
+ # s.to_s.match(/\A[+-]?\d+?(\.\d+)?\Z/) == nil ? false : true
67
+ begin Float(s) ; true end rescue false
67
68
  end
68
69
 
69
70
  def Filter::numeric code, fields
@@ -1,5 +1,24 @@
1
1
  module BioTable
2
2
 
3
+ module Formatter
4
+ def Formatter::transform_header_ids modify, list
5
+ l = list.dup
6
+ case modify
7
+ when :downcase then l.map { |h| h.downcase }
8
+ when :upcase then l.map { |h| h.upcase }
9
+ else l
10
+ end
11
+ end
12
+ def Formatter::transform_row_ids modify, list
13
+ l = list.dup
14
+ case modify
15
+ when :downcase then l[0].downcase!
16
+ when :upcase then l[0].upcase!
17
+ end
18
+ l
19
+ end
20
+ end
21
+
3
22
  class TabFormatter
4
23
  def write list
5
24
  print list.map{|field| (field==nil ? "NA" : field)}.join("\t"),"\n"
@@ -8,7 +27,6 @@ module BioTable
8
27
  end
9
28
 
10
29
  class CsvFormatter
11
-
12
30
  def write list
13
31
  csv_string = CSV.generate do |csv|
14
32
  csv << list