dreader 0.4.1 → 0.4.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Changelog.org +11 -0
- data/Gemfile.lock +1 -1
- data/README.md +71 -10
- data/examples/wikipedia_us_cities/us_cities_bulk_declare.rb +85 -0
- data/lib/dreader/version.rb +1 -1
- data/lib/dreader.rb +125 -23
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 6d616b9bad780960c105a2c754e392ecf51c0cdaeeb6076c11b4042d7c40e414
|
4
|
+
data.tar.gz: 656507d726a22a24111fc239c0f640dedf90321aab2bcda74a73dc0cf730a83d
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 9ee4f8c9367864ef01aea8d75240a3a93a7bdbcab2411b4e2d4f92df8dc4fa5840e812308a02b8c5da4afc791c358f5eff4849017f63a4d90011ebcdca24e217
|
7
|
+
data.tar.gz: 11571140e63afb0c52b33a010e701e754964150878cba7e395878c2cf2f59cc85eb6b493d5fd05f287d229ec807e8873a45aafe850be6fc30abcbbb2d1cc47fc
|
data/Changelog.org
CHANGED
@@ -1,3 +1,14 @@
|
|
1
|
+
* Version 0.4.2
|
2
|
+
** better error messages for process and check functions
|
3
|
+
dreader now captures exceptions raised by process and check and
|
4
|
+
prints and error message to stdout if an error is found.
|
5
|
+
the exception is then propagated in the standard way.
|
6
|
+
** new method bulk_declare
|
7
|
+
bulk_declare allow to easily declare columns which don't need a
|
8
|
+
specific treatment
|
9
|
+
** read will now complains if the argument passed is not a hash
|
10
|
+
** virtualcols is now accessible (attr_reader)
|
11
|
+
** fixed a bug with slice
|
1
12
|
* Version 0.4.1
|
2
13
|
** fixed an issue with ~read~: it always required a hash as input
|
3
14
|
** changed syntax of ~debug~, which now accepts a hash as argument
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -46,7 +46,7 @@ Or install it yourself as:
|
|
46
46
|
|
47
47
|
## Usage
|
48
48
|
|
49
|
-
### Declare
|
49
|
+
### Declare what file you want to read
|
50
50
|
|
51
51
|
Require `dreader` and declare an instance of the `Dreader::Engine` class:
|
52
52
|
|
@@ -137,6 +137,44 @@ end
|
|
137
137
|
# we are done with our declarations)
|
138
138
|
```
|
139
139
|
|
140
|
+
If there are different columns that you want to read and process in
|
141
|
+
the same way, you can use the method `bulk_declare`, which accepts a
|
142
|
+
hash as input.
|
143
|
+
|
144
|
+
For instance:
|
145
|
+
|
146
|
+
```ruby
|
147
|
+
i.bulk_declare {a: 'A', b: 'B'}
|
148
|
+
```
|
149
|
+
|
150
|
+
is equivalent to:
|
151
|
+
|
152
|
+
```ruby
|
153
|
+
i.column :a do
|
154
|
+
colref 'A'
|
155
|
+
end
|
156
|
+
|
157
|
+
i.column :b do
|
158
|
+
colref 'B'
|
159
|
+
end
|
160
|
+
```
|
161
|
+
|
162
|
+
The method also accepts a code block, which allows to define a common
|
163
|
+
`process` function for all columns. In case, **don't forget to put
|
164
|
+
the hash in parentheses, or the Ruby parser won't be able to
|
165
|
+
distinguish the hash from the code block.** For instance:
|
166
|
+
|
167
|
+
```ruby
|
168
|
+
i.bulk_declare({a: 'A', b: 'B'}) do
|
169
|
+
process do |cell|
|
170
|
+
...
|
171
|
+
end
|
172
|
+
end
|
173
|
+
```
|
174
|
+
|
175
|
+
There is an example of `bulk_declare` in the examples directory:
|
176
|
+
([us_cities_bulk_declare.rb](examples/wikipedia_us_cities/us_cities_bulk_declare.rb)).
|
177
|
+
|
140
178
|
**Remarks:**
|
141
179
|
|
142
180
|
1. the column name can be anything ruby can use as a Hash key. You
|
@@ -146,7 +184,7 @@ end
|
|
146
184
|
2. `colref` can be a string (e.g., `'A'`) or an integer, in which case
|
147
185
|
the first column is one
|
148
186
|
|
149
|
-
3. you need to declare only the columns you want to import
|
187
|
+
3. **you need to declare only the columns you want to import.** For
|
150
188
|
instance, we could skip the declaration for column 1, if 'Date of
|
151
189
|
Birth' is the only data we want to import
|
152
190
|
|
@@ -215,9 +253,6 @@ into a `@table` instance variable.
|
|
215
253
|
i.read
|
216
254
|
```
|
217
255
|
|
218
|
-
**Read applies all the `column` and `virtual_column` declarations and
|
219
|
-
builds a hash with the data read.**
|
220
|
-
|
221
256
|
After reading the file we can use `errors` to see whether any of the
|
222
257
|
`check` functions failed:
|
223
258
|
|
@@ -228,6 +263,13 @@ array_of_strings ech do |error_line|
|
|
228
263
|
end
|
229
264
|
```
|
230
265
|
|
266
|
+
We can then use `virtual_columns` to process data and generate the
|
267
|
+
virtual columns:
|
268
|
+
|
269
|
+
```ruby
|
270
|
+
i.virtual_columns
|
271
|
+
```
|
272
|
+
|
231
273
|
Finally we can use the `process` function to execute the `mapping`
|
232
274
|
directive to each line read from the file.
|
233
275
|
|
@@ -290,13 +332,11 @@ i.table
|
|
290
332
|
age: { value: 31, row_number: 2, col_number: 2, errors: nil } } ]
|
291
333
|
```
|
292
334
|
|
293
|
-
## Simplifying the data read
|
335
|
+
## Simplifying the hash with the data read
|
294
336
|
|
295
337
|
The `Dreader::Util` class provides some functions to simplify and
|
296
338
|
restructure the hashes built by `dreader`.
|
297
339
|
|
298
|
-
More in details:
|
299
|
-
|
300
340
|
`Dreader::Util.simplify hash` simplifies the hash passed as input by
|
301
341
|
removing all information but the value and making the value
|
302
342
|
accessible directly from the name of the column.
|
@@ -309,14 +349,35 @@ Dreader::Util.simplify i.table[0]
|
|
309
349
|
`Dreader::Util.slice hash, keys` and `Dreader::Util.slice hash,
|
310
350
|
keys`, where `keys` is an arrays of keys, are respectively used to
|
311
351
|
select or remove some keys from `hash`.
|
352
|
+
|
353
|
+
```ruby
|
354
|
+
i.table[0]
|
355
|
+
{ name: { value: "John", row_number: 1, col_number: 1, errors: nil },
|
356
|
+
age: { value: 30, row_number: 1, col_number: 2, errors: nil }}
|
357
|
+
|
358
|
+
Dreader::Util.slice i.table[0], :name
|
359
|
+
{name: { value: "John", row_number: 1, col_number: 1, errors: nil}
|
360
|
+
|
361
|
+
Dreader::Util.clean i.table[0], :name
|
362
|
+
{age: { value: 30, row_number: 1, col_number: 2, errors: nil }
|
363
|
+
```
|
364
|
+
|
365
|
+
The methods `slice` and `clean` are more useful when used in
|
366
|
+
conjuction with `simplify`:
|
312
367
|
|
313
368
|
```ruby
|
314
|
-
Dreader::Util.
|
369
|
+
hash = Dreader::Util.simplify i.table[0]
|
370
|
+
{name: "John", age: 30}
|
371
|
+
|
372
|
+
Dreader::Util.slice hash, [:age]
|
315
373
|
{age: 30}
|
316
374
|
|
317
|
-
Dreader::Util.clean
|
375
|
+
Dreader::Util.clean hash, [:age]
|
318
376
|
{name: "John"}
|
319
377
|
```
|
378
|
+
|
379
|
+
Notice that the output produced by `slice` and `simplify` is a has
|
380
|
+
which can be used to create an `ActiveRecord` object.
|
320
381
|
|
321
382
|
Finally, the `Dreader::Util.restructure` method helps building hashes
|
322
383
|
to create
|
@@ -0,0 +1,85 @@
|
|
1
|
+
require 'dreader'
|
2
|
+
|
3
|
+
# this is the class which will contain all the data we read from the file
|
4
|
+
class City
|
5
|
+
[:city, :state, :population, :lat, :lon].each do |var|
|
6
|
+
attr_accessor var
|
7
|
+
end
|
8
|
+
|
9
|
+
def initialize(hash)
|
10
|
+
hash.each do |k, v|
|
11
|
+
self.send("#{k}=", v)
|
12
|
+
end
|
13
|
+
end
|
14
|
+
end
|
15
|
+
|
16
|
+
importer = Dreader::Engine.new
|
17
|
+
|
18
|
+
# read from us_cities.tsv, lines from 2 to 10 (included)
|
19
|
+
importer.options do
|
20
|
+
filename "us_cities.tsv"
|
21
|
+
first_row 2
|
22
|
+
last_row 10
|
23
|
+
end
|
24
|
+
|
25
|
+
# these are the columns for which we only need to specify column and name
|
26
|
+
importer.bulk_declare ({city: 2, state: 3, latlon: 11}) do
|
27
|
+
process { |val| val.strip }
|
28
|
+
end
|
29
|
+
|
30
|
+
# the population column requires more work
|
31
|
+
importer.column :population do |col|
|
32
|
+
col.colref 4
|
33
|
+
|
34
|
+
# make "3,000" into 3000 (int)
|
35
|
+
col.process do |value|
|
36
|
+
value.gsub(",", "").to_i
|
37
|
+
end
|
38
|
+
|
39
|
+
col.check do |value|
|
40
|
+
value > 0
|
41
|
+
end
|
42
|
+
|
43
|
+
end
|
44
|
+
|
45
|
+
cities = []
|
46
|
+
|
47
|
+
importer.mapping do |row|
|
48
|
+
# remove all additional information stored in each cell
|
49
|
+
r = Dreader::Util.simplify row
|
50
|
+
|
51
|
+
# make latlon into the lat, lon fields
|
52
|
+
r[:lat], r[:lon] = r[:latlon].split(" ")
|
53
|
+
|
54
|
+
# now r contains something like
|
55
|
+
# {lat: ..., lon: ..., city: ..., state: ..., population: ..., latlon: ...}
|
56
|
+
|
57
|
+
# remove fields which are not understood by the Cities class and
|
58
|
+
# make a new instance
|
59
|
+
cleaned = Dreader::Util.clean r, [:latlon]
|
60
|
+
|
61
|
+
# you must declare an array cities before calling importer.process
|
62
|
+
cities << City.new(cleaned)
|
63
|
+
end
|
64
|
+
|
65
|
+
# print to stdout what we told dreader to read
|
66
|
+
# (useful only for ... debugging!)
|
67
|
+
importer.debug n: 10
|
68
|
+
|
69
|
+
# check some other features of debug:
|
70
|
+
# disable processing and debug (e.g., to analyze the raw data read)
|
71
|
+
importer.debug process: false, check: false
|
72
|
+
|
73
|
+
# load and process
|
74
|
+
importer.load
|
75
|
+
cities = []
|
76
|
+
importer.process
|
77
|
+
|
78
|
+
# output everything to see whether it works
|
79
|
+
puts "First ten cities in the US (source Wikipedia)"
|
80
|
+
cities.each do |city|
|
81
|
+
[:city, :state, :population, :lat, :lon].each do |var|
|
82
|
+
puts "#{var.to_s.capitalize}: #{city.send(var)}"
|
83
|
+
end
|
84
|
+
puts ""
|
85
|
+
end
|
data/lib/dreader/version.rb
CHANGED
data/lib/dreader.rb
CHANGED
@@ -70,8 +70,9 @@ module Dreader
|
|
70
70
|
end
|
71
71
|
|
72
72
|
# an alias for Hash.slice
|
73
|
-
|
74
|
-
|
73
|
+
# keys is an array of keys
|
74
|
+
def self.slice hash, keys
|
75
|
+
hash.slice *keys
|
75
76
|
end
|
76
77
|
|
77
78
|
# remove all `keys` from `hash`
|
@@ -102,9 +103,11 @@ module Dreader
|
|
102
103
|
attr_reader :options
|
103
104
|
# the specification of the columns to process
|
104
105
|
attr_reader :colspec
|
106
|
+
# the specification of the virtual columns
|
107
|
+
attr_reader :virtualcols
|
105
108
|
# the data we read
|
106
109
|
attr_reader :table
|
107
|
-
|
110
|
+
|
108
111
|
def initialize
|
109
112
|
@options = {}
|
110
113
|
@colspec = []
|
@@ -133,6 +136,51 @@ module Dreader
|
|
133
136
|
@colspec << column.to_hash.merge({name: name})
|
134
137
|
end
|
135
138
|
|
139
|
+
# bulk declare columns we intend to read
|
140
|
+
#
|
141
|
+
# - hash is a hash in the form { symbolic_name: colref }
|
142
|
+
#
|
143
|
+
# i.bulk_declare {name: 'B', age: 'C'} is equivalent to:
|
144
|
+
#
|
145
|
+
# i.column :name do
|
146
|
+
# colref 'B'
|
147
|
+
# end
|
148
|
+
# i.column :age do
|
149
|
+
# colref 'C'
|
150
|
+
# end
|
151
|
+
#
|
152
|
+
# i.bulk_declare {name: 'B', age: 'C'} do
|
153
|
+
# process do |cell|
|
154
|
+
# cell.strip
|
155
|
+
# end
|
156
|
+
# end
|
157
|
+
#
|
158
|
+
# is equivalent to:
|
159
|
+
#
|
160
|
+
# i.column :name do
|
161
|
+
# colref 'B'
|
162
|
+
# process do |cell|
|
163
|
+
# cell.strip
|
164
|
+
# end
|
165
|
+
# end
|
166
|
+
# i.column :age do
|
167
|
+
# colref 'C'
|
168
|
+
# process do |cell|
|
169
|
+
# cell.strip
|
170
|
+
# end
|
171
|
+
# end
|
172
|
+
def bulk_declare hash, &block
|
173
|
+
hash.keys.each do |key|
|
174
|
+
column = Column.new
|
175
|
+
column.colref hash[key]
|
176
|
+
if block
|
177
|
+
column.instance_eval(&block)
|
178
|
+
end
|
179
|
+
@colspec << column.to_hash.merge({name: key})
|
180
|
+
end
|
181
|
+
end
|
182
|
+
|
183
|
+
|
136
184
|
# virtual columns define derived attributes
|
137
185
|
# the code specified in the virtual column is executed after reading
|
138
186
|
# a row and before applying the mapping function
|
@@ -165,7 +213,12 @@ module Dreader
|
|
165
213
|
# @return the data read from filename, in the form of an array of
|
166
214
|
# hashes
|
167
215
|
def read args = {}
|
168
|
-
|
216
|
+
if args.class == Hash
|
217
|
+
hash = @options.merge(args)
|
218
|
+
else
|
219
|
+
puts "dreader error at #{__callee__}: this function takes a Hash as input"
|
220
|
+
exit
|
221
|
+
end
|
169
222
|
|
170
223
|
spreadsheet = Dreader::Engine.open_spreadsheet (hash[:filename])
|
171
224
|
sheet = spreadsheet.sheet(hash[:sheet] || 0)
|
@@ -187,13 +240,23 @@ module Dreader
|
|
187
240
|
r[colname][:row_number] = row_number
|
188
241
|
r[colname][:col_number] = colspec[:colref]
|
189
242
|
|
190
|
-
|
243
|
+
begin
|
244
|
+
r[colname][:value] = value = colspec[:process] ? colspec[:process].call(cell) : cell
|
245
|
+
rescue => e
|
246
|
+
puts "dreader error at #{__callee__}: 'process' specification for :#{colname} raised an exception at row #{row_number} (col #{index + 1}, value: #{cell})"
|
247
|
+
raise e
|
248
|
+
end
|
191
249
|
|
192
|
-
|
193
|
-
|
194
|
-
|
195
|
-
|
196
|
-
|
250
|
+
begin
|
251
|
+
if colspec[:check] and not colspec[:check].call(value) then
|
252
|
+
r[colname][:error] = true
|
253
|
+
@errors << "dreader error at #{__callee__}: value \"#{cell}\" for #{colname} at row #{row_number} (col #{index + 1}) does not pass the check function"
|
254
|
+
else
|
255
|
+
r[colname][:error] = false
|
256
|
+
end
|
257
|
+
rescue => e
|
258
|
+
puts "dreader error at #{__callee__}: 'check' specification for :#{colname} raised an exception at row #{row_number} (col #{index + 1}, value: #{cell})"
|
259
|
+
raise e
|
197
260
|
end
|
198
261
|
end
|
199
262
|
|
@@ -205,10 +268,34 @@ module Dreader
|
|
205
268
|
|
206
269
|
alias_method :load, :read
|
207
270
|
|
271
|
+
# get (processed) row number
|
272
|
+
#
|
273
|
+
# - row_number is the row to get: index starts at 1.
|
274
|
+
#
|
275
|
+
# get_row(1) get the first line read, that is, the row specified
|
276
|
+
# by `first_row` in `options` (or in read)
|
277
|
+
#
|
278
|
+
# You need to invoke read first
|
279
|
+
def get_row row_number
|
280
|
+
if row_number > @table.size
|
281
|
+
puts "dreader error at #{__callee__}: 'row_number' is out of range (did you invoke read first?)"
|
282
|
+
exit
|
283
|
+
elsif row_number <= 0
|
284
|
+
puts "dreader error at #{__callee__}: 'row_number' is zero or negative (first row is 1)."
|
285
|
+
else
|
286
|
+
@table[row_number - 1]
|
287
|
+
end
|
288
|
+
end
|
289
|
+
|
208
290
|
# show to stdout the first `n` records we read from the file given the current
|
209
291
|
# configuration
|
210
292
|
def debug args = {}
|
211
|
-
|
293
|
+
if args.class == Hash
|
294
|
+
hash = @options.merge(args)
|
295
|
+
else
|
296
|
+
puts "dreader error at #{__callee__}: this function takes a Hash as input"
|
297
|
+
exit
|
298
|
+
end
|
212
299
|
|
213
300
|
# apply some defaults, if not defined in the options
|
214
301
|
hash[:process] = true if not hash.has_key? :process # shall we apply the process function?
|
@@ -246,13 +333,23 @@ module Dreader
|
|
246
333
|
checked_str = ""
|
247
334
|
|
248
335
|
if hash[:process]
|
249
|
-
|
250
|
-
|
336
|
+
begin
|
337
|
+
processed = colspec[:process] ? colspec[:process].call(cell) : cell
|
338
|
+
processed_str = "processed: '#{processed}' (#{processed.class})"
|
339
|
+
rescue => e
|
340
|
+
puts "dreader error at #{__callee__}: 'check' specification for :#{colname} raised an exception at row #{row_number} (col #{index + 1}, value: #{cell})"
|
341
|
+
raise e
|
342
|
+
end
|
251
343
|
end
|
252
344
|
if hash[:check]
|
253
|
-
|
254
|
-
|
255
|
-
|
345
|
+
begin
|
346
|
+
processed = colspec[:process] ? colspec[:process].call(cell) : cell
|
347
|
+
check = colspec[:check] ? colspec[:check].call(processed) : "no check specified"
|
348
|
+
checked_str = "checked: '#{check}'"
|
349
|
+
rescue => e
|
350
|
+
puts "dreader error at #{__callee__}: 'check' specification for #{colname} at row #{row_number} raised an exception (col #{index + 1}, value: #{cell})"
|
351
|
+
raise e
|
352
|
+
end
|
256
353
|
end
|
257
354
|
|
258
355
|
puts " #{colname} => orig: '#{cell}' (#{cell.class}) #{processed_str} #{checked_str} (column: '#{colspec[:colref]}')"
|
@@ -268,13 +365,18 @@ module Dreader
|
|
268
365
|
|
269
366
|
def virtual_columns
|
270
367
|
# execute the virtual column specification
|
271
|
-
@
|
272
|
-
@
|
273
|
-
|
274
|
-
|
275
|
-
|
276
|
-
|
277
|
-
|
368
|
+
@table.each do |r|
|
369
|
+
@virtualcols.each do |virtualcol|
|
370
|
+
begin
|
371
|
+
# add the cell to the table
|
372
|
+
r[virtualcol[:name]] = {
|
373
|
+
value: virtualcol[:process].call(r),
|
374
|
+
virtual: true,
|
375
|
+
}
|
376
|
+
rescue => e
|
377
|
+
puts "dreader error at #{__callee__}: 'process' specification for :#{virtualcol[:name]} raised an exception at row #{r[r.keys.first][:row_number]}"
|
378
|
+
raise e
|
379
|
+
end
|
278
380
|
end
|
279
381
|
end
|
280
382
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: dreader
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.4.
|
4
|
+
version: 0.4.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Adolfo Villafiorita
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2018-
|
11
|
+
date: 2018-05-20 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -84,6 +84,7 @@ files:
|
|
84
84
|
- examples/wikipedia_big_us_cities/cities_by_state.ods
|
85
85
|
- examples/wikipedia_us_cities/us_cities.rb
|
86
86
|
- examples/wikipedia_us_cities/us_cities.tsv
|
87
|
+
- examples/wikipedia_us_cities/us_cities_bulk_declare.rb
|
87
88
|
- lib/dreader.rb
|
88
89
|
- lib/dreader/version.rb
|
89
90
|
homepage: http://github.com/avillafiorita/dreader
|