dreader 0.4.1 → 0.4.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Changelog.org +11 -0
- data/Gemfile.lock +1 -1
- data/README.md +71 -10
- data/examples/wikipedia_us_cities/us_cities_bulk_declare.rb +85 -0
- data/lib/dreader/version.rb +1 -1
- data/lib/dreader.rb +125 -23
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 6d616b9bad780960c105a2c754e392ecf51c0cdaeeb6076c11b4042d7c40e414
|
4
|
+
data.tar.gz: 656507d726a22a24111fc239c0f640dedf90321aab2bcda74a73dc0cf730a83d
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 9ee4f8c9367864ef01aea8d75240a3a93a7bdbcab2411b4e2d4f92df8dc4fa5840e812308a02b8c5da4afc791c358f5eff4849017f63a4d90011ebcdca24e217
|
7
|
+
data.tar.gz: 11571140e63afb0c52b33a010e701e754964150878cba7e395878c2cf2f59cc85eb6b493d5fd05f287d229ec807e8873a45aafe850be6fc30abcbbb2d1cc47fc
|
data/Changelog.org
CHANGED
@@ -1,3 +1,14 @@
|
|
1
|
+
* Version 0.4.2
|
2
|
+
** better error messages for process and check functions
|
3
|
+
dreader now captures exceptions raised by process and check and
|
4
|
+
prints and error message to stdout if an error is found.
|
5
|
+
the exception is then propagated in the standard way.
|
6
|
+
** new method bulk_declare
|
7
|
+
bulk_declare allow to easily declare columns which don't need a
|
8
|
+
specific treatment
|
9
|
+
** read will now complains if the argument passed is not a hash
|
10
|
+
** virtualcols is now accessible (attr_reader)
|
11
|
+
** fixed a bug with slice
|
1
12
|
* Version 0.4.1
|
2
13
|
** fixed an issue with ~read~: it always required a hash as input
|
3
14
|
** changed syntax of ~debug~, which now accepts a hash as argument
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -46,7 +46,7 @@ Or install it yourself as:
|
|
46
46
|
|
47
47
|
## Usage
|
48
48
|
|
49
|
-
### Declare
|
49
|
+
### Declare what file you want to read
|
50
50
|
|
51
51
|
Require `dreader` and declare an instance of the `Dreader::Engine` class:
|
52
52
|
|
@@ -137,6 +137,44 @@ end
|
|
137
137
|
# we are done with our declarations)
|
138
138
|
```
|
139
139
|
|
140
|
+
If there are different columns that you want to read and process in
|
141
|
+
the same way, you can use the method `bulk_declare`, which accepts a
|
142
|
+
hash as input.
|
143
|
+
|
144
|
+
For instance:
|
145
|
+
|
146
|
+
```ruby
|
147
|
+
i.bulk_declare {a: 'A', b: 'B'}
|
148
|
+
```
|
149
|
+
|
150
|
+
is equivalent to:
|
151
|
+
|
152
|
+
```ruby
|
153
|
+
i.column :a do
|
154
|
+
colref 'A'
|
155
|
+
end
|
156
|
+
|
157
|
+
i.column :b do
|
158
|
+
colref 'B'
|
159
|
+
end
|
160
|
+
```
|
161
|
+
|
162
|
+
The method also accepts a code block, which allows to define a common
|
163
|
+
`process` function for all columns. In case, **don't forget to put
|
164
|
+
the hash in parentheses, or the Ruby parser won't be able to
|
165
|
+
distinguish the hash from the code block.** For instance:
|
166
|
+
|
167
|
+
```ruby
|
168
|
+
i.bulk_declare({a: 'A', b: 'B'}) do
|
169
|
+
process do |cell|
|
170
|
+
...
|
171
|
+
end
|
172
|
+
end
|
173
|
+
```
|
174
|
+
|
175
|
+
There is an example of `bulk_declare` in the examples directory:
|
176
|
+
([us_cities_bulk_declare.rb](examples/wikipedia_us_cities/us_cities_bulk_declare.rb)).
|
177
|
+
|
140
178
|
**Remarks:**
|
141
179
|
|
142
180
|
1. the column name can be anything ruby can use as a Hash key. You
|
@@ -146,7 +184,7 @@ end
|
|
146
184
|
2. `colref` can be a string (e.g., `'A'`) or an integer, in which case
|
147
185
|
the first column is one
|
148
186
|
|
149
|
-
3. you need to declare only the columns you want to import
|
187
|
+
3. **you need to declare only the columns you want to import.** For
|
150
188
|
instance, we could skip the declaration for column 1, if 'Date of
|
151
189
|
Birth' is the only data we want to import
|
152
190
|
|
@@ -215,9 +253,6 @@ into a `@table` instance variable.
|
|
215
253
|
i.read
|
216
254
|
```
|
217
255
|
|
218
|
-
**Read applies all the `column` and `virtual_column` declarations and
|
219
|
-
builds a hash with the data read.**
|
220
|
-
|
221
256
|
After reading the file we can use `errors` to see whether any of the
|
222
257
|
`check` functions failed:
|
223
258
|
|
@@ -228,6 +263,13 @@ array_of_strings ech do |error_line|
|
|
228
263
|
end
|
229
264
|
```
|
230
265
|
|
266
|
+
We can then use `virtual_columns` to process data and generate the
|
267
|
+
virtual columns:
|
268
|
+
|
269
|
+
```ruby
|
270
|
+
i.virtual_columns
|
271
|
+
```
|
272
|
+
|
231
273
|
Finally we can use the `process` function to execute the `mapping`
|
232
274
|
directive to each line read from the file.
|
233
275
|
|
@@ -290,13 +332,11 @@ i.table
|
|
290
332
|
age: { value: 31, row_number: 2, col_number: 2, errors: nil } } ]
|
291
333
|
```
|
292
334
|
|
293
|
-
## Simplifying the data read
|
335
|
+
## Simplifying the hash with the data read
|
294
336
|
|
295
337
|
The `Dreader::Util` class provides some functions to simplify and
|
296
338
|
restructure the hashes built by `dreader`.
|
297
339
|
|
298
|
-
More in details:
|
299
|
-
|
300
340
|
`Dreader::Util.simplify hash` simplifies the hash passed as input by
|
301
341
|
removing all information but the value and making the value
|
302
342
|
accessible directly from the name of the column.
|
@@ -309,14 +349,35 @@ Dreader::Util.simplify i.table[0]
|
|
309
349
|
`Dreader::Util.slice hash, keys` and `Dreader::Util.slice hash,
|
310
350
|
keys`, where `keys` is an arrays of keys, are respectively used to
|
311
351
|
select or remove some keys from `hash`.
|
352
|
+
|
353
|
+
```ruby
|
354
|
+
i.table[0]
|
355
|
+
{ name: { value: "John", row_number: 1, col_number: 1, errors: nil },
|
356
|
+
age: { value: 30, row_number: 1, col_number: 2, errors: nil }}
|
357
|
+
|
358
|
+
Dreader::Util.slice i.table[0], :name
|
359
|
+
{name: { value: "John", row_number: 1, col_number: 1, errors: nil}
|
360
|
+
|
361
|
+
Dreader::Util.clean i.table[0], :name
|
362
|
+
{age: { value: 30, row_number: 1, col_number: 2, errors: nil }
|
363
|
+
```
|
364
|
+
|
365
|
+
The methods `slice` and `clean` are more useful when used in
|
366
|
+
conjuction with `simplify`:
|
312
367
|
|
313
368
|
```ruby
|
314
|
-
Dreader::Util.
|
369
|
+
hash = Dreader::Util.simplify i.table[0]
|
370
|
+
{name: "John", age: 30}
|
371
|
+
|
372
|
+
Dreader::Util.slice hash, [:age]
|
315
373
|
{age: 30}
|
316
374
|
|
317
|
-
Dreader::Util.clean
|
375
|
+
Dreader::Util.clean hash, [:age]
|
318
376
|
{name: "John"}
|
319
377
|
```
|
378
|
+
|
379
|
+
Notice that the output produced by `slice` and `simplify` is a has
|
380
|
+
which can be used to create an `ActiveRecord` object.
|
320
381
|
|
321
382
|
Finally, the `Dreader::Util.restructure` method helps building hashes
|
322
383
|
to create
|
@@ -0,0 +1,85 @@
|
|
1
|
+
require 'dreader'
|
2
|
+
|
3
|
+
# this is the class which will contain all the data we read from the file
|
4
|
+
class City
|
5
|
+
[:city, :state, :population, :lat, :lon].each do |var|
|
6
|
+
attr_accessor var
|
7
|
+
end
|
8
|
+
|
9
|
+
def initialize(hash)
|
10
|
+
hash.each do |k, v|
|
11
|
+
self.send("#{k}=", v)
|
12
|
+
end
|
13
|
+
end
|
14
|
+
end
|
15
|
+
|
16
|
+
importer = Dreader::Engine.new
|
17
|
+
|
18
|
+
# read from us_cities.tsv, lines from 2 to 10 (included)
|
19
|
+
importer.options do
|
20
|
+
filename "us_cities.tsv"
|
21
|
+
first_row 2
|
22
|
+
last_row 10
|
23
|
+
end
|
24
|
+
|
25
|
+
# these are the columns for which we only need to specify column and name
|
26
|
+
importer.bulk_declare ({city: 2, state: 3, latlon: 11}) do
|
27
|
+
process { |val| val.strip }
|
28
|
+
end
|
29
|
+
|
30
|
+
# the population column requires more work
|
31
|
+
importer.column :population do |col|
|
32
|
+
col.colref 4
|
33
|
+
|
34
|
+
# make "3,000" into 3000 (int)
|
35
|
+
col.process do |value|
|
36
|
+
value.gsub(",", "").to_i
|
37
|
+
end
|
38
|
+
|
39
|
+
col.check do |value|
|
40
|
+
value > 0
|
41
|
+
end
|
42
|
+
|
43
|
+
end
|
44
|
+
|
45
|
+
cities = []
|
46
|
+
|
47
|
+
importer.mapping do |row|
|
48
|
+
# remove all additional information stored in each cell
|
49
|
+
r = Dreader::Util.simplify row
|
50
|
+
|
51
|
+
# make latlon into the lat, lon fields
|
52
|
+
r[:lat], r[:lon] = r[:latlon].split(" ")
|
53
|
+
|
54
|
+
# now r contains something like
|
55
|
+
# {lat: ..., lon: ..., city: ..., state: ..., population: ..., latlon: ...}
|
56
|
+
|
57
|
+
# remove fields which are not understood by the Cities class and
|
58
|
+
# make a new instance
|
59
|
+
cleaned = Dreader::Util.clean r, [:latlon]
|
60
|
+
|
61
|
+
# you must declare an array cities before calling importer.process
|
62
|
+
cities << City.new(cleaned)
|
63
|
+
end
|
64
|
+
|
65
|
+
# print to stdout what we told dreader to read
|
66
|
+
# (useful only for ... debugging!)
|
67
|
+
importer.debug n: 10
|
68
|
+
|
69
|
+
# check some other features of debug:
|
70
|
+
# disable processing and debug (e.g., to analyze the raw data read)
|
71
|
+
importer.debug process: false, check: false
|
72
|
+
|
73
|
+
# load and process
|
74
|
+
importer.load
|
75
|
+
cities = []
|
76
|
+
importer.process
|
77
|
+
|
78
|
+
# output everything to see whether it works
|
79
|
+
puts "First ten cities in the US (source Wikipedia)"
|
80
|
+
cities.each do |city|
|
81
|
+
[:city, :state, :population, :lat, :lon].each do |var|
|
82
|
+
puts "#{var.to_s.capitalize}: #{city.send(var)}"
|
83
|
+
end
|
84
|
+
puts ""
|
85
|
+
end
|
data/lib/dreader/version.rb
CHANGED
data/lib/dreader.rb
CHANGED
@@ -70,8 +70,9 @@ module Dreader
|
|
70
70
|
end
|
71
71
|
|
72
72
|
# an alias for Hash.slice
|
73
|
-
|
74
|
-
|
73
|
+
# keys is an array of keys
|
74
|
+
def self.slice hash, keys
|
75
|
+
hash.slice *keys
|
75
76
|
end
|
76
77
|
|
77
78
|
# remove all `keys` from `hash`
|
@@ -102,9 +103,11 @@ module Dreader
|
|
102
103
|
attr_reader :options
|
103
104
|
# the specification of the columns to process
|
104
105
|
attr_reader :colspec
|
106
|
+
# the specification of the virtual columns
|
107
|
+
attr_reader :virtualcols
|
105
108
|
# the data we read
|
106
109
|
attr_reader :table
|
107
|
-
|
110
|
+
|
108
111
|
def initialize
|
109
112
|
@options = {}
|
110
113
|
@colspec = []
|
@@ -133,6 +136,51 @@ module Dreader
|
|
133
136
|
@colspec << column.to_hash.merge({name: name})
|
134
137
|
end
|
135
138
|
|
139
|
+
# bulk declare columns we intend to read
|
140
|
+
#
|
141
|
+
# - hash is a hash in the form { symbolic_name: colref }
|
142
|
+
#
|
143
|
+
# i.bulk_declare {name: 'B', age: 'C'} is equivalent to:
|
144
|
+
#
|
145
|
+
# i.column :name do
|
146
|
+
# colref 'B'
|
147
|
+
# end
|
148
|
+
# i.column :age do
|
149
|
+
# colref 'C'
|
150
|
+
# end
|
151
|
+
#
|
152
|
+
# i.bulk_declare {name: 'B', age: 'C'} do
|
153
|
+
# process do |cell|
|
154
|
+
# cell.strip
|
155
|
+
# end
|
156
|
+
# end
|
157
|
+
#
|
158
|
+
# is equivalent to:
|
159
|
+
#
|
160
|
+
# i.column :name do
|
161
|
+
# colref 'B'
|
162
|
+
# process do |cell|
|
163
|
+
# cell.strip
|
164
|
+
# end
|
165
|
+
# end
|
166
|
+
# i.column :age do
|
167
|
+
# colref 'C'
|
168
|
+
# process do |cell|
|
169
|
+
# cell.strip
|
170
|
+
# end
|
171
|
+
# end
|
172
|
+
def bulk_declare hash, &block
|
173
|
+
hash.keys.each do |key|
|
174
|
+
column = Column.new
|
175
|
+
column.colref hash[key]
|
176
|
+
if block
|
177
|
+
column.instance_eval(&block)
|
178
|
+
end
|
179
|
+
@colspec << column.to_hash.merge({name: key})
|
180
|
+
end
|
181
|
+
end
|
182
|
+
|
183
|
+
|
136
184
|
# virtual columns define derived attributes
|
137
185
|
# the code specified in the virtual column is executed after reading
|
138
186
|
# a row and before applying the mapping function
|
@@ -165,7 +213,12 @@ module Dreader
|
|
165
213
|
# @return the data read from filename, in the form of an array of
|
166
214
|
# hashes
|
167
215
|
def read args = {}
|
168
|
-
|
216
|
+
if args.class == Hash
|
217
|
+
hash = @options.merge(args)
|
218
|
+
else
|
219
|
+
puts "dreader error at #{__callee__}: this function takes a Hash as input"
|
220
|
+
exit
|
221
|
+
end
|
169
222
|
|
170
223
|
spreadsheet = Dreader::Engine.open_spreadsheet (hash[:filename])
|
171
224
|
sheet = spreadsheet.sheet(hash[:sheet] || 0)
|
@@ -187,13 +240,23 @@ module Dreader
|
|
187
240
|
r[colname][:row_number] = row_number
|
188
241
|
r[colname][:col_number] = colspec[:colref]
|
189
242
|
|
190
|
-
|
243
|
+
begin
|
244
|
+
r[colname][:value] = value = colspec[:process] ? colspec[:process].call(cell) : cell
|
245
|
+
rescue => e
|
246
|
+
puts "dreader error at #{__callee__}: 'process' specification for :#{colname} raised an exception at row #{row_number} (col #{index + 1}, value: #{cell})"
|
247
|
+
raise e
|
248
|
+
end
|
191
249
|
|
192
|
-
|
193
|
-
|
194
|
-
|
195
|
-
|
196
|
-
|
250
|
+
begin
|
251
|
+
if colspec[:check] and not colspec[:check].call(value) then
|
252
|
+
r[colname][:error] = true
|
253
|
+
@errors << "dreader error at #{__callee__}: value \"#{cell}\" for #{colname} at row #{row_number} (col #{index + 1}) does not pass the check function"
|
254
|
+
else
|
255
|
+
r[colname][:error] = false
|
256
|
+
end
|
257
|
+
rescue => e
|
258
|
+
puts "dreader error at #{__callee__}: 'check' specification for :#{colname} raised an exception at row #{row_number} (col #{index + 1}, value: #{cell})"
|
259
|
+
raise e
|
197
260
|
end
|
198
261
|
end
|
199
262
|
|
@@ -205,10 +268,34 @@ module Dreader
|
|
205
268
|
|
206
269
|
alias_method :load, :read
|
207
270
|
|
271
|
+
# get (processed) row number
|
272
|
+
#
|
273
|
+
# - row_number is the row to get: index starts at 1.
|
274
|
+
#
|
275
|
+
# get_row(1) get the first line read, that is, the row specified
|
276
|
+
# by `first_row` in `options` (or in read)
|
277
|
+
#
|
278
|
+
# You need to invoke read first
|
279
|
+
def get_row row_number
|
280
|
+
if row_number > @table.size
|
281
|
+
puts "dreader error at #{__callee__}: 'row_number' is out of range (did you invoke read first?)"
|
282
|
+
exit
|
283
|
+
elsif row_number <= 0
|
284
|
+
puts "dreader error at #{__callee__}: 'row_number' is zero or negative (first row is 1)."
|
285
|
+
else
|
286
|
+
@table[row_number - 1]
|
287
|
+
end
|
288
|
+
end
|
289
|
+
|
208
290
|
# show to stdout the first `n` records we read from the file given the current
|
209
291
|
# configuration
|
210
292
|
def debug args = {}
|
211
|
-
|
293
|
+
if args.class == Hash
|
294
|
+
hash = @options.merge(args)
|
295
|
+
else
|
296
|
+
puts "dreader error at #{__callee__}: this function takes a Hash as input"
|
297
|
+
exit
|
298
|
+
end
|
212
299
|
|
213
300
|
# apply some defaults, if not defined in the options
|
214
301
|
hash[:process] = true if not hash.has_key? :process # shall we apply the process function?
|
@@ -246,13 +333,23 @@ module Dreader
|
|
246
333
|
checked_str = ""
|
247
334
|
|
248
335
|
if hash[:process]
|
249
|
-
|
250
|
-
|
336
|
+
begin
|
337
|
+
processed = colspec[:process] ? colspec[:process].call(cell) : cell
|
338
|
+
processed_str = "processed: '#{processed}' (#{processed.class})"
|
339
|
+
rescue => e
|
340
|
+
puts "dreader error at #{__callee__}: 'check' specification for :#{colname} raised an exception at row #{row_number} (col #{index + 1}, value: #{cell})"
|
341
|
+
raise e
|
342
|
+
end
|
251
343
|
end
|
252
344
|
if hash[:check]
|
253
|
-
|
254
|
-
|
255
|
-
|
345
|
+
begin
|
346
|
+
processed = colspec[:process] ? colspec[:process].call(cell) : cell
|
347
|
+
check = colspec[:check] ? colspec[:check].call(processed) : "no check specified"
|
348
|
+
checked_str = "checked: '#{check}'"
|
349
|
+
rescue => e
|
350
|
+
puts "dreader error at #{__callee__}: 'check' specification for #{colname} at row #{row_number} raised an exception (col #{index + 1}, value: #{cell})"
|
351
|
+
raise e
|
352
|
+
end
|
256
353
|
end
|
257
354
|
|
258
355
|
puts " #{colname} => orig: '#{cell}' (#{cell.class}) #{processed_str} #{checked_str} (column: '#{colspec[:colref]}')"
|
@@ -268,13 +365,18 @@ module Dreader
|
|
268
365
|
|
269
366
|
def virtual_columns
|
270
367
|
# execute the virtual column specification
|
271
|
-
@
|
272
|
-
@
|
273
|
-
|
274
|
-
|
275
|
-
|
276
|
-
|
277
|
-
|
368
|
+
@table.each do |r|
|
369
|
+
@virtualcols.each do |virtualcol|
|
370
|
+
begin
|
371
|
+
# add the cell to the table
|
372
|
+
r[virtualcol[:name]] = {
|
373
|
+
value: virtualcol[:process].call(r),
|
374
|
+
virtual: true,
|
375
|
+
}
|
376
|
+
rescue => e
|
377
|
+
puts "dreader error at #{__callee__}: 'process' specification for :#{virtualcol[:name]} raised an exception at row #{r[r.keys.first][:row_number]}"
|
378
|
+
raise e
|
379
|
+
end
|
278
380
|
end
|
279
381
|
end
|
280
382
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: dreader
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.4.
|
4
|
+
version: 0.4.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Adolfo Villafiorita
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2018-
|
11
|
+
date: 2018-05-20 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -84,6 +84,7 @@ files:
|
|
84
84
|
- examples/wikipedia_big_us_cities/cities_by_state.ods
|
85
85
|
- examples/wikipedia_us_cities/us_cities.rb
|
86
86
|
- examples/wikipedia_us_cities/us_cities.tsv
|
87
|
+
- examples/wikipedia_us_cities/us_cities_bulk_declare.rb
|
87
88
|
- lib/dreader.rb
|
88
89
|
- lib/dreader/version.rb
|
89
90
|
homepage: http://github.com/avillafiorita/dreader
|