dreader 0.4.2 → 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.ORG +45 -0
- data/Gemfile.lock +21 -8
- data/README.org +794 -0
- data/dreader.gemspec +6 -4
- data/examples/age/age.rb +22 -6
- data/examples/age_with_multiple_checks/Birthdays.ods +0 -0
- data/examples/age_with_multiple_checks/age_with_multiple_checks.rb +62 -0
- data/examples/template/template_generation.rb +37 -0
- data/examples/wikipedia_big_us_cities/big_us_cities.rb +20 -18
- data/examples/wikipedia_us_cities/us_cities.rb +28 -27
- data/examples/wikipedia_us_cities/us_cities_bulk_declare.rb +22 -22
- data/lib/dreader/column.rb +39 -0
- data/lib/dreader/engine.rb +473 -0
- data/lib/dreader/options.rb +16 -0
- data/lib/dreader/util.rb +71 -0
- data/lib/dreader/version.rb +1 -1
- data/lib/dreader.rb +5 -411
- metadata +59 -25
- data/Changelog.org +0 -20
- data/README.md +0 -469
data/README.md
DELETED
@@ -1,469 +0,0 @@
|
|
1
|
-
# Dreader
|
2
|
-
|
3
|
-
A simple DSL built on top of [Roo](https://github.com/roo-rb/roo) to
|
4
|
-
read and process tabular data (CSV, LibreOffice, Excel).
|
5
|
-
|
6
|
-
This gem allows you to:
|
7
|
-
|
8
|
-
1. specify the structure of some tabular data you want to process
|
9
|
-
2. debug and check correctness of the data you read
|
10
|
-
3. read a file and process it, that is, execute code for each cell and
|
11
|
-
each row of the file
|
12
|
-
|
13
|
-
If your data require elaborations which cannot be performed line by
|
14
|
-
line, you can also access all the data read by the gem and manipulate
|
15
|
-
it as you need.
|
16
|
-
|
17
|
-
The input data can be in CSV (comma or tab separated), LibreOffice,
|
18
|
-
and Excel.
|
19
|
-
|
20
|
-
We use it to import data into Rails application, but the gem can used
|
21
|
-
in any Ruby application.
|
22
|
-
|
23
|
-
The gem should be relatively easy to use, despite its name: *dread*
|
24
|
-
stands for *d*ata *r*eader.
|
25
|
-
|
26
|
-
The gem depends on `roo`, from which we leverage all data
|
27
|
-
reading/parsing facilities and which allows us to achieve what we want
|
28
|
-
in about 250 lines of code.
|
29
|
-
|
30
|
-
|
31
|
-
## Installation
|
32
|
-
|
33
|
-
Add this line to your application's Gemfile:
|
34
|
-
|
35
|
-
```ruby
|
36
|
-
gem 'dreader'
|
37
|
-
```
|
38
|
-
|
39
|
-
And then execute:
|
40
|
-
|
41
|
-
$ bundle
|
42
|
-
|
43
|
-
Or install it yourself as:
|
44
|
-
|
45
|
-
$ gem install dreader
|
46
|
-
|
47
|
-
## Usage
|
48
|
-
|
49
|
-
### Declare what file you want to read
|
50
|
-
|
51
|
-
Require `dreader` and declare an instance of the `Dreader::Engine` class:
|
52
|
-
|
53
|
-
```ruby
|
54
|
-
require 'dreader'
|
55
|
-
|
56
|
-
i = Dreader::Engine.new
|
57
|
-
```
|
58
|
-
|
59
|
-
Specify parsing option, using the following syntax:
|
60
|
-
|
61
|
-
```ruby
|
62
|
-
i.options do
|
63
|
-
filename 'example.ods'
|
64
|
-
|
65
|
-
sheet 'Sheet 1'
|
66
|
-
|
67
|
-
first_row 1
|
68
|
-
last_row 20
|
69
|
-
end
|
70
|
-
```
|
71
|
-
|
72
|
-
where:
|
73
|
-
|
74
|
-
* (optional) `filename` is the file to read. If not specified, you
|
75
|
-
will have to supply a filename when loading the file (see `read`,
|
76
|
-
below). The extension determines the file type. **Use `tsv` for
|
77
|
-
tab-separated files.**
|
78
|
-
* (optional) `first_row` is the first line to read (use `2` if your
|
79
|
-
file has a header)
|
80
|
-
* (optional) `last_row` is the last line to read. If not specified, we
|
81
|
-
will rely on `roo` to determine the last row
|
82
|
-
* (optional) `sheet` is the sheet name or number to read from. If not
|
83
|
-
specified, the first (default) sheet is used
|
84
|
-
|
85
|
-
### Declare the columns you want to read
|
86
|
-
|
87
|
-
Declare the columns you want to read by assigning them a name and a
|
88
|
-
column reference:
|
89
|
-
|
90
|
-
```ruby
|
91
|
-
# we will access column A in Ruby code using :name
|
92
|
-
i.column :name do
|
93
|
-
colref 'A'
|
94
|
-
end
|
95
|
-
```
|
96
|
-
|
97
|
-
You can also specify two ruby blocks, `process` and `check` to
|
98
|
-
preprocess data and to check for errors.
|
99
|
-
|
100
|
-
For instance, given the following file:
|
101
|
-
|
102
|
-
| Name | Date of birth |
|
103
|
-
|------------------|-----------------|
|
104
|
-
| Forest Whitaker | July 15, 1961 |
|
105
|
-
| Daniel Day-Lewis | April 29, 1957 |
|
106
|
-
| Sean Penn | August 17, 1960 |
|
107
|
-
|
108
|
-
we could use the following declaration to specify the data to read:
|
109
|
-
|
110
|
-
```ruby
|
111
|
-
# we want to access column 1 using :name
|
112
|
-
# :name should be non nil and of length greater than 0
|
113
|
-
i.column :name do
|
114
|
-
colref 1
|
115
|
-
check do |x|
|
116
|
-
x and x.length > 0
|
117
|
-
end
|
118
|
-
end
|
119
|
-
|
120
|
-
# we want to access column 2 (Date of birth) using :birthdate
|
121
|
-
i.column :birthdate do
|
122
|
-
colref 2
|
123
|
-
|
124
|
-
# make sure the column is transformed into an integer
|
125
|
-
process do |x|
|
126
|
-
Date.parse(x)
|
127
|
-
end
|
128
|
-
|
129
|
-
# check age is a date (check is invoked on the value returned
|
130
|
-
# by process)
|
131
|
-
check do |x|
|
132
|
-
x.class == Date
|
133
|
-
end
|
134
|
-
end
|
135
|
-
|
136
|
-
# we don't care about any other column (and, therefore,
|
137
|
-
# we are done with our declarations)
|
138
|
-
```
|
139
|
-
|
140
|
-
If there are different columns that you want to read and process in
|
141
|
-
the same way, you can use the method `bulk_declare`, which accepts a
|
142
|
-
hash as input.
|
143
|
-
|
144
|
-
For instance:
|
145
|
-
|
146
|
-
```ruby
|
147
|
-
i.bulk_declare {a: 'A', b: 'B'}
|
148
|
-
```
|
149
|
-
|
150
|
-
is equivalent to:
|
151
|
-
|
152
|
-
```ruby
|
153
|
-
i.column :a do
|
154
|
-
colref 'A'
|
155
|
-
end
|
156
|
-
|
157
|
-
i.column :b do
|
158
|
-
colref 'B'
|
159
|
-
end
|
160
|
-
```
|
161
|
-
|
162
|
-
The method also accepts a code block, which allows to define a common
|
163
|
-
`process` function for all columns. In case, **don't forget to put
|
164
|
-
the hash in parentheses, or the Ruby parser won't be able to
|
165
|
-
distinguish the hash from the code block.** For instance:
|
166
|
-
|
167
|
-
```ruby
|
168
|
-
i.bulk_declare({a: 'A', b: 'B'}) do
|
169
|
-
process do |cell|
|
170
|
-
...
|
171
|
-
end
|
172
|
-
end
|
173
|
-
```
|
174
|
-
|
175
|
-
There is an example of `bulk_declare` in the examples directory:
|
176
|
-
([us_cities_bulk_declare.rb](examples/wikipedia_us_cities/us_cities_bulk_declare.rb)).
|
177
|
-
|
178
|
-
**Remarks:**
|
179
|
-
|
180
|
-
1. the column name can be anything ruby can use as a Hash key. You
|
181
|
-
can use symbols, strings, and even object instances, if you wish to
|
182
|
-
do so.
|
183
|
-
|
184
|
-
2. `colref` can be a string (e.g., `'A'`) or an integer, in which case
|
185
|
-
the first column is one
|
186
|
-
|
187
|
-
3. **you need to declare only the columns you want to import.** For
|
188
|
-
instance, we could skip the declaration for column 1, if 'Date of
|
189
|
-
Birth' is the only data we want to import
|
190
|
-
|
191
|
-
4. If `process` and `check` are specified, then `check` will receive
|
192
|
-
the result of invoking `process` on the cell value. This makes
|
193
|
-
sense if process is used to make the cell value more accessible to
|
194
|
-
ruby code (e.g., transforming a string into an integer).
|
195
|
-
|
196
|
-
|
197
|
-
### Add virtual columns, if you want
|
198
|
-
|
199
|
-
Sometimes it is convenient to aggregate or otherwise manipulate the
|
200
|
-
data read from each row before doing the actual processing.
|
201
|
-
|
202
|
-
For instance, we might have a table with dates of birth, while we are
|
203
|
-
really interested in the age of people.
|
204
|
-
|
205
|
-
In such cases, we can use virtual column. A **virtual column** allows
|
206
|
-
one to add a column to the data read. The value of the column for
|
207
|
-
each row is computed using the values of other cells.
|
208
|
-
|
209
|
-
Virtual columns are declared similar to columns. Thus, for instance,
|
210
|
-
the following declaration adds an `age` column to each row of the data
|
211
|
-
we read from the previous example:
|
212
|
-
|
213
|
-
```ruby
|
214
|
-
i.virtual_column :age do
|
215
|
-
process do |row|
|
216
|
-
# `compute_birthday` has to be defined
|
217
|
-
compute_birthday(row[:birthdate])
|
218
|
-
end
|
219
|
-
end
|
220
|
-
```
|
221
|
-
|
222
|
-
Virtual columns are, of course, available to the `mapping` directive
|
223
|
-
(see below).
|
224
|
-
|
225
|
-
|
226
|
-
### Specify how to process data
|
227
|
-
|
228
|
-
Finally we can specify how we process lines, using the `mapping`
|
229
|
-
directive. Mapping takes an arbitrary piece of ruby code, which can
|
230
|
-
reference the fields of a row.
|
231
|
-
|
232
|
-
For instance:
|
233
|
-
|
234
|
-
```ruby
|
235
|
-
i.mapping do |row|
|
236
|
-
puts "#{row[:name][:value]} is #{row[:age][:value]} years old"
|
237
|
-
end
|
238
|
-
```
|
239
|
-
|
240
|
-
Notice that the data read from each row of our input data is stored in
|
241
|
-
a hash. The hash uses column names as the primary key and stores
|
242
|
-
the values in the `:value` key.
|
243
|
-
|
244
|
-
|
245
|
-
### Start working with the data
|
246
|
-
|
247
|
-
We are now all set and we can start working with the data.
|
248
|
-
|
249
|
-
First use `read` or `load` (synonyms), to read all data and put it
|
250
|
-
into a `@table` instance variable.
|
251
|
-
|
252
|
-
```ruby
|
253
|
-
i.read
|
254
|
-
```
|
255
|
-
|
256
|
-
After reading the file we can use `errors` to see whether any of the
|
257
|
-
`check` functions failed:
|
258
|
-
|
259
|
-
```ruby
|
260
|
-
array_of_strings = i.errors
|
261
|
-
array_of_strings ech do |error_line|
|
262
|
-
puts error_line
|
263
|
-
end
|
264
|
-
```
|
265
|
-
|
266
|
-
We can then use `virtual_columns` to process data and generate the
|
267
|
-
virtual columns:
|
268
|
-
|
269
|
-
```ruby
|
270
|
-
i.virtual_columns
|
271
|
-
```
|
272
|
-
|
273
|
-
Finally we can use the `process` function to execute the `mapping`
|
274
|
-
directive to each line read from the file.
|
275
|
-
|
276
|
-
```ruby
|
277
|
-
i.process
|
278
|
-
```
|
279
|
-
|
280
|
-
Look in the examples directory for further details and a couple of
|
281
|
-
working examples.
|
282
|
-
|
283
|
-
**Remark.** You can override some of the defaults by passing a hash as
|
284
|
-
argument to read. For instance:
|
285
|
-
|
286
|
-
```ruby
|
287
|
-
i.read filename: another_filepath
|
288
|
-
```
|
289
|
-
|
290
|
-
will read data from `another_filepath`, rather than from the filename
|
291
|
-
specified in the options. This might be useful, for instance, if the
|
292
|
-
same specification has to be used for different files.
|
293
|
-
|
294
|
-
|
295
|
-
## Digging deeper
|
296
|
-
|
297
|
-
If you need to perform more elaborations on the data which cannot be
|
298
|
-
captured with `process` (that is, by processing the data row by row),
|
299
|
-
you can also directly access all data read, using the `table` method:
|
300
|
-
|
301
|
-
```ruby
|
302
|
-
i.read
|
303
|
-
i.table
|
304
|
-
# an array of hashes (one hash per row)
|
305
|
-
```
|
306
|
-
|
307
|
-
More in details, the `read` method fills a `@table` instance variable
|
308
|
-
with an array of hashes. Each hash represents a line of the file.
|
309
|
-
|
310
|
-
Each hash contains one key per column, following your specification.
|
311
|
-
Its value is, in turn, a hash with the following structure:
|
312
|
-
|
313
|
-
```ruby
|
314
|
-
{
|
315
|
-
value: ..., # the result of calling process on the cell
|
316
|
-
row_number: ... # the row number
|
317
|
-
col_number: ... # the column number
|
318
|
-
error: ... # the result of calling check on the cell processed value
|
319
|
-
}
|
320
|
-
```
|
321
|
-
|
322
|
-
(Note that virtual columns only store `value` and a Boolean `virtual`,
|
323
|
-
which is always `true`.)
|
324
|
-
|
325
|
-
Thus, for instance, given the example above:
|
326
|
-
|
327
|
-
```ruby
|
328
|
-
i.table
|
329
|
-
[ { name: { value: "John", row_number: 1, col_number: 1, errors: nil },
|
330
|
-
age: { value: 30, row_number: 1, col_number: 2, errors: nil } },
|
331
|
-
{ name: { value: "Jane", row_number: 2, col_number: 1, errors: nil },
|
332
|
-
age: { value: 31, row_number: 2, col_number: 2, errors: nil } } ]
|
333
|
-
```
|
334
|
-
|
335
|
-
## Simplifying the hash with the data read
|
336
|
-
|
337
|
-
The `Dreader::Util` class provides some functions to simplify and
|
338
|
-
restructure the hashes built by `dreader`.
|
339
|
-
|
340
|
-
`Dreader::Util.simplify hash` simplifies the hash passed as input by
|
341
|
-
removing all information but the value and making the value
|
342
|
-
accessible directly from the name of the column.
|
343
|
-
|
344
|
-
```ruby
|
345
|
-
Dreader::Util.simplify i.table[0]
|
346
|
-
{name: "John", age: 30}
|
347
|
-
```
|
348
|
-
|
349
|
-
`Dreader::Util.slice hash, keys` and `Dreader::Util.slice hash,
|
350
|
-
keys`, where `keys` is an arrays of keys, are respectively used to
|
351
|
-
select or remove some keys from `hash`.
|
352
|
-
|
353
|
-
```ruby
|
354
|
-
i.table[0]
|
355
|
-
{ name: { value: "John", row_number: 1, col_number: 1, errors: nil },
|
356
|
-
age: { value: 30, row_number: 1, col_number: 2, errors: nil }}
|
357
|
-
|
358
|
-
Dreader::Util.slice i.table[0], :name
|
359
|
-
{name: { value: "John", row_number: 1, col_number: 1, errors: nil}
|
360
|
-
|
361
|
-
Dreader::Util.clean i.table[0], :name
|
362
|
-
{age: { value: 30, row_number: 1, col_number: 2, errors: nil }
|
363
|
-
```
|
364
|
-
|
365
|
-
The methods `slice` and `clean` are more useful when used in
|
366
|
-
conjuction with `simplify`:
|
367
|
-
|
368
|
-
```ruby
|
369
|
-
hash = Dreader::Util.simplify i.table[0]
|
370
|
-
{name: "John", age: 30}
|
371
|
-
|
372
|
-
Dreader::Util.slice hash, [:age]
|
373
|
-
{age: 30}
|
374
|
-
|
375
|
-
Dreader::Util.clean hash, [:age]
|
376
|
-
{name: "John"}
|
377
|
-
```
|
378
|
-
|
379
|
-
Notice that the output produced by `slice` and `simplify` is a has
|
380
|
-
which can be used to create an `ActiveRecord` object.
|
381
|
-
|
382
|
-
Finally, the `Dreader::Util.restructure` method helps building hashes
|
383
|
-
to create
|
384
|
-
[ActiveModel](http://api.rubyonrails.org/classes/ActiveModel/Model.html)
|
385
|
-
objects with nested attributes:
|
386
|
-
|
387
|
-
```ruby
|
388
|
-
hash = {name: "John", surname: "Doe", address: "Unknown", city: "NY" }
|
389
|
-
|
390
|
-
Dreader::Util.restructure hash, [:name, :surname], :address_attributes, [:address, :city]
|
391
|
-
{name: "John", surname: "Doe", address_attributes: {address: "Unknonw", city: "NY"}}
|
392
|
-
```
|
393
|
-
|
394
|
-
|
395
|
-
## Debugging your specification
|
396
|
-
|
397
|
-
If you are not sure about what is going on (like I often am when
|
398
|
-
reading tabular data), you can use the `debug` function, which prints
|
399
|
-
the current configuration, reads some records from your files, and
|
400
|
-
shows them to standard output:
|
401
|
-
|
402
|
-
|
403
|
-
```ruby
|
404
|
-
i.debug
|
405
|
-
i.debug n: 40 # read 40 lines (from first_row, if the option is declared)
|
406
|
-
i.debug n: 40, filename: filepath # like above, but read from filepath
|
407
|
-
```
|
408
|
-
|
409
|
-
Another possibility is getting the value of the `@table` variable,
|
410
|
-
which contains all the data read.
|
411
|
-
|
412
|
-
By default `debug` invokes the `process` and `check` directives. Pass
|
413
|
-
the following options, if you want to disable this behavior; this
|
414
|
-
might be useful, for instance, if you intend to check only what data
|
415
|
-
is read:
|
416
|
-
|
417
|
-
```ruby
|
418
|
-
i.debug process: false, debug: false
|
419
|
-
```
|
420
|
-
|
421
|
-
Notice that `check` implies `process`, since `check` is invoked on the
|
422
|
-
output of the `process` directive.`
|
423
|
-
|
424
|
-
|
425
|
-
## Changelog
|
426
|
-
|
427
|
-
See [[Changelog]].
|
428
|
-
|
429
|
-
|
430
|
-
## Known Limitations
|
431
|
-
|
432
|
-
At the moment:
|
433
|
-
|
434
|
-
- it is not possible to specify column references using header names
|
435
|
-
(like Roo does).
|
436
|
-
- it is not possible to pass options to the file readers. As a
|
437
|
-
consequence tab-separated files must have the `.tsv` extension to be
|
438
|
-
correctly parsed.
|
439
|
-
- some testing wouldn't hurt.
|
440
|
-
|
441
|
-
|
442
|
-
## Known Bugs
|
443
|
-
|
444
|
-
No known bugs and an unknown number of unknown bugs.
|
445
|
-
|
446
|
-
(See the open issues for the known bugs.)
|
447
|
-
|
448
|
-
|
449
|
-
## Development
|
450
|
-
|
451
|
-
After checking out the repo, run `bin/setup` to install dependencies. You can
|
452
|
-
also run `bin/console` for an interactive prompt that will allow you to
|
453
|
-
experiment.
|
454
|
-
|
455
|
-
To install this gem onto your local machine, run `bundle exec rake
|
456
|
-
install`. To release a new version, update the version number in `version.rb`,
|
457
|
-
and then run `bundle exec rake release`, which will create a git tag for the
|
458
|
-
version, push git commits and tags, and push the `.gem` file to
|
459
|
-
[rubygems.org](https://rubygems.org).
|
460
|
-
|
461
|
-
## Contributing
|
462
|
-
|
463
|
-
Bug reports and pull requests are welcome on GitHub at
|
464
|
-
https://github.com/avillafiorita/dreader.
|
465
|
-
|
466
|
-
## License
|
467
|
-
|
468
|
-
The gem is available as open source under the terms of the [MIT
|
469
|
-
License](https://opensource.org/licenses/MIT).
|