dreader 0.5.0 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.org +92 -0
- data/Gemfile.lock +20 -7
- data/README.org +821 -0
- data/dreader.gemspec +6 -4
- data/examples/age/age.rb +41 -25
- data/examples/age_with_multiple_checks/Birthdays.ods +0 -0
- data/examples/age_with_multiple_checks/age_with_multiple_checks.rb +64 -0
- data/examples/local_vars/local_vars.rb +28 -0
- data/examples/template/template_generation.rb +37 -0
- data/examples/wikipedia_big_us_cities/big_us_cities.rb +24 -20
- data/examples/wikipedia_us_cities/us_cities.rb +31 -28
- data/examples/wikipedia_us_cities/us_cities_bulk_declare.rb +25 -23
- data/lib/dreader/column.rb +39 -0
- data/lib/dreader/engine.rb +495 -0
- data/lib/dreader/options.rb +16 -0
- data/lib/dreader/util.rb +86 -0
- data/lib/dreader/version.rb +1 -1
- data/lib/dreader.rb +5 -411
- metadata +60 -24
- data/Changelog.org +0 -20
- data/README.md +0 -469
data/README.md
DELETED
@@ -1,469 +0,0 @@
|
|
1
|
-
# Dreader
|
2
|
-
|
3
|
-
A simple DSL built on top of [Roo](https://github.com/roo-rb/roo) to
|
4
|
-
read and process tabular data (CSV, LibreOffice, Excel).
|
5
|
-
|
6
|
-
This gem allows you to:
|
7
|
-
|
8
|
-
1. specify the structure of some tabular data you want to process
|
9
|
-
2. debug and check correctness of the data you read
|
10
|
-
3. read a file and process it, that is, execute code for each cell and
|
11
|
-
each row of the file
|
12
|
-
|
13
|
-
If your data require elaborations which cannot be performed line by
|
14
|
-
line, you can also access all the data read by the gem and manipulate
|
15
|
-
it as you need.
|
16
|
-
|
17
|
-
The input data can be in CSV (comma or tab separated), LibreOffice,
|
18
|
-
and Excel.
|
19
|
-
|
20
|
-
We use it to import data into Rails application, but the gem can used
|
21
|
-
in any Ruby application.
|
22
|
-
|
23
|
-
The gem should be relatively easy to use, despite its name: *dread*
|
24
|
-
stands for *d*ata *r*eader.
|
25
|
-
|
26
|
-
The gem depends on `roo`, from which we leverage all data
|
27
|
-
reading/parsing facilities and which allows us to achieve what we want
|
28
|
-
in about 250 lines of code.
|
29
|
-
|
30
|
-
|
31
|
-
## Installation
|
32
|
-
|
33
|
-
Add this line to your application's Gemfile:
|
34
|
-
|
35
|
-
```ruby
|
36
|
-
gem 'dreader'
|
37
|
-
```
|
38
|
-
|
39
|
-
And then execute:
|
40
|
-
|
41
|
-
$ bundle
|
42
|
-
|
43
|
-
Or install it yourself as:
|
44
|
-
|
45
|
-
$ gem install dreader
|
46
|
-
|
47
|
-
## Usage
|
48
|
-
|
49
|
-
### Declare what file you want to read
|
50
|
-
|
51
|
-
Require `dreader` and declare an instance of the `Dreader::Engine` class:
|
52
|
-
|
53
|
-
```ruby
|
54
|
-
require 'dreader'
|
55
|
-
|
56
|
-
i = Dreader::Engine.new
|
57
|
-
```
|
58
|
-
|
59
|
-
Specify parsing option, using the following syntax:
|
60
|
-
|
61
|
-
```ruby
|
62
|
-
i.options do
|
63
|
-
filename 'example.ods'
|
64
|
-
|
65
|
-
sheet 'Sheet 1'
|
66
|
-
|
67
|
-
first_row 1
|
68
|
-
last_row 20
|
69
|
-
end
|
70
|
-
```
|
71
|
-
|
72
|
-
where:
|
73
|
-
|
74
|
-
* (optional) `filename` is the file to read. If not specified, you
|
75
|
-
will have to supply a filename when loading the file (see `read`,
|
76
|
-
below). The extension determines the file type. **Use `tsv` for
|
77
|
-
tab-separated files.**
|
78
|
-
* (optional) `first_row` is the first line to read (use `2` if your
|
79
|
-
file has a header)
|
80
|
-
* (optional) `last_row` is the last line to read. If not specified, we
|
81
|
-
will rely on `roo` to determine the last row
|
82
|
-
* (optional) `sheet` is the sheet name or number to read from. If not
|
83
|
-
specified, the first (default) sheet is used
|
84
|
-
|
85
|
-
### Declare the columns you want to read
|
86
|
-
|
87
|
-
Declare the columns you want to read by assigning them a name and a
|
88
|
-
column reference:
|
89
|
-
|
90
|
-
```ruby
|
91
|
-
# we will access column A in Ruby code using :name
|
92
|
-
i.column :name do
|
93
|
-
colref 'A'
|
94
|
-
end
|
95
|
-
```
|
96
|
-
|
97
|
-
You can also specify two ruby blocks, `process` and `check` to
|
98
|
-
preprocess data and to check for errors.
|
99
|
-
|
100
|
-
For instance, given the following file:
|
101
|
-
|
102
|
-
| Name | Date of birth |
|
103
|
-
|------------------|-----------------|
|
104
|
-
| Forest Whitaker | July 15, 1961 |
|
105
|
-
| Daniel Day-Lewis | April 29, 1957 |
|
106
|
-
| Sean Penn | August 17, 1960 |
|
107
|
-
|
108
|
-
we could use the following declaration to specify the data to read:
|
109
|
-
|
110
|
-
```ruby
|
111
|
-
# we want to access column 1 using :name
|
112
|
-
# :name should be non nil and of length greater than 0
|
113
|
-
i.column :name do
|
114
|
-
colref 1
|
115
|
-
check do |x|
|
116
|
-
x and x.length > 0
|
117
|
-
end
|
118
|
-
end
|
119
|
-
|
120
|
-
# we want to access column 2 (Date of birth) using :birthdate
|
121
|
-
i.column :birthdate do
|
122
|
-
colref 2
|
123
|
-
|
124
|
-
# make sure the column is transformed into an integer
|
125
|
-
process do |x|
|
126
|
-
Date.parse(x)
|
127
|
-
end
|
128
|
-
|
129
|
-
# check age is a date (check is invoked on the value returned
|
130
|
-
# by process)
|
131
|
-
check do |x|
|
132
|
-
x.class == Date
|
133
|
-
end
|
134
|
-
end
|
135
|
-
|
136
|
-
# we don't care about any other column (and, therefore,
|
137
|
-
# we are done with our declarations)
|
138
|
-
```
|
139
|
-
|
140
|
-
If there are different columns that you want to read and process in
|
141
|
-
the same way, you can use the method `bulk_declare`, which accepts a
|
142
|
-
hash as input.
|
143
|
-
|
144
|
-
For instance:
|
145
|
-
|
146
|
-
```ruby
|
147
|
-
i.bulk_declare {a: 'A', b: 'B'}
|
148
|
-
```
|
149
|
-
|
150
|
-
is equivalent to:
|
151
|
-
|
152
|
-
```ruby
|
153
|
-
i.column :a do
|
154
|
-
colref 'A'
|
155
|
-
end
|
156
|
-
|
157
|
-
i.column :b do
|
158
|
-
colref 'B'
|
159
|
-
end
|
160
|
-
```
|
161
|
-
|
162
|
-
The method also accepts a code block, which allows to define a common
|
163
|
-
`process` function for all columns. In case, **don't forget to put
|
164
|
-
the hash in parentheses, or the Ruby parser won't be able to
|
165
|
-
distinguish the hash from the code block.** For instance:
|
166
|
-
|
167
|
-
```ruby
|
168
|
-
i.bulk_declare({a: 'A', b: 'B'}) do
|
169
|
-
process do |cell|
|
170
|
-
...
|
171
|
-
end
|
172
|
-
end
|
173
|
-
```
|
174
|
-
|
175
|
-
There is an example of `bulk_declare` in the examples directory:
|
176
|
-
([us_cities_bulk_declare.rb](examples/wikipedia_us_cities/us_cities_bulk_declare.rb)).
|
177
|
-
|
178
|
-
**Remarks:**
|
179
|
-
|
180
|
-
1. the column name can be anything ruby can use as a Hash key. You
|
181
|
-
can use symbols, strings, and even object instances, if you wish to
|
182
|
-
do so.
|
183
|
-
|
184
|
-
2. `colref` can be a string (e.g., `'A'`) or an integer, in which case
|
185
|
-
the first column is one
|
186
|
-
|
187
|
-
3. **you need to declare only the columns you want to import.** For
|
188
|
-
instance, we could skip the declaration for column 1, if 'Date of
|
189
|
-
Birth' is the only data we want to import
|
190
|
-
|
191
|
-
4. If `process` and `check` are specified, then `check` will receive
|
192
|
-
the result of invoking `process` on the cell value. This makes
|
193
|
-
sense if process is used to make the cell value more accessible to
|
194
|
-
ruby code (e.g., transforming a string into an integer).
|
195
|
-
|
196
|
-
|
197
|
-
### Add virtual columns, if you want
|
198
|
-
|
199
|
-
Sometimes it is convenient to aggregate or otherwise manipulate the
|
200
|
-
data read from each row before doing the actual processing.
|
201
|
-
|
202
|
-
For instance, we might have a table with dates of birth, while we are
|
203
|
-
really interested in the age of people.
|
204
|
-
|
205
|
-
In such cases, we can use virtual column. A **virtual column** allows
|
206
|
-
one to add a column to the data read. The value of the column for
|
207
|
-
each row is computed using the values of other cells.
|
208
|
-
|
209
|
-
Virtual columns are declared similar to columns. Thus, for instance,
|
210
|
-
the following declaration adds an `age` column to each row of the data
|
211
|
-
we read from the previous example:
|
212
|
-
|
213
|
-
```ruby
|
214
|
-
i.virtual_column :age do
|
215
|
-
process do |row|
|
216
|
-
# `compute_birthday` has to be defined
|
217
|
-
compute_birthday(row[:birthdate])
|
218
|
-
end
|
219
|
-
end
|
220
|
-
```
|
221
|
-
|
222
|
-
Virtual columns are, of course, available to the `mapping` directive
|
223
|
-
(see below).
|
224
|
-
|
225
|
-
|
226
|
-
### Specify how to process data
|
227
|
-
|
228
|
-
Finally we can specify how we process lines, using the `mapping`
|
229
|
-
directive. Mapping takes an arbitrary piece of ruby code, which can
|
230
|
-
reference the fields of a row.
|
231
|
-
|
232
|
-
For instance:
|
233
|
-
|
234
|
-
```ruby
|
235
|
-
i.mapping do |row|
|
236
|
-
puts "#{row[:name][:value]} is #{row[:age][:value]} years old"
|
237
|
-
end
|
238
|
-
```
|
239
|
-
|
240
|
-
Notice that the data read from each row of our input data is stored in
|
241
|
-
a hash. The hash uses column names as the primary key and stores
|
242
|
-
the values in the `:value` key.
|
243
|
-
|
244
|
-
|
245
|
-
### Start working with the data
|
246
|
-
|
247
|
-
We are now all set and we can start working with the data.
|
248
|
-
|
249
|
-
First use `read` or `load` (synonyms), to read all data and put it
|
250
|
-
into a `@table` instance variable.
|
251
|
-
|
252
|
-
```ruby
|
253
|
-
i.read
|
254
|
-
```
|
255
|
-
|
256
|
-
After reading the file we can use `errors` to see whether any of the
|
257
|
-
`check` functions failed:
|
258
|
-
|
259
|
-
```ruby
|
260
|
-
array_of_strings = i.errors
|
261
|
-
array_of_strings ech do |error_line|
|
262
|
-
puts error_line
|
263
|
-
end
|
264
|
-
```
|
265
|
-
|
266
|
-
We can then use `virtual_columns` to process data and generate the
|
267
|
-
virtual columns:
|
268
|
-
|
269
|
-
```ruby
|
270
|
-
i.virtual_columns
|
271
|
-
```
|
272
|
-
|
273
|
-
Finally we can use the `process` function to execute the `mapping`
|
274
|
-
directive to each line read from the file.
|
275
|
-
|
276
|
-
```ruby
|
277
|
-
i.process
|
278
|
-
```
|
279
|
-
|
280
|
-
Look in the examples directory for further details and a couple of
|
281
|
-
working examples.
|
282
|
-
|
283
|
-
**Remark.** You can override some of the defaults by passing a hash as
|
284
|
-
argument to read. For instance:
|
285
|
-
|
286
|
-
```ruby
|
287
|
-
i.read filename: another_filepath
|
288
|
-
```
|
289
|
-
|
290
|
-
will read data from `another_filepath`, rather than from the filename
|
291
|
-
specified in the options. This might be useful, for instance, if the
|
292
|
-
same specification has to be used for different files.
|
293
|
-
|
294
|
-
|
295
|
-
## Digging deeper
|
296
|
-
|
297
|
-
If you need to perform more elaborations on the data which cannot be
|
298
|
-
captured with `process` (that is, by processing the data row by row),
|
299
|
-
you can also directly access all data read, using the `table` method:
|
300
|
-
|
301
|
-
```ruby
|
302
|
-
i.read
|
303
|
-
i.table
|
304
|
-
# an array of hashes (one hash per row)
|
305
|
-
```
|
306
|
-
|
307
|
-
More in details, the `read` method fills a `@table` instance variable
|
308
|
-
with an array of hashes. Each hash represents a line of the file.
|
309
|
-
|
310
|
-
Each hash contains one key per column, following your specification.
|
311
|
-
Its value is, in turn, a hash with the following structure:
|
312
|
-
|
313
|
-
```ruby
|
314
|
-
{
|
315
|
-
value: ..., # the result of calling process on the cell
|
316
|
-
row_number: ... # the row number
|
317
|
-
col_number: ... # the column number
|
318
|
-
error: ... # the result of calling check on the cell processed value
|
319
|
-
}
|
320
|
-
```
|
321
|
-
|
322
|
-
(Note that virtual columns only store `value` and a Boolean `virtual`,
|
323
|
-
which is always `true`.)
|
324
|
-
|
325
|
-
Thus, for instance, given the example above:
|
326
|
-
|
327
|
-
```ruby
|
328
|
-
i.table
|
329
|
-
[ { name: { value: "John", row_number: 1, col_number: 1, errors: nil },
|
330
|
-
age: { value: 30, row_number: 1, col_number: 2, errors: nil } },
|
331
|
-
{ name: { value: "Jane", row_number: 2, col_number: 1, errors: nil },
|
332
|
-
age: { value: 31, row_number: 2, col_number: 2, errors: nil } } ]
|
333
|
-
```
|
334
|
-
|
335
|
-
## Simplifying the hash with the data read
|
336
|
-
|
337
|
-
The `Dreader::Util` class provides some functions to simplify and
|
338
|
-
restructure the hashes built by `dreader`.
|
339
|
-
|
340
|
-
`Dreader::Util.simplify hash` simplifies the hash passed as input by
|
341
|
-
removing all information but the value and making the value
|
342
|
-
accessible directly from the name of the column.
|
343
|
-
|
344
|
-
```ruby
|
345
|
-
Dreader::Util.simplify i.table[0]
|
346
|
-
{name: "John", age: 30}
|
347
|
-
```
|
348
|
-
|
349
|
-
`Dreader::Util.slice hash, keys` and `Dreader::Util.slice hash,
|
350
|
-
keys`, where `keys` is an arrays of keys, are respectively used to
|
351
|
-
select or remove some keys from `hash`.
|
352
|
-
|
353
|
-
```ruby
|
354
|
-
i.table[0]
|
355
|
-
{ name: { value: "John", row_number: 1, col_number: 1, errors: nil },
|
356
|
-
age: { value: 30, row_number: 1, col_number: 2, errors: nil }}
|
357
|
-
|
358
|
-
Dreader::Util.slice i.table[0], :name
|
359
|
-
{name: { value: "John", row_number: 1, col_number: 1, errors: nil}
|
360
|
-
|
361
|
-
Dreader::Util.clean i.table[0], :name
|
362
|
-
{age: { value: 30, row_number: 1, col_number: 2, errors: nil }
|
363
|
-
```
|
364
|
-
|
365
|
-
The methods `slice` and `clean` are more useful when used in
|
366
|
-
conjuction with `simplify`:
|
367
|
-
|
368
|
-
```ruby
|
369
|
-
hash = Dreader::Util.simplify i.table[0]
|
370
|
-
{name: "John", age: 30}
|
371
|
-
|
372
|
-
Dreader::Util.slice hash, [:age]
|
373
|
-
{age: 30}
|
374
|
-
|
375
|
-
Dreader::Util.clean hash, [:age]
|
376
|
-
{name: "John"}
|
377
|
-
```
|
378
|
-
|
379
|
-
Notice that the output produced by `slice` and `simplify` is a has
|
380
|
-
which can be used to create an `ActiveRecord` object.
|
381
|
-
|
382
|
-
Finally, the `Dreader::Util.restructure` method helps building hashes
|
383
|
-
to create
|
384
|
-
[ActiveModel](http://api.rubyonrails.org/classes/ActiveModel/Model.html)
|
385
|
-
objects with nested attributes:
|
386
|
-
|
387
|
-
```ruby
|
388
|
-
hash = {name: "John", surname: "Doe", address: "Unknown", city: "NY" }
|
389
|
-
|
390
|
-
Dreader::Util.restructure hash, [:name, :surname], :address_attributes, [:address, :city]
|
391
|
-
{name: "John", surname: "Doe", address_attributes: {address: "Unknonw", city: "NY"}}
|
392
|
-
```
|
393
|
-
|
394
|
-
|
395
|
-
## Debugging your specification
|
396
|
-
|
397
|
-
If you are not sure about what is going on (like I often am when
|
398
|
-
reading tabular data), you can use the `debug` function, which prints
|
399
|
-
the current configuration, reads some records from your files, and
|
400
|
-
shows them to standard output:
|
401
|
-
|
402
|
-
|
403
|
-
```ruby
|
404
|
-
i.debug
|
405
|
-
i.debug n: 40 # read 40 lines (from first_row, if the option is declared)
|
406
|
-
i.debug n: 40, filename: filepath # like above, but read from filepath
|
407
|
-
```
|
408
|
-
|
409
|
-
Another possibility is getting the value of the `@table` variable,
|
410
|
-
which contains all the data read.
|
411
|
-
|
412
|
-
By default `debug` invokes the `process` and `check` directives. Pass
|
413
|
-
the following options, if you want to disable this behavior; this
|
414
|
-
might be useful, for instance, if you intend to check only what data
|
415
|
-
is read:
|
416
|
-
|
417
|
-
```ruby
|
418
|
-
i.debug process: false, debug: false
|
419
|
-
```
|
420
|
-
|
421
|
-
Notice that `check` implies `process`, since `check` is invoked on the
|
422
|
-
output of the `process` directive.`
|
423
|
-
|
424
|
-
|
425
|
-
## Changelog
|
426
|
-
|
427
|
-
See [[Changelog]].
|
428
|
-
|
429
|
-
|
430
|
-
## Known Limitations
|
431
|
-
|
432
|
-
At the moment:
|
433
|
-
|
434
|
-
- it is not possible to specify column references using header names
|
435
|
-
(like Roo does).
|
436
|
-
- it is not possible to pass options to the file readers. As a
|
437
|
-
consequence tab-separated files must have the `.tsv` extension to be
|
438
|
-
correctly parsed.
|
439
|
-
- some testing wouldn't hurt.
|
440
|
-
|
441
|
-
|
442
|
-
## Known Bugs
|
443
|
-
|
444
|
-
No known bugs and an unknown number of unknown bugs.
|
445
|
-
|
446
|
-
(See the open issues for the known bugs.)
|
447
|
-
|
448
|
-
|
449
|
-
## Development
|
450
|
-
|
451
|
-
After checking out the repo, run `bin/setup` to install dependencies. You can
|
452
|
-
also run `bin/console` for an interactive prompt that will allow you to
|
453
|
-
experiment.
|
454
|
-
|
455
|
-
To install this gem onto your local machine, run `bundle exec rake
|
456
|
-
install`. To release a new version, update the version number in `version.rb`,
|
457
|
-
and then run `bundle exec rake release`, which will create a git tag for the
|
458
|
-
version, push git commits and tags, and push the `.gem` file to
|
459
|
-
[rubygems.org](https://rubygems.org).
|
460
|
-
|
461
|
-
## Contributing
|
462
|
-
|
463
|
-
Bug reports and pull requests are welcome on GitHub at
|
464
|
-
https://github.com/avillafiorita/dreader.
|
465
|
-
|
466
|
-
## License
|
467
|
-
|
468
|
-
The gem is available as open source under the terms of the [MIT
|
469
|
-
License](https://opensource.org/licenses/MIT).
|