dreader 0.5.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md DELETED
@@ -1,469 +0,0 @@
1
- # Dreader
2
-
3
- A simple DSL built on top of [Roo](https://github.com/roo-rb/roo) to
4
- read and process tabular data (CSV, LibreOffice, Excel).
5
-
6
- This gem allows you to:
7
-
8
- 1. specify the structure of some tabular data you want to process
9
- 2. debug and check correctness of the data you read
10
- 3. read a file and process it, that is, execute code for each cell and
11
- each row of the file
12
-
13
- If your data require elaborations which cannot be performed line by
14
- line, you can also access all the data read by the gem and manipulate
15
- it as you need.
16
-
17
- The input data can be in CSV (comma or tab separated), LibreOffice,
18
- and Excel.
19
-
20
- We use it to import data into Rails application, but the gem can used
21
- in any Ruby application.
22
-
23
- The gem should be relatively easy to use, despite its name: *dread*
24
- stands for *d*ata *r*eader.
25
-
26
- The gem depends on `roo`, from which we leverage all data
27
- reading/parsing facilities and which allows us to achieve what we want
28
- in about 250 lines of code.
29
-
30
-
31
- ## Installation
32
-
33
- Add this line to your application's Gemfile:
34
-
35
- ```ruby
36
- gem 'dreader'
37
- ```
38
-
39
- And then execute:
40
-
41
- $ bundle
42
-
43
- Or install it yourself as:
44
-
45
- $ gem install dreader
46
-
47
- ## Usage
48
-
49
- ### Declare what file you want to read
50
-
51
- Require `dreader` and declare an instance of the `Dreader::Engine` class:
52
-
53
- ```ruby
54
- require 'dreader'
55
-
56
- i = Dreader::Engine.new
57
- ```
58
-
59
- Specify parsing option, using the following syntax:
60
-
61
- ```ruby
62
- i.options do
63
- filename 'example.ods'
64
-
65
- sheet 'Sheet 1'
66
-
67
- first_row 1
68
- last_row 20
69
- end
70
- ```
71
-
72
- where:
73
-
74
- * (optional) `filename` is the file to read. If not specified, you
75
- will have to supply a filename when loading the file (see `read`,
76
- below). The extension determines the file type. **Use `tsv` for
77
- tab-separated files.**
78
- * (optional) `first_row` is the first line to read (use `2` if your
79
- file has a header)
80
- * (optional) `last_row` is the last line to read. If not specified, we
81
- will rely on `roo` to determine the last row
82
- * (optional) `sheet` is the sheet name or number to read from. If not
83
- specified, the first (default) sheet is used
84
-
85
- ### Declare the columns you want to read
86
-
87
- Declare the columns you want to read by assigning them a name and a
88
- column reference:
89
-
90
- ```ruby
91
- # we will access column A in Ruby code using :name
92
- i.column :name do
93
- colref 'A'
94
- end
95
- ```
96
-
97
- You can also specify two ruby blocks, `process` and `check` to
98
- preprocess data and to check for errors.
99
-
100
- For instance, given the following file:
101
-
102
- | Name | Date of birth |
103
- |------------------|-----------------|
104
- | Forest Whitaker | July 15, 1961 |
105
- | Daniel Day-Lewis | April 29, 1957 |
106
- | Sean Penn | August 17, 1960 |
107
-
108
- we could use the following declaration to specify the data to read:
109
-
110
- ```ruby
111
- # we want to access column 1 using :name
112
- # :name should be non nil and of length greater than 0
113
- i.column :name do
114
- colref 1
115
- check do |x|
116
- x and x.length > 0
117
- end
118
- end
119
-
120
- # we want to access column 2 (Date of birth) using :birthdate
121
- i.column :birthdate do
122
- colref 2
123
-
124
- # make sure the column is transformed into an integer
125
- process do |x|
126
- Date.parse(x)
127
- end
128
-
129
- # check age is a date (check is invoked on the value returned
130
- # by process)
131
- check do |x|
132
- x.class == Date
133
- end
134
- end
135
-
136
- # we don't care about any other column (and, therefore,
137
- # we are done with our declarations)
138
- ```
139
-
140
- If there are different columns that you want to read and process in
141
- the same way, you can use the method `bulk_declare`, which accepts a
142
- hash as input.
143
-
144
- For instance:
145
-
146
- ```ruby
147
- i.bulk_declare {a: 'A', b: 'B'}
148
- ```
149
-
150
- is equivalent to:
151
-
152
- ```ruby
153
- i.column :a do
154
- colref 'A'
155
- end
156
-
157
- i.column :b do
158
- colref 'B'
159
- end
160
- ```
161
-
162
- The method also accepts a code block, which allows to define a common
163
- `process` function for all columns. In case, **don't forget to put
164
- the hash in parentheses, or the Ruby parser won't be able to
165
- distinguish the hash from the code block.** For instance:
166
-
167
- ```ruby
168
- i.bulk_declare({a: 'A', b: 'B'}) do
169
- process do |cell|
170
- ...
171
- end
172
- end
173
- ```
174
-
175
- There is an example of `bulk_declare` in the examples directory:
176
- ([us_cities_bulk_declare.rb](examples/wikipedia_us_cities/us_cities_bulk_declare.rb)).
177
-
178
- **Remarks:**
179
-
180
- 1. the column name can be anything ruby can use as a Hash key. You
181
- can use symbols, strings, and even object instances, if you wish to
182
- do so.
183
-
184
- 2. `colref` can be a string (e.g., `'A'`) or an integer, in which case
185
- the first column is one
186
-
187
- 3. **you need to declare only the columns you want to import.** For
188
- instance, we could skip the declaration for column 1, if 'Date of
189
- Birth' is the only data we want to import
190
-
191
- 4. If `process` and `check` are specified, then `check` will receive
192
- the result of invoking `process` on the cell value. This makes
193
- sense if process is used to make the cell value more accessible to
194
- ruby code (e.g., transforming a string into an integer).
195
-
196
-
197
- ### Add virtual columns, if you want
198
-
199
- Sometimes it is convenient to aggregate or otherwise manipulate the
200
- data read from each row before doing the actual processing.
201
-
202
- For instance, we might have a table with dates of birth, while we are
203
- really interested in the age of people.
204
-
205
- In such cases, we can use virtual column. A **virtual column** allows
206
- one to add a column to the data read. The value of the column for
207
- each row is computed using the values of other cells.
208
-
209
- Virtual columns are declared similar to columns. Thus, for instance,
210
- the following declaration adds an `age` column to each row of the data
211
- we read from the previous example:
212
-
213
- ```ruby
214
- i.virtual_column :age do
215
- process do |row|
216
- # `compute_birthday` has to be defined
217
- compute_birthday(row[:birthdate])
218
- end
219
- end
220
- ```
221
-
222
- Virtual columns are, of course, available to the `mapping` directive
223
- (see below).
224
-
225
-
226
- ### Specify how to process data
227
-
228
- Finally we can specify how we process lines, using the `mapping`
229
- directive. Mapping takes an arbitrary piece of ruby code, which can
230
- reference the fields of a row.
231
-
232
- For instance:
233
-
234
- ```ruby
235
- i.mapping do |row|
236
- puts "#{row[:name][:value]} is #{row[:age][:value]} years old"
237
- end
238
- ```
239
-
240
- Notice that the data read from each row of our input data is stored in
241
- a hash. The hash uses column names as the primary key and stores
242
- the values in the `:value` key.
243
-
244
-
245
- ### Start working with the data
246
-
247
- We are now all set and we can start working with the data.
248
-
249
- First use `read` or `load` (synonyms), to read all data and put it
250
- into a `@table` instance variable.
251
-
252
- ```ruby
253
- i.read
254
- ```
255
-
256
- After reading the file we can use `errors` to see whether any of the
257
- `check` functions failed:
258
-
259
- ```ruby
260
- array_of_strings = i.errors
261
- array_of_strings ech do |error_line|
262
- puts error_line
263
- end
264
- ```
265
-
266
- We can then use `virtual_columns` to process data and generate the
267
- virtual columns:
268
-
269
- ```ruby
270
- i.virtual_columns
271
- ```
272
-
273
- Finally we can use the `process` function to execute the `mapping`
274
- directive to each line read from the file.
275
-
276
- ```ruby
277
- i.process
278
- ```
279
-
280
- Look in the examples directory for further details and a couple of
281
- working examples.
282
-
283
- **Remark.** You can override some of the defaults by passing a hash as
284
- argument to read. For instance:
285
-
286
- ```ruby
287
- i.read filename: another_filepath
288
- ```
289
-
290
- will read data from `another_filepath`, rather than from the filename
291
- specified in the options. This might be useful, for instance, if the
292
- same specification has to be used for different files.
293
-
294
-
295
- ## Digging deeper
296
-
297
- If you need to perform more elaborations on the data which cannot be
298
- captured with `process` (that is, by processing the data row by row),
299
- you can also directly access all data read, using the `table` method:
300
-
301
- ```ruby
302
- i.read
303
- i.table
304
- # an array of hashes (one hash per row)
305
- ```
306
-
307
- More in details, the `read` method fills a `@table` instance variable
308
- with an array of hashes. Each hash represents a line of the file.
309
-
310
- Each hash contains one key per column, following your specification.
311
- Its value is, in turn, a hash with the following structure:
312
-
313
- ```ruby
314
- {
315
- value: ..., # the result of calling process on the cell
316
- row_number: ... # the row number
317
- col_number: ... # the column number
318
- error: ... # the result of calling check on the cell processed value
319
- }
320
- ```
321
-
322
- (Note that virtual columns only store `value` and a Boolean `virtual`,
323
- which is always `true`.)
324
-
325
- Thus, for instance, given the example above:
326
-
327
- ```ruby
328
- i.table
329
- [ { name: { value: "John", row_number: 1, col_number: 1, errors: nil },
330
- age: { value: 30, row_number: 1, col_number: 2, errors: nil } },
331
- { name: { value: "Jane", row_number: 2, col_number: 1, errors: nil },
332
- age: { value: 31, row_number: 2, col_number: 2, errors: nil } } ]
333
- ```
334
-
335
- ## Simplifying the hash with the data read
336
-
337
- The `Dreader::Util` class provides some functions to simplify and
338
- restructure the hashes built by `dreader`.
339
-
340
- `Dreader::Util.simplify hash` simplifies the hash passed as input by
341
- removing all information but the value and making the value
342
- accessible directly from the name of the column.
343
-
344
- ```ruby
345
- Dreader::Util.simplify i.table[0]
346
- {name: "John", age: 30}
347
- ```
348
-
349
- `Dreader::Util.slice hash, keys` and `Dreader::Util.slice hash,
350
- keys`, where `keys` is an arrays of keys, are respectively used to
351
- select or remove some keys from `hash`.
352
-
353
- ```ruby
354
- i.table[0]
355
- { name: { value: "John", row_number: 1, col_number: 1, errors: nil },
356
- age: { value: 30, row_number: 1, col_number: 2, errors: nil }}
357
-
358
- Dreader::Util.slice i.table[0], :name
359
- {name: { value: "John", row_number: 1, col_number: 1, errors: nil}
360
-
361
- Dreader::Util.clean i.table[0], :name
362
- {age: { value: 30, row_number: 1, col_number: 2, errors: nil }
363
- ```
364
-
365
- The methods `slice` and `clean` are more useful when used in
366
- conjuction with `simplify`:
367
-
368
- ```ruby
369
- hash = Dreader::Util.simplify i.table[0]
370
- {name: "John", age: 30}
371
-
372
- Dreader::Util.slice hash, [:age]
373
- {age: 30}
374
-
375
- Dreader::Util.clean hash, [:age]
376
- {name: "John"}
377
- ```
378
-
379
- Notice that the output produced by `slice` and `simplify` is a has
380
- which can be used to create an `ActiveRecord` object.
381
-
382
- Finally, the `Dreader::Util.restructure` method helps building hashes
383
- to create
384
- [ActiveModel](http://api.rubyonrails.org/classes/ActiveModel/Model.html)
385
- objects with nested attributes:
386
-
387
- ```ruby
388
- hash = {name: "John", surname: "Doe", address: "Unknown", city: "NY" }
389
-
390
- Dreader::Util.restructure hash, [:name, :surname], :address_attributes, [:address, :city]
391
- {name: "John", surname: "Doe", address_attributes: {address: "Unknonw", city: "NY"}}
392
- ```
393
-
394
-
395
- ## Debugging your specification
396
-
397
- If you are not sure about what is going on (like I often am when
398
- reading tabular data), you can use the `debug` function, which prints
399
- the current configuration, reads some records from your files, and
400
- shows them to standard output:
401
-
402
-
403
- ```ruby
404
- i.debug
405
- i.debug n: 40 # read 40 lines (from first_row, if the option is declared)
406
- i.debug n: 40, filename: filepath # like above, but read from filepath
407
- ```
408
-
409
- Another possibility is getting the value of the `@table` variable,
410
- which contains all the data read.
411
-
412
- By default `debug` invokes the `process` and `check` directives. Pass
413
- the following options, if you want to disable this behavior; this
414
- might be useful, for instance, if you intend to check only what data
415
- is read:
416
-
417
- ```ruby
418
- i.debug process: false, debug: false
419
- ```
420
-
421
- Notice that `check` implies `process`, since `check` is invoked on the
422
- output of the `process` directive.`
423
-
424
-
425
- ## Changelog
426
-
427
- See [[Changelog]].
428
-
429
-
430
- ## Known Limitations
431
-
432
- At the moment:
433
-
434
- - it is not possible to specify column references using header names
435
- (like Roo does).
436
- - it is not possible to pass options to the file readers. As a
437
- consequence tab-separated files must have the `.tsv` extension to be
438
- correctly parsed.
439
- - some testing wouldn't hurt.
440
-
441
-
442
- ## Known Bugs
443
-
444
- No known bugs and an unknown number of unknown bugs.
445
-
446
- (See the open issues for the known bugs.)
447
-
448
-
449
- ## Development
450
-
451
- After checking out the repo, run `bin/setup` to install dependencies. You can
452
- also run `bin/console` for an interactive prompt that will allow you to
453
- experiment.
454
-
455
- To install this gem onto your local machine, run `bundle exec rake
456
- install`. To release a new version, update the version number in `version.rb`,
457
- and then run `bundle exec rake release`, which will create a git tag for the
458
- version, push git commits and tags, and push the `.gem` file to
459
- [rubygems.org](https://rubygems.org).
460
-
461
- ## Contributing
462
-
463
- Bug reports and pull requests are welcome on GitHub at
464
- https://github.com/avillafiorita/dreader.
465
-
466
- ## License
467
-
468
- The gem is available as open source under the terms of the [MIT
469
- License](https://opensource.org/licenses/MIT).