dreader 0.5.0 → 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.md DELETED
@@ -1,469 +0,0 @@
1
- # Dreader
2
-
3
- A simple DSL built on top of [Roo](https://github.com/roo-rb/roo) to
4
- read and process tabular data (CSV, LibreOffice, Excel).
5
-
6
- This gem allows you to:
7
-
8
- 1. specify the structure of some tabular data you want to process
9
- 2. debug and check correctness of the data you read
10
- 3. read a file and process it, that is, execute code for each cell and
11
- each row of the file
12
-
13
- If your data require elaborations which cannot be performed line by
14
- line, you can also access all the data read by the gem and manipulate
15
- it as you need.
16
-
17
- The input data can be in CSV (comma or tab separated), LibreOffice,
18
- and Excel.
19
-
20
- We use it to import data into Rails application, but the gem can used
21
- in any Ruby application.
22
-
23
- The gem should be relatively easy to use, despite its name: *dread*
24
- stands for *d*ata *r*eader.
25
-
26
- The gem depends on `roo`, from which we leverage all data
27
- reading/parsing facilities and which allows us to achieve what we want
28
- in about 250 lines of code.
29
-
30
-
31
- ## Installation
32
-
33
- Add this line to your application's Gemfile:
34
-
35
- ```ruby
36
- gem 'dreader'
37
- ```
38
-
39
- And then execute:
40
-
41
- $ bundle
42
-
43
- Or install it yourself as:
44
-
45
- $ gem install dreader
46
-
47
- ## Usage
48
-
49
- ### Declare what file you want to read
50
-
51
- Require `dreader` and declare an instance of the `Dreader::Engine` class:
52
-
53
- ```ruby
54
- require 'dreader'
55
-
56
- i = Dreader::Engine.new
57
- ```
58
-
59
- Specify parsing option, using the following syntax:
60
-
61
- ```ruby
62
- i.options do
63
- filename 'example.ods'
64
-
65
- sheet 'Sheet 1'
66
-
67
- first_row 1
68
- last_row 20
69
- end
70
- ```
71
-
72
- where:
73
-
74
- * (optional) `filename` is the file to read. If not specified, you
75
- will have to supply a filename when loading the file (see `read`,
76
- below). The extension determines the file type. **Use `tsv` for
77
- tab-separated files.**
78
- * (optional) `first_row` is the first line to read (use `2` if your
79
- file has a header)
80
- * (optional) `last_row` is the last line to read. If not specified, we
81
- will rely on `roo` to determine the last row
82
- * (optional) `sheet` is the sheet name or number to read from. If not
83
- specified, the first (default) sheet is used
84
-
85
- ### Declare the columns you want to read
86
-
87
- Declare the columns you want to read by assigning them a name and a
88
- column reference:
89
-
90
- ```ruby
91
- # we will access column A in Ruby code using :name
92
- i.column :name do
93
- colref 'A'
94
- end
95
- ```
96
-
97
- You can also specify two ruby blocks, `process` and `check` to
98
- preprocess data and to check for errors.
99
-
100
- For instance, given the following file:
101
-
102
- | Name | Date of birth |
103
- |------------------|-----------------|
104
- | Forest Whitaker | July 15, 1961 |
105
- | Daniel Day-Lewis | April 29, 1957 |
106
- | Sean Penn | August 17, 1960 |
107
-
108
- we could use the following declaration to specify the data to read:
109
-
110
- ```ruby
111
- # we want to access column 1 using :name
112
- # :name should be non nil and of length greater than 0
113
- i.column :name do
114
- colref 1
115
- check do |x|
116
- x and x.length > 0
117
- end
118
- end
119
-
120
- # we want to access column 2 (Date of birth) using :birthdate
121
- i.column :birthdate do
122
- colref 2
123
-
124
- # make sure the column is transformed into an integer
125
- process do |x|
126
- Date.parse(x)
127
- end
128
-
129
- # check age is a date (check is invoked on the value returned
130
- # by process)
131
- check do |x|
132
- x.class == Date
133
- end
134
- end
135
-
136
- # we don't care about any other column (and, therefore,
137
- # we are done with our declarations)
138
- ```
139
-
140
- If there are different columns that you want to read and process in
141
- the same way, you can use the method `bulk_declare`, which accepts a
142
- hash as input.
143
-
144
- For instance:
145
-
146
- ```ruby
147
- i.bulk_declare {a: 'A', b: 'B'}
148
- ```
149
-
150
- is equivalent to:
151
-
152
- ```ruby
153
- i.column :a do
154
- colref 'A'
155
- end
156
-
157
- i.column :b do
158
- colref 'B'
159
- end
160
- ```
161
-
162
- The method also accepts a code block, which allows to define a common
163
- `process` function for all columns. In case, **don't forget to put
164
- the hash in parentheses, or the Ruby parser won't be able to
165
- distinguish the hash from the code block.** For instance:
166
-
167
- ```ruby
168
- i.bulk_declare({a: 'A', b: 'B'}) do
169
- process do |cell|
170
- ...
171
- end
172
- end
173
- ```
174
-
175
- There is an example of `bulk_declare` in the examples directory:
176
- ([us_cities_bulk_declare.rb](examples/wikipedia_us_cities/us_cities_bulk_declare.rb)).
177
-
178
- **Remarks:**
179
-
180
- 1. the column name can be anything ruby can use as a Hash key. You
181
- can use symbols, strings, and even object instances, if you wish to
182
- do so.
183
-
184
- 2. `colref` can be a string (e.g., `'A'`) or an integer, in which case
185
- the first column is one
186
-
187
- 3. **you need to declare only the columns you want to import.** For
188
- instance, we could skip the declaration for column 1, if 'Date of
189
- Birth' is the only data we want to import
190
-
191
- 4. If `process` and `check` are specified, then `check` will receive
192
- the result of invoking `process` on the cell value. This makes
193
- sense if process is used to make the cell value more accessible to
194
- ruby code (e.g., transforming a string into an integer).
195
-
196
-
197
- ### Add virtual columns, if you want
198
-
199
- Sometimes it is convenient to aggregate or otherwise manipulate the
200
- data read from each row before doing the actual processing.
201
-
202
- For instance, we might have a table with dates of birth, while we are
203
- really interested in the age of people.
204
-
205
- In such cases, we can use virtual column. A **virtual column** allows
206
- one to add a column to the data read. The value of the column for
207
- each row is computed using the values of other cells.
208
-
209
- Virtual columns are declared similar to columns. Thus, for instance,
210
- the following declaration adds an `age` column to each row of the data
211
- we read from the previous example:
212
-
213
- ```ruby
214
- i.virtual_column :age do
215
- process do |row|
216
- # `compute_birthday` has to be defined
217
- compute_birthday(row[:birthdate])
218
- end
219
- end
220
- ```
221
-
222
- Virtual columns are, of course, available to the `mapping` directive
223
- (see below).
224
-
225
-
226
- ### Specify how to process data
227
-
228
- Finally we can specify how we process lines, using the `mapping`
229
- directive. Mapping takes an arbitrary piece of ruby code, which can
230
- reference the fields of a row.
231
-
232
- For instance:
233
-
234
- ```ruby
235
- i.mapping do |row|
236
- puts "#{row[:name][:value]} is #{row[:age][:value]} years old"
237
- end
238
- ```
239
-
240
- Notice that the data read from each row of our input data is stored in
241
- a hash. The hash uses column names as the primary key and stores
242
- the values in the `:value` key.
243
-
244
-
245
- ### Start working with the data
246
-
247
- We are now all set and we can start working with the data.
248
-
249
- First use `read` or `load` (synonyms), to read all data and put it
250
- into a `@table` instance variable.
251
-
252
- ```ruby
253
- i.read
254
- ```
255
-
256
- After reading the file we can use `errors` to see whether any of the
257
- `check` functions failed:
258
-
259
- ```ruby
260
- array_of_strings = i.errors
261
- array_of_strings ech do |error_line|
262
- puts error_line
263
- end
264
- ```
265
-
266
- We can then use `virtual_columns` to process data and generate the
267
- virtual columns:
268
-
269
- ```ruby
270
- i.virtual_columns
271
- ```
272
-
273
- Finally we can use the `process` function to execute the `mapping`
274
- directive to each line read from the file.
275
-
276
- ```ruby
277
- i.process
278
- ```
279
-
280
- Look in the examples directory for further details and a couple of
281
- working examples.
282
-
283
- **Remark.** You can override some of the defaults by passing a hash as
284
- argument to read. For instance:
285
-
286
- ```ruby
287
- i.read filename: another_filepath
288
- ```
289
-
290
- will read data from `another_filepath`, rather than from the filename
291
- specified in the options. This might be useful, for instance, if the
292
- same specification has to be used for different files.
293
-
294
-
295
- ## Digging deeper
296
-
297
- If you need to perform more elaborations on the data which cannot be
298
- captured with `process` (that is, by processing the data row by row),
299
- you can also directly access all data read, using the `table` method:
300
-
301
- ```ruby
302
- i.read
303
- i.table
304
- # an array of hashes (one hash per row)
305
- ```
306
-
307
- More in details, the `read` method fills a `@table` instance variable
308
- with an array of hashes. Each hash represents a line of the file.
309
-
310
- Each hash contains one key per column, following your specification.
311
- Its value is, in turn, a hash with the following structure:
312
-
313
- ```ruby
314
- {
315
- value: ..., # the result of calling process on the cell
316
- row_number: ... # the row number
317
- col_number: ... # the column number
318
- error: ... # the result of calling check on the cell processed value
319
- }
320
- ```
321
-
322
- (Note that virtual columns only store `value` and a Boolean `virtual`,
323
- which is always `true`.)
324
-
325
- Thus, for instance, given the example above:
326
-
327
- ```ruby
328
- i.table
329
- [ { name: { value: "John", row_number: 1, col_number: 1, errors: nil },
330
- age: { value: 30, row_number: 1, col_number: 2, errors: nil } },
331
- { name: { value: "Jane", row_number: 2, col_number: 1, errors: nil },
332
- age: { value: 31, row_number: 2, col_number: 2, errors: nil } } ]
333
- ```
334
-
335
- ## Simplifying the hash with the data read
336
-
337
- The `Dreader::Util` class provides some functions to simplify and
338
- restructure the hashes built by `dreader`.
339
-
340
- `Dreader::Util.simplify hash` simplifies the hash passed as input by
341
- removing all information but the value and making the value
342
- accessible directly from the name of the column.
343
-
344
- ```ruby
345
- Dreader::Util.simplify i.table[0]
346
- {name: "John", age: 30}
347
- ```
348
-
349
- `Dreader::Util.slice hash, keys` and `Dreader::Util.slice hash,
350
- keys`, where `keys` is an arrays of keys, are respectively used to
351
- select or remove some keys from `hash`.
352
-
353
- ```ruby
354
- i.table[0]
355
- { name: { value: "John", row_number: 1, col_number: 1, errors: nil },
356
- age: { value: 30, row_number: 1, col_number: 2, errors: nil }}
357
-
358
- Dreader::Util.slice i.table[0], :name
359
- {name: { value: "John", row_number: 1, col_number: 1, errors: nil}
360
-
361
- Dreader::Util.clean i.table[0], :name
362
- {age: { value: 30, row_number: 1, col_number: 2, errors: nil }
363
- ```
364
-
365
- The methods `slice` and `clean` are more useful when used in
366
- conjuction with `simplify`:
367
-
368
- ```ruby
369
- hash = Dreader::Util.simplify i.table[0]
370
- {name: "John", age: 30}
371
-
372
- Dreader::Util.slice hash, [:age]
373
- {age: 30}
374
-
375
- Dreader::Util.clean hash, [:age]
376
- {name: "John"}
377
- ```
378
-
379
- Notice that the output produced by `slice` and `simplify` is a has
380
- which can be used to create an `ActiveRecord` object.
381
-
382
- Finally, the `Dreader::Util.restructure` method helps building hashes
383
- to create
384
- [ActiveModel](http://api.rubyonrails.org/classes/ActiveModel/Model.html)
385
- objects with nested attributes:
386
-
387
- ```ruby
388
- hash = {name: "John", surname: "Doe", address: "Unknown", city: "NY" }
389
-
390
- Dreader::Util.restructure hash, [:name, :surname], :address_attributes, [:address, :city]
391
- {name: "John", surname: "Doe", address_attributes: {address: "Unknonw", city: "NY"}}
392
- ```
393
-
394
-
395
- ## Debugging your specification
396
-
397
- If you are not sure about what is going on (like I often am when
398
- reading tabular data), you can use the `debug` function, which prints
399
- the current configuration, reads some records from your files, and
400
- shows them to standard output:
401
-
402
-
403
- ```ruby
404
- i.debug
405
- i.debug n: 40 # read 40 lines (from first_row, if the option is declared)
406
- i.debug n: 40, filename: filepath # like above, but read from filepath
407
- ```
408
-
409
- Another possibility is getting the value of the `@table` variable,
410
- which contains all the data read.
411
-
412
- By default `debug` invokes the `process` and `check` directives. Pass
413
- the following options, if you want to disable this behavior; this
414
- might be useful, for instance, if you intend to check only what data
415
- is read:
416
-
417
- ```ruby
418
- i.debug process: false, debug: false
419
- ```
420
-
421
- Notice that `check` implies `process`, since `check` is invoked on the
422
- output of the `process` directive.`
423
-
424
-
425
- ## Changelog
426
-
427
- See [[Changelog]].
428
-
429
-
430
- ## Known Limitations
431
-
432
- At the moment:
433
-
434
- - it is not possible to specify column references using header names
435
- (like Roo does).
436
- - it is not possible to pass options to the file readers. As a
437
- consequence tab-separated files must have the `.tsv` extension to be
438
- correctly parsed.
439
- - some testing wouldn't hurt.
440
-
441
-
442
- ## Known Bugs
443
-
444
- No known bugs and an unknown number of unknown bugs.
445
-
446
- (See the open issues for the known bugs.)
447
-
448
-
449
- ## Development
450
-
451
- After checking out the repo, run `bin/setup` to install dependencies. You can
452
- also run `bin/console` for an interactive prompt that will allow you to
453
- experiment.
454
-
455
- To install this gem onto your local machine, run `bundle exec rake
456
- install`. To release a new version, update the version number in `version.rb`,
457
- and then run `bundle exec rake release`, which will create a git tag for the
458
- version, push git commits and tags, and push the `.gem` file to
459
- [rubygems.org](https://rubygems.org).
460
-
461
- ## Contributing
462
-
463
- Bug reports and pull requests are welcome on GitHub at
464
- https://github.com/avillafiorita/dreader.
465
-
466
- ## License
467
-
468
- The gem is available as open source under the terms of the [MIT
469
- License](https://opensource.org/licenses/MIT).