dreader 1.1.1 → 1.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 524e55af5bb94cae3f407a1602069549783e935798a638361f3c98e922ffc54d
4
- data.tar.gz: 2599e048324ccd233e3fa4a0261e134ced0a3347d27af1e978e635639b6284a8
3
+ metadata.gz: 3c30be2fe49c6c8ce20d4930c75f1279ac1a92a099f609b1266b14dc61c7cf3c
4
+ data.tar.gz: 58c735a67c45ef11a180bc6f17892ba912656d346320da1a716caa69661f4695
5
5
  SHA512:
6
- metadata.gz: e8a78531d96ef35f9272a38daa5327620026dd66c98529710901d4c5994b9e5369308612cc79d1e0998d46dbb9dd6d26c448ddc4265dd536f1e61a4fc50fb885
7
- data.tar.gz: de71caed5d3df79d0d456b080a72a0edc035ef4e46906aac3f94295b07d57b956554a9d9e07b4b6195c5c09df0badfa9202122d7b6e0568cf89569b6a2277d28
6
+ metadata.gz: 7847892dbcf648432a9c51867fd70e1260e956e82ea7cbbad93f92882dd867be6f36f67a0edfbb972e16e12c1364efe92a54bd03d31801770da4b28fac725350
7
+ data.tar.gz: a717955a2eaa0c406d6fb140daf9cd084c11e0d4710289de34b7757b5fd4f4e920ded4fd1450306e3e42a42278c983e078ff8e53248fe0f5b7a93d09fb8a9d40
data/CHANGELOG.org CHANGED
@@ -1,5 +1,20 @@
1
1
  #+TITLE: Changelog
2
2
 
3
+ * Version 1.2.0 - <2023-11-02 Thu>
4
+ ** reject declaration
5
+
6
+ - A new reject declaration allows to reject some lines. reject takes as
7
+ input a row and can predicate over columns and virtual columns. When
8
+ true, the corresponding line is discarded.
9
+
10
+ * Version 1.1.2 - <2023-10-31 Tue>
11
+ ** Fixes an issue with the :extension option
12
+
13
+ - Fixes a bug related to =:extension= and adds a working example, to test
14
+ the feature
15
+ - Changes the extension from a string to a symbol. No initial dot required
16
+ any longer
17
+
3
18
  * Version 1.1.1 - <2023-10-16 Mon>
4
19
  ** Adds option :extension
5
20
 
data/README.org CHANGED
@@ -137,7 +137,8 @@ To write an import function with Dreader:
137
137
  and check parsed data
138
138
  - Add virtual columns, that is, columns computed from other values
139
139
  in the row
140
- - Specify how to map line. This is where you do the actual work
140
+ - Specify what lines you want to reject, if any
141
+ - Specify how to transform lines. This is where you do the actual work
141
142
  (for instance, if you process a file line by line) or put together data for
142
143
  processing after the file has been fully read --- see the next step.
143
144
 
@@ -165,12 +166,13 @@ Require =dreader= and declare a class which extends =Dreader::Engine=:
165
166
  end
166
167
  #+END_EXAMPLE
167
168
 
168
- In the class specify parsing option, using the following syntax:
169
+ Specify parsing option in the class, using the following syntax:
169
170
 
170
171
  #+BEGIN_EXAMPLE ruby
171
172
  options do
172
173
  filename 'example.ods'
173
- extension ".ods"
174
+ # this optional. Use it when the file does not have an extension
175
+ extension :ods
174
176
 
175
177
  sheet 'Sheet 1'
176
178
 
@@ -190,10 +192,10 @@ where:
190
192
  to supply a filename when loading the file (see =read=, below). *Use
191
193
  =.tsv= for tab-separated files.*
192
194
  - (optional) =extension= overrides or specify the extension of =filename=.
193
- Takes as input the extension preceded by a "." (e.g., ".xlsx"). Notice that
194
- **value of this option is not appended to filename** (see =read= below).
195
- Filename must thus be a valid reference to a file in the file system. This
196
- option is useful in one of these two circumstances:
195
+ Takes as input a symbol (e.g., =:xlsx=).
196
+ Notice that **value of this option is not appended to filename** (see =read=
197
+ below). Filename must thus be a valid reference to a file in the file
198
+ system. This option is useful in one of these two circumstances:
197
199
  1. When =filename= has no extension
198
200
  2. When you want to override the extension of the filename, e.g., to force
199
201
  reading a "file.csv" as a tab separated file
@@ -397,6 +399,9 @@ See [[file:examples/wikipedia_us_cities/us_cities_bulk_declare.rb][us_cities_bul
397
399
  hash from the code block.
398
400
  #+END_NOTES
399
401
 
402
+ The data read from each row of our input data is stored in a hash. The hash
403
+ uses column names as the primary key and stores the values in the =:value=
404
+ key.
400
405
 
401
406
  *** Add virtual columns
402
407
 
@@ -426,6 +431,22 @@ Virtual columns are, of course, available to the =mapping= directive
426
431
  (see below).
427
432
 
428
433
 
434
+ *** Specify which lines to reject
435
+
436
+ You can reject some lines using the =reject= declaration, which is applied row
437
+ by row, can predicate over columns and virtual columns, and has to return a
438
+ Boolean value.
439
+
440
+ All lines returning a truish value will be be rejected, that is, not stored in
441
+ the =@table= variable (and, consequently, passed to the mapping function).
442
+
443
+ For instance, the following declaration rejects all lines in which the
444
+ population column is higher than =3_000_000=:
445
+
446
+ #+begin_src ruby
447
+ reject { |row| row[:population][:value] > 3_000_000 }
448
+ #+end_src
449
+
429
450
  *** Specify how to process each line
430
451
 
431
452
  The =mapping= directive specifies what to do with each line read. The
@@ -441,10 +462,9 @@ value of column =:age= and prints them to standard output
441
462
  end
442
463
  #+END_EXAMPLE
443
464
 
444
- The data read from each row of our input data is stored in a hash. The hash
445
- uses column names as the primary key and stores the values in the =:value=
446
- key.
447
-
465
+ To invoke the =mapping= declaration on a file, use the =mappings= method,
466
+ which invokes =map= to each row and it stores in the =@table= variable
467
+ whatever value mapping returns.
448
468
 
449
469
  *** Process data
450
470
 
@@ -464,8 +484,8 @@ A typical scenario works as follows:
464
484
  # examples:
465
485
  # i.read
466
486
  # i.read filename: "example.ods"
467
- # i.read filename: "example.ods", extension: ".ods"
468
- # i.read filename: "example", extension: ".ods"
487
+ # i.read filename: "example.ods", extension: :ods
488
+ # i.read filename: "example", extension: :ods
469
489
  # (the line above opens the file "example" as an Open Document Spreasdheet)
470
490
  i.read
471
491
 
@@ -500,7 +520,13 @@ A typical scenario works as follows:
500
520
  (Optionally: check again for errors.)
501
521
 
502
522
  5. Add your own code to process the data returned after =mappings=, which you
503
- can access with =i.table= or =i.data= (synonyms).
523
+ can assign to a variable (e.g., =returned_data = i.mappings=) or access
524
+ with =i.table= or =i.data= (synonyms).
525
+
526
+ #+begin_quote
527
+ Notice that =mappings= does a side effect and invoking the mapping twice in a
528
+ row won't work: you need to reload the file first.
529
+ #+end_quote
504
530
 
505
531
  Look in the examples directory for further details and a couple of working
506
532
  examples.
@@ -0,0 +1,13 @@
1
+ Name Date of birth
2
+ Forest Whitaker July 15, 1961
3
+ Daniel Day-Lewis April 29, 1957
4
+ Sean Penn August 17, 1960
5
+ Jeff Bridges December 4, 1949
6
+ Colin Firth September 10, 1960
7
+ Jean Dujardin June 19, 1972
8
+ Daniel Day-Lewis April 29, 1957
9
+ Matthew McConaughey November 4, 1969
10
+ Eddie Redmayne January 6, 1982
11
+ Leonardo DiCaprio November 11, 1974
12
+ Casey Affleck August 12, 1975
13
+ Gary Oldman March 21, 1958
@@ -0,0 +1,13 @@
1
+ Name,Date of birth
2
+ Forest Whitaker,"July 15, 1961"
3
+ Daniel Day-Lewis,"April 29, 1957"
4
+ Sean Penn,"August 17, 1960"
5
+ Jeff Bridges,"December 4, 1949"
6
+ Colin Firth,"September 10, 1960"
7
+ Jean Dujardin,"June 19, 1972"
8
+ Daniel Day-Lewis,"April 29, 1957"
9
+ Matthew McConaughey,"November 4, 1969"
10
+ Eddie Redmayne,"January 6, 1982"
11
+ Leonardo DiCaprio,"November 11, 1974"
12
+ Casey Affleck,"August 12, 1975"
13
+ Gary Oldman,"March 21, 1958"
@@ -0,0 +1,55 @@
1
+ require "dreader"
2
+
3
+ class Reader
4
+ extend Dreader::Engine
5
+
6
+ options do
7
+ first_row 2
8
+ debug true
9
+ end
10
+
11
+ column :name do
12
+ doc "A is the name string"
13
+ colref 'A'
14
+ end
15
+
16
+ column :birthdate do
17
+ doc "Birthdate contains a full date (i.e., including the year)"
18
+ colref 'B'
19
+
20
+ process do |c|
21
+ Date.parse(c)
22
+ end
23
+ end
24
+
25
+ virtual_column :age do
26
+ process do |row|
27
+ birthdate = row[:birthdate][:value]
28
+ birthday = Date.new(Date.today.year, birthdate.month, birthdate.day)
29
+ today = Date.today
30
+
31
+ [0, today.year - birthdate.year - (birthday < today ? 1 : 0)].max
32
+ end
33
+ end
34
+
35
+ mapping do |row|
36
+ r = Dreader::Util.simplify(row)
37
+ puts "#{r[:name]} is #{r[:age]} years old (born on #{r[:birthdate]})"
38
+ end
39
+ end
40
+
41
+ i = Reader
42
+ i.read filename: "Birthdays.csv", mapping: true
43
+
44
+ i.read filename: "Birthdays-TabSeparated.csv", extension: :tsv, mapping: true
45
+
46
+ #
47
+ # Here we can do further processing on the data
48
+ #
49
+ File.open("ages.txt", "w") do |file|
50
+ i.table.each do |row|
51
+ unless row[:row_errors].any?
52
+ file.puts "#{row[:name][:value]} #{row[:age][:value]}"
53
+ end
54
+ end
55
+ end
Binary file
Binary file
@@ -0,0 +1,73 @@
1
+ require "dreader"
2
+
3
+ class Reader
4
+ extend Dreader::Engine
5
+
6
+ options do
7
+ first_row 2
8
+ debug true
9
+ extension :ods
10
+ end
11
+
12
+ column :name do
13
+ doc "A is the name string"
14
+ colref 'A'
15
+ end
16
+
17
+ column :birthdate do
18
+ doc "Birthdate contains a full date (i.e., including the year)"
19
+ colref 'B'
20
+
21
+ process do |c|
22
+ Date.parse(c)
23
+ end
24
+ end
25
+
26
+ virtual_column :age do
27
+ process do |row|
28
+ birthdate = row[:birthdate][:value]
29
+ birthday = Date.new(Date.today.year, birthdate.month, birthdate.day)
30
+ today = Date.today
31
+
32
+ [0, today.year - birthdate.year - (birthday < today ? 1 : 0)].max
33
+ end
34
+ end
35
+
36
+ mapping do |row|
37
+ r = Dreader::Util.simplify(row)
38
+ puts "#{r[:name]} is #{r[:age]} years old (born on #{r[:birthdate]})"
39
+ end
40
+ end
41
+
42
+ puts
43
+ puts "*****************************************************************"
44
+ puts "Reading ODS with no extension, using extension set in the options"
45
+ puts "*****************************************************************"
46
+ puts
47
+
48
+ i = Reader
49
+ i.read filename: "Birthdays"
50
+ i.virtual_columns
51
+ i.mappings
52
+
53
+ puts
54
+ puts "*****************************************************************"
55
+ puts "Reading XLSX with wrong extension, overriding existing extension"
56
+ puts "*****************************************************************"
57
+ puts
58
+
59
+ i = Reader
60
+ i.read filename: "Birthdays-xlsx-with-wrong-extension.xls", extension: :xlsx
61
+ i.virtual_columns
62
+ i.mappings
63
+
64
+ puts
65
+ puts "*****************************************************************"
66
+ puts "Reading XLSX with no extension"
67
+ puts "*****************************************************************"
68
+ puts
69
+
70
+ i = Reader
71
+ i.read filename: "Birthdays-xlsx", extension: :xlsx
72
+ i.virtual_columns
73
+ i.mappings
@@ -0,0 +1,77 @@
1
+ require 'dreader'
2
+
3
+ # this is the class which will contain all the data we read from the file
4
+ class City
5
+ [:city, :state, :population, :lat, :lon].each do |var|
6
+ attr_accessor var
7
+ end
8
+
9
+ def initialize(hash)
10
+ hash.each do |k, v|
11
+ self.send("#{k}=", v)
12
+ end
13
+ end
14
+ end
15
+
16
+ class Importer
17
+ extend Dreader::Engine
18
+
19
+ # read from us_cities.tsv, lines from 2 to 10 (included)
20
+ options do
21
+ filename "us_cities.tsv"
22
+ first_row 2
23
+ last_row 307
24
+ end
25
+
26
+ # these are the columns for which we only need to specify column and name
27
+ columns ({city: 2, state: 3, latlon: 11}) do
28
+ process { |val| val.strip }
29
+ end
30
+
31
+ # the population column requires more work
32
+ column :population do |col|
33
+ col.colref 4
34
+
35
+ # make "3,000" into 3000 (int)
36
+ col.process { |value| value.gsub(",", "").to_i }
37
+
38
+ # check population is positive
39
+ col.check { |value| value > 0 }
40
+ end
41
+
42
+ # reject all cities with more than 3M people
43
+ reject do |row|
44
+ row[:population][:value] >= 3_000_000
45
+ end
46
+
47
+ mapping do |row|
48
+ # remove all additional information stored in each cell
49
+ r = Dreader::Util.simplify row
50
+
51
+ # make latlon into the lat, lon fields
52
+ r[:lat], r[:lon] = r[:latlon].split(" ")
53
+
54
+ # now r contains something like
55
+ # {lat: ..., lon: ..., city: ..., state: ..., population: ..., latlon: ...}
56
+
57
+ # remove fields which are not understood by the Cities class and
58
+ # make a new instance
59
+ cleaned = Dreader::Util.clean r, [:latlon]
60
+
61
+ # you must declare an array cities before calling importer.mapping
62
+ City.new(cleaned)
63
+ end
64
+ end
65
+
66
+ # load and process
67
+ importer = Importer
68
+ importer.load mapping: true, debug: true
69
+
70
+ # output everything to see whether it works
71
+ puts "First ten cities in the US with less than 3M (source Wikipedia)"
72
+ importer.table.each do |city|
73
+ [:city, :state, :population, :lat, :lon].each do |var|
74
+ puts "#{var.to_s.capitalize}: #{city.send(var)}"
75
+ end
76
+ puts ""
77
+ end
@@ -21,6 +21,8 @@ module Dreader
21
21
  attr_accessor :declared_virtual_columns
22
22
  # the mapping rules
23
23
  attr_accessor :declared_mapping
24
+ # the declared filter
25
+ attr_accessor :declared_reject
24
26
 
25
27
  # the data we read
26
28
  attr_reader :table
@@ -118,6 +120,11 @@ module Dreader
118
120
  @declared_virtual_columns << column.to_hash.merge({ name: name })
119
121
  end
120
122
 
123
+ # define a filter, which skips some rows
124
+ def reject(&block)
125
+ @declared_reject = block
126
+ end
127
+
121
128
  # define what we do with each line we read
122
129
  # - `block` is the code which takes as input a `row` and processes
123
130
  # `row` is a hash in which each spreadsheet cell is accessible under
@@ -187,8 +194,13 @@ module Dreader
187
194
  # this has side-effects on r
188
195
  virtual_columns_on(r) if options[:virtual] || options[:mapping]
189
196
 
197
+ # check whether the filter would ignore this line
198
+ # notice that we need to invoke compact to avoid nil being added
199
+ # to the table
200
+ next if !options[:ignore_reject] && reject?(r)
201
+
190
202
  options[:mapping] ? mappings_on(r) : r
191
- end
203
+ end.compact
192
204
  end
193
205
 
194
206
  # TODO: PASS A ROW (and not row_number and sheet)
@@ -268,6 +280,7 @@ module Dreader
268
280
 
269
281
  # Compute virtual columns for, with side effect on row
270
282
  def virtual_columns_on(row)
283
+ @declared_virtual_columns ||= []
271
284
  @declared_virtual_columns.each do |virtualcol|
272
285
  colname = virtualcol[:name]
273
286
  row[colname] = { virtual: true }
@@ -291,13 +304,36 @@ module Dreader
291
304
  end
292
305
  end
293
306
 
294
- # apply the mapping code to the array it makes sense to invoke it only
295
- # once.
307
+ # check whether a line has to be rejected
308
+ def reject?(row)
309
+ rejected = @declared_reject&.call(row)
310
+ if rejected
311
+ @logger.debug "[dreader] row rejected by reject declaration #{row}"
312
+ end
313
+ end
314
+
315
+ # apply the mapping code to the @table. Notice that we do a side effect
316
+ # on @table and, hence, invoking the mapping twice won't work (you need to
317
+ # reload first).
318
+ #
319
+ # the mapping is applied only if it defined and it returns the output of
320
+ # the mapping.
296
321
  #
297
- # the mapping is applied only if it defined and it uses map, so that
298
- # it can be used functionally
322
+ # notice also that we do a side-effect on @table. This is to make the
323
+ # behavior of
324
+ #
325
+ # i.load mapping: true
326
+ # i.table
327
+ #
328
+ # and
329
+ #
330
+ # i = load;
331
+ # i.mappings
332
+ # i.table
333
+ #
334
+ # the same
299
335
  def mappings
300
- @table.map { |row| mappings_on(row) }
336
+ @table = @table.map { |row| mappings_on(row) }
301
337
  end
302
338
 
303
339
  def mappings_on(row)
@@ -398,28 +434,36 @@ module Dreader
398
434
  # list of keys we support in options. We remove them when reading
399
435
  # the CSV file
400
436
  OPTION_KEYS = %i[
401
- filename sheet first_row last_row logger logger_level
437
+ filename extension sheet first_row last_row
438
+ logger logger_level
439
+ debug
402
440
  ]
403
441
 
404
442
  def open_spreadsheet(options)
405
443
  filename = options[:filename]
406
- ext = options[:extension] || File.extname(filename)
444
+ # use the extension option or make ".CSV" into :csv
445
+ extension = options[:extension] || File.extname(filename).downcase[1..-1]&.to_sym
446
+
447
+
448
+ # TODO: MAKE DEBUG AND LOGGER INTO REAL CLASS VARIABLES OR MAKE LOCAL AND/OR FUNCTIONS
449
+ @debug = @declared_options.merge(options)[:debug] == true
450
+ if @debug
451
+ @logger = options[:logger] || Logger.new($stdout)
452
+ @logger.debug "[dreader open_spreadsheet] filename: #{filename}"
453
+ @logger.debug "[dreader open_spreadsheet] extension: #{extension}"
454
+ end
407
455
 
408
- case ext
409
- when ".csv"
456
+ case extension
457
+ when :csv
410
458
  csv_options = @declared_options.except(*OPTION_KEYS)
411
459
  Roo::CSV.new(filename, csv_options:)
412
- when ".tsv"
460
+ when :tsv
413
461
  csv_options = @declared_options.except(*OPTION_KEYS).merge({ col_sep: "\t" })
414
462
  Roo::CSV.new(filename, csv_options:)
415
- when ".ods"
416
- Roo::OpenOffice.new(filename)
417
- when ".xls"
418
- Roo::Excel.new(filename)
419
- when ".xlsx"
420
- Roo::Excelx.new(filename)
463
+ when :ods, :xls, :xlsx
464
+ Roo::Spreadsheet.open(filename, extension:)
421
465
  else
422
- raise "Unknown extension: #{ext}"
466
+ raise "Unknown extension: #{ext}. Use the :extension option."
423
467
  end
424
468
  end
425
469
 
@@ -1,3 +1,3 @@
1
1
  module Dreader
2
- VERSION = "1.1.1"
2
+ VERSION = "1.2.0"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: dreader
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.1
4
+ version: 1.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Adolfo Villafiorita
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2023-10-31 00:00:00.000000000 Z
11
+ date: 2023-11-02 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: roo
@@ -108,6 +108,13 @@ files:
108
108
  - dreader.gemspec
109
109
  - examples/age/Birthdays.ods
110
110
  - examples/age/age.rb
111
+ - examples/age_csv/Birthdays-TabSeparated.csv
112
+ - examples/age_csv/Birthdays.csv
113
+ - examples/age_csv/age.rb
114
+ - examples/age_noext/Birthdays
115
+ - examples/age_noext/Birthdays-xlsx
116
+ - examples/age_noext/Birthdays-xlsx-with-wrong-extension.xls
117
+ - examples/age_noext/age.rb
111
118
  - examples/age_with_multiple_checks/Birthdays.ods
112
119
  - examples/age_with_multiple_checks/age_with_multiple_checks.rb
113
120
  - examples/local_vars/local_vars.rb
@@ -117,6 +124,7 @@ files:
117
124
  - examples/wikipedia_us_cities/us_cities.rb
118
125
  - examples/wikipedia_us_cities/us_cities.tsv
119
126
  - examples/wikipedia_us_cities/us_cities_bulk_declare.rb
127
+ - examples/wikipedia_us_cities/us_cities_reject.rb
120
128
  - lib/dreader.rb
121
129
  - lib/dreader/column.rb
122
130
  - lib/dreader/engine.rb
@@ -142,7 +150,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
142
150
  - !ruby/object:Gem::Version
143
151
  version: '0'
144
152
  requirements: []
145
- rubygems_version: 3.4.10
153
+ rubygems_version: 3.4.21
146
154
  signing_key:
147
155
  specification_version: 4
148
156
  summary: Process and import data from cvs and spreadsheets