dreader 1.1.1 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 524e55af5bb94cae3f407a1602069549783e935798a638361f3c98e922ffc54d
4
- data.tar.gz: 2599e048324ccd233e3fa4a0261e134ced0a3347d27af1e978e635639b6284a8
3
+ metadata.gz: 3c30be2fe49c6c8ce20d4930c75f1279ac1a92a099f609b1266b14dc61c7cf3c
4
+ data.tar.gz: 58c735a67c45ef11a180bc6f17892ba912656d346320da1a716caa69661f4695
5
5
  SHA512:
6
- metadata.gz: e8a78531d96ef35f9272a38daa5327620026dd66c98529710901d4c5994b9e5369308612cc79d1e0998d46dbb9dd6d26c448ddc4265dd536f1e61a4fc50fb885
7
- data.tar.gz: de71caed5d3df79d0d456b080a72a0edc035ef4e46906aac3f94295b07d57b956554a9d9e07b4b6195c5c09df0badfa9202122d7b6e0568cf89569b6a2277d28
6
+ metadata.gz: 7847892dbcf648432a9c51867fd70e1260e956e82ea7cbbad93f92882dd867be6f36f67a0edfbb972e16e12c1364efe92a54bd03d31801770da4b28fac725350
7
+ data.tar.gz: a717955a2eaa0c406d6fb140daf9cd084c11e0d4710289de34b7757b5fd4f4e920ded4fd1450306e3e42a42278c983e078ff8e53248fe0f5b7a93d09fb8a9d40
data/CHANGELOG.org CHANGED
@@ -1,5 +1,20 @@
1
1
  #+TITLE: Changelog
2
2
 
3
+ * Version 1.2.0 - <2023-11-02 Thu>
4
+ ** reject declaration
5
+
6
+ - A new reject declaration allows to reject some lines. reject takes as
7
+ input a row and can predicate over columns and virtual columns. When
8
+ true, the corresponding line is discarded.
9
+
10
+ * Version 1.1.2 - <2023-10-31 Tue>
11
+ ** Fixes an issue with the :extension option
12
+
13
+ - Fixes a bug related to =:extension= and adds a working example, to test
14
+ the feature
15
+ - Changes the extension from a string to a symbol. No initial dot required
16
+ any longer
17
+
3
18
  * Version 1.1.1 - <2023-10-16 Mon>
4
19
  ** Adds option :extension
5
20
 
data/README.org CHANGED
@@ -137,7 +137,8 @@ To write an import function with Dreader:
137
137
  and check parsed data
138
138
  - Add virtual columns, that is, columns computed from other values
139
139
  in the row
140
- - Specify how to map line. This is where you do the actual work
140
+ - Specify what lines you want to reject, if any
141
+ - Specify how to transform lines. This is where you do the actual work
141
142
  (for instance, if you process a file line by line) or put together data for
142
143
  processing after the file has been fully read --- see the next step.
143
144
 
@@ -165,12 +166,13 @@ Require =dreader= and declare a class which extends =Dreader::Engine=:
165
166
  end
166
167
  #+END_EXAMPLE
167
168
 
168
- In the class specify parsing option, using the following syntax:
169
+ Specify parsing option in the class, using the following syntax:
169
170
 
170
171
  #+BEGIN_EXAMPLE ruby
171
172
  options do
172
173
  filename 'example.ods'
173
- extension ".ods"
174
+ # this optional. Use it when the file does not have an extension
175
+ extension :ods
174
176
 
175
177
  sheet 'Sheet 1'
176
178
 
@@ -190,10 +192,10 @@ where:
190
192
  to supply a filename when loading the file (see =read=, below). *Use
191
193
  =.tsv= for tab-separated files.*
192
194
  - (optional) =extension= overrides or specify the extension of =filename=.
193
- Takes as input the extension preceded by a "." (e.g., ".xlsx"). Notice that
194
- **value of this option is not appended to filename** (see =read= below).
195
- Filename must thus be a valid reference to a file in the file system. This
196
- option is useful in one of these two circumstances:
195
+ Takes as input a symbol (e.g., =:xlsx=).
196
+ Notice that **value of this option is not appended to filename** (see =read=
197
+ below). Filename must thus be a valid reference to a file in the file
198
+ system. This option is useful in one of these two circumstances:
197
199
  1. When =filename= has no extension
198
200
  2. When you want to override the extension of the filename, e.g., to force
199
201
  reading a "file.csv" as a tab separated file
@@ -397,6 +399,9 @@ See [[file:examples/wikipedia_us_cities/us_cities_bulk_declare.rb][us_cities_bul
397
399
  hash from the code block.
398
400
  #+END_NOTES
399
401
 
402
+ The data read from each row of our input data is stored in a hash. The hash
403
+ uses column names as the primary key and stores the values in the =:value=
404
+ key.
400
405
 
401
406
  *** Add virtual columns
402
407
 
@@ -426,6 +431,22 @@ Virtual columns are, of course, available to the =mapping= directive
426
431
  (see below).
427
432
 
428
433
 
434
+ *** Specify which lines to reject
435
+
436
+ You can reject some lines using the =reject= declaration, which is applied row
437
+ by row, can predicate over columns and virtual columns, and has to return a
438
+ Boolean value.
439
+
440
+ All lines returning a truish value will be be rejected, that is, not stored in
441
+ the =@table= variable (and, consequently, passed to the mapping function).
442
+
443
+ For instance, the following declaration rejects all lines in which the
444
+ population column is higher than =3_000_000=:
445
+
446
+ #+begin_src ruby
447
+ reject { |row| row[:population][:value] > 3_000_000 }
448
+ #+end_src
449
+
429
450
  *** Specify how to process each line
430
451
 
431
452
  The =mapping= directive specifies what to do with each line read. The
@@ -441,10 +462,9 @@ value of column =:age= and prints them to standard output
441
462
  end
442
463
  #+END_EXAMPLE
443
464
 
444
- The data read from each row of our input data is stored in a hash. The hash
445
- uses column names as the primary key and stores the values in the =:value=
446
- key.
447
-
465
+ To invoke the =mapping= declaration on a file, use the =mappings= method,
466
+ which invokes =map= to each row and it stores in the =@table= variable
467
+ whatever value mapping returns.
448
468
 
449
469
  *** Process data
450
470
 
@@ -464,8 +484,8 @@ A typical scenario works as follows:
464
484
  # examples:
465
485
  # i.read
466
486
  # i.read filename: "example.ods"
467
- # i.read filename: "example.ods", extension: ".ods"
468
- # i.read filename: "example", extension: ".ods"
487
+ # i.read filename: "example.ods", extension: :ods
488
+ # i.read filename: "example", extension: :ods
469
489
  # (the line above opens the file "example" as an Open Document Spreasdheet)
470
490
  i.read
471
491
 
@@ -500,7 +520,13 @@ A typical scenario works as follows:
500
520
  (Optionally: check again for errors.)
501
521
 
502
522
  5. Add your own code to process the data returned after =mappings=, which you
503
- can access with =i.table= or =i.data= (synonyms).
523
+ can assign to a variable (e.g., =returned_data = i.mappings=) or access
524
+ with =i.table= or =i.data= (synonyms).
525
+
526
+ #+begin_quote
527
+ Notice that =mappings= does a side effect and invoking the mapping twice in a
528
+ row won't work: you need to reload the file first.
529
+ #+end_quote
504
530
 
505
531
  Look in the examples directory for further details and a couple of working
506
532
  examples.
@@ -0,0 +1,13 @@
1
+ Name Date of birth
2
+ Forest Whitaker July 15, 1961
3
+ Daniel Day-Lewis April 29, 1957
4
+ Sean Penn August 17, 1960
5
+ Jeff Bridges December 4, 1949
6
+ Colin Firth September 10, 1960
7
+ Jean Dujardin June 19, 1972
8
+ Daniel Day-Lewis April 29, 1957
9
+ Matthew McConaughey November 4, 1969
10
+ Eddie Redmayne January 6, 1982
11
+ Leonardo DiCaprio November 11, 1974
12
+ Casey Affleck August 12, 1975
13
+ Gary Oldman March 21, 1958
@@ -0,0 +1,13 @@
1
+ Name,Date of birth
2
+ Forest Whitaker,"July 15, 1961"
3
+ Daniel Day-Lewis,"April 29, 1957"
4
+ Sean Penn,"August 17, 1960"
5
+ Jeff Bridges,"December 4, 1949"
6
+ Colin Firth,"September 10, 1960"
7
+ Jean Dujardin,"June 19, 1972"
8
+ Daniel Day-Lewis,"April 29, 1957"
9
+ Matthew McConaughey,"November 4, 1969"
10
+ Eddie Redmayne,"January 6, 1982"
11
+ Leonardo DiCaprio,"November 11, 1974"
12
+ Casey Affleck,"August 12, 1975"
13
+ Gary Oldman,"March 21, 1958"
@@ -0,0 +1,55 @@
1
+ require "dreader"
2
+
3
+ class Reader
4
+ extend Dreader::Engine
5
+
6
+ options do
7
+ first_row 2
8
+ debug true
9
+ end
10
+
11
+ column :name do
12
+ doc "A is the name string"
13
+ colref 'A'
14
+ end
15
+
16
+ column :birthdate do
17
+ doc "Birthdate contains a full date (i.e., including the year)"
18
+ colref 'B'
19
+
20
+ process do |c|
21
+ Date.parse(c)
22
+ end
23
+ end
24
+
25
+ virtual_column :age do
26
+ process do |row|
27
+ birthdate = row[:birthdate][:value]
28
+ birthday = Date.new(Date.today.year, birthdate.month, birthdate.day)
29
+ today = Date.today
30
+
31
+ [0, today.year - birthdate.year - (birthday < today ? 1 : 0)].max
32
+ end
33
+ end
34
+
35
+ mapping do |row|
36
+ r = Dreader::Util.simplify(row)
37
+ puts "#{r[:name]} is #{r[:age]} years old (born on #{r[:birthdate]})"
38
+ end
39
+ end
40
+
41
+ i = Reader
42
+ i.read filename: "Birthdays.csv", mapping: true
43
+
44
+ i.read filename: "Birthdays-TabSeparated.csv", extension: :tsv, mapping: true
45
+
46
+ #
47
+ # Here we can do further processing on the data
48
+ #
49
+ File.open("ages.txt", "w") do |file|
50
+ i.table.each do |row|
51
+ unless row[:row_errors].any?
52
+ file.puts "#{row[:name][:value]} #{row[:age][:value]}"
53
+ end
54
+ end
55
+ end
Binary file
Binary file
@@ -0,0 +1,73 @@
1
+ require "dreader"
2
+
3
+ class Reader
4
+ extend Dreader::Engine
5
+
6
+ options do
7
+ first_row 2
8
+ debug true
9
+ extension :ods
10
+ end
11
+
12
+ column :name do
13
+ doc "A is the name string"
14
+ colref 'A'
15
+ end
16
+
17
+ column :birthdate do
18
+ doc "Birthdate contains a full date (i.e., including the year)"
19
+ colref 'B'
20
+
21
+ process do |c|
22
+ Date.parse(c)
23
+ end
24
+ end
25
+
26
+ virtual_column :age do
27
+ process do |row|
28
+ birthdate = row[:birthdate][:value]
29
+ birthday = Date.new(Date.today.year, birthdate.month, birthdate.day)
30
+ today = Date.today
31
+
32
+ [0, today.year - birthdate.year - (birthday < today ? 1 : 0)].max
33
+ end
34
+ end
35
+
36
+ mapping do |row|
37
+ r = Dreader::Util.simplify(row)
38
+ puts "#{r[:name]} is #{r[:age]} years old (born on #{r[:birthdate]})"
39
+ end
40
+ end
41
+
42
+ puts
43
+ puts "*****************************************************************"
44
+ puts "Reading ODS with no extension, using extension set in the options"
45
+ puts "*****************************************************************"
46
+ puts
47
+
48
+ i = Reader
49
+ i.read filename: "Birthdays"
50
+ i.virtual_columns
51
+ i.mappings
52
+
53
+ puts
54
+ puts "*****************************************************************"
55
+ puts "Reading XLSX with wrong extension, overriding existing extension"
56
+ puts "*****************************************************************"
57
+ puts
58
+
59
+ i = Reader
60
+ i.read filename: "Birthdays-xlsx-with-wrong-extension.xls", extension: :xlsx
61
+ i.virtual_columns
62
+ i.mappings
63
+
64
+ puts
65
+ puts "*****************************************************************"
66
+ puts "Reading XLSX with no extension"
67
+ puts "*****************************************************************"
68
+ puts
69
+
70
+ i = Reader
71
+ i.read filename: "Birthdays-xlsx", extension: :xlsx
72
+ i.virtual_columns
73
+ i.mappings
@@ -0,0 +1,77 @@
1
+ require 'dreader'
2
+
3
+ # this is the class which will contain all the data we read from the file
4
+ class City
5
+ [:city, :state, :population, :lat, :lon].each do |var|
6
+ attr_accessor var
7
+ end
8
+
9
+ def initialize(hash)
10
+ hash.each do |k, v|
11
+ self.send("#{k}=", v)
12
+ end
13
+ end
14
+ end
15
+
16
+ class Importer
17
+ extend Dreader::Engine
18
+
19
+ # read from us_cities.tsv, lines from 2 to 10 (included)
20
+ options do
21
+ filename "us_cities.tsv"
22
+ first_row 2
23
+ last_row 307
24
+ end
25
+
26
+ # these are the columns for which we only need to specify column and name
27
+ columns ({city: 2, state: 3, latlon: 11}) do
28
+ process { |val| val.strip }
29
+ end
30
+
31
+ # the population column requires more work
32
+ column :population do |col|
33
+ col.colref 4
34
+
35
+ # make "3,000" into 3000 (int)
36
+ col.process { |value| value.gsub(",", "").to_i }
37
+
38
+ # check population is positive
39
+ col.check { |value| value > 0 }
40
+ end
41
+
42
+ # reject all cities with more than 3M people
43
+ reject do |row|
44
+ row[:population][:value] >= 3_000_000
45
+ end
46
+
47
+ mapping do |row|
48
+ # remove all additional information stored in each cell
49
+ r = Dreader::Util.simplify row
50
+
51
+ # make latlon into the lat, lon fields
52
+ r[:lat], r[:lon] = r[:latlon].split(" ")
53
+
54
+ # now r contains something like
55
+ # {lat: ..., lon: ..., city: ..., state: ..., population: ..., latlon: ...}
56
+
57
+ # remove fields which are not understood by the Cities class and
58
+ # make a new instance
59
+ cleaned = Dreader::Util.clean r, [:latlon]
60
+
61
+ # you must declare an array cities before calling importer.mapping
62
+ City.new(cleaned)
63
+ end
64
+ end
65
+
66
+ # load and process
67
+ importer = Importer
68
+ importer.load mapping: true, debug: true
69
+
70
+ # output everything to see whether it works
71
+ puts "First ten cities in the US with less than 3M (source Wikipedia)"
72
+ importer.table.each do |city|
73
+ [:city, :state, :population, :lat, :lon].each do |var|
74
+ puts "#{var.to_s.capitalize}: #{city.send(var)}"
75
+ end
76
+ puts ""
77
+ end
@@ -21,6 +21,8 @@ module Dreader
21
21
  attr_accessor :declared_virtual_columns
22
22
  # the mapping rules
23
23
  attr_accessor :declared_mapping
24
+ # the declared filter
25
+ attr_accessor :declared_reject
24
26
 
25
27
  # the data we read
26
28
  attr_reader :table
@@ -118,6 +120,11 @@ module Dreader
118
120
  @declared_virtual_columns << column.to_hash.merge({ name: name })
119
121
  end
120
122
 
123
+ # define a filter, which skips some rows
124
+ def reject(&block)
125
+ @declared_reject = block
126
+ end
127
+
121
128
  # define what we do with each line we read
122
129
  # - `block` is the code which takes as input a `row` and processes
123
130
  # `row` is a hash in which each spreadsheet cell is accessible under
@@ -187,8 +194,13 @@ module Dreader
187
194
  # this has side-effects on r
188
195
  virtual_columns_on(r) if options[:virtual] || options[:mapping]
189
196
 
197
+ # check whether the filter would ignore this line
198
+ # notice that we need to invoke compact to avoid nil being added
199
+ # to the table
200
+ next if !options[:ignore_reject] && reject?(r)
201
+
190
202
  options[:mapping] ? mappings_on(r) : r
191
- end
203
+ end.compact
192
204
  end
193
205
 
194
206
  # TODO: PASS A ROW (and not row_number and sheet)
@@ -268,6 +280,7 @@ module Dreader
268
280
 
269
281
  # Compute virtual columns for, with side effect on row
270
282
  def virtual_columns_on(row)
283
+ @declared_virtual_columns ||= []
271
284
  @declared_virtual_columns.each do |virtualcol|
272
285
  colname = virtualcol[:name]
273
286
  row[colname] = { virtual: true }
@@ -291,13 +304,36 @@ module Dreader
291
304
  end
292
305
  end
293
306
 
294
- # apply the mapping code to the array it makes sense to invoke it only
295
- # once.
307
+ # check whether a line has to be rejected
308
+ def reject?(row)
309
+ rejected = @declared_reject&.call(row)
310
+ if rejected
311
+ @logger.debug "[dreader] row rejected by reject declaration #{row}"
312
+ end
313
+ end
314
+
315
+ # apply the mapping code to the @table. Notice that we do a side effect
316
+ # on @table and, hence, invoking the mapping twice won't work (you need to
317
+ # reload first).
318
+ #
319
+ # the mapping is applied only if it defined and it returns the output of
320
+ # the mapping.
296
321
  #
297
- # the mapping is applied only if it defined and it uses map, so that
298
- # it can be used functionally
322
+ # notice also that we do a side-effect on @table. This is to make the
323
+ # behavior of
324
+ #
325
+ # i.load mapping: true
326
+ # i.table
327
+ #
328
+ # and
329
+ #
330
+ # i = load;
331
+ # i.mappings
332
+ # i.table
333
+ #
334
+ # the same
299
335
  def mappings
300
- @table.map { |row| mappings_on(row) }
336
+ @table = @table.map { |row| mappings_on(row) }
301
337
  end
302
338
 
303
339
  def mappings_on(row)
@@ -398,28 +434,36 @@ module Dreader
398
434
  # list of keys we support in options. We remove them when reading
399
435
  # the CSV file
400
436
  OPTION_KEYS = %i[
401
- filename sheet first_row last_row logger logger_level
437
+ filename extension sheet first_row last_row
438
+ logger logger_level
439
+ debug
402
440
  ]
403
441
 
404
442
  def open_spreadsheet(options)
405
443
  filename = options[:filename]
406
- ext = options[:extension] || File.extname(filename)
444
+ # use the extension option or make ".CSV" into :csv
445
+ extension = options[:extension] || File.extname(filename).downcase[1..-1]&.to_sym
446
+
447
+
448
+ # TODO: MAKE DEBUG AND LOGGER INTO REAL CLASS VARIABLES OR MAKE LOCAL AND/OR FUNCTIONS
449
+ @debug = @declared_options.merge(options)[:debug] == true
450
+ if @debug
451
+ @logger = options[:logger] || Logger.new($stdout)
452
+ @logger.debug "[dreader open_spreadsheet] filename: #{filename}"
453
+ @logger.debug "[dreader open_spreadsheet] extension: #{extension}"
454
+ end
407
455
 
408
- case ext
409
- when ".csv"
456
+ case extension
457
+ when :csv
410
458
  csv_options = @declared_options.except(*OPTION_KEYS)
411
459
  Roo::CSV.new(filename, csv_options:)
412
- when ".tsv"
460
+ when :tsv
413
461
  csv_options = @declared_options.except(*OPTION_KEYS).merge({ col_sep: "\t" })
414
462
  Roo::CSV.new(filename, csv_options:)
415
- when ".ods"
416
- Roo::OpenOffice.new(filename)
417
- when ".xls"
418
- Roo::Excel.new(filename)
419
- when ".xlsx"
420
- Roo::Excelx.new(filename)
463
+ when :ods, :xls, :xlsx
464
+ Roo::Spreadsheet.open(filename, extension:)
421
465
  else
422
- raise "Unknown extension: #{ext}"
466
+ raise "Unknown extension: #{ext}. Use the :extension option."
423
467
  end
424
468
  end
425
469
 
@@ -1,3 +1,3 @@
1
1
  module Dreader
2
- VERSION = "1.1.1"
2
+ VERSION = "1.2.0"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: dreader
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.1
4
+ version: 1.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Adolfo Villafiorita
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2023-10-31 00:00:00.000000000 Z
11
+ date: 2023-11-02 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: roo
@@ -108,6 +108,13 @@ files:
108
108
  - dreader.gemspec
109
109
  - examples/age/Birthdays.ods
110
110
  - examples/age/age.rb
111
+ - examples/age_csv/Birthdays-TabSeparated.csv
112
+ - examples/age_csv/Birthdays.csv
113
+ - examples/age_csv/age.rb
114
+ - examples/age_noext/Birthdays
115
+ - examples/age_noext/Birthdays-xlsx
116
+ - examples/age_noext/Birthdays-xlsx-with-wrong-extension.xls
117
+ - examples/age_noext/age.rb
111
118
  - examples/age_with_multiple_checks/Birthdays.ods
112
119
  - examples/age_with_multiple_checks/age_with_multiple_checks.rb
113
120
  - examples/local_vars/local_vars.rb
@@ -117,6 +124,7 @@ files:
117
124
  - examples/wikipedia_us_cities/us_cities.rb
118
125
  - examples/wikipedia_us_cities/us_cities.tsv
119
126
  - examples/wikipedia_us_cities/us_cities_bulk_declare.rb
127
+ - examples/wikipedia_us_cities/us_cities_reject.rb
120
128
  - lib/dreader.rb
121
129
  - lib/dreader/column.rb
122
130
  - lib/dreader/engine.rb
@@ -142,7 +150,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
142
150
  - !ruby/object:Gem::Version
143
151
  version: '0'
144
152
  requirements: []
145
- rubygems_version: 3.4.10
153
+ rubygems_version: 3.4.21
146
154
  signing_key:
147
155
  specification_version: 4
148
156
  summary: Process and import data from cvs and spreadsheets