dreader 1.1.2 → 1.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d59887423bcb1658823534ca3d8a5505899cd70a5f8d23eda235159c590cf364
4
- data.tar.gz: b17d71ed8d7db969fcc872a3f01cb6fdfbfaff69e0030ffc304ebc481a15675d
3
+ metadata.gz: 3c30be2fe49c6c8ce20d4930c75f1279ac1a92a099f609b1266b14dc61c7cf3c
4
+ data.tar.gz: 58c735a67c45ef11a180bc6f17892ba912656d346320da1a716caa69661f4695
5
5
  SHA512:
6
- metadata.gz: 1407751e4baa35d4c6644cc9c9ebea7a8185febd2211e6b76c86a3001e8972d58061b6946a178dd007043c897a9943d41971e0868dab2e9e332350a5436bfff1
7
- data.tar.gz: da163158d92a1618de0218f5285fe189f3ec5fe78b0d815c6b873574175c817ef4134725ff70a174b0206fb0da1770f9888044b076dbfc8e62b79f22e10930a6
6
+ metadata.gz: 7847892dbcf648432a9c51867fd70e1260e956e82ea7cbbad93f92882dd867be6f36f67a0edfbb972e16e12c1364efe92a54bd03d31801770da4b28fac725350
7
+ data.tar.gz: a717955a2eaa0c406d6fb140daf9cd084c11e0d4710289de34b7757b5fd4f4e920ded4fd1450306e3e42a42278c983e078ff8e53248fe0f5b7a93d09fb8a9d40
data/CHANGELOG.org CHANGED
@@ -1,5 +1,12 @@
1
1
  #+TITLE: Changelog
2
2
 
3
+ * Version 1.2.0 - <2023-11-02 Thu>
4
+ ** reject declaration
5
+
6
+ - A new reject declaration allows to reject some lines. reject takes as
7
+ input a row and can predicate over columns and virtual columns. When
8
+ true, the corresponding line is discarded.
9
+
3
10
  * Version 1.1.2 - <2023-10-31 Tue>
4
11
  ** Fixes an issue with the :extension option
5
12
 
data/README.org CHANGED
@@ -137,7 +137,8 @@ To write an import function with Dreader:
137
137
  and check parsed data
138
138
  - Add virtual columns, that is, columns computed from other values
139
139
  in the row
140
- - Specify how to map line. This is where you do the actual work
140
+ - Specify what lines you want to reject, if any
141
+ - Specify how to transform lines. This is where you do the actual work
141
142
  (for instance, if you process a file line by line) or put together data for
142
143
  processing after the file has been fully read --- see the next step.
143
144
 
@@ -398,6 +399,9 @@ See [[file:examples/wikipedia_us_cities/us_cities_bulk_declare.rb][us_cities_bul
398
399
  hash from the code block.
399
400
  #+END_NOTES
400
401
 
402
+ The data read from each row of our input data is stored in a hash. The hash
403
+ uses column names as the primary key and stores the values in the =:value=
404
+ key.
401
405
 
402
406
  *** Add virtual columns
403
407
 
@@ -427,6 +431,22 @@ Virtual columns are, of course, available to the =mapping= directive
427
431
  (see below).
428
432
 
429
433
 
434
+ *** Specify which lines to reject
435
+
436
+ You can reject some lines using the =reject= declaration, which is applied row
437
+ by row, can predicate over columns and virtual columns, and has to return a
438
+ Boolean value.
439
+
440
+ All lines returning a truish value will be be rejected, that is, not stored in
441
+ the =@table= variable (and, consequently, passed to the mapping function).
442
+
443
+ For instance, the following declaration rejects all lines in which the
444
+ population column is higher than =3_000_000=:
445
+
446
+ #+begin_src ruby
447
+ reject { |row| row[:population][:value] > 3_000_000 }
448
+ #+end_src
449
+
430
450
  *** Specify how to process each line
431
451
 
432
452
  The =mapping= directive specifies what to do with each line read. The
@@ -442,10 +462,9 @@ value of column =:age= and prints them to standard output
442
462
  end
443
463
  #+END_EXAMPLE
444
464
 
445
- The data read from each row of our input data is stored in a hash. The hash
446
- uses column names as the primary key and stores the values in the =:value=
447
- key.
448
-
465
+ To invoke the =mapping= declaration on a file, use the =mappings= method,
466
+ which invokes =map= to each row and it stores in the =@table= variable
467
+ whatever value mapping returns.
449
468
 
450
469
  *** Process data
451
470
 
@@ -501,7 +520,13 @@ A typical scenario works as follows:
501
520
  (Optionally: check again for errors.)
502
521
 
503
522
  5. Add your own code to process the data returned after =mappings=, which you
504
- can access with =i.table= or =i.data= (synonyms).
523
+ can assign to a variable (e.g., =returned_data = i.mappings=) or access
524
+ with =i.table= or =i.data= (synonyms).
525
+
526
+ #+begin_quote
527
+ Notice that =mappings= does a side effect and invoking the mapping twice in a
528
+ row won't work: you need to reload the file first.
529
+ #+end_quote
505
530
 
506
531
  Look in the examples directory for further details and a couple of working
507
532
  examples.
@@ -0,0 +1,77 @@
1
+ require 'dreader'
2
+
3
+ # this is the class which will contain all the data we read from the file
4
+ class City
5
+ [:city, :state, :population, :lat, :lon].each do |var|
6
+ attr_accessor var
7
+ end
8
+
9
+ def initialize(hash)
10
+ hash.each do |k, v|
11
+ self.send("#{k}=", v)
12
+ end
13
+ end
14
+ end
15
+
16
+ class Importer
17
+ extend Dreader::Engine
18
+
19
+ # read from us_cities.tsv, lines from 2 to 10 (included)
20
+ options do
21
+ filename "us_cities.tsv"
22
+ first_row 2
23
+ last_row 307
24
+ end
25
+
26
+ # these are the columns for which we only need to specify column and name
27
+ columns ({city: 2, state: 3, latlon: 11}) do
28
+ process { |val| val.strip }
29
+ end
30
+
31
+ # the population column requires more work
32
+ column :population do |col|
33
+ col.colref 4
34
+
35
+ # make "3,000" into 3000 (int)
36
+ col.process { |value| value.gsub(",", "").to_i }
37
+
38
+ # check population is positive
39
+ col.check { |value| value > 0 }
40
+ end
41
+
42
+ # reject all cities with more than 3M people
43
+ reject do |row|
44
+ row[:population][:value] >= 3_000_000
45
+ end
46
+
47
+ mapping do |row|
48
+ # remove all additional information stored in each cell
49
+ r = Dreader::Util.simplify row
50
+
51
+ # make latlon into the lat, lon fields
52
+ r[:lat], r[:lon] = r[:latlon].split(" ")
53
+
54
+ # now r contains something like
55
+ # {lat: ..., lon: ..., city: ..., state: ..., population: ..., latlon: ...}
56
+
57
+ # remove fields which are not understood by the Cities class and
58
+ # make a new instance
59
+ cleaned = Dreader::Util.clean r, [:latlon]
60
+
61
+ # you must declare an array cities before calling importer.mapping
62
+ City.new(cleaned)
63
+ end
64
+ end
65
+
66
+ # load and process
67
+ importer = Importer
68
+ importer.load mapping: true, debug: true
69
+
70
+ # output everything to see whether it works
71
+ puts "First ten cities in the US with less than 3M (source Wikipedia)"
72
+ importer.table.each do |city|
73
+ [:city, :state, :population, :lat, :lon].each do |var|
74
+ puts "#{var.to_s.capitalize}: #{city.send(var)}"
75
+ end
76
+ puts ""
77
+ end
@@ -21,6 +21,8 @@ module Dreader
21
21
  attr_accessor :declared_virtual_columns
22
22
  # the mapping rules
23
23
  attr_accessor :declared_mapping
24
+ # the declared filter
25
+ attr_accessor :declared_reject
24
26
 
25
27
  # the data we read
26
28
  attr_reader :table
@@ -118,6 +120,11 @@ module Dreader
118
120
  @declared_virtual_columns << column.to_hash.merge({ name: name })
119
121
  end
120
122
 
123
+ # define a filter, which skips some rows
124
+ def reject(&block)
125
+ @declared_reject = block
126
+ end
127
+
121
128
  # define what we do with each line we read
122
129
  # - `block` is the code which takes as input a `row` and processes
123
130
  # `row` is a hash in which each spreadsheet cell is accessible under
@@ -187,8 +194,13 @@ module Dreader
187
194
  # this has side-effects on r
188
195
  virtual_columns_on(r) if options[:virtual] || options[:mapping]
189
196
 
197
+ # check whether the filter would ignore this line
198
+ # notice that we need to invoke compact to avoid nil being added
199
+ # to the table
200
+ next if !options[:ignore_reject] && reject?(r)
201
+
190
202
  options[:mapping] ? mappings_on(r) : r
191
- end
203
+ end.compact
192
204
  end
193
205
 
194
206
  # TODO: PASS A ROW (and not row_number and sheet)
@@ -268,6 +280,7 @@ module Dreader
268
280
 
269
281
  # Compute virtual columns for, with side effect on row
270
282
  def virtual_columns_on(row)
283
+ @declared_virtual_columns ||= []
271
284
  @declared_virtual_columns.each do |virtualcol|
272
285
  colname = virtualcol[:name]
273
286
  row[colname] = { virtual: true }
@@ -291,13 +304,36 @@ module Dreader
291
304
  end
292
305
  end
293
306
 
294
- # apply the mapping code to the array it makes sense to invoke it only
295
- # once.
307
+ # check whether a line has to be rejected
308
+ def reject?(row)
309
+ rejected = @declared_reject&.call(row)
310
+ if rejected
311
+ @logger.debug "[dreader] row rejected by reject declaration #{row}"
312
+ end
313
+ end
314
+
315
+ # apply the mapping code to the @table. Notice that we do a side effect
316
+ # on @table and, hence, invoking the mapping twice won't work (you need to
317
+ # reload first).
318
+ #
319
+ # the mapping is applied only if it defined and it returns the output of
320
+ # the mapping.
321
+ #
322
+ # notice also that we do a side-effect on @table. This is to make the
323
+ # behavior of
324
+ #
325
+ # i.load mapping: true
326
+ # i.table
327
+ #
328
+ # and
329
+ #
330
+ # i = load;
331
+ # i.mappings
332
+ # i.table
296
333
  #
297
- # the mapping is applied only if it defined and it uses map, so that
298
- # it can be used functionally
334
+ # the same
299
335
  def mappings
300
- @table.map { |row| mappings_on(row) }
336
+ @table = @table.map { |row| mappings_on(row) }
301
337
  end
302
338
 
303
339
  def mappings_on(row)
@@ -1,3 +1,3 @@
1
1
  module Dreader
2
- VERSION = "1.1.2"
2
+ VERSION = "1.2.0"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: dreader
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.2
4
+ version: 1.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Adolfo Villafiorita
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2023-11-01 00:00:00.000000000 Z
11
+ date: 2023-11-02 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: roo
@@ -124,6 +124,7 @@ files:
124
124
  - examples/wikipedia_us_cities/us_cities.rb
125
125
  - examples/wikipedia_us_cities/us_cities.tsv
126
126
  - examples/wikipedia_us_cities/us_cities_bulk_declare.rb
127
+ - examples/wikipedia_us_cities/us_cities_reject.rb
127
128
  - lib/dreader.rb
128
129
  - lib/dreader/column.rb
129
130
  - lib/dreader/engine.rb
@@ -149,7 +150,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
149
150
  - !ruby/object:Gem::Version
150
151
  version: '0'
151
152
  requirements: []
152
- rubygems_version: 3.4.10
153
+ rubygems_version: 3.4.21
153
154
  signing_key:
154
155
  specification_version: 4
155
156
  summary: Process and import data from cvs and spreadsheets