RubyGems - dreader - Versions diffs - 0.4.2 → 1.0.0 - Mend

dreader 0.4.2 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

checksums.yaml +4 -4
data/CHANGELOG.ORG +45 -0
data/Gemfile.lock +21 -8
data/README.org +794 -0
data/dreader.gemspec +6 -4
data/examples/age/age.rb +22 -6
data/examples/age_with_multiple_checks/Birthdays.ods +0 -0
data/examples/age_with_multiple_checks/age_with_multiple_checks.rb +62 -0
data/examples/template/template_generation.rb +37 -0
data/examples/wikipedia_big_us_cities/big_us_cities.rb +20 -18
data/examples/wikipedia_us_cities/us_cities.rb +28 -27
data/examples/wikipedia_us_cities/us_cities_bulk_declare.rb +22 -22
data/lib/dreader/column.rb +39 -0
data/lib/dreader/engine.rb +473 -0
data/lib/dreader/options.rb +16 -0
data/lib/dreader/util.rb +71 -0
data/lib/dreader/version.rb +1 -1
data/lib/dreader.rb +5 -411
metadata +59 -25
data/Changelog.org +0 -20
data/README.md +0 -469

data/README.org ADDED Viewed

@@ -0,0 +1,794 @@
+#+TITLE: Dreader
+#+AUTHOR: Adolfo Villafiorita
+#+STARTUP: showall
+Dreader is a simple DSL built on top of [[https://github.com/roo-rb/roo][Roo]] to read and process
+tabular data (CSV, LibreOffice, Excel) in a simple and structured way.
+Main advantages:
+1. All code to parse input data has the same structure, simplifying
+   code management and understanding (convention over configuration).
+2. It favors a declarative approach, clearly identifying from which
+   data has to be read and in which way.
+3. Has facilities to run simulations, to debug and check code and
+   data.
+We use Dreader for importing fairly big files (in the order of
+10K-100K records) in MIP, an ERP to manage distribution of bins to the
+population.  The main issues we had before using Dreader were errors
+and exceptional cases in the input data.  We also had to manage
+several small variations in the input files (coming from different
+ERPs) and Dreader helped us standardizing the input code.
+The gem depends on =roo=, from which it leverages all data
+reading/parsing facilities keeping its size in about 250 lines of
+code.
+It should be relatively easy to use; /dreader/ stands for /d/ata /r/eader.
+* Installation
+Add this line to your application's Gemfile:
+#+BEGIN_EXAMPLE ruby
+  gem 'dreader'
+#+END_EXAMPLE
+And then execute:
+#+BEGIN_EXAMPLE
+  $ bundle
+#+END_EXAMPLE
+Or install it yourself as:
+#+BEGIN_EXAMPLE
+  $ gem install dreader
+#+END_EXAMPLE
+* Usage
+** Quick start
+Print name and age of people from the following data:
+| Name             | Date of birth   |
+|------------------+-----------------|
+| Forest Whitaker  | July 15, 1961   |
+| Daniel Day-Lewis | April 29, 1957  |
+| Sean Penn        | August 17, 1960 |
+#+BEGIN_EXAMPLE ruby
+  require 'dreader'
+  class Reader < Dreader::Engine
+    options do
+      # we start reading from row 2
+      first_row 2
+    end
+    column :name do
+      doc "column A contains :name, a string; doc is optional"
+      colref 'A'
+    end
+    # column B contains :birthdate, a date. We can use a Hash and omit
+    # colref
+    column({ birthdate: 'B' }) do
+      process do |c|
+        Date.parse(c)
+      end
+    end
+    # add as many example lines as you want to show examples of good
+    # records these example lines are added to the template generated with
+    # generate_template
+    example { name: "John", birthday: "27/03/2020" }
+    # for each line, :age is computed from :birthdate
+    virtual_column :age do
+      process do |row|
+        birthdate = row[:birthdate][:value]
+        birthday = Date.new(Date.today.year, birthdate.month, birthdate.day)
+        today = Date.today
+        [0, today.year - birthdate.year - (birthday < today ? 1 : 0)].max
+      end
+    end
+    # this is how we process each line of the input file
+    mapping do |row|
+      r = Dreader::Util.simplify(row)
+      puts "#{r[:name]} is #{r[:age]} years old (born on #{r[:birthdate]})"
+    end
+  end
+  reader = Reader.new
+  # read the file
+  reader.read filename: "Birthdays.ods"
+  # compute the virtual columns
+  reader.virtual_columns
+  # run the mapping declaration
+  reader.process
+  #
+  # Here we can do further processing on the data
+  #
+  File.open("ages.txt", "w") do |file|
+    reader.table.each do |row|
+      unless row[:row_errors].any?
+        file.puts "#{row[:name][:value]} #{row[:age][:value]}"
+      end
+    end
+  end
+#+END_EXAMPLE
+** Gentler Introduction
+To write an import function with Dreader:
+- Declare which is the input file and where we can find data (Sheet
+  and first row)
+- Declare the content of columns and how to check raw data, parse data,
+  and check parsed data
+- Add virtual columns, that is, columns computed from other values
+  in the row
+- Specify how to process data each line. This is where you do the actual work
+  (for instance, if you process a file line by line) or put together data for
+  processing after the file has been fully read --- see the next step.
+Dreader has now collected and shaped the data according to your instructions
+and collected errors in the process.  We are now ready to do the actual
+processing:
+- Do the processing
+Each step is described in more details in the following sections.
+*** Declare which is the input file and where we can find data
+Require =dreader= and declare a class which inherits from =Dreader::Engine=:
+#+BEGIN_EXAMPLE ruby
+  require 'dreader'
+  class Reader < Dreader::Engine
+  [...]
+  end
+#+END_EXAMPLE
+In the class specify parsing option, using the following syntax:
+#+BEGIN_EXAMPLE ruby
+  options do
+    filename 'example.ods'
+    sheet 'Sheet 1'
+    first_row 1
+    last_row 20
+    # optional (this allows to integrate with other applications already
+    # using a logger)
+    logger Logger.new
+    logger_level Logger::INFO
+  end
+#+END_EXAMPLE
+where:
+- (optional) =filename= is the file to read. If not specified, you will
+  have to supply a filename when loading the file (see =read=, below).
+  The extension determines the file type. *Use =.tsv= for tab-separated
+  files.*
+- (optional) =first_row= is the first line to read (use =2= if your file
+  has a header)
+- (optional) =last_row= is the last line to read. If not specified, we
+  will rely on =roo= to determine the last row.  This is useful for
+  those files in which you only want to process some of the content or
+  contain "garbage" after the records.
+- (optional) =sheet= is the sheet name or number to read from. If not
+  specified, the first (default) sheet is used
+#+BEGIN_NOTES
+You can override some of the defaults by passing a hash as argument to
+the =read= function. For instance:
+#+BEGIN_EXAMPLE ruby
+  i.read filename: another_filepath
+#+END_EXAMPLE
+will read data from =another_filepath=, rather than from the filename
+specified in the options. This might be useful, for instance, if the
+same specification has to be used for different files.
+#+END_NOTES
+*** Declare the content of columns and how to parse them
+Declare the columns you want to read by assigning them a name and a column
+reference.
+There are two notations:
+#+BEGIN_EXAMPLE ruby
+  # First notation, colref is put in the block
+  i.column :name do
+    colref 'A'
+  end
+  # Second notation, a hash is passed in the name
+  i.column({ name: 'A' }) do
+  end
+#+END_EXAMPLE
+The reference to a column can either be a letter or a number. First column
+is ='A'= or =1=.
+The =column= declaration can contain Ruby blocks:
+- one or more =check_raw= block check raw data as read from the input
+  file. They can be used, for instance, to verify presence of a value in the
+  input file.  *Check must return true if there are no errors; any other
+  value (e.g. an array of messages) is considered an error.*
+- =process= can be used to transform data into something closer to the input
+  data required for the importing (e.g., it can be used for downcase or
+  strip a string)
+- one or more =check= block perform a check on the =process=ed data, to check
+  for errors. They can be used, for instance, to check that a model built with
+  =process= is valid.  *Check must return true if there are no errors.*
+#+begin_example
+  i.column({ name: 'A' }) do
+    check_raw do |cell|
+      !cell.nil?
+    end
+  end
+#+end_example
+#+begin_quote
+  *If you declare more than a check block of the same type per column, use a
+  unique symbol to distinguish the blocks or the error messages will be
+  overwritten*.
+#+end_quote
+#+begin_example
+  i.column({ name: 'A' }) do
+    check_raw :must_be_non_nil do |cell|
+      !cell.nil?
+    end
+    check_raw :first_letter_must_be_a do |cell|
+      cell[0] == 'A'
+    end
+  end
+#+end_example
+#+begin_quote
+  =process= is always executed before =check=. If you want to check raw data
+  use the =check_raw= directive.
+#+end_quote
+#+begin_quote
+  There can be only one process block.  *If you define more than one per
+  column, only the last one is executed.*
+#+end_quote
+#+begin_example
+  i.column({ name: 'A' }) do
+    check_raw do |cell|
+      # Here cell is like in the input file
+    end
+    process do |cell|
+      cell.upcase
+    end
+    check do |cell|
+      # Here cell is upcase and
+    end
+  end
+#+end_example
+For instance, given the tabular data:
+| Name             | Date of birth   |
+|------------------+-----------------|
+| Forest Whitaker  | July 15, 1961   |
+| Daniel Day-Lewis | April 29, 1957  |
+| Sean Penn        | August 17, 1960 |
+we could use the following declaration to specify the data to read:
+#+BEGIN_EXAMPLE ruby
+  # we want to access column 1 using :name (1 and A are equivalent)
+  # :name should be non nil and of length greater than 0
+  column :name do
+    colref 1
+    check do |x|
+      x and x.length > 0
+    end
+  end
+  # we want to access column 2 (Date of birth) using :birthdate
+  column :birthdate do
+    colref 2
+    # make sure the column is transformed into a Date
+    process do |x|
+      Date.parse(x)
+    end
+    # check age is a date (check is invoked on the value returned
+    # by process)
+    check do |x|
+      x.class == Date
+    end
+  end
+#+END_EXAMPLE
+#+BEGIN_NOTES
+1. The column name can be anything Ruby can use as a key for a Hash,
+   such as, for instance, symbols, strings, and even object instances.
+2. =colref= can be a string (e.g., ='A'=) or an integer, with
+   1 and "A" being the first column.
+3. *You need to declare only the columns you want to import.* For
+   instance, we could skip the declaration for column 1, if 'Date of
+   Birth' is the only data we want to import
+4. If =process= and =check= are specified, then =check= will receive the
+   result of invoking =process= on the cell value. This makes sense if
+   process is used to make the cell value more accessible to ruby code
+   (e.g., transforming a string into an integer).
+#+END_NOTES
+If there are different columns that have to be read and processed in the same
+way, =columns= (notice the plural form) allows for a more compact
+representation:
+#+BEGIN_EXAMPLE ruby
+  columns { a: 'A', b: 'B' }
+#+END_EXAMPLE
+is equivalent to:
+#+BEGIN_EXAMPLE ruby
+  column :a do
+    colref 'A'
+  end
+  column :b do
+    colref 'B'
+  end
+#+END_EXAMPLE
+=columns= accepts a code block, which can be used to add =process= and =check=
+declarations:
+#+BEGIN_EXAMPLE ruby
+  columns({ a: 'A', b: 'B' }) do
+    process do |cell|
+      ...
+    end
+  end
+#+END_EXAMPLE
+See [[file:examples/wikipedia_us_cities/us_cities_bulk_declare.rb][us_cities_bulk_declare.rb]] for an example of =columns=.
+#+BEGIN_NOTES
+  If you use code blocks, don't forget to put in parentheses the
+  column mapping, or the Ruby parser won't be able to distinguish the
+  hash from the code block.
+#+END_NOTES
+*** Add virtual columns
+Sometimes it is convenient to aggregate or otherwise manipulate the data
+read from each row, before doing the actual processing.
+For instance, we might have a table with dates of birth, while we are
+really interested in the age of people.
+In such cases, we can use virtual column. A *virtual column* allows
+one to add a column to the data read, computed using the values of
+other cells in the same row.
+The following declaration adds an =age= column to each row of the data
+read from the previous example:
+#+BEGIN_EXAMPLE ruby
+  virtual_column :age do
+    process do |row|
+      # the function `compute_birthday` has to be defined
+      compute_birthday(row[:birthdate])
+    end
+  end
+#+END_EXAMPLE
+Virtual columns are, of course, available to the =mapping= directive
+(see below).
+*** Specify how to process each line
+The =mapping= directive specifies what to do with each line read.  The
+=mapping= declaration takes an arbitrary piece of ruby code, which can
+reference the fields using the column names we declared.
+For instance the following code gets the value of column =:name=, the
+value of column =:age= and prints them to standard output
+#+BEGIN_EXAMPLE ruby
+  mapping do |row|
+    puts "#{row[:name][:value]} is #{row[:age][:value]} years old"
+  end
+#+END_EXAMPLE
+The data read from each row of our input data is stored in a hash. The hash
+uses column names as the primary key and stores the values in the =:value=
+key.
+*** Process data
+If =mapping= does not work for your data processing activities (e.g., you need
+to make elaborations on data which span different rows), you can add your own
+code after the =process= directive.
+A typical scenario works as follows:
+1. Instantiate the class: ~i = Reader.new~
+1. Use =i.read= or =i.load= (synonyms), to read all data.
+#+BEGIN_EXAMPLE ruby
+  i.read
+#+END_EXAMPLE
+2. Use =errors= to see whether any of the check functions failed:
+#+BEGIN_EXAMPLE ruby
+  array_of_hashes = i.errors
+  array_of_hashes.each do |error_hash|
+    puts error_hash
+  end
+#+END_EXAMPLE
+3. Use =virtual_columns= to generate the virtual columns:
+#+BEGIN_EXAMPLE ruby
+  i.virtual_columns
+#+END_EXAMPLE
+(Optionally: check again for errors.)
+4. Use the =process= function to execute the =mapping=
+directive on each line read from the file.
+#+BEGIN_EXAMPLE ruby
+  i.process
+#+END_EXAMPLE
+(Optionally: check again for errors.)
+5. Add your own code to process data. Use the =table= function to access data.
+Look in the examples directory for further details and a couple of
+working examples.
+*** Managing Errors
+**** Finding errors in input data
+Dreader collects errors in three specific ways:
+1. In each column specification, using =check_raw= and =check=.  This allows
+   to check each field for errors (e.g., a =nil= value in a cell)
+2. In virtual columns, using =check_raw= and =check=.  This allows to perform
+   more complex checks by putting together all the values read from a row
+   (e.g., =to_date= occurs before =from_date=)
+The following, for instance checks that name or surname have a valid value:
+#+begin_example ruby
+virtual_column :global_check do
+  doc "Name or Surname must exist"
+  check :name_or_surname_must_be_defined do |row|
+    row[:name] || row[:surname]
+  end
+end
+#+end_example
+If you prefer, you can also define a virtual column that contains the value of
+the check:
+#+begin_example ruby
+virtual_column :name_or_surname_exist do
+  doc "Name or Surname must exist"
+  process do |row|
+    row[:name] || row[:surname]
+  end
+end
+#+end_example
+You can then act in the mapping directive according to value returned by the
+virtual column:
+#+begin_example ruby
+mapping do |row|
+  unless row[:global_check][:value] == false
+  [...]
+end
+#+end_example
+**** Managing Errors
+You can check for errors in two different ways:
+The first is in the =mapping= directive, where can check whether some checks for
+the =row= failed, by:
+1. checking from the =:error= boolean key associated to each column, that is:
+   =row[<column_name>][:error]=
+2. looking at the value of the =:row_errors= key, which contains all error messages
+   generated for the row:
+   =row[:row_errors]=
+3. After the processing, by using the method =errors=, which lists all the errors.
+The utility function =Dreader::Util.errors= takes as input the errors generated by
+Dreader and extract those of a specific row and, optionally column:
+#+begin_example ruby
+  # get all the errors at line 2
+  Dreader::Util.errors i.errors, 2
+  # get all the errors at line 2, column 'C'
+  Dreader::Util.errors i.errors, 2, 3
+#+end_example
+* Generating a Template from the specification
+From version 0.6.0 =dreader= allows to generate a template starting from the
+specification.
+The template is generated by the following call:
+#+begin_example ruby
+generate_template template_filename: "template.xlsx"
+#+end_example
+(The =template_filename= directive can also be specified in the =options=
+section).
+The template contains the following rows:
+- The first row contains the names of the columns, as specified in the
+  =columns= declarations and made into a human readable form.
+- The second row contains the doc strings of the columns, if set.
+- The remaining rows contain the example records added with the
+  =example= directive
+The position of the first row is determined by the value of =first_row=, that
+is, if =first_row= is 2 (content starts from the second row), the header row
+is put in row 1.
+Only Excel is supported, at the moment.
+An example of template generation can be found in the Examples.
+** Digging deeper
+If you need to perform elaborations which cannot be performed row by
+row you can access all data, with the =table= method:
+#+BEGIN_EXAMPLE ruby
+  i.read
+  i.table
+#+END_EXAMPLE
+The function =i.table= returns an array of Hashes.  Each element of
+the array is a row of the input file.  Each element/row has the
+following structure:
+#+BEGIN_EXAMPLE ruby
+  {
+    col_name1: { <info about col_name_1 in row_j> },
+    [...]
+    col_nameN: { <info about col_name_N in row_j> },
+    row_errors: [ <errors associated to row> ],
+    row_number: <row number>
+  }
+#+END_EXAMPLE
+where =col_name1=, ..., =col_nameN= are the names you have assigned to
+the columns and the information stored for each cell is the
+following:
+#+BEGIN_EXAMPLE ruby
+  {
+    value: ...,      # the result of calling process on the cell
+    row_number: ..., # the row number
+    col_number: ..., # the column number
+    error: ...       # the result of calling check on the cell processed value
+  }
+#+END_EXAMPLE
+(Note that virtual columns only store =value= and a Boolean =virtual=,
+which is always =true=.)
+Thus, for instance, given the example above returns:
+#+BEGIN_EXAMPLE ruby
+  i.table
+  [
+    {
+      name: { value: "John", row_number: 1, col_number: 1, errors: nil },
+      age:  { value: 30, row_number: 1, col_number: 2, errors: nil }
+    },
+    {
+      name: { value: "Jane", row_number: 2, col_number: 1, errors: nil },
+      age:  { value: 31, row_number: 2, col_number: 2, errors: nil }
+    }
+  ]
+#+END_EXAMPLE
+* Simplifying the hash with the data read
+The =Dreader::Util= class provides some functions to simplify the
+hashes built by =dreader=.  This is useful to simplify the code you
+write and to genereate hashes you can pass, for instance, to
+ActiveRecord creators.
+** Simplify removes everything but the values
+=Dreader::Util.simplify hash= removes all information but the value
+and making the value accessible directly from the name of the column.
+#+BEGIN_EXAMPLE ruby
+  i.table[0]
+  { name: { value: "John", row_number: 1, col_number: 1, errors: nil },
+    age:  { value: 30, row_number: 1, col_number: 2, errors: nil } }
+  Dreader::Util.simplify i.table[0]
+  { name: "John", age: 30 }
+#+END_EXAMPLE
+*As an additional bonus, it removes the keys =row_number= and =row_errors=,
+which are not part of the data read, in the first place.*
+** Slice and Clean select columns
+=Dreader::Util.slice hash, keys= and =Dreader::Util.clean hash, keys=,
+where =keys= is an arrays of keys, are respectively used to select or
+remove some keys from the hash returned by Dreader.  (Notice that the
+Ruby Hash class already provides similar methods.)
+#+BEGIN_EXAMPLE ruby
+  i.table[0]
+  { name: { value: "John", row_number: 1, col_number: 1, errors: nil },
+    age:  { value: 30, row_number: 1, col_number: 2, errors: nil }}
+  Dreader::Util.slice i.table[0], :name
+  { name: { value: "John", row_number: 1, col_number: 1, errors: nil}
+  Dreader::Util.clean i.table[0], :name
+  { age:  { value: 30, row_number: 1, col_number: 2, errors: nil }
+#+END_EXAMPLE
+The methods =slice= and =clean= are more useful when used in
+conjuction with =simplify=:
+#+BEGIN_EXAMPLE ruby
+  hash = Dreader::Util.simplify i.table[0]
+  { name: "John", age: 30 }
+  Dreader::Util.slice hash, [:age]
+  { age: 30 }
+  Dreader::Util.clean hash, [:age]
+  { name: "John" }
+#+END_EXAMPLE
+The output produced by =slice= and =simplify= is a hash which can be used to
+create an =ActiveRecord= object.
+** Better Integration with ActiveRecord
+Finally, the =Dreader::Util.restructure= method helps building hashes
+to create [[http://api.rubyonrails.org/classes/ActiveModel/Model.html][ActiveModel]] objects with nested attributes:
+#+BEGIN_EXAMPLE ruby
+  hash = {name: "John", surname: "Doe", address: "Unknown", city: "NY" }
+  Dreader::Util.restructure hash, [:name, :surname], :address_attributes, [:address, :city]
+  {name: "John", surname: "Doe", address_attributes: {address: "Unknonw", city: "NY"}}
+#+END_EXAMPLE
+* Debugging your specification
+The =debug= function prints the current configuration, reads some
+records from the input file(s), and shows the records read:
+#+BEGIN_EXAMPLE ruby
+  i.debug
+  i.debug n: 40 # read 40 lines (from first_row)
+  i.debug n: 40, filename: filepath # like above, but read from filepath
+#+END_EXAMPLE
+By default =debug= invokes the =check_raw=, =process=, and =check=
+directives. Pass the following options, if you want to disable this behavior;
+this might be useful, for instance, if you intend to check only what data is
+read:
+#+BEGIN_EXAMPLE ruby
+  i.debug process: false, check: false
+#+END_EXAMPLE
+Notice that =check= implies =process=, since =check= is invoked on the
+output of the =process= directive.`
+If you prefer, in alternative to =debug= you can also use configuration
+variables (but then you need to change the configuration according to the
+environment):
+#+begin_example ruby
+  i.options do
+    debug true
+  end
+#+end_example
+* Changelog
+See [[file:CHANGELOG.ORG][CHANGELOG]].
+* Known Limitations
+At the moment:
+- it is not possible to specify column references using header names
+  (like Roo does).
+- it is not possible to pass options to the file readers. As a
+  consequence tab-separated files must have the =.tsv= extension or
+  they will not be parsed correctly
+- some more testing wouldn't hurt.
+* Known Bugs
+Some known bugs and an unknown number of unknown bugs.
+(See the open issues for the known bugs.)
+* Development
+After checking out the repo, run =bin/setup= to install dependencies.
+You can also run =bin/console= for an interactive prompt that will
+allow you to experiment.
+To install this gem onto your local machine, run =bundle exec rake
+install=. To release a new version, update the version number in
+=version.rb=, and then run =bundle exec rake release=, which will
+create a git tag for the version, push git commits and tags, and push
+the =.gem= file to [[https://rubygems.org][rubygems.org]].
+* Contributing
+Bug reports and pull requests are welcome.
+You need to get in touch with me by email, till I figure how to enable
+it in Gitea.
+* License
+[[https://opensource.org/licenses/MIT][MIT License]].