RubyGems - dreader - Versions diffs - 0.5.0 → 1.0.0 - Mend

dreader 0.5.0 → 1.0.0

Files changed (21) hide show

checksums.yaml +4 -4
data/CHANGELOG.ORG +45 -0
data/Gemfile.lock +20 -7
data/README.org +794 -0
data/dreader.gemspec +6 -4
data/examples/age/age.rb +22 -6
data/examples/age_with_multiple_checks/Birthdays.ods +0 -0
data/examples/age_with_multiple_checks/age_with_multiple_checks.rb +62 -0
data/examples/template/template_generation.rb +37 -0
data/examples/wikipedia_big_us_cities/big_us_cities.rb +20 -18
data/examples/wikipedia_us_cities/us_cities.rb +28 -27
data/examples/wikipedia_us_cities/us_cities_bulk_declare.rb +22 -22
data/lib/dreader/column.rb +39 -0
data/lib/dreader/engine.rb +473 -0
data/lib/dreader/options.rb +16 -0
data/lib/dreader/util.rb +71 -0
data/lib/dreader/version.rb +1 -1
data/lib/dreader.rb +5 -411
metadata +59 -24
data/Changelog.org +0 -20
data/README.md +0 -469

data/README.org ADDED Viewed

@@ -0,0 +1,794 @@
+#+TITLE: Dreader
+#+AUTHOR: Adolfo Villafiorita
+#+STARTUP: showall
+Dreader is a simple DSL built on top of [[https://github.com/roo-rb/roo][Roo]] to read and process
+tabular data (CSV, LibreOffice, Excel) in a simple and structured way.
+Main advantages:
+1. All code to parse input data has the same structure, simplifying
+   code management and understanding (convention over configuration).
+2. It favors a declarative approach, clearly identifying from which
+   data has to be read and in which way.
+3. Has facilities to run simulations, to debug and check code and
+   data.
+We use Dreader for importing fairly big files (in the order of
+10K-100K records) in MIP, an ERP to manage distribution of bins to the
+population.  The main issues we had before using Dreader were errors
+and exceptional cases in the input data.  We also had to manage
+several small variations in the input files (coming from different
+ERPs) and Dreader helped us standardizing the input code.
+The gem depends on =roo=, from which it leverages all data
+reading/parsing facilities keeping its size in about 250 lines of
+code.
+It should be relatively easy to use; /dreader/ stands for /d/ata /r/eader.
+* Installation
+Add this line to your application's Gemfile:
+#+BEGIN_EXAMPLE ruby
+  gem 'dreader'
+#+END_EXAMPLE
+And then execute:
+#+BEGIN_EXAMPLE
+  $ bundle
+#+END_EXAMPLE
+Or install it yourself as:
+#+BEGIN_EXAMPLE
+  $ gem install dreader
+#+END_EXAMPLE
+* Usage
+** Quick start
+Print name and age of people from the following data:
+| Name             | Date of birth   |
+|------------------+-----------------|
+| Forest Whitaker  | July 15, 1961   |
+| Daniel Day-Lewis | April 29, 1957  |
+| Sean Penn        | August 17, 1960 |
+#+BEGIN_EXAMPLE ruby
+  require 'dreader'
+  class Reader < Dreader::Engine
+    options do
+      # we start reading from row 2
+      first_row 2
+    end
+    column :name do
+      doc "column A contains :name, a string; doc is optional"
+      colref 'A'
+    end
+    # column B contains :birthdate, a date. We can use a Hash and omit
+    # colref
+    column({ birthdate: 'B' }) do
+      process do |c|
+        Date.parse(c)
+      end
+    end
+    # add as many example lines as you want to show examples of good
+    # records these example lines are added to the template generated with
+    # generate_template
+    example { name: "John", birthday: "27/03/2020" }
+    # for each line, :age is computed from :birthdate
+    virtual_column :age do
+      process do |row|
+        birthdate = row[:birthdate][:value]
+        birthday = Date.new(Date.today.year, birthdate.month, birthdate.day)
+        today = Date.today
+        [0, today.year - birthdate.year - (birthday < today ? 1 : 0)].max
+      end
+    end
+    # this is how we process each line of the input file
+    mapping do |row|
+      r = Dreader::Util.simplify(row)
+      puts "#{r[:name]} is #{r[:age]} years old (born on #{r[:birthdate]})"
+    end
+  end
+  reader = Reader.new
+  # read the file
+  reader.read filename: "Birthdays.ods"
+  # compute the virtual columns
+  reader.virtual_columns
+  # run the mapping declaration
+  reader.process
+  #
+  # Here we can do further processing on the data
+  #
+  File.open("ages.txt", "w") do |file|
+    reader.table.each do |row|
+      unless row[:row_errors].any?
+        file.puts "#{row[:name][:value]} #{row[:age][:value]}"
+      end
+    end
+  end
+#+END_EXAMPLE
+** Gentler Introduction
+To write an import function with Dreader:
+- Declare which is the input file and where we can find data (Sheet
+  and first row)
+- Declare the content of columns and how to check raw data, parse data,
+  and check parsed data
+- Add virtual columns, that is, columns computed from other values
+  in the row
+- Specify how to process data each line. This is where you do the actual work
+  (for instance, if you process a file line by line) or put together data for
+  processing after the file has been fully read --- see the next step.
+Dreader has now collected and shaped the data according to your instructions
+and collected errors in the process.  We are now ready to do the actual
+processing:
+- Do the processing
+Each step is described in more details in the following sections.
+*** Declare which is the input file and where we can find data
+Require =dreader= and declare a class which inherits from =Dreader::Engine=:
+#+BEGIN_EXAMPLE ruby
+  require 'dreader'
+  class Reader < Dreader::Engine
+  [...]
+  end
+#+END_EXAMPLE
+In the class specify parsing option, using the following syntax:
+#+BEGIN_EXAMPLE ruby
+  options do
+    filename 'example.ods'
+    sheet 'Sheet 1'
+    first_row 1
+    last_row 20
+    # optional (this allows to integrate with other applications already
+    # using a logger)
+    logger Logger.new
+    logger_level Logger::INFO
+  end
+#+END_EXAMPLE
+where:
+- (optional) =filename= is the file to read. If not specified, you will
+  have to supply a filename when loading the file (see =read=, below).
+  The extension determines the file type. *Use =.tsv= for tab-separated
+  files.*
+- (optional) =first_row= is the first line to read (use =2= if your file
+  has a header)
+- (optional) =last_row= is the last line to read. If not specified, we
+  will rely on =roo= to determine the last row.  This is useful for
+  those files in which you only want to process some of the content or
+  contain "garbage" after the records.
+- (optional) =sheet= is the sheet name or number to read from. If not
+  specified, the first (default) sheet is used
+#+BEGIN_NOTES
+You can override some of the defaults by passing a hash as argument to
+the =read= function. For instance:
+#+BEGIN_EXAMPLE ruby
+  i.read filename: another_filepath
+#+END_EXAMPLE
+will read data from =another_filepath=, rather than from the filename
+specified in the options. This might be useful, for instance, if the
+same specification has to be used for different files.
+#+END_NOTES
+*** Declare the content of columns and how to parse them
+Declare the columns you want to read by assigning them a name and a column
+reference.
+There are two notations:
+#+BEGIN_EXAMPLE ruby
+  # First notation, colref is put in the block
+  i.column :name do
+    colref 'A'
+  end
+  # Second notation, a hash is passed in the name
+  i.column({ name: 'A' }) do
+  end
+#+END_EXAMPLE
+The reference to a column can either be a letter or a number. First column
+is ='A'= or =1=.
+The =column= declaration can contain Ruby blocks:
+- one or more =check_raw= block check raw data as read from the input
+  file. They can be used, for instance, to verify presence of a value in the
+  input file.  *Check must return true if there are no errors; any other
+  value (e.g. an array of messages) is considered an error.*
+- =process= can be used to transform data into something closer to the input
+  data required for the importing (e.g., it can be used for downcase or
+  strip a string)
+- one or more =check= block perform a check on the =process=ed data, to check
+  for errors. They can be used, for instance, to check that a model built with
+  =process= is valid.  *Check must return true if there are no errors.*
+#+begin_example
+  i.column({ name: 'A' }) do
+    check_raw do |cell|
+      !cell.nil?
+    end
+  end
+#+end_example
+#+begin_quote
+  *If you declare more than a check block of the same type per column, use a
+  unique symbol to distinguish the blocks or the error messages will be
+  overwritten*.
+#+end_quote
+#+begin_example
+  i.column({ name: 'A' }) do
+    check_raw :must_be_non_nil do |cell|
+      !cell.nil?
+    end
+    check_raw :first_letter_must_be_a do |cell|
+      cell[0] == 'A'
+    end
+  end
+#+end_example
+#+begin_quote
+  =process= is always executed before =check=. If you want to check raw data
+  use the =check_raw= directive.
+#+end_quote
+#+begin_quote
+  There can be only one process block.  *If you define more than one per
+  column, only the last one is executed.*
+#+end_quote
+#+begin_example
+  i.column({ name: 'A' }) do
+    check_raw do |cell|
+      # Here cell is like in the input file
+    end
+    process do |cell|
+      cell.upcase
+    end
+    check do |cell|
+      # Here cell is upcase and
+    end
+  end
+#+end_example
+For instance, given the tabular data:
+| Name             | Date of birth   |
+|------------------+-----------------|
+| Forest Whitaker  | July 15, 1961   |
+| Daniel Day-Lewis | April 29, 1957  |
+| Sean Penn        | August 17, 1960 |
+we could use the following declaration to specify the data to read:
+#+BEGIN_EXAMPLE ruby
+  # we want to access column 1 using :name (1 and A are equivalent)
+  # :name should be non nil and of length greater than 0
+  column :name do
+    colref 1
+    check do |x|
+      x and x.length > 0
+    end
+  end
+  # we want to access column 2 (Date of birth) using :birthdate
+  column :birthdate do
+    colref 2
+    # make sure the column is transformed into a Date
+    process do |x|
+      Date.parse(x)
+    end
+    # check age is a date (check is invoked on the value returned
+    # by process)
+    check do |x|
+      x.class == Date
+    end
+  end
+#+END_EXAMPLE
+#+BEGIN_NOTES
+1. The column name can be anything Ruby can use as a key for a Hash,
+   such as, for instance, symbols, strings, and even object instances.
+2. =colref= can be a string (e.g., ='A'=) or an integer, with
+   1 and "A" being the first column.
+3. *You need to declare only the columns you want to import.* For
+   instance, we could skip the declaration for column 1, if 'Date of
+   Birth' is the only data we want to import
+4. If =process= and =check= are specified, then =check= will receive the
+   result of invoking =process= on the cell value. This makes sense if
+   process is used to make the cell value more accessible to ruby code
+   (e.g., transforming a string into an integer).
+#+END_NOTES
+If there are different columns that have to be read and processed in the same
+way, =columns= (notice the plural form) allows for a more compact
+representation:
+#+BEGIN_EXAMPLE ruby
+  columns { a: 'A', b: 'B' }
+#+END_EXAMPLE
+is equivalent to:
+#+BEGIN_EXAMPLE ruby
+  column :a do
+    colref 'A'
+  end
+  column :b do
+    colref 'B'
+  end
+#+END_EXAMPLE
+=columns= accepts a code block, which can be used to add =process= and =check=
+declarations:
+#+BEGIN_EXAMPLE ruby
+  columns({ a: 'A', b: 'B' }) do
+    process do |cell|
+      ...
+    end
+  end
+#+END_EXAMPLE
+See [[file:examples/wikipedia_us_cities/us_cities_bulk_declare.rb][us_cities_bulk_declare.rb]] for an example of =columns=.
+#+BEGIN_NOTES
+  If you use code blocks, don't forget to put in parentheses the
+  column mapping, or the Ruby parser won't be able to distinguish the
+  hash from the code block.
+#+END_NOTES
+*** Add virtual columns
+Sometimes it is convenient to aggregate or otherwise manipulate the data
+read from each row, before doing the actual processing.
+For instance, we might have a table with dates of birth, while we are
+really interested in the age of people.
+In such cases, we can use virtual column. A *virtual column* allows
+one to add a column to the data read, computed using the values of
+other cells in the same row.
+The following declaration adds an =age= column to each row of the data
+read from the previous example:
+#+BEGIN_EXAMPLE ruby
+  virtual_column :age do
+    process do |row|
+      # the function `compute_birthday` has to be defined
+      compute_birthday(row[:birthdate])
+    end
+  end
+#+END_EXAMPLE
+Virtual columns are, of course, available to the =mapping= directive
+(see below).
+*** Specify how to process each line
+The =mapping= directive specifies what to do with each line read.  The
+=mapping= declaration takes an arbitrary piece of ruby code, which can
+reference the fields using the column names we declared.
+For instance the following code gets the value of column =:name=, the
+value of column =:age= and prints them to standard output
+#+BEGIN_EXAMPLE ruby
+  mapping do |row|
+    puts "#{row[:name][:value]} is #{row[:age][:value]} years old"
+  end
+#+END_EXAMPLE
+The data read from each row of our input data is stored in a hash. The hash
+uses column names as the primary key and stores the values in the =:value=
+key.
+*** Process data
+If =mapping= does not work for your data processing activities (e.g., you need
+to make elaborations on data which span different rows), you can add your own
+code after the =process= directive.
+A typical scenario works as follows:
+1. Instantiate the class: ~i = Reader.new~
+1. Use =i.read= or =i.load= (synonyms), to read all data.
+#+BEGIN_EXAMPLE ruby
+  i.read
+#+END_EXAMPLE
+2. Use =errors= to see whether any of the check functions failed:
+#+BEGIN_EXAMPLE ruby
+  array_of_hashes = i.errors
+  array_of_hashes.each do |error_hash|
+    puts error_hash
+  end
+#+END_EXAMPLE
+3. Use =virtual_columns= to generate the virtual columns:
+#+BEGIN_EXAMPLE ruby
+  i.virtual_columns
+#+END_EXAMPLE
+(Optionally: check again for errors.)
+4. Use the =process= function to execute the =mapping=
+directive on each line read from the file.
+#+BEGIN_EXAMPLE ruby
+  i.process
+#+END_EXAMPLE
+(Optionally: check again for errors.)
+5. Add your own code to process data. Use the =table= function to access data.
+Look in the examples directory for further details and a couple of
+working examples.
+*** Managing Errors
+**** Finding errors in input data
+Dreader collects errors in three specific ways:
+1. In each column specification, using =check_raw= and =check=.  This allows
+   to check each field for errors (e.g., a =nil= value in a cell)
+2. In virtual columns, using =check_raw= and =check=.  This allows to perform
+   more complex checks by putting together all the values read from a row
+   (e.g., =to_date= occurs before =from_date=)
+The following, for instance checks that name or surname have a valid value:
+#+begin_example ruby
+virtual_column :global_check do
+  doc "Name or Surname must exist"
+  check :name_or_surname_must_be_defined do |row|
+    row[:name] || row[:surname]
+  end
+end
+#+end_example
+If you prefer, you can also define a virtual column that contains the value of
+the check:
+#+begin_example ruby
+virtual_column :name_or_surname_exist do
+  doc "Name or Surname must exist"
+  process do |row|
+    row[:name] || row[:surname]
+  end
+end
+#+end_example
+You can then act in the mapping directive according to value returned by the
+virtual column:
+#+begin_example ruby
+mapping do |row|
+  unless row[:global_check][:value] == false
+  [...]
+end
+#+end_example
+**** Managing Errors
+You can check for errors in two different ways:
+The first is in the =mapping= directive, where can check whether some checks for
+the =row= failed, by:
+1. checking from the =:error= boolean key associated to each column, that is:
+   =row[<column_name>][:error]=
+2. looking at the value of the =:row_errors= key, which contains all error messages
+   generated for the row:
+   =row[:row_errors]=
+3. After the processing, by using the method =errors=, which lists all the errors.
+The utility function =Dreader::Util.errors= takes as input the errors generated by
+Dreader and extract those of a specific row and, optionally column:
+#+begin_example ruby
+  # get all the errors at line 2
+  Dreader::Util.errors i.errors, 2
+  # get all the errors at line 2, column 'C'
+  Dreader::Util.errors i.errors, 2, 3
+#+end_example
+* Generating a Template from the specification
+From version 0.6.0 =dreader= allows to generate a template starting from the
+specification.
+The template is generated by the following call:
+#+begin_example ruby
+generate_template template_filename: "template.xlsx"
+#+end_example
+(The =template_filename= directive can also be specified in the =options=
+section).
+The template contains the following rows:
+- The first row contains the names of the columns, as specified in the
+  =columns= declarations and made into a human readable form.
+- The second row contains the doc strings of the columns, if set.
+- The remaining rows contain the example records added with the
+  =example= directive
+The position of the first row is determined by the value of =first_row=, that
+is, if =first_row= is 2 (content starts from the second row), the header row
+is put in row 1.
+Only Excel is supported, at the moment.
+An example of template generation can be found in the Examples.
+** Digging deeper
+If you need to perform elaborations which cannot be performed row by
+row you can access all data, with the =table= method:
+#+BEGIN_EXAMPLE ruby
+  i.read
+  i.table
+#+END_EXAMPLE
+The function =i.table= returns an array of Hashes.  Each element of
+the array is a row of the input file.  Each element/row has the
+following structure:
+#+BEGIN_EXAMPLE ruby
+  {
+    col_name1: { <info about col_name_1 in row_j> },
+    [...]
+    col_nameN: { <info about col_name_N in row_j> },
+    row_errors: [ <errors associated to row> ],
+    row_number: <row number>
+  }
+#+END_EXAMPLE
+where =col_name1=, ..., =col_nameN= are the names you have assigned to
+the columns and the information stored for each cell is the
+following:
+#+BEGIN_EXAMPLE ruby
+  {
+    value: ...,      # the result of calling process on the cell
+    row_number: ..., # the row number
+    col_number: ..., # the column number
+    error: ...       # the result of calling check on the cell processed value
+  }
+#+END_EXAMPLE
+(Note that virtual columns only store =value= and a Boolean =virtual=,
+which is always =true=.)
+Thus, for instance, given the example above returns:
+#+BEGIN_EXAMPLE ruby
+  i.table
+  [
+    {
+      name: { value: "John", row_number: 1, col_number: 1, errors: nil },
+      age:  { value: 30, row_number: 1, col_number: 2, errors: nil }
+    },
+    {
+      name: { value: "Jane", row_number: 2, col_number: 1, errors: nil },
+      age:  { value: 31, row_number: 2, col_number: 2, errors: nil }
+    }
+  ]
+#+END_EXAMPLE
+* Simplifying the hash with the data read
+The =Dreader::Util= class provides some functions to simplify the
+hashes built by =dreader=.  This is useful to simplify the code you
+write and to genereate hashes you can pass, for instance, to
+ActiveRecord creators.
+** Simplify removes everything but the values
+=Dreader::Util.simplify hash= removes all information but the value
+and making the value accessible directly from the name of the column.
+#+BEGIN_EXAMPLE ruby
+  i.table[0]
+  { name: { value: "John", row_number: 1, col_number: 1, errors: nil },
+    age:  { value: 30, row_number: 1, col_number: 2, errors: nil } }
+  Dreader::Util.simplify i.table[0]
+  { name: "John", age: 30 }
+#+END_EXAMPLE
+*As an additional bonus, it removes the keys =row_number= and =row_errors=,
+which are not part of the data read, in the first place.*
+** Slice and Clean select columns
+=Dreader::Util.slice hash, keys= and =Dreader::Util.clean hash, keys=,
+where =keys= is an arrays of keys, are respectively used to select or
+remove some keys from the hash returned by Dreader.  (Notice that the
+Ruby Hash class already provides similar methods.)
+#+BEGIN_EXAMPLE ruby
+  i.table[0]
+  { name: { value: "John", row_number: 1, col_number: 1, errors: nil },
+    age:  { value: 30, row_number: 1, col_number: 2, errors: nil }}
+  Dreader::Util.slice i.table[0], :name
+  { name: { value: "John", row_number: 1, col_number: 1, errors: nil}
+  Dreader::Util.clean i.table[0], :name
+  { age:  { value: 30, row_number: 1, col_number: 2, errors: nil }
+#+END_EXAMPLE
+The methods =slice= and =clean= are more useful when used in
+conjuction with =simplify=:
+#+BEGIN_EXAMPLE ruby
+  hash = Dreader::Util.simplify i.table[0]
+  { name: "John", age: 30 }
+  Dreader::Util.slice hash, [:age]
+  { age: 30 }
+  Dreader::Util.clean hash, [:age]
+  { name: "John" }
+#+END_EXAMPLE
+The output produced by =slice= and =simplify= is a hash which can be used to
+create an =ActiveRecord= object.
+** Better Integration with ActiveRecord
+Finally, the =Dreader::Util.restructure= method helps building hashes
+to create [[http://api.rubyonrails.org/classes/ActiveModel/Model.html][ActiveModel]] objects with nested attributes:
+#+BEGIN_EXAMPLE ruby
+  hash = {name: "John", surname: "Doe", address: "Unknown", city: "NY" }
+  Dreader::Util.restructure hash, [:name, :surname], :address_attributes, [:address, :city]
+  {name: "John", surname: "Doe", address_attributes: {address: "Unknonw", city: "NY"}}
+#+END_EXAMPLE
+* Debugging your specification
+The =debug= function prints the current configuration, reads some
+records from the input file(s), and shows the records read:
+#+BEGIN_EXAMPLE ruby
+  i.debug
+  i.debug n: 40 # read 40 lines (from first_row)
+  i.debug n: 40, filename: filepath # like above, but read from filepath
+#+END_EXAMPLE
+By default =debug= invokes the =check_raw=, =process=, and =check=
+directives. Pass the following options, if you want to disable this behavior;
+this might be useful, for instance, if you intend to check only what data is
+read:
+#+BEGIN_EXAMPLE ruby
+  i.debug process: false, check: false
+#+END_EXAMPLE
+Notice that =check= implies =process=, since =check= is invoked on the
+output of the =process= directive.`
+If you prefer, in alternative to =debug= you can also use configuration
+variables (but then you need to change the configuration according to the
+environment):
+#+begin_example ruby
+  i.options do
+    debug true
+  end
+#+end_example
+* Changelog
+See [[file:CHANGELOG.ORG][CHANGELOG]].
+* Known Limitations
+At the moment:
+- it is not possible to specify column references using header names
+  (like Roo does).
+- it is not possible to pass options to the file readers. As a
+  consequence tab-separated files must have the =.tsv= extension or
+  they will not be parsed correctly
+- some more testing wouldn't hurt.
+* Known Bugs
+Some known bugs and an unknown number of unknown bugs.
+(See the open issues for the known bugs.)
+* Development
+After checking out the repo, run =bin/setup= to install dependencies.
+You can also run =bin/console= for an interactive prompt that will
+allow you to experiment.
+To install this gem onto your local machine, run =bundle exec rake
+install=. To release a new version, update the version number in
+=version.rb=, and then run =bundle exec rake release=, which will
+create a git tag for the version, push git commits and tags, and push
+the =.gem= file to [[https://rubygems.org][rubygems.org]].
+* Contributing
+Bug reports and pull requests are welcome.
+You need to get in touch with me by email, till I figure how to enable
+it in Gitea.
+* License
+[[https://opensource.org/licenses/MIT][MIT License]].