RubyGems - smarter_csv - Versions diffs - 1.9.3 → 1.10.0 - Mend

smarter_csv 1.9.3 → 1.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +18 -0
data/README.md +28 -7
data/lib/smarter_csv/hash_transformations.rb +91 -0
data/lib/smarter_csv/header_transformations.rb +63 -0
data/lib/smarter_csv/header_validations.rb +34 -0
data/lib/smarter_csv/headers.rb +6 -98
data/lib/smarter_csv/options_processing.rb +10 -1
data/lib/smarter_csv/smarter_csv.rb +68 -92
data/lib/smarter_csv/variables.rb +5 -1
data/lib/smarter_csv/version.rb +1 -1
data/lib/smarter_csv.rb +8 -0
metadata +6 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 5f35e10ff8bc0e79ff1ed9bea8e413f746f51128a6f6a9622d246873fd588366
-  data.tar.gz: 5cc30cf6f4422dd16f3019915bc5305a92aaaa4b99665e4c4c525d3bbf489cfd
+  metadata.gz: f1d0b58acf0135b621e3182470674230ef73b48c829810e74fffa975fc318cf5
+  data.tar.gz: ee404c5c485748d35cda36b8d249cb6813a3f80005182fe8c05feac1694aba57
 SHA512:
-  metadata.gz: 057472a73ae0be95318b16428b276ecffba384a68479af715c5ec3ca7601405ca73928b0fbf245c9b3f46fd33b82a8c6d9c9e6330ddb0305b83ae23f58173df0
-  data.tar.gz: 319b12a53875c1963eed6d27aa67850135d33a5b3a9f70607e6d812906733b711ade6c3ee6e789d78c2e159004a879e59e700145224134745b16d279039ac38a
+  metadata.gz: 4fee097fe2237f863510100155062da6815237260da5b15189f104f54596f7d5ff0479deb80596544e0bb1b9ba7b78126d2251798721e8d2f91e06b430950cd6
+  data.tar.gz: c30562965452ef296b5e5aaf2a9a12887aa42d8e8396780b73b34f99a2386d232bf020578618fcbd65186fc864518c81a3e7555cae9b00a005322f3599e18c5a

data/CHANGELOG.md CHANGED Viewed

@@ -1,6 +1,24 @@
 # SmarterCSV 1.x Change Log
+## 1.10.0 (2023-12-31) ⚡ BREAKING ⚡
+  * BREAKING CHANGES:
+    Changed behavior:
+     + when `user_provided_headers` are provided:
+       * if they are not unique, an exception will now be raised
+       * they are taken "as is", no header transformations can be applied
+       * when they are given as strings or as symbols, it is assumed that this is the desired format
+       * the value of the `strings_as_keys` options will be ignored
+     + option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
+       * this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
+       * explicitly set this option to `nil` to get the behavior from previous versions.
+  * performance and memory improvements
+  * code refactor
 ## 1.9.3 (2023-12-16)
   * raise SmarterCSV::IncorrectOption when `user_provided_headers` are empty
   * code refactor / no functional changes

data/README.md CHANGED Viewed

@@ -2,15 +2,33 @@
 # SmarterCSV
  [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
+#### LATEST CHANGES
+* Version 1.10.0 has BREAKING CHANGES:
+    Changed behavior:
+     + when `user_provided_headers` are provided:
+       * if they are not unique, an exception will now be raised
+       * they are taken "as is", no header transformations can be applied
+       * when they are given as strings or as symbols, it is assumed that this is the desired format
+       * the value of the `strings_as_keys` options will be ignored
+     + option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
+       * this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
+       * explicitly set this option to `nil` to get the behavior from previous versions.
 #### Development Branches
 * default branch is `main` for 1.x development
-* 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
+* 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
+  - This is an EXPERIMENTAL branch - DO NOT USE in production
-#### Work towards Future Version 2.0
+#### Work towards Future Version 2.x
-* Work towards SmarterCSV 2.0 is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
+* Work towards SmarterCSV 2.x is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
   Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.
 ---------------
@@ -84,6 +102,10 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
 00000040  73 2c 35 36 37 38 0d 0a                           |s,5678..|
 ```
+### Articles
+* [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
+* [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
 ### Examples
 Here are some examples to demonstrate the versatility of SmarterCSV.
@@ -243,8 +265,6 @@ NOTE: If you use `key_mappings` and `value_converters`, make sure that the value
     data[0][:price].class
       => Float
 ```
-## Parallel Processing
-[Jack](https://github.com/xjlin0) wrote an interesting article about [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
 ## Documentation
@@ -280,7 +300,8 @@ The options and the block are optional.
      | :headers_in_file            |   true   | Whether or not the file contains headers as the first line.                          |
      |                             |          | Important if the file does not contain headers,                                      |
      |                             |          | otherwise you would lose the first line of data.                                     |
-     | :duplicate_header_suffix    |   nil    | If set, adds numbers to duplicated headers and separates them by the given suffix    |
+     | :duplicate_header_suffix    |   ''     | Adds numbers to duplicated headers and separates them by the given suffix.           |
+     |                             |          | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior)        |
      | :user_provided_headers      |   nil    | *careful with that axe!*                                                             |
      |                             |          | user provided Array of header strings or symbols, to define                          |
      |                             |          | what headers should be used, overriding any in-file headers.                         |

data/lib/smarter_csv/hash_transformations.rb ADDED Viewed

@@ -0,0 +1,91 @@
+# frozen_string_literal: true
+module SmarterCSV
+  class << self
+    def hash_transformations(hash, options)
+      # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
+      # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
+      remove_empty_values = options[:remove_empty_values] == true
+      remove_zero_values = options[:remove_zero_values]
+      remove_values_matching = options[:remove_values_matching]
+      convert_to_numeric = options[:convert_values_to_numeric]
+      value_converters = options[:value_converters]
+      hash.each_with_object({}) do |(k, v), new_hash|
+        next if k.nil? || k == '' || k == :""
+        next if remove_empty_values && (has_rails ? v.blank? : blank?(v))
+        next if remove_zero_values && v.is_a?(String) && v =~ /^(0+|0+\.0+)$/ # values are Strings
+        next if remove_values_matching && v =~ remove_values_matching
+        # deal with the :only / :except options to :convert_values_to_numeric
+        if convert_to_numeric && !limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
+          if v =~ /^[+-]?\d+\.\d+$/
+            v = v.to_f
+          elsif v =~ /^[+-]?\d+$/
+            v = v.to_i
+          end
+        end
+        converter = value_converters[k] if value_converters
+        v = converter.convert(v) if converter
+        new_hash[k] = v
+      end
+    end
+    # def hash_transformations(hash, options)
+    #   # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
+    #   # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
+    #   hash.delete(nil)
+    #   hash.delete('')
+    #   hash.delete(:"")
+    #   if options[:remove_empty_values] == true
+    #     hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
+    #   end
+    #   hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
+    #   hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
+    #   if options[:convert_values_to_numeric]
+    #     hash.each do |k, v|
+    #       # deal with the :only / :except options to :convert_values_to_numeric
+    #       next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
+    #       # convert if it's a numeric value:
+    #       case v
+    #       when /^[+-]?\d+\.\d+$/
+    #         hash[k] = v.to_f
+    #       when /^[+-]?\d+$/
+    #         hash[k] = v.to_i
+    #       end
+    #     end
+    #   end
+    #   if options[:value_converters]
+    #     hash.each do |k, v|
+    #       converter = options[:value_converters][k]
+    #       next unless converter
+    #       hash[k] = converter.convert(v)
+    #     end
+    #   end
+    #   hash
+    # end
+    protected
+    # acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
+    def limit_execution_for_only_or_except(options, option_name, key)
+      if options[option_name].is_a?(Hash)
+        if options[option_name].has_key?(:except)
+          return true if Array(options[option_name][:except]).include?(key)
+        elsif options[option_name].has_key?(:only)
+          return true unless Array(options[option_name][:only]).include?(key)
+        end
+      end
+      false
+    end
+  end
+end

data/lib/smarter_csv/header_transformations.rb ADDED Viewed

@@ -0,0 +1,63 @@
+# frozen_string_literal: true
+module SmarterCSV
+  class << self
+    # transform the headers that were in the file:
+    def header_transformations(header_array, options)
+      header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
+      header_array.map!{|x| x.strip} if options[:strip_whitespace]
+      unless options[:keep_original_headers]
+        header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
+        header_array.map!{|x| x.downcase} if options[:downcase_header]
+      end
+      # detect duplicate headers and disambiguate
+      header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
+      # symbolize headers
+      header_array = header_array.map{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
+      # doesn't make sense to re-map when we have user_provided_headers
+      header_array = remap_headers(header_array, options) if options[:key_mapping]
+      header_array
+    end
+    def disambiguate_headers(headers, options)
+      counts = Hash.new(0)
+      headers.map do |header|
+        counts[header] += 1
+        counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
+      end
+    end
+    # do some key mapping on the keys in the file header
+    # if you want to completely delete a key, then map it to nil or to ''
+    def remap_headers(headers, options)
+      key_mapping = options[:key_mapping]
+      if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
+        raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
+      end
+      key_mapping = options[:key_mapping]
+      # if silence_missing_keys are not set, raise error if missing header
+      missing_keys = key_mapping.keys - headers
+      # if the user passes a list of speciffic mapped keys that are optional
+      missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)
+      unless missing_keys.empty? || options[:silence_missing_keys] == true
+        raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
+      end
+      headers.map! do |header|
+        if key_mapping.has_key?(header)
+          key_mapping[header].nil? ? nil : key_mapping[header]
+        elsif options[:remove_unmapped_keys]
+          nil
+        else
+          header
+        end
+      end
+      headers
+    end
+  end
+end

data/lib/smarter_csv/header_validations.rb ADDED Viewed

@@ -0,0 +1,34 @@
+# frozen_string_literal: true
+module SmarterCSV
+  class << self
+    def header_validations(headers, options)
+      check_duplicate_headers(headers, options)
+      check_required_headers(headers, options)
+    end
+    def check_duplicate_headers(headers, _options)
+      header_counts = Hash.new(0)
+      headers.each { |header| header_counts[header] += 1 unless header.nil? }
+      duplicates = header_counts.select { |_, count| count > 1 }
+      unless duplicates.empty?
+        raise(SmarterCSV::DuplicateHeaders, "Duplicate Headers in CSV: #{duplicates.inspect}")
+      end
+    end
+    require 'set'
+    def check_required_headers(headers, options)
+      if options[:required_keys] && options[:required_keys].is_a?(Array)
+        headers_set = headers.to_set
+        missing_keys = options[:required_keys].select { |k| !headers_set.include?(k) }
+        unless missing_keys.empty?
+          raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}"
+        end
+      end
+    end
+  end
+end

data/lib/smarter_csv/headers.rb CHANGED Viewed

@@ -14,7 +14,11 @@ module SmarterCSV
         # the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
         header_line = @raw_header = readline_with_counts(filehandle, options)
         header_line = preprocess_header_line(header_line, options)
-        file_header_array, file_header_size = parse_and_modify_headers(header_line, options)
+        file_header_array, file_header_size = parse(header_line, options)
+        file_header_array = header_transformations(file_header_array, options)
       else
         unless options[:user_provided_headers]
           raise SmarterCSV::IncorrectOption, "ERROR: If :headers_in_file is set to false, you have to provide :user_provided_headers"
@@ -36,22 +40,12 @@ module SmarterCSV
             # we could print out the mapping of file_header_array to header_array here
           end
         end
         header_array = user_header_array
       else
         header_array = file_header_array
       end
-      # detect duplicate headers and disambiguate
-      header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
-      # symbolize headers
-      header_array.map!{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
-      # wouldn't make sense to re-map user provided headers
-      header_array = remap_headers(header_array, options) if options[:key_mapping] && !options[:user_provided_headers]
-      validate_and_deprecate_headers(header_array, options)
       [header_array, header_array.size]
     end
@@ -65,92 +59,6 @@ module SmarterCSV
       header_line
     end
-    def parse_and_modify_headers(header_line, options)
-      file_header_array, file_header_size = parse(header_line, options)
-      file_header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
-      file_header_array.map!{|x| x.strip} if options[:strip_whitespace]
-      unless options[:keep_original_headers]
-        file_header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
-        file_header_array.map!{|x| x.downcase} if options[:downcase_header]
-      end
-      [file_header_array, file_header_size]
-    end
-    def disambiguate_headers(headers, options)
-      counts = Hash.new(0)
-      headers.map do |header|
-        counts[header] += 1
-        counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
-      end
-    end
-    # do some key mapping on the keys in the file header
-    # if you want to completely delete a key, then map it to nil or to ''
-    def remap_headers(headers, options)
-      key_mapping = options[:key_mapping]
-      if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
-        raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
-      end
-      key_mapping = options[:key_mapping]
-      # if silence_missing_keys are not set, raise error if missing header
-      missing_keys = key_mapping.keys - headers
-      # if the user passes a list of speciffic mapped keys that are optional
-      missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)
-      unless missing_keys.empty? || options[:silence_missing_keys] == true
-        raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
-      end
-      headers.map! do |header|
-        if key_mapping.has_key?(header)
-          key_mapping[header].nil? ? nil : key_mapping[header]
-        elsif options[:remove_unmapped_keys]
-          nil
-        else
-          header
-        end
-      end
-      headers
-    end
-    # header_validations
-    def validate_and_deprecate_headers(headers, options)
-      duplicate_headers = []
-      headers.compact.each do |k|
-        duplicate_headers << k if headers.select{|x| x == k}.size > 1
-      end
-      unless options[:user_provided_headers] || duplicate_headers.empty?
-        raise SmarterCSV::DuplicateHeaders, "ERROR: duplicate headers: #{duplicate_headers.join(',')}"
-      end
-      # deprecate required_headers
-      unless options[:required_headers].nil?
-        puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required_headers'"
-        if options[:required_keys].nil?
-          options[:required_keys] = options[:required_headers]
-          options[:required_headers] = nil
-        end
-      end
-      if options[:required_keys] && options[:required_keys].is_a?(Array)
-        missing_keys = []
-        options[:required_keys].each do |k|
-          missing_keys << k unless headers.include?(k)
-        end
-        raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}" unless missing_keys.empty?
-      end
-    end
-    def enforce_utf8_encoding(header, options)
-      return header unless options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
-      header.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence])
-    end
     def remove_comments_from_header(header, options)
       return header unless options[:comment_regexp]

data/lib/smarter_csv/options_processing.rb CHANGED Viewed

@@ -9,7 +9,7 @@ module SmarterCSV
     comment_regexp: nil, # was: /\A#/,
     convert_values_to_numeric: true,
     downcase_header: true,
-    duplicate_header_suffix: nil,
+    duplicate_header_suffix: '', # was: nil,
     file_encoding: 'utf-8',
     force_simple_split: false,
     force_utf8: false,
@@ -62,6 +62,15 @@ module SmarterCSV
     private
     def validate_options!(options)
+      # deprecate required_headers
+      unless options[:required_headers].nil?
+        puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required_headers'"
+        if options[:required_keys].nil?
+          options[:required_keys] = options[:required_headers]
+          options[:required_headers] = nil
+        end
+      end
       keys = options.keys
       errors = []
       errors << "invalid row_sep" if keys.include?(:row_sep) && !option_valid?(options[:row_sep])

data/lib/smarter_csv/smarter_csv.rb CHANGED Viewed

@@ -12,28 +12,34 @@ module SmarterCSV
   # first parameter: filename or input object which responds to readline method
   def SmarterCSV.process(input, given_options = {}, &block) # rubocop:disable Lint/UnusedMethodArgument
+    initialize_variables
     options = process_options(given_options)
-    initialize_variables
+    @enforce_utf8 = options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
+    @verbose = options[:verbose]
-    has_rails = !!defined?(Rails)
     begin
       fh = input.respond_to?(:readline) ? input : File.open(input, "r:#{options[:file_encoding]}")
+      if @enforce_utf8 && (fh.respond_to?(:external_encoding) && fh.external_encoding != Encoding.find('UTF-8') || fh.respond_to?(:encoding) && fh.encoding != Encoding.find('UTF-8'))
+        puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
+      end
       # auto-detect the row separator
       options[:row_sep] = guess_line_ending(fh, options) if options[:row_sep]&.to_sym == :auto
       # attempt to auto-detect column separator
       options[:col_sep] = guess_column_separator(fh, options) if options[:col_sep]&.to_sym == :auto
-      if (options[:force_utf8] || options[:file_encoding] =~ /utf-8/i) && (fh.respond_to?(:external_encoding) && fh.external_encoding != Encoding.find('UTF-8') || fh.respond_to?(:encoding) && fh.encoding != Encoding.find('UTF-8'))
-        puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
-      end
       skip_lines(fh, options)
       @headers, header_size = process_headers(fh, options)
       @headerA = @headers # @headerA is deprecated, use @headers
+      puts "Effective headers:\n#{pp(@headers)}\n" if @verbose
+      header_validations(@headers, options)
       # in case we use chunking.. we'll need to set it up..
       if options[:chunk_size].to_i > 0
         use_chunks = true
@@ -45,31 +51,42 @@ module SmarterCSV
       end
       # now on to processing all the rest of the lines in the CSV file:
+      # fh.each_line |line|
       until fh.eof? # we can't use fh.readlines() here, because this would read the whole file into memory at once, and eof => true
         line = readline_with_counts(fh, options)
         # replace invalid byte sequence in UTF-8 with question mark to avoid errors
-        line = line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
+        line = enforce_utf8_encoding(line, options) if @enforce_utf8
-        print "processing file line %10d, csv line %10d\r" % [@file_line_count, @csv_line_count] if options[:verbose]
+        print "processing file line %10d, csv line %10d\r" % [@file_line_count, @csv_line_count] if @verbose
         next if options[:comment_regexp] && line =~ options[:comment_regexp] # ignore all comment lines if there are any
         # cater for the quoted csv data containing the row separator carriage return character
         # in which case the row data will be split across multiple lines (see the sample content in spec/fixtures/carriage_returns_rn.csv)
         # by detecting the existence of an uneven number of quote characters
+        multiline = count_quote_chars(line, options[:quote_char]).odd?
-        multiline = count_quote_chars(line, options[:quote_char]).odd? # should handle quote_char nil
-        while count_quote_chars(line, options[:quote_char]).odd? # should handle quote_char nil
+        while multiline
           next_line = fh.readline(options[:row_sep])
-          next_line = next_line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
+          next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
           line += next_line
           @file_line_count += 1
+          break if fh.eof? # Exit loop if end of file is reached
+          multiline = count_quote_chars(line, options[:quote_char]).odd?
         end
-        print "\nline contains uneven number of quote chars so including content through file line %d\n" % @file_line_count if options[:verbose] && multiline
+        # :nocov:
+        if multiline && @verbose
+          print "\nline contains uneven number of quote chars so including content through file line %d\n" % @file_line_count
+        end
+        # :nocov:
         line.chomp!(options[:row_sep])
+        # --- SPLIT LINE & DATA TRANSFORMATIONS ------------------------------------------------------------
         dataA, _data_size = parse(line, options, header_size)
         dataA.map!{|x| x.strip} if options[:strip_whitespace]
@@ -77,48 +94,25 @@ module SmarterCSV
         # if all values are blank, then ignore this line
         next if options[:remove_empty_hashes] && (dataA.empty? || blank?(dataA))
+        # --- HASH TRANSFORMATIONS ------------------------------------------------------------
         hash = @headers.zip(dataA).to_h
-        # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
-        hash.delete(nil)
-        hash.delete('')
-        hash.delete(:"")
-        if options[:remove_empty_values] == true
-          hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
-        end
-        hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
-        hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
-        if options[:convert_values_to_numeric]
-          hash.each do |k, v|
-            # deal with the :only / :except options to :convert_values_to_numeric
-            next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
+        hash = hash_transformations(hash, options)
-            # convert if it's a numeric value:
-            case v
-            when /^[+-]?\d+\.\d+$/
-              hash[k] = v.to_f
-            when /^[+-]?\d+$/
-              hash[k] = v.to_i
-            end
-          end
-        end
-        if options[:value_converters]
-          hash.each do |k, v|
-            converter = options[:value_converters][k]
-            next unless converter
-            hash[k] = converter.convert(v)
-          end
-        end
+        # --- HASH VALIDATIONS ----------------------------------------------------------------
+        # will go here, and be able to:
+        #  - validate correct format of the values for fields
+        #  - required fields to be non-empty
+        #  - ...
+        # -------------------------------------------------------------------------------------
         next if options[:remove_empty_hashes] && hash.empty?
+        puts "CSV Line #{@file_line_count}: #{pp(hash)}" if @verbose == '2' # very verbose setting
+        # optional adding of csv_line_number to the hash to help debugging
         hash[:csv_line_number] = @csv_line_count if options[:with_line_numbers]
+        # process the chunks or the resulting hash
         if use_chunks
           chunk << hash # append temp result to chunk
@@ -127,16 +121,13 @@ module SmarterCSV
             if block_given?
               yield chunk # do something with the hashes in the chunk in the block
             else
-              @result << chunk # not sure yet, why anybody would want to do this without a block
+              @result << chunk.dup # Append chunk to result (use .dup to keep a copy after we do chunk.clear)
             end
             @chunk_count += 1
-            chunk = [] # initialize for next chunk of data
+            chunk.clear # re-initialize for next chunk of data
           else
-            # the last chunk may contain partial data, which also needs to be returned (BUG / ISSUE-18)
+            # the last chunk may contain partial data, which is handled below
           end
           # while a chunk is being filled up we don't need to do anything else here
         else # no chunk handling
@@ -149,15 +140,15 @@ module SmarterCSV
       end
       # print new line to retain last processing line message
-      print "\n" if options[:verbose]
+      print "\n" if @verbose
-      # last chunk:
+      # handling of last chunk:
       if !chunk.nil? && chunk.size > 0
         # do something with the chunk
         if block_given?
           yield chunk # do something with the hashes in the chunk in the block
         else
-          @result << chunk # not sure yet, why anybody would want to do this without a block
+          @result << chunk.dup # Append chunk to result (use .dup to keep a copy after we do chunk.clear)
         end
         @chunk_count += 1
         # chunk = [] # initialize for next chunk of data
@@ -174,16 +165,22 @@ module SmarterCSV
   end
   class << self
-    # * the `scan` method iterates through the string and finds all occurrences of the pattern
-    # * The reqular expression:
-    #   - (?<!\\) : Negative lookbehind to ensure the quote character is not preceded by an unescaped backslash.
-    #   - (?:\\\\)* : Non-capturing group for an even number of backslashes (escaped backslashes).
-    #                 This allows for any number of escaped backslashes before the quote character.
-    #   - #{Regexp.escape(quote_char)} : Dynamically inserts the quote_char into the regex,
-    #                                    ensuring it's properly escaped for use in the regex.
-    #
     def count_quote_chars(line, quote_char)
-      line.scan(/(?<!\\)(?:\\\\)*#{Regexp.escape(quote_char)}/).count
+      return 0 if line.nil? || quote_char.nil? || quote_char.empty?
+      count = 0
+      escaped = false
+      line.each_char do |char|
+        if char == '\\' && !escaped
+          escaped = true
+        else
+          count += 1 if char == quote_char && !escaped
+          escaped = false
+        end
+      end
+      count
     end
     def has_acceleration?
@@ -192,18 +189,6 @@ module SmarterCSV
     protected
-    # acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
-    def limit_execution_for_only_or_except(options, option_name, key)
-      if options[option_name].is_a?(Hash)
-        if options[option_name].has_key?(:except)
-          return true if Array(options[option_name][:except]).include?(key)
-        elsif options[option_name].has_key?(:only)
-          return true unless Array(options[option_name][:only]).include?(key)
-        end
-      end
-      false
-    end
     # SEE: https://github.com/rails/rails/blob/32015b6f369adc839c4f0955f2d9dce50c0b6123/activesupport/lib/active_support/core_ext/object/blank.rb#L121
     # and in the future we might also include UTF-8 space characters: https://www.compart.com/en/unicode/category/Zs
     BLANK_RE = /\A\s*\z/.freeze
@@ -211,33 +196,24 @@ module SmarterCSV
     def blank?(value)
       case value
       when String
-        value.empty? || BLANK_RE.match?(value)
+        BLANK_RE.match?(value)
       when NilClass
         true
       when Array
-        value.empty? || value.inject(true){|result, x| result && elem_blank?(x)}
+        value.all? { |elem| blank?(elem) }
       when Hash
-        value.empty? || value.values.inject(true){|result, x| result && elem_blank?(x)}
+        value.values.all? { |elem| blank?(elem) } # Focus on values only
       else
         false
       end
     end
-    def elem_blank?(value)
-      case value
-      when String
-        value.empty? || BLANK_RE.match?(value)
+    private
-      when NilClass
-        true
+    def enforce_utf8_encoding(line, options)
+      # return line unless options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
-      else
-        false
-      end
+      line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence])
     end
   end
 end

data/lib/smarter_csv/variables.rb CHANGED Viewed

@@ -2,9 +2,10 @@
 module SmarterCSV
   class << self
-    attr_reader :csv_line_count, :chunk_count, :errors, :file_line_count, :headers, :raw_header, :result, :warnings
+    attr_reader :has_rails, :csv_line_count, :chunk_count, :errors, :file_line_count, :headers, :raw_header, :result, :warnings
     def initialize_variables
+      @has_rails = !!defined?(Rails)
       @csv_line_count = 0
       @chunk_count = 0
       @errors = {}
@@ -14,13 +15,16 @@ module SmarterCSV
       @raw_header = nil # header as it appears in the file
       @result = []
       @warnings = {}
+      @enforce_utf8 = false # only set to true if needed (after options parsing)
     end
     # :nocov:
+    # rubocop:disable Naming/MethodName
     def headerA
       warn "Deprecarion Warning: 'headerA' will be removed in future versions. Use 'headders'"
       @headerA
     end
+    # rubocop:enable Naming/MethodName
     # :nocov:
   end
 end

data/lib/smarter_csv/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module SmarterCSV
-  VERSION = "1.9.3"
+  VERSION = "1.10.0"
 end

data/lib/smarter_csv.rb CHANGED Viewed

@@ -5,13 +5,21 @@ require "smarter_csv/file_io"
 require "smarter_csv/options_processing"
 require "smarter_csv/auto_detection"
 require "smarter_csv/variables"
+require 'smarter_csv/header_transformations'
+require 'smarter_csv/header_validations'
 require "smarter_csv/headers"
+require "smarter_csv/hash_transformations"
 require "smarter_csv/parse"
+# load the C-extension:
 case RUBY_ENGINE
 when 'ruby'
   begin
     if `uname -s`.chomp == 'Darwin'
+      #
+      # Please report if you see cases where the rake-compiler is building x86_64 code on arm64 cpus:
+      # https://github.com/rake-compiler/rake-compiler/issues/231
+      #
       require 'smarter_csv/smarter_csv.bundle'
     else
       # :nocov:

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: smarter_csv
 version: !ruby/object:Gem::Version
-  version: 1.9.3
+  version: 1.10.0
 platform: ruby
 authors:
 - Tilo Sloboda
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2023-12-16 00:00:00.000000000 Z
+date: 2023-12-31 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: awesome_print
@@ -118,6 +118,9 @@ files:
 - lib/smarter_csv.rb
 - lib/smarter_csv/auto_detection.rb
 - lib/smarter_csv/file_io.rb
+- lib/smarter_csv/hash_transformations.rb
+- lib/smarter_csv/header_transformations.rb
+- lib/smarter_csv/header_validations.rb
 - lib/smarter_csv/headers.rb
 - lib/smarter_csv/options_processing.rb
 - lib/smarter_csv/parse.rb
@@ -148,7 +151,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.2.3
+rubygems_version: 3.5.3
 signing_key:
 specification_version: 4
 summary: Ruby Gem for smarter importing of CSV Files (and CSV-like files), with lots