RubyGems - smarter_csv - Versions diffs - 1.9.3 → 1.10.1 - Mend

smarter_csv 1.9.3 → 1.10.1

Files changed (14) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +21 -0
data/CONTRIBUTORS.md +1 -0
data/README.md +28 -7
data/lib/smarter_csv/hash_transformations.rb +91 -0
data/lib/smarter_csv/header_transformations.rb +63 -0
data/lib/smarter_csv/header_validations.rb +34 -0
data/lib/smarter_csv/headers.rb +6 -98
data/lib/smarter_csv/options_processing.rb +10 -1
data/lib/smarter_csv/smarter_csv.rb +68 -92
data/lib/smarter_csv/variables.rb +5 -1
data/lib/smarter_csv/version.rb +1 -1
data/lib/smarter_csv.rb +8 -0
metadata +5 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 5f35e10ff8bc0e79ff1ed9bea8e413f746f51128a6f6a9622d246873fd588366
-  data.tar.gz: 5cc30cf6f4422dd16f3019915bc5305a92aaaa4b99665e4c4c525d3bbf489cfd
+  metadata.gz: 6b214e402e999d37eb8fff613e0d87afe9084298ea0813447ca81aec33d7503a
+  data.tar.gz: 5344f4221d56ce53864bcd825c35d128cd998b1a54a2f60bed6f7e9d4d7c802f
 SHA512:
-  metadata.gz: 057472a73ae0be95318b16428b276ecffba384a68479af715c5ec3ca7601405ca73928b0fbf245c9b3f46fd33b82a8c6d9c9e6330ddb0305b83ae23f58173df0
-  data.tar.gz: 319b12a53875c1963eed6d27aa67850135d33a5b3a9f70607e6d812906733b711ade6c3ee6e789d78c2e159004a879e59e700145224134745b16d279039ac38a
+  metadata.gz: f05993e5a591b7b720dc2833d525ee2443d6fe00e6d0acdda2d237406296e16fddbe3959ff8df57f6d3bf64f95401f3d8b3b83d4a8a92ea8e9a7a8ba82cd57fe
+  data.tar.gz: 906a7b5ef793ed46d875a77d55471d568f6ac5adebcf2c121bf67ecdbbf150059eb0eeddec6a63a2d10cc6617ed2ba7359eeb6f627b25891a38ea4bdbdf37b83

data/CHANGELOG.md CHANGED Viewed

@@ -1,6 +1,27 @@
 # SmarterCSV 1.x Change Log
+## 1.10.1 (2024-01-07)
+  * fix incorrect warning about UTF-8 (issue #268, thanks hirowatari)
+## 1.10.0 (2023-12-31) ⚡ BREAKING ⚡
+  * BREAKING CHANGES:
+    Changed behavior:
+     + when `user_provided_headers` are provided:
+       * if they are not unique, an exception will now be raised
+       * they are taken "as is", no header transformations can be applied
+       * when they are given as strings or as symbols, it is assumed that this is the desired format
+       * the value of the `strings_as_keys` options will be ignored
+     + option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
+       * this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
+       * explicitly set this option to `nil` to get the behavior from previous versions.
+  * performance and memory improvements
+  * code refactor
 ## 1.9.3 (2023-12-16)
   * raise SmarterCSV::IncorrectOption when `user_provided_headers` are empty
   * code refactor / no functional changes

data/CONTRIBUTORS.md CHANGED Viewed

@@ -51,3 +51,4 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
  * [Rahul Chaudhary](https://github.com/rahulch95)
  * [Alessandro Fazzi](https://github.com/pioneerskies)
  * [JP Camara](https://github.com/jpcamara)
+ * [Hiro Watari](https://github.com/hirowatari)

data/README.md CHANGED Viewed

@@ -2,15 +2,33 @@
 # SmarterCSV
  [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
+#### LATEST CHANGES
+* Version 1.10.0 has BREAKING CHANGES:
+    Changed behavior:
+     + when `user_provided_headers` are provided:
+       * if they are not unique, an exception will now be raised
+       * they are taken "as is", no header transformations can be applied
+       * when they are given as strings or as symbols, it is assumed that this is the desired format
+       * the value of the `strings_as_keys` options will be ignored
+     + option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
+       * this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
+       * explicitly set this option to `nil` to get the behavior from previous versions.
 #### Development Branches
 * default branch is `main` for 1.x development
-* 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
+* 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
+  - This is an EXPERIMENTAL branch - DO NOT USE in production
-#### Work towards Future Version 2.0
+#### Work towards Future Version 2.x
-* Work towards SmarterCSV 2.0 is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
+* Work towards SmarterCSV 2.x is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
   Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.
 ---------------
@@ -84,6 +102,10 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
 00000040  73 2c 35 36 37 38 0d 0a                           |s,5678..|
 ```
+### Articles
+* [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
+* [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
 ### Examples
 Here are some examples to demonstrate the versatility of SmarterCSV.
@@ -243,8 +265,6 @@ NOTE: If you use `key_mappings` and `value_converters`, make sure that the value
     data[0][:price].class
       => Float
 ```
-## Parallel Processing
-[Jack](https://github.com/xjlin0) wrote an interesting article about [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
 ## Documentation
@@ -280,7 +300,8 @@ The options and the block are optional.
      | :headers_in_file            |   true   | Whether or not the file contains headers as the first line.                          |
      |                             |          | Important if the file does not contain headers,                                      |
      |                             |          | otherwise you would lose the first line of data.                                     |
-     | :duplicate_header_suffix    |   nil    | If set, adds numbers to duplicated headers and separates them by the given suffix    |
+     | :duplicate_header_suffix    |   ''     | Adds numbers to duplicated headers and separates them by the given suffix.           |
+     |                             |          | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior)        |
      | :user_provided_headers      |   nil    | *careful with that axe!*                                                             |
      |                             |          | user provided Array of header strings or symbols, to define                          |
      |                             |          | what headers should be used, overriding any in-file headers.                         |

data/lib/smarter_csv/hash_transformations.rb ADDED Viewed

@@ -0,0 +1,91 @@
+# frozen_string_literal: true
+module SmarterCSV
+  class << self
+    def hash_transformations(hash, options)
+      # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
+      # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
+      remove_empty_values = options[:remove_empty_values] == true
+      remove_zero_values = options[:remove_zero_values]
+      remove_values_matching = options[:remove_values_matching]
+      convert_to_numeric = options[:convert_values_to_numeric]
+      value_converters = options[:value_converters]
+      hash.each_with_object({}) do |(k, v), new_hash|
+        next if k.nil? || k == '' || k == :""
+        next if remove_empty_values && (has_rails ? v.blank? : blank?(v))
+        next if remove_zero_values && v.is_a?(String) && v =~ /^(0+|0+\.0+)$/ # values are Strings
+        next if remove_values_matching && v =~ remove_values_matching
+        # deal with the :only / :except options to :convert_values_to_numeric
+        if convert_to_numeric && !limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
+          if v =~ /^[+-]?\d+\.\d+$/
+            v = v.to_f
+          elsif v =~ /^[+-]?\d+$/
+            v = v.to_i
+          end
+        end
+        converter = value_converters[k] if value_converters
+        v = converter.convert(v) if converter
+        new_hash[k] = v
+      end
+    end
+    # def hash_transformations(hash, options)
+    #   # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
+    #   # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
+    #   hash.delete(nil)
+    #   hash.delete('')
+    #   hash.delete(:"")
+    #   if options[:remove_empty_values] == true
+    #     hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
+    #   end
+    #   hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
+    #   hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
+    #   if options[:convert_values_to_numeric]
+    #     hash.each do |k, v|
+    #       # deal with the :only / :except options to :convert_values_to_numeric
+    #       next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
+    #       # convert if it's a numeric value:
+    #       case v
+    #       when /^[+-]?\d+\.\d+$/
+    #         hash[k] = v.to_f
+    #       when /^[+-]?\d+$/
+    #         hash[k] = v.to_i
+    #       end
+    #     end
+    #   end
+    #   if options[:value_converters]
+    #     hash.each do |k, v|
+    #       converter = options[:value_converters][k]
+    #       next unless converter
+    #       hash[k] = converter.convert(v)
+    #     end
+    #   end
+    #   hash
+    # end
+    protected
+    # acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
+    def limit_execution_for_only_or_except(options, option_name, key)
+      if options[option_name].is_a?(Hash)
+        if options[option_name].has_key?(:except)
+          return true if Array(options[option_name][:except]).include?(key)
+        elsif options[option_name].has_key?(:only)
+          return true unless Array(options[option_name][:only]).include?(key)
+        end
+      end
+      false
+    end
+  end
+end

data/lib/smarter_csv/header_transformations.rb ADDED Viewed

@@ -0,0 +1,63 @@
+# frozen_string_literal: true
+module SmarterCSV
+  class << self
+    # transform the headers that were in the file:
+    def header_transformations(header_array, options)
+      header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
+      header_array.map!{|x| x.strip} if options[:strip_whitespace]
+      unless options[:keep_original_headers]
+        header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
+        header_array.map!{|x| x.downcase} if options[:downcase_header]
+      end
+      # detect duplicate headers and disambiguate
+      header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
+      # symbolize headers
+      header_array = header_array.map{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
+      # doesn't make sense to re-map when we have user_provided_headers
+      header_array = remap_headers(header_array, options) if options[:key_mapping]
+      header_array
+    end
+    def disambiguate_headers(headers, options)
+      counts = Hash.new(0)
+      headers.map do |header|
+        counts[header] += 1
+        counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
+      end
+    end
+    # do some key mapping on the keys in the file header
+    # if you want to completely delete a key, then map it to nil or to ''
+    def remap_headers(headers, options)
+      key_mapping = options[:key_mapping]
+      if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
+        raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
+      end
+      key_mapping = options[:key_mapping]
+      # if silence_missing_keys are not set, raise error if missing header
+      missing_keys = key_mapping.keys - headers
+      # if the user passes a list of speciffic mapped keys that are optional
+      missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)
+      unless missing_keys.empty? || options[:silence_missing_keys] == true
+        raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
+      end
+      headers.map! do |header|
+        if key_mapping.has_key?(header)
+          key_mapping[header].nil? ? nil : key_mapping[header]
+        elsif options[:remove_unmapped_keys]
+          nil
+        else
+          header
+        end
+      end
+      headers
+    end
+  end
+end

data/lib/smarter_csv/header_validations.rb ADDED Viewed

@@ -0,0 +1,34 @@
+# frozen_string_literal: true
+module SmarterCSV
+  class << self
+    def header_validations(headers, options)
+      check_duplicate_headers(headers, options)
+      check_required_headers(headers, options)
+    end
+    def check_duplicate_headers(headers, _options)
+      header_counts = Hash.new(0)
+      headers.each { |header| header_counts[header] += 1 unless header.nil? }
+      duplicates = header_counts.select { |_, count| count > 1 }
+      unless duplicates.empty?
+        raise(SmarterCSV::DuplicateHeaders, "Duplicate Headers in CSV: #{duplicates.inspect}")
+      end
+    end
+    require 'set'
+    def check_required_headers(headers, options)
+      if options[:required_keys] && options[:required_keys].is_a?(Array)
+        headers_set = headers.to_set
+        missing_keys = options[:required_keys].select { |k| !headers_set.include?(k) }
+        unless missing_keys.empty?
+          raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}"
+        end
+      end
+    end
+  end
+end

data/lib/smarter_csv/headers.rb CHANGED Viewed

@@ -14,7 +14,11 @@ module SmarterCSV
         # the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
         header_line = @raw_header = readline_with_counts(filehandle, options)
         header_line = preprocess_header_line(header_line, options)
-        file_header_array, file_header_size = parse_and_modify_headers(header_line, options)
+        file_header_array, file_header_size = parse(header_line, options)
+        file_header_array = header_transformations(file_header_array, options)
       else
         unless options[:user_provided_headers]
           raise SmarterCSV::IncorrectOption, "ERROR: If :headers_in_file is set to false, you have to provide :user_provided_headers"
@@ -36,22 +40,12 @@ module SmarterCSV
             # we could print out the mapping of file_header_array to header_array here
           end
         end
         header_array = user_header_array
       else
         header_array = file_header_array
       end
-      # detect duplicate headers and disambiguate
-      header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
-      # symbolize headers
-      header_array.map!{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
-      # wouldn't make sense to re-map user provided headers
-      header_array = remap_headers(header_array, options) if options[:key_mapping] && !options[:user_provided_headers]
-      validate_and_deprecate_headers(header_array, options)
       [header_array, header_array.size]
     end
@@ -65,92 +59,6 @@ module SmarterCSV
       header_line
     end
-    def parse_and_modify_headers(header_line, options)
-      file_header_array, file_header_size = parse(header_line, options)
-      file_header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
-      file_header_array.map!{|x| x.strip} if options[:strip_whitespace]
-      unless options[:keep_original_headers]
-        file_header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
-        file_header_array.map!{|x| x.downcase} if options[:downcase_header]
-      end
-      [file_header_array, file_header_size]
-    end
-    def disambiguate_headers(headers, options)
-      counts = Hash.new(0)
-      headers.map do |header|
-        counts[header] += 1
-        counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
-      end
-    end
-    # do some key mapping on the keys in the file header
-    # if you want to completely delete a key, then map it to nil or to ''
-    def remap_headers(headers, options)
-      key_mapping = options[:key_mapping]
-      if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
-        raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
-      end
-      key_mapping = options[:key_mapping]
-      # if silence_missing_keys are not set, raise error if missing header
-      missing_keys = key_mapping.keys - headers
-      # if the user passes a list of speciffic mapped keys that are optional
-      missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)
-      unless missing_keys.empty? || options[:silence_missing_keys] == true
-        raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
-      end
-      headers.map! do |header|
-        if key_mapping.has_key?(header)
-          key_mapping[header].nil? ? nil : key_mapping[header]
-        elsif options[:remove_unmapped_keys]
-          nil
-        else
-          header
-        end
-      end
-      headers
-    end
-    # header_validations
-    def validate_and_deprecate_headers(headers, options)
-      duplicate_headers = []
-      headers.compact.each do |k|
-        duplicate_headers << k if headers.select{|x| x == k}.size > 1
-      end
-      unless options[:user_provided_headers] || duplicate_headers.empty?
-        raise SmarterCSV::DuplicateHeaders, "ERROR: duplicate headers: #{duplicate_headers.join(',')}"
-      end
-      # deprecate required_headers
-      unless options[:required_headers].nil?
-        puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required_headers'"
-        if options[:required_keys].nil?
-          options[:required_keys] = options[:required_headers]
-          options[:required_headers] = nil
-        end
-      end
-      if options[:required_keys] && options[:required_keys].is_a?(Array)
-        missing_keys = []
-        options[:required_keys].each do |k|
-          missing_keys << k unless headers.include?(k)
-        end
-        raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}" unless missing_keys.empty?
-      end
-    end
-    def enforce_utf8_encoding(header, options)
-      return header unless options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
-      header.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence])
-    end
     def remove_comments_from_header(header, options)
       return header unless options[:comment_regexp]

data/lib/smarter_csv/options_processing.rb CHANGED Viewed

@@ -9,7 +9,7 @@ module SmarterCSV
     comment_regexp: nil, # was: /\A#/,
     convert_values_to_numeric: true,
     downcase_header: true,
-    duplicate_header_suffix: nil,
+    duplicate_header_suffix: '', # was: nil,
     file_encoding: 'utf-8',
     force_simple_split: false,
     force_utf8: false,
@@ -62,6 +62,15 @@ module SmarterCSV
     private
     def validate_options!(options)
+      # deprecate required_headers
+      unless options[:required_headers].nil?
+        puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required_headers'"
+        if options[:required_keys].nil?
+          options[:required_keys] = options[:required_headers]
+          options[:required_headers] = nil
+        end
+      end
       keys = options.keys
       errors = []
       errors << "invalid row_sep" if keys.include?(:row_sep) && !option_valid?(options[:row_sep])

data/lib/smarter_csv/smarter_csv.rb CHANGED Viewed

@@ -12,28 +12,34 @@ module SmarterCSV
   # first parameter: filename or input object which responds to readline method
   def SmarterCSV.process(input, given_options = {}, &block) # rubocop:disable Lint/UnusedMethodArgument
+    initialize_variables
     options = process_options(given_options)
-    initialize_variables
+    @enforce_utf8 = options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
+    @verbose = options[:verbose]
-    has_rails = !!defined?(Rails)
     begin
       fh = input.respond_to?(:readline) ? input : File.open(input, "r:#{options[:file_encoding]}")
+      if (options[:force_utf8] || options[:file_encoding] =~ /utf-8/i) && (fh.respond_to?(:external_encoding) && fh.external_encoding != Encoding.find('UTF-8') || fh.respond_to?(:encoding) && fh.encoding != Encoding.find('UTF-8'))
+        puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
+      end
       # auto-detect the row separator
       options[:row_sep] = guess_line_ending(fh, options) if options[:row_sep]&.to_sym == :auto
       # attempt to auto-detect column separator
       options[:col_sep] = guess_column_separator(fh, options) if options[:col_sep]&.to_sym == :auto
-      if (options[:force_utf8] || options[:file_encoding] =~ /utf-8/i) && (fh.respond_to?(:external_encoding) && fh.external_encoding != Encoding.find('UTF-8') || fh.respond_to?(:encoding) && fh.encoding != Encoding.find('UTF-8'))
-        puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
-      end
       skip_lines(fh, options)
       @headers, header_size = process_headers(fh, options)
       @headerA = @headers # @headerA is deprecated, use @headers
+      puts "Effective headers:\n#{pp(@headers)}\n" if @verbose
+      header_validations(@headers, options)
       # in case we use chunking.. we'll need to set it up..
       if options[:chunk_size].to_i > 0
         use_chunks = true
@@ -45,31 +51,42 @@ module SmarterCSV
       end
       # now on to processing all the rest of the lines in the CSV file:
+      # fh.each_line |line|
       until fh.eof? # we can't use fh.readlines() here, because this would read the whole file into memory at once, and eof => true
         line = readline_with_counts(fh, options)
         # replace invalid byte sequence in UTF-8 with question mark to avoid errors
-        line = line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
+        line = enforce_utf8_encoding(line, options) if @enforce_utf8
-        print "processing file line %10d, csv line %10d\r" % [@file_line_count, @csv_line_count] if options[:verbose]
+        print "processing file line %10d, csv line %10d\r" % [@file_line_count, @csv_line_count] if @verbose
         next if options[:comment_regexp] && line =~ options[:comment_regexp] # ignore all comment lines if there are any
         # cater for the quoted csv data containing the row separator carriage return character
         # in which case the row data will be split across multiple lines (see the sample content in spec/fixtures/carriage_returns_rn.csv)
         # by detecting the existence of an uneven number of quote characters
+        multiline = count_quote_chars(line, options[:quote_char]).odd?
-        multiline = count_quote_chars(line, options[:quote_char]).odd? # should handle quote_char nil
-        while count_quote_chars(line, options[:quote_char]).odd? # should handle quote_char nil
+        while multiline
           next_line = fh.readline(options[:row_sep])
-          next_line = next_line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
+          next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
           line += next_line
           @file_line_count += 1
+          break if fh.eof? # Exit loop if end of file is reached
+          multiline = count_quote_chars(line, options[:quote_char]).odd?
+        end
+        # :nocov:
+        if multiline && @verbose
+          print "\nline contains uneven number of quote chars so including content through file line %d\n" % @file_line_count
         end
-        print "\nline contains uneven number of quote chars so including content through file line %d\n" % @file_line_count if options[:verbose] && multiline
+        # :nocov:
         line.chomp!(options[:row_sep])
+        # --- SPLIT LINE & DATA TRANSFORMATIONS ------------------------------------------------------------
         dataA, _data_size = parse(line, options, header_size)
         dataA.map!{|x| x.strip} if options[:strip_whitespace]
@@ -77,48 +94,25 @@ module SmarterCSV
         # if all values are blank, then ignore this line
         next if options[:remove_empty_hashes] && (dataA.empty? || blank?(dataA))
+        # --- HASH TRANSFORMATIONS ------------------------------------------------------------
         hash = @headers.zip(dataA).to_h
-        # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
-        hash.delete(nil)
-        hash.delete('')
-        hash.delete(:"")
-        if options[:remove_empty_values] == true
-          hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
-        end
-        hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
-        hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
-        if options[:convert_values_to_numeric]
-          hash.each do |k, v|
-            # deal with the :only / :except options to :convert_values_to_numeric
-            next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
-            # convert if it's a numeric value:
-            case v
-            when /^[+-]?\d+\.\d+$/
-              hash[k] = v.to_f
-            when /^[+-]?\d+$/
-              hash[k] = v.to_i
-            end
-          end
-        end
-        if options[:value_converters]
-          hash.each do |k, v|
-            converter = options[:value_converters][k]
-            next unless converter
+        hash = hash_transformations(hash, options)
-            hash[k] = converter.convert(v)
-          end
-        end
+        # --- HASH VALIDATIONS ----------------------------------------------------------------
+        # will go here, and be able to:
+        #  - validate correct format of the values for fields
+        #  - required fields to be non-empty
+        #  - ...
+        # -------------------------------------------------------------------------------------
         next if options[:remove_empty_hashes] && hash.empty?
+        puts "CSV Line #{@file_line_count}: #{pp(hash)}" if @verbose == '2' # very verbose setting
+        # optional adding of csv_line_number to the hash to help debugging
         hash[:csv_line_number] = @csv_line_count if options[:with_line_numbers]
+        # process the chunks or the resulting hash
         if use_chunks
           chunk << hash # append temp result to chunk
@@ -127,16 +121,13 @@ module SmarterCSV
             if block_given?
               yield chunk # do something with the hashes in the chunk in the block
             else
-              @result << chunk # not sure yet, why anybody would want to do this without a block
+              @result << chunk.dup # Append chunk to result (use .dup to keep a copy after we do chunk.clear)
             end
             @chunk_count += 1
-            chunk = [] # initialize for next chunk of data
+            chunk.clear # re-initialize for next chunk of data
           else
-            # the last chunk may contain partial data, which also needs to be returned (BUG / ISSUE-18)
+            # the last chunk may contain partial data, which is handled below
           end
           # while a chunk is being filled up we don't need to do anything else here
         else # no chunk handling
@@ -149,15 +140,15 @@ module SmarterCSV
       end
       # print new line to retain last processing line message
-      print "\n" if options[:verbose]
+      print "\n" if @verbose
-      # last chunk:
+      # handling of last chunk:
       if !chunk.nil? && chunk.size > 0
         # do something with the chunk
         if block_given?
           yield chunk # do something with the hashes in the chunk in the block
         else
-          @result << chunk # not sure yet, why anybody would want to do this without a block
+          @result << chunk.dup # Append chunk to result (use .dup to keep a copy after we do chunk.clear)
         end
         @chunk_count += 1
         # chunk = [] # initialize for next chunk of data
@@ -174,16 +165,22 @@ module SmarterCSV
   end
   class << self
-    # * the `scan` method iterates through the string and finds all occurrences of the pattern
-    # * The reqular expression:
-    #   - (?<!\\) : Negative lookbehind to ensure the quote character is not preceded by an unescaped backslash.
-    #   - (?:\\\\)* : Non-capturing group for an even number of backslashes (escaped backslashes).
-    #                 This allows for any number of escaped backslashes before the quote character.
-    #   - #{Regexp.escape(quote_char)} : Dynamically inserts the quote_char into the regex,
-    #                                    ensuring it's properly escaped for use in the regex.
-    #
     def count_quote_chars(line, quote_char)
-      line.scan(/(?<!\\)(?:\\\\)*#{Regexp.escape(quote_char)}/).count
+      return 0 if line.nil? || quote_char.nil? || quote_char.empty?
+      count = 0
+      escaped = false
+      line.each_char do |char|
+        if char == '\\' && !escaped
+          escaped = true
+        else
+          count += 1 if char == quote_char && !escaped
+          escaped = false
+        end
+      end
+      count
     end
     def has_acceleration?
@@ -192,18 +189,6 @@ module SmarterCSV
     protected
-    # acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
-    def limit_execution_for_only_or_except(options, option_name, key)
-      if options[option_name].is_a?(Hash)
-        if options[option_name].has_key?(:except)
-          return true if Array(options[option_name][:except]).include?(key)
-        elsif options[option_name].has_key?(:only)
-          return true unless Array(options[option_name][:only]).include?(key)
-        end
-      end
-      false
-    end
     # SEE: https://github.com/rails/rails/blob/32015b6f369adc839c4f0955f2d9dce50c0b6123/activesupport/lib/active_support/core_ext/object/blank.rb#L121
     # and in the future we might also include UTF-8 space characters: https://www.compart.com/en/unicode/category/Zs
     BLANK_RE = /\A\s*\z/.freeze
@@ -211,33 +196,24 @@ module SmarterCSV
     def blank?(value)
       case value
       when String
-        value.empty? || BLANK_RE.match?(value)
+        BLANK_RE.match?(value)
       when NilClass
         true
       when Array
-        value.empty? || value.inject(true){|result, x| result && elem_blank?(x)}
+        value.all? { |elem| blank?(elem) }
       when Hash
-        value.empty? || value.values.inject(true){|result, x| result && elem_blank?(x)}
+        value.values.all? { |elem| blank?(elem) } # Focus on values only
       else
         false
       end
     end
-    def elem_blank?(value)
-      case value
-      when String
-        value.empty? || BLANK_RE.match?(value)
+    private
-      when NilClass
-        true
+    def enforce_utf8_encoding(line, options)
+      # return line unless options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
-      else
-        false
-      end
+      line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence])
     end
   end
 end

data/lib/smarter_csv/variables.rb CHANGED Viewed

@@ -2,9 +2,10 @@
 module SmarterCSV
   class << self
-    attr_reader :csv_line_count, :chunk_count, :errors, :file_line_count, :headers, :raw_header, :result, :warnings
+    attr_reader :has_rails, :csv_line_count, :chunk_count, :errors, :file_line_count, :headers, :raw_header, :result, :warnings
     def initialize_variables
+      @has_rails = !!defined?(Rails)
       @csv_line_count = 0
       @chunk_count = 0
       @errors = {}
@@ -14,13 +15,16 @@ module SmarterCSV
       @raw_header = nil # header as it appears in the file
       @result = []
       @warnings = {}
+      @enforce_utf8 = false # only set to true if needed (after options parsing)
     end
     # :nocov:
+    # rubocop:disable Naming/MethodName
     def headerA
       warn "Deprecarion Warning: 'headerA' will be removed in future versions. Use 'headders'"
       @headerA
     end
+    # rubocop:enable Naming/MethodName
     # :nocov:
   end
 end

data/lib/smarter_csv/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module SmarterCSV
-  VERSION = "1.9.3"
+  VERSION = "1.10.1"
 end

data/lib/smarter_csv.rb CHANGED Viewed

@@ -5,13 +5,21 @@ require "smarter_csv/file_io"
 require "smarter_csv/options_processing"
 require "smarter_csv/auto_detection"
 require "smarter_csv/variables"
+require 'smarter_csv/header_transformations'
+require 'smarter_csv/header_validations'
 require "smarter_csv/headers"
+require "smarter_csv/hash_transformations"
 require "smarter_csv/parse"
+# load the C-extension:
 case RUBY_ENGINE
 when 'ruby'
   begin
     if `uname -s`.chomp == 'Darwin'
+      #
+      # Please report if you see cases where the rake-compiler is building x86_64 code on arm64 cpus:
+      # https://github.com/rake-compiler/rake-compiler/issues/231
+      #
       require 'smarter_csv/smarter_csv.bundle'
     else
       # :nocov:

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: smarter_csv
 version: !ruby/object:Gem::Version
-  version: 1.9.3
+  version: 1.10.1
 platform: ruby
 authors:
 - Tilo Sloboda
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2023-12-16 00:00:00.000000000 Z
+date: 2024-01-07 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: awesome_print
@@ -118,6 +118,9 @@ files:
 - lib/smarter_csv.rb
 - lib/smarter_csv/auto_detection.rb
 - lib/smarter_csv/file_io.rb
+- lib/smarter_csv/hash_transformations.rb
+- lib/smarter_csv/header_transformations.rb
+- lib/smarter_csv/header_validations.rb
 - lib/smarter_csv/headers.rb
 - lib/smarter_csv/options_processing.rb
 - lib/smarter_csv/parse.rb