RubyGems - smarter_csv - Versions diffs - 1.4.2 → 1.5.2 - Mend

smarter_csv 1.4.2 → 1.5.2

Files changed (17) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +22 -1
data/CONTRIBUTORS.md +2 -0
data/README.md +12 -2
data/lib/smarter_csv/smarter_csv.rb +137 -100
data/lib/smarter_csv/version.rb +1 -1
data/spec/fixtures/additional_separator.csv +6 -0
data/spec/fixtures/duplicate_headers.csv +1 -1
data/spec/fixtures/hard_sample.csv +2 -0
data/spec/smarter_csv/additional_separator_spec.rb +45 -0
data/spec/smarter_csv/binary_file2_spec.rb +1 -1
data/spec/smarter_csv/duplicate_headers_spec.rb +76 -0
data/spec/smarter_csv/hard_sample_spec.rb +24 -0
data/spec/smarter_csv/ignore_comments_spec.rb +45 -30
data/spec/smarter_csv/invalid_headers_spec.rb +8 -22
data/spec/smarter_csv/no_header_spec.rb +16 -11
metadata +12 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 3be724101d41326ff480bcb723c1b40a3cabd879eb55e0c2f044372f8e5a57d0
-  data.tar.gz: 657db1421352f449bf042f8df4d5178167af048ad37836e4f2f2f8a6aea3ece0
+  metadata.gz: 88b9932c898320fb05d5697e155dc0bd3ade887d2fcfab7b660933e230007364
+  data.tar.gz: f0525d9c917aff44f910d4547b8e918faa3beb50d47adc29182df1fc1ec2be19
 SHA512:
-  metadata.gz: 3430649df35ac8139d35b04b85e8691ca5fc3d98b7b15f0d3987855f571987bdb742e0ed6f807ddb7a2e61e61d696d529ac311bc58e30188325f1c4bb78098a4
-  data.tar.gz: 1b386af7cc7c39bc7ea934875e16f6641a2cc0c2bb5dfaa3b1f298739b1b355b2f41570e42998a2d7790a17f96feb07118b69c23d913acc634aae5901f0c9229
+  metadata.gz: 330ad44b9808150f6fdf96dec65d259c2d9cf5eb25e22dc80f63095f4014b065b8aa97a2ba9b814c6cea6f4c0361e04567be403ab78b54d0518b49dc072f36ac
+  data.tar.gz: 27531bd508b5b455a32947badfb85d7e95489ad282a837ef046864806ba7fa12539148ab2fe4c84174fba3ef085dd3adda5d7c070615d684cd99ed0f90b903a3

data/CHANGELOG.md CHANGED Viewed

@@ -1,7 +1,28 @@
 # SmarterCSV 1.x Change Log
-## 1.4.1 (2022-02-12)
+## 1.5.2 (2022-04-29)
+  * added missing keys to the SmarterCSV::KeyMappingError exception message #189 (thanks to John Dell)
+## 1.5.1 (2022-04-27)
+  * added raising of `KeyMappingError` if `key_mapping` refers to a non-existent key
+  * added option `duplicate_header_suffix` (thanks to Skye Shaw)
+    When given a non-nil string, it uses the suffix to append numbering 2..n to duplicate headers.
+    If your code will need to process arbitrary CSV files, please set `duplicate_header_suffix`.
+## 1.5.0 (2022-04-25)
+  * fixed bug with trailing col_sep characters, introduced in 1.4.0
+  * Fix deprecation warning in Ruby 3.0.3 / $INPUT_RECORD_SEPARATOR (thanks to Joel Fouse )
+  * changed default for `comment_regexp` to be `nil` for a safer default behavior (thanks to David Lazar)
+  **Note**
+    This no longer assumes that lines starting with `#` are comments.
+    If you want to treat lines starting with '#' as comments, use `comment_regexp: /\A#/`
+## 1.4.2 (2022-02-12)
+  * fixed issue with simplecov
+## 1.4.1 (2022-02-12) (PULLED)
   * minor fix: also support `col_sep: :auto`
   * added simplecov

data/CONTRIBUTORS.md CHANGED Viewed

@@ -43,3 +43,5 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
  * [Olle Jonsson](https://github.com/olleolleolle)
  * [Nicolas Guillemain](https://github.com/Viiruus)
  * [Sp6](https://github.com/sp6)
+ * [Joel Fouse](https://github.com/jfouse)
+ * [John Dell](https://github.com/spovich)

data/README.md CHANGED Viewed

@@ -215,7 +215,7 @@ The options and the block are optional.
      | :invalid_byte_sequence      |   ''     | what to replace invalid byte sequences with                                          |
      | :force_utf8                 |   false  | force UTF-8 encoding of all lines (including headers) in the CSV file                |
      | :skip_lines                 |   nil    | how many lines to skip before the first line or header line is processed             |
-     | :comment_regexp             |   /^#/   | regular expression which matches comment lines (see NOTE about the CSV header)       |
+     | :comment_regexp             |   nil    | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/       |
      ---------------------------------------------------------------------------------------------------------------------------------
      | :col_sep                    |   ','    | column separator, can be set to :auto                                                |
      | :force_simple_split         |   false  | force simple splitting on :col_sep character for non-standard CSV-files.             |
@@ -228,6 +228,7 @@ The options and the block are optional.
      | :headers_in_file            |   true   | Whether or not the file contains headers as the first line.                          |
      |                             |          | Important if the file does not contain headers,                                      |
      |                             |          | otherwise you would lose the first line of data.                                     |
+     | :duplicate_header_suffix    |   nil    | If set, adds numbers to duplicated headers and separates them by the given suffix    |
      | :user_provided_headers      |   nil    | *careful with that axe!*                                                             |
      |                             |          | user provided Array of header strings or symbols, to define                          |
      |                             |          | what headers should be used, overriding any in-file headers.                         |
@@ -282,14 +283,23 @@ And header and data validations will also be supported in 2.x
          data = SmarterCSV.process(f)
        end
 ```
 #### NOTES about CSV Headers:
  * as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
- * the first line with the CSV header may or may not be commented out according to the :comment_regexp
+ * the first line with the header might be commented out, in which case you will need to set `comment_regexp: /\A#/`
+   This is no longer handled automatically since 1.5.0.
  * any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
  * any of the keys in the header line will be downcased, spaces replaced by underscore, and converted to Ruby symbols before being used as keys in the returned Hashes
  * you can not combine the :user_provided_headers and :key_mapping options
  * if the incorrect number of headers are provided via :user_provided_headers, exception SmarterCSV::HeaderSizeMismatch is raised
+#### NOTES on Duplicate Headers:
+ As a corner case, it is possible that a CSV file contains multiple headers with the same name.
+ * If that happens, by default `smarter_csv` will raise a `DuplicateHeaders` error.
+ * If you set `duplicate_header_suffix` to a non-nil string, it will use it to append numbers 2..n to the duplicate headers. To further disambiguate the headers, you can further use `key_mapping` to assign meaningful names.
+ * If your code will need to process arbitrary CSV files, please set `duplicate_header_suffix`.
+ * Another way to deal with duplicate headers it to use `user_assigned_headers` to ignore any headers in the file.
 #### NOTES on Key Mapping:
  * keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them)
  * if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash

data/lib/smarter_csv/smarter_csv.rb CHANGED Viewed

@@ -5,108 +5,41 @@ module SmarterCSV
   class DuplicateHeaders < SmarterCSVException; end
   class MissingHeaders < SmarterCSVException; end
   class NoColSepDetected < SmarterCSVException; end
+  class KeyMappingError < SmarterCSVException; end
-  def SmarterCSV.process(input, options={}, &block)   # first parameter: filename or input object with readline method
+  # first parameter: filename or input object which responds to readline method
+  def SmarterCSV.process(input, options={}, &block)
     options = default_options.merge(options)
     options[:invalid_byte_sequence] = '' if options[:invalid_byte_sequence].nil?
     headerA = []
     result = []
-    old_row_sep = $INPUT_RECORD_SEPARATOR
-    file_line_count = 0
-    csv_line_count = 0
+    @file_line_count = 0
+    @csv_line_count = 0
     has_rails = !! defined?(Rails)
     begin
-      f = input.respond_to?(:readline) ? input : File.open(input, "r:#{options[:file_encoding]}")
+      fh = input.respond_to?(:readline) ? input : File.open(input, "r:#{options[:file_encoding]}")
       # auto-detect the row separator
-      options[:row_sep] = SmarterCSV.guess_line_ending(f, options) if options[:row_sep].to_sym == :auto
-      $INPUT_RECORD_SEPARATOR = options[:row_sep]
+      options[:row_sep] = SmarterCSV.guess_line_ending(fh, options) if options[:row_sep].to_sym == :auto
       # attempt to auto-detect column separator
-      options[:col_sep] = guess_column_separator(f) if options[:col_sep].to_sym == :auto
+      options[:col_sep] = guess_column_separator(fh, options) if options[:col_sep].to_sym == :auto
       # preserve options, in case we need to call the CSV class
       csv_options = options.select{|k,v| [:col_sep, :row_sep, :quote_char].include?(k)} # options.slice(:col_sep, :row_sep, :quote_char)
       csv_options.delete(:row_sep) if [nil, :auto].include?( options[:row_sep].to_sym )
       csv_options.delete(:col_sep) if [nil, :auto].include?( options[:col_sep].to_sym )
-      if (options[:force_utf8] || options[:file_encoding] =~ /utf-8/i) && ( f.respond_to?(:external_encoding) && f.external_encoding != Encoding.find('UTF-8') || f.respond_to?(:encoding) && f.encoding != Encoding.find('UTF-8') )
+      if (options[:force_utf8] || options[:file_encoding] =~ /utf-8/i) && ( fh.respond_to?(:external_encoding) && fh.external_encoding != Encoding.find('UTF-8') || fh.respond_to?(:encoding) && fh.encoding != Encoding.find('UTF-8') )
         puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
       end
-      options[:skip_lines].to_i.times{f.readline} if options[:skip_lines].to_i > 0
-      if options[:headers_in_file]        # extract the header line
-        # process the header line in the CSV file..
-        # the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
-        header = f.readline
-        header = header.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
-        header = header.sub(options[:comment_regexp],'').chomp(options[:row_sep])
-        file_line_count += 1
-        csv_line_count += 1
-        header = header.gsub(options[:strip_chars_from_headers], '') if options[:strip_chars_from_headers]
-        if (header =~ %r{#{options[:quote_char]}}) and (! options[:force_simple_split])
-          file_headerA = begin
-            CSV.parse( header, **csv_options ).flatten.collect!{|x| x.nil? ? '' : x} # to deal with nil values from CSV.parse
-          rescue CSV::MalformedCSVError => e
-            raise $!, "#{$!} [SmarterCSV: csv line #{csv_line_count}]", $!.backtrace
-          end
-        else
-          file_headerA =  header.split(options[:col_sep])
-        end
-        file_header_size = file_headerA.size # before mapping, which could delete keys
-        file_headerA.map!{|x| x.gsub(%r/#{options[:quote_char]}/,'') }
-        file_headerA.map!{|x| x.strip}  if options[:strip_whitespace]
-        unless options[:keep_original_headers]
-          file_headerA.map!{|x| x.gsub(/\s+|-+/,'_')}
-          file_headerA.map!{|x| x.downcase }   if options[:downcase_header]
+      if options[:skip_lines].to_i > 0
+        options[:skip_lines].to_i.times do
+          readline_with_counts(fh, options)
         end
-      else
-        raise SmarterCSV::IncorrectOption , "ERROR: If :headers_in_file is set to false, you have to provide :user_provided_headers" if options[:user_provided_headers].nil?
-      end
-      if options[:user_provided_headers] && options[:user_provided_headers].class == Array && ! options[:user_provided_headers].empty?
-        # use user-provided headers
-        headerA = options[:user_provided_headers]
-        if defined?(file_header_size) && ! file_header_size.nil?
-          if headerA.size != file_header_size
-            raise SmarterCSV::HeaderSizeMismatch , "ERROR: :user_provided_headers defines #{headerA.size} headers !=  CSV-file #{input} has #{file_header_size} headers"
-          else
-            # we could print out the mapping of file_headerA to headerA here
-          end
-        end
-      else
-        headerA = file_headerA
       end
-      header_size = headerA.size # used for splitting lines
-      headerA.map!{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
-      unless options[:user_provided_headers] # wouldn't make sense to re-map user provided headers
-        key_mappingH = options[:key_mapping]
-        # do some key mapping on the keys in the file header
-        #   if you want to completely delete a key, then map it to nil or to ''
-        if ! key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
-          headerA.map!{|x| key_mappingH.has_key?(x) ? (key_mappingH[x].nil? ? nil : key_mappingH[x]) : (options[:remove_unmapped_keys] ? nil : x)}
-        end
-      end
-      # header_validations
-      duplicate_headers = []
-      headerA.compact.each do |k|
-        duplicate_headers << k if headerA.select{|x| x == k}.size > 1
-      end
-      raise SmarterCSV::DuplicateHeaders , "ERROR: duplicate headers: #{duplicate_headers.join(',')}" unless duplicate_headers.empty?
-      if options[:required_headers] && options[:required_headers].is_a?(Array)
-        missing_headers = []
-        options[:required_headers].each do |k|
-          missing_headers << k unless headerA.include?(k)
-        end
-        raise SmarterCSV::MissingHeaders , "ERROR: missing headers: #{missing_headers.join(',')}" unless missing_headers.empty?
-      end
+      headerA, header_size = process_headers(fh, options, csv_options)
       # in case we use chunking.. we'll need to set it up..
       if ! options[:chunk_size].nil? && options[:chunk_size].to_i > 0
@@ -119,42 +52,41 @@ module SmarterCSV
       end
       # now on to processing all the rest of the lines in the CSV file:
-      while ! f.eof?    # we can't use f.readlines() here, because this would read the whole file into memory at once, and eof => true
-        line = f.readline  # read one line.. this uses the input_record_separator $INPUT_RECORD_SEPARATOR which we set previously!
+      while ! fh.eof?    # we can't use fh.readlines() here, because this would read the whole file into memory at once, and eof => true
+        line = readline_with_counts(fh, options)
         # replace invalid byte sequence in UTF-8 with question mark to avoid errors
         line = line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
-        file_line_count += 1
-        csv_line_count += 1
-        print "processing file line %10d, csv line %10d\r" % [file_line_count, csv_line_count] if options[:verbose]
-        next  if  line =~ options[:comment_regexp]  # ignore all comment lines if there are any
+        print "processing file line %10d, csv line %10d\r" % [@file_line_count, @csv_line_count] if options[:verbose]
+        next if options[:comment_regexp] && line =~ options[:comment_regexp] # ignore all comment lines if there are any
         # cater for the quoted csv data containing the row separator carriage return character
         # in which case the row data will be split across multiple lines (see the sample content in spec/fixtures/carriage_returns_rn.csv)
         # by detecting the existence of an uneven number of quote characters
         multiline = line.count(options[:quote_char])%2 == 1 # should handle quote_char nil
         while line.count(options[:quote_char])%2 == 1 # should handle quote_char nil
-          next_line = f.readline
+          next_line = fh.readline(options[:row_sep])
           next_line = next_line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
           line += next_line
-          file_line_count += 1
+          @file_line_count += 1
         end
-        print "\nline contains uneven number of quote chars so including content through file line %d\n" % file_line_count if options[:verbose] && multiline
+        print "\nline contains uneven number of quote chars so including content through file line %d\n" % @file_line_count if options[:verbose] && multiline
-        line.chomp!    # will use $INPUT_RECORD_SEPARATOR which is set to options[:col_sep]
+        line.chomp!(options[:row_sep])
         if (line =~ %r{#{options[:quote_char]}}) and (! options[:force_simple_split])
           dataA = begin
             CSV.parse( line, **csv_options ).flatten.collect!{|x| x.nil? ? '' : x} # to deal with nil values from CSV.parse
           rescue CSV::MalformedCSVError => e
-            raise $!, "#{$!} [SmarterCSV: csv line #{csv_line_count}]", $!.backtrace
+            raise $!, "#{$!} [SmarterCSV: csv line #{@csv_line_count}]", $!.backtrace
           end
         else
-          dataA =  line.split(options[:col_sep], header_size)
+          dataA = line.split(options[:col_sep], header_size)
         end
-####     dataA.map!{|x| x.gsub(%r/#{options[:quote_char]}/,'') }  # this is actually not a good idea as a default
-        dataA.map!{|x| x.strip}  if options[:strip_whitespace]
+        dataA.map!{|x| x.sub(/(#{options[:col_sep]})+\z/, '')} # remove any unwanted trailing col_sep characters at the end
+        dataA.map!{|x| x.strip} if options[:strip_whitespace]
         # if all values are blank, then ignore this line
         # SEE: https://github.com/rails/rails/blob/32015b6f369adc839c4f0955f2d9dce50c0b6123/activesupport/lib/active_support/core_ext/object/blank.rb#L121
@@ -208,7 +140,7 @@ module SmarterCSV
         if use_chunks
           chunk << hash  # append temp result to chunk
-          if chunk.size >= chunk_size || f.eof?   # if chunk if full, or EOF reached
+          if chunk.size >= chunk_size || fh.eof?   # if chunk if full, or EOF reached
             # do something with the chunk
             if block_given?
               yield chunk  # do something with the hashes in the chunk in the block
@@ -249,8 +181,7 @@ module SmarterCSV
         chunk = []  # initialize for next chunk of data
       end
     ensure
-      $INPUT_RECORD_SEPARATOR = old_row_sep   # make sure this stupid global variable is always reset to it's previous value after we're done!
-      f.close if f.respond_to?(:close)
+      fh.close if fh.respond_to?(:close)
     end
     if block_given?
       return chunk_count  # when we do processing through a block we only care how many chunks we processed
@@ -261,14 +192,22 @@ module SmarterCSV
   private
+  def self.readline_with_counts(filehandle, options)
+    line  = filehandle.readline(options[:row_sep])
+    @file_line_count += 1
+    @csv_line_count += 1
+    line
+  end
   def self.default_options
     {
       auto_row_sep_chars: 500,
       chunk_size: nil ,
       col_sep: ',',
-      comment_regexp: /\A#/,
+      comment_regexp: nil, # was: /\A#/,
       convert_values_to_numeric: true,
       downcase_header: true,
+      duplicate_header_suffix: nil,
       file_encoding: 'utf-8',
       force_simple_split: false ,
       force_utf8: false,
@@ -329,11 +268,11 @@ module SmarterCSV
   end
   # raise exception if none is found
-  def self.guess_column_separator(filehandle)
+  def self.guess_column_separator(filehandle, options)
     del = [',', "\t", ';', ':', '|']
     n = Hash.new(0)
     5.times do
-      line = filehandle.readline
+      line = filehandle.readline(options[:row_sep])
       del.each do |d|
         n[d] += line.scan(d).count
       end
@@ -379,4 +318,102 @@ module SmarterCSV
     k,_ = counts.max_by{|_,v| v}
     return k                    # the most frequent one is it
   end
+  def self.process_headers(filehandle, options, csv_options)
+    if options[:headers_in_file]        # extract the header line
+      # process the header line in the CSV file..
+      # the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
+      header = readline_with_counts(filehandle, options)
+      header = header.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
+      header = header.sub(options[:comment_regexp],'') if options[:comment_regexp]
+      header = header.chomp(options[:row_sep])
+      header = header.gsub(options[:strip_chars_from_headers], '') if options[:strip_chars_from_headers]
+      if (header =~ %r{#{options[:quote_char]}}) and (! options[:force_simple_split])
+        file_headerA = begin
+          CSV.parse( header, **csv_options ).flatten.collect!{|x| x.nil? ? '' : x} # to deal with nil values from CSV.parse
+        rescue CSV::MalformedCSVError => e
+          raise $!, "#{$!} [SmarterCSV: csv line #{@csv_line_count}]", $!.backtrace
+        end
+      else
+        file_headerA =  header.split(options[:col_sep])
+      end
+      file_header_size = file_headerA.size # before mapping, which could delete keys
+      file_headerA.map!{|x| x.gsub(%r/#{options[:quote_char]}/,'') }
+      file_headerA.map!{|x| x.strip}  if options[:strip_whitespace]
+      unless options[:keep_original_headers]
+        file_headerA.map!{|x| x.gsub(/\s+|-+/,'_')}
+        file_headerA.map!{|x| x.downcase }   if options[:downcase_header]
+      end
+    else
+      raise SmarterCSV::IncorrectOption , "ERROR: If :headers_in_file is set to false, you have to provide :user_provided_headers" unless options[:user_provided_headers]
+    end
+    if options[:user_provided_headers] && options[:user_provided_headers].class == Array && ! options[:user_provided_headers].empty?
+      # use user-provided headers
+      headerA = options[:user_provided_headers]
+      if defined?(file_header_size) && ! file_header_size.nil?
+        if headerA.size != file_header_size
+          raise SmarterCSV::HeaderSizeMismatch , "ERROR: :user_provided_headers defines #{headerA.size} headers !=  CSV-file #{input} has #{file_header_size} headers"
+        else
+          # we could print out the mapping of file_headerA to headerA here
+        end
+      end
+    else
+      headerA = file_headerA
+    end
+    # detect duplicate headers and disambiguate
+    headerA = process_duplicate_headers(headerA, options) if options[:duplicate_header_suffix]
+    header_size = headerA.size # used for splitting lines
+    headerA.map!{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
+    unless options[:user_provided_headers] # wouldn't make sense to re-map user provided headers
+      key_mappingH = options[:key_mapping]
+      # do some key mapping on the keys in the file header
+      #   if you want to completely delete a key, then map it to nil or to ''
+      if ! key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
+        # we can't map keys that are not there
+        missing_keys = key_mappingH.keys - headerA
+        raise(SmarterCSV::KeyMappingError, "missing header(s): #{missing_keys.join(",")}") unless missing_keys.empty?
+        headerA.map!{|x| key_mappingH.has_key?(x) ? (key_mappingH[x].nil? ? nil : key_mappingH[x]) : (options[:remove_unmapped_keys] ? nil : x)}
+      end
+    end
+    # header_validations
+    duplicate_headers = []
+    headerA.compact.each do |k|
+      duplicate_headers << k if headerA.select{|x| x == k}.size > 1
+    end
+    raise SmarterCSV::DuplicateHeaders , "ERROR: duplicate headers: #{duplicate_headers.join(',')}" unless duplicate_headers.empty?
+    if options[:required_headers] && options[:required_headers].is_a?(Array)
+      missing_headers = []
+      options[:required_headers].each do |k|
+        missing_headers << k unless headerA.include?(k)
+      end
+      raise SmarterCSV::MissingHeaders , "ERROR: missing headers: #{missing_headers.join(',')}" unless missing_headers.empty?
+    end
+    [headerA, header_size]
+  end
+  def self.process_duplicate_headers(headers, options)
+    counts = Hash.new(0)
+    result = []
+    headers.each do |key|
+      counts[key] += 1
+      if counts[key] == 1
+        result << key
+      else
+        result << [key, options[:duplicate_header_suffix], counts[key]].join
+      end
+    end
+    result
+  end
 end

data/lib/smarter_csv/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module SmarterCSV
-  VERSION = "1.4.2"
+  VERSION = "1.5.2"
 end

data/spec/fixtures/additional_separator.csv ADDED Viewed

@@ -0,0 +1,6 @@
+col1,col2
+eins,zwei
+uno,dos,
+one,two ,,,
+ichi, ,,,,,
+un

data/spec/fixtures/duplicate_headers.csv CHANGED Viewed

@@ -1,3 +1,3 @@
 email,firstname,lastname,email,age
 tom@bla.com,Tom,Sawyer,mike@bla.com,34
-eri@bla.com,Eri Chan,tom@bla.com,21
+eri@bla.com,Eri,Chan,tom@bla.com,21

data/spec/fixtures/hard_sample.csv ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ Name,Email,Financial Status,Paid at,Fulfillment Status,Fulfilled at,Accepts Marketing,Currency,Subtotal,Shipping,Taxes,Total,Discount Code,Discount Amount,Shipping Method,Created at,Lineitem quantity,Lineitem name,Lineitem price,Lineitem compare at price,Lineitem sku,Lineitem requires shipping,Lineitem taxable,Lineitem fulfillment status,Billing Name,Billing Street,Billing Address1,Billing Address2,Billing Company,Billing City,Billing Zip,Billing Province,Billing Country,Billing Phone,Shipping Name,Shipping Street,Shipping Address1,Shipping Address2,Shipping Company,Shipping City,Shipping Zip,Shipping Province,Shipping Country,Shipping Phone,Notes,Note Attributes,Cancelled at,Payment Method,Payment Reference,Refunded Amount,Vendor, rece,Tags,Risk Level,Source,Lineitem discount,Tax 1 Name,Tax 1 Value,Tax 2 Name,Tax 2 Value,Tax 3 Name,Tax 3 Value,Tax 4 Name,Tax 4 Value,Tax 5 Name,Tax 5 Value,Phone,Receipt Number,Duties,Billing Province Name,Shipping Province Name,Payment ID,Payment Terms Name,Next Payment Due At
2	+ #MR1220817,foo@bar.com,paid,2022-02-08 22:31:28 +0100,unfulfilled,,yes,EUR,144,0,24,144,VIP,119.6,"Livraison Standard GRATUITE, 2-5 jours avec suivi",2022-02-08 22:31:26 +0100,2,Cire Épilation Nacrée,37,,WAX-200-NAC,true,true,pending,French Fry,64 Boulevard Budgié,64 Boulevard Budgié,,,dootdoot’,'49100,,FR,06 12 34 56 78,French Fry,64 Boulevard Budgi,64 Boulevard Budgié,,,dootdoot,'49100,,FR,06 12 34 56 78,,,,Stripe,c23800013619353.2,0,Goober Rég,4331065802905,902,Low,web,0,FR TVA 20%,24,,,,,,,,,3366012111111,,,,,,,

data/spec/smarter_csv/additional_separator_spec.rb ADDED Viewed

@@ -0,0 +1,45 @@
+require 'spec_helper'
+fixture_path = 'spec/fixtures'
+describe 'handling of additional trailing column separators' do
+  let(:file) { "#{fixture_path}/additional_separator.csv" }
+  describe '' do
+    let(:data) { SmarterCSV.process(file) }
+    it 'reads all lines' do
+      data.size.should eq 5
+    end
+    it 'reads regular lines' do
+      item = data[0]
+      item[:col1].should == 'eins'
+      item[:col2].should == 'zwei'
+    end
+    it 'strips single trailing col_sep character' do
+      item = data[1]
+      item[:col1].should == 'uno'
+      item[:col2].should == 'dos'
+    end
+    it 'strips multiple trailing col_sep characters' do
+      item = data[2]
+      item[:col1].should == 'one'
+      item[:col2].should == 'two'
+    end
+    it 'strips multiple trailing col_sep chars' do
+      item = data[3]
+      item[:col1].should == 'ichi'
+      item[:col2].should == nil
+    end
+    it 'strips multiple trailing col_sep chars' do
+      item = data[4]
+      item[:col1].should == 'un'
+      item[:col2].should == nil
+    end
+  end
+end

data/spec/smarter_csv/binary_file2_spec.rb CHANGED Viewed

@@ -12,7 +12,7 @@ describe 'be_able_to' do
   it 'loads_binary_file_with_strings_as_keys' do
     options = {:col_sep => "\cA", :row_sep => "\cB", :comment_regexp => /^#/, :strings_as_keys => true}
     data = SmarterCSV.process("#{fixture_path}/binary.csv", options)
-    data.flatten.size.should == 8
+    data.size.should == 8
     data.each do |item|
       # all keys should be strings
       item.keys.each{|x| x.class.should be == String}

data/spec/smarter_csv/duplicate_headers_spec.rb ADDED Viewed

@@ -0,0 +1,76 @@
+require 'spec_helper'
+fixture_path = 'spec/fixtures'
+describe 'duplicate headers' do
+  describe 'without special handling / default behavior' do
+    it 'raises error on duplicate headers' do
+      expect {
+        SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", {})
+      }.to raise_exception(SmarterCSV::DuplicateHeaders)
+    end
+    it 'raises error on duplicate given headers' do
+      expect {
+        options = {:user_provided_headers => [:a,:b,:c,:d,:a]}
+        SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
+      }.to raise_exception(SmarterCSV::DuplicateHeaders)
+    end
+    it 'raises error on missing mapped headers and includes missing headers in message' do
+      expect {
+        # the mapping is right, but the underlying csv file is bad
+        options = {:key_mapping => {:email => :a, :firstname => :b, :lastname => :c, :manager_email => :d, :age => :e} }
+        SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
+      }.to raise_exception(SmarterCSV::KeyMappingError, "missing header(s): manager_email")
+    end
+  end
+  describe 'with special handling' do
+    context 'with given suffix' do
+      let(:options) { {duplicate_header_suffix: '_'} }
+      it 'reads whole file' do
+        data = SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
+        expect(data.size).to eq 2
+      end
+      it 'generates the correct keys' do
+        data = SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
+        expect(data.first.keys).to eq [:email, :firstname, :lastname, :email_2, :age]
+      end
+      it 'enumerates when duplicate headers are given' do
+        options.merge!({:user_provided_headers => [:a,:b,:c,:a,:a]})
+        data = SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
+        expect(data.first.keys).to eq [:a, :b, :c, :a_2, :a_3]
+      end
+      it 'can remap duplicated headers' do
+        options.merge!({:key_mapping => {:email => :a, :firstname => :b, :lastname => :c, :email_2 => :d, :age => :e}})
+        data = SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
+        expect(data.first).to eq({a: 'tom@bla.com', b: 'Tom', c: 'Sawyer', d: 'mike@bla.com', e: 34})
+      end
+    end
+    context 'with empty suffix' do
+      let(:options) { {duplicate_header_suffix: ''} }
+      it 'reads whole file' do
+        data = SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
+        expect(data.size).to eq 2
+      end
+      it 'generates the correct keys' do
+        data = SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
+        expect(data.first.keys).to eq [:email, :firstname, :lastname, :email2, :age]
+      end
+      it 'enumerates when duplicate headers are given' do
+        options.merge!({:user_provided_headers => [:a,:b,:c,:a,:a]})
+        data = SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
+        expect(data.first.keys).to eq [:a, :b, :c, :a2, :a3]
+      end
+    end
+  end
+end

data/spec/smarter_csv/hard_sample_spec.rb ADDED Viewed

@@ -0,0 +1,24 @@
+require 'spec_helper'
+fixture_path = 'spec/fixtures'
+describe 'can handle the difficult CSV file' do
+  it 'loads the data with default values' do
+    data = SmarterCSV.process("#{fixture_path}/hard_sample.csv")
+    data.size.should eq 1
+    item = data.first
+    item.keys.count.should == 48
+    item[:name].should == '#MR1220817'
+    item[:shipping_method].should == 'Livraison Standard GRATUITE, 2-5 jours avec suivi'
+    item[:lineitem_name].should == 'Cire Épilation Nacrée'
+    item[:phone].should == 3366012111111
+  end
+  # the main problem is the data line starting with a # character, but not being a comment
+  it 'fails to load the CSV file with incorrectly set comment_regexp' do
+    options = {comment_regexp: /\A#/ }
+    data = SmarterCSV.process("#{fixture_path}/hard_sample.csv", options)
+    data.size.should eq 0
+  end
+end

data/spec/smarter_csv/ignore_comments_spec.rb CHANGED Viewed

@@ -1,30 +1,45 @@
-require 'spec_helper'
-fixture_path = 'spec/fixtures'
-describe 'be_able_to' do
-  it 'ignore comments in CSV files' do
-    options = {}
-    data = SmarterCSV.process("#{fixture_path}/ignore_comments.csv", options)
-    data.size.should eq 5
-    # all the keys should be symbols
-    data.each{|item| item.keys.each{|x| x.is_a?(Symbol).should be_truthy}}
-    data.each do |h|
-      h.keys.each do |key|
-        [:"not_a_comment#first_name", :last_name, :dogs, :cats, :birds, :fish].should include( key )
-      end
-    end
-  end
-  it 'ignore comments in CSV files with CRLF' do
-    options = {row_sep: "\r\n"}
-    data = SmarterCSV.process("#{fixture_path}/ignore_comments2.csv", options)
-    # all the keys should be symbols
-    data.size.should eq 1
-    data.first[:h1].should eq 'a'
-    data.first[:h2].should eq "b\r\n#c"
-  end
-end
+require 'spec_helper'
+fixture_path = 'spec/fixtures'
+describe 'be_able_to' do
+  it 'by default does not ignore comments in CSV files' do
+    options = {}
+    data = SmarterCSV.process("#{fixture_path}/ignore_comments.csv", options)
+    data.size.should eq 8
+    # all the keys should be symbols
+    data.each{|item| item.keys.each{|x| x.is_a?(Symbol).should be_truthy}}
+    data.each do |h|
+      h.keys.each do |key|
+        [:"not_a_comment#first_name", :last_name, :dogs, :cats, :birds, :fish].should include( key )
+      end
+    end
+  end
+  it 'ignore comments in CSV files using comment_regexp' do
+    options = {comment_regexp: /\A#/}
+    data = SmarterCSV.process("#{fixture_path}/ignore_comments.csv", options)
+    data.size.should eq 5
+    # all the keys should be symbols
+    data.each{|item| item.keys.each{|x| x.is_a?(Symbol).should be_truthy}}
+    data.each do |h|
+      h.keys.each do |key|
+        [:"not_a_comment#first_name", :last_name, :dogs, :cats, :birds, :fish].should include( key )
+      end
+    end
+  end
+  it 'ignore comments in CSV files with CRLF' do
+    options = {row_sep: "\r\n"}
+    data = SmarterCSV.process("#{fixture_path}/ignore_comments2.csv", options)
+    # all the keys should be symbols
+    data.size.should eq 1
+    data.first[:h1].should eq 'a'
+    data.first[:h2].should eq "b\r\n#c"
+  end
+end

data/spec/smarter_csv/invalid_headers_spec.rb CHANGED Viewed

@@ -3,28 +3,6 @@ require 'spec_helper'
 fixture_path = 'spec/fixtures'
 describe 'test exceptions for invalid headers' do
-  it 'raises error on duplicate headers' do
-    expect {
-      SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", {})
-    }.to raise_exception(SmarterCSV::DuplicateHeaders)
-  end
-  it 'raises error on duplicate given headers' do
-    expect {
-      options = {:user_provided_headers => [:a,:b,:c,:d,:a]}
-      SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
-    }.to raise_exception(SmarterCSV::DuplicateHeaders)
-  end
-  it 'raises error on duplicate mapped headers' do
-    expect {
-      # the mapping is right, but the underlying csv file is bad
-      options = {:key_mapping => {:email => :a, :firstname => :b, :lastname => :c, :manager_email => :d, :age => :e} }
-      SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
-    }.to raise_exception(SmarterCSV::DuplicateHeaders)
-  end
   it 'does not raise an error if no required headers are given' do
     options = {:required_headers => nil} # order does not matter
     data = SmarterCSV.process("#{fixture_path}/user_import.csv", options)
@@ -49,4 +27,12 @@ describe 'test exceptions for invalid headers' do
       SmarterCSV.process("#{fixture_path}/user_import.csv", options)
     }.to raise_exception(SmarterCSV::MissingHeaders)
   end
+  it 'raises error on missing mapped headers and includes missing headers in message' do
+    expect {
+      # :age does not exist in the CSV header
+      options = {:key_mapping => {:email => :a, :firstname => :b, :lastname => :c, :manager_email => :d, :age => :e} }
+      SmarterCSV.process("#{fixture_path}/user_import.csv", options)
+    }.to raise_exception(SmarterCSV::KeyMappingError, "missing header(s): age")
+  end
 end

data/spec/smarter_csv/no_header_spec.rb CHANGED Viewed

@@ -2,23 +2,28 @@ require 'spec_helper'
 fixture_path = 'spec/fixtures'
-describe 'be_able_to' do
-  it 'loads_csv_file_without_header' do
-    options = {:headers_in_file => false, :user_provided_headers => [:a,:b,:c,:d,:e,:f]}
-    data = SmarterCSV.process("#{fixture_path}/no_header.csv", options)
+describe 'no header in file' do
+  let(:headers) { [:a,:b,:c,:d,:e,:f] }
+  let(:options) { {:headers_in_file => false, :user_provided_headers => headers} }
+  subject(:data) { SmarterCSV.process("#{fixture_path}/no_header.csv", options) }
+  it 'load the correct number of records' do
     data.size.should == 5
-    # all the keys should be symbols
-    data.each{|item| item.keys.each{|x| x.class.should be == Symbol}}
+  end
-    data.each do |item|
+  it 'uses given symbols for all records' do
+    data.each do |item|
       item.keys.each do |key|
         [:a,:b,:c,:d,:e,:f].should include( key )
       end
     end
-    data.each do |h|
-      h.size.should <= 6
-    end
   end
+  it 'loads the correct data' do
+    data[0].should == {a: "Dan", b: "McAllister", c: 2, d: 0}
+    data[1].should == {a: "Lucy", b: "Laweless", d: 5, e: 0}
+    data[2].should == {a: "Miles", b: "O'Brian", c: 0, d: 0, e: 0, f: 21}
+    data[3].should == {a: "Nancy", b: "Homes", c: 2, d: 0, e: 1}
+    data[4].should == {a: "Hernán", b: "Curaçon", c: 3, d: 0, e: 0}
+  end
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: smarter_csv
 version: !ruby/object:Gem::Version
-  version: 1.4.2
+  version: 1.5.2
 platform: ruby
 authors:
 - Tilo Sloboda
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2022-02-15 00:00:00.000000000 Z
+date: 2022-04-29 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rspec
@@ -62,6 +62,7 @@ files:
 - lib/smarter_csv/smarter_csv.rb
 - lib/smarter_csv/version.rb
 - smarter_csv.gemspec
+- spec/fixtures/additional_separator.csv
 - spec/fixtures/basic.csv
 - spec/fixtures/binary.csv
 - spec/fixtures/carriage_returns_n.csv
@@ -73,6 +74,7 @@ files:
 - spec/fixtures/empty.csv
 - spec/fixtures/empty_columns_1.csv
 - spec/fixtures/empty_columns_2.csv
+- spec/fixtures/hard_sample.csv
 - spec/fixtures/ignore_comments.csv
 - spec/fixtures/ignore_comments2.csv
 - spec/fixtures/key_mapping.csv
@@ -101,6 +103,7 @@ files:
 - spec/fixtures/valid_unicode.csv
 - spec/fixtures/with_dashes.csv
 - spec/fixtures/with_dates.csv
+- spec/smarter_csv/additional_separator_spec.rb
 - spec/smarter_csv/binary_file2_spec.rb
 - spec/smarter_csv/binary_file_spec.rb
 - spec/smarter_csv/blank_spec.rb
@@ -109,8 +112,10 @@ files:
 - spec/smarter_csv/close_file_spec.rb
 - spec/smarter_csv/column_separator_spec.rb
 - spec/smarter_csv/convert_values_to_numeric_spec.rb
+- spec/smarter_csv/duplicate_headers_spec.rb
 - spec/smarter_csv/empty_columns_spec.rb
 - spec/smarter_csv/extenstions_spec.rb
+- spec/smarter_csv/hard_sample_spec.rb
 - spec/smarter_csv/header_transformation_spec.rb
 - spec/smarter_csv/ignore_comments_spec.rb
 - spec/smarter_csv/invalid_headers_spec.rb
@@ -164,6 +169,7 @@ specification_version: 4
 summary: Ruby Gem for smarter importing of CSV Files (and CSV-like files), with lots
   of optional features, e.g. chunked processing for huge CSV files
 test_files:
+- spec/fixtures/additional_separator.csv
 - spec/fixtures/basic.csv
 - spec/fixtures/binary.csv
 - spec/fixtures/carriage_returns_n.csv
@@ -175,6 +181,7 @@ test_files:
 - spec/fixtures/empty.csv
 - spec/fixtures/empty_columns_1.csv
 - spec/fixtures/empty_columns_2.csv
+- spec/fixtures/hard_sample.csv
 - spec/fixtures/ignore_comments.csv
 - spec/fixtures/ignore_comments2.csv
 - spec/fixtures/key_mapping.csv
@@ -203,6 +210,7 @@ test_files:
 - spec/fixtures/valid_unicode.csv
 - spec/fixtures/with_dashes.csv
 - spec/fixtures/with_dates.csv
+- spec/smarter_csv/additional_separator_spec.rb
 - spec/smarter_csv/binary_file2_spec.rb
 - spec/smarter_csv/binary_file_spec.rb
 - spec/smarter_csv/blank_spec.rb
@@ -211,8 +219,10 @@ test_files:
 - spec/smarter_csv/close_file_spec.rb
 - spec/smarter_csv/column_separator_spec.rb
 - spec/smarter_csv/convert_values_to_numeric_spec.rb
+- spec/smarter_csv/duplicate_headers_spec.rb
 - spec/smarter_csv/empty_columns_spec.rb
 - spec/smarter_csv/extenstions_spec.rb
+- spec/smarter_csv/hard_sample_spec.rb
 - spec/smarter_csv/header_transformation_spec.rb
 - spec/smarter_csv/ignore_comments_spec.rb
 - spec/smarter_csv/invalid_headers_spec.rb