RubyGems - smarter_csv - Versions diffs - 1.7.3 → 1.8.0 - Mend

smarter_csv 1.7.3 → 1.8.0

Files changed (8) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: d4046758f38c21262fdec6bc7e13e3a7811c7aee3944d92e0cc36a2a1cfb032a
-  data.tar.gz: 9d111e2f36171ca488034f3af73fc71c7c9f6fde73986d277aeaf1560a066fa2
+  metadata.gz: 55400b3977ce35c58d60c4101362b68d99f2dbf7cb6a63956ae3b6ab79fcf1ac
+  data.tar.gz: 41f46d3e4de69a7924ecd2214ba4e37766106469d1b8b257fd752a96204a47fd
 SHA512:
-  metadata.gz: c46c5c45dd3fafe66735b2b17b0679c5aaff27b3670140d97bc19e1c825ad91310fa2cf55a12a5c7b0c31ef82fe9cc12a2c4bda0a78b218d80ad5816c01c0d9f
-  data.tar.gz: ba03acd95955f8afeb8e96f16c7cfa2e1605dbaf6fddb7008930294aab83196aed21f57605efb3553799381c1c4811528eee2db221efa50dc82f58bcf9135842
+  metadata.gz: 24ecc14cf9c65efe5c11e4bd20753420aa8ccd7385171cd21eac2e1be92c4896087cdc2a18799fa111c0f36154ad4481daed7f08b752f4fae2b5f27241b8cf6c
+  data.tar.gz: c1d70e18a7ae8057e58cbf73b62f4896dd7030bc5fd2e927669e5ea829f9a3c11daeb9c8b83296dbb46e6f0d23034245b7207882a77b54cc1ca128a581175359

data/CHANGELOG.md CHANGED Viewed

@@ -1,6 +1,13 @@
 # SmarterCSV 1.x Change Log
+## 1.8.0 (2023-03-18)
+  * NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
+  * ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
+## 1.7.4 (2023-01-13)
+  * improved guessing of the column separator, thanks to Alessandro Fazzi
 ## 1.7.3 (2022-12-05)
   * new option :silence_missing_keys; if set to true, it ignores missing keys in `key_mapping`

data/CONTRIBUTORS.md CHANGED Viewed

@@ -49,3 +49,4 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
  * [Nicolas Rodriguez](https://github.com/n-rodriguez)
  * [Hirotaka Mizutani ](https://github.com/hirotaka)
  * [Rahul Chaudhary](https://github.com/rahulch95)
+ * [Alessandro Fazzi](https://github.com/pioneerskies)

data/README.md CHANGED Viewed

@@ -55,10 +55,23 @@ The two main choices you have in terms of how to call `SmarterCSV.process` are:
  * calling `process` with or without a block
  * passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
-Tip: If you are uncertain about what line endings a CSV-file uses, try specifying `:row_sep => :auto` as part of the options.
-But this could be slow if we would analyze the whole CSV file first (previous to 1.1.5 the whole file was analyzed).
-To speed things up, you can setting the option `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500; nil or 0 will check the whole file).
-You can also set the `:row_sep` manually! Checkout Example 5 for unusual `:row_sep` and `:col_sep`.
+By default (since version 1.8.0), detection of the column and row separators is set to automatic `row_sep: :auto`, `col_sep: :auto`. This should make it easier to process any CSV files without having to examine the line endings or column separators.
+You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); nil or 0 will check the whole file).
+You can also set the `:row_sep` manually! Checkout Example 4 for unusual `:row_sep` and `:col_sep`.
+### Troubleshooting
+In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection  a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).
+```
+$ hexdump -C spec/fixtures/bom_test_feff.csv
+00000000  fe ff 73 6f 6d 65 5f 69  64 2c 74 79 70 65 2c 66  |..some_id,type,f|
+00000010  75 7a 7a 62 6f 78 65 73  0d 0a 34 32 37 36 36 38  |uzzboxes..427668|
+00000020  30 35 2c 7a 69 7a 7a 6c  65 73 2c 31 32 33 34 0d  |05,zizzles,1234.|
+00000030  0a 33 38 37 35 39 31 35  30 2c 71 75 69 7a 7a 65  |.38759150,quizze|
+00000040  73 2c 35 36 37 38 0d 0a                           |s,5678..|
+```
 #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
@@ -222,10 +235,10 @@ The options and the block are optional.
      | :skip_lines                 |   nil    | how many lines to skip before the first line or header line is processed             |
      | :comment_regexp             |   nil    | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/       |
      ---------------------------------------------------------------------------------------------------------------------------------
-     | :col_sep                    |   ','    | column separator, can be set to :auto                                                |
+     | :col_sep                    |   :auto   | column separator (default was ',')                                           |
      | :force_simple_split         |   false  | force simple splitting on :col_sep character for non-standard CSV-files.             |
      |                             |          | e.g. when :quote_char is not properly escaped                                        |
-     | :row_sep                    | $/ ,"\n" | row separator or record separator , defaults to system's $/ , which defaults to "\n" |
+     | :row_sep                    |  :auto   | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
      |                             |          | This can also be set to :auto, but will process the whole cvs file first  (slow!)    |
      | :auto_row_sep_chars         |   500    | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
      | :quote_char                 |   '"'    | quotation character                                                                  |

data/Rakefile CHANGED Viewed

@@ -3,6 +3,17 @@
 require "bundler/gem_tasks"
 require 'rspec/core/rake_task'
+# temp fix for NoMethodError: undefined method `last_comment'
+# remove when fixed in Rake 11.x and higher
+module TempFixForRakeLastComment
+  def last_comment
+    last_description
+  end
+end
+Rake::Application.send :include, TempFixForRakeLastComment
+### end of tempfix
 RSpec::Core::RakeTask.new(:spec)
 require "rubocop/rake_task"

data/lib/smarter_csv/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module SmarterCSV
-  VERSION = "1.7.3"
+  VERSION = "1.8.0"
 end

data/lib/smarter_csv.rb CHANGED Viewed

@@ -3,8 +3,8 @@
 require_relative "extensions/hash"
 require_relative "smarter_csv/version"
-require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
-# require 'smarter_csv.bundle' unless ENV['CI'] # does not compile/link in CI?
+# require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
+require 'smarter_csv.bundle' unless ENV['CI'] # does not compile/link in CI?
 module SmarterCSV
   class SmarterCSVException < StandardError; end
@@ -39,11 +39,7 @@ module SmarterCSV
         puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
       end
-      if options[:skip_lines].to_i > 0
-        options[:skip_lines].to_i.times do
-          readline_with_counts(fh, options)
-        end
-      end
+      skip_lines(fh, options)
       headerA, header_size = process_headers(fh, options)
@@ -207,7 +203,7 @@ module SmarterCSV
         acceleration: true,
         auto_row_sep_chars: 500,
         chunk_size: nil,
-        col_sep: ',',
+        col_sep: :auto, # was: ',',
         comment_regexp: nil, # was: /\A#/,
         convert_values_to_numeric: true,
         downcase_header: true,
@@ -226,7 +222,7 @@ module SmarterCSV
         remove_values_matching: nil,
         remove_zero_values: false,
         required_headers: nil,
-        row_sep: $/,
+        row_sep: :auto, # was: $/,
         silence_missing_keys: false,
         skip_lines: nil,
         strings_as_keys: false,
@@ -243,9 +239,24 @@ module SmarterCSV
       line = filehandle.readline(options[:row_sep])
       @file_line_count += 1
       @csv_line_count += 1
+      line = remove_bom(line) if @csv_line_count == 1
       line
     end
+    def skip_lines(filehandle, options)
+      return unless options[:skip_lines].to_i > 0
+      options[:skip_lines].to_i.times do
+        readline_with_counts(filehandle, options)
+      end
+    end
+    def rewind(filehandle)
+      @file_line_count = 0
+      @csv_line_count = 0
+      filehandle.rewind
+    end
     ###
     ### Thin wrapper around C-extension
     ###
@@ -374,24 +385,23 @@ module SmarterCSV
       return false
     end
-    # raise exception if none is found
+    # If file has headers, then guesses column separator from headers.
+    # Otherwise guesses column separator from contents.
+    # Raises exception if none is found.
     def guess_column_separator(filehandle, options)
-      del = [',', "\t", ';', ':', '|']
-      n = Hash.new(0)
+      skip_lines(filehandle, options)
-      5.times do
-        line = filehandle.readline(options[:row_sep])
-        del.each do |d|
-          n[d] += line.scan(d).count
-        end
-      rescue EOFError # short files
-        break
-      end
+      possible_delimiters = [',', "\t", ';', ':', '|']
-      filehandle.rewind
-      raise SmarterCSV::NoColSepDetected if n.values.max == 0
+      candidates = if options.fetch(:headers_in_file)
+                     candidated_column_separators_from_headers(filehandle, options, possible_delimiters)
+                   else
+                     candidated_column_separators_from_contents(filehandle, options, possible_delimiters)
+                   end
+      raise SmarterCSV::NoColSepDetected if candidates.values.max == 0
-      col_sep = n.key(n.values.max)
+      candidates.key(candidates.values.max)
     end
     # limitation: this currently reads the whole file in before making a decision
@@ -420,7 +430,7 @@ module SmarterCSV
         lines += 1
         break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
       end
-      filehandle.rewind
+      rewind(filehandle)
       counts["\r"] += 1 if last_char == "\r"
       # find the most frequent key/value pair:
@@ -476,13 +486,13 @@ module SmarterCSV
       unless options[:user_provided_headers] # wouldn't make sense to re-map user provided headers
         key_mappingH = options[:key_mapping]
         # do some key mapping on the keys in the file header
         #   if you want to completely delete a key, then map it to nil or to ''
         if !key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
           unless options[:silence_missing_keys]
             # if silence_missing_keys are not set, raise error if missing header
             missing_keys = key_mappingH.keys - headerA
             puts "WARNING: missing header(s): #{missing_keys.join(",")}" unless missing_keys.empty?
           end
@@ -525,5 +535,56 @@ module SmarterCSV
       end
       result
     end
+    private
+    UTF_32_BOM = %w[0 0 fe ff].freeze
+    UTF_32LE_BOM = %w[ff fe 0 0].freeze
+    UTF_8_BOM = %w[ef bb bf].freeze
+    UTF_16_BOM = %w[fe ff].freeze
+    UTF_16LE_BOM = %w[ff fe].freeze
+    def remove_bom(str)
+      str_as_hex = str.bytes.map{|x| x.to_s(16)}
+      # if string does not start with one of the bytes above, there is no BOM
+      return str unless %w[ef fe ff 0].include?(str_as_hex[0])
+      return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
+      return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM
+      return str.byteslice(2..-1) if [UTF_16_BOM, UTF_16LE_BOM].include?(str_as_hex[0..1])
+      puts "SmarterCSV found unhandled BOM! #{str.chars[0..7].inspect}"
+      str
+    end
+    def candidated_column_separators_from_headers(filehandle, options, delimiters)
+      candidates = Hash.new(0)
+      line = readline_with_counts(filehandle, options.slice(:row_sep))
+      delimiters.each do |d|
+        candidates[d] += line.scan(d).count
+      end
+      rewind(filehandle)
+      candidates
+    end
+    def candidated_column_separators_from_contents(filehandle, options, delimiters)
+      candidates = Hash.new(0)
+      5.times do
+        line = readline_with_counts(filehandle, options.slice(:row_sep))
+        delimiters.each do |d|
+          candidates[d] += line.scan(d).count
+        end
+      rescue EOFError # short files
+        break
+      end
+      rewind(filehandle)
+      candidates
+    end
   end
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: smarter_csv
 version: !ruby/object:Gem::Version
-  version: 1.7.3
+  version: 1.8.0
 platform: ruby
 authors:
 - Tilo Sloboda
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2022-12-09 00:00:00.000000000 Z
+date: 2023-03-19 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: awesome_print