RubyGems - smarter_csv - Versions diffs - 1.1.4 → 1.1.5 - Mend

smarter_csv 1.1.4 → 1.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml +4 -4
data/.travis.yml +12 -22
data/Gemfile +3 -2
data/README.md +19 -9
data/lib/smarter_csv/smarter_csv.rb +11 -4
data/lib/smarter_csv/version.rb +1 -1
metadata +9 -7

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: cb1216b85e197c77005a95ab4c3bc46896b7719f
-  data.tar.gz: 7539e858a39825e5fac1dc27e51f53f1e3f20c2c
+  metadata.gz: 042aadb2bc5426a07a64f09e781bccbd728e8052
+  data.tar.gz: ba48c2e303591d4027e05d1208c225381d362857
 SHA512:
-  metadata.gz: 2669d2f524e138bdcd8e9ded254a8dee1996589d56eabe4ac2f4480be7ebc88c1360600d26e109d3eba7e1e91075aa52629663ef4fd32489a7fd7e809f8b587c
-  data.tar.gz: 39cf42229ab96f15e860472ea138e04ba18c6c717e0860eaa87e9fd2e0c8ca516a8070ee6333f96a6cbc8a40662b509fef492eaefc72b95ecb1ccf5d8c1b1faa
+  metadata.gz: 58cb92edabb46bdcb48598d4b4b02b5f0f09cc63378e818ac672daf8d722b5fbf1b246df5db262dff306e87943a2bb2bebbb753944adc5449b19cd5a1475c00b
+  data.tar.gz: 31fe30f2b2027274a5252c55b234b120327be6ce652f7ad71232bd8a920e33d30cbae42577fff398d0a61574dc17b3016cd3fa1d520eec3dd4636569cc62860e

data/.travis.yml CHANGED

@@ -1,29 +1,19 @@
 language: ruby
 bundler_args: --without development
-rvm:
-  - 1.8.7
-  - 1.9.2
-  - 1.9.3
-  - 2.0.0
-  - 2.1.3
-  - 2.2.2
-  - jruby
-  - ruby-head
-  - jruby-head
-  - ree
-  - rbx
-# jdk:
-#   - oraclejdk7
-#   - openjdk7
-env: JRUBY_OPTS="--server -Xcompile.invokedynamic=false -J-XX:+TieredCompilation -J-XX:TieredStopAtLevel=1 -J-noverify -J-Xms512m -J-Xmx1024m"
+before_install:
+  - gem install bundler
+  - gem update --system
 matrix:
-  allow_failures:
-    - rbx
-    - rvm: jruby-head
+  include:
+    - rvm: 2.2.8
+    - rvm: 2.3.5
+    - rvm: 2.4.2
+    - rvm: jruby-9.1.13.0
+      env:
+        - JRUBY_OPTS="--server -Xcompile.invokedynamic=false -J-XX:+TieredCompilation -J-XX:TieredStopAtLevel=1 -J-noverify -J-Xms512m -J-Xmx1024m"
     - rvm: ruby-head
-    - rvm: ree
-    - rvm: 1.8.7
-    - rvm: jruby-18mode
 branches:
   only:
     - master

data/Gemfile CHANGED

@@ -4,8 +4,9 @@ source 'https://rubygems.org'
 gemspec
-gem "rake"
+gem "rake", "< 11"
+gem 'pry'
 group :test do
-  gem "rspec", "~> 2.14"
+  gem "rspec", "~> 2.99"
 end

data/README.md CHANGED

@@ -1,6 +1,6 @@
 # SmarterCSV
-[![Build Status](https://secure.travis-ci.org/tilo/smarter_csv.png?branch=master)](http://travis-ci.org/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
+[![Build Status](https://secure.travis-ci.org/tilo/smarter_csv.svg?branch=master)](http://travis-ci.org/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
 `smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with Mongoid or ActiveRecord,
 and parallel processing with Resque or Sidekiq.
@@ -35,7 +35,10 @@ The two main choices you have in terms of how to call `SmarterCSV.process` are:
  * passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
 Tip: If you are uncertain about what line endings a CSV-file uses, try specifying `:row_sep => :auto` as part of the options.
-But this could be slow, because it will try to analyze each CSV file first. If you want to speed things up, set the `:row_sep` manually! Checkout Example 5 for unusual `:row_sep` and `:col_sep`.
+But this could be slow if we would analyze the whole CSV file first (previous to 1.1.5 the whole file was analyzed).
+To speed things up, you can setting the option `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500; nil or 0 will check the whole file).
+You can also set the `:row_sep` manually! Checkout Example 5 for unusual `:row_sep` and `:col_sep`.
 #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
 Please note how each hash contains only the keys for columns with non-null values.
@@ -166,7 +169,7 @@ NOTE: If you use `key_mappings` and `value_converters`, make sure that the value
       => Float
 ## Parallel Processing
-[Jack](https://github.com/xjlin0) wrote an interesting article about [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing/)
+[Jack](https://github.com/xjlin0) wrote an interesting article about [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
 ## Documentation
@@ -184,6 +187,7 @@ The options and the block are optional.
      | :col_sep                    |   ','    | column separator                                                                     |
      | :row_sep                    | $/ ,"\n" | row separator or record separator , defaults to system's $/ , which defaults to "\n" |
      |                             |          | This can also be set to :auto, but will process the whole cvs file first  (slow!)    |
+     | :auto_row_sep_chars         |   500    | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
      | :quote_char                 |   '"'    | quotation character                                                                  |
      | :comment_regexp             |   /^#/   | regular expression which matches comment lines (see NOTE about the CSV header)       |
      | :chunk_size                 |   nil    | if set, determines the desired chunk-size (defaults to nil, no chunk processing)     |
@@ -216,7 +220,7 @@ The options and the block are optional.
      |                             |          |      also accepts either {:except => [:key1,:key2]} or {:only => :key3}              |
      | :remove_empty_hashes        |   true   | remove / ignore any hashes which don't have any key/value pairs                      |
      | :file_encoding              |   utf-8  | Set the file encoding eg.: 'windows-1252' or 'iso-8859-1'                            |
-     | :force_simple_split         |   false  | force simiple splitting on :col_sep character for non-standard CSV-files.            |
+     | :force_simple_split         |   false  | force simple splitting on :col_sep character for non-standard CSV-files.            |
      |                             |          | e.g. when :quote_char is not properly escaped                                        |
      | :verbose                    |   false  | print out line number while processing (to track down problems in input files)       |
@@ -261,10 +265,6 @@ The options and the block are optional.
  * some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, use the `:force_simple_split => true` option in combination with `:strip_chars_from_headers => /[\-"]/` . This will also significantly speed up the import.
    If you would force a different :quote_char instead (setting it to a non-used character), then the import would be up to 5-times slower than using `:force_simple_split`.
-#### Known Issues:
- * if you are using 1.8.7 versions of Ruby, JRuby, or Ruby Enterprise Edition, `smarter_csv` will have problems with double-quoted fields, because of a bug in an underlying library.
 ## See also:
   http://www.unixgods.org/~tilo/Ruby/process_csv_as_hashes.html
@@ -293,8 +293,14 @@ Planned in the next releases:
 ## Changes
+#### 1.1.5 (2017-11-05)
+ * fix issue with invalid byte sequences in header (issue #103, thanks to Dave Myron)
+ * fix issue with invalid byte sequences in multi-line data (thanks to Ivan Ushakov)
+ * analyze only 500 characters by default when `:row_sep => :auto` is used.
+   added option `row_sep_auto_chars` to change the default if necessary. (thanks to Matthieu Paret)
 #### 1.1.4 (2017-01-16)
- * fixing UTF-8 related bug which was introduced in 1.1.2 (thank to Tirdad C.)
+ * fixing UTF-8 related bug which was introduced in 1.1.2 (thanks to Tirdad C.)
 #### 1.1.3 (2016-12-30)
  * added warning when options indicate UTF-8 processing, but input filehandle is not opened with r:UTF-8 option
@@ -449,6 +455,10 @@ And a special thanks to those who contributed pull requests:
  * [Michael](https://github.com/polycarpou)
  * [Kevin Coleman](https://github.com/KevinColemanInc)
  * [Tirdad C.](https://github.com/tridadc)
+ * [Dave Myron](https://github.com/contentfree)
+ * [Ivan Ushakov](https://github.com/IvanUshakov)
+ * [Matthieu Paret](https://github.com/mtparet)
+ * [Rohit Amarnath](https://github.com/ramarnat)
 ## Contributing

data/lib/smarter_csv/smarter_csv.rb CHANGED

@@ -9,7 +9,8 @@ module SmarterCSV
       :remove_empty_values => true, :remove_zero_values => false , :remove_values_matching => nil , :remove_empty_hashes => true , :strip_whitespace => true,
       :convert_values_to_numeric => true, :strip_chars_from_headers => nil , :user_provided_headers => nil , :headers_in_file => true,
       :comment_regexp => /^#/, :chunk_size => nil , :key_mapping_hash => nil , :downcase_header => true, :strings_as_keys => false, :file_encoding => 'utf-8',
-      :remove_unmapped_keys => false, :keep_original_headers => false, :value_converters => nil, :skip_lines => nil, :force_utf8 => false, :invalid_byte_sequence => ''
+      :remove_unmapped_keys => false, :keep_original_headers => false, :value_converters => nil, :skip_lines => nil, :force_utf8 => false, :invalid_byte_sequence => '',
+      :auto_row_sep_chars => 500
     }
     options = default_options.merge(options)
     options[:invalid_byte_sequence] = '' if options[:invalid_byte_sequence].nil?
@@ -27,7 +28,7 @@ module SmarterCSV
       end
       if options[:row_sep] == :auto
-        options[:row_sep] =  SmarterCSV.guess_line_ending( f, options )
+        options[:row_sep] = line_ending = SmarterCSV.guess_line_ending( f, options )
         f.rewind
       end
       $/ = options[:row_sep]
@@ -39,8 +40,9 @@ module SmarterCSV
       if options[:headers_in_file]        # extract the header line
         # process the header line in the CSV file..
         # the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
-        header = f.readline.sub(options[:comment_regexp],'').chomp(options[:row_sep])
+        header = f.readline
         header = header.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
+        header = header.sub(options[:comment_regexp],'').chomp(options[:row_sep])
         file_line_count += 1
         csv_line_count += 1
@@ -118,7 +120,9 @@ module SmarterCSV
         # by detecting the existence of an uneven number of quote characters
         multiline = line.count(options[:quote_char])%2 == 1
         while line.count(options[:quote_char])%2 == 1
-          line += f.readline
+          next_line = f.readline
+          next_line = next_line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
+          line += next_line
           file_line_count += 1
         end
         print "\nline contains uneven number of quote chars so including content through file line %d\n" % file_line_count if options[:verbose] && multiline
@@ -251,6 +255,7 @@ module SmarterCSV
     # count how many of the pre-defined line-endings we find
     # ignoring those contained within quote characters
     last_char = nil
+    lines = 0
     filehandle.each_char do |c|
       quoted_char = !quoted_char if c == options[:quote_char]
       next if quoted_char
@@ -265,6 +270,8 @@ module SmarterCSV
         counts["\n"] += 1
       end
       last_char = c
+      lines += 1
+      break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
     end
     counts["\r"] += 1 if last_char == "\r"
     # find the key/value pair with the largest counter:

data/lib/smarter_csv/version.rb CHANGED

@@ -1,3 +1,3 @@
 module SmarterCSV
-  VERSION = "1.1.4"
+  VERSION = "1.1.5"
 end

metadata CHANGED

@@ -1,15 +1,16 @@
 --- !ruby/object:Gem::Specification
 name: smarter_csv
 version: !ruby/object:Gem::Version
-  version: 1.1.4
+  version: 1.1.5
 platform: ruby
 authors:
-- |
-  Tilo Sloboda
+- 'Tilo Sloboda
+'
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2017-01-17 00:00:00.000000000 Z
+date: 2017-11-06 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rspec
@@ -29,8 +30,9 @@ description: Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes,
   optional features for processing large files in parallel, embedded comments, unusual
   field- and record-separators, flexible mapping of CSV-headers to Hash-keys
 email:
-- |
-  tilo.sloboda@gmail.com
+- 'tilo.sloboda@gmail.com
+'
 executables: []
 extensions: []
 extra_rdoc_files: []
@@ -123,7 +125,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
 requirements:
 - csv
 rubyforge_project:
-rubygems_version: 2.4.5
+rubygems_version: 2.6.13
 signing_key:
 specification_version: 4
 summary: Ruby Gem for smarter importing of CSV Files (and CSV-like files), with lots