RubyGems - smarter_csv - Versions diffs - 1.1.4 → 1.1.5 - Mend

smarter_csv 1.1.4 → 1.1.5

Files changed (7) hide show

checksums.yaml +4 -4
data/.travis.yml +12 -22
data/Gemfile +3 -2
data/README.md +19 -9
data/lib/smarter_csv/smarter_csv.rb +11 -4
data/lib/smarter_csv/version.rb +1 -1
metadata +9 -7

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: cb1216b85e197c77005a95ab4c3bc46896b7719f
-  data.tar.gz: 7539e858a39825e5fac1dc27e51f53f1e3f20c2c
+  metadata.gz: 042aadb2bc5426a07a64f09e781bccbd728e8052
+  data.tar.gz: ba48c2e303591d4027e05d1208c225381d362857
 SHA512:
-  metadata.gz: 2669d2f524e138bdcd8e9ded254a8dee1996589d56eabe4ac2f4480be7ebc88c1360600d26e109d3eba7e1e91075aa52629663ef4fd32489a7fd7e809f8b587c
-  data.tar.gz: 39cf42229ab96f15e860472ea138e04ba18c6c717e0860eaa87e9fd2e0c8ca516a8070ee6333f96a6cbc8a40662b509fef492eaefc72b95ecb1ccf5d8c1b1faa
+  metadata.gz: 58cb92edabb46bdcb48598d4b4b02b5f0f09cc63378e818ac672daf8d722b5fbf1b246df5db262dff306e87943a2bb2bebbb753944adc5449b19cd5a1475c00b
+  data.tar.gz: 31fe30f2b2027274a5252c55b234b120327be6ce652f7ad71232bd8a920e33d30cbae42577fff398d0a61574dc17b3016cd3fa1d520eec3dd4636569cc62860e

data/.travis.yml CHANGED

@@ -1,29 +1,19 @@
 language: ruby
 bundler_args: --without development
-rvm:
-  - 1.8.7
-  - 1.9.2
-  - 1.9.3
-  - 2.0.0
-  - 2.1.3
-  - 2.2.2
-  - jruby
-  - ruby-head
-  - jruby-head
-  - ree
-  - rbx
-# jdk:
-#   - oraclejdk7
-#   - openjdk7
-env: JRUBY_OPTS="--server -Xcompile.invokedynamic=false -J-XX:+TieredCompilation -J-XX:TieredStopAtLevel=1 -J-noverify -J-Xms512m -J-Xmx1024m"
+before_install:
+  - gem install bundler
+  - gem update --system
 matrix:
-  allow_failures:
-    - rbx
-    - rvm: jruby-head
+  include:
+    - rvm: 2.2.8
+    - rvm: 2.3.5
+    - rvm: 2.4.2
+    - rvm: jruby-9.1.13.0
+      env:
+        - JRUBY_OPTS="--server -Xcompile.invokedynamic=false -J-XX:+TieredCompilation -J-XX:TieredStopAtLevel=1 -J-noverify -J-Xms512m -J-Xmx1024m"
     - rvm: ruby-head
-    - rvm: ree
-    - rvm: 1.8.7
-    - rvm: jruby-18mode
 branches:
   only:
     - master

data/Gemfile CHANGED

@@ -4,8 +4,9 @@ source 'https://rubygems.org'
 gemspec
-gem "rake"
+gem "rake", "< 11"
+gem 'pry'
 group :test do
-  gem "rspec", "~> 2.14"
+  gem "rspec", "~> 2.99"
 end

data/README.md CHANGED

@@ -1,6 +1,6 @@
 # SmarterCSV
-[![Build Status](https://secure.travis-ci.org/tilo/smarter_csv.png?branch=master)](http://travis-ci.org/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
+[![Build Status](https://secure.travis-ci.org/tilo/smarter_csv.svg?branch=master)](http://travis-ci.org/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
 `smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with Mongoid or ActiveRecord,
 and parallel processing with Resque or Sidekiq.
@@ -35,7 +35,10 @@ The two main choices you have in terms of how to call `SmarterCSV.process` are:
  * passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
 Tip: If you are uncertain about what line endings a CSV-file uses, try specifying `:row_sep => :auto` as part of the options.
-But this could be slow, because it will try to analyze each CSV file first. If you want to speed things up, set the `:row_sep` manually! Checkout Example 5 for unusual `:row_sep` and `:col_sep`.
+But this could be slow if we would analyze the whole CSV file first (previous to 1.1.5 the whole file was analyzed).
+To speed things up, you can setting the option `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500; nil or 0 will check the whole file).
+You can also set the `:row_sep` manually! Checkout Example 5 for unusual `:row_sep` and `:col_sep`.
 #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
 Please note how each hash contains only the keys for columns with non-null values.
@@ -166,7 +169,7 @@ NOTE: If you use `key_mappings` and `value_converters`, make sure that the value
       => Float
 ## Parallel Processing
-[Jack](https://github.com/xjlin0) wrote an interesting article about [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing/)
+[Jack](https://github.com/xjlin0) wrote an interesting article about [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
 ## Documentation
@@ -184,6 +187,7 @@ The options and the block are optional.
      | :col_sep                    |   ','    | column separator                                                                     |
      | :row_sep                    | $/ ,"\n" | row separator or record separator , defaults to system's $/ , which defaults to "\n" |
      |                             |          | This can also be set to :auto, but will process the whole cvs file first  (slow!)    |
+     | :auto_row_sep_chars         |   500    | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
      | :quote_char                 |   '"'    | quotation character                                                                  |
      | :comment_regexp             |   /^#/   | regular expression which matches comment lines (see NOTE about the CSV header)       |
      | :chunk_size                 |   nil    | if set, determines the desired chunk-size (defaults to nil, no chunk processing)     |
@@ -216,7 +220,7 @@ The options and the block are optional.
      |                             |          |      also accepts either {:except => [:key1,:key2]} or {:only => :key3}              |
      | :remove_empty_hashes        |   true   | remove / ignore any hashes which don't have any key/value pairs                      |
      | :file_encoding              |   utf-8  | Set the file encoding eg.: 'windows-1252' or 'iso-8859-1'                            |
-     | :force_simple_split         |   false  | force simiple splitting on :col_sep character for non-standard CSV-files.            |
+     | :force_simple_split         |   false  | force simple splitting on :col_sep character for non-standard CSV-files.            |
      |                             |          | e.g. when :quote_char is not properly escaped                                        |
      | :verbose                    |   false  | print out line number while processing (to track down problems in input files)       |
@@ -261,10 +265,6 @@ The options and the block are optional.
  * some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, use the `:force_simple_split => true` option in combination with `:strip_chars_from_headers => /[\-"]/` . This will also significantly speed up the import.
    If you would force a different :quote_char instead (setting it to a non-used character), then the import would be up to 5-times slower than using `:force_simple_split`.
-#### Known Issues:
- * if you are using 1.8.7 versions of Ruby, JRuby, or Ruby Enterprise Edition, `smarter_csv` will have problems with double-quoted fields, because of a bug in an underlying library.
 ## See also:
   http://www.unixgods.org/~tilo/Ruby/process_csv_as_hashes.html
@@ -293,8 +293,14 @@ Planned in the next releases:
 ## Changes
+#### 1.1.5 (2017-11-05)
+ * fix issue with invalid byte sequences in header (issue #103, thanks to Dave Myron)
+ * fix issue with invalid byte sequences in multi-line data (thanks to Ivan Ushakov)
+ * analyze only 500 characters by default when `:row_sep => :auto` is used.
+   added option `row_sep_auto_chars` to change the default if necessary. (thanks to Matthieu Paret)
 #### 1.1.4 (2017-01-16)
- * fixing UTF-8 related bug which was introduced in 1.1.2 (thank to Tirdad C.)
+ * fixing UTF-8 related bug which was introduced in 1.1.2 (thanks to Tirdad C.)
 #### 1.1.3 (2016-12-30)
  * added warning when options indicate UTF-8 processing, but input filehandle is not opened with r:UTF-8 option
@@ -449,6 +455,10 @@ And a special thanks to those who contributed pull requests:
  * [Michael](https://github.com/polycarpou)
  * [Kevin Coleman](https://github.com/KevinColemanInc)
  * [Tirdad C.](https://github.com/tridadc)
+ * [Dave Myron](https://github.com/contentfree)
+ * [Ivan Ushakov](https://github.com/IvanUshakov)
+ * [Matthieu Paret](https://github.com/mtparet)
+ * [Rohit Amarnath](https://github.com/ramarnat)
 ## Contributing

data/lib/smarter_csv/smarter_csv.rb CHANGED

@@ -9,7 +9,8 @@ module SmarterCSV
       :remove_empty_values => true, :remove_zero_values => false , :remove_values_matching => nil , :remove_empty_hashes => true , :strip_whitespace => true,
       :convert_values_to_numeric => true, :strip_chars_from_headers => nil , :user_provided_headers => nil , :headers_in_file => true,
       :comment_regexp => /^#/, :chunk_size => nil , :key_mapping_hash => nil , :downcase_header => true, :strings_as_keys => false, :file_encoding => 'utf-8',
-      :remove_unmapped_keys => false, :keep_original_headers => false, :value_converters => nil, :skip_lines => nil, :force_utf8 => false, :invalid_byte_sequence => ''
+      :remove_unmapped_keys => false, :keep_original_headers => false, :value_converters => nil, :skip_lines => nil, :force_utf8 => false, :invalid_byte_sequence => '',
+      :auto_row_sep_chars => 500
     }
     options = default_options.merge(options)
     options[:invalid_byte_sequence] = '' if options[:invalid_byte_sequence].nil?
@@ -27,7 +28,7 @@ module SmarterCSV
       end
       if options[:row_sep] == :auto
-        options[:row_sep] =  SmarterCSV.guess_line_ending( f, options )
+        options[:row_sep] = line_ending = SmarterCSV.guess_line_ending( f, options )
         f.rewind
       end
       $/ = options[:row_sep]
@@ -39,8 +40,9 @@ module SmarterCSV
       if options[:headers_in_file]        # extract the header line
         # process the header line in the CSV file..
         # the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
-        header = f.readline.sub(options[:comment_regexp],'').chomp(options[:row_sep])
+        header = f.readline
         header = header.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
+        header = header.sub(options[:comment_regexp],'').chomp(options[:row_sep])
         file_line_count += 1
         csv_line_count += 1
@@ -118,7 +120,9 @@ module SmarterCSV
         # by detecting the existence of an uneven number of quote characters
         multiline = line.count(options[:quote_char])%2 == 1
         while line.count(options[:quote_char])%2 == 1
-          line += f.readline
+          next_line = f.readline
+          next_line = next_line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
+          line += next_line
           file_line_count += 1
         end
         print "\nline contains uneven number of quote chars so including content through file line %d\n" % file_line_count if options[:verbose] && multiline
@@ -251,6 +255,7 @@ module SmarterCSV
     # count how many of the pre-defined line-endings we find
     # ignoring those contained within quote characters
     last_char = nil
+    lines = 0
     filehandle.each_char do |c|
       quoted_char = !quoted_char if c == options[:quote_char]
       next if quoted_char
@@ -265,6 +270,8 @@ module SmarterCSV
         counts["\n"] += 1
       end
       last_char = c
+      lines += 1
+      break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
     end
     counts["\r"] += 1 if last_char == "\r"
     # find the key/value pair with the largest counter:

data/lib/smarter_csv/version.rb CHANGED

@@ -1,3 +1,3 @@
 module SmarterCSV
-  VERSION = "1.1.4"
+  VERSION = "1.1.5"
 end

metadata CHANGED

@@ -1,15 +1,16 @@
 --- !ruby/object:Gem::Specification
 name: smarter_csv
 version: !ruby/object:Gem::Version
-  version: 1.1.4
+  version: 1.1.5
 platform: ruby
 authors:
-- |
-  Tilo Sloboda
+- 'Tilo Sloboda
+'
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2017-01-17 00:00:00.000000000 Z
+date: 2017-11-06 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rspec
@@ -29,8 +30,9 @@ description: Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes,
   optional features for processing large files in parallel, embedded comments, unusual
   field- and record-separators, flexible mapping of CSV-headers to Hash-keys
 email:
-- |
-  tilo.sloboda@gmail.com
+- 'tilo.sloboda@gmail.com
+'
 executables: []
 extensions: []
 extra_rdoc_files: []
@@ -123,7 +125,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
 requirements:
 - csv
 rubyforge_project:
-rubygems_version: 2.4.5
+rubygems_version: 2.6.13
 signing_key:
 specification_version: 4
 summary: Ruby Gem for smarter importing of CSV Files (and CSV-like files), with lots