RubyGems - smarter_csv - Versions diffs - 1.8.1 → 1.8.3 - Mend

smarter_csv 1.8.1 → 1.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +13 -1
data/README.md +20 -10
data/TO_DO_v2.md +14 -0
data/ext/smarter_csv/smarter_csv.c +46 -46
data/lib/smarter_csv/version.rb +1 -1
data/lib/smarter_csv.rb +25 -39
metadata +3 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: a7aa350efc77f90c6986a7573e733b5d9d02930c94465f17d2227b346263a6ce
-  data.tar.gz: 42351edf3e618b8c025f266796897aa0c3572d77e42788a05b1ee37ce8bdeed2
+  metadata.gz: 654b04532f0d0b1e15bf84c2e23231e00946a1f57c613f53555ba2d531eaf4f9
+  data.tar.gz: d99a921a908864764a39e94818be45c9feb8a1fbe15eb776e24ef10e98c749fd
 SHA512:
-  metadata.gz: 8bd9d59d7260a8e90ce472917801b98d088e37de5b1e912914f820f2efbbeb0491f5056d47575debdf1bccb8b9b8670cd089647efa15ec93b02413747dcfe702
-  data.tar.gz: 861364c6213af99c11cd3b9a59b2cf46f8c8e850ee2273e4f1b790714c9cd0ca66a734d64233737e086669c2b6aa51415f1343c3d61811547ec3c715d7a1620c
+  metadata.gz: 8005c2b6bdd4e82ab1acc8849afd4b8d7abf0d744bb18fa76aaac68a707a8f14300b4e844abac3dacab24b254c81787dc4501a1cc1138ebdc97fe52728e82f30
+  data.tar.gz: 0baead2aa4d6841f3770e27a24e5dc5d783873db253c8185e81366a2b5a36045d82f4dc2011fbd46373cf434968675b232217c8764c2c858d74c1cceaebd45ed

data/CHANGELOG.md CHANGED Viewed

@@ -1,13 +1,25 @@
 # SmarterCSV 1.x Change Log
+## 1.8.3 (2023-03-30)
+  * bugfix: windows one-column files were raising NoColSepDetected (issue #229)
+## 1.8.2 (2023-03-21)
+  * bugfix: do not raise `NoColSepDetected` for CSV files with only one column in most cases (issue #222)
+            If the first lines contain non-ASCII characters, and no col_sep is detected, it will still raise `NoColSepDetected`
 ## 1.8.1 (2023-03-19)
   * added validation against invalid values for :col_sep, :row_sep, :quote_char (issue #216)
   * deprecating `required_headers` and replace with `required_keys` (issue #140)
   * fixed issue with require statement
-## 1.8.0 (2023-03-18)
+## 1.8.0 (2023-03-18) BREAKING
   * NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
+    MAKE SURE to rescue `NoColSepDetected` if your CSV files can have unexpected formats,
+              e.g. from users uploading them to a service, and handle those cases.
   * ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
 ## 1.7.4 (2023-01-13)

data/README.md CHANGED Viewed

@@ -3,26 +3,33 @@
  [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
+#### Development Branches
+* default branch is `main` for 1.x development
+* 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
 #### Work towards Future Version 2.0
 * Work towards SmarterCSV 2.0 is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
   Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.
-* New versions of SmarterCSV 1.x will soon print a deprecation warning if you set :verbose to true
-  See below for list of deprecated options.
+---------------
-#### Restructured Branches
+#### SmarterCSV 1.x [Current Version]
-* default branch is `main` for 1.x development
-* 2.x development is on `2.0-development`
+`smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, parallel processing, kicking-off batch jobs with Sidekiq, or oploading data to S3.
----------------
+The goals for SmarterCSV are:
+  * ease of use for handling most common CSV files without having to tweak options
+  * improve robustness of your code when you have no control over the quality of the CSV files which are processed
+  * formatting each row of data as a hash, in order to allow easy processing with ActiveRecord, parallel processing, kicking-off batch jobs with Sidekiq, or oploading data to S3.
-#### SmarterCSV 1.x [Current Version]
+#### Rescue from Exceptions
+While SmarterCSV uses sensible defaults to process the most common CSV files, it will raise exceptions if it can not auto-detect `col_sep`, `row_sep`, or if it encounters other problems. Therefore, when calling `SmarterCSV.process`, please rescue from `SmarterCSVException`, and handle outliers according to your requirements.
-`smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, parallel processing, or kicking-off batch jobs with Sidekiq.
+If you encounter unusual CSV files, please follow the tips in the Troubleshooting section below. You can use the options below to accomodate for unusual formats.
-To create high-quality output, some options are enabled as a default. Please make sure to check the output and tweak the options accordingly.
+#### Features
 One `smarter_csv` user wrote:
@@ -77,7 +84,10 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
 Here are some examples to demonstrate the versatility of SmarterCSV.
-By default SmarterCSV determines the `row_sep` and `col_sep` values automatically.
+**It is generally recommended to rescue `SmarterCSVException` or it's sub-classes.**
+By default SmarterCSV determines the `row_sep` and `col_sep` values automatically. In cases where the automatic detection fails, an exception will be raised, e.g. `NoColSepDetected`. Rescuing from these exceptions will make sure that you don't miss processing CSV files, in case users upload CSV files with unexpected formats.
 In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.
 #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:

data/TO_DO_v2.md ADDED Viewed

@@ -0,0 +1,14 @@
+# SmarterCSV v2.0 TO DO List
+* add enumerable to speed up parallel processing [issue #66](https://github.com/tilo/smarter_csv/issues/66), [issue #32](https://github.com/tilo/smarter_csv/issues/32)
+* use Procs for validations and transformatoins [issue #118](https://github.com/tilo/smarter_csv/issues/118)
+* make @errors and @warnings work [issue #118](https://github.com/tilo/smarter_csv/issues/118)
+* skip file opening, allow reading from CSV string, e.g. reading from S3 file [issue #120](https://github.com/tilo/smarter_csv/issues/120).
+  Or stream large file from S3 (linked in the issue)
+* Collect all Errors, before surfacing them. Avoid throwing an exception on the first error [issue #133](https://github.com/tilo/smarter_csv/issues/133)
+* Don't call rewind on filehandle
+* [2.0 BUG] :convert_values_to_numeric_unless_leading_zeros drops leading zeros [issue #151](https://github.com/tilo/smarter_csv/issues/151)
+* [2.0 BUG]  convert_to_float saves Proc as @@convert_to_integer [issue #157](https://github.com/tilo/smarter_csv/issues/157)
+* Provide an example for custom Procs for hash_transformations in the docs [issue #174](https://github.com/tilo/smarter_csv/issues/174)
+* Replace remove_empty_values: false [issue #213](https://github.com/tilo/smarter_csv/issues/213)

data/ext/smarter_csv/smarter_csv.c CHANGED Viewed

@@ -15,67 +15,67 @@
 static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quote_char, VALUE max_size) {
   if (RB_TYPE_P(line, T_NIL) == 1) {
     return rb_ary_new();
+  }
-  } else if (RB_TYPE_P(line, T_STRING) == 1) {
-    rb_encoding *encoding = rb_enc_get(line); /* get the encoding from the input line */
-    char *startP = RSTRING_PTR(line); /* may not be null terminated */
-    long line_len = RSTRING_LEN(line);
-    char *endP = startP + line_len ; /* points behind the string */
-    char *p = startP;
+  if (RB_TYPE_P(line, T_STRING) != 1) {
+    rb_raise(rb_eTypeError, "ERROR in SmarterCSV.parse_line: line has to be a string or nil");
+  }
-    char *col_sepP = RSTRING_PTR(col_sep);
-    long col_sep_len = RSTRING_LEN(col_sep);
+  rb_encoding *encoding = rb_enc_get(line); /* get the encoding from the input line */
+  char *startP = RSTRING_PTR(line); /* may not be null terminated */
+  long line_len = RSTRING_LEN(line);
+  char *endP = startP + line_len ; /* points behind the string */
+  char *p = startP;
-    char *quoteP = RSTRING_PTR(quote_char);
-    long quote_count = 0;
+  char *col_sepP = RSTRING_PTR(col_sep);
+  long col_sep_len = RSTRING_LEN(col_sep);
-    bool col_sep_found = true;
+  char *quoteP = RSTRING_PTR(quote_char);
+  long quote_count = 0;
-    VALUE elements = rb_ary_new();
-    VALUE field;
-    long i;
+  bool col_sep_found = true;
-    while (p < endP) {
-      /* does the remaining string start with col_sep ? */
-      col_sep_found = true;
-      for(i=0; (i < col_sep_len) && (p+i < endP) ; i++) {
-        col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i));
-      }
-      /* if col_sep was found and we have even quotes */
-      if (col_sep_found && (quote_count % 2 == 0)) {
-        /* if max_size != nil && lements.size >= header_size */
-        if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) {
-          break;
-        } else {
-          /* push that field with original encoding onto the results */
-          field = rb_enc_str_new(startP, p - startP, encoding);
-          rb_ary_push(elements, field);
+  VALUE elements = rb_ary_new();
+  VALUE field;
+  long i;
-          p += col_sep_len;
-          startP = p;
-        }
+  while (p < endP) {
+    /* does the remaining string start with col_sep ? */
+    col_sep_found = true;
+    for(i=0; (i < col_sep_len) && (p+i < endP) ; i++) {
+      col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i));
+    }
+    /* if col_sep was found and we have even quotes */
+    if (col_sep_found && (quote_count % 2 == 0)) {
+      /* if max_size != nil && lements.size >= header_size */
+      if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) {
+        break;
       } else {
-        if (*p == *quoteP) {
-          quote_count += 1;
-        }
-        p++;
-      }
-    } /* while */
+        /* push that field with original encoding onto the results */
+        field = rb_enc_str_new(startP, p - startP, encoding);
+        rb_ary_push(elements, field);
-    /* check if the last part of the line needs to be processed */
-    if ((max_size == Qnil) || RARRAY_LEN(elements) < NUM2INT(max_size)) {
-      /* copy the remaining line as a field with original encoding onto the results */
-      field = rb_enc_str_new(startP, endP - startP, encoding);
-      rb_ary_push(elements, field);
+        p += col_sep_len;
+        startP = p;
+      }
+    } else {
+      if (*p == *quoteP) {
+        quote_count += 1;
+      }
+      p++;
     }
+  } /* while */
-    return elements;
+  /* check if the last part of the line needs to be processed */
+  if ((max_size == Qnil) || RARRAY_LEN(elements) < NUM2INT(max_size)) {
+    /* copy the remaining line as a field with original encoding onto the results */
+    field = rb_enc_str_new(startP, endP - startP, encoding);
+    rb_ary_push(elements, field);
   }
-  rb_raise(rb_eTypeError, "ERROR in SmarterCSV.parse_line: line has to be a string or nil");
+  return elements;
 }
 VALUE SmarterCSV = Qnil;
 void Init_smarter_csv(void) {

data/lib/smarter_csv/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module SmarterCSV
-  VERSION = "1.8.1"
+  VERSION = "1.8.3"
 end

data/lib/smarter_csv.rb CHANGED Viewed

@@ -3,8 +3,11 @@
 require_relative "extensions/hash"
 require_relative "smarter_csv/version"
-require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
-# require 'smarter_csv.bundle' unless ENV['CI'] # local testing
+if `uname -s`.chomp == 'Darwin'
+  require 'smarter_csv.bundle' unless ENV['CI'] # local testing
+else
+  require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
+end
 module SmarterCSV
   class SmarterCSVException < StandardError; end
@@ -393,15 +396,28 @@ module SmarterCSV
     def guess_column_separator(filehandle, options)
       skip_lines(filehandle, options)
-      possible_delimiters = [',', "\t", ';', ':', '|']
+      delimiters = [',', "\t", ';', ':', '|']
+      line = nil
+      has_header = options[:headers_in_file]
+      candidates = Hash.new(0)
+      count = has_header ? 1 : 5
+      count.times do
+        line = readline_with_counts(filehandle, options)
+        delimiters.each do |d|
+          candidates[d] += line.scan(d).count
+        end
+      rescue EOFError # short files
+        break
+      end
+      rewind(filehandle)
-      candidates = if options.fetch(:headers_in_file)
-                     candidated_column_separators_from_headers(filehandle, options, possible_delimiters)
-                   else
-                     candidated_column_separators_from_contents(filehandle, options, possible_delimiters)
-                   end
+      if candidates.values.max == 0
+        # if the header only contains
+        return ',' if line.chomp(options[:row_sep]) =~ /^\w+$/
-      raise SmarterCSV::NoColSepDetected if candidates.values.max == 0
+        raise SmarterCSV::NoColSepDetected
+      end
       candidates.key(candidates.values.max)
     end
@@ -582,35 +598,5 @@ module SmarterCSV
       return true if str.is_a?(String) && !str.empty?
       false
     end
-    def candidated_column_separators_from_headers(filehandle, options, delimiters)
-      candidates = Hash.new(0)
-      line = readline_with_counts(filehandle, options.slice(:row_sep))
-      delimiters.each do |d|
-        candidates[d] += line.scan(d).count
-      end
-      rewind(filehandle)
-      candidates
-    end
-    def candidated_column_separators_from_contents(filehandle, options, delimiters)
-      candidates = Hash.new(0)
-      5.times do
-        line = readline_with_counts(filehandle, options.slice(:row_sep))
-        delimiters.each do |d|
-          candidates[d] += line.scan(d).count
-        end
-      rescue EOFError # short files
-        break
-      end
-      rewind(filehandle)
-      candidates
-    end
   end
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: smarter_csv
 version: !ruby/object:Gem::Version
-  version: 1.8.1
+  version: 1.8.3
 platform: ruby
 authors:
 - Tilo Sloboda
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2023-03-19 00:00:00.000000000 Z
+date: 2023-03-30 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: awesome_print
@@ -112,6 +112,7 @@ files:
 - LICENSE.txt
 - README.md
 - Rakefile
+- TO_DO_v2.md
 - ext/smarter_csv/extconf.rb
 - ext/smarter_csv/smarter_csv.c
 - lib/extensions/hash.rb