RubyGems - smarter_csv - Versions diffs - 1.12.0 → 1.13.0 - Mend

smarter_csv 1.12.0 → 1.13.0

Files changed (15) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +42 -0
data/CONTRIBUTORS.md +5 -0
data/docs/data_transformations.md +10 -1
data/docs/header_transformations.md +2 -0
data/docs/options.md +11 -8
data/docs/value_converters.md +20 -5
data/ext/smarter_csv/smarter_csv.c +29 -20
data/lib/smarter_csv/auto_detection.rb +5 -1
data/lib/smarter_csv/errors.rb +1 -0
data/lib/smarter_csv/options.rb +14 -0
data/lib/smarter_csv/parser.rb +47 -21
data/lib/smarter_csv/reader.rb +31 -10
data/lib/smarter_csv/version.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: e37441fcb5fcb55c507df960d4472d085b6b8ab207596e0c723b1c7ed868bb90
-  data.tar.gz: ec554fd545805f48838000446af1749b2adaa4c8e3fb31b3ca146aa3d9b91fad
+  metadata.gz: c28d21a143e743de4b21d8ad93d860b1d51424e525e9ec0a73bb640b170d9823
+  data.tar.gz: e7a16ae8494196b85d9a196d071f09b302583c9ef3a414f09f5ad6ae1f11c29b
 SHA512:
-  metadata.gz: 55842abeea7fa20b4811c8d1021a054829abe0dcd9e808e669ebcf8b17457979c66e7bf8110e0a5a07f224e2ca6371b98b1929b678c488f9499a845e733efb17
-  data.tar.gz: 8945d14497a08fef63b7908b10a9a8d483864b065c3b0fdd26497e6826733196fc76fa69c03008d48dbf233d382e552d6d3a56b999536278ebf33f48a5eb0c03
+  metadata.gz: 319ee5aed33630e9670a1c95cc8da6fd57df9d1d7db57a00af79c1e5c10de56b4e9054c86b6b462ebdc693513a79aff2c881a1ede00ad28e5da58768b4a6f2cf
+  data.tar.gz: 6b36378d3a15ed9065c697f56f2cafc359d2c746e5796780a276c1a87c6a04be38616205f31ec9341412f3c9a4f52d150ce4ef95c0e76f340368d5683b1452e6

data/CHANGELOG.md CHANGED Viewed

@@ -1,6 +1,48 @@
 # SmarterCSV 1.x Change Log
+## 1.13.0 (2024-11-06) ⚡ POTENTIALLY BREAKING ⚡
+  CHANGED DEFAULT BEHAVIOR
+  ========================
+  The changes are to improve robustness and to reduce the risk of data loss
+  * implementing auto-detection of extra columns (thanks to James Fenley)
+  * improved handling of unbalanced quote_char in input ([issue 288](https://github.com/tilo/smarter_csv/issues/288)) thanks to Simon Rentzke), and ([issue 283](https://github.com/tilo/smarter_csv/issues/283)) thanks to James Fenley, Randall B, Matthew Kennedy)
+    -> SmarterCSV will now raise `SmarterCSV::MalformedCSV` for unbalanced quote_char.
+  * bugfix / improved handling of extra columns in input data ([issue 284](https://github.com/tilo/smarter_csv/issues/284)) (thanks to James Fenley)
+    * previous behavior:
+      when a CSV row had more columns than listed in the header, the additional columns were ignored
+    * new behavior:
+      * new default behavior is to auto-generate additional headers, e.g. :column_7, :column_8, etc
+      * you can set option `:strict` to true in order to get a `SmarterCSV::MalformedCSV` exception instead
+  * setting `user_provided_headers` now implies `headers_in_file: false` ([issue 282](https://github.com/tilo/smarter_csv/issues/282))
+    The option `user_provided_headers` can be used to specify headers when there are none in the input, OR to completely override headers that are in the input (file).
+    SmarterCSV is now using a safer default behavior.
+    * previous behavior:
+      Setting `user_provided_headers` did not change the default `headers_in_file: true`
+      If the input had no headers, this would cause the first line to be erroneously treated as a header, and the user could lose the first row of data.
+    * new behavior:
+      Setting `user_provided_headers` sets`headers_in_file: false`
+      a) Improved behavior if there was no header in the input data.
+      b) If there was a header in the input data, and `user_provided_headers` is used to override the headers in the file, then please explicitly specify `headers_in_file: true`, otherwise you will get an extra hash which includes the header data.
+    IF you set `user_provided_headers` and the file has a header, then provide `headers_in_file: true` to avoid getting that extra record.
+   * handling of numeric columns with leading zeroes, e.g. ZIP codes. ([issue #151](https://github.com/tilo/smarter_csv/issues/151) thanks to David Moles). `convert_values_to_numeric: { except: [:zip] }` will now return a string for that column instead.
+## 1.12.1 (2024-07-10)
+  * Improved column separator detection by ignoring quoted sections [#276](https://github.com/tilo/smarter_csv/pull/276) (thanks to Nicolas Castellanos)
 ## 1.12.0 (2024-07-09)
   * Added Thread-Safety: added SmarterCSV::Reader to process CSV files in a thread-safe manner ([issue #277](https://github.com/tilo/smarter_csv/pull/277))
   * SmarterCSV::Writer changed default row separator to the system's row separator (`\n` on Linux, `\r\n` on Windows)

data/CONTRIBUTORS.md CHANGED Viewed

@@ -53,3 +53,8 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
  * [JP Camara](https://github.com/jpcamara)
  * [Kenton Hirowatari](https://github.com/hirowatari)
  * [Daniel Pepper](https://github.com/dpep)
+ * [Nicolas Castellanos](https://github.com/nicastelo)
+ * [James Fenley](https://github.com/rex-remind101)
+ * [Simon Rentzke](https://github.com/simonrentzke)
+ * [Randall B](https://github.com/randall-coding)
+ * [Matthew Kennedy](https://github.com/MattKitmanLabs)

data/docs/data_transformations.md CHANGED Viewed

@@ -26,6 +26,15 @@ It removes any values which are `nil` or would be empty strings.
 `convert_values_to_numeric` is enabled by default.
 SmarterCSV will convert strings containing Integers or Floats to the appropriate class.
+Here is an example of using `convert_values_to_numeric` for numbers with leading zeros, e.g. ZIP codes:
+```
+  data = SmarterCSV.process('/tmp/zip.csv',  convert_values_to_numeric: { except: [:zip] })
+   => [{:zip=>"00480"}, {:zip=>"51903"}, {:zip=>"12354"}, {:zip=>"02343"}]
+```
+This will return the column `:zip` as a string with all digits intact.
 ## Remove Zero Values
 `remove_zero_values` is disabled by default.
 When enabled, it removes key/value pairs which have a numeric value equal to zero.
@@ -44,7 +53,7 @@ It can happen that after all transformations, a row of the CSV file would produc
 By default SmarterCSV uses `remove_empty_hashes: true` to remove these empty hashes from the result.
-This can be set to `true`, to keep these empty hashes in the results.
+This can be set to `false`, to keep these empty hashes in the results.
 -------------------
 PREVIOUS: [Header Validations](./header_validations.md) | NEXT: [Value Converters](./value_converters.md)

data/docs/header_transformations.md CHANGED Viewed

@@ -64,6 +64,8 @@ If you want to have an underscore between the header and the number, you can set
    => [{:first_name=>"Carl", :middle_name=>"Edward", :last_name=>"Sagan"}]
 ```
+If you set `duplicate_header_suffix: nil`, you get the same behavior as earlier versions, which raised the `SmarterCSV::DuplicateHeaders` error.
 ## Key Mapping
 The above example already illustrates how intermediate keys can be mapped into something different.

data/docs/options.md CHANGED Viewed

@@ -41,7 +41,7 @@
      | :skip_lines                 |   nil    | how many lines to skip before the first line or header line is processed             |
      | :comment_regexp             |   nil    | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/       |
      ---------------------------------------------------------------------------------------------------------------------------------
-     | :col_sep                    |   :auto   | column separator (default was ',')                                           |
+     | :col_sep                    |   :auto   | column separator (default was ',')                                                  |
      | :force_simple_split         |   false  | force simple splitting on :col_sep character for non-standard CSV-files.             |
      |                             |          | e.g. when :quote_char is not properly escaped                                        |
      | :row_sep                    |  :auto   | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
@@ -49,9 +49,10 @@
      | :auto_row_sep_chars         |   500    | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
      | :quote_char                 |   '"'    | quotation character                                                                  |
      ---------------------------------------------------------------------------------------------------------------------------------
-     | :headers_in_file            |   true   | Whether or not the file contains headers as the first line.                          |
-     |                             |          | Important if the file does not contain headers,                                      |
-     |                             |          | otherwise you would lose the first line of data.                                     |
+     | :headers_in_file            |  true(1) | Whether or not the file contains headers as the first line.                          |
+     |                             |          | (1): if `user_provided_headers` is given, the default is `false`,                    |
+     |                             |          | unless you specify it to be explicitly `true`.                                       |
+     |                             |          | This prevents losing the first line of data, which is otherwise assumed to be a header. |
      | :duplicate_header_suffix    |   ''     | Adds numbers to duplicated headers and separates them by the given suffix.           |
      |                             |          | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior)        |
      | :user_provided_headers      |   nil    | *careful with that axe!*                                                             |
@@ -61,6 +62,8 @@
      | :remove_empty_hashes        |   true   | remove / ignore any hashes which don't have any key/value pairs or all empty values  |
      | :verbose                    |   false  | print out line number while processing (to track down problems in input files)       |
      | :with_line_numbers          |   false  | add :csv_line_number to each data hash                                               |
+     | :missing_header_prefix      |  column_ | can be set to a string of your liking                                                |
+     | :strict                     |   false  | When set to `true`, extra columns will raise MalformedCSV exception                  |
      ---------------------------------------------------------------------------------------------------------------------------------
 Additional 1.x Options which may be replaced in 2.0
@@ -71,11 +74,11 @@ There have been a lot of 1-offs and feature creep around these options, and goin
      | Option                      | Default  |  Explanation                                                                         |
      ---------------------------------------------------------------------------------------------------------------------------------
      | :key_mapping                |   nil    | a hash which maps headers from the CSV file to keys in the result hash               |
-     | :silence_missing_keys        |   false  | ignore missing keys in `key_mapping`                                   |
-     |                             |          | if set to true: makes all mapped keys optional                         |
+     | :silence_missing_keys        |   false  | ignore missing keys in `key_mapping`                                                |
+     |                             |          | if set to true: makes all mapped keys optional                                       |
      |                             |          | if given an array, makes only the keys listed in it optional                         |
-     | :required_keys              |   nil    | An array. Specify the required names AFTER header transformation.                  |
-     | :required_headers           |   nil    | (DEPRECATED / renamed) Use `required_keys` instead                          |
+     | :required_keys              |   nil    | An array. Specify the required names AFTER header transformation.                    |
+     | :required_headers           |   nil    | (DEPRECATED / renamed) Use `required_keys` instead                                   |
      |                             |          | or an exception is raised   No validation if nil is given.                           |
      | :remove_unmapped_keys       |   false  | when using :key_mapping option, should non-mapped keys / columns be removed?         |
      | :downcase_header            |   true   | downcase all column headers                                                          |

data/docs/value_converters.md CHANGED Viewed

@@ -21,10 +21,10 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
 ```ruby
     $ cat spec/fixtures/with_dates.csv
-    first,last,date,price
-    Ben,Miller,10/30/1998,$44.50
-    Tom,Turner,2/1/2011,$15.99
-    Ken,Smith,01/09/2013,$199.99
+    first,last,date,price,member
+    Ben,Miller,10/30/1998,$44.50,TRUE
+    Tom,Turner,2/1/2011,$15.99,False
+    Ken,Smith,01/09/2013,$199.99,true
     $ irb
     > require 'smarter_csv'
@@ -51,7 +51,20 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
       end
     end
-    options = {:value_converters => {:date => DateConverter, :price => DollarConverter}}
+    class BooleanConverter
+      def self.convert(value)
+        case value
+        when /true/i
+          true
+        when /false/i
+          false
+        else
+          nil
+        end
+      end
+    end
+    options = {value_converters: {date: DateConverter, price: DollarConverter, member: BooleanConverter}}
     data = SmarterCSV.process("spec/fixtures/with_dates.csv", options)
     first_record = data.first
     first_record[:date]
@@ -62,6 +75,8 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
       => 44.50
     first_record[:price].class
       => Float
+    first_record[:member]
+      => true
 ```
 --------------------

data/ext/smarter_csv/smarter_csv.c CHANGED Viewed

@@ -9,9 +9,10 @@
   #define true  ((bool)1)
 #endif
-/*
-   max_size: pass nil if no limit is specified
- */
+VALUE SmarterCSV = Qnil;
+VALUE eMalformedCSVError = Qnil;
+VALUE Parser = Qnil;
 static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quote_char, VALUE max_size) {
   if (RB_TYPE_P(line, T_NIL) == 1) {
     return rb_ary_new();
@@ -24,7 +25,7 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
   rb_encoding *encoding = rb_enc_get(line); /* get the encoding from the input line */
   char *startP = RSTRING_PTR(line); /* may not be null terminated */
   long line_len = RSTRING_LEN(line);
-  char *endP = startP + line_len ; /* points behind the string */
+  char *endP = startP + line_len; /* points behind the string */
   char *p = startP;
   char *col_sepP = RSTRING_PTR(col_sep);
@@ -39,18 +40,19 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
   VALUE field;
   long i;
-  char prev_char = '\0'; // Store the previous character for comparison against an escape character
-  long backslash_count = 0; // to count consecutive backslash characters
+  /* Variables for escaped quote handling */
+  long backslash_count = 0;
+  bool in_quotes = false;
   while (p < endP) {
     /* does the remaining string start with col_sep ? */
     col_sep_found = true;
-    for(i=0; (i < col_sep_len) && (p+i < endP) ; i++) {
+    for(i=0; (i < col_sep_len) && (p+i < endP); i++) {
       col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i));
     }
-    /* if col_sep was found and we have even quotes */
-    if (col_sep_found && (quote_count % 2 == 0)) {
-      /* if max_size != nil && lements.size >= header_size */
+    /* if col_sep was found and we're not inside quotes */
+    if (col_sep_found && !in_quotes) {
+      /* if max_size != nil && elements.size >= header_size */
       if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) {
         break;
       } else {
@@ -60,22 +62,30 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
         p += col_sep_len;
         startP = p;
+        backslash_count = 0; // Reset backslash count at the start of a new field
       }
     } else {
       if (*p == '\\') {
         backslash_count++;
       } else {
-        if (*p == *quoteP && (backslash_count % 2 == 0)) {
-          quote_count++;
+        if (*p == *quoteP) {
+          if (backslash_count % 2 == 0) {
+            /* Even number of backslashes means quote is not escaped */
+            in_quotes = !in_quotes;
+          }
+          /* Else, quote is escaped; do nothing */
         }
-        backslash_count = 0; // no more consecutive backslash characters
+        backslash_count = 0; // Reset after any character other than backslash
       }
       p++;
     }
-    prev_char = *(p - 1); // Update the previous character
   } /* while */
+  /* Check for unclosed quotes at the end of the line */
+  if (in_quotes) {
+    rb_raise(eMalformedCSVError, "Unclosed quoted field detected in line: %s", StringValueCStr(line));
+  }
   /* check if the last part of the line needs to be processed */
   if ((max_size == Qnil) || RARRAY_LEN(elements) < NUM2INT(max_size)) {
     /* copy the remaining line as a field with original encoding onto the results */
@@ -86,12 +96,11 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
   return elements;
 }
-VALUE SmarterCSV = Qnil;
-VALUE Parser = Qnil;
 void Init_smarter_csv(void) {
-  SmarterCSV = rb_define_module("SmarterCSV");
-  Parser = rb_define_module_under(SmarterCSV, "Parser");
+  // these modules and the error class are already defined in Ruby code, make them accessible:
+  SmarterCSV = rb_const_get(rb_cObject, rb_intern("SmarterCSV"));
+  Parser = rb_const_get(SmarterCSV, rb_intern("Parser"));
+  eMalformedCSVError = rb_const_get(SmarterCSV, rb_intern("MalformedCSV"));
   rb_define_module_function(Parser, "parse_csv_line_c", rb_parse_csv_line, 4);
 }

data/lib/smarter_csv/auto_detection.rb CHANGED Viewed

@@ -13,13 +13,17 @@ module SmarterCSV
       delimiters = [',', "\t", ';', ':', '|']
       line = nil
+      escaped_quote = Regexp.escape(options[:quote_char])
       has_header = options[:headers_in_file]
       candidates = Hash.new(0)
       count = has_header ? 1 : 5
       count.times do
         line = readline_with_counts(filehandle, options)
         delimiters.each do |d|
-          candidates[d] += line.scan(d).count
+          # Count only non-quoted occurrences of the delimiter
+          non_quoted_text = line.split(/#{escaped_quote}[^#{escaped_quote}]*#{escaped_quote}/).join
+          candidates[d] += non_quoted_text.scan(d).count
         end
       rescue EOFError # short files
         break

data/lib/smarter_csv/errors.rb CHANGED Viewed

@@ -11,6 +11,7 @@ module SmarterCSV
   class MissingKeys < SmarterCSVException; end # previously known as MissingHeaders
   class NoColSepDetected < SmarterCSVException; end
   class KeyMappingError < SmarterCSVException; end
+  class MalformedCSV < SmarterCSVException; end
   # Writer:
   class InvalidInputData < SmarterCSVException; end
 end

data/lib/smarter_csv/options.rb CHANGED Viewed

@@ -26,6 +26,7 @@ module SmarterCSV
       invalid_byte_sequence: '',
       keep_original_headers: false,
       key_mapping: nil,
+      missing_header_prefix: 'column_',
       quote_char: '"',
       remove_empty_hashes: true,
       remove_empty_values: true,
@@ -37,6 +38,7 @@ module SmarterCSV
       row_sep: :auto, # was: $/,
       silence_missing_keys: false,
       skip_lines: nil,
+      strict: false,
       strings_as_keys: false,
       strip_chars_from_headers: nil,
       strip_whitespace: true,
@@ -50,6 +52,18 @@ module SmarterCSV
     def process_options(given_options = {})
       puts "User provided options:\n#{pp(given_options)}\n" if given_options[:verbose]
+      # Special case for :user_provided_headers:
+      #
+      # If we would use the default `headers_in_file: true`, and `:user_provided_headers` are given,
+      # we could lose the first data row
+      #
+      # We now err on the side of treating an actual header as data, rather than losing a data row.
+      #
+      if given_options[:user_provided_headers] && !given_options.keys.include?(:headers_in_file)
+        given_options[:headers_in_file] = false
+        puts "WARNING: setting `headers_in_file: false` as a precaution to not lose the first row. Set explicitly to `true` if you have headers."
+      end
       @options = DEFAULT_OPTIONS.dup.merge!(given_options)
       # fix invalid input

data/lib/smarter_csv/parser.rb CHANGED Viewed

@@ -7,6 +7,8 @@ module SmarterCSV
     ###
     ### Thin wrapper around C-extension
     ###
+    ### NOTE: we are no longer passing-in header_size
+    ###
     def parse(line, options, header_size = nil)
       # puts "SmarterCSV.parse OPTIONS: #{options[:acceleration]}" if options[:verbose]
@@ -31,59 +33,83 @@ module SmarterCSV
     # - we are not assuming that quotes inside a fields need to be doubled
     # - we are not assuming that all fields need to be quoted (0 is even)
     # - works with multi-char col_sep
-    # - if header_size is given, only up to header_size fields are parsed
     #
-    # We use header_size for parsing the body lines to make sure we always match the number of headers
-    # in case there are trailing col_sep characters in line
+    # NOTE: we are no longer passing-in header_size
     #
-    # Our convention is that empty fields are returned as empty strings, not as nil.
+    # - if header_size was given, only up to header_size fields are parsed
     #
+    #     We used header_size for parsing the body lines to make sure we always match the number of headers
+    #     in case there are trailing col_sep characters in line
     #
-    # the purpose of the max_size parameter is to handle a corner case where
-    # CSV lines contain more fields than the header.
-    # In which case the remaining fields in the line are ignored
+    #     the purpose of the max_size parameter was to handle a corner case where
+    #     CSV lines contain more fields than the header. In which case the remaining fields in the line were ignored
     #
+    # Our convention is that empty fields are returned as empty strings, not as nil.
     def parse_csv_line_ruby(line, options, header_size = nil)
-      return [] if line.nil?
+      return [[], 0] if line.nil?
       line_size = line.size
       col_sep = options[:col_sep]
       col_sep_size = col_sep.size
       quote = options[:quote_char]
-      quote_count = 0
       elements = []
       start = 0
       i = 0
-      previous_char = ''
+      backslash_count = 0
+      in_quotes = false
       while i < line_size
-        if line[i...i+col_sep_size] == col_sep && quote_count.even?
+        # Check if the current position matches the column separator and we're not inside quotes
+        if line[i...i+col_sep_size] == col_sep && !in_quotes
           break if !header_size.nil? && elements.size >= header_size
           elements << cleanup_quotes(line[start...i], quote)
-          previous_char = line[i]
-          i += col_sep.size
+          i += col_sep_size
           start = i
+          backslash_count = 0 # Reset backslash count at the start of a new field
         else
-          quote_count += 1 if line[i] == quote && previous_char != '\\'
-          previous_char = line[i]
+          if line[i] == '\\'
+            backslash_count += 1
+          else
+            if line[i] == quote
+              if backslash_count % 2 == 0
+                # Even number of backslashes means quote is not escaped
+                in_quotes = !in_quotes
+              end
+              # Else, quote is escaped; do nothing
+            end
+            backslash_count = 0 # Reset after any character other than backslash
+          end
           i += 1
         end
       end
-      elements << cleanup_quotes(line[start..-1], quote) if header_size.nil? || elements.size < header_size
+      # Check for unclosed quotes at the end of the line
+      if in_quotes
+        raise MalformedCSV, "Unclosed quoted field detected in line: #{line}"
+      end
+      # Process the remaining field
+      if header_size.nil? || elements.size < header_size
+        elements << cleanup_quotes(line[start..-1], quote)
+      end
       [elements, elements.size]
     end
     def cleanup_quotes(field, quote)
       return field if field.nil?
-      # return if field !~ /#{quote}/ # this check can probably eliminated
+      # Remove surrounding quotes if present
       if field.start_with?(quote) && field.end_with?(quote)
-        field.delete_prefix!(quote)
-        field.delete_suffix!(quote)
+        field = field[1..-2]
       end
-      field.gsub!("#{quote}#{quote}", quote)
+      # Replace double quotes with a single quote
+      field.gsub!("#{quote * 2}", quote)
       field
     end
   end

data/lib/smarter_csv/reader.rb CHANGED Viewed

@@ -62,7 +62,8 @@ module SmarterCSV
         skip_lines(fh, options)
-        @headers, header_size = process_headers(fh, options)
+        # NOTE: we are no longer using header_size
+        @headers, _header_size = process_headers(fh, options)
         @headerA = @headers # @headerA is deprecated, use @headers
         puts "Effective headers:\n#{pp(@headers)}\n" if @verbose
@@ -97,14 +98,23 @@ module SmarterCSV
           multiline = count_quote_chars(line, options[:quote_char]).odd?
           while multiline
-            next_line = fh.readline(options[:row_sep])
-            next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
-            line += next_line
-            @file_line_count += 1
-            break if fh.eof? # Exit loop if end of file is reached
-            multiline = count_quote_chars(line, options[:quote_char]).odd?
+            begin
+              next_line = fh.readline(options[:row_sep])
+              next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
+              line += next_line
+              @file_line_count += 1
+              multiline = count_quote_chars(line, options[:quote_char]).odd?
+            rescue EOFError
+              # End of file reached. Check if quotes are balanced.
+              total_quotes = count_quote_chars(line, options[:quote_char])
+              if total_quotes.odd?
+                raise MalformedCSV, "Unclosed quoted field detected in multiline data"
+              else
+                # Quotes are balanced; proceed without raising an error.
+                break
+              end
+            end
           end
           # :nocov:
@@ -116,7 +126,18 @@ module SmarterCSV
           line.chomp!(options[:row_sep])
           # --- SPLIT LINE & DATA TRANSFORMATIONS ------------------------------------------------------------
-          dataA, _data_size = parse(line, options, header_size)
+          dataA, data_size = parse(line, options) # we parse the extra columns
+          if options[:strict]
+            raise SmarterCSV::HeaderSizeMismatch, "extra columns detected on line #{@file_line_count}"
+          else
+            # we create additional columns on-the-fly
+            current_size = @headers.size
+            while current_size < data_size
+              @headers << "#{options[:missing_header_prefix]}#{current_size + 1}".to_sym
+              current_size += 1
+            end
+          end
           dataA.map!{|x| x.strip} if options[:strip_whitespace]

data/lib/smarter_csv/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module SmarterCSV
-  VERSION = "1.12.0"
+  VERSION = "1.13.0"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: smarter_csv
 version: !ruby/object:Gem::Version
-  version: 1.12.0
+  version: 1.13.0
 platform: ruby
 authors:
 - Tilo Sloboda
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2024-07-10 00:00:00.000000000 Z
+date: 2024-11-05 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: awesome_print