smarter_csv 1.12.0 → 1.13.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e37441fcb5fcb55c507df960d4472d085b6b8ab207596e0c723b1c7ed868bb90
4
- data.tar.gz: ec554fd545805f48838000446af1749b2adaa4c8e3fb31b3ca146aa3d9b91fad
3
+ metadata.gz: c28d21a143e743de4b21d8ad93d860b1d51424e525e9ec0a73bb640b170d9823
4
+ data.tar.gz: e7a16ae8494196b85d9a196d071f09b302583c9ef3a414f09f5ad6ae1f11c29b
5
5
  SHA512:
6
- metadata.gz: 55842abeea7fa20b4811c8d1021a054829abe0dcd9e808e669ebcf8b17457979c66e7bf8110e0a5a07f224e2ca6371b98b1929b678c488f9499a845e733efb17
7
- data.tar.gz: 8945d14497a08fef63b7908b10a9a8d483864b065c3b0fdd26497e6826733196fc76fa69c03008d48dbf233d382e552d6d3a56b999536278ebf33f48a5eb0c03
6
+ metadata.gz: 319ee5aed33630e9670a1c95cc8da6fd57df9d1d7db57a00af79c1e5c10de56b4e9054c86b6b462ebdc693513a79aff2c881a1ede00ad28e5da58768b4a6f2cf
7
+ data.tar.gz: 6b36378d3a15ed9065c697f56f2cafc359d2c746e5796780a276c1a87c6a04be38616205f31ec9341412f3c9a4f52d150ce4ef95c0e76f340368d5683b1452e6
data/CHANGELOG.md CHANGED
@@ -1,6 +1,48 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
+ ## 1.13.0 (2024-11-06) ⚡ POTENTIALLY BREAKING ⚡
5
+
6
+ CHANGED DEFAULT BEHAVIOR
7
+ ========================
8
+ The changes are to improve robustness and to reduce the risk of data loss
9
+
10
+ * implementing auto-detection of extra columns (thanks to James Fenley)
11
+
12
+ * improved handling of unbalanced quote_char in input ([issue 288](https://github.com/tilo/smarter_csv/issues/288)) thanks to Simon Rentzke), and ([issue 283](https://github.com/tilo/smarter_csv/issues/283)) thanks to James Fenley, Randall B, Matthew Kennedy)
13
+ -> SmarterCSV will now raise `SmarterCSV::MalformedCSV` for unbalanced quote_char.
14
+
15
+ * bugfix / improved handling of extra columns in input data ([issue 284](https://github.com/tilo/smarter_csv/issues/284)) (thanks to James Fenley)
16
+
17
+ * previous behavior:
18
+ when a CSV row had more columns than listed in the header, the additional columns were ignored
19
+
20
+ * new behavior:
21
+ * new default behavior is to auto-generate additional headers, e.g. :column_7, :column_8, etc
22
+ * you can set option `:strict` to true in order to get a `SmarterCSV::MalformedCSV` exception instead
23
+
24
+ * setting `user_provided_headers` now implies `headers_in_file: false` ([issue 282](https://github.com/tilo/smarter_csv/issues/282))
25
+
26
+ The option `user_provided_headers` can be used to specify headers when there are none in the input, OR to completely override headers that are in the input (file).
27
+
28
+ SmarterCSV is now using a safer default behavior.
29
+
30
+ * previous behavior:
31
+ Setting `user_provided_headers` did not change the default `headers_in_file: true`
32
+ If the input had no headers, this would cause the first line to be erroneously treated as a header, and the user could lose the first row of data.
33
+
34
+ * new behavior:
35
+ Setting `user_provided_headers` sets`headers_in_file: false`
36
+ a) Improved behavior if there was no header in the input data.
37
+ b) If there was a header in the input data, and `user_provided_headers` is used to override the headers in the file, then please explicitly specify `headers_in_file: true`, otherwise you will get an extra hash which includes the header data.
38
+
39
+ IF you set `user_provided_headers` and the file has a header, then provide `headers_in_file: true` to avoid getting that extra record.
40
+
41
+ * handling of numeric columns with leading zeroes, e.g. ZIP codes. ([issue #151](https://github.com/tilo/smarter_csv/issues/151) thanks to David Moles). `convert_values_to_numeric: { except: [:zip] }` will now return a string for that column instead.
42
+
43
+ ## 1.12.1 (2024-07-10)
44
+ * Improved column separator detection by ignoring quoted sections [#276](https://github.com/tilo/smarter_csv/pull/276) (thanks to Nicolas Castellanos)
45
+
4
46
  ## 1.12.0 (2024-07-09)
5
47
  * Added Thread-Safety: added SmarterCSV::Reader to process CSV files in a thread-safe manner ([issue #277](https://github.com/tilo/smarter_csv/pull/277))
6
48
  * SmarterCSV::Writer changed default row separator to the system's row separator (`\n` on Linux, `\r\n` on Windows)
data/CONTRIBUTORS.md CHANGED
@@ -53,3 +53,8 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
53
53
  * [JP Camara](https://github.com/jpcamara)
54
54
  * [Kenton Hirowatari](https://github.com/hirowatari)
55
55
  * [Daniel Pepper](https://github.com/dpep)
56
+ * [Nicolas Castellanos](https://github.com/nicastelo)
57
+ * [James Fenley](https://github.com/rex-remind101)
58
+ * [Simon Rentzke](https://github.com/simonrentzke)
59
+ * [Randall B](https://github.com/randall-coding)
60
+ * [Matthew Kennedy](https://github.com/MattKitmanLabs)
@@ -26,6 +26,15 @@ It removes any values which are `nil` or would be empty strings.
26
26
  `convert_values_to_numeric` is enabled by default.
27
27
  SmarterCSV will convert strings containing Integers or Floats to the appropriate class.
28
28
 
29
+ Here is an example of using `convert_values_to_numeric` for numbers with leading zeros, e.g. ZIP codes:
30
+
31
+ ```
32
+ data = SmarterCSV.process('/tmp/zip.csv', convert_values_to_numeric: { except: [:zip] })
33
+ => [{:zip=>"00480"}, {:zip=>"51903"}, {:zip=>"12354"}, {:zip=>"02343"}]
34
+ ```
35
+
36
+ This will return the column `:zip` as a string with all digits intact.
37
+
29
38
  ## Remove Zero Values
30
39
  `remove_zero_values` is disabled by default.
31
40
  When enabled, it removes key/value pairs which have a numeric value equal to zero.
@@ -44,7 +53,7 @@ It can happen that after all transformations, a row of the CSV file would produc
44
53
 
45
54
  By default SmarterCSV uses `remove_empty_hashes: true` to remove these empty hashes from the result.
46
55
 
47
- This can be set to `true`, to keep these empty hashes in the results.
56
+ This can be set to `false`, to keep these empty hashes in the results.
48
57
 
49
58
  -------------------
50
59
  PREVIOUS: [Header Validations](./header_validations.md) | NEXT: [Value Converters](./value_converters.md)
@@ -64,6 +64,8 @@ If you want to have an underscore between the header and the number, you can set
64
64
  => [{:first_name=>"Carl", :middle_name=>"Edward", :last_name=>"Sagan"}]
65
65
  ```
66
66
 
67
+ If you set `duplicate_header_suffix: nil`, you get the same behavior as earlier versions, which raised the `SmarterCSV::DuplicateHeaders` error.
68
+
67
69
  ## Key Mapping
68
70
 
69
71
  The above example already illustrates how intermediate keys can be mapped into something different.
data/docs/options.md CHANGED
@@ -41,7 +41,7 @@
41
41
  | :skip_lines | nil | how many lines to skip before the first line or header line is processed |
42
42
  | :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
43
43
  ---------------------------------------------------------------------------------------------------------------------------------
44
- | :col_sep | :auto | column separator (default was ',') |
44
+ | :col_sep | :auto | column separator (default was ',') |
45
45
  | :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
46
46
  | | | e.g. when :quote_char is not properly escaped |
47
47
  | :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
@@ -49,9 +49,10 @@
49
49
  | :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
50
50
  | :quote_char | '"' | quotation character |
51
51
  ---------------------------------------------------------------------------------------------------------------------------------
52
- | :headers_in_file | true | Whether or not the file contains headers as the first line. |
53
- | | | Important if the file does not contain headers, |
54
- | | | otherwise you would lose the first line of data. |
52
+ | :headers_in_file | true(1) | Whether or not the file contains headers as the first line. |
53
+ | | | (1): if `user_provided_headers` is given, the default is `false`, |
54
+ | | | unless you specify it to be explicitly `true`. |
55
+ | | | This prevents losing the first line of data, which is otherwise assumed to be a header. |
55
56
  | :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
56
57
  | | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
57
58
  | :user_provided_headers | nil | *careful with that axe!* |
@@ -61,6 +62,8 @@
61
62
  | :remove_empty_hashes | true | remove / ignore any hashes which don't have any key/value pairs or all empty values |
62
63
  | :verbose | false | print out line number while processing (to track down problems in input files) |
63
64
  | :with_line_numbers | false | add :csv_line_number to each data hash |
65
+ | :missing_header_prefix | column_ | can be set to a string of your liking |
66
+ | :strict | false | When set to `true`, extra columns will raise MalformedCSV exception |
64
67
  ---------------------------------------------------------------------------------------------------------------------------------
65
68
 
66
69
  Additional 1.x Options which may be replaced in 2.0
@@ -71,11 +74,11 @@ There have been a lot of 1-offs and feature creep around these options, and goin
71
74
  | Option | Default | Explanation |
72
75
  ---------------------------------------------------------------------------------------------------------------------------------
73
76
  | :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
74
- | :silence_missing_keys | false | ignore missing keys in `key_mapping` |
75
- | | | if set to true: makes all mapped keys optional |
77
+ | :silence_missing_keys | false | ignore missing keys in `key_mapping` |
78
+ | | | if set to true: makes all mapped keys optional |
76
79
  | | | if given an array, makes only the keys listed in it optional |
77
- | :required_keys | nil | An array. Specify the required names AFTER header transformation. |
78
- | :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead |
80
+ | :required_keys | nil | An array. Specify the required names AFTER header transformation. |
81
+ | :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead |
79
82
  | | | or an exception is raised No validation if nil is given. |
80
83
  | :remove_unmapped_keys | false | when using :key_mapping option, should non-mapped keys / columns be removed? |
81
84
  | :downcase_header | true | downcase all column headers |
@@ -21,10 +21,10 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
21
21
 
22
22
  ```ruby
23
23
  $ cat spec/fixtures/with_dates.csv
24
- first,last,date,price
25
- Ben,Miller,10/30/1998,$44.50
26
- Tom,Turner,2/1/2011,$15.99
27
- Ken,Smith,01/09/2013,$199.99
24
+ first,last,date,price,member
25
+ Ben,Miller,10/30/1998,$44.50,TRUE
26
+ Tom,Turner,2/1/2011,$15.99,False
27
+ Ken,Smith,01/09/2013,$199.99,true
28
28
 
29
29
  $ irb
30
30
  > require 'smarter_csv'
@@ -51,7 +51,20 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
51
51
  end
52
52
  end
53
53
 
54
- options = {:value_converters => {:date => DateConverter, :price => DollarConverter}}
54
+ class BooleanConverter
55
+ def self.convert(value)
56
+ case value
57
+ when /true/i
58
+ true
59
+ when /false/i
60
+ false
61
+ else
62
+ nil
63
+ end
64
+ end
65
+ end
66
+
67
+ options = {value_converters: {date: DateConverter, price: DollarConverter, member: BooleanConverter}}
55
68
  data = SmarterCSV.process("spec/fixtures/with_dates.csv", options)
56
69
  first_record = data.first
57
70
  first_record[:date]
@@ -62,6 +75,8 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
62
75
  => 44.50
63
76
  first_record[:price].class
64
77
  => Float
78
+ first_record[:member]
79
+ => true
65
80
  ```
66
81
 
67
82
  --------------------
@@ -9,9 +9,10 @@
9
9
  #define true ((bool)1)
10
10
  #endif
11
11
 
12
- /*
13
- max_size: pass nil if no limit is specified
14
- */
12
+ VALUE SmarterCSV = Qnil;
13
+ VALUE eMalformedCSVError = Qnil;
14
+ VALUE Parser = Qnil;
15
+
15
16
  static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quote_char, VALUE max_size) {
16
17
  if (RB_TYPE_P(line, T_NIL) == 1) {
17
18
  return rb_ary_new();
@@ -24,7 +25,7 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
24
25
  rb_encoding *encoding = rb_enc_get(line); /* get the encoding from the input line */
25
26
  char *startP = RSTRING_PTR(line); /* may not be null terminated */
26
27
  long line_len = RSTRING_LEN(line);
27
- char *endP = startP + line_len ; /* points behind the string */
28
+ char *endP = startP + line_len; /* points behind the string */
28
29
  char *p = startP;
29
30
 
30
31
  char *col_sepP = RSTRING_PTR(col_sep);
@@ -39,18 +40,19 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
39
40
  VALUE field;
40
41
  long i;
41
42
 
42
- char prev_char = '\0'; // Store the previous character for comparison against an escape character
43
- long backslash_count = 0; // to count consecutive backslash characters
43
+ /* Variables for escaped quote handling */
44
+ long backslash_count = 0;
45
+ bool in_quotes = false;
44
46
 
45
47
  while (p < endP) {
46
48
  /* does the remaining string start with col_sep ? */
47
49
  col_sep_found = true;
48
- for(i=0; (i < col_sep_len) && (p+i < endP) ; i++) {
50
+ for(i=0; (i < col_sep_len) && (p+i < endP); i++) {
49
51
  col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i));
50
52
  }
51
- /* if col_sep was found and we have even quotes */
52
- if (col_sep_found && (quote_count % 2 == 0)) {
53
- /* if max_size != nil && lements.size >= header_size */
53
+ /* if col_sep was found and we're not inside quotes */
54
+ if (col_sep_found && !in_quotes) {
55
+ /* if max_size != nil && elements.size >= header_size */
54
56
  if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) {
55
57
  break;
56
58
  } else {
@@ -60,22 +62,30 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
60
62
 
61
63
  p += col_sep_len;
62
64
  startP = p;
65
+ backslash_count = 0; // Reset backslash count at the start of a new field
63
66
  }
64
67
  } else {
65
68
  if (*p == '\\') {
66
69
  backslash_count++;
67
70
  } else {
68
- if (*p == *quoteP && (backslash_count % 2 == 0)) {
69
- quote_count++;
71
+ if (*p == *quoteP) {
72
+ if (backslash_count % 2 == 0) {
73
+ /* Even number of backslashes means quote is not escaped */
74
+ in_quotes = !in_quotes;
75
+ }
76
+ /* Else, quote is escaped; do nothing */
70
77
  }
71
- backslash_count = 0; // no more consecutive backslash characters
78
+ backslash_count = 0; // Reset after any character other than backslash
72
79
  }
73
80
  p++;
74
81
  }
75
-
76
- prev_char = *(p - 1); // Update the previous character
77
82
  } /* while */
78
83
 
84
+ /* Check for unclosed quotes at the end of the line */
85
+ if (in_quotes) {
86
+ rb_raise(eMalformedCSVError, "Unclosed quoted field detected in line: %s", StringValueCStr(line));
87
+ }
88
+
79
89
  /* check if the last part of the line needs to be processed */
80
90
  if ((max_size == Qnil) || RARRAY_LEN(elements) < NUM2INT(max_size)) {
81
91
  /* copy the remaining line as a field with original encoding onto the results */
@@ -86,12 +96,11 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
86
96
  return elements;
87
97
  }
88
98
 
89
- VALUE SmarterCSV = Qnil;
90
- VALUE Parser = Qnil;
91
-
92
99
  void Init_smarter_csv(void) {
93
- SmarterCSV = rb_define_module("SmarterCSV");
94
- Parser = rb_define_module_under(SmarterCSV, "Parser");
100
+ // these modules and the error class are already defined in Ruby code, make them accessible:
101
+ SmarterCSV = rb_const_get(rb_cObject, rb_intern("SmarterCSV"));
102
+ Parser = rb_const_get(SmarterCSV, rb_intern("Parser"));
103
+ eMalformedCSVError = rb_const_get(SmarterCSV, rb_intern("MalformedCSV"));
95
104
 
96
105
  rb_define_module_function(Parser, "parse_csv_line_c", rb_parse_csv_line, 4);
97
106
  }
@@ -13,13 +13,17 @@ module SmarterCSV
13
13
  delimiters = [',', "\t", ';', ':', '|']
14
14
 
15
15
  line = nil
16
+ escaped_quote = Regexp.escape(options[:quote_char])
16
17
  has_header = options[:headers_in_file]
17
18
  candidates = Hash.new(0)
18
19
  count = has_header ? 1 : 5
19
20
  count.times do
20
21
  line = readline_with_counts(filehandle, options)
21
22
  delimiters.each do |d|
22
- candidates[d] += line.scan(d).count
23
+ # Count only non-quoted occurrences of the delimiter
24
+ non_quoted_text = line.split(/#{escaped_quote}[^#{escaped_quote}]*#{escaped_quote}/).join
25
+
26
+ candidates[d] += non_quoted_text.scan(d).count
23
27
  end
24
28
  rescue EOFError # short files
25
29
  break
@@ -11,6 +11,7 @@ module SmarterCSV
11
11
  class MissingKeys < SmarterCSVException; end # previously known as MissingHeaders
12
12
  class NoColSepDetected < SmarterCSVException; end
13
13
  class KeyMappingError < SmarterCSVException; end
14
+ class MalformedCSV < SmarterCSVException; end
14
15
  # Writer:
15
16
  class InvalidInputData < SmarterCSVException; end
16
17
  end
@@ -26,6 +26,7 @@ module SmarterCSV
26
26
  invalid_byte_sequence: '',
27
27
  keep_original_headers: false,
28
28
  key_mapping: nil,
29
+ missing_header_prefix: 'column_',
29
30
  quote_char: '"',
30
31
  remove_empty_hashes: true,
31
32
  remove_empty_values: true,
@@ -37,6 +38,7 @@ module SmarterCSV
37
38
  row_sep: :auto, # was: $/,
38
39
  silence_missing_keys: false,
39
40
  skip_lines: nil,
41
+ strict: false,
40
42
  strings_as_keys: false,
41
43
  strip_chars_from_headers: nil,
42
44
  strip_whitespace: true,
@@ -50,6 +52,18 @@ module SmarterCSV
50
52
  def process_options(given_options = {})
51
53
  puts "User provided options:\n#{pp(given_options)}\n" if given_options[:verbose]
52
54
 
55
+ # Special case for :user_provided_headers:
56
+ #
57
+ # If we would use the default `headers_in_file: true`, and `:user_provided_headers` are given,
58
+ # we could lose the first data row
59
+ #
60
+ # We now err on the side of treating an actual header as data, rather than losing a data row.
61
+ #
62
+ if given_options[:user_provided_headers] && !given_options.keys.include?(:headers_in_file)
63
+ given_options[:headers_in_file] = false
64
+ puts "WARNING: setting `headers_in_file: false` as a precaution to not lose the first row. Set explicitly to `true` if you have headers."
65
+ end
66
+
53
67
  @options = DEFAULT_OPTIONS.dup.merge!(given_options)
54
68
 
55
69
  # fix invalid input
@@ -7,6 +7,8 @@ module SmarterCSV
7
7
  ###
8
8
  ### Thin wrapper around C-extension
9
9
  ###
10
+ ### NOTE: we are no longer passing-in header_size
11
+ ###
10
12
  def parse(line, options, header_size = nil)
11
13
  # puts "SmarterCSV.parse OPTIONS: #{options[:acceleration]}" if options[:verbose]
12
14
 
@@ -31,59 +33,83 @@ module SmarterCSV
31
33
  # - we are not assuming that quotes inside a fields need to be doubled
32
34
  # - we are not assuming that all fields need to be quoted (0 is even)
33
35
  # - works with multi-char col_sep
34
- # - if header_size is given, only up to header_size fields are parsed
35
36
  #
36
- # We use header_size for parsing the body lines to make sure we always match the number of headers
37
- # in case there are trailing col_sep characters in line
37
+ # NOTE: we are no longer passing-in header_size
38
38
  #
39
- # Our convention is that empty fields are returned as empty strings, not as nil.
39
+ # - if header_size was given, only up to header_size fields are parsed
40
40
  #
41
+ # We used header_size for parsing the body lines to make sure we always match the number of headers
42
+ # in case there are trailing col_sep characters in line
41
43
  #
42
- # the purpose of the max_size parameter is to handle a corner case where
43
- # CSV lines contain more fields than the header.
44
- # In which case the remaining fields in the line are ignored
44
+ # the purpose of the max_size parameter was to handle a corner case where
45
+ # CSV lines contain more fields than the header. In which case the remaining fields in the line were ignored
45
46
  #
47
+ # Our convention is that empty fields are returned as empty strings, not as nil.
48
+
46
49
  def parse_csv_line_ruby(line, options, header_size = nil)
47
- return [] if line.nil?
50
+ return [[], 0] if line.nil?
48
51
 
49
52
  line_size = line.size
50
53
  col_sep = options[:col_sep]
51
54
  col_sep_size = col_sep.size
52
55
  quote = options[:quote_char]
53
- quote_count = 0
54
56
  elements = []
55
57
  start = 0
56
58
  i = 0
57
59
 
58
- previous_char = ''
60
+ backslash_count = 0
61
+ in_quotes = false
62
+
59
63
  while i < line_size
60
- if line[i...i+col_sep_size] == col_sep && quote_count.even?
64
+ # Check if the current position matches the column separator and we're not inside quotes
65
+ if line[i...i+col_sep_size] == col_sep && !in_quotes
61
66
  break if !header_size.nil? && elements.size >= header_size
62
67
 
63
68
  elements << cleanup_quotes(line[start...i], quote)
64
- previous_char = line[i]
65
- i += col_sep.size
69
+ i += col_sep_size
66
70
  start = i
71
+ backslash_count = 0 # Reset backslash count at the start of a new field
67
72
  else
68
- quote_count += 1 if line[i] == quote && previous_char != '\\'
69
- previous_char = line[i]
73
+ if line[i] == '\\'
74
+ backslash_count += 1
75
+ else
76
+ if line[i] == quote
77
+ if backslash_count % 2 == 0
78
+ # Even number of backslashes means quote is not escaped
79
+ in_quotes = !in_quotes
80
+ end
81
+ # Else, quote is escaped; do nothing
82
+ end
83
+ backslash_count = 0 # Reset after any character other than backslash
84
+ end
70
85
  i += 1
71
86
  end
72
87
  end
73
- elements << cleanup_quotes(line[start..-1], quote) if header_size.nil? || elements.size < header_size
88
+
89
+ # Check for unclosed quotes at the end of the line
90
+ if in_quotes
91
+ raise MalformedCSV, "Unclosed quoted field detected in line: #{line}"
92
+ end
93
+
94
+ # Process the remaining field
95
+ if header_size.nil? || elements.size < header_size
96
+ elements << cleanup_quotes(line[start..-1], quote)
97
+ end
98
+
74
99
  [elements, elements.size]
75
100
  end
76
101
 
77
102
  def cleanup_quotes(field, quote)
78
103
  return field if field.nil?
79
104
 
80
- # return if field !~ /#{quote}/ # this check can probably eliminated
81
-
105
+ # Remove surrounding quotes if present
82
106
  if field.start_with?(quote) && field.end_with?(quote)
83
- field.delete_prefix!(quote)
84
- field.delete_suffix!(quote)
107
+ field = field[1..-2]
85
108
  end
86
- field.gsub!("#{quote}#{quote}", quote)
109
+
110
+ # Replace double quotes with a single quote
111
+ field.gsub!("#{quote * 2}", quote)
112
+
87
113
  field
88
114
  end
89
115
  end
@@ -62,7 +62,8 @@ module SmarterCSV
62
62
 
63
63
  skip_lines(fh, options)
64
64
 
65
- @headers, header_size = process_headers(fh, options)
65
+ # NOTE: we are no longer using header_size
66
+ @headers, _header_size = process_headers(fh, options)
66
67
  @headerA = @headers # @headerA is deprecated, use @headers
67
68
 
68
69
  puts "Effective headers:\n#{pp(@headers)}\n" if @verbose
@@ -97,14 +98,23 @@ module SmarterCSV
97
98
  multiline = count_quote_chars(line, options[:quote_char]).odd?
98
99
 
99
100
  while multiline
100
- next_line = fh.readline(options[:row_sep])
101
- next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
102
- line += next_line
103
- @file_line_count += 1
104
-
105
- break if fh.eof? # Exit loop if end of file is reached
106
-
107
- multiline = count_quote_chars(line, options[:quote_char]).odd?
101
+ begin
102
+ next_line = fh.readline(options[:row_sep])
103
+ next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
104
+ line += next_line
105
+ @file_line_count += 1
106
+
107
+ multiline = count_quote_chars(line, options[:quote_char]).odd?
108
+ rescue EOFError
109
+ # End of file reached. Check if quotes are balanced.
110
+ total_quotes = count_quote_chars(line, options[:quote_char])
111
+ if total_quotes.odd?
112
+ raise MalformedCSV, "Unclosed quoted field detected in multiline data"
113
+ else
114
+ # Quotes are balanced; proceed without raising an error.
115
+ break
116
+ end
117
+ end
108
118
  end
109
119
 
110
120
  # :nocov:
@@ -116,7 +126,18 @@ module SmarterCSV
116
126
  line.chomp!(options[:row_sep])
117
127
 
118
128
  # --- SPLIT LINE & DATA TRANSFORMATIONS ------------------------------------------------------------
119
- dataA, _data_size = parse(line, options, header_size)
129
+ dataA, data_size = parse(line, options) # we parse the extra columns
130
+
131
+ if options[:strict]
132
+ raise SmarterCSV::HeaderSizeMismatch, "extra columns detected on line #{@file_line_count}"
133
+ else
134
+ # we create additional columns on-the-fly
135
+ current_size = @headers.size
136
+ while current_size < data_size
137
+ @headers << "#{options[:missing_header_prefix]}#{current_size + 1}".to_sym
138
+ current_size += 1
139
+ end
140
+ end
120
141
 
121
142
  dataA.map!{|x| x.strip} if options[:strip_whitespace]
122
143
 
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterCSV
4
- VERSION = "1.12.0"
4
+ VERSION = "1.13.0"
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.12.0
4
+ version: 1.13.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2024-07-10 00:00:00.000000000 Z
11
+ date: 2024-11-05 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: awesome_print