smarter_csv 1.12.1 → 1.13.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 05aa9e7d2d22ec6e1beb3790e2b727cd3e615cadcd537716f2dfbb190cc87a09
4
- data.tar.gz: e37b072c7c81a3b6cdc6192ed2bfab046c924f3aa7a8a3e2a66f55fafa25b7ff
3
+ metadata.gz: 2e785bafb4281cfadba23ef7db89a50a1a22f642ca4dd03cd2e97323a8cfa761
4
+ data.tar.gz: 2d3b7f87540a0982582859fc804cb50088d9b4906aa75765fa3bf8d4ac535a23
5
5
  SHA512:
6
- metadata.gz: 07c149aaa123ef75fb65fd596fbab64359e24cf2b8606fe406d714358a1c14696fa9ecb420e6dd0a95d40f6af6d41e4988b16df9eac4346d9e1295e3c32f22b1
7
- data.tar.gz: 71341c1cf1092fabbfe9106ce533adb872e2bc1b0c30fbc032f3ceaea1832e2ddef5d4156f1465658a67dddaae508cd23b12cfe9fdf34edea3f1f3ede0385688
6
+ metadata.gz: 0d3e04d0ec26174a179d42f577c412483956e4ecf8af0e3b4b27234ed52027b984e26afbadfdb8fc1ee95f77780c6856f824dc1ca6e8eb91a87e7a81d46d40ed
7
+ data.tar.gz: 379e3a0daa40afb0158fb546cf33fbb33d52eea77847f587349e69e4f9c6f51233b9953d60c5c3cf0a06e4a30115174cbefb81f42f9c5b2ac495d22a1fa7efa1
data/CHANGELOG.md CHANGED
@@ -1,6 +1,48 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
+ ## 1.13.1 (2024-12-12)
5
+ * fix bug with SmarterCSV.generate with `force_quotes: true` ([issue 294](https://github.com/tilo/smarter_csv/issues/294))
6
+
7
+ ## 1.13.0 (2024-11-06) ⚡ POTENTIALLY BREAKING ⚡
8
+
9
+ CHANGED DEFAULT BEHAVIOR
10
+ ========================
11
+ The changes are to improve robustness and to reduce the risk of data loss
12
+
13
+ * implementing auto-detection of extra columns (thanks to James Fenley)
14
+
15
+ * improved handling of unbalanced quote_char in input ([issue 288](https://github.com/tilo/smarter_csv/issues/288)) thanks to Simon Rentzke), and ([issue 283](https://github.com/tilo/smarter_csv/issues/283)) thanks to James Fenley, Randall B, Matthew Kennedy)
16
+ -> SmarterCSV will now raise `SmarterCSV::MalformedCSV` for unbalanced quote_char.
17
+
18
+ * bugfix / improved handling of extra columns in input data ([issue 284](https://github.com/tilo/smarter_csv/issues/284)) (thanks to James Fenley)
19
+
20
+ * previous behavior:
21
+ when a CSV row had more columns than listed in the header, the additional columns were ignored
22
+
23
+ * new behavior:
24
+ * new default behavior is to auto-generate additional headers, e.g. :column_7, :column_8, etc
25
+ * you can set option `:strict` to true in order to get a `SmarterCSV::MalformedCSV` exception instead
26
+
27
+ * setting `user_provided_headers` now implies `headers_in_file: false` ([issue 282](https://github.com/tilo/smarter_csv/issues/282))
28
+
29
+ The option `user_provided_headers` can be used to specify headers when there are none in the input, OR to completely override headers that are in the input (file).
30
+
31
+ SmarterCSV is now using a safer default behavior.
32
+
33
+ * previous behavior:
34
+ Setting `user_provided_headers` did not change the default `headers_in_file: true`
35
+ If the input had no headers, this would cause the first line to be erroneously treated as a header, and the user could lose the first row of data.
36
+
37
+ * new behavior:
38
+ Setting `user_provided_headers` sets`headers_in_file: false`
39
+ a) Improved behavior if there was no header in the input data.
40
+ b) If there was a header in the input data, and `user_provided_headers` is used to override the headers in the file, then please explicitly specify `headers_in_file: true`, otherwise you will get an extra hash which includes the header data.
41
+
42
+ IF you set `user_provided_headers` and the file has a header, then provide `headers_in_file: true` to avoid getting that extra record.
43
+
44
+ * improved documentation for handling of numeric columns with leading zeroes, e.g. ZIP codes. ([issue #151](https://github.com/tilo/smarter_csv/issues/151) thanks to David Moles). `convert_values_to_numeric: { except: [:zip] }` will return a string for that column instead (since version 1.10.x)
45
+
4
46
  ## 1.12.1 (2024-07-10)
5
47
  * Improved column separator detection by ignoring quoted sections [#276](https://github.com/tilo/smarter_csv/pull/276) (thanks to Nicolas Castellanos)
6
48
 
data/CONTRIBUTORS.md CHANGED
@@ -54,3 +54,7 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
54
54
  * [Kenton Hirowatari](https://github.com/hirowatari)
55
55
  * [Daniel Pepper](https://github.com/dpep)
56
56
  * [Nicolas Castellanos](https://github.com/nicastelo)
57
+ * [James Fenley](https://github.com/rex-remind101)
58
+ * [Simon Rentzke](https://github.com/simonrentzke)
59
+ * [Randall B](https://github.com/randall-coding)
60
+ * [Matthew Kennedy](https://github.com/MattKitmanLabs)
data/README.md CHANGED
@@ -1,7 +1,7 @@
1
1
 
2
2
  # SmarterCSV
3
3
 
4
- [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
4
+ [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) ![Gem Version](https://img.shields.io/gem/v/smarter_csv) [View on RubyGems](https://rubygems.org/gems/smarter_csv) [View on RubyToolbox](https://www.ruby-toolbox.com/search?q=smarter_csv)
5
5
 
6
6
  SmarterCSV provides a convenient interface for reading and writing CSV files and data.
7
7
 
@@ -47,7 +47,7 @@ Or install it yourself as:
47
47
  # Articles
48
48
  * [Parsing CSV Files in Ruby with SmarterCSV](https://tilo-sloboda.medium.com/parsing-csv-files-in-ruby-with-smartercsv-6ce66fb6cf38)
49
49
  * [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
50
- * [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
50
+ * [Faster Parsing CSV with Parallel Processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing) by [Jack lin](https://github.com/xjlin0/)
51
51
  * [The original post](http://www.unixgods.org/Ruby/process_csv_as_hashes.html) that started SmarterCSV
52
52
 
53
53
  # [ChangeLog](./CHANGELOG.md)
@@ -26,6 +26,15 @@ It removes any values which are `nil` or would be empty strings.
26
26
  `convert_values_to_numeric` is enabled by default.
27
27
  SmarterCSV will convert strings containing Integers or Floats to the appropriate class.
28
28
 
29
+ Here is an example of using `convert_values_to_numeric` for numbers with leading zeros, e.g. ZIP codes:
30
+
31
+ ```
32
+ data = SmarterCSV.process('/tmp/zip.csv', convert_values_to_numeric: { except: [:zip] })
33
+ => [{:zip=>"00480"}, {:zip=>"51903"}, {:zip=>"12354"}, {:zip=>"02343"}]
34
+ ```
35
+
36
+ This will return the column `:zip` as a string with all digits intact.
37
+
29
38
  ## Remove Zero Values
30
39
  `remove_zero_values` is disabled by default.
31
40
  When enabled, it removes key/value pairs which have a numeric value equal to zero.
@@ -44,7 +53,7 @@ It can happen that after all transformations, a row of the CSV file would produc
44
53
 
45
54
  By default SmarterCSV uses `remove_empty_hashes: true` to remove these empty hashes from the result.
46
55
 
47
- This can be set to `true`, to keep these empty hashes in the results.
56
+ This can be set to `false`, to keep these empty hashes in the results.
48
57
 
49
58
  -------------------
50
59
  PREVIOUS: [Header Validations](./header_validations.md) | NEXT: [Value Converters](./value_converters.md)
@@ -64,6 +64,8 @@ If you want to have an underscore between the header and the number, you can set
64
64
  => [{:first_name=>"Carl", :middle_name=>"Edward", :last_name=>"Sagan"}]
65
65
  ```
66
66
 
67
+ If you set `duplicate_header_suffix: nil`, you get the same behavior as earlier versions, which raised the `SmarterCSV::DuplicateHeaders` error.
68
+
67
69
  ## Key Mapping
68
70
 
69
71
  The above example already illustrates how intermediate keys can be mapped into something different.
@@ -79,8 +81,8 @@ There is an additional option `remove_unmapped_keys` which can be enabled to onl
79
81
 
80
82
  ## CSV Files without Headers
81
83
 
82
- If you have CSV files without headers, it is important to set `headers_in_file: false`, otherwise you'll lose the first data line in your file.
83
- You then have to provide `user_provided_headers`, which takes an array of either symbols or strings.
84
+ If you have CSV files without headers, it is important to set `headers_in_file: false`, otherwise you'll lose the first data line in your file.
85
+ You then have to provide `user_provided_headers`, which takes an array of either symbols or strings. Versions >1.13 now automatically set `headers_in_file: false` if you provide `user_provided_headers`. Also see next paragraph.
84
86
 
85
87
 
86
88
  ## CSV Files with Headers
@@ -91,6 +93,7 @@ For CSV files with headers, you can either:
91
93
  * map one or more headers into whatever you chose using the `map_headers` option.
92
94
  (if you map a header to `nil`, it will remove that column from the resulting row hash).
93
95
  * completely replace the headers using `user_provided_headers` (please be careful with this powerful option, as it is not robust against changes in input format).
96
+ When you use `user_provided_headers`, versions >1.13 will set `headers_in_file: false` -- so if you replace the headers for a file that has headers, you must set `headers_in_file: true` to override this and ignore the header row.
94
97
  * use the original unmodified headers from the CSV file, using `keep_original_headers`. This results in hash keys that are strings, and may be padded with spaces.
95
98
 
96
99
 
@@ -102,11 +105,10 @@ For CSV files with headers, you can either:
102
105
  * any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
103
106
  * any of the keys in the header line will be downcased, spaces replaced by underscore, and converted to Ruby symbols before being used as keys in the returned Hashes
104
107
  * you can not combine the :user_provided_headers and :key_mapping options
105
- * if the incorrect number of headers are provided via :user_provided_headers, exception SmarterCSV::HeaderSizeMismatch is raised
108
+ * if the incorrect number of headers are provided via :user_provided_headers, versions >1.13 will automatically add column names `column_N` for additional unexpected columns. If you want to raise an error instead, add option `strict: true`, and it will raise `SmarterCSV::HeaderSizeMismatch`.
106
109
 
107
110
  ### NOTES on improper quotation and unwanted characters in headers:
108
- * some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, use the `:force_simple_split => true` option in combination with `:strip_chars_from_headers => /[\-"]/` . This will also significantly speed up the import.
109
- If you would force a different :quote_char instead (setting it to a non-used character), then the import would be up to 5-times slower than using `:force_simple_split`.
111
+ * some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, set the `quote_char` to something different, e.g. `quote_char: "%"`, or try setting `:strip_chars_from_headers => /[\-"]/`
110
112
 
111
113
  ---------------
112
114
  PREVIOUS: [Row and Column Separators](./row_col_sep.md) | NEXT: [Header Validations](./header_validations.md)
data/docs/options.md CHANGED
@@ -41,17 +41,16 @@
41
41
  | :skip_lines | nil | how many lines to skip before the first line or header line is processed |
42
42
  | :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
43
43
  ---------------------------------------------------------------------------------------------------------------------------------
44
- | :col_sep | :auto | column separator (default was ',') |
45
- | :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
46
- | | | e.g. when :quote_char is not properly escaped |
44
+ | :col_sep | :auto | column separator (default was ',') |
47
45
  | :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
48
46
  | | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
49
47
  | :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
50
48
  | :quote_char | '"' | quotation character |
51
49
  ---------------------------------------------------------------------------------------------------------------------------------
52
- | :headers_in_file | true | Whether or not the file contains headers as the first line. |
53
- | | | Important if the file does not contain headers, |
54
- | | | otherwise you would lose the first line of data. |
50
+ | :headers_in_file | true(1) | Whether or not the file contains headers as the first line. |
51
+ | | | (1): if `user_provided_headers` is given, the default is `false`, |
52
+ | | | unless you specify it to be explicitly `true`. |
53
+ | | | This prevents losing the first line of data, which is otherwise assumed to be a header. |
55
54
  | :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
56
55
  | | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
57
56
  | :user_provided_headers | nil | *careful with that axe!* |
@@ -61,6 +60,8 @@
61
60
  | :remove_empty_hashes | true | remove / ignore any hashes which don't have any key/value pairs or all empty values |
62
61
  | :verbose | false | print out line number while processing (to track down problems in input files) |
63
62
  | :with_line_numbers | false | add :csv_line_number to each data hash |
63
+ | :missing_header_prefix | column_ | can be set to a string of your liking |
64
+ | :strict | false | When set to `true`, extra columns will raise MalformedCSV exception |
64
65
  ---------------------------------------------------------------------------------------------------------------------------------
65
66
 
66
67
  Additional 1.x Options which may be replaced in 2.0
@@ -71,11 +72,11 @@ There have been a lot of 1-offs and feature creep around these options, and goin
71
72
  | Option | Default | Explanation |
72
73
  ---------------------------------------------------------------------------------------------------------------------------------
73
74
  | :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
74
- | :silence_missing_keys | false | ignore missing keys in `key_mapping` |
75
- | | | if set to true: makes all mapped keys optional |
75
+ | :silence_missing_keys | false | ignore missing keys in `key_mapping` |
76
+ | | | if set to true: makes all mapped keys optional |
76
77
  | | | if given an array, makes only the keys listed in it optional |
77
- | :required_keys | nil | An array. Specify the required names AFTER header transformation. |
78
- | :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead |
78
+ | :required_keys | nil | An array. Specify the required names AFTER header transformation. |
79
+ | :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead |
79
80
  | | | or an exception is raised No validation if nil is given. |
80
81
  | :remove_unmapped_keys | false | when using :key_mapping option, should non-mapped keys / columns be removed? |
81
82
  | :downcase_header | true | downcase all column headers |
@@ -21,10 +21,10 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
21
21
 
22
22
  ```ruby
23
23
  $ cat spec/fixtures/with_dates.csv
24
- first,last,date,price
25
- Ben,Miller,10/30/1998,$44.50
26
- Tom,Turner,2/1/2011,$15.99
27
- Ken,Smith,01/09/2013,$199.99
24
+ first,last,date,price,member
25
+ Ben,Miller,10/30/1998,$44.50,TRUE
26
+ Tom,Turner,2/1/2011,$15.99,False
27
+ Ken,Smith,01/09/2013,$199.99,true
28
28
 
29
29
  $ irb
30
30
  > require 'smarter_csv'
@@ -51,7 +51,20 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
51
51
  end
52
52
  end
53
53
 
54
- options = {:value_converters => {:date => DateConverter, :price => DollarConverter}}
54
+ class BooleanConverter
55
+ def self.convert(value)
56
+ case value
57
+ when /true/i
58
+ true
59
+ when /false/i
60
+ false
61
+ else
62
+ nil
63
+ end
64
+ end
65
+ end
66
+
67
+ options = {value_converters: {date: DateConverter, price: DollarConverter, member: BooleanConverter}}
55
68
  data = SmarterCSV.process("spec/fixtures/with_dates.csv", options)
56
69
  first_record = data.first
57
70
  first_record[:date]
@@ -62,6 +75,8 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
62
75
  => 44.50
63
76
  first_record[:price].class
64
77
  => Float
78
+ first_record[:member]
79
+ => true
65
80
  ```
66
81
 
67
82
  --------------------
@@ -9,9 +9,10 @@
9
9
  #define true ((bool)1)
10
10
  #endif
11
11
 
12
- /*
13
- max_size: pass nil if no limit is specified
14
- */
12
+ VALUE SmarterCSV = Qnil;
13
+ VALUE eMalformedCSVError = Qnil;
14
+ VALUE Parser = Qnil;
15
+
15
16
  static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quote_char, VALUE max_size) {
16
17
  if (RB_TYPE_P(line, T_NIL) == 1) {
17
18
  return rb_ary_new();
@@ -24,7 +25,7 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
24
25
  rb_encoding *encoding = rb_enc_get(line); /* get the encoding from the input line */
25
26
  char *startP = RSTRING_PTR(line); /* may not be null terminated */
26
27
  long line_len = RSTRING_LEN(line);
27
- char *endP = startP + line_len ; /* points behind the string */
28
+ char *endP = startP + line_len; /* points behind the string */
28
29
  char *p = startP;
29
30
 
30
31
  char *col_sepP = RSTRING_PTR(col_sep);
@@ -39,18 +40,19 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
39
40
  VALUE field;
40
41
  long i;
41
42
 
42
- char prev_char = '\0'; // Store the previous character for comparison against an escape character
43
- long backslash_count = 0; // to count consecutive backslash characters
43
+ /* Variables for escaped quote handling */
44
+ long backslash_count = 0;
45
+ bool in_quotes = false;
44
46
 
45
47
  while (p < endP) {
46
48
  /* does the remaining string start with col_sep ? */
47
49
  col_sep_found = true;
48
- for(i=0; (i < col_sep_len) && (p+i < endP) ; i++) {
50
+ for(i=0; (i < col_sep_len) && (p+i < endP); i++) {
49
51
  col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i));
50
52
  }
51
- /* if col_sep was found and we have even quotes */
52
- if (col_sep_found && (quote_count % 2 == 0)) {
53
- /* if max_size != nil && lements.size >= header_size */
53
+ /* if col_sep was found and we're not inside quotes */
54
+ if (col_sep_found && !in_quotes) {
55
+ /* if max_size != nil && elements.size >= header_size */
54
56
  if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) {
55
57
  break;
56
58
  } else {
@@ -60,22 +62,30 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
60
62
 
61
63
  p += col_sep_len;
62
64
  startP = p;
65
+ backslash_count = 0; // Reset backslash count at the start of a new field
63
66
  }
64
67
  } else {
65
68
  if (*p == '\\') {
66
69
  backslash_count++;
67
70
  } else {
68
- if (*p == *quoteP && (backslash_count % 2 == 0)) {
69
- quote_count++;
71
+ if (*p == *quoteP) {
72
+ if (backslash_count % 2 == 0) {
73
+ /* Even number of backslashes means quote is not escaped */
74
+ in_quotes = !in_quotes;
75
+ }
76
+ /* Else, quote is escaped; do nothing */
70
77
  }
71
- backslash_count = 0; // no more consecutive backslash characters
78
+ backslash_count = 0; // Reset after any character other than backslash
72
79
  }
73
80
  p++;
74
81
  }
75
-
76
- prev_char = *(p - 1); // Update the previous character
77
82
  } /* while */
78
83
 
84
+ /* Check for unclosed quotes at the end of the line */
85
+ if (in_quotes) {
86
+ rb_raise(eMalformedCSVError, "Unclosed quoted field detected in line: %s", StringValueCStr(line));
87
+ }
88
+
79
89
  /* check if the last part of the line needs to be processed */
80
90
  if ((max_size == Qnil) || RARRAY_LEN(elements) < NUM2INT(max_size)) {
81
91
  /* copy the remaining line as a field with original encoding onto the results */
@@ -86,12 +96,11 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
86
96
  return elements;
87
97
  }
88
98
 
89
- VALUE SmarterCSV = Qnil;
90
- VALUE Parser = Qnil;
91
-
92
99
  void Init_smarter_csv(void) {
93
- SmarterCSV = rb_define_module("SmarterCSV");
94
- Parser = rb_define_module_under(SmarterCSV, "Parser");
100
+ // these modules and the error class are already defined in Ruby code, make them accessible:
101
+ SmarterCSV = rb_const_get(rb_cObject, rb_intern("SmarterCSV"));
102
+ Parser = rb_const_get(SmarterCSV, rb_intern("Parser"));
103
+ eMalformedCSVError = rb_const_get(SmarterCSV, rb_intern("MalformedCSV"));
95
104
 
96
105
  rb_define_module_function(Parser, "parse_csv_line_c", rb_parse_csv_line, 4);
97
106
  }
@@ -13,14 +13,13 @@ module SmarterCSV
13
13
  delimiters = [',', "\t", ';', ':', '|']
14
14
 
15
15
  line = nil
16
+ escaped_quote = Regexp.escape(options[:quote_char])
16
17
  has_header = options[:headers_in_file]
17
18
  candidates = Hash.new(0)
18
19
  count = has_header ? 1 : 5
19
20
  count.times do
20
21
  line = readline_with_counts(filehandle, options)
21
22
  delimiters.each do |d|
22
- escaped_quote = Regexp.escape(options[:quote_char])
23
-
24
23
  # Count only non-quoted occurrences of the delimiter
25
24
  non_quoted_text = line.split(/#{escaped_quote}[^#{escaped_quote}]*#{escaped_quote}/).join
26
25
 
@@ -11,6 +11,7 @@ module SmarterCSV
11
11
  class MissingKeys < SmarterCSVException; end # previously known as MissingHeaders
12
12
  class NoColSepDetected < SmarterCSVException; end
13
13
  class KeyMappingError < SmarterCSVException; end
14
+ class MalformedCSV < SmarterCSVException; end
14
15
  # Writer:
15
16
  class InvalidInputData < SmarterCSVException; end
16
17
  end
@@ -20,12 +20,12 @@ module SmarterCSV
20
20
  downcase_header: true,
21
21
  duplicate_header_suffix: '', # was: nil,
22
22
  file_encoding: 'utf-8',
23
- force_simple_split: false,
24
23
  force_utf8: false,
25
24
  headers_in_file: true,
26
25
  invalid_byte_sequence: '',
27
26
  keep_original_headers: false,
28
27
  key_mapping: nil,
28
+ missing_header_prefix: 'column_',
29
29
  quote_char: '"',
30
30
  remove_empty_hashes: true,
31
31
  remove_empty_values: true,
@@ -37,6 +37,7 @@ module SmarterCSV
37
37
  row_sep: :auto, # was: $/,
38
38
  silence_missing_keys: false,
39
39
  skip_lines: nil,
40
+ strict: false,
40
41
  strings_as_keys: false,
41
42
  strip_chars_from_headers: nil,
42
43
  strip_whitespace: true,
@@ -50,6 +51,18 @@ module SmarterCSV
50
51
  def process_options(given_options = {})
51
52
  puts "User provided options:\n#{pp(given_options)}\n" if given_options[:verbose]
52
53
 
54
+ # Special case for :user_provided_headers:
55
+ #
56
+ # If we would use the default `headers_in_file: true`, and `:user_provided_headers` are given,
57
+ # we could lose the first data row
58
+ #
59
+ # We now err on the side of treating an actual header as data, rather than losing a data row.
60
+ #
61
+ if given_options[:user_provided_headers] && !given_options.keys.include?(:headers_in_file)
62
+ given_options[:headers_in_file] = false
63
+ puts "WARNING: setting `headers_in_file: false` as a precaution to not lose the first row. Set explicitly to `true` if you have headers."
64
+ end
65
+
53
66
  @options = DEFAULT_OPTIONS.dup.merge!(given_options)
54
67
 
55
68
  # fix invalid input
@@ -7,6 +7,8 @@ module SmarterCSV
7
7
  ###
8
8
  ### Thin wrapper around C-extension
9
9
  ###
10
+ ### NOTE: we are no longer passing-in header_size
11
+ ###
10
12
  def parse(line, options, header_size = nil)
11
13
  # puts "SmarterCSV.parse OPTIONS: #{options[:acceleration]}" if options[:verbose]
12
14
 
@@ -31,59 +33,83 @@ module SmarterCSV
31
33
  # - we are not assuming that quotes inside a fields need to be doubled
32
34
  # - we are not assuming that all fields need to be quoted (0 is even)
33
35
  # - works with multi-char col_sep
34
- # - if header_size is given, only up to header_size fields are parsed
35
36
  #
36
- # We use header_size for parsing the body lines to make sure we always match the number of headers
37
- # in case there are trailing col_sep characters in line
37
+ # NOTE: we are no longer passing-in header_size
38
38
  #
39
- # Our convention is that empty fields are returned as empty strings, not as nil.
39
+ # - if header_size was given, only up to header_size fields are parsed
40
40
  #
41
+ # We used header_size for parsing the body lines to make sure we always match the number of headers
42
+ # in case there are trailing col_sep characters in line
41
43
  #
42
- # the purpose of the max_size parameter is to handle a corner case where
43
- # CSV lines contain more fields than the header.
44
- # In which case the remaining fields in the line are ignored
44
+ # the purpose of the max_size parameter was to handle a corner case where
45
+ # CSV lines contain more fields than the header. In which case the remaining fields in the line were ignored
45
46
  #
47
+ # Our convention is that empty fields are returned as empty strings, not as nil.
48
+
46
49
  def parse_csv_line_ruby(line, options, header_size = nil)
47
- return [] if line.nil?
50
+ return [[], 0] if line.nil?
48
51
 
49
52
  line_size = line.size
50
53
  col_sep = options[:col_sep]
51
54
  col_sep_size = col_sep.size
52
55
  quote = options[:quote_char]
53
- quote_count = 0
54
56
  elements = []
55
57
  start = 0
56
58
  i = 0
57
59
 
58
- previous_char = ''
60
+ backslash_count = 0
61
+ in_quotes = false
62
+
59
63
  while i < line_size
60
- if line[i...i+col_sep_size] == col_sep && quote_count.even?
64
+ # Check if the current position matches the column separator and we're not inside quotes
65
+ if line[i...i+col_sep_size] == col_sep && !in_quotes
61
66
  break if !header_size.nil? && elements.size >= header_size
62
67
 
63
68
  elements << cleanup_quotes(line[start...i], quote)
64
- previous_char = line[i]
65
- i += col_sep.size
69
+ i += col_sep_size
66
70
  start = i
71
+ backslash_count = 0 # Reset backslash count at the start of a new field
67
72
  else
68
- quote_count += 1 if line[i] == quote && previous_char != '\\'
69
- previous_char = line[i]
73
+ if line[i] == '\\'
74
+ backslash_count += 1
75
+ else
76
+ if line[i] == quote
77
+ if backslash_count % 2 == 0
78
+ # Even number of backslashes means quote is not escaped
79
+ in_quotes = !in_quotes
80
+ end
81
+ # Else, quote is escaped; do nothing
82
+ end
83
+ backslash_count = 0 # Reset after any character other than backslash
84
+ end
70
85
  i += 1
71
86
  end
72
87
  end
73
- elements << cleanup_quotes(line[start..-1], quote) if header_size.nil? || elements.size < header_size
88
+
89
+ # Check for unclosed quotes at the end of the line
90
+ if in_quotes
91
+ raise MalformedCSV, "Unclosed quoted field detected in line: #{line}"
92
+ end
93
+
94
+ # Process the remaining field
95
+ if header_size.nil? || elements.size < header_size
96
+ elements << cleanup_quotes(line[start..-1], quote)
97
+ end
98
+
74
99
  [elements, elements.size]
75
100
  end
76
101
 
77
102
  def cleanup_quotes(field, quote)
78
103
  return field if field.nil?
79
104
 
80
- # return if field !~ /#{quote}/ # this check can probably eliminated
81
-
105
+ # Remove surrounding quotes if present
82
106
  if field.start_with?(quote) && field.end_with?(quote)
83
- field.delete_prefix!(quote)
84
- field.delete_suffix!(quote)
107
+ field = field[1..-2]
85
108
  end
86
- field.gsub!("#{quote}#{quote}", quote)
109
+
110
+ # Replace double quotes with a single quote
111
+ field.gsub!("#{quote * 2}", quote)
112
+
87
113
  field
88
114
  end
89
115
  end
@@ -62,7 +62,8 @@ module SmarterCSV
62
62
 
63
63
  skip_lines(fh, options)
64
64
 
65
- @headers, header_size = process_headers(fh, options)
65
+ # NOTE: we are no longer using header_size
66
+ @headers, _header_size = process_headers(fh, options)
66
67
  @headerA = @headers # @headerA is deprecated, use @headers
67
68
 
68
69
  puts "Effective headers:\n#{pp(@headers)}\n" if @verbose
@@ -97,14 +98,23 @@ module SmarterCSV
97
98
  multiline = count_quote_chars(line, options[:quote_char]).odd?
98
99
 
99
100
  while multiline
100
- next_line = fh.readline(options[:row_sep])
101
- next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
102
- line += next_line
103
- @file_line_count += 1
104
-
105
- break if fh.eof? # Exit loop if end of file is reached
106
-
107
- multiline = count_quote_chars(line, options[:quote_char]).odd?
101
+ begin
102
+ next_line = fh.readline(options[:row_sep])
103
+ next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
104
+ line += next_line
105
+ @file_line_count += 1
106
+
107
+ multiline = count_quote_chars(line, options[:quote_char]).odd?
108
+ rescue EOFError
109
+ # End of file reached. Check if quotes are balanced.
110
+ total_quotes = count_quote_chars(line, options[:quote_char])
111
+ if total_quotes.odd?
112
+ raise MalformedCSV, "Unclosed quoted field detected in multiline data"
113
+ else
114
+ # Quotes are balanced; proceed without raising an error.
115
+ break
116
+ end
117
+ end
108
118
  end
109
119
 
110
120
  # :nocov:
@@ -116,7 +126,18 @@ module SmarterCSV
116
126
  line.chomp!(options[:row_sep])
117
127
 
118
128
  # --- SPLIT LINE & DATA TRANSFORMATIONS ------------------------------------------------------------
119
- dataA, _data_size = parse(line, options, header_size)
129
+ dataA, data_size = parse(line, options) # we parse the extra columns
130
+
131
+ if options[:strict]
132
+ raise SmarterCSV::HeaderSizeMismatch, "extra columns detected on line #{@file_line_count}"
133
+ else
134
+ # we create additional columns on-the-fly
135
+ current_size = @headers.size
136
+ while current_size < data_size
137
+ @headers << "#{options[:missing_header_prefix]}#{current_size + 1}".to_sym
138
+ current_size += 1
139
+ end
140
+ end
120
141
 
121
142
  dataA.map!{|x| x.strip} if options[:strip_whitespace]
122
143
 
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterCSV
4
- VERSION = "1.12.1"
4
+ VERSION = "1.13.1"
5
5
  end
@@ -81,6 +81,7 @@ module SmarterCSV
81
81
  def finalize
82
82
  # Map headers if :map_headers option is provided
83
83
  mapped_headers = @headers.map { |header| @map_headers[header] || header }
84
+ mapped_headers = mapped_headers.map{|x| escape_csv_field(x)} if @force_quotes
84
85
 
85
86
  @temp_file.rewind
86
87
  @output_file.write(mapped_headers.join(@col_sep) + @row_sep)
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.12.1
4
+ version: 1.13.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2024-07-10 00:00:00.000000000 Z
11
+ date: 2024-12-12 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: awesome_print
@@ -165,7 +165,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
165
165
  - !ruby/object:Gem::Version
166
166
  version: '0'
167
167
  requirements: []
168
- rubygems_version: 3.2.3
168
+ rubygems_version: 3.5.4
169
169
  signing_key:
170
170
  specification_version: 4
171
171
  summary: Convenient CSV Reading and Writing