smarter_csv 1.12.1 → 1.13.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +39 -0
- data/CONTRIBUTORS.md +4 -0
- data/docs/data_transformations.md +10 -1
- data/docs/header_transformations.md +2 -0
- data/docs/options.md +11 -8
- data/docs/value_converters.md +20 -5
- data/ext/smarter_csv/smarter_csv.c +29 -20
- data/lib/smarter_csv/auto_detection.rb +1 -2
- data/lib/smarter_csv/errors.rb +1 -0
- data/lib/smarter_csv/options.rb +14 -0
- data/lib/smarter_csv/parser.rb +47 -21
- data/lib/smarter_csv/reader.rb +31 -10
- data/lib/smarter_csv/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: c28d21a143e743de4b21d8ad93d860b1d51424e525e9ec0a73bb640b170d9823
|
4
|
+
data.tar.gz: e7a16ae8494196b85d9a196d071f09b302583c9ef3a414f09f5ad6ae1f11c29b
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 319ee5aed33630e9670a1c95cc8da6fd57df9d1d7db57a00af79c1e5c10de56b4e9054c86b6b462ebdc693513a79aff2c881a1ede00ad28e5da58768b4a6f2cf
|
7
|
+
data.tar.gz: 6b36378d3a15ed9065c697f56f2cafc359d2c746e5796780a276c1a87c6a04be38616205f31ec9341412f3c9a4f52d150ce4ef95c0e76f340368d5683b1452e6
|
data/CHANGELOG.md
CHANGED
@@ -1,6 +1,45 @@
|
|
1
1
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
3
3
|
|
4
|
+
## 1.13.0 (2024-11-06) ⚡ POTENTIALLY BREAKING ⚡
|
5
|
+
|
6
|
+
CHANGED DEFAULT BEHAVIOR
|
7
|
+
========================
|
8
|
+
The changes are to improve robustness and to reduce the risk of data loss
|
9
|
+
|
10
|
+
* implementing auto-detection of extra columns (thanks to James Fenley)
|
11
|
+
|
12
|
+
* improved handling of unbalanced quote_char in input ([issue 288](https://github.com/tilo/smarter_csv/issues/288)) thanks to Simon Rentzke), and ([issue 283](https://github.com/tilo/smarter_csv/issues/283)) thanks to James Fenley, Randall B, Matthew Kennedy)
|
13
|
+
-> SmarterCSV will now raise `SmarterCSV::MalformedCSV` for unbalanced quote_char.
|
14
|
+
|
15
|
+
* bugfix / improved handling of extra columns in input data ([issue 284](https://github.com/tilo/smarter_csv/issues/284)) (thanks to James Fenley)
|
16
|
+
|
17
|
+
* previous behavior:
|
18
|
+
when a CSV row had more columns than listed in the header, the additional columns were ignored
|
19
|
+
|
20
|
+
* new behavior:
|
21
|
+
* new default behavior is to auto-generate additional headers, e.g. :column_7, :column_8, etc
|
22
|
+
* you can set option `:strict` to true in order to get a `SmarterCSV::MalformedCSV` exception instead
|
23
|
+
|
24
|
+
* setting `user_provided_headers` now implies `headers_in_file: false` ([issue 282](https://github.com/tilo/smarter_csv/issues/282))
|
25
|
+
|
26
|
+
The option `user_provided_headers` can be used to specify headers when there are none in the input, OR to completely override headers that are in the input (file).
|
27
|
+
|
28
|
+
SmarterCSV is now using a safer default behavior.
|
29
|
+
|
30
|
+
* previous behavior:
|
31
|
+
Setting `user_provided_headers` did not change the default `headers_in_file: true`
|
32
|
+
If the input had no headers, this would cause the first line to be erroneously treated as a header, and the user could lose the first row of data.
|
33
|
+
|
34
|
+
* new behavior:
|
35
|
+
Setting `user_provided_headers` sets`headers_in_file: false`
|
36
|
+
a) Improved behavior if there was no header in the input data.
|
37
|
+
b) If there was a header in the input data, and `user_provided_headers` is used to override the headers in the file, then please explicitly specify `headers_in_file: true`, otherwise you will get an extra hash which includes the header data.
|
38
|
+
|
39
|
+
IF you set `user_provided_headers` and the file has a header, then provide `headers_in_file: true` to avoid getting that extra record.
|
40
|
+
|
41
|
+
* handling of numeric columns with leading zeroes, e.g. ZIP codes. ([issue #151](https://github.com/tilo/smarter_csv/issues/151) thanks to David Moles). `convert_values_to_numeric: { except: [:zip] }` will now return a string for that column instead.
|
42
|
+
|
4
43
|
## 1.12.1 (2024-07-10)
|
5
44
|
* Improved column separator detection by ignoring quoted sections [#276](https://github.com/tilo/smarter_csv/pull/276) (thanks to Nicolas Castellanos)
|
6
45
|
|
data/CONTRIBUTORS.md
CHANGED
@@ -54,3 +54,7 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
|
|
54
54
|
* [Kenton Hirowatari](https://github.com/hirowatari)
|
55
55
|
* [Daniel Pepper](https://github.com/dpep)
|
56
56
|
* [Nicolas Castellanos](https://github.com/nicastelo)
|
57
|
+
* [James Fenley](https://github.com/rex-remind101)
|
58
|
+
* [Simon Rentzke](https://github.com/simonrentzke)
|
59
|
+
* [Randall B](https://github.com/randall-coding)
|
60
|
+
* [Matthew Kennedy](https://github.com/MattKitmanLabs)
|
@@ -26,6 +26,15 @@ It removes any values which are `nil` or would be empty strings.
|
|
26
26
|
`convert_values_to_numeric` is enabled by default.
|
27
27
|
SmarterCSV will convert strings containing Integers or Floats to the appropriate class.
|
28
28
|
|
29
|
+
Here is an example of using `convert_values_to_numeric` for numbers with leading zeros, e.g. ZIP codes:
|
30
|
+
|
31
|
+
```
|
32
|
+
data = SmarterCSV.process('/tmp/zip.csv', convert_values_to_numeric: { except: [:zip] })
|
33
|
+
=> [{:zip=>"00480"}, {:zip=>"51903"}, {:zip=>"12354"}, {:zip=>"02343"}]
|
34
|
+
```
|
35
|
+
|
36
|
+
This will return the column `:zip` as a string with all digits intact.
|
37
|
+
|
29
38
|
## Remove Zero Values
|
30
39
|
`remove_zero_values` is disabled by default.
|
31
40
|
When enabled, it removes key/value pairs which have a numeric value equal to zero.
|
@@ -44,7 +53,7 @@ It can happen that after all transformations, a row of the CSV file would produc
|
|
44
53
|
|
45
54
|
By default SmarterCSV uses `remove_empty_hashes: true` to remove these empty hashes from the result.
|
46
55
|
|
47
|
-
This can be set to `
|
56
|
+
This can be set to `false`, to keep these empty hashes in the results.
|
48
57
|
|
49
58
|
-------------------
|
50
59
|
PREVIOUS: [Header Validations](./header_validations.md) | NEXT: [Value Converters](./value_converters.md)
|
@@ -64,6 +64,8 @@ If you want to have an underscore between the header and the number, you can set
|
|
64
64
|
=> [{:first_name=>"Carl", :middle_name=>"Edward", :last_name=>"Sagan"}]
|
65
65
|
```
|
66
66
|
|
67
|
+
If you set `duplicate_header_suffix: nil`, you get the same behavior as earlier versions, which raised the `SmarterCSV::DuplicateHeaders` error.
|
68
|
+
|
67
69
|
## Key Mapping
|
68
70
|
|
69
71
|
The above example already illustrates how intermediate keys can be mapped into something different.
|
data/docs/options.md
CHANGED
@@ -41,7 +41,7 @@
|
|
41
41
|
| :skip_lines | nil | how many lines to skip before the first line or header line is processed |
|
42
42
|
| :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
|
43
43
|
---------------------------------------------------------------------------------------------------------------------------------
|
44
|
-
| :col_sep | :auto | column separator (default was ',')
|
44
|
+
| :col_sep | :auto | column separator (default was ',') |
|
45
45
|
| :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
|
46
46
|
| | | e.g. when :quote_char is not properly escaped |
|
47
47
|
| :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
|
@@ -49,9 +49,10 @@
|
|
49
49
|
| :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
|
50
50
|
| :quote_char | '"' | quotation character |
|
51
51
|
---------------------------------------------------------------------------------------------------------------------------------
|
52
|
-
| :headers_in_file |
|
53
|
-
| | |
|
54
|
-
| | |
|
52
|
+
| :headers_in_file | true(1) | Whether or not the file contains headers as the first line. |
|
53
|
+
| | | (1): if `user_provided_headers` is given, the default is `false`, |
|
54
|
+
| | | unless you specify it to be explicitly `true`. |
|
55
|
+
| | | This prevents losing the first line of data, which is otherwise assumed to be a header. |
|
55
56
|
| :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
|
56
57
|
| | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
|
57
58
|
| :user_provided_headers | nil | *careful with that axe!* |
|
@@ -61,6 +62,8 @@
|
|
61
62
|
| :remove_empty_hashes | true | remove / ignore any hashes which don't have any key/value pairs or all empty values |
|
62
63
|
| :verbose | false | print out line number while processing (to track down problems in input files) |
|
63
64
|
| :with_line_numbers | false | add :csv_line_number to each data hash |
|
65
|
+
| :missing_header_prefix | column_ | can be set to a string of your liking |
|
66
|
+
| :strict | false | When set to `true`, extra columns will raise MalformedCSV exception |
|
64
67
|
---------------------------------------------------------------------------------------------------------------------------------
|
65
68
|
|
66
69
|
Additional 1.x Options which may be replaced in 2.0
|
@@ -71,11 +74,11 @@ There have been a lot of 1-offs and feature creep around these options, and goin
|
|
71
74
|
| Option | Default | Explanation |
|
72
75
|
---------------------------------------------------------------------------------------------------------------------------------
|
73
76
|
| :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
|
74
|
-
| :silence_missing_keys | false | ignore missing keys in `key_mapping`
|
75
|
-
| | | if set to true: makes all mapped keys optional
|
77
|
+
| :silence_missing_keys | false | ignore missing keys in `key_mapping` |
|
78
|
+
| | | if set to true: makes all mapped keys optional |
|
76
79
|
| | | if given an array, makes only the keys listed in it optional |
|
77
|
-
| :required_keys | nil | An array. Specify the required names AFTER header transformation.
|
78
|
-
| :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead
|
80
|
+
| :required_keys | nil | An array. Specify the required names AFTER header transformation. |
|
81
|
+
| :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead |
|
79
82
|
| | | or an exception is raised No validation if nil is given. |
|
80
83
|
| :remove_unmapped_keys | false | when using :key_mapping option, should non-mapped keys / columns be removed? |
|
81
84
|
| :downcase_header | true | downcase all column headers |
|
data/docs/value_converters.md
CHANGED
@@ -21,10 +21,10 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
|
|
21
21
|
|
22
22
|
```ruby
|
23
23
|
$ cat spec/fixtures/with_dates.csv
|
24
|
-
first,last,date,price
|
25
|
-
Ben,Miller,10/30/1998,$44.50
|
26
|
-
Tom,Turner,2/1/2011,$15.99
|
27
|
-
Ken,Smith,01/09/2013,$199.99
|
24
|
+
first,last,date,price,member
|
25
|
+
Ben,Miller,10/30/1998,$44.50,TRUE
|
26
|
+
Tom,Turner,2/1/2011,$15.99,False
|
27
|
+
Ken,Smith,01/09/2013,$199.99,true
|
28
28
|
|
29
29
|
$ irb
|
30
30
|
> require 'smarter_csv'
|
@@ -51,7 +51,20 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
|
|
51
51
|
end
|
52
52
|
end
|
53
53
|
|
54
|
-
|
54
|
+
class BooleanConverter
|
55
|
+
def self.convert(value)
|
56
|
+
case value
|
57
|
+
when /true/i
|
58
|
+
true
|
59
|
+
when /false/i
|
60
|
+
false
|
61
|
+
else
|
62
|
+
nil
|
63
|
+
end
|
64
|
+
end
|
65
|
+
end
|
66
|
+
|
67
|
+
options = {value_converters: {date: DateConverter, price: DollarConverter, member: BooleanConverter}}
|
55
68
|
data = SmarterCSV.process("spec/fixtures/with_dates.csv", options)
|
56
69
|
first_record = data.first
|
57
70
|
first_record[:date]
|
@@ -62,6 +75,8 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
|
|
62
75
|
=> 44.50
|
63
76
|
first_record[:price].class
|
64
77
|
=> Float
|
78
|
+
first_record[:member]
|
79
|
+
=> true
|
65
80
|
```
|
66
81
|
|
67
82
|
--------------------
|
@@ -9,9 +9,10 @@
|
|
9
9
|
#define true ((bool)1)
|
10
10
|
#endif
|
11
11
|
|
12
|
-
|
13
|
-
|
14
|
-
|
12
|
+
VALUE SmarterCSV = Qnil;
|
13
|
+
VALUE eMalformedCSVError = Qnil;
|
14
|
+
VALUE Parser = Qnil;
|
15
|
+
|
15
16
|
static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quote_char, VALUE max_size) {
|
16
17
|
if (RB_TYPE_P(line, T_NIL) == 1) {
|
17
18
|
return rb_ary_new();
|
@@ -24,7 +25,7 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
|
|
24
25
|
rb_encoding *encoding = rb_enc_get(line); /* get the encoding from the input line */
|
25
26
|
char *startP = RSTRING_PTR(line); /* may not be null terminated */
|
26
27
|
long line_len = RSTRING_LEN(line);
|
27
|
-
char *endP = startP + line_len
|
28
|
+
char *endP = startP + line_len; /* points behind the string */
|
28
29
|
char *p = startP;
|
29
30
|
|
30
31
|
char *col_sepP = RSTRING_PTR(col_sep);
|
@@ -39,18 +40,19 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
|
|
39
40
|
VALUE field;
|
40
41
|
long i;
|
41
42
|
|
42
|
-
|
43
|
-
long backslash_count = 0;
|
43
|
+
/* Variables for escaped quote handling */
|
44
|
+
long backslash_count = 0;
|
45
|
+
bool in_quotes = false;
|
44
46
|
|
45
47
|
while (p < endP) {
|
46
48
|
/* does the remaining string start with col_sep ? */
|
47
49
|
col_sep_found = true;
|
48
|
-
for(i=0; (i < col_sep_len) && (p+i < endP)
|
50
|
+
for(i=0; (i < col_sep_len) && (p+i < endP); i++) {
|
49
51
|
col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i));
|
50
52
|
}
|
51
|
-
/* if col_sep was found and we
|
52
|
-
if (col_sep_found &&
|
53
|
-
/* if max_size != nil &&
|
53
|
+
/* if col_sep was found and we're not inside quotes */
|
54
|
+
if (col_sep_found && !in_quotes) {
|
55
|
+
/* if max_size != nil && elements.size >= header_size */
|
54
56
|
if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) {
|
55
57
|
break;
|
56
58
|
} else {
|
@@ -60,22 +62,30 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
|
|
60
62
|
|
61
63
|
p += col_sep_len;
|
62
64
|
startP = p;
|
65
|
+
backslash_count = 0; // Reset backslash count at the start of a new field
|
63
66
|
}
|
64
67
|
} else {
|
65
68
|
if (*p == '\\') {
|
66
69
|
backslash_count++;
|
67
70
|
} else {
|
68
|
-
if (*p == *quoteP
|
69
|
-
|
71
|
+
if (*p == *quoteP) {
|
72
|
+
if (backslash_count % 2 == 0) {
|
73
|
+
/* Even number of backslashes means quote is not escaped */
|
74
|
+
in_quotes = !in_quotes;
|
75
|
+
}
|
76
|
+
/* Else, quote is escaped; do nothing */
|
70
77
|
}
|
71
|
-
backslash_count = 0; //
|
78
|
+
backslash_count = 0; // Reset after any character other than backslash
|
72
79
|
}
|
73
80
|
p++;
|
74
81
|
}
|
75
|
-
|
76
|
-
prev_char = *(p - 1); // Update the previous character
|
77
82
|
} /* while */
|
78
83
|
|
84
|
+
/* Check for unclosed quotes at the end of the line */
|
85
|
+
if (in_quotes) {
|
86
|
+
rb_raise(eMalformedCSVError, "Unclosed quoted field detected in line: %s", StringValueCStr(line));
|
87
|
+
}
|
88
|
+
|
79
89
|
/* check if the last part of the line needs to be processed */
|
80
90
|
if ((max_size == Qnil) || RARRAY_LEN(elements) < NUM2INT(max_size)) {
|
81
91
|
/* copy the remaining line as a field with original encoding onto the results */
|
@@ -86,12 +96,11 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
|
|
86
96
|
return elements;
|
87
97
|
}
|
88
98
|
|
89
|
-
VALUE SmarterCSV = Qnil;
|
90
|
-
VALUE Parser = Qnil;
|
91
|
-
|
92
99
|
void Init_smarter_csv(void) {
|
93
|
-
|
94
|
-
|
100
|
+
// these modules and the error class are already defined in Ruby code, make them accessible:
|
101
|
+
SmarterCSV = rb_const_get(rb_cObject, rb_intern("SmarterCSV"));
|
102
|
+
Parser = rb_const_get(SmarterCSV, rb_intern("Parser"));
|
103
|
+
eMalformedCSVError = rb_const_get(SmarterCSV, rb_intern("MalformedCSV"));
|
95
104
|
|
96
105
|
rb_define_module_function(Parser, "parse_csv_line_c", rb_parse_csv_line, 4);
|
97
106
|
}
|
@@ -13,14 +13,13 @@ module SmarterCSV
|
|
13
13
|
delimiters = [',', "\t", ';', ':', '|']
|
14
14
|
|
15
15
|
line = nil
|
16
|
+
escaped_quote = Regexp.escape(options[:quote_char])
|
16
17
|
has_header = options[:headers_in_file]
|
17
18
|
candidates = Hash.new(0)
|
18
19
|
count = has_header ? 1 : 5
|
19
20
|
count.times do
|
20
21
|
line = readline_with_counts(filehandle, options)
|
21
22
|
delimiters.each do |d|
|
22
|
-
escaped_quote = Regexp.escape(options[:quote_char])
|
23
|
-
|
24
23
|
# Count only non-quoted occurrences of the delimiter
|
25
24
|
non_quoted_text = line.split(/#{escaped_quote}[^#{escaped_quote}]*#{escaped_quote}/).join
|
26
25
|
|
data/lib/smarter_csv/errors.rb
CHANGED
@@ -11,6 +11,7 @@ module SmarterCSV
|
|
11
11
|
class MissingKeys < SmarterCSVException; end # previously known as MissingHeaders
|
12
12
|
class NoColSepDetected < SmarterCSVException; end
|
13
13
|
class KeyMappingError < SmarterCSVException; end
|
14
|
+
class MalformedCSV < SmarterCSVException; end
|
14
15
|
# Writer:
|
15
16
|
class InvalidInputData < SmarterCSVException; end
|
16
17
|
end
|
data/lib/smarter_csv/options.rb
CHANGED
@@ -26,6 +26,7 @@ module SmarterCSV
|
|
26
26
|
invalid_byte_sequence: '',
|
27
27
|
keep_original_headers: false,
|
28
28
|
key_mapping: nil,
|
29
|
+
missing_header_prefix: 'column_',
|
29
30
|
quote_char: '"',
|
30
31
|
remove_empty_hashes: true,
|
31
32
|
remove_empty_values: true,
|
@@ -37,6 +38,7 @@ module SmarterCSV
|
|
37
38
|
row_sep: :auto, # was: $/,
|
38
39
|
silence_missing_keys: false,
|
39
40
|
skip_lines: nil,
|
41
|
+
strict: false,
|
40
42
|
strings_as_keys: false,
|
41
43
|
strip_chars_from_headers: nil,
|
42
44
|
strip_whitespace: true,
|
@@ -50,6 +52,18 @@ module SmarterCSV
|
|
50
52
|
def process_options(given_options = {})
|
51
53
|
puts "User provided options:\n#{pp(given_options)}\n" if given_options[:verbose]
|
52
54
|
|
55
|
+
# Special case for :user_provided_headers:
|
56
|
+
#
|
57
|
+
# If we would use the default `headers_in_file: true`, and `:user_provided_headers` are given,
|
58
|
+
# we could lose the first data row
|
59
|
+
#
|
60
|
+
# We now err on the side of treating an actual header as data, rather than losing a data row.
|
61
|
+
#
|
62
|
+
if given_options[:user_provided_headers] && !given_options.keys.include?(:headers_in_file)
|
63
|
+
given_options[:headers_in_file] = false
|
64
|
+
puts "WARNING: setting `headers_in_file: false` as a precaution to not lose the first row. Set explicitly to `true` if you have headers."
|
65
|
+
end
|
66
|
+
|
53
67
|
@options = DEFAULT_OPTIONS.dup.merge!(given_options)
|
54
68
|
|
55
69
|
# fix invalid input
|
data/lib/smarter_csv/parser.rb
CHANGED
@@ -7,6 +7,8 @@ module SmarterCSV
|
|
7
7
|
###
|
8
8
|
### Thin wrapper around C-extension
|
9
9
|
###
|
10
|
+
### NOTE: we are no longer passing-in header_size
|
11
|
+
###
|
10
12
|
def parse(line, options, header_size = nil)
|
11
13
|
# puts "SmarterCSV.parse OPTIONS: #{options[:acceleration]}" if options[:verbose]
|
12
14
|
|
@@ -31,59 +33,83 @@ module SmarterCSV
|
|
31
33
|
# - we are not assuming that quotes inside a fields need to be doubled
|
32
34
|
# - we are not assuming that all fields need to be quoted (0 is even)
|
33
35
|
# - works with multi-char col_sep
|
34
|
-
# - if header_size is given, only up to header_size fields are parsed
|
35
36
|
#
|
36
|
-
#
|
37
|
-
# in case there are trailing col_sep characters in line
|
37
|
+
# NOTE: we are no longer passing-in header_size
|
38
38
|
#
|
39
|
-
#
|
39
|
+
# - if header_size was given, only up to header_size fields are parsed
|
40
40
|
#
|
41
|
+
# We used header_size for parsing the body lines to make sure we always match the number of headers
|
42
|
+
# in case there are trailing col_sep characters in line
|
41
43
|
#
|
42
|
-
#
|
43
|
-
#
|
44
|
-
# In which case the remaining fields in the line are ignored
|
44
|
+
# the purpose of the max_size parameter was to handle a corner case where
|
45
|
+
# CSV lines contain more fields than the header. In which case the remaining fields in the line were ignored
|
45
46
|
#
|
47
|
+
# Our convention is that empty fields are returned as empty strings, not as nil.
|
48
|
+
|
46
49
|
def parse_csv_line_ruby(line, options, header_size = nil)
|
47
|
-
return [] if line.nil?
|
50
|
+
return [[], 0] if line.nil?
|
48
51
|
|
49
52
|
line_size = line.size
|
50
53
|
col_sep = options[:col_sep]
|
51
54
|
col_sep_size = col_sep.size
|
52
55
|
quote = options[:quote_char]
|
53
|
-
quote_count = 0
|
54
56
|
elements = []
|
55
57
|
start = 0
|
56
58
|
i = 0
|
57
59
|
|
58
|
-
|
60
|
+
backslash_count = 0
|
61
|
+
in_quotes = false
|
62
|
+
|
59
63
|
while i < line_size
|
60
|
-
if
|
64
|
+
# Check if the current position matches the column separator and we're not inside quotes
|
65
|
+
if line[i...i+col_sep_size] == col_sep && !in_quotes
|
61
66
|
break if !header_size.nil? && elements.size >= header_size
|
62
67
|
|
63
68
|
elements << cleanup_quotes(line[start...i], quote)
|
64
|
-
|
65
|
-
i += col_sep.size
|
69
|
+
i += col_sep_size
|
66
70
|
start = i
|
71
|
+
backslash_count = 0 # Reset backslash count at the start of a new field
|
67
72
|
else
|
68
|
-
|
69
|
-
|
73
|
+
if line[i] == '\\'
|
74
|
+
backslash_count += 1
|
75
|
+
else
|
76
|
+
if line[i] == quote
|
77
|
+
if backslash_count % 2 == 0
|
78
|
+
# Even number of backslashes means quote is not escaped
|
79
|
+
in_quotes = !in_quotes
|
80
|
+
end
|
81
|
+
# Else, quote is escaped; do nothing
|
82
|
+
end
|
83
|
+
backslash_count = 0 # Reset after any character other than backslash
|
84
|
+
end
|
70
85
|
i += 1
|
71
86
|
end
|
72
87
|
end
|
73
|
-
|
88
|
+
|
89
|
+
# Check for unclosed quotes at the end of the line
|
90
|
+
if in_quotes
|
91
|
+
raise MalformedCSV, "Unclosed quoted field detected in line: #{line}"
|
92
|
+
end
|
93
|
+
|
94
|
+
# Process the remaining field
|
95
|
+
if header_size.nil? || elements.size < header_size
|
96
|
+
elements << cleanup_quotes(line[start..-1], quote)
|
97
|
+
end
|
98
|
+
|
74
99
|
[elements, elements.size]
|
75
100
|
end
|
76
101
|
|
77
102
|
def cleanup_quotes(field, quote)
|
78
103
|
return field if field.nil?
|
79
104
|
|
80
|
-
#
|
81
|
-
|
105
|
+
# Remove surrounding quotes if present
|
82
106
|
if field.start_with?(quote) && field.end_with?(quote)
|
83
|
-
field
|
84
|
-
field.delete_suffix!(quote)
|
107
|
+
field = field[1..-2]
|
85
108
|
end
|
86
|
-
|
109
|
+
|
110
|
+
# Replace double quotes with a single quote
|
111
|
+
field.gsub!("#{quote * 2}", quote)
|
112
|
+
|
87
113
|
field
|
88
114
|
end
|
89
115
|
end
|
data/lib/smarter_csv/reader.rb
CHANGED
@@ -62,7 +62,8 @@ module SmarterCSV
|
|
62
62
|
|
63
63
|
skip_lines(fh, options)
|
64
64
|
|
65
|
-
|
65
|
+
# NOTE: we are no longer using header_size
|
66
|
+
@headers, _header_size = process_headers(fh, options)
|
66
67
|
@headerA = @headers # @headerA is deprecated, use @headers
|
67
68
|
|
68
69
|
puts "Effective headers:\n#{pp(@headers)}\n" if @verbose
|
@@ -97,14 +98,23 @@ module SmarterCSV
|
|
97
98
|
multiline = count_quote_chars(line, options[:quote_char]).odd?
|
98
99
|
|
99
100
|
while multiline
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
101
|
+
begin
|
102
|
+
next_line = fh.readline(options[:row_sep])
|
103
|
+
next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
|
104
|
+
line += next_line
|
105
|
+
@file_line_count += 1
|
106
|
+
|
107
|
+
multiline = count_quote_chars(line, options[:quote_char]).odd?
|
108
|
+
rescue EOFError
|
109
|
+
# End of file reached. Check if quotes are balanced.
|
110
|
+
total_quotes = count_quote_chars(line, options[:quote_char])
|
111
|
+
if total_quotes.odd?
|
112
|
+
raise MalformedCSV, "Unclosed quoted field detected in multiline data"
|
113
|
+
else
|
114
|
+
# Quotes are balanced; proceed without raising an error.
|
115
|
+
break
|
116
|
+
end
|
117
|
+
end
|
108
118
|
end
|
109
119
|
|
110
120
|
# :nocov:
|
@@ -116,7 +126,18 @@ module SmarterCSV
|
|
116
126
|
line.chomp!(options[:row_sep])
|
117
127
|
|
118
128
|
# --- SPLIT LINE & DATA TRANSFORMATIONS ------------------------------------------------------------
|
119
|
-
dataA,
|
129
|
+
dataA, data_size = parse(line, options) # we parse the extra columns
|
130
|
+
|
131
|
+
if options[:strict]
|
132
|
+
raise SmarterCSV::HeaderSizeMismatch, "extra columns detected on line #{@file_line_count}"
|
133
|
+
else
|
134
|
+
# we create additional columns on-the-fly
|
135
|
+
current_size = @headers.size
|
136
|
+
while current_size < data_size
|
137
|
+
@headers << "#{options[:missing_header_prefix]}#{current_size + 1}".to_sym
|
138
|
+
current_size += 1
|
139
|
+
end
|
140
|
+
end
|
120
141
|
|
121
142
|
dataA.map!{|x| x.strip} if options[:strip_whitespace]
|
122
143
|
|
data/lib/smarter_csv/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: smarter_csv
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.13.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tilo Sloboda
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2024-
|
11
|
+
date: 2024-11-05 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: awesome_print
|