smarter_csv 1.12.0 → 1.13.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +42 -0
- data/CONTRIBUTORS.md +5 -0
- data/docs/data_transformations.md +10 -1
- data/docs/header_transformations.md +2 -0
- data/docs/options.md +11 -8
- data/docs/value_converters.md +20 -5
- data/ext/smarter_csv/smarter_csv.c +29 -20
- data/lib/smarter_csv/auto_detection.rb +5 -1
- data/lib/smarter_csv/errors.rb +1 -0
- data/lib/smarter_csv/options.rb +14 -0
- data/lib/smarter_csv/parser.rb +47 -21
- data/lib/smarter_csv/reader.rb +31 -10
- data/lib/smarter_csv/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: c28d21a143e743de4b21d8ad93d860b1d51424e525e9ec0a73bb640b170d9823
|
4
|
+
data.tar.gz: e7a16ae8494196b85d9a196d071f09b302583c9ef3a414f09f5ad6ae1f11c29b
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 319ee5aed33630e9670a1c95cc8da6fd57df9d1d7db57a00af79c1e5c10de56b4e9054c86b6b462ebdc693513a79aff2c881a1ede00ad28e5da58768b4a6f2cf
|
7
|
+
data.tar.gz: 6b36378d3a15ed9065c697f56f2cafc359d2c746e5796780a276c1a87c6a04be38616205f31ec9341412f3c9a4f52d150ce4ef95c0e76f340368d5683b1452e6
|
data/CHANGELOG.md
CHANGED
@@ -1,6 +1,48 @@
|
|
1
1
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
3
3
|
|
4
|
+
## 1.13.0 (2024-11-06) ⚡ POTENTIALLY BREAKING ⚡
|
5
|
+
|
6
|
+
CHANGED DEFAULT BEHAVIOR
|
7
|
+
========================
|
8
|
+
The changes are to improve robustness and to reduce the risk of data loss
|
9
|
+
|
10
|
+
* implementing auto-detection of extra columns (thanks to James Fenley)
|
11
|
+
|
12
|
+
* improved handling of unbalanced quote_char in input ([issue 288](https://github.com/tilo/smarter_csv/issues/288)) thanks to Simon Rentzke), and ([issue 283](https://github.com/tilo/smarter_csv/issues/283)) thanks to James Fenley, Randall B, Matthew Kennedy)
|
13
|
+
-> SmarterCSV will now raise `SmarterCSV::MalformedCSV` for unbalanced quote_char.
|
14
|
+
|
15
|
+
* bugfix / improved handling of extra columns in input data ([issue 284](https://github.com/tilo/smarter_csv/issues/284)) (thanks to James Fenley)
|
16
|
+
|
17
|
+
* previous behavior:
|
18
|
+
when a CSV row had more columns than listed in the header, the additional columns were ignored
|
19
|
+
|
20
|
+
* new behavior:
|
21
|
+
* new default behavior is to auto-generate additional headers, e.g. :column_7, :column_8, etc
|
22
|
+
* you can set option `:strict` to true in order to get a `SmarterCSV::MalformedCSV` exception instead
|
23
|
+
|
24
|
+
* setting `user_provided_headers` now implies `headers_in_file: false` ([issue 282](https://github.com/tilo/smarter_csv/issues/282))
|
25
|
+
|
26
|
+
The option `user_provided_headers` can be used to specify headers when there are none in the input, OR to completely override headers that are in the input (file).
|
27
|
+
|
28
|
+
SmarterCSV is now using a safer default behavior.
|
29
|
+
|
30
|
+
* previous behavior:
|
31
|
+
Setting `user_provided_headers` did not change the default `headers_in_file: true`
|
32
|
+
If the input had no headers, this would cause the first line to be erroneously treated as a header, and the user could lose the first row of data.
|
33
|
+
|
34
|
+
* new behavior:
|
35
|
+
Setting `user_provided_headers` sets`headers_in_file: false`
|
36
|
+
a) Improved behavior if there was no header in the input data.
|
37
|
+
b) If there was a header in the input data, and `user_provided_headers` is used to override the headers in the file, then please explicitly specify `headers_in_file: true`, otherwise you will get an extra hash which includes the header data.
|
38
|
+
|
39
|
+
IF you set `user_provided_headers` and the file has a header, then provide `headers_in_file: true` to avoid getting that extra record.
|
40
|
+
|
41
|
+
* handling of numeric columns with leading zeroes, e.g. ZIP codes. ([issue #151](https://github.com/tilo/smarter_csv/issues/151) thanks to David Moles). `convert_values_to_numeric: { except: [:zip] }` will now return a string for that column instead.
|
42
|
+
|
43
|
+
## 1.12.1 (2024-07-10)
|
44
|
+
* Improved column separator detection by ignoring quoted sections [#276](https://github.com/tilo/smarter_csv/pull/276) (thanks to Nicolas Castellanos)
|
45
|
+
|
4
46
|
## 1.12.0 (2024-07-09)
|
5
47
|
* Added Thread-Safety: added SmarterCSV::Reader to process CSV files in a thread-safe manner ([issue #277](https://github.com/tilo/smarter_csv/pull/277))
|
6
48
|
* SmarterCSV::Writer changed default row separator to the system's row separator (`\n` on Linux, `\r\n` on Windows)
|
data/CONTRIBUTORS.md
CHANGED
@@ -53,3 +53,8 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
|
|
53
53
|
* [JP Camara](https://github.com/jpcamara)
|
54
54
|
* [Kenton Hirowatari](https://github.com/hirowatari)
|
55
55
|
* [Daniel Pepper](https://github.com/dpep)
|
56
|
+
* [Nicolas Castellanos](https://github.com/nicastelo)
|
57
|
+
* [James Fenley](https://github.com/rex-remind101)
|
58
|
+
* [Simon Rentzke](https://github.com/simonrentzke)
|
59
|
+
* [Randall B](https://github.com/randall-coding)
|
60
|
+
* [Matthew Kennedy](https://github.com/MattKitmanLabs)
|
@@ -26,6 +26,15 @@ It removes any values which are `nil` or would be empty strings.
|
|
26
26
|
`convert_values_to_numeric` is enabled by default.
|
27
27
|
SmarterCSV will convert strings containing Integers or Floats to the appropriate class.
|
28
28
|
|
29
|
+
Here is an example of using `convert_values_to_numeric` for numbers with leading zeros, e.g. ZIP codes:
|
30
|
+
|
31
|
+
```
|
32
|
+
data = SmarterCSV.process('/tmp/zip.csv', convert_values_to_numeric: { except: [:zip] })
|
33
|
+
=> [{:zip=>"00480"}, {:zip=>"51903"}, {:zip=>"12354"}, {:zip=>"02343"}]
|
34
|
+
```
|
35
|
+
|
36
|
+
This will return the column `:zip` as a string with all digits intact.
|
37
|
+
|
29
38
|
## Remove Zero Values
|
30
39
|
`remove_zero_values` is disabled by default.
|
31
40
|
When enabled, it removes key/value pairs which have a numeric value equal to zero.
|
@@ -44,7 +53,7 @@ It can happen that after all transformations, a row of the CSV file would produc
|
|
44
53
|
|
45
54
|
By default SmarterCSV uses `remove_empty_hashes: true` to remove these empty hashes from the result.
|
46
55
|
|
47
|
-
This can be set to `
|
56
|
+
This can be set to `false`, to keep these empty hashes in the results.
|
48
57
|
|
49
58
|
-------------------
|
50
59
|
PREVIOUS: [Header Validations](./header_validations.md) | NEXT: [Value Converters](./value_converters.md)
|
@@ -64,6 +64,8 @@ If you want to have an underscore between the header and the number, you can set
|
|
64
64
|
=> [{:first_name=>"Carl", :middle_name=>"Edward", :last_name=>"Sagan"}]
|
65
65
|
```
|
66
66
|
|
67
|
+
If you set `duplicate_header_suffix: nil`, you get the same behavior as earlier versions, which raised the `SmarterCSV::DuplicateHeaders` error.
|
68
|
+
|
67
69
|
## Key Mapping
|
68
70
|
|
69
71
|
The above example already illustrates how intermediate keys can be mapped into something different.
|
data/docs/options.md
CHANGED
@@ -41,7 +41,7 @@
|
|
41
41
|
| :skip_lines | nil | how many lines to skip before the first line or header line is processed |
|
42
42
|
| :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
|
43
43
|
---------------------------------------------------------------------------------------------------------------------------------
|
44
|
-
| :col_sep | :auto | column separator (default was ',')
|
44
|
+
| :col_sep | :auto | column separator (default was ',') |
|
45
45
|
| :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
|
46
46
|
| | | e.g. when :quote_char is not properly escaped |
|
47
47
|
| :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
|
@@ -49,9 +49,10 @@
|
|
49
49
|
| :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
|
50
50
|
| :quote_char | '"' | quotation character |
|
51
51
|
---------------------------------------------------------------------------------------------------------------------------------
|
52
|
-
| :headers_in_file |
|
53
|
-
| | |
|
54
|
-
| | |
|
52
|
+
| :headers_in_file | true(1) | Whether or not the file contains headers as the first line. |
|
53
|
+
| | | (1): if `user_provided_headers` is given, the default is `false`, |
|
54
|
+
| | | unless you specify it to be explicitly `true`. |
|
55
|
+
| | | This prevents losing the first line of data, which is otherwise assumed to be a header. |
|
55
56
|
| :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
|
56
57
|
| | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
|
57
58
|
| :user_provided_headers | nil | *careful with that axe!* |
|
@@ -61,6 +62,8 @@
|
|
61
62
|
| :remove_empty_hashes | true | remove / ignore any hashes which don't have any key/value pairs or all empty values |
|
62
63
|
| :verbose | false | print out line number while processing (to track down problems in input files) |
|
63
64
|
| :with_line_numbers | false | add :csv_line_number to each data hash |
|
65
|
+
| :missing_header_prefix | column_ | can be set to a string of your liking |
|
66
|
+
| :strict | false | When set to `true`, extra columns will raise MalformedCSV exception |
|
64
67
|
---------------------------------------------------------------------------------------------------------------------------------
|
65
68
|
|
66
69
|
Additional 1.x Options which may be replaced in 2.0
|
@@ -71,11 +74,11 @@ There have been a lot of 1-offs and feature creep around these options, and goin
|
|
71
74
|
| Option | Default | Explanation |
|
72
75
|
---------------------------------------------------------------------------------------------------------------------------------
|
73
76
|
| :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
|
74
|
-
| :silence_missing_keys | false | ignore missing keys in `key_mapping`
|
75
|
-
| | | if set to true: makes all mapped keys optional
|
77
|
+
| :silence_missing_keys | false | ignore missing keys in `key_mapping` |
|
78
|
+
| | | if set to true: makes all mapped keys optional |
|
76
79
|
| | | if given an array, makes only the keys listed in it optional |
|
77
|
-
| :required_keys | nil | An array. Specify the required names AFTER header transformation.
|
78
|
-
| :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead
|
80
|
+
| :required_keys | nil | An array. Specify the required names AFTER header transformation. |
|
81
|
+
| :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead |
|
79
82
|
| | | or an exception is raised No validation if nil is given. |
|
80
83
|
| :remove_unmapped_keys | false | when using :key_mapping option, should non-mapped keys / columns be removed? |
|
81
84
|
| :downcase_header | true | downcase all column headers |
|
data/docs/value_converters.md
CHANGED
@@ -21,10 +21,10 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
|
|
21
21
|
|
22
22
|
```ruby
|
23
23
|
$ cat spec/fixtures/with_dates.csv
|
24
|
-
first,last,date,price
|
25
|
-
Ben,Miller,10/30/1998,$44.50
|
26
|
-
Tom,Turner,2/1/2011,$15.99
|
27
|
-
Ken,Smith,01/09/2013,$199.99
|
24
|
+
first,last,date,price,member
|
25
|
+
Ben,Miller,10/30/1998,$44.50,TRUE
|
26
|
+
Tom,Turner,2/1/2011,$15.99,False
|
27
|
+
Ken,Smith,01/09/2013,$199.99,true
|
28
28
|
|
29
29
|
$ irb
|
30
30
|
> require 'smarter_csv'
|
@@ -51,7 +51,20 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
|
|
51
51
|
end
|
52
52
|
end
|
53
53
|
|
54
|
-
|
54
|
+
class BooleanConverter
|
55
|
+
def self.convert(value)
|
56
|
+
case value
|
57
|
+
when /true/i
|
58
|
+
true
|
59
|
+
when /false/i
|
60
|
+
false
|
61
|
+
else
|
62
|
+
nil
|
63
|
+
end
|
64
|
+
end
|
65
|
+
end
|
66
|
+
|
67
|
+
options = {value_converters: {date: DateConverter, price: DollarConverter, member: BooleanConverter}}
|
55
68
|
data = SmarterCSV.process("spec/fixtures/with_dates.csv", options)
|
56
69
|
first_record = data.first
|
57
70
|
first_record[:date]
|
@@ -62,6 +75,8 @@ If you use `key_mappings` and `value_converters`, make sure that the value conve
|
|
62
75
|
=> 44.50
|
63
76
|
first_record[:price].class
|
64
77
|
=> Float
|
78
|
+
first_record[:member]
|
79
|
+
=> true
|
65
80
|
```
|
66
81
|
|
67
82
|
--------------------
|
@@ -9,9 +9,10 @@
|
|
9
9
|
#define true ((bool)1)
|
10
10
|
#endif
|
11
11
|
|
12
|
-
|
13
|
-
|
14
|
-
|
12
|
+
VALUE SmarterCSV = Qnil;
|
13
|
+
VALUE eMalformedCSVError = Qnil;
|
14
|
+
VALUE Parser = Qnil;
|
15
|
+
|
15
16
|
static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quote_char, VALUE max_size) {
|
16
17
|
if (RB_TYPE_P(line, T_NIL) == 1) {
|
17
18
|
return rb_ary_new();
|
@@ -24,7 +25,7 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
|
|
24
25
|
rb_encoding *encoding = rb_enc_get(line); /* get the encoding from the input line */
|
25
26
|
char *startP = RSTRING_PTR(line); /* may not be null terminated */
|
26
27
|
long line_len = RSTRING_LEN(line);
|
27
|
-
char *endP = startP + line_len
|
28
|
+
char *endP = startP + line_len; /* points behind the string */
|
28
29
|
char *p = startP;
|
29
30
|
|
30
31
|
char *col_sepP = RSTRING_PTR(col_sep);
|
@@ -39,18 +40,19 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
|
|
39
40
|
VALUE field;
|
40
41
|
long i;
|
41
42
|
|
42
|
-
|
43
|
-
long backslash_count = 0;
|
43
|
+
/* Variables for escaped quote handling */
|
44
|
+
long backslash_count = 0;
|
45
|
+
bool in_quotes = false;
|
44
46
|
|
45
47
|
while (p < endP) {
|
46
48
|
/* does the remaining string start with col_sep ? */
|
47
49
|
col_sep_found = true;
|
48
|
-
for(i=0; (i < col_sep_len) && (p+i < endP)
|
50
|
+
for(i=0; (i < col_sep_len) && (p+i < endP); i++) {
|
49
51
|
col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i));
|
50
52
|
}
|
51
|
-
/* if col_sep was found and we
|
52
|
-
if (col_sep_found &&
|
53
|
-
/* if max_size != nil &&
|
53
|
+
/* if col_sep was found and we're not inside quotes */
|
54
|
+
if (col_sep_found && !in_quotes) {
|
55
|
+
/* if max_size != nil && elements.size >= header_size */
|
54
56
|
if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) {
|
55
57
|
break;
|
56
58
|
} else {
|
@@ -60,22 +62,30 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
|
|
60
62
|
|
61
63
|
p += col_sep_len;
|
62
64
|
startP = p;
|
65
|
+
backslash_count = 0; // Reset backslash count at the start of a new field
|
63
66
|
}
|
64
67
|
} else {
|
65
68
|
if (*p == '\\') {
|
66
69
|
backslash_count++;
|
67
70
|
} else {
|
68
|
-
if (*p == *quoteP
|
69
|
-
|
71
|
+
if (*p == *quoteP) {
|
72
|
+
if (backslash_count % 2 == 0) {
|
73
|
+
/* Even number of backslashes means quote is not escaped */
|
74
|
+
in_quotes = !in_quotes;
|
75
|
+
}
|
76
|
+
/* Else, quote is escaped; do nothing */
|
70
77
|
}
|
71
|
-
backslash_count = 0; //
|
78
|
+
backslash_count = 0; // Reset after any character other than backslash
|
72
79
|
}
|
73
80
|
p++;
|
74
81
|
}
|
75
|
-
|
76
|
-
prev_char = *(p - 1); // Update the previous character
|
77
82
|
} /* while */
|
78
83
|
|
84
|
+
/* Check for unclosed quotes at the end of the line */
|
85
|
+
if (in_quotes) {
|
86
|
+
rb_raise(eMalformedCSVError, "Unclosed quoted field detected in line: %s", StringValueCStr(line));
|
87
|
+
}
|
88
|
+
|
79
89
|
/* check if the last part of the line needs to be processed */
|
80
90
|
if ((max_size == Qnil) || RARRAY_LEN(elements) < NUM2INT(max_size)) {
|
81
91
|
/* copy the remaining line as a field with original encoding onto the results */
|
@@ -86,12 +96,11 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
|
|
86
96
|
return elements;
|
87
97
|
}
|
88
98
|
|
89
|
-
VALUE SmarterCSV = Qnil;
|
90
|
-
VALUE Parser = Qnil;
|
91
|
-
|
92
99
|
void Init_smarter_csv(void) {
|
93
|
-
|
94
|
-
|
100
|
+
// these modules and the error class are already defined in Ruby code, make them accessible:
|
101
|
+
SmarterCSV = rb_const_get(rb_cObject, rb_intern("SmarterCSV"));
|
102
|
+
Parser = rb_const_get(SmarterCSV, rb_intern("Parser"));
|
103
|
+
eMalformedCSVError = rb_const_get(SmarterCSV, rb_intern("MalformedCSV"));
|
95
104
|
|
96
105
|
rb_define_module_function(Parser, "parse_csv_line_c", rb_parse_csv_line, 4);
|
97
106
|
}
|
@@ -13,13 +13,17 @@ module SmarterCSV
|
|
13
13
|
delimiters = [',', "\t", ';', ':', '|']
|
14
14
|
|
15
15
|
line = nil
|
16
|
+
escaped_quote = Regexp.escape(options[:quote_char])
|
16
17
|
has_header = options[:headers_in_file]
|
17
18
|
candidates = Hash.new(0)
|
18
19
|
count = has_header ? 1 : 5
|
19
20
|
count.times do
|
20
21
|
line = readline_with_counts(filehandle, options)
|
21
22
|
delimiters.each do |d|
|
22
|
-
|
23
|
+
# Count only non-quoted occurrences of the delimiter
|
24
|
+
non_quoted_text = line.split(/#{escaped_quote}[^#{escaped_quote}]*#{escaped_quote}/).join
|
25
|
+
|
26
|
+
candidates[d] += non_quoted_text.scan(d).count
|
23
27
|
end
|
24
28
|
rescue EOFError # short files
|
25
29
|
break
|
data/lib/smarter_csv/errors.rb
CHANGED
@@ -11,6 +11,7 @@ module SmarterCSV
|
|
11
11
|
class MissingKeys < SmarterCSVException; end # previously known as MissingHeaders
|
12
12
|
class NoColSepDetected < SmarterCSVException; end
|
13
13
|
class KeyMappingError < SmarterCSVException; end
|
14
|
+
class MalformedCSV < SmarterCSVException; end
|
14
15
|
# Writer:
|
15
16
|
class InvalidInputData < SmarterCSVException; end
|
16
17
|
end
|
data/lib/smarter_csv/options.rb
CHANGED
@@ -26,6 +26,7 @@ module SmarterCSV
|
|
26
26
|
invalid_byte_sequence: '',
|
27
27
|
keep_original_headers: false,
|
28
28
|
key_mapping: nil,
|
29
|
+
missing_header_prefix: 'column_',
|
29
30
|
quote_char: '"',
|
30
31
|
remove_empty_hashes: true,
|
31
32
|
remove_empty_values: true,
|
@@ -37,6 +38,7 @@ module SmarterCSV
|
|
37
38
|
row_sep: :auto, # was: $/,
|
38
39
|
silence_missing_keys: false,
|
39
40
|
skip_lines: nil,
|
41
|
+
strict: false,
|
40
42
|
strings_as_keys: false,
|
41
43
|
strip_chars_from_headers: nil,
|
42
44
|
strip_whitespace: true,
|
@@ -50,6 +52,18 @@ module SmarterCSV
|
|
50
52
|
def process_options(given_options = {})
|
51
53
|
puts "User provided options:\n#{pp(given_options)}\n" if given_options[:verbose]
|
52
54
|
|
55
|
+
# Special case for :user_provided_headers:
|
56
|
+
#
|
57
|
+
# If we would use the default `headers_in_file: true`, and `:user_provided_headers` are given,
|
58
|
+
# we could lose the first data row
|
59
|
+
#
|
60
|
+
# We now err on the side of treating an actual header as data, rather than losing a data row.
|
61
|
+
#
|
62
|
+
if given_options[:user_provided_headers] && !given_options.keys.include?(:headers_in_file)
|
63
|
+
given_options[:headers_in_file] = false
|
64
|
+
puts "WARNING: setting `headers_in_file: false` as a precaution to not lose the first row. Set explicitly to `true` if you have headers."
|
65
|
+
end
|
66
|
+
|
53
67
|
@options = DEFAULT_OPTIONS.dup.merge!(given_options)
|
54
68
|
|
55
69
|
# fix invalid input
|
data/lib/smarter_csv/parser.rb
CHANGED
@@ -7,6 +7,8 @@ module SmarterCSV
|
|
7
7
|
###
|
8
8
|
### Thin wrapper around C-extension
|
9
9
|
###
|
10
|
+
### NOTE: we are no longer passing-in header_size
|
11
|
+
###
|
10
12
|
def parse(line, options, header_size = nil)
|
11
13
|
# puts "SmarterCSV.parse OPTIONS: #{options[:acceleration]}" if options[:verbose]
|
12
14
|
|
@@ -31,59 +33,83 @@ module SmarterCSV
|
|
31
33
|
# - we are not assuming that quotes inside a fields need to be doubled
|
32
34
|
# - we are not assuming that all fields need to be quoted (0 is even)
|
33
35
|
# - works with multi-char col_sep
|
34
|
-
# - if header_size is given, only up to header_size fields are parsed
|
35
36
|
#
|
36
|
-
#
|
37
|
-
# in case there are trailing col_sep characters in line
|
37
|
+
# NOTE: we are no longer passing-in header_size
|
38
38
|
#
|
39
|
-
#
|
39
|
+
# - if header_size was given, only up to header_size fields are parsed
|
40
40
|
#
|
41
|
+
# We used header_size for parsing the body lines to make sure we always match the number of headers
|
42
|
+
# in case there are trailing col_sep characters in line
|
41
43
|
#
|
42
|
-
#
|
43
|
-
#
|
44
|
-
# In which case the remaining fields in the line are ignored
|
44
|
+
# the purpose of the max_size parameter was to handle a corner case where
|
45
|
+
# CSV lines contain more fields than the header. In which case the remaining fields in the line were ignored
|
45
46
|
#
|
47
|
+
# Our convention is that empty fields are returned as empty strings, not as nil.
|
48
|
+
|
46
49
|
def parse_csv_line_ruby(line, options, header_size = nil)
|
47
|
-
return [] if line.nil?
|
50
|
+
return [[], 0] if line.nil?
|
48
51
|
|
49
52
|
line_size = line.size
|
50
53
|
col_sep = options[:col_sep]
|
51
54
|
col_sep_size = col_sep.size
|
52
55
|
quote = options[:quote_char]
|
53
|
-
quote_count = 0
|
54
56
|
elements = []
|
55
57
|
start = 0
|
56
58
|
i = 0
|
57
59
|
|
58
|
-
|
60
|
+
backslash_count = 0
|
61
|
+
in_quotes = false
|
62
|
+
|
59
63
|
while i < line_size
|
60
|
-
if
|
64
|
+
# Check if the current position matches the column separator and we're not inside quotes
|
65
|
+
if line[i...i+col_sep_size] == col_sep && !in_quotes
|
61
66
|
break if !header_size.nil? && elements.size >= header_size
|
62
67
|
|
63
68
|
elements << cleanup_quotes(line[start...i], quote)
|
64
|
-
|
65
|
-
i += col_sep.size
|
69
|
+
i += col_sep_size
|
66
70
|
start = i
|
71
|
+
backslash_count = 0 # Reset backslash count at the start of a new field
|
67
72
|
else
|
68
|
-
|
69
|
-
|
73
|
+
if line[i] == '\\'
|
74
|
+
backslash_count += 1
|
75
|
+
else
|
76
|
+
if line[i] == quote
|
77
|
+
if backslash_count % 2 == 0
|
78
|
+
# Even number of backslashes means quote is not escaped
|
79
|
+
in_quotes = !in_quotes
|
80
|
+
end
|
81
|
+
# Else, quote is escaped; do nothing
|
82
|
+
end
|
83
|
+
backslash_count = 0 # Reset after any character other than backslash
|
84
|
+
end
|
70
85
|
i += 1
|
71
86
|
end
|
72
87
|
end
|
73
|
-
|
88
|
+
|
89
|
+
# Check for unclosed quotes at the end of the line
|
90
|
+
if in_quotes
|
91
|
+
raise MalformedCSV, "Unclosed quoted field detected in line: #{line}"
|
92
|
+
end
|
93
|
+
|
94
|
+
# Process the remaining field
|
95
|
+
if header_size.nil? || elements.size < header_size
|
96
|
+
elements << cleanup_quotes(line[start..-1], quote)
|
97
|
+
end
|
98
|
+
|
74
99
|
[elements, elements.size]
|
75
100
|
end
|
76
101
|
|
77
102
|
def cleanup_quotes(field, quote)
|
78
103
|
return field if field.nil?
|
79
104
|
|
80
|
-
#
|
81
|
-
|
105
|
+
# Remove surrounding quotes if present
|
82
106
|
if field.start_with?(quote) && field.end_with?(quote)
|
83
|
-
field
|
84
|
-
field.delete_suffix!(quote)
|
107
|
+
field = field[1..-2]
|
85
108
|
end
|
86
|
-
|
109
|
+
|
110
|
+
# Replace double quotes with a single quote
|
111
|
+
field.gsub!("#{quote * 2}", quote)
|
112
|
+
|
87
113
|
field
|
88
114
|
end
|
89
115
|
end
|
data/lib/smarter_csv/reader.rb
CHANGED
@@ -62,7 +62,8 @@ module SmarterCSV
|
|
62
62
|
|
63
63
|
skip_lines(fh, options)
|
64
64
|
|
65
|
-
|
65
|
+
# NOTE: we are no longer using header_size
|
66
|
+
@headers, _header_size = process_headers(fh, options)
|
66
67
|
@headerA = @headers # @headerA is deprecated, use @headers
|
67
68
|
|
68
69
|
puts "Effective headers:\n#{pp(@headers)}\n" if @verbose
|
@@ -97,14 +98,23 @@ module SmarterCSV
|
|
97
98
|
multiline = count_quote_chars(line, options[:quote_char]).odd?
|
98
99
|
|
99
100
|
while multiline
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
101
|
+
begin
|
102
|
+
next_line = fh.readline(options[:row_sep])
|
103
|
+
next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
|
104
|
+
line += next_line
|
105
|
+
@file_line_count += 1
|
106
|
+
|
107
|
+
multiline = count_quote_chars(line, options[:quote_char]).odd?
|
108
|
+
rescue EOFError
|
109
|
+
# End of file reached. Check if quotes are balanced.
|
110
|
+
total_quotes = count_quote_chars(line, options[:quote_char])
|
111
|
+
if total_quotes.odd?
|
112
|
+
raise MalformedCSV, "Unclosed quoted field detected in multiline data"
|
113
|
+
else
|
114
|
+
# Quotes are balanced; proceed without raising an error.
|
115
|
+
break
|
116
|
+
end
|
117
|
+
end
|
108
118
|
end
|
109
119
|
|
110
120
|
# :nocov:
|
@@ -116,7 +126,18 @@ module SmarterCSV
|
|
116
126
|
line.chomp!(options[:row_sep])
|
117
127
|
|
118
128
|
# --- SPLIT LINE & DATA TRANSFORMATIONS ------------------------------------------------------------
|
119
|
-
dataA,
|
129
|
+
dataA, data_size = parse(line, options) # we parse the extra columns
|
130
|
+
|
131
|
+
if options[:strict]
|
132
|
+
raise SmarterCSV::HeaderSizeMismatch, "extra columns detected on line #{@file_line_count}"
|
133
|
+
else
|
134
|
+
# we create additional columns on-the-fly
|
135
|
+
current_size = @headers.size
|
136
|
+
while current_size < data_size
|
137
|
+
@headers << "#{options[:missing_header_prefix]}#{current_size + 1}".to_sym
|
138
|
+
current_size += 1
|
139
|
+
end
|
140
|
+
end
|
120
141
|
|
121
142
|
dataA.map!{|x| x.strip} if options[:strip_whitespace]
|
122
143
|
|
data/lib/smarter_csv/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: smarter_csv
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.13.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tilo Sloboda
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2024-
|
11
|
+
date: 2024-11-05 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: awesome_print
|