smarter_csv 1.9.2 → 1.10.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +23 -0
- data/README.md +29 -8
- data/lib/smarter_csv/auto_detection.rb +73 -0
- data/lib/smarter_csv/file_io.rb +50 -0
- data/lib/smarter_csv/hash_transformations.rb +91 -0
- data/lib/smarter_csv/header_transformations.rb +63 -0
- data/lib/smarter_csv/header_validations.rb +34 -0
- data/lib/smarter_csv/headers.rb +68 -0
- data/lib/smarter_csv/options_processing.rb +10 -1
- data/lib/smarter_csv/parse.rb +90 -0
- data/lib/smarter_csv/smarter_csv.rb +79 -416
- data/lib/smarter_csv/variables.rb +30 -0
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +16 -3
- metadata +11 -4
- data/lib/core_ext/hash.rb +0 -9
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f1d0b58acf0135b621e3182470674230ef73b48c829810e74fffa975fc318cf5
|
4
|
+
data.tar.gz: ee404c5c485748d35cda36b8d249cb6813a3f80005182fe8c05feac1694aba57
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 4fee097fe2237f863510100155062da6815237260da5b15189f104f54596f7d5ff0479deb80596544e0bb1b9ba7b78126d2251798721e8d2f91e06b430950cd6
|
7
|
+
data.tar.gz: c30562965452ef296b5e5aaf2a9a12887aa42d8e8396780b73b34f99a2386d232bf020578618fcbd65186fc864518c81a3e7555cae9b00a005322f3599e18c5a
|
data/CHANGELOG.md
CHANGED
@@ -1,6 +1,29 @@
|
|
1
1
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
3
3
|
|
4
|
+
## 1.10.0 (2023-12-31) ⚡ BREAKING ⚡
|
5
|
+
|
6
|
+
* BREAKING CHANGES:
|
7
|
+
|
8
|
+
Changed behavior:
|
9
|
+
+ when `user_provided_headers` are provided:
|
10
|
+
* if they are not unique, an exception will now be raised
|
11
|
+
* they are taken "as is", no header transformations can be applied
|
12
|
+
* when they are given as strings or as symbols, it is assumed that this is the desired format
|
13
|
+
* the value of the `strings_as_keys` options will be ignored
|
14
|
+
|
15
|
+
+ option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
|
16
|
+
* this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
|
17
|
+
* explicitly set this option to `nil` to get the behavior from previous versions.
|
18
|
+
|
19
|
+
* performance and memory improvements
|
20
|
+
* code refactor
|
21
|
+
|
22
|
+
## 1.9.3 (2023-12-16)
|
23
|
+
* raise SmarterCSV::IncorrectOption when `user_provided_headers` are empty
|
24
|
+
* code refactor / no functional changes
|
25
|
+
* added test cases
|
26
|
+
|
4
27
|
## 1.9.2 (2023-11-12)
|
5
28
|
* fixed bug with '\\' at end of line (issue #252, thanks to averycrespi-moz)
|
6
29
|
* fixed require statements (issue #249, thanks to PikachuEXE, courtsimas)
|
data/README.md
CHANGED
@@ -2,15 +2,33 @@
|
|
2
2
|
# SmarterCSV
|
3
3
|
|
4
4
|
[](https://codecov.io/gh/tilo/smarter_csv) [](http://badge.fury.io/rb/smarter_csv)
|
5
|
-
|
5
|
+
|
6
|
+
|
7
|
+
#### LATEST CHANGES
|
8
|
+
|
9
|
+
* Version 1.10.0 has BREAKING CHANGES:
|
10
|
+
|
11
|
+
Changed behavior:
|
12
|
+
+ when `user_provided_headers` are provided:
|
13
|
+
* if they are not unique, an exception will now be raised
|
14
|
+
* they are taken "as is", no header transformations can be applied
|
15
|
+
* when they are given as strings or as symbols, it is assumed that this is the desired format
|
16
|
+
* the value of the `strings_as_keys` options will be ignored
|
17
|
+
|
18
|
+
+ option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
|
19
|
+
* this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
|
20
|
+
* explicitly set this option to `nil` to get the behavior from previous versions.
|
21
|
+
|
6
22
|
#### Development Branches
|
7
23
|
|
8
24
|
* default branch is `main` for 1.x development
|
9
|
-
|
25
|
+
|
26
|
+
* 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
|
27
|
+
- This is an EXPERIMENTAL branch - DO NOT USE in production
|
10
28
|
|
11
|
-
#### Work towards Future Version 2.
|
29
|
+
#### Work towards Future Version 2.x
|
12
30
|
|
13
|
-
* Work towards SmarterCSV 2.
|
31
|
+
* Work towards SmarterCSV 2.x is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
|
14
32
|
Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.
|
15
33
|
|
16
34
|
---------------
|
@@ -84,6 +102,10 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
|
|
84
102
|
00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
|
85
103
|
```
|
86
104
|
|
105
|
+
### Articles
|
106
|
+
* [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
|
107
|
+
* [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
|
108
|
+
|
87
109
|
### Examples
|
88
110
|
|
89
111
|
Here are some examples to demonstrate the versatility of SmarterCSV.
|
@@ -243,8 +265,6 @@ NOTE: If you use `key_mappings` and `value_converters`, make sure that the value
|
|
243
265
|
data[0][:price].class
|
244
266
|
=> Float
|
245
267
|
```
|
246
|
-
## Parallel Processing
|
247
|
-
[Jack](https://github.com/xjlin0) wrote an interesting article about [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
|
248
268
|
|
249
269
|
## Documentation
|
250
270
|
|
@@ -280,7 +300,8 @@ The options and the block are optional.
|
|
280
300
|
| :headers_in_file | true | Whether or not the file contains headers as the first line. |
|
281
301
|
| | | Important if the file does not contain headers, |
|
282
302
|
| | | otherwise you would lose the first line of data. |
|
283
|
-
| :duplicate_header_suffix |
|
303
|
+
| :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
|
304
|
+
| | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
|
284
305
|
| :user_provided_headers | nil | *careful with that axe!* |
|
285
306
|
| | | user provided Array of header strings or symbols, to define |
|
286
307
|
| | | what headers should be used, overriding any in-file headers. |
|
@@ -300,7 +321,7 @@ And header and data validations will also be supported in 2.x
|
|
300
321
|
| Option | Default | Explanation |
|
301
322
|
---------------------------------------------------------------------------------------------------------------------------------
|
302
323
|
| :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
|
303
|
-
| :
|
324
|
+
| :silence_missing_keys | false | ignore missing keys in `key_mapping` |
|
304
325
|
| | | if set to true: makes all mapped keys optional |
|
305
326
|
| | | if given an array, makes only the keys listed in it optional |
|
306
327
|
| :required_keys | nil | An array. Specify the required names AFTER header transformation. |
|
@@ -0,0 +1,73 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
protected
|
6
|
+
|
7
|
+
# If file has headers, then guesses column separator from headers.
|
8
|
+
# Otherwise guesses column separator from contents.
|
9
|
+
# Raises exception if none is found.
|
10
|
+
def guess_column_separator(filehandle, options)
|
11
|
+
skip_lines(filehandle, options)
|
12
|
+
|
13
|
+
delimiters = [',', "\t", ';', ':', '|']
|
14
|
+
|
15
|
+
line = nil
|
16
|
+
has_header = options[:headers_in_file]
|
17
|
+
candidates = Hash.new(0)
|
18
|
+
count = has_header ? 1 : 5
|
19
|
+
count.times do
|
20
|
+
line = readline_with_counts(filehandle, options)
|
21
|
+
delimiters.each do |d|
|
22
|
+
candidates[d] += line.scan(d).count
|
23
|
+
end
|
24
|
+
rescue EOFError # short files
|
25
|
+
break
|
26
|
+
end
|
27
|
+
rewind(filehandle)
|
28
|
+
|
29
|
+
if candidates.values.max == 0
|
30
|
+
# if the header only contains
|
31
|
+
return ',' if line.chomp(options[:row_sep]) =~ /^\w+$/
|
32
|
+
|
33
|
+
raise SmarterCSV::NoColSepDetected
|
34
|
+
end
|
35
|
+
|
36
|
+
candidates.key(candidates.values.max)
|
37
|
+
end
|
38
|
+
|
39
|
+
# limitation: this currently reads the whole file in before making a decision
|
40
|
+
def guess_line_ending(filehandle, options)
|
41
|
+
counts = {"\n" => 0, "\r" => 0, "\r\n" => 0}
|
42
|
+
quoted_char = false
|
43
|
+
|
44
|
+
# count how many of the pre-defined line-endings we find
|
45
|
+
# ignoring those contained within quote characters
|
46
|
+
last_char = nil
|
47
|
+
lines = 0
|
48
|
+
filehandle.each_char do |c|
|
49
|
+
quoted_char = !quoted_char if c == options[:quote_char]
|
50
|
+
next if quoted_char
|
51
|
+
|
52
|
+
if last_char == "\r"
|
53
|
+
if c == "\n"
|
54
|
+
counts["\r\n"] += 1
|
55
|
+
else
|
56
|
+
counts["\r"] += 1 # \r are counted after they appeared
|
57
|
+
end
|
58
|
+
elsif c == "\n"
|
59
|
+
counts["\n"] += 1
|
60
|
+
end
|
61
|
+
last_char = c
|
62
|
+
lines += 1
|
63
|
+
break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
|
64
|
+
end
|
65
|
+
rewind(filehandle)
|
66
|
+
|
67
|
+
counts["\r"] += 1 if last_char == "\r"
|
68
|
+
# find the most frequent key/value pair:
|
69
|
+
most_frequent_key, _count = counts.max_by{|_, v| v}
|
70
|
+
most_frequent_key
|
71
|
+
end
|
72
|
+
end
|
73
|
+
end
|
@@ -0,0 +1,50 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
protected
|
6
|
+
|
7
|
+
def readline_with_counts(filehandle, options)
|
8
|
+
line = filehandle.readline(options[:row_sep])
|
9
|
+
@file_line_count += 1
|
10
|
+
@csv_line_count += 1
|
11
|
+
line = remove_bom(line) if @csv_line_count == 1
|
12
|
+
line
|
13
|
+
end
|
14
|
+
|
15
|
+
def skip_lines(filehandle, options)
|
16
|
+
options[:skip_lines].to_i.times do
|
17
|
+
readline_with_counts(filehandle, options)
|
18
|
+
end
|
19
|
+
end
|
20
|
+
|
21
|
+
def rewind(filehandle)
|
22
|
+
@file_line_count = 0
|
23
|
+
@csv_line_count = 0
|
24
|
+
filehandle.rewind
|
25
|
+
end
|
26
|
+
|
27
|
+
private
|
28
|
+
|
29
|
+
UTF_32_BOM = %w[0 0 fe ff].freeze
|
30
|
+
UTF_32LE_BOM = %w[ff fe 0 0].freeze
|
31
|
+
UTF_8_BOM = %w[ef bb bf].freeze
|
32
|
+
UTF_16_BOM = %w[fe ff].freeze
|
33
|
+
UTF_16LE_BOM = %w[ff fe].freeze
|
34
|
+
|
35
|
+
def remove_bom(str)
|
36
|
+
str_as_hex = str.bytes.map{|x| x.to_s(16)}
|
37
|
+
# if string does not start with one of the bytes, there is no BOM
|
38
|
+
return str unless %w[ef fe ff 0].include?(str_as_hex[0])
|
39
|
+
|
40
|
+
return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
|
41
|
+
return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM
|
42
|
+
return str.byteslice(2..-1) if [UTF_16_BOM, UTF_16LE_BOM].include?(str_as_hex[0..1])
|
43
|
+
|
44
|
+
# :nocov:
|
45
|
+
puts "SmarterCSV found unhandled BOM! #{str.chars[0..7].inspect}"
|
46
|
+
str
|
47
|
+
# :nocov:
|
48
|
+
end
|
49
|
+
end
|
50
|
+
end
|
@@ -0,0 +1,91 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
def hash_transformations(hash, options)
|
6
|
+
# there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
|
7
|
+
# make sure we delete any key/value pairs from the hash, which the user wanted to delete:
|
8
|
+
remove_empty_values = options[:remove_empty_values] == true
|
9
|
+
remove_zero_values = options[:remove_zero_values]
|
10
|
+
remove_values_matching = options[:remove_values_matching]
|
11
|
+
convert_to_numeric = options[:convert_values_to_numeric]
|
12
|
+
value_converters = options[:value_converters]
|
13
|
+
|
14
|
+
hash.each_with_object({}) do |(k, v), new_hash|
|
15
|
+
next if k.nil? || k == '' || k == :""
|
16
|
+
next if remove_empty_values && (has_rails ? v.blank? : blank?(v))
|
17
|
+
next if remove_zero_values && v.is_a?(String) && v =~ /^(0+|0+\.0+)$/ # values are Strings
|
18
|
+
next if remove_values_matching && v =~ remove_values_matching
|
19
|
+
|
20
|
+
# deal with the :only / :except options to :convert_values_to_numeric
|
21
|
+
if convert_to_numeric && !limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
|
22
|
+
if v =~ /^[+-]?\d+\.\d+$/
|
23
|
+
v = v.to_f
|
24
|
+
elsif v =~ /^[+-]?\d+$/
|
25
|
+
v = v.to_i
|
26
|
+
end
|
27
|
+
end
|
28
|
+
|
29
|
+
converter = value_converters[k] if value_converters
|
30
|
+
v = converter.convert(v) if converter
|
31
|
+
|
32
|
+
new_hash[k] = v
|
33
|
+
end
|
34
|
+
end
|
35
|
+
|
36
|
+
# def hash_transformations(hash, options)
|
37
|
+
# # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
|
38
|
+
# # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
|
39
|
+
# hash.delete(nil)
|
40
|
+
# hash.delete('')
|
41
|
+
# hash.delete(:"")
|
42
|
+
|
43
|
+
# if options[:remove_empty_values] == true
|
44
|
+
# hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
|
45
|
+
# end
|
46
|
+
|
47
|
+
# hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
|
48
|
+
# hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
|
49
|
+
|
50
|
+
# if options[:convert_values_to_numeric]
|
51
|
+
# hash.each do |k, v|
|
52
|
+
# # deal with the :only / :except options to :convert_values_to_numeric
|
53
|
+
# next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
|
54
|
+
|
55
|
+
# # convert if it's a numeric value:
|
56
|
+
# case v
|
57
|
+
# when /^[+-]?\d+\.\d+$/
|
58
|
+
# hash[k] = v.to_f
|
59
|
+
# when /^[+-]?\d+$/
|
60
|
+
# hash[k] = v.to_i
|
61
|
+
# end
|
62
|
+
# end
|
63
|
+
# end
|
64
|
+
|
65
|
+
# if options[:value_converters]
|
66
|
+
# hash.each do |k, v|
|
67
|
+
# converter = options[:value_converters][k]
|
68
|
+
# next unless converter
|
69
|
+
|
70
|
+
# hash[k] = converter.convert(v)
|
71
|
+
# end
|
72
|
+
# end
|
73
|
+
|
74
|
+
# hash
|
75
|
+
# end
|
76
|
+
|
77
|
+
protected
|
78
|
+
|
79
|
+
# acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
|
80
|
+
def limit_execution_for_only_or_except(options, option_name, key)
|
81
|
+
if options[option_name].is_a?(Hash)
|
82
|
+
if options[option_name].has_key?(:except)
|
83
|
+
return true if Array(options[option_name][:except]).include?(key)
|
84
|
+
elsif options[option_name].has_key?(:only)
|
85
|
+
return true unless Array(options[option_name][:only]).include?(key)
|
86
|
+
end
|
87
|
+
end
|
88
|
+
false
|
89
|
+
end
|
90
|
+
end
|
91
|
+
end
|
@@ -0,0 +1,63 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
# transform the headers that were in the file:
|
6
|
+
def header_transformations(header_array, options)
|
7
|
+
header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
|
8
|
+
header_array.map!{|x| x.strip} if options[:strip_whitespace]
|
9
|
+
|
10
|
+
unless options[:keep_original_headers]
|
11
|
+
header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
|
12
|
+
header_array.map!{|x| x.downcase} if options[:downcase_header]
|
13
|
+
end
|
14
|
+
|
15
|
+
# detect duplicate headers and disambiguate
|
16
|
+
header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
|
17
|
+
# symbolize headers
|
18
|
+
header_array = header_array.map{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
|
19
|
+
# doesn't make sense to re-map when we have user_provided_headers
|
20
|
+
header_array = remap_headers(header_array, options) if options[:key_mapping]
|
21
|
+
|
22
|
+
header_array
|
23
|
+
end
|
24
|
+
|
25
|
+
def disambiguate_headers(headers, options)
|
26
|
+
counts = Hash.new(0)
|
27
|
+
headers.map do |header|
|
28
|
+
counts[header] += 1
|
29
|
+
counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
|
30
|
+
end
|
31
|
+
end
|
32
|
+
|
33
|
+
# do some key mapping on the keys in the file header
|
34
|
+
# if you want to completely delete a key, then map it to nil or to ''
|
35
|
+
def remap_headers(headers, options)
|
36
|
+
key_mapping = options[:key_mapping]
|
37
|
+
if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
|
38
|
+
raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
|
39
|
+
end
|
40
|
+
|
41
|
+
key_mapping = options[:key_mapping]
|
42
|
+
# if silence_missing_keys are not set, raise error if missing header
|
43
|
+
missing_keys = key_mapping.keys - headers
|
44
|
+
# if the user passes a list of speciffic mapped keys that are optional
|
45
|
+
missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)
|
46
|
+
|
47
|
+
unless missing_keys.empty? || options[:silence_missing_keys] == true
|
48
|
+
raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
|
49
|
+
end
|
50
|
+
|
51
|
+
headers.map! do |header|
|
52
|
+
if key_mapping.has_key?(header)
|
53
|
+
key_mapping[header].nil? ? nil : key_mapping[header]
|
54
|
+
elsif options[:remove_unmapped_keys]
|
55
|
+
nil
|
56
|
+
else
|
57
|
+
header
|
58
|
+
end
|
59
|
+
end
|
60
|
+
headers
|
61
|
+
end
|
62
|
+
end
|
63
|
+
end
|
@@ -0,0 +1,34 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
def header_validations(headers, options)
|
6
|
+
check_duplicate_headers(headers, options)
|
7
|
+
check_required_headers(headers, options)
|
8
|
+
end
|
9
|
+
|
10
|
+
def check_duplicate_headers(headers, _options)
|
11
|
+
header_counts = Hash.new(0)
|
12
|
+
headers.each { |header| header_counts[header] += 1 unless header.nil? }
|
13
|
+
|
14
|
+
duplicates = header_counts.select { |_, count| count > 1 }
|
15
|
+
|
16
|
+
unless duplicates.empty?
|
17
|
+
raise(SmarterCSV::DuplicateHeaders, "Duplicate Headers in CSV: #{duplicates.inspect}")
|
18
|
+
end
|
19
|
+
end
|
20
|
+
|
21
|
+
require 'set'
|
22
|
+
|
23
|
+
def check_required_headers(headers, options)
|
24
|
+
if options[:required_keys] && options[:required_keys].is_a?(Array)
|
25
|
+
headers_set = headers.to_set
|
26
|
+
missing_keys = options[:required_keys].select { |k| !headers_set.include?(k) }
|
27
|
+
|
28
|
+
unless missing_keys.empty?
|
29
|
+
raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}"
|
30
|
+
end
|
31
|
+
end
|
32
|
+
end
|
33
|
+
end
|
34
|
+
end
|
@@ -0,0 +1,68 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
def process_headers(filehandle, options)
|
6
|
+
@raw_header = nil # header as it appears in the file
|
7
|
+
@headers = nil # the processed headers
|
8
|
+
header_array = []
|
9
|
+
file_header_size = nil
|
10
|
+
|
11
|
+
# if headers_in_file, get the headers -> We get the number of columns, even when user provided headers
|
12
|
+
if options[:headers_in_file] # extract the header line
|
13
|
+
# process the header line in the CSV file..
|
14
|
+
# the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
|
15
|
+
header_line = @raw_header = readline_with_counts(filehandle, options)
|
16
|
+
header_line = preprocess_header_line(header_line, options)
|
17
|
+
|
18
|
+
file_header_array, file_header_size = parse(header_line, options)
|
19
|
+
|
20
|
+
file_header_array = header_transformations(file_header_array, options)
|
21
|
+
|
22
|
+
else
|
23
|
+
unless options[:user_provided_headers]
|
24
|
+
raise SmarterCSV::IncorrectOption, "ERROR: If :headers_in_file is set to false, you have to provide :user_provided_headers"
|
25
|
+
end
|
26
|
+
end
|
27
|
+
|
28
|
+
if options[:user_provided_headers]
|
29
|
+
unless options[:user_provided_headers].is_a?(Array) && !options[:user_provided_headers].empty?
|
30
|
+
raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for user_provided_headers! Expecting array with headers.")
|
31
|
+
end
|
32
|
+
|
33
|
+
# use user-provided headers
|
34
|
+
user_header_array = options[:user_provided_headers]
|
35
|
+
# user_provided_headers: their count should match the headers_in_file if any
|
36
|
+
if defined?(file_header_size) && !file_header_size.nil?
|
37
|
+
if user_header_array.size != file_header_size
|
38
|
+
raise SmarterCSV::HeaderSizeMismatch, "ERROR: :user_provided_headers defines #{user_header_array.size} headers != CSV-file has #{file_header_size} headers"
|
39
|
+
else
|
40
|
+
# we could print out the mapping of file_header_array to header_array here
|
41
|
+
end
|
42
|
+
end
|
43
|
+
|
44
|
+
header_array = user_header_array
|
45
|
+
else
|
46
|
+
header_array = file_header_array
|
47
|
+
end
|
48
|
+
|
49
|
+
[header_array, header_array.size]
|
50
|
+
end
|
51
|
+
|
52
|
+
private
|
53
|
+
|
54
|
+
def preprocess_header_line(header_line, options)
|
55
|
+
header_line = enforce_utf8_encoding(header_line, options)
|
56
|
+
header_line = remove_comments_from_header(header_line, options)
|
57
|
+
header_line = header_line.chomp(options[:row_sep])
|
58
|
+
header_line.gsub!(options[:strip_chars_from_headers], '') if options[:strip_chars_from_headers]
|
59
|
+
header_line
|
60
|
+
end
|
61
|
+
|
62
|
+
def remove_comments_from_header(header, options)
|
63
|
+
return header unless options[:comment_regexp]
|
64
|
+
|
65
|
+
header.sub(options[:comment_regexp], '')
|
66
|
+
end
|
67
|
+
end
|
68
|
+
end
|
@@ -9,7 +9,7 @@ module SmarterCSV
|
|
9
9
|
comment_regexp: nil, # was: /\A#/,
|
10
10
|
convert_values_to_numeric: true,
|
11
11
|
downcase_header: true,
|
12
|
-
duplicate_header_suffix: nil,
|
12
|
+
duplicate_header_suffix: '', # was: nil,
|
13
13
|
file_encoding: 'utf-8',
|
14
14
|
force_simple_split: false,
|
15
15
|
force_utf8: false,
|
@@ -62,6 +62,15 @@ module SmarterCSV
|
|
62
62
|
private
|
63
63
|
|
64
64
|
def validate_options!(options)
|
65
|
+
# deprecate required_headers
|
66
|
+
unless options[:required_headers].nil?
|
67
|
+
puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required_headers'"
|
68
|
+
if options[:required_keys].nil?
|
69
|
+
options[:required_keys] = options[:required_headers]
|
70
|
+
options[:required_headers] = nil
|
71
|
+
end
|
72
|
+
end
|
73
|
+
|
65
74
|
keys = options.keys
|
66
75
|
errors = []
|
67
76
|
errors << "invalid row_sep" if keys.include?(:row_sep) && !option_valid?(options[:row_sep])
|
@@ -0,0 +1,90 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
protected
|
6
|
+
|
7
|
+
###
|
8
|
+
### Thin wrapper around C-extension
|
9
|
+
###
|
10
|
+
def parse(line, options, header_size = nil)
|
11
|
+
# puts "SmarterCSV.parse OPTIONS: #{options[:acceleration]}" if options[:verbose]
|
12
|
+
|
13
|
+
if options[:acceleration] && has_acceleration?
|
14
|
+
# :nocov:
|
15
|
+
has_quotes = line =~ /#{options[:quote_char]}/
|
16
|
+
elements = parse_csv_line_c(line, options[:col_sep], options[:quote_char], header_size)
|
17
|
+
elements.map!{|x| cleanup_quotes(x, options[:quote_char])} if has_quotes
|
18
|
+
[elements, elements.size]
|
19
|
+
# :nocov:
|
20
|
+
else
|
21
|
+
# puts "WARNING: SmarterCSV is using un-accelerated parsing of lines. Check options[:acceleration]"
|
22
|
+
parse_csv_line_ruby(line, options, header_size)
|
23
|
+
end
|
24
|
+
end
|
25
|
+
|
26
|
+
# ------------------------------------------------------------------
|
27
|
+
# Ruby equivalent of the C-extension for parse_line
|
28
|
+
#
|
29
|
+
# parses a single line: either a CSV header and body line
|
30
|
+
# - quoting rules compared to RFC-4180 are somewhat relaxed
|
31
|
+
# - we are not assuming that quotes inside a fields need to be doubled
|
32
|
+
# - we are not assuming that all fields need to be quoted (0 is even)
|
33
|
+
# - works with multi-char col_sep
|
34
|
+
# - if header_size is given, only up to header_size fields are parsed
|
35
|
+
#
|
36
|
+
# We use header_size for parsing the body lines to make sure we always match the number of headers
|
37
|
+
# in case there are trailing col_sep characters in line
|
38
|
+
#
|
39
|
+
# Our convention is that empty fields are returned as empty strings, not as nil.
|
40
|
+
#
|
41
|
+
#
|
42
|
+
# the purpose of the max_size parameter is to handle a corner case where
|
43
|
+
# CSV lines contain more fields than the header.
|
44
|
+
# In which case the remaining fields in the line are ignored
|
45
|
+
#
|
46
|
+
def parse_csv_line_ruby(line, options, header_size = nil)
|
47
|
+
return [] if line.nil?
|
48
|
+
|
49
|
+
line_size = line.size
|
50
|
+
col_sep = options[:col_sep]
|
51
|
+
col_sep_size = col_sep.size
|
52
|
+
quote = options[:quote_char]
|
53
|
+
quote_count = 0
|
54
|
+
elements = []
|
55
|
+
start = 0
|
56
|
+
i = 0
|
57
|
+
|
58
|
+
previous_char = ''
|
59
|
+
while i < line_size
|
60
|
+
if line[i...i+col_sep_size] == col_sep && quote_count.even?
|
61
|
+
break if !header_size.nil? && elements.size >= header_size
|
62
|
+
|
63
|
+
elements << cleanup_quotes(line[start...i], quote)
|
64
|
+
previous_char = line[i]
|
65
|
+
i += col_sep.size
|
66
|
+
start = i
|
67
|
+
else
|
68
|
+
quote_count += 1 if line[i] == quote && previous_char != '\\'
|
69
|
+
previous_char = line[i]
|
70
|
+
i += 1
|
71
|
+
end
|
72
|
+
end
|
73
|
+
elements << cleanup_quotes(line[start..-1], quote) if header_size.nil? || elements.size < header_size
|
74
|
+
[elements, elements.size]
|
75
|
+
end
|
76
|
+
|
77
|
+
def cleanup_quotes(field, quote)
|
78
|
+
return field if field.nil?
|
79
|
+
|
80
|
+
# return if field !~ /#{quote}/ # this check can probably eliminated
|
81
|
+
|
82
|
+
if field.start_with?(quote) && field.end_with?(quote)
|
83
|
+
field.delete_prefix!(quote)
|
84
|
+
field.delete_suffix!(quote)
|
85
|
+
end
|
86
|
+
field.gsub!("#{quote}#{quote}", quote)
|
87
|
+
field
|
88
|
+
end
|
89
|
+
end
|
90
|
+
end
|