smarter_csv 1.9.2 → 1.10.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +23 -0
- data/README.md +29 -8
- data/lib/smarter_csv/auto_detection.rb +73 -0
- data/lib/smarter_csv/file_io.rb +50 -0
- data/lib/smarter_csv/hash_transformations.rb +91 -0
- data/lib/smarter_csv/header_transformations.rb +63 -0
- data/lib/smarter_csv/header_validations.rb +34 -0
- data/lib/smarter_csv/headers.rb +68 -0
- data/lib/smarter_csv/options_processing.rb +10 -1
- data/lib/smarter_csv/parse.rb +90 -0
- data/lib/smarter_csv/smarter_csv.rb +79 -416
- data/lib/smarter_csv/variables.rb +30 -0
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +16 -3
- metadata +11 -4
- data/lib/core_ext/hash.rb +0 -9
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f1d0b58acf0135b621e3182470674230ef73b48c829810e74fffa975fc318cf5
|
4
|
+
data.tar.gz: ee404c5c485748d35cda36b8d249cb6813a3f80005182fe8c05feac1694aba57
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 4fee097fe2237f863510100155062da6815237260da5b15189f104f54596f7d5ff0479deb80596544e0bb1b9ba7b78126d2251798721e8d2f91e06b430950cd6
|
7
|
+
data.tar.gz: c30562965452ef296b5e5aaf2a9a12887aa42d8e8396780b73b34f99a2386d232bf020578618fcbd65186fc864518c81a3e7555cae9b00a005322f3599e18c5a
|
data/CHANGELOG.md
CHANGED
@@ -1,6 +1,29 @@
|
|
1
1
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
3
3
|
|
4
|
+
## 1.10.0 (2023-12-31) ⚡ BREAKING ⚡
|
5
|
+
|
6
|
+
* BREAKING CHANGES:
|
7
|
+
|
8
|
+
Changed behavior:
|
9
|
+
+ when `user_provided_headers` are provided:
|
10
|
+
* if they are not unique, an exception will now be raised
|
11
|
+
* they are taken "as is", no header transformations can be applied
|
12
|
+
* when they are given as strings or as symbols, it is assumed that this is the desired format
|
13
|
+
* the value of the `strings_as_keys` options will be ignored
|
14
|
+
|
15
|
+
+ option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
|
16
|
+
* this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
|
17
|
+
* explicitly set this option to `nil` to get the behavior from previous versions.
|
18
|
+
|
19
|
+
* performance and memory improvements
|
20
|
+
* code refactor
|
21
|
+
|
22
|
+
## 1.9.3 (2023-12-16)
|
23
|
+
* raise SmarterCSV::IncorrectOption when `user_provided_headers` are empty
|
24
|
+
* code refactor / no functional changes
|
25
|
+
* added test cases
|
26
|
+
|
4
27
|
## 1.9.2 (2023-11-12)
|
5
28
|
* fixed bug with '\\' at end of line (issue #252, thanks to averycrespi-moz)
|
6
29
|
* fixed require statements (issue #249, thanks to PikachuEXE, courtsimas)
|
data/README.md
CHANGED
@@ -2,15 +2,33 @@
|
|
2
2
|
# SmarterCSV
|
3
3
|
|
4
4
|
[![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
|
5
|
-
|
5
|
+
|
6
|
+
|
7
|
+
#### LATEST CHANGES
|
8
|
+
|
9
|
+
* Version 1.10.0 has BREAKING CHANGES:
|
10
|
+
|
11
|
+
Changed behavior:
|
12
|
+
+ when `user_provided_headers` are provided:
|
13
|
+
* if they are not unique, an exception will now be raised
|
14
|
+
* they are taken "as is", no header transformations can be applied
|
15
|
+
* when they are given as strings or as symbols, it is assumed that this is the desired format
|
16
|
+
* the value of the `strings_as_keys` options will be ignored
|
17
|
+
|
18
|
+
+ option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
|
19
|
+
* this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
|
20
|
+
* explicitly set this option to `nil` to get the behavior from previous versions.
|
21
|
+
|
6
22
|
#### Development Branches
|
7
23
|
|
8
24
|
* default branch is `main` for 1.x development
|
9
|
-
|
25
|
+
|
26
|
+
* 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
|
27
|
+
- This is an EXPERIMENTAL branch - DO NOT USE in production
|
10
28
|
|
11
|
-
#### Work towards Future Version 2.
|
29
|
+
#### Work towards Future Version 2.x
|
12
30
|
|
13
|
-
* Work towards SmarterCSV 2.
|
31
|
+
* Work towards SmarterCSV 2.x is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
|
14
32
|
Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.
|
15
33
|
|
16
34
|
---------------
|
@@ -84,6 +102,10 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
|
|
84
102
|
00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
|
85
103
|
```
|
86
104
|
|
105
|
+
### Articles
|
106
|
+
* [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
|
107
|
+
* [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
|
108
|
+
|
87
109
|
### Examples
|
88
110
|
|
89
111
|
Here are some examples to demonstrate the versatility of SmarterCSV.
|
@@ -243,8 +265,6 @@ NOTE: If you use `key_mappings` and `value_converters`, make sure that the value
|
|
243
265
|
data[0][:price].class
|
244
266
|
=> Float
|
245
267
|
```
|
246
|
-
## Parallel Processing
|
247
|
-
[Jack](https://github.com/xjlin0) wrote an interesting article about [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
|
248
268
|
|
249
269
|
## Documentation
|
250
270
|
|
@@ -280,7 +300,8 @@ The options and the block are optional.
|
|
280
300
|
| :headers_in_file | true | Whether or not the file contains headers as the first line. |
|
281
301
|
| | | Important if the file does not contain headers, |
|
282
302
|
| | | otherwise you would lose the first line of data. |
|
283
|
-
| :duplicate_header_suffix |
|
303
|
+
| :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
|
304
|
+
| | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
|
284
305
|
| :user_provided_headers | nil | *careful with that axe!* |
|
285
306
|
| | | user provided Array of header strings or symbols, to define |
|
286
307
|
| | | what headers should be used, overriding any in-file headers. |
|
@@ -300,7 +321,7 @@ And header and data validations will also be supported in 2.x
|
|
300
321
|
| Option | Default | Explanation |
|
301
322
|
---------------------------------------------------------------------------------------------------------------------------------
|
302
323
|
| :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
|
303
|
-
| :
|
324
|
+
| :silence_missing_keys | false | ignore missing keys in `key_mapping` |
|
304
325
|
| | | if set to true: makes all mapped keys optional |
|
305
326
|
| | | if given an array, makes only the keys listed in it optional |
|
306
327
|
| :required_keys | nil | An array. Specify the required names AFTER header transformation. |
|
@@ -0,0 +1,73 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
protected
|
6
|
+
|
7
|
+
# If file has headers, then guesses column separator from headers.
|
8
|
+
# Otherwise guesses column separator from contents.
|
9
|
+
# Raises exception if none is found.
|
10
|
+
def guess_column_separator(filehandle, options)
|
11
|
+
skip_lines(filehandle, options)
|
12
|
+
|
13
|
+
delimiters = [',', "\t", ';', ':', '|']
|
14
|
+
|
15
|
+
line = nil
|
16
|
+
has_header = options[:headers_in_file]
|
17
|
+
candidates = Hash.new(0)
|
18
|
+
count = has_header ? 1 : 5
|
19
|
+
count.times do
|
20
|
+
line = readline_with_counts(filehandle, options)
|
21
|
+
delimiters.each do |d|
|
22
|
+
candidates[d] += line.scan(d).count
|
23
|
+
end
|
24
|
+
rescue EOFError # short files
|
25
|
+
break
|
26
|
+
end
|
27
|
+
rewind(filehandle)
|
28
|
+
|
29
|
+
if candidates.values.max == 0
|
30
|
+
# if the header only contains
|
31
|
+
return ',' if line.chomp(options[:row_sep]) =~ /^\w+$/
|
32
|
+
|
33
|
+
raise SmarterCSV::NoColSepDetected
|
34
|
+
end
|
35
|
+
|
36
|
+
candidates.key(candidates.values.max)
|
37
|
+
end
|
38
|
+
|
39
|
+
# limitation: this currently reads the whole file in before making a decision
|
40
|
+
def guess_line_ending(filehandle, options)
|
41
|
+
counts = {"\n" => 0, "\r" => 0, "\r\n" => 0}
|
42
|
+
quoted_char = false
|
43
|
+
|
44
|
+
# count how many of the pre-defined line-endings we find
|
45
|
+
# ignoring those contained within quote characters
|
46
|
+
last_char = nil
|
47
|
+
lines = 0
|
48
|
+
filehandle.each_char do |c|
|
49
|
+
quoted_char = !quoted_char if c == options[:quote_char]
|
50
|
+
next if quoted_char
|
51
|
+
|
52
|
+
if last_char == "\r"
|
53
|
+
if c == "\n"
|
54
|
+
counts["\r\n"] += 1
|
55
|
+
else
|
56
|
+
counts["\r"] += 1 # \r are counted after they appeared
|
57
|
+
end
|
58
|
+
elsif c == "\n"
|
59
|
+
counts["\n"] += 1
|
60
|
+
end
|
61
|
+
last_char = c
|
62
|
+
lines += 1
|
63
|
+
break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
|
64
|
+
end
|
65
|
+
rewind(filehandle)
|
66
|
+
|
67
|
+
counts["\r"] += 1 if last_char == "\r"
|
68
|
+
# find the most frequent key/value pair:
|
69
|
+
most_frequent_key, _count = counts.max_by{|_, v| v}
|
70
|
+
most_frequent_key
|
71
|
+
end
|
72
|
+
end
|
73
|
+
end
|
@@ -0,0 +1,50 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
protected
|
6
|
+
|
7
|
+
def readline_with_counts(filehandle, options)
|
8
|
+
line = filehandle.readline(options[:row_sep])
|
9
|
+
@file_line_count += 1
|
10
|
+
@csv_line_count += 1
|
11
|
+
line = remove_bom(line) if @csv_line_count == 1
|
12
|
+
line
|
13
|
+
end
|
14
|
+
|
15
|
+
def skip_lines(filehandle, options)
|
16
|
+
options[:skip_lines].to_i.times do
|
17
|
+
readline_with_counts(filehandle, options)
|
18
|
+
end
|
19
|
+
end
|
20
|
+
|
21
|
+
def rewind(filehandle)
|
22
|
+
@file_line_count = 0
|
23
|
+
@csv_line_count = 0
|
24
|
+
filehandle.rewind
|
25
|
+
end
|
26
|
+
|
27
|
+
private
|
28
|
+
|
29
|
+
UTF_32_BOM = %w[0 0 fe ff].freeze
|
30
|
+
UTF_32LE_BOM = %w[ff fe 0 0].freeze
|
31
|
+
UTF_8_BOM = %w[ef bb bf].freeze
|
32
|
+
UTF_16_BOM = %w[fe ff].freeze
|
33
|
+
UTF_16LE_BOM = %w[ff fe].freeze
|
34
|
+
|
35
|
+
def remove_bom(str)
|
36
|
+
str_as_hex = str.bytes.map{|x| x.to_s(16)}
|
37
|
+
# if string does not start with one of the bytes, there is no BOM
|
38
|
+
return str unless %w[ef fe ff 0].include?(str_as_hex[0])
|
39
|
+
|
40
|
+
return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
|
41
|
+
return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM
|
42
|
+
return str.byteslice(2..-1) if [UTF_16_BOM, UTF_16LE_BOM].include?(str_as_hex[0..1])
|
43
|
+
|
44
|
+
# :nocov:
|
45
|
+
puts "SmarterCSV found unhandled BOM! #{str.chars[0..7].inspect}"
|
46
|
+
str
|
47
|
+
# :nocov:
|
48
|
+
end
|
49
|
+
end
|
50
|
+
end
|
@@ -0,0 +1,91 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
def hash_transformations(hash, options)
|
6
|
+
# there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
|
7
|
+
# make sure we delete any key/value pairs from the hash, which the user wanted to delete:
|
8
|
+
remove_empty_values = options[:remove_empty_values] == true
|
9
|
+
remove_zero_values = options[:remove_zero_values]
|
10
|
+
remove_values_matching = options[:remove_values_matching]
|
11
|
+
convert_to_numeric = options[:convert_values_to_numeric]
|
12
|
+
value_converters = options[:value_converters]
|
13
|
+
|
14
|
+
hash.each_with_object({}) do |(k, v), new_hash|
|
15
|
+
next if k.nil? || k == '' || k == :""
|
16
|
+
next if remove_empty_values && (has_rails ? v.blank? : blank?(v))
|
17
|
+
next if remove_zero_values && v.is_a?(String) && v =~ /^(0+|0+\.0+)$/ # values are Strings
|
18
|
+
next if remove_values_matching && v =~ remove_values_matching
|
19
|
+
|
20
|
+
# deal with the :only / :except options to :convert_values_to_numeric
|
21
|
+
if convert_to_numeric && !limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
|
22
|
+
if v =~ /^[+-]?\d+\.\d+$/
|
23
|
+
v = v.to_f
|
24
|
+
elsif v =~ /^[+-]?\d+$/
|
25
|
+
v = v.to_i
|
26
|
+
end
|
27
|
+
end
|
28
|
+
|
29
|
+
converter = value_converters[k] if value_converters
|
30
|
+
v = converter.convert(v) if converter
|
31
|
+
|
32
|
+
new_hash[k] = v
|
33
|
+
end
|
34
|
+
end
|
35
|
+
|
36
|
+
# def hash_transformations(hash, options)
|
37
|
+
# # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
|
38
|
+
# # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
|
39
|
+
# hash.delete(nil)
|
40
|
+
# hash.delete('')
|
41
|
+
# hash.delete(:"")
|
42
|
+
|
43
|
+
# if options[:remove_empty_values] == true
|
44
|
+
# hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
|
45
|
+
# end
|
46
|
+
|
47
|
+
# hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
|
48
|
+
# hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
|
49
|
+
|
50
|
+
# if options[:convert_values_to_numeric]
|
51
|
+
# hash.each do |k, v|
|
52
|
+
# # deal with the :only / :except options to :convert_values_to_numeric
|
53
|
+
# next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
|
54
|
+
|
55
|
+
# # convert if it's a numeric value:
|
56
|
+
# case v
|
57
|
+
# when /^[+-]?\d+\.\d+$/
|
58
|
+
# hash[k] = v.to_f
|
59
|
+
# when /^[+-]?\d+$/
|
60
|
+
# hash[k] = v.to_i
|
61
|
+
# end
|
62
|
+
# end
|
63
|
+
# end
|
64
|
+
|
65
|
+
# if options[:value_converters]
|
66
|
+
# hash.each do |k, v|
|
67
|
+
# converter = options[:value_converters][k]
|
68
|
+
# next unless converter
|
69
|
+
|
70
|
+
# hash[k] = converter.convert(v)
|
71
|
+
# end
|
72
|
+
# end
|
73
|
+
|
74
|
+
# hash
|
75
|
+
# end
|
76
|
+
|
77
|
+
protected
|
78
|
+
|
79
|
+
# acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
|
80
|
+
def limit_execution_for_only_or_except(options, option_name, key)
|
81
|
+
if options[option_name].is_a?(Hash)
|
82
|
+
if options[option_name].has_key?(:except)
|
83
|
+
return true if Array(options[option_name][:except]).include?(key)
|
84
|
+
elsif options[option_name].has_key?(:only)
|
85
|
+
return true unless Array(options[option_name][:only]).include?(key)
|
86
|
+
end
|
87
|
+
end
|
88
|
+
false
|
89
|
+
end
|
90
|
+
end
|
91
|
+
end
|
@@ -0,0 +1,63 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
# transform the headers that were in the file:
|
6
|
+
def header_transformations(header_array, options)
|
7
|
+
header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
|
8
|
+
header_array.map!{|x| x.strip} if options[:strip_whitespace]
|
9
|
+
|
10
|
+
unless options[:keep_original_headers]
|
11
|
+
header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
|
12
|
+
header_array.map!{|x| x.downcase} if options[:downcase_header]
|
13
|
+
end
|
14
|
+
|
15
|
+
# detect duplicate headers and disambiguate
|
16
|
+
header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
|
17
|
+
# symbolize headers
|
18
|
+
header_array = header_array.map{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
|
19
|
+
# doesn't make sense to re-map when we have user_provided_headers
|
20
|
+
header_array = remap_headers(header_array, options) if options[:key_mapping]
|
21
|
+
|
22
|
+
header_array
|
23
|
+
end
|
24
|
+
|
25
|
+
def disambiguate_headers(headers, options)
|
26
|
+
counts = Hash.new(0)
|
27
|
+
headers.map do |header|
|
28
|
+
counts[header] += 1
|
29
|
+
counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
|
30
|
+
end
|
31
|
+
end
|
32
|
+
|
33
|
+
# do some key mapping on the keys in the file header
|
34
|
+
# if you want to completely delete a key, then map it to nil or to ''
|
35
|
+
def remap_headers(headers, options)
|
36
|
+
key_mapping = options[:key_mapping]
|
37
|
+
if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
|
38
|
+
raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
|
39
|
+
end
|
40
|
+
|
41
|
+
key_mapping = options[:key_mapping]
|
42
|
+
# if silence_missing_keys are not set, raise error if missing header
|
43
|
+
missing_keys = key_mapping.keys - headers
|
44
|
+
# if the user passes a list of speciffic mapped keys that are optional
|
45
|
+
missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)
|
46
|
+
|
47
|
+
unless missing_keys.empty? || options[:silence_missing_keys] == true
|
48
|
+
raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
|
49
|
+
end
|
50
|
+
|
51
|
+
headers.map! do |header|
|
52
|
+
if key_mapping.has_key?(header)
|
53
|
+
key_mapping[header].nil? ? nil : key_mapping[header]
|
54
|
+
elsif options[:remove_unmapped_keys]
|
55
|
+
nil
|
56
|
+
else
|
57
|
+
header
|
58
|
+
end
|
59
|
+
end
|
60
|
+
headers
|
61
|
+
end
|
62
|
+
end
|
63
|
+
end
|
@@ -0,0 +1,34 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
def header_validations(headers, options)
|
6
|
+
check_duplicate_headers(headers, options)
|
7
|
+
check_required_headers(headers, options)
|
8
|
+
end
|
9
|
+
|
10
|
+
def check_duplicate_headers(headers, _options)
|
11
|
+
header_counts = Hash.new(0)
|
12
|
+
headers.each { |header| header_counts[header] += 1 unless header.nil? }
|
13
|
+
|
14
|
+
duplicates = header_counts.select { |_, count| count > 1 }
|
15
|
+
|
16
|
+
unless duplicates.empty?
|
17
|
+
raise(SmarterCSV::DuplicateHeaders, "Duplicate Headers in CSV: #{duplicates.inspect}")
|
18
|
+
end
|
19
|
+
end
|
20
|
+
|
21
|
+
require 'set'
|
22
|
+
|
23
|
+
def check_required_headers(headers, options)
|
24
|
+
if options[:required_keys] && options[:required_keys].is_a?(Array)
|
25
|
+
headers_set = headers.to_set
|
26
|
+
missing_keys = options[:required_keys].select { |k| !headers_set.include?(k) }
|
27
|
+
|
28
|
+
unless missing_keys.empty?
|
29
|
+
raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}"
|
30
|
+
end
|
31
|
+
end
|
32
|
+
end
|
33
|
+
end
|
34
|
+
end
|
@@ -0,0 +1,68 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
def process_headers(filehandle, options)
|
6
|
+
@raw_header = nil # header as it appears in the file
|
7
|
+
@headers = nil # the processed headers
|
8
|
+
header_array = []
|
9
|
+
file_header_size = nil
|
10
|
+
|
11
|
+
# if headers_in_file, get the headers -> We get the number of columns, even when user provided headers
|
12
|
+
if options[:headers_in_file] # extract the header line
|
13
|
+
# process the header line in the CSV file..
|
14
|
+
# the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
|
15
|
+
header_line = @raw_header = readline_with_counts(filehandle, options)
|
16
|
+
header_line = preprocess_header_line(header_line, options)
|
17
|
+
|
18
|
+
file_header_array, file_header_size = parse(header_line, options)
|
19
|
+
|
20
|
+
file_header_array = header_transformations(file_header_array, options)
|
21
|
+
|
22
|
+
else
|
23
|
+
unless options[:user_provided_headers]
|
24
|
+
raise SmarterCSV::IncorrectOption, "ERROR: If :headers_in_file is set to false, you have to provide :user_provided_headers"
|
25
|
+
end
|
26
|
+
end
|
27
|
+
|
28
|
+
if options[:user_provided_headers]
|
29
|
+
unless options[:user_provided_headers].is_a?(Array) && !options[:user_provided_headers].empty?
|
30
|
+
raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for user_provided_headers! Expecting array with headers.")
|
31
|
+
end
|
32
|
+
|
33
|
+
# use user-provided headers
|
34
|
+
user_header_array = options[:user_provided_headers]
|
35
|
+
# user_provided_headers: their count should match the headers_in_file if any
|
36
|
+
if defined?(file_header_size) && !file_header_size.nil?
|
37
|
+
if user_header_array.size != file_header_size
|
38
|
+
raise SmarterCSV::HeaderSizeMismatch, "ERROR: :user_provided_headers defines #{user_header_array.size} headers != CSV-file has #{file_header_size} headers"
|
39
|
+
else
|
40
|
+
# we could print out the mapping of file_header_array to header_array here
|
41
|
+
end
|
42
|
+
end
|
43
|
+
|
44
|
+
header_array = user_header_array
|
45
|
+
else
|
46
|
+
header_array = file_header_array
|
47
|
+
end
|
48
|
+
|
49
|
+
[header_array, header_array.size]
|
50
|
+
end
|
51
|
+
|
52
|
+
private
|
53
|
+
|
54
|
+
def preprocess_header_line(header_line, options)
|
55
|
+
header_line = enforce_utf8_encoding(header_line, options)
|
56
|
+
header_line = remove_comments_from_header(header_line, options)
|
57
|
+
header_line = header_line.chomp(options[:row_sep])
|
58
|
+
header_line.gsub!(options[:strip_chars_from_headers], '') if options[:strip_chars_from_headers]
|
59
|
+
header_line
|
60
|
+
end
|
61
|
+
|
62
|
+
def remove_comments_from_header(header, options)
|
63
|
+
return header unless options[:comment_regexp]
|
64
|
+
|
65
|
+
header.sub(options[:comment_regexp], '')
|
66
|
+
end
|
67
|
+
end
|
68
|
+
end
|
@@ -9,7 +9,7 @@ module SmarterCSV
|
|
9
9
|
comment_regexp: nil, # was: /\A#/,
|
10
10
|
convert_values_to_numeric: true,
|
11
11
|
downcase_header: true,
|
12
|
-
duplicate_header_suffix: nil,
|
12
|
+
duplicate_header_suffix: '', # was: nil,
|
13
13
|
file_encoding: 'utf-8',
|
14
14
|
force_simple_split: false,
|
15
15
|
force_utf8: false,
|
@@ -62,6 +62,15 @@ module SmarterCSV
|
|
62
62
|
private
|
63
63
|
|
64
64
|
def validate_options!(options)
|
65
|
+
# deprecate required_headers
|
66
|
+
unless options[:required_headers].nil?
|
67
|
+
puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required_headers'"
|
68
|
+
if options[:required_keys].nil?
|
69
|
+
options[:required_keys] = options[:required_headers]
|
70
|
+
options[:required_headers] = nil
|
71
|
+
end
|
72
|
+
end
|
73
|
+
|
65
74
|
keys = options.keys
|
66
75
|
errors = []
|
67
76
|
errors << "invalid row_sep" if keys.include?(:row_sep) && !option_valid?(options[:row_sep])
|
@@ -0,0 +1,90 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
protected
|
6
|
+
|
7
|
+
###
|
8
|
+
### Thin wrapper around C-extension
|
9
|
+
###
|
10
|
+
def parse(line, options, header_size = nil)
|
11
|
+
# puts "SmarterCSV.parse OPTIONS: #{options[:acceleration]}" if options[:verbose]
|
12
|
+
|
13
|
+
if options[:acceleration] && has_acceleration?
|
14
|
+
# :nocov:
|
15
|
+
has_quotes = line =~ /#{options[:quote_char]}/
|
16
|
+
elements = parse_csv_line_c(line, options[:col_sep], options[:quote_char], header_size)
|
17
|
+
elements.map!{|x| cleanup_quotes(x, options[:quote_char])} if has_quotes
|
18
|
+
[elements, elements.size]
|
19
|
+
# :nocov:
|
20
|
+
else
|
21
|
+
# puts "WARNING: SmarterCSV is using un-accelerated parsing of lines. Check options[:acceleration]"
|
22
|
+
parse_csv_line_ruby(line, options, header_size)
|
23
|
+
end
|
24
|
+
end
|
25
|
+
|
26
|
+
# ------------------------------------------------------------------
|
27
|
+
# Ruby equivalent of the C-extension for parse_line
|
28
|
+
#
|
29
|
+
# parses a single line: either a CSV header and body line
|
30
|
+
# - quoting rules compared to RFC-4180 are somewhat relaxed
|
31
|
+
# - we are not assuming that quotes inside a fields need to be doubled
|
32
|
+
# - we are not assuming that all fields need to be quoted (0 is even)
|
33
|
+
# - works with multi-char col_sep
|
34
|
+
# - if header_size is given, only up to header_size fields are parsed
|
35
|
+
#
|
36
|
+
# We use header_size for parsing the body lines to make sure we always match the number of headers
|
37
|
+
# in case there are trailing col_sep characters in line
|
38
|
+
#
|
39
|
+
# Our convention is that empty fields are returned as empty strings, not as nil.
|
40
|
+
#
|
41
|
+
#
|
42
|
+
# the purpose of the max_size parameter is to handle a corner case where
|
43
|
+
# CSV lines contain more fields than the header.
|
44
|
+
# In which case the remaining fields in the line are ignored
|
45
|
+
#
|
46
|
+
def parse_csv_line_ruby(line, options, header_size = nil)
|
47
|
+
return [] if line.nil?
|
48
|
+
|
49
|
+
line_size = line.size
|
50
|
+
col_sep = options[:col_sep]
|
51
|
+
col_sep_size = col_sep.size
|
52
|
+
quote = options[:quote_char]
|
53
|
+
quote_count = 0
|
54
|
+
elements = []
|
55
|
+
start = 0
|
56
|
+
i = 0
|
57
|
+
|
58
|
+
previous_char = ''
|
59
|
+
while i < line_size
|
60
|
+
if line[i...i+col_sep_size] == col_sep && quote_count.even?
|
61
|
+
break if !header_size.nil? && elements.size >= header_size
|
62
|
+
|
63
|
+
elements << cleanup_quotes(line[start...i], quote)
|
64
|
+
previous_char = line[i]
|
65
|
+
i += col_sep.size
|
66
|
+
start = i
|
67
|
+
else
|
68
|
+
quote_count += 1 if line[i] == quote && previous_char != '\\'
|
69
|
+
previous_char = line[i]
|
70
|
+
i += 1
|
71
|
+
end
|
72
|
+
end
|
73
|
+
elements << cleanup_quotes(line[start..-1], quote) if header_size.nil? || elements.size < header_size
|
74
|
+
[elements, elements.size]
|
75
|
+
end
|
76
|
+
|
77
|
+
def cleanup_quotes(field, quote)
|
78
|
+
return field if field.nil?
|
79
|
+
|
80
|
+
# return if field !~ /#{quote}/ # this check can probably eliminated
|
81
|
+
|
82
|
+
if field.start_with?(quote) && field.end_with?(quote)
|
83
|
+
field.delete_prefix!(quote)
|
84
|
+
field.delete_suffix!(quote)
|
85
|
+
end
|
86
|
+
field.gsub!("#{quote}#{quote}", quote)
|
87
|
+
field
|
88
|
+
end
|
89
|
+
end
|
90
|
+
end
|