smarter_csv 1.9.3 → 1.10.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +18 -0
- data/README.md +28 -7
- data/lib/smarter_csv/hash_transformations.rb +91 -0
- data/lib/smarter_csv/header_transformations.rb +63 -0
- data/lib/smarter_csv/header_validations.rb +34 -0
- data/lib/smarter_csv/headers.rb +6 -98
- data/lib/smarter_csv/options_processing.rb +10 -1
- data/lib/smarter_csv/smarter_csv.rb +68 -92
- data/lib/smarter_csv/variables.rb +5 -1
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +8 -0
- metadata +6 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f1d0b58acf0135b621e3182470674230ef73b48c829810e74fffa975fc318cf5
|
4
|
+
data.tar.gz: ee404c5c485748d35cda36b8d249cb6813a3f80005182fe8c05feac1694aba57
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 4fee097fe2237f863510100155062da6815237260da5b15189f104f54596f7d5ff0479deb80596544e0bb1b9ba7b78126d2251798721e8d2f91e06b430950cd6
|
7
|
+
data.tar.gz: c30562965452ef296b5e5aaf2a9a12887aa42d8e8396780b73b34f99a2386d232bf020578618fcbd65186fc864518c81a3e7555cae9b00a005322f3599e18c5a
|
data/CHANGELOG.md
CHANGED
@@ -1,6 +1,24 @@
|
|
1
1
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
3
3
|
|
4
|
+
## 1.10.0 (2023-12-31) ⚡ BREAKING ⚡
|
5
|
+
|
6
|
+
* BREAKING CHANGES:
|
7
|
+
|
8
|
+
Changed behavior:
|
9
|
+
+ when `user_provided_headers` are provided:
|
10
|
+
* if they are not unique, an exception will now be raised
|
11
|
+
* they are taken "as is", no header transformations can be applied
|
12
|
+
* when they are given as strings or as symbols, it is assumed that this is the desired format
|
13
|
+
* the value of the `strings_as_keys` options will be ignored
|
14
|
+
|
15
|
+
+ option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
|
16
|
+
* this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
|
17
|
+
* explicitly set this option to `nil` to get the behavior from previous versions.
|
18
|
+
|
19
|
+
* performance and memory improvements
|
20
|
+
* code refactor
|
21
|
+
|
4
22
|
## 1.9.3 (2023-12-16)
|
5
23
|
* raise SmarterCSV::IncorrectOption when `user_provided_headers` are empty
|
6
24
|
* code refactor / no functional changes
|
data/README.md
CHANGED
@@ -2,15 +2,33 @@
|
|
2
2
|
# SmarterCSV
|
3
3
|
|
4
4
|
[](https://codecov.io/gh/tilo/smarter_csv) [](http://badge.fury.io/rb/smarter_csv)
|
5
|
-
|
5
|
+
|
6
|
+
|
7
|
+
#### LATEST CHANGES
|
8
|
+
|
9
|
+
* Version 1.10.0 has BREAKING CHANGES:
|
10
|
+
|
11
|
+
Changed behavior:
|
12
|
+
+ when `user_provided_headers` are provided:
|
13
|
+
* if they are not unique, an exception will now be raised
|
14
|
+
* they are taken "as is", no header transformations can be applied
|
15
|
+
* when they are given as strings or as symbols, it is assumed that this is the desired format
|
16
|
+
* the value of the `strings_as_keys` options will be ignored
|
17
|
+
|
18
|
+
+ option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
|
19
|
+
* this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
|
20
|
+
* explicitly set this option to `nil` to get the behavior from previous versions.
|
21
|
+
|
6
22
|
#### Development Branches
|
7
23
|
|
8
24
|
* default branch is `main` for 1.x development
|
9
|
-
|
25
|
+
|
26
|
+
* 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
|
27
|
+
- This is an EXPERIMENTAL branch - DO NOT USE in production
|
10
28
|
|
11
|
-
#### Work towards Future Version 2.
|
29
|
+
#### Work towards Future Version 2.x
|
12
30
|
|
13
|
-
* Work towards SmarterCSV 2.
|
31
|
+
* Work towards SmarterCSV 2.x is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
|
14
32
|
Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.
|
15
33
|
|
16
34
|
---------------
|
@@ -84,6 +102,10 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
|
|
84
102
|
00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
|
85
103
|
```
|
86
104
|
|
105
|
+
### Articles
|
106
|
+
* [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
|
107
|
+
* [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
|
108
|
+
|
87
109
|
### Examples
|
88
110
|
|
89
111
|
Here are some examples to demonstrate the versatility of SmarterCSV.
|
@@ -243,8 +265,6 @@ NOTE: If you use `key_mappings` and `value_converters`, make sure that the value
|
|
243
265
|
data[0][:price].class
|
244
266
|
=> Float
|
245
267
|
```
|
246
|
-
## Parallel Processing
|
247
|
-
[Jack](https://github.com/xjlin0) wrote an interesting article about [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
|
248
268
|
|
249
269
|
## Documentation
|
250
270
|
|
@@ -280,7 +300,8 @@ The options and the block are optional.
|
|
280
300
|
| :headers_in_file | true | Whether or not the file contains headers as the first line. |
|
281
301
|
| | | Important if the file does not contain headers, |
|
282
302
|
| | | otherwise you would lose the first line of data. |
|
283
|
-
| :duplicate_header_suffix |
|
303
|
+
| :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
|
304
|
+
| | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
|
284
305
|
| :user_provided_headers | nil | *careful with that axe!* |
|
285
306
|
| | | user provided Array of header strings or symbols, to define |
|
286
307
|
| | | what headers should be used, overriding any in-file headers. |
|
@@ -0,0 +1,91 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
def hash_transformations(hash, options)
|
6
|
+
# there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
|
7
|
+
# make sure we delete any key/value pairs from the hash, which the user wanted to delete:
|
8
|
+
remove_empty_values = options[:remove_empty_values] == true
|
9
|
+
remove_zero_values = options[:remove_zero_values]
|
10
|
+
remove_values_matching = options[:remove_values_matching]
|
11
|
+
convert_to_numeric = options[:convert_values_to_numeric]
|
12
|
+
value_converters = options[:value_converters]
|
13
|
+
|
14
|
+
hash.each_with_object({}) do |(k, v), new_hash|
|
15
|
+
next if k.nil? || k == '' || k == :""
|
16
|
+
next if remove_empty_values && (has_rails ? v.blank? : blank?(v))
|
17
|
+
next if remove_zero_values && v.is_a?(String) && v =~ /^(0+|0+\.0+)$/ # values are Strings
|
18
|
+
next if remove_values_matching && v =~ remove_values_matching
|
19
|
+
|
20
|
+
# deal with the :only / :except options to :convert_values_to_numeric
|
21
|
+
if convert_to_numeric && !limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
|
22
|
+
if v =~ /^[+-]?\d+\.\d+$/
|
23
|
+
v = v.to_f
|
24
|
+
elsif v =~ /^[+-]?\d+$/
|
25
|
+
v = v.to_i
|
26
|
+
end
|
27
|
+
end
|
28
|
+
|
29
|
+
converter = value_converters[k] if value_converters
|
30
|
+
v = converter.convert(v) if converter
|
31
|
+
|
32
|
+
new_hash[k] = v
|
33
|
+
end
|
34
|
+
end
|
35
|
+
|
36
|
+
# def hash_transformations(hash, options)
|
37
|
+
# # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
|
38
|
+
# # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
|
39
|
+
# hash.delete(nil)
|
40
|
+
# hash.delete('')
|
41
|
+
# hash.delete(:"")
|
42
|
+
|
43
|
+
# if options[:remove_empty_values] == true
|
44
|
+
# hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
|
45
|
+
# end
|
46
|
+
|
47
|
+
# hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
|
48
|
+
# hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
|
49
|
+
|
50
|
+
# if options[:convert_values_to_numeric]
|
51
|
+
# hash.each do |k, v|
|
52
|
+
# # deal with the :only / :except options to :convert_values_to_numeric
|
53
|
+
# next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
|
54
|
+
|
55
|
+
# # convert if it's a numeric value:
|
56
|
+
# case v
|
57
|
+
# when /^[+-]?\d+\.\d+$/
|
58
|
+
# hash[k] = v.to_f
|
59
|
+
# when /^[+-]?\d+$/
|
60
|
+
# hash[k] = v.to_i
|
61
|
+
# end
|
62
|
+
# end
|
63
|
+
# end
|
64
|
+
|
65
|
+
# if options[:value_converters]
|
66
|
+
# hash.each do |k, v|
|
67
|
+
# converter = options[:value_converters][k]
|
68
|
+
# next unless converter
|
69
|
+
|
70
|
+
# hash[k] = converter.convert(v)
|
71
|
+
# end
|
72
|
+
# end
|
73
|
+
|
74
|
+
# hash
|
75
|
+
# end
|
76
|
+
|
77
|
+
protected
|
78
|
+
|
79
|
+
# acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
|
80
|
+
def limit_execution_for_only_or_except(options, option_name, key)
|
81
|
+
if options[option_name].is_a?(Hash)
|
82
|
+
if options[option_name].has_key?(:except)
|
83
|
+
return true if Array(options[option_name][:except]).include?(key)
|
84
|
+
elsif options[option_name].has_key?(:only)
|
85
|
+
return true unless Array(options[option_name][:only]).include?(key)
|
86
|
+
end
|
87
|
+
end
|
88
|
+
false
|
89
|
+
end
|
90
|
+
end
|
91
|
+
end
|
@@ -0,0 +1,63 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
# transform the headers that were in the file:
|
6
|
+
def header_transformations(header_array, options)
|
7
|
+
header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
|
8
|
+
header_array.map!{|x| x.strip} if options[:strip_whitespace]
|
9
|
+
|
10
|
+
unless options[:keep_original_headers]
|
11
|
+
header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
|
12
|
+
header_array.map!{|x| x.downcase} if options[:downcase_header]
|
13
|
+
end
|
14
|
+
|
15
|
+
# detect duplicate headers and disambiguate
|
16
|
+
header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
|
17
|
+
# symbolize headers
|
18
|
+
header_array = header_array.map{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
|
19
|
+
# doesn't make sense to re-map when we have user_provided_headers
|
20
|
+
header_array = remap_headers(header_array, options) if options[:key_mapping]
|
21
|
+
|
22
|
+
header_array
|
23
|
+
end
|
24
|
+
|
25
|
+
def disambiguate_headers(headers, options)
|
26
|
+
counts = Hash.new(0)
|
27
|
+
headers.map do |header|
|
28
|
+
counts[header] += 1
|
29
|
+
counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
|
30
|
+
end
|
31
|
+
end
|
32
|
+
|
33
|
+
# do some key mapping on the keys in the file header
|
34
|
+
# if you want to completely delete a key, then map it to nil or to ''
|
35
|
+
def remap_headers(headers, options)
|
36
|
+
key_mapping = options[:key_mapping]
|
37
|
+
if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
|
38
|
+
raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
|
39
|
+
end
|
40
|
+
|
41
|
+
key_mapping = options[:key_mapping]
|
42
|
+
# if silence_missing_keys are not set, raise error if missing header
|
43
|
+
missing_keys = key_mapping.keys - headers
|
44
|
+
# if the user passes a list of speciffic mapped keys that are optional
|
45
|
+
missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)
|
46
|
+
|
47
|
+
unless missing_keys.empty? || options[:silence_missing_keys] == true
|
48
|
+
raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
|
49
|
+
end
|
50
|
+
|
51
|
+
headers.map! do |header|
|
52
|
+
if key_mapping.has_key?(header)
|
53
|
+
key_mapping[header].nil? ? nil : key_mapping[header]
|
54
|
+
elsif options[:remove_unmapped_keys]
|
55
|
+
nil
|
56
|
+
else
|
57
|
+
header
|
58
|
+
end
|
59
|
+
end
|
60
|
+
headers
|
61
|
+
end
|
62
|
+
end
|
63
|
+
end
|
@@ -0,0 +1,34 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class << self
|
5
|
+
def header_validations(headers, options)
|
6
|
+
check_duplicate_headers(headers, options)
|
7
|
+
check_required_headers(headers, options)
|
8
|
+
end
|
9
|
+
|
10
|
+
def check_duplicate_headers(headers, _options)
|
11
|
+
header_counts = Hash.new(0)
|
12
|
+
headers.each { |header| header_counts[header] += 1 unless header.nil? }
|
13
|
+
|
14
|
+
duplicates = header_counts.select { |_, count| count > 1 }
|
15
|
+
|
16
|
+
unless duplicates.empty?
|
17
|
+
raise(SmarterCSV::DuplicateHeaders, "Duplicate Headers in CSV: #{duplicates.inspect}")
|
18
|
+
end
|
19
|
+
end
|
20
|
+
|
21
|
+
require 'set'
|
22
|
+
|
23
|
+
def check_required_headers(headers, options)
|
24
|
+
if options[:required_keys] && options[:required_keys].is_a?(Array)
|
25
|
+
headers_set = headers.to_set
|
26
|
+
missing_keys = options[:required_keys].select { |k| !headers_set.include?(k) }
|
27
|
+
|
28
|
+
unless missing_keys.empty?
|
29
|
+
raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}"
|
30
|
+
end
|
31
|
+
end
|
32
|
+
end
|
33
|
+
end
|
34
|
+
end
|
data/lib/smarter_csv/headers.rb
CHANGED
@@ -14,7 +14,11 @@ module SmarterCSV
|
|
14
14
|
# the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
|
15
15
|
header_line = @raw_header = readline_with_counts(filehandle, options)
|
16
16
|
header_line = preprocess_header_line(header_line, options)
|
17
|
-
|
17
|
+
|
18
|
+
file_header_array, file_header_size = parse(header_line, options)
|
19
|
+
|
20
|
+
file_header_array = header_transformations(file_header_array, options)
|
21
|
+
|
18
22
|
else
|
19
23
|
unless options[:user_provided_headers]
|
20
24
|
raise SmarterCSV::IncorrectOption, "ERROR: If :headers_in_file is set to false, you have to provide :user_provided_headers"
|
@@ -36,22 +40,12 @@ module SmarterCSV
|
|
36
40
|
# we could print out the mapping of file_header_array to header_array here
|
37
41
|
end
|
38
42
|
end
|
43
|
+
|
39
44
|
header_array = user_header_array
|
40
45
|
else
|
41
46
|
header_array = file_header_array
|
42
47
|
end
|
43
48
|
|
44
|
-
# detect duplicate headers and disambiguate
|
45
|
-
header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
|
46
|
-
|
47
|
-
# symbolize headers
|
48
|
-
header_array.map!{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
|
49
|
-
|
50
|
-
# wouldn't make sense to re-map user provided headers
|
51
|
-
header_array = remap_headers(header_array, options) if options[:key_mapping] && !options[:user_provided_headers]
|
52
|
-
|
53
|
-
validate_and_deprecate_headers(header_array, options)
|
54
|
-
|
55
49
|
[header_array, header_array.size]
|
56
50
|
end
|
57
51
|
|
@@ -65,92 +59,6 @@ module SmarterCSV
|
|
65
59
|
header_line
|
66
60
|
end
|
67
61
|
|
68
|
-
def parse_and_modify_headers(header_line, options)
|
69
|
-
file_header_array, file_header_size = parse(header_line, options)
|
70
|
-
|
71
|
-
file_header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
|
72
|
-
file_header_array.map!{|x| x.strip} if options[:strip_whitespace]
|
73
|
-
|
74
|
-
unless options[:keep_original_headers]
|
75
|
-
file_header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
|
76
|
-
file_header_array.map!{|x| x.downcase} if options[:downcase_header]
|
77
|
-
end
|
78
|
-
[file_header_array, file_header_size]
|
79
|
-
end
|
80
|
-
|
81
|
-
def disambiguate_headers(headers, options)
|
82
|
-
counts = Hash.new(0)
|
83
|
-
headers.map do |header|
|
84
|
-
counts[header] += 1
|
85
|
-
counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
|
86
|
-
end
|
87
|
-
end
|
88
|
-
|
89
|
-
# do some key mapping on the keys in the file header
|
90
|
-
# if you want to completely delete a key, then map it to nil or to ''
|
91
|
-
def remap_headers(headers, options)
|
92
|
-
key_mapping = options[:key_mapping]
|
93
|
-
if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
|
94
|
-
raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
|
95
|
-
end
|
96
|
-
|
97
|
-
key_mapping = options[:key_mapping]
|
98
|
-
# if silence_missing_keys are not set, raise error if missing header
|
99
|
-
missing_keys = key_mapping.keys - headers
|
100
|
-
# if the user passes a list of speciffic mapped keys that are optional
|
101
|
-
missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)
|
102
|
-
|
103
|
-
unless missing_keys.empty? || options[:silence_missing_keys] == true
|
104
|
-
raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
|
105
|
-
end
|
106
|
-
|
107
|
-
headers.map! do |header|
|
108
|
-
if key_mapping.has_key?(header)
|
109
|
-
key_mapping[header].nil? ? nil : key_mapping[header]
|
110
|
-
elsif options[:remove_unmapped_keys]
|
111
|
-
nil
|
112
|
-
else
|
113
|
-
header
|
114
|
-
end
|
115
|
-
end
|
116
|
-
headers
|
117
|
-
end
|
118
|
-
|
119
|
-
# header_validations
|
120
|
-
def validate_and_deprecate_headers(headers, options)
|
121
|
-
duplicate_headers = []
|
122
|
-
headers.compact.each do |k|
|
123
|
-
duplicate_headers << k if headers.select{|x| x == k}.size > 1
|
124
|
-
end
|
125
|
-
|
126
|
-
unless options[:user_provided_headers] || duplicate_headers.empty?
|
127
|
-
raise SmarterCSV::DuplicateHeaders, "ERROR: duplicate headers: #{duplicate_headers.join(',')}"
|
128
|
-
end
|
129
|
-
|
130
|
-
# deprecate required_headers
|
131
|
-
unless options[:required_headers].nil?
|
132
|
-
puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required_headers'"
|
133
|
-
if options[:required_keys].nil?
|
134
|
-
options[:required_keys] = options[:required_headers]
|
135
|
-
options[:required_headers] = nil
|
136
|
-
end
|
137
|
-
end
|
138
|
-
|
139
|
-
if options[:required_keys] && options[:required_keys].is_a?(Array)
|
140
|
-
missing_keys = []
|
141
|
-
options[:required_keys].each do |k|
|
142
|
-
missing_keys << k unless headers.include?(k)
|
143
|
-
end
|
144
|
-
raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}" unless missing_keys.empty?
|
145
|
-
end
|
146
|
-
end
|
147
|
-
|
148
|
-
def enforce_utf8_encoding(header, options)
|
149
|
-
return header unless options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
|
150
|
-
|
151
|
-
header.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence])
|
152
|
-
end
|
153
|
-
|
154
62
|
def remove_comments_from_header(header, options)
|
155
63
|
return header unless options[:comment_regexp]
|
156
64
|
|
@@ -9,7 +9,7 @@ module SmarterCSV
|
|
9
9
|
comment_regexp: nil, # was: /\A#/,
|
10
10
|
convert_values_to_numeric: true,
|
11
11
|
downcase_header: true,
|
12
|
-
duplicate_header_suffix: nil,
|
12
|
+
duplicate_header_suffix: '', # was: nil,
|
13
13
|
file_encoding: 'utf-8',
|
14
14
|
force_simple_split: false,
|
15
15
|
force_utf8: false,
|
@@ -62,6 +62,15 @@ module SmarterCSV
|
|
62
62
|
private
|
63
63
|
|
64
64
|
def validate_options!(options)
|
65
|
+
# deprecate required_headers
|
66
|
+
unless options[:required_headers].nil?
|
67
|
+
puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required_headers'"
|
68
|
+
if options[:required_keys].nil?
|
69
|
+
options[:required_keys] = options[:required_headers]
|
70
|
+
options[:required_headers] = nil
|
71
|
+
end
|
72
|
+
end
|
73
|
+
|
65
74
|
keys = options.keys
|
66
75
|
errors = []
|
67
76
|
errors << "invalid row_sep" if keys.include?(:row_sep) && !option_valid?(options[:row_sep])
|
@@ -12,28 +12,34 @@ module SmarterCSV
|
|
12
12
|
|
13
13
|
# first parameter: filename or input object which responds to readline method
|
14
14
|
def SmarterCSV.process(input, given_options = {}, &block) # rubocop:disable Lint/UnusedMethodArgument
|
15
|
+
initialize_variables
|
16
|
+
|
15
17
|
options = process_options(given_options)
|
16
18
|
|
17
|
-
|
19
|
+
@enforce_utf8 = options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
|
20
|
+
@verbose = options[:verbose]
|
18
21
|
|
19
|
-
has_rails = !!defined?(Rails)
|
20
22
|
begin
|
21
23
|
fh = input.respond_to?(:readline) ? input : File.open(input, "r:#{options[:file_encoding]}")
|
22
24
|
|
25
|
+
if @enforce_utf8 && (fh.respond_to?(:external_encoding) && fh.external_encoding != Encoding.find('UTF-8') || fh.respond_to?(:encoding) && fh.encoding != Encoding.find('UTF-8'))
|
26
|
+
puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
|
27
|
+
end
|
28
|
+
|
23
29
|
# auto-detect the row separator
|
24
30
|
options[:row_sep] = guess_line_ending(fh, options) if options[:row_sep]&.to_sym == :auto
|
25
31
|
# attempt to auto-detect column separator
|
26
32
|
options[:col_sep] = guess_column_separator(fh, options) if options[:col_sep]&.to_sym == :auto
|
27
33
|
|
28
|
-
if (options[:force_utf8] || options[:file_encoding] =~ /utf-8/i) && (fh.respond_to?(:external_encoding) && fh.external_encoding != Encoding.find('UTF-8') || fh.respond_to?(:encoding) && fh.encoding != Encoding.find('UTF-8'))
|
29
|
-
puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
|
30
|
-
end
|
31
|
-
|
32
34
|
skip_lines(fh, options)
|
33
35
|
|
34
36
|
@headers, header_size = process_headers(fh, options)
|
35
37
|
@headerA = @headers # @headerA is deprecated, use @headers
|
36
38
|
|
39
|
+
puts "Effective headers:\n#{pp(@headers)}\n" if @verbose
|
40
|
+
|
41
|
+
header_validations(@headers, options)
|
42
|
+
|
37
43
|
# in case we use chunking.. we'll need to set it up..
|
38
44
|
if options[:chunk_size].to_i > 0
|
39
45
|
use_chunks = true
|
@@ -45,31 +51,42 @@ module SmarterCSV
|
|
45
51
|
end
|
46
52
|
|
47
53
|
# now on to processing all the rest of the lines in the CSV file:
|
54
|
+
# fh.each_line |line|
|
48
55
|
until fh.eof? # we can't use fh.readlines() here, because this would read the whole file into memory at once, and eof => true
|
49
56
|
line = readline_with_counts(fh, options)
|
50
57
|
|
51
58
|
# replace invalid byte sequence in UTF-8 with question mark to avoid errors
|
52
|
-
line = line
|
59
|
+
line = enforce_utf8_encoding(line, options) if @enforce_utf8
|
53
60
|
|
54
|
-
print "processing file line %10d, csv line %10d\r" % [@file_line_count, @csv_line_count] if
|
61
|
+
print "processing file line %10d, csv line %10d\r" % [@file_line_count, @csv_line_count] if @verbose
|
55
62
|
|
56
63
|
next if options[:comment_regexp] && line =~ options[:comment_regexp] # ignore all comment lines if there are any
|
57
64
|
|
58
65
|
# cater for the quoted csv data containing the row separator carriage return character
|
59
66
|
# in which case the row data will be split across multiple lines (see the sample content in spec/fixtures/carriage_returns_rn.csv)
|
60
67
|
# by detecting the existence of an uneven number of quote characters
|
68
|
+
multiline = count_quote_chars(line, options[:quote_char]).odd?
|
61
69
|
|
62
|
-
multiline
|
63
|
-
while count_quote_chars(line, options[:quote_char]).odd? # should handle quote_char nil
|
70
|
+
while multiline
|
64
71
|
next_line = fh.readline(options[:row_sep])
|
65
|
-
next_line = next_line
|
72
|
+
next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
|
66
73
|
line += next_line
|
67
74
|
@file_line_count += 1
|
75
|
+
|
76
|
+
break if fh.eof? # Exit loop if end of file is reached
|
77
|
+
|
78
|
+
multiline = count_quote_chars(line, options[:quote_char]).odd?
|
68
79
|
end
|
69
|
-
|
80
|
+
|
81
|
+
# :nocov:
|
82
|
+
if multiline && @verbose
|
83
|
+
print "\nline contains uneven number of quote chars so including content through file line %d\n" % @file_line_count
|
84
|
+
end
|
85
|
+
# :nocov:
|
70
86
|
|
71
87
|
line.chomp!(options[:row_sep])
|
72
88
|
|
89
|
+
# --- SPLIT LINE & DATA TRANSFORMATIONS ------------------------------------------------------------
|
73
90
|
dataA, _data_size = parse(line, options, header_size)
|
74
91
|
|
75
92
|
dataA.map!{|x| x.strip} if options[:strip_whitespace]
|
@@ -77,48 +94,25 @@ module SmarterCSV
|
|
77
94
|
# if all values are blank, then ignore this line
|
78
95
|
next if options[:remove_empty_hashes] && (dataA.empty? || blank?(dataA))
|
79
96
|
|
97
|
+
# --- HASH TRANSFORMATIONS ------------------------------------------------------------
|
80
98
|
hash = @headers.zip(dataA).to_h
|
81
99
|
|
82
|
-
|
83
|
-
hash.delete(nil)
|
84
|
-
hash.delete('')
|
85
|
-
hash.delete(:"")
|
86
|
-
|
87
|
-
if options[:remove_empty_values] == true
|
88
|
-
hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
|
89
|
-
end
|
90
|
-
|
91
|
-
hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
|
92
|
-
hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
|
93
|
-
|
94
|
-
if options[:convert_values_to_numeric]
|
95
|
-
hash.each do |k, v|
|
96
|
-
# deal with the :only / :except options to :convert_values_to_numeric
|
97
|
-
next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
|
100
|
+
hash = hash_transformations(hash, options)
|
98
101
|
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
end
|
106
|
-
end
|
107
|
-
end
|
108
|
-
|
109
|
-
if options[:value_converters]
|
110
|
-
hash.each do |k, v|
|
111
|
-
converter = options[:value_converters][k]
|
112
|
-
next unless converter
|
113
|
-
|
114
|
-
hash[k] = converter.convert(v)
|
115
|
-
end
|
116
|
-
end
|
102
|
+
# --- HASH VALIDATIONS ----------------------------------------------------------------
|
103
|
+
# will go here, and be able to:
|
104
|
+
# - validate correct format of the values for fields
|
105
|
+
# - required fields to be non-empty
|
106
|
+
# - ...
|
107
|
+
# -------------------------------------------------------------------------------------
|
117
108
|
|
118
109
|
next if options[:remove_empty_hashes] && hash.empty?
|
119
110
|
|
111
|
+
puts "CSV Line #{@file_line_count}: #{pp(hash)}" if @verbose == '2' # very verbose setting
|
112
|
+
# optional adding of csv_line_number to the hash to help debugging
|
120
113
|
hash[:csv_line_number] = @csv_line_count if options[:with_line_numbers]
|
121
114
|
|
115
|
+
# process the chunks or the resulting hash
|
122
116
|
if use_chunks
|
123
117
|
chunk << hash # append temp result to chunk
|
124
118
|
|
@@ -127,16 +121,13 @@ module SmarterCSV
|
|
127
121
|
if block_given?
|
128
122
|
yield chunk # do something with the hashes in the chunk in the block
|
129
123
|
else
|
130
|
-
@result << chunk #
|
124
|
+
@result << chunk.dup # Append chunk to result (use .dup to keep a copy after we do chunk.clear)
|
131
125
|
end
|
132
126
|
@chunk_count += 1
|
133
|
-
chunk
|
127
|
+
chunk.clear # re-initialize for next chunk of data
|
134
128
|
else
|
135
|
-
|
136
|
-
# the last chunk may contain partial data, which also needs to be returned (BUG / ISSUE-18)
|
137
|
-
|
129
|
+
# the last chunk may contain partial data, which is handled below
|
138
130
|
end
|
139
|
-
|
140
131
|
# while a chunk is being filled up we don't need to do anything else here
|
141
132
|
|
142
133
|
else # no chunk handling
|
@@ -149,15 +140,15 @@ module SmarterCSV
|
|
149
140
|
end
|
150
141
|
|
151
142
|
# print new line to retain last processing line message
|
152
|
-
print "\n" if
|
143
|
+
print "\n" if @verbose
|
153
144
|
|
154
|
-
# last chunk:
|
145
|
+
# handling of last chunk:
|
155
146
|
if !chunk.nil? && chunk.size > 0
|
156
147
|
# do something with the chunk
|
157
148
|
if block_given?
|
158
149
|
yield chunk # do something with the hashes in the chunk in the block
|
159
150
|
else
|
160
|
-
@result << chunk #
|
151
|
+
@result << chunk.dup # Append chunk to result (use .dup to keep a copy after we do chunk.clear)
|
161
152
|
end
|
162
153
|
@chunk_count += 1
|
163
154
|
# chunk = [] # initialize for next chunk of data
|
@@ -174,16 +165,22 @@ module SmarterCSV
|
|
174
165
|
end
|
175
166
|
|
176
167
|
class << self
|
177
|
-
# * the `scan` method iterates through the string and finds all occurrences of the pattern
|
178
|
-
# * The reqular expression:
|
179
|
-
# - (?<!\\) : Negative lookbehind to ensure the quote character is not preceded by an unescaped backslash.
|
180
|
-
# - (?:\\\\)* : Non-capturing group for an even number of backslashes (escaped backslashes).
|
181
|
-
# This allows for any number of escaped backslashes before the quote character.
|
182
|
-
# - #{Regexp.escape(quote_char)} : Dynamically inserts the quote_char into the regex,
|
183
|
-
# ensuring it's properly escaped for use in the regex.
|
184
|
-
#
|
185
168
|
def count_quote_chars(line, quote_char)
|
186
|
-
line.
|
169
|
+
return 0 if line.nil? || quote_char.nil? || quote_char.empty?
|
170
|
+
|
171
|
+
count = 0
|
172
|
+
escaped = false
|
173
|
+
|
174
|
+
line.each_char do |char|
|
175
|
+
if char == '\\' && !escaped
|
176
|
+
escaped = true
|
177
|
+
else
|
178
|
+
count += 1 if char == quote_char && !escaped
|
179
|
+
escaped = false
|
180
|
+
end
|
181
|
+
end
|
182
|
+
|
183
|
+
count
|
187
184
|
end
|
188
185
|
|
189
186
|
def has_acceleration?
|
@@ -192,18 +189,6 @@ module SmarterCSV
|
|
192
189
|
|
193
190
|
protected
|
194
191
|
|
195
|
-
# acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
|
196
|
-
def limit_execution_for_only_or_except(options, option_name, key)
|
197
|
-
if options[option_name].is_a?(Hash)
|
198
|
-
if options[option_name].has_key?(:except)
|
199
|
-
return true if Array(options[option_name][:except]).include?(key)
|
200
|
-
elsif options[option_name].has_key?(:only)
|
201
|
-
return true unless Array(options[option_name][:only]).include?(key)
|
202
|
-
end
|
203
|
-
end
|
204
|
-
false
|
205
|
-
end
|
206
|
-
|
207
192
|
# SEE: https://github.com/rails/rails/blob/32015b6f369adc839c4f0955f2d9dce50c0b6123/activesupport/lib/active_support/core_ext/object/blank.rb#L121
|
208
193
|
# and in the future we might also include UTF-8 space characters: https://www.compart.com/en/unicode/category/Zs
|
209
194
|
BLANK_RE = /\A\s*\z/.freeze
|
@@ -211,33 +196,24 @@ module SmarterCSV
|
|
211
196
|
def blank?(value)
|
212
197
|
case value
|
213
198
|
when String
|
214
|
-
|
215
|
-
|
199
|
+
BLANK_RE.match?(value)
|
216
200
|
when NilClass
|
217
201
|
true
|
218
|
-
|
219
202
|
when Array
|
220
|
-
value.
|
221
|
-
|
203
|
+
value.all? { |elem| blank?(elem) }
|
222
204
|
when Hash
|
223
|
-
value.
|
224
|
-
|
205
|
+
value.values.all? { |elem| blank?(elem) } # Focus on values only
|
225
206
|
else
|
226
207
|
false
|
227
208
|
end
|
228
209
|
end
|
229
210
|
|
230
|
-
|
231
|
-
case value
|
232
|
-
when String
|
233
|
-
value.empty? || BLANK_RE.match?(value)
|
211
|
+
private
|
234
212
|
|
235
|
-
|
236
|
-
|
213
|
+
def enforce_utf8_encoding(line, options)
|
214
|
+
# return line unless options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
|
237
215
|
|
238
|
-
|
239
|
-
false
|
240
|
-
end
|
216
|
+
line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence])
|
241
217
|
end
|
242
218
|
end
|
243
219
|
end
|
@@ -2,9 +2,10 @@
|
|
2
2
|
|
3
3
|
module SmarterCSV
|
4
4
|
class << self
|
5
|
-
attr_reader :csv_line_count, :chunk_count, :errors, :file_line_count, :headers, :raw_header, :result, :warnings
|
5
|
+
attr_reader :has_rails, :csv_line_count, :chunk_count, :errors, :file_line_count, :headers, :raw_header, :result, :warnings
|
6
6
|
|
7
7
|
def initialize_variables
|
8
|
+
@has_rails = !!defined?(Rails)
|
8
9
|
@csv_line_count = 0
|
9
10
|
@chunk_count = 0
|
10
11
|
@errors = {}
|
@@ -14,13 +15,16 @@ module SmarterCSV
|
|
14
15
|
@raw_header = nil # header as it appears in the file
|
15
16
|
@result = []
|
16
17
|
@warnings = {}
|
18
|
+
@enforce_utf8 = false # only set to true if needed (after options parsing)
|
17
19
|
end
|
18
20
|
|
19
21
|
# :nocov:
|
22
|
+
# rubocop:disable Naming/MethodName
|
20
23
|
def headerA
|
21
24
|
warn "Deprecarion Warning: 'headerA' will be removed in future versions. Use 'headders'"
|
22
25
|
@headerA
|
23
26
|
end
|
27
|
+
# rubocop:enable Naming/MethodName
|
24
28
|
# :nocov:
|
25
29
|
end
|
26
30
|
end
|
data/lib/smarter_csv/version.rb
CHANGED
data/lib/smarter_csv.rb
CHANGED
@@ -5,13 +5,21 @@ require "smarter_csv/file_io"
|
|
5
5
|
require "smarter_csv/options_processing"
|
6
6
|
require "smarter_csv/auto_detection"
|
7
7
|
require "smarter_csv/variables"
|
8
|
+
require 'smarter_csv/header_transformations'
|
9
|
+
require 'smarter_csv/header_validations'
|
8
10
|
require "smarter_csv/headers"
|
11
|
+
require "smarter_csv/hash_transformations"
|
9
12
|
require "smarter_csv/parse"
|
10
13
|
|
14
|
+
# load the C-extension:
|
11
15
|
case RUBY_ENGINE
|
12
16
|
when 'ruby'
|
13
17
|
begin
|
14
18
|
if `uname -s`.chomp == 'Darwin'
|
19
|
+
#
|
20
|
+
# Please report if you see cases where the rake-compiler is building x86_64 code on arm64 cpus:
|
21
|
+
# https://github.com/rake-compiler/rake-compiler/issues/231
|
22
|
+
#
|
15
23
|
require 'smarter_csv/smarter_csv.bundle'
|
16
24
|
else
|
17
25
|
# :nocov:
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: smarter_csv
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.10.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tilo Sloboda
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2023-12-
|
11
|
+
date: 2023-12-31 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: awesome_print
|
@@ -118,6 +118,9 @@ files:
|
|
118
118
|
- lib/smarter_csv.rb
|
119
119
|
- lib/smarter_csv/auto_detection.rb
|
120
120
|
- lib/smarter_csv/file_io.rb
|
121
|
+
- lib/smarter_csv/hash_transformations.rb
|
122
|
+
- lib/smarter_csv/header_transformations.rb
|
123
|
+
- lib/smarter_csv/header_validations.rb
|
121
124
|
- lib/smarter_csv/headers.rb
|
122
125
|
- lib/smarter_csv/options_processing.rb
|
123
126
|
- lib/smarter_csv/parse.rb
|
@@ -148,7 +151,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
148
151
|
- !ruby/object:Gem::Version
|
149
152
|
version: '0'
|
150
153
|
requirements: []
|
151
|
-
rubygems_version: 3.
|
154
|
+
rubygems_version: 3.5.3
|
152
155
|
signing_key:
|
153
156
|
specification_version: 4
|
154
157
|
summary: Ruby Gem for smarter importing of CSV Files (and CSV-like files), with lots
|