smarter_csv 1.9.3 → 1.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 5f35e10ff8bc0e79ff1ed9bea8e413f746f51128a6f6a9622d246873fd588366
4
- data.tar.gz: 5cc30cf6f4422dd16f3019915bc5305a92aaaa4b99665e4c4c525d3bbf489cfd
3
+ metadata.gz: f1d0b58acf0135b621e3182470674230ef73b48c829810e74fffa975fc318cf5
4
+ data.tar.gz: ee404c5c485748d35cda36b8d249cb6813a3f80005182fe8c05feac1694aba57
5
5
  SHA512:
6
- metadata.gz: 057472a73ae0be95318b16428b276ecffba384a68479af715c5ec3ca7601405ca73928b0fbf245c9b3f46fd33b82a8c6d9c9e6330ddb0305b83ae23f58173df0
7
- data.tar.gz: 319b12a53875c1963eed6d27aa67850135d33a5b3a9f70607e6d812906733b711ade6c3ee6e789d78c2e159004a879e59e700145224134745b16d279039ac38a
6
+ metadata.gz: 4fee097fe2237f863510100155062da6815237260da5b15189f104f54596f7d5ff0479deb80596544e0bb1b9ba7b78126d2251798721e8d2f91e06b430950cd6
7
+ data.tar.gz: c30562965452ef296b5e5aaf2a9a12887aa42d8e8396780b73b34f99a2386d232bf020578618fcbd65186fc864518c81a3e7555cae9b00a005322f3599e18c5a
data/CHANGELOG.md CHANGED
@@ -1,6 +1,24 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
+ ## 1.10.0 (2023-12-31) ⚡ BREAKING ⚡
5
+
6
+ * BREAKING CHANGES:
7
+
8
+ Changed behavior:
9
+ + when `user_provided_headers` are provided:
10
+ * if they are not unique, an exception will now be raised
11
+ * they are taken "as is", no header transformations can be applied
12
+ * when they are given as strings or as symbols, it is assumed that this is the desired format
13
+ * the value of the `strings_as_keys` options will be ignored
14
+
15
+ + option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
16
+ * this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
17
+ * explicitly set this option to `nil` to get the behavior from previous versions.
18
+
19
+ * performance and memory improvements
20
+ * code refactor
21
+
4
22
  ## 1.9.3 (2023-12-16)
5
23
  * raise SmarterCSV::IncorrectOption when `user_provided_headers` are empty
6
24
  * code refactor / no functional changes
data/README.md CHANGED
@@ -2,15 +2,33 @@
2
2
  # SmarterCSV
3
3
 
4
4
  [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
5
-
5
+
6
+
7
+ #### LATEST CHANGES
8
+
9
+ * Version 1.10.0 has BREAKING CHANGES:
10
+
11
+ Changed behavior:
12
+ + when `user_provided_headers` are provided:
13
+ * if they are not unique, an exception will now be raised
14
+ * they are taken "as is", no header transformations can be applied
15
+ * when they are given as strings or as symbols, it is assumed that this is the desired format
16
+ * the value of the `strings_as_keys` options will be ignored
17
+
18
+ + option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
19
+ * this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
20
+ * explicitly set this option to `nil` to get the behavior from previous versions.
21
+
6
22
  #### Development Branches
7
23
 
8
24
  * default branch is `main` for 1.x development
9
- * 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
25
+
26
+ * 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
27
+ - This is an EXPERIMENTAL branch - DO NOT USE in production
10
28
 
11
- #### Work towards Future Version 2.0
29
+ #### Work towards Future Version 2.x
12
30
 
13
- * Work towards SmarterCSV 2.0 is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
31
+ * Work towards SmarterCSV 2.x is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
14
32
  Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.
15
33
 
16
34
  ---------------
@@ -84,6 +102,10 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
84
102
  00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
85
103
  ```
86
104
 
105
+ ### Articles
106
+ * [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
107
+ * [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
108
+
87
109
  ### Examples
88
110
 
89
111
  Here are some examples to demonstrate the versatility of SmarterCSV.
@@ -243,8 +265,6 @@ NOTE: If you use `key_mappings` and `value_converters`, make sure that the value
243
265
  data[0][:price].class
244
266
  => Float
245
267
  ```
246
- ## Parallel Processing
247
- [Jack](https://github.com/xjlin0) wrote an interesting article about [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
248
268
 
249
269
  ## Documentation
250
270
 
@@ -280,7 +300,8 @@ The options and the block are optional.
280
300
  | :headers_in_file | true | Whether or not the file contains headers as the first line. |
281
301
  | | | Important if the file does not contain headers, |
282
302
  | | | otherwise you would lose the first line of data. |
283
- | :duplicate_header_suffix | nil | If set, adds numbers to duplicated headers and separates them by the given suffix |
303
+ | :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
304
+ | | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
284
305
  | :user_provided_headers | nil | *careful with that axe!* |
285
306
  | | | user provided Array of header strings or symbols, to define |
286
307
  | | | what headers should be used, overriding any in-file headers. |
@@ -0,0 +1,91 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SmarterCSV
4
+ class << self
5
+ def hash_transformations(hash, options)
6
+ # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
7
+ # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
8
+ remove_empty_values = options[:remove_empty_values] == true
9
+ remove_zero_values = options[:remove_zero_values]
10
+ remove_values_matching = options[:remove_values_matching]
11
+ convert_to_numeric = options[:convert_values_to_numeric]
12
+ value_converters = options[:value_converters]
13
+
14
+ hash.each_with_object({}) do |(k, v), new_hash|
15
+ next if k.nil? || k == '' || k == :""
16
+ next if remove_empty_values && (has_rails ? v.blank? : blank?(v))
17
+ next if remove_zero_values && v.is_a?(String) && v =~ /^(0+|0+\.0+)$/ # values are Strings
18
+ next if remove_values_matching && v =~ remove_values_matching
19
+
20
+ # deal with the :only / :except options to :convert_values_to_numeric
21
+ if convert_to_numeric && !limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
22
+ if v =~ /^[+-]?\d+\.\d+$/
23
+ v = v.to_f
24
+ elsif v =~ /^[+-]?\d+$/
25
+ v = v.to_i
26
+ end
27
+ end
28
+
29
+ converter = value_converters[k] if value_converters
30
+ v = converter.convert(v) if converter
31
+
32
+ new_hash[k] = v
33
+ end
34
+ end
35
+
36
+ # def hash_transformations(hash, options)
37
+ # # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
38
+ # # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
39
+ # hash.delete(nil)
40
+ # hash.delete('')
41
+ # hash.delete(:"")
42
+
43
+ # if options[:remove_empty_values] == true
44
+ # hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
45
+ # end
46
+
47
+ # hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
48
+ # hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
49
+
50
+ # if options[:convert_values_to_numeric]
51
+ # hash.each do |k, v|
52
+ # # deal with the :only / :except options to :convert_values_to_numeric
53
+ # next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
54
+
55
+ # # convert if it's a numeric value:
56
+ # case v
57
+ # when /^[+-]?\d+\.\d+$/
58
+ # hash[k] = v.to_f
59
+ # when /^[+-]?\d+$/
60
+ # hash[k] = v.to_i
61
+ # end
62
+ # end
63
+ # end
64
+
65
+ # if options[:value_converters]
66
+ # hash.each do |k, v|
67
+ # converter = options[:value_converters][k]
68
+ # next unless converter
69
+
70
+ # hash[k] = converter.convert(v)
71
+ # end
72
+ # end
73
+
74
+ # hash
75
+ # end
76
+
77
+ protected
78
+
79
+ # acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
80
+ def limit_execution_for_only_or_except(options, option_name, key)
81
+ if options[option_name].is_a?(Hash)
82
+ if options[option_name].has_key?(:except)
83
+ return true if Array(options[option_name][:except]).include?(key)
84
+ elsif options[option_name].has_key?(:only)
85
+ return true unless Array(options[option_name][:only]).include?(key)
86
+ end
87
+ end
88
+ false
89
+ end
90
+ end
91
+ end
@@ -0,0 +1,63 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SmarterCSV
4
+ class << self
5
+ # transform the headers that were in the file:
6
+ def header_transformations(header_array, options)
7
+ header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
8
+ header_array.map!{|x| x.strip} if options[:strip_whitespace]
9
+
10
+ unless options[:keep_original_headers]
11
+ header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
12
+ header_array.map!{|x| x.downcase} if options[:downcase_header]
13
+ end
14
+
15
+ # detect duplicate headers and disambiguate
16
+ header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
17
+ # symbolize headers
18
+ header_array = header_array.map{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
19
+ # doesn't make sense to re-map when we have user_provided_headers
20
+ header_array = remap_headers(header_array, options) if options[:key_mapping]
21
+
22
+ header_array
23
+ end
24
+
25
+ def disambiguate_headers(headers, options)
26
+ counts = Hash.new(0)
27
+ headers.map do |header|
28
+ counts[header] += 1
29
+ counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
30
+ end
31
+ end
32
+
33
+ # do some key mapping on the keys in the file header
34
+ # if you want to completely delete a key, then map it to nil or to ''
35
+ def remap_headers(headers, options)
36
+ key_mapping = options[:key_mapping]
37
+ if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
38
+ raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
39
+ end
40
+
41
+ key_mapping = options[:key_mapping]
42
+ # if silence_missing_keys are not set, raise error if missing header
43
+ missing_keys = key_mapping.keys - headers
44
+ # if the user passes a list of speciffic mapped keys that are optional
45
+ missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)
46
+
47
+ unless missing_keys.empty? || options[:silence_missing_keys] == true
48
+ raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
49
+ end
50
+
51
+ headers.map! do |header|
52
+ if key_mapping.has_key?(header)
53
+ key_mapping[header].nil? ? nil : key_mapping[header]
54
+ elsif options[:remove_unmapped_keys]
55
+ nil
56
+ else
57
+ header
58
+ end
59
+ end
60
+ headers
61
+ end
62
+ end
63
+ end
@@ -0,0 +1,34 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SmarterCSV
4
+ class << self
5
+ def header_validations(headers, options)
6
+ check_duplicate_headers(headers, options)
7
+ check_required_headers(headers, options)
8
+ end
9
+
10
+ def check_duplicate_headers(headers, _options)
11
+ header_counts = Hash.new(0)
12
+ headers.each { |header| header_counts[header] += 1 unless header.nil? }
13
+
14
+ duplicates = header_counts.select { |_, count| count > 1 }
15
+
16
+ unless duplicates.empty?
17
+ raise(SmarterCSV::DuplicateHeaders, "Duplicate Headers in CSV: #{duplicates.inspect}")
18
+ end
19
+ end
20
+
21
+ require 'set'
22
+
23
+ def check_required_headers(headers, options)
24
+ if options[:required_keys] && options[:required_keys].is_a?(Array)
25
+ headers_set = headers.to_set
26
+ missing_keys = options[:required_keys].select { |k| !headers_set.include?(k) }
27
+
28
+ unless missing_keys.empty?
29
+ raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}"
30
+ end
31
+ end
32
+ end
33
+ end
34
+ end
@@ -14,7 +14,11 @@ module SmarterCSV
14
14
  # the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
15
15
  header_line = @raw_header = readline_with_counts(filehandle, options)
16
16
  header_line = preprocess_header_line(header_line, options)
17
- file_header_array, file_header_size = parse_and_modify_headers(header_line, options)
17
+
18
+ file_header_array, file_header_size = parse(header_line, options)
19
+
20
+ file_header_array = header_transformations(file_header_array, options)
21
+
18
22
  else
19
23
  unless options[:user_provided_headers]
20
24
  raise SmarterCSV::IncorrectOption, "ERROR: If :headers_in_file is set to false, you have to provide :user_provided_headers"
@@ -36,22 +40,12 @@ module SmarterCSV
36
40
  # we could print out the mapping of file_header_array to header_array here
37
41
  end
38
42
  end
43
+
39
44
  header_array = user_header_array
40
45
  else
41
46
  header_array = file_header_array
42
47
  end
43
48
 
44
- # detect duplicate headers and disambiguate
45
- header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
46
-
47
- # symbolize headers
48
- header_array.map!{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
49
-
50
- # wouldn't make sense to re-map user provided headers
51
- header_array = remap_headers(header_array, options) if options[:key_mapping] && !options[:user_provided_headers]
52
-
53
- validate_and_deprecate_headers(header_array, options)
54
-
55
49
  [header_array, header_array.size]
56
50
  end
57
51
 
@@ -65,92 +59,6 @@ module SmarterCSV
65
59
  header_line
66
60
  end
67
61
 
68
- def parse_and_modify_headers(header_line, options)
69
- file_header_array, file_header_size = parse(header_line, options)
70
-
71
- file_header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
72
- file_header_array.map!{|x| x.strip} if options[:strip_whitespace]
73
-
74
- unless options[:keep_original_headers]
75
- file_header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
76
- file_header_array.map!{|x| x.downcase} if options[:downcase_header]
77
- end
78
- [file_header_array, file_header_size]
79
- end
80
-
81
- def disambiguate_headers(headers, options)
82
- counts = Hash.new(0)
83
- headers.map do |header|
84
- counts[header] += 1
85
- counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
86
- end
87
- end
88
-
89
- # do some key mapping on the keys in the file header
90
- # if you want to completely delete a key, then map it to nil or to ''
91
- def remap_headers(headers, options)
92
- key_mapping = options[:key_mapping]
93
- if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
94
- raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
95
- end
96
-
97
- key_mapping = options[:key_mapping]
98
- # if silence_missing_keys are not set, raise error if missing header
99
- missing_keys = key_mapping.keys - headers
100
- # if the user passes a list of speciffic mapped keys that are optional
101
- missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)
102
-
103
- unless missing_keys.empty? || options[:silence_missing_keys] == true
104
- raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
105
- end
106
-
107
- headers.map! do |header|
108
- if key_mapping.has_key?(header)
109
- key_mapping[header].nil? ? nil : key_mapping[header]
110
- elsif options[:remove_unmapped_keys]
111
- nil
112
- else
113
- header
114
- end
115
- end
116
- headers
117
- end
118
-
119
- # header_validations
120
- def validate_and_deprecate_headers(headers, options)
121
- duplicate_headers = []
122
- headers.compact.each do |k|
123
- duplicate_headers << k if headers.select{|x| x == k}.size > 1
124
- end
125
-
126
- unless options[:user_provided_headers] || duplicate_headers.empty?
127
- raise SmarterCSV::DuplicateHeaders, "ERROR: duplicate headers: #{duplicate_headers.join(',')}"
128
- end
129
-
130
- # deprecate required_headers
131
- unless options[:required_headers].nil?
132
- puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required_headers'"
133
- if options[:required_keys].nil?
134
- options[:required_keys] = options[:required_headers]
135
- options[:required_headers] = nil
136
- end
137
- end
138
-
139
- if options[:required_keys] && options[:required_keys].is_a?(Array)
140
- missing_keys = []
141
- options[:required_keys].each do |k|
142
- missing_keys << k unless headers.include?(k)
143
- end
144
- raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}" unless missing_keys.empty?
145
- end
146
- end
147
-
148
- def enforce_utf8_encoding(header, options)
149
- return header unless options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
150
-
151
- header.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence])
152
- end
153
-
154
62
  def remove_comments_from_header(header, options)
155
63
  return header unless options[:comment_regexp]
156
64
 
@@ -9,7 +9,7 @@ module SmarterCSV
9
9
  comment_regexp: nil, # was: /\A#/,
10
10
  convert_values_to_numeric: true,
11
11
  downcase_header: true,
12
- duplicate_header_suffix: nil,
12
+ duplicate_header_suffix: '', # was: nil,
13
13
  file_encoding: 'utf-8',
14
14
  force_simple_split: false,
15
15
  force_utf8: false,
@@ -62,6 +62,15 @@ module SmarterCSV
62
62
  private
63
63
 
64
64
  def validate_options!(options)
65
+ # deprecate required_headers
66
+ unless options[:required_headers].nil?
67
+ puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required_headers'"
68
+ if options[:required_keys].nil?
69
+ options[:required_keys] = options[:required_headers]
70
+ options[:required_headers] = nil
71
+ end
72
+ end
73
+
65
74
  keys = options.keys
66
75
  errors = []
67
76
  errors << "invalid row_sep" if keys.include?(:row_sep) && !option_valid?(options[:row_sep])
@@ -12,28 +12,34 @@ module SmarterCSV
12
12
 
13
13
  # first parameter: filename or input object which responds to readline method
14
14
  def SmarterCSV.process(input, given_options = {}, &block) # rubocop:disable Lint/UnusedMethodArgument
15
+ initialize_variables
16
+
15
17
  options = process_options(given_options)
16
18
 
17
- initialize_variables
19
+ @enforce_utf8 = options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
20
+ @verbose = options[:verbose]
18
21
 
19
- has_rails = !!defined?(Rails)
20
22
  begin
21
23
  fh = input.respond_to?(:readline) ? input : File.open(input, "r:#{options[:file_encoding]}")
22
24
 
25
+ if @enforce_utf8 && (fh.respond_to?(:external_encoding) && fh.external_encoding != Encoding.find('UTF-8') || fh.respond_to?(:encoding) && fh.encoding != Encoding.find('UTF-8'))
26
+ puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
27
+ end
28
+
23
29
  # auto-detect the row separator
24
30
  options[:row_sep] = guess_line_ending(fh, options) if options[:row_sep]&.to_sym == :auto
25
31
  # attempt to auto-detect column separator
26
32
  options[:col_sep] = guess_column_separator(fh, options) if options[:col_sep]&.to_sym == :auto
27
33
 
28
- if (options[:force_utf8] || options[:file_encoding] =~ /utf-8/i) && (fh.respond_to?(:external_encoding) && fh.external_encoding != Encoding.find('UTF-8') || fh.respond_to?(:encoding) && fh.encoding != Encoding.find('UTF-8'))
29
- puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
30
- end
31
-
32
34
  skip_lines(fh, options)
33
35
 
34
36
  @headers, header_size = process_headers(fh, options)
35
37
  @headerA = @headers # @headerA is deprecated, use @headers
36
38
 
39
+ puts "Effective headers:\n#{pp(@headers)}\n" if @verbose
40
+
41
+ header_validations(@headers, options)
42
+
37
43
  # in case we use chunking.. we'll need to set it up..
38
44
  if options[:chunk_size].to_i > 0
39
45
  use_chunks = true
@@ -45,31 +51,42 @@ module SmarterCSV
45
51
  end
46
52
 
47
53
  # now on to processing all the rest of the lines in the CSV file:
54
+ # fh.each_line |line|
48
55
  until fh.eof? # we can't use fh.readlines() here, because this would read the whole file into memory at once, and eof => true
49
56
  line = readline_with_counts(fh, options)
50
57
 
51
58
  # replace invalid byte sequence in UTF-8 with question mark to avoid errors
52
- line = line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
59
+ line = enforce_utf8_encoding(line, options) if @enforce_utf8
53
60
 
54
- print "processing file line %10d, csv line %10d\r" % [@file_line_count, @csv_line_count] if options[:verbose]
61
+ print "processing file line %10d, csv line %10d\r" % [@file_line_count, @csv_line_count] if @verbose
55
62
 
56
63
  next if options[:comment_regexp] && line =~ options[:comment_regexp] # ignore all comment lines if there are any
57
64
 
58
65
  # cater for the quoted csv data containing the row separator carriage return character
59
66
  # in which case the row data will be split across multiple lines (see the sample content in spec/fixtures/carriage_returns_rn.csv)
60
67
  # by detecting the existence of an uneven number of quote characters
68
+ multiline = count_quote_chars(line, options[:quote_char]).odd?
61
69
 
62
- multiline = count_quote_chars(line, options[:quote_char]).odd? # should handle quote_char nil
63
- while count_quote_chars(line, options[:quote_char]).odd? # should handle quote_char nil
70
+ while multiline
64
71
  next_line = fh.readline(options[:row_sep])
65
- next_line = next_line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
72
+ next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
66
73
  line += next_line
67
74
  @file_line_count += 1
75
+
76
+ break if fh.eof? # Exit loop if end of file is reached
77
+
78
+ multiline = count_quote_chars(line, options[:quote_char]).odd?
68
79
  end
69
- print "\nline contains uneven number of quote chars so including content through file line %d\n" % @file_line_count if options[:verbose] && multiline
80
+
81
+ # :nocov:
82
+ if multiline && @verbose
83
+ print "\nline contains uneven number of quote chars so including content through file line %d\n" % @file_line_count
84
+ end
85
+ # :nocov:
70
86
 
71
87
  line.chomp!(options[:row_sep])
72
88
 
89
+ # --- SPLIT LINE & DATA TRANSFORMATIONS ------------------------------------------------------------
73
90
  dataA, _data_size = parse(line, options, header_size)
74
91
 
75
92
  dataA.map!{|x| x.strip} if options[:strip_whitespace]
@@ -77,48 +94,25 @@ module SmarterCSV
77
94
  # if all values are blank, then ignore this line
78
95
  next if options[:remove_empty_hashes] && (dataA.empty? || blank?(dataA))
79
96
 
97
+ # --- HASH TRANSFORMATIONS ------------------------------------------------------------
80
98
  hash = @headers.zip(dataA).to_h
81
99
 
82
- # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
83
- hash.delete(nil)
84
- hash.delete('')
85
- hash.delete(:"")
86
-
87
- if options[:remove_empty_values] == true
88
- hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
89
- end
90
-
91
- hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
92
- hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
93
-
94
- if options[:convert_values_to_numeric]
95
- hash.each do |k, v|
96
- # deal with the :only / :except options to :convert_values_to_numeric
97
- next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
100
+ hash = hash_transformations(hash, options)
98
101
 
99
- # convert if it's a numeric value:
100
- case v
101
- when /^[+-]?\d+\.\d+$/
102
- hash[k] = v.to_f
103
- when /^[+-]?\d+$/
104
- hash[k] = v.to_i
105
- end
106
- end
107
- end
108
-
109
- if options[:value_converters]
110
- hash.each do |k, v|
111
- converter = options[:value_converters][k]
112
- next unless converter
113
-
114
- hash[k] = converter.convert(v)
115
- end
116
- end
102
+ # --- HASH VALIDATIONS ----------------------------------------------------------------
103
+ # will go here, and be able to:
104
+ # - validate correct format of the values for fields
105
+ # - required fields to be non-empty
106
+ # - ...
107
+ # -------------------------------------------------------------------------------------
117
108
 
118
109
  next if options[:remove_empty_hashes] && hash.empty?
119
110
 
111
+ puts "CSV Line #{@file_line_count}: #{pp(hash)}" if @verbose == '2' # very verbose setting
112
+ # optional adding of csv_line_number to the hash to help debugging
120
113
  hash[:csv_line_number] = @csv_line_count if options[:with_line_numbers]
121
114
 
115
+ # process the chunks or the resulting hash
122
116
  if use_chunks
123
117
  chunk << hash # append temp result to chunk
124
118
 
@@ -127,16 +121,13 @@ module SmarterCSV
127
121
  if block_given?
128
122
  yield chunk # do something with the hashes in the chunk in the block
129
123
  else
130
- @result << chunk # not sure yet, why anybody would want to do this without a block
124
+ @result << chunk.dup # Append chunk to result (use .dup to keep a copy after we do chunk.clear)
131
125
  end
132
126
  @chunk_count += 1
133
- chunk = [] # initialize for next chunk of data
127
+ chunk.clear # re-initialize for next chunk of data
134
128
  else
135
-
136
- # the last chunk may contain partial data, which also needs to be returned (BUG / ISSUE-18)
137
-
129
+ # the last chunk may contain partial data, which is handled below
138
130
  end
139
-
140
131
  # while a chunk is being filled up we don't need to do anything else here
141
132
 
142
133
  else # no chunk handling
@@ -149,15 +140,15 @@ module SmarterCSV
149
140
  end
150
141
 
151
142
  # print new line to retain last processing line message
152
- print "\n" if options[:verbose]
143
+ print "\n" if @verbose
153
144
 
154
- # last chunk:
145
+ # handling of last chunk:
155
146
  if !chunk.nil? && chunk.size > 0
156
147
  # do something with the chunk
157
148
  if block_given?
158
149
  yield chunk # do something with the hashes in the chunk in the block
159
150
  else
160
- @result << chunk # not sure yet, why anybody would want to do this without a block
151
+ @result << chunk.dup # Append chunk to result (use .dup to keep a copy after we do chunk.clear)
161
152
  end
162
153
  @chunk_count += 1
163
154
  # chunk = [] # initialize for next chunk of data
@@ -174,16 +165,22 @@ module SmarterCSV
174
165
  end
175
166
 
176
167
  class << self
177
- # * the `scan` method iterates through the string and finds all occurrences of the pattern
178
- # * The reqular expression:
179
- # - (?<!\\) : Negative lookbehind to ensure the quote character is not preceded by an unescaped backslash.
180
- # - (?:\\\\)* : Non-capturing group for an even number of backslashes (escaped backslashes).
181
- # This allows for any number of escaped backslashes before the quote character.
182
- # - #{Regexp.escape(quote_char)} : Dynamically inserts the quote_char into the regex,
183
- # ensuring it's properly escaped for use in the regex.
184
- #
185
168
  def count_quote_chars(line, quote_char)
186
- line.scan(/(?<!\\)(?:\\\\)*#{Regexp.escape(quote_char)}/).count
169
+ return 0 if line.nil? || quote_char.nil? || quote_char.empty?
170
+
171
+ count = 0
172
+ escaped = false
173
+
174
+ line.each_char do |char|
175
+ if char == '\\' && !escaped
176
+ escaped = true
177
+ else
178
+ count += 1 if char == quote_char && !escaped
179
+ escaped = false
180
+ end
181
+ end
182
+
183
+ count
187
184
  end
188
185
 
189
186
  def has_acceleration?
@@ -192,18 +189,6 @@ module SmarterCSV
192
189
 
193
190
  protected
194
191
 
195
- # acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
196
- def limit_execution_for_only_or_except(options, option_name, key)
197
- if options[option_name].is_a?(Hash)
198
- if options[option_name].has_key?(:except)
199
- return true if Array(options[option_name][:except]).include?(key)
200
- elsif options[option_name].has_key?(:only)
201
- return true unless Array(options[option_name][:only]).include?(key)
202
- end
203
- end
204
- false
205
- end
206
-
207
192
  # SEE: https://github.com/rails/rails/blob/32015b6f369adc839c4f0955f2d9dce50c0b6123/activesupport/lib/active_support/core_ext/object/blank.rb#L121
208
193
  # and in the future we might also include UTF-8 space characters: https://www.compart.com/en/unicode/category/Zs
209
194
  BLANK_RE = /\A\s*\z/.freeze
@@ -211,33 +196,24 @@ module SmarterCSV
211
196
  def blank?(value)
212
197
  case value
213
198
  when String
214
- value.empty? || BLANK_RE.match?(value)
215
-
199
+ BLANK_RE.match?(value)
216
200
  when NilClass
217
201
  true
218
-
219
202
  when Array
220
- value.empty? || value.inject(true){|result, x| result && elem_blank?(x)}
221
-
203
+ value.all? { |elem| blank?(elem) }
222
204
  when Hash
223
- value.empty? || value.values.inject(true){|result, x| result && elem_blank?(x)}
224
-
205
+ value.values.all? { |elem| blank?(elem) } # Focus on values only
225
206
  else
226
207
  false
227
208
  end
228
209
  end
229
210
 
230
- def elem_blank?(value)
231
- case value
232
- when String
233
- value.empty? || BLANK_RE.match?(value)
211
+ private
234
212
 
235
- when NilClass
236
- true
213
+ def enforce_utf8_encoding(line, options)
214
+ # return line unless options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
237
215
 
238
- else
239
- false
240
- end
216
+ line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence])
241
217
  end
242
218
  end
243
219
  end
@@ -2,9 +2,10 @@
2
2
 
3
3
  module SmarterCSV
4
4
  class << self
5
- attr_reader :csv_line_count, :chunk_count, :errors, :file_line_count, :headers, :raw_header, :result, :warnings
5
+ attr_reader :has_rails, :csv_line_count, :chunk_count, :errors, :file_line_count, :headers, :raw_header, :result, :warnings
6
6
 
7
7
  def initialize_variables
8
+ @has_rails = !!defined?(Rails)
8
9
  @csv_line_count = 0
9
10
  @chunk_count = 0
10
11
  @errors = {}
@@ -14,13 +15,16 @@ module SmarterCSV
14
15
  @raw_header = nil # header as it appears in the file
15
16
  @result = []
16
17
  @warnings = {}
18
+ @enforce_utf8 = false # only set to true if needed (after options parsing)
17
19
  end
18
20
 
19
21
  # :nocov:
22
+ # rubocop:disable Naming/MethodName
20
23
  def headerA
21
24
  warn "Deprecarion Warning: 'headerA' will be removed in future versions. Use 'headders'"
22
25
  @headerA
23
26
  end
27
+ # rubocop:enable Naming/MethodName
24
28
  # :nocov:
25
29
  end
26
30
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterCSV
4
- VERSION = "1.9.3"
4
+ VERSION = "1.10.0"
5
5
  end
data/lib/smarter_csv.rb CHANGED
@@ -5,13 +5,21 @@ require "smarter_csv/file_io"
5
5
  require "smarter_csv/options_processing"
6
6
  require "smarter_csv/auto_detection"
7
7
  require "smarter_csv/variables"
8
+ require 'smarter_csv/header_transformations'
9
+ require 'smarter_csv/header_validations'
8
10
  require "smarter_csv/headers"
11
+ require "smarter_csv/hash_transformations"
9
12
  require "smarter_csv/parse"
10
13
 
14
+ # load the C-extension:
11
15
  case RUBY_ENGINE
12
16
  when 'ruby'
13
17
  begin
14
18
  if `uname -s`.chomp == 'Darwin'
19
+ #
20
+ # Please report if you see cases where the rake-compiler is building x86_64 code on arm64 cpus:
21
+ # https://github.com/rake-compiler/rake-compiler/issues/231
22
+ #
15
23
  require 'smarter_csv/smarter_csv.bundle'
16
24
  else
17
25
  # :nocov:
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.9.3
4
+ version: 1.10.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2023-12-16 00:00:00.000000000 Z
11
+ date: 2023-12-31 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: awesome_print
@@ -118,6 +118,9 @@ files:
118
118
  - lib/smarter_csv.rb
119
119
  - lib/smarter_csv/auto_detection.rb
120
120
  - lib/smarter_csv/file_io.rb
121
+ - lib/smarter_csv/hash_transformations.rb
122
+ - lib/smarter_csv/header_transformations.rb
123
+ - lib/smarter_csv/header_validations.rb
121
124
  - lib/smarter_csv/headers.rb
122
125
  - lib/smarter_csv/options_processing.rb
123
126
  - lib/smarter_csv/parse.rb
@@ -148,7 +151,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
148
151
  - !ruby/object:Gem::Version
149
152
  version: '0'
150
153
  requirements: []
151
- rubygems_version: 3.2.3
154
+ rubygems_version: 3.5.3
152
155
  signing_key:
153
156
  specification_version: 4
154
157
  summary: Ruby Gem for smarter importing of CSV Files (and CSV-like files), with lots