smarter_csv 1.9.3 → 1.10.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 5f35e10ff8bc0e79ff1ed9bea8e413f746f51128a6f6a9622d246873fd588366
4
- data.tar.gz: 5cc30cf6f4422dd16f3019915bc5305a92aaaa4b99665e4c4c525d3bbf489cfd
3
+ metadata.gz: f1d0b58acf0135b621e3182470674230ef73b48c829810e74fffa975fc318cf5
4
+ data.tar.gz: ee404c5c485748d35cda36b8d249cb6813a3f80005182fe8c05feac1694aba57
5
5
  SHA512:
6
- metadata.gz: 057472a73ae0be95318b16428b276ecffba384a68479af715c5ec3ca7601405ca73928b0fbf245c9b3f46fd33b82a8c6d9c9e6330ddb0305b83ae23f58173df0
7
- data.tar.gz: 319b12a53875c1963eed6d27aa67850135d33a5b3a9f70607e6d812906733b711ade6c3ee6e789d78c2e159004a879e59e700145224134745b16d279039ac38a
6
+ metadata.gz: 4fee097fe2237f863510100155062da6815237260da5b15189f104f54596f7d5ff0479deb80596544e0bb1b9ba7b78126d2251798721e8d2f91e06b430950cd6
7
+ data.tar.gz: c30562965452ef296b5e5aaf2a9a12887aa42d8e8396780b73b34f99a2386d232bf020578618fcbd65186fc864518c81a3e7555cae9b00a005322f3599e18c5a
data/CHANGELOG.md CHANGED
@@ -1,6 +1,24 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
+ ## 1.10.0 (2023-12-31) ⚡ BREAKING ⚡
5
+
6
+ * BREAKING CHANGES:
7
+
8
+ Changed behavior:
9
+ + when `user_provided_headers` are provided:
10
+ * if they are not unique, an exception will now be raised
11
+ * they are taken "as is", no header transformations can be applied
12
+ * when they are given as strings or as symbols, it is assumed that this is the desired format
13
+ * the value of the `strings_as_keys` options will be ignored
14
+
15
+ + option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
16
+ * this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
17
+ * explicitly set this option to `nil` to get the behavior from previous versions.
18
+
19
+ * performance and memory improvements
20
+ * code refactor
21
+
4
22
  ## 1.9.3 (2023-12-16)
5
23
  * raise SmarterCSV::IncorrectOption when `user_provided_headers` are empty
6
24
  * code refactor / no functional changes
data/README.md CHANGED
@@ -2,15 +2,33 @@
2
2
  # SmarterCSV
3
3
 
4
4
  [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
5
-
5
+
6
+
7
+ #### LATEST CHANGES
8
+
9
+ * Version 1.10.0 has BREAKING CHANGES:
10
+
11
+ Changed behavior:
12
+ + when `user_provided_headers` are provided:
13
+ * if they are not unique, an exception will now be raised
14
+ * they are taken "as is", no header transformations can be applied
15
+ * when they are given as strings or as symbols, it is assumed that this is the desired format
16
+ * the value of the `strings_as_keys` options will be ignored
17
+
18
+ + option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
19
+ * this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
20
+ * explicitly set this option to `nil` to get the behavior from previous versions.
21
+
6
22
  #### Development Branches
7
23
 
8
24
  * default branch is `main` for 1.x development
9
- * 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
25
+
26
+ * 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
27
+ - This is an EXPERIMENTAL branch - DO NOT USE in production
10
28
 
11
- #### Work towards Future Version 2.0
29
+ #### Work towards Future Version 2.x
12
30
 
13
- * Work towards SmarterCSV 2.0 is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
31
+ * Work towards SmarterCSV 2.x is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
14
32
  Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.
15
33
 
16
34
  ---------------
@@ -84,6 +102,10 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
84
102
  00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
85
103
  ```
86
104
 
105
+ ### Articles
106
+ * [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
107
+ * [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
108
+
87
109
  ### Examples
88
110
 
89
111
  Here are some examples to demonstrate the versatility of SmarterCSV.
@@ -243,8 +265,6 @@ NOTE: If you use `key_mappings` and `value_converters`, make sure that the value
243
265
  data[0][:price].class
244
266
  => Float
245
267
  ```
246
- ## Parallel Processing
247
- [Jack](https://github.com/xjlin0) wrote an interesting article about [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
248
268
 
249
269
  ## Documentation
250
270
 
@@ -280,7 +300,8 @@ The options and the block are optional.
280
300
  | :headers_in_file | true | Whether or not the file contains headers as the first line. |
281
301
  | | | Important if the file does not contain headers, |
282
302
  | | | otherwise you would lose the first line of data. |
283
- | :duplicate_header_suffix | nil | If set, adds numbers to duplicated headers and separates them by the given suffix |
303
+ | :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
304
+ | | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
284
305
  | :user_provided_headers | nil | *careful with that axe!* |
285
306
  | | | user provided Array of header strings or symbols, to define |
286
307
  | | | what headers should be used, overriding any in-file headers. |
@@ -0,0 +1,91 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SmarterCSV
4
+ class << self
5
+ def hash_transformations(hash, options)
6
+ # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
7
+ # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
8
+ remove_empty_values = options[:remove_empty_values] == true
9
+ remove_zero_values = options[:remove_zero_values]
10
+ remove_values_matching = options[:remove_values_matching]
11
+ convert_to_numeric = options[:convert_values_to_numeric]
12
+ value_converters = options[:value_converters]
13
+
14
+ hash.each_with_object({}) do |(k, v), new_hash|
15
+ next if k.nil? || k == '' || k == :""
16
+ next if remove_empty_values && (has_rails ? v.blank? : blank?(v))
17
+ next if remove_zero_values && v.is_a?(String) && v =~ /^(0+|0+\.0+)$/ # values are Strings
18
+ next if remove_values_matching && v =~ remove_values_matching
19
+
20
+ # deal with the :only / :except options to :convert_values_to_numeric
21
+ if convert_to_numeric && !limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
22
+ if v =~ /^[+-]?\d+\.\d+$/
23
+ v = v.to_f
24
+ elsif v =~ /^[+-]?\d+$/
25
+ v = v.to_i
26
+ end
27
+ end
28
+
29
+ converter = value_converters[k] if value_converters
30
+ v = converter.convert(v) if converter
31
+
32
+ new_hash[k] = v
33
+ end
34
+ end
35
+
36
+ # def hash_transformations(hash, options)
37
+ # # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
38
+ # # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
39
+ # hash.delete(nil)
40
+ # hash.delete('')
41
+ # hash.delete(:"")
42
+
43
+ # if options[:remove_empty_values] == true
44
+ # hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
45
+ # end
46
+
47
+ # hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
48
+ # hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
49
+
50
+ # if options[:convert_values_to_numeric]
51
+ # hash.each do |k, v|
52
+ # # deal with the :only / :except options to :convert_values_to_numeric
53
+ # next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
54
+
55
+ # # convert if it's a numeric value:
56
+ # case v
57
+ # when /^[+-]?\d+\.\d+$/
58
+ # hash[k] = v.to_f
59
+ # when /^[+-]?\d+$/
60
+ # hash[k] = v.to_i
61
+ # end
62
+ # end
63
+ # end
64
+
65
+ # if options[:value_converters]
66
+ # hash.each do |k, v|
67
+ # converter = options[:value_converters][k]
68
+ # next unless converter
69
+
70
+ # hash[k] = converter.convert(v)
71
+ # end
72
+ # end
73
+
74
+ # hash
75
+ # end
76
+
77
+ protected
78
+
79
+ # acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
80
+ def limit_execution_for_only_or_except(options, option_name, key)
81
+ if options[option_name].is_a?(Hash)
82
+ if options[option_name].has_key?(:except)
83
+ return true if Array(options[option_name][:except]).include?(key)
84
+ elsif options[option_name].has_key?(:only)
85
+ return true unless Array(options[option_name][:only]).include?(key)
86
+ end
87
+ end
88
+ false
89
+ end
90
+ end
91
+ end
@@ -0,0 +1,63 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SmarterCSV
4
+ class << self
5
+ # transform the headers that were in the file:
6
+ def header_transformations(header_array, options)
7
+ header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
8
+ header_array.map!{|x| x.strip} if options[:strip_whitespace]
9
+
10
+ unless options[:keep_original_headers]
11
+ header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
12
+ header_array.map!{|x| x.downcase} if options[:downcase_header]
13
+ end
14
+
15
+ # detect duplicate headers and disambiguate
16
+ header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
17
+ # symbolize headers
18
+ header_array = header_array.map{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
19
+ # doesn't make sense to re-map when we have user_provided_headers
20
+ header_array = remap_headers(header_array, options) if options[:key_mapping]
21
+
22
+ header_array
23
+ end
24
+
25
+ def disambiguate_headers(headers, options)
26
+ counts = Hash.new(0)
27
+ headers.map do |header|
28
+ counts[header] += 1
29
+ counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
30
+ end
31
+ end
32
+
33
+ # do some key mapping on the keys in the file header
34
+ # if you want to completely delete a key, then map it to nil or to ''
35
+ def remap_headers(headers, options)
36
+ key_mapping = options[:key_mapping]
37
+ if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
38
+ raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
39
+ end
40
+
41
+ key_mapping = options[:key_mapping]
42
+ # if silence_missing_keys are not set, raise error if missing header
43
+ missing_keys = key_mapping.keys - headers
44
+ # if the user passes a list of speciffic mapped keys that are optional
45
+ missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)
46
+
47
+ unless missing_keys.empty? || options[:silence_missing_keys] == true
48
+ raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
49
+ end
50
+
51
+ headers.map! do |header|
52
+ if key_mapping.has_key?(header)
53
+ key_mapping[header].nil? ? nil : key_mapping[header]
54
+ elsif options[:remove_unmapped_keys]
55
+ nil
56
+ else
57
+ header
58
+ end
59
+ end
60
+ headers
61
+ end
62
+ end
63
+ end
@@ -0,0 +1,34 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SmarterCSV
4
+ class << self
5
+ def header_validations(headers, options)
6
+ check_duplicate_headers(headers, options)
7
+ check_required_headers(headers, options)
8
+ end
9
+
10
+ def check_duplicate_headers(headers, _options)
11
+ header_counts = Hash.new(0)
12
+ headers.each { |header| header_counts[header] += 1 unless header.nil? }
13
+
14
+ duplicates = header_counts.select { |_, count| count > 1 }
15
+
16
+ unless duplicates.empty?
17
+ raise(SmarterCSV::DuplicateHeaders, "Duplicate Headers in CSV: #{duplicates.inspect}")
18
+ end
19
+ end
20
+
21
+ require 'set'
22
+
23
+ def check_required_headers(headers, options)
24
+ if options[:required_keys] && options[:required_keys].is_a?(Array)
25
+ headers_set = headers.to_set
26
+ missing_keys = options[:required_keys].select { |k| !headers_set.include?(k) }
27
+
28
+ unless missing_keys.empty?
29
+ raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}"
30
+ end
31
+ end
32
+ end
33
+ end
34
+ end
@@ -14,7 +14,11 @@ module SmarterCSV
14
14
  # the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
15
15
  header_line = @raw_header = readline_with_counts(filehandle, options)
16
16
  header_line = preprocess_header_line(header_line, options)
17
- file_header_array, file_header_size = parse_and_modify_headers(header_line, options)
17
+
18
+ file_header_array, file_header_size = parse(header_line, options)
19
+
20
+ file_header_array = header_transformations(file_header_array, options)
21
+
18
22
  else
19
23
  unless options[:user_provided_headers]
20
24
  raise SmarterCSV::IncorrectOption, "ERROR: If :headers_in_file is set to false, you have to provide :user_provided_headers"
@@ -36,22 +40,12 @@ module SmarterCSV
36
40
  # we could print out the mapping of file_header_array to header_array here
37
41
  end
38
42
  end
43
+
39
44
  header_array = user_header_array
40
45
  else
41
46
  header_array = file_header_array
42
47
  end
43
48
 
44
- # detect duplicate headers and disambiguate
45
- header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
46
-
47
- # symbolize headers
48
- header_array.map!{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
49
-
50
- # wouldn't make sense to re-map user provided headers
51
- header_array = remap_headers(header_array, options) if options[:key_mapping] && !options[:user_provided_headers]
52
-
53
- validate_and_deprecate_headers(header_array, options)
54
-
55
49
  [header_array, header_array.size]
56
50
  end
57
51
 
@@ -65,92 +59,6 @@ module SmarterCSV
65
59
  header_line
66
60
  end
67
61
 
68
- def parse_and_modify_headers(header_line, options)
69
- file_header_array, file_header_size = parse(header_line, options)
70
-
71
- file_header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
72
- file_header_array.map!{|x| x.strip} if options[:strip_whitespace]
73
-
74
- unless options[:keep_original_headers]
75
- file_header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
76
- file_header_array.map!{|x| x.downcase} if options[:downcase_header]
77
- end
78
- [file_header_array, file_header_size]
79
- end
80
-
81
- def disambiguate_headers(headers, options)
82
- counts = Hash.new(0)
83
- headers.map do |header|
84
- counts[header] += 1
85
- counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
86
- end
87
- end
88
-
89
- # do some key mapping on the keys in the file header
90
- # if you want to completely delete a key, then map it to nil or to ''
91
- def remap_headers(headers, options)
92
- key_mapping = options[:key_mapping]
93
- if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
94
- raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
95
- end
96
-
97
- key_mapping = options[:key_mapping]
98
- # if silence_missing_keys are not set, raise error if missing header
99
- missing_keys = key_mapping.keys - headers
100
- # if the user passes a list of speciffic mapped keys that are optional
101
- missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)
102
-
103
- unless missing_keys.empty? || options[:silence_missing_keys] == true
104
- raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
105
- end
106
-
107
- headers.map! do |header|
108
- if key_mapping.has_key?(header)
109
- key_mapping[header].nil? ? nil : key_mapping[header]
110
- elsif options[:remove_unmapped_keys]
111
- nil
112
- else
113
- header
114
- end
115
- end
116
- headers
117
- end
118
-
119
- # header_validations
120
- def validate_and_deprecate_headers(headers, options)
121
- duplicate_headers = []
122
- headers.compact.each do |k|
123
- duplicate_headers << k if headers.select{|x| x == k}.size > 1
124
- end
125
-
126
- unless options[:user_provided_headers] || duplicate_headers.empty?
127
- raise SmarterCSV::DuplicateHeaders, "ERROR: duplicate headers: #{duplicate_headers.join(',')}"
128
- end
129
-
130
- # deprecate required_headers
131
- unless options[:required_headers].nil?
132
- puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required_headers'"
133
- if options[:required_keys].nil?
134
- options[:required_keys] = options[:required_headers]
135
- options[:required_headers] = nil
136
- end
137
- end
138
-
139
- if options[:required_keys] && options[:required_keys].is_a?(Array)
140
- missing_keys = []
141
- options[:required_keys].each do |k|
142
- missing_keys << k unless headers.include?(k)
143
- end
144
- raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}" unless missing_keys.empty?
145
- end
146
- end
147
-
148
- def enforce_utf8_encoding(header, options)
149
- return header unless options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
150
-
151
- header.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence])
152
- end
153
-
154
62
  def remove_comments_from_header(header, options)
155
63
  return header unless options[:comment_regexp]
156
64
 
@@ -9,7 +9,7 @@ module SmarterCSV
9
9
  comment_regexp: nil, # was: /\A#/,
10
10
  convert_values_to_numeric: true,
11
11
  downcase_header: true,
12
- duplicate_header_suffix: nil,
12
+ duplicate_header_suffix: '', # was: nil,
13
13
  file_encoding: 'utf-8',
14
14
  force_simple_split: false,
15
15
  force_utf8: false,
@@ -62,6 +62,15 @@ module SmarterCSV
62
62
  private
63
63
 
64
64
  def validate_options!(options)
65
+ # deprecate required_headers
66
+ unless options[:required_headers].nil?
67
+ puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required_headers'"
68
+ if options[:required_keys].nil?
69
+ options[:required_keys] = options[:required_headers]
70
+ options[:required_headers] = nil
71
+ end
72
+ end
73
+
65
74
  keys = options.keys
66
75
  errors = []
67
76
  errors << "invalid row_sep" if keys.include?(:row_sep) && !option_valid?(options[:row_sep])
@@ -12,28 +12,34 @@ module SmarterCSV
12
12
 
13
13
  # first parameter: filename or input object which responds to readline method
14
14
  def SmarterCSV.process(input, given_options = {}, &block) # rubocop:disable Lint/UnusedMethodArgument
15
+ initialize_variables
16
+
15
17
  options = process_options(given_options)
16
18
 
17
- initialize_variables
19
+ @enforce_utf8 = options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
20
+ @verbose = options[:verbose]
18
21
 
19
- has_rails = !!defined?(Rails)
20
22
  begin
21
23
  fh = input.respond_to?(:readline) ? input : File.open(input, "r:#{options[:file_encoding]}")
22
24
 
25
+ if @enforce_utf8 && (fh.respond_to?(:external_encoding) && fh.external_encoding != Encoding.find('UTF-8') || fh.respond_to?(:encoding) && fh.encoding != Encoding.find('UTF-8'))
26
+ puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
27
+ end
28
+
23
29
  # auto-detect the row separator
24
30
  options[:row_sep] = guess_line_ending(fh, options) if options[:row_sep]&.to_sym == :auto
25
31
  # attempt to auto-detect column separator
26
32
  options[:col_sep] = guess_column_separator(fh, options) if options[:col_sep]&.to_sym == :auto
27
33
 
28
- if (options[:force_utf8] || options[:file_encoding] =~ /utf-8/i) && (fh.respond_to?(:external_encoding) && fh.external_encoding != Encoding.find('UTF-8') || fh.respond_to?(:encoding) && fh.encoding != Encoding.find('UTF-8'))
29
- puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
30
- end
31
-
32
34
  skip_lines(fh, options)
33
35
 
34
36
  @headers, header_size = process_headers(fh, options)
35
37
  @headerA = @headers # @headerA is deprecated, use @headers
36
38
 
39
+ puts "Effective headers:\n#{pp(@headers)}\n" if @verbose
40
+
41
+ header_validations(@headers, options)
42
+
37
43
  # in case we use chunking.. we'll need to set it up..
38
44
  if options[:chunk_size].to_i > 0
39
45
  use_chunks = true
@@ -45,31 +51,42 @@ module SmarterCSV
45
51
  end
46
52
 
47
53
  # now on to processing all the rest of the lines in the CSV file:
54
+ # fh.each_line |line|
48
55
  until fh.eof? # we can't use fh.readlines() here, because this would read the whole file into memory at once, and eof => true
49
56
  line = readline_with_counts(fh, options)
50
57
 
51
58
  # replace invalid byte sequence in UTF-8 with question mark to avoid errors
52
- line = line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
59
+ line = enforce_utf8_encoding(line, options) if @enforce_utf8
53
60
 
54
- print "processing file line %10d, csv line %10d\r" % [@file_line_count, @csv_line_count] if options[:verbose]
61
+ print "processing file line %10d, csv line %10d\r" % [@file_line_count, @csv_line_count] if @verbose
55
62
 
56
63
  next if options[:comment_regexp] && line =~ options[:comment_regexp] # ignore all comment lines if there are any
57
64
 
58
65
  # cater for the quoted csv data containing the row separator carriage return character
59
66
  # in which case the row data will be split across multiple lines (see the sample content in spec/fixtures/carriage_returns_rn.csv)
60
67
  # by detecting the existence of an uneven number of quote characters
68
+ multiline = count_quote_chars(line, options[:quote_char]).odd?
61
69
 
62
- multiline = count_quote_chars(line, options[:quote_char]).odd? # should handle quote_char nil
63
- while count_quote_chars(line, options[:quote_char]).odd? # should handle quote_char nil
70
+ while multiline
64
71
  next_line = fh.readline(options[:row_sep])
65
- next_line = next_line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
72
+ next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
66
73
  line += next_line
67
74
  @file_line_count += 1
75
+
76
+ break if fh.eof? # Exit loop if end of file is reached
77
+
78
+ multiline = count_quote_chars(line, options[:quote_char]).odd?
68
79
  end
69
- print "\nline contains uneven number of quote chars so including content through file line %d\n" % @file_line_count if options[:verbose] && multiline
80
+
81
+ # :nocov:
82
+ if multiline && @verbose
83
+ print "\nline contains uneven number of quote chars so including content through file line %d\n" % @file_line_count
84
+ end
85
+ # :nocov:
70
86
 
71
87
  line.chomp!(options[:row_sep])
72
88
 
89
+ # --- SPLIT LINE & DATA TRANSFORMATIONS ------------------------------------------------------------
73
90
  dataA, _data_size = parse(line, options, header_size)
74
91
 
75
92
  dataA.map!{|x| x.strip} if options[:strip_whitespace]
@@ -77,48 +94,25 @@ module SmarterCSV
77
94
  # if all values are blank, then ignore this line
78
95
  next if options[:remove_empty_hashes] && (dataA.empty? || blank?(dataA))
79
96
 
97
+ # --- HASH TRANSFORMATIONS ------------------------------------------------------------
80
98
  hash = @headers.zip(dataA).to_h
81
99
 
82
- # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
83
- hash.delete(nil)
84
- hash.delete('')
85
- hash.delete(:"")
86
-
87
- if options[:remove_empty_values] == true
88
- hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
89
- end
90
-
91
- hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
92
- hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
93
-
94
- if options[:convert_values_to_numeric]
95
- hash.each do |k, v|
96
- # deal with the :only / :except options to :convert_values_to_numeric
97
- next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
100
+ hash = hash_transformations(hash, options)
98
101
 
99
- # convert if it's a numeric value:
100
- case v
101
- when /^[+-]?\d+\.\d+$/
102
- hash[k] = v.to_f
103
- when /^[+-]?\d+$/
104
- hash[k] = v.to_i
105
- end
106
- end
107
- end
108
-
109
- if options[:value_converters]
110
- hash.each do |k, v|
111
- converter = options[:value_converters][k]
112
- next unless converter
113
-
114
- hash[k] = converter.convert(v)
115
- end
116
- end
102
+ # --- HASH VALIDATIONS ----------------------------------------------------------------
103
+ # will go here, and be able to:
104
+ # - validate correct format of the values for fields
105
+ # - required fields to be non-empty
106
+ # - ...
107
+ # -------------------------------------------------------------------------------------
117
108
 
118
109
  next if options[:remove_empty_hashes] && hash.empty?
119
110
 
111
+ puts "CSV Line #{@file_line_count}: #{pp(hash)}" if @verbose == '2' # very verbose setting
112
+ # optional adding of csv_line_number to the hash to help debugging
120
113
  hash[:csv_line_number] = @csv_line_count if options[:with_line_numbers]
121
114
 
115
+ # process the chunks or the resulting hash
122
116
  if use_chunks
123
117
  chunk << hash # append temp result to chunk
124
118
 
@@ -127,16 +121,13 @@ module SmarterCSV
127
121
  if block_given?
128
122
  yield chunk # do something with the hashes in the chunk in the block
129
123
  else
130
- @result << chunk # not sure yet, why anybody would want to do this without a block
124
+ @result << chunk.dup # Append chunk to result (use .dup to keep a copy after we do chunk.clear)
131
125
  end
132
126
  @chunk_count += 1
133
- chunk = [] # initialize for next chunk of data
127
+ chunk.clear # re-initialize for next chunk of data
134
128
  else
135
-
136
- # the last chunk may contain partial data, which also needs to be returned (BUG / ISSUE-18)
137
-
129
+ # the last chunk may contain partial data, which is handled below
138
130
  end
139
-
140
131
  # while a chunk is being filled up we don't need to do anything else here
141
132
 
142
133
  else # no chunk handling
@@ -149,15 +140,15 @@ module SmarterCSV
149
140
  end
150
141
 
151
142
  # print new line to retain last processing line message
152
- print "\n" if options[:verbose]
143
+ print "\n" if @verbose
153
144
 
154
- # last chunk:
145
+ # handling of last chunk:
155
146
  if !chunk.nil? && chunk.size > 0
156
147
  # do something with the chunk
157
148
  if block_given?
158
149
  yield chunk # do something with the hashes in the chunk in the block
159
150
  else
160
- @result << chunk # not sure yet, why anybody would want to do this without a block
151
+ @result << chunk.dup # Append chunk to result (use .dup to keep a copy after we do chunk.clear)
161
152
  end
162
153
  @chunk_count += 1
163
154
  # chunk = [] # initialize for next chunk of data
@@ -174,16 +165,22 @@ module SmarterCSV
174
165
  end
175
166
 
176
167
  class << self
177
- # * the `scan` method iterates through the string and finds all occurrences of the pattern
178
- # * The reqular expression:
179
- # - (?<!\\) : Negative lookbehind to ensure the quote character is not preceded by an unescaped backslash.
180
- # - (?:\\\\)* : Non-capturing group for an even number of backslashes (escaped backslashes).
181
- # This allows for any number of escaped backslashes before the quote character.
182
- # - #{Regexp.escape(quote_char)} : Dynamically inserts the quote_char into the regex,
183
- # ensuring it's properly escaped for use in the regex.
184
- #
185
168
  def count_quote_chars(line, quote_char)
186
- line.scan(/(?<!\\)(?:\\\\)*#{Regexp.escape(quote_char)}/).count
169
+ return 0 if line.nil? || quote_char.nil? || quote_char.empty?
170
+
171
+ count = 0
172
+ escaped = false
173
+
174
+ line.each_char do |char|
175
+ if char == '\\' && !escaped
176
+ escaped = true
177
+ else
178
+ count += 1 if char == quote_char && !escaped
179
+ escaped = false
180
+ end
181
+ end
182
+
183
+ count
187
184
  end
188
185
 
189
186
  def has_acceleration?
@@ -192,18 +189,6 @@ module SmarterCSV
192
189
 
193
190
  protected
194
191
 
195
- # acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
196
- def limit_execution_for_only_or_except(options, option_name, key)
197
- if options[option_name].is_a?(Hash)
198
- if options[option_name].has_key?(:except)
199
- return true if Array(options[option_name][:except]).include?(key)
200
- elsif options[option_name].has_key?(:only)
201
- return true unless Array(options[option_name][:only]).include?(key)
202
- end
203
- end
204
- false
205
- end
206
-
207
192
  # SEE: https://github.com/rails/rails/blob/32015b6f369adc839c4f0955f2d9dce50c0b6123/activesupport/lib/active_support/core_ext/object/blank.rb#L121
208
193
  # and in the future we might also include UTF-8 space characters: https://www.compart.com/en/unicode/category/Zs
209
194
  BLANK_RE = /\A\s*\z/.freeze
@@ -211,33 +196,24 @@ module SmarterCSV
211
196
  def blank?(value)
212
197
  case value
213
198
  when String
214
- value.empty? || BLANK_RE.match?(value)
215
-
199
+ BLANK_RE.match?(value)
216
200
  when NilClass
217
201
  true
218
-
219
202
  when Array
220
- value.empty? || value.inject(true){|result, x| result && elem_blank?(x)}
221
-
203
+ value.all? { |elem| blank?(elem) }
222
204
  when Hash
223
- value.empty? || value.values.inject(true){|result, x| result && elem_blank?(x)}
224
-
205
+ value.values.all? { |elem| blank?(elem) } # Focus on values only
225
206
  else
226
207
  false
227
208
  end
228
209
  end
229
210
 
230
- def elem_blank?(value)
231
- case value
232
- when String
233
- value.empty? || BLANK_RE.match?(value)
211
+ private
234
212
 
235
- when NilClass
236
- true
213
+ def enforce_utf8_encoding(line, options)
214
+ # return line unless options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
237
215
 
238
- else
239
- false
240
- end
216
+ line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence])
241
217
  end
242
218
  end
243
219
  end
@@ -2,9 +2,10 @@
2
2
 
3
3
  module SmarterCSV
4
4
  class << self
5
- attr_reader :csv_line_count, :chunk_count, :errors, :file_line_count, :headers, :raw_header, :result, :warnings
5
+ attr_reader :has_rails, :csv_line_count, :chunk_count, :errors, :file_line_count, :headers, :raw_header, :result, :warnings
6
6
 
7
7
  def initialize_variables
8
+ @has_rails = !!defined?(Rails)
8
9
  @csv_line_count = 0
9
10
  @chunk_count = 0
10
11
  @errors = {}
@@ -14,13 +15,16 @@ module SmarterCSV
14
15
  @raw_header = nil # header as it appears in the file
15
16
  @result = []
16
17
  @warnings = {}
18
+ @enforce_utf8 = false # only set to true if needed (after options parsing)
17
19
  end
18
20
 
19
21
  # :nocov:
22
+ # rubocop:disable Naming/MethodName
20
23
  def headerA
21
24
  warn "Deprecarion Warning: 'headerA' will be removed in future versions. Use 'headders'"
22
25
  @headerA
23
26
  end
27
+ # rubocop:enable Naming/MethodName
24
28
  # :nocov:
25
29
  end
26
30
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterCSV
4
- VERSION = "1.9.3"
4
+ VERSION = "1.10.0"
5
5
  end
data/lib/smarter_csv.rb CHANGED
@@ -5,13 +5,21 @@ require "smarter_csv/file_io"
5
5
  require "smarter_csv/options_processing"
6
6
  require "smarter_csv/auto_detection"
7
7
  require "smarter_csv/variables"
8
+ require 'smarter_csv/header_transformations'
9
+ require 'smarter_csv/header_validations'
8
10
  require "smarter_csv/headers"
11
+ require "smarter_csv/hash_transformations"
9
12
  require "smarter_csv/parse"
10
13
 
14
+ # load the C-extension:
11
15
  case RUBY_ENGINE
12
16
  when 'ruby'
13
17
  begin
14
18
  if `uname -s`.chomp == 'Darwin'
19
+ #
20
+ # Please report if you see cases where the rake-compiler is building x86_64 code on arm64 cpus:
21
+ # https://github.com/rake-compiler/rake-compiler/issues/231
22
+ #
15
23
  require 'smarter_csv/smarter_csv.bundle'
16
24
  else
17
25
  # :nocov:
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.9.3
4
+ version: 1.10.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2023-12-16 00:00:00.000000000 Z
11
+ date: 2023-12-31 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: awesome_print
@@ -118,6 +118,9 @@ files:
118
118
  - lib/smarter_csv.rb
119
119
  - lib/smarter_csv/auto_detection.rb
120
120
  - lib/smarter_csv/file_io.rb
121
+ - lib/smarter_csv/hash_transformations.rb
122
+ - lib/smarter_csv/header_transformations.rb
123
+ - lib/smarter_csv/header_validations.rb
121
124
  - lib/smarter_csv/headers.rb
122
125
  - lib/smarter_csv/options_processing.rb
123
126
  - lib/smarter_csv/parse.rb
@@ -148,7 +151,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
148
151
  - !ruby/object:Gem::Version
149
152
  version: '0'
150
153
  requirements: []
151
- rubygems_version: 3.2.3
154
+ rubygems_version: 3.5.3
152
155
  signing_key:
153
156
  specification_version: 4
154
157
  summary: Ruby Gem for smarter importing of CSV Files (and CSV-like files), with lots