smarter_csv 1.9.2 → 1.10.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 3e4032569303bd062a92b3c3f45f5166346808291667dda9ebd91af123f532ef
4
- data.tar.gz: 78b73abc411d8ed866feae600b87b72c3c99fd3b00b67c81eac227c17f8d38ea
3
+ metadata.gz: f1d0b58acf0135b621e3182470674230ef73b48c829810e74fffa975fc318cf5
4
+ data.tar.gz: ee404c5c485748d35cda36b8d249cb6813a3f80005182fe8c05feac1694aba57
5
5
  SHA512:
6
- metadata.gz: 1712951a2ce4f6e8ad93a6e76a105a3a8d4890babacfbb9ae3eead11ac638962d9da3d45421a327049e87c9d54b43c0dca1327f11a13bbd54440d3a7fefc6253
7
- data.tar.gz: 3d8b81f04c8eb16a7b2ab9ddf27bdaf2b2bfdd2ee3a8b70765a88f809fc9869500debe950d8ec27e3a6af818e6f1e415d96d078e52784d638f1363619088faa3
6
+ metadata.gz: 4fee097fe2237f863510100155062da6815237260da5b15189f104f54596f7d5ff0479deb80596544e0bb1b9ba7b78126d2251798721e8d2f91e06b430950cd6
7
+ data.tar.gz: c30562965452ef296b5e5aaf2a9a12887aa42d8e8396780b73b34f99a2386d232bf020578618fcbd65186fc864518c81a3e7555cae9b00a005322f3599e18c5a
data/CHANGELOG.md CHANGED
@@ -1,6 +1,29 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
+ ## 1.10.0 (2023-12-31) ⚡ BREAKING ⚡
5
+
6
+ * BREAKING CHANGES:
7
+
8
+ Changed behavior:
9
+ + when `user_provided_headers` are provided:
10
+ * if they are not unique, an exception will now be raised
11
+ * they are taken "as is", no header transformations can be applied
12
+ * when they are given as strings or as symbols, it is assumed that this is the desired format
13
+ * the value of the `strings_as_keys` options will be ignored
14
+
15
+ + option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
16
+ * this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
17
+ * explicitly set this option to `nil` to get the behavior from previous versions.
18
+
19
+ * performance and memory improvements
20
+ * code refactor
21
+
22
+ ## 1.9.3 (2023-12-16)
23
+ * raise SmarterCSV::IncorrectOption when `user_provided_headers` are empty
24
+ * code refactor / no functional changes
25
+ * added test cases
26
+
4
27
  ## 1.9.2 (2023-11-12)
5
28
  * fixed bug with '\\' at end of line (issue #252, thanks to averycrespi-moz)
6
29
  * fixed require statements (issue #249, thanks to PikachuEXE, courtsimas)
data/README.md CHANGED
@@ -2,15 +2,33 @@
2
2
  # SmarterCSV
3
3
 
4
4
  [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
5
-
5
+
6
+
7
+ #### LATEST CHANGES
8
+
9
+ * Version 1.10.0 has BREAKING CHANGES:
10
+
11
+ Changed behavior:
12
+ + when `user_provided_headers` are provided:
13
+ * if they are not unique, an exception will now be raised
14
+ * they are taken "as is", no header transformations can be applied
15
+ * when they are given as strings or as symbols, it is assumed that this is the desired format
16
+ * the value of the `strings_as_keys` options will be ignored
17
+
18
+ + option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
19
+ * this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
20
+ * explicitly set this option to `nil` to get the behavior from previous versions.
21
+
6
22
  #### Development Branches
7
23
 
8
24
  * default branch is `main` for 1.x development
9
- * 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
25
+
26
+ * 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
27
+ - This is an EXPERIMENTAL branch - DO NOT USE in production
10
28
 
11
- #### Work towards Future Version 2.0
29
+ #### Work towards Future Version 2.x
12
30
 
13
- * Work towards SmarterCSV 2.0 is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
31
+ * Work towards SmarterCSV 2.x is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
14
32
  Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.
15
33
 
16
34
  ---------------
@@ -84,6 +102,10 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
84
102
  00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
85
103
  ```
86
104
 
105
+ ### Articles
106
+ * [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
107
+ * [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
108
+
87
109
  ### Examples
88
110
 
89
111
  Here are some examples to demonstrate the versatility of SmarterCSV.
@@ -243,8 +265,6 @@ NOTE: If you use `key_mappings` and `value_converters`, make sure that the value
243
265
  data[0][:price].class
244
266
  => Float
245
267
  ```
246
- ## Parallel Processing
247
- [Jack](https://github.com/xjlin0) wrote an interesting article about [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
248
268
 
249
269
  ## Documentation
250
270
 
@@ -280,7 +300,8 @@ The options and the block are optional.
280
300
  | :headers_in_file | true | Whether or not the file contains headers as the first line. |
281
301
  | | | Important if the file does not contain headers, |
282
302
  | | | otherwise you would lose the first line of data. |
283
- | :duplicate_header_suffix | nil | If set, adds numbers to duplicated headers and separates them by the given suffix |
303
+ | :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
304
+ | | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
284
305
  | :user_provided_headers | nil | *careful with that axe!* |
285
306
  | | | user provided Array of header strings or symbols, to define |
286
307
  | | | what headers should be used, overriding any in-file headers. |
@@ -300,7 +321,7 @@ And header and data validations will also be supported in 2.x
300
321
  | Option | Default | Explanation |
301
322
  ---------------------------------------------------------------------------------------------------------------------------------
302
323
  | :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
303
- | :silence_missing_key | false | ignore missing keys in `key_mapping` |
324
+ | :silence_missing_keys | false | ignore missing keys in `key_mapping` |
304
325
  | | | if set to true: makes all mapped keys optional |
305
326
  | | | if given an array, makes only the keys listed in it optional |
306
327
  | :required_keys | nil | An array. Specify the required names AFTER header transformation. |
@@ -0,0 +1,73 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SmarterCSV
4
+ class << self
5
+ protected
6
+
7
+ # If file has headers, then guesses column separator from headers.
8
+ # Otherwise guesses column separator from contents.
9
+ # Raises exception if none is found.
10
+ def guess_column_separator(filehandle, options)
11
+ skip_lines(filehandle, options)
12
+
13
+ delimiters = [',', "\t", ';', ':', '|']
14
+
15
+ line = nil
16
+ has_header = options[:headers_in_file]
17
+ candidates = Hash.new(0)
18
+ count = has_header ? 1 : 5
19
+ count.times do
20
+ line = readline_with_counts(filehandle, options)
21
+ delimiters.each do |d|
22
+ candidates[d] += line.scan(d).count
23
+ end
24
+ rescue EOFError # short files
25
+ break
26
+ end
27
+ rewind(filehandle)
28
+
29
+ if candidates.values.max == 0
30
+ # if the header only contains
31
+ return ',' if line.chomp(options[:row_sep]) =~ /^\w+$/
32
+
33
+ raise SmarterCSV::NoColSepDetected
34
+ end
35
+
36
+ candidates.key(candidates.values.max)
37
+ end
38
+
39
+ # limitation: this currently reads the whole file in before making a decision
40
+ def guess_line_ending(filehandle, options)
41
+ counts = {"\n" => 0, "\r" => 0, "\r\n" => 0}
42
+ quoted_char = false
43
+
44
+ # count how many of the pre-defined line-endings we find
45
+ # ignoring those contained within quote characters
46
+ last_char = nil
47
+ lines = 0
48
+ filehandle.each_char do |c|
49
+ quoted_char = !quoted_char if c == options[:quote_char]
50
+ next if quoted_char
51
+
52
+ if last_char == "\r"
53
+ if c == "\n"
54
+ counts["\r\n"] += 1
55
+ else
56
+ counts["\r"] += 1 # \r are counted after they appeared
57
+ end
58
+ elsif c == "\n"
59
+ counts["\n"] += 1
60
+ end
61
+ last_char = c
62
+ lines += 1
63
+ break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
64
+ end
65
+ rewind(filehandle)
66
+
67
+ counts["\r"] += 1 if last_char == "\r"
68
+ # find the most frequent key/value pair:
69
+ most_frequent_key, _count = counts.max_by{|_, v| v}
70
+ most_frequent_key
71
+ end
72
+ end
73
+ end
@@ -0,0 +1,50 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SmarterCSV
4
+ class << self
5
+ protected
6
+
7
+ def readline_with_counts(filehandle, options)
8
+ line = filehandle.readline(options[:row_sep])
9
+ @file_line_count += 1
10
+ @csv_line_count += 1
11
+ line = remove_bom(line) if @csv_line_count == 1
12
+ line
13
+ end
14
+
15
+ def skip_lines(filehandle, options)
16
+ options[:skip_lines].to_i.times do
17
+ readline_with_counts(filehandle, options)
18
+ end
19
+ end
20
+
21
+ def rewind(filehandle)
22
+ @file_line_count = 0
23
+ @csv_line_count = 0
24
+ filehandle.rewind
25
+ end
26
+
27
+ private
28
+
29
+ UTF_32_BOM = %w[0 0 fe ff].freeze
30
+ UTF_32LE_BOM = %w[ff fe 0 0].freeze
31
+ UTF_8_BOM = %w[ef bb bf].freeze
32
+ UTF_16_BOM = %w[fe ff].freeze
33
+ UTF_16LE_BOM = %w[ff fe].freeze
34
+
35
+ def remove_bom(str)
36
+ str_as_hex = str.bytes.map{|x| x.to_s(16)}
37
+ # if string does not start with one of the bytes, there is no BOM
38
+ return str unless %w[ef fe ff 0].include?(str_as_hex[0])
39
+
40
+ return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
41
+ return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM
42
+ return str.byteslice(2..-1) if [UTF_16_BOM, UTF_16LE_BOM].include?(str_as_hex[0..1])
43
+
44
+ # :nocov:
45
+ puts "SmarterCSV found unhandled BOM! #{str.chars[0..7].inspect}"
46
+ str
47
+ # :nocov:
48
+ end
49
+ end
50
+ end
@@ -0,0 +1,91 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SmarterCSV
4
+ class << self
5
+ def hash_transformations(hash, options)
6
+ # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
7
+ # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
8
+ remove_empty_values = options[:remove_empty_values] == true
9
+ remove_zero_values = options[:remove_zero_values]
10
+ remove_values_matching = options[:remove_values_matching]
11
+ convert_to_numeric = options[:convert_values_to_numeric]
12
+ value_converters = options[:value_converters]
13
+
14
+ hash.each_with_object({}) do |(k, v), new_hash|
15
+ next if k.nil? || k == '' || k == :""
16
+ next if remove_empty_values && (has_rails ? v.blank? : blank?(v))
17
+ next if remove_zero_values && v.is_a?(String) && v =~ /^(0+|0+\.0+)$/ # values are Strings
18
+ next if remove_values_matching && v =~ remove_values_matching
19
+
20
+ # deal with the :only / :except options to :convert_values_to_numeric
21
+ if convert_to_numeric && !limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
22
+ if v =~ /^[+-]?\d+\.\d+$/
23
+ v = v.to_f
24
+ elsif v =~ /^[+-]?\d+$/
25
+ v = v.to_i
26
+ end
27
+ end
28
+
29
+ converter = value_converters[k] if value_converters
30
+ v = converter.convert(v) if converter
31
+
32
+ new_hash[k] = v
33
+ end
34
+ end
35
+
36
+ # def hash_transformations(hash, options)
37
+ # # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
38
+ # # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
39
+ # hash.delete(nil)
40
+ # hash.delete('')
41
+ # hash.delete(:"")
42
+
43
+ # if options[:remove_empty_values] == true
44
+ # hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
45
+ # end
46
+
47
+ # hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
48
+ # hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
49
+
50
+ # if options[:convert_values_to_numeric]
51
+ # hash.each do |k, v|
52
+ # # deal with the :only / :except options to :convert_values_to_numeric
53
+ # next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
54
+
55
+ # # convert if it's a numeric value:
56
+ # case v
57
+ # when /^[+-]?\d+\.\d+$/
58
+ # hash[k] = v.to_f
59
+ # when /^[+-]?\d+$/
60
+ # hash[k] = v.to_i
61
+ # end
62
+ # end
63
+ # end
64
+
65
+ # if options[:value_converters]
66
+ # hash.each do |k, v|
67
+ # converter = options[:value_converters][k]
68
+ # next unless converter
69
+
70
+ # hash[k] = converter.convert(v)
71
+ # end
72
+ # end
73
+
74
+ # hash
75
+ # end
76
+
77
+ protected
78
+
79
+ # acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
80
+ def limit_execution_for_only_or_except(options, option_name, key)
81
+ if options[option_name].is_a?(Hash)
82
+ if options[option_name].has_key?(:except)
83
+ return true if Array(options[option_name][:except]).include?(key)
84
+ elsif options[option_name].has_key?(:only)
85
+ return true unless Array(options[option_name][:only]).include?(key)
86
+ end
87
+ end
88
+ false
89
+ end
90
+ end
91
+ end
@@ -0,0 +1,63 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SmarterCSV
4
+ class << self
5
+ # transform the headers that were in the file:
6
+ def header_transformations(header_array, options)
7
+ header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
8
+ header_array.map!{|x| x.strip} if options[:strip_whitespace]
9
+
10
+ unless options[:keep_original_headers]
11
+ header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
12
+ header_array.map!{|x| x.downcase} if options[:downcase_header]
13
+ end
14
+
15
+ # detect duplicate headers and disambiguate
16
+ header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
17
+ # symbolize headers
18
+ header_array = header_array.map{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
19
+ # doesn't make sense to re-map when we have user_provided_headers
20
+ header_array = remap_headers(header_array, options) if options[:key_mapping]
21
+
22
+ header_array
23
+ end
24
+
25
+ def disambiguate_headers(headers, options)
26
+ counts = Hash.new(0)
27
+ headers.map do |header|
28
+ counts[header] += 1
29
+ counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
30
+ end
31
+ end
32
+
33
+ # do some key mapping on the keys in the file header
34
+ # if you want to completely delete a key, then map it to nil or to ''
35
+ def remap_headers(headers, options)
36
+ key_mapping = options[:key_mapping]
37
+ if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
38
+ raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
39
+ end
40
+
41
+ key_mapping = options[:key_mapping]
42
+ # if silence_missing_keys are not set, raise error if missing header
43
+ missing_keys = key_mapping.keys - headers
44
+ # if the user passes a list of speciffic mapped keys that are optional
45
+ missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)
46
+
47
+ unless missing_keys.empty? || options[:silence_missing_keys] == true
48
+ raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
49
+ end
50
+
51
+ headers.map! do |header|
52
+ if key_mapping.has_key?(header)
53
+ key_mapping[header].nil? ? nil : key_mapping[header]
54
+ elsif options[:remove_unmapped_keys]
55
+ nil
56
+ else
57
+ header
58
+ end
59
+ end
60
+ headers
61
+ end
62
+ end
63
+ end
@@ -0,0 +1,34 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SmarterCSV
4
+ class << self
5
+ def header_validations(headers, options)
6
+ check_duplicate_headers(headers, options)
7
+ check_required_headers(headers, options)
8
+ end
9
+
10
+ def check_duplicate_headers(headers, _options)
11
+ header_counts = Hash.new(0)
12
+ headers.each { |header| header_counts[header] += 1 unless header.nil? }
13
+
14
+ duplicates = header_counts.select { |_, count| count > 1 }
15
+
16
+ unless duplicates.empty?
17
+ raise(SmarterCSV::DuplicateHeaders, "Duplicate Headers in CSV: #{duplicates.inspect}")
18
+ end
19
+ end
20
+
21
+ require 'set'
22
+
23
+ def check_required_headers(headers, options)
24
+ if options[:required_keys] && options[:required_keys].is_a?(Array)
25
+ headers_set = headers.to_set
26
+ missing_keys = options[:required_keys].select { |k| !headers_set.include?(k) }
27
+
28
+ unless missing_keys.empty?
29
+ raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}"
30
+ end
31
+ end
32
+ end
33
+ end
34
+ end
@@ -0,0 +1,68 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SmarterCSV
4
+ class << self
5
+ def process_headers(filehandle, options)
6
+ @raw_header = nil # header as it appears in the file
7
+ @headers = nil # the processed headers
8
+ header_array = []
9
+ file_header_size = nil
10
+
11
+ # if headers_in_file, get the headers -> We get the number of columns, even when user provided headers
12
+ if options[:headers_in_file] # extract the header line
13
+ # process the header line in the CSV file..
14
+ # the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
15
+ header_line = @raw_header = readline_with_counts(filehandle, options)
16
+ header_line = preprocess_header_line(header_line, options)
17
+
18
+ file_header_array, file_header_size = parse(header_line, options)
19
+
20
+ file_header_array = header_transformations(file_header_array, options)
21
+
22
+ else
23
+ unless options[:user_provided_headers]
24
+ raise SmarterCSV::IncorrectOption, "ERROR: If :headers_in_file is set to false, you have to provide :user_provided_headers"
25
+ end
26
+ end
27
+
28
+ if options[:user_provided_headers]
29
+ unless options[:user_provided_headers].is_a?(Array) && !options[:user_provided_headers].empty?
30
+ raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for user_provided_headers! Expecting array with headers.")
31
+ end
32
+
33
+ # use user-provided headers
34
+ user_header_array = options[:user_provided_headers]
35
+ # user_provided_headers: their count should match the headers_in_file if any
36
+ if defined?(file_header_size) && !file_header_size.nil?
37
+ if user_header_array.size != file_header_size
38
+ raise SmarterCSV::HeaderSizeMismatch, "ERROR: :user_provided_headers defines #{user_header_array.size} headers != CSV-file has #{file_header_size} headers"
39
+ else
40
+ # we could print out the mapping of file_header_array to header_array here
41
+ end
42
+ end
43
+
44
+ header_array = user_header_array
45
+ else
46
+ header_array = file_header_array
47
+ end
48
+
49
+ [header_array, header_array.size]
50
+ end
51
+
52
+ private
53
+
54
+ def preprocess_header_line(header_line, options)
55
+ header_line = enforce_utf8_encoding(header_line, options)
56
+ header_line = remove_comments_from_header(header_line, options)
57
+ header_line = header_line.chomp(options[:row_sep])
58
+ header_line.gsub!(options[:strip_chars_from_headers], '') if options[:strip_chars_from_headers]
59
+ header_line
60
+ end
61
+
62
+ def remove_comments_from_header(header, options)
63
+ return header unless options[:comment_regexp]
64
+
65
+ header.sub(options[:comment_regexp], '')
66
+ end
67
+ end
68
+ end
@@ -9,7 +9,7 @@ module SmarterCSV
9
9
  comment_regexp: nil, # was: /\A#/,
10
10
  convert_values_to_numeric: true,
11
11
  downcase_header: true,
12
- duplicate_header_suffix: nil,
12
+ duplicate_header_suffix: '', # was: nil,
13
13
  file_encoding: 'utf-8',
14
14
  force_simple_split: false,
15
15
  force_utf8: false,
@@ -62,6 +62,15 @@ module SmarterCSV
62
62
  private
63
63
 
64
64
  def validate_options!(options)
65
+ # deprecate required_headers
66
+ unless options[:required_headers].nil?
67
+ puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required_headers'"
68
+ if options[:required_keys].nil?
69
+ options[:required_keys] = options[:required_headers]
70
+ options[:required_headers] = nil
71
+ end
72
+ end
73
+
65
74
  keys = options.keys
66
75
  errors = []
67
76
  errors << "invalid row_sep" if keys.include?(:row_sep) && !option_valid?(options[:row_sep])
@@ -0,0 +1,90 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SmarterCSV
4
+ class << self
5
+ protected
6
+
7
+ ###
8
+ ### Thin wrapper around C-extension
9
+ ###
10
+ def parse(line, options, header_size = nil)
11
+ # puts "SmarterCSV.parse OPTIONS: #{options[:acceleration]}" if options[:verbose]
12
+
13
+ if options[:acceleration] && has_acceleration?
14
+ # :nocov:
15
+ has_quotes = line =~ /#{options[:quote_char]}/
16
+ elements = parse_csv_line_c(line, options[:col_sep], options[:quote_char], header_size)
17
+ elements.map!{|x| cleanup_quotes(x, options[:quote_char])} if has_quotes
18
+ [elements, elements.size]
19
+ # :nocov:
20
+ else
21
+ # puts "WARNING: SmarterCSV is using un-accelerated parsing of lines. Check options[:acceleration]"
22
+ parse_csv_line_ruby(line, options, header_size)
23
+ end
24
+ end
25
+
26
+ # ------------------------------------------------------------------
27
+ # Ruby equivalent of the C-extension for parse_line
28
+ #
29
+ # parses a single line: either a CSV header and body line
30
+ # - quoting rules compared to RFC-4180 are somewhat relaxed
31
+ # - we are not assuming that quotes inside a fields need to be doubled
32
+ # - we are not assuming that all fields need to be quoted (0 is even)
33
+ # - works with multi-char col_sep
34
+ # - if header_size is given, only up to header_size fields are parsed
35
+ #
36
+ # We use header_size for parsing the body lines to make sure we always match the number of headers
37
+ # in case there are trailing col_sep characters in line
38
+ #
39
+ # Our convention is that empty fields are returned as empty strings, not as nil.
40
+ #
41
+ #
42
+ # the purpose of the max_size parameter is to handle a corner case where
43
+ # CSV lines contain more fields than the header.
44
+ # In which case the remaining fields in the line are ignored
45
+ #
46
+ def parse_csv_line_ruby(line, options, header_size = nil)
47
+ return [] if line.nil?
48
+
49
+ line_size = line.size
50
+ col_sep = options[:col_sep]
51
+ col_sep_size = col_sep.size
52
+ quote = options[:quote_char]
53
+ quote_count = 0
54
+ elements = []
55
+ start = 0
56
+ i = 0
57
+
58
+ previous_char = ''
59
+ while i < line_size
60
+ if line[i...i+col_sep_size] == col_sep && quote_count.even?
61
+ break if !header_size.nil? && elements.size >= header_size
62
+
63
+ elements << cleanup_quotes(line[start...i], quote)
64
+ previous_char = line[i]
65
+ i += col_sep.size
66
+ start = i
67
+ else
68
+ quote_count += 1 if line[i] == quote && previous_char != '\\'
69
+ previous_char = line[i]
70
+ i += 1
71
+ end
72
+ end
73
+ elements << cleanup_quotes(line[start..-1], quote) if header_size.nil? || elements.size < header_size
74
+ [elements, elements.size]
75
+ end
76
+
77
+ def cleanup_quotes(field, quote)
78
+ return field if field.nil?
79
+
80
+ # return if field !~ /#{quote}/ # this check can probably eliminated
81
+
82
+ if field.start_with?(quote) && field.end_with?(quote)
83
+ field.delete_prefix!(quote)
84
+ field.delete_suffix!(quote)
85
+ end
86
+ field.gsub!("#{quote}#{quote}", quote)
87
+ field
88
+ end
89
+ end
90
+ end