smarter_csv 1.5.0 → 1.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +13 -0
- data/CONTRIBUTORS.md +1 -0
- data/README.md +17 -4
- data/lib/smarter_csv/smarter_csv.rb +182 -102
- data/lib/smarter_csv/version.rb +1 -1
- data/smarter_csv.gemspec +1 -1
- data/spec/fixtures/duplicate_headers.csv +1 -1
- data/spec/smarter_csv/duplicate_headers_spec.rb +76 -0
- data/spec/smarter_csv/invalid_headers_spec.rb +8 -22
- data/spec/smarter_csv/malformed_spec.rb +15 -7
- data/spec/smarter_csv/no_header_spec.rb +16 -11
- data/spec/smarter_csv/parse/column_separator_spec.rb +61 -0
- data/spec/smarter_csv/parse/old_csv_library_spec.rb +74 -0
- data/spec/smarter_csv/parse/rfc4180_and_more_spec.rb +170 -0
- data/spec/smarter_csv/quoted_spec.rb +8 -4
- metadata +25 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: fd2cf82aafc3b45257fbdfc594ed8e1d3bf2226e59cbee144b3003d8f79ec6cf
|
4
|
+
data.tar.gz: 95df862865e3123cf86194d47107f140f69f2fc91c20aba01d4004e8bffa5d74
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: df32ae9a380fa4fff0932d56e8a0cacadb8d4ebf7d8124e607f2ba389c3b60f875c300a2137fe04aac2b7eda77850b343af5e58c53e310f90460f96223f3228c
|
7
|
+
data.tar.gz: 107e1dbacdc6293a0c044a91cf237f50fbeab59eb5b032167f55d0fe6c2cf07b079c6cfe296368c03d2c84e64ce3c0e6ad744043397cdc217cd3ab51beb3ab09
|
data/CHANGELOG.md
CHANGED
@@ -1,6 +1,19 @@
|
|
1
1
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
3
3
|
|
4
|
+
## 1.6.0 (2022-05-03)
|
5
|
+
* completely rewrote line parser
|
6
|
+
* added methods `SmarterCSV.raw_headers` and `SmarterCSV.headers` to allow easy examination of how the headers are processed.
|
7
|
+
|
8
|
+
## 1.5.2 (2022-04-29)
|
9
|
+
* added missing keys to the SmarterCSV::KeyMappingError exception message #189 (thanks to John Dell)
|
10
|
+
|
11
|
+
## 1.5.1 (2022-04-27)
|
12
|
+
* added raising of `KeyMappingError` if `key_mapping` refers to a non-existent key
|
13
|
+
* added option `duplicate_header_suffix` (thanks to Skye Shaw)
|
14
|
+
When given a non-nil string, it uses the suffix to append numbering 2..n to duplicate headers.
|
15
|
+
If your code will need to process arbitrary CSV files, please set `duplicate_header_suffix`.
|
16
|
+
|
4
17
|
## 1.5.0 (2022-04-25)
|
5
18
|
* fixed bug with trailing col_sep characters, introduced in 1.4.0
|
6
19
|
* Fix deprecation warning in Ruby 3.0.3 / $INPUT_RECORD_SEPARATOR (thanks to Joel Fouse )
|
data/CONTRIBUTORS.md
CHANGED
data/README.md
CHANGED
@@ -16,10 +16,12 @@
|
|
16
16
|
|
17
17
|
# SmarterCSV
|
18
18
|
|
19
|
-
[](http://travis-ci.
|
19
|
+
[](http://travis-ci.com/tilo/smarter_csv) [](http://badge.fury.io/rb/smarter_csv)
|
20
20
|
|
21
21
|
#### SmarterCSV 1.x
|
22
22
|
|
23
|
+
`smarter_csv` is now 10 years old, and still kicking! 🎉🎉🎉
|
24
|
+
|
23
25
|
`smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with Mongoid or ActiveRecord,
|
24
26
|
and parallel processing with Resque or Sidekiq.
|
25
27
|
|
@@ -42,11 +44,13 @@ NOTE; This Gem is only for importing CSV files - writing of CSV files is not sup
|
|
42
44
|
|
43
45
|
### Why?
|
44
46
|
|
45
|
-
Ruby's CSV library's API is pretty old, and it's processing of CSV-files returning Arrays of Arrays feels 'very close to the metal'. The output is not easy to use - especially not if you want to create database records
|
47
|
+
Ruby's CSV library's API is pretty old, and it's processing of CSV-files returning Arrays of Arrays feels 'very close to the metal'. The output is not easy to use - especially not if you want to create database records or Sidekiq jobs with it. Another shortcoming is that Ruby's CSV library does not have good support for huge CSV-files, e.g. there is no support for 'chunking' and/or parallel processing of the CSV-content (e.g. with Sidekiq).
|
48
|
+
|
49
|
+
As the existing CSV libraries didn't fit my needs, I was writing my own CSV processing - specifically for use in connection with Rails ORMs like Mongoid, MongoMapper and ActiveRecord. In those ORMs you can easily pass a hash with attribute/value pairs to the create() method. The lower-level Mongo driver and Moped also accept larger arrays of such hashes to create a larger amount of records quickly with just one call. The same patterns are used when you pass data to Sidekiq jobs.
|
46
50
|
|
47
|
-
|
51
|
+
For processing large CSV files it is essential to process them in chunks, so the memory impact is minimized.
|
48
52
|
|
49
|
-
###
|
53
|
+
### How?
|
50
54
|
|
51
55
|
The two main choices you have in terms of how to call `SmarterCSV.process` are:
|
52
56
|
* calling `process` with or without a block
|
@@ -228,6 +232,7 @@ The options and the block are optional.
|
|
228
232
|
| :headers_in_file | true | Whether or not the file contains headers as the first line. |
|
229
233
|
| | | Important if the file does not contain headers, |
|
230
234
|
| | | otherwise you would lose the first line of data. |
|
235
|
+
| :duplicate_header_suffix | nil | If set, adds numbers to duplicated headers and separates them by the given suffix |
|
231
236
|
| :user_provided_headers | nil | *careful with that axe!* |
|
232
237
|
| | | user provided Array of header strings or symbols, to define |
|
233
238
|
| | | what headers should be used, overriding any in-file headers. |
|
@@ -282,6 +287,7 @@ And header and data validations will also be supported in 2.x
|
|
282
287
|
data = SmarterCSV.process(f)
|
283
288
|
end
|
284
289
|
```
|
290
|
+
|
285
291
|
#### NOTES about CSV Headers:
|
286
292
|
* as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
|
287
293
|
* the first line with the header might be commented out, in which case you will need to set `comment_regexp: /\A#/`
|
@@ -291,6 +297,13 @@ And header and data validations will also be supported in 2.x
|
|
291
297
|
* you can not combine the :user_provided_headers and :key_mapping options
|
292
298
|
* if the incorrect number of headers are provided via :user_provided_headers, exception SmarterCSV::HeaderSizeMismatch is raised
|
293
299
|
|
300
|
+
#### NOTES on Duplicate Headers:
|
301
|
+
As a corner case, it is possible that a CSV file contains multiple headers with the same name.
|
302
|
+
* If that happens, by default `smarter_csv` will raise a `DuplicateHeaders` error.
|
303
|
+
* If you set `duplicate_header_suffix` to a non-nil string, it will use it to append numbers 2..n to the duplicate headers. To further disambiguate the headers, you can further use `key_mapping` to assign meaningful names.
|
304
|
+
* If your code will need to process arbitrary CSV files, please set `duplicate_header_suffix`.
|
305
|
+
* Another way to deal with duplicate headers it to use `user_assigned_headers` to ignore any headers in the file.
|
306
|
+
|
294
307
|
#### NOTES on Key Mapping:
|
295
308
|
* keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them)
|
296
309
|
* if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
|
@@ -5,107 +5,38 @@ module SmarterCSV
|
|
5
5
|
class DuplicateHeaders < SmarterCSVException; end
|
6
6
|
class MissingHeaders < SmarterCSVException; end
|
7
7
|
class NoColSepDetected < SmarterCSVException; end
|
8
|
+
class KeyMappingError < SmarterCSVException; end
|
9
|
+
class MalformedCSVError < SmarterCSVException; end
|
8
10
|
|
9
|
-
|
11
|
+
# first parameter: filename or input object which responds to readline method
|
12
|
+
def SmarterCSV.process(input, options={}, &block)
|
10
13
|
options = default_options.merge(options)
|
11
14
|
options[:invalid_byte_sequence] = '' if options[:invalid_byte_sequence].nil?
|
12
15
|
|
13
16
|
headerA = []
|
14
17
|
result = []
|
15
|
-
file_line_count = 0
|
16
|
-
csv_line_count = 0
|
18
|
+
@file_line_count = 0
|
19
|
+
@csv_line_count = 0
|
17
20
|
has_rails = !! defined?(Rails)
|
18
21
|
begin
|
19
|
-
|
22
|
+
fh = input.respond_to?(:readline) ? input : File.open(input, "r:#{options[:file_encoding]}")
|
20
23
|
|
21
24
|
# auto-detect the row separator
|
22
|
-
options[:row_sep] = SmarterCSV.guess_line_ending(
|
25
|
+
options[:row_sep] = SmarterCSV.guess_line_ending(fh, options) if options[:row_sep].to_sym == :auto
|
23
26
|
# attempt to auto-detect column separator
|
24
|
-
options[:col_sep] = guess_column_separator(
|
25
|
-
# preserve options, in case we need to call the CSV class
|
26
|
-
csv_options = options.select{|k,v| [:col_sep, :row_sep, :quote_char].include?(k)} # options.slice(:col_sep, :row_sep, :quote_char)
|
27
|
-
csv_options.delete(:row_sep) if [nil, :auto].include?( options[:row_sep].to_sym )
|
28
|
-
csv_options.delete(:col_sep) if [nil, :auto].include?( options[:col_sep].to_sym )
|
27
|
+
options[:col_sep] = guess_column_separator(fh, options) if options[:col_sep].to_sym == :auto
|
29
28
|
|
30
|
-
if (options[:force_utf8] || options[:file_encoding] =~ /utf-8/i) && (
|
29
|
+
if (options[:force_utf8] || options[:file_encoding] =~ /utf-8/i) && ( fh.respond_to?(:external_encoding) && fh.external_encoding != Encoding.find('UTF-8') || fh.respond_to?(:encoding) && fh.encoding != Encoding.find('UTF-8') )
|
31
30
|
puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
|
32
31
|
end
|
33
32
|
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
# process the header line in the CSV file..
|
38
|
-
# the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
|
39
|
-
header = f.readline(options[:row_sep])
|
40
|
-
header = header.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
|
41
|
-
header = header.sub(options[:comment_regexp],'') if options[:comment_regexp]
|
42
|
-
header = header.chomp(options[:row_sep])
|
43
|
-
|
44
|
-
file_line_count += 1
|
45
|
-
csv_line_count += 1
|
46
|
-
header = header.gsub(options[:strip_chars_from_headers], '') if options[:strip_chars_from_headers]
|
47
|
-
|
48
|
-
if (header =~ %r{#{options[:quote_char]}}) and (! options[:force_simple_split])
|
49
|
-
file_headerA = begin
|
50
|
-
CSV.parse( header, **csv_options ).flatten.collect!{|x| x.nil? ? '' : x} # to deal with nil values from CSV.parse
|
51
|
-
rescue CSV::MalformedCSVError => e
|
52
|
-
raise $!, "#{$!} [SmarterCSV: csv line #{csv_line_count}]", $!.backtrace
|
53
|
-
end
|
54
|
-
else
|
55
|
-
file_headerA = header.split(options[:col_sep])
|
56
|
-
end
|
57
|
-
file_header_size = file_headerA.size # before mapping, which could delete keys
|
58
|
-
|
59
|
-
file_headerA.map!{|x| x.gsub(%r/#{options[:quote_char]}/,'') }
|
60
|
-
file_headerA.map!{|x| x.strip} if options[:strip_whitespace]
|
61
|
-
unless options[:keep_original_headers]
|
62
|
-
file_headerA.map!{|x| x.gsub(/\s+|-+/,'_')}
|
63
|
-
file_headerA.map!{|x| x.downcase } if options[:downcase_header]
|
64
|
-
end
|
65
|
-
else
|
66
|
-
raise SmarterCSV::IncorrectOption , "ERROR: If :headers_in_file is set to false, you have to provide :user_provided_headers" if options[:user_provided_headers].nil?
|
67
|
-
end
|
68
|
-
if options[:user_provided_headers] && options[:user_provided_headers].class == Array && ! options[:user_provided_headers].empty?
|
69
|
-
# use user-provided headers
|
70
|
-
headerA = options[:user_provided_headers]
|
71
|
-
if defined?(file_header_size) && ! file_header_size.nil?
|
72
|
-
if headerA.size != file_header_size
|
73
|
-
raise SmarterCSV::HeaderSizeMismatch , "ERROR: :user_provided_headers defines #{headerA.size} headers != CSV-file #{input} has #{file_header_size} headers"
|
74
|
-
else
|
75
|
-
# we could print out the mapping of file_headerA to headerA here
|
76
|
-
end
|
33
|
+
if options[:skip_lines].to_i > 0
|
34
|
+
options[:skip_lines].to_i.times do
|
35
|
+
readline_with_counts(fh, options)
|
77
36
|
end
|
78
|
-
else
|
79
|
-
headerA = file_headerA
|
80
37
|
end
|
81
|
-
header_size = headerA.size # used for splitting lines
|
82
|
-
|
83
|
-
headerA.map!{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
|
84
|
-
|
85
|
-
unless options[:user_provided_headers] # wouldn't make sense to re-map user provided headers
|
86
|
-
key_mappingH = options[:key_mapping]
|
87
38
|
|
88
|
-
|
89
|
-
# if you want to completely delete a key, then map it to nil or to ''
|
90
|
-
if ! key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
|
91
|
-
headerA.map!{|x| key_mappingH.has_key?(x) ? (key_mappingH[x].nil? ? nil : key_mappingH[x]) : (options[:remove_unmapped_keys] ? nil : x)}
|
92
|
-
end
|
93
|
-
end
|
94
|
-
|
95
|
-
# header_validations
|
96
|
-
duplicate_headers = []
|
97
|
-
headerA.compact.each do |k|
|
98
|
-
duplicate_headers << k if headerA.select{|x| x == k}.size > 1
|
99
|
-
end
|
100
|
-
raise SmarterCSV::DuplicateHeaders , "ERROR: duplicate headers: #{duplicate_headers.join(',')}" unless duplicate_headers.empty?
|
101
|
-
|
102
|
-
if options[:required_headers] && options[:required_headers].is_a?(Array)
|
103
|
-
missing_headers = []
|
104
|
-
options[:required_headers].each do |k|
|
105
|
-
missing_headers << k unless headerA.include?(k)
|
106
|
-
end
|
107
|
-
raise SmarterCSV::MissingHeaders , "ERROR: missing headers: #{missing_headers.join(',')}" unless missing_headers.empty?
|
108
|
-
end
|
39
|
+
headerA, header_size = process_headers(fh, options)
|
109
40
|
|
110
41
|
# in case we use chunking.. we'll need to set it up..
|
111
42
|
if ! options[:chunk_size].nil? && options[:chunk_size].to_i > 0
|
@@ -118,15 +49,13 @@ module SmarterCSV
|
|
118
49
|
end
|
119
50
|
|
120
51
|
# now on to processing all the rest of the lines in the CSV file:
|
121
|
-
while !
|
122
|
-
line =
|
52
|
+
while ! fh.eof? # we can't use fh.readlines() here, because this would read the whole file into memory at once, and eof => true
|
53
|
+
line = readline_with_counts(fh, options)
|
123
54
|
|
124
55
|
# replace invalid byte sequence in UTF-8 with question mark to avoid errors
|
125
56
|
line = line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
|
126
57
|
|
127
|
-
file_line_count
|
128
|
-
csv_line_count += 1
|
129
|
-
print "processing file line %10d, csv line %10d\r" % [file_line_count, csv_line_count] if options[:verbose]
|
58
|
+
print "processing file line %10d, csv line %10d\r" % [@file_line_count, @csv_line_count] if options[:verbose]
|
130
59
|
|
131
60
|
next if options[:comment_regexp] && line =~ options[:comment_regexp] # ignore all comment lines if there are any
|
132
61
|
|
@@ -135,24 +64,17 @@ module SmarterCSV
|
|
135
64
|
# by detecting the existence of an uneven number of quote characters
|
136
65
|
multiline = line.count(options[:quote_char])%2 == 1 # should handle quote_char nil
|
137
66
|
while line.count(options[:quote_char])%2 == 1 # should handle quote_char nil
|
138
|
-
next_line =
|
67
|
+
next_line = fh.readline(options[:row_sep])
|
139
68
|
next_line = next_line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
|
140
69
|
line += next_line
|
141
|
-
file_line_count += 1
|
70
|
+
@file_line_count += 1
|
142
71
|
end
|
143
|
-
print "\nline contains uneven number of quote chars so including content through file line %d\n" % file_line_count if options[:verbose] && multiline
|
72
|
+
print "\nline contains uneven number of quote chars so including content through file line %d\n" % @file_line_count if options[:verbose] && multiline
|
144
73
|
|
145
74
|
line.chomp!(options[:row_sep])
|
146
75
|
|
147
|
-
|
148
|
-
|
149
|
-
CSV.parse( line, **csv_options ).flatten.collect!{|x| x.nil? ? '' : x} # to deal with nil values from CSV.parse
|
150
|
-
rescue CSV::MalformedCSVError => e
|
151
|
-
raise $!, "#{$!} [SmarterCSV: csv line #{csv_line_count}]", $!.backtrace
|
152
|
-
end
|
153
|
-
else
|
154
|
-
dataA = line.split(options[:col_sep], header_size)
|
155
|
-
end
|
76
|
+
dataA, data_size = parse(line, options, header_size)
|
77
|
+
|
156
78
|
dataA.map!{|x| x.sub(/(#{options[:col_sep]})+\z/, '')} # remove any unwanted trailing col_sep characters at the end
|
157
79
|
dataA.map!{|x| x.strip} if options[:strip_whitespace]
|
158
80
|
|
@@ -208,7 +130,7 @@ module SmarterCSV
|
|
208
130
|
if use_chunks
|
209
131
|
chunk << hash # append temp result to chunk
|
210
132
|
|
211
|
-
if chunk.size >= chunk_size ||
|
133
|
+
if chunk.size >= chunk_size || fh.eof? # if chunk if full, or EOF reached
|
212
134
|
# do something with the chunk
|
213
135
|
if block_given?
|
214
136
|
yield chunk # do something with the hashes in the chunk in the block
|
@@ -249,7 +171,7 @@ module SmarterCSV
|
|
249
171
|
chunk = [] # initialize for next chunk of data
|
250
172
|
end
|
251
173
|
ensure
|
252
|
-
|
174
|
+
fh.close if fh.respond_to?(:close)
|
253
175
|
end
|
254
176
|
if block_given?
|
255
177
|
return chunk_count # when we do processing through a block we only care how many chunks we processed
|
@@ -268,6 +190,7 @@ module SmarterCSV
|
|
268
190
|
comment_regexp: nil, # was: /\A#/,
|
269
191
|
convert_values_to_numeric: true,
|
270
192
|
downcase_header: true,
|
193
|
+
duplicate_header_suffix: nil,
|
271
194
|
file_encoding: 'utf-8',
|
272
195
|
force_simple_split: false ,
|
273
196
|
force_utf8: false,
|
@@ -293,6 +216,62 @@ module SmarterCSV
|
|
293
216
|
}
|
294
217
|
end
|
295
218
|
|
219
|
+
def self.readline_with_counts(filehandle, options)
|
220
|
+
line = filehandle.readline(options[:row_sep])
|
221
|
+
@file_line_count += 1
|
222
|
+
@csv_line_count += 1
|
223
|
+
line
|
224
|
+
end
|
225
|
+
|
226
|
+
# parses a single line: either a CSV header and body line
|
227
|
+
# - quoting rules compared to RFC-4180 are somewhat relaxed
|
228
|
+
# - we are not assuming that quotes inside a fields need to be doubled
|
229
|
+
# - we are not assuming that all fields need to be quoted (0 is even)
|
230
|
+
# - works with multi-char col_sep
|
231
|
+
# - if header_size is given, only up to header_size fields are parsed
|
232
|
+
#
|
233
|
+
# We use header_size for parsing the body lines to make sure we always match the number of headers
|
234
|
+
# in case there are trailing col_sep characters in line
|
235
|
+
#
|
236
|
+
# Our convention is that empty fields are returned as empty strings, not as nil.
|
237
|
+
#
|
238
|
+
def self.parse(line, options, header_size = nil)
|
239
|
+
return [] if line.nil?
|
240
|
+
|
241
|
+
col_sep = options[:col_sep]
|
242
|
+
quote = options[:quote_char]
|
243
|
+
quote_count = 0
|
244
|
+
elements = []
|
245
|
+
start = 0
|
246
|
+
i = 0
|
247
|
+
|
248
|
+
while i < line.size do
|
249
|
+
if line[i...i+col_sep.size] == col_sep && quote_count.even?
|
250
|
+
break if !header_size.nil? && elements.size >= header_size
|
251
|
+
|
252
|
+
elements << cleanup_quotes(line[start...i], quote)
|
253
|
+
i += col_sep.size
|
254
|
+
start = i
|
255
|
+
else
|
256
|
+
quote_count += 1 if line[i] == quote
|
257
|
+
i += 1
|
258
|
+
end
|
259
|
+
end
|
260
|
+
elements << cleanup_quotes(line[start..-1], quote) if header_size.nil? || elements.size < header_size
|
261
|
+
[elements, elements.size]
|
262
|
+
end
|
263
|
+
|
264
|
+
def self.cleanup_quotes(field, quote)
|
265
|
+
return field if field.nil? || field !~ /#{quote}/
|
266
|
+
|
267
|
+
if field.start_with?(quote) && field.end_with?(quote)
|
268
|
+
field.delete_prefix!(quote)
|
269
|
+
field.delete_suffix!(quote)
|
270
|
+
end
|
271
|
+
field.gsub!("#{quote}#{quote}", quote)
|
272
|
+
field
|
273
|
+
end
|
274
|
+
|
296
275
|
def self.blank?(value)
|
297
276
|
case value
|
298
277
|
when Array
|
@@ -378,4 +357,105 @@ module SmarterCSV
|
|
378
357
|
k,_ = counts.max_by{|_,v| v}
|
379
358
|
return k # the most frequent one is it
|
380
359
|
end
|
360
|
+
|
361
|
+
def self.raw_hearder
|
362
|
+
@raw_header
|
363
|
+
end
|
364
|
+
|
365
|
+
def self.headers
|
366
|
+
@headers
|
367
|
+
end
|
368
|
+
|
369
|
+
def self.process_headers(filehandle, options)
|
370
|
+
@raw_header = nil
|
371
|
+
@headers = nil
|
372
|
+
if options[:headers_in_file] # extract the header line
|
373
|
+
# process the header line in the CSV file..
|
374
|
+
# the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
|
375
|
+
header = readline_with_counts(filehandle, options)
|
376
|
+
@raw_header = header
|
377
|
+
|
378
|
+
header = header.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
|
379
|
+
header = header.sub(options[:comment_regexp],'') if options[:comment_regexp]
|
380
|
+
header = header.chomp(options[:row_sep])
|
381
|
+
|
382
|
+
header = header.gsub(options[:strip_chars_from_headers], '') if options[:strip_chars_from_headers]
|
383
|
+
|
384
|
+
file_headerA, file_header_size = parse(header, options)
|
385
|
+
|
386
|
+
file_headerA.map!{|x| x.gsub(%r/#{options[:quote_char]}/,'') }
|
387
|
+
file_headerA.map!{|x| x.strip} if options[:strip_whitespace]
|
388
|
+
unless options[:keep_original_headers]
|
389
|
+
file_headerA.map!{|x| x.gsub(/\s+|-+/,'_')}
|
390
|
+
file_headerA.map!{|x| x.downcase } if options[:downcase_header]
|
391
|
+
end
|
392
|
+
else
|
393
|
+
raise SmarterCSV::IncorrectOption , "ERROR: If :headers_in_file is set to false, you have to provide :user_provided_headers" unless options[:user_provided_headers]
|
394
|
+
end
|
395
|
+
if options[:user_provided_headers] && options[:user_provided_headers].class == Array && ! options[:user_provided_headers].empty?
|
396
|
+
# use user-provided headers
|
397
|
+
headerA = options[:user_provided_headers]
|
398
|
+
if defined?(file_header_size) && ! file_header_size.nil?
|
399
|
+
if headerA.size != file_header_size
|
400
|
+
raise SmarterCSV::HeaderSizeMismatch , "ERROR: :user_provided_headers defines #{headerA.size} headers != CSV-file #{input} has #{file_header_size} headers"
|
401
|
+
else
|
402
|
+
# we could print out the mapping of file_headerA to headerA here
|
403
|
+
end
|
404
|
+
end
|
405
|
+
else
|
406
|
+
headerA = file_headerA
|
407
|
+
end
|
408
|
+
|
409
|
+
# detect duplicate headers and disambiguate
|
410
|
+
headerA = process_duplicate_headers(headerA, options) if options[:duplicate_header_suffix]
|
411
|
+
header_size = headerA.size # used for splitting lines
|
412
|
+
|
413
|
+
headerA.map!{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
|
414
|
+
|
415
|
+
unless options[:user_provided_headers] # wouldn't make sense to re-map user provided headers
|
416
|
+
key_mappingH = options[:key_mapping]
|
417
|
+
|
418
|
+
# do some key mapping on the keys in the file header
|
419
|
+
# if you want to completely delete a key, then map it to nil or to ''
|
420
|
+
if ! key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
|
421
|
+
# we can't map keys that are not there
|
422
|
+
missing_keys = key_mappingH.keys - headerA
|
423
|
+
raise(SmarterCSV::KeyMappingError, "missing header(s): #{missing_keys.join(",")}") unless missing_keys.empty?
|
424
|
+
|
425
|
+
headerA.map!{|x| key_mappingH.has_key?(x) ? (key_mappingH[x].nil? ? nil : key_mappingH[x]) : (options[:remove_unmapped_keys] ? nil : x)}
|
426
|
+
end
|
427
|
+
end
|
428
|
+
|
429
|
+
# header_validations
|
430
|
+
duplicate_headers = []
|
431
|
+
headerA.compact.each do |k|
|
432
|
+
duplicate_headers << k if headerA.select{|x| x == k}.size > 1
|
433
|
+
end
|
434
|
+
raise SmarterCSV::DuplicateHeaders , "ERROR: duplicate headers: #{duplicate_headers.join(',')}" unless duplicate_headers.empty?
|
435
|
+
|
436
|
+
if options[:required_headers] && options[:required_headers].is_a?(Array)
|
437
|
+
missing_headers = []
|
438
|
+
options[:required_headers].each do |k|
|
439
|
+
missing_headers << k unless headerA.include?(k)
|
440
|
+
end
|
441
|
+
raise SmarterCSV::MissingHeaders , "ERROR: missing headers: #{missing_headers.join(',')}" unless missing_headers.empty?
|
442
|
+
end
|
443
|
+
|
444
|
+
@headers = headerA
|
445
|
+
[headerA, header_size]
|
446
|
+
end
|
447
|
+
|
448
|
+
def self.process_duplicate_headers(headers, options)
|
449
|
+
counts = Hash.new(0)
|
450
|
+
result = []
|
451
|
+
headers.each do |key|
|
452
|
+
counts[key] += 1
|
453
|
+
if counts[key] == 1
|
454
|
+
result << key
|
455
|
+
else
|
456
|
+
result << [key, options[:duplicate_header_suffix], counts[key]].join
|
457
|
+
end
|
458
|
+
end
|
459
|
+
result
|
460
|
+
end
|
381
461
|
end
|
data/lib/smarter_csv/version.rb
CHANGED
data/smarter_csv.gemspec
CHANGED
@@ -16,9 +16,9 @@ Gem::Specification.new do |spec|
|
|
16
16
|
spec.executables = spec.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
|
17
17
|
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
18
18
|
spec.require_paths = ["lib"]
|
19
|
-
spec.requirements = ['csv'] # for CSV.parse() only needed in case we have quoted fields
|
20
19
|
spec.add_development_dependency "rspec"
|
21
20
|
spec.add_development_dependency "simplecov"
|
21
|
+
spec.add_development_dependency "awesome_print"
|
22
22
|
# spec.add_development_dependency "guard-rspec"
|
23
23
|
|
24
24
|
spec.metadata["homepage_uri"] = spec.homepage
|
@@ -0,0 +1,76 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
fixture_path = 'spec/fixtures'
|
4
|
+
|
5
|
+
describe 'duplicate headers' do
|
6
|
+
describe 'without special handling / default behavior' do
|
7
|
+
it 'raises error on duplicate headers' do
|
8
|
+
expect {
|
9
|
+
SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", {})
|
10
|
+
}.to raise_exception(SmarterCSV::DuplicateHeaders)
|
11
|
+
end
|
12
|
+
|
13
|
+
it 'raises error on duplicate given headers' do
|
14
|
+
expect {
|
15
|
+
options = {:user_provided_headers => [:a,:b,:c,:d,:a]}
|
16
|
+
SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
|
17
|
+
}.to raise_exception(SmarterCSV::DuplicateHeaders)
|
18
|
+
end
|
19
|
+
|
20
|
+
it 'raises error on missing mapped headers and includes missing headers in message' do
|
21
|
+
expect {
|
22
|
+
# the mapping is right, but the underlying csv file is bad
|
23
|
+
options = {:key_mapping => {:email => :a, :firstname => :b, :lastname => :c, :manager_email => :d, :age => :e} }
|
24
|
+
SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
|
25
|
+
}.to raise_exception(SmarterCSV::KeyMappingError, "missing header(s): manager_email")
|
26
|
+
end
|
27
|
+
end
|
28
|
+
|
29
|
+
describe 'with special handling' do
|
30
|
+
context 'with given suffix' do
|
31
|
+
let(:options) { {duplicate_header_suffix: '_'} }
|
32
|
+
|
33
|
+
it 'reads whole file' do
|
34
|
+
data = SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
|
35
|
+
expect(data.size).to eq 2
|
36
|
+
end
|
37
|
+
|
38
|
+
it 'generates the correct keys' do
|
39
|
+
data = SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
|
40
|
+
expect(data.first.keys).to eq [:email, :firstname, :lastname, :email_2, :age]
|
41
|
+
end
|
42
|
+
|
43
|
+
it 'enumerates when duplicate headers are given' do
|
44
|
+
options.merge!({:user_provided_headers => [:a,:b,:c,:a,:a]})
|
45
|
+
data = SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
|
46
|
+
expect(data.first.keys).to eq [:a, :b, :c, :a_2, :a_3]
|
47
|
+
end
|
48
|
+
|
49
|
+
it 'can remap duplicated headers' do
|
50
|
+
options.merge!({:key_mapping => {:email => :a, :firstname => :b, :lastname => :c, :email_2 => :d, :age => :e}})
|
51
|
+
data = SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
|
52
|
+
expect(data.first).to eq({a: 'tom@bla.com', b: 'Tom', c: 'Sawyer', d: 'mike@bla.com', e: 34})
|
53
|
+
end
|
54
|
+
end
|
55
|
+
|
56
|
+
context 'with empty suffix' do
|
57
|
+
let(:options) { {duplicate_header_suffix: ''} }
|
58
|
+
|
59
|
+
it 'reads whole file' do
|
60
|
+
data = SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
|
61
|
+
expect(data.size).to eq 2
|
62
|
+
end
|
63
|
+
|
64
|
+
it 'generates the correct keys' do
|
65
|
+
data = SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
|
66
|
+
expect(data.first.keys).to eq [:email, :firstname, :lastname, :email2, :age]
|
67
|
+
end
|
68
|
+
|
69
|
+
it 'enumerates when duplicate headers are given' do
|
70
|
+
options.merge!({:user_provided_headers => [:a,:b,:c,:a,:a]})
|
71
|
+
data = SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
|
72
|
+
expect(data.first.keys).to eq [:a, :b, :c, :a2, :a3]
|
73
|
+
end
|
74
|
+
end
|
75
|
+
end
|
76
|
+
end
|
@@ -3,28 +3,6 @@ require 'spec_helper'
|
|
3
3
|
fixture_path = 'spec/fixtures'
|
4
4
|
|
5
5
|
describe 'test exceptions for invalid headers' do
|
6
|
-
it 'raises error on duplicate headers' do
|
7
|
-
expect {
|
8
|
-
SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", {})
|
9
|
-
}.to raise_exception(SmarterCSV::DuplicateHeaders)
|
10
|
-
end
|
11
|
-
|
12
|
-
it 'raises error on duplicate given headers' do
|
13
|
-
expect {
|
14
|
-
options = {:user_provided_headers => [:a,:b,:c,:d,:a]}
|
15
|
-
SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
|
16
|
-
}.to raise_exception(SmarterCSV::DuplicateHeaders)
|
17
|
-
end
|
18
|
-
|
19
|
-
it 'raises error on duplicate mapped headers' do
|
20
|
-
expect {
|
21
|
-
# the mapping is right, but the underlying csv file is bad
|
22
|
-
options = {:key_mapping => {:email => :a, :firstname => :b, :lastname => :c, :manager_email => :d, :age => :e} }
|
23
|
-
SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
|
24
|
-
}.to raise_exception(SmarterCSV::DuplicateHeaders)
|
25
|
-
end
|
26
|
-
|
27
|
-
|
28
6
|
it 'does not raise an error if no required headers are given' do
|
29
7
|
options = {:required_headers => nil} # order does not matter
|
30
8
|
data = SmarterCSV.process("#{fixture_path}/user_import.csv", options)
|
@@ -49,4 +27,12 @@ describe 'test exceptions for invalid headers' do
|
|
49
27
|
SmarterCSV.process("#{fixture_path}/user_import.csv", options)
|
50
28
|
}.to raise_exception(SmarterCSV::MissingHeaders)
|
51
29
|
end
|
30
|
+
|
31
|
+
it 'raises error on missing mapped headers and includes missing headers in message' do
|
32
|
+
expect {
|
33
|
+
# :age does not exist in the CSV header
|
34
|
+
options = {:key_mapping => {:email => :a, :firstname => :b, :lastname => :c, :manager_email => :d, :age => :e} }
|
35
|
+
SmarterCSV.process("#{fixture_path}/user_import.csv", options)
|
36
|
+
}.to raise_exception(SmarterCSV::KeyMappingError, "missing header(s): age")
|
37
|
+
end
|
52
38
|
end
|
@@ -2,16 +2,24 @@ require 'spec_helper'
|
|
2
2
|
|
3
3
|
fixture_path = 'spec/fixtures'
|
4
4
|
|
5
|
-
|
6
|
-
|
7
|
-
|
8
|
-
context "malformed header" do
|
5
|
+
# according to RFC-4180 quotes inside of "words" shouldbe doubled, but our parser is robust against that.
|
6
|
+
describe 'malformed CSV quotes' do
|
7
|
+
context "malformed quotes in header" do
|
9
8
|
let(:csv_path) { "#{fixture_path}/malformed_header.csv" }
|
10
|
-
it
|
9
|
+
it 'should be resilient against single quotes' do
|
10
|
+
data = SmarterCSV.process(csv_path)
|
11
|
+
expect(data[0]).to eq({:name=>"Arnold Schwarzenegger", :dobdob=>"1947-07-30"})
|
12
|
+
expect(data[1]).to eq({:name=>"Jeff Bridges", :dobdob=>"1949-12-04"})
|
13
|
+
end
|
11
14
|
end
|
12
15
|
|
13
|
-
context "malformed content" do
|
16
|
+
context "malformed quotes in content" do
|
14
17
|
let(:csv_path) { "#{fixture_path}/malformed.csv" }
|
15
|
-
|
18
|
+
|
19
|
+
it 'should be resilient against single quotes' do
|
20
|
+
data = SmarterCSV.process(csv_path)
|
21
|
+
expect(data[0]).to eq({:name=>"Arnold Schwarzenegger", :dob=>"1947-07-30"})
|
22
|
+
expect(data[1]).to eq({:name=>"Jeff \"the dude\" Bridges", :dob=>"1949-12-04"})
|
23
|
+
end
|
16
24
|
end
|
17
25
|
end
|
@@ -2,23 +2,28 @@ require 'spec_helper'
|
|
2
2
|
|
3
3
|
fixture_path = 'spec/fixtures'
|
4
4
|
|
5
|
-
describe '
|
6
|
-
|
7
|
-
|
8
|
-
|
5
|
+
describe 'no header in file' do
|
6
|
+
let(:headers) { [:a,:b,:c,:d,:e,:f] }
|
7
|
+
let(:options) { {:headers_in_file => false, :user_provided_headers => headers} }
|
8
|
+
subject(:data) { SmarterCSV.process("#{fixture_path}/no_header.csv", options) }
|
9
|
+
|
10
|
+
it 'load the correct number of records' do
|
9
11
|
data.size.should == 5
|
10
|
-
|
11
|
-
data.each{|item| item.keys.each{|x| x.class.should be == Symbol}}
|
12
|
+
end
|
12
13
|
|
13
|
-
|
14
|
+
it 'uses given symbols for all records' do
|
15
|
+
data.each do |item|
|
14
16
|
item.keys.each do |key|
|
15
17
|
[:a,:b,:c,:d,:e,:f].should include( key )
|
16
18
|
end
|
17
19
|
end
|
18
|
-
|
19
|
-
data.each do |h|
|
20
|
-
h.size.should <= 6
|
21
|
-
end
|
22
20
|
end
|
23
21
|
|
22
|
+
it 'loads the correct data' do
|
23
|
+
data[0].should == {a: "Dan", b: "McAllister", c: 2, d: 0}
|
24
|
+
data[1].should == {a: "Lucy", b: "Laweless", d: 5, e: 0}
|
25
|
+
data[2].should == {a: "Miles", b: "O'Brian", c: 0, d: 0, e: 0, f: 21}
|
26
|
+
data[3].should == {a: "Nancy", b: "Homes", c: 2, d: 0, e: 1}
|
27
|
+
data[4].should == {a: "Hernán", b: "Curaçon", c: 3, d: 0, e: 0}
|
28
|
+
end
|
24
29
|
end
|
@@ -0,0 +1,61 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe 'parse with col_sep' do
|
4
|
+
let(:options) { {quote_char: '"'} }
|
5
|
+
|
6
|
+
it 'parses with comma' do
|
7
|
+
line = "a,b,,d"
|
8
|
+
options.merge!({col_sep: ","})
|
9
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
10
|
+
expect(array).to eq ['a', 'b', '', 'd']
|
11
|
+
expect(array_size).to eq 4
|
12
|
+
end
|
13
|
+
|
14
|
+
it 'parses trailing commas' do
|
15
|
+
line = "a,b,c,,"
|
16
|
+
options.merge!({col_sep: ","})
|
17
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
18
|
+
expect(array).to eq ['a', 'b', 'c', '', '']
|
19
|
+
expect(array_size).to eq 5
|
20
|
+
end
|
21
|
+
|
22
|
+
it 'parses with space' do
|
23
|
+
line = "a b d"
|
24
|
+
options.merge!({col_sep: " "})
|
25
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
26
|
+
expect(array).to eq ['a', 'b', '', 'd']
|
27
|
+
expect(array_size).to eq 4
|
28
|
+
end
|
29
|
+
|
30
|
+
it 'parses with tab' do
|
31
|
+
line = "a\tb\t\td"
|
32
|
+
options.merge!({col_sep: "\t"})
|
33
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
34
|
+
expect(array).to eq ['a', 'b', '', 'd']
|
35
|
+
expect(array_size).to eq 4
|
36
|
+
end
|
37
|
+
|
38
|
+
it 'parses with multiple space separator' do
|
39
|
+
line = "a b d"
|
40
|
+
options.merge!({col_sep: " "})
|
41
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
42
|
+
expect(array).to eq ['a b', '', 'd']
|
43
|
+
expect(array_size).to eq 3
|
44
|
+
end
|
45
|
+
|
46
|
+
it 'parses with multiple char separator' do
|
47
|
+
line = '<=><=>A<=>B<=>C'
|
48
|
+
options.merge!({col_sep: "<=>"})
|
49
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
50
|
+
expect(array).to eq ["", "", "A", "B", "C"]
|
51
|
+
expect(array_size).to eq 5
|
52
|
+
end
|
53
|
+
|
54
|
+
it 'parses trailing multiple char separator' do
|
55
|
+
line = '<=><=>A<=>B<=>C<=><=>'
|
56
|
+
options.merge!({col_sep: "<=>"})
|
57
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
58
|
+
expect(array).to eq ["", "", "A", "B", "C", "", ""]
|
59
|
+
expect(array_size).to eq 7
|
60
|
+
end
|
61
|
+
end
|
@@ -0,0 +1,74 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe 'old CSV library parsing tests' do
|
4
|
+
let(:options) { {quote_char: '"', col_sep: ","} }
|
5
|
+
|
6
|
+
[ ["\t", ["\t"]],
|
7
|
+
["foo,\"\"\"\"\"\",baz", ["foo", "\"\"", "baz"]],
|
8
|
+
["foo,\"\"\"bar\"\"\",baz", ["foo", "\"bar\"", "baz"]],
|
9
|
+
["\"\"\"\n\",\"\"\"\n\"", ["\"\n", "\"\n"]],
|
10
|
+
["foo,\"\r\n\",baz", ["foo", "\r\n", "baz"]],
|
11
|
+
["\"\"", [""]],
|
12
|
+
["foo,\"\"\"\",baz", ["foo", "\"", "baz"]],
|
13
|
+
["foo,\"\r.\n\",baz", ["foo", "\r.\n", "baz"]],
|
14
|
+
["foo,\"\r\",baz", ["foo", "\r", "baz"]],
|
15
|
+
["foo,\"\",baz", ["foo", "", "baz"]],
|
16
|
+
["\",\"", [","]],
|
17
|
+
["foo", ["foo"]],
|
18
|
+
[",,", ['', '', '']],
|
19
|
+
[",", ['', '']],
|
20
|
+
["foo,\"\n\",baz", ["foo", "\n", "baz"]],
|
21
|
+
["foo,,baz", ["foo", '', "baz"]],
|
22
|
+
["\"\"\"\r\",\"\"\"\r\"", ["\"\r", "\"\r"]],
|
23
|
+
["\",\",\",\"", [",", ","]],
|
24
|
+
["foo,bar,", ["foo", "bar", '']],
|
25
|
+
[",foo,bar", ['', "foo", "bar"]],
|
26
|
+
["foo,bar", ["foo", "bar"]],
|
27
|
+
[";", [";"]],
|
28
|
+
["\t,\t", ["\t", "\t"]],
|
29
|
+
["foo,\"\r\n\r\",baz", ["foo", "\r\n\r", "baz"]],
|
30
|
+
["foo,\"\r\n\n\",baz", ["foo", "\r\n\n", "baz"]],
|
31
|
+
["foo,\"foo,bar\",baz", ["foo", "foo,bar", "baz"]],
|
32
|
+
[";,;", [";", ";"]]
|
33
|
+
].each do |line, result|
|
34
|
+
it "parses #{line}" do
|
35
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
36
|
+
expect(array).to eq result
|
37
|
+
end
|
38
|
+
end
|
39
|
+
|
40
|
+
[ ["foo,\"\"\"\"\"\",baz", ["foo", "\"\"", "baz"]],
|
41
|
+
["foo,\"\"\"bar\"\"\",baz", ["foo", "\"bar\"", "baz"]],
|
42
|
+
["foo,\"\r\n\",baz", ["foo", "\r\n", "baz"]],
|
43
|
+
["\"\"", [""]],
|
44
|
+
["foo,\"\"\"\",baz", ["foo", "\"", "baz"]],
|
45
|
+
["foo,\"\r.\n\",baz", ["foo", "\r.\n", "baz"]],
|
46
|
+
["foo,\"\r\",baz", ["foo", "\r", "baz"]],
|
47
|
+
["foo,\"\",baz", ["foo", "", "baz"]],
|
48
|
+
["foo", ["foo"]],
|
49
|
+
[",,", ['', '', '']],
|
50
|
+
[",", ['', '']],
|
51
|
+
["foo,\"\n\",baz", ["foo", "\n", "baz"]],
|
52
|
+
["foo,,baz", ["foo", '', "baz"]],
|
53
|
+
["foo,bar", ["foo", "bar"]],
|
54
|
+
["foo,\"\r\n\n\",baz", ["foo", "\r\n\n", "baz"]],
|
55
|
+
["foo,\"foo,bar\",baz", ["foo", "foo,bar", "baz"]]
|
56
|
+
].each do |line, result|
|
57
|
+
it "parses #{line}" do
|
58
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
59
|
+
expect(array).to eq result
|
60
|
+
end
|
61
|
+
end
|
62
|
+
|
63
|
+
it 'mixed quotes' do
|
64
|
+
line = %Q{Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K}
|
65
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
66
|
+
expect(array).to eq ["Ten Thousand", "10000", " 2710 ", "", "10,000", "It's \"10 Grand\", baby", "10K"]
|
67
|
+
end
|
68
|
+
|
69
|
+
it 'single quotes in fields' do
|
70
|
+
line = 'Indoor Chrome,49.2"" L x 49.2"" W x 20.5"" H,Chrome,"Crystal,Metal,Wood",23.12'
|
71
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
72
|
+
expect(array).to eq ['Indoor Chrome', '49.2" L x 49.2" W x 20.5" H', 'Chrome', 'Crystal,Metal,Wood', '23.12']
|
73
|
+
end
|
74
|
+
end
|
@@ -0,0 +1,170 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
fixture_path = 'spec/fixtures'
|
4
|
+
|
5
|
+
describe 'fulfills RFC-4180 and more' do
|
6
|
+
let(:options) { {col_sep: ',', row_sep: $INPUT_RECORD_SEPARATOR, quote_char: '"' } }
|
7
|
+
|
8
|
+
context 'parses simple CSV' do
|
9
|
+
context 'RFC-4180' do
|
10
|
+
it 'separating on col_sep' do
|
11
|
+
line = 'aaa,bbb,ccc'
|
12
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [%w[aaa bbb ccc], 3]
|
13
|
+
end
|
14
|
+
|
15
|
+
it 'preserves whitespace' do
|
16
|
+
line = ' aaa , bbb , ccc '
|
17
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
18
|
+
[' aaa ', ' bbb ', ' ccc '], 3
|
19
|
+
]
|
20
|
+
end
|
21
|
+
end
|
22
|
+
|
23
|
+
context 'extending RFC-4180' do
|
24
|
+
it 'with extra col_sep' do
|
25
|
+
line = 'aaa,bbb,ccc,'
|
26
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
27
|
+
['aaa', 'bbb', 'ccc', ''], 4
|
28
|
+
]
|
29
|
+
end
|
30
|
+
|
31
|
+
it 'with extra col_sep with given header_size' do
|
32
|
+
line = 'aaa,bbb,ccc,'
|
33
|
+
expect( SmarterCSV.send(:parse, line, options, 3)).to eq [
|
34
|
+
['aaa', 'bbb', 'ccc'], 3
|
35
|
+
]
|
36
|
+
end
|
37
|
+
|
38
|
+
it 'with multiple extra col_sep' do
|
39
|
+
line = 'aaa,bbb,ccc,,,'
|
40
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
41
|
+
['aaa', 'bbb', 'ccc', '', '', ''], 6
|
42
|
+
]
|
43
|
+
end
|
44
|
+
|
45
|
+
it 'with multiple extra col_sep' do
|
46
|
+
line = 'aaa,bbb,ccc,,,'
|
47
|
+
expect( SmarterCSV.send(:parse, line, options, 3)).to eq [
|
48
|
+
['aaa', 'bbb', 'ccc'], 3
|
49
|
+
]
|
50
|
+
end
|
51
|
+
|
52
|
+
it 'with multiple complex col_sep' do
|
53
|
+
line = 'aaa<=>bbb<=>ccc<=><=><=>'
|
54
|
+
expect( SmarterCSV.send(:parse, line, options.merge({col_sep: '<=>'}))).to eq [
|
55
|
+
['aaa', 'bbb', 'ccc', '', '', ''], 6
|
56
|
+
]
|
57
|
+
end
|
58
|
+
|
59
|
+
it 'with multiple complex col_sep with given header_size' do
|
60
|
+
line = 'aaa<=>bbb<=>ccc<=><=><=>'
|
61
|
+
expect( SmarterCSV.send(:parse, line, options.merge({col_sep: '<=>'}), 3)).to eq [
|
62
|
+
['aaa', 'bbb', 'ccc'], 3
|
63
|
+
]
|
64
|
+
end
|
65
|
+
end
|
66
|
+
end
|
67
|
+
|
68
|
+
context 'parses quoted CSV' do
|
69
|
+
context 'RFC-4180' do
|
70
|
+
it 'separating on col_sep' do
|
71
|
+
line = '"aaa","bbb","ccc"'
|
72
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [%w[aaa bbb ccc], 3]
|
73
|
+
end
|
74
|
+
|
75
|
+
it 'parses corner case correctly' do
|
76
|
+
line = '"Board 4""","$17.40","10000003427"'
|
77
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
78
|
+
['Board 4"', '$17.40', '10000003427'], 3
|
79
|
+
]
|
80
|
+
end
|
81
|
+
|
82
|
+
it 'quoted parts can contain spaces' do
|
83
|
+
line = '" aaa1 aaa2 "," bbb1 bbb2 "," ccc1 ccc2 "'
|
84
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
85
|
+
[' aaa1 aaa2 ', ' bbb1 bbb2 ', ' ccc1 ccc2 '], 3
|
86
|
+
]
|
87
|
+
end
|
88
|
+
|
89
|
+
it 'quoted parts can contain row_sep' do
|
90
|
+
line = '"aaa1, aaa2","bbb1, bbb2","ccc1, ccc2"'
|
91
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
92
|
+
['aaa1, aaa2', 'bbb1, bbb2', 'ccc1, ccc2'], 3
|
93
|
+
]
|
94
|
+
end
|
95
|
+
|
96
|
+
it 'quoted parts can contain row_sep' do
|
97
|
+
line = '"aaa1, ""aaa2"", aaa3","""bbb1"", bbb2","ccc1, ""ccc2"""'
|
98
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
99
|
+
['aaa1, "aaa2", aaa3', '"bbb1", bbb2', 'ccc1, "ccc2"'], 3
|
100
|
+
]
|
101
|
+
end
|
102
|
+
|
103
|
+
it 'some fields are quoted' do
|
104
|
+
line = '1,"board 4""",12.95'
|
105
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
106
|
+
['1', 'board 4"', '12.95'], 3
|
107
|
+
]
|
108
|
+
end
|
109
|
+
|
110
|
+
it 'separating on col_sep' do
|
111
|
+
line = '"some","thing","""completely"" different"'
|
112
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
113
|
+
['some', 'thing', '"completely" different'], 3
|
114
|
+
]
|
115
|
+
end
|
116
|
+
end
|
117
|
+
|
118
|
+
context 'extending RFC-4180' do
|
119
|
+
it 'with extra col_sep, without given header_size' do
|
120
|
+
line = '"aaa","bbb","ccc",'
|
121
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
122
|
+
['aaa', 'bbb', 'ccc', ''], 4
|
123
|
+
]
|
124
|
+
end
|
125
|
+
|
126
|
+
it 'with extra col_sep, with given header_size' do
|
127
|
+
line = '"aaa","bbb","ccc",'
|
128
|
+
expect( SmarterCSV.send(:parse, line, options, 3)).to eq [%w[aaa bbb ccc], 3]
|
129
|
+
end
|
130
|
+
|
131
|
+
it 'with multiple extra col_sep, without given header_size' do
|
132
|
+
line = '"aaa","bbb","ccc",,,'
|
133
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
134
|
+
['aaa', 'bbb', 'ccc', '', '', ''], 6
|
135
|
+
]
|
136
|
+
end
|
137
|
+
|
138
|
+
it 'with multiple extra col_sep, with given header_size' do
|
139
|
+
line = '"aaa","bbb","ccc",,,'
|
140
|
+
expect( SmarterCSV.send(:parse, line, options, 3)).to eq [
|
141
|
+
['aaa', 'bbb', 'ccc'], 3
|
142
|
+
]
|
143
|
+
end
|
144
|
+
|
145
|
+
it 'with multiple complex extra col_sep, without given header_size' do
|
146
|
+
line = '"aaa"<=>"bbb"<=>"ccc"<=><=><=>'
|
147
|
+
expect( SmarterCSV.send(:parse, line, options.merge({col_sep: '<=>'}))).to eq [
|
148
|
+
['aaa', 'bbb', 'ccc', '', '', ''], 6
|
149
|
+
]
|
150
|
+
end
|
151
|
+
|
152
|
+
it 'with multiple complex extra col_sep, with given header_size' do
|
153
|
+
line = '"aaa"<=>"bbb"<=>"ccc"<=><=><=>'
|
154
|
+
expect( SmarterCSV.send(:parse, line, options.merge({col_sep: '<=>'}), 3)).to eq [
|
155
|
+
['aaa', 'bbb', 'ccc'], 3
|
156
|
+
]
|
157
|
+
end
|
158
|
+
end
|
159
|
+
end
|
160
|
+
|
161
|
+
# relaxed parsing compared to RFC-4180
|
162
|
+
context 'liberal_parsing' do
|
163
|
+
it 'parses corner case correctly' do
|
164
|
+
line = 'is,this "three, or four",fields'
|
165
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
166
|
+
['is', 'this "three, or four"', 'fields'], 3
|
167
|
+
]
|
168
|
+
end
|
169
|
+
end
|
170
|
+
end
|
@@ -3,7 +3,6 @@ require 'spec_helper'
|
|
3
3
|
fixture_path = 'spec/fixtures'
|
4
4
|
|
5
5
|
describe 'loading file with quoted fields' do
|
6
|
-
|
7
6
|
it 'leaving the quotes in the data' do
|
8
7
|
options = {}
|
9
8
|
data = SmarterCSV.process("#{fixture_path}/quoted.csv", options)
|
@@ -12,6 +11,7 @@ describe 'loading file with quoted fields' do
|
|
12
11
|
data[1][:description].should be_nil
|
13
12
|
data[2][:model].should eq 'Venture "Extended Edition, Very Large"'
|
14
13
|
data[2][:description].should be_nil
|
14
|
+
data[3][:description].should eq 'MUST SELL! air, moon roof, loaded'
|
15
15
|
data.each do |h|
|
16
16
|
h[:year].class.should eq Fixnum
|
17
17
|
h[:make].should_not be_nil
|
@@ -20,17 +20,21 @@ describe 'loading file with quoted fields' do
|
|
20
20
|
end
|
21
21
|
end
|
22
22
|
|
23
|
-
|
23
|
+
# quotes inside quoted fields need to be escaped by another double-quote
|
24
24
|
it 'removes quotes around quoted fields, but not inside data' do
|
25
25
|
options = {}
|
26
26
|
data = SmarterCSV.process("#{fixture_path}/quote_char.csv", options)
|
27
27
|
|
28
28
|
data.length.should eq 6
|
29
|
+
data[0][:first_name].should eq "\"John"
|
30
|
+
data[0][:last_name].should eq "Cooke\""
|
29
31
|
data[1][:first_name].should eq "Jam\ne\nson\""
|
30
32
|
data[2][:first_name].should eq "\"Jean"
|
33
|
+
data[4][:first_name].should eq "Bo\"bbie"
|
34
|
+
data[5][:first_name].should eq 'Mica'
|
35
|
+
data[5][:last_name].should eq 'Copeland'
|
31
36
|
end
|
32
37
|
|
33
|
-
|
34
38
|
# NOTE: quotes inside headers need to be escaped by doubling them
|
35
39
|
# e.g. 'correct ""EXAMPLE""'
|
36
40
|
# this escaping is illegal: 'incorrect \"EXAMPLE\"' <-- this caused CSV parsing error
|
@@ -43,6 +47,6 @@ describe 'loading file with quoted fields' do
|
|
43
47
|
data.length.should eq 3
|
44
48
|
data.first.keys[2].should eq :isbn
|
45
49
|
data.first.keys[3].should eq :discounted_price
|
50
|
+
data[1][:author].should eq 'Timothy "The Parser" Campbell'
|
46
51
|
end
|
47
|
-
|
48
52
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: smarter_csv
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.6.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tilo Sloboda
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2022-
|
11
|
+
date: 2022-05-03 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: rspec
|
@@ -38,6 +38,20 @@ dependencies:
|
|
38
38
|
- - ">="
|
39
39
|
- !ruby/object:Gem::Version
|
40
40
|
version: '0'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: awesome_print
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - ">="
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '0'
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - ">="
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '0'
|
41
55
|
description: Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, with
|
42
56
|
optional features for processing large files in parallel, embedded comments, unusual
|
43
57
|
field- and record-separators, flexible mapping of CSV-headers to Hash-keys
|
@@ -112,6 +126,7 @@ files:
|
|
112
126
|
- spec/smarter_csv/close_file_spec.rb
|
113
127
|
- spec/smarter_csv/column_separator_spec.rb
|
114
128
|
- spec/smarter_csv/convert_values_to_numeric_spec.rb
|
129
|
+
- spec/smarter_csv/duplicate_headers_spec.rb
|
115
130
|
- spec/smarter_csv/empty_columns_spec.rb
|
116
131
|
- spec/smarter_csv/extenstions_spec.rb
|
117
132
|
- spec/smarter_csv/hard_sample_spec.rb
|
@@ -125,6 +140,9 @@ files:
|
|
125
140
|
- spec/smarter_csv/malformed_spec.rb
|
126
141
|
- spec/smarter_csv/no_header_spec.rb
|
127
142
|
- spec/smarter_csv/not_downcase_header_spec.rb
|
143
|
+
- spec/smarter_csv/parse/column_separator_spec.rb
|
144
|
+
- spec/smarter_csv/parse/old_csv_library_spec.rb
|
145
|
+
- spec/smarter_csv/parse/rfc4180_and_more_spec.rb
|
128
146
|
- spec/smarter_csv/problematic.rb
|
129
147
|
- spec/smarter_csv/quoted_spec.rb
|
130
148
|
- spec/smarter_csv/remove_empty_values_spec.rb
|
@@ -160,8 +178,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
160
178
|
- - ">="
|
161
179
|
- !ruby/object:Gem::Version
|
162
180
|
version: '0'
|
163
|
-
requirements:
|
164
|
-
- csv
|
181
|
+
requirements: []
|
165
182
|
rubygems_version: 3.1.6
|
166
183
|
signing_key:
|
167
184
|
specification_version: 4
|
@@ -218,6 +235,7 @@ test_files:
|
|
218
235
|
- spec/smarter_csv/close_file_spec.rb
|
219
236
|
- spec/smarter_csv/column_separator_spec.rb
|
220
237
|
- spec/smarter_csv/convert_values_to_numeric_spec.rb
|
238
|
+
- spec/smarter_csv/duplicate_headers_spec.rb
|
221
239
|
- spec/smarter_csv/empty_columns_spec.rb
|
222
240
|
- spec/smarter_csv/extenstions_spec.rb
|
223
241
|
- spec/smarter_csv/hard_sample_spec.rb
|
@@ -231,6 +249,9 @@ test_files:
|
|
231
249
|
- spec/smarter_csv/malformed_spec.rb
|
232
250
|
- spec/smarter_csv/no_header_spec.rb
|
233
251
|
- spec/smarter_csv/not_downcase_header_spec.rb
|
252
|
+
- spec/smarter_csv/parse/column_separator_spec.rb
|
253
|
+
- spec/smarter_csv/parse/old_csv_library_spec.rb
|
254
|
+
- spec/smarter_csv/parse/rfc4180_and_more_spec.rb
|
234
255
|
- spec/smarter_csv/problematic.rb
|
235
256
|
- spec/smarter_csv/quoted_spec.rb
|
236
257
|
- spec/smarter_csv/remove_empty_values_spec.rb
|