smarter_csv 1.5.1 → 1.6.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +11 -1
- data/CONTRIBUTORS.md +1 -0
- data/README.md +8 -4
- data/lib/smarter_csv/smarter_csv.rb +92 -42
- data/lib/smarter_csv/version.rb +1 -1
- data/smarter_csv.gemspec +1 -1
- data/spec/smarter_csv/duplicate_headers_spec.rb +4 -4
- data/spec/smarter_csv/invalid_headers_spec.rb +4 -4
- data/spec/smarter_csv/malformed_spec.rb +15 -7
- data/spec/smarter_csv/parse/column_separator_spec.rb +61 -0
- data/spec/smarter_csv/parse/old_csv_library_spec.rb +74 -0
- data/spec/smarter_csv/parse/rfc4180_and_more_spec.rb +170 -0
- data/spec/smarter_csv/quoted_spec.rb +8 -4
- metadata +23 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9be5f053e15e157d7d28555b4de894d2761d5918203da45f5fc4e6c5adcc2a3f
|
4
|
+
data.tar.gz: a47394f3d1f985960a64abf1a43ce6ebf9b8217af2c01a0c5f053af8c77c09ae
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: f27113af8a5771d89ac5c8783f1f69645c24bb576dafd97de17b0d8db8fff74dc396b42450802f9332c9f8b32a02ee18dabbc5dbc2daa91c75f957a678e99099
|
7
|
+
data.tar.gz: 5b2c2f3cbfc17b43c030c4c4c261962818bfbb2ce1a0ecd88f394682b75b75a64a2d8b6afcc0e4b99b97d39ef4b645de143cbb5f595e7aa9d661e66b1a53e98f
|
data/CHANGELOG.md
CHANGED
@@ -1,7 +1,17 @@
|
|
1
1
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
3
3
|
|
4
|
-
## 1.
|
4
|
+
## 1.6.1 (2022-05-06)
|
5
|
+
* unused keys in `key_mapping` generate a warning, no longer raise an exception
|
6
|
+
|
7
|
+
## 1.6.0 (2022-05-03)
|
8
|
+
* completely rewrote line parser
|
9
|
+
* added methods `SmarterCSV.raw_headers` and `SmarterCSV.headers` to allow easy examination of how the headers are processed.
|
10
|
+
|
11
|
+
## 1.5.2 (2022-04-29)
|
12
|
+
* added missing keys to the SmarterCSV::KeyMappingError exception message #189 (thanks to John Dell)
|
13
|
+
|
14
|
+
## 1.5.1 (2022-04-27)
|
5
15
|
* added raising of `KeyMappingError` if `key_mapping` refers to a non-existent key
|
6
16
|
* added option `duplicate_header_suffix` (thanks to Skye Shaw)
|
7
17
|
When given a non-nil string, it uses the suffix to append numbering 2..n to duplicate headers.
|
data/CONTRIBUTORS.md
CHANGED
data/README.md
CHANGED
@@ -16,10 +16,12 @@
|
|
16
16
|
|
17
17
|
# SmarterCSV
|
18
18
|
|
19
|
-
[](http://travis-ci.
|
19
|
+
[](http://travis-ci.com/tilo/smarter_csv) [](http://badge.fury.io/rb/smarter_csv)
|
20
20
|
|
21
21
|
#### SmarterCSV 1.x
|
22
22
|
|
23
|
+
`smarter_csv` is now 10 years old, and still kicking! 🎉🎉🎉
|
24
|
+
|
23
25
|
`smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with Mongoid or ActiveRecord,
|
24
26
|
and parallel processing with Resque or Sidekiq.
|
25
27
|
|
@@ -42,11 +44,13 @@ NOTE; This Gem is only for importing CSV files - writing of CSV files is not sup
|
|
42
44
|
|
43
45
|
### Why?
|
44
46
|
|
45
|
-
Ruby's CSV library's API is pretty old, and it's processing of CSV-files returning Arrays of Arrays feels 'very close to the metal'. The output is not easy to use - especially not if you want to create database records
|
47
|
+
Ruby's CSV library's API is pretty old, and it's processing of CSV-files returning Arrays of Arrays feels 'very close to the metal'. The output is not easy to use - especially not if you want to create database records or Sidekiq jobs with it. Another shortcoming is that Ruby's CSV library does not have good support for huge CSV-files, e.g. there is no support for 'chunking' and/or parallel processing of the CSV-content (e.g. with Sidekiq).
|
48
|
+
|
49
|
+
As the existing CSV libraries didn't fit my needs, I was writing my own CSV processing - specifically for use in connection with Rails ORMs like Mongoid, MongoMapper and ActiveRecord. In those ORMs you can easily pass a hash with attribute/value pairs to the create() method. The lower-level Mongo driver and Moped also accept larger arrays of such hashes to create a larger amount of records quickly with just one call. The same patterns are used when you pass data to Sidekiq jobs.
|
46
50
|
|
47
|
-
|
51
|
+
For processing large CSV files it is essential to process them in chunks, so the memory impact is minimized.
|
48
52
|
|
49
|
-
###
|
53
|
+
### How?
|
50
54
|
|
51
55
|
The two main choices you have in terms of how to call `SmarterCSV.process` are:
|
52
56
|
* calling `process` with or without a block
|
@@ -6,6 +6,7 @@ module SmarterCSV
|
|
6
6
|
class MissingHeaders < SmarterCSVException; end
|
7
7
|
class NoColSepDetected < SmarterCSVException; end
|
8
8
|
class KeyMappingError < SmarterCSVException; end
|
9
|
+
class MalformedCSVError < SmarterCSVException; end
|
9
10
|
|
10
11
|
# first parameter: filename or input object which responds to readline method
|
11
12
|
def SmarterCSV.process(input, options={}, &block)
|
@@ -18,24 +19,24 @@ module SmarterCSV
|
|
18
19
|
@csv_line_count = 0
|
19
20
|
has_rails = !! defined?(Rails)
|
20
21
|
begin
|
21
|
-
|
22
|
+
fh = input.respond_to?(:readline) ? input : File.open(input, "r:#{options[:file_encoding]}")
|
22
23
|
|
23
24
|
# auto-detect the row separator
|
24
|
-
options[:row_sep] = SmarterCSV.guess_line_ending(
|
25
|
+
options[:row_sep] = SmarterCSV.guess_line_ending(fh, options) if options[:row_sep].to_sym == :auto
|
25
26
|
# attempt to auto-detect column separator
|
26
|
-
options[:col_sep] = guess_column_separator(
|
27
|
-
# preserve options, in case we need to call the CSV class
|
28
|
-
csv_options = options.select{|k,v| [:col_sep, :row_sep, :quote_char].include?(k)} # options.slice(:col_sep, :row_sep, :quote_char)
|
29
|
-
csv_options.delete(:row_sep) if [nil, :auto].include?( options[:row_sep].to_sym )
|
30
|
-
csv_options.delete(:col_sep) if [nil, :auto].include?( options[:col_sep].to_sym )
|
27
|
+
options[:col_sep] = guess_column_separator(fh, options) if options[:col_sep].to_sym == :auto
|
31
28
|
|
32
|
-
if (options[:force_utf8] || options[:file_encoding] =~ /utf-8/i) && (
|
29
|
+
if (options[:force_utf8] || options[:file_encoding] =~ /utf-8/i) && ( fh.respond_to?(:external_encoding) && fh.external_encoding != Encoding.find('UTF-8') || fh.respond_to?(:encoding) && fh.encoding != Encoding.find('UTF-8') )
|
33
30
|
puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
|
34
31
|
end
|
35
32
|
|
36
|
-
|
33
|
+
if options[:skip_lines].to_i > 0
|
34
|
+
options[:skip_lines].to_i.times do
|
35
|
+
readline_with_counts(fh, options)
|
36
|
+
end
|
37
|
+
end
|
37
38
|
|
38
|
-
headerA, header_size = process_headers(
|
39
|
+
headerA, header_size = process_headers(fh, options)
|
39
40
|
|
40
41
|
# in case we use chunking.. we'll need to set it up..
|
41
42
|
if ! options[:chunk_size].nil? && options[:chunk_size].to_i > 0
|
@@ -48,10 +49,8 @@ module SmarterCSV
|
|
48
49
|
end
|
49
50
|
|
50
51
|
# now on to processing all the rest of the lines in the CSV file:
|
51
|
-
while !
|
52
|
-
line =
|
53
|
-
@file_line_count += 1
|
54
|
-
@csv_line_count += 1
|
52
|
+
while ! fh.eof? # we can't use fh.readlines() here, because this would read the whole file into memory at once, and eof => true
|
53
|
+
line = readline_with_counts(fh, options)
|
55
54
|
|
56
55
|
# replace invalid byte sequence in UTF-8 with question mark to avoid errors
|
57
56
|
line = line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
|
@@ -63,9 +62,10 @@ module SmarterCSV
|
|
63
62
|
# cater for the quoted csv data containing the row separator carriage return character
|
64
63
|
# in which case the row data will be split across multiple lines (see the sample content in spec/fixtures/carriage_returns_rn.csv)
|
65
64
|
# by detecting the existence of an uneven number of quote characters
|
65
|
+
|
66
66
|
multiline = line.count(options[:quote_char])%2 == 1 # should handle quote_char nil
|
67
67
|
while line.count(options[:quote_char])%2 == 1 # should handle quote_char nil
|
68
|
-
next_line =
|
68
|
+
next_line = fh.readline(options[:row_sep])
|
69
69
|
next_line = next_line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
|
70
70
|
line += next_line
|
71
71
|
@file_line_count += 1
|
@@ -74,16 +74,8 @@ module SmarterCSV
|
|
74
74
|
|
75
75
|
line.chomp!(options[:row_sep])
|
76
76
|
|
77
|
-
|
78
|
-
|
79
|
-
CSV.parse( line, **csv_options ).flatten.collect!{|x| x.nil? ? '' : x} # to deal with nil values from CSV.parse
|
80
|
-
rescue CSV::MalformedCSVError => e
|
81
|
-
raise $!, "#{$!} [SmarterCSV: csv line #{@csv_line_count}]", $!.backtrace
|
82
|
-
end
|
83
|
-
else
|
84
|
-
dataA = line.split(options[:col_sep], header_size)
|
85
|
-
end
|
86
|
-
dataA.map!{|x| x.sub(/(#{options[:col_sep]})+\z/, '')} # remove any unwanted trailing col_sep characters at the end
|
77
|
+
dataA, data_size = parse(line, options, header_size)
|
78
|
+
|
87
79
|
dataA.map!{|x| x.strip} if options[:strip_whitespace]
|
88
80
|
|
89
81
|
# if all values are blank, then ignore this line
|
@@ -138,7 +130,7 @@ module SmarterCSV
|
|
138
130
|
if use_chunks
|
139
131
|
chunk << hash # append temp result to chunk
|
140
132
|
|
141
|
-
if chunk.size >= chunk_size ||
|
133
|
+
if chunk.size >= chunk_size || fh.eof? # if chunk if full, or EOF reached
|
142
134
|
# do something with the chunk
|
143
135
|
if block_given?
|
144
136
|
yield chunk # do something with the hashes in the chunk in the block
|
@@ -179,7 +171,7 @@ module SmarterCSV
|
|
179
171
|
chunk = [] # initialize for next chunk of data
|
180
172
|
end
|
181
173
|
ensure
|
182
|
-
|
174
|
+
fh.close if fh.respond_to?(:close)
|
183
175
|
end
|
184
176
|
if block_given?
|
185
177
|
return chunk_count # when we do processing through a block we only care how many chunks we processed
|
@@ -224,6 +216,62 @@ module SmarterCSV
|
|
224
216
|
}
|
225
217
|
end
|
226
218
|
|
219
|
+
def self.readline_with_counts(filehandle, options)
|
220
|
+
line = filehandle.readline(options[:row_sep])
|
221
|
+
@file_line_count += 1
|
222
|
+
@csv_line_count += 1
|
223
|
+
line
|
224
|
+
end
|
225
|
+
|
226
|
+
# parses a single line: either a CSV header and body line
|
227
|
+
# - quoting rules compared to RFC-4180 are somewhat relaxed
|
228
|
+
# - we are not assuming that quotes inside a fields need to be doubled
|
229
|
+
# - we are not assuming that all fields need to be quoted (0 is even)
|
230
|
+
# - works with multi-char col_sep
|
231
|
+
# - if header_size is given, only up to header_size fields are parsed
|
232
|
+
#
|
233
|
+
# We use header_size for parsing the body lines to make sure we always match the number of headers
|
234
|
+
# in case there are trailing col_sep characters in line
|
235
|
+
#
|
236
|
+
# Our convention is that empty fields are returned as empty strings, not as nil.
|
237
|
+
#
|
238
|
+
def self.parse(line, options, header_size = nil)
|
239
|
+
return [] if line.nil?
|
240
|
+
|
241
|
+
col_sep = options[:col_sep]
|
242
|
+
quote = options[:quote_char]
|
243
|
+
quote_count = 0
|
244
|
+
elements = []
|
245
|
+
start = 0
|
246
|
+
i = 0
|
247
|
+
|
248
|
+
while i < line.size do
|
249
|
+
if line[i...i+col_sep.size] == col_sep && quote_count.even?
|
250
|
+
break if !header_size.nil? && elements.size >= header_size
|
251
|
+
|
252
|
+
elements << cleanup_quotes(line[start...i], quote)
|
253
|
+
i += col_sep.size
|
254
|
+
start = i
|
255
|
+
else
|
256
|
+
quote_count += 1 if line[i] == quote
|
257
|
+
i += 1
|
258
|
+
end
|
259
|
+
end
|
260
|
+
elements << cleanup_quotes(line[start..-1], quote) if header_size.nil? || elements.size < header_size
|
261
|
+
[elements, elements.size]
|
262
|
+
end
|
263
|
+
|
264
|
+
def self.cleanup_quotes(field, quote)
|
265
|
+
return field if field.nil? || field !~ /#{quote}/
|
266
|
+
|
267
|
+
if field.start_with?(quote) && field.end_with?(quote)
|
268
|
+
field.delete_prefix!(quote)
|
269
|
+
field.delete_suffix!(quote)
|
270
|
+
end
|
271
|
+
field.gsub!("#{quote}#{quote}", quote)
|
272
|
+
field
|
273
|
+
end
|
274
|
+
|
227
275
|
def self.blank?(value)
|
228
276
|
case value
|
229
277
|
when Array
|
@@ -310,13 +358,22 @@ module SmarterCSV
|
|
310
358
|
return k # the most frequent one is it
|
311
359
|
end
|
312
360
|
|
313
|
-
def self.
|
361
|
+
def self.raw_hearder
|
362
|
+
@raw_header
|
363
|
+
end
|
364
|
+
|
365
|
+
def self.headers
|
366
|
+
@headers
|
367
|
+
end
|
368
|
+
|
369
|
+
def self.process_headers(filehandle, options)
|
370
|
+
@raw_header = nil
|
371
|
+
@headers = nil
|
314
372
|
if options[:headers_in_file] # extract the header line
|
315
373
|
# process the header line in the CSV file..
|
316
374
|
# the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
|
317
|
-
header = filehandle
|
318
|
-
@
|
319
|
-
@csv_line_count += 1
|
375
|
+
header = readline_with_counts(filehandle, options)
|
376
|
+
@raw_header = header
|
320
377
|
|
321
378
|
header = header.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence]) if options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
|
322
379
|
header = header.sub(options[:comment_regexp],'') if options[:comment_regexp]
|
@@ -324,16 +381,7 @@ module SmarterCSV
|
|
324
381
|
|
325
382
|
header = header.gsub(options[:strip_chars_from_headers], '') if options[:strip_chars_from_headers]
|
326
383
|
|
327
|
-
|
328
|
-
file_headerA = begin
|
329
|
-
CSV.parse( header, **csv_options ).flatten.collect!{|x| x.nil? ? '' : x} # to deal with nil values from CSV.parse
|
330
|
-
rescue CSV::MalformedCSVError => e
|
331
|
-
raise $!, "#{$!} [SmarterCSV: csv line #{@csv_line_count}]", $!.backtrace
|
332
|
-
end
|
333
|
-
else
|
334
|
-
file_headerA = header.split(options[:col_sep])
|
335
|
-
end
|
336
|
-
file_header_size = file_headerA.size # before mapping, which could delete keys
|
384
|
+
file_headerA, file_header_size = parse(header, options)
|
337
385
|
|
338
386
|
file_headerA.map!{|x| x.gsub(%r/#{options[:quote_char]}/,'') }
|
339
387
|
file_headerA.map!{|x| x.strip} if options[:strip_whitespace]
|
@@ -371,7 +419,8 @@ module SmarterCSV
|
|
371
419
|
# if you want to completely delete a key, then map it to nil or to ''
|
372
420
|
if ! key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
|
373
421
|
# we can't map keys that are not there
|
374
|
-
|
422
|
+
missing_keys = key_mappingH.keys - headerA
|
423
|
+
puts "WARNING: missing header(s): #{missing_keys.join(",")}" unless missing_keys.empty?
|
375
424
|
|
376
425
|
headerA.map!{|x| key_mappingH.has_key?(x) ? (key_mappingH[x].nil? ? nil : key_mappingH[x]) : (options[:remove_unmapped_keys] ? nil : x)}
|
377
426
|
end
|
@@ -392,6 +441,7 @@ module SmarterCSV
|
|
392
441
|
raise SmarterCSV::MissingHeaders , "ERROR: missing headers: #{missing_headers.join(',')}" unless missing_headers.empty?
|
393
442
|
end
|
394
443
|
|
444
|
+
@headers = headerA
|
395
445
|
[headerA, header_size]
|
396
446
|
end
|
397
447
|
|
data/lib/smarter_csv/version.rb
CHANGED
data/smarter_csv.gemspec
CHANGED
@@ -16,9 +16,9 @@ Gem::Specification.new do |spec|
|
|
16
16
|
spec.executables = spec.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
|
17
17
|
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
18
18
|
spec.require_paths = ["lib"]
|
19
|
-
spec.requirements = ['csv'] # for CSV.parse() only needed in case we have quoted fields
|
20
19
|
spec.add_development_dependency "rspec"
|
21
20
|
spec.add_development_dependency "simplecov"
|
21
|
+
spec.add_development_dependency "awesome_print"
|
22
22
|
# spec.add_development_dependency "guard-rspec"
|
23
23
|
|
24
24
|
spec.metadata["homepage_uri"] = spec.homepage
|
@@ -17,12 +17,12 @@ describe 'duplicate headers' do
|
|
17
17
|
}.to raise_exception(SmarterCSV::DuplicateHeaders)
|
18
18
|
end
|
19
19
|
|
20
|
-
it '
|
20
|
+
it 'does not raise error on missing mapped headers and includes missing headers in message' do
|
21
|
+
# the mapping is right, but the underlying csv file is bad
|
22
|
+
options = {:key_mapping => {:email => :a, :firstname => :b, :lastname => :c, :manager_email => :d, :age => :e} }
|
21
23
|
expect {
|
22
|
-
# the mapping is right, but the underlying csv file is bad
|
23
|
-
options = {:key_mapping => {:email => :a, :firstname => :b, :lastname => :c, :manager_email => :d, :age => :e} }
|
24
24
|
SmarterCSV.process("#{fixture_path}/duplicate_headers.csv", options)
|
25
|
-
}.
|
25
|
+
}.not_to raise_exception(SmarterCSV::KeyMappingError)
|
26
26
|
end
|
27
27
|
end
|
28
28
|
|
@@ -28,11 +28,11 @@ describe 'test exceptions for invalid headers' do
|
|
28
28
|
}.to raise_exception(SmarterCSV::MissingHeaders)
|
29
29
|
end
|
30
30
|
|
31
|
-
it '
|
31
|
+
it 'does not raise error on missing mapped headers and includes missing headers in message' do
|
32
|
+
# :age does not exist in the CSV header
|
33
|
+
options = {:key_mapping => {:email => :a, :firstname => :b, :lastname => :c, :manager_email => :d, :age => :e} }
|
32
34
|
expect {
|
33
|
-
# :age does not exist in the CSV header
|
34
|
-
options = {:key_mapping => {:email => :a, :firstname => :b, :lastname => :c, :manager_email => :d, :age => :e} }
|
35
35
|
SmarterCSV.process("#{fixture_path}/user_import.csv", options)
|
36
|
-
}.
|
36
|
+
}.not_to raise_exception(SmarterCSV::KeyMappingError)
|
37
37
|
end
|
38
38
|
end
|
@@ -2,16 +2,24 @@ require 'spec_helper'
|
|
2
2
|
|
3
3
|
fixture_path = 'spec/fixtures'
|
4
4
|
|
5
|
-
|
6
|
-
|
7
|
-
|
8
|
-
context "malformed header" do
|
5
|
+
# according to RFC-4180 quotes inside of "words" shouldbe doubled, but our parser is robust against that.
|
6
|
+
describe 'malformed CSV quotes' do
|
7
|
+
context "malformed quotes in header" do
|
9
8
|
let(:csv_path) { "#{fixture_path}/malformed_header.csv" }
|
10
|
-
it
|
9
|
+
it 'should be resilient against single quotes' do
|
10
|
+
data = SmarterCSV.process(csv_path)
|
11
|
+
expect(data[0]).to eq({:name=>"Arnold Schwarzenegger", :dobdob=>"1947-07-30"})
|
12
|
+
expect(data[1]).to eq({:name=>"Jeff Bridges", :dobdob=>"1949-12-04"})
|
13
|
+
end
|
11
14
|
end
|
12
15
|
|
13
|
-
context "malformed content" do
|
16
|
+
context "malformed quotes in content" do
|
14
17
|
let(:csv_path) { "#{fixture_path}/malformed.csv" }
|
15
|
-
|
18
|
+
|
19
|
+
it 'should be resilient against single quotes' do
|
20
|
+
data = SmarterCSV.process(csv_path)
|
21
|
+
expect(data[0]).to eq({:name=>"Arnold Schwarzenegger", :dob=>"1947-07-30"})
|
22
|
+
expect(data[1]).to eq({:name=>"Jeff \"the dude\" Bridges", :dob=>"1949-12-04"})
|
23
|
+
end
|
16
24
|
end
|
17
25
|
end
|
@@ -0,0 +1,61 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe 'parse with col_sep' do
|
4
|
+
let(:options) { {quote_char: '"'} }
|
5
|
+
|
6
|
+
it 'parses with comma' do
|
7
|
+
line = "a,b,,d"
|
8
|
+
options.merge!({col_sep: ","})
|
9
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
10
|
+
expect(array).to eq ['a', 'b', '', 'd']
|
11
|
+
expect(array_size).to eq 4
|
12
|
+
end
|
13
|
+
|
14
|
+
it 'parses trailing commas' do
|
15
|
+
line = "a,b,c,,"
|
16
|
+
options.merge!({col_sep: ","})
|
17
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
18
|
+
expect(array).to eq ['a', 'b', 'c', '', '']
|
19
|
+
expect(array_size).to eq 5
|
20
|
+
end
|
21
|
+
|
22
|
+
it 'parses with space' do
|
23
|
+
line = "a b d"
|
24
|
+
options.merge!({col_sep: " "})
|
25
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
26
|
+
expect(array).to eq ['a', 'b', '', 'd']
|
27
|
+
expect(array_size).to eq 4
|
28
|
+
end
|
29
|
+
|
30
|
+
it 'parses with tab' do
|
31
|
+
line = "a\tb\t\td"
|
32
|
+
options.merge!({col_sep: "\t"})
|
33
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
34
|
+
expect(array).to eq ['a', 'b', '', 'd']
|
35
|
+
expect(array_size).to eq 4
|
36
|
+
end
|
37
|
+
|
38
|
+
it 'parses with multiple space separator' do
|
39
|
+
line = "a b d"
|
40
|
+
options.merge!({col_sep: " "})
|
41
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
42
|
+
expect(array).to eq ['a b', '', 'd']
|
43
|
+
expect(array_size).to eq 3
|
44
|
+
end
|
45
|
+
|
46
|
+
it 'parses with multiple char separator' do
|
47
|
+
line = '<=><=>A<=>B<=>C'
|
48
|
+
options.merge!({col_sep: "<=>"})
|
49
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
50
|
+
expect(array).to eq ["", "", "A", "B", "C"]
|
51
|
+
expect(array_size).to eq 5
|
52
|
+
end
|
53
|
+
|
54
|
+
it 'parses trailing multiple char separator' do
|
55
|
+
line = '<=><=>A<=>B<=>C<=><=>'
|
56
|
+
options.merge!({col_sep: "<=>"})
|
57
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
58
|
+
expect(array).to eq ["", "", "A", "B", "C", "", ""]
|
59
|
+
expect(array_size).to eq 7
|
60
|
+
end
|
61
|
+
end
|
@@ -0,0 +1,74 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
describe 'old CSV library parsing tests' do
|
4
|
+
let(:options) { {quote_char: '"', col_sep: ","} }
|
5
|
+
|
6
|
+
[ ["\t", ["\t"]],
|
7
|
+
["foo,\"\"\"\"\"\",baz", ["foo", "\"\"", "baz"]],
|
8
|
+
["foo,\"\"\"bar\"\"\",baz", ["foo", "\"bar\"", "baz"]],
|
9
|
+
["\"\"\"\n\",\"\"\"\n\"", ["\"\n", "\"\n"]],
|
10
|
+
["foo,\"\r\n\",baz", ["foo", "\r\n", "baz"]],
|
11
|
+
["\"\"", [""]],
|
12
|
+
["foo,\"\"\"\",baz", ["foo", "\"", "baz"]],
|
13
|
+
["foo,\"\r.\n\",baz", ["foo", "\r.\n", "baz"]],
|
14
|
+
["foo,\"\r\",baz", ["foo", "\r", "baz"]],
|
15
|
+
["foo,\"\",baz", ["foo", "", "baz"]],
|
16
|
+
["\",\"", [","]],
|
17
|
+
["foo", ["foo"]],
|
18
|
+
[",,", ['', '', '']],
|
19
|
+
[",", ['', '']],
|
20
|
+
["foo,\"\n\",baz", ["foo", "\n", "baz"]],
|
21
|
+
["foo,,baz", ["foo", '', "baz"]],
|
22
|
+
["\"\"\"\r\",\"\"\"\r\"", ["\"\r", "\"\r"]],
|
23
|
+
["\",\",\",\"", [",", ","]],
|
24
|
+
["foo,bar,", ["foo", "bar", '']],
|
25
|
+
[",foo,bar", ['', "foo", "bar"]],
|
26
|
+
["foo,bar", ["foo", "bar"]],
|
27
|
+
[";", [";"]],
|
28
|
+
["\t,\t", ["\t", "\t"]],
|
29
|
+
["foo,\"\r\n\r\",baz", ["foo", "\r\n\r", "baz"]],
|
30
|
+
["foo,\"\r\n\n\",baz", ["foo", "\r\n\n", "baz"]],
|
31
|
+
["foo,\"foo,bar\",baz", ["foo", "foo,bar", "baz"]],
|
32
|
+
[";,;", [";", ";"]]
|
33
|
+
].each do |line, result|
|
34
|
+
it "parses #{line}" do
|
35
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
36
|
+
expect(array).to eq result
|
37
|
+
end
|
38
|
+
end
|
39
|
+
|
40
|
+
[ ["foo,\"\"\"\"\"\",baz", ["foo", "\"\"", "baz"]],
|
41
|
+
["foo,\"\"\"bar\"\"\",baz", ["foo", "\"bar\"", "baz"]],
|
42
|
+
["foo,\"\r\n\",baz", ["foo", "\r\n", "baz"]],
|
43
|
+
["\"\"", [""]],
|
44
|
+
["foo,\"\"\"\",baz", ["foo", "\"", "baz"]],
|
45
|
+
["foo,\"\r.\n\",baz", ["foo", "\r.\n", "baz"]],
|
46
|
+
["foo,\"\r\",baz", ["foo", "\r", "baz"]],
|
47
|
+
["foo,\"\",baz", ["foo", "", "baz"]],
|
48
|
+
["foo", ["foo"]],
|
49
|
+
[",,", ['', '', '']],
|
50
|
+
[",", ['', '']],
|
51
|
+
["foo,\"\n\",baz", ["foo", "\n", "baz"]],
|
52
|
+
["foo,,baz", ["foo", '', "baz"]],
|
53
|
+
["foo,bar", ["foo", "bar"]],
|
54
|
+
["foo,\"\r\n\n\",baz", ["foo", "\r\n\n", "baz"]],
|
55
|
+
["foo,\"foo,bar\",baz", ["foo", "foo,bar", "baz"]]
|
56
|
+
].each do |line, result|
|
57
|
+
it "parses #{line}" do
|
58
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
59
|
+
expect(array).to eq result
|
60
|
+
end
|
61
|
+
end
|
62
|
+
|
63
|
+
it 'mixed quotes' do
|
64
|
+
line = %Q{Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K}
|
65
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
66
|
+
expect(array).to eq ["Ten Thousand", "10000", " 2710 ", "", "10,000", "It's \"10 Grand\", baby", "10K"]
|
67
|
+
end
|
68
|
+
|
69
|
+
it 'single quotes in fields' do
|
70
|
+
line = 'Indoor Chrome,49.2"" L x 49.2"" W x 20.5"" H,Chrome,"Crystal,Metal,Wood",23.12'
|
71
|
+
array, array_size = SmarterCSV.send(:parse, line, options)
|
72
|
+
expect(array).to eq ['Indoor Chrome', '49.2" L x 49.2" W x 20.5" H', 'Chrome', 'Crystal,Metal,Wood', '23.12']
|
73
|
+
end
|
74
|
+
end
|
@@ -0,0 +1,170 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
fixture_path = 'spec/fixtures'
|
4
|
+
|
5
|
+
describe 'fulfills RFC-4180 and more' do
|
6
|
+
let(:options) { {col_sep: ',', row_sep: $INPUT_RECORD_SEPARATOR, quote_char: '"' } }
|
7
|
+
|
8
|
+
context 'parses simple CSV' do
|
9
|
+
context 'RFC-4180' do
|
10
|
+
it 'separating on col_sep' do
|
11
|
+
line = 'aaa,bbb,ccc'
|
12
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [%w[aaa bbb ccc], 3]
|
13
|
+
end
|
14
|
+
|
15
|
+
it 'preserves whitespace' do
|
16
|
+
line = ' aaa , bbb , ccc '
|
17
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
18
|
+
[' aaa ', ' bbb ', ' ccc '], 3
|
19
|
+
]
|
20
|
+
end
|
21
|
+
end
|
22
|
+
|
23
|
+
context 'extending RFC-4180' do
|
24
|
+
it 'with extra col_sep' do
|
25
|
+
line = 'aaa,bbb,ccc,'
|
26
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
27
|
+
['aaa', 'bbb', 'ccc', ''], 4
|
28
|
+
]
|
29
|
+
end
|
30
|
+
|
31
|
+
it 'with extra col_sep with given header_size' do
|
32
|
+
line = 'aaa,bbb,ccc,'
|
33
|
+
expect( SmarterCSV.send(:parse, line, options, 3)).to eq [
|
34
|
+
['aaa', 'bbb', 'ccc'], 3
|
35
|
+
]
|
36
|
+
end
|
37
|
+
|
38
|
+
it 'with multiple extra col_sep' do
|
39
|
+
line = 'aaa,bbb,ccc,,,'
|
40
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
41
|
+
['aaa', 'bbb', 'ccc', '', '', ''], 6
|
42
|
+
]
|
43
|
+
end
|
44
|
+
|
45
|
+
it 'with multiple extra col_sep' do
|
46
|
+
line = 'aaa,bbb,ccc,,,'
|
47
|
+
expect( SmarterCSV.send(:parse, line, options, 3)).to eq [
|
48
|
+
['aaa', 'bbb', 'ccc'], 3
|
49
|
+
]
|
50
|
+
end
|
51
|
+
|
52
|
+
it 'with multiple complex col_sep' do
|
53
|
+
line = 'aaa<=>bbb<=>ccc<=><=><=>'
|
54
|
+
expect( SmarterCSV.send(:parse, line, options.merge({col_sep: '<=>'}))).to eq [
|
55
|
+
['aaa', 'bbb', 'ccc', '', '', ''], 6
|
56
|
+
]
|
57
|
+
end
|
58
|
+
|
59
|
+
it 'with multiple complex col_sep with given header_size' do
|
60
|
+
line = 'aaa<=>bbb<=>ccc<=><=><=>'
|
61
|
+
expect( SmarterCSV.send(:parse, line, options.merge({col_sep: '<=>'}), 3)).to eq [
|
62
|
+
['aaa', 'bbb', 'ccc'], 3
|
63
|
+
]
|
64
|
+
end
|
65
|
+
end
|
66
|
+
end
|
67
|
+
|
68
|
+
context 'parses quoted CSV' do
|
69
|
+
context 'RFC-4180' do
|
70
|
+
it 'separating on col_sep' do
|
71
|
+
line = '"aaa","bbb","ccc"'
|
72
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [%w[aaa bbb ccc], 3]
|
73
|
+
end
|
74
|
+
|
75
|
+
it 'parses corner case correctly' do
|
76
|
+
line = '"Board 4""","$17.40","10000003427"'
|
77
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
78
|
+
['Board 4"', '$17.40', '10000003427'], 3
|
79
|
+
]
|
80
|
+
end
|
81
|
+
|
82
|
+
it 'quoted parts can contain spaces' do
|
83
|
+
line = '" aaa1 aaa2 "," bbb1 bbb2 "," ccc1 ccc2 "'
|
84
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
85
|
+
[' aaa1 aaa2 ', ' bbb1 bbb2 ', ' ccc1 ccc2 '], 3
|
86
|
+
]
|
87
|
+
end
|
88
|
+
|
89
|
+
it 'quoted parts can contain row_sep' do
|
90
|
+
line = '"aaa1, aaa2","bbb1, bbb2","ccc1, ccc2"'
|
91
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
92
|
+
['aaa1, aaa2', 'bbb1, bbb2', 'ccc1, ccc2'], 3
|
93
|
+
]
|
94
|
+
end
|
95
|
+
|
96
|
+
it 'quoted parts can contain row_sep' do
|
97
|
+
line = '"aaa1, ""aaa2"", aaa3","""bbb1"", bbb2","ccc1, ""ccc2"""'
|
98
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
99
|
+
['aaa1, "aaa2", aaa3', '"bbb1", bbb2', 'ccc1, "ccc2"'], 3
|
100
|
+
]
|
101
|
+
end
|
102
|
+
|
103
|
+
it 'some fields are quoted' do
|
104
|
+
line = '1,"board 4""",12.95'
|
105
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
106
|
+
['1', 'board 4"', '12.95'], 3
|
107
|
+
]
|
108
|
+
end
|
109
|
+
|
110
|
+
it 'separating on col_sep' do
|
111
|
+
line = '"some","thing","""completely"" different"'
|
112
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
113
|
+
['some', 'thing', '"completely" different'], 3
|
114
|
+
]
|
115
|
+
end
|
116
|
+
end
|
117
|
+
|
118
|
+
context 'extending RFC-4180' do
|
119
|
+
it 'with extra col_sep, without given header_size' do
|
120
|
+
line = '"aaa","bbb","ccc",'
|
121
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
122
|
+
['aaa', 'bbb', 'ccc', ''], 4
|
123
|
+
]
|
124
|
+
end
|
125
|
+
|
126
|
+
it 'with extra col_sep, with given header_size' do
|
127
|
+
line = '"aaa","bbb","ccc",'
|
128
|
+
expect( SmarterCSV.send(:parse, line, options, 3)).to eq [%w[aaa bbb ccc], 3]
|
129
|
+
end
|
130
|
+
|
131
|
+
it 'with multiple extra col_sep, without given header_size' do
|
132
|
+
line = '"aaa","bbb","ccc",,,'
|
133
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
134
|
+
['aaa', 'bbb', 'ccc', '', '', ''], 6
|
135
|
+
]
|
136
|
+
end
|
137
|
+
|
138
|
+
it 'with multiple extra col_sep, with given header_size' do
|
139
|
+
line = '"aaa","bbb","ccc",,,'
|
140
|
+
expect( SmarterCSV.send(:parse, line, options, 3)).to eq [
|
141
|
+
['aaa', 'bbb', 'ccc'], 3
|
142
|
+
]
|
143
|
+
end
|
144
|
+
|
145
|
+
it 'with multiple complex extra col_sep, without given header_size' do
|
146
|
+
line = '"aaa"<=>"bbb"<=>"ccc"<=><=><=>'
|
147
|
+
expect( SmarterCSV.send(:parse, line, options.merge({col_sep: '<=>'}))).to eq [
|
148
|
+
['aaa', 'bbb', 'ccc', '', '', ''], 6
|
149
|
+
]
|
150
|
+
end
|
151
|
+
|
152
|
+
it 'with multiple complex extra col_sep, with given header_size' do
|
153
|
+
line = '"aaa"<=>"bbb"<=>"ccc"<=><=><=>'
|
154
|
+
expect( SmarterCSV.send(:parse, line, options.merge({col_sep: '<=>'}), 3)).to eq [
|
155
|
+
['aaa', 'bbb', 'ccc'], 3
|
156
|
+
]
|
157
|
+
end
|
158
|
+
end
|
159
|
+
end
|
160
|
+
|
161
|
+
# relaxed parsing compared to RFC-4180
|
162
|
+
context 'liberal_parsing' do
|
163
|
+
it 'parses corner case correctly' do
|
164
|
+
line = 'is,this "three, or four",fields'
|
165
|
+
expect( SmarterCSV.send(:parse, line, options)).to eq [
|
166
|
+
['is', 'this "three, or four"', 'fields'], 3
|
167
|
+
]
|
168
|
+
end
|
169
|
+
end
|
170
|
+
end
|
@@ -3,7 +3,6 @@ require 'spec_helper'
|
|
3
3
|
fixture_path = 'spec/fixtures'
|
4
4
|
|
5
5
|
describe 'loading file with quoted fields' do
|
6
|
-
|
7
6
|
it 'leaving the quotes in the data' do
|
8
7
|
options = {}
|
9
8
|
data = SmarterCSV.process("#{fixture_path}/quoted.csv", options)
|
@@ -12,6 +11,7 @@ describe 'loading file with quoted fields' do
|
|
12
11
|
data[1][:description].should be_nil
|
13
12
|
data[2][:model].should eq 'Venture "Extended Edition, Very Large"'
|
14
13
|
data[2][:description].should be_nil
|
14
|
+
data[3][:description].should eq 'MUST SELL! air, moon roof, loaded'
|
15
15
|
data.each do |h|
|
16
16
|
h[:year].class.should eq Fixnum
|
17
17
|
h[:make].should_not be_nil
|
@@ -20,17 +20,21 @@ describe 'loading file with quoted fields' do
|
|
20
20
|
end
|
21
21
|
end
|
22
22
|
|
23
|
-
|
23
|
+
# quotes inside quoted fields need to be escaped by another double-quote
|
24
24
|
it 'removes quotes around quoted fields, but not inside data' do
|
25
25
|
options = {}
|
26
26
|
data = SmarterCSV.process("#{fixture_path}/quote_char.csv", options)
|
27
27
|
|
28
28
|
data.length.should eq 6
|
29
|
+
data[0][:first_name].should eq "\"John"
|
30
|
+
data[0][:last_name].should eq "Cooke\""
|
29
31
|
data[1][:first_name].should eq "Jam\ne\nson\""
|
30
32
|
data[2][:first_name].should eq "\"Jean"
|
33
|
+
data[4][:first_name].should eq "Bo\"bbie"
|
34
|
+
data[5][:first_name].should eq 'Mica'
|
35
|
+
data[5][:last_name].should eq 'Copeland'
|
31
36
|
end
|
32
37
|
|
33
|
-
|
34
38
|
# NOTE: quotes inside headers need to be escaped by doubling them
|
35
39
|
# e.g. 'correct ""EXAMPLE""'
|
36
40
|
# this escaping is illegal: 'incorrect \"EXAMPLE\"' <-- this caused CSV parsing error
|
@@ -43,6 +47,6 @@ describe 'loading file with quoted fields' do
|
|
43
47
|
data.length.should eq 3
|
44
48
|
data.first.keys[2].should eq :isbn
|
45
49
|
data.first.keys[3].should eq :discounted_price
|
50
|
+
data[1][:author].should eq 'Timothy "The Parser" Campbell'
|
46
51
|
end
|
47
|
-
|
48
52
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: smarter_csv
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.6.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tilo Sloboda
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2022-
|
11
|
+
date: 2022-05-06 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: rspec
|
@@ -38,6 +38,20 @@ dependencies:
|
|
38
38
|
- - ">="
|
39
39
|
- !ruby/object:Gem::Version
|
40
40
|
version: '0'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: awesome_print
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - ">="
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '0'
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - ">="
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '0'
|
41
55
|
description: Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, with
|
42
56
|
optional features for processing large files in parallel, embedded comments, unusual
|
43
57
|
field- and record-separators, flexible mapping of CSV-headers to Hash-keys
|
@@ -126,6 +140,9 @@ files:
|
|
126
140
|
- spec/smarter_csv/malformed_spec.rb
|
127
141
|
- spec/smarter_csv/no_header_spec.rb
|
128
142
|
- spec/smarter_csv/not_downcase_header_spec.rb
|
143
|
+
- spec/smarter_csv/parse/column_separator_spec.rb
|
144
|
+
- spec/smarter_csv/parse/old_csv_library_spec.rb
|
145
|
+
- spec/smarter_csv/parse/rfc4180_and_more_spec.rb
|
129
146
|
- spec/smarter_csv/problematic.rb
|
130
147
|
- spec/smarter_csv/quoted_spec.rb
|
131
148
|
- spec/smarter_csv/remove_empty_values_spec.rb
|
@@ -161,8 +178,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
161
178
|
- - ">="
|
162
179
|
- !ruby/object:Gem::Version
|
163
180
|
version: '0'
|
164
|
-
requirements:
|
165
|
-
- csv
|
181
|
+
requirements: []
|
166
182
|
rubygems_version: 3.1.6
|
167
183
|
signing_key:
|
168
184
|
specification_version: 4
|
@@ -233,6 +249,9 @@ test_files:
|
|
233
249
|
- spec/smarter_csv/malformed_spec.rb
|
234
250
|
- spec/smarter_csv/no_header_spec.rb
|
235
251
|
- spec/smarter_csv/not_downcase_header_spec.rb
|
252
|
+
- spec/smarter_csv/parse/column_separator_spec.rb
|
253
|
+
- spec/smarter_csv/parse/old_csv_library_spec.rb
|
254
|
+
- spec/smarter_csv/parse/rfc4180_and_more_spec.rb
|
236
255
|
- spec/smarter_csv/problematic.rb
|
237
256
|
- spec/smarter_csv/quoted_spec.rb
|
238
257
|
- spec/smarter_csv/remove_empty_values_spec.rb
|