smarter_csv 1.7.3 → 1.8.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +7 -0
- data/CONTRIBUTORS.md +1 -0
- data/README.md +19 -6
- data/Rakefile +11 -0
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +86 -25
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 55400b3977ce35c58d60c4101362b68d99f2dbf7cb6a63956ae3b6ab79fcf1ac
|
4
|
+
data.tar.gz: 41f46d3e4de69a7924ecd2214ba4e37766106469d1b8b257fd752a96204a47fd
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 24ecc14cf9c65efe5c11e4bd20753420aa8ccd7385171cd21eac2e1be92c4896087cdc2a18799fa111c0f36154ad4481daed7f08b752f4fae2b5f27241b8cf6c
|
7
|
+
data.tar.gz: c1d70e18a7ae8057e58cbf73b62f4896dd7030bc5fd2e927669e5ea829f9a3c11daeb9c8b83296dbb46e6f0d23034245b7207882a77b54cc1ca128a581175359
|
data/CHANGELOG.md
CHANGED
@@ -1,6 +1,13 @@
|
|
1
1
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
3
3
|
|
4
|
+
## 1.8.0 (2023-03-18)
|
5
|
+
* NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
|
6
|
+
* ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
|
7
|
+
|
8
|
+
## 1.7.4 (2023-01-13)
|
9
|
+
* improved guessing of the column separator, thanks to Alessandro Fazzi
|
10
|
+
|
4
11
|
## 1.7.3 (2022-12-05)
|
5
12
|
* new option :silence_missing_keys; if set to true, it ignores missing keys in `key_mapping`
|
6
13
|
|
data/CONTRIBUTORS.md
CHANGED
@@ -49,3 +49,4 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
|
|
49
49
|
* [Nicolas Rodriguez](https://github.com/n-rodriguez)
|
50
50
|
* [Hirotaka Mizutani ](https://github.com/hirotaka)
|
51
51
|
* [Rahul Chaudhary](https://github.com/rahulch95)
|
52
|
+
* [Alessandro Fazzi](https://github.com/pioneerskies)
|
data/README.md
CHANGED
@@ -55,10 +55,23 @@ The two main choices you have in terms of how to call `SmarterCSV.process` are:
|
|
55
55
|
* calling `process` with or without a block
|
56
56
|
* passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
|
57
57
|
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
You can also set the `:row_sep` manually! Checkout Example
|
58
|
+
By default (since version 1.8.0), detection of the column and row separators is set to automatic `row_sep: :auto`, `col_sep: :auto`. This should make it easier to process any CSV files without having to examine the line endings or column separators.
|
59
|
+
|
60
|
+
You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); nil or 0 will check the whole file).
|
61
|
+
You can also set the `:row_sep` manually! Checkout Example 4 for unusual `:row_sep` and `:col_sep`.
|
62
|
+
|
63
|
+
### Troubleshooting
|
64
|
+
|
65
|
+
In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).
|
66
|
+
|
67
|
+
```
|
68
|
+
$ hexdump -C spec/fixtures/bom_test_feff.csv
|
69
|
+
00000000 fe ff 73 6f 6d 65 5f 69 64 2c 74 79 70 65 2c 66 |..some_id,type,f|
|
70
|
+
00000010 75 7a 7a 62 6f 78 65 73 0d 0a 34 32 37 36 36 38 |uzzboxes..427668|
|
71
|
+
00000020 30 35 2c 7a 69 7a 7a 6c 65 73 2c 31 32 33 34 0d |05,zizzles,1234.|
|
72
|
+
00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
|
73
|
+
00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
|
74
|
+
```
|
62
75
|
|
63
76
|
|
64
77
|
#### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
|
@@ -222,10 +235,10 @@ The options and the block are optional.
|
|
222
235
|
| :skip_lines | nil | how many lines to skip before the first line or header line is processed |
|
223
236
|
| :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
|
224
237
|
---------------------------------------------------------------------------------------------------------------------------------
|
225
|
-
| :col_sep |
|
238
|
+
| :col_sep | :auto | column separator (default was ',') |
|
226
239
|
| :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
|
227
240
|
| | | e.g. when :quote_char is not properly escaped |
|
228
|
-
| :row_sep |
|
241
|
+
| :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
|
229
242
|
| | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
|
230
243
|
| :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
|
231
244
|
| :quote_char | '"' | quotation character |
|
data/Rakefile
CHANGED
@@ -3,6 +3,17 @@
|
|
3
3
|
require "bundler/gem_tasks"
|
4
4
|
require 'rspec/core/rake_task'
|
5
5
|
|
6
|
+
|
7
|
+
# temp fix for NoMethodError: undefined method `last_comment'
|
8
|
+
# remove when fixed in Rake 11.x and higher
|
9
|
+
module TempFixForRakeLastComment
|
10
|
+
def last_comment
|
11
|
+
last_description
|
12
|
+
end
|
13
|
+
end
|
14
|
+
Rake::Application.send :include, TempFixForRakeLastComment
|
15
|
+
### end of tempfix
|
16
|
+
|
6
17
|
RSpec::Core::RakeTask.new(:spec)
|
7
18
|
|
8
19
|
require "rubocop/rake_task"
|
data/lib/smarter_csv/version.rb
CHANGED
data/lib/smarter_csv.rb
CHANGED
@@ -3,8 +3,8 @@
|
|
3
3
|
require_relative "extensions/hash"
|
4
4
|
require_relative "smarter_csv/version"
|
5
5
|
|
6
|
-
require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
|
7
|
-
|
6
|
+
# require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
|
7
|
+
require 'smarter_csv.bundle' unless ENV['CI'] # does not compile/link in CI?
|
8
8
|
|
9
9
|
module SmarterCSV
|
10
10
|
class SmarterCSVException < StandardError; end
|
@@ -39,11 +39,7 @@ module SmarterCSV
|
|
39
39
|
puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
|
40
40
|
end
|
41
41
|
|
42
|
-
|
43
|
-
options[:skip_lines].to_i.times do
|
44
|
-
readline_with_counts(fh, options)
|
45
|
-
end
|
46
|
-
end
|
42
|
+
skip_lines(fh, options)
|
47
43
|
|
48
44
|
headerA, header_size = process_headers(fh, options)
|
49
45
|
|
@@ -207,7 +203,7 @@ module SmarterCSV
|
|
207
203
|
acceleration: true,
|
208
204
|
auto_row_sep_chars: 500,
|
209
205
|
chunk_size: nil,
|
210
|
-
col_sep: ',',
|
206
|
+
col_sep: :auto, # was: ',',
|
211
207
|
comment_regexp: nil, # was: /\A#/,
|
212
208
|
convert_values_to_numeric: true,
|
213
209
|
downcase_header: true,
|
@@ -226,7 +222,7 @@ module SmarterCSV
|
|
226
222
|
remove_values_matching: nil,
|
227
223
|
remove_zero_values: false,
|
228
224
|
required_headers: nil,
|
229
|
-
row_sep: $/,
|
225
|
+
row_sep: :auto, # was: $/,
|
230
226
|
silence_missing_keys: false,
|
231
227
|
skip_lines: nil,
|
232
228
|
strings_as_keys: false,
|
@@ -243,9 +239,24 @@ module SmarterCSV
|
|
243
239
|
line = filehandle.readline(options[:row_sep])
|
244
240
|
@file_line_count += 1
|
245
241
|
@csv_line_count += 1
|
242
|
+
line = remove_bom(line) if @csv_line_count == 1
|
246
243
|
line
|
247
244
|
end
|
248
245
|
|
246
|
+
def skip_lines(filehandle, options)
|
247
|
+
return unless options[:skip_lines].to_i > 0
|
248
|
+
|
249
|
+
options[:skip_lines].to_i.times do
|
250
|
+
readline_with_counts(filehandle, options)
|
251
|
+
end
|
252
|
+
end
|
253
|
+
|
254
|
+
def rewind(filehandle)
|
255
|
+
@file_line_count = 0
|
256
|
+
@csv_line_count = 0
|
257
|
+
filehandle.rewind
|
258
|
+
end
|
259
|
+
|
249
260
|
###
|
250
261
|
### Thin wrapper around C-extension
|
251
262
|
###
|
@@ -374,24 +385,23 @@ module SmarterCSV
|
|
374
385
|
return false
|
375
386
|
end
|
376
387
|
|
377
|
-
#
|
388
|
+
# If file has headers, then guesses column separator from headers.
|
389
|
+
# Otherwise guesses column separator from contents.
|
390
|
+
# Raises exception if none is found.
|
378
391
|
def guess_column_separator(filehandle, options)
|
379
|
-
|
380
|
-
n = Hash.new(0)
|
392
|
+
skip_lines(filehandle, options)
|
381
393
|
|
382
|
-
|
383
|
-
line = filehandle.readline(options[:row_sep])
|
384
|
-
del.each do |d|
|
385
|
-
n[d] += line.scan(d).count
|
386
|
-
end
|
387
|
-
rescue EOFError # short files
|
388
|
-
break
|
389
|
-
end
|
394
|
+
possible_delimiters = [',', "\t", ';', ':', '|']
|
390
395
|
|
391
|
-
|
392
|
-
|
396
|
+
candidates = if options.fetch(:headers_in_file)
|
397
|
+
candidated_column_separators_from_headers(filehandle, options, possible_delimiters)
|
398
|
+
else
|
399
|
+
candidated_column_separators_from_contents(filehandle, options, possible_delimiters)
|
400
|
+
end
|
401
|
+
|
402
|
+
raise SmarterCSV::NoColSepDetected if candidates.values.max == 0
|
393
403
|
|
394
|
-
|
404
|
+
candidates.key(candidates.values.max)
|
395
405
|
end
|
396
406
|
|
397
407
|
# limitation: this currently reads the whole file in before making a decision
|
@@ -420,7 +430,7 @@ module SmarterCSV
|
|
420
430
|
lines += 1
|
421
431
|
break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
|
422
432
|
end
|
423
|
-
filehandle
|
433
|
+
rewind(filehandle)
|
424
434
|
|
425
435
|
counts["\r"] += 1 if last_char == "\r"
|
426
436
|
# find the most frequent key/value pair:
|
@@ -476,13 +486,13 @@ module SmarterCSV
|
|
476
486
|
|
477
487
|
unless options[:user_provided_headers] # wouldn't make sense to re-map user provided headers
|
478
488
|
key_mappingH = options[:key_mapping]
|
479
|
-
|
480
489
|
# do some key mapping on the keys in the file header
|
481
490
|
# if you want to completely delete a key, then map it to nil or to ''
|
482
491
|
if !key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
|
483
492
|
unless options[:silence_missing_keys]
|
484
493
|
# if silence_missing_keys are not set, raise error if missing header
|
485
494
|
missing_keys = key_mappingH.keys - headerA
|
495
|
+
|
486
496
|
puts "WARNING: missing header(s): #{missing_keys.join(",")}" unless missing_keys.empty?
|
487
497
|
end
|
488
498
|
|
@@ -525,5 +535,56 @@ module SmarterCSV
|
|
525
535
|
end
|
526
536
|
result
|
527
537
|
end
|
538
|
+
|
539
|
+
private
|
540
|
+
|
541
|
+
UTF_32_BOM = %w[0 0 fe ff].freeze
|
542
|
+
UTF_32LE_BOM = %w[ff fe 0 0].freeze
|
543
|
+
UTF_8_BOM = %w[ef bb bf].freeze
|
544
|
+
UTF_16_BOM = %w[fe ff].freeze
|
545
|
+
UTF_16LE_BOM = %w[ff fe].freeze
|
546
|
+
|
547
|
+
def remove_bom(str)
|
548
|
+
str_as_hex = str.bytes.map{|x| x.to_s(16)}
|
549
|
+
# if string does not start with one of the bytes above, there is no BOM
|
550
|
+
return str unless %w[ef fe ff 0].include?(str_as_hex[0])
|
551
|
+
|
552
|
+
return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
|
553
|
+
return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM
|
554
|
+
return str.byteslice(2..-1) if [UTF_16_BOM, UTF_16LE_BOM].include?(str_as_hex[0..1])
|
555
|
+
|
556
|
+
puts "SmarterCSV found unhandled BOM! #{str.chars[0..7].inspect}"
|
557
|
+
str
|
558
|
+
end
|
559
|
+
|
560
|
+
def candidated_column_separators_from_headers(filehandle, options, delimiters)
|
561
|
+
candidates = Hash.new(0)
|
562
|
+
line = readline_with_counts(filehandle, options.slice(:row_sep))
|
563
|
+
|
564
|
+
delimiters.each do |d|
|
565
|
+
candidates[d] += line.scan(d).count
|
566
|
+
end
|
567
|
+
|
568
|
+
rewind(filehandle)
|
569
|
+
|
570
|
+
candidates
|
571
|
+
end
|
572
|
+
|
573
|
+
def candidated_column_separators_from_contents(filehandle, options, delimiters)
|
574
|
+
candidates = Hash.new(0)
|
575
|
+
|
576
|
+
5.times do
|
577
|
+
line = readline_with_counts(filehandle, options.slice(:row_sep))
|
578
|
+
delimiters.each do |d|
|
579
|
+
candidates[d] += line.scan(d).count
|
580
|
+
end
|
581
|
+
rescue EOFError # short files
|
582
|
+
break
|
583
|
+
end
|
584
|
+
|
585
|
+
rewind(filehandle)
|
586
|
+
|
587
|
+
candidates
|
588
|
+
end
|
528
589
|
end
|
529
590
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: smarter_csv
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.8.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tilo Sloboda
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2023-03-19 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: awesome_print
|