smarter_csv 1.7.3 → 1.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +7 -0
- data/CONTRIBUTORS.md +1 -0
- data/README.md +19 -6
- data/Rakefile +11 -0
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +86 -25
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 55400b3977ce35c58d60c4101362b68d99f2dbf7cb6a63956ae3b6ab79fcf1ac
|
|
4
|
+
data.tar.gz: 41f46d3e4de69a7924ecd2214ba4e37766106469d1b8b257fd752a96204a47fd
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 24ecc14cf9c65efe5c11e4bd20753420aa8ccd7385171cd21eac2e1be92c4896087cdc2a18799fa111c0f36154ad4481daed7f08b752f4fae2b5f27241b8cf6c
|
|
7
|
+
data.tar.gz: c1d70e18a7ae8057e58cbf73b62f4896dd7030bc5fd2e927669e5ea829f9a3c11daeb9c8b83296dbb46e6f0d23034245b7207882a77b54cc1ca128a581175359
|
data/CHANGELOG.md
CHANGED
|
@@ -1,6 +1,13 @@
|
|
|
1
1
|
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
|
3
3
|
|
|
4
|
+
## 1.8.0 (2023-03-18)
|
|
5
|
+
* NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
|
|
6
|
+
* ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
|
|
7
|
+
|
|
8
|
+
## 1.7.4 (2023-01-13)
|
|
9
|
+
* improved guessing of the column separator, thanks to Alessandro Fazzi
|
|
10
|
+
|
|
4
11
|
## 1.7.3 (2022-12-05)
|
|
5
12
|
* new option :silence_missing_keys; if set to true, it ignores missing keys in `key_mapping`
|
|
6
13
|
|
data/CONTRIBUTORS.md
CHANGED
|
@@ -49,3 +49,4 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
|
|
|
49
49
|
* [Nicolas Rodriguez](https://github.com/n-rodriguez)
|
|
50
50
|
* [Hirotaka Mizutani ](https://github.com/hirotaka)
|
|
51
51
|
* [Rahul Chaudhary](https://github.com/rahulch95)
|
|
52
|
+
* [Alessandro Fazzi](https://github.com/pioneerskies)
|
data/README.md
CHANGED
|
@@ -55,10 +55,23 @@ The two main choices you have in terms of how to call `SmarterCSV.process` are:
|
|
|
55
55
|
* calling `process` with or without a block
|
|
56
56
|
* passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
|
|
57
57
|
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
You can also set the `:row_sep` manually! Checkout Example
|
|
58
|
+
By default (since version 1.8.0), detection of the column and row separators is set to automatic `row_sep: :auto`, `col_sep: :auto`. This should make it easier to process any CSV files without having to examine the line endings or column separators.
|
|
59
|
+
|
|
60
|
+
You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); nil or 0 will check the whole file).
|
|
61
|
+
You can also set the `:row_sep` manually! Checkout Example 4 for unusual `:row_sep` and `:col_sep`.
|
|
62
|
+
|
|
63
|
+
### Troubleshooting
|
|
64
|
+
|
|
65
|
+
In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).
|
|
66
|
+
|
|
67
|
+
```
|
|
68
|
+
$ hexdump -C spec/fixtures/bom_test_feff.csv
|
|
69
|
+
00000000 fe ff 73 6f 6d 65 5f 69 64 2c 74 79 70 65 2c 66 |..some_id,type,f|
|
|
70
|
+
00000010 75 7a 7a 62 6f 78 65 73 0d 0a 34 32 37 36 36 38 |uzzboxes..427668|
|
|
71
|
+
00000020 30 35 2c 7a 69 7a 7a 6c 65 73 2c 31 32 33 34 0d |05,zizzles,1234.|
|
|
72
|
+
00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
|
|
73
|
+
00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
|
|
74
|
+
```
|
|
62
75
|
|
|
63
76
|
|
|
64
77
|
#### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
|
|
@@ -222,10 +235,10 @@ The options and the block are optional.
|
|
|
222
235
|
| :skip_lines | nil | how many lines to skip before the first line or header line is processed |
|
|
223
236
|
| :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
|
|
224
237
|
---------------------------------------------------------------------------------------------------------------------------------
|
|
225
|
-
| :col_sep |
|
|
238
|
+
| :col_sep | :auto | column separator (default was ',') |
|
|
226
239
|
| :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
|
|
227
240
|
| | | e.g. when :quote_char is not properly escaped |
|
|
228
|
-
| :row_sep |
|
|
241
|
+
| :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
|
|
229
242
|
| | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
|
|
230
243
|
| :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
|
|
231
244
|
| :quote_char | '"' | quotation character |
|
data/Rakefile
CHANGED
|
@@ -3,6 +3,17 @@
|
|
|
3
3
|
require "bundler/gem_tasks"
|
|
4
4
|
require 'rspec/core/rake_task'
|
|
5
5
|
|
|
6
|
+
|
|
7
|
+
# temp fix for NoMethodError: undefined method `last_comment'
|
|
8
|
+
# remove when fixed in Rake 11.x and higher
|
|
9
|
+
module TempFixForRakeLastComment
|
|
10
|
+
def last_comment
|
|
11
|
+
last_description
|
|
12
|
+
end
|
|
13
|
+
end
|
|
14
|
+
Rake::Application.send :include, TempFixForRakeLastComment
|
|
15
|
+
### end of tempfix
|
|
16
|
+
|
|
6
17
|
RSpec::Core::RakeTask.new(:spec)
|
|
7
18
|
|
|
8
19
|
require "rubocop/rake_task"
|
data/lib/smarter_csv/version.rb
CHANGED
data/lib/smarter_csv.rb
CHANGED
|
@@ -3,8 +3,8 @@
|
|
|
3
3
|
require_relative "extensions/hash"
|
|
4
4
|
require_relative "smarter_csv/version"
|
|
5
5
|
|
|
6
|
-
require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
|
|
7
|
-
|
|
6
|
+
# require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
|
|
7
|
+
require 'smarter_csv.bundle' unless ENV['CI'] # does not compile/link in CI?
|
|
8
8
|
|
|
9
9
|
module SmarterCSV
|
|
10
10
|
class SmarterCSVException < StandardError; end
|
|
@@ -39,11 +39,7 @@ module SmarterCSV
|
|
|
39
39
|
puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
|
|
40
40
|
end
|
|
41
41
|
|
|
42
|
-
|
|
43
|
-
options[:skip_lines].to_i.times do
|
|
44
|
-
readline_with_counts(fh, options)
|
|
45
|
-
end
|
|
46
|
-
end
|
|
42
|
+
skip_lines(fh, options)
|
|
47
43
|
|
|
48
44
|
headerA, header_size = process_headers(fh, options)
|
|
49
45
|
|
|
@@ -207,7 +203,7 @@ module SmarterCSV
|
|
|
207
203
|
acceleration: true,
|
|
208
204
|
auto_row_sep_chars: 500,
|
|
209
205
|
chunk_size: nil,
|
|
210
|
-
col_sep: ',',
|
|
206
|
+
col_sep: :auto, # was: ',',
|
|
211
207
|
comment_regexp: nil, # was: /\A#/,
|
|
212
208
|
convert_values_to_numeric: true,
|
|
213
209
|
downcase_header: true,
|
|
@@ -226,7 +222,7 @@ module SmarterCSV
|
|
|
226
222
|
remove_values_matching: nil,
|
|
227
223
|
remove_zero_values: false,
|
|
228
224
|
required_headers: nil,
|
|
229
|
-
row_sep: $/,
|
|
225
|
+
row_sep: :auto, # was: $/,
|
|
230
226
|
silence_missing_keys: false,
|
|
231
227
|
skip_lines: nil,
|
|
232
228
|
strings_as_keys: false,
|
|
@@ -243,9 +239,24 @@ module SmarterCSV
|
|
|
243
239
|
line = filehandle.readline(options[:row_sep])
|
|
244
240
|
@file_line_count += 1
|
|
245
241
|
@csv_line_count += 1
|
|
242
|
+
line = remove_bom(line) if @csv_line_count == 1
|
|
246
243
|
line
|
|
247
244
|
end
|
|
248
245
|
|
|
246
|
+
def skip_lines(filehandle, options)
|
|
247
|
+
return unless options[:skip_lines].to_i > 0
|
|
248
|
+
|
|
249
|
+
options[:skip_lines].to_i.times do
|
|
250
|
+
readline_with_counts(filehandle, options)
|
|
251
|
+
end
|
|
252
|
+
end
|
|
253
|
+
|
|
254
|
+
def rewind(filehandle)
|
|
255
|
+
@file_line_count = 0
|
|
256
|
+
@csv_line_count = 0
|
|
257
|
+
filehandle.rewind
|
|
258
|
+
end
|
|
259
|
+
|
|
249
260
|
###
|
|
250
261
|
### Thin wrapper around C-extension
|
|
251
262
|
###
|
|
@@ -374,24 +385,23 @@ module SmarterCSV
|
|
|
374
385
|
return false
|
|
375
386
|
end
|
|
376
387
|
|
|
377
|
-
#
|
|
388
|
+
# If file has headers, then guesses column separator from headers.
|
|
389
|
+
# Otherwise guesses column separator from contents.
|
|
390
|
+
# Raises exception if none is found.
|
|
378
391
|
def guess_column_separator(filehandle, options)
|
|
379
|
-
|
|
380
|
-
n = Hash.new(0)
|
|
392
|
+
skip_lines(filehandle, options)
|
|
381
393
|
|
|
382
|
-
|
|
383
|
-
line = filehandle.readline(options[:row_sep])
|
|
384
|
-
del.each do |d|
|
|
385
|
-
n[d] += line.scan(d).count
|
|
386
|
-
end
|
|
387
|
-
rescue EOFError # short files
|
|
388
|
-
break
|
|
389
|
-
end
|
|
394
|
+
possible_delimiters = [',', "\t", ';', ':', '|']
|
|
390
395
|
|
|
391
|
-
|
|
392
|
-
|
|
396
|
+
candidates = if options.fetch(:headers_in_file)
|
|
397
|
+
candidated_column_separators_from_headers(filehandle, options, possible_delimiters)
|
|
398
|
+
else
|
|
399
|
+
candidated_column_separators_from_contents(filehandle, options, possible_delimiters)
|
|
400
|
+
end
|
|
401
|
+
|
|
402
|
+
raise SmarterCSV::NoColSepDetected if candidates.values.max == 0
|
|
393
403
|
|
|
394
|
-
|
|
404
|
+
candidates.key(candidates.values.max)
|
|
395
405
|
end
|
|
396
406
|
|
|
397
407
|
# limitation: this currently reads the whole file in before making a decision
|
|
@@ -420,7 +430,7 @@ module SmarterCSV
|
|
|
420
430
|
lines += 1
|
|
421
431
|
break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
|
|
422
432
|
end
|
|
423
|
-
filehandle
|
|
433
|
+
rewind(filehandle)
|
|
424
434
|
|
|
425
435
|
counts["\r"] += 1 if last_char == "\r"
|
|
426
436
|
# find the most frequent key/value pair:
|
|
@@ -476,13 +486,13 @@ module SmarterCSV
|
|
|
476
486
|
|
|
477
487
|
unless options[:user_provided_headers] # wouldn't make sense to re-map user provided headers
|
|
478
488
|
key_mappingH = options[:key_mapping]
|
|
479
|
-
|
|
480
489
|
# do some key mapping on the keys in the file header
|
|
481
490
|
# if you want to completely delete a key, then map it to nil or to ''
|
|
482
491
|
if !key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
|
|
483
492
|
unless options[:silence_missing_keys]
|
|
484
493
|
# if silence_missing_keys are not set, raise error if missing header
|
|
485
494
|
missing_keys = key_mappingH.keys - headerA
|
|
495
|
+
|
|
486
496
|
puts "WARNING: missing header(s): #{missing_keys.join(",")}" unless missing_keys.empty?
|
|
487
497
|
end
|
|
488
498
|
|
|
@@ -525,5 +535,56 @@ module SmarterCSV
|
|
|
525
535
|
end
|
|
526
536
|
result
|
|
527
537
|
end
|
|
538
|
+
|
|
539
|
+
private
|
|
540
|
+
|
|
541
|
+
UTF_32_BOM = %w[0 0 fe ff].freeze
|
|
542
|
+
UTF_32LE_BOM = %w[ff fe 0 0].freeze
|
|
543
|
+
UTF_8_BOM = %w[ef bb bf].freeze
|
|
544
|
+
UTF_16_BOM = %w[fe ff].freeze
|
|
545
|
+
UTF_16LE_BOM = %w[ff fe].freeze
|
|
546
|
+
|
|
547
|
+
def remove_bom(str)
|
|
548
|
+
str_as_hex = str.bytes.map{|x| x.to_s(16)}
|
|
549
|
+
# if string does not start with one of the bytes above, there is no BOM
|
|
550
|
+
return str unless %w[ef fe ff 0].include?(str_as_hex[0])
|
|
551
|
+
|
|
552
|
+
return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
|
|
553
|
+
return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM
|
|
554
|
+
return str.byteslice(2..-1) if [UTF_16_BOM, UTF_16LE_BOM].include?(str_as_hex[0..1])
|
|
555
|
+
|
|
556
|
+
puts "SmarterCSV found unhandled BOM! #{str.chars[0..7].inspect}"
|
|
557
|
+
str
|
|
558
|
+
end
|
|
559
|
+
|
|
560
|
+
def candidated_column_separators_from_headers(filehandle, options, delimiters)
|
|
561
|
+
candidates = Hash.new(0)
|
|
562
|
+
line = readline_with_counts(filehandle, options.slice(:row_sep))
|
|
563
|
+
|
|
564
|
+
delimiters.each do |d|
|
|
565
|
+
candidates[d] += line.scan(d).count
|
|
566
|
+
end
|
|
567
|
+
|
|
568
|
+
rewind(filehandle)
|
|
569
|
+
|
|
570
|
+
candidates
|
|
571
|
+
end
|
|
572
|
+
|
|
573
|
+
def candidated_column_separators_from_contents(filehandle, options, delimiters)
|
|
574
|
+
candidates = Hash.new(0)
|
|
575
|
+
|
|
576
|
+
5.times do
|
|
577
|
+
line = readline_with_counts(filehandle, options.slice(:row_sep))
|
|
578
|
+
delimiters.each do |d|
|
|
579
|
+
candidates[d] += line.scan(d).count
|
|
580
|
+
end
|
|
581
|
+
rescue EOFError # short files
|
|
582
|
+
break
|
|
583
|
+
end
|
|
584
|
+
|
|
585
|
+
rewind(filehandle)
|
|
586
|
+
|
|
587
|
+
candidates
|
|
588
|
+
end
|
|
528
589
|
end
|
|
529
590
|
end
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: smarter_csv
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 1.
|
|
4
|
+
version: 1.8.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Tilo Sloboda
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: bin
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date:
|
|
11
|
+
date: 2023-03-19 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: awesome_print
|