smarter_csv 1.7.4 → 1.8.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +5 -1
- data/README.md +19 -6
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +47 -15
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 55400b3977ce35c58d60c4101362b68d99f2dbf7cb6a63956ae3b6ab79fcf1ac
|
4
|
+
data.tar.gz: 41f46d3e4de69a7924ecd2214ba4e37766106469d1b8b257fd752a96204a47fd
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 24ecc14cf9c65efe5c11e4bd20753420aa8ccd7385171cd21eac2e1be92c4896087cdc2a18799fa111c0f36154ad4481daed7f08b752f4fae2b5f27241b8cf6c
|
7
|
+
data.tar.gz: c1d70e18a7ae8057e58cbf73b62f4896dd7030bc5fd2e927669e5ea829f9a3c11daeb9c8b83296dbb46e6f0d23034245b7207882a77b54cc1ca128a581175359
|
data/CHANGELOG.md
CHANGED
@@ -1,7 +1,11 @@
|
|
1
1
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
3
3
|
|
4
|
-
## 1.
|
4
|
+
## 1.8.0 (2023-03-18)
|
5
|
+
* NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
|
6
|
+
* ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
|
7
|
+
|
8
|
+
## 1.7.4 (2023-01-13)
|
5
9
|
* improved guessing of the column separator, thanks to Alessandro Fazzi
|
6
10
|
|
7
11
|
## 1.7.3 (2022-12-05)
|
data/README.md
CHANGED
@@ -55,10 +55,23 @@ The two main choices you have in terms of how to call `SmarterCSV.process` are:
|
|
55
55
|
* calling `process` with or without a block
|
56
56
|
* passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
|
57
57
|
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
You can also set the `:row_sep` manually! Checkout Example
|
58
|
+
By default (since version 1.8.0), detection of the column and row separators is set to automatic `row_sep: :auto`, `col_sep: :auto`. This should make it easier to process any CSV files without having to examine the line endings or column separators.
|
59
|
+
|
60
|
+
You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); nil or 0 will check the whole file).
|
61
|
+
You can also set the `:row_sep` manually! Checkout Example 4 for unusual `:row_sep` and `:col_sep`.
|
62
|
+
|
63
|
+
### Troubleshooting
|
64
|
+
|
65
|
+
In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).
|
66
|
+
|
67
|
+
```
|
68
|
+
$ hexdump -C spec/fixtures/bom_test_feff.csv
|
69
|
+
00000000 fe ff 73 6f 6d 65 5f 69 64 2c 74 79 70 65 2c 66 |..some_id,type,f|
|
70
|
+
00000010 75 7a 7a 62 6f 78 65 73 0d 0a 34 32 37 36 36 38 |uzzboxes..427668|
|
71
|
+
00000020 30 35 2c 7a 69 7a 7a 6c 65 73 2c 31 32 33 34 0d |05,zizzles,1234.|
|
72
|
+
00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
|
73
|
+
00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
|
74
|
+
```
|
62
75
|
|
63
76
|
|
64
77
|
#### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
|
@@ -222,10 +235,10 @@ The options and the block are optional.
|
|
222
235
|
| :skip_lines | nil | how many lines to skip before the first line or header line is processed |
|
223
236
|
| :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
|
224
237
|
---------------------------------------------------------------------------------------------------------------------------------
|
225
|
-
| :col_sep |
|
238
|
+
| :col_sep | :auto | column separator (default was ',') |
|
226
239
|
| :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
|
227
240
|
| | | e.g. when :quote_char is not properly escaped |
|
228
|
-
| :row_sep |
|
241
|
+
| :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
|
229
242
|
| | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
|
230
243
|
| :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
|
231
244
|
| :quote_char | '"' | quotation character |
|
data/lib/smarter_csv/version.rb
CHANGED
data/lib/smarter_csv.rb
CHANGED
@@ -3,8 +3,8 @@
|
|
3
3
|
require_relative "extensions/hash"
|
4
4
|
require_relative "smarter_csv/version"
|
5
5
|
|
6
|
-
require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
|
7
|
-
|
6
|
+
# require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
|
7
|
+
require 'smarter_csv.bundle' unless ENV['CI'] # does not compile/link in CI?
|
8
8
|
|
9
9
|
module SmarterCSV
|
10
10
|
class SmarterCSVException < StandardError; end
|
@@ -39,11 +39,7 @@ module SmarterCSV
|
|
39
39
|
puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
|
40
40
|
end
|
41
41
|
|
42
|
-
|
43
|
-
options[:skip_lines].to_i.times do
|
44
|
-
readline_with_counts(fh, options)
|
45
|
-
end
|
46
|
-
end
|
42
|
+
skip_lines(fh, options)
|
47
43
|
|
48
44
|
headerA, header_size = process_headers(fh, options)
|
49
45
|
|
@@ -207,7 +203,7 @@ module SmarterCSV
|
|
207
203
|
acceleration: true,
|
208
204
|
auto_row_sep_chars: 500,
|
209
205
|
chunk_size: nil,
|
210
|
-
col_sep: ',',
|
206
|
+
col_sep: :auto, # was: ',',
|
211
207
|
comment_regexp: nil, # was: /\A#/,
|
212
208
|
convert_values_to_numeric: true,
|
213
209
|
downcase_header: true,
|
@@ -226,7 +222,7 @@ module SmarterCSV
|
|
226
222
|
remove_values_matching: nil,
|
227
223
|
remove_zero_values: false,
|
228
224
|
required_headers: nil,
|
229
|
-
row_sep: $/,
|
225
|
+
row_sep: :auto, # was: $/,
|
230
226
|
silence_missing_keys: false,
|
231
227
|
skip_lines: nil,
|
232
228
|
strings_as_keys: false,
|
@@ -243,9 +239,24 @@ module SmarterCSV
|
|
243
239
|
line = filehandle.readline(options[:row_sep])
|
244
240
|
@file_line_count += 1
|
245
241
|
@csv_line_count += 1
|
242
|
+
line = remove_bom(line) if @csv_line_count == 1
|
246
243
|
line
|
247
244
|
end
|
248
245
|
|
246
|
+
def skip_lines(filehandle, options)
|
247
|
+
return unless options[:skip_lines].to_i > 0
|
248
|
+
|
249
|
+
options[:skip_lines].to_i.times do
|
250
|
+
readline_with_counts(filehandle, options)
|
251
|
+
end
|
252
|
+
end
|
253
|
+
|
254
|
+
def rewind(filehandle)
|
255
|
+
@file_line_count = 0
|
256
|
+
@csv_line_count = 0
|
257
|
+
filehandle.rewind
|
258
|
+
end
|
259
|
+
|
249
260
|
###
|
250
261
|
### Thin wrapper around C-extension
|
251
262
|
###
|
@@ -378,6 +389,8 @@ module SmarterCSV
|
|
378
389
|
# Otherwise guesses column separator from contents.
|
379
390
|
# Raises exception if none is found.
|
380
391
|
def guess_column_separator(filehandle, options)
|
392
|
+
skip_lines(filehandle, options)
|
393
|
+
|
381
394
|
possible_delimiters = [',', "\t", ';', ':', '|']
|
382
395
|
|
383
396
|
candidates = if options.fetch(:headers_in_file)
|
@@ -417,7 +430,7 @@ module SmarterCSV
|
|
417
430
|
lines += 1
|
418
431
|
break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
|
419
432
|
end
|
420
|
-
filehandle
|
433
|
+
rewind(filehandle)
|
421
434
|
|
422
435
|
counts["\r"] += 1 if last_char == "\r"
|
423
436
|
# find the most frequent key/value pair:
|
@@ -473,13 +486,13 @@ module SmarterCSV
|
|
473
486
|
|
474
487
|
unless options[:user_provided_headers] # wouldn't make sense to re-map user provided headers
|
475
488
|
key_mappingH = options[:key_mapping]
|
476
|
-
|
477
489
|
# do some key mapping on the keys in the file header
|
478
490
|
# if you want to completely delete a key, then map it to nil or to ''
|
479
491
|
if !key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
|
480
492
|
unless options[:silence_missing_keys]
|
481
493
|
# if silence_missing_keys are not set, raise error if missing header
|
482
494
|
missing_keys = key_mappingH.keys - headerA
|
495
|
+
|
483
496
|
puts "WARNING: missing header(s): #{missing_keys.join(",")}" unless missing_keys.empty?
|
484
497
|
end
|
485
498
|
|
@@ -525,15 +538,34 @@ module SmarterCSV
|
|
525
538
|
|
526
539
|
private
|
527
540
|
|
541
|
+
UTF_32_BOM = %w[0 0 fe ff].freeze
|
542
|
+
UTF_32LE_BOM = %w[ff fe 0 0].freeze
|
543
|
+
UTF_8_BOM = %w[ef bb bf].freeze
|
544
|
+
UTF_16_BOM = %w[fe ff].freeze
|
545
|
+
UTF_16LE_BOM = %w[ff fe].freeze
|
546
|
+
|
547
|
+
def remove_bom(str)
|
548
|
+
str_as_hex = str.bytes.map{|x| x.to_s(16)}
|
549
|
+
# if string does not start with one of the bytes above, there is no BOM
|
550
|
+
return str unless %w[ef fe ff 0].include?(str_as_hex[0])
|
551
|
+
|
552
|
+
return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
|
553
|
+
return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM
|
554
|
+
return str.byteslice(2..-1) if [UTF_16_BOM, UTF_16LE_BOM].include?(str_as_hex[0..1])
|
555
|
+
|
556
|
+
puts "SmarterCSV found unhandled BOM! #{str.chars[0..7].inspect}"
|
557
|
+
str
|
558
|
+
end
|
559
|
+
|
528
560
|
def candidated_column_separators_from_headers(filehandle, options, delimiters)
|
529
561
|
candidates = Hash.new(0)
|
530
|
-
line = filehandle.
|
562
|
+
line = readline_with_counts(filehandle, options.slice(:row_sep))
|
531
563
|
|
532
564
|
delimiters.each do |d|
|
533
565
|
candidates[d] += line.scan(d).count
|
534
566
|
end
|
535
567
|
|
536
|
-
filehandle
|
568
|
+
rewind(filehandle)
|
537
569
|
|
538
570
|
candidates
|
539
571
|
end
|
@@ -542,7 +574,7 @@ module SmarterCSV
|
|
542
574
|
candidates = Hash.new(0)
|
543
575
|
|
544
576
|
5.times do
|
545
|
-
line = filehandle.
|
577
|
+
line = readline_with_counts(filehandle, options.slice(:row_sep))
|
546
578
|
delimiters.each do |d|
|
547
579
|
candidates[d] += line.scan(d).count
|
548
580
|
end
|
@@ -550,7 +582,7 @@ module SmarterCSV
|
|
550
582
|
break
|
551
583
|
end
|
552
584
|
|
553
|
-
filehandle
|
585
|
+
rewind(filehandle)
|
554
586
|
|
555
587
|
candidates
|
556
588
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: smarter_csv
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.8.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tilo Sloboda
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2023-
|
11
|
+
date: 2023-03-19 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: awesome_print
|