smarter_csv 1.7.4 → 1.8.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +10 -1
- data/README.md +27 -7
- data/ext/smarter_csv/smarter_csv.c +0 -1
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +79 -21
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: a7aa350efc77f90c6986a7573e733b5d9d02930c94465f17d2227b346263a6ce
|
4
|
+
data.tar.gz: 42351edf3e618b8c025f266796897aa0c3572d77e42788a05b1ee37ce8bdeed2
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8bd9d59d7260a8e90ce472917801b98d088e37de5b1e912914f820f2efbbeb0491f5056d47575debdf1bccb8b9b8670cd089647efa15ec93b02413747dcfe702
|
7
|
+
data.tar.gz: 861364c6213af99c11cd3b9a59b2cf46f8c8e850ee2273e4f1b790714c9cd0ca66a734d64233737e086669c2b6aa51415f1343c3d61811547ec3c715d7a1620c
|
data/CHANGELOG.md
CHANGED
@@ -1,7 +1,16 @@
|
|
1
1
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
3
3
|
|
4
|
-
## 1.
|
4
|
+
## 1.8.1 (2023-03-19)
|
5
|
+
* added validation against invalid values for :col_sep, :row_sep, :quote_char (issue #216)
|
6
|
+
* deprecating `required_headers` and replace with `required_keys` (issue #140)
|
7
|
+
* fixed issue with require statement
|
8
|
+
|
9
|
+
## 1.8.0 (2023-03-18)
|
10
|
+
* NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
|
11
|
+
* ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
|
12
|
+
|
13
|
+
## 1.7.4 (2023-01-13)
|
5
14
|
* improved guessing of the column separator, thanks to Alessandro Fazzi
|
6
15
|
|
7
16
|
## 1.7.3 (2022-12-05)
|
data/README.md
CHANGED
@@ -55,11 +55,30 @@ The two main choices you have in terms of how to call `SmarterCSV.process` are:
|
|
55
55
|
* calling `process` with or without a block
|
56
56
|
* passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
|
57
57
|
|
58
|
-
|
59
|
-
But this could be slow if we would analyze the whole CSV file first (previous to 1.1.5 the whole file was analyzed).
|
60
|
-
To speed things up, you can setting the option `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500; nil or 0 will check the whole file).
|
61
|
-
You can also set the `:row_sep` manually! Checkout Example 5 for unusual `:row_sep` and `:col_sep`.
|
58
|
+
By default (since version 1.8.0), detection of the column and row separators is set to automatic `row_sep: :auto`, `col_sep: :auto`. This should make it easier to process any CSV files without having to examine the line endings or column separators.
|
62
59
|
|
60
|
+
You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); nil or 0 will check the whole file).
|
61
|
+
You can also set the `:row_sep` manually! Checkout Example 4 for unusual `:row_sep` and `:col_sep`.
|
62
|
+
|
63
|
+
### Troubleshooting
|
64
|
+
|
65
|
+
In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).
|
66
|
+
|
67
|
+
```
|
68
|
+
$ hexdump -C spec/fixtures/bom_test_feff.csv
|
69
|
+
00000000 fe ff 73 6f 6d 65 5f 69 64 2c 74 79 70 65 2c 66 |..some_id,type,f|
|
70
|
+
00000010 75 7a 7a 62 6f 78 65 73 0d 0a 34 32 37 36 36 38 |uzzboxes..427668|
|
71
|
+
00000020 30 35 2c 7a 69 7a 7a 6c 65 73 2c 31 32 33 34 0d |05,zizzles,1234.|
|
72
|
+
00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
|
73
|
+
00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
|
74
|
+
```
|
75
|
+
|
76
|
+
### Examples
|
77
|
+
|
78
|
+
Here are some examples to demonstrate the versatility of SmarterCSV.
|
79
|
+
|
80
|
+
By default SmarterCSV determines the `row_sep` and `col_sep` values automatically.
|
81
|
+
In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.
|
63
82
|
|
64
83
|
#### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
|
65
84
|
Please note how each hash contains only the keys for columns with non-null values.
|
@@ -222,10 +241,10 @@ The options and the block are optional.
|
|
222
241
|
| :skip_lines | nil | how many lines to skip before the first line or header line is processed |
|
223
242
|
| :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
|
224
243
|
---------------------------------------------------------------------------------------------------------------------------------
|
225
|
-
| :col_sep |
|
244
|
+
| :col_sep | :auto | column separator (default was ',') |
|
226
245
|
| :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
|
227
246
|
| | | e.g. when :quote_char is not properly escaped |
|
228
|
-
| :row_sep |
|
247
|
+
| :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
|
229
248
|
| | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
|
230
249
|
| :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
|
231
250
|
| :quote_char | '"' | quotation character |
|
@@ -254,7 +273,8 @@ And header and data validations will also be supported in 2.x
|
|
254
273
|
---------------------------------------------------------------------------------------------------------------------------------
|
255
274
|
| :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
|
256
275
|
| :silence_missing_key | false | ignore missing keys in `key_mapping` if true |
|
257
|
-
| :
|
276
|
+
| :required_keys | nil | An array. Specify the required names AFTER header transformation. |
|
277
|
+
| :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead |
|
258
278
|
| | | or an exception is raised No validation if nil is given. |
|
259
279
|
| :remove_unmapped_keys | false | when using :key_mapping option, should non-mapped keys / columns be removed? |
|
260
280
|
| :downcase_header | true | downcase all column headers |
|
@@ -27,7 +27,6 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
|
|
27
27
|
long col_sep_len = RSTRING_LEN(col_sep);
|
28
28
|
|
29
29
|
char *quoteP = RSTRING_PTR(quote_char);
|
30
|
-
long quote_len = RSTRING_LEN(quote_char);
|
31
30
|
long quote_count = 0;
|
32
31
|
|
33
32
|
bool col_sep_found = true;
|
data/lib/smarter_csv/version.rb
CHANGED
data/lib/smarter_csv.rb
CHANGED
@@ -4,23 +4,24 @@ require_relative "extensions/hash"
|
|
4
4
|
require_relative "smarter_csv/version"
|
5
5
|
|
6
6
|
require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
|
7
|
-
# require 'smarter_csv.bundle' unless ENV['CI'] #
|
7
|
+
# require 'smarter_csv.bundle' unless ENV['CI'] # local testing
|
8
8
|
|
9
9
|
module SmarterCSV
|
10
10
|
class SmarterCSVException < StandardError; end
|
11
11
|
class HeaderSizeMismatch < SmarterCSVException; end
|
12
12
|
class IncorrectOption < SmarterCSVException; end
|
13
|
+
class ValidationError < SmarterCSVException; end
|
13
14
|
class DuplicateHeaders < SmarterCSVException; end
|
14
15
|
class MissingHeaders < SmarterCSVException; end
|
15
16
|
class NoColSepDetected < SmarterCSVException; end
|
16
|
-
class KeyMappingError < SmarterCSVException; end
|
17
|
-
class MalformedCSVError < SmarterCSVException; end
|
17
|
+
class KeyMappingError < SmarterCSVException; end # CURRENTLY UNUSED -> version 1.9.0
|
18
18
|
|
19
19
|
# first parameter: filename or input object which responds to readline method
|
20
20
|
def SmarterCSV.process(input, options = {}, &block)
|
21
21
|
options = default_options.merge(options)
|
22
22
|
options[:invalid_byte_sequence] = '' if options[:invalid_byte_sequence].nil?
|
23
23
|
puts "SmarterCSV OPTIONS: #{options.inspect}" if options[:verbose]
|
24
|
+
validate_options!(options)
|
24
25
|
|
25
26
|
headerA = []
|
26
27
|
result = []
|
@@ -39,11 +40,7 @@ module SmarterCSV
|
|
39
40
|
puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
|
40
41
|
end
|
41
42
|
|
42
|
-
|
43
|
-
options[:skip_lines].to_i.times do
|
44
|
-
readline_with_counts(fh, options)
|
45
|
-
end
|
46
|
-
end
|
43
|
+
skip_lines(fh, options)
|
47
44
|
|
48
45
|
headerA, header_size = process_headers(fh, options)
|
49
46
|
|
@@ -207,7 +204,7 @@ module SmarterCSV
|
|
207
204
|
acceleration: true,
|
208
205
|
auto_row_sep_chars: 500,
|
209
206
|
chunk_size: nil,
|
210
|
-
col_sep: ',',
|
207
|
+
col_sep: :auto, # was: ',',
|
211
208
|
comment_regexp: nil, # was: /\A#/,
|
212
209
|
convert_values_to_numeric: true,
|
213
210
|
downcase_header: true,
|
@@ -218,7 +215,7 @@ module SmarterCSV
|
|
218
215
|
headers_in_file: true,
|
219
216
|
invalid_byte_sequence: '',
|
220
217
|
keep_original_headers: false,
|
221
|
-
|
218
|
+
key_mapping: nil,
|
222
219
|
quote_char: '"',
|
223
220
|
remove_empty_hashes: true,
|
224
221
|
remove_empty_values: true,
|
@@ -226,7 +223,8 @@ module SmarterCSV
|
|
226
223
|
remove_values_matching: nil,
|
227
224
|
remove_zero_values: false,
|
228
225
|
required_headers: nil,
|
229
|
-
|
226
|
+
required_keys: nil,
|
227
|
+
row_sep: :auto, # was: $/,
|
230
228
|
silence_missing_keys: false,
|
231
229
|
skip_lines: nil,
|
232
230
|
strings_as_keys: false,
|
@@ -243,9 +241,24 @@ module SmarterCSV
|
|
243
241
|
line = filehandle.readline(options[:row_sep])
|
244
242
|
@file_line_count += 1
|
245
243
|
@csv_line_count += 1
|
244
|
+
line = remove_bom(line) if @csv_line_count == 1
|
246
245
|
line
|
247
246
|
end
|
248
247
|
|
248
|
+
def skip_lines(filehandle, options)
|
249
|
+
return unless options[:skip_lines].to_i > 0
|
250
|
+
|
251
|
+
options[:skip_lines].to_i.times do
|
252
|
+
readline_with_counts(filehandle, options)
|
253
|
+
end
|
254
|
+
end
|
255
|
+
|
256
|
+
def rewind(filehandle)
|
257
|
+
@file_line_count = 0
|
258
|
+
@csv_line_count = 0
|
259
|
+
filehandle.rewind
|
260
|
+
end
|
261
|
+
|
249
262
|
###
|
250
263
|
### Thin wrapper around C-extension
|
251
264
|
###
|
@@ -378,6 +391,8 @@ module SmarterCSV
|
|
378
391
|
# Otherwise guesses column separator from contents.
|
379
392
|
# Raises exception if none is found.
|
380
393
|
def guess_column_separator(filehandle, options)
|
394
|
+
skip_lines(filehandle, options)
|
395
|
+
|
381
396
|
possible_delimiters = [',', "\t", ';', ':', '|']
|
382
397
|
|
383
398
|
candidates = if options.fetch(:headers_in_file)
|
@@ -417,7 +432,7 @@ module SmarterCSV
|
|
417
432
|
lines += 1
|
418
433
|
break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
|
419
434
|
end
|
420
|
-
filehandle
|
435
|
+
rewind(filehandle)
|
421
436
|
|
422
437
|
counts["\r"] += 1 if last_char == "\r"
|
423
438
|
# find the most frequent key/value pair:
|
@@ -497,12 +512,21 @@ module SmarterCSV
|
|
497
512
|
raise SmarterCSV::DuplicateHeaders, "ERROR: duplicate headers: #{duplicate_headers.join(',')}"
|
498
513
|
end
|
499
514
|
|
500
|
-
|
501
|
-
|
502
|
-
|
503
|
-
|
515
|
+
# deprecate required_headers
|
516
|
+
if !options[:required_headers].nil?
|
517
|
+
puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required headers'"
|
518
|
+
if options[:required_keys].nil?
|
519
|
+
options[:required_keys] = options[:required_headers]
|
520
|
+
options[:required_headers] = nil
|
521
|
+
end
|
522
|
+
end
|
523
|
+
|
524
|
+
if options[:required_keys] && options[:required_keys].is_a?(Array)
|
525
|
+
missing_keys = []
|
526
|
+
options[:required_keys].each do |k|
|
527
|
+
missing_keys << k unless headerA.include?(k)
|
504
528
|
end
|
505
|
-
raise SmarterCSV::MissingHeaders, "ERROR: missing
|
529
|
+
raise SmarterCSV::MissingHeaders, "ERROR: missing attributes: #{missing_keys.join(',')}" unless missing_keys.empty?
|
506
530
|
end
|
507
531
|
|
508
532
|
@headers = headerA
|
@@ -525,15 +549,49 @@ module SmarterCSV
|
|
525
549
|
|
526
550
|
private
|
527
551
|
|
552
|
+
UTF_32_BOM = %w[0 0 fe ff].freeze
|
553
|
+
UTF_32LE_BOM = %w[ff fe 0 0].freeze
|
554
|
+
UTF_8_BOM = %w[ef bb bf].freeze
|
555
|
+
UTF_16_BOM = %w[fe ff].freeze
|
556
|
+
UTF_16LE_BOM = %w[ff fe].freeze
|
557
|
+
|
558
|
+
def remove_bom(str)
|
559
|
+
str_as_hex = str.bytes.map{|x| x.to_s(16)}
|
560
|
+
# if string does not start with one of the bytes, there is no BOM
|
561
|
+
return str unless %w[ef fe ff 0].include?(str_as_hex[0])
|
562
|
+
|
563
|
+
return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
|
564
|
+
return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM
|
565
|
+
return str.byteslice(2..-1) if [UTF_16_BOM, UTF_16LE_BOM].include?(str_as_hex[0..1])
|
566
|
+
|
567
|
+
puts "SmarterCSV found unhandled BOM! #{str.chars[0..7].inspect}"
|
568
|
+
str
|
569
|
+
end
|
570
|
+
|
571
|
+
def validate_options!(options)
|
572
|
+
keys = options.keys
|
573
|
+
errors = []
|
574
|
+
errors << "invalid row_sep" if keys.include?(:row_sep) && !option_valid?(options[:row_sep])
|
575
|
+
errors << "invalid col_sep" if keys.include?(:col_sep) && !option_valid?(options[:col_sep])
|
576
|
+
errors << "invalid quote_char" if keys.include?(:quote_char) && !option_valid?(options[:quote_char])
|
577
|
+
raise SmarterCSV::ValidationError, errors.inspect if errors.any?
|
578
|
+
end
|
579
|
+
|
580
|
+
def option_valid?(str)
|
581
|
+
return true if str.is_a?(Symbol) && str == :auto
|
582
|
+
return true if str.is_a?(String) && !str.empty?
|
583
|
+
false
|
584
|
+
end
|
585
|
+
|
528
586
|
def candidated_column_separators_from_headers(filehandle, options, delimiters)
|
529
587
|
candidates = Hash.new(0)
|
530
|
-
line = filehandle.
|
588
|
+
line = readline_with_counts(filehandle, options.slice(:row_sep))
|
531
589
|
|
532
590
|
delimiters.each do |d|
|
533
591
|
candidates[d] += line.scan(d).count
|
534
592
|
end
|
535
593
|
|
536
|
-
filehandle
|
594
|
+
rewind(filehandle)
|
537
595
|
|
538
596
|
candidates
|
539
597
|
end
|
@@ -542,7 +600,7 @@ module SmarterCSV
|
|
542
600
|
candidates = Hash.new(0)
|
543
601
|
|
544
602
|
5.times do
|
545
|
-
line = filehandle.
|
603
|
+
line = readline_with_counts(filehandle, options.slice(:row_sep))
|
546
604
|
delimiters.each do |d|
|
547
605
|
candidates[d] += line.scan(d).count
|
548
606
|
end
|
@@ -550,7 +608,7 @@ module SmarterCSV
|
|
550
608
|
break
|
551
609
|
end
|
552
610
|
|
553
|
-
filehandle
|
611
|
+
rewind(filehandle)
|
554
612
|
|
555
613
|
candidates
|
556
614
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: smarter_csv
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.8.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tilo Sloboda
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2023-
|
11
|
+
date: 2023-03-19 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: awesome_print
|