smarter_csv 1.7.4 → 1.8.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +10 -1
- data/README.md +27 -7
- data/ext/smarter_csv/smarter_csv.c +0 -1
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +79 -21
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: a7aa350efc77f90c6986a7573e733b5d9d02930c94465f17d2227b346263a6ce
|
4
|
+
data.tar.gz: 42351edf3e618b8c025f266796897aa0c3572d77e42788a05b1ee37ce8bdeed2
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8bd9d59d7260a8e90ce472917801b98d088e37de5b1e912914f820f2efbbeb0491f5056d47575debdf1bccb8b9b8670cd089647efa15ec93b02413747dcfe702
|
7
|
+
data.tar.gz: 861364c6213af99c11cd3b9a59b2cf46f8c8e850ee2273e4f1b790714c9cd0ca66a734d64233737e086669c2b6aa51415f1343c3d61811547ec3c715d7a1620c
|
data/CHANGELOG.md
CHANGED
@@ -1,7 +1,16 @@
|
|
1
1
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
3
3
|
|
4
|
-
## 1.
|
4
|
+
## 1.8.1 (2023-03-19)
|
5
|
+
* added validation against invalid values for :col_sep, :row_sep, :quote_char (issue #216)
|
6
|
+
* deprecating `required_headers` and replace with `required_keys` (issue #140)
|
7
|
+
* fixed issue with require statement
|
8
|
+
|
9
|
+
## 1.8.0 (2023-03-18)
|
10
|
+
* NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
|
11
|
+
* ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
|
12
|
+
|
13
|
+
## 1.7.4 (2023-01-13)
|
5
14
|
* improved guessing of the column separator, thanks to Alessandro Fazzi
|
6
15
|
|
7
16
|
## 1.7.3 (2022-12-05)
|
data/README.md
CHANGED
@@ -55,11 +55,30 @@ The two main choices you have in terms of how to call `SmarterCSV.process` are:
|
|
55
55
|
* calling `process` with or without a block
|
56
56
|
* passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
|
57
57
|
|
58
|
-
|
59
|
-
But this could be slow if we would analyze the whole CSV file first (previous to 1.1.5 the whole file was analyzed).
|
60
|
-
To speed things up, you can setting the option `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500; nil or 0 will check the whole file).
|
61
|
-
You can also set the `:row_sep` manually! Checkout Example 5 for unusual `:row_sep` and `:col_sep`.
|
58
|
+
By default (since version 1.8.0), detection of the column and row separators is set to automatic `row_sep: :auto`, `col_sep: :auto`. This should make it easier to process any CSV files without having to examine the line endings or column separators.
|
62
59
|
|
60
|
+
You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); nil or 0 will check the whole file).
|
61
|
+
You can also set the `:row_sep` manually! Checkout Example 4 for unusual `:row_sep` and `:col_sep`.
|
62
|
+
|
63
|
+
### Troubleshooting
|
64
|
+
|
65
|
+
In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).
|
66
|
+
|
67
|
+
```
|
68
|
+
$ hexdump -C spec/fixtures/bom_test_feff.csv
|
69
|
+
00000000 fe ff 73 6f 6d 65 5f 69 64 2c 74 79 70 65 2c 66 |..some_id,type,f|
|
70
|
+
00000010 75 7a 7a 62 6f 78 65 73 0d 0a 34 32 37 36 36 38 |uzzboxes..427668|
|
71
|
+
00000020 30 35 2c 7a 69 7a 7a 6c 65 73 2c 31 32 33 34 0d |05,zizzles,1234.|
|
72
|
+
00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
|
73
|
+
00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
|
74
|
+
```
|
75
|
+
|
76
|
+
### Examples
|
77
|
+
|
78
|
+
Here are some examples to demonstrate the versatility of SmarterCSV.
|
79
|
+
|
80
|
+
By default SmarterCSV determines the `row_sep` and `col_sep` values automatically.
|
81
|
+
In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.
|
63
82
|
|
64
83
|
#### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
|
65
84
|
Please note how each hash contains only the keys for columns with non-null values.
|
@@ -222,10 +241,10 @@ The options and the block are optional.
|
|
222
241
|
| :skip_lines | nil | how many lines to skip before the first line or header line is processed |
|
223
242
|
| :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
|
224
243
|
---------------------------------------------------------------------------------------------------------------------------------
|
225
|
-
| :col_sep |
|
244
|
+
| :col_sep | :auto | column separator (default was ',') |
|
226
245
|
| :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
|
227
246
|
| | | e.g. when :quote_char is not properly escaped |
|
228
|
-
| :row_sep |
|
247
|
+
| :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
|
229
248
|
| | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
|
230
249
|
| :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
|
231
250
|
| :quote_char | '"' | quotation character |
|
@@ -254,7 +273,8 @@ And header and data validations will also be supported in 2.x
|
|
254
273
|
---------------------------------------------------------------------------------------------------------------------------------
|
255
274
|
| :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
|
256
275
|
| :silence_missing_key | false | ignore missing keys in `key_mapping` if true |
|
257
|
-
| :
|
276
|
+
| :required_keys | nil | An array. Specify the required names AFTER header transformation. |
|
277
|
+
| :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead |
|
258
278
|
| | | or an exception is raised No validation if nil is given. |
|
259
279
|
| :remove_unmapped_keys | false | when using :key_mapping option, should non-mapped keys / columns be removed? |
|
260
280
|
| :downcase_header | true | downcase all column headers |
|
@@ -27,7 +27,6 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
|
|
27
27
|
long col_sep_len = RSTRING_LEN(col_sep);
|
28
28
|
|
29
29
|
char *quoteP = RSTRING_PTR(quote_char);
|
30
|
-
long quote_len = RSTRING_LEN(quote_char);
|
31
30
|
long quote_count = 0;
|
32
31
|
|
33
32
|
bool col_sep_found = true;
|
data/lib/smarter_csv/version.rb
CHANGED
data/lib/smarter_csv.rb
CHANGED
@@ -4,23 +4,24 @@ require_relative "extensions/hash"
|
|
4
4
|
require_relative "smarter_csv/version"
|
5
5
|
|
6
6
|
require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
|
7
|
-
# require 'smarter_csv.bundle' unless ENV['CI'] #
|
7
|
+
# require 'smarter_csv.bundle' unless ENV['CI'] # local testing
|
8
8
|
|
9
9
|
module SmarterCSV
|
10
10
|
class SmarterCSVException < StandardError; end
|
11
11
|
class HeaderSizeMismatch < SmarterCSVException; end
|
12
12
|
class IncorrectOption < SmarterCSVException; end
|
13
|
+
class ValidationError < SmarterCSVException; end
|
13
14
|
class DuplicateHeaders < SmarterCSVException; end
|
14
15
|
class MissingHeaders < SmarterCSVException; end
|
15
16
|
class NoColSepDetected < SmarterCSVException; end
|
16
|
-
class KeyMappingError < SmarterCSVException; end
|
17
|
-
class MalformedCSVError < SmarterCSVException; end
|
17
|
+
class KeyMappingError < SmarterCSVException; end # CURRENTLY UNUSED -> version 1.9.0
|
18
18
|
|
19
19
|
# first parameter: filename or input object which responds to readline method
|
20
20
|
def SmarterCSV.process(input, options = {}, &block)
|
21
21
|
options = default_options.merge(options)
|
22
22
|
options[:invalid_byte_sequence] = '' if options[:invalid_byte_sequence].nil?
|
23
23
|
puts "SmarterCSV OPTIONS: #{options.inspect}" if options[:verbose]
|
24
|
+
validate_options!(options)
|
24
25
|
|
25
26
|
headerA = []
|
26
27
|
result = []
|
@@ -39,11 +40,7 @@ module SmarterCSV
|
|
39
40
|
puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
|
40
41
|
end
|
41
42
|
|
42
|
-
|
43
|
-
options[:skip_lines].to_i.times do
|
44
|
-
readline_with_counts(fh, options)
|
45
|
-
end
|
46
|
-
end
|
43
|
+
skip_lines(fh, options)
|
47
44
|
|
48
45
|
headerA, header_size = process_headers(fh, options)
|
49
46
|
|
@@ -207,7 +204,7 @@ module SmarterCSV
|
|
207
204
|
acceleration: true,
|
208
205
|
auto_row_sep_chars: 500,
|
209
206
|
chunk_size: nil,
|
210
|
-
col_sep: ',',
|
207
|
+
col_sep: :auto, # was: ',',
|
211
208
|
comment_regexp: nil, # was: /\A#/,
|
212
209
|
convert_values_to_numeric: true,
|
213
210
|
downcase_header: true,
|
@@ -218,7 +215,7 @@ module SmarterCSV
|
|
218
215
|
headers_in_file: true,
|
219
216
|
invalid_byte_sequence: '',
|
220
217
|
keep_original_headers: false,
|
221
|
-
|
218
|
+
key_mapping: nil,
|
222
219
|
quote_char: '"',
|
223
220
|
remove_empty_hashes: true,
|
224
221
|
remove_empty_values: true,
|
@@ -226,7 +223,8 @@ module SmarterCSV
|
|
226
223
|
remove_values_matching: nil,
|
227
224
|
remove_zero_values: false,
|
228
225
|
required_headers: nil,
|
229
|
-
|
226
|
+
required_keys: nil,
|
227
|
+
row_sep: :auto, # was: $/,
|
230
228
|
silence_missing_keys: false,
|
231
229
|
skip_lines: nil,
|
232
230
|
strings_as_keys: false,
|
@@ -243,9 +241,24 @@ module SmarterCSV
|
|
243
241
|
line = filehandle.readline(options[:row_sep])
|
244
242
|
@file_line_count += 1
|
245
243
|
@csv_line_count += 1
|
244
|
+
line = remove_bom(line) if @csv_line_count == 1
|
246
245
|
line
|
247
246
|
end
|
248
247
|
|
248
|
+
def skip_lines(filehandle, options)
|
249
|
+
return unless options[:skip_lines].to_i > 0
|
250
|
+
|
251
|
+
options[:skip_lines].to_i.times do
|
252
|
+
readline_with_counts(filehandle, options)
|
253
|
+
end
|
254
|
+
end
|
255
|
+
|
256
|
+
def rewind(filehandle)
|
257
|
+
@file_line_count = 0
|
258
|
+
@csv_line_count = 0
|
259
|
+
filehandle.rewind
|
260
|
+
end
|
261
|
+
|
249
262
|
###
|
250
263
|
### Thin wrapper around C-extension
|
251
264
|
###
|
@@ -378,6 +391,8 @@ module SmarterCSV
|
|
378
391
|
# Otherwise guesses column separator from contents.
|
379
392
|
# Raises exception if none is found.
|
380
393
|
def guess_column_separator(filehandle, options)
|
394
|
+
skip_lines(filehandle, options)
|
395
|
+
|
381
396
|
possible_delimiters = [',', "\t", ';', ':', '|']
|
382
397
|
|
383
398
|
candidates = if options.fetch(:headers_in_file)
|
@@ -417,7 +432,7 @@ module SmarterCSV
|
|
417
432
|
lines += 1
|
418
433
|
break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
|
419
434
|
end
|
420
|
-
filehandle
|
435
|
+
rewind(filehandle)
|
421
436
|
|
422
437
|
counts["\r"] += 1 if last_char == "\r"
|
423
438
|
# find the most frequent key/value pair:
|
@@ -497,12 +512,21 @@ module SmarterCSV
|
|
497
512
|
raise SmarterCSV::DuplicateHeaders, "ERROR: duplicate headers: #{duplicate_headers.join(',')}"
|
498
513
|
end
|
499
514
|
|
500
|
-
|
501
|
-
|
502
|
-
|
503
|
-
|
515
|
+
# deprecate required_headers
|
516
|
+
if !options[:required_headers].nil?
|
517
|
+
puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required headers'"
|
518
|
+
if options[:required_keys].nil?
|
519
|
+
options[:required_keys] = options[:required_headers]
|
520
|
+
options[:required_headers] = nil
|
521
|
+
end
|
522
|
+
end
|
523
|
+
|
524
|
+
if options[:required_keys] && options[:required_keys].is_a?(Array)
|
525
|
+
missing_keys = []
|
526
|
+
options[:required_keys].each do |k|
|
527
|
+
missing_keys << k unless headerA.include?(k)
|
504
528
|
end
|
505
|
-
raise SmarterCSV::MissingHeaders, "ERROR: missing
|
529
|
+
raise SmarterCSV::MissingHeaders, "ERROR: missing attributes: #{missing_keys.join(',')}" unless missing_keys.empty?
|
506
530
|
end
|
507
531
|
|
508
532
|
@headers = headerA
|
@@ -525,15 +549,49 @@ module SmarterCSV
|
|
525
549
|
|
526
550
|
private
|
527
551
|
|
552
|
+
UTF_32_BOM = %w[0 0 fe ff].freeze
|
553
|
+
UTF_32LE_BOM = %w[ff fe 0 0].freeze
|
554
|
+
UTF_8_BOM = %w[ef bb bf].freeze
|
555
|
+
UTF_16_BOM = %w[fe ff].freeze
|
556
|
+
UTF_16LE_BOM = %w[ff fe].freeze
|
557
|
+
|
558
|
+
def remove_bom(str)
|
559
|
+
str_as_hex = str.bytes.map{|x| x.to_s(16)}
|
560
|
+
# if string does not start with one of the bytes, there is no BOM
|
561
|
+
return str unless %w[ef fe ff 0].include?(str_as_hex[0])
|
562
|
+
|
563
|
+
return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
|
564
|
+
return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM
|
565
|
+
return str.byteslice(2..-1) if [UTF_16_BOM, UTF_16LE_BOM].include?(str_as_hex[0..1])
|
566
|
+
|
567
|
+
puts "SmarterCSV found unhandled BOM! #{str.chars[0..7].inspect}"
|
568
|
+
str
|
569
|
+
end
|
570
|
+
|
571
|
+
def validate_options!(options)
|
572
|
+
keys = options.keys
|
573
|
+
errors = []
|
574
|
+
errors << "invalid row_sep" if keys.include?(:row_sep) && !option_valid?(options[:row_sep])
|
575
|
+
errors << "invalid col_sep" if keys.include?(:col_sep) && !option_valid?(options[:col_sep])
|
576
|
+
errors << "invalid quote_char" if keys.include?(:quote_char) && !option_valid?(options[:quote_char])
|
577
|
+
raise SmarterCSV::ValidationError, errors.inspect if errors.any?
|
578
|
+
end
|
579
|
+
|
580
|
+
def option_valid?(str)
|
581
|
+
return true if str.is_a?(Symbol) && str == :auto
|
582
|
+
return true if str.is_a?(String) && !str.empty?
|
583
|
+
false
|
584
|
+
end
|
585
|
+
|
528
586
|
def candidated_column_separators_from_headers(filehandle, options, delimiters)
|
529
587
|
candidates = Hash.new(0)
|
530
|
-
line = filehandle.
|
588
|
+
line = readline_with_counts(filehandle, options.slice(:row_sep))
|
531
589
|
|
532
590
|
delimiters.each do |d|
|
533
591
|
candidates[d] += line.scan(d).count
|
534
592
|
end
|
535
593
|
|
536
|
-
filehandle
|
594
|
+
rewind(filehandle)
|
537
595
|
|
538
596
|
candidates
|
539
597
|
end
|
@@ -542,7 +600,7 @@ module SmarterCSV
|
|
542
600
|
candidates = Hash.new(0)
|
543
601
|
|
544
602
|
5.times do
|
545
|
-
line = filehandle.
|
603
|
+
line = readline_with_counts(filehandle, options.slice(:row_sep))
|
546
604
|
delimiters.each do |d|
|
547
605
|
candidates[d] += line.scan(d).count
|
548
606
|
end
|
@@ -550,7 +608,7 @@ module SmarterCSV
|
|
550
608
|
break
|
551
609
|
end
|
552
610
|
|
553
|
-
filehandle
|
611
|
+
rewind(filehandle)
|
554
612
|
|
555
613
|
candidates
|
556
614
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: smarter_csv
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.8.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tilo Sloboda
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2023-
|
11
|
+
date: 2023-03-19 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: awesome_print
|