smarter_csv 1.8.0 → 1.8.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +14 -1
- data/README.md +11 -1
- data/TO_DO_v2.md +14 -0
- data/ext/smarter_csv/smarter_csv.c +0 -1
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +54 -45
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 4a6ec6f3a579d9c1e6bfc2c3c9006f64d8c7b705eeca6ec048ea56c688f8ea1c
|
4
|
+
data.tar.gz: ba9a4a289adcc2fc398ae608f9570c28baac57b877852f3ea37c78fa57f2d7e3
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: d8f516501a5539e30789e2d18c4d051f50372786d8df1272192c2bc7997470cf5d5e1ae94b776d0d580cb62ed8ffb0f6591ccc2be5d60eae6e421f22f0c92f94
|
7
|
+
data.tar.gz: 2993c59278adb531cf2299c0aa3869c637868f2f4e6421f845e4ade35011f94ee6e226f43e07b206c67c354c2eb38c0a41981ffbdbff00950690b94b06b2aacd
|
data/CHANGELOG.md
CHANGED
@@ -1,8 +1,21 @@
|
|
1
1
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
3
3
|
|
4
|
-
## 1.8.
|
4
|
+
## 1.8.2 (2023-03-21)
|
5
|
+
* bugfix: do not raise `NoColSepDetected` for CSV files with only one column in most cases (issue #222)
|
6
|
+
If the first lines contain non-ASCII characters, and no col_sep is detected, it will still raise `NoColSepDetected`
|
7
|
+
|
8
|
+
## 1.8.1 (2023-03-19)
|
9
|
+
* added validation against invalid values for :col_sep, :row_sep, :quote_char (issue #216)
|
10
|
+
* deprecating `required_headers` and replace with `required_keys` (issue #140)
|
11
|
+
* fixed issue with require statement
|
12
|
+
|
13
|
+
## 1.8.0 (2023-03-18) BREAKING
|
5
14
|
* NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
|
15
|
+
|
16
|
+
MAKE SURE to rescue `NoColSepDetected` if your CSV files can have unexpected formats,
|
17
|
+
e.g. from users uploading them to a service, and handle those cases.
|
18
|
+
|
6
19
|
* ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
|
7
20
|
|
8
21
|
## 1.7.4 (2023-01-13)
|
data/README.md
CHANGED
@@ -73,6 +73,15 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
|
|
73
73
|
00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
|
74
74
|
```
|
75
75
|
|
76
|
+
### Examples
|
77
|
+
|
78
|
+
Here are some examples to demonstrate the versatility of SmarterCSV.
|
79
|
+
|
80
|
+
**It is generally recommended to rescue `SmarterCSVException` or it's sub-classes.**
|
81
|
+
|
82
|
+
By default SmarterCSV determines the `row_sep` and `col_sep` values automatically. In cases where the automatic detection fails, an exception will be raised, e.g. `NoColSepDetected`. Rescuing from these exceptions will make sure that you don't miss processing CSV files, in case users upload CSV files with unexpected formats.
|
83
|
+
|
84
|
+
In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.
|
76
85
|
|
77
86
|
#### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
|
78
87
|
Please note how each hash contains only the keys for columns with non-null values.
|
@@ -267,7 +276,8 @@ And header and data validations will also be supported in 2.x
|
|
267
276
|
---------------------------------------------------------------------------------------------------------------------------------
|
268
277
|
| :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
|
269
278
|
| :silence_missing_key | false | ignore missing keys in `key_mapping` if true |
|
270
|
-
| :
|
279
|
+
| :required_keys | nil | An array. Specify the required names AFTER header transformation. |
|
280
|
+
| :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead |
|
271
281
|
| | | or an exception is raised No validation if nil is given. |
|
272
282
|
| :remove_unmapped_keys | false | when using :key_mapping option, should non-mapped keys / columns be removed? |
|
273
283
|
| :downcase_header | true | downcase all column headers |
|
data/TO_DO_v2.md
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
# SmarterCSV v2.0 TO DO List
|
2
|
+
|
3
|
+
* add enumerable to speed up parallel processing [issue #66](https://github.com/tilo/smarter_csv/issues/66), [issue #32](https://github.com/tilo/smarter_csv/issues/32)
|
4
|
+
* use Procs for validations and transformatoins [issue #118](https://github.com/tilo/smarter_csv/issues/118)
|
5
|
+
* make @errors and @warnings work [issue #118](https://github.com/tilo/smarter_csv/issues/118)
|
6
|
+
* skip file opening, allow reading from CSV string, e.g. reading from S3 file [issue #120](https://github.com/tilo/smarter_csv/issues/120).
|
7
|
+
Or stream large file from S3 (linked in the issue)
|
8
|
+
* Collect all Errors, before surfacing them. Avoid throwing an exception on the first error [issue #133](https://github.com/tilo/smarter_csv/issues/133)
|
9
|
+
* Don't call rewind on filehandle
|
10
|
+
* [2.0 BUG] :convert_values_to_numeric_unless_leading_zeros drops leading zeros [issue #151](https://github.com/tilo/smarter_csv/issues/151)
|
11
|
+
* [2.0 BUG] convert_to_float saves Proc as @@convert_to_integer [issue #157](https://github.com/tilo/smarter_csv/issues/157)
|
12
|
+
* Provide an example for custom Procs for hash_transformations in the docs [issue #174](https://github.com/tilo/smarter_csv/issues/174)
|
13
|
+
* Replace remove_empty_values: false [issue #213](https://github.com/tilo/smarter_csv/issues/213)
|
14
|
+
|
@@ -27,7 +27,6 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
|
|
27
27
|
long col_sep_len = RSTRING_LEN(col_sep);
|
28
28
|
|
29
29
|
char *quoteP = RSTRING_PTR(quote_char);
|
30
|
-
long quote_len = RSTRING_LEN(quote_char);
|
31
30
|
long quote_count = 0;
|
32
31
|
|
33
32
|
bool col_sep_found = true;
|
data/lib/smarter_csv/version.rb
CHANGED
data/lib/smarter_csv.rb
CHANGED
@@ -3,24 +3,25 @@
|
|
3
3
|
require_relative "extensions/hash"
|
4
4
|
require_relative "smarter_csv/version"
|
5
5
|
|
6
|
-
|
7
|
-
require 'smarter_csv.bundle' unless ENV['CI'] #
|
6
|
+
require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
|
7
|
+
# require 'smarter_csv.bundle' unless ENV['CI'] # local testing
|
8
8
|
|
9
9
|
module SmarterCSV
|
10
10
|
class SmarterCSVException < StandardError; end
|
11
11
|
class HeaderSizeMismatch < SmarterCSVException; end
|
12
12
|
class IncorrectOption < SmarterCSVException; end
|
13
|
+
class ValidationError < SmarterCSVException; end
|
13
14
|
class DuplicateHeaders < SmarterCSVException; end
|
14
15
|
class MissingHeaders < SmarterCSVException; end
|
15
16
|
class NoColSepDetected < SmarterCSVException; end
|
16
|
-
class KeyMappingError < SmarterCSVException; end
|
17
|
-
class MalformedCSVError < SmarterCSVException; end
|
17
|
+
class KeyMappingError < SmarterCSVException; end # CURRENTLY UNUSED -> version 1.9.0
|
18
18
|
|
19
19
|
# first parameter: filename or input object which responds to readline method
|
20
20
|
def SmarterCSV.process(input, options = {}, &block)
|
21
21
|
options = default_options.merge(options)
|
22
22
|
options[:invalid_byte_sequence] = '' if options[:invalid_byte_sequence].nil?
|
23
23
|
puts "SmarterCSV OPTIONS: #{options.inspect}" if options[:verbose]
|
24
|
+
validate_options!(options)
|
24
25
|
|
25
26
|
headerA = []
|
26
27
|
result = []
|
@@ -214,7 +215,7 @@ module SmarterCSV
|
|
214
215
|
headers_in_file: true,
|
215
216
|
invalid_byte_sequence: '',
|
216
217
|
keep_original_headers: false,
|
217
|
-
|
218
|
+
key_mapping: nil,
|
218
219
|
quote_char: '"',
|
219
220
|
remove_empty_hashes: true,
|
220
221
|
remove_empty_values: true,
|
@@ -222,6 +223,7 @@ module SmarterCSV
|
|
222
223
|
remove_values_matching: nil,
|
223
224
|
remove_zero_values: false,
|
224
225
|
required_headers: nil,
|
226
|
+
required_keys: nil,
|
225
227
|
row_sep: :auto, # was: $/,
|
226
228
|
silence_missing_keys: false,
|
227
229
|
skip_lines: nil,
|
@@ -391,15 +393,28 @@ module SmarterCSV
|
|
391
393
|
def guess_column_separator(filehandle, options)
|
392
394
|
skip_lines(filehandle, options)
|
393
395
|
|
394
|
-
|
396
|
+
delimiters = [',', "\t", ';', ':', '|']
|
395
397
|
|
396
|
-
|
397
|
-
|
398
|
-
|
399
|
-
|
400
|
-
|
398
|
+
line = nil
|
399
|
+
has_header = options[:headers_in_file]
|
400
|
+
candidates = Hash.new(0)
|
401
|
+
count = has_header ? 1 : 5
|
402
|
+
count.times do
|
403
|
+
line = readline_with_counts(filehandle, options)
|
404
|
+
delimiters.each do |d|
|
405
|
+
candidates[d] += line.scan(d).count
|
406
|
+
end
|
407
|
+
rescue EOFError # short files
|
408
|
+
break
|
409
|
+
end
|
410
|
+
rewind(filehandle)
|
401
411
|
|
402
|
-
|
412
|
+
if candidates.values.max == 0
|
413
|
+
# if the header only contains
|
414
|
+
return ',' if line =~ /^\w+$/
|
415
|
+
|
416
|
+
raise SmarterCSV::NoColSepDetected
|
417
|
+
end
|
403
418
|
|
404
419
|
candidates.key(candidates.values.max)
|
405
420
|
end
|
@@ -486,13 +501,13 @@ module SmarterCSV
|
|
486
501
|
|
487
502
|
unless options[:user_provided_headers] # wouldn't make sense to re-map user provided headers
|
488
503
|
key_mappingH = options[:key_mapping]
|
504
|
+
|
489
505
|
# do some key mapping on the keys in the file header
|
490
506
|
# if you want to completely delete a key, then map it to nil or to ''
|
491
507
|
if !key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
|
492
508
|
unless options[:silence_missing_keys]
|
493
509
|
# if silence_missing_keys are not set, raise error if missing header
|
494
510
|
missing_keys = key_mappingH.keys - headerA
|
495
|
-
|
496
511
|
puts "WARNING: missing header(s): #{missing_keys.join(",")}" unless missing_keys.empty?
|
497
512
|
end
|
498
513
|
|
@@ -510,12 +525,21 @@ module SmarterCSV
|
|
510
525
|
raise SmarterCSV::DuplicateHeaders, "ERROR: duplicate headers: #{duplicate_headers.join(',')}"
|
511
526
|
end
|
512
527
|
|
513
|
-
|
514
|
-
|
515
|
-
|
516
|
-
|
528
|
+
# deprecate required_headers
|
529
|
+
if !options[:required_headers].nil?
|
530
|
+
puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required headers'"
|
531
|
+
if options[:required_keys].nil?
|
532
|
+
options[:required_keys] = options[:required_headers]
|
533
|
+
options[:required_headers] = nil
|
517
534
|
end
|
518
|
-
|
535
|
+
end
|
536
|
+
|
537
|
+
if options[:required_keys] && options[:required_keys].is_a?(Array)
|
538
|
+
missing_keys = []
|
539
|
+
options[:required_keys].each do |k|
|
540
|
+
missing_keys << k unless headerA.include?(k)
|
541
|
+
end
|
542
|
+
raise SmarterCSV::MissingHeaders, "ERROR: missing attributes: #{missing_keys.join(',')}" unless missing_keys.empty?
|
519
543
|
end
|
520
544
|
|
521
545
|
@headers = headerA
|
@@ -546,7 +570,7 @@ module SmarterCSV
|
|
546
570
|
|
547
571
|
def remove_bom(str)
|
548
572
|
str_as_hex = str.bytes.map{|x| x.to_s(16)}
|
549
|
-
# if string does not start with one of the bytes
|
573
|
+
# if string does not start with one of the bytes, there is no BOM
|
550
574
|
return str unless %w[ef fe ff 0].include?(str_as_hex[0])
|
551
575
|
|
552
576
|
return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
|
@@ -557,34 +581,19 @@ module SmarterCSV
|
|
557
581
|
str
|
558
582
|
end
|
559
583
|
|
560
|
-
def
|
561
|
-
|
562
|
-
|
563
|
-
|
564
|
-
|
565
|
-
|
566
|
-
|
567
|
-
|
568
|
-
rewind(filehandle)
|
569
|
-
|
570
|
-
candidates
|
584
|
+
def validate_options!(options)
|
585
|
+
keys = options.keys
|
586
|
+
errors = []
|
587
|
+
errors << "invalid row_sep" if keys.include?(:row_sep) && !option_valid?(options[:row_sep])
|
588
|
+
errors << "invalid col_sep" if keys.include?(:col_sep) && !option_valid?(options[:col_sep])
|
589
|
+
errors << "invalid quote_char" if keys.include?(:quote_char) && !option_valid?(options[:quote_char])
|
590
|
+
raise SmarterCSV::ValidationError, errors.inspect if errors.any?
|
571
591
|
end
|
572
592
|
|
573
|
-
def
|
574
|
-
|
575
|
-
|
576
|
-
|
577
|
-
line = readline_with_counts(filehandle, options.slice(:row_sep))
|
578
|
-
delimiters.each do |d|
|
579
|
-
candidates[d] += line.scan(d).count
|
580
|
-
end
|
581
|
-
rescue EOFError # short files
|
582
|
-
break
|
583
|
-
end
|
584
|
-
|
585
|
-
rewind(filehandle)
|
586
|
-
|
587
|
-
candidates
|
593
|
+
def option_valid?(str)
|
594
|
+
return true if str.is_a?(Symbol) && str == :auto
|
595
|
+
return true if str.is_a?(String) && !str.empty?
|
596
|
+
false
|
588
597
|
end
|
589
598
|
end
|
590
599
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: smarter_csv
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.8.
|
4
|
+
version: 1.8.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tilo Sloboda
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2023-03-
|
11
|
+
date: 2023-03-22 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: awesome_print
|
@@ -112,6 +112,7 @@ files:
|
|
112
112
|
- LICENSE.txt
|
113
113
|
- README.md
|
114
114
|
- Rakefile
|
115
|
+
- TO_DO_v2.md
|
115
116
|
- ext/smarter_csv/extconf.rb
|
116
117
|
- ext/smarter_csv/smarter_csv.c
|
117
118
|
- lib/extensions/hash.rb
|