smarter_csv 1.7.4 → 1.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 3c20e6e8f4281f99aa22e67c778e1e950ef7040a48f454c7d9751f2ccf44c093
4
- data.tar.gz: 14ec931e58fce24c675bd76ad39f6046c52a0a4b07cc89814fa55c6a981ebe27
3
+ metadata.gz: a7aa350efc77f90c6986a7573e733b5d9d02930c94465f17d2227b346263a6ce
4
+ data.tar.gz: 42351edf3e618b8c025f266796897aa0c3572d77e42788a05b1ee37ce8bdeed2
5
5
  SHA512:
6
- metadata.gz: cf3d642f523bf49d0867bc1768a6df247f3392390090c2b0fbfda5ac75f5f8f829eaac2ec14105936e4d317c7a1d1b865d74de1e60ce6405c6fd1f868bd703eb
7
- data.tar.gz: 70522e31ca2ced36beef2a38509d1df01bad12af5bb56cb1404cfcbceffdfb4cbbbf8a8a8a535e3ea060d2d4c5b4c1049c8486d163d7de6b466dd83838aa2cf0
6
+ metadata.gz: 8bd9d59d7260a8e90ce472917801b98d088e37de5b1e912914f820f2efbbeb0491f5056d47575debdf1bccb8b9b8670cd089647efa15ec93b02413747dcfe702
7
+ data.tar.gz: 861364c6213af99c11cd3b9a59b2cf46f8c8e850ee2273e4f1b790714c9cd0ca66a734d64233737e086669c2b6aa51415f1343c3d61811547ec3c715d7a1620c
data/CHANGELOG.md CHANGED
@@ -1,7 +1,16 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
- ## 1.7.4 (2022-01-13)
4
+ ## 1.8.1 (2023-03-19)
5
+ * added validation against invalid values for :col_sep, :row_sep, :quote_char (issue #216)
6
+ * deprecating `required_headers` and replace with `required_keys` (issue #140)
7
+ * fixed issue with require statement
8
+
9
+ ## 1.8.0 (2023-03-18)
10
+ * NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
11
+ * ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
12
+
13
+ ## 1.7.4 (2023-01-13)
5
14
  * improved guessing of the column separator, thanks to Alessandro Fazzi
6
15
 
7
16
  ## 1.7.3 (2022-12-05)
data/README.md CHANGED
@@ -55,11 +55,30 @@ The two main choices you have in terms of how to call `SmarterCSV.process` are:
55
55
  * calling `process` with or without a block
56
56
  * passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
57
57
 
58
- Tip: If you are uncertain about what line endings a CSV-file uses, try specifying `:row_sep => :auto` as part of the options.
59
- But this could be slow if we would analyze the whole CSV file first (previous to 1.1.5 the whole file was analyzed).
60
- To speed things up, you can setting the option `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500; nil or 0 will check the whole file).
61
- You can also set the `:row_sep` manually! Checkout Example 5 for unusual `:row_sep` and `:col_sep`.
58
+ By default (since version 1.8.0), detection of the column and row separators is set to automatic `row_sep: :auto`, `col_sep: :auto`. This should make it easier to process any CSV files without having to examine the line endings or column separators.
62
59
 
60
+ You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); nil or 0 will check the whole file).
61
+ You can also set the `:row_sep` manually! Checkout Example 4 for unusual `:row_sep` and `:col_sep`.
62
+
63
+ ### Troubleshooting
64
+
65
+ In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).
66
+
67
+ ```
68
+ $ hexdump -C spec/fixtures/bom_test_feff.csv
69
+ 00000000 fe ff 73 6f 6d 65 5f 69 64 2c 74 79 70 65 2c 66 |..some_id,type,f|
70
+ 00000010 75 7a 7a 62 6f 78 65 73 0d 0a 34 32 37 36 36 38 |uzzboxes..427668|
71
+ 00000020 30 35 2c 7a 69 7a 7a 6c 65 73 2c 31 32 33 34 0d |05,zizzles,1234.|
72
+ 00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
73
+ 00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
74
+ ```
75
+
76
+ ### Examples
77
+
78
+ Here are some examples to demonstrate the versatility of SmarterCSV.
79
+
80
+ By default SmarterCSV determines the `row_sep` and `col_sep` values automatically.
81
+ In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.
63
82
 
64
83
  #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
65
84
  Please note how each hash contains only the keys for columns with non-null values.
@@ -222,10 +241,10 @@ The options and the block are optional.
222
241
  | :skip_lines | nil | how many lines to skip before the first line or header line is processed |
223
242
  | :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
224
243
  ---------------------------------------------------------------------------------------------------------------------------------
225
- | :col_sep | ',' | column separator, can be set to :auto |
244
+ | :col_sep | :auto | column separator (default was ',') |
226
245
  | :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
227
246
  | | | e.g. when :quote_char is not properly escaped |
228
- | :row_sep | $/ ,"\n" | row separator or record separator , defaults to system's $/ , which defaults to "\n" |
247
+ | :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
229
248
  | | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
230
249
  | :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
231
250
  | :quote_char | '"' | quotation character |
@@ -254,7 +273,8 @@ And header and data validations will also be supported in 2.x
254
273
  ---------------------------------------------------------------------------------------------------------------------------------
255
274
  | :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
256
275
  | :silence_missing_key | false | ignore missing keys in `key_mapping` if true |
257
- | :required_headers | nil | An array. Each of the given headers must be present after header manipulation, |
276
+ | :required_keys | nil | An array. Specify the required names AFTER header transformation. |
277
+ | :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead |
258
278
  | | | or an exception is raised No validation if nil is given. |
259
279
  | :remove_unmapped_keys | false | when using :key_mapping option, should non-mapped keys / columns be removed? |
260
280
  | :downcase_header | true | downcase all column headers |
@@ -27,7 +27,6 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
27
27
  long col_sep_len = RSTRING_LEN(col_sep);
28
28
 
29
29
  char *quoteP = RSTRING_PTR(quote_char);
30
- long quote_len = RSTRING_LEN(quote_char);
31
30
  long quote_count = 0;
32
31
 
33
32
  bool col_sep_found = true;
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterCSV
4
- VERSION = "1.7.4"
4
+ VERSION = "1.8.1"
5
5
  end
data/lib/smarter_csv.rb CHANGED
@@ -4,23 +4,24 @@ require_relative "extensions/hash"
4
4
  require_relative "smarter_csv/version"
5
5
 
6
6
  require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
7
- # require 'smarter_csv.bundle' unless ENV['CI'] # does not compile/link in CI?
7
+ # require 'smarter_csv.bundle' unless ENV['CI'] # local testing
8
8
 
9
9
  module SmarterCSV
10
10
  class SmarterCSVException < StandardError; end
11
11
  class HeaderSizeMismatch < SmarterCSVException; end
12
12
  class IncorrectOption < SmarterCSVException; end
13
+ class ValidationError < SmarterCSVException; end
13
14
  class DuplicateHeaders < SmarterCSVException; end
14
15
  class MissingHeaders < SmarterCSVException; end
15
16
  class NoColSepDetected < SmarterCSVException; end
16
- class KeyMappingError < SmarterCSVException; end
17
- class MalformedCSVError < SmarterCSVException; end
17
+ class KeyMappingError < SmarterCSVException; end # CURRENTLY UNUSED -> version 1.9.0
18
18
 
19
19
  # first parameter: filename or input object which responds to readline method
20
20
  def SmarterCSV.process(input, options = {}, &block)
21
21
  options = default_options.merge(options)
22
22
  options[:invalid_byte_sequence] = '' if options[:invalid_byte_sequence].nil?
23
23
  puts "SmarterCSV OPTIONS: #{options.inspect}" if options[:verbose]
24
+ validate_options!(options)
24
25
 
25
26
  headerA = []
26
27
  result = []
@@ -39,11 +40,7 @@ module SmarterCSV
39
40
  puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
40
41
  end
41
42
 
42
- if options[:skip_lines].to_i > 0
43
- options[:skip_lines].to_i.times do
44
- readline_with_counts(fh, options)
45
- end
46
- end
43
+ skip_lines(fh, options)
47
44
 
48
45
  headerA, header_size = process_headers(fh, options)
49
46
 
@@ -207,7 +204,7 @@ module SmarterCSV
207
204
  acceleration: true,
208
205
  auto_row_sep_chars: 500,
209
206
  chunk_size: nil,
210
- col_sep: ',',
207
+ col_sep: :auto, # was: ',',
211
208
  comment_regexp: nil, # was: /\A#/,
212
209
  convert_values_to_numeric: true,
213
210
  downcase_header: true,
@@ -218,7 +215,7 @@ module SmarterCSV
218
215
  headers_in_file: true,
219
216
  invalid_byte_sequence: '',
220
217
  keep_original_headers: false,
221
- key_mapping_hash: nil,
218
+ key_mapping: nil,
222
219
  quote_char: '"',
223
220
  remove_empty_hashes: true,
224
221
  remove_empty_values: true,
@@ -226,7 +223,8 @@ module SmarterCSV
226
223
  remove_values_matching: nil,
227
224
  remove_zero_values: false,
228
225
  required_headers: nil,
229
- row_sep: $/,
226
+ required_keys: nil,
227
+ row_sep: :auto, # was: $/,
230
228
  silence_missing_keys: false,
231
229
  skip_lines: nil,
232
230
  strings_as_keys: false,
@@ -243,9 +241,24 @@ module SmarterCSV
243
241
  line = filehandle.readline(options[:row_sep])
244
242
  @file_line_count += 1
245
243
  @csv_line_count += 1
244
+ line = remove_bom(line) if @csv_line_count == 1
246
245
  line
247
246
  end
248
247
 
248
+ def skip_lines(filehandle, options)
249
+ return unless options[:skip_lines].to_i > 0
250
+
251
+ options[:skip_lines].to_i.times do
252
+ readline_with_counts(filehandle, options)
253
+ end
254
+ end
255
+
256
+ def rewind(filehandle)
257
+ @file_line_count = 0
258
+ @csv_line_count = 0
259
+ filehandle.rewind
260
+ end
261
+
249
262
  ###
250
263
  ### Thin wrapper around C-extension
251
264
  ###
@@ -378,6 +391,8 @@ module SmarterCSV
378
391
  # Otherwise guesses column separator from contents.
379
392
  # Raises exception if none is found.
380
393
  def guess_column_separator(filehandle, options)
394
+ skip_lines(filehandle, options)
395
+
381
396
  possible_delimiters = [',', "\t", ';', ':', '|']
382
397
 
383
398
  candidates = if options.fetch(:headers_in_file)
@@ -417,7 +432,7 @@ module SmarterCSV
417
432
  lines += 1
418
433
  break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
419
434
  end
420
- filehandle.rewind
435
+ rewind(filehandle)
421
436
 
422
437
  counts["\r"] += 1 if last_char == "\r"
423
438
  # find the most frequent key/value pair:
@@ -497,12 +512,21 @@ module SmarterCSV
497
512
  raise SmarterCSV::DuplicateHeaders, "ERROR: duplicate headers: #{duplicate_headers.join(',')}"
498
513
  end
499
514
 
500
- if options[:required_headers] && options[:required_headers].is_a?(Array)
501
- missing_headers = []
502
- options[:required_headers].each do |k|
503
- missing_headers << k unless headerA.include?(k)
515
+ # deprecate required_headers
516
+ if !options[:required_headers].nil?
517
+ puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required headers'"
518
+ if options[:required_keys].nil?
519
+ options[:required_keys] = options[:required_headers]
520
+ options[:required_headers] = nil
521
+ end
522
+ end
523
+
524
+ if options[:required_keys] && options[:required_keys].is_a?(Array)
525
+ missing_keys = []
526
+ options[:required_keys].each do |k|
527
+ missing_keys << k unless headerA.include?(k)
504
528
  end
505
- raise SmarterCSV::MissingHeaders, "ERROR: missing headers: #{missing_headers.join(',')}" unless missing_headers.empty?
529
+ raise SmarterCSV::MissingHeaders, "ERROR: missing attributes: #{missing_keys.join(',')}" unless missing_keys.empty?
506
530
  end
507
531
 
508
532
  @headers = headerA
@@ -525,15 +549,49 @@ module SmarterCSV
525
549
 
526
550
  private
527
551
 
552
+ UTF_32_BOM = %w[0 0 fe ff].freeze
553
+ UTF_32LE_BOM = %w[ff fe 0 0].freeze
554
+ UTF_8_BOM = %w[ef bb bf].freeze
555
+ UTF_16_BOM = %w[fe ff].freeze
556
+ UTF_16LE_BOM = %w[ff fe].freeze
557
+
558
+ def remove_bom(str)
559
+ str_as_hex = str.bytes.map{|x| x.to_s(16)}
560
+ # if string does not start with one of the bytes, there is no BOM
561
+ return str unless %w[ef fe ff 0].include?(str_as_hex[0])
562
+
563
+ return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
564
+ return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM
565
+ return str.byteslice(2..-1) if [UTF_16_BOM, UTF_16LE_BOM].include?(str_as_hex[0..1])
566
+
567
+ puts "SmarterCSV found unhandled BOM! #{str.chars[0..7].inspect}"
568
+ str
569
+ end
570
+
571
+ def validate_options!(options)
572
+ keys = options.keys
573
+ errors = []
574
+ errors << "invalid row_sep" if keys.include?(:row_sep) && !option_valid?(options[:row_sep])
575
+ errors << "invalid col_sep" if keys.include?(:col_sep) && !option_valid?(options[:col_sep])
576
+ errors << "invalid quote_char" if keys.include?(:quote_char) && !option_valid?(options[:quote_char])
577
+ raise SmarterCSV::ValidationError, errors.inspect if errors.any?
578
+ end
579
+
580
+ def option_valid?(str)
581
+ return true if str.is_a?(Symbol) && str == :auto
582
+ return true if str.is_a?(String) && !str.empty?
583
+ false
584
+ end
585
+
528
586
  def candidated_column_separators_from_headers(filehandle, options, delimiters)
529
587
  candidates = Hash.new(0)
530
- line = filehandle.readline(options[:row_sep])
588
+ line = readline_with_counts(filehandle, options.slice(:row_sep))
531
589
 
532
590
  delimiters.each do |d|
533
591
  candidates[d] += line.scan(d).count
534
592
  end
535
593
 
536
- filehandle.rewind
594
+ rewind(filehandle)
537
595
 
538
596
  candidates
539
597
  end
@@ -542,7 +600,7 @@ module SmarterCSV
542
600
  candidates = Hash.new(0)
543
601
 
544
602
  5.times do
545
- line = filehandle.readline(options[:row_sep])
603
+ line = readline_with_counts(filehandle, options.slice(:row_sep))
546
604
  delimiters.each do |d|
547
605
  candidates[d] += line.scan(d).count
548
606
  end
@@ -550,7 +608,7 @@ module SmarterCSV
550
608
  break
551
609
  end
552
610
 
553
- filehandle.rewind
611
+ rewind(filehandle)
554
612
 
555
613
  candidates
556
614
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.7.4
4
+ version: 1.8.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2023-01-14 00:00:00.000000000 Z
11
+ date: 2023-03-19 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: awesome_print