smarter_csv 1.7.4 → 1.8.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 3c20e6e8f4281f99aa22e67c778e1e950ef7040a48f454c7d9751f2ccf44c093
4
- data.tar.gz: 14ec931e58fce24c675bd76ad39f6046c52a0a4b07cc89814fa55c6a981ebe27
3
+ metadata.gz: a7aa350efc77f90c6986a7573e733b5d9d02930c94465f17d2227b346263a6ce
4
+ data.tar.gz: 42351edf3e618b8c025f266796897aa0c3572d77e42788a05b1ee37ce8bdeed2
5
5
  SHA512:
6
- metadata.gz: cf3d642f523bf49d0867bc1768a6df247f3392390090c2b0fbfda5ac75f5f8f829eaac2ec14105936e4d317c7a1d1b865d74de1e60ce6405c6fd1f868bd703eb
7
- data.tar.gz: 70522e31ca2ced36beef2a38509d1df01bad12af5bb56cb1404cfcbceffdfb4cbbbf8a8a8a535e3ea060d2d4c5b4c1049c8486d163d7de6b466dd83838aa2cf0
6
+ metadata.gz: 8bd9d59d7260a8e90ce472917801b98d088e37de5b1e912914f820f2efbbeb0491f5056d47575debdf1bccb8b9b8670cd089647efa15ec93b02413747dcfe702
7
+ data.tar.gz: 861364c6213af99c11cd3b9a59b2cf46f8c8e850ee2273e4f1b790714c9cd0ca66a734d64233737e086669c2b6aa51415f1343c3d61811547ec3c715d7a1620c
data/CHANGELOG.md CHANGED
@@ -1,7 +1,16 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
- ## 1.7.4 (2022-01-13)
4
+ ## 1.8.1 (2023-03-19)
5
+ * added validation against invalid values for :col_sep, :row_sep, :quote_char (issue #216)
6
+ * deprecating `required_headers` and replace with `required_keys` (issue #140)
7
+ * fixed issue with require statement
8
+
9
+ ## 1.8.0 (2023-03-18)
10
+ * NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
11
+ * ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
12
+
13
+ ## 1.7.4 (2023-01-13)
5
14
  * improved guessing of the column separator, thanks to Alessandro Fazzi
6
15
 
7
16
  ## 1.7.3 (2022-12-05)
data/README.md CHANGED
@@ -55,11 +55,30 @@ The two main choices you have in terms of how to call `SmarterCSV.process` are:
55
55
  * calling `process` with or without a block
56
56
  * passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
57
57
 
58
- Tip: If you are uncertain about what line endings a CSV-file uses, try specifying `:row_sep => :auto` as part of the options.
59
- But this could be slow if we would analyze the whole CSV file first (previous to 1.1.5 the whole file was analyzed).
60
- To speed things up, you can setting the option `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500; nil or 0 will check the whole file).
61
- You can also set the `:row_sep` manually! Checkout Example 5 for unusual `:row_sep` and `:col_sep`.
58
+ By default (since version 1.8.0), detection of the column and row separators is set to automatic `row_sep: :auto`, `col_sep: :auto`. This should make it easier to process any CSV files without having to examine the line endings or column separators.
62
59
 
60
+ You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); nil or 0 will check the whole file).
61
+ You can also set the `:row_sep` manually! Checkout Example 4 for unusual `:row_sep` and `:col_sep`.
62
+
63
+ ### Troubleshooting
64
+
65
+ In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).
66
+
67
+ ```
68
+ $ hexdump -C spec/fixtures/bom_test_feff.csv
69
+ 00000000 fe ff 73 6f 6d 65 5f 69 64 2c 74 79 70 65 2c 66 |..some_id,type,f|
70
+ 00000010 75 7a 7a 62 6f 78 65 73 0d 0a 34 32 37 36 36 38 |uzzboxes..427668|
71
+ 00000020 30 35 2c 7a 69 7a 7a 6c 65 73 2c 31 32 33 34 0d |05,zizzles,1234.|
72
+ 00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
73
+ 00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
74
+ ```
75
+
76
+ ### Examples
77
+
78
+ Here are some examples to demonstrate the versatility of SmarterCSV.
79
+
80
+ By default SmarterCSV determines the `row_sep` and `col_sep` values automatically.
81
+ In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.
63
82
 
64
83
  #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
65
84
  Please note how each hash contains only the keys for columns with non-null values.
@@ -222,10 +241,10 @@ The options and the block are optional.
222
241
  | :skip_lines | nil | how many lines to skip before the first line or header line is processed |
223
242
  | :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
224
243
  ---------------------------------------------------------------------------------------------------------------------------------
225
- | :col_sep | ',' | column separator, can be set to :auto |
244
+ | :col_sep | :auto | column separator (default was ',') |
226
245
  | :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
227
246
  | | | e.g. when :quote_char is not properly escaped |
228
- | :row_sep | $/ ,"\n" | row separator or record separator , defaults to system's $/ , which defaults to "\n" |
247
+ | :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
229
248
  | | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
230
249
  | :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
231
250
  | :quote_char | '"' | quotation character |
@@ -254,7 +273,8 @@ And header and data validations will also be supported in 2.x
254
273
  ---------------------------------------------------------------------------------------------------------------------------------
255
274
  | :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
256
275
  | :silence_missing_key | false | ignore missing keys in `key_mapping` if true |
257
- | :required_headers | nil | An array. Each of the given headers must be present after header manipulation, |
276
+ | :required_keys | nil | An array. Specify the required names AFTER header transformation. |
277
+ | :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead |
258
278
  | | | or an exception is raised No validation if nil is given. |
259
279
  | :remove_unmapped_keys | false | when using :key_mapping option, should non-mapped keys / columns be removed? |
260
280
  | :downcase_header | true | downcase all column headers |
@@ -27,7 +27,6 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
27
27
  long col_sep_len = RSTRING_LEN(col_sep);
28
28
 
29
29
  char *quoteP = RSTRING_PTR(quote_char);
30
- long quote_len = RSTRING_LEN(quote_char);
31
30
  long quote_count = 0;
32
31
 
33
32
  bool col_sep_found = true;
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterCSV
4
- VERSION = "1.7.4"
4
+ VERSION = "1.8.1"
5
5
  end
data/lib/smarter_csv.rb CHANGED
@@ -4,23 +4,24 @@ require_relative "extensions/hash"
4
4
  require_relative "smarter_csv/version"
5
5
 
6
6
  require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
7
- # require 'smarter_csv.bundle' unless ENV['CI'] # does not compile/link in CI?
7
+ # require 'smarter_csv.bundle' unless ENV['CI'] # local testing
8
8
 
9
9
  module SmarterCSV
10
10
  class SmarterCSVException < StandardError; end
11
11
  class HeaderSizeMismatch < SmarterCSVException; end
12
12
  class IncorrectOption < SmarterCSVException; end
13
+ class ValidationError < SmarterCSVException; end
13
14
  class DuplicateHeaders < SmarterCSVException; end
14
15
  class MissingHeaders < SmarterCSVException; end
15
16
  class NoColSepDetected < SmarterCSVException; end
16
- class KeyMappingError < SmarterCSVException; end
17
- class MalformedCSVError < SmarterCSVException; end
17
+ class KeyMappingError < SmarterCSVException; end # CURRENTLY UNUSED -> version 1.9.0
18
18
 
19
19
  # first parameter: filename or input object which responds to readline method
20
20
  def SmarterCSV.process(input, options = {}, &block)
21
21
  options = default_options.merge(options)
22
22
  options[:invalid_byte_sequence] = '' if options[:invalid_byte_sequence].nil?
23
23
  puts "SmarterCSV OPTIONS: #{options.inspect}" if options[:verbose]
24
+ validate_options!(options)
24
25
 
25
26
  headerA = []
26
27
  result = []
@@ -39,11 +40,7 @@ module SmarterCSV
39
40
  puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
40
41
  end
41
42
 
42
- if options[:skip_lines].to_i > 0
43
- options[:skip_lines].to_i.times do
44
- readline_with_counts(fh, options)
45
- end
46
- end
43
+ skip_lines(fh, options)
47
44
 
48
45
  headerA, header_size = process_headers(fh, options)
49
46
 
@@ -207,7 +204,7 @@ module SmarterCSV
207
204
  acceleration: true,
208
205
  auto_row_sep_chars: 500,
209
206
  chunk_size: nil,
210
- col_sep: ',',
207
+ col_sep: :auto, # was: ',',
211
208
  comment_regexp: nil, # was: /\A#/,
212
209
  convert_values_to_numeric: true,
213
210
  downcase_header: true,
@@ -218,7 +215,7 @@ module SmarterCSV
218
215
  headers_in_file: true,
219
216
  invalid_byte_sequence: '',
220
217
  keep_original_headers: false,
221
- key_mapping_hash: nil,
218
+ key_mapping: nil,
222
219
  quote_char: '"',
223
220
  remove_empty_hashes: true,
224
221
  remove_empty_values: true,
@@ -226,7 +223,8 @@ module SmarterCSV
226
223
  remove_values_matching: nil,
227
224
  remove_zero_values: false,
228
225
  required_headers: nil,
229
- row_sep: $/,
226
+ required_keys: nil,
227
+ row_sep: :auto, # was: $/,
230
228
  silence_missing_keys: false,
231
229
  skip_lines: nil,
232
230
  strings_as_keys: false,
@@ -243,9 +241,24 @@ module SmarterCSV
243
241
  line = filehandle.readline(options[:row_sep])
244
242
  @file_line_count += 1
245
243
  @csv_line_count += 1
244
+ line = remove_bom(line) if @csv_line_count == 1
246
245
  line
247
246
  end
248
247
 
248
+ def skip_lines(filehandle, options)
249
+ return unless options[:skip_lines].to_i > 0
250
+
251
+ options[:skip_lines].to_i.times do
252
+ readline_with_counts(filehandle, options)
253
+ end
254
+ end
255
+
256
+ def rewind(filehandle)
257
+ @file_line_count = 0
258
+ @csv_line_count = 0
259
+ filehandle.rewind
260
+ end
261
+
249
262
  ###
250
263
  ### Thin wrapper around C-extension
251
264
  ###
@@ -378,6 +391,8 @@ module SmarterCSV
378
391
  # Otherwise guesses column separator from contents.
379
392
  # Raises exception if none is found.
380
393
  def guess_column_separator(filehandle, options)
394
+ skip_lines(filehandle, options)
395
+
381
396
  possible_delimiters = [',', "\t", ';', ':', '|']
382
397
 
383
398
  candidates = if options.fetch(:headers_in_file)
@@ -417,7 +432,7 @@ module SmarterCSV
417
432
  lines += 1
418
433
  break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
419
434
  end
420
- filehandle.rewind
435
+ rewind(filehandle)
421
436
 
422
437
  counts["\r"] += 1 if last_char == "\r"
423
438
  # find the most frequent key/value pair:
@@ -497,12 +512,21 @@ module SmarterCSV
497
512
  raise SmarterCSV::DuplicateHeaders, "ERROR: duplicate headers: #{duplicate_headers.join(',')}"
498
513
  end
499
514
 
500
- if options[:required_headers] && options[:required_headers].is_a?(Array)
501
- missing_headers = []
502
- options[:required_headers].each do |k|
503
- missing_headers << k unless headerA.include?(k)
515
+ # deprecate required_headers
516
+ if !options[:required_headers].nil?
517
+ puts "DEPRECATION WARNING: please use 'required_keys' instead of 'required headers'"
518
+ if options[:required_keys].nil?
519
+ options[:required_keys] = options[:required_headers]
520
+ options[:required_headers] = nil
521
+ end
522
+ end
523
+
524
+ if options[:required_keys] && options[:required_keys].is_a?(Array)
525
+ missing_keys = []
526
+ options[:required_keys].each do |k|
527
+ missing_keys << k unless headerA.include?(k)
504
528
  end
505
- raise SmarterCSV::MissingHeaders, "ERROR: missing headers: #{missing_headers.join(',')}" unless missing_headers.empty?
529
+ raise SmarterCSV::MissingHeaders, "ERROR: missing attributes: #{missing_keys.join(',')}" unless missing_keys.empty?
506
530
  end
507
531
 
508
532
  @headers = headerA
@@ -525,15 +549,49 @@ module SmarterCSV
525
549
 
526
550
  private
527
551
 
552
+ UTF_32_BOM = %w[0 0 fe ff].freeze
553
+ UTF_32LE_BOM = %w[ff fe 0 0].freeze
554
+ UTF_8_BOM = %w[ef bb bf].freeze
555
+ UTF_16_BOM = %w[fe ff].freeze
556
+ UTF_16LE_BOM = %w[ff fe].freeze
557
+
558
+ def remove_bom(str)
559
+ str_as_hex = str.bytes.map{|x| x.to_s(16)}
560
+ # if string does not start with one of the bytes, there is no BOM
561
+ return str unless %w[ef fe ff 0].include?(str_as_hex[0])
562
+
563
+ return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
564
+ return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM
565
+ return str.byteslice(2..-1) if [UTF_16_BOM, UTF_16LE_BOM].include?(str_as_hex[0..1])
566
+
567
+ puts "SmarterCSV found unhandled BOM! #{str.chars[0..7].inspect}"
568
+ str
569
+ end
570
+
571
+ def validate_options!(options)
572
+ keys = options.keys
573
+ errors = []
574
+ errors << "invalid row_sep" if keys.include?(:row_sep) && !option_valid?(options[:row_sep])
575
+ errors << "invalid col_sep" if keys.include?(:col_sep) && !option_valid?(options[:col_sep])
576
+ errors << "invalid quote_char" if keys.include?(:quote_char) && !option_valid?(options[:quote_char])
577
+ raise SmarterCSV::ValidationError, errors.inspect if errors.any?
578
+ end
579
+
580
+ def option_valid?(str)
581
+ return true if str.is_a?(Symbol) && str == :auto
582
+ return true if str.is_a?(String) && !str.empty?
583
+ false
584
+ end
585
+
528
586
  def candidated_column_separators_from_headers(filehandle, options, delimiters)
529
587
  candidates = Hash.new(0)
530
- line = filehandle.readline(options[:row_sep])
588
+ line = readline_with_counts(filehandle, options.slice(:row_sep))
531
589
 
532
590
  delimiters.each do |d|
533
591
  candidates[d] += line.scan(d).count
534
592
  end
535
593
 
536
- filehandle.rewind
594
+ rewind(filehandle)
537
595
 
538
596
  candidates
539
597
  end
@@ -542,7 +600,7 @@ module SmarterCSV
542
600
  candidates = Hash.new(0)
543
601
 
544
602
  5.times do
545
- line = filehandle.readline(options[:row_sep])
603
+ line = readline_with_counts(filehandle, options.slice(:row_sep))
546
604
  delimiters.each do |d|
547
605
  candidates[d] += line.scan(d).count
548
606
  end
@@ -550,7 +608,7 @@ module SmarterCSV
550
608
  break
551
609
  end
552
610
 
553
- filehandle.rewind
611
+ rewind(filehandle)
554
612
 
555
613
  candidates
556
614
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.7.4
4
+ version: 1.8.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2023-01-14 00:00:00.000000000 Z
11
+ date: 2023-03-19 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: awesome_print