smarter_csv 1.7.4 → 1.8.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 3c20e6e8f4281f99aa22e67c778e1e950ef7040a48f454c7d9751f2ccf44c093
4
- data.tar.gz: 14ec931e58fce24c675bd76ad39f6046c52a0a4b07cc89814fa55c6a981ebe27
3
+ metadata.gz: 55400b3977ce35c58d60c4101362b68d99f2dbf7cb6a63956ae3b6ab79fcf1ac
4
+ data.tar.gz: 41f46d3e4de69a7924ecd2214ba4e37766106469d1b8b257fd752a96204a47fd
5
5
  SHA512:
6
- metadata.gz: cf3d642f523bf49d0867bc1768a6df247f3392390090c2b0fbfda5ac75f5f8f829eaac2ec14105936e4d317c7a1d1b865d74de1e60ce6405c6fd1f868bd703eb
7
- data.tar.gz: 70522e31ca2ced36beef2a38509d1df01bad12af5bb56cb1404cfcbceffdfb4cbbbf8a8a8a535e3ea060d2d4c5b4c1049c8486d163d7de6b466dd83838aa2cf0
6
+ metadata.gz: 24ecc14cf9c65efe5c11e4bd20753420aa8ccd7385171cd21eac2e1be92c4896087cdc2a18799fa111c0f36154ad4481daed7f08b752f4fae2b5f27241b8cf6c
7
+ data.tar.gz: c1d70e18a7ae8057e58cbf73b62f4896dd7030bc5fd2e927669e5ea829f9a3c11daeb9c8b83296dbb46e6f0d23034245b7207882a77b54cc1ca128a581175359
data/CHANGELOG.md CHANGED
@@ -1,7 +1,11 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
- ## 1.7.4 (2022-01-13)
4
+ ## 1.8.0 (2023-03-18)
5
+ * NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
6
+ * ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
7
+
8
+ ## 1.7.4 (2023-01-13)
5
9
  * improved guessing of the column separator, thanks to Alessandro Fazzi
6
10
 
7
11
  ## 1.7.3 (2022-12-05)
data/README.md CHANGED
@@ -55,10 +55,23 @@ The two main choices you have in terms of how to call `SmarterCSV.process` are:
55
55
  * calling `process` with or without a block
56
56
  * passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
57
57
 
58
- Tip: If you are uncertain about what line endings a CSV-file uses, try specifying `:row_sep => :auto` as part of the options.
59
- But this could be slow if we would analyze the whole CSV file first (previous to 1.1.5 the whole file was analyzed).
60
- To speed things up, you can setting the option `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500; nil or 0 will check the whole file).
61
- You can also set the `:row_sep` manually! Checkout Example 5 for unusual `:row_sep` and `:col_sep`.
58
+ By default (since version 1.8.0), detection of the column and row separators is set to automatic `row_sep: :auto`, `col_sep: :auto`. This should make it easier to process any CSV files without having to examine the line endings or column separators.
59
+
60
+ You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); nil or 0 will check the whole file).
61
+ You can also set the `:row_sep` manually! Checkout Example 4 for unusual `:row_sep` and `:col_sep`.
62
+
63
+ ### Troubleshooting
64
+
65
+ In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).
66
+
67
+ ```
68
+ $ hexdump -C spec/fixtures/bom_test_feff.csv
69
+ 00000000 fe ff 73 6f 6d 65 5f 69 64 2c 74 79 70 65 2c 66 |..some_id,type,f|
70
+ 00000010 75 7a 7a 62 6f 78 65 73 0d 0a 34 32 37 36 36 38 |uzzboxes..427668|
71
+ 00000020 30 35 2c 7a 69 7a 7a 6c 65 73 2c 31 32 33 34 0d |05,zizzles,1234.|
72
+ 00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
73
+ 00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
74
+ ```
62
75
 
63
76
 
64
77
  #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
@@ -222,10 +235,10 @@ The options and the block are optional.
222
235
  | :skip_lines | nil | how many lines to skip before the first line or header line is processed |
223
236
  | :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
224
237
  ---------------------------------------------------------------------------------------------------------------------------------
225
- | :col_sep | ',' | column separator, can be set to :auto |
238
+ | :col_sep | :auto | column separator (default was ',') |
226
239
  | :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
227
240
  | | | e.g. when :quote_char is not properly escaped |
228
- | :row_sep | $/ ,"\n" | row separator or record separator , defaults to system's $/ , which defaults to "\n" |
241
+ | :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
229
242
  | | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
230
243
  | :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
231
244
  | :quote_char | '"' | quotation character |
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterCSV
4
- VERSION = "1.7.4"
4
+ VERSION = "1.8.0"
5
5
  end
data/lib/smarter_csv.rb CHANGED
@@ -3,8 +3,8 @@
3
3
  require_relative "extensions/hash"
4
4
  require_relative "smarter_csv/version"
5
5
 
6
- require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
7
- # require 'smarter_csv.bundle' unless ENV['CI'] # does not compile/link in CI?
6
+ # require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
7
+ require 'smarter_csv.bundle' unless ENV['CI'] # does not compile/link in CI?
8
8
 
9
9
  module SmarterCSV
10
10
  class SmarterCSVException < StandardError; end
@@ -39,11 +39,7 @@ module SmarterCSV
39
39
  puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
40
40
  end
41
41
 
42
- if options[:skip_lines].to_i > 0
43
- options[:skip_lines].to_i.times do
44
- readline_with_counts(fh, options)
45
- end
46
- end
42
+ skip_lines(fh, options)
47
43
 
48
44
  headerA, header_size = process_headers(fh, options)
49
45
 
@@ -207,7 +203,7 @@ module SmarterCSV
207
203
  acceleration: true,
208
204
  auto_row_sep_chars: 500,
209
205
  chunk_size: nil,
210
- col_sep: ',',
206
+ col_sep: :auto, # was: ',',
211
207
  comment_regexp: nil, # was: /\A#/,
212
208
  convert_values_to_numeric: true,
213
209
  downcase_header: true,
@@ -226,7 +222,7 @@ module SmarterCSV
226
222
  remove_values_matching: nil,
227
223
  remove_zero_values: false,
228
224
  required_headers: nil,
229
- row_sep: $/,
225
+ row_sep: :auto, # was: $/,
230
226
  silence_missing_keys: false,
231
227
  skip_lines: nil,
232
228
  strings_as_keys: false,
@@ -243,9 +239,24 @@ module SmarterCSV
243
239
  line = filehandle.readline(options[:row_sep])
244
240
  @file_line_count += 1
245
241
  @csv_line_count += 1
242
+ line = remove_bom(line) if @csv_line_count == 1
246
243
  line
247
244
  end
248
245
 
246
+ def skip_lines(filehandle, options)
247
+ return unless options[:skip_lines].to_i > 0
248
+
249
+ options[:skip_lines].to_i.times do
250
+ readline_with_counts(filehandle, options)
251
+ end
252
+ end
253
+
254
+ def rewind(filehandle)
255
+ @file_line_count = 0
256
+ @csv_line_count = 0
257
+ filehandle.rewind
258
+ end
259
+
249
260
  ###
250
261
  ### Thin wrapper around C-extension
251
262
  ###
@@ -378,6 +389,8 @@ module SmarterCSV
378
389
  # Otherwise guesses column separator from contents.
379
390
  # Raises exception if none is found.
380
391
  def guess_column_separator(filehandle, options)
392
+ skip_lines(filehandle, options)
393
+
381
394
  possible_delimiters = [',', "\t", ';', ':', '|']
382
395
 
383
396
  candidates = if options.fetch(:headers_in_file)
@@ -417,7 +430,7 @@ module SmarterCSV
417
430
  lines += 1
418
431
  break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
419
432
  end
420
- filehandle.rewind
433
+ rewind(filehandle)
421
434
 
422
435
  counts["\r"] += 1 if last_char == "\r"
423
436
  # find the most frequent key/value pair:
@@ -473,13 +486,13 @@ module SmarterCSV
473
486
 
474
487
  unless options[:user_provided_headers] # wouldn't make sense to re-map user provided headers
475
488
  key_mappingH = options[:key_mapping]
476
-
477
489
  # do some key mapping on the keys in the file header
478
490
  # if you want to completely delete a key, then map it to nil or to ''
479
491
  if !key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
480
492
  unless options[:silence_missing_keys]
481
493
  # if silence_missing_keys are not set, raise error if missing header
482
494
  missing_keys = key_mappingH.keys - headerA
495
+
483
496
  puts "WARNING: missing header(s): #{missing_keys.join(",")}" unless missing_keys.empty?
484
497
  end
485
498
 
@@ -525,15 +538,34 @@ module SmarterCSV
525
538
 
526
539
  private
527
540
 
541
+ UTF_32_BOM = %w[0 0 fe ff].freeze
542
+ UTF_32LE_BOM = %w[ff fe 0 0].freeze
543
+ UTF_8_BOM = %w[ef bb bf].freeze
544
+ UTF_16_BOM = %w[fe ff].freeze
545
+ UTF_16LE_BOM = %w[ff fe].freeze
546
+
547
+ def remove_bom(str)
548
+ str_as_hex = str.bytes.map{|x| x.to_s(16)}
549
+ # if string does not start with one of the bytes above, there is no BOM
550
+ return str unless %w[ef fe ff 0].include?(str_as_hex[0])
551
+
552
+ return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
553
+ return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM
554
+ return str.byteslice(2..-1) if [UTF_16_BOM, UTF_16LE_BOM].include?(str_as_hex[0..1])
555
+
556
+ puts "SmarterCSV found unhandled BOM! #{str.chars[0..7].inspect}"
557
+ str
558
+ end
559
+
528
560
  def candidated_column_separators_from_headers(filehandle, options, delimiters)
529
561
  candidates = Hash.new(0)
530
- line = filehandle.readline(options[:row_sep])
562
+ line = readline_with_counts(filehandle, options.slice(:row_sep))
531
563
 
532
564
  delimiters.each do |d|
533
565
  candidates[d] += line.scan(d).count
534
566
  end
535
567
 
536
- filehandle.rewind
568
+ rewind(filehandle)
537
569
 
538
570
  candidates
539
571
  end
@@ -542,7 +574,7 @@ module SmarterCSV
542
574
  candidates = Hash.new(0)
543
575
 
544
576
  5.times do
545
- line = filehandle.readline(options[:row_sep])
577
+ line = readline_with_counts(filehandle, options.slice(:row_sep))
546
578
  delimiters.each do |d|
547
579
  candidates[d] += line.scan(d).count
548
580
  end
@@ -550,7 +582,7 @@ module SmarterCSV
550
582
  break
551
583
  end
552
584
 
553
- filehandle.rewind
585
+ rewind(filehandle)
554
586
 
555
587
  candidates
556
588
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.7.4
4
+ version: 1.8.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2023-01-14 00:00:00.000000000 Z
11
+ date: 2023-03-19 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: awesome_print