smarter_csv 1.7.3 → 1.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d4046758f38c21262fdec6bc7e13e3a7811c7aee3944d92e0cc36a2a1cfb032a
4
- data.tar.gz: 9d111e2f36171ca488034f3af73fc71c7c9f6fde73986d277aeaf1560a066fa2
3
+ metadata.gz: 55400b3977ce35c58d60c4101362b68d99f2dbf7cb6a63956ae3b6ab79fcf1ac
4
+ data.tar.gz: 41f46d3e4de69a7924ecd2214ba4e37766106469d1b8b257fd752a96204a47fd
5
5
  SHA512:
6
- metadata.gz: c46c5c45dd3fafe66735b2b17b0679c5aaff27b3670140d97bc19e1c825ad91310fa2cf55a12a5c7b0c31ef82fe9cc12a2c4bda0a78b218d80ad5816c01c0d9f
7
- data.tar.gz: ba03acd95955f8afeb8e96f16c7cfa2e1605dbaf6fddb7008930294aab83196aed21f57605efb3553799381c1c4811528eee2db221efa50dc82f58bcf9135842
6
+ metadata.gz: 24ecc14cf9c65efe5c11e4bd20753420aa8ccd7385171cd21eac2e1be92c4896087cdc2a18799fa111c0f36154ad4481daed7f08b752f4fae2b5f27241b8cf6c
7
+ data.tar.gz: c1d70e18a7ae8057e58cbf73b62f4896dd7030bc5fd2e927669e5ea829f9a3c11daeb9c8b83296dbb46e6f0d23034245b7207882a77b54cc1ca128a581175359
data/CHANGELOG.md CHANGED
@@ -1,6 +1,13 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
+ ## 1.8.0 (2023-03-18)
5
+ * NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
6
+ * ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
7
+
8
+ ## 1.7.4 (2023-01-13)
9
+ * improved guessing of the column separator, thanks to Alessandro Fazzi
10
+
4
11
  ## 1.7.3 (2022-12-05)
5
12
  * new option :silence_missing_keys; if set to true, it ignores missing keys in `key_mapping`
6
13
 
data/CONTRIBUTORS.md CHANGED
@@ -49,3 +49,4 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
49
49
  * [Nicolas Rodriguez](https://github.com/n-rodriguez)
50
50
  * [Hirotaka Mizutani ](https://github.com/hirotaka)
51
51
  * [Rahul Chaudhary](https://github.com/rahulch95)
52
+ * [Alessandro Fazzi](https://github.com/pioneerskies)
data/README.md CHANGED
@@ -55,10 +55,23 @@ The two main choices you have in terms of how to call `SmarterCSV.process` are:
55
55
  * calling `process` with or without a block
56
56
  * passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
57
57
 
58
- Tip: If you are uncertain about what line endings a CSV-file uses, try specifying `:row_sep => :auto` as part of the options.
59
- But this could be slow if we would analyze the whole CSV file first (previous to 1.1.5 the whole file was analyzed).
60
- To speed things up, you can setting the option `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500; nil or 0 will check the whole file).
61
- You can also set the `:row_sep` manually! Checkout Example 5 for unusual `:row_sep` and `:col_sep`.
58
+ By default (since version 1.8.0), detection of the column and row separators is set to automatic `row_sep: :auto`, `col_sep: :auto`. This should make it easier to process any CSV files without having to examine the line endings or column separators.
59
+
60
+ You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); nil or 0 will check the whole file).
61
+ You can also set the `:row_sep` manually! Checkout Example 4 for unusual `:row_sep` and `:col_sep`.
62
+
63
+ ### Troubleshooting
64
+
65
+ In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).
66
+
67
+ ```
68
+ $ hexdump -C spec/fixtures/bom_test_feff.csv
69
+ 00000000 fe ff 73 6f 6d 65 5f 69 64 2c 74 79 70 65 2c 66 |..some_id,type,f|
70
+ 00000010 75 7a 7a 62 6f 78 65 73 0d 0a 34 32 37 36 36 38 |uzzboxes..427668|
71
+ 00000020 30 35 2c 7a 69 7a 7a 6c 65 73 2c 31 32 33 34 0d |05,zizzles,1234.|
72
+ 00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
73
+ 00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
74
+ ```
62
75
 
63
76
 
64
77
  #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
@@ -222,10 +235,10 @@ The options and the block are optional.
222
235
  | :skip_lines | nil | how many lines to skip before the first line or header line is processed |
223
236
  | :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
224
237
  ---------------------------------------------------------------------------------------------------------------------------------
225
- | :col_sep | ',' | column separator, can be set to :auto |
238
+ | :col_sep | :auto | column separator (default was ',') |
226
239
  | :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
227
240
  | | | e.g. when :quote_char is not properly escaped |
228
- | :row_sep | $/ ,"\n" | row separator or record separator , defaults to system's $/ , which defaults to "\n" |
241
+ | :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
229
242
  | | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
230
243
  | :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
231
244
  | :quote_char | '"' | quotation character |
data/Rakefile CHANGED
@@ -3,6 +3,17 @@
3
3
  require "bundler/gem_tasks"
4
4
  require 'rspec/core/rake_task'
5
5
 
6
+
7
+ # temp fix for NoMethodError: undefined method `last_comment'
8
+ # remove when fixed in Rake 11.x and higher
9
+ module TempFixForRakeLastComment
10
+ def last_comment
11
+ last_description
12
+ end
13
+ end
14
+ Rake::Application.send :include, TempFixForRakeLastComment
15
+ ### end of tempfix
16
+
6
17
  RSpec::Core::RakeTask.new(:spec)
7
18
 
8
19
  require "rubocop/rake_task"
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterCSV
4
- VERSION = "1.7.3"
4
+ VERSION = "1.8.0"
5
5
  end
data/lib/smarter_csv.rb CHANGED
@@ -3,8 +3,8 @@
3
3
  require_relative "extensions/hash"
4
4
  require_relative "smarter_csv/version"
5
5
 
6
- require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
7
- # require 'smarter_csv.bundle' unless ENV['CI'] # does not compile/link in CI?
6
+ # require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
7
+ require 'smarter_csv.bundle' unless ENV['CI'] # does not compile/link in CI?
8
8
 
9
9
  module SmarterCSV
10
10
  class SmarterCSVException < StandardError; end
@@ -39,11 +39,7 @@ module SmarterCSV
39
39
  puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
40
40
  end
41
41
 
42
- if options[:skip_lines].to_i > 0
43
- options[:skip_lines].to_i.times do
44
- readline_with_counts(fh, options)
45
- end
46
- end
42
+ skip_lines(fh, options)
47
43
 
48
44
  headerA, header_size = process_headers(fh, options)
49
45
 
@@ -207,7 +203,7 @@ module SmarterCSV
207
203
  acceleration: true,
208
204
  auto_row_sep_chars: 500,
209
205
  chunk_size: nil,
210
- col_sep: ',',
206
+ col_sep: :auto, # was: ',',
211
207
  comment_regexp: nil, # was: /\A#/,
212
208
  convert_values_to_numeric: true,
213
209
  downcase_header: true,
@@ -226,7 +222,7 @@ module SmarterCSV
226
222
  remove_values_matching: nil,
227
223
  remove_zero_values: false,
228
224
  required_headers: nil,
229
- row_sep: $/,
225
+ row_sep: :auto, # was: $/,
230
226
  silence_missing_keys: false,
231
227
  skip_lines: nil,
232
228
  strings_as_keys: false,
@@ -243,9 +239,24 @@ module SmarterCSV
243
239
  line = filehandle.readline(options[:row_sep])
244
240
  @file_line_count += 1
245
241
  @csv_line_count += 1
242
+ line = remove_bom(line) if @csv_line_count == 1
246
243
  line
247
244
  end
248
245
 
246
+ def skip_lines(filehandle, options)
247
+ return unless options[:skip_lines].to_i > 0
248
+
249
+ options[:skip_lines].to_i.times do
250
+ readline_with_counts(filehandle, options)
251
+ end
252
+ end
253
+
254
+ def rewind(filehandle)
255
+ @file_line_count = 0
256
+ @csv_line_count = 0
257
+ filehandle.rewind
258
+ end
259
+
249
260
  ###
250
261
  ### Thin wrapper around C-extension
251
262
  ###
@@ -374,24 +385,23 @@ module SmarterCSV
374
385
  return false
375
386
  end
376
387
 
377
- # raise exception if none is found
388
+ # If file has headers, then guesses column separator from headers.
389
+ # Otherwise guesses column separator from contents.
390
+ # Raises exception if none is found.
378
391
  def guess_column_separator(filehandle, options)
379
- del = [',', "\t", ';', ':', '|']
380
- n = Hash.new(0)
392
+ skip_lines(filehandle, options)
381
393
 
382
- 5.times do
383
- line = filehandle.readline(options[:row_sep])
384
- del.each do |d|
385
- n[d] += line.scan(d).count
386
- end
387
- rescue EOFError # short files
388
- break
389
- end
394
+ possible_delimiters = [',', "\t", ';', ':', '|']
390
395
 
391
- filehandle.rewind
392
- raise SmarterCSV::NoColSepDetected if n.values.max == 0
396
+ candidates = if options.fetch(:headers_in_file)
397
+ candidated_column_separators_from_headers(filehandle, options, possible_delimiters)
398
+ else
399
+ candidated_column_separators_from_contents(filehandle, options, possible_delimiters)
400
+ end
401
+
402
+ raise SmarterCSV::NoColSepDetected if candidates.values.max == 0
393
403
 
394
- col_sep = n.key(n.values.max)
404
+ candidates.key(candidates.values.max)
395
405
  end
396
406
 
397
407
  # limitation: this currently reads the whole file in before making a decision
@@ -420,7 +430,7 @@ module SmarterCSV
420
430
  lines += 1
421
431
  break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
422
432
  end
423
- filehandle.rewind
433
+ rewind(filehandle)
424
434
 
425
435
  counts["\r"] += 1 if last_char == "\r"
426
436
  # find the most frequent key/value pair:
@@ -476,13 +486,13 @@ module SmarterCSV
476
486
 
477
487
  unless options[:user_provided_headers] # wouldn't make sense to re-map user provided headers
478
488
  key_mappingH = options[:key_mapping]
479
-
480
489
  # do some key mapping on the keys in the file header
481
490
  # if you want to completely delete a key, then map it to nil or to ''
482
491
  if !key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
483
492
  unless options[:silence_missing_keys]
484
493
  # if silence_missing_keys are not set, raise error if missing header
485
494
  missing_keys = key_mappingH.keys - headerA
495
+
486
496
  puts "WARNING: missing header(s): #{missing_keys.join(",")}" unless missing_keys.empty?
487
497
  end
488
498
 
@@ -525,5 +535,56 @@ module SmarterCSV
525
535
  end
526
536
  result
527
537
  end
538
+
539
+ private
540
+
541
+ UTF_32_BOM = %w[0 0 fe ff].freeze
542
+ UTF_32LE_BOM = %w[ff fe 0 0].freeze
543
+ UTF_8_BOM = %w[ef bb bf].freeze
544
+ UTF_16_BOM = %w[fe ff].freeze
545
+ UTF_16LE_BOM = %w[ff fe].freeze
546
+
547
+ def remove_bom(str)
548
+ str_as_hex = str.bytes.map{|x| x.to_s(16)}
549
+ # if string does not start with one of the bytes above, there is no BOM
550
+ return str unless %w[ef fe ff 0].include?(str_as_hex[0])
551
+
552
+ return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
553
+ return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM
554
+ return str.byteslice(2..-1) if [UTF_16_BOM, UTF_16LE_BOM].include?(str_as_hex[0..1])
555
+
556
+ puts "SmarterCSV found unhandled BOM! #{str.chars[0..7].inspect}"
557
+ str
558
+ end
559
+
560
+ def candidated_column_separators_from_headers(filehandle, options, delimiters)
561
+ candidates = Hash.new(0)
562
+ line = readline_with_counts(filehandle, options.slice(:row_sep))
563
+
564
+ delimiters.each do |d|
565
+ candidates[d] += line.scan(d).count
566
+ end
567
+
568
+ rewind(filehandle)
569
+
570
+ candidates
571
+ end
572
+
573
+ def candidated_column_separators_from_contents(filehandle, options, delimiters)
574
+ candidates = Hash.new(0)
575
+
576
+ 5.times do
577
+ line = readline_with_counts(filehandle, options.slice(:row_sep))
578
+ delimiters.each do |d|
579
+ candidates[d] += line.scan(d).count
580
+ end
581
+ rescue EOFError # short files
582
+ break
583
+ end
584
+
585
+ rewind(filehandle)
586
+
587
+ candidates
588
+ end
528
589
  end
529
590
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.7.3
4
+ version: 1.8.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2022-12-09 00:00:00.000000000 Z
11
+ date: 2023-03-19 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: awesome_print