smarter_csv 1.7.3 → 1.8.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d4046758f38c21262fdec6bc7e13e3a7811c7aee3944d92e0cc36a2a1cfb032a
4
- data.tar.gz: 9d111e2f36171ca488034f3af73fc71c7c9f6fde73986d277aeaf1560a066fa2
3
+ metadata.gz: 55400b3977ce35c58d60c4101362b68d99f2dbf7cb6a63956ae3b6ab79fcf1ac
4
+ data.tar.gz: 41f46d3e4de69a7924ecd2214ba4e37766106469d1b8b257fd752a96204a47fd
5
5
  SHA512:
6
- metadata.gz: c46c5c45dd3fafe66735b2b17b0679c5aaff27b3670140d97bc19e1c825ad91310fa2cf55a12a5c7b0c31ef82fe9cc12a2c4bda0a78b218d80ad5816c01c0d9f
7
- data.tar.gz: ba03acd95955f8afeb8e96f16c7cfa2e1605dbaf6fddb7008930294aab83196aed21f57605efb3553799381c1c4811528eee2db221efa50dc82f58bcf9135842
6
+ metadata.gz: 24ecc14cf9c65efe5c11e4bd20753420aa8ccd7385171cd21eac2e1be92c4896087cdc2a18799fa111c0f36154ad4481daed7f08b752f4fae2b5f27241b8cf6c
7
+ data.tar.gz: c1d70e18a7ae8057e58cbf73b62f4896dd7030bc5fd2e927669e5ea829f9a3c11daeb9c8b83296dbb46e6f0d23034245b7207882a77b54cc1ca128a581175359
data/CHANGELOG.md CHANGED
@@ -1,6 +1,13 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
+ ## 1.8.0 (2023-03-18)
5
+ * NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
6
+ * ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
7
+
8
+ ## 1.7.4 (2023-01-13)
9
+ * improved guessing of the column separator, thanks to Alessandro Fazzi
10
+
4
11
  ## 1.7.3 (2022-12-05)
5
12
  * new option :silence_missing_keys; if set to true, it ignores missing keys in `key_mapping`
6
13
 
data/CONTRIBUTORS.md CHANGED
@@ -49,3 +49,4 @@ A Big Thank you to everyone who filed issues, sent comments, and who contributed
49
49
  * [Nicolas Rodriguez](https://github.com/n-rodriguez)
50
50
  * [Hirotaka Mizutani ](https://github.com/hirotaka)
51
51
  * [Rahul Chaudhary](https://github.com/rahulch95)
52
+ * [Alessandro Fazzi](https://github.com/pioneerskies)
data/README.md CHANGED
@@ -55,10 +55,23 @@ The two main choices you have in terms of how to call `SmarterCSV.process` are:
55
55
  * calling `process` with or without a block
56
56
  * passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
57
57
 
58
- Tip: If you are uncertain about what line endings a CSV-file uses, try specifying `:row_sep => :auto` as part of the options.
59
- But this could be slow if we would analyze the whole CSV file first (previous to 1.1.5 the whole file was analyzed).
60
- To speed things up, you can setting the option `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500; nil or 0 will check the whole file).
61
- You can also set the `:row_sep` manually! Checkout Example 5 for unusual `:row_sep` and `:col_sep`.
58
+ By default (since version 1.8.0), detection of the column and row separators is set to automatic `row_sep: :auto`, `col_sep: :auto`. This should make it easier to process any CSV files without having to examine the line endings or column separators.
59
+
60
+ You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); nil or 0 will check the whole file).
61
+ You can also set the `:row_sep` manually! Checkout Example 4 for unusual `:row_sep` and `:col_sep`.
62
+
63
+ ### Troubleshooting
64
+
65
+ In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).
66
+
67
+ ```
68
+ $ hexdump -C spec/fixtures/bom_test_feff.csv
69
+ 00000000 fe ff 73 6f 6d 65 5f 69 64 2c 74 79 70 65 2c 66 |..some_id,type,f|
70
+ 00000010 75 7a 7a 62 6f 78 65 73 0d 0a 34 32 37 36 36 38 |uzzboxes..427668|
71
+ 00000020 30 35 2c 7a 69 7a 7a 6c 65 73 2c 31 32 33 34 0d |05,zizzles,1234.|
72
+ 00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
73
+ 00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
74
+ ```
62
75
 
63
76
 
64
77
  #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
@@ -222,10 +235,10 @@ The options and the block are optional.
222
235
  | :skip_lines | nil | how many lines to skip before the first line or header line is processed |
223
236
  | :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
224
237
  ---------------------------------------------------------------------------------------------------------------------------------
225
- | :col_sep | ',' | column separator, can be set to :auto |
238
+ | :col_sep | :auto | column separator (default was ',') |
226
239
  | :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
227
240
  | | | e.g. when :quote_char is not properly escaped |
228
- | :row_sep | $/ ,"\n" | row separator or record separator , defaults to system's $/ , which defaults to "\n" |
241
+ | :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
229
242
  | | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
230
243
  | :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
231
244
  | :quote_char | '"' | quotation character |
data/Rakefile CHANGED
@@ -3,6 +3,17 @@
3
3
  require "bundler/gem_tasks"
4
4
  require 'rspec/core/rake_task'
5
5
 
6
+
7
+ # temp fix for NoMethodError: undefined method `last_comment'
8
+ # remove when fixed in Rake 11.x and higher
9
+ module TempFixForRakeLastComment
10
+ def last_comment
11
+ last_description
12
+ end
13
+ end
14
+ Rake::Application.send :include, TempFixForRakeLastComment
15
+ ### end of tempfix
16
+
6
17
  RSpec::Core::RakeTask.new(:spec)
7
18
 
8
19
  require "rubocop/rake_task"
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterCSV
4
- VERSION = "1.7.3"
4
+ VERSION = "1.8.0"
5
5
  end
data/lib/smarter_csv.rb CHANGED
@@ -3,8 +3,8 @@
3
3
  require_relative "extensions/hash"
4
4
  require_relative "smarter_csv/version"
5
5
 
6
- require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
7
- # require 'smarter_csv.bundle' unless ENV['CI'] # does not compile/link in CI?
6
+ # require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
7
+ require 'smarter_csv.bundle' unless ENV['CI'] # does not compile/link in CI?
8
8
 
9
9
  module SmarterCSV
10
10
  class SmarterCSVException < StandardError; end
@@ -39,11 +39,7 @@ module SmarterCSV
39
39
  puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
40
40
  end
41
41
 
42
- if options[:skip_lines].to_i > 0
43
- options[:skip_lines].to_i.times do
44
- readline_with_counts(fh, options)
45
- end
46
- end
42
+ skip_lines(fh, options)
47
43
 
48
44
  headerA, header_size = process_headers(fh, options)
49
45
 
@@ -207,7 +203,7 @@ module SmarterCSV
207
203
  acceleration: true,
208
204
  auto_row_sep_chars: 500,
209
205
  chunk_size: nil,
210
- col_sep: ',',
206
+ col_sep: :auto, # was: ',',
211
207
  comment_regexp: nil, # was: /\A#/,
212
208
  convert_values_to_numeric: true,
213
209
  downcase_header: true,
@@ -226,7 +222,7 @@ module SmarterCSV
226
222
  remove_values_matching: nil,
227
223
  remove_zero_values: false,
228
224
  required_headers: nil,
229
- row_sep: $/,
225
+ row_sep: :auto, # was: $/,
230
226
  silence_missing_keys: false,
231
227
  skip_lines: nil,
232
228
  strings_as_keys: false,
@@ -243,9 +239,24 @@ module SmarterCSV
243
239
  line = filehandle.readline(options[:row_sep])
244
240
  @file_line_count += 1
245
241
  @csv_line_count += 1
242
+ line = remove_bom(line) if @csv_line_count == 1
246
243
  line
247
244
  end
248
245
 
246
+ def skip_lines(filehandle, options)
247
+ return unless options[:skip_lines].to_i > 0
248
+
249
+ options[:skip_lines].to_i.times do
250
+ readline_with_counts(filehandle, options)
251
+ end
252
+ end
253
+
254
+ def rewind(filehandle)
255
+ @file_line_count = 0
256
+ @csv_line_count = 0
257
+ filehandle.rewind
258
+ end
259
+
249
260
  ###
250
261
  ### Thin wrapper around C-extension
251
262
  ###
@@ -374,24 +385,23 @@ module SmarterCSV
374
385
  return false
375
386
  end
376
387
 
377
- # raise exception if none is found
388
+ # If file has headers, then guesses column separator from headers.
389
+ # Otherwise guesses column separator from contents.
390
+ # Raises exception if none is found.
378
391
  def guess_column_separator(filehandle, options)
379
- del = [',', "\t", ';', ':', '|']
380
- n = Hash.new(0)
392
+ skip_lines(filehandle, options)
381
393
 
382
- 5.times do
383
- line = filehandle.readline(options[:row_sep])
384
- del.each do |d|
385
- n[d] += line.scan(d).count
386
- end
387
- rescue EOFError # short files
388
- break
389
- end
394
+ possible_delimiters = [',', "\t", ';', ':', '|']
390
395
 
391
- filehandle.rewind
392
- raise SmarterCSV::NoColSepDetected if n.values.max == 0
396
+ candidates = if options.fetch(:headers_in_file)
397
+ candidated_column_separators_from_headers(filehandle, options, possible_delimiters)
398
+ else
399
+ candidated_column_separators_from_contents(filehandle, options, possible_delimiters)
400
+ end
401
+
402
+ raise SmarterCSV::NoColSepDetected if candidates.values.max == 0
393
403
 
394
- col_sep = n.key(n.values.max)
404
+ candidates.key(candidates.values.max)
395
405
  end
396
406
 
397
407
  # limitation: this currently reads the whole file in before making a decision
@@ -420,7 +430,7 @@ module SmarterCSV
420
430
  lines += 1
421
431
  break if options[:auto_row_sep_chars] && options[:auto_row_sep_chars] > 0 && lines >= options[:auto_row_sep_chars]
422
432
  end
423
- filehandle.rewind
433
+ rewind(filehandle)
424
434
 
425
435
  counts["\r"] += 1 if last_char == "\r"
426
436
  # find the most frequent key/value pair:
@@ -476,13 +486,13 @@ module SmarterCSV
476
486
 
477
487
  unless options[:user_provided_headers] # wouldn't make sense to re-map user provided headers
478
488
  key_mappingH = options[:key_mapping]
479
-
480
489
  # do some key mapping on the keys in the file header
481
490
  # if you want to completely delete a key, then map it to nil or to ''
482
491
  if !key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
483
492
  unless options[:silence_missing_keys]
484
493
  # if silence_missing_keys are not set, raise error if missing header
485
494
  missing_keys = key_mappingH.keys - headerA
495
+
486
496
  puts "WARNING: missing header(s): #{missing_keys.join(",")}" unless missing_keys.empty?
487
497
  end
488
498
 
@@ -525,5 +535,56 @@ module SmarterCSV
525
535
  end
526
536
  result
527
537
  end
538
+
539
+ private
540
+
541
+ UTF_32_BOM = %w[0 0 fe ff].freeze
542
+ UTF_32LE_BOM = %w[ff fe 0 0].freeze
543
+ UTF_8_BOM = %w[ef bb bf].freeze
544
+ UTF_16_BOM = %w[fe ff].freeze
545
+ UTF_16LE_BOM = %w[ff fe].freeze
546
+
547
+ def remove_bom(str)
548
+ str_as_hex = str.bytes.map{|x| x.to_s(16)}
549
+ # if string does not start with one of the bytes above, there is no BOM
550
+ return str unless %w[ef fe ff 0].include?(str_as_hex[0])
551
+
552
+ return str.byteslice(4..-1) if [UTF_32_BOM, UTF_32LE_BOM].include?(str_as_hex[0..3])
553
+ return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM
554
+ return str.byteslice(2..-1) if [UTF_16_BOM, UTF_16LE_BOM].include?(str_as_hex[0..1])
555
+
556
+ puts "SmarterCSV found unhandled BOM! #{str.chars[0..7].inspect}"
557
+ str
558
+ end
559
+
560
+ def candidated_column_separators_from_headers(filehandle, options, delimiters)
561
+ candidates = Hash.new(0)
562
+ line = readline_with_counts(filehandle, options.slice(:row_sep))
563
+
564
+ delimiters.each do |d|
565
+ candidates[d] += line.scan(d).count
566
+ end
567
+
568
+ rewind(filehandle)
569
+
570
+ candidates
571
+ end
572
+
573
+ def candidated_column_separators_from_contents(filehandle, options, delimiters)
574
+ candidates = Hash.new(0)
575
+
576
+ 5.times do
577
+ line = readline_with_counts(filehandle, options.slice(:row_sep))
578
+ delimiters.each do |d|
579
+ candidates[d] += line.scan(d).count
580
+ end
581
+ rescue EOFError # short files
582
+ break
583
+ end
584
+
585
+ rewind(filehandle)
586
+
587
+ candidates
588
+ end
528
589
  end
529
590
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.7.3
4
+ version: 1.8.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2022-12-09 00:00:00.000000000 Z
11
+ date: 2023-03-19 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: awesome_print