smarter_csv 1.11.2 → 1.12.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.rubocop.yml +8 -2
- data/CHANGELOG.md +32 -1
- data/CONTRIBUTORS.md +1 -0
- data/README.md +31 -396
- data/docs/_introduction.md +56 -0
- data/docs/basic_api.md +157 -0
- data/docs/batch_processing.md +68 -0
- data/docs/data_transformations.md +50 -0
- data/docs/examples.md +75 -0
- data/docs/header_transformations.md +113 -0
- data/docs/header_validations.md +36 -0
- data/docs/options.md +98 -0
- data/docs/row_col_sep.md +104 -0
- data/docs/value_converters.md +68 -0
- data/ext/smarter_csv/smarter_csv.c +4 -2
- data/lib/smarter_csv/auto_detection.rb +7 -2
- data/lib/smarter_csv/file_io.rb +1 -1
- data/lib/smarter_csv/hash_transformations.rb +1 -1
- data/lib/smarter_csv/header_transformations.rb +1 -1
- data/lib/smarter_csv/header_validations.rb +2 -2
- data/lib/smarter_csv/headers.rb +1 -1
- data/lib/smarter_csv/{options_processing.rb → options.rb} +44 -43
- data/lib/smarter_csv/{parse.rb → parser.rb} +2 -2
- data/lib/smarter_csv/reader.rb +243 -0
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv/writer.rb +2 -1
- data/lib/smarter_csv.rb +20 -4
- data/smarter_csv.gemspec +2 -2
- metadata +21 -11
- data/lib/smarter_csv/smarter_csv.rb +0 -210
- data/lib/smarter_csv/variables.rb +0 -30
data/docs/row_col_sep.md
ADDED
@@ -0,0 +1,104 @@
|
|
1
|
+
|
2
|
+
### Contents
|
3
|
+
|
4
|
+
* [Introduction](./_introduction.md)
|
5
|
+
* [The Basic API](./basic_api.md)
|
6
|
+
* [Batch Processing](././batch_processing.md)
|
7
|
+
* [Configuration Options](./options.md)
|
8
|
+
* [**Row and Column Separators**](./row_col_sep.md)
|
9
|
+
* [Header Transformations](./header_transformations.md)
|
10
|
+
* [Header Validations](./header_validations.md)
|
11
|
+
* [Data Transformations](./data_transformations.md)
|
12
|
+
* [Value Converters](./value_converters.md)
|
13
|
+
|
14
|
+
--------------
|
15
|
+
|
16
|
+
# Row and Column Separators
|
17
|
+
|
18
|
+
## Automatic Detection
|
19
|
+
|
20
|
+
Convenient defaults allow automatic detection of the column and row separators: `row_sep: :auto`, `col_sep: :auto`. This makes it easier to process any CSV files without having to examine the line endings or column separators, e.g. when users upload CSV files to your service and you have no control over the incoming files.
|
21
|
+
|
22
|
+
You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); `nil` or `0` will check the whole file). Of course you can also set the `:row_sep` manually.
|
23
|
+
|
24
|
+
|
25
|
+
## Column Separator `col_sep`
|
26
|
+
|
27
|
+
The automatic detection of column separators considers: `,`, `\t`, `;`, `:`, `|`.
|
28
|
+
|
29
|
+
Some CSV files may contain an unusual column separqator, which could even be a control character.
|
30
|
+
|
31
|
+
## Row Separator `row_sep`
|
32
|
+
|
33
|
+
The automatic detection of row separators considers: `\n`, `\r\n`, `\r`.
|
34
|
+
|
35
|
+
Some CSV files may contain an unusual row separqator, which could even be a control character.
|
36
|
+
|
37
|
+
|
38
|
+
## Custom / Non-Standard CSV Formats
|
39
|
+
|
40
|
+
Besides custom values for `col_sep`, `row_sep`, some other customizations of CSV files are:
|
41
|
+
* the presence of a number of leading lines before the header or data section start.
|
42
|
+
* the presence of comment lines, e.g. lines starting with `#`
|
43
|
+
|
44
|
+
To explore these special cases, please use the following examples.
|
45
|
+
|
46
|
+
### Example 1: reading an iTunes DB dump
|
47
|
+
|
48
|
+
This data format uses CTRL-A as the column separator, and CTRL-B as the record separator. It also has comment lines that start with a `#` character. This also maps the header `name` to `genre`, and ignores the column `export_date`.
|
49
|
+
|
50
|
+
```ruby
|
51
|
+
filename = '/tmp/itunes_db_dump'
|
52
|
+
options = {
|
53
|
+
:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
|
54
|
+
:chunk_size => 100 , :key_mapping => {export_date: nil, name: :genre},
|
55
|
+
}
|
56
|
+
n = SmarterCSV.process(filename, options) do |chunk|
|
57
|
+
SidekiqWorkerClass.process_async(chunk) # pass an array of hashes to Sidekiq workers for parallel processing
|
58
|
+
end
|
59
|
+
=> returns number of chunks
|
60
|
+
```
|
61
|
+
|
62
|
+
### Example 2: Reading a CSV-File with custom col_sep, row_sep
|
63
|
+
In this example we have an unusual CSV file with `|` as the row separator, and `#` as the column separator.
|
64
|
+
This unusual format needs explicit options `col_sep` and `row_sep`.
|
65
|
+
|
66
|
+
```ruby
|
67
|
+
filename = '/tmp/input_file.txt'
|
68
|
+
recordsA = SmarterCSV.process(filename, {col_sep: "#", row_sep: "|"})
|
69
|
+
|
70
|
+
=> returns an array of hashes
|
71
|
+
```
|
72
|
+
|
73
|
+
### Example 3:
|
74
|
+
In this example, we use `skip_lines: 3` to skip and ignore the first 3 lines in the input
|
75
|
+
|
76
|
+
|
77
|
+
```ruby
|
78
|
+
filename = '/tmp/input_file.txt'
|
79
|
+
recordsA = SmarterCSV.process(filename, {skip_lines: 3})
|
80
|
+
|
81
|
+
=> returns an array of hashes
|
82
|
+
```
|
83
|
+
|
84
|
+
|
85
|
+
### Example 4: reading an iTunes DB dump
|
86
|
+
|
87
|
+
In this example, we use `comment_regexp` to filter out and ignore any lines starting with `#`
|
88
|
+
|
89
|
+
|
90
|
+
```ruby
|
91
|
+
# Consider a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes!)
|
92
|
+
filename = '/tmp/strange_db_dump'
|
93
|
+
options = {
|
94
|
+
:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
|
95
|
+
:chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre},
|
96
|
+
}
|
97
|
+
n = SmarterCSV.process(filename, options) do |chunk|
|
98
|
+
SidekiqWorkerClass.process_async(chunk) # pass an array of hashes to Sidekiq workers for parallel processing
|
99
|
+
end
|
100
|
+
=> returns number of chunks
|
101
|
+
```
|
102
|
+
|
103
|
+
----------------
|
104
|
+
PREVIOUS: [Configuration Options](./options.md) | NEXT: [Header Transformations](./header_transformations.md)
|
@@ -0,0 +1,68 @@
|
|
1
|
+
|
2
|
+
### Contents
|
3
|
+
|
4
|
+
* [Introduction](./_introduction.md)
|
5
|
+
* [The Basic API](./basic_api.md)
|
6
|
+
* [Batch Processing](././batch_processing.md)
|
7
|
+
* [Configuration Options](./options.md)
|
8
|
+
* [Row and Column Separators](./row_col_sep.md)
|
9
|
+
* [Header Transformations](./header_transformations.md)
|
10
|
+
* [Header Validations](./header_validations.md)
|
11
|
+
* [Data Transformations](./data_transformations.md)
|
12
|
+
* [**Value Converters**](./value_converters.md)
|
13
|
+
|
14
|
+
--------------
|
15
|
+
|
16
|
+
# Using Value Converters
|
17
|
+
|
18
|
+
Value Converters allow you to do custom transformations specific rows, to help you massage the data so it fits the expectations of your down-stream process, such as creating a DB record.
|
19
|
+
|
20
|
+
If you use `key_mappings` and `value_converters`, make sure that the value converters references the keys based on the final mapped name, not the original name in the CSV file.
|
21
|
+
|
22
|
+
```ruby
|
23
|
+
$ cat spec/fixtures/with_dates.csv
|
24
|
+
first,last,date,price
|
25
|
+
Ben,Miller,10/30/1998,$44.50
|
26
|
+
Tom,Turner,2/1/2011,$15.99
|
27
|
+
Ken,Smith,01/09/2013,$199.99
|
28
|
+
|
29
|
+
$ irb
|
30
|
+
> require 'smarter_csv'
|
31
|
+
> require 'date'
|
32
|
+
|
33
|
+
# define a custom converter class, which implements self.convert(value)
|
34
|
+
class DateConverter
|
35
|
+
def self.convert(value)
|
36
|
+
Date.strptime( value, '%m/%d/%Y') # parses custom date format into Date instance
|
37
|
+
end
|
38
|
+
end
|
39
|
+
|
40
|
+
class DollarConverter
|
41
|
+
def self.convert(value)
|
42
|
+
value.sub('$','').to_f # strips the dollar sign and creates a Float value
|
43
|
+
end
|
44
|
+
end
|
45
|
+
|
46
|
+
require 'money'
|
47
|
+
class MoneyConverter
|
48
|
+
def self.convert(value)
|
49
|
+
# depending on locale you might want to also remove the indicator for thousands, e.g. comma
|
50
|
+
Money.from_amount(value.gsub(/[\s\$]/,'').to_f) # creates a Money instance (based on cents)
|
51
|
+
end
|
52
|
+
end
|
53
|
+
|
54
|
+
options = {:value_converters => {:date => DateConverter, :price => DollarConverter}}
|
55
|
+
data = SmarterCSV.process("spec/fixtures/with_dates.csv", options)
|
56
|
+
first_record = data.first
|
57
|
+
first_record[:date]
|
58
|
+
=> #<Date: 1998-10-30 ((2451117j,0s,0n),+0s,2299161j)>
|
59
|
+
first_record[:date].class
|
60
|
+
=> Date
|
61
|
+
first_record[:price]
|
62
|
+
=> 44.50
|
63
|
+
first_record[:price].class
|
64
|
+
=> Float
|
65
|
+
```
|
66
|
+
|
67
|
+
--------------------
|
68
|
+
PREVIOUS: [Data Transformations](./data_transformations.md) | UP: [README](../README.md)
|
@@ -87,9 +87,11 @@ static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quot
|
|
87
87
|
}
|
88
88
|
|
89
89
|
VALUE SmarterCSV = Qnil;
|
90
|
+
VALUE Parser = Qnil;
|
90
91
|
|
91
92
|
void Init_smarter_csv(void) {
|
92
|
-
|
93
|
+
SmarterCSV = rb_define_module("SmarterCSV");
|
94
|
+
Parser = rb_define_module_under(SmarterCSV, "Parser");
|
93
95
|
|
94
|
-
rb_define_module_function(
|
96
|
+
rb_define_module_function(Parser, "parse_csv_line_c", rb_parse_csv_line, 4);
|
95
97
|
}
|
@@ -1,7 +1,7 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
3
|
module SmarterCSV
|
4
|
-
|
4
|
+
module AutoDetection
|
5
5
|
protected
|
6
6
|
|
7
7
|
# If file has headers, then guesses column separator from headers.
|
@@ -19,7 +19,12 @@ module SmarterCSV
|
|
19
19
|
count.times do
|
20
20
|
line = readline_with_counts(filehandle, options)
|
21
21
|
delimiters.each do |d|
|
22
|
-
|
22
|
+
escaped_quote = Regexp.escape(options[:quote_char])
|
23
|
+
|
24
|
+
# Count only non-quoted occurrences of the delimiter
|
25
|
+
non_quoted_text = line.split(/#{escaped_quote}[^#{escaped_quote}]*#{escaped_quote}/).join
|
26
|
+
|
27
|
+
candidates[d] += non_quoted_text.scan(d).count
|
23
28
|
end
|
24
29
|
rescue EOFError # short files
|
25
30
|
break
|
data/lib/smarter_csv/file_io.rb
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
3
|
module SmarterCSV
|
4
|
-
|
4
|
+
module HashTransformations
|
5
5
|
def hash_transformations(hash, options)
|
6
6
|
# there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
|
7
7
|
# make sure we delete any key/value pairs from the hash, which the user wanted to delete:
|
@@ -1,7 +1,7 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
3
|
module SmarterCSV
|
4
|
-
|
4
|
+
module HeaderValidations
|
5
5
|
def header_validations(headers, options)
|
6
6
|
check_duplicate_headers(headers, options)
|
7
7
|
check_required_headers(headers, options)
|
@@ -26,7 +26,7 @@ module SmarterCSV
|
|
26
26
|
missing_keys = options[:required_keys].select { |k| !headers_set.include?(k) }
|
27
27
|
|
28
28
|
unless missing_keys.empty?
|
29
|
-
raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}. Check `
|
29
|
+
raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}. Check `reader.headers` for original headers."
|
30
30
|
end
|
31
31
|
end
|
32
32
|
end
|
data/lib/smarter_csv/headers.rb
CHANGED
@@ -1,43 +1,51 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
3
|
module SmarterCSV
|
4
|
-
|
5
|
-
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
4
|
+
#
|
5
|
+
# NOTE: this is not called when "parse" methods are tested by themselves
|
6
|
+
#
|
7
|
+
# ONLY FOR BACKWARDS-COMPATIBILITY
|
8
|
+
def self.default_options
|
9
|
+
Options::DEFAULT_OPTIONS
|
10
|
+
end
|
11
|
+
|
12
|
+
module Options
|
13
|
+
DEFAULT_OPTIONS = {
|
14
|
+
acceleration: true, # if user wants to use accelleration or not
|
15
|
+
auto_row_sep_chars: 500,
|
16
|
+
chunk_size: nil,
|
17
|
+
col_sep: :auto, # was: ',',
|
18
|
+
comment_regexp: nil, # was: /\A#/,
|
19
|
+
convert_values_to_numeric: true,
|
20
|
+
downcase_header: true,
|
21
|
+
duplicate_header_suffix: '', # was: nil,
|
22
|
+
file_encoding: 'utf-8',
|
23
|
+
force_simple_split: false,
|
24
|
+
force_utf8: false,
|
25
|
+
headers_in_file: true,
|
26
|
+
invalid_byte_sequence: '',
|
27
|
+
keep_original_headers: false,
|
28
|
+
key_mapping: nil,
|
29
|
+
quote_char: '"',
|
30
|
+
remove_empty_hashes: true,
|
31
|
+
remove_empty_values: true,
|
32
|
+
remove_unmapped_keys: false,
|
33
|
+
remove_values_matching: nil,
|
34
|
+
remove_zero_values: false,
|
35
|
+
required_headers: nil,
|
36
|
+
required_keys: nil,
|
37
|
+
row_sep: :auto, # was: $/,
|
38
|
+
silence_missing_keys: false,
|
39
|
+
skip_lines: nil,
|
40
|
+
strings_as_keys: false,
|
41
|
+
strip_chars_from_headers: nil,
|
42
|
+
strip_whitespace: true,
|
43
|
+
user_provided_headers: nil,
|
44
|
+
value_converters: nil,
|
45
|
+
verbose: false,
|
46
|
+
with_line_numbers: false,
|
47
|
+
}.freeze
|
39
48
|
|
40
|
-
class << self
|
41
49
|
# NOTE: this is not called when "parse" methods are tested by themselves
|
42
50
|
def process_options(given_options = {})
|
43
51
|
puts "User provided options:\n#{pp(given_options)}\n" if given_options[:verbose]
|
@@ -53,13 +61,6 @@ module SmarterCSV
|
|
53
61
|
@options
|
54
62
|
end
|
55
63
|
|
56
|
-
# NOTE: this is not called when "parse" methods are tested by themselves
|
57
|
-
#
|
58
|
-
# ONLY FOR BACKWARDS-COMPATIBILITY
|
59
|
-
def default_options
|
60
|
-
DEFAULT_OPTIONS
|
61
|
-
end
|
62
|
-
|
63
64
|
private
|
64
65
|
|
65
66
|
def validate_options!(options)
|
@@ -1,7 +1,7 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
3
|
module SmarterCSV
|
4
|
-
|
4
|
+
module Parser
|
5
5
|
protected
|
6
6
|
|
7
7
|
###
|
@@ -10,7 +10,7 @@ module SmarterCSV
|
|
10
10
|
def parse(line, options, header_size = nil)
|
11
11
|
# puts "SmarterCSV.parse OPTIONS: #{options[:acceleration]}" if options[:verbose]
|
12
12
|
|
13
|
-
if options[:acceleration] && has_acceleration
|
13
|
+
if options[:acceleration] && has_acceleration
|
14
14
|
# :nocov:
|
15
15
|
has_quotes = line =~ /#{options[:quote_char]}/
|
16
16
|
elements = parse_csv_line_c(line, options[:col_sep], options[:quote_char], header_size)
|
@@ -0,0 +1,243 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SmarterCSV
|
4
|
+
class Reader
|
5
|
+
include ::SmarterCSV::Options
|
6
|
+
include ::SmarterCSV::FileIO
|
7
|
+
include ::SmarterCSV::AutoDetection
|
8
|
+
include ::SmarterCSV::Headers
|
9
|
+
include ::SmarterCSV::HeaderTransformations
|
10
|
+
include ::SmarterCSV::HeaderValidations
|
11
|
+
include ::SmarterCSV::HashTransformations
|
12
|
+
include ::SmarterCSV::Parser
|
13
|
+
|
14
|
+
attr_reader :input, :options
|
15
|
+
attr_reader :csv_line_count, :chunk_count, :file_line_count
|
16
|
+
attr_reader :enforce_utf8, :has_rails, :has_acceleration
|
17
|
+
attr_reader :errors, :warnings, :headers, :raw_header, :result
|
18
|
+
|
19
|
+
# :nocov:
|
20
|
+
# rubocop:disable Naming/MethodName
|
21
|
+
def headerA
|
22
|
+
warn "Deprecarion Warning: 'headerA' will be removed in future versions. Use 'headders'"
|
23
|
+
@headerA
|
24
|
+
end
|
25
|
+
# rubocop:enable Naming/MethodName
|
26
|
+
# :nocov:
|
27
|
+
|
28
|
+
# first parameter: filename or input object which responds to readline method
|
29
|
+
def initialize(input, given_options = {})
|
30
|
+
@input = input
|
31
|
+
@has_rails = !!defined?(Rails)
|
32
|
+
@csv_line_count = 0
|
33
|
+
@chunk_count = 0
|
34
|
+
@errors = {}
|
35
|
+
@file_line_count = 0
|
36
|
+
@headerA = []
|
37
|
+
@headers = nil
|
38
|
+
@raw_header = nil # header as it appears in the file
|
39
|
+
@result = []
|
40
|
+
@warnings = {}
|
41
|
+
@enforce_utf8 = false # only set to true if needed (after options parsing)
|
42
|
+
@options = process_options(given_options)
|
43
|
+
# true if it is compiled with accelleration
|
44
|
+
@has_acceleration = !!SmarterCSV::Parser.respond_to?(:parse_csv_line_c)
|
45
|
+
end
|
46
|
+
|
47
|
+
def process(&block) # rubocop:disable Lint/UnusedMethodArgument
|
48
|
+
@enforce_utf8 = options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
|
49
|
+
@verbose = options[:verbose]
|
50
|
+
|
51
|
+
begin
|
52
|
+
fh = input.respond_to?(:readline) ? input : File.open(input, "r:#{options[:file_encoding]}")
|
53
|
+
|
54
|
+
if (options[:force_utf8] || options[:file_encoding] =~ /utf-8/i) && (fh.respond_to?(:external_encoding) && fh.external_encoding != Encoding.find('UTF-8') || fh.respond_to?(:encoding) && fh.encoding != Encoding.find('UTF-8'))
|
55
|
+
puts 'WARNING: you are trying to process UTF-8 input, but did not open the input with "b:utf-8" option. See README file "NOTES about File Encodings".'
|
56
|
+
end
|
57
|
+
|
58
|
+
# auto-detect the row separator
|
59
|
+
options[:row_sep] = guess_line_ending(fh, options) if options[:row_sep]&.to_sym == :auto
|
60
|
+
# attempt to auto-detect column separator
|
61
|
+
options[:col_sep] = guess_column_separator(fh, options) if options[:col_sep]&.to_sym == :auto
|
62
|
+
|
63
|
+
skip_lines(fh, options)
|
64
|
+
|
65
|
+
@headers, header_size = process_headers(fh, options)
|
66
|
+
@headerA = @headers # @headerA is deprecated, use @headers
|
67
|
+
|
68
|
+
puts "Effective headers:\n#{pp(@headers)}\n" if @verbose
|
69
|
+
|
70
|
+
header_validations(@headers, options)
|
71
|
+
|
72
|
+
# in case we use chunking.. we'll need to set it up..
|
73
|
+
if options[:chunk_size].to_i > 0
|
74
|
+
use_chunks = true
|
75
|
+
chunk_size = options[:chunk_size].to_i
|
76
|
+
@chunk_count = 0
|
77
|
+
chunk = []
|
78
|
+
else
|
79
|
+
use_chunks = false
|
80
|
+
end
|
81
|
+
|
82
|
+
# now on to processing all the rest of the lines in the CSV file:
|
83
|
+
# fh.each_line |line|
|
84
|
+
until fh.eof? # we can't use fh.readlines() here, because this would read the whole file into memory at once, and eof => true
|
85
|
+
line = readline_with_counts(fh, options)
|
86
|
+
|
87
|
+
# replace invalid byte sequence in UTF-8 with question mark to avoid errors
|
88
|
+
line = enforce_utf8_encoding(line, options) if @enforce_utf8
|
89
|
+
|
90
|
+
print "processing file line %10d, csv line %10d\r" % [@file_line_count, @csv_line_count] if @verbose
|
91
|
+
|
92
|
+
next if options[:comment_regexp] && line =~ options[:comment_regexp] # ignore all comment lines if there are any
|
93
|
+
|
94
|
+
# cater for the quoted csv data containing the row separator carriage return character
|
95
|
+
# in which case the row data will be split across multiple lines (see the sample content in spec/fixtures/carriage_returns_rn.csv)
|
96
|
+
# by detecting the existence of an uneven number of quote characters
|
97
|
+
multiline = count_quote_chars(line, options[:quote_char]).odd?
|
98
|
+
|
99
|
+
while multiline
|
100
|
+
next_line = fh.readline(options[:row_sep])
|
101
|
+
next_line = enforce_utf8_encoding(next_line, options) if @enforce_utf8
|
102
|
+
line += next_line
|
103
|
+
@file_line_count += 1
|
104
|
+
|
105
|
+
break if fh.eof? # Exit loop if end of file is reached
|
106
|
+
|
107
|
+
multiline = count_quote_chars(line, options[:quote_char]).odd?
|
108
|
+
end
|
109
|
+
|
110
|
+
# :nocov:
|
111
|
+
if multiline && @verbose
|
112
|
+
print "\nline contains uneven number of quote chars so including content through file line %d\n" % @file_line_count
|
113
|
+
end
|
114
|
+
# :nocov:
|
115
|
+
|
116
|
+
line.chomp!(options[:row_sep])
|
117
|
+
|
118
|
+
# --- SPLIT LINE & DATA TRANSFORMATIONS ------------------------------------------------------------
|
119
|
+
dataA, _data_size = parse(line, options, header_size)
|
120
|
+
|
121
|
+
dataA.map!{|x| x.strip} if options[:strip_whitespace]
|
122
|
+
|
123
|
+
# if all values are blank, then ignore this line
|
124
|
+
next if options[:remove_empty_hashes] && (dataA.empty? || blank?(dataA))
|
125
|
+
|
126
|
+
# --- HASH TRANSFORMATIONS ------------------------------------------------------------
|
127
|
+
hash = @headers.zip(dataA).to_h
|
128
|
+
|
129
|
+
hash = hash_transformations(hash, options)
|
130
|
+
|
131
|
+
# --- HASH VALIDATIONS ----------------------------------------------------------------
|
132
|
+
# will go here, and be able to:
|
133
|
+
# - validate correct format of the values for fields
|
134
|
+
# - required fields to be non-empty
|
135
|
+
# - ...
|
136
|
+
# -------------------------------------------------------------------------------------
|
137
|
+
|
138
|
+
next if options[:remove_empty_hashes] && hash.empty?
|
139
|
+
|
140
|
+
puts "CSV Line #{@file_line_count}: #{pp(hash)}" if @verbose == '2' # very verbose setting
|
141
|
+
# optional adding of csv_line_number to the hash to help debugging
|
142
|
+
hash[:csv_line_number] = @csv_line_count if options[:with_line_numbers]
|
143
|
+
|
144
|
+
# process the chunks or the resulting hash
|
145
|
+
if use_chunks
|
146
|
+
chunk << hash # append temp result to chunk
|
147
|
+
|
148
|
+
if chunk.size >= chunk_size || fh.eof? # if chunk if full, or EOF reached
|
149
|
+
# do something with the chunk
|
150
|
+
if block_given?
|
151
|
+
yield chunk # do something with the hashes in the chunk in the block
|
152
|
+
else
|
153
|
+
@result << chunk.dup # Append chunk to result (use .dup to keep a copy after we do chunk.clear)
|
154
|
+
end
|
155
|
+
@chunk_count += 1
|
156
|
+
chunk.clear # re-initialize for next chunk of data
|
157
|
+
else
|
158
|
+
# the last chunk may contain partial data, which is handled below
|
159
|
+
end
|
160
|
+
# while a chunk is being filled up we don't need to do anything else here
|
161
|
+
|
162
|
+
else # no chunk handling
|
163
|
+
if block_given?
|
164
|
+
yield [hash] # do something with the hash in the block (better to use chunking here)
|
165
|
+
else
|
166
|
+
@result << hash
|
167
|
+
end
|
168
|
+
end
|
169
|
+
end
|
170
|
+
|
171
|
+
# print new line to retain last processing line message
|
172
|
+
print "\n" if @verbose
|
173
|
+
|
174
|
+
# handling of last chunk:
|
175
|
+
if !chunk.nil? && chunk.size > 0
|
176
|
+
# do something with the chunk
|
177
|
+
if block_given?
|
178
|
+
yield chunk # do something with the hashes in the chunk in the block
|
179
|
+
else
|
180
|
+
@result << chunk.dup # Append chunk to result (use .dup to keep a copy after we do chunk.clear)
|
181
|
+
end
|
182
|
+
@chunk_count += 1
|
183
|
+
# chunk = [] # initialize for next chunk of data
|
184
|
+
end
|
185
|
+
ensure
|
186
|
+
fh.close if fh.respond_to?(:close)
|
187
|
+
end
|
188
|
+
|
189
|
+
if block_given?
|
190
|
+
@chunk_count # when we do processing through a block we only care how many chunks we processed
|
191
|
+
else
|
192
|
+
@result # returns either an Array of Hashes, or an Array of Arrays of Hashes (if in chunked mode)
|
193
|
+
end
|
194
|
+
end
|
195
|
+
|
196
|
+
def count_quote_chars(line, quote_char)
|
197
|
+
return 0 if line.nil? || quote_char.nil? || quote_char.empty?
|
198
|
+
|
199
|
+
count = 0
|
200
|
+
escaped = false
|
201
|
+
|
202
|
+
line.each_char do |char|
|
203
|
+
if char == '\\' && !escaped
|
204
|
+
escaped = true
|
205
|
+
else
|
206
|
+
count += 1 if char == quote_char && !escaped
|
207
|
+
escaped = false
|
208
|
+
end
|
209
|
+
end
|
210
|
+
|
211
|
+
count
|
212
|
+
end
|
213
|
+
|
214
|
+
protected
|
215
|
+
|
216
|
+
# SEE: https://github.com/rails/rails/blob/32015b6f369adc839c4f0955f2d9dce50c0b6123/activesupport/lib/active_support/core_ext/object/blank.rb#L121
|
217
|
+
# and in the future we might also include UTF-8 space characters: https://www.compart.com/en/unicode/category/Zs
|
218
|
+
BLANK_RE = /\A\s*\z/.freeze
|
219
|
+
|
220
|
+
def blank?(value)
|
221
|
+
case value
|
222
|
+
when String
|
223
|
+
BLANK_RE.match?(value)
|
224
|
+
when NilClass
|
225
|
+
true
|
226
|
+
when Array
|
227
|
+
value.all? { |elem| blank?(elem) }
|
228
|
+
when Hash
|
229
|
+
value.values.all? { |elem| blank?(elem) } # Focus on values only
|
230
|
+
else
|
231
|
+
false
|
232
|
+
end
|
233
|
+
end
|
234
|
+
|
235
|
+
private
|
236
|
+
|
237
|
+
def enforce_utf8_encoding(line, options)
|
238
|
+
# return line unless options[:force_utf8] || options[:file_encoding] !~ /utf-8/i
|
239
|
+
|
240
|
+
line.force_encoding('utf-8').encode('utf-8', invalid: :replace, undef: :replace, replace: options[:invalid_byte_sequence])
|
241
|
+
end
|
242
|
+
end
|
243
|
+
end
|
data/lib/smarter_csv/version.rb
CHANGED
data/lib/smarter_csv/writer.rb
CHANGED
@@ -40,7 +40,8 @@ module SmarterCSV
|
|
40
40
|
class Writer
|
41
41
|
def initialize(file_path, options = {})
|
42
42
|
@options = options
|
43
|
-
|
43
|
+
|
44
|
+
@row_sep = options[:row_sep] || $/ # Defaults to system's row separator. RFC4180 "\r\n"
|
44
45
|
@col_sep = options[:col_sep] || ','
|
45
46
|
@quote_char = options[:quote_char] || '"'
|
46
47
|
@force_quotes = options[:force_quotes] == true
|