smarter_csv 1.11.2 → 1.12.0.pre1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.rubocop.yml +8 -2
- data/CHANGELOG.md +29 -1
- data/README.md +31 -396
- data/docs/_introduction.md +40 -0
- data/docs/basic_api.md +140 -0
- data/docs/batch_processing.md +53 -0
- data/docs/data_transformations.md +32 -0
- data/docs/examples.md +61 -0
- data/docs/header_transformations.md +95 -0
- data/docs/header_validations.md +18 -0
- data/docs/notes.md +29 -0
- data/docs/options.md +82 -0
- data/docs/row_col_sep.md +87 -0
- data/docs/value_converters.md +51 -0
- data/ext/smarter_csv/smarter_csv.c +4 -2
- data/lib/smarter_csv/auto_detection.rb +1 -1
- data/lib/smarter_csv/file_io.rb +1 -1
- data/lib/smarter_csv/hash_transformations.rb +1 -1
- data/lib/smarter_csv/header_transformations.rb +1 -1
- data/lib/smarter_csv/header_validations.rb +2 -2
- data/lib/smarter_csv/headers.rb +1 -1
- data/lib/smarter_csv/{options_processing.rb → options.rb} +44 -43
- data/lib/smarter_csv/{parse.rb → parser.rb} +2 -2
- data/lib/smarter_csv/reader.rb +243 -0
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv/writer.rb +2 -1
- data/lib/smarter_csv.rb +20 -4
- data/smarter_csv.gemspec +1 -1
- metadata +19 -9
- data/lib/smarter_csv/smarter_csv.rb +0 -210
- data/lib/smarter_csv/variables.rb +0 -30
data/docs/basic_api.md
ADDED
@@ -0,0 +1,140 @@
|
|
1
|
+
|
2
|
+
# SmarterCSV API
|
3
|
+
|
4
|
+
Let's explore the basic APIs for reading and writing CSV files. There is a simplified API (backwards conpatible with previous SmarterCSV versions) and the full API, which allows you to access the internal state of the reader or writer instance after processing.
|
5
|
+
|
6
|
+
## Reading CSV
|
7
|
+
|
8
|
+
SmarterCSV has convenient defaults for automatically detecting row and column separators based on the given data. This provides more robust parsing of input files when you have no control over the data, e.g. when users upload CSV files.
|
9
|
+
Learn more about this [in this section](docs/examples/row_col_sep.md).
|
10
|
+
|
11
|
+
### Simplified Interface
|
12
|
+
|
13
|
+
The simplified call to read CSV files is:
|
14
|
+
|
15
|
+
```
|
16
|
+
array_of_hashes = SmarterCSV.process(file_or_input, options, &block)
|
17
|
+
|
18
|
+
```
|
19
|
+
It can also be used with a block:
|
20
|
+
|
21
|
+
```
|
22
|
+
SmarterCSV.process(file_or_input, options, &block) do |hash|
|
23
|
+
# process one row of CSV
|
24
|
+
end
|
25
|
+
```
|
26
|
+
|
27
|
+
It can also be used for processing batches of rows:
|
28
|
+
|
29
|
+
```
|
30
|
+
SmarterCSV.process(file_or_input, {chunk_size: 100}, &block) do |array_of_hashes|
|
31
|
+
# process one chunk of up to 100 rows of CSV data
|
32
|
+
end
|
33
|
+
```
|
34
|
+
|
35
|
+
### Full Interface
|
36
|
+
|
37
|
+
The simplified API works in most cases, but if you need access to the internal state and detailed results of the CSV-parsing, you should use this form:
|
38
|
+
|
39
|
+
```
|
40
|
+
reader = SmarterCSV::Reader.new(file_or_input, options)
|
41
|
+
data = reader.process
|
42
|
+
|
43
|
+
puts reader.raw_headers
|
44
|
+
```
|
45
|
+
It cal also be used with a block:
|
46
|
+
|
47
|
+
```
|
48
|
+
reader = SmarterCSV::Reader.new(file_or_input, options)
|
49
|
+
data = reader.process do
|
50
|
+
# do something here
|
51
|
+
end
|
52
|
+
|
53
|
+
puts reader.raw_headers
|
54
|
+
```
|
55
|
+
|
56
|
+
This allows you access to the internal state of the `reader` instance after processing.
|
57
|
+
|
58
|
+
|
59
|
+
## Interface for Writing CSV
|
60
|
+
|
61
|
+
To generate a CSV file, we use the `<<` operator to append new data to the file.
|
62
|
+
|
63
|
+
The input operator for adding data to a CSV file `<<` can handle single hashes, array-of-hashes, or array-of-arrays-of-hashes, and can be called one or multiple times for each file.
|
64
|
+
|
65
|
+
One smart feature of writing CSV data is the discovery of headers.
|
66
|
+
|
67
|
+
If you have hashes of data, where each hash can have different keys, the `SmarterCSV::Reader` automatically discovers the superset of keys as the headers of the CSV file. This can be disabled by either providing one of the options `headers`, `map_headers`, or `discover_headers: false`.
|
68
|
+
|
69
|
+
|
70
|
+
### Simplified Interface
|
71
|
+
|
72
|
+
The simplified interface takes a block:
|
73
|
+
|
74
|
+
```
|
75
|
+
SmarterCSV.generate(filename, options) do |csv_writer|
|
76
|
+
|
77
|
+
MyModel.find_in_batches(batch_size: 100) do |batch|
|
78
|
+
batch.pluck(:name, :description, :instructor).each do |record|
|
79
|
+
csv_writer << record
|
80
|
+
end
|
81
|
+
end
|
82
|
+
|
83
|
+
end
|
84
|
+
```
|
85
|
+
|
86
|
+
### Full Interface
|
87
|
+
|
88
|
+
```
|
89
|
+
writer = SmarterCSV::Writer.new(file_path, options)
|
90
|
+
|
91
|
+
MyModel.find_in_batches(batch_size: 100) do |batch|
|
92
|
+
batch.pluck(:name, :description, :instructor).each do |record|
|
93
|
+
csv_writer << record
|
94
|
+
end
|
95
|
+
|
96
|
+
writer.finalize
|
97
|
+
```
|
98
|
+
|
99
|
+
## Rescue from Exceptions
|
100
|
+
|
101
|
+
While SmarterCSV uses sensible defaults to process the most common CSV files, it will raise exceptions if it can not auto-detect `col_sep`, `row_sep`, or if it encounters other problems. Therefore please rescue from `SmarterCSV::Error`, and handle outliers according to your requirements.
|
102
|
+
|
103
|
+
If you encounter unusual CSV files, please follow the tips in the Troubleshooting section below. You can use the options below to accomodate for unusual formats.
|
104
|
+
|
105
|
+
## Troubleshooting
|
106
|
+
|
107
|
+
In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).
|
108
|
+
|
109
|
+
```
|
110
|
+
$ hexdump -C spec/fixtures/bom_test_feff.csv
|
111
|
+
00000000 fe ff 73 6f 6d 65 5f 69 64 2c 74 79 70 65 2c 66 |..some_id,type,f|
|
112
|
+
00000010 75 7a 7a 62 6f 78 65 73 0d 0a 34 32 37 36 36 38 |uzzboxes..427668|
|
113
|
+
00000020 30 35 2c 7a 69 7a 7a 6c 65 73 2c 31 32 33 34 0d |05,zizzles,1234.|
|
114
|
+
00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
|
115
|
+
00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
|
116
|
+
```
|
117
|
+
|
118
|
+
## Assumptions / Limitations
|
119
|
+
|
120
|
+
* the escape character is `\`, as on UNIX and Windows systems.
|
121
|
+
* quote charcters around fields are balanced, e.g. valid: `"field"`, invalid: `"field\"`
|
122
|
+
e.g. an escaped `quote_char` does not denote the end of a field.
|
123
|
+
|
124
|
+
|
125
|
+
## NOTES about File Encodings:
|
126
|
+
* if you have a CSV file which contains unicode characters, you can process it as follows:
|
127
|
+
|
128
|
+
```ruby
|
129
|
+
File.open(filename, "r:bom|utf-8") do |f|
|
130
|
+
data = SmarterCSV.process(f);
|
131
|
+
end
|
132
|
+
```
|
133
|
+
* if the CSV file with unicode characters is in a remote location, similarly you need to give the encoding as an option to the `open` call:
|
134
|
+
```ruby
|
135
|
+
require 'open-uri'
|
136
|
+
file_location = 'http://your.remote.org/sample.csv'
|
137
|
+
open(file_location, 'r:utf-8') do |f| # don't forget to specify the UTF-8 encoding!!
|
138
|
+
data = SmarterCSV.process(f)
|
139
|
+
end
|
140
|
+
```
|
@@ -0,0 +1,53 @@
|
|
1
|
+
|
2
|
+
# Batch Processing
|
3
|
+
|
4
|
+
Processing CSV data in batches (chunks), allows you to parallelize the workload of importing data.
|
5
|
+
This can come in handy when you don't want to slow-down the CSV import of large files.
|
6
|
+
|
7
|
+
Setting the option `chunk_size` sets the max batch size.
|
8
|
+
|
9
|
+
|
10
|
+
## Example 1: How SmarterCSV processes CSV-files as chunks, returning arrays of hashes:
|
11
|
+
Please note how the returned array contains two sub-arrays containing the chunks which were read, each chunk containing 2 hashes.
|
12
|
+
In case the number of rows is not cleanly divisible by `:chunk_size`, the last chunk contains fewer hashes.
|
13
|
+
|
14
|
+
```ruby
|
15
|
+
> pets_by_owner = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}})
|
16
|
+
=> [ [ {:first=>"Dan", :last=>"McAllister", :dogs=>"2"}, {:first=>"Lucy", :last=>"Laweless", :cats=>"5"} ],
|
17
|
+
[ {:first=>"Miles", :last=>"O'Brian", :fish=>"21"}, {:first=>"Nancy", :last=>"Homes", :dogs=>"2", :birds=>"1"} ]
|
18
|
+
]
|
19
|
+
```
|
20
|
+
|
21
|
+
## Example 2: How SmarterCSV processes CSV-files as chunks, and passes arrays of hashes to a given block:
|
22
|
+
Please note how the given block is passed the data for each chunk as the parameter (array of hashes),
|
23
|
+
and how the `process` method returns the number of chunks when called with a block
|
24
|
+
|
25
|
+
```ruby
|
26
|
+
> total_chunks = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}}) do |chunk|
|
27
|
+
chunk.each do |h| # you can post-process the data from each row to your heart's content, and also create virtual attributes:
|
28
|
+
h[:full_name] = [h[:first],h[:last]].join(' ') # create a virtual attribute
|
29
|
+
h.delete(:first) ; h.delete(:last) # remove two keys
|
30
|
+
end
|
31
|
+
puts chunk.inspect # we could at this point pass the chunk to a Resque worker..
|
32
|
+
end
|
33
|
+
|
34
|
+
[{:dogs=>"2", :full_name=>"Dan McAllister"}, {:cats=>"5", :full_name=>"Lucy Laweless"}]
|
35
|
+
[{:fish=>"21", :full_name=>"Miles O'Brian"}, {:dogs=>"2", :birds=>"1", :full_name=>"Nancy Homes"}]
|
36
|
+
=> 2
|
37
|
+
```
|
38
|
+
|
39
|
+
## Example 3: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
|
40
|
+
```ruby
|
41
|
+
# using chunks:
|
42
|
+
filename = '/tmp/some.csv'
|
43
|
+
options = {:chunk_size => 100, :key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
|
44
|
+
n = SmarterCSV.process(filename, options) do |chunk|
|
45
|
+
# we're passing a block in, to process each resulting hash / row (block takes array of hashes)
|
46
|
+
# when chunking is enabled, there are up to :chunk_size hashes in each chunk
|
47
|
+
MyModel.collection.insert( chunk ) # insert up to 100 records at a time
|
48
|
+
end
|
49
|
+
|
50
|
+
=> returns number of chunks we processed
|
51
|
+
```
|
52
|
+
|
53
|
+
|
@@ -0,0 +1,32 @@
|
|
1
|
+
# Data Transformations
|
2
|
+
|
3
|
+
SmarterCSV automatically transforms the values in each colum in order to normalize the data.
|
4
|
+
This behavior can be customized or disabled.
|
5
|
+
|
6
|
+
## Remove Empty Values
|
7
|
+
`remove_empty_values` is enabled by default
|
8
|
+
It removes any values which are `nil` or would be empty strings.
|
9
|
+
|
10
|
+
## Convert Values to Numeric
|
11
|
+
`convert_values_to_numeric` is enabled by default.
|
12
|
+
SmarterCSV will convert strings containing Integers or Floats to the appropriate class.
|
13
|
+
|
14
|
+
## Remove Zero Values
|
15
|
+
`remove_zero_values` is disabled by default.
|
16
|
+
When enabled, it removes key/value pairs which have a numeric value equal to zero.
|
17
|
+
|
18
|
+
## Remove Values Matching
|
19
|
+
`remove_values_matching` is disabled by default.
|
20
|
+
When enabled, this can help removing key/value pairs from result hashes which would cause problems.
|
21
|
+
|
22
|
+
e.g.
|
23
|
+
* `remove_values_matching: /^\$0\.0+$/` would remove $0.00
|
24
|
+
* `remove_values_matching: /^#VALUE!$/` would remove errors from Excel spreadsheets
|
25
|
+
|
26
|
+
## Empty Hashes
|
27
|
+
|
28
|
+
It can happen that after all transformations, a row of the CSV file would produce a completely empty hash.
|
29
|
+
|
30
|
+
By default SmarterCSV uses `remove_empty_hashes: true` to remove these empty hashes from the result.
|
31
|
+
|
32
|
+
This can be set to `true`, to keep these empty hashes in the results.
|
data/docs/examples.md
ADDED
@@ -0,0 +1,61 @@
|
|
1
|
+
|
2
|
+
# Examples
|
3
|
+
|
4
|
+
Here are some examples to demonstrate the versatility of SmarterCSV.
|
5
|
+
|
6
|
+
**It is generally recommended to rescue `SmarterCSV::Error` or it's sub-classes.**
|
7
|
+
|
8
|
+
By default SmarterCSV determines the `row_sep` and `col_sep` values automatically. In cases where the automatic detection fails, an exception will be raised, e.g. `NoColSepDetected`. Rescuing from these exceptions will make sure that you don't miss processing CSV files, in case users upload CSV files with unexpected formats.
|
9
|
+
|
10
|
+
In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.
|
11
|
+
|
12
|
+
## Example 1a: How SmarterCSV processes CSV-files as array of hashes:
|
13
|
+
Please note how each hash contains only the keys for columns with non-null values.
|
14
|
+
|
15
|
+
```ruby
|
16
|
+
$ cat pets.csv
|
17
|
+
first name,last name,dogs,cats,birds,fish
|
18
|
+
Dan,McAllister,2,,,
|
19
|
+
Lucy,Laweless,,5,,
|
20
|
+
Miles,O'Brian,,,,21
|
21
|
+
Nancy,Homes,2,,1,
|
22
|
+
$ irb
|
23
|
+
> require 'smarter_csv'
|
24
|
+
=> true
|
25
|
+
> pets_by_owner = SmarterCSV.process('/tmp/pets.csv')
|
26
|
+
=> [ {:first_name=>"Dan", :last_name=>"McAllister", :dogs=>"2"},
|
27
|
+
{:first_name=>"Lucy", :last_name=>"Laweless", :cats=>"5"},
|
28
|
+
{:first_name=>"Miles", :last_name=>"O'Brian", :fish=>"21"},
|
29
|
+
{:first_name=>"Nancy", :last_name=>"Homes", :dogs=>"2", :birds=>"1"}
|
30
|
+
]
|
31
|
+
```
|
32
|
+
|
33
|
+
|
34
|
+
## Example 3: Populate a MySQL or MongoDB Database with SmarterCSV:
|
35
|
+
```ruby
|
36
|
+
# without using chunks:
|
37
|
+
filename = '/tmp/some.csv'
|
38
|
+
options = {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
|
39
|
+
n = SmarterCSV.process(filename, options) do |array|
|
40
|
+
# we're passing a block in, to process each resulting hash / =row (the block takes array of hashes)
|
41
|
+
# when chunking is not enabled, there is only one hash in each array
|
42
|
+
MyModel.create( array.first )
|
43
|
+
end
|
44
|
+
|
45
|
+
=> returns number of chunks / rows we processed
|
46
|
+
```
|
47
|
+
|
48
|
+
## Example 4: Processing a CSV File, and inserting batch jobs in Sidekiq:
|
49
|
+
```ruby
|
50
|
+
filename = '/tmp/input.csv' # CSV file containing ids or data to process
|
51
|
+
options = { :chunk_size => 100 }
|
52
|
+
n = SmarterCSV.process(filename, options) do |chunk|
|
53
|
+
Sidekiq::Client.push_bulk(
|
54
|
+
'class' => SidekiqIndividualWorkerClass,
|
55
|
+
'args' => chunk,
|
56
|
+
)
|
57
|
+
# OR:
|
58
|
+
# SidekiqBatchWorkerClass.process_async(chunk ) # pass an array of hashes to Sidekiq workers for parallel processing
|
59
|
+
end
|
60
|
+
=> returns number of chunks
|
61
|
+
```
|
@@ -0,0 +1,95 @@
|
|
1
|
+
# Header Transformations
|
2
|
+
|
3
|
+
By default SmarterCSV assumes that a CSV file has headers, and it automatically normalizes the headers and transforms them into Ruby symbols. You can completely customize or override this (see below).
|
4
|
+
|
5
|
+
## Header Normalization
|
6
|
+
|
7
|
+
When processing the headers, it transforms them into Ruby symbols, stripping extra spaces, lower-casing them and replacing spaces with underscores. e.g. " \t Annual Sales " becomes `:annual_sales`. (see Notes below)
|
8
|
+
|
9
|
+
## Duplicate Headers
|
10
|
+
|
11
|
+
There can be a lot of variation in CSV files. It is possible that a CSV file contains multiple headers with the same name.
|
12
|
+
|
13
|
+
By default SmarterCSV handles duplicate headers by appending numbers 2..n to them.
|
14
|
+
|
15
|
+
Consider this example:
|
16
|
+
|
17
|
+
```
|
18
|
+
$ cat > /tmp/dupe.csv
|
19
|
+
name,name,name
|
20
|
+
Carl,Edward,Sagan
|
21
|
+
```
|
22
|
+
|
23
|
+
When parsing these duplicate headers, SmarterCSV will return:
|
24
|
+
|
25
|
+
```
|
26
|
+
data = SmarterCSV.process('/tmp/dupe.csv')
|
27
|
+
=> [{:name=>"Carl", :name2=>"Edward", :name3=>"Sagan"}]
|
28
|
+
```
|
29
|
+
|
30
|
+
If you want to have an underscore between the header and the number, you can set `duplicate_header_suffix: '_'`.
|
31
|
+
|
32
|
+
```
|
33
|
+
data = SmarterCSV.process('/tmp/dupe.csv', {duplicate_header_suffix: '_'})
|
34
|
+
=> [{:name=>"Carl", :name_2=>"Edward", :name_3=>"Sagan"}]
|
35
|
+
```
|
36
|
+
|
37
|
+
To further disambiguate the headers, you can further use `key_mapping` to assign meaningful names. Please note that the mapping uses the already transformed keys `name_2`, `name_3` as input.
|
38
|
+
|
39
|
+
```
|
40
|
+
options = {
|
41
|
+
duplicate_header_suffix: '_',
|
42
|
+
key_mapping: {
|
43
|
+
name: :first_name,
|
44
|
+
name_2: :middle_name,
|
45
|
+
name_3: :last_name,
|
46
|
+
}
|
47
|
+
}
|
48
|
+
data = SmarterCSV.process('/tmp/dupe.csv', options)
|
49
|
+
=> [{:first_name=>"Carl", :middle_name=>"Edward", :last_name=>"Sagan"}]
|
50
|
+
```
|
51
|
+
|
52
|
+
## Key Mapping
|
53
|
+
|
54
|
+
The above example already illustrates how intermediate keys can be mapped into something different.
|
55
|
+
This transfoms some of the keys in the input, but other keys are still present.
|
56
|
+
|
57
|
+
There is an additional option `remove_unmapped_keys` which can be enabled to only produce the mapped keys in the resulting hashes, and drops any other columns.
|
58
|
+
|
59
|
+
|
60
|
+
### NOTES on Key Mapping:
|
61
|
+
* keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them)
|
62
|
+
* if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
|
63
|
+
* if you have input files with a large number of columns, and you want to ignore all columns which are not specifically mapped with :key_mapping, then use option :remove_unmapped_keys => true
|
64
|
+
|
65
|
+
## CSV Files without Headers
|
66
|
+
|
67
|
+
If you have CSV files without headers, it is important to set `headers_in_file: false`, otherwise you'll lose the first data line in your file.
|
68
|
+
You then have to provide `user_provided_headers`, which takes an array of either symbols or strings.
|
69
|
+
|
70
|
+
|
71
|
+
## CSV Files with Headers
|
72
|
+
|
73
|
+
For CSV files with headers, you can either:
|
74
|
+
|
75
|
+
* use the automatic header normalization
|
76
|
+
* map one or more headers into whatever you chose using the `map_headers` option.
|
77
|
+
(if you map a header to `nil`, it will remove that column from the resulting row hash).
|
78
|
+
* completely replace the headers using `user_provided_headers` (please be careful with this powerful option, as it is not robust against changes in input format).
|
79
|
+
* use the original unmodified headers from the CSV file, using `keep_original_headers`. This results in hash keys that are strings, and may be padded with spaces.
|
80
|
+
|
81
|
+
|
82
|
+
# Notes
|
83
|
+
|
84
|
+
### NOTES about CSV Headers:
|
85
|
+
* as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
|
86
|
+
* the first line with the header might be commented out, in which case you will need to set `comment_regexp: /\A#/`
|
87
|
+
* any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
|
88
|
+
* any of the keys in the header line will be downcased, spaces replaced by underscore, and converted to Ruby symbols before being used as keys in the returned Hashes
|
89
|
+
* you can not combine the :user_provided_headers and :key_mapping options
|
90
|
+
* if the incorrect number of headers are provided via :user_provided_headers, exception SmarterCSV::HeaderSizeMismatch is raised
|
91
|
+
|
92
|
+
### NOTES on improper quotation and unwanted characters in headers:
|
93
|
+
* some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, use the `:force_simple_split => true` option in combination with `:strip_chars_from_headers => /[\-"]/` . This will also significantly speed up the import.
|
94
|
+
If you would force a different :quote_char instead (setting it to a non-used character), then the import would be up to 5-times slower than using `:force_simple_split`.
|
95
|
+
|
@@ -0,0 +1,18 @@
|
|
1
|
+
# Header Validations
|
2
|
+
|
3
|
+
When you are importing data, it can be important to verify that all required data is present, to ensure consistent quality when importing data.
|
4
|
+
|
5
|
+
You can use the `required_keys` option to specify an array of hash keys that you require to be present at a minimum for every data row (after header transformation).
|
6
|
+
|
7
|
+
If these keys are not present, `SmarterCSV::MissingKeys` will be raised to inform you of the data inconsistency.
|
8
|
+
|
9
|
+
## Example
|
10
|
+
|
11
|
+
```ruby
|
12
|
+
options = {
|
13
|
+
required_keys: [:source_account, :destination_account, :amount]
|
14
|
+
}
|
15
|
+
data = SmarterCSV.process("/tmp/transactions.csv", options)
|
16
|
+
|
17
|
+
=> this will raise SmarterCSV::MissingKeys if any row does not contain these three keys
|
18
|
+
```
|
data/docs/notes.md
ADDED
@@ -0,0 +1,29 @@
|
|
1
|
+
|
2
|
+
# Notes
|
3
|
+
|
4
|
+
|
5
|
+
|
6
|
+
|
7
|
+
## NOTES on the use of Chunking and Blocks:
|
8
|
+
* chunking can be VERY USEFUL if used in combination with passing a block to File.read_csv FOR LARGE FILES
|
9
|
+
* if you pass a block to File.read_csv, that block will be executed and given an Array of Hashes as the parameter.
|
10
|
+
* if the chunk_size is not set, then the array will only contain one Hash.
|
11
|
+
* if the chunk_size is > 0 , then the array may contain up to chunk_size Hashes.
|
12
|
+
* this can be very useful when passing chunked data to a post-processing step, e.g. through Sidekiq
|
13
|
+
|
14
|
+
## NOTES about File Encodings:
|
15
|
+
* if you have a CSV file which contains unicode characters, you can process it as follows:
|
16
|
+
|
17
|
+
```ruby
|
18
|
+
File.open(filename, "r:bom|utf-8") do |f|
|
19
|
+
data = SmarterCSV.process(f);
|
20
|
+
end
|
21
|
+
```
|
22
|
+
* if the CSV file with unicode characters is in a remote location, similarly you need to give the encoding as an option to the `open` call:
|
23
|
+
```ruby
|
24
|
+
require 'open-uri'
|
25
|
+
file_location = 'http://your.remote.org/sample.csv'
|
26
|
+
open(file_location, 'r:utf-8') do |f| # don't forget to specify the UTF-8 encoding!!
|
27
|
+
data = SmarterCSV.process(f)
|
28
|
+
end
|
29
|
+
```
|
data/docs/options.md
ADDED
@@ -0,0 +1,82 @@
|
|
1
|
+
|
2
|
+
# SmarterCSV Options
|
3
|
+
|
4
|
+
## CSV Writing
|
5
|
+
|
6
|
+
| Option | Default | Explanation |
|
7
|
+
---------------------------------------------------------------------------------------------------------------------------------
|
8
|
+
| :row_sep | $/ | Separates rows; Defaults to your OS row separator. `/n` on UNIX, `/r/n` oon Windows |
|
9
|
+
| :col_sep | "," | Separates each value in a row |
|
10
|
+
| :quote_char | '"' | |
|
11
|
+
| :force_quotes | false | Forces each individual value to be quoted |
|
12
|
+
| :discover_headers | true | Automatically detects all keys in the input before writing the header |
|
13
|
+
| | | This can be disabled by providing `headers` or `map_headers` options. |
|
14
|
+
| :headers | [] | You can provide the specific list of keys from the input you'd like to be used as headers in the CSV file |
|
15
|
+
| :map_headers | {} | Similar to `headers`, but also maps each desired key to a user-specified value that is uesd as the header. |
|
16
|
+
|
|
17
|
+
|
18
|
+
## CSV Reading
|
19
|
+
|
20
|
+
| Option | Default | Explanation |
|
21
|
+
---------------------------------------------------------------------------------------------------------------------------------
|
22
|
+
| :chunk_size | nil | if set, determines the desired chunk-size (defaults to nil, no chunk processing) |
|
23
|
+
| | | |
|
24
|
+
| :file_encoding | utf-8 | Set the file encoding eg.: 'windows-1252' or 'iso-8859-1' |
|
25
|
+
| :invalid_byte_sequence | '' | what to replace invalid byte sequences with |
|
26
|
+
| :force_utf8 | false | force UTF-8 encoding of all lines (including headers) in the CSV file |
|
27
|
+
| :skip_lines | nil | how many lines to skip before the first line or header line is processed |
|
28
|
+
| :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
|
29
|
+
---------------------------------------------------------------------------------------------------------------------------------
|
30
|
+
| :col_sep | :auto | column separator (default was ',') |
|
31
|
+
| :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
|
32
|
+
| | | e.g. when :quote_char is not properly escaped |
|
33
|
+
| :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
|
34
|
+
| | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
|
35
|
+
| :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
|
36
|
+
| :quote_char | '"' | quotation character |
|
37
|
+
---------------------------------------------------------------------------------------------------------------------------------
|
38
|
+
| :headers_in_file | true | Whether or not the file contains headers as the first line. |
|
39
|
+
| | | Important if the file does not contain headers, |
|
40
|
+
| | | otherwise you would lose the first line of data. |
|
41
|
+
| :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
|
42
|
+
| | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
|
43
|
+
| :user_provided_headers | nil | *careful with that axe!* |
|
44
|
+
| | | user provided Array of header strings or symbols, to define |
|
45
|
+
| | | what headers should be used, overriding any in-file headers. |
|
46
|
+
| | | You can not combine the :user_provided_headers and :key_mapping options |
|
47
|
+
| :remove_empty_hashes | true | remove / ignore any hashes which don't have any key/value pairs or all empty values |
|
48
|
+
| :verbose | false | print out line number while processing (to track down problems in input files) |
|
49
|
+
| :with_line_numbers | false | add :csv_line_number to each data hash |
|
50
|
+
---------------------------------------------------------------------------------------------------------------------------------
|
51
|
+
|
52
|
+
Additional 1.x Options which may be replaced in 2.0
|
53
|
+
|
54
|
+
There have been a lot of 1-offs and feature creep around these options, and going forward we'll strive to have a simpler, but more flexible way to address these features.
|
55
|
+
|
56
|
+
|
57
|
+
| Option | Default | Explanation |
|
58
|
+
---------------------------------------------------------------------------------------------------------------------------------
|
59
|
+
| :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
|
60
|
+
| :silence_missing_keys | false | ignore missing keys in `key_mapping` |
|
61
|
+
| | | if set to true: makes all mapped keys optional |
|
62
|
+
| | | if given an array, makes only the keys listed in it optional |
|
63
|
+
| :required_keys | nil | An array. Specify the required names AFTER header transformation. |
|
64
|
+
| :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead |
|
65
|
+
| | | or an exception is raised No validation if nil is given. |
|
66
|
+
| :remove_unmapped_keys | false | when using :key_mapping option, should non-mapped keys / columns be removed? |
|
67
|
+
| :downcase_header | true | downcase all column headers |
|
68
|
+
| :strings_as_keys | false | use strings instead of symbols as the keys in the result hashes |
|
69
|
+
| :strip_whitespace | true | remove whitespace before/after values and headers |
|
70
|
+
| :keep_original_headers | false | keep the original headers from the CSV-file as-is. |
|
71
|
+
| | | Disables other flags manipulating the header fields. |
|
72
|
+
| :strip_chars_from_headers | nil | RegExp to remove extraneous characters from the header line (e.g. if headers are quoted) |
|
73
|
+
---------------------------------------------------------------------------------------------------------------------------------
|
74
|
+
| :value_converters | nil | supply a hash of :header => KlassName; the class needs to implement self.convert(val)|
|
75
|
+
| :remove_empty_values | true | remove values which have nil or empty strings as values |
|
76
|
+
| :remove_zero_values | false | remove values which have a numeric value equal to zero / 0 |
|
77
|
+
| :remove_values_matching | nil | removes key/value pairs if value matches given regular expressions. e.g.: |
|
78
|
+
| | | /^\$0\.0+$/ to match $0.00 , or /^#VALUE!$/ to match errors in Excel spreadsheets |
|
79
|
+
| :convert_values_to_numeric | true | converts strings containing Integers or Floats to the appropriate class |
|
80
|
+
| | | also accepts either {:except => [:key1,:key2]} or {:only => :key3} |
|
81
|
+
---------------------------------------------------------------------------------------------------------------------------------
|
82
|
+
|
data/docs/row_col_sep.md
ADDED
@@ -0,0 +1,87 @@
|
|
1
|
+
|
2
|
+
# Row and Column Separators
|
3
|
+
|
4
|
+
## Automatic Detection
|
5
|
+
|
6
|
+
Convenient defaults allow automatic detection of the column and row separators: `row_sep: :auto`, `col_sep: :auto`. This makes it easier to process any CSV files without having to examine the line endings or column separators, e.g. when users upload CSV files to your service and you have no control over the incoming files.
|
7
|
+
|
8
|
+
You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); `nil` or `0` will check the whole file). Of course you can also set the `:row_sep` manually.
|
9
|
+
|
10
|
+
|
11
|
+
## Column Separator `col_sep`
|
12
|
+
|
13
|
+
The automatic detection of column separators considers: `,`, `\t`, `;`, `:`, `|`.
|
14
|
+
|
15
|
+
Some CSV files may contain an unusual column separqator, which could even be a control character.
|
16
|
+
|
17
|
+
## Row Separator `row_sep`
|
18
|
+
|
19
|
+
The automatic detection of row separators considers: `\n`, `\r\n`, `\r`.
|
20
|
+
|
21
|
+
Some CSV files may contain an unusual row separqator, which could even be a control character.
|
22
|
+
|
23
|
+
|
24
|
+
## Custom / Non-Standard CSV Formats
|
25
|
+
|
26
|
+
Besides custom values for `col_sep`, `row_sep`, some other customizations of CSV files are:
|
27
|
+
* the presence of a number of leading lines before the header or data section start.
|
28
|
+
* the presence of comment lines, e.g. lines starting with `#`
|
29
|
+
|
30
|
+
To explore these special cases, please use the following examples.
|
31
|
+
|
32
|
+
### Example 1: reading an iTunes DB dump
|
33
|
+
|
34
|
+
This data format uses CTRL-A as the column separator, and CTRL-B as the record separator. It also has comment lines that start with a `#` character. This also maps the header `name` to `genre`, and ignores the column `export_date`.
|
35
|
+
|
36
|
+
```ruby
|
37
|
+
filename = '/tmp/itunes_db_dump'
|
38
|
+
options = {
|
39
|
+
:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
|
40
|
+
:chunk_size => 100 , :key_mapping => {export_date: nil, name: :genre},
|
41
|
+
}
|
42
|
+
n = SmarterCSV.process(filename, options) do |chunk|
|
43
|
+
SidekiqWorkerClass.process_async(chunk) # pass an array of hashes to Sidekiq workers for parallel processing
|
44
|
+
end
|
45
|
+
=> returns number of chunks
|
46
|
+
```
|
47
|
+
|
48
|
+
### Example 2: Reading a CSV-File with custom col_sep, row_sep
|
49
|
+
In this example we have an unusual CSV file with `|` as the row separator, and `#` as the column separator.
|
50
|
+
This unusual format needs explicit options `col_sep` and `row_sep`.
|
51
|
+
|
52
|
+
```ruby
|
53
|
+
filename = '/tmp/input_file.txt'
|
54
|
+
recordsA = SmarterCSV.process(filename, {col_sep: "#", row_sep: "|"})
|
55
|
+
|
56
|
+
=> returns an array of hashes
|
57
|
+
```
|
58
|
+
|
59
|
+
### Example 3:
|
60
|
+
In this example, we use `skip_lines: 3` to skip and ignore the first 3 lines in the input
|
61
|
+
|
62
|
+
|
63
|
+
```ruby
|
64
|
+
filename = '/tmp/input_file.txt'
|
65
|
+
recordsA = SmarterCSV.process(filename, {skip_lines: 3})
|
66
|
+
|
67
|
+
=> returns an array of hashes
|
68
|
+
```
|
69
|
+
|
70
|
+
|
71
|
+
### Example 4: reading an iTunes DB dump
|
72
|
+
|
73
|
+
In this example, we use `comment_regexp` to filter out and ignore any lines starting with `#`
|
74
|
+
|
75
|
+
|
76
|
+
```ruby
|
77
|
+
# Consider a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes!)
|
78
|
+
filename = '/tmp/strange_db_dump'
|
79
|
+
options = {
|
80
|
+
:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
|
81
|
+
:chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre},
|
82
|
+
}
|
83
|
+
n = SmarterCSV.process(filename, options) do |chunk|
|
84
|
+
SidekiqWorkerClass.process_async(chunk) # pass an array of hashes to Sidekiq workers for parallel processing
|
85
|
+
end
|
86
|
+
=> returns number of chunks
|
87
|
+
```
|
@@ -0,0 +1,51 @@
|
|
1
|
+
|
2
|
+
# Using Value Converters
|
3
|
+
|
4
|
+
Value Converters allow you to do custom transformations specific rows, to help you massage the data so it fits the expectations of your down-stream process, such as creating a DB record.
|
5
|
+
|
6
|
+
If you use `key_mappings` and `value_converters`, make sure that the value converters references the keys based on the final mapped name, not the original name in the CSV file.
|
7
|
+
|
8
|
+
```ruby
|
9
|
+
$ cat spec/fixtures/with_dates.csv
|
10
|
+
first,last,date,price
|
11
|
+
Ben,Miller,10/30/1998,$44.50
|
12
|
+
Tom,Turner,2/1/2011,$15.99
|
13
|
+
Ken,Smith,01/09/2013,$199.99
|
14
|
+
|
15
|
+
$ irb
|
16
|
+
> require 'smarter_csv'
|
17
|
+
> require 'date'
|
18
|
+
|
19
|
+
# define a custom converter class, which implements self.convert(value)
|
20
|
+
class DateConverter
|
21
|
+
def self.convert(value)
|
22
|
+
Date.strptime( value, '%m/%d/%Y') # parses custom date format into Date instance
|
23
|
+
end
|
24
|
+
end
|
25
|
+
|
26
|
+
class DollarConverter
|
27
|
+
def self.convert(value)
|
28
|
+
value.sub('$','').to_f # strips the dollar sign and creates a Float value
|
29
|
+
end
|
30
|
+
end
|
31
|
+
|
32
|
+
require 'money'
|
33
|
+
class MoneyConverter
|
34
|
+
def self.convert(value)
|
35
|
+
# depending on locale you might want to also remove the indicator for thousands, e.g. comma
|
36
|
+
Money.from_amount(value.gsub(/[\s\$]/,'').to_f) # creates a Money instance (based on cents)
|
37
|
+
end
|
38
|
+
end
|
39
|
+
|
40
|
+
options = {:value_converters => {:date => DateConverter, :price => DollarConverter}}
|
41
|
+
data = SmarterCSV.process("spec/fixtures/with_dates.csv", options)
|
42
|
+
first_record = data.first
|
43
|
+
first_record[:date]
|
44
|
+
=> #<Date: 1998-10-30 ((2451117j,0s,0n),+0s,2299161j)>
|
45
|
+
first_record[:date].class
|
46
|
+
=> Date
|
47
|
+
first_record[:price]
|
48
|
+
=> 44.50
|
49
|
+
first_record[:price].class
|
50
|
+
=> Float
|
51
|
+
```
|