smarter_csv 1.11.2 → 1.12.0.pre1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4a650fbccc5bf703199c6771da75ed95c87c9e3763e176b688a2daaa1b0c4669
4
- data.tar.gz: 89cfff5bb92ee8faaeea7a644b61e63a7e47c61f30faefc515b602808eec0989
3
+ metadata.gz: a7573b21c853accca5035c8ba3b1db8f0d3a7fdddc961a535440adedcf0b6b82
4
+ data.tar.gz: 19fffb74289999f01210ad359ac349286a8b267df3910846bec81934f6cde333
5
5
  SHA512:
6
- metadata.gz: 70ea1888be23a467d15001086935117252a30860ee5a418e8b61b61bfb19a03377391ffb28589db412b0aac4293ac029a047b268d0fa6836ec60f2319cabf102
7
- data.tar.gz: 49d7d1d1c4e258611168056a0b8b7433dfedcf37eeac25c22c0aade552a878361d019b86156cbadcacc5996a43cae11d9d55990a0ef9c4a599cf447019cf8d7f
6
+ metadata.gz: 1f9bcb549185941fec0ee7a238df470a8bfdba7cc7ec007057afed0f9dfda8e7a298d1fcfe3bcb2911337827900ccb71df6bd65ed917aed322a3499d4cf3c3a9
7
+ data.tar.gz: 3350a2d318e351f5d5a192fa1aa0664ed0ba42910c5742c18e4a92cdcd145f0e27ddea928e968e834b50e9ea2906b4b4aa573939540e661b65e11acffa739c0b
data/.rubocop.yml CHANGED
@@ -25,7 +25,7 @@ Metrics/BlockNesting:
25
25
  Metrics/ClassLength:
26
26
  Enabled: false
27
27
 
28
- Metrics/CyclomaticComplexity: # BS rule
28
+ Metrics/CyclomaticComplexity:
29
29
  Enabled: false
30
30
 
31
31
  Metrics/MethodLength:
@@ -34,7 +34,7 @@ Metrics/MethodLength:
34
34
  Metrics/ModuleLength:
35
35
  Enabled: false
36
36
 
37
- Metrics/PerceivedComplexity: # BS rule
37
+ Metrics/PerceivedComplexity:
38
38
  Enabled: false
39
39
 
40
40
  Naming/PredicateName:
@@ -46,6 +46,9 @@ Naming/VariableName:
46
46
  Naming/VariableNumber:
47
47
  Enabled: false
48
48
 
49
+ Style/AccessorGrouping: # not needed
50
+ Enabled: false
51
+
49
52
  Style/ClassEqualityComparison:
50
53
  Enabled: false
51
54
 
@@ -88,6 +91,9 @@ Style/IfInsideElse:
88
91
  Style/IfUnlessModifier:
89
92
  Enabled: false
90
93
 
94
+ Style/InverseMethods:
95
+ Enabled: false
96
+
91
97
  Style/NestedTernaryOperator:
92
98
  Enabled: false
93
99
 
data/CHANGELOG.md CHANGED
@@ -1,7 +1,35 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
- ## 1.11.2 (2024-07-05)
4
+ ## 1.12.0 (2024-07-08)
5
+ * added SmarterCSV::Reader to process CSV files ([issue #277](https://github.com/tilo/smarter_csv/pull/277))
6
+ * SmarterCSV::Writer changed default row separator to the system's row separator (`\n` on Linux, `\r\n` on Windows)
7
+ * added a lot of docs
8
+
9
+ * POTENTIAL ISSUE:
10
+
11
+ Version 1.12.x has a change of the underlying implementation of `SmarterCSV.process(file_or_input, options, &block)`.
12
+ Underneath it now uses this interface:
13
+ ```
14
+ reader = SmarterCSV::Reader.new(file_or_input, options)
15
+
16
+ # either simple one-liner:
17
+ data = reader.process
18
+
19
+ # or block format:
20
+ data = reader.process do
21
+ # do something here
22
+ end
23
+ ```
24
+ It still supports calling `SmarterCSV.process` for backwards-compatibility, but it no longer provides access to the internal state, e.g. raw_headers.
25
+
26
+ `SmarterCSV.raw_headers` -> `reader.raw_headers`
27
+ `SmarterCSV.headers` -> `reader.headers`
28
+
29
+ If you need these features, please update your code to create an instance of `SmarterCSV::Reader` as shown above.
30
+
31
+
32
+ ## 1.11.2 (2024-07-06)
5
33
  * fixing missing errors definition
6
34
 
7
35
  ## 1.11.1 (2024-07-05) (YANKED)
data/README.md CHANGED
@@ -3,405 +3,21 @@
3
3
 
4
4
  [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
5
5
 
6
- SmarterCSV provides a complete interface to CSV files and data. It offers tools to enable you to read and write to and from Strings or IO objects, as needed.
6
+ SmarterCSV provides a convenient interface for reading and writing CSV files and data.
7
7
 
8
- SmarterCSV focuses on representing the data for each row as a hash.
8
+ Unlike traditional CSV parsing methods, SmarterCSV focuses on representing the data for each row as a Ruby hash, which lends itself perfectly for direct use with ActiveRecord, Sidekiq, and JSON stores such as S3. For large files it supports processing CSV data in chunks of array-of-hashes, which allows parallel or batch processing of the data.
9
9
 
10
- When reading CSV files, Using an array-of-hashes format makes it much easier to further process the data, or creating database records with it.
10
+ Its powerful interface is designed to simplify and optimize the process of handling CSV data, and allows for highly customizable and efficient data processing by enabling the user to easily map CSV headers to Hash keys, skip unwanted rows, and transform data on-the-fly.
11
11
 
12
- When writing CSV data to file, it similarly takes arrays of hashes, and converts them to a CSV file.
12
+ This results in a more readable, maintainable, and performant codebase. Whether you're dealing with large datasets or complex data transformations, SmarterCSV streamlines CSV operations, making it an invaluable tool for developers seeking to enhance their data processing workflows.
13
13
 
14
- #### BREAKING CHANGES
14
+ When writing CSV data to file, it similarly takes arrays of hashes, and converts them to a CSV file.
15
15
 
16
- * Version 1.10.0 had BREAKING CHANGES:
16
+ One user wrote:
17
17
 
18
- Changed behavior:
19
- + when `user_provided_headers` are provided:
20
- * if they are not unique, an exception will now be raised
21
- * they are taken "as is", no header transformations can be applied
22
- * when they are given as strings or as symbols, it is assumed that this is the desired format
23
- * the value of the `strings_as_keys` options will be ignored
24
-
25
- + option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
26
- * this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
27
- * explicitly set this option to `nil` to get the behavior from previous versions.
18
+ > *Best gem for CSV for us yet. [...] taking an import process from 7+ hours to about 3 minutes. [...] Smarter CSV was a big part and helped clean up our code ALOT*
28
19
 
29
- #### Development Branches
30
-
31
- * default branch is `main` for 1.x development
32
-
33
- * 2.x development is [MOVED TO THIS PR](https://github.com/tilo/smarter_csv/pull/267)
34
- - 2.x behavior is still EXPERIMENTAL - DO NOT USE in production
35
-
36
- ---------------
37
-
38
- #### SmarterCSV 1.x [Current Version]
39
-
40
- `smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, parallel processing, kicking-off batch jobs with Sidekiq, or oploading data to S3.
41
-
42
- The goals for SmarterCSV are:
43
- * ease of use for handling most common CSV files without having to tweak options
44
- * improve robustness of your code when you have no control over the quality of the CSV files which are processed
45
- * formatting each row of data as a hash, in order to allow easy processing with ActiveRecord, parallel processing, kicking-off batch jobs with Sidekiq, or oploading data to S3.
46
-
47
- #### Rescue from Exceptions
48
- While SmarterCSV uses sensible defaults to process the most common CSV files, it will raise exceptions if it can not auto-detect `col_sep`, `row_sep`, or if it encounters other problems. Therefore, when calling `SmarterCSV.process`, please rescue from `SmarterCSVException`, and handle outliers according to your requirements.
49
-
50
- If you encounter unusual CSV files, please follow the tips in the Troubleshooting section below. You can use the options below to accomodate for unusual formats.
51
-
52
- #### Features
53
-
54
- One `smarter_csv` user wrote:
55
-
56
- *Best gem for CSV for us yet. [...] taking an import process from 7+ hours to about 3 minutes.
57
- [...] Smarter CSV was a big part and helped clean up our code ALOT*
58
-
59
- `smarter_csv` has lots of features:
60
- * able to process large CSV-files
61
- * able to chunk the input from the CSV file to avoid loading the whole CSV file into memory
62
- * return a Hash for each line of the CSV file, so we can quickly use the results for either creating MongoDB or ActiveRecord entries, or further processing with Resque
63
- * able to pass a block to the `process` method, so data from the CSV file can be directly processed (e.g. Resque.enqueue )
64
- * allows to have a bit more flexible input format, where comments are possible, and col_sep,row_sep can be set to any character sequence, including control characters.
65
- * able to re-map CSV "column names" to Hash-keys of your choice (normalization)
66
- * able to ignore "columns" in the input (delete columns)
67
- * able to eliminate nil or empty fields from the result hashes (default)
68
-
69
- #### Assumptions / Limitations
70
- * It is assumed that the escape character is `\`, as on UNIX and Windows systems.
71
- * It is assumed that quote charcters around fields are balanced, e.g. valid: `"field"`, invalid: `"field\"`
72
- e.g. an escaped `quote_char` does not denote the end of a field.
73
- * This Gem is only for importing CSV files - writing of CSV files is not supported at this time.
74
-
75
- ### Why?
76
-
77
- Ruby's CSV library's API is pretty old, and it's processing of CSV-files returning Arrays of Arrays feels 'very close to the metal'. The output is not easy to use - especially not if you want to create database records or Sidekiq jobs with it. Another shortcoming is that Ruby's CSV library does not have good support for huge CSV-files, e.g. there is no support for 'chunking' and/or parallel processing of the CSV-content (e.g. with Sidekiq).
78
-
79
- As the existing CSV libraries didn't fit my needs, I was writing my own CSV processing - specifically for use in connection with Rails ORMs like Mongoid, MongoMapper and ActiveRecord. In those ORMs you can easily pass a hash with attribute/value pairs to the create() method. The lower-level Mongo driver and Moped also accept larger arrays of such hashes to create a larger amount of records quickly with just one call. The same patterns are used when you pass data to Sidekiq jobs.
80
-
81
- For processing large CSV files it is essential to process them in chunks, so the memory impact is minimized.
82
-
83
- ### How?
84
-
85
- The two main choices you have in terms of how to call `SmarterCSV.process` are:
86
- * calling `process` with or without a block
87
- * passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
88
-
89
- By default (since version 1.8.0), detection of the column and row separators is set to automatic `row_sep: :auto`, `col_sep: :auto`. This should make it easier to process any CSV files without having to examine the line endings or column separators.
90
-
91
- You can change the setting `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500 characters); nil or 0 will check the whole file).
92
- You can also set the `:row_sep` manually! Checkout Example 4 for unusual `:row_sep` and `:col_sep`.
93
-
94
- ### Troubleshooting
95
-
96
- In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection a tool like `hexdump` can help find otherwise hidden control character or byte sequences like [BOMs](https://en.wikipedia.org/wiki/Byte_order_mark).
97
-
98
- ```
99
- $ hexdump -C spec/fixtures/bom_test_feff.csv
100
- 00000000 fe ff 73 6f 6d 65 5f 69 64 2c 74 79 70 65 2c 66 |..some_id,type,f|
101
- 00000010 75 7a 7a 62 6f 78 65 73 0d 0a 34 32 37 36 36 38 |uzzboxes..427668|
102
- 00000020 30 35 2c 7a 69 7a 7a 6c 65 73 2c 31 32 33 34 0d |05,zizzles,1234.|
103
- 00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
104
- 00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
105
- ```
106
-
107
- ### Articles
108
- * [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
109
- * [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
110
-
111
- ### Examples
112
-
113
- Here are some examples to demonstrate the versatility of SmarterCSV.
114
-
115
- **It is generally recommended to rescue `SmarterCSVException` or it's sub-classes.**
116
-
117
- By default SmarterCSV determines the `row_sep` and `col_sep` values automatically. In cases where the automatic detection fails, an exception will be raised, e.g. `NoColSepDetected`. Rescuing from these exceptions will make sure that you don't miss processing CSV files, in case users upload CSV files with unexpected formats.
118
-
119
- In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.
120
-
121
- #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
122
- Please note how each hash contains only the keys for columns with non-null values.
123
-
124
- ```ruby
125
- $ cat pets.csv
126
- first name,last name,dogs,cats,birds,fish
127
- Dan,McAllister,2,,,
128
- Lucy,Laweless,,5,,
129
- Miles,O'Brian,,,,21
130
- Nancy,Homes,2,,1,
131
- $ irb
132
- > require 'smarter_csv'
133
- => true
134
- > pets_by_owner = SmarterCSV.process('/tmp/pets.csv')
135
- => [ {:first_name=>"Dan", :last_name=>"McAllister", :dogs=>"2"},
136
- {:first_name=>"Lucy", :last_name=>"Laweless", :cats=>"5"},
137
- {:first_name=>"Miles", :last_name=>"O'Brian", :fish=>"21"},
138
- {:first_name=>"Nancy", :last_name=>"Homes", :dogs=>"2", :birds=>"1"}
139
- ]
140
- ```
141
-
142
-
143
- #### Example 1b: How SmarterCSV processes CSV-files as chunks, returning arrays of hashes:
144
- Please note how the returned array contains two sub-arrays containing the chunks which were read, each chunk containing 2 hashes.
145
- In case the number of rows is not cleanly divisible by `:chunk_size`, the last chunk contains fewer hashes.
146
-
147
- ```ruby
148
- > pets_by_owner = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}})
149
- => [ [ {:first=>"Dan", :last=>"McAllister", :dogs=>"2"}, {:first=>"Lucy", :last=>"Laweless", :cats=>"5"} ],
150
- [ {:first=>"Miles", :last=>"O'Brian", :fish=>"21"}, {:first=>"Nancy", :last=>"Homes", :dogs=>"2", :birds=>"1"} ]
151
- ]
152
- ```
153
-
154
- #### Example 1c: How SmarterCSV processes CSV-files as chunks, and passes arrays of hashes to a given block:
155
- Please note how the given block is passed the data for each chunk as the parameter (array of hashes),
156
- and how the `process` method returns the number of chunks when called with a block
157
-
158
- ```ruby
159
- > total_chunks = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}}) do |chunk|
160
- chunk.each do |h| # you can post-process the data from each row to your heart's content, and also create virtual attributes:
161
- h[:full_name] = [h[:first],h[:last]].join(' ') # create a virtual attribute
162
- h.delete(:first) ; h.delete(:last) # remove two keys
163
- end
164
- puts chunk.inspect # we could at this point pass the chunk to a Resque worker..
165
- end
166
-
167
- [{:dogs=>"2", :full_name=>"Dan McAllister"}, {:cats=>"5", :full_name=>"Lucy Laweless"}]
168
- [{:fish=>"21", :full_name=>"Miles O'Brian"}, {:dogs=>"2", :birds=>"1", :full_name=>"Nancy Homes"}]
169
- => 2
170
- ```
171
- #### Example 2: Reading a CSV-File in one Chunk, returning one Array of Hashes:
172
- ```ruby
173
- filename = '/tmp/input_file.txt' # TAB delimited file, each row ending with Control-M
174
- recordsA = SmarterCSV.process(filename, {:col_sep => "\t", :row_sep => "\cM"}) # no block given
175
-
176
- => returns an array of hashes
177
- ```
178
- #### Example 3: Populate a MySQL or MongoDB Database with SmarterCSV:
179
- ```ruby
180
- # without using chunks:
181
- filename = '/tmp/some.csv'
182
- options = {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
183
- n = SmarterCSV.process(filename, options) do |array|
184
- # we're passing a block in, to process each resulting hash / =row (the block takes array of hashes)
185
- # when chunking is not enabled, there is only one hash in each array
186
- MyModel.create( array.first )
187
- end
188
-
189
- => returns number of chunks / rows we processed
190
- ```
191
-
192
- #### Example 4: Processing a CSV File, and inserting batch jobs in Sidekiq:
193
- ```ruby
194
- filename = '/tmp/input.csv' # CSV file containing ids or data to process
195
- options = { :chunk_size => 100 }
196
- n = SmarterCSV.process(filename, options) do |chunk|
197
- Sidekiq::Client.push_bulk(
198
- 'class' => SidekiqIndividualWorkerClass,
199
- 'args' => chunk,
200
- )
201
- # OR:
202
- # SidekiqBatchWorkerClass.process_async(chunk ) # pass an array of hashes to Sidekiq workers for parallel processing
203
- end
204
- => returns number of chunks
205
- ```
206
-
207
- #### Example 4b: Reading a CSV-like File, and Processing it with Sidekiq:
208
- ```ruby
209
- filename = '/tmp/strange_db_dump' # a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes!)
210
- options = {
211
- :col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
212
- :chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre}
213
- }
214
- n = SmarterCSV.process(filename, options) do |chunk|
215
- SidekiqWorkerClass.process_async(chunk ) # pass an array of hashes to Sidekiq workers for parallel processing
216
- end
217
- => returns number of chunks
218
- ```
219
- #### Example 5: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
220
- ```ruby
221
- # using chunks:
222
- filename = '/tmp/some.csv'
223
- options = {:chunk_size => 100, :key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
224
- n = SmarterCSV.process(filename, options) do |chunk|
225
- # we're passing a block in, to process each resulting hash / row (block takes array of hashes)
226
- # when chunking is enabled, there are up to :chunk_size hashes in each chunk
227
- MyModel.collection.insert( chunk ) # insert up to 100 records at a time
228
- end
229
-
230
- => returns number of chunks we processed
231
- ```
232
-
233
- #### Example 6: Using Value Converters
234
-
235
- NOTE: If you use `key_mappings` and `value_converters`, make sure that the value converters has references the keys based on the final mapped name, not the original name in the CSV file.
236
- ```ruby
237
- $ cat spec/fixtures/with_dates.csv
238
- first,last,date,price
239
- Ben,Miller,10/30/1998,$44.50
240
- Tom,Turner,2/1/2011,$15.99
241
- Ken,Smith,01/09/2013,$199.99
242
- $ irb
243
- > require 'smarter_csv'
244
- > require 'date'
245
-
246
- # define a custom converter class, which implements self.convert(value)
247
- class DateConverter
248
- def self.convert(value)
249
- Date.strptime( value, '%m/%d/%Y') # parses custom date format into Date instance
250
- end
251
- end
252
-
253
- class DollarConverter
254
- def self.convert(value)
255
- value.sub('$','').to_f
256
- end
257
- end
258
-
259
- options = {:value_converters => {:date => DateConverter, :price => DollarConverter}}
260
- data = SmarterCSV.process("spec/fixtures/with_dates.csv", options)
261
- data[0][:date]
262
- => #<Date: 1998-10-30 ((2451117j,0s,0n),+0s,2299161j)>
263
- data[0][:date].class
264
- => Date
265
- data[0][:price]
266
- => 44.50
267
- data[0][:price].class
268
- => Float
269
- ```
270
-
271
- ## Documentation
272
-
273
- The `process` method reads and processes a "generalized" CSV file and returns the contents either as an Array of Hashes,
274
- or an Array of Arrays, which contain Hashes, or processes Chunks of Hashes via a given block.
275
-
276
- SmarterCSV.process(filename, options={}, &block)
277
-
278
- The options and the block are optional.
279
-
280
- `SmarterCSV.process` supports the following options:
281
-
282
- #### Options:
283
-
284
- | Option | Default | Explanation |
285
- ---------------------------------------------------------------------------------------------------------------------------------
286
- | :chunk_size | nil | if set, determines the desired chunk-size (defaults to nil, no chunk processing) |
287
- | | | |
288
- | :file_encoding | utf-8 | Set the file encoding eg.: 'windows-1252' or 'iso-8859-1' |
289
- | :invalid_byte_sequence | '' | what to replace invalid byte sequences with |
290
- | :force_utf8 | false | force UTF-8 encoding of all lines (including headers) in the CSV file |
291
- | :skip_lines | nil | how many lines to skip before the first line or header line is processed |
292
- | :comment_regexp | nil | regular expression to ignore comment lines (see NOTE on CSV header), e.g./\A#/ |
293
- ---------------------------------------------------------------------------------------------------------------------------------
294
- | :col_sep | :auto | column separator (default was ',') |
295
- | :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
296
- | | | e.g. when :quote_char is not properly escaped |
297
- | :row_sep | :auto | row separator or record separator (previous default was system's $/ , which defaulted to "\n") |
298
- | | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
299
- | :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
300
- | :quote_char | '"' | quotation character |
301
- ---------------------------------------------------------------------------------------------------------------------------------
302
- | :headers_in_file | true | Whether or not the file contains headers as the first line. |
303
- | | | Important if the file does not contain headers, |
304
- | | | otherwise you would lose the first line of data. |
305
- | :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
306
- | | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
307
- | :user_provided_headers | nil | *careful with that axe!* |
308
- | | | user provided Array of header strings or symbols, to define |
309
- | | | what headers should be used, overriding any in-file headers. |
310
- | | | You can not combine the :user_provided_headers and :key_mapping options |
311
- | :remove_empty_hashes | true | remove / ignore any hashes which don't have any key/value pairs or all empty values |
312
- | :verbose | false | print out line number while processing (to track down problems in input files) |
313
- | :with_line_numbers | false | add :csv_line_number to each data hash |
314
- ---------------------------------------------------------------------------------------------------------------------------------
315
-
316
- #### Deprecated 1.x Options: to be replaced in 2.0
317
-
318
- There have been a lot of 1-offs and feature creep around these options, and going forward we'll have a simpler, but more flexible way to address these features.
319
-
320
- Instead of these options, there will be a new and more flexible way to process the header fields, as well as the fields in each line of the CSV.
321
- And header and data validations will also be supported in 2.x
322
-
323
- | Option | Default | Explanation |
324
- ---------------------------------------------------------------------------------------------------------------------------------
325
- | :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
326
- | :silence_missing_keys | false | ignore missing keys in `key_mapping` |
327
- | | | if set to true: makes all mapped keys optional |
328
- | | | if given an array, makes only the keys listed in it optional |
329
- | :required_keys | nil | An array. Specify the required names AFTER header transformation. |
330
- | :required_headers | nil | (DEPRECATED / renamed) Use `required_keys` instead |
331
- | | | or an exception is raised No validation if nil is given. |
332
- | :remove_unmapped_keys | false | when using :key_mapping option, should non-mapped keys / columns be removed? |
333
- | :downcase_header | true | downcase all column headers |
334
- | :strings_as_keys | false | use strings instead of symbols as the keys in the result hashes |
335
- | :strip_whitespace | true | remove whitespace before/after values and headers |
336
- | :keep_original_headers | false | keep the original headers from the CSV-file as-is. |
337
- | | | Disables other flags manipulating the header fields. |
338
- | :strip_chars_from_headers | nil | RegExp to remove extraneous characters from the header line (e.g. if headers are quoted) |
339
- ---------------------------------------------------------------------------------------------------------------------------------
340
- | :value_converters | nil | supply a hash of :header => KlassName; the class needs to implement self.convert(val)|
341
- | :remove_empty_values | true | remove values which have nil or empty strings as values |
342
- | :remove_zero_values | false | remove values which have a numeric value equal to zero / 0 |
343
- | :remove_values_matching | nil | removes key/value pairs if value matches given regular expressions. e.g.: |
344
- | | | /^\$0\.0+$/ to match $0.00 , or /^#VALUE!$/ to match errors in Excel spreadsheets |
345
- | :convert_values_to_numeric | true | converts strings containing Integers or Floats to the appropriate class |
346
- | | | also accepts either {:except => [:key1,:key2]} or {:only => :key3} |
347
- ---------------------------------------------------------------------------------------------------------------------------------
348
-
349
-
350
- #### NOTES about File Encodings:
351
- * if you have a CSV file which contains unicode characters, you can process it as follows:
352
-
353
- ```ruby
354
- File.open(filename, "r:bom|utf-8") do |f|
355
- data = SmarterCSV.process(f);
356
- end
357
- ```
358
- * if the CSV file with unicode characters is in a remote location, similarly you need to give the encoding as an option to the `open` call:
359
- ```ruby
360
- require 'open-uri'
361
- file_location = 'http://your.remote.org/sample.csv'
362
- open(file_location, 'r:utf-8') do |f| # don't forget to specify the UTF-8 encoding!!
363
- data = SmarterCSV.process(f)
364
- end
365
- ```
366
-
367
- #### NOTES about CSV Headers:
368
- * as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
369
- * the first line with the header might be commented out, in which case you will need to set `comment_regexp: /\A#/`
370
- This is no longer handled automatically since 1.5.0.
371
- * any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
372
- * any of the keys in the header line will be downcased, spaces replaced by underscore, and converted to Ruby symbols before being used as keys in the returned Hashes
373
- * you can not combine the :user_provided_headers and :key_mapping options
374
- * if the incorrect number of headers are provided via :user_provided_headers, exception SmarterCSV::HeaderSizeMismatch is raised
375
-
376
- #### NOTES on Duplicate Headers:
377
- As a corner case, it is possible that a CSV file contains multiple headers with the same name.
378
- * If that happens, by default `smarter_csv` will raise a `DuplicateHeaders` error.
379
- * If you set `duplicate_header_suffix` to a non-nil string, it will use it to append numbers 2..n to the duplicate headers. To further disambiguate the headers, you can further use `key_mapping` to assign meaningful names.
380
- * If your code will need to process arbitrary CSV files, please set `duplicate_header_suffix`.
381
- * Another way to deal with duplicate headers it to use `user_assigned_headers` to ignore any headers in the file.
382
-
383
- #### NOTES on Key Mapping:
384
- * keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them)
385
- * if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
386
- * if you have input files with a large number of columns, and you want to ignore all columns which are not specifically mapped with :key_mapping, then use option :remove_unmapped_keys => true
387
-
388
- #### NOTES on the use of Chunking and Blocks:
389
- * chunking can be VERY USEFUL if used in combination with passing a block to File.read_csv FOR LARGE FILES
390
- * if you pass a block to File.read_csv, that block will be executed and given an Array of Hashes as the parameter.
391
- * if the chunk_size is not set, then the array will only contain one Hash.
392
- * if the chunk_size is > 0 , then the array may contain up to chunk_size Hashes.
393
- * this can be very useful when passing chunked data to a post-processing step, e.g. through Resque
394
-
395
- #### NOTES on improper quotation and unwanted characters in headers:
396
- * some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, use the `:force_simple_split => true` option in combination with `:strip_chars_from_headers => /[\-"]/` . This will also significantly speed up the import.
397
- If you would force a different :quote_char instead (setting it to a non-used character), then the import would be up to 5-times slower than using `:force_simple_split`.
398
-
399
- ## The original post that started SmarterCSV:
400
-
401
- http://www.unixgods.org/Ruby/process_csv_as_hashes.html
402
-
403
-
404
- ## Installation
20
+ # Installation
405
21
 
406
22
  Add this line to your application's Gemfile:
407
23
  ```ruby
@@ -415,9 +31,28 @@ Or install it yourself as:
415
31
  ```ruby
416
32
  $ gem install smarter_csv
417
33
  ```
418
- ## [ChangeLog](./CHANGELOG.md)
419
34
 
420
- ## Reporting Bugs / Feature Requests
35
+ # Documentation
36
+
37
+ * [Introduction](docs/_introduction.md)
38
+ * [The Basic API](docs/basic_api.md)
39
+ * [Configuration Options](docs/options.md)
40
+ * [Row and Column Separators](docs/row_col_sep.md)
41
+ * [Header Transformations](docs/header_transformations.md)
42
+ * [Header Validations](docs/header_validations.md)
43
+ * [Data Transformations](docs/data_transformations.md)
44
+ * [Value Converters](docs/value_converters.md)
45
+
46
+ * [Notes](docs/notes.md) <--- this info needs to be moved to individual pages
47
+
48
+ # Articles
49
+ * [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
50
+ * [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
51
+ * [The original post](http://www.unixgods.org/Ruby/process_csv_as_hashes.html) that started SmarterCSV
52
+
53
+ # [ChangeLog](./CHANGELOG.md)
54
+
55
+ # Reporting Bugs / Feature Requests
421
56
 
422
57
  Please [open an Issue on GitHub](https://github.com/tilo/smarter_csv/issues) if you have feedback, new feature requests, or want to report a bug. Thank you!
423
58
 
@@ -426,10 +61,10 @@ For reporting issues, please:
426
61
  * open a pull-request adding a test that demonstrates the issue
427
62
  * mention your version of SmarterCSV, Ruby, Rails
428
63
 
429
- ## [A Special Thanks to all Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
64
+ # [A Special Thanks to all Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
430
65
 
431
66
 
432
- ## Contributing
67
+ # Contributing
433
68
 
434
69
  1. Fork it
435
70
  2. Create your feature branch (`git checkout -b my-new-feature`)
@@ -0,0 +1,40 @@
1
+
2
+ # SmarterCSV Introduction
3
+
4
+ `smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, parallel processing, kicking-off batch jobs with Sidekiq, or oploading data to S3.
5
+
6
+
7
+ ## Why another CSV library?
8
+
9
+ Ruby's original 'csv' library's API is pretty old, and its processing of CSV-files returning an array-of-array format feels unnecessarily 'close to the metal'. Its output is not easy to use - especially not if you need a data hash to create database records, or JSON from it, or pass it to Sidekiq or S3. Another shortcoming is that Ruby's 'csv' library does not have good support for huge CSV-files, e.g. there is no support for batching and/or parallel processing of the CSV-content (e.g. with Sidekiq jobs).
10
+
11
+ When SmarterCSV was envisioned, I needed to do nightly imports of very large data sets that came in CSV format, that needed to be upserted into a database, and because of the sheer volume of data needed to be processed in parallel.
12
+ The CSV processing also needed to be robust against variations in the input data.
13
+
14
+ ## Benefits of using SmarterCSV
15
+
16
+ * Improved Robustness:
17
+ Typically you have little control over the data quality of CSV files that need to be imported. Because SmarterCSV has intelligent defaults and auto-detection of typical formats, this improves the robustness of your CSV imports without having to manually tweak options.
18
+
19
+ * Easy-to-use Format:
20
+ By using a Ruby hash to represent a CSV row, SmarterCSV allows you to directly use this data and insert it into a database, or use it with Sidekiq, S3, message queues, etc
21
+
22
+ * Normalized Headers:
23
+ SmarterCSV automatically transforms CSV headers to Ruby symbols, stripping leading or trailing whitespace.
24
+ There are many ways to customize the header transformation to your liking. You can re-map CSV headers to hash keys, and you can ignore CSV columns.
25
+
26
+ * Normalized Data:
27
+ SmarterCSV transforms the data in each CSV row automatically, stripping whitespace, converting numerical data into numbers, ignoring nil or empty fields, and more. There are many ways to customize this. You can even add your own value converters.
28
+
29
+ * Batch Processing of large CSV files:
30
+ Processing large CSV files in chunks, reduces the memory impact and allows for faster / parallel processing.
31
+ By adding the option `chunk_size: numeric_value`, you can switch to batch processing. SmarterCSV will then return arrays-of-hashes. This makes parallel processing easy: you can pass whole chunks of data to Sidekiq, bulk-insert into a DB, or pass it to other data sinks.
32
+
33
+ ## Additional Features
34
+
35
+ * Header Validation:
36
+ You can validate that a set of hash keys is present in each record after header transformations are applied.
37
+ This can help ensure importing data with consistent quality.
38
+
39
+ * Data Validations
40
+ (planned feature)