smarter_csv 1.1.5 → 1.12.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (95) hide show
  1. checksums.yaml +5 -5
  2. data/.rspec +1 -2
  3. data/.rubocop.yml +154 -0
  4. data/CHANGELOG.md +364 -0
  5. data/CONTRIBUTORS.md +56 -0
  6. data/Gemfile +7 -2
  7. data/LICENSE.txt +21 -0
  8. data/README.md +44 -441
  9. data/Rakefile +39 -19
  10. data/TO_DO_v2.md +14 -0
  11. data/docs/_introduction.md +56 -0
  12. data/docs/basic_api.md +157 -0
  13. data/docs/batch_processing.md +68 -0
  14. data/docs/data_transformations.md +50 -0
  15. data/docs/examples.md +75 -0
  16. data/docs/header_transformations.md +113 -0
  17. data/docs/header_validations.md +36 -0
  18. data/docs/options.md +98 -0
  19. data/docs/row_col_sep.md +104 -0
  20. data/docs/value_converters.md +68 -0
  21. data/ext/smarter_csv/extconf.rb +14 -0
  22. data/ext/smarter_csv/smarter_csv.c +97 -0
  23. data/lib/smarter_csv/auto_detection.rb +78 -0
  24. data/lib/smarter_csv/errors.rb +16 -0
  25. data/lib/smarter_csv/file_io.rb +50 -0
  26. data/lib/smarter_csv/hash_transformations.rb +91 -0
  27. data/lib/smarter_csv/header_transformations.rb +63 -0
  28. data/lib/smarter_csv/header_validations.rb +34 -0
  29. data/lib/smarter_csv/headers.rb +68 -0
  30. data/lib/smarter_csv/options.rb +95 -0
  31. data/lib/smarter_csv/parser.rb +90 -0
  32. data/lib/smarter_csv/reader.rb +243 -0
  33. data/lib/smarter_csv/version.rb +3 -1
  34. data/lib/smarter_csv/writer.rb +116 -0
  35. data/lib/smarter_csv.rb +91 -3
  36. data/smarter_csv.gemspec +43 -20
  37. metadata +122 -137
  38. data/.gitignore +0 -8
  39. data/.travis.yml +0 -19
  40. data/lib/extensions/hash.rb +0 -7
  41. data/lib/smarter_csv/smarter_csv.rb +0 -281
  42. data/spec/fixtures/basic.csv +0 -8
  43. data/spec/fixtures/binary.csv +0 -1
  44. data/spec/fixtures/carriage_returns_n.csv +0 -18
  45. data/spec/fixtures/carriage_returns_quoted.csv +0 -3
  46. data/spec/fixtures/carriage_returns_r.csv +0 -1
  47. data/spec/fixtures/carriage_returns_rn.csv +0 -18
  48. data/spec/fixtures/chunk_cornercase.csv +0 -10
  49. data/spec/fixtures/empty.csv +0 -5
  50. data/spec/fixtures/line_endings_n.csv +0 -4
  51. data/spec/fixtures/line_endings_r.csv +0 -1
  52. data/spec/fixtures/line_endings_rn.csv +0 -4
  53. data/spec/fixtures/lots_of_columns.csv +0 -2
  54. data/spec/fixtures/malformed.csv +0 -3
  55. data/spec/fixtures/malformed_header.csv +0 -3
  56. data/spec/fixtures/money.csv +0 -3
  57. data/spec/fixtures/no_header.csv +0 -7
  58. data/spec/fixtures/numeric.csv +0 -5
  59. data/spec/fixtures/pets.csv +0 -5
  60. data/spec/fixtures/quoted.csv +0 -5
  61. data/spec/fixtures/separator.csv +0 -4
  62. data/spec/fixtures/skip_lines.csv +0 -8
  63. data/spec/fixtures/valid_unicode.csv +0 -5
  64. data/spec/fixtures/with_dashes.csv +0 -8
  65. data/spec/fixtures/with_dates.csv +0 -4
  66. data/spec/smarter_csv/binary_file2_spec.rb +0 -24
  67. data/spec/smarter_csv/binary_file_spec.rb +0 -22
  68. data/spec/smarter_csv/carriage_return_spec.rb +0 -170
  69. data/spec/smarter_csv/chunked_reading_spec.rb +0 -14
  70. data/spec/smarter_csv/close_file_spec.rb +0 -15
  71. data/spec/smarter_csv/column_separator_spec.rb +0 -11
  72. data/spec/smarter_csv/convert_values_to_numeric_spec.rb +0 -48
  73. data/spec/smarter_csv/extenstions_spec.rb +0 -17
  74. data/spec/smarter_csv/header_transformation_spec.rb +0 -21
  75. data/spec/smarter_csv/keep_headers_spec.rb +0 -24
  76. data/spec/smarter_csv/key_mapping_spec.rb +0 -25
  77. data/spec/smarter_csv/line_ending_spec.rb +0 -43
  78. data/spec/smarter_csv/load_basic_spec.rb +0 -20
  79. data/spec/smarter_csv/malformed_spec.rb +0 -21
  80. data/spec/smarter_csv/no_header_spec.rb +0 -24
  81. data/spec/smarter_csv/not_downcase_header_spec.rb +0 -24
  82. data/spec/smarter_csv/quoted_spec.rb +0 -23
  83. data/spec/smarter_csv/remove_empty_values_spec.rb +0 -13
  84. data/spec/smarter_csv/remove_keys_from_hashes_spec.rb +0 -25
  85. data/spec/smarter_csv/remove_not_mapped_keys_spec.rb +0 -35
  86. data/spec/smarter_csv/remove_values_matching_spec.rb +0 -26
  87. data/spec/smarter_csv/remove_zero_values_spec.rb +0 -25
  88. data/spec/smarter_csv/skip_lines_spec.rb +0 -29
  89. data/spec/smarter_csv/strings_as_keys_spec.rb +0 -24
  90. data/spec/smarter_csv/strip_chars_from_headers_spec.rb +0 -24
  91. data/spec/smarter_csv/valid_unicode_spec.rb +0 -94
  92. data/spec/smarter_csv/value_converters_spec.rb +0 -52
  93. data/spec/spec/spec_helper.rb +0 -17
  94. data/spec/spec.opts +0 -2
  95. data/spec/spec_helper.rb +0 -21
data/README.md CHANGED
@@ -1,467 +1,70 @@
1
- # SmarterCSV
2
-
3
- [![Build Status](https://secure.travis-ci.org/tilo/smarter_csv.svg?branch=master)](http://travis-ci.org/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
4
-
5
- `smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with Mongoid or ActiveRecord,
6
- and parallel processing with Resque or Sidekiq.
7
-
8
- One `smarter_csv` user wrote:
9
-
10
- *Best gem for CSV for us yet. [...] taking an import process from 7+ hours to about 3 minutes.
11
- [...] Smarter CSV was a big part and helped clean up our code ALOT*
12
-
13
- `smarter_csv` has lots of features:
14
- * able to process large CSV-files
15
- * able to chunk the input from the CSV file to avoid loading the whole CSV file into memory
16
- * return a Hash for each line of the CSV file, so we can quickly use the results for either creating MongoDB or ActiveRecord entries, or further processing with Resque
17
- * able to pass a block to the `process` method, so data from the CSV file can be directly processed (e.g. Resque.enqueue )
18
- * allows to have a bit more flexible input format, where comments are possible, and col_sep,row_sep can be set to any character sequence, including control characters.
19
- * able to re-map CSV "column names" to Hash-keys of your choice (normalization)
20
- * able to ignore "columns" in the input (delete columns)
21
- * able to eliminate nil or empty fields from the result hashes (default)
22
-
23
- NOTE; This Gem is only for importing CSV files - writing of CSV files is not supported at this time.
24
-
25
- ### Why?
26
-
27
- Ruby's CSV library's API is pretty old, and it's processing of CSV-files returning Arrays of Arrays feels 'very close to the metal'. The output is not easy to use - especially not if you want to create database records from it. Another shortcoming is that Ruby's CSV library does not have good support for huge CSV-files, e.g. there is no support for 'chunking' and/or parallel processing of the CSV-content (e.g. with Resque or Sidekiq),
28
-
29
- As the existing CSV libraries didn't fit my needs, I was writing my own CSV processing - specifically for use in connection with Rails ORMs like Mongoid, MongoMapper or ActiveRecord. In those ORMs you can easily pass a hash with attribute/value pairs to the create() method. The lower-level Mongo driver and Moped also accept larger arrays of such hashes to create a larger amount of records quickly with just one call.
30
-
31
- ### Examples
32
-
33
- The two main choices you have in terms of how to call `SmarterCSV.process` are:
34
- * calling `process` with or without a block
35
- * passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
36
-
37
- Tip: If you are uncertain about what line endings a CSV-file uses, try specifying `:row_sep => :auto` as part of the options.
38
- But this could be slow if we would analyze the whole CSV file first (previous to 1.1.5 the whole file was analyzed).
39
- To speed things up, you can setting the option `:auto_row_sep_chars` to only analyze the first N characters of the file (default is 500; nil or 0 will check the whole file).
40
- You can also set the `:row_sep` manually! Checkout Example 5 for unusual `:row_sep` and `:col_sep`.
41
-
42
-
43
- #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
44
- Please note how each hash contains only the keys for columns with non-null values.
45
-
46
- $ cat pets.csv
47
- first name,last name,dogs,cats,birds,fish
48
- Dan,McAllister,2,,,
49
- Lucy,Laweless,,5,,
50
- Miles,O'Brian,,,,21
51
- Nancy,Homes,2,,1,
52
- $ irb
53
- > require 'smarter_csv'
54
- => true
55
- > pets_by_owner = SmarterCSV.process('/tmp/pets.csv')
56
- => [ {:first_name=>"Dan", :last_name=>"McAllister", :dogs=>"2"},
57
- {:first_name=>"Lucy", :last_name=>"Laweless", :cats=>"5"},
58
- {:first_name=>"Miles", :last_name=>"O'Brian", :fish=>"21"},
59
- {:first_name=>"Nancy", :last_name=>"Homes", :dogs=>"2", :birds=>"1"}
60
- ]
61
-
62
-
63
- #### Example 1b: How SmarterCSV processes CSV-files as chunks, returning arrays of hashes:
64
- Please note how the returned array contains two sub-arrays containing the chunks which were read, each chunk containing 2 hashes.
65
- In case the number of rows is not cleanly divisible by `:chunk_size`, the last chunk contains fewer hashes.
66
-
67
- > pets_by_owner = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}})
68
- => [ [ {:first=>"Dan", :last=>"McAllister", :dogs=>"2"}, {:first=>"Lucy", :last=>"Laweless", :cats=>"5"} ],
69
- [ {:first=>"Miles", :last=>"O'Brian", :fish=>"21"}, {:first=>"Nancy", :last=>"Homes", :dogs=>"2", :birds=>"1"} ]
70
- ]
71
-
72
- #### Example 1c: How SmarterCSV processes CSV-files as chunks, and passes arrays of hashes to a given block:
73
- Please note how the given block is passed the data for each chunk as the parameter (array of hashes),
74
- and how the `process` method returns the number of chunks when called with a block
75
-
76
- > total_chunks = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}}) do |chunk|
77
- chunk.each do |h| # you can post-process the data from each row to your heart's content, and also create virtual attributes:
78
- h[:full_name] = [h[:first],h[:last]].join(' ') # create a virtual attribute
79
- h.delete(:first) ; h.delete(:last) # remove two keys
80
- end
81
- puts chunk.inspect # we could at this point pass the chunk to a Resque worker..
82
- end
83
-
84
- [{:dogs=>"2", :full_name=>"Dan McAllister"}, {:cats=>"5", :full_name=>"Lucy Laweless"}]
85
- [{:fish=>"21", :full_name=>"Miles O'Brian"}, {:dogs=>"2", :birds=>"1", :full_name=>"Nancy Homes"}]
86
- => 2
87
-
88
- #### Example 2: Reading a CSV-File in one Chunk, returning one Array of Hashes:
89
-
90
- filename = '/tmp/input_file.txt' # TAB delimited file, each row ending with Control-M
91
- recordsA = SmarterCSV.process(filename, {:col_sep => "\t", :row_sep => "\cM"}) # no block given
92
-
93
- => returns an array of hashes
94
-
95
- #### Example 3: Populate a MySQL or MongoDB Database with SmarterCSV:
96
-
97
- # without using chunks:
98
- filename = '/tmp/some.csv'
99
- options = {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
100
- n = SmarterCSV.process(filename, options) do |array|
101
- # we're passing a block in, to process each resulting hash / =row (the block takes array of hashes)
102
- # when chunking is not enabled, there is only one hash in each array
103
- MyModel.create( array.first )
104
- end
105
-
106
- => returns number of chunks / rows we processed
107
-
108
- #### Example 4: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
109
-
110
- # using chunks:
111
- filename = '/tmp/some.csv'
112
- options = {:chunk_size => 100, :key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
113
- n = SmarterCSV.process(filename, options) do |chunk|
114
- # we're passing a block in, to process each resulting hash / row (block takes array of hashes)
115
- # when chunking is enabled, there are up to :chunk_size hashes in each chunk
116
- MyModel.collection.insert( chunk ) # insert up to 100 records at a time
117
- end
118
-
119
- => returns number of chunks we processed
120
-
121
-
122
- #### Example 5: Reading a CSV-like File, and Processing it with Resque:
123
-
124
- filename = '/tmp/strange_db_dump' # a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes!)
125
- options = {
126
- :col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
127
- :chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre}
128
- }
129
- n = SmarterCSV.process(filename, options) do |chunk|
130
- Resque.enque( ResqueWorkerClass, chunk ) # pass chunks of CSV-data to Resque workers for parallel processing
131
- end
132
- => returns number of chunks
133
-
134
- #### Example 6: Using Value Converters
135
-
136
- NOTE: If you use `key_mappings` and `value_converters`, make sure that the value converters has references the keys based on the final mapped name, not the original name in the CSV file.
137
-
138
- $ cat spec/fixtures/with_dates.csv
139
- first,last,date,price
140
- Ben,Miller,10/30/1998,$44.50
141
- Tom,Turner,2/1/2011,$15.99
142
- Ken,Smith,01/09/2013,$199.99
143
- $ irb
144
- > require 'smarter_csv'
145
- > require 'date'
146
-
147
- # define a custom converter class, which implements self.convert(value)
148
- class DateConverter
149
- def self.convert(value)
150
- Date.strptime( value, '%m/%d/%Y') # parses custom date format into Date instance
151
- end
152
- end
153
-
154
- class DollarConverter
155
- def self.convert(value)
156
- value.sub('$','').to_f
157
- end
158
- end
159
-
160
- options = {:value_converters => {:date => DateConverter, :price => DollarConverter}}
161
- data = SmarterCSV.process("spec/fixtures/with_dates.csv", options)
162
- data[0][:date]
163
- => #<Date: 1998-10-30 ((2451117j,0s,0n),+0s,2299161j)>
164
- data[0][:date].class
165
- => Date
166
- data[0][:price]
167
- => 44.50
168
- data[0][:price].class
169
- => Float
170
-
171
- ## Parallel Processing
172
- [Jack](https://github.com/xjlin0) wrote an interesting article about [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
173
-
174
- ## Documentation
175
-
176
- The `process` method reads and processes a "generalized" CSV file and returns the contents either as an Array of Hashes,
177
- or an Array of Arrays, which contain Hashes, or processes Chunks of Hashes via a given block.
178
-
179
- SmarterCSV.process(filename, options={}, &block)
180
-
181
- The options and the block are optional.
182
-
183
- `SmarterCSV.process` supports the following options:
184
-
185
- | Option | Default | Explanation |
186
- ---------------------------------------------------------------------------------------------------------------------------------
187
- | :col_sep | ',' | column separator |
188
- | :row_sep | $/ ,"\n" | row separator or record separator , defaults to system's $/ , which defaults to "\n" |
189
- | | | This can also be set to :auto, but will process the whole cvs file first (slow!) |
190
- | :auto_row_sep_chars | 500 | How many characters to analyze when using `:row_sep => :auto`. nil or 0 means whole file. |
191
- | :quote_char | '"' | quotation character |
192
- | :comment_regexp | /^#/ | regular expression which matches comment lines (see NOTE about the CSV header) |
193
- | :chunk_size | nil | if set, determines the desired chunk-size (defaults to nil, no chunk processing) |
194
- ---------------------------------------------------------------------------------------------------------------------------------
195
- | :key_mapping | nil | a hash which maps headers from the CSV file to keys in the result hash |
196
- | :remove_unmapped_keys | false | when using :key_mapping option, should non-mapped keys / columns be removed? |
197
- | :downcase_header | true | downcase all column headers |
198
- | :strings_as_keys | false | use strings instead of symbols as the keys in the result hashes |
199
- | :strip_whitespace | true | remove whitespace before/after values and headers |
200
- | :keep_original_headers | false | keep the original headers from the CSV-file as-is. |
201
- | | | Disables other flags manipulating the header fields. |
202
- | :user_provided_headers | nil | *careful with that axe!* |
203
- | | | user provided Array of header strings or symbols, to define |
204
- | | | what headers should be used, overriding any in-file headers. |
205
- | | | You can not combine the :user_provided_headers and :key_mapping options |
206
- | :strip_chars_from_headers | nil | RegExp to remove extraneous characters from the header line (e.g. if headers are quoted) |
207
- | :headers_in_file | true | Whether or not the file contains headers as the first line. |
208
- | | | Important if the file does not contain headers, |
209
- | | | otherwise you would lose the first line of data. |
210
- | :skip_lines | nil | how many lines to skip before the first line or header line is processed |
211
- | :force_utf8 | false | force UTF-8 encoding of all lines (including headers) in the CSV file |
212
- | :invalid_byte_sequence | '' | how to replace invalid byte sequences with |
213
- ---------------------------------------------------------------------------------------------------------------------------------
214
- | :value_converters | nil | supply a hash of :header => KlassName; the class needs to implement self.convert(val)|
215
- | :remove_empty_values | true | remove values which have nil or empty strings as values |
216
- | :remove_zero_values | true | remove values which have a numeric value equal to zero / 0 |
217
- | :remove_values_matching | nil | removes key/value pairs if value matches given regular expressions. e.g.: |
218
- | | | /^\$0\.0+$/ to match $0.00 , or /^#VALUE!$/ to match errors in Excel spreadsheets |
219
- | :convert_values_to_numeric | true | converts strings containing Integers or Floats to the appropriate class |
220
- | | | also accepts either {:except => [:key1,:key2]} or {:only => :key3} |
221
- | :remove_empty_hashes | true | remove / ignore any hashes which don't have any key/value pairs |
222
- | :file_encoding | utf-8 | Set the file encoding eg.: 'windows-1252' or 'iso-8859-1' |
223
- | :force_simple_split | false | force simple splitting on :col_sep character for non-standard CSV-files. |
224
- | | | e.g. when :quote_char is not properly escaped |
225
- | :verbose | false | print out line number while processing (to track down problems in input files) |
226
-
227
-
228
- #### NOTES about File Encodings:
229
- * if you have a CSV file which contains unicode characters, you can process it as follows:
230
-
231
-
232
- File.open(filename, "r:bom|utf-8") do |f|
233
- data = SmarterCSV.process(f);
234
- end
235
-
236
- * if the CSV file with unicode characters is in a remote location, similarly you need to give the encoding as an option to the `open` call:
237
1
 
238
- require 'open-uri'
239
- file_location = 'http://your.remote.org/sample.csv'
240
- open(file_location, 'r:utf-8') do |f| # don't forget to specify the UTF-8 encoding!!
241
- data = SmarterCSV.process(f)
242
- end
2
+ # SmarterCSV
243
3
 
244
- #### NOTES about CSV Headers:
245
- * as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
246
- * the first line with the CSV header may or may not be commented out according to the :comment_regexp
247
- * any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
248
- * any of the keys in the header line will be downcased, spaces replaced by underscore, and converted to Ruby symbols before being used as keys in the returned Hashes
249
- * you can not combine the :user_provided_headers and :key_mapping options
250
- * if the incorrect number of headers are provided via :user_provided_headers, exception SmarterCSV::HeaderSizeMismatch is raised
4
+ [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
251
5
 
252
- #### NOTES on Key Mapping:
253
- * keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them)
254
- * if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
255
- * if you have input files with a large number of columns, and you want to ignore all columns which are not specifically mapped with :key_mapping, then use option :remove_unmapped_keys => true
6
+ SmarterCSV provides a convenient interface for reading and writing CSV files and data.
256
7
 
257
- #### NOTES on the use of Chunking and Blocks:
258
- * chunking can be VERY USEFUL if used in combination with passing a block to File.read_csv FOR LARGE FILES
259
- * if you pass a block to File.read_csv, that block will be executed and given an Array of Hashes as the parameter.
260
- * if the chunk_size is not set, then the array will only contain one Hash.
261
- * if the chunk_size is > 0 , then the array may contain up to chunk_size Hashes.
262
- * this can be very useful when passing chunked data to a post-processing step, e.g. through Resque
8
+ Unlike traditional CSV parsing methods, SmarterCSV focuses on representing the data for each row as a Ruby hash, which lends itself perfectly for direct use with ActiveRecord, Sidekiq, and JSON stores such as S3. For large files it supports processing CSV data in chunks of array-of-hashes, which allows parallel or batch processing of the data.
263
9
 
264
- #### NOTES on improper quotation and unwanted characters in headers:
265
- * some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, use the `:force_simple_split => true` option in combination with `:strip_chars_from_headers => /[\-"]/` . This will also significantly speed up the import.
266
- If you would force a different :quote_char instead (setting it to a non-used character), then the import would be up to 5-times slower than using `:force_simple_split`.
10
+ Its powerful interface is designed to simplify and optimize the process of handling CSV data, and allows for highly customizable and efficient data processing by enabling the user to easily map CSV headers to Hash keys, skip unwanted rows, and transform data on-the-fly.
267
11
 
268
- ## See also:
12
+ This results in a more readable, maintainable, and performant codebase. Whether you're dealing with large datasets or complex data transformations, SmarterCSV streamlines CSV operations, making it an invaluable tool for developers seeking to enhance their data processing workflows.
269
13
 
270
- http://www.unixgods.org/~tilo/Ruby/process_csv_as_hashes.html
14
+ When writing CSV data to file, it similarly takes arrays of hashes, and converts them to a CSV file.
271
15
 
16
+ One user wrote:
272
17
 
18
+ > *Best gem for CSV for us yet. [...] taking an import process from 7+ hours to about 3 minutes. [...] Smarter CSV was a big part and helped clean up our code ALOT*
273
19
 
274
- ## Installation
20
+ # Installation
275
21
 
276
22
  Add this line to your application's Gemfile:
277
-
23
+ ```ruby
278
24
  gem 'smarter_csv'
279
-
25
+ ```
280
26
  And then execute:
281
-
27
+ ```ruby
282
28
  $ bundle
283
-
29
+ ```
284
30
  Or install it yourself as:
285
-
31
+ ```ruby
286
32
  $ gem install smarter_csv
287
-
288
- ## Upcoming
289
-
290
- Planned in the next releases:
291
- * programmatic header transformations
292
- * CSV command line
293
-
294
- ## Changes
295
-
296
- #### 1.1.5 (2017-11-05)
297
- * fix issue with invalid byte sequences in header (issue #103, thanks to Dave Myron)
298
- * fix issue with invalid byte sequences in multi-line data (thanks to Ivan Ushakov)
299
- * analyze only 500 characters by default when `:row_sep => :auto` is used.
300
- added option `row_sep_auto_chars` to change the default if necessary. (thanks to Matthieu Paret)
301
-
302
- #### 1.1.4 (2017-01-16)
303
- * fixing UTF-8 related bug which was introduced in 1.1.2 (thanks to Tirdad C.)
304
-
305
- #### 1.1.3 (2016-12-30)
306
- * added warning when options indicate UTF-8 processing, but input filehandle is not opened with r:UTF-8 option
307
-
308
- #### 1.1.2 (2016-12-29)
309
- * added option `invalid_byte_sequence` (thanks to polycarpou)
310
- * added comments on handling of UTF-8 encoding when opening from File vs. OpenURI (thanks to KevinColemanInc)
311
-
312
- #### 1.1.1 (2016-11-26)
313
- * added option to `skip_lines` (thanks to wal)
314
- * added option to `force_utf8` encoding (thanks to jordangraft)
315
- * bugfix if no headers in input data (thanks to esBeee)
316
- * ensure input file is closed (thanks to waldyr)
317
- * improved verbose output (thankd to benmaher)
318
- * improved documentation
319
-
320
- #### 1.1.0 (2015-07-26)
321
- * added feature :value_converters, which allows parsing of dates, money, and other things (thanks to Raphaël Bleuse, Lucas Camargo de Almeida, Alejandro)
322
- * added error if :headers_in_file is set to false, and no :user_provided_headers are given (thanks to innhyu)
323
- * added support to convert dashes to underscore characters in headers (thanks to César Camacho)
324
- * fixing automatic detection of \r\n line-endings (thanks to feens)
325
-
326
- #### 1.0.19 (2014-10-29)
327
- * added option :keep_original_headers to keep CSV-headers as-is (thanks to Benjamin Thouret)
328
-
329
- #### 1.0.18 (2014-10-27)
330
- * added support for multi-line fields / csv fields containing CR (thanks to Chris Hilton) (issue #31)
331
-
332
- #### 1.0.17 (2014-01-13)
333
- * added option to set :row_sep to :auto , for automatic detection of the row-separator (issue #22)
334
-
335
- #### 1.0.16 (2014-01-13)
336
- * :convert_values_to_numeric option can now be qualified with :except or :only (thanks to Hugo Lepetit)
337
- * removed deprecated `process_csv` method
338
-
339
- #### 1.0.15 (2013-12-07)
340
- * new option:
341
- * :remove_unmapped_keys to completely ignore columns which were not mapped with :key_mapping (thanks to Dave Sanders)
342
-
343
- #### 1.0.14 (2013-11-01)
344
- * added GPL-2 and MIT license to GEM spec file; if you need another license contact me
345
-
346
- #### 1.0.13 (2013-11-01) ### YANKED!
347
- * added GPL-2 license to GEM spec file; if you need another license contact me
348
-
349
- #### 1.0.12 (2013-10-15)
350
- * added RSpec tests
351
-
352
- #### 1.0.11 (2013-09-28)
353
- * bugfix : fixed issue #18 - fixing issue with last chunk not being properly returned (thanks to Jordan Running)
354
- * added RSpec tests
355
-
356
- #### 1.0.10 (2013-06-26)
357
- * bugfix : fixed issue #14 - passing options along to CSV.parse (thanks to Marcos Zimmermann)
358
-
359
- #### 1.0.9 (2013-06-19)
360
- * bugfix : fixed issue #13 with negative integers and floats not being correctly converted (thanks to Graham Wetzler)
361
-
362
- #### 1.0.8 (2013-06-01)
363
-
364
- * bugfix : fixed issue with nil values in inputs with quote-char (thanks to Félix Bellanger)
365
- * new options:
366
- * :force_simple_split : to force simiple splitting on :col_sep character for non-standard CSV-files. e.g. without properly escaped :quote_char
367
- * :verbose : print out line number while processing (to track down problems in input files)
368
-
369
- #### 1.0.7 (2013-05-20)
370
-
371
- * allowing process to work with objects with a 'readline' method (thanks to taq)
372
- * added options:
373
- * :file_encoding : defaults to utf8 (thanks to MrTin, Paxa)
374
-
375
- #### 1.0.6 (2013-05-19)
376
-
377
- * bugfix : quoted fields are now correctly parsed
378
-
379
- #### 1.0.5 (2013-05-08)
380
-
381
- * bugfix : for :headers_in_file option
382
-
383
- #### 1.0.4 (2012-08-17)
384
-
385
- * renamed the following options:
386
- * :strip_whitepace_from_values => :strip_whitespace - removes leading/trailing whitespace from headers and values
387
-
388
- #### 1.0.3 (2012-08-16)
389
-
390
- * added the following options:
391
- * :strip_whitepace_from_values - removes leading/trailing whitespace from values
392
-
393
- #### 1.0.2 (2012-08-02)
394
-
395
- * added more options for dealing with headers:
396
- * :user_provided_headers ,user provided Array with header strings or symbols, to precisely define what the headers should be, overriding any in-file headers (default: nil)
397
- * :headers_in_file , if the file contains headers as the first line (default: true)
398
-
399
- #### 1.0.1 (2012-07-30)
400
-
401
- * added the following options:
402
- * :downcase_header
403
- * :strings_as_keys
404
- * :remove_zero_values
405
- * :remove_values_matching
406
- * :remove_empty_hashes
407
- * :convert_values_to_numeric
408
-
409
- * renamed the following options:
410
- * :remove_empty_fields => :remove_empty_values
411
-
412
-
413
- #### 1.0.0 (2012-07-29)
414
-
415
- * renamed `SmarterCSV.process_csv` to `SmarterCSV.process`.
416
-
417
- #### 1.0.0.pre1 (2012-07-29)
418
-
419
-
420
- ## Reporting Bugs / Feature Requests
33
+ ```
34
+
35
+ # Documentation
36
+
37
+ * [Introduction](docs/_introduction.md)
38
+ * [The Basic API](docs/basic_api.md)
39
+ * [Batch Processing](./docs/batch_processing.md)
40
+ * [Configuration Options](docs/options.md)
41
+ * [Row and Column Separators](docs/row_col_sep.md)
42
+ * [Header Transformations](docs/header_transformations.md)
43
+ * [Header Validations](docs/header_validations.md)
44
+ * [Data Transformations](docs/data_transformations.md)
45
+ * [Value Converters](docs/value_converters.md)
46
+
47
+ # Articles
48
+ * [Parsing CSV Files in Ruby with SmarterCSV](https://tilo-sloboda.medium.com/parsing-csv-files-in-ruby-with-smartercsv-6ce66fb6cf38)
49
+ * [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
50
+ * [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
51
+ * [The original post](http://www.unixgods.org/Ruby/process_csv_as_hashes.html) that started SmarterCSV
52
+
53
+ # [ChangeLog](./CHANGELOG.md)
54
+
55
+ # Reporting Bugs / Feature Requests
421
56
 
422
57
  Please [open an Issue on GitHub](https://github.com/tilo/smarter_csv/issues) if you have feedback, new feature requests, or want to report a bug. Thank you!
423
58
 
59
+ For reporting issues, please:
60
+ * include a small sample CSV file
61
+ * open a pull-request adding a test that demonstrates the issue
62
+ * mention your version of SmarterCSV, Ruby, Rails
424
63
 
425
- ## Special Thanks
426
-
427
- Many thanks to people who have filed issues and sent comments.
428
- And a special thanks to those who contributed pull requests:
429
-
430
- * [Jack 0](https://github.com/xjlin0)
431
- * [Alejandro](https://github.com/agaviria)
432
- * [Lucas Camargo de Almeida](https://github.com/lcalmeida)
433
- * [Raphaël Bleuse](https://github.com/bleuse)
434
- * [feens](https://github.com/feens)
435
- * [César Camacho](https://github.com/chanko)
436
- * [innhyu](https://github.com/innhyu)
437
- * [Benjamin Thouret](https://github.com/benichu)
438
- * [Chris Hilton](https://github.com/chrismhilton)
439
- * [Sean Duckett](http://github.com/sduckett)
440
- * [Alex Ong](http://github.com/khaong)
441
- * [Martin Nilsson](http://github.com/MrTin)
442
- * [Eustáquio Rangel](http://github.com/taq)
443
- * [Pavel](http://github.com/paxa)
444
- * [Félix Bellanger](https://github.com/Keeguon)
445
- * [Graham Wetzler](https://github.com/grahamwetzler)
446
- * [Marcos G. Zimmermann](https://github.com/marcosgz)
447
- * [Jordan Running](https://github.com/jrunning)
448
- * [Dave Sanders](https://github.com/DaveSanders)
449
- * [Hugo Lepetit](https://github.com/giglemad)
450
- * [esBeee](https://github.com/esBeee)
451
- * [Waldyr de Souza](https://github.com/waldyr)
452
- * [Ben Maher](https://github.com/benmaher)
453
- * [Wal McConnell](https://github.com/wal)
454
- * [Jordan Graft](https://github.com/jordangraft)
455
- * [Michael](https://github.com/polycarpou)
456
- * [Kevin Coleman](https://github.com/KevinColemanInc)
457
- * [Tirdad C.](https://github.com/tridadc)
458
- * [Dave Myron](https://github.com/contentfree)
459
- * [Ivan Ushakov](https://github.com/IvanUshakov)
460
- * [Matthieu Paret](https://github.com/mtparet)
461
- * [Rohit Amarnath](https://github.com/ramarnat)
64
+ # [A Special Thanks to all Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
462
65
 
463
66
 
464
- ## Contributing
67
+ # Contributing
465
68
 
466
69
  1. Fork it
467
70
  2. Create your feature branch (`git checkout -b my-new-feature`)
data/Rakefile CHANGED
@@ -1,26 +1,46 @@
1
- #!/usr/bin/env rake
1
+ # frozen_string_literal: true
2
+
2
3
  require "bundler/gem_tasks"
4
+ require 'rspec/core/rake_task'
3
5
 
4
- require 'rubygems'
5
- require 'rake'
6
+ # # temp fix for NoMethodError: undefined method `last_comment'
7
+ # # remove when fixed in Rake 11.x and higher
8
+ # module TempFixForRakeLastComment
9
+ # def last_comment
10
+ # last_description
11
+ # end
12
+ # end
13
+ # Rake::Application.send :include, TempFixForRakeLastComment
14
+ # ### end of tempfix
6
15
 
7
- require 'rspec/core/rake_task'
16
+ RSpec::Core::RakeTask.new(:spec)
8
17
 
9
- desc "Run RSpec"
10
- RSpec::Core::RakeTask.new do |t|
11
- t.verbose = false
12
- end
18
+ require "rubocop/rake_task"
13
19
 
14
- desc "Run specs for all test cases"
15
- task :spec_all do
16
- system "rake spec"
17
- end
20
+ RuboCop::RakeTask.new
18
21
 
19
- # task :spec_all do
20
- # %w[active_record data_mapper mongoid].each do |model_adapter|
21
- # puts "MODEL_ADAPTER = #{model_adapter}"
22
- # system "rake spec MODEL_ADAPTER=#{model_adapter}"
23
- # end
24
- # end
22
+ require "rake/extensiontask"
23
+
24
+ if RUBY_ENGINE == 'jruby'
25
+
26
+ task default: %i[spec]
27
+
28
+ else
29
+ task build: :compile
25
30
 
26
- task :default => :spec
31
+ Rake::ExtensionTask.new("smarter_csv") do |ext|
32
+ ext.lib_dir = "lib/smarter_csv"
33
+ ext.ext_dir = "ext/smarter_csv"
34
+ ext.source_pattern = "*.{c,h}"
35
+ end
36
+
37
+ # task default: %i[clobber compile spec rubocop]
38
+ task default: %i[clobber compile spec]
39
+ end
40
+
41
+ desc 'Run spec with coverage'
42
+ task :coverage do
43
+ ENV['COVERAGE'] = 'true'
44
+ Rake::Task['spec'].execute
45
+ `open coverage/index.html`
46
+ end
data/TO_DO_v2.md ADDED
@@ -0,0 +1,14 @@
1
+ # SmarterCSV v2.0 TO DO List
2
+
3
+ * add enumerable to speed up parallel processing [issue #66](https://github.com/tilo/smarter_csv/issues/66), [issue #32](https://github.com/tilo/smarter_csv/issues/32)
4
+ * use Procs for validations and transformatoins [issue #118](https://github.com/tilo/smarter_csv/issues/118)
5
+ * make @errors and @warnings work [issue #118](https://github.com/tilo/smarter_csv/issues/118)
6
+ * skip file opening, allow reading from CSV string, e.g. reading from S3 file [issue #120](https://github.com/tilo/smarter_csv/issues/120).
7
+ Or stream large file from S3 (linked in the issue)
8
+ * Collect all Errors, before surfacing them. Avoid throwing an exception on the first error [issue #133](https://github.com/tilo/smarter_csv/issues/133)
9
+ * Don't call rewind on filehandle
10
+ * [2.0 BUG] :convert_values_to_numeric_unless_leading_zeros drops leading zeros [issue #151](https://github.com/tilo/smarter_csv/issues/151)
11
+ * [2.0 BUG] convert_to_float saves Proc as @@convert_to_integer [issue #157](https://github.com/tilo/smarter_csv/issues/157)
12
+ * Provide an example for custom Procs for hash_transformations in the docs [issue #174](https://github.com/tilo/smarter_csv/issues/174)
13
+ * Replace remove_empty_values: false [issue #213](https://github.com/tilo/smarter_csv/issues/213)
14
+
@@ -0,0 +1,56 @@
1
+
2
+ ### Contents
3
+
4
+ * [**Introduction**](./_introduction.md)
5
+ * [The Basic API](./basic_api.md)
6
+ * [Batch Processing](././batch_processing.md)
7
+ * [Configuration Options](./options.md)
8
+ * [Row and Column Separators](./row_col_sep.md)
9
+ * [Header Transformations](./header_transformations.md)
10
+ * [Header Validations](./header_validations.md)
11
+ * [Data Transformations](./data_transformations.md)
12
+ * [Value Converters](./value_converters.md)
13
+
14
+ --------------
15
+
16
+ # SmarterCSV Introduction
17
+
18
+ `smarter_csv` is a Ruby Gem for convenient reading and writing of CSV files. It has intelligent defaults, and auto-discovery of column and row separators. It imports CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, kicking-off batch jobs with Sidekiq, parallel processing, or oploading data to S3. Similarly, writing CSV files takes Hashes, or Arrays of Hashes to create a CSV file.
19
+
20
+ ## Why another CSV library?
21
+
22
+ Ruby's original 'csv' library's API is pretty old, and its processing of CSV-files returning an array-of-array format feels unnecessarily 'close to the metal'. Its output is not easy to use - especially not if you need a data hash to create database records, or JSON from it, or pass it to Sidekiq or S3. Another shortcoming is that Ruby's 'csv' library does not have good support for huge CSV-files, e.g. there is no support for batching and/or parallel processing of the CSV-content (e.g. with Sidekiq jobs).
23
+
24
+ When SmarterCSV was envisioned, I needed to do nightly imports of very large data sets that came in CSV format, that needed to be upserted into a database, and because of the sheer volume of data needed to be processed in parallel.
25
+ The CSV processing also needed to be robust against variations in the input data.
26
+
27
+ ## Benefits of using SmarterCSV
28
+
29
+ * Improved Robustness:
30
+ Typically you have little control over the data quality of CSV files that need to be imported. Because SmarterCSV has intelligent defaults and auto-detection of typical formats, this improves the robustness of your CSV imports without having to manually tweak options.
31
+
32
+ * Easy-to-use Format:
33
+ By using a Ruby hash to represent a CSV row, SmarterCSV allows you to directly use this data and insert it into a database, or use it with Sidekiq, S3, message queues, etc
34
+
35
+ * Normalized Headers:
36
+ SmarterCSV automatically transforms CSV headers to Ruby symbols, stripping leading or trailing whitespace.
37
+ There are many ways to customize the header transformation to your liking. You can re-map CSV headers to hash keys, and you can ignore CSV columns.
38
+
39
+ * Normalized Data:
40
+ SmarterCSV transforms the data in each CSV row automatically, stripping whitespace, converting numerical data into numbers, ignoring nil or empty fields, and more. There are many ways to customize this. You can even add your own value converters.
41
+
42
+ * Batch Processing of large CSV files:
43
+ Processing large CSV files in chunks, reduces the memory impact and allows for faster / parallel processing.
44
+ By adding the option `chunk_size: numeric_value`, you can switch to batch processing. SmarterCSV will then return arrays-of-hashes. This makes parallel processing easy: you can pass whole chunks of data to Sidekiq, bulk-insert into a DB, or pass it to other data sinks.
45
+
46
+ ## Additional Features
47
+
48
+ * Header Validation:
49
+ You can validate that a set of hash keys is present in each record after header transformations are applied.
50
+ This can help ensure importing data with consistent quality.
51
+
52
+ * Data Validations
53
+ (planned feature)
54
+
55
+ ---------------
56
+ PREVIOUS [README](../README.md) | NEXT: [The Basic API](./basic_api.md)