smarter_csv 1.0.0 → 1.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -3,15 +3,17 @@
3
3
  `smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with Mongoid or ActiveRecord,
4
4
  and parallel processing with Resque or Sidekiq.
5
5
 
6
- `smarter_csv` has lots of optional features:
6
+ `smarter_csv` has lots of features:
7
7
  * able to process large CSV-files
8
8
  * able to chunk the input from the CSV file to avoid loading the whole CSV file into memory
9
9
  * return a Hash for each line of the CSV file, so we can quickly use the results for either creating MongoDB or ActiveRecord entries, or further processing with Resque
10
- * able to pass a block to the method, so data from the CSV file can be directly processed (e.g. Resque.enqueue )
11
- * have a bit more flexible input format, where comments are possible, and col_sep,row_sep can be set to any character sequence, including control characters.
10
+ * able to pass a block to the `process` method, so data from the CSV file can be directly processed (e.g. Resque.enqueue )
11
+ * allows to have a bit more flexible input format, where comments are possible, and col_sep,row_sep can be set to any character sequence, including control characters.
12
12
  * able to re-map CSV "column names" to Hash-keys of your choice (normalization)
13
13
  * able to ignore "columns" in the input (delete columns)
14
- * able to eliminate nil or empty fields from the result hashes
14
+ * able to eliminate nil or empty fields from the result hashes (default)
15
+
16
+ NOTE; This Gem is only for importing CSV files - writing of CSV files is not supported.
15
17
 
16
18
  ### Why?
17
19
 
@@ -20,14 +22,64 @@ Ruby's CSV library's API is pretty old, and it's processing of CSV-files returni
20
22
  As the existing CSV libraries didn't fit my needs, I was writing my own CSV processing - specifically for use in connection with Rails ORMs like Mongoid, MongoMapper or ActiveRecord. In those ORMs you can easily pass a hash with attribute/value pairs to the create() method. The lower-level Mongo driver and Moped also accept larger arrays of such hashes to create a larger amount of records quickly with just one call.
21
23
 
22
24
  ### Examples
23
- #### Example 1: Reading a CSV-File in one Chunk, returning one Array of Hashes:
25
+
26
+ The two main choices you have in terms of how to call `SmarterCSV.process` are:
27
+ * calling `process` with or without a block
28
+ * passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
29
+
30
+ #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
31
+ Please note how each hash contains only the keys for columns with non-null values.
32
+
33
+ $ cat pets.csv
34
+ first name,last name,dogs,cats,birds,fish
35
+ Dan,McAllister,2,,,
36
+ Lucy,Laweless,,5,,
37
+ Miles,O'Brian,,,,21
38
+ Nancy,Homes,2,,1,
39
+ $ irb
40
+ > require 'smarter_csv'
41
+ => true
42
+ > pets_by_owner = SmarterCSV.process('/tmp/pets.csv')
43
+ => [ {:first_name=>"Dan", :last_name=>"McAllister", :dogs=>"2"},
44
+ {:first_name=>"Lucy", :last_name=>"Laweless", :cats=>"5"},
45
+ {:first_name=>"Miles", :last_name=>"O'Brian", :fish=>"21"},
46
+ {:first_name=>"Nancy", :last_name=>"Homes", :dogs=>"2", :birds=>"1"}
47
+ ]
48
+
49
+
50
+ #### Example 1b: How SmarterCSV processes CSV-files as chunks, returning arrays of hashes:
51
+ Please note how the returned array contains two sub-arrays containing the chunks which were read, each chunk containing 2 hashes.
52
+ In case the number of rows is not cleanly divisible by `:chunk_size`, the last chunk contains fewer hashes.
53
+
54
+ > pets_by_owner = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}})
55
+ => [ [ {:first=>"Dan", :last=>"McAllister", :dogs=>"2"}, {:first=>"Lucy", :last=>"Laweless", :cats=>"5"} ],
56
+ [ {:first=>"Miles", :last=>"O'Brian", :fish=>"21"}, {:first=>"Nancy", :last=>"Homes", :dogs=>"2", :birds=>"1"} ]
57
+ ]
58
+
59
+ #### Example 1c: How SmarterCSV processes CSV-files as chunks, and passes arrays of hashes to a given block:
60
+ Please note how the given block is passed the data for each chunk as the parameter (array of hashes),
61
+ and how the `process` method returns the number of chunks when called with a block
62
+
63
+ > total_chunks = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}}) do |chunk|
64
+ chunk.each do |h| # you can post-process the data from each row to your heart's content, and also create virtual attributes:
65
+ h[:full_name] = [h[:first],h[:last]].join(' ') # create a virtual attribute
66
+ h.delete(:first) ; h.delete(:last) # remove two keys
67
+ end
68
+ puts chunk.inspect # we could at this point pass the chunk to a Resque worker..
69
+ end
70
+
71
+ [{:dogs=>"2", :full_name=>"Dan McAllister"}, {:cats=>"5", :full_name=>"Lucy Laweless"}]
72
+ [{:fish=>"21", :full_name=>"Miles O'Brian"}, {:dogs=>"2", :birds=>"1", :full_name=>"Nancy Homes"}]
73
+ => 2
74
+
75
+ #### Example 2: Reading a CSV-File in one Chunk, returning one Array of Hashes:
24
76
 
25
77
  filename = '/tmp/input_file.txt' # TAB delimited file, each row ending with Control-M
26
- recordsA = SmarterCSV.process(filename, {:col_sep => "\t", :row_sep => "\cM"}
78
+ recordsA = SmarterCSV.process(filename, {:col_sep => "\t", :row_sep => "\cM"}) # no block given
27
79
 
28
80
  => returns an array of hashes
29
81
 
30
- #### Example 2: Populate a MySQL or MongoDB Database with SmarterCSV:
82
+ #### Example 3: Populate a MySQL or MongoDB Database with SmarterCSV:
31
83
 
32
84
  # without using chunks:
33
85
  filename = '/tmp/some.csv'
@@ -40,29 +92,71 @@ As the existing CSV libraries didn't fit my needs, I was writing my own CSV proc
40
92
  => returns number of chunks / rows we processed
41
93
 
42
94
 
43
- #### Example 3: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
95
+ #### Example 4: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
44
96
 
45
97
  # using chunks:
46
98
  filename = '/tmp/some.csv'
47
- n = SmarterCSV.process(filename, {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}, :chunk_size => 100}) do |array|
99
+ n = SmarterCSV.process(filename, {:chunk_size => 100, :key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}) do |chunk|
48
100
  # we're passing a block in, to process each resulting hash / row (block takes array of hashes)
49
- # when chunking is enabled, there are up to :chunk_size hashes in each array
50
- MyModel.collection.insert( array ) # insert up to 100 records at a time
101
+ # when chunking is enabled, there are up to :chunk_size hashes in each chunk
102
+ MyModel.collection.insert( chunk ) # insert up to 100 records at a time
51
103
  end
52
104
 
53
105
  => returns number of chunks we processed
54
106
 
55
107
 
56
- #### Example 4: Reading a CSV-like File, and Processing it with Resque:
108
+ #### Example 5: Reading a CSV-like File, and Processing it with Resque:
57
109
 
58
110
  filename = '/tmp/strange_db_dump' # a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes)
59
111
  n = SmarterCSV.process(filename, {:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
60
- :chunk_size => '5' , :key_mapping => {:export_date => nil, :name => :genre}}) do |x|
61
- puts "Resque.enque( ResqueWorkerClass, #{x.size}, #{x.inspect} )" # simulate processing each chunk
112
+ :chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre}}) do |chunk|
113
+ Resque.enque( ResqueWorkerClass, chunk ) # pass chunks of CSV-data to Resque workers for parallel processing
62
114
  end
63
115
  => returns number of chunks
64
116
 
65
117
 
118
+ ## Documentation
119
+
120
+ The `process` method reads and processes a "generalized" CSV file and returns the contents either as an Array of Hashes,
121
+ or an Array of Arrays, which contain Hashes, or processes Chunks of Hashes via a given block.
122
+
123
+ SmarterCSV.process(filename, options={}, &block)
124
+
125
+ The options and the block are optional.
126
+
127
+ `SmarterCSV.process` supports the following options:
128
+ * :col_sep : column separator , which defaults to ','
129
+ * :row_sep : row separator or record separator , defaults to system's $/ , which defaults to "\n"
130
+ * :quote_char : quotation character , defaults to '"'
131
+ * :comment_regexp : regular expression which matches comment lines , defaults to /^#/ (see NOTE about the CSV header)
132
+ * :chunk_size : if set, determines the desired chunk-size (defaults to nil, no chunk processing)
133
+ * :key_mapping : a hash which maps headers from the CSV file to keys in the result hash (default: nil)
134
+ * :downcase_header : downcase all column headers (default: true)
135
+ * :strings_as_keys : use strings instead of symbols as the keys in the result hashes (default: false)
136
+ * :remove_empty_values : remove values which have nil or empty strings as values (default: true)
137
+ * :remove_zero_values : remove values which have a numeric value equal to zero / 0 (default: false)
138
+ * :remove_values_matching : removes key/value pairs if value matches given regular expressions (default: nil) , e.g. /^\$0\.0+$/ to match $0.00 , or /^#VALUE!$/ to match errors in Excel spreadsheets
139
+ * :convert_values_to_numeric : converts strings containing Integers or Floats to the appropriate class (default: true)
140
+ * :remove_empty_hashes : remove / ignore any hashes which don't have any key/value pairs (default: true)
141
+
142
+ #### NOTES about CSV Headers:
143
+ * as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
144
+ * the first line with the CSV header may or may not be commented out according to the :comment_regexp
145
+ * any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
146
+ * any of the keys in the header line will be downcased, spaces replaced by underscore, and converted to Ruby symbols before being used as keys in the returned Hashes
147
+
148
+ #### NOTES on Key Mapping:
149
+ * keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them)
150
+ * if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
151
+
152
+ #### NOTES on the use of Chunking and Blocks:
153
+ * chunking can be VERY USEFUL if used in combination with passing a block to File.read_csv FOR LARGE FILES
154
+ * if you pass a block to File.read_csv, that block will be executed and given an Array of Hashes as the parameter.
155
+ * if the chunk_size is not set, then the array will only contain one Hash.
156
+ * if the chunk_size is > 0 , then the array may contain up to chunk_size Hashes.
157
+ * this can be very useful when passing chunked data to a post-processing step, e.g. through Resque
158
+
159
+
66
160
  ## See also:
67
161
 
68
162
  http://www.unixgods.org/~tilo/Ruby/process_csv_as_hashes.html
@@ -83,9 +177,35 @@ Or install it yourself as:
83
177
 
84
178
  $ gem install smarter_csv
85
179
 
86
- ## Usage
87
180
 
88
- TODO: Write usage instructions here
181
+ ## Changes
182
+
183
+ #### 1.0.1 (2012-07-30)
184
+
185
+ * added the following options:
186
+ * :downcase_header
187
+ * :strings_as_keys
188
+ * :remove_zero_values
189
+ * :remove_values_matching
190
+ * :remove_empty_hashes
191
+ * :convert_values_to_numeric
192
+
193
+ * renamed the following options:
194
+ * :remove_empty_fields => :remove_empty_values
195
+
196
+
197
+ #### 1.0.0 (2012-07-29)
198
+
199
+ * renamed `SmarterCSV.process_csv` to `SmarterCSV.process`.
200
+
201
+ #### 1.0.0.pre1 (2012-07-29)
202
+
203
+
204
+ ## Reporting Bugs / Feature Requests
205
+
206
+ Please [open an Issue on GitHub](https://github.com/tilo/smarter_csv/issues) if you have feedback, new feature requests, or want to report a bug. Thank you!
207
+
208
+
89
209
 
90
210
  ## Contributing
91
211
 
@@ -1,37 +1,9 @@
1
1
  module SmarterCSV
2
- # this reads and processes a "generalized" CSV file and returns the contents either as an Array of Hashes,
3
- # or an Array of Arrays, which contain Hashes, or processes Chunks of Hashes via a given block
4
- #
5
- # File.read_csv supports the following options:
6
- # * :col_sep : column separator , which defaults to ','
7
- # * :row_sep : row separator or record separator , defaults to system's $/ , which defaults to "\n"
8
- # * :quote_char : quotation character , defaults to '"' (currently not used)
9
- # * :comment_regexp : regular expression which matches comment lines , defaults to /^#/ (see NOTE about the CSV header)
10
- # * :chunk_size : if set, determines the desired chunk-size (defaults to nil, no chunk processing)
11
- # * :remove_empty_fields : remove fields which have nil or empty strings as values (default: true)
12
- #
13
- # NOTES about CSV Headers:
14
- # - as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
15
- # - the first line with the CSV header may or may not be commented out according to the :comment_regexp
16
- # - any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
17
- # - any of the keys in the header line will be converted to Ruby symbols before being used in the returned Hashes
18
- #
19
- # NOTES on Key Mapping:
20
- # - keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes
21
- # can be better used internally in our application (e.g. when directly creating MongoDB entries with them)
22
- # - if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
23
- #
24
- # NOTES on the use of Chunking and Blocks:
25
- # - chunking can be VERY USEFUL if used in combination with passing a block to File.read_csv FOR LARGE FILES
26
- # - if you pass a block to File.read_csv, that block will be executed and given an Array of Hashes as the parameter.
27
- # If the chunk_size is not set, then the array will only contain one Hash.
28
- # If the chunk_size is > 0 , then the array may contain up to chunk_size Hashes.
29
- # This can be very useful when passing chunked data to a post-processing step, e.g. through Resque
30
- #
31
-
32
2
  def SmarterCSV.process(filename, options={}, &block)
33
- default_options = {:col_sep => ',' , :row_sep => $/ , :quote_char => '"', :remove_empty_fields => true,
34
- :comment_regexp => /^#/, :chunk_size => nil , :key_mapping_hash => nil
3
+ default_options = {:col_sep => ',' , :row_sep => $/ , :quote_char => '"',
4
+ :remove_empty_values => true, :remove_zero_values => false , :remove_values_matching => nil , :remove_empty_hashes => true ,
5
+ :convert_values_to_numeric => true, :strip_chars_from_headers => nil ,
6
+ :comment_regexp => /^#/, :chunk_size => nil , :key_mapping_hash => nil , :downcase_header => true, :strings_as_keys => false
35
7
  }
36
8
  options = default_options.merge(options)
37
9
  headerA = []
@@ -43,7 +15,11 @@ module SmarterCSV
43
15
 
44
16
  # process the header line in the CSV file..
45
17
  # the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
46
- headerA = f.readline.sub(options[:comment_regexp],'').chomp(options[:row_sep]).split(options[:col_sep]).map{|x| x.gsub(%r/options[:quote_char]/,'').gsub(/\s+/,'_').to_sym}
18
+ headerA = f.readline.sub(options[:comment_regexp],'').chomp(options[:row_sep])
19
+ headerA = headerA.gsub(options[:strip_chars_from_headers], '') if options[:strip_chars_from_headers]
20
+ headerA = headerA.split(options[:col_sep]).map{|x| x.gsub(%r/options[:quote_char]/,'').gsub(/\s+/,'_')}
21
+ headerA.map!{|x| x.downcase } if options[:downcase_header]
22
+ headerA.map!{|x| x.to_sym } unless options[:strings_as_keys]
47
23
  key_mappingH = options[:key_mapping]
48
24
 
49
25
  # do some key mapping on the keys in the file header
@@ -72,8 +48,21 @@ module SmarterCSV
72
48
  hash = Hash.zip(headerA,dataA) # from Facets of Ruby library
73
49
  # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
74
50
  hash.delete(nil); hash.delete(''); hash.delete(:"") # delete any hash keys which were mapped to be deleted
75
- hash.delete_if{|k,v| v.nil? || v =~ /^\s*$/} if options[:remove_empty_fields]
76
-
51
+ hash.delete_if{|k,v| v.nil? || v =~ /^\s*$/} if options[:remove_empty_values]
52
+ hash.delete_if{|k,v| ! v.nil? && v =~ /^(\d+|\d+\.\d+)$/ && v.to_f == 0} if options[:remove_zero_values] # values are typically Strings!
53
+ hash.delete_if{|k,v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
54
+ if options[:convert_values_to_numeric]
55
+ hash.each do |k,v|
56
+ case v
57
+ when /^\d+$/
58
+ hash[k] = v.to_i
59
+ when /^\d+\.\d+$/
60
+ hash[k] = v.to_f
61
+ end
62
+ end
63
+ end
64
+ next if hash.empty? if options[:remove_empty_hashes]
65
+
77
66
  if use_chunks
78
67
  chunk << hash # append temp result to chunk
79
68
 
@@ -1,3 +1,3 @@
1
1
  module SmarterCSV
2
- VERSION = "1.0.0"
2
+ VERSION = "1.0.1"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.0.1
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -11,7 +11,7 @@ authors:
11
11
  autorequire:
12
12
  bindir: bin
13
13
  cert_chain: []
14
- date: 2012-07-29 00:00:00.000000000 Z
14
+ date: 2012-07-30 00:00:00.000000000 Z
15
15
  dependencies: []
16
16
  description: Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, with
17
17
  optional features for processing large files in parallel, embedded comments, unusual