smarter_csv 1.0.0 → 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -3,15 +3,17 @@
3
3
  `smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with Mongoid or ActiveRecord,
4
4
  and parallel processing with Resque or Sidekiq.
5
5
 
6
- `smarter_csv` has lots of optional features:
6
+ `smarter_csv` has lots of features:
7
7
  * able to process large CSV-files
8
8
  * able to chunk the input from the CSV file to avoid loading the whole CSV file into memory
9
9
  * return a Hash for each line of the CSV file, so we can quickly use the results for either creating MongoDB or ActiveRecord entries, or further processing with Resque
10
- * able to pass a block to the method, so data from the CSV file can be directly processed (e.g. Resque.enqueue )
11
- * have a bit more flexible input format, where comments are possible, and col_sep,row_sep can be set to any character sequence, including control characters.
10
+ * able to pass a block to the `process` method, so data from the CSV file can be directly processed (e.g. Resque.enqueue )
11
+ * allows to have a bit more flexible input format, where comments are possible, and col_sep,row_sep can be set to any character sequence, including control characters.
12
12
  * able to re-map CSV "column names" to Hash-keys of your choice (normalization)
13
13
  * able to ignore "columns" in the input (delete columns)
14
- * able to eliminate nil or empty fields from the result hashes
14
+ * able to eliminate nil or empty fields from the result hashes (default)
15
+
16
+ NOTE; This Gem is only for importing CSV files - writing of CSV files is not supported.
15
17
 
16
18
  ### Why?
17
19
 
@@ -20,14 +22,64 @@ Ruby's CSV library's API is pretty old, and it's processing of CSV-files returni
20
22
  As the existing CSV libraries didn't fit my needs, I was writing my own CSV processing - specifically for use in connection with Rails ORMs like Mongoid, MongoMapper or ActiveRecord. In those ORMs you can easily pass a hash with attribute/value pairs to the create() method. The lower-level Mongo driver and Moped also accept larger arrays of such hashes to create a larger amount of records quickly with just one call.
21
23
 
22
24
  ### Examples
23
- #### Example 1: Reading a CSV-File in one Chunk, returning one Array of Hashes:
25
+
26
+ The two main choices you have in terms of how to call `SmarterCSV.process` are:
27
+ * calling `process` with or without a block
28
+ * passing a `:chunk_size` to the `process` method, and processing the CSV-file in chunks, rather than in one piece.
29
+
30
+ #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
31
+ Please note how each hash contains only the keys for columns with non-null values.
32
+
33
+ $ cat pets.csv
34
+ first name,last name,dogs,cats,birds,fish
35
+ Dan,McAllister,2,,,
36
+ Lucy,Laweless,,5,,
37
+ Miles,O'Brian,,,,21
38
+ Nancy,Homes,2,,1,
39
+ $ irb
40
+ > require 'smarter_csv'
41
+ => true
42
+ > pets_by_owner = SmarterCSV.process('/tmp/pets.csv')
43
+ => [ {:first_name=>"Dan", :last_name=>"McAllister", :dogs=>"2"},
44
+ {:first_name=>"Lucy", :last_name=>"Laweless", :cats=>"5"},
45
+ {:first_name=>"Miles", :last_name=>"O'Brian", :fish=>"21"},
46
+ {:first_name=>"Nancy", :last_name=>"Homes", :dogs=>"2", :birds=>"1"}
47
+ ]
48
+
49
+
50
+ #### Example 1b: How SmarterCSV processes CSV-files as chunks, returning arrays of hashes:
51
+ Please note how the returned array contains two sub-arrays containing the chunks which were read, each chunk containing 2 hashes.
52
+ In case the number of rows is not cleanly divisible by `:chunk_size`, the last chunk contains fewer hashes.
53
+
54
+ > pets_by_owner = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}})
55
+ => [ [ {:first=>"Dan", :last=>"McAllister", :dogs=>"2"}, {:first=>"Lucy", :last=>"Laweless", :cats=>"5"} ],
56
+ [ {:first=>"Miles", :last=>"O'Brian", :fish=>"21"}, {:first=>"Nancy", :last=>"Homes", :dogs=>"2", :birds=>"1"} ]
57
+ ]
58
+
59
+ #### Example 1c: How SmarterCSV processes CSV-files as chunks, and passes arrays of hashes to a given block:
60
+ Please note how the given block is passed the data for each chunk as the parameter (array of hashes),
61
+ and how the `process` method returns the number of chunks when called with a block
62
+
63
+ > total_chunks = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}}) do |chunk|
64
+ chunk.each do |h| # you can post-process the data from each row to your heart's content, and also create virtual attributes:
65
+ h[:full_name] = [h[:first],h[:last]].join(' ') # create a virtual attribute
66
+ h.delete(:first) ; h.delete(:last) # remove two keys
67
+ end
68
+ puts chunk.inspect # we could at this point pass the chunk to a Resque worker..
69
+ end
70
+
71
+ [{:dogs=>"2", :full_name=>"Dan McAllister"}, {:cats=>"5", :full_name=>"Lucy Laweless"}]
72
+ [{:fish=>"21", :full_name=>"Miles O'Brian"}, {:dogs=>"2", :birds=>"1", :full_name=>"Nancy Homes"}]
73
+ => 2
74
+
75
+ #### Example 2: Reading a CSV-File in one Chunk, returning one Array of Hashes:
24
76
 
25
77
  filename = '/tmp/input_file.txt' # TAB delimited file, each row ending with Control-M
26
- recordsA = SmarterCSV.process(filename, {:col_sep => "\t", :row_sep => "\cM"}
78
+ recordsA = SmarterCSV.process(filename, {:col_sep => "\t", :row_sep => "\cM"}) # no block given
27
79
 
28
80
  => returns an array of hashes
29
81
 
30
- #### Example 2: Populate a MySQL or MongoDB Database with SmarterCSV:
82
+ #### Example 3: Populate a MySQL or MongoDB Database with SmarterCSV:
31
83
 
32
84
  # without using chunks:
33
85
  filename = '/tmp/some.csv'
@@ -40,29 +92,71 @@ As the existing CSV libraries didn't fit my needs, I was writing my own CSV proc
40
92
  => returns number of chunks / rows we processed
41
93
 
42
94
 
43
- #### Example 3: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
95
+ #### Example 4: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
44
96
 
45
97
  # using chunks:
46
98
  filename = '/tmp/some.csv'
47
- n = SmarterCSV.process(filename, {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}, :chunk_size => 100}) do |array|
99
+ n = SmarterCSV.process(filename, {:chunk_size => 100, :key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}) do |chunk|
48
100
  # we're passing a block in, to process each resulting hash / row (block takes array of hashes)
49
- # when chunking is enabled, there are up to :chunk_size hashes in each array
50
- MyModel.collection.insert( array ) # insert up to 100 records at a time
101
+ # when chunking is enabled, there are up to :chunk_size hashes in each chunk
102
+ MyModel.collection.insert( chunk ) # insert up to 100 records at a time
51
103
  end
52
104
 
53
105
  => returns number of chunks we processed
54
106
 
55
107
 
56
- #### Example 4: Reading a CSV-like File, and Processing it with Resque:
108
+ #### Example 5: Reading a CSV-like File, and Processing it with Resque:
57
109
 
58
110
  filename = '/tmp/strange_db_dump' # a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes)
59
111
  n = SmarterCSV.process(filename, {:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
60
- :chunk_size => '5' , :key_mapping => {:export_date => nil, :name => :genre}}) do |x|
61
- puts "Resque.enque( ResqueWorkerClass, #{x.size}, #{x.inspect} )" # simulate processing each chunk
112
+ :chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre}}) do |chunk|
113
+ Resque.enque( ResqueWorkerClass, chunk ) # pass chunks of CSV-data to Resque workers for parallel processing
62
114
  end
63
115
  => returns number of chunks
64
116
 
65
117
 
118
+ ## Documentation
119
+
120
+ The `process` method reads and processes a "generalized" CSV file and returns the contents either as an Array of Hashes,
121
+ or an Array of Arrays, which contain Hashes, or processes Chunks of Hashes via a given block.
122
+
123
+ SmarterCSV.process(filename, options={}, &block)
124
+
125
+ The options and the block are optional.
126
+
127
+ `SmarterCSV.process` supports the following options:
128
+ * :col_sep : column separator , which defaults to ','
129
+ * :row_sep : row separator or record separator , defaults to system's $/ , which defaults to "\n"
130
+ * :quote_char : quotation character , defaults to '"'
131
+ * :comment_regexp : regular expression which matches comment lines , defaults to /^#/ (see NOTE about the CSV header)
132
+ * :chunk_size : if set, determines the desired chunk-size (defaults to nil, no chunk processing)
133
+ * :key_mapping : a hash which maps headers from the CSV file to keys in the result hash (default: nil)
134
+ * :downcase_header : downcase all column headers (default: true)
135
+ * :strings_as_keys : use strings instead of symbols as the keys in the result hashes (default: false)
136
+ * :remove_empty_values : remove values which have nil or empty strings as values (default: true)
137
+ * :remove_zero_values : remove values which have a numeric value equal to zero / 0 (default: false)
138
+ * :remove_values_matching : removes key/value pairs if value matches given regular expressions (default: nil) , e.g. /^\$0\.0+$/ to match $0.00 , or /^#VALUE!$/ to match errors in Excel spreadsheets
139
+ * :convert_values_to_numeric : converts strings containing Integers or Floats to the appropriate class (default: true)
140
+ * :remove_empty_hashes : remove / ignore any hashes which don't have any key/value pairs (default: true)
141
+
142
+ #### NOTES about CSV Headers:
143
+ * as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
144
+ * the first line with the CSV header may or may not be commented out according to the :comment_regexp
145
+ * any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
146
+ * any of the keys in the header line will be downcased, spaces replaced by underscore, and converted to Ruby symbols before being used as keys in the returned Hashes
147
+
148
+ #### NOTES on Key Mapping:
149
+ * keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them)
150
+ * if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
151
+
152
+ #### NOTES on the use of Chunking and Blocks:
153
+ * chunking can be VERY USEFUL if used in combination with passing a block to File.read_csv FOR LARGE FILES
154
+ * if you pass a block to File.read_csv, that block will be executed and given an Array of Hashes as the parameter.
155
+ * if the chunk_size is not set, then the array will only contain one Hash.
156
+ * if the chunk_size is > 0 , then the array may contain up to chunk_size Hashes.
157
+ * this can be very useful when passing chunked data to a post-processing step, e.g. through Resque
158
+
159
+
66
160
  ## See also:
67
161
 
68
162
  http://www.unixgods.org/~tilo/Ruby/process_csv_as_hashes.html
@@ -83,9 +177,35 @@ Or install it yourself as:
83
177
 
84
178
  $ gem install smarter_csv
85
179
 
86
- ## Usage
87
180
 
88
- TODO: Write usage instructions here
181
+ ## Changes
182
+
183
+ #### 1.0.1 (2012-07-30)
184
+
185
+ * added the following options:
186
+ * :downcase_header
187
+ * :strings_as_keys
188
+ * :remove_zero_values
189
+ * :remove_values_matching
190
+ * :remove_empty_hashes
191
+ * :convert_values_to_numeric
192
+
193
+ * renamed the following options:
194
+ * :remove_empty_fields => :remove_empty_values
195
+
196
+
197
+ #### 1.0.0 (2012-07-29)
198
+
199
+ * renamed `SmarterCSV.process_csv` to `SmarterCSV.process`.
200
+
201
+ #### 1.0.0.pre1 (2012-07-29)
202
+
203
+
204
+ ## Reporting Bugs / Feature Requests
205
+
206
+ Please [open an Issue on GitHub](https://github.com/tilo/smarter_csv/issues) if you have feedback, new feature requests, or want to report a bug. Thank you!
207
+
208
+
89
209
 
90
210
  ## Contributing
91
211
 
@@ -1,37 +1,9 @@
1
1
  module SmarterCSV
2
- # this reads and processes a "generalized" CSV file and returns the contents either as an Array of Hashes,
3
- # or an Array of Arrays, which contain Hashes, or processes Chunks of Hashes via a given block
4
- #
5
- # File.read_csv supports the following options:
6
- # * :col_sep : column separator , which defaults to ','
7
- # * :row_sep : row separator or record separator , defaults to system's $/ , which defaults to "\n"
8
- # * :quote_char : quotation character , defaults to '"' (currently not used)
9
- # * :comment_regexp : regular expression which matches comment lines , defaults to /^#/ (see NOTE about the CSV header)
10
- # * :chunk_size : if set, determines the desired chunk-size (defaults to nil, no chunk processing)
11
- # * :remove_empty_fields : remove fields which have nil or empty strings as values (default: true)
12
- #
13
- # NOTES about CSV Headers:
14
- # - as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
15
- # - the first line with the CSV header may or may not be commented out according to the :comment_regexp
16
- # - any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
17
- # - any of the keys in the header line will be converted to Ruby symbols before being used in the returned Hashes
18
- #
19
- # NOTES on Key Mapping:
20
- # - keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes
21
- # can be better used internally in our application (e.g. when directly creating MongoDB entries with them)
22
- # - if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
23
- #
24
- # NOTES on the use of Chunking and Blocks:
25
- # - chunking can be VERY USEFUL if used in combination with passing a block to File.read_csv FOR LARGE FILES
26
- # - if you pass a block to File.read_csv, that block will be executed and given an Array of Hashes as the parameter.
27
- # If the chunk_size is not set, then the array will only contain one Hash.
28
- # If the chunk_size is > 0 , then the array may contain up to chunk_size Hashes.
29
- # This can be very useful when passing chunked data to a post-processing step, e.g. through Resque
30
- #
31
-
32
2
  def SmarterCSV.process(filename, options={}, &block)
33
- default_options = {:col_sep => ',' , :row_sep => $/ , :quote_char => '"', :remove_empty_fields => true,
34
- :comment_regexp => /^#/, :chunk_size => nil , :key_mapping_hash => nil
3
+ default_options = {:col_sep => ',' , :row_sep => $/ , :quote_char => '"',
4
+ :remove_empty_values => true, :remove_zero_values => false , :remove_values_matching => nil , :remove_empty_hashes => true ,
5
+ :convert_values_to_numeric => true, :strip_chars_from_headers => nil ,
6
+ :comment_regexp => /^#/, :chunk_size => nil , :key_mapping_hash => nil , :downcase_header => true, :strings_as_keys => false
35
7
  }
36
8
  options = default_options.merge(options)
37
9
  headerA = []
@@ -43,7 +15,11 @@ module SmarterCSV
43
15
 
44
16
  # process the header line in the CSV file..
45
17
  # the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
46
- headerA = f.readline.sub(options[:comment_regexp],'').chomp(options[:row_sep]).split(options[:col_sep]).map{|x| x.gsub(%r/options[:quote_char]/,'').gsub(/\s+/,'_').to_sym}
18
+ headerA = f.readline.sub(options[:comment_regexp],'').chomp(options[:row_sep])
19
+ headerA = headerA.gsub(options[:strip_chars_from_headers], '') if options[:strip_chars_from_headers]
20
+ headerA = headerA.split(options[:col_sep]).map{|x| x.gsub(%r/options[:quote_char]/,'').gsub(/\s+/,'_')}
21
+ headerA.map!{|x| x.downcase } if options[:downcase_header]
22
+ headerA.map!{|x| x.to_sym } unless options[:strings_as_keys]
47
23
  key_mappingH = options[:key_mapping]
48
24
 
49
25
  # do some key mapping on the keys in the file header
@@ -72,8 +48,21 @@ module SmarterCSV
72
48
  hash = Hash.zip(headerA,dataA) # from Facets of Ruby library
73
49
  # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
74
50
  hash.delete(nil); hash.delete(''); hash.delete(:"") # delete any hash keys which were mapped to be deleted
75
- hash.delete_if{|k,v| v.nil? || v =~ /^\s*$/} if options[:remove_empty_fields]
76
-
51
+ hash.delete_if{|k,v| v.nil? || v =~ /^\s*$/} if options[:remove_empty_values]
52
+ hash.delete_if{|k,v| ! v.nil? && v =~ /^(\d+|\d+\.\d+)$/ && v.to_f == 0} if options[:remove_zero_values] # values are typically Strings!
53
+ hash.delete_if{|k,v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
54
+ if options[:convert_values_to_numeric]
55
+ hash.each do |k,v|
56
+ case v
57
+ when /^\d+$/
58
+ hash[k] = v.to_i
59
+ when /^\d+\.\d+$/
60
+ hash[k] = v.to_f
61
+ end
62
+ end
63
+ end
64
+ next if hash.empty? if options[:remove_empty_hashes]
65
+
77
66
  if use_chunks
78
67
  chunk << hash # append temp result to chunk
79
68
 
@@ -1,3 +1,3 @@
1
1
  module SmarterCSV
2
- VERSION = "1.0.0"
2
+ VERSION = "1.0.1"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.0.1
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -11,7 +11,7 @@ authors:
11
11
  autorequire:
12
12
  bindir: bin
13
13
  cert_chain: []
14
- date: 2012-07-29 00:00:00.000000000 Z
14
+ date: 2012-07-30 00:00:00.000000000 Z
15
15
  dependencies: []
16
16
  description: Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, with
17
17
  optional features for processing large files in parallel, embedded comments, unusual