RubyGems - smarter_csv - Versions diffs - 1.0.0.pre1 - Mend

smarter_csv 1.0.0.pre1

Files changed (12) hide show

data/.gitignore +8 -0
data/.rvmrc +1 -0
data/Gemfile +4 -0
data/LICENSE +23 -0
data/README.md +98 -0
data/Rakefile +2 -0
data/lib/extensions/hash.rb +9 -0
data/lib/smarter_csv.rb +4 -0
data/lib/smarter_csv/smarter_csv.rb +111 -0
data/lib/smarter_csv/version.rb +3 -0
data/smarter_csv.gemspec +17 -0
metadata +63 -0

data/.gitignore ADDED

@@ -0,0 +1,8 @@
+*~
+#*#
+*old
+*.bak
+*.gem
+.bundle
+Gemfile.lock
+pkg/*

data/.rvmrc ADDED

	@@ -0,0 +1 @@
1	+ rvm gemset use smarter_csv

data/Gemfile ADDED

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in smarter_csv.gemspec
+gemspec

data/LICENSE ADDED

@@ -0,0 +1,23 @@
+Copyright (c) 2012 Tilo Sloboda
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,98 @@
+# SmarterCSV
+`smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with Mongoid or ActiveRecord,
+and parallel processing with Resque or Sidekiq.
+`smarter_csv` has lots of optional features:
+ * able to process large CSV-files
+ * able to chunk the input from the CSV file to avoid loading the whole CSV file into memory
+ * return a Hash for each line of the CSV file, so we can quickly use the results for either creating MongoDB or ActiveRecord entries, or further processing with Resque
+ * able to pass a block to the method, so data from the CSV file can be directly processed (e.g. Resque.enqueue )
+ * have a bit more flexible input format, where comments are possible, and col_sep,row_sep can be set to any character sequence, including control characters.
+ * able to re-map CSV "column names" to Hash-keys of your choice (normalization)
+ * able to ignore "columns" in the input (delete columns)
+ * able to eliminate nil or empty fields from the result hashes
+### Why?
+Ruby's CSV library's API is pretty old, and it's processing of CSV-files returning Arrays of Arrays feels 'very close to the metal'. The output is not easy to use - especially not if you want to create database records from it. Another shortcoming is that Ruby's CSV library does not have good support for huge CSV-files, e.g. there is no support for 'chunking' and/or parallel processing of the CSV-content (e.g. with Resque or Sidekiq),
+As the existing CSV libraries didn't fit my needs, I was writing my own CSV processing - specifically for use in connection with Rails ORMs like Mongoid, MongoMapper or ActiveRecord. In those ORMs you can easily pass a hash with attribute/value pairs to the create() method. The lower-level Mongo driver and Moped also accept larger arrays of such hashes to create a larger amount of records quickly with just one call.
+### Examples
+#### Example 1: Reading a CSV-File in one Chunk, returning one Array of Hashes:
+    filename = '/tmp/input_file.txt' # TAB delimited file, each row ending with Control-M
+    recordsA = SmarterCSV.process_csv(filename, {:col_sep => "\t", :row_sep => "\cM"}
+    => returns an array of hashes
+#### Example 2: Populate a MySQL or MongoDB Database with SmarterCSV:
+    # without using chunks:
+    filename = '/tmp/some.csv'
+    n = SmarterCSV.process_csv(filename, {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}) do |array|
+          # we're passing a block in, to process each resulting hash / =row (the block takes array of hashes)
+          # when chunking is not enabled, there is only one hash in each array
+          MyModel.create( array.first )
+    end
+     => returns number of chunks / rows we processed
+#### Example 3: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
+    # using chunks:
+    filename = '/tmp/some.csv'
+    n = SmarterCSV.process_csv(filename, {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}, :chunk_size => 100}) do |array|
+          # we're passing a block in, to process each resulting hash / row (block takes array of hashes)
+          # when chunking is enabled, there are up to :chunk_size hashes in each array
+          MyModel.collection.insert( array )   # insert up to 100 records at a time
+    end
+     => returns number of chunks we processed
+#### Example 4: Reading a CSV-like File, and Processing it with Resque:
+    filename = '/tmp/strange_db_dump'   # a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes)
+    n = SmarterCSV.process_csv(filename, {:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
+            :chunk_size => '5' , :key_mapping => {:export_date => nil, :name => :genre}}) do |x|
+        puts   "Resque.enque( ResqueWorkerClass, #{x.size}, #{x.inspect} )"   # simulate processing each chunk
+    end
+    => returns number of chunks
+## Installation
+Add this line to your application's Gemfile:
+    gem 'smarter_csv'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install smarter_csv
+## Usage
+TODO: Write usage instructions here
+## Contributing
+1. Fork it
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Added some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create new Pull Request
+## See also:
+  http://www.unixgods.org/~tilo/Ruby/process_csv_as_hashes.html
+  https://gist.github.com/3101950

data/Rakefile ADDED

	@@ -0,0 +1,2 @@
1	+ #!/usr/bin/env rake
2	+ require "bundler/gem_tasks"

data/lib/extensions/hash.rb ADDED

@@ -0,0 +1,9 @@
+# the following extension for class Hash is needed (from Facets of Ruby library):
+class Hash
+  def self.zip(keys,values) # from Facets of Ruby library
+    h = {}
+    keys.size.times{ |i| h[ keys[i] ] = values[i] }
+    h
+  end
+end

data/lib/smarter_csv.rb ADDED

@@ -0,0 +1,4 @@
+require "smarter_csv/version"
+require "extensions/hash.rb"
+require "smarter_csv/smarter_csv.rb"

data/lib/smarter_csv/smarter_csv.rb ADDED

@@ -0,0 +1,111 @@
+module SmarterCSV
+  # this reads and processes a "generalized" CSV file and returns the contents either as an Array of Hashes,
+  # or an Array of Arrays, which contain Hashes, or processes Chunks of Hashes via a given block
+  #
+  # File.read_csv supports the following options:
+  #  * :col_sep : column separator , which defaults to ','
+  #  * :row_sep : row separator or record separator , defaults to system's $/ , which defaults to "\n"
+  #  * :quote_char : quotation character , defaults to '"' (currently not used)
+  #  * :comment_regexp : regular expression which matches comment lines , defaults to /^#/ (see NOTE about the CSV header)
+  #  * :chunk_size : if set, determines the desired chunk-size (defaults to nil, no chunk processing)
+  #  * :remove_empty_fields : remove fields which have nil or empty strings as values (default: true)
+  #
+  # NOTES about CSV Headers:
+  #  - as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
+  #  - the first line with the CSV header may or may not be commented out according to the :comment_regexp
+  #  - any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
+  #  - any of the keys in the header line will be converted to Ruby symbols before being used in the returned Hashes
+  #
+  # NOTES on Key Mapping:
+  #  - keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes
+  #    can be better used internally in our application (e.g. when directly creating MongoDB entries with them)
+  #  - if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
+  #
+  # NOTES on the use of Chunking and Blocks:
+  #  - chunking can be VERY USEFUL if used in combination with passing a block to File.read_csv FOR LARGE FILES
+  #  - if you pass a block to File.read_csv, that block will be executed and given an Array of Hashes as the parameter.
+  #    If the chunk_size is not set, then the array will only contain one Hash.
+  #    If the chunk_size is > 0 , then the array may contain up to chunk_size Hashes.
+  #    This can be very useful when passing chunked data to a post-processing step, e.g. through Resque
+  #
+  def SmarterCSV.process_csv(filename, options={}, &block)
+    default_options = {:col_sep => ',' , :row_sep => $/ , :quote_char => '"', :remove_empty_fields => true,
+      :comment_regexp => /^#/, :chunk_size => nil , :key_mapping_hash => nil
+    }
+    options = default_options.merge(options)
+    headerA = []
+    result = []
+    old_row_sep = $/
+    begin
+      $/ = options[:row_sep]
+      f = File.open(filename, "r")
+      # process the header line in the CSV file..
+      # the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
+      headerA = f.readline.sub(options[:comment_regexp],'').chomp(options[:row_sep]).split(options[:col_sep]).map{|x| x.gsub(%r/options[:quote_char]/,'').gsub(/\s+/,'_').to_sym}
+      key_mappingH = options[:key_mapping]
+      # do some key mapping on the keys in the file header
+      #   if you want to completely delete a key, then map it to nil or to ''
+      if ! key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
+        headerA.map!{|x| key_mappingH.has_key?(x) ? (key_mappingH[x].nil? ? nil : key_mappingH[x].to_sym) : x}
+      end
+      # in case we use chunking.. we'll need to set it up..
+      if ! options[:chunk_size].nil? && options[:chunk_size].to_i > 0
+        use_chunks = true
+        chunk_size = options[:chunk_size].to_i
+        chunk_count = 0
+        chunk = []
+      else
+        use_chunks = false
+      end
+      # now on to processing all the rest of the lines in the CSV file:
+      while ! f.eof?    # we can't use f.readlines() here, because this would read the whole file into memory at once, and eof => true
+        line = f.readline  # read one line.. this uses the input_record_separator $/ which we set previously!
+        next  if  line =~ options[:comment_regexp]  # ignore all comment lines if there are any
+        line.chomp!    # will use $/ which is set to options[:col_sep]
+        dataA = line.split(options[:col_sep])
+        hash = Hash.zip(headerA,dataA)  # from Facets of Ruby library
+        # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
+        hash.delete(nil); hash.delete(''); hash.delete(:"") # delete any hash keys which were mapped to be deleted
+        hash.delete_if{|k,v| v.nil? || v =~ /^\s*$/}  if options[:remove_empty_fields]
+        if use_chunks
+          chunk << hash  # append temp result to chunk
+          if chunk.size >= chunk_size || f.eof?   # if chunk if full, or EOF reached
+            # do something with the chunk
+            if block_given?
+              yield chunk  # do something with the hashes in the chunk in the block
+            else
+              result << chunk  # not sure yet, why anybody would want to do this without a block
+            end
+            chunk_count += 1
+            chunk = []  # initialize for next chunk of data
+          end
+          # while a chunk is being filled up we don't need to do anything else here
+        else # no chunk handling
+          if block_given?
+            yield [hash]  # do something with the hash in the block (better to use chunking here)
+          else
+            result << hash
+          end
+        end
+      end
+    ensure
+      $/ = old_row_sep   # make sure this stupid global variable is always reset to it's previous value after we're done!
+    end
+    if block_given?
+      return chunk_count  # when we do processing through a block we only care how many chunks we processed
+    else
+      return result # returns either an Array of Hashes, or an Array of Arrays of Hashes (if in chunked mode)
+    end
+  end
+end

data/lib/smarter_csv/version.rb ADDED

@@ -0,0 +1,3 @@
+module SmarterCSV
+  VERSION = "1.0.0.pre1"
+end

data/smarter_csv.gemspec ADDED

@@ -0,0 +1,17 @@
+# -*- encoding: utf-8 -*-
+require File.expand_path('../lib/smarter_csv/version', __FILE__)
+Gem::Specification.new do |gem|
+  gem.authors       = ["Tilo Sloboda\n"]
+  gem.email         = ["tilo.sloboda@gmail.com\n"]
+  gem.description   = %q{Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, with optional features for processing large files in parallel, embedded comments, unusual field- and record-separators, flexible mapping of CSV-headers to Hash-keys}
+  gem.summary       = %q{Ruby Gem for smarter importing of CSV Files (and CSV-like files), with lots of optional features, e.g. chunked processing for huge CSV files}
+  gem.homepage      = ""
+  gem.files         = `git ls-files`.split($\)
+  gem.executables   = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
+  gem.test_files    = gem.files.grep(%r{^(test|spec|features)/})
+  gem.name          = "smarter_csv"
+  gem.require_paths = ["lib"]
+  gem.version       = SmarterCSV::VERSION
+end

metadata ADDED

@@ -0,0 +1,63 @@
+--- !ruby/object:Gem::Specification
+name: smarter_csv
+version: !ruby/object:Gem::Version
+  version: 1.0.0.pre1
+  prerelease: 6
+platform: ruby
+authors:
+- ! 'Tilo Sloboda
+'
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2012-07-29 00:00:00.000000000 Z
+dependencies: []
+description: Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, with
+  optional features for processing large files in parallel, embedded comments, unusual
+  field- and record-separators, flexible mapping of CSV-headers to Hash-keys
+email:
+- ! 'tilo.sloboda@gmail.com
+'
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- .rvmrc
+- Gemfile
+- LICENSE
+- README.md
+- Rakefile
+- lib/extensions/hash.rb
+- lib/smarter_csv.rb
+- lib/smarter_csv/smarter_csv.rb
+- lib/smarter_csv/version.rb
+- smarter_csv.gemspec
+homepage: ''
+licenses: []
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>'
+    - !ruby/object:Gem::Version
+      version: 1.3.1
+requirements: []
+rubyforge_project:
+rubygems_version: 1.8.15
+signing_key:
+specification_version: 3
+summary: Ruby Gem for smarter importing of CSV Files (and CSV-like files), with lots
+  of optional features, e.g. chunked processing for huge CSV files
+test_files: []