RubyGems - csv_fast_importer - Versions diffs - 1.0.0 - Mend

csv_fast_importer 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (91) hide show

checksums.yaml +7 -0
data/.gitignore +36 -0
data/.ruby-version +1 -0
data/.travis.yml +15 -0
data/CONTRIBUTING.md +24 -0
data/Gemfile +3 -0
data/Gemfile.lock +128 -0
data/LICENSE +21 -0
data/README.md +186 -0
data/Rakefile +44 -0
data/benchmark/NPRI-SubsDisp-Normalized-Since1993.csv +10000 -0
data/benchmark/README.md +140 -0
data/benchmark/benchmark.rb +26 -0
data/benchmark/results.png +0 -0
data/benchmark/results.xlsx +0 -0
data/benchmark/strategies.rb +115 -0
data/benchmark/tools.rb +61 -0
data/csv_fast_importer.gemspec +42 -0
data/lib/csv_fast_importer.rb +12 -0
data/lib/csv_fast_importer/configuration.rb +57 -0
data/lib/csv_fast_importer/database/mysql.rb +28 -0
data/lib/csv_fast_importer/database/postgres.rb +36 -0
data/lib/csv_fast_importer/database/queryable.rb +51 -0
data/lib/csv_fast_importer/database_connection.rb +19 -0
data/lib/csv_fast_importer/database_factory.rb +19 -0
data/lib/csv_fast_importer/import.rb +58 -0
data/lib/csv_fast_importer/version.rb +3 -0
data/sample-app/.gitignore +10 -0
data/sample-app/Gemfile +50 -0
data/sample-app/Gemfile.lock +172 -0
data/sample-app/README.md +23 -0
data/sample-app/Rakefile +6 -0
data/sample-app/app/assets/images/.keep +0 -0
data/sample-app/app/assets/javascripts/application.js +16 -0
data/sample-app/app/assets/stylesheets/application.css +15 -0
data/sample-app/app/controllers/application_controller.rb +5 -0
data/sample-app/app/controllers/concerns/.keep +0 -0
data/sample-app/app/helpers/application_helper.rb +2 -0
data/sample-app/app/mailers/.keep +0 -0
data/sample-app/app/models/.keep +0 -0
data/sample-app/app/models/concerns/.keep +0 -0
data/sample-app/app/models/knight.rb +2 -0
data/sample-app/app/views/layouts/application.html.erb +14 -0
data/sample-app/bin/bundle +3 -0
data/sample-app/bin/rails +9 -0
data/sample-app/bin/rake +9 -0
data/sample-app/bin/setup +29 -0
data/sample-app/bin/spring +17 -0
data/sample-app/config.ru +4 -0
data/sample-app/config/application.rb +26 -0
data/sample-app/config/boot.rb +3 -0
data/sample-app/config/database.yml +21 -0
data/sample-app/config/environment.rb +5 -0
data/sample-app/config/environments/development.rb +41 -0
data/sample-app/config/environments/production.rb +79 -0
data/sample-app/config/environments/test.rb +42 -0
data/sample-app/config/initializers/assets.rb +11 -0
data/sample-app/config/initializers/backtrace_silencers.rb +7 -0
data/sample-app/config/initializers/cookies_serializer.rb +3 -0
data/sample-app/config/initializers/filter_parameter_logging.rb +4 -0
data/sample-app/config/initializers/inflections.rb +16 -0
data/sample-app/config/initializers/mime_types.rb +4 -0
data/sample-app/config/initializers/session_store.rb +3 -0
data/sample-app/config/initializers/wrap_parameters.rb +14 -0
data/sample-app/config/locales/en.yml +23 -0
data/sample-app/config/routes.rb +56 -0
data/sample-app/config/secrets.yml +22 -0
data/sample-app/db/development.sqlite3 +0 -0
data/sample-app/db/migrate/20170818134706_create_knights.rb +8 -0
data/sample-app/db/schema.rb +24 -0
data/sample-app/db/seeds.rb +7 -0
data/sample-app/knights.csv +3 -0
data/sample-app/lib/assets/.keep +0 -0
data/sample-app/lib/tasks/.keep +0 -0
data/sample-app/lib/tasks/csv_fast_importer.rake +9 -0
data/sample-app/log/.keep +0 -0
data/sample-app/public/404.html +67 -0
data/sample-app/public/422.html +67 -0
data/sample-app/public/500.html +66 -0
data/sample-app/public/favicon.ico +0 -0
data/sample-app/public/robots.txt +5 -0
data/sample-app/test/controllers/.keep +0 -0
data/sample-app/test/fixtures/.keep +0 -0
data/sample-app/test/fixtures/knights.yml +9 -0
data/sample-app/test/helpers/.keep +0 -0
data/sample-app/test/integration/.keep +0 -0
data/sample-app/test/mailers/.keep +0 -0
data/sample-app/test/models/.keep +0 -0
data/sample-app/test/models/knight_test.rb +7 -0
data/sample-app/test/test_helper.rb +10 -0
metadata +331 -0

data/benchmark/README.md ADDED Viewed

@@ -0,0 +1,140 @@
+# Benchmark
+## Description
+There are many ways to import CSV files in a database. Some are based on native ruby libraries, other on dedicated gems.
+We tried here to build a big picture on all main strategies.
+:point_right: If you think one is missing, do not hesitate to create an issue, or better, submit pull request.
+## Modus operandi
+With each identified strategy, a **10 000 lines** CSV file (created from first lines of `NPRI-SubsDisp-Normalized-Since1993.csv`) is imported in a **PostgreSQL** database. A small file is used here because some strategies would have take hours to import a file with a millions of lines.
+:information_source: `NPRI-SubsDisp-Normalized-Since1993.csv` was downloaded from [canadian open data](http://ouvert.canada.ca/data/fr/dataset).
+:information_source: Duration measure includes file reading and database writing, after transaction commit.
+## Strategies
+`Dataset` is an ActiveRecord model.
+`file` is the file to import.
+### CSV.foreach + ActiveRecord .create
+```ruby
+  require 'csv'
+  Dataset.transaction do
+    CSV.foreach(file, headers: true) do |row|
+      Dataset.create!(row.to_hash)
+    end
+  end
+```
+### [SmarterCSV](https://github.com/tilo/smarter_csv) 1.1.4 + ActiveRecord .create
+CSV file reading can be customized with chunk size (this may affect performance).
+```ruby
+  require 'smarter_csv'
+  Dataset.transaction do
+    SmarterCSV.process(file.path, chunk_size: 1000) do |dataset_attributes|
+      Dataset.create! dataset_attributes
+    end
+  end
+```
+### [SmarterCSV](https://github.com/tilo/smarter_csv) 1.1.4 + [activerecord-import](https://github.com/zdennis/activerecord-import) 0.10.0
+`activerecord-import` becomes efficient when importing multiple rows in same time. But importing the whole CSV file is not a solution because of memory foot print :boom:. So, we read here the CSV file by batch. This is done with `SmarterCSV` which have a small effect on global performances (see results).
+:information_source: Model validations are skipped here to improve performances even if no validation was defined.
+```ruby
+  require 'smarter_csv'
+  require 'activerecord-import/base'
+  SmarterCSV.process(file.path, chunk_size: 1000) do |dataset_attributes|
+    datasets = dataset_attributes.map { |attributes| Dataset.new attributes }
+    Dataset.import dataset_attributes.first.keys, datasets, batch_size: 100, validate: false
+  end
+```
+### [SmarterCSV](https://github.com/tilo/smarter_csv) 1.1.4 + [bulk_insert](https://github.com/jamis/bulk_insert) 1.5.0
+Same constraints than `activerecord-import`: batch processing improves performances.
+```ruby
+  require 'smarter_csv'
+  require 'bulk_insert'
+  SmarterCSV.process(file.path, chunk_size: 1000) do |dataset_attributes|
+    Dataset.bulk_insert values: dataset_attributes
+  end
+```
+### CSV.foreach + [upsert](https://github.com/seamusabshere/upsert) 2.2.1
+```ruby
+  require 'csv'
+  require 'upsert'
+  Upsert.batch(Dataset.connection, Dataset.table_name) do |upsert|
+    CSV.foreach(file, headers: true) do |row|
+      upsert.row(row.to_hash)
+    end
+  end
+```
+### [CSVImporter](https://github.com/pcreux/csv-importer) 0.3.2
+```ruby
+  DatasetCSVImporter.new(path: file.path).run!
+```
+### [ActiveImporter](https://github.com/continuum/active_importer) 0.2.6
+```ruby
+  DatasetActiveImporter.import file.path
+```
+### [Ferry](https://github.com/cmu-is-projects/ferry) 2.0.0
+:information_source: `Ferry` is more than juste a gem which import CSV files but it can also be used to do that.
+```ruby
+  require 'ferry'
+  Ferry::Importer.new.import_csv "benchmark_env", "datasets", file.path
+```
+## Results
+![Benchmark](results.png?raw=true "Benchmark")
+Produced on a MacBookPro (OSX 10.12.6, i5 2.4GHz, 8Go RAM, Flash drive), with local PostgreSQL **9.6.1.0** instance.
+:information_source: Results variability accros multiple executions is lower then 5%.
+## Explanations
+First of all, CSV reading took approximatively **400ms** with `CSV.foreach`, and **1000ms** with `SmarterCSV`.
+We also can notice that all strategies based on Rails' `create!` are very slow. Indeed, this strategy execute each SQL `INSERT` in a dedicated statement, and all ActiveRecord process (validations, callbacks...) is also executed. This last point could be very usefull in a Rails application, but is the main drawback when you look for performance.
+`upsert` could be more efficient with an id column in imported file (and a unique constraint in database schema), which is not the case here. To give some idea, duration would be divided by 2 with such additional column.
+Finally, `CSVFastImport` executes one single statement (with `COPY` command) which delegates operation to PostgreSQL instance. Then, CSV file is directly read by database engine witout any constraints (SQL standards, communication protocol...). This is the fastest way to import data in a database :rocket:.
+## How to execute this benchmark?
+Start local PostgreSQL instance.
+Create database
+```shell
+bundle exec rake test:db:create
+```
+Execute benchmark
+```
+bundle exec rake benchmark
+```
+:information_source: Environment variables `DB_USERNAME` and `DB_PASSWORD` will be used for database authentication. Default is anonymous connection (works great with OSX and [Postgres.app](https://postgresapp.com)).

data/benchmark/benchmark.rb ADDED Viewed

@@ -0,0 +1,26 @@
+require 'active_record'
+require_relative './tools'
+class Dataset < ActiveRecord::Base
+end
+db = database_connect
+build_dataset(db, 'datasets', ENV['DATASET_SIZE'] || 10_000) do |file|
+  lines_count = count(file)
+  puts "Start benchmark with a #{lines_count} lines file."
+  puts "Running benchmark..."
+  require_relative './strategies'
+  STRATEGIES.each do |label, strategy|
+    db.execute 'TRUNCATE TABLE datasets'
+    printf "%-35s: ", label
+    duration = measure_duration { strategy.call(file) }
+    warning_message = '(file partially imported)' if Dataset.count < lines_count - 1 # Header
+    printf "%20d ms %s\n", duration, warning_message
+  end
+end
+puts
+puts "Benchmark finished."

data/benchmark/results.png ADDED Viewed

Binary file

data/benchmark/results.xlsx ADDED Viewed

Binary file

data/benchmark/strategies.rb ADDED Viewed

@@ -0,0 +1,115 @@
+# -----------------------------------------------------------------------------
+# All tested strategy (implementations).
+# -----------------------------------------------------------------------------
+STRATEGIES = {}
+# CSVFastImporter -------------------------------------------------------------
+STRATEGIES['CSVFastImporter'] = lambda do |file|
+  CsvFastImporter.import file, col_sep: ','
+end
+# CSV.foreach + ActiveRecord create -------------------------------------------
+STRATEGIES['CSV.foreach + ActiveRecord create'] = lambda do |file|
+  require 'csv'
+  Dataset.transaction do
+    CSV.foreach(file, headers: true) do |row|
+      Dataset.create!(row.to_hash)
+    end
+  end
+end
+# SmarterCSV + ActiveRecord create --------------------------------------------
+STRATEGIES['SmarterCSV + ActiveRecord create'] = lambda do |file|
+  require 'smarter_csv'
+  Dataset.transaction do
+    SmarterCSV.process(file.path, chunk_size: 1000) do |dataset_attributes|
+      Dataset.create! dataset_attributes
+    end
+  end
+end
+# SmarterCSV + activerecord-import --------------------------------------------
+STRATEGIES['SmarterCSV + activerecord-import'] = lambda do |file|
+  require 'smarter_csv'
+  require 'activerecord-import/base'
+  SmarterCSV.process(file.path, chunk_size: 1000) do |dataset_attributes|
+    datasets = dataset_attributes.map { |attributes| Dataset.new attributes }
+    Dataset.import dataset_attributes.first.keys, datasets, batch_size: 100, validate: false
+  end
+end
+# SmarterCSV + BulkInsert -----------------------------------------------------
+STRATEGIES['SmarterCSV + BulkInsert'] = lambda do |file|
+  require 'smarter_csv'
+  require 'bulk_insert'
+  SmarterCSV.process(file.path, chunk_size: 1000) do |dataset_attributes|
+    Dataset.bulk_insert values: dataset_attributes
+  end
+  # Nearly same performance with following code:
+  # Dataset.bulk_insert(set_size: 500) do |worker|
+  #   SmarterCSV.process(file.path, chunk_size: 500) do |dataset_attributes|
+  #     dataset_attributes.each do |attributes|
+  #       worker.add attributes
+  #     end
+  #   end
+  # end
+end
+STRATEGIES['CSV.foreach + upsert'] = lambda do |file|
+  require 'csv'
+  require 'upsert'
+  Upsert.logger.level = Logger::ERROR
+  Upsert.batch(Dataset.connection, Dataset.table_name) do |upsert|
+    CSV.foreach(file, headers: true) do |row|
+      upsert.row(row.to_hash)
+    end
+  end
+end
+# CSVImporter -----------------------------------------------------------------
+require 'csv-importer'
+class DatasetCSVImporter
+  include CSVImporter
+  model Dataset
+end
+STRATEGIES['CSVImporter'] = lambda do |file|
+  DatasetCSVImporter.new(path: file.path).run!
+end
+# ActiveImporter --------------------------------------------------------------
+require 'active_importer'
+class DatasetActiveImporter < ActiveImporter::Base
+  imports Dataset
+end
+STRATEGIES['ActiveImporter'] = lambda do |file|
+  DatasetActiveImporter.import file.path
+end
+# ferry -----------------------------------------------------------------------
+STRATEGIES['ferry'] = lambda do |file|
+  # Required to make ferry work without a rails application
+  require 'yaml'
+  FileUtils.mkdir_p('config') unless File.exists?('config')
+  config_file = 'config/database.yml'
+  FileUtils.touch(config_file)
+  config = YAML.load(<<-EOT)
+    benchmark_env:
+      adapter: postgresql
+      database: csv_fast_importer_test
+    EOT
+  File.open(config_file, 'w') { |f| f.write config.to_yaml }
+  # Prevent progress output
+  $stderr.reopen(Tempfile.new('benchmark_ferry').path, "w")
+  require 'ferry'
+  Ferry::Importer.new.import_csv "benchmark_env", "datasets", file.path
+  FileUtils.rm(config_file)
+end

data/benchmark/tools.rb ADDED Viewed

@@ -0,0 +1,61 @@
+# -----------------------------------------------------------------------------
+# Set of usefull methods
+# -----------------------------------------------------------------------------
+def database_connect
+  require_relative '../spec/config/test_database.rb'
+  test_db = TestDatabase.new
+  test_db.connect
+  require 'csv_fast_importer'
+  CsvFastImporter::DatabaseFactory.build
+end
+# Downloaded from http://ouvert.canada.ca/data/fr/dataset
+ORIGINAL_DATASET_FILE = File.new('benchmark/NPRI-SubsDisp-Normalized-Since1993.csv')
+def build_dataset(db, file_name, lines_count)
+  puts "Database schema generation..."
+  db.execute "DROP TABLE IF EXISTS #{file_name}"
+  db.execute <<-SQL
+    CREATE TABLE #{file_name} (
+      Reporting_Year smallint NULL,
+      NPRI_ID integer NULL,
+      Facility_Name varchar(255) NULL,
+      Company_Name varchar(255) NULL,
+      NAICS integer NULL,
+      Province varchar(255) NULL,
+      CAS_Number varchar(255) NULL,
+      substance_name varchar(255) NULL,
+      group_escaped varchar(255) NULL,
+      Category varchar(255) NULL,
+      Quantity decimal NULL,
+      Units varchar(255) NULL,
+      Estimation_Method varchar(255) NULL
+    )
+  SQL
+  dataset_file = File.new("benchmark/#{file_name}.csv", 'w+')
+  `head -n #{lines_count} #{ORIGINAL_DATASET_FILE.path} > #{dataset_file.path}`
+  yield dataset_file
+  File.delete(dataset_file)
+end
+def count(file)
+  `wc -l "#{file.path}"`.strip.split(' ')[0].to_i
+end
+# In milliseconds
+def measure_duration
+  start_time = Time.now
+  block_stdout { yield }
+  (1000 * (Time.now - start_time)).to_i
+end
+def block_stdout
+  original_stdout = $stdout
+  File.open(File::NULL, "w") do |file|
+    $stdout = file
+    yield
+    $stdout = original_stdout
+  end
+end

data/csv_fast_importer.gemspec ADDED Viewed

@@ -0,0 +1,42 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'csv_fast_importer/version'
+Gem::Specification.new do |spec|
+  spec.name          = "csv_fast_importer"
+  spec.version       = CSVFastImporter::VERSION
+  spec.authors       = ["Sogilis"]
+  spec.email         = ["sogilis@sogilis.com"]
+  spec.summary       = "Fast CSV Importer"
+  spec.description   = "Import CSV files' content into a PostgreSQL database. It is based on the Postgre COPY command which is designed to be as faster as possible."
+  spec.homepage      = "https://github.com/sogilis/csv_fast_importer"
+  spec.license       = "MIT"
+  spec.files         = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
+  spec.bindir        = "exe"
+  spec.executables   = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.required_ruby_version = ">= 2.0"
+  spec.add_development_dependency "bundler", "~> 1.10"
+  spec.add_development_dependency "rake", "~> 10.0"
+  spec.add_development_dependency "pg", ">= 0.18.4"
+  spec.add_development_dependency "mysql2", ">= 0.3.10"
+  spec.add_development_dependency "codacy-coverage"
+  spec.add_development_dependency "rspec"
+  # Only for benchmark
+  spec.add_development_dependency "smarter_csv"
+  spec.add_development_dependency "activerecord-import"
+  spec.add_development_dependency "bulk_insert"
+  spec.add_development_dependency "upsert"
+  spec.add_development_dependency "csv-importer"
+  spec.add_development_dependency "active_importer"
+  spec.add_development_dependency "ferry"
+  spec.add_runtime_dependency "activerecord", [">= 3.0"]
+end

data/lib/csv_fast_importer.rb ADDED Viewed

@@ -0,0 +1,12 @@
+require 'csv_fast_importer/version'
+require 'csv_fast_importer/configuration'
+require 'csv_fast_importer/import'
+module CsvFastImporter
+  def self.import(file, parameters = {})
+    configuration = CsvFastImporter::Configuration.new file, parameters
+    CsvFastImporter::Import.new(configuration).run
+  end
+end

data/lib/csv_fast_importer/configuration.rb ADDED Viewed

@@ -0,0 +1,57 @@
+module CsvFastImporter
+  # Gather all import configurations based on given file and additional parameters.
+  # This class is also responsible for default configuration.
+  class Configuration
+    attr_accessor :file
+    def initialize(file, parameters = {})
+      @file = file
+      @parameters = parameters
+    end
+    def encoding
+      @encoding ||= @parameters[:encoding] || 'UTF-8'
+    end
+    def column_separator
+      @column_separator ||= @parameters[:col_sep] || ';'
+    end
+    def mapping
+      @mapping ||= downcase_keys_and_values(@parameters[:mapping] || {})
+    end
+    def destination_table
+      @destination_table ||= (@parameters[:destination] || File.basename(@file, '.*'))
+    end
+    def row_index_column
+      @row_index_column ||= @parameters[:row_index_column]
+    end
+    def transactional?
+      @transactional ||= !(@parameters[:transaction] == :disabled)
+    end
+    def transactional_forced?
+      @transactional_forced ||= (@parameters[:transaction] == :enabled)
+    end
+    def truncate?
+      @deletion ||= @parameters[:deletion] == :truncate
+    end
+    def deletion?
+      @deletion ||= !(@parameters[:deletion] == :none)
+    end
+  private
+    def downcase_keys_and_values(hash)
+      Hash[hash.map{ |k, v| [k.to_s.downcase, v.to_s.downcase] }]
+    end
+  end
+end