RubyGems - salesforce_bulk_query - Versions diffs - 0.0.2 - Mend

salesforce_bulk_query 0.0.2

Files changed (17) hide show

checksums.yaml +7 -0
data/.gitignore +2 -0
data/Gemfile +4 -0
data/LICENSE +22 -0
data/README.md +108 -0
data/example_test_salesforce_credentials.json +7 -0
data/lib/salesforce_bulk_query/batch.rb +76 -0
data/lib/salesforce_bulk_query/connection.rb +124 -0
data/lib/salesforce_bulk_query/job.rb +161 -0
data/lib/salesforce_bulk_query/logger.rb +44 -0
data/lib/salesforce_bulk_query/query.rb +149 -0
data/lib/salesforce_bulk_query/version.rb +3 -0
data/lib/salesforce_bulk_query.rb +115 -0
data/salesforce_bulk_query.gemspec +31 -0
data/spec/salesforce_bulk_query_spec.rb +130 -0
data/spec/spec_helper.rb +6 -0
metadata +145 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: a199495e5a1f919e25c9c2f6f129a9d2010b240a
+  data.tar.gz: ffca70b80c504a2c43224e828bc0d059cc23477e
+SHA512:
+  metadata.gz: 7149c8f12a7fcacc181f1ec39d9d987b68db9dbbf1beada5ab3a0774343bfd9e7aced0d828a7ac78978bd2d2baca5b1cf8309a27098d9b7a452bb277cfd2f5e7
+  data.tar.gz: 669dab096fd60a81c1bdb0a0a1b0811ebdc57519f23a0f44b8000ef94e75911a3ff51660abe5cc7a0bc0aeb9c9a5a5b37f9d800149428f3afa7c1d5c4aaaf737

data/.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ test_salesforce_credentials.json
2	+ Gemfile.lock

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source "http://rubygems.org"
+# Specify your gem's dependencies in salesforce_bulk_api.gemspec
+gemspec

data/LICENSE ADDED Viewed

@@ -0,0 +1,22 @@
+(BSD License)
+Copyright (c) 2014 Yatish Mehta & GoodData Corporation. All rights reserved.
+Redistribution and use in source and binary forms, with or without modification, are permitted provided
+that the following conditions are met:
+    * Redistributions of source code must retain the above copyright notice, this list of conditions and
+       the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright notice, this list of conditions
+       and the following disclaimer in the documentation and/or other materials provided with the distribution.
+    * Neither the name of the GoodData Corporation nor the names of its contributors may be used to endorse
+       or promote products derived from this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS
+OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

data/README.md ADDED Viewed

@@ -0,0 +1,108 @@
+Salesforce Bulk Query
+=====================
+A library for downloading data from Salesforce Bulk API. We only focus on querying, other operations of the API aren't supported. Designed to handle a lot of data.
+Derived from [Salesforce Bulk API](https://github.com/yatish27/salesforce_bulk_api)
+## Basic Usage
+To install, run:
+    gem install salesforce_bulk_query
+or add
+    gem salesforce_bulk_query
+to your Gemfile.
+Before using the library, make sure you have the right account in your Salesforce organization that has access to API and that you won't run out of the [API limits](http://www.salesforce.com/us/developer/docs/api_asynchpre/Content/asynch_api_concepts_limits.htm#batch_proc_time_title)
+For doing most of the API calls, the library uses [Restforce](https://github.com/ejholmes/restforce) Code example:
+    require 'restforce'
+    require 'salesforce_bulk_query'
+    # Create a restforce client instance
+    # with basic auth
+    restforce = Restforce.new(
+      :username => 'me',
+      :password => 'password',
+      :security_token => 'token',
+      :client_id => "my sfdc app client id",
+      :client_secret => "my sfdc app client secret"
+    )
+    # or OAuth
+    restforce = Restforce.new(
+      :refresh_token => "xyz",
+      :client_id => "my sfdc app client id",
+      :client_secret => "my sfdc app client secret"
+    )
+    bulk_api = SalesforceBulkQuery::Api.new(restforce)
+    # query the api
+    result = bulk_client.query("Task", "SELECT Id, Name FROM Task")
+    # the result is files
+    puts "All the downloaded stuff is in csvs: #{result[:filenames]}"
+    # query is a blocking call and can take several hours
+    # if you want to just start the query asynchronously, use
+    query = start_query("Task", "SELECT Id, Name FROM Task")
+    # get a cofee
+    sleep(1234)
+    # check the status
+    status = query.check_status
+    if status[:finished]
+      result = query.get_results
+      puts "All the downloaded stuff is in csvs: #{result[:filenames]}"
+    end
+## How it works
+The library uses the [Salesforce Bulk API](https://www.salesforce.com/us/developer/docs/api_asynch/index_Left.htm#CSHID=asynch_api_bulk_query.htm|StartTopic=Content%2Fasynch_api_bulk_query.htm|SkinName=webhelp). The given query is divided into 15 subqueries, according to the [limits](http://www.salesforce.com/us/developer/docs/api_asynchpre/Content/asynch_api_concepts_limits.htm#batch_proc_time_title). Each subquery is an interval based on the CreatedDate Salesforce field. The limits are passed to the API in SOQL queries. Subqueries are sent to the API as batches and added to a job.
+The first interval starts with the date the first Salesforce object was created, we query Salesforce REST API for that. If this query times out, we use a constant. The last interval ends a few minutes before now to avoid consistency issues. Custom start and end can be passed - see Options.
+Job has a fixed time limit to process all the subqueries. Batches that finish in time are downloaded to CSVs, batches that don't are divided to 15 subqueries each and added to new jobs.
+CSV results are downloaded by chunks, so that we don't run into memory related issues. All other requests are made through the Restforce client that is passed when instantiating the Api class. Restforce is not in the dependencies, so theoretically you can pass another object with the same set of methods as Restforce client.
+## Options
+There are a few optional settings you can pass to the `Api` methods:
+* `api_version`: which Salesforce api version should be used
+* `logger`: where logs should go
+* `filename_prefix`: prefix applied to csv files
+* `directory_path`: custom direcotory path for CSVs, if omitted, a new temp directory is created
+* `check_interval`: how often the results should be checked in secs.
+* `time_limit`: maximum time the query can take. If this time limit is exceeded, available results are downloaded and the list of subqueries that didn't finished is returned. In seconds. The limti should be understood as limit for waiting. When the limit is reached the function downloads data that is ready which can take some additonal time.
+* `created_from`, `created_to`: limits for the CreatedDate field. Note that queries can't contain any WHERE statements as we're doing some manipulations to create subqueries and we don't want things to get too difficult. So this is the way to limit the query yourself. The format is like `"1999-01-01T00:00:00.000Z"`
+* `single_batch`: If true, the queries are not divided into subqueries as described above. Instead one batch job is created with the given query.
+See specs for exact usage.
+## Logging
+    require 'logger'
+    require 'restforce'
+    # create the restforce client
+    restforce = Restforce.new(...)
+    # instantiate a logger and pass it to the Api constructor
+    logger = Logger.new(STDOUT)
+    bulk_api = SalesforceBulkQuery::Api.new(restforce, :logger => logger)
+    # switch off logging in Restforce so you don't get every message twice
+    Restforce.log = false
+If you're using Restforce as a client (which you probably are) and you want to do logging, Salesforce Bulk Query will use a custom logging middleware for Restforce. This is because the original logging middleware puts all API responses to log, which is not something you would like to do for a few gigabytes CSVs. When you use the :logger parameter it's recommended you swith off the default logging in Restforce, otherwise you'll get all messages twice.
+## Copyright
+Copyright (c) 2014 Yatish Mehta & GoodData Corporation. See [LICENSE](LICENSE) for details.

data/example_test_salesforce_credentials.json ADDED Viewed

@@ -0,0 +1,7 @@
+{
+  "username": "me@mycompany.com",
+  "password": "mypassword",
+  "token": "token I got in my email",
+  "client_id": "id for my registered SFDC app",
+  "client_secret": "secret number for my SFDC app"
+}

data/lib/salesforce_bulk_query/batch.rb ADDED Viewed

@@ -0,0 +1,76 @@
+require 'tmpdir'
+module SalesforceBulkQuery
+  # Represents a Salesforce api batch. Batch contains a single subquery.
+  # Many batches are contained in a Job.
+  class Batch
+    def initialize(options)
+      @sobject = options[:sobject]
+      @soql = options[:soql]
+      @job_id = options[:job_id]
+      @connection = options[:connection]
+      @start = options[:start]
+      @stop = options[:stop]
+      @@directory_path ||= Dir.mktmpdir
+    end
+    attr_reader :soql, :start, :stop
+    # Do the api request
+    def create
+      path = "job/#{@job_id}/batch/"
+      response_parsed = @connection.post_xml(path, @soql, {:csv_content_type => true})
+      @batch_id = response_parsed['id'][0]
+    end
+    def check_status
+      # request to get the result id
+      path = "job/#{@job_id}/batch/#{@batch_id}/result"
+      response_parsed = @connection.get_xml(path)
+      @result_id = response_parsed["result"] ? response_parsed["result"][0] : nil
+      return {
+        :finished => ! @result_id.nil?,
+        :result_id => @result_id
+      }
+    end
+    def get_filename
+      return "#{@sobject}_#{@batch_id}_#{@start}-#{@stop}.csv"
+    end
+    def get_result(directory_path=nil)
+      # request to get the actual results
+      path = "job/#{@job_id}/batch/#{@batch_id}/result/#{@result_id}"
+      if !@result_id
+        raise "batch not finished yet, trying to get result: #{path}"
+      end
+      directory_path ||= @@directory_path
+      # write it to a file
+      filename = File.join(directory_path, get_filename)
+      @connection.get_to_file(path, filename)
+      return filename
+    end
+    def to_log
+      return {
+        :sobject => @sobject,
+        :soql => @soql,
+        :job_id => @job_id,
+        :connection => @connection.to_log,
+        :start => @start,
+        :stop => @stop,
+        :directory_path => @@directory_path
+      }
+    end
+  end
+end

data/lib/salesforce_bulk_query/connection.rb ADDED Viewed

@@ -0,0 +1,124 @@
+require 'xmlsimple'
+require 'net/http'
+module SalesforceBulkQuery
+  # Connection to the Salesforce API
+  # shared in all classes that do some requests
+  class Connection
+    def initialize(client, api_version, logger=nil, filename_prefix=nil)
+      @client=client
+      @logger = logger
+      @filename_prefix = filename_prefix
+      @@API_VERSION = api_version
+      @@PATH_PREFIX = "/services/async/#{@@API_VERSION}/"
+    end
+    attr_reader :client
+    XML_REQUEST_HEADER = {'Content-Type' => 'application/xml; charset=utf-8'}
+    CSV_REQUEST_HEADER = {'Content-Type' => 'text/csv; charset=UTF-8'}
+    def session_header
+      {'X-SFDC-Session' => @client.options[:oauth_token]}
+    end
+    def parse_xml(xml)
+      parsed = nil
+      begin
+        parsed = XmlSimple.xml_in(xml)
+      rescue => e
+        @logger.error "Error parsing xml: #{xml}\n#{e}\n#{e.backtrace}"
+        raise
+      end
+      return parsed
+    end
+    def post_xml(path, xml, options={})
+      path = "#{@@PATH_PREFIX}#{path}"
+      headers = options[:csv_content_type] ? CSV_REQUEST_HEADER : XML_REQUEST_HEADER
+      response = nil
+      # do the request
+      with_retries do
+        begin
+          response = @client.post(path, xml, headers.merge(session_header))
+        rescue JSON::ParserError => e
+          if e.message.index('ExceededQuota')
+            raise "You've run out of sfdc batch api quota. Original error: #{e}\n #{e.backtrace}"
+          end
+          raise e
+        end
+      end
+      return parse_xml(response.body)
+    end
+    def get_xml(path, options={})
+      path = "#{@@PATH_PREFIX}#{path}"
+      headers = XML_REQUEST_HEADER
+      response = nil
+      with_retries do
+        response = @client.get(path, {}, headers.merge(session_header))
+      end
+      return options[:skip_parsing] ? response.body : parse_xml(response.body)
+    end
+    def get_to_file(path, filename)
+      path = "#{@@PATH_PREFIX}#{path}"
+      uri = URI.parse( @client.options[:instance_url])
+      # open a file
+      http = Net::HTTP.new(uri.host, uri.port)
+      http.use_ssl = true
+      headers = XML_REQUEST_HEADER.merge(session_header)
+      @logger.info "Doing GET to #{path}, headers #{headers}" if @logger
+      if @filename_prefix
+        filename = "#{@filename_prefix}_#{filename}"
+      end
+      # do the request
+      http.request_get(path, headers) do |res|
+        @logger.info "Got response #{res.inspect}, reading response body by chunks and writing to #{filename}" if @logger
+        File.open(filename, 'w') do |file|
+          # write the body to the file by chunks
+          res.read_body do |segment|
+            file.write(segment)
+          end
+        end
+      end
+    end
+    def with_retries
+      i = 0
+      begin
+        yield
+      rescue => e
+        i += 1
+        if i < 3
+          @logger.warn "Retrying, got error: #{e}, #{e.backtrace}" if @logger
+          retry
+        else
+          @logger.error "Failed 3 times, last error: #{e}, #{e.backtrace}" if @logger
+          raise
+        end
+      end
+    end
+    def to_log
+      return {
+        :client => "Restforce asi",
+        :filename_prefix => @filename_prefix,
+        :api_version => @@API_VERSION,
+        :path_prefix => @@PATH_PREFIX
+      }
+    end
+  end
+end

data/lib/salesforce_bulk_query/job.rb ADDED Viewed

@@ -0,0 +1,161 @@
+require "salesforce_bulk_query/batch"
+module SalesforceBulkQuery
+  # Represents a Salesforce bulk api job, contains multiple batches.
+  # Many jobs contained in Query
+  class Job
+    @@operation = 'query'
+    @@xml_header = '<?xml version="1.0" encoding="utf-8" ?>'
+    JOB_TIME_LIMIT = 10 * 60
+    BATCH_COUNT = 15
+    def initialize(sobject, connection, logger=nil)
+      @sobject = sobject
+      @connection = connection
+      @logger = logger
+      @batches = []
+      @unfinished_batches = []
+    end
+    attr_reader :job_id
+    # Do the API request
+    def create_job(csv=true)
+      content_type = csv ? "CSV" : "XML"
+      xml = "#{@@xml_header}<jobInfo xmlns=\"http://www.force.com/2009/06/asyncapi/dataload\">"
+      xml += "<operation>#{@@operation}</operation>"
+      xml += "<object>#{@sobject}</object>"
+      xml += "<contentType>#{content_type}</contentType>"
+      xml += "</jobInfo>"
+      response_parsed = @connection.post_xml("job", xml)
+      @job_id = response_parsed['id'][0]
+    end
+    def get_extended_soql(soql, from, to)
+      return "#{soql} WHERE CreatedDate >= #{from} AND CreatedDate < #{to}"
+    end
+    def generate_batches(soql, start, stop, single_batch=false)
+      # if there's just one batch wanted, add it and we're done
+      if single_batch
+        soql_extended = get_extended_soql(soql, start, stop)
+        @logger.info "Adding soql #{soql_extended} as a batch to job" if @logger
+        add_query(soql_extended,
+          :start => start,
+          :stop => stop
+        )
+        return
+      end
+      # if there's more, generate the time intervals and generate the batches
+      step_size = (stop - start) / BATCH_COUNT
+      interval_beginings = start.step(stop - step_size, step_size).map{|f|f}
+      interval_ends = interval_beginings.clone
+      interval_ends.shift
+      interval_ends.push(stop)
+      interval_beginings.zip(interval_ends).each do |from, to|
+        soql_extended = get_extended_soql(soql, from, to)
+        @logger.info "Adding soql #{soql_extended} as a batch to job" if @logger
+        add_query(soql_extended,
+          :start => from,
+          :stop => to
+        )
+      end
+    end
+    def add_query(query, options={})
+      # create and create a batch
+      batch = SalesforceBulkQuery::Batch.new(
+        :sobject => @sobject,
+        :soql => query,
+        :job_id => @job_id,
+        :connection => @connection,
+        :start => options[:start],
+        :stop => options[:stop]
+      )
+      batch.create
+      # add the batch to the list
+      @batches.push(batch)
+    end
+    def close_job
+      xml = "#{@@xml_header}<jobInfo xmlns=\"http://www.force.com/2009/06/asyncapi/dataload\">"
+      xml += "<state>Closed</state>"
+      xml += "</jobInfo>"
+      path = "job/#{@job_id}"
+      response_parsed = @connection.post_xml(path, xml)
+      @job_closed = Time.now
+    end
+    def check_status
+      path = "job/#{@job_id}"
+      response_parsed = @connection.get_xml(path)
+      @completed = Integer(response_parsed["numberBatchesCompleted"][0])
+      @finished = @completed == Integer(response_parsed["numberBatchesTotal"][0])
+      return {
+        :finished => @finished,
+        :some_failed => Integer(response_parsed["numberRecordsFailed"][0]) > 0,
+        :response => response_parsed
+      }
+    end
+    # downloads whatever is available, returns as unfinished whatever is not
+    def get_results(options={})
+      filenames = []
+      unfinished_batches = []
+      # get result for each batch in the job
+      @batches.each do |batch|
+        batch_status = batch.check_status
+        # if the result is ready
+        if batch_status[:finished]
+          # download the result
+          filename = batch.get_result(options[:directory_path])
+          filenames.push(filename)
+        else
+          # otherwise put it to unfinished
+          unfinished_batches.push(batch)
+        end
+      end
+      @unfinished_batches = unfinished_batches
+      return {
+        :filenames => filenames,
+        :unfinished_batches => unfinished_batches
+      }
+    end
+    def get_available_results(options={})
+      # if we didn't reach limit yet, do nothing
+      # if all done, do nothing
+      # if none of the batches finished, same thing
+      if (Time.now - @job_closed < JOB_TIME_LIMIT) || @finished || @completed == 0
+        return nil
+      end
+      return get_results(options)
+    end
+    def to_log
+      return {
+        :sobject => @sobject,
+        :connection => @connection.to_log,
+        :batches => @batches.map {|b| b.to_log},
+        :unfinished_batches => @unfinished_batches.map {|b| b.to_log}
+      }
+    end
+  end
+end

data/lib/salesforce_bulk_query/logger.rb ADDED Viewed

@@ -0,0 +1,44 @@
+require 'forwardable'
+require 'faraday'
+module SalesforceBulkQuery
+  # Custom logger for Restforce that doesn't log tons of data.
+  class Logger < Faraday::Response::Middleware
+    extend Forwardable
+    MAX_LOG_LENGTH = 2000
+    def initialize(app, logger, options)
+      super(app)
+      @options = options
+      @logger = logger || begin
+        require 'logger'
+        ::Logger.new(STDOUT)
+      end
+    end
+    def_delegators :@logger, :debug, :info, :warn, :error, :fatal
+    def call(env)
+      debug('request') do
+        dump :url => env[:url].to_s,
+          :method => env[:method],
+          :headers => env[:request_headers],
+          :body => env[:body][0..MAX_LOG_LENGTH]
+      end
+      super
+    end
+    def on_complete(env)
+      debug('response') do
+        dump :status => env[:status].to_s,
+          :headers => env[:response_headers],
+          :body => env[:body][0..MAX_LOG_LENGTH]
+      end
+    end
+    def dump(hash)
+      "\n" + hash.map { |k, v| " #{k}: #{v.inspect}" }.join("\n")
+    end
+  end
+end

data/lib/salesforce_bulk_query/query.rb ADDED Viewed

@@ -0,0 +1,149 @@
+require 'salesforce_bulk_query/job'
+module SalesforceBulkQuery
+  # Abstraction of a single user-given query. It contains multiple jobs, is tied to a specific connection
+  class Query
+    # if no created_to is given we use the current time with this offset
+    # subtracted (to make sure the freshest changes that can be inconsistent
+    # aren't there) It's in minutes
+    OFFSET_FROM_NOW = 10
+    def initialize(sobject, soql, connection, options={})
+      @sobject = sobject
+      @soql = soql
+      @connection = connection
+      @logger = options[:logger]
+      @created_from = options[:created_from]
+      @created_to = options[:created_to]
+      @single_batch = options[:single_batch]
+      @jobs_in_progress = []
+      @jobs_done = []
+      @finished_batch_filenames = []
+      @restarted_subqueries = []
+    end
+    DEFAULT_MIN_CREATED = "1999-01-01T00:00:00.000Z"
+    # Creates the first job, divides the query to subqueries, puts all the subqueries as batches to the job
+    def start
+      # order by and where not allowed
+      if (!@single_batch) && (@soql =~ /WHERE/i || @soql =~ /ORDER BY/i)
+        raise "You can't have WHERE or ORDER BY in your soql. If you want to download just specific date range use created_from / created_to"
+      end
+      # create the first job
+      job = SalesforceBulkQuery::Job.new(@sobject, @connection, @logger)
+      job.create_job
+      # get the date when it should start
+      if @created_from
+        min_created = @created_from
+      else
+        # get the date when the first was created
+        min_created = nil
+        begin
+          min_created_resp = @connection.client.query("SELECT CreatedDate FROM #{@sobject} ORDER BY CreatedDate LIMIT 1")
+          min_created_resp.each {|s| min_created = s[:CreatedDate]}
+        rescue Faraday::Error::TimeoutError => e
+          @logger.warn "Timeout getting the oldest object for #{@sobject}. Error: #{e}. Using the default value" if @logger
+          min_created = DEFAULT_MIN_CREATED
+        end
+      end
+      # generate intervals
+      start = DateTime.parse(min_created)
+      stop = @created_to ? DateTime.parse(@created_to) : DateTime.now - Rational(OFFSET_FROM_NOW, 1440)
+      job.generate_batches(@soql, start, stop, @single_batch)
+      job.close_job
+      @jobs_in_progress.push(job)
+    end
+    # Check statuses of all jobs
+    def check_status
+      all_done = true
+      job_statuses = []
+      # check all jobs statuses and put them in an array
+      @jobs_in_progress.each do |job|
+        job_status = job.check_status
+        all_done &&= job_status[:finished]
+        job_statuses.push(job_status)
+      end
+      return {
+        :finished => all_done,
+        :job_statuses => job_statuses
+      }
+    end
+    # Get results for all jobs
+    # @param options[:directory_path]
+    def get_results(options={})
+      all_job_results = []
+      job_result_filenames = []
+      unfinished_subqueries = []
+      # check each job and put it there
+      @jobs_in_progress.each do |job|
+        job_results = job.get_results(options)
+        all_job_results.push(job_results)
+        job_result_filenames += job_results[:filenames]
+        unfinished_subqueries.push(job_results[:unfinished_batches].map {|b| b.soql})
+        # if it's done add it to done
+        if job_results[:unfinished_batches].empty?
+          @jobs_done.push(job)
+        end
+      end
+      return {
+        :filenames => job_result_filenames + @finished_batch_filenames,
+        :unfinished_subqueries => unfinished_subqueries,
+        :restarted_subqueries => @restarted_subqueries,
+        :results => all_job_results,
+        :done_jobs => @jobs_done
+      }
+    end
+    # Restart unfinished batches in all jobs in progress, creating new jobs
+    # downloads results for finished batches
+    def get_result_or_restart(options={})
+      new_jobs = []
+      job_ids_to_remove = []
+      jobs_done = []
+      @jobs_in_progress.each do |job|
+        # get available stuff, if not the right time yet, go on
+        available_results = job.get_available_results(options)
+        if available_results.nil?
+          next
+        end
+        unfinished_batches = available_results[:unfinished_batches]
+        # store the filenames and resturted stuff
+        @finished_batch_filenames += available_results[:filenames]
+        @restarted_subqueries += unfinished_batches.map {|b| b.soql}
+        unfinished_batches.each do |batch|
+          # for each unfinished batch create a new job and add it to new jobs
+          @logger.info "The following subquery didn't end in time: #{batch.soql}. Dividing into multiple and running again" if @logger
+          new_job = SalesforceBulkQuery::Job.new(@sobject, @connection)
+          new_job.create_job
+          new_job.generate_batches(@soql, batch.start, batch.stop)
+          new_job.close_job
+          new_jobs.push(new_job)
+        end
+        # the current job to be removed from jobs in progress
+        job_ids_to_remove.push(job.job_id)
+        jobs_done.push(job)
+      end
+      # remove the finished jobs from progress and add there the new ones
+      @jobs_in_progress.select! {|j| ! job_ids_to_remove.include?(j.job_id)}
+      @jobs_done += jobs_done
+      @jobs_in_progress += new_jobs
+    end
+  end
+end

data/lib/salesforce_bulk_query/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module SalesforceBulkQuery
+  VERSION = '0.0.2'
+end

data/lib/salesforce_bulk_query.rb ADDED Viewed

@@ -0,0 +1,115 @@
+require 'salesforce_bulk_query/connection'
+require 'salesforce_bulk_query/query'
+require 'salesforce_bulk_query/logger'
+# Module where all the stuff is happening
+module SalesforceBulkQuery
+  # Abstracts the whole library, class the user interacts with
+  class Api
+    @@DEFAULT_API_VERSION = '29.0'
+    # Constructor
+    # @param client [Restforce] An instance of the Restforce client, that is used internally to access Salesforce api
+    # @param options
+    def initialize(client, options={})
+      @logger = options[:logger]
+      api_version = options[:api_version] || @@DEFAULT_API_VERSION
+      # use our own logging middleware if logger passed
+      if @logger && client.respond_to?(:middleware)
+        client.middleware.use(SalesforceBulkQuery::Logger, @logger, options)
+      end
+      # initialize connection
+      @connection = SalesforceBulkQuery::Connection.new(client, api_version, @logger, options[:filename_prefix])
+    end
+    # Get the Salesforce instance URL
+    def instance_url
+      # make sure it ends with /
+      url = @connection.client.instance_url
+      url += '/' if url[-1] != '/'
+      return url
+    end
+    CHECK_INTERVAL = 10
+    QUERY_TIME_LIMIT = 60 * 60 * 2 # two hours
+    # Query the Salesforce API. It's a blocking method - waits until the query is resolved
+    # can take quite some time
+    # @param sobject Salesforce object, e.g. "Opportunity"
+    # @param soql SOQL query, e.g. "SELECT Name FROM Opportunity"
+    # @return hash with :filenames and other useful stuff
+    def query(sobject, soql, options={})
+      check_interval = options[:check_interval] || CHECK_INTERVAL
+      time_limit = options[:time_limit] || QUERY_TIME_LIMIT
+      start_time = Time.now
+      # start the machinery
+      query = start_query(sobject, soql, options)
+      results = nil
+      loop do
+        # check the status
+        status = query.check_status
+        # if finished get the result and we're done
+        if status[:finished]
+          # get the results and we're done
+          results = query.get_results(:directory_path => options[:directory_path])
+          @logger.info "Query finished. Results: #{results_to_string(results)}" if @logger
+          break
+        end
+        # if we've run out of time limit, go away
+        if Time.now - start_time > time_limit
+          @logger.warn "Ran out of time limit, downloading what's available and terminating" if @logger
+          # download what's available
+          results = query.get_results(
+            :directory_path => options[:directory_path],
+          )
+          @logger.info "Downloaded the following files: #{results[:filenames]} The following didn't finish in time: #{results[:unfinished_subqueries]}. Results: #{results_to_string(results)}" if @logger
+          break
+        end
+        # restart whatever needs to be restarted and sleep
+        query.get_result_or_restart(:directory_path => options[:directory_path])
+        @logger.info "Sleeping #{check_interval}" if @logger
+        sleep(check_interval)
+      end
+      return results
+    end
+    # Start the query (synchronous method)
+    # @params see #query
+    # @return Query instance with the running query
+    def start_query(sobject, soql, options={})
+      # create the query, start it and return it
+      query = SalesforceBulkQuery::Query.new(sobject, soql, @connection, {:logger => @logger}.merge(options))
+      query.start
+      return query
+    end
+    private
+    # create a hash with just the fields we want to show in logs
+    def results_to_string(results)
+      return results.merge({
+        :results => results[:results].map do |r|
+          r.merge({
+            :unfinished_batches => r[:unfinished_batches].map do |b|
+              b.to_log
+            end
+          })
+        end,
+        :done_jobs => results[:done_jobs].map {|j| j.to_log}
+      })
+    end
+  end
+end

data/salesforce_bulk_query.gemspec ADDED Viewed

@@ -0,0 +1,31 @@
+# -*- encoding: utf-8 -*-
+$:.push File.expand_path("../lib/", __FILE__)
+require "salesforce_bulk_query/version"
+Gem::Specification.new do |s|
+  s.name = 'salesforce_bulk_query'
+  s.version = SalesforceBulkQuery::VERSION
+  s.authors = ['Petr Cvengros']
+  s.email = ['petr.cvengros@gooddata.com']
+  s.required_ruby_version = '>= 1.9'
+  s.homepage = 'https://github.com/cvengros/salesforce_bulk_query'
+  s.summary = %q{Downloading data from Salesforce Bulk API made easy and scalable.}
+  s.description = %q{A library for downloading data from Salesforce Bulk API. We only focus on querying, other operations of the API aren't supported. Designed to handle a lot of data.}
+  s.license = 'BSD'
+  s.add_dependency 'json', '~> 1.8'
+  s.add_dependency 'xml-simple', '~> 1.1'
+  s.add_development_dependency 'multi_json', '~> 1.9'
+  s.add_development_dependency 'restforce', '~>1.4'
+  s.add_development_dependency 'rspec', '~>2.14'
+  s.add_development_dependency 'pry', '~>0.9'
+  s.files = `git ls-files`.split($/)
+  s.require_paths = ['lib']
+  s.rubygems_version = "1.3.7"
+end

data/spec/salesforce_bulk_query_spec.rb ADDED Viewed

@@ -0,0 +1,130 @@
+require 'spec_helper'
+require 'multi_json'
+require 'restforce'
+require 'csv'
+require 'tmpdir'
+require 'logger'
+LOGGING = false
+describe SalesforceBulkQuery do
+  before :all do
+    auth = MultiJson.load(File.read('test_salesforce_credentials.json'), :symbolize_keys => true)
+    @client = Restforce.new(
+      :username => auth[:username],
+      :password => auth[:password],
+      :security_token => auth[:token],
+      :client_id => auth[:client_id],
+      :client_secret => auth[:client_secret],
+      :api_version => '30.0'
+    )
+    @api = SalesforceBulkQuery::Api.new(@client,
+      :api_version => '30.0',
+      :logger => LOGGING ? Logger.new(STDOUT): nil
+    )
+    # switch off the normal logging
+    Restforce.log = false
+  end
+  describe "instance_url" do
+    it "gives you some reasonable url" do
+      url = @api.instance_url
+      url.should_not be_empty
+      url.should match(/salesforce\.com\//)
+    end
+  end
+  describe "query" do
+    context "when you give it no options" do
+      it "downloads the data to a few file", :constraint => 'slow'  do
+        result = @api.query("Opportunity", "SELECT Id, Name FROM Opportunity")
+        result[:filenames].should have_at_least(2).items
+        result[:results].should_not be_empty
+        result[:done_jobs].should_not be_empty
+        result[:filenames].each do |filename|
+          File.size?(filename).should be_true
+          lines = CSV.read(filename)
+          if lines.length > 1
+            # first line should be the header
+            lines[0].should eql(["Id", "Name"])
+            # first id shouldn't be emtpy
+            lines[1][0].should_not be_empty
+          end
+        end
+      end
+    end
+    context "when you give it all the options" do
+      it "downloads a single file" do
+        tmp = Dir.mktmpdir
+        frm = "2000-01-01"
+        from = "#{frm}T00:00:00.000Z"
+        t = "2020-01-01"
+        to = "#{t}T00:00:00.000Z"
+        result = @api.query(
+          "Account",
+          "SELECT Id, Name, Industry, Type FROM Account",
+          :check_interval => 30,
+          :directory_path => tmp,
+          :created_from => from,
+          :created_to => to,
+          :single_batch => true
+        )
+        result[:filenames].should have(1).items
+        result[:results].should_not be_empty
+        result[:done_jobs].should_not be_empty
+        filename = result[:filenames][0]
+        File.size?(filename).should be_true
+        lines = CSV.read(filename)
+        # first line should be the header
+        lines[0].should eql(["Id", "Name", "Industry", "Type"])
+        # first id shouldn't be emtpy
+        lines[1][0].should_not be_empty
+        filename.should match(tmp)
+        filename.should match(frm)
+        filename.should match(t)
+      end
+    end
+    context "when you give it a short time limit" do
+      it "downloads just a few files" do
+        result = @api.query(
+          "Task",
+          "SELECT Id, Name, CreatedDate FROM Task",
+          :time_limit => 30
+        )
+        result[:results].should_not be_empty
+      end
+    end
+  end
+  describe "start_query" do
+    it "starts a query that finishes some time later" do
+      query = @api.start_query("Opportunity",  "SELECT Id, Name, CreatedDate FROM Opportunity")
+      # get a cofee
+      sleep(40)
+      # check the status
+      status = query.check_status
+      if status[:finished]
+        result = query.get_results
+        result[:filenames].should have_at_least(2).items
+        result[:results].should_not be_empty
+        result[:done_jobs].should_not be_empty
+      end
+    end
+  end
+end

data/spec/spec_helper.rb ADDED Viewed

@@ -0,0 +1,6 @@
+require 'salesforce_bulk_query'
+RSpec.configure do |c|
+  c.filter_run :focus => true
+  c.run_all_when_everything_filtered = true
+end

metadata ADDED Viewed

@@ -0,0 +1,145 @@
+--- !ruby/object:Gem::Specification
+name: salesforce_bulk_query
+version: !ruby/object:Gem::Version
+  version: 0.0.2
+platform: ruby
+authors:
+- Petr Cvengros
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2014-06-02 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: json
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.8'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.8'
+- !ruby/object:Gem::Dependency
+  name: xml-simple
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.1'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.1'
+- !ruby/object:Gem::Dependency
+  name: multi_json
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.9'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.9'
+- !ruby/object:Gem::Dependency
+  name: restforce
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.4'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.4'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.14'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.14'
+- !ruby/object:Gem::Dependency
+  name: pry
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.9'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.9'
+description: A library for downloading data from Salesforce Bulk API. We only focus
+  on querying, other operations of the API aren't supported. Designed to handle a
+  lot of data.
+email:
+- petr.cvengros@gooddata.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- ".gitignore"
+- Gemfile
+- LICENSE
+- README.md
+- example_test_salesforce_credentials.json
+- lib/salesforce_bulk_query.rb
+- lib/salesforce_bulk_query/batch.rb
+- lib/salesforce_bulk_query/connection.rb
+- lib/salesforce_bulk_query/job.rb
+- lib/salesforce_bulk_query/logger.rb
+- lib/salesforce_bulk_query/query.rb
+- lib/salesforce_bulk_query/version.rb
+- salesforce_bulk_query.gemspec
+- spec/salesforce_bulk_query_spec.rb
+- spec/spec_helper.rb
+homepage: https://github.com/cvengros/salesforce_bulk_query
+licenses:
+- BSD
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '1.9'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.2.2
+signing_key:
+specification_version: 4
+summary: Downloading data from Salesforce Bulk API made easy and scalable.
+test_files: []