RubyGems - salesforce_bulk_query - Versions diffs - 0.0.6 → 0.1.0 - Mend

salesforce_bulk_query 0.0.6 → 0.1.0

Files changed (18) hide show

checksums.yaml +4 -4
data/.gitignore +3 -2
data/.travis.yml +26 -0
data/README.md +67 -16
data/Rakefile +20 -0
data/env_setup-example.sh +13 -0
data/lib/salesforce_bulk_query/batch.rb +84 -11
data/lib/salesforce_bulk_query/connection.rb +13 -1
data/lib/salesforce_bulk_query/job.rb +60 -29
data/lib/salesforce_bulk_query/query.rb +73 -69
data/lib/salesforce_bulk_query/utils.rb +16 -0
data/lib/salesforce_bulk_query/version.rb +1 -1
data/lib/salesforce_bulk_query.rb +14 -43
data/salesforce_bulk_query.gemspec +5 -0
data/spec/salesforce_bulk_query_spec.rb +45 -43
data/spec/spec_helper.rb +27 -0
metadata +70 -18
data/example_test_salesforce_credentials.json +0 -7

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 4a5ed2f842b4032e2c90680ff6467feba485ba62
-  data.tar.gz: 3a15cdf85bc0707570b2e35f4db44139680f850e
+  metadata.gz: 39893db1495deba35f872109eae3fde80d55b506
+  data.tar.gz: 64d1e6f6d2b016d5257004c8e83527372c51d688
 SHA512:
-  metadata.gz: 3f1e236abefe0706f1ccfa80fc44c42feff3f5fec1ef5b6af7d3c61d9de30c6062d21ad629e785ac67c131705bd26517114426d71de37c4d21a8dd6f8cf547a3
-  data.tar.gz: b431e5601bf9a9f940d79fc089a6dc00380e73a37eb17589b6bcf6f024278a3742a7056fe2553d40383db98e59d4ab0e55695442e4fb058e0daf0c553837324a
+  metadata.gz: 8dbda4621053cb70ad98f309feaf6842e0dfa8d3041db06b5cff531f151acd2dc9f41cc09ed60a3844c64d292bf49a9660c00d3c229edc63233ac5f37bcbc5a6
+  data.tar.gz: cf03cdc3d3b34bf587cb26359648af0343ee7a5b0956a789e4e0491d3a1c2ff905d42e52503c120806f385a1814ca969a78370a472319033cf3573b03048d87f

data/.gitignore CHANGED Viewed

@@ -1,3 +1,4 @@
-test_salesforce_credentials.json
+env_setup.sh
 Gemfile.lock
-*.gem
+*.gem
+hh*

data/.travis.yml ADDED Viewed

@@ -0,0 +1,26 @@
+language: ruby
+sudo: false
+os:
+  - linux
+  - osx
+branches:
+except:
+  - gh-pages
+  - next_doc_release
+env:
+  global:
+    - LOGGING=true
+    - secure: EVku6JcQcp69DksaZUo8Sluk2Mk45CHjs/7w2xDG6n58Px3zxG3i0DSxm5SYBaDvcIQCv+LLt/LyjTRs5t51GbQrFLZGxGLMHDkJ/eS4K2q27P8NGqM/Yj4WtxB7BYQNLQf00eCo9HeTyjY7AsAqZG8wn6W9DUF5Gtu7AUO0aiQ=
+    - secure: GwO5mPCUhaeC1fHuX86T3INRrrpqb2raWS+M83Ko4fnwKUryGiMOVYtzu5NSiTfN2YRHd3gvAsisempqCHndTPvuWcJhSCIGEZdIYgdxXAz4Meb78t0w/5rY7O9BOJ5+Zin098SmB/uiPvW/HDSQw+T+bBHO3ETw+h8VvFDmHM4=
+    - secure: TbguXDUV3wZhZ9ZHGYP2X0zo8gJejeuawmHQ6xey5gZzYrAH92r6wgddQ7KM2Vc+UCw0Gpnuy2AKM6EhEGbRShc5u6LmwAJIqZ6iG48ALOxha6JOoICuSbahRmnNSVOP1utzAXjOB20YO8lao5PGNfqyacFnr3VhqG22Kl5EJhc=
+    - secure: a2wb07pS1Cl6aiKN0tSVjW4Zsc0K0yQ3mI+nrR3KQpF1vaMu43gKYRGLz3NDWcaPrIgKfFyeKjucxd8o52dLn5Mi8OW7hpPWy3d9Y9umx0PpTYHl16tOw2PLpWedqGTM/eZn5Y8UtzrsAdVJKIcCXE7SA28qGmaR5KKuaJjC+3w=
+    - secure: JvH3Nr6N9GbSWDNc8vE0cwDZPQcUgwXB2MjtkPZ/fnfrR/x/cNmCyc2lzZqfExmDNn17YbR2fPP8zKBczEDvY8q03v42LfnpDSCR2F73yJzQe9dxHwACroAT8VgAz3o23ejgry31SzZz+IoUvZ2fzuoZs+uAt1yr48I2V03PqfM=
+rvm:
+  - 1.9.3
+  - jruby-19mode
+  - 2.1
+  - 2.2
+before_install:
+  - gem update --system
+  - gem update bundler
+script: rake ci

data/README.md CHANGED Viewed

@@ -4,6 +4,15 @@ A library for downloading data from Salesforce Bulk API. We only focus on queryi
 Derived from [Salesforce Bulk API](https://github.com/yatish27/salesforce_bulk_api)
+## Status
+[![Gem Version](https://badge.fury.io/rb/salesforce_bulk_query.png)](http://badge.fury.io/rb/salesforce_bulk_query)
+[![Downloads](http://img.shields.io/gem/dt/salesforce_bulk_query.svg)](http://rubygems.org/gems/salesforce_bulk_query)
+[![Dependency Status](https://gemnasium.com/cvengros/salesforce_bulk_query.png)](https://gemnasium.com/cvengros/salesforce_bulk_query)
+[![Code Climate](https://codeclimate.com/github/cvengros/salesforce_bulk_query.png)](https://codeclimate.com/github/cvengros/salesforce_bulk_query)
+[![Build Status](https://travis-ci.org/cvengros/salesforce_bulk_query.png)](https://travis-ci.org/cvengros/salesforce_bulk_query)
+[![Coverage Status](https://coveralls.io/repos/cvengros/salesforce_bulk_query/badge.png)](https://coveralls.io/r/cvengros/salesforce_bulk_query)
 ## Basic Usage
 To install, run:
@@ -54,14 +63,15 @@ puts "All the downloaded stuff is in csvs: #{result[:filenames]}"
 # if you want to just start the query asynchronously, use
 query = start_query("Task", "SELECT Id, Name FROM Task")
-# get a cofee
+# get a coffee
 sleep(1234)
-# check the status
-status = query.check_status
-if status[:finished]
-  result = query.get_results
-  puts "All the downloaded stuff is in csvs: #{result[:filenames]}"
+# get what's available and check the status
+results = query.get_available_results
+if results[:succeeded]
+  puts "All the downloaded stuff is in csvs: #{results[:filenames]}"
+else
+  puts "This is going to take a while, get another coffee"
 end
 ```
@@ -82,28 +92,69 @@ There are a few optional settings you can pass to the `Api` methods:
 * `filename_prefix`: prefix applied to csv files
 * `directory_path`: custom direcotory path for CSVs, if omitted, a new temp directory is created
 * `check_interval`: how often the results should be checked in secs.
-* `time_limit`: maximum time the query can take. If this time limit is exceeded, available results are downloaded and the list of subqueries that didn't finished is returned. In seconds. The limti should be understood as limit for waiting. When the limit is reached the function downloads data that is ready which can take some additonal time.
+* `time_limit`: maximum time the query can take. If this time limit is exceeded, available results are downloaded and the list of subqueries that didn't finished is returned. In seconds. The limti should be understood as limit for waiting. When the limit is reached the function downloads data that is ready which can take some additonal time. If no limit is given the query runs until it finishes
 * `created_from`, `created_to`: limits for the CreatedDate field. Note that queries can't contain any WHERE statements as we're doing some manipulations to create subqueries and we don't want things to get too difficult. So this is the way to limit the query yourself. The format is like `"1999-01-01T00:00:00.000Z"`
 * `single_batch`: If true, the queries are not divided into subqueries as described above. Instead one batch job is created with the given query. This is faster for small amount of data, but will fail with a timeout if you have a lot of data.
 See specs for exact usage.
 ## Logging
-    require 'logger'
-    require 'restforce'
-    # create the restforce client
-    restforce = Restforce.new(...)
+```ruby
+require 'logger'
+require 'restforce'
+# create the restforce client
+restforce = Restforce.new(...)
-    # instantiate a logger and pass it to the Api constructor
-    logger = Logger.new(STDOUT)
-    bulk_api = SalesforceBulkQuery::Api.new(restforce, :logger => logger)
+# instantiate a logger and pass it to the Api constructor
+logger = Logger.new(STDOUT)
+bulk_api = SalesforceBulkQuery::Api.new(restforce, :logger => logger)
-    # switch off logging in Restforce so you don't get every message twice
-    Restforce.log = false
+# switch off logging in Restforce so you don't get every message twice
+Restforce.log = false
+```
 If you're using Restforce as a client (which you probably are) and you want to do logging, Salesforce Bulk Query will use a custom logging middleware for Restforce. This is because the original logging middleware puts all API responses to log, which is not something you would like to do for a few gigabytes CSVs. When you use the :logger parameter it's recommended you switch off the default logging in Restforce, otherwise you'll get all messages twice.
+## Notes
+Query (user given) -> Job (Salesforce construct that encapsulates 15 batches) -> Batch (1 SOQL with CreatedDate constraints)
+At the beginning the query is divided into 15 subqueries and put into a single job. When one of the subqueries fails, a new job with 15 subqueries is created, the range of the failed query is divided into 15 sub-subqueries.
+## Running tests locally
+Travis CI is set up for this repository to make sure all the tests are passing with each commit.
+To run the tests locally:
+* Copy the env_setup-example.sh file
+```
+cp  env_setup-example.sh env_setup.sh
+```
+* Setup all the params in env_setup. USERNAME, PASSWORD and TOKEN are your salesforce account credentials. You can get those by [registering for a free developer account](https://developer.salesforce.com/signup). You might need to [reset your security token](https://help.salesforce.com/apex/HTViewHelpDoc?id=user_security_token.htm) to put it to TOKEN variable. CLIENT_ID and CLIENT_SECRET belong to your Salesforce connected app. You can create one by following the steps outlined in [the tutorial](https://help.salesforce.com/apex/HTViewHelpDoc?id=connected_app_create.htm). Make sure you check the 'api' permission.
+* Run the env_setup
+```
+. env_setup.sh
+```
+* Run the tests
+```
+bundle exec rspec
+```
+Note that env_setup.sh is ignored from git in .gitignore so that you don't commit your credentials by accident.
+## Contributing
+1. Fork it ( https://github.com/[my-github-username]/salesforce_bulk_query/fork )
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Run the tests (see above), fix if they fail.
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create a new Pull Request
+Make sure you run all the tests and they pass. If you create a new feature, write a test for it.
 ## Copyright
 Copyright (c) 2014 Yatish Mehta & GoodData Corporation. See [LICENSE](LICENSE) for details.

data/Rakefile ADDED Viewed

@@ -0,0 +1,20 @@
+require 'coveralls/rake/task'
+require 'rake/testtask'
+require 'rspec/core/rake_task'
+Coveralls::RakeTask.new
+task default: %w[ci]
+desc 'Run continuous integration test'
+task :ci do
+  Rake::Task['test:unit'].invoke
+  Rake::Task['coveralls:push'].invoke
+end
+namespace :test do
+  desc "Run unit tests"
+  RSpec::Core::RakeTask.new(:unit) do |t|
+    t.pattern = 'spec/**/*.rb'
+  end
+end

data/env_setup-example.sh ADDED Viewed

@@ -0,0 +1,13 @@
+#!/bin/sh
+# mandatory
+export CLIENT_ID=""
+export CLIENT_SECRET=""
+export USERNAME=""
+export PASSWORD=""
+export TOKEN=""
+# optional
+export LOGGING='true' # unset to switch off logging
+export ENTITY='Task'

data/lib/salesforce_bulk_query/batch.rb CHANGED Viewed

@@ -1,5 +1,8 @@
 require 'tmpdir'
+require 'salesforce_bulk_query/utils'
 module SalesforceBulkQuery
   # Represents a Salesforce api batch. Batch contains a single subquery.
   # Many batches are contained in a Job.
@@ -11,10 +14,12 @@ module SalesforceBulkQuery
       @connection = options[:connection]
       @start = options[:start]
       @stop = options[:stop]
+      @logger = options[:logger]
       @@directory_path ||= Dir.mktmpdir
+      @filename = nil
     end
-    attr_reader :soql, :start, :stop
+    attr_reader :soql, :start, :stop, :filename, :fail_message, :batch_id, :csv_record_count
     # Do the api request
     def create
@@ -25,15 +30,48 @@ module SalesforceBulkQuery
       @batch_id = response_parsed['id'][0]
     end
+    # check status of the batch
+    # if it fails, don't throw an error now, let the job above collect all fails and raise it at once
     def check_status
-      # request to get the result id
-      path = "job/#{@job_id}/batch/#{@batch_id}/result"
+      succeeded = nil
+      failed = nil
+      # get the status of the batch
+      # https://www.salesforce.com/us/developer/docs/api_asynch/Content/asynch_api_batches_get_info.htm
+      status_path = "job/#{@job_id}/batch/#{@batch_id}"
+      status_response = @connection.get_xml(status_path)
+      # interpret the status
+      @status = status_response['state'][0]
+      # https://www.salesforce.com/us/developer/docs/api_asynch/Content/asynch_api_batches_interpret_status.htm
+      case @status
+        when 'Failed'
+          failed = true
+          @fail_message = status_response['stateMessage']
+        when 'InProgress', 'Queued'
+          succeeded = false
+        when 'Completed'
+          succeeded = true
+          failed = false
+        else
+          fail "Something weird happened, #{@batch_id} has status #{@status}."
+      end
+      if succeeded
+        # request to get the result id
+        # https://www.salesforce.com/us/developer/docs/api_asynch/Content/asynch_api_batches_get_results.htm
+        path = "job/#{@job_id}/batch/#{@batch_id}/result"
+        response_parsed = @connection.get_xml(path)
-      response_parsed = @connection.get_xml(path)
+        @result_id = response_parsed["result"] ? response_parsed["result"][0] : nil
+      end
-      @result_id = response_parsed["result"] ? response_parsed["result"][0] : nil
       return {
-        :finished => ! @result_id.nil?,
+        :failed => failed,
+        :fail_message => @fail_message,
+        :succeeded => succeeded,
         :result_id => @result_id
       }
     end
@@ -42,7 +80,14 @@ module SalesforceBulkQuery
       return "#{@sobject}_#{@batch_id}_#{@start}-#{@stop}.csv"
     end
-    def get_result(directory_path=nil)
+    def get_result(options={})
+      # if it was already downloaded, no one should ask about it
+      if @filename
+        raise "This batch was already downloaded once: #{@filename}, #{@batch_id}"
+      end
+      directory_path = options[:directory_path]
+      skip_verification = options[:skip_verification]
       # request to get the actual results
       path = "job/#{@job_id}/batch/#{@batch_id}/result/#{@result_id}"
@@ -54,10 +99,39 @@ module SalesforceBulkQuery
       directory_path ||= @@directory_path
       # write it to a file
-      filename = File.join(directory_path, get_filename)
-      @connection.get_to_file(path, filename)
+      @filename = File.join(directory_path, get_filename)
+      @connection.get_to_file(path, @filename)
+      # Verify the number of downloaded records is roughly the same as
+      # count on the soql api
+      # maybe also verify
+      unless skip_verification
+        @verfication = verification
+      end
-      return filename
+      return {
+        :filename => @filename,
+        :verfication => @verfication
+      }
+    end
+    def verification
+      api_count = @connection.query_count(@sobject, @start, @stop)
+      # if we weren't able to get the count, fail.
+      if api_count.nil?
+        return false
+      end
+      # count the records in the csv
+      @csv_record_count = Utils.line_count(@filename)
+      if @logger && @csv_record_count % 100 == 0
+        @logger.warn "The line count for batch #{@soql} is highly suspicius: #{@csv_record_count}"
+      end
+      if @logger && @csv_record_count != api_count
+        @logger.warn "The counts for batch #{@soql} don't match. Record count in downloaded csv #{@csv_record_count}, record count on api count(): #{api_count}"
+      end
+      return @csv_record_count >= api_count
     end
     def to_log
@@ -71,6 +145,5 @@ module SalesforceBulkQuery
         :directory_path => @@directory_path
       }
     end
   end
 end

data/lib/salesforce_bulk_query/connection.rb CHANGED Viewed

@@ -7,7 +7,7 @@ module SalesforceBulkQuery
   # shared in all classes that do some requests
   class Connection
     def initialize(client, api_version, logger=nil, filename_prefix=nil)
-      @client=client
+      @client = client
       @logger = logger
       @filename_prefix = filename_prefix
@@ -112,6 +112,18 @@ module SalesforceBulkQuery
       end
     end
+    def query_count(sobject, from, to)
+      # do it with retries, if it doesn't succeed, return nil, don't fail.
+      begin
+        with_retries do
+          q = @client.query("SELECT COUNT() FROM #{sobject} WHERE CreatedDate >= #{from} AND CreatedDate < #{to}")
+          return q.size
+        end
+      rescue TimeoutError => e
+        return nil
+      end
+    end
     def to_log
       return {
         :client => "Restforce asi",

data/lib/salesforce_bulk_query/job.rb CHANGED Viewed

@@ -7,16 +7,24 @@ module SalesforceBulkQuery
   class Job
     @@operation = 'query'
     @@xml_header = '<?xml version="1.0" encoding="utf-8" ?>'
-    JOB_TIME_LIMIT = 10 * 60
+    JOB_TIME_LIMIT = 15 * 60
     BATCH_COUNT = 15
-    def initialize(sobject, connection, logger=nil)
+    def initialize(sobject, connection, options={})
       @sobject = sobject
       @connection = connection
-      @logger = logger
+      @logger = options[:logger]
+      @job_time_limit = options[:job_time_limit] || JOB_TIME_LIMIT
+      # all batches (static)
       @batches = []
+      # unfinished batches as of last get_available_results call
       @unfinished_batches = []
+      # filenames fort the already downloaded and verified batches
+      @filenames = []
     end
     attr_reader :job_id
@@ -79,12 +87,14 @@ module SalesforceBulkQuery
         :job_id => @job_id,
         :connection => @connection,
         :start => options[:start],
-        :stop => options[:stop]
+        :stop => options[:stop],
+        :logger => @logger
       )
       batch.create
       # add the batch to the list
       @batches.push(batch)
+      @unfinished_batches.push(batch)
     end
     def close_job
@@ -95,60 +105,81 @@ module SalesforceBulkQuery
       path = "job/#{@job_id}"
       response_parsed = @connection.post_xml(path, xml)
-      @job_closed = Time.now
+      @job_closed_time = Time.now
     end
     def check_status
       path = "job/#{@job_id}"
       response_parsed = @connection.get_xml(path)
-      @completed = Integer(response_parsed["numberBatchesCompleted"][0])
-      @finished = @completed == Integer(response_parsed["numberBatchesTotal"][0])
+      @completed_count = Integer(response_parsed["numberBatchesCompleted"][0])
+      @succeeded = @completed_count == Integer(response_parsed["numberBatchesTotal"][0])
       return {
-        :finished => @finished,
-        :some_failed => Integer(response_parsed["numberRecordsFailed"][0]) > 0,
+        :succeeded => @succeeded,
+        :some_records_failed => Integer(response_parsed["numberRecordsFailed"][0]) > 0,
+        :some_batches_failed => Integer(response_parsed["numberBatchesFailed"][0]) > 0,
         :response => response_parsed
       }
     end
+    def over_limit?
+      (Time.now - @job_closed_time) > @job_time_limit
+    end
     # downloads whatever is available, returns as unfinished whatever is not
-    def get_results(options={})
-      filenames = []
+    def get_available_results(options={})
+      downloaded_filenames = []
       unfinished_batches = []
+      verification_fail_batches = []
+      failed_batches = []
       # get result for each batch in the job
-      @batches.each do |batch|
+      @unfinished_batches.each do |batch|
         batch_status = batch.check_status
         # if the result is ready
-        if batch_status[:finished]
+        if batch_status[:succeeded]
+          # each finished batch should go here only once
           # download the result
-          filename = batch.get_result(options[:directory_path])
-          filenames.push(filename)
+          result = batch.get_result(options)
+          # if the verification failed, put it to failed
+          # will never ask about this one again.
+          if result[:verification] == false
+            verification_fail_batches << batch
+          else
+            # if verification ok and finished put it to filenames
+            downloaded_filenames << result[:filename]
+          end
+        elsif batch_status[:failed]
+          # put it to failed and raise error at the end
+          failed_batches << batch
         else
           # otherwise put it to unfinished
-          unfinished_batches.push(batch)
+          unfinished_batches << batch
         end
       end
+      unless failed_batches.empty?
+        details = failed_batches.map{ |b| "#{b.batch_id}: #{b.fail_message}"}.join("\n")
+        fail ArgumentError, "#{failed_batches.length} batches failed. Details: #{details}"
+      end
+      # cache the unfinished_batches till the next run
       @unfinished_batches = unfinished_batches
+      # cumulate filenames
+      @filenames += downloaded_filenames
       return {
-        :filenames => filenames,
-        :unfinished_batches => unfinished_batches
+        :filenames => @filenames,
+        :unfinished_batches => unfinished_batches,
+        :verification_fail_batches => verification_fail_batches
       }
     end
-    def get_available_results(options={})
-      # if we didn't reach limit yet, do nothing
-      # if all done, do nothing
-      # if none of the batches finished, same thing
-      if (Time.now - @job_closed < JOB_TIME_LIMIT) || @finished || @completed == 0
-        return nil
-      end
-      return get_results(options)
-    end
     def to_log
       return {
         :sobject => @sobject,

data/lib/salesforce_bulk_query/query.rb CHANGED Viewed

@@ -19,23 +19,33 @@ module SalesforceBulkQuery
       @created_from = options[:created_from]
       @created_to = options[:created_to]
       @single_batch = options[:single_batch]
+      # jobs currently running
       @jobs_in_progress = []
+      # successfully finished jobs with no batches to split
       @jobs_done = []
+      # finished or timeouted jobs with some batches split into other jobs
+      @jobs_restarted = []
       @finished_batch_filenames = []
       @restarted_subqueries = []
     end
+    attr_reader :jobs_in_progress, :jobs_restarted, :jobs_done
     DEFAULT_MIN_CREATED = "1999-01-01T00:00:00.000Z"
     # Creates the first job, divides the query to subqueries, puts all the subqueries as batches to the job
-    def start
+    def start(options={})
       # order by and where not allowed
       if (!@single_batch) && (@soql =~ /WHERE/i || @soql =~ /ORDER BY/i)
         raise "You can't have WHERE or ORDER BY in your soql. If you want to download just specific date range use created_from / created_to"
       end
       # create the first job
-      job = SalesforceBulkQuery::Job.new(@sobject, @connection, @logger)
+      job = SalesforceBulkQuery::Job.new(@sobject, @connection, {:logger => @logger}.merge(options))
       job.create_job
       # get the date when it should start
@@ -55,7 +65,7 @@ module SalesforceBulkQuery
       # generate intervals
       start = DateTime.parse(min_created)
-      stop = @created_to ? DateTime.parse(@created_to) : DateTime.now - Rational(OFFSET_FROM_NOW, 1440)
+      stop = @created_to ? DateTime.parse(@created_to) : DateTime.now - Rational(options[:offset_from_now] || OFFSET_FROM_NOW, 1440)
       job.generate_batches(@soql, start, stop, @single_batch)
       job.close_job
@@ -63,89 +73,83 @@ module SalesforceBulkQuery
       @jobs_in_progress.push(job)
     end
+    # Get results for all finished jobs. If there are some unfinished batches, skip them and return them as unfinished.
+    #
+    # @param options[:directory_path]
+    def get_available_results(options={})
-    # Check statuses of all jobs
-    def check_status
       all_done = true
-      job_statuses = []
-      # check all jobs statuses and put them in an array
-      @jobs_in_progress.each do |job|
-        job_status = job.check_status
-        all_done &&= job_status[:finished]
-        job_statuses.push(job_status)
-      end
-      return {
-        :finished => all_done,
-        :job_statuses => job_statuses,
-        :jobs_done => @jobs_done
-      }
-    end
-    # Get results for all jobs
-    # @param options[:directory_path]
-    def get_results(options={})
-      all_job_results = []
-      job_result_filenames = []
       unfinished_subqueries = []
-      # check each job and put it there
-      @jobs_in_progress.each do |job|
-        job_results = job.get_results(options)
-        all_job_results.push(job_results)
-        job_result_filenames += job_results[:filenames]
-        unfinished_subqueries.push(job_results[:unfinished_batches].map {|b| b.soql})
-        # if it's done add it to done
-        if job_results[:unfinished_batches].empty?
-          @jobs_done.push(job)
-        end
-      end
-      return {
-        :filenames => job_result_filenames + @finished_batch_filenames,
-        :unfinished_subqueries => unfinished_subqueries,
-        :restarted_subqueries => @restarted_subqueries,
-        :results => all_job_results,
-        :done_jobs => @jobs_done
-      }
-    end
-    # Restart unfinished batches in all jobs in progress, creating new jobs
-    # downloads results for finished batches
-    def get_result_or_restart(options={})
-      new_jobs = []
-      job_ids_to_remove = []
+      jobs_in_progress = []
+      jobs_restarted = []
       jobs_done = []
+      # check all jobs statuses and split what should be split
       @jobs_in_progress.each do |job|
-        # get available stuff, if not the right time yet, go on
-        available_results = job.get_available_results(options)
-        if available_results.nil?
-          next
-        end
-        unfinished_batches = available_results[:unfinished_batches]
-        # store the filenames and resturted stuff
-        @finished_batch_filenames += available_results[:filenames]
-        @restarted_subqueries += unfinished_batches.map {|b| b.soql}
+        # check job status
+        job_status = job.check_status
+        job_over_limit = job.over_limit?
+        job_done = job_status[:succeeded] || job_over_limit
+        # download what's available
+        job_results = job.get_available_results(options)
+        unfinished_batches = job_results[:unfinished_batches]
+        unfinished_subqueries += unfinished_batches.map {|b| b.soql}
+        # split to subqueries what needs to be split
+        to_split = job_results[:verification_fail_batches]
+        to_split += unfinished_batches if job_over_limit
+        # delete files associated with batches that failed verification
+        job_results[:verification_fail_batches].each do |b|
+          @logger.info "Deleting #{b.filename}, verification failed."
+          File.delete(b.filename)
+        end
-        unfinished_batches.each do |batch|
+        to_split.each do |batch|
           # for each unfinished batch create a new job and add it to new jobs
-          @logger.info "The following subquery didn't end in time: #{batch.soql}. Dividing into multiple and running again" if @logger
-          new_job = SalesforceBulkQuery::Job.new(@sobject, @connection)
+          @logger.info "The following subquery didn't end in time / failed verification: #{batch.soql}. Dividing into multiple and running again" if @logger
+          new_job = SalesforceBulkQuery::Job.new(@sobject, @connection, {:logger => @logger}.merge(options))
           new_job.create_job
           new_job.generate_batches(@soql, batch.start, batch.stop)
           new_job.close_job
-          new_jobs.push(new_job)
+          jobs_in_progress.push(new_job)
+        end
+        # what to do with the current job
+        # finish, some stuff restarted
+        if job_done
+          if to_split.empty?
+            # done, nothing left
+            jobs_done.push(job)
+          else
+            # done, some batches needed to be restarted
+            jobs_restarted.push(job)
+          end
+          # store the filenames and restarted stuff
+          @finished_batch_filenames += job_results[:filenames]
+          @restarted_subqueries += to_split.map {|b| b.soql}
+        else
+          # still in progress
+          jobs_in_progress.push(job)
         end
-        # the current job to be removed from jobs in progress
-        job_ids_to_remove.push(job.job_id)
-        jobs_done.push(job)
+        # we're done if this job is done and it didn't generate any new jobs
+        all_done &&= (job_done && to_split.empty?)
       end
       # remove the finished jobs from progress and add there the new ones
-      @jobs_in_progress.select! {|j| ! job_ids_to_remove.include?(j.job_id)}
+      @jobs_in_progress = jobs_in_progress
       @jobs_done += jobs_done
-      @jobs_in_progress += new_jobs
+      return {
+        :succeeded => all_done,
+        :filenames => @finished_batch_filenames,
+        :unfinished_subqueries => unfinished_subqueries,
+        :jobs_done => @jobs_done.map { |j| j.job_id }
+      }
     end
   end
 end

data/lib/salesforce_bulk_query/utils.rb ADDED Viewed

@@ -0,0 +1,16 @@
+require 'csv'
+module SalesforceBulkQuery
+  class Utils
+    # record count if they want to
+    def self.line_count(f)
+      i = 0
+      CSV.foreach(f, :headers => true) {|_| i += 1}
+      i
+    end
+    def self.header(f)
+      File.open(f, &:readline).split(',').map{ |c| c.strip.delete('"') }
+    end
+  end
+end

data/lib/salesforce_bulk_query/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module SalesforceBulkQuery
-  VERSION = '0.0.6'
+  VERSION = '0.1.0'
 end

data/lib/salesforce_bulk_query.rb CHANGED Viewed

@@ -3,6 +3,8 @@ require 'csv'
 require 'salesforce_bulk_query/connection'
 require 'salesforce_bulk_query/query'
 require 'salesforce_bulk_query/logger'
+require 'salesforce_bulk_query/utils'
 # Module where all the stuff is happening
 module SalesforceBulkQuery
@@ -36,8 +38,7 @@ module SalesforceBulkQuery
       return url
     end
-    CHECK_INTERVAL = 10
-    QUERY_TIME_LIMIT = 60 * 60 * 2 # two hours
+    CHECK_INTERVAL = 30
     # Query the Salesforce API. It's a blocking method - waits until the query is resolved
     # can take quite some time
@@ -46,7 +47,7 @@ module SalesforceBulkQuery
     # @return hash with :filenames and other useful stuff
     def query(sobject, soql, options={})
       check_interval = options[:check_interval] || CHECK_INTERVAL
-      time_limit = options[:time_limit] || QUERY_TIME_LIMIT
+      time_limit = options[:time_limit] # in seconds
       start_time = Time.now
@@ -55,34 +56,27 @@ module SalesforceBulkQuery
       results = nil
       loop do
-        # check the status
-        status = query.check_status
+        # get available results and check the status
+        results = query.get_available_results(options)
         # if finished get the result and we're done
-        if status[:finished]
+        if results[:succeeded]
-          # get the results and we're done
-          results = query.get_results(:directory_path => options[:directory_path])
-          @logger.info "Query finished. Results: #{results_to_string(results)}" if @logger
+          # we're done
+          @logger.info "Query succeeded. Results: #{results}" if @logger
           break
         end
         # if we've run out of time limit, go away
-        if Time.now - start_time > time_limit
+        if time_limit && (Time.now - start_time > time_limit)
           @logger.warn "Ran out of time limit, downloading what's available and terminating" if @logger
-          # download what's available
-          results = query.get_results(
-            :directory_path => options[:directory_path],
-          )
-          @logger.info "Downloaded the following files: #{results[:filenames]} The following didn't finish in time: #{results[:unfinished_subqueries]}. Results: #{results_to_string(results)}" if @logger
+          @logger.info "Downloaded the following files: #{results[:filenames]} The following didn't finish in time: #{results[:unfinished_subqueries]}." if @logger
           break
         end
-        # restart whatever needs to be restarted and sleep
-        query.get_result_or_restart(:directory_path => options[:directory_path])
         @logger.info "Sleeping #{check_interval}" if @logger
+        @logger.info "Downloaded files: #{results[:filenames].length} Jobs in progress: #{query.jobs_in_progress.length}"
         sleep(check_interval)
       end
@@ -90,7 +84,7 @@ module SalesforceBulkQuery
       if @logger && ! results[:filenames].empty?
         @logger.info "Download finished. Downloaded files in #{File.dirname(results[:filenames][0])}. Filename size [line count]:"
-        @logger.info "\n" + results[:filenames].sort.map{|f| "#{File.basename(f)} #{File.size(f)} #{line_count(f) if options[:count_lines]}"}.join("\n")
+        @logger.info "\n" + results[:filenames].sort.map{|f| "#{File.basename(f)} #{File.size(f)} #{Utils.line_count(f) if options[:count_lines]}"}.join("\n")
       end
       return results
     end
@@ -101,31 +95,8 @@ module SalesforceBulkQuery
     def start_query(sobject, soql, options={})
       # create the query, start it and return it
       query = SalesforceBulkQuery::Query.new(sobject, soql, @connection, {:logger => @logger}.merge(options))
-      query.start
+      query.start(options)
       return query
     end
-    private
-    # record count if they want to
-    def line_count(f)
-      i = 0
-      CSV.foreach(f, :headers => true) {|_| i+=1}
-      i
-    end
-    # create a hash with just the fields we want to show in logs
-    def results_to_string(results)
-      return results.merge({
-        :results => results[:results].map do |r|
-          r.merge({
-            :unfinished_batches => r[:unfinished_batches].map do |b|
-              b.to_log
-            end
-          })
-        end,
-        :done_jobs => results[:done_jobs].map {|j| j.to_log}
-      })
-    end
   end
 end

data/salesforce_bulk_query.gemspec CHANGED Viewed

@@ -22,6 +22,11 @@ Gem::Specification.new do |s|
   s.add_development_dependency 'restforce', '~>1.4'
   s.add_development_dependency 'rspec', '~>2.14'
   s.add_development_dependency 'pry', '~>0.9'
+  s.add_development_dependency 'pry-stack_explorer', '~>0.4' if RUBY_PLATFORM != 'java'
+  s.add_development_dependency 'rake', '~> 10.3'
+  s.add_development_dependency 'coveralls', '~> 0.7', '>= 0.7.0'
   s.files = `git ls-files`.split($/)

data/spec/salesforce_bulk_query_spec.rb CHANGED Viewed

@@ -1,32 +1,17 @@
 require 'spec_helper'
 require 'multi_json'
-require 'restforce'
 require 'csv'
 require 'tmpdir'
 require 'logger'
+require 'set'
-LOGGING = false
+# test co nejak nafakuje tu situaci v twc
 describe SalesforceBulkQuery do
   before :all do
-    auth = MultiJson.load(File.read('test_salesforce_credentials.json'), :symbolize_keys => true)
-    @client = Restforce.new(
-      :username => auth[:username],
-      :password => auth[:password],
-      :security_token => auth[:token],
-      :client_id => auth[:client_id],
-      :client_secret => auth[:client_secret],
-      :api_version => '30.0'
-    )
-    @api = SalesforceBulkQuery::Api.new(@client,
-      :api_version => '30.0',
-      :logger => LOGGING ? Logger.new(STDOUT): nil
-    )
-    # switch off the normal logging
-    Restforce.log = false
+    @client = SpecHelper.create_default_restforce
+    @api = SpecHelper.create_default_api(@client)
+    @entity = ENV['ENTITY'] || 'Opportunity'
+    @field_list = (ENV['FIELD_LIST'] || "Id,CreatedDate").split(',')
   end
   describe "instance_url" do
@@ -38,21 +23,29 @@ describe SalesforceBulkQuery do
   end
   describe "query" do
+    context "if you give it an invalid SOQL" do
+      it "fails with argument error" do
+        expect{@api.query(@entity, "SELECT Id, SomethingInvalid FROM #{@entity}")}.to raise_error(ArgumentError)
+      end
+    end
     context "when you give it no options" do
       it "downloads the data to a few files", :constraint => 'slow'  do
-        result = @api.query("Opportunity", "SELECT Id, Name FROM Opportunity")
-        result[:filenames].should have_at_least(2).items
-        result[:results].should_not be_empty
-        result[:done_jobs].should_not be_empty
+        result = @api.query(@entity, "SELECT #{@field_list.join(', ')} FROM #{@entity}", :count_lines => true)
+        filenames = result[:filenames]
+        filenames.should have_at_least(2).items
+        result[:jobs_done].should_not be_empty
+        # no duplicate filenames
+        expect(Set.new(filenames).length).to eq(filenames.length)
-        result[:filenames].each do |filename|
+        filenames.each do |filename|
           File.size?(filename).should be_true
           lines = CSV.read(filename)
           if lines.length > 1
             # first line should be the header
-            lines[0].should eql(["Id", "Name"])
+            lines[0].should eql(@field_list)
             # first id shouldn't be emtpy
             lines[1][0].should_not be_empty
@@ -79,8 +72,7 @@ describe SalesforceBulkQuery do
         )
         result[:filenames].should have(1).items
-        result[:results].should_not be_empty
-        result[:done_jobs].should_not be_empty
+        result[:jobs_done].should_not be_empty
         filename = result[:filenames][0]
@@ -99,32 +91,42 @@ describe SalesforceBulkQuery do
       end
     end
     context "when you give it a short time limit" do
-      it "downloads just a few files" do
+      it "downloads some stuff is unfinished" do
         result = @api.query(
-          "Task",
-          "SELECT Id, Name, CreatedDate FROM Task",
-          :time_limit => 30
+          "Opportunity",
+          "SELECT Id, Name, CreatedDate FROM Opportunity",
+          :time_limit => 15
         )
-        result[:results].should_not be_empty
+        # one of them should be non-empty
+        expect((! result[:unfinished_subqueries].empty?) || (! result[:filenames].empty?)).to eq true
+      end
+    end
+    context "when you pass a short job time limit" do
+      it "creates quite a few jobs quickly", :skip => true do
+        # development only
+        result = @api.query(
+          @entity,
+          "SELECT Id, CreatedDate FROM #{@entity}",
+          :count_lines => true,
+          :job_time_limit => 60
+        )
+        require 'pry'; binding.pry
       end
     end
   end
   describe "start_query" do
     it "starts a query that finishes some time later" do
-      query = @api.start_query("Opportunity",  "SELECT Id, Name, CreatedDate FROM Opportunity")
+      query = @api.start_query("Opportunity",  "SELECT Id, Name, CreatedDate FROM Opportunity", :single_batch => true)
       # get a cofee
-      sleep(40)
+      sleep(60*2)
       # check the status
-      status = query.check_status
-      if status[:finished]
-        result = query.get_results
-        result[:filenames].should have_at_least(2).items
-        result[:results].should_not be_empty
-        result[:done_jobs].should_not be_empty
-      end
+      result = query.get_available_results
+      expect(result[:succeeded]).to eq true
+      result[:filenames].should have_at_least(1).items
+      result[:jobs_done].should_not be_empty
     end
   end

data/spec/spec_helper.rb CHANGED Viewed

@@ -1,6 +1,33 @@
 require 'salesforce_bulk_query'
+require 'restforce'
 RSpec.configure do |c|
   c.filter_run :focus => true
   c.run_all_when_everything_filtered = true
+  c.filter_run_excluding :skip => true
+end
+class SpecHelper
+  DEFAULT_API_VERSION = '30.0'
+  def self.create_default_restforce
+    Restforce.new(
+      :username => ENV['USERNAME'],
+      :password => ENV['PASSWORD'],
+      :security_token => ENV['TOKEN'],
+      :client_id => ENV['CLIENT_ID'],
+      :client_secret => ENV['CLIENT_SECRET'],
+      :api_version => ENV['API_VERSION'] || DEFAULT_API_VERSION
+    )
+  end
+  def self.create_default_api(restforce)
+    # switch off the normal logging
+    Restforce.log = false
+    SalesforceBulkQuery::Api.new(restforce,
+      :api_version => ENV['API_VERSION'] || DEFAULT_API_VERSION,
+      :logger => ENV['LOGGING'] ? Logger.new(STDOUT): nil
+    )
+  end
 end

metadata CHANGED Viewed

@@ -1,99 +1,147 @@
 --- !ruby/object:Gem::Specification
 name: salesforce_bulk_query
 version: !ruby/object:Gem::Version
-  version: 0.0.6
+  version: 0.1.0
 platform: ruby
 authors:
 - Petr Cvengros
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-01-09 00:00:00.000000000 Z
+date: 2015-01-23 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: json
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '1.8'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '1.8'
 - !ruby/object:Gem::Dependency
   name: xml-simple
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '1.1'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '1.1'
 - !ruby/object:Gem::Dependency
   name: multi_json
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '1.9'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '1.9'
 - !ruby/object:Gem::Dependency
   name: restforce
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '1.4'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '1.4'
 - !ruby/object:Gem::Dependency
   name: rspec
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '2.14'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '2.14'
 - !ruby/object:Gem::Dependency
   name: pry
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '0.9'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '0.9'
+- !ruby/object:Gem::Dependency
+  name: pry-stack_explorer
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.4'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.4'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '10.3'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '10.3'
+- !ruby/object:Gem::Dependency
+  name: coveralls
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.7'
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 0.7.0
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.7'
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 0.7.0
 description: A library for downloading data from Salesforce Bulk API. We only focus
   on querying, other operations of the API aren't supported. Designed to handle a
   lot of data.
@@ -103,17 +151,20 @@ executables: []
 extensions: []
 extra_rdoc_files: []
 files:
-- .gitignore
+- ".gitignore"
+- ".travis.yml"
 - Gemfile
 - LICENSE
 - README.md
-- example_test_salesforce_credentials.json
+- Rakefile
+- env_setup-example.sh
 - lib/salesforce_bulk_query.rb
 - lib/salesforce_bulk_query/batch.rb
 - lib/salesforce_bulk_query/connection.rb
 - lib/salesforce_bulk_query/job.rb
 - lib/salesforce_bulk_query/logger.rb
 - lib/salesforce_bulk_query/query.rb
+- lib/salesforce_bulk_query/utils.rb
 - lib/salesforce_bulk_query/version.rb
 - salesforce_bulk_query.gemspec
 - spec/salesforce_bulk_query_spec.rb
@@ -128,12 +179,12 @@ require_paths:
 - lib
 required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
-  - - '>='
+  - - ">="
     - !ruby/object:Gem::Version
       version: '1.9'
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
-  - - '>='
+  - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
@@ -143,3 +194,4 @@ signing_key:
 specification_version: 4
 summary: Downloading data from Salesforce Bulk API made easy and scalable.
 test_files: []
+has_rdoc:

data/example_test_salesforce_credentials.json DELETED Viewed

@@ -1,7 +0,0 @@
-{
-  "username": "me@mycompany.com",
-  "password": "mypassword",
-  "token": "token I got in my email",
-  "client_id": "id for my registered SFDC app",
-  "client_secret": "secret number for my SFDC app"
-}