salesforce_bulk_query 0.0.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: a199495e5a1f919e25c9c2f6f129a9d2010b240a
4
+ data.tar.gz: ffca70b80c504a2c43224e828bc0d059cc23477e
5
+ SHA512:
6
+ metadata.gz: 7149c8f12a7fcacc181f1ec39d9d987b68db9dbbf1beada5ab3a0774343bfd9e7aced0d828a7ac78978bd2d2baca5b1cf8309a27098d9b7a452bb277cfd2f5e7
7
+ data.tar.gz: 669dab096fd60a81c1bdb0a0a1b0811ebdc57519f23a0f44b8000ef94e75911a3ff51660abe5cc7a0bc0aeb9c9a5a5b37f9d800149428f3afa7c1d5c4aaaf737
data/.gitignore ADDED
@@ -0,0 +1,2 @@
1
+ test_salesforce_credentials.json
2
+ Gemfile.lock
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source "http://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in salesforce_bulk_api.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,22 @@
1
+ (BSD License)
2
+
3
+ Copyright (c) 2014 Yatish Mehta & GoodData Corporation. All rights reserved.
4
+
5
+ Redistribution and use in source and binary forms, with or without modification, are permitted provided
6
+ that the following conditions are met:
7
+
8
+ * Redistributions of source code must retain the above copyright notice, this list of conditions and
9
+ the following disclaimer.
10
+ * Redistributions in binary form must reproduce the above copyright notice, this list of conditions
11
+ and the following disclaimer in the documentation and/or other materials provided with the distribution.
12
+ * Neither the name of the GoodData Corporation nor the names of its contributors may be used to endorse
13
+ or promote products derived from this software without specific prior written permission.
14
+
15
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS
16
+ OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
17
+ AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
18
+ CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
19
+ DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
20
+ DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
21
+ WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
22
+ ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
data/README.md ADDED
@@ -0,0 +1,108 @@
1
+ Salesforce Bulk Query
2
+ =====================
3
+ A library for downloading data from Salesforce Bulk API. We only focus on querying, other operations of the API aren't supported. Designed to handle a lot of data.
4
+
5
+ Derived from [Salesforce Bulk API](https://github.com/yatish27/salesforce_bulk_api)
6
+
7
+ ## Basic Usage
8
+ To install, run:
9
+
10
+ gem install salesforce_bulk_query
11
+
12
+ or add
13
+
14
+ gem salesforce_bulk_query
15
+
16
+ to your Gemfile.
17
+
18
+ Before using the library, make sure you have the right account in your Salesforce organization that has access to API and that you won't run out of the [API limits](http://www.salesforce.com/us/developer/docs/api_asynchpre/Content/asynch_api_concepts_limits.htm#batch_proc_time_title)
19
+
20
+ For doing most of the API calls, the library uses [Restforce](https://github.com/ejholmes/restforce) Code example:
21
+
22
+ require 'restforce'
23
+ require 'salesforce_bulk_query'
24
+
25
+ # Create a restforce client instance
26
+ # with basic auth
27
+ restforce = Restforce.new(
28
+ :username => 'me',
29
+ :password => 'password',
30
+ :security_token => 'token',
31
+ :client_id => "my sfdc app client id",
32
+ :client_secret => "my sfdc app client secret"
33
+ )
34
+
35
+ # or OAuth
36
+ restforce = Restforce.new(
37
+ :refresh_token => "xyz",
38
+ :client_id => "my sfdc app client id",
39
+ :client_secret => "my sfdc app client secret"
40
+ )
41
+
42
+ bulk_api = SalesforceBulkQuery::Api.new(restforce)
43
+
44
+ # query the api
45
+ result = bulk_client.query("Task", "SELECT Id, Name FROM Task")
46
+
47
+ # the result is files
48
+ puts "All the downloaded stuff is in csvs: #{result[:filenames]}"
49
+
50
+ # query is a blocking call and can take several hours
51
+ # if you want to just start the query asynchronously, use
52
+ query = start_query("Task", "SELECT Id, Name FROM Task")
53
+
54
+ # get a cofee
55
+ sleep(1234)
56
+
57
+ # check the status
58
+ status = query.check_status
59
+ if status[:finished]
60
+ result = query.get_results
61
+ puts "All the downloaded stuff is in csvs: #{result[:filenames]}"
62
+ end
63
+
64
+ ## How it works
65
+
66
+ The library uses the [Salesforce Bulk API](https://www.salesforce.com/us/developer/docs/api_asynch/index_Left.htm#CSHID=asynch_api_bulk_query.htm|StartTopic=Content%2Fasynch_api_bulk_query.htm|SkinName=webhelp). The given query is divided into 15 subqueries, according to the [limits](http://www.salesforce.com/us/developer/docs/api_asynchpre/Content/asynch_api_concepts_limits.htm#batch_proc_time_title). Each subquery is an interval based on the CreatedDate Salesforce field. The limits are passed to the API in SOQL queries. Subqueries are sent to the API as batches and added to a job.
67
+
68
+ The first interval starts with the date the first Salesforce object was created, we query Salesforce REST API for that. If this query times out, we use a constant. The last interval ends a few minutes before now to avoid consistency issues. Custom start and end can be passed - see Options.
69
+
70
+ Job has a fixed time limit to process all the subqueries. Batches that finish in time are downloaded to CSVs, batches that don't are divided to 15 subqueries each and added to new jobs.
71
+
72
+ CSV results are downloaded by chunks, so that we don't run into memory related issues. All other requests are made through the Restforce client that is passed when instantiating the Api class. Restforce is not in the dependencies, so theoretically you can pass another object with the same set of methods as Restforce client.
73
+
74
+ ## Options
75
+ There are a few optional settings you can pass to the `Api` methods:
76
+ * `api_version`: which Salesforce api version should be used
77
+ * `logger`: where logs should go
78
+ * `filename_prefix`: prefix applied to csv files
79
+ * `directory_path`: custom direcotory path for CSVs, if omitted, a new temp directory is created
80
+ * `check_interval`: how often the results should be checked in secs.
81
+ * `time_limit`: maximum time the query can take. If this time limit is exceeded, available results are downloaded and the list of subqueries that didn't finished is returned. In seconds. The limti should be understood as limit for waiting. When the limit is reached the function downloads data that is ready which can take some additonal time.
82
+ * `created_from`, `created_to`: limits for the CreatedDate field. Note that queries can't contain any WHERE statements as we're doing some manipulations to create subqueries and we don't want things to get too difficult. So this is the way to limit the query yourself. The format is like `"1999-01-01T00:00:00.000Z"`
83
+ * `single_batch`: If true, the queries are not divided into subqueries as described above. Instead one batch job is created with the given query.
84
+
85
+ See specs for exact usage.
86
+
87
+ ## Logging
88
+ require 'logger'
89
+ require 'restforce'
90
+
91
+ # create the restforce client
92
+ restforce = Restforce.new(...)
93
+
94
+ # instantiate a logger and pass it to the Api constructor
95
+ logger = Logger.new(STDOUT)
96
+ bulk_api = SalesforceBulkQuery::Api.new(restforce, :logger => logger)
97
+
98
+ # switch off logging in Restforce so you don't get every message twice
99
+ Restforce.log = false
100
+
101
+ If you're using Restforce as a client (which you probably are) and you want to do logging, Salesforce Bulk Query will use a custom logging middleware for Restforce. This is because the original logging middleware puts all API responses to log, which is not something you would like to do for a few gigabytes CSVs. When you use the :logger parameter it's recommended you swith off the default logging in Restforce, otherwise you'll get all messages twice.
102
+
103
+ ## Copyright
104
+
105
+ Copyright (c) 2014 Yatish Mehta & GoodData Corporation. See [LICENSE](LICENSE) for details.
106
+
107
+
108
+
@@ -0,0 +1,7 @@
1
+ {
2
+ "username": "me@mycompany.com",
3
+ "password": "mypassword",
4
+ "token": "token I got in my email",
5
+ "client_id": "id for my registered SFDC app",
6
+ "client_secret": "secret number for my SFDC app"
7
+ }
@@ -0,0 +1,76 @@
1
+ require 'tmpdir'
2
+
3
+ module SalesforceBulkQuery
4
+ # Represents a Salesforce api batch. Batch contains a single subquery.
5
+ # Many batches are contained in a Job.
6
+ class Batch
7
+ def initialize(options)
8
+ @sobject = options[:sobject]
9
+ @soql = options[:soql]
10
+ @job_id = options[:job_id]
11
+ @connection = options[:connection]
12
+ @start = options[:start]
13
+ @stop = options[:stop]
14
+ @@directory_path ||= Dir.mktmpdir
15
+ end
16
+
17
+ attr_reader :soql, :start, :stop
18
+
19
+ # Do the api request
20
+ def create
21
+ path = "job/#{@job_id}/batch/"
22
+
23
+ response_parsed = @connection.post_xml(path, @soql, {:csv_content_type => true})
24
+
25
+ @batch_id = response_parsed['id'][0]
26
+ end
27
+
28
+ def check_status
29
+ # request to get the result id
30
+ path = "job/#{@job_id}/batch/#{@batch_id}/result"
31
+
32
+ response_parsed = @connection.get_xml(path)
33
+
34
+ @result_id = response_parsed["result"] ? response_parsed["result"][0] : nil
35
+ return {
36
+ :finished => ! @result_id.nil?,
37
+ :result_id => @result_id
38
+ }
39
+ end
40
+
41
+ def get_filename
42
+ return "#{@sobject}_#{@batch_id}_#{@start}-#{@stop}.csv"
43
+ end
44
+
45
+ def get_result(directory_path=nil)
46
+
47
+ # request to get the actual results
48
+ path = "job/#{@job_id}/batch/#{@batch_id}/result/#{@result_id}"
49
+
50
+ if !@result_id
51
+ raise "batch not finished yet, trying to get result: #{path}"
52
+ end
53
+
54
+ directory_path ||= @@directory_path
55
+
56
+ # write it to a file
57
+ filename = File.join(directory_path, get_filename)
58
+ @connection.get_to_file(path, filename)
59
+
60
+ return filename
61
+ end
62
+
63
+ def to_log
64
+ return {
65
+ :sobject => @sobject,
66
+ :soql => @soql,
67
+ :job_id => @job_id,
68
+ :connection => @connection.to_log,
69
+ :start => @start,
70
+ :stop => @stop,
71
+ :directory_path => @@directory_path
72
+ }
73
+ end
74
+
75
+ end
76
+ end
@@ -0,0 +1,124 @@
1
+ require 'xmlsimple'
2
+ require 'net/http'
3
+
4
+ module SalesforceBulkQuery
5
+
6
+ # Connection to the Salesforce API
7
+ # shared in all classes that do some requests
8
+ class Connection
9
+ def initialize(client, api_version, logger=nil, filename_prefix=nil)
10
+ @client=client
11
+ @logger = logger
12
+ @filename_prefix = filename_prefix
13
+
14
+ @@API_VERSION = api_version
15
+ @@PATH_PREFIX = "/services/async/#{@@API_VERSION}/"
16
+ end
17
+
18
+ attr_reader :client
19
+
20
+ XML_REQUEST_HEADER = {'Content-Type' => 'application/xml; charset=utf-8'}
21
+ CSV_REQUEST_HEADER = {'Content-Type' => 'text/csv; charset=UTF-8'}
22
+
23
+ def session_header
24
+ {'X-SFDC-Session' => @client.options[:oauth_token]}
25
+ end
26
+
27
+ def parse_xml(xml)
28
+ parsed = nil
29
+ begin
30
+ parsed = XmlSimple.xml_in(xml)
31
+ rescue => e
32
+ @logger.error "Error parsing xml: #{xml}\n#{e}\n#{e.backtrace}"
33
+ raise
34
+ end
35
+
36
+ return parsed
37
+ end
38
+
39
+ def post_xml(path, xml, options={})
40
+ path = "#{@@PATH_PREFIX}#{path}"
41
+ headers = options[:csv_content_type] ? CSV_REQUEST_HEADER : XML_REQUEST_HEADER
42
+
43
+ response = nil
44
+ # do the request
45
+ with_retries do
46
+ begin
47
+ response = @client.post(path, xml, headers.merge(session_header))
48
+ rescue JSON::ParserError => e
49
+ if e.message.index('ExceededQuota')
50
+ raise "You've run out of sfdc batch api quota. Original error: #{e}\n #{e.backtrace}"
51
+ end
52
+ raise e
53
+ end
54
+ end
55
+
56
+ return parse_xml(response.body)
57
+ end
58
+
59
+ def get_xml(path, options={})
60
+ path = "#{@@PATH_PREFIX}#{path}"
61
+ headers = XML_REQUEST_HEADER
62
+
63
+ response = nil
64
+ with_retries do
65
+ response = @client.get(path, {}, headers.merge(session_header))
66
+ end
67
+
68
+ return options[:skip_parsing] ? response.body : parse_xml(response.body)
69
+ end
70
+
71
+ def get_to_file(path, filename)
72
+ path = "#{@@PATH_PREFIX}#{path}"
73
+ uri = URI.parse( @client.options[:instance_url])
74
+ # open a file
75
+ http = Net::HTTP.new(uri.host, uri.port)
76
+ http.use_ssl = true
77
+ headers = XML_REQUEST_HEADER.merge(session_header)
78
+ @logger.info "Doing GET to #{path}, headers #{headers}" if @logger
79
+
80
+ if @filename_prefix
81
+ filename = "#{@filename_prefix}_#{filename}"
82
+ end
83
+
84
+ # do the request
85
+ http.request_get(path, headers) do |res|
86
+
87
+ @logger.info "Got response #{res.inspect}, reading response body by chunks and writing to #{filename}" if @logger
88
+
89
+ File.open(filename, 'w') do |file|
90
+ # write the body to the file by chunks
91
+ res.read_body do |segment|
92
+
93
+ file.write(segment)
94
+ end
95
+ end
96
+ end
97
+ end
98
+
99
+ def with_retries
100
+ i = 0
101
+ begin
102
+ yield
103
+ rescue => e
104
+ i += 1
105
+ if i < 3
106
+ @logger.warn "Retrying, got error: #{e}, #{e.backtrace}" if @logger
107
+ retry
108
+ else
109
+ @logger.error "Failed 3 times, last error: #{e}, #{e.backtrace}" if @logger
110
+ raise
111
+ end
112
+ end
113
+ end
114
+
115
+ def to_log
116
+ return {
117
+ :client => "Restforce asi",
118
+ :filename_prefix => @filename_prefix,
119
+ :api_version => @@API_VERSION,
120
+ :path_prefix => @@PATH_PREFIX
121
+ }
122
+ end
123
+ end
124
+ end
@@ -0,0 +1,161 @@
1
+ require "salesforce_bulk_query/batch"
2
+
3
+ module SalesforceBulkQuery
4
+
5
+ # Represents a Salesforce bulk api job, contains multiple batches.
6
+ # Many jobs contained in Query
7
+ class Job
8
+ @@operation = 'query'
9
+ @@xml_header = '<?xml version="1.0" encoding="utf-8" ?>'
10
+ JOB_TIME_LIMIT = 10 * 60
11
+ BATCH_COUNT = 15
12
+
13
+
14
+ def initialize(sobject, connection, logger=nil)
15
+ @sobject = sobject
16
+ @connection = connection
17
+ @logger = logger
18
+ @batches = []
19
+ @unfinished_batches = []
20
+ end
21
+
22
+ attr_reader :job_id
23
+
24
+ # Do the API request
25
+ def create_job(csv=true)
26
+ content_type = csv ? "CSV" : "XML"
27
+ xml = "#{@@xml_header}<jobInfo xmlns=\"http://www.force.com/2009/06/asyncapi/dataload\">"
28
+ xml += "<operation>#{@@operation}</operation>"
29
+ xml += "<object>#{@sobject}</object>"
30
+ xml += "<contentType>#{content_type}</contentType>"
31
+ xml += "</jobInfo>"
32
+
33
+ response_parsed = @connection.post_xml("job", xml)
34
+ @job_id = response_parsed['id'][0]
35
+ end
36
+
37
+ def get_extended_soql(soql, from, to)
38
+ return "#{soql} WHERE CreatedDate >= #{from} AND CreatedDate < #{to}"
39
+ end
40
+
41
+ def generate_batches(soql, start, stop, single_batch=false)
42
+ # if there's just one batch wanted, add it and we're done
43
+ if single_batch
44
+ soql_extended = get_extended_soql(soql, start, stop)
45
+ @logger.info "Adding soql #{soql_extended} as a batch to job" if @logger
46
+
47
+ add_query(soql_extended,
48
+ :start => start,
49
+ :stop => stop
50
+ )
51
+ return
52
+ end
53
+
54
+ # if there's more, generate the time intervals and generate the batches
55
+ step_size = (stop - start) / BATCH_COUNT
56
+
57
+ interval_beginings = start.step(stop - step_size, step_size).map{|f|f}
58
+ interval_ends = interval_beginings.clone
59
+ interval_ends.shift
60
+ interval_ends.push(stop)
61
+
62
+ interval_beginings.zip(interval_ends).each do |from, to|
63
+
64
+ soql_extended = get_extended_soql(soql, from, to)
65
+ @logger.info "Adding soql #{soql_extended} as a batch to job" if @logger
66
+
67
+ add_query(soql_extended,
68
+ :start => from,
69
+ :stop => to
70
+ )
71
+ end
72
+ end
73
+
74
+ def add_query(query, options={})
75
+ # create and create a batch
76
+ batch = SalesforceBulkQuery::Batch.new(
77
+ :sobject => @sobject,
78
+ :soql => query,
79
+ :job_id => @job_id,
80
+ :connection => @connection,
81
+ :start => options[:start],
82
+ :stop => options[:stop]
83
+ )
84
+ batch.create
85
+
86
+ # add the batch to the list
87
+ @batches.push(batch)
88
+ end
89
+
90
+ def close_job
91
+ xml = "#{@@xml_header}<jobInfo xmlns=\"http://www.force.com/2009/06/asyncapi/dataload\">"
92
+ xml += "<state>Closed</state>"
93
+ xml += "</jobInfo>"
94
+
95
+ path = "job/#{@job_id}"
96
+
97
+ response_parsed = @connection.post_xml(path, xml)
98
+ @job_closed = Time.now
99
+ end
100
+
101
+ def check_status
102
+ path = "job/#{@job_id}"
103
+ response_parsed = @connection.get_xml(path)
104
+ @completed = Integer(response_parsed["numberBatchesCompleted"][0])
105
+ @finished = @completed == Integer(response_parsed["numberBatchesTotal"][0])
106
+ return {
107
+ :finished => @finished,
108
+ :some_failed => Integer(response_parsed["numberRecordsFailed"][0]) > 0,
109
+ :response => response_parsed
110
+ }
111
+ end
112
+
113
+ # downloads whatever is available, returns as unfinished whatever is not
114
+ def get_results(options={})
115
+ filenames = []
116
+ unfinished_batches = []
117
+
118
+ # get result for each batch in the job
119
+ @batches.each do |batch|
120
+ batch_status = batch.check_status
121
+
122
+ # if the result is ready
123
+ if batch_status[:finished]
124
+
125
+ # download the result
126
+ filename = batch.get_result(options[:directory_path])
127
+ filenames.push(filename)
128
+ else
129
+ # otherwise put it to unfinished
130
+ unfinished_batches.push(batch)
131
+ end
132
+ end
133
+ @unfinished_batches = unfinished_batches
134
+
135
+ return {
136
+ :filenames => filenames,
137
+ :unfinished_batches => unfinished_batches
138
+ }
139
+ end
140
+
141
+ def get_available_results(options={})
142
+ # if we didn't reach limit yet, do nothing
143
+ # if all done, do nothing
144
+ # if none of the batches finished, same thing
145
+ if (Time.now - @job_closed < JOB_TIME_LIMIT) || @finished || @completed == 0
146
+ return nil
147
+ end
148
+
149
+ return get_results(options)
150
+ end
151
+
152
+ def to_log
153
+ return {
154
+ :sobject => @sobject,
155
+ :connection => @connection.to_log,
156
+ :batches => @batches.map {|b| b.to_log},
157
+ :unfinished_batches => @unfinished_batches.map {|b| b.to_log}
158
+ }
159
+ end
160
+ end
161
+ end
@@ -0,0 +1,44 @@
1
+ require 'forwardable'
2
+ require 'faraday'
3
+
4
+ module SalesforceBulkQuery
5
+ # Custom logger for Restforce that doesn't log tons of data.
6
+ class Logger < Faraday::Response::Middleware
7
+ extend Forwardable
8
+
9
+ MAX_LOG_LENGTH = 2000
10
+
11
+ def initialize(app, logger, options)
12
+ super(app)
13
+ @options = options
14
+ @logger = logger || begin
15
+ require 'logger'
16
+ ::Logger.new(STDOUT)
17
+ end
18
+ end
19
+
20
+ def_delegators :@logger, :debug, :info, :warn, :error, :fatal
21
+
22
+ def call(env)
23
+ debug('request') do
24
+ dump :url => env[:url].to_s,
25
+ :method => env[:method],
26
+ :headers => env[:request_headers],
27
+ :body => env[:body][0..MAX_LOG_LENGTH]
28
+ end
29
+ super
30
+ end
31
+
32
+ def on_complete(env)
33
+ debug('response') do
34
+ dump :status => env[:status].to_s,
35
+ :headers => env[:response_headers],
36
+ :body => env[:body][0..MAX_LOG_LENGTH]
37
+ end
38
+ end
39
+
40
+ def dump(hash)
41
+ "\n" + hash.map { |k, v| " #{k}: #{v.inspect}" }.join("\n")
42
+ end
43
+ end
44
+ end
@@ -0,0 +1,149 @@
1
+ require 'salesforce_bulk_query/job'
2
+
3
+ module SalesforceBulkQuery
4
+
5
+ # Abstraction of a single user-given query. It contains multiple jobs, is tied to a specific connection
6
+ class Query
7
+
8
+ # if no created_to is given we use the current time with this offset
9
+ # subtracted (to make sure the freshest changes that can be inconsistent
10
+ # aren't there) It's in minutes
11
+ OFFSET_FROM_NOW = 10
12
+
13
+ def initialize(sobject, soql, connection, options={})
14
+ @sobject = sobject
15
+ @soql = soql
16
+ @connection = connection
17
+ @logger = options[:logger]
18
+ @created_from = options[:created_from]
19
+ @created_to = options[:created_to]
20
+ @single_batch = options[:single_batch]
21
+ @jobs_in_progress = []
22
+ @jobs_done = []
23
+ @finished_batch_filenames = []
24
+ @restarted_subqueries = []
25
+ end
26
+
27
+ DEFAULT_MIN_CREATED = "1999-01-01T00:00:00.000Z"
28
+
29
+ # Creates the first job, divides the query to subqueries, puts all the subqueries as batches to the job
30
+ def start
31
+ # order by and where not allowed
32
+ if (!@single_batch) && (@soql =~ /WHERE/i || @soql =~ /ORDER BY/i)
33
+ raise "You can't have WHERE or ORDER BY in your soql. If you want to download just specific date range use created_from / created_to"
34
+ end
35
+
36
+ # create the first job
37
+ job = SalesforceBulkQuery::Job.new(@sobject, @connection, @logger)
38
+ job.create_job
39
+
40
+ # get the date when it should start
41
+ if @created_from
42
+ min_created = @created_from
43
+ else
44
+ # get the date when the first was created
45
+ min_created = nil
46
+ begin
47
+ min_created_resp = @connection.client.query("SELECT CreatedDate FROM #{@sobject} ORDER BY CreatedDate LIMIT 1")
48
+ min_created_resp.each {|s| min_created = s[:CreatedDate]}
49
+ rescue Faraday::Error::TimeoutError => e
50
+ @logger.warn "Timeout getting the oldest object for #{@sobject}. Error: #{e}. Using the default value" if @logger
51
+ min_created = DEFAULT_MIN_CREATED
52
+ end
53
+ end
54
+
55
+ # generate intervals
56
+ start = DateTime.parse(min_created)
57
+ stop = @created_to ? DateTime.parse(@created_to) : DateTime.now - Rational(OFFSET_FROM_NOW, 1440)
58
+ job.generate_batches(@soql, start, stop, @single_batch)
59
+
60
+ job.close_job
61
+
62
+ @jobs_in_progress.push(job)
63
+ end
64
+
65
+
66
+ # Check statuses of all jobs
67
+ def check_status
68
+ all_done = true
69
+ job_statuses = []
70
+ # check all jobs statuses and put them in an array
71
+ @jobs_in_progress.each do |job|
72
+ job_status = job.check_status
73
+ all_done &&= job_status[:finished]
74
+ job_statuses.push(job_status)
75
+ end
76
+
77
+ return {
78
+ :finished => all_done,
79
+ :job_statuses => job_statuses
80
+ }
81
+ end
82
+
83
+ # Get results for all jobs
84
+ # @param options[:directory_path]
85
+ def get_results(options={})
86
+ all_job_results = []
87
+ job_result_filenames = []
88
+ unfinished_subqueries = []
89
+ # check each job and put it there
90
+ @jobs_in_progress.each do |job|
91
+ job_results = job.get_results(options)
92
+ all_job_results.push(job_results)
93
+ job_result_filenames += job_results[:filenames]
94
+ unfinished_subqueries.push(job_results[:unfinished_batches].map {|b| b.soql})
95
+ # if it's done add it to done
96
+ if job_results[:unfinished_batches].empty?
97
+ @jobs_done.push(job)
98
+ end
99
+ end
100
+ return {
101
+ :filenames => job_result_filenames + @finished_batch_filenames,
102
+ :unfinished_subqueries => unfinished_subqueries,
103
+ :restarted_subqueries => @restarted_subqueries,
104
+ :results => all_job_results,
105
+ :done_jobs => @jobs_done
106
+ }
107
+ end
108
+
109
+ # Restart unfinished batches in all jobs in progress, creating new jobs
110
+ # downloads results for finished batches
111
+ def get_result_or_restart(options={})
112
+ new_jobs = []
113
+ job_ids_to_remove = []
114
+ jobs_done = []
115
+
116
+ @jobs_in_progress.each do |job|
117
+ # get available stuff, if not the right time yet, go on
118
+ available_results = job.get_available_results(options)
119
+ if available_results.nil?
120
+ next
121
+ end
122
+
123
+ unfinished_batches = available_results[:unfinished_batches]
124
+
125
+ # store the filenames and resturted stuff
126
+ @finished_batch_filenames += available_results[:filenames]
127
+ @restarted_subqueries += unfinished_batches.map {|b| b.soql}
128
+
129
+ unfinished_batches.each do |batch|
130
+ # for each unfinished batch create a new job and add it to new jobs
131
+ @logger.info "The following subquery didn't end in time: #{batch.soql}. Dividing into multiple and running again" if @logger
132
+ new_job = SalesforceBulkQuery::Job.new(@sobject, @connection)
133
+ new_job.create_job
134
+ new_job.generate_batches(@soql, batch.start, batch.stop)
135
+ new_job.close_job
136
+ new_jobs.push(new_job)
137
+ end
138
+ # the current job to be removed from jobs in progress
139
+ job_ids_to_remove.push(job.job_id)
140
+ jobs_done.push(job)
141
+ end
142
+ # remove the finished jobs from progress and add there the new ones
143
+ @jobs_in_progress.select! {|j| ! job_ids_to_remove.include?(j.job_id)}
144
+ @jobs_done += jobs_done
145
+
146
+ @jobs_in_progress += new_jobs
147
+ end
148
+ end
149
+ end
@@ -0,0 +1,3 @@
1
+ module SalesforceBulkQuery
2
+ VERSION = '0.0.2'
3
+ end
@@ -0,0 +1,115 @@
1
+ require 'salesforce_bulk_query/connection'
2
+ require 'salesforce_bulk_query/query'
3
+ require 'salesforce_bulk_query/logger'
4
+
5
+ # Module where all the stuff is happening
6
+ module SalesforceBulkQuery
7
+
8
+ # Abstracts the whole library, class the user interacts with
9
+ class Api
10
+ @@DEFAULT_API_VERSION = '29.0'
11
+
12
+ # Constructor
13
+ # @param client [Restforce] An instance of the Restforce client, that is used internally to access Salesforce api
14
+ # @param options
15
+ def initialize(client, options={})
16
+ @logger = options[:logger]
17
+
18
+ api_version = options[:api_version] || @@DEFAULT_API_VERSION
19
+
20
+ # use our own logging middleware if logger passed
21
+ if @logger && client.respond_to?(:middleware)
22
+ client.middleware.use(SalesforceBulkQuery::Logger, @logger, options)
23
+ end
24
+
25
+ # initialize connection
26
+ @connection = SalesforceBulkQuery::Connection.new(client, api_version, @logger, options[:filename_prefix])
27
+ end
28
+
29
+ # Get the Salesforce instance URL
30
+ def instance_url
31
+ # make sure it ends with /
32
+ url = @connection.client.instance_url
33
+ url += '/' if url[-1] != '/'
34
+ return url
35
+ end
36
+
37
+ CHECK_INTERVAL = 10
38
+ QUERY_TIME_LIMIT = 60 * 60 * 2 # two hours
39
+
40
+ # Query the Salesforce API. It's a blocking method - waits until the query is resolved
41
+ # can take quite some time
42
+ # @param sobject Salesforce object, e.g. "Opportunity"
43
+ # @param soql SOQL query, e.g. "SELECT Name FROM Opportunity"
44
+ # @return hash with :filenames and other useful stuff
45
+ def query(sobject, soql, options={})
46
+ check_interval = options[:check_interval] || CHECK_INTERVAL
47
+ time_limit = options[:time_limit] || QUERY_TIME_LIMIT
48
+
49
+ start_time = Time.now
50
+
51
+ # start the machinery
52
+ query = start_query(sobject, soql, options)
53
+ results = nil
54
+
55
+ loop do
56
+ # check the status
57
+ status = query.check_status
58
+
59
+ # if finished get the result and we're done
60
+ if status[:finished]
61
+
62
+ # get the results and we're done
63
+ results = query.get_results(:directory_path => options[:directory_path])
64
+ @logger.info "Query finished. Results: #{results_to_string(results)}" if @logger
65
+ break
66
+ end
67
+
68
+ # if we've run out of time limit, go away
69
+ if Time.now - start_time > time_limit
70
+ @logger.warn "Ran out of time limit, downloading what's available and terminating" if @logger
71
+
72
+ # download what's available
73
+ results = query.get_results(
74
+ :directory_path => options[:directory_path],
75
+ )
76
+
77
+ @logger.info "Downloaded the following files: #{results[:filenames]} The following didn't finish in time: #{results[:unfinished_subqueries]}. Results: #{results_to_string(results)}" if @logger
78
+ break
79
+ end
80
+
81
+ # restart whatever needs to be restarted and sleep
82
+ query.get_result_or_restart(:directory_path => options[:directory_path])
83
+ @logger.info "Sleeping #{check_interval}" if @logger
84
+ sleep(check_interval)
85
+ end
86
+
87
+ return results
88
+ end
89
+
90
+ # Start the query (synchronous method)
91
+ # @params see #query
92
+ # @return Query instance with the running query
93
+ def start_query(sobject, soql, options={})
94
+ # create the query, start it and return it
95
+ query = SalesforceBulkQuery::Query.new(sobject, soql, @connection, {:logger => @logger}.merge(options))
96
+ query.start
97
+ return query
98
+ end
99
+
100
+ private
101
+ # create a hash with just the fields we want to show in logs
102
+ def results_to_string(results)
103
+ return results.merge({
104
+ :results => results[:results].map do |r|
105
+ r.merge({
106
+ :unfinished_batches => r[:unfinished_batches].map do |b|
107
+ b.to_log
108
+ end
109
+ })
110
+ end,
111
+ :done_jobs => results[:done_jobs].map {|j| j.to_log}
112
+ })
113
+ end
114
+ end
115
+ end
@@ -0,0 +1,31 @@
1
+ # -*- encoding: utf-8 -*-
2
+ $:.push File.expand_path("../lib/", __FILE__)
3
+ require "salesforce_bulk_query/version"
4
+
5
+ Gem::Specification.new do |s|
6
+ s.name = 'salesforce_bulk_query'
7
+ s.version = SalesforceBulkQuery::VERSION
8
+ s.authors = ['Petr Cvengros']
9
+ s.email = ['petr.cvengros@gooddata.com']
10
+
11
+ s.required_ruby_version = '>= 1.9'
12
+
13
+ s.homepage = 'https://github.com/cvengros/salesforce_bulk_query'
14
+ s.summary = %q{Downloading data from Salesforce Bulk API made easy and scalable.}
15
+ s.description = %q{A library for downloading data from Salesforce Bulk API. We only focus on querying, other operations of the API aren't supported. Designed to handle a lot of data.}
16
+ s.license = 'BSD'
17
+
18
+ s.add_dependency 'json', '~> 1.8'
19
+ s.add_dependency 'xml-simple', '~> 1.1'
20
+
21
+ s.add_development_dependency 'multi_json', '~> 1.9'
22
+ s.add_development_dependency 'restforce', '~>1.4'
23
+ s.add_development_dependency 'rspec', '~>2.14'
24
+ s.add_development_dependency 'pry', '~>0.9'
25
+
26
+
27
+ s.files = `git ls-files`.split($/)
28
+ s.require_paths = ['lib']
29
+
30
+ s.rubygems_version = "1.3.7"
31
+ end
@@ -0,0 +1,130 @@
1
+ require 'spec_helper'
2
+ require 'multi_json'
3
+ require 'restforce'
4
+ require 'csv'
5
+ require 'tmpdir'
6
+ require 'logger'
7
+
8
+ LOGGING = false
9
+
10
+ describe SalesforceBulkQuery do
11
+
12
+ before :all do
13
+ auth = MultiJson.load(File.read('test_salesforce_credentials.json'), :symbolize_keys => true)
14
+
15
+ @client = Restforce.new(
16
+ :username => auth[:username],
17
+ :password => auth[:password],
18
+ :security_token => auth[:token],
19
+ :client_id => auth[:client_id],
20
+ :client_secret => auth[:client_secret],
21
+ :api_version => '30.0'
22
+ )
23
+ @api = SalesforceBulkQuery::Api.new(@client,
24
+ :api_version => '30.0',
25
+ :logger => LOGGING ? Logger.new(STDOUT): nil
26
+ )
27
+
28
+ # switch off the normal logging
29
+ Restforce.log = false
30
+ end
31
+
32
+ describe "instance_url" do
33
+ it "gives you some reasonable url" do
34
+ url = @api.instance_url
35
+ url.should_not be_empty
36
+ url.should match(/salesforce\.com\//)
37
+ end
38
+ end
39
+
40
+ describe "query" do
41
+ context "when you give it no options" do
42
+ it "downloads the data to a few file", :constraint => 'slow' do
43
+ result = @api.query("Opportunity", "SELECT Id, Name FROM Opportunity")
44
+ result[:filenames].should have_at_least(2).items
45
+ result[:results].should_not be_empty
46
+ result[:done_jobs].should_not be_empty
47
+
48
+ result[:filenames].each do |filename|
49
+ File.size?(filename).should be_true
50
+
51
+ lines = CSV.read(filename)
52
+
53
+ if lines.length > 1
54
+ # first line should be the header
55
+ lines[0].should eql(["Id", "Name"])
56
+
57
+ # first id shouldn't be emtpy
58
+ lines[1][0].should_not be_empty
59
+ end
60
+ end
61
+ end
62
+ end
63
+ context "when you give it all the options" do
64
+ it "downloads a single file" do
65
+ tmp = Dir.mktmpdir
66
+ frm = "2000-01-01"
67
+ from = "#{frm}T00:00:00.000Z"
68
+ t = "2020-01-01"
69
+ to = "#{t}T00:00:00.000Z"
70
+ result = @api.query(
71
+ "Account",
72
+ "SELECT Id, Name, Industry, Type FROM Account",
73
+ :check_interval => 30,
74
+ :directory_path => tmp,
75
+ :created_from => from,
76
+ :created_to => to,
77
+ :single_batch => true
78
+ )
79
+
80
+ result[:filenames].should have(1).items
81
+ result[:results].should_not be_empty
82
+ result[:done_jobs].should_not be_empty
83
+
84
+ filename = result[:filenames][0]
85
+
86
+ File.size?(filename).should be_true
87
+ lines = CSV.read(filename)
88
+
89
+ # first line should be the header
90
+ lines[0].should eql(["Id", "Name", "Industry", "Type"])
91
+
92
+ # first id shouldn't be emtpy
93
+ lines[1][0].should_not be_empty
94
+
95
+ filename.should match(tmp)
96
+ filename.should match(frm)
97
+ filename.should match(t)
98
+ end
99
+ end
100
+ context "when you give it a short time limit" do
101
+ it "downloads just a few files" do
102
+ result = @api.query(
103
+ "Task",
104
+ "SELECT Id, Name, CreatedDate FROM Task",
105
+ :time_limit => 30
106
+ )
107
+ result[:results].should_not be_empty
108
+ end
109
+ end
110
+ end
111
+
112
+ describe "start_query" do
113
+ it "starts a query that finishes some time later" do
114
+ query = @api.start_query("Opportunity", "SELECT Id, Name, CreatedDate FROM Opportunity")
115
+
116
+ # get a cofee
117
+ sleep(40)
118
+
119
+ # check the status
120
+ status = query.check_status
121
+ if status[:finished]
122
+ result = query.get_results
123
+ result[:filenames].should have_at_least(2).items
124
+ result[:results].should_not be_empty
125
+ result[:done_jobs].should_not be_empty
126
+ end
127
+ end
128
+
129
+ end
130
+ end
@@ -0,0 +1,6 @@
1
+ require 'salesforce_bulk_query'
2
+
3
+ RSpec.configure do |c|
4
+ c.filter_run :focus => true
5
+ c.run_all_when_everything_filtered = true
6
+ end
metadata ADDED
@@ -0,0 +1,145 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: salesforce_bulk_query
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.2
5
+ platform: ruby
6
+ authors:
7
+ - Petr Cvengros
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2014-06-02 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: json
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '1.8'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.8'
27
+ - !ruby/object:Gem::Dependency
28
+ name: xml-simple
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '1.1'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '1.1'
41
+ - !ruby/object:Gem::Dependency
42
+ name: multi_json
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: '1.9'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '1.9'
55
+ - !ruby/object:Gem::Dependency
56
+ name: restforce
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: '1.4'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: '1.4'
69
+ - !ruby/object:Gem::Dependency
70
+ name: rspec
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: '2.14'
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: '2.14'
83
+ - !ruby/object:Gem::Dependency
84
+ name: pry
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - "~>"
88
+ - !ruby/object:Gem::Version
89
+ version: '0.9'
90
+ type: :development
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - "~>"
95
+ - !ruby/object:Gem::Version
96
+ version: '0.9'
97
+ description: A library for downloading data from Salesforce Bulk API. We only focus
98
+ on querying, other operations of the API aren't supported. Designed to handle a
99
+ lot of data.
100
+ email:
101
+ - petr.cvengros@gooddata.com
102
+ executables: []
103
+ extensions: []
104
+ extra_rdoc_files: []
105
+ files:
106
+ - ".gitignore"
107
+ - Gemfile
108
+ - LICENSE
109
+ - README.md
110
+ - example_test_salesforce_credentials.json
111
+ - lib/salesforce_bulk_query.rb
112
+ - lib/salesforce_bulk_query/batch.rb
113
+ - lib/salesforce_bulk_query/connection.rb
114
+ - lib/salesforce_bulk_query/job.rb
115
+ - lib/salesforce_bulk_query/logger.rb
116
+ - lib/salesforce_bulk_query/query.rb
117
+ - lib/salesforce_bulk_query/version.rb
118
+ - salesforce_bulk_query.gemspec
119
+ - spec/salesforce_bulk_query_spec.rb
120
+ - spec/spec_helper.rb
121
+ homepage: https://github.com/cvengros/salesforce_bulk_query
122
+ licenses:
123
+ - BSD
124
+ metadata: {}
125
+ post_install_message:
126
+ rdoc_options: []
127
+ require_paths:
128
+ - lib
129
+ required_ruby_version: !ruby/object:Gem::Requirement
130
+ requirements:
131
+ - - ">="
132
+ - !ruby/object:Gem::Version
133
+ version: '1.9'
134
+ required_rubygems_version: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - ">="
137
+ - !ruby/object:Gem::Version
138
+ version: '0'
139
+ requirements: []
140
+ rubyforge_project:
141
+ rubygems_version: 2.2.2
142
+ signing_key:
143
+ specification_version: 4
144
+ summary: Downloading data from Salesforce Bulk API made easy and scalable.
145
+ test_files: []