sidekiq-iteration 0.1.0 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 82316cffa840b2c9619792b6f0c5bb7ec696f964ad81edf6b0ed5861339ca064
4
- data.tar.gz: 8337c0e87e6be8858d5b9d868c8a04b05f91de315be2e9aca9c40e8447c78644
3
+ metadata.gz: 40efca13e06cd7fdcfc1ff59ad08fea8fc731ee1b5560ae5b75b2591379bcb63
4
+ data.tar.gz: eec2991b40bb67ffc1dcea55f1c0e8acc98a18cd59ad2fd117cd80c1a94c3e79
5
5
  SHA512:
6
- metadata.gz: 63712780bca873613cbe3ef89ff0037c3eaa5633a28382eb02427311360714a89f092e5efef29873efa75067d22f32745eea3f04d7606442e29da33e2e2e6a08
7
- data.tar.gz: 4ded7fc6ab772c019154e6559027c87d389a356a16915d2c869e318fb26f5339dd98355a0f1eb422358c1cd9f7fe31bc2a70d5627a577f8f45394db15d0593b7
6
+ metadata.gz: 1162ffafc4d157e7a8f9d2b8f69163e90e83431daac129707e16605a9b0df250c1cf2dd9063f651782b01ab0dcd9a0f3848d381981f94ef5a2daf36f43d591be
7
+ data.tar.gz: 45a1efa4e1e65ae322b7c923cb5795ceda901863afc4d135cb25e1b342c9604e7f90719259b7fd93b8c38b15a5bcb4cab984617ea915fb58055f21936470269c
data/CHANGELOG.md CHANGED
@@ -1,5 +1,28 @@
1
1
  ## master (unreleased)
2
2
 
3
+ ## 0.3.0 (2023-05-20)
4
+
5
+ - Allow a default retry backoff to be configured
6
+
7
+ ```ruby
8
+ SidekiqIteration.default_retry_backoff = 10.seconds
9
+ ```
10
+
11
+ - Add ability to iterate Active Record enumerators in reverse order
12
+
13
+ ```ruby
14
+ active_record_records_enumerator(User.all, order: :desc)
15
+ ```
16
+
17
+ ## 0.2.0 (2022-11-11)
18
+
19
+ - Fix storing run metadata when the job fails for sidekiq < 6.5.2
20
+
21
+ - Make enumerators resume from the last cursor position
22
+
23
+ This fixes `NestedEnumerator` to work correctly. Previously, each intermediate enumerator
24
+ was resumed from the next cursor position, possibly skipping remaining inner items.
25
+
3
26
  ## 0.1.0 (2022-11-02)
4
27
 
5
28
  - First release
data/README.md CHANGED
@@ -2,7 +2,7 @@
2
2
 
3
3
  [![Build Status](https://github.com/fatkodima/sidekiq-iteration/actions/workflows/ci.yml/badge.svg?branch=master)](https://github.com/fatkodima/sidekiq-iteration/actions/workflows/ci.yml)
4
4
 
5
- Meet Iteration, an extension for [Sidekiq](https://github.com/mperham/sidekiq) that makes your jobs interruptible and resumable, saving all progress that the job has made (aka checkpoint for jobs).
5
+ Meet Iteration, an extension for [Sidekiq](https://github.com/mperham/sidekiq) that makes your long-running jobs interruptible and resumable, saving all progress that the job has made (aka checkpoint for jobs).
6
6
 
7
7
  ## Background
8
8
 
@@ -33,7 +33,7 @@ Software that is designed for high availability [must be resilient](https://12fa
33
33
  - Ruby 2.7+ (if you need support for older ruby, [open an issue](https://github.com/fatkodima/sidekiq-iteration/issues/new))
34
34
  - Sidekiq 6+
35
35
 
36
- ## Getting started
36
+ ## Installation
37
37
 
38
38
  Add this line to your application's Gemfile:
39
39
 
@@ -45,6 +45,8 @@ And then execute:
45
45
 
46
46
  $ bundle
47
47
 
48
+ ## Getting started
49
+
48
50
  In the job, include `SidekiqIteration::Iteration` module and start describing the job with two methods (`build_enumerator` and `each_iteration`) instead of `perform`:
49
51
 
50
52
  ```ruby
@@ -136,10 +138,10 @@ class BatchesJob
136
138
  end
137
139
  ```
138
140
 
139
- ### Iterating over batches of Active Record Relations
141
+ ### Iterating over Active Record Relations
140
142
 
141
143
  ```ruby
142
- class BatchesAsRelationJob
144
+ class RelationsJob
143
145
  include Sidekiq::Job
144
146
  include SidekiqIteration::Iteration
145
147
 
@@ -151,14 +153,14 @@ class BatchesAsRelationJob
151
153
  )
152
154
  end
153
155
 
154
- def each_iteration(batch_of_comments, product_id)
155
- # batch_of_comments will be a Comment::ActiveRecord_Relation
156
- batch_of_comments.update_all(deleted: true)
156
+ def each_iteration(comments_relation, product_id)
157
+ # comments_relation will be a Comment::ActiveRecord_Relation
158
+ comments_relation.update_all(deleted: true)
157
159
  end
158
160
  end
159
161
  ```
160
162
 
161
- ### Iterating over arrays
163
+ ### Iterating over arbitrary arrays
162
164
 
163
165
  ```ruby
164
166
  class ArrayJob
@@ -184,10 +186,10 @@ class CsvJob
184
186
 
185
187
  def build_enumerator(import_id, cursor:)
186
188
  import = Import.find(import_id)
187
- csv_enumereator(import.csv, cursor: cursor)
189
+ csv_enumerator(import.csv, cursor: cursor)
188
190
  end
189
191
 
190
- def each_iteration(csv_row)
192
+ def each_iteration(csv_row, import_id)
191
193
  # insert csv_row to database
192
194
  end
193
195
  end
@@ -220,6 +222,7 @@ end
220
222
  ## Guides
221
223
 
222
224
  * [Iteration: how it works](guides/iteration-how-it-works.md)
225
+ * [Job argument semantics](guides/argument-semantics.md)
223
226
  * [Best practices](guides/best-practices.md)
224
227
  * [Writing custom enumerator](guides/custom-enumerator.md)
225
228
  * [Throttling](guides/throttling.md)
@@ -228,10 +231,15 @@ For more detailed documentation, see [rubydoc](https://rubydoc.info/gems/sidekiq
228
231
 
229
232
  ## API
230
233
 
231
- Iteration job must respond to `build_enumerator` and `each_iteration` methods. `build_enumerator` must return [`Enumerator`](https://ruby-doc.org/core-3.1.2/Enumerator.htmll) object that respects the `cursor` value.
234
+ Iteration job must respond to `build_enumerator` and `each_iteration` methods. `build_enumerator` must return [`Enumerator`](https://ruby-doc.org/core-3.1.2/Enumerator.html) object that respects the `cursor` value.
232
235
 
233
236
  ## FAQ
234
237
 
238
+ **Advantages of this pattern over splitting a large job into many small jobs?**
239
+ * Having one job is easier for redis in terms of memory, time and # of requests needed for enqueuing.
240
+ * It simplifies sidekiq monitoring, because you have a predictable number of jobs in the queues, instead of having thousands of them at one time and millions at another. Also easier to navigate its web UI.
241
+ * You can stop/pause/delete just one job, if something goes wrong. With many jobs it is harder and can take a long time, if it is critical to stop it right now.
242
+
235
243
  **Why can't I just iterate in `#perform` method and do whatever I want?** You can, but then your job has to comply with a long list of requirements, such as the ones above. This creates leaky abstractions more easily, when instead we can expose a more powerful abstraction for developers without exposing the underlying infrastructure.
236
244
 
237
245
  **What happens when my job is interrupted?** A checkpoint will be persisted to Redis after the current `each_iteration`, and the job will be re-enqueued. Once it's popped off the queue, the worker will work off from the next iteration.
@@ -0,0 +1,130 @@
1
+ # Argument Semantics
2
+
3
+ `sidekiq-iteration` defines the `perform` method, required by `sidekiq`, to allow for iteration.
4
+
5
+ The call sequence is usually 3 methods:
6
+
7
+ `perform -> build_enumerator -> each_iteration`
8
+
9
+ In that sense `sidekiq-iteration` works like a framework (it calls your code) rather than like a library (that you call). When using jobs with parameters, the following rules of thumb are good to keep in mind.
10
+
11
+ ## Jobs without arguments
12
+
13
+ Jobs without arguments do not pass anything into either `build_enumerator` or `each_iteration` except for the `cursor` which `sidekiq-iteration` persists by itself:
14
+
15
+ ```ruby
16
+ class ArglessJob
17
+ include Sidekiq::Job
18
+ include SidekiqIteration::Iteration
19
+
20
+ def build_enumerator(cursor:)
21
+ # ...
22
+ end
23
+
24
+ def each_iteration(single_object_yielded_from_enumerator)
25
+ # ...
26
+ end
27
+ end
28
+ ```
29
+
30
+ To enqueue the job:
31
+
32
+ ```ruby
33
+ ArglessJob.perform_async
34
+ ```
35
+
36
+ ## Jobs with positional arguments
37
+
38
+ Jobs with positional arguments will have those arguments available to both `build_enumerator` and `each_iteration`:
39
+
40
+ ```ruby
41
+ class ArgumentativeJob
42
+ include Sidekiq::Job
43
+ include SidekiqIteration::Iteration
44
+
45
+ def build_enumerator(arg1, arg2, arg3, cursor:)
46
+ # ...
47
+ end
48
+
49
+ def each_iteration(single_object_yielded_from_enumerator, arg1, arg2, arg3)
50
+ # ...
51
+ end
52
+ end
53
+ ```
54
+
55
+ To enqueue the job:
56
+
57
+ ```ruby
58
+ ArgumentativeJob.perform_async(_arg1 = "One", _arg2 = "Two", _arg3 = "Three")
59
+ ```
60
+
61
+ ## Jobs with keyword arguments
62
+
63
+ Jobs with keyword arguments will have the keyword arguments available to both `build_enumerator` and `each_iteration`, but these arguments come packaged into a Hash in both cases. You will need to `fetch` or `[]` your parameter from the `Hash` you get passed in:
64
+
65
+ ```ruby
66
+ class ParameterizedJob
67
+ include Sidekiq::Job
68
+ include SidekiqIteration::Iteration
69
+
70
+ def build_enumerator(kwargs, cursor:)
71
+ name = kwargs.fetch("name")
72
+ email = kwargs.fetch("email")
73
+ # ...
74
+ end
75
+
76
+ def each_iteration(object_yielded_from_enumerator, kwargs)
77
+ name = kwargs.fetch("name")
78
+ email = kwargs.fetch("email")
79
+ # ...
80
+ end
81
+ end
82
+ ```
83
+
84
+ To enqueue the job:
85
+
86
+ ```ruby
87
+ ParameterizedJob.perform_async("name" => "Jane", "email" => "jane@host.example")
88
+ ```
89
+
90
+ ## Jobs with both positional and keyword arguments
91
+
92
+ Jobs with keyword arguments will have the keyword arguments available to both `build_enumerator` and `each_iteration`, but these arguments come packaged into a Hash in both cases. You will need to `fetch` or `[]` your parameter from the `Hash` you get passed in. Positional arguments get passed first and "unsplatted" (not combined into an array), the `Hash` containing keyword arguments comes after:
93
+
94
+ ```ruby
95
+ class HighlyConfigurableGreetingJob
96
+ include Sidekiq::Job
97
+ include SidekiqIteration::Iteration
98
+
99
+ def build_enumerator(subject_line, kwargs, cursor:)
100
+ name = kwargs.fetch("sender_name")
101
+ email = kwargs.fetch("sender_email")
102
+ # ...
103
+ end
104
+
105
+ def each_iteration(object_yielded_from_enumerator, subject_line, kwargs)
106
+ name = kwargs.fetch("sender_name")
107
+ email = kwargs.fetch("sender_email")
108
+ # ...
109
+ end
110
+ end
111
+ ```
112
+
113
+ To enqueue the job:
114
+
115
+ ```ruby
116
+ HighlyConfigurableGreetingJob.perform_async(_subject_line = "Greetings everybody!", "sender_name" => "Jane", "sender_email" => "jane@host.example")
117
+ ```
118
+
119
+ ## Returning (yielding) from enumerators
120
+
121
+ When defining a custom enumerator (see the [custom enumerator guide](custom-enumerator.md)) you need to yield two positional arguments from it: the object that will be the value for the current iteration (like a single ActiveModel instance, a single number...) and the value you want to be persisted as the `cursor` value should `sidekiq-iteration` decide to interrupt you after this iteration. Calling the enumerator with that cursor should return the next object after the one returned in this iteration. That new `cursor` value does not get passed to `each_iteration`:
122
+
123
+ ```ruby
124
+ Enumerator.new do |yielder|
125
+ # In this case `cursor` is an Integer
126
+ cursor.upto(99999) do |offset|
127
+ yielder.yield(fetch_record_at(offset), offset)
128
+ end
129
+ end
130
+ ```
@@ -1,5 +1,10 @@
1
1
  # Best practices
2
2
 
3
+ ## Considerations when writing jobs
4
+
5
+ * Duration of `#each_iteration`: processing a single element from the enumerator builded in `#build_enumerator` should take less than 25 seconds, or the duration set as a timeout for Sidekiq. It allows the job to be safely interrupted and resumed.
6
+ * Idempotency of `#each_iteration`: it should be safe to run `#each_iteration` multiple times for the same element from the enumerator. Read more in [this Sidekiq best practice](https://github.com/mperham/sidekiq/wiki/Best-Practices#2-make-your-job-idempotent-and-transactional). It's important if the job errors and you run it again, because the same element that errored the job may be processed again. It especially matters in the situation described above, when the iteration duration exceeds the timeout: if the job is re-enqueued, multiple elements may be processed again.
7
+
3
8
  ## Batch iteration
4
9
 
5
10
  Regardless of the active record enumerator used in the task, `sidekiq-iteration` gem loads records in batches of 100 (by default).
@@ -2,6 +2,17 @@
2
2
 
3
3
  Iteration leverages the [`Enumerator`](https://ruby-doc.org/core-3.1.2/Enumerator.html) pattern from the Ruby standard library, which allows us to use almost any resource as a collection to iterate.
4
4
 
5
+ Before writing an enumerator, it is important to understand [how Iteration works](iteration-how-it-works.md) and how
6
+ your enumerator will be used by it. An enumerator must `yield` two things in the following order as positional
7
+ arguments:
8
+ - An object to be processed in a job `each_iteration` method
9
+ - A cursor position, which Iteration will persist if `each_iteration` returns succesfully and the job is forced to shut
10
+ down. It can be any data type your job backend can serialize and deserialize correctly.
11
+
12
+ A job that includes Iteration is first started with `nil` as the cursor. When resuming an interrupted job, Iteration
13
+ will deserialize the persisted cursor and pass it to the job's `build_enumerator` method, which your enumerator uses to
14
+ find objects that come _after_ the last successfully processed object.
15
+
5
16
  ## Cursorless Enumerator
6
17
 
7
18
  Consider a custom Enumerator that takes items from a Redis list. Because a Redis list is essentially a queue, we can ignore the cursor:
@@ -23,7 +34,7 @@ class ListJob
23
34
  end
24
35
  end
25
36
 
26
- def each_iteration(item)
37
+ def each_iteration(item_from_redis)
27
38
  # ...
28
39
  end
29
40
  end
@@ -31,14 +42,15 @@ end
31
42
 
32
43
  ## Enumerator with cursor
33
44
 
34
- But what about iterating based on a cursor? Consider this Enumerator that wraps third party API (Stripe) for paginated iteration:
45
+ For a more complex example, consider this Enumerator that wraps a third party API (Stripe) for paginated iteration and
46
+ stores a string as the cursor position:
35
47
 
36
48
  ```ruby
37
49
  class StripeListEnumerator
38
50
  # @param resource [Stripe::APIResource] The type of Stripe object to request
39
51
  # @param params [Hash] Query parameters for the request
40
52
  # @param options [Hash] Request options, such as API key or version
41
- # @param cursor [String]
53
+ # @param cursor [nil, String] The Stripe ID of the last item iterated over
42
54
  def initialize(resource, params: {}, options: {}, cursor:)
43
55
  pagination_params = {}
44
56
  pagination_params[:starting_after] = cursor unless cursor.nil?
@@ -59,6 +71,9 @@ class StripeListEnumerator
59
71
  def each
60
72
  loop do
61
73
  @list.each do |item, _index|
74
+ # The first argument is what gets passed to `each_iteration`.
75
+ # The second argument (item.id) is going to be persisted as the cursor,
76
+ # it doesn't get passed to `each_iteration`.
62
77
  yield item, item.id
63
78
  end
64
79
 
@@ -71,26 +86,38 @@ class StripeListEnumerator
71
86
  end
72
87
  ```
73
88
 
89
+ Here we leverage the Stripe cursor pagination where the cursor is an ID of a specific item in the collection. The job
90
+ which uses such an `Enumerator` would then look like so:
91
+
74
92
  ```ruby
75
- class StripeJob
93
+ class LoadRefundsForChargeJob
76
94
  include Sidekiq::Job
77
95
  include SidekiqIteration::Iteration
78
96
 
79
- def build_enumerator(params, cursor:)
97
+ def build_enumerator(charge_id, cursor:)
80
98
  StripeListEnumerator.new(
81
99
  Stripe::Refund,
82
- params: { charge: "ch_123" },
100
+ params: { charge: charge_id }, # "charge_id" will be a prefixed Stripe ID such as "chrg_123"
83
101
  options: { api_key: "sk_test_123", stripe_version: "2018-01-18" },
84
102
  cursor: cursor
85
103
  ).to_enumerator
86
104
  end
87
105
 
88
- def each_iteration(stripe_refund, _params)
106
+ # Note that in this case `each_iteration` will only receive one positional argument per iteration.
107
+ # If what your enumerator yields is a composite object you will need to unpack it yourself
108
+ # inside the `each_iteration`.
109
+ def each_iteration(stripe_refund, charge_id)
89
110
  # ...
90
111
  end
91
112
  end
92
113
  ```
93
114
 
115
+ and you initiate the job with
116
+
117
+ ```ruby
118
+ LoadRefundsForChargeJob.perform_later(_charge_id = "chrg_345")
119
+ ```
120
+
94
121
  ## Notes
95
122
 
96
123
  We recommend that you read the implementation of the other enumerators that come with the library (`CsvEnumerator`, `ActiveRecordEnumerator`) to gain a better understanding of building Enumerator objects.
@@ -1,28 +1,51 @@
1
1
  # frozen_string_literal: true
2
2
 
3
- require_relative "active_record_cursor"
4
-
5
3
  module SidekiqIteration
6
- # Builds Enumerator based on ActiveRecord Relation. Supports enumerating on rows and batches.
7
4
  # @private
8
5
  class ActiveRecordEnumerator
9
- SQL_DATETIME_WITH_NSEC = "%Y-%m-%d %H:%M:%S.%N"
6
+ SQL_DATETIME_WITH_NSEC = "%Y-%m-%d %H:%M:%S.%6N"
10
7
 
11
- def initialize(relation, columns: nil, batch_size: 100, cursor: nil)
8
+ def initialize(relation, columns: nil, batch_size: 100, order: :asc, cursor: nil)
12
9
  unless relation.is_a?(ActiveRecord::Relation)
13
10
  raise ArgumentError, "relation must be an ActiveRecord::Relation"
14
11
  end
15
12
 
16
- @relation = relation
13
+ unless order == :asc || order == :desc
14
+ raise ArgumentError, ":order must be :asc or :desc, got #{order.inspect}"
15
+ end
16
+
17
+ @primary_key = "#{relation.table_name}.#{relation.primary_key}"
18
+ @columns = Array(columns&.map(&:to_s) || @primary_key)
19
+ @primary_key_index = @columns.index(@primary_key) || @columns.index(relation.primary_key)
20
+ @pluck_columns = if @primary_key_index
21
+ @columns
22
+ else
23
+ @columns + [@primary_key]
24
+ end
17
25
  @batch_size = batch_size
18
- @columns = Array(columns || "#{relation.table_name}.#{relation.primary_key}")
19
- @cursor = cursor
26
+ @order = order
27
+ @cursor = Array.wrap(cursor)
28
+ raise ArgumentError, "Must specify at least one column" if @columns.empty?
29
+ if relation.joins_values.present? && !@columns.all?(/\./)
30
+ raise ArgumentError, "You need to specify fully-qualified columns if you join a table"
31
+ end
32
+
33
+ if relation.arel.orders.present? || relation.arel.taken.present?
34
+ raise ArgumentError,
35
+ "The relation cannot use ORDER BY or LIMIT due to the way how iteration with a cursor is designed. " \
36
+ "You can use other ways to limit the number of rows, e.g. a WHERE condition on the primary key column."
37
+ end
38
+
39
+ ordering = @columns.to_h { |column| [column, @order] }
40
+ @base_relation = relation.reorder(ordering)
41
+ @iteration_count = 0
20
42
  end
21
43
 
22
44
  def records
23
- Enumerator.new(-> { size }) do |yielder|
45
+ Enumerator.new(-> { records_size }) do |yielder|
24
46
  batches.each do |batch, _|
25
47
  batch.each do |record|
48
+ @iteration_count += 1
26
49
  yielder.yield(record, cursor_value(record))
27
50
  end
28
51
  end
@@ -30,40 +53,146 @@ module SidekiqIteration
30
53
  end
31
54
 
32
55
  def batches
33
- cursor = ActiveRecordCursor.new(@relation, @columns, @cursor)
34
- Enumerator.new(-> { size }) do |yielder|
35
- while (records = cursor.next_batch(@batch_size))
36
- yielder.yield(records, cursor_value(records.last)) if records.any?
56
+ Enumerator.new(-> { records_size }) do |yielder|
57
+ while (batch = next_batch(load: true))
58
+ @iteration_count += 1
59
+ yielder.yield(batch, cursor_value(batch.last))
37
60
  end
38
61
  end
39
62
  end
40
63
 
41
- def size
42
- @relation.count(:all)
64
+ def relations
65
+ Enumerator.new(-> { relations_size }) do |yielder|
66
+ while (batch = next_batch(load: false))
67
+ @iteration_count += 1
68
+ yielder.yield(batch, unwrap_array(@cursor))
69
+ end
70
+ end
43
71
  end
44
72
 
45
73
  private
74
+ def records_size
75
+ @base_relation.count(:all)
76
+ end
77
+
78
+ def relations_size
79
+ (records_size + @batch_size - 1) / @batch_size # ceiling division
80
+ end
81
+
82
+ def next_batch(load:)
83
+ batch_relation = @base_relation.limit(@batch_size)
84
+ if conditions.any?
85
+ batch_relation = batch_relation.where(*conditions)
86
+ end
87
+
88
+ records = nil
89
+ cursor_values, ids = batch_relation.uncached do
90
+ if load
91
+ records = batch_relation.records
92
+ pluck_columns(records)
93
+ else
94
+ pluck_columns(batch_relation)
95
+ end
96
+ end
97
+
98
+ cursor = cursor_values.last
99
+ return unless cursor.present?
100
+
101
+ # The primary key was plucked, but original cursor did not include it, so we should remove it
102
+ cursor.pop unless @primary_key_index
103
+ @cursor = Array.wrap(cursor)
104
+
105
+ # Yields relations by selecting the primary keys of records in the batch.
106
+ # Post.where(published: nil) results in an enumerator of relations like:
107
+ # Post.where(published: nil, ids: batch_of_ids)
108
+ relation = @base_relation.where(@primary_key => ids)
109
+ relation.send(:load_records, records) if load
110
+ relation
111
+ end
112
+
113
+ def pluck_columns(batch)
114
+ columns =
115
+ if batch.is_a?(Array)
116
+ @pluck_columns.map { |column| column.to_s.split(".").last }
117
+ else
118
+ @pluck_columns
119
+ end
120
+
121
+ if columns.size == 1 # only the primary key
122
+ column_values = batch.pluck(columns.first)
123
+ return [column_values, column_values]
124
+ end
125
+
126
+ column_values = batch.pluck(*columns)
127
+ primary_key_index = @primary_key_index || -1
128
+ primary_key_values = column_values.map { |values| values[primary_key_index] }
129
+
130
+ serialize_column_values!(column_values)
131
+ [column_values, primary_key_values]
132
+ end
133
+
46
134
  def cursor_value(record)
47
135
  positions = @columns.map do |column|
48
136
  attribute_name = column.to_s.split(".").last
49
- column_value(record, attribute_name)
137
+ column_value(record[attribute_name])
138
+ end
139
+
140
+ unwrap_array(positions)
141
+ end
142
+
143
+ def conditions
144
+ return [] if @cursor.empty?
145
+
146
+ binds = []
147
+ sql = build_starts_after_conditions(0, binds)
148
+
149
+ # Start from the record pointed by cursor.
150
+ # We use the property that `>=` is equivalent to `> or =`.
151
+ if @iteration_count == 0
152
+ binds.unshift(*@cursor)
153
+ columns_equality = @columns.map { |column| "#{column} = ?" }.join(" AND ")
154
+ sql = "(#{columns_equality}) OR (#{sql})"
50
155
  end
51
156
 
52
- if positions.size == 1
53
- positions.first
157
+ [sql, *binds]
158
+ end
159
+
160
+ # (x, y) > (a, b) iff (x > a or (x = a and y > b))
161
+ # (x, y) < (a, b) iff (x < a or (x = a and y < b))
162
+ def build_starts_after_conditions(index, binds)
163
+ column = @columns[index]
164
+
165
+ if index < @cursor.size - 1
166
+ binds << @cursor[index] << @cursor[index]
167
+ "#{column} #{@order == :asc ? '>' : '<'} ? OR (#{column} = ? AND (#{build_starts_after_conditions(index + 1, binds)}))"
54
168
  else
55
- positions
169
+ binds << @cursor[index]
170
+ if @columns.size == @cursor.size
171
+ @order == :asc ? "#{column} > ?" : "#{column} < ?"
172
+ else
173
+ @order == :asc ? "#{column} >= ?" : "#{column} <= ?"
174
+ end
56
175
  end
57
176
  end
58
177
 
59
- def column_value(record, attribute)
60
- value = record.read_attribute(attribute.to_sym)
61
- case record.class.columns_hash.fetch(attribute).type
62
- when :datetime
178
+ def serialize_column_values!(column_values)
179
+ column_values.map! { |values| values.map! { |value| column_value(value) } }
180
+ end
181
+
182
+ def column_value(value)
183
+ if value.is_a?(Time)
63
184
  value.strftime(SQL_DATETIME_WITH_NSEC)
64
185
  else
65
186
  value
66
187
  end
67
188
  end
189
+
190
+ def unwrap_array(array)
191
+ if array.size == 1
192
+ array.first
193
+ else
194
+ array
195
+ end
196
+ end
68
197
  end
69
198
  end
@@ -49,7 +49,7 @@ module SidekiqIteration
49
49
  def rows(cursor:)
50
50
  @csv.lazy
51
51
  .each_with_index
52
- .drop(count_of_processed_rows(cursor))
52
+ .drop(cursor || 0)
53
53
  .to_enum { count_of_rows_in_file }
54
54
  end
55
55
 
@@ -60,7 +60,7 @@ module SidekiqIteration
60
60
  @csv.lazy
61
61
  .each_slice(batch_size)
62
62
  .with_index
63
- .drop(count_of_processed_rows(cursor))
63
+ .drop(cursor || 0)
64
64
  .to_enum { (count_of_rows_in_file.to_f / batch_size).ceil }
65
65
  end
66
66
 
@@ -73,13 +73,5 @@ module SidekiqIteration
73
73
  count -= 1 if @csv.headers
74
74
  count
75
75
  end
76
-
77
- def count_of_processed_rows(cursor)
78
- if cursor
79
- cursor + 1
80
- else
81
- 0
82
- end
83
- end
84
76
  end
85
77
  end
@@ -1,7 +1,6 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require_relative "active_record_enumerator"
4
- require_relative "active_record_batch_enumerator"
5
4
  require_relative "csv_enumerator"
6
5
  require_relative "nested_enumerator"
7
6
 
@@ -22,8 +21,7 @@ module SidekiqIteration
22
21
  raise ArgumentError, "array cannot contain ActiveRecord objects"
23
22
  end
24
23
 
25
- drop = cursor ? cursor + 1 : 0
26
- array.each_with_index.drop(drop).to_enum { array.size }
24
+ array.each_with_index.drop(cursor || 0).to_enum { array.size }
27
25
  end
28
26
 
29
27
  # Builds Enumerator from Active Record Relation. Each Enumerator tick moves the cursor one row forward.
@@ -33,6 +31,7 @@ module SidekiqIteration
33
31
  # @option options :columns [Array<String, Symbol>] used to build the actual query for iteration,
34
32
  # defaults to primary key
35
33
  # @option options :batch_size [Integer] (100) size of the batch
34
+ # @option options :order [:asc, :desc] (:asc) specifies iteration order
36
35
  #
37
36
  # +columns:+ argument is used to build the actual query for iteration. +columns+: defaults to primary key:
38
37
  #
@@ -115,7 +114,7 @@ module SidekiqIteration
115
114
  # end
116
115
  #
117
116
  def active_record_relations_enumerator(scope, cursor:, **options)
118
- ActiveRecordBatchEnumerator.new(scope, cursor: cursor, **options).each
117
+ ActiveRecordEnumerator.new(scope, cursor: cursor, **options).relations
119
118
  end
120
119
 
121
120
  # Builds Enumerator from a CSV file.
@@ -13,13 +13,13 @@ module SidekiqIteration
13
13
  base.extend(Throttling)
14
14
 
15
15
  base.class_eval do
16
- throttle_on(backoff: 0) do |job|
16
+ throttle_on(backoff: SidekiqIteration.default_retry_backoff) do |job|
17
17
  job.class.max_job_runtime &&
18
18
  job.start_time &&
19
19
  (Time.now.utc - job.start_time) > job.class.max_job_runtime
20
20
  end
21
21
 
22
- throttle_on(backoff: 0) do
22
+ throttle_on(backoff: SidekiqIteration.default_retry_backoff) do
23
23
  defined?(Sidekiq::CLI) &&
24
24
  Sidekiq::CLI.instance.launcher.stopping?
25
25
  end
@@ -56,16 +56,22 @@ module SidekiqIteration
56
56
 
57
57
  attr_reader :executions,
58
58
  :cursor_position,
59
- :start_time,
60
59
  :times_interrupted,
61
- :total_time,
62
60
  :current_run_iterations
63
61
 
62
+ # The time when the job starts running. If the job is interrupted and runs again,
63
+ # the value is updated.
64
+ attr_reader :start_time
65
+
66
+ # The total time the job has been running, including multiple iterations.
67
+ # The time isn't reset if the job is interrupted.
68
+ attr_reader :total_time
69
+
64
70
  # @private
65
71
  def initialize
66
72
  super
67
73
  @arguments = nil
68
- @job_iteration_retry_backoff = nil
74
+ @job_iteration_retry_backoff = SidekiqIteration.default_retry_backoff
69
75
  @needs_reenqueue = false
70
76
  @current_run_iterations = 0
71
77
  end
@@ -191,14 +197,14 @@ module SidekiqIteration
191
197
  )
192
198
  end
193
199
 
194
- adjust_total_time
195
200
  true
201
+ ensure
202
+ adjust_total_time
196
203
  end
197
204
 
198
205
  def reenqueue_iteration_job
199
206
  SidekiqIteration.logger.info("[SidekiqIteration::Iteration] Interrupting and re-enqueueing the job cursor_position=#{cursor_position}")
200
207
 
201
- adjust_total_time
202
208
  @times_interrupted += 1
203
209
 
204
210
  arguments = @arguments
@@ -7,6 +7,17 @@ module SidekiqIteration
7
7
  module JobRetryPatch
8
8
  private
9
9
  def process_retry(jobinst, msg, queue, exception)
10
+ add_sidekiq_iteration_metadata(jobinst, msg)
11
+ super
12
+ end
13
+
14
+ # The method was renamed in https://github.com/mperham/sidekiq/commit/0676a5202e89aa9da4ad7991f4111b97a9d8a0a4.
15
+ def attempt_retry(jobinst, msg, queue, exception)
16
+ add_sidekiq_iteration_metadata(jobinst, msg)
17
+ super
18
+ end
19
+
20
+ def add_sidekiq_iteration_metadata(jobinst, msg)
10
21
  if jobinst.is_a?(Iteration)
11
22
  unless msg["args"].last.is_a?(Hash)
12
23
  msg["args"].push({})
@@ -19,12 +30,14 @@ module SidekiqIteration
19
30
  "total_time" => jobinst.total_time,
20
31
  }
21
32
  end
22
-
23
- super
24
33
  end
25
34
  end
26
35
  end
27
36
 
28
- if Sidekiq::JobRetry.instance_method(:process_retry)
37
+ if Sidekiq::JobRetry.private_method_defined?(:process_retry) ||
38
+ Sidekiq::JobRetry.private_method_defined?(:attempt_retry)
29
39
  Sidekiq::JobRetry.prepend(SidekiqIteration::JobRetryPatch)
40
+ else
41
+ raise "Sidekiq #{Sidekiq::VERSION} removed the #process_retry method. " \
42
+ "Please open an issue at the `sidekiq-iteration` gem."
30
43
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SidekiqIteration
4
- VERSION = "0.1.0"
4
+ VERSION = "0.3.0"
5
5
  end
@@ -22,6 +22,17 @@ module SidekiqIteration
22
22
  #
23
23
  attr_accessor :max_job_runtime
24
24
 
25
+ # Configures a delay duration to wait before resuming an interrupted job.
26
+ #
27
+ # @example
28
+ # SidekiqIteration.default_retry_backoff = 10.seconds
29
+ #
30
+ # Defaults to nil which means interrupted jobs will be retried immediately.
31
+ # This value will be ignored when an interruption is raised by a throttle enumerator,
32
+ # where the throttle backoff value will take precedence over this setting.
33
+ #
34
+ attr_accessor :default_retry_backoff
35
+
25
36
  # Set a custom logger for sidekiq-iteration.
26
37
  # Defaults to `Sidekiq.logger`.
27
38
  #
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: sidekiq-iteration
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - fatkodima
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2022-11-02 00:00:00.000000000 Z
12
+ date: 2023-05-20 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: sidekiq
@@ -35,14 +35,13 @@ files:
35
35
  - CHANGELOG.md
36
36
  - LICENSE.txt
37
37
  - README.md
38
+ - guides/argument-semantics.md
38
39
  - guides/best-practices.md
39
40
  - guides/custom-enumerator.md
40
41
  - guides/iteration-how-it-works.md
41
42
  - guides/throttling.md
42
43
  - lib/sidekiq-iteration.rb
43
44
  - lib/sidekiq_iteration.rb
44
- - lib/sidekiq_iteration/active_record_batch_enumerator.rb
45
- - lib/sidekiq_iteration/active_record_cursor.rb
46
45
  - lib/sidekiq_iteration/active_record_enumerator.rb
47
46
  - lib/sidekiq_iteration/csv_enumerator.rb
48
47
  - lib/sidekiq_iteration/enumerators.rb
@@ -73,8 +72,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
73
72
  - !ruby/object:Gem::Version
74
73
  version: '0'
75
74
  requirements: []
76
- rubygems_version: 3.1.6
75
+ rubygems_version: 3.4.12
77
76
  signing_key:
78
77
  specification_version: 4
79
- summary: Makes your sidekiq jobs interruptible and resumable.
78
+ summary: Makes your long-running sidekiq jobs interruptible and resumable.
80
79
  test_files: []
@@ -1,127 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module SidekiqIteration
4
- # Batch Enumerator based on ActiveRecord Relation.
5
- # @private
6
- class ActiveRecordBatchEnumerator
7
- include Enumerable
8
-
9
- SQL_DATETIME_WITH_NSEC = "%Y-%m-%d %H:%M:%S.%N"
10
-
11
- def initialize(relation, columns: nil, batch_size: 100, cursor: nil)
12
- @primary_key = "#{relation.table_name}.#{relation.primary_key}"
13
- @columns = Array(columns&.map(&:to_s) || @primary_key)
14
- @primary_key_index = @columns.index(@primary_key) || @columns.index(relation.primary_key)
15
- @pluck_columns = if @primary_key_index
16
- @columns
17
- else
18
- @columns + [@primary_key]
19
- end
20
- @batch_size = batch_size
21
- @cursor = Array.wrap(cursor)
22
- @initial_cursor = @cursor
23
- raise ArgumentError, "Must specify at least one column" if @columns.empty?
24
- if relation.joins_values.present? && !@columns.all?(/\./)
25
- raise ArgumentError, "You need to specify fully-qualified columns if you join a table"
26
- end
27
-
28
- if relation.arel.orders.present? || relation.arel.taken.present?
29
- raise ArgumentError,
30
- "The relation cannot use ORDER BY or LIMIT due to the way how iteration with a cursor is designed. " \
31
- "You can use other ways to limit the number of rows, e.g. a WHERE condition on the primary key column."
32
- end
33
-
34
- @base_relation = relation.reorder(@columns.join(", "))
35
- end
36
-
37
- def each
38
- return to_enum { size } unless block_given?
39
-
40
- while (relation = next_batch)
41
- yield relation, cursor_value
42
- end
43
- end
44
-
45
- def size
46
- (@base_relation.count(:all) + @batch_size - 1) / @batch_size # ceiling division
47
- end
48
-
49
- private
50
- def next_batch
51
- relation = @base_relation.limit(@batch_size)
52
- if conditions.any?
53
- relation = relation.where(*conditions)
54
- end
55
-
56
- cursor_values, ids = relation.uncached do
57
- pluck_columns(relation)
58
- end
59
-
60
- cursor = cursor_values.last
61
- unless cursor.present?
62
- @cursor = @initial_cursor
63
- return
64
- end
65
- # The primary key was plucked, but original cursor did not include it, so we should remove it
66
- cursor.pop unless @primary_key_index
67
- @cursor = Array.wrap(cursor)
68
-
69
- # Yields relations by selecting the primary keys of records in the batch.
70
- # Post.where(published: nil) results in an enumerator of relations like:
71
- # Post.where(published: nil, ids: batch_of_ids)
72
- @base_relation.where(@primary_key => ids)
73
- end
74
-
75
- def pluck_columns(relation)
76
- if @pluck_columns.size == 1 # only the primary key
77
- column_values = relation.pluck(*@pluck_columns)
78
- return [column_values, column_values]
79
- end
80
-
81
- column_values = relation.pluck(*@pluck_columns)
82
- primary_key_index = @primary_key_index || -1
83
- primary_key_values = column_values.map { |values| values[primary_key_index] }
84
-
85
- serialize_column_values!(column_values)
86
- [column_values, primary_key_values]
87
- end
88
-
89
- def cursor_value
90
- if @cursor.size == 1
91
- @cursor.first
92
- else
93
- @cursor
94
- end
95
- end
96
-
97
- def conditions
98
- column_index = @cursor.size - 1
99
- column = @columns[column_index]
100
- where_clause = if @columns.size == @cursor.size
101
- "#{column} > ?"
102
- else
103
- "#{column} >= ?"
104
- end
105
- while column_index > 0
106
- column_index -= 1
107
- column = @columns[column_index]
108
- where_clause = "#{column} > ? OR (#{column} = ? AND (#{where_clause}))"
109
- end
110
- ret = @cursor.reduce([where_clause]) { |params, value| params << value << value }
111
- ret.pop
112
- ret
113
- end
114
-
115
- def serialize_column_values!(column_values)
116
- column_values.map! { |values| values.map! { |value| column_value(value) } }
117
- end
118
-
119
- def column_value(value)
120
- if value.is_a?(Time)
121
- value.strftime(SQL_DATETIME_WITH_NSEC)
122
- else
123
- value
124
- end
125
- end
126
- end
127
- end
@@ -1,89 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module SidekiqIteration
4
- # @private
5
- class ActiveRecordCursor
6
- include Comparable
7
-
8
- attr_reader :position, :reached_end
9
-
10
- def initialize(relation, columns = nil, position = nil)
11
- columns ||= "#{relation.table_name}.#{relation.primary_key}"
12
- @columns = Array.wrap(columns)
13
- raise ArgumentError, "Must specify at least one column" if @columns.empty?
14
-
15
- self.position = Array.wrap(position)
16
- if relation.joins_values.present? && !@columns.all?(/\./)
17
- raise ArgumentError, "You need to specify fully-qualified columns if you join a table"
18
- end
19
-
20
- if relation.arel.orders.present? || relation.arel.taken.present?
21
- raise ArgumentError,
22
- "The relation cannot use ORDER BY or LIMIT due to the way how iteration with a cursor is designed. " \
23
- "You can use other ways to limit the number of rows, e.g. a WHERE condition on the primary key column."
24
- end
25
-
26
- @base_relation = relation.reorder(@columns.join(", "))
27
- @reached_end = false
28
- end
29
-
30
- def <=>(other)
31
- if reached_end == other.reached_end
32
- position <=> other.position
33
- else
34
- reached_end ? 1 : -1
35
- end
36
- end
37
-
38
- def position=(position)
39
- raise ArgumentError, "Cursor position cannot contain nil values" if position.any?(&:nil?)
40
-
41
- @position = position
42
- end
43
-
44
- def next_batch(batch_size)
45
- return if @reached_end
46
-
47
- relation = @base_relation.limit(batch_size)
48
-
49
- if (conditions = self.conditions).any?
50
- relation = relation.where(*conditions)
51
- end
52
-
53
- records = relation.uncached do
54
- relation.to_a
55
- end
56
-
57
- update_from_record(records.last) if records.any?
58
- @reached_end = records.size < batch_size
59
-
60
- records if records.any?
61
- end
62
-
63
- private
64
- def conditions
65
- i = @position.size - 1
66
- column = @columns[i]
67
- conditions = if @columns.size == @position.size
68
- "#{column} > ?"
69
- else
70
- "#{column} >= ?"
71
- end
72
- while i > 0
73
- i -= 1
74
- column = @columns[i]
75
- conditions = "#{column} > ? OR (#{column} = ? AND (#{conditions}))"
76
- end
77
- ret = @position.reduce([conditions]) { |params, value| params << value << value }
78
- ret.pop
79
- ret
80
- end
81
-
82
- def update_from_record(record)
83
- self.position = @columns.map do |column|
84
- method = column.to_s.split(".").last
85
- record.send(method)
86
- end
87
- end
88
- end
89
- end