RubyGems - job-iteration - Versions diffs - 1.3.5 → 1.4.0 - Mend

job-iteration 1.3.5 → 1.4.0

Files changed (28) hide show

checksums.yaml +4 -4
data/.github/workflows/ci.yml +50 -26
data/.github/workflows/cla.yml +22 -0
data/.rubocop.yml +3 -3
data/CHANGELOG.md +30 -4
data/Gemfile +0 -1
data/Gemfile.lock +64 -65
data/README.md +26 -8
data/dev.yml +2 -2
data/gemfiles/rails_6_1.gemfile +12 -0
data/gemfiles/rails_7_0.gemfile +6 -0
data/guides/argument-semantics.md +128 -0
data/guides/best-practices.md +72 -32
data/guides/custom-enumerator.md +76 -28
data/guides/iteration-how-it-works.md +2 -18
data/{railgun.yml → isogun.yml} +0 -4
data/lib/job-iteration/active_record_batch_enumerator.rb +3 -1
data/lib/job-iteration/active_record_cursor.rb +7 -3
data/lib/job-iteration/active_record_enumerator.rb +6 -1
data/lib/job-iteration/csv_enumerator.rb +1 -1
data/lib/job-iteration/enumerator_builder.rb +49 -9
data/lib/job-iteration/iteration.rb +83 -46
data/lib/job-iteration/log_subscriber.rb +38 -0
data/lib/job-iteration/nested_enumerator.rb +48 -0
data/lib/job-iteration/throttle_enumerator.rb +1 -2
data/lib/job-iteration/version.rb +1 -1
data/lib/job-iteration.rb +25 -0
metadata +10 -4

data/guides/best-practices.md CHANGED Viewed

@@ -1,20 +1,67 @@
 # Best practices
-## Instrumentation
+## Batch iteration
-Iteration leverages `ActiveSupport::Notifications` which lets you instrument all kind of events:
+Regardless of the active record enumerator used in the task, `job-iteration` gem loads records in batches of 100 (by default).
+The following two tasks produce equivalent database queries,
+however `RecordsJob` task allows for more frequent interruptions by doing just one thing in the `each_iteration` method.
 ```ruby
-# config/initializers/instrumentation.rb
-ActiveSupport::Notifications.subscribe('build_enumerator.iteration') do |_, started, finished, _, tags|
-  StatsD.distribution(
-    'iteration.build_enumerator',
-    (finished - started),
-    tags: { job_class: tags[:job_class]&.underscore }
-  )
+# bad
+class BatchesJob < ApplicationJob
+  include JobIteration::Iteration
+  def build_enumerator(product_id, cursor:)
+    enumerator_builder.active_record_on_batches(
+      Comment.where(product_id: product_id),
+      cursor: cursor,
+      batch_size: 5,
+    )
+  end
+  def each_iteration(batch_of_comments, product_id)
+    batch_of_comments.each(&:destroy)
+  end
+end
+# good
+class RecordsJob < ApplicationJob
+  include JobIteration::Iteration
+  def build_enumerator(product_id, cursor:)
+    enumerator_builder.active_record_on_records(
+      Comment.where(product_id: product_id),
+      cursor: cursor,
+      batch_size: 5,
+    )
+  end
+  def each_iteration(comment, product_id)
+    comment.destroy
+  end
 end
+```
-ActiveSupport::Notifications.subscribe('each_iteration.iteration') do |_, started, finished, _, tags|
+## Instrumentation
+Iteration leverages [`ActiveSupport::Notifications`](https://guides.rubyonrails.org/active_support_instrumentation.html)
+to notify you what it's doing. You can subscribe to the following events (listed in order of job lifecycle):
+- `build_enumerator.iteration`
+- `throttled.iteration` (when using ThrottleEnumerator)
+- `nil_enumerator.iteration`
+- `resumed.iteration`
+- `each_iteration.iteration`
+- `not_found.iteration`
+- `interrupted.iteration`
+- `completed.iteration`
+All events have tags including the job class name and cursor position, some add the amount of times interrupted and/or
+total time the job spent running across interruptions.
+```ruby
+# config/initializers/instrumentation.rb
+ActiveSupport::Notifications.monotonic_subscribe("each_iteration.iteration") do |_, started, finished, _, tags|
   elapsed = finished - started
   StatsD.distribution(
     "iteration.each_iteration",
@@ -27,28 +74,6 @@ ActiveSupport::Notifications.subscribe('each_iteration.iteration') do |_, starte
     "each_iteration runtime exceeded limit of #{BackgroundQueue.max_iteration_runtime}s"
   end
 end
-ActiveSupport::Notifications.subscribe('resumed.iteration') do |_, _, _, _, tags|
-  StatsD.increment(
-    "iteration.resumed",
-    tags: { job_class: tags[:job_class]&.underscore }
-  )
-end
-ActiveSupport::Notifications.subscribe('interrupted.iteration') do |_, _, _, _, tags|
-  StatsD.increment(
-    "iteration.interrupted",
-    tags: { job_class: tags[:job_class]&.underscore }
-  )
-end
-# If you're using ThrottleEnumerator
-ActiveSupport::Notifications.subscribe('throttled.iteration') do |_, _, _, _, tags|
-  StatsD.increment(
-    "iteration.throttled",
-    tags: { job_class: tags[:job_class]&.underscore }
-  )
-end
 ```
 ## Max iteration time
@@ -66,3 +91,18 @@ JobIteration.max_job_runtime = 5.minutes # nil by default
 ```
 Use this accessor to tweak how often you'd like the job to interrupt itself.
+### Per job max job runtime
+For more granular control, `job_iteration_max_job_runtime` can be set **per-job class**. This allows both incremental adoption, as well as using a conservative global setting, and an aggressive setting on a per-job basis.
+```ruby
+class MyJob < ApplicationJob
+  include JobIteration::Iteration
+  self.job_iteration_max_job_runtime = 3.minutes
+  # ...
+```
+This setting will be inherited by any child classes, although it can be further overridden. Note that no class can **increase** the `max_job_runtime` it has inherited; it can only be **decreased**. No job can increase its `max_job_runtime` beyond the global limit.

data/guides/custom-enumerator.md CHANGED Viewed

@@ -1,38 +1,34 @@
-Iteration leverages the [Enumerator](http://ruby-doc.org/core-2.5.1/Enumerator.html) pattern from the Ruby standard library, which allows us to use almost any resource as a collection to iterate.
+Iteration leverages the [Enumerator](https://ruby-doc.org/3.2.1/Enumerator.html) pattern from the Ruby standard library,
+which allows us to use almost any resource as a collection to iterate.
-Consider a custom Enumerator that takes items from a Redis list. Because a Redis List is essentially a queue, we can ignore the cursor:
+Before writing an enumerator, it is important to understand [how Iteration works](iteration-how-it-works.md) and how
+your enumerator will be used by it. An enumerator must `yield` two things in the following order as positional
+arguments:
+- An object to be processed in a job `each_iteration` method
+- A cursor position, which Iteration will persist if `each_iteration` returns succesfully and the job is forced to shut
+  down. It can be any data type your job backend can serialize and deserialize correctly.
-```ruby
-class ListJob < ActiveJob::Base
-  include JobIteration::Iteration
-  def build_enumerator(*)
-    @redis = Redis.new
-    Enumerator.new do |yielder|
-      yielder.yield @redis.lpop(key), nil
-    end
-  end
-  def each_iteration(item_from_redis)
-    # ...
-  end
-end
-```
+A job that includes Iteration is first started with `nil` as the cursor. When resuming an interrupted job, Iteration
+will deserialize the persisted cursor and pass it to the job's `build_enumerator` method, which your enumerator uses to
+find objects that come _after_ the last successfully processed object. The [array enumerator](https://github.com/Shopify/job-iteration/blob/v1.3.6/lib/job-iteration/enumerator_builder.rb#L50-L67)
+is a simple example which uses the array index as the cursor position.
-But what about iterating based on a cursor? Consider this Enumerator that wraps third party API (Stripe) for paginated iteration:
+For a more complex example, consider this Enumerator that wraps a third party API (Stripe) for paginated iteration and
+stores a string as the cursor position:
 ```ruby
 class StripeListEnumerator
+  # @see https://stripe.com/docs/api/pagination
   # @param resource [Stripe::APIResource] The type of Stripe object to request
   # @param params [Hash] Query parameters for the request
   # @param options [Hash] Request options, such as API key or version
-  # @param cursor [String]
+  # @param cursor [nil, String] The Stripe ID of the last item iterated over
   def initialize(resource, params: {}, options: {}, cursor:)
     pagination_params = {}
     pagination_params[:starting_after] = cursor unless cursor.nil?
+    # The following line makes a request, consider adding your rate limiter here.
     @list = resource.public_send(:list, params.merge(pagination_params), options)
-      .auto_paging_each.lazy
   end
   def to_enumerator
@@ -45,27 +41,75 @@ class StripeListEnumerator
   # as the cursor on the job. This allows us to properly set the
   # `starting_after` parameter for the API request when resuming.
   def each
-    @list.each do |item, _index|
-      yield item, item.id
+    loop do
+      @list.each do |item, _index|
+        # The first argument is what gets passed to `each_iteration`.
+        # The second argument (item.id) is going to be persisted as the cursor,
+        # it doesn't get passed to `each_iteration`.
+        yield item, item.id
+      end
+      # The following line makes a request, consider adding your rate limiter here.
+      @list = @list.next_page
+      break if @list.empty?
     end
   end
 end
 ```
+Here we leverage the Stripe cursor pagination where the cursor is an ID of a specific item in the collection. The job
+which uses such an `Enumerator` would then look like so:
 ```ruby
-class StripeJob < ActiveJob::Base
+class LoadRefundsForChargeJob < ActiveJob::Base
   include JobIteration::Iteration
-  def build_enumerator(params, cursor:)
+  # If you added your own rate limiting above, handle it here. For example:
+  # retry_on(MyRateLimiter::LimitExceededError, wait: 30.seconds, attempts: :unlimited)
+  # Use an exponential back-off strategy when Stripe's API returns errors.
+  def build_enumerator(charge_id, cursor:)
     StripeListEnumerator.new(
       Stripe::Refund,
-      params: { charge: "ch_123" },
+      params: { charge: charge_id}, # "charge_id" will be a prefixed Stripe ID such as "chrg_123"
       options: { api_key: "sk_test_123", stripe_version: "2018-01-18" },
       cursor: cursor
     ).to_enumerator
   end
-  def each_iteration(stripe_refund, _params)
+  # Note that in this case `each_iteration` will only receive one positional argument per iteration.
+  # If what your enumerator yields is a composite object you will need to unpack it yourself
+  # inside the `each_iteration`.
+  def each_iteration(stripe_refund, charge_id)
+    # ...
+  end
+end
+```
+and you initiate the job with
+```ruby
+LoadRefundsForChargeJob.perform_later(_charge_id = "chrg_345")
+```
+Sometimes you can ignore the cursor. Consider the following custom Enumerator that takes items from a Redis list, which
+is essentially a queue. Even if this job doesn't need to persist a cursor in order to resume, it can still use
+Iteration's signal handling to finish `each_iteration` and gracefully terminate.
+```ruby
+class RedisPopListJob < ActiveJob::Base
+  include JobIteration::Iteration
+  # @see https://redis.io/commands/lpop/
+  def build_enumerator(*)
+    @redis = Redis.new
+    Enumerator.new do |yielder|
+      yielder.yield @redis.lpop(key), nil
+    end
+  end
+  def each_iteration(item_from_redis)
     # ...
   end
 end
@@ -73,4 +117,8 @@ end
 We recommend that you read the implementation of the other enumerators that come with the library (`CsvEnumerator`, `ActiveRecordEnumerator`) to gain a better understanding of building Enumerator objects.
-Code that is written after the `yield` in a custom enumerator is not guaranteed to execute. In the case that a job is forced to exit ie `job_should_exit?` is true, then the job is re-enqueued during the yield and the rest of the code in the enumerator does not run. You can follow that logic [here](https://github.com/Shopify/job-iteration/blob/9641f455b9126efff2214692c0bef423e0d12c39/lib/job-iteration/iteration.rb#L128-L131).
+Code that is written after the `yield` in a custom enumerator is not guaranteed to execute. In the case that a job is
+forced to exit ie `job_should_exit?` is true, then the job is re-enqueued during the yield and the rest of the code in
+the enumerator does not run. You can follow that logic
+[here](https://github.com/Shopify/job-iteration/blob/v1.3.6/lib/job-iteration/iteration.rb#L161-L165) and
+[here](https://github.com/Shopify/job-iteration/blob/v1.3.6/lib/job-iteration/iteration.rb#L131-L143)

data/guides/iteration-how-it-works.md CHANGED Viewed

@@ -34,22 +34,6 @@ Further reading: [Sidekiq signals](https://github.com/mperham/sidekiq/wiki/Signa
 In the early versions of Iteration, `build_enumerator` used to return ActiveRecord relations directly, and we would infer the Enumerator based on the type of object. We used to support ActiveRecord relations, arrays and CSVs. This made it hard to add support for other types of enumerations, and it was easy for developers to make mistakes and return an array of ActiveRecord objects, and for us starting to treat that as an array instead of as an ActiveRecord relation.
-The current version of Iteration supports _any_ Enumerator. We expose helpers to build enumerators conveniently (`enumerator_builder.active_record_on_records`), but it's up for a developer to implement a custom Enumerator. Consider this example:
-```ruby
-class MyJob < ActiveJob::Base
-  include JobIteration::Iteration
-  def build_enumerator(cursor:)
-    Enumerator.new do
-      Redis.lpop("mylist") # or: Kafka.poll(timeout: 10.seconds)
-    end
-  end
-  def each_iteration(element_from_redis)
-    # ...
-  end
-end
-```
+The current version of Iteration supports _any_ Enumerator. We expose helpers to build common enumerators conveniently (`enumerator_builder.active_record_on_records`), but it's up to a developer to implement [a custom Enumerator](custom-enumerator.md).
-Further reading: [ruby-doc](http://ruby-doc.org/core-2.5.1/Enumerator.html), [a great post about Enumerators](http://blog.arkency.com/2014/01/ruby-to-enum-for-enumerator/).
+Further reading: [ruby-doc](https://ruby-doc.org/3.2.1/Enumerator.html), [a great post about Enumerators](http://blog.arkency.com/2014/01/ruby-to-enum-for-enumerator/).

data/{railgun.yml → isogun.yml} RENAMED Viewed

@@ -2,14 +2,10 @@
 name: job-iteration
 vm:
-  image:      /opt/dev/misc/railgun-images/default
   ip_address: 192.168.64.142
   memory:     1G
   cores:      2
-volumes:
-  root:  '500M'
 services:
   - redis
   - mysql

data/lib/job-iteration/active_record_batch_enumerator.rb CHANGED Viewed

@@ -26,7 +26,7 @@ module JobIteration
       end
       if relation.arel.orders.present? || relation.arel.taken.present?
-        raise ConditionNotSupportedError
+        raise JobIteration::ActiveRecordCursor::ConditionNotSupportedError
       end
       @base_relation = relation.reorder(@columns.join(","))
@@ -34,6 +34,7 @@ module JobIteration
     def each
       return to_enum { size } unless block_given?
       while (relation = next_batch)
         yield relation, cursor_value
       end
@@ -86,6 +87,7 @@ module JobIteration
     def cursor_value
       return @cursor.first if @cursor.size == 1
       @cursor
     end

data/lib/job-iteration/active_record_cursor.rb CHANGED Viewed

@@ -19,8 +19,11 @@ module JobIteration
     end
     def initialize(relation, columns = nil, position = nil)
-      columns ||= "#{relation.table_name}.#{relation.primary_key}"
-      @columns = Array.wrap(columns)
+      @columns = if columns
+        Array(columns)
+      else
+        Array(relation.primary_key).map { |pk| "#{relation.table_name}.#{pk}" }
+      end
       self.position = Array.wrap(position)
       raise ArgumentError, "Must specify at least one column" if columns.empty?
       if relation.joins_values.present? && !@columns.all? { |column| column.to_s.include?(".") }
@@ -45,6 +48,7 @@ module JobIteration
     def position=(position)
       raise "Cursor position cannot contain nil values" if position.any?(&:nil?)
       @position = position
     end
@@ -56,7 +60,7 @@ module JobIteration
     end
     def next_batch(batch_size)
-      return nil if @reached_end
+      return if @reached_end
       relation = @base_relation.limit(batch_size)

data/lib/job-iteration/active_record_enumerator.rb CHANGED Viewed

@@ -10,7 +10,11 @@ module JobIteration
     def initialize(relation, columns: nil, batch_size: 100, cursor: nil)
       @relation = relation
       @batch_size = batch_size
-      @columns = Array(columns || "#{relation.table_name}.#{relation.primary_key}")
+      @columns = if columns
+        Array(columns)
+      else
+        Array(relation.primary_key).map { |pk| "#{relation.table_name}.#{pk}" }
+      end
       @cursor = cursor
     end
@@ -45,6 +49,7 @@ module JobIteration
         column_value(record, attribute_name)
       end
       return positions.first if positions.size == 1
       positions
     end

data/lib/job-iteration/csv_enumerator.rb CHANGED Viewed

@@ -41,7 +41,7 @@ module JobIteration
     def batches(batch_size:, cursor:)
       @csv.lazy
         .each_slice(batch_size)
-        .each_with_index
+        .with_index
         .drop(count_of_processed_rows(cursor))
         .to_enum { (count_of_rows_in_file.to_f / batch_size).ceil }
     end

data/lib/job-iteration/enumerator_builder.rb CHANGED Viewed

@@ -4,6 +4,7 @@ require_relative "./active_record_batch_enumerator"
 require_relative "./active_record_enumerator"
 require_relative "./csv_enumerator"
 require_relative "./throttle_enumerator"
+require_relative "./nested_enumerator"
 require "forwardable"
 module JobIteration
@@ -19,10 +20,12 @@ module JobIteration
     # compatibility with raw calls to EnumeratorBuilder. Think of these wrappers
     # the way you should a middleware.
     class Wrapper < Enumerator
-      def self.wrap(_builder, enum)
-        new(-> { enum.size }) do |yielder|
-          enum.each do |*val|
-            yielder.yield(*val)
+      class << self
+        def wrap(_builder, enum)
+          new(-> { enum.size }) do |yielder|
+            enum.each do |*val|
+              yielder.yield(*val)
+            end
           end
         end
       end
@@ -43,6 +46,7 @@ module JobIteration
     # Builds Enumerator objects that iterates N times and yields number starting from zero.
     def build_times_enumerator(number, cursor:)
       raise ArgumentError, "First argument must be an Integer" unless number.is_a?(Integer)
       wrap(self, build_array_enumerator(number.times.to_a, cursor: cursor))
     end
@@ -54,6 +58,7 @@ module JobIteration
       if enumerable.any? { |i| defined?(ActiveRecord) && i.is_a?(ActiveRecord::Base) }
         raise ArgumentError, "array cannot contain ActiveRecord objects"
       end
       drop =
         if cursor.nil?
           0
@@ -97,7 +102,7 @@ module JobIteration
       enum = build_active_record_enumerator(
         scope,
         cursor: cursor,
-        **args
+        **args,
       ).records
       wrap(self, enum)
     end
@@ -112,7 +117,7 @@ module JobIteration
       enum = build_active_record_enumerator(
         scope,
         cursor: cursor,
-        **args
+        **args,
       ).batches
       wrap(self, enum)
     end
@@ -123,7 +128,7 @@ module JobIteration
       enum = JobIteration::ActiveRecordBatchEnumerator.new(
         scope,
         cursor: cursor,
-        **args
+        **args,
       ).each
       enum = wrap(self, enum) if wrap
       enum
@@ -134,7 +139,7 @@ module JobIteration
         enum,
         @job,
         throttle_on: throttle_on,
-        backoff: backoff
+        backoff: backoff,
       ).to_enum
     end
@@ -142,6 +147,40 @@ module JobIteration
       CsvEnumerator.new(enumerable).rows(cursor: cursor)
     end
+    # Builds Enumerator for nested iteration.
+    #
+    # @param enums [Array<Proc>] an Array of Procs, each should return an Enumerator.
+    #   Each proc from enums should accept the yielded items from the parent enumerators
+    #     and the `cursor` as its arguments.
+    #   Each proc's `cursor` argument is its part from the `build_enumerator`'s `cursor` array.
+    # @param cursor [Array<Object>] array of offsets for each of the enums to start iteration from
+    #
+    # @example
+    #   def build_enumerator(cursor:)
+    #     enumerator_builder.nested(
+    #       [
+    #         ->(cursor) {
+    #           enumerator_builder.active_record_on_records(Shop.all, cursor: cursor)
+    #         },
+    #         ->(shop, cursor) {
+    #           enumerator_builder.active_record_on_records(shop.products, cursor: cursor)
+    #         },
+    #         ->(_shop, product, cursor) {
+    #           enumerator_builder.active_record_on_batch_relations(product.product_variants, cursor: cursor)
+    #         }
+    #       ],
+    #       cursor: cursor
+    #     )
+    #   end
+    #
+    #   def each_iteration(product_variants_relation)
+    #     # do something
+    #   end
+    #
+    def build_nested_enumerator(enums, cursor:)
+      NestedEnumerator.new(enums, cursor: cursor).each
+    end
     alias_method :once, :build_once_enumerator
     alias_method :times, :build_times_enumerator
     alias_method :array, :build_array_enumerator
@@ -150,6 +189,7 @@ module JobIteration
     alias_method :active_record_on_batch_relations, :build_active_record_enumerator_on_batch_relations
     alias_method :throttle, :build_throttle_enumerator
     alias_method :csv, :build_csv_enumerator
+    alias_method :nested, :build_nested_enumerator
     private
@@ -161,7 +201,7 @@ module JobIteration
       JobIteration::ActiveRecordEnumerator.new(
         scope,
         cursor: cursor,
-        **args
+        **args,
       )
     end
   end