RubyGems - job-iteration - Versions diffs - 0.9.0 - Mend

job-iteration 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of job-iteration might be problematic. Click here for more details.

Files changed (24) hide show

checksums.yaml +7 -0
data/.gitignore +10 -0
data/.rubocop.yml +13 -0
data/.travis.yml +14 -0
data/CODE_OF_CONDUCT.md +74 -0
data/Gemfile +24 -0
data/Gemfile.lock +113 -0
data/LICENSE.txt +21 -0
data/README.md +191 -0
data/Rakefile +12 -0
data/guides/best-practices.md +60 -0
data/guides/iteration-how-it-works.md +55 -0
data/job-iteration.gemspec +30 -0
data/lib/job-iteration.rb +33 -0
data/lib/job-iteration/active_record_cursor.rb +93 -0
data/lib/job-iteration/active_record_enumerator.rb +51 -0
data/lib/job-iteration/csv_enumerator.rb +42 -0
data/lib/job-iteration/enumerator_builder.rb +146 -0
data/lib/job-iteration/integrations/resque.rb +26 -0
data/lib/job-iteration/integrations/sidekiq.rb +21 -0
data/lib/job-iteration/iteration.rb +204 -0
data/lib/job-iteration/test_helper.rb +41 -0
data/lib/job-iteration/version.rb +5 -0
metadata +122 -0

data/Rakefile ADDED

@@ -0,0 +1,12 @@
+# frozen_string_literal: true
+require "bundler/gem_tasks"
+require "rake/testtask"
+Rake::TestTask.new(:test) do |t|
+  t.libs << "test"
+  t.libs << "lib"
+  t.test_files = FileList["test/**/*_test.rb"]
+end
+task default: :test

data/guides/best-practices.md ADDED

@@ -0,0 +1,60 @@
+# Best practices
+## Instrumentation
+Iteration leverages `ActiveSupport::Notifications` which lets you instrument all kind of events:
+```ruby
+# config/initializers/instrumentation.rb
+ActiveSupport::Notifications.subscribe('build_enumerator.iteration') do |_, started, finished, _, tags|
+  StatsD.distribution(
+    'iteration.build_enumerator',
+    (finished - started),
+    tags: { job_class: tags[:job_class]&.underscore }
+  )
+end
+ActiveSupport::Notifications.subscribe('each_iteration.iteration') do |_, started, finished, _, tags|
+  elapsed = finished - started
+  StatsD.distribution(
+    "iteration.each_iteration",
+    elapsed,
+    tags: { job_class: tags[:job_class]&.underscore }
+  )
+  if elapsed >= BackgroundQueue.max_iteration_runtime
+    Rails.logger.warn "[Iteration] job_class=#{tags[:job_class]} " \
+    "each_iteration runtime exceeded limit of #{BackgroundQueue.max_iteration_runtime}s"
+  end
+end
+ActiveSupport::Notifications.subscribe('resumed.iteration') do |_, _, _, _, tags|
+  StatsD.increment(
+    "iteration.resumed",
+    tags: { job_class: tags[:job_class]&.underscore }
+  )
+end
+ActiveSupport::Notifications.subscribe('interrupted.iteration') do |_, _, _, _, tags|
+  StatsD.increment(
+    "iteration.interrupted",
+    tags: { job_class: tags[:job_class]&.underscore }
+  )
+end
+```
+## Max iteration time
+As you may notice in the snippet above, at Shopify we enforce that `each_iteration` does not take longer than `BackgroundQueue.max_iteration_runtime`, which is set to `25` seconds.
+We discourage that because jobs with a long `each_iteration` make interruptibility somewhat useless, as the infrastructure will have to wait longer for the job to interrupt.
+## Max job runtime
+If a job is supposed to have millions of iterations and you expect it to run for hours and days, it's still a good idea to sometimes interrupt the job even if there are no interruption signals coming from deploys or the infrastructure. At Shopify, we interrupt at least every 5 minutes to preserve **worker capacity**.
+```ruby
+JobIteration.max_job_runtime = 5.minutes # nil by default
+```
+Use this accessor to tweak how often you'd like the job to interrupt itself.

data/guides/iteration-how-it-works.md ADDED

@@ -0,0 +1,55 @@
+# Iteration: how it works
+The main idea behind Iteration is to provide an API to describe jobs in interruptible manner, on the contrast with one massive `def perform` that is impossible to interrupt safely.
+Exposing the enumerator and the action to apply allows us to keep the cursor and interrupt between iterations. Let's see how it looks like on example of an ActiveRecord relation (and Enumerator).
+1. `build_enumerator` is called, which constructs `ActiveRecordEnumerator` from an ActiveRecord relation (`Product.all`)
+2. The first batch of records is loaded:
+```sql
+SELECT  `products`.* FROM `products` ORDER BY products.id LIMIT 100
+```
+3. Job iterates over two records of the relation and then receives `SIGTERM` (graceful termination signal) caused by a deploy
+4. Signal handler sets a flag that makes `job_should_exit?` to return `true`
+5. After the last iteration is completed, we will check `job_should_exit?` which now returns `true`
+6. The job stops iterating and pushes itself back to the queue, with the latest `cursor_position` value.
+7. Next time when the job is taken from the queue, we'll load records starting from the last primary key that was processed:
+```sql
+SELECT  `products`.* FROM `products` WHERE (products.id > 2) ORDER BY products.id LIMIT 100
+```
+## Signals
+It's critical to know UNIX signals in order to understand how interruption works. There are two main signals that Sidekiq and Resque use: `SIGTERM` and `SIGKILL`. `SIGTERM` is the graceful termination signal which means that the process should exit _soon_, not immediately. For Iteration, it means that we have time to wait for the last iteration to finish and to push job back to the queue with the last cursor position.
+`SIGTERM` is what allows Iteration to work. In contrast, `SIGKILL` means immediate exit. It doesn't let the worker terminate gracefully, instead it will drop the job and exit as soon as possible.
+Most of deploy strategies (Kubernetes, Heroku, Capistrano) send `SIGTERM` before the shut down, then wait for the a timeout (usually from 30 seconds to a minute) and send `SIGKILL` if the process haven't terminated yet.
+Further reading: [Sidekiq signals](https://github.com/mperham/sidekiq/wiki/Signals).
+## Enumerators
+In the early versions of Iteration, `build_enumerator` used to return ActiveRecord relations directly, and we would infer the Enumerator based on the type of object. We used to support ActiveRecord relations, arrays and CSVs. This way it was hard to add support for anything else to enumerate, and it was easy for developers to make a mistake and return an array of ActiveRecord objects, and for us starting to threat that as an array instead of ActiveRecord relation.
+In the current version of Iteration, it supports _any_ Enumerator. We expose helpers to build enumerators conveniently (`enumerator_builder.active_record_on_records`), but it's up for a developer to implement a custom Enumerator. Consider this example:
+```ruby
+class MyJob < ActiveJob::Base
+  include JobIteration::Iteration
+  def build_enumerator(cursor:)
+    Enumerator.new do
+      Redis.lpop("mylist") # or: Kafka.poll(timeout: 10.seconds)
+    end
+  end
+  def each_iteration(element_from_redis)
+    # ...
+  end
+end
+```
+Further reading: [ruby-doc](http://ruby-doc.org/core-2.5.1/Enumerator.html), [a great post about Enumerators](http://blog.arkency.com/2014/01/ruby-to-enum-for-enumerator/).

data/job-iteration.gemspec ADDED

@@ -0,0 +1,30 @@
+# frozen_string_literal: true
+lib = File.expand_path("../lib", __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require "job-iteration/version"
+Gem::Specification.new do |spec|
+  spec.name          = "job-iteration"
+  spec.version       = JobIteration::VERSION
+  spec.authors       = %w(Shopify)
+  spec.email         = ["ops-accounts+shipit@shopify.com"]
+  spec.summary       = 'Makes your background jobs interruptible and resumable.'
+  spec.description   = spec.summary
+  spec.homepage      = "https://github.com/shopify/job-iteration"
+  spec.license       = "MIT"
+  spec.files         = `git ls-files -z`.split("\x0").reject do |f|
+    f.match(%r{^(test|spec|features)/})
+  end
+  spec.bindir        = "exe"
+  spec.executables   = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
+  spec.require_paths = %w(lib)
+  spec.add_dependency "activejob", "~> 5.2"
+  spec.add_development_dependency "bundler", "~> 1.16"
+  spec.add_development_dependency "rake", "~> 10.0"
+  spec.add_development_dependency "minitest", "~> 5.0"
+end

data/lib/job-iteration.rb ADDED

@@ -0,0 +1,33 @@
+# frozen_string_literal: true
+require "job-iteration/version"
+require "job-iteration/enumerator_builder"
+require "job-iteration/iteration"
+module JobIteration
+  INTEGRATIONS = [:resque, :sidekiq]
+  extend self
+  attr_accessor :max_job_runtime, :interruption_adapter
+  module AlwaysRunningInterruptionAdapter
+    extend self
+    def shutdown?
+      false
+    end
+  end
+  self.interruption_adapter = AlwaysRunningInterruptionAdapter
+  def load_integrations
+    INTEGRATIONS.each do |integration|
+      begin
+        require "job-iteration/integrations/#{integration}"
+      rescue LoadError
+      end
+    end
+  end
+end
+JobIteration.load_integrations unless ENV['ITERATION_DISABLE_AUTOCONFIGURE']

data/lib/job-iteration/active_record_cursor.rb ADDED

@@ -0,0 +1,93 @@
+# frozen_string_literal: true
+module JobIteration
+  class ActiveRecordCursor
+    include Comparable
+    attr_reader :position
+    attr_accessor :reached_end
+    class ConditionNotSupportedError < ArgumentError
+      def initialize
+        super(
+          "The relation cannot use ORDER BY or LIMIT due to the way how iteration with a cursor is designed. " \
+          "You can use other ways to limit the number of rows, e.g. a WHERE condition on the primary key column."
+        )
+      end
+    end
+    def initialize(relation, columns = nil, position = nil)
+      columns ||= "#{relation.table_name}.#{relation.primary_key}"
+      @columns = Array.wrap(columns)
+      self.position = Array.wrap(position)
+      raise ArgumentError, "Must specify at least one column" if columns.empty?
+      if relation.joins_values.present? && !@columns.all? { |column| column.to_s.include?('.') }
+        raise ArgumentError, "You need to specify fully-qualified columns if you join a table"
+      end
+      if relation.arel.orders.present? || relation.arel.taken.present?
+        raise ConditionNotSupportedError
+      end
+      @base_relation = relation.reorder(@columns.join(','))
+      @reached_end = false
+    end
+    def <=>(other)
+      if reached_end != other.reached_end
+        reached_end ? 1 : -1
+      else
+        position <=> other.position
+      end
+    end
+    def position=(position)
+      raise "Cursor position cannot contain nil values" if position.any?(&:nil?)
+      @position = position
+    end
+    def update_from_record(record)
+      self.position = @columns.map do |column|
+        method = column.to_s.split('.').last
+        record.send(method.to_sym)
+      end
+    end
+    def next_batch(batch_size)
+      return nil if @reached_end
+      relation = @base_relation.limit(batch_size)
+      if (conditions = self.conditions).any?
+        relation = relation.where(*conditions)
+      end
+      records = relation.to_a
+      update_from_record(records.last) unless records.empty?
+      @reached_end = records.size < batch_size
+      records.empty? ? nil : records
+    end
+    protected
+    def conditions
+      i = @position.size - 1
+      column = @columns[i]
+      conditions = if @columns.size == @position.size
+        "#{column} > ?"
+      else
+        "#{column} >= ?"
+      end
+      while i > 0
+        i -= 1
+        column = @columns[i]
+        conditions = "#{column} > ? OR (#{column} = ? AND (#{conditions}))"
+      end
+      ret = @position.reduce([conditions]) { |params, value| params << value << value }
+      ret.pop
+      ret
+    end
+  end
+end

data/lib/job-iteration/active_record_enumerator.rb ADDED

@@ -0,0 +1,51 @@
+# frozen_string_literal: true
+require_relative "./active_record_cursor"
+module JobIteration
+  class ActiveRecordEnumerator
+    def initialize(relation, columns: nil, batch_size: 100, cursor: nil)
+      @relation = relation
+      @batch_size = batch_size
+      @columns = Array(columns || "#{relation.table_name}.#{relation.primary_key}")
+      @cursor = cursor
+    end
+    def records
+      Enumerator.new(method(:size)) do |yielder|
+        batches.each do |batch, _|
+          batch.each do |record|
+            yielder.yield(record, cursor_value(record))
+          end
+        end
+      end
+    end
+    def batches
+      cursor = finder_cursor
+      Enumerator.new(method(:size)) do |yielder|
+        while records = cursor.next_batch(@batch_size)
+          yielder.yield(records, cursor_value(records.last)) if records.any?
+        end
+      end
+    end
+    def size
+      @relation.count
+    end
+    private
+    def cursor_value(record)
+      positions = @columns.map do |column|
+        method = column.to_s.split('.').last
+        attribute = record.read_attribute(method.to_sym)
+        attribute.is_a?(Time) ? attribute.to_s(:db) : attribute
+      end
+      return positions.first if positions.size == 1
+      positions
+    end
+    def finder_cursor
+      JobIteration::ActiveRecordCursor.new(@relation, @columns, @cursor)
+    end
+  end
+end

data/lib/job-iteration/csv_enumerator.rb ADDED

@@ -0,0 +1,42 @@
+# frozen_string_literal: true
+module JobIteration
+  class CsvEnumerator
+    def initialize(csv)
+      unless csv.instance_of?(CSV)
+        raise ArgumentError, "CsvEnumerator.new takes CSV object"
+      end
+      @csv = csv
+    end
+    def rows(cursor:)
+      @csv.lazy
+        .each_with_index
+        .drop(cursor.to_i)
+        .to_enum { count_rows_in_file }
+    end
+    def batches(batch_size:, cursor:)
+      @csv.lazy
+        .each_slice(batch_size)
+        .each_with_index
+        .drop(cursor.to_i)
+        .to_enum { (count_rows_in_file.to_f / batch_size).ceil }
+    end
+    private
+    def count_rows_in_file
+      begin
+        filepath = @csv.path
+      rescue NoMethodError
+        return
+      end
+      count = `wc -l < #{filepath}`.strip.to_i
+      count -= 1 if @csv.headers
+      count
+    end
+  end
+end

data/lib/job-iteration/enumerator_builder.rb ADDED

@@ -0,0 +1,146 @@
+# frozen_string_literal: true
+require_relative "./active_record_enumerator"
+require_relative "./csv_enumerator"
+module JobIteration
+  class EnumeratorBuilder
+    extend Forwardable
+    # These wrappers ensure we have a custom type that we can assert on in
+    # Iteration. It's useful that the `wrapper` passed to EnumeratorBuilder in
+    # `enumerator_builder` is _always_ the type that is returned from
+    # `build_enumerator`. This prevents people from implementing custom
+    # Enumerators without wrapping them in
+    # `enumerator_builder.wrap(custom_enum)`. We don't do this yet for backwards
+    # compatibility with raw calls to EnumeratorBuilder. Think of these wrappers
+    # the way you should a middleware.
+    class Wrapper < Enumerator
+      def self.wrap(_builder, enum)
+        new(-> { enum.size }) do |yielder|
+          enum.each do |*val|
+            yielder.yield(*val)
+          end
+        end
+      end
+    end
+    def initialize(job, wrapper: Wrapper)
+      @job = job
+      @wrapper = wrapper
+    end
+    def_delegator :@wrapper, :wrap
+    # Builds Enumerator objects that iterates once.
+    def build_once_enumerator(cursor:)
+      wrap(self, build_times_enumerator(1, cursor: cursor))
+    end
+    # Builds Enumerator objects that iterates N times and yields number starting from zero.
+    def build_times_enumerator(number, cursor:)
+      raise ArgumentError, "First argument must be an Integer" unless number.is_a?(Integer)
+      wrap(self, build_array_enumerator(number.times.to_a, cursor: cursor))
+    end
+    # Builds Enumerator object from a given array, using +cursor+ as an offset.
+    def build_array_enumerator(enumerable, cursor:)
+      unless enumerable.is_a?(Array)
+        raise ArgumentError, "enumerable must be an Array"
+      end
+      if enumerable.any? { |i| defined?(ActiveRecord) && i.is_a?(ActiveRecord::Base) }
+        raise ArgumentError, "array cannot contain ActiveRecord objects"
+      end
+      drop =
+        if cursor.nil?
+          0
+        else
+          cursor + 1
+        end
+      wrap(self, enumerable.each_with_index.drop(drop).to_enum { enumerable.size })
+    end
+    # Builds Enumerator from a lock queue instance that belongs to a job.
+    # The helper is only to be used from jobs that use LockQueue module.
+    def build_lock_queue_enumerator(lock_queue, at_most_once:)
+      unless lock_queue.is_a?(BackgroundQueue::LockQueue::RedisQueue) ||
+          lock_queue.is_a?(BackgroundQueue::LockQueue::RolloutRedisQueue)
+        raise ArgumentError, "an argument to #build_lock_queue_enumerator must be a LockQueue"
+      end
+      wrap(self, BackgroundQueue::LockQueueEnumerator.new(lock_queue, at_most_once: at_most_once).to_enum)
+    end
+    # Builds Enumerator from Active Record Relation. Each Enumerator tick moves the cursor one row forward.
+    #
+    # +columns:+ argument is used to build the actual query for iteration. +columns+: defaults to primary key:
+    #
+    #   1) SELECT * FROM users ORDER BY id LIMIT 100
+    #
+    # When iteration is resumed, +cursor:+ and +columns:+ values will be used to continue from the point
+    # where iteration stopped:
+    #
+    #   2) SELECT * FROM users WHERE id > $CURSOR ORDER BY id LIMIT 100
+    #
+    # +columns:+ can also take more than one column. In that case, +cursor+ will contain serialized values
+    # of all columns at the point where iteration stopped.
+    #
+    # Consider this example with +columns: [:created_at, :id]+. Here's the query will use on the first iteration:
+    #
+    #   1) SELECT * FROM `products` ORDER BY created_at, id LIMIT 100
+    #
+    # And the query on the next iteration:
+    #
+    #   2) SELECT * FROM `products`
+    #        WHERE (created_at > '$LAST_CREATED_AT_CURSOR'
+    #          OR (created_at = '$LAST_CREATED_AT_CURSOR' AND (id > '$LAST_ID_CURSOR')))
+    #        ORDER BY created_at, id LIMIT 100
+    def build_active_record_enumerator_on_records(scope, cursor:, columns: nil, batch_size: nil)
+      enum = build_active_record_enumerator(
+        scope,
+        cursor: cursor,
+        columns: columns,
+        batch_size: batch_size,
+      ).records
+      wrap(self, enum)
+    end
+    # Builds Enumerator from Active Record Relation and enumerates on batches.
+    # Each Enumerator tick moves the cursor +batch_size+ rows forward.
+    #
+    # +batch_size:+ sets how many records will be fetched in one batch. Defaults to 100.
+    #
+    # For the rest of arguments, see documentation for #build_active_record_enumerator_on_records
+    def build_active_record_enumerator_on_batches(scope, cursor:, columns: nil, batch_size: nil)
+      enum = build_active_record_enumerator(
+        scope,
+        cursor: cursor,
+        columns: columns,
+        batch_size: batch_size,
+      ).batches
+      wrap(self, enum)
+    end
+    alias_method :once, :build_once_enumerator
+    alias_method :times, :build_times_enumerator
+    alias_method :array, :build_array_enumerator
+    alias_method :active_record_on_records, :build_active_record_enumerator_on_records
+    alias_method :active_record_on_batches, :build_active_record_enumerator_on_batches
+    private
+    def build_active_record_enumerator(scope, cursor:, columns:, batch_size:)
+      unless scope.is_a?(ActiveRecord::Relation)
+        raise ArgumentError, "scope must be an ActiveRecord::Relation"
+      end
+      JobIteration::ActiveRecordEnumerator.new(
+        scope,
+        **{
+          columns: columns,
+          batch_size: batch_size,
+          cursor: cursor,
+        }.compact
+      )
+    end
+  end
+end