RubyGems - resque_stuck_queue_revised - Versions diffs - 0.5.1 - Mend

resque_stuck_queue_revised 0.5.1

Files changed (24) hide show

checksums.yaml +7 -0
data/.gitignore +2 -0
data/Gemfile +19 -0
data/LICENSE.txt +22 -0
data/README.md +199 -0
data/Rakefile +26 -0
data/THOUGHTS +9 -0
data/lib/resque/stuck_queue.rb +1 -0
data/lib/resque_stuck_queue.rb +320 -0
data/lib/resque_stuck_queue/config.rb +81 -0
data/lib/resque_stuck_queue/heartbeat_job.rb +19 -0
data/lib/resque_stuck_queue/version.rb +5 -0
data/resque_stuck_queue.gemspec +27 -0
data/test/resque/set_redis_key.rb +9 -0
data/test/test_collision.rb +47 -0
data/test/test_config.rb +67 -0
data/test/test_helper.rb +57 -0
data/test/test_integration.rb +172 -0
data/test/test_lagtime.rb +34 -0
data/test/test_named_queues.rb +96 -0
data/test/test_resque_stuck_queue.rb +58 -0
data/test/test_set_custom_refresh_job.rb +41 -0
data/test/test_ver_2.rb +45 -0
metadata +132 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: d6c0c95202091d497b9fc9017530947331ac679e
+  data.tar.gz: c9f74138b3c84fc652425e18de9e8abf4f36dfad
+SHA512:
+  metadata.gz: 9ebb2a19d7f350d20ccb36ec94bdca73afc502a3ea381b967cec98c8409f26f7b22f82d76bcc5e474de514dc65cd0f9f8f20819cb9b740ed0419c2014e212461
+  data.tar.gz: c2f46e30c50bf844346aace386249a8d3fcdafd3d4ceebcfb8fa2e6321af58bfd092c89409ab0181a16882b30c15694d032f61ba652e0a9a3ce59484329f4600

data/.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ *gem
2	+ Gemfile.lock

data/Gemfile ADDED Viewed

@@ -0,0 +1,19 @@
+source 'https://rubygems.org'
+gemspec
+if ENV['RESQUE_2']
+# resque 2
+  gem 'resque', :git => "https://github.com/engineyard/resque.git"
+else
+  gem 'resque'
+end
+gem 'redis-mutex'
+# TEST
+gem 'minitest'
+gem 'mocha'
+gem 'pry'
+gem 'm'
+gem 'resque-scheduler'

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,22 @@
+Copyright (c) 2014 Shai Rosenfeld
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,199 @@
+# Resque stuck queue
+## Why?
+This is to be used to satisfy an ops problem. There have been cases resque processes would stop processing jobs for unknown reasons. Other times resque wouldn't be running entirely due to deploy problems architecture/human error issues. Or on a different note, resque could be highly backed up and won't process jobs because it's too busy. This enables gaining a little insight into those issues.
+## What is it?
+If resque doesn't run jobs in specific queues (defaults to `@queue = :app`) within a certain timeframe, it will trigger a pre-defined handler of your choice. You can use this to send an email, pager duty, add more resque workers, restart resque, send you a txt...whatever suits you.
+It will also fire a proc to notify you when it's recovered.
+## How it works
+It's a heartbeat mechanism:
+![meme](http://cdn.memegenerator.net/instances/500x/43575729.jpg)
+Ok, seriously:
+When you call `start` you are essentially starting two threads that will continiously run until `stop` is called or until the process shuts down.
+One thread is responsible for pushing a 'heartbeat' job to resque which will essentially refresh a specific key in redis every time that job is processed.
+The other thread is a continious loop that will check redis (bypassing resque) for that key and check what the latest time the hearbeat job successfully updated that key.
+StuckQueue will trigger a pre-defined proc if the queue is lagging according to the times you've configured (see below).
+After firing the proc, it will continue to monitor the queue, but won't call the proc again until the queue is found to be good again (it will then call a different "recovered" handler).
+By calling the recovered proc, it will then complain again the next time the lag is found.
+You can also configure it to periodically trigger unless of couse it's recovered/good again (see the `:warn_interval` below).
+## Usage
+Run this as a daemon somewhere alongside the app/in your setup. You'll need to configure it to your needs first:
+Put something like this in `config/initializers/resque-stuck-queue.rb`:
+<pre>
+require 'resque_stuck_queue' # or require 'resque/stuck_queue'
+require 'logger'
+# change to decent values that make sense for you
+Resque::StuckQueue.config[:heartbeat_interval]       = 10.seconds
+Resque::StuckQueue.config[:watcher_interval]         = 1.seconds
+Resque::StuckQueue.config[:trigger_timeout]          = 30.seconds # acceptable lagtime
+Resque::StuckQueue.config[:warn_interval]            = 5.minutes  # keep on triggering periodically, default is only one trigger
+# which queues to monitor
+Resque::StuckQueue.config[:queues]                   = [:app, :custom_queue]
+# handler for when a resque queue is being problematic
+Resque::StuckQueue.config[:triggered_handler]         = proc { |bad_queue, lagtime|
+  msg = "[BAD] APPNAME #{Rails.env}'s Resque #{bad_queue} queue lagging job execution by #{lagtime} seconds."
+  send_email(msg)
+}
+# handler for when a resque queue recovers
+Resque::StuckQueue.config[:recovered_handler]         = proc { |good_queue, lagtime|
+  msg = "[GOOD] APPNAME #{Rails.env}'s Resque #{good_queue} queue lagging job execution by #{lagtime} seconds."
+  send_email(msg)
+}
+# create a sync/unbuffered log
+logpath = Rails.root.join('log', 'resque_stuck_queue.log')
+logfile = File.open(logpath, "a")
+logfile.sync = true
+logger = Logger.new(logfile)
+logger.formatter = Logger::Formatter.new
+Resque::StuckQueue.config[:logger]                    = logger
+# your own redis
+Resque::StuckQueue.config[:redis]                     = YOUR_REDIS
+</pre>
+Then create a task to run it as a daemon (similar to how the resque rake job is implemented):
+<pre>
+# put this in lib/tasks/resque_stuck_queue.rb
+namespace :resque do
+  desc "Start a Resque-stuck daemon"
+  # :environment dep task should load the config via the initializer
+  task :stuck_queue => :environment do
+    Resque::StuckQueue.start
+  end
+end
+</pre>
+then run it via god, monit or whatever:
+<pre>
+$ bundle exec rake --trace resque:stuck_queue # outdated god config - https://gist.github.com/shaiguitar/298935953d91faa6bd4e
+</pre>
+## Configuration Options
+Configuration settings are below. You'll most likely at the least want to tune `:triggered_handler`,`:heartbeat_interval` and `:trigger_timeout` settings.
+<pre>
+triggered_handler:
+	set to what gets triggered when resque-stuck-queue will detect the latest heartbeat is older than the trigger_timeout time setting.
+	Example:
+	Resque::StuckQueue.config[:triggered_handler] = proc { |queue_name, lagtime| send_email('queue #{queue_name} isnt working, aaah the daemons') }
+recovered_handler:
+	set to what gets triggered when resque-stuck-queue has triggered a problem, but then detects the queue went back down to functioning well again(it wont trigger again until it has recovered).
+	Example:
+	Resque::StuckQueue.config[:recovered_handler] = proc { |queue_name, lagtime| send_email('phew, queue #{queue_name} is ok') }
+heartbeat_interval:
+	set to how often to push the 'heartbeat' job which will refresh the latest working time.
+	Example:
+	Resque::StuckQueue.config[:heartbeat_interval] = 5.minutes
+watcher_interval:
+	set to how often to check to see when the last time it worked was.
+	Example:
+	Resque::StuckQueue.config[:watcher_interval] = 1.minute
+trigger_timeout:
+	set to how much of a resque work lag you are willing to accept before being notified. note: take the :watcher_interval setting into account when setting this timeout.
+	Example:
+	Resque::StuckQueue.config[:trigger_timeout] = 9.minutes
+warn_interval:
+	optional: if set, it will continiously trigger/warn in spaces of this interval after first trigger. eg, as long as lagtime keeps on being above trigger_timeout/recover hasn't occured yet.
+redis:
+	set the Redis StuckQueue will use. Either a Redis or Redis::Namespace instance.
+heartbeat_key:
+	optional, name of keys to keep track of the last good resque heartbeat time
+triggered_key:
+	optional, name of keys to keep track of the last trigger time
+logger:
+	optional, pass a Logger. Default a ruby logger will be instantiated. Needs to respond to that interface.
+queues:
+	optional, monitor specific queues you want to send a heartbeat/monitor to. default is [:app]
+abort_on_exception:
+	optional, if you want the resque-stuck-queue threads to explicitly raise, default is true
+heartbeat_job:
+	optional, your own custom refreshing job. if you are using something other than resque
+enable_signals:
+	optional, allow resque::stuck's signal_handlers which do mostly nothing at this point. possible future plan: log info, reopen log file, etc.
+</pre>
+To start it:
+<pre>
+Resque::StuckQueue.start                # blocking
+Resque::StuckQueue.start_in_background  # sugar for Thread.new { Resque::StuckQueue.start }
+</pre>
+Stopping it consists of the same idea:
+<pre>
+Resque::StuckQueue.stop                 # this will block until the threads end their current iteration
+Resque::StuckQueue.force_stop!          # force kill those threads and let's move on
+</pre>
+## Sidekiq/Other redis-based job queues
+If you have trouble with other queues you can use this lib by setting your own custom refresh job (aka, the job that refreshes your queue specific heartbeat_key). The one thing you need to take care of is ensure whatever and however you enque your own custom job, it sets the heartbeat_key to Time.now:
+<pre>
+class CustomJob
+  include Sidekiq::Worker
+  def perform
+    # ensure you're setting the key in the redis the job queue is using
+    $redis.set(Resque::StuckQueue.heartbeat_key_for(queue_name), Time.now.to_i)
+  end
+end
+Resque::StuckQueue.config[:heartbeat_job] = proc {
+  # or however else you enque your custom job, Sidekiq::Client.enqueue(CustomJob), whatever, etc.
+  CustomJob.perform_async
+}
+</pre>
+## Tests
+Run the tests:
+`bundle; bundle exec rake`
+`RESQUE_2=1 bundle exec rake # for resq 2 compat`

data/Rakefile ADDED Viewed

@@ -0,0 +1,26 @@
+require 'rake/testtask'
+task :default => :test
+#task :test do
+  ## forking and what not. keep containted in each own process?
+  #Dir['./test/test_*.rb'].each do |file|
+    #system("ruby -I. -I lib/ #{file}")
+  #end
+#end
+Rake::TestTask.new do |t|
+  t.pattern = "test/test_*rb"
+end
+require 'resque/tasks'
+task :'resque:setup' do
+  # https://github.com/resque/resque/issues/773
+  # have the jobs loaded in memory
+  Dir["./test/resque/*.rb"].each {|file| require file}
+  # load project
+  Dir["./lib/resque_stuck_queue.rb"].each {|file| require file}
+end
+require 'resque_scheduler/tasks'
+task "resque:scheduler_setup"

data/THOUGHTS ADDED Viewed

@@ -0,0 +1,9 @@
+## TODOS
+rm redis locking (since it works by keys now, no need for it, recover/trigger ping pong).
+rm require resque?
+refactor tests to have an around(:suite) to run with resque beforehand (no startup time) and just run test_integration.rb
+  (& compact dup tests etc)
+don't continue to send heartbeat job if its alerting/stuck, it can just back the queue up (even if just marginally) more.

data/lib/resque/stuck_queue.rb ADDED Viewed

	@@ -0,0 +1 @@
1	+ require File.join(File.expand_path(File.dirname(__FILE__)), "..","resque_stuck_queue")

data/lib/resque_stuck_queue.rb ADDED Viewed

@@ -0,0 +1,320 @@
+require "resque_stuck_queue/version"
+require "resque_stuck_queue/config"
+require "resque_stuck_queue/heartbeat_job"
+require 'redis-namespace'
+# TODO move this require into a configurable?
+require 'resque'
+# TODO rm redis-mutex dep and just do the setnx locking here
+require 'redis-mutex'
+module Resque
+  module StuckQueue
+    class << self
+      attr_accessor :config
+      def config
+        @config ||= Config.new
+      end
+      def logger
+        @logger ||= (config[:logger] || StuckQueue::LOGGER)
+      end
+      def redis
+        @redis ||= config[:redis]
+      end
+      def heartbeat_key_for(queue)
+        if config[:heartbeat_key]
+          "#{queue}:#{config[:heartbeat_key]}"
+        else
+          "#{queue}:#{HEARTBEAT_KEY}"
+        end
+      end
+      def triggered_key_for(queue)
+        if config[:triggered_key]
+          "#{queue}:#{self.config[:triggered_key]}"
+        else
+          "#{queue}:#{TRIGGERED_KEY}"
+        end
+      end
+      def heartbeat_keys
+        queues.map{|q| heartbeat_key_for(q) }
+      end
+      def queues
+        @queues ||= (config[:queues] || [:app])
+      end
+      def abort_on_exception
+        if !config[:abort_on_exception].nil?
+          config[:abort_on_exception] # allow overriding w false
+        else
+          true # default
+        end
+      end
+      def start_in_background
+        Thread.new do
+          Thread.current.abort_on_exception = abort_on_exception
+          self.start
+        end
+      end
+      # call this after setting config. once started you should't be allowed to modify it
+      def start
+        @running = true
+        @stopped = false
+        @threads = []
+        config.validate_required_keys!
+        config.freeze
+        log_starting_info
+        reset_keys
+        RedisClassy.redis = redis if RedisClassy.redis.nil?
+        pretty_process_name
+        setup_heartbeat_thread
+        setup_watcher_thread
+        setup_warn_thread
+        # fo-eva.
+        @threads.map(&:join)
+        logger.info("threads stopped")
+        @stopped = true
+      end
+      def stop
+        reset!
+        # wait for clean thread shutdown
+        while @stopped == false
+          sleep 1
+        end
+        logger.info("Stopped")
+      end
+      def force_stop!
+        logger.info("Force stopping")
+        @threads.map(&:kill)
+        reset!
+      end
+      def reset!
+        # clean state so we can stop and start in the same process.
+        @config = Config.new # clear, unfreeze
+        @queues = nil
+        @running = false
+        @logger = nil
+      end
+      def reset_keys
+        queues.each do |qn|
+          redis.del(heartbeat_key_for(qn))
+          redis.del(triggered_key_for(qn))
+        end
+      end
+      def stopped?
+        @stopped
+      end
+      def trigger_handler(queue_name, type)
+        raise 'Must trigger either the recovered or triggered handler!' unless (type == :recovered || type == :triggered)
+        handler_name = :"#{type}_handler"
+        logger.info("Triggering #{type} handler for #{queue_name} at #{Time.now}.")
+        (config[handler_name] || const_get(handler_name.upcase)).call(queue_name, lag_time(queue_name))
+        manual_refresh(queue_name, type)
+      rescue => e
+        logger.info("handler #{type} for #{queue_name} crashed: #{e.inspect}")
+        logger.info("\n#{e.backtrace.join("\n")}")
+        raise e
+      end
+      def log_starting_info
+        logger.info("Starting StuckQueue with config: #{self.config.inspect}")
+      end
+      def log_watcher_info(queue_name)
+        logger.info("Lag time for #{queue_name} is #{lag_time(queue_name).inspect} seconds.")
+        if triggered_ago = last_triggered(queue_name)
+          logger.info("Last triggered for #{queue_name} is #{triggered_ago.inspect} seconds.")
+        else
+          logger.info("No last trigger found for #{queue_name}.")
+        end
+      end
+      private
+      def log_starting_thread(type)
+        interval_keyname = "#{type}_interval".to_sym
+        logger.info("Starting #{type} thread with interval of #{config[interval_keyname]} seconds")
+      end
+      def read_from_redis(keyname)
+        redis.get(keyname)
+      end
+      def setup_watcher_thread
+        @threads << Thread.new do
+          Thread.current.abort_on_exception = abort_on_exception
+          log_starting_thread(:watcher)
+          while @running
+            mutex = Redis::Mutex.new('resque_stuck_queue_lock', block: 0)
+            if mutex.lock
+              begin
+                queues.each do |queue_name|
+                  log_watcher_info(queue_name)
+                  if should_trigger?(queue_name)
+                    trigger_handler(queue_name, :triggered)
+                  elsif should_recover?(queue_name)
+                    trigger_handler(queue_name, :recovered)
+                  end
+                end
+              ensure
+                mutex.unlock
+              end
+            end
+            wait_for_it(:watcher_interval)
+          end
+        end
+      end
+      def setup_heartbeat_thread
+        @threads << Thread.new do
+          Thread.current.abort_on_exception = abort_on_exception
+          log_starting_thread(:heartbeat)
+          while @running
+            # we want to go through resque jobs, because that's what we're trying to test here:
+            # ensure that jobs get executed and the time is updated!
+            wait_for_it(:heartbeat_interval)
+            logger.info("Sending heartbeat jobs")
+            enqueue_jobs
+          end
+        end
+      end
+      def setup_warn_thread
+        if config[:warn_interval]
+          @threads << Thread.new do
+            Thread.current.abort_on_exception = abort_on_exception
+            log_starting_thread(:warn)
+            while @running
+              queues.each do |qn|
+                trigger_handler(qn, :triggered) if should_trigger?(qn, true)
+              end
+              wait_for_it(:warn_interval)
+            end
+          end
+        end
+      end
+      def enqueue_jobs
+        if config[:heartbeat_job]
+          # FIXME config[:heartbeat_job] with mutliple queues is bad semantics
+          config[:heartbeat_job].call
+        else
+          queues.each do |queue_name|
+            # Redis::Namespace.new support as well as Redis.new
+            namespace = redis.respond_to?(:namespace) ? redis.namespace : nil
+            Resque.enqueue_to(queue_name, HeartbeatJob, heartbeat_key_for(queue_name), redis.client.host, redis.client.port, namespace, Time.now.to_i )
+          end
+        end
+      end
+      def last_successful_heartbeat(queue_name)
+        time_set = read_from_redis(heartbeat_key_for(queue_name))
+        if time_set
+          time_set
+        else
+          logger.info("manually refreshing #{queue_name} for :first_time")
+          manual_refresh(queue_name, :first_time)
+         end.to_i
+      end
+      def manual_refresh(queue_name, type)
+        if type == :triggered
+          time = Time.now.to_i
+          redis.set(triggered_key_for(queue_name), time)
+          time
+        elsif type == :recovered
+          redis.del(triggered_key_for(queue_name))
+          nil
+        elsif type == :first_time
+          time = Time.now.to_i
+          redis.set(heartbeat_key_for(queue_name), time)
+          time
+        end
+      end
+      def lag_time(queue_name)
+        Time.now.to_i - last_successful_heartbeat(queue_name)
+      end
+      def last_triggered(queue_name)
+        time_set = read_from_redis(triggered_key_for(queue_name))
+        if !time_set.nil?
+          Time.now.to_i - time_set.to_i
+        end
+      end
+      def should_recover?(queue_name)
+        last_triggered(queue_name) &&
+          lag_time(queue_name) < max_wait_time
+      end
+      def should_trigger?(queue_name, force_trigger = false)
+        if lag_time(queue_name) >= max_wait_time
+          last_trigger = last_triggered(queue_name)
+          if force_trigger
+            return true
+          end
+          if last_trigger.nil?
+            # if it hasn't been triggered before, do it
+            return true
+          end
+          # if it already triggered in the past don't trigger again.
+          # :recovered should clearn out last_triggered so the cycle (trigger<->recover) continues
+          return false
+        end
+      end
+      def wait_for_it(type)
+        if type == :heartbeat_interval
+          sleep config[:heartbeat_interval] || HEARTBEAT_INTERVAL
+        elsif type == :watcher_interval
+          sleep config[:watcher_interval]   || WATCHER_INTERVAL
+        elsif type == :warn_interval
+          sleep config[:warn_interval]
+        else
+          raise 'Must sleep for :watcher_interval interval or :heartbeat_interval or :warn_interval interval!'
+        end
+      end
+      def max_wait_time
+        config[:trigger_timeout] || TRIGGER_TIMEOUT
+      end
+      def pretty_process_name
+        $0 = "rake --trace resque:stuck_queue #{redis.inspect} QUEUES=#{queues.join(",")}"
+      end
+    end
+  end
+end