resque_stuck_queue_revised 0.5.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: d6c0c95202091d497b9fc9017530947331ac679e
4
+ data.tar.gz: c9f74138b3c84fc652425e18de9e8abf4f36dfad
5
+ SHA512:
6
+ metadata.gz: 9ebb2a19d7f350d20ccb36ec94bdca73afc502a3ea381b967cec98c8409f26f7b22f82d76bcc5e474de514dc65cd0f9f8f20819cb9b740ed0419c2014e212461
7
+ data.tar.gz: c2f46e30c50bf844346aace386249a8d3fcdafd3d4ceebcfb8fa2e6321af58bfd092c89409ab0181a16882b30c15694d032f61ba652e0a9a3ce59484329f4600
data/.gitignore ADDED
@@ -0,0 +1,2 @@
1
+ *gem
2
+ Gemfile.lock
data/Gemfile ADDED
@@ -0,0 +1,19 @@
1
+ source 'https://rubygems.org'
2
+
3
+ gemspec
4
+
5
+ if ENV['RESQUE_2']
6
+ # resque 2
7
+ gem 'resque', :git => "https://github.com/engineyard/resque.git"
8
+ else
9
+ gem 'resque'
10
+ end
11
+
12
+ gem 'redis-mutex'
13
+
14
+ # TEST
15
+ gem 'minitest'
16
+ gem 'mocha'
17
+ gem 'pry'
18
+ gem 'm'
19
+ gem 'resque-scheduler'
data/LICENSE.txt ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2014 Shai Rosenfeld
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,199 @@
1
+ # Resque stuck queue
2
+
3
+ ## Why?
4
+
5
+ This is to be used to satisfy an ops problem. There have been cases resque processes would stop processing jobs for unknown reasons. Other times resque wouldn't be running entirely due to deploy problems architecture/human error issues. Or on a different note, resque could be highly backed up and won't process jobs because it's too busy. This enables gaining a little insight into those issues.
6
+
7
+ ## What is it?
8
+
9
+ If resque doesn't run jobs in specific queues (defaults to `@queue = :app`) within a certain timeframe, it will trigger a pre-defined handler of your choice. You can use this to send an email, pager duty, add more resque workers, restart resque, send you a txt...whatever suits you.
10
+
11
+ It will also fire a proc to notify you when it's recovered.
12
+
13
+ ## How it works
14
+
15
+ It's a heartbeat mechanism:
16
+
17
+ ![meme](http://cdn.memegenerator.net/instances/500x/43575729.jpg)
18
+
19
+ Ok, seriously:
20
+
21
+ When you call `start` you are essentially starting two threads that will continiously run until `stop` is called or until the process shuts down.
22
+
23
+ One thread is responsible for pushing a 'heartbeat' job to resque which will essentially refresh a specific key in redis every time that job is processed.
24
+
25
+ The other thread is a continious loop that will check redis (bypassing resque) for that key and check what the latest time the hearbeat job successfully updated that key.
26
+
27
+ StuckQueue will trigger a pre-defined proc if the queue is lagging according to the times you've configured (see below).
28
+
29
+ After firing the proc, it will continue to monitor the queue, but won't call the proc again until the queue is found to be good again (it will then call a different "recovered" handler).
30
+
31
+ By calling the recovered proc, it will then complain again the next time the lag is found.
32
+
33
+ You can also configure it to periodically trigger unless of couse it's recovered/good again (see the `:warn_interval` below).
34
+
35
+ ## Usage
36
+
37
+ Run this as a daemon somewhere alongside the app/in your setup. You'll need to configure it to your needs first:
38
+
39
+ Put something like this in `config/initializers/resque-stuck-queue.rb`:
40
+
41
+ <pre>
42
+ require 'resque_stuck_queue' # or require 'resque/stuck_queue'
43
+ require 'logger'
44
+
45
+ # change to decent values that make sense for you
46
+ Resque::StuckQueue.config[:heartbeat_interval] = 10.seconds
47
+ Resque::StuckQueue.config[:watcher_interval] = 1.seconds
48
+ Resque::StuckQueue.config[:trigger_timeout] = 30.seconds # acceptable lagtime
49
+ Resque::StuckQueue.config[:warn_interval] = 5.minutes # keep on triggering periodically, default is only one trigger
50
+
51
+ # which queues to monitor
52
+ Resque::StuckQueue.config[:queues] = [:app, :custom_queue]
53
+
54
+ # handler for when a resque queue is being problematic
55
+ Resque::StuckQueue.config[:triggered_handler] = proc { |bad_queue, lagtime|
56
+ msg = "[BAD] APPNAME #{Rails.env}'s Resque #{bad_queue} queue lagging job execution by #{lagtime} seconds."
57
+ send_email(msg)
58
+ }
59
+
60
+ # handler for when a resque queue recovers
61
+ Resque::StuckQueue.config[:recovered_handler] = proc { |good_queue, lagtime|
62
+ msg = "[GOOD] APPNAME #{Rails.env}'s Resque #{good_queue} queue lagging job execution by #{lagtime} seconds."
63
+ send_email(msg)
64
+ }
65
+
66
+ # create a sync/unbuffered log
67
+ logpath = Rails.root.join('log', 'resque_stuck_queue.log')
68
+ logfile = File.open(logpath, "a")
69
+ logfile.sync = true
70
+ logger = Logger.new(logfile)
71
+ logger.formatter = Logger::Formatter.new
72
+ Resque::StuckQueue.config[:logger] = logger
73
+
74
+ # your own redis
75
+ Resque::StuckQueue.config[:redis] = YOUR_REDIS
76
+
77
+ </pre>
78
+
79
+ Then create a task to run it as a daemon (similar to how the resque rake job is implemented):
80
+
81
+ <pre>
82
+
83
+ # put this in lib/tasks/resque_stuck_queue.rb
84
+
85
+ namespace :resque do
86
+ desc "Start a Resque-stuck daemon"
87
+ # :environment dep task should load the config via the initializer
88
+ task :stuck_queue => :environment do
89
+ Resque::StuckQueue.start
90
+ end
91
+
92
+ end
93
+
94
+ </pre>
95
+
96
+ then run it via god, monit or whatever:
97
+
98
+ <pre>
99
+ $ bundle exec rake --trace resque:stuck_queue # outdated god config - https://gist.github.com/shaiguitar/298935953d91faa6bd4e
100
+ </pre>
101
+
102
+ ## Configuration Options
103
+
104
+ Configuration settings are below. You'll most likely at the least want to tune `:triggered_handler`,`:heartbeat_interval` and `:trigger_timeout` settings.
105
+
106
+ <pre>
107
+ triggered_handler:
108
+ set to what gets triggered when resque-stuck-queue will detect the latest heartbeat is older than the trigger_timeout time setting.
109
+ Example:
110
+ Resque::StuckQueue.config[:triggered_handler] = proc { |queue_name, lagtime| send_email('queue #{queue_name} isnt working, aaah the daemons') }
111
+
112
+ recovered_handler:
113
+ set to what gets triggered when resque-stuck-queue has triggered a problem, but then detects the queue went back down to functioning well again(it wont trigger again until it has recovered).
114
+ Example:
115
+ Resque::StuckQueue.config[:recovered_handler] = proc { |queue_name, lagtime| send_email('phew, queue #{queue_name} is ok') }
116
+
117
+ heartbeat_interval:
118
+ set to how often to push the 'heartbeat' job which will refresh the latest working time.
119
+ Example:
120
+ Resque::StuckQueue.config[:heartbeat_interval] = 5.minutes
121
+
122
+ watcher_interval:
123
+ set to how often to check to see when the last time it worked was.
124
+ Example:
125
+ Resque::StuckQueue.config[:watcher_interval] = 1.minute
126
+
127
+ trigger_timeout:
128
+ set to how much of a resque work lag you are willing to accept before being notified. note: take the :watcher_interval setting into account when setting this timeout.
129
+ Example:
130
+ Resque::StuckQueue.config[:trigger_timeout] = 9.minutes
131
+
132
+ warn_interval:
133
+ optional: if set, it will continiously trigger/warn in spaces of this interval after first trigger. eg, as long as lagtime keeps on being above trigger_timeout/recover hasn't occured yet.
134
+
135
+ redis:
136
+ set the Redis StuckQueue will use. Either a Redis or Redis::Namespace instance.
137
+
138
+ heartbeat_key:
139
+ optional, name of keys to keep track of the last good resque heartbeat time
140
+
141
+ triggered_key:
142
+ optional, name of keys to keep track of the last trigger time
143
+
144
+ logger:
145
+ optional, pass a Logger. Default a ruby logger will be instantiated. Needs to respond to that interface.
146
+
147
+ queues:
148
+ optional, monitor specific queues you want to send a heartbeat/monitor to. default is [:app]
149
+
150
+ abort_on_exception:
151
+ optional, if you want the resque-stuck-queue threads to explicitly raise, default is true
152
+
153
+ heartbeat_job:
154
+ optional, your own custom refreshing job. if you are using something other than resque
155
+
156
+ enable_signals:
157
+ optional, allow resque::stuck's signal_handlers which do mostly nothing at this point. possible future plan: log info, reopen log file, etc.
158
+ </pre>
159
+
160
+ To start it:
161
+
162
+ <pre>
163
+ Resque::StuckQueue.start # blocking
164
+ Resque::StuckQueue.start_in_background # sugar for Thread.new { Resque::StuckQueue.start }
165
+ </pre>
166
+
167
+ Stopping it consists of the same idea:
168
+
169
+ <pre>
170
+ Resque::StuckQueue.stop # this will block until the threads end their current iteration
171
+ Resque::StuckQueue.force_stop! # force kill those threads and let's move on
172
+ </pre>
173
+
174
+ ## Sidekiq/Other redis-based job queues
175
+
176
+ If you have trouble with other queues you can use this lib by setting your own custom refresh job (aka, the job that refreshes your queue specific heartbeat_key). The one thing you need to take care of is ensure whatever and however you enque your own custom job, it sets the heartbeat_key to Time.now:
177
+
178
+ <pre>
179
+
180
+ class CustomJob
181
+ include Sidekiq::Worker
182
+ def perform
183
+ # ensure you're setting the key in the redis the job queue is using
184
+ $redis.set(Resque::StuckQueue.heartbeat_key_for(queue_name), Time.now.to_i)
185
+ end
186
+ end
187
+
188
+ Resque::StuckQueue.config[:heartbeat_job] = proc {
189
+ # or however else you enque your custom job, Sidekiq::Client.enqueue(CustomJob), whatever, etc.
190
+ CustomJob.perform_async
191
+ }
192
+ </pre>
193
+
194
+ ## Tests
195
+
196
+ Run the tests:
197
+
198
+ `bundle; bundle exec rake`
199
+ `RESQUE_2=1 bundle exec rake # for resq 2 compat`
data/Rakefile ADDED
@@ -0,0 +1,26 @@
1
+ require 'rake/testtask'
2
+
3
+ task :default => :test
4
+ #task :test do
5
+ ## forking and what not. keep containted in each own process?
6
+ #Dir['./test/test_*.rb'].each do |file|
7
+ #system("ruby -I. -I lib/ #{file}")
8
+ #end
9
+ #end
10
+ Rake::TestTask.new do |t|
11
+ t.pattern = "test/test_*rb"
12
+ end
13
+
14
+ require 'resque/tasks'
15
+
16
+ task :'resque:setup' do
17
+ # https://github.com/resque/resque/issues/773
18
+ # have the jobs loaded in memory
19
+ Dir["./test/resque/*.rb"].each {|file| require file}
20
+ # load project
21
+ Dir["./lib/resque_stuck_queue.rb"].each {|file| require file}
22
+ end
23
+
24
+ require 'resque_scheduler/tasks'
25
+ task "resque:scheduler_setup"
26
+
data/THOUGHTS ADDED
@@ -0,0 +1,9 @@
1
+ ## TODOS
2
+
3
+ rm redis locking (since it works by keys now, no need for it, recover/trigger ping pong).
4
+ rm require resque?
5
+
6
+ refactor tests to have an around(:suite) to run with resque beforehand (no startup time) and just run test_integration.rb
7
+ (& compact dup tests etc)
8
+
9
+ don't continue to send heartbeat job if its alerting/stuck, it can just back the queue up (even if just marginally) more.
@@ -0,0 +1 @@
1
+ require File.join(File.expand_path(File.dirname(__FILE__)), "..","resque_stuck_queue")
@@ -0,0 +1,320 @@
1
+ require "resque_stuck_queue/version"
2
+ require "resque_stuck_queue/config"
3
+ require "resque_stuck_queue/heartbeat_job"
4
+
5
+ require 'redis-namespace'
6
+
7
+ # TODO move this require into a configurable?
8
+ require 'resque'
9
+
10
+ # TODO rm redis-mutex dep and just do the setnx locking here
11
+ require 'redis-mutex'
12
+
13
+ module Resque
14
+ module StuckQueue
15
+
16
+ class << self
17
+
18
+ attr_accessor :config
19
+
20
+ def config
21
+ @config ||= Config.new
22
+ end
23
+
24
+ def logger
25
+ @logger ||= (config[:logger] || StuckQueue::LOGGER)
26
+ end
27
+
28
+ def redis
29
+ @redis ||= config[:redis]
30
+ end
31
+
32
+ def heartbeat_key_for(queue)
33
+ if config[:heartbeat_key]
34
+ "#{queue}:#{config[:heartbeat_key]}"
35
+ else
36
+ "#{queue}:#{HEARTBEAT_KEY}"
37
+ end
38
+ end
39
+
40
+ def triggered_key_for(queue)
41
+ if config[:triggered_key]
42
+ "#{queue}:#{self.config[:triggered_key]}"
43
+ else
44
+ "#{queue}:#{TRIGGERED_KEY}"
45
+ end
46
+ end
47
+
48
+ def heartbeat_keys
49
+ queues.map{|q| heartbeat_key_for(q) }
50
+ end
51
+
52
+ def queues
53
+ @queues ||= (config[:queues] || [:app])
54
+ end
55
+
56
+ def abort_on_exception
57
+ if !config[:abort_on_exception].nil?
58
+ config[:abort_on_exception] # allow overriding w false
59
+ else
60
+ true # default
61
+ end
62
+ end
63
+
64
+ def start_in_background
65
+ Thread.new do
66
+ Thread.current.abort_on_exception = abort_on_exception
67
+ self.start
68
+ end
69
+ end
70
+
71
+ # call this after setting config. once started you should't be allowed to modify it
72
+ def start
73
+ @running = true
74
+ @stopped = false
75
+ @threads = []
76
+ config.validate_required_keys!
77
+ config.freeze
78
+
79
+ log_starting_info
80
+
81
+ reset_keys
82
+
83
+ RedisClassy.redis = redis if RedisClassy.redis.nil?
84
+
85
+ pretty_process_name
86
+
87
+ setup_heartbeat_thread
88
+ setup_watcher_thread
89
+
90
+ setup_warn_thread
91
+
92
+ # fo-eva.
93
+ @threads.map(&:join)
94
+
95
+ logger.info("threads stopped")
96
+ @stopped = true
97
+ end
98
+
99
+ def stop
100
+ reset!
101
+ # wait for clean thread shutdown
102
+ while @stopped == false
103
+ sleep 1
104
+ end
105
+ logger.info("Stopped")
106
+ end
107
+
108
+ def force_stop!
109
+ logger.info("Force stopping")
110
+ @threads.map(&:kill)
111
+ reset!
112
+ end
113
+
114
+ def reset!
115
+ # clean state so we can stop and start in the same process.
116
+ @config = Config.new # clear, unfreeze
117
+ @queues = nil
118
+ @running = false
119
+ @logger = nil
120
+ end
121
+
122
+ def reset_keys
123
+ queues.each do |qn|
124
+ redis.del(heartbeat_key_for(qn))
125
+ redis.del(triggered_key_for(qn))
126
+ end
127
+ end
128
+
129
+ def stopped?
130
+ @stopped
131
+ end
132
+
133
+ def trigger_handler(queue_name, type)
134
+ raise 'Must trigger either the recovered or triggered handler!' unless (type == :recovered || type == :triggered)
135
+ handler_name = :"#{type}_handler"
136
+ logger.info("Triggering #{type} handler for #{queue_name} at #{Time.now}.")
137
+ (config[handler_name] || const_get(handler_name.upcase)).call(queue_name, lag_time(queue_name))
138
+ manual_refresh(queue_name, type)
139
+ rescue => e
140
+ logger.info("handler #{type} for #{queue_name} crashed: #{e.inspect}")
141
+ logger.info("\n#{e.backtrace.join("\n")}")
142
+ raise e
143
+ end
144
+
145
+ def log_starting_info
146
+ logger.info("Starting StuckQueue with config: #{self.config.inspect}")
147
+ end
148
+
149
+ def log_watcher_info(queue_name)
150
+ logger.info("Lag time for #{queue_name} is #{lag_time(queue_name).inspect} seconds.")
151
+ if triggered_ago = last_triggered(queue_name)
152
+ logger.info("Last triggered for #{queue_name} is #{triggered_ago.inspect} seconds.")
153
+ else
154
+ logger.info("No last trigger found for #{queue_name}.")
155
+ end
156
+ end
157
+
158
+ private
159
+
160
+ def log_starting_thread(type)
161
+ interval_keyname = "#{type}_interval".to_sym
162
+ logger.info("Starting #{type} thread with interval of #{config[interval_keyname]} seconds")
163
+ end
164
+
165
+ def read_from_redis(keyname)
166
+ redis.get(keyname)
167
+ end
168
+
169
+ def setup_watcher_thread
170
+ @threads << Thread.new do
171
+ Thread.current.abort_on_exception = abort_on_exception
172
+ log_starting_thread(:watcher)
173
+ while @running
174
+ mutex = Redis::Mutex.new('resque_stuck_queue_lock', block: 0)
175
+ if mutex.lock
176
+ begin
177
+ queues.each do |queue_name|
178
+ log_watcher_info(queue_name)
179
+ if should_trigger?(queue_name)
180
+ trigger_handler(queue_name, :triggered)
181
+ elsif should_recover?(queue_name)
182
+ trigger_handler(queue_name, :recovered)
183
+ end
184
+ end
185
+ ensure
186
+ mutex.unlock
187
+ end
188
+ end
189
+ wait_for_it(:watcher_interval)
190
+ end
191
+ end
192
+ end
193
+
194
+ def setup_heartbeat_thread
195
+ @threads << Thread.new do
196
+ Thread.current.abort_on_exception = abort_on_exception
197
+ log_starting_thread(:heartbeat)
198
+ while @running
199
+ # we want to go through resque jobs, because that's what we're trying to test here:
200
+ # ensure that jobs get executed and the time is updated!
201
+ wait_for_it(:heartbeat_interval)
202
+ logger.info("Sending heartbeat jobs")
203
+ enqueue_jobs
204
+ end
205
+ end
206
+ end
207
+
208
+ def setup_warn_thread
209
+ if config[:warn_interval]
210
+ @threads << Thread.new do
211
+ Thread.current.abort_on_exception = abort_on_exception
212
+ log_starting_thread(:warn)
213
+ while @running
214
+ queues.each do |qn|
215
+ trigger_handler(qn, :triggered) if should_trigger?(qn, true)
216
+ end
217
+ wait_for_it(:warn_interval)
218
+ end
219
+ end
220
+ end
221
+ end
222
+
223
+
224
+ def enqueue_jobs
225
+ if config[:heartbeat_job]
226
+ # FIXME config[:heartbeat_job] with mutliple queues is bad semantics
227
+ config[:heartbeat_job].call
228
+ else
229
+ queues.each do |queue_name|
230
+ # Redis::Namespace.new support as well as Redis.new
231
+ namespace = redis.respond_to?(:namespace) ? redis.namespace : nil
232
+ Resque.enqueue_to(queue_name, HeartbeatJob, heartbeat_key_for(queue_name), redis.client.host, redis.client.port, namespace, Time.now.to_i )
233
+ end
234
+ end
235
+ end
236
+
237
+ def last_successful_heartbeat(queue_name)
238
+ time_set = read_from_redis(heartbeat_key_for(queue_name))
239
+ if time_set
240
+ time_set
241
+ else
242
+ logger.info("manually refreshing #{queue_name} for :first_time")
243
+ manual_refresh(queue_name, :first_time)
244
+ end.to_i
245
+ end
246
+
247
+ def manual_refresh(queue_name, type)
248
+ if type == :triggered
249
+ time = Time.now.to_i
250
+ redis.set(triggered_key_for(queue_name), time)
251
+ time
252
+ elsif type == :recovered
253
+ redis.del(triggered_key_for(queue_name))
254
+ nil
255
+ elsif type == :first_time
256
+ time = Time.now.to_i
257
+ redis.set(heartbeat_key_for(queue_name), time)
258
+ time
259
+ end
260
+ end
261
+
262
+ def lag_time(queue_name)
263
+ Time.now.to_i - last_successful_heartbeat(queue_name)
264
+ end
265
+
266
+ def last_triggered(queue_name)
267
+ time_set = read_from_redis(triggered_key_for(queue_name))
268
+ if !time_set.nil?
269
+ Time.now.to_i - time_set.to_i
270
+ end
271
+ end
272
+
273
+ def should_recover?(queue_name)
274
+ last_triggered(queue_name) &&
275
+ lag_time(queue_name) < max_wait_time
276
+ end
277
+
278
+ def should_trigger?(queue_name, force_trigger = false)
279
+ if lag_time(queue_name) >= max_wait_time
280
+ last_trigger = last_triggered(queue_name)
281
+
282
+ if force_trigger
283
+ return true
284
+ end
285
+
286
+ if last_trigger.nil?
287
+ # if it hasn't been triggered before, do it
288
+ return true
289
+ end
290
+
291
+ # if it already triggered in the past don't trigger again.
292
+ # :recovered should clearn out last_triggered so the cycle (trigger<->recover) continues
293
+ return false
294
+ end
295
+ end
296
+
297
+ def wait_for_it(type)
298
+ if type == :heartbeat_interval
299
+ sleep config[:heartbeat_interval] || HEARTBEAT_INTERVAL
300
+ elsif type == :watcher_interval
301
+ sleep config[:watcher_interval] || WATCHER_INTERVAL
302
+ elsif type == :warn_interval
303
+ sleep config[:warn_interval]
304
+ else
305
+ raise 'Must sleep for :watcher_interval interval or :heartbeat_interval or :warn_interval interval!'
306
+ end
307
+ end
308
+
309
+ def max_wait_time
310
+ config[:trigger_timeout] || TRIGGER_TIMEOUT
311
+ end
312
+
313
+ def pretty_process_name
314
+ $0 = "rake --trace resque:stuck_queue #{redis.inspect} QUEUES=#{queues.join(",")}"
315
+ end
316
+
317
+ end
318
+ end
319
+ end
320
+