gitlab-sidekiq-fetcher 0.5.0.pre.alpha → 0.5.4

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 7da36ba54ba6a0e97cef5da210f02bdde86dc06e22960d37573e51e20046dd40
4
- data.tar.gz: bd3bf13edc789374109a2c18b6797b44c1440bc2665938c654e3a627bbc5bc23
3
+ metadata.gz: 477650a08755f00beb453c4867a270fa8469a83a372b5b09ba1c030885899699
4
+ data.tar.gz: da3fcf3f0b67dd3f71c73ea80d09666e82fa9360cdf5cbf6c8b8e5668b85bfb5
5
5
  SHA512:
6
- metadata.gz: 852f01191a052384cfe4215766c6b27dc932ced77782a92c23f73aba091b520e83d293d57b407e45ed5d141661664a9f77b88211b50e9bed52cb15a011b0347a
7
- data.tar.gz: 59ee3493c9ddcb3e145ce77259b85ed508f8e57ca97878c12f5427e25952d2a4418860cd1d90562d39aa693dd630359e88ac78fd882aaceb9c5c374b6ca7509a
6
+ metadata.gz: b4d03a71a6c00e2fa55affd4af5633c2c1ef58093c8bd82b95d4a1603523d940e35b8db4643becf20236224ef065fc879b70585933569efa893e0e05481bffb9
7
+ data.tar.gz: 4fac10a26e3507c70927ad8689c2d8903d452cc9ed25a6522dc5c2c7e5e7fcb13d8412013eadb435a9a2feecab51130ad516553c4bbb49c72b2b36827f75c3cf
data/.gitignore CHANGED
@@ -1,2 +1,3 @@
1
1
  *.gem
2
2
  coverage
3
+ .DS_Store
data/.gitlab-ci.yml CHANGED
@@ -25,7 +25,7 @@ rspec:
25
25
  .integration:
26
26
  stage: test
27
27
  script:
28
- - cd tests/reliability_test
28
+ - cd tests/reliability
29
29
  - bundle exec ruby reliability_test.rb
30
30
  services:
31
31
  - redis:alpine
@@ -47,11 +47,19 @@ integration_basic:
47
47
  variables:
48
48
  JOB_FETCHER: basic
49
49
 
50
- retry_test:
50
+ kill_interruption:
51
51
  stage: test
52
52
  script:
53
- - cd tests/retry_test
54
- - bundle exec ruby retry_test.rb
53
+ - cd tests/interruption
54
+ - bundle exec ruby test_kill_signal.rb
55
+ services:
56
+ - redis:alpine
57
+
58
+ term_interruption:
59
+ stage: test
60
+ script:
61
+ - cd tests/interruption
62
+ - bundle exec ruby test_term_signal.rb
55
63
  services:
56
64
  - redis:alpine
57
65
 
data/README.md CHANGED
@@ -10,6 +10,17 @@ There are two strategies implemented: [Reliable fetch](http://redis.io/commands/
10
10
  semi-reliable fetch that uses regular `brpop` and `lpush` to pick the job and put it to working queue. The main benefit of "Reliable" strategy is that `rpoplpush` is atomic, eliminating a race condition in which jobs can be lost.
11
11
  However, it comes at a cost because `rpoplpush` can't watch multiple lists at the same time so we need to iterate over the entire queue list which significantly increases pressure on Redis when there are more than a few queues. The "semi-reliable" strategy is much more reliable than the default Sidekiq fetcher, though. Compared to the reliable fetch strategy, it does not increase pressure on Redis significantly.
12
12
 
13
+ ### Interruption handling
14
+
15
+ Sidekiq expects any job to report succcess or to fail. In the last case, Sidekiq puts `retry_count` counter
16
+ into the job and keeps to re-run the job until the counter reched the maximum allowed value. When the job has
17
+ not been given a chance to finish its work(to report success or fail), for example, when it was killed forcibly or when the job was requeued, after receiving TERM signal, the standard retry mechanisme does not get into the game and the job will be retried indefinatelly. This is why Reliable fetcher maintains a special counter `interrupted_count`
18
+ which is used to limit the amount of such retries. In both cases, Reliable Fetcher increments counter `interrupted_count` and rejects the job from running again when the counter exceeds `max_retries_after_interruption` times (default: 3 times).
19
+ Such a job will be put to `interrupted` queue. This queue mostly behaves as Sidekiq Dead queue so it only stores a limited amount of jobs for a limited term. Same as for Dead queue, all the limits are configurable via `interrupted_max_jobs` (default: 10_000) and `interrupted_timeout_in_seconds` (default: 3 months) Sidekiq option keys.
20
+
21
+ You can also disable special handling of interrupted jobs by setting `max_retries_after_interruption` into `-1`.
22
+ In this case, interrupted jobs will be run without any limits from Reliable Fetcher and they won't be put into Interrupted queue.
23
+
13
24
 
14
25
  ## Installation
15
26
 
@@ -1,14 +1,14 @@
1
1
  Gem::Specification.new do |s|
2
- s.name = 'gitlab-sidekiq-fetcher'
3
- s.version = '0.5.0-alpha'
4
- s.authors = ['TEA', 'GitLab']
5
- s.email = 'valery@gitlab.com'
6
- s.license = 'LGPL-3.0'
7
- s.homepage = 'https://gitlab.com/gitlab-org/sidekiq-reliable-fetch/'
8
- s.summary = 'Reliable fetch extension for Sidekiq'
9
- s.description = 'Redis reliable queue pattern implemented in Sidekiq'
2
+ s.name = 'gitlab-sidekiq-fetcher'
3
+ s.version = '0.5.4'
4
+ s.authors = ['TEA', 'GitLab']
5
+ s.email = 'valery@gitlab.com'
6
+ s.license = 'LGPL-3.0'
7
+ s.homepage = 'https://gitlab.com/gitlab-org/sidekiq-reliable-fetch/'
8
+ s.summary = 'Reliable fetch extension for Sidekiq'
9
+ s.description = 'Redis reliable queue pattern implemented in Sidekiq'
10
10
  s.require_paths = ['lib']
11
- s.files = `git ls-files`.split($\)
12
- s.test_files = []
11
+ s.files = `git ls-files`.split($\)
12
+ s.test_files = []
13
13
  s.add_dependency 'sidekiq', '~> 5'
14
14
  end
@@ -1,4 +1,5 @@
1
1
  require 'sidekiq'
2
+ require 'sidekiq/api'
2
3
 
3
4
  require_relative 'sidekiq/base_reliable_fetch'
4
5
  require_relative 'sidekiq/reliable_fetch'
@@ -1,6 +1,6 @@
1
1
  # frozen_string_literal: true
2
2
 
3
- require 'sidekiq/job_retry'
3
+ require_relative 'interrupted_set'
4
4
 
5
5
  module Sidekiq
6
6
  class BaseReliableFetch
@@ -18,6 +18,9 @@ module Sidekiq
18
18
  # Defines the COUNT parameter that will be passed to Redis SCAN command
19
19
  SCAN_COUNT = 1000
20
20
 
21
+ # How much time a job can be interrupted
22
+ DEFAULT_MAX_RETRIES_AFTER_INTERRUPTION = 3
23
+
21
24
  UnitOfWork = Struct.new(:queue, :job) do
22
25
  def acknowledge
23
26
  Sidekiq.redis { |conn| conn.lrem(Sidekiq::BaseReliableFetch.working_queue_name(queue), 1, job) }
@@ -65,155 +68,173 @@ module Sidekiq
65
68
  end
66
69
  end
67
70
 
68
- def self.pid
69
- @pid ||= ::Process.pid
71
+ def self.hostname
72
+ Socket.gethostname
70
73
  end
71
74
 
72
- def self.hostname
73
- @hostname ||= Socket.gethostname
75
+ def self.process_nonce
76
+ @@process_nonce ||= SecureRandom.hex(6)
77
+ end
78
+
79
+ def self.identity
80
+ @@identity ||= "#{hostname}:#{$$}:#{process_nonce}"
74
81
  end
75
82
 
76
83
  def self.heartbeat
77
84
  Sidekiq.redis do |conn|
78
- conn.set(heartbeat_key(hostname, pid), 1, ex: HEARTBEAT_LIFESPAN)
85
+ conn.set(heartbeat_key(identity), 1, ex: HEARTBEAT_LIFESPAN)
79
86
  end
80
87
 
81
- Sidekiq.logger.debug("Heartbeat for hostname: #{hostname} and pid: #{pid}")
88
+ Sidekiq.logger.debug("Heartbeat for #{identity}")
82
89
  end
83
90
 
84
91
  def self.bulk_requeue(inprogress, _options)
85
92
  return if inprogress.empty?
86
93
 
87
- Sidekiq.logger.debug('Re-queueing terminated jobs')
88
-
89
94
  Sidekiq.redis do |conn|
90
95
  inprogress.each do |unit_of_work|
91
96
  conn.multi do |multi|
92
- multi.lpush(unit_of_work.queue, unit_of_work.job)
97
+ preprocess_interrupted_job(unit_of_work.job, unit_of_work.queue, multi)
98
+
93
99
  multi.lrem(working_queue_name(unit_of_work.queue), 1, unit_of_work.job)
94
100
  end
95
101
  end
96
102
  end
97
-
98
- Sidekiq.logger.info("Pushed #{inprogress.size} jobs back to Redis")
99
103
  rescue => e
100
104
  Sidekiq.logger.warn("Failed to requeue #{inprogress.size} jobs: #{e.message}")
101
105
  end
102
106
 
103
- def self.heartbeat_key(hostname, pid)
104
- "reliable-fetcher-heartbeat-#{hostname}-#{pid}"
105
- end
106
-
107
- def self.working_queue_name(queue)
108
- "#{WORKING_QUEUE_PREFIX}:#{queue}:#{hostname}:#{pid}"
107
+ def self.clean_working_queue!(original_queue, working_queue)
108
+ Sidekiq.redis do |conn|
109
+ while job = conn.rpop(working_queue)
110
+ preprocess_interrupted_job(job, original_queue)
111
+ end
112
+ end
109
113
  end
110
114
 
111
- attr_reader :cleanup_interval, :last_try_to_take_lease_at, :lease_interval,
112
- :queues, :use_semi_reliable_fetch,
113
- :strictly_ordered_queues
115
+ def self.preprocess_interrupted_job(job, queue, conn = nil)
116
+ msg = Sidekiq.load_json(job)
117
+ msg['interrupted_count'] = msg['interrupted_count'].to_i + 1
114
118
 
115
- def initialize(options)
116
- @cleanup_interval = options.fetch(:cleanup_interval, DEFAULT_CLEANUP_INTERVAL)
117
- @lease_interval = options.fetch(:lease_interval, DEFAULT_LEASE_INTERVAL)
118
- @last_try_to_take_lease_at = 0
119
- @strictly_ordered_queues = !!options[:strict]
120
- @queues = options[:queues].map { |q| "queue:#{q}" }
119
+ if interruption_exhausted?(msg)
120
+ send_to_quarantine(msg, conn)
121
+ else
122
+ requeue_job(queue, msg, conn)
123
+ end
121
124
  end
122
125
 
123
- def retrieve_work
124
- clean_working_queues! if take_lease
126
+ def self.valid_identity_format?(identity)
127
+ # New format is "{hostname}:{pid}:{randomhex}
128
+ # Old format is "{hostname}:{pid}"
125
129
 
126
- retrieve_unit_of_work
127
- end
128
-
129
- def retrieve_unit_of_work
130
- raise NotImplementedError,
131
- "#{self.class} does not implement #{__method__}"
130
+ # Test the newer format first, only checking the older if necessary
131
+ identity.match(/[^:]*:[0-9]*:[0-9a-f]*\z/) || identity.match(/([^:]*):([0-9]*)\z/)
132
132
  end
133
133
 
134
- private
135
-
136
- def clean_working_queue!(working_queue)
137
- original_queue = working_queue.gsub(/#{WORKING_QUEUE_PREFIX}:|:[^:]*:[0-9]*\z/, '')
134
+ # Detect "old" jobs and requeue them because the worker they were assigned
135
+ # to probably failed miserably.
136
+ def self.clean_working_queues!
137
+ Sidekiq.logger.info('Cleaning working queues')
138
138
 
139
139
  Sidekiq.redis do |conn|
140
- count = 0
141
-
142
- while job = conn.rpop(working_queue)
143
- msg = begin
144
- Sidekiq.load_json(job)
145
- rescue => e
146
- Sidekiq.logger.info("Skipped job: #{job} as we couldn't parse it")
147
- next
148
- end
149
-
150
- msg['retry_count'] = msg['retry_count'].to_i + 1
151
-
152
- if retries_exhausted?(msg)
153
- send_to_morgue(msg)
154
- else
155
- job = Sidekiq.dump_json(msg)
140
+ conn.scan_each(match: "#{WORKING_QUEUE_PREFIX}:queue:*", count: SCAN_COUNT) do |key|
141
+ original_queue, identity = key.scan(/#{WORKING_QUEUE_PREFIX}:(queue:[^:]*):(.*)\z/).flatten
156
142
 
157
- conn.lpush(original_queue, job)
143
+ next unless valid_identity_format?(identity)
158
144
 
159
- count += 1
160
- end
145
+ clean_working_queue!(original_queue, key) if worker_dead?(identity, conn)
161
146
  end
162
-
163
- Sidekiq.logger.info("Requeued #{count} dead jobs to #{original_queue}")
164
147
  end
165
148
  end
166
149
 
167
- def retries_exhausted?(msg)
168
- max_retries_default = Sidekiq.options.fetch(:max_retries, Sidekiq::JobRetry::DEFAULT_MAX_RETRY_ATTEMPTS)
150
+ def self.worker_dead?(identity, conn)
151
+ !conn.get(heartbeat_key(identity))
152
+ end
169
153
 
170
- max_retry_attempts = retry_attempts_from(msg['retry'], max_retries_default)
154
+ def self.heartbeat_key(identity)
155
+ "reliable-fetcher-heartbeat-#{identity.gsub(':', '-')}"
156
+ end
171
157
 
172
- msg['retry_count'] >= max_retry_attempts
158
+ def self.working_queue_name(queue)
159
+ "#{WORKING_QUEUE_PREFIX}:#{queue}:#{identity}"
173
160
  end
174
161
 
175
- def retry_attempts_from(msg_retry, default)
176
- if msg_retry.is_a?(Integer)
177
- msg_retry
178
- else
179
- default
162
+ def self.interruption_exhausted?(msg)
163
+ return false if max_retries_after_interruption(msg['class']) < 0
164
+
165
+ msg['interrupted_count'].to_i >= max_retries_after_interruption(msg['class'])
166
+ end
167
+
168
+ def self.max_retries_after_interruption(worker_class)
169
+ max_retries_after_interruption = nil
170
+
171
+ max_retries_after_interruption ||= begin
172
+ Object.const_get(worker_class).sidekiq_options[:max_retries_after_interruption]
173
+ rescue NameError
180
174
  end
175
+
176
+ max_retries_after_interruption ||= Sidekiq.options[:max_retries_after_interruption]
177
+ max_retries_after_interruption ||= DEFAULT_MAX_RETRIES_AFTER_INTERRUPTION
178
+ max_retries_after_interruption
181
179
  end
182
180
 
183
- def send_to_morgue(msg)
181
+ def self.send_to_quarantine(msg, multi_connection = nil)
184
182
  Sidekiq.logger.warn(
185
183
  class: msg['class'],
186
184
  jid: msg['jid'],
187
- message: %(Reliable Fetcher: adding dead #{msg['class']} job #{msg['jid']})
185
+ message: %(Reliable Fetcher: adding dead #{msg['class']} job #{msg['jid']} to interrupted queue)
188
186
  )
189
187
 
190
- payload = Sidekiq.dump_json(msg)
191
- Sidekiq::DeadSet.new.kill(payload, notify_failure: false)
188
+ job = Sidekiq.dump_json(msg)
189
+ Sidekiq::InterruptedSet.new.put(job, connection: multi_connection)
192
190
  end
193
191
 
194
- # Detect "old" jobs and requeue them because the worker they were assigned
195
- # to probably failed miserably.
196
- def clean_working_queues!
197
- Sidekiq.logger.info("Cleaning working queues")
192
+ # If you want this method to be run is a scope of multi connection
193
+ # you need to pass it
194
+ def self.requeue_job(queue, msg, conn)
195
+ with_connection(conn) do |conn|
196
+ conn.lpush(queue, Sidekiq.dump_json(msg))
197
+ end
198
198
 
199
- Sidekiq.redis do |conn|
200
- conn.scan_each(match: "#{WORKING_QUEUE_PREFIX}:queue:*", count: SCAN_COUNT) do |key|
201
- # Example: "working:name_of_the_job:queue:{hostname}:{PID}"
202
- hostname, pid = key.scan(/:([^:]*):([0-9]*)\z/).flatten
199
+ Sidekiq.logger.info(
200
+ message: "Pushed job #{msg['jid']} back to queue #{queue}",
201
+ jid: msg['jid'],
202
+ queue: queue
203
+ )
204
+ end
203
205
 
204
- continue if hostname.nil? || pid.nil?
206
+ # Yield block with an existing connection or creates another one
207
+ def self.with_connection(conn, &block)
208
+ return yield(conn) if conn
205
209
 
206
- clean_working_queue!(key) if worker_dead?(hostname, pid)
207
- end
208
- end
210
+ Sidekiq.redis { |conn| yield(conn) }
209
211
  end
210
212
 
211
- def worker_dead?(hostname, pid)
212
- Sidekiq.redis do |conn|
213
- !conn.get(self.class.heartbeat_key(hostname, pid))
214
- end
213
+ attr_reader :cleanup_interval, :last_try_to_take_lease_at, :lease_interval,
214
+ :queues, :use_semi_reliable_fetch,
215
+ :strictly_ordered_queues
216
+
217
+ def initialize(options)
218
+ @cleanup_interval = options.fetch(:cleanup_interval, DEFAULT_CLEANUP_INTERVAL)
219
+ @lease_interval = options.fetch(:lease_interval, DEFAULT_LEASE_INTERVAL)
220
+ @last_try_to_take_lease_at = 0
221
+ @strictly_ordered_queues = !!options[:strict]
222
+ @queues = options[:queues].map { |q| "queue:#{q}" }
215
223
  end
216
224
 
225
+ def retrieve_work
226
+ self.class.clean_working_queues! if take_lease
227
+
228
+ retrieve_unit_of_work
229
+ end
230
+
231
+ def retrieve_unit_of_work
232
+ raise NotImplementedError,
233
+ "#{self.class} does not implement #{__method__}"
234
+ end
235
+
236
+ private
237
+
217
238
  def take_lease
218
239
  return unless allowed_to_take_a_lease?
219
240
 
@@ -0,0 +1,47 @@
1
+ require 'sidekiq/api'
2
+
3
+ module Sidekiq
4
+ class InterruptedSet < ::Sidekiq::JobSet
5
+ DEFAULT_MAX_CAPACITY = 10_000
6
+ DEFAULT_MAX_TIMEOUT = 90 * 24 * 60 * 60 # 3 months
7
+
8
+ def initialize
9
+ super "interrupted"
10
+ end
11
+
12
+ def put(message, opts = {})
13
+ now = Time.now.to_f
14
+
15
+ with_multi_connection(opts[:connection]) do |conn|
16
+ conn.zadd(name, now.to_s, message)
17
+ conn.zremrangebyscore(name, '-inf', now - self.class.timeout)
18
+ conn.zremrangebyrank(name, 0, - self.class.max_jobs)
19
+ end
20
+
21
+ true
22
+ end
23
+
24
+ # Yield block inside an existing multi connection or creates new one
25
+ def with_multi_connection(conn, &block)
26
+ return yield(conn) if conn
27
+
28
+ Sidekiq.redis do |c|
29
+ c.multi do |multi|
30
+ yield(multi)
31
+ end
32
+ end
33
+ end
34
+
35
+ def retry_all
36
+ each(&:retry) while size > 0
37
+ end
38
+
39
+ def self.max_jobs
40
+ Sidekiq.options[:interrupted_max_jobs] || DEFAULT_MAX_CAPACITY
41
+ end
42
+
43
+ def self.timeout
44
+ Sidekiq.options[:interrupted_timeout_in_seconds] || DEFAULT_MAX_TIMEOUT
45
+ end
46
+ end
47
+ end
@@ -5,10 +5,11 @@ require 'sidekiq/reliable_fetch'
5
5
  require 'sidekiq/semi_reliable_fetch'
6
6
 
7
7
  describe Sidekiq::BaseReliableFetch do
8
+ let(:job) { Sidekiq.dump_json(class: 'Bob', args: [1, 2, 'foo']) }
9
+
8
10
  before { Sidekiq.redis(&:flushdb) }
9
11
 
10
12
  describe 'UnitOfWork' do
11
- let(:job) { Sidekiq.dump_json({ class: 'Bob', args: [1, 2, 'foo'] }) }
12
13
  let(:fetcher) { Sidekiq::ReliableFetch.new(queues: ['foo']) }
13
14
 
14
15
  describe '#requeue' do
@@ -39,19 +40,42 @@ describe Sidekiq::BaseReliableFetch do
39
40
  end
40
41
 
41
42
  describe '.bulk_requeue' do
43
+ let!(:queue1) { Sidekiq::Queue.new('foo') }
44
+ let!(:queue2) { Sidekiq::Queue.new('bar') }
45
+
42
46
  it 'requeues the bulk' do
43
- queue1 = Sidekiq::Queue.new('foo')
44
- queue2 = Sidekiq::Queue.new('bar')
47
+ uow = described_class::UnitOfWork
48
+ jobs = [ uow.new('queue:foo', job), uow.new('queue:foo', job), uow.new('queue:bar', job) ]
49
+ described_class.bulk_requeue(jobs, queues: [])
45
50
 
46
- expect(queue1.size).to eq 0
47
- expect(queue2.size).to eq 0
51
+ expect(queue1.size).to eq 2
52
+ expect(queue2.size).to eq 1
53
+ end
48
54
 
55
+ it 'puts jobs into interrupted queue' do
49
56
  uow = described_class::UnitOfWork
50
- jobs = [ uow.new('queue:foo', 'bob'), uow.new('queue:foo', 'bar'), uow.new('queue:bar', 'widget') ]
57
+ interrupted_job = Sidekiq.dump_json(class: 'Bob', args: [1, 2, 'foo'], interrupted_count: 3)
58
+ jobs = [ uow.new('queue:foo', interrupted_job), uow.new('queue:foo', job), uow.new('queue:bar', job) ]
59
+ described_class.bulk_requeue(jobs, queues: [])
60
+
61
+ expect(queue1.size).to eq 1
62
+ expect(queue2.size).to eq 1
63
+ expect(Sidekiq::InterruptedSet.new.size).to eq 1
64
+ end
65
+
66
+ it 'does not put jobs into interrupted queue if it is disabled' do
67
+ Sidekiq.options[:max_retries_after_interruption] = -1
68
+
69
+ uow = described_class::UnitOfWork
70
+ interrupted_job = Sidekiq.dump_json(class: 'Bob', args: [1, 2, 'foo'], interrupted_count: 3)
71
+ jobs = [ uow.new('queue:foo', interrupted_job), uow.new('queue:foo', job), uow.new('queue:bar', job) ]
51
72
  described_class.bulk_requeue(jobs, queues: [])
52
73
 
53
74
  expect(queue1.size).to eq 2
54
75
  expect(queue2.size).to eq 1
76
+ expect(Sidekiq::InterruptedSet.new.size).to eq 0
77
+
78
+ Sidekiq.options[:max_retries_after_interruption] = 3
55
79
  end
56
80
  end
57
81
 
@@ -63,7 +87,7 @@ describe Sidekiq::BaseReliableFetch do
63
87
  Sidekiq.redis do |conn|
64
88
  sleep 0.2 # Give the time to heartbeat thread to make a loop
65
89
 
66
- heartbeat_key = described_class.heartbeat_key(Socket.gethostname, ::Process.pid)
90
+ heartbeat_key = described_class.heartbeat_key(described_class.identity)
67
91
  heartbeat = conn.get(heartbeat_key)
68
92
 
69
93
  expect(heartbeat).not_to be_nil
@@ -4,7 +4,7 @@ shared_examples 'a Sidekiq fetcher' do
4
4
  before { Sidekiq.redis(&:flushdb) }
5
5
 
6
6
  describe '#retrieve_work' do
7
- let(:job) { Sidekiq.dump_json({ class: 'Bob', args: [1, 2, 'foo'] }) }
7
+ let(:job) { Sidekiq.dump_json(class: 'Bob', args: [1, 2, 'foo']) }
8
8
  let(:fetcher) { described_class.new(queues: ['assigned']) }
9
9
 
10
10
  it 'retrieves the job and puts it to working queue' do
@@ -24,17 +24,18 @@ shared_examples 'a Sidekiq fetcher' do
24
24
  expect(fetcher.retrieve_work).to be_nil
25
25
  end
26
26
 
27
- it 'requeues jobs from dead working queue with incremented retry_count' do
27
+ it 'requeues jobs from dead working queue with incremented interrupted_count' do
28
28
  Sidekiq.redis do |conn|
29
29
  conn.rpush(other_process_working_queue_name('assigned'), job)
30
30
  end
31
31
 
32
32
  expected_job = Sidekiq.load_json(job)
33
- expected_job['retry_count'] = 1
33
+ expected_job['interrupted_count'] = 1
34
34
  expected_job = Sidekiq.dump_json(expected_job)
35
35
 
36
36
  uow = fetcher.retrieve_work
37
37
 
38
+ expect(uow).to_not be_nil
38
39
  expect(uow.job).to eq expected_job
39
40
 
40
41
  Sidekiq.redis do |conn|
@@ -42,6 +43,40 @@ shared_examples 'a Sidekiq fetcher' do
42
43
  end
43
44
  end
44
45
 
46
+ it 'ignores working queue keys in unknown formats' do
47
+ # Add a spurious non-numeric char segment at the end; this simulates any other
48
+ # incorrect form in general
49
+ malformed_key = "#{other_process_working_queue_name('assigned')}:X"
50
+ Sidekiq.redis do |conn|
51
+ conn.rpush(malformed_key, job)
52
+ end
53
+
54
+ uow = fetcher.retrieve_work
55
+
56
+ Sidekiq.redis do |conn|
57
+ expect(conn.llen(malformed_key)).to eq 1
58
+ end
59
+ end
60
+
61
+ it 'requeues jobs from legacy dead working queue with incremented interrupted_count' do
62
+ Sidekiq.redis do |conn|
63
+ conn.rpush(legacy_other_process_working_queue_name('assigned'), job)
64
+ end
65
+
66
+ expected_job = Sidekiq.load_json(job)
67
+ expected_job['interrupted_count'] = 1
68
+ expected_job = Sidekiq.dump_json(expected_job)
69
+
70
+ uow = fetcher.retrieve_work
71
+
72
+ expect(uow).to_not be_nil
73
+ expect(uow.job).to eq expected_job
74
+
75
+ Sidekiq.redis do |conn|
76
+ expect(conn.llen(legacy_other_process_working_queue_name('assigned'))).to eq 0
77
+ end
78
+ end
79
+
45
80
  it 'does not requeue jobs from live working queue' do
46
81
  working_queue = live_other_process_working_queue_name('assigned')
47
82
 
@@ -61,8 +96,7 @@ shared_examples 'a Sidekiq fetcher' do
61
96
  it 'does not clean up orphaned jobs more than once per cleanup interval' do
62
97
  Sidekiq.redis = Sidekiq::RedisConnection.create(url: REDIS_URL, size: 10)
63
98
 
64
- expect_any_instance_of(described_class)
65
- .to receive(:clean_working_queues!).once
99
+ expect(described_class).to receive(:clean_working_queues!).once
66
100
 
67
101
  threads = 10.times.map do
68
102
  Thread.new do
@@ -98,6 +132,24 @@ shared_examples 'a Sidekiq fetcher' do
98
132
 
99
133
  expect(jobs).to include 'this_job_should_not_stuck'
100
134
  end
135
+
136
+ context 'with short cleanup interval' do
137
+ let(:short_interval) { 1 }
138
+ let(:fetcher) { described_class.new(queues: queues, lease_interval: short_interval, cleanup_interval: short_interval) }
139
+
140
+ it 'requeues when there is no heartbeat' do
141
+ Sidekiq.redis { |conn| conn.rpush('queue:assigned', job) }
142
+ # Use of retrieve_work twice with a sleep ensures we have exercised the
143
+ # `identity` method to create the working queue key name and that it
144
+ # matches the patterns used in the cleanup
145
+ uow = fetcher.retrieve_work
146
+ sleep(short_interval + 1)
147
+ uow = fetcher.retrieve_work
148
+
149
+ # Will only receive a UnitOfWork if the job was detected as failed and requeued
150
+ expect(uow).to_not be_nil
151
+ end
152
+ end
101
153
  end
102
154
  end
103
155
 
@@ -107,17 +159,23 @@ def working_queue_size(queue_name)
107
159
  end
108
160
  end
109
161
 
110
- def other_process_working_queue_name(queue)
162
+ def legacy_other_process_working_queue_name(queue)
111
163
  "#{Sidekiq::BaseReliableFetch::WORKING_QUEUE_PREFIX}:queue:#{queue}:#{Socket.gethostname}:#{::Process.pid + 1}"
112
164
  end
113
165
 
166
+
167
+ def other_process_working_queue_name(queue)
168
+ "#{Sidekiq::BaseReliableFetch::WORKING_QUEUE_PREFIX}:queue:#{queue}:#{Socket.gethostname}:#{::Process.pid + 1}:#{::SecureRandom.hex(6)}"
169
+ end
170
+
114
171
  def live_other_process_working_queue_name(queue)
115
172
  pid = ::Process.pid + 1
116
173
  hostname = Socket.gethostname
174
+ nonce = SecureRandom.hex(6)
117
175
 
118
176
  Sidekiq.redis do |conn|
119
- conn.set(Sidekiq::BaseReliableFetch.heartbeat_key(hostname, pid), 1)
177
+ conn.set(Sidekiq::BaseReliableFetch.heartbeat_key("#{hostname}-#{pid}-#{nonce}"), 1)
120
178
  end
121
179
 
122
- "#{Sidekiq::BaseReliableFetch::WORKING_QUEUE_PREFIX}:queue:#{queue}:#{hostname}:#{pid}"
180
+ "#{Sidekiq::BaseReliableFetch::WORKING_QUEUE_PREFIX}:queue:#{queue}:#{hostname}:#{pid}:#{nonce}"
123
181
  end
data/tests/README.md CHANGED
@@ -18,15 +18,20 @@ You need to have redis server running on default HTTP port `6379`. To use other
18
18
  This tool spawns configured number of Sidekiq workers and when the amount of processed jobs is about half of origin
19
19
  number it will kill all the workers with `kill -9` and then it will spawn new workers again until all the jobs are processed. To track the process and counters we use Redis keys/counters.
20
20
 
21
- # How to run retry tests
21
+ # How to run interruption tests
22
22
 
23
23
  ```
24
- cd retry_test
25
- bundle exec ruby retry_test.rb
24
+ cd tests/interruption
25
+
26
+ # Verify "KILL" signal
27
+ bundle exec ruby test_kill_signal.rb
28
+
29
+ # Verify "TERM" signal
30
+ bundle exec ruby test_term_signal.rb
26
31
  ```
27
32
 
28
33
  It requires Redis to be running on 6379 port.
29
34
 
30
35
  ## How it works
31
36
 
32
- It spawns Sidekiq workers then creates a job that will kill itself after a moment. The reliable fetcher will bring it back. The purpose is to verify that job is run no more then `retry` parameter says even when job was killed.
37
+ It spawns Sidekiq workers then creates a job that will kill itself after a moment. The reliable fetcher will bring it back. The purpose is to verify that job is run no more then allowed number of times.
File without changes
@@ -0,0 +1,25 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'sidekiq'
4
+ require_relative 'config'
5
+ require_relative '../support/utils'
6
+
7
+ EXPECTED_NUM_TIMES_BEEN_RUN = 3
8
+ NUM_WORKERS = EXPECTED_NUM_TIMES_BEEN_RUN + 1
9
+
10
+ Sidekiq.redis(&:flushdb)
11
+
12
+ pids = spawn_workers(NUM_WORKERS)
13
+
14
+ RetryTestWorker.perform_async
15
+
16
+ sleep 300
17
+
18
+ Sidekiq.redis do |redis|
19
+ times_has_been_run = redis.get('times_has_been_run').to_i
20
+ assert 'The job has been run', times_has_been_run, EXPECTED_NUM_TIMES_BEEN_RUN
21
+ end
22
+
23
+ assert 'Found interruption exhausted jobs', Sidekiq::InterruptedSet.new.size, 1
24
+
25
+ stop_workers(pids)
@@ -0,0 +1,25 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'sidekiq'
4
+ require_relative 'config'
5
+ require_relative '../support/utils'
6
+
7
+ EXPECTED_NUM_TIMES_BEEN_RUN = 3
8
+ NUM_WORKERS = EXPECTED_NUM_TIMES_BEEN_RUN + 1
9
+
10
+ Sidekiq.redis(&:flushdb)
11
+
12
+ pids = spawn_workers(NUM_WORKERS)
13
+
14
+ RetryTestWorker.perform_async('TERM', 60)
15
+
16
+ sleep 300
17
+
18
+ Sidekiq.redis do |redis|
19
+ times_has_been_run = redis.get('times_has_been_run').to_i
20
+ assert 'The job has been run', times_has_been_run, EXPECTED_NUM_TIMES_BEEN_RUN
21
+ end
22
+
23
+ assert 'Found interruption exhausted jobs', Sidekiq::InterruptedSet.new.size, 1
24
+
25
+ stop_workers(pids)
@@ -0,0 +1,15 @@
1
+ # frozen_string_literal: true
2
+
3
+ class RetryTestWorker
4
+ include Sidekiq::Worker
5
+
6
+ def perform(signal = 'KILL', wait_seconds = 1)
7
+ Sidekiq.redis do |redis|
8
+ redis.incr('times_has_been_run')
9
+ end
10
+
11
+ Process.kill(signal, Process.pid)
12
+
13
+ sleep wait_seconds
14
+ end
15
+ end
File without changes
File without changes
@@ -0,0 +1,26 @@
1
+ def assert(text, actual, expected)
2
+ if actual == expected
3
+ puts "#{text}: #{actual} (Success)"
4
+ else
5
+ puts "#{text}: #{actual} (Failed). Expected: #{expected}"
6
+ exit 1
7
+ end
8
+ end
9
+
10
+ def spawn_workers(number)
11
+ pids = []
12
+
13
+ number.times do
14
+ pids << spawn('sidekiq -r ./config.rb')
15
+ end
16
+
17
+ pids
18
+ end
19
+
20
+ # Stop Sidekiq workers
21
+ def stop_workers(pids)
22
+ pids.each do |pid|
23
+ Process.kill('KILL', pid)
24
+ Process.wait pid
25
+ end
26
+ end
metadata CHANGED
@@ -1,15 +1,15 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: gitlab-sidekiq-fetcher
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.0.pre.alpha
4
+ version: 0.5.4
5
5
  platform: ruby
6
6
  authors:
7
7
  - TEA
8
8
  - GitLab
9
- autorequire:
9
+ autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2019-08-02 00:00:00.000000000 Z
12
+ date: 2021-02-22 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: sidekiq
@@ -42,6 +42,7 @@ files:
42
42
  - gitlab-sidekiq-fetcher.gemspec
43
43
  - lib/sidekiq-reliable-fetch.rb
44
44
  - lib/sidekiq/base_reliable_fetch.rb
45
+ - lib/sidekiq/interrupted_set.rb
45
46
  - lib/sidekiq/reliable_fetch.rb
46
47
  - lib/sidekiq/semi_reliable_fetch.rb
47
48
  - spec/base_reliable_fetch_spec.rb
@@ -50,18 +51,19 @@ files:
50
51
  - spec/semi_reliable_fetch_spec.rb
51
52
  - spec/spec_helper.rb
52
53
  - tests/README.md
53
- - tests/reliability_test/config.rb
54
- - tests/reliability_test/reliability_test.rb
55
- - tests/reliability_test/worker.rb
56
- - tests/retry_test/config.rb
57
- - tests/retry_test/retry_test.rb
58
- - tests/retry_test/simple_assert.rb
59
- - tests/retry_test/worker.rb
54
+ - tests/interruption/config.rb
55
+ - tests/interruption/test_kill_signal.rb
56
+ - tests/interruption/test_term_signal.rb
57
+ - tests/interruption/worker.rb
58
+ - tests/reliability/config.rb
59
+ - tests/reliability/reliability_test.rb
60
+ - tests/reliability/worker.rb
61
+ - tests/support/utils.rb
60
62
  homepage: https://gitlab.com/gitlab-org/sidekiq-reliable-fetch/
61
63
  licenses:
62
64
  - LGPL-3.0
63
65
  metadata: {}
64
- post_install_message:
66
+ post_install_message:
65
67
  rdoc_options: []
66
68
  require_paths:
67
69
  - lib
@@ -72,12 +74,12 @@ required_ruby_version: !ruby/object:Gem::Requirement
72
74
  version: '0'
73
75
  required_rubygems_version: !ruby/object:Gem::Requirement
74
76
  requirements:
75
- - - ">"
77
+ - - ">="
76
78
  - !ruby/object:Gem::Version
77
- version: 1.3.1
79
+ version: '0'
78
80
  requirements: []
79
- rubygems_version: 3.0.3
80
- signing_key:
81
+ rubygems_version: 3.1.4
82
+ signing_key:
81
83
  specification_version: 4
82
84
  summary: Reliable fetch extension for Sidekiq
83
85
  test_files: []
@@ -1,40 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- require 'sidekiq'
4
- require 'sidekiq/util'
5
- require 'sidekiq/cli'
6
- require_relative 'config'
7
- require_relative 'simple_assert'
8
-
9
- NUM_WORKERS = RetryTestWorker::EXPECTED_NUM_TIMES_BEEN_RUN + 1
10
-
11
- Sidekiq.redis(&:flushdb)
12
-
13
- def spawn_workers
14
- pids = []
15
-
16
- NUM_WORKERS.times do
17
- pids << spawn('sidekiq -r ./config.rb')
18
- end
19
-
20
- pids
21
- end
22
-
23
- pids = spawn_workers
24
-
25
- jid = RetryTestWorker.perform_async
26
-
27
- sleep 300
28
-
29
- Sidekiq.redis do |redis|
30
- times_has_been_run = redis.get('times_has_been_run').to_i
31
- assert "The job has been run", times_has_been_run, 2
32
- end
33
-
34
- assert "Found dead jobs", Sidekiq::DeadSet.new.size, 1
35
-
36
- # Stop Sidekiq workers
37
- pids.each do |pid|
38
- Process.kill('KILL', pid)
39
- Process.wait pid
40
- end
@@ -1,8 +0,0 @@
1
- def assert(text, actual, expected)
2
- if actual == expected
3
- puts "#{text}: #{actual} (Success)"
4
- else
5
- puts "#{text}: #{actual} (Failed). Expected: #{expected}"
6
- exit 1
7
- end
8
- end
@@ -1,23 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- class RetryTestWorker
4
- include Sidekiq::Worker
5
-
6
- EXPECTED_NUM_TIMES_BEEN_RUN = 2
7
-
8
- sidekiq_options retry: EXPECTED_NUM_TIMES_BEEN_RUN
9
-
10
- sidekiq_retry_in do |count, exception|
11
- 1 # retry in one second
12
- end
13
-
14
- def perform
15
- sleep 1
16
-
17
- Sidekiq.redis do |redis|
18
- redis.incr('times_has_been_run')
19
- end
20
-
21
- Process.kill('KILL', Process.pid) # Job suicide, OOM killer imitation
22
- end
23
- end