minitest-distributed 0.1.2 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: bfef30635f4dd487b913e94e04a1098f27d27278af1f1adbb794a1f91f54e880
4
- data.tar.gz: cff09c3e45440e77a7f39dabcdc6faf3a99336cc05362ea1740c810c91bb227b
3
+ metadata.gz: 076d9467680eff44b28d42436648c5f4878fb5689bca7ddd9365bd05991a5352
4
+ data.tar.gz: '0068262916e339e61d4997eb478df915be5943d3a6b39e70fb02def9ba0d0dbb'
5
5
  SHA512:
6
- metadata.gz: b9a05e26c4b0a432d3b7c2d8aec342d141b0fc5b4a1b7a355b6e5e14609a85e834c60fde5d6bf32a85529d9678c0ce19e2a8e9feb1d6da49c761b77289c340b0
7
- data.tar.gz: 430c1c0da35af7e4db05580676de9ae4d911ecb0673d99535b07c00a72f600dd71e3e279a127e254bba2be13352579d1867ff376edaf1b91fd241b5b6da0b408
6
+ metadata.gz: 5d71ead8d7f352d9d628ec6682d2367804c63362e1e02e658ae3899b55486e1ff1bedd1cc10e23796a3373a0a3d0580b874646c4f7e3ec472675b8b85d88b001
7
+ data.tar.gz: cafd308bdad9ee0332323e0e826810fee2be91cbb5bdafa74ec8037891e9a6031ffa6f4f1e4777527e0f21784a48502848893661120145749629240164c3c077
@@ -11,6 +11,10 @@ AllCops:
11
11
  Exclude:
12
12
  - minitest-distributed.gemspec
13
13
 
14
+ # This cop is broken when using assignments
15
+ Layout/RescueEnsureAlignment:
16
+ Enabled: false
17
+
14
18
  ##### Sorbet cops
15
19
 
16
20
  Sorbet:
data/Gemfile CHANGED
@@ -1,7 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
  source "https://rubygems.org"
3
3
 
4
- # Specify your gem's dependencies in minitest-stateful.gemspec
4
+ # Specify your gem's dependencies in minitest-distributed.gemspec
5
5
  gemspec
6
6
 
7
7
  gem "rake", "~> 12.0"
data/README.md CHANGED
@@ -63,8 +63,8 @@ them to fail.
63
63
 
64
64
  ### Other optional command line arguments
65
65
 
66
- - `--test-timeout=SECONDS` or `ENV[MINITEST_TEST_TIMEOUT]` (default: 30s): the
67
- maximum amount a test is allowed to run before it times out. In a distributed
66
+ - `--test-timeout=SECONDS` or `ENV[MINITEST_TEST_TIMEOUT_SECONDS]` (default: 30s):
67
+ the maximum amount a test is allowed to run before it times out. In a distributed
68
68
  system, it's impossible to differentiate between a worker being slow and a
69
69
  worker being broken. When the timeout passes, the other workers will assume
70
70
  that the worker running the test has crashed, and will attempt to claim this
@@ -92,24 +92,40 @@ other tests.
92
92
 
93
93
  ## Development
94
94
 
95
- After checking out the repo, run `bin/setup` to install dependencies. Then,
96
- run `rake test` to run the tests. You can also run `bin/console` for an
97
- interactive prompt that will allow you to experiment.
95
+ To bootstrap a local development environment:
98
96
 
99
- To install this gem onto your local machine, run `bundle exec rake install`.
100
- To release a new version, update the version number in `version.rb`, and then
101
- run `bundle exec rake release`, which will create a git tag for the version,
102
- push git commits and tags, and push the `.gem` file to
103
- [rubygems.org](https://rubygems.org).
97
+ - Run `bin/setup` to install dependencies.
98
+ - Start a Redis server by running `redis-server`, assuming you have Redis
99
+ installed locally and the binary is on your `PATH`. Alternatively, you can
100
+ set the `REDIS_URL` environment variable to point to a Redis instance running
101
+ elsewhere.
102
+ - Now, run `bin/rake test` to run the tests, and verify everything is working.
103
+ - You can also run `bin/console` for an interactive prompt that will allow you
104
+ to experiment.
105
+
106
+ ### Releasing a new version
107
+
108
+ - To install this gem onto your local machine, run `bin/rake install`.
109
+ - Only people at Shopify can release a new version to
110
+ [rubygems.org](https://rubygems.org). To do so, update the `VERSION` constant
111
+ in `version.rb`, and merge to master. Shipit will take care of building the
112
+ `.gem` bundle, and pushing it to rubygems.org.
104
113
 
105
114
  ## Contributing
106
115
 
107
- Bug reports and pull requests are welcome on GitHub at https://github.com/Shopify/minitest-distributed. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/Shopify/minitest-distributed/blob/master/CODE_OF_CONDUCT.md).
116
+ Bug reports and pull requests are welcome on GitHub at
117
+ https://github.com/Shopify/minitest-distributed. This project is intended to
118
+ be a safe, welcoming space for collaboration, and contributors are expected to
119
+ adhere to the [code of
120
+ conduct](https://github.com/Shopify/minitest-distributed/blob/master/CODE_OF_CONDUCT.md).
108
121
 
109
122
  ## License
110
123
 
111
- The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
124
+ The gem is available as open source under the terms of the [MIT
125
+ License](https://opensource.org/licenses/MIT).
112
126
 
113
127
  ## Code of Conduct
114
128
 
115
- Everyone interacting in the Minitest::Stateful project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/Shopify/minitest-distributed/blob/master/CODE_OF_CONDUCT.md).
129
+ Everyone interacting in the `minitest-distributed` project's codebases, issue
130
+ trackers, chat rooms and mailing lists is expected to follow the [code of
131
+ conduct](https://github.com/Shopify/minitest-distributed/blob/master/CODE_OF_CONDUCT.md).
data/bin/setup CHANGED
@@ -4,5 +4,3 @@ IFS=$'\n\t'
4
4
  set -vx
5
5
 
6
6
  bundle install
7
-
8
- # Do any other automated setup that you need to do here
@@ -8,8 +8,8 @@ module Minitest
8
8
  module Distributed
9
9
  class Configuration < T::Struct
10
10
  DEFAULT_BATCH_SIZE = 10
11
- DEFAULT_MAX_ATTEMPTS = 3
12
- DEFAULT_TEST_TIMEOUT = 30.0 # seconds
11
+ DEFAULT_MAX_ATTEMPTS = 1
12
+ DEFAULT_TEST_TIMEOUT_SECONDS = 30.0 # seconds
13
13
 
14
14
  class << self
15
15
  extend T::Sig
@@ -20,11 +20,54 @@ module Minitest
20
20
  coordinator_uri: URI(env['MINITEST_COORDINATOR'] || 'memory:'),
21
21
  run_id: env['MINITEST_RUN_ID'] || SecureRandom.uuid,
22
22
  worker_id: env['MINITEST_WORKER_ID'] || SecureRandom.uuid,
23
- test_timeout: Float(env['MINITEST_TEST_TIMEOUT'] || DEFAULT_TEST_TIMEOUT),
23
+ test_timeout_seconds: Float(env['MINITEST_TEST_TIMEOUT_SECONDS'] || DEFAULT_TEST_TIMEOUT_SECONDS),
24
24
  test_batch_size: Integer(env['MINITEST_TEST_BATCH_SIZE'] || DEFAULT_BATCH_SIZE),
25
25
  max_attempts: Integer(env['MINITEST_MAX_ATTEMPTS'] || DEFAULT_MAX_ATTEMPTS),
26
+ max_failures: (max_failures_env = env['MINITEST_MAX_FAILURES']) ? Integer(max_failures_env) : nil,
26
27
  )
27
28
  end
29
+
30
+ sig { params(opts: OptionParser).returns(T.attached_class) }
31
+ def from_command_line_options(opts)
32
+ configuration = from_env
33
+
34
+ opts.on('--coordinator=URI', "The URI pointing to the coordinator") do |uri|
35
+ configuration.coordinator_uri = URI.parse(uri)
36
+ end
37
+
38
+ opts.on('--test-timeout=TIMEOUT', "The maximum run time for a single test in seconds") do |timeout|
39
+ configuration.test_timeout_seconds = Float(timeout)
40
+ end
41
+
42
+ opts.on('--max-attempts=ATTEMPTS', "The maximum number of attempts to run a test") do |attempts|
43
+ configuration.max_attempts = Integer(attempts)
44
+ end
45
+
46
+ opts.on('--test-batch-size=NUMBER', "The number of tests to process per batch") do |batch_size|
47
+ configuration.test_batch_size = Integer(batch_size)
48
+ end
49
+
50
+ opts.on('--max-failures=FAILURES', "The maximum allowed failure before aborting a run") do |failures|
51
+ configuration.max_failures = Integer(failures)
52
+ end
53
+
54
+ opts.on('--run-id=ID', "The ID for this run shared between coordinated workers") do |id|
55
+ configuration.run_id = id
56
+ end
57
+
58
+ opts.on('--worker-id=ID', "The unique ID for this worker") do |id|
59
+ configuration.worker_id = id
60
+ end
61
+
62
+ opts.on(
63
+ '--[no-]retry-failures', "Retry failed and errored tests from a previous run attempt " \
64
+ "with the same run ID (default: enabled)"
65
+ ) do |enabled|
66
+ configuration.retry_failures = enabled
67
+ end
68
+
69
+ configuration
70
+ end
28
71
  end
29
72
 
30
73
  extend T::Sig
@@ -32,9 +75,11 @@ module Minitest
32
75
  prop :coordinator_uri, URI::Generic, default: URI('memory:')
33
76
  prop :run_id, String, factory: -> { SecureRandom.uuid }
34
77
  prop :worker_id, String, factory: -> { SecureRandom.uuid }
35
- prop :test_timeout, Float, default: DEFAULT_TEST_TIMEOUT
78
+ prop :test_timeout_seconds, Float, default: DEFAULT_TEST_TIMEOUT_SECONDS
36
79
  prop :test_batch_size, Integer, default: DEFAULT_BATCH_SIZE
37
80
  prop :max_attempts, Integer, default: DEFAULT_MAX_ATTEMPTS
81
+ prop :max_failures, T.nilable(Integer)
82
+ prop :retry_failures, T::Boolean, default: true
38
83
 
39
84
  sig { returns(Coordinators::CoordinatorInterface) }
40
85
  def coordinator
@@ -18,6 +18,9 @@ module Minitest
18
18
  sig { abstract.returns(ResultAggregate) }
19
19
  def combined_results; end
20
20
 
21
+ sig { abstract.returns(T::Boolean) }
22
+ def aborted?; end
23
+
21
24
  sig { abstract.params(test_selector: TestSelector).void }
22
25
  def produce(test_selector:); end
23
26
 
@@ -25,7 +25,8 @@ module Minitest
25
25
 
26
26
  @leader = T.let(Mutex.new, Mutex)
27
27
  @queue = T.let(Queue.new, Queue)
28
- @local_results = T.let(ResultAggregate.new, ResultAggregate)
28
+ @local_results = T.let(ResultAggregate.new(max_failures: configuration.max_failures), ResultAggregate)
29
+ @aborted = T.let(false, T::Boolean)
29
30
  end
30
31
 
31
32
  sig { override.params(reporter: Minitest::CompositeReporter, options: T::Hash[Symbol, T.untyped]).void }
@@ -33,6 +34,11 @@ module Minitest
33
34
  # No need for any additional reporters
34
35
  end
35
36
 
37
+ sig { override.returns(T::Boolean) }
38
+ def aborted?
39
+ @aborted
40
+ end
41
+
36
42
  sig { override.params(test_selector: TestSelector).void }
37
43
  def produce(test_selector:)
38
44
  if @leader.try_lock
@@ -41,24 +47,38 @@ module Minitest
41
47
  if tests.empty?
42
48
  queue.close
43
49
  else
44
- tests.each { |test| queue << test }
50
+ tests.each do |runnable|
51
+ queue << EnqueuedRunnable.new(
52
+ class_name: T.must(runnable.class.name),
53
+ method_name: runnable.name,
54
+ test_timeout_seconds: configuration.test_timeout_seconds,
55
+ max_attempts: configuration.max_attempts,
56
+ )
57
+ end
45
58
  end
46
59
  end
47
60
  end
48
61
 
49
62
  sig { override.params(reporter: AbstractReporter).void }
50
63
  def consume(reporter:)
51
- until queue.empty? && queue.closed?
52
- enqueued_runnable = queue.pop
64
+ until queue.closed?
65
+ enqueued_runnable = T.let(queue.pop, EnqueuedRunnable)
66
+
53
67
  reporter.prerecord(enqueued_runnable.runnable_class, enqueued_runnable.method_name)
54
- result = enqueued_runnable.run
55
68
 
56
- local_results.update_with_result(result)
57
- local_results.acks += 1
69
+ enqueued_result = enqueued_runnable.run do |initial_result|
70
+ if ResultType.of(initial_result) == ResultType::Requeued
71
+ queue << enqueued_runnable.next_attempt
72
+ end
73
+ EnqueuedRunnable::Result::Commit.success
74
+ end
58
75
 
59
- reporter.record(result)
76
+ reporter.record(enqueued_result.committed_result)
77
+ local_results.update_with_result(enqueued_result)
60
78
 
61
- queue.close if local_results.completed?
79
+ # We abort a run if we reach the maximum number of failures
80
+ queue.close if combined_results.abort?
81
+ queue.close if combined_results.complete?
62
82
  end
63
83
  end
64
84
  end
@@ -15,11 +15,19 @@ module Minitest
15
15
  # to the stream.
16
16
  #
17
17
  # AFter that, all workers will start consuming from the stream. They will first
18
- # try to claim stale entries from other workers (determined by the `test_timeout`
19
- # option), and process them tp to a maxumim of `max_attempts` attempts. Then,
18
+ # try to claim stale entries from other workers (determined by the `test_timeout_seconds`
19
+ # option), and process them up to a maximum of `max_attempts` attempts. Then,
20
20
  # they will consume tests from the stream, run them, and ack them. This is done
21
21
  # in batches to reduce load on Redis.
22
22
  #
23
+ # Retrying failed tests (up to `max_attempts` times) uses the same mechanism.
24
+ # When a test fails, and we haven't exhausted the maximum number of attempts, we
25
+ # do not ACK the result with Redis. The means that another worker will eventually
26
+ # claim the test, and run it again. However, in this case we don't want to slow
27
+ # things down unnecessarily. When a test fails and we want to retry it, we add the
28
+ # test to the `retry_set` in Redis. When other worker sees that a test is in this
29
+ # set, it can immediately claim the test, rather than waiting for the timeout.
30
+ #
23
31
  # Finally, when we have acked the same number of tests as we populated into the
24
32
  # queue, the run is considered finished. The first worker to detect this will
25
33
  # remove the consumergroup and the associated stream from Redis.
@@ -47,7 +55,10 @@ module Minitest
47
55
  attr_reader :local_results
48
56
 
49
57
  sig { returns(T::Set[EnqueuedRunnable]) }
50
- attr_reader :reclaimed_tests
58
+ attr_reader :reclaimed_timeout_tests
59
+
60
+ sig { returns(T::Set[EnqueuedRunnable]) }
61
+ attr_reader :reclaimed_failed_tests
51
62
 
52
63
  sig { params(configuration: Configuration).void }
53
64
  def initialize(configuration:)
@@ -58,7 +69,9 @@ module Minitest
58
69
  @group_name = T.let('minitest-distributed', String)
59
70
  @local_results = T.let(ResultAggregate.new, ResultAggregate)
60
71
  @combined_results = T.let(nil, T.nilable(ResultAggregate))
61
- @reclaimed_tests = T.let(Set.new, T::Set[EnqueuedRunnable])
72
+ @reclaimed_timeout_tests = T.let(Set.new, T::Set[EnqueuedRunnable])
73
+ @reclaimed_failed_tests = T.let(Set.new, T::Set[EnqueuedRunnable])
74
+ @aborted = T.let(false, T::Boolean)
62
75
  end
63
76
 
64
77
  sig { override.params(reporter: Minitest::CompositeReporter, options: T::Hash[Symbol, T.untyped]).void }
@@ -70,41 +83,51 @@ module Minitest
70
83
  def combined_results
71
84
  @combined_results ||= begin
72
85
  stats_as_string = redis.mget(key('runs'), key('assertions'), key('passes'),
73
- key('failures'), key('errors'), key('skips'), key('reruns'), key('acks'), key('size'))
86
+ key('failures'), key('errors'), key('skips'), key('requeues'), key('discards'),
87
+ key('acks'), key('size'))
74
88
 
75
89
  ResultAggregate.new(
90
+ max_failures: configuration.max_failures,
91
+
76
92
  runs: Integer(stats_as_string.fetch(0) || 0),
77
93
  assertions: Integer(stats_as_string.fetch(1) || 0),
78
94
  passes: Integer(stats_as_string.fetch(2) || 0),
79
95
  failures: Integer(stats_as_string.fetch(3) || 0),
80
96
  errors: Integer(stats_as_string.fetch(4) || 0),
81
97
  skips: Integer(stats_as_string.fetch(5) || 0),
82
- reruns: Integer(stats_as_string.fetch(6) || 0),
83
- acks: Integer(stats_as_string.fetch(7) || 0),
98
+ requeues: Integer(stats_as_string.fetch(6) || 0),
99
+ discards: Integer(stats_as_string.fetch(7) || 0),
100
+ acks: Integer(stats_as_string.fetch(8) || 0),
84
101
 
85
- # In the case where we have no build szie number published yet, we initialize
102
+ # In the case where we have no build size number published yet, we initialize
86
103
  # thesize of the test suite to be arbitrarity large, to make sure it is
87
104
  # higher than the number of acks, so the run is not consider completed yet.
88
- size: Integer(stats_as_string.fetch(8) || 2_147_483_647),
105
+ size: Integer(stats_as_string.fetch(9) || 2_147_483_647),
89
106
  )
90
107
  end
91
108
  end
92
109
 
110
+ sig { override.returns(T::Boolean) }
111
+ def aborted?
112
+ @aborted
113
+ end
114
+
93
115
  sig { override.params(test_selector: TestSelector).void }
94
116
  def produce(test_selector:)
95
117
  # Whoever ends up creating the consumer group will act as leader,
96
118
  # and publish the list of tests to the stream.
97
119
 
98
- begin
120
+ initial_attempt = begin
99
121
  # When using `redis.multi`, the second DEL command gets executed even if the initial GROUP
100
122
  # fails. This is bad, because only the leader should be issuing the DEL command.
101
123
  # When using EVAL and a Lua script, the script aborts after the first XGROUP command
102
124
  # fails, and the DEL never gets executed for followers.
103
- redis.evalsha(
125
+ keys_deleted = redis.evalsha(
104
126
  register_consumergroup_script,
105
127
  keys: [stream_key, key('size'), key('acks')],
106
128
  argv: [group_name],
107
129
  )
130
+ keys_deleted == 0
108
131
 
109
132
  rescue Redis::CommandError => ce
110
133
  if ce.message.include?('BUSYGROUP')
@@ -118,38 +141,67 @@ module Minitest
118
141
  end
119
142
  end
120
143
 
121
- run_attempt, previous_failures, previous_errors, _deleted = redis.multi do
122
- redis.incr(key('attempt'))
123
- redis.lrange(key('failure_list'), 0, -1)
124
- redis.lrange(key('error_list'), 0, -1)
125
- redis.del(key('failure_list'), key('error_list'))
126
- end
127
-
128
- tests = if run_attempt == 1
144
+ tests = T.let([], T::Array[Minitest::Runnable])
145
+ tests = if initial_attempt
129
146
  # If this is the first attempt for this run ID, we will schedule the full
130
147
  # test suite as returned by the test selector to run.
131
- test_selector.tests
132
- else
133
- # For subsequent attempts, we check the list of previous failures and
134
- # errors, and only schedule to re-run those tests. This allows for faster
135
- # retries of potentially flaky tests.
136
- (previous_failures + previous_errors).map do |test_to_retry|
137
- EnqueuedRunnable.from_hash!(Marshal.load(test_to_retry))
148
+
149
+ tests_from_selector = test_selector.tests
150
+ adjust_combined_results(ResultAggregate.new(size: tests_from_selector.size))
151
+ tests_from_selector
152
+
153
+ elsif configuration.retry_failures
154
+ # Before starting a retry attempt, we first check if the previous attempt
155
+ # was aborted before it was completed. If this is the case, we cannot use
156
+ # retry mode, and should immediately fail the attempt.
157
+ if combined_results.abort?
158
+ # We mark this run as aborted, which causes this worker to not be successful.
159
+ @aborted = true
160
+
161
+ # We still publish an empty size run to Redis, so if there are any followers,
162
+ # they will wind down normally. Only the leader will exit
163
+ # with a non-zero exit status and fail the build; any follower will
164
+ # exit with status 0.
165
+ adjust_combined_results(ResultAggregate.new(size: 0))
166
+ T.let([], T::Array[Minitest::Runnable])
167
+ else
168
+ previous_failures, previous_errors, _deleted = redis.multi do
169
+ redis.lrange(list_key(ResultType::Failed.serialize), 0, -1)
170
+ redis.lrange(list_key(ResultType::Error.serialize), 0, -1)
171
+ redis.del(list_key(ResultType::Failed.serialize), list_key(ResultType::Error.serialize))
172
+ end
173
+
174
+ # We set the `size` key to the number of tests we are planning to schedule.
175
+ # We also adjust the number of failures and errors back to 0.
176
+ # We set the number of requeues to the number of tests that failed, so the
177
+ # run statistics will reflect that we retried some failed test.
178
+ #
179
+ # However, normally requeues are not acked, as we expect the test to be acked
180
+ # by another worker later. This makes the test loop think iot is already done.
181
+ # To prevent this, we initialize the number of acks negatively, so it evens out
182
+ # in the statistics.
183
+ total_failures = previous_failures.length + previous_errors.length
184
+ adjust_combined_results(ResultAggregate.new(
185
+ size: total_failures,
186
+ failures: -previous_failures.length,
187
+ errors: -previous_errors.length,
188
+ requeues: total_failures,
189
+ ))
190
+
191
+ # For subsequent attempts, we check the list of previous failures and
192
+ # errors, and only schedule to re-run those tests. This allows for faster
193
+ # retries of potentially flaky tests.
194
+ test_identifiers_to_retry = T.let(previous_failures + previous_errors, T::Array[String])
195
+ test_identifiers_to_retry.map { |identifier| DefinedRunnable.from_identifier(identifier) }
138
196
  end
197
+ else
198
+ adjust_combined_results(ResultAggregate.new(size: 0))
199
+ T.let([], T::Array[Minitest::Runnable])
139
200
  end
140
201
 
141
- # We set the `size` key to the number of tests we are planning to schedule.
142
- # This will allow workers to tell when the run is done. We also adjust the
143
- # number of failures and errors in case of a retry run.
144
- adjust_combined_results(ResultAggregate.new(
145
- size: tests.size,
146
- failures: -previous_failures.length,
147
- errors: -previous_errors.length,
148
- reruns: previous_failures.length + previous_errors.length,
149
- ))
150
-
151
- # TODO: break this up in batches.
152
- tests.each { |test| redis.xadd(stream_key, test.serialize) }
202
+ redis.pipelined do
203
+ tests.each { |test| redis.xadd(stream_key, class_name: T.must(test.class.name), method_name: test.name) }
204
+ end
153
205
  end
154
206
 
155
207
  sig { override.params(reporter: AbstractReporter).void }
@@ -158,26 +210,29 @@ module Minitest
158
210
  loop do
159
211
  # First, see if there are any pending tests from other workers to claim.
160
212
  stale_runnables = claim_stale_runnables
161
- stale_processed = process_batch(stale_runnables, reporter)
213
+ process_batch(stale_runnables, reporter)
162
214
 
163
- # Finally, try to process a regular batch of messages
215
+ # Then, try to process a regular batch of messages
164
216
  fresh_runnables = claim_fresh_runnables(block: exponential_backoff)
165
- fresh_processed = process_batch(fresh_runnables, reporter)
217
+ process_batch(fresh_runnables, reporter)
166
218
 
167
219
  # If we have acked the same amount of tests as we were supposed to, the run
168
220
  # is complete and we can exit our loop. Generally, only one worker will detect
169
221
  # this condition. The pther workers will quit their consumer loop because the
170
222
  # consumergroup will be deleted by the first worker, and their Redis commands
171
223
  # will start to fail - see the rescue block below.
172
- break if combined_results.completed?
224
+ break if combined_results.complete?
225
+
226
+ # We also abort a run if we reach the maximum number of failures
227
+ break if combined_results.abort?
173
228
 
174
229
  # To make sure we don't end up in a busy loop overwhelming Redis with commands
175
230
  # when there is no work to do, we increase the blocking time exponentially,
176
- # and reset it to the initial value if we processed any messages
177
- if stale_processed > 0 || fresh_processed > 0
178
- exponential_backoff = INITIAL_BACKOFF
179
- else
231
+ # and reset it to the initial value if we processed any tests.
232
+ if stale_runnables.empty? && fresh_runnables.empty?
180
233
  exponential_backoff <<= 1
234
+ else
235
+ exponential_backoff = INITIAL_BACKOFF
181
236
  end
182
237
  end
183
238
 
@@ -203,28 +258,20 @@ module Minitest
203
258
  @redis ||= Redis.new(url: configuration.coordinator_uri)
204
259
  end
205
260
 
206
- sig { returns(String) }
207
- def ack_batch_script
208
- @ack_batch_script = T.let(@ack_batch_script, T.nilable(String))
209
- @ack_batch_script ||= redis.script(:load, <<~LUA)
210
- local acked_ids, acked, i = {}, 0, 2
211
- while ARGV[i] do
212
- if redis.call('XACK', KEYS[1], ARGV[1], ARGV[i]) > 0 then
213
- acked = acked + 1
214
- acked_ids[acked] = ARGV[i]
215
- end
216
- i = i + 1
217
- end
218
- return acked_ids
219
- LUA
220
- end
221
-
222
261
  sig { returns(String) }
223
262
  def register_consumergroup_script
224
263
  @register_consumergroup_script = T.let(@register_consumergroup_script, T.nilable(String))
225
264
  @register_consumergroup_script ||= redis.script(:load, <<~LUA)
265
+ -- Try to create the consumergroup. This will raise an error if the
266
+ -- consumergroup has already been registered by somebody else, which
267
+ -- means another worker will be acting as leader.
268
+ -- In that case, the next Redis DEL call will not be executed.
226
269
  redis.call('XGROUP', 'CREATE', KEYS[1], ARGV[1], '0', 'MKSTREAM')
227
- redis.call('DEL', KEYS[2], KEYS[3])
270
+
271
+ -- The leader should reset the size and acks key for this run attempt.
272
+ -- We return the number of keys that were deleted, which can be used to
273
+ -- determine whether this was the first attempt for this run or not.
274
+ return redis.call('DEL', KEYS[2], KEYS[3])
228
275
  LUA
229
276
  end
230
277
 
@@ -232,51 +279,119 @@ module Minitest
232
279
  def claim_fresh_runnables(block:)
233
280
  result = redis.xreadgroup(group_name, configuration.worker_id, stream_key, '>',
234
281
  block: block, count: configuration.test_batch_size)
235
- EnqueuedRunnable.from_redis_stream_claim(result.fetch(stream_key, []))
282
+ EnqueuedRunnable.from_redis_stream_claim(result.fetch(stream_key, []), configuration: configuration)
283
+ end
284
+
285
+ sig do
286
+ params(
287
+ pending_messages: T::Hash[String, PendingExecution],
288
+ max_idle_time_ms: Integer,
289
+ ).returns(T::Array[EnqueuedRunnable])
290
+ end
291
+ def xclaim_messages(pending_messages, max_idle_time_ms:)
292
+ return [] if pending_messages.empty?
293
+ claimed = redis.xclaim(stream_key, group_name, configuration.worker_id,
294
+ max_idle_time_ms, pending_messages.keys)
295
+
296
+ EnqueuedRunnable.from_redis_stream_claim(claimed, pending_messages, configuration: configuration)
236
297
  end
237
298
 
238
299
  sig { returns(T::Array[EnqueuedRunnable]) }
239
300
  def claim_stale_runnables
240
- # When we have to reclaim stale tests, those test are potentially too slow
241
- # to run inside the test timeout. We only claim one test at a time in order
242
- # to prevent the exact same batch from being too slow on repeated attempts,
243
- # which would cause us to mark all the tests in that batch as failed.
244
- #
245
- # This has the side effect that for a retried test, the test timeout
246
- # will be TEST_TIMEOUT * BATCH_SIZE in practice. This gives us a higher
247
- # likelihood that the test will pass if the batch size > 1.
248
- pending = redis.xpending(stream_key, group_name, '-', '+', 1)
249
-
250
- # Every test is allowed to take test_timeout milliseconds. Because we process tests in
251
- # batches, they should never be pending for TEST_TIMEOUT * BATCH_SIZE milliseconds.
301
+ # Every test is allowed to take test_timeout_seconds. Because we process tests in
302
+ # batches, they should never be pending for TEST_TIMEOUT_SECONDS * BATCH_SIZE seconds.
252
303
  # So, only try to claim messages older than that, with a bit of jitter.
253
- max_idle_time = Integer(configuration.test_timeout * configuration.test_batch_size * 1000)
254
- max_idle_time_with_jitter = max_idle_time * rand(1.0...1.2)
255
- to_claim = pending.each_with_object({}) do |message, hash|
256
- if message['elapsed'] > max_idle_time_with_jitter
257
- hash[message.fetch('entry_id')] = message
304
+ max_idle_time_ms = Integer(configuration.test_timeout_seconds * configuration.test_batch_size * 1000)
305
+ max_idle_time_ms_with_jitter = max_idle_time_ms * rand(1.0...1.2)
306
+
307
+ # Find all the pending messages to see if we want to attenpt to claim some.
308
+ pending = redis.xpending(stream_key, group_name, '-', '+', configuration.test_batch_size)
309
+ return [] if pending.empty?
310
+
311
+ active_consumers = Set[configuration.worker_id]
312
+
313
+ stale_messages = {}
314
+ active_messages = {}
315
+ pending.each do |msg|
316
+ message = PendingExecution.from_xpending(msg)
317
+ if message.elapsed_time_ms < max_idle_time_ms_with_jitter
318
+ active_consumers << message.worker_id
319
+ active_messages[message.entry_id] = message
320
+ else
321
+ stale_messages[message.entry_id] = message
258
322
  end
259
323
  end
260
324
 
261
- if to_claim.empty?
262
- []
263
- else
264
- claimed = redis.xclaim(stream_key, group_name, configuration.worker_id, max_idle_time, to_claim.keys)
265
- enqueued_runnables = EnqueuedRunnable.from_redis_stream_claim(claimed)
266
- enqueued_runnables.each do |er|
267
- # `count` will be set to the current attempt of a different worker that has timed out.
268
- # The attempt we are going to try will be the next one, so add one.
269
- attempt = to_claim.fetch(er.execution_id).fetch('count') + 1
270
- if attempt > configuration.max_attempts
271
- # If we exhaust our attempts, we will mark the test to immediately fail when it will be run next.
272
- mark_runnable_to_fail_immediately(er)
273
- else
274
- reclaimed_tests << er
325
+ # If we only have evidence of one active consumer based on the pending message,
326
+ # we will query Redis for all consumers to make sure we have full data.
327
+ # We can skip this if we already know that there is more than one active one.
328
+ if active_consumers.size == 1
329
+ begin
330
+ redis.xinfo('consumers', stream_key, group_name).each do |consumer|
331
+ if consumer.fetch('idle') < max_idle_time_ms
332
+ active_consumers << consumer.fetch('name')
333
+ end
275
334
  end
335
+ rescue Redis::CommandError
336
+ # This command can fail, specifically during the cleanup phase at the end
337
+ # of a build, when another worker has removed the stream key already.
276
338
  end
339
+ end
340
+
341
+ # Now, see if we want to claim any stale messages. If we are the only active
342
+ # consumer, we want to claim our own messages as well as messgaes from other
343
+ # (stale) consumers. If there are multiple active consumers, we are going to
344
+ # let another consumer claim our own messages.
345
+ if active_consumers.size > 1
346
+ stale_messages.reject! { |_key, message| message.worker_id == configuration.worker_id }
347
+ end
348
+
349
+ unless stale_messages.empty?
350
+ # When we have to reclaim stale tests, those test are potentially too slow
351
+ # to run inside the test timeout. We only claim one timed out test at a time in order
352
+ # to prevent the exact same batch from being too slow on repeated attempts,
353
+ # which would cause us to mark all the tests in that batch as failed.
354
+ #
355
+ # This has the side effect that for a retried test, the test timeout
356
+ # will be TEST_TIMEOUT_SECONDS * BATCH_SIZE in practice. This gives us a higher
357
+ # likelihood that the test will pass if the batch size > 1.
358
+ stale_messages = stale_messages.slice(stale_messages.keys.first)
359
+
360
+ enqueued_runnables = xclaim_messages(stale_messages, max_idle_time_ms: max_idle_time_ms)
361
+ reclaimed_timeout_tests.merge(enqueued_runnables)
362
+ return enqueued_runnables
363
+ end
364
+
365
+ # Now, see if we want to claim any failed tests to retry. Again, if we are the only
366
+ # active consumer, we want to claim our own messages as well as messgaes from other
367
+ # (stale) consumers. If there are multiple active consumers, we are going to let
368
+ # another consumer claim our own messages.
369
+ if active_consumers.size > 1
370
+ active_messages.reject! { |_key, message| message.worker_id == configuration.worker_id }
371
+ end
277
372
 
278
- enqueued_runnables
373
+ # For all the active messages, we can check whether they are marked for a retry by
374
+ # trying to remove the test from the retry set set in Redis. Only one worker will be
375
+ # able to remove the entry from the set, so only one worker will end up trying to
376
+ # claim the test for the next attempt.
377
+ #
378
+ # We use `redis.multi` so we only need one round-trip for the entire list. Note that
379
+ # this is not an atomic operation with the XCLAIM call. This is OK, because the retry
380
+ # set is only there to speed things up and prevent us from having to wait for the test
381
+ # timeout. If the worker crashes between removing an item from the retry setm the test
382
+ # will eventually be picked up by another worker.
383
+ messages_in_retry_set = {}
384
+ redis.multi do
385
+ active_messages.each do |key, message|
386
+ messages_in_retry_set[key] = redis.srem(key('retry_set'), message.attempt_id)
387
+ end
279
388
  end
389
+
390
+ # Now, we only select the messages that were on the retry set, and try to claim them.
391
+ active_messages.keep_if { |key, _value| messages_in_retry_set.fetch(key).value }
392
+ enqueued_runnables = xclaim_messages(active_messages, max_idle_time_ms: 0)
393
+ reclaimed_failed_tests.merge(enqueued_runnables)
394
+ enqueued_runnables
280
395
  end
281
396
 
282
397
  sig { void }
@@ -288,18 +403,6 @@ module Minitest
288
403
  # so we can assume that all the Redis cleanup was completed.
289
404
  end
290
405
 
291
- sig { params(er: EnqueuedRunnable).void }
292
- def mark_runnable_to_fail_immediately(er)
293
- assertion = Minitest::Assertion.new(<<~EOM.chomp)
294
- This test takes too long to run (> #{configuration.test_timeout}s).
295
-
296
- We have tried running this test #{configuration.max_attempts} on different workers, but every time the worker has not reported back a result within #{configuration.test_timeout}ms.
297
- Try to make the test faster, or increase the test timeout.
298
- EOM
299
- assertion.set_backtrace(caller)
300
- er.canned_failure = assertion
301
- end
302
-
303
406
  sig { params(results: ResultAggregate).void }
304
407
  def adjust_combined_results(results)
305
408
  updated = redis.multi do
@@ -309,14 +412,16 @@ module Minitest
309
412
  redis.incrby(key('failures'), results.failures)
310
413
  redis.incrby(key('errors'), results.errors)
311
414
  redis.incrby(key('skips'), results.skips)
312
- redis.incrby(key('reruns'), results.reruns)
415
+ redis.incrby(key('requeues'), results.requeues)
416
+ redis.incrby(key('discards'), results.discards)
313
417
  redis.incrby(key('acks'), results.acks)
314
418
  redis.incrby(key('size'), results.size)
315
419
  end
316
420
 
317
- @combined_results = ResultAggregate.new(runs: updated[0], assertions: updated[1], passes: updated[2],
318
- failures: updated[3], errors: updated[4], skips: updated[5], reruns: updated[6],
319
- acks: updated[7], size: updated[8])
421
+ @combined_results = ResultAggregate.new(max_failures: configuration.max_failures,
422
+ runs: updated[0], assertions: updated[1], passes: updated[2],
423
+ failures: updated[3], errors: updated[4], skips: updated[5], requeues: updated[6], discards: updated[7],
424
+ acks: updated[8], size: updated[9])
320
425
  end
321
426
 
322
427
  sig { params(name: String).returns(String) }
@@ -324,59 +429,56 @@ module Minitest
324
429
  "minitest/#{configuration.run_id}/#{name}"
325
430
  end
326
431
 
327
- sig { params(batch: T::Array[EnqueuedRunnable], reporter: AbstractReporter).returns(Integer) }
432
+ sig { params(name: String).returns(String) }
433
+ def list_key(name)
434
+ key("#{name}_list")
435
+ end
436
+
437
+ sig { params(batch: T::Array[EnqueuedRunnable], reporter: AbstractReporter).void }
328
438
  def process_batch(batch, reporter)
329
- to_be_acked = {}
439
+ return 0 if batch.empty?
440
+
441
+ local_results.size += batch.size
442
+
443
+ runnable_results = T.let([], T::Array[EnqueuedRunnable::Result])
444
+ redis.multi do
445
+ batch.each do |enqueued_runnable|
446
+ # Fulfill the reporter contract by calling `prerecord` before we run the test.
447
+ reporter.prerecord(enqueued_runnable.runnable_class, enqueued_runnable.method_name)
448
+
449
+ # Actually run the test!
450
+ runnable_results << enqueued_runnable.run do |initial_result|
451
+ if ResultType.of(initial_result) == ResultType::Requeued
452
+ sadd_future = redis.sadd(key('retry_set'), enqueued_runnable.attempt_id)
453
+ EnqueuedRunnable::Result::Commit.new { sadd_future.value }
454
+ else
455
+ xack_future = redis.xack(stream_key, group_name, enqueued_runnable.entry_id)
456
+ EnqueuedRunnable::Result::Commit.new { xack_future.value == 1 }
457
+ end
458
+ end
459
+ end
460
+ end
461
+
462
+ batch_result_aggregate = ResultAggregate.new
463
+ runnable_results.each do |runnable_result|
464
+ # Complete the reporter contract by calling `record` with the result.
465
+ reporter.record(runnable_result.committed_result)
330
466
 
331
- batch.each do |enqueued_runnable|
332
- local_results.size += 1
333
- reporter.prerecord(enqueued_runnable.runnable_class, enqueued_runnable.method_name)
334
- result = enqueued_runnable.run
467
+ # Update statistics.
468
+ batch_result_aggregate.update_with_result(runnable_result)
469
+ local_results.update_with_result(runnable_result)
335
470
 
336
- case (result_type = ResultType.of(result))
337
- when ResultType::Passed
471
+ case (result_type = ResultType.of(runnable_result.committed_result))
472
+ when ResultType::Skipped, ResultType::Failed, ResultType::Error
473
+ redis.lpush(list_key(result_type.serialize), runnable_result.enqueued_runnable.identifier)
474
+ when ResultType::Passed, ResultType::Requeued, ResultType::Discarded
338
475
  # noop
339
- when ResultType::Skipped
340
- redis.lpush(key('skip_list'), Marshal.dump(enqueued_runnable.serialize))
341
- when ResultType::Failed
342
- redis.lpush(key('failure_list'), Marshal.dump(enqueued_runnable.serialize))
343
- when ResultType::Error
344
- redis.lpush(key('error_list'), Marshal.dump(enqueued_runnable.serialize))
345
476
  else
346
477
  T.absurd(result_type)
347
478
  end
348
-
349
- local_results.update_with_result(result)
350
- to_be_acked[enqueued_runnable.execution_id] = result
351
- end
352
-
353
- return 0 if to_be_acked.empty?
354
-
355
- acked = redis.evalsha(
356
- ack_batch_script,
357
- keys: [stream_key],
358
- argv: [group_name] + to_be_acked.keys
359
- )
360
-
361
- batch_results = ResultAggregate.new(acks: acked.length)
362
- acked.each do |execution_id|
363
- acked_result = to_be_acked.delete(execution_id)
364
- reporter.record(acked_result)
365
- batch_results.update_with_result(acked_result)
366
- end
367
-
368
- to_be_acked.each do |_execution_id, unacked_result|
369
- # TODO: use custom assertion class.
370
- discard_assertion = Minitest::Skip.new("The test result was discarded, " \
371
- "because the test has been claimed another worker.")
372
- discard_assertion.set_backtrace(caller)
373
- unacked_result.failures = [discard_assertion]
374
- reporter.record(unacked_result)
375
479
  end
376
480
 
377
- adjust_combined_results(batch_results)
378
- local_results.acks += acked.length
379
- acked.length
481
+ adjust_combined_results(batch_result_aggregate)
380
482
  end
381
483
 
382
484
  INITIAL_BACKOFF = 10 # milliseconds