nogara-redis_failover 0.9.7 → 0.9.7.2
Sign up to get free protection for your applications and to get access to all the features.
- data/Changes.md +17 -0
- data/README.md +44 -18
- data/examples/config.yml +3 -0
- data/lib/redis_failover.rb +4 -1
- data/lib/redis_failover/cli.rb +25 -2
- data/lib/redis_failover/client.rb +18 -11
- data/lib/redis_failover/errors.rb +0 -4
- data/lib/redis_failover/failover_strategy.rb +25 -0
- data/lib/redis_failover/failover_strategy/latency.rb +21 -0
- data/lib/redis_failover/manual_failover.rb +16 -4
- data/lib/redis_failover/node.rb +2 -1
- data/lib/redis_failover/node_manager.rb +408 -138
- data/lib/redis_failover/node_snapshot.rb +81 -0
- data/lib/redis_failover/node_strategy.rb +34 -0
- data/lib/redis_failover/node_strategy/consensus.rb +18 -0
- data/lib/redis_failover/node_strategy/majority.rb +18 -0
- data/lib/redis_failover/node_strategy/single.rb +17 -0
- data/lib/redis_failover/node_watcher.rb +22 -14
- data/lib/redis_failover/util.rb +2 -2
- data/lib/redis_failover/version.rb +1 -1
- data/redis_failover.gemspec +1 -1
- data/spec/failover_strategy/latency_spec.rb +41 -0
- data/spec/failover_strategy_spec.rb +17 -0
- data/spec/node_snapshot_spec.rb +30 -0
- data/spec/node_strategy/consensus_spec.rb +30 -0
- data/spec/node_strategy/majority_spec.rb +22 -0
- data/spec/node_strategy/single_spec.rb +22 -0
- data/spec/node_strategy_spec.rb +22 -0
- data/spec/node_watcher_spec.rb +2 -2
- data/spec/spec_helper.rb +2 -1
- data/spec/support/node_manager_stub.rb +29 -8
- metadata +33 -6
data/Changes.md
CHANGED
@@ -1,3 +1,20 @@
|
|
1
|
+
HEAD
|
2
|
+
-----------
|
3
|
+
- redis_failover now supports distributed monitoring among the node managers! Previously, the node managers were only used
|
4
|
+
as a means of redundancy in case a particular node manager crashed. Starting with version 1.0 of redis_failover, the node
|
5
|
+
managers will all periodically report their health report/snapshots. The primary node manager will utilize a configurable
|
6
|
+
"node strategy" to determine if a particular node is available or unavailable.
|
7
|
+
- redis_failover now supports a configurable "failover strategy" that's consulted when performing a failover. Currently,
|
8
|
+
a single strategy is provided that takes into account the average latency of the last health check to the redis server.
|
9
|
+
|
10
|
+
0.9.7.2
|
11
|
+
-----------
|
12
|
+
- Add support for Redis#client's location method. Fixes a compatibility issue with redis_failover and Sidekiq.
|
13
|
+
|
14
|
+
0.9.7.1
|
15
|
+
-----------
|
16
|
+
- Stop repeated attempts to acquire exclusive lock in Node Manager (#36)
|
17
|
+
|
1
18
|
0.9.7
|
2
19
|
-----------
|
3
20
|
- Stubbed Client#client to return itself, fixes a fork reconnect bug with Resque (dbalatero)
|
data/README.md
CHANGED
@@ -2,7 +2,7 @@
|
|
2
2
|
|
3
3
|
[![Build Status](https://secure.travis-ci.org/ryanlecompte/redis_failover.png?branch=master)](http://travis-ci.org/ryanlecompte/redis_failover)
|
4
4
|
|
5
|
-
redis_failover attempts to provides a full automatic master/slave failover solution for Ruby. Redis does not provide
|
5
|
+
redis_failover attempts to provides a full automatic master/slave failover solution for Ruby. Redis does not currently provide
|
6
6
|
an automatic failover capability when configured for master/slave replication. When the master node dies,
|
7
7
|
a new master must be manually brought online and assigned as the slave's new master. This manual
|
8
8
|
switch-over is not desirable in high traffic sites where Redis is a critical part of the overall
|
@@ -10,8 +10,8 @@ architecture. The existing standard Redis client for Ruby also only supports con
|
|
10
10
|
Redis server. When using master/slave replication, it is desirable to have all writes go to the
|
11
11
|
master, and all reads go to one of the N configured slaves.
|
12
12
|
|
13
|
-
This gem (built using [ZK][]) attempts to address these failover scenarios.
|
14
|
-
|
13
|
+
This gem (built using [ZK][]) attempts to address these failover scenarios. One or more Node Manager daemons run as background
|
14
|
+
processes and monitor all of your configured master/slave nodes. When the daemon starts up, it
|
15
15
|
automatically discovers the current master/slaves. Background watchers are setup for each of
|
16
16
|
the redis nodes. As soon as a node is detected as being offline, it will be moved to an "unavailable" state.
|
17
17
|
If the node that went offline was the master, then one of the slaves will be promoted as the new master.
|
@@ -22,8 +22,10 @@ nodes. Note that detection of a node going down should be nearly instantaneous,
|
|
22
22
|
used to keep tabs on a node is via a blocking Redis BLPOP call (no polling). This call fails nearly
|
23
23
|
immediately when the node actually goes offline. To avoid false positives (i.e., intermittent flaky
|
24
24
|
network interruption), the Node Manager will only mark a node as unavailable if it fails to communicate with
|
25
|
-
it 3 times (this is configurable via --max-failures, see configuration options below). Note that you can
|
26
|
-
deploy multiple Node Manager daemons
|
25
|
+
it 3 times (this is configurable via --max-failures, see configuration options below). Note that you can (and should)
|
26
|
+
deploy multiple Node Manager daemons since they each report periodic health reports/snapshots of the redis servers. A
|
27
|
+
"node strategy" is used to determine if a node is actually unavailable. By default a majority strategy is used, but
|
28
|
+
you can also configure "consensus" or "single" as well.
|
27
29
|
|
28
30
|
This gem provides a RedisFailover::Client wrapper that is master/slave aware. The client is configured
|
29
31
|
with a list of ZooKeeper servers. The client will automatically contact the ZooKeeper cluster to find out
|
@@ -64,15 +66,20 @@ following options:
|
|
64
66
|
|
65
67
|
Usage: redis_node_manager [OPTIONS]
|
66
68
|
|
69
|
+
|
67
70
|
Specific options:
|
68
|
-
-n, --nodes NODES
|
69
|
-
-z, --zkservers SERVERS
|
70
|
-
-p, --password PASSWORD
|
71
|
-
--znode-path PATH
|
72
|
-
--max-failures COUNT
|
73
|
-
-C, --config PATH
|
74
|
-
|
75
|
-
-
|
71
|
+
-n, --nodes NODES Comma-separated redis host:port pairs
|
72
|
+
-z, --zkservers SERVERS Comma-separated ZooKeeper host:port pairs
|
73
|
+
-p, --password PASSWORD Redis password
|
74
|
+
--znode-path PATH Znode path override for storing redis server list
|
75
|
+
--max-failures COUNT Max failures before manager marks node unavailable
|
76
|
+
-C, --config PATH Path to YAML config file
|
77
|
+
--with-chroot ROOT Path to ZooKeepers chroot
|
78
|
+
-E, --environment ENV Config environment to use
|
79
|
+
--node-strategy STRATEGY Strategy used when determining availability of nodes (default: majority)
|
80
|
+
--failover-strategy STRATEGY Strategy used when failing over to a new node (default: latency)
|
81
|
+
--required-node-managers COUNT Required Node Managers that must be reachable to determine node state (default: 1)
|
82
|
+
-h, --help Display all options
|
76
83
|
|
77
84
|
To start the daemon for a simple master/slave configuration, use the following:
|
78
85
|
|
@@ -103,10 +110,12 @@ directory for configuration file samples.
|
|
103
110
|
|
104
111
|
The Node Manager will automatically discover the master/slaves upon startup. Note that it is
|
105
112
|
a good idea to run more than one instance of the Node Manager daemon in your environment. At
|
106
|
-
any moment, a single Node Manager process will be designated to
|
113
|
+
any moment, a single Node Manager process will be designated to manage the redis servers. If
|
107
114
|
this Node Manager process dies or becomes partitioned from the network, another Node Manager
|
108
|
-
will be promoted as the primary
|
109
|
-
processes as you'd like
|
115
|
+
will be promoted as the primary manager of redis servers. You can run as many Node Manager
|
116
|
+
processes as you'd like. Every Node Manager periodically records health "snapshots" which the
|
117
|
+
primary/master Node Manager consults when determining if it should officially mark a redis
|
118
|
+
server as unavailable. By default, a majority strategy is used.
|
110
119
|
|
111
120
|
## Client Usage
|
112
121
|
|
@@ -149,6 +158,25 @@ server passed to #manual_failover, or it will pick a random slave to become the
|
|
149
158
|
client = RedisFailover::Client.new(:zkservers => 'localhost:2181,localhost:2182,localhost:2183')
|
150
159
|
client.manual_failover(:host => 'localhost', :port => 2222)
|
151
160
|
|
161
|
+
## Node & Failover Strategies
|
162
|
+
|
163
|
+
As of redis_failover version 1.0, the notion of "node" and "failover" strategies exists. All running Node Managers will periodically record
|
164
|
+
"snapshots" of their view of the redis nodes. The primary Node Manager will process these snapshots from all of the Node Managers by running a configurable
|
165
|
+
node strategy. By default, a majority strategy is used. This means that if a majority of Node Managers indicate that a node is unavailable, then the primary
|
166
|
+
Node Manager will officially mark it as unavailable. Other strategies exist:
|
167
|
+
|
168
|
+
- consensus (all Node Managers must agree that the node is unavailable)
|
169
|
+
- single (at least one Node Manager saying the node is unavailable will cause the node to be marked as such)
|
170
|
+
|
171
|
+
When a failover happens, the primary Node Manager will now consult a "failover strategy" to determine which candidate node should be used. Currently only a single
|
172
|
+
strategy is provided by redis_failover: latency. This strategy simply selects a node that is both marked as available by all Node Managers and has the lowest
|
173
|
+
average latency for its last health check.
|
174
|
+
|
175
|
+
Note that you should set the "required_node_managers" configuration option appropriately. This value (defaults to 1) is used to determine if enough Node
|
176
|
+
Managers have reported their view of a node's state. For example, if you have deployed 5 Node Managers, then you should set this value to 5 if you only
|
177
|
+
want to accept a node's availability when all 5 Node Managers are part of the snapshot. To give yourself flexibility, you may want to set this value to 3
|
178
|
+
instead. This would give you flexibility to take down 2 Node Managers, while still allowing the cluster to be managed appropriately.
|
179
|
+
|
152
180
|
## Documentation
|
153
181
|
|
154
182
|
redis_failover uses YARD for its API documentation. Refer to the generated [API documentation](http://rubydoc.info/github/ryanlecompte/redis_failover/master/frames) for full coverage.
|
@@ -166,8 +194,6 @@ redis_failover uses YARD for its API documentation. Refer to the generated [API
|
|
166
194
|
|
167
195
|
- Note that it's still possible for the RedisFailover::Client instances to see a stale list of servers for a very small window. In most cases this will not be the case due to how ZooKeeper handles distributed communication, but you should be aware that in the worst case the client could write to a "stale" master for a small period of time until the next watch event is received by the client via ZooKeeper.
|
168
196
|
|
169
|
-
- Note that currently multiple Node Managers are currently used for redundancy purposes only. The Node Managers do not communicate with each other to perform any type of election or voting to determine if they all agree on promoting a new master. Right now Node Managers that are not "active" just sit and wait until they can grab the lock to become the single decision-maker for which Redis servers are available or not. This means that a scenario could present itself where a Node Manager thinks the Redis master is available, however the actual RedisFailover::Client instances think they can't reach the Redis master (either due to network partitions or the Node Manager flapping due to machine failure, etc). We are exploring ways to improve this situation.
|
170
|
-
|
171
197
|
## Resources
|
172
198
|
|
173
199
|
- Check out Steve Whittaker's [redis-failover-test](https://github.com/swhitt/redis-failover-test) project which shows how to test redis_failover in a non-trivial configuration using Vagrant/Chef.
|
data/examples/config.yml
CHANGED
data/lib/redis_failover.rb
CHANGED
@@ -6,6 +6,7 @@ require 'thread'
|
|
6
6
|
require 'logger'
|
7
7
|
require 'timeout'
|
8
8
|
require 'optparse'
|
9
|
+
require 'benchmark'
|
9
10
|
require 'multi_json'
|
10
11
|
require 'securerandom'
|
11
12
|
|
@@ -16,7 +17,9 @@ require 'redis_failover/errors'
|
|
16
17
|
require 'redis_failover/client'
|
17
18
|
require 'redis_failover/runner'
|
18
19
|
require 'redis_failover/version'
|
20
|
+
require 'redis_failover/node_strategy'
|
19
21
|
require 'redis_failover/node_manager'
|
20
22
|
require 'redis_failover/node_watcher'
|
23
|
+
require 'redis_failover/node_snapshot'
|
21
24
|
require 'redis_failover/manual_failover'
|
22
|
-
|
25
|
+
require 'redis_failover/failover_strategy'
|
data/lib/redis_failover/cli.rb
CHANGED
@@ -48,6 +48,21 @@ module RedisFailover
|
|
48
48
|
options[:config_environment] = config_env
|
49
49
|
end
|
50
50
|
|
51
|
+
opts.on('--node-strategy STRATEGY',
|
52
|
+
'Strategy used when determining availability of nodes (default: majority)') do |strategy|
|
53
|
+
options[:node_strategy] = strategy
|
54
|
+
end
|
55
|
+
|
56
|
+
opts.on('--failover-strategy STRATEGY',
|
57
|
+
'Strategy used when failing over to a new node (default: latency)') do |strategy|
|
58
|
+
options[:failover_strategy] = strategy
|
59
|
+
end
|
60
|
+
|
61
|
+
opts.on('--required-node-managers COUNT',
|
62
|
+
'Required Node Managers that must be reachable to determine node state (default: 1)') do |count|
|
63
|
+
options[:required_node_managers] = Integer(count)
|
64
|
+
end
|
65
|
+
|
51
66
|
opts.on('-h', '--help', 'Display all options') do
|
52
67
|
puts opts
|
53
68
|
exit
|
@@ -59,7 +74,7 @@ module RedisFailover
|
|
59
74
|
options = from_file(config_file, options[:config_environment])
|
60
75
|
end
|
61
76
|
|
62
|
-
if
|
77
|
+
if invalid_options?(options)
|
63
78
|
puts parser
|
64
79
|
exit
|
65
80
|
end
|
@@ -68,7 +83,7 @@ module RedisFailover
|
|
68
83
|
end
|
69
84
|
|
70
85
|
# @return [Boolean] true if required options missing, false otherwise
|
71
|
-
def self.
|
86
|
+
def self.invalid_options?(options)
|
72
87
|
return true if options.empty?
|
73
88
|
return true unless options.values_at(:nodes, :zkservers).all?
|
74
89
|
false
|
@@ -113,6 +128,14 @@ module RedisFailover
|
|
113
128
|
options[:nodes].each { |opts| opts.update(:password => password) }
|
114
129
|
end
|
115
130
|
|
131
|
+
if node_strategy = options[:node_strategy]
|
132
|
+
options[:node_strategy] = node_strategy.to_sym
|
133
|
+
end
|
134
|
+
|
135
|
+
if failover_strategy = options[:failover_strategy]
|
136
|
+
options[:failover_strategy] = failover_strategy.to_sym
|
137
|
+
end
|
138
|
+
|
116
139
|
options
|
117
140
|
end
|
118
141
|
end
|
@@ -79,10 +79,12 @@ module RedisFailover
|
|
79
79
|
self
|
80
80
|
end
|
81
81
|
|
82
|
-
#
|
83
|
-
#
|
82
|
+
# Delegates to the underlying Redis client to fetch the location.
|
83
|
+
# This method always returns the location of the master.
|
84
|
+
#
|
85
|
+
# @return [String] the redis location
|
84
86
|
def location
|
85
|
-
|
87
|
+
dispatch(:client).location
|
86
88
|
end
|
87
89
|
|
88
90
|
# Specifies a callback to invoke when the current redis node list changes.
|
@@ -131,7 +133,7 @@ module RedisFailover
|
|
131
133
|
# @option options [String] :host the host of the failover candidate
|
132
134
|
# @option options [String] :port the port of the failover candidate
|
133
135
|
def manual_failover(options = {})
|
134
|
-
ManualFailover.new(@zk, options).perform
|
136
|
+
ManualFailover.new(@zk, @root_znode, options).perform
|
135
137
|
self
|
136
138
|
end
|
137
139
|
|
@@ -176,12 +178,12 @@ module RedisFailover
|
|
176
178
|
# Sets up the underlying ZooKeeper connection.
|
177
179
|
def setup_zk
|
178
180
|
@zk = ZK.new(@zkservers)
|
179
|
-
@zk.watcher.register(
|
181
|
+
@zk.watcher.register(redis_nodes_path) { |event| handle_zk_event(event) }
|
180
182
|
if @safe_mode
|
181
183
|
@zk.on_expired_session { purge_clients }
|
182
184
|
end
|
183
|
-
@zk.on_connected { @zk.stat(
|
184
|
-
@zk.stat(
|
185
|
+
@zk.on_connected { @zk.stat(redis_nodes_path, :watch => true) }
|
186
|
+
@zk.stat(redis_nodes_path, :watch => true)
|
185
187
|
update_znode_timestamp
|
186
188
|
end
|
187
189
|
|
@@ -194,12 +196,12 @@ module RedisFailover
|
|
194
196
|
build_clients
|
195
197
|
elsif event.node_deleted?
|
196
198
|
purge_clients
|
197
|
-
@zk.stat(
|
199
|
+
@zk.stat(redis_nodes_path, :watch => true)
|
198
200
|
else
|
199
201
|
logger.error("Unknown ZK node event: #{event.inspect}")
|
200
202
|
end
|
201
203
|
ensure
|
202
|
-
@zk.stat(
|
204
|
+
@zk.stat(redis_nodes_path, :watch => true)
|
203
205
|
end
|
204
206
|
|
205
207
|
# Determines if a method is a known redis operation.
|
@@ -308,7 +310,7 @@ module RedisFailover
|
|
308
310
|
#
|
309
311
|
# @return [Hash] the known master/slave redis servers
|
310
312
|
def fetch_nodes
|
311
|
-
data = @zk.get(
|
313
|
+
data = @zk.get(redis_nodes_path, :watch => true).first
|
312
314
|
nodes = symbolize_keys(decode(data))
|
313
315
|
logger.debug("Fetched nodes: #{nodes.inspect}")
|
314
316
|
|
@@ -474,7 +476,7 @@ module RedisFailover
|
|
474
476
|
# @param [Hash] options the configuration options
|
475
477
|
def parse_options(options)
|
476
478
|
@zkservers = options.fetch(:zkservers) { raise ArgumentError, ':zkservers required'}
|
477
|
-
@
|
479
|
+
@root_znode = options.fetch(:znode_path, Util::DEFAULT_ROOT_ZNODE_PATH)
|
478
480
|
@namespace = options[:namespace]
|
479
481
|
@password = options[:password]
|
480
482
|
@db = options[:db]
|
@@ -483,5 +485,10 @@ module RedisFailover
|
|
483
485
|
@safe_mode = options.fetch(:safe_mode, true)
|
484
486
|
@master_only = options.fetch(:master_only, false)
|
485
487
|
end
|
488
|
+
|
489
|
+
# @return [String] the znode path for the master redis nodes config
|
490
|
+
def redis_nodes_path
|
491
|
+
"#{@root_znode}/nodes"
|
492
|
+
end
|
486
493
|
end
|
487
494
|
end
|
@@ -0,0 +1,25 @@
|
|
1
|
+
module RedisFailover
|
2
|
+
# Base class for strategies that determine which node is used during failover.
|
3
|
+
class FailoverStrategy
|
4
|
+
include Util
|
5
|
+
|
6
|
+
# Loads a strategy based on the given name.
|
7
|
+
#
|
8
|
+
# @param [String, Symbol] name the strategy name
|
9
|
+
# @return [Object] a new strategy instance
|
10
|
+
def self.for(name)
|
11
|
+
require "redis_failover/failover_strategy/#{name.downcase}"
|
12
|
+
const_get(name.capitalize).new
|
13
|
+
rescue LoadError, NameError
|
14
|
+
raise "Failed to find failover strategy: #{name}"
|
15
|
+
end
|
16
|
+
|
17
|
+
# Returns a candidate node as determined by this strategy.
|
18
|
+
#
|
19
|
+
# @param [Hash<Node, NodeSnapshot>] snapshots the node snapshots
|
20
|
+
# @return [Node] the candidate node or nil if one couldn't be found
|
21
|
+
def find_candidate(snapshots)
|
22
|
+
raise NotImplementedError
|
23
|
+
end
|
24
|
+
end
|
25
|
+
end
|
@@ -0,0 +1,21 @@
|
|
1
|
+
module RedisFailover
|
2
|
+
class FailoverStrategy
|
3
|
+
# Failover strategy that selects an available node that is both seen by all
|
4
|
+
# node managers and has the lowest reported health check latency.
|
5
|
+
class Latency < FailoverStrategy
|
6
|
+
# @see RedisFailover::FailoverStrategy#find_candidate
|
7
|
+
def find_candidate(snapshots)
|
8
|
+
candidates = {}
|
9
|
+
snapshots.each do |node, snapshot|
|
10
|
+
if snapshot.all_available?
|
11
|
+
candidates[node] = snapshot.avg_latency
|
12
|
+
end
|
13
|
+
end
|
14
|
+
|
15
|
+
if candidate = candidates.min_by(&:last)
|
16
|
+
candidate.first
|
17
|
+
end
|
18
|
+
end
|
19
|
+
end
|
20
|
+
end
|
21
|
+
end
|
@@ -2,37 +2,49 @@ module RedisFailover
|
|
2
2
|
# Provides manual failover support to a new master.
|
3
3
|
class ManualFailover
|
4
4
|
# Path for manual failover communication.
|
5
|
-
ZNODE_PATH = '
|
5
|
+
ZNODE_PATH = 'manual_failover'.freeze
|
6
6
|
|
7
7
|
# Denotes that any slave can be used as a candidate for promotion.
|
8
8
|
ANY_SLAVE = "ANY_SLAVE".freeze
|
9
9
|
|
10
|
+
def self.path(root_znode)
|
11
|
+
"#{root_znode}/#{ZNODE_PATH}"
|
12
|
+
end
|
13
|
+
|
10
14
|
# Creates a new instance.
|
11
15
|
#
|
12
16
|
# @param [ZK] zk the ZooKeeper client
|
17
|
+
# @param [ZNode] root_znode the root ZK node
|
13
18
|
# @param [Hash] options the options used for manual failover
|
14
19
|
# @option options [String] :host the host of the failover candidate
|
15
20
|
# @option options [String] :port the port of the failover candidate
|
16
21
|
# @note
|
17
22
|
# If options is empty, a random slave will be used
|
18
23
|
# as a failover candidate.
|
19
|
-
def initialize(zk, options = {})
|
24
|
+
def initialize(zk, root_znode, options = {})
|
20
25
|
@zk = zk
|
26
|
+
@root_znode = root_znode
|
21
27
|
@options = options
|
28
|
+
|
29
|
+
unless @options.empty?
|
30
|
+
port = Integer(@options[:port]) rescue nil
|
31
|
+
raise ArgumentError, ':host not properly specified' if @options[:host].to_s.empty?
|
32
|
+
raise ArgumentError, ':port not properly specified' if port.nil?
|
33
|
+
end
|
22
34
|
end
|
23
35
|
|
24
36
|
# Performs a manual failover.
|
25
37
|
def perform
|
26
38
|
create_path
|
27
39
|
node = @options.empty? ? ANY_SLAVE : "#{@options[:host]}:#{@options[:port]}"
|
28
|
-
@zk.set(
|
40
|
+
@zk.set(self.class.path(@root_znode), node)
|
29
41
|
end
|
30
42
|
|
31
43
|
private
|
32
44
|
|
33
45
|
# Creates the znode path used for coordinating manual failovers.
|
34
46
|
def create_path
|
35
|
-
@zk.create(
|
47
|
+
@zk.create(self.class.path(@root_znode))
|
36
48
|
rescue ZK::Exceptions::NodeExists
|
37
49
|
# best effort
|
38
50
|
end
|
data/lib/redis_failover/node.rb
CHANGED
@@ -22,7 +22,8 @@ module RedisFailover
|
|
22
22
|
# @option options [String] :host the host of the redis server
|
23
23
|
# @option options [String] :port the port of the redis server
|
24
24
|
def initialize(options = {})
|
25
|
-
@host = options
|
25
|
+
@host = options[:host]
|
26
|
+
raise InvalidNodeError, 'missing host' if @host.to_s.empty?
|
26
27
|
@port = Integer(options[:port] || 6379)
|
27
28
|
@password = options[:password]
|
28
29
|
end
|
@@ -3,20 +3,23 @@ module RedisFailover
|
|
3
3
|
# will discover the current redis master and slaves. Each redis node is
|
4
4
|
# monitored by a NodeWatcher instance. The NodeWatchers periodically
|
5
5
|
# report the current state of the redis node it's watching to the
|
6
|
-
# NodeManager
|
7
|
-
#
|
8
|
-
#
|
6
|
+
# NodeManager. The NodeManager processes the state reports and reacts
|
7
|
+
# appropriately by handling stale/dead nodes, and promoting a new redis master
|
8
|
+
# if it sees fit to do so.
|
9
9
|
class NodeManager
|
10
10
|
include Util
|
11
11
|
|
12
12
|
# Number of seconds to wait before retrying bootstrap process.
|
13
|
-
TIMEOUT =
|
13
|
+
TIMEOUT = 5
|
14
|
+
# Number of seconds for checking node snapshots.
|
15
|
+
CHECK_INTERVAL = 10
|
16
|
+
# Number of max attempts to promote a master before releasing master lock.
|
17
|
+
MAX_PROMOTION_ATTEMPTS = 3
|
14
18
|
|
15
19
|
# ZK Errors that the Node Manager cares about.
|
16
20
|
ZK_ERRORS = [
|
17
21
|
ZK::Exceptions::LockAssertionFailedError,
|
18
|
-
ZK::Exceptions::InterruptedSession
|
19
|
-
ZKDisconnectedError
|
22
|
+
ZK::Exceptions::InterruptedSession
|
20
23
|
].freeze
|
21
24
|
|
22
25
|
# Errors that can happen during the node discovery process.
|
@@ -38,15 +41,16 @@ module RedisFailover
|
|
38
41
|
def initialize(options)
|
39
42
|
logger.info("Redis Node Manager v#{VERSION} starting (#{RUBY_DESCRIPTION})")
|
40
43
|
@options = options
|
41
|
-
@
|
42
|
-
@
|
43
|
-
@
|
44
|
+
@required_node_managers = options.fetch(:required_node_managers, 1)
|
45
|
+
@root_znode = options.fetch(:znode_path, Util::DEFAULT_ROOT_ZNODE_PATH)
|
46
|
+
@node_strategy = NodeStrategy.for(options.fetch(:node_strategy, :majority))
|
47
|
+
@failover_strategy = FailoverStrategy.for(options.fetch(:failover_strategy, :latency))
|
48
|
+
@nodes = Array(@options[:nodes]).map { |opts| Node.new(opts) }.uniq
|
49
|
+
@master_manager = false
|
50
|
+
@master_promotion_attempts = 0
|
51
|
+
@sufficient_node_managers = false
|
52
|
+
@lock = Monitor.new
|
44
53
|
@shutdown = false
|
45
|
-
@leader = false
|
46
|
-
@master = nil
|
47
|
-
@slaves = []
|
48
|
-
@unavailable = []
|
49
|
-
@lock_path = "#{@znode}_lock".freeze
|
50
54
|
end
|
51
55
|
|
52
56
|
# Starts the node manager.
|
@@ -54,21 +58,18 @@ module RedisFailover
|
|
54
58
|
# @note This method does not return until the manager terminates.
|
55
59
|
def start
|
56
60
|
return unless running?
|
57
|
-
@queue = Queue.new
|
58
61
|
setup_zk
|
59
|
-
|
60
|
-
|
61
|
-
@leader = true
|
62
|
-
logger.info('Acquired master Node Manager lock')
|
63
|
-
if discover_nodes
|
64
|
-
initialize_path
|
65
|
-
spawn_watchers
|
66
|
-
handle_state_reports
|
67
|
-
end
|
68
|
-
end
|
62
|
+
spawn_watchers
|
63
|
+
wait_until_master
|
69
64
|
rescue *ZK_ERRORS => ex
|
70
65
|
logger.error("ZK error while attempting to manage nodes: #{ex.inspect}")
|
71
66
|
reset
|
67
|
+
sleep(TIMEOUT)
|
68
|
+
retry
|
69
|
+
rescue NoMasterError
|
70
|
+
logger.error("Failed to promote a new master after #{MAX_PROMOTION_ATTEMPTS} attempts.")
|
71
|
+
reset
|
72
|
+
sleep(TIMEOUT)
|
72
73
|
retry
|
73
74
|
end
|
74
75
|
|
@@ -77,78 +78,58 @@ module RedisFailover
|
|
77
78
|
#
|
78
79
|
# @param [Node] node the node
|
79
80
|
# @param [Symbol] state the state
|
80
|
-
|
81
|
-
|
81
|
+
# @param [Integer] latency an optional latency
|
82
|
+
def notify_state(node, state, latency = nil)
|
83
|
+
@lock.synchronize do
|
84
|
+
if running?
|
85
|
+
update_current_state(node, state, latency)
|
86
|
+
end
|
87
|
+
end
|
88
|
+
rescue => ex
|
89
|
+
logger.error("Error handling state report #{[node, state].inspect}: #{ex.inspect}")
|
90
|
+
logger.error(ex.backtrace.join("\n"))
|
82
91
|
end
|
83
92
|
|
84
93
|
# Performs a reset of the manager.
|
85
94
|
def reset
|
86
|
-
@
|
95
|
+
@master_manager = false
|
96
|
+
@master_promotion_attempts = 0
|
87
97
|
@watchers.each(&:shutdown) if @watchers
|
88
|
-
@queue.clear
|
89
|
-
@zk.close! if @zk
|
90
|
-
@zk_lock = nil
|
91
98
|
end
|
92
99
|
|
93
100
|
# Initiates a graceful shutdown.
|
94
101
|
def shutdown
|
95
102
|
logger.info('Shutting down ...')
|
96
|
-
@
|
103
|
+
@lock.synchronize do
|
97
104
|
@shutdown = true
|
98
105
|
end
|
106
|
+
|
107
|
+
reset
|
108
|
+
exit
|
99
109
|
end
|
100
110
|
|
101
111
|
private
|
102
112
|
|
103
113
|
# Configures the ZooKeeper client.
|
104
114
|
def setup_zk
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
@zk.register(@manual_znode) do |event|
|
110
|
-
if event.node_created? || event.node_changed?
|
111
|
-
perform_manual_failover
|
115
|
+
unless @zk
|
116
|
+
@zk = ZK.new("#{@options[:zkservers]}#{@options[:chroot] || ''}")
|
117
|
+
@zk.register(manual_failover_path) do |event|
|
118
|
+
handle_manual_failover_update(event)
|
112
119
|
end
|
120
|
+
@zk.on_connected { @zk.stat(manual_failover_path, :watch => true) }
|
113
121
|
end
|
114
122
|
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
# Handles periodic state reports from {RedisFailover::NodeWatcher} instances.
|
120
|
-
def handle_state_reports
|
121
|
-
while running? && (state_report = @queue.pop)
|
122
|
-
begin
|
123
|
-
@mutex.synchronize do
|
124
|
-
return unless running?
|
125
|
-
@zk_lock.assert!
|
126
|
-
node, state = state_report
|
127
|
-
case state
|
128
|
-
when :unavailable then handle_unavailable(node)
|
129
|
-
when :available then handle_available(node)
|
130
|
-
when :syncing then handle_syncing(node)
|
131
|
-
when :zk_disconnected then raise ZKDisconnectedError
|
132
|
-
else raise InvalidNodeStateError.new(node, state)
|
133
|
-
end
|
134
|
-
|
135
|
-
# flush current state
|
136
|
-
write_state
|
137
|
-
end
|
138
|
-
rescue *ZK_ERRORS
|
139
|
-
# fail hard if this is a ZK connection-related error
|
140
|
-
raise
|
141
|
-
rescue => ex
|
142
|
-
logger.error("Error handling #{state_report.inspect}: #{ex.inspect}")
|
143
|
-
logger.error(ex.backtrace.join("\n"))
|
144
|
-
end
|
145
|
-
end
|
123
|
+
create_path(@root_znode)
|
124
|
+
create_path(current_state_root)
|
125
|
+
@zk.stat(manual_failover_path, :watch => true)
|
146
126
|
end
|
147
127
|
|
148
128
|
# Handles an unavailable node.
|
149
129
|
#
|
150
130
|
# @param [Node] node the unavailable node
|
151
|
-
|
131
|
+
# @param [Hash<Node, NodeSnapshot>] snapshots the current set of snapshots
|
132
|
+
def handle_unavailable(node, snapshots)
|
152
133
|
# no-op if we already know about this node
|
153
134
|
return if @unavailable.include?(node)
|
154
135
|
logger.info("Handling unavailable node: #{node}")
|
@@ -157,7 +138,7 @@ module RedisFailover
|
|
157
138
|
# find a new master if this node was a master
|
158
139
|
if node == @master
|
159
140
|
logger.info("Demoting currently unavailable master #{node}.")
|
160
|
-
promote_new_master
|
141
|
+
promote_new_master(snapshots)
|
161
142
|
else
|
162
143
|
@slaves.delete(node)
|
163
144
|
end
|
@@ -166,7 +147,8 @@ module RedisFailover
|
|
166
147
|
# Handles an available node.
|
167
148
|
#
|
168
149
|
# @param [Node] node the available node
|
169
|
-
|
150
|
+
# @param [Hash<Node, NodeSnapshot>] snapshots the current set of snapshots
|
151
|
+
def handle_available(node, snapshots)
|
170
152
|
reconcile(node)
|
171
153
|
|
172
154
|
# no-op if we already know about this node
|
@@ -179,7 +161,7 @@ module RedisFailover
|
|
179
161
|
@slaves << node
|
180
162
|
else
|
181
163
|
# no master exists, make this the new master
|
182
|
-
promote_new_master(node)
|
164
|
+
promote_new_master(snapshots, node)
|
183
165
|
end
|
184
166
|
|
185
167
|
@unavailable.delete(node)
|
@@ -188,74 +170,75 @@ module RedisFailover
|
|
188
170
|
# Handles a node that is currently syncing.
|
189
171
|
#
|
190
172
|
# @param [Node] node the syncing node
|
191
|
-
|
173
|
+
# @param [Hash<Node, NodeSnapshot>] snapshots the current set of snapshots
|
174
|
+
def handle_syncing(node, snapshots)
|
192
175
|
reconcile(node)
|
193
176
|
|
194
177
|
if node.syncing_with_master? && node.prohibits_stale_reads?
|
195
178
|
logger.info("Node #{node} not ready yet, still syncing with master.")
|
196
179
|
force_unavailable_slave(node)
|
197
|
-
|
180
|
+
else
|
181
|
+
# otherwise, we can use this node
|
182
|
+
handle_available(node, snapshots)
|
198
183
|
end
|
199
|
-
|
200
|
-
# otherwise, we can use this node
|
201
|
-
handle_available(node)
|
202
184
|
end
|
203
185
|
|
204
186
|
# Handles a manual failover request to the given node.
|
205
187
|
#
|
206
188
|
# @param [Node] node the candidate node for failover
|
207
|
-
|
189
|
+
# @param [Hash<Node, NodeSnapshot>] snapshots the current set of snapshots
|
190
|
+
def handle_manual_failover(node, snapshots)
|
208
191
|
# no-op if node to be failed over is already master
|
209
192
|
return if @master == node
|
210
193
|
logger.info("Handling manual failover")
|
211
194
|
|
195
|
+
# ensure we can talk to the node
|
196
|
+
node.ping
|
197
|
+
|
212
198
|
# make current master a slave, and promote new master
|
213
199
|
@slaves << @master if @master
|
214
200
|
@slaves.delete(node)
|
215
|
-
promote_new_master(node)
|
201
|
+
promote_new_master(snapshots, node)
|
216
202
|
end
|
217
203
|
|
218
204
|
# Promotes a new master.
|
219
205
|
#
|
206
|
+
# @param [Hash<Node, NodeSnapshot>] snapshots the current set of snapshots
|
220
207
|
# @param [Node] node the optional node to promote
|
221
|
-
|
222
|
-
|
223
|
-
delete_path
|
208
|
+
def promote_new_master(snapshots, node = nil)
|
209
|
+
delete_path(redis_nodes_path)
|
224
210
|
@master = nil
|
225
211
|
|
226
|
-
# make a specific node or
|
227
|
-
candidate = node ||
|
228
|
-
|
212
|
+
# make a specific node or selected candidate the new master
|
213
|
+
candidate = node || failover_strategy_candidate(snapshots)
|
214
|
+
|
215
|
+
if candidate.nil?
|
229
216
|
logger.error('Failed to promote a new master, no candidate available.')
|
230
|
-
|
217
|
+
else
|
218
|
+
@slaves.delete(candidate)
|
219
|
+
@unavailable.delete(candidate)
|
220
|
+
redirect_slaves_to(candidate)
|
221
|
+
candidate.make_master!
|
222
|
+
@master = candidate
|
223
|
+
write_current_redis_nodes
|
224
|
+
@master_promotion_attempts = 0
|
225
|
+
logger.info("Successfully promoted #{candidate} to master.")
|
231
226
|
end
|
232
|
-
|
233
|
-
redirect_slaves_to(candidate)
|
234
|
-
candidate.make_master!
|
235
|
-
@master = candidate
|
236
|
-
|
237
|
-
create_path
|
238
|
-
write_state
|
239
|
-
logger.info("Successfully promoted #{candidate} to master.")
|
240
227
|
end
|
241
228
|
|
242
229
|
# Discovers the current master and slave nodes.
|
243
230
|
# @return [Boolean] true if nodes successfully discovered, false otherwise
|
244
231
|
def discover_nodes
|
245
|
-
@
|
246
|
-
return
|
247
|
-
|
232
|
+
@lock.synchronize do
|
233
|
+
return unless running?
|
234
|
+
@slaves, @unavailable = [], []
|
248
235
|
if @master = find_existing_master
|
249
236
|
logger.info("Using master #{@master} from existing znode config.")
|
250
|
-
elsif @master = guess_master(nodes)
|
237
|
+
elsif @master = guess_master(@nodes)
|
251
238
|
logger.info("Guessed master #{@master} from known redis nodes.")
|
252
239
|
end
|
253
|
-
@slaves = nodes - [@master]
|
254
|
-
logger.info("Managing master (#{@master}) and slaves "
|
255
|
-
"(#{@slaves.map(&:to_s).join(', ')})")
|
256
|
-
# ensure that slaves are correctly pointing to this master
|
257
|
-
redirect_slaves_to(@master)
|
258
|
-
true
|
240
|
+
@slaves = @nodes - [@master]
|
241
|
+
logger.info("Managing master (#{@master}) and slaves #{stringify_nodes(@slaves)}")
|
259
242
|
end
|
260
243
|
rescue *NODE_DISCOVERY_ERRORS => ex
|
261
244
|
msg = <<-MSG.gsub(/\s+/, ' ')
|
@@ -273,7 +256,7 @@ module RedisFailover
|
|
273
256
|
|
274
257
|
# Seeds the initial node master from an existing znode config.
|
275
258
|
def find_existing_master
|
276
|
-
if data = @zk.get(
|
259
|
+
if data = @zk.get(redis_nodes_path).first
|
277
260
|
nodes = symbolize_keys(decode(data))
|
278
261
|
master = node_from(nodes[:master])
|
279
262
|
logger.info("Master from existing znode config: #{master || 'none'}")
|
@@ -302,10 +285,13 @@ module RedisFailover
|
|
302
285
|
|
303
286
|
# Spawns the {RedisFailover::NodeWatcher} instances for each managed node.
|
304
287
|
def spawn_watchers
|
305
|
-
@
|
306
|
-
|
288
|
+
@zk.delete(current_state_path, :ignore => :no_node)
|
289
|
+
@monitored_available, @monitored_unavailable = {}, []
|
290
|
+
@watchers = @nodes.map do |node|
|
291
|
+
NodeWatcher.new(self, node, @options.fetch(:max_failures, 3))
|
307
292
|
end
|
308
293
|
@watchers.each(&:watch)
|
294
|
+
logger.info("Monitoring redis nodes at #{stringify_nodes(@nodes)}")
|
309
295
|
end
|
310
296
|
|
311
297
|
# Searches for the master node.
|
@@ -373,77 +359,361 @@ module RedisFailover
|
|
373
359
|
}
|
374
360
|
end
|
375
361
|
|
362
|
+
# @return [Hash] the set of currently available/unavailable nodes as
|
363
|
+
# seen by this node manager instance
|
364
|
+
def node_availability_state
|
365
|
+
{
|
366
|
+
:available => Hash[@monitored_available.map { |k, v| [k.to_s, v] }],
|
367
|
+
:unavailable => @monitored_unavailable.map(&:to_s)
|
368
|
+
}
|
369
|
+
end
|
370
|
+
|
376
371
|
# Deletes the znode path containing the redis nodes.
|
377
|
-
|
378
|
-
|
379
|
-
|
372
|
+
#
|
373
|
+
# @param [String] path the znode path to delete
|
374
|
+
def delete_path(path)
|
375
|
+
@zk.delete(path)
|
376
|
+
logger.info("Deleted ZK node #{path}")
|
380
377
|
rescue ZK::Exceptions::NoNode => ex
|
381
378
|
logger.info("Tried to delete missing znode: #{ex.inspect}")
|
382
379
|
end
|
383
380
|
|
384
|
-
# Creates
|
385
|
-
|
386
|
-
|
387
|
-
|
388
|
-
|
381
|
+
# Creates a znode path.
|
382
|
+
#
|
383
|
+
# @param [String] path the znode path to create
|
384
|
+
# @param [Hash] options the options used to create the path
|
385
|
+
# @option options [String] :initial_value an initial value for the znode
|
386
|
+
# @option options [Boolean] :ephemeral true if node is ephemeral, false otherwise
|
387
|
+
def create_path(path, options = {})
|
388
|
+
unless @zk.exists?(path)
|
389
|
+
@zk.create(path,
|
390
|
+
options[:initial_value],
|
391
|
+
:ephemeral => options.fetch(:ephemeral, false))
|
392
|
+
logger.info("Created ZK node #{path}")
|
389
393
|
end
|
390
394
|
rescue ZK::Exceptions::NodeExists
|
391
395
|
# best effort
|
392
396
|
end
|
393
397
|
|
394
|
-
#
|
395
|
-
|
396
|
-
|
397
|
-
|
398
|
+
# Writes state to a particular znode path.
|
399
|
+
#
|
400
|
+
# @param [String] path the znode path that should be written to
|
401
|
+
# @param [String] value the value to write to the znode
|
402
|
+
# @param [Hash] options the default options to be used when creating the node
|
403
|
+
# @note the path will be created if it doesn't exist
|
404
|
+
def write_state(path, value, options = {})
|
405
|
+
create_path(path, options.merge(:initial_value => value))
|
406
|
+
@zk.set(path, value)
|
407
|
+
end
|
408
|
+
|
409
|
+
# Handles a manual failover znode update.
|
410
|
+
#
|
411
|
+
# @param [ZK::Event] event the ZK event to handle
|
412
|
+
def handle_manual_failover_update(event)
|
413
|
+
if event.node_created? || event.node_changed?
|
414
|
+
perform_manual_failover
|
415
|
+
end
|
416
|
+
rescue => ex
|
417
|
+
logger.error("Error scheduling a manual failover: #{ex.inspect}")
|
418
|
+
logger.error(ex.backtrace.join("\n"))
|
419
|
+
ensure
|
420
|
+
@zk.stat(manual_failover_path, :watch => true)
|
421
|
+
end
|
422
|
+
|
423
|
+
# Produces a FQDN id for this Node Manager.
|
424
|
+
#
|
425
|
+
# @return [String] the FQDN for this Node Manager
|
426
|
+
def manager_id
|
427
|
+
@manager_id ||= [
|
428
|
+
Socket.gethostbyname(Socket.gethostname)[0],
|
429
|
+
Process.pid
|
430
|
+
].join('-')
|
431
|
+
end
|
432
|
+
|
433
|
+
# Writes the current master list of redis nodes. This method is only invoked
|
434
|
+
# if this node manager instance is the master/primary manager.
|
435
|
+
def write_current_redis_nodes
|
436
|
+
write_state(redis_nodes_path, encode(current_nodes))
|
437
|
+
end
|
438
|
+
|
439
|
+
# @return [String] root path for current node manager state
|
440
|
+
def current_state_root
|
441
|
+
"#{@root_znode}/manager_node_state"
|
442
|
+
end
|
443
|
+
|
444
|
+
# @return [String] the znode path for this node manager's view
|
445
|
+
# of available nodes
|
446
|
+
def current_state_path
|
447
|
+
"#{current_state_root}/#{manager_id}"
|
448
|
+
end
|
449
|
+
|
450
|
+
# @return [String] the znode path for the master redis nodes config
|
451
|
+
def redis_nodes_path
|
452
|
+
"#{@root_znode}/nodes"
|
453
|
+
end
|
454
|
+
|
455
|
+
# @return [String] the znode path used for performing manual failovers
|
456
|
+
def manual_failover_path
|
457
|
+
ManualFailover.path(@root_znode)
|
458
|
+
end
|
459
|
+
|
460
|
+
# @return [Boolean] true if this node manager is the master, false otherwise
|
461
|
+
def master_manager?
|
462
|
+
@master_manager
|
398
463
|
end
|
399
464
|
|
400
|
-
#
|
401
|
-
|
402
|
-
|
403
|
-
|
465
|
+
# Used to update the master node manager state. These states are only handled if
|
466
|
+
# this node manager instance is serving as the master manager.
|
467
|
+
#
|
468
|
+
# @param [Node] node the node to handle
|
469
|
+
# @param [Hash<Node, NodeSnapshot>] snapshots the current set of snapshots
|
470
|
+
def update_master_state(node, snapshots)
|
471
|
+
state = @node_strategy.determine_state(node, snapshots)
|
472
|
+
case state
|
473
|
+
when :unavailable
|
474
|
+
handle_unavailable(node, snapshots)
|
475
|
+
when :available
|
476
|
+
if node.syncing_with_master?
|
477
|
+
handle_syncing(node, snapshots)
|
478
|
+
else
|
479
|
+
handle_available(node, snapshots)
|
480
|
+
end
|
481
|
+
else
|
482
|
+
raise InvalidNodeStateError.new(node, state)
|
483
|
+
end
|
484
|
+
rescue *ZK_ERRORS
|
485
|
+
# fail hard if this is a ZK connection-related error
|
486
|
+
raise
|
487
|
+
rescue => ex
|
488
|
+
logger.error("Error handling state report for #{[node, state].inspect}: #{ex.inspect}")
|
489
|
+
end
|
490
|
+
|
491
|
+
# Updates the current view of the world for this particular node
|
492
|
+
# manager instance. All node managers write this state regardless
|
493
|
+
# of whether they are the master manager or not.
|
494
|
+
#
|
495
|
+
# @param [Node] node the node to handle
|
496
|
+
# @param [Symbol] state the node state
|
497
|
+
# @param [Integer] latency an optional latency
|
498
|
+
def update_current_state(node, state, latency = nil)
|
499
|
+
case state
|
500
|
+
when :unavailable
|
501
|
+
@monitored_unavailable |= [node]
|
502
|
+
@monitored_available.delete(node)
|
503
|
+
when :available
|
504
|
+
@monitored_available[node] = latency
|
505
|
+
@monitored_unavailable.delete(node)
|
506
|
+
else
|
507
|
+
raise InvalidNodeStateError.new(node, state)
|
508
|
+
end
|
509
|
+
|
510
|
+
# flush ephemeral current node manager state
|
511
|
+
write_state(current_state_path,
|
512
|
+
encode(node_availability_state),
|
513
|
+
:ephemeral => true)
|
514
|
+
end
|
515
|
+
|
516
|
+
# Fetches each currently running node manager's view of the
|
517
|
+
# world in terms of which nodes they think are available/unavailable.
|
518
|
+
#
|
519
|
+
# @return [Hash<String, Array>] a hash of node manager to host states
|
520
|
+
def fetch_node_manager_states
|
521
|
+
states = {}
|
522
|
+
@zk.children(current_state_root).each do |child|
|
523
|
+
full_path = "#{current_state_root}/#{child}"
|
524
|
+
begin
|
525
|
+
states[child] = symbolize_keys(decode(@zk.get(full_path).first))
|
526
|
+
rescue ZK::Exceptions::NoNode
|
527
|
+
# ignore, this is an edge case that can happen when a node manager
|
528
|
+
# process dies while fetching its state
|
529
|
+
rescue => ex
|
530
|
+
logger.error("Failed to fetch states for #{full_path}: #{ex.inspect}")
|
531
|
+
end
|
532
|
+
end
|
533
|
+
states
|
534
|
+
end
|
535
|
+
|
536
|
+
# Builds current snapshots of nodes across all running node managers.
|
537
|
+
#
|
538
|
+
# @return [Hash<Node, NodeSnapshot>] the snapshots for all nodes
|
539
|
+
def current_node_snapshots
|
540
|
+
nodes = {}
|
541
|
+
snapshots = Hash.new { |h, k| h[k] = NodeSnapshot.new(k) }
|
542
|
+
fetch_node_manager_states.each do |node_manager, states|
|
543
|
+
available, unavailable = states.values_at(:available, :unavailable)
|
544
|
+
available.each do |node_string, latency|
|
545
|
+
node = nodes[node_string] ||= node_from(node_string)
|
546
|
+
snapshots[node].viewable_by(node_manager, latency)
|
547
|
+
end
|
548
|
+
unavailable.each do |node_string|
|
549
|
+
node = nodes[node_string] ||= node_from(node_string)
|
550
|
+
snapshots[node].unviewable_by(node_manager)
|
551
|
+
end
|
552
|
+
end
|
553
|
+
|
554
|
+
snapshots
|
555
|
+
end
|
556
|
+
|
557
|
+
# Waits until this node manager becomes the master.
|
558
|
+
def wait_until_master
|
559
|
+
logger.info('Waiting to become master Node Manager ...')
|
560
|
+
|
561
|
+
with_lock do
|
562
|
+
@master_manager = true
|
563
|
+
logger.info('Acquired master Node Manager lock.')
|
564
|
+
logger.info("Configured node strategy #{@node_strategy.class}")
|
565
|
+
logger.info("Configured failover strategy #{@failover_strategy.class}")
|
566
|
+
logger.info("Required Node Managers to make a decision: #{@required_node_managers}")
|
567
|
+
manage_nodes
|
568
|
+
end
|
569
|
+
end
|
570
|
+
|
571
|
+
# Manages the redis nodes by periodically processing snapshots.
|
572
|
+
def manage_nodes
|
573
|
+
# Re-discover nodes, since the state of the world may have been changed
|
574
|
+
# by the time we've become the primary node manager.
|
575
|
+
discover_nodes
|
576
|
+
|
577
|
+
# ensure that slaves are correctly pointing to this master
|
578
|
+
redirect_slaves_to(@master)
|
579
|
+
|
580
|
+
# Periodically update master config state.
|
581
|
+
while running? && master_manager?
|
582
|
+
@zk_lock.assert!
|
583
|
+
sleep(CHECK_INTERVAL)
|
584
|
+
|
585
|
+
@lock.synchronize do
|
586
|
+
snapshots = current_node_snapshots
|
587
|
+
if ensure_sufficient_node_managers(snapshots)
|
588
|
+
snapshots.each_key do |node|
|
589
|
+
update_master_state(node, snapshots)
|
590
|
+
end
|
591
|
+
|
592
|
+
# flush current master state
|
593
|
+
write_current_redis_nodes
|
594
|
+
|
595
|
+
# check if we've exhausted our attempts to promote a master
|
596
|
+
unless @master
|
597
|
+
@master_promotion_attempts += 1
|
598
|
+
raise NoMasterError if @master_promotion_attempts > MAX_PROMOTION_ATTEMPTS
|
599
|
+
end
|
600
|
+
end
|
601
|
+
end
|
602
|
+
end
|
603
|
+
end
|
604
|
+
|
605
|
+
# Creates a Node instance from a string.
|
606
|
+
#
|
607
|
+
# @param [String] node_string a string representation of a node (e.g., host:port)
|
608
|
+
# @return [Node] the Node representation
|
609
|
+
def node_from(node_string)
|
610
|
+
return if node_string.nil?
|
611
|
+
host, port = node_string.split(':', 2)
|
612
|
+
Node.new(:host => host, :port => port, :password => @options[:password])
|
404
613
|
end
|
405
614
|
|
406
615
|
# Executes a block wrapped in a ZK exclusive lock.
|
407
616
|
def with_lock
|
408
|
-
@zk_lock
|
409
|
-
|
410
|
-
|
617
|
+
@zk_lock ||= @zk.locker('master_redis_node_manager_lock')
|
618
|
+
|
619
|
+
begin
|
620
|
+
@zk_lock.lock!(true)
|
621
|
+
rescue Exception
|
622
|
+
# handle shutdown case
|
623
|
+
running? ? raise : return
|
411
624
|
end
|
412
625
|
|
413
626
|
if running?
|
627
|
+
@zk_lock.assert!
|
414
628
|
yield
|
415
629
|
end
|
416
630
|
ensure
|
417
|
-
|
631
|
+
if @zk_lock
|
632
|
+
begin
|
633
|
+
@zk_lock.unlock!
|
634
|
+
rescue => ex
|
635
|
+
logger.warn("Failed to release lock: #{ex.inspect}")
|
636
|
+
end
|
637
|
+
end
|
418
638
|
end
|
419
639
|
|
420
640
|
# Perform a manual failover to a redis node.
|
421
641
|
def perform_manual_failover
|
422
|
-
@
|
423
|
-
return unless running? && @
|
642
|
+
@lock.synchronize do
|
643
|
+
return unless running? && @master_manager && @zk_lock
|
424
644
|
@zk_lock.assert!
|
425
|
-
new_master = @zk.get(
|
645
|
+
new_master = @zk.get(manual_failover_path, :watch => true).first
|
426
646
|
return unless new_master && new_master.size > 0
|
427
647
|
logger.info("Received manual failover request for: #{new_master}")
|
428
648
|
logger.info("Current nodes: #{current_nodes.inspect}")
|
429
|
-
|
430
|
-
|
649
|
+
snapshots = current_node_snapshots
|
650
|
+
|
651
|
+
node = if new_master == ManualFailover::ANY_SLAVE
|
652
|
+
failover_strategy_candidate(snapshots)
|
653
|
+
else
|
654
|
+
node_from(new_master)
|
655
|
+
end
|
656
|
+
|
431
657
|
if node
|
432
|
-
handle_manual_failover(node)
|
658
|
+
handle_manual_failover(node, snapshots)
|
433
659
|
else
|
434
660
|
logger.error('Failed to perform manual failover, no candidate found.')
|
435
661
|
end
|
436
662
|
end
|
437
663
|
rescue => ex
|
438
|
-
logger.error("Error handling
|
664
|
+
logger.error("Error handling manual failover: #{ex.inspect}")
|
439
665
|
logger.error(ex.backtrace.join("\n"))
|
440
666
|
ensure
|
441
|
-
@zk.stat(
|
667
|
+
@zk.stat(manual_failover_path, :watch => true)
|
442
668
|
end
|
443
669
|
|
444
670
|
# @return [Boolean] true if running, false otherwise
|
445
671
|
def running?
|
446
|
-
!@shutdown
|
672
|
+
@lock.synchronize { !@shutdown }
|
673
|
+
end
|
674
|
+
|
675
|
+
# @return [String] a stringified version of redis nodes
|
676
|
+
def stringify_nodes(nodes)
|
677
|
+
"(#{nodes.map(&:to_s).join(', ')})"
|
678
|
+
end
|
679
|
+
|
680
|
+
# Determines if each snapshot has a sufficient number of node managers.
|
681
|
+
#
|
682
|
+
# @param [Hash<Node, Snapshot>] snapshots the current snapshots
|
683
|
+
# @return [Boolean] true if sufficient, false otherwise
|
684
|
+
def ensure_sufficient_node_managers(snapshots)
|
685
|
+
currently_sufficient = true
|
686
|
+
snapshots.each do |node, snapshot|
|
687
|
+
node_managers = snapshot.node_managers
|
688
|
+
if node_managers.size < @required_node_managers
|
689
|
+
logger.error("Not enough Node Managers in snapshot for node #{node}. " +
|
690
|
+
"Required: #{@required_node_managers}, " +
|
691
|
+
"Available: #{node_managers.size} #{node_managers}")
|
692
|
+
currently_sufficient = false
|
693
|
+
end
|
694
|
+
end
|
695
|
+
|
696
|
+
if currently_sufficient && !@sufficient_node_managers
|
697
|
+
logger.info("Required Node Managers are visible: #{@required_node_managers}")
|
698
|
+
end
|
699
|
+
|
700
|
+
@sufficient_node_managers = currently_sufficient
|
701
|
+
@sufficient_node_managers
|
702
|
+
end
|
703
|
+
|
704
|
+
# Invokes the configured failover strategy.
|
705
|
+
#
|
706
|
+
# @param [Hash<Node, NodeSnapshot>] snapshots the node snapshots
|
707
|
+
# @return [Node] a failover candidate
|
708
|
+
def failover_strategy_candidate(snapshots)
|
709
|
+
# only include nodes that this master Node Manager can see
|
710
|
+
filtered_snapshots = snapshots.select do |node, snapshot|
|
711
|
+
snapshot.viewable_by?(manager_id)
|
712
|
+
end
|
713
|
+
|
714
|
+
logger.info('Attempting to find candidate from snapshots:')
|
715
|
+
logger.info("\n" + filtered_snapshots.values.join("\n"))
|
716
|
+
@failover_strategy.find_candidate(filtered_snapshots)
|
447
717
|
end
|
448
718
|
end
|
449
719
|
end
|