semian 0.3.0 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +2 -2
- data/.rubocop.yml +113 -0
- data/CHANGELOG.md +8 -0
- data/Gemfile +5 -0
- data/LICENSE.md +1 -1
- data/README.md +488 -39
- data/Rakefile +15 -8
- data/ext/semian/extconf.rb +2 -2
- data/lib/semian.rb +16 -6
- data/lib/semian/adapter.rb +1 -1
- data/lib/semian/circuit_breaker.rb +38 -37
- data/lib/semian/mysql2.rb +21 -1
- data/lib/semian/net_http.rb +95 -0
- data/lib/semian/protected_resource.rb +7 -2
- data/lib/semian/resource.rb +1 -1
- data/lib/semian/simple_integer.rb +23 -0
- data/lib/semian/simple_sliding_window.rb +43 -0
- data/lib/semian/simple_state.rb +43 -0
- data/lib/semian/unprotected_resource.rb +4 -1
- data/lib/semian/version.rb +1 -1
- data/repodb.yml +1 -0
- data/scripts/install_toxiproxy.sh +3 -3
- data/semian.gemspec +4 -3
- data/test/circuit_breaker_test.rb +6 -2
- data/test/helpers/background_helper.rb +1 -1
- data/test/instrumentation_test.rb +1 -1
- data/test/mysql2_test.rb +57 -1
- data/test/net_http_test.rb +481 -0
- data/test/redis_test.rb +3 -3
- data/test/resource_test.rb +33 -31
- data/test/semian_test.rb +3 -2
- data/test/simple_integer_test.rb +49 -0
- data/test/simple_sliding_window_test.rb +65 -0
- data/test/simple_state_test.rb +45 -0
- data/test/test_helper.rb +5 -0
- data/test/unprotected_resource_test.rb +1 -1
- metadata +30 -27
- checksums.yaml.gz.sig +0 -0
- data.tar.gz.sig +0 -1
- metadata.gz.sig +0 -0
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9df1d9d29629650b74dab6ead2cc8d288de28d78
|
4
|
+
data.tar.gz: 8752cce5af3858930ee820eab7c2f1c558eeed22
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 2070b61bb3ac7080c36eaba3c0508eb81928b32111dd11e6f5d9668b78fada81549f2eac75f3021787ce03fd30b6f3c5f01a4f5f99d077b0b3aa9aeecc5a3e9d
|
7
|
+
data.tar.gz: 63f9400b70e1bf039819bb26accf5ecbcea87ab1fb938ff0abaf8e44ccfbbd78e016c3ff48832960cab6fb1c9428785da0f4c2333d15c651b2647bf99d9ae41d
|
data/.gitignore
CHANGED
data/.rubocop.yml
ADDED
@@ -0,0 +1,113 @@
|
|
1
|
+
AllCops:
|
2
|
+
Exclude:
|
3
|
+
- Gemfile
|
4
|
+
- lib/snippets/**/*
|
5
|
+
- vendor/**/*
|
6
|
+
- data/**/*
|
7
|
+
- db/schema.rb
|
8
|
+
- db/migrate/*
|
9
|
+
- test/dummy/**/*
|
10
|
+
- bin/rails
|
11
|
+
- lib/shipit-engine.rb
|
12
|
+
- tmp/**/*
|
13
|
+
|
14
|
+
Style/GuardClause:
|
15
|
+
Enabled: false
|
16
|
+
|
17
|
+
Lint/AssignmentInCondition:
|
18
|
+
Enabled: false
|
19
|
+
|
20
|
+
Lint/HandleExceptions:
|
21
|
+
Enabled: false
|
22
|
+
|
23
|
+
Lint/EndAlignment:
|
24
|
+
Enabled: false
|
25
|
+
|
26
|
+
Style/NumericLiterals:
|
27
|
+
Exclude:
|
28
|
+
- db/schema.rb
|
29
|
+
|
30
|
+
Style/SingleSpaceBeforeFirstArg:
|
31
|
+
Exclude:
|
32
|
+
- db/schema.rb
|
33
|
+
|
34
|
+
Style/DoubleNegation:
|
35
|
+
Enabled: false
|
36
|
+
|
37
|
+
Metrics/LineLength:
|
38
|
+
Max: 135
|
39
|
+
|
40
|
+
Metrics/MethodLength:
|
41
|
+
Max: 40
|
42
|
+
|
43
|
+
Metrics/ClassLength:
|
44
|
+
Max: 500
|
45
|
+
|
46
|
+
Metrics/AbcSize:
|
47
|
+
Max: 50
|
48
|
+
|
49
|
+
Metrics/CyclomaticComplexity:
|
50
|
+
Max: 10
|
51
|
+
|
52
|
+
Style/Documentation:
|
53
|
+
Enabled: false
|
54
|
+
|
55
|
+
Style/SingleLineBlockParams:
|
56
|
+
Enabled: false
|
57
|
+
|
58
|
+
Style/SignalException:
|
59
|
+
Enabled: false
|
60
|
+
|
61
|
+
Style/RaiseArgs:
|
62
|
+
Enabled: false
|
63
|
+
|
64
|
+
Style/ModuleFunction:
|
65
|
+
Enabled: false
|
66
|
+
|
67
|
+
Style/RedundantReturn:
|
68
|
+
AllowMultipleReturnValues: true
|
69
|
+
|
70
|
+
Style/IndentHash:
|
71
|
+
Enabled: false
|
72
|
+
|
73
|
+
Style/TrailingComma:
|
74
|
+
EnforcedStyleForMultiline: comma
|
75
|
+
|
76
|
+
Style/ClassAndModuleChildren:
|
77
|
+
Enabled: false
|
78
|
+
|
79
|
+
Style/PredicateName:
|
80
|
+
Exclude:
|
81
|
+
- app/serializers/**/*
|
82
|
+
|
83
|
+
Style/SpaceInsideHashLiteralBraces:
|
84
|
+
EnforcedStyle: no_space
|
85
|
+
|
86
|
+
Style/StringLiterals:
|
87
|
+
Enabled: false
|
88
|
+
|
89
|
+
Style/PerlBackrefs:
|
90
|
+
Enabled: false
|
91
|
+
|
92
|
+
Style/TrivialAccessors:
|
93
|
+
AllowPredicates: true
|
94
|
+
|
95
|
+
Style/ExtraSpacing:
|
96
|
+
AllowForAlignment: false
|
97
|
+
|
98
|
+
Style/GlobalVars:
|
99
|
+
Exclude:
|
100
|
+
- 'ext/semian/extconf.rb'
|
101
|
+
|
102
|
+
Lint/Eval:
|
103
|
+
Exclude:
|
104
|
+
- 'Rakefile'
|
105
|
+
|
106
|
+
Metrics/ParameterLists:
|
107
|
+
Enabled: false
|
108
|
+
|
109
|
+
Style/IfUnlessModifier:
|
110
|
+
Enabled: false
|
111
|
+
|
112
|
+
Style/CaseIndentation:
|
113
|
+
IndentWhenRelativeTo: end
|
data/CHANGELOG.md
ADDED
@@ -0,0 +1,8 @@
|
|
1
|
+
# v0.4.0
|
2
|
+
|
3
|
+
* net/http: add adapter for net/http #58
|
4
|
+
* circuit_breaker: split circuit breaker into three data structures to allow for
|
5
|
+
alternative implementations in the future #62
|
6
|
+
* mysql: don't prevent rollbacks on transactions #60
|
7
|
+
* core: fix initialization bug when the resource is accessed before the options
|
8
|
+
are set #65
|
data/Gemfile
CHANGED
data/LICENSE.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
The MIT License (MIT)
|
2
2
|
|
3
|
-
Copyright (c) 2014
|
3
|
+
Copyright (c) 2014 Shopify
|
4
4
|
|
5
5
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
6
|
of this software and associated documentation files (the "Software"), to deal
|
data/README.md
CHANGED
@@ -1,15 +1,47 @@
|
|
1
1
|
## Semian [](https://travis-ci.org/Shopify/semian)
|
2
2
|
|
3
|
-
|
4
|
-
applications against misbehaving external services. It allows you to fail fast
|
5
|
-
so you can handle errors gracefully. The patterns are inspired by
|
6
|
-
[Hystrix][hystrix] and [Release It][release-it]. Semian is an extraction from
|
7
|
-
[Shopify][shopify] where it's been running successfully in production since
|
8
|
-
October, 2014.
|
3
|
+

|
9
4
|
|
10
|
-
|
11
|
-
|
12
|
-
|
5
|
+
Semian is a library for controlling access to slow or unresponsive external
|
6
|
+
services to avoid cascading failures.
|
7
|
+
|
8
|
+
When services are down they typically fail fast with errors like `ECONNREFUSED`
|
9
|
+
and `ECONNRESET` which can be rescued in code. However, slow resources fail
|
10
|
+
slowly. The thread serving the request blocks until it hits the timeout for the
|
11
|
+
slow resource. During that time, the thread is doing nothing useful and thus the
|
12
|
+
slow resource has caused a cascading failure by occupying workers and therefore
|
13
|
+
losing capacity. **Semian is a library for failing fast in these situations,
|
14
|
+
allowing you to handle errors gracefully.** Semian does this by intercepting
|
15
|
+
resource access through heuristic patterns inspired by [Hystrix][hystrix] and
|
16
|
+
[Release It][release-it]:
|
17
|
+
|
18
|
+
* [**Circuit breaker**](#circuit-breaker). A pattern for limiting the
|
19
|
+
amount of requests to a dependency that is having issues.
|
20
|
+
* [**Bulkheading**](#bulkheading). Controlling the concurrent access to
|
21
|
+
a single resource, access is coordinates server-wide with [SysV
|
22
|
+
semaphores][sysv].
|
23
|
+
|
24
|
+
Resource drivers are monkey-patched to be aware of Semian, these are called
|
25
|
+
[Semian Adapters](#adapters). Thus, every time resource access is requested
|
26
|
+
Semian is queried for status on the resource first. If Semian, through the
|
27
|
+
patterns above, deems the resource to be unavailable it will raise an exception.
|
28
|
+
**The ultimate outcome of Semian is always an exception that can then be rescued
|
29
|
+
for a graceful fallback**. Instead of waiting for the timeout, Semian raises
|
30
|
+
straight away.
|
31
|
+
|
32
|
+
If you are already rescuing exceptions for failing resources and timeouts,
|
33
|
+
Semian is mostly a drop-in library with a little configuration that will make
|
34
|
+
your code more resilient to slow resource access. But, [do you even need
|
35
|
+
Semian?](#do-i-need-semian)
|
36
|
+
|
37
|
+
For an overview of building resilient Ruby applications, start by reading [the
|
38
|
+
Shopify blog post on Toxiproxy and Semian][resiliency-blog-post]. For more in
|
39
|
+
depth information on Semian, see [Understanding Semian](#understanding-semian).
|
40
|
+
Semian is an extraction from [Shopify][shopify] where it's been running
|
41
|
+
successfully in production since October, 2014.
|
42
|
+
|
43
|
+
The other component to your Ruby resiliency kit is [Toxiproxy][toxiproxy] to
|
44
|
+
write automated resiliency tests.
|
13
45
|
|
14
46
|
# Usage
|
15
47
|
|
@@ -26,10 +58,47 @@ section](#configuration) on how to configure adapters.
|
|
26
58
|
|
27
59
|
## Adapters
|
28
60
|
|
29
|
-
|
61
|
+
Semian works by intercepting resource access. Every time access is requested,
|
62
|
+
Semian is queried, and it will raise an exception if the resource is unavailable
|
63
|
+
according to the circuit breaker or bulkheads. This is done by monkey-patching
|
64
|
+
the resource driver. **The exception raised by the driver always inherits from
|
65
|
+
the Base exception class of the driver**, meaning you can always simply rescue
|
66
|
+
the base class and catch both Semian and driver errors in the same rescue for
|
67
|
+
fallbacks.
|
68
|
+
|
69
|
+
The following adapters are in Semian and tested heavily in production, the
|
70
|
+
version is the version of the public gem with the same name:
|
30
71
|
|
31
72
|
* [`semian/mysql2`][mysql-semian-adapter] (~> 0.3.16)
|
32
73
|
* [`semian/redis`][redis-semian-adapter] (~> 3.2.1)
|
74
|
+
* [`semian/net_http`][nethttp-semian-adapter]
|
75
|
+
|
76
|
+
### Creating Adapters
|
77
|
+
|
78
|
+
To create a Semian adapter you must implement the following methods:
|
79
|
+
|
80
|
+
1. [`include Semian::Adapter`][semian-adapter]. Use the helpers to wrap the
|
81
|
+
resource. This takes care of situations such as monitoring, nested resources,
|
82
|
+
unsupported platforms, creating the Semian resource if it doesn't already
|
83
|
+
exist and so on.
|
84
|
+
2. `#semian_identifier`. This is responsible for returning a symbol that
|
85
|
+
represents every unique resource, for example `redis_master` or
|
86
|
+
`mysql_shard_1`. This is usually assembled from a `name` attribute on the
|
87
|
+
Semian configuration hash, but could also be `<host>:<port>`.
|
88
|
+
3. `connect`. The name of this method varies. You must override the driver's
|
89
|
+
connect method with one that wraps the connect call with
|
90
|
+
`Semian::Resource#acquire`. You should do this at the lowest possible level.
|
91
|
+
4. `query`. Same as `connect` but for queries on the resource.
|
92
|
+
5. Define exceptions `ResourceBusyError` and `CircuitOpenError`. These are
|
93
|
+
raised when the request was rejected early because the resource is out of
|
94
|
+
tickets or because the circuit breaker is open (see [Understanding
|
95
|
+
Semian](#understanding-semian). They should inherit from the base exception
|
96
|
+
class from the raw driver. For example `Mysql2::Error` or
|
97
|
+
`Redis::BaseConnectionError` for the MySQL and Redis drivers. This makes it
|
98
|
+
easy to `rescue` and handle them gracefully in application code, by
|
99
|
+
`rescue`ing the base class.
|
100
|
+
|
101
|
+
The best resource is looking at the [already implemented adapters](#adapters).
|
33
102
|
|
34
103
|
### Configuration
|
35
104
|
|
@@ -58,32 +127,370 @@ client = Redis.new(semian: {
|
|
58
127
|
})
|
59
128
|
```
|
60
129
|
|
61
|
-
|
130
|
+
#### Net::HTTP
|
131
|
+
For the `Net::HTTP` specific Semian adapter, since many external libraries may create
|
132
|
+
HTTP connections on the user's behalf, the parameters are instead provided
|
133
|
+
by calling specific functions in `Semian::NetHTTP`, perhaps in an initialization file.
|
62
134
|
|
63
|
-
|
135
|
+
##### Naming and Options
|
136
|
+
To give Semian parameters, assign a `proc` to `Semian::NetHTTP.semian_configuration`
|
137
|
+
that takes a two parameters, `host` and `port` like `127.0.0.1` and `80` or `github_com` and `80`,
|
138
|
+
and returns a `Hash` with keys as follows.
|
64
139
|
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
raised when the request was rejected early because the resource is out of
|
79
|
-
tickets or because the circuit breaker is open (see [Understanding
|
80
|
-
Semian](#understanding-semian). They should inherit from the base exception
|
81
|
-
class from the raw driver. For example `Mysql2::Error` or
|
82
|
-
`Redis::BaseConnectionError` for the MySQL and Redis drivers. This makes it
|
83
|
-
easy to `rescue` and handle them gracefully in application code, by
|
84
|
-
`rescue`ing the base class.
|
140
|
+
```ruby
|
141
|
+
SEMIAN_PARAMETERS = { tickets: 1,
|
142
|
+
success_threshold: 1,
|
143
|
+
error_threshold: 3,
|
144
|
+
error_timeout: 10 }
|
145
|
+
Semian::NetHTTP.semian_configuration = proc do |host, port|
|
146
|
+
# Let's make it only active for github.com
|
147
|
+
if host == "github.com" && port == "80"
|
148
|
+
SEMIAN_PARAMETERS.merge(name: "github.com_80")
|
149
|
+
else
|
150
|
+
nil
|
151
|
+
end
|
152
|
+
end
|
85
153
|
|
86
|
-
|
154
|
+
# Called from within API:
|
155
|
+
# semian_options = Semian::NetHTTP.semian_configuration("github.com", 80)
|
156
|
+
# semian_identifier = "nethttp_#{semian_options[:name]}"
|
157
|
+
```
|
158
|
+
|
159
|
+
The `name` should be carefully chosen since it identifies the resource being protected.
|
160
|
+
The `semian_options` passed apply to that resource. Semian creates the `semian_identifier`
|
161
|
+
from the `name` to look up and store changes in the circuit breaker and bulkhead states
|
162
|
+
and associate successes, failures, errors with the protected resource.
|
163
|
+
|
164
|
+
For most purposes, `"#{host}_#{port}"` is a good default `name`. Custom `name` formats
|
165
|
+
can be useful to grouping related subdomains as one resource, so that they all
|
166
|
+
contribute to the same circuit breaker and bulkhead state and fail together.
|
167
|
+
|
168
|
+
A return value of `nil` for `semian_configuration` means Semian is disabled for that
|
169
|
+
HTTP endpoint. This works well since the result of a failed Hash lookup is `nil` also.
|
170
|
+
This behavior lets the adapter default to whitelisting, although the
|
171
|
+
behavior can be changed to blacklisting or even be completely disabled by varying
|
172
|
+
the use of returning `nil` in the assigned closure.
|
173
|
+
|
174
|
+
##### Additional Exceptions
|
175
|
+
Since we envision this particular adapter can be used in combination with many
|
176
|
+
external libraries, that can raise additional exceptions, we added functionality to
|
177
|
+
expand the Exceptions that can be tracked as part of Semian's circuit breaker.
|
178
|
+
This may be necessary for libraries that introduce new exceptions or re-raise them.
|
179
|
+
Add exceptions and reset to the [`default`][nethttp-default-errors] list using the following:
|
180
|
+
|
181
|
+
```ruby
|
182
|
+
# assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
|
183
|
+
Semian::NetHTTP.exceptions += [::OpenSSL::SSL::SSLError]
|
184
|
+
|
185
|
+
Semian::NetHTTP.reset_exceptions
|
186
|
+
# assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
|
187
|
+
```
|
188
|
+
|
189
|
+
# Understanding Semian
|
190
|
+
|
191
|
+
Semian is a library with heuristics for failing fast. This section will explain
|
192
|
+
in depth how Semian works and which situations it's applicable for. First we
|
193
|
+
explain the category of problems Semian is meant to solve. Then we dive into how
|
194
|
+
Semian works to solve these problems.
|
195
|
+
|
196
|
+
## Do I need Semian?
|
197
|
+
|
198
|
+
Semian is not a trivial library to understand, introduces complexity and thus
|
199
|
+
should be introduced with care. Remember, all Semian does is raise exceptions
|
200
|
+
based on heuristics. It is paramount that you understand Semian before
|
201
|
+
including it in production as you may otherwise be surprised by its behaviour.
|
202
|
+
|
203
|
+
Applications that benefit from Semian are those working on eliminating SPOFs
|
204
|
+
(Single Points of Failure), and specifically are running into a wall regarding
|
205
|
+
slow resources. But it is by no means a magic wand that solves all your latency
|
206
|
+
problems by being added to your `Gemfile`. This section describes the types of
|
207
|
+
problems Semian solves.
|
208
|
+
|
209
|
+
If your application is multithreaded or evented (e.g. not Resque and Unicorn)
|
210
|
+
these problems are not as pressing. You can still get use out of Semian however.
|
211
|
+
|
212
|
+
### Real World Example
|
213
|
+
|
214
|
+
This is better illustrated with a real world example from Shopify. When you are
|
215
|
+
browsing a store while signed in, Shopify stores your session in Redis.
|
216
|
+
If Redis becomes unavailable, the driver will start throwing exceptions.
|
217
|
+
We rescue these exceptions and simply disable all customer sign in functionality
|
218
|
+
on the store until Redis is back online.
|
219
|
+
|
220
|
+
This is great if querying the resource fails instantly, because it means we fail
|
221
|
+
in just a single roundtrip of ~1ms. But if the resource is unresponsive or slow,
|
222
|
+
this can take as long as our timeout which is easily 200ms. This means every
|
223
|
+
request, even if it does rescue the exception, now takes an extra 200ms.
|
224
|
+
Because every resource takes that long, our capacity is also significantly
|
225
|
+
degraded. These problems are explained in depth in the next two sections.
|
226
|
+
|
227
|
+
With Semian, the slow resource would fail instantly (after a small amount of
|
228
|
+
convergence time) preventing your response time from spiking and not decreasing
|
229
|
+
capacity of the cluster.
|
230
|
+
|
231
|
+
If this sounds familiar to you, Semian is what you need to be resilient to
|
232
|
+
latency. You may not need the graceful fallback depending on your application,
|
233
|
+
in which case it will just result in an error (e.g. a `HTTP 500`) faster.
|
234
|
+
|
235
|
+
We will now examine the two problems in detail.
|
236
|
+
|
237
|
+
#### In-depth analysis of real world example
|
238
|
+
|
239
|
+
If a single resource is slow, every single request is going to suffer. We saw
|
240
|
+
this in the example before. Let's illustrate this more clearly in the following
|
241
|
+
Rails example where the user session is stored in Redis:
|
242
|
+
|
243
|
+
```ruby
|
244
|
+
def index
|
245
|
+
@user = fetch_user
|
246
|
+
@posts = Post.all
|
247
|
+
end
|
248
|
+
|
249
|
+
private
|
250
|
+
def fetch_user
|
251
|
+
user = User.find(session[:user_id])
|
252
|
+
rescue Redis::CannotConnectError
|
253
|
+
nil
|
254
|
+
end
|
255
|
+
```
|
256
|
+
|
257
|
+
Our code is resilient to a failure of the session layer, it doesn't `HTTP 500`
|
258
|
+
if the session store is unavailable (this can be tested with
|
259
|
+
[Toxiproxy][toxiproxy]). If the `User` and `Post` data store is unavailable, the
|
260
|
+
server will send back `HTTP 500`. We accept that, because it's our primary data
|
261
|
+
store. This could be prevented with a caching tier or something else out of
|
262
|
+
scope.
|
263
|
+
|
264
|
+
This code has two flaws however:
|
265
|
+
|
266
|
+
1. **What happens if the session storage is consistently slow?** I.e. the majority
|
267
|
+
of requests take, say, more than half the timeout time (but it should only
|
268
|
+
take ~1ms)?
|
269
|
+
2. **What happens if the session storage is unavailable and is not responding at
|
270
|
+
all?** I.e. we hit timeouts on every request.
|
271
|
+
|
272
|
+
These two problems in turn have two related problems associated with them:
|
273
|
+
response time and capacity.
|
274
|
+
|
275
|
+
#### Response time
|
276
|
+
|
277
|
+
Requests that attempt to access a down session storage are all gracefully handled, the
|
278
|
+
`@user` will simply be `nil`, which the code handles. There is still a
|
279
|
+
major impact on users however, as every request to the storage has to time
|
280
|
+
out. This causes the average response time to all pages that access it to go up by
|
281
|
+
however long your timeout is. Your timeout is proportional to your worst case timeout,
|
282
|
+
as well as the number of attempts to hit it on each page. This is the problem Semian
|
283
|
+
solves by using heuristics to fail these requests early which causes a much better
|
284
|
+
user experience during downtime.
|
285
|
+
|
286
|
+
#### Capacity loss
|
287
|
+
|
288
|
+
When your single-threaded worker is waiting for a resource to return, it's
|
289
|
+
effectively doing nothing when it could be serving fast requests. To use the
|
290
|
+
example from before, perhaps some actions do not access the session storage at
|
291
|
+
all. These requests will pile up behind the now slow requests that are trying to
|
292
|
+
access that layer, because they're failing slowly. Essentially, your capacity
|
293
|
+
degrades significantly because your average response time goes up (as explained
|
294
|
+
in the previous section). Capacity loss simply follows from an increase in
|
295
|
+
response time. The higher your timeout and the slower your resource, the more
|
296
|
+
capacity you lose.
|
297
|
+
|
298
|
+
#### Timeouts aren't enough
|
299
|
+
|
300
|
+
It should be clear by now that timeouts aren't enough. Consistent timeouts will
|
301
|
+
increase the average response time, which causes a bad user experience, and
|
302
|
+
ultimately compromise the performance of the entire system. Even if the timeout
|
303
|
+
is as low as ~250ms (just enough to allow a single TCP retransmit) there's a
|
304
|
+
large loss of capacity and for many applications a 100-300% increase in average
|
305
|
+
response time. This is the problem Semian solves by failing fast.
|
306
|
+
|
307
|
+
## How does Semian work?
|
308
|
+
|
309
|
+
Semian consists of two parts: circuit breaker and bulkheading. To understand
|
310
|
+
Semian, and especially how to configure it, we must understand these patterns
|
311
|
+
and their implementation.
|
312
|
+
|
313
|
+
### Circuit Breaker
|
314
|
+
|
315
|
+
The circuit breaker pattern is based on a simple observation - if we hit a
|
316
|
+
timeout or any other error for a given service one or more times, we’re likely
|
317
|
+
to hit it again for some amount of time. Instead of hitting the timeout
|
318
|
+
repeatedly, we can mark the resource as dead for some amount of time during
|
319
|
+
which we raise an exception instantly on any call to it. This is called the
|
320
|
+
[circuit breaker pattern][cbp].
|
321
|
+
|
322
|
+

|
323
|
+
|
324
|
+
When we perform a Remote Procedure Call (RPC), it will first check the circuit.
|
325
|
+
If the circuit is rejecting requests because of too many failures reported by
|
326
|
+
the driver, it will throw an exception immediately. Otherwise the circuit will
|
327
|
+
call the driver. If the driver fails to get data back from the data store, it
|
328
|
+
will notify the circuit. The circuit will count the error so that if too many
|
329
|
+
errors have happened recently, it will start rejecting requests immediately
|
330
|
+
instead of waiting for the driver to time out. The exception will then be raised
|
331
|
+
back to the original caller. If the driver’s request was successful, it will
|
332
|
+
return the data back to the calling method and notify the circuit that it made a
|
333
|
+
successful call.
|
334
|
+
|
335
|
+
The state of the circuit breaker is local to the worker and is not shared across
|
336
|
+
all workers on a server.
|
337
|
+
|
338
|
+
#### Circuit Breaker Configuration
|
339
|
+
|
340
|
+
There are three configuration parameters for circuit breakers in Semian:
|
341
|
+
|
342
|
+
* **error_threshold**. The amount of errors to encounter for the worker before
|
343
|
+
opening the circuit, that is to start rejecting requests instantly.
|
344
|
+
* **error_timeout**. The amount of time until trying to query the resource
|
345
|
+
again.
|
346
|
+
* **success_threshold**. The amount of successes on the circuit until closing it
|
347
|
+
again, that is to start accepting all requests to the circuit.
|
348
|
+
|
349
|
+
### Bulkheading
|
350
|
+
|
351
|
+
For many applications, circuit breakers are not enough however. This is best
|
352
|
+
illustrated with an extreme. Imagine if the timeout for our data store isn't as
|
353
|
+
low as 200ms, but actually 10 seconds. For example, you might have a relational data
|
354
|
+
store where for some customers, 10s queries are (unfortunately) legitimate.
|
355
|
+
Reducing the time of worst case queries requires a lot of effort. Dropping the
|
356
|
+
query immediately could potentially leave some customers unable to access certain
|
357
|
+
functionality. High timeouts are especially critical in a non-threaded
|
358
|
+
environment where blocking IO means a worker is effectively doing nothing.
|
359
|
+
|
360
|
+
In this case, circuit breakers aren't sufficient. Assuming the circuit is shared
|
361
|
+
across all processes on a server, it will still take at least 10s before the
|
362
|
+
circuit is open—in that time every worker is blocked. Meaning we are in a
|
363
|
+
reduced capacity state for at least 20s, with the last 10s timeouts
|
364
|
+
occurring just before the circuit opens at the 10s mark when a couple of
|
365
|
+
workers have hit a timeout and the circuit opens. We thought of a number of
|
366
|
+
potential solutions to this problem - stricter timeouts, grouping timeouts by
|
367
|
+
section of our application, timeouts per statement—but they all still revolved
|
368
|
+
around timeouts, and those are extremely hard to get right.
|
369
|
+
|
370
|
+
Instead of thinking about timeouts, we took inspiration from Hystrix by Netflix
|
371
|
+
and the book Release It (the resiliency bible), and look at our services as
|
372
|
+
connection pools. On a server with `W` workers, only a certain number of them
|
373
|
+
are expected to be talking to a single data store at once. Let's say we've
|
374
|
+
determined from our monitoring that there’s a 10% chance they’re talking to
|
375
|
+
`mysql_shard_0` at any given point in time under normal traffic. The probability
|
376
|
+
that five workers are talking to it at the same time is 0.001%. If we only allow
|
377
|
+
five workers to talk to a resource at any given point in time, and accept the
|
378
|
+
0.001% false positive rate—we can fail the sixth worker attempting to check out
|
379
|
+
a connection instantly. This means that while the five workers are waiting for a
|
380
|
+
timeout, all the other `W-5` workers on the node will instantly be failing on
|
381
|
+
checking out the connection and opening their circuits. Our capacity is only
|
382
|
+
degraded by a relatively small amount.
|
383
|
+
|
384
|
+
We call this limitation primitive "tickets". In this case, the resource access
|
385
|
+
is limited to 5 tickets (see Configuration). The timeout value specifies the
|
386
|
+
maximum amount of time to block if no ticket is available.
|
387
|
+
|
388
|
+
How do we limit the access to a resource for all workers on a server when the
|
389
|
+
workers do not directly share memory? This is implemented with [SysV
|
390
|
+
semaphores][sysv] to provide server-wide access control.
|
391
|
+
|
392
|
+
#### Bulkhead Configuration
|
393
|
+
|
394
|
+
There are two configuration values. It's not easy to choose good values and we're
|
395
|
+
still experimenting with ways to figure out optimal ticket numbers. Generally
|
396
|
+
something below half the number of workers on the server for endpoints that are
|
397
|
+
queried frequently has worked well for us.
|
398
|
+
|
399
|
+
* **tickets**. Number of workers that can concurrently access a resource.
|
400
|
+
* **timeout**. Time to wait to acquire a ticket if there are no tickets left.
|
401
|
+
We recommend this to be `0` unless you have very few workers running (i.e.
|
402
|
+
less than ~5).
|
403
|
+
|
404
|
+
## Defense line
|
405
|
+
|
406
|
+
The finished defense line for resource access with circuit breakers and
|
407
|
+
bulkheads then looks like this:
|
408
|
+
|
409
|
+

|
410
|
+
|
411
|
+
The RPC first checks the circuit; if the circuit is open it will raise the
|
412
|
+
exception straight away which will trigger the fallback (the default fallback is
|
413
|
+
a 500 response). Otherwise, it will try Semian which fails instantly if too many
|
414
|
+
workers are already querying the resource. Finally the driver will query the
|
415
|
+
data store. If the data store succeeds, the driver will return the data back to
|
416
|
+
the RPC. If the data store is slow or fails, this is our last line of defense
|
417
|
+
against a misbehaving resource. The driver will raise an exception after trying
|
418
|
+
to connect with a timeout or after an immediate failure. These driver actions
|
419
|
+
will affect the circuit and Semian, which can make future calls fail faster.
|
420
|
+
|
421
|
+
## Failing gracefully
|
422
|
+
|
423
|
+
Ok, great, we've got a way to fail fast with slow resources, how does that make
|
424
|
+
my application more resilient?
|
425
|
+
|
426
|
+
Failing fast is only half the battle. It's up to you what you do with these
|
427
|
+
errors, in the [session example](#real-world-example) we handle it gracefully by
|
428
|
+
signing people out and disabling all session related functionality till the data
|
429
|
+
store is back online. However, not rescuing the exception and simply sending
|
430
|
+
`HTTP 500` back to the client faster will help with [capacity
|
431
|
+
loss](#capacity-loss).
|
432
|
+
|
433
|
+
### Exceptions inherit from base class
|
434
|
+
|
435
|
+
It's important to understand that the exceptions raised by [Semian
|
436
|
+
Adapters](#adapters) inherit from the base class of the driver itself, meaning
|
437
|
+
that if you do something like:
|
438
|
+
|
439
|
+
```ruby
|
440
|
+
def posts
|
441
|
+
Post.all
|
442
|
+
rescue Mysql2::Error
|
443
|
+
[]
|
444
|
+
end
|
445
|
+
```
|
446
|
+
|
447
|
+
Exceptions raised by Semian's `MySQL2` adapter will also get caught.
|
448
|
+
|
449
|
+
### Patterns
|
450
|
+
|
451
|
+
We do not recommend mindlessly sprinkling `rescue`s all over the place. What you
|
452
|
+
should do instead is writing decorators around secondary data stores (e.g. sessions)
|
453
|
+
that provide resiliency for free. For example, if we stored the tags associated
|
454
|
+
with products in a secondary data store it could look something like this:
|
455
|
+
|
456
|
+
```ruby
|
457
|
+
# Resilient decorator for storing a Set in Redis.
|
458
|
+
class RedisSet
|
459
|
+
def initialize(key)
|
460
|
+
@key = key
|
461
|
+
end
|
462
|
+
|
463
|
+
def get
|
464
|
+
redis.smembers(@key)
|
465
|
+
rescue Redis::BaseConnectionError
|
466
|
+
[]
|
467
|
+
end
|
468
|
+
|
469
|
+
private
|
470
|
+
|
471
|
+
def redis
|
472
|
+
@redis ||= Redis.new
|
473
|
+
end
|
474
|
+
end
|
475
|
+
|
476
|
+
class Product
|
477
|
+
# This will simply return an empty array in the case of a Redis outage.
|
478
|
+
def tags
|
479
|
+
tags_set.get
|
480
|
+
end
|
481
|
+
|
482
|
+
private
|
483
|
+
|
484
|
+
def tags_set
|
485
|
+
@tags_set ||= RedisSet.new("product:tags:#{self.id}")
|
486
|
+
end
|
487
|
+
end
|
488
|
+
```
|
489
|
+
|
490
|
+
These decorators can be resiliency tested with [Toxiproxy][toxiproxy]. You can
|
491
|
+
provide fallbacks around your primary data store as well. In our case, we simply
|
492
|
+
`HTTP 500` in those cases unless it's cached because these pages aren't worth
|
493
|
+
much without data from their primary data store.
|
87
494
|
|
88
495
|
## Monitoring
|
89
496
|
|
@@ -105,17 +512,59 @@ Semian.subscribe do |event, resource, scope, adapter|
|
|
105
512
|
end
|
106
513
|
```
|
107
514
|
|
108
|
-
#
|
515
|
+
# FAQ
|
516
|
+
|
517
|
+
**How does Semian work with containers?** Semian uses [SysV semaphores][sysv] to
|
518
|
+
coordinate access to a resource. The semaphore is only shared within the
|
519
|
+
[IPC][namespaces]. Unless you are running many workers inside every container,
|
520
|
+
this leaves the bulkheading pattern effectively useless. We recommend sharing
|
521
|
+
the IPC namespace between all containers on your host for the best ticket
|
522
|
+
economy. If you are using Docker, this can be done with the [--ipc
|
523
|
+
flag](https://docs.docker.com/reference/run/#ipc-settings).
|
524
|
+
|
525
|
+
**Why isn't resource access shared across the entire cluster?** This implies a
|
526
|
+
coordination data store. Semian would have to be resilient to failures of this
|
527
|
+
data store as well, and fall back to other primitives. While it's nice to have
|
528
|
+
all workers have the same view of the world, this greatly increases the
|
529
|
+
complexity of the implementation which is not favourable for resiliency code.
|
530
|
+
|
531
|
+
**Why isn't the circuit breaker implemented as a host-wide mechanism?** No good
|
532
|
+
reason. Patches welcome!
|
533
|
+
|
534
|
+
**Why is there no fallback mechanism in Semian?** Read the [Failing
|
535
|
+
Gracefully](#failing-gracefully) section. In short, exceptions is exactly this.
|
536
|
+
We did not want to put an extra level on abstraction on top of this. In the
|
537
|
+
first internal implementation this was the case, but we later moved away from
|
538
|
+
it.
|
539
|
+
|
540
|
+
**Why does it not use normal Ruby semaphores?** To work properly the access
|
541
|
+
control needs to be performed across many workers. With MRI that means having
|
542
|
+
multiple processes, not threads. Thus we need a primitive outside of the
|
543
|
+
interpreter. For other Ruby implementations a driver that uses Ruby semaphores
|
544
|
+
could be used (and would be accepted as a PR).
|
545
|
+
|
546
|
+
**Why are there three semaphores in the semaphore sets for each resource?** This
|
547
|
+
has to do with being able to resize the number of tickets for a resource online.
|
548
|
+
|
549
|
+
**Can I change the number of tickets freely?** Yes, the logic for this isn't
|
550
|
+
trivial but it works well.
|
109
551
|
|
110
|
-
|
552
|
+
**What is the performance overhead of Semian?** Extremely minimal in comparison
|
553
|
+
to going to the network. Don't worry about it unless you're instrumenting
|
554
|
+
non-IO.
|
111
555
|
|
112
556
|
[hystrix]: https://github.com/Netflix/Hystrix
|
113
557
|
[release-it]: https://pragprog.com/book/mnee/release-it
|
114
558
|
[shopify]: http://www.shopify.com/
|
115
|
-
[mysql-semian-adapter]:
|
116
|
-
[redis-semian-adapter]:
|
117
|
-
[semian-adapter]:
|
118
|
-
[semian-
|
559
|
+
[mysql-semian-adapter]: lib/semian/mysql2.rb
|
560
|
+
[redis-semian-adapter]: lib/semian/redis.rb
|
561
|
+
[semian-adapter]: lib/semian/adapter.rb
|
562
|
+
[nethttp-semian-adapter]: lib/semian/net_http.rb
|
563
|
+
[nethttp-default-errors]: lib/semian/net_http.rb#L33-L43
|
564
|
+
[semian-instrumentable]: lib/semian/instrumentable.rb
|
119
565
|
[statsd-instrument]: http://github.com/shopify/statsd-instrument
|
120
566
|
[resiliency-blog-post]: http://www.shopify.com/technology/16906928-building-and-testing-resilient-ruby-on-rails-applications
|
121
567
|
[toxiproxy]: https://github.com/Shopify/toxiproxy
|
568
|
+
[sysv]: http://man7.org/linux/man-pages/man7/svipc.7.html
|
569
|
+
[cbp]: https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
|
570
|
+
[namespaces]: http://man7.org/linux/man-pages/man7/namespaces.7.html
|