semian 0.3.0 → 0.4.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +2 -2
- data/.rubocop.yml +113 -0
- data/CHANGELOG.md +8 -0
- data/Gemfile +5 -0
- data/LICENSE.md +1 -1
- data/README.md +488 -39
- data/Rakefile +15 -8
- data/ext/semian/extconf.rb +2 -2
- data/lib/semian.rb +16 -6
- data/lib/semian/adapter.rb +1 -1
- data/lib/semian/circuit_breaker.rb +38 -37
- data/lib/semian/mysql2.rb +21 -1
- data/lib/semian/net_http.rb +95 -0
- data/lib/semian/protected_resource.rb +7 -2
- data/lib/semian/resource.rb +1 -1
- data/lib/semian/simple_integer.rb +23 -0
- data/lib/semian/simple_sliding_window.rb +43 -0
- data/lib/semian/simple_state.rb +43 -0
- data/lib/semian/unprotected_resource.rb +4 -1
- data/lib/semian/version.rb +1 -1
- data/repodb.yml +1 -0
- data/scripts/install_toxiproxy.sh +3 -3
- data/semian.gemspec +4 -3
- data/test/circuit_breaker_test.rb +6 -2
- data/test/helpers/background_helper.rb +1 -1
- data/test/instrumentation_test.rb +1 -1
- data/test/mysql2_test.rb +57 -1
- data/test/net_http_test.rb +481 -0
- data/test/redis_test.rb +3 -3
- data/test/resource_test.rb +33 -31
- data/test/semian_test.rb +3 -2
- data/test/simple_integer_test.rb +49 -0
- data/test/simple_sliding_window_test.rb +65 -0
- data/test/simple_state_test.rb +45 -0
- data/test/test_helper.rb +5 -0
- data/test/unprotected_resource_test.rb +1 -1
- metadata +30 -27
- checksums.yaml.gz.sig +0 -0
- data.tar.gz.sig +0 -1
- metadata.gz.sig +0 -0
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9df1d9d29629650b74dab6ead2cc8d288de28d78
|
4
|
+
data.tar.gz: 8752cce5af3858930ee820eab7c2f1c558eeed22
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 2070b61bb3ac7080c36eaba3c0508eb81928b32111dd11e6f5d9668b78fada81549f2eac75f3021787ce03fd30b6f3c5f01a4f5f99d077b0b3aa9aeecc5a3e9d
|
7
|
+
data.tar.gz: 63f9400b70e1bf039819bb26accf5ecbcea87ab1fb938ff0abaf8e44ccfbbd78e016c3ff48832960cab6fb1c9428785da0f4c2333d15c651b2647bf99d9ae41d
|
data/.gitignore
CHANGED
data/.rubocop.yml
ADDED
@@ -0,0 +1,113 @@
|
|
1
|
+
AllCops:
|
2
|
+
Exclude:
|
3
|
+
- Gemfile
|
4
|
+
- lib/snippets/**/*
|
5
|
+
- vendor/**/*
|
6
|
+
- data/**/*
|
7
|
+
- db/schema.rb
|
8
|
+
- db/migrate/*
|
9
|
+
- test/dummy/**/*
|
10
|
+
- bin/rails
|
11
|
+
- lib/shipit-engine.rb
|
12
|
+
- tmp/**/*
|
13
|
+
|
14
|
+
Style/GuardClause:
|
15
|
+
Enabled: false
|
16
|
+
|
17
|
+
Lint/AssignmentInCondition:
|
18
|
+
Enabled: false
|
19
|
+
|
20
|
+
Lint/HandleExceptions:
|
21
|
+
Enabled: false
|
22
|
+
|
23
|
+
Lint/EndAlignment:
|
24
|
+
Enabled: false
|
25
|
+
|
26
|
+
Style/NumericLiterals:
|
27
|
+
Exclude:
|
28
|
+
- db/schema.rb
|
29
|
+
|
30
|
+
Style/SingleSpaceBeforeFirstArg:
|
31
|
+
Exclude:
|
32
|
+
- db/schema.rb
|
33
|
+
|
34
|
+
Style/DoubleNegation:
|
35
|
+
Enabled: false
|
36
|
+
|
37
|
+
Metrics/LineLength:
|
38
|
+
Max: 135
|
39
|
+
|
40
|
+
Metrics/MethodLength:
|
41
|
+
Max: 40
|
42
|
+
|
43
|
+
Metrics/ClassLength:
|
44
|
+
Max: 500
|
45
|
+
|
46
|
+
Metrics/AbcSize:
|
47
|
+
Max: 50
|
48
|
+
|
49
|
+
Metrics/CyclomaticComplexity:
|
50
|
+
Max: 10
|
51
|
+
|
52
|
+
Style/Documentation:
|
53
|
+
Enabled: false
|
54
|
+
|
55
|
+
Style/SingleLineBlockParams:
|
56
|
+
Enabled: false
|
57
|
+
|
58
|
+
Style/SignalException:
|
59
|
+
Enabled: false
|
60
|
+
|
61
|
+
Style/RaiseArgs:
|
62
|
+
Enabled: false
|
63
|
+
|
64
|
+
Style/ModuleFunction:
|
65
|
+
Enabled: false
|
66
|
+
|
67
|
+
Style/RedundantReturn:
|
68
|
+
AllowMultipleReturnValues: true
|
69
|
+
|
70
|
+
Style/IndentHash:
|
71
|
+
Enabled: false
|
72
|
+
|
73
|
+
Style/TrailingComma:
|
74
|
+
EnforcedStyleForMultiline: comma
|
75
|
+
|
76
|
+
Style/ClassAndModuleChildren:
|
77
|
+
Enabled: false
|
78
|
+
|
79
|
+
Style/PredicateName:
|
80
|
+
Exclude:
|
81
|
+
- app/serializers/**/*
|
82
|
+
|
83
|
+
Style/SpaceInsideHashLiteralBraces:
|
84
|
+
EnforcedStyle: no_space
|
85
|
+
|
86
|
+
Style/StringLiterals:
|
87
|
+
Enabled: false
|
88
|
+
|
89
|
+
Style/PerlBackrefs:
|
90
|
+
Enabled: false
|
91
|
+
|
92
|
+
Style/TrivialAccessors:
|
93
|
+
AllowPredicates: true
|
94
|
+
|
95
|
+
Style/ExtraSpacing:
|
96
|
+
AllowForAlignment: false
|
97
|
+
|
98
|
+
Style/GlobalVars:
|
99
|
+
Exclude:
|
100
|
+
- 'ext/semian/extconf.rb'
|
101
|
+
|
102
|
+
Lint/Eval:
|
103
|
+
Exclude:
|
104
|
+
- 'Rakefile'
|
105
|
+
|
106
|
+
Metrics/ParameterLists:
|
107
|
+
Enabled: false
|
108
|
+
|
109
|
+
Style/IfUnlessModifier:
|
110
|
+
Enabled: false
|
111
|
+
|
112
|
+
Style/CaseIndentation:
|
113
|
+
IndentWhenRelativeTo: end
|
data/CHANGELOG.md
ADDED
@@ -0,0 +1,8 @@
|
|
1
|
+
# v0.4.0
|
2
|
+
|
3
|
+
* net/http: add adapter for net/http #58
|
4
|
+
* circuit_breaker: split circuit breaker into three data structures to allow for
|
5
|
+
alternative implementations in the future #62
|
6
|
+
* mysql: don't prevent rollbacks on transactions #60
|
7
|
+
* core: fix initialization bug when the resource is accessed before the options
|
8
|
+
are set #65
|
data/Gemfile
CHANGED
data/LICENSE.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
The MIT License (MIT)
|
2
2
|
|
3
|
-
Copyright (c) 2014
|
3
|
+
Copyright (c) 2014 Shopify
|
4
4
|
|
5
5
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
6
|
of this software and associated documentation files (the "Software"), to deal
|
data/README.md
CHANGED
@@ -1,15 +1,47 @@
|
|
1
1
|
## Semian [data:image/s3,"s3://crabby-images/b91b2/b91b2304a7db45ec327204c6602d80c819ef2b5f" alt="Build Status"](https://travis-ci.org/Shopify/semian)
|
2
2
|
|
3
|
-
|
4
|
-
applications against misbehaving external services. It allows you to fail fast
|
5
|
-
so you can handle errors gracefully. The patterns are inspired by
|
6
|
-
[Hystrix][hystrix] and [Release It][release-it]. Semian is an extraction from
|
7
|
-
[Shopify][shopify] where it's been running successfully in production since
|
8
|
-
October, 2014.
|
3
|
+
data:image/s3,"s3://crabby-images/e4570/e457040af0f6efdef3eab2aeea419d5772c06c16" alt=""
|
9
4
|
|
10
|
-
|
11
|
-
|
12
|
-
|
5
|
+
Semian is a library for controlling access to slow or unresponsive external
|
6
|
+
services to avoid cascading failures.
|
7
|
+
|
8
|
+
When services are down they typically fail fast with errors like `ECONNREFUSED`
|
9
|
+
and `ECONNRESET` which can be rescued in code. However, slow resources fail
|
10
|
+
slowly. The thread serving the request blocks until it hits the timeout for the
|
11
|
+
slow resource. During that time, the thread is doing nothing useful and thus the
|
12
|
+
slow resource has caused a cascading failure by occupying workers and therefore
|
13
|
+
losing capacity. **Semian is a library for failing fast in these situations,
|
14
|
+
allowing you to handle errors gracefully.** Semian does this by intercepting
|
15
|
+
resource access through heuristic patterns inspired by [Hystrix][hystrix] and
|
16
|
+
[Release It][release-it]:
|
17
|
+
|
18
|
+
* [**Circuit breaker**](#circuit-breaker). A pattern for limiting the
|
19
|
+
amount of requests to a dependency that is having issues.
|
20
|
+
* [**Bulkheading**](#bulkheading). Controlling the concurrent access to
|
21
|
+
a single resource, access is coordinates server-wide with [SysV
|
22
|
+
semaphores][sysv].
|
23
|
+
|
24
|
+
Resource drivers are monkey-patched to be aware of Semian, these are called
|
25
|
+
[Semian Adapters](#adapters). Thus, every time resource access is requested
|
26
|
+
Semian is queried for status on the resource first. If Semian, through the
|
27
|
+
patterns above, deems the resource to be unavailable it will raise an exception.
|
28
|
+
**The ultimate outcome of Semian is always an exception that can then be rescued
|
29
|
+
for a graceful fallback**. Instead of waiting for the timeout, Semian raises
|
30
|
+
straight away.
|
31
|
+
|
32
|
+
If you are already rescuing exceptions for failing resources and timeouts,
|
33
|
+
Semian is mostly a drop-in library with a little configuration that will make
|
34
|
+
your code more resilient to slow resource access. But, [do you even need
|
35
|
+
Semian?](#do-i-need-semian)
|
36
|
+
|
37
|
+
For an overview of building resilient Ruby applications, start by reading [the
|
38
|
+
Shopify blog post on Toxiproxy and Semian][resiliency-blog-post]. For more in
|
39
|
+
depth information on Semian, see [Understanding Semian](#understanding-semian).
|
40
|
+
Semian is an extraction from [Shopify][shopify] where it's been running
|
41
|
+
successfully in production since October, 2014.
|
42
|
+
|
43
|
+
The other component to your Ruby resiliency kit is [Toxiproxy][toxiproxy] to
|
44
|
+
write automated resiliency tests.
|
13
45
|
|
14
46
|
# Usage
|
15
47
|
|
@@ -26,10 +58,47 @@ section](#configuration) on how to configure adapters.
|
|
26
58
|
|
27
59
|
## Adapters
|
28
60
|
|
29
|
-
|
61
|
+
Semian works by intercepting resource access. Every time access is requested,
|
62
|
+
Semian is queried, and it will raise an exception if the resource is unavailable
|
63
|
+
according to the circuit breaker or bulkheads. This is done by monkey-patching
|
64
|
+
the resource driver. **The exception raised by the driver always inherits from
|
65
|
+
the Base exception class of the driver**, meaning you can always simply rescue
|
66
|
+
the base class and catch both Semian and driver errors in the same rescue for
|
67
|
+
fallbacks.
|
68
|
+
|
69
|
+
The following adapters are in Semian and tested heavily in production, the
|
70
|
+
version is the version of the public gem with the same name:
|
30
71
|
|
31
72
|
* [`semian/mysql2`][mysql-semian-adapter] (~> 0.3.16)
|
32
73
|
* [`semian/redis`][redis-semian-adapter] (~> 3.2.1)
|
74
|
+
* [`semian/net_http`][nethttp-semian-adapter]
|
75
|
+
|
76
|
+
### Creating Adapters
|
77
|
+
|
78
|
+
To create a Semian adapter you must implement the following methods:
|
79
|
+
|
80
|
+
1. [`include Semian::Adapter`][semian-adapter]. Use the helpers to wrap the
|
81
|
+
resource. This takes care of situations such as monitoring, nested resources,
|
82
|
+
unsupported platforms, creating the Semian resource if it doesn't already
|
83
|
+
exist and so on.
|
84
|
+
2. `#semian_identifier`. This is responsible for returning a symbol that
|
85
|
+
represents every unique resource, for example `redis_master` or
|
86
|
+
`mysql_shard_1`. This is usually assembled from a `name` attribute on the
|
87
|
+
Semian configuration hash, but could also be `<host>:<port>`.
|
88
|
+
3. `connect`. The name of this method varies. You must override the driver's
|
89
|
+
connect method with one that wraps the connect call with
|
90
|
+
`Semian::Resource#acquire`. You should do this at the lowest possible level.
|
91
|
+
4. `query`. Same as `connect` but for queries on the resource.
|
92
|
+
5. Define exceptions `ResourceBusyError` and `CircuitOpenError`. These are
|
93
|
+
raised when the request was rejected early because the resource is out of
|
94
|
+
tickets or because the circuit breaker is open (see [Understanding
|
95
|
+
Semian](#understanding-semian). They should inherit from the base exception
|
96
|
+
class from the raw driver. For example `Mysql2::Error` or
|
97
|
+
`Redis::BaseConnectionError` for the MySQL and Redis drivers. This makes it
|
98
|
+
easy to `rescue` and handle them gracefully in application code, by
|
99
|
+
`rescue`ing the base class.
|
100
|
+
|
101
|
+
The best resource is looking at the [already implemented adapters](#adapters).
|
33
102
|
|
34
103
|
### Configuration
|
35
104
|
|
@@ -58,32 +127,370 @@ client = Redis.new(semian: {
|
|
58
127
|
})
|
59
128
|
```
|
60
129
|
|
61
|
-
|
130
|
+
#### Net::HTTP
|
131
|
+
For the `Net::HTTP` specific Semian adapter, since many external libraries may create
|
132
|
+
HTTP connections on the user's behalf, the parameters are instead provided
|
133
|
+
by calling specific functions in `Semian::NetHTTP`, perhaps in an initialization file.
|
62
134
|
|
63
|
-
|
135
|
+
##### Naming and Options
|
136
|
+
To give Semian parameters, assign a `proc` to `Semian::NetHTTP.semian_configuration`
|
137
|
+
that takes a two parameters, `host` and `port` like `127.0.0.1` and `80` or `github_com` and `80`,
|
138
|
+
and returns a `Hash` with keys as follows.
|
64
139
|
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
raised when the request was rejected early because the resource is out of
|
79
|
-
tickets or because the circuit breaker is open (see [Understanding
|
80
|
-
Semian](#understanding-semian). They should inherit from the base exception
|
81
|
-
class from the raw driver. For example `Mysql2::Error` or
|
82
|
-
`Redis::BaseConnectionError` for the MySQL and Redis drivers. This makes it
|
83
|
-
easy to `rescue` and handle them gracefully in application code, by
|
84
|
-
`rescue`ing the base class.
|
140
|
+
```ruby
|
141
|
+
SEMIAN_PARAMETERS = { tickets: 1,
|
142
|
+
success_threshold: 1,
|
143
|
+
error_threshold: 3,
|
144
|
+
error_timeout: 10 }
|
145
|
+
Semian::NetHTTP.semian_configuration = proc do |host, port|
|
146
|
+
# Let's make it only active for github.com
|
147
|
+
if host == "github.com" && port == "80"
|
148
|
+
SEMIAN_PARAMETERS.merge(name: "github.com_80")
|
149
|
+
else
|
150
|
+
nil
|
151
|
+
end
|
152
|
+
end
|
85
153
|
|
86
|
-
|
154
|
+
# Called from within API:
|
155
|
+
# semian_options = Semian::NetHTTP.semian_configuration("github.com", 80)
|
156
|
+
# semian_identifier = "nethttp_#{semian_options[:name]}"
|
157
|
+
```
|
158
|
+
|
159
|
+
The `name` should be carefully chosen since it identifies the resource being protected.
|
160
|
+
The `semian_options` passed apply to that resource. Semian creates the `semian_identifier`
|
161
|
+
from the `name` to look up and store changes in the circuit breaker and bulkhead states
|
162
|
+
and associate successes, failures, errors with the protected resource.
|
163
|
+
|
164
|
+
For most purposes, `"#{host}_#{port}"` is a good default `name`. Custom `name` formats
|
165
|
+
can be useful to grouping related subdomains as one resource, so that they all
|
166
|
+
contribute to the same circuit breaker and bulkhead state and fail together.
|
167
|
+
|
168
|
+
A return value of `nil` for `semian_configuration` means Semian is disabled for that
|
169
|
+
HTTP endpoint. This works well since the result of a failed Hash lookup is `nil` also.
|
170
|
+
This behavior lets the adapter default to whitelisting, although the
|
171
|
+
behavior can be changed to blacklisting or even be completely disabled by varying
|
172
|
+
the use of returning `nil` in the assigned closure.
|
173
|
+
|
174
|
+
##### Additional Exceptions
|
175
|
+
Since we envision this particular adapter can be used in combination with many
|
176
|
+
external libraries, that can raise additional exceptions, we added functionality to
|
177
|
+
expand the Exceptions that can be tracked as part of Semian's circuit breaker.
|
178
|
+
This may be necessary for libraries that introduce new exceptions or re-raise them.
|
179
|
+
Add exceptions and reset to the [`default`][nethttp-default-errors] list using the following:
|
180
|
+
|
181
|
+
```ruby
|
182
|
+
# assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
|
183
|
+
Semian::NetHTTP.exceptions += [::OpenSSL::SSL::SSLError]
|
184
|
+
|
185
|
+
Semian::NetHTTP.reset_exceptions
|
186
|
+
# assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
|
187
|
+
```
|
188
|
+
|
189
|
+
# Understanding Semian
|
190
|
+
|
191
|
+
Semian is a library with heuristics for failing fast. This section will explain
|
192
|
+
in depth how Semian works and which situations it's applicable for. First we
|
193
|
+
explain the category of problems Semian is meant to solve. Then we dive into how
|
194
|
+
Semian works to solve these problems.
|
195
|
+
|
196
|
+
## Do I need Semian?
|
197
|
+
|
198
|
+
Semian is not a trivial library to understand, introduces complexity and thus
|
199
|
+
should be introduced with care. Remember, all Semian does is raise exceptions
|
200
|
+
based on heuristics. It is paramount that you understand Semian before
|
201
|
+
including it in production as you may otherwise be surprised by its behaviour.
|
202
|
+
|
203
|
+
Applications that benefit from Semian are those working on eliminating SPOFs
|
204
|
+
(Single Points of Failure), and specifically are running into a wall regarding
|
205
|
+
slow resources. But it is by no means a magic wand that solves all your latency
|
206
|
+
problems by being added to your `Gemfile`. This section describes the types of
|
207
|
+
problems Semian solves.
|
208
|
+
|
209
|
+
If your application is multithreaded or evented (e.g. not Resque and Unicorn)
|
210
|
+
these problems are not as pressing. You can still get use out of Semian however.
|
211
|
+
|
212
|
+
### Real World Example
|
213
|
+
|
214
|
+
This is better illustrated with a real world example from Shopify. When you are
|
215
|
+
browsing a store while signed in, Shopify stores your session in Redis.
|
216
|
+
If Redis becomes unavailable, the driver will start throwing exceptions.
|
217
|
+
We rescue these exceptions and simply disable all customer sign in functionality
|
218
|
+
on the store until Redis is back online.
|
219
|
+
|
220
|
+
This is great if querying the resource fails instantly, because it means we fail
|
221
|
+
in just a single roundtrip of ~1ms. But if the resource is unresponsive or slow,
|
222
|
+
this can take as long as our timeout which is easily 200ms. This means every
|
223
|
+
request, even if it does rescue the exception, now takes an extra 200ms.
|
224
|
+
Because every resource takes that long, our capacity is also significantly
|
225
|
+
degraded. These problems are explained in depth in the next two sections.
|
226
|
+
|
227
|
+
With Semian, the slow resource would fail instantly (after a small amount of
|
228
|
+
convergence time) preventing your response time from spiking and not decreasing
|
229
|
+
capacity of the cluster.
|
230
|
+
|
231
|
+
If this sounds familiar to you, Semian is what you need to be resilient to
|
232
|
+
latency. You may not need the graceful fallback depending on your application,
|
233
|
+
in which case it will just result in an error (e.g. a `HTTP 500`) faster.
|
234
|
+
|
235
|
+
We will now examine the two problems in detail.
|
236
|
+
|
237
|
+
#### In-depth analysis of real world example
|
238
|
+
|
239
|
+
If a single resource is slow, every single request is going to suffer. We saw
|
240
|
+
this in the example before. Let's illustrate this more clearly in the following
|
241
|
+
Rails example where the user session is stored in Redis:
|
242
|
+
|
243
|
+
```ruby
|
244
|
+
def index
|
245
|
+
@user = fetch_user
|
246
|
+
@posts = Post.all
|
247
|
+
end
|
248
|
+
|
249
|
+
private
|
250
|
+
def fetch_user
|
251
|
+
user = User.find(session[:user_id])
|
252
|
+
rescue Redis::CannotConnectError
|
253
|
+
nil
|
254
|
+
end
|
255
|
+
```
|
256
|
+
|
257
|
+
Our code is resilient to a failure of the session layer, it doesn't `HTTP 500`
|
258
|
+
if the session store is unavailable (this can be tested with
|
259
|
+
[Toxiproxy][toxiproxy]). If the `User` and `Post` data store is unavailable, the
|
260
|
+
server will send back `HTTP 500`. We accept that, because it's our primary data
|
261
|
+
store. This could be prevented with a caching tier or something else out of
|
262
|
+
scope.
|
263
|
+
|
264
|
+
This code has two flaws however:
|
265
|
+
|
266
|
+
1. **What happens if the session storage is consistently slow?** I.e. the majority
|
267
|
+
of requests take, say, more than half the timeout time (but it should only
|
268
|
+
take ~1ms)?
|
269
|
+
2. **What happens if the session storage is unavailable and is not responding at
|
270
|
+
all?** I.e. we hit timeouts on every request.
|
271
|
+
|
272
|
+
These two problems in turn have two related problems associated with them:
|
273
|
+
response time and capacity.
|
274
|
+
|
275
|
+
#### Response time
|
276
|
+
|
277
|
+
Requests that attempt to access a down session storage are all gracefully handled, the
|
278
|
+
`@user` will simply be `nil`, which the code handles. There is still a
|
279
|
+
major impact on users however, as every request to the storage has to time
|
280
|
+
out. This causes the average response time to all pages that access it to go up by
|
281
|
+
however long your timeout is. Your timeout is proportional to your worst case timeout,
|
282
|
+
as well as the number of attempts to hit it on each page. This is the problem Semian
|
283
|
+
solves by using heuristics to fail these requests early which causes a much better
|
284
|
+
user experience during downtime.
|
285
|
+
|
286
|
+
#### Capacity loss
|
287
|
+
|
288
|
+
When your single-threaded worker is waiting for a resource to return, it's
|
289
|
+
effectively doing nothing when it could be serving fast requests. To use the
|
290
|
+
example from before, perhaps some actions do not access the session storage at
|
291
|
+
all. These requests will pile up behind the now slow requests that are trying to
|
292
|
+
access that layer, because they're failing slowly. Essentially, your capacity
|
293
|
+
degrades significantly because your average response time goes up (as explained
|
294
|
+
in the previous section). Capacity loss simply follows from an increase in
|
295
|
+
response time. The higher your timeout and the slower your resource, the more
|
296
|
+
capacity you lose.
|
297
|
+
|
298
|
+
#### Timeouts aren't enough
|
299
|
+
|
300
|
+
It should be clear by now that timeouts aren't enough. Consistent timeouts will
|
301
|
+
increase the average response time, which causes a bad user experience, and
|
302
|
+
ultimately compromise the performance of the entire system. Even if the timeout
|
303
|
+
is as low as ~250ms (just enough to allow a single TCP retransmit) there's a
|
304
|
+
large loss of capacity and for many applications a 100-300% increase in average
|
305
|
+
response time. This is the problem Semian solves by failing fast.
|
306
|
+
|
307
|
+
## How does Semian work?
|
308
|
+
|
309
|
+
Semian consists of two parts: circuit breaker and bulkheading. To understand
|
310
|
+
Semian, and especially how to configure it, we must understand these patterns
|
311
|
+
and their implementation.
|
312
|
+
|
313
|
+
### Circuit Breaker
|
314
|
+
|
315
|
+
The circuit breaker pattern is based on a simple observation - if we hit a
|
316
|
+
timeout or any other error for a given service one or more times, we’re likely
|
317
|
+
to hit it again for some amount of time. Instead of hitting the timeout
|
318
|
+
repeatedly, we can mark the resource as dead for some amount of time during
|
319
|
+
which we raise an exception instantly on any call to it. This is called the
|
320
|
+
[circuit breaker pattern][cbp].
|
321
|
+
|
322
|
+
data:image/s3,"s3://crabby-images/cc75e/cc75e812dda1af630ef1b5cdb8eb1deaf2883c49" alt=""
|
323
|
+
|
324
|
+
When we perform a Remote Procedure Call (RPC), it will first check the circuit.
|
325
|
+
If the circuit is rejecting requests because of too many failures reported by
|
326
|
+
the driver, it will throw an exception immediately. Otherwise the circuit will
|
327
|
+
call the driver. If the driver fails to get data back from the data store, it
|
328
|
+
will notify the circuit. The circuit will count the error so that if too many
|
329
|
+
errors have happened recently, it will start rejecting requests immediately
|
330
|
+
instead of waiting for the driver to time out. The exception will then be raised
|
331
|
+
back to the original caller. If the driver’s request was successful, it will
|
332
|
+
return the data back to the calling method and notify the circuit that it made a
|
333
|
+
successful call.
|
334
|
+
|
335
|
+
The state of the circuit breaker is local to the worker and is not shared across
|
336
|
+
all workers on a server.
|
337
|
+
|
338
|
+
#### Circuit Breaker Configuration
|
339
|
+
|
340
|
+
There are three configuration parameters for circuit breakers in Semian:
|
341
|
+
|
342
|
+
* **error_threshold**. The amount of errors to encounter for the worker before
|
343
|
+
opening the circuit, that is to start rejecting requests instantly.
|
344
|
+
* **error_timeout**. The amount of time until trying to query the resource
|
345
|
+
again.
|
346
|
+
* **success_threshold**. The amount of successes on the circuit until closing it
|
347
|
+
again, that is to start accepting all requests to the circuit.
|
348
|
+
|
349
|
+
### Bulkheading
|
350
|
+
|
351
|
+
For many applications, circuit breakers are not enough however. This is best
|
352
|
+
illustrated with an extreme. Imagine if the timeout for our data store isn't as
|
353
|
+
low as 200ms, but actually 10 seconds. For example, you might have a relational data
|
354
|
+
store where for some customers, 10s queries are (unfortunately) legitimate.
|
355
|
+
Reducing the time of worst case queries requires a lot of effort. Dropping the
|
356
|
+
query immediately could potentially leave some customers unable to access certain
|
357
|
+
functionality. High timeouts are especially critical in a non-threaded
|
358
|
+
environment where blocking IO means a worker is effectively doing nothing.
|
359
|
+
|
360
|
+
In this case, circuit breakers aren't sufficient. Assuming the circuit is shared
|
361
|
+
across all processes on a server, it will still take at least 10s before the
|
362
|
+
circuit is open—in that time every worker is blocked. Meaning we are in a
|
363
|
+
reduced capacity state for at least 20s, with the last 10s timeouts
|
364
|
+
occurring just before the circuit opens at the 10s mark when a couple of
|
365
|
+
workers have hit a timeout and the circuit opens. We thought of a number of
|
366
|
+
potential solutions to this problem - stricter timeouts, grouping timeouts by
|
367
|
+
section of our application, timeouts per statement—but they all still revolved
|
368
|
+
around timeouts, and those are extremely hard to get right.
|
369
|
+
|
370
|
+
Instead of thinking about timeouts, we took inspiration from Hystrix by Netflix
|
371
|
+
and the book Release It (the resiliency bible), and look at our services as
|
372
|
+
connection pools. On a server with `W` workers, only a certain number of them
|
373
|
+
are expected to be talking to a single data store at once. Let's say we've
|
374
|
+
determined from our monitoring that there’s a 10% chance they’re talking to
|
375
|
+
`mysql_shard_0` at any given point in time under normal traffic. The probability
|
376
|
+
that five workers are talking to it at the same time is 0.001%. If we only allow
|
377
|
+
five workers to talk to a resource at any given point in time, and accept the
|
378
|
+
0.001% false positive rate—we can fail the sixth worker attempting to check out
|
379
|
+
a connection instantly. This means that while the five workers are waiting for a
|
380
|
+
timeout, all the other `W-5` workers on the node will instantly be failing on
|
381
|
+
checking out the connection and opening their circuits. Our capacity is only
|
382
|
+
degraded by a relatively small amount.
|
383
|
+
|
384
|
+
We call this limitation primitive "tickets". In this case, the resource access
|
385
|
+
is limited to 5 tickets (see Configuration). The timeout value specifies the
|
386
|
+
maximum amount of time to block if no ticket is available.
|
387
|
+
|
388
|
+
How do we limit the access to a resource for all workers on a server when the
|
389
|
+
workers do not directly share memory? This is implemented with [SysV
|
390
|
+
semaphores][sysv] to provide server-wide access control.
|
391
|
+
|
392
|
+
#### Bulkhead Configuration
|
393
|
+
|
394
|
+
There are two configuration values. It's not easy to choose good values and we're
|
395
|
+
still experimenting with ways to figure out optimal ticket numbers. Generally
|
396
|
+
something below half the number of workers on the server for endpoints that are
|
397
|
+
queried frequently has worked well for us.
|
398
|
+
|
399
|
+
* **tickets**. Number of workers that can concurrently access a resource.
|
400
|
+
* **timeout**. Time to wait to acquire a ticket if there are no tickets left.
|
401
|
+
We recommend this to be `0` unless you have very few workers running (i.e.
|
402
|
+
less than ~5).
|
403
|
+
|
404
|
+
## Defense line
|
405
|
+
|
406
|
+
The finished defense line for resource access with circuit breakers and
|
407
|
+
bulkheads then looks like this:
|
408
|
+
|
409
|
+
data:image/s3,"s3://crabby-images/d4689/d4689e075fef7794b7fad15f5417f4f76367895e" alt=""
|
410
|
+
|
411
|
+
The RPC first checks the circuit; if the circuit is open it will raise the
|
412
|
+
exception straight away which will trigger the fallback (the default fallback is
|
413
|
+
a 500 response). Otherwise, it will try Semian which fails instantly if too many
|
414
|
+
workers are already querying the resource. Finally the driver will query the
|
415
|
+
data store. If the data store succeeds, the driver will return the data back to
|
416
|
+
the RPC. If the data store is slow or fails, this is our last line of defense
|
417
|
+
against a misbehaving resource. The driver will raise an exception after trying
|
418
|
+
to connect with a timeout or after an immediate failure. These driver actions
|
419
|
+
will affect the circuit and Semian, which can make future calls fail faster.
|
420
|
+
|
421
|
+
## Failing gracefully
|
422
|
+
|
423
|
+
Ok, great, we've got a way to fail fast with slow resources, how does that make
|
424
|
+
my application more resilient?
|
425
|
+
|
426
|
+
Failing fast is only half the battle. It's up to you what you do with these
|
427
|
+
errors, in the [session example](#real-world-example) we handle it gracefully by
|
428
|
+
signing people out and disabling all session related functionality till the data
|
429
|
+
store is back online. However, not rescuing the exception and simply sending
|
430
|
+
`HTTP 500` back to the client faster will help with [capacity
|
431
|
+
loss](#capacity-loss).
|
432
|
+
|
433
|
+
### Exceptions inherit from base class
|
434
|
+
|
435
|
+
It's important to understand that the exceptions raised by [Semian
|
436
|
+
Adapters](#adapters) inherit from the base class of the driver itself, meaning
|
437
|
+
that if you do something like:
|
438
|
+
|
439
|
+
```ruby
|
440
|
+
def posts
|
441
|
+
Post.all
|
442
|
+
rescue Mysql2::Error
|
443
|
+
[]
|
444
|
+
end
|
445
|
+
```
|
446
|
+
|
447
|
+
Exceptions raised by Semian's `MySQL2` adapter will also get caught.
|
448
|
+
|
449
|
+
### Patterns
|
450
|
+
|
451
|
+
We do not recommend mindlessly sprinkling `rescue`s all over the place. What you
|
452
|
+
should do instead is writing decorators around secondary data stores (e.g. sessions)
|
453
|
+
that provide resiliency for free. For example, if we stored the tags associated
|
454
|
+
with products in a secondary data store it could look something like this:
|
455
|
+
|
456
|
+
```ruby
|
457
|
+
# Resilient decorator for storing a Set in Redis.
|
458
|
+
class RedisSet
|
459
|
+
def initialize(key)
|
460
|
+
@key = key
|
461
|
+
end
|
462
|
+
|
463
|
+
def get
|
464
|
+
redis.smembers(@key)
|
465
|
+
rescue Redis::BaseConnectionError
|
466
|
+
[]
|
467
|
+
end
|
468
|
+
|
469
|
+
private
|
470
|
+
|
471
|
+
def redis
|
472
|
+
@redis ||= Redis.new
|
473
|
+
end
|
474
|
+
end
|
475
|
+
|
476
|
+
class Product
|
477
|
+
# This will simply return an empty array in the case of a Redis outage.
|
478
|
+
def tags
|
479
|
+
tags_set.get
|
480
|
+
end
|
481
|
+
|
482
|
+
private
|
483
|
+
|
484
|
+
def tags_set
|
485
|
+
@tags_set ||= RedisSet.new("product:tags:#{self.id}")
|
486
|
+
end
|
487
|
+
end
|
488
|
+
```
|
489
|
+
|
490
|
+
These decorators can be resiliency tested with [Toxiproxy][toxiproxy]. You can
|
491
|
+
provide fallbacks around your primary data store as well. In our case, we simply
|
492
|
+
`HTTP 500` in those cases unless it's cached because these pages aren't worth
|
493
|
+
much without data from their primary data store.
|
87
494
|
|
88
495
|
## Monitoring
|
89
496
|
|
@@ -105,17 +512,59 @@ Semian.subscribe do |event, resource, scope, adapter|
|
|
105
512
|
end
|
106
513
|
```
|
107
514
|
|
108
|
-
#
|
515
|
+
# FAQ
|
516
|
+
|
517
|
+
**How does Semian work with containers?** Semian uses [SysV semaphores][sysv] to
|
518
|
+
coordinate access to a resource. The semaphore is only shared within the
|
519
|
+
[IPC][namespaces]. Unless you are running many workers inside every container,
|
520
|
+
this leaves the bulkheading pattern effectively useless. We recommend sharing
|
521
|
+
the IPC namespace between all containers on your host for the best ticket
|
522
|
+
economy. If you are using Docker, this can be done with the [--ipc
|
523
|
+
flag](https://docs.docker.com/reference/run/#ipc-settings).
|
524
|
+
|
525
|
+
**Why isn't resource access shared across the entire cluster?** This implies a
|
526
|
+
coordination data store. Semian would have to be resilient to failures of this
|
527
|
+
data store as well, and fall back to other primitives. While it's nice to have
|
528
|
+
all workers have the same view of the world, this greatly increases the
|
529
|
+
complexity of the implementation which is not favourable for resiliency code.
|
530
|
+
|
531
|
+
**Why isn't the circuit breaker implemented as a host-wide mechanism?** No good
|
532
|
+
reason. Patches welcome!
|
533
|
+
|
534
|
+
**Why is there no fallback mechanism in Semian?** Read the [Failing
|
535
|
+
Gracefully](#failing-gracefully) section. In short, exceptions is exactly this.
|
536
|
+
We did not want to put an extra level on abstraction on top of this. In the
|
537
|
+
first internal implementation this was the case, but we later moved away from
|
538
|
+
it.
|
539
|
+
|
540
|
+
**Why does it not use normal Ruby semaphores?** To work properly the access
|
541
|
+
control needs to be performed across many workers. With MRI that means having
|
542
|
+
multiple processes, not threads. Thus we need a primitive outside of the
|
543
|
+
interpreter. For other Ruby implementations a driver that uses Ruby semaphores
|
544
|
+
could be used (and would be accepted as a PR).
|
545
|
+
|
546
|
+
**Why are there three semaphores in the semaphore sets for each resource?** This
|
547
|
+
has to do with being able to resize the number of tickets for a resource online.
|
548
|
+
|
549
|
+
**Can I change the number of tickets freely?** Yes, the logic for this isn't
|
550
|
+
trivial but it works well.
|
109
551
|
|
110
|
-
|
552
|
+
**What is the performance overhead of Semian?** Extremely minimal in comparison
|
553
|
+
to going to the network. Don't worry about it unless you're instrumenting
|
554
|
+
non-IO.
|
111
555
|
|
112
556
|
[hystrix]: https://github.com/Netflix/Hystrix
|
113
557
|
[release-it]: https://pragprog.com/book/mnee/release-it
|
114
558
|
[shopify]: http://www.shopify.com/
|
115
|
-
[mysql-semian-adapter]:
|
116
|
-
[redis-semian-adapter]:
|
117
|
-
[semian-adapter]:
|
118
|
-
[semian-
|
559
|
+
[mysql-semian-adapter]: lib/semian/mysql2.rb
|
560
|
+
[redis-semian-adapter]: lib/semian/redis.rb
|
561
|
+
[semian-adapter]: lib/semian/adapter.rb
|
562
|
+
[nethttp-semian-adapter]: lib/semian/net_http.rb
|
563
|
+
[nethttp-default-errors]: lib/semian/net_http.rb#L33-L43
|
564
|
+
[semian-instrumentable]: lib/semian/instrumentable.rb
|
119
565
|
[statsd-instrument]: http://github.com/shopify/statsd-instrument
|
120
566
|
[resiliency-blog-post]: http://www.shopify.com/technology/16906928-building-and-testing-resilient-ruby-on-rails-applications
|
121
567
|
[toxiproxy]: https://github.com/Shopify/toxiproxy
|
568
|
+
[sysv]: http://man7.org/linux/man-pages/man7/svipc.7.html
|
569
|
+
[cbp]: https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
|
570
|
+
[namespaces]: http://man7.org/linux/man-pages/man7/namespaces.7.html
|