semian 0.12.0 → 0.13.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +235 -0
- data/LICENSE.md +21 -0
- data/README.md +836 -0
- data/ext/semian/extconf.rb +21 -19
- data/lib/semian/adapter.rb +8 -4
- data/lib/semian/circuit_breaker.rb +16 -10
- data/lib/semian/grpc.rb +32 -10
- data/lib/semian/instrumentable.rb +2 -0
- data/lib/semian/lru_hash.rb +15 -14
- data/lib/semian/mysql2.rb +13 -9
- data/lib/semian/net_http.rb +10 -4
- data/lib/semian/platform.rb +3 -1
- data/lib/semian/protected_resource.rb +5 -3
- data/lib/semian/rails.rb +12 -6
- data/lib/semian/redis.rb +15 -13
- data/lib/semian/redis_client.rb +5 -3
- data/lib/semian/resource.rb +5 -3
- data/lib/semian/simple_integer.rb +4 -2
- data/lib/semian/simple_sliding_window.rb +5 -3
- data/lib/semian/simple_state.rb +3 -1
- data/lib/semian/unprotected_resource.rb +2 -0
- data/lib/semian/version.rb +3 -1
- data/lib/semian.rb +61 -45
- metadata +11 -201
data/README.md
ADDED
@@ -0,0 +1,836 @@
|
|
1
|
+
## Semian ![Build Status](https://github.com/Shopify/semian/actions/workflows/main.yml/badge.svg)
|
2
|
+
|
3
|
+
|
4
|
+
![](http://i.imgur.com/7Vn2ibF.png)
|
5
|
+
|
6
|
+
Semian is a library for controlling access to slow or unresponsive external
|
7
|
+
services to avoid cascading failures.
|
8
|
+
|
9
|
+
When services are down they typically fail fast with errors like `ECONNREFUSED`
|
10
|
+
and `ECONNRESET` which can be rescued in code. However, slow resources fail
|
11
|
+
slowly. The thread serving the request blocks until it hits the timeout for the
|
12
|
+
slow resource. During that time, the thread is doing nothing useful and thus the
|
13
|
+
slow resource has caused a cascading failure by occupying workers and therefore
|
14
|
+
losing capacity. **Semian is a library for failing fast in these situations,
|
15
|
+
allowing you to handle errors gracefully.** Semian does this by intercepting
|
16
|
+
resource access through heuristic patterns inspired by [Hystrix][hystrix] and
|
17
|
+
[Release It][release-it]:
|
18
|
+
|
19
|
+
* [**Circuit breaker**](#circuit-breaker). A pattern for limiting the
|
20
|
+
amount of requests to a dependency that is having issues.
|
21
|
+
* [**Bulkheading**](#bulkheading). Controlling the concurrent access to
|
22
|
+
a single resource, access is coordinated server-wide with [SysV
|
23
|
+
semaphores][sysv].
|
24
|
+
|
25
|
+
Resource drivers are monkey-patched to be aware of Semian, these are called
|
26
|
+
[Semian Adapters](#adapters). Thus, every time resource access is requested
|
27
|
+
Semian is queried for status on the resource first. If Semian, through the
|
28
|
+
patterns above, deems the resource to be unavailable it will raise an exception.
|
29
|
+
**The ultimate outcome of Semian is always an exception that can then be rescued
|
30
|
+
for a graceful fallback**. Instead of waiting for the timeout, Semian raises
|
31
|
+
straight away.
|
32
|
+
|
33
|
+
If you are already rescuing exceptions for failing resources and timeouts,
|
34
|
+
Semian is mostly a drop-in library with a little configuration that will make
|
35
|
+
your code more resilient to slow resource access. But, [do you even need
|
36
|
+
Semian?](#do-i-need-semian)
|
37
|
+
|
38
|
+
For an overview of building resilient Ruby applications, start by reading [the
|
39
|
+
Shopify blog post on Toxiproxy and Semian][resiliency-blog-post]. For more in
|
40
|
+
depth information on Semian, see [Understanding Semian](#understanding-semian).
|
41
|
+
Semian is an extraction from [Shopify][shopify] where it's been running
|
42
|
+
successfully in production since October, 2014.
|
43
|
+
|
44
|
+
The other component to your Ruby resiliency kit is [Toxiproxy][toxiproxy] to
|
45
|
+
write automated resiliency tests.
|
46
|
+
|
47
|
+
# Usage
|
48
|
+
|
49
|
+
Install by adding the gem to your `Gemfile` and require the [adapters](#adapters) you need:
|
50
|
+
|
51
|
+
```ruby
|
52
|
+
gem 'semian', require: %w(semian semian/mysql2 semian/redis)
|
53
|
+
```
|
54
|
+
|
55
|
+
We recommend this pattern of requiring adapters directly from the `Gemfile`.
|
56
|
+
This ensures Semian adapters are loaded as early as possible and also
|
57
|
+
protects your application during boot. Please see the [adapter configuration
|
58
|
+
section](#configuration) on how to configure adapters.
|
59
|
+
|
60
|
+
## Adapters
|
61
|
+
|
62
|
+
Semian works by intercepting resource access. Every time access is requested,
|
63
|
+
Semian is queried, and it will raise an exception if the resource is unavailable
|
64
|
+
according to the circuit breaker or bulkheads. This is done by monkey-patching
|
65
|
+
the resource driver. **The exception raised by the driver always inherits from
|
66
|
+
the Base exception class of the driver**, meaning you can always simply rescue
|
67
|
+
the base class and catch both Semian and driver errors in the same rescue for
|
68
|
+
fallbacks.
|
69
|
+
|
70
|
+
The following adapters are in Semian and tested heavily in production, the
|
71
|
+
version is the version of the public gem with the same name:
|
72
|
+
|
73
|
+
* [`semian/mysql2`][mysql-semian-adapter] (~> 0.3.16)
|
74
|
+
* [`semian/redis`][redis-semian-adapter] (~> 3.2.1)
|
75
|
+
* [`semian/net_http`][nethttp-semian-adapter]
|
76
|
+
|
77
|
+
### Creating Adapters
|
78
|
+
|
79
|
+
To create a Semian adapter you must implement the following methods:
|
80
|
+
|
81
|
+
1. [`include Semian::Adapter`][semian-adapter]. Use the helpers to wrap the
|
82
|
+
resource. This takes care of situations such as monitoring, nested resources,
|
83
|
+
unsupported platforms, creating the Semian resource if it doesn't already
|
84
|
+
exist and so on.
|
85
|
+
2. `#semian_identifier`. This is responsible for returning a symbol that
|
86
|
+
represents every unique resource, for example `redis_master` or
|
87
|
+
`mysql_shard_1`. This is usually assembled from a `name` attribute on the
|
88
|
+
Semian configuration hash, but could also be `<host>:<port>`.
|
89
|
+
3. `connect`. The name of this method varies. You must override the driver's
|
90
|
+
connect method with one that wraps the connect call with
|
91
|
+
`Semian::Resource#acquire`. You should do this at the lowest possible level.
|
92
|
+
4. `query`. Same as `connect` but for queries on the resource.
|
93
|
+
5. Define exceptions `ResourceBusyError` and `CircuitOpenError`. These are
|
94
|
+
raised when the request was rejected early because the resource is out of
|
95
|
+
tickets or because the circuit breaker is open (see [Understanding
|
96
|
+
Semian](#understanding-semian). They should inherit from the base exception
|
97
|
+
class from the raw driver. For example `Mysql2::Error` or
|
98
|
+
`Redis::BaseConnectionError` for the MySQL and Redis drivers. This makes it
|
99
|
+
easy to `rescue` and handle them gracefully in application code, by
|
100
|
+
`rescue`ing the base class.
|
101
|
+
|
102
|
+
The best resource is looking at the [already implemented adapters](#adapters).
|
103
|
+
|
104
|
+
### Configuration
|
105
|
+
|
106
|
+
There are some global configuration options that can be set for Semian:
|
107
|
+
|
108
|
+
```ruby
|
109
|
+
# Maximum size of the LRU cache (default: 500)
|
110
|
+
# Note: Setting this to 0 enables aggressive garbage collection.
|
111
|
+
Semian.maximum_lru_size = 0
|
112
|
+
|
113
|
+
# Minimum time a resource should be resident in the LRU cache (default: 300s)
|
114
|
+
Semian.minimum_lru_time = 60
|
115
|
+
```
|
116
|
+
|
117
|
+
Note: `minimum_lru_time` is a stronger guarantee than `maximum_lru_size`. That
|
118
|
+
is, if a resource has been updated more recently than `minimum_lru_time` it
|
119
|
+
will not be garbage collected, even if it would cause the LRU cache to grow
|
120
|
+
larger than `maximum_lru_size`.
|
121
|
+
|
122
|
+
When instantiating a resource it now needs to be configured for Semian. This is
|
123
|
+
done by passing `semian` as an argument when initializing the client. Examples
|
124
|
+
built in adapters:
|
125
|
+
|
126
|
+
```ruby
|
127
|
+
# MySQL2 client
|
128
|
+
# In Rails this means having a Semian key in database.yml for each db.
|
129
|
+
client = Mysql2::Client.new(host: "localhost", username: "root", semian: {
|
130
|
+
name: "master",
|
131
|
+
tickets: 8, # See the Understanding Semian section on picking these values
|
132
|
+
success_threshold: 2,
|
133
|
+
error_threshold: 3,
|
134
|
+
error_timeout: 10
|
135
|
+
})
|
136
|
+
|
137
|
+
# Redis client
|
138
|
+
client = Redis.new(semian: {
|
139
|
+
name: "inventory",
|
140
|
+
tickets: 4,
|
141
|
+
success_threshold: 2,
|
142
|
+
error_threshold: 4,
|
143
|
+
error_timeout: 20
|
144
|
+
})
|
145
|
+
```
|
146
|
+
|
147
|
+
#### Thread Safety
|
148
|
+
|
149
|
+
Semian's circuit breaker implementation is thread-safe by default as of
|
150
|
+
`v0.7.0`. If you'd like to disable it for performance reasons, pass
|
151
|
+
`thread_safety_disabled: true` to the resource options.
|
152
|
+
|
153
|
+
Bulkheads should be disabled (pass `bulkhead: false`) in a threaded environment
|
154
|
+
(e.g. Puma or Sidekiq), but can safely be enabled in non-threaded environments
|
155
|
+
(e.g. Resque and Unicorn). As described in this document, circuit breakers alone
|
156
|
+
should be adequate in most environments with reasonably low timeouts.
|
157
|
+
|
158
|
+
Internally, semian uses `SEM_UNDO` for several sysv semaphore operations:
|
159
|
+
|
160
|
+
* Acquire
|
161
|
+
* Worker registration
|
162
|
+
* Semaphore metadata state lock
|
163
|
+
|
164
|
+
The intention behind `SEM_UNDO` is that a semaphore operation is automatically undone when the process exits. This
|
165
|
+
is true even if the process exits abnormally - crashes, receives a `SIG_KILL`, etc, because it is handled by
|
166
|
+
the operating system and not the process itself.
|
167
|
+
|
168
|
+
If, however, a thread performs a semop, the `SEM_UNDO` is on its parent process. This means that the operation
|
169
|
+
*will not* be undone when the thread exits. This can result in the following unfavorable behavior when using
|
170
|
+
threads:
|
171
|
+
|
172
|
+
* Threads acquire a resource, but are killed and the resource ticket is never released. For a process, the
|
173
|
+
ticket would be released by `SEM_UNDO`, but since it's a thread there is the potential for ticket starvation.
|
174
|
+
This can result in deadlock on the resource.
|
175
|
+
* Threads that register workers on a resource but are killed and never unregistered. For a process, the worker
|
176
|
+
count would be automatically decremented by `SEM_UNDO`, but for threads the worker count will continue to increment,
|
177
|
+
only being undone when the parent process dies. This can cause the number of tickets to dramatically exceed the quota.
|
178
|
+
* If a thread acquires the semaphore metadata lock and dies before releasing it, semian will deadlock on anything
|
179
|
+
attempting to acquire the metadata lock until the thread's parent process exits. This can prevent the ticket count
|
180
|
+
from being updated.
|
181
|
+
|
182
|
+
Moreover, a strategy that utilizes `SEM_UNDO` is not compatible with a strategy that attempts to the semaphores tickets manually.
|
183
|
+
In order to support threads, operations that currently use `SEM_UNDO` would need to use no semaphore flag, and the calling process
|
184
|
+
will be responsible for ensuring that threads are appropriately cleaned up. It is still possible to implement this, but
|
185
|
+
it would likely require an in-memory semaphore managed by the parent process of the threads. PRs welcome for this functionality.
|
186
|
+
|
187
|
+
#### Quotas
|
188
|
+
|
189
|
+
You may now set quotas per worker:
|
190
|
+
|
191
|
+
```ruby
|
192
|
+
client = Redis.new(semian: {
|
193
|
+
name: "inventory",
|
194
|
+
quota: 0.51,
|
195
|
+
success_threshold: 2,
|
196
|
+
error_threshold: 4,
|
197
|
+
error_timeout: 20
|
198
|
+
})
|
199
|
+
|
200
|
+
```
|
201
|
+
|
202
|
+
Per the above example, you no longer need to care about the number of tickets.
|
203
|
+
Rather, the tickets shall be computed as a proportion of the number of active workers.
|
204
|
+
|
205
|
+
In this case, we'd allow 50% of the workers on a particular host to connect to this redis resource.
|
206
|
+
So long as the workers are in their own process, they will automatically be registered. The quota will
|
207
|
+
set the bulkhead threshold based on the number of registered workers, whenever a new worker registers.
|
208
|
+
|
209
|
+
This is ideal for environments with non-uniform worker distribution, and to eliminate the need to manually
|
210
|
+
calculate and adjust ticket counts.
|
211
|
+
|
212
|
+
**Note**:
|
213
|
+
|
214
|
+
- You must pass **exactly** one of ticket or quota.
|
215
|
+
- Tickets available will be the ceiling of the quota ratio to the number of workers
|
216
|
+
- So, with one worker, there will always be a minimum of 1 ticket
|
217
|
+
- Workers in different processes will automatically unregister when the process exits.
|
218
|
+
- If you have a small number of workers (exactly 2) it's possible that the bulkhead will be too sensitive using quotas.
|
219
|
+
- If you use a forking web server (like unicorn) you should call `Semian.unregister_all_resources` before/after forking.
|
220
|
+
|
221
|
+
#### Net::HTTP
|
222
|
+
For the `Net::HTTP` specific Semian adapter, since many external libraries may create
|
223
|
+
HTTP connections on the user's behalf, the parameters are instead provided
|
224
|
+
by associating callback functions with `Semian::NetHTTP`, perhaps in an initialization file.
|
225
|
+
|
226
|
+
##### Naming and Options
|
227
|
+
To give Semian parameters, assign a `proc` to `Semian::NetHTTP.semian_configuration`
|
228
|
+
that takes a two parameters, `host` and `port` like `127.0.0.1`,`443` or `github_com`,`80`,
|
229
|
+
and returns a `Hash` with configuration parameters as follows. The `proc` is used as a
|
230
|
+
callback to initialize the configuration options, similar to other adapters.
|
231
|
+
|
232
|
+
```ruby
|
233
|
+
SEMIAN_PARAMETERS = { tickets: 1,
|
234
|
+
success_threshold: 1,
|
235
|
+
error_threshold: 3,
|
236
|
+
error_timeout: 10 }
|
237
|
+
Semian::NetHTTP.semian_configuration = proc do |host, port|
|
238
|
+
# Let's make it only active for github.com
|
239
|
+
if host == "github.com" && port == "80"
|
240
|
+
SEMIAN_PARAMETERS.merge(name: "github.com_80")
|
241
|
+
else
|
242
|
+
nil
|
243
|
+
end
|
244
|
+
end
|
245
|
+
|
246
|
+
# Called from within API:
|
247
|
+
# semian_options = Semian::NetHTTP.semian_configuration("github.com", 80)
|
248
|
+
# semian_identifier = "nethttp_#{semian_options[:name]}"
|
249
|
+
```
|
250
|
+
|
251
|
+
The `name` should be carefully chosen since it identifies the resource being protected.
|
252
|
+
The `semian_options` passed apply to that resource. Semian creates the `semian_identifier`
|
253
|
+
from the `name` to look up and store changes in the circuit breaker and bulkhead states
|
254
|
+
and associate successes, failures, errors with the protected resource.
|
255
|
+
|
256
|
+
We only require that:
|
257
|
+
* the `semian_configuration` be **set only once** over the lifetime of the library
|
258
|
+
* the output of the `proc` be the same over time, that is, the configuration produced by
|
259
|
+
each pair of `host`, `port` is **the same each time** the callback is invoked.
|
260
|
+
|
261
|
+
For most purposes, `"#{host}_#{port}"` is a good default `name`. Custom `name` formats
|
262
|
+
can be useful to grouping related subdomains as one resource, so that they all
|
263
|
+
contribute to the same circuit breaker and bulkhead state and fail together.
|
264
|
+
|
265
|
+
A return value of `nil` for `semian_configuration` means Semian is disabled for that
|
266
|
+
HTTP endpoint. This works well since the result of a failed Hash lookup is `nil` also.
|
267
|
+
This behavior lets the adapter default to whitelisting, although the
|
268
|
+
behavior can be changed to blacklisting or even be completely disabled by varying
|
269
|
+
the use of returning `nil` in the assigned closure.
|
270
|
+
|
271
|
+
##### Additional Exceptions
|
272
|
+
Since we envision this particular adapter can be used in combination with many
|
273
|
+
external libraries, that can raise additional exceptions, we added functionality to
|
274
|
+
expand the Exceptions that can be tracked as part of Semian's circuit breaker.
|
275
|
+
This may be necessary for libraries that introduce new exceptions or re-raise them.
|
276
|
+
Add exceptions and reset to the [`default`][nethttp-default-errors] list using the following:
|
277
|
+
|
278
|
+
```ruby
|
279
|
+
# assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
|
280
|
+
Semian::NetHTTP.exceptions += [::OpenSSL::SSL::SSLError]
|
281
|
+
|
282
|
+
Semian::NetHTTP.reset_exceptions
|
283
|
+
# assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
|
284
|
+
```
|
285
|
+
##### Mark Unsuccessful Responses as Failures
|
286
|
+
Unsuccessful responses (e.g. 5xx responses) do not raise exceptions, and as such are not marked as failures by default. The `open_circuit_server_errors` Semian configuration parameter may be set to enable this behaviour, to mark unsuccessful responses as failures as seen below:
|
287
|
+
|
288
|
+
```ruby
|
289
|
+
SEMIAN_PARAMETERS = { tickets: 1,
|
290
|
+
success_threshold: 1,
|
291
|
+
error_threshold: 3,
|
292
|
+
error_timeout: 10,
|
293
|
+
open_circuit_server_errors: true}
|
294
|
+
```
|
295
|
+
|
296
|
+
|
297
|
+
|
298
|
+
# Understanding Semian
|
299
|
+
|
300
|
+
Semian is a library with heuristics for failing fast. This section will explain
|
301
|
+
in depth how Semian works and which situations it's applicable for. First we
|
302
|
+
explain the category of problems Semian is meant to solve. Then we dive into how
|
303
|
+
Semian works to solve these problems.
|
304
|
+
|
305
|
+
## Do I need Semian?
|
306
|
+
|
307
|
+
Semian is not a trivial library to understand, introduces complexity and thus
|
308
|
+
should be introduced with care. Remember, all Semian does is raise exceptions
|
309
|
+
based on heuristics. It is paramount that you understand Semian before
|
310
|
+
including it in production as you may otherwise be surprised by its behaviour.
|
311
|
+
|
312
|
+
Applications that benefit from Semian are those working on eliminating SPOFs
|
313
|
+
(Single Points of Failure), and specifically are running into a wall regarding
|
314
|
+
slow resources. But it is by no means a magic wand that solves all your latency
|
315
|
+
problems by being added to your `Gemfile`. This section describes the types of
|
316
|
+
problems Semian solves.
|
317
|
+
|
318
|
+
If your application is multithreaded or evented (e.g. not Resque and Unicorn)
|
319
|
+
these problems are not as pressing. You can still get use out of Semian however.
|
320
|
+
|
321
|
+
### Real World Example
|
322
|
+
|
323
|
+
This is better illustrated with a real world example from Shopify. When you are
|
324
|
+
browsing a store while signed in, Shopify stores your session in Redis.
|
325
|
+
If Redis becomes unavailable, the driver will start throwing exceptions.
|
326
|
+
We rescue these exceptions and simply disable all customer sign in functionality
|
327
|
+
on the store until Redis is back online.
|
328
|
+
|
329
|
+
This is great if querying the resource fails instantly, because it means we fail
|
330
|
+
in just a single roundtrip of ~1ms. But if the resource is unresponsive or slow,
|
331
|
+
this can take as long as our timeout which is easily 200ms. This means every
|
332
|
+
request, even if it does rescue the exception, now takes an extra 200ms.
|
333
|
+
Because every resource takes that long, our capacity is also significantly
|
334
|
+
degraded. These problems are explained in depth in the next two sections.
|
335
|
+
|
336
|
+
With Semian, the slow resource would fail instantly (after a small amount of
|
337
|
+
convergence time) preventing your response time from spiking and not decreasing
|
338
|
+
capacity of the cluster.
|
339
|
+
|
340
|
+
If this sounds familiar to you, Semian is what you need to be resilient to
|
341
|
+
latency. You may not need the graceful fallback depending on your application,
|
342
|
+
in which case it will just result in an error (e.g. a `HTTP 500`) faster.
|
343
|
+
|
344
|
+
We will now examine the two problems in detail.
|
345
|
+
|
346
|
+
#### In-depth analysis of real world example
|
347
|
+
|
348
|
+
If a single resource is slow, every single request is going to suffer. We saw
|
349
|
+
this in the example before. Let's illustrate this more clearly in the following
|
350
|
+
Rails example where the user session is stored in Redis:
|
351
|
+
|
352
|
+
```ruby
|
353
|
+
def index
|
354
|
+
@user = fetch_user
|
355
|
+
@posts = Post.all
|
356
|
+
end
|
357
|
+
|
358
|
+
private
|
359
|
+
def fetch_user
|
360
|
+
user = User.find(session[:user_id])
|
361
|
+
rescue Redis::CannotConnectError
|
362
|
+
nil
|
363
|
+
end
|
364
|
+
```
|
365
|
+
|
366
|
+
Our code is resilient to a failure of the session layer, it doesn't `HTTP 500`
|
367
|
+
if the session store is unavailable (this can be tested with
|
368
|
+
[Toxiproxy][toxiproxy]). If the `User` and `Post` data store is unavailable, the
|
369
|
+
server will send back `HTTP 500`. We accept that, because it's our primary data
|
370
|
+
store. This could be prevented with a caching tier or something else out of
|
371
|
+
scope.
|
372
|
+
|
373
|
+
This code has two flaws however:
|
374
|
+
|
375
|
+
1. **What happens if the session storage is consistently slow?** I.e. the majority
|
376
|
+
of requests take, say, more than half the timeout time (but it should only
|
377
|
+
take ~1ms)?
|
378
|
+
2. **What happens if the session storage is unavailable and is not responding at
|
379
|
+
all?** I.e. we hit timeouts on every request.
|
380
|
+
|
381
|
+
These two problems in turn have two related problems associated with them:
|
382
|
+
response time and capacity.
|
383
|
+
|
384
|
+
#### Response time
|
385
|
+
|
386
|
+
Requests that attempt to access a down session storage are all gracefully handled, the
|
387
|
+
`@user` will simply be `nil`, which the code handles. There is still a
|
388
|
+
major impact on users however, as every request to the storage has to time
|
389
|
+
out. This causes the average response time to all pages that access it to go up by
|
390
|
+
however long your timeout is. Your timeout is proportional to your worst case timeout,
|
391
|
+
as well as the number of attempts to hit it on each page. This is the problem Semian
|
392
|
+
solves by using heuristics to fail these requests early which causes a much better
|
393
|
+
user experience during downtime.
|
394
|
+
|
395
|
+
#### Capacity loss
|
396
|
+
|
397
|
+
When your single-threaded worker is waiting for a resource to return, it's
|
398
|
+
effectively doing nothing when it could be serving fast requests. To use the
|
399
|
+
example from before, perhaps some actions do not access the session storage at
|
400
|
+
all. These requests will pile up behind the now slow requests that are trying to
|
401
|
+
access that layer, because they're failing slowly. Essentially, your capacity
|
402
|
+
degrades significantly because your average response time goes up (as explained
|
403
|
+
in the previous section). Capacity loss simply follows from an increase in
|
404
|
+
response time. The higher your timeout and the slower your resource, the more
|
405
|
+
capacity you lose.
|
406
|
+
|
407
|
+
#### Timeouts aren't enough
|
408
|
+
|
409
|
+
It should be clear by now that timeouts aren't enough. Consistent timeouts will
|
410
|
+
increase the average response time, which causes a bad user experience, and
|
411
|
+
ultimately compromise the performance of the entire system. Even if the timeout
|
412
|
+
is as low as ~250ms (just enough to allow a single TCP retransmit) there's a
|
413
|
+
large loss of capacity and for many applications a 100-300% increase in average
|
414
|
+
response time. This is the problem Semian solves by failing fast.
|
415
|
+
|
416
|
+
## How does Semian work?
|
417
|
+
|
418
|
+
Semian consists of two parts: circuit breaker and bulkheading. To understand
|
419
|
+
Semian, and especially how to configure it, we must understand these patterns
|
420
|
+
and their implementation.
|
421
|
+
|
422
|
+
### Circuit Breaker
|
423
|
+
|
424
|
+
The circuit breaker pattern is based on a simple observation - if we hit a
|
425
|
+
timeout or any other error for a given service one or more times, we’re likely
|
426
|
+
to hit it again for some amount of time. Instead of hitting the timeout
|
427
|
+
repeatedly, we can mark the resource as dead for some amount of time during
|
428
|
+
which we raise an exception instantly on any call to it. This is called the
|
429
|
+
[circuit breaker pattern][cbp].
|
430
|
+
|
431
|
+
![](http://cdn.shopify.com/s/files/1/0070/7032/files/image01_grande.png)
|
432
|
+
|
433
|
+
When we perform a Remote Procedure Call (RPC), it will first check the circuit.
|
434
|
+
If the circuit is rejecting requests because of too many failures reported by
|
435
|
+
the driver, it will throw an exception immediately. Otherwise the circuit will
|
436
|
+
call the driver. If the driver fails to get data back from the data store, it
|
437
|
+
will notify the circuit. The circuit will count the error so that if too many
|
438
|
+
errors have happened recently, it will start rejecting requests immediately
|
439
|
+
instead of waiting for the driver to time out. The exception will then be raised
|
440
|
+
back to the original caller. If the driver’s request was successful, it will
|
441
|
+
return the data back to the calling method and notify the circuit that it made a
|
442
|
+
successful call.
|
443
|
+
|
444
|
+
The state of the circuit breaker is local to the worker and is not shared across
|
445
|
+
all workers on a server.
|
446
|
+
|
447
|
+
#### Circuit Breaker Configuration
|
448
|
+
|
449
|
+
There are four configuration parameters for circuit breakers in Semian:
|
450
|
+
|
451
|
+
* **error_threshold**. The amount of errors a worker encounters within error_threshold_timeout amount of time before
|
452
|
+
opening the circuit, that is to start rejecting requests instantly.
|
453
|
+
* **error_threshold_timeout**. The amount of time in seconds that error_threshold errors must occur to open the circuit. Defaults to error_timeout seconds if not set.
|
454
|
+
* **error_timeout**. The amount of time in seconds until trying to query the resource
|
455
|
+
again.
|
456
|
+
* **success_threshold**. The amount of successes on the circuit until closing it
|
457
|
+
again, that is to start accepting all requests to the circuit.
|
458
|
+
* **half_open_resource_timeout**. Timeout for the resource in seconds when the circuit is half-open (supported for MySQL, Net::HTTP and Redis).
|
459
|
+
|
460
|
+
For more information about configuring these parameters, please read [this post](https://engineering.shopify.com/blogs/engineering/circuit-breaker-misconfigured).
|
461
|
+
|
462
|
+
### Bulkheading
|
463
|
+
|
464
|
+
For some applications, circuit breakers are not enough. This is best illustrated
|
465
|
+
with an example. Imagine if the timeout for our data store isn't as low as
|
466
|
+
200ms, but actually 10 seconds. For example, you might have a relational data
|
467
|
+
store where for some customers, 10s queries are (unfortunately) legitimate.
|
468
|
+
Reducing the time of worst case queries requires a lot of effort. Dropping the
|
469
|
+
query immediately could potentially leave some customers unable to access
|
470
|
+
certain functionality. High timeouts are especially critical in a non-threaded
|
471
|
+
environment where blocking IO means a worker is effectively doing nothing.
|
472
|
+
|
473
|
+
In this case, circuit breakers aren't sufficient. Assuming the circuit is shared
|
474
|
+
across all processes on a server, it will still take at least 10s before the
|
475
|
+
circuit is open. In that time every worker is blocked (see also "Defense Line"
|
476
|
+
section for an in-depth explanation of the co-operation between circuit breakers
|
477
|
+
and bulkheads) this means we're at reduced capacity for at least 20s, with the
|
478
|
+
last 10s timeouts occurring just before the circuit opens at the 10s mark when a
|
479
|
+
couple of workers have hit a timeout and the circuit opens. We thought of a
|
480
|
+
number of potential solutions to this problem - stricter timeouts, grouping
|
481
|
+
timeouts by section of our application, timeouts per statement—but they all
|
482
|
+
still revolved around timeouts, and those are extremely hard to get right.
|
483
|
+
|
484
|
+
Instead of thinking about timeouts, we took inspiration from Hystrix by Netflix
|
485
|
+
and the book Release It (the resiliency bible), and look at our services as
|
486
|
+
connection pools. On a server with `W` workers, only a certain number of them
|
487
|
+
are expected to be talking to a single data store at once. Let's say we've
|
488
|
+
determined from our monitoring that there’s a 10% chance they’re talking to
|
489
|
+
`mysql_shard_0` at any given point in time under normal traffic. The probability
|
490
|
+
that five workers are talking to it at the same time is 0.001%. If we only allow
|
491
|
+
five workers to talk to a resource at any given point in time, and accept the
|
492
|
+
0.001% false positive rate—we can fail the sixth worker attempting to check out
|
493
|
+
a connection instantly. This means that while the five workers are waiting for a
|
494
|
+
timeout, all the other `W-5` workers on the node will instantly be failing on
|
495
|
+
checking out the connection and opening their circuits. Our capacity is only
|
496
|
+
degraded by a relatively small amount.
|
497
|
+
|
498
|
+
We call this limitation primitive "tickets". In this case, the resource access
|
499
|
+
is limited to 5 tickets (see Configuration). The timeout value specifies the
|
500
|
+
maximum amount of time to block if no ticket is available.
|
501
|
+
|
502
|
+
How do we limit the access to a resource for all workers on a server when the
|
503
|
+
workers do not directly share memory? This is implemented with [SysV
|
504
|
+
semaphores][sysv] to provide server-wide access control.
|
505
|
+
|
506
|
+
#### Bulkhead Configuration
|
507
|
+
|
508
|
+
There are two configuration values. It's not easy to choose good values and we're
|
509
|
+
still experimenting with ways to figure out optimal ticket numbers. Generally
|
510
|
+
something below half the number of workers on the server for endpoints that are
|
511
|
+
queried frequently has worked well for us.
|
512
|
+
|
513
|
+
* **tickets**. Number of workers that can concurrently access a resource.
|
514
|
+
* **timeout**. Time to wait in seconds to acquire a ticket if there are no tickets left.
|
515
|
+
We recommend this to be `0` unless you have very few workers running (i.e.
|
516
|
+
less than ~5).
|
517
|
+
|
518
|
+
Note that there are system-wide limitations on how many tickets can be allocated
|
519
|
+
on a system. `cat /proc/sys/kernel/sem` will tell you.
|
520
|
+
|
521
|
+
> System-wide limit on the number of semaphore sets. On Linux
|
522
|
+
systems before version 3.19, the default value for this limit
|
523
|
+
was 128. Since Linux 3.19, the default value is 32,000. On
|
524
|
+
Linux, this limit can be read and modified via the fourth
|
525
|
+
field of `/proc/sys/kernel/sem`.
|
526
|
+
|
527
|
+
#### Bulkhead debugging on linux
|
528
|
+
|
529
|
+
Note: It is often helpful to examine the actual IPC resources on the system. Semian
|
530
|
+
provides an easy way to get the semaphore key:
|
531
|
+
|
532
|
+
```
|
533
|
+
irb> require 'semian'
|
534
|
+
irb> puts Semian::Resource.new(:your_resource_name, tickets:1).key # do this from a dev machine
|
535
|
+
"0x48af51ea"
|
536
|
+
```
|
537
|
+
|
538
|
+
This key can then be used to easily inspect the semaphore on any host machine:
|
539
|
+
|
540
|
+
```
|
541
|
+
ipcs -si $(ipcs -s | grep 0x48af51ea | awk '{print $2}')
|
542
|
+
```
|
543
|
+
|
544
|
+
Which should output something like:
|
545
|
+
|
546
|
+
```
|
547
|
+
|
548
|
+
Semaphore Array semid=5570729
|
549
|
+
uid=8192 gid=8192 cuid=8192 cgid=8192
|
550
|
+
mode=0660, access_perms=0660
|
551
|
+
nsems = 4
|
552
|
+
otime = Thu Mar 30 15:06:16 2017
|
553
|
+
ctime = Mon Mar 13 20:25:36 2017
|
554
|
+
semnum value ncount zcount pid
|
555
|
+
0 1 0 0 48
|
556
|
+
1 25 0 0 48
|
557
|
+
2 25 0 0 27
|
558
|
+
3 31 0 0 48
|
559
|
+
```
|
560
|
+
|
561
|
+
In the above example, we can see each of the semaphores. Looking at the enum code
|
562
|
+
in `ext/semian/sysv_semaphores.h` we can see that:
|
563
|
+
|
564
|
+
* 0: is the semian meta lock (mutex) protecting updates to the other resources. It's currently free
|
565
|
+
* 1: is the number of available tickets - currently no tickets are in use because it's the same as 2
|
566
|
+
* 2: is the configured (maximum) number of tickets
|
567
|
+
* 3: is the number of registered workers (processes) that would be considered if using the quota strategy.
|
568
|
+
|
569
|
+
## Defense line
|
570
|
+
|
571
|
+
The finished defense line for resource access with circuit breakers and
|
572
|
+
bulkheads then looks like this:
|
573
|
+
|
574
|
+
![](http://cdn.shopify.com/s/files/1/0070/7032/files/image02_grande.png)
|
575
|
+
|
576
|
+
The RPC first checks the circuit; if the circuit is open it will raise the
|
577
|
+
exception straight away which will trigger the fallback (the default fallback is
|
578
|
+
a 500 response). Otherwise, it will try Semian which fails instantly if too many
|
579
|
+
workers are already querying the resource. Finally the driver will query the
|
580
|
+
data store. If the data store succeeds, the driver will return the data back to
|
581
|
+
the RPC. If the data store is slow or fails, this is our last line of defense
|
582
|
+
against a misbehaving resource. The driver will raise an exception after trying
|
583
|
+
to connect with a timeout or after an immediate failure. These driver actions
|
584
|
+
will affect the circuit and Semian, which can make future calls fail faster.
|
585
|
+
|
586
|
+
A useful way to think about the co-operation between bulkheads and circuit
|
587
|
+
breakers is through visualizing a failure scenario graphing capacity as a
|
588
|
+
function of time. If an incident strikes that makes the server unresponsive
|
589
|
+
with a `20s` timeout on the client and you only have circuit breakers
|
590
|
+
enabled--you will lose capacity until all workers have tripped their circuit
|
591
|
+
breakers. The slope of this line will depend on the amount of traffic to the now
|
592
|
+
unavailable service. If the slope is steep (i.e. high traffic), you'll lose
|
593
|
+
capacity quicker. The higher the client driver timeout, the longer you'll lose
|
594
|
+
capacity for. In the example below we have the circuit breakers configured to
|
595
|
+
open after 3 failures:
|
596
|
+
|
597
|
+
![resiliency- circuit breakers](https://cloud.githubusercontent.com/assets/97400/22405538/53229758-e612-11e6-81b2-824f873c3fb7.png)
|
598
|
+
|
599
|
+
If we imagine the same scenario but with _only_ bulkheads, configured to have
|
600
|
+
tickets for 50% of workers at any given time, we'll see the following
|
601
|
+
flat-lining scenario:
|
602
|
+
|
603
|
+
![resiliency- bulkheads](https://cloud.githubusercontent.com/assets/97400/22405542/6832a372-e612-11e6-88c4-2452b64b3121.png)
|
604
|
+
|
605
|
+
Circuit breakers have the nice property of re-gaining 100% capacity. Bulkheads
|
606
|
+
have the desirable property of guaranteeing a minimum capacity. If we do
|
607
|
+
addition of the two graphs, marrying bulkheads and circuit breakers, we have a
|
608
|
+
plummy outcome:
|
609
|
+
|
610
|
+
![resiliency- circuit breakers bulkheads](https://cloud.githubusercontent.com/assets/97400/22405550/a25749c2-e612-11e6-8bc8-5fe29e212b3b.png)
|
611
|
+
|
612
|
+
This means that if the slope or client timeout is sufficiently low, bulkheads
|
613
|
+
will provide little value and are likely not necessary.
|
614
|
+
|
615
|
+
## Failing gracefully
|
616
|
+
|
617
|
+
Ok, great, we've got a way to fail fast with slow resources, how does that make
|
618
|
+
my application more resilient?
|
619
|
+
|
620
|
+
Failing fast is only half the battle. It's up to you what you do with these
|
621
|
+
errors, in the [session example](#real-world-example) we handle it gracefully by
|
622
|
+
signing people out and disabling all session related functionality till the data
|
623
|
+
store is back online. However, not rescuing the exception and simply sending
|
624
|
+
`HTTP 500` back to the client faster will help with [capacity
|
625
|
+
loss](#capacity-loss).
|
626
|
+
|
627
|
+
### Exceptions inherit from base class
|
628
|
+
|
629
|
+
It's important to understand that the exceptions raised by [Semian
|
630
|
+
Adapters](#adapters) inherit from the base class of the driver itself, meaning
|
631
|
+
that if you do something like:
|
632
|
+
|
633
|
+
```ruby
|
634
|
+
def posts
|
635
|
+
Post.all
|
636
|
+
rescue Mysql2::Error
|
637
|
+
[]
|
638
|
+
end
|
639
|
+
```
|
640
|
+
|
641
|
+
Exceptions raised by Semian's `MySQL2` adapter will also get caught.
|
642
|
+
|
643
|
+
### Patterns
|
644
|
+
|
645
|
+
We do not recommend mindlessly sprinkling `rescue`s all over the place. What you
|
646
|
+
should do instead is writing decorators around secondary data stores (e.g. sessions)
|
647
|
+
that provide resiliency for free. For example, if we stored the tags associated
|
648
|
+
with products in a secondary data store it could look something like this:
|
649
|
+
|
650
|
+
```ruby
|
651
|
+
# Resilient decorator for storing a Set in Redis.
|
652
|
+
class RedisSet
|
653
|
+
def initialize(key)
|
654
|
+
@key = key
|
655
|
+
end
|
656
|
+
|
657
|
+
def get
|
658
|
+
redis.smembers(@key)
|
659
|
+
rescue Redis::BaseConnectionError
|
660
|
+
[]
|
661
|
+
end
|
662
|
+
|
663
|
+
private
|
664
|
+
|
665
|
+
def redis
|
666
|
+
@redis ||= Redis.new
|
667
|
+
end
|
668
|
+
end
|
669
|
+
|
670
|
+
class Product
|
671
|
+
# This will simply return an empty array in the case of a Redis outage.
|
672
|
+
def tags
|
673
|
+
tags_set.get
|
674
|
+
end
|
675
|
+
|
676
|
+
private
|
677
|
+
|
678
|
+
def tags_set
|
679
|
+
@tags_set ||= RedisSet.new("product:tags:#{self.id}")
|
680
|
+
end
|
681
|
+
end
|
682
|
+
```
|
683
|
+
|
684
|
+
These decorators can be resiliency tested with [Toxiproxy][toxiproxy]. You can
|
685
|
+
provide fallbacks around your primary data store as well. In our case, we simply
|
686
|
+
`HTTP 500` in those cases unless it's cached because these pages aren't worth
|
687
|
+
much without data from their primary data store.
|
688
|
+
|
689
|
+
## Monitoring
|
690
|
+
|
691
|
+
With [`Semian::Instrumentable`][semian-instrumentable] clients can monitor
|
692
|
+
Semian internals. For example to instrument just events with
|
693
|
+
[`statsd-instrument`][statsd-instrument]:
|
694
|
+
|
695
|
+
```ruby
|
696
|
+
# `event` is `success`, `busy`, `circuit_open`, `state_change`, or `lru_hash_gc`.
|
697
|
+
# `resource` is the `Semian::Resource` object (or a `LRUHash` object for `lru_hash_gc`).
|
698
|
+
# `scope` is `connection` or `query` (others can be instrumented too from the adapter) (is nil for `lru_hash_gc`).
|
699
|
+
# `adapter` is the name of the adapter (mysql2, redis, ..) (is a payload hash for `lru_hash_gc`)
|
700
|
+
Semian.subscribe do |event, resource, scope, adapter|
|
701
|
+
case event
|
702
|
+
when :success, :busy, :circuit_open, :state_change
|
703
|
+
StatsD.increment("semian.#{event}", tags: {
|
704
|
+
resource: resource.name,
|
705
|
+
adapter: adapter,
|
706
|
+
type: scope,
|
707
|
+
})
|
708
|
+
else
|
709
|
+
StatsD.increment("semian.#{event}")
|
710
|
+
end
|
711
|
+
end
|
712
|
+
```
|
713
|
+
|
714
|
+
# FAQ
|
715
|
+
|
716
|
+
**How does Semian work with containers?** Semian uses [SysV semaphores][sysv] to
|
717
|
+
coordinate access to a resource. The semaphore is only shared within the
|
718
|
+
[IPC][namespaces]. Unless you are running many workers inside every container,
|
719
|
+
this leaves the bulkheading pattern effectively useless. We recommend sharing
|
720
|
+
the IPC namespace between all containers on your host for the best ticket
|
721
|
+
economy. If you are using Docker, this can be done with the [--ipc
|
722
|
+
flag](https://docs.docker.com/engine/reference/run/#ipc-settings---ipc).
|
723
|
+
|
724
|
+
**Why isn't resource access shared across the entire cluster?** This implies a
|
725
|
+
coordination data store. Semian would have to be resilient to failures of this
|
726
|
+
data store as well, and fall back to other primitives. While it's nice to have
|
727
|
+
all workers have the same view of the world, this greatly increases the
|
728
|
+
complexity of the implementation which is not favourable for resiliency code.
|
729
|
+
|
730
|
+
**Why isn't the circuit breaker implemented as a host-wide mechanism?** No good
|
731
|
+
reason. Patches welcome!
|
732
|
+
|
733
|
+
**Why is there no fallback mechanism in Semian?** Read the [Failing
|
734
|
+
Gracefully](#failing-gracefully) section. In short, exceptions is exactly this.
|
735
|
+
We did not want to put an extra level on abstraction on top of this. In the
|
736
|
+
first internal implementation this was the case, but we later moved away from
|
737
|
+
it.
|
738
|
+
|
739
|
+
**Why does it not use normal Ruby semaphores?** To work properly the access
|
740
|
+
control needs to be performed across many workers. With MRI that means having
|
741
|
+
multiple processes, not threads. Thus we need a primitive outside of the
|
742
|
+
interpreter. For other Ruby implementations a driver that uses Ruby semaphores
|
743
|
+
could be used (and would be accepted as a PR).
|
744
|
+
|
745
|
+
**Why are there three semaphores in the semaphore sets for each resource?** This
|
746
|
+
has to do with being able to resize the number of tickets for a resource online.
|
747
|
+
|
748
|
+
**Can I change the number of tickets freely?** Yes, the logic for this isn't
|
749
|
+
trivial but it works well.
|
750
|
+
|
751
|
+
**What is the performance overhead of Semian?** Extremely minimal in comparison
|
752
|
+
to going to the network. Don't worry about it unless you're instrumenting
|
753
|
+
non-IO.
|
754
|
+
|
755
|
+
# Developing Semian
|
756
|
+
|
757
|
+
Semian requires a Linux environment. We provide a [docker-compose](https://docs.docker.com/compose/) file
|
758
|
+
that runs MySQL, Redis, Toxiproxy and Ruby in containers. Use
|
759
|
+
the steps below to work on Semian from a Mac OS environment.
|
760
|
+
|
761
|
+
## Prerequisites :
|
762
|
+
```bash
|
763
|
+
# install docker-for-desktop
|
764
|
+
$ brew cask install docker
|
765
|
+
|
766
|
+
# install latest docker-compose
|
767
|
+
$ brew install docker-compose
|
768
|
+
|
769
|
+
# install visual-studio-code (optional)
|
770
|
+
$ brew cask install visual-studio-code
|
771
|
+
|
772
|
+
# clone Semian
|
773
|
+
$ git clone https://github.com/Shopify/semian.git
|
774
|
+
$ cd semian
|
775
|
+
```
|
776
|
+
|
777
|
+
## Visual Studio Code
|
778
|
+
- Open semian in vscode
|
779
|
+
- Install recommended extensions (one off requirement)
|
780
|
+
- Click `reopen in container` (first boot might take about a minute)
|
781
|
+
|
782
|
+
See https://code.visualstudio.com/docs/remote/containers for more details
|
783
|
+
|
784
|
+
|
785
|
+
If you make any changes to `.devcontainer/` you'd need to recreate the containers:
|
786
|
+
|
787
|
+
- Select `Rebuild Container` from the command palette
|
788
|
+
|
789
|
+
|
790
|
+
Running Tests:
|
791
|
+
- `$ bundle exec rake` Run with `SKIP_FLAKY_TESTS=true` to skip flaky tests (CI runs all tests)
|
792
|
+
|
793
|
+
## Everything else
|
794
|
+
|
795
|
+
Test semian in containers:
|
796
|
+
- `$ docker-compose -f .devcontainer/docker-compose.yml up -d`
|
797
|
+
- `$ docker exec -it semian bash`
|
798
|
+
|
799
|
+
If you make any changes to `.devcontainer/` you'd need to recreate the containers:
|
800
|
+
|
801
|
+
- `$ docker-compose -f .devcontainer/docker-compose.yml up -d --force-recreate`
|
802
|
+
|
803
|
+
Run tests in containers:
|
804
|
+
|
805
|
+
```shell
|
806
|
+
$ docker-compose -f ./.devcontainer/docker-compose.yml run --rm test
|
807
|
+
```
|
808
|
+
|
809
|
+
Running Tests:
|
810
|
+
- `$ bundle exec rake` Run with `SKIP_FLAKY_TESTS=true` to skip flaky tests (CI runs all tests)
|
811
|
+
|
812
|
+
### Running tests in batches
|
813
|
+
|
814
|
+
* *TEST_WORKERS* - Total number of workers or batches.
|
815
|
+
It uses to identify a total number of batches, that would be run in parallel. *Default: 1*
|
816
|
+
* *TEST_WORKER_NUM* - Specify which batch to run. The value is between 1 and *TEST_WORKERS*. *Default: 1*
|
817
|
+
|
818
|
+
```shell
|
819
|
+
$ bundle exec rake test:parallel TEST_WORKERS=5 TEST_WORKER_NUM=1
|
820
|
+
```
|
821
|
+
|
822
|
+
[hystrix]: https://github.com/Netflix/Hystrix
|
823
|
+
[release-it]: https://pragprog.com/titles/mnee2/release-it-second-edition/
|
824
|
+
[shopify]: http://www.shopify.com/
|
825
|
+
[mysql-semian-adapter]: lib/semian/mysql2.rb
|
826
|
+
[redis-semian-adapter]: lib/semian/redis.rb
|
827
|
+
[semian-adapter]: lib/semian/adapter.rb
|
828
|
+
[nethttp-semian-adapter]: lib/semian/net_http.rb
|
829
|
+
[nethttp-default-errors]: lib/semian/net_http.rb#L35-L45
|
830
|
+
[semian-instrumentable]: lib/semian/instrumentable.rb
|
831
|
+
[statsd-instrument]: http://github.com/shopify/statsd-instrument
|
832
|
+
[resiliency-blog-post]: https://engineering.shopify.com/blogs/engineering/building-and-testing-resilient-ruby-on-rails-applications
|
833
|
+
[toxiproxy]: https://github.com/Shopify/toxiproxy
|
834
|
+
[sysv]: http://man7.org/linux/man-pages/man7/svipc.7.html
|
835
|
+
[cbp]: https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
|
836
|
+
[namespaces]: http://man7.org/linux/man-pages/man7/namespaces.7.html
|