semian 0.3.0 → 0.4.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: fc4fd1a356e755e2aa60896da8d30394e545d44a
4
- data.tar.gz: ea85970b77be890bf35c69a9d047525280f60e31
3
+ metadata.gz: 9df1d9d29629650b74dab6ead2cc8d288de28d78
4
+ data.tar.gz: 8752cce5af3858930ee820eab7c2f1c558eeed22
5
5
  SHA512:
6
- metadata.gz: 455a0fb4fef078ea9a7c2a5c5ee63584492726aa32e98d2a062a274d7680d6abb2366120722ffb20aa3069bc2efb282ac0906dc20af5dd656df8a588e5cfde63
7
- data.tar.gz: bfe6ab61e67e6c5f6426bd0bf18b5710c50711f82966136bb8da420a2da2c091191886f57b834f130babff966620e3f717c16d6f65ed3e9d4dbd67f09c93ea74
6
+ metadata.gz: 2070b61bb3ac7080c36eaba3c0508eb81928b32111dd11e6f5d9668b78fada81549f2eac75f3021787ce03fd30b6f3c5f01a4f5f99d077b0b3aa9aeecc5a3e9d
7
+ data.tar.gz: 63f9400b70e1bf039819bb26accf5ecbcea87ab1fb938ff0abaf8e44ccfbbd78e016c3ff48832960cab6fb1c9428785da0f4c2333d15c651b2647bf99d9ae41d
data/.gitignore CHANGED
@@ -1,6 +1,6 @@
1
1
  /.bundle/
2
- /lib/semian/*.so
3
- /lib/semian/*.bundle
2
+ /lib/**/*.so
3
+ /lib/**/*.bundle
4
4
  /tmp/*
5
5
  *.gem
6
6
  /html/
@@ -0,0 +1,113 @@
1
+ AllCops:
2
+ Exclude:
3
+ - Gemfile
4
+ - lib/snippets/**/*
5
+ - vendor/**/*
6
+ - data/**/*
7
+ - db/schema.rb
8
+ - db/migrate/*
9
+ - test/dummy/**/*
10
+ - bin/rails
11
+ - lib/shipit-engine.rb
12
+ - tmp/**/*
13
+
14
+ Style/GuardClause:
15
+ Enabled: false
16
+
17
+ Lint/AssignmentInCondition:
18
+ Enabled: false
19
+
20
+ Lint/HandleExceptions:
21
+ Enabled: false
22
+
23
+ Lint/EndAlignment:
24
+ Enabled: false
25
+
26
+ Style/NumericLiterals:
27
+ Exclude:
28
+ - db/schema.rb
29
+
30
+ Style/SingleSpaceBeforeFirstArg:
31
+ Exclude:
32
+ - db/schema.rb
33
+
34
+ Style/DoubleNegation:
35
+ Enabled: false
36
+
37
+ Metrics/LineLength:
38
+ Max: 135
39
+
40
+ Metrics/MethodLength:
41
+ Max: 40
42
+
43
+ Metrics/ClassLength:
44
+ Max: 500
45
+
46
+ Metrics/AbcSize:
47
+ Max: 50
48
+
49
+ Metrics/CyclomaticComplexity:
50
+ Max: 10
51
+
52
+ Style/Documentation:
53
+ Enabled: false
54
+
55
+ Style/SingleLineBlockParams:
56
+ Enabled: false
57
+
58
+ Style/SignalException:
59
+ Enabled: false
60
+
61
+ Style/RaiseArgs:
62
+ Enabled: false
63
+
64
+ Style/ModuleFunction:
65
+ Enabled: false
66
+
67
+ Style/RedundantReturn:
68
+ AllowMultipleReturnValues: true
69
+
70
+ Style/IndentHash:
71
+ Enabled: false
72
+
73
+ Style/TrailingComma:
74
+ EnforcedStyleForMultiline: comma
75
+
76
+ Style/ClassAndModuleChildren:
77
+ Enabled: false
78
+
79
+ Style/PredicateName:
80
+ Exclude:
81
+ - app/serializers/**/*
82
+
83
+ Style/SpaceInsideHashLiteralBraces:
84
+ EnforcedStyle: no_space
85
+
86
+ Style/StringLiterals:
87
+ Enabled: false
88
+
89
+ Style/PerlBackrefs:
90
+ Enabled: false
91
+
92
+ Style/TrivialAccessors:
93
+ AllowPredicates: true
94
+
95
+ Style/ExtraSpacing:
96
+ AllowForAlignment: false
97
+
98
+ Style/GlobalVars:
99
+ Exclude:
100
+ - 'ext/semian/extconf.rb'
101
+
102
+ Lint/Eval:
103
+ Exclude:
104
+ - 'Rakefile'
105
+
106
+ Metrics/ParameterLists:
107
+ Enabled: false
108
+
109
+ Style/IfUnlessModifier:
110
+ Enabled: false
111
+
112
+ Style/CaseIndentation:
113
+ IndentWhenRelativeTo: end
@@ -0,0 +1,8 @@
1
+ # v0.4.0
2
+
3
+ * net/http: add adapter for net/http #58
4
+ * circuit_breaker: split circuit breaker into three data structures to allow for
5
+ alternative implementations in the future #62
6
+ * mysql: don't prevent rollbacks on transactions #60
7
+ * core: fix initialization bug when the resource is accessed before the options
8
+ are set #65
data/Gemfile CHANGED
@@ -4,3 +4,8 @@ gemspec
4
4
  group :debug do
5
5
  gem 'byebug'
6
6
  end
7
+
8
+ group :development, :test do
9
+ gem 'toxiproxy', github: 'Shopify/toxiproxy-ruby', ref: 'f0c5d0bebca01180e2cfd5234e3d18affefbc670', require: 'toxiproxy'
10
+ gem 'rubocop', '~> 0.34.2'
11
+ end
data/LICENSE.md CHANGED
@@ -1,6 +1,6 @@
1
1
  The MIT License (MIT)
2
2
 
3
- Copyright (c) 2014 Scott Francis <scott.francis@shopify.com>
3
+ Copyright (c) 2014 Shopify
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal
data/README.md CHANGED
@@ -1,15 +1,47 @@
1
1
  ## Semian [![Build Status](https://travis-ci.org/Shopify/semian.svg?branch=master)](https://travis-ci.org/Shopify/semian)
2
2
 
3
- Semian is a latency and fault tolerance library for protecting your Ruby
4
- applications against misbehaving external services. It allows you to fail fast
5
- so you can handle errors gracefully. The patterns are inspired by
6
- [Hystrix][hystrix] and [Release It][release-it]. Semian is an extraction from
7
- [Shopify][shopify] where it's been running successfully in production since
8
- October, 2014.
3
+ ![](http://i.imgur.com/7Vn2ibF.png)
9
4
 
10
- For an overview of building resilient Ruby application, see [the blog post on
11
- Toxiproxy and Semian][resiliency-blog-post]. We recommend using
12
- [Toxiproxy][toxiproxy] to test for resiliency.
5
+ Semian is a library for controlling access to slow or unresponsive external
6
+ services to avoid cascading failures.
7
+
8
+ When services are down they typically fail fast with errors like `ECONNREFUSED`
9
+ and `ECONNRESET` which can be rescued in code. However, slow resources fail
10
+ slowly. The thread serving the request blocks until it hits the timeout for the
11
+ slow resource. During that time, the thread is doing nothing useful and thus the
12
+ slow resource has caused a cascading failure by occupying workers and therefore
13
+ losing capacity. **Semian is a library for failing fast in these situations,
14
+ allowing you to handle errors gracefully.** Semian does this by intercepting
15
+ resource access through heuristic patterns inspired by [Hystrix][hystrix] and
16
+ [Release It][release-it]:
17
+
18
+ * [**Circuit breaker**](#circuit-breaker). A pattern for limiting the
19
+ amount of requests to a dependency that is having issues.
20
+ * [**Bulkheading**](#bulkheading). Controlling the concurrent access to
21
+ a single resource, access is coordinates server-wide with [SysV
22
+ semaphores][sysv].
23
+
24
+ Resource drivers are monkey-patched to be aware of Semian, these are called
25
+ [Semian Adapters](#adapters). Thus, every time resource access is requested
26
+ Semian is queried for status on the resource first. If Semian, through the
27
+ patterns above, deems the resource to be unavailable it will raise an exception.
28
+ **The ultimate outcome of Semian is always an exception that can then be rescued
29
+ for a graceful fallback**. Instead of waiting for the timeout, Semian raises
30
+ straight away.
31
+
32
+ If you are already rescuing exceptions for failing resources and timeouts,
33
+ Semian is mostly a drop-in library with a little configuration that will make
34
+ your code more resilient to slow resource access. But, [do you even need
35
+ Semian?](#do-i-need-semian)
36
+
37
+ For an overview of building resilient Ruby applications, start by reading [the
38
+ Shopify blog post on Toxiproxy and Semian][resiliency-blog-post]. For more in
39
+ depth information on Semian, see [Understanding Semian](#understanding-semian).
40
+ Semian is an extraction from [Shopify][shopify] where it's been running
41
+ successfully in production since October, 2014.
42
+
43
+ The other component to your Ruby resiliency kit is [Toxiproxy][toxiproxy] to
44
+ write automated resiliency tests.
13
45
 
14
46
  # Usage
15
47
 
@@ -26,10 +58,47 @@ section](#configuration) on how to configure adapters.
26
58
 
27
59
  ## Adapters
28
60
 
29
- The following adapters are in Semian and work against the public gems:
61
+ Semian works by intercepting resource access. Every time access is requested,
62
+ Semian is queried, and it will raise an exception if the resource is unavailable
63
+ according to the circuit breaker or bulkheads. This is done by monkey-patching
64
+ the resource driver. **The exception raised by the driver always inherits from
65
+ the Base exception class of the driver**, meaning you can always simply rescue
66
+ the base class and catch both Semian and driver errors in the same rescue for
67
+ fallbacks.
68
+
69
+ The following adapters are in Semian and tested heavily in production, the
70
+ version is the version of the public gem with the same name:
30
71
 
31
72
  * [`semian/mysql2`][mysql-semian-adapter] (~> 0.3.16)
32
73
  * [`semian/redis`][redis-semian-adapter] (~> 3.2.1)
74
+ * [`semian/net_http`][nethttp-semian-adapter]
75
+
76
+ ### Creating Adapters
77
+
78
+ To create a Semian adapter you must implement the following methods:
79
+
80
+ 1. [`include Semian::Adapter`][semian-adapter]. Use the helpers to wrap the
81
+ resource. This takes care of situations such as monitoring, nested resources,
82
+ unsupported platforms, creating the Semian resource if it doesn't already
83
+ exist and so on.
84
+ 2. `#semian_identifier`. This is responsible for returning a symbol that
85
+ represents every unique resource, for example `redis_master` or
86
+ `mysql_shard_1`. This is usually assembled from a `name` attribute on the
87
+ Semian configuration hash, but could also be `<host>:<port>`.
88
+ 3. `connect`. The name of this method varies. You must override the driver's
89
+ connect method with one that wraps the connect call with
90
+ `Semian::Resource#acquire`. You should do this at the lowest possible level.
91
+ 4. `query`. Same as `connect` but for queries on the resource.
92
+ 5. Define exceptions `ResourceBusyError` and `CircuitOpenError`. These are
93
+ raised when the request was rejected early because the resource is out of
94
+ tickets or because the circuit breaker is open (see [Understanding
95
+ Semian](#understanding-semian). They should inherit from the base exception
96
+ class from the raw driver. For example `Mysql2::Error` or
97
+ `Redis::BaseConnectionError` for the MySQL and Redis drivers. This makes it
98
+ easy to `rescue` and handle them gracefully in application code, by
99
+ `rescue`ing the base class.
100
+
101
+ The best resource is looking at the [already implemented adapters](#adapters).
33
102
 
34
103
  ### Configuration
35
104
 
@@ -58,32 +127,370 @@ client = Redis.new(semian: {
58
127
  })
59
128
  ```
60
129
 
61
- ### Creating an adapter
130
+ #### Net::HTTP
131
+ For the `Net::HTTP` specific Semian adapter, since many external libraries may create
132
+ HTTP connections on the user's behalf, the parameters are instead provided
133
+ by calling specific functions in `Semian::NetHTTP`, perhaps in an initialization file.
62
134
 
63
- To create a Semian adapter you must implement the following methods:
135
+ ##### Naming and Options
136
+ To give Semian parameters, assign a `proc` to `Semian::NetHTTP.semian_configuration`
137
+ that takes a two parameters, `host` and `port` like `127.0.0.1` and `80` or `github_com` and `80`,
138
+ and returns a `Hash` with keys as follows.
64
139
 
65
- 1. [`include Semian::Adapter`][semian-adapter]. Use the helpers to wrap the
66
- resource. This takes care of situations such as monitoring, nested resources,
67
- unsupported platforms, creating the Semian resource if it doesn't already
68
- exist and so on.
69
- 2. `#semian_identifier`. This is responsible for returning a symbol that
70
- represents every unique resource, for example `redis_master` or
71
- `mysql_shard_1`. This is usually assembled from a `name` attribute on the
72
- Semian configuration hash, but could also be `<host>:<port>`.
73
- 3. `connect`. The name of this method varies. You must override the driver's
74
- connect method with one that wraps the connect call with
75
- `Semian::Resource#acquire`. You should do this at the lowest possible level.
76
- 4. `query`. Same as `connect` but for queries on the resource.
77
- 5. Define exceptions `ResourceBusyError` and `CircuitOpenError`. These are
78
- raised when the request was rejected early because the resource is out of
79
- tickets or because the circuit breaker is open (see [Understanding
80
- Semian](#understanding-semian). They should inherit from the base exception
81
- class from the raw driver. For example `Mysql2::Error` or
82
- `Redis::BaseConnectionError` for the MySQL and Redis drivers. This makes it
83
- easy to `rescue` and handle them gracefully in application code, by
84
- `rescue`ing the base class.
140
+ ```ruby
141
+ SEMIAN_PARAMETERS = { tickets: 1,
142
+ success_threshold: 1,
143
+ error_threshold: 3,
144
+ error_timeout: 10 }
145
+ Semian::NetHTTP.semian_configuration = proc do |host, port|
146
+ # Let's make it only active for github.com
147
+ if host == "github.com" && port == "80"
148
+ SEMIAN_PARAMETERS.merge(name: "github.com_80")
149
+ else
150
+ nil
151
+ end
152
+ end
85
153
 
86
- The best resource is looking at the [already implemented adapters](#adapters).
154
+ # Called from within API:
155
+ # semian_options = Semian::NetHTTP.semian_configuration("github.com", 80)
156
+ # semian_identifier = "nethttp_#{semian_options[:name]}"
157
+ ```
158
+
159
+ The `name` should be carefully chosen since it identifies the resource being protected.
160
+ The `semian_options` passed apply to that resource. Semian creates the `semian_identifier`
161
+ from the `name` to look up and store changes in the circuit breaker and bulkhead states
162
+ and associate successes, failures, errors with the protected resource.
163
+
164
+ For most purposes, `"#{host}_#{port}"` is a good default `name`. Custom `name` formats
165
+ can be useful to grouping related subdomains as one resource, so that they all
166
+ contribute to the same circuit breaker and bulkhead state and fail together.
167
+
168
+ A return value of `nil` for `semian_configuration` means Semian is disabled for that
169
+ HTTP endpoint. This works well since the result of a failed Hash lookup is `nil` also.
170
+ This behavior lets the adapter default to whitelisting, although the
171
+ behavior can be changed to blacklisting or even be completely disabled by varying
172
+ the use of returning `nil` in the assigned closure.
173
+
174
+ ##### Additional Exceptions
175
+ Since we envision this particular adapter can be used in combination with many
176
+ external libraries, that can raise additional exceptions, we added functionality to
177
+ expand the Exceptions that can be tracked as part of Semian's circuit breaker.
178
+ This may be necessary for libraries that introduce new exceptions or re-raise them.
179
+ Add exceptions and reset to the [`default`][nethttp-default-errors] list using the following:
180
+
181
+ ```ruby
182
+ # assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
183
+ Semian::NetHTTP.exceptions += [::OpenSSL::SSL::SSLError]
184
+
185
+ Semian::NetHTTP.reset_exceptions
186
+ # assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
187
+ ```
188
+
189
+ # Understanding Semian
190
+
191
+ Semian is a library with heuristics for failing fast. This section will explain
192
+ in depth how Semian works and which situations it's applicable for. First we
193
+ explain the category of problems Semian is meant to solve. Then we dive into how
194
+ Semian works to solve these problems.
195
+
196
+ ## Do I need Semian?
197
+
198
+ Semian is not a trivial library to understand, introduces complexity and thus
199
+ should be introduced with care. Remember, all Semian does is raise exceptions
200
+ based on heuristics. It is paramount that you understand Semian before
201
+ including it in production as you may otherwise be surprised by its behaviour.
202
+
203
+ Applications that benefit from Semian are those working on eliminating SPOFs
204
+ (Single Points of Failure), and specifically are running into a wall regarding
205
+ slow resources. But it is by no means a magic wand that solves all your latency
206
+ problems by being added to your `Gemfile`. This section describes the types of
207
+ problems Semian solves.
208
+
209
+ If your application is multithreaded or evented (e.g. not Resque and Unicorn)
210
+ these problems are not as pressing. You can still get use out of Semian however.
211
+
212
+ ### Real World Example
213
+
214
+ This is better illustrated with a real world example from Shopify. When you are
215
+ browsing a store while signed in, Shopify stores your session in Redis.
216
+ If Redis becomes unavailable, the driver will start throwing exceptions.
217
+ We rescue these exceptions and simply disable all customer sign in functionality
218
+ on the store until Redis is back online.
219
+
220
+ This is great if querying the resource fails instantly, because it means we fail
221
+ in just a single roundtrip of ~1ms. But if the resource is unresponsive or slow,
222
+ this can take as long as our timeout which is easily 200ms. This means every
223
+ request, even if it does rescue the exception, now takes an extra 200ms.
224
+ Because every resource takes that long, our capacity is also significantly
225
+ degraded. These problems are explained in depth in the next two sections.
226
+
227
+ With Semian, the slow resource would fail instantly (after a small amount of
228
+ convergence time) preventing your response time from spiking and not decreasing
229
+ capacity of the cluster.
230
+
231
+ If this sounds familiar to you, Semian is what you need to be resilient to
232
+ latency. You may not need the graceful fallback depending on your application,
233
+ in which case it will just result in an error (e.g. a `HTTP 500`) faster.
234
+
235
+ We will now examine the two problems in detail.
236
+
237
+ #### In-depth analysis of real world example
238
+
239
+ If a single resource is slow, every single request is going to suffer. We saw
240
+ this in the example before. Let's illustrate this more clearly in the following
241
+ Rails example where the user session is stored in Redis:
242
+
243
+ ```ruby
244
+ def index
245
+ @user = fetch_user
246
+ @posts = Post.all
247
+ end
248
+
249
+ private
250
+ def fetch_user
251
+ user = User.find(session[:user_id])
252
+ rescue Redis::CannotConnectError
253
+ nil
254
+ end
255
+ ```
256
+
257
+ Our code is resilient to a failure of the session layer, it doesn't `HTTP 500`
258
+ if the session store is unavailable (this can be tested with
259
+ [Toxiproxy][toxiproxy]). If the `User` and `Post` data store is unavailable, the
260
+ server will send back `HTTP 500`. We accept that, because it's our primary data
261
+ store. This could be prevented with a caching tier or something else out of
262
+ scope.
263
+
264
+ This code has two flaws however:
265
+
266
+ 1. **What happens if the session storage is consistently slow?** I.e. the majority
267
+ of requests take, say, more than half the timeout time (but it should only
268
+ take ~1ms)?
269
+ 2. **What happens if the session storage is unavailable and is not responding at
270
+ all?** I.e. we hit timeouts on every request.
271
+
272
+ These two problems in turn have two related problems associated with them:
273
+ response time and capacity.
274
+
275
+ #### Response time
276
+
277
+ Requests that attempt to access a down session storage are all gracefully handled, the
278
+ `@user` will simply be `nil`, which the code handles. There is still a
279
+ major impact on users however, as every request to the storage has to time
280
+ out. This causes the average response time to all pages that access it to go up by
281
+ however long your timeout is. Your timeout is proportional to your worst case timeout,
282
+ as well as the number of attempts to hit it on each page. This is the problem Semian
283
+ solves by using heuristics to fail these requests early which causes a much better
284
+ user experience during downtime.
285
+
286
+ #### Capacity loss
287
+
288
+ When your single-threaded worker is waiting for a resource to return, it's
289
+ effectively doing nothing when it could be serving fast requests. To use the
290
+ example from before, perhaps some actions do not access the session storage at
291
+ all. These requests will pile up behind the now slow requests that are trying to
292
+ access that layer, because they're failing slowly. Essentially, your capacity
293
+ degrades significantly because your average response time goes up (as explained
294
+ in the previous section). Capacity loss simply follows from an increase in
295
+ response time. The higher your timeout and the slower your resource, the more
296
+ capacity you lose.
297
+
298
+ #### Timeouts aren't enough
299
+
300
+ It should be clear by now that timeouts aren't enough. Consistent timeouts will
301
+ increase the average response time, which causes a bad user experience, and
302
+ ultimately compromise the performance of the entire system. Even if the timeout
303
+ is as low as ~250ms (just enough to allow a single TCP retransmit) there's a
304
+ large loss of capacity and for many applications a 100-300% increase in average
305
+ response time. This is the problem Semian solves by failing fast.
306
+
307
+ ## How does Semian work?
308
+
309
+ Semian consists of two parts: circuit breaker and bulkheading. To understand
310
+ Semian, and especially how to configure it, we must understand these patterns
311
+ and their implementation.
312
+
313
+ ### Circuit Breaker
314
+
315
+ The circuit breaker pattern is based on a simple observation - if we hit a
316
+ timeout or any other error for a given service one or more times, we’re likely
317
+ to hit it again for some amount of time. Instead of hitting the timeout
318
+ repeatedly, we can mark the resource as dead for some amount of time during
319
+ which we raise an exception instantly on any call to it. This is called the
320
+ [circuit breaker pattern][cbp].
321
+
322
+ ![](http://cdn.shopify.com/s/files/1/0070/7032/files/image01_grande.png)
323
+
324
+ When we perform a Remote Procedure Call (RPC), it will first check the circuit.
325
+ If the circuit is rejecting requests because of too many failures reported by
326
+ the driver, it will throw an exception immediately. Otherwise the circuit will
327
+ call the driver. If the driver fails to get data back from the data store, it
328
+ will notify the circuit. The circuit will count the error so that if too many
329
+ errors have happened recently, it will start rejecting requests immediately
330
+ instead of waiting for the driver to time out. The exception will then be raised
331
+ back to the original caller. If the driver’s request was successful, it will
332
+ return the data back to the calling method and notify the circuit that it made a
333
+ successful call.
334
+
335
+ The state of the circuit breaker is local to the worker and is not shared across
336
+ all workers on a server.
337
+
338
+ #### Circuit Breaker Configuration
339
+
340
+ There are three configuration parameters for circuit breakers in Semian:
341
+
342
+ * **error_threshold**. The amount of errors to encounter for the worker before
343
+ opening the circuit, that is to start rejecting requests instantly.
344
+ * **error_timeout**. The amount of time until trying to query the resource
345
+ again.
346
+ * **success_threshold**. The amount of successes on the circuit until closing it
347
+ again, that is to start accepting all requests to the circuit.
348
+
349
+ ### Bulkheading
350
+
351
+ For many applications, circuit breakers are not enough however. This is best
352
+ illustrated with an extreme. Imagine if the timeout for our data store isn't as
353
+ low as 200ms, but actually 10 seconds. For example, you might have a relational data
354
+ store where for some customers, 10s queries are (unfortunately) legitimate.
355
+ Reducing the time of worst case queries requires a lot of effort. Dropping the
356
+ query immediately could potentially leave some customers unable to access certain
357
+ functionality. High timeouts are especially critical in a non-threaded
358
+ environment where blocking IO means a worker is effectively doing nothing.
359
+
360
+ In this case, circuit breakers aren't sufficient. Assuming the circuit is shared
361
+ across all processes on a server, it will still take at least 10s before the
362
+ circuit is open—in that time every worker is blocked. Meaning we are in a
363
+ reduced capacity state for at least 20s, with the last 10s timeouts
364
+ occurring just before the circuit opens at the 10s mark when a couple of
365
+ workers have hit a timeout and the circuit opens. We thought of a number of
366
+ potential solutions to this problem - stricter timeouts, grouping timeouts by
367
+ section of our application, timeouts per statement—but they all still revolved
368
+ around timeouts, and those are extremely hard to get right.
369
+
370
+ Instead of thinking about timeouts, we took inspiration from Hystrix by Netflix
371
+ and the book Release It (the resiliency bible), and look at our services as
372
+ connection pools. On a server with `W` workers, only a certain number of them
373
+ are expected to be talking to a single data store at once. Let's say we've
374
+ determined from our monitoring that there’s a 10% chance they’re talking to
375
+ `mysql_shard_0` at any given point in time under normal traffic. The probability
376
+ that five workers are talking to it at the same time is 0.001%. If we only allow
377
+ five workers to talk to a resource at any given point in time, and accept the
378
+ 0.001% false positive rate—we can fail the sixth worker attempting to check out
379
+ a connection instantly. This means that while the five workers are waiting for a
380
+ timeout, all the other `W-5` workers on the node will instantly be failing on
381
+ checking out the connection and opening their circuits. Our capacity is only
382
+ degraded by a relatively small amount.
383
+
384
+ We call this limitation primitive "tickets". In this case, the resource access
385
+ is limited to 5 tickets (see Configuration). The timeout value specifies the
386
+ maximum amount of time to block if no ticket is available.
387
+
388
+ How do we limit the access to a resource for all workers on a server when the
389
+ workers do not directly share memory? This is implemented with [SysV
390
+ semaphores][sysv] to provide server-wide access control.
391
+
392
+ #### Bulkhead Configuration
393
+
394
+ There are two configuration values. It's not easy to choose good values and we're
395
+ still experimenting with ways to figure out optimal ticket numbers. Generally
396
+ something below half the number of workers on the server for endpoints that are
397
+ queried frequently has worked well for us.
398
+
399
+ * **tickets**. Number of workers that can concurrently access a resource.
400
+ * **timeout**. Time to wait to acquire a ticket if there are no tickets left.
401
+ We recommend this to be `0` unless you have very few workers running (i.e.
402
+ less than ~5).
403
+
404
+ ## Defense line
405
+
406
+ The finished defense line for resource access with circuit breakers and
407
+ bulkheads then looks like this:
408
+
409
+ ![](http://cdn.shopify.com/s/files/1/0070/7032/files/image02_grande.png)
410
+
411
+ The RPC first checks the circuit; if the circuit is open it will raise the
412
+ exception straight away which will trigger the fallback (the default fallback is
413
+ a 500 response). Otherwise, it will try Semian which fails instantly if too many
414
+ workers are already querying the resource. Finally the driver will query the
415
+ data store. If the data store succeeds, the driver will return the data back to
416
+ the RPC. If the data store is slow or fails, this is our last line of defense
417
+ against a misbehaving resource. The driver will raise an exception after trying
418
+ to connect with a timeout or after an immediate failure. These driver actions
419
+ will affect the circuit and Semian, which can make future calls fail faster.
420
+
421
+ ## Failing gracefully
422
+
423
+ Ok, great, we've got a way to fail fast with slow resources, how does that make
424
+ my application more resilient?
425
+
426
+ Failing fast is only half the battle. It's up to you what you do with these
427
+ errors, in the [session example](#real-world-example) we handle it gracefully by
428
+ signing people out and disabling all session related functionality till the data
429
+ store is back online. However, not rescuing the exception and simply sending
430
+ `HTTP 500` back to the client faster will help with [capacity
431
+ loss](#capacity-loss).
432
+
433
+ ### Exceptions inherit from base class
434
+
435
+ It's important to understand that the exceptions raised by [Semian
436
+ Adapters](#adapters) inherit from the base class of the driver itself, meaning
437
+ that if you do something like:
438
+
439
+ ```ruby
440
+ def posts
441
+ Post.all
442
+ rescue Mysql2::Error
443
+ []
444
+ end
445
+ ```
446
+
447
+ Exceptions raised by Semian's `MySQL2` adapter will also get caught.
448
+
449
+ ### Patterns
450
+
451
+ We do not recommend mindlessly sprinkling `rescue`s all over the place. What you
452
+ should do instead is writing decorators around secondary data stores (e.g. sessions)
453
+ that provide resiliency for free. For example, if we stored the tags associated
454
+ with products in a secondary data store it could look something like this:
455
+
456
+ ```ruby
457
+ # Resilient decorator for storing a Set in Redis.
458
+ class RedisSet
459
+ def initialize(key)
460
+ @key = key
461
+ end
462
+
463
+ def get
464
+ redis.smembers(@key)
465
+ rescue Redis::BaseConnectionError
466
+ []
467
+ end
468
+
469
+ private
470
+
471
+ def redis
472
+ @redis ||= Redis.new
473
+ end
474
+ end
475
+
476
+ class Product
477
+ # This will simply return an empty array in the case of a Redis outage.
478
+ def tags
479
+ tags_set.get
480
+ end
481
+
482
+ private
483
+
484
+ def tags_set
485
+ @tags_set ||= RedisSet.new("product:tags:#{self.id}")
486
+ end
487
+ end
488
+ ```
489
+
490
+ These decorators can be resiliency tested with [Toxiproxy][toxiproxy]. You can
491
+ provide fallbacks around your primary data store as well. In our case, we simply
492
+ `HTTP 500` in those cases unless it's cached because these pages aren't worth
493
+ much without data from their primary data store.
87
494
 
88
495
  ## Monitoring
89
496
 
@@ -105,17 +512,59 @@ Semian.subscribe do |event, resource, scope, adapter|
105
512
  end
106
513
  ```
107
514
 
108
- # Understanding Semian
515
+ # FAQ
516
+
517
+ **How does Semian work with containers?** Semian uses [SysV semaphores][sysv] to
518
+ coordinate access to a resource. The semaphore is only shared within the
519
+ [IPC][namespaces]. Unless you are running many workers inside every container,
520
+ this leaves the bulkheading pattern effectively useless. We recommend sharing
521
+ the IPC namespace between all containers on your host for the best ticket
522
+ economy. If you are using Docker, this can be done with the [--ipc
523
+ flag](https://docs.docker.com/reference/run/#ipc-settings).
524
+
525
+ **Why isn't resource access shared across the entire cluster?** This implies a
526
+ coordination data store. Semian would have to be resilient to failures of this
527
+ data store as well, and fall back to other primitives. While it's nice to have
528
+ all workers have the same view of the world, this greatly increases the
529
+ complexity of the implementation which is not favourable for resiliency code.
530
+
531
+ **Why isn't the circuit breaker implemented as a host-wide mechanism?** No good
532
+ reason. Patches welcome!
533
+
534
+ **Why is there no fallback mechanism in Semian?** Read the [Failing
535
+ Gracefully](#failing-gracefully) section. In short, exceptions is exactly this.
536
+ We did not want to put an extra level on abstraction on top of this. In the
537
+ first internal implementation this was the case, but we later moved away from
538
+ it.
539
+
540
+ **Why does it not use normal Ruby semaphores?** To work properly the access
541
+ control needs to be performed across many workers. With MRI that means having
542
+ multiple processes, not threads. Thus we need a primitive outside of the
543
+ interpreter. For other Ruby implementations a driver that uses Ruby semaphores
544
+ could be used (and would be accepted as a PR).
545
+
546
+ **Why are there three semaphores in the semaphore sets for each resource?** This
547
+ has to do with being able to resize the number of tickets for a resource online.
548
+
549
+ **Can I change the number of tickets freely?** Yes, the logic for this isn't
550
+ trivial but it works well.
109
551
 
110
- Coming soon!
552
+ **What is the performance overhead of Semian?** Extremely minimal in comparison
553
+ to going to the network. Don't worry about it unless you're instrumenting
554
+ non-IO.
111
555
 
112
556
  [hystrix]: https://github.com/Netflix/Hystrix
113
557
  [release-it]: https://pragprog.com/book/mnee/release-it
114
558
  [shopify]: http://www.shopify.com/
115
- [mysql-semian-adapter]: https://github.com/Shopify/semian/blob/master/lib/semian/mysql2.rb
116
- [redis-semian-adapter]: https://github.com/Shopify/semian/blob/master/lib/semian/redis.rb
117
- [semian-adapter]: https://github.com/Shopify/semian/blob/master/lib/semian/adapter.rb
118
- [semian-instrumentable]: https://github.com/Shopify/semian/blob/master/lib/semian/instrumentable.rb
559
+ [mysql-semian-adapter]: lib/semian/mysql2.rb
560
+ [redis-semian-adapter]: lib/semian/redis.rb
561
+ [semian-adapter]: lib/semian/adapter.rb
562
+ [nethttp-semian-adapter]: lib/semian/net_http.rb
563
+ [nethttp-default-errors]: lib/semian/net_http.rb#L33-L43
564
+ [semian-instrumentable]: lib/semian/instrumentable.rb
119
565
  [statsd-instrument]: http://github.com/shopify/statsd-instrument
120
566
  [resiliency-blog-post]: http://www.shopify.com/technology/16906928-building-and-testing-resilient-ruby-on-rails-applications
121
567
  [toxiproxy]: https://github.com/Shopify/toxiproxy
568
+ [sysv]: http://man7.org/linux/man-pages/man7/svipc.7.html
569
+ [cbp]: https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
570
+ [namespaces]: http://man7.org/linux/man-pages/man7/namespaces.7.html