semian 0.12.0 → 0.13.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.md ADDED
@@ -0,0 +1,836 @@
1
+ ## Semian ![Build Status](https://github.com/Shopify/semian/actions/workflows/main.yml/badge.svg)
2
+
3
+
4
+ ![](http://i.imgur.com/7Vn2ibF.png)
5
+
6
+ Semian is a library for controlling access to slow or unresponsive external
7
+ services to avoid cascading failures.
8
+
9
+ When services are down they typically fail fast with errors like `ECONNREFUSED`
10
+ and `ECONNRESET` which can be rescued in code. However, slow resources fail
11
+ slowly. The thread serving the request blocks until it hits the timeout for the
12
+ slow resource. During that time, the thread is doing nothing useful and thus the
13
+ slow resource has caused a cascading failure by occupying workers and therefore
14
+ losing capacity. **Semian is a library for failing fast in these situations,
15
+ allowing you to handle errors gracefully.** Semian does this by intercepting
16
+ resource access through heuristic patterns inspired by [Hystrix][hystrix] and
17
+ [Release It][release-it]:
18
+
19
+ * [**Circuit breaker**](#circuit-breaker). A pattern for limiting the
20
+ amount of requests to a dependency that is having issues.
21
+ * [**Bulkheading**](#bulkheading). Controlling the concurrent access to
22
+ a single resource, access is coordinated server-wide with [SysV
23
+ semaphores][sysv].
24
+
25
+ Resource drivers are monkey-patched to be aware of Semian, these are called
26
+ [Semian Adapters](#adapters). Thus, every time resource access is requested
27
+ Semian is queried for status on the resource first. If Semian, through the
28
+ patterns above, deems the resource to be unavailable it will raise an exception.
29
+ **The ultimate outcome of Semian is always an exception that can then be rescued
30
+ for a graceful fallback**. Instead of waiting for the timeout, Semian raises
31
+ straight away.
32
+
33
+ If you are already rescuing exceptions for failing resources and timeouts,
34
+ Semian is mostly a drop-in library with a little configuration that will make
35
+ your code more resilient to slow resource access. But, [do you even need
36
+ Semian?](#do-i-need-semian)
37
+
38
+ For an overview of building resilient Ruby applications, start by reading [the
39
+ Shopify blog post on Toxiproxy and Semian][resiliency-blog-post]. For more in
40
+ depth information on Semian, see [Understanding Semian](#understanding-semian).
41
+ Semian is an extraction from [Shopify][shopify] where it's been running
42
+ successfully in production since October, 2014.
43
+
44
+ The other component to your Ruby resiliency kit is [Toxiproxy][toxiproxy] to
45
+ write automated resiliency tests.
46
+
47
+ # Usage
48
+
49
+ Install by adding the gem to your `Gemfile` and require the [adapters](#adapters) you need:
50
+
51
+ ```ruby
52
+ gem 'semian', require: %w(semian semian/mysql2 semian/redis)
53
+ ```
54
+
55
+ We recommend this pattern of requiring adapters directly from the `Gemfile`.
56
+ This ensures Semian adapters are loaded as early as possible and also
57
+ protects your application during boot. Please see the [adapter configuration
58
+ section](#configuration) on how to configure adapters.
59
+
60
+ ## Adapters
61
+
62
+ Semian works by intercepting resource access. Every time access is requested,
63
+ Semian is queried, and it will raise an exception if the resource is unavailable
64
+ according to the circuit breaker or bulkheads. This is done by monkey-patching
65
+ the resource driver. **The exception raised by the driver always inherits from
66
+ the Base exception class of the driver**, meaning you can always simply rescue
67
+ the base class and catch both Semian and driver errors in the same rescue for
68
+ fallbacks.
69
+
70
+ The following adapters are in Semian and tested heavily in production, the
71
+ version is the version of the public gem with the same name:
72
+
73
+ * [`semian/mysql2`][mysql-semian-adapter] (~> 0.3.16)
74
+ * [`semian/redis`][redis-semian-adapter] (~> 3.2.1)
75
+ * [`semian/net_http`][nethttp-semian-adapter]
76
+
77
+ ### Creating Adapters
78
+
79
+ To create a Semian adapter you must implement the following methods:
80
+
81
+ 1. [`include Semian::Adapter`][semian-adapter]. Use the helpers to wrap the
82
+ resource. This takes care of situations such as monitoring, nested resources,
83
+ unsupported platforms, creating the Semian resource if it doesn't already
84
+ exist and so on.
85
+ 2. `#semian_identifier`. This is responsible for returning a symbol that
86
+ represents every unique resource, for example `redis_master` or
87
+ `mysql_shard_1`. This is usually assembled from a `name` attribute on the
88
+ Semian configuration hash, but could also be `<host>:<port>`.
89
+ 3. `connect`. The name of this method varies. You must override the driver's
90
+ connect method with one that wraps the connect call with
91
+ `Semian::Resource#acquire`. You should do this at the lowest possible level.
92
+ 4. `query`. Same as `connect` but for queries on the resource.
93
+ 5. Define exceptions `ResourceBusyError` and `CircuitOpenError`. These are
94
+ raised when the request was rejected early because the resource is out of
95
+ tickets or because the circuit breaker is open (see [Understanding
96
+ Semian](#understanding-semian). They should inherit from the base exception
97
+ class from the raw driver. For example `Mysql2::Error` or
98
+ `Redis::BaseConnectionError` for the MySQL and Redis drivers. This makes it
99
+ easy to `rescue` and handle them gracefully in application code, by
100
+ `rescue`ing the base class.
101
+
102
+ The best resource is looking at the [already implemented adapters](#adapters).
103
+
104
+ ### Configuration
105
+
106
+ There are some global configuration options that can be set for Semian:
107
+
108
+ ```ruby
109
+ # Maximum size of the LRU cache (default: 500)
110
+ # Note: Setting this to 0 enables aggressive garbage collection.
111
+ Semian.maximum_lru_size = 0
112
+
113
+ # Minimum time a resource should be resident in the LRU cache (default: 300s)
114
+ Semian.minimum_lru_time = 60
115
+ ```
116
+
117
+ Note: `minimum_lru_time` is a stronger guarantee than `maximum_lru_size`. That
118
+ is, if a resource has been updated more recently than `minimum_lru_time` it
119
+ will not be garbage collected, even if it would cause the LRU cache to grow
120
+ larger than `maximum_lru_size`.
121
+
122
+ When instantiating a resource it now needs to be configured for Semian. This is
123
+ done by passing `semian` as an argument when initializing the client. Examples
124
+ built in adapters:
125
+
126
+ ```ruby
127
+ # MySQL2 client
128
+ # In Rails this means having a Semian key in database.yml for each db.
129
+ client = Mysql2::Client.new(host: "localhost", username: "root", semian: {
130
+ name: "master",
131
+ tickets: 8, # See the Understanding Semian section on picking these values
132
+ success_threshold: 2,
133
+ error_threshold: 3,
134
+ error_timeout: 10
135
+ })
136
+
137
+ # Redis client
138
+ client = Redis.new(semian: {
139
+ name: "inventory",
140
+ tickets: 4,
141
+ success_threshold: 2,
142
+ error_threshold: 4,
143
+ error_timeout: 20
144
+ })
145
+ ```
146
+
147
+ #### Thread Safety
148
+
149
+ Semian's circuit breaker implementation is thread-safe by default as of
150
+ `v0.7.0`. If you'd like to disable it for performance reasons, pass
151
+ `thread_safety_disabled: true` to the resource options.
152
+
153
+ Bulkheads should be disabled (pass `bulkhead: false`) in a threaded environment
154
+ (e.g. Puma or Sidekiq), but can safely be enabled in non-threaded environments
155
+ (e.g. Resque and Unicorn). As described in this document, circuit breakers alone
156
+ should be adequate in most environments with reasonably low timeouts.
157
+
158
+ Internally, semian uses `SEM_UNDO` for several sysv semaphore operations:
159
+
160
+ * Acquire
161
+ * Worker registration
162
+ * Semaphore metadata state lock
163
+
164
+ The intention behind `SEM_UNDO` is that a semaphore operation is automatically undone when the process exits. This
165
+ is true even if the process exits abnormally - crashes, receives a `SIG_KILL`, etc, because it is handled by
166
+ the operating system and not the process itself.
167
+
168
+ If, however, a thread performs a semop, the `SEM_UNDO` is on its parent process. This means that the operation
169
+ *will not* be undone when the thread exits. This can result in the following unfavorable behavior when using
170
+ threads:
171
+
172
+ * Threads acquire a resource, but are killed and the resource ticket is never released. For a process, the
173
+ ticket would be released by `SEM_UNDO`, but since it's a thread there is the potential for ticket starvation.
174
+ This can result in deadlock on the resource.
175
+ * Threads that register workers on a resource but are killed and never unregistered. For a process, the worker
176
+ count would be automatically decremented by `SEM_UNDO`, but for threads the worker count will continue to increment,
177
+ only being undone when the parent process dies. This can cause the number of tickets to dramatically exceed the quota.
178
+ * If a thread acquires the semaphore metadata lock and dies before releasing it, semian will deadlock on anything
179
+ attempting to acquire the metadata lock until the thread's parent process exits. This can prevent the ticket count
180
+ from being updated.
181
+
182
+ Moreover, a strategy that utilizes `SEM_UNDO` is not compatible with a strategy that attempts to the semaphores tickets manually.
183
+ In order to support threads, operations that currently use `SEM_UNDO` would need to use no semaphore flag, and the calling process
184
+ will be responsible for ensuring that threads are appropriately cleaned up. It is still possible to implement this, but
185
+ it would likely require an in-memory semaphore managed by the parent process of the threads. PRs welcome for this functionality.
186
+
187
+ #### Quotas
188
+
189
+ You may now set quotas per worker:
190
+
191
+ ```ruby
192
+ client = Redis.new(semian: {
193
+ name: "inventory",
194
+ quota: 0.51,
195
+ success_threshold: 2,
196
+ error_threshold: 4,
197
+ error_timeout: 20
198
+ })
199
+
200
+ ```
201
+
202
+ Per the above example, you no longer need to care about the number of tickets.
203
+ Rather, the tickets shall be computed as a proportion of the number of active workers.
204
+
205
+ In this case, we'd allow 50% of the workers on a particular host to connect to this redis resource.
206
+ So long as the workers are in their own process, they will automatically be registered. The quota will
207
+ set the bulkhead threshold based on the number of registered workers, whenever a new worker registers.
208
+
209
+ This is ideal for environments with non-uniform worker distribution, and to eliminate the need to manually
210
+ calculate and adjust ticket counts.
211
+
212
+ **Note**:
213
+
214
+ - You must pass **exactly** one of ticket or quota.
215
+ - Tickets available will be the ceiling of the quota ratio to the number of workers
216
+ - So, with one worker, there will always be a minimum of 1 ticket
217
+ - Workers in different processes will automatically unregister when the process exits.
218
+ - If you have a small number of workers (exactly 2) it's possible that the bulkhead will be too sensitive using quotas.
219
+ - If you use a forking web server (like unicorn) you should call `Semian.unregister_all_resources` before/after forking.
220
+
221
+ #### Net::HTTP
222
+ For the `Net::HTTP` specific Semian adapter, since many external libraries may create
223
+ HTTP connections on the user's behalf, the parameters are instead provided
224
+ by associating callback functions with `Semian::NetHTTP`, perhaps in an initialization file.
225
+
226
+ ##### Naming and Options
227
+ To give Semian parameters, assign a `proc` to `Semian::NetHTTP.semian_configuration`
228
+ that takes a two parameters, `host` and `port` like `127.0.0.1`,`443` or `github_com`,`80`,
229
+ and returns a `Hash` with configuration parameters as follows. The `proc` is used as a
230
+ callback to initialize the configuration options, similar to other adapters.
231
+
232
+ ```ruby
233
+ SEMIAN_PARAMETERS = { tickets: 1,
234
+ success_threshold: 1,
235
+ error_threshold: 3,
236
+ error_timeout: 10 }
237
+ Semian::NetHTTP.semian_configuration = proc do |host, port|
238
+ # Let's make it only active for github.com
239
+ if host == "github.com" && port == "80"
240
+ SEMIAN_PARAMETERS.merge(name: "github.com_80")
241
+ else
242
+ nil
243
+ end
244
+ end
245
+
246
+ # Called from within API:
247
+ # semian_options = Semian::NetHTTP.semian_configuration("github.com", 80)
248
+ # semian_identifier = "nethttp_#{semian_options[:name]}"
249
+ ```
250
+
251
+ The `name` should be carefully chosen since it identifies the resource being protected.
252
+ The `semian_options` passed apply to that resource. Semian creates the `semian_identifier`
253
+ from the `name` to look up and store changes in the circuit breaker and bulkhead states
254
+ and associate successes, failures, errors with the protected resource.
255
+
256
+ We only require that:
257
+ * the `semian_configuration` be **set only once** over the lifetime of the library
258
+ * the output of the `proc` be the same over time, that is, the configuration produced by
259
+ each pair of `host`, `port` is **the same each time** the callback is invoked.
260
+
261
+ For most purposes, `"#{host}_#{port}"` is a good default `name`. Custom `name` formats
262
+ can be useful to grouping related subdomains as one resource, so that they all
263
+ contribute to the same circuit breaker and bulkhead state and fail together.
264
+
265
+ A return value of `nil` for `semian_configuration` means Semian is disabled for that
266
+ HTTP endpoint. This works well since the result of a failed Hash lookup is `nil` also.
267
+ This behavior lets the adapter default to whitelisting, although the
268
+ behavior can be changed to blacklisting or even be completely disabled by varying
269
+ the use of returning `nil` in the assigned closure.
270
+
271
+ ##### Additional Exceptions
272
+ Since we envision this particular adapter can be used in combination with many
273
+ external libraries, that can raise additional exceptions, we added functionality to
274
+ expand the Exceptions that can be tracked as part of Semian's circuit breaker.
275
+ This may be necessary for libraries that introduce new exceptions or re-raise them.
276
+ Add exceptions and reset to the [`default`][nethttp-default-errors] list using the following:
277
+
278
+ ```ruby
279
+ # assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
280
+ Semian::NetHTTP.exceptions += [::OpenSSL::SSL::SSLError]
281
+
282
+ Semian::NetHTTP.reset_exceptions
283
+ # assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
284
+ ```
285
+ ##### Mark Unsuccessful Responses as Failures
286
+ Unsuccessful responses (e.g. 5xx responses) do not raise exceptions, and as such are not marked as failures by default. The `open_circuit_server_errors` Semian configuration parameter may be set to enable this behaviour, to mark unsuccessful responses as failures as seen below:
287
+
288
+ ```ruby
289
+ SEMIAN_PARAMETERS = { tickets: 1,
290
+ success_threshold: 1,
291
+ error_threshold: 3,
292
+ error_timeout: 10,
293
+ open_circuit_server_errors: true}
294
+ ```
295
+
296
+
297
+
298
+ # Understanding Semian
299
+
300
+ Semian is a library with heuristics for failing fast. This section will explain
301
+ in depth how Semian works and which situations it's applicable for. First we
302
+ explain the category of problems Semian is meant to solve. Then we dive into how
303
+ Semian works to solve these problems.
304
+
305
+ ## Do I need Semian?
306
+
307
+ Semian is not a trivial library to understand, introduces complexity and thus
308
+ should be introduced with care. Remember, all Semian does is raise exceptions
309
+ based on heuristics. It is paramount that you understand Semian before
310
+ including it in production as you may otherwise be surprised by its behaviour.
311
+
312
+ Applications that benefit from Semian are those working on eliminating SPOFs
313
+ (Single Points of Failure), and specifically are running into a wall regarding
314
+ slow resources. But it is by no means a magic wand that solves all your latency
315
+ problems by being added to your `Gemfile`. This section describes the types of
316
+ problems Semian solves.
317
+
318
+ If your application is multithreaded or evented (e.g. not Resque and Unicorn)
319
+ these problems are not as pressing. You can still get use out of Semian however.
320
+
321
+ ### Real World Example
322
+
323
+ This is better illustrated with a real world example from Shopify. When you are
324
+ browsing a store while signed in, Shopify stores your session in Redis.
325
+ If Redis becomes unavailable, the driver will start throwing exceptions.
326
+ We rescue these exceptions and simply disable all customer sign in functionality
327
+ on the store until Redis is back online.
328
+
329
+ This is great if querying the resource fails instantly, because it means we fail
330
+ in just a single roundtrip of ~1ms. But if the resource is unresponsive or slow,
331
+ this can take as long as our timeout which is easily 200ms. This means every
332
+ request, even if it does rescue the exception, now takes an extra 200ms.
333
+ Because every resource takes that long, our capacity is also significantly
334
+ degraded. These problems are explained in depth in the next two sections.
335
+
336
+ With Semian, the slow resource would fail instantly (after a small amount of
337
+ convergence time) preventing your response time from spiking and not decreasing
338
+ capacity of the cluster.
339
+
340
+ If this sounds familiar to you, Semian is what you need to be resilient to
341
+ latency. You may not need the graceful fallback depending on your application,
342
+ in which case it will just result in an error (e.g. a `HTTP 500`) faster.
343
+
344
+ We will now examine the two problems in detail.
345
+
346
+ #### In-depth analysis of real world example
347
+
348
+ If a single resource is slow, every single request is going to suffer. We saw
349
+ this in the example before. Let's illustrate this more clearly in the following
350
+ Rails example where the user session is stored in Redis:
351
+
352
+ ```ruby
353
+ def index
354
+ @user = fetch_user
355
+ @posts = Post.all
356
+ end
357
+
358
+ private
359
+ def fetch_user
360
+ user = User.find(session[:user_id])
361
+ rescue Redis::CannotConnectError
362
+ nil
363
+ end
364
+ ```
365
+
366
+ Our code is resilient to a failure of the session layer, it doesn't `HTTP 500`
367
+ if the session store is unavailable (this can be tested with
368
+ [Toxiproxy][toxiproxy]). If the `User` and `Post` data store is unavailable, the
369
+ server will send back `HTTP 500`. We accept that, because it's our primary data
370
+ store. This could be prevented with a caching tier or something else out of
371
+ scope.
372
+
373
+ This code has two flaws however:
374
+
375
+ 1. **What happens if the session storage is consistently slow?** I.e. the majority
376
+ of requests take, say, more than half the timeout time (but it should only
377
+ take ~1ms)?
378
+ 2. **What happens if the session storage is unavailable and is not responding at
379
+ all?** I.e. we hit timeouts on every request.
380
+
381
+ These two problems in turn have two related problems associated with them:
382
+ response time and capacity.
383
+
384
+ #### Response time
385
+
386
+ Requests that attempt to access a down session storage are all gracefully handled, the
387
+ `@user` will simply be `nil`, which the code handles. There is still a
388
+ major impact on users however, as every request to the storage has to time
389
+ out. This causes the average response time to all pages that access it to go up by
390
+ however long your timeout is. Your timeout is proportional to your worst case timeout,
391
+ as well as the number of attempts to hit it on each page. This is the problem Semian
392
+ solves by using heuristics to fail these requests early which causes a much better
393
+ user experience during downtime.
394
+
395
+ #### Capacity loss
396
+
397
+ When your single-threaded worker is waiting for a resource to return, it's
398
+ effectively doing nothing when it could be serving fast requests. To use the
399
+ example from before, perhaps some actions do not access the session storage at
400
+ all. These requests will pile up behind the now slow requests that are trying to
401
+ access that layer, because they're failing slowly. Essentially, your capacity
402
+ degrades significantly because your average response time goes up (as explained
403
+ in the previous section). Capacity loss simply follows from an increase in
404
+ response time. The higher your timeout and the slower your resource, the more
405
+ capacity you lose.
406
+
407
+ #### Timeouts aren't enough
408
+
409
+ It should be clear by now that timeouts aren't enough. Consistent timeouts will
410
+ increase the average response time, which causes a bad user experience, and
411
+ ultimately compromise the performance of the entire system. Even if the timeout
412
+ is as low as ~250ms (just enough to allow a single TCP retransmit) there's a
413
+ large loss of capacity and for many applications a 100-300% increase in average
414
+ response time. This is the problem Semian solves by failing fast.
415
+
416
+ ## How does Semian work?
417
+
418
+ Semian consists of two parts: circuit breaker and bulkheading. To understand
419
+ Semian, and especially how to configure it, we must understand these patterns
420
+ and their implementation.
421
+
422
+ ### Circuit Breaker
423
+
424
+ The circuit breaker pattern is based on a simple observation - if we hit a
425
+ timeout or any other error for a given service one or more times, we’re likely
426
+ to hit it again for some amount of time. Instead of hitting the timeout
427
+ repeatedly, we can mark the resource as dead for some amount of time during
428
+ which we raise an exception instantly on any call to it. This is called the
429
+ [circuit breaker pattern][cbp].
430
+
431
+ ![](http://cdn.shopify.com/s/files/1/0070/7032/files/image01_grande.png)
432
+
433
+ When we perform a Remote Procedure Call (RPC), it will first check the circuit.
434
+ If the circuit is rejecting requests because of too many failures reported by
435
+ the driver, it will throw an exception immediately. Otherwise the circuit will
436
+ call the driver. If the driver fails to get data back from the data store, it
437
+ will notify the circuit. The circuit will count the error so that if too many
438
+ errors have happened recently, it will start rejecting requests immediately
439
+ instead of waiting for the driver to time out. The exception will then be raised
440
+ back to the original caller. If the driver’s request was successful, it will
441
+ return the data back to the calling method and notify the circuit that it made a
442
+ successful call.
443
+
444
+ The state of the circuit breaker is local to the worker and is not shared across
445
+ all workers on a server.
446
+
447
+ #### Circuit Breaker Configuration
448
+
449
+ There are four configuration parameters for circuit breakers in Semian:
450
+
451
+ * **error_threshold**. The amount of errors a worker encounters within error_threshold_timeout amount of time before
452
+ opening the circuit, that is to start rejecting requests instantly.
453
+ * **error_threshold_timeout**. The amount of time in seconds that error_threshold errors must occur to open the circuit. Defaults to error_timeout seconds if not set.
454
+ * **error_timeout**. The amount of time in seconds until trying to query the resource
455
+ again.
456
+ * **success_threshold**. The amount of successes on the circuit until closing it
457
+ again, that is to start accepting all requests to the circuit.
458
+ * **half_open_resource_timeout**. Timeout for the resource in seconds when the circuit is half-open (supported for MySQL, Net::HTTP and Redis).
459
+
460
+ For more information about configuring these parameters, please read [this post](https://engineering.shopify.com/blogs/engineering/circuit-breaker-misconfigured).
461
+
462
+ ### Bulkheading
463
+
464
+ For some applications, circuit breakers are not enough. This is best illustrated
465
+ with an example. Imagine if the timeout for our data store isn't as low as
466
+ 200ms, but actually 10 seconds. For example, you might have a relational data
467
+ store where for some customers, 10s queries are (unfortunately) legitimate.
468
+ Reducing the time of worst case queries requires a lot of effort. Dropping the
469
+ query immediately could potentially leave some customers unable to access
470
+ certain functionality. High timeouts are especially critical in a non-threaded
471
+ environment where blocking IO means a worker is effectively doing nothing.
472
+
473
+ In this case, circuit breakers aren't sufficient. Assuming the circuit is shared
474
+ across all processes on a server, it will still take at least 10s before the
475
+ circuit is open. In that time every worker is blocked (see also "Defense Line"
476
+ section for an in-depth explanation of the co-operation between circuit breakers
477
+ and bulkheads) this means we're at reduced capacity for at least 20s, with the
478
+ last 10s timeouts occurring just before the circuit opens at the 10s mark when a
479
+ couple of workers have hit a timeout and the circuit opens. We thought of a
480
+ number of potential solutions to this problem - stricter timeouts, grouping
481
+ timeouts by section of our application, timeouts per statement—but they all
482
+ still revolved around timeouts, and those are extremely hard to get right.
483
+
484
+ Instead of thinking about timeouts, we took inspiration from Hystrix by Netflix
485
+ and the book Release It (the resiliency bible), and look at our services as
486
+ connection pools. On a server with `W` workers, only a certain number of them
487
+ are expected to be talking to a single data store at once. Let's say we've
488
+ determined from our monitoring that there’s a 10% chance they’re talking to
489
+ `mysql_shard_0` at any given point in time under normal traffic. The probability
490
+ that five workers are talking to it at the same time is 0.001%. If we only allow
491
+ five workers to talk to a resource at any given point in time, and accept the
492
+ 0.001% false positive rate—we can fail the sixth worker attempting to check out
493
+ a connection instantly. This means that while the five workers are waiting for a
494
+ timeout, all the other `W-5` workers on the node will instantly be failing on
495
+ checking out the connection and opening their circuits. Our capacity is only
496
+ degraded by a relatively small amount.
497
+
498
+ We call this limitation primitive "tickets". In this case, the resource access
499
+ is limited to 5 tickets (see Configuration). The timeout value specifies the
500
+ maximum amount of time to block if no ticket is available.
501
+
502
+ How do we limit the access to a resource for all workers on a server when the
503
+ workers do not directly share memory? This is implemented with [SysV
504
+ semaphores][sysv] to provide server-wide access control.
505
+
506
+ #### Bulkhead Configuration
507
+
508
+ There are two configuration values. It's not easy to choose good values and we're
509
+ still experimenting with ways to figure out optimal ticket numbers. Generally
510
+ something below half the number of workers on the server for endpoints that are
511
+ queried frequently has worked well for us.
512
+
513
+ * **tickets**. Number of workers that can concurrently access a resource.
514
+ * **timeout**. Time to wait in seconds to acquire a ticket if there are no tickets left.
515
+ We recommend this to be `0` unless you have very few workers running (i.e.
516
+ less than ~5).
517
+
518
+ Note that there are system-wide limitations on how many tickets can be allocated
519
+ on a system. `cat /proc/sys/kernel/sem` will tell you.
520
+
521
+ > System-wide limit on the number of semaphore sets. On Linux
522
+ systems before version 3.19, the default value for this limit
523
+ was 128. Since Linux 3.19, the default value is 32,000. On
524
+ Linux, this limit can be read and modified via the fourth
525
+ field of `/proc/sys/kernel/sem`.
526
+
527
+ #### Bulkhead debugging on linux
528
+
529
+ Note: It is often helpful to examine the actual IPC resources on the system. Semian
530
+ provides an easy way to get the semaphore key:
531
+
532
+ ```
533
+ irb> require 'semian'
534
+ irb> puts Semian::Resource.new(:your_resource_name, tickets:1).key # do this from a dev machine
535
+ "0x48af51ea"
536
+ ```
537
+
538
+ This key can then be used to easily inspect the semaphore on any host machine:
539
+
540
+ ```
541
+ ipcs -si $(ipcs -s | grep 0x48af51ea | awk '{print $2}')
542
+ ```
543
+
544
+ Which should output something like:
545
+
546
+ ```
547
+
548
+ Semaphore Array semid=5570729
549
+ uid=8192 gid=8192 cuid=8192 cgid=8192
550
+ mode=0660, access_perms=0660
551
+ nsems = 4
552
+ otime = Thu Mar 30 15:06:16 2017
553
+ ctime = Mon Mar 13 20:25:36 2017
554
+ semnum value ncount zcount pid
555
+ 0 1 0 0 48
556
+ 1 25 0 0 48
557
+ 2 25 0 0 27
558
+ 3 31 0 0 48
559
+ ```
560
+
561
+ In the above example, we can see each of the semaphores. Looking at the enum code
562
+ in `ext/semian/sysv_semaphores.h` we can see that:
563
+
564
+ * 0: is the semian meta lock (mutex) protecting updates to the other resources. It's currently free
565
+ * 1: is the number of available tickets - currently no tickets are in use because it's the same as 2
566
+ * 2: is the configured (maximum) number of tickets
567
+ * 3: is the number of registered workers (processes) that would be considered if using the quota strategy.
568
+
569
+ ## Defense line
570
+
571
+ The finished defense line for resource access with circuit breakers and
572
+ bulkheads then looks like this:
573
+
574
+ ![](http://cdn.shopify.com/s/files/1/0070/7032/files/image02_grande.png)
575
+
576
+ The RPC first checks the circuit; if the circuit is open it will raise the
577
+ exception straight away which will trigger the fallback (the default fallback is
578
+ a 500 response). Otherwise, it will try Semian which fails instantly if too many
579
+ workers are already querying the resource. Finally the driver will query the
580
+ data store. If the data store succeeds, the driver will return the data back to
581
+ the RPC. If the data store is slow or fails, this is our last line of defense
582
+ against a misbehaving resource. The driver will raise an exception after trying
583
+ to connect with a timeout or after an immediate failure. These driver actions
584
+ will affect the circuit and Semian, which can make future calls fail faster.
585
+
586
+ A useful way to think about the co-operation between bulkheads and circuit
587
+ breakers is through visualizing a failure scenario graphing capacity as a
588
+ function of time. If an incident strikes that makes the server unresponsive
589
+ with a `20s` timeout on the client and you only have circuit breakers
590
+ enabled--you will lose capacity until all workers have tripped their circuit
591
+ breakers. The slope of this line will depend on the amount of traffic to the now
592
+ unavailable service. If the slope is steep (i.e. high traffic), you'll lose
593
+ capacity quicker. The higher the client driver timeout, the longer you'll lose
594
+ capacity for. In the example below we have the circuit breakers configured to
595
+ open after 3 failures:
596
+
597
+ ![resiliency- circuit breakers](https://cloud.githubusercontent.com/assets/97400/22405538/53229758-e612-11e6-81b2-824f873c3fb7.png)
598
+
599
+ If we imagine the same scenario but with _only_ bulkheads, configured to have
600
+ tickets for 50% of workers at any given time, we'll see the following
601
+ flat-lining scenario:
602
+
603
+ ![resiliency- bulkheads](https://cloud.githubusercontent.com/assets/97400/22405542/6832a372-e612-11e6-88c4-2452b64b3121.png)
604
+
605
+ Circuit breakers have the nice property of re-gaining 100% capacity. Bulkheads
606
+ have the desirable property of guaranteeing a minimum capacity. If we do
607
+ addition of the two graphs, marrying bulkheads and circuit breakers, we have a
608
+ plummy outcome:
609
+
610
+ ![resiliency- circuit breakers bulkheads](https://cloud.githubusercontent.com/assets/97400/22405550/a25749c2-e612-11e6-8bc8-5fe29e212b3b.png)
611
+
612
+ This means that if the slope or client timeout is sufficiently low, bulkheads
613
+ will provide little value and are likely not necessary.
614
+
615
+ ## Failing gracefully
616
+
617
+ Ok, great, we've got a way to fail fast with slow resources, how does that make
618
+ my application more resilient?
619
+
620
+ Failing fast is only half the battle. It's up to you what you do with these
621
+ errors, in the [session example](#real-world-example) we handle it gracefully by
622
+ signing people out and disabling all session related functionality till the data
623
+ store is back online. However, not rescuing the exception and simply sending
624
+ `HTTP 500` back to the client faster will help with [capacity
625
+ loss](#capacity-loss).
626
+
627
+ ### Exceptions inherit from base class
628
+
629
+ It's important to understand that the exceptions raised by [Semian
630
+ Adapters](#adapters) inherit from the base class of the driver itself, meaning
631
+ that if you do something like:
632
+
633
+ ```ruby
634
+ def posts
635
+ Post.all
636
+ rescue Mysql2::Error
637
+ []
638
+ end
639
+ ```
640
+
641
+ Exceptions raised by Semian's `MySQL2` adapter will also get caught.
642
+
643
+ ### Patterns
644
+
645
+ We do not recommend mindlessly sprinkling `rescue`s all over the place. What you
646
+ should do instead is writing decorators around secondary data stores (e.g. sessions)
647
+ that provide resiliency for free. For example, if we stored the tags associated
648
+ with products in a secondary data store it could look something like this:
649
+
650
+ ```ruby
651
+ # Resilient decorator for storing a Set in Redis.
652
+ class RedisSet
653
+ def initialize(key)
654
+ @key = key
655
+ end
656
+
657
+ def get
658
+ redis.smembers(@key)
659
+ rescue Redis::BaseConnectionError
660
+ []
661
+ end
662
+
663
+ private
664
+
665
+ def redis
666
+ @redis ||= Redis.new
667
+ end
668
+ end
669
+
670
+ class Product
671
+ # This will simply return an empty array in the case of a Redis outage.
672
+ def tags
673
+ tags_set.get
674
+ end
675
+
676
+ private
677
+
678
+ def tags_set
679
+ @tags_set ||= RedisSet.new("product:tags:#{self.id}")
680
+ end
681
+ end
682
+ ```
683
+
684
+ These decorators can be resiliency tested with [Toxiproxy][toxiproxy]. You can
685
+ provide fallbacks around your primary data store as well. In our case, we simply
686
+ `HTTP 500` in those cases unless it's cached because these pages aren't worth
687
+ much without data from their primary data store.
688
+
689
+ ## Monitoring
690
+
691
+ With [`Semian::Instrumentable`][semian-instrumentable] clients can monitor
692
+ Semian internals. For example to instrument just events with
693
+ [`statsd-instrument`][statsd-instrument]:
694
+
695
+ ```ruby
696
+ # `event` is `success`, `busy`, `circuit_open`, `state_change`, or `lru_hash_gc`.
697
+ # `resource` is the `Semian::Resource` object (or a `LRUHash` object for `lru_hash_gc`).
698
+ # `scope` is `connection` or `query` (others can be instrumented too from the adapter) (is nil for `lru_hash_gc`).
699
+ # `adapter` is the name of the adapter (mysql2, redis, ..) (is a payload hash for `lru_hash_gc`)
700
+ Semian.subscribe do |event, resource, scope, adapter|
701
+ case event
702
+ when :success, :busy, :circuit_open, :state_change
703
+ StatsD.increment("semian.#{event}", tags: {
704
+ resource: resource.name,
705
+ adapter: adapter,
706
+ type: scope,
707
+ })
708
+ else
709
+ StatsD.increment("semian.#{event}")
710
+ end
711
+ end
712
+ ```
713
+
714
+ # FAQ
715
+
716
+ **How does Semian work with containers?** Semian uses [SysV semaphores][sysv] to
717
+ coordinate access to a resource. The semaphore is only shared within the
718
+ [IPC][namespaces]. Unless you are running many workers inside every container,
719
+ this leaves the bulkheading pattern effectively useless. We recommend sharing
720
+ the IPC namespace between all containers on your host for the best ticket
721
+ economy. If you are using Docker, this can be done with the [--ipc
722
+ flag](https://docs.docker.com/engine/reference/run/#ipc-settings---ipc).
723
+
724
+ **Why isn't resource access shared across the entire cluster?** This implies a
725
+ coordination data store. Semian would have to be resilient to failures of this
726
+ data store as well, and fall back to other primitives. While it's nice to have
727
+ all workers have the same view of the world, this greatly increases the
728
+ complexity of the implementation which is not favourable for resiliency code.
729
+
730
+ **Why isn't the circuit breaker implemented as a host-wide mechanism?** No good
731
+ reason. Patches welcome!
732
+
733
+ **Why is there no fallback mechanism in Semian?** Read the [Failing
734
+ Gracefully](#failing-gracefully) section. In short, exceptions is exactly this.
735
+ We did not want to put an extra level on abstraction on top of this. In the
736
+ first internal implementation this was the case, but we later moved away from
737
+ it.
738
+
739
+ **Why does it not use normal Ruby semaphores?** To work properly the access
740
+ control needs to be performed across many workers. With MRI that means having
741
+ multiple processes, not threads. Thus we need a primitive outside of the
742
+ interpreter. For other Ruby implementations a driver that uses Ruby semaphores
743
+ could be used (and would be accepted as a PR).
744
+
745
+ **Why are there three semaphores in the semaphore sets for each resource?** This
746
+ has to do with being able to resize the number of tickets for a resource online.
747
+
748
+ **Can I change the number of tickets freely?** Yes, the logic for this isn't
749
+ trivial but it works well.
750
+
751
+ **What is the performance overhead of Semian?** Extremely minimal in comparison
752
+ to going to the network. Don't worry about it unless you're instrumenting
753
+ non-IO.
754
+
755
+ # Developing Semian
756
+
757
+ Semian requires a Linux environment. We provide a [docker-compose](https://docs.docker.com/compose/) file
758
+ that runs MySQL, Redis, Toxiproxy and Ruby in containers. Use
759
+ the steps below to work on Semian from a Mac OS environment.
760
+
761
+ ## Prerequisites :
762
+ ```bash
763
+ # install docker-for-desktop
764
+ $ brew cask install docker
765
+
766
+ # install latest docker-compose
767
+ $ brew install docker-compose
768
+
769
+ # install visual-studio-code (optional)
770
+ $ brew cask install visual-studio-code
771
+
772
+ # clone Semian
773
+ $ git clone https://github.com/Shopify/semian.git
774
+ $ cd semian
775
+ ```
776
+
777
+ ## Visual Studio Code
778
+ - Open semian in vscode
779
+ - Install recommended extensions (one off requirement)
780
+ - Click `reopen in container` (first boot might take about a minute)
781
+
782
+ See https://code.visualstudio.com/docs/remote/containers for more details
783
+
784
+
785
+ If you make any changes to `.devcontainer/` you'd need to recreate the containers:
786
+
787
+ - Select `Rebuild Container` from the command palette
788
+
789
+
790
+ Running Tests:
791
+ - `$ bundle exec rake` Run with `SKIP_FLAKY_TESTS=true` to skip flaky tests (CI runs all tests)
792
+
793
+ ## Everything else
794
+
795
+ Test semian in containers:
796
+ - `$ docker-compose -f .devcontainer/docker-compose.yml up -d`
797
+ - `$ docker exec -it semian bash`
798
+
799
+ If you make any changes to `.devcontainer/` you'd need to recreate the containers:
800
+
801
+ - `$ docker-compose -f .devcontainer/docker-compose.yml up -d --force-recreate`
802
+
803
+ Run tests in containers:
804
+
805
+ ```shell
806
+ $ docker-compose -f ./.devcontainer/docker-compose.yml run --rm test
807
+ ```
808
+
809
+ Running Tests:
810
+ - `$ bundle exec rake` Run with `SKIP_FLAKY_TESTS=true` to skip flaky tests (CI runs all tests)
811
+
812
+ ### Running tests in batches
813
+
814
+ * *TEST_WORKERS* - Total number of workers or batches.
815
+ It uses to identify a total number of batches, that would be run in parallel. *Default: 1*
816
+ * *TEST_WORKER_NUM* - Specify which batch to run. The value is between 1 and *TEST_WORKERS*. *Default: 1*
817
+
818
+ ```shell
819
+ $ bundle exec rake test:parallel TEST_WORKERS=5 TEST_WORKER_NUM=1
820
+ ```
821
+
822
+ [hystrix]: https://github.com/Netflix/Hystrix
823
+ [release-it]: https://pragprog.com/titles/mnee2/release-it-second-edition/
824
+ [shopify]: http://www.shopify.com/
825
+ [mysql-semian-adapter]: lib/semian/mysql2.rb
826
+ [redis-semian-adapter]: lib/semian/redis.rb
827
+ [semian-adapter]: lib/semian/adapter.rb
828
+ [nethttp-semian-adapter]: lib/semian/net_http.rb
829
+ [nethttp-default-errors]: lib/semian/net_http.rb#L35-L45
830
+ [semian-instrumentable]: lib/semian/instrumentable.rb
831
+ [statsd-instrument]: http://github.com/shopify/statsd-instrument
832
+ [resiliency-blog-post]: https://engineering.shopify.com/blogs/engineering/building-and-testing-resilient-ruby-on-rails-applications
833
+ [toxiproxy]: https://github.com/Shopify/toxiproxy
834
+ [sysv]: http://man7.org/linux/man-pages/man7/svipc.7.html
835
+ [cbp]: https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
836
+ [namespaces]: http://man7.org/linux/man-pages/man7/namespaces.7.html