semian 0.12.0 → 0.13.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md ADDED
@@ -0,0 +1,836 @@
1
+ ## Semian ![Build Status](https://github.com/Shopify/semian/actions/workflows/main.yml/badge.svg)
2
+
3
+
4
+ ![](http://i.imgur.com/7Vn2ibF.png)
5
+
6
+ Semian is a library for controlling access to slow or unresponsive external
7
+ services to avoid cascading failures.
8
+
9
+ When services are down they typically fail fast with errors like `ECONNREFUSED`
10
+ and `ECONNRESET` which can be rescued in code. However, slow resources fail
11
+ slowly. The thread serving the request blocks until it hits the timeout for the
12
+ slow resource. During that time, the thread is doing nothing useful and thus the
13
+ slow resource has caused a cascading failure by occupying workers and therefore
14
+ losing capacity. **Semian is a library for failing fast in these situations,
15
+ allowing you to handle errors gracefully.** Semian does this by intercepting
16
+ resource access through heuristic patterns inspired by [Hystrix][hystrix] and
17
+ [Release It][release-it]:
18
+
19
+ * [**Circuit breaker**](#circuit-breaker). A pattern for limiting the
20
+ amount of requests to a dependency that is having issues.
21
+ * [**Bulkheading**](#bulkheading). Controlling the concurrent access to
22
+ a single resource, access is coordinated server-wide with [SysV
23
+ semaphores][sysv].
24
+
25
+ Resource drivers are monkey-patched to be aware of Semian, these are called
26
+ [Semian Adapters](#adapters). Thus, every time resource access is requested
27
+ Semian is queried for status on the resource first. If Semian, through the
28
+ patterns above, deems the resource to be unavailable it will raise an exception.
29
+ **The ultimate outcome of Semian is always an exception that can then be rescued
30
+ for a graceful fallback**. Instead of waiting for the timeout, Semian raises
31
+ straight away.
32
+
33
+ If you are already rescuing exceptions for failing resources and timeouts,
34
+ Semian is mostly a drop-in library with a little configuration that will make
35
+ your code more resilient to slow resource access. But, [do you even need
36
+ Semian?](#do-i-need-semian)
37
+
38
+ For an overview of building resilient Ruby applications, start by reading [the
39
+ Shopify blog post on Toxiproxy and Semian][resiliency-blog-post]. For more in
40
+ depth information on Semian, see [Understanding Semian](#understanding-semian).
41
+ Semian is an extraction from [Shopify][shopify] where it's been running
42
+ successfully in production since October, 2014.
43
+
44
+ The other component to your Ruby resiliency kit is [Toxiproxy][toxiproxy] to
45
+ write automated resiliency tests.
46
+
47
+ # Usage
48
+
49
+ Install by adding the gem to your `Gemfile` and require the [adapters](#adapters) you need:
50
+
51
+ ```ruby
52
+ gem 'semian', require: %w(semian semian/mysql2 semian/redis)
53
+ ```
54
+
55
+ We recommend this pattern of requiring adapters directly from the `Gemfile`.
56
+ This ensures Semian adapters are loaded as early as possible and also
57
+ protects your application during boot. Please see the [adapter configuration
58
+ section](#configuration) on how to configure adapters.
59
+
60
+ ## Adapters
61
+
62
+ Semian works by intercepting resource access. Every time access is requested,
63
+ Semian is queried, and it will raise an exception if the resource is unavailable
64
+ according to the circuit breaker or bulkheads. This is done by monkey-patching
65
+ the resource driver. **The exception raised by the driver always inherits from
66
+ the Base exception class of the driver**, meaning you can always simply rescue
67
+ the base class and catch both Semian and driver errors in the same rescue for
68
+ fallbacks.
69
+
70
+ The following adapters are in Semian and tested heavily in production, the
71
+ version is the version of the public gem with the same name:
72
+
73
+ * [`semian/mysql2`][mysql-semian-adapter] (~> 0.3.16)
74
+ * [`semian/redis`][redis-semian-adapter] (~> 3.2.1)
75
+ * [`semian/net_http`][nethttp-semian-adapter]
76
+
77
+ ### Creating Adapters
78
+
79
+ To create a Semian adapter you must implement the following methods:
80
+
81
+ 1. [`include Semian::Adapter`][semian-adapter]. Use the helpers to wrap the
82
+ resource. This takes care of situations such as monitoring, nested resources,
83
+ unsupported platforms, creating the Semian resource if it doesn't already
84
+ exist and so on.
85
+ 2. `#semian_identifier`. This is responsible for returning a symbol that
86
+ represents every unique resource, for example `redis_master` or
87
+ `mysql_shard_1`. This is usually assembled from a `name` attribute on the
88
+ Semian configuration hash, but could also be `<host>:<port>`.
89
+ 3. `connect`. The name of this method varies. You must override the driver's
90
+ connect method with one that wraps the connect call with
91
+ `Semian::Resource#acquire`. You should do this at the lowest possible level.
92
+ 4. `query`. Same as `connect` but for queries on the resource.
93
+ 5. Define exceptions `ResourceBusyError` and `CircuitOpenError`. These are
94
+ raised when the request was rejected early because the resource is out of
95
+ tickets or because the circuit breaker is open (see [Understanding
96
+ Semian](#understanding-semian). They should inherit from the base exception
97
+ class from the raw driver. For example `Mysql2::Error` or
98
+ `Redis::BaseConnectionError` for the MySQL and Redis drivers. This makes it
99
+ easy to `rescue` and handle them gracefully in application code, by
100
+ `rescue`ing the base class.
101
+
102
+ The best resource is looking at the [already implemented adapters](#adapters).
103
+
104
+ ### Configuration
105
+
106
+ There are some global configuration options that can be set for Semian:
107
+
108
+ ```ruby
109
+ # Maximum size of the LRU cache (default: 500)
110
+ # Note: Setting this to 0 enables aggressive garbage collection.
111
+ Semian.maximum_lru_size = 0
112
+
113
+ # Minimum time a resource should be resident in the LRU cache (default: 300s)
114
+ Semian.minimum_lru_time = 60
115
+ ```
116
+
117
+ Note: `minimum_lru_time` is a stronger guarantee than `maximum_lru_size`. That
118
+ is, if a resource has been updated more recently than `minimum_lru_time` it
119
+ will not be garbage collected, even if it would cause the LRU cache to grow
120
+ larger than `maximum_lru_size`.
121
+
122
+ When instantiating a resource it now needs to be configured for Semian. This is
123
+ done by passing `semian` as an argument when initializing the client. Examples
124
+ built in adapters:
125
+
126
+ ```ruby
127
+ # MySQL2 client
128
+ # In Rails this means having a Semian key in database.yml for each db.
129
+ client = Mysql2::Client.new(host: "localhost", username: "root", semian: {
130
+ name: "master",
131
+ tickets: 8, # See the Understanding Semian section on picking these values
132
+ success_threshold: 2,
133
+ error_threshold: 3,
134
+ error_timeout: 10
135
+ })
136
+
137
+ # Redis client
138
+ client = Redis.new(semian: {
139
+ name: "inventory",
140
+ tickets: 4,
141
+ success_threshold: 2,
142
+ error_threshold: 4,
143
+ error_timeout: 20
144
+ })
145
+ ```
146
+
147
+ #### Thread Safety
148
+
149
+ Semian's circuit breaker implementation is thread-safe by default as of
150
+ `v0.7.0`. If you'd like to disable it for performance reasons, pass
151
+ `thread_safety_disabled: true` to the resource options.
152
+
153
+ Bulkheads should be disabled (pass `bulkhead: false`) in a threaded environment
154
+ (e.g. Puma or Sidekiq), but can safely be enabled in non-threaded environments
155
+ (e.g. Resque and Unicorn). As described in this document, circuit breakers alone
156
+ should be adequate in most environments with reasonably low timeouts.
157
+
158
+ Internally, semian uses `SEM_UNDO` for several sysv semaphore operations:
159
+
160
+ * Acquire
161
+ * Worker registration
162
+ * Semaphore metadata state lock
163
+
164
+ The intention behind `SEM_UNDO` is that a semaphore operation is automatically undone when the process exits. This
165
+ is true even if the process exits abnormally - crashes, receives a `SIG_KILL`, etc, because it is handled by
166
+ the operating system and not the process itself.
167
+
168
+ If, however, a thread performs a semop, the `SEM_UNDO` is on its parent process. This means that the operation
169
+ *will not* be undone when the thread exits. This can result in the following unfavorable behavior when using
170
+ threads:
171
+
172
+ * Threads acquire a resource, but are killed and the resource ticket is never released. For a process, the
173
+ ticket would be released by `SEM_UNDO`, but since it's a thread there is the potential for ticket starvation.
174
+ This can result in deadlock on the resource.
175
+ * Threads that register workers on a resource but are killed and never unregistered. For a process, the worker
176
+ count would be automatically decremented by `SEM_UNDO`, but for threads the worker count will continue to increment,
177
+ only being undone when the parent process dies. This can cause the number of tickets to dramatically exceed the quota.
178
+ * If a thread acquires the semaphore metadata lock and dies before releasing it, semian will deadlock on anything
179
+ attempting to acquire the metadata lock until the thread's parent process exits. This can prevent the ticket count
180
+ from being updated.
181
+
182
+ Moreover, a strategy that utilizes `SEM_UNDO` is not compatible with a strategy that attempts to the semaphores tickets manually.
183
+ In order to support threads, operations that currently use `SEM_UNDO` would need to use no semaphore flag, and the calling process
184
+ will be responsible for ensuring that threads are appropriately cleaned up. It is still possible to implement this, but
185
+ it would likely require an in-memory semaphore managed by the parent process of the threads. PRs welcome for this functionality.
186
+
187
+ #### Quotas
188
+
189
+ You may now set quotas per worker:
190
+
191
+ ```ruby
192
+ client = Redis.new(semian: {
193
+ name: "inventory",
194
+ quota: 0.51,
195
+ success_threshold: 2,
196
+ error_threshold: 4,
197
+ error_timeout: 20
198
+ })
199
+
200
+ ```
201
+
202
+ Per the above example, you no longer need to care about the number of tickets.
203
+ Rather, the tickets shall be computed as a proportion of the number of active workers.
204
+
205
+ In this case, we'd allow 50% of the workers on a particular host to connect to this redis resource.
206
+ So long as the workers are in their own process, they will automatically be registered. The quota will
207
+ set the bulkhead threshold based on the number of registered workers, whenever a new worker registers.
208
+
209
+ This is ideal for environments with non-uniform worker distribution, and to eliminate the need to manually
210
+ calculate and adjust ticket counts.
211
+
212
+ **Note**:
213
+
214
+ - You must pass **exactly** one of ticket or quota.
215
+ - Tickets available will be the ceiling of the quota ratio to the number of workers
216
+ - So, with one worker, there will always be a minimum of 1 ticket
217
+ - Workers in different processes will automatically unregister when the process exits.
218
+ - If you have a small number of workers (exactly 2) it's possible that the bulkhead will be too sensitive using quotas.
219
+ - If you use a forking web server (like unicorn) you should call `Semian.unregister_all_resources` before/after forking.
220
+
221
+ #### Net::HTTP
222
+ For the `Net::HTTP` specific Semian adapter, since many external libraries may create
223
+ HTTP connections on the user's behalf, the parameters are instead provided
224
+ by associating callback functions with `Semian::NetHTTP`, perhaps in an initialization file.
225
+
226
+ ##### Naming and Options
227
+ To give Semian parameters, assign a `proc` to `Semian::NetHTTP.semian_configuration`
228
+ that takes a two parameters, `host` and `port` like `127.0.0.1`,`443` or `github_com`,`80`,
229
+ and returns a `Hash` with configuration parameters as follows. The `proc` is used as a
230
+ callback to initialize the configuration options, similar to other adapters.
231
+
232
+ ```ruby
233
+ SEMIAN_PARAMETERS = { tickets: 1,
234
+ success_threshold: 1,
235
+ error_threshold: 3,
236
+ error_timeout: 10 }
237
+ Semian::NetHTTP.semian_configuration = proc do |host, port|
238
+ # Let's make it only active for github.com
239
+ if host == "github.com" && port == "80"
240
+ SEMIAN_PARAMETERS.merge(name: "github.com_80")
241
+ else
242
+ nil
243
+ end
244
+ end
245
+
246
+ # Called from within API:
247
+ # semian_options = Semian::NetHTTP.semian_configuration("github.com", 80)
248
+ # semian_identifier = "nethttp_#{semian_options[:name]}"
249
+ ```
250
+
251
+ The `name` should be carefully chosen since it identifies the resource being protected.
252
+ The `semian_options` passed apply to that resource. Semian creates the `semian_identifier`
253
+ from the `name` to look up and store changes in the circuit breaker and bulkhead states
254
+ and associate successes, failures, errors with the protected resource.
255
+
256
+ We only require that:
257
+ * the `semian_configuration` be **set only once** over the lifetime of the library
258
+ * the output of the `proc` be the same over time, that is, the configuration produced by
259
+ each pair of `host`, `port` is **the same each time** the callback is invoked.
260
+
261
+ For most purposes, `"#{host}_#{port}"` is a good default `name`. Custom `name` formats
262
+ can be useful to grouping related subdomains as one resource, so that they all
263
+ contribute to the same circuit breaker and bulkhead state and fail together.
264
+
265
+ A return value of `nil` for `semian_configuration` means Semian is disabled for that
266
+ HTTP endpoint. This works well since the result of a failed Hash lookup is `nil` also.
267
+ This behavior lets the adapter default to whitelisting, although the
268
+ behavior can be changed to blacklisting or even be completely disabled by varying
269
+ the use of returning `nil` in the assigned closure.
270
+
271
+ ##### Additional Exceptions
272
+ Since we envision this particular adapter can be used in combination with many
273
+ external libraries, that can raise additional exceptions, we added functionality to
274
+ expand the Exceptions that can be tracked as part of Semian's circuit breaker.
275
+ This may be necessary for libraries that introduce new exceptions or re-raise them.
276
+ Add exceptions and reset to the [`default`][nethttp-default-errors] list using the following:
277
+
278
+ ```ruby
279
+ # assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
280
+ Semian::NetHTTP.exceptions += [::OpenSSL::SSL::SSLError]
281
+
282
+ Semian::NetHTTP.reset_exceptions
283
+ # assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
284
+ ```
285
+ ##### Mark Unsuccessful Responses as Failures
286
+ Unsuccessful responses (e.g. 5xx responses) do not raise exceptions, and as such are not marked as failures by default. The `open_circuit_server_errors` Semian configuration parameter may be set to enable this behaviour, to mark unsuccessful responses as failures as seen below:
287
+
288
+ ```ruby
289
+ SEMIAN_PARAMETERS = { tickets: 1,
290
+ success_threshold: 1,
291
+ error_threshold: 3,
292
+ error_timeout: 10,
293
+ open_circuit_server_errors: true}
294
+ ```
295
+
296
+
297
+
298
+ # Understanding Semian
299
+
300
+ Semian is a library with heuristics for failing fast. This section will explain
301
+ in depth how Semian works and which situations it's applicable for. First we
302
+ explain the category of problems Semian is meant to solve. Then we dive into how
303
+ Semian works to solve these problems.
304
+
305
+ ## Do I need Semian?
306
+
307
+ Semian is not a trivial library to understand, introduces complexity and thus
308
+ should be introduced with care. Remember, all Semian does is raise exceptions
309
+ based on heuristics. It is paramount that you understand Semian before
310
+ including it in production as you may otherwise be surprised by its behaviour.
311
+
312
+ Applications that benefit from Semian are those working on eliminating SPOFs
313
+ (Single Points of Failure), and specifically are running into a wall regarding
314
+ slow resources. But it is by no means a magic wand that solves all your latency
315
+ problems by being added to your `Gemfile`. This section describes the types of
316
+ problems Semian solves.
317
+
318
+ If your application is multithreaded or evented (e.g. not Resque and Unicorn)
319
+ these problems are not as pressing. You can still get use out of Semian however.
320
+
321
+ ### Real World Example
322
+
323
+ This is better illustrated with a real world example from Shopify. When you are
324
+ browsing a store while signed in, Shopify stores your session in Redis.
325
+ If Redis becomes unavailable, the driver will start throwing exceptions.
326
+ We rescue these exceptions and simply disable all customer sign in functionality
327
+ on the store until Redis is back online.
328
+
329
+ This is great if querying the resource fails instantly, because it means we fail
330
+ in just a single roundtrip of ~1ms. But if the resource is unresponsive or slow,
331
+ this can take as long as our timeout which is easily 200ms. This means every
332
+ request, even if it does rescue the exception, now takes an extra 200ms.
333
+ Because every resource takes that long, our capacity is also significantly
334
+ degraded. These problems are explained in depth in the next two sections.
335
+
336
+ With Semian, the slow resource would fail instantly (after a small amount of
337
+ convergence time) preventing your response time from spiking and not decreasing
338
+ capacity of the cluster.
339
+
340
+ If this sounds familiar to you, Semian is what you need to be resilient to
341
+ latency. You may not need the graceful fallback depending on your application,
342
+ in which case it will just result in an error (e.g. a `HTTP 500`) faster.
343
+
344
+ We will now examine the two problems in detail.
345
+
346
+ #### In-depth analysis of real world example
347
+
348
+ If a single resource is slow, every single request is going to suffer. We saw
349
+ this in the example before. Let's illustrate this more clearly in the following
350
+ Rails example where the user session is stored in Redis:
351
+
352
+ ```ruby
353
+ def index
354
+ @user = fetch_user
355
+ @posts = Post.all
356
+ end
357
+
358
+ private
359
+ def fetch_user
360
+ user = User.find(session[:user_id])
361
+ rescue Redis::CannotConnectError
362
+ nil
363
+ end
364
+ ```
365
+
366
+ Our code is resilient to a failure of the session layer, it doesn't `HTTP 500`
367
+ if the session store is unavailable (this can be tested with
368
+ [Toxiproxy][toxiproxy]). If the `User` and `Post` data store is unavailable, the
369
+ server will send back `HTTP 500`. We accept that, because it's our primary data
370
+ store. This could be prevented with a caching tier or something else out of
371
+ scope.
372
+
373
+ This code has two flaws however:
374
+
375
+ 1. **What happens if the session storage is consistently slow?** I.e. the majority
376
+ of requests take, say, more than half the timeout time (but it should only
377
+ take ~1ms)?
378
+ 2. **What happens if the session storage is unavailable and is not responding at
379
+ all?** I.e. we hit timeouts on every request.
380
+
381
+ These two problems in turn have two related problems associated with them:
382
+ response time and capacity.
383
+
384
+ #### Response time
385
+
386
+ Requests that attempt to access a down session storage are all gracefully handled, the
387
+ `@user` will simply be `nil`, which the code handles. There is still a
388
+ major impact on users however, as every request to the storage has to time
389
+ out. This causes the average response time to all pages that access it to go up by
390
+ however long your timeout is. Your timeout is proportional to your worst case timeout,
391
+ as well as the number of attempts to hit it on each page. This is the problem Semian
392
+ solves by using heuristics to fail these requests early which causes a much better
393
+ user experience during downtime.
394
+
395
+ #### Capacity loss
396
+
397
+ When your single-threaded worker is waiting for a resource to return, it's
398
+ effectively doing nothing when it could be serving fast requests. To use the
399
+ example from before, perhaps some actions do not access the session storage at
400
+ all. These requests will pile up behind the now slow requests that are trying to
401
+ access that layer, because they're failing slowly. Essentially, your capacity
402
+ degrades significantly because your average response time goes up (as explained
403
+ in the previous section). Capacity loss simply follows from an increase in
404
+ response time. The higher your timeout and the slower your resource, the more
405
+ capacity you lose.
406
+
407
+ #### Timeouts aren't enough
408
+
409
+ It should be clear by now that timeouts aren't enough. Consistent timeouts will
410
+ increase the average response time, which causes a bad user experience, and
411
+ ultimately compromise the performance of the entire system. Even if the timeout
412
+ is as low as ~250ms (just enough to allow a single TCP retransmit) there's a
413
+ large loss of capacity and for many applications a 100-300% increase in average
414
+ response time. This is the problem Semian solves by failing fast.
415
+
416
+ ## How does Semian work?
417
+
418
+ Semian consists of two parts: circuit breaker and bulkheading. To understand
419
+ Semian, and especially how to configure it, we must understand these patterns
420
+ and their implementation.
421
+
422
+ ### Circuit Breaker
423
+
424
+ The circuit breaker pattern is based on a simple observation - if we hit a
425
+ timeout or any other error for a given service one or more times, we’re likely
426
+ to hit it again for some amount of time. Instead of hitting the timeout
427
+ repeatedly, we can mark the resource as dead for some amount of time during
428
+ which we raise an exception instantly on any call to it. This is called the
429
+ [circuit breaker pattern][cbp].
430
+
431
+ ![](http://cdn.shopify.com/s/files/1/0070/7032/files/image01_grande.png)
432
+
433
+ When we perform a Remote Procedure Call (RPC), it will first check the circuit.
434
+ If the circuit is rejecting requests because of too many failures reported by
435
+ the driver, it will throw an exception immediately. Otherwise the circuit will
436
+ call the driver. If the driver fails to get data back from the data store, it
437
+ will notify the circuit. The circuit will count the error so that if too many
438
+ errors have happened recently, it will start rejecting requests immediately
439
+ instead of waiting for the driver to time out. The exception will then be raised
440
+ back to the original caller. If the driver’s request was successful, it will
441
+ return the data back to the calling method and notify the circuit that it made a
442
+ successful call.
443
+
444
+ The state of the circuit breaker is local to the worker and is not shared across
445
+ all workers on a server.
446
+
447
+ #### Circuit Breaker Configuration
448
+
449
+ There are four configuration parameters for circuit breakers in Semian:
450
+
451
+ * **error_threshold**. The amount of errors a worker encounters within error_threshold_timeout amount of time before
452
+ opening the circuit, that is to start rejecting requests instantly.
453
+ * **error_threshold_timeout**. The amount of time in seconds that error_threshold errors must occur to open the circuit. Defaults to error_timeout seconds if not set.
454
+ * **error_timeout**. The amount of time in seconds until trying to query the resource
455
+ again.
456
+ * **success_threshold**. The amount of successes on the circuit until closing it
457
+ again, that is to start accepting all requests to the circuit.
458
+ * **half_open_resource_timeout**. Timeout for the resource in seconds when the circuit is half-open (supported for MySQL, Net::HTTP and Redis).
459
+
460
+ For more information about configuring these parameters, please read [this post](https://engineering.shopify.com/blogs/engineering/circuit-breaker-misconfigured).
461
+
462
+ ### Bulkheading
463
+
464
+ For some applications, circuit breakers are not enough. This is best illustrated
465
+ with an example. Imagine if the timeout for our data store isn't as low as
466
+ 200ms, but actually 10 seconds. For example, you might have a relational data
467
+ store where for some customers, 10s queries are (unfortunately) legitimate.
468
+ Reducing the time of worst case queries requires a lot of effort. Dropping the
469
+ query immediately could potentially leave some customers unable to access
470
+ certain functionality. High timeouts are especially critical in a non-threaded
471
+ environment where blocking IO means a worker is effectively doing nothing.
472
+
473
+ In this case, circuit breakers aren't sufficient. Assuming the circuit is shared
474
+ across all processes on a server, it will still take at least 10s before the
475
+ circuit is open. In that time every worker is blocked (see also "Defense Line"
476
+ section for an in-depth explanation of the co-operation between circuit breakers
477
+ and bulkheads) this means we're at reduced capacity for at least 20s, with the
478
+ last 10s timeouts occurring just before the circuit opens at the 10s mark when a
479
+ couple of workers have hit a timeout and the circuit opens. We thought of a
480
+ number of potential solutions to this problem - stricter timeouts, grouping
481
+ timeouts by section of our application, timeouts per statement—but they all
482
+ still revolved around timeouts, and those are extremely hard to get right.
483
+
484
+ Instead of thinking about timeouts, we took inspiration from Hystrix by Netflix
485
+ and the book Release It (the resiliency bible), and look at our services as
486
+ connection pools. On a server with `W` workers, only a certain number of them
487
+ are expected to be talking to a single data store at once. Let's say we've
488
+ determined from our monitoring that there’s a 10% chance they’re talking to
489
+ `mysql_shard_0` at any given point in time under normal traffic. The probability
490
+ that five workers are talking to it at the same time is 0.001%. If we only allow
491
+ five workers to talk to a resource at any given point in time, and accept the
492
+ 0.001% false positive rate—we can fail the sixth worker attempting to check out
493
+ a connection instantly. This means that while the five workers are waiting for a
494
+ timeout, all the other `W-5` workers on the node will instantly be failing on
495
+ checking out the connection and opening their circuits. Our capacity is only
496
+ degraded by a relatively small amount.
497
+
498
+ We call this limitation primitive "tickets". In this case, the resource access
499
+ is limited to 5 tickets (see Configuration). The timeout value specifies the
500
+ maximum amount of time to block if no ticket is available.
501
+
502
+ How do we limit the access to a resource for all workers on a server when the
503
+ workers do not directly share memory? This is implemented with [SysV
504
+ semaphores][sysv] to provide server-wide access control.
505
+
506
+ #### Bulkhead Configuration
507
+
508
+ There are two configuration values. It's not easy to choose good values and we're
509
+ still experimenting with ways to figure out optimal ticket numbers. Generally
510
+ something below half the number of workers on the server for endpoints that are
511
+ queried frequently has worked well for us.
512
+
513
+ * **tickets**. Number of workers that can concurrently access a resource.
514
+ * **timeout**. Time to wait in seconds to acquire a ticket if there are no tickets left.
515
+ We recommend this to be `0` unless you have very few workers running (i.e.
516
+ less than ~5).
517
+
518
+ Note that there are system-wide limitations on how many tickets can be allocated
519
+ on a system. `cat /proc/sys/kernel/sem` will tell you.
520
+
521
+ > System-wide limit on the number of semaphore sets. On Linux
522
+ systems before version 3.19, the default value for this limit
523
+ was 128. Since Linux 3.19, the default value is 32,000. On
524
+ Linux, this limit can be read and modified via the fourth
525
+ field of `/proc/sys/kernel/sem`.
526
+
527
+ #### Bulkhead debugging on linux
528
+
529
+ Note: It is often helpful to examine the actual IPC resources on the system. Semian
530
+ provides an easy way to get the semaphore key:
531
+
532
+ ```
533
+ irb> require 'semian'
534
+ irb> puts Semian::Resource.new(:your_resource_name, tickets:1).key # do this from a dev machine
535
+ "0x48af51ea"
536
+ ```
537
+
538
+ This key can then be used to easily inspect the semaphore on any host machine:
539
+
540
+ ```
541
+ ipcs -si $(ipcs -s | grep 0x48af51ea | awk '{print $2}')
542
+ ```
543
+
544
+ Which should output something like:
545
+
546
+ ```
547
+
548
+ Semaphore Array semid=5570729
549
+ uid=8192 gid=8192 cuid=8192 cgid=8192
550
+ mode=0660, access_perms=0660
551
+ nsems = 4
552
+ otime = Thu Mar 30 15:06:16 2017
553
+ ctime = Mon Mar 13 20:25:36 2017
554
+ semnum value ncount zcount pid
555
+ 0 1 0 0 48
556
+ 1 25 0 0 48
557
+ 2 25 0 0 27
558
+ 3 31 0 0 48
559
+ ```
560
+
561
+ In the above example, we can see each of the semaphores. Looking at the enum code
562
+ in `ext/semian/sysv_semaphores.h` we can see that:
563
+
564
+ * 0: is the semian meta lock (mutex) protecting updates to the other resources. It's currently free
565
+ * 1: is the number of available tickets - currently no tickets are in use because it's the same as 2
566
+ * 2: is the configured (maximum) number of tickets
567
+ * 3: is the number of registered workers (processes) that would be considered if using the quota strategy.
568
+
569
+ ## Defense line
570
+
571
+ The finished defense line for resource access with circuit breakers and
572
+ bulkheads then looks like this:
573
+
574
+ ![](http://cdn.shopify.com/s/files/1/0070/7032/files/image02_grande.png)
575
+
576
+ The RPC first checks the circuit; if the circuit is open it will raise the
577
+ exception straight away which will trigger the fallback (the default fallback is
578
+ a 500 response). Otherwise, it will try Semian which fails instantly if too many
579
+ workers are already querying the resource. Finally the driver will query the
580
+ data store. If the data store succeeds, the driver will return the data back to
581
+ the RPC. If the data store is slow or fails, this is our last line of defense
582
+ against a misbehaving resource. The driver will raise an exception after trying
583
+ to connect with a timeout or after an immediate failure. These driver actions
584
+ will affect the circuit and Semian, which can make future calls fail faster.
585
+
586
+ A useful way to think about the co-operation between bulkheads and circuit
587
+ breakers is through visualizing a failure scenario graphing capacity as a
588
+ function of time. If an incident strikes that makes the server unresponsive
589
+ with a `20s` timeout on the client and you only have circuit breakers
590
+ enabled--you will lose capacity until all workers have tripped their circuit
591
+ breakers. The slope of this line will depend on the amount of traffic to the now
592
+ unavailable service. If the slope is steep (i.e. high traffic), you'll lose
593
+ capacity quicker. The higher the client driver timeout, the longer you'll lose
594
+ capacity for. In the example below we have the circuit breakers configured to
595
+ open after 3 failures:
596
+
597
+ ![resiliency- circuit breakers](https://cloud.githubusercontent.com/assets/97400/22405538/53229758-e612-11e6-81b2-824f873c3fb7.png)
598
+
599
+ If we imagine the same scenario but with _only_ bulkheads, configured to have
600
+ tickets for 50% of workers at any given time, we'll see the following
601
+ flat-lining scenario:
602
+
603
+ ![resiliency- bulkheads](https://cloud.githubusercontent.com/assets/97400/22405542/6832a372-e612-11e6-88c4-2452b64b3121.png)
604
+
605
+ Circuit breakers have the nice property of re-gaining 100% capacity. Bulkheads
606
+ have the desirable property of guaranteeing a minimum capacity. If we do
607
+ addition of the two graphs, marrying bulkheads and circuit breakers, we have a
608
+ plummy outcome:
609
+
610
+ ![resiliency- circuit breakers bulkheads](https://cloud.githubusercontent.com/assets/97400/22405550/a25749c2-e612-11e6-8bc8-5fe29e212b3b.png)
611
+
612
+ This means that if the slope or client timeout is sufficiently low, bulkheads
613
+ will provide little value and are likely not necessary.
614
+
615
+ ## Failing gracefully
616
+
617
+ Ok, great, we've got a way to fail fast with slow resources, how does that make
618
+ my application more resilient?
619
+
620
+ Failing fast is only half the battle. It's up to you what you do with these
621
+ errors, in the [session example](#real-world-example) we handle it gracefully by
622
+ signing people out and disabling all session related functionality till the data
623
+ store is back online. However, not rescuing the exception and simply sending
624
+ `HTTP 500` back to the client faster will help with [capacity
625
+ loss](#capacity-loss).
626
+
627
+ ### Exceptions inherit from base class
628
+
629
+ It's important to understand that the exceptions raised by [Semian
630
+ Adapters](#adapters) inherit from the base class of the driver itself, meaning
631
+ that if you do something like:
632
+
633
+ ```ruby
634
+ def posts
635
+ Post.all
636
+ rescue Mysql2::Error
637
+ []
638
+ end
639
+ ```
640
+
641
+ Exceptions raised by Semian's `MySQL2` adapter will also get caught.
642
+
643
+ ### Patterns
644
+
645
+ We do not recommend mindlessly sprinkling `rescue`s all over the place. What you
646
+ should do instead is writing decorators around secondary data stores (e.g. sessions)
647
+ that provide resiliency for free. For example, if we stored the tags associated
648
+ with products in a secondary data store it could look something like this:
649
+
650
+ ```ruby
651
+ # Resilient decorator for storing a Set in Redis.
652
+ class RedisSet
653
+ def initialize(key)
654
+ @key = key
655
+ end
656
+
657
+ def get
658
+ redis.smembers(@key)
659
+ rescue Redis::BaseConnectionError
660
+ []
661
+ end
662
+
663
+ private
664
+
665
+ def redis
666
+ @redis ||= Redis.new
667
+ end
668
+ end
669
+
670
+ class Product
671
+ # This will simply return an empty array in the case of a Redis outage.
672
+ def tags
673
+ tags_set.get
674
+ end
675
+
676
+ private
677
+
678
+ def tags_set
679
+ @tags_set ||= RedisSet.new("product:tags:#{self.id}")
680
+ end
681
+ end
682
+ ```
683
+
684
+ These decorators can be resiliency tested with [Toxiproxy][toxiproxy]. You can
685
+ provide fallbacks around your primary data store as well. In our case, we simply
686
+ `HTTP 500` in those cases unless it's cached because these pages aren't worth
687
+ much without data from their primary data store.
688
+
689
+ ## Monitoring
690
+
691
+ With [`Semian::Instrumentable`][semian-instrumentable] clients can monitor
692
+ Semian internals. For example to instrument just events with
693
+ [`statsd-instrument`][statsd-instrument]:
694
+
695
+ ```ruby
696
+ # `event` is `success`, `busy`, `circuit_open`, `state_change`, or `lru_hash_gc`.
697
+ # `resource` is the `Semian::Resource` object (or a `LRUHash` object for `lru_hash_gc`).
698
+ # `scope` is `connection` or `query` (others can be instrumented too from the adapter) (is nil for `lru_hash_gc`).
699
+ # `adapter` is the name of the adapter (mysql2, redis, ..) (is a payload hash for `lru_hash_gc`)
700
+ Semian.subscribe do |event, resource, scope, adapter|
701
+ case event
702
+ when :success, :busy, :circuit_open, :state_change
703
+ StatsD.increment("semian.#{event}", tags: {
704
+ resource: resource.name,
705
+ adapter: adapter,
706
+ type: scope,
707
+ })
708
+ else
709
+ StatsD.increment("semian.#{event}")
710
+ end
711
+ end
712
+ ```
713
+
714
+ # FAQ
715
+
716
+ **How does Semian work with containers?** Semian uses [SysV semaphores][sysv] to
717
+ coordinate access to a resource. The semaphore is only shared within the
718
+ [IPC][namespaces]. Unless you are running many workers inside every container,
719
+ this leaves the bulkheading pattern effectively useless. We recommend sharing
720
+ the IPC namespace between all containers on your host for the best ticket
721
+ economy. If you are using Docker, this can be done with the [--ipc
722
+ flag](https://docs.docker.com/engine/reference/run/#ipc-settings---ipc).
723
+
724
+ **Why isn't resource access shared across the entire cluster?** This implies a
725
+ coordination data store. Semian would have to be resilient to failures of this
726
+ data store as well, and fall back to other primitives. While it's nice to have
727
+ all workers have the same view of the world, this greatly increases the
728
+ complexity of the implementation which is not favourable for resiliency code.
729
+
730
+ **Why isn't the circuit breaker implemented as a host-wide mechanism?** No good
731
+ reason. Patches welcome!
732
+
733
+ **Why is there no fallback mechanism in Semian?** Read the [Failing
734
+ Gracefully](#failing-gracefully) section. In short, exceptions is exactly this.
735
+ We did not want to put an extra level on abstraction on top of this. In the
736
+ first internal implementation this was the case, but we later moved away from
737
+ it.
738
+
739
+ **Why does it not use normal Ruby semaphores?** To work properly the access
740
+ control needs to be performed across many workers. With MRI that means having
741
+ multiple processes, not threads. Thus we need a primitive outside of the
742
+ interpreter. For other Ruby implementations a driver that uses Ruby semaphores
743
+ could be used (and would be accepted as a PR).
744
+
745
+ **Why are there three semaphores in the semaphore sets for each resource?** This
746
+ has to do with being able to resize the number of tickets for a resource online.
747
+
748
+ **Can I change the number of tickets freely?** Yes, the logic for this isn't
749
+ trivial but it works well.
750
+
751
+ **What is the performance overhead of Semian?** Extremely minimal in comparison
752
+ to going to the network. Don't worry about it unless you're instrumenting
753
+ non-IO.
754
+
755
+ # Developing Semian
756
+
757
+ Semian requires a Linux environment. We provide a [docker-compose](https://docs.docker.com/compose/) file
758
+ that runs MySQL, Redis, Toxiproxy and Ruby in containers. Use
759
+ the steps below to work on Semian from a Mac OS environment.
760
+
761
+ ## Prerequisites :
762
+ ```bash
763
+ # install docker-for-desktop
764
+ $ brew cask install docker
765
+
766
+ # install latest docker-compose
767
+ $ brew install docker-compose
768
+
769
+ # install visual-studio-code (optional)
770
+ $ brew cask install visual-studio-code
771
+
772
+ # clone Semian
773
+ $ git clone https://github.com/Shopify/semian.git
774
+ $ cd semian
775
+ ```
776
+
777
+ ## Visual Studio Code
778
+ - Open semian in vscode
779
+ - Install recommended extensions (one off requirement)
780
+ - Click `reopen in container` (first boot might take about a minute)
781
+
782
+ See https://code.visualstudio.com/docs/remote/containers for more details
783
+
784
+
785
+ If you make any changes to `.devcontainer/` you'd need to recreate the containers:
786
+
787
+ - Select `Rebuild Container` from the command palette
788
+
789
+
790
+ Running Tests:
791
+ - `$ bundle exec rake` Run with `SKIP_FLAKY_TESTS=true` to skip flaky tests (CI runs all tests)
792
+
793
+ ## Everything else
794
+
795
+ Test semian in containers:
796
+ - `$ docker-compose -f .devcontainer/docker-compose.yml up -d`
797
+ - `$ docker exec -it semian bash`
798
+
799
+ If you make any changes to `.devcontainer/` you'd need to recreate the containers:
800
+
801
+ - `$ docker-compose -f .devcontainer/docker-compose.yml up -d --force-recreate`
802
+
803
+ Run tests in containers:
804
+
805
+ ```shell
806
+ $ docker-compose -f ./.devcontainer/docker-compose.yml run --rm test
807
+ ```
808
+
809
+ Running Tests:
810
+ - `$ bundle exec rake` Run with `SKIP_FLAKY_TESTS=true` to skip flaky tests (CI runs all tests)
811
+
812
+ ### Running tests in batches
813
+
814
+ * *TEST_WORKERS* - Total number of workers or batches.
815
+ It uses to identify a total number of batches, that would be run in parallel. *Default: 1*
816
+ * *TEST_WORKER_NUM* - Specify which batch to run. The value is between 1 and *TEST_WORKERS*. *Default: 1*
817
+
818
+ ```shell
819
+ $ bundle exec rake test:parallel TEST_WORKERS=5 TEST_WORKER_NUM=1
820
+ ```
821
+
822
+ [hystrix]: https://github.com/Netflix/Hystrix
823
+ [release-it]: https://pragprog.com/titles/mnee2/release-it-second-edition/
824
+ [shopify]: http://www.shopify.com/
825
+ [mysql-semian-adapter]: lib/semian/mysql2.rb
826
+ [redis-semian-adapter]: lib/semian/redis.rb
827
+ [semian-adapter]: lib/semian/adapter.rb
828
+ [nethttp-semian-adapter]: lib/semian/net_http.rb
829
+ [nethttp-default-errors]: lib/semian/net_http.rb#L35-L45
830
+ [semian-instrumentable]: lib/semian/instrumentable.rb
831
+ [statsd-instrument]: http://github.com/shopify/statsd-instrument
832
+ [resiliency-blog-post]: https://engineering.shopify.com/blogs/engineering/building-and-testing-resilient-ruby-on-rails-applications
833
+ [toxiproxy]: https://github.com/Shopify/toxiproxy
834
+ [sysv]: http://man7.org/linux/man-pages/man7/svipc.7.html
835
+ [cbp]: https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
836
+ [namespaces]: http://man7.org/linux/man-pages/man7/namespaces.7.html