RubyGems - semian - Versions diffs - 0.3.0 → 0.4.0 - Mend

semian 0.3.0 → 0.4.0

Files changed (41) hide show

checksums.yaml +4 -4
data/.gitignore +2 -2
data/.rubocop.yml +113 -0
data/CHANGELOG.md +8 -0
data/Gemfile +5 -0
data/LICENSE.md +1 -1
data/README.md +488 -39
data/Rakefile +15 -8
data/ext/semian/extconf.rb +2 -2
data/lib/semian.rb +16 -6
data/lib/semian/adapter.rb +1 -1
data/lib/semian/circuit_breaker.rb +38 -37
data/lib/semian/mysql2.rb +21 -1
data/lib/semian/net_http.rb +95 -0
data/lib/semian/protected_resource.rb +7 -2
data/lib/semian/resource.rb +1 -1
data/lib/semian/simple_integer.rb +23 -0
data/lib/semian/simple_sliding_window.rb +43 -0
data/lib/semian/simple_state.rb +43 -0
data/lib/semian/unprotected_resource.rb +4 -1
data/lib/semian/version.rb +1 -1
data/repodb.yml +1 -0
data/scripts/install_toxiproxy.sh +3 -3
data/semian.gemspec +4 -3
data/test/circuit_breaker_test.rb +6 -2
data/test/helpers/background_helper.rb +1 -1
data/test/instrumentation_test.rb +1 -1
data/test/mysql2_test.rb +57 -1
data/test/net_http_test.rb +481 -0
data/test/redis_test.rb +3 -3
data/test/resource_test.rb +33 -31
data/test/semian_test.rb +3 -2
data/test/simple_integer_test.rb +49 -0
data/test/simple_sliding_window_test.rb +65 -0
data/test/simple_state_test.rb +45 -0
data/test/test_helper.rb +5 -0
data/test/unprotected_resource_test.rb +1 -1
metadata +30 -27
checksums.yaml.gz.sig +0 -0
data.tar.gz.sig +0 -1
metadata.gz.sig +0 -0

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: fc4fd1a356e755e2aa60896da8d30394e545d44a
-  data.tar.gz: ea85970b77be890bf35c69a9d047525280f60e31
+  metadata.gz: 9df1d9d29629650b74dab6ead2cc8d288de28d78
+  data.tar.gz: 8752cce5af3858930ee820eab7c2f1c558eeed22
 SHA512:
-  metadata.gz: 455a0fb4fef078ea9a7c2a5c5ee63584492726aa32e98d2a062a274d7680d6abb2366120722ffb20aa3069bc2efb282ac0906dc20af5dd656df8a588e5cfde63
-  data.tar.gz: bfe6ab61e67e6c5f6426bd0bf18b5710c50711f82966136bb8da420a2da2c091191886f57b834f130babff966620e3f717c16d6f65ed3e9d4dbd67f09c93ea74
+  metadata.gz: 2070b61bb3ac7080c36eaba3c0508eb81928b32111dd11e6f5d9668b78fada81549f2eac75f3021787ce03fd30b6f3c5f01a4f5f99d077b0b3aa9aeecc5a3e9d
+  data.tar.gz: 63f9400b70e1bf039819bb26accf5ecbcea87ab1fb938ff0abaf8e44ccfbbd78e016c3ff48832960cab6fb1c9428785da0f4c2333d15c651b2647bf99d9ae41d

data/.gitignore CHANGED

@@ -1,6 +1,6 @@
 /.bundle/
-/lib/semian/*.so
-/lib/semian/*.bundle
+/lib/**/*.so
+/lib/**/*.bundle
 /tmp/*
 *.gem
 /html/

data/.rubocop.yml ADDED

@@ -0,0 +1,113 @@
+AllCops:
+  Exclude:
+    - Gemfile
+    - lib/snippets/**/*
+    - vendor/**/*
+    - data/**/*
+    - db/schema.rb
+    - db/migrate/*
+    - test/dummy/**/*
+    - bin/rails
+    - lib/shipit-engine.rb
+    - tmp/**/*
+Style/GuardClause:
+  Enabled: false
+Lint/AssignmentInCondition:
+  Enabled: false
+Lint/HandleExceptions:
+  Enabled: false
+Lint/EndAlignment:
+  Enabled: false
+Style/NumericLiterals:
+  Exclude:
+    - db/schema.rb
+Style/SingleSpaceBeforeFirstArg:
+  Exclude:
+    - db/schema.rb
+Style/DoubleNegation:
+  Enabled: false
+Metrics/LineLength:
+  Max: 135
+Metrics/MethodLength:
+  Max: 40
+Metrics/ClassLength:
+  Max: 500
+Metrics/AbcSize:
+  Max: 50
+Metrics/CyclomaticComplexity:
+  Max: 10
+Style/Documentation:
+  Enabled: false
+Style/SingleLineBlockParams:
+  Enabled: false
+Style/SignalException:
+  Enabled: false
+Style/RaiseArgs:
+  Enabled: false
+Style/ModuleFunction:
+  Enabled: false
+Style/RedundantReturn:
+  AllowMultipleReturnValues: true
+Style/IndentHash:
+  Enabled: false
+Style/TrailingComma:
+  EnforcedStyleForMultiline: comma
+Style/ClassAndModuleChildren:
+  Enabled: false
+Style/PredicateName:
+  Exclude:
+    - app/serializers/**/*
+Style/SpaceInsideHashLiteralBraces:
+  EnforcedStyle: no_space
+Style/StringLiterals:
+  Enabled: false
+Style/PerlBackrefs:
+  Enabled: false
+Style/TrivialAccessors:
+  AllowPredicates: true
+Style/ExtraSpacing:
+  AllowForAlignment: false
+Style/GlobalVars:
+  Exclude:
+    - 'ext/semian/extconf.rb'
+Lint/Eval:
+  Exclude:
+    - 'Rakefile'
+Metrics/ParameterLists:
+  Enabled: false
+Style/IfUnlessModifier:
+  Enabled: false
+Style/CaseIndentation:
+  IndentWhenRelativeTo: end

data/CHANGELOG.md ADDED

@@ -0,0 +1,8 @@
+# v0.4.0
+* net/http: add adapter for net/http #58
+* circuit_breaker: split circuit breaker into three data structures to allow for
+  alternative implementations in the future #62
+* mysql: don't prevent rollbacks on transactions #60
+* core: fix initialization bug when the resource is accessed before the options
+  are set #65

data/Gemfile CHANGED

@@ -4,3 +4,8 @@ gemspec
 group :debug do
   gem 'byebug'
 end
+group :development, :test do
+  gem 'toxiproxy', github: 'Shopify/toxiproxy-ruby', ref: 'f0c5d0bebca01180e2cfd5234e3d18affefbc670', require: 'toxiproxy'
+  gem 'rubocop', '~> 0.34.2'
+end

data/LICENSE.md CHANGED

@@ -1,6 +1,6 @@
 The MIT License (MIT)
-Copyright (c) 2014 Scott Francis <scott.francis@shopify.com>
+Copyright (c) 2014 Shopify
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

data/README.md CHANGED

@@ -1,15 +1,47 @@
 ## Semian [![Build Status](https://travis-ci.org/Shopify/semian.svg?branch=master)](https://travis-ci.org/Shopify/semian)
-Semian is a latency and fault tolerance library for protecting your Ruby
-applications against misbehaving external services. It allows you to fail fast
-so you can handle errors gracefully. The patterns are inspired by
-[Hystrix][hystrix] and [Release It][release-it]. Semian is an extraction from
-[Shopify][shopify] where it's been running successfully in production since
-October, 2014.
+![](http://i.imgur.com/7Vn2ibF.png)
-For an overview of building resilient Ruby application, see [the blog post on
-Toxiproxy and Semian][resiliency-blog-post]. We recommend using
-[Toxiproxy][toxiproxy] to test for resiliency.
+Semian is a library for controlling access to slow or unresponsive external
+services to avoid cascading failures.
+When services are down they typically fail fast with errors like `ECONNREFUSED`
+and `ECONNRESET` which can be rescued in code. However, slow resources fail
+slowly. The thread serving the request blocks until it hits the timeout for the
+slow resource. During that time, the thread is doing nothing useful and thus the
+slow resource has caused a cascading failure by occupying workers and therefore
+losing capacity. **Semian is a library for failing fast in these situations,
+allowing you to handle errors gracefully.** Semian does this by intercepting
+resource access through heuristic patterns inspired by [Hystrix][hystrix] and
+[Release It][release-it]:
+* [**Circuit breaker**](#circuit-breaker). A pattern for limiting the
+  amount of requests to a dependency that is having issues.
+* [**Bulkheading**](#bulkheading). Controlling the concurrent access to
+  a single resource, access is coordinates server-wide with [SysV
+  semaphores][sysv].
+Resource drivers are monkey-patched to be aware of Semian, these are called
+[Semian Adapters](#adapters). Thus, every time resource access is requested
+Semian is queried for status on the resource first.  If Semian, through the
+patterns above, deems the resource to be unavailable it will raise an exception.
+**The ultimate outcome of Semian is always an exception that can then be rescued
+for a graceful fallback**. Instead of waiting for the timeout, Semian raises
+straight away.
+If you are already rescuing exceptions for failing resources and timeouts,
+Semian is mostly a drop-in library with a little configuration that will make
+your code more resilient to slow resource access. But, [do you even need
+Semian?](#do-i-need-semian)
+For an overview of building resilient Ruby applications, start by reading [the
+Shopify blog post on Toxiproxy and Semian][resiliency-blog-post]. For more in
+depth information on Semian, see [Understanding Semian](#understanding-semian).
+Semian is an extraction from [Shopify][shopify] where it's been running
+successfully in production since October, 2014.
+The other component to your Ruby resiliency kit is [Toxiproxy][toxiproxy] to
+write automated resiliency tests.
 # Usage
@@ -26,10 +58,47 @@ section](#configuration) on how to configure adapters.
 ## Adapters
-The following adapters are in Semian and work against the public gems:
+Semian works by intercepting resource access. Every time access is requested,
+Semian is queried, and it will raise an exception if the resource is unavailable
+according to the circuit breaker or bulkheads.  This is done by monkey-patching
+the resource driver. **The exception raised by the driver always inherits from
+the Base exception class of the driver**, meaning you can always simply rescue
+the base class and catch both Semian and driver errors in the same rescue for
+fallbacks.
+The following adapters are in Semian and tested heavily in production, the
+version is the version of the public gem with the same name:
 * [`semian/mysql2`][mysql-semian-adapter] (~> 0.3.16)
 * [`semian/redis`][redis-semian-adapter] (~> 3.2.1)
+* [`semian/net_http`][nethttp-semian-adapter]
+### Creating Adapters
+To create a Semian adapter you must implement the following methods:
+1. [`include Semian::Adapter`][semian-adapter]. Use the helpers to wrap the
+   resource. This takes care of situations such as monitoring, nested resources,
+   unsupported platforms, creating the Semian resource if it doesn't already
+   exist and so on.
+2. `#semian_identifier`. This is responsible for returning a symbol that
+   represents every unique resource, for example `redis_master` or
+   `mysql_shard_1`. This is usually assembled from a `name` attribute on the
+   Semian configuration hash, but could also be `<host>:<port>`.
+3. `connect`. The name of this method varies. You must override the driver's
+   connect method with one that wraps the connect call with
+   `Semian::Resource#acquire`. You should do this at the lowest possible level.
+4. `query`. Same as `connect` but for queries on the resource.
+5. Define exceptions `ResourceBusyError` and `CircuitOpenError`. These are
+   raised when the request was rejected early because the resource is out of
+   tickets or because the circuit breaker is open (see [Understanding
+   Semian](#understanding-semian). They should inherit from the base exception
+   class from the raw driver. For example `Mysql2::Error` or
+   `Redis::BaseConnectionError` for the MySQL and Redis drivers. This makes it
+   easy to `rescue` and handle them gracefully in application code, by
+   `rescue`ing the base class.
+The best resource is looking at the [already implemented adapters](#adapters).
 ### Configuration
@@ -58,32 +127,370 @@ client = Redis.new(semian: {
 })
 ```
-### Creating an adapter
+#### Net::HTTP
+For the `Net::HTTP` specific Semian adapter, since many external libraries may create
+HTTP connections on the user's behalf, the parameters are instead provided
+by calling specific functions in `Semian::NetHTTP`, perhaps in an initialization file.
-To create a Semian adapter you must implement the following methods:
+##### Naming and Options
+To give Semian parameters, assign a `proc` to `Semian::NetHTTP.semian_configuration`
+that takes a two parameters, `host` and `port` like `127.0.0.1` and `80` or `github_com` and `80`,
+and returns a `Hash` with keys as follows.
-1. [`include Semian::Adapter`][semian-adapter]. Use the helpers to wrap the
-   resource. This takes care of situations such as monitoring, nested resources,
-   unsupported platforms, creating the Semian resource if it doesn't already
-   exist and so on.
-2. `#semian_identifier`. This is responsible for returning a symbol that
-   represents every unique resource, for example `redis_master` or
-   `mysql_shard_1`. This is usually assembled from a `name` attribute on the
-   Semian configuration hash, but could also be `<host>:<port>`.
-3. `connect`. The name of this method varies. You must override the driver's
-   connect method with one that wraps the connect call with
-   `Semian::Resource#acquire`. You should do this at the lowest possible level.
-4. `query`. Same as `connect` but for queries on the resource.
-5. Define exceptions `ResourceBusyError` and `CircuitOpenError`. These are
-   raised when the request was rejected early because the resource is out of
-   tickets or because the circuit breaker is open (see [Understanding
-   Semian](#understanding-semian). They should inherit from the base exception
-   class from the raw driver. For example `Mysql2::Error` or
-   `Redis::BaseConnectionError` for the MySQL and Redis drivers. This makes it
-   easy to `rescue` and handle them gracefully in application code, by
-   `rescue`ing the base class.
+```ruby
+SEMIAN_PARAMETERS = { tickets: 1,
+                      success_threshold: 1,
+                      error_threshold: 3,
+                      error_timeout: 10 }
+Semian::NetHTTP.semian_configuration = proc do |host, port|
+  # Let's make it only active for github.com
+  if host == "github.com" && port == "80"
+    SEMIAN_PARAMETERS.merge(name: "github.com_80")
+  else
+    nil
+  end
+end
-The best resource is looking at the [already implemented adapters](#adapters).
+# Called from within API:
+# semian_options = Semian::NetHTTP.semian_configuration("github.com", 80)
+# semian_identifier = "nethttp_#{semian_options[:name]}"
+```
+The `name` should be carefully chosen since it identifies the resource being protected.
+The `semian_options` passed apply to that resource. Semian creates the `semian_identifier`
+from the `name` to look up and store changes in the circuit breaker and bulkhead states
+and associate successes, failures, errors with the protected resource.
+For most purposes, `"#{host}_#{port}"` is a good default `name`. Custom `name` formats
+can be useful to grouping related subdomains as one resource, so that they all
+contribute to the same circuit breaker and bulkhead state and fail together.
+A return value of `nil` for `semian_configuration` means Semian is disabled for that
+HTTP endpoint. This works well since the result of a failed Hash lookup is `nil` also.
+This behavior lets the adapter default to whitelisting, although the
+behavior can be changed to blacklisting or even be completely disabled by varying
+the use of returning `nil` in the assigned closure.
+##### Additional Exceptions
+Since we envision this particular adapter can be used in combination with many
+external libraries, that can raise additional exceptions, we added functionality to
+expand the Exceptions that can be tracked as part of Semian's circuit breaker.
+This may be necessary for libraries that introduce new exceptions or re-raise them.
+Add exceptions and reset to the [`default`][nethttp-default-errors] list using the following:
+```ruby
+# assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
+Semian::NetHTTP.exceptions += [::OpenSSL::SSL::SSLError]
+Semian::NetHTTP.reset_exceptions
+# assert_equal(Semian::NetHTTP.exceptions, Semian::NetHTTP::DEFAULT_ERRORS)
+```
+# Understanding Semian
+Semian is a library with heuristics for failing fast. This section will explain
+in depth how Semian works and which situations it's applicable for. First we
+explain the category of problems Semian is meant to solve. Then we dive into how
+Semian works to solve these problems.
+## Do I need Semian?
+Semian is not a trivial library to understand, introduces complexity and thus
+should be introduced with care. Remember, all Semian does is raise exceptions
+based on heuristics. It is paramount that you understand Semian before
+including it in production as you may otherwise be surprised by its behaviour.
+Applications that benefit from Semian are those working on eliminating SPOFs
+(Single Points of Failure), and specifically are running into a wall regarding
+slow resources. But it is by no means a magic wand that solves all your latency
+problems by being added to your `Gemfile`. This section describes the types of
+problems Semian solves.
+If your application is multithreaded or evented (e.g. not Resque and Unicorn)
+these problems are not as pressing. You can still get use out of Semian however.
+### Real World Example
+This is better illustrated with a real world example from Shopify. When you are
+browsing a store while signed in, Shopify stores your session in Redis.
+If Redis becomes unavailable, the driver will start throwing exceptions.
+We rescue these exceptions and simply disable all customer sign in functionality
+on the store until Redis is back online.
+This is great if querying the resource fails instantly, because it means we fail
+in just a single roundtrip of ~1ms. But if the resource is unresponsive or slow,
+this can take as long as our timeout which is easily 200ms. This means every
+request, even if it does rescue the exception, now takes an extra 200ms.
+Because every resource takes that long, our capacity is also significantly
+degraded. These problems are explained in depth in the next two sections.
+With Semian, the slow resource would fail instantly (after a small amount of
+convergence time) preventing your response time from spiking and not decreasing
+capacity of the cluster.
+If this sounds familiar to you, Semian is what you need to be resilient to
+latency. You may not need the graceful fallback depending on your application,
+in which case it will just result in an error (e.g. a `HTTP 500`) faster.
+We will now examine the two problems in detail.
+#### In-depth analysis of real world example
+If a single resource is slow, every single request is going to suffer. We saw
+this in the example before. Let's illustrate this more clearly in the following
+Rails example where the user session is stored in Redis:
+```ruby
+def index
+  @user = fetch_user
+  @posts = Post.all
+end
+private
+def fetch_user
+  user = User.find(session[:user_id])
+rescue Redis::CannotConnectError
+  nil
+end
+```
+Our code is resilient to a failure of the session layer, it doesn't `HTTP 500`
+if the session store is unavailable (this can be tested with
+[Toxiproxy][toxiproxy]). If the `User` and `Post` data store is unavailable, the
+server will send back `HTTP 500`. We accept that, because it's our primary data
+store. This could be prevented with a caching tier or something else out of
+scope.
+This code has two flaws however:
+1. **What happens if the session storage is consistently slow?** I.e. the majority
+   of requests take, say, more than half the timeout time (but it should only
+   take ~1ms)?
+2. **What happens if the session storage is unavailable and is not responding at
+   all?** I.e. we hit timeouts on every request.
+These two problems in turn have two related problems associated with them:
+response time and capacity.
+#### Response time
+Requests that attempt to access a down session storage are all gracefully handled, the
+`@user` will simply be `nil`, which the code handles. There is still a
+major impact on users however, as every request to the storage has to time
+out. This causes the average response time to all pages that access it to go up by
+however long your timeout is. Your timeout is proportional to your worst case timeout,
+as well as the number of attempts to hit it on each page. This is the problem Semian
+solves by using heuristics to fail these requests early which causes a much better
+user experience during downtime.
+#### Capacity loss
+When your single-threaded worker is waiting for a resource to return, it's
+effectively doing nothing when it could be serving fast requests. To use the
+example from before, perhaps some actions do not access the session storage at
+all. These requests will pile up behind the now slow requests that are trying to
+access that layer, because they're failing slowly. Essentially, your capacity
+degrades significantly because your average response time goes up (as explained
+in the previous section). Capacity loss simply follows from an increase in
+response time. The higher your timeout and the slower your resource, the more
+capacity you lose.
+#### Timeouts aren't enough
+It should be clear by now that timeouts aren't enough. Consistent timeouts will
+increase the average response time, which causes a bad user experience, and
+ultimately compromise the performance of the entire system. Even if the timeout
+is as low as ~250ms (just enough to allow a single TCP retransmit) there's a
+large loss of capacity and for many applications a 100-300% increase in average
+response time. This is the problem Semian solves by failing fast.
+## How does Semian work?
+Semian consists of two parts: circuit breaker and bulkheading. To understand
+Semian, and especially how to configure it, we must understand these patterns
+and their implementation.
+### Circuit Breaker
+The circuit breaker pattern is based on a simple observation - if we hit a
+timeout or any other error for a given service one or more times, we’re likely
+to hit it again for some amount of time. Instead of hitting the timeout
+repeatedly, we can mark the resource as dead for some amount of time during
+which we raise an exception instantly on any call to it. This is called the
+[circuit breaker pattern][cbp].
+![](http://cdn.shopify.com/s/files/1/0070/7032/files/image01_grande.png)
+When we perform a Remote Procedure Call (RPC), it will first check the circuit.
+If the circuit is rejecting requests because of too many failures reported by
+the driver, it will throw an exception immediately. Otherwise the circuit will
+call the driver. If the driver fails to get data back from the data store, it
+will notify the circuit. The circuit will count the error so that if too many
+errors have happened recently, it will start rejecting requests immediately
+instead of waiting for the driver to time out. The exception will then be raised
+back to the original caller. If the driver’s request was successful, it will
+return the data back to the calling method and notify the circuit that it made a
+successful call.
+The state of the circuit breaker is local to the worker and is not shared across
+all workers on a server.
+#### Circuit Breaker Configuration
+There are three configuration parameters for circuit breakers in Semian:
+* **error_threshold**. The amount of errors to encounter for the worker before
+  opening the circuit, that is to start rejecting requests instantly.
+* **error_timeout**. The amount of time until trying to query the resource
+  again.
+* **success_threshold**. The amount of successes on the circuit until closing it
+  again, that is to start accepting all requests to the circuit.
+### Bulkheading
+For many applications, circuit breakers are not enough however. This is best
+illustrated with an extreme. Imagine if the timeout for our data store isn't as
+low as 200ms, but actually 10 seconds. For example, you might have a relational data
+store where for some customers, 10s queries are (unfortunately) legitimate.
+Reducing the time of worst case queries requires a lot of effort. Dropping the
+query immediately could potentially leave some customers unable to access certain
+functionality. High timeouts are especially critical in a non-threaded
+environment where blocking IO means a worker is effectively doing nothing.
+In this case, circuit breakers aren't sufficient. Assuming the circuit is shared
+across all processes on a server, it will still take at least 10s before the
+circuit is open—in that time every worker is blocked. Meaning we are in a
+reduced capacity state for at least 20s, with the last 10s timeouts
+occurring just before the circuit opens at the 10s mark when a couple of
+workers have hit a timeout and the circuit opens. We thought of a number of
+potential solutions to this problem - stricter timeouts, grouping timeouts by
+section of our application, timeouts per statement—but they all still revolved
+around timeouts, and those are extremely hard to get right.
+Instead of thinking about timeouts, we took inspiration from Hystrix by Netflix
+and the book Release It (the resiliency bible), and look at our services as
+connection pools. On a server with `W` workers, only a certain number of them
+are expected to be talking to a single data store at once. Let's say we've
+determined from our monitoring that there’s a 10% chance they’re talking to
+`mysql_shard_0` at any given point in time under normal traffic. The probability
+that five workers are talking to it at the same time is 0.001%. If we only allow
+five workers to talk to a resource at any given point in time, and accept the
+0.001% false positive rate—we can fail the sixth worker attempting to check out
+a connection instantly. This means that while the five workers are waiting for a
+timeout, all the other `W-5` workers on the node will instantly be failing on
+checking out the connection and opening their circuits. Our capacity is only
+degraded by a relatively small amount.
+We call this limitation primitive "tickets". In this case, the resource access
+is limited to 5 tickets (see Configuration). The timeout value specifies the
+maximum amount of time to block if no ticket is available.
+How do we limit the access to a resource for all workers on a server when the
+workers do not directly share memory? This is implemented with [SysV
+semaphores][sysv] to provide server-wide access control.
+#### Bulkhead Configuration
+There are two configuration values. It's not easy to choose good values and we're
+still experimenting with ways to figure out optimal ticket numbers. Generally
+something below half the number of workers on the server for endpoints that are
+queried frequently has worked well for us.
+* **tickets**. Number of workers that can concurrently access a resource.
+* **timeout**. Time to wait to acquire a ticket if there are no tickets left.
+  We recommend this to be `0` unless you have very few workers running (i.e.
+  less than ~5).
+## Defense line
+The finished defense line for resource access with circuit breakers and
+bulkheads then looks like this:
+![](http://cdn.shopify.com/s/files/1/0070/7032/files/image02_grande.png)
+The RPC first checks the circuit; if the circuit is open it will raise the
+exception straight away which will trigger the fallback (the default fallback is
+a 500 response). Otherwise, it will try Semian which fails instantly if too many
+workers are already querying the resource. Finally the driver will query the
+data store. If the data store succeeds, the driver will return the data back to
+the RPC. If the data store is slow or fails, this is our last line of defense
+against a misbehaving resource. The driver will raise an exception after trying
+to connect with a timeout or after an immediate failure. These driver actions
+will affect the circuit and Semian, which can make future calls fail faster.
+## Failing gracefully
+Ok, great, we've got a way to fail fast with slow resources, how does that make
+my application more resilient?
+Failing fast is only half the battle. It's up to you what you do with these
+errors, in the [session example](#real-world-example) we handle it gracefully by
+signing people out and disabling all session related functionality till the data
+store is back online. However, not rescuing the exception and simply sending
+`HTTP 500` back to the client faster will help with [capacity
+loss](#capacity-loss).
+### Exceptions inherit from base class
+It's important to understand that the exceptions raised by [Semian
+Adapters](#adapters) inherit from the base class of the driver itself, meaning
+that if you do something like:
+```ruby
+def posts
+  Post.all
+rescue Mysql2::Error
+  []
+end
+```
+Exceptions raised by Semian's `MySQL2` adapter will also get caught.
+### Patterns
+We do not recommend mindlessly sprinkling `rescue`s all over the place. What you
+should do instead is writing decorators around secondary data stores (e.g. sessions)
+that provide resiliency for free. For example, if we stored the tags associated
+with products in a secondary data store it could look something like this:
+```ruby
+# Resilient decorator for storing a Set in Redis.
+class RedisSet
+  def initialize(key)
+    @key = key
+  end
+  def get
+    redis.smembers(@key)
+  rescue Redis::BaseConnectionError
+    []
+  end
+  private
+  def redis
+    @redis ||= Redis.new
+  end
+end
+class Product
+  # This will simply return an empty array in the case of a Redis outage.
+  def tags
+    tags_set.get
+  end
+  private
+  def tags_set
+    @tags_set ||= RedisSet.new("product:tags:#{self.id}")
+  end
+end
+```
+These decorators can be resiliency tested with [Toxiproxy][toxiproxy]. You can
+provide fallbacks around your primary data store as well. In our case, we simply
+`HTTP 500` in those cases unless it's cached because these pages aren't worth
+much without data from their primary data store.
 ## Monitoring
@@ -105,17 +512,59 @@ Semian.subscribe do |event, resource, scope, adapter|
 end
 ```
-# Understanding Semian
+# FAQ
+**How does Semian work with containers?** Semian uses [SysV semaphores][sysv] to
+coordinate access to a resource. The semaphore is only shared within the
+[IPC][namespaces]. Unless you are running many workers inside every container,
+this leaves the bulkheading pattern effectively useless. We recommend sharing
+the IPC namespace between all containers on your host for the best ticket
+economy. If you are using Docker, this can be done with the [--ipc
+flag](https://docs.docker.com/reference/run/#ipc-settings).
+**Why isn't resource access shared across the entire cluster?** This implies a
+coordination data store. Semian would have to be resilient to failures of this
+data store as well, and fall back to other primitives. While it's nice to have
+all workers have the same view of the world, this greatly increases the
+complexity of the implementation which is not favourable for resiliency code.
+**Why isn't the circuit breaker implemented as a host-wide mechanism?** No good
+reason. Patches welcome!
+**Why is there no fallback mechanism in Semian?** Read the [Failing
+Gracefully](#failing-gracefully) section. In short, exceptions is exactly this.
+We did not want to put an extra level on abstraction on top of this. In the
+first internal implementation this was the case, but we later moved away from
+it.
+**Why does it not use normal Ruby semaphores?** To work properly the access
+control needs to be performed across many workers. With MRI that means having
+multiple processes, not threads. Thus we need a primitive outside of the
+interpreter. For other Ruby implementations a driver that uses Ruby semaphores
+could be used (and would be accepted as a PR).
+**Why are there three semaphores in the semaphore sets for each resource?** This
+has to do with being able to resize the number of tickets for a resource online.
+**Can I change the number of tickets freely?** Yes, the logic for this isn't
+trivial but it works well.
-Coming soon!
+**What is the performance overhead of Semian?** Extremely minimal in comparison
+to going to the network. Don't worry about it unless you're instrumenting
+non-IO.
 [hystrix]: https://github.com/Netflix/Hystrix
 [release-it]: https://pragprog.com/book/mnee/release-it
 [shopify]: http://www.shopify.com/
-[mysql-semian-adapter]: https://github.com/Shopify/semian/blob/master/lib/semian/mysql2.rb
-[redis-semian-adapter]: https://github.com/Shopify/semian/blob/master/lib/semian/redis.rb
-[semian-adapter]: https://github.com/Shopify/semian/blob/master/lib/semian/adapter.rb
-[semian-instrumentable]: https://github.com/Shopify/semian/blob/master/lib/semian/instrumentable.rb
+[mysql-semian-adapter]: lib/semian/mysql2.rb
+[redis-semian-adapter]: lib/semian/redis.rb
+[semian-adapter]: lib/semian/adapter.rb
+[nethttp-semian-adapter]: lib/semian/net_http.rb
+[nethttp-default-errors]: lib/semian/net_http.rb#L33-L43
+[semian-instrumentable]: lib/semian/instrumentable.rb
 [statsd-instrument]: http://github.com/shopify/statsd-instrument
 [resiliency-blog-post]: http://www.shopify.com/technology/16906928-building-and-testing-resilient-ruby-on-rails-applications
 [toxiproxy]: https://github.com/Shopify/toxiproxy
+[sysv]: http://man7.org/linux/man-pages/man7/svipc.7.html
+[cbp]: https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
+[namespaces]: http://man7.org/linux/man-pages/man7/namespaces.7.html