RubyGems - einhorn - Versions diffs - 0.7.0 → 0.8.2 - Mend

einhorn 0.7.0 → 0.8.2

Files changed (30) hide show

checksums.yaml +7 -0
data/.travis.yml +6 -4
data/README.md +39 -3
data/README.md.in +21 -3
data/bin/einhorn +17 -2
data/bin/einhornsh +7 -15
data/einhorn.gemspec +1 -0
data/example/pool_worker.rb +1 -1
data/lib/einhorn.rb +55 -41
data/lib/einhorn/client.rb +2 -3
data/lib/einhorn/command.rb +127 -20
data/lib/einhorn/command/interface.rb +21 -10
data/lib/einhorn/event.rb +10 -1
data/lib/einhorn/event/connection.rb +2 -2
data/lib/einhorn/prctl.rb +26 -0
data/lib/einhorn/prctl_linux.rb +49 -0
data/lib/einhorn/version.rb +1 -1
data/lib/einhorn/worker.rb +47 -25
data/test/integration/_lib/fixtures/exit_during_upgrade/exiting_server.rb +1 -0
data/test/integration/_lib/fixtures/pdeathsig_printer/pdeathsig_printer.rb +29 -0
data/test/integration/_lib/fixtures/signal_timeout/sleepy_server.rb +23 -0
data/test/integration/_lib/fixtures/upgrade_project/upgrading_server.rb +2 -0
data/test/integration/_lib/helpers/einhorn_helpers.rb +14 -9
data/test/integration/pdeathsig.rb +26 -0
data/test/integration/upgrading.rb +47 -0
data/test/unit/_lib/bad_worker.rb +7 -0
data/test/unit/_lib/sleep_worker.rb +5 -0
data/test/unit/einhorn.rb +41 -3
data/test/unit/einhorn/command.rb +114 -0
metadata +48 -38

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: f3f9c31b861db9b8b7ab5d2345be06f60223d60e1243bc2618ec5ef1db2b72e5
+  data.tar.gz: 38144bb080c8719b4d164bcbf8d96d498844626a66b9a522c75fb8bab6309c4f
+SHA512:
+  metadata.gz: 3370ff020a249f5af7be26bfb48392a0b5721d139b895684e37aa92f612bb00d10e48eeb1b95acc15c74448af96c1df039fc841f6ff45ffdf91d6a16bcc614ab
+  data.tar.gz: 3e6b93f1ed82a46a9578dd3dd59cdb0c5c962772d285e59f1e78e531b7577fa4debbc7e3a28a0e4f9cddf053e4481979b37229ce40a6ed9e476dbba6e7985f1e

data/.travis.yml CHANGED

@@ -1,8 +1,10 @@
 language: ruby
 rvm:
-  - 1.8.7
-  - 1.9.2
-  - 1.9.3
   - 2.0.0
   - 2.1
-  - ree
+  - 2.2
+# This is to work around the version of bundler installed in Travis and
+# https://github.com/bundler/bundler/issues/3558
+before_install:
+  - gem update bundler

data/README.md CHANGED

@@ -194,6 +194,17 @@ library.
 You can set the name that Einhorn and your workers show in PS. Just
 pass `-c <name>`.
+### Re exec
+You can use the `--reexec-as` option to replace the `einhorn` command with a command or script of your own. This might be useful for those with a Capistrano like deploy process that has changing symlinks. To ensure that you are following the symlinks you could use a bash script like this.
+    #!/bin/bash
+    cd <symlinked directory>
+    exec /usr/local/bin/einhorn "$@"
+Then you could set `--reexec-as=` to the name of your bash script and it will run in place of the plain einhorn command.
 ### Options
     -b, --bind ADDR                  Bind an address and add the corresponding FD via the environment
@@ -217,11 +228,18 @@ pass `-c <name>`.
                                      Unix nice level at which to run the einhorn processes. If not running as root, make sure to ulimit -e as appopriate.
         --with-state-fd STATE        [Internal option] With file descriptor containing state
         --upgrade-check              [Internal option] Check if Einhorn can exec itself and exit with status 0 before loading code
+    -t, --signal-timeout=T           If children do not react to signals after T seconds, escalate to SIGKILL
         --version                    Show version
 ## Contributing
+### Development Status
+Einhorn is still in active operation at Stripe, but we are not maintaining
+Einhorn actively. PRs are very welcome, and we will review and merge,
+but we are unlikely to triage and fix reported issues without code.
 Contributions are definitely welcome. To contribute, just follow the
 usual workflow:
@@ -249,10 +267,28 @@ EventMachine-LE to support file-descriptor passing. Check out
 ## Compatibility
-Einhorn was developed and tested under Ruby 1.8.7.
+Einhorn runs in Ruby 2.0, 2.1, and 2.2
+The following libraries ease integration with Einhorn with languages other than
+Ruby:
+- **[go-einhorn](https://github.com/stripe/go-einhorn)**: Stripe's own library
+  for *talking* to an einhorn master (doesn't wrap socket code).
+- **[goji](https://github.com/zenazn/goji/)**: Go (golang) server framework. The
+  [`bind`](https://godoc.org/github.com/zenazn/goji/bind) and
+  [`graceful`](https://godoc.org/github.com/zenazn/goji/graceful)
+  packages provide helpers and HTTP/TCP connection wrappers for Einhorn
+  integration.
+- **[github.com/CHH/einhorn](https://github.com/CHH/einhorn)**: PHP library
+- **[thin-attach\_socket](https://github.com/ConradIrwin/thin-attach_socket)**:
+  run `thin` behind Einhorn
+- **[baseplate](https://reddit.github.io/baseplate/cli/serve.html)**: a
+  collection of Python helpers and libraries, with support for running behind
+  Einhorn
+*NB: this list should not imply any official endorsement or vetting!*
 ## About
-Einhorn is a project of [Stripe](https://stripe.com), led by [Greg
-Brockman](https://twitter.com/thegdb). Feel free to get in touch at
+Einhorn is a project of [Stripe](https://stripe.com), led by [Carl Jackson](https://github.com/zenazn). Feel free to get in touch at
 info@stripe.com.

data/README.md.in CHANGED

@@ -67,10 +67,28 @@ EventMachine-LE to support file-descriptor passing. Check out
 ## Compatibility
-Einhorn was developed and tested under Ruby 1.8.7.
+Einhorn runs in Ruby 2.0, 2.1, and 2.2
+The following libraries ease integration with Einhorn with languages other than
+Ruby:
+- **[go-einhorn](https://github.com/stripe/go-einhorn)**: Stripe's own library
+  for *talking* to an einhorn master (doesn't wrap socket code).
+- **[goji](https://github.com/zenazn/goji/)**: Go (golang) server framework. The
+  [`bind`](https://godoc.org/github.com/zenazn/goji/bind) and
+  [`graceful`](https://godoc.org/github.com/zenazn/goji/graceful)
+  packages provide helpers and HTTP/TCP connection wrappers for Einhorn
+  integration.
+- **[github.com/CHH/einhorn](https://github.com/CHH/einhorn)**: PHP library
+- **[thin-attach\_socket](https://github.com/ConradIrwin/thin-attach_socket)**:
+  run `thin` behind Einhorn
+- **[baseplate](https://reddit.github.io/baseplate/cli/serve.html)**: a
+  collection of Python helpers and libraries, with support for running behind
+  Einhorn
+*NB: this list should not imply any official endorsement or vetting!*
 ## About
-Einhorn is a project of [Stripe](https://stripe.com), led by [Greg
-Brockman](https://twitter.com/thegdb). Feel free to get in touch at
+Einhorn is a project of [Stripe](https://stripe.com), led by [Carl Jackson](https://github.com/zenazn). Feel free to get in touch at
 info@stripe.com.

data/bin/einhorn CHANGED

@@ -266,8 +266,11 @@ if true # $0 == __FILE__
       Einhorn::Command.quieter(false)
     end
-    opts.on('-s', '--seconds N', 'Number of seconds to wait until respawning') do |b|
-      Einhorn::State.config[:seconds] = s.to_i
+    opts.on('-s', '--seconds N', 'Number of seconds to wait until respawning') do |s|
+      seconds = Float(s)
+      raise ArgumentError, 'seconds must be > 0' if seconds.zero?
+      Einhorn::State.config[:seconds] = seconds
     end
     opts.on('-v', '--verbose', 'Make output verbose (can be reconfigured on the fly)') do
@@ -310,6 +313,18 @@ if true # $0 == __FILE__
       Einhorn::State.signal_timeout = Integer(t)
     end
+    opts.on('--max-unacked=N', 'Maximum number of workers that can be unacked when gracefully upgrading.') do |n|
+      Einhorn::State.config[:max_unacked] = Integer(n)
+    end
+    opts.on('--max-upgrade-additional=N', 'Maximum number of additional workers that can be running during an upgrade.') do |n|
+      Einhorn::State.config[:max_upgrade_additional] = Integer(n)
+    end
+    opts.on('--gc-before-fork', 'Run the GC three times before forking to improve memory sharing for copy-on-write.') do
+      Einhorn::State.config[:gc_before_fork] = true
+    end
     opts.on('--version', 'Show version') do
       puts Einhorn::VERSION
       exit

data/bin/einhornsh CHANGED

@@ -21,22 +21,14 @@ module Einhorn
     end
     def send_command(hash)
-      begin
-        @client.send_command(hash)
-        while response = @client.receive_message
-          if response.kind_of?(Hash)
-            yield response['message']
-            return unless response['wait']
-          else
-            puts "Invalid response type #{response.class}: #{response.inspect}"
-          end
+      @client.send_command(hash)
+      while response = @client.receive_message
+        if response.kind_of?(Hash)
+          yield response['message']
+          return unless response['wait']
+        else
+          puts "Invalid response type #{response.class}: #{response.inspect}"
         end
-      rescue Errno::EPIPE => e
-        emit("einhornsh: Error communicating with Einhorn: #{e} (#{e.class})")
-        emit("einhornsh: Attempting to reconnect...")
-        reconnect
-        retry
       end
     end

data/einhorn.gemspec CHANGED

@@ -15,6 +15,7 @@ Gem::Specification.new do |gem|
   gem.name          = 'einhorn'
   gem.require_paths = ['lib']
+  gem.add_development_dependency 'rack', '~> 1.6'
   gem.add_development_dependency 'rake'
   gem.add_development_dependency 'pry'
   gem.add_development_dependency 'minitest', '< 5.0'

data/example/pool_worker.rb CHANGED

@@ -16,4 +16,4 @@ def einhorn_main
     puts "From PID #{$$}: Doing some work"
     sleep 1
   end
-end
+end

data/lib/einhorn.rb CHANGED

@@ -45,6 +45,7 @@ module Einhorn
         :orig_cmd => nil,
         :bind => [],
         :bind_fds => [],
+        :bound_ports => [],
         :cmd => nil,
         :script_name => nil,
         :respawn => true,
@@ -68,14 +69,9 @@ module Einhorn
         :reexec_commandline => nil,
         :drop_environment_variables => [],
         :signal_timeout => nil,
+        :preloaded => false
       }
     end
-    def self.dumpable_state
-      dump = state
-      dump[:reloading_for_preload_upgrade] = dump[:reloading_for_upgrade]
-      dump
-    end
   end
   module TransientState
@@ -83,7 +79,6 @@ module Einhorn
     def self.default_state
       {
         :whatami => :master,
-        :preloaded => false,
         :script_name => nil,
         :argv => [],
         :environ => {},
@@ -110,38 +105,24 @@ module Einhorn
     updated_state = old_state.dup
     # Handle changes in state format updates from previous einhorn versions
-    if store == Einhorn::State
-      # TODO: Drop this backwards compatibility hack when we hit 0.7
-      if updated_state.include?(:reloading_for_preload_upgrade) &&
-          !updated_state.include?(:reloading_for_upgrade)
-        updated_state[:reloading_for_upgrade] = updated_state.delete(:reloading_for_preload_upgrade)
-        message << "upgraded :reloading_for_preload_upgrade to :reloading_for_upgrade"
-      end
-      if updated_state[:children]
-        # For a period, we created children entries for state_passers,
-        # but we don't want that (in particular, it probably died
-        # before we could setup our SIGCHLD handler
-        updated_state[:children].delete_if {|k, v| v[:type] == :state_passer}
-        # Depending on what is passed for --reexec-as, it's possible
-        # that the process received a SIGCHLD while something other
-        # than einhorn was the active executable. If that happened,
-        # einhorn might not know about a dead child, so let's check
-        # them all
-        dead = []
-        updated_state[:children].each do |pid, v|
-          begin
-            pid = Process.wait(pid, Process::WNOHANG)
-            dead << pid if pid
-          rescue Errno::ECHILD
-            dead << pid
-          end
-        end
-        Einhorn::Event::Timer.open(0) do
-          dead.each {|pid| Einhorn::Command.mourn(pid)}
+    if store == Einhorn::State && updated_state[:children]
+      # Depending on what is passed for --reexec-as, it's possible
+      # that the process received a SIGCHLD while something other
+      # than einhorn was the active executable. If that happened,
+      # einhorn might not know about a dead child, so let's check
+      # them all
+      dead = []
+      updated_state[:children].each do |pid, v|
+        begin
+          pid = Process.wait(pid, Process::WNOHANG)
+          dead << pid if pid
+        rescue Errno::ECHILD
+          dead << pid
         end
       end
+      Einhorn::Event::Timer.open(0) do
+        dead.each {|pid| Einhorn::Command.cleanup(pid)}
+      end
     end
     default = store.default_state
@@ -182,20 +163,23 @@ module Einhorn
     end
     Einhorn::TransientState.socket_handles << sd
-    sd.fileno
+    [sd.fileno, sd.local_address.ip_port]
   end
   # Implement these ourselves so it plays nicely with state persistence
   def self.log_debug(msg, tag=nil)
     $stderr.puts("#{log_tag} DEBUG: #{msg}\n") if Einhorn::State.verbosity <= 0
+    $stderr.flush
     self.send_tagged_message(tag, msg) if tag
   end
   def self.log_info(msg, tag=nil)
     $stderr.puts("#{log_tag} INFO: #{msg}\n") if Einhorn::State.verbosity <= 1
+    $stderr.flush
     self.send_tagged_message(tag, msg) if tag
   end
   def self.log_error(msg, tag=nil)
     $stderr.puts("#{log_tag} ERROR: #{msg}\n") if Einhorn::State.verbosity <= 2
+    $stderr.flush
     self.send_tagged_message(tag, "ERROR: #{msg}") if tag
   end
@@ -246,6 +230,8 @@ module Einhorn
       set_argv(Einhorn::State.cmd, false)
       begin
+        # Reset preloaded state to false - this allows us to monitor for failed preloads during reloads.
+        Einhorn::State.preloaded = false
         # If it's not going to be requireable, then load it.
         if !path.end_with?('.rb') && File.exists?(path)
           log_info("Loading #{path} (if this hangs, make sure your code can be properly loaded as a library)", :upgrade)
@@ -253,13 +239,15 @@ module Einhorn
         else
           log_info("Requiring #{path} (if this hangs, make sure your code can be properly loaded as a library)", :upgrade)
           require path
+          force_move_to_oldgen if Einhorn::State.config[:gc_before_fork]
         end
       rescue Exception => e
         log_info("Proceeding with postload -- could not load #{path}: #{e} (#{e.class})\n  #{e.backtrace.join("\n  ")}", :upgrade)
       else
         if defined?(einhorn_main)
           log_info("Successfully loaded #{path}", :upgrade)
-          Einhorn::TransientState.preloaded = true
+          Einhorn::State.preloaded = true
         else
           log_info("Proceeding with postload -- loaded #{path}, but no einhorn_main method was defined", :upgrade)
         end
@@ -267,6 +255,22 @@ module Einhorn
     end
   end
+  # Make the GC more copy-on-write friendly by forcibly incrementing the generation
+  # counter on all objects to its maximum value. Learn more at: https://github.com/ko1/nakayoshi_fork
+  def self.force_move_to_oldgen
+    log_info("Starting GC to improve copy-on-write memory sharing", :upgrade)
+    GC.start
+    3.times do
+      GC.start(full_mark: false)
+    end
+    GC.compact if GC.respond_to?(:compact)
+    log_info("Finished GC after preloading", :upgrade)
+  end
+  private_class_method :force_move_to_oldgen
   def self.set_argv(cmd, set_ps_name)
     # TODO: clean up this hack
     idx = 0
@@ -324,8 +328,9 @@ module Einhorn
   def self.socketify_env!
     Einhorn::State.bind.each do |host, port, flags|
-      fd = bind(host, port, flags)
+      fd, actual_port = bind(host, port, flags)
       Einhorn::State.bind_fds << fd
+      Einhorn::State.bound_ports << actual_port
     end
   end
@@ -339,7 +344,8 @@ module Einhorn
         host = $2
         port = $3
         flags = $4.split(',').select {|flag| flag.length > 0}.map {|flag| flag.downcase}
-        fd = (Einhorn::State.sockets[[host, port]] ||= bind(host, port, flags))
+        Einhorn::State.sockets[[host, port]] ||= bind(host, port, flags)[0]
+        fd = Einhorn::State.sockets[[host, port]]
         "#{opt}#{fd}"
       else
         arg
@@ -431,6 +437,14 @@ module Einhorn
       Einhorn::State.reloading_for_upgrade = false
     end
+    # If setting a signal-timeout, timeout the event loop
+    # in the same timeframe, ensuring processes are culled
+    # on a regular basis.
+    if Einhorn::State.signal_timeout
+      Einhorn::Event.default_timeout = Einhorn::Event.default_timeout.nil? ?
+        Einhorn::State.signal_timeout : [Einhorn::State.signal_timeout, Einhorn::Event.default_timeout].min
+    end
     while Einhorn::State.respawn || Einhorn::State.children.size > 0
       log_debug("Entering event loop")

data/lib/einhorn/client.rb CHANGED

@@ -1,5 +1,4 @@
 require 'set'
-require 'uri'
 require 'yaml'
 module Einhorn
@@ -22,12 +21,12 @@ module Einhorn
       def self.serialize_message(message)
         serialized = YAML.dump(message)
-        escaped = URI.escape(serialized, "%\n")
+        escaped = serialized.gsub(/%|\n/, '%' => '%25', "\n" => '%0A')
         escaped + "\n"
       end
       def self.deserialize_message(line)
-        serialized = URI.unescape(line)
+        serialized = line.gsub(/%(25|0A)/, '%25' => '%', '%0A' => "\n")
         YAML.load(serialized)
       end
     end

data/lib/einhorn/command.rb CHANGED

@@ -3,6 +3,7 @@ require 'set'
 require 'tmpdir'
 require 'einhorn/command/interface'
+require 'einhorn/prctl'
 module Einhorn
   module Command
@@ -10,18 +11,16 @@ module Einhorn
       begin
         while true
           Einhorn.log_debug('Going to reap a child process')
           pid = Process.wait(-1, Process::WNOHANG)
           return unless pid
-          mourn(pid)
+          cleanup(pid)
           Einhorn::Event.break_loop
         end
       rescue Errno::ECHILD
       end
     end
-    # Mourn the death of your child
-    def self.mourn(pid)
+    def self.cleanup(pid)
       unless spec = Einhorn::State.children[pid]
         Einhorn.log_error("Could not find any config for exited child #{pid.inspect}! This probably indicates a bug in Einhorn.")
         return
@@ -40,11 +39,23 @@ module Einhorn
       case type = spec[:type]
       when :worker
         Einhorn.log_info("===> Exited worker #{pid.inspect}#{extra}", :upgrade)
+      when :state_passer
+        Einhorn.log_debug("===> Exited state passing process #{pid.inspect}", :upgrade)
       else
         Einhorn.log_error("===> Exited process #{pid.inspect} has unrecgonized type #{type.inspect}: #{spec.inspect}", :upgrade)
       end
     end
+    def self.register_ping(pid, request_id)
+      unless spec = Einhorn::State.children[pid]
+        Einhorn.log_error("Could not find state for PID #{pid.inspect}; ignoring ACK.")
+        return
+      end
+      spec[:pinged_at] = Time.now
+      spec[:pinged_request_id] = request_id
+    end
     def self.register_manual_ack(pid)
       ack_mode = Einhorn::State.ack_mode
       unless ack_mode[:type] == :manual
@@ -98,8 +109,8 @@ module Einhorn
     def self.signal_all(signal, children=nil, record=true)
       children ||= Einhorn::WorkerPool.workers
+      signaled = {}
-      signaled = []
       Einhorn.log_info("Sending #{signal} to #{children.inspect}", :upgrade)
       children.each do |child|
@@ -113,22 +124,31 @@ module Einhorn
             Einhorn.log_error("Re-sending #{signal} to already-signaled child #{child.inspect}. It may be slow to spin down, or it may be swallowing #{signal}s.", :upgrade)
           end
           spec[:signaled].add(signal)
+          spec[:last_signaled_at] = Time.now
         end
         begin
           Process.kill(signal, child)
         rescue Errno::ESRCH
+          Einhorn.log_debug("Attempted to #{signal} child #{child.inspect} but the process does not exist", :upgrade)
         else
-          signaled << child
+          signaled[child] = spec
         end
       end
-      if Einhorn::State.signal_timeout
+      if Einhorn::State.signal_timeout && record
         Einhorn::Event::Timer.open(Einhorn::State.signal_timeout) do
           children.each do |child|
-            next unless spec = Einhorn::State.children[child]
+            spec = Einhorn::State.children[child]
+            next unless spec # Process is already dead and removed by cleanup
+            signaled_spec = signaled[child]
+            next unless signaled_spec # We got ESRCH when trying to signal
+            if spec[:spinup_time] != signaled_spec[:spinup_time]
+              Einhorn.log_info("Different spinup time recorded for #{child} after #{Einhorn::State.signal_timeout}s. This probably indicates a PID rollover.", :upgrade)
+              next
+            end
-            Einhorn.log_info("Child #{child.inspect} is still active after #{Einhorn::State.signal_timeout}. Sending SIGKILL.")
+            Einhorn.log_info("Child #{child.inspect} is still active after #{Einhorn::State.signal_timeout}s. Sending SIGKILL.")
             begin
               Process.kill('KILL', child)
             rescue Errno::ESRCH
@@ -136,11 +156,12 @@ module Einhorn
             spec[:signaled].add('KILL')
           end
         end
-      end
-      "Successfully sent #{signal}s to #{signaled.length} processes: #{signaled.inspect}"
+        Einhorn.log_info("Successfully sent #{signal}s to #{signaled.length} processes: #{signaled.keys}")
+      end
     end
     def self.increment
       Einhorn::Event.break_loop
       old = Einhorn::State.config[:number]
@@ -211,6 +232,7 @@ module Einhorn
       fork do
         Einhorn::TransientState.whatami = :state_passer
+        Einhorn::State.children[Process.pid] = {type: :state_passer}
         Einhorn::State.generation += 1
         read.close
@@ -256,7 +278,8 @@ module Einhorn
     def self.spinup(cmd=nil)
       cmd ||= Einhorn::State.cmd
       index = next_index
-      if Einhorn::TransientState.preloaded
+      expected_ppid = Process.pid
+      if Einhorn::State.preloaded
         pid = fork do
           Einhorn::TransientState.whatami = :worker
           prepare_child_process
@@ -268,6 +291,8 @@ module Einhorn
           reseed_random
+          setup_parent_watch(expected_ppid)
           prepare_child_environment(index)
           einhorn_main
         end
@@ -277,6 +302,7 @@ module Einhorn
           prepare_child_process
           Einhorn.log_info("About to exec #{cmd.inspect}")
+          Einhorn::Command::Interface.uninit
           # Here's the only case where cloexec would help. Since we
           # have to track and manually close FDs for other cases, we
           # may as well just reuse close_all rather than also set
@@ -285,20 +311,24 @@ module Einhorn
           # Note that Ruby 1.9's close_others option is useful here.
           Einhorn::Event.close_all_for_worker
+          setup_parent_watch(expected_ppid)
           prepare_child_environment(index)
           Einhorn::Compat.exec(cmd[0], cmd[1..-1], :close_others => false)
         end
       end
       Einhorn.log_info("===> Launched #{pid} (index: #{index})", :upgrade)
+      Einhorn::State.last_spinup = Time.now
       Einhorn::State.children[pid] = {
         :type => :worker,
         :version => Einhorn::State.version,
         :acked => false,
         :signaled => Set.new,
-        :index => index
+        :last_signaled_at => nil,
+        :index => index,
+        :spinup_time => Einhorn::State.last_spinup,
       }
-      Einhorn::State.last_spinup = Time.now
       # Set up whatever's needed for ACKing
       ack_mode = Einhorn::State.ack_mode
@@ -364,9 +394,28 @@ module Einhorn
     end
     def self.prepare_child_process
+      Process.setpgrp
       Einhorn.renice_self
     end
+    def self.setup_parent_watch(expected_ppid)
+      if Einhorn::State.kill_children_on_exit then
+        begin
+          # NB: Having the USR2 signal handler set to terminate (the default) at
+          # this point is required. If it's set to a ruby handler, there are
+          # race conditions that could cause the worker to leak.
+          Einhorn::Prctl.set_pdeathsig("USR2")
+          if Process.ppid != expected_ppid then
+            Einhorn.log_error("Parent process died before we set pdeathsig; cowardly refusing to exec child process.")
+            exit(1)
+          end
+        rescue NotImplementedError
+          # Unsupported OS; silently continue.
+        end
+      end
+    end
     # @param options [Hash]
     #
     # @option options [Boolean] :smooth (false) Whether to perform a smooth or
@@ -451,6 +500,41 @@ module Einhorn
         Einhorn.log_info("Have too many workers at the current version, so killing off #{excess.length} of them.")
         signal_all("USR2", excess)
       end
+      # Ensure all signaled workers that have outlived signal_timeout get killed.
+      kill_expired_signaled_workers if Einhorn::State.signal_timeout
+    end
+    def self.kill_expired_signaled_workers
+      now = Time.now
+      children = Einhorn::State.children.select do |_,c|
+        # Only interested in USR2 signaled workers
+        next unless c[:signaled] && c[:signaled].length > 0
+        next unless c[:signaled].include?('USR2')
+        # Ignore processes that have received KILL since it can't be trapped.
+        next if c[:signaled].include?('KILL')
+        # Filter out those children that have not reached signal_timeout yet.
+        next unless c[:last_signaled_at]
+        expires_at = c[:last_signaled_at] + Einhorn::State.signal_timeout
+        next unless now >= expires_at
+        true
+      end
+      Einhorn.log_info("#{children.size} expired signaled workers found.") if children.size > 0
+      children.each do |pid, child|
+        Einhorn.log_info("Child #{pid.inspect} was signaled #{(child[:last_signaled_at] - now).abs.to_i}s ago. Sending SIGKILL as it is still active after #{Einhorn::State.signal_timeout}s timeout.", :upgrade)
+        begin
+          Process.kill('KILL', pid)
+        rescue Errno::ESRCH
+          Einhorn.log_debug("Attempted to SIGKILL child #{pid.inspect} but the process does not exist.")
+        end
+        child[:signaled].add('KILL')
+        child[:last_signaled_at] = Time.now
+      end
     end
     def self.stop_respawning
@@ -478,10 +562,18 @@ module Einhorn
       missing.times {spinup}
     end
+    # Unbounded exponential backoff is not a thing: we run into problems if
+    # e.g., each of our hundred workers simultaneously fail to boot for the same
+    # ephemeral reason. Instead cap backoff by some reasonable maximum, so we
+    # don't wait until the heat death of the universe to spin up new capacity.
+    MAX_SPINUP_INTERVAL = 30.0
     def self.replenish_gradually(max_unacked=nil)
       return if Einhorn::TransientState.has_outstanding_spinup_timer
       return unless Einhorn::WorkerPool.missing_worker_count > 0
+      max_unacked ||= Einhorn::State.config[:max_unacked]
       # default to spinning up at most NCPU workers at once
       unless max_unacked
         begin
@@ -500,14 +592,12 @@ module Einhorn
       # Exponentially backoff automated spinup if we're just having
       # things die before ACKing
       spinup_interval = Einhorn::State.config[:seconds] * (1.5 ** Einhorn::State.consecutive_deaths_before_ack)
+      spinup_interval = [spinup_interval, MAX_SPINUP_INTERVAL].min
       seconds_ago = (Time.now - Einhorn::State.last_spinup).to_f
       if seconds_ago > spinup_interval
-        unacked = Einhorn::WorkerPool.unacked_unsignaled_modern_workers.length
-        if unacked >= max_unacked
-          Einhorn.log_debug("There are #{unacked} unacked new workers, and max_unacked is #{max_unacked}, so not spinning up a new process")
-        else
-          msg = "Last spinup was #{seconds_ago}s ago, and spinup_interval is #{spinup_interval}s, so spinning up a new process"
+        if trigger_spinup?(max_unacked)
+          msg = "Last spinup was #{seconds_ago}s ago, and spinup_interval is #{spinup_interval}s, so spinning up a new process."
           if Einhorn::State.consecutive_deaths_before_ack > 0
             Einhorn.log_info("#{msg} (there have been #{Einhorn::State.consecutive_deaths_before_ack} consecutive unacked worker deaths)", :upgrade)
@@ -518,7 +608,7 @@ module Einhorn
           spinup
         end
       else
-        Einhorn.log_debug("Last spinup was #{seconds_ago}s ago, and spinup_interval is #{spinup_interval}s, so not spinning up a new process")
+        Einhorn.log_debug("Last spinup was #{seconds_ago}s ago, and spinup_interval is #{spinup_interval}s, so not spinning up a new process.")
       end
       Einhorn::TransientState.has_outstanding_spinup_timer = true
@@ -541,5 +631,22 @@ module Einhorn
       Einhorn.log_info(output) if log
       output
     end
+    def self.trigger_spinup?(max_unacked)
+      unacked = Einhorn::WorkerPool.unacked_unsignaled_modern_workers.length
+      if unacked >= max_unacked
+        Einhorn.log_info("There are #{unacked} unacked new workers, and max_unacked is #{max_unacked}, so not spinning up a new process.")
+        return false
+      elsif Einhorn::State.config[:max_upgrade_additional]
+        capacity_exceeded = (Einhorn::State.config[:number] + Einhorn::State.config[:max_upgrade_additional]) - Einhorn::WorkerPool.workers_with_state.length
+        if capacity_exceeded < 0
+          Einhorn.log_info("Over worker capacity by #{capacity_exceeded.abs} during upgrade, #{Einhorn::WorkerPool.modern_workers.length} new workers of #{Einhorn::WorkerPool.workers_with_state.length} total. Waiting for old workers to exit before spinning up a process.")
+          return false
+        end
+      end
+      true
+    end
   end
 end