einhorn 0.7.0 → 0.8.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: f3f9c31b861db9b8b7ab5d2345be06f60223d60e1243bc2618ec5ef1db2b72e5
4
+ data.tar.gz: 38144bb080c8719b4d164bcbf8d96d498844626a66b9a522c75fb8bab6309c4f
5
+ SHA512:
6
+ metadata.gz: 3370ff020a249f5af7be26bfb48392a0b5721d139b895684e37aa92f612bb00d10e48eeb1b95acc15c74448af96c1df039fc841f6ff45ffdf91d6a16bcc614ab
7
+ data.tar.gz: 3e6b93f1ed82a46a9578dd3dd59cdb0c5c962772d285e59f1e78e531b7577fa4debbc7e3a28a0e4f9cddf053e4481979b37229ce40a6ed9e476dbba6e7985f1e
@@ -1,8 +1,10 @@
1
1
  language: ruby
2
2
  rvm:
3
- - 1.8.7
4
- - 1.9.2
5
- - 1.9.3
6
3
  - 2.0.0
7
4
  - 2.1
8
- - ree
5
+ - 2.2
6
+
7
+ # This is to work around the version of bundler installed in Travis and
8
+ # https://github.com/bundler/bundler/issues/3558
9
+ before_install:
10
+ - gem update bundler
data/README.md CHANGED
@@ -194,6 +194,17 @@ library.
194
194
  You can set the name that Einhorn and your workers show in PS. Just
195
195
  pass `-c <name>`.
196
196
 
197
+ ### Re exec
198
+
199
+ You can use the `--reexec-as` option to replace the `einhorn` command with a command or script of your own. This might be useful for those with a Capistrano like deploy process that has changing symlinks. To ensure that you are following the symlinks you could use a bash script like this.
200
+
201
+ #!/bin/bash
202
+
203
+ cd <symlinked directory>
204
+ exec /usr/local/bin/einhorn "$@"
205
+
206
+ Then you could set `--reexec-as=` to the name of your bash script and it will run in place of the plain einhorn command.
207
+
197
208
  ### Options
198
209
 
199
210
  -b, --bind ADDR Bind an address and add the corresponding FD via the environment
@@ -217,11 +228,18 @@ pass `-c <name>`.
217
228
  Unix nice level at which to run the einhorn processes. If not running as root, make sure to ulimit -e as appopriate.
218
229
  --with-state-fd STATE [Internal option] With file descriptor containing state
219
230
  --upgrade-check [Internal option] Check if Einhorn can exec itself and exit with status 0 before loading code
231
+ -t, --signal-timeout=T If children do not react to signals after T seconds, escalate to SIGKILL
220
232
  --version Show version
221
233
 
222
234
 
223
235
  ## Contributing
224
236
 
237
+ ### Development Status
238
+
239
+ Einhorn is still in active operation at Stripe, but we are not maintaining
240
+ Einhorn actively. PRs are very welcome, and we will review and merge,
241
+ but we are unlikely to triage and fix reported issues without code.
242
+
225
243
  Contributions are definitely welcome. To contribute, just follow the
226
244
  usual workflow:
227
245
 
@@ -249,10 +267,28 @@ EventMachine-LE to support file-descriptor passing. Check out
249
267
 
250
268
  ## Compatibility
251
269
 
252
- Einhorn was developed and tested under Ruby 1.8.7.
270
+ Einhorn runs in Ruby 2.0, 2.1, and 2.2
271
+
272
+ The following libraries ease integration with Einhorn with languages other than
273
+ Ruby:
274
+
275
+ - **[go-einhorn](https://github.com/stripe/go-einhorn)**: Stripe's own library
276
+ for *talking* to an einhorn master (doesn't wrap socket code).
277
+ - **[goji](https://github.com/zenazn/goji/)**: Go (golang) server framework. The
278
+ [`bind`](https://godoc.org/github.com/zenazn/goji/bind) and
279
+ [`graceful`](https://godoc.org/github.com/zenazn/goji/graceful)
280
+ packages provide helpers and HTTP/TCP connection wrappers for Einhorn
281
+ integration.
282
+ - **[github.com/CHH/einhorn](https://github.com/CHH/einhorn)**: PHP library
283
+ - **[thin-attach\_socket](https://github.com/ConradIrwin/thin-attach_socket)**:
284
+ run `thin` behind Einhorn
285
+ - **[baseplate](https://reddit.github.io/baseplate/cli/serve.html)**: a
286
+ collection of Python helpers and libraries, with support for running behind
287
+ Einhorn
288
+
289
+ *NB: this list should not imply any official endorsement or vetting!*
253
290
 
254
291
  ## About
255
292
 
256
- Einhorn is a project of [Stripe](https://stripe.com), led by [Greg
257
- Brockman](https://twitter.com/thegdb). Feel free to get in touch at
293
+ Einhorn is a project of [Stripe](https://stripe.com), led by [Carl Jackson](https://github.com/zenazn). Feel free to get in touch at
258
294
  info@stripe.com.
@@ -67,10 +67,28 @@ EventMachine-LE to support file-descriptor passing. Check out
67
67
 
68
68
  ## Compatibility
69
69
 
70
- Einhorn was developed and tested under Ruby 1.8.7.
70
+ Einhorn runs in Ruby 2.0, 2.1, and 2.2
71
+
72
+ The following libraries ease integration with Einhorn with languages other than
73
+ Ruby:
74
+
75
+ - **[go-einhorn](https://github.com/stripe/go-einhorn)**: Stripe's own library
76
+ for *talking* to an einhorn master (doesn't wrap socket code).
77
+ - **[goji](https://github.com/zenazn/goji/)**: Go (golang) server framework. The
78
+ [`bind`](https://godoc.org/github.com/zenazn/goji/bind) and
79
+ [`graceful`](https://godoc.org/github.com/zenazn/goji/graceful)
80
+ packages provide helpers and HTTP/TCP connection wrappers for Einhorn
81
+ integration.
82
+ - **[github.com/CHH/einhorn](https://github.com/CHH/einhorn)**: PHP library
83
+ - **[thin-attach\_socket](https://github.com/ConradIrwin/thin-attach_socket)**:
84
+ run `thin` behind Einhorn
85
+ - **[baseplate](https://reddit.github.io/baseplate/cli/serve.html)**: a
86
+ collection of Python helpers and libraries, with support for running behind
87
+ Einhorn
88
+
89
+ *NB: this list should not imply any official endorsement or vetting!*
71
90
 
72
91
  ## About
73
92
 
74
- Einhorn is a project of [Stripe](https://stripe.com), led by [Greg
75
- Brockman](https://twitter.com/thegdb). Feel free to get in touch at
93
+ Einhorn is a project of [Stripe](https://stripe.com), led by [Carl Jackson](https://github.com/zenazn). Feel free to get in touch at
76
94
  info@stripe.com.
@@ -266,8 +266,11 @@ if true # $0 == __FILE__
266
266
  Einhorn::Command.quieter(false)
267
267
  end
268
268
 
269
- opts.on('-s', '--seconds N', 'Number of seconds to wait until respawning') do |b|
270
- Einhorn::State.config[:seconds] = s.to_i
269
+ opts.on('-s', '--seconds N', 'Number of seconds to wait until respawning') do |s|
270
+ seconds = Float(s)
271
+ raise ArgumentError, 'seconds must be > 0' if seconds.zero?
272
+
273
+ Einhorn::State.config[:seconds] = seconds
271
274
  end
272
275
 
273
276
  opts.on('-v', '--verbose', 'Make output verbose (can be reconfigured on the fly)') do
@@ -310,6 +313,18 @@ if true # $0 == __FILE__
310
313
  Einhorn::State.signal_timeout = Integer(t)
311
314
  end
312
315
 
316
+ opts.on('--max-unacked=N', 'Maximum number of workers that can be unacked when gracefully upgrading.') do |n|
317
+ Einhorn::State.config[:max_unacked] = Integer(n)
318
+ end
319
+
320
+ opts.on('--max-upgrade-additional=N', 'Maximum number of additional workers that can be running during an upgrade.') do |n|
321
+ Einhorn::State.config[:max_upgrade_additional] = Integer(n)
322
+ end
323
+
324
+ opts.on('--gc-before-fork', 'Run the GC three times before forking to improve memory sharing for copy-on-write.') do
325
+ Einhorn::State.config[:gc_before_fork] = true
326
+ end
327
+
313
328
  opts.on('--version', 'Show version') do
314
329
  puts Einhorn::VERSION
315
330
  exit
@@ -21,22 +21,14 @@ module Einhorn
21
21
  end
22
22
 
23
23
  def send_command(hash)
24
- begin
25
- @client.send_command(hash)
26
- while response = @client.receive_message
27
- if response.kind_of?(Hash)
28
- yield response['message']
29
- return unless response['wait']
30
- else
31
- puts "Invalid response type #{response.class}: #{response.inspect}"
32
- end
24
+ @client.send_command(hash)
25
+ while response = @client.receive_message
26
+ if response.kind_of?(Hash)
27
+ yield response['message']
28
+ return unless response['wait']
29
+ else
30
+ puts "Invalid response type #{response.class}: #{response.inspect}"
33
31
  end
34
- rescue Errno::EPIPE => e
35
- emit("einhornsh: Error communicating with Einhorn: #{e} (#{e.class})")
36
- emit("einhornsh: Attempting to reconnect...")
37
- reconnect
38
-
39
- retry
40
32
  end
41
33
  end
42
34
 
@@ -15,6 +15,7 @@ Gem::Specification.new do |gem|
15
15
  gem.name = 'einhorn'
16
16
  gem.require_paths = ['lib']
17
17
 
18
+ gem.add_development_dependency 'rack', '~> 1.6'
18
19
  gem.add_development_dependency 'rake'
19
20
  gem.add_development_dependency 'pry'
20
21
  gem.add_development_dependency 'minitest', '< 5.0'
@@ -16,4 +16,4 @@ def einhorn_main
16
16
  puts "From PID #{$$}: Doing some work"
17
17
  sleep 1
18
18
  end
19
- end
19
+ end
@@ -45,6 +45,7 @@ module Einhorn
45
45
  :orig_cmd => nil,
46
46
  :bind => [],
47
47
  :bind_fds => [],
48
+ :bound_ports => [],
48
49
  :cmd => nil,
49
50
  :script_name => nil,
50
51
  :respawn => true,
@@ -68,14 +69,9 @@ module Einhorn
68
69
  :reexec_commandline => nil,
69
70
  :drop_environment_variables => [],
70
71
  :signal_timeout => nil,
72
+ :preloaded => false
71
73
  }
72
74
  end
73
-
74
- def self.dumpable_state
75
- dump = state
76
- dump[:reloading_for_preload_upgrade] = dump[:reloading_for_upgrade]
77
- dump
78
- end
79
75
  end
80
76
 
81
77
  module TransientState
@@ -83,7 +79,6 @@ module Einhorn
83
79
  def self.default_state
84
80
  {
85
81
  :whatami => :master,
86
- :preloaded => false,
87
82
  :script_name => nil,
88
83
  :argv => [],
89
84
  :environ => {},
@@ -110,38 +105,24 @@ module Einhorn
110
105
  updated_state = old_state.dup
111
106
 
112
107
  # Handle changes in state format updates from previous einhorn versions
113
- if store == Einhorn::State
114
- # TODO: Drop this backwards compatibility hack when we hit 0.7
115
- if updated_state.include?(:reloading_for_preload_upgrade) &&
116
- !updated_state.include?(:reloading_for_upgrade)
117
- updated_state[:reloading_for_upgrade] = updated_state.delete(:reloading_for_preload_upgrade)
118
- message << "upgraded :reloading_for_preload_upgrade to :reloading_for_upgrade"
119
- end
120
-
121
- if updated_state[:children]
122
- # For a period, we created children entries for state_passers,
123
- # but we don't want that (in particular, it probably died
124
- # before we could setup our SIGCHLD handler
125
- updated_state[:children].delete_if {|k, v| v[:type] == :state_passer}
126
-
127
- # Depending on what is passed for --reexec-as, it's possible
128
- # that the process received a SIGCHLD while something other
129
- # than einhorn was the active executable. If that happened,
130
- # einhorn might not know about a dead child, so let's check
131
- # them all
132
- dead = []
133
- updated_state[:children].each do |pid, v|
134
- begin
135
- pid = Process.wait(pid, Process::WNOHANG)
136
- dead << pid if pid
137
- rescue Errno::ECHILD
138
- dead << pid
139
- end
140
- end
141
- Einhorn::Event::Timer.open(0) do
142
- dead.each {|pid| Einhorn::Command.mourn(pid)}
108
+ if store == Einhorn::State && updated_state[:children]
109
+ # Depending on what is passed for --reexec-as, it's possible
110
+ # that the process received a SIGCHLD while something other
111
+ # than einhorn was the active executable. If that happened,
112
+ # einhorn might not know about a dead child, so let's check
113
+ # them all
114
+ dead = []
115
+ updated_state[:children].each do |pid, v|
116
+ begin
117
+ pid = Process.wait(pid, Process::WNOHANG)
118
+ dead << pid if pid
119
+ rescue Errno::ECHILD
120
+ dead << pid
143
121
  end
144
122
  end
123
+ Einhorn::Event::Timer.open(0) do
124
+ dead.each {|pid| Einhorn::Command.cleanup(pid)}
125
+ end
145
126
  end
146
127
 
147
128
  default = store.default_state
@@ -182,20 +163,23 @@ module Einhorn
182
163
  end
183
164
 
184
165
  Einhorn::TransientState.socket_handles << sd
185
- sd.fileno
166
+ [sd.fileno, sd.local_address.ip_port]
186
167
  end
187
168
 
188
169
  # Implement these ourselves so it plays nicely with state persistence
189
170
  def self.log_debug(msg, tag=nil)
190
171
  $stderr.puts("#{log_tag} DEBUG: #{msg}\n") if Einhorn::State.verbosity <= 0
172
+ $stderr.flush
191
173
  self.send_tagged_message(tag, msg) if tag
192
174
  end
193
175
  def self.log_info(msg, tag=nil)
194
176
  $stderr.puts("#{log_tag} INFO: #{msg}\n") if Einhorn::State.verbosity <= 1
177
+ $stderr.flush
195
178
  self.send_tagged_message(tag, msg) if tag
196
179
  end
197
180
  def self.log_error(msg, tag=nil)
198
181
  $stderr.puts("#{log_tag} ERROR: #{msg}\n") if Einhorn::State.verbosity <= 2
182
+ $stderr.flush
199
183
  self.send_tagged_message(tag, "ERROR: #{msg}") if tag
200
184
  end
201
185
 
@@ -246,6 +230,8 @@ module Einhorn
246
230
  set_argv(Einhorn::State.cmd, false)
247
231
 
248
232
  begin
233
+ # Reset preloaded state to false - this allows us to monitor for failed preloads during reloads.
234
+ Einhorn::State.preloaded = false
249
235
  # If it's not going to be requireable, then load it.
250
236
  if !path.end_with?('.rb') && File.exists?(path)
251
237
  log_info("Loading #{path} (if this hangs, make sure your code can be properly loaded as a library)", :upgrade)
@@ -253,13 +239,15 @@ module Einhorn
253
239
  else
254
240
  log_info("Requiring #{path} (if this hangs, make sure your code can be properly loaded as a library)", :upgrade)
255
241
  require path
242
+
243
+ force_move_to_oldgen if Einhorn::State.config[:gc_before_fork]
256
244
  end
257
245
  rescue Exception => e
258
246
  log_info("Proceeding with postload -- could not load #{path}: #{e} (#{e.class})\n #{e.backtrace.join("\n ")}", :upgrade)
259
247
  else
260
248
  if defined?(einhorn_main)
261
249
  log_info("Successfully loaded #{path}", :upgrade)
262
- Einhorn::TransientState.preloaded = true
250
+ Einhorn::State.preloaded = true
263
251
  else
264
252
  log_info("Proceeding with postload -- loaded #{path}, but no einhorn_main method was defined", :upgrade)
265
253
  end
@@ -267,6 +255,22 @@ module Einhorn
267
255
  end
268
256
  end
269
257
 
258
+ # Make the GC more copy-on-write friendly by forcibly incrementing the generation
259
+ # counter on all objects to its maximum value. Learn more at: https://github.com/ko1/nakayoshi_fork
260
+ def self.force_move_to_oldgen
261
+ log_info("Starting GC to improve copy-on-write memory sharing", :upgrade)
262
+
263
+ GC.start
264
+ 3.times do
265
+ GC.start(full_mark: false)
266
+ end
267
+
268
+ GC.compact if GC.respond_to?(:compact)
269
+
270
+ log_info("Finished GC after preloading", :upgrade)
271
+ end
272
+ private_class_method :force_move_to_oldgen
273
+
270
274
  def self.set_argv(cmd, set_ps_name)
271
275
  # TODO: clean up this hack
272
276
  idx = 0
@@ -324,8 +328,9 @@ module Einhorn
324
328
 
325
329
  def self.socketify_env!
326
330
  Einhorn::State.bind.each do |host, port, flags|
327
- fd = bind(host, port, flags)
331
+ fd, actual_port = bind(host, port, flags)
328
332
  Einhorn::State.bind_fds << fd
333
+ Einhorn::State.bound_ports << actual_port
329
334
  end
330
335
  end
331
336
 
@@ -339,7 +344,8 @@ module Einhorn
339
344
  host = $2
340
345
  port = $3
341
346
  flags = $4.split(',').select {|flag| flag.length > 0}.map {|flag| flag.downcase}
342
- fd = (Einhorn::State.sockets[[host, port]] ||= bind(host, port, flags))
347
+ Einhorn::State.sockets[[host, port]] ||= bind(host, port, flags)[0]
348
+ fd = Einhorn::State.sockets[[host, port]]
343
349
  "#{opt}#{fd}"
344
350
  else
345
351
  arg
@@ -431,6 +437,14 @@ module Einhorn
431
437
  Einhorn::State.reloading_for_upgrade = false
432
438
  end
433
439
 
440
+ # If setting a signal-timeout, timeout the event loop
441
+ # in the same timeframe, ensuring processes are culled
442
+ # on a regular basis.
443
+ if Einhorn::State.signal_timeout
444
+ Einhorn::Event.default_timeout = Einhorn::Event.default_timeout.nil? ?
445
+ Einhorn::State.signal_timeout : [Einhorn::State.signal_timeout, Einhorn::Event.default_timeout].min
446
+ end
447
+
434
448
  while Einhorn::State.respawn || Einhorn::State.children.size > 0
435
449
  log_debug("Entering event loop")
436
450
 
@@ -1,5 +1,4 @@
1
1
  require 'set'
2
- require 'uri'
3
2
  require 'yaml'
4
3
 
5
4
  module Einhorn
@@ -22,12 +21,12 @@ module Einhorn
22
21
 
23
22
  def self.serialize_message(message)
24
23
  serialized = YAML.dump(message)
25
- escaped = URI.escape(serialized, "%\n")
24
+ escaped = serialized.gsub(/%|\n/, '%' => '%25', "\n" => '%0A')
26
25
  escaped + "\n"
27
26
  end
28
27
 
29
28
  def self.deserialize_message(line)
30
- serialized = URI.unescape(line)
29
+ serialized = line.gsub(/%(25|0A)/, '%25' => '%', '%0A' => "\n")
31
30
  YAML.load(serialized)
32
31
  end
33
32
  end
@@ -3,6 +3,7 @@ require 'set'
3
3
  require 'tmpdir'
4
4
 
5
5
  require 'einhorn/command/interface'
6
+ require 'einhorn/prctl'
6
7
 
7
8
  module Einhorn
8
9
  module Command
@@ -10,18 +11,16 @@ module Einhorn
10
11
  begin
11
12
  while true
12
13
  Einhorn.log_debug('Going to reap a child process')
13
-
14
14
  pid = Process.wait(-1, Process::WNOHANG)
15
15
  return unless pid
16
- mourn(pid)
16
+ cleanup(pid)
17
17
  Einhorn::Event.break_loop
18
18
  end
19
19
  rescue Errno::ECHILD
20
20
  end
21
21
  end
22
22
 
23
- # Mourn the death of your child
24
- def self.mourn(pid)
23
+ def self.cleanup(pid)
25
24
  unless spec = Einhorn::State.children[pid]
26
25
  Einhorn.log_error("Could not find any config for exited child #{pid.inspect}! This probably indicates a bug in Einhorn.")
27
26
  return
@@ -40,11 +39,23 @@ module Einhorn
40
39
  case type = spec[:type]
41
40
  when :worker
42
41
  Einhorn.log_info("===> Exited worker #{pid.inspect}#{extra}", :upgrade)
42
+ when :state_passer
43
+ Einhorn.log_debug("===> Exited state passing process #{pid.inspect}", :upgrade)
43
44
  else
44
45
  Einhorn.log_error("===> Exited process #{pid.inspect} has unrecgonized type #{type.inspect}: #{spec.inspect}", :upgrade)
45
46
  end
46
47
  end
47
48
 
49
+ def self.register_ping(pid, request_id)
50
+ unless spec = Einhorn::State.children[pid]
51
+ Einhorn.log_error("Could not find state for PID #{pid.inspect}; ignoring ACK.")
52
+ return
53
+ end
54
+
55
+ spec[:pinged_at] = Time.now
56
+ spec[:pinged_request_id] = request_id
57
+ end
58
+
48
59
  def self.register_manual_ack(pid)
49
60
  ack_mode = Einhorn::State.ack_mode
50
61
  unless ack_mode[:type] == :manual
@@ -98,8 +109,8 @@ module Einhorn
98
109
 
99
110
  def self.signal_all(signal, children=nil, record=true)
100
111
  children ||= Einhorn::WorkerPool.workers
112
+ signaled = {}
101
113
 
102
- signaled = []
103
114
  Einhorn.log_info("Sending #{signal} to #{children.inspect}", :upgrade)
104
115
 
105
116
  children.each do |child|
@@ -113,22 +124,31 @@ module Einhorn
113
124
  Einhorn.log_error("Re-sending #{signal} to already-signaled child #{child.inspect}. It may be slow to spin down, or it may be swallowing #{signal}s.", :upgrade)
114
125
  end
115
126
  spec[:signaled].add(signal)
127
+ spec[:last_signaled_at] = Time.now
116
128
  end
117
129
 
118
130
  begin
119
131
  Process.kill(signal, child)
120
132
  rescue Errno::ESRCH
133
+ Einhorn.log_debug("Attempted to #{signal} child #{child.inspect} but the process does not exist", :upgrade)
121
134
  else
122
- signaled << child
135
+ signaled[child] = spec
123
136
  end
124
137
  end
125
138
 
126
- if Einhorn::State.signal_timeout
139
+ if Einhorn::State.signal_timeout && record
127
140
  Einhorn::Event::Timer.open(Einhorn::State.signal_timeout) do
128
141
  children.each do |child|
129
- next unless spec = Einhorn::State.children[child]
142
+ spec = Einhorn::State.children[child]
143
+ next unless spec # Process is already dead and removed by cleanup
144
+ signaled_spec = signaled[child]
145
+ next unless signaled_spec # We got ESRCH when trying to signal
146
+ if spec[:spinup_time] != signaled_spec[:spinup_time]
147
+ Einhorn.log_info("Different spinup time recorded for #{child} after #{Einhorn::State.signal_timeout}s. This probably indicates a PID rollover.", :upgrade)
148
+ next
149
+ end
130
150
 
131
- Einhorn.log_info("Child #{child.inspect} is still active after #{Einhorn::State.signal_timeout}. Sending SIGKILL.")
151
+ Einhorn.log_info("Child #{child.inspect} is still active after #{Einhorn::State.signal_timeout}s. Sending SIGKILL.")
132
152
  begin
133
153
  Process.kill('KILL', child)
134
154
  rescue Errno::ESRCH
@@ -136,11 +156,12 @@ module Einhorn
136
156
  spec[:signaled].add('KILL')
137
157
  end
138
158
  end
139
- end
140
159
 
141
- "Successfully sent #{signal}s to #{signaled.length} processes: #{signaled.inspect}"
160
+ Einhorn.log_info("Successfully sent #{signal}s to #{signaled.length} processes: #{signaled.keys}")
161
+ end
142
162
  end
143
163
 
164
+
144
165
  def self.increment
145
166
  Einhorn::Event.break_loop
146
167
  old = Einhorn::State.config[:number]
@@ -211,6 +232,7 @@ module Einhorn
211
232
 
212
233
  fork do
213
234
  Einhorn::TransientState.whatami = :state_passer
235
+ Einhorn::State.children[Process.pid] = {type: :state_passer}
214
236
  Einhorn::State.generation += 1
215
237
  read.close
216
238
 
@@ -256,7 +278,8 @@ module Einhorn
256
278
  def self.spinup(cmd=nil)
257
279
  cmd ||= Einhorn::State.cmd
258
280
  index = next_index
259
- if Einhorn::TransientState.preloaded
281
+ expected_ppid = Process.pid
282
+ if Einhorn::State.preloaded
260
283
  pid = fork do
261
284
  Einhorn::TransientState.whatami = :worker
262
285
  prepare_child_process
@@ -268,6 +291,8 @@ module Einhorn
268
291
 
269
292
  reseed_random
270
293
 
294
+ setup_parent_watch(expected_ppid)
295
+
271
296
  prepare_child_environment(index)
272
297
  einhorn_main
273
298
  end
@@ -277,6 +302,7 @@ module Einhorn
277
302
  prepare_child_process
278
303
 
279
304
  Einhorn.log_info("About to exec #{cmd.inspect}")
305
+ Einhorn::Command::Interface.uninit
280
306
  # Here's the only case where cloexec would help. Since we
281
307
  # have to track and manually close FDs for other cases, we
282
308
  # may as well just reuse close_all rather than also set
@@ -285,20 +311,24 @@ module Einhorn
285
311
  # Note that Ruby 1.9's close_others option is useful here.
286
312
  Einhorn::Event.close_all_for_worker
287
313
 
314
+ setup_parent_watch(expected_ppid)
315
+
288
316
  prepare_child_environment(index)
289
317
  Einhorn::Compat.exec(cmd[0], cmd[1..-1], :close_others => false)
290
318
  end
291
319
  end
292
320
 
293
321
  Einhorn.log_info("===> Launched #{pid} (index: #{index})", :upgrade)
322
+ Einhorn::State.last_spinup = Time.now
294
323
  Einhorn::State.children[pid] = {
295
324
  :type => :worker,
296
325
  :version => Einhorn::State.version,
297
326
  :acked => false,
298
327
  :signaled => Set.new,
299
- :index => index
328
+ :last_signaled_at => nil,
329
+ :index => index,
330
+ :spinup_time => Einhorn::State.last_spinup,
300
331
  }
301
- Einhorn::State.last_spinup = Time.now
302
332
 
303
333
  # Set up whatever's needed for ACKing
304
334
  ack_mode = Einhorn::State.ack_mode
@@ -364,9 +394,28 @@ module Einhorn
364
394
  end
365
395
 
366
396
  def self.prepare_child_process
397
+ Process.setpgrp
367
398
  Einhorn.renice_self
368
399
  end
369
400
 
401
+ def self.setup_parent_watch(expected_ppid)
402
+ if Einhorn::State.kill_children_on_exit then
403
+ begin
404
+ # NB: Having the USR2 signal handler set to terminate (the default) at
405
+ # this point is required. If it's set to a ruby handler, there are
406
+ # race conditions that could cause the worker to leak.
407
+
408
+ Einhorn::Prctl.set_pdeathsig("USR2")
409
+ if Process.ppid != expected_ppid then
410
+ Einhorn.log_error("Parent process died before we set pdeathsig; cowardly refusing to exec child process.")
411
+ exit(1)
412
+ end
413
+ rescue NotImplementedError
414
+ # Unsupported OS; silently continue.
415
+ end
416
+ end
417
+ end
418
+
370
419
  # @param options [Hash]
371
420
  #
372
421
  # @option options [Boolean] :smooth (false) Whether to perform a smooth or
@@ -451,6 +500,41 @@ module Einhorn
451
500
  Einhorn.log_info("Have too many workers at the current version, so killing off #{excess.length} of them.")
452
501
  signal_all("USR2", excess)
453
502
  end
503
+
504
+ # Ensure all signaled workers that have outlived signal_timeout get killed.
505
+ kill_expired_signaled_workers if Einhorn::State.signal_timeout
506
+ end
507
+
508
+ def self.kill_expired_signaled_workers
509
+ now = Time.now
510
+ children = Einhorn::State.children.select do |_,c|
511
+ # Only interested in USR2 signaled workers
512
+ next unless c[:signaled] && c[:signaled].length > 0
513
+ next unless c[:signaled].include?('USR2')
514
+
515
+ # Ignore processes that have received KILL since it can't be trapped.
516
+ next if c[:signaled].include?('KILL')
517
+
518
+ # Filter out those children that have not reached signal_timeout yet.
519
+ next unless c[:last_signaled_at]
520
+ expires_at = c[:last_signaled_at] + Einhorn::State.signal_timeout
521
+ next unless now >= expires_at
522
+
523
+ true
524
+ end
525
+
526
+ Einhorn.log_info("#{children.size} expired signaled workers found.") if children.size > 0
527
+ children.each do |pid, child|
528
+ Einhorn.log_info("Child #{pid.inspect} was signaled #{(child[:last_signaled_at] - now).abs.to_i}s ago. Sending SIGKILL as it is still active after #{Einhorn::State.signal_timeout}s timeout.", :upgrade)
529
+ begin
530
+ Process.kill('KILL', pid)
531
+ rescue Errno::ESRCH
532
+ Einhorn.log_debug("Attempted to SIGKILL child #{pid.inspect} but the process does not exist.")
533
+ end
534
+
535
+ child[:signaled].add('KILL')
536
+ child[:last_signaled_at] = Time.now
537
+ end
454
538
  end
455
539
 
456
540
  def self.stop_respawning
@@ -478,10 +562,18 @@ module Einhorn
478
562
  missing.times {spinup}
479
563
  end
480
564
 
565
+ # Unbounded exponential backoff is not a thing: we run into problems if
566
+ # e.g., each of our hundred workers simultaneously fail to boot for the same
567
+ # ephemeral reason. Instead cap backoff by some reasonable maximum, so we
568
+ # don't wait until the heat death of the universe to spin up new capacity.
569
+ MAX_SPINUP_INTERVAL = 30.0
570
+
481
571
  def self.replenish_gradually(max_unacked=nil)
482
572
  return if Einhorn::TransientState.has_outstanding_spinup_timer
483
573
  return unless Einhorn::WorkerPool.missing_worker_count > 0
484
574
 
575
+ max_unacked ||= Einhorn::State.config[:max_unacked]
576
+
485
577
  # default to spinning up at most NCPU workers at once
486
578
  unless max_unacked
487
579
  begin
@@ -500,14 +592,12 @@ module Einhorn
500
592
  # Exponentially backoff automated spinup if we're just having
501
593
  # things die before ACKing
502
594
  spinup_interval = Einhorn::State.config[:seconds] * (1.5 ** Einhorn::State.consecutive_deaths_before_ack)
595
+ spinup_interval = [spinup_interval, MAX_SPINUP_INTERVAL].min
503
596
  seconds_ago = (Time.now - Einhorn::State.last_spinup).to_f
504
597
 
505
598
  if seconds_ago > spinup_interval
506
- unacked = Einhorn::WorkerPool.unacked_unsignaled_modern_workers.length
507
- if unacked >= max_unacked
508
- Einhorn.log_debug("There are #{unacked} unacked new workers, and max_unacked is #{max_unacked}, so not spinning up a new process")
509
- else
510
- msg = "Last spinup was #{seconds_ago}s ago, and spinup_interval is #{spinup_interval}s, so spinning up a new process"
599
+ if trigger_spinup?(max_unacked)
600
+ msg = "Last spinup was #{seconds_ago}s ago, and spinup_interval is #{spinup_interval}s, so spinning up a new process."
511
601
 
512
602
  if Einhorn::State.consecutive_deaths_before_ack > 0
513
603
  Einhorn.log_info("#{msg} (there have been #{Einhorn::State.consecutive_deaths_before_ack} consecutive unacked worker deaths)", :upgrade)
@@ -518,7 +608,7 @@ module Einhorn
518
608
  spinup
519
609
  end
520
610
  else
521
- Einhorn.log_debug("Last spinup was #{seconds_ago}s ago, and spinup_interval is #{spinup_interval}s, so not spinning up a new process")
611
+ Einhorn.log_debug("Last spinup was #{seconds_ago}s ago, and spinup_interval is #{spinup_interval}s, so not spinning up a new process.")
522
612
  end
523
613
 
524
614
  Einhorn::TransientState.has_outstanding_spinup_timer = true
@@ -541,5 +631,22 @@ module Einhorn
541
631
  Einhorn.log_info(output) if log
542
632
  output
543
633
  end
634
+
635
+ def self.trigger_spinup?(max_unacked)
636
+ unacked = Einhorn::WorkerPool.unacked_unsignaled_modern_workers.length
637
+ if unacked >= max_unacked
638
+ Einhorn.log_info("There are #{unacked} unacked new workers, and max_unacked is #{max_unacked}, so not spinning up a new process.")
639
+ return false
640
+ elsif Einhorn::State.config[:max_upgrade_additional]
641
+ capacity_exceeded = (Einhorn::State.config[:number] + Einhorn::State.config[:max_upgrade_additional]) - Einhorn::WorkerPool.workers_with_state.length
642
+ if capacity_exceeded < 0
643
+ Einhorn.log_info("Over worker capacity by #{capacity_exceeded.abs} during upgrade, #{Einhorn::WorkerPool.modern_workers.length} new workers of #{Einhorn::WorkerPool.workers_with_state.length} total. Waiting for old workers to exit before spinning up a process.")
644
+
645
+ return false
646
+ end
647
+ end
648
+
649
+ true
650
+ end
544
651
  end
545
652
  end