einhorn 0.7.0 → 0.8.2

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: f3f9c31b861db9b8b7ab5d2345be06f60223d60e1243bc2618ec5ef1db2b72e5
4
+ data.tar.gz: 38144bb080c8719b4d164bcbf8d96d498844626a66b9a522c75fb8bab6309c4f
5
+ SHA512:
6
+ metadata.gz: 3370ff020a249f5af7be26bfb48392a0b5721d139b895684e37aa92f612bb00d10e48eeb1b95acc15c74448af96c1df039fc841f6ff45ffdf91d6a16bcc614ab
7
+ data.tar.gz: 3e6b93f1ed82a46a9578dd3dd59cdb0c5c962772d285e59f1e78e531b7577fa4debbc7e3a28a0e4f9cddf053e4481979b37229ce40a6ed9e476dbba6e7985f1e
@@ -1,8 +1,10 @@
1
1
  language: ruby
2
2
  rvm:
3
- - 1.8.7
4
- - 1.9.2
5
- - 1.9.3
6
3
  - 2.0.0
7
4
  - 2.1
8
- - ree
5
+ - 2.2
6
+
7
+ # This is to work around the version of bundler installed in Travis and
8
+ # https://github.com/bundler/bundler/issues/3558
9
+ before_install:
10
+ - gem update bundler
data/README.md CHANGED
@@ -194,6 +194,17 @@ library.
194
194
  You can set the name that Einhorn and your workers show in PS. Just
195
195
  pass `-c <name>`.
196
196
 
197
+ ### Re exec
198
+
199
+ You can use the `--reexec-as` option to replace the `einhorn` command with a command or script of your own. This might be useful for those with a Capistrano like deploy process that has changing symlinks. To ensure that you are following the symlinks you could use a bash script like this.
200
+
201
+ #!/bin/bash
202
+
203
+ cd <symlinked directory>
204
+ exec /usr/local/bin/einhorn "$@"
205
+
206
+ Then you could set `--reexec-as=` to the name of your bash script and it will run in place of the plain einhorn command.
207
+
197
208
  ### Options
198
209
 
199
210
  -b, --bind ADDR Bind an address and add the corresponding FD via the environment
@@ -217,11 +228,18 @@ pass `-c <name>`.
217
228
  Unix nice level at which to run the einhorn processes. If not running as root, make sure to ulimit -e as appopriate.
218
229
  --with-state-fd STATE [Internal option] With file descriptor containing state
219
230
  --upgrade-check [Internal option] Check if Einhorn can exec itself and exit with status 0 before loading code
231
+ -t, --signal-timeout=T If children do not react to signals after T seconds, escalate to SIGKILL
220
232
  --version Show version
221
233
 
222
234
 
223
235
  ## Contributing
224
236
 
237
+ ### Development Status
238
+
239
+ Einhorn is still in active operation at Stripe, but we are not maintaining
240
+ Einhorn actively. PRs are very welcome, and we will review and merge,
241
+ but we are unlikely to triage and fix reported issues without code.
242
+
225
243
  Contributions are definitely welcome. To contribute, just follow the
226
244
  usual workflow:
227
245
 
@@ -249,10 +267,28 @@ EventMachine-LE to support file-descriptor passing. Check out
249
267
 
250
268
  ## Compatibility
251
269
 
252
- Einhorn was developed and tested under Ruby 1.8.7.
270
+ Einhorn runs in Ruby 2.0, 2.1, and 2.2
271
+
272
+ The following libraries ease integration with Einhorn with languages other than
273
+ Ruby:
274
+
275
+ - **[go-einhorn](https://github.com/stripe/go-einhorn)**: Stripe's own library
276
+ for *talking* to an einhorn master (doesn't wrap socket code).
277
+ - **[goji](https://github.com/zenazn/goji/)**: Go (golang) server framework. The
278
+ [`bind`](https://godoc.org/github.com/zenazn/goji/bind) and
279
+ [`graceful`](https://godoc.org/github.com/zenazn/goji/graceful)
280
+ packages provide helpers and HTTP/TCP connection wrappers for Einhorn
281
+ integration.
282
+ - **[github.com/CHH/einhorn](https://github.com/CHH/einhorn)**: PHP library
283
+ - **[thin-attach\_socket](https://github.com/ConradIrwin/thin-attach_socket)**:
284
+ run `thin` behind Einhorn
285
+ - **[baseplate](https://reddit.github.io/baseplate/cli/serve.html)**: a
286
+ collection of Python helpers and libraries, with support for running behind
287
+ Einhorn
288
+
289
+ *NB: this list should not imply any official endorsement or vetting!*
253
290
 
254
291
  ## About
255
292
 
256
- Einhorn is a project of [Stripe](https://stripe.com), led by [Greg
257
- Brockman](https://twitter.com/thegdb). Feel free to get in touch at
293
+ Einhorn is a project of [Stripe](https://stripe.com), led by [Carl Jackson](https://github.com/zenazn). Feel free to get in touch at
258
294
  info@stripe.com.
@@ -67,10 +67,28 @@ EventMachine-LE to support file-descriptor passing. Check out
67
67
 
68
68
  ## Compatibility
69
69
 
70
- Einhorn was developed and tested under Ruby 1.8.7.
70
+ Einhorn runs in Ruby 2.0, 2.1, and 2.2
71
+
72
+ The following libraries ease integration with Einhorn with languages other than
73
+ Ruby:
74
+
75
+ - **[go-einhorn](https://github.com/stripe/go-einhorn)**: Stripe's own library
76
+ for *talking* to an einhorn master (doesn't wrap socket code).
77
+ - **[goji](https://github.com/zenazn/goji/)**: Go (golang) server framework. The
78
+ [`bind`](https://godoc.org/github.com/zenazn/goji/bind) and
79
+ [`graceful`](https://godoc.org/github.com/zenazn/goji/graceful)
80
+ packages provide helpers and HTTP/TCP connection wrappers for Einhorn
81
+ integration.
82
+ - **[github.com/CHH/einhorn](https://github.com/CHH/einhorn)**: PHP library
83
+ - **[thin-attach\_socket](https://github.com/ConradIrwin/thin-attach_socket)**:
84
+ run `thin` behind Einhorn
85
+ - **[baseplate](https://reddit.github.io/baseplate/cli/serve.html)**: a
86
+ collection of Python helpers and libraries, with support for running behind
87
+ Einhorn
88
+
89
+ *NB: this list should not imply any official endorsement or vetting!*
71
90
 
72
91
  ## About
73
92
 
74
- Einhorn is a project of [Stripe](https://stripe.com), led by [Greg
75
- Brockman](https://twitter.com/thegdb). Feel free to get in touch at
93
+ Einhorn is a project of [Stripe](https://stripe.com), led by [Carl Jackson](https://github.com/zenazn). Feel free to get in touch at
76
94
  info@stripe.com.
@@ -266,8 +266,11 @@ if true # $0 == __FILE__
266
266
  Einhorn::Command.quieter(false)
267
267
  end
268
268
 
269
- opts.on('-s', '--seconds N', 'Number of seconds to wait until respawning') do |b|
270
- Einhorn::State.config[:seconds] = s.to_i
269
+ opts.on('-s', '--seconds N', 'Number of seconds to wait until respawning') do |s|
270
+ seconds = Float(s)
271
+ raise ArgumentError, 'seconds must be > 0' if seconds.zero?
272
+
273
+ Einhorn::State.config[:seconds] = seconds
271
274
  end
272
275
 
273
276
  opts.on('-v', '--verbose', 'Make output verbose (can be reconfigured on the fly)') do
@@ -310,6 +313,18 @@ if true # $0 == __FILE__
310
313
  Einhorn::State.signal_timeout = Integer(t)
311
314
  end
312
315
 
316
+ opts.on('--max-unacked=N', 'Maximum number of workers that can be unacked when gracefully upgrading.') do |n|
317
+ Einhorn::State.config[:max_unacked] = Integer(n)
318
+ end
319
+
320
+ opts.on('--max-upgrade-additional=N', 'Maximum number of additional workers that can be running during an upgrade.') do |n|
321
+ Einhorn::State.config[:max_upgrade_additional] = Integer(n)
322
+ end
323
+
324
+ opts.on('--gc-before-fork', 'Run the GC three times before forking to improve memory sharing for copy-on-write.') do
325
+ Einhorn::State.config[:gc_before_fork] = true
326
+ end
327
+
313
328
  opts.on('--version', 'Show version') do
314
329
  puts Einhorn::VERSION
315
330
  exit
@@ -21,22 +21,14 @@ module Einhorn
21
21
  end
22
22
 
23
23
  def send_command(hash)
24
- begin
25
- @client.send_command(hash)
26
- while response = @client.receive_message
27
- if response.kind_of?(Hash)
28
- yield response['message']
29
- return unless response['wait']
30
- else
31
- puts "Invalid response type #{response.class}: #{response.inspect}"
32
- end
24
+ @client.send_command(hash)
25
+ while response = @client.receive_message
26
+ if response.kind_of?(Hash)
27
+ yield response['message']
28
+ return unless response['wait']
29
+ else
30
+ puts "Invalid response type #{response.class}: #{response.inspect}"
33
31
  end
34
- rescue Errno::EPIPE => e
35
- emit("einhornsh: Error communicating with Einhorn: #{e} (#{e.class})")
36
- emit("einhornsh: Attempting to reconnect...")
37
- reconnect
38
-
39
- retry
40
32
  end
41
33
  end
42
34
 
@@ -15,6 +15,7 @@ Gem::Specification.new do |gem|
15
15
  gem.name = 'einhorn'
16
16
  gem.require_paths = ['lib']
17
17
 
18
+ gem.add_development_dependency 'rack', '~> 1.6'
18
19
  gem.add_development_dependency 'rake'
19
20
  gem.add_development_dependency 'pry'
20
21
  gem.add_development_dependency 'minitest', '< 5.0'
@@ -16,4 +16,4 @@ def einhorn_main
16
16
  puts "From PID #{$$}: Doing some work"
17
17
  sleep 1
18
18
  end
19
- end
19
+ end
@@ -45,6 +45,7 @@ module Einhorn
45
45
  :orig_cmd => nil,
46
46
  :bind => [],
47
47
  :bind_fds => [],
48
+ :bound_ports => [],
48
49
  :cmd => nil,
49
50
  :script_name => nil,
50
51
  :respawn => true,
@@ -68,14 +69,9 @@ module Einhorn
68
69
  :reexec_commandline => nil,
69
70
  :drop_environment_variables => [],
70
71
  :signal_timeout => nil,
72
+ :preloaded => false
71
73
  }
72
74
  end
73
-
74
- def self.dumpable_state
75
- dump = state
76
- dump[:reloading_for_preload_upgrade] = dump[:reloading_for_upgrade]
77
- dump
78
- end
79
75
  end
80
76
 
81
77
  module TransientState
@@ -83,7 +79,6 @@ module Einhorn
83
79
  def self.default_state
84
80
  {
85
81
  :whatami => :master,
86
- :preloaded => false,
87
82
  :script_name => nil,
88
83
  :argv => [],
89
84
  :environ => {},
@@ -110,38 +105,24 @@ module Einhorn
110
105
  updated_state = old_state.dup
111
106
 
112
107
  # Handle changes in state format updates from previous einhorn versions
113
- if store == Einhorn::State
114
- # TODO: Drop this backwards compatibility hack when we hit 0.7
115
- if updated_state.include?(:reloading_for_preload_upgrade) &&
116
- !updated_state.include?(:reloading_for_upgrade)
117
- updated_state[:reloading_for_upgrade] = updated_state.delete(:reloading_for_preload_upgrade)
118
- message << "upgraded :reloading_for_preload_upgrade to :reloading_for_upgrade"
119
- end
120
-
121
- if updated_state[:children]
122
- # For a period, we created children entries for state_passers,
123
- # but we don't want that (in particular, it probably died
124
- # before we could setup our SIGCHLD handler
125
- updated_state[:children].delete_if {|k, v| v[:type] == :state_passer}
126
-
127
- # Depending on what is passed for --reexec-as, it's possible
128
- # that the process received a SIGCHLD while something other
129
- # than einhorn was the active executable. If that happened,
130
- # einhorn might not know about a dead child, so let's check
131
- # them all
132
- dead = []
133
- updated_state[:children].each do |pid, v|
134
- begin
135
- pid = Process.wait(pid, Process::WNOHANG)
136
- dead << pid if pid
137
- rescue Errno::ECHILD
138
- dead << pid
139
- end
140
- end
141
- Einhorn::Event::Timer.open(0) do
142
- dead.each {|pid| Einhorn::Command.mourn(pid)}
108
+ if store == Einhorn::State && updated_state[:children]
109
+ # Depending on what is passed for --reexec-as, it's possible
110
+ # that the process received a SIGCHLD while something other
111
+ # than einhorn was the active executable. If that happened,
112
+ # einhorn might not know about a dead child, so let's check
113
+ # them all
114
+ dead = []
115
+ updated_state[:children].each do |pid, v|
116
+ begin
117
+ pid = Process.wait(pid, Process::WNOHANG)
118
+ dead << pid if pid
119
+ rescue Errno::ECHILD
120
+ dead << pid
143
121
  end
144
122
  end
123
+ Einhorn::Event::Timer.open(0) do
124
+ dead.each {|pid| Einhorn::Command.cleanup(pid)}
125
+ end
145
126
  end
146
127
 
147
128
  default = store.default_state
@@ -182,20 +163,23 @@ module Einhorn
182
163
  end
183
164
 
184
165
  Einhorn::TransientState.socket_handles << sd
185
- sd.fileno
166
+ [sd.fileno, sd.local_address.ip_port]
186
167
  end
187
168
 
188
169
  # Implement these ourselves so it plays nicely with state persistence
189
170
  def self.log_debug(msg, tag=nil)
190
171
  $stderr.puts("#{log_tag} DEBUG: #{msg}\n") if Einhorn::State.verbosity <= 0
172
+ $stderr.flush
191
173
  self.send_tagged_message(tag, msg) if tag
192
174
  end
193
175
  def self.log_info(msg, tag=nil)
194
176
  $stderr.puts("#{log_tag} INFO: #{msg}\n") if Einhorn::State.verbosity <= 1
177
+ $stderr.flush
195
178
  self.send_tagged_message(tag, msg) if tag
196
179
  end
197
180
  def self.log_error(msg, tag=nil)
198
181
  $stderr.puts("#{log_tag} ERROR: #{msg}\n") if Einhorn::State.verbosity <= 2
182
+ $stderr.flush
199
183
  self.send_tagged_message(tag, "ERROR: #{msg}") if tag
200
184
  end
201
185
 
@@ -246,6 +230,8 @@ module Einhorn
246
230
  set_argv(Einhorn::State.cmd, false)
247
231
 
248
232
  begin
233
+ # Reset preloaded state to false - this allows us to monitor for failed preloads during reloads.
234
+ Einhorn::State.preloaded = false
249
235
  # If it's not going to be requireable, then load it.
250
236
  if !path.end_with?('.rb') && File.exists?(path)
251
237
  log_info("Loading #{path} (if this hangs, make sure your code can be properly loaded as a library)", :upgrade)
@@ -253,13 +239,15 @@ module Einhorn
253
239
  else
254
240
  log_info("Requiring #{path} (if this hangs, make sure your code can be properly loaded as a library)", :upgrade)
255
241
  require path
242
+
243
+ force_move_to_oldgen if Einhorn::State.config[:gc_before_fork]
256
244
  end
257
245
  rescue Exception => e
258
246
  log_info("Proceeding with postload -- could not load #{path}: #{e} (#{e.class})\n #{e.backtrace.join("\n ")}", :upgrade)
259
247
  else
260
248
  if defined?(einhorn_main)
261
249
  log_info("Successfully loaded #{path}", :upgrade)
262
- Einhorn::TransientState.preloaded = true
250
+ Einhorn::State.preloaded = true
263
251
  else
264
252
  log_info("Proceeding with postload -- loaded #{path}, but no einhorn_main method was defined", :upgrade)
265
253
  end
@@ -267,6 +255,22 @@ module Einhorn
267
255
  end
268
256
  end
269
257
 
258
+ # Make the GC more copy-on-write friendly by forcibly incrementing the generation
259
+ # counter on all objects to its maximum value. Learn more at: https://github.com/ko1/nakayoshi_fork
260
+ def self.force_move_to_oldgen
261
+ log_info("Starting GC to improve copy-on-write memory sharing", :upgrade)
262
+
263
+ GC.start
264
+ 3.times do
265
+ GC.start(full_mark: false)
266
+ end
267
+
268
+ GC.compact if GC.respond_to?(:compact)
269
+
270
+ log_info("Finished GC after preloading", :upgrade)
271
+ end
272
+ private_class_method :force_move_to_oldgen
273
+
270
274
  def self.set_argv(cmd, set_ps_name)
271
275
  # TODO: clean up this hack
272
276
  idx = 0
@@ -324,8 +328,9 @@ module Einhorn
324
328
 
325
329
  def self.socketify_env!
326
330
  Einhorn::State.bind.each do |host, port, flags|
327
- fd = bind(host, port, flags)
331
+ fd, actual_port = bind(host, port, flags)
328
332
  Einhorn::State.bind_fds << fd
333
+ Einhorn::State.bound_ports << actual_port
329
334
  end
330
335
  end
331
336
 
@@ -339,7 +344,8 @@ module Einhorn
339
344
  host = $2
340
345
  port = $3
341
346
  flags = $4.split(',').select {|flag| flag.length > 0}.map {|flag| flag.downcase}
342
- fd = (Einhorn::State.sockets[[host, port]] ||= bind(host, port, flags))
347
+ Einhorn::State.sockets[[host, port]] ||= bind(host, port, flags)[0]
348
+ fd = Einhorn::State.sockets[[host, port]]
343
349
  "#{opt}#{fd}"
344
350
  else
345
351
  arg
@@ -431,6 +437,14 @@ module Einhorn
431
437
  Einhorn::State.reloading_for_upgrade = false
432
438
  end
433
439
 
440
+ # If setting a signal-timeout, timeout the event loop
441
+ # in the same timeframe, ensuring processes are culled
442
+ # on a regular basis.
443
+ if Einhorn::State.signal_timeout
444
+ Einhorn::Event.default_timeout = Einhorn::Event.default_timeout.nil? ?
445
+ Einhorn::State.signal_timeout : [Einhorn::State.signal_timeout, Einhorn::Event.default_timeout].min
446
+ end
447
+
434
448
  while Einhorn::State.respawn || Einhorn::State.children.size > 0
435
449
  log_debug("Entering event loop")
436
450
 
@@ -1,5 +1,4 @@
1
1
  require 'set'
2
- require 'uri'
3
2
  require 'yaml'
4
3
 
5
4
  module Einhorn
@@ -22,12 +21,12 @@ module Einhorn
22
21
 
23
22
  def self.serialize_message(message)
24
23
  serialized = YAML.dump(message)
25
- escaped = URI.escape(serialized, "%\n")
24
+ escaped = serialized.gsub(/%|\n/, '%' => '%25', "\n" => '%0A')
26
25
  escaped + "\n"
27
26
  end
28
27
 
29
28
  def self.deserialize_message(line)
30
- serialized = URI.unescape(line)
29
+ serialized = line.gsub(/%(25|0A)/, '%25' => '%', '%0A' => "\n")
31
30
  YAML.load(serialized)
32
31
  end
33
32
  end
@@ -3,6 +3,7 @@ require 'set'
3
3
  require 'tmpdir'
4
4
 
5
5
  require 'einhorn/command/interface'
6
+ require 'einhorn/prctl'
6
7
 
7
8
  module Einhorn
8
9
  module Command
@@ -10,18 +11,16 @@ module Einhorn
10
11
  begin
11
12
  while true
12
13
  Einhorn.log_debug('Going to reap a child process')
13
-
14
14
  pid = Process.wait(-1, Process::WNOHANG)
15
15
  return unless pid
16
- mourn(pid)
16
+ cleanup(pid)
17
17
  Einhorn::Event.break_loop
18
18
  end
19
19
  rescue Errno::ECHILD
20
20
  end
21
21
  end
22
22
 
23
- # Mourn the death of your child
24
- def self.mourn(pid)
23
+ def self.cleanup(pid)
25
24
  unless spec = Einhorn::State.children[pid]
26
25
  Einhorn.log_error("Could not find any config for exited child #{pid.inspect}! This probably indicates a bug in Einhorn.")
27
26
  return
@@ -40,11 +39,23 @@ module Einhorn
40
39
  case type = spec[:type]
41
40
  when :worker
42
41
  Einhorn.log_info("===> Exited worker #{pid.inspect}#{extra}", :upgrade)
42
+ when :state_passer
43
+ Einhorn.log_debug("===> Exited state passing process #{pid.inspect}", :upgrade)
43
44
  else
44
45
  Einhorn.log_error("===> Exited process #{pid.inspect} has unrecgonized type #{type.inspect}: #{spec.inspect}", :upgrade)
45
46
  end
46
47
  end
47
48
 
49
+ def self.register_ping(pid, request_id)
50
+ unless spec = Einhorn::State.children[pid]
51
+ Einhorn.log_error("Could not find state for PID #{pid.inspect}; ignoring ACK.")
52
+ return
53
+ end
54
+
55
+ spec[:pinged_at] = Time.now
56
+ spec[:pinged_request_id] = request_id
57
+ end
58
+
48
59
  def self.register_manual_ack(pid)
49
60
  ack_mode = Einhorn::State.ack_mode
50
61
  unless ack_mode[:type] == :manual
@@ -98,8 +109,8 @@ module Einhorn
98
109
 
99
110
  def self.signal_all(signal, children=nil, record=true)
100
111
  children ||= Einhorn::WorkerPool.workers
112
+ signaled = {}
101
113
 
102
- signaled = []
103
114
  Einhorn.log_info("Sending #{signal} to #{children.inspect}", :upgrade)
104
115
 
105
116
  children.each do |child|
@@ -113,22 +124,31 @@ module Einhorn
113
124
  Einhorn.log_error("Re-sending #{signal} to already-signaled child #{child.inspect}. It may be slow to spin down, or it may be swallowing #{signal}s.", :upgrade)
114
125
  end
115
126
  spec[:signaled].add(signal)
127
+ spec[:last_signaled_at] = Time.now
116
128
  end
117
129
 
118
130
  begin
119
131
  Process.kill(signal, child)
120
132
  rescue Errno::ESRCH
133
+ Einhorn.log_debug("Attempted to #{signal} child #{child.inspect} but the process does not exist", :upgrade)
121
134
  else
122
- signaled << child
135
+ signaled[child] = spec
123
136
  end
124
137
  end
125
138
 
126
- if Einhorn::State.signal_timeout
139
+ if Einhorn::State.signal_timeout && record
127
140
  Einhorn::Event::Timer.open(Einhorn::State.signal_timeout) do
128
141
  children.each do |child|
129
- next unless spec = Einhorn::State.children[child]
142
+ spec = Einhorn::State.children[child]
143
+ next unless spec # Process is already dead and removed by cleanup
144
+ signaled_spec = signaled[child]
145
+ next unless signaled_spec # We got ESRCH when trying to signal
146
+ if spec[:spinup_time] != signaled_spec[:spinup_time]
147
+ Einhorn.log_info("Different spinup time recorded for #{child} after #{Einhorn::State.signal_timeout}s. This probably indicates a PID rollover.", :upgrade)
148
+ next
149
+ end
130
150
 
131
- Einhorn.log_info("Child #{child.inspect} is still active after #{Einhorn::State.signal_timeout}. Sending SIGKILL.")
151
+ Einhorn.log_info("Child #{child.inspect} is still active after #{Einhorn::State.signal_timeout}s. Sending SIGKILL.")
132
152
  begin
133
153
  Process.kill('KILL', child)
134
154
  rescue Errno::ESRCH
@@ -136,11 +156,12 @@ module Einhorn
136
156
  spec[:signaled].add('KILL')
137
157
  end
138
158
  end
139
- end
140
159
 
141
- "Successfully sent #{signal}s to #{signaled.length} processes: #{signaled.inspect}"
160
+ Einhorn.log_info("Successfully sent #{signal}s to #{signaled.length} processes: #{signaled.keys}")
161
+ end
142
162
  end
143
163
 
164
+
144
165
  def self.increment
145
166
  Einhorn::Event.break_loop
146
167
  old = Einhorn::State.config[:number]
@@ -211,6 +232,7 @@ module Einhorn
211
232
 
212
233
  fork do
213
234
  Einhorn::TransientState.whatami = :state_passer
235
+ Einhorn::State.children[Process.pid] = {type: :state_passer}
214
236
  Einhorn::State.generation += 1
215
237
  read.close
216
238
 
@@ -256,7 +278,8 @@ module Einhorn
256
278
  def self.spinup(cmd=nil)
257
279
  cmd ||= Einhorn::State.cmd
258
280
  index = next_index
259
- if Einhorn::TransientState.preloaded
281
+ expected_ppid = Process.pid
282
+ if Einhorn::State.preloaded
260
283
  pid = fork do
261
284
  Einhorn::TransientState.whatami = :worker
262
285
  prepare_child_process
@@ -268,6 +291,8 @@ module Einhorn
268
291
 
269
292
  reseed_random
270
293
 
294
+ setup_parent_watch(expected_ppid)
295
+
271
296
  prepare_child_environment(index)
272
297
  einhorn_main
273
298
  end
@@ -277,6 +302,7 @@ module Einhorn
277
302
  prepare_child_process
278
303
 
279
304
  Einhorn.log_info("About to exec #{cmd.inspect}")
305
+ Einhorn::Command::Interface.uninit
280
306
  # Here's the only case where cloexec would help. Since we
281
307
  # have to track and manually close FDs for other cases, we
282
308
  # may as well just reuse close_all rather than also set
@@ -285,20 +311,24 @@ module Einhorn
285
311
  # Note that Ruby 1.9's close_others option is useful here.
286
312
  Einhorn::Event.close_all_for_worker
287
313
 
314
+ setup_parent_watch(expected_ppid)
315
+
288
316
  prepare_child_environment(index)
289
317
  Einhorn::Compat.exec(cmd[0], cmd[1..-1], :close_others => false)
290
318
  end
291
319
  end
292
320
 
293
321
  Einhorn.log_info("===> Launched #{pid} (index: #{index})", :upgrade)
322
+ Einhorn::State.last_spinup = Time.now
294
323
  Einhorn::State.children[pid] = {
295
324
  :type => :worker,
296
325
  :version => Einhorn::State.version,
297
326
  :acked => false,
298
327
  :signaled => Set.new,
299
- :index => index
328
+ :last_signaled_at => nil,
329
+ :index => index,
330
+ :spinup_time => Einhorn::State.last_spinup,
300
331
  }
301
- Einhorn::State.last_spinup = Time.now
302
332
 
303
333
  # Set up whatever's needed for ACKing
304
334
  ack_mode = Einhorn::State.ack_mode
@@ -364,9 +394,28 @@ module Einhorn
364
394
  end
365
395
 
366
396
  def self.prepare_child_process
397
+ Process.setpgrp
367
398
  Einhorn.renice_self
368
399
  end
369
400
 
401
+ def self.setup_parent_watch(expected_ppid)
402
+ if Einhorn::State.kill_children_on_exit then
403
+ begin
404
+ # NB: Having the USR2 signal handler set to terminate (the default) at
405
+ # this point is required. If it's set to a ruby handler, there are
406
+ # race conditions that could cause the worker to leak.
407
+
408
+ Einhorn::Prctl.set_pdeathsig("USR2")
409
+ if Process.ppid != expected_ppid then
410
+ Einhorn.log_error("Parent process died before we set pdeathsig; cowardly refusing to exec child process.")
411
+ exit(1)
412
+ end
413
+ rescue NotImplementedError
414
+ # Unsupported OS; silently continue.
415
+ end
416
+ end
417
+ end
418
+
370
419
  # @param options [Hash]
371
420
  #
372
421
  # @option options [Boolean] :smooth (false) Whether to perform a smooth or
@@ -451,6 +500,41 @@ module Einhorn
451
500
  Einhorn.log_info("Have too many workers at the current version, so killing off #{excess.length} of them.")
452
501
  signal_all("USR2", excess)
453
502
  end
503
+
504
+ # Ensure all signaled workers that have outlived signal_timeout get killed.
505
+ kill_expired_signaled_workers if Einhorn::State.signal_timeout
506
+ end
507
+
508
+ def self.kill_expired_signaled_workers
509
+ now = Time.now
510
+ children = Einhorn::State.children.select do |_,c|
511
+ # Only interested in USR2 signaled workers
512
+ next unless c[:signaled] && c[:signaled].length > 0
513
+ next unless c[:signaled].include?('USR2')
514
+
515
+ # Ignore processes that have received KILL since it can't be trapped.
516
+ next if c[:signaled].include?('KILL')
517
+
518
+ # Filter out those children that have not reached signal_timeout yet.
519
+ next unless c[:last_signaled_at]
520
+ expires_at = c[:last_signaled_at] + Einhorn::State.signal_timeout
521
+ next unless now >= expires_at
522
+
523
+ true
524
+ end
525
+
526
+ Einhorn.log_info("#{children.size} expired signaled workers found.") if children.size > 0
527
+ children.each do |pid, child|
528
+ Einhorn.log_info("Child #{pid.inspect} was signaled #{(child[:last_signaled_at] - now).abs.to_i}s ago. Sending SIGKILL as it is still active after #{Einhorn::State.signal_timeout}s timeout.", :upgrade)
529
+ begin
530
+ Process.kill('KILL', pid)
531
+ rescue Errno::ESRCH
532
+ Einhorn.log_debug("Attempted to SIGKILL child #{pid.inspect} but the process does not exist.")
533
+ end
534
+
535
+ child[:signaled].add('KILL')
536
+ child[:last_signaled_at] = Time.now
537
+ end
454
538
  end
455
539
 
456
540
  def self.stop_respawning
@@ -478,10 +562,18 @@ module Einhorn
478
562
  missing.times {spinup}
479
563
  end
480
564
 
565
+ # Unbounded exponential backoff is not a thing: we run into problems if
566
+ # e.g., each of our hundred workers simultaneously fail to boot for the same
567
+ # ephemeral reason. Instead cap backoff by some reasonable maximum, so we
568
+ # don't wait until the heat death of the universe to spin up new capacity.
569
+ MAX_SPINUP_INTERVAL = 30.0
570
+
481
571
  def self.replenish_gradually(max_unacked=nil)
482
572
  return if Einhorn::TransientState.has_outstanding_spinup_timer
483
573
  return unless Einhorn::WorkerPool.missing_worker_count > 0
484
574
 
575
+ max_unacked ||= Einhorn::State.config[:max_unacked]
576
+
485
577
  # default to spinning up at most NCPU workers at once
486
578
  unless max_unacked
487
579
  begin
@@ -500,14 +592,12 @@ module Einhorn
500
592
  # Exponentially backoff automated spinup if we're just having
501
593
  # things die before ACKing
502
594
  spinup_interval = Einhorn::State.config[:seconds] * (1.5 ** Einhorn::State.consecutive_deaths_before_ack)
595
+ spinup_interval = [spinup_interval, MAX_SPINUP_INTERVAL].min
503
596
  seconds_ago = (Time.now - Einhorn::State.last_spinup).to_f
504
597
 
505
598
  if seconds_ago > spinup_interval
506
- unacked = Einhorn::WorkerPool.unacked_unsignaled_modern_workers.length
507
- if unacked >= max_unacked
508
- Einhorn.log_debug("There are #{unacked} unacked new workers, and max_unacked is #{max_unacked}, so not spinning up a new process")
509
- else
510
- msg = "Last spinup was #{seconds_ago}s ago, and spinup_interval is #{spinup_interval}s, so spinning up a new process"
599
+ if trigger_spinup?(max_unacked)
600
+ msg = "Last spinup was #{seconds_ago}s ago, and spinup_interval is #{spinup_interval}s, so spinning up a new process."
511
601
 
512
602
  if Einhorn::State.consecutive_deaths_before_ack > 0
513
603
  Einhorn.log_info("#{msg} (there have been #{Einhorn::State.consecutive_deaths_before_ack} consecutive unacked worker deaths)", :upgrade)
@@ -518,7 +608,7 @@ module Einhorn
518
608
  spinup
519
609
  end
520
610
  else
521
- Einhorn.log_debug("Last spinup was #{seconds_ago}s ago, and spinup_interval is #{spinup_interval}s, so not spinning up a new process")
611
+ Einhorn.log_debug("Last spinup was #{seconds_ago}s ago, and spinup_interval is #{spinup_interval}s, so not spinning up a new process.")
522
612
  end
523
613
 
524
614
  Einhorn::TransientState.has_outstanding_spinup_timer = true
@@ -541,5 +631,22 @@ module Einhorn
541
631
  Einhorn.log_info(output) if log
542
632
  output
543
633
  end
634
+
635
+ def self.trigger_spinup?(max_unacked)
636
+ unacked = Einhorn::WorkerPool.unacked_unsignaled_modern_workers.length
637
+ if unacked >= max_unacked
638
+ Einhorn.log_info("There are #{unacked} unacked new workers, and max_unacked is #{max_unacked}, so not spinning up a new process.")
639
+ return false
640
+ elsif Einhorn::State.config[:max_upgrade_additional]
641
+ capacity_exceeded = (Einhorn::State.config[:number] + Einhorn::State.config[:max_upgrade_additional]) - Einhorn::WorkerPool.workers_with_state.length
642
+ if capacity_exceeded < 0
643
+ Einhorn.log_info("Over worker capacity by #{capacity_exceeded.abs} during upgrade, #{Einhorn::WorkerPool.modern_workers.length} new workers of #{Einhorn::WorkerPool.workers_with_state.length} total. Waiting for old workers to exit before spinning up a process.")
644
+
645
+ return false
646
+ end
647
+ end
648
+
649
+ true
650
+ end
544
651
  end
545
652
  end