daemon_controller 0.2.2

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE.txt ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2008 Phusion
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy
4
+ of this software and associated documentation files (the "Software"), to deal
5
+ in the Software without restriction, including without limitation the rights
6
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7
+ copies of the Software, and to permit persons to whom the Software is
8
+ furnished to do so, subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in
11
+ all copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19
+ THE SOFTWARE.
20
+
data/README.markdown ADDED
@@ -0,0 +1,460 @@
1
+ Introduction
2
+ ============
3
+
4
+ *daemon_controller* is a library for starting and stopping specific daemons
5
+ programmatically in a robust, race-condition-free manner.
6
+
7
+ It's not a daemon monitoring system like God or Monit. It's also not a library
8
+ for writing daemons.
9
+
10
+ It provides the following functionality:
11
+
12
+ * Starting daemons. If the daemon fails to start then an exception will be
13
+ raised. *daemon_controller* can even detect failures that occur after the
14
+ daemon has already daemonized.
15
+
16
+ Starting daemons is done in a race-condition-free manner. If another
17
+ process using *daemon_controller* is trying to start the same daemon,
18
+ then *daemon_controller* will guarantee serialization.
19
+
20
+ *daemon_controller* also raises an exception if it detects that the daemon
21
+ is already started.
22
+ * Connecting to a daemon, starting it if it's not already started. This too
23
+ is done in a race-condition-free manner. If the daemon fails to start then
24
+ an exception will be raised.
25
+ * Stopping daemons.
26
+ * Checking whether a daemon is running.
27
+
28
+
29
+ ## What is it for?
30
+
31
+ There is a lot of software (both Rails related and unrelated) which rely on
32
+ servers or daemons. To name a few, in no particular order:
33
+
34
+ * [Ultrasphinx](http://blog.evanweaver.com/files/doc/fauna/ultrasphinx/), a
35
+ Rails library for full-text searching. It makes use the [Sphinx search
36
+ software](http://www.sphinxsearch.com/) for indexing and searching. Indexing
37
+ is done by running a command, while searching is done by querying the Sphinx
38
+ search server.
39
+ * [acts_as_ferret](http://projects.jkraemer.net/acts_as_ferret/wiki), another
40
+ Rails library for full-text searching. It uses the Ferret search software.
41
+ On production environments, it relies on the Ferret DRB server for both
42
+ searching and indexing.
43
+ * [BackgrounDRb](http://backgroundrb.rubyforge.org/), a Ruby job server and
44
+ scheduler. Scheduling is done by contacting the BackgrounDRb daemon.
45
+ * [mongrel_cluster](http://mongrel.rubyforge.org/wiki/MongrelCluster), which
46
+ starts and stops multiple Mongrel daemons.
47
+
48
+ Relying on daemons is quite common, but not without problems. Let's go over
49
+ some of them.
50
+
51
+ ### Starting daemons is a hassle
52
+
53
+ If you've used similar software, then you might agree that managing these
54
+ daemons is a hassle. If you're using BackgrounDRb, then the daemon must be
55
+ running. Starting the daemon is not hard, but it is annoying. It's also
56
+ possible that the system administrator forgets to start the daemon. While
57
+ configuring the system to automatically start a daemon at startup is not hard,
58
+ it is an extra thing to do, and thus a hassle. We thought, why can't such
59
+ daemons be automatically started? Indeed, this won't be possible if the daemon
60
+ is to be run on a remote machine. But in by far the majority of use cases, the
61
+ daemon runs on the same host as the Rails application. If a Rails application -
62
+ or indeed, <em>any</em> application - is configured to contact a daemon on the
63
+ local host, then why not start the daemon automatically on demand?
64
+
65
+ ### Daemon starting code may not be robust or efficient
66
+
67
+ We've also observed that people write daemon controlling code over and over
68
+ again. Consider for example UltraSphinx, which provides a
69
+ `rake sphinx:daemon:start` Rake task to start the daemon. The time that a
70
+ daemon needs to initialize is variable, and depends on things such as the
71
+ current system load. The Sphinx daemon usually needs less than a second before
72
+ we can connect to it. However, the way different software handles starting of a
73
+ daemon varies. We've observed that waiting a fixed amount of time is by far the
74
+ most common way. For example, UltraSphinx's daemon starting code looks like
75
+ this:
76
+
77
+ system "searchd --config '#{Ultrasphinx::CONF_PATH}'"
78
+ sleep(4) # give daemon a chance to write the pid file
79
+ if ultrasphinx_daemon_running?
80
+ say "started successfully"
81
+ else
82
+ say "failed to start"
83
+ end
84
+
85
+ This is in no way a slam against UltraSphinx. However, if the daemon starts in
86
+ 200 miliseconds, then the user who issued the start command will be waiting for
87
+ 3.8 seconds for no good reason. This is not good for usability or for the
88
+ user's patience.
89
+
90
+ ### Startup error handling
91
+
92
+ Different software handles daemon startup errors in different ways. Some might
93
+ not even handle errors at all. For example, consider 'mongrel_cluster'. If
94
+ there's a typo in one of your application source files, then 'mongrel_cluster'
95
+ will not report the error. Instead, you have to check its log files to see what
96
+ happened. This is not good for usability: many people will be wondering why
97
+ they can't connect to their Mongrel ports after issuing a
98
+ `mongrel_rails cluster::start` - until they realize that they should read the
99
+ log file. But the thing is, not everybody realizes this. And typing in an extra
100
+ command to read the log file to check whether Mongrel started correctly, is
101
+ just a big hassle. Why can't the daemon startup code report such errors
102
+ immediately?
103
+
104
+ ### Stale or corrupt Pid files
105
+
106
+ Suppose that you're running a Mongrel cluster, and your server suddenly powers
107
+ off because of a power outage. When the server is online again, it fails to
108
+ start your Mongrel cluster because the PID file that it had written still
109
+ exists, and wasn't cleaned up properly (it's supposed to be cleaned up when
110
+ Mongrel exits). mongrel_cluster provides the `--clean` option to check whether
111
+ the PID file is *stale*, and will automatically clean it up if it is. But not
112
+ all daemon controlling software supports this. Why can't all software check for
113
+ stale PID files automatically?
114
+
115
+
116
+ ## Implementation issues
117
+
118
+ From the problem descriptions, it would become apparent that our wishlist is as
119
+ follows. Why is this wishlist often not implemented? Let's go over them.
120
+
121
+ - **A daemon should be automatically started on demand, instead of requiring the user to manually start it.**
122
+
123
+ The most obvious problems are related to concurrency. Suppose that your web
124
+ application has a search box, and you want to start the search daemon if it
125
+ isn't already started, then connect to. Two problems will arise:
126
+
127
+ * Suppose that Rails process A is still starting the daemon. At the same
128
+ time, another visitor tries to search something, and Rails process B
129
+ notices that the daemon is not running. If B tries to start the daemon
130
+ while it's already being started by A, then things can go wrong.
131
+ *A robust daemon starter must ensure that only one process at the same time may start the daemon.*
132
+ * It's not a good idea to wait a fixed amount of time for the daemon to
133
+ start, because you don't know in advance how long it will take for it to
134
+ start. For example, if you wait 2 seconds, then try to connect to the
135
+ daemon, and the daemon isn't done initializing yet, then it will seem as
136
+ if the daemon failed to start.
137
+
138
+ These are the most probable reasons why people don't try to write
139
+ auto-starting code, and instead require the user to start the daemon
140
+ manually.
141
+
142
+ These problems, as well as several less obvious problems, are closely
143
+ related to the next few points.
144
+
145
+ - **The daemon starter must wait until the daemon is done initializing, no longer and no shorter**
146
+
147
+ Because only after the daemon is fully initialized, is it safe to connect
148
+ to it. And because the user should not have to wait longer than he really
149
+ has to. During startup, the daemon will have to be continuously checked
150
+ whether it's done initializing or whether an error occured. Writing this
151
+ code can be quite a hassle, which is why most people don't do it.
152
+
153
+ - **The daemon starter must report any startup errors**
154
+
155
+ If the daemon starting command - e.g. `sphinx -c config_file.conf`,
156
+ `apachectl start` or `mongrel_rails cluster::start` - reports startup
157
+ errors, then all is fine as long as the user is starting the command from a
158
+ terminal. A problem occurs when the error occurs after the daemon has
159
+ already gone into the background. Such errors are only reported to the log
160
+ file.
161
+ *The daemon starter should also check the log file for any startup errors.*
162
+
163
+ Furthermore, it should be able to raise startup errors as exceptions. This
164
+ allows the the application to decide what to do with the error. For less
165
+ experienced system administrators, the error might be displayed in the
166
+ browser, allowing the administrators to become aware of the problem without
167
+ forcing them to manually check the log files. Or the error might be emailed
168
+ to a system administrator's email address.
169
+
170
+ - **The daemon starter must be able to correct stale or corrupted PID files**
171
+
172
+ If the PID file is stale, or for some reason has been corrupted, then the
173
+ daemon starter must be able to cope with that.
174
+ *It should check whether the PID file contains a valid PID, and whether the PID exists.*
175
+
176
+
177
+ Introducing daemon_controller
178
+ =============================
179
+
180
+ *daemon_controller* is a library for managing daemons in a robust manner. It is
181
+ not a tool for managing daemons. Rather, it is a library which lets you write
182
+ applications that manage daemons in a robust manner. For example,
183
+ 'mongrel_cluster' or UltraSphinx may be adapted to utilize this library, for
184
+ more robust daemon management.
185
+
186
+ *daemon_controller* implements all items in the aforementioned wishlist. It
187
+ provides the following functionalities:
188
+
189
+ ### Starting a daemon
190
+
191
+ This ensures that no two processes can start the same daemon at the same time.
192
+ It will also reports any startup errors, even errors that occur after the
193
+ daemon has already gone into the background but before it has fully initialized
194
+ yet. It also allows you to set a timeout, and will try to abort the daemon if
195
+ it takes too long to initialize.
196
+
197
+ The start function won't return until the daemon has been fully initialized,
198
+ and is responding to connections. So if the start function has returned, then
199
+ the daemon is guaranteed to be usable.
200
+
201
+ ### Stopping a daemon
202
+
203
+ It will stop the daemon, but only if it's already running. Any errors
204
+ are reported. If the daemon isn't already running, then it will silently
205
+ succeed. Just like starting a daemon, you can set a timeout for stopping the
206
+ daemon.
207
+
208
+ Like the start function, the stop function won't return until the daemon is no
209
+ longer running. This makes it save to immediately start the same daemon again
210
+ after having stopped it, without worrying that the previous daemon instance
211
+ hasn't exited yet and might conflict with the newly started daemon instance.
212
+
213
+ ### Connecting to a daemon, starting it if it isn't running
214
+
215
+ Every daemon has to be connected to using a different way. As a developer, you
216
+ tell 'daemon_controller' how to connect to the daemon. It will then attempt to
217
+ do that, and if that fails, it will check whether the daemon is running. If it
218
+ isn't running, then it will automatically start the daemon, and attempt to
219
+ connect to the daemon again. Failures are reported.
220
+
221
+ ### Checking whether a daemon is running
222
+
223
+ This information is retrieved from the PID file. It also checks whether the PID
224
+ file is stale.
225
+
226
+ ### All failures are reported via exceptions
227
+
228
+ So that you can exactly determine how you want to handle errors.
229
+
230
+ ### Lots and lots of error checking
231
+
232
+ So that there are very few ways in which the system can screw up.
233
+
234
+ daemon_controller's goal is to make daemon management less of a hassle, and as
235
+ automatic and straightforward as possible.
236
+
237
+
238
+ What about Monit/God?
239
+ =====================
240
+
241
+ daemon_controller is not a replacement for [Monit](http://www.tildeslash.com/monit/)
242
+ or [God](http://god.rubyforge.org/). Rather, it is a solution to the following
243
+ problem:
244
+
245
+ > **Hongli:** hey Ninh, do a 'git pull', I just implemented awesome searching
246
+ > features in our application!
247
+ > **Ninh:** cool. *pulls from repository*
248
+ > **Ninh:** hey Hongli, it doesn't work.
249
+ > **Hongli:** what do you mean, it doesn't work?
250
+ > **Ninh:** it says "connection refused", or something
251
+ > **Hongli:** oh I forgot to mention it, you have to run the Sphinx search
252
+ > daemon before it works. type "rake sphinx:daemon:start" to do
253
+ > that
254
+ > **Ninh:** great. but now I get a different error. something about
255
+ > BackgrounDRb.
256
+ > **Hongli:** oops, I forgot to mention this too. you need to start the
257
+ > BackgrounDRb server with "rake backgroundrb:start_server"
258
+ > **Ninh:** okay, so every time I want to use this app, I have to type
259
+ > "rake sphinx:daemon:start", "rake backgroundrb:start_server" and
260
+ > "./script/server"?
261
+ > **Hongli:** yep
262
+
263
+ Imagine the above conversation becoming just:
264
+
265
+ > **Hongli:** hey Ninh, do a 'git pull', I just implemented awesome searching
266
+ > features in our application!
267
+ > **Ninh:** cool. *pulls from repository*
268
+ > **Ninh:** awesome, it works!
269
+
270
+ This is not something that can be achieved with Monit/God. Monit/God are for
271
+ monitoring daemons, auto-restarting them when they use too much resources.
272
+ daemon_controller's goal is to allow developers to implement daemon
273
+ starting/stopping and daemon auto-starting code that's robust. daemon_controller
274
+ is intended to be used to make daemon-dependent applications Just Work(tm)
275
+ without having to start the daemons manually.
276
+
277
+
278
+ Tutorial #1: controlling Apache
279
+ ===============================
280
+
281
+ Suppose that you're a [Phusion Passenger](http://www.modrails.com/) developer,
282
+ and you need to write tests for the Apache module. In particular, you want to
283
+ test whether the different Phusion Passenger configuration directives are
284
+ working as expected. Obviously, to test the Apache module, the Apache web
285
+ server must be running. For every test, you will want the unit test suite to:
286
+
287
+ 1. Write an Apache configuration file, with the relevant configuration
288
+ directive set to a specific value.
289
+ 2. Start Apache.
290
+ 3. Send an HTTP request to Apache and check whether the HTTP response matches
291
+ your expectations.
292
+ 4. Stop Apache.
293
+
294
+ That can be done with the following code:
295
+
296
+ require 'daemon_controller'
297
+
298
+ File.open("apache.conf", "w") do |f|
299
+ f.write("PidFile apache.pid\n")
300
+ f.write("LogFile apache.log\n")
301
+ f.write("Listen 1234\n")
302
+ f.write(... other relevant configuration options ...)
303
+ end
304
+
305
+ controller = DaemonController.new(
306
+ :identifier => 'Apache web server',
307
+ :start_command => 'apachectl -f apache.conf -k start',
308
+ :ping_command => lambda { TCPSocket.new('localhost', 1234) },
309
+ :pid_file => 'apache.pid',
310
+ :log_file => 'apache.log',
311
+ :timeout => 25
312
+ )
313
+ controller.start
314
+
315
+ .... apache is now started ....
316
+ .... some test code here ....
317
+
318
+ controller.stop
319
+
320
+ The `File.open` line is obvious: it writes the relevant Apache configuration
321
+ file.
322
+
323
+ The next line is for creating a new DaemonController object. We pass a
324
+ human-readable identifier for this daemon ("Apache web server") to the
325
+ constructor. This is used for generating friendlier error messages.
326
+ We also tell it how Apache is supposed to be started (`:start_command`), how to
327
+ check whether it can be connected to (`:ping_command`), and where its PID file
328
+ and log file is. If Apache failed with an error during startup, then it will be
329
+ reported. If Apache failed with an error after it has gone into the background,
330
+ then that will be reported too: the given log file is monitored for new error
331
+ messages.
332
+ Finally, a timeout of 25 seconds is given. If Apache doesn't start within 25
333
+ seconds, then an exception will be raised.
334
+
335
+ The ping command is just a `Proc` which returns true or false. If the Proc
336
+ raises `Errno::ECONNREFUSED`, then that's also interpreted by DaemonController
337
+ as meaning that the daemon isn't responding yet.
338
+
339
+ After `controller.start` has returned, we can continue with the test case. At
340
+ this point, we know that Apache is done with initializing.
341
+ When we're done with Apache, we stop it with `controller.stop`. This does not
342
+ return until Apache has fully stopped.
343
+
344
+ The cautious reader might notice that the socket returned by the ping command
345
+ is never closed. That's true, because DaemonController will close it
346
+ automatically for us, if it notices that the ping command proc's return value
347
+ responds to `#close`.
348
+
349
+ From this example, it becomes apparent that for daemon_controller to work, you
350
+ must know how to start the daemon, how to contact the daemon, and you must know
351
+ where it will put its PID file and log file.
352
+
353
+
354
+ Tutorial #2: Sphinx indexing and search server management
355
+ =========================================================
356
+
357
+ We at Phusion are currently developing a web application with full-text search
358
+ capabilities, and we're using Sphinx for this purpose. We want to make the
359
+ lives of our developers and our system administrators as easy as possible, so
360
+ that there's little room for human screw-up, and so we've developed this
361
+ library. Our Sphinx search daemon is completely managed through this library
362
+ and is automatically started on demand.
363
+
364
+ Our Sphinx config file is generated from an ERB template. This ERB template
365
+ writes different values in the config file, depending on whether we're in
366
+ development, test or production mode. We will want to regenerate this config
367
+ file every time, just before we start the search daemon.
368
+ But there's more. The search daemon will fail if there is no search index. If a
369
+ new developer has just checked out the application's source code, then there is
370
+ no search index yet. We don't want him to go through the pain of having to
371
+ generate the index manually. (That said, it isn't that much of a pain, but it's
372
+ just yet-another-thing to do, which can and should be automated.) So before
373
+ starting the daemon, we will also want to check whether the index exists. If
374
+ not, then we'll generate it, and then start the daemon. Of course, no two Rails
375
+ processes may generate the config file or the index at the same time.
376
+
377
+ When querying the search server, we will want to automatically start it if it
378
+ isn't running.
379
+
380
+ This can be achieved with the following code:
381
+
382
+ require 'daemon_controller'
383
+
384
+ class SearchServer
385
+ SEARCH_SERVER_PORT = 1234
386
+
387
+ def initialize
388
+ @controller = DaemonController.new(
389
+ :identifier => 'Sphinx search server',
390
+ :start_command => "searchd -c config/sphinx.conf",
391
+ :before_start => method(:before_start),
392
+ :ping_command => lambda { TCPSocket.new('localhost', SEARCH_SERVER_PORT) },
393
+ :pid_file => 'tmp/pids/sphinx.pid',
394
+ :log_file => 'log/sphinx.log')
395
+ end
396
+
397
+ def query(search_terms)
398
+ socket = @controller.connect do
399
+ TCPSocket.new('localhost', SEARCH_SERVER_PORT)
400
+ end
401
+ send_query(socket, search_terms)
402
+ return retrieve_results(socket)
403
+ end
404
+
405
+ private
406
+ def before_start
407
+ generate_configuration_file
408
+ if !index_exists?
409
+ generate_index
410
+ end
411
+ end
412
+
413
+ ...
414
+ end
415
+
416
+ Notice the `:before_start` option. We pass a block of code which is to be run,
417
+ just before the daemon is started. This block, along with starting the daemon,
418
+ is completely serialized. That is, if you're inside the block, then it's
419
+ guaranteed that no other process is running this block at the same time as well.
420
+
421
+ The `#query` method is the method for querying the search server with search
422
+ terms. It returns a list of result. It uses `DaemonController#connect`: one
423
+ passes a block of that method, which contains code for connecting to the
424
+ daemon. If the block returns nil, or if it raises `Errno::ECONNREFUSED`, then
425
+ `DaemonController#connect` will automatically take care of auto-starting the
426
+ Sphinx daemon for us.
427
+
428
+
429
+ A little bit of history
430
+ =======================
431
+
432
+ The issue of managing daemons has been a thorn in our eyes for quite some time
433
+ now. Until now, we've solved this problem by equipping any daemons that we
434
+ write with the ability to gracefully handle being concurrently started, the
435
+ ability to initialize as much as possible *before* forking into the background,
436
+ etc. However, equipping all this robustness into our code over and over is a
437
+ lot of work. We've considered documenting a standard behavior for daemons so
438
+ that they can properly support auto-starting and such.
439
+
440
+ However, we've recently realized that that's probably a futile effort.
441
+ Convincing everybody to write a lot of code for a bit more robustness is
442
+ probably not realistic. So we took the pragmatic approach and developed a
443
+ library which adds more robustness on top of daemons' existing behavior. And
444
+ thus, daemon_controller was born. It is a little bit less efficient compared to
445
+ when the daemon is designed from the beginning with such abilities in mind, but
446
+ it's compatible with virtually all daemons, and is easy to use.
447
+
448
+
449
+ Concurrency notes
450
+ =================
451
+ DaemonController can only guarantee concurrency safety between multiple threads
452
+ in the same process, if all of those threads use the same DaemonController
453
+ object. Synchronization between multiple processes works fine.
454
+
455
+
456
+ API documentation
457
+ =================
458
+
459
+ Detailed API documentation is available in the form of inline comments in
460
+ `lib/daemon_controller.rb`.