clusterfuck 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- data/.document +5 -0
- data/.gitignore +5 -0
- data/LICENSE +20 -0
- data/README.rdoc +49 -0
- data/Rakefile +60 -0
- data/VERSION +1 -0
- data/bin/clusterfuck +20 -0
- data/lib/clusterfuck.rb +218 -0
- data/test/clusterfuck_test.rb +7 -0
- data/test/test_helper.rb +9 -0
- metadata +66 -0
data/.document
ADDED
data/LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2009 Trevor Fountain
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.rdoc
ADDED
@@ -0,0 +1,49 @@
|
|
1
|
+
= Clusterfuck
|
2
|
+
==== A Subversive Distributed-Systems Tool
|
3
|
+
|
4
|
+
Clusterfuck is a tool for automating the process of SSH-ing into remote machines and kickstarting a large number
|
5
|
+
of jobs. It's probably best explained by an example, so here's what I use it for:
|
6
|
+
|
7
|
+
As part of my research I need to compute the distance between each pair of objects in a set of about 70,000 items.
|
8
|
+
Computing the distance between each pair takes a few seconds; running the entire job on a single machine generally takes over a day.
|
9
|
+
However, as a member of the University I have a ssh login that works on quite a few machines, so I found myself breaking the job up into smaller, quicker chunks and running each chunk on a different machine.
|
10
|
+
Clusterfuck was born out of my frustration with that method -- "surely," I said to myself, "this can be automated."
|
11
|
+
|
12
|
+
If you have a lot of jobs to run and access to multiple machines on which to run them, Clusterfuck is for you!
|
13
|
+
|
14
|
+
== Usage
|
15
|
+
To use Clusterfuck you'll first need to create a configuration file (a "clusterfile"). An example clusterfile might look something like this:
|
16
|
+
|
17
|
+
Clusterfuck::Task.new do |task|
|
18
|
+
task.hosts = %w{clark asimov}
|
19
|
+
task.jobs = (0..3).map { |x| Clusterfuck::Job.new("host{x}","sleep 0.5 && hostname") }
|
20
|
+
task.temp = "fragments"
|
21
|
+
task.username = "SSHUSERNAME"
|
22
|
+
task.password = "SSHPASSWORD"
|
23
|
+
task.debug = true
|
24
|
+
end
|
25
|
+
|
26
|
+
This creates a new clusterfuck task and distributes the jobs across two hosts, +clark+ and +asimov+.
|
27
|
+
The jobs to be run in this case are pretty trivial; we basically ssh into each machine, sleep for a little bit, then get the hostname.
|
28
|
+
Whatever each job prints to stdout is saved in +task+.+temp+ (under the current working directory); running
|
29
|
+
this clusterfile will create 4 files in <code>./fragments/</code>: host0.[hostname], host1.[hostname], host2.[hostname], and host3.[hostname] (where [hostname] is the name of the machine on which the job was run).
|
30
|
+
+task+.+username+ and +task+.+password+ are the SSH credentials used to log into the maching -- currently, Clusterfuck
|
31
|
+
can only use one global set of credentials. There's no technical reason for this, other than the fact that I don't
|
32
|
+
really need to use machine-specific logins, so it'll probably appear in future releases.
|
33
|
+
+task+.+verbose+ turns on verbose output (messages to stdout each time a job is started, skipped, or canceled).
|
34
|
+
|
35
|
+
Once you have a clusterfile you can kick off your jobs by running the command +clusterfuck+ in the same directory.
|
36
|
+
|
37
|
+
== Note on Patches/Pull Requests
|
38
|
+
|
39
|
+
* Fork the project.
|
40
|
+
* Add something cool or fix a nefarious bug. Documentation wins extra love.
|
41
|
+
* Add tests for it. I'd really like this, but since I haven't written any tests myself yet I can't really blame you if you skip it...
|
42
|
+
* Commit, but do not mess with rakefile, version, or history.
|
43
|
+
(if you want to have your own version that's ok -- but
|
44
|
+
bump the version in a separate commit that I can ignore when I pull)
|
45
|
+
* Send me a pull request.
|
46
|
+
|
47
|
+
== Copyright
|
48
|
+
|
49
|
+
Copyright (c) 2009 Trevor Fountain. See LICENSE for details.
|
data/Rakefile
ADDED
@@ -0,0 +1,60 @@
|
|
1
|
+
require 'rubygems'
|
2
|
+
require 'rake'
|
3
|
+
|
4
|
+
begin
|
5
|
+
require 'jeweler'
|
6
|
+
Jeweler::Tasks.new do |gem|
|
7
|
+
gem.name = "clusterfuck"
|
8
|
+
gem.summary = %Q{Run jobs across multiple machines via ssh}
|
9
|
+
gem.description = %Q{Automate the execution of jobs across multiple machines with SSH. Ideal for systems with shared filesystems.}
|
10
|
+
gem.email = "doches@gmail.com"
|
11
|
+
gem.homepage = "http://github.com/doches/clusterfuck"
|
12
|
+
gem.authors = ["Trevor Fountain"]
|
13
|
+
# gem is a Gem::Specification... see http://www.rubygems.org/read/chapter/20 for additional settings
|
14
|
+
end
|
15
|
+
Jeweler::GemcutterTasks.new
|
16
|
+
rescue LoadError
|
17
|
+
puts "Jeweler (or a dependency) not available. Install it with: sudo gem install jeweler"
|
18
|
+
end
|
19
|
+
|
20
|
+
require 'rake/testtask'
|
21
|
+
Rake::TestTask.new(:test) do |test|
|
22
|
+
test.libs << 'lib' << 'test'
|
23
|
+
test.pattern = 'test/**/*_test.rb'
|
24
|
+
test.verbose = true
|
25
|
+
end
|
26
|
+
|
27
|
+
begin
|
28
|
+
require 'rcov/rcovtask'
|
29
|
+
Rcov::RcovTask.new do |test|
|
30
|
+
test.libs << 'test'
|
31
|
+
test.pattern = 'test/**/*_test.rb'
|
32
|
+
test.verbose = true
|
33
|
+
end
|
34
|
+
rescue LoadError
|
35
|
+
task :rcov do
|
36
|
+
abort "RCov is not available. In order to run rcov, you must: sudo gem install spicycode-rcov"
|
37
|
+
end
|
38
|
+
end
|
39
|
+
|
40
|
+
task :test => :check_dependencies
|
41
|
+
|
42
|
+
task :default => :test
|
43
|
+
|
44
|
+
gem 'rdoc'
|
45
|
+
require 'rdoc'
|
46
|
+
require 'rake/rdoctask'
|
47
|
+
Rake::RDocTask.new do |rdoc|
|
48
|
+
if File.exist?('VERSION')
|
49
|
+
version = File.read('VERSION')
|
50
|
+
else
|
51
|
+
version = ""
|
52
|
+
end
|
53
|
+
|
54
|
+
rdoc.rdoc_dir = 'rdoc'
|
55
|
+
rdoc.title = "Clusterfuck #{version}"
|
56
|
+
rdoc.rdoc_files.include('README*')
|
57
|
+
rdoc.rdoc_files.include('lib/*.rb')
|
58
|
+
rdoc.main = "README.rdoc"
|
59
|
+
rdoc.options += ["-SHN","-f","darkfish"]
|
60
|
+
end
|
data/VERSION
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
0.1.0
|
data/bin/clusterfuck
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require 'clusterfuck'
|
4
|
+
if ARGV[0]
|
5
|
+
# Use specified file
|
6
|
+
load ARGV[0]
|
7
|
+
else
|
8
|
+
# Search the current directory for a clusterfile
|
9
|
+
found = false
|
10
|
+
Dir.foreach(".") do |file|
|
11
|
+
if file.downcase == "clusterfile"
|
12
|
+
load file
|
13
|
+
found = true
|
14
|
+
break
|
15
|
+
end
|
16
|
+
end
|
17
|
+
if not found
|
18
|
+
STDERR.puts "No clusterfile found!"
|
19
|
+
end
|
20
|
+
end
|
data/lib/clusterfuck.rb
ADDED
@@ -0,0 +1,218 @@
|
|
1
|
+
require 'socket'
|
2
|
+
require 'net/ssh'
|
3
|
+
|
4
|
+
# Clusterfuck is an ugly, dirty hack to run a large number of jobs on multiple machines.
|
5
|
+
# If you can break your task up into a series of small, independent jobs, clusterfuck
|
6
|
+
# can automate the process of distributing jobs across machines.
|
7
|
+
module Clusterfuck
|
8
|
+
# Print a message when a job is cancelled due to too many failures
|
9
|
+
VERBOSE_CANCEL = 0
|
10
|
+
# Print a message when a job is cancelled AND at each failure
|
11
|
+
VERBOSE_FAIL = 1
|
12
|
+
# Print a message for cancellations and failures, AND each time a job is started.
|
13
|
+
VERBOSE_ALL = 2
|
14
|
+
|
15
|
+
# The flag used to prefix dry run (debugging) messages.
|
16
|
+
DEBUG_WARN = "[DRY-RUN]"
|
17
|
+
# The interval to sleep instead of running jobs when performing a dry run (in seconds)
|
18
|
+
DEBUG_INTERVAL = [0.2,1.0]
|
19
|
+
|
20
|
+
# A configuration holds the various pieces of information Clusterfuck needs
|
21
|
+
# to represent a task.
|
22
|
+
#
|
23
|
+
# You probably won't need to instantiate a Configuration directly; one is created
|
24
|
+
# when you create a new Task, and passed to the block it takes as a parameter. See Task
|
25
|
+
# for more information.
|
26
|
+
#
|
27
|
+
# Possible configuration options include:
|
28
|
+
# [timeout] Number of seconds to wait before an SSH connection 'times out' (DEFAULT: 2)
|
29
|
+
# [max_fail] Max number of times a failing job will be re-attempted on a new machine (DEFAULT: 3)
|
30
|
+
# [hosts] Array of hostnames (or ip addresses) as Strings to use as nodes
|
31
|
+
# [jobs] Array of Job objects, one per job, which will be allocated to the +hosts+. If you're lazy,
|
32
|
+
# you can also just use an array of strings (where each string is the command to run) -- a short
|
33
|
+
# name for each will be produced using the first 8 chars from the command.
|
34
|
+
# [verbose] Level of message reporting. One of +VERBOSE_CANCEL+,+VERBOSE_FAIL+, or +VERBOSE_ALL+
|
35
|
+
# (DEFAULT: +VERBOSE_CANCEL+)
|
36
|
+
# [username] The SSH username to use to connect
|
37
|
+
# [password] The SSH password to use to connect
|
38
|
+
# [show_report] Show a report after all jobs are complete that gives statistics for each machine.
|
39
|
+
# [debug] Do a 'dry run' -- allocate jobs to machines and display the result but DO NOT actually
|
40
|
+
# connect to any machines or run any jobs. Useful for testing your clusterfile before
|
41
|
+
# kicking off a major run.
|
42
|
+
# [temp] Directory in which to capture stdout from each job. Setting this to +false+
|
43
|
+
# will cause clusterfuck to ignore job output, leaving it up to you to capture the results
|
44
|
+
# of each job. (DEFAULT: ./fragments)
|
45
|
+
class Configuration
|
46
|
+
# Holds the user-specified options. Again, you probably don't want to access this directly -- use the
|
47
|
+
# getter/setter syntax instead.
|
48
|
+
attr_reader :options
|
49
|
+
|
50
|
+
# Create a new Configuration object with default options.
|
51
|
+
def initialize
|
52
|
+
@options = {
|
53
|
+
"timeout" => 2,
|
54
|
+
"max_fail" => 3,
|
55
|
+
"verbose" => VERBOSE_CANCEL,
|
56
|
+
"show_report" => true,
|
57
|
+
"temp" => "./fragments",
|
58
|
+
}
|
59
|
+
end
|
60
|
+
|
61
|
+
# You can get/set options as if they were attributes, i.e. +config.foo = "bar"+ will set the option +foo+ to "bar".
|
62
|
+
def method_missing(key,args=nil)
|
63
|
+
if args.nil?
|
64
|
+
return @options[key.to_s]
|
65
|
+
else
|
66
|
+
key = key.to_s.gsub!("=","")
|
67
|
+
@options[key] = args
|
68
|
+
end
|
69
|
+
end
|
70
|
+
|
71
|
+
# Get a pretty-printed version of the currently set options
|
72
|
+
def to_s
|
73
|
+
@options.map { |pair| "#{pair[0]} = \"#{pair[1]}\""}.join(", ")
|
74
|
+
end
|
75
|
+
|
76
|
+
# Convert array of string commands to Job objects if necessary
|
77
|
+
def jobify!
|
78
|
+
@options["jobs"].map! do |job|
|
79
|
+
if not job.is_a?(Job) # Ah-ha, make this string into a job
|
80
|
+
short = job.downcase.gsub(/[^a-z]/,"")
|
81
|
+
short = job[0..7] if short.size > 8
|
82
|
+
Job.new(short,job)
|
83
|
+
else # Don't change anything...
|
84
|
+
job
|
85
|
+
end
|
86
|
+
end
|
87
|
+
end
|
88
|
+
end
|
89
|
+
|
90
|
+
# The primary means of interacting with Clusterfuck. Create a new
|
91
|
+
# Task, passing in a block that takes a Configuration object as a parameter (rake-style).
|
92
|
+
# The constructor returns after all jobs have been completed.
|
93
|
+
class Task
|
94
|
+
# See Configuration for a list of recognized configuration options.
|
95
|
+
def initialize(&custom)
|
96
|
+
# Run configuration options specified in clusterfile
|
97
|
+
config = Configuration.new
|
98
|
+
custom.call(config)
|
99
|
+
config.jobify!
|
100
|
+
|
101
|
+
# Make output fragment directory
|
102
|
+
`mkdir #{config.temp}` if config.temp and not File.exists?(config.temp)
|
103
|
+
|
104
|
+
# Run all jobs
|
105
|
+
machines = config.hosts.map { |name| Machine.new(name,config) }
|
106
|
+
machines.each { |machine| machine.run }
|
107
|
+
|
108
|
+
# Wait for jobs to terminate
|
109
|
+
machines.each do |machine|
|
110
|
+
begin
|
111
|
+
machine.thread.join
|
112
|
+
rescue Timeout::Error
|
113
|
+
STDERR.puts machine.to_s
|
114
|
+
end
|
115
|
+
end
|
116
|
+
|
117
|
+
# Print a report, if requested
|
118
|
+
if config.show_report
|
119
|
+
puts " Machine\t| STARTED\t| COMPLETE\t| FAILED\t|"
|
120
|
+
machines.each { |machine| puts machine.report }
|
121
|
+
end
|
122
|
+
end
|
123
|
+
end
|
124
|
+
|
125
|
+
# Represents a single machine (node) in our ad hoc cluster
|
126
|
+
class Machine
|
127
|
+
# The hostname of this machine
|
128
|
+
attr_accessor :host
|
129
|
+
# The global config options specified when the task was created
|
130
|
+
attr_accessor :config
|
131
|
+
# The thread represented this machine's ssh process
|
132
|
+
attr_reader :thread
|
133
|
+
# The number of jobs this machine has completed
|
134
|
+
attr_reader :jobs_completed
|
135
|
+
# The number of jobs this machine has attempted
|
136
|
+
attr_reader :jobs_attempted
|
137
|
+
# Was this machine dropped from the host list (too many failed jobs)?
|
138
|
+
attr_reader :dropped
|
139
|
+
|
140
|
+
# Create a new machine with the specified +host+ and +config+
|
141
|
+
def initialize(host,config)
|
142
|
+
self.host = host
|
143
|
+
self.config = config
|
144
|
+
|
145
|
+
@thread = nil
|
146
|
+
@jobs_completed = 0
|
147
|
+
@jobs_attempted = 0
|
148
|
+
@dropped = false
|
149
|
+
end
|
150
|
+
|
151
|
+
# Open an SSH connection to this machine and process jobs until the global jobs queue is empty
|
152
|
+
def run
|
153
|
+
@thread = Thread.new do
|
154
|
+
while config.jobs.size > 0
|
155
|
+
job = config.jobs.shift
|
156
|
+
if config.debug
|
157
|
+
puts "#{DEBUG_WARN} #{self.host} starting job '#{job.short_name}'"
|
158
|
+
puts "#{DEBUG_WARN} #{job.command}"
|
159
|
+
delay = rand*(DEBUG_INTERVAL[1]-DEBUG_INTERVAL[0])+DEBUG_INTERVAL[0]
|
160
|
+
@jobs_attempted += 1
|
161
|
+
sleep(delay)
|
162
|
+
@jobs_completed += 1
|
163
|
+
else
|
164
|
+
begin
|
165
|
+
@jobs_attempted += 1
|
166
|
+
Net::SSH.start(self.host,config.username,:password => config.password,:timeout => config.timeout) do |ssh|
|
167
|
+
puts "Starting job #{job.short_name} on #{self.host}" if config.verbose >= VERBOSE_ALL
|
168
|
+
if config.temp
|
169
|
+
ssh.exec(job.command + " > #{Dir.getwd}/#{config.temp}/#{job.short_name}.#{self.host}")
|
170
|
+
else
|
171
|
+
ssh.exec(job.command)
|
172
|
+
end
|
173
|
+
@jobs_completed += 1
|
174
|
+
end
|
175
|
+
rescue Timeout::Error
|
176
|
+
puts "#{job.short_name} FAILED on #{self.host}, dropping it from the hostlist" if config.verbose >= VERBOSE_FAIL
|
177
|
+
if not job.failed < config.max_fail
|
178
|
+
config.jobs.push job
|
179
|
+
job.failed += 1
|
180
|
+
else
|
181
|
+
puts "CANCELLING #{job.short_name}, too many failures (#{job.failed})" if config.verbose >= VERBOSE_CANCEL
|
182
|
+
end
|
183
|
+
@dropped = true
|
184
|
+
break
|
185
|
+
end
|
186
|
+
end
|
187
|
+
end
|
188
|
+
end
|
189
|
+
end
|
190
|
+
|
191
|
+
# Get a one-line summary of this machine's performance
|
192
|
+
def report
|
193
|
+
tab = "\t"
|
194
|
+
if self.host.size > 7
|
195
|
+
tab = ""
|
196
|
+
end
|
197
|
+
"#{self.host}#{tab}\t| #{@jobs_attempted}\t\t| #{@jobs_completed}\t\t| #{@dropped ? 'YES' : 'no'}\t\t|"
|
198
|
+
end
|
199
|
+
end
|
200
|
+
|
201
|
+
# Represents an individual job to be run
|
202
|
+
class Job
|
203
|
+
# The short name of this job, used to name the temporary file it produces
|
204
|
+
attr_accessor :short_name
|
205
|
+
# The actual command to run to execute this job.
|
206
|
+
attr_accessor :command
|
207
|
+
# The number of times this job has been unsuccessfully attempted.
|
208
|
+
attr_accessor :failed
|
209
|
+
|
210
|
+
# Create a new job with the specified short and command
|
211
|
+
def initialize(short_name,command)
|
212
|
+
self.short_name = short_name
|
213
|
+
self.command = command
|
214
|
+
|
215
|
+
self.failed = 0
|
216
|
+
end
|
217
|
+
end
|
218
|
+
end
|
data/test/test_helper.rb
ADDED
metadata
ADDED
@@ -0,0 +1,66 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: clusterfuck
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.1.0
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Trevor Fountain
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
|
12
|
+
date: 2009-10-20 00:00:00 +01:00
|
13
|
+
default_executable: clusterfuck
|
14
|
+
dependencies: []
|
15
|
+
|
16
|
+
description: Automate the execution of jobs across multiple machines with SSH. Ideal for systems with shared filesystems.
|
17
|
+
email: doches@gmail.com
|
18
|
+
executables:
|
19
|
+
- clusterfuck
|
20
|
+
extensions: []
|
21
|
+
|
22
|
+
extra_rdoc_files:
|
23
|
+
- LICENSE
|
24
|
+
- README.rdoc
|
25
|
+
files:
|
26
|
+
- .document
|
27
|
+
- .gitignore
|
28
|
+
- LICENSE
|
29
|
+
- README.rdoc
|
30
|
+
- Rakefile
|
31
|
+
- VERSION
|
32
|
+
- bin/clusterfuck
|
33
|
+
- lib/clusterfuck.rb
|
34
|
+
- test/clusterfuck_test.rb
|
35
|
+
- test/test_helper.rb
|
36
|
+
has_rdoc: true
|
37
|
+
homepage: http://github.com/doches/clusterfuck
|
38
|
+
licenses: []
|
39
|
+
|
40
|
+
post_install_message:
|
41
|
+
rdoc_options:
|
42
|
+
- --charset=UTF-8
|
43
|
+
require_paths:
|
44
|
+
- lib
|
45
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
46
|
+
requirements:
|
47
|
+
- - ">="
|
48
|
+
- !ruby/object:Gem::Version
|
49
|
+
version: "0"
|
50
|
+
version:
|
51
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
52
|
+
requirements:
|
53
|
+
- - ">="
|
54
|
+
- !ruby/object:Gem::Version
|
55
|
+
version: "0"
|
56
|
+
version:
|
57
|
+
requirements: []
|
58
|
+
|
59
|
+
rubyforge_project:
|
60
|
+
rubygems_version: 1.3.5
|
61
|
+
signing_key:
|
62
|
+
specification_version: 3
|
63
|
+
summary: Run jobs across multiple machines via ssh
|
64
|
+
test_files:
|
65
|
+
- test/test_helper.rb
|
66
|
+
- test/clusterfuck_test.rb
|